Lecture 10: Searching the Web
Download
Report
Transcript Lecture 10: Searching the Web
Special Topics in Computer Science
The Art of Information Retrieval
Chapter 13: Searching the Web
Alexander Gelbukh
www.Gelbukh.com
Previous chapter: Conclusions
Interface is a key element of the system. If the users
cannot use it, it does not matter how good it is.
Interface design choices are important at any stage
of the process
o Especially to formulate queries
o Also to present results
o 3D interfaces to present results
Also, overall system interface and action tracking
Difficult to assess quality. Difficult to find new ideas
Very promising if you find them!
2
Previous chapter: Research topics
Many ideas throughout the chapter
o some may be obsolete
New interface types! 3D interfaces
Ways of assessing the quality of interfaces
3
Web: challenges (differences)
Distributed data
Volatile data: 40% / month
Very large volume
o Very large answers
o 1998: 3,000,000 servers, 350,000,000 pages.
o 2003: Only Google: 3,307,998,701 pages (10 times more)
Unstructured and redundant data. 30% are duplicates
Quality of data. 0.5% errors, 30% in foreign names
Heterogeneous data (languages, alphabets: Chinese)
Heterogeneous and inexperienced users
4
Search engines
Difference: full text is not available
o Now obsolete: Google stores it, some other engines too
Centralized (logically) architecture
o there are distributed (physically) architectures
Crawlers (robots) collect data/index in a central place
A search engine indexes only a small amount
(2%? 30%?) of Web
Recall is nearly not relevant for simple queries
Google: a revolution (AltaVista of our days)
5
Ranking
Commercial secret
Ranking can take into account hypertext
Google: PageRank algorithm
o Roughly, # of incoming links (much more complicated)
Problems: tricks
o Link exchange
o Anti-trick measures: detect link exchangers
o Penalize tricks: repeated keywords, etc.
Related pages
o Co-cited or co-citing pages are related
o Clustering the search results
6
Crawling
Depth-first? Width-first? Most popular first?
How to divide the work between crawlers?
Index is always obsolete
o Not equally obsolete (like stars)
o Depends on crawling policy
o 2% - 9% of invalid links. Snapshots.
PageRank first!
Robot instruction file on each server
7
Metasearchers
Search using many engines and unify the results
o How to rank?! Marge rankings?
o Inquirus: Download each page and analyze it; rank
Intersection of different major search engines is 1%
8
Other topics
Indexing
Hierarchies
Interfaces
User problems (understanding Boolean search, ...)
have been covered in previous chapters
Hyperlink (structured) search
Fish search: explore neighborhood of a hit on the fly
o Relevant docs frequently have relevant neighbors
9
Research topics
NLP techniques to improve indexing and ranking
o WSD. Anaphora? Semantic structures
Semantic Web
o Ontologies
Text Mining to improve navigation. Web Mining (links)
Distributed architectures
Scalable index compression (? – just bigger disks)
Multimedia search
10
Conclusions
Web has its own challenges as compared with general
collections
Search engines have to cope with them
Gathering data (crawling) is a problem specific for Web
Also, Web provides new types of info (links), which
can be used by search engines
11
Special Topics in Computer Science
The Art of Information Retrieval
Chapter 14: Libraries and
Bibliographical Systems
Alexander Gelbukh
www.Gelbukh.com
Differences with IR...
Historically first applications for searching
o Predecessor of IR
Docs: bibliographic records
o Free text
o Structured fields (e.g., date)
Users: mostly librarians, or users of a library
o thus: very limited budget
Usually use Boolean model (IR: vector space)
o Seems to be mostly due to historical reasons (among others)
o Recently tend to add natural language search
13
... Differences with IR
Creating the database is a subtask of such systems
o Suite data to the system, not system to data as in IR
o Carefully selected, structured, and annotated data
o Annotation standards. Decimal classification, ...
14
Online Public Access Catalogs (OPAC)
1.
2.
3.
Three generations:
Known-item finding tools (by title, author, ...)
Subject headings, keywords, ...
Search strategy assistance, natural language queries,
improved GUI, ...
Prove to be very hard to use by inexperienced users
Nowadays tend to become similar to digital library
tools
15
Research topics
Ease of use
More power and flexibility ?
Integration with Digital Libraries ?
16
Conclusions
Highly interoperable and standardized
Look like legacy systems...
17
Special Topics in Computer Science
The Art of Information Retrieval
Chapter 15: Digital Libraries
Alexander Gelbukh
www.Gelbukh.com
Digital libraries (DL)
Simplistic view: library in a machine-readable form
o Digitalization issues. Multilingual.
1.
2.
3.
4.
5.
5S model:
Streams (texts, multimedia, ...)
Structures (databases, indices, ...)
Spaces (interfaces in 1D, 2D, 3D, time, ...)
Scenarios (procedures, transformations, services, ...)
Societies (authors, annotators, ...)
This provides a way to define a DL
19
Architecture
Provide Web services
Manipulate Digital Objects (Items?)
Repositories of such objects. Access protocol.
Standards.
Security. Payment. Copyright. Watermarking
Parallel search across heterogeneous distributed
(multilingual) collections
Multimedia collections
Metadata, Standard formats
20
Systems
A lot of specific projects and systems are mentioned
in the book.
Interoperability. Standards for automatic searching
remote libraries. Protocols
21
Research topics
Markup tools to produce high-quality documents
Scaling
Interoperability. Standards
Better integration with IR
22
Conclusions
Turning heaps of texts collected in conventional (or
new) libraries into searchable and accessible
information
DLs are technological solutions, which involve
IR as one of aspects
Unlike Web, they handle carefully prepared docs.
Very costly.
Like Web, they are highly distributed and
heterogeneous. Thus importance of standardization
and interopearbility
23
Thank you!
The end
Exam?
24