Lecture 10: Searching the Web

Download Report

Transcript Lecture 10: Searching the Web

Special Topics in Computer Science
The Art of Information Retrieval
Chapter 13: Searching the Web
Alexander Gelbukh
www.Gelbukh.com
Previous chapter: Conclusions
 Interface is a key element of the system. If the users
cannot use it, it does not matter how good it is.
 Interface design choices are important at any stage
of the process
o Especially to formulate queries
o Also to present results
o 3D interfaces to present results
 Also, overall system interface and action tracking
 Difficult to assess quality. Difficult to find new ideas
 Very promising if you find them!
2
Previous chapter: Research topics
 Many ideas throughout the chapter
o some may be obsolete
 New interface types! 3D interfaces
 Ways of assessing the quality of interfaces
3
Web: challenges (differences)
 Distributed data
 Volatile data: 40% / month
 Very large volume
o Very large answers
o 1998: 3,000,000 servers, 350,000,000 pages.
o 2003: Only Google: 3,307,998,701 pages (10 times more)




Unstructured and redundant data. 30% are duplicates
Quality of data. 0.5% errors, 30% in foreign names
Heterogeneous data (languages, alphabets: Chinese)
Heterogeneous and inexperienced users
4
Search engines
 Difference: full text is not available
o Now obsolete: Google stores it, some other engines too
 Centralized (logically) architecture
o there are distributed (physically) architectures
 Crawlers (robots) collect data/index in a central place
 A search engine indexes only a small amount
(2%? 30%?) of Web
 Recall is nearly not relevant for simple queries
 Google: a revolution (AltaVista of our days)
5
Ranking
 Commercial secret
 Ranking can take into account hypertext
 Google: PageRank algorithm
o Roughly, # of incoming links (much more complicated)
 Problems: tricks
o Link exchange
o Anti-trick measures: detect link exchangers
o Penalize tricks: repeated keywords, etc.
 Related pages
o Co-cited or co-citing pages are related
o Clustering the search results
6
Crawling
 Depth-first? Width-first? Most popular first?
 How to divide the work between crawlers?
 Index is always obsolete
o Not equally obsolete (like stars)
o Depends on crawling policy
o 2% - 9% of invalid links. Snapshots.
 PageRank first!
 Robot instruction file on each server
7
Metasearchers
 Search using many engines and unify the results
o How to rank?! Marge rankings?
o Inquirus: Download each page and analyze it; rank
 Intersection of different major search engines is 1%
8
Other topics
 Indexing
 Hierarchies
 Interfaces
 User problems (understanding Boolean search, ...)
have been covered in previous chapters
Hyperlink (structured) search
 Fish search: explore neighborhood of a hit on the fly
o Relevant docs frequently have relevant neighbors
9
Research topics
 NLP techniques to improve indexing and ranking
o WSD. Anaphora? Semantic structures
 Semantic Web
o Ontologies




Text Mining to improve navigation. Web Mining (links)
Distributed architectures
Scalable index compression (? – just bigger disks)
Multimedia search
10
Conclusions
 Web has its own challenges as compared with general
collections
 Search engines have to cope with them
 Gathering data (crawling) is a problem specific for Web
 Also, Web provides new types of info (links), which
can be used by search engines
11
Special Topics in Computer Science
The Art of Information Retrieval
Chapter 14: Libraries and
Bibliographical Systems
Alexander Gelbukh
www.Gelbukh.com
Differences with IR...
 Historically first applications for searching
o Predecessor of IR
 Docs: bibliographic records
o Free text
o Structured fields (e.g., date)
 Users: mostly librarians, or users of a library
o thus: very limited budget
 Usually use Boolean model (IR: vector space)
o Seems to be mostly due to historical reasons (among others)
o Recently tend to add natural language search
13
... Differences with IR
 Creating the database is a subtask of such systems
o Suite data to the system, not system to data as in IR
o Carefully selected, structured, and annotated data
o Annotation standards. Decimal classification, ...
14
Online Public Access Catalogs (OPAC)

1.
2.
3.
Three generations:
Known-item finding tools (by title, author, ...)
Subject headings, keywords, ...
Search strategy assistance, natural language queries,
improved GUI, ...
 Prove to be very hard to use by inexperienced users
 Nowadays tend to become similar to digital library
tools
15
Research topics
 Ease of use
 More power and flexibility ?
 Integration with Digital Libraries ?
16
Conclusions
 Highly interoperable and standardized
 Look like legacy systems...
17
Special Topics in Computer Science
The Art of Information Retrieval
Chapter 15: Digital Libraries
Alexander Gelbukh
www.Gelbukh.com
Digital libraries (DL)
 Simplistic view: library in a machine-readable form
o Digitalization issues. Multilingual.

1.
2.
3.
4.
5.

5S model:
Streams (texts, multimedia, ...)
Structures (databases, indices, ...)
Spaces (interfaces in 1D, 2D, 3D, time, ...)
Scenarios (procedures, transformations, services, ...)
Societies (authors, annotators, ...)
This provides a way to define a DL
19
Architecture
 Provide Web services
 Manipulate Digital Objects (Items?)
 Repositories of such objects. Access protocol.
Standards.
Security. Payment. Copyright. Watermarking
 Parallel search across heterogeneous distributed
(multilingual) collections
 Multimedia collections
 Metadata, Standard formats
20
Systems
 A lot of specific projects and systems are mentioned
in the book.
 Interoperability. Standards for automatic searching
remote libraries. Protocols
21
Research topics




Markup tools to produce high-quality documents
Scaling
Interoperability. Standards
Better integration with IR
22
Conclusions
 Turning heaps of texts collected in conventional (or
new) libraries into searchable and accessible
information
 DLs are technological solutions, which involve
IR as one of aspects
 Unlike Web, they handle carefully prepared docs.
Very costly.
 Like Web, they are highly distributed and
heterogeneous. Thus importance of standardization
and interopearbility
23
Thank you!
The end
Exam?
24