認識搜尋引擎

Transcript 認識搜尋引擎

搜尋引擎簡介
國立中正大學資訊工程研究所
吳昇副教授([email protected])
What is a search engine?
◆A web service site for the Internet
Users to find information in the
Internet Cyberspace。
◆The software to provide web search
service
Use of search engines?
◆Search for the url of a company/website
◆Look for the contact info about a person or
an organization
◆Search for information related to a term, eg.
to collect information about 櫻花鉤吻鮭
◆Look for news regarding XXX
◆Treat the search engine as a big dictionary
◆…...
Types of search engines
◆Directory browse/search
◆Web pages search
◆USENET news search
◆Ftp search
◆People/organization search
◆Daily-life information search
◆Library search
◆Commercial product search
Example search engines
◆Yahoo,
◆Google,
◆AltaVista,
◆MSN,
◆Excite,
◆Lycos, ...
◆YAM, Kimo, PCHome
◆GAIS, Openfind, ...
◆DejaNews,
◆Archie, ...
Portal Services
◆Directory / Search
◆Daily information: Weather, Maps. TV, ...
◆Free Emails, Free Pages, Calendar
◆Personalized services, channel subscription
◆Web Chat,
◆E-Commerce,
◆Content Aggregation
◆…...
Directory implementation
◆Each url data is a record
◆The url data is managed by a database
system
◆Search function is supported for searching
the data in the directory tree
Directory implementation
◆The search is in general for locating a
website or a category of web sites.
◆The data input is through manual
registration by the website owner or the
suffer
◆The management of the directory tree needs
intensive labor work by people who are
familiar with certain domain knowledge
The Advantages/Disadvantages
of Directory search engine
◆Advantages
– The data is manually maintained, and thus
contains less noise, and is more precise.
– The output of search can be categorized and can
be more organized
– Can support search within a category
The Advantages/Disadvantages
of Directory search engine
◆Disadvantages:
– The data coverage is limited, and sometimes,
can not find wanted
– Does not support relevance ranking
– Labor intensive
Implementation of Webpage
search engine
1.Feature consideration
2.Data Gathering
3.Data Preprocessing
4.Data Indexing
5.Query Processing
6. Interaction
7.Service tools
8.Personalization
Requirements for WebPage
search engines
0. The quality of the search result in a search
engine basically depends on
– a. the quality of the underlying data
– b. the search techniques such as ranking tech.
1. Data coverage should be large enough
2. Data needs to be filtered, such as removing
redundant pages
Requirements for WebPage
search engines
3. Full text search capability should be
provided
4. Relevance Ranking mechanism should be
provided
5. Search Speed should be fast enough
6. Search features ;
I.e., evaluation points:
Quality, speed, scale, robustness, features,
Data Gatherer
◆Also known as spider, crawler, robot, ...
◆Periodically travels the web space to collect
web pages
◆Need a list management to decide which and
when to collect
◆Need a link analyzer to generate new URL list
◆Need to decide what to collect and what not to.
Data Gatherer
◆Get-file function through http protocol is the
basic function
◆Webpage parser module used to extract link info
from a retrieved page,
◆URL bank manager module to manage the urls
to be fetched.
◆Robot-controller module to manage the data
collection using multiple clients
Issues of Robot
◆Site Based vs URL based
– Site based is popular such as wget, teleport
– robots.txt is easier to implement in SiteBased
robot
– URL based robot is more appropriate for large
scale search engines
◆Retrieval Schedule, BFS is better
◆Incremental Retrieval
Robot Issues
◆What to gather and what not to?
◆Hidden web data collection
◆Focused crawling
– targeting specialized content of web pages
– suitable for special search engines
– evaluated by precision and recall
Data Preprocessing
◆Remove redundant pages
◆Transform the page into internal data format.
◆Perform web cross-link analysis to generate a
URL databank.
◆Filter the data to remove data that better not be
indexed
◆Partition the data space*
Redundancy removal
◆15% to 20% of the web pages are replicated
on different websites, e.g., some tutorials
such as Java, Perl, Python, …
◆Can be implemented by partitioned-hashing
or external sorting
Ranking the URLs
◆Link analysis is done to count the mutual
reference between web pages
◆A URL receiving higher number of
references will get higher score
– weighted link
– discount internal link // such as back to home
◆Order the web pages in order of score such
that a page with higher rank will have lower
ID
Data Partition
◆The data is partitioned by language type
◆The language partition can be done as
follows:
– for each known language, collect certain
amount of webpages of that language
– build up high-frequent term set for each
language set from the analysis of the sample
data
– determine the language type by term analysis
Indexer
◆In general, inverted file is used to generate the
index
◆Need large data space for the indexing task.
◆For each indexed term, an index list is generated
to record which files/locations such term appears.
◆Need about the same or more space as the
original data
Indexer implementation issue
◆Data filter module is used to cope with different
data sources
◆Inversion module is the kernel module
◆Need to be scalable to handle continuous
growing data size.
– Hundreds of Giga bytes
– Tera bytes
◆Distributed/Concurrent Indexing
Indexer implementation issue
◆Temporary space minimization
◆Index speed is crucial
◆Memory can be utilized to improve the
index performance
◆Hashing and Sorting is the key!
Query Processing
◆Use dictionary/stop-list to preprocess the
query string
◆Parse the query into expressions of tokens
◆Use index structure to locate the matched
◆Use TF*IDF type technique to score the
matched documents
◆Combine URL scores to rank the result
Search CGI programs
◆search agent CGI:
– parse the query and fork a searcher process to do
the search (or use IPC to query the searcher)
– when the searcher returns, analyze and process the
result for formatted output
– process the result and store it in tmp result store
– log query and some status info
◆cgi for view-next-page
◆showmatch cgi
Output control
◆Site grouping:
group the pages from same website together
◆Title grouping:
group the pages with similar title
◆Sort the output according to certain criteria
Interaction
◆Term Suggestion:
–
–
–
–
Related terms
thesaurus
term-expansion
error correction
• phonetic
• spelling
Personalization
◆Keeping track of a user’s interest such that
the search result can be tuned to improve
the satisfaction to the user
◆Query Tracking and classification
Service tools
◆Query cache to improve the performance of
the Search, for queries that have been
served.
◆Use memory cache file system to reduce
the dick access overhead
◆Mechanism for special case handling
◆Log analyzer
Research Issues
◆Hidden Web data collection
◆Distributed index/search
◆Index minimization, incremental Indexing
◆Smart robot
◆Intelligent Retrieval
◆Output result auto classification/clustering
◆Data source clustering/classification
– classifying/clustering the whole web
Conclusion
◆Size does matter
◆Is still searching for a better engine!