Transcript Document
search engines
fdm 20c introduction to digital media
lecture 04.12.2008
warren sack / film & digital media department / university of california, santa cruz
outline
• what’s the difference between a web directory
and a search engine?
– ontologies
– clusters
• what does a search engine do?
– crawling
– indexing
– retrieving
what do search engines do?
• crawl the web
• index the pages found
• retrieve pages from the index in response to
user queries
crawling/spidering the web
depth-first crawl
source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html
breadth-first crawl
source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html
indexing web pages found by the crawler
1. tokenization: break the document into words,
punctuation, tags, and uris
2. stop words: remove any that are too
frequently used to be informational (these are
called stop words)
3. stemming: find the stem of each word
4. inverted index: create an index linking each
word found to one or more documents
tokenization
• T’was was a dark and stormy night in the
country manor.
==>
• T single-quote was a dark and stormy night in
the country manor period
stop words
• T single-quote was a dark and stormy night in
the country manor period
==>
•
T single-quote was a
period
dark
and
stormy night
in the
country manor
stemming
• dark stormy night country manor
==>
• dark storm night country manor
y
document vectors
• documents (i.e., web pages) are stored as
vectors or lists; e.g.,
country dark manor night storm
• note that syntactic structure is lost
inverted index
• dictionary: make a file listing all of the words
found in all of the web pages crawled
• postings: for each word, make a list of all of the
web pages in which the word was found
• encoding: remove any redundancies found in
the dictionary and postings files so that they are
as small as possible
inverted index / dictionary
•
•
•
•
create a document vector for each web page
combine all of the vectors together
remove any duplicate words
alphabetize the resulting list
inverted index / dictionary / example
• webpage 1: T’was a dark and stormy night in
the country manor.
==> country dark manor night storm
• webpage 2: Now is the time for all good men to
come to the aid of their country.
==> aid all come country good men time
• dictionary: aid all come country dark good
manor men night storm time
inverted index / postings / example
• postings: aid (2) all (2) come (2) country (1&2)
dark (1) good (2) manor (1) men (2) night (1)
storm (1) time (2)
• the postings file just annotates the dictionary file
with the documents in which the words appear
retrieval
• the postings file can support boolean queries
• for example, searching for webpages that
contain the word “country” returns 1&2.
• for example, searching for the webpages that
contain the words “country” and “manor” returns
webpage 1
better retrieval
• how should the webpages retrieved be
ordered?
• e.g., if webpage 1 and webpage 2 both contain
the word “country” should 1 be listed before or
after 2?
social networks and centrality
Diane is central; Jane is not.
See www.orgnet.com/sna.html
google’s page rank algorithm
• basic idea: if a webpage has a lot of other
webpages that link to it, then it is central and
probably an important page.
• If page A has pages T1...Tn which link to it (i.e.,
are citations) and C(A) is defined as the number
of links going out of page A, then PageRank
(PR) of a page A is given as follows:
• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
PR(Tn)/C(Tn))
– where the parameter d is a damping factor which can
be set between 0 and 1 and is usually set to 0.85.
the difference is in the details
• The Boolean approach to information retrieval is the most widely used. Links
to documents are returned only if they contain exactly the same words as
your query. Even if a site is related in some way, it would never be retrieved
unless it contains the user's search string. In order to get better results from
this, most search engines finesse things a little. The results from the
inverted file are modified by factors such as: where the search term appears
in the item - whether the string of characters appears in specially tagged
areas such as titles or head, or whether it appears early in the body text of
the html document; how close the search words are together, or whether
they form an 'exact phrase'; the frequency of occurrence of search terms in
the inverted file; whether the search terms appear as keywords in the
metatags in the html file's head; how frequently the terms appear in relation
to the document's length (greater frequency indicating a probable greater
relevance). These techniques gradually move over into semantic analysis.
For instance selection by the frequency of occurrence of words in general in
which very uncommon words get heavier weight. Matthew Fuller