Search Engines
Download
Report
Transcript Search Engines
Search Engines
What Are They?
Four Components
A database of references to webpages
An indexing robot that crawls the WWW
An interface
Enables users to submit queries
Displays results
Information retrieval system
Each is unique, but are mostly the same
2
Database
Where user's query is matched
Contains only essential parts of pages
Only includes pages that were indexed
Search engines are always out of date
3
Web Crawler
A robot that follows links
Records data it finds
Words in the webpage
Metadata
ALT attributes in IMG tags
Robot Exclusion Protocol
4
Search Engine Interfaces
Gathers input from users
Presents results from the IR system
Often in ranked order
5
Search Engine Interfaces
Input
User requirements
Search expression, search limits
Presentation style
Presentation format , search type
6
Search Engine Interfaces
Output
Results
Descriptions
Clusters
7
Search Term Matching
Trying to find a match in the database
Two main methods
Keyword searching
Matching single terms, computing cosine
Concept-based searching
Examining clusters of words
Attempt to determine meaning of query and find
records related to that meaning
8
Basic IR Features
Boolean operators
Extended operators
AND, OR, NOT, grouping
NEAR, ADJACENT, (")
Stop word deletion
Stemming
Searching in fields (e.g. host)
9
Ranked Output
Most SEs produce ranked lists by applying
simple rules:
Early words are more important
Title is very important
Frequency of occurrence matters for some
Infrequent words matter more
Modification date
Google is different:
PageRankTM method based on popularity
Links as money
10
Googlebombing
Google spoofed from the lecture list
first hit from 1992
Official GoogleBlog explanation
11
What about the Invisible Web?
Also known as the Deep Web
Documents that are on the WWW but
not indexed by Search Engines
Some are available only by submitting
forms
Some are not generally accessible (in
subnets)
Some are not in (X)HTML format
12
The Invisible Web Isn't So
Invisible Anymore…
More search engines parse non(X)HTML now than before
Because of awareness of the problem
companies are making more content
available using
Stable URLs
Robot-friendly sitemaps
But much content is still not indexed
13
But, there's still plenty of
important yet invisible docs
How to find them?
Use database tools from the U.'s library
Many of them are in databases
No one search engine covers everything
Especially for research articles
Use multiple search engines or a metacrawler
dogpile is the most famous
14
Search Engines
A Summary of Practical Advice
How To Succeed With SEs
As a surfer:
If you don't know what you are looking for
Use multiple SEs, or a meta-crawler
Search within results
If you don't know what you are looking for
Use multiple SEs, or a meta-crawler
Use Boolean expressions or search within
results
Consider specialized engines
16
How To Succeed With SEs
As a creator:
HTML level
Always use ALT attributes with <IMG>, etc.
Avoid frames
Make it easier to index
Don't expect SEs to find your pages
Make links between your pages
Use metadata
Informal: <meta name="description" …>
Formal: Dublin core and others
Increase your pages popularity
Don’t use systematic reciprocal linking: rings, exchanges, lists
Page Rank™ is inversely proportional to outdegree
17
How To Succeed With SEs
As a creator (cont.)
For surfers:
Use <meta name="description" …>
Don't expect surfers to start at top of your
hierarchy
Don't rely on a hierarchy
Include a context map near the top of each page
Don't use frames
Think through dynamic content implications
Stickiness… is for another day
18