Search Engines

Download Report

Transcript Search Engines

Search Engines
What Are They?

Four Components
A database of references to webpages
 An indexing robot that crawls the WWW
 An interface

Enables users to submit queries
 Displays results



Information retrieval system
Each is unique, but are mostly the same
2
Database

Where user's query is matched

Contains only essential parts of pages
Only includes pages that were indexed
Search engines are always out of date


3
Web Crawler


A robot that follows links
Records data it finds
Words in the webpage
 Metadata



ALT attributes in IMG tags
Robot Exclusion Protocol
4
Search Engine Interfaces


Gathers input from users
Presents results from the IR system

Often in ranked order
5
Search Engine Interfaces

Input

User requirements


Search expression, search limits
Presentation style

Presentation format , search type
6
Search Engine Interfaces

Output
Results
 Descriptions
 Clusters

7
Search Term Matching


Trying to find a match in the database
Two main methods

Keyword searching


Matching single terms, computing cosine
Concept-based searching
Examining clusters of words
 Attempt to determine meaning of query and find
records related to that meaning

8
Basic IR Features

Boolean operators


Extended operators




AND, OR, NOT, grouping
NEAR, ADJACENT, (")
Stop word deletion
Stemming
Searching in fields (e.g. host)
9
Ranked Output

Most SEs produce ranked lists by applying
simple rules:






Early words are more important
Title is very important
Frequency of occurrence matters for some
Infrequent words matter more
Modification date
Google is different:


PageRankTM method based on popularity
Links as money
10
Googlebombing

Google spoofed from the lecture list
first hit from 1992
 Official GoogleBlog explanation

11
What about the Invisible Web?


Also known as the Deep Web
Documents that are on the WWW but
not indexed by Search Engines
Some are available only by submitting
forms
 Some are not generally accessible (in
subnets)
 Some are not in (X)HTML format

12
The Invisible Web Isn't So
Invisible Anymore…


More search engines parse non(X)HTML now than before
Because of awareness of the problem
companies are making more content
available using
Stable URLs
 Robot-friendly sitemaps


But much content is still not indexed
13
But, there's still plenty of
important yet invisible docs

How to find them?



Use database tools from the U.'s library


Many of them are in databases
No one search engine covers everything
Especially for research articles
Use multiple search engines or a metacrawler

dogpile is the most famous
14
Search Engines
A Summary of Practical Advice
How To Succeed With SEs

As a surfer:

If you don't know what you are looking for
Use multiple SEs, or a meta-crawler
 Search within results


If you don't know what you are looking for
Use multiple SEs, or a meta-crawler
 Use Boolean expressions or search within
results
 Consider specialized engines

16
How To Succeed With SEs

As a creator:

HTML level



Always use ALT attributes with <IMG>, etc.
Avoid frames
Make it easier to index



Don't expect SEs to find your pages
Make links between your pages
Use metadata



Informal: <meta name="description" …>
Formal: Dublin core and others
Increase your pages popularity


Don’t use systematic reciprocal linking: rings, exchanges, lists
Page Rank™ is inversely proportional to outdegree
17
How To Succeed With SEs


As a creator (cont.)
For surfers:
Use <meta name="description" …>
 Don't expect surfers to start at top of your
hierarchy

Don't rely on a hierarchy
 Include a context map near the top of each page
 Don't use frames
 Think through dynamic content implications
 Stickiness… is for another day

18