CS315-L08-SearchEngineBasics

Download Report

Transcript CS315-L08-SearchEngineBasics

Search Engine Basics
How do people search for information?
Using a Search Engine

Put your search terms and hope for the best.
Direct navigation

Enter the URL directly into the browser.
Navigation within a Directory

Use a web portal as an entry point to the web.
Information seeking on the web is not straightforward
and people use a combination of techniques
but turn often to search engines.
Web Search is not simply IR
The scale of web search
is way beyond traditional IR.
The web is very dynamic.
The web contains
an enormous amount of duplication.
The quality of web pages is not uniform.
The range of topics on the web is open.
The web is globally distributed.
Users typical habits are different
(short 2.3 words queries, inspect only top-10 pages).
The web is hypertextual.
From Information Need to Answers
Let’s say you discovered mice at home. Problem!
You would like to eliminate the problem.
How would you search for solutions?
Formulate the query terms:
You are a nice person, you would like to get rid of them in
a politically correct way.
What changes in your search terms?
The classic search model
Get rid of mice in a
politically correct way
User task
Misconception?
Info about removing mice
without killing them
Info need
Misformulation?
Query
how trap mice alive Search
Search
engine
Query
refinement
Results
Collection
Types of Needs for Web Search
Corpus: The publicly accessible Web
Need: Retrieve high quality results relevant to the user’s need.

Characterize the particular need:
Types of need



Informational – want to learn about something
Low hemoglobin
Navigational – want to go to that page
United Airlines
Transactional – want to do something (web-mediated)
Tampere weather
 Access a service
Mars surface images
 Downloads
Nikon CoolPix
 Shop

Gray areas
 Find a good hub
Car rental Finland
 Exploratory search “see what’s there”
Abortion morality
Categories of Web searchers
Broder: A taxonomy of web search
Informational ____ %

acquire some information about a topic from web pages.
Navigational ____ %

find a site to start navigation from.
Transactional ____ %

perform some activity mediated by a web site.
Think of your own searches. Do you agree?
How did Broder found out these categories?
How did he measure the percentages?
Search Engines: The players and the field
The mechanics of a typical search.
The search engine wars.
Statistics from search engine logs.
The architecture of a search engine.
The query engine.
Mechanics: Results & ads come ranked
Result for phrase query
Search Engine Wars
The battle for domination of the web search space!
The competition is good news for users!
Crucial:
advertising is combined with search results!
What if one of the search engines
will manage to dominate the space?
Yahoo!
Synonymous with the .com boom,
once the best known brand on the web.
Started off as a web directory service in 1994,
acquired Inktomi search engine technology in 2003.
Has very strong advertising and e-commerce partners
Lycos
One of the pioneers of the field
Introduced innovations in 1996 that
inspired the creation of Google
To “Google” is synonymous with Web searching.
Google
Has raised the bar on search quality
The most popular SE in the last few years.
Is innovative and dynamic.
Ask Jeeves
Specializes in natural language question
answering.
Search driven by Teoma.
Tries to differ…
bing
Successful third reincarnation of previous attempts
Live Search
Pyrrhic victory in the browser wars with Netscape.
(was:
(was:
MSN Search))
Was Synonymous with PC software.
“Stop searching, start deciding”: Turned Google into copycat!
Cuil
Newer player
Claimed to have indexed 120B pages!
It did not rank!
Other “search engines”…
How do you decide which is best?
How do you measure
similarity in ranking?
How many people use the web? SEs?
Over 10% of the world’s population were online as of 2004.
Today? ________
Number of broadband users is growing
(over 50% of connected Americans use broadband).
Search engine share as of June 2004:

Google (41.6%), Yahoo! (31.5%), MSN (27.4%),
AOL (13.6%), Ask Jeeves (7%)
Today? _______
200 million hits per day to Google (mid 2004).
Today? ___
Search Engines as Info Gatekeepers
Search engines are becoming
the primary entry point for discovering web pages.
Ranking of web pages
influences which pages users will view.
Exclusion of a site from search engines
will cut off the site from its intended audience.
The privacy policy of a search engine is important.
Introna & Nissenbaum: Defining the Web: The Politics of Search Engines
Hindman et al: Googlearchy: How a few Heavily-Linked Sites Dominate Politics on the Web
Architecture of a Search Engine
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
User
A
Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com
Web
B
Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise
Web spider
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ]
E
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages
C
Search
Indexer
D
The Web
Indexes
Ad indexes
Newer features: suggest
Newer features: Trends
N-grams
How to Crawl the Web
 Frequency of crawl: important
 How do you design an
experiment to measure
frequency of crawls by S.E.’s
on a particular page?
 Parallel machines crawl all the
time
 robots.txt gives
explicit directions on what not
to crawl
 Mode of crawl: BFS or DFS
or …
Where to start? How to crawl?
Crawling starting point
Put a starting page in a queue Q & repeat:
• Pick up a page P from the queue,
• Crawl P, and
• Put on the queue each page reachable from P
Web Search vs. Intranet Search
Local search engines on web sites have a bad reputation.
Users often use a web search engine to find information
on web sites,
rather than the local web site search engine.
Many companies do not invest in local search.
Content management is a problem.
Information needs on web sites may be different.
The most popular search keywords
AltaVista (1998) AlltheWeb (2002) Excite (2001)
sex
free
free
applet
sex
sex
porno
download
pictures
mp3
software
new
chat
uk
nude