Search Engines - Wellesley College

Download Report

Transcript Search Engines - Wellesley College

CS 115: COMPUTING FOR THE
SOCIO-TECHNO WEB
FINDING INFORMATION WITH
SEARCH ENGINES
THE WEB IS A DIRECTED GRAPH
Like a map of a country
with cities and
one-way roads
Directed Graph of
Nodes and
Arcs (one-way connections)


Nodes = web pages
Arcs = hyperlinks from a
page to another
Why is this cool?
Because…
it can be explored
it can be indexed
GOOGLE
HOW SEARCH ENGINES WORK
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
User
A
Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com
Web
Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise
B
Web spider
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ]
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages
E
C
Search
Indexer
D
The Web
Indexes
Ad indexes
BINARY SEARCH
Halve things each time
MECHANICS OF A
TYPICAL SEARCH
WEB DIRECTORIES
ORGANIZE
INFORMATION IN
CATEGORIES
WITH HUMAN HELP
WHAT ARE YOU TRYING
TO FIND?
Types of queries:
• Informational – want to learn about something
Peripheral neuropathy
• Navigational – want to go to that page
Wellesley College
• Transactional – want to do something (web-mediated)
• Access a service
• Downloads
• Shop
• Gray areas
• Find a good hub (resource collection)
• Exploratory search “see what’s there”
Wellesley weather
Mars surface images
Nikon SLR camera
car rental Boston
morality of abortion
HOW FAR DO YOU LOOK FOR
RESULTS?
DIVERSITY IN
CONTENT
Languages
• Hundreds of languages (2001)
• Home pages (1997): English 82%, Next 15 languages: 13%
• Google’s index (mid 2001): English: 53%, JGCFSKRIP: 30%
• This trend is expected to continue today
Popular Query Topics (from 1 million Google queries, Apr 2000)
Arts
14.6%
Arts: Music
6.1%
Computers
13.8%
Regional: North America
5.3%
Regional
10.3%
Adult: Image Galleries
4.4%
Society
8.7%
Computers: Software
3.4%
Adult
8%
Computers: Internet
3.2%
Recreation
7.3%
Business: Industries
2.3%
Business
7.2%
Regional: Europe
1.8%
…
…
…
…
QUESTIONS ABOUT THE WEB
How big is the Web?
How many people use the Web?
How many people use search engines?
How hard is it to go from one page to another
through clicks?
What is the shape of the Web?
HOW BIG IS THE WEB?
Number of accessible web pages (the visible web)
• Google claims to have encountered 1 trillion unique URLs (though in the
past claimed to have indexed 26.6 billion pages
• Yahoo claims to have indexed 55 billion pages
• Cuil claims to have indexed 120 billion pages
The deep web (or hidden or invisible web)
• “contains 400-550 times more information”
Coverage (i.e. the proportion of the web indexed)
is crucial for search engines.
• Today, less than 15% pages are indexed!
CAN YOU MEASURE THE SIZE OF
WEB?
How do you count fish on a lake?
• Lincoln-Petersen Method aka:
Capture-Mark-Recapture method
• M = # of fish captured and marked; released.
• C= # of fish returned in second visit.
• R = # of marked fish in second visit.
Estimate :
M/N = R/C =>
N= (M x C) / R
N
M
R
C
CAN YOU MEASURE THE SIZE OF
WEB?
Capture-Mark-Recapture method
• SE1 = # of pages indexed search engine 1.
• QSE2 = # of pages returned by search engine 2 for typical queries.
• OVR = # of pages returned by both search engines for typical queries.
Estimate :
SE1 / WWW = OVR / QSE2 =>
WWW = (SE1 x QSE2) / OVR
WWW
SE1
OVR
QSE2
HOW MANY PEOPLE USE THE WEB?
87% of the American online
• 70% of Americans use broadband at home
68.0% of Americans access the internet on a cell phone, tablet,
or other mobile device
39% of the world had internet access in 2013
What does this tell you about the importance of the Web?
HOW MANY PEOPLE USE SEARCH
ENGINES?
49% of all internet users use a search engine on a daily basis
• 622 million queries per day (18.6 billion searches in April, 2014)
Search engine usage as of June 2004:
• Google (41.6%), Yahoo! (31.5%), MSN (27.4%), AOL (13.6%), Ask (7%)
Search engine usage as of April 2014:
• Google (67.6%), Yahoo! (10%), MSN (18.7%), AOL (1.3%), Ask (2.4%)
What does this tell you about the importance of the Search
Engines?
HOW HARD IS IT TO SURF
FROM ONE PAGE TO ANOTHER?
Over 75% of the time there is no directed path
from one random web page to another.
When a directed path exists
its average length is 16 clicks.
Short average path between pairs of nodes
is characteristic of a small-world network.
“Map of the Internet” (1998)
WHAT IS THE SHAPE OF THE WEB?
BOW-TIE SHAPE OF THE WEB
WHAT DOES THE WEB LOOK
LIKE?
STRONGLY CONNECTED
COMPONENT
strongly connected component (SCC) in a directed graph is a
subset of the nodes such that:
(i) every node in the subset has a path to every other; and
(ii) the subset is not part of some larger set with the property that
every node can reach every other.
A CONSTRUCTIVE ALGORITHM TO
PROVE THAT THE WEB IS A BOWTIE
Start with disconnected Web pages
Examine the shape after 1 link/page is considered
Bowtie appears after the 2nd link per page is considered
After that, the Bowtie shape gets stronger
AFTER ONE LINK IS
CONSIDERED
A collection of pseudo-trees
AFTER A SECOND LINK IS
CONSIDERED
A collection of bowties
WHEN MORE LINKS ARE
INCLUDED…
Consider the combinations of links
within the same bowtie
between bowties
Bowties are everywhere!
CORRECT THE SHAPE OF THE
WEB
EXERCISES
1-Draw a web graph of the course class website.
2-Handout
HOW ABOUT THE CLASS WEB?
Crawling starting point
CAN SE’S COVER ALL THE
WEB?
Crawling starting point
Put a starting Web page in a queue Q & repeat:
• Pick up a page P from the queue,
• Crawl P, and
• Put on the queue each page reachable from P