Programming Parallel N-Body Codes with the BSP Model

Download Report

Transcript Programming Parallel N-Body Codes with the BSP Model

Web Search Engines and
Information Retrieval on the World-Wide Web
Torsten Suel
CIS Department
[email protected]
http://cis.poly.edu/suel
Overview:
• introduction and motivation
• research: improving cluster-based search engines
• research: future peer-to-peer search engine architectures
1. Introduction and Motivation
Web search engines:
1. Introduction and Motivation (cont.)
Basic structure of a search engine:
indexing
Crawler
Index
disks
Query: “computer”
Search.com
look up
1. Introduction and Motivation (cont.)
Challenges for search engines:
• coverage
(need to cover large part of the web)
need to crawl and store massive data sets
• good ranking
(in the case of broad queries)
smart information retrieval techniques
• freshness
(need to update content)
frequent recrawling of content
• user load
(up to 10000 queries/sec - Google)
many queries on massive data
• manipulation
(sites want to be listed first)
most techniques will be exploited quickly
1. Introduction and Motivation (cont.)
• more than 3 billion web pages and 10 million web sites
• need to crawl, store, and process terabytes of data
• 10000 queries / second (Google)
• cluster of more than 5000 Linux servers (Google)
• “planetary-scale web service”
(google, hotmail, yahoo, aol web caches, akamai)
• proprietary code and secret recipes
1. Introduction and Motivation (cont.)
Other types of web search tools
• Web directories
(yahoo, open directory project)
• Specialized search engines (cora, citeseer, achoo, findlaw)
• Local search engines
(for one site)
• Meta search engines
(dogpile, mamma, search.com)
• Personal search assistants
(alexa, google toolbar)
• Image search
(ditto, visoo)
• Database search
(completeplanet, brightplanet)
1. Introduction and Motivation (cont.)
Data collection, extraction & mining tools
• Example: Whizbang job database:
- collects job announcements on company web sites
- focused crawling to track down job annoucements
- sorts job announcements by type, locations, etc.
• trademark and copyright enforcement
- track down mp3 and video files
- track down images with logos (Cobion)
• comparison shopping and auction bots
• competitive intelligence
• national security: monitoring certain websites
1. Introduction and Motivation (cont.)
machine learning
systems
information
retrieval
AI
natural
language
processing
algorithms
databases
2. Cluster-Based Search Engines
Research Challenges:
• efficiency and scaling with query load
- per-node performance
- scaling cluster size
• data size and scaling with the web
- data acquisition: crawling and refresh
- index size and performance
- index updates
• better ranking for improved results
- link-based ranking
- topic- and context-specific ranking
Polybot crawler:
(with Vlad Shkapenyuk)
• scalable web crawler
• runs on cluster of servers
• 300 pages/sec (and beyond)
Storage and Indexing:
(Alex Okulov and Xiaohui Long)
• storing and indexing terabytes on network of workstations
• fast compression techniques for storage
• index performance and index updates
• index partitioning
Linux servers
with several
disks each
high-speed
LAN or SAN
Link-based ranking
(Yenyu Chen and Qingqing Gan)
• Ragerank (Brin&Page/Google)
“significance of a page
depends on significance
of those referencing it”
• improving link-based ranking
• integration of term- and link-based methods
2. Peer-to-peer Search Engine Architectures
Future Search Engines and Search Tools
• expect powerful user interfaces beyond browser
- browsing assistants
- search and navigation tools
• many more search engine accesses
• most access programmatic in nature
• idea: split search engine into upper and lower tier
- lower tier: crawling, indexing, index queries (dumb, big data)
- upper tier: ranking, interface, analysis (smart stuff)
• idea: lower layer as highly distributed substrate to
support search and navigation tools
- open and agnostic
- scalable
“let a thousand flowers bloom”
“let a million queries fly”
P2P web search architecture:
• thousands of powerful machines all over the internet
• machines can join or leave
• agnostic: can implement many IR methods on top
search
engine
search
engine
search
engine
search
engine
West Exploration and Search Technology Lab:
• about 10 grad and undergrad students
• more information: http://cis.poly.edu/westlab
• courses on web search, IR, web protocols
Showcase slides at http://cis.poly.edu/showcase/