Slides (PowerPoint, 2.5 MB) - Department of Computer Science
Download
Report
Transcript Slides (PowerPoint, 2.5 MB) - Department of Computer Science
Cornell Information Science
Research Seminar: The Web Lab
http://weblab.infosci.cornell.edu/
William Y. Arms
Manuel Calimlim
Lucy Walle
Felix Weigel
January 23, 2007
The Web Lab: A Joint Project of Cornell
University and the Internet Archive
Faculty
William Arms, Johannes Gehrke, Dan Huttenlocher, Jon Kleinberg,
Michael Macy, David Strang,...
Researchers
Manuel Calimlim, Dave Lifka, Ruth Mitchell, Lucia Walle, Felix
Weigel,...
Students
Selcuk Aya, Pavel Dmitriev, Blazej Kot, with more than 50 M.Eng.,
and undergraduate students from Information Science and Computer
Science
Internet Archive
Brewster Kahle, Tracey Jacquith, Michael Stack, Kris Carpenter,...
2
Introduction to the Web Lab
Mining the History of the Web
The Internet Archive's Web Collection
•
Complete crawls of the Web, every two months since 1996
•
Total archive is about 110,000,000,000 pages (110 billion)
•
Recent crawls are about 60+ TByte (compressed)
•
Total archive is about 1,900 TByte (compressed)
•
Metadata contains format, links, anchor text
3
The Library Stacks: the Internet Archive
4
The Wayback Machine
Demo:
http://www.archive.org/
5
Research using Metadata about Web Pages
Current NSF grant
Research using anchor text
• links to microsoft.com and google.com
Changes to the link structure of the Web
• differences between crawls
• densification (increases in average node degree)
Formation of online groups
6
Example of Past Work: Social and Information
Networks, Joining a Community
Close to one billion (user, community) instances
Work by: Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and
Xiangyang Lan
7
The Never-ending Research Dialog
RESEARCHER
Here's an
analysis we
would like to
do...
Not as you
suggest it,
but here's
another
idea...
We don't know
how to do that
analysis. Would
this be any use
to you?
INFORMATION
SCIENTIST
That might be
possible, with
the following
modification...
Let's try it and see.
8
The Role of Web Data
for Social Science Research
Social networks are an important research topic
– Emergence of global phenomena from local effects
• Viral spreading of rumors
– Behavior of individuals in a community
• Roles in discussion threads, herd behavior in opinion polls
– Network structure and dynamics
• Strength of weak ties, triangle relations, homophily
9
How to Observe a Social Network?
• Social network research before the web
– Talk to people, make notes
– Distribute questionnaires, gather statistics
• Problems with this approach
– Tedious task
– Small scale
• The Internet Archive is a great resource for research
– Contains web pages with social networks
– Records the history of the pages
10
Social Networks on the Web
The web contains many social networks
– Sites for social networking, social bookmarking, file sharing
• MySpace, Facebook, Flickr, Delicious
– Community portals
• Yahoo Groups, DBLife
– Encyclopedia and folksonomy projects
• Wikipedia, Wikia
– Review sites and customer comments
• Amazon, Netflix
– Blogs, web forums, Usenet
11
The Bliss and Curse of Digital Data
Opportunities
– Collecting network data at an unprecedented scale
– Verifying hypotheses in many different networks
– Monitoring communities at a finer granularity
– Mining and searching social networks
Challenges
– Finding suitable information on the web
– Extracting information from web pages
– Making web data persistent
– Processing very large data sets
– Access rights and privacy
12
Web Lab and Social Science Research
• Collaboration with Cornell’s Institute for the Social Sciences
• Our goal: Make data available to researchers
– Large web graph database with multiple crawls
– Packaged subsets of crawls for analysis
– Visual extraction tool for creating new data sets (ongoing)
– Small-scale crawling for adding new web sites (starting)
– Full-text indexing (planned)
Demo of the extraction tool available at
http://www.cs.cornell.edu/~weigel/WrapperDemo/
13
Web Data Extraction
Researchers often don’t care about web pages, but specific
substructures inside the pages
– Blog postings
– Web forums
– Social tagging
– News headlines
– Tables of content
– Bibliographies
– Product details
– Customer reviews
14
Web Data Collaboration Server
Data extraction
• Writing extraction code is a tedious task
• Create tools to make the data easily accessible in a structured
format (e.g., tables in a database)
Data sharing
• Extracting the same data repeatedly is a waste of time and storage
space
• Let users share their data and extraction rules
Data curation
• Web data is often incomplete and erroneous
• Let users collaborate to correct and complete the data
15
Demonstration
Demo of the extraction tool available at
http://www.cs.cornell.edu/~weigel/WrapperDemo/
16
The Web Lab System
Web Collection
INTERNET ARCHIVE
Text indexes
National
supercomputers
File
server
Structure
database
Wayback
Machine
Computer
cluster
Page store
Text indexes
CORNELL UNIVERSITY
17
Technical Processing: the Web Lab
Networking
Internet 2, National Lambda Rail
Wayback Machine
Commodity computers with
local file systems
Structure database
Relational database system on
large shared memory computer
Data analysis
Specialized Linux cluster with
Hadoop distributed file system
and MapReduce programming
Different types of computer for different functions
18
The Research Process
Select a sub-set for analysis
• SQL query the relational database directly
• Use the GetPages tool on the Web site to send an SQL query
Download the sub-set
• To the researcher's computer
• To the Web Lab file server
Clean-up the data
• MapReduce tasks on the Hadoop cluster
Data analysis
• MapReduce tasks on the Hadoop cluster
19
Selection Methods
By known identifier (Wayback Machine)
web pages with the URL http://www.nsf.gov/
By character string (full text indexing) -- future
all pages containing, "Internet is doubling every six months"
all page containing the SARS-CoV genetic sequence
By metadata criteria
all web pages that link to microsoft.com but not to google.com
all email addresses that I used to receive mail from but have not had
mail from recently*
* Example provided by Marc Smith
20
Benefits of Using a Relational Database
• Simple query language for retrieving data
• Transaction support
• Concurrency control for parallel queries
• Multiple indices for high performance
• Reliability since databases have built-in recovery functionality
21
Metadata Loading
• The crawler outputs compressed metadata files (DAT
files).
• Each DAT file has a set of crawled pages with page
metadata, including things like crawl time, IP address,
mime type, language encoding, etc.
• Most importantly, the outgoing links from each page are
parsed, including the full URL and associated anchor text.
22
Database Schema
Crawl – Name of the
crawl from which data is
loaded
Page – Metadata about
each webpage plus
fields to help find and
extract the full html text
Link – The outgoing
links from crawled
pages
Url – Lookup table for
unique URLs
Host – Lookup table for
unique hostnames
23
Crawls Loaded Into SQL DB
Crawl
Period
Databa
se size
Pages
Links
Urls
Hosts
DJ
Jan-April 2002
2.5 TB
1.1 billion
26 billion
250
million
16 million
DV
Jan-April 2004
15 TB
1.3 billion
110 billion
TBD
TBD
EB
Jan-March
2005
20 TB
3 billion
130 billion
20 billion
380 million
Amazon
Jan-April
2004, JanAugust 2005
570 GB
40 million
3 billion
35 million
356
Cornell
Jan-April
2002, JanApril 2004
5 GB
800,000
12 million
750,000
40,000
24
Selection from the Database
• SQL query the relational database directly
(Contact Manuel Calimlim)
• Use the GetPages tool on the Web site to send an SQL query -work in progress
25
Demonstration
Demonstration of the Web Lab web site
http://weblab.infosci.cornell.edu/
and the GetPages tool
26
Massive Data Analysis by Non-Specialists
A typical scientist or social scientist:
• Has deep domain knowledge
• Has good algorithmic understanding
• Is often a competent computer user or has a research assistant
who is familiar with languages such as Fortran, Python, and
Matlab, or applications packages such as SAS and Excel.
But...
• Has limited understanding of large-scale data analysis
• Is not skilled at any form of computing that requires parallel
computing or concurrency
Typical problem of scale: Given 100 billion URLs, how do you
identify duplicates?
27
Hadoop and MapReduce Programming
Hadoop
An open source distributed file system similar to the Google
File System. It supports MapReduce programming.
http://lucene.apache.org/hadoop/
MapReduce
A functional programming style to support large-scale data
analysis without the need for global data structures.
In the 1960s, Fortran gave scientists a simple way to
translate mathematical problems into efficient computer
codes.
MapReduce programming gives researchers a simple way
to run massive data analysis on large computer clusters.
28
The MapReduce Paradigm
Input
data split
into files
M map
tasks
Intermediate
files
R reduce
tasks
Output
files
Output 0
split 0
split 1
split 2
split 3
split 4
Output 1
Each intermediate file is
divided into R partitions
Each reduce task
corresponds to one partition
29
A Web Graph Example
2
1
4
3
5
6
30
Building the Web Graph
URLs, pages, and links:
• URLs contained in Web pages may link to pages never crawled
• URLs not canonicalized: different URLs may refer to same page
• Links are from a page to a URL
Web graph from crawl data:
• Nodes are union of pages crawled and URLs seen
• Each node and edge has time interval(s) over which it exists
31
Web Graph Example
Problem:
Given a set of URL pairs in uncanonicalized form (u0, v0), create
a list of all the edges that point to each node of the web graph:
• Replace each u0 or v0 with its canonicalized form u or v.
• Create a list of all nodes of the graph, i.e., the set of unique u.
• Discard all (u, v) pairs, where u = v, or v is not a node of the graph.
• Discard all duplicate edges.
• For each node v, create a list (v, {u}), where {u} is the set of nodes
that have edges to node v.
Each step is a simple programming task for a small numbers of links on
a single computer. How can this simplicity be retained with huge
numbers of links on a very large computer cluster?
32
MapReduce Example
Map task
Input:
(u0, v0)
Output: (u, d)
(v, u)
// Indicate that u is a from-URL
// Indicate that v is a to-URL with link from u
d is a dummy marker. Do not output if u = v.
This is simple application code to write.
33
A MapReduce Example
Merge
The input to the reduce process merges the output values
from the map task that correspond to each URL.
For each URL, w, it creates a list:
w, {d, ... , d, u1, ..., uk}
This merge is performed automatically by the system libraries.
34
A MapReduce Example
Reduce
Input: w, {d, ... , d, u1, ..., uk}, where w is any URL.
Output:
If there is no marker d in the list, discard and do not output. This
corresponds to a URL that never appears only as the first element
of a (u, v) pair.
Otherwise remove duplicates from u1, ..., uk and output.
The output is a to-URL and a list of the nodes that link to it:
v, {u1, ..., uk}
This is simple application code to write.
35
For the Future:
Examples of Tools and Services
The Web Lab is steadily building a set of tools for researchers
• API and Web services
• GetPages Web forms to select dataset by query of a relational
database with indexes by date, URL, domain name, file type,
anchor text, etc.
• Focused Web crawling (modification of Heritrix crawler)
• Extraction of Web graph from subset and calculations, e.g.,
PageRank, hubs and authorities
• Graph visualization
• Natural language processing of anchor text
36
The Web Lab is Ready for Use
We are ready to work with a number of researchers:
Systems
Relational database operational
Hadoop pilot cluster (large cluster soon)
File server and web server operational
People
Manuel Calimlim (database)
Lucy Walle (Hadoop + MapReduce)
Tools
A variety of tools in prototype
Experience with large volumes of anchor text and URLs
37
Thanks
This work would not be possible without the forethought and
long standing commitment of Brewster Kahle and the Internet
Archive to capture and preserve the content of the Web for
future generations.
This work has been funded in part by the National Science
Foundation, grants CNS-0403340, DUE-0127308, SES0537606, IIS-0634677, and IIS-0705774.
38
Cornell Information Science
Research Seminar: The Web Lab
http://weblab.infosci.cornell.edu/
William Y. Arms
Manuel Calimlim
Lucy Walle
Felix Weigel
January 23, 2007