WebLifecycles_DataMining

Transcript WebLifecycles_DataMining

Welcome/Agenda
9:00am – Welcome/Setting the Agenda for the Day
9:10am - 10:30am – Challenges of the Web Now & in the Future
Response to these Challenges
10:30am – BREAK
11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the
Web Archiving Lifecycle
12:30pm – LUNCH
1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives
3pm – BREAK
3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives
4:30pm – Wrap-up & Next Steps
IIPC GA Meeting
Ljubljana, Slovenia
April 26, 2013
Data Mining & Web
Archiving ‘Lifecycles’
Kris Carpenter Negulescu
Internet Archive
IIPC GA Meeting
Ljubljana, Slovenia
April 26, 2013
2
Use Cases
Election 2012 Collaborative
NLNZ 2013 Domain and GOV Collections
Wide00002/00005 Crawls
http//home.us.archive.org/~vinay/wide/wide-00002.html
http://home.us.archive.org/~vinay/wide/wide-00005.html
https://webarchive.jira.com/wiki/display/~vinay/Embed+Analysis+for+the+Wide00005+Crawl
IIPC General Assembly, The Hague, May 9, 2011
3
Traditional “Crawl” Lifecycles
Crawl seeded &
scoped
Lucene Shards
WARCs
Data & Services
Audited and
Monitored
Data Harvested and
De-duplicated
Collection QA’d,
Reports Generated,
Access Services
Deployed/Updated
CDXs/WATs
WARCs Written and
Ingested
WARCs Processed
and Indexed
IIPC GA Meeting
Ljubljana, Slovenia
April 26, 2013
Analyzing Scope & Quality
Target Resource
“Analysis”
Live Snapshot
Generation
WAT Generation
”Browser” Log Analysis
“Crawl” Log Analysis
Filtering APIs/Feeds
Seeding a Crawler
Frontier & Alternate
Capture Mechanisms
Web Graph Generation
In-link Analyses/Ranking
Anchor, Description, Full
Text Indexing & Mining
IIPC GA Meeting
Embed & Out-link
Analyses
Ljubljana, Slovenia
April 26, 2013
Preparing to
Collect/Scoping/Framing a
Crawl/Collection
 Pre “Crawl” Workflows
Target identification (beyond curatorial
selection…)
• Automated Filtering of Data Sources by Topic,
Geo IP, file format, robots policy or other criteria
• Out-link analyses and ranking from selected
sources, In-link analyses
• Mining Anchor text/Page Descriptions/Title tags (if
not full text)
“Test” Capture Analyses (…routing to proper
capture mechanisms)
IIPC General Assembly, The Hague, May 9, 2011
6
Your Browser: Behind the Scenes
IIPC General Assembly, The Hague, May 9, 2011
8
Extracted Metadata & Links (WAT)
WAT is WARC ☺
WAT records are WARC
metadata records
WARC-Refers-To header
identifies original WARC
record
WAT payload is JSON
Can be combined with
Curator generated
metadata
Monitoring/Enhancing/Confirming
Capture
Comparing Live Resources to Files Written
Evaluating Completeness (at all levels)
Generating Snapshots of Live and Archived
resources
Eliminating Spam/Detecting Scoping Mistakes
& Issues
Mining Crawl Logs (HIVE)
Mining Browser Logs
Mining/Analyzing Links
IIPC General Assembly, The Hague, May 9, 2011
10
Characterizing/Documenting/Preser
ving Captures & Collections
IIPC General Assembly, The Hague, May 9, 2011
11
Enabling Access & Research
Host profiles
Link Graphs, Tag Clouds, & Visualizations
Collection Based: http://home.us.archive.org/~vinay/eot08-exploredata.html
Archive wide: http://home.us.archive.org/~vinay/global/1995-2011/stats.html
http://home.us.archive.org/~vinay/tld.html
Site/Page Evolution
http://archive.org/details/TheNewYorkTimesTimelapse1996-2010
Portal Browse/Search
 http://eotarchive.cdlib.org/
Research Use/Access
 History Tracker (Weber/Lazer)
 ARCLink (AlSum/Nelson)
IIPC General Assembly, The Hague, May 9, 2011
12
HistoryTracker Tool
Beta Version!
Curated Data Sets
PIG Scripts in
Hadoop Environment
Link Lists
RU High-Speed
Computing Cluster
14

WebLifecycles_DataMining

Transcript WebLifecycles_DataMining

Directory