Archive overview and projects too

Download Report

Transcript Archive overview and projects too

Archive overview
and projects too
Important links
• Need to sign up for “library cards”
– http://www.archive.org/account/login.createaccount.p
hp
• Then you can access following pages:
–
–
–
–
www.archive.org/web/researcher/researcher.php
www.archive.org/web/researcher/data_available.php
www.archive.org/web/researcher/parallel.php
www.archive.org/web/researcher/example_research_c
reate_arc.php
Machine overview
• Data stored on ~200 desktop computers
• Host names: ia00xxx (e.g., ia00660)
– Initially, you’ll use ia0010[0-7]
• Four 160GB drives on each
– /0, /1, /2, and /3
– /1-/3 filled to capacity
– /0 filled to 1/2 capacity
• /0/tmp is “temp” space for computations
Your account
• Fill out form at:
http://www.soe.ucsc.edu/~raymie/290guserinfo.html
• I’ll take it from there
• Expect an e-mail
Files
• ARC files -- contain raw data
– Multiple doc’s/file, ~100MB per file
• DAT files -- contain commonly-used fields
• CDX files -- index of ARC and DAT
– /0/tmp/complete.cdx -- per machine
– Archive-wide cdx’s on 6 machines (wayback)
• All compressed (ARC on page boundaries)
ARC format
DAT format
Programs
• Unix tools
– grep, join, cut, Awk, perl, screen(!), ...
• Alexa tools
• P2
Alexa tools
• av_arcfilter, av_cat, av_getpage, av_grep,
av_prepend_random, av_randomize,
av_search, av_sort
P2
• Based on data-parallel programming model
– SIMD, single-instruction, multiple data
– Thinking machines
• Idea: run the same command line on all
P2
• P2 program [-c combiner] -p machines
– program: command-line to be run
– combiner: program to combine results
– machines: machines to use
• “-p /net/ia00100 /net/ia00101”
• “-p $rack1”
– $rack[1-5], $arcs
P2 - example
• p2 uptime -p $ARCS
– Returns result of uptime on all machines
• p2 ‘zcat /0/tmp/complete.cdx.gz | wc -l’ -p ..
– Returns length (in lines) of indexes
p2
• Output of “subprograms” sent to initiating
“p2” program
• This program “combines” these lines
– By default, av_cat is used to get them to
standard output
– The -c option allows the user to set a combiner
• But lines from subprograms can be
interleaved
Possible projects
•
•
•
•
•
•
•
•
•
Crawl catalog
Counts & histograms
Page-change
Word-change study
Language id
Table detection
RSS download/studies
Id “soft” 404/30x’s
Mirror detection
•
•
•
•
Javascript link extract
Storage redundancy
URL database
Validating host counts
– IP sampling vs. crawls
– Correcting for vrt. host