Transcript Web Mining

QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Web Mining in the Cloud
Hadoop/Cascading/Bixo in EC2
Ken Krugler, Bixo Labs, Inc.
ACM Data Mining SIG
08 December 2009
About me

Background in vertical web crawl
– Krugle search engine for open source code
– Bixo open source web mining toolkit

Consultant for companies using EC2
– Web mining
– Data processing

Founder of Bixo Labs
– Elastic web mining platform
– http://bixolabs.com
Typical Data Mining
Data Mining Victory!
Meanwhile, Over at McAfee…
Web Mining 101
 Extracting
 More
& Analyzing Web Data
Than Just Search
 Business
intelligence, competitive
intelligence, events, people, companies,
popularity, pricing, social graphs, Twitter
feeds, Facebook friends, support forums,
shopping carts…
4 Steps in Web Mining
 Collect
 Parse
- fetch content from web
- extract data from formats
 Analyze
- tokenize, rate, classify, cluster
 Produce
- “useful data”
Web Mining versus Data Mining
 Scale
- 10 million isn’t a big number
 Access
- public but restricted
– Special implicit rules apply
 Structure
- not much
How to Mine Large Scale Web Data?
 Start
with scalable map-reduce platform
 Add
a workflow API layer
 Mix
in a web crawling toolkit
 Write
 Run
your custom data processing code
in an elastic cloud environment
One Solution - the HECB Stack
Bixo
Cascading
Hadoop
EC2
QuickTime™ and a
decompressor
are needed to see this picture.
EC2 - Amazon Elastic Compute Cloud
 True
–
–
–
–
cost of non-cloud environment
Cost of servers & networking (2 year life)
Cost of colo (6 servers/rack)
Cost of OPS salary (15% of FTE/cluster)
Managing servers is no fun
 Web
mining is perfect for the cloud
– “bursty” => savings are even greater
– Data is distilled, so no transfer $$$ pain
Why Hadoop?
 Perfect
for processing lots of data
– Map-reduce
– Distributed file system
 Open
source, large community, etc.
 Runs
well in EC2 clusters
 Elastic
Map Reduce as option
Why Cascading?
 API
on top of Hadoop
 Supports
efficient, reliable workflows
 Reduces
painful low-level MR details
 Build
workflow using “pipe” model
Why Bixo?
 Plugs
into Cascading-based workflow
– Scales with Hadoop cluster
– Rules well in EC2
 Handles
grungy web crawling details
– Polite yet efficient fetching
– Errors, web servers that lie
– Parsing lots of formats, broken HTML
 Open
source toolkit for web mining apps
SEO Keyword Data Mining
 Example
 Find
of typical web mining task
common keywords (1,2,3 word terms)
– Do domain-centric web crawl
– Parse pages to extract title, meta, h1, links
– Output keywords sorted by frequency
 Compare
to competitor site(s)
Workflow
Custom Code for Example
 Filtering
URLs inside domain
– Non-English content
– User-generated content (forums, etc)
 Generating
keywords from text
– Special tokenization
– One, two, three word phrases
 But
95% of code was generic
End Result in Data Mining Tool
What Next?

Another example - mining mailing lists

Go straight to Summary/Q&A

Talk about Web Scale Mining

Write tweets, posts & emails
“No minute off-line goes unpunished”
Another Example - HUGMEE
 Hadoop
 Users who
 Generate the
 Most
 Effective
 Emails
Helpful Hadoopers
 Use
mailing list archives for data (collect)
 Parse
mbox files and emails (parse)
 Score
based on key phrases (analyze)
 End
result is score/name pair (produce)
Scoring Algorithm
 Very
sophisticated point system
 “thanks”
 “owe
== 5
you a beer” == 50
 “worship
the ground you walk on” == 100
High Level Steps
 Collect
–
–
–
–
emails
Fetch mod_mbox generated page
Parse it to extract links to mbox files
Fetch mbox files
Split into separate emails
 Parse
emails
– Extract key headers (messageId, email, etc)
– Parse body to identify quoted text
High Level Steps
 Analyze
–
–
–
–
emails
Find key phrases in replies (ignore signoff)
Score emails by phrases
Group & sum by message ID
Group & sum by email address
 Produce
ranked list
– Toss email addresses with no love
– Sort by summed score
Workflow
Building the Flow
mod_mbox Page
Custom Operation
Validate
This Hug’s for Ted!
Produce
Back
Web Scale Mining
 Bigger
Data
– 100M pages versus 1M pages
 Bigger
Breadth
– 100K domains versus 1K domains
 Bigger
Clusters
– 50 servers versus 5 servers
Web Scale == Endless Heuristics
 Document
–
–
–
–
features detection
Charset
Mime-type
Language
Many noisy sources of “truth”
 Duplicates
detection
– Quest for the perfect hash function
 Spam/porn/link
farm detection
Web Scale == Challenges

All web servers lie

Edge cases ad nauseam

Avoiding spam/porn/junk

Focusing on English content

Scaling to 100K domains/100M pages
– Avoid bottlenecks
– Fix large cluster issues
Public Terabyte Dataset
 Sponsored
 High
by Concurrent/Bixolabs
quality crawl of top domains
– HECB Stack using Elastic Map Reduce
 Hosted
 Crawl
by Amazon in S3, free to EC2 users
& processing code available
 Questions,
input? http://bixolabs.com/PTD/
Web Scale Case Study - PTD Crawl

Robots.txt - Robot Exclusion Protocol
– Not a real standard, lots of extensions
– Many ways to mess it up (HTML, typos, etc)

Great performance when all is well
– 25K pages/minute fetching
– 50K pages/minute parsing

Hadoop 0.18.3 vs. 0.19.2
– Different APIs, behavior, bugs
– At painful cluster tuning stage
Large Scale Web Mining Summary
 10K
is easy, 100M is hard
– You encounter endless edge cases
– There’s always another bottleneck
– Cluster tuning is challenging
 Web
mining toolkit approach works
– Easier to customize/optimize
– Easier to solve problems
Back
Summary
 HECB
stack works well for web mining
– Cheaper than typical colo option
– Scales to hundreds of millions of pages
– Reliable and efficient workflow
 Web
–
–
–
–
mining has high & increasing value
Search engine optimization, advertising
Social networks, reputation
Competitive pricing
Etc, etc, etc.
Any Questions?

My email:
[email protected]

Bixo mailing list:
http://tech.groups.yahoo.com/group/bixo-dev/
QuickTime™ and a
decompressor
are needed to see this picture.