Transcript Web Mining
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Web Mining in the Cloud
Hadoop/Cascading/Bixo in EC2
Ken Krugler, Bixo Labs, Inc.
ACM Data Mining SIG
08 December 2009
About me
Background in vertical web crawl
– Krugle search engine for open source code
– Bixo open source web mining toolkit
Consultant for companies using EC2
– Web mining
– Data processing
Founder of Bixo Labs
– Elastic web mining platform
– http://bixolabs.com
Typical Data Mining
Data Mining Victory!
Meanwhile, Over at McAfee…
Web Mining 101
Extracting
More
& Analyzing Web Data
Than Just Search
Business
intelligence, competitive
intelligence, events, people, companies,
popularity, pricing, social graphs, Twitter
feeds, Facebook friends, support forums,
shopping carts…
4 Steps in Web Mining
Collect
Parse
- fetch content from web
- extract data from formats
Analyze
- tokenize, rate, classify, cluster
Produce
- “useful data”
Web Mining versus Data Mining
Scale
- 10 million isn’t a big number
Access
- public but restricted
– Special implicit rules apply
Structure
- not much
How to Mine Large Scale Web Data?
Start
with scalable map-reduce platform
Add
a workflow API layer
Mix
in a web crawling toolkit
Write
Run
your custom data processing code
in an elastic cloud environment
One Solution - the HECB Stack
Bixo
Cascading
Hadoop
EC2
QuickTime™ and a
decompressor
are needed to see this picture.
EC2 - Amazon Elastic Compute Cloud
True
–
–
–
–
cost of non-cloud environment
Cost of servers & networking (2 year life)
Cost of colo (6 servers/rack)
Cost of OPS salary (15% of FTE/cluster)
Managing servers is no fun
Web
mining is perfect for the cloud
– “bursty” => savings are even greater
– Data is distilled, so no transfer $$$ pain
Why Hadoop?
Perfect
for processing lots of data
– Map-reduce
– Distributed file system
Open
source, large community, etc.
Runs
well in EC2 clusters
Elastic
Map Reduce as option
Why Cascading?
API
on top of Hadoop
Supports
efficient, reliable workflows
Reduces
painful low-level MR details
Build
workflow using “pipe” model
Why Bixo?
Plugs
into Cascading-based workflow
– Scales with Hadoop cluster
– Rules well in EC2
Handles
grungy web crawling details
– Polite yet efficient fetching
– Errors, web servers that lie
– Parsing lots of formats, broken HTML
Open
source toolkit for web mining apps
SEO Keyword Data Mining
Example
Find
of typical web mining task
common keywords (1,2,3 word terms)
– Do domain-centric web crawl
– Parse pages to extract title, meta, h1, links
– Output keywords sorted by frequency
Compare
to competitor site(s)
Workflow
Custom Code for Example
Filtering
URLs inside domain
– Non-English content
– User-generated content (forums, etc)
Generating
keywords from text
– Special tokenization
– One, two, three word phrases
But
95% of code was generic
End Result in Data Mining Tool
What Next?
Another example - mining mailing lists
Go straight to Summary/Q&A
Talk about Web Scale Mining
Write tweets, posts & emails
“No minute off-line goes unpunished”
Another Example - HUGMEE
Hadoop
Users who
Generate the
Most
Effective
Emails
Helpful Hadoopers
Use
mailing list archives for data (collect)
Parse
mbox files and emails (parse)
Score
based on key phrases (analyze)
End
result is score/name pair (produce)
Scoring Algorithm
Very
sophisticated point system
“thanks”
“owe
== 5
you a beer” == 50
“worship
the ground you walk on” == 100
High Level Steps
Collect
–
–
–
–
emails
Fetch mod_mbox generated page
Parse it to extract links to mbox files
Fetch mbox files
Split into separate emails
Parse
emails
– Extract key headers (messageId, email, etc)
– Parse body to identify quoted text
High Level Steps
Analyze
–
–
–
–
emails
Find key phrases in replies (ignore signoff)
Score emails by phrases
Group & sum by message ID
Group & sum by email address
Produce
ranked list
– Toss email addresses with no love
– Sort by summed score
Workflow
Building the Flow
mod_mbox Page
Custom Operation
Validate
This Hug’s for Ted!
Produce
Back
Web Scale Mining
Bigger
Data
– 100M pages versus 1M pages
Bigger
Breadth
– 100K domains versus 1K domains
Bigger
Clusters
– 50 servers versus 5 servers
Web Scale == Endless Heuristics
Document
–
–
–
–
features detection
Charset
Mime-type
Language
Many noisy sources of “truth”
Duplicates
detection
– Quest for the perfect hash function
Spam/porn/link
farm detection
Web Scale == Challenges
All web servers lie
Edge cases ad nauseam
Avoiding spam/porn/junk
Focusing on English content
Scaling to 100K domains/100M pages
– Avoid bottlenecks
– Fix large cluster issues
Public Terabyte Dataset
Sponsored
High
by Concurrent/Bixolabs
quality crawl of top domains
– HECB Stack using Elastic Map Reduce
Hosted
Crawl
by Amazon in S3, free to EC2 users
& processing code available
Questions,
input? http://bixolabs.com/PTD/
Web Scale Case Study - PTD Crawl
Robots.txt - Robot Exclusion Protocol
– Not a real standard, lots of extensions
– Many ways to mess it up (HTML, typos, etc)
Great performance when all is well
– 25K pages/minute fetching
– 50K pages/minute parsing
Hadoop 0.18.3 vs. 0.19.2
– Different APIs, behavior, bugs
– At painful cluster tuning stage
Large Scale Web Mining Summary
10K
is easy, 100M is hard
– You encounter endless edge cases
– There’s always another bottleneck
– Cluster tuning is challenging
Web
mining toolkit approach works
– Easier to customize/optimize
– Easier to solve problems
Back
Summary
HECB
stack works well for web mining
– Cheaper than typical colo option
– Scales to hundreds of millions of pages
– Reliable and efficient workflow
Web
–
–
–
–
mining has high & increasing value
Search engine optimization, advertising
Social networks, reputation
Competitive pricing
Etc, etc, etc.
Any Questions?
My email:
[email protected]
Bixo mailing list:
http://tech.groups.yahoo.com/group/bixo-dev/
QuickTime™ and a
decompressor
are needed to see this picture.