Transcript ppt
Large Scale Applications on Hadoop
in Yahoo
Vijay K Narayanan, Yahoo! Labs
04.26.2010
Massive Data Analytics
Over the Cloud
(MDAC 2010)
1
Outline
Hadoop in Yahoo!
Common types of applications on Hadoop
Sample applications in:
›
Content Analysis
›
Web Graph
›
Mail Spam Filtering
›
Search
›
Advertising
User Modeling on Hadoop
Challenges and Practical Considerations
2
Hadoop in Yahoo
3
By the Numbers
About 30,000 nodes in tens of clusters
›
1 Node = 4 *1 TB disk, 8 cores, 16 GB RAM as a typical configuration.
Largest single cluster of about 4000 nodes
4 tiers of clusters
›
Application research and development
›
Production clusters
›
Hadoop platform development and testing
›
Proof of concepts and ad-hoc work
Over 1000 users across research, engineering, operations etc.
›
Running more than 100,000 jobs per day
More than 3 PB of data
›
Compressed and un-replicated volume
Currently running Hadoop 0.20
4
Advantages
Wide applicability of the M/R computing model
›
Many problems in internet domain can be solved by data parallelism
High throughput
›
Stream through 100 TB of data in less than 1 hour
›
Applications that took weeks earlier complete in hours
Research prototyping, development, and production deployment
systems are (almost) identical
Scalable, economical, fault-tolerant
Shared resource with common infrastructure operations
5
Entities in internet eco-system
Leverage Hadoop extensively in all of these domains in Yahoo!
Content
(pages, blogs etc.)
Content/Display
Search
Browses
Engine
Searches
Advertising
Interacts
User
Ads
Queries
Search
Advertising
6
(Text, Display etc.)
Common Types of Applications
7
Common applications on Hadoop in Yahoo!
1. Near real-time data pipelines
›
Backbone for analytics, reporting, research etc.
›
Multi-step pipelines to create data feeds from logs
• Web-servers - page content, layout and links, clicks, queries etc.
• Ad servers – ad serving opportunity data, impressions
• Clicks, beacons, conversion data servers
›
Process large volume of events
• Tens of billions events/day
• Tens of TB (compressed) data/day
›
Latencies of tens of minutes to a few hours.
›
Continuous runs of jobs working on chunks of data
8
Example: Data Pipelines
Analytics
•
Tens of billions events/day
•
Parse and Transform event
streams
•
Join clicks with views
•
Filter out robots
•
Aggregate, Sort, Partition
•
Data Quality Checks
User
Sessions
User
Profiles
Ads
and
Content
9
•
Network analytics
•
Experiment reporting
• Optimize traffic &engagement
• User session & click-stream
• Path and funnel analysis
• User segment analysis
• Interest
• Measurements
• Modeling and Scoring
• Experimentation
Common applications on Hadoop in Yahoo!
2. High throughput engine for ETL and reporting applications
›
Put large data sources (e.g. logs) on HDFS
›
Run canned aggregations, transformations, normalizations
›
Load reports to RDBMS/data marts
›
Hourly and Daily batch jobs
3. Exploratory data research
›
Ad-hoc analysis and insights into data
›
Leveraging Pig and custom Map Reduce scripts
›
Pig is based on Pig Latin (up-coming support for SQL)
• Procedural language, designed for data parallelism
• Supports nested relational data structures
10
Common applications on Hadoop in Yahoo!
4. Indexing for efficient retrieval
›
Build and update indices of content, ads etc.
›
Updated in batch mode and pushed for online serving
›
Efficient retrieval of content and ads during serving
5. Offline modeling
›
Supervised and un-supervised learning algorithms
›
Outlier detection methods
›
Association rule mining techniques
›
Graph analysis methods
›
Time series analysis etc.
11
Common applications on Hadoop in Yahoo!
6. Batch and near real-time scoring applications
›
Offline model scoring for upload to serving applications
›
Frequency: hourly or daily jobs
7. Near real-time feedback from serving systems
›
Update features and model weights based on feedback from serving
›
Periodically push these updates to online scoring and serving
›
Typical updates in minutes or hours
8. Monitoring and performance dashboards
›
Analyze scoring and serving logs for:
• Monitoring end to end performance of scoring and serving systems
• Measurements of model performance and measurements
12
Sample Applications
13
Application: Content Analysis
Web data
›
Information about every web site, page, and link crawled by Yahoo
›
Growing corpus of more than 100Tb+ data from 10’s of billions documents
Document processing pipeline on Hadoop
Enrich with features from page, site etc.
›
Page segmentation
›
Term document vector and weighted variants
Entity anlaysis
›
Detection, disambiguation, resolution of entities in page
Concepts and topic modeling and clustering
Page quality analysis
14
Application: Web graph analysis
Directed graph of the web
Aggregated views by different dimensions
›
Sites, Domains, Hosts etc.
Large scale analysis of this graph
›
2 trillion links
›
Jobs utilize 100,000+ maps, ~10,000 reduces
›
~300 TB compressed output
Attribute
Before Hadoop
With Hadoop
Time
1 month
Days
Maximum number of
URLs
~ Order of 100 billion
Many times larger
15
Application: Mail spam filtering
Scale of the problem
›
~ 25B Connections, 5B deliveries per day
›
~ 450M mailboxes
User feedback on spam is often late, noisy and not always actionable
Problem
Algorithm
Data size
Running time
on Hadoop
Detecting spam
campaigns
Frequent Itemset
mining
~ 20 MM spam
votes
1 hour
“Gaming” of spam
IP votes by
spammers
Connected
component
(squaring a bipartite graph)
~ 500K spammers, 1 hour
500k spam IPs
16
Application: Mail Spam Filtering Campaigns
9
2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You
Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,)
9
2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly
Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,)
2432
(IPTYPE:none,FROMUSER:sales,SUBJ:January 18th:
9
2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-NewVehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)
CreditReport
Update,FROMDOM:zaninte.info,URL:zaninte.info,
9
2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport
ip_D:66.206.25.227,)
Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,)
9
2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole
family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,)
9
2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole
family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,)
9
1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New
9
1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be
fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,)
9
1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New
Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,)
9
1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport
Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,)
9
1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-IncomeStrategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,)
2447
(IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-CompeteCars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,)
On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,
9
1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-IsYours?,URL:brunhil.info,ip_D:66.206.25.227,)
ip_D:66.206.25.227,)
17
17
Application: Search Ranking
Rank web-pages based on relevance to queries
›
Features based on content of page, site, queries, web graph etc.
›
Train machine learning models to rank relevant pages for queries
›
Periodically learn new models
Dimension
Before Hadoop
Using Hadoop
Features
~ 100’s
~ 1000’s
Running Time
~ Days to weeks
~ hours
18
Application: Search AssistTM
•
Related concepts occur together. Analyze ~ 3 years of logs
•
Build dictionaries on Hadoop and push to online serving
Dimension
Before Hadoop
Using Hadoop
Time
4 weeks
< 30 minutes
Language
C++
Python
Development Time
2-3 weeks
2-3 days
19
Applications in Advertising
Expanding sets of seed keywords for matching with text ads
›
Analyze text corpus, user query sessions, clustering keywords etc.
Indexing ads for fast retrieval
›
Build and update index of more than a billion text ads
Response prediction and Relevance modeling
Categorization of pages and queries to help in matching
›
Adult pages, gambling pages etc.
Forecasting of ad inventory
User modeling
Model performance dashboards
20
User Modeling on Hadoop
21
User activities
Large dimensionality of possible user activities
But a typical user has a sparse activity vector
Attributes of the events change over time
Attribute
Possible Values
Typical values per
user
Pages
~ MM
10 – 100
Queries
~ 100s of MM
Few
Ads
~ 100s of thousands
10s
Building a pipeline on Hadoop to model user interests from activities
22
User Modeling Pipeline
5 main components to train, score and evaluate models
1.
Data Generation
a. Data Acquisition
b. Feature and Target Generation
2.
Model Training
3.
Offline Scoring and Evaluation
4.
Batch scoring and upload to online serving
5.
Dashboard to monitor the online performance
23
Overview of User Modeling Pipeline
Online Serving Systems
Models
and
Scores
Hadoop
Data Generation
Merging
Projection
Join
Filtering
Aggregations
Modeling Engine
Join
Scoring and
Evaluation
Join
Filtering
Scoring
Model Training
Work
Score & graph
based eval
Transformations
User event
History files
Flow
Manager
Feature and
Target Set
Model Files
Scores and
Reports
HDFS
24
1a. Data Acquisition
Input
›
Multiple user event feeds (browsing activities, search etc.) per time period
User
Time
Event
Source
U1
T0
visited autos.yahoo.com
Web server logs
U1
T1
searched for “car insurance”
Search logs
U1
T2
browsed stock quotes
Web server logs
U1
T3
saw an ad for “discount
brokerage”, but did not click
Ad logs
U1
T4
checked Yahoo Mail
Web server logs
U1
T5
clicked on an ad for “auto
insurance”
Ad logs, click server logs
25
1a. Data Acquisition
Tag and Transform
•
Categorization
•
Topic
•
….
Map Operations
Project relevant
event attributes
Filter irrelevant
events
Event
Feeds
User
User
User
User
User
User
event event event
event event event
HDFS
Normalized
Events (NE)
26
1a. Data Acquisition
Output:
›
Single normalized feed containing all events for all users per time period
User
Time
Event
Tag
U1
T0
Content browsing
Autos, Mercedes Benz
U2
T2
Search query
Category: Auto Insurance
…
…
…….
………
...
…
…….
………
U23
T23
Mail usage
Drop event
U36
T36
Ad click
Category: Auto Insurance
27
1b. Feature and Target Generation
Features:
›
Summaries of user activities over a time window
›
Aggregates, Moving averages, Rates etc. over moving time windows
›
Support online updates to existing features
Targets:
›
Constructed in the offline model training phase
›
Typically user actions in the future time period indicating interest
• Clicks/Click-through rates on ads and content
• Site and page visits
• Conversion events
– Purchases, Quote requests etc.
– Sign-ups to newsletters, Registrations etc.
28
1b. Feature and Target Windows
T0
Query
Visit Y! finance
Interest event
Time
Moving Window
Target Window
Feature Window
29
29
1b. Feature Generation
U1
T0
Content browsing
Autos, Mercedes Benz
U1
T2
Search query
Category: Auto Insurance
U1
T3
Click on search result
Category: Insurance premiums
U1
T4
Ad click
Category: Auto Insurance
Reduce 1
Summaries over
Reduce 2
user event history
All events for U2
All events for U1
Aggregates within window
Time and event weighted averages
Map 1
Map 2
Map 3
Event rates
……..
U1, Event 1
U1, Event 2
U2, Event 2
Aggregate
U1, Event 2
U2, Event 3
NE 1
NE 2
NE 3
Normalized
NE 4
NE 5
NE 6
events
NE 7
NE 8
NE 9
U2, Event 1
Feature
HDFS
Set
30
1b. Joining Features and Targets
Low target rates
›
Typical response rates are in the range of 0.01% ~ 1%
Many users have no interest activities in the target window
First construct the targets
Compute the feature vector only for users with targets
›
Reduces the need for computing features for users without target actions
Allows stratified sampling of users with different target and feature
attributes
31
2. Model Training
Supervised models trained using a variety of techniques
Regressions
›
Different flavors: Linear, Logistic, Poisson etc.
›
Constraints on weights
›
Different regularizations: L1 and L2
Decision trees
›
Used for both regression and ranking problems
›
Boosted trees
Naïve Bayes
Support vector machines
›
Commonly used in text classification, query categorization etc.
Online learning algorithms
32
2. Model Training
Maximum Entropy modeling
›
Log-linear link function.
›
Classification problems in large dimensional, sparse features
Constrained Random Fields
›
Sequence labeling and named-entity recognition problems
Some of these algorithms are implemented in Mahout
Not all algorithms are easy to implement in MR framework
Train one model per node.
›
Each node can train model for one target response
33
3. Offline Scoring and Evaluation
Apply weights from model training phase to features from Feature
generation component
Mapper operations only
Janino* equation editor
›
Embedded compiler can compile arbitrary scoring equations.
›
Can also embed any class invoked during scoring
›
Can modify features on the fly before scoring
Evaluation metrics
›
Sort by scores and compute metrics in reducer
›
Precision vs. Recall curve
›
Lift charts
* http://docs.codehaus.org/display/JANINO/Home
34
Modeling Workflow
User
event
history
User
Data
Acquisition
event
history
Target
generation
Training
Phase
Targets
Feature
generation
Data
Acquisition
Target
generation
Evaluation
Phase
Targets
Feature
generation
Features
Features
Model Training
Model Scoring
Scores
Weights
Evaluation
35
4. Batch Scoring
User
event
history
Data
Acquisition
Feature
generation
Features
Weights
Model Scoring
Scores
Online Serving
Systems
36
User modeling pipeline system
Component
Data Processed
Time
Data Acquisition
~ 1 Tb per time
period
2 – 3 hours
Feature and Target
Generation
~ 1 Tb * Size of
feature window
4 - 6 hours
Model Training
~ 50 - 100 Gb
1 – 2 hours for 100’s
of models
Scoring
~ 500 Gb
1 hour
37
Challenges and Practical
Considerations
38
Current challenges
Limited size of name-node
›
File and block meta-data in HDFS is in RAM on name-node
›
On name-node with 64Gb RAM
• ~ 100 million file blocks and 60 million files
›
Upper limit of 4000 node limit cluster
›
Adding more reducers leads to a large number of small files
Copying data in/out of HDFS
›
Limited by read/write rates of external file systems
High latency for small jobs
›
Overhead to set up may be large for small jobs
39
Practical considerations
Reduce amount of data transfer from mapper to reducer
›
There is still disk write/read in going from mapper to reducer
• Mapper output = Reducer input files can become large
• Can run out of disk space for intermediate storage
›
Project a subset of relevant attributes in mapper to send to reducer
›
Use combiners
›
Compress intermediate data
Distribution of keys
›
Reducer can become a bottleneck for common keys
›
Use Partitioner to control distribution of map records to reducers
• E.g. distribute mapper records with common keys across multiple reducers in a
round robin manner
40
Practical considerations
Judicious partitioning of data
›
Multiple files helps parallelism, but hit name-node limits
›
Smaller number of files keeps name-node happy but at the expense of
parallelism
Less ideal for distributed computing algorithms requiring
communications (e.g. distributed decision trees)
›
MPI on top of the cluster for communication
41
Acknowledgment
Numerous wonderful colleagues!
Questions?
42
Appendix:
More Applications
43
Application: Content Optimization
Optimizing content across the Yahoo portal pages
›
Rank articles from an editorial pool of articles based on interest
• Yahoo Front Page,
• Yahoo News etc.
›
Customizing feeds in My Yahoo portal page
›
Top buzzing queries
›
Content recommendations (RSS feeds)
Use Hadoop for feature aggregates and model weight updates
• near real-time and uploaded to online serving
44
Yahoo Front Page – Case Study
Content
Optimizatio
n
Search
Index
Ads
Optimizatio
n
Machine
Learned
Spam filters
Content
Optimizatio
n
RSS Feed
Recos.
45
Application: Search Logs Analysis
Analyze search result view and click logs
›
Reporting and measurement of user click response
›
User session analysis
• Enrich, expand and re-write queries
• Spelling corrections
• Suggesting related queries
Traffic quality and protection
›
Detect and filter out fraudulent traffic and clicks
46
Mail Spam Filtering: Connected
Components
Y1 = Yahoo user 1, Y2 = Yahoo user 2
IP1 = IP address of the host Y1 “voted” not-spam from
y1
IP1
SQUARING
y1
weight = 2
y2
IP2
y2
47
47
Mail Spam Filtering: Connected Components
Voting
Set of IPs/YIDs used
exclusively for
voting notspam
IP1
y1
IP3
Set of (likely new)
spamming IPs which
are “worth” voting for
y2
IP4
IP2
y3
Set of
“voted from” IPs
Set of Yahoo IDs
voting notspam
48
48
Set of
“voted on” IPs