Building Web Analytics on Hadoop at CBS Interactive

Download Report

Transcript Building Web Analytics on Hadoop at CBS Interactive

Thursdays
9:00 ET/PT
Building Web Analytics on Hadoop at CBS Interactive
Michael Sun
[email protected]
Big Data Workshop, Boston 03/10/2012
Brands and Websites of CBS interactive,
Samples
ENTERTAINMENT
GAMES & MOVIES
MUSIC
SPORTS
TECH, BIZ & NEWS
CBSi Scale
• Top 20 global web property
• 235M worldwide monthly unique users
• Hadoop Cluster size:
– Currently workers: 40 nodes (260 TB)
– This month: add 24 nodes, 800 TB total
– Next quarter: ~ 80 nodes, ~ 1 PB
• DW peak processing: > 500M events/day globally,
doubling next quarter (ad logs)
1 - Source: comScore, March 2011
Web Analytics Processing
• Collect web logs for web metrics analysis
– Web logs by tracking clicks, page views, downloads,
streaming video events, ad events, etc
• Provide internal metrics for web sites monitoring
• A/B testing
• Billers apps, external reporting
• Ad event tracking to support sales
• Provide data service
– Support marketing by providing data for data mining
– User-centric datastore (stay tuned)
– Optimize user experience
1 - Source: comScore, March 2011
2105595680218152960 125.83.8.253 [07/Mar/2012:16:00:00 +0000] GET
/clear/c.gif?ts=1331136009989&sid=115&ld=www.xcar.com.cn&ldc=a4887174-530c-40df-8002e06c199ba81a&xrq=fid%3D523%26page%3D10&brflv=10.3.183&brwinsz=1680x840&brscrsz=1680x10
50&brlang=zhCN&tcset=utf8&im=dwjs&xref=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php&srcurl
=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php%3Ffid%3D523%26page%3D11&titl
e=%E5%B8%95%E6%9D%B0%E7%BD%97%E8%AE%BA%E5%9D%9B_%E5%B8%95%E6%9D%B
0%E7%BD%97%E7%A4%BE%E5%8C%BA_%E5%B8%95%E6%9D%B0%E7%BD%97%E8%BD%A6
%E5%8F%8B%E4%BC%9A_PAJERO%E8%AE%BA%E5%9D%9B_XCAR%20%E7%88%B1%E5%8
D%A1%E6%B1%BD%E8%BD%A6%E4%BF%B1%E4%B9%90%E9%83%A8 HTTP/1.1 200 42
clgf=Cg+5E02cT/eWAAAAo0Y
http://www.xcar.com.cn/bbs/forumdisplay.php?fid=523&page=11
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.802.30
Safari/535.1 SE 2.X MetaSr 1.0 - 1
schemas.append(Schema(( # schemas[0]
SchemaField('web_event_id', 'int', nullable=False, signed=True, bits=64),
SchemaField('ip_address', 'string', nullable=False, maxlen=15, io_encoding='ascii'),
SchemaField('empty1', 'string', nullable=False, maxlen=5, io_encoding='ascii'),
SchemaField('empty2', 'string', nullable=True, maxlen=5, io_encoding='ascii'),
SchemaField('req_date', 'string', nullable=True, maxlen=30, io_encoding='ascii'),
SchemaField('request', 'string', nullable=True, maxlen=2000, on_range_error='truncate',
io_encoding='ascii'),
SchemaField('http_status', 'int', nullable=True, signed=True),
SchemaField('bytes_sent', 'int', nullable=True, signed=True),
SchemaField('cookie', 'string', nullable=True, maxlen=100, on_range_error='truncate',
io_encoding='utf-8'),
SchemaField('referrer', 'string', nullable=True, maxlen=1000, on_range_error='truncate',
io_encoding='utf-8'),
SchemaField('user_agent', 'string', nullable=True, maxlen=2000, on_range_error='truncate',
io_encoding='utf-8'),
SchemaField('is_clear_gif_mask', 'int', nullable=False, on_null='default', on_type_error='default',
signed=True, bits=2)
Modernize the platform
• The web log processing using a proprietary platform ran
into the limit
–
–
–
–
Code base was 10 years old
The version we used vendor is no longer supporting
Not fault-tolerant
Upgrade to the newer version not cost-effective
• Data volume is increasing all the time
– 300+ web sites
– Video tracking increasing the fastest
– To support new initiatives of business
• Use open source systems as much as possible
Hadoop to the Rescue / Research
• Open-source: scalable data processing framework
based on MapReduce
• Processing PB of data using Hadoop Distributed files
system (HDFS)
– high throughput
– Fault-Tolerant
• Distributed computing model
– Functional programming model based
−
MapReduce (M|S|R)
• Execution engine
–
–
–
–
Used as a cluster for ETL
Collect data (distributed harvester)
Analyze data (M/R, streaming + scripting + R, Pig/Hive)
Archive data (distributed archive)
The Plan
• Build web logs collection (codename Fido)
– Apache web log piped to cronolog
– Hourly M/R collector job to
− Gzip hourly log files & checksum
− Scp from web servers to Hadoop datanodes
− Put on HDFS
• Build Python ETL framework (codename Lumberjack)
– Based stdin/stdout streaming, one process/one thread
– Can run stand-alone or on Hadoop
– Pipeline
– Filter
– Schema
• Build web log processing with Lumberjack
– Parse
– Sessionize
– Lookup
– Format data/Load to DB
Web Analytics
Billers
Data mining
Sites
Web metrics
DW
Database
Hadoop
Apache
Logs
Distribute log by Fido
Hive
Python-ETL
MapReduce
External data sources
CMS Systems
HDFS
Clickmap
Web log Processing by Hadoop
Streaming and Python-ETL
• Parsing web logs
–
–
–
–
IAB filtering and checking
Parsing user agents by regex
IP range lookup
Look up product key etc
• Sessionization
– Prepare Sessionize
– Sessionize
– Filter-unpack
• Process huge dimensions, URL/Page Title
• Load Facts
– Format Load data/Load data to DB
Benefits to Ops
• Processing time to reaching SLA, saving 6 hours
• Running 2 years in production without any big issues
• Withstood the test of 50% / year data volume increase
• Architecture by design made easy to add new
processing logic
• Robust and Fault-Tolerant
– Five dead datanodes, jobs still ran OK
– Upgraded JVM on a few datanodes while jobs running
– Reprocessing old data while processing data of current day
Conclusions I – Create Tool Appropriate
to the Job if it doesn’t have what you want
• Python ETL Framework and Hadoop Streaming together
can do complex, big volume ETL work
• Python ETL Framework
–
–
–
–
Home grown, under review for open-source release
Rich functionalities by Python
Extensible
NLS support
• Put on top of another platform, eg Hadoop
– Distributed/Parallel
– Sorting
– Aggregation
Conclusions II – Power and Flexibility for
Processing Big Data
• Hadoop - scale and computing horsepower
–
–
–
–
–
Robustness
Fault-tolerance
Scalability
Significant reduction of processing time to reach SLA
Cost-effective
−
−
Commodity HW
Free SW
• Currently:
– Build Multi-tenant Hadoop clusters using Fair Scheduler
Questions?
[email protected]
Follow up Lumberjack
[email protected]