Transcript Part I

Delip Rao
[email protected]
What is the typical size of data you
deal with on a daily basis?
• Processes 20 Petabytes of raw data a day
• Works out to 231 TB per second!
• Numbers from 2008, grows by the day.
http://www.niallkennedy.com/blog/2008/01/google-mapreduce-stats.html
• 200 GB per day (March 2008)
• 2+ TB (compressed) per day in April 2009
• 4+ TB (compressed) per day in Oct 2009
– 15 TB (uncompressed) a day (2009)
And Many More …
• eBay: 50 TB / day
• NYSE: 1TB / day
• CERN LHC : 15 PB / Year
Storage and Analysis of
Tera-scale Data : 1 of 2
415 Database Class
11/17/09
In Today’s Class We Will ..
• Deal with scale from a completely different
perspective
• Discuss problems with traditional approaches
• Discuss how to analyze large quantities of data
• Discuss how to store (physically) huge
amounts of data in a scalable, reliable fashion
• Discuss a simple effective approach to store
record-like data
Dealing with scale
• MySQL will crawl with 500 GB of data
• “Enterprise” databases (Oracle, DB2)
– Expensive $$$$$$$$
– Does not scale well
• Indexing painful
• Aggregate operations (SELECT COUNT(*) ..) almost
impossible
• Distributed databases
– Expensive $$$$$$$$$$$$$$$$$$$$$$$$
– Doesn’t scale well either
Large Scale Data: Do we need
databases?
• Traditional database design is inspired from
decades of research on storage and retrieval.
• Complicated database systems == more tuning
– A whole industry of “Database Administrators”
– Result: Increased operational expenses
• Complicated indexing, transaction processing
algorithms are not needed if all we care about
are analyses from the data
Parallelize both data access and
processing
• Over time processing capacity has increased
compared to
– Disk transfer time (slow)
– Disk seek time (even slower)
• Solution:
– Process data using a cluster of nodes using
independent CPUs and independent disks.
Overview
• MapReduce is a design pattern:
– Manipulate large quantities of data
– Abstracts away system specific issues
– Encourage cleaner software engineering
– Inspired from functional programming primitives
MapReduce by Example
•
•
•
•
Output: Word frequency histogram
Input: Text, read one line at a time
Single core design: Use a hash table
MapReduce:
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
Word Frequency Histogram (contd)
the quick
brown fox
the fox ate
the rabbit
the brown
rabbit
Word Frequency Histogram (contd)
(ate, 1)
the quick
brown fox
MAPPER
REDUCER
(brown, 2)
the fox ate
the rabbit
the brown
rabbit
MAPPER
SHUFFLE
(fox, 2)
(quick, 1)
(rabbit, 2)
(the, 4)
REDUCER
MAPPER
WordCount review
• Output: Word frequency histogram
• Input: Text, read one line at a time
– Key: ignore, Value: Line of text
def mapper(key, value):
foreach word in value.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
WordCount: In actual code
• Mapper
WordCount: In actual code
• Reducer
WordCount: In actual code
• Driver (main) method
Observe the benefits of
abstraction from
hardware dependence,
reliability and job
distribution
“Thinking” in MapReduce
• Input is a sequence of key value pairs (records)
• Processing of any record is independent of the
others
• Need to recast algorithms and sometimes
data to fit to this model
– Think of structured data (Graphs!)
Example: Inverted Indexing
• Say you have a large (billions) collection of
documents
• How do you efficiently find all documents that
contain a certain word?
• Database solution:
SELECT doc_id FROM doc_table
where doc_text CONTAINS ‘word’;
• Forget scalability. Very inefficient.
• Another demonstration of when not to use a DB
Example: Inverted Indexing
• Well studied problem in Information Retrieval
community
• More about this in 600.466 course (Spring)
• For now, we will build a simple index
– Scan all documents in the collection
What is the complexity
of this code?
– For each word record the document
in which it
appears
• Can write a few lines of Perl/Python to do it
– Simple but will take forever to finish
“Thinking” in MapReduce (contd)
• Building inverted indexes
What is the latency of
• Input: Collection of documents
this code?
• Output: For each word find all documents
with the word
def mapper(filename, content):
foreach word in content.split():
output(word, filename)
def reducer(key, values):
output(key, unique(values))
Suggested Exercise
• Twitter has data in following format:
<user_id, tweet_text, timestamp>
• Write map-reduces for
– Finding all users who tweeted on “Comic Con”
– Ranking all users by frequency of their tweets
– Finding number of tweets containing “iPhone”
varies with time
MapReduce vs RDBMS
Traditional RDBMS
MapReduce
Data size
Gigabytes
Petabytes
Access
Interactive & Batch
Batch
Updates
Read & write many times
Write once, read many
Structure
Static schema
Dynamic schema
Integrity
High
Low
Scaling
Nonlinear
Linear
The Apache Hadoop Zoo
PIG
CHUKWA
MAPREDUCE
COMMON
HIVE
HDFS
HBASE
ZOOKEEPER
AVRO
Storing Large Data: HDFS
• Hadoop File System (HDFS)
• Very large distributed file system (~10PB)
• Assumes commodity hardware
– Replication
– Failure detection & Recovery
• Optimized for batch processing
• Single namespace for entire cluster
hdfs://node-21/user/smith/job21/input01.txt
HDFS Concepts
• Blocks – A single unit of storage
• Namenode (master)
– manages namespace
• Filesystem namespace tree + metadata
– Maintains file to block mapping
• Datanodes (workers)
– Performs block level operations
Storing record data
• HDFS is a filesystem: An abstraction for files
with raw bytes and no structure
• Lot of real world data occur as tuples
– Hence RDBMS. If only they were scalable …
• Google’s solution: BigTable (2004)
– A scalable distributed multi-dimensional sorted
map
• Currently used in 100+ projects inside Google
– 70+ PB data; 30+ GB/s IO (Jeff Dean, LADIS ’09)
Storing record data: HBase
•
•
•
•
Open source clone of Google’s Bigtable
Originally created in PowerSet in 2007
Used in : Yahoo, Microsoft, Adobe, Twitter, …
Distributed column-oriented database on top of
HDFS
• Real time read/write random-access
• Not Relational and does not support SQL
• But can work with very large datasets
– Billions of rows, millions of columns
HBase: Data Model
• Data stored in labeled tables
– Multi-dimensional sorted map
• Table has rows, columns.
• Cell: Intersection of a row & column
– Cells are versioned (timestamped)
– Contains an uninterrupted array of bytes
(no type information)
• Primary key: A cell that uniquely identifies a
row
HBase: Data model (contd)
• Columns are grouped into column families
– Eg., temperature:air, temperature:dew_point
• Thus column name is family_name:identifier
• The column families are assumed to be known
a priori
• However can add new columns for an existing
family at run time
ROW
location_id
COLUMN FAMILIES
temperature:
humidity:
…
temperature:air
temperature:dew_point
humidity:absolute
humidity:relative
humidity:specific
…
HBase: Data model (contd)
• Tables are partitioned into regions
• Region: Subset of rows
• Regions are units that get distributed across a
cluster
• Locking
– Row updates are atomic
– Updating a cell will lock entire row.
– Simple to implement. Efficient.
– Also, updates are rare
HBase vs. RDBMS
• Scale : Billions of rows and Millions of columns
• Traditional RDBMS:
– Fixed schema
– Good for small to medium volume applications
– Scaling RDBMS involves violating Codd’s Rules,
loosening ACID properties
HBase schema design case study
• Store information about students, courses,
course registration
• Relationships (Two one-to-many)
– A student can take multiple courses
– A course is taken by multiple students
HBase schema design case study
• RDBMS solution
STUDENTS
id (primary key)
name
REGISTRATION
department_id
student_id
course_id
type
COURSES
id (primary key)
title
faculty_id
HBase schema design case study
• HBase solution
ROW
COLUMN FAMILIES
info:
course:
<student_id>
info:name
info:department_id
course:course_id=type
ROW
COLUMN FAMILIES
<course_id>
info:
student:
info:title
info:faculty_id
student:student_id=type
HBase: A real example
• Search Engine Query Log
ROW
request_md5_hash
COLUMN FAMILIES
query
cookie:
request:
query:text
query:lang
...
cookie:id
request:user_agent
request:ip_addr
request:timestamp
Common practice to use hash of
the row of as key when no
natural primary key exists
This is okay when data is
accessed sequentially. No
need to “lookup”
Suggested Exercise
• Write an RDBMS schema to model UserFollower network in Twitter
• Now, write its HBase equivalent
Access to HBase
• via Java API
– Map semantics: Put, Get, Scan, Delete
– Versioning support
• via the HBase shell $...hbase shell
Not a real query
language. Useful for
“inspecting” a table.
Not a preferred way to
access. Typically API is
used.
hbase>
hbase>
test
…
hbase>
hbase>
hbase>
hbase>
create 'test' 'data'
list
put
put
put
put
'test',
'test',
'test',
'test',
'row1',
'row2',
'row2',
'row3',
hbase> scan 'test'
…
hbase> disable 'test'
hbase> drop 'test'
'data:1',
'data:1',
'data:2',
'data:3',
'value1'
'value1'
'value2'
'value3'
A Word on HBase Performance
• Original HBase had performance issues
• HBase 2.0 (latest release) much faster
– Open source development!!!
• Performance Analysis by StumbleUpon.com
– Website uses over 9b rows in a single HBase table
– 1.2m row reads/sec using just 19 nodes
– Scalable with more nodes
– Caching (new feature) further improves
performance
Are traditional databases really
required?
• Will a bank store all its data on HBase or its
equivalents?
• Unlikely because Hadoop
– Does not have a notion of a transaction
– No security or access control like databases
• Fortunately batch processing large amounts of
data does not require such guarantees
• Hot research topic: Integrate databases and
MapReduce – “In database MapReduce”
Summary
• (Traditional) Databases are not Swiss-Army knives
• Large data problems require radically different
solutions
• Exploit the power of parallel I/O and computation
• MapReduce as a framework for building reliable
distributed data processing applications
• Storing large data requires redesign from the
ground up, i.e. filesystem (HDFS)
Summary (contd)
• HDFS : A reliable open source distributed file
system
• HBase : A sorted multi-dimensional map for
record oriented data
– Not Relational
– No query language other than map semantics (Get
and Put)
• Using MapReduce + HBase involves a fair bit of
programming experience
– Next class we will study Pig and Hive: A “data analyst
friendly” interface to processing large data.
Suggested Reading
• Jeffrey Dean and Sanjay Ghemawat,
“MapReduce: Simplified Data Processing on
Large Clusters”
• Dewitt and Stonebraker, Mapreduce: “A major
step backwards”
• Chu-Carroll, “Databases are hammers;
MapReduce is a screwdriver”
• Dewitt and Stonebraker, “Mapreduce II”
Suggested Reading (contd)
• Hadoop Overview
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
• Who uses Hadoop?
http://wiki.apache.org/hadoop/PoweredBy
• HDFS Architecture
http://hadoop.apache.org/common/docs/current/hdfs_design.html
• Chang et. al., “Bigtable: A Distributed Storage
System for Structured Data”
http://labs.google.com/papers/bigtable.html
• Hbase
http://hadoop.apache.org/hbase/