Study of Hbase
Download
Report
Transcript Study of Hbase
Evaluation of Hbase Read/Write
(A study of Hbase and it’s benchmarks)
BY
VAIBHAV NACHANKAR
ARVIND DWARAKANATH
Recap of Hbase
Hbase
is an open-source, distributed, columnoriented and sorted-map data storage.
It is a Hadoop Database; sits on HDFS.
Hbase can support reliable storage and efficient
access of a huge amount of structured data
Hbase Architecture
Recap of Hbase (contd.)
Modeled after BigTable.
Map/reduce with Hadoop.
Optimizations for real time queries.
No single point of failure.
Random access performance is like MySQL.
Application : Facebook Messaging Database.
Hbase Benchmark Techniques
‘Hadoop Hbase-0.20.2 Performance Evaluation’ by
D. Carstoiu, A. Cernian, A. Olteanu. University of
Bucharest.
STRATEGY: Uses random read, writes to test and
benchmark Hadoop with Hbase.
Hbase Benchmark Techniques (contd.)
‘Hadoop Hbase-0.20.2 Performance Evaluation’ by
Kareem Dana at Duke University. It shows a varied
set of test cases for executions to test HBase.
STRATEGY: Tested on column families, columns,
Sort and interspersed read/writes.
Yahoo! Cloud Serving Benchmark (YCSB)
‘Benchmarking Cloud Serving Systems with YCSB’
by Brian F. Cooper, Adam Silberstein, Erwin Tam,
Raghu Ramakrishnan, Russell Sears.
This paper/project is designed to benchmark existing
and newer cloud storage technologies.
The benchmark is done so far on Hbase, Cassandra,
MongoDb, Project Voldemort and SQL.
YCSB
The benchmark tool uses Workload files and the
workload files can be customized according to users.
You can specify 50/50 read/write, 95/5 r/w and so
on.
The code for the project is available on Github.
https://github.com/brianfrankcooper/YCSB.git
Example of a Workload
# Yahoo! Cloud System Benchmark
# Workload A: Update heavy workload
# Application example: Session store recording recent actions
#
# Read/update ratio: 50/50
# Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
# Request distribution: zipfian
recordcount=1000
operationcount=1000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0
Example of a Workload
# Yahoo! Cloud System Benchmark
# Workload B: Read mostly workload
# Application example: photo tagging; add a tag is an update, but most operations are
to read tags
#
# Read/update ratio: 95/5
# Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
# Request distribution: zipfian
recordcount=1000
operationcount=1000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=0.95
updateproportion=0.05
scanproportion=0
insertproportion=0
Our Project
Install Hbase and get Hadoop to interface with it. Study
benchmark techniques.
Build a suite of codes and get it to run on Hadoop/Hbase.
Include basic get, put, scan operations.
Extend Word Count’s map-reduce to add to Hbase.
Compare with Brisk Cassandra.
About Brisk
Cassandra is a No-SQL BigTable-based database.
Datastax enterprise built Brisk to interface Hadoop
with Cassandra
Hadoop + Cassandra = Brisk!!
Brisk Architecture
Challenges Faced
Configuration of Hbase is a tedious job! Not for the
weak of will!
Hbase subsequent releases do not keep the APIs
consistent. So we ran into a lot of ‘deprecated API’
error messages.
Hadoop compatibility with Hbase has to be verified
before we proceed with installations.
Challenges Faced (contd.)
Very few documents on installation details of Hbase.
Even fewer for Brisk!
Performance for Word Count (2 nodes/2 cores each)
1 mapper/ 3 reducer
49
Time in secs
Average = 45.484
48
47
46
45
1 mapper/ 3 reducer
44
43
42
41
1
2
3
4
5
Number of readings
Performance for Word Count (contd.)
52.5
2 mapper/ 3 reducers
Time in secs
Average = 49.664
52
51.5
51
50.5
50
2 mapper/ 3 reducers
49.5
49
48.5
48
47.5
1
2
3
4
5 Number of readings
Performance for Word Count (contd.)
60
2 mapper/ 2 reducers
Time in secs
Average = 43.7008
50
40
30
2 mapper/ 2 reducers
20
10
0
1
2
3
4
5
Number of readings
Performance for a simple get/put/scan (2 nodes/ 2 core)
Time in secs
2.5
Average for get, scan and put
are 1.841.6266 and 1.71.
2
1.5
get
scan
put
1
0.5
0
1
2
3
4
5
Number of readings
Performance for Word Count (3 nodes/2 cores each)
1 mapper/ 3 reducers
Time in secs
37
Average = 34.047
36
35
34
33
1 mapper/ 3 reducers
32
31
30
29
1
2
3
4
5
Number of Readings
Performance for Word Count (contd.)
2 mappers/ 3 reducers
Time in secs
38.5
Average = 36.1012
38
37.5
37
36.5
36
2 mappers/ 3 reducers
35.5
35
34.5
34
33.5
33
1
2
3
4
5
Number of Readings
Performance for Word Count (contd.)
2 mappers/ 2 reducers
Time in secs
50
Average = 37.4358
45
40
35
30
25
2 mappers/ 2 reducers
20
15
10
5
Number of readings
0
1
2
3
4
5
Conclusions
Brisk seems a lot more promising tool; as it
integrates Cassandra and Hadoop together without
much ado.
Hbase/Hadoop APIs have to be made consistent.
With standardization, it would be easier to work with
them.
Hbase Reads are faster than the Writes.
Thank You
Questions??