Study of Hbase

Download Report

Transcript Study of Hbase

Evaluation of Hbase Read/Write
(A study of Hbase and it’s benchmarks)
BY
VAIBHAV NACHANKAR
ARVIND DWARAKANATH
Recap of Hbase
 Hbase
is an open-source, distributed, columnoriented and sorted-map data storage.
 It is a Hadoop Database; sits on HDFS.
 Hbase can support reliable storage and efficient
access of a huge amount of structured data
Hbase Architecture
Recap of Hbase (contd.)
 Modeled after BigTable.
 Map/reduce with Hadoop.
 Optimizations for real time queries.
 No single point of failure.
 Random access performance is like MySQL.
 Application : Facebook Messaging Database.
Hbase Benchmark Techniques
 ‘Hadoop Hbase-0.20.2 Performance Evaluation’ by
D. Carstoiu, A. Cernian, A. Olteanu. University of
Bucharest.
 STRATEGY: Uses random read, writes to test and
benchmark Hadoop with Hbase.
Hbase Benchmark Techniques (contd.)
 ‘Hadoop Hbase-0.20.2 Performance Evaluation’ by
Kareem Dana at Duke University. It shows a varied
set of test cases for executions to test HBase.
 STRATEGY: Tested on column families, columns,
Sort and interspersed read/writes.
Yahoo! Cloud Serving Benchmark (YCSB)
 ‘Benchmarking Cloud Serving Systems with YCSB’
by Brian F. Cooper, Adam Silberstein, Erwin Tam,
Raghu Ramakrishnan, Russell Sears.
 This paper/project is designed to benchmark existing
and newer cloud storage technologies.
 The benchmark is done so far on Hbase, Cassandra,
MongoDb, Project Voldemort and SQL.
YCSB
 The benchmark tool uses Workload files and the
workload files can be customized according to users.
 You can specify 50/50 read/write, 95/5 r/w and so
on.
 The code for the project is available on Github.
https://github.com/brianfrankcooper/YCSB.git
Example of a Workload
# Yahoo! Cloud System Benchmark
# Workload A: Update heavy workload
# Application example: Session store recording recent actions
#
# Read/update ratio: 50/50
# Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
# Request distribution: zipfian
recordcount=1000
operationcount=1000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0
Example of a Workload
# Yahoo! Cloud System Benchmark
# Workload B: Read mostly workload
# Application example: photo tagging; add a tag is an update, but most operations are
to read tags
#
# Read/update ratio: 95/5
# Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
# Request distribution: zipfian
recordcount=1000
operationcount=1000
workload=com.yahoo.ycsb.workloads.CoreWorkload
readallfields=true
readproportion=0.95
updateproportion=0.05
scanproportion=0
insertproportion=0
Our Project
 Install Hbase and get Hadoop to interface with it. Study
benchmark techniques.
 Build a suite of codes and get it to run on Hadoop/Hbase.
 Include basic get, put, scan operations.
 Extend Word Count’s map-reduce to add to Hbase.
 Compare with Brisk Cassandra.
About Brisk
 Cassandra is a No-SQL BigTable-based database.
 Datastax enterprise built Brisk to interface Hadoop
with Cassandra
 Hadoop + Cassandra = Brisk!!
Brisk Architecture
Challenges Faced
 Configuration of Hbase is a tedious job! Not for the
weak of will!
 Hbase subsequent releases do not keep the APIs
consistent. So we ran into a lot of ‘deprecated API’
error messages.
 Hadoop compatibility with Hbase has to be verified
before we proceed with installations.
Challenges Faced (contd.)
 Very few documents on installation details of Hbase.
 Even fewer for Brisk!
Performance for Word Count (2 nodes/2 cores each)
1 mapper/ 3 reducer
49
Time in secs
Average = 45.484
48
47
46
45
1 mapper/ 3 reducer
44
43
42
41
1
2
3
4
5
Number of readings
Performance for Word Count (contd.)
52.5
2 mapper/ 3 reducers
Time in secs
Average = 49.664
52
51.5
51
50.5
50
2 mapper/ 3 reducers
49.5
49
48.5
48
47.5
1
2
3
4
5 Number of readings
Performance for Word Count (contd.)
60
2 mapper/ 2 reducers
Time in secs
Average = 43.7008
50
40
30
2 mapper/ 2 reducers
20
10
0
1
2
3
4
5
Number of readings
Performance for a simple get/put/scan (2 nodes/ 2 core)
Time in secs
2.5
Average for get, scan and put
are 1.841.6266 and 1.71.
2
1.5
get
scan
put
1
0.5
0
1
2
3
4
5
Number of readings
Performance for Word Count (3 nodes/2 cores each)
1 mapper/ 3 reducers
Time in secs
37
Average = 34.047
36
35
34
33
1 mapper/ 3 reducers
32
31
30
29
1
2
3
4
5
Number of Readings
Performance for Word Count (contd.)
2 mappers/ 3 reducers
Time in secs
38.5
Average = 36.1012
38
37.5
37
36.5
36
2 mappers/ 3 reducers
35.5
35
34.5
34
33.5
33
1
2
3
4
5
Number of Readings
Performance for Word Count (contd.)
2 mappers/ 2 reducers
Time in secs
50
Average = 37.4358
45
40
35
30
25
2 mappers/ 2 reducers
20
15
10
5
Number of readings
0
1
2
3
4
5
Conclusions
 Brisk seems a lot more promising tool; as it
integrates Cassandra and Hadoop together without
much ado.
 Hbase/Hadoop APIs have to be made consistent.
With standardization, it would be easier to work with
them.
 Hbase Reads are faster than the Writes.
Thank You
Questions??