Impala and BigQuery

Download Report

Transcript Impala and BigQuery

Impala and BigQuery
By David Gruzman
BigDataCraft.com
Impala and BigQuery
Big Query is google's database service based
on the Dremel. Big Query is hosted by Google.
►Impala is open source database inspired by the
Dremel paper. Impala is part of the Cloudera
Hadoop distribution.
►
by David Gruzman
Today agenda
►
Overview of Dremel as a technology
►
Overview of the BigQuery
►
A few words about Impala
►
DG Mediamind use case
►
Deeper insights into Impala
►
Conclusions
►
Q&A
Why dremel?
►
►
►
Google is first who got MapReduce
Google is first faced MapReduce main problem
– latency. The problem was propagated to
engines on top of MapReduce also.
It is logical that Google was first who
approached it by developing real time query
capability for big data.
How dremel is used in google
►
►
►
Dremel is not replacement for the MapReduce
or Tenzing but complements it. (Tenzing is
Google's Hive)
Analyst can make many fast queries using
Dremel
After getting good idea what is needed – run
slow MapReduce (or SQL based on
MapReduce) to get precise results
Why dremel is Unique
►
►
Dremel with BigQuery built on top of it is
probably only Interactive big data query engine
today.
I mean that it is only engine capable to produce
results over terabytes of data in seconds!
►
Main idea (my guess) that is harness huge
cluster of machines for the single query.
Dremel as technology
Novel Hierarchical columnar format.
►
LLVM based code generation.
►
Distributed aggregation Tree
►
In-situ data processing. (inside the storage)
Dremel : Aggregation tree
Dremel : Nested columnar format
Big Query
►
►
►
Service built by google on top of the Dremel
engine
Only (known to me) query engine as a service
working with BigData.
Query time not depends on data size
BigQuery main capabilities
►
Aggregations
►
Join of big table to small table.
►
Join of two big tables (recently added)
►
Hierarchical data format. It makes preaggregations cheaper.
Main limitations
►
►
►
Small results size
Intermediate results should not exceed memory
size.
No “external tables”
Why BigQuery is not popular
So,why BigQuery is not popular
►
►
►
►
Data is not created in google cloud. It is hard
and not practical to move big data. It is heavy,
after all.
Google is used to change APIs. BigQuery also
changed during last years. It is hard to build
busines.
Many companies in Internet related businesses
a wary of sharing data with Google.
It is expensive. 35$ per TB can give 1000th of
dollars bills per day.
Dremel
In the same time – it is good
technically
►
►
I got referances from company doing serious
testing
Marting Fawler's company also tested it and
give very good feedback.
Question to all of you
Why Your organization decided not to use
google's Big Query?
Where we can find Impala
Impala
What is impala
►
►
Massive parralel processing (MPP) database
engine, developed by Cloudera.
Integrated into Hadoop stack on the same level
as MapReduce, and not above it (as Hive and
Pig)
Pig
Impala
Map Reduce
HDFS
Hive
Why impala
►
Data has a gravity
►
Today a lot of data live in HDFS
►
It is not practical to move big data
►
It is practical to bring engine to the data
►
In the same time – MapReduce is not must
►
Impala process data in Hadoop cluster without
using MapReduce
MapReduce bypass
►
►
►
Several other modern Database engines also
realized the opportunity to bypass MapReduce
but work right with HDFS.
They takes various approaches.
MapReduce Bypass
►
Existing MPP databases, like Greenplum –
store their external tables in the HDFS
MapReduce bypass
►
►
Jethrodata store data in their own format on
HDFS and also work with it without MR layer.
They have their proprietary format which enable
full indexing of the data together with columnar
efficiency. In cases of high selectivity queries
this approach has serious advantages.
Use Case from DG
I think it is will be typical case in the future
►
►
►
DG is using Hadoop and Hive
Evaluation Impala to do part of things more
efficiently.
After their case presentation we will back to
discuss insights of the Impala
Again – Impala has different place
then Pig and Hive
Hive and Pig
Impala
Map Reduce
HDFS
Impala architecture
Impala – Dremel traces
►
LLVM code generation
►
It is really fast
►
C++ as implementation language (not Java...)
►
►
Simple query engine. It actually doing things
which can be done in memory.
Broadcast join algorithm is implemented
LLVM code generation
►
►
►
►
►
Assume you want to write custom code for the
specific query. It will be super efficient
Code generation automate this process for
each query
We actually need to super-optimize inner loop
doing filtering (where) and group by.
LLVM enables us to compile in fraction of
seconds into native code
LLVM enable us to enjoy new CPU capabilities
like SSE in a portable way.
Why code generation it interesting?
►
►
If you develop own engine, or some peace of
code responsible to process serious data
volumes code generation may give you order of
magnitude boost.
I had cases when usage of such technology
was game changing
Impala – Hive Traces
►
While dremel converts data into own format,
Impala supports multiple formats. It is kind of
schema on read.
►
►
Impala shares metastore with Hive, which
enables very simple adoption
Internally Impala have well defined way to add
new formats
Impala – unique things
►
►
►
►
Impala “format adapters”, called scanners have
predicate pushdown capability.
Probably only open source MPP engine
Today we do not have any other means to run
hundreds of CPU cores in one query efficiently
without expensive license.
Hive give us the same but not efficiently.
Impala vs MPP
►
It usually tooks many years to create MPP
database.
►
There are serious simplifications:
►
The data is read only
►
►
There is actually not DBMS – only query
engine.
No serious resource management, but
measurement (all over code).
Impala – hive killer?
►
►
►
►
►
Not so quickly.
Hive is doing things Impala can not do yet, like
joins between several big tables.
Hive has convinient java UDF, while impala is
not
Impala does not have inter-query fault
tolerance.
In the same time – MapReduce is not good
framework for the database engine
Impala – Data Formats
►
There are scanners for the following types:
►
RCFile
►
Parquet (native dremel format)
►
CSV
►
AVRO
►
Sequence File
Impala – future
►
Will get closer to other MPP engines
►
Support more formats
►
More advanced scheduling and resource
management
Basic benchmark
►
TPC-H, Q1, SF=10
►
4 EC2 large instances
►
4 seconds, while hive takes about 1 minute.
►
This number means group by speed of about
235MB/sec per core.
Impala price per GB
►
1 Large instance costs $0.24
►
Cluster costs 0.96 per hour.
►
Cost of 1 second : 0.96 / 3600
►
We process by such cluster 1.75GB per second
►
So cost of 1 TB processing is about $0.15
►
It is about 300 times cheaper then BigQuery
Performance - summary
►
It is fast when data reduction is big
►
It is fast, when data is hot.
►
►
It should enjoy fast storage / SSD. My
measurements shows about 200 MB/sec per
core group by processing
Always faster then Hive at least 10 times
What with clouds?
Impala in cloud is not elastic
►
►
►
►
To be elastic we need to create cluster when we
need it.
Even if we agree to by hour resolution – storage
will be a problem
S3 will not give us hundreds of Mbs per second
per instance
To store data in local file system – is transient
Impala - conclusions
►
►
►
►
It is first time I remember when we can put our
hands on free MPP database.
There is no risk to try it side-by-side with Hive
It is possible to offload part of the work to
Impala and do the rest with Hive
It is part of the Cloudera Hadoop distribution
and easily installed by Cloudera Manager
Materials used
►
Benchmarks
http://www.slideshare.net/sudabon/performanceevaluation-of-cloudera-impala-2012120815536323
https://amplab.cs.berkeley.edu/benchmark/
►
Architecture
http://www.slideshare.net/scottleber/impala19176906
https://cloud.google.com/files/BigQueryTechnical
WP.pdf
Material used - comparisons
►
►
►
To hive: http://www.quora.com/Cloudera/DoesCloudera-Impala-have-any-drawbacks-whencompared-with-Hive
To vertica: http://www.quora.com/ClouderaImpala/How-does-Cloudera-Impala-compare-toVertica
To dremel: http://www.quora.com/ClouderaImpala/How-does-Clouderas-Impala-compareto-Googles-Dremel
Thank you!!!
►
►
Special thanks to
Faina Kamenetsky – who helped set up clusters
in amazon.
BigDataCraft.com
►
We are boutique consulting company
►
Our services are:
►
On paper POC
►
On hardware POC
►
Architecture / Design reviews
►
Custom integrations and bug fixing
Impala - Flow