Shark:SQL and Rich Analytics at Scale

Download Report

Transcript Shark:SQL and Rich Analytics at Scale

Presentaed By
Kirti Dighe
Drushti Gawade
What is Shark?
 A new data analysis system
Built on the top of the RDD and spark
Compatible with Apache Hive data,
metastores, and queries(HiveQL, UDFs, etc)
Similar speedups of up to 100x
Supports low-latency, interactive queries
through in-memory computation
Supports both SQL and complex analytics
such as machine learning
Shark Architecture

Used to query an existing Hive warehouse returns result
much faster without modification
 Diagram
of Architecture
 Support
partial DAG execution
 Optimization of joint algorithm



Features of shark
Supports general computation
Provides in-memory storage abstraction-RDD
Engine is optimized for low latency



Sparks main abstraction-RDD
Collection stored in external storage system or
derived data set
Contains arbitrary data types
Benefits of RDD’s




Return at the speed of DRAM
Use of lineage
Speedy recovery
Immutable-foundation for relational
processing.

Shark can tolerate the loss of any set of worker
nodes.

Recovery is parallelized across the cluster.

The deterministic nature of RDDs also enables
straggler mitigation

Recovery works even in queries that combine SQL
and machine learning UDFs
Executing sql over RDDs
Process of executing sql queries which includes
Query parsing
 Logical plan generation
 Physical plan generation

Partial DAG execution(PDE)
Static query optimization
 Dynamic query optimization
 Modification of statistics
Example of statistics
 Partition size record count
 List of “heavy hitters”
 Approximate histogram

Join Optimization
Skew handling and degree parallelism
Task scheduling overhead
Columnar Memory Store
Simply catching records as JVM objects is insuffiecient
Shark employs column oriented storage , a partition of
columns is one MaoReduce “record”
Benefits: compact representation, cpu efficient
compression, cache locality


Shark supports machine learning-first class citizen
Programming model design to express machine learning
algorithm:
1. Language Integration
Shark allows queries to perform logistic regression over a
user database.
Ex: Data analysis pipeline that performs logistic regression
over database.
2. Execution Engine Integration
Common abstraction allows machine learning
computation and SQl queries to share workers and cached
data.
Enables end to end fault tolerance
How to improve Query Processing Speed
 Minimize tail latency
 CPU cost processing of each



Memory-based shuffle
Temporary object creation
Bytecode compilation of expression
evaluation
Evaluation of the shark using database




Pavlo et al. Benchmark: 2.1 TB of data reproducing Pavlo
et
al.’s comparison of MapReduce vs. analytical DBMSs [25].
TPC-H Dataset: 100 GB and 1 TB datasets generated by the
DBGEN program [29].
Real Hive Warehouse: 1.7 TB of sampled Hive warehouse
data from an early industrial user of Shark.
Machine Learning Dataset: 100 GB synthetic dataset to
measure
the performance of machine learning algorithms.
Shark perform 100x faster than hive
Methodology and cluster setup
Amazon EC2 with 100m2.4xlarge nodes
8 virtual code
68 GB of memory
1.6 TB of local storage
Pavlo etal. Benchmarks
1 GB/node ranking table
20 GB/node uservisits table

Selection Query (cluster index)
SELECT pageURL, pageRank
FROM rankings WHERE pageRank > X;
Aggregation Queries
SELECT sourceIP, SUM(adRevenue)
FROM uservisits GROUP BY sourceIP;

SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)
FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 7);

Join Query
SELECT INTO Temp sourceIP, AVG(pageRank), SUM(adRevenue) as
totalRevenue FROM rankings AS R, uservisits AS UV WHERE
R.pageURL = UV.destURL AND UV.visitDate BETWEEN
Date(’2000-01-15’) AND Date(’2000-01-22’) GROUP BY UV.sourceIP;
Join query runtime from
Pavlo Benchmark
Join stategies
chosen by optimizers
Data Loading
To query data in HDFS directly,which means its data
ingress rate is at least as fast as Hadoop’s.
Micro-Benchmarks
Aggregation performance
SELECT [GROUP_BY_COLUMN], COUNT(*) FROM
lineitem GROUP BY [GROUP_BY_COLUMN]

 Join selection at runtime
Fault tolerence
Measuring sharks performance in presence of node failures –
simulate failures and measure query performance,
before,during and after failure recovery.
Real hive warehouse
1. Query 1 computes summary statistics in 12 dimensions for users
of a specific customer on a specific day.
2. Query 2 counts the number of sessions and distinct
customer/client combination grouped by countries with filter cates
on eight columns.
3. Query 3 counts the number of sessions and distinct users for
all but 2 countries.
4. Query 4 computes summary statistics in 7 dimensions grouping
by a column, and showing the top groups sorted in descending
order.
Machine learning Algorithms
Compare performance of shark running the same work
flow in Hive and Hadoop
Workflow consisted of three steps:
1)Selecting the data of interesr from the warehouse using
SQL
2)Extracting Features
3)Applying Iterartive Algorithms
 Logistic Regresion
 K-Means Clustering
Logistic Regression,pre-iterarion runtime(seconds)
K-means Cllustering,pre-iteration algorithm


1.
2.

Warehouse combining relational queries and
complex analytics
Generalizes map reduce using both
Traditional Databse Techniques
Novel Partial DAG Execution
Shark faster than Hive and Hadoop