C-Store: Data Management in the Cloud

Download Report

Transcript C-Store: Data Management in the Cloud

C-Store: Data Management in
the Cloud
Jianlin Feng
School of Software
SUN YAT-SEN UNIVERSITY
Jun 5, 2009
What is the Cloud?

A definition from Wikipedia


Platform as a service (e.g, Amazon EC2)



Cloud computing is a style of computing in which
dynamically scalable and often virtualized resources are
provided as a service over the Internet.
allows customers to rent computers (virtual machines) on
which to run their own computer applications.
Software as a service
infrastructure as a service
Amazon EC2 (Elastic Compute Cloud)


EC2 uses Xen virtualization.
Each virtual machine, called an "instance",
functions as a virtual private server in one of
three sizes:


small, large or extra large.
Amazon.com sizes instances based on "EC2
Compute Units"

the equivalent CPU capacity of physical hardware.
One EC2 Compute Unit equals 1.0-1.2 GHz 2007
Opteron or 2007 Xeon processor.
Pricing

Amazon charges customers in two primary
ways:



Hourly charge per virtual machine
Data transfer charge
Amazon advertising describes the pricing
scheme as "you pay for resources you
consume".
Advantage of Public Cloud

Public clouds are hosted by large
infrastructure companies such as



Amazon, Google, Yahoo, Microsoft, Sun
Can afford huge cloud.
For many companies, especially for start-ups
and medium-sized business), setting up a
private cloud can be too expensive



hardware cost
Software cost
Personnel cost for maintaining the system
Cloud Characteristics

Computing power is elastic, but noly if workload is
parallelizable.


Data is stored at an un-trusted host.


Computing power comes from shared-nothing architecture.
A possible solution is encrypting data.
Data is replicated, often across large geographic
distance.

To provide data availability and durability.
Transactional Data Management (OLTP)

Typically does not use a shared-nothing architecture.


It is hard to maintain ACID guarantees in the face of
data replication over large geographic distances.



OLTP systems are usually less than 1TB in size.
Google’s Bigtable implements a replicated shared-nothing
database, by weaking “A” from ACID.
The H-Store project still remains in vision stage.
There are big risks in storing transactional data on
an un-trusted host.

Transactional data include details at the lowest granularity.
First Conclusion

Transactional data management applications
are not well suited for deployment in the
cloud.
Analytical Data Management (DW)

Tend to be read-mostly (read-only), with occasional
batch inserts.

Shared-nothing architecture is a good match.



The ever increasing amount of data is the primary driver for
choosing shared-nothing.
Large scans, multidimensional aggregations, and star
schema joins for analytical workload are easy to parallelize
on shared-nothing system.
Infrequent writes eliminates the need for complex
distributed locking and commit protocols.
Analytical Data Management (DW):
continued

ACID guarantees are typically not needed.


Snapshot isolation is usually enough.
Particularly sensitive data can often be left
out of the analysis.

Less granular versions of the data are usually
used for analysis.
Second Conclusion

Analytical Data Management applications are
well-suited for deployment in the cloud.
Vertica (C-Store) for the Cloud
Cloud DBMS Wish List

Efficiency

Fault Tolerance
 If a query must restart each time a node fails, then long, complex
queries are difficult to complete.

Ability to run in a heterogeneous environment.
 Should prevent the slowest node from making a disproportionate
affect on total query performance.

Ability to operate on encrypted data.

Ability to interface with business intelligence products.
MapReduce vs. Parallel DBMS (1)

Efficiency



Fault Tolerance



MapReduce is good for brute-force scan over unstructured
data such as text documents.
Parallel DBMS is good for selective access of structured
data.
MapReduce takes it as a high priority.
Most parallel DBMS restart a query upon a faiure.
Ability to run in a heterogeneous environment.


MapReduce does well.
Parallel DBMS are generally designed to run in a
homogeneous environment.
MapReduce vs. Parallel DBMS (2)

Ability to operate on encrypted data.


Neither has the native ability to operate on
encrypted data.
Ability to interface with business intelligence
products.


MapReduce is not intended for interfacing with BI
products.
Parallel DBMS supports BI products well.
A Call for A Hybrid Solution

Bring together ideas from MapReduce and
Parallel DBMS.

The hybrid solution should combine


Fault tolerance, heterogeneous cluster, and ease
of use out-of-the-box capabilities of MapReduce
With the efficiency, performance, and tool
plugability of shared-nothing parallel DBMS.
References
1.
2.
Abadi, Daniel J. Data Management in the
Cloud: Limitations and Opportunities. In
IEEE Data Engineering Bulletin, 2009.
Vertica Company. Getting Started with
Vertica Analytic Database for the Cloud.
2009.