C-Store: Data Management in the Cloud
Download
Report
Transcript C-Store: Data Management in the Cloud
C-Store: Data Management in
the Cloud
Jianlin Feng
School of Software
SUN YAT-SEN UNIVERSITY
Jun 5, 2009
What is the Cloud?
A definition from Wikipedia
Platform as a service (e.g, Amazon EC2)
Cloud computing is a style of computing in which
dynamically scalable and often virtualized resources are
provided as a service over the Internet.
allows customers to rent computers (virtual machines) on
which to run their own computer applications.
Software as a service
infrastructure as a service
Amazon EC2 (Elastic Compute Cloud)
EC2 uses Xen virtualization.
Each virtual machine, called an "instance",
functions as a virtual private server in one of
three sizes:
small, large or extra large.
Amazon.com sizes instances based on "EC2
Compute Units"
the equivalent CPU capacity of physical hardware.
One EC2 Compute Unit equals 1.0-1.2 GHz 2007
Opteron or 2007 Xeon processor.
Pricing
Amazon charges customers in two primary
ways:
Hourly charge per virtual machine
Data transfer charge
Amazon advertising describes the pricing
scheme as "you pay for resources you
consume".
Advantage of Public Cloud
Public clouds are hosted by large
infrastructure companies such as
Amazon, Google, Yahoo, Microsoft, Sun
Can afford huge cloud.
For many companies, especially for start-ups
and medium-sized business), setting up a
private cloud can be too expensive
hardware cost
Software cost
Personnel cost for maintaining the system
Cloud Characteristics
Computing power is elastic, but noly if workload is
parallelizable.
Data is stored at an un-trusted host.
Computing power comes from shared-nothing architecture.
A possible solution is encrypting data.
Data is replicated, often across large geographic
distance.
To provide data availability and durability.
Transactional Data Management (OLTP)
Typically does not use a shared-nothing architecture.
It is hard to maintain ACID guarantees in the face of
data replication over large geographic distances.
OLTP systems are usually less than 1TB in size.
Google’s Bigtable implements a replicated shared-nothing
database, by weaking “A” from ACID.
The H-Store project still remains in vision stage.
There are big risks in storing transactional data on
an un-trusted host.
Transactional data include details at the lowest granularity.
First Conclusion
Transactional data management applications
are not well suited for deployment in the
cloud.
Analytical Data Management (DW)
Tend to be read-mostly (read-only), with occasional
batch inserts.
Shared-nothing architecture is a good match.
The ever increasing amount of data is the primary driver for
choosing shared-nothing.
Large scans, multidimensional aggregations, and star
schema joins for analytical workload are easy to parallelize
on shared-nothing system.
Infrequent writes eliminates the need for complex
distributed locking and commit protocols.
Analytical Data Management (DW):
continued
ACID guarantees are typically not needed.
Snapshot isolation is usually enough.
Particularly sensitive data can often be left
out of the analysis.
Less granular versions of the data are usually
used for analysis.
Second Conclusion
Analytical Data Management applications are
well-suited for deployment in the cloud.
Vertica (C-Store) for the Cloud
Cloud DBMS Wish List
Efficiency
Fault Tolerance
If a query must restart each time a node fails, then long, complex
queries are difficult to complete.
Ability to run in a heterogeneous environment.
Should prevent the slowest node from making a disproportionate
affect on total query performance.
Ability to operate on encrypted data.
Ability to interface with business intelligence products.
MapReduce vs. Parallel DBMS (1)
Efficiency
Fault Tolerance
MapReduce is good for brute-force scan over unstructured
data such as text documents.
Parallel DBMS is good for selective access of structured
data.
MapReduce takes it as a high priority.
Most parallel DBMS restart a query upon a faiure.
Ability to run in a heterogeneous environment.
MapReduce does well.
Parallel DBMS are generally designed to run in a
homogeneous environment.
MapReduce vs. Parallel DBMS (2)
Ability to operate on encrypted data.
Neither has the native ability to operate on
encrypted data.
Ability to interface with business intelligence
products.
MapReduce is not intended for interfacing with BI
products.
Parallel DBMS supports BI products well.
A Call for A Hybrid Solution
Bring together ideas from MapReduce and
Parallel DBMS.
The hybrid solution should combine
Fault tolerance, heterogeneous cluster, and ease
of use out-of-the-box capabilities of MapReduce
With the efficiency, performance, and tool
plugability of shared-nothing parallel DBMS.
References
1.
2.
Abadi, Daniel J. Data Management in the
Cloud: Limitations and Opportunities. In
IEEE Data Engineering Bulletin, 2009.
Vertica Company. Getting Started with
Vertica Analytic Database for the Cloud.
2009.