Cloud Computing - CS 609 : Database Management

download report

Transcript Cloud Computing - CS 609 : Database Management

Intro to Cloud Computing
Source: http://www.free-pictures-photos.com/
Cloud Computing
• No longer the next big thing – the current big
thing
– Began in 2007 – IBM and Google “Blue Cloud”
– Name cloud inspired by cloud symbol
representing internet in diagrams
What is Cloud Computing?
• But what is it?
• Everyone has a different opinion on what it is
• Is it trendy?
• “The computer industry is the only industry that is
more fashion-driven than women’s fashion”
– Larry Ellison
Questions to answer
• What clouds have you used today (yesterday)?
• What is a cloud?
Applications
• What does cloud computing actually do?
– Consider applications you may currently be
running on laptop, desktop, phone, server
– Cloud has them also, or can potentially bring them
to you
– Brings applications, views, manipulates, shares
data
Cloud Computing
• Everyone has an opinion on what to use a
cloud for
– Applications on the internet – email, tax prep
– Storage for business, personal data
– Web services for photos, maps, GPS
– Rent a virtual server, load software on it, turn it on
/off, clone it if sudden workload demand
– Store, secure data for authorized access (really?)
– Use a platform including OS, Apache, MySQL,
Python, PHP
Cloud Computing Characteristics
• So what are its characteristics?
• AKA On-demand computing, pay as you go,
software as a service, utility computing
• Typically access through the internet
• Distributed and highly parallel approach
• Usually costs $$$, but cost-effective
• Virtualization
• Elastic
• Replication, replication, replication …
Cloud Components
• 3 components
– Clients
• Mobile, thin, thick
– Datacenter
– Distributed servers
Data Center
• Data Center
– Collection of servers
– In large room in your building, across world
• Distributed Servers
– Distributed data centers
• geographically disparate
• Robust if failure
• Dynamic datacenter so
can increase as needed
Clouds
• Allow access to applications other than on
local computer or internet connected device
• Instead, company hosts your application Advantages?
– No more licenses, service packs, etc.
– Less hardware, etc.
– Can access anywhere
but
– Works only as long as have internet connection
– Lose control – can’t optimize
Cloud Computing Characteristics
• Cost-effective
– start-up company to use a cloud instead of buy
computers, hire IT people, etc.
• Elastic Computing
– company has a temporary surge in business, use
cloud instead of invest in new computing
equipment
Virtualization
• What is virtualization?
– Software implementation of a computer that executes
programs like a physical machine
– Installation of one machine runs on another
– All software in the cloud runs on a server within virtual
machine
– AMD-Virtualization and Intel Virtualization Technologies
(IVT) extensions made it doable
Virtualization
• Virtual Machine VM
– isolated guest OS installation within a normal host OS
– Object of deployment
• Virtual Machine Image –
– Static data containing software (OS, apps, data files) the
VM will run once started
– Used to create VM instance
– Typically stored on disk
• Virtual Machine Instance –
– Running virtual machine
– Started from image, runs OS and processes, computes, etc.
– Dynamic object you can interact with
Virtualization
• Hypervisor – Virtual Machine Manager VMM
• One level higher than supervisory program
• Installed on server hardware
• Easily create copies of existing environments
• Can exist on same servers or different machines
• Single server multiple OS instances, minimize CPU idle
time
App
App
App
App
App
App
OS
OS
OS
Operating System
Hypervisor
Hardware
Hardware
Traditional Stack
Virtualized Stack
Elastic - Cloud Computing
Characteristics
• Use what you need
– Hardware, platform (OS), software
• Cloud infrastructure used depends on application
– Massive number of servers needed
OR
– Only need one server to run small job
• Company has a temporary surge in business, use cloud
instead of invest in new computing equipment
• Company has a decline in business, don’t have to maintain
unused equipment
Cloud Computing Characteristics
• Redundancy
– Redundancy is the key to the success of clouds
– Google approach – cheap components that fail, so
replicate all processing and storage
What Motivated Cloud Computing
Initial motivation:
– Web-scale problems – data intensive
Solutions:
– Large data centers
How to access:
– Highly-interactive Web applications (thin client)
Next Step:
– Different models of computing
Data Intensive - How much data?
• CERN’s LHC will generate 15 PB a year
• Facebook – 2.5 Pb, growing at 15TB per day in 2012?
•
25 TB
1000 times volume of mail delivered by USPS
• Sloan Digital Sky Survey – 0.5 PB /month in 2015
• “all words ever spoken by human beings”
•
~ 5 EB – 1018
Solution: Large Data Centers
• Although Google famous for innovating web
searching, Google’s architecture as much a
revolution
– Instead of few expensive servers, use many cheap servers
($5000 instead of $100,000)
• 1/2M servers in ~ 12 locations)
• With thin, wide network
• Cloud – robust and self-healing
– Uses a lot of power
• Need cheaper power solutions
The Result:
Different Computing Model
“Why do it yourself if you can pay someone to do it for you?”
Software-as-a-Service (SaaS)
Platform-as-a-Service (PaaS)
Infrastructure-as-a-Service (IaaS)
IaaS
• Infrastructure as a Service (IaaS) – aka Hardware as a
Service (HaaS) and Utility computing
– Why buy machines when you can rent cycles?
– Utility computing billing – based on what used
– Provides basic storage and compute capabilities as
server
• Servers, storage systems, CPU cycles, switches,
routers, etc.
• Ex: Amazon’s EC2
IaaS
• Does not provide applications to customers
(SaaS and PaaS do)
• Saves cost of purchasing
• Infrastructure can be scaled up or down
• Multiple tenants can use equipment at the
same time
• Device independence – access systems on
different hardware
• Low barriers to entry, example?
– e.g. Samba
PaaS
• Platform as a Service (PaaS) aka cloudware
– Supplies all resources needed to build apps and services
without having to download or install software
– Provides a computing platform and solution stack
– Customer interacts with platform through API
– Layer of software encapsulated provided as service to
build higher level services
– Ex: Google Apps Engine
PaaS provides
• Development teams across world to work
together
• Merge web services from multiple sources
• Cost savings from using built-in security,
scalability and failover
• Cost-savings from using higher-level
programming abstractions
SaaS
• Software as a Service (SaaS) – web based
applications
– Software available on cloud for use
– Application hosted as a service to customers who
access via the internet
– Single instance runs and services multiple end
users
– Ex: salesforce.com, Gmail
SaaS
• Pros/Cons
– Customer doesn’t have to maintain or support SW
– Out of customer’s hands when hosting service
changes it
– Use software out of box
– Instead of just paying for its once, billed
– Don’t have to pay as much up front, cheaper more
reliable
– Security (SSL used), don’t need VPNs (Virtual private
networks on back-end)
Benefits to SaaS
•
•
•
•
•
Everyone knows WWW, little training needed
Smaller IT staff needed
Easier to customize
Better marketing by providers, accommodate more
Security (SSL used), don’t need VPNs (Virtual private
networks on back-end)
• But:
• Specific computational need not addressed – may
have to buy own
• Lock-in – can’t move to new vendor without penalty
Future of SaaS
• Move all processing power to the cloud and
carry ultralight input device
– Already happening?
• E-mail
• Google Docs
• Implications for Microsoft, software as purchasable
local application
– Windows Live (Microsoft’s cloud)
– Adobe web based photoshop
IaaS, PaaS, SaaS
When not to use a Cloud
• Legislative Issues
– Laws and policy allow freer access to data on a cloud
than private server
• FBI can access data without warrant or owner’s consent
• Geopolitical concerns
– If in Canada, cannot store data on U.S. cloud – Why?
• (because of patriot act…)
– What about storing your data on clouds outside of
USA?
Types of Clouds
• Public, Private, Hybrid Clouds
• Names do not necessarily dictate location
• Type may depend on whether temporary or
permanent
Data Bases in Cloud Environments
Based on:
Md. Ashfakul Islam
Department of Computer Science
The University of Alabama
Issues to Consider
• Distributed or Centralized application?
• How can ACID guarantees be maintained?
• CAPS theorem
– Consistency, Availability, Partition
– Data availability (even if network partition) is achieved
by compromising consistency
– Traditional consistency techniques become obsolete
• Consistency becomes bottleneck of data
management deployment in cloud
– Costly to maintain
Analytical DBs - Data Warehousing
• Data Warehousing DW - Popular application of Hadoop
• Typically DW is relational (OLAP)
– but also semi-structured, unstructured data
• Can also be parallel DBs (teradata)
– column oriented
– Can be expensive, e.g. TBs of data
• Hadoop for DW
– Facebook abandoned Oracle for Hadoop (Hive)
– Also Pig – for semi-structured
Evaluation of Analytical DB
• Analytical DB handles historical data with little or no
updates - no ACID properties.
• Elasticity
– Since no ACID – easier
• E.g. no updates, so locking not needed
– A number of commercial products support elasticity.
• Security
– requirement of sensitive and detailed data
– third party vendor store data
– potential risk of data leakage and privacy violation
• Replication
– Recent snapshot of DB serves purpose.
– Strong consistency isn’t required.
Transactional Data Management
Needed because:
• Transactional Data Management
– heart of database industry
– almost all financial transaction conducted
through it
– rely on ACID guarantees
• ACID properties are main challenge in
transactional DM deployment in Cloud.
Relational Joins
• Hadoop is not a DB
• Debate between parallel DBs and MR for
OLAPS
– Dewitt/Stonebreaker call MR “step backwards”
– Parallel faster because can create indexes
Consistency in Clouds
• Consistent database must remain consistent
after execution of successful operations.
• Inconsistency may cause to problems
• Consistency is always sacrificed to achieve
availability and scalability.
• Strong consistency maintenance in cloud is
very costly.
DBs in the Cloud
• Slow start for DBs – why??
• Considered Scalable Transactions for Web
Applications in the Cloud
• Two important properties of Web applications
– all transactions are short-lived
– data request can be responded to with a small set
of well-identified data items
• Eventual consistency acceptable
Cloud Provider DB Options
Windows Azure
Data Management
• Can run SQL Server or another DBMS in a VM
created with Azure Virtual Machines
• Free to run NoSQL technologies such as
MongoDB and Cassandra
• Running your own database system is
straightforward- also requires handling the
administration of that DBMS
Data Management Options
• Figure 3: For data management, Windows Azure provides
relational storage, scalable NoSQL tables, and unstructured
binary storage.
Data Management Options
• Each of the three options addresses a different need:
– relational storage
– fast access to potentially large amounts of simple typed
data
– unstructured binary storage.
• In all cases, data is automatically replicated across
three different computers in an Azure datacenter
• All three options can be accessed either by Windows
Azure applications or by applications running
elsewhere, such as an on-premises datacenter, a
laptop, or phone.
Relational Storage – SQL Database
• Provides all of the key features of a relational
database management system, including
– atomic transactions, concurrent data access by
multiple users with data integrity, ANSI SQL
queries, and a familiar programming model.
– If know SQL Server, using SQL Database is
straightforward.
– can be accessed using Entity Framework,
ADO.NET, JDBC
SQL Database
• But SQL Database isn't just a DBMS in the cloud-it's a
PaaS service.
• You control your data and who can access it and SQL
Database takes care of the administrative grunt work
– such as managing the hardware infrastructure and
automatically keeping the database and operating system
software up to date.
• SQL Database provides a federation option that
distributes data across multiple servers.
– Spread data access requests across multiple servers for
better performance.
Tables
• For application that needs fast access to lots
of typed data, it, but doesn't need to perform
complex SQL queries
• For storing data, and retrieving it in simple
ways
• NOT relational
• very scalable, with a single table can hold as
much as a terabyte of data
Blobs
• Designed to store unstructured binary data.
• Like Tables, Blobs provides inexpensive
storage
• Single blob can be as large as one terabyte
• Application sees ordinary Windows files, but
the contents are stored in a blob
Amazon
Amazon
• Simple Storage Service S3
– Low-level put/get interface
– Store items up to 5GB
• AWS MySQL – traditional model (non-cloud) on EC2
• AWS MySQL/R – durability of the data guaranteed by the
Replication architecture
– Application server maintains connection to Master copy
and connections to one DB server
– Update transactions handled by Master
– Read-only transactions issued to DB server associated with
application server
Amazon
• AWS RDS – relational database service, implements same as
AWS MySQL
– RDS is pre-packaged, so users don’t have to worry about
managing deployment of VMs, SW upgrades, etc.
• AWS Simple DB – retrieve records based on key values or
ranges on primary and secondary keys
– Does not synchronize concurrent read/write access to
different copies of same data
– Web service for running queries on structured data
– Eventual data consistency is maintained data
– Does not support SQL
– Works with S3 and EC2 to store, process, query
Google
Google - App Engine (Megastore)
• Google has PaaS strategy
• App Engine uses the data engine Megastore
– Scalable structured data store
– Built on BigTable
– Partitioned into space of small DBs, each with own log
• Log stored across Paxos cluster (Paxos – protocol for solving consensus in unreliable
network
• full ACID semantics within partitions
–
–
–
–
–
Adopted a combined Partitioning and Replication architecture
Lower consistency across partitions
3B write, 20B read transactions per day as of 1/11
Tables can be arranged hierarchically
Support for secondary indexes
Google - App Engine (Megastore)
– 3 levels of read consistency
• Current – last committed value
• Snapshot – value as of start of read transaction
• Inconsistent reads – used for cross entity group reads
– Updates within entity group
• Write updates to WAL of entity group, applies to data
• Limited by: log contention - one winner, one loser
– Paxos accepts limited update rate (10**2 per sec)
– Across entity groups
• 2PC
– Support for Backup and recovery
• Synchronous replication, snapshots and incremental log backups
Google - App Engine
• AppEngine supports Python, Java with embedded SQL
– Used to support simplified SQL dialect, GQL
– GQL – no aggregate functions or joins
Business
Model
Cloud
Provider
Web/app
server
Database
Storage / File
Sys.
Consistency
AWS MySQL AWS
MySQL/R
IaaS
IaaS
AWS RDS
AWS SimpleDB AWS S3
MS Azure
IaaS
Google
AppEng
PaaS
PaaS
PaaS
Flexible
Flexible
Amazon
Amazon
Flexible
Google
Microsoft
Tomcat
Tomcat
Tomcat
Tomcat
Tomcat
AppEngine
.Net Azure
MySQL
EBS
MySQL Rep MySQL
EC2 & EBS -
SimpleDB
-
none
S3
DataStore
GFS
Repeatable Repeatable Repeatable Eventual
Read
Read
Read
Consistency
Java
Java
Java
Java
AppLanguage
DB-Language SQL
SQL
Architecture Classic
Replication Classic
HW Config.
manual
manual
Table 1: Overview of Cloud Services
SQL
manual
SimpleDB
Queries
Part.+Repl.
Eventual
Consistency
Java
low-level API
Distr. Contol
manual/autom manual
atic
PaaS
SQL Azure
Windows
Azure
Snapshot
Snapshot
Isolation
Isolation
Java/AppEngi C#
ne
GQL
SQL
Part.+Repl.(+ Replication
C)
automatic
manual/auto
matic
• Link to paper by Kossmann
Cloud SQL
• Google Cloud SQL
– Available
– One of App Engine’s most requested feature –
• Simple way to develop traditional DB driven
applications
– Quicker path to jump off App Engine platform
– DB import/export so can move existing MySQL DBs to
cloud
– Support for both Java JDBC, Python DB-API connections,
less code change required
– No support for PHP on AppEngine, can put PHP apps in
cloud using Quercus
Google - Spanner
• Previous complaints – no cross row transactions
• 2PC too expensive to support because of performance or availability
problems
• What is a Spanner?
• A huge Semi-Relational Database
– Built on top of Colossus (GFS2)
–
–
–
–
–
–
–
–
–
Seriously, it's huge!
Scales up to millions of machines
Shards across multiple data centers
Data centers across multiple continents
Lock-free reads
Externally-consistent writes (transactions)
Relational Schema
SQL-like query language
Reasonable performance
Google - Spanner
• A Layered System
– Relational
– Key-Value
– Paxos TrueTime Colossus
• Google says that the biggest new idea is
TrueTime API
Google - Spanner
• A table must have a primary key (ordered set of columns)
• A table must be marked as a directory or be interleaved in a parent
table
• Interleaved data is actually attached to a row in the parent table
• Data is actually stored as key-values (heterogeneous/interleaved)
• ON DELETE CASCADE means to delete when parent row is
deleted
Google - Spanner
• Lock-free Read
• Lock-free reads using timestamps
• Read Transaction System uses latest nonblocking timestamp
• Special non-blocking write transaction