Database overview
Download
Report
Transcript Database overview
CS 292 Special topics on
Big Data
Yuan Xue
([email protected])
Part II NoSQL Database
(Overview)
Yuan Xue
([email protected])
Outline
From SQL to NoSQL
Motivation
Challenges
Approaches
Notable NoSQL systems
Database User/Application Developer: How to use?
Database System Designer: How to design?
Under the hood: (Physical) data model and distribution algorithm
Database Designer: How to link application needs with database design
(Logic) data model and CRUD operations
Schema design
Summary and Moving forward
Summary of NoSQL Data Modeling Techniques
Summary of NoSQL Data Distribution Models/Algorithms
Limits of NoSQL
NewSQL
Data Model
• Column Family
• Key-value
• Document
From SQL to NoSQL
Relational Database Review
Persistent data storage
Transaction support
ACID
Concurrency control +
recovery
Standard data interface for
data sharing
SQL
Design
Operation
Conceptual Design
Entity/Relationship model
SQL query
Logic Design
Data model mapping
Logical Schema]
Normalization
Normalized Schema
Physical Design
Physical (Internal) Schema
SQL Review -- Putting Things Together
Users
Application Program/Queries
Query Processing
DBMS
system
Data access
(Database Engine)
Meta-data
Data
http://www.dbinfoblog.com/post/24/the-query-processor/
From SQL to NoSQL
Motivation I
Scaling (up/out) SQL database
Web-based application with SQL database as backend
High Web traffic large volume of transactions
More users large amount of data
Solution 1: cache
E.g. memcached,
Only handle read traffic
Solution 2: Scale up (vertically)
Add more resources to a single node
Solution 3: Scale out (horizontally)
Add more nodes to the DB system
Data distribution among multiple nodes non-trivial
Scale out SQL database – Techniques and Challenges
Two Techniques
Replication
Sharding
Replication
Master-Slave
• Duplication facilitates read
• Data consistency problem arises
All writes are written to the master
All reads performed against the replicated slave databases
Critical reads may be incorrect as writes may not have been propagated down
Large data sets can pose problems as master needs to duplicate data to slaves
Peer-to-peer
SQL and multi-node cluster do not go well
Writes can happen at any nodes
Inconsistent write (which can be persistent)
Sharding (Partition the dataset to multiple nodes)
Scales well for both reads and writes
Not transparent, application needs to be partition-aware
Can no longer have relationships/joins across partitions
Loss of referential integrity across shards
• Partition facilitates write/read
• lost transaction support across partition
From SQL to NoSQL
Motivation II
Limits in Relational Data Model
Impedance Mismatch
Difference between in-memory data structures and relational model
Predefined schema
Join operation
Not appropriate for
Graph
data
Geographical data
Unstructured data
From SQL to NoSQL
Google: Search Engine
Store billions document Bigtable + Google File System
Amazon: Online Shopping
Shopping cart management dynamo
Foundation
HBase + HDFS
Cassandra
Riak
Redis
MongoDB
…
Open source DBMS
Supported by
many social
media sites with
large data needs
• facebook
• twitter
Amazon DynamoDB
Amazon SimpleDB
…
Cloud-hosted managed
DBMS
Utilized by
companies
• imdb
• startups..
What is NoSQL
Stands for Not Only SQL: Not relational database
Umbrella term for many different types of data stores (database)
Different Types
Key value DB,
Column Family DB
Document DB
Graph DB
of Data Model
Just as there are different programming languages, NoSQL
provides different data storage tools in the toolbox
Polyglot Persistence
What is the Magic?
Logic Data Model: From Table to Aggregate
Diverse data models
Aggregate
No schema predefined
Column Family
Key-value
Document
Graph
allows an attribute/field to be added at run-time
Still need to consider how to define “key” “column family”
Giving up build-in Joins
Physical Data Handling: From ACID to BASE (CAP theorem coming up)
No full transaction support
Support at Aggregate level
Support both replication and sharding (automatically)
Relax one or more of the ACID properties
NoSQL Database Classification
-- Data Model View
Key Value Store
Column Family
Hbase (BigTable)
Cassandra
Document
Dynamo
Riak
Redis
Memcached (in memory)
MongoDB
Terrastore
Graph
FlockDB
Neo4J
Transaction Review
ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that
guarantee that database transactions are processed reliably.
Atomicity: "all-or-nothing" proposition
Each work unit performed in a database must either complete in its
entirety or have no effect whatsoever
Consistency: database constraints conformation
Each client, each transaction,
Can assume all constraints hold when transaction begins
Must guarantee all constraints hold when transaction ends
Isolation: serial equivalency
Operations may be interleaved, but execution must be equivalent to some
sequential (serial) order of all transactions.
Durability: durable storage.
If system crashes after transaction commits, all effects of transaction
remain in database
ACID and Transaction Support in Distributed
Environment
Recall -- scaling out database
Distributed environment with multiple nodes
Data are distributed across nodes via
Replication
Sharding
Concerns from Distributed Networking Environment
Message Loss/Delay
Network partition
Can ACID property still hold for database in distributed
environment?
CAP theorem comes as a guideline
CAP Theorem
Start with three properties for
distributed systems:
Availability
Consistency: All nodes see the same
data at the same time
Availability: Every request to a nonfailing node in the system returns a
response about whether it was
successful or failed.
Partition Tolerance: System
properties (consistency and/or
availability) hold even when the
system is partitioned (communicate
lost)
Consistency
Partition
tolerance
Availability
CAP Theorem
Consistency – Atomic data object
As in ACID
1. In a distributed environment,
multiple copies of a data item
may exist on different nodes.
2. Consistency requires that all
operations towards a data
item are executed as if they
are performed over a single
instant
Consistency
Partition
tolerance
Clients
Data item X
Data item X
Copy 1
Data item X
Copy 2
Availability
CAP Theorem
Availability – Available data object
Requests to data -- Read/write,
always succeed.
Consistency
1.
2.
Partition
tolerance
All (non-failing ) nodes remain able to read and write even when
network is partitioned.
A system that keeps some, but not all, of its nodes able to read and
write is not Available in the CAPsense, even if it remains available to
clients and satisfies its SLAs for high availalbility
Reference: https://foundationdb.com/white-papers/the-cap-theorem
Availability
CAP Theorem
Partition Tolerance
1.
2.
The network will be allowed to
lose arbitrarily many messages
sent from one node to another.
When network is partitioned, all
messages from one component
to another will get lost.
Consistency
Partition
tolerance
Under Partition Tolerance
Consistency requirement implies that every data operation will be atomic,
even though arbitrary messages may get lost.
Availability requirement implies that every nodes receiving a request from
a client must respond, even though arbitrary messages may get lost.
CAP Theorem
You can have at most two of these three properties for any
shared-data system
Consistency, availability and partition tolerance
To scale out, you have to support partition tolerant
NoSQL: either consistency or availability to choose from under
network partition
Availability
SQL
NoSQL
Pick one side
Consistency
Partition
tolerance
NoSQL Database Classification
-- View from CAP Theorem
Availability
Relational:
MySQL,
PostgreSQL
Consistency
Dynamo and its
derivatives:
Cassandra, Riak
Relational:
MySQL
BigTable and its
derivatives: HBase,
Redis, MongoDB
Partition
tolerance
More on Consistency
Question: in AP system, if consistency property can not be hold, what property can
be claimed?
Example
Data
item X is replicated on nodes M and N
Client A writes X to node N
Some period of time t elapses.
Client B reads X from node M
Does client B see the write from client A?
Client B
read
M
Data item X
Copy 1
Client A
write
N
Data item X
Copy 2
From client’s perspective, two kinds of consistency:
Strong consistency (as C in CAP): any subsequent access is guaranteed to
return the updated value
Weak consistency: subsequent access is not guaranteed to return the updated
value
Inconsistency window: The period between the update and the moment when it is
guaranteed that any observer will always see the updated value Consistency is a
continuum with tradeoffs
Multiple consistency models
Eventual Consistency
Eventual consistency
a specific form of weak consistency
When no updates occur for a long period of time, eventually all updates
will propagate through the system and all the nodes will be consistent
all accesses will return the last updated value.
For a given accepted update and a given node, eventually either the
update reaches the node or the node is removed from service
Based on CAP
SQL NoSQL
ACID BASE (Basically Available, Soft state, Eventual consistency)
Basically Available - system seems to work all the time
Soft State - it doesn't have to be consistent all the time
Eventually Consistent - becomes consistent at some later time
References and Additional Reading for CAP
http://en.wikipedia.org/wiki/CAP_theorem
Formal proof for CAP theorem: Brewer's Conjecture and the Feasibility of
Consistent, Available, Partition-Tolerant Web Services by Seth Gilbert and
Nancy Lynch
Graphical illustration of CAP theorem:
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
Recent post from Brewer: CAP Twelve Years Later: How the "Rules" Have
Changed
Eventually Consistent by Werner Vogels
Part II NoSQL Database
(Overview)
Yuan Xue
([email protected])
BigTable and Hbase
Introduction
BigTable Background
Development began in 2004 at Google (published 2006)
need to store/handle large amounts of (semi)-structured data
Many Google projects store data in BigTable
Google’s web crawl
Google Earth
Google Analytics
HBase Background
open-source implementation of BigTable built on top of HDFS
Initial HBase prototype in 2007
Hadoop become Apache top-level project and HBase becomes subproject
in 2008
Road Map
Database User/Application Developer: How to use?
Database System Designer: How to design?
(Logic) data model and CRUD operations
Under the hood: (Physical) data model and distribution algorithm
Database Designer: How to link application needs with database design
Schema design
Data Model
A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
(row:string, column:string, time:int64) uninterpreted byte array
Rows maintained in sorted lexicographic order based on row key
A row key is an arbitrary string
Every read or write of data under a single row is atomic.
Row ranges dynamically partitioned into tablets
Unit of distribution and load balancing
Applications can exploit this property for efficient row scans
Data Model
A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
(row:string, column:string, time:int64) uninterpreted byte array
Columns grouped into column families
Column key = family:qualifier
Column family must be created before data can be stored in a column key.
Column families provide locality hints
Unbounded number of columns
Data Model
A sparse, distributed, persistent multidimensional sorted map
Map indexed by a row key, column key, and a timestamp
(row:string, column:string, time:int64) uninterpreted byte array
Timestamps
64 bit integers , Assigned by:
Bigtable: real-time in microseconds,
Client application: when unique timestamps are a necessity.
Items in a cell are stored in decreasing timestamp order.
Application specifies how many versions (n) of data items are maintained in a cell.
Bigtable garbage collects obsolete versions.
Data Model – MiniTwitter Example
View as a Map of Map
Operations & APIs in Hbase
Create and delete tables and column families; Modify meta-data
Operations are based on row keys
Single-row operations:
Multi-row operations:
Put
Get
Delete
Scan
MultiPut
Atomic R-M-W sequences on data stored in a single row key (No support for
transactions across multiple rows).
No built-in joins
Can be done in the application
Using scan() and get() operations
Using MapReduce
Creating a Table
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new
HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);
34
Altering a Table
Disable the table before changing the schema
35
Single-row operations: Put()
Insert a new record (with a new key), Or
Insert a record for an existing key
Implicit version number
(timestamp)
Explicit version number
36
Put() in MiniTwitter
Update information
Single-row operations: Get()
•
Given a key return corresponding record
For each value return the highest version
Can control the number of versions you want
39
Get() in MiniTwitter
Single-row operations: Delete()
Marking table cells as deleted
Multiple levels
Can mark an entire column family as deleted
Can make all column families of a given row as deleted
Delete d = new Delete(Bytes.toBytes(“rowkey”));
userTable.delete(d);
Delete d = new Delete(Bytes.toBytes(“rowkey”));
d.deleteColumns(
Bytes.toBytes(“cf”),
Bytes.toBytes(“attr”));
userTable.delete(d);
41
Multi-row operations: Scan()
42
Road Map
Database User/Application Developer: How to use?
(Logic) data model and CRUD operations
Database System Designer: How to design?
Under the hood: (Physical) data model and distribution algorithm
Single Node
Write, Read, Delete
Distributed System
Database Designer: How to link application needs with database design
Schema design
Basics
Terms
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
RegionServer
HFile/SSTable
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
Basic building block of Bigtable
tablet server
RegionServer
Persistent, ordered immutable map from keys to values
Sequence of blocks on disk plus an index for block lookup
Stored in GFS
Can be completely mapped into memory
Supported operations:
Look up value associated with key
Iterate key/value pairs within a key range
64K
block
64K
block
64K
block
SSTable
Index
HDFS: Hadoop Distributed File Systems
Client requests meta data about a file from namenode
Data is served directly from datanode
File Read/Write in HDFS
File Read
1. open
HDFS
client
3. read
6. close
File Write
2. get block locations
Distributed
FileSystem
NameNode
FSData
InputStream
1. create
HDFS
client
name node
3. write
7. close
client JVM
client JVM
client node
client node
FSData
OutputStream
2. create
NameNode
8. complete
name node
4. get a list of 3 data nodes
5. write packet
4. read from the closest node
Distributed
FileSystem
6. ack packet
5. read from the 2nd closest node
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
data node
data node
data node
data node
data node
data node
If a data node crashed, the crashed node is removed, current block receives a newer id so as to delete the
partial data from the crashed node later, and Namenode allocates an another node.
47
Hbase: Logic storage vs Physical storage
Region/Tablet
Dynamically partitioned range of rows
Built from multiple SSTables
Column-Family oriented storage
Tablet
64K
block
Start:Alice00
64K
block
64K
block
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
RegionServer
End:Dave11
SSTable
Index
64K
block
64K
block
64K
block
SSTable
Index
Table (HTable)
BigTable
Hbase
SSTable
HFile
memtable
MemStore
tablet
region
tablet server
Multiple tablets make up the table
The entire BigTable is split into tablets of contiguous ranges of rows
Approximately 100MB to 200MB each
Tablets are split as their size grows
SSTables can be shared
Tablet
Alice00
SSTable
RegionServer
Tablet
Dave11
SSTable
Emily
Darth
SSTable
HTable
SSTable
•Each column family is stored in a separate file
•Key & Version numbers are replicated with each column family
•Empty cells are not stored
Source: Graphic from slides by Erik Paulson
Tablet1
Tablet2
Table to Region
Physical Storage: MiniTwitter Example
HTable
Write Path in HBase
Hlog
(append only WAL on HDFS
One per RS)
Read Path in Hbase
Deletion and Compaction in HBase
Delete() will mark the record for deletion
A new “tombstone” record is written for that value
BigTable
Hbase
Merging
compaction
Minor compaction
Minor compaction
flush
Major compaction
Major compaction
Announcement
Lab 1 Due
Lab 2 Release (team up)
Project team up
Quiz 1 graded
Data Distribution and Serving -- Big Picture
57
Placement of Tablets and Data Serving
A tablet is assigned to one tablet server at a time.
Metadata for tablet locations and start/end row are stored in a special Bigtable cell
Master maintains:
The set of live tablet servers,
Current assignment of tablets to tablet servers (including the unassigned ones)
RegionServer and DataNode
RegionServer and DataNode
Interacting with Hbase
Hbase Schema Design
How many column families should the table have?
What data goes into what column family?
How many columns should be in each column family?
What should the column names be? Although column names don’t
have to be defined on table creation, you need to know them when
you write or read data.
What information should go into the cells?
How many versions should be stored for each cell?
What should the rowkey structure be, and what should it contain?
MiniTwitter Review
Read operation
Whom does TheFakeMT follow?
Does TheFakeMT follow TheRealMT?
Who follows TheFakeMT?
Does TheRealMT follow TheFakeMT?
Write operation
A user follows someone
A user unfollows someone
MiniTwitter- Version 1
Version 2
Read operation
How many people
a user follows?
Atomic operation!
Version 3
Get rid of the
counter
Problem
Row access
overhead
Version 4
Wide table vs
tall table
Version 4 – client code
Version 5
Trick with hash code
Normalization vs Denormalization
Course
CourseID
CourseName
Hour
Description
CS292
Special Topics on Big Data
3
large-scale data processing
CS283
Computer networks
3
Networking technology
ClassSchedule
ClassID
CourseID
Semester
InstructorID Classroom
Time
2014CS292
CS292
S2014
xuey1
FGH134
Tue/Th 1:10-2:25
2014CS283
CS283
S2014
jmatt
FGH236
Tue/Th 1:10-2:25
ClassID
StudentID
Grade
2014CS292
balice1
NULL
Registration
VandyUser
VUNetID
FirstName
LastName
Email
xuey1
Yuan
Xue
Yuan.xue
balice1
Alice
Burch
Alice.burch
ClassSchedule
eID
SectionID
Semester
InstructorID
Classroom
Time
2
01
S2014
xuey1
FGH134
Tue/Th 1:10-2:25
3
01
S2014
jmatt
FGH236
Tue/Th 1:10-2:25