NoSQL DATABASE - COW :: Ceng On the Web
Download
Report
Transcript NoSQL DATABASE - COW :: Ceng On the Web
Introduction to NoSQL Databases
Chyngyz Omurov
Osman Tursun
Ceng,Middle East Technical University
OUTLINE
• NoSQL Definition
• Motivation
• Data Store Introduction
-- Keu-value Stores
-- Document Stores
-- Extensible Record Stores
-- New Relational Database
• Conclusion
NoSQL: The Name
• “SQL” = Traditional Relation DBMS.
• Experience teach us:
Not every data management/analysis problem
is best solved using a traditional relation DBMS.
• “NoSQL”=“No SQL”=
Not using traditional Relation DBMS
• “No SQL” ≠
Don’t use SQL language
NoSQL: The Name
Not every data management/analysis problem
is best solved using a traditional relation DBMS.
• “NoSQL”=“Not only use SQL”
RDMS
Data management system(DBMS) provides
• Convenient
• Multi-user
• Safe
• Persistent
• Reliable
• Massive
• Efficient
RDMS
Web apps have different needs(than the
apps that RDBMS were designed for)
--Low and predictable response time(latency)
--Scalability & elasticity(at low cost)
--High availability
--Flexible schemas/ semi-structured data
--Geographic distribution (multiple datacenters)
Web apps can(usually) do without
--Transaction/ Strong Consistency/ integrity
--Complex queries
NoSQL System
No declarative query language– more programming
Relaxed consistency—fewer guarantees
NoSQL System
The idea behind the NoSQL:
Giving up ACID constraints, one can achieve
much higher performance and scalability.
ACID= Atomicity, Consistency, Isolation, and
Durability
BASE=Basically Available, soft state, Eventually
consistent.
CAP Theorem
• A system can have only two out of three of
the following properties:consistency,
availability, and partition-tolerance.
New relational DBMS
• The SQL systems provide horizontal scalability
without abandoning SQL and ACID transactions.
Types of NoSQL Databases
Objective:
• Understand/compare each type of NoSQL
database
• Discuss 1-2 NoSQL database in each family
Systems Beyond our Scope
Some authors have used a broad definition of
NoSQL, including any DB system that is not
relational:
• Graph database systems
• Object-oriented database systems
• Distributed object-oriented stores
• Data-warehousing database systems
- complex queries
- read-only or read-mostly
ACID
Types of NoSQL Databases
Key-value
stores
Document
stores
Extensible
record
stores
Types of NoSQL Databases
NoSQL systems generally have six key features:
1. the ability to horizontally scale "simple
operation" throughput over many servers
2. the ability to replicate and distribute (partition)
data over many servers
Types of NoSQL Databases
3. a simple call level interface or protocol (in
contrast to a SQL binding)
4. a weaker concurrency model than ACID
transactions of most relational (SQL) database
systems (BASE)
Types of NoSQL Databases
5. efficient use of distributed indexes and RAM
for data storage, and
6. the ability to dynamically add new attributes to
data records
Types of NoSQL Databases
•
NoSQL systems differ mainly in their data
model
•
Specific implementations differ in the persistent
mechanism and additional functionalities:
Replication
Versioning
Locking
Transactions
etc..
Key-Value Stores
• Global Collection of Key/Value Pairs
• Inspired by Amazon’s Dynamo and
Distributed Hashtables
•Operations
•void Put(string key, byte[] data);
•byte[] Get(string key);
•void Remove(string key);
Key-Value Stores: Examples
Project Voldemort
• Advanced key-value store
• Created by LinkedIn, now open source
• Written in Java
• Provides MVCC
• Asynchronous replication
• Sharding + Consistent Hashing
• Automatic failure detection and recovery
Project Voldemort
Operations:
value = store.get(key)
store.put(key, value)
store.delete(key)
Pros? & Cons?
Document Stores: Document?
• What is a document?
Semi-structured data
Encapsulates and encodes data (or information)
in some standard formats or encodings
Encodings:
•
•
•
•
•
XML
YAML
JSON
BSON
Binary forms: PDF, Microsoft Office documents..
etc.
Document Stores: Document?
• Documents are like rows or records in relational
databases, BUT
Schema
Row
Document
No Schema
FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing"
FirstName:"Jonathan", Address:"15 Wanamassa Point
Road", Children:[{Name:"Michael",Age:10},
{Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5},
{Name:"Elena", Age:2}]
Document Stores
• Similar to Key-value stores but with a major
differences,
value is a document
generally support secondary indexes
• Flexible schema
Any number of fields can be added
Multiple types of documents (objects) and nested
documents or lists
• Documents stored in JSON or Binary JSON
(BSON)
• No ACID property
Document Stores: Examples
TERRASTORE
by Google
CouchDB
• Apache project since 2008
• Schema free, document oriented database
Documents are stored in JSON format
Support secondary indexes
B-tree storage engine
MVCC model, no locking
No joins, no PK/FK
• Incremental replication
CouchDB
• REST API
CRUD
HTTP
Params
Create
PUT
/db/docid
Read
GET
/db/docid
Update
POST
/db/docid
Delete
DELETE
/db/docid
• Libraries for various languages that convert
native API calls into the RESTful calls
Java, C, PHP, etc.
CouchDB: Views
• Views
Filter, sort, “join”, aggregate, report
Map/Reduce based
K/V pairs from Map/Reduce are also stored in
the B-tree engine
Built on demand
Can be materialized & incrementally updated
CouchDB: Views
CouchDB:
Local Consistency
• CouchDB uses Multi-Version Concurrency
Control (MVCC)
CouchDB:
“Global” Consistency
• Incremental Replication
Extensible record stores
• Extensible record stores also called column
sotres.
Each key is associated with multiple
attributes(i.e. columns)
Hybrid row/column stores
Inspired Google BigTable
Example: HBase, Cassandra
Column: HBase
Based on Google’s BigTable
Apache Project TLP
Cloudera (certification, EC2 AMI’s, etc.)
Layered over HDFS (Hadoop Distributed File
System).
Input/Output for MapReduce Jobs
APIs
---Thrift, REST
Column: HBase
Automatic Partitioning
Automatic re-balancing/re-partitioning
Fault tolerant
--HDFS
---Multiple Replicates
Highly distributed
Column: HBase
Column: Cassandra
Create at facebook for Inbox search
Facebook Google Code ASF
Commercial Support available from Riptano
Features taken from both Dynamo and Big
Table
-- Dynamo – Consistent hashing, Partitioning,
Replication
-- Big Table- Column Familes, MemTables,
SSTables
Column: Cassandra
Symmetric nodes
-- No single point of failure
-- Linearly scalable
-- Ease of administration
Flexible/Automated Provisioning
Flexible Replica Replacement
High Availability
-- Eventually Consistency
-- However, consistency is tuneable
Column: Cassandra
Partitioning
--Random
----Good distribution of data between nodes
---- Range scans not possible
--Order preserving
---can lead to unbalanced nodes
--- Range scans, Natural Order
Extremely fast reads/writes (low latency)
Thrift API
Column: Cassandra
Column
-- Basic unit of storage
Column Family
--Collection of like records
--Record level atomicity
- indexed
Keyspace
--Top level namespace
--Usually one per application
Column: Cassandra
Column details
--name
---byte[]
---Queried against
---Determines sort order
-value
----byte[]
----Opaque to Cassandra
-Timestamp
----long
----conflict resolution (last write wins)
Column-oriented NoSQL
Name
Producer
Data Model
Querying
BigTable
Google
Set of couple(key,
values)
Selection (by combination of row,
column, and time stamp ranges)
HBase
Apache
Groups of columns (a
BigTable clone)
JRUBY IRB-based shell(similar to
SQL)
Hypertable
Hypertable
Like BigTable
HQL(Hypertext Query Language)
CASSANDRA
Apache
Columns, groups of
columns corresponding
to a key(supercolumns)
Simple selection on key, range
queries, column or column ranges
PNUTS
Yahoo
(hashed or ordered)
tables, typed arrays,
flexible schema
Selection and projection from a
single table (retrieve an arbitrary
single record by primary key, range
queries, complex predicates,
ordering, top-k)
Scalable Relational Systems
• Also called NewSQL
• SQL
• ACID
• Performance and scalability through modern
innovative software architecture
Scalable Relational Systems
RDBMS will provide scalabilty:
Use small scope operations
Use small-scope transaction
MySQL Cluster
•
shared-nothing cluster
• NDB storage engine(replace the InnoDB)
• Replication(2PC)
• Horizontal data partitioning
MySQL Cluster
VoltDB
VoltDB
Scalable Relational Systems
CONCLUSION: NoSQL pros/cons
•
Advantages
•
Massive scalability
High availability
Lower cost (than competitive solutions at that scale)
(usually) predictable elasticity
Schema flexibility, sparse & semi-structured data
Disadvantages
Limited query capabilities (so far)
Eventual consistency is not intuitive to program for
• Makes client applications more complicated
No standardization
• Portability might be an issue
CONCLUSION
• For now
NoSQL databases are still far from advanced
database technologies
NoSQL will not replace traditional relational
DBMS
• NoSQL are good for specialized applications
involving large unstructured distributed data
with high requirements on scaling
References
•
•
Cattell, R. Scalable SQL and NoSQL data stores
http://dl.acm.org/citation.cfm?id=1978919
Pokorný J.: NoSQL Databases: a step to database scalability in
Web environment
http://dl.acm.org/citation.cfm?id=2095583&dl=ACM&coll=DL
&CFID=90098443&CFTOKEN=64346810
• http://couchdb.apache.org/
• http://project-voldemort.com/
• http://cassandra.apache.org/
• http://hbase.apache.org/