NoSQL DATABASE - COW :: Ceng On the Web

Download Report

Transcript NoSQL DATABASE - COW :: Ceng On the Web

Introduction to NoSQL Databases
Chyngyz Omurov
Osman Tursun
Ceng,Middle East Technical University
OUTLINE
• NoSQL Definition
• Motivation
• Data Store Introduction
-- Keu-value Stores
-- Document Stores
-- Extensible Record Stores
-- New Relational Database
• Conclusion
NoSQL: The Name
• “SQL” = Traditional Relation DBMS.
• Experience teach us:
Not every data management/analysis problem
is best solved using a traditional relation DBMS.
• “NoSQL”=“No SQL”=
Not using traditional Relation DBMS
• “No SQL” ≠
Don’t use SQL language
NoSQL: The Name
Not every data management/analysis problem
is best solved using a traditional relation DBMS.
• “NoSQL”=“Not only use SQL”
RDMS
Data management system(DBMS) provides
• Convenient
• Multi-user
• Safe
• Persistent
• Reliable
• Massive
• Efficient
RDMS
Web apps have different needs(than the
apps that RDBMS were designed for)
--Low and predictable response time(latency)
--Scalability & elasticity(at low cost)
--High availability
--Flexible schemas/ semi-structured data
--Geographic distribution (multiple datacenters)
 Web apps can(usually) do without
--Transaction/ Strong Consistency/ integrity
--Complex queries
NoSQL System
No declarative query language– more programming
Relaxed consistency—fewer guarantees
NoSQL System
The idea behind the NoSQL:
Giving up ACID constraints, one can achieve
much higher performance and scalability.
ACID= Atomicity, Consistency, Isolation, and
Durability
BASE=Basically Available, soft state, Eventually
consistent.
CAP Theorem
• A system can have only two out of three of
the following properties:consistency,
availability, and partition-tolerance.
New relational DBMS
• The SQL systems provide horizontal scalability
without abandoning SQL and ACID transactions.
Types of NoSQL Databases
Objective:
• Understand/compare each type of NoSQL
database
• Discuss 1-2 NoSQL database in each family
Systems Beyond our Scope
Some authors have used a broad definition of
NoSQL, including any DB system that is not
relational:
• Graph database systems
• Object-oriented database systems
• Distributed object-oriented stores
• Data-warehousing database systems
- complex queries
- read-only or read-mostly
ACID
Types of NoSQL Databases
Key-value
stores
Document
stores
Extensible
record
stores
Types of NoSQL Databases
NoSQL systems generally have six key features:
1. the ability to horizontally scale "simple
operation" throughput over many servers
2. the ability to replicate and distribute (partition)
data over many servers
Types of NoSQL Databases
3. a simple call level interface or protocol (in
contrast to a SQL binding)
4. a weaker concurrency model than ACID
transactions of most relational (SQL) database
systems (BASE)
Types of NoSQL Databases
5. efficient use of distributed indexes and RAM
for data storage, and
6. the ability to dynamically add new attributes to
data records
Types of NoSQL Databases
•
NoSQL systems differ mainly in their data
model
•
Specific implementations differ in the persistent
mechanism and additional functionalities:
 Replication
 Versioning
 Locking
 Transactions
 etc..
Key-Value Stores
• Global Collection of Key/Value Pairs
• Inspired by Amazon’s Dynamo and
Distributed Hashtables
•Operations
•void Put(string key, byte[] data);
•byte[] Get(string key);
•void Remove(string key);
Key-Value Stores: Examples
Project Voldemort
• Advanced key-value store
• Created by LinkedIn, now open source
• Written in Java
• Provides MVCC
• Asynchronous replication
• Sharding + Consistent Hashing
• Automatic failure detection and recovery
Project Voldemort
Operations:
value = store.get(key)
store.put(key, value)
store.delete(key)
Pros? & Cons?
Document Stores: Document?
• What is a document?
 Semi-structured data
 Encapsulates and encodes data (or information)
in some standard formats or encodings
 Encodings:
•
•
•
•
•
XML
YAML
JSON
BSON
Binary forms: PDF, Microsoft Office documents..
etc.
Document Stores: Document?
• Documents are like rows or records in relational
databases, BUT
Schema
Row
Document
No Schema
FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing"
FirstName:"Jonathan", Address:"15 Wanamassa Point
Road", Children:[{Name:"Michael",Age:10},
{Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5},
{Name:"Elena", Age:2}]
Document Stores
• Similar to Key-value stores but with a major
differences,
 value is a document
 generally support secondary indexes
• Flexible schema
 Any number of fields can be added
 Multiple types of documents (objects) and nested
documents or lists
• Documents stored in JSON or Binary JSON
(BSON)
• No ACID property
Document Stores: Examples
TERRASTORE
by Google
CouchDB
• Apache project since 2008
• Schema free, document oriented database
 Documents are stored in JSON format
 Support secondary indexes
 B-tree storage engine
 MVCC model, no locking
 No joins, no PK/FK
• Incremental replication
CouchDB
• REST API
CRUD
HTTP
Params
Create
PUT
/db/docid
Read
GET
/db/docid
Update
POST
/db/docid
Delete
DELETE
/db/docid
• Libraries for various languages that convert
native API calls into the RESTful calls
 Java, C, PHP, etc.
CouchDB: Views
• Views
 Filter, sort, “join”, aggregate, report
 Map/Reduce based
 K/V pairs from Map/Reduce are also stored in
the B-tree engine
 Built on demand
 Can be materialized & incrementally updated
CouchDB: Views
CouchDB:
Local Consistency
• CouchDB uses Multi-Version Concurrency
Control (MVCC)
CouchDB:
“Global” Consistency
• Incremental Replication
Extensible record stores
• Extensible record stores also called column
sotres.
 Each key is associated with multiple
attributes(i.e. columns)
 Hybrid row/column stores
 Inspired Google BigTable
 Example: HBase, Cassandra
Column: HBase
 Based on Google’s BigTable
 Apache Project TLP
 Cloudera (certification, EC2 AMI’s, etc.)
 Layered over HDFS (Hadoop Distributed File
System).
 Input/Output for MapReduce Jobs
 APIs
---Thrift, REST
Column: HBase
 Automatic Partitioning
 Automatic re-balancing/re-partitioning
 Fault tolerant
--HDFS
---Multiple Replicates

Highly distributed
Column: HBase
Column: Cassandra
 Create at facebook for Inbox search
 Facebook Google Code ASF
 Commercial Support available from Riptano
 Features taken from both Dynamo and Big
Table
-- Dynamo – Consistent hashing, Partitioning,
Replication
-- Big Table- Column Familes, MemTables,
SSTables
Column: Cassandra
 Symmetric nodes
-- No single point of failure
-- Linearly scalable
-- Ease of administration

Flexible/Automated Provisioning

Flexible Replica Replacement

High Availability
-- Eventually Consistency
-- However, consistency is tuneable
Column: Cassandra
 Partitioning
--Random
----Good distribution of data between nodes
---- Range scans not possible
--Order preserving
---can lead to unbalanced nodes
--- Range scans, Natural Order

Extremely fast reads/writes (low latency)

Thrift API
Column: Cassandra
 Column
-- Basic unit of storage

Column Family
--Collection of like records
--Record level atomicity
- indexed

Keyspace
--Top level namespace
--Usually one per application
Column: Cassandra
 Column details
--name
---byte[]
---Queried against
---Determines sort order
-value
----byte[]
----Opaque to Cassandra
-Timestamp
----long
----conflict resolution (last write wins)
Column-oriented NoSQL
Name
Producer
Data Model
Querying
BigTable
Google
Set of couple(key,
values)
Selection (by combination of row,
column, and time stamp ranges)
HBase
Apache
Groups of columns (a
BigTable clone)
JRUBY IRB-based shell(similar to
SQL)
Hypertable
Hypertable
Like BigTable
HQL(Hypertext Query Language)
CASSANDRA
Apache
Columns, groups of
columns corresponding
to a key(supercolumns)
Simple selection on key, range
queries, column or column ranges
PNUTS
Yahoo
(hashed or ordered)
tables, typed arrays,
flexible schema
Selection and projection from a
single table (retrieve an arbitrary
single record by primary key, range
queries, complex predicates,
ordering, top-k)
Scalable Relational Systems
• Also called NewSQL
• SQL
• ACID
• Performance and scalability through modern
innovative software architecture
Scalable Relational Systems
RDBMS will provide scalabilty:
 Use small scope operations
 Use small-scope transaction
MySQL Cluster
•
shared-nothing cluster
• NDB storage engine(replace the InnoDB)
• Replication(2PC)
• Horizontal data partitioning
MySQL Cluster
VoltDB
VoltDB
Scalable Relational Systems
CONCLUSION: NoSQL pros/cons
•
Advantages





•
Massive scalability
High availability
Lower cost (than competitive solutions at that scale)
(usually) predictable elasticity
Schema flexibility, sparse & semi-structured data
Disadvantages
 Limited query capabilities (so far)
 Eventual consistency is not intuitive to program for
• Makes client applications more complicated
 No standardization
• Portability might be an issue
CONCLUSION
• For now
 NoSQL databases are still far from advanced
database technologies
 NoSQL will not replace traditional relational
DBMS
• NoSQL are good for specialized applications
involving large unstructured distributed data
with high requirements on scaling
References
•
•
Cattell, R. Scalable SQL and NoSQL data stores
 http://dl.acm.org/citation.cfm?id=1978919
Pokorný J.: NoSQL Databases: a step to database scalability in
Web environment
 http://dl.acm.org/citation.cfm?id=2095583&dl=ACM&coll=DL
&CFID=90098443&CFTOKEN=64346810
• http://couchdb.apache.org/
• http://project-voldemort.com/
• http://cassandra.apache.org/
• http://hbase.apache.org/