What is NoSQL?
Download
Report
Transcript What is NoSQL?
NoSQL
by Michael Britton, Mark McGregor, and Sam Howard
Simplicity, Speed, Scalability
What is NoSQL?
—ext Generation Databases mostly addressing some
N
of the points: being non-relational, distributed,
open-source and horizontally scalable
—he term “NoSQL” is actually misleading. A more
T
appropriate term is actually “Not Only SQL”
Origins
● 1
—
998 - Carlo Strozzi
●
Still used Relational model
●
More accurately called “NoRel”
—
● —
2009 – Eric Evans and Johan Oskarsson
●
Organized event to discuss open-source distributed databases
—
●
Originally a term to label Non-ACID databases
—
●
meant to be a Twitter hashtag but went viral and stuck
Why NoSQL
What You Are Giving Up With NoSQL
● Relationships between entities are basically nonexistent
● Limited ACID transactions
● No standard language for queries (SQL)
● Less structured
RDBMS Vs. NoSQL
●
●
●
●
●
●
RDBMS
Structured and organized data
Structured Query Language
(SQL)
Data and its relationships
stored in separate tables.
Data Manipulation Language,
Data Definition Language
Tight Consistency
BASE Transaction
●
●
●
●
●
●
●
NoSQL
No declarative query language
No predefined schema
Key-Value pair storage, Column
Store, Document Store, Graph
Databases
Eventual consistency rather ACID
property
Unstructured and unpredictable
data
CAP Theorem
Prioritize high performance, high
availability and scalability
SQL VS NoSQL Queries
NoSQL Query:
SQL Query:
NoSQL vs. MySQL
● MySQL > 50 GB Data
● Writes Average: ~300 ms
● Reads Average: ~350 ms
● Cassandra > 50 GB Data
● Writes Average: 0.12 ms
● Reads Average: 15 ms
NoSQL Pros/Cons
Pros
●
●
●
●
High Scalability
Distributed Computing
Lower Cost
Schema Flexibility, SemiStructured Data
● No Complicated Relationships
Cons
● No Standardization
● Limited query capabilities
● Eventual consistent model is not
intuitive to program for
Non-Relational:
Distributed:
Open-Source:
Horizontally Scalable:
The concept of joining tables together by
relations is non-existent.
A network of interconnected computers, controlled
by a central Database Management System
Anyone can make changes to the original source code.
Using multiple computers as one unit to increase
productivity
Non-Relational
● R
—elational databases join tables together using Primary
Key / Foreign Key relationships
● —
Non-Relational databases have no such structure
● I—
tems are aggregated into one file, much like a giant
Excel spreadsheet
● —
Prone to data duplication
● —
Difficult to update records
Distributed
● —
Non-relational databases can easily be spread out
over multiple machines over the same network
● —
Each machine in the distributed network can carry
information most relevant to it’s area
● —
Controlled by the DDBMS – Distributed Database
Management System
Open-Source
● —
Source code is generally available to the open
public
● —
Improve the software as needed
● —
Share with the community
Horizontally Scalable
Horizontal
Vertical
Other Important Terms
● Denormalization - optimizing read performance by adding redundant data
or grouping data in order to improve scalability and performance
● does NOT mean that the data has not been normalized
● Denormalization should ideally take place after 3NF has been
achieved
● Constraints are used to ensure that redundant copies of data are
synchronized
● Materialized View - a database object that contains the results of a query.
● query result is cached but can be updated from the original query as
necessary
Other Important Terms
● Keyspace - object that holds together all column families of a design
● outermost grouping of data in datastore
● resembles a schema in RDMS
● Column Families - tuple (pair) consisting of a key-value pair, where the
key is set to a value that is a set of columns
● object that contains columns of related data
● resembles a table in RDMS
Other Important Terms
● Super Column Family - tuple (pair) that consists of key-value pair, where
the key is mapped to a value that are column families
● similar to a view in RDBS
● Column (data store) - tuple (triplet) key-value pair consisting of a unique
name, a value, and a timestamp.
the timestamp determines old data from new data
not to be confused with a standard relational database column
lowest level object in a keyspace
Other Important Terms
● Database Shard - a horizon partition in a database or a search partition.
Each partition is a separate shard.
● shards can be distributed to separate hardware, reducing the number
of rows in each table
● not to be confused with horizontal partitioning, which refers to
splitting one or more tables by rows within a single schema or
database server
● Sharding - the process of forming shards within the distributed database
system.
● traditionally done by hand coding
● auto-sharding code is highly sought after
Other Important Terms
● Consistent Hashing - special hashing in which when the hash table is
resized, only K / n keys need to be remapped
● K is the number of rows
● n is the number of slots
All your BASE are belonging to NoSQL
● A BASE system gives up on consistency.
● Basically Available indicates the system does guarantee availability.
● Soft state indicates that the state of the system may change over time,
even without input.
● Eventual consistency indicates that the system will become consistent over
time, given the system doesn’t receive input during that time.
CAP Theorem (Brewer’s Theorem)
● There are three basic requirements which exist in a special relation when
designing for a distributed architecture.
● Consistency ‘C’ - the data in the database remains consistent after the
execution of the operation
● Availability ‘A’ - the system is always on, no downtime.
● Partition Tolerance ‘P’ - the system continues to function even if the
communication among the servers is unreliable.
CAP Theorem Cont.
● CAP provides the basic requirements for a distributed systems to follow 2
of the 3 requirements. All of the current NoSQL database follow the
different combinations of C, A, and P.
● CA - Single site cluster, therefore all the nodes are always in contact.
● CP - Some data may not be accessible, but the rest is still
consistent/accurate
● AP - System is still available under partitioning, but some of the data
may be inaccurate.
Challenges of NoSQL
● Maturity - In comparison RDBMS systems have been around for a
long time. Most NoSQL alternatives are in pre-production versions
with many key features yet to be implemented.
● Support - Most NoSQL systems are Open Source projects, and the
companies that offer support are small start-ups without global
reach, support services, or the credibility of Oracle, Microsoft, or
IBM.
Challenges of NoSQL
● Analytics and Business Intelligence - NoSQL databases have
evolved to meet the scaling demands of Web 2.0 applications.
● Administration - The design goals for NoSQL is to provide a zeroadmin solution, but as of today it requires a lot of skill to install and
a lot of to effort to maintain.
● Expertise - Almost all NoSQL developers is learning how to use
and develop for NoSQL
Advantages of NoSQL
● Elastic Scaling - NoSQL databases are designed to expand
transparently to take advantage of new nodes, and they are usually
designed with low-cost commodity hardware in mind.
● Big Data - The volumes of data that can be handled by NoSQL
systems are greater than what can be handled by the biggest
RDBMS.
● No DBA - NoSQL databases are designed from the ground up to
require less management: automatic repair, data distribution, and
simpler data models to lead to lower administration and tuning
requirements.
Advantages of NoSQL
● Economic - NoSQL databases typically use clusters of cheap
commodity servers to managing the ever-expanding amount of
data and transactions.
● Flexible Data Models - NoSQL databases have more relaxed data
model restrictions. Key Value stores and document databases
allow the application to store virtually any structure it wants in a
data element.
Taxonomy (Data Models)
Key-value stores are the simplest NoSQL databases. Every
single item in the database is stored as an attribute name (or
"key"), together with its value. Examples of key-value stores
are Riak and Voldemort. Some key-value stores, such as
Redis, allow each value to have a type, such as "integer",
which adds functionality
Document databases pair each key with a complex data
structure known as a document. Documents can contain many
different key-value pairs, or key-array pairs, or even nested
documents.
Graph stores are used to store information about networks,
such as social connections. Graph stores include Neo4J and
HyperGraphDB.
Column stores such as Cassandra and HBase are optimized
for queries over large datasets, and store columns of data
together, instead of rows.
Key-Value stores
● Examples-Tokyo Cabinet/Tyrant,
Redis, Voldemort, Oracle BDB
● Typical Application- Content
caching (Focus on scaling to
huge amounts of data, designed
to handle massive load), logging,
etc.
● Strengths- Fast Lookups
● Weaknesses- Stored data has no
schema
Oracle Embraces NoSQL
Oracle Embraces NoSQL
● Distributed key-value database
● Designed to provide highly reliable, scalable, and available data storage
across a configurable set of systems that function as storage nodes
● Data is stored as key-value pairs, which are written to particular storage
node(s), based on the hashed value of the primary key.
● Storage nodes are replicated to ensure high availability, rapid failover in
the event of a node failure and optimal load balancing of queries.
● Customer applications are written using an easy-to-use Java/C API to read
and write data.
Oracle Embraces NoSQL
●
●
●
●
●
●
Utilizes storage nodes
more storage nodes provide greater throughput
Storage Node Agent (SNA) monitors each nodes behavior
Replication nodes work in groups to serve the same data
Replication factor of 3
Single-master architecture
● Master node replicates to replication nodes
● Election system elects new master in case of failure
Column Stores
● Examples-Cassandra, HBase,
Riak
● Typical applications-Distributed
file systems
● Data model-Columns → column
families
● Strengths-Fast lookups, good
distributed storage of data
● Weaknesses-Very low-level API
Apache Cassandra Project
●
●
●
●
●
Scalability and high availability without compromising performance
Uses column indexes
Denormalization
Materialized Views
Built-in caching
Apache Cassandra Project
● Used in over 1500 companies with large, active data sets
● Largest cluster has 300 TB of data on over 400 machines
● Replication across multiple data centers allows failed nodes to be replaced
with no downtime
● Every node is identical, allowing no single point of failure
● Users can choose between synchronous and asynchronous replication
Document Databases
● Examples-CouchDB, MongoDb
● Typical applications-Web
applications (Similar to KeyValue stores, but the DB knows
what the Value is)
● Data model-Collections of KeyValue collections
● Strengths-Tolerant of incomplete
data
● Weaknesses-Query
performance, no standard query
syntax
Hu - MongoDB - us
●
Stores data in the form of BSON (Binary
JSON) documents with dynamic schemas,
making the integration of data in certain
types of applications easy and fast.
●
Most talked about NoSQL DBMS technology
because it features auto sharding,
replication,schema less design, and
scalability, and more.
Hu - MongoDB - us
●
●
●
●
●
●
Full indexing support - index on any attribute
Replicable - mirror across WAN and LAN
Auto Sharding
Document-based querying
Flexible aggregation
GridFS allows for storage of data files larger than BSON allows
Graph Databases
● Graphs databases store data in
graphics to easily represent data
● Graphs records data in nodes
with properties
● Nodes can have unlimited
properties, but are generally
broken up into multiple nodes
● Useful for answering questions
based on related information
Neo4J
●
●
●
●
●
●
Highly Scalable
Fully ACID
Intuitive graphical models
Custom disk-based native storage engine
Massively scalable, with potential for BILLIONS of nodes
Highly available
Neo4J
● Expressive, powerful, human
readable graph query language
● EX:
MATCH (a:Actor { name:"Keanu
Reeves" })
RETURN a
Other NoSQL DBMS Products Cont.
● CouchDB - stores data in the form of a collection document. Each
document is a bunch of ‘keys’ and corresponding ‘values’. CouchDB
support indices, queries, and views. It uses JSON to story data, JavaScript
as its query language using MapReduce and HTTP for the API.
● Redis - An in-memory, key value data store. Mostly used as a caching
mechanism in most of the applications because it stores data in the RAM
making it extremely fast when retrieving data. It is a data structure server
and not a replacement to the traditional database. Used in combination
with products like MySql to deliver high performance when the data is
needed to be delivered rapidly.
Other NoSQL DBMS Products Cont.
● Hadoop - An open-source framework. Written in Java and supports
data-intensive distributed applications. Supports applications
running on largest clusters of computers and allows analyzing data
among many different computers. Designed to scale up from single
servers to thousands of machines.
● There are currently 150 different NoSQL databases
Companies That Implement NoSQL
●
●
●
●
●
●
●
●
Google - BIGTABLE
Facebook - CASSANDRA
Mozilla - HBASE
Adobe - HBASE
Foursquare - MongoDB
LinkedIn - VOLDEMORT
Digg - REDIS
Twitter - HADOOP, PIG, CASSANDRA
Questions?
Tough!
Sources:
●
http://nosql-database.org/
—
●
http://www.ignoredbydinosaurs.com/2013/05/explaining-non-relational-databases-my-mom
—
●
http://en.wikipedia.org/wiki/NoSQL
—
●
http://greendatacenterconference.com/blog/the-five-key-advantages-and-disadvantages-of-nosql
—
●
http://www.tutorialindustry.com/nosql-tutorial-for-beginners
—
●
http://www.techrepublic.com/blog/10-things/10-things-you-should-know-about-nosql-databases
—
●
http://readwrite.com/2011/10/24/oracle-formally-embraces-nosql#awesm=~oCvdI8zKkJmAiZ
●
http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html
●
http://cassandra.apache.org/
●
http://www.neo4j.org/learn/nosql
●
http://www.w3resource.com/mongodb/nosql.php
●
http://architects.dzone.com/articles/putting-nosql-perspective