The CIO's Guide to NoSQL

Download Report

Transcript The CIO's Guide to NoSQL

The CIO's Guide to
NoSQL
Dan McCreary
July 7th, 2011
Version 4
Agenda
•
•
•
•
•
•
•
Historical Context
The Business Case for NoSQL
Terminology
How NoSQL is Different
Key NoSQL Products
Call to Action: The NoSQL Pilot Project
The Future of NoSQL
M
D
Copyright Kelly-McCreary & Associates, LLC
2
Background for Dan McCreary
• Bell Labs
• NeXT Computer (Steve Jobs)
• Owner of Custom Object-Oriented
Software Consultancy
• Federal data integration (National
Information Exchange Model)
• Native XML/XQuery – 2006
• Advocate of NoSQL/XRX systems
M
D
Copyright Kelly-McCreary & Associates, LLC
3
NoSQL Training Areas
Track
Managers
Course
The CIO's
Guide to
NoSQL
Architects/
Project
Managers
Architectural
Tradeoff Modeling
Developer
Functional
Programming
You Are
Here
Project Manager's
Guide to NoSQL
MapReduce
Hadoop
Transitioning
to NoSQL
XQuery
M
D
Copyright Kelly-McCreary & Associates, LLC
4
Sample of NoSQL Jargon
Indexing
Document orientation
B-Tree
Schema free
Configurable durability
MapReduce
Documents for archives
Functional programming
Horizontal scaling
Document Transformation
Sharding and auto-sharding
Document Indexing and Search
Brewer's CAP Theorem
Alternate Query Languages
Consistency
Aggregates
OLAP
Reliability
XQuery
Partition tolerance
MDX
Single-point-of-failure
RDF
SPARQL
Object-Relational mapping
Architecture Tradeoff Modeling
Key-value stores
ATAM
Column stores
Document-stores
Note that within the context of NoSQL many
Memcached
of these terms have different meanings!
M
D
Copyright Kelly-McCreary & Associates, LLC
5
Selecting a Database…
"Selecting the right data storage solution is
no longer a trivial task."
Start
Does it
look like
document?
Yes
Use Microsoft
Office
No
Use the
RDBMS
M
D
Copyright Kelly-McCreary & Associates, LLC
Stop
6
Pressures on SQL Only Systems
Scalability
OLAP/BI/Data
Warehouse
SQL
Social
Networks
Agile
Schema
Free
M
D
Copyright Kelly-McCreary & Associates, LLC
7
Simplicity is a Virtue
• Many systems derive their strength by dramatically limiting the
features in their system
• Simplicity allows database designers to focus on the primary
business driver
• Examples:
– Touch screen interfaces
– Key/Value data stores
M
D
Copyright Kelly-McCreary & Associates, LLC
8
Historical Context
Mainframe Era
•
•
•
•
Commodity Processors
1 CPU
COBOL and FORTRAN
Punchcards and flat files
$10,000 per CPU hour
•
•
•
•
10,000 CPUs
Functional programming
MapReduce "farms"
Pennies per CPU hour
M
D
Copyright Kelly-McCreary & Associates, LLC
9
Two Approaches to Computation
1930s and 40s
Alonzo Church
John Von Neumann
Manage state with a program counter.
Make computations act like math functions.
Which is simpler? Which is cheaper? Which will scale to 10,000 CPUs?
M
D
Copyright 2010 Dan McCreary & Associates
10
Standard vs. MapReduce Prices
John's Way
Alonzo's Way
http://aws.amazon.com/elasticmapreduce/#pricing
M
D
Copyright Kelly-McCreary & Associates, LLC
11
MapReduce CPUs Cost Less!
40
35
Cost Per CPU Hour (Cents)
30
25
20
15
10
5
0
Standard MapReduce
CPU
CPU
Cuts cost from 32 to 6 cents per CPU hour!
Perhaps Alanzo was right!
Why? (hint: how "shareable" is this process)
M
http://aws.amazon.com/elasticmapreduce/#pricing
D
Copyright Kelly-McCreary & Associates, LLC
12
Perspectives
Native
XML
Object
Stores
NoSQL for
Web 2.0
and
BigData
OLAP
MDX
Graph
Stores
Perspective depends on your context
M
D
Kelly-McCreary & Associates, LLC
13
Architectural Tradeoffs
"I want a fast car with good mileage."
"I want a scaleable database with low cost that runs
well on the 1,000 CPUs in our data center."
M
D
Kelly-McCreary & Associates, LLC
14
Recent History
• The term NoSQL became re-popularized
around 2009
• Used for conferences of advocates of nonrelational databases
• Became a contagious idea "meme"
• First of many "NoSQL meetups" in San
Francisco organized by Jon Oskarsson
• Conversion from "No SQL" to "Not Only
SQL" in recent year
M
D
Kelly-McCreary & Associates, LLC
15
NoSQL on Google Trends
M
D
Kelly-McCreary & Associates, LLC
16
NoSQL and Web 2.0 Startups
• Many web 2.0 startups did not use Oracle
or MySQL
• They built their own data stores influenced
by Amazon’s Dynamo and Google’s
BigTable in order to store and process
huge amounts of data
• In the social community or cloud
computing applications, most of these data
stores became OpenSource software
M
D
Kelly-McCreary & Associates, LLC
17
Google MapReduce
M
• 2004 paper that had huge impact of
functional programming in the entire
community
• Copied by many organizations, including
Yahoo
D
Copyright Kelly-McCreary & Associates, LLC
18
Google Bigtable Paper
• 2006 paper that gave focus to scaleable
databases
• designed to reliably scale to petabytes of
data and thousands of machines
M
D
Copyright Kelly-McCreary & Associates, LLC
19
Amazon's Dynamo Paper
•
•
•
•
Werner Vogels
CTO - Amazon.com
October 2, 2007
Used to power
Amazon's S3 service
• One of the most
influential papers in
the NoSQL movement
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”,
in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.
M
D
Copyright Kelly-McCreary & Associates, LLC
20
NoSQL "Meetups"
“NoSQLers came to share how they had
overthrown the tyranny of slow, expensive
relational databases in favor of more
efficient and cheaper ways of managing
data.”
Computerworld magazine, July 1st, 2009
M
D
Kelly-McCreary & Associates, LLC
21
Key Motivators
• Licensing RDBMS on multiple CPUs
• The Thee "V"s
– Velocity – lots of data arriving fast
– Volume – web-scale BigData
– Variability – many exceptions
• Desire to escape rigid schema design
• Avoidance of complex Object-Relational
Mapping (the "Vietnam" of computer
science)
M
D
Kelly-McCreary & Associates, LLC
22
Many Processes Today Are Driven By…
The constraints of yesterday…
Challenge:
Ask ourselves the question…
Do our current method of solving problems with tabular data…
Reflect the storage of the 1950s…
Or our actual business requirements?
What structures best solve the actual business problem?
M
D
23
Copyright 2008 Dan McCreary & Associates
No-Shredding!
My
Data
• Relational databases take a single hierarchical document and
shred it into many pieces so it will fit in tabular structures
• Document stores prevent this shredding
M
D
24
Copyright 2008 Dan McCreary & Associates
Is Shredding Really Necessary?
• Every time you take
hierarchical data and
put it into a traditional
database you have to
put repeating groups in
separate tables and
use SQL “joins” to
reassemble the data
M
D
25
Copyright 2008 Dan McCreary & Associates
Object Relational Mapping
Web Browser
•
•
•
•
T1
T2
T4
T3
Object Middle
Tier
Relational
Database
T1 – HTML into Objects
T2 –Objects into SQL Tables
T3 – Tables into Objects
T4 – Objects into HTML
M
D
Kelly-McCreary & Associates, LLC
26
"The Vietnam of Applications"
• Object-relational mapping has become one of
the most complex components of building
applications today
• A "Quagmire" where many projects get lost
• Many "heroic efforts" have been made to
solve the problem:
– Hibernate
– Ruby on Rails
• But sometimes the way to avoid complexity is
to keep your architecture very simple
M
D
Copyright Kelly-McCreary & Associates, LLC
27
Document Stores Need No Translation
Document
Application Layer
•
•
•
•
•
•
M
D
Document
Database
Documents in the database
Documents in the application
No object middle tier
No "shredding"
No reassembly
Simple!
28
Copyright 2010 Dan McCreary & Associates
Zero Translation (XML)
REST-Interfaces
XForms
Web Browser
•
•
•
•
•
XML database
XML lives in the web browser (XForms)
REST interfaces
XML in the database (Native XML, XQuery)
XRX Web Application Architecture
No translation!
M
D
29
Copyright 2010 Dan McCreary & Associates
"Schema Free"
• Systems that automatically determine how to
index data as the data is loaded into the
database
• No a priori knowledge of data structure
• No need for up-front logical data modeling
– …but some modeling is still critical
• Adding new data elements or changing data
elements is not disruptive
• Searching millions of records still has subsecond response time
M
D
30
Copyright 2010 Dan McCreary & Associates
Monoculture and Mono-architecture
M
Image Source: Wikipedia
D
31
Copyright 2010 Dan McCreary & Associates
Eric Evans
“The whole point of seeking alternatives
[to RDBMS systems] is that you need to
solve a problem that relational databases
are a bad fit for.”
Eric Evans
Rackspace
M
D
Kelly-McCreary & Associates, LLC
32
Evolution of Ideas in OpenSource
New Products
New Database Ideas
Proprietary Software
Product A
Schema-free
OpenSource
Auto-sharding
MapReduce
Product B
Product B
Cloud Computing
•
•
How quickly can new ideas be recombined into new database products?
OpenSource software has proved to be the most efficient way to quickly
recombine new ideas into new products
M
D
Copyright Kelly-McCreary & Associates, LLC
33
Storage Architectural Patterns
Tables
Triples
Trees
Stars
M
34
D
Copyright 2010 Dan McCreary & Associates
Finding the Right Match
Schema-Free
Standards Compliant
Mature Query Language
Use CMU's Architectural Tradeoff and Modeling (ATAM) Process
M
D
35
Copyright 2010 Dan McCreary & Associates
Brewer's CAP Theorem
Consistency
You can not
have all three
so pick two!
Availability
Partition Tolerance
M
D
Kelly-McCreary & Associates, LLC
36
Avoidance of Unneeded Complexity
• Relational databases provide a variety of
features to ALWAYS support strict data
consistency
• Rich feature set and the ACID properties
implemented by RDBMSs might be more
than necessary for particular applications
and use cases
M
D
Kelly-McCreary & Associates, LLC
37
High Throughput
• Some NoSQL databases provide a
significantly higher data throughput than
traditional RDBMS
• Hypertable which pursues Google’s
Bigtable approach allows the local search
engine Zvent to store one billion data cells
per day
• Google is able to process 20 petabytes a
day stored in BigTable via it’s MapReduce
approach
M
D
Kelly-McCreary & Associates, LLC
38
Complexity and Cost of Setting
up Database Clusters
NoSQL databases are designed
in a way that “PC clusters can be easily and
cheaply expanded without the complexity
and cost of ’sharding,’ which involves cutting
up databases into multiple tables to run on
large clusters or grids”.
Nati Shalom, CTO and founder of GigaSpaces
M
D
Kelly-McCreary & Associates, LLC
39
Compromising Reliability for Better
Performance
• Shalom argues that there are “different
scenarios where applications would be willing
to compromise reliability for better
performance.”
• Performance over reliability
• Example: HTTP session data example
– “needs to be shared between various web
servers but since the data is transient in nature (it
goes away when the user logs off) there is no
need to store it in persistent storage.”
M
D
Kelly-McCreary & Associates, LLC
40
"Once Size Fits…"
"One Size Does Not Fit All"
James Hamilton Nov. 3rd, 2009
http://perspectives.mvdirona.com/CommentView,guid,afe46691-a293-4f9a-8900-5688a597726a.aspx
M
D
Kelly-McCreary & Associates, LLC
41
Cloud Computing
• High scalability
– Especially in the horizontal direction (multi
CPUs)
• Low administration overhead
– Simple web page administration
M
D
Kelly-McCreary & Associates, LLC
42
Databases work well in the cloud
• Data warehousing specific databases for
batch data processing and map/reduce
operations
• Simple, scalable and fast key/value-stores
• Databases containing a richer feature set
than key/value-stores fitting the gap with
traditional
• RDBMS while offering good performance and
scalability properties (such as document
databases).
M
D
Kelly-McCreary & Associates, LLC
43
Scale Up vs. Scale Out
Scale Up
• Make a single CPU as fast as
possible
• Increase clock speed
• Add RAM
• Make disk I/O go faster
Scale Out
• Make Many CPUs work
together
• Learn how to divide your
problems into independent
threads
M
D
Copyright Kelly-McCreary & Associates, LLC
44
The NO-SQL Universe
Key-Value Stores
Document Stores
XML
Graph Stores
Object Stores
Column Stores
M
D
45
Copyright 2010 Dan McCreary & Associates
Types of Key-Value Stores
•
•
•
•
•
Eventually‐consistent Key‐Value store
Hierarchical Key-Value Stores
Key-Value Stores In RAM
Key Value Stores on Disk
Ordered Key-Value Stores
M
D
Copyright Kelly-McCreary & Associates, LLC
46
Key Value Stores
Key
Value
• A table with two columns
and a simple interface
– Add a key-value
– For this key, give me the
value
– Delete a key
• Blazingly fast and easy to
scale
M
D
Copyright Kelly-McCreary & Associates, LLC
47
Different Thinking
Sequential Processing
• The output of any
step can be used in
the next step
Parallel Processing
• Each loop of a FLOWR
statement is an independent
thread
M
D
Kelly-McCreary & Associates, LLC
48
Auto-Sharding
• When one database gets almost full it tells
a "coordinator" system and the data
automatically gets migrated to other
systems
After
Before
45% full
90% full
45% full
M
D
Copyright Kelly-McCreary & Associates, LLC
49
Functional Programming
• What does it mean to your IT staff?
• What experience do they have in
functional programming?
• Can they "unlearn" the habits of the
procedural world?
M
D
Copyright Kelly-McCreary & Associates, LLC
50
MongoDB
•
•
•
•
•
•
•
Open Source License
Document/Collection centric
Sharding built-in, automatic
Stores data in JSON format
Query language is JSON
Can be 10x faster than MySQL
Many languages (C++, JavaScript, Java,
Perl, Python etc.)
M
D
Copyright Kelly-McCreary & Associates, LLC
51
Hadoop/Hbase
• Open source implementation of
MapReduce algorithm written in Java
• Initially created by Yahoo
– 300 person-years development
• Column-oriented data store
• Java interface
• Hbase designed specifically to work with
Hadoop
M
D
Copyright Kelly-McCreary & Associates, LLC
52
Voldomort
•
•
•
•
•
A distributed key-value system
Used at LinkedIn
10K-20K node operations/CPU
Auto-sharding
Graceful server failure handling
M
D
Copyright Kelly-McCreary & Associates, LLC
53
Cassendra
•
•
•
•
Apache open source project
Originally developed by Facebook
Designed for highly distributed systems
Column-family data model
M
D
Copyright Kelly-McCreary & Associates, LLC
54
CouchDB
•
•
•
•
Apache Document Store
Written in ERLANG
RESTful JSON API
Distributed, featuring robust, incremental
replication with bi-directional conflict
detection and management
M
D
Copyright Kelly-McCreary & Associates, LLC
55
Memcached
• Free & open source in-memory caching system
• Designed to speeding up dynamic web applications by
alleviating database load
• RAM resident key-value store for small chunks of arbitrary
data (strings, objects) from results of database calls, API calls,
or page rendering
• Simple interface
• Designed for quick deployment, ease of development
• APIs in many languages
M
D
Copyright Kelly-McCreary & Associates, LLC
56
MarkLogic
• Native XML database designed to used by
Petabyte data stores
• ACID compliant
• Heavy use by federal agencies, document
publishers and "high-variability" data
• Arguably the most successful NoSQL
company
M
D
Copyright Kelly-McCreary & Associates, LLC
57
eXist
• OpenSource native XML database
• Strong support for XQuery and XQuery
extensions
• Heavily used by the Text Encoding
Initiative (TEI) community and
XRX/XForms communities
• Ideal for metadata management
• Integrated Lucene search and structured
search
M
D
Copyright Kelly-McCreary & Associates, LLC
58
Riak
•
•
•
•
Community and Commercial licenses
A "Dynamo-inspired" database
Written in ERLANG
Query JSON or ERLANG
M
D
Copyright Kelly-McCreary & Associates, LLC
59
Hypertable
• Open Source
• Closely modeled after Google's Bigtable
project
• High performance distributed data storage
system
• Designed to support applications requiring
maximum performance, scalability, and
reliability
• Hypertable Query Language (HQL) that is
syntactically similar to SQL
M
D
Copyright Kelly-McCreary & Associates, LLC
60
Selecting a NoSQL Pilot Project
• The "Goldilocks Pilot
Project Strategy"
• Not to big, not to
small, just the right
size
• Duration
• Sponsorship
• Importance
• Skills
• Mentorship
M
D
61
Copyright 2010 Dan McCreary & Associates
The Future of the NoSQL Movement
Growth
Diversity
•
•
•
•
Will data sets continue to grow at exponential rates?
Will new system options become more diverse?
Will new markets have different demands?
Will some ideas be "absorbed" into existing RDBMS vendors
products?
• Will the NoSQL community continue to be the place where new
database ideas and products are incubated?
• Will the job of doing high-quality architectural tradeoffs analysis
become easier?
M
D
Copyright Kelly-McCreary & Associates, LLC
62
Using the Wrong Architecture
Start
Finish
Credit: Isaac Homelund – MN Office of the Revisor
M
D
Using the Right Architecture
Finish
Start
Find ways to remove barriers to empowering
the non programmers on your team.
M
D
Questions
Dan McCreary
President, Kelly-McCreary & Associates
[email protected]
M
D
Kelly-McCreary & Associates, LLC
65