title 44 point meta normal lf all caps

Download Report

Transcript title 44 point meta normal lf all caps

Introduction to Big Data
and NoSQL
SQL Azure Saturday
April, 21, 2012
Don Demsak
Advisory Solutions Architect
EMC Consulting
www.donxml.com
1
Meet Don
• Advisory Solutions Architect
– EMC Consulting
• Application Architecture, Development & Design
• DonXml.com, Twitter: donxml
• Email – [email protected]
• SlideShare - http://www.slideshare.net/dondemsak
2
The era of Big Data
3
How did we get here?
• Expensive
–
–
–
–
–
–
Processors
Disk space
Memory
Operating Systems
Software
Programmers
• Monoculture
–
–
–
–
Limit CPU cycles
Limit disk space
Limit memory
Limited OS
Development
– Limited Software
– Programmers
• Mono-lingual
• Mono-persistence
4
Typical RDBMS Implementations
• Fixed table schemas
• Small but frequent reads/writes
• Large batch transactions
• Focus on ACID
–
–
–
–
Atomicity
Consistency
Isolation
Durability
5
How we scale RDBMS
implementations
6
1st Step – Build a relational database
Database
7
2nd Step – Table Partitioning
p1 p2 p3
Database
8
3rd Step – Database Partitioning
Browser
Web Tier
B/L Tier
Database
Web Tier
B/L Tier
Database
Web Tier
B/L Tier
Database
Customer #1
Browser
Customer #2
Browser
Customer #3
9
4th Step – Move to the cloud?
Browser
SQL Azure
Federation
Web Tier
B/L Tier
Web Tier
B/L Tier
SQL Azure
Federation
Web Tier
B/L Tier
SQL Azure
Federation
Customer #1
Browser
Customer #2
Browser
Customer #3
10
There has to be other ways
11
Polyglot Persistence
12
Polyglot Programmer
13
14
Where Did NoSQL Originate?
• 1998 - Carlo Strozzi
– NoSQL project - lightweight open-source relational DB
with no SQL interface
• 2009 - Eric Evans & Johan Oskarsson of Last.fm
wanted to organize an event to discuss opensource distributed databases
15
NoSQL (loose) Definition
• (often) Open source
• Non-relational
• Distributed
• (often) don’t guarantee ACID
16
Atlanta 2009
• No:sql(east) conference
– select fun, profit from real_world where relational=false
• Billed as “conference of no-rel datastores”
17
Types Of NoSQL Data Stores
18
5 Groups of Data Models
Relational
Document
Key Value
Graph
Column Family
19
Document Store
• Apache Jackrabbit
• CouchDB
• MongoDB
• SimpleDB
• XML Databases
– MarkLogic Server
– eXist.
20
Document?
• Okay think of a web page...
– Relational model requires column/tag
– Lots of empty columns
– Wasted space
• Document model just stores the pages as is
– Saves on space
– Very flexible.
21
Graph Storage
• AllegroGraph
• Core Data
• Neo4j
• DEX
• FlockDB
• Microsoft Trinity (research project)
– http://research.microsoft.com/en-us/projects/trinity/
22
What’s a graph?
• Graph consists of
– Node (‘stations’ of the graph)
– Edges (lines between them)
• FlockDB
– Created by the Twitter folks
– Nodes = Users
– Edges = Nature of relationship between nodes.
23
Key/Value Stores
• On disk
• Cache in Ram
• Eventually Consistent
– Weak Definition
• “If no updates occur for a period, eventually all updates will
propagate through the system and all replicas will be consistent”
– Strong Definition
• “for a given update and a given replica eventually either the
update reaches the replica or the replica retires”
• Ordered
– Distributed Hash Table allows lexicographical processing
24
Key/Value Examples
• Azure AppFabric Cache
• Memcache-d
• VMWare vFabric GemFire
25
Object Databases
• Db4o
• GemStone/S
• InterSystems Caché
• Objectivity/DB
• ZODB
26
Tabular
• BigTable
• Mnesia
• Hbase
• Hypertable
• Azure Table Storage
• SQL Server 2012
27
Azure Table Storage Demo
28
Big Data
29
Big Data Definition
• Volumes & volumes of data
• Unstructured
• Semi-structured
• Not suited for Relational Databases
• Often utilizes MapReduce frameworks
30
Big Data Examples
• Cassandra
• Hadoop
• Greenplum
• Azure Storage
• EMC Atmos
• Amazon S3
• SQL Azure (with Federations support)
31
Real World Example
• Twitter
– The challenges
• Needs to store many graphs
 Who you are following
 Who’s following you
 Who you receive phone
notifications from etc
• To deliver a tweet requires
rapid paging of followers
• Heavy write load as followers
are added and removed
• Set arithmetic for @mentions
(intersection of users).
32
What did they try?
• Started with Relational
Databases
• Tried Key-Value storage
of denormalized lists
• Did it work?
– Nope
• Either good at
 Handling the write load
 Or paging large
amounts of data
 But not both
33
What did they need?
• Simplest possible thing that would work
• Allow for horizontal partitioning
• Allow write operations to
• Arrive out of order
– Or be processed more than once
– Failures should result in redundant work
• Not lost work!
34
The Result was FlockDB
• Stores graph data
• Not optimized for graph traversal operations
• Optimized for large adjacency lists
– List of all edges in a graph
• Key is the edge value a set of the node end points
• Optimized for fast read and write
• Optimized for page-able set arithmetic.
35
How Does it Work?
• Stores graphs as sets of edges between nodes
• Data is partitioned by node
– All queries can be answered by a single partition
• Write operations are idempotent
– Can be applied multiple times without changing the
result
• And commutative
– Changing the order of operands doesn’t change the
result.
36
Working With Big Data
37
ACID
• Atomicity
– All or Nothing
• Consistency
– Valid according to all defined rules
• Isolation
– No transaction should be able to interfere with another
transaction
• Durability
– Once a transaction has been committed, it will remain
so, even in the event of power loss, crashes, or errors
38
BASE
• Basically Available
– High availability but not always consistent
• Soft state
– Background cleanup mechanism
• Eventual consistency
– Given a sufficiently long period of time over which no
changes are sent, all updates can be expected to
propagate eventually through the system and all the
replicas will be consistent.
39
Traditional (relational) Approach
Extract
Transactional Data Store
Transform
Load
Data Warehouse
40
Big Data Approach
• MapReduce Pattern/Framework
– an Input Reader
– Map Function – To transform to a common shape
(format)
– a partition function
– a compare function
– Reduce Function
– an Output Writer
41
MongoDB Example
> // map function
> m = function(){
...
this.tags.forEach(
...
function(z){
...
emit( z , { count : 1 }
);
...
}
...
);
...};
> // reduce function
> r = function( key , values ){
...
var total = 0;
...
for ( var i=0; i<values.length; i++ )
...
total += values[i].count;
...
return { count : total };
...};
> // execute
> res = db.things.mapReduce(m, r, { out : "myoutput" } );
42
MongoDB Demo
43
Big Data on Azure
• Azure Table Storage
– Azure Service Bus
• SQL Azure Federations
• MongoDB on Azure
– http://www.mongodb.org/display/DOCS/MongoDB+on+Azure
• Hadoop on Azure
– https://www.hadooponazure.com/
44
Using Azure for Computing
Worker
Data
Master
Worker
Data
Job/Task Scheduler
Worker
Data
Data
Client
45
Moving to Event Based Architecture
Web Role
Worker Role
Web Role
Worker Role
Web Role
Worker Role
Req
Req
Req
Queue
Web Role
Web Role
Web Role
Worker Role
Monitor queue
length against
user’s expectations
Worker Role
Worker Role
46
Aggregate Stores
47
Visualizing Aggregates
Orders
ID: 1001
Customer: Ann
Customers
Line Items
32411234
2
$48
$96
707423234
1
$56
456
125145
1
$24
$24
Order Lines
Payment Details
Card: AmEx
CC#: 12343
Expiration: 07/2015
Credit Cards
48
Visualizing Aggregates
ID: 1001
Customer: Ann
Line Items
32411234
2
$48
$96
707423234
1
$56
456
125145
1
$24
$24
Payment Details
{
“SalesOrdersView”:{
ID: 1001,
Customer: Ann,
LineItems: []
……………..
…………….
……………..
}
}
Card: AmEx
CC#: 12343
Expiration: 07/2015
49
MongoDB on Azure Demo
50
Next Steps
• Learn a NoSQL product
– Great place to start – AppFabric Cache, Azure Table
Storage, MongoDB
• Pick a new programming language to learn
– Not Java or C#/VB
– Node.js, JavaScript, F#
51
THANK YOU
52