NoSQL and MongoDBx

Download Report

Transcript NoSQL and MongoDBx

NoSQL with Mongo DB and R
Bob Wakefield
(I know stuff.)
Sit down. Strap in. Hang on.
NoSQL with Mongo DB and R
This is a discussion.
Feel free to hop in and
correct me if I say
something crazy.
WARNING!
“We’ll talk about this
a little bit later.”
Motivations for this presentation
• Kaggle competition experience
• Recent experience on a client site
• NoSQL skills starting to be in demand
RecSys Challenge 2013: Yelp
business rating prediction
• Build a recommender system based on user ratings
• AKA ETL Hell
System 1
Schema changes break everything down stream!
System 2
Wakefield Career
Management System
Step 1: Scan market for high value skills.
Step 2: Acquire skills.
Step 3: Sell skills to the highest bidder.
Step 4: Get Paid.
Sample of Job Board Post
• Designs NoSQL dynamic schemas to leverage
simplicity and power of NoSQL
• Experience with NoSql-based databases, such as
MongoDB
• Knowledge of NoSQL databases (MongoDB, Hadoop,
Couch DB, etc…)
• Five years' experience with NoSQL, columnar, and
key/value databases
Mongo DB ETL Example
• Data from Yelp Kaggle Competition
• Data in JSON format
• 229,907 reviews
{
'type': 'review',
'business_id': (encrypted business id),
'user_id': (encrypted user id),
'stars': (star rating),
'text': (review text),
'date': (date, formatted like '2012-03-14', %Y-%m-%d in strptime notation),
'votes': {'useful': (count), 'funny': (count), 'cool': (count)}
}
Main Topics for this Evening
• NoSQL
• MongoDB
• Modeling unstructured data
• MongoDB and R
Intended Audience
• Business/Data Analyst
• Data Architect
• General Scenario
• You have read only access to MongoDB and need
to retrieve your own data.
People I’m Going to Ignore
• Software Developers
• DBAs
Source Material - Books
• No SQL Distilled by P. Sadalage and M. Fowler
• MongoDB Applied Design Patters by R. Copeland
• The Definitive Guide to MongoDB by Plugge,
Membry and Hawkins
• MongoDB online docs
Source Material - YouTube
• Introduction to NoSQL by Martin Fowler – GOTO
conferences
• Workshop: NoSQL Data Modelling (Jan Steemann)
Teil2 – ArangoDB
• NoSQL Data Modelling for Scalable eCommerce –
Dataversity Net
• Domain Driven Design – Zend
• Webinar on the rmongodb R package - comsystotv
Why NoSQL
• Handles Schema
Changes Well
(easy
development)
• Solves Impedance
Mismatch
problem
• Rise of JSON
• python
module:
simplejson
A really generic and
unofficial definition of NoSQL
An ill-defined set of mostly open-source
databases, mostly developed in the 21st
century, and mostly not using SQL.
Common Characteristics of
NoSQL Databases
• Non – relational
• Open source
• Cluster friendly
• Built from the ground up to handle 21st century data
challenges
• Schema-less*
Various Types of NoSQL
Databases
DB Name
Super Type
mongoDB
Aggregate
RavenDB
Aggregate
CouchDB
Aggregate
Cassandra
Aggregate
Apache HBASEAggregate
Project
Vorldermort Aggregate
Riak
Aggregate
Redis
Aggregate
Neo4j
Graph
Subtype
Document
Document
Document
Column Family
Column Family
Key Value
Key Value
Key Value
Example of graph database
What is an aggregate?
• Not what you think.
• Definition from Domain Driven Design
• “A group of related entities and value objects.”
• aggregate = document
What is a document?
• Not what you think.
• word document <> NoSQL document
Example of a document
{
"business_id": "rncjoVoEFUJGCUoC1JgnUA",
"full_address": "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345",
"open": true,
"categories": ["Accountants", "Professional Services", "Tax Services",],
"city": "Peoria",
"review_count": 3,
"name": "Peoria Income Tax Service",
"neighborhoods": [],
"longitude": -112.241596,
"state": "AZ",
"stars": 5.0,
"latitude": 33.581867000000003,
"type": "business“
}
A MongoDB Vocab lesson
Relational Term
Database
Tables
Rows
MongoDB
Equivalent
Database
Collections
Documents
Facts about MongoDB that
will BLOW YOUR MIND!!
• No Schemas
• No transactions
• No joins
• Max docuement size of 16MB
• Larger documents handled with GridFS
Facts about MongoDB that
are fairly mundane.
• Runs on most common OSs
• Windows
• Linux
• Mac
• Solaris
• Data stored as BSON (Binary JSON)
• used for speed
• translation handled by language drivers
Retrieving Data
SQL Statement
MongoDB commands
SELECT *
FROM table
db.collection.find()
SELECT *
FROM table
WHERE artist = ‘Nirvana’
db.collection.find({Artist:”Nirvana”})
SELECT*
FROM table
ORDER BY Title
db.collection.find().sort(Title:1)
DISTINCT
.distinct()
GROUP BY
.group()
>=, <
$gte, $lt
Rules for building NoSQL
Data Structures
Rule 1: Every document must have an _id.
Rule 2: There is only one rule.
Designing NoSQL Data
Structures
• NoSQL data structures driven by application
design.
• Need to take into account necessary CRUD
operations
• To embed or not to imbed. That is the question!
• Rule of thumb is to imbed whenever possible.
• No modeling standards or CASE tools!
A (denormalized) embedded structure
An array of values
{
"business_id": "rncjoVoEFUJGCUoC1JgnUA",
"full_address": "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345",
"open": true,
"categories": ["Accountants", "Professional Services", "Tax Services",],
"city": "Peoria",
"review_count": 3,
"name": "Peoria Income Tax Service",
"neighborhoods": [],
"longitude": -112.241596,
"state": "AZ",
"stars": 5.0,
"latitude": 33.581867000000003,
"type": "business“
}
A (denormalized) embedded structure
An array of sub documents
{
“_id : “First Post”,
“comments” : [
{“author” : “Bob”, “text” : “Nice Post!”},
{“author” : “Tom”, “text” : “Dislike!”}
],
“comment_count” : 2
}
This makes for a hairy query!
A normalized structure
//db.post schema
{
“_id” : “First Post”,
“author” : “Rick”,
“text” : “This is my first post.”
}
//db.comments schema
{
“_id” : ObjectID(...),
“post_id” : “First Post”,
“author” : “Bob”,
“text” : “Nice Post!”
}
A polymorphic structure
• When all the documents in a collection are similarly, but
not identically, structured.
• Enables simpler schema migration.
• custom_field_1
• no more of this crap
• Better mapping of object – oriented inheritance and
polymorphism.
A polymorphic structure
//Page document (stored in nodes collection)
{
_id : 1,
title: “Welcome”,
url: “/”,
type: “page”,
text: “Welcome to my wonderful wiki.”
}
//Photo document (also stored in nodes collection)
{
_id: 3,
title: “Cool Photo”,
url: “/photo.jpg”,
type: “photo”,
content: Binary(...)
}
RmongoDB
• Two packages available
• Rmongo = Dodge Omni
• rmongoDB = Porche
• RmongoDB usage example
Final Thoughts
• Data Architects should NOT be designing
NoSQL data structures
• Are NoSQL DBs going to totally replace
RDBMS?
• Polyglot Persistence
Questions?
• You should consider this presentation a book
report.
• I’ve only been studying this stuff for a month.
• I MIGHT have an answer to your question.
• I might not...