Hadoop_Hive_JSON_and_data!_Oh_my!
Download
Report
Transcript Hadoop_Hive_JSON_and_data!_Oh_my!
Hadoop, Hive, JSON, and Data!
Oh, my!!
TJay Belt
1
TJay Belt
Database Administrator at Imagine Learning
eMail me
[email protected]
Read me
http://tjaybelt.blogspot.com
Follow me
@tjaybelt
2
Thanks to our Sponsors!
Yearly Partners
Gold Sponsors
Overview
Big Data ecosystem
30,000 feet view of our ecosystem
Issues found along the way
4
Json (JavaScript Object Notation)
Lightweight data-interchange format. It is easy
for humans to read and write. It is easy for
machines to parse and generate.
Json (JavaScript Object Notation)
{
"_id": "00000000-0000-0000-0000-000000000000",
"Revision": 12,
"ModelData": {
"GradeLevel": "Kindergarten",
"FirstLanguage": "English“
},
"SetTheStageData": {
"LastSetTheStageLibraryWords": 1,
"LastSetTheStageTakeATest": 0
}
}
Json (JavaScript Object Notation)
"TestInstances": [{
"Product": "ILE",
"Lesson": "30698aac-5a3d-4464-935c-16de4ba9db70",
"LessonBranch": "Main",
"TestType": "PlacementTest",
"TimeStarted": "2015-11-13T15:16:51.8757165+00:00",
"TimeCompleted": "2015-11-13T15:26:29.9646995+00:00",
"TestInstanceId": "1",
"TestSectionInstances": [{
"TestSection": "Letter Recognition",
"TestQuestionInstances": [{
"TestQuestion": "q43",
"TimeStarted": "2015-11-13T15:17:24.965+00:00",
"TimeCompleted": "2015-11-13T15:17:33.432+00:00",
"TestOptionInstances": [{
"ClickCount": 1,
"IsSelected": false,
"ResponseLatency": 0,
"TestOption": "opt256"
},
{
"ClickCount": 1,
"IsSelected": false,
"ResponseLatency": 0,
"TestOption": "opt258"
},
{
"ClickCount": 1,
"IsSelected": false,
"ResponseLatency": 0,
"TestOption": "opt257"
},
{
"ClickCount": 1,
"IsSelected": true,
"ResponseLatency": -8467,
"TestOption": "opt253"
},
{
"ClickCount": 1,
"IsSelected": false,
"ResponseLatency": 0,
"TestOption": "opt255"
},
{
"ClickCount": 1,
"IsSelected": false,
"ResponseLatency": 0,
"TestOption": "opt254"
}]
},
Blob Storage
Reliable, cost-effective cloud storage for large
amounts of unstructured data
Microsoft Azure Cloud
MongoDB
MongoDB (from humongous) is a cross-platform
document-oriented database.
Classified as a NoSQL database that eschews
the traditional table-based relational
database structure in favor of JSON-like
documents with dynamic schemas
Making the integration of data in certain types of
applications easier and faster.
Hadoop
is a Java-based programming framework that
supports the processing of large data sets in a
distributed computing environment.
MapReduce
is a programming model and an associated
implementation for processing and generating
large data sets with a parallel, distributed
algorithm
on a cluster.
HIVE
Apache Hive is a data warehouse infrastructure
built on top of Hadoop for providing data
summarization, query, and analysis.
It supports queries expressed
in a language called HiveQL,
which automatically translates
SQL-like queries into
MapReduce jobs executed on
Hadoop.
What do we have?
13
Things we tried
SQL Server Json procs
SlamData
PowerQuery
DocumentDB
MongoDirector
SQL Azure
Issues I encountered
16
17
Issues I encountered
18
Issues I encountered
19
Thank You!
TJay Belt
Cell
eMail
Blog
Linked In
Twitter
Skype
Google+
(801) 735-9439
[email protected]
http://tjaybelt.blogspot.com
www.linkedin.com/in/tjaybelt
@tjaybelt
tjaybelt
link
Thanks to our Sponsors!
Yearly Partners
Gold Sponsors