MSBIC Hadoop Series Getting Started

Download Report

Transcript MSBIC Hadoop Series Getting Started

MSBIC Hadoop Series
Getting Started
Bryan Smith
@smithbryanc
MSBIC Hadoop Series
http://msbic.sqlpass.org/
Learn the basics of Hadoop through a combination of demonstration and lecture.
Session participants are invited to follow along leveraging emulation environments
and Azure-based clusters, the setting up of which we will address in our first session.
March – Getting Started
August – On Vacation
April – Understanding the File System
September – Hadoop & MS BI
May – Implementing MapReduce Jobs
October – To Be Announced
June – Querying the Data with Hive
November – Loading Social Media Data
July – Processing the Data with Pig
December – DW Integration
Today’s Session
Objectives:
1. Establish basic concepts & terminology
2. Setup development environment for further learning
What is Hadoop?
Elastic, massively-parallel data storage & processing platform
Key Benefits:
 Store data without first having to define structure
 Leverage combined resources of multiple servers
 Expand or contract capacity without outage or data loss
IMHO …
Ideal When:





Query performance & ACID compliance not critical
Don’t know all that I might do with data
Data subject to variable interpretation
Preprocessing of data (through ETL) not economical
Data volumes leave few alternatives
Common Enterprise Applications:




Staging for voluminous, low per-record value data or complex data
Online archive of high volume data sets
Data Scientist workbench
Data Lake applications
Architectural Basics
Job
Name Node
X
Task
Data Node
A
Z
Data Node
X
Y
X
B
Y
Z
Task
Task
Data Node
Data Node
Z
X
Y
C
Y
Z
Non-Relational Database
(Hbase)
Workflow & Scheduling
(Oozie)
Management & Monitoring
(Ambari, Zookeeper)
Scripting
(Pig)
Metadata Services
(HCatlog)
Job Execution
(MapReduce)
File System
(HDFS)
Data Integration
(Flume, Sqoop)
Hadoop Ecosystem
Query
(Hive)
Hadoop Distributions
Apache Hadoop
Hortonworks HDP
Microsoft
Teradata
SAP
Cloudera CDH
Oracle
IBM*
MapR M-Series
Amazon
*IBM promotes IBM InfoSphere BigInsights over Cloudera CDH
Microsoft & Hortonworks
Hortonworks HDP
on Windows
HDInsight
on Azure
HDInsight
on PDW
HDInsight
Emulator
Setting Up the HDInsight Emulator
Before Installation:
 Have fully patched Windows 8 or 8.1 development environment
 Install Visual Studio 2010, 2012 or 2013 Express or higher edition
 Install NuGet Package Manager
Install HDInsight Emulator here
 Step by step installation instructions provide here
After Installation
 Update the Hadoop services account to not expire its password
Setting Up an Azure HDInsight Cluster
Azure HDInsight Cluster
Azure SQL Database
(Optional & PreExisting)
Azure Blob Storage
All components must
be deployed within
same Azure data
center
Accessing Azure
Two Primary Options:
 Organizational Subscription
 Individual Subscriptions
MSDN Subscription ($150 credit/month)
BizSpark Subscription ($150 credit/month)
Trial Subscription ($200 credit for 30-day period)
Paid Subscriptions
For Next Session
Topic:
 Understanding the File System
 Loading Data to Hadoop
Requested Action(s):
 Come with working HDInsight Emulator