February 25th, 2017
Download
Report
Transcript February 25th, 2017
Big Data with Azure where to begin?
25th February 2017
Pordenone, Italy
Concepts and best practices
Satya SK Jayanty
Principal Architect & Managing Consultant
[email protected]
February 25th, 2017
#sqlsat589
Organizers
February 25th, 2017
#sqlsat589
Speaking Engagements
February 25th, 2017
#sqlsat589
Author’d
http://tinyurl.com/sql2k8r2admincookbook
February 25th, 2017
http://tinyurl.com/sql2012InstantCubeSecurity
http://www.manning.com/delaney/
#sqlsat589
Agenda…
.what agenda?
..
…
….
…..
……. no agenda!
..…... you like: small data… big data… all data!
…….….that’s why you are here today
February 25th, 2017
#sqlsat589
What differentiates today’s thriving
organizations?
Data.
Data in all forms &
sizes is being
generated faster
than ever before
February 25th, 2017
Capture & combine it
for new insights &
better, faster decisions
#sqlsat589
Strategic opportunity with Big Data…
Cloud
Mobile
Social
How do you use
technology
innovation…
Big data
?
to architect
business
innovation?
Increased
productivity
February 25th, 2017
Customer
growth
Real-time
insights
Embrace
new models
#sqlsat589
Be prepared to blow your mind!?!
February 25th, 2017
#sqlsat589
Big Data Eco-system
Copyright: IBM
February 25th, 2017
#sqlsat589
Big Data Components with Hadoop
February 25th, 2017
#sqlsat589
Big Data Mountain
February 25th, 2017
#sqlsat589
Handle traditional data to big data?
Petabytes
Click stream
Wikis/blogs
Terabytes
Sensors
RFID
Devices
Social sentiment
Audio/video
Big Data
Log files
Spatial and
GPS
coordinates
Gigabytes
Data market
feeds
eGov feeds
Megabytes
Weather
Text/image
Data Complexity: Variety and Velocity
February 25th, 2017
#sqlsat589
Data Warehouse traditional approach
February 25th, 2017
#sqlsat589
Peak points of traditional approach
February 25th, 2017
#sqlsat589
Breaking points of traditional approach
February 25th, 2017
#sqlsat589
Added points of traditional approach
February 25th, 2017
#sqlsat589
Added points of traditional approach
February 25th, 2017
#sqlsat589
Evolving Approaches to Analytics
Extract
Transform
Load
EDW
ETL Tool
(SQL Svr, Teradata, etc)
(SSIS, etc)
Original Data
Transformed Data
BI Tools
Data Marts
Data Lake(s)
Dashboards
Ingest
(EL)
Original Data
Scale-out
Storage &
Compute
Apps
(HDFS, Blob
Storage, etc)
Streaming data
Transform & Load
February 25th, 2017
#sqlsat589
Introducing Big Data
“Big data is a collection of data sets
so
large
Cheap
Storage
and complex that it becomes awkward to work
with using on-hand database management
tools.
> 2 billion users
Difficulties include capture, storage, search,
Sensor Networks
Inexpensive Computing
sharing, analysis,
and visualization.”
– Wikipedia
Enormous amounts of data
. online behavior social networking users .
.. samples of medical ailments ..
… purchasing habits of grocery shoppers …
…. crime statistics of cities ….
….. “internet of things” IoT…..
…… 24/7 out-patient monitor ……
……. real-time tele-metric devices …….
February 25th, 2017
90%
Of data in the world,
has been created in
the last 2 years
#sqlsat589
5 V’s
February 25th, 2017
#sqlsat589
Cloud Computing Patterns
February 25th, 2017
#sqlsat589
Application
building blocks
February 25th, 2017
#sqlsat589
Introducing Apache Hadoop
February 25th, 2017
Hadoop stores files in a distributed file
system
Hadoop can store very large amounts of
data
#sqlsat589
Industry use cases of Hadoop
Financial services
Retail
Healthcare
February 25th, 2017
Telecom
Utilities, oil and gas
Manufacturing
Public sector
#sqlsat589
Introducing Hadoop
Comparison to Traditional RDBMS
TRADITIONAL RDBMS
HADOOP
Data Size
Access
Updates
Structure
Integrity
Scaling
DBA Ratio
February 25th, 2017
#sqlsat589
Data variety
February 25th, 2017
#sqlsat589
Data velocity
February 25th, 2017
#sqlsat589
Hadoop is a platform with portfolio of projects
•
•
•
•
•
•
•
•
•
•
February 25th, 2017
Hadoop common – utilities to support modules
HDFS (Hadoop Distributed File System) – high throughput
YARN – job scheduling and cluster RM
MapReduce – YARN-based for parallel processing
Spark – compute engine
Pig – data-flow language & execution framework
Oozie – workflow scheduler
Ambari – provisioning, managing and monitoring clusters
Sqoop – bulk data transfer between Hadoop & Relational DB
Batch processing centric – using a “Map-Reduce” processing
paradigm
#sqlsat589
A look on SQL and NoSQL
February 25th, 2017
#sqlsat589
Getting Started
with HDInsight
Introducing Azure HDInsight
100% Apache Hadoop
Powered by the cloud
Immersive insights
3
1
February 25th, 2017
#sqlsat589
Position in Cloud
February 25th, 2017
#sqlsat589
A Holistic Big Data Solution from Microsoft
February 25th, 2017
#sqlsat589
Hadoop on Windows
February 25th, 2017
#sqlsat589
HDInsight supports Hive
Hadoop 2.0
February 25th, 2017
#sqlsat589
HDInsight supports HBase
HMaster
Coordination
Name Node
Region Server
Region Server
Region Server
Region Server
Job Tracker
February 25th, 2017
Data Node
Data Node
Data Node
Data Node
Task Tracker
Task Tracker
Task Tracker
Task Tracker
#sqlsat589
HDInsight supports Mahout
February 25th, 2017
#sqlsat589
HDInsight supports Storm
February 25th, 2017
#sqlsat589
TCO, Deployment & Geo-Redundancy
$£€¥
February 25th, 2017
#sqlsat589
Connect cloud Hadoop with on-premises
February 25th, 2017
#sqlsat589
Hybrid Compatibility
Microsoft are the only vendor with enterprise on
premises and cloud big data offerings
Name=Andy
Pnid=123456
123456
Hadoop On Premises
February 25th, 2017
4712
HDInsight in Azure
#sqlsat589
HDP
Spark
Storm
…
February 25th, 2017
#sqlsat589
The overall architectural challenge of
Big Data is that just as Data can vary, so
must architectures.
Batch
Hadoop can
operate with
O(n) over
Petabytes of
data
February 25th, 2017
Interactive
Real Time
Drill, Stinger
and Tez bring
must quicker
querying
Storm allows
enormous
scalable
throughput
#sqlsat589
Bringing Hadoop to a billion people
February 25th, 2017
#sqlsat589
Introducing the zoo:
HDInsight/Hadoop Eco system
Legend
Red = Core Hadoop
Blue = Data processing
Green = Packages
Distributed Processing
(MapReduce)
Purple = Microsoft
integration points and
value adds
Orange = Data
Movement
Distributed Storage
(HDFS)
February 25th, 2017
#sqlsat589
Programming HDInsight
Since HDInsight is a service-based implementation, you get immediate access to the
tools you need to program against HDInsight/Hadoop
Existing Ecosystem
• Hive, Pig, Sqoop, Mahout, Cascading, Scalding,
Scoobi, Pegasus, etc.
.NET
• C#, F# Map/Reduce, LINQ to Hive, .Net
Management Clients, etc.
JavaScript
• JavaScript Map/Reduce, Browser-hosted
Console, Node.js management clients
DevOps/IT Pros:
February 25th, 2017
• PowerShell, Cross-Platform CLI Tools
#sqlsat589
Other Microsoft data science tools
Hadoop in the cloud
+ Storm (real-time analytics)
+ HBase (NoSQL)
+ Mahoot (ML!)
Power BI: Power Query, Power View, and
Dashboards
Excel
Azure Data Factory (ETL in the cloud)
Analytics Platform System (SQL Server on
steroids + Hadoop + hardware)
Streaming data originating in the cloud
Based on HDInsight/Hadoop
February 25th, 2017
#sqlsat589
Azure ML
Machine Learning
in Azure
Pre-process data
Engineer features
Modelling ≣ machine learning ≣ data mining
Run R
Run Python
Experiments
Web Services
Limited: data size, experiment duration, scalability, speed
Relatively inexpensive, can be free
February 25th, 2017
#sqlsat589
Challenges with implementing Hadoop
Big Data on-premise concerns include:
•Hardware costs
•IT and operational costs in setting up a
machine cluster and supporting it
•Cost of personnel to work on the
ecosystem
February 25th, 2017
Barriers to Hadoop:
• Skills gap
• Weak business support
• Security concerns
• Data management hurdles
• Tool deficiencies
• Containing costs
#sqlsat589
Why Hadoop in the cloud?
February 25th, 2017
#sqlsat589
Applications
Reports
Dashboards
Natural
language query
Mobile
Complex event
processing
Modeling
Machine
learning
Data
Orchestration
Information
management
The Microsoft
data platform
Relational
February 25th, 2017
Non-relational
NoSQL
Streaming
Internal
& external
#sqlsat589
Cortana Analytics Suite
Transform data into intelligent action
DATA
February 25th, 2017
INTELLIGENCE
ACTION
#sqlsat589
Azure Data Factory
A managed cloud service for building &
operating data pipelines
Part of the Cortana Analytics Suite
February 25th, 2017
#sqlsat589
PolyBase and queries
Provides a scalable, T-SQL-compatible query processing
framework for combining data from both universes
Access any data
February 25th, 2017
#sqlsat589
Agnostic architecture
PolyBase is
agnostic
=
No vendor lock in
PolyBase supports
Hadoop on Linux &
Windows
PolyBase integrates
with the cloud
PolyBase supports
HDInsight in APS &
external Hadoop
clusters
February 25th, 2017
#sqlsat589
PolyBase builds the bridge
Just-in-Time data integration
Across relational and non-relational data
High performance parallel architecture
Fast, simple data loading
PolyBase = run
time integration
Best of both worlds
Uses computational power at source for both relational data &
Hadoop
Opportunity for new types of analysis
Uses existing analytical skills
Includes Power BI
Familiar SQL semantics & behaviour
Query with familiar tools
SSDT
February 25th, 2017
#sqlsat589
PolyBase
User
Perspective
External Table
External
Data
Source
Systems
Perspective
External
File
Format
PDW Engine
PDW
Service
Bridge
February 25th, 2017
#sqlsat589
What is R?
Extensible
via packages
Talented
community
of
contributors
High
accuracy ML
classifiers
Big data analytics
Open source
implementation
Top tool for
machine
learning
OOL for
statistical
computing
February 25th, 2017
In-memory
analytics
Industry
standard
for
computational
mining
Amazing
data-visualization
capabilities
#sqlsat589
What is R?
Better
http://www.rstudio.com/
Rattle makes it even easier
Core R: the purest version: http://cran.r-project.org/
Revolution Analytics: parallelism & performance: http://www.revolutionanalytics.com/
Azure ML: built-in
February 25th, 2017
#sqlsat589
Why R is famous?
R plotting
Box plot
Bar plot
Histogram
Contour
Dot plot
Mosaic
Scatter
Latticist
http://homes.cs.washington.edu/~jheer//files/zoo/?utm_source\x3dtwitterfeed\x26utm_medium\x3dtwitter
February 25th, 2017
#sqlsat589
Revolution R Enterprise and SQL
Big data analytics platform
Based on open source R
High-performance, scalable, full-featured
Statistical and machine-learning algorithms are performant, scalable, and distributable
Write once, deploy anywhere
Scripts and models can be executed on a variety of platforms, including non-Microsoft
(Hadoop,
Teradata in-DB)
Integration with the R Ecosystem
Analytic algorithms accessed via R function with similar syntax for R users. Arbitrary R
functions/packages can be used in conjunction
Advanced analytics
February 25th, 2017
#sqlsat589
SQL Server 2016 R integration scenario
Exploration
Use RRE from R IDE to analyze large datasets and build predictive and embedded models with the compute
happening on the SQL Server machine (SQL Server compute context)
Operationalization
Developer can operationalize R script/model over SQL Server data by using T-SQL constructs
DBA can manage resource, secure, and govern R runtime execution in SQL Server
February 25th, 2017
#sqlsat589
R script library in Microsoft Azure
Marketplace
Example solutions
Fraud detection
Extensibilit
y
R Integration
Launch
External
Process
Sales forecasting
R
Warehouse efficiency
Benefits
New R
scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
Microsoft Azure
Machine Learning Marketplace
Predictive maintenance
Analytic library
T-SQL interface
010010
100100
010101
010010
100100
010101
Relational data
Data Scientist
Interacts directly
with data
Faster deployment of ML models
Faster performance
(moves compute close to the
data)
Improved scalability
Benefits
Data Developer/DBA
Manages data and
analytics together
Built into SQL Server
Advanced analytics
February 25th, 2017
#sqlsat589
Machine learning tools
Open source
• R – considered best fit
• Python
• Monte Carlo Machine Learning
Library
• H2O
• Weka
• Octave-Forge
February 25th, 2017
Commercial
• Microsoft Azure Machine
Learning
• SAS Enterprise Miner
• IBM SPSS Modeler
• RapidMiner
• Apache Mahout
• MATLAB
• Oracle Data Mining
#sqlsat589
Rich
Services
Heterogeneity
Integrate with
on-premises
Lower Your
Risk
February 25th, 2017
#sqlsat589
Scaling
February 25th, 2017
#sqlsat589
Azure – in hawk-eye mode
Platform Services
Security &
Management
Cloud
Services
Service
Fabric
Hybrid
Operations
Web Apps
API Apps
Portal
Azure Active
Directory
Batch
SQL
Database
Data
Warehouse
DocumentDB
Redis
Cache
Azure
Search
Storage
Tables
RemoteApp
Mobile
Apps
Logic Apps
Azure AD
B2C
Multi-Factor
Authentication
Automation
Scheduler
AD Privileged
Identity
Management
Domain Services
Storage
Queues
BizTalk
Services
Hybrid
Connections
Service Bus
Key Vault
Store/
Marketplace
Azure AD
Health Monitoring
Media
Services
Content
Delivery
Network (CDN)
API
Management
Notification
Hubs
Backup
HDInsight
Visual Studio
Azure
SDK
VS Online
App
Insights
Data
Factory
Machine
Learning
Event
Hubs
IoT Hub
Stream
Analytics
Data
Lake
Data
Catalog
Mobile
Engagement
Operational
Analytics
Import/Export
Azure Site
Recovery
StorSimple
VM Image Gallery
& VM Depot
Infrastructure Services
February 25th, 2017
#sqlsat589
February 25th, 2017
#sqlsat589
Summary
Big Data refers to data sets so large and/or complex that
they become awkward to work with in conventional ways
Hadoop and HDInsight = Microsoft’s answer to Big Data
Hadoop can store petabytes of data reliably and execute
huge distributed computations
However – Big Data query results often involve
significant latency
Power BI includes authoring add-ins to query, analyze and
visualize data sourced from Azure HDInsight
Preload data in advance of business user queries
Big Data is just another data source!
February 25th, 2017
#sqlsat589
Resources
Microsoft Big Data web site
http://www.microsoft.com/en-us/server-cloud/solutions/big-data.aspx
Azure HDInsight web site
http://azure.microsoft.com/en-us/documentation/services/hdinsight/
Hortonworks tutorials
http://hortonworks.com/tutorials
Numerous tutorials are available to learn about Big Data by using the
Hortonworks Sandbox
Follow me
@SQLMaste
r
www.sqlserver-qa.net
February 25th, 2017
#sqlsat589
Sponsors
February 25th, 2017
#sqlsat589
#sqlsat589
Q&A
THANKS!
February 25th, 2017
#sqlsat589