Polybase in the Modern Data Warehouse
Download
Report
Transcript Polybase in the Modern Data Warehouse
I.
o
o
II.
o
o
o
III.
o
o
o
New Insights through
Big Data
4
New World of Big Data & DW – Yet another ‘Hype’?
BI and analytics
Data warehouse
ETL
… data warehousing has reached the
most significant tipping point since its
inception. The biggest, possibly most
elaborate data management system in
IT is changing.
– Gartner, “The State of Data Warehousing in 2012”
Data sources
5
Impact on traditional DW
BI and analytics
3
Data warehouse
Near real-time
insights
ETL
1
Increasing
data volumes
Data sources
2
New data
sources and types
Non-relational data
4
Cloud-born
data
6
The modern Data Warehouse
Data sources
Non-relational data
When to use
what?
Gartner Survey
April 2014
10%
4%
5%
15%
57%
Piloting on premise
Piloting in the cloud
Production on-premise with cluster
Production on premise with
appliance
14%
2014 –
think Hadoop will
replace existing DW solution
(2013: 10%)
↓ 50% decline
even with the presence of Hadoop 2.0
60%
50%
→ primarily includes
SQL over HDFS (!)
40%
30%
20%
10%
0%
Interactive
Analytics
DBMS
Stream Processing
Search
Graph applications
Challenges for a modern data warehouse
Keep existing &
legacy investment
Limited
scalability and ability to
handle new data types
Acquire Big Data
solution
Buy new tier-one
hardware appliance
Acquire business
intelligence & tools
Significant training
and data silos
High acquisition and
migration costs
Complex with low
adoption
Microsoft’s Analytics Platform
Systems (APS)
12
Introducing the Microsoft Analytics Platform System
The turnkey modern data warehouse appliance
Enterprise-ready Big
Data
Next-generation
performance at scale
Engineered for
optimal value
• Relational and non-relational
data in a single appliance
• Near real-time performance
with In-Memory Columnstore
• Enterprise-ready Hadoop aka
HDInsight Region (HDI)
• Ability to scale out to
accommodate growing data
• Industry’s lowest data
warehouse appliance price per
terabyte
• Integrated querying across
HDI and PDW via PolyBase
• Removal of data warehouse
bottlenecks with MPP SQL
Server
• Integration with Microsoft
BI/Excel & 3rd party BI tools
• Concurrency that fuels rapid
adoption
• Value through a single
appliance solution
• Value with flexible hardware
options using commodity
hardware
Core APS Features
MPP and In-Memory Columnstore for best-in-class performance
Columnstore index representation
100x
Up to
faster queries
C1
C2
C3
C4 C5 C6
15x
Up to
more
compression
Updateable clustered columnstore vs. table with customary indexing
Parallel query execution
•
Query
•
Results
•
Store data in columnar format for
massive compression
Load data into or out of memory for
next-generation performance with up to
60% improvement in data loading speed
Updateable and clustered for real-time
trickle loading
Core APS Features
HDInsight Region
What’s HDInsight in the appliance?
o Hadoop bits run on same HW as PDW region
o Separate appliance region – interconnected via
Infiniband to PDW region
o Hortonworks HDP 2.0 bits on Windows Server
– including YARN execution framework
o PolyBase certified with HDInsight on-premise
15
Core APS Features
Your
Apps
PolyBase for the (Big) Data Continuum
1. Querying all of data sources
through APS/PDW
o allowing to query relational & nonrelational stores (on-premise) with
single query interface
o On-premise data sources include
not only Hadoop, but also RDBMSs
or document-oriented stores*
cloud service
HDInsight
Azure SQL Database
(Hadoop)
Azure Storage
2.
Microsoft Azure
1.
On-premise
Your
Apps
IBM DB2
2. APS/PDW as cloud-attach for
Azure
o allowing to scale storage &
leverage cloud compute
(SQL DB, Azure HDI, Azure DocDB,
Azure ML, Azure Search)
o Trickle effect (data migration from
on-premise to Azure)
Oracle
Hadoop
APS with
PolyBase
SQL Server
Split-based
query processing
Supporting
various Hadoop
distributions
Best-In-Class
Performance
Openness
& Flexibility
Integration with
Microsoft Azure
Hybrid
Mature T-SQL
language surface
Seamless with
Microsoft BI &
Third-Party
BI tools
Preserving
Existing skill
T-SQL Semantics sets & tools
17
18
PolyBase –
Building Blocks & Functionality
19
Data Scientists,
BI Users, DB Admins
Your
Apps
Microsoft
APS
APS
control &
data
nodes
w/ PolyBase
PowerPivot
PowerView
External Table
Hadoop
External
External File
Data Source
Format
Relational DW
PolyBase
Split-Based Query
Processing
Web
Apps
Social
Apps
Mobile
Apps
Sensor
&RFID
Key principle - Leverage computation power of
external Hadoop clusters
Microsoft Azure
o Transparent pushing of computation into Hadoop
on-the-fly
o No map/reduce or Hadoop skills needed
o Reducing data volume to be transferred
Indefinite Azure Storage
Generalizing key principle in future releases
o Apply for all external data sources
o Azure for indefinite storage + as appliance
extension by making use of Azure compute
o Leveraging compute of other data sources relational data sources or NoSQL platforms
Azure Express
Route
On-premise
Your
Apps
Extensible approach
o Ideas to plug-in other push-down engines in future
o Further performance improvements in near future
Hadoop
APS with
PolyBase
PolyBase Query Use Cases
Your
Apps
PolyBase query scenarios (SELECT)
1. Run T-SQL over HDFS/Hadoop data
2. Combine data from different Hadoop clusters
(e.g. production & dev/test) via PolyBase
3. Join relational data with non-relational data in
HDFS/Hadoop
T-SQL
Customer Value
o Ease-of-use & Improved Time-To-Insights
― Build the data lake w/o heavily investing into
new resources, i.e. Java & map/reduce experts
― Leverage familiar & mature T-SQL scripts
and constructs
― Seamless tool integration w/ PolyBase
Hadoop
Microsoft APS w/
PolyBase
Hadoop
Dev/Test
PolyBase Import & Export Use Cases
PolyBase ETL scenarios (CTAS/CETAS)
Your
Apps
o Storing subset of Hadoop in PDW for frequent access, i.e. in
SQL Server’s columnar format (→ updatable CCI)
o Enabling data aging scenarios to more economic
storage (→ ‘data lake’)
Customer Value
T-SQL
― Avoids the need of maintaining a separate import
or export utility
― Allows building multi-temperature DW platforms
o PDW/APS acts as hot query engine processing
most recent/relevant data sets
o Aged data immediately accessible via external tables
o No need for deleting any data anymore
Hadoop
Microsoft APS w/
PolyBase
PolyBase for Hybrid Use Cases (Azure Integration)
APS as Azure-attach
o Azure as cheap storage for keeping your data forever
o Mesh-up on-premise and cloud apps on your own terms
o PolyBase as bridge between on-premise and Azure
Your
Apps
cloud service
HDInsight
Azure SQL Database
(Hadoop)
Customer Value
Azure Storage
o Indefinite storage and compute
―
Azure as extension for your on-premise data assets
o Cloud transition on your own terms
― Move only subsets of on-premise data, e.g. non-sensitive data
o Leverage new Azure data services
― Reduced capex & availability of new emerging data
services in Azure for on-premise focused users
T-SQL
Your
Apps
APS w/ PolyBase
PolyBase for Round-Trip Scenarios
Your
Apps
PolyBase RoundTrip Scenario (CETAS)
o Processing non-relational data with relational data and
exporting it to your Big Data platform of choice
cloud service
HDInsight
Azure SQL Database
(Hadoop)
Customer Value
― Total freedom & flexibility
o
Azure Storage
User can decide where to store the results of querying
different data sets
― Simplicity
o Example: 1 T-SQL statement for
a) querying different Hadoop data sets,
b) combining with relational data and
c) storing results in Azure or in a different
Hadoop cluster
T-SQL
Your
Apps
APS w/ PolyBase
Hadoop
Dev/Test
26
Real-world customer production
use cases with PolyBase
27
‘We love Open-Source, but we love SQL
Server Analysis Services (even more)’
o SQL Server SMP & Hadoop cluster
Pain Points with current solution
1. ‘We are currently using a mix of SQOOP & BCP.’
2. The integration with SQL Server should be
better, simpler & as fast as possible’ !s
3. ‘Take a definition of an aggregate in Hive and
map it through SQL DDL’!
4. ‘Loading data from Hadoop to SQL must be
easier’.
5. ‘Hadoop is hard to deploy and maintain'. insights
‘Pay-as-you-drive’ - Driver Discount &
Policy adjustment
o SQL Server 2012 PDW & HDP 2.0/2.1 on Linux
What they want to accomplish
1. ‘We have increasing set of non-relational data – sensor data
& web data which we want to store in Hadoop.’
2. ‘We want to get answers to our complex questions fast’.
3. ‘We don’t want to learn a new tool or programming
language or paradigm to get the answers’.
4. ‘We know how to use Microsoft tools and services’.
5. ‘We don’t want to maintain a complex infrastructure – we
trust Microsoft & SQL Server to help us’.
Investing into Online Shopping website
(‘Korea’s Amazon’)
o SQL Server 2012 PDW & HDP 1.3/HDP 2.0 on Linux
What they want to accomplish
1. ‘We want perform complex data mining on customer
purchase data – basket analysis’.
2. ‘We want to understand the social media data
(reviews/Twitter) – specifically around our products
& stores’.
3. ‘We will use Hadoop to keep all of our data ~
envisioned to be around 480 TB. PDW will be the
efficient analysis engine for the hot & more recent
data’.
4. ‘PDW & PolyBase are much faster than Hive’.
5. ‘We’re interested in using data mining cloud services
in Azure (hybrid scenarios)’
Solution Architecture (Details) – ShinSeGae
Web log data(160GB/daily) –
1. External Tables A, B, C
2. Unstructured/semi-structured text
data - External Tables D, E, F
Text
(Board/SNS/
Internal Text )
3.
Complex Event
Processing (Storm)
SSG.com
(renewal)
Message Queues
(KAFKA, Open source)
Weather..
Recommendation engine &
personalized advertising
Tracking Log Servers
PolyBase
Queries
Company emails –
External Tables G, H, I
Mails
Online
Shopping Mall
raw/cold/
warm data
HDP on Linux
Campaign
Analytic information
(right customer targeting)
10 GB
Ethernet
APS/PDW
Operational
Data Store
EDW
Recent/hot data
stored in PDW
EIS
OLAP
(Tabular)
DATA Mining
Visualization
(Silverlight)
BI analyst
What they want to accomplish
o On-premise: 15-20 HDP Hadoop cluster + 1 Rack
PDW
o How they want using Azure:
1. Cost-optimal backup for Hadoop data [WASB]
2. Scale-out environment to spin-up more
compute on demand at busy times
3. Platform to leverage new data services, in
particular Azure HDI and Azure Machine Learning
o Daily - 50 GB compressed data into Azure
o
Your
Apps
cloud service
HDInsight
Azure SQL Database
(Hadoop)
Azure Storage
Either planning to use custom API or
PolyBase CETAS
o Planning to use PolyBase to combine data from
on-premise and Azure
Your
Apps
T-SQL
APS with scale-out
PDW region
‘Understanding network quality’
o SQL Server 2012 PDW & Hadoop on Linux
What they want to accomplish
1. ‘We collect millions of network records for quality
assessment and capacity planning – on a daily basis’.
2. ‘Hadoop will be used for storage and ETL of these
network record files’.
3. ‘PDW for more operational analysis, ad-hoc analysis,
operational reports’.
4. ‘We are using Polybase along with Oozie-based
orchestration for a seamless & automated integration’.
Solution Architecture (Details) – Telcom
Capturing Network logs (>300 GB/per
day) – External Tables A, B, C
High-frequency
Event Processing
(Network logs)
Usage of Hive’s
Metadata stores
HCatalog
Hadoop
cluster
(18+ servers)
PolyBase
Queries
APS/PDW
Operational
Data Store
Infiniband
raw/cold data
(Petabyte of network
logs)
EDW
Hot operational
PDW data
Oozie
Workflows
Remote procedure calls via stored
procedures to trigger PolyBase queries
Network
quality analysis
Capacity
Planning
Visualization
(PowerPivot)
BI analyst/Planner/
Decision-maker
‘Analyzing oil drilling rigs’
o SQL Server 2012 PDW & external Hadoop
What they want to accomplish
1. ‘Each drilling rig is equipped with sensor devices and we
want to store the incoming sensor data into Hadoop’.
2. ‘We want to use PolyBase to access these sensor data’.
3. ‘We want to monitor our rigs in near real-time’.
4. ‘We want to conduct predictive analysis – which rigs are
running poorly, which rigs show similar symptoms’.
‘Bridging the gap between cloud & on-prem’
o SQL Server 2012 PDW & Azure HDInsight
What they want to accomplish
1. ‘HDInsight/Hadoop in the cloud to store and massage our
raw data (XML files) generated by our web-application’.
2. ‘PDW to keep the data on-prem (legal requirement) and to
have an efficient query engine for analysis purposes’.
3. ‘PolyBase is a great way of accessing our files in the cloud via
simple T-SQL.’.
4. ‘With this solution, we can allow web users to quickly ask
questions while the heavy, more complex business analysis is
accomplished by PDW users’.
Solution Architecture (Details) – Government
HDI tools for data
transformation
Web apps- Generating tons of
smaller XML files (~7KB each)
Web Application for
Tax Filing (e-invoice)
Other Web Feeds
Transforming to
large text files ~ 10 GBs each
(External Tables)
PolyBase
Queries
Operational
Data Store
HDI on
Azure
Azure Blob
Storage
APS/PDW
Public Internet or
Azure Express Route
cheap data store – alternative to
Hadoop on-prem solution
EDW
PDW/APS for fast query
response & data processing of
hot data
Microsoft
BI stack
IBM Cognos
‘What are you drinking? Why is
the machine down’?
o SQL Server/APS with PDW & HDI region
What they want to accomplish
1. ‘We want a complete solution stack – we do
not have Hadoop experts in-house and
don’t have the money to get it’.
2. ‘We want to store all raw data coming from
vending machines into Hadoop’.
3. ‘360 degree of all our data – structured
customer data & unstructured data coming
from vending machines’.
4. ‘Predicate maintenance of machines’.
‘Data Management Services for Microsoft Teams’
o APS with PDW & HDInsight region
What they want to accomplish
o Offering data management services to other Microsoft
groups
o Cookie scoring for msn.com
o 3TB of data for each analysis
o Added PolyBase for Reporting & PowerQuery
o Previously –
o Analysis solely done via HiveQL
o Results are writing back in SQL Server for reporting
o PowerQuery via HiveODBC driver
Solution Architecture (Details) – Internal Microsoft Data Scientist
Data scientist group 1 - using chaining of
Hive queries & PowerQuery via HiveODBC
Hive & PowerQuery
via Hive ODBC
System Center &
AdminConsole
msn.com – Log
files
Microsoft servers –
Log files
Secure Gateway &
AD Integration
Analyzing
~3 TB Web Traffic
Polybase Queries
HDI region
1 scale unit
HDI region
Infiniband
Data scientist group 2 - Using Polybase for
existing tooling (T-SQL, BI tools) performing
processing of complex analytical queries &
consistent management experience
PowerQuery/Power
View/PowerMap
PDW region
Full Rack
PDW
APS with PDW & HDI region
Analytical queries
via SSDT
44
microsoft.com/sqlserver and Amazon Kindle Store
microsoftvirtualacademy.com
Azure Machine Learning, DocumentDB, and Stream Analytics
http://channel9.msdn.com/Events/TechEd
www.microsoft.com/learning
http://microsoft.com/technet
http://developer.microsoft.com