Data Integration
Download
Report
Transcript Data Integration
Do It Strategically with
Microsoft Business Intelligence!
Bojan Ciric
Strategic Consultant
ARCHITECTURE
Technical aspects of the project
Data
Integration
Data
Analysis
Data
Visualization
Solution Architecture
SQL Server as a backbone
DATA INTEGRATION
ETL: Extract, Transform, and Load
1. Extract data from the source systems
2. Transform the data to convert it to a desired format
3. Load the data into the data warehouse
Data Integration
More than Just ETL
Transform corporate data into meaningful and
actionable information
Challenges
– Retrieve and merge data from multiple sources
– Cleanse and transform the data
– Load the data into appropriate data stores for analysis
and reporting
Enterprises spend 60%–80% of their BI resources in
the data integration stage
Top tips for successful ETL:
Choose the right ETL
tool
•large volume of data and
complex transformations
requires powerful tool
available to dealing with
these challenges
Use a lot metadata
•since that essence of ETL
is composed by mapping
of operational systems to
destination as well as
data transformation,
store all of that in
metadata instead of
applications. Over the
time you will get benefits
through maintenance
lower costs.
Engage experienced
ETL developers
•they are expensive but
several times more
efficient then
inexperienced
Choose the right
destination data
model
•operational sources you
cannot choose, but good
destination models
allows you less complex
transformations and
better information
potential
Establish data
governance
•during ETL process, it is
extremely important and
quite efficient to
establish data
governance within the
organization if no any.
Good Data Governance
policy and procedures is
essential environment for
the ETL process during
the operational work.
Let’s do ETL with SSIS
SQL Server Integration
Services (SSIS) service
SSIS object model
Two distinct runtime
engines:
– Control flow
– Data flow
32-bit and 64-bit editions
The Package
The basic unit of work, deployment, and execution
An organized collection of:
–
–
–
–
–
–
Connection managers
Control flow components
Data flow components
Variables
Event handlers
Configurations
Can be designed graphically or built
programmatically
Saved in XML format to the file system or SQL
Server
Control Flow
Control flow is a process-oriented workflow engine
A package contains a single control flow
Control flow elements
–
–
–
–
Containers
Tasks
Precedence constraints
Variables
Data Flow
The Data Flow Task
– Performs traditional ETL and more
– Fast and scalable
Data Flow Components
– Extract data from Sources
– Load data into Destinations
– Modify data with Transformations
Service Paths
– Connect data flow components
– Create the pipeline
Data Flow Sources
Sources extract data from
– Relational tables and views
– Files
– Analysis Services databases
Data Flow Destinations
Destinations load data to
–
–
–
–
Relational tables and views
Files
Analysis Services databases and objects
DataReaders and Recordsets
Row Transformations
Update column values or create new columns
Transform each row in the pipeline input
Rowset Transformations
Create new rowsets that can include
–
–
–
–
Aggregated values
Sorted values
Sample rowsets
Pivoted or unpivoted rowsets
This is a heavy-weight performer of SSIS
Are also called asynchronous
components
Split and Join Transformations
Distribute rows to different outputs
Create copies of the transformation inputs
Join multiple inputs into one output
Perform lookup operations
Custom Developed Components
Fully aligned with IBM BDW model
Pexim Message Mapper
Classification Resolver
Pexim Surrogate Key
Pexim Accounting Unit
Pexim Rank Resolver
Source To BDW Mapping
Classification Management
Surrogate Key Management
Accounting Unit Management
Rank Management
ETL Concept Evolution – Custom Components
ETL Concept Evolution – Logic out of SQL Queries
Microsoft SQL Server Integration Services
DEMO
OLAM, Multidimensional Data, Data mining and so on
DATA ANALYSIS
OLAP or Multidimensional Data
Online Analytical Processing = Multidimensional
Data
Measures and Dimensions
Uses a calculation engine for fast, flexible
transformation of base data (such as aggregates)
Supports discovery of business trends and statistics
not directly visible in data warehouse queries
Cube (UDM)
Unified Dimensional Model
Combination of measures (from facts) and
dimensions as one conceptual model
Rich data model enhanced by
–
–
–
–
–
–
Calculations
Key Performance Indicators (KPIs)
Actions
Perspectives
Translations
Partitions
Formally, cube is called a UDM
Querying a Simple Cube
What sales did we expect to achieve in North America
for CY 2004 Q1?
5,005,000
Q1
Q2
Pacific
Europe
Q4
North America
Measures Dimension
Sa
le
s
Q3
Te
rri
to
ry
N/A
Star Schema
Star Schema Benefits
Transforms normalized data into a simpler model
Delivers high-performance queries
Delivers higher performing queries using Star Join
Query Optimization
Uses mature modeling techniques that are widely
supported by many BI tools
Requires low maintenance as the data warehouse
design evolves
Snowflake Dimension Tables
Define hierarchies using multiple dimension tables
Support fact tables with varying granularity
Simplify consolidation of data from multiple
sources
Potential for slower query performance in relational reporting
No difference in performance in Analysis Services database
Hierarchies
Benefits
– View of data at different levels of
summarization
– Path to drill down or drill up
Implementation
– Denormalized star
schema dimension
– Normalized snowflake
dimension
– Self-referencing
relationship
Fact Table Fundamentals
Collection of measurements associated with
a specific business process
Specific column types
– Foreign keys to dimensions
– Measures – numeric and additive
– Metadata and lineage
Consistent granularity – the most atomic
level by which the facts can be defined
Fact Table Examples
Parent-Child Hierarchy
A dimension that contains a parent attribute
A parent attribute describes a self-referencing
relationship, or a self-join, within a dimension table
Common examples
– Organizational charts
– General Ledger structures
– Bill of Materials
Parent-Child Hierarchy Example
Slowly Changing Dimensions
Support primary role of data warehouse to
describe the past accurately
Maintain historical context as new or
changed data is loaded into dimension
tables
Implement changes by Slowly Changing
Dimension (SCD) type
– Type 1: Overwrite the existing dimension record
– Type 2: Insert a new ‘versioned’ dimension
record
– Type 3: Track limited history with attributes
SCD Type 1
Existing record is updated
History is not preserved
SCD Type 2
Existing record is ‘expired’ and new record inserted
History is preserved
Most common form of SCD
SCD Type 3
Existing record is updated
Limited history is preserved
Implementation is rare
SQL Server 2008 Analysis Services
OLAP component
–
–
–
Aggregates and organizes data from
business data sources
Performs calculations difficult to perform
using relational queries
Supports advanced business intelligence,
such as Key Performance Indicators
Data mining component
–
–
Discovers patterns in both relational and
OLAP data
Enhances the OLAP component with
discovered results
Cube = Unified Dimensional Model
Multidimensional data
Combination of measures and dimensions as one
conceptual model
– Measures are sourced from fact tables
– Dimensions are sourced from dimension tables
Dimensions and Facts
Basis of All BI
Fact – something that happened
–
–
–
–
Sale, purchase, shipping...
Transaction or an event
Verb
Essentially a Measure
Dimension – describes a fact
– Customer, product, account...
– Object
– Noun
A fact (measure) is expressed in terms of dimensions
– 16 balls sold to John on 20090115.
Dimensions
Describe business entities
Contain attributes that provide context to numeric
data
Present data organized into hierarchies
Dimensions
Members from tables/views in a data source view
(based on a Data Warehouse)
Contain attributes matching dimension columns
Organize attributes as hierarchies
– One All level and one leaf level
– User hierarchies are multi-level combinations of
attributes
– Can be placed in display folders
Used for slicing and dicing by attribute
Hierarchy
Defined in Analysis Services
Ordered collection of attributes into levels
Navigation path through dimensional space
Very important to get right!
Measure Group
Group of measures with same dimensionality
Analogous to a fact table
Cube can contain more than one measure group
– E.g. Sales, Inventory, Finance
Defined by dimension relationships
Dimension Model
Calculations
Expressions evaluated at query time for values that
cannot be stored in fact table
Types of calculations
– Calculated members
– Named sets
– Scoped assignments
Calculations are defined using MDX
MDX = MultiDimensional EXpressions
Microsoft SQL Server Analysis Services - OLAP
DEMO
DATA MINING
Data Mining
Discovery of (very) hidden patterns in mountains of
data
Correlation search engine
Recent combination of statistics, probability
analysis, database technologies, machine learning,
and AI
What does Data Mining do?
Typical Uses
Seek
Profitable
Customers
Correct
Data During
ETL
Detect and
Prevent
Fraud
Understand
Customer
Needs
Data
Mining
Build
Effective
Marketing
Campaigns
Anticipate
Customer
Churn
Predict
Sales
Typical Scenarios
Customer Classification & Segmentation
• Who are our customers?
• Are there any relationships between their demographics and their buying power?
Profitability and Risk
• Who are our most profitable customers?
• Can I predict profit of a future customer based on demographics?
• Are they creditworthy?
• How much should I charge them to give a good loan and protect against losses?
Customer Needs Analysis
• How do they behave?
• What are they likely to do once they bought that really expensive car?
• Should I intervene?
Forecasting
• What are my sales going to be like in the next few months?
• Will I have credit problems?
• Will my server need an upgrade in the next 3 months?
Mining Process
Data Mining
DEMO
Summary
Data Mining is a powerful, predictive technology
Turns data into valuable, decision-making
knowledge
SQL Server 2008 Analysis Services support
Predictive Analytics
Mine your mountains of data for gems of
intelligence today!
DATA VISUALIZATION
Key Performance Indicator (KPI)
Quantifiable measurement comparing
business performance to goals
Measure of overall organizational health
when combined into a collection for a
business scorecard
Three main ways to build KPIs:
– Using OLAP (cubes)
– Directly in Performance Point Server
– Using data mining (predictive KPI)
KPI Characteristics
Value
Goal
Status
Trend
Dashboards and Scorecards
Scorecard
– Table (pivot-like) of KPIs
Dashboard
– Contains scorecards, reports, and other analytical
visualisations
Building Dashboard with Microsoft Excel 2010
DEMO
PowerPivot for Excel 2010
PowerPivoting Massive Data Volumes
With a few mouse clicks, a user can create and publish intuitive and interactive self-service analysis solutions.
Empower Your Users—With Familiar Tools
The ease of Excel now
with:
Unmatched
computational power
Advanced analytic
expressions
Empower Your Users—With Familiar Data
Users know reports:
What’s available
What’s useful
Where to find them
How to use them
IT provions reports:
With security and
scalability
Interactive Slicing and Dicing
Excel 2010 and Excel Services
Interactive slicers enable users to look at the data from various directions in Excel 2010 and
in the browser through PowerPivot for SharePoint and Excel Services.
© 2010 Asseco SEE. All rights reserved.
The information herein is for informational purposes only and represents the opinions and views of Asseco SEE and/or Bojan Ciric. The material
presented is not certain and may vary based on several factors.
Portions © 2009 Asseco SEE & entire material © 2009 Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors,
as individually attributed or as already covered by Microsoft Copyright ownerships. All rights reserved. Microsoft, Windows, Windows Vista and
other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for
informational purposes only and represents the current view Asseco SEE as of the date of this presentation. Because Asseco SEE & Microsoft must
respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Asseco SEE
cannot guarantee the accuracy of any information provided after the date of this presentation. Asseco SEE makes no warranties, express, implied
or statutory, as to the information in this presentation. E&OE.