Unisys and SSIS Performance

Download Report

Transcript Unisys and SSIS Performance

SQL Server 2005
Project REAL – End to End BI
Rick Brewis
SQL TS
([email protected])
Microsoft
Erin Welker (ewelker@scalabilityexperts .com)
Practice Manager
Scalability Experts
Mike Heifner ([email protected])
National Sales & Marketing Director
Unisys
Agenda
8:30 - Continental Breakfast
9:00 - Introductions and Agenda
Unisys ES7000 – The Premier BI Platform
Break
Project REAL Presentation/Demonstration
Working Lunch
1:15 - Wrap-up
The REAL DEAL
What is Project REAL?
What impacts Design and Processing?
And Wait …
Best practices : performance tuning
tips!
Integration Services
Relational Datawarehouse
Analysis Services
Coming Attractions …
Project REAL
A quick overview
Project REAL Overview
Key goals for Project
REAL
Reference
Implementation
End-to-End
At-Scale
Lots of Users
Includes all parts of a BI
system:
Components
ETL,
DW RDBMS, OLAP Cube
Analysis/ reporting front-end
Aspects
Design, Deployment,
Management,
Backup, Monitoring
Project REAL Scenario
Based on Barnes & Noble Booksellers
800+ stores in the United States
40,000+ employees
Inventory Management System – for Buyers, Planners, Executive and
Analysts in the Headquarters, and Managers in the Stores and the
Distribution Centers
Business Case:
Buyer requires a series of analytic views to help determine the inventory
levels and the sales trends, and make better ordering decisions.
Buyer would like to view fully detailed information about any item of
concern.
Based on B&N – Enterprise Data Warehouse
All possible challenges of a large data warehouse
Quick access to large volumes of information
Multiple data sources: sales and inventory systems
Large data volume
Complex operations, data manipulations, business rules
Must expose data in a form useful to business users
Data Overview
Data sourced from 2 OLTP production systems at Barnes
& Noble
Sales and Inventory subject areas
15 Dimension Areas, 4 Fact Tables
Full copy of production data
warehouse as of late December 04
2 TB as delivered
Three months of daily updates
170 GB as delivered
“Masked” for use in Project REAL
Sales
data
Inventory
data
SSIS
B&N
Data
Warehouse
Copy for
Microsoft
Data Overview
4 fact tables
Store Sales, Store Inventory, Distribution Center Inventory (DC
Inventory), Division Strategy
Historical Load (without indexes) ~10 Billion Rows +
Dimension tables
DC Inventory facts
Store Inventory facts
Store Sales facts
Division Strategy facts
Tables Rows
Size (MB) Bytes/Row Avg. Rows/Table
15
12,666,277
6,420
n/a
n/a
18
54,405,164
4,363
84
3,022,509
53 8,630,298,635 435,983
53
162,835,823
157 1,366,052,628 192,354
148
8,700,972
1
33,417,014
2,013
63
33,417,014
Incremental Load ~760 Million Rows + per Quarter
Dimension data
DC Inventory facts
Store Inventory facts
Store Sales facts
Division Strategy facts
Tables Rows
Size (MB) Bytes/Row
7
12,050,392
8,734
n/a
1
298,496,583
31,525
111
1
294,776,968
65,713
234
1
148,801,022
29,129
205
1
6,782,314
517
80
Data Overview
Highlights on Large Dimensions
Item Dimensions (7+ million members)
Store sales for 36 months (every book, calendar, etc.
that Barnes and Noble has sold in the last 3 years)
Store inventory for 1 year (every book in every store)
Distribution center inventory for last quarter
Customer Dimension (5+ million members)
Source: the B&N loyalty card program
800+ Stores
12 other dims leading to lots of hierarchies, levels and
attributes . . .
Logical Architecture
DW
- SS RDBMS
Reports
- SSRS
Reports
via
web
browser
OLTP
Sales
ETL SSIS
OLTP
Inventory
Application
Web Farm
OLAP
& DM
- SSAS
One OS with
RDBMS, SSIS, SSAS
Analytica
l users
via AS
client
tools
OH Demo Physical Architecture
Presenter Laptop
Mgmt LAN (10/100)
Gigabit
Fiber
Application Servers (x2)
Unisys ES3040L (x86)
1-2x Intel Xeon MP 3.00 GHz
1GB RAM
2x 36GB internal drives
1x Ethernet NIC
Software Environment
Windows 2003 SP1
IIS
SharePoint Portal Server
Business Scorecard Manager
MU Tool
ProClarity
Office 12 Server
Database & OLAP Server
Unisys ES7000 /400 (IA64)
8x Itanium 1.6 GHz w/9M
64GB RAM
2x 72GB internal drives
2x Gigabit Ethernet NIC
4x Fibre HBA for Data
Software Environment
Windows 2003 SP1
SQL Server 2005 (Source DB &
Datawarehouse)
Analysis Services
Integration Services
Reporting Services
Storage EMC CX300
~2 TB
2 SPs with 2 Ports each
1 DAEs w/ 14 146GB drives
RAID5
Data
Source Database Files
Datawarehouse Files
Analysis Services Cubes
Files
Backups
Project REAL
Storage Overview Demo
Erin Welker – Scalability Experts
Practice Manager
Led the Data Lifecycle Management on Project
REAL
Table and cube partitioning
Optimized data placement
Dimension pruning
Scalability Experts Background
SQL Server specialists focused on advanced services including:
Performance
Migration
Business Intelligence
Scalability
Consolidation
Enterprise Manageability
Availability
Security
Knowledge Transfer
Recent accomplishments  Microsoft Data Management Solution of the Year 2005, 2003
 Lead partner and content designer for:
 SQL Server 2005 technical readiness program – IT Pro
 Gold Certified Partner for Business Intelligence Solutions
 Gold Certified Partner for Advanced Infrastructure Solutions – mission
critical scalability and availability
 Preferred consulting partner for advanced SQL Server design &
deployment by major hardware OEMs
 Training experienced Oracle DBAs in SQL Server
 Oracle to SQL Server migration tools
Project REAL
Design & Processing Overview
Schema Design: Dimensional Modeling
Kimball style “Star” schemas
Dimension tables describe
attributes of interest
Store, book, customer,
time…
Dimensions may contain
hierarchy levels and
attributes
Fact tables contain measures
FK for each dimension in the
cube
Surrogate Keys used for
Primary Key fields
Remaining fields are
measures
How many, what price…
Multiple fact tables share
Conformed Dimension tables
Data model matches business
model – good for users
Granularity
Fine grain data is a blessing and a curse
Incredible opportunities for analysis, discovery
Large volumes of data are harder to manage and process
Logical design and physical design may be in conflict
Cubes and relational warehouse don’t have to be at
same granularity
It can be a good idea to have summary cubes at higher
levels for quicker browsing / reporting
For relational reporting: summary tables are a similar idea
Get ready for more data!
Once, all “decision support” systems were summaries
Now, we try to track every item sold or warehoused
Soon, there will be multiple records per item (think RFID)
Indexing strategies
Indexing for ETL
Index only dimension tables, for key lookup
Indexes on fact tables just slow things down
Follow existing best practices of index management during ETL
Load data with/without indexes?
Indexing for cube processing
No need for fact table indexes here, either
Assuming MOLAP storage
Why would you do otherwise?
Indexing for reporting
Definite need for indexes to get report performance
Indexes that help reporting will hurt ETL speed
Find the balance
Loading & Partitioning Strategy
Historic Load
Targeted for no more than 20M rows each
~ 1500 weekly (6-12M each) partitions
By week and by item subject
52 weeks of data with the top 122 subjects broken into 24 groupings of
approximately equal size + one catchall subject partition
Daily Load
Integrated with data movement ETL
Includes dimension (type 1 and type 2 SCDs) and partition
fact table processing
Creates new partitions as needed by the data; cloned from
an existing “template” partition w/ aggregation design, error
configuration, etc.
Processes new data (as it detects that data is in the stream)
– full process of the partitions
90 days of daily Store Sales, Store Inventory and DC Inventory
data
Processing times
System: ES7000-400
12X CPU, 32GB
Memory used:
RDBMS 24GB
AS 6GB
DC Inventory (Duration: 11:00)
system-determined parallelism; 21 partitions;
total of 54,562,126 rows
Store Inventory (Duration: 35:59:31)
10 in parallel, 1300 partitions, total of
8,631,073,157 rows
Store Sales (Duration: 14:23:55)
10 in parallel, 156 partitions, total of
1,351,434,857 rows
Item Vendor (Duration: 6:39)
System-determined parallelism; 5 partitions;
total of 35,690,370 rows
Total duration: 50:41:05 for 10,072,760,510 rows, or
3,312,318 rows per minute (average)
Project REAL
Cube Processing Demo
Project REAL
And More… MU Tool Demo
The SSIS Architecture
Control Flow (Runtime)
A parallel workflow engine
Executes containers and tasks
Data Flow (“Pipeline”)
A special runtime task
A high-performance data pipeline
Applies graphs of components to
data movement
Component can be sources,
transformations or destinations
Highly parallel operations possible
ETL - Complex Control Flow
1 – Branching
Run the next 4 tasks
in parallel
2 – AND precedence
Wait for ALL inputs
before proceeding
ETL - Data Flow pattern
1 - DataSource
2 – Audit RowCount
3 - Calculations
4 – SCD Component
5 – Union All
6 - Destination
Project REAL
SSIS DEMO
Project REAL
SSIS e-cartography
Partitioning
Reasons for partitioning fact data
Managing the lifecycle of data
Improved performance
Simpler ETL design
SQL Server 2005 introduces Partitioned Tables in
RDBMS
The method before was to create multiple fact tables
Partitioned Tables can simplify…
The relational schema
The SSIS packages
The cube design
Custom development
Partitioning (cont.)
• Multiple Tables
– One Table per partition
– Can compose partitions
using a UNION ALL view
– Many limitations, not all
table operation supported
• Partitioned Tables
– Encapsulate partitions
natively
– All table operations
supported, e.g. Fast Load
Tbl_Fact_Store_Sales_WE_2002_01_05
Tbl_Fact_Store_Sales_WE_2002_01_12
Tbl_Fact_Store_Sales_WE_2002_01_19
Tbl_Fact_Store_Sales
Tbl_Fact_Store_Sales_WE_2002_01_26
… 153 more …
Tbl_Fact_Store_Sales_WE_2005_01_01
Partitioning Disk Layout
One partition per filegroup; One filegroup per file –
Facilitates movement of partitions at the file level
Optimize storage & structure
Partitions and indexes share filegroup
Co-Aligned is generally Good
P1 of FactSales has the same data-range as P1 of FactInventory
Co-Located is generally Bad
File
P1 of
FactSales
the same
P1 ofFile
FactInventory
File
1 File is
2 on File
3 …storage
Fileas
155
156 File 157
Filegroup
FG 1
FG 2
FG 3
…
FG 155
FG 156
FG 157
Partition
P1
P2
P3
…
P 155
P 156
P 157
Database Schemas
Decouple users and objects
server.database.schema.object
Eases management & programming – visual grouping of
objects
Loose-coupled programming
References do not break when user leaves
Schemas are securable containers
As used in Project REAL:
Many objects, what domain (app) do they belong to?
Administration and programming
Not using the security aspects at this stage
Examples:
admin.up_SetLogicalDate(@date)
util.uf_GetEndOfWeek(@date)
etl.up_TruncateDimDate()
Catalog Views
Supersede the Information_Schema & System Tables
Reside in the sys schema
sys.tables, sys.views, sys.partitions
More intuitive and complete than SQL2000
Compatibility objects re-implemented as views
As used in Project REAL:
Partitioning logic needs to dynamically & automatically
create new tables
It is very simple to access the metadata of any object
now
Code is simpler and more intuitive than SQL2000
Was: select * from sysobjects where xtype = 'U'
Now: select * from sys.tables
Extended T-SQL Set-Calculus
SQL Server 2005 includes new Set operators
Already have Union of Tables (UNION)
Intersection of Tables (INTERSECT)
Difference of Tables (EXCEPT)
UNION
INTERSECT
EXCEPT
Extended T-SQL Set-Calculus (cont.)
Sometimes source data does not distinguish
between Inserts and Updates
We can use these new Set operators to
perform complex Update/Insert logic
UPDATE Destination FROM (Source INTERSECT
Destination)
INSERT INTO Destination FROM (Source EXCEPT
Destination)
Saves us having to construct complex
correlated sub-queries or WHERE EXISTS logic
High performance as these operators expose
core engine functions
Database Snapshots
During development & testing we need to undo changes efficiently
Metadata and Data
Backup/Restore takes too long
Long running transactions are expensive
‘Delete Changes’ does not cover all scenarios
Use new Database Snapshot functionality
‘Copy On First Write’
456
Consistent delta that keeps track of changes
Rollback the database using this delta at any time
‘SourceSafe’ for SQL!
123
Production environment
Possible use as ETL end-of-day mechanism
Database
Performance implications – Test!
Snapshot
Project REAL
DB Snapshot & Partitioning Demo
Analysis Services
Lots more objects
More flexibility for modeling options
More flexibility for end-users
But can be a bit overwhelming when you are starting
off
Project REAL’s focus is on:
General development experience
Gotchas which we ran into at-scale
Performance results
Extensive Objects with Richer Semantics
SQL Server 2000
SQL Server 2005
Virtual cube
 Cube
Cube (single fact table)
 Measure group
Dimension
 Dimension
Hierarchy
Virtual dimension (multi-level)
Hierarchy
1
2
 Hierarchy
Member property (in a general sense)  Attribute
Level
 Attribute (when used in a hierarchy)
Virtual dimension (single-level)
 Attribute Hierarchy
Perspectives
KPIs
1
2
Translations
DSVs allows multiple fact tables in a measure group
True multiple hierarchies in a dimension
Key concepts
New dimension concepts in Analysis Services
2005:
Attributes
Attribute hierarchies
Attribute relationships
User-defined hierarchies
Natural Hierarchies
Reporting Hierarchies
Measure groups
Dimensions “the basics”
They are “way” you do analysis
Sales and Inventory
by Store
by Customer
Over Time


Consists of:
1. DSV dimension table(s) and
fields
2. Attributes built from fields (dragn-drop from relational fields)
Important: Not all DSV fields
need to be added as attributes,
although it is a good idea
By default, attributes also have
a single level, flat hierarchy, a
so-called “attribute hierarchy”,
associated with them
3. User-defined hierarchies
(drag-n-drop from attributes)
4. LOTS of objects!!
Note: 9 user-defined
hierarchies (nav only)
Note: 140+ fields in
dimension table
(possible attribute
hierarchies)
key
44 attribute
 Note:
hierarchies
Declaring related
attributes
Relationships allow:
A one-to-many drilldown
One-to-one (old-style
member property)
Not always established by the
design wizards; may need to
be added by-hand
In most cases, correct
numbers are calculated and
returned . . . But that doesn’t
mean these are optional except
for maybe a POC
Key attribute
Where data mining goes
Data mining should start to be integrated and
invisible in applications
“Not for guys in white lab coats any more”
Develop / explore models in BI Dev Studio
Analytical tool users: Models are in cubes
Data mining results appear in reports
Data mining is behind helpful suggestions
You might like this book…
Data mining is integrated in invisible ways
Error detection on data entry
Data cleansing in ETL operations
Data Mining Algorithms
Decision Trees
Clustering
Sequence Clustering
Time Series
Association
Neural Net
Naïve Bayes
Project REAL
Some Performance Tips
Performance measurements
Know why you are measuring things
Measurements relevant to the business
“KPIs of system performance”
ETL rows/sec
Query response time
Number of users supported
Diagnostic measurements
“Performance toolkit”
System: CPU usage, memory committed, disk
queue, context switches, network bytes
Application: RS cache hit ratio, AS thread pool size,
SQL Server latch waits
Performance tuning guidelines
You must have a workload to do anything
Rows to load, cube to process, reports to run
Make sure it represents the business – or else!
Use the KPIs to measure success
Run controlled experiments
Change only one thing at a time!
Use diagnostic measurements to direct
experiments
Have a way to record results systematically
Report your results in terms of the KPIs
Project REAL
Coming Attractions…
Excel Services "12"
Sean Boon, PM Excel Services
What is Excel Services?
New Office “12” functionality built on SharePoint
Server platform that provides:
Server-side Excel spreadsheet calculation
Browser based spreadsheet viewing &
interactivity
Web service access to spreadsheet calculation
Analyze data in Excel
Use spreadsheets in
dashboards and business
applications
Best with
SQL Server
Publish spreadsheets to
SharePoint & Excel Services
Manage and access and
explore spreadsheets in the
portal
Excel Services
Browser
100% thin
View and
Interact
Author & Publish
Spreadsheets
Open
Spreadsheet/Snapshot
Excel “12”
Web Services
Access
Excel “12”
client
Custom
applications
Server-based Spreadsheets
Publish Spreadsheets to the Server to Share Securely
Define parameter
cells that can be
changed on the
server
Control what is
visible on the
server
Saves regular
Excel “12” files
Confidential – Microsoft Corporation
Server-based Spreadsheets
View and interact with spreadsheets in the browser
Excellent visual
fidelity including all
new conditional
formatting
100% HTML and
script no client
components
Server side
charting
Confidential – Microsoft Corporation
Server-based Spreadsheets
Incorporate spreadsheet components in BI dashboards
Unified filtering
across all parts on
the dashboard
10+ out-of-thebox filter parts
Display specific
spreadsheet
ranges or charts
Confidential – Microsoft Corporation
Server-based Spreadsheets
Incorporate spreadsheet logic in business process and apps
Cross platform
web services
interface
Custom UI or
part of business
process
Calculate
spreadsheet and
set and get values
Confidential – Microsoft Corporation
Value of Excel on the Server
Provide a secure, centrally managed single
version of the truth
Protect proprietary information
Incorporate spreadsheets in dashboards
Utilize servers to offload calculations
Use Excel authored business logic in business
applications and processes
“Excel Services”
Architecture
Web front-end
SharePoint UI
HTML rendering + web services
Application server
Loads spreadsheets, refreshes data,
calculates
Web front end
Web front end
Excel web
Excel web
access
services
ECS Proxy
Independent
Scale-Out
Flexible Topology
Application Server
ECS interface
Maintains state for interactivity
File + query caches for performance
Excel calculation service
The Office servers provide:
Solution platform
Store: spreadsheets, data connection
files, admin settings
Application Server
Security: Authentication, Authorization
Single box or multi-tier
Independent scale-out
External data sources
What Excel Services Isn’t
No authoring in the browser
Doesn’t solve multi-user spreadsheet authoring
Doesn’t support 100% of Excel spreadsheets
Resources
http://blogs.msdn.com/excel
[email protected]
Detroit SQL Server User Group
Meet 2nd Thursday of each Month
Location: Microsoft Office
6:00 – 8:00pm
Next Meeting – May 11th, 2006
Website: http://sqlsig.org/
Project REAL
Questions?
http://www.microsoft.com/sql/bi/ProjectREAL