Distinct Count

Transcript Distinct Count

DAT353
Analysis Service: Server
Internals
Tom Conlon
Program Manager
SQL Server Business Intelligence Unit
Microsoft Corporation
Purpose of this Session
• Remove some of the mystery
• Explain how it is that we do some
things so much better than our
competitors
• Things are easier to understand when
the internals are understood
• Requirements:
– You already know the basics – this is for
the experienced
Agenda
•
•
•
•
•
•
•
Architecture Review
Aggregations
Data and Dimension Storage
Processing
Querying
Server Memory Management
Distinct Count
Architecture – Single Server
Analysis
Manager
Processing
Querying
ADO MD
OLAP
Store
Analysis
Server
Application
OLEDB
OLEDB for OLAP
Other
OLE DB
Providers
DSO
PivotTable Service
SQL Server
Data
Warehouse
Component Architecture - Query
METADATA
MANAGER
AGENT
FORMULA
ENGINE
CACHE
CACHE
MMSMDSRV.EXE
MSOLAP80.DLL
FORMULA ENGINE
Server Storage Engine
METADATA
MANAGER
MDX
Component Architecture - Management
MSMDGD80.DLL
MSMDCB80.DLL
DCube
Storage
Engine
DCube
PARSER
METADATA
MANAGER
METADATA
MANAGER
AGENT
FORMULA
ENGINE
CACHE
CACHE
MMSMDSRV.EXE
MSOLAP80.DLL
FORMULA ENGINE
Server Storage Engine
METADATA
MANAGER
MDX
DDL
Component Architecture - Distributed
MSMDGD80.DLL
MSMDCB80.DLL
DCube
Storage
Engine
DCube
PARSER
METADATA
MANAGER
METADATA
MANAGER
AGENT
FORMULA
ENGINE
CACHE
CACHE
MMSMDSRV.EXE
Server Storage Engine
MMSMDSRV.EXE
METADATA
MANAGER
CACHE
MSOLAP80.DLL
FORMULA ENGINE
Server Storage Engine
METADATA
MANAGER
MDX
Agenda
• Architecture Review
• Aggregations
•
•
•
•
•
Data and Dimension Storage
Processing
Querying
Server Memory Management
Distinct Count
Why Aggregations?
•
Aggregations can result in orders of
magnitude improvement in performance
–
–
–
Don’t have to access every fact table record to
determine query result
Further savings with data compression
Biggest savings: reduce disk scan
Aggregations - Overview
Highest Level Aggregation
Customer
Product
Units Sold
Sales
All
All
347814123
$345,212,301.3
Customers
Product
All Customers
Country
State
City
Name
All Products
Category
Brand
Name Intermediate Aggregation
countryCode
productID
Units Sold
SKU Can
sd452
9456
US
Facts
…
custID
SKU
Units Sold
Sales
345-23
135123
2
$45.67
563-01
451236
34
$67.32
…
yu678
4623
Sales
$23,914.30
$57,931.45
Partial Aggregation
•
Don’t want to create all possible aggregations
–
•
Data explosion!
What if a query is made to a combination of levels
where no aggregation exists?
–
–
Can compute from lower level aggregations
Don’t need to compute every possible aggregation
Customers
Product
All Customers
Country
State
City
Name
All Products
Category
Brand
Name
SKU
Queries including a
combination of Country and
Brand can be answered if
aggregation Country by
Name exists.
Aggregations
Show me all sales for all products for all . . .
Fact Table
Highest level of aggregation (1,1,1,1,…)
Most detailed
Aggregations
(m,m,m,…)
Partial Aggregation
Show me all sales for all products for all . . .
Fact Table
Highest level of aggregation
Most detailed
Aggregations
Aggregation Design
Month
Product
Quarter
Pro.
Family
Quarter
Pro.
Family
Products
Month
Fact Table
Aggregation Design
Results
•
Result: aggregations designed in waves
from the top of the pyramid
•
At 100% aggregations, ‘waves’ all touch:
overkill
•
20-30% Generally adequate (0% for the
smaller cubes)
Fact Table
Aggregation Design
•
Which aggregations are more important than
others?
–
–
•
All are equal
From design perspective, select the ones that
result in overall improved query performance
Usage Based Optimization: Weightings on
each aggregation based on usage frequency
Flexible and Rigid Aggregations
•
•
Changing dimensions allow members to be moved, added and
deleted.
After members move, only incremental process of dimension
is required
A
X
•
•
B
C
When member X is moved from
a child of A to a child of C, all
aggregations involving A or C
are invalided
X
‘Flexible’ aggregations deleted when a changing dimension is
incrementally processed.
‘Rigid’ aggregations remain valid
Aggregation Data Storage
Aggregations including (All) level are
rigid (if all other levels in the agg are
rigid)
A
B
C
X
•
•
•
Aggregations with this level are always
flexible
X
Aggregations with this
level are rigid (if all other
levels in the agg are rigid)
No impact on fact data or rigid aggregation data when changing
dimension incrementally processed
Flexible aggregations are invalidated when changing dimension
incrementally processed
Data is in three files:
–
–
–
partitionName.fact.data
partitionName.agg.rigid.data
[partitionName.agg.flex.data]
Incremental Dimension Processing
(Chg Dimension)
Query and process
dimension data:
Keys, member names,
member properties
•
Potential Resource
Competition during lazy
processing after changing
dimension incrementally
processed
•
Fewer Aggregations!
•
Result: Query performance
degradation
For each cube
using this dimension
Delete flexible
Aggs and indexes
Start lazy
aggregating
Start lazy
indexing
demo
Flexible Aggregation
Demo
Agenda
• Architecture Review
• Aggregations
• Data and Dimension
Storage
•
•
•
•
Processing
Querying
Server Memory Management
Distinct Count
Data Storage
•
•
•
•
No data stored for empty member combinations
With compression – data storage approx 1/3 of space
required in RDBMS source
Data is stored by record in pages
Each record contains all measures at an intersection of
dimension members
Page
Record 1: mbrd1, mbrd2,…mbrdn m1, m2,…mn
Record 2: mbrd1, mbrd2,…mbrdn m1, m2,…mn
…
Record 256: mbrd1, mbrd2,…mbrdn m1, m2,…mn
Data Structures
•
•
Partition data stored in a file divided into Segments
Each Segment contains 256 pages (each with 256
records) = 64K records
Data File
Segment 1
Segment n
Segment 2
Page 1
Page 2
Page 3
…
Page 256
…
…
Only last segment
has fewer than
256 pages
Clustering
•
Physical order of the records in each page and
segment is organized to improve performance
–
–
•
Try to minimize distribution of records with the same
member across segments and pages
–
–
•
Keeps records with same or close members together
Similar in concept to SQL clustered index where data
sorted by key values
Optimized, but no algorithm can keep records for the
same member (unless the cube contains a single
dimension)
Similarly – SQL can only have a single clustered index
Records with identical dimension members can be
in multiple segments
–
Data is read and processed in chunks (more on this
later…)
Indexing
•
•
How is the data retrieved?
–
Cubes can be in the terabyte range
–
Scanning data files not an option
Need an index by dimension member
–
•
Answers question “Where is the data associated
with this combination of dimension members?”
Map files provide this
Map Files
Segment 1
There is a map for each dimension which
indicates the page where the member is
included in a data record
Dimension 1 Map
Map
Page 3
…
…
Page 4
…
Member
Map
…
…
1324
…
…
Page 2
Member
Page 5
1213
Dimension 2 Map
Page 1
…
Page 6
…
Page 256
To resolve a query containing a member
from each dimension, get list of pages
containing all members
Other Approaches
• Array Based
– Normally allocates a cell for every
combination.
– Result: Data explosion - much more
disk space and longer processing
times
• Mix of Record and Array
– ‘Dense’ dimensions are record like
– Sparse are array like
•
Bit used per empty cell – sparsity
explodes db sizes
– User chooses decides whether a
dimension is dense or sparse
Agenda
• Architecture Review
• Aggregations
• Data and Dimension Storage
• Processing
• Querying
• Server Memory Management
• Distinct Count
Processing Buffer Memory Settings
•
‘Read-Ahead Buffer Size’ is the buffer
containing data read from source db
–
–
•
Data is processed in chunks of ‘Process
Buffer Size’
–
–
•
Defined in Server Property Dialog. Default: 4Meg
Rarely important – little effect when changed
Defined in Server Property Dialog
Data is clustered within Process Buffer Size
Bigger Process Buffer Size the better – make
as big as possible
–
–
Data for dimension members is clustered to keep
data for ‘close’ members close together
The larger these memory settings are, the more
effective clustering
Incremental Processing
Original
partition
Two Step Process
• First, a partition is
created with the
incremental data
• Second, the partition is
merged with the
original
• Complete Segments of
both partitions left
intact – incomplete
ones are merged
• After many
incremental
processes, data
distributed: degraded
performance
• Reprocess (if you have
a large Process Buffer
size) can provide
improved performance
…
…
…
+
Incremental
Partition
…
Agenda
•
•
•
•
Architecture Review
Aggregations
Data and Dimension Storage
Processing
• Querying
• Server Memory Management
• Distinct Count
Querying a Cube
•
CLIENT
–
•
Select {[North America],[USA],[Canada]} on rows,
Measures.members on columns from myCube
Need two things:
–
–
getting the dimension members – the axes
getting the data
Resolve Axis
•
Dimension members cached on client in
‘Client Member Cache’
–
–
–
•
•
•
Levels with #members < Large Level
Threshold sent in group
Levels with #members > Large Level
Threshold retrieved as needed
Large Level Threshold default value:1000,
can be changed in server property and in
connection string
Where members not cached, members and
descendents retrieved to client until needed
member retrieved
Levels with members with 1000s of siblings
result in degraded performance
Member cache not cleaned except for
disconnect or when cube structure changes.
Requested
member
Cached
members
‘Christmas trees’
Non-cached
members
Client Data Cache
•
•
Client retains data of previous queries in client data
cache
Client Cache Size property controls how much data
is in the client cache
–
–
–
•
•
When 0: unlimited
1-99 (inclusive), percent of physical memory
>99 use up to the value in KB
Default value: 25
When exceeded, client cache is cleaned at cube
granularity
How Cubes Are Queried
Data on disk
Partition: Canada
Segment 1 Segment 4
Segment 2 Segment 5
Segment 3 Segment 6
Partition: Mexico
Service
Query
Processor
Dimension
Memory
Segment 1 Segment 4
Segment 2 Segment 5
Segment 3 Segment 6
Partition: USA
Segment 1 Segment 4
Segment 2 Segment 5
Segment 3 Segment 6
Cache
Memory
Client
Client Data
Cache
Agenda
•
•
•
•
•
Architecture Review
Aggregations
Data and Dimension Storage
Processing
Querying
• Server Memory
Management
• Distinct Count
Service Start Up
•
–
•
Dimension requirements: ~125
bytes per member plus member
properties
–
–
•
Tip: invalidate a dimension if not
used in a cube
1M members: 125M
With 25 char member property (eg,
Address): 175M
Large dimensions can migrate to
separate process space
Dimension
Memory
Minimum allocated memory
•
Minimum Allocated Memory
defines the amount of memory
completely dedicated to the
server
All dimensions in the database
are retained in memory
During Processing
•
Shadow dimensions
Processing Buffers
–
–
•
Dimension
Memory
Read Ahead Buffer size
Process Buffer Size
If dimension and processing buffers
memory requirements exceed
Memory Conservation Threshold no room for data cache
Shadow
Dimensions
Processing
Buffers
Available
Cache
Memory conservation threshold
•
2 copies of dimensions stored in
memory while processing
Minimum allocated memory
–
During Querying
•
Data cache stores query data for reuse
•
If Dimension Memory requirements >
Memory Conservation Threshold, no Data
Cache
‘Cleaner’ wakes up periodically to reclaim
memory from data cache
–
BackgroundInterval registry setting. Default
value: 30 seconds
<= 0.5 * (Minimum Allocated Memory+
Memory Conservation Threshold): No
cleaning
0.5 * (Minimum Allocated Memory + Memory
Conservation Threshold) and < Memory
Conservation Threshold: mild cleaning
 Memory Conservation Threshold:
aggressive cleaning
Dimension
Memory
Available
Cache
Memory conservation threshold
•
Faster than retrieving from storage
Minimum allocated memory
–
demo
Setting Server
Properties
Agenda
•
•
•
•
•
•
Architecture Review
Aggregations
Data and Dimension Storage
Processing
Querying
Server Memory Management
• Distinct Count
Distinct Count
• Business Problem: Sales Manager wants to know:
– “How many customers are buying Computers?”
– “How many active customers do I have?”
Sales
All products
Hardware
Computers
Monitors
Printers
Software
Home
Business
Games
8000
3300
2000
800
500
4700
1500
2500
700
Number of
Customers
200
80
70
60
30
150
100
100
80
Distinct Count: Changes to Data Structure
•
DC Measure stored with each fact and aggregation record
–
•
Data ordered by DC measure
–
•
Just like a new dimension
“Order by” included in SQL statement during processing
Number of records can be increased by orders of magnitude
–
Dependant on number of distinct values per record
Sample aggregate record without
Distinct Count…
countryCode
productID
Units Sold
Sales
Can
sd452
9456
$23,914.30
…# records increases with distinct count on customers
CustID
countryCode
productID
Units Sold
Sales
132-45
Can
sd452
1305
4,453.01
432-39
Can
sd452
2325
7,212.23
639-53
Can
sd452
1406
4,890.01
430-30
Can
sd452
3110
8,476.54
964-90
Can
sd452
1310
4,490.23
Distinct Count: Changes to Query
•
Single thread per partition instead of per
segment
–
–
•
Unlike regular cubes, cannot do a single
aggregation of results from each segment as a
single value of the DC measure can cross
segments
Consequently – performance impact
Dimension slice requires much more disk
scan than before
–
–
Segments clustered by DC measure
Expensive
Distinct Count Tips
• Keep DC measures in their own cube
– All measures are retrieved on query –
even if some are not asked for
– Create virtual cube to merge DC with
other measures
• Incremental processing DC cubes is
very expensive
– Segments restructured and reordered to
keep records ordered by DC measure
– Time and memory intensive
Distinct Count Tips
•
Unlike regular cubes, best to distribute DC
values evenly across each partition
– Most effective use of multiple threads for
query processing
•
If you have a dimenion that corresponds
to DistinctCount Measure
– Aggregations recommended only on lowest
level
– (Example, Customer dimension in cube,
Customer as distinct count measure)
Summary
•
•
•
•
•
•
•
Architecture Review
Aggregations
Data and Dimension Storage
Processing
Querying
Server Memory Management
Distinct Count
Don’t forget to complete the
on-line Session Feedback form
on the Attendee Web site
https://web.mseventseurope.com/teched/

Distinct Count

Transcript Distinct Count

Directory