Transcript ppt
CS 345:
Topics in Data Warehousing
Thursday, November 11, 2004
Review of Thursday’s Class
• Eliminating Bottlenecks
– Multi-channel RAID
– Clusters of Processors
• Shared-Memory (SMP)
• Shared-Disk
• Shared-Nothing
•
•
•
•
•
Horizontal Partitioning
Partitioning and Clustered Indexes
Multi-Dimensional Clustering
Clustered File Organization
Integer Mapped Tables
Outline of Today’s Class
• Dimension Key Mapping Revisited
– Comments on Assignment #2
• Updating the Data Warehouse
– Incremental maintenance vs. drop & rebuild
– Self-maintainable views
• Approximate Query Processing
– Sampling-based techniques
• Computing confidence intervals
• Online vs. pre-computed samples
• Sampling and joins
– Alternative techniques
Notes on Assignment #2
• During 2nd warehouse load, need to determine which
player_key to use for fact rows
– Some player_ids had multiple player_keys in the dimension table
– Due to preserving history of dimension value changes
• Many people used the rule “the maximum player_key for
each player_id is the most current”
– Relies on fact that surrogate keys assigned in increasing order
• Worked fine for this assignment, but two things to note
– Fixing past errors when preserving history
– Non-determinism of order in SQL query results
• Frequent source of “Heisenbugs”
Rewriting History Retroactively
•
Simplest technique for handling change = update dimension table
– Restates attributes of past facts that reference modified dimension rows
– Desirable if the reason for the update is to correct a data entry error
– Often undesirable if update represents genuine change in an attribute over time
•
•
Preserve history by creating new dim. rows in response to a value change
But what if we later discover a past error?
–
–
–
–
–
–
–
John Smith lives in CA in 2000
John Smith moves to WA in 2001, but the source system is not changed
John Smith moves to OR in 2002, and the source system is changed
In 2003, we learn that John Smith lived in WA in 2001
Would like to restate history
Insert a 3rd dimension row for John Smith
Update all 2001 facts for John Smith to point to the new dimension row
• Also update aggregate tables
– But now maximum dimension key for John Smith <> current dimension key
•
Two solutions
– Add 4th dimension row for John Smith with State = ‘OR’
• Row #2 and Row #4 for John Smith are identical
– Maintain mapping table in staging area with current key for each person
Use ORDER BY to guarantee order
•
•
•
•
•
SQL uses multiset semantics, not ordered list semantics
Inserting same records in different order → same relation
Query result sets have order, but order is arbitrary
Mandatory to use ORDER BY if ordering is important
Example:
INSERT INTO player (id, key) VALUES (4503, 23);
INSERT INTO player (id, key) VALUES (4503, 640);
SELECT key FROM player WHERE id = 4503;
• What will be the result?
– Two possibilities
• 23
640
• 640
23
– To guarantee the first result, use:
SELECT key FROM player WHERE id = 4503
ORDER BY key;
Warehouse Updates
• Typical data warehousing practice is to batch updates
– Data warehouse is read-only most of the time
– Periodically warehouse is refreshed with new data
• Benefits of batch updates
– Manageability
• Avoid synchronization issues caused by concurrent queries and
updates
• Perform data transformations and integration tasks occasionally
rather than continually
• Quality assurance becomes easier
– Performance
• Indexes, materialized views, etc. can be updated once instead of
continuously
• Amortize fixed update costs across many changed records
• Potentially substitute sequential access for random access
• Downside of batch updates
– Data staleness (warehouse does not offer real-time view of data)
Updating Indexes and Aggregates
• Data warehouses use lots of indexes and materialized views
• These must be updated whenever the warehouse is updated
• Updating existing indexes can be expensive
– Index becomes fragmented, requires reorganization
• Index entries are stored sorted by index search key
• Deleting a record leaves gaps in the index
• Inserting new record requires moving index entries around to make room
– Assuming that entries are packed tightly
• Similar to the problem of maintaining a sorted array
• Typically “overflow pages” are used
– Physical ordering of pages on disk may not exactly match sort order
– Applying updates one at a time involves random reads
• To avoid random reads, sort updates first and apply them en masse
• Similar idea to sort merge join
• Specify “fill factor” when creating index
– How much empty space is built in to accommodate future inserts?
– High fill factor → More efficient space usage
– Low fill factor → Fragmentation is less of a problem
Updating Indexes
Index before update
<2,RID>, <4,RID>, <4,RID> <5,RID>, <7,RID>, <10,RID><12,RID>, <14,RID>, <15,RID>
After inserting record with value 6
<2,RID>, <4,RID>, <4,RID>
<7,RID>, <10,RID>
<5,RID>, <6,RID>
<12,RID>, <14,RID>, <15,RID>
Scan of index
jumps around
on disk
Maintain vs. Drop/Rebuild
• Two techniques for managing indexes during incremental loads
– Incrementally maintain
– Drop and rebuild
• Incrementally maintain indexes
– Allow the DBMS to update indexes as new records are loaded
– Occasionally reorganize indexes to combat fragmentation
• Some systems will do this automatically on-the-fly
– Loading is slowed down by cost of index maintenance
• Drop and rebuild
– Drop indexes before loading new data
– Load the new data
– Recreate indexes from scratch after load completes
• Which is best depends on the situation
– Percentage of new or changed records
• Incrementally maintain: cost proportional to number of new records
• Drop and rebuild: cost proportional to number of total records
– Number of indexes
• More indexes → worse locality of access when maintaining indexes
Maintain vs. Drop/Rebuild
• Drop/rebuild allows fill factor of 100%
– Since indexes won’t be updated, no need to leave extra space
– Use smaller fill factor for incrementally maintained indexes
• Partitioning based on date improves performance of both
methods
– Especially drop/rebuild
– Only need to rebuild most recent partition
– (Size of new data) / (Size of existing partition data) is often large
→ partition-level rebuild is a good choice
• Rough rule of thumb
– Rebuild if > 20% of rows are new/changed
– Maintain if < 20% of rows are new/changed
– Actual threshold depends on hardware, number of indexes
Self-Maintainable Views
• Incremental maintenance of materialized views is a hard problem
• Naïve solution: Always recompute view contents from scratch
– Re-execute query specified in view definition whenever data changes
• Better solution: Apply incremental maintenance rules
–
–
–
–
–
Nice when it’s possible!
Example: SELECT SUM(Quantity) FROM Sales
Current value is 45
Insert a new row into Sales with Quantity = 4
New value for materialized view = 49
• A view (or set of views) is self-maintainable if the new contents of
the view(s) can be computed from:
– The old contents of the view(s)
– The changed records
• Not all views are self-maintainable
– Example: SELECT MAX(Quantity) FROM Sales
– Self-maintainable if database is append-only
– Not self-maintainable if deletes are allowed
• What if record with maximum quantity is deleted?
Self-Maintainable Views
• V1 =
SELECT Customer.State, Sales.Quantity
FROM Sales, Customer
WHERE Sales.Cust_key = Customer.Cust_key
GROUP BY Customer.State
• V1 is not self-maintainable
– Suppose a new Sales record is inserted
– Can’t determine State from the new Sales record alone
• V2 =
SELECT Cust_key, State FROM Customer
• {V1, V2} is self-maintainable
– New Customer record → add a row to V2
– New Sales record → look up State from V2 and update V1
• Given a view, how to add more views to achieve selfmaintenance?
– Interesting research problem (goal is to add as little as possible)
Exploratory Data Analysis
• Much data analysis is exploratory in nature
– Primary goal is improved understanding
– Better understanding of data leads (eventually) to specific,
actionable queries
• Direction and scope of exploratory analysis changes as
analysis proceeds
– Future queries are inherently unpredictable
– They depend on insights gained from the results of present
queries
• Unpredictable queries are hard to speed up
– Pre-computed summary structures work best when workload is
well-known, static
– Each index, aggregate, etc. targets specific class of queries
– Only a limited number of indexes, etc. can be constructed
• Due to disk space and load time considerations
Exploratory Data Analysis
• The most important computation takes place in the user’s head
– Make connections
– Develop mental models
– Hypothesize and experiment
• Interactive vs. batch-mode processing
– Interactive processing
• User asks a series of questions in one sitting
• Mental focus and concentration is maintained
• No “context switches” necessary
– Batch-mode processing
• User asks a question and goes off to do something else
• User retrieves the answer later, then formulates another question
• Cognitively expensive → many context switches
• Interactive processing is much preferable
– Places requirements on query processing time
– User needs to be able to maintain attention
– Limit time spent waiting for query response to intervals that can
productively spent thinking
Approximate Query Processing
• Interactive response times are often impractical for large data
warehouses
– Data volumes are prohibitive
– Merely scanning the fact table may take many minutes!
– Queries that miss indexes/aggregates are slow
• One approach: fast, approximate answers
– Provide quick “ballpark estimates” based on small synopsis of database
– Approximate answer:
• $85,000 +/- $2000 (with 95% confidence)
• Return answer in 10 seconds
– Exact answer:
• $84,792.45
• Return answer in 1 hour
– Good enough for high-level view of database structure, trends
• Combine approximate and exact answers
– Start with approximate answers for high-level “quick-and-dirty” analysis
– Use exact answers to refine and confirm hypotheses
AQP Using Random Samples
• Simplest data synopsis = uniform random sample
–
–
–
–
Evaluate query over random sample of the data
Extrapolate from random sample to estimate overall result
Large body of knowledge from statistics
Example: Pre-election polls
• Unbiased estimator for COUNT / SUM queries
– Compute the aggregate over the random sample
– Scale the answer by (1/sampling rate)
– Expected value of estimate = actual query answer
• Definition of unbiased estimator
– Variance of estimate is also important
• Corresponds to confidence intervals reported by pollsters
• The lower the variance, the more confident is the estimate
AQP Example
SalesSample
Sales
Product
Amount
Product Amount
CPU
1
CPU
1
CPU
1
CPU
2
CPU
2
CPU
3
CPU
3
Disk
2
CPU
4
Disk
1
Disk
2
Monitor
1
SELECT SUM(Amount)
FROM Sales
WHERE Product = 'CPU'
Exact Answer: 1+1+2+3+4 = 11
Approx. Answer: (1+2+3)*2= 12
Variance of Estimates
• An estimate can be unbiased but still useless if confidence is too low
• Example: Ask 1 random person “Bush or Kerry?”
– If they say Bush, estimate that 100% of population prefers Bush.
– If they say Kerry, estimate that 0% of population prefers Bush.
– Estimate is unbiased
• Assume that in reality 51% of population prefers Bush
• Expected value = (0.51 * 100%) + (0.49 * 0%) = 51%
– Estimate is not very helpful because variance is too high
• Factors that affect variance of estimate
– Bigger sample size → less variance
• Inversely proportional to square root of sample size
– Attribute being SUMmed is highly skewed → more variance
•
•
•
•
Hard to find a needle in a haystack
Scenario 1: 99 people have $0, 1 person has $100
Scenario 2: 50 people have $2, 50 people have $0
Estimate total wealth via sample of 10 people → Scenario 2 is easier
– More restrictive filters → more variance
• Records that are filtered out don’t help
Confidence Intervals for Sampling
Method
90% Confidence Interval (±)
Guarantees?
Central Limit Theorem
1.65 * (S) / sqrt(|S|)
as |S|
Hoeffding
1.22 * (MAX-MIN) / sqrt(|S|)
always
Chebychev (know (R))
3.16 * (R) / sqrt(|S|)
always
Chebychev (est. (R))
3.16 * (S) / sqrt(|S|)
as (S) (R)
Confidence intervals for Average: select avg(R.A) from R
(Can replace R.A with any arithmetic expression on the attributes in R)
(R) = standard deviation of the values of R.A; (S) = s.d. for S.A
Slide from Garofalakis & Gibbons, VLDB 2001 tutorial
Online vs. Pre-Computed Sample
• Two basic approaches to sampling
– Online sampling
• Generate sample “on-the-fly” when query is asked
– Pre-computed samples
• Generate samples of big tables in advance
• Store the pre-computed samples separately
• Online sampling can be slow
– Generation of sample can be very expensive
• Sometimes even more expensive then scanning entire relation!
– Example:
•
•
•
•
200 tuples fit on each disk block
Take a 1% sample of relation
Average page has 2 tuples included in sample
Most pages have ≥ 1 tuples included in sample (with high
probability)
• Most pages need to be read when generating the sample
Online Sampling
• Alternatives to tuple-level online sampling
– Block-level sampling
• Choose random sample of disk blocks on which a relation is stored
• Include all tuples on the selected blocks
• Problem: Correlations between queried attributes and sort order
–
–
–
–
–
–
Customer relation is clustered by Age
Query count of retired customers
Age and RetiredFlag are highly correlated
Retired customers are not spread evenly across blocks
Confidence of estimate decreases
Difficult to calculate accurate confidence intervals
– Store relation in random order
• To generate x% random sample, scan x% of rows
• Problem: Lose the ability to have a clustered index
• Problem: Relation must be rearranged when new records added
AQP with Pre-Computed Samples
Warehouse
Client
• Similar architecture to aggregate navigator
• Client queries reference base-level
fact and dimension tables
•Re-written queries reference sample tables
Warehouse
Client
Query
Rewriter
Rewritten
Queries
Data
Warehouse
Samples
Warehouse
Client
SQL
Queries
Sampling and Joins
• Sampling for join queries presents added difficulties
• Here’s a poor approach:
–
–
–
–
–
Construct 1% sample of fact
Construct 1% sample of dimension 1
Construct 1% sample of dimension 2
Compute join of the three samples
Expected number of tuples in result = Only 0.0001% of tuples in
actual join!
• Here’s a better approach:
– Construct 1% sample of fact
– Compute join of fact sample with full dimension tables
– Result is a 1% sample of join result
• Fact table is usually much bigger than dimensions
– Sampling dimension tables doesn’t speed things up much
Sampling and Joins
• Alternative approach for big dimension tables
– Compute 1% sample of fact
– Construct semi-join of big dimension with fact sample
• In other words, identify dimension rows that join with at least
one row in the fact sample
– Store semi-join result as “sample” of dimension
• The rest of the dimension rows are not relevant for queries
that use the fact sample
– (# of rows in semi-join result) ≤ (# of rows in fact
sample)
• Non-foreign-key joins cause further problems
– Fortunately standard star schema queries on join on
primary key-foreign key relationships
Additional AQP Topics
• Non-uniform (stratified) sampling
– Can provide more accurate answers for some queries
– Example: query with filter State = ‘NV’
• Compare two equal-sized samples
– Uniform random sample
– Sample of only records from states in Pacific Time Zone
• Second sample has higher sampling rate → more relevant tuples →
more accurate answer
– Example: query that groups by State
• CA is better represented than WY in uniform sample
• Estimate for CA will be more accurate than estimate for WY
• Could use stratified sample with equal # of records from each state
– Different sampling rates for each state
– Scaling up the answer is more complicated
– Multiply sub-total for each state by a different scaling factor
• Stratified sample delivers lowest average error across all groups
– Researchers have explored various non-uniform sampling
techniques
Additional AQP Topics
• Approximate answers using other synopses
– A sample is one data structure for summarizing a large data set
– Many other techniques exist
• Histograms
• Compression via wavelet decomposition
• “Sketches” based on random projection
– Methodology for AQP with other synopses similar to using precomputed samples
• Pre-compute a synopsis of the fact table
– And optionally synopses of the dimensions
• Translate query on facts and dimensions into query over synopses
• Answers query over synopses using database server
• Generate estimated answer and confidence interval based on result
Next Week: Data Mining
• High-level course outline
–
–
–
–
Logical database design
Query processing
Physical database design
Data mining
• Data mining:
– “the process of identifying valid, novel, potential useful, and
ultimately understandable patterns in data”
– Or, “What can I do with my data warehouse?”
• Remainder of course
– Next 2 weeks: Data mining
• Decision trees, association rules, clustering
– Last 1 week: Project presentations & course review
– Final exam Wednesday December 8