CS 345: Topics in Data Warehousing

Download Report

Transcript CS 345: Topics in Data Warehousing

CS 345:
Topics in Data Warehousing
Tuesday, November 23, 2004
Review of Thursday’s Class
• Association Rule Mining
– Market basket analysis
– What is an association rule?
– Frequent itemsets
• Association rule mining algorithms
– A-priori algorithm
– Speeding up A-priori using hashing
– One- and two-pass algorithms
Overview of Today’s Class
• Clustering
– Types of clustering
– Clustering large data sets
• Administrative details
– Final exam
– Assignment #3 and #4
– Course Project
• Miscellaneous topics
–
–
–
–
Multiple fiscal calendars
Heterogeneous products
Sparse facts
Audit dimensions
Clustering
• One of the most popular data mining tasks
– Very general-purpose
• Divide records into groups of “similar” records
• What is a good clustering?
– Records in same cluster are similar
– Records in different clusters are dissimilar
• There are many ways to do it
–
–
–
–
Many notions of similarity have been proposed
Many clustering algorithms have been proposed
No consensus on which is “best”
Often quality of results is subjective
Clustering
• Are there 2 clusters or 3 clusters?
Types of Clustering Algorithms
• Hierarchical
– Agglomerative
• Initially, each record is its own cluster
• Repeatedly combine the 2 most similar clusters
– Divisive
• Initially, all records are in a single cluster
• Repeatedly split 1 cluster into 2 smaller clusters
• Partition-based
– Partition records into k clusters
– Optimize objective function measuring goodness of clustering
– Heuristic optimization via “hill-climbing”
• Random choice of initial clusters
• Iteratively refine clusters by re-assigning records when doing so
improves objective function
• Eventually converge to local maximum
Dendrograms
• A hierarchical clustering can be represented by
a dendrogram
• Many different clusterings in one
– Viewed at different levels of detail
Clustering and Data Warehouses
• Typically clustering is performed externally to database server
– Similar to association rule mining
– Data mining process:
•
•
•
•
•
Apply filters in data warehouse
Export data records that pass filters
Apply clustering algorithm to exported file
Assign each record to a cluster
Import cluster assignments back into warehouse
• Scaling clustering algorithms to large data sets
– Data summarization
•
•
•
•
Grow clusters incrementally in single pass
Summarize characteristics of each cluster using concise synopsis
Decide to split / merge clusters based on synopses
Example: BIRCH (Balanced Iterative Clustering and Reducing using
Hierarchies)
– Sampling 1
• Perform clustering on a random sample of the data
– Sampling 2
• Represent cluster by a sample group of representative points
• Example: CURE (Clustering Using REpresentatives)
Final Exam
• Final exam:
– Wednesday, Dec. 8
– 7:00pm-10:00pm
– Room TBD
• Open book / open notes
• Goals:
– Test understanding of major concepts
– Practice thinking critically about design alternatives
• Practice final handed out next Tuesday
– Format of real final will closely resemble practice final
Assignment #3
• Assignment #3 was very time-consuming for
many students
– I misjudged the difficult of the assignment
– Sorry about that!
– Assignment #4 will be much easier
• Grading policy for Assignment #3:
– Reasonable effort = full credit
– You won’t be penalized if your program wasn’t
completely finished
– Reward for finishing = don’t have to do Assignment #4
Assignment #4
•
Two options:
1. Finish Assignment #3
2. Write decision tree learning program
•
•
2nd option is described in handout
Can do both for extra credit
– Not advised unless your course project is
finished
Project Presentations
• Short presentations about course projects
– Next Tuesday and Thursday in class
– Sign up for your preferred day (handout)
• What is expected?
– 5-10 minute presentation
• Give an overview of your project
• Describe your results (preliminary results if necessary)
– Implementation projects:
• Overview of architecture
• Give a demo (if possible)
– Literature surveys:
• Summarize the papers being surveyed
– Informal presentation is fine
• Not meant to be a big deal
• No need to spend hours and hours preparing!
– Project doesn’t need to be complete by presentation date
• But you should have made good progress
• At least enough progress to make a decent presentation!
Project Write-Up
• Project write-up
– Main project deliverable
– Literature survey: written report
– Implementation project: whatever is appropriate
•
•
•
•
Document describing project design
Sample outputs
Executable code or link to demo-able system, if feasible
Contact the instructor if you’re not sure what to submit
• Due date: December 8
– Extensions available if needed
• Contact the instructor if you need one
– Recommended time frame:
•
•
•
•
Finish the project by your project presentation date
Submit your write-up when you do your presentation
Enjoy a low-stress finals week!
This may or may not be possible depending on your schedule…
Miscellaneous Topics
• Covered in Ch. 7-15 of Data Warehouse Toolkit
– These chapters are not required reading
– Density of new material is low
– Lots of application examples
•
•
•
•
•
Multiple calendars
Heterogeneous products
Unstructured text fields
Sparse facts
Audit dimension (data lineage)
Multiple Fiscal Calendars
• Businesses operate on fiscal calendars
– Used for accounting purposes
– Fiscal year may not correspond to calendar year
• Stanford’s fiscal year: Sept. 1 – Aug. 31
• US Government fiscal year: Oct. 1 – Sept. 30
– Fiscal month may not correspond to calendar month
• 5-4-4 calendar
• ISO week
• Date dimension includes fiscal calendar attributes
– Fiscal year, fiscal month, fiscal week
• Some organizations may use multiple fiscal calendars
– Subsidiaries or individual lines of business may report differently
– E.g. for historical reasons, following a merger
– How to handle this in the date dimension?
Three Options for
Multiple Calendars
• Three options
– Include all attributes in Date dimension
– Use outrigger tables
– Use extended dimension tables
• 1. Attributes for all fiscal calendars in Date dimension
– E.g. FiscalYear, FederalFiscalYear, OldFiscalYear
– Advantage: Simple data model (no new tables)
– Disadvantage: Wide/cluttered data dimension
• 2. Outrigger table for each non-standard fiscal calendar
– 1-to-1 relationship between outrigger row and date dim. row
– Advantage: Date dimension stays thin and simple
– Disadvantages:
• Extra tables
• Queries using non-standard attributes require extra join
Extended Dimension Tables
• 3. Create several versions of the Date dimension table
– The basic Date dimension contains all the “core” date attributes
– Alternative versions of the Date dimension contain:
• Core date attributes
• Additional attributes from a particular fiscal calendar
• All versions share the same surrogate keys
– E.g. surrogate key of Nov. 23, '04 = 1287
• In Date dimension...
• And also in all alternative Date dimensions
• Using alternative dimensions
– Most queries join fact to basic Date dimension
– Some queries join fact table to extended Date dimension instead
– Advantages:
• Basic Date dimension stays thin and simple
• No extra joins required
– Disadvantages:
• Additional management overhead to synchronize keys
• Extra tables / extra copies of data
Heterogeneous Products
• Some businesses have a variety of dissimilar products
– Example: Financial services
– American Express products include:
•
•
•
•
Credit cards
Checking and Savings
Brokerage accounts
Insurance
•Annuities
•Financial Planning
•Traveler’s checks
•…and many more
• Relevant attributes vary from one product to another
– Mortgages have very different attributes from credit cards
– Facts associated with each product may differ too
• Two desired types of analysis
– Global
• Unified view of all customer interactions across all lines of business
– Specific
• Detailed analysis of individual lines of business
Handling Heterogeneous Products
• Design “core” and “custom” schemas
• Core schema includes common data elements
– Useful for global analyses that cut across product lines
– Core facts include measures relevant to most or all product lines
– Core dimensions have attributes relevant to most/all product lines
• Custom schemas are specialized for specific product lines
– One custom schema for mortgages, another for credit cards, etc.
– Useful for detailed analyses of specific product lines
– Custom facts include:
• Core measurements
– E.g. Balance
• Product-specific measurements
– E.g. Finance charge, interest rate, fees
– Custom dimensions include:
• Core attributes
• Product-specific attributes
Same Three Options Apply
• All possible attributes included in “monster” dimension
and fact tables
– Infeasible if product lines are too diverse
– Data becomes overly sparse, wastes space
– Most entries in most tables will be NULL or N/A
• Outrigger tables with product-specific dimension
attributes
– Separate outrigger table for each product line
– Could have “fact outrigger tables” as well
• Core measurements are stored in main fact table
• “Outrigger fact tables” have additional measurements for specific
product lines
• Detailed analysis requires joining outrigger fact to main fact
• Saves space compared to storing separate core + extended fact
tables
• Core and extended dimension and fact tables
Option 1: Include All Attributes
Extended measurements
Extended attributes
Dimension
Table
Fact
Table
Dimension
Table
Core measurements
Dimension foreign keys
Core attributes
Option 2: Use Outriggers
Extended attributes
Extended measurements
Dim.
Table
Fact
Table
Dim.
Table
Core attributes
Core measurements
Dimension foreign keys
Option 3: Core + Extended
Extended attributes
Extended measurements
Core attributes
Fact
Dim.
Dim.
Table
Fact
Table
Dim.
Fact
Dim.
Dim.
Dim.
Table
Fact
Dim.
Dim.
Core measurements
Dimension foreign keys
Core attributes
Core measurements
Dimension foreign keys
Unstructured Text Fields
• Often data is collected as free-form text
– Or “semi-structured” text
– Examples:
• Comments fields
• Text of product reviews
• Résumés
• Querying free-form text requires information retrieval techniques
– Keyword search
• Special keyword indexing techniques (inverted indexes)
– Stemming
• Derived = derive = deriving
– “Sounds like” operator (Soundex)
• Smith = Smythe
• Ascroft = Ashcraft
• Sometimes query answering paradigm is different
– Relational query: Exact match
– IR query: Ranked results, fuzzy match
– Combining IR & relational query styles = research area
Sparse Facts
• Some applications have:
–
–
–
–
Many possible measurements
Only a few of them are made per fact record
Which ones varies from record to record
Examples:
• Data from scientific experiments
• Medical diagnostics for hospital patients
• Standard approach
– Include all measurement columns in fact
– Most measured values are 0 or NULL
– Can be wasteful of space
Sparse Facts
• Alternative approach
– One fact row with n measurements → n fact rows with one
measurement each
• Grain of fact table is changed
– Include just a single measurement column
• Or one column per data type: int_value, money_value,
decimal_value
– Special “fact type” dimension
• Indicates which measurement is being recorded
• All queries must filter to a single fact type or aggregates are
meaningless!
– Advantage:
• Saves space
• Queries are faster
– Disadvantages:
• Users must take care not to SUM incompatible fact types
– Temperature + Blood Pressure = Meaningless
• Difficult to compare multiple measures associated with same event
Audit Dimension
• Data warehouses are never perfect
– Mistakes can be made during load
– Incorrect data can be discovered later and corrected
• Modeling change history of warehouse data can be useful
– Where did this row come from?
– Has it been updated since it was first created?
• Known as the data lineage or data provenance problem
– Accurate tracking can be difficult when data from multiple sources is
combined and transformed
– Active area of research
• One solution: add audit dimension to fact
– Audit dimension includes data about when and how the fact record was
generated
– Goal: Understand and record process by which data gets into the
warehouse
– Answer questions like:
• Why did the answer to this query change between last week and this week?
• What fact rows were produced by last Thursday’s load? The data was bad!
– Sample attributes: SourceSystem, ExtractDate, AllocationPolicy,
ChangeDate, ChangeReason, IsOutlierValue