ppt - Stanford University
Download
Report
Transcript ppt - Stanford University
CS 345:
Topics in Data Warehousing
Tuesday, September 28, 2004
Outline of Today’s Class
•
•
•
•
What is data warehousing?
Transaction processing vs. data analysis
Course logistics
Data integration
A Brief History of
Information Technology
• The “dark ages”: paper forms in file cabinets
• Computerized systems emerge
– Initially for big projects like Social Security
– Same functionality as old paper-based systems
• The “golden age”: databases are everywhere
– Most activities tracked electronically
– Stored data provides detailed history of activity
• The next step: use data for decision-making
–
–
–
–
The focus of this course!
Made possible by omnipresence of IT
Identify inefficiencies in current processes
Quantify likely impact of decisions
Databases for Decision Support
• 1st phase: Automating existing processes
makes them more efficient.
– Automation → Lots of well-organized, easily
accessed data
• 2nd phase: Data analysis allows for better
decision-making.
– Analyze data → better understanding
– Better understanding → better decisions
• “Data Entry” vs. “Thinking”
– Data analysts are decision-makers: managers,
executives, etc.
OLTP vs. OLAP
• OLTP: On-Line
Transaction Processing
– Many short transactions
(queries + updates)
– Examples:
• Update account balance
• Enroll in course
• Add book to shopping cart
– Queries touch small
amounts of data (one
record or a few records)
– Updates are frequent
– Concurrency is biggest
performance concern
• OLAP: On-Line
Analytical Processing
– Long transactions, complex
queries
– Examples:
• Report total sales for each
department in each month
• Identify top-selling books
• Count classes with fewer
than 10 students
– Queries touch large
amounts of data
– Updates are infrequent
– Individual queries can
require lots of resources
Why OLAP & OLTP don’t mix (1)
Different performance requirements
• Transaction processing (OLTP):
– Fast response time important (< 1 second)
– Data must be up-to-date, consistent at all times
• Data analysis (OLAP):
– Queries can consume lots of resources
– Can saturate CPUs and disk bandwidth
– Operating on static “snapshot” of data usually OK
• OLAP can “crowd out” OLTP transactions
– Transactions are slow → unhappy users
• Example:
– Analysis query asks for sum of all sales
– Acquires lock on sales table for consistency
– New sales transaction is blocked
Why OLAP & OLTP don’t mix (2)
Different data modeling requirements
• Transaction processing (OLTP):
– Normalized schema for consistency
– Complex data models, many tables
– Limited number of standardized queries and updates
• Data analysis (OLAP):
– Simplicity of data model is important
• Allow semi-technical users to formulate ad hoc queries
– De-normalized schemas are common
• Fewer joins → improved query performance
• Fewer tables → schema is easier to understand
Why OLAP & OLTP don’t mix (3)
Analysis requires data from many sources
• An OLTP system targets one specific process
– For example: ordering from an online store
• OLAP integrates data from different processes
– Combine sales, inventory, and purchasing data
– Analyze experiments conducted by different labs
• OLAP often makes use of historical data
– Identify long-term patterns
– Notice changes in behavior over time
• Terminology, schemas vary across data sources
– Integrating data from disparate sources is a major
challenge
Data Warehouses
• Doing OLTP and OLAP in the same database
system is often impractical
– Different performance requirements
– Different data modeling requirements
– Analysis queries require data from many sources
• Solution: Build a “data warehouse”
–
–
–
–
Copy data from various OLTP systems
Optimize data organization, system tuning for OLAP
Transactions aren’t slowed by big analysis queries
Periodically refresh the data in the warehouse
Course Logistics
• Course web site:
http://cs345.stanford.edu
• Course format will be lecture-based
– As opposed to a paper-reading course
• Prerequisite:
– Knowledge of SQL
Assigned Work
• Five homework assignments
– One problem set
– Four programming assignments
• Not a lot of code to write
• Emphasis will be on interacting with Oracle
• Course project
– Open-ended
– Focus on a topic of your choosing
– Any of these types:
• Research project, or…
• Programming project, or…
• Survey of research literature
– May be done individually or in groups of two
• Final Exam
High-Level Course Outline
• Logical Database Design
– How should the data be modeled?
– Designing the data warehouse schema
• Query Processing
– Analysis queries are hard to answer efficiently
– What techniques are available to the DBMS?
• Physical Database Design
– How should the data be organized on disk?
– What data structures should be used?
• Data Mining
– What use is all this data?
– Which questions should we ask our data warehouse?
Additional Topics
• Related topics to be touched on briefly:
–
–
–
–
–
–
–
–
–
Data integration
Data cleaning
Approximate query answering
Data lineage
Data visualization
Incremental maintenance of materialized views
Answering queries using views
Indexing special data types (spatial, text, geographic)
Metadata management
• Projects can be done in these areas
The Textbook
• “The Data Warehouse Toolkit”
by Ralph Kimball and Margy Ross
•
•
•
•
•
•
•
Written by well-known data warehouse designer
Clearly written and readable
Lots of generic but realistic examples
Semi-technical (no math!)
Business-focused
We’ll use it for the first one-third of the course
Get the second edition!
Course Objectives
• Gain practical understanding of how data
warehouses are built and used
• Gain exposure to data modeling “best practices”
• Learn techniques used to process complex
queries over very large data sets
• Understand the performance trade-offs that
come from alternative data structures
• Learn commonly-used methods for mining and
analysis of large data sets
• Become familiar with current research directions
in data warehousing and related areas
Loading the Data Warehouse
Data is periodically
extracted
Data is cleansed and
transformed
Users query the data
warehouse
Source Systems
(OLTP)
Data Staging Area
Data Warehouse
Data Integration is Hard
• Data warehouses combine data from multiple sources
• Data must be translated into a consistent format
• Data integration represents ~80% of effort for a typical
data warehouse project!
• Some reasons why it’s hard:
– Metadata is poor or non-existent
– Data quality is often bad
• Missing or default values
• Multiple spellings of the same thing
(Cal vs. UC Berkeley vs. University of California)
– Inconsistent semantics
• What is an airline passenger?
Federated Databases
• An alternative to data warehouses
• Data warehouse
– Create a copy of all the data
– Execute queries against the copy
• Federated database
– Pull data from source systems as needed to answer queries
• “lazy” vs. “eager” data integration
Query
Extraction
Query
Answer
Warehouse
Source
Systems
Rewritten
Queries
Data Warehouse
Answer Mediator
Federated Database
Source
Systems
Warehouses vs. Federation
• Advantages of federated databases:
– No redundant copying of data
– Queries see “real-time” view of evolving data
– More flexible security policy
• Disadvantages of federated databases:
–
–
–
–
Analysis queries place extra load on transactional systems
Query optimization is hard to do well
Historical data may not be available
Complex “wrappers” needed to mediate between analysis server
and source systems
• Data warehouses are much more common in practice
– Better performance
– Lower complexity
– Slightly out-of-date data is acceptable