Data Warehousing
Download
Report
Transcript Data Warehousing
Data Warehouses
Richard Goerwitz
What’s a Data Warehouse?
People disagree on what a DW is
Instead of defining one up front I’m
going to
• Talk about key concepts used by data
warehouse architects
• Show you an example of a small data
warehouse component
• Talk about how you can repeat what I’ve
done
Key Terms, Acronyms
By the end of this talk, you’ll know
what these things all mean:
• OLTP
• OLAP
• ETL
• Star schema
• Conformed dimension
• Data mart
• Data warehouse
OLTP
OLTP = online transaction processing
The process of moving data around to
handle day-to-day affairs
•
•
•
•
Scheduling classes
Registering students
Recording grades
Recording payments, etc.
Systems supporting this kind of activity
are called transactional systems
Transactional Systems
Transactional systems are optimized
primarily for the here and now
• They allow for constant change
• Don’t record all previous data states
Name changes
Course rosters, etc.
• Support many simultaneous users
• Permit often heavy read/write access
Transactional Databases
Databases that support transactional
systems must be heavily normalized
This means they should • Shun duplicate data (1NF)
• Keep dependent data with the stuff it depends
on (2NF)
• Keep data dependencies straightforward within
each database table (3NF)
• Structure tables so that changes only “lock up”
a small part of the DB at a time
Carleton Databases
Here are some major DBs at Carleton
•
•
•
•
•
•
Informix – Card access
UniData 6.1 – SIS, financials, etc.
MS SQL Server 2000 – Asset tracking, etc.
Oracle 9i – Alumni, document management
MySQL 4.x – web system back ends
PostgreSQL 8.x – courses like this one
You tell me • Which ones are transactional in nature?
OLAP
OLAP = online analytical processing
The process of analyzing data derived
from OLTP (and other) systems
What might we analyze?
•
•
•
•
Class enrollments, by department
Courses taught, by department
Who is majoring in what
All of the above, correlated against an
additional time dimension
OLAP Support Systems
OLAP support systems are optimized
for analysis
What does this mean?
• They are geared mainly for read access
• They support fewer simultaneous users
• They hold snapshots of OLTP data
Provide history
Give us time depth to our analyses
• They change relatively slowly
ETL
When OLAP support systems change,
the changes are typically made via
periodic (e.g., nightly) ETL processes
ETL = extract, transform, load
ETL means specifically • Extracting data from various sources
• Transforming and cleaning the data
• Loading data into databases used for
analysis and reporting
Star Schemas
ETL processes need to load data into a
database
How does this database look?
• It’s structurally simple
• Separates data into fact tables and dimension
tables
• Reorganizes fact and dimension tables into a
series of star-like schemas
• (Don’t worry, I’ll be showing you a graphic
example of one of these.)
Data Marts
A data mart is really just a series of
star schemas
Dimension tables in a series of star
schemas must be identical or at least
conformed
• By conformed I mean
The tables are proper subsets of each other
They are hierarchized the same way
internally
Data Warehouses
So what is a data warehouse (DW)?
As noted, people use this term loosely, in
various ways
But in general we can say that a DW:
• Encompasses a number of smaller data marts
• Derives its data from multiple back-end
systems
• Is optimized for OLAP, not OLTP
• Holds historical snapshots
• Must be heavily documented