Data Warehousing

Download Report

Transcript Data Warehousing

Data Warehouses
Richard Goerwitz
What’s a Data Warehouse?


People disagree on what a DW is
Instead of defining one up front I’m
going to
• Talk about key concepts used by data
warehouse architects
• Show you an example of a small data
warehouse component
• Talk about how you can repeat what I’ve
done
Key Terms, Acronyms

By the end of this talk, you’ll know
what these things all mean:
• OLTP
• OLAP
• ETL
• Star schema
• Conformed dimension
• Data mart
• Data warehouse
OLTP


OLTP = online transaction processing
The process of moving data around to
handle day-to-day affairs
•
•
•
•

Scheduling classes
Registering students
Recording grades
Recording payments, etc.
Systems supporting this kind of activity
are called transactional systems
Transactional Systems

Transactional systems are optimized
primarily for the here and now
• They allow for constant change
• Don’t record all previous data states


Name changes
Course rosters, etc.
• Support many simultaneous users
• Permit often heavy read/write access
Transactional Databases


Databases that support transactional
systems must be heavily normalized
This means they should • Shun duplicate data (1NF)
• Keep dependent data with the stuff it depends
on (2NF)
• Keep data dependencies straightforward within
each database table (3NF)
• Structure tables so that changes only “lock up”
a small part of the DB at a time
Carleton Databases

Here are some major DBs at Carleton
•
•
•
•
•
•

Informix – Card access
UniData 6.1 – SIS, financials, etc.
MS SQL Server 2000 – Asset tracking, etc.
Oracle 9i – Alumni, document management
MySQL 4.x – web system back ends
PostgreSQL 8.x – courses like this one
You tell me • Which ones are transactional in nature?
OLAP



OLAP = online analytical processing
The process of analyzing data derived
from OLTP (and other) systems
What might we analyze?
•
•
•
•
Class enrollments, by department
Courses taught, by department
Who is majoring in what
All of the above, correlated against an
additional time dimension
OLAP Support Systems


OLAP support systems are optimized
for analysis
What does this mean?
• They are geared mainly for read access
• They support fewer simultaneous users
• They hold snapshots of OLTP data


Provide history
Give us time depth to our analyses
• They change relatively slowly
ETL



When OLAP support systems change,
the changes are typically made via
periodic (e.g., nightly) ETL processes
ETL = extract, transform, load
ETL means specifically • Extracting data from various sources
• Transforming and cleaning the data
• Loading data into databases used for
analysis and reporting
Star Schemas


ETL processes need to load data into a
database
How does this database look?
• It’s structurally simple
• Separates data into fact tables and dimension
tables
• Reorganizes fact and dimension tables into a
series of star-like schemas
• (Don’t worry, I’ll be showing you a graphic
example of one of these.)
Data Marts


A data mart is really just a series of
star schemas
Dimension tables in a series of star
schemas must be identical or at least
conformed
• By conformed I mean


The tables are proper subsets of each other
They are hierarchized the same way
internally
Data Warehouses



So what is a data warehouse (DW)?
As noted, people use this term loosely, in
various ways
But in general we can say that a DW:
• Encompasses a number of smaller data marts
• Derives its data from multiple back-end
systems
• Is optimized for OLAP, not OLTP
• Holds historical snapshots
• Must be heavily documented