Data Warehousing

Download Report

Transcript Data Warehousing

What is a Data Warehouse?
And Why Are So Many
Schools Setting Them Up?
Richard Goerwitz
What Is a Data Warehouse?




Nobody can agree
So I’m not actually going to define a DW
Don’t feel cheated, though
By the end of this talk, you’ll
• Understand key concepts that underlie all
warehouse implementations (“talk the talk”)
• Understand the various components out of
which DW architects construct real-world data
warehouses
• Understand what a data warehouse project
looks like
Why Are Schools Setting Up
Data Warehouses?

A data warehouse makes it easier to:
•
•
•
•
•
Optimize classroom, computer lab usage
Refine admissions ratings systems
Forecast future demand for courses, majors
Tie private spreadsheet data into central repositories
Correlate admissions and IR data with outcomes such as:



GPAs
Placement rates
Happiness, as measured by alumni surveys
• Notify advisors when extra help may be needed based on



Admissions data (student vitals; SAT, etc.)
Special events: A-student suddenly gets a C in his/her major
Slower trends: Student’s GPA falls for > 2 semesters/terms
• (Many other examples could be given!)

Better information = better decisions
• Better admission decisions
• Better retention rates
• More effective fund raising, etc.
Talking The Talk

To think and communicate usefully about data warehouses
you’ll need to understand a set of common terms and
concepts:
•
•
•
•
•
•
•
•
•

OLTP
ODS
OLAP, ROLAP, MOLAP
ETL
Star schema
Conformed dimension
Data mart
Cube
Metadata
Even if you’re not an IT person, pay heed:
• You’ll have to communicate with IT people
• More importantly:
Evidence shows that IT will only build a successful warehouse if you
are intimately involved!
OLTP


OLTP = online transaction processing
The process of moving data around to
handle day-to-day affairs
•
•
•
•

Scheduling classes
Registering students
Tracking benefits
Recording payments, etc.
Systems supporting this kind of activity
are called transactional systems
Transactional Systems

Transactional systems are optimized primarily for
the here and now
•
•
•
•
Can support many simultaneous users
Can support heavy read/write access
Allow for constant change
Are big, ugly, and often don’t give people the data they
want




As a result a lot of data ends up in shadow databases
Some ends up locked away in private spreadsheets
Transactional systems don’t record all previous
data states
Lots of data gets thrown away or archived, e.g.:
• Admissions data
• Enrollment data
• Asset tracking data (“How many computers did we
support each year, from 1996 to 2006, and where do we
expect to be in 2010?”)
Simple Transactional Database

Map of Microsoft
Windows Update
Service (WUS)
back-end database
• Diagrammed using
Sybase
PowerDesigner



Each green box is a
database “table”
Arrows are “joins” or
foreign keys
This is simple for an
OLTP back end
More Complex Example



Recruitment Plus back-end
database
Used by many admissions
offices
Note again:
• Green boxes are tables
• Lines are foreign key
relationships
• Purple boxes are views


Considerable expertise is
required to report off this
database!
Imagine what it’s like for
even more complex
systems
• Colleague
• SCT Banner (over 4,000
tables)
The “Reporting Problem”



Often we require OLTP data as a snapshot, in a
spreadsheet or report
Reports require querying back-end OLTP support
databases
But OLTP databases are often very complex, and
typically
• Contain many, often obscure, tables
• Utilize cryptic, unintuitive field/column names
• Don’t store all necessary historical data

As a result, reporting becomes a problem –
• Requires special expertise
• May require modifications to production OLTP systems
• Becomes harder and harder for staff to keep up!
Workarounds

Ways of working around the reporting
problem include:
1. Have OLTP system vendors do the work
•
•
Provide canned reports
Write reporting GUIs for their products
2. Hire more specialists
•
•
To create simplified views of OLTP data
To write reports, create snapshots
3. Periodically copy data from OLTP systems to
a place where
•
•
•
The data is easier to understand
The data is optimized for reporting
Easily pluggable into reporting tools
ODS



ODS = operational data store
ODSs were an early workaround to the “reporting
problem”
To create an ODS you
• Build a separate/simplified version of an OLTP system
• Periodically copy data into it from the live OLTP system
• Hook it to operational reporting tools


An ODS can be an integration point or real-time
“reporting database” for an operational system
It’s not enough for full enterprise-level, crossdatabase analytical processing
OLAP


OLAP = online analytical processing
OLAP is the process of creating and
summarizing historical, multidimensional
data
• To help users understand the data better
• Provide a basis for informed decisions
• Allow users to manipulate and explore data
themselves, easily and intuitively


More than just “reporting”
Reporting is just one (static) product of
OLAP
OLAP Support Databases


OLAP systems require support databases
These databases typically
• Support fewer simultaneous users than OLTP
back ends
• Are structured simply; i.e., denormalized
• Can grow large


Hold snapshots of data in OLTP systems
Provide history/time depth to our analyses
• Are optimized for read (not write) access
• Updated via periodic batch (e.g., nightly) ETL
processes
ETL Processes

ETL = extract, transform, load
• Extract data from various sources
• Transform and clean the data from those sources
• Load the data into databases used for analysis and
reporting

ETL processes are coded in various ways
• By hand in SQL, UniBASIC, etc.
• Using more general programming languages
• In semi-automated fashion using specialized ETL tools
like Cognos Decision Stream

Most institutions do hand ETL; but note well:
• Hand ETL is slow
• Requires specialized knowledge
• Becomes extremely difficult to maintain as code
accumulates and databases/personnel change!
Where Does the Data Go?



What sort of a database do the ETL
processes dump data into?
Typically, into very simple table
structures
These table structures are:
• Denormalized
• Minimally branched/hierarchized
• Structured into star schemas
So What Are Star Schemas?

Star schemas are collections of data arranged
into star-like patterns
• They have fact tables in the middle, which contain
amounts, measures (like counts, dollar amounts, GPAs)
• Dimension tables around the outside, which contain
labels and classifications (like names, geocodes, majors)
• For faster processing, aggregate fact tables are
sometimes also used (e.g., counts pre-averaged for an
entire term)

Star schemas should
• Have descriptive column/field labels
• Be easy for users to understand
• Perform well on queries
A Very Simple Star Schema
Data Center UPS
Power Output
Dimensions:
Phase
Time
Date
Facts:
Volts
Amps
Etc.
A More Complex Star Schema


Freshman survey
data (HERI/CIRP)
Dimensions:
• Questions
• Survey years
• Data about test
takers

Facts:
• Answer (text)
• Answer (raw)
• Count (1)

Oops, answers should have been placed in their
own dimension (creating a “factless fact table”).
I’ll demo a better version of this star later!
Oops
• Not a star
• Snowflaked!
Data Marts

One definition:
• One or more star schemas that present data on a single
or related set of business processes


Data marts should not be built in isolation
They need to be connected via dimensional tables
that are
• The same or subsets of each other
• Hierarchized the same way internally

So, e.g., if I construct data marts for…
• GPA trends, student major trends, enrollments
• Freshman survey data, senior survey data, etc.

…I connect these marts via a conformed student
dimension
• Makes correlation of data across star schemas intuitive
• Makes it easier for OLAP tools to use the data
• Allows nonspecialists to do much of the work
Simple Data Mart Example
UPS
Battery star
By battery
Run-time
% charged
Current
Input star
By phase
Voltage
Current
Output star
By phase
Voltage
Current
Sensor star
By sensor
Temp
Humidity
Note conformed date,
time dimensions!
CIRP Star/Data Mart




CIRP
Freshman
survey data
Corrected
from a
previous
slide
Note the
CirpAnswer
dimension
Note student
dimension
(ties in with
other marts)
CIRP Mart in Cognos BI 8
ROLAP, MOLAP

ROLAP = OLAP via direct relational query
• E.g., against a (materialized) view
• Against star schemas in a warehouse

MOLAP = OLAP via multidimensional
database (MDB)
• MDB is a special kind of database
• Treats data kind of like a big, fast spreadsheet
• MDBs typically draw data in from a data
warehouse

Built to work best with star schemas
Data Cubes


The term data cube
means different things to
different people
Various definitions:
1.
2.
3.
4.

A star schema
Any DB view used for
reporting
A three-dimensional
array in a MDB
Any multidimensional
MDB array (really a
hypercube)
Which definition do you
suppose is technically
correct?
Metadata


Metadata = data about data
In a data warehousing context it can mean many
things
•
•
•
•

Many institutions make metadata available via
data malls or warehouse portals, e.g.:
•
•
•
•

Information on data in source OLTP systems
Information on ETL jobs and what they do to the data
Information on data in marts/star schemas
Documentation in OLAP tools on the data they
manipulate
University of New Mexico
UC Davis
Rensselear Polytechnic Institute
University of Illinois
Good ETL tools automate the setup of
malls/portals!
The Data Warehouse


OK now we’re experts in terms like OLTP, OLAP,
star schema, metadata, etc.
Let’s use some of these terms to describe how a
DW works:
•
•
•
•
•
•
•
•
•
•
Provides ample metadata – data about the data
Utilizes easy-to-understand column/field names
Feeds multidimensional databases (MDBs)
Is updated via periodic (mainly nightly) ETL jobs
Presents data in a simplified, denormalized form
Utilizes star-like fact/dimension table schemas
Encompasses multiple, smaller data “marts”
Supports OLAP tools (Access/Excel, Safari, Cognos BI)
Derives data from (multiple) back-end OLTP systems
Houses historical data, and can grow very big
A Data Warehouse is Not…

Vendor and consultant proclamations
aside, a data warehouse is not:
• A project

With a specific end date
• A product you buy from a vendor



Like an ODS (such as SCT’s)
A canned “warehouse” supplied by iStrategy
Cognos ReportNet
• A database schema or instance


Like Oracle
SQL Server
• A cut-down version of your live transactional
database
Kimball & Caserta’s Definition

According to Ralph Kimball and Joe
Caserta, a data warehouse is:
A system that extracts, cleans, conforms, and
delivers source data into a dimensional data
store and then supports and implements
querying and analysis for the purpose of
decision making.


Another def.: The union of all the enterprise’s data marts
Aside: The Kimball model is not without some critics:
•
E.g., Bill Inmon
Example Data Warehouse (1)


This one is
RPI’s
5 parts:
•
•
•
•
•
Sources
ETL stuff
DW proper
Cubes etc.
OLAP apps
Example Data Warehouse (2)


Caltech’s DW
Five Parts:
•
•
•
•
•
Source systems
ETL processes
Data marts
FM/metadata
Reporting and
analysis tools
• Note: They’re
also customers
of Cognos!
So Where is Colorado College?

Phil Goldstein (Educause Center for Applied
Research fellow) identifies the major deployment
levels:
•
•
•
•
•
•



Level 1: Transactional systems only
Level 2a: ODS or single data mart; no ETL
Level 2: ODS or single data mart with ETL tools
Level 3a: Warehouse or multiple marts; no ETL; OLAP
Level 3b: Warehouse or multiple marts; ETL; OLAP
Level 3: Enterprise-wide warehouse or multiple marts;
ETL tools; OLAP tools
Goldstein’s study was just released in late 2005
It’s very good; based on real survey data
Which level is Colorado College at?
Implementing a Data Warehouse

In many organizations IT people want to huddle and work
out a warehousing plan, but in fact
• The purpose of a DW is decision support
• The primary audience of a DW is therefore College decision
makers
• It is College decision makers therefore who must determine





Scope
Priority
Resources
Decision makers can’t make these determinations without
an understanding of data warehouses
It is therefore imperative that key decision makers first be
educated about data warehouses
• Once this occurs, it is possible to




Elicit requirements (a critical step that’s often skipped)
Determine priorities/scope
Formulate a budget
Create a plan and timeline, with real milestones and deliverables!
Is This Really a Good Plan?



Sure, according to Phil Goldstein (Educause Center for
Applied Research)
He’s conducted extensive surveys on “academic analytics”
(= business intelligence for higher ed)
His four recommendations for improving analytics:
1. Key decisionmakers must lead the way
2. Technologists must collaborate
•
•
Must collect requirements
Must form strong partnerships with functional sponsors
3. IT must build the needed infrastructure
•
•
Carleton violated this rule with Cognos BI
As we discovered, without an ETL/warehouse infrastructure,
success with OLAP is elusive
4. Staff must train and develop deep analysis skills

Goldstein’s findings mirror closely the advice of industry
heavyweights – Ralph Kimball, Laura Reeves, Margie Ross,
Warren Thornthwaite, etc.
Isn’t a DW a Huge Undertaking?



Sure, it can be huge
Don’t hold on too tightly to the bigsounding word, “warehouse”
Luminaries like Ralph Kimball have shown
that a data warehouse can be built
incrementally
• Can start with just a few data marts
• Targeted consulting help will ensure proper,
extensible architecture and tool selection
What Takes Up the Most Time?


You may be surprised
to learn what DW step
takes the most time
Try guessing which:
•
•
•
•
•
Hardware
Physical database setup
Database design
ETL
OLAP setup
90
80
70
60
Hardware
50
Database
East
ETL
West
40
Schemas
OLAP
tools
North
30
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Acc. to Kimball & Caserta, ETL will eat up 70% of the time.
Other analysts give estimates ranging from 50% to 80%.
The most often underestimated part of the warehouse
project!
Eight Month Initial Deployment
Step
Duration
Step
Duration
Begin educating decision makers
21 days
Secure, configure network
1 day
Collect requirements
14 days
Deploy physical “target” DB
4 days
Decide general DW design
7 days
Learn/deploy ETL tool
28 days
Determine budget
3 days
Choose/set up modeling tool
21 days
Identify project roles
1 day
Design initial data mart
7 days
Eval/choose ETL tool
21 days
Design ETL processes
28 days
Eval/choose physical DB
14 days
Hook up OLAP tools
7 days
Spec/order, configure server
20 days
Publicize, train, train
21 days
Conclusion

Information is held in transactional systems
• But transactional systems are complex
• They don’t talk to each other well; each is a silo
• They require specially trained people to report off of

For normal people to explore institutional data, data in
transactional systems needs to be
• Renormalized as star schemas
• Moved to a system optimized for analysis
• Merged into a unified whole in a data warehouse

Note: This process must be led by “customers”
• Yes, IT people must build the infrastructure
• But IT people aren’t the main customers

So who are the customers?
•
•
•
•
•
•
Admissions officers trying to make good admission decisions
Student counselors trying to find/help students at risk
Development offers raising funds that support the College
Alumni affairs people trying to manage volunteers
Faculty deans trying to right-size departments
IT people managing software/hardware assets, etc….