Cornell's Data Warehousing Infrastructure

Download Report

Transcript Cornell's Data Warehousing Infrastructure

Cornell University’s
Data Warehousing
Infrastructure
Presented by:
Jeff Christen
Data Warehousing DBA – Team Lead
[email protected]
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 1
Jeff Christen
• DBA Lead - Data Warehousing
– Ten Years experience as a DBA (Oracle & Informix)
– Last Four Years Focused on Data Warehouse DBA Support
• Team Interface to other IT groups
• Team Interface to University
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 2
Responsibilities of a DW DBA
•
•
•
•
•
•
•
•
Production Support – 24x7 (Database & Load)
Performance Monitoring & Tuning
Database Backup & Recovery
Security Implementation
Object & Code Migrations (Dev / Test / Prod)
Infrastructure Development & Maintenance
Enforcement of University Policies & Practices
Integration of Models & ETL into Warehouse
Environment
• Assist in Data Modeling & ETL Development
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 3
Cornell’s Warehousing Challenges
•
•
•
•
•
•
•
Twelve production data marts
Twenty-five unique loads
Variety of sources
(mainframe, PeopleSoft)
Varied load frequencies (daily, weekly, monthly)
Varied load requirements
(full, partial, append)
Rapidly shrinking load windows
Multiple server & O.S. environment
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 4
DMTools as a Solution
DMTools is a Data Warehousing infrastructure
management tool developed and in use by
Cornell University.
•
•
•
•
•
Allows high data availability- 24x7 access
Repository driven
Manages loads
Toolbox written in Oracle PL/SQL (O.S. independent)
GUI console to manage load related metadata
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 5
High Data Availability
Table Renaming Process
Two copies of each data mart table are maintained
• A current table and a backup or work table
• Data mart users only “see” the current table
• New data is loaded into “work” table
• Rename tables
– rename person to person_b
– rename person_w to person
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 6
High Data Availability
Table Renaming Benefits
•
•
•
•
Instant access to new data
Rename does not interrupt users
“Backup” table for previous load’s data
Ability to instantly roll back load
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 7
High Data Availability
Table Renaming Example
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 8
High Data Availability
Partition Exchange
• Similar to table rename
• May exchange table partitions without disrupting users
• Maintain full partitioned copy of table & small nonpartitioned copy
• Data loaded into non-partitioned table, then
exchanged with appropriate partition
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 9
High Data Availability
Partition Exchange Example
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 10
High Data Availability
Additional Considerations
•
•
•
•
Move security between table copies
Remove / add policy to table during exchange
Constraint management
Manage index names between table copies (A or B
suffix to guarantee unique names)
• Step verification & error handling
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 11
Current Warehousing Environment
Infrastructure
Load (pull) from Replication Database
Production
AIX Server
Transformation
&
Staging
Production
AIX Server
Logging & Metadata
DMTools
Repository
Production
AIX Server
H/R
&
Payroll
Logging & Metadata
Load (push) from
Staging Database
Consolidated
DataMarts
Production
Windows Server
Replication
Database
(PeopleSoft)
Logging & Metadata
Production
Windows Server
Load (pull) from
Replication Database
Contributor
Relations
Logging & Metadata
Production
Windows Server
Load (pull) from Replication Database
Logging & Metadata
Student
Administration
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 12
DMTools Repository Data Model
DM_TAB_PART_COUNTS
DM_RUN_DETAILS
DM_IND_COLUMNS
DM_INDEXES
DM_ROW_COUNTS
DM_RUNS
DM_IND_PARTITIONS
DM_TABLES
DM_NAMES
DM_POLICIES
DM_CONSTRAINTS
DM_EMAIL_CONTENT
DM_GRANTS
DM_ADMIN_USERS
DM_EMAIL_SUBSCRIPTIONS
DM_CONS_COLUMNS
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 13
DMTools Repository Data Model
(System Catalog)
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 14
DMTools Repository Data Model
(Logging Metrics)
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 15
DMTools Repository
• Resides in a centralized database
• Contains metadata needed for load process
• Contains logging data related to load process & load
metrics
• Holds notification information
• Scalable (may use for multiple data marts)
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 16
Load Logging
• High level logging
–
–
–
–
–
Start & end times for pre load, load, post load
Load status
( completed, running, failed )
Snapshot date
Estimated completion time
Record counts (by table & partition)
• Low level logging
– Start /end time & description for every action
– Captures Oracle errors & SQL executing
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 17
Load Status Monitoring
• Predefined views
–
–
–
–
Status views
(high level load info for given DM)
Detail views
(used by DBA for trouble shooting)
Load metrics views (various object counts related to loads)
Flexibility to build additional views on logging tables as needed
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 18
DMTools Toolbox
• Set of Oracle stored procedures used to perform
common load tasks
• Resides on each database in the DMTools
environment
• Procedures bundled in packages related to their
function
– dmrunlog
– dmidxpack …(partial list only)…
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 19
Main Procedure
•
•
•
•
•
Unique to each data mart
Not owned by DMTools (data mart schema)
Templates for easy starting point
Modify template to accommodate
Preload, load, post load
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 20
DMTools Console
• GUI interface to the repository
• Metadata maintenance
– Adding/removing tables, indexes, grants, etc.
– Migration management
• Email/pager subscription & notification maintenance
• Warnings & reports
– Detect potential load problems prior to load
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 21
DMTools Console - GUI Interface (main)
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 22
DMTools Console - GUI Interface (add Table)
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 23
DMTools Advantages
•
•
•
•
•
•
•
•
High data availability (24x7 access)
Instant rollback to pre-load state
Centralized logging and notification
Metadata driven
O.S. independent (Oracle RDBMS only)
Works with various ETL tools
Simple setup and maintenance
Very scaleable
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 24
Cornell University’s
Data Warehousing
Infrastructure
INTEREST ? // QUESTIONS ?
Presented by:
Jeff Christen
Data Warehousing DBA – Team Lead
[email protected]
FORUM II
Best Practices in Data Warehousing in Higher Education:
A Framework for Higher Education Reporting
April 18, 2005
Slide 25