Data Warehouse

Download Report

Transcript Data Warehouse

DataWarehousing and
DataMining
Prof. Sin-Min Lee
DATA WAREHOUSE, OLAP, and
DATA MINING
• Concepts
– Data warehousing
– OLAP (On-Line Analytical Processing)
– Data mining
• Case Studies
– WebTarget (USN)
– TFDW (USMC)
DATA WAREHOUSE
• DATABASE MANAGEMENT IN THE INTERNET ERA
• CLIENT/SERVER - BASED
• ANALYTICAL vs OPERATIONAL (OLAP vs OLTP)
• MULTI-DIMENSIONAL ANALYSIS
• DATA WAREHOUSE (ENTERPRISE-WIDE) vs
DATA MART (FUNCTIONAL AREA)
MULTIDIMENSIONAL NATURE OF
DATA WAREHOUSES
• BORING QUERY: “How many Sailors/Marines chose not
to stay in the Navy/Marine Corps this year?”
• USEFUL QUERY: “What was our retention (separation)
rate this year by community by paygrade by years of
service by gender by rating and how did it compare to last
year and what can we expect next year?”
DW ARCHITECTURE
DW 3-TIER ARCHITECTURE
1. DATA QUALITY & DATA
CLEANSING
• #1 REASON FOR DW PROJECT FAILURE
• PROBLEMS
- Database heterogeneity
- Data heterogeneity
• FUNCTIONALITY OF TOOLS
- Removing unwanted data from operational databases
- Converting to common data names and definitions
- Calculating summaries and derived data
- Establishing defaults for missing data
- Accommodating source data definition changes
APPROACHES TO DATA
CLEANSING
• AUTOMATIC CODE GENERATION
Creates code to convert from source to target data
• DATA REPLICATION TOOLS
Captures changes to source database from recovery logs
and database triggers and propagates changes to copies of
the data
• DYNAMIC TRANFORMATION ENGINES
Rule-driven systems that capture data from source
databases at user-defined intervals, transform it, and export
it to a data warehouse/mart target
2. METADATA
(What does the data mean?)
•
•
•
•
•
•
•
•
Logical Structure of DW Including End User Views
Identification of Authoritative Data Sources
Transformation Rules for Populating DW
Transformation Rules to Deliver Data to End-User
Analytical Tools
Subscription Information for Information Delivery
DW Operational Information
DW Usage Metrics
Security Authorizations, Access Control Lists, etc.
3. DATA WAREHOUSE DATABASE
• PARALLEL COMPUTING PLATFORMS
Exs: Symmetric (Shared) Multiprocessors (SMPs);
Massively Parallel Processors (MPPs)
• ROLAP
Relational DBMS with “Heavy Duty” Indexing
Capabilities
• MOLAP
Multidimensional Databases (MDDBs)
3rd Party Tools that Augment Relational Model
4. DATA MARTS
•
•
•
•
A Data Warehouse Focused on a Specific Subject Area
Subsidiary to a Data Warehouse of Integrated Data
More Rapidly Deployable than a Data Warehouse
Subject-based vice Enterprise-based
5. ACCESS TOOLS
• QUERY AND REPORTING TOOLS
- Managed query tools: Layer between user and SQL (e.g.,
BrioQuery)
- Configurable report generators (e.g., Brio’s BrioReport)
• APPLICATIONS
- Application development platforms (e.g., PowerSoft’s
PowerBuilder; Microsoft’s Visual Basic)
ACCESS (cont’d)
• OLAP
- Support of multidimensional analysis
- Ability to drilldown and rollup along any of the
predefined dimensions
- Major vendors: Cognos, Business Objects, Brio
MULTIDIMENSIONAL DATA
MODEL: STAR SCHEMA
• FACTS: Core data element being analyzed, e.g.,
Units_of_Items_Sold
• DIMENSIONS: Attributes about FACTS, e.g.,
Product_Type, Purchase_Date
ROLE OF METRICS
• Facts should be defined as Measures of Effectiveness
(sometimes called Key Performance Indicators (KPI’s))
• Exs:
NEC Reutilization Rate
Retention Rate
Attrition Rate
Readiness (Personnel)
COGNOS DEMO
• http://www.cognos.com/products/tours/anal
ysis_launch.html
ACCESS: Data Mining
• “Searching for meaningful patterns in large data
sets”
• Knowledge acquisition
• Motivated and facilitated by:
–
–
–
–
Availability of large data sets
Advances in storage technology
Data warehouse technology
E-commerce and the Internet
• Exploratory vs. confirmatory analysis
6. DW ADMINISTRATION
AND MANAGEMENT
• “Normal” DBA Responsibilities plus:
• Source Data Quality Checks
• Keeping track of what all the source data means
• Managing Very Large Databases (gigabytes or terabytes in
size)
7. INFORMATION DELIVERY
SYSTEM
• How to get information from the data warehouse to users?
• Users subscribe to the data warehouse.
• Specifically, they subscribe to specific reports to be
delivered on a periodic basis.
• Reports are delivered to user’s Web browser as per
prescribed frequency.
• Powerful tool for delivering information to the people who
need it in an extremely timely fashion. True MIS; true
DSS.
BENEFITS OF DATA WAREHOUSE
• Freedom from restrictions of operational databases
• Decision-oriented
• Extremely efficient presentation of management
information
• Widespread access to critical information for those who
need it when they need it
• Knowledge discovery
• Improves business intelligence
• Relatively inexpensive to implement
• Does not require re-engineering of legacy systems
GIS: GEOGRAPHIC INFORMATION
SYSTEMS
•
•
•
•
Ability to visualize data spatially
Maps on top of a relational DBMS
Data is viewed on maps vice from tables
Features:
- Thematic maps
- Spatial queries
- Geocoding of data
• Vendors: MapInfo; ESRI (ArcInfo)