Transcript Slide 1

Managing Data for DSS II
Managing Data for DS
Data Warehouse
Common characteristics:
– Database designed to meet analytical tasks comprising of
data from multiple applications
– Small number of users with intense and long interactions
– Read intensive usage
– Periodic updates to the contents
– Consists of current as well as historical data
– Relatively fewer but large tables
– Queries results is large results sets, involving full table scan
and joins spanning several tables
– Aggregation, vector operation and summarization are
common
– The data frequently resides in external heterogeneous
sources
Introduction- Terminology
Current Detail Data- data acquired directly from operational
databases, often representing entire enterprise
Old Detail Data- Aged current detail data, historical data
organized by subjects, it helps in trend analysis
Data Marts- A large data store for informational needs where
scope is limited to a department, SBUs etc., In a phased
implementation data marts are a way to build a warehouse.
Summarized Data- Aggregated data along the lines required for
executive reporting,trend analysis and decision support.
Metadata- It is data about the data, description of contents,
location, structure, end-user views, identification of authoritative
data, history of updates, security authorizations
Introduction- Architecture
Management
External
Meta data
Current
Information
Delivery System
Report, Query & EIs
Data Mining Tools
Extract,
Cleanup
& Load
Realized
or Virtual
MDDB
Currentl
Repository
OLAP Tools
• The Data Warehouse is an integrated, subject-oriented,
time-variant, non-volatile database that provides support
for decision making.
– Integrated
• The Data Warehouse is a centralized,
consolidated database that integrates data
retrieved from the entire organization.
– Subject-Oriented
• The Data Warehouse data is arranged and
optimized to provide answers to questions
coming from diverse functional areas within a
company.
• Time Variant
– The Warehouse data represent the flow of data
through time. It can even contain projected data.
– Non-Volatile
• Once data enter the Data Warehouse, they are
never removed.
• The Data Warehouse is always growing.
Major Tasks in Data Preparation
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same
or similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially
for numerical data
Extraction, Cleanup, Integration
• Data Cleaning
– Missing Values
•
•
•
•
Ignore the tuple
Fill in the value manually
Use a global constant to fill
Attribute mean as missing value
– Average income of all customer is 30000 pm
• Attribute mean of all samples belonging to same
class
– Missing value with average income of same class e.g.,
credit_risk, emp_status
• Most probable value
– Regression, Bayesian classifiers, decision tree
induction
Extraction, Cleanup,
Integration
• Data Cleaning
– Noisy Data- A random error or variance of measured value.
Given price how can we smooth our the data to remove noise.
• Binning
– Smooth the sorted data by consulting the neighbours.
– Given 4 8 15 21 21 24 25 28 34
– Parttion it» Bin 1: 4 8 15
» Bin 2: 21 21 24
» Bin 3: 25 28 34
– Replace the Bin values by mean or Bin boundaries
• Clustering
• Regression- Smoothen it by fitting in fitting in functions.
– Inconsistent Data – Manually or through rule base
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube
construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small,
specified range
– min-max normalization
– Z-score normalization
– Normalization by decimal scaling
Data Transformation: Normalization
• min-max normalization
v  min A
v' 
(new _ max A  new _ min A)  new _ min A
max A  min A
Suppose that the minimum and maximum values for
the attribute income are £12,000 and £98,000,
respectively. We map income to the range [0.0, 1.0].
By min-max normalization, a value of £73,600 for
income is transformed to
(73600-12000)/(98000-12000)*(1.0-0.0)+0=0.716.
Data Transformation: Normalization
• min-max normalization
v  min A
v' 
(new _ max A  new _ min A)  new _ min A
max A  min A
• z-score normalization
v  m ean
v' 
stand _ dev
• normalization by decimal scaling
A
A
v
v'  j
10
Where j is the smallest integer such that Max(| v ' |)<1
Star Schema
• The star schema is a data-modeling technique used to
map multidimensional decision support into a relational
database.
• Star schemas yield an easily implemented model for
multidimensional data analysis while still preserving the
relational structure of the operational database.
• Four Components:
– Facts
– Dimensions
– Attributes
– Attribute hierarchies
A Simple Star Schema
Star Schema
• Facts
– Facts are numeric measurements (values) that represent a
specific business aspect or activity.
– The fact table contains facts that are linked through their
dimensions.
– Facts can be computed or derived at run-time (metrics).
• Dimensions
– Dimensions are qualifying characteristics that provide
additional perspectives to a given fact.
– Dimensions are stored in dimension tables.
Star Schema
• Attributes
– Each dimension table contains attributes.
Attributes are often used to search, filter, or
classify facts.
– Dimensions provide descriptive characteristics
about the facts through their attributes.
Possible Attributes For Sales Dimensions
Three Dimensional View Of Sales
Slice And Dice View Of Sales
Star Schema
• Attribute Hierarchies
– Attributes within dimensions can be ordered in a
well-defined attribute hierarchy.
– The attribute hierarchy provides a top-down data
organization that is used for two main purposes:
• Aggregation
• Drill-down/roll-up data analysis
A Location Attribute Hierarchy
Attribute Hierarchies In Multidimensional Analysis
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Sales Fact Table
Time_keu Prod_key Loca_key branch_keyunit_sales $L_sales $_cost
1
2
2
1
20
100000
5000
4
6
17
8
30
600
20
6
12
4
77
25
5000
200
Time Table
Time_ke day
6 Mon
4 Tue
15 Wed
day_week
6
4
1
month
Jan
Jan
Jan
Item Table
Item_key Item name brand
12 Hp Cam HP
6 IBM PC
IBM PC
season
Q1
Q1
Q1
year
2008
2008
2008
type
supplier
HP3225 Hp Delhi
IBM PC 2GIBM New York