Transcript mss2

MINING & WAREHOUSING (MSS2)
BY
CHANDRA S. AMARAVADI
1
EXTENSIONS TO DSS
BI systems (aka EIS)
Geographical Information Systems (GIS)
Collaborative Systems (formerly GDSS)
Expert Systems
OLAP/Data mining/warehousing
DATA WAREHOUSES
3
DATA WAREHOUSE
A large collection of historical data that is organized
specifically for use in decision support (i.e. OLAP,
data mining)
DATA WAREHOUSING
USERS
APPLICATIONS
WAREHOUSE
User1
Prodn.
Mktg.
Fin.
Acctg.
S
E
R
V
E
R
User2
User3
User4
THE DATA LIFE CYCLE
Dashboard/BI
Internal
Data
Data
Visualization
Data
Marts
OLAP
External
Data
Internal
Data
Data
Warehouse
Decisions
Data
Marts
Data Mining
Knowledge
The activities taking place with respect to data for warehouse/OLAP/mining
INTEGRATION OF DATA
Integration
Operational databases
application A – m,f
application B – 1,0
application C – x,y
application D – male, female
Data Warehouse
M/F
7
CHARACTERISTICS OF WAREHOUSES
Data warehouses have a number of characteristics
Subject-oriented
Integrated
Time-variant
Non-volatile
8
KEY FEATURES OF DATA WAREHOUSES…
Subject-oriented: A data warehouse is organized around major
subjects, such as customer, supplier, product, and sales.
Integrated: A data warehouse is usually constructed by
integrating data from multiple heterogeneous sources, such as
relational databases, flat files, and on-line transaction records.
Time-variant: Data are stored to provide information from a
historical perspective (e.g., the past 5-10 years). Every key
structure in the data warehouse contains, either implicitly or
explicitly, an element of time
Nonvolatile: The data in a warehouse is permanent.
DESIGN OF WAREHOUSE
Design of warehouses is similar to databases:
Schema not radically different
W/H designed primarily for queries
Database will be converted to cube at run
time.
10
ORGANIZING THE W/H
STAR SCHEMA: Consists of a large central table and a set
of smaller tables, one each for each dimension.
SNOWFLAKE SCHEMA: A variant of the star schema,
Where some dimension tables are normalized, thereby
splitting the data into additional tables.
CONSTELLATION SCHEMA: A collection of stars.
STAR SCHEMA OF A DATA WAREHOUSE (FYI)
BRANCH
ACCOUNTS
TRANSACTIONS
What are the dimensions here?
SNOWFLAKE SCHEMA OF A DATA
WAREHOUSE (FYI)
BRANCH
ACCOUNTS
TRANSACTIONS
CREDIT
DEBIT
CONSTELLATION SCHEMA OF A DATA
WAREHOUSE FOR SALES AND SHIPPING (FYI)
BRANCH
DEPARTMENTS
ACCOUNTS
EMPLOYEES
TRANSACTIONS
CREDIT
DEBIT
EXAMPLE OF WAREHOUSE & MARTS
Weekly sales by region
2007-2013
Weekly sales by
state
2005-2013
Sales by Product Line
2010-2013
Data
Mart
Data
Mart
Warehouse
Sales Detail
Sales Detail
2003-2013
Weekly sales by
product
2008-2013
OLAP & DATA WAREHOUSES
16
ONLINE ANALYTICAL PROCESSING
OLAP: Tools to analyze data in a warehouse for
decision support. How many light bulbs sold in
December?
Aggregation (“data summarization”)
Dimension
Slice & Dice
Cube organization (Multidimensional database)
AGGREGATION & CUBE
Region
North. E
South E.
South W.
Midwest
South W.
North E.
Units
40
20
30
20
50
65
NE
SE
SW
65
20
30
40
50
sales in the Northern region?
18
DIMENSIONS & CONCEPT HIERARCHIES
A dimension is an aspect of the data, it is a
characteristic of a variable such as location, for sales
variable.
Dimensions can have hierarchies (or various levels of
aggregations)
A concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level, more
general concepts
CONCEPT HIERARCHY
Mfr. dimension
iPhone5
GAL
GALII
30,000
13,600
iPhone4
150,600
G
15,067
G2
FX
18,240
51,326
MULTI-DIMENSIONAL ORGANIZATION
Sales, costs etc.
nw
Regions
sw
Products
(tables, desks, lamps..)
Cube organization supports slice & dice
MULTI-DIMENSIONAL ORGANIZATION..
LG
G
LG
G
G2
MW
G2
LG
G
FX
G2
FX
TOTAL FX
TOTAL
TOTAL
SW
TOTAL
MW
SW
SE
NW
NE
TOTAL
3,780 MW
4,893 SW
7,494 SE6,520 NW
2,450
NE 25,137 TOTAL
3,780
4,893
7,494 6,520
2,450
25,137
2,342
1,200
950
7,570
3,780 1,400
4,893 1,678
7,494 6,520
2,450
25,137
2,342
1,200
1,400 1,678
950
7,570
3,780
7,893
2,342 5,647
1,200 6,493
1,400 1,678 7,839 31,652
950
7,570
3,780
7,893
5,647 6,493
7,839
31,652
9,902
13,986 7,893
14,541 5,647
14,691
11,239
64,359
3,780
6,493
7,839
31,652
9,902 13,986 14,541 14,691
11,239
64,359
9,902
SE
13,986
NW
14,541
NE
14,691
11,239
March sales
February sales
January sales
64,359
shows multi-dimensional/cube organization
22
CUBE ORGANIZATION
Data from warehouse imported into memory
A sophisticated 3D representation is created
Referred to as “sparse matrix”
Sides of cube are dimensions
Allows “slice & dice”
Answers to high level queries/reports
DATA MINING
24
DATA MINING
Application of statistical and AI techniques to identify
patterns that exist in large databases but are hidden in the
vast amounts of data.
e.g. sequence/association, classification, and
clustering
Buyer non-buyer
Success/failure of an operation
Customers who spend more
Preferences of target groups
25
SOME DATA MINING APPLICATIONS
Predicting the probability of default for consumer loans
Predicting audience response to TV advertisements
Predicting the probability that a cancer patient will respond
to radiation therapy.
Predicting the probability that an offshore well will
produce oil
A phone company mined data from 140 million
households, each with as many as 10,000 attributes,
including life-style and calling habits. Identified 22
profiles
26
DATA MINING AS A STEP IN THE PROCESS
OF KNOWLEDGE DISCOVERY
Evaluation and
Presentation
Knowledge
Data Mining
Selection and
Transformation
Cleaning and
Integration
Database
Data Warehouse
Flat files
Patterns
DATA MINING PROCESS
Cleaning & integration – data is brought in from multiple sources
Selection & transformation – sometimes called dimensionality
reduction, it is concerned with selection of dimensions and sometimes
the raw data needs to be transformed to suit the problem e.g. calculate
margin.
Data mining - process of extracting data patterns, using statistical or
AI techniques.
Pattern evaluation - identifying patterns useful and relevant to the
organizational context.
Knowledge presentation -- Visualization and knowledge
representation techniques are used to present the mined knowledge to
the user.
28
OLAP, MINING & WAREHOUSING...
Data warehousing refers to the use of high speed/high
capacity servers to store historical transaction information
and to make this information accessible to decision makers.
OLAP is used to perform high level analysis of data based
on data summarization (aggregation) and slice and dice
operations. For e.g. how many shoes sold in midwest in
Feb?
Data mining refers to identification of patterns from data.
DATA MINING TECHNIQUES
30
TYPES OF ANALYSES
Sequence
-- Activities occurring one after another
e.g. loan after buying car, warranty.
Association -- (AKA Market Basket Analysis) Activities
which occur together (e.g. bread and meat)
Classification -- Identifying profiles of data classified
into pre-defined groups (frequent & infrequent
shoppers)
Clustering
-- Identifying natural characteristics of data
(what major areas are cust. coming from?)
SEQUENTIAL ANALYSIS
32
SEQUENTIAL ANALYSIS
Predict purchasing
40% of customers buy a gray skirt six months after buying
a red jacket
Identified by time-series analysis, Neural networks or
genetic algorithms
33
IDENTIFYING SEQUENTIAL
PATTERNS..
applications in forecasting exchange rates, meat consumption , bankruptcies etc.
34
ASSOCIATIVE ANALYSIS
35
ASSOCIATION
Identifies items purchased together
Customers who bought items A, B, C also buy Y
85 percent of customers who buy a certain wine brand also
buy a certain type of pasta
Many men buy a six-pack when they purchase diapers
Enhances product placement in stores
36
ANALYSIS TECHNIQUES
ASSOCIATION
TID
Items
100
ACD
200
BCE
300
ABCE
400
BE
Database
Algorithm A-priori
list all items – 1 item set (C1)
filter by min. transaction
support (L1)
identify 2-item sets
(L1* L1) = (C2 )
filter and generate L2
repeat process...
Min. transaction support is the number (sometimes given as %) of transactions
in which the item must occur.
37
A-PRIORI ALGORITHM
TID
Items
Itemset
Sup.
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
Itemset
Itemset
Sup.
Itemset
Sup.
{A B}
{A B}
1
{A C}
2
{A C}
2
{B C}
2
{A E}
1
{B E}
3
{B C}
{B C}
2
{C E}
2
{B E}
{B E}
3
{C E}
2
Itemset
Sup.
Itemset
Sup.
{B C E}
2
{B C E}
2
Scan
Database
100 A C D
200 B C E
300 A B C E
400 B E
{A C}
{A E}
{C E}
Itemset
{B C E}
C3
Scan
Database
C2
Scan
Database
minimum
Support: 2
C1
minimum
Support: 2
Itemset
Sup.
{A}
2
{B}
3
{C}
3
{E}
3
L1
L2
C2
L3
C3
38
DISCUSSION
Apply associative rule mining (Use A-priori algorithm) to the
following portfolios of clients of a brokerage company, to identify
stocks that are purchased together. Use a minimum support of
two.
Client1
Client2
Client3
Client4
AAPL, ORCL, WFC, BAC
BAC, WTU
GOOG, IBM, C
WTU, WFC, C, IBM
39
CLASSIFICATION
ANALYSIS
40
CLASSIFICATION
A technique for grouping data into pre-defined classes
using certain attributes of the data. E.g. defaulter or not,
cruise customer, 4G subscriber or not etc.
Define classes
Identify classification criteria
Select suitable technique
application
41
CLASSIFICATION TECHNIQUES
Discriminant analysis (categorical membership)
Bayesian classification
Logistic regression (probabilistic membership)
Neural networks
42
DISCRIMINANT ANALYSIS*
A method of classification that uses a Discriminant Function
to decide classes
DF based on attributes and weights.
Calculated as a weighted average of attributes and weights
(of training data) (Zave score)
E.g. (GMAT + 200 * UGPA) > 1200
Given a new set of data, calculate Z score
Compare with Zav
*a simplified version
DF – Discriminant Function
43
CLASSIFICATION WITH
BAYESIAN ANALYSIS
44
REFRESHER ON PROBABILITY
Probability is the chance that an event/outcome will
take place.
E.g. Cubs had 43 wins 36 losses in
one season
Probability of winning?
45
PRIOR (CONDITIONAL) PROBABILITIES
Prior probabilities are knowledge of other
events which may help improve predictions
E.g. p(IPO success) = 0.33
Suppose we know that a big company
behind IPO
p(IPO success/big company) = 0.99
Which is higher?
P(successful cellphone call) or
P(successful call/subscriber in service area)
46
CONDITIONAL PROBABILITIES..
College
# of students
CBT
1,800
COAS
2,300
COFAC
1,200
COE
6,700
Total
12,000
If we see a student in the union and he/she is a WIU student
What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE?
47
BAYESIAN ANALYSIS
Bayes theorem can be exploited for classification
Bayes theorem
P(A/B) = [P(B/A) x P(A)]/P(B)
Classify into class I or II based on conditional
probability
P(class I/x) > P(class II/x)? *
E.g. wiu student in union, which college are they?
A method for classifying objects/events into classes based
on probabilities of occurrence of the objects/events
*x is some condition e.g. surgery or being a shopper in a retail chain
48
INTRODUCTION TO
BAYESIAN ANALYSIS
We are interested in p(person becoming a manager/mba)*
How can we use Baye’s theorem?
ASSUME
300 m population; 100 m employees
500,000 are managers
10,000 managers go to college for an MBA
20 m go to college
100,000 do MBA
*you need to write formula using terms from the problem
49
THIS SLIDE IS INTENTIONALLY BLANK
50
BAYESIAN ANALYSIS - EXAMPLE
PROBLEM:
An observer has collected information about
Eagles & Hawks for a long time. If a new bird
is spotted with a certain wingspan (x), need to know
whether eagle or hawk
From “Data Mining – Methods for Knowledge Discovery” by K.Cios, W.Pedrycz, R.Swiniarski
51
BAYESIAN ANALYSIS - EXAMPLE
CLASSIFICATION PROBLEM:
Compare p(eagle/x) and p(hawk/x)
Whichever is higher  unknown bird
We don’t know p(eagle/x) or p(hawk/x)
Aquila rapax
52
BAYESIAN ANALYSIS - EXAMPLE
Bayes‘ theorem
p( x | eagle) P(eagle)
P(eagle | x) 
p ( x)
p( x | hawk ) P(hawk )
P(hawk | x) 
p( x)
Decision rule?
53
PROBABILITY DENSITY FUNCTION
Shows from observations of birds, the probability of
a bird having a particular wingspan
P(x|eagle),
P(x| hawk)
54
PROBABILITY OF EAGLE OR HAWK
N = Number of birds
= neagle + nhawk
P(eagle) 
neagles
N
= 0.8
P(hawk) 
nhawk
N
= 0.2
55
USING BAYES THEOREM
New bird’s size = 45 cm
p(45|eagle) = 2.22 x 10-2
p(45|hawk) = 1.10 x 10-2
(from known probability
density functions)
p( x | eagle) * P(eagle) vs p( x | hawk) * P(hawk)
2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2
0.01776 > 0.0021
Decision rule predicts eagle
56
CLUSTERING
57
DATA MINING FUNCTIONS
Clustering
Goal is to identify natural groupings of data.
applicationications in market segmentation, discovering
affinity groups, and defect analysis
based on euclidean distance
similar to centroid
trial & error approach
need to guess # of clusters
58
CLUSTER ANALYSIS OF CUSTOMER
LOCATIONS IN A CITY
CUSTOMERS ARE CLUSTERED
INTO FOUR SEGMENTS
Income: Medium
Children: 2
Income: High
Children: 1
Car: Luxury
Car: Sedan and Car: Truck
Income: Medium
Children:3
Cluster 1
Cluster 4
Cluster 3
Income: Low
Children: 0
Car: Compact
Cluster 2
60
A SIMPLE EXAMPLE
ACCT#
MONTHLY SPENDING
109905
$200
112343
$30
124205
$1,510
125687
$11,078
130984
$341
131987
$543
140001
$178
140387
$2,000
150933
$1,750
167912
$230
61
63
DISCUSSION QUESTIONS
What kind of data is stored in a data warehouse?
What are the characteristics of a data warehouse?
How is data organized in a warehouse?
What tool(s) can we use to analyze data in a warehouse?
What type of multidimensional organization of data is also called a
“sparse matrix”?
What is a concept hierarchy?
What technique can you use to predict the probability that a cancer
patient will respond to radiation therapy?
What type of mining identifies natural characteristics of data?
What technique is useful for predicting trends?
What is minimum transaction support?
How is Bayes theorem used for classification?
64