Transcript mss2x
MINING & WAREHOUSING (MSS2)
BY
CHANDRA S. AMARAVADI
1
EXTENSIONS TO DSS
BI systems (aka EIS)
Geographical Information Systems (GIS)
Collaborative Systems (formerly GDSS)
Expert Systems
OLAP/Data mining/warehousing
Big Data
DATA WAREHOUSES
3
DATA WAREHOUSE
A large collection of historical data that is organized
specifically for use in decision support (i.e. OLAP,
data mining)
DATA WAREHOUSING
USERS
APPLICATIONS
WAREHOUSE
User1
Prodn.
Mktg.
Fin.
Acctg.
S
E
R
V
E
R
User2
User3
User4
THE DATA LIFE CYCLE
Dashboard/BI
Internal
Data
Data
Visualization
Data
Marts
OLAP
External
Data
Internal
Data
Data
Warehouse
Decisions
Data
Marts
Data Mining
Knowledge
The activities taking place with respect to data for warehouse/OLAP/mining
INTEGRATION OF DATA
Integration
Operational databases
application A – m,f
application B – 1,0
application C – x,y
application D – male, female
Data Warehouse
M/F
7
DESIGN OF WAREHOUSE
Design of warehouses is similar to databases:
Schema not radically different
W/H designed primarily for queries
Database will be converted to cube at run
time.
8
ORGANIZING THE W/H
STAR SCHEMA: Consists of a large central table and a set
of smaller tables, one each for each dimension.
SNOWFLAKE SCHEMA: A variant of the star schema,
Where some dimension tables are normalized, thereby
splitting the data into additional tables.
CONSTELLATION SCHEMA: A collection of stars.
STAR SCHEMA OF A DATA WAREHOUSE
BRANCH
ACCOUNTS
TRANSACTIONS
What are the dimensions here?
SNOWFLAKE SCHEMA OF A DATA
WAREHOUSE
BRANCH
ACCOUNTS
TRANSACTIONS
CREDIT
DEBIT
CONSTELLATION SCHEMA OF A DATA
WAREHOUSE FOR SALES AND SHIPPING
BRANCH
DEPARTMENTS
ACCOUNTS
EMPLOYEES
TRANSACTIONS
CREDIT
DEBIT
EXAMPLE OF WAREHOUSE & MARTS
Weekly sales by region
2010-2015
Weekly sales by
state
2010-2015
Sales by Product Line
2010-2015
Data
Mart
Data
Mart
Warehouse
Sales Detail
Sales Detail
2010-2015
Weekly sales by
product
2012-2015
OLAP & DATA WAREHOUSES
14
ONLINE ANALYTICAL PROCESSING
OLAP: Tools to analyze data in a warehouse for
decision support. How many light bulbs sold in
December?
Aggregation (“data summarization”)
Dimension
Slice & Dice
Cube organization (Multidimensional database)
AGGREGATION & CUBE
Region
Units
North E.
40
South E.
20
South W.
30
Midwest
20
North E.
65
North E.
80
NE
SE
SW
65
20
30
40
50
80
sales in the Northern region?
16
DIMENSIONS & CONCEPT HIERARCHIES
A dimension is an aspect of the data, it is a
characteristic of a variable such as location, for sales
variable.
Dimensions can have hierarchies (or various levels of
aggregations)
A concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level, more
general concepts
CONCEPT HIERARCHY
Mfr. dimension
Iphone 6S Plus
GAL S7
10 mil
GAL S6
38 mil
Iphone 6S
48.7 mil
G5
G4
G3
15,000
11 mil
10 mil
MULTI-DIMENSIONAL ORGANIZATION
Sales, costs etc.
nw
Regions
sw
Products
(tables, desks, lamps..)
Cube organization supports slice & dice
MULTI-DIMENSIONAL ORGANIZATION..
LG
MW
SW
SE
NW
NE
TOTAL
LG
MW
SW
SE
NW
NE
TOTAL
G5
3,780
4,893
7,494
6,520
2,450
25,137
LG
MW
SW
SE
NW
NE
TOTAL
G5
3,780
4,893
7,494 6,520
2,450
25,137
G4
2,342
1,200
1,400
1,678
950
7,570
G5
3,780
4,893
7,494
6,520
2,450
25,137
G4
2,342
1,200
1,400 1,678
950
7,570
G3
3,780
7,893
7,839 31,652
G4
2,342 5,647
1,200 6,493
1,400
1,678
950
7,570
G3
3,780
7,893
5,647 6,493
7,839
31,652
TOTAL G3
9,902
13,986 7,893
14,541 5,647
14,691 11,239
64,359
3,780
6,493
7,839
31,652
TOTAL
9,902 13,986 14,541 14,691
11,239
64,359
TOTAL
9,902 13,986 14,541 14,691
11,239
64,359
March sales
February sales
January sales
shows multi-dimensional/cube organization
20
CUBE ORGANIZATION
Data from warehouse imported into memory
A sophisticated 3D representation is created
Referred to as “sparse matrix”
Sides of cube are dimensions
Allows “slice & dice”
Answers to high level queries/reports
DATA MINING
22
DATA MINING
Application of statistical and AI techniques to identify
patterns that exist in large databases but are hidden in the
vast amounts of data.
e.g. sequence/association, classification, and
clustering
Buyer non-buyer
Success/failure of an operation
Customers who spend more
Preferences of target groups
23
SOME DATA MINING APPLICATIONS
Predicting the probability of default for consumer loans
Predicting audience response to banner ads
Predicting the probability that a cancer patient will respond
to radiation therapy.
Predicting the probability that an offshore well will
produce oil
A cellular company mined data from 140 million
households, each with as many as 10,000 attributes,
including life-style and calling habits. Identified 22
profiles
24
DATA MINING AS A STEP IN THE PROCESS
OF KNOWLEDGE DISCOVERY
Evaluation and
Presentation
Knowledge
Data Mining
Selection and
Transformation
Cleaning and
Integration
Database
Data Warehouse
Flat files
Patterns
DATA MINING PROCESS
Cleaning & integration – data is brought in from multiple sources
Selection & transformation – sometimes called dimensionality
reduction, it is concerned with selection of dimensions and sometimes
the raw data needs to be transformed to suit the problem e.g. calculate
margin.
Data mining - process of extracting data patterns, using statistical or
AI techniques.
Pattern evaluation - identifying patterns useful and relevant to the
organizational context.
Knowledge presentation -- Visualization and knowledge
representation techniques are used to present the mined knowledge to
the user.
26
OLAP, MINING & WAREHOUSING...
Data warehousing refers to the use of high speed/high
capacity servers to store historical transaction information
and to make this information accessible to decision makers.
OLAP is used to perform high level analysis of data based
on data summarization (aggregation) and slice and dice
operations. For e.g. how many shoes sold in midwest in
Feb?
Data mining refers to identification of patterns from data.
DATA MINING TECHNIQUES
28
TYPES OF ANALYSES
Sequence
-- Activities occurring one after another
e.g. loan after buying car, warranty.
Association -- (AKA Market Basket Analysis) Activities
which occur together (e.g. bread and meat)
Classification -- Identifying profiles of data classified
into pre-defined groups (frequent & infrequent
shoppers)
Clustering
-- Identifying natural characteristics of data
(what major areas are cust. coming from?)
SEQUENTIAL ANALYSIS
30
SEQUENTIAL ANALYSIS
Predict purchasing
40% of customers buy a gray skirt six months after buying
a red jacket
Identified by time-series analysis, Neural networks or
genetic algorithms
31
IDENTIFYING SEQUENTIAL
PATTERNS..
applications in forecasting exchange rates, meat consumption , bankruptcies etc.
32
ASSOCIATIVE ANALYSIS
33
ASSOCIATION
Identifies items purchased together
Customers who bought items A, B, C also buy Y
85 percent of customers who buy a certain wine brand also
buy a certain type of pasta
Many men buy a six-pack when they purchase diapers
Enhances product placement in stores
34
ANALYSIS TECHNIQUES
ASSOCIATION
TID Items
100
ACD
200
BCE
300
ABCE
400
BE
Database
Algorithm A-priori
list all items – 1 item set (C1)
filter by min. transaction
support (L1)
identify 2-item sets
(L1* L1) = (C2 )
filter and generate L2
repeat process...
Min. transaction support is the number (sometimes given as %) of transactions
in which the item must occur.
35
A-PRIORI ALGORITHM
TID Items
Scan
Database
Itemset
Sup.
{A}
2
100
ACD
200
BCE
{B}
3
ABCE
{C}
3
{D}
1
{E}
3
Itemset
Itemset
Sup.
{A B}
{A B}
1
300
400
BE
{A C}
Scan
Database
minimum
Support: 2
C1
minimum
Support: 2
Itemset
Sup.
{A}
2
{B}
3
{C}
3
{E}
3
Itemset
Sup.
{A C}
2
{B C}
2
{B E}
3
{C E}
2
{A C}
2
{A E}
1
{B C}
{B C}
2
{B E}
{B E}
3
{C E}
2
Itemset
Sup.
Itemset
Sup.
{B C E}
2
{B C E}
2
{A E}
{C E}
Itemset
{B C E}
C3
C2
Scan
Database
L1
L2
C2
L3
C3
36
DISCUSSION
Apply associative rule mining (Use A-priori algorithm) to the
following portfolios of clients of a brokerage company, to identify
stocks that are purchased together. Use a minimum support of
two.
Client
Stocks purchased
Client1
AAPL, ORCL, WFC, BA
Client2
BA, WTU
Client3
GOOG, IBM, C
Client4
WTU, WFC, C, IBM
37
CLASSIFICATION
ANALYSIS
38
CLASSIFICATION
A technique for grouping data into pre-defined classes
using certain attributes of the data. E.g. defaulter or not,
cruise customer, 4G subscriber or not etc.
Define classes
Identify classification criteria
Select suitable technique
application
39
CLASSIFICATION TECHNIQUES
Discriminant analysis (categorical membership)
Bayesian classification
Logistic regression (probabilistic membership)
Neural networks
40
DISCRIMINANT ANALYSIS*
A method of classification that uses a Discriminant Function
to decide classes
DF based on attributes and weights.
Calculated as a weighted average of attributes and weights
(of training data) (Zave score)
E.g. (GMAT + 200 * UGPA) > 1200
Given a new set of data, calculate Z score
Compare with Zav
*a simplified version
DF – Discriminant Function
41
CLASSIFICATION WITH
BAYESIAN ANALYSIS
42
REFRESHER ON PROBABILITY
Probability is the chance that an event/outcome will
take place.
E.g. A baseball team had 44 wins
36 losses in one season
Probability of winning?
43
PRIOR (CONDITIONAL) PROBABILITIES
Prior probabilities are knowledge of other
events which may help improve predictions
E.g. p(IPO success) = 0.33
Suppose we know that a big company
behind IPO
p(IPO success/big company) = 0.99
Which is higher?
P(successful cellphone call) or
P(successful call/subscriber in service area)
44
CONDITIONAL PROBABILITIES..
College
# of students
CBT
1,800
COAS
2,300
COFAC
1,200
COE
6,700
Total
12,000
If we see a student in the union and he/she is a WIU student
What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE?
45
BAYESIAN ANALYSIS
Bayes theorem can be exploited for classification
Bayes theorem
P(A/B) = [P(B/A) x P(A)]/P(B)
Classify into class I or II based on conditional
probability
P(class I/x) > P(class II/x)? *
E.g. wiu student in union, which college are they?
A method for classifying objects/events into classes based
on probabilities of occurrence of the objects/events
*x is some condition e.g. surgery or being a shopper in a retail chain
46
INTRODUCTION TO
BAYESIAN ANALYSIS
We are interested in p(person becoming a manager/mba)*
How can we use Baye’s theorem?
ASSUME
300 m population; 100 m employees
500,000 are managers
10,000 managers go to college for an MBA
20 m go to college
100,000 do MBA
*you need to write formula using terms from the problem
47
THIS SLIDE IS INTENTIONALLY BLANK
48
BAYESIAN ANALYSIS - EXAMPLE
PROBLEM:
An observer has collected information about
Eagles & Hawks for a long time. If a new bird
is spotted with a certain wingspan (x), need to know
whether eagle or hawk
From “Data Mining – Methods for Knowledge Discovery” by K.Cios, W.Pedrycz, R.Swiniarski
49
BAYESIAN ANALYSIS - EXAMPLE
CLASSIFICATION PROBLEM:
Compare p(eagle/x) and p(hawk/x)
Whichever is higher unknown bird
We don’t know p(eagle/x) or p(hawk/x)
Aquila rapax
50
BAYESIAN ANALYSIS - EXAMPLE
Bayes‘ theorem
p( x | eagle) P(eagle)
P(eagle | x)
p ( x)
p( x | hawk ) P(hawk )
P(hawk | x)
p( x)
Decision rule?
51
PROBABILITY DENSITY FUNCTION
Shows from observations of birds, the probability of
a bird having a particular wingspan
P(x|eagle),
P(x| hawk)
52
PROBABILITY OF EAGLE OR HAWK
N = Number of birds
= neagle + nhawk
P(eagle)
neagles
N
= 0.8
P(hawk)
nhawk
N
= 0.2
53
USING BAYES THEOREM
New bird’s size = 45 cm
p(45|eagle) = 2.22 x 10-2
p(45|hawk) = 1.10 x 10-2
(from known probability
density functions)
p( x | eagle) * P(eagle) vs p( x | hawk) * P(hawk)
2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2
0.01776 > 0.0021
Decision rule predicts eagle
54
CLUSTERING
55
DATA MINING FUNCTIONS
Clustering
Goal is to identify natural groupings of data.
applicationications in market segmentation, discovering
affinity groups, and defect analysis
based on euclidean distance
similar to centroid
trial & error approach
need to guess # of clusters
56
CLUSTER ANALYSIS OF CUSTOMERS
CLUSTERING OF CAR BUYERS
Income: Medium
Children: 2
Income: High
Children: 1
Car: Luxury
Car: Sedan and Car: Truck
Income: Medium
Children:3
Cluster 1
Cluster 4
Cluster 3
Income: Low
Children: 0
Car: Compact
Cluster 2
58
A SIMPLE EXAMPLE
ACCT#
MONTHLY SPENDING
109905
$200
112343
$30
124205
$1,510
125687
$11,078
130984
$341
131987
$543
140001
$178
140387
$2,000
150933
$1,750
167912
$230
59
BIG DATA
Big Data: name given to large volumes of data (in petabytes
and exabytes)
Characteristics
Volume (PB, EB)
Velocity (produced in real time)
Variety (emails, social media posts, sensors etc)
Technologies
Apache Hadoop (open source)
Hadoop cluster servers
Hadoop Distributed File System (HDFS) for data storage
MapReduce for distributed processing of queries
61
DISCUSSION QUESTIONS
What kind of data is stored in a data warehouse?
What are the characteristics of a data warehouse?
How is data organized in a warehouse?
What tool(s) can we use to analyze data in a warehouse?
What type of multidimensional organization of data is also called a
“sparse matrix”?
What is a concept hierarchy?
What technique can you use to predict the probability that a cancer
patient will respond to radiation therapy?
What type of mining identifies natural characteristics of data?
What technique is useful for predicting trends?
What is minimum transaction support?
How is Bayes theorem used for classification?
62