Transcript IS-class

Decision support tools :
Data warehousing, OLAP and data
mining.
Sunita Sarawagi
Data explosion
• Banks, companies, websites, retail stores,
scientific labs --- contain terabytes of data and is
continually growing.
• Storage and processing getting cheaper
• Wealth of information hidden in the flood of data
• Conventional querying/analysis methods did not
scale
• Need new ways of interaction
Data warehousing and Data mining
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a
what they can understand
and use in a business
context.
[Barry Devlin]
Why a Warehouse?
• Large organizations have complex internal
organizations, and have data stored at different
locations, on different operational
• Data sources often store only current data, not
historical data
• Corporate decision making requires a unified view of
all organizational data, including historical data
• A data warehouse
– Greatly simplifies querying, permits study of historical
trends
– Shifts decision support query load away from transaction
processing systems
Decision support tools
Direct
Query
Merge
Clean
Summarize
Reporting
tools
OLAP
Crystal reports
Essbase
Mining
tools
Intelligent Miner
Relational
DBMS+
e.g. Redbrick
Data warehouse
Staging + ETL layer
Detailed
transactional
data
GIS
data
Bombay branch Delhi branch
Oracle
Calcutta branch
IMS
Operational data
Census
data
SAS
MTNL: case study
MTNL operates in 9 zones, each with 5-7 exchanges.
Database
Size
GB
CSMS: Customer service management system
40x9
FRS: Fault Repair System
200
FMS: Finance Management System
20
DQS: Directory Query system
PMS: Personal Management system
10
IGS: Interactive graphic System
5
TCMIS: Traffic Control Information Management System
10
INTERNET: Internet Database System
70
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
– Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
• Data warehouse system
– Data analysis and decision making
• Distinct features
– User and system orientation: customer vs. market
– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries
Data warehouse construction
• Heterogeneous schema integration
– merge from various sources, fuzzy matches
– remove inconsistencies
• Data cleaning:
– missing data, outliers, clean fields e.g. names/addresses
• Data loading: efficient parallel loads
• Products: Prism warehouse manager, Platinum info
refiner, info pump, QDB, Vality
Warehouse maintenance
• Data refresh
– when to refresh, what form to send updates?
• Materialized view maintenance with batch
updates.
• Query evaluation using materialized views
• Monitoring and reporting tools
– HP intelligent warehouse advisor
Warehouse Schemas
• Typically warehouse data is multidimensional, with
very large fact tables
– Examples of dimensions: item-id, date/time of sale, store
where sale was made, customer identifier
– Examples of measures: number of items sold, price of
items
• Dimension values are usually encoded using small
integers and mapped to full values via dimension
tables
– Resultant schema is called a star schema
• More complicated schema structures
– Snowflake schema: multiple levels of dimension tables
– Constellation: multiple fact tables
OLAP
Fast, interactive answers to large aggregate queries.
• Multidimensional model: dimensions with
hierarchies
– Dim 1: Bank location:
• branch-->city-->state
– Dim 2: Customer:
• sub profession --> profession
– Dim 3: Time:
• month --> quarter --> year
• Measures: loan amount, #transactions, balance
Multidimensional Data
• Sales volume as a function of product,
month, and region Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month
Month Week
Day
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
India
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in India
Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or detailed data,
or introducing new dimensions
• Slice and dice:
– project and select
• Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
• Other operations
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
OLAP products
• About 30 OLAP vendors
• Dominant ones:
– Oracle Express: largest market share
– Hyperion: technology leader
– Microsoft Plato: introduced late last year,
rapidly taking over...
Part 2: Data mining
Data mining
• Process of semi-automatically analyzing large
databases to find patterns that are:
–
–
–
–
valid: hold on new data with some certainity
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to
interpret the pattern
• Other names: Knowledge discovery in databases,
Data analysis
Relationship with other fields
• Overlaps with machine learning, statistics,
artificial intelligence, databases, visualization
but more stress on
– scalability of number of features and instances
– stress on algorithms and architectures whereas
foundations of methods and formulations
provided by statistics and machine learning.
– automation for handling large, heterogeneous data
Applications
• Banking: loan/credit card approval
– predict good customers based on old customers
• Customer relationship management:
– identify those who are likely to leave for a competitor.
• Targeted marketing:
– identify likely responders to promotions
• Fraud detection: telecommunications, financial
transactions
– from an online stream of event identify fraudulent events
• Manufacturing and production:
– automatically adjust knobs when process parameter
changes
Applications (continued)
• Medicine: disease outcome, effectiveness of
treatments
– analyze patient disease history: find relationship
between diseases
• Molecular/Pharmaceutical: identify new drugs
• Scientific data analysis:
– identify new galaxies by searching for sub clusters
• Web site/store design and promotion:
– find affinity of visitor to pages and modify layout
Mining technology today
Data
warehouse
Preprocessing
utilities
Extract
data via
ODBC
Vendors
(IDC 1999)
– SAS: 29%
– SPSS: 13.5%
– IBM: 6%
•Sampling
•Attribute
transformation
Scalable algorithms
• association
• classification
• clustering
• sequence mining
Mining
operations
Visualization
Tools
Mining operations
Itemset mining
Clustering
Classification
• hierarchical • Association rules
• Regression
• EM
• Causality
• Classification trees
• density based
• Neural networks
• Bayesian learning
• Nearest neighbour
• Radial basis functions
• Support vector machines
• Meta learning methods
– Bagging,boosting
Classification
• Given old data about customers and payments,
predict new applicant’s loan eligibility.
Previous customers
Age
Salary
Profession
Location
Customer type
Classifier
Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/
bad
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)
• Regression: (linear or any other polynomial)
• Decision tree classifier: divide decision space into
piecewise constant regions.
• Neural networks: partition by non-linear boundaries
• Probabilistic/generative models
• Lazy learning methods: nearest neighbor
• Support vector machines: boundary to maximally
separate classes
Decision tree classifiers
• Widely used learning method
• Easy to interpret: can be re-represented as if-thenelse rules
• Approximates function by piece wise constant
regions
• Does not require any prior knowledge of data
distribution, works well on noisy data.
• Has been applied to:
– classify medical patients based on the disease,
– equipment malfunction by cause,
– loan applicant by likelihood of payment.
Decision trees
• Tree where internal nodes are simple decision rules
on one or more attributes and leaf nodes are
predicted class labels.
Salary < 1 M
Prof = teacher
Good
Bad
Age < 30
Bad
Good
Algorithm for tree building
• Greedy top-down construction.
Gen_Tree (Node, data)
make node a leaf?
Yes
Stop
Selection
Find best attribute and best split on attribute
criteria
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
Nearest neighbor
• Define proximity between instances, find neighbors
of new instance
– K-NN approach: assign majority class amongst k nearest
neighbour
– weighted regression: learn a new regression equation by
weighting each training instance based on distance from
new instance
• Pros
+ Fast training
• Cons
– Slow during application.
– No feature selection.
– Notion of proximity vague
Neural networks
• Useful for learning complex data like
handwriting, speech and image recognition
Decision boundaries:
Linear regression
Classification tree
Neural network
Bayesian learning
• Assume a probability model on
generation of data.
• Apply Bayes theorem to find most
likely class as:
p(d | c j ) p(c j )
c  max p( c j | d )  max
cj
cj
p(d )
• Naïve bayes: Assume attributes
conditionally independent given class value
p(c j ) n
c  max
p( ai | c j )

cj
p( d ) i 1
Meta learning methods
• No single classifier good under all cases
• Difficult to evaluate in advance the conditions
• Meta learning: combine the effects of the
classifiers
– Voting: sum up votes of component classifiers
– Combiners: learn a new classifier on the outcomes of
previous ones:
– Boosting: staged classifiers
• Disadvantage: interpretation hard
– Knowledge probing: learn single classifier to mimick
meta classifier
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined
classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Sarawagi
Applications
• Customer segmentation e.g. for targeted marketing
– Group/cluster existing customers based on time series
of payment history such that similar customers in same
cluster.
– Identify micro-markets and develop policies foreach
• Image processing
• Text clustering e.g. scatter/gather
• Compression
Distance functions
• Numeric data: euclidean, manhattan distances
– Minkowski metric: [sum(xi-yi)^m]^(1/m)
– Larger m gives higher weight to larger distances
• Categorical data: 0/1 to indicate presence/absence
– Euclidean distance: equal weightage to 1 and 0 match
– Hamming distance (# dissimilarity)
– Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0
matches not important
– data dependent measures: similarity of A and B depends
on co-occurance with C.
• Combined numeric and categorical data:weighted normalized distance:
Clustering methods
• Hierarchical clustering
– agglomerative Vs divisive
– single link Vs complete link
• Partitional clustering
– distance-based: K-means
– model-based: EM
– density-based:
Agglomerative Hierarchical
clustering
• Given: matrix of similarity between every point
pair
• Start with each point in a separate cluster and
merge clusters based on some criteria:
– Single link: merge two clusters such that the minimum
distance between two points from the two different
cluster is the least
– Complete link: merge two clusters such that all points
in one cluster are “close” to all points in the other.
Partitional methods: K-means
• Criteria: minimize sum of square of distance
• Between each point and centroid of the cluster.
• Between each pair of points in the cluster
• Algorithm:
– Select initial partition with K clusters: random, first K, K
separated points
– Repeat until stabilization:
• Assign each point to closest cluster center
• Generate new cluster centers
• Adjust clusters by merging/splitting
Association Rule
• Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
• Find: all rules that correlate the presence of one
set of items with that of another set of items
Rule form: “Body ead [support, confidence]”.
– E.g., 98% of people who have an checking account
and a PPF also apply for a credit card
–
Rule Measures: Support and
Confidence
Customer
buys both
Customer
buys b
Customer
buys d
• Find all the rules X & Y  Z
with minimum confidence and
support
– support, s, probability that a
transaction contains {X Y Z}
– confidence, c, conditional
probability that a transaction having
{X Y} also contains Z
Transaction ID Items Bought
Let minimum support 50%,
2000
A,B,C
and minimum confidence
1000
A,C
50%, we have
4000
A,D
5000
B,E,F
– A  C (50%, 66.6%)
– C  A (50%, 100%)
Applications of fast itemset counting
Find correlated events:
• Applications in medicine: find redundant
tests
• Cross selling in retail, banking
• Improve predictive capability of classifiers
that assume attribute independence
• New similarity measures of categorical
attributes [Mannila et al, KDD 98]
Mining market
• Around 20 to 30 mining tool vendors
• Major tool players:
– SAS’s Enterprise Miner.
– IBM’s Intelligent Miner,
– SGI’s MineSet,
• All pretty much the same set of tools
• Many embedded products:
–
–
–
–
fraud detection:
electronic commerce applications,
health care,
customer relationship management: Epiphany
Summary
• Need for new decision support tools
• Data warehousing
– Data integration, loading, cleaning
– Interactive data analysis/navigation: OLAP
• Data mining: definition and an overview of the
various operations:
– Classification: regression, nearest neighbour, neural
network, bayesian
– Clustering: distance based (k-means), Heirarchical
– Itemset counting