Transcript ida-2002

Data Mining : Intelligent Data Analysis for
Knowledge Discovery
Prof. Yike Guo
Dept. of Computing
Imperial College
Intelligent Data Analysis and Probability Inference
Course Overview
• Goal
– Basic Concepts of Data Mining
– Data Mining Techniques
– Data Mining Applications
– Future Research Trends on Data Mining
• Reference Books
• Data Mining: Concepts and Techniques JiaWei Han and
Micheline Kamber
• Advances in Knowledge Discovery and Data Mining U.M
Fayyad and G, Piatetsky-Shapiro AAAI/MIT Press. 1996
• Predictive Data Mining: A Practical Guide Sholom M.Weiss and
Nitin Indurkhya Morgan Kaufmann Publishers, Inc. 1997
• Intelligent Data Analysis, Springer 1999
• Post-genome Informatics by Minoru Kanehisa, Oxford
University Press, 2000 Intelligent Data Analysis and Probability Inference
What does the data say?
Day
Outlook Temperature
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
Humidity
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Wind
Play Tennis
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Intelligent Data Analysis and Probability Inference
Turing Data into Knowledge
Intelligent Data Analysis and Probability Inference
What does the data say?
100 000
10 000
1000
Amount (x1000)
100
10
1
0.1
MEDLINE records
MEDLINE G5 MeSH
Transistors / chip
DNA sequences
Mapped human genes
3-D structures
0.01
0.001
1965 1970
1975 1980
1985 1990
1995 2000
Year
Intelligent Data Analysis and Probability Inference
Intelligent Data Analysis and Probability Inference
What Is Data Mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
• Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
• What is not data mining?
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
Intelligent Data Analysis and Probability Inference
• Data: set of facts F ( records in a database)
• Pattern : An expression E in a language L describing data in
a subset FE of F and E is simpler than the enumeration of al l
the facts of FE. FE is also called a class and E is also
called a model or knowledge.
• Data Mining Process: data mining is a multi-step process
involving multiple choices, iteration and evaluation. It is nontrivial since there is no closed-form solution. It always involve
intensive search.
• Validity : E is true (with high probability) for F
• Useful : patterns are not trivial inductive properties of data
• Understandable: patterns should be understandable by data
owners to aid in understanding the data/domain
Intelligent Data Analysis and Probability Inference
Why Data Mining
• Limitation of traditional database querying:
– Most queries of interest to data owners are difficult to
state in a query language
• “ find me all records indicating fraud”=> “ tell me the
characteristics of fraud” (Summarisation)
• “find me who likely to buy product X” (classification
problem)
• “find all records that are similar to records in table X”
(clustering problem)
– Ability to support analysis and decision making using
traditional (SQL) queries become infeasible (query
formulation problem ).
Intelligent Data Analysis and Probability Inference
Relational Database Revisited
• Terabyte databases, consisting of billions of records,
are becoming common
• Relational data model is the defacto standard
• A relational database : set of relations
• A relation : a set of homogenous tuples
• Relations are created, updated and queried using SQL
• Query = Keyword based search
SELECT telephone_number
FROM telephone_book
WHERE last_name = “Smith”
Intelligent Data Analysis and Probability Inference
SQL : Relational Querying Language
• Provides a well-defined set of operations: scan, join,
insert, delete, sort, aggregate, union, difference
• Scan -- applies a predicate P to relation R
For each tuple tr from R
if P(tr) is true, tr is inserted in the output stream
• Join -- composes two relations R and S
For each tuple tr from R
For each tuple ts from S
if join attribute of tr equals to join attribute of ts
form output tuple by concatenating tr and ts
Intelligent Data Analysis and Probability Inference
Pages
MUID
Relational database. A table (relation) is a set and the three basic table operations shown
here are extensions of the standard set operations.
Paper 1
Paper 2
Paper 3
Paper 4
....
SELECT
Author
MUID
Author
Pages
MUID
PROJECT
JOIN
Author 1-1
Author 1-2
Author 2-1
Author 2-2
Author 2-3
Author 3-1
....
Intelligent Data Analysis and Probability Inference
The Query Formulation Problem
Consider the query :
What kinds of weather condition are suitable for
playing tennis ?
• It is not solvable via query optimisation
• Has not received much attention in the database
field or in traditional statistical approaches
• These problems are of inductive features: learning
from data rather than search from data
• Natural solution is via train-by-example approach
to construct inductive models as the answers
Intelligent Data Analysis and Probability Inference
Why Data Mining Now
• Data Explosion
– Business Data : organisations such as supermarket chains, credit
card companies, investment banks, government agencies, etc.
routinely generate daily volumes of 100MB of data
– Scientific Data: Scientific and remote sensing instruments collect
data at the rates of Gigabytes per day: far beyond human analysis
abilities.
• Data Wasting
– Only a small portion (5% - 10%) of the collected data is ever
analysed
– Data that may never be analysed continues to be collected, at great
expense.
• We are drowning in data, but starving for knowledge!
Intelligent Data Analysis and Probability Inference
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Intelligent Data Analysis and Probability Inference
Data Mining and Decision Support
Data Warehousing:
create/ select
target database
Sampling:
choose data for
building models
Data Reduction and Projection:
derive useful features
dimensionality reduction
Data Cleaning:
supply missing values
eliminate noisy data
Data Mining:
choose data mining tasks
choose data mining methods
to extract patterns / knowledge
Model Test and Evaluation:
test the accuracy of the model
consistency check
model refinement
Machine Learning Technologies
Decision
SupportData Analysis and Probability Inference
Intelligent
Data Warehousing
• “ A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.” --- W. H. Inmon
• A data warehouse is
– A decision support database that is maintained separately from
the organization’s operational databases.
– It integrates data from multiple heterogeneous sources to
support the continuing need for structured and /or ad-hoc
queries, analytical reporting, and decision support.
Intelligent Data Analysis and Probability Inference
Modeling Data Warehouses
• Modeling data warehouses: dimensions & measurements
– Star schema: A single object (fact table) in the middle
connected to a number of objects (dimension tables) radically.
– Snowflake schema: A refinement of star schema where the
dimensional hierarchy is represented explicitly by normalizing
the dimension tables.
– Fact constellations: Multiple fact tables share dimension
tables.
• Storage of selected summary tables:
– Independent summary table storing pre-aggregated data, e.g.,
total sales by product by year.
– Encoding aggregated tuples in the same fact table and the
same dimension tables.
Intelligent Data Analysis and Probability Inference
Example of Star Schema
Time Dimension Table
Sales Fact Table
Product Dimension Table
Many Time Attributes
Time_Key
Many Product Attributes
Product_Key
Store Dimension Table
Many Store Attributes
Store_Key
Location_Key
Location Dimension Table
Many Location Attributes
unit_sales
Measures
dollar_sales
Yen_sales
Intelligent Data Analysis and Probability Inference
OLAP: On-Line Analytical Processing
• A multidimensional, LOGICAL view of the data.
• Interactive analysis of the data: drill, pivot, slice_dice, filter.
• Summarization and aggregations at every dimension
intersection.
• Retrieval and display of data in 2-D or 3-D crosstabs, charts,
and graphs, with easy pivoting of the axes.
• Analytical modeling: deriving ratios, variance, etc. and
involving measurements or numerical data across many
dimensions.
• Forecasting, trend analysis, and statistical analysis.
• Requirement: Quick response to OLAP queries.
Intelligent Data Analysis and Probability Inference
OLAP Architecture
• Logical architecture:
– OLAP view: multidimensional and logic presentation of
the data in the data warehouse/mart to the business user.
– Data store technology: The technology options of how and
where the data is stored.
• Three services components:
– data store services
– OLAP services, and
– user presentation services.
• Two data store architectures:
– Multidimensional data store: (MOLAP).
– Relational data store: Relational OLAP (ROLAP).
Intelligent Data Analysis and Probability Inference
Multidimensional Data
• Sales volume as a function of product,
month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month
Month Week
Day
Intelligent Data Analysis and Probability Inference
Construction of Data Cubes
Amount
B.C.
Province Prairies
Ontario
sum
0-20K20-40K 40-60K60K- sum
All Amount
Comp_Method, B.C.
Comp_Method
Database
… ...
Discipline
sum
Each dimension contains a hierarchy of values for one attribute
A cube cell stores aggregate values, e.g., count, sum, max, etc.
A “sum” cell stores dimension summation values.
Sparse-cube technology and MOLAP/ROLAP integration.
“Chunk”-based multi-way aggregation and single-pass computation.
Intelligent Data Analysis and Probability Inference
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
Product
ANNUALY QTRLY
DAILY
PRODUCT ITEM PRODUCT GROUP
DISTRICT
SALES PERSON
REGION
DISTRICT
COUNTRY
DIVISION
Geography
Promotion
Organization
Intelligent Data Analysis and Probability Inference
Decision Support with Data Warehouse
• Ad Hoc Queries: Q: How many customers do we
have in London? A: 32776
Intelligent Data Analysis and Probability Inference
• Report and Spreadsheet
Intelligent Data Analysis and Probability Inference
• OLAP: Q:What are the sales figures for Y in the
different regions:
Intelligent Data Analysis and Probability Inference
• Statistics: Q: Is there a relation between age and
buy behaviour? A: Older clients buy more
Intelligent Data Analysis and Probability Inference
• Data Mining: Q: What factors influence buying
behaviour ?
A1: : Young men in sports cars buy 3
times as much audio equipment
(clustering/regression):
Age
A2: Older woman with dark hair more
often buy rinse (classification)
Old
Hair color
B
A3: Buyers of cars are also the buyers
of houses (asociation)
Young
Middle
Y
Wage
N
W
L
N
N
H
Y
Intelligent Data Analysis and Probability Inference
Data Mining Functionalities (1)
• Concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
• Association (correlation and causality)
– Multi-dimensional vs. single-dimensional association
– age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”)
[support = 2%, confidence = 60%]
– contains(T, “computer”) à contains(x, “software”) [1%,
75%]
Intelligent Data Analysis and Probability Inference
Data Mining Functionalities (2)
• Classification and Prediction
– Finding models (functions) that describe and distinguish
classes or concepts for future prediction
– E.g., classify countries based on climate, or classify cars based
on gas mileage
– Presentation: decision-tree, classification rule, neural network
– Prediction: Predict some unknown or missing numerical values
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
– Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
Intelligent Data Analysis and Probability Inference
Data Mining Functionalities (3)
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior
of the data
– It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
Intelligent Data Analysis and Probability Inference
Example Data Mining Applications
• Commercial :
–
–
–
–
Fraud detection: Identify Fraudulent transaction
Loan approval: Establish the credit worthiness of a customer requesting a loan
Investment analysis : Predict a portfolio's return on investment
Marketing and sales data analysis: Identify potential customers; establishing the
effectiveness of a sales campaign
• Medical:
– Drug effect analysis : from patient records to learn drug effects
– Disease causality analysis
• Political policy:
– Election policy : people’s voting patterns
– Social policy: tax/benefit policy
• Manufacturing:
– Manufacturing process analysis: identify the causes of manufacturing problems
– Experiment result analysis : Summarise experiment results and create predictive
models
Intelligent Data Analysis and Probability Inference
Market Analysis and Management (1)
• Where are the data sources for analysis?
– Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
• Target marketing
– Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
• Determine customer purchasing patterns over time
– Conversion of single to a joint bank account: marriage, etc.
• Cross-market analysis
– Associations/co-relations between product sales
– Prediction based on the association information
Intelligent Data Analysis and Probability Inference
Market Analysis and Management (2)
• Customer profiling
– data mining can tell you what types of customers buy what
products (clustering or classification)
• Identifying customer requirements
– identifying the best products for different customers
– use prediction to find what factors will attract new customers
• Provides summary information
– various multidimensional summary reports
– statistical summary information (data central tendency and
variation)
Intelligent Data Analysis and Probability Inference
Fraud Detection and Management (1)
• Applications
– widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach
– use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
• Examples
– auto insurance: detect a group of people who stage
accidents to collect on insurance
– money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
– medical insurance: detect professional patients and ring of
doctors and ring of references
Intelligent Data Analysis and Probability Inference
Fraud Detection and Management (2)
• Detecting inappropriate medical treatment
– Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
• Detecting telephone fraud
– Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.
– British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
• Retail
– Analysts estimate that 38% of retail shrink is due to
dishonest employees.
Intelligent Data Analysis and Probability Inference
Related Fields:
• Machine learning: Inductive reasoning
• Statistics : Sampling, Statistical Inference, Error
Estimation
• Pattern recognition: Neural Networks, Clustering
• Knowledge Acquisition, Statistical Expert Systems
• Data Visualisation
• Databases: OLAP, Parallel DBMS, Deductive
Databases
• Data Warehousing: collection, cleaning of
transactional data for on-line retrial
Intelligent Data Analysis and Probability Inference