PPT - pantherFILE

Download Report

Transcript PPT - pantherFILE

Introduction to Data Mining
Oct. 12/14, 2004
Xiangming Mu
What is Data Mining
• Definition: Extraction of hidden predictive
information from large databases
– Predict future trends and behaviors
– Using automated, prospective analyses technology in
• Statistics, artificial intelligence, and machine learning
– Supported by three technologies
• Massive data collection
• Powerful multiprocessor computers
• Data mining algorithms
The evolution of Data Mining
• Data Collection (1960s)
– Retrospective static data delivery (e.g., total revenue in the last five
years)
• Data Access (1980s)
– Retrospective, dynamic data delivery at record level
– e.g,
– DBMS, SQL, ODBC
• Data Warehousing and Decision Support (1990s)
– Retrospective, dynamic data delivery at multiple levels
– On-line analytic processing (OLAP), multidimensional databases, data
warehouses
– e.g., what were unit sales in New England last Month? Drill down to
Boston
• Data Mining
– Prospective, proactive information delivery
– Massive databases, algorithms, and powerful computer
– e.g., what’s likely to happen to Boston unit sales next month? Why?
Data Warehouse
•
combine data from heterogeneous data sources into a single homogenous
structure.
•
Organize data in simplified structures for efficiency of analytical queries
rather than for transaction processing.
•
•
Contain transformed data that is valid, consistent, consolidated, and
formatted for analysis.
•
Provide stable data that represents business history.
•
Are updated periodically with additional data rather than frequent
transactions.
•
Simplify security requirements.
•
Provide a database organized for OLAP rather than OLTP( online
transaction processing) .
Data Warehouse (cont’)
• Usually multiple granularity are supported
–
–
–
–
Current operational data
Historical operational data
Aggregated data
Metadata
• Two kinds of tables
– Fact tables
• Many rows, primarily numeric data, static data
– Dimension tables
• Reference the data stored in the fact table
• Primary character data, fewer rows, updatable
Scope of Data Mining
• Data Mining is
–
–
–
–
–
Decision trees
Nearest neighbor classification
Neural networks
Rule induction
K-means clustering
• Data Mining is NOT
–
–
–
–
Data warehousing
SQL/Queries/Reporting
Software Agents
Online Analytical Processing (OLAP)
Models
• A “black box” that makes predictions about the future based on
information from the past and present
• Large number of inputs usually available
• A good model offers accuracy and understandability
– Decision trees->rule induction->regression models->neural networks
Age
Beer
Diapers
Blood Pressure
Model
?
Techniques for Data Mining
• Classical Techniques
– Statistics, Neighborhoods and Clustering
• New Techniques
– Trees, Networks and Rules
Statistics
• Helps answer questions like
–
–
–
–
What patterns are there in my database?
What is the chance that an event will occur?
Which patterns are significant?
What is a high level summary of the data that gives
me some idea of what is contained in the database?
• Description
– Histograms, Max, Min, Mean, Median, Variance
Statistics: regression
• Prediction = Regression (?)
• Maps values from predictors
in such a way that the lowest
error occurs in make a
prediction
• Linear regression
–
Y = a + b*X
– Multiple linear regression
• Non-linear regression
Nearest Neighbor
• Objects that are “near” to each other will have similar
prediction values as well
• If you know the prediction value of one of the objects you
can predict it for it’s nearest neighbors
• Confidence: the chance of being correct
• Examples:
– Guess the income of you from your neighbor
– “find more like this” question
K Nearest Neighbors
• a combination of the classes of the k record(s) most
similar to it in a historical dataset
Clustering
• A method by which like records are
grouped together
• There is no absolute “best” way to cluster
– Dimensions involved for clustering (e.g., one
dimension– based on one attribute of the
database)
Clustering vs. Nearest Neighbor
Nearest Neighbor
Clustering
For both data consolidation
and predictions
Primarily for data grouping
Space is defined by the
problem
Space is predefined by
past experience.
Generally only uses distance Can use other metrics
metrics
besides distance
Hierarchical clustering
• The number of dimensions for clustering
– Two extreme situations: one cluster or as many
clusters as records—both failed to help
understand data better
– Allows users to choose the appropriate number
of dimension
– Usually be viewed as a tree
• Advantages
– Defined solely by the data
– Easy to increase or decrease the number of
clusters
Clustering approaches
• Single link approach
– Clustering based on linked records
• Complete link approach
– Clustering based on the average minimum distance from the
center point
What is Decision Tree
• Tree-shaped structures
• represent sets of decisions
• These decisions generate rules for the
classification of a dataset.
• Specific decision tree methods include
– Classification and Regression Trees (CART)
– Chi Square Automatic Interaction Detection
(CHAID) .
50 Renewers
50 Non-Renewers
New Technology Phone ?
Yes
No
30 Renewers
50 Non-Renewers
20 Renewers
0 Non-Renewers
Customer <2.3years
Yes
No
25 Renewers
10 Non-Renewers
5 Renewers
40 Non-Renewers
Age <55
Yes
20 Renewers
0 Non-Renewers
No
5 Renewers
10 Non-Renewers
Characteristics of decision tree
• A collection of data is divided into two sub-collection on
each branch.
• Each leaf of the tree represents a segment of the data. --records in the segment were somewhat similar to each
other
• The distribution of records among the “leafs” provides
information for prediction
• The condition for a particular leaf is the combination of
nodes along the path from the “root”
What is a good question?
• Is good at distinguishing groups in the prediction variable
– i.e., distinguish between renewers and non-renewers
• The level of homogeneous in each segments after the
question
– CART (Classification and Regression Trees): tries all questions,
then picks the best (based on entropy metrics)
– CHAID( Chi-Square Automatic Interaction Detector): relies on
the chi-square test used in contingency tables
When stops splitting?
• Each segment contains only one record
• All the records in the segment have identical
characteristics
• The improvement is not substantial enough
– The entropy do not change substantially
Artificial Neural Network
• What is Artificial Neural Network?
–
–
–
–
resemble biological neural networks in structure.
Non-linear predictive models
learn through training
Derive primarily from artificial intelligence rather than statistics
• Advantages
– Highly accurate prediction
– Can be applied across a large number of different types of
problems
• Disadvantages
– Difficult to understand
– Difficult to deploy
Degree of automation
• Even though sharing a high degree of automation, neural
network still needs some manual arrangements
– How should the nodes in the network be connected?
– How many neuron like processing units should be used?
– When should “training” be stopped in order to avoid
overfitting?—can be trained forever and still not be 100%
accurate on the training set.
– Numeric data usually needs to normalized (between 0 and 1.0)
Feature Extraction
• Detect the most relevant and most important
predictors
– e.g. high level of features for images
– Points and lines
Input Nodes
Hidden Nodes
Output Nodes
Prediction
• Complexity of NN
– The number of links (weight)
– The number and layers of hidden nodes
Income
(70,000)
Income
(0.7)
Default?
Age (50)
Weight=0.2
0.29
Age (0.5)
0—good credit risk (no default)
1---bad credit risk (likely default)
Weight=0.3
Types of neural networks
• Back propagation
– Errors from the output nodes are propagated backwards through
the hidden nodes to the input nodes.
– At each level, the link weights between nodes are updated to
avoid similar mistake
– Different algorithms were developed
• Gradient descent
• Newton’s method
• Genetic algorithms
• Other methods
– Kohonen feature maps
• Feedforward, and no hidden layer
– Radial Basis Function networks
Rule Induction
• Most common form of knowledge discovery in
unsupervised learning systems
• Is a bottom up approach (decision tree is top down)
• The extraction of useful if-then rules from data based on
statistical significance
– e.g., if paper plates then plastic forks
– But do not imply causality
• Two aspects
– Accuracy (confidence), or how often is the rule correct?
– Coverage (support), or how often does the rule apply?
Rule
If nails, then hammer
Accuracy Coverage
20%
85%
If bread, then swiss cheese 15%
6%
If 42 years old, purchased
pretzels and dry roasted
peanuts, then beer
0.01%
95%
Interpret rules
• Targeting consequent
– e.g., all the rules with a consequent related to “coffee”
• Targeting accuracy
– Highly profitable activities
• Targeting coverage
– Always happen
• Targeting interestingness
– Balance between accuracy and coverage