Data Mining - Computer Science

Download Report

Transcript Data Mining - Computer Science

Data Mining
Dr. Awad Khalil
Computer Science Department
AUC
Data Mining, by Dr. Khalil
1
Content
What and Why Data Mining
 Data Mining Applications
 Data Mining Operations & associated Techniques
 Predictive Modeling
 Database Segmentation
 Link Analysis
 Deviation Detection
 The Data Mining Process
 The CRISP-DM Model

Data Mining, by Dr. Khalil
2
What and Why Data Mining?

Data Mining is the process of extracting valid, previously
unknown, comprehensible, and actionable information
from large databases and using it to make crucial
business decisions.
 Data mining is concerned with the analysis of data and
the use of software techniques for finding hidden and
unexpected patterns and relationships in sets of data.
 The focus of data mining is to reveal information that is
hidden and unexpected.
 Data mining requires a single, separate, clean,
integrated, and self-consistent source of data. A data
warehouse is well equipped for providing data for data
mining.
 Data mining can provide huge paybacks for companies
who have made a significant investment in data
warehousing.
Data Mining, by Dr. Khalil
3
Data Mining Applications

Retail/Marketing:
 Identifying buying patterns of customers
 Finding associations among customer demographic characteristic
 Predicting response to mailing companies
 Market basket analysis

Banking:
 Detecting patterns of fraudulent credit card use
 Identifying loyal customers
 Predicting customers likely to change their credit card affiliation
 Determining credit card spending by customer groups

Insurance:
 Claims analysis
 Predicting which customers will buy new policies
Medicine:
 Characterizing patient behavior to predict surgery visits
 Identifying successful medical therapies for different illnesses

Data Mining, by Dr. Khalil
4
Data Mining Operations & Associated Techniques

Predictive Modeling:
 Classification
 Value prediction

Database Segmentation:
 Demographic clustering
 Neural clustering

Link Analysis:
 Associate discovery
 Sequential pattern discovery
 Similar time sequence discovery

Deviation Detection:
 Statistics
 Visualization
Data Mining, by Dr. Khalil
5
Predictive Modeling

Predictive Modeling is similar to the human learning
experience in using observations to form a model of the
important characteristics of some phenomenon.
 This approach uses generalization of the “real world”
and the ability to fit new data into a general framework.
 Predictive modeling can be used to analyze an existing
database to determine some essential characteristics
(model) about the data set.
 Applications of predictive modeling include customer
retention management, credit approval, cross-selling,
and direct marketing.
 There are two techniques associated with predictive
modeling: classification Data
andMining,
value
.
by Dr.prediction
Khalil
6
Classification
is used to establish a specific
predetermined class for each record in a database
from a finite set of possible class values.
 There are two specializations of classification:
 Tree induction;
 Neural induction.
 Classification
Data Mining, by Dr. Khalil
7
Classification – Tree Induction




In the shown example, we are interested in predicting who is currently renting
property is likely to be interested in buying property.
A predictive model has determined that only two variables are of interest: the
length of time the customer has rented property and the age of the customer.
The decision tree presents the analysis in an intuitive way.
The model predicts that those customers who have rented for more than two
years and are over 25 years old are the most likely to be interested in buying
property
Data Mining, by Dr. Khalil
8
Classification – Neural Network





A Neural Network contains collections of connected nodes with input, output, and
processing at each node.
Between the visible input and output layers may be a number of hidden processing
layers.
Each processing unit (circle) in one layer is connected to each processing unit in the
next layer by a weighted value, expressing the strength of the relationship.
The network attempts to mirror the way the human brain works in processing patterns
by arithmetically combining all the variables associated with a given data point.
In this way, it is possible to develop nonlinear predictive models that “learn” by
studying combinations of variables and how different combinations of variables affect
different data sets.
Data Mining, by Dr. Khalil
9
Value Prediction






Value prediction is used to estimate a continuous numeric value
that is associated with a database record.
This technique uses the traditional statistical techniques of linear
regression and nonlinear regression.
Linear regression attempts to fit a straight line through a plot of
the data, such that the line is the best representation of the average
of all observations at that point in the plot.
Linear regression works well with linear data and is sensitive to
the presence of outliers (that is, data values which do not conform
to the expected norm).
Although nonlinear regression avoids the main problems of linear
regression, it is still not flexible enough to handle all possible
shapes of the data plot.
Applications of value prediction include credit card fraud
detection and target mailingData
listMining,
identification.
by Dr. Khalil
10
Database Segmentation





The aim of database segmentation is to partition a database into an unknown number of
segments, or clusters, of similar records, that is, records that share a number of
properties and so are considered to be homogeneous.
This approach uses unsupervised learning to discover homogeneous sub-populations in
a database to improve the accuracy of the profiles.
Database segmentation is less precise than other operations and is therefore less
sensitive to redundant and irrelevant features.
Applications of database segmentation include customer profiling, direct marketing, and
cross-selling.
Database segmentation is associated with demographic or neural clustering techniques,
which are distinguished by the allowable data inputs, the methods used to calculate the
distance between records, and the presentation of the resulting segments for analysis.
Data Mining, by Dr. Khalil
11
Link Analysis



Link analysis aims to establish links, called associations, between the individual
records, or sets of records, in a database.
There are three specializations of link analysis:
 Association discovery: finds items that imply the presence of other items in the
same event. These affinities between items are represented by association rules. For
example “when a customer rents a property for more than two years and is more
than 25 years old, in 40% of cases, the customer will buy a property. This
association happens in 35% of all customers who rent properties.”
 Sequential pattern discovery: finds patterns between events such that the
presence of one set of items is followed by another set of items in a database of
events over a period of time. For example, this approach can be used to understand
long-term customer buying behavior.
 Similar time sequence discovery: is used, for example, in the discovery of
links between two sets of data that are time-dependent, and is based on the degree
of similarity between the patterns that both time series demonstrate, For example,
within three months of buying property, new home owners will purchase goods
such as cookers, freezers, and washing machines.
Applications of link analysis include product affinity analysis, direct marketing, and
stock price movement.
Data Mining, by Dr. Khalil
12
Deviation Detection




Deviation detection is a relatively new technique in terms of commercially available
data mining tools.
It identifies outliers, which express deviation from some previously known expectation
and norm.
This operation can be performed using statistics and visualization techniques. For
example, linear regression facilitates the identification of outliers in data while modern
visualization techniques display summaries and graphical representations that make
deviations easy to detect.
Applications of deviation detection include fraud detection in the use of credit cards and
insurance claims, quality control, and defects tracing.
Data Mining, by Dr. Khalil
13
The Data Mining Process

In 1996 a consortium of vendors and users developed a
specification called the Cross Industry Standard Process for Data
Mining (CRISP-DM).

CRISP-DM specifies a data mining process that is not specific to
any particular industry or tool.

CRISP-DM has evolved from the knowledge Discovery processes
used widely in industry and in direct response to user
requirements.

The major aims of CRISP-DM are make large data mining
projects run more efficiently as well as to make them cheaper,
more reliable, and more manageable.
Data Mining, by Dr. Khalil
14
The CRISP-DM Model






The CRISP-DM methodology is a hierarchical process model.
At the top level, the process is divided into six different generic
phases, ranging from business understanding to deployment of
project results.
The next level elaborates each of these phases as comprising
several generic tasks. At this level, the description is generic
enough to cover all the DM scenarios.
The third level specializes these tasks for specific situations. For
example, the generic task might be cleaning data, and the
specialized task could be cleaning of numeric or categorical
values.
The fourth level is the process instance, that is, a record of
actions, decisions, and result of an actual execution of a DM
project.
The model also discusses relationships between different DM
tasks.
Data Mining, by Dr. Khalil
15
The CRISP-DM Phases
understanding – determine business
objectives, assess situation, determine data mining goal;
and produce a project plan.
 Data understanding – collect initial data, describe
data; explore data; and verify data quality.
 Data preparation – select data, clean data, construct
data, integrate data, and format data.
 Modeling – select modeling technique, generate test
design, build model, and assess model.
 Evaluation – evaluate results, review process, and
determine next step.
 Deployment – plan deployment, plan monitoring and
maintenance, produce final report, and review report.
 Business
Data Mining, by Dr. Khalil
16
Data Mining Tools
 There
are a growing number of commercial data
mining tools on the marketplace.
 The important features of data mining tools
include:
 Data preparation
 Selection of data mining operations
(algorithms)
 Product scalability and performance
 Facilities for understanding results
Data Mining, by Dr. Khalil
17
Thank you
Data Mining, by Dr. Khalil
18