Transcript Data Mining

An Introduction to
Data Mining
Hosein Rostani
Alireza Zohdi
Report 1 for “advance data base” course
Supervisor: Dr. Masoud Rahgozar
December 2007
1
Outline
 Why data mining?
 Data mining applications
 Data mining functionalities
 Concept description
 Association analysis
 Outlier Analysis
 Evolution Analysis
 Classification
 Clustering
2
Why data mining?
 Motivation:
 Wide availability of huge amounts of data
 Need for turning data into useful info & knowledge
 Data mining:
 Extracting or “mining” knowledge from large amounts of
data
 Knowledge : useful patterns
 Semiautomatic process

Focus on automatic aspects
3
Data mining applications
 Prediction. Examples:
 Credit risk
 Customer switching to competitors
 Fraudulent phone calling card usage
 Associations. Examples:
 Related books for buy
 Related accessories for suggest: e.g. camera
 Causation discovery: e.g. medicine
 Clusters. Example:
 Clusters of disease
4
Data mining functionalities
 Concept description
 Characterization & discrimination
 Association analysis
 Outlier Analysis
 Evolution Analysis
 Classification and Prediction
 Clustering
5
Concept description
 Description of concepts
 summarized, concise & precise
 Ways:
 Data characterization

Summarizing the data of the target class in general terms
 Data discrimination

Comparison of the target class with the contrasting class(es)
 Examples of Output forms:
 Pie charts, bar charts, curves & multidimensional tables
6
Association analysis
 Mining frequent patterns
 For discovery of interesting associations within data
 Kinds of frequent patterns:
 Frequent itemset

Set of items frequently appear together. E.g. milk and bread
 Frequent subsequence

E.g. pattern of customers’ purchase:
 First a PC, then a digital camera & then a memory card
 Frequent substructure

Structural forms such as graphs, trees, or lattices
 Support and confidence
7
Outlier Analysis
 Outliers:
 data objects disobeying the general behavior of data
 Approaches to outliers
 Discard as noise or exceptions
 Keep for applications such as fraud detection

Example: detecting fraudulent usage of credit cards
 Ways:
 Using statistical tests
 Using distance measures
 Using deviation-based methods
8
Evolution Analysis
 Description and modeling of trends
 For objects with changing behavior over time
 Ways:
 Applying other data mining tasks on time related data

Association analysis, classification, prediction, clustering & …
 Distinct ways



time-series data analysis
sequence or periodicity pattern matching
similarity-based data analysis
 Example: stock market: predict future trends in prices
9
Classification and Prediction
 Classification:
 Process of finding a model that distinguishes data
classes
 Purpose: using the model to predict the class of new
objects
 Deriving model:
 Based on the analysis of a set of training data

data objects with known class labels
 Example:
 In a credit card company

Classification of customers based on their payment history
10
Classification
 A two-step process for classification:
 First: Learning or training step

Building the classifier by analyzing or learning from training
data
 Second: classifying step

Using classifier for classification
 Accuracy of a classifier (on a given test set)
 Percentage of test set tuples correctly classified by
classifier
 Classification methods:
 Decision tree, Naïve Bayesian classification, Neural
network, k-nearest neighbor classification, …
11
Decision tree
 Decision tree induction :
 Learning of decision trees from class-labeled training
tuples
 Decision tree: A flowchart-like tree structure
 Internal nodes: tests on attributes
 Branches: outcomes of the test
 Leaves: class labels
 Usage in classification:
 Prediction by tracing a path from the root to a leaf node
 Testing attribute values of new tuple against decision
tree
 Easily converting Decision tree to classification rules 12
Decision tree example: Does a
customer buys a computer?
13
Bayesian Classification
 Bayesian classification
 Predicting the probability that a new tuple belongs to a
particular class
 High accuracy and speed in large databases
 Based on Bayes’ theorem
 Conditional probability
 Naïve Bayesian classifier
 Assumption: class conditional independence
 Good for Simplifying computations
14
Clustering
 The process of grouping a set of physical or abstract
objects into classes of similar objects
 Generating class labels for objects currently without
label
 Clustering based on this principle:
 Maximizing the intraclass similarity and
 Minimizing the interclass similarity
 Clustering also for facilitating taxonomy formation
 Hierarchical organization of observations
15
An example: clustering customers
in a restaurant
Restaurant database
Preprocessing
Object
View for Clustering
Clustering
A Set of Similar Object Clusters
Summarization
Young at
midnight
White Collar
for Dinner
Retired for
Lunch
16
Steps of database Clustering
1.
2.
3.
4.
5.
6.
7.
Define object-view
Select relevant attributes
Generate suitable input format for the clustering tool
Define similarity measure
Select parameter settings for the chosen clustering
algorithm
Run clustering algorithm
Characterize the computed clusters
17
Challenge: database clustering
 Data collections are in many different formats
 Flat files
 Relational databases
 Object-oriented database
 Flat file format:
 The simplest and most frequently used format in the
traditional data analysis area
 Databases are more complex than flat files
18
Challenge: database clustering
(cont.)
 Challenge: Changing clustering algorithms to
become more directly applicable to real-world
databases
 Issues related to databases:
 Different types of objects in DB
 Relationships between objects: 1:1, 1:n & n:m
 Complexity in definition of object similarity
Due to the presence of bags of values for an object
 Difficulty in selection of an appropriate similarity measure
 Due to the presence of different types for attributes of
objects

19
Refferences

Han, J., Kamber, M., Data Mining: Concepts and
Techniques, Second Edition, Elsevier Inc., 2006, 770
p., ISBN 1-55860-901-3.
 Silberschatz, A., Korth, F., Sudarshan, S., Database
System Concepts, Fifth Edition, McGraw-Hill, 2005,
ISBN 0-07-295886-3.
 Ryu, T., Eick, C., A Database Clustering Methodology
and Tool, in Information Sciences 171(1-3): 29-59
(2005).
20