Presentation Topic (Scientific Discovery)
Download
Report
Transcript Presentation Topic (Scientific Discovery)
Data Mining
Budi Santosa, PhD
2008
Lab Komputasi dan Optimasi Industri
Teknik Industri ITS
Content
Introduction
Overview of data mining technology
Association rules
Classification
Clustering
Applications of data mining
Commercial tools
Conclusion
Introduction
What is data mining?
Why do we need to ‘mine’ data?
On what kind of data can we ‘mine’?
What is data mining?
Data mining adalah teknik yang merupakan gabungan
metode-metode analisis data secara statistik dengan
algoritma-algoritma untuk memproses data berukuran besar.
Data mining merupakan proses menemukan informasi atau
pola yang penting dalam basis data berukuran besar.
A part of Knowledge Discovery in Data (KDD) process.
Why data mining?
The explosive growth in data collection
The storing of data in data warehouses
The availability of increased access to data from Web navigation
and intranet
We have to find a more effective way to use these data in decision
support process than just using traditional querry languages
On what kind of data?
Data warehouses
Transactional databases
Structure - 3D Anatomy
Advanced database systems
Spacial and Temporal
Time-series
Multimedia, text
WWW
…
Function – 1D Signal
Metadata – Annotation
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1 Name:
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
RAW NORMALIZED
ORF NAME
GENE NAME CHRM F
G
R
GF1 GF2
YAL001C
TFC3 1
1 A 1 2 12.03 7.38 403.83
YBL080C
PET112
2
1 A 1 3 53.21 35.62 "1,78
YBR154C
RPB5 2
1 A 1 4 79.26 78.51 "2,660.73"
YCL044C
3
1 A 1 5 53.22 44.66 "1,786.53"
YDL020C
SON1 4
1 A 1 6 23.80 20.34 799.06
YDL211C
4
1 A 1 7 17.31 35.34 581.00
YDR155C
CPH1 4
1 A 1 8 349.78
401.84
YDR346C
4
1 A 1 9 64.97 65.88 "2,180.87"
YAL010C
MDM10 1
1 A 2 2 13.73 9.61 461.03
YBL088C
TEL1 2
1 A 2 3 8.50 7.74 285.38
YBR162C
2
1 A 2 4 226.84
293.83
Overview of data mining technology
Data Mining vs. DataWarehousing
Data Mining as a part of Knowledge Discovery Process
Goals of Data Mining and Knowledge Discovery
Types of Knowledge Discovery during Data Mining
Data Mining vs. Data Warehousing
A Data Warehouse is a repository of integrated
information, available for queries and analysis. Data and
information are extracted from heterogeneous sources as
they are generated....This makes it much easier and
more efficient to run queries over data that originally
came from different sources.
The goal of data warehousing is to support decision
making with data!
Knowledge Discovery in Databases
and Data Mining
The non-trivial extraction of implicit, unknown, and potentially
useful information from databases.
o The knowledge discovery process comprises six phases:
Goals of Data Mining and KDD
Prediction:
how certain attibutes within the data
will behave in the future.
Identification: identify the existence of an item, an event, an
activity.
Classification: partition the data into categories.
Clustering: cluster object.
Association: object association
Types of Knowledge Discovery during Data
Mining
Association rules
Classification heirarchies
Sequential patterns
Patterns within time-series
Clustering
Content
Introduction
Overview of data mining technology
Association rules
Classification
Clustering
Application of data mining
Commercial tools
Conclusion
Association Rules
Purpose
Providing the rules correlate the presence of a set of items with another
set of item
Examples:
Association Rules
Some concepts
Market-basket model
Look for combinations of products
Put the SHOES near the SOCKS so that if a customer buys one they
will buy the other
Transactions: is the fact the person buys some items in
the itemset at supermarket
Association Rules
Some concepts
Item
support
Milk
3
2
Transaction-id
time
Items-Bought
Bread
101
6:35
Bread, Milk, cookies, juice
Cookies 2
792
7:38
Milk, juice
Juice
2
1130
8:05
Milk, eggs
Coffee
1
1735
8:40
Bread, cookies, coffee
Eggs
1
Support: it refers how frequently a specific itemset
occurs in the database
X => Y:
Bread => Juice is 50%
the confidence of the rule X=>Y:
support (X U Y) / support (X)
The goal of mining association rules is generate all
possible rules that exceed some minimum userspecified support and confidence thresholds
Association Rules
Apriori Algorithm
Input: database of m transactions, D, and a minimum support,
mins, represented as a fraction of m
Output: frequent itemsets, L1, L2, …, Lk
Association Rules
Apriori algorithm
Item
support
Milk
3
2
Transaction-id
time
Items-Bought
Bread
101
6:35
Bread, milk, cookies, juice
Cookies 2
792
7:38
Milk, juice
Juice
2
1130
8:05
Milk, eggs
Eggs
1
1735
8:40
Bread, cookies, coffee
Coffee
1
The candidate 1-itemsets
D
mins = 2
minf = 0.5
milk, bread, juice,
cookies, eggs, coffee
freq >
0.5
0.75, 0.5, 0.5, 0.5, 0.25,
0.25
result
milk, bread, juice, cookies,
{milk, juice},
{bread, cookies}
frequent 1itemsets
milk, bread, juice,
cookies
0.75, 0.5, 0.5,
0.5
The candidate 2-itemsets
{milk, bread}, {milk,
juice}, {bread, juice},
{milk, cookies}, {bread,
cookies}, {juice, cookies}
0.25, 0.5, 0.25, 0.25, 0.5,
0.25
frequent
3-itemsets
The candidate
3-itemsets
frequent 2-itemsets
Ф
{……………..}
{milk, juice},
{bread, cookies}
Ф
{……………….}
0.5, 0.5
Apriori Algorithm
Begin
+ compute support(ij) = count(ij)/m for each individual item, i1, i2, ..,in by scanning the database
once and counting the number of transactions that item ij appears in
+ the candidate frequent 1-itemset, C1, will be the set of items i1, i2, …, in
+ the subset of items containing ij form Ci where support(ij) >= mins becomes the frequent
+ 1-itemset, L1;
+ k=1;
+ termination = false;
+ repeat
+ Lk+1 = ;
+ create the candidate frequent (k+1)-itemset, Ck+1, by combining members of Lk that have k-1
items in common; (this forms candidate frequent (k+1)-itemsets by selectively extending frequent
k-itemsets by one item)
+ in addition, only consider as elements of Ck+1 those k+1 items such that every subset of size k
appears in Lk;
+ scan the database once and compute the support for each member of Ck+1; if the support for a
member of Ck+1; if the support for a member of Ck+1 >= min then add that member to Lk+1;
+ if Lk+1 is empty then termination = true
else k = k+1;
+ until termination
end;
Association Rules
Demo of Apriori Algorithm
Association Rules
Frequent-pattern tree algorithm
Motivated by the fact that Apriori based algorithms may generate and test
a very large number of candidate itemsets.
Example:
with 1000 frequent 1-items, Apriori would have to generate
2^1000 candidate 2-itemsets
The FP-growth algorithm is one aproach that eliminates the generation of
a large number of candidate itemsets
Association Rules
Frequent-pattern tree algorithm
Generating a compressed version of the database in terms of an
FP-Tree
FP-Growth Algorithm for finding frequent itemsets
Association Rules
Root
FP-Tree algorithm
Item head table
Item
Support
Milk
3
bread
2
cookies
2
juice
2
Milk:1
Milk:2
Milk:3
link
Transaction 4
1
2
3
Bread:1
Cookies:1
Bread,
Milk,
Bread,
bread,
Bread,
milk,
Milk,
cookies,
Milk
cookies,
cookies
juicecoffee
eggs
juice
Juice:1
Juice:1
Bread:1
Cookies:1
Association Rules
FP-Growth algorithm
Root
Milk:3
Bread:1
Cookies:1
Milk
bread
cookies
juice
Milk, juice
Bread, cookies
Juice:1
Juice:1
Bread:1
Cookies:1
Association Rules
FP-Growth algorithm
Procedure FP-growth (tree, alpha);
Begin
If tree contains a single path then
For each combination, beta, of the nodes in the path
Generate pattern (beta U alpha) with support = minimum support of nodes in beta
Else
For each item, I, in the header of the tree do
Begin
Generate pattern beta = (I U alpha) with support = i.support;
Construct beta’s conditional pattern base;
Construct beta’s conditional FP-tree, bete_tree;
If beta_tree is not empty then
FP-growth (beta_tree, beta);
End
end
Association Rules
Demo of FP-Growth algorithm
Classification
Introduction
Classification is the process of learning a model that describes
different classes of data, the classes are predetermined
The model that is produced is usually in the form of a decision tree
or a set of rules
married
Yes
no
salary
Acct balance
>5k
<20k
Poor risk
>=20k
<50k
>=50
Fair risk Good risk
<5k
Poor risk
age
<25
Fair risk
>=25
Good risk
RID
Married
Salary
Acct balance
Age
Loanworthy
1
No
>=50
<5k
>=25
Yes
2
Yes
>=50
>=5k
>=25
Yes
3
Yes
20k..50k
<5k
<25
No
4
No
<20k
>=5k
<25
No
5
No
<20k
<5k
>=25
No
6
Yes
20k..50k
>=5k
>=25
Yes
Expected information
I ( S1 , S 2 ,...S n ) pi log2 pi
i 1
I(3,3)=1
<20k
20k..50k
age
Class is “no” {4,5}
Entropy
n
E( A)
j 1
S j1 ... S jn
S
Information gain
Gain(A) = I-E(A)
<25
* I (S j1 ,...,S jn )
E(Married)=0.92
Gain(Married)=0.08
E(Salary)=0.33
Gain(Salary)=0.67
E(A.balance)=0.82
Gain(A.balance)=0.18
E(Age)=0.81
Gain(Age)=0.19
Salary
n
Class
attribute
>=50k
Class is “yes” {1,2}
>=25
Class is “no” {3} Class is “yes” {6}
Classification
Algorithm for decision tree induction
Procedure Build_tree (records, attributes);
Begin
Create a node N;
If all records belongs to the same class, C then
Return N as a leaf node with class label C;
If Attributes is empty then
Return n as a leaf node with class label C, such that the majority of records belong to it;
Select attribute Ai (with the highest information gain) from Attributes;
Label node N with Ai;
For each know value, Vj, of Ai do
Begin
Add a brach from node N for the condition Ai=Vj;
Sj=subset of Records where Ai=Vj;
If Sj is empty then
Add a leaf, L, with class label C, such that the majority of records belong to it and return L
Else add the node returned by build_tree(Sj, Attributes – Aj)
End;
End;
Classification
Demo of decision tree
Clustering
Introduction
The previous data mining task of classification deals with
partitioning data based on a pre-classified training sample
Clustering is an automated process to group related records
together.
Related records are grouped together on the basis of having
similar values for attributes
The groups are usually disjoint
Clustering
Some concepts
An important facet of clustering is the similarity function that is
used
When the data is number, a similarity function based on distance is
typically used
Euclidean metric (Euclidean distance), Minkowsky metric,
Mahattan metric.
Dis tance(rj , rk ) | rj1 rk1 | ... | rjn rkn |
2
2
Clustering
K-means clustering algorithm
Input: a database D, of m records r1,…,rm and a desired number of
clusters. k
Output: set of k clusters
Begin
Randomly choose k records as the centroids for the k clusters’
Repeat
Assign each record, ri, to a cluster such that the distance between ri and the cluster
centroid (mean) is the smallest among the k clusters;
Recalculate the centroid (mean) for each cluster based on the records assigned to
the cluster;
Until no change;
End;
Clustering
Demo of K-means algorithm
Content
Introduction
Overview of data mining technology
Association rules
Classification
Clustering
Applications of data mining
Commercial tools
Conclusion
Applications of data mining
Market analysis
Marketing stratergies
Advertisement
Risk analysis and management
Finance and finance investments
Manufacturing and production
Fraud detection and detection of unusual patterns (outliers)
Telecommunication
Finanancial transactions
Anti-terrorism (!!!)
Applications of data mining
Text mining (news group, email, documents) and Web
mining
Stream data mining
DNA and bio-data analysis
Diseases outcome
Effectiveness of treatments
Identify new drugs
Contoh penelitian/paper
An application of data mining for marketing in
telecommunication
Application of data mining to customer profile analysis in the
power electric ity
Conditional Market Segmentation by Neural Networks
cluster analysis in Industrial market
marketing segmentation using support vector
Commercial tools
Oracle Data Miner
http://www.oracle.com/technology/products/bi/odm/odminer.html
Data To Knowledge
http://alg.ncsa.uiuc.edu/do/tools/d2k
SAS
http://www.sas.com/
Clementine
http://spss.com/clemetine/
Intelligent Miner
http://www-306.ibm.com/software/data/iminer/
Conclusion
Data mining is a “decision support” process in which we
search for patterns of information in data.
This technique can be used on many types of data.
Overlaps
with machine learning, statistics, artificial
intelligence, databases, visualization…
Conclusion
The result of mining may be to discover the following type of
“new” information:
Association rules
Sequencial patterns
Classification trees
…
References
Fundamentals of Database Systems
fourth edition -- R.Elmasri, S.B.Navathe -- Addition Wesley -- ISBN
0-321-20448-4
Discovering Knowledge in Data: An Introduction to
Data Mining
Daniel T.Larose –Willey – ISBN 0-471-66652-2
RESOURCES FROM THE INTERNET
Data Mining Teknik pemanfaatan data untuk keperluan bisnis
Thanks for listening!!!