Presentation Topic (Scientific Discovery)

Transcript Presentation Topic (Scientific Discovery)

Data Mining
Budi Santosa, PhD
2008
Lab Komputasi dan Optimasi Industri
Teknik Industri ITS
Content
 Introduction
 Overview of data mining technology
 Association rules
 Classification
 Clustering
 Applications of data mining
 Commercial tools
 Conclusion
Introduction
 What is data mining?
 Why do we need to ‘mine’ data?
 On what kind of data can we ‘mine’?
What is data mining?
 Data mining adalah teknik yang merupakan gabungan
metode-metode analisis data secara statistik dengan
algoritma-algoritma untuk memproses data berukuran besar.
Data mining merupakan proses menemukan informasi atau
pola yang penting dalam basis data berukuran besar.
 A part of Knowledge Discovery in Data (KDD) process.
Why data mining?
The explosive growth in data collection
The storing of data in data warehouses
The availability of increased access to data from Web navigation
and intranet
 We have to find a more effective way to use these data in decision
support process than just using traditional querry languages
On what kind of data?
 Data warehouses
 Transactional databases
Structure - 3D Anatomy
 Advanced database systems
 Spacial and Temporal
 Time-series
 Multimedia, text
 WWW
…
Function – 1D Signal
Metadata – Annotation
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1 Name:
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
RAW NORMALIZED
ORF NAME
GENE NAME CHRM F
G
R
GF1 GF2
YAL001C
TFC3 1
1 A 1 2 12.03 7.38 403.83
YBL080C
PET112
2
1 A 1 3 53.21 35.62 "1,78
YBR154C
RPB5 2
1 A 1 4 79.26 78.51 "2,660.73"
YCL044C
3
1 A 1 5 53.22 44.66 "1,786.53"
YDL020C
SON1 4
1 A 1 6 23.80 20.34 799.06
YDL211C
4
1 A 1 7 17.31 35.34 581.00
YDR155C
CPH1 4
1 A 1 8 349.78
401.84
YDR346C
4
1 A 1 9 64.97 65.88 "2,180.87"
YAL010C
MDM10 1
1 A 2 2 13.73 9.61 461.03
YBL088C
TEL1 2
1 A 2 3 8.50 7.74 285.38
YBR162C
2
1 A 2 4 226.84
293.83
Overview of data mining technology
 Data Mining vs. DataWarehousing
 Data Mining as a part of Knowledge Discovery Process
 Goals of Data Mining and Knowledge Discovery
 Types of Knowledge Discovery during Data Mining
Data Mining vs. Data Warehousing
 A Data Warehouse is a repository of integrated
information, available for queries and analysis. Data and
information are extracted from heterogeneous sources as
they are generated....This makes it much easier and
more efficient to run queries over data that originally
came from different sources.
 The goal of data warehousing is to support decision
making with data!
Knowledge Discovery in Databases
and Data Mining
 The non-trivial extraction of implicit, unknown, and potentially
useful information from databases.
o The knowledge discovery process comprises six phases:
Goals of Data Mining and KDD
 Prediction:




how certain attibutes within the data
will behave in the future.
Identification: identify the existence of an item, an event, an
activity.
Classification: partition the data into categories.
Clustering: cluster object.
Association: object association
Types of Knowledge Discovery during Data
Mining
 Association rules
 Classification heirarchies
 Sequential patterns
 Patterns within time-series
 Clustering
Content
 Introduction
 Overview of data mining technology
 Association rules
 Classification
 Clustering
 Application of data mining
 Commercial tools
 Conclusion
Association Rules
 Purpose
 Providing the rules correlate the presence of a set of items with another
set of item
 Examples:
Association Rules
 Some concepts

Market-basket model
 Look for combinations of products
 Put the SHOES near the SOCKS so that if a customer buys one they
will buy the other
 Transactions: is the fact the person buys some items in
the itemset at supermarket
Association Rules
 Some concepts
Item
support
Milk
3
2
Transaction-id
time
Items-Bought
Bread
101
6:35
Bread, Milk, cookies, juice
Cookies 2
792
7:38
Milk, juice
Juice
2
1130
8:05
Milk, eggs
Coffee
1
1735
8:40
Bread, cookies, coffee
Eggs
1
Support: it refers how frequently a specific itemset
occurs in the database
X => Y:
Bread => Juice is 50%
the confidence of the rule X=>Y:
support (X U Y) / support (X)
The goal of mining association rules is generate all
possible rules that exceed some minimum userspecified support and confidence thresholds
Association Rules
 Apriori Algorithm
 Input: database of m transactions, D, and a minimum support,
mins, represented as a fraction of m
 Output: frequent itemsets, L1, L2, …, Lk
Association Rules
 Apriori algorithm
Item
support
Milk
3
2
Transaction-id
time
Items-Bought
Bread
101
6:35
Bread, milk, cookies, juice
Cookies 2
792
7:38
Milk, juice
Juice
2
1130
8:05
Milk, eggs
Eggs
1
1735
8:40
Bread, cookies, coffee
Coffee
1
The candidate 1-itemsets
D
mins = 2
minf = 0.5
milk, bread, juice,
cookies, eggs, coffee
freq >
0.5
0.75, 0.5, 0.5, 0.5, 0.25,
0.25
result
milk, bread, juice, cookies,
{milk, juice},
{bread, cookies}
frequent 1itemsets
milk, bread, juice,
cookies
0.75, 0.5, 0.5,
0.5
The candidate 2-itemsets
{milk, bread}, {milk,
juice}, {bread, juice},
{milk, cookies}, {bread,
cookies}, {juice, cookies}
0.25, 0.5, 0.25, 0.25, 0.5,
0.25
frequent
3-itemsets
The candidate
3-itemsets
frequent 2-itemsets
Ф
{……………..}
{milk, juice},
{bread, cookies}
Ф
{……………….}
0.5, 0.5
 Apriori Algorithm
Begin
+ compute support(ij) = count(ij)/m for each individual item, i1, i2, ..,in by scanning the database
once and counting the number of transactions that item ij appears in
+ the candidate frequent 1-itemset, C1, will be the set of items i1, i2, …, in
+ the subset of items containing ij form Ci where support(ij) >= mins becomes the frequent
+ 1-itemset, L1;
+ k=1;
+ termination = false;
+ repeat
+ Lk+1 = ;
+ create the candidate frequent (k+1)-itemset, Ck+1, by combining members of Lk that have k-1
items in common; (this forms candidate frequent (k+1)-itemsets by selectively extending frequent
k-itemsets by one item)
+ in addition, only consider as elements of Ck+1 those k+1 items such that every subset of size k
appears in Lk;
+ scan the database once and compute the support for each member of Ck+1; if the support for a
member of Ck+1; if the support for a member of Ck+1 >= min then add that member to Lk+1;
+ if Lk+1 is empty then termination = true
else k = k+1;
+ until termination
end;
Association Rules
 Demo of Apriori Algorithm
Association Rules
 Frequent-pattern tree algorithm
 Motivated by the fact that Apriori based algorithms may generate and test
a very large number of candidate itemsets.
 Example:
with 1000 frequent 1-items, Apriori would have to generate
2^1000 candidate 2-itemsets
 The FP-growth algorithm is one aproach that eliminates the generation of
a large number of candidate itemsets
Association Rules
 Frequent-pattern tree algorithm
 Generating a compressed version of the database in terms of an
FP-Tree
 FP-Growth Algorithm for finding frequent itemsets
Association Rules
Root
 FP-Tree algorithm
Item head table
Item
Support
Milk
3
bread
2
cookies
2
juice
2
Milk:1
Milk:2
Milk:3
link
Transaction 4
1
2
3
Bread:1
Cookies:1
Bread,
Milk,
Bread,
bread,
Bread,
milk,
Milk,
cookies,
Milk
cookies,
cookies
juicecoffee
eggs
juice
Juice:1
Juice:1
Bread:1
Cookies:1
Association Rules
 FP-Growth algorithm
Root
Milk:3
Bread:1
Cookies:1
Milk
bread
cookies
juice
Milk, juice
Bread, cookies
Juice:1
Juice:1
Bread:1
Cookies:1
Association Rules
 FP-Growth algorithm
Procedure FP-growth (tree, alpha);
Begin
If tree contains a single path then
For each combination, beta, of the nodes in the path
Generate pattern (beta U alpha) with support = minimum support of nodes in beta
Else
For each item, I, in the header of the tree do
Begin
Generate pattern beta = (I U alpha) with support = i.support;
Construct beta’s conditional pattern base;
Construct beta’s conditional FP-tree, bete_tree;
If beta_tree is not empty then
FP-growth (beta_tree, beta);
End
end
Association Rules
 Demo of FP-Growth algorithm
Classification
 Introduction
 Classification is the process of learning a model that describes
different classes of data, the classes are predetermined
 The model that is produced is usually in the form of a decision tree
or a set of rules
married
Yes
no
salary
Acct balance
>5k
<20k
Poor risk
>=20k
<50k
>=50
Fair risk Good risk
<5k
Poor risk
age
<25
Fair risk
>=25
Good risk
RID
Married
Salary
Acct balance
Age
Loanworthy
1
No
>=50
<5k
>=25
Yes
2
Yes
>=50
>=5k
>=25
Yes
3
Yes
20k..50k
<5k
<25
No
4
No
<20k
>=5k
<25
No
5
No
<20k
<5k
>=25
No
6
Yes
20k..50k
>=5k
>=25
Yes
Expected information
I ( S1 , S 2 ,...S n )   pi log2 pi
i 1
I(3,3)=1
<20k
20k..50k
age
Class is “no” {4,5}
Entropy
n
E( A)  
j 1
S j1  ... S jn
S
Information gain
Gain(A) = I-E(A)
<25
* I (S j1 ,...,S jn )
E(Married)=0.92
Gain(Married)=0.08
E(Salary)=0.33
Gain(Salary)=0.67
E(A.balance)=0.82
Gain(A.balance)=0.18
E(Age)=0.81
Gain(Age)=0.19
Salary
n
Class
attribute
>=50k
Class is “yes” {1,2}
>=25
Class is “no” {3} Class is “yes” {6}
Classification
 Algorithm for decision tree induction
Procedure Build_tree (records, attributes);
Begin
Create a node N;
If all records belongs to the same class, C then
Return N as a leaf node with class label C;
If Attributes is empty then
Return n as a leaf node with class label C, such that the majority of records belong to it;
Select attribute Ai (with the highest information gain) from Attributes;
Label node N with Ai;
For each know value, Vj, of Ai do
Begin
Add a brach from node N for the condition Ai=Vj;
Sj=subset of Records where Ai=Vj;
If Sj is empty then
Add a leaf, L, with class label C, such that the majority of records belong to it and return L
Else add the node returned by build_tree(Sj, Attributes – Aj)
End;
End;
Classification
 Demo of decision tree
Clustering
 Introduction
 The previous data mining task of classification deals with
partitioning data based on a pre-classified training sample
 Clustering is an automated process to group related records
together.
 Related records are grouped together on the basis of having
similar values for attributes
 The groups are usually disjoint
Clustering
 Some concepts
 An important facet of clustering is the similarity function that is
used
 When the data is number, a similarity function based on distance is
typically used
 Euclidean metric (Euclidean distance), Minkowsky metric,
Mahattan metric.
Dis tance(rj , rk )  | rj1  rk1 | ... | rjn  rkn |
2
2
Clustering
 K-means clustering algorithm
 Input: a database D, of m records r1,…,rm and a desired number of
clusters. k
 Output: set of k clusters
Begin
Randomly choose k records as the centroids for the k clusters’
Repeat
Assign each record, ri, to a cluster such that the distance between ri and the cluster
centroid (mean) is the smallest among the k clusters;
Recalculate the centroid (mean) for each cluster based on the records assigned to
the cluster;
Until no change;
End;
Clustering
 Demo of K-means algorithm
Content
 Introduction
 Overview of data mining technology
 Association rules
 Classification
 Clustering
 Applications of data mining
 Commercial tools
 Conclusion
Applications of data mining
 Market analysis
 Marketing stratergies
 Advertisement
 Risk analysis and management
 Finance and finance investments
 Manufacturing and production
 Fraud detection and detection of unusual patterns (outliers)
 Telecommunication
 Finanancial transactions
Anti-terrorism (!!!)
Applications of data mining
 Text mining (news group, email, documents) and Web
mining
 Stream data mining
 DNA and bio-data analysis
 Diseases outcome
 Effectiveness of treatments
 Identify new drugs
Contoh penelitian/paper
 An application of data mining for marketing in




telecommunication
Application of data mining to customer profile analysis in the
power electric ity
Conditional Market Segmentation by Neural Networks
cluster analysis in Industrial market
marketing segmentation using support vector
Commercial tools
 Oracle Data Miner
http://www.oracle.com/technology/products/bi/odm/odminer.html
 Data To Knowledge
http://alg.ncsa.uiuc.edu/do/tools/d2k
 SAS
http://www.sas.com/
 Clementine
http://spss.com/clemetine/
 Intelligent Miner
http://www-306.ibm.com/software/data/iminer/
Conclusion
 Data mining is a “decision support” process in which we
search for patterns of information in data.
 This technique can be used on many types of data.
 Overlaps
with machine learning, statistics, artificial
intelligence, databases, visualization…
Conclusion
The result of mining may be to discover the following type of
“new” information:
 Association rules
 Sequencial patterns
 Classification trees
…
References
 Fundamentals of Database Systems
fourth edition -- R.Elmasri, S.B.Navathe -- Addition Wesley -- ISBN
0-321-20448-4
 Discovering Knowledge in Data: An Introduction to
Data Mining
Daniel T.Larose –Willey – ISBN 0-471-66652-2
 RESOURCES FROM THE INTERNET
 Data Mining Teknik pemanfaatan data untuk keperluan bisnis
Thanks for listening!!!

Presentation Topic (Scientific Discovery)

Transcript Presentation Topic (Scientific Discovery)

Directory