data-mining-concepts
Download
Report
Transcript data-mining-concepts
Data Mining Concept
Ho Viet Lam - Nguyen Thi My Dung
May, 14th 2007
Content
o Introduction
o Overview of data mining technology
o Association rules
o Classification
o Clustering
o Applications of data mining
o Commercial tools
o Conclusion
Introduction
o What is data mining?
o Why do we need to ‘mine’ data?
o On what kind of data can we ‘mine’?
What is data mining?
o The process of discovering meaningful
new correlations, patterns and trends by
sifting through large amounts of data
stored in repositoties, using pattern
recognition technologies as well as statical
and methematics techniques.
o A part of Knowledge Discovery in Data
(KDD) process.
Why data mining?
The explosive growth in data collection
The storing of data in data warehouses
The availability of increased access to data
from Web navigation and intranet
We have to find a more effective way to use
these data in decision support process than
just using traditional querry languages
On what kind of data?
o Relational databases
o Data warehouses
o Transactional
databases
o Advanced database
systems
Object-relational
Spacial and Temporal
Time-series
Multimedia, text
WWW
…
Structure - 3D Anatomy
Function – 1D Signal
Metadata – Annotation
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
RAW NORMALIZED
ORF NAME
GENE NAME
CHRM F
G
R
YAL001C
TFC3 1
1 A 1 2 12.03 7.38
YBL080C
PET112
2
1 A 1 3 53.21
YBR154C
RPB5 2
1 A 1 4 79.26 78.51
YCL044C
3
1 A 1 5 53.22 44.66
YDL020C
SON1 4
1 A 1 6 23.80 20.34
YDL211C
4
1 A 1 7 17.31 35.34
YDR155C
CPH1 4
1 A 1 8 349.78
YDR346C
4
1 A 1 9 64.97 65.88
YAL010C
MDM10 1
1 A 2 2 13.73 9.61
YBL088C
TEL1 2
1 A 2 3 8.50 7.74
YBR162C
2
1 A 2 4 226.84
Name:
GF1 GF2
403.83
35.62 "1,786
"2,660.73"
"1,786.53"
799.06
581.00
401.84
"2,180.87"
461.03
285.38
293.83
Overview of data mining technology
o Data Mining vs. Data Warehousing
o Data Mining as a part of Knowledge
Discovery Process
o Goals of Data Mining and Knowledge
Discovery
o Types of Knowledge Discovery during
Data Mining
Data Mining vs. Data Warehousing
o A Data Warehouse is a repository of
integrated
information,
available
for
queries
and
analysis.
Data
and
information
are
extracted
from
heterogeneous sources as they are
generated....This makes it much easier
and more efficient to run queries over
data that originally came from different
sources.
o The goal of data warehousing is to support
decision making with data!
Knowledge Discovery in Databases
and Data Mining
o The non-trivial extraction of implicit,
unknown, and potentially useful
information from databases.
oThe knowledge discovery process
comprises six phases:
Goals of Data Mining and KDD
o Prediction:
how certain attibutes
within the data will behave in the
future.
o Identification: identify the existence
of an item, an event, an activity.
o Classification: partition the data
into categories.
o Optimization: optimize the use of
limited resources.
Types of Knowledge Discovery
during Data Mining
o Association rules
o Classification heirarchies
o Sequential patterns
o Patterns within time-series
o Clustering
Content
o Introduction
o Overview of data mining technology
o Association rules
o Classification
o Clustering
o Application of data mining
o Commercial tools
o Conclusion
Association Rules
o Purpose
Providing the rules correlate the presence of
a set of items with another set of item
Examples:
Association Rules
o Some concepts
o Market-basket model
Look for combinations of products
Put the SHOES near the SOCKS so that if a
customer buys one they will buy the other
oTransactions: is the fact the person buys
some items in the itemset at
supermarket
Association Rules
o Some concepts
Item
support
Milk
3
2
Transaction-id
time
Items-Bought
Bread
101
6:35
Bread, Milk, cookies, juice
Cookies 2
792
7:38
Milk, juice
Juice
2
1130
8:05
Milk, eggs
Coffee
1
1735
8:40
Bread, cookies, coffee
Eggs
1
Support: it refers how frequently a specific itemset
occurs in the database
X => Y:
Bread => Juice is 50%
the confidence of the rule X=>Y:
support (X U Y) / support (X)
The goal of mining association rules is generate all
possible rules that exceed some minimum userspecified support and confidence thresholds
Association Rules
o Apriori Algorithm
Input: database of m transactions, D,
and
a
minimum
support,
mins,
represented as a fraction of m
Output: frequent itemsets, L1, L2, …, Lk
Association Rules Item
o Apriori algorithm
support
Milk
3
2
Transaction-id
time
Items-Bought
Bread
101
6:35
Bread, milk, cookies, juice
Cookies 2
792
7:38
Milk, juice
Juice
2
1130
8:05
Milk, eggs
Eggs
1
1735
8:40
Bread, cookies, coffee
Coffee
1
The candidate 1-itemsets
D
mins = 2
minf = 0.5
milk, bread, juice,
cookies, eggs, coffee
freq >
0.5
0.75, 0.5, 0.5, 0.5, 0.25,
0.25
result
milk, bread, juice, cookies,
{milk, juice},
{bread, cookies}
frequent 1itemsets
milk, bread, juice,
cookies
0.75, 0.5, 0.5,
0.5
The candidate 2-itemsets
{milk, bread}, {milk,
juice}, {bread, juice},
{milk, cookies}, {bread,
cookies}, {juice, cookies}
0.25, 0.5, 0.25, 0.25, 0.5,
0.25
frequent
3-itemsets
The candidate
3-itemsets
frequent 2-itemsets
Ф
{……………..}
{milk, juice},
{bread, cookies}
Ф
{……………….}
0.5, 0.5
o Apriori Algorithm
Begin
+ compute support(ij) = count(ij)/m for each individual item, i1, i2, ..,in
by scanning the database once and counting the number of
transactions that item ij appears in
+ the candidate frequent 1-itemset, C1, will be the set of items i1, i2,
…, in
+ the subset of items containing ij form Ci where support(ij) >= mins
becomes the frequent
+ 1-itemset, L1;
+ k=1;
+ termination = false;
+ repeat
+ Lk+1 = ;
+ create the candidate frequent (k+1)-itemset, Ck+1, by combining
members of Lk that have k-1 items in common; (this forms
candidate frequent (k+1)-itemsets by selectively extending
frequent k-itemsets by one item)
+ in addition, only consider as elements of Ck+1 those k+1 items
such that every subset of size k appears in Lk;
+ scan the database once and compute the support for each
member of Ck+1; if the support for a member of Ck+1; if the support
for a member of Ck+1 >= min then add that member to Lk+1;
+ if Lk+1 is empty then termination = true
else k = k+1;
+ until termination
end;
Association Rules
o Demo of Apriori Algorithm
Association Rules
o Frequent-pattern tree algorithm
Motivated by the fact that Apriori based
algorithms may generate and test a very
large number of candidate itemsets.
Example:
with 1000 frequent 1-items, Apriori would have
to generate 2^1000 candidate 2-itemsets
The FP-growth algorithm is one aproach
that eliminates the generation of a large
number of candidate itemsets
Association Rules
o Frequent-pattern tree algorithm
Generating a compressed version of the
database in terms of an FP-Tree
FP-Growth
Algorithm
for
finding
frequent itemsets
Association Rules
Root
o FP-Tree algorithm
Item head table
Item
Support
Milk
3
bread
2
cookies
2
juice
2
Milk:1
Milk:2
Milk:3
link
Transaction 4
1
2
3
Bread:1
Cookies:1
Bread,
Milk,
Bread,
bread,
Bread,
milk,
Milk,
cookies,
Milk
cookies,
cookies
juicecoffee
eggs
juice
Juice:1
Juice:1
Bread:1
Cookies:1
Association Rules
o FP-Growth algorithm
Root
Milk:3
Bread:1
Cookies:1
Milk
bread
cookies
juice
Milk, juice
Bread, cookies
Juice:1
Juice:1
Bread:1
Cookies:1
Association Rules
o FP-Growth algorithm
Procedure FP-growth (tree, alpha);
Begin
If tree contains a single path then
For each combination, beta, of the nodes in the path
Generate pattern (beta U alpha) with support = minimum support
of nodes in beta
Else
For each item, I, in the header of the tree do
Begin
Generate pattern beta = (I U alpha) with support = i.support;
Construct beta’s conditional pattern base;
Construct beta’s conditional FP-tree, bete_tree;
If beta_tree is not empty then
FP-growth (beta_tree, beta);
End
end
Association Rules
o Demo of FP-Growth algorithm
Classification
o Introduction
Classification is the process of learning a
model that describes different classes of data,
the classes are predetermined
The model that is produced is usually in the
form of a decision tree or a set of rules
married
Yes
no
salary
Acct balance
>5k
<20k
Poor risk
>=20k
<50k
>=50
Fair risk Good risk
<5k
Poor risk
age
<25
Fair risk
>=25
Good risk
RID
Married
Salary
Acct balance
Age
Loanworthy
1
No
>=50
<5k
>=25
Yes
2
Yes
>=50
>=5k
>=25
Yes
3
Yes
20k..50k
<5k
<25
No
4
No
<20k
>=5k
<25
No
5
No
<20k
<5k
>=25
No
6
Yes
20k..50k
>=5k
>=25
Yes
Expected information
I ( S1 , S 2 ,...S n ) pi log 2 pi
i 1
I(3,3)=1
<20k
20k..50k
age
Class is “no” {4,5}
Entropy
n
E ( A)
j 1
S j1 ... S jn
S
Information gain
Gain(A) = I-E(A)
<25
* I (S j1 ,..., S jn )
E(Married)=0.92
Gain(Married)=0.08
E(Salary)=0.33
Gain(Salary)=0.67
E(A.balance)=0.82
Gain(A.balance)=0.18
E(Age)=0.81
Gain(Age)=0.19
Salary
n
Class
attribute
>=50k
Class is “yes” {1,2}
>=25
Class is “no” {3} Class is “yes” {6}
Classification
o Algorithm for decision tree induction
Procedure Build_tree (records, attributes);
Begin
Create a node N;
If all records belongs to the same class, C then
Return N as a leaf node with class label C;
If Attributes is empty then
Return n as a leaf node with class label C, such that the majority of records
belong to it;
Select attribute Ai (with the highest information gain) from Attributes;
Label node N with Ai;
For each know value, Vj, of Ai do
Begin
Add a brach from node N for the condition Ai=Vj;
Sj=subset of Records where Ai=Vj;
If Sj is empty then
Add a leaf, L, with class label C, such that the majority of records belong
to it and return L
Else add the node returned by build_tree(Sj, Attributes – Aj)
End;
End;
Classification
o Demo of decision tree
Clustering
o Introduction
The previous data mining task of classification
deals with partitioning data based on a preclassified training sample
Clustering is an automated process to group
related records together.
Related records are grouped together on the
basis of having similar values for attributes
The groups are usually disjoint
Clustering
o Some concepts
An important facet of clustering is the
similarity function that is used
When the data is number, a similarity
function based on distance is typically
used
Euclidean metric (Euclidean distance),
Minkowsky metric, Mahattan metric.
Dis tan ce(rj , rk ) | rj1 rk1 | ... | rjn rkn |
2
2
Clustering
o K-means clustering algorithm
o Input: a database D, of m records r1,…,rm and
a desired number of clusters. k
o Output: set of k clusters
Begin
Randomly choose k records as the centroids for the k
clusters’
Repeat
Assign each record, ri, to a cluster such that the distance
between ri and the cluster centroid (mean) is the
smallest among the k clusters;
Recalculate the centroid (mean) for each cluster based
on the records assigned to the cluster;
Until no change;
End;
Clustering
o Demo of K-means algorithm
Content
o Introduction
o Overview of data mining technology
o Association rules
o Classification
o Clustering
o Applications of data mining
o Commercial tools
o Conclusion
Applications of data mining
o Market analysis
Marketing stratergies
Advertisement
o Risk analysis and management
Finance and finance investments
Manufacturing and production
o Fraud detection and detection
unusual patterns (outliers)
Telecommunication
Finanancial transactions
Anti-terrorism (!!!)
of
Applications of data mining
o Text
mining
(news
group,
documents) and Web mining
o Stream data mining
o DNA and bio-data analysis
Diseases outcome
Effectiveness of treatments
Identify new drugs
email,
Commercial tools
o Oracle Data Miner
http://www.oracle.com/technology/products/bi/od
m/odminer.html
o Data To Knowledge
http://alg.ncsa.uiuc.edu/do/tools/d2k
o SAS
http://www.sas.com/
o Clementine
http://spss.com/clemetine/
o Intelligent Miner
http://www-306.ibm.com/software/data/iminer/
Conclusion
o Data mining is a “decision support”
process in which we search for
patterns of information in data.
o This technique can be used on many
types of data.
o Overlaps with machine learning,
statistics,
artificial
intelligence,
databases, visualization…
Conclusion
The result of mining may be to
discover the following type of “new”
information:
Association rules
Sequencial patterns
Classification trees
…
References
o Fundamentals of Database Systems
fourth edition -- R.Elmasri, S.B.Navathe -Addition Wesley -- ISBN 0-321-20448-4
o Discovering Knowledge in Data: An
Introduction to Data Mining
Daniel T.Larose –Willey – ISBN 0-471-66652-2
o RESOURCES FROM THE INTERNET
Thanks for listening!!!