Knowledge Discovery in Databases

download report

Transcript Knowledge Discovery in Databases

Knowledge
Discovery
ITCS 6162
Transparencies prepared
by Ho Tu Bao [JAIST]
1
Clustering
“Are there clusters of similar cells?”
Light color with 1 nucleus
Dark color with 2 tails 2 nuclei
1 nucleus and 1 tail
Dark color with 1 tail and 2 nuclei
2
Association Rule Discovery
Task: Discovering association rules among
items in a transaction database.
An association among two items A and B means
that the presence of A in a record implies the
presence of B in the same record: A => B.
In general: A1, A2, … => B
3
Association Rule Discovery
“Are there any associations between the
characteristics of the cells?”
If color = light and # nuclei = 1
then # tails = 1
(support = 12.5%;
confidence = 50%)
If # nuclei = 2 and Cell = Cancerous
then # tails = 2
(support = 25%;
If # tails = 1
then Color = light
confidence = 100%)
(support = 37.5%;
confidence = 75%)
4
Many Other Data Mining Techniques
Genetic
Algorithms
Rough Sets
Bayesian
Networks
Text Mining
Statistics
Time Series
5
Lecture 1: Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
6
Challenges and Influential Aspects
Massive data sets,
high dimensionality
(efficiency, scalability)
Different sources of data
(distributed, heterogeneous
databases, noise and missing,
irrelevant data, etc.)
Changing
data and
knowledge
Handling of different
types of data with
different degree of
supervision
Interactive,
Visualization
Knowledge
Discovery
Understandability of patterns, various
kinds of requests and results (decision
lists, inference networks, concept
hierarchies, etc.)
7
Massive Data Sets and High Dimensionality
# attributes
?
High dimensionality increases
exponentially the size of the
space of patterns.
With p attributes each has d
discrete values in average, the
space of possible instances has
the size of dp.
38 attributes, each 10 values
# instances = 1038
Cancerous Cell
C2
C1
C3
Healthy Cell
h1
h2
Classes: {Cancerous, Healthy}
Attributes:
Cell body: {dark, light}
#nuclei: {1, 2}
#tails:
{1, 2}
(# instances = 23 = 8)
8
Different Types of Data
Combinatorical search in hypothesis spaces (machine learning)
Attribute
Numerical
No structure
 
Ordinal
structure
Age,
Temperature,
Taste,
Ring
structure
Income,
Length

   
Symbolic
Places,
Color
Nominal
(categorical)
Rank,
Resemblance
Ordinal
Measurable
Often matrix-based computation (multivariate data analysis)
9
Brief introduction to lectures
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
10
Lecture 2: Preparing Data
• As much as 80% of KDD is about preparing
data, and the remaining 20% is about mining
• Content of the lecture
1. Data cleaning
2. Data transformations
3. Data reduction
4. Software and case-studies
• Prerequisite: Nothing special but expected
some understanding of statistics
11
Data Preparation
The design and organization of data, including the
setting of goals and the composition of features,
is done by humans. There are two central goals for
the preparation of data:
•
•
To organize data into a standard form that is
ready for processing by data mining programs.
To prepare features that lead to the best
data mining performance.
12
Brief introduction to lectures
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
13
Lecture 3: Decision Tree Induction
• One of the most widely used KDD
classification techniques for supervised data.
• Content of the lecture
1.
2.
3.
4.
Decision tree representation and framework
Attribute selection
Pruning trees
Software C4.5, CABRO and case-studies
• Prerequisite: Nothing special
14
Decision Trees
Decision Tree is a classifier in the form of a tree
structure that is either:
• a leaf node, indicating a class of instances, or
• a decision node that specifies some test to be
carried out on a single attribute value, with one
branch and subtree for each possible outcome of the
test
A decision tree can be used to classify an instance by
starting at the root of the tree and moving through it
until a leaf is met.
15
General Framework of Decision Tree Induction
1. Choose the “best” attribute by a given selection measure
2. Extend tree by adding new branch for each attribute value
3. Sorting training examples to leaf nodes
4. If examples unambiguously classified Then Stop
Else Repeat steps 1-4 for leaf nodes
5. Pruning unstable leaf nodes
Headache
Temperature
Temperature
Flu
e1
yes
normal
no
e2
yes
high
yes
e3
e4
e5
yes
no
no
very high
normal
high
yes
no
no
e6
no
very high
no
{e1, e4}
normal
no
yes
{e2}
yes
high
{e2, e5}
Headache
no
{e5}
no
very high
{e3,e6}
Headache
yes
{e3}
yes
no
{e6}
no
16
Some Attribute Selection Measures
Gain - ratio
Gini - index

2
R - measure

j
p. j i pij log pij  i pi. log pi.

j
p. j log p. j
2
2
p
p

p
 j .j i i j i i.

i

(eij  nij ) 2
j
eij
, eij 
2
p
max
{
p
i j}
i
j .j
Quinlan, C4.5, 1993
Breiman, CART, 1984
n. j ni.
n..
Statistics
Ho & Nguyen, CABRO,1996
17
Avoiding Overfitting
How can we avoid overfitting?
• Stop growing when data split not statistically
significant (pre-pruning)
• Grow full tree, then post-prune (post-pruning)
Two post-pruning techniques
• Reduce-Error Pruning
• Rule Post-Pruning
18
Converting A Tree to Rules
outlook
high
no
sunny
o’cast
humidity
yes
normal
yes
rain
wind
true
no
false
yes
IF
(Outlook = Sunny) and (Humidity = High)
THEN PlayTennis = No
IF
(Outlook = Sunny) and (Humidity = Normal)
THEN PlayTennis = Yes
........
19
CABRO: Decision Tree Induction
CABRO based on R-measure, a measure for attribute
dependency stemmed from rough sets theory.
Discovered
decision
tree
Matching
path and
the class
that
matches
the
unknown
instance
Unknown
case
20
Brief introduction to lectures
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
21
Lecture 4:
Mining Association Rules
• A new technique and attractive topic. It
allows discovering the important associations
among items of transactions in a database.
• Content of the lecture
1. Basic Definitions
2. Algorithm Apriori
3. The Basket Analysis Program
• Prerequisite: Nothing special
22
Measures of Association Rules
There are two measures of rule strength:
Support of (A => B) = [AB] / N, where N is the number
of records in the database.
The support of a rule is the proportion of times the
rule applies.
Confidence of (A => B) = [AB] / [A]
The confidence of a rule is the proportion of times the
rule is correct.
23
•
Algorithm Apriori
The task of mining association rules is
mainly to discover strong association rules
(high confidence and strong support) in
large databases.
• Mining association rules is composed
of two steps:
1. discover the large items, i.e., the
sets of itemsets that have
transaction support above a
predetermined minimum support s.
2. Use the large itemsets to generate
the association rules
TID
Items
1000
2000
3000
4000
A, B, C
A, C
A, D
B, E, F
Large
items
{A}
{B}
{C}
{A,C}
support
75%
50%
50%
50%
S = 40%
24
Algorithm Apriori: Illustration
S = 40%
C1
Database D
TID
100
200
300
400
Itemset Count
Items
A, C, D
B, C, E
A, B, C, E
B, E
Scan
D
C2
C3
Itemset
{B, C, E}
Itemset
Scan
D
Scan
D
{A,B}
{A,C}
{A,E}
{B,C}
{B,E}
{C,E}
Itemset Count
{A}
{B}
{C}
{E}
C2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
2
3
3
1
3
{A}
{B}
{C}
{D}
{E}
L1
Count
1
2
1
2
3
2
C3
2
3
3
3
L2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
Count
2
2
3
2
L3
Itemset
Count
Itemset
Count
{B, C, E}
2
{B, C, E}
2
25