Data Mining - Academia Sinica
Download
Report
Transcript Data Mining - Academia Sinica
Data Mining for Information
Retrieval
Chun-Nan Hsu
Institute of Information Science
Academia Sinica, Taipei, TAIWAN
Copyright © 1998 Chun-Nan Hsu, All right reserved
Lab name TBA
NTUST talk
1
The formation of the field
“data mining”
Statistics ~1800?
Expert Systems ~1970
Pattern Recognition
~1970
Relational
Databases, Triggers
~1980
MIS decision support
~1990
Lab name TBA
Rule induction
Machine learning
~1980
Knowledge Discovery
for Databases (KDD) ~1990
Data Mining
~1995
NTUST talk
2
Taxonomies of data mining
Based on underlying technologies
» decision trees, rule-based, example-based, nonlinear
regression, neural networks, bayesian networks, rough
sets...
Based on tasks at hand (due to Fayyad et al. 1997)
» classification, regression, clustering, summarization,
dependency modeling, change and deviation detection
Based on data???? Formalize these ideas
»
»
»
»
Lab name TBA
collection of similarities
time series
image --- snapshot of a state
collection of images
NTUST talk
3
Collection of similarities
Characterize classes by generating classifiers
(supervised learning)??????
Cluster objects into classes (clustering, unsupervised
learning)??????
Many techniques available, most well understood
Lab name TBA
NTUST talk
4
Time series
Forecasting, predicting the next (few) states
Characterizing the “trend” to detect changes and
deviations
Usually can be reformulated as a supervised learning
problem
Lab name TBA
NTUST talk
5
Collection of images
Extracting dependency, co-relations
Example: a collection of shopping lists of
supermarket customers
Example: a collection of symptom lists of patients
taking a new medicine???????
Techniques
» Association rules
» Bayesian networks and other probabilistic graphical models
Lab name TBA
NTUST talk
6
Image
Summarization
Key feature extraction
Not much is known
Example: a snapshot of an inventory database
Lab name TBA
NTUST talk
7
Issue: Consistency of
Machine-generated Rules
Learning
Data Mining
Discovery
database state (t)
Rules
transactions:
insert/ delete/
update
Consistent?
database state (t+1)
Lab name TBA
NTUST talk
8
Dealing with Inconsistent Rules
Delete them?
» Simple, but the system might have no rule to use
Modify them?
» Smart, but the system might be busy modifying rules
Learn rules that are unlikely to become inconsistent
» Yes, but how does it know which rule to learn?
Need a way to measure “likelihood of not becoming
inconsistent” --- Robustness of knowledge
Lab name TBA
NTUST talk
9
Robustness vs. Predictive Accuracy
Given a rule A C
Closed-world assumption on databases:
BOTH insertions and deletions affect inconsistency
Robustness of a rule is measured with regard to
entire database states D:
Pr(A C|D)
Predictive accuracy of a rule is measured with regard
to data tuples d:
Pr(C| A,d)
Lab name TBA
NTUST talk
10
Definition of Robustness of
knowledge (1)
A rule is robust if it is unlikely that the rule becomes
inconsistent with a database state
Intuitively, this probability can be estimated as
# of database states consistent with the rule
# of possible database states
However:
» database states are not equally probable
» # of database states are intractably large
Lab name TBA
NTUST talk
11
Definition of Robustness of
knowledge (2)
A rule is robust given a current database state if
transactions that invalidate the rule is unlikely to be
performed.
Likelihood of database states depends on
» Current database state
» Probability of transactions performed on that state
New definition of robustness is 1 - Pr(t|d)
» t: transactions that invalidate the rule is performed
» d: current database state
Lab name TBA
NTUST talk
12
Robustness Estimation
Step 1: Find transactions that invalidate the input rule
Step 2: Decompose the probabilities of invalidating
transactions into local probabilities
Step 3: Estimate local probabilities
Lab name TBA
NTUST talk
13
Step 1: Find Transactions that
Invalidate the Input Rule
R1: The latitude of a Maltese Geographic location is
greater than or equal to 35.89.
geoloc(_,_,?country,?latitude,_) & (?country = “Malta”)
?latitude > or = 35.89
Transactions that invalidate R1:
» T1: One of the existing tuples of geoloc with its country =
“Malta” is updated such that its latitude < 35.89
» T2: Insert an inconsistent tuple...
» T3:Update a tuple whose latitude < 35.89 into “Malta”
Robust(R1) = 1 - Pr(t|d)
= 1 - (Pr(T1|d) + Pr(T2|d) + Pr(T3|d))
Lab name TBA
NTUST talk
14
Step 2: Decompose the Probabilities
of Invalidating Transactions
x1:
type of
transaction?
x3:
on which
tuple?
x4:
on which
attribute?
x5:
what new
attribute value?
x2:
on which
relation?
Pr(t|d) = Pr(x1,x2,x3,x4,x5|d)
= Pr(x1|d) Pr(x2| x1,d) Pr(x3|x2,x1,d) Pr(x4| x2,x1,d) Pr(x5| x4,x2,x1,d)
=
p1 *
p2 *
p3
*
p4
*
p5
Lab name TBA
NTUST talk
15
Step 3: Estimate Local Probabilities
Estimate local probabilities using Laplace Law of
Succession (Laplace 1820)
r+1
n+k
Useful information for robustness estimation:
» transaction log
» expected size of tables
» information about attribute ranges, value distributions
When no information is available, use database
schema information
Lab name TBA
NTUST talk
16
Example of Robustness Estimation
R1: geoloc(_,_,?country,?latitude,_) & (?country = “Malta”) ?latitude
> or = 35.89
T1: One of the existing tuples of geoloc with its country = “Malta” is
updated such that its latitude < 35.89
»
»
»
»
»
p1: update?
1/3 = 0.33
p2: geoloc?
1/2 = 0.50
p3: geoloc, country = “Malta”?
4/80 = 0.05
p4: geoloc, latitude to be updated? 1/5 = 0.20
p5: latitude updated to < 35.89?
1/2 = 0.5
Pr(T1|d) = p1 * p2 * p3 * p4 * p5 = 0.008
Lab name TBA
Pr(T2|d) and Pr(T3|d) can be estimated similarly
NTUST talk
17
Example (cont.): When additional
information is available
Naive
» p1: update?
Laplace
» p1: update?
1/3 = 0.33
# of previous updates + 1
# of previous transactions + 3
m-Probability (Cestnik & Bratko 1991)
» p1: update?
# of previous updates + m * Pr(U)
# of previous transactions + m
» m is an expected number of future transactions
» Pr(U) is a prior probability of updates
Lab name TBA
NTUST talk
18
Applying Robustness Estimation
Robustness may not be the only desirable property of
target rules
Need to combine robustness and other utility
measures to guide learning
» Tautologies are the most robust
Using many measures to guide rule generation could
be difficult
Lab name TBA
NTUST talk
19
Pruning Rule Literals with
Robustness Estimation
Use existing algorithms to generate rules
Prune literals of an output rule based on its
applicability and estimated robustness
Example:
if wharf in Malta, depth < 50ft, with one or more crane
its length > 1200ft
» shortest rule consistent with the database
if wharf in Malta its length > 1200ft
» the most robust
if wharf in Malta with one or more crane its length > 1200ft
Lab name TBA
NTUST talk
20
Applications
Learning rules for Semantic Query Optimization (Hsu &
Knoblock ML94, Siegel Boston U. thesis 89, Shekha et al. IEEE TKDE 94)
Learning functional dependency (Mannila & Raiha KDD94)
Discovering models to reconcile/integrate
heterogeneous databases (Dao, Son et al. KDD95)
Learning to answer intentional queries (Chu et al. 91)
Discovering knowledge for decision suppor
Lab name TBA
NTUST talk
21
Summary
Data Mining from “Image” need to estimate the
robustness of extracted knowledge
Robustness can be defined based on the probability
of invalidating transactions
Robustness can be estimated efficiently
Rule pruning guided by robustness and other utility
measures may yield robust and interesting rules
Discovering robust knowledge to enhance database
functionalities
Lab name TBA
NTUST talk
22
Data Mining for IR?
Different tasks need different ways to collect and
prepare data
Data preparation and cleaning are important
Lab name TBA
NTUST talk
23
Data Mining for IR? Issues
Potential Applications
»
»
»
»
»
»
text categorization (a.k.a. classification, routing, filtering)
fact extraction (a.k.a. template filling)
clustering
text summarization (a.k.a. abstracting, gisting)
user profiling and modeling
interactive query formulation
Issues
» scaling up to large volume of data
» feature selection (a.k.a. dimensionality reduction)
Lab name TBA
NTUST talk
24
Projects
Recent projects
» Template filling --- inducing information extractors from
labeled semi-structured documents (J of Info Systems, 1999)
» Feature Selection --- feature selection for backprop neural
network (IEEE Tools with AI, 1998)
(to-be-proposed) projects
» Alias-mining for digital library (NSC)
» Classifying NL diagnosis records to ICD-9-CM coding (NHI)
More projects…plans of collaboration much welcome!
Lab name TBA
NTUST talk
25