Green Data Mining: Privacy-Oriented Data Mining by Proof Checking

Download Report

Transcript Green Data Mining: Privacy-Oriented Data Mining by Proof Checking

Text Mining
the technology to convert text
into knowledge
Stan Matwin
School of Information Technology and Engineering
University of Ottawa
Canada
[email protected]
Plan
•
•
•
•
What?
Why?
How?
Who?
codata 2002
2
What?
• Text Mining (TM) = Data Mining from
textual data
• Finding nuggets in otherwise
uninteresting mountains of ore
• DM = finding interesting knowledge
(relationships, facts) in large amounts of
data
codata 2002
3
What cnt’d
•
•
•
•
•
•
Working with large corpora
…and little knowledge
Discovering new knowledge
… e.g. in Grimm’s fairy tales
vs uncovering of existing knowledge
…e.g. find mySQL developers with  1yr
experience in a file of 5000 CVs
• Has to treat data as NL
codata 2002
4
What? Cnt’d
•
•
•
•
Uncovering aspect of TM
TM = Information Extraction from Text
Text -> Data Base mapping
TM and XML
codata 2002
5
Examples
• Extracting information from CVs: skills,
systems, technologies etc
• Personal news filtering agent
• Research in functional genomics about
protein interaction
codata 2002
6
Why?
• Moore’s law, and…
• Storage law
codata 2002
7
How?
A combination of
– Machine learning
– Linguistic analysis
•
•
•
•
Stemming
Tagging
Parsing
Semantic analysis
codata 2002
8
Some TM-related tasks
•
•
•
•
•
Text segmentation
Topic identification and tracking
Text summarization
Language identification
Author identification
codata 2002
9
Two case studies
• CADERIGE
• Spam detection (with AmikaNow)
codata 2002 10
Caderige
« Catégorisation Automatique de Documents pour
l'Extraction de Réseaux d'Interactions Géniques »
Knowledge extraction from Natural
Language texts
Subset of
Potentially relevant abstracts
Keyword search
search
tir
MedLine
Filled-out
forms
Query in NL
Caderige
Data or knowledge
base
Contrôle
Agent
: sigma E
ObjetContrôle
: expression du
gèneAgent
dacBContrôle: sigma E
Objet
: expression
du
Justification
: homologie
Agent
: sigma E
gèneObjet
dacB
: expression du
Justification
: homologie
gène dacB
Justification: homologie
codata 2002 11
Caderige
• Objective: to extract information of
interest to geneticists from on-line
bastract and/or paper databases (e.g.
Medline)
• Ensure acceptable recall and precision
codata 2002 12
The araR gene is monocistronic, and the promoter region contains -10
and -35 regions (as determind by primer extension analysis) similar to
those recognized by RNA polymerase containing the major vegetative
cell sigma factor sigmaA. An insertion-deletion mutation in the araR
gene leads to constitutive expression of the L-arabinose metabolic
operaon. We demonstrate that the araR gene codes for a negative
regulator of the ara operon and that the expression of araR is repressed
by its own product.
The fragment (it.) can be selected by means of keywords
codata 2002 13
It has been proposed that Pho-P plays a key
role in the activation of tuA and in the
repression of tagA and tagD.
"What are the proteins involved in the
regulation of tagA?”
This question cannot be answered with keywords alone; semantic
knowledge that repression is a type of regulation is req’d
codata 2002 14
After determination of the nucleotide sequence
and deduction of the purR reading frame, the
PurR product was found to be highly similar
to the purR-encoded repressor from Bacillus
subtilis.
does not answer
"What are the proteins involved in the
regulation of purR?",
In fact, parsing is needed to see that PurR and purRencoded Repressor are objects of the verb to be similar
codata 2002 15
Conceptual interpretation is needed to see that
RNA isolated from a sigma B deletion mutant revealed that the
transcription of gspA is sigmaB dependent.
is an answer to
"What are the proteins involved in the regulation of gspA
gspA is sigmaB dependent is interpreted as
protein sigmaB regulates gspA
codata 2002 16
CADERIGE Architecture
Extraction
labeling
acquisition
of linguistic
resources
by
text mining
MedLine abstracts
Linguistic resources
- Thesaurus
-extraction grammars
- fragment selectors
-•••
query
fragment
selection
index
extraction using
extr.
gragrammar
conceptual
s
normalization
normalization
Query-text
matching
Forms
codata 2002 17
3 steps
1. Focusing: learned filters
2. Linguistic Analysis:
lexicalsyntactic/semantic
• Syntax-semantics mapping
3. Extraction
codata 2002 18
Caderige: example
POSITIVE INTERACTION
(
[NP
<protein>
1
<Protein>
)
1
]
</protein>
Subject( $2 ,$1 )
Dobj( $2 ,$3 )
</interaction>
<interaction>
(
[Ver b
2
Semantic Class :
Positive interaction
)
2
(
] <gene_ [ NP
3
expression>
<Gene
expression>
)
3
]
</gene_
expression>
codata 2002 19
Current stage
• 1 done
• XML for 3 designed
• Tools for 2 chosen
codata 2002 20
Email filters
•
•
•
•
Spam elimination
Automatic filing
Compliance enforcement
….
codata 2002 21
Email…
• The trick: cast it as a text classification
problem
• Build a training set
• train your favouritre classifier
• Deploy it
codata 2002 22
State of the art
• Current accuracy 80%
codata 2002 23
Difficulties
•
•
•
•
multi-class problem where
classes overlap
and are hierarchical
recall vs precision
codata 2002 24
TM: who – academically?
•
•
•
•
•
•
David Lewis
Yimin Yang – CMU
Ray Mooney - UT Austin
Nick Cercone - Waterloo
Guy Lapalme – U. de Montréal
TAMALE - University of Ottawa
codata 2002 25
Who – industrially?
• Google
• Clearforest
• AmikaNow
codata 2002 26
Conclusion
• Text mining – a necessity (so “!” instead
of “?”)
• Still in its infancy
• Methods must exploit linguistic
knowledge
codata 2002 27
Classification
• Prevalent practice:
examples are represented as vectors of
values of attributes
• Theoretical wisdom,
confirmed empirically: the more examples,
the better predictive accuracy
codata 2002 28
ML/DM at U of O
• Learning from imbalanced classes:
applications in remote sensing
• a relational, rather than propositional
representation: learning the maintainability
concept
• Learning in the presence of background
knowledge. Bayesian belief networks and how
to get them. Appl to distributed DB
codata 2002 29
Why text classification?
•
•
•
•
•
Automatic file saving
Internet filters
Recommenders
Information extraction
…
codata 2002 30
Text classification: standard approach
1. Remove stop words and markings
2. remaining words are all attributes
3. A document becomes a vector <word,
frequency>
4. Train a boolean classifier for each
class
5. Evaluate the results on an unseen
sample
codata 2002 31
Text classification: tools
• RIPPER
A rule-based learner
Works well with large sets of binary features
• Naïve Bayes
Efficient (no search)
Simple to program
Gives “degree of belief”
codata 2002 32
“Prior art”
• Yang: best results using k-NN: 82.3%
microaveraged accuracy
• Joachim’s results using Support Vector
Machine + unlabelled data
• SVM insensitive to high dimensionality,
sparseness of examples
codata 2002 33
SVM in Text classification
SVM
Transductive SVM
Maximum separation
Margin for test set
Training with 17 examples in 10
most frequent categories gives test
performance of 60% on 3000+ test
cases available during training
codata 2002 34
Combining classifiers
#
1
3
5
Reuters
representations
NP
BW, NP, NPS
BW, NP, NPS, KP, KPS
b.e.
.827
.845
.849
DigiTrad
representations
BWS
BW, BWS, NP
BW, BWS, NP, KPS, KP
b.e.
.360
.404e
.422e
Comparable to best known results (Yang)
codata 2002 35