NLP for Text Mining - University at Buffalo

Download Report

Transcript NLP for Text Mining - University at Buffalo

NLP for Text Mining
Towards systems capable of
extracting semantic
information from texts
Presentation by: Tiago Vaz Maia
Introduction



Keyword-models of text are very poor
(e.g. search with google).
There is great advantage to a system that
‘understands’ texts, at some level.
Need for semantic understanding.
CRYSTAL (UMass)

CRYSTAL: Inducing a Conceptual Dictionary
S. Soderland, D. Fisher, J. Aseltine, W.
Lehnert
In Proceedings of the Fourteenth
International Joint Conference on Artificial
Intelligence, 1995.
Information Extraction Systems


Generally domain-specific (e.g. medical
records, financial news).
Work by having “a dictionary of linguistic
patterns that can be used to identify references
to relevant information in a text”. Op. Cit.
A ‘Concept-Node’ Definition
CN-type: Sign or Symptom
Subtype: Absent
Extract from Direct Object
Active voice verb
Subject constraints:
words include “PATIENT”
head class: <Patient or Disabled Group>
Verb constraints:
words include “DENIES”
Direct Object constraints:
head class <Sign or Symptom>
Op. Cit.
Rationale for CRYSTAL


Building domain-specific dictionaries is
time-consuming (knowledge-engineering
bottleneck).
CRYSTAL builds such dictionaries
automatically, from annotated texts
(supervised learning).
Annotation of Texts


E.g. “Unremarkable with the exception of
mild shortness of breath and chronically
swollen ankles”.
Domain expert marks “shortness of
breath” and “swollen ankles” with CN type
“sign or symptom” and subtype “present”.
(Example from Op. Cit.)
CRYSTAL’s Output


A dictionary of information extraction rules
(i.e. concept-nodes) specific for the
domain.
These rules should be general enough to
apply to other texts in the same domain.
Algorithms

Next time…

Five minutes go by very fast!
Conclusions


Domain-specific information extraction
systems capable of semantic
understanding are within reach of current
technology.
CRYSTAL makes such systems scalable
and easily portable by automating the
process of dictionary construction.
CRYSTAL’s Dictionary
Induction Algorithms
Tiago V. Maia
Review: Information Extraction



Information Extraction Systems extract
useful information from free text.
This information can, for example, be
stored in a database, where it can be
data-mined, etc.
E.g., from hospital discharge reports we
may want to extract:
Review: Information Extraction
 Name
of patient.
 Diagnosis.
 Symptoms.
 Prescribed
 Etc.
treatments.
Review: Dictionary of ConceptNodes


In the UMass system extraction rules are
stored in concept-nodes.
The set of all concept-nodes is called a
dictionary.
A ‘Concept-Node’ Definition
CN-type: Sign or Symptom
Subtype: Absent
Extract from Direct Object
Active voice verb
Subject constraints:
words include “PATIENT”
head class: <Patient or Disabled Group>
Verb constraints:
words include “DENIES”
Direct Object constraints:
head class <Sign or Symptom>
Op. Cit.
Review: CRYSTAL

CRYSTAL automatically induces a
domain-specific dictionary, from an
annotated training corpus.
Learning Algorithm

Two steps:
 1.
Create one concept-node per positive
training instance.
 2.
Gradually merge different conceptnodes to achieve a more general and
compact representation.
Constructing Initial CN’s

E.g. of step 1:
 Sentence:
“Unremarkable with the
exception of mild shortness of breath and
chronically swollen ankles”.
 Annotation:
“shortness of breath” and
“swollen ankles” are marked with CN type
“sign or symptom”, subtype “present”.
Initial CN Definition
CN-type: Sign or Symptom
Subtype: Present
Extract from Prep. Phrase “WITH”
Verb= <NULL>
Subject constraints:
words include “UNREMARKABLE”
Prep. Phrase constraints:
words include “THE EXCEPTION OF
MILD SHORTNESS OF BREATH AND
CHRONICALLY SWOLLEN ANKLES”
head class <Sign or Symptom>
Op. Cit.
Need for Induction


Initial concept-nodes are too specific to
be useful for any texts other than the
training corpus.
One needs an inductive step, capable of
constructing more general definitions
 Step 2.
Inducing General CN’s



Main idea: Merge sufficiently similar CN’s,
until doing more merges starts generating
too many errors.
How do we merge similar CN’s?
The goal is to obtain a general CN that
‘covers’ both CN’s and provides a good
generalization for unseen cases.
Merging CN’s

The unification of two CN’s is found by
relaxing the constraints in such a way that
they cover both nodes.
 Word
constraints: Intersection of the word
constraints from each CN. E.g.:
Verb constraints:
 “vehemently
 “denies

denies”
categorically”
“denies”
Merging CN’s
 Semantic
class constraints: Found by
moving up the semantic hierarchy. E.g.:
Prep. phrase constraints:
 head
class <Sign or Symptom>
 head
class <Lab or Test Result>

head class <Finding>,
if in the semantic hierarchy we have:
Semantic Hierarchy
Finding
Sign or Symptom
Lab or Test Result
Evaluating Merges


Every merged CN is tested against the
training corpus. If its error rate is above a
certain threshold, it is discarded.
The system continues merging CN’s until
no more can be merged without resulting
in a CN whose error rate exceeds the prespecified tolerance.
Results



MUC-3: dictionary built using 1500 hours
of work by two advanced graduate
students and one post-doc.
MUC-4: using Autoslog, a precursor of
CRYSTAL, dictionary was built using 8
hours of work by a first-year graduate
student!
Both dictionaries presented roughly the
same functionality.
Conclusions


Automated induction of domain-specific
information extraction dictionaries is very
good alternative to hand-coding.
Knowledge engineering effort drastically
reduced, allowing for widespread realworld applications.
Combining Information
Extraction and Data Mining
Tiago V. Maia
“Using Information Extraction to Aid the
Discovery of Prediction Rules from Text”
U.Y. Nahm and R.J. Mooney
In: KDD-2000 Workshop on Text Mining
An Approach to Text Mining
Text
IE
DB
KDD
Rules
The Application


Step 1. Starting from free-text job
postings in a newsgroup, build a
database of jobs.
Step 2. Mine that database, to find
interesting rules.
Sample Job Posting

Sample job posting:
 “Leading
Internet Provider using cutting
edge web technology in Austin is
accepting applications for a Senior
Software Developer. The candidate must
have 5 years of software development,
which includes coding in C/C++ and
experience with databases (Oracle,
Sybase, Informix, etc.)…”
Sample Job Record
 Title:
Senior Software Developer
 Salary: $70-85K
 City: Austin
 Language: Perl, C, Javascript, Java, C++
 Platform: Windows
 Application: Oracle, Informix, Sybase
 Area: RDBMS, Internet, Intranet, Ecommerce
 Required years of experience: 5
 Required degree: BS
Sample Extracted Rule

“If a computer-related job requires
knowledge of Java and graphics then it
also requires knowledge of Photoshop”
Information Extraction

Uses RAPIER: a system similar to
CRYSTAL, that also constructs the
extraction rules automatically, from an
annotated training corpus.
Rule Induction



The induced rules predict the value in a
database field, given the values in the rest
of the record.
Each slot-value pair is treated as a distinct
binary feature. E.g., Oracle 
Application.
An example of an induced rule:
HTML  Language  Windows NT 
Platform  Active Server Pages 
Application  Database  Area
Algorithms for Rule Induction


Uses C4.5.
Decision trees are learned using the
binary representation for slot-value pairs,
and pruned to yield the rules.
Conclusions


Text mining can be achieved by
information extraction followed by the
application of standard KDD techniques
to the resulting structured database.
Both IE and KDD are well understood,
and their combination should yield
practical real-world systems.
Learning Probabilistic
Relational Models
Getoor, L., Friedman, N., Koller,D.
and Pfeffer, A.
Invited contribution to the book
Relational Data Mining,
Dzeroski, S. and Lavrac, N. (Eds.),
Springer-Verlag, 2001.
Applicability
Text
IE
DB
KDD
JPD
Probabilistic Relational
Models

Probabilistic relational models are a
marriage of:
1. Probabilistic graphical models, and
2. Relational models.
Why a Probabilistic Model?



Probabilistic graphical models (e.g.
Bayesian networks) have proven very
successful for representing statistical
patterns in data.
Algorithms have been developed for
learning such models from data.
Because what is learnt is a joint
probability distribution, these models are
not restricted to answering questions
about specific attributes.
Why a Relational Model?



Typically, Bayesian networks (or graphical
models in general) use a representation
that consists of several variables (e.g.
height, weight, etc.) but has no structure.
The most common way of structuring data
is in a relational form (e.g. relational
databases).
Data structured in this way consists of
individuals, their attributes, and relations
between individuals (e.g. database of
students, classes, etc.).
Probabilistic Relational
Models


Probabilistic relational models are a
natural way to represent and learn
statistical regularities in structured
information.
Moreover, because they are close to
relational databases, they are ideal for
data mining.
Introduction to Bayes Nets


The problem with using joint probability
distributions is their exponential character
in the general case.
E.g., assume that there are four random
variables:
 Student
Intelligence, Course Difficulty,
Student Understands Material, Student
Grade.
 Assume Student Intelligence, Course
Difficulty and Student Understands
Material have three possible values: {low,
medium, high}.
Exponential Complexity of
JPDs
 Further
assume that the Student Grade
has six possible values: {A, B, C, D, E, F}.
 Then, to have a joint probability
distribution, we need to specify (or learn)
3x3x3x6 = 162 values.
 Imagine if we had hundreds of variables...
Independence Assumptions
in Bayes Nets


Bayes nets help because they exploit the
fact that each variable typically depends
directly only on a small number of other
variables.
In our example, we have:
Example of a Bayes Net
Intelligence
Difficulty
Understands
Material
Grade
Conditional Independence


That is, for example the difficulty of the
test is independent of the intelligence of
the student.
Also, importantly, the grade is
independent of the intelligence of the
student or the difficulty of the test, given
the student’s understanding of the
material  Conditional independence.
Conditional Independence

Formally, we have:
 Every
node X is independent of every
other node that does not descend from X,
given the values of X’s parents.
Conditional Probability
Distributions (CPDs)

Associated with each node is a
conditional probability distribution, which
specifies the probability distributions for
the node given the values for the parents.
CPD for the Understanding Node
Understanding
Intelligence
low
low
low
medium
medium
medium
high
high
high
Difficulty
low
medium
high
low
medium
high
low
medium
high
low
0.2
0.45
0.75
0.15
0.2
0.45
0.01
0.1
0.2
medium
0.6
0.4
0.2
0.3
0.6
0.4
0.19
0.35
0.6
high
0.2
0.15
0.05
0.55
0.2
0.15
0.8
0.55
0.2
Conditional Independence
Savings in Complexity

For our example, we need to specify the
following values:
3
for the probabilities of the 3 possible
values for Intelligence.
 3 for the probabilities of the 3 possible
values for Difficulty.
 27 for the CPD for the Understanding
node.
 18 for the CPD for the Grade node.
 Total: 51 (instead of 162).
Conditional Independence
Savings in Complexity (cont.)

For cases where there are hundreds or
thousands of variables, the fact that each
variable will depend only on a small
number of other variables will entail
significantly more dramatic savings.
Joint Probability Distribution

Given the CPDs, we can calculate the
JPD via the Chain Rule for Bayesian
Networks. In our example, that is:
P(Intelligence, Difficulty, Understanding, Grade)=
P(Intelligence) x P(Difficulty) x
P(Understanding | Intelligence, Difficulty) x
P(Grade | Understanding)
Queries

Once we have the joint probability
distribution, we can answer any queries
about probability distributions of variables,
given any set of observed variables.
Bayesian Network
Classifiers
Friedman, N., Geiger, D., and Goldszmidt, M.
In: Machine Learning, 20:131-163, 1997.
Presentation by: Tiago Vaz Maia
Learning Bayes Nets

The problem:
 Given
a training set:
x21, …, xM1
 x12, x22, …, xM2
…
 x1N, x2N, …, xMN
 x11,
 Find
the network that best matches this
training set.
Learning Bayes Nets (cont.)

Two separable problems:
 Learning
the structure of the network.
 Learning the parameters.


Learning the parameters is easy.
The hard part is learning the structure.
Learning the Structure of
Bayes Nets


The general approach is to have a score
that evaluates the match between the
network and the training set, and to
search for the best matching network.
We will be concentrating on a score
based on minimal description length
(MDL).
MDL


MDL(B|D) = 1/2 log N |B| - LL(B|D),
where:
B
is a Bayes net.
 D is a training set.
 N is the number of training examples.
 |B| is the number of parameters in B.
 LL is the log likelihood.

The goal is to find the network with
minimum MDL.
MDL (cont.)



The first term represents the size of the
network, and forces the network to be
small (avoid overfitting).
The second term measures how closely B
models the probability distribution in data
D.
The combination of the two provides a
balance between a good fit to the data,
while maintaining few enough parameters
that there is good generalization ability.
Search


Search is based on a greedy strategy.
Initially, we start with the empty network,
and successively apply local operations
that maximally improve the network’s
MDL, until we hit a local minimum.
The local operations are:
 Arc
addition, arc deletion, and arc
reversal.
Bayes Nets as Classifiers



Given a training set <A1, …, An, C>, one
can use the learning technique just
described to learn a Bayes network that
does not rely on independence
assumptions, like in naïve Bayes.
However, in cases where there are many
attributes (e.g. over 30), this approach
actually performs worse than naïve
Bayes.
Why?
Performance of Bayesian
Network Classifiers



The problem is in the definition of the
MDL score.
It is possible that a network might have a
very good MDL score and yet perform
poorly as a classifier.
Roughly speaking, the network is trying to
fit all the data (including the attributes),
rather than just trying to estimate as
accurately as possible P(C| A1, …, An),
which is what is relevant for the
classification.
Tree-Augmented
Naïve Bayesian
Classifiers
(Continuation of Friedman et al.)
Presentation by: Tiago Vaz Maia
Introduction


Last time: General Bayesian networks
often perform worse than Naïve Bayes in
classification problems with many
attributes.
Today: An extension to Naïve Bayes that
allows edges between attributes, is
computationally tractable, and
outperforms both Naïve Bayes and
general Bayes nets in classification
problems.
Augmented Naïve Bayesian
Networks


The problem with the general Bayes nets
is that the structure of the classification
problem is not being exploited. In
classification, the class should be treated
differently from the attributes.
Idea: Maintain the structure of Naïve
Bayes, but allow edges between the
attributes.
Augmented Naïve Bayes
Nets (cont.)

E.g.:
C
A1
A2
..…..
An
Augmented Naïve Bayes
(cont.)


Problem: Any combination of edges
between the attributes is possible, so this
is computationally intractable.
Solution: Allow only a subset of possible
edge combinations. Namely, the edges
between the attributes must form a tree.
Tree-Augmented Naïve
Bayes Nets (TAN)

E.g.:
Document type
Professor
Publications
Course
Grade
X
Student
Tree-Augmented Naïve
Bayes Nets


Obviously not ideal from the point of view
of expressiveness.
However, more expressive than Naïve
Bayes, and computationally tractable:
 There
exists a polynomial-time algorithm
for finding the optimal tree structure
between the attributes.
Learning Tree Bayes Nets
from Data




1. Compute mutual information between
every pair of variables: I(X;Y).
2. Build a complete undirected graph in
which the vertices are the variables, and
annotate the weight of an edge connecting X
and Y with I(X;Y).
3. Build a maximum weight spanning tree.
4. Transform the resulting undirected tree to
a directed one by choosing a root variable
and setting the direction of all edges to be
outward from it.
Learning TANs

Exactly the same, with the following minor
changes:
 In
step 1 instead of having I(Ai;Aj) we have
I(Ai;Aj | C).
 In the end, we add the class node and the
edges from that node to all attributes.
Experimental Results

In a series of experimental tests
conducted with data sets from U.C.
Irvine’s repository, TANs often
outperformed:
 Naïve
Bayes.
 General Bayes Nets.
 C4.5.