Odsek za tehnologije znanja - Analytics @ IJS-E8

Download Report

Transcript Odsek za tehnologije znanja - Analytics @ IJS-E8

What Semantic Web researchers need
to know about Machine Learning?
http://analytics.ijs.si/events/Tutorial-MachineLearningForSemanticWeb-ISWC2007-Busan-Nov2007/
Marko Grobelnik, Blaž Fortuna, Dunja Mladenić
How Semantic Web compares to
Machine Learning? (1/2)
• Semantic Web is traditionally non-analytic
area of research…
– …it includes a lot of logic, some language
technologies, some standardization etc.
– Generally, manual encoding of knowledge is
still appreciated the most…
– …where the main paradigm is top-down.
– Top-down approaches enable accuracy when
designing knowledge bases, but can cost a lot
and are mostly not scalable
How Semantic Web compares to
Machine Learning? (2/2)
• Machine Learning is on the other hand
almost exclusively analytic & bottom-up…
– …where analytic techniques have the main
goal to extract knowledge from the data…
– …with a little or no human involvement.
– The cost per encoded piece of knowledge is
lower, achievable scale is high, …
– …but, the techniques are limited to what is
computable in a reasonable time.
What are stitching points
between SW and ML?
• …SW can contribute to ML and vice versa
• Machine Learning for Semantic Web
– …decrease costs when extracting knowledge
and building ontologies, speeding-up
reasoning, scalability
• Semantic Web for Machine Learning
– …Semantic Web techniques can provide new
insights into the data (e.g. via reasoning)
which are unobservable by statistical means
Part 1:
Understanding Machine Learning
in simple terms?
• Sub-areas of machine learning
• What are constituents of a machine
learning algorithm?
• What kind of problems one could try to
solve with ML methods?
• Why ML is easy and why it is hard?
Sub-areas of machine learning
• Data-Mining
• Main events: ACM KDD, IEEE ISCM, SIAM SDM
• Books: Witten and Frank, 1999
• AI style Machine Learning
• Main events: ICML, ECML/PKDD
• Books: Mitchell, 1997
• Statistical Machine Learning
• Main events: NIPS, UAI
• Books: Bishop 2007, Duda & Hard & Stork 2000
• Theoretical Machine Learning
• Main events: COLT, ALT
• Books: Vapnik 1999
Which areas are contributing?
Analysis of Data bases
Learning with
different
representations
Theory on learning
Statistical data
analysis
What is “Learning” in Machine Learning?
• Herbert Simon: “Learning is any process by which a
system improves performance from experience.”
• Improve on task T, with respect to performance metric P,
based on experience E:
– T: Recognizing hand-written words
– P: Percentage of words correctly classified
– E: Database of human-labeled images of handwritten words
– T: Driving on four-lane highways using vision sensors
– P: Average distance traveled before a human-judged error
– E: A sequence of images and steering commands recorded while observing a
human driver
– T: Categorize email messages as spam or legitimate
– P: Percentage of email messages correctly classified
– E: Database of emails, some with human-given labels
What are constituents of a
machine learning algorithm?
• Typical ML algorithm includes:
•
•
•
•
•
Input data in some kind of form (most often vectors)
Input parameters (settings)
Fitting data using some search algorithm into a model
Output the model in the form of y=f(x)
Use the model for classification of unseen data
– …the key elements are:
• Language of the model – determines complexity of the model
–
…popular language for many algorithms is linear model (used in SVM,
Perceptron, Bayes, ...): y = a x1 + b x2 + c x3 + …
• Search algorithm – determines the quality of a result
–
…most often it is some kind of local optimization
What kind of problems one could try
to solve with ML methods?
• Generally, machine learning is about finding
patterns and regularities in the data…
• …the result is always a model which could be
understood as a summary of the data used for
learning model
• The resulting model can be used for:
– Explaining existing phenomena in the data
– Predicting future situations
Why ML is easy and why it is hard?
• ML is easy:
– …because it seems the God is not playing dice and the Nature
usually behaves nicely and is not chaotic
– With ML techniques we try to discover the rules how the nature
was generating the observed data.
– Simple algorithmic approaches can bring a lot of success when
understanding the observed data.
• ML is hard:
– …because we don’t know the language the God is using when
programming the Universe
– There are many different ways to represent similar concepts…
– …we need to deal with scale, noise, dynamics, etc and is hard to
find the right way how to model the observed data
Part 2:
Main approaches in
Machine Learning
• Supervised learning
• Semi-supervised learning
• Unsupervised learning
When to apply each of the
approaches?
• …let’s take the example from medicine:
– we have patients which have symptoms (inputs) and diagnosis
(output)
• Supervised learning (classification)
– …given symptoms and corresponding diagnoses for many patients,
the goal is to find rules which can map/predict symptoms to
diagnosis for an unseen patient
• Semi-supervised learning (transduction, active learning)
– …given symptoms and corresponding diagnoses for only a few
symptoms, leverage these to find most probable diagnoses for all
the patients
• Unsupervised learning (clustering, decompositions)
• …given only symptoms for many patients, find the groups of similar
patients (with e.g. possibly similar diagnoses).
Supervised learning
Assign an object to a given finite set of classes:
• Medical diagnosis
– …assign diagnosis to a patient
• Credit card applications or transactions
– …assign credit score to an applicant
• Fraud detection in e-commerce
– …decide about fraud or non-fraud event in a business process
• Financial investments
– …decide whether to buy or sell or hold on a stock exchange
• Spam filtering of e-mails
– …decide if an email is a spam or a regular email
• Recommending articles in a newspaper
– …decide if an article fits the user profile
• Semantic/linguistic annotation
– …assign semantic or linguistic annotation to a word or phrase
Recommending News Articles
???
Machine Learning
labeled articles
(training examples)
predicted article class
(interesting or not)
new article
(test example)
Supervised learning
• Given is a set of labeled examples represented by feature vectors
• The goal is: to build a model approximating the target function
which would automatically assign right label to a new unlabeled
example
• Feature values:
– discrete (eg., eyes_color  {brown, blue, green})
– continuous (eg., age  [0..200])
– ordered (eg., size {small, medium, large})
• Values of the target function – labels:
•
– discrete (classification) or continuous (regression)
– exclude each other (eg., medical diagnosis) or not (eg., document
content can be about arts and computer science)
– have some predefined relations (taxonomy of document categories,
e.,g., DMoz or Medline)
The target function can be represented in different ways (storing
examples, symbolic, numerical, graphical,…) and modeled by using
different algorithms
Illustrative example
Recommending cartoon for a 5-year old boy
Title
Bob the builder
Pixar-Locomotion
Ice age
Over the hedge
Cars
Anima
South Park
Simpson
Main characters
vehicles, human
vehicles
animals
animals
vehicles
human
human
human
Duration
10 mins
5 mins
90 mins
60 mins
90 mins
90 mins
30 mins
20 mins
Main characters  {vehicle, human, animal}
Duration [5..90]
Target function
There is a trade-off between the expressiveness of
a representation and the ease of learning
– The more expressive a representation, the better it will
be at approximating an arbitrary function; however, the
more examples will be needed to learn an accurate
function
Illustrative example
•Values of the target function: discrete labels (classification),
exclude each other
Interesting movie: yes no
Possible data visualization
human = no
Possible Model
for not interesting
human = yes
(vehicles = no)
and
(human = yes)
vehicles = yes
vehicles = no
Generalization
• Model must generalize the data to correctly
classify yet unseen examples (the ones which
don’t appear in the training data)
• Lookup table of training examples is a
consistent model that does not generalize
– An example that was not in the training data can not
be classified
Occam’s razor:
– Finding a simple model helps ensure generalization
Ockham (Occam)’s Razor
• William of Ockham (1295-1349) was a
Franciscan friar who applied the criteria to
theology:
– “Entities should not be multiplied beyond
necessity” (Classical version but not an actual
quote)
– “The supreme goal of all theory is to make the
irreducible basic elements as simple and as few
as possible without having to surrender the
adequate representation of a single datum of
experience.” (Einstein)
• requires a precise definition of simplicity
• assumes that nature itself is simple
Algorithms for learning
classification models
Storing examples
– Nearest Neighbour
Symbolic
– Decision trees
– Rules in propositional logic or first order logic
Numerical
–
–
–
–
Perceptron algorithm
Winnow algorithm
Support Vector Machines
Logistic Regression
Probabilistic graphical models
– Naive Bayesian classifier
– Hidden-Markov Models
Nearest neighbor
• Storing training examples without generating
any generalization
– Simple, requires efficient storage
• Classification by comparing the example to the
stored training examples and estimating the
class based on classes of the most similar
examples
– Similarity function is crucial
Also known as:
– Instance-based, Case-based, Exemplar-based,
Memory-based, Lazy Learning
Similarity/Distance
• For continuous features use Euclidian distance
Dist (e1 , e2 ) 
n
2
(
f

f
)
 1i 2i
i 1
ek  f k ,1 , f k , 2 ,... f k ,n
• For discrete features, assume distance between two
values is 0 if they are the same and 1 if they are
different (eg., Hamming distance for bit vectors).
To compensate for difference in units across features,
scale all continuous values to the interval [0,1].
Nearest neighbor
Nearest neighbor - example
Model
Title
Main characters
Duration
Bob the builder
vehicles, human
10 mins
Pixar-Locomotion
vehicles
5 mins
Ice age
animals
90 mins
Over the hedge
animals
60 mins
Cars
vehicles
90 mins
Anima
human
90 mins
South Park
human
30 mins
Simpson
human
20 mins
Vehicles
(yes, no)
Human
(yes, no)
Animals
(yes, no)
Features
Duration
(5, 10, 20, 30, 60, 90)
 |F |

L1 (e1 , e2 ) || e1  e2 ||   ( f1i  f 2i ) ; f i  f j  1; f i  f j
 i 1

Nearest neighbor - example
Title
Main characters
Duration Dist.
Distance:
Bob the builder
vehicles, human
10 mins
3
1+0+1+1=3
Pixar-Locomotion
vehicles
5 mins
4
1+1+1+1=4
Ice age
animals
90 mins
3
1+1+0+1=3
Over the hedge
animals
60 mins
2
Cars
vehicles
90 mins
4
Anima
human
90 mins
3
South Park
human
30 mins
3
1+0+1+1=3
Simpson
human
20 mins
3
1+0+1+1=3
1+1+0+0=2
1+1+1+1=4
1+0+1+1=3
Jungle book
human, animals
60 mins
 |F |

L1 (e1 , e2 ) || e1  e2 ||   ( f1i  f 2i ) ; f i  f j  1; f i  f j
 i 1

Decision trees
• Recursively splitting set of examples until (almost)
all examples in the set have the same class value
–
–
–
–
starts with a set of all examples
find a feature giving the best split
split the examples according to the feature values
repeat for each subset of examples
InfGain( S , F )  Entropy( S ) 

f Values( F )
Sf
S
Entropy( S f )
Entropy( S )   pi log pi
i
• Classification by “pushing” the example from root
to leaf, assign class value of the leaf
Decision tree - example
Title
Main characters
Bob the builder
vehicles
Duration Entropy({5,3})   5 log 2 5  3 log 2 3
8
8 8
8
10 mins
= 0.63*0.68+0.38*0.53
Pixar-Locomotion vehicles
5 mins
Ice age
animals
90 mins
Over the hedge
animals
60 mins
Cars
vehicles
90 mins
Anima
human
90 mins
South Park
human
30 mins
Simpson
human
20 mins
vehicle
3
53
characters
human
3
InfGain=0.95-(0+0+0)=0.95
3
3
Entropy({3,0})   log 2  0  0
3
3
Entropy({x,0})  0
1
1 2
2
Entropy({1,2})   log 2  log 2
3
3 3
3
= 0.33*1.58+0.67*0.58
= 0.91
duration
≤10
animals
2
= 0.95
2
10<, ≤ 30 30 <,<90
2
1
53
90≥
12
InfGain=0.95-(0+0+0.38*0.91+0)=0.61
Decision tree - example
Title
Main characters
Duration
vehicles
10 mins
Pixar-Locomotion vehicles
5 mins
Ice age
animals
90 mins
Over the hedge
animals
60 mins
Cars
vehicles
90 mins
Anima
human
90 mins
South Park
human
30 mins
Simpson
human
20 mins
Jungle book
animals
60 mins
Bob the builder
characters
vehicle
3
human
3
53
animals
2
Decision tree model
If-then Rules
Generating rules by adding conditions:
– Λ – restricting the rule (less examples match)
– V – generalizing the rule (more examples match)
• Maximize quality of each rule (eg., matching
examples are of the same class) while aiming at
describing all the examples with a set of rules
If-then Rules
Converting a tree to rules
• If (Main characters = vehicles) then interesting
• If (Main characters = human) then uninteresting
• If (Main characters = animals) then interesting
:
vehicle
5 3
characters
animals
human
3
3
2
Support Vector Machine
• Learns a hyperplane in
higher dimensional space
– that separates the training
data and
– gives the highest margin
Positive
class
Margin
• Implicit mapping of the
original feature space into
higher dimensional space
– mapping using so called
kernel function (eg., linear,
polynomial, …)
Regarded as state-of-the-art in
text document classification
Hyperplane
Negative
class
Linear Model
Naïve Bayes
Determine class of example ek by estimating
P(ci ) P(ek | ci )
P(ci | ek ) 
 arg max P(ci ) P(ek | ci )
P(ek )
i
• P(ci) – estimate from the data using frequency:
no. of examples with class ck / no. of all examples
• P(ek|ci) – too many possibilities (all combinations
of feature values)
– assume feature independence given the class
n
P(ek | ci )   P( f kj | ci )
j 1
Model
Naïve Bayes - example
Main characters
Duration P(vehicle| )=0.6 P(5| )=0.2
vehicles
10 mins
Pixar-Locomotion vehicles
5 mins
Ice age
animals
90 mins
Over the hedge
animals
60 mins
Cars
vehicles
90 mins
Anima
human
90 mins
South Park
human
30 mins
Simpson
human
20 mins
Jungle book
animals
60 mins
Title
Bob the builder
P(
P(
| ) → 5/8*(0.4*0.2) = 0.05
| ) → 3/8*(0*0) = 0
P(vehicle| )=0
P(5| )=0
P(human| )=0
P(human| )=1
P(10| )=0.2
P(10| )=0
P(animal|
P(animal|
P(
P(
)=5/8
)=3/8
)=0.4 P(20| )=0
)=0 P(20| )=0.33
P(30| )=0
P(30| )=0.33
P(60| )=0.2
P(60| )=0
P(90| )=0.4
P(90| )=0.33
P(ci | ek )  arg max P(ci ) P( f kj | ci )
i
j
Generative Probabilistic Models
• Assume a simple (usually unrealistic) probabilistic
method by which the data was generated
• Each class value has a different parameterized
generative model that characterizes it
• Training: Use the data for each category to estimate the
parameters of the generative model for that category.
– Maximum Likelihood Estimation (MLE): Set parameters to
maximize the probability that the model produced the given
training data
– If Mλ denotes a model with parameter values λ and Dk is the
training data for the kth class, find model parameters for class k
(λk) that maximize the likelihood of Dk:
k  argmax P( Dk | M  )

• Testing: Use Bayesian analysis to determine the
category model that most likely generated a specific test
instance.
Semi-supervised learning
Similar to supervised learning except that
• we have examples and only some of them are
labeled
• we may have a human available for a limited
time to provide labels of examples
– …this corresponds to the situation where all the
patients in the database have symptoms, but only a
few have diagnosis
– …and occasionally we have doctors for a limited time
to respond the questions about the patients
Using unlabeled data
(Nigam et al., 2000)
• Given: a small number of labeled examples and a
large pool of unlabeled examples, no human
available
– e.g., classifying news article as interesting or not interesting
• Approach description (EM + Naive Bayes):
– train a classifier with only labeled documents,
– assign probabilistically-weighted class labels to unlabeled
documents,
– train a new classifier using all the documents
– iterate until the classifier remains unchanged
Using Unlabeled Data with
Expectation-Maximization (EM)
E-step: Estimate
labels of unlabeled
documents
Initialize: Learn
from labeled only
Naive
Bayes
M-step: Use all
documents to
rebuild classifier
Guarantees local maximum a posteriori parameters
Co-training (Blum & Mitchell, 1998)
Theory behind co-training
• Possible to learn from unlabeled examples
• Value of unlabeled data depends on
– How (conditionally) independent are the two
representations of the same data
• The more the better
– The number of redundant inputs (features)
• Expected error decreases exponentially with this number
• Disagreement on unlabeled data predicts true
error
Better performance on labelling unlabeled data
compared to EM approach
Bootstrap Learning to Classify Web Pages
Document
Given: set of documents where each
document is described by two independent
sets of features (e.g. document text +
hyperlinks anchor text)
ew labeled and many
unlabeled
Page
Classifier
Link
Classifier
Hyperlink to
the document
Active Learning
Active Learning
• We use this methods
whenever hand-labeled
data are rare or
expensive to obtain
• Interactive method
performance
• Requests only labeling
of “interesting” objects
• Much less human work
needed for the same
result compared to
arbitrary labeling
examples
Teacher
Teacher
Data &
passive student
labels
query
active student
label
Active student
asking smart
questions
Passive student
asking random
questions
number of questions
Approaches to Active Learning
• Uncertainty sampling (efficient)
– select example closest to the decision hyperplane (or the one with classification
probability closest to P=0.5) [Tong & Koller 2000]
• Maximum margin ratio change
– select example with the largest predicted impact on the margin size if selected
[Tong & Koller 2000]
• Monte Carlo Estimation of Error Reduction
– select example that reinforces our current beliefs [Roy & McCallum 2001]
• Random sampling as baseline
• Experimental evaluation (using F1-measure) of the four listed
approaches shown on three categories from Reuters-2000
dataset
– average over 10 random samples of 5000 training (out of 500k) and
10k testing (out of 300k)examples
– two of the methods a rather time consuming, thus we run them for
including the first 50 unlabeled examples
– experiments show that active learning is especially useful for
unbalanced data
Category with very unbalanced
class distribution having 2.7% of
positive examples
Uncertainty seems to
outperform MarginRatio
Illustration of Active learning
• starting with one labeled example from each class
(red and blue)
• select one example for labeling (green circle)
• request label and add re-generate the model using
the extended labeled data
Illustration of linear SVM model using
• arbitrary selection of unlabeled examples (random)
• active learning selecting the most uncertain
examples (closest to the decision hyperplane)
Uncertainty sampling
of unlabeled example
Unsupervised learning
• Given is a set of examples
• The goal is: to cluster the examples into several
groups based on some similarity measure
– examples inside the group should be similar while
examples between the groups should be different
Similarity measure plays a crucial role:
Cosine Cos(d , d ) 
1
2
d1  d 2

d1 d 2
Minkowsky (norm k)
x
x
1i 2 i
i
 j x 2j
k xk2
Jaccard
1/ k
 |W |

Lk (d 1, d 2 ) || d1  d 2 ||k    ( w1i  w2i ) k 
 i 1

Manhattan (k=1), Euclidean (k=2 )
J ( A, B) 
A B
A B
Clustering methods
• Hierarchical
– agglomerative – at each step merge two or more
groups
– divisive – at each step break the selected group into
two or more groups
• Non hierarchical
– requires specification of the number of clusters
– optimization of the initial clustering (e.g., maximize
similarity of examples inside the same group)
• Geometrical
– map multidimensional space into two- or threedimensional (e.g., principal component analysis)
• Graph-theoretical
K-Means clustering algorithm
• Given:
– set of examples (e.g., TFIDF vectors of documents),
– distance measure (e.g., cosine)
– K (number of groups)
• For each of K groups initialize its centroid with a
random document
• While not converging
– Each document is assigned to the nearest group
(represented by its centroid)
– For each group calculate new centroid (group mass
point, average document in the group)
Example of k-means clustering
Examples:
• A: 1,0,1,0,1 1.
• B: 1,0,0,0,1
• C: 1,0,1,0,0 2.
• D: 0,0,0,1,0
• E: 0,1,0,1,0 3.
K=2
4.
Randomly select two examples, e.g., A, D to
be representatives of two clusters I: A, II: D
Calculate similarity of other examples to the
them
B,I= 0.82, B,II= 0, C,I= 0.82, C,II= 0, E,I= 0, E,II= 0.7
Assign examples to the most similar cluster
I: (A,B,C) II: (D,E)
Calculate the cluster centroid
I: 1,0,0.67,0,0.67
5.
II: 0,0.5,0,1,0
Calculate similarity of all the examples to the
centroids A,I= 0.88, A,II= 0, B,I= 0.77, B,II= 0, C,I=
0.77, C,II= 0, D,I= 0, D,II= 0.82, E,I= 0, E,II= 0.87
6.
7.
Cluster the examples I: (A,B,C) II: (D,E)
Stop as the clustering got stabilized
Example of hierarchical clustering
(bisecting k-means)
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
0, 1, 2, 4, 6, 7, 9, 10, 11
3, 5, 8
0, 2, 4, 7, 10, 11
0, 2, 4, 7, 11
10
2, 4, 11
2, 11
2
0, 7
4
11
1, 6, 9
0
1, 9
1
7
3, 8
6
9
3
5
8
Latent Semantic Indexing
• LSI is a statistical technique that attempts
to estimate the hidden content structure
within documents:
– …it uses linear algebra technique SingularValue-Decomposition (SVD)
– …it discovers statistically most significant cooccurrences of terms
LSI Example
Original document-term mantrix
d1
d2
d3
d4
d5
d6
cosmonaut
1
0
1
0
0
0
astronaut
0
1
0
0
0
0
moon
1
1
0
0
0
0
car
1
0
0
1
1
0
truck
0
0
0
1
0
High correlation although
d2 and d3 don’t share
any word
Correlation matrix
1
d1
Rescaled document matrix,
Reduced into two dimensions
d1
d2
d3
d4
d5
d6
Dim1
-1.62
-0.60
-0.04
-0.97
-0.71
-0.26
Dim2
-0.46
-0.84
-0.30
1.00
0.35
0.65
d2
d3
d4
d5
d1
1.00
d2
0.8
1.00
d3
0.4
0.9
1.00
d4
0.5
-0.2
-0.6
1.00
d5
0.7
0.2
-0.3
0.9
1.00
d6
0.1
-0.5
-0.9
0.9
0.7
d6
1.00
Data one can analyze with ML methods
•
•
•
•
•
•
•
•
•
Structured, tabular data
Textual data
Multilingual data
Social networks
Images/videos
Relational data
Temporal data
Cross-modal analysis
Very large datasets
Structured, tabular data
Table, where each example is a raw, each
feature is a column with predefined values
feature
Title
Bob the builder
Main characters
Duration
vehicles, human 10 mins
Pixar-Locomotion vehicles
5 mins
Ice age
animals
90 mins
Over the hedge
animals
60 mins
.
Cars
vehicles
90 mins
.
Anima
human
90 mins
South Park
human
30 mins
Simpson
human
20 mins
example
.
Main characters  {vehicle, human, animal}
Duration  [5..90]
Textual data
Having a set of documents, represent each as a
feature vector:
• divide text into units (eg., words), remove punctuation,
(remove stop-words, stemming,…)
• each unit becomes a feature having numeric weight as its
value (eg., number of occurrences in the text - referred to
as term frequency or TF)
Commonly used weight is TFIDF:
 N 

TFIDF ( w)  tf ( w) * log 
 df ( w) 
• tf(w) – term frequency (no. of occurrences of word w in document
dokumentu)
• df(w) – document frequency (no. of documents containing word w)
• N – no. of all documents
Textual data - example
Bob the builder is a
children animated
movie on a character
Bob and his friends
that include several
vehicle characters.
They face challenges
and jointly solve
them, such as, repair
a roof or save Bob’s
cat from a tall tree…
Pixar has several
short animated
movies suitable for
children. Locomotion
is one of them
showing train engine
and a train wagon as
two characters that
face a challenge of
crossing a halfbroken bridge…
…
Simpson family
provokes a smile
on many adult
and children
faces showing
everyday life of a
family of four…
bob
builder
3
1
1
1
1
2
1
1
… …
… …
0
0
1
1
1
1
0
0
… …
…
0
…
0
0
…
0
0
…
0
…
0
… …
… …
children animated movie
…
…
1
character friend vehicle
…
Perceptron learning algorithm on
(document classification)
Input:
set of examples (documents) in the form of vectors;
each labeled as +1 or -1
Output:
linear model wi (one weight per word)
Algorithm:
• initialize the model wi by setting word weights = 0
• iterate through documents N times
• For document d from D
–
–
–
–
// Using current model wi classify the document d
If sum(di *wi) >= 0 then classify document as positive
Else classify document as negative
If document classification is wrong then
• // adjust weights of all words occurring in the document
• wt+1 = wt +sign(true-class) * Beta (input parameter Beta>0)
• // where sign(positive) = 1 and sign(negative) =-1
Multilingual data
• Text in several natural languages
• Perform machine learning and retrieval on
textual data regardless the language
differences
• Approach:
– Machine Translation (on sentence level)
– Multilingual lexicon (on word level)
– Mapping into semantic space (on word level,
eg., KCCA)
KCCA to handle multilingual data
•
•
KCCA enables representing documents in a
“language neutral way”
Intuition behind KCCA:
1. Given a parallel corpus (such as Acquis)…
2. …first, we automatically identify language independent
semantic concepts from text,
3. …then, we re-represent documents with the identified
concepts,
4. …finally, we are able to perform cross language
statistical operations (such as retrieval, classification,
clustering…)
English
German
Slovenian
Slovak
Czech
French
Hungarian
Spanish
Greek
Italian
Language Independent
Document Representation
Danish
Finnish
Lithuanian
Swedish
Dutch
New document
represented as text in
any of the above languages
New document
represented in
Language Neutral way
…enables cross-lingual retrieval, categorization, clustering, …
Input for KCCA
• On input we have set of
aligned documents:
Bag-of-words space
for English language
Bag-of-words space
for German language
– For each document we have a
version in each language
• Documents are represented as
bag-of-words vectors
Pair of aligned
documents
The Output from KCCA
• The goal: find pairs of
semantic dimensions that coappear in documents and
their translations with high
correlation
– Semantic dimension is a
weighted set of words.
• These pairs are pairs of
vectors, one from e.g. English
bag-of-words space and one
from German bag-of-words
space.
Semantic
dimension
loss, income,
company,
quarter
verlust,
einkommen,
firma, viertel
wage,
payment,
negotiations,
union
zahlung, volle,
gewerkschaft,
verhandlungsrunde
Semantic
dimensions pair
The Algorithm – Theory
Formally the KCCA solves:
max(x,y) Corr(<x,, , >, <y,, , >)
– x, y – semantic directions for English and German
– ( , ) is a pair of aligned documents
Examples of Semantic Dimensions from
Acquis corpus: English-French (1/2)
Most important words from semantic
dimensions automatically generated from
2000 documents:
Veterinary,
Transport
DIRECTIVE, DECISION, VEHICLES, AGREEMENT, EC, VETERINARY, PRODUCTS, HEALTH, MEAT
DIRECTIVE, DECISION, VEHICULES, PRESENTE, RESIDUS, ACCORD, PRODUITS, ANIMAUX, SANITAIRE
Customs
NOMENCLATURE, COMBINED, COLUMN, GOODS, TARIFF, CLASSIFICATION, CUSTOMS
NOMENCLATURE, COMBINEE, COLONNE, MARCHANDISES, CLASSEMENT, TARIF, TARIFAIRES
EMBRYOS, ANIMALS, OVA, SEMEN, ANIMAL, CONVENTION, BOVINE, DECISION, FEEDINGSTUFFS
EMBRYONS, ANIMAUX, OVULES, CONVENTION, SPERME, EQUIDES, DECISION, BOVINE, ADDITIFS
SUGAR, CONVENTION, ADDITIVES, PIGMEAT, PRICE, PRICES, FEEDINGSTUFFS, SEED
SUCRE, CONVENTION, PORC, ADDITIFS, PRIX, ALIMENTATION, SEMENCES, DECISION
EXPORT, LICENCES, LICENCE, REFUND, VEHICLES, FISHERY, CONVENTION, CERTIFICATE, ISSUED
EXPORTATION, CERTIFICATS, CERTIFICAT, PECHE, VEHICULES, LAIT, CONVENTION
Export Licences
Agriculture
Veterinary
Examples of Semantic Dimensions from
Acquis corpora: English-Slovene (2/2)
Most important words from semantic
dimensions automatically generated from
2000 documents :
Agriculture
OLIVE, OIL, AID, SUGAR, PRICE, STATE, MILK, LICENCES, OR, EXPORT, INTERVENTION
OLJA, OLJCNEGA, POMOCI, SLADKORJA, POMOC, OLJK, SLADKOR, ALI, DOVOLJENJA, CE
Customs
NOMENCLATURE, COLUMN, COMBINED, GOODS, TARIFF, CLASSIFICATION, ST, ANNEXED, INVOKED
NOMENKLATURO, STOLPCU, NOMENKLATURE, KOMBINIRANO, KOMBINIRANE, CARINSKI, BLAGA
QUOTAS, TARIFF, SEED, CUSTOMS, COLUMN, ENERGY, INVOKED, ATOMIC, QUOTA, OPENING
KVOT, TARIFNE, SEMENA, KVOTE, TARIFNIH, CARINSKI, ATOMSKO, ENERGIJO, ODPRTJU
DESIGNATIONS, GEOGRAPHICAL, INDICATIONS, EURATOM, PROTECTED, ECSC, NAMES, ORIGIN
OZNACB, EURATOM, GEOGRAFSKIH, POREKLA, ESPJ, ZASCITENIH, OZNACBE, IMEN, REGISTER
WINE, WINES, ALCOHOL, DRINKS, DISTILLATION, POULTRYMEAT, ICEWINE, ANALYSIS
VINO, VINA, VIN, VINSKEM, VINSKEGA, ALKOHOL, NAMIZNEGA, DESTILACIJO, DESTILACIJE
Wine
Agriculture protection
Energy
Applications of KCCA
•
Cross-lingual document retrieval: retrieved
documents depend only on the meaning of the
query and not its language.
•
Automatic document categorization: only
one classifier is learned and not a separate
classifier for each language
•
Document clustering: documents should be
grouped into clusters based on their content,
not on the language they are written in.
•
Cross-media information retrieval: in the
same way we correlate two languages we can
correlate text to images, text to video, text to
sound, …
Example of cross-lingual
information retrieval on
Reuters news corpus
using KCCA
Borse
= Stock =
‘Borse’
Exchange
‘Stock Exchange’
Search engine
Related approaches
• Usual approach for modelling cross
language Information Retrieval is Latent
Semantic Indexing (LSI/SVD) on parallel
corpora
– …measured performance of KCCA is
significantly better then of LSI
[Vinokourov et. al, 2002]
Availability/Scalability
• KCCA is available within Text-Garden textmining software environment
– …available at http://www.textmining.net
• Current version processes up-to 10.000
documents
• Next version (incremental) will be able to
process up-to 100.000 documents
Social networks
• Social networks can be also potential
source of data for machine learning and
building semantic structures
– …conceptually they share similar underlying
structure as text – namely, the underlying
distribution is generated by power-law
• In the next slides we show how social
networks can be modeled using
unsupervised techhniques
Analysis of e-mail graph
•
An e-mail graph can be analyzed in the following 5 major steps:
1. Starting with log files from an e-mail server where the data include information
about e-mail transactions with the fields: sender and the list of receivers.
2. After cleaning we get the data in the form of e-mail transactions which include email addresses of sender and receiver.
3. From a set of e-mail transactions we construct a graph where vertices are email addresses connected if there is a transaction between them
4. E-mail graph is transformed into a sparse matrix allowing to perform data
manipulation and analysis operations
5. Sparse matrix representation of the graph is analyzed with ontology learning
tools producing an ontological structure corresponding to the organizational
structure of the institution where e-mails came from.
Graph transformation
into a set of sparse matrix
• Graph with N vertices is transformed into N*N
sparse matrix where:
– …Xth row represents information for Xth vertex
– …Xth row has nonzero components for:
• Xth vertex itself and
• Xth vertex’s neighbors on the distance D (e.g. 1, 2, 3)
– Intuitively, Xth row represents numerically
“neighborhood” of the Xth vertex within the graph:
• Xth element in the Xth row has weight 1
• …elements representing neighbors have lower weights
relative to the distance (d) from the Xth vertex (1/(2^d))
– (e.g. 1, 0.5, 0.25, 0.125, …)
Graph transformation
into sparse matrix (example)
2
6
7
3
4
0
8
1
5
9
10
11
0
1
0
1
0.5
1
0.5
1
3
4
5
6
7
8
9
0.25
0.25
0.5
0.25
0.5
0.25
0.25
0.5
0.25
0.5
10
11
0.25
0.25
2
1
0.25
0.5
0.25
3
0.25
1
0.5
0.25
0.5
0.5
1
0.25
0.5
0.25
0.25
1
0.5
0.25
4
0.25
5
0.25
6
0.5
0.25
7
0.25
0.5
8
0.5
0.25
9
0.25
0.5
1
0.25
1
0.25
0.25
0.5
0.5
0.25
0.25
1
0.25
0.25
10
11
Transforming
Graph into
Matrix
2
0.25
0.25
0.25
0.5
0.5
1
0.5
0.5
1
1
Data used for Experimentation
• The data is the collection of log files with e-mail
transactions from local e-mail spam filter
software Amavis (http://www.amavis.org/):
– Each line of the log files denotes one event at the
spam filter software
– We were interested in the events on successful e-mail
transactions
• ...having information on time, sender, and list of receivers
– An example of successful e-mail transaction is the
following line:
• 2005 Mar 28 13:59:05 patsy amavis[33972]: (33972-01-3)
Passed CLEAN, [217.32.164.151] [193.113.30.29]
<[email protected]> -> <[email protected]>,
Message-ID:
<21DA6754A9238B48B92F39637EF307FD0D4781C8@i2km41ukdy.domain1.systemhost.net>, Hits: -1.668, 6389 ms
Some statistics about the data
• The log files include e-mails for 19 months:
– …this sums up to 12.8Gb of data.
– After filtering out successful e-mail transactions it
remains 564Mb
• …which contains approx. 2.7 million of successful e-mail
transitions used for further processing
– The whole dataset contains references to approx.
45000 e-mail addresses
• …after the data cleaning phase the number is reduced to
approx. 17000 e-mail addresses
• …out of which 770 e-mail addresses are internal from the
home institution (with local domain name)
Organizational structure of JSI produced from cleaned
e-mail transactions with OntoGen in <5 minutes
Organizational structure of JSI visualized from
e-mail transactions with Document-Atlas
Software Mining
Software
Link
analysis
Structured data
= networks
Unstructured data
= textual documents
Document network
= a set of interlinked
documents; each link
has type and weight
[Grčar, Mladenić, Grobelnik, 2007]
Text
mining
Extracting data
Comment references
/** The format of Documents. Subclasses
of DocumentFormat know about
* particular MIME types and how to unpack the information in any
* markup or formatting they contain into GATE annotations. Each MIME
* type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat,
* RtfDocumentFormat, MpegDocumentFormat. These classes register themselves
* with a static index residing here when they are constructed. Static
* getDocumentFormat methods can then be used to get the appropriate
* format class for a particular document.
*/
public abstract class DocumentFormat
extends AbstractLanguageResource implements LanguageResource{
Class
comment
/** The MIME type of this format. */
private MimeType mimeType = null;
A field
A method
Method comment
Field
type
Field comment
Field
name
/**
* Find a DocumentFormat implementation that deals with a particular
* MIME type, given that type.
* @param aGateDocument this document will receive as a feature
*
the associated Mime Type. The name of the feature is
*
MimeType and its
Comment
value
reference
is in the format type/subtype
* @param mimeType the mime type that is given as input
*/
static public DocumentFormat getDocumentFormat(gate.Document aGateDocument,
MimeType mimeType){
Return
type
Method
name
} // getDocumentFormat(aGateDocument, MimeType)
} // class DocumentFormat
Class
name
Super-class
(base class)
Implemented
interface
Extracting data
DocumentFormat.class
DocumentFormat
Extracting data
DocumentFormat.class
DocumentFormat
LanguageResource
RtfDocumentFormat
MimeType
2
DocumentFormat
AbstractLanguageResource
Document
XmlDocumentFormat
MpegDocumentFormat
Example graph
See next slide
Example graph - zoomed
Structuring extracted knowledge
Source code
Preprocessing
Ontology
Feature
vectors
OntoGen
Semantic Web and Machine Learning
Examples of Semantic Web
with Machine Learning
•
•
•
•
•
•
•
Ontology learning
Detecting semantic via Google
Visualization of semantic spaces
Contextualized search
User modeling
Detecting bias in the news
Semantic summarization
ML view on Ontology
• Ontology is a data model that represents a set of
concepts within a domain and the relationships
between those concepts
• Ontology can be seen as a graph/network
structure consisting from:
– a set of concepts (vertices in a graph),
– a set of instances assigned to a particular concepts
(data records assigned to vertices in a graph)
– a set of relationships connecting concepts (directed
edges in a graph)
Example of a Topic Ontology
Ontology learning
• Define the ontology learning tasks in terms of mappings
between ontology components, where some of the
components are given and some are missing and we
want to induce the missing ones.
• Some typical scenarios in ontology learning are the
following:
– Inducing concepts/clustering of instances (given instances)
– Inducing relations (given concepts and the associated
instances)
– Ontology population (given an ontology and relevant, but not
associated instances)
– Ontology generation (given instances and any other
background information)
– Ontology updating/extending (given an ontology and
background information, such as, new instances or the ontology
usage patterns)
Ontology Learning with OntoGen
(developed on the top of Text Garden)
• Semi-Automatic
– provide suggestions and insights into the domain
– the user interacts with parameters of methods
– final decisions taken by the user
• Data-Driven
– most of the aid provided by the system is based on
some underlying data
– instances are described by features extracted from
the data (eg., words-vectors)
[Fortuna, Mladenić, Grobelnik, 2005]
Installation package is publicly available in binaries
at ontogen.ijs.si
Basic idea behind OntoGen
Text corpus
Ontology
Concept A
Concept B
Domai
n
Concept
C
112
Hierarchy of
concepts
These
twothe
views are synchronized
Details
about
selected concept
Concept
name
Descriptive
keywords
Distinctive
keywords
Selected
concept
Ontology
visualization
Root concept
Suggesting
sub-concepts
Number of
suggestions
List of
suggestions
Inspection
tool
Mining
Pharmacy
Real estate
Oil
Airlines
Restaurants
Life insurance
Description of
the concept
The system asks several
“yes or no” questions
Checkboxes indicate
whether a document
belongs to the
concept or not
List of
documents
Document
preview pane
Red dots represent
documents that
currently belong to
the concept
Similarity
graph
Google Meaning
Google meaning
• Google offers insight on how words are used in a large corpora
(web)
• This can be used to estimate distance between terms
– Estimation is based on the co-occurrence of terms
– Uses the frequency of each term and their co-occurrences
• Examples:
– GoogDist(“Oscar Wilde”, “The Picture of Dorian Gray”) = 0.2
– GoogDist(“Oscar Wilde”, “A Midsummer Night’s Dream”) = 0.5
Google meaning
• Google offers insight on how words are used in a large corpora
(web)
• This can be used to estimate similarities between terms
– Estimation is based on the co-occurrence of terms
– Uses the frequency of each term and their co-occurrences
Text Visualization
Document-Atlas
http://docatlas.ijs.si
Document Atlas – visualization of
document collections and their structure
http://docatlas.ijs.si
Contextualized web-search
SearchPoint:
http://searchpoint.ijs.si
Visualization of search results
• Search engines generally work very well
• There are cases where it is difficult to specify a
query
• Idea: help the user by clustering all the hits and
visualise the results space
Example: jaguar
Relevant part of
ontology
for the query
“jaguar”
Example: ISWC
Query ISWC
has many
meanings
Example: Password
SEKTbar – intelligent user
profiling
User focused browsing history in SEKTbar
• Web-based user profile is automatically generated
while the user is browsing the Web
– User profile is represented as a user-interest topic ontology
(root gives the user’s general interest, leaves specific
interests)
– topic ontology is generated by using hierarchical clustering
– nodes of current interest are determined by comparing the
profile node centroids to the centroid, computed out of the m
most recently visited pages
• User profile is visualized on the SEKTbar (Internet
Explorer Toolbar)
– the user can select a node in the hierarchy to see its specific
keywords and associated pages
[Grčar, Mladenić, Grobelnik, 2006]
Detecting the bias in media with
statistical learning methods
Media Bias
• The descriptions of the same event in news can differ
with respect to the news source and its bias.
– Can we detect this using statistical methods?
– How does this bias reflect in vocabulary?
• To answer these questions we need:
– A controlled collection of news for a longer time period
• A year corpus of news articles on international events
– Matching of news articles from different news sources based
on the described event
• We used bag-of-words model and “Best Reciprocal Hit” method
from Bioinformatics to match news articles describing similar
events
Experimental setup
• Time period: March 31st 2005 – April 14th 2006
• Size of collections:
• Number of discovered matches:
Prediction of news source
• The task: given a pair of news articles describing the
same event, can we predict the news source for
each?
• In this experiment we focused on CNN and Al
Jazeera.
• SVM linear classifier was used for prediction
– Evaluation was done using 10-fold cross-validation
– Significance of results was tested against random matches
• We used SVM feature selection [Brank et. al.] to
extract most important classification keywords.
Source prediction results
• News source can be predicted with 81% BEP
 high bias in vocabulary
• News source specific keywords:
Bias example
Keywords differences on topic level
News source similarity maps
Based on source
vocabulary differences
Based on covered events
Summarization of documents
through semantic-graphs
Approach Description
• Approach:
– Learn a machine learning model for selecting
sentences
– Use information about semantic structure of the
document (concepts and relations among concepts)
Results are promising
– achieved 70% recall of and 40% precision on extracted SubjectPredicate-Object triples on DUC (Document understanding
conference) data
Summarization
Original
Document
Human built document
summary
Cracks
Appear in U.N. Trade Embargo Against Iraq.
Semantic net of
Subj-Pred-Obj
triples
Creation of
semantic network
Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country.
Japan, meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse c riticism that it is not doing its
share to oppose Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U.N. embargo on Iraq. President Bush on
Tuesday night promised a joint session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest
of Kuwait permanent. ``America must stand up to aggression, and we will,'' said Bush, who added that the U.S. military may remain in the Saudi Arabian
desert indefinitely. ``I cannot predict just how long it will take to convince Iraq to withdraw from Kuwait,'' Bush said. More than 150,000 U.S. troops have
been sent to the Persian Gulf region to deter a possible Iraqi invasion of Saudi Arabia. Bush's aides said the president woul d follow his address to
Congress with a televised message for the Iraqi people, declaring the world is united against their government's invasion of Kuwait. Saddam had offered
Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to respond to an offer Monday by Saddam of free oil _ in exchange
for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a none-too-subtle attempt to bypass the U.N. embargo, in
effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in Iraq. But according to a State Department
survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with Baghdad, all in defiance of U.N.
sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European countries also are trying to
maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to exchange food
and medicine for up to 200,000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or B aghdad on the reported
food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the
first by a senior Iraqi official since the 1980-88 gulf war. After the visit, the two countries announced they would resume diplomatic relations. Wellinformed oil industry sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about
150,000 barrels of refined oil a day for domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that
following Aziz's visit, Iraq is apparently prepared to give Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict.
Secretary of State James A. Baker III, meanwhile, met in Moscow with Soviet Foreign Minister Eduard Shevardnadze, two days after the U.S.-Soviet
summit that produced a joint demand that Iraq withdraw from Kuwait. During the summit, Bush encouraged Mikhail Gorbachev to w ithdraw 190 Soviet
military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the Soviet parliament Tuesday the s pecialists had not reneged on
those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq. In his speech, Bush said his heart went out to the families of the hundreds of
Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and the world wi ll not be blackmailed.'' The
president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the Earth.'' In other
developments: _A U.S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of them
women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a
planeload of 164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young
people suspected of involvement in the resistance. ``There is no law and order,'' said Thuraya, 19, who would not give her last name. ``A soldier can rape
a father's daughter in front of him and he can't do anything about it.'' _The State Department said Iraq had told U.S. officials that American males residing
in Iraq and Kuwait who were born in Arab countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many
men the Iraqi move could affect. _A Pentagon spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with
Turkey and Syria. He said there was little indication hostilities are imminent. Defense Secretary Dick Cheney said the cost of the U.S. military buildup in
the Middle East was rising above the $1 billion-a-month estimate generally used by government officials. He said the total cost _ if no shooting war breaks
out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised disgruntled lawmakers ``a significant increase'' in help from Arab
nations and other U.S. allies for Operation Desert Shield. Japan, which has been accused of responding too slowly to the cris is in the gulf, said Tuesday it
may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U.N. prohibition on trade with Iraq. ``The pressure from abroad is getting so strong,''
said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid would be extended through the World
Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury Secretary Nicholas Brady visited
Tokyo on a world tour seeking $10.5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid pac kage for multinational
peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States have
said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oi l. Japan's constitution
bans the use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On
Monday, Saddam offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the
Philippines and Namibia _ said no. Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil.
Venezuelan President Carlos Andres Perez dismissed Saddam's offer of free oil as a ``propaganda ploy.'' Venezuela, an OPEC member, has led a drive
among oil-producing nations to boost production to make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil
makes up 20 percent of the world's oil reserves. Only Saudi Arabia has higher reserves. But according to the State Department, Cuba, which faces an oil
deficit because of reduced Soviet deliveries, has received a shipment of Iraqi petroleum since U.N. sanctions were imposed fi ve weeks ago. And
Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United States, Virgil Constantinescu, denied that claim
Tuesday, calling it ``absolutely false and without foundation.''.
Cracks appeared in the U.N. trade embargo
against Iraq. The State Department reports that
Cuba and Romania have struck oil deals with
Iraq as others attempt to trade with Baghdad in
defiance of the sanctions. Iran has agreed to
exchange food and medicine for Iraqi oil.
Saddam has offered developing nations free
oil if they send their tankers to pick it up.
Thus far, none has accepted.
Japan, accused of responding too slowly to the
Gulf crisis, has promised $2 billion in aid to
countries hit hardest by the Iraqi trade embargo.
President Bush has promised that Saddam's
aggression will not succeed.
Manual
summarization
70% recall, 40% precision
of selected triples according to
human generated summaries
Automatic
summarization by
selecting relevant triples
Mapping between
graphs learned with
ML methods
Automatically built
document
summary (not done
by us)
Nat. Lang.
Generation
Semantic net of
Subj-Pred-Obj
triples
Cracks appeared in the U.N. trade embargo
against Iraq. The State Department reports that
Cuba and Romania have struck oil deals with
Iraq as others attempt to trade with Baghdad in
defiance of the sanctions. Iran has agreed to
exchange food and medicine for Iraqi oil.
Saddam has offered developing nations free
oil if they send their tankers to pick it up.
Thus far, none has accepted.
Japan, accused of responding too slowly to the
Gulf crisis, has promised $2 billion in aid to
countries hit hardest by the Iraqi trade embargo.
President Bush has promised that Saddam's
aggression will not succeed.
Detailed Summarization Procedure
Linguistic analysis of the text
- Deep parsing of sentences
Refinement of the text parse
- Named-entity consolidation
Determine that ’George Bush’ = ‘Bush’
= ‘U.S. president’
- Anaphora resolution
Link pronouns with name-entities
Extract Subject–Predicate–Object
triples
Compose a graph from triples
Describe each triple with a set of
features for learning
Learn a model to classify triples into
the summary
Generate a summary graph
Use summary graph to generate
textual document summary
Tom Sawyer went to
town. He met a friend.
Tom was happy. …
Tom Sawyer went to town.
He [Tom Sawyer] met a
friend. Tom [Tom Sawyer]
was happy. …
Tom  go  town
Tom  meet  friend
Tom  is  happy
Example of summarization
Cracks Appear in U.N. Trade Embargo Against Iraq.
Human written
summary
Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein sought to circumvent the economic noose around his country. Japan,
meanwhile, announced it would increase its aid to countries hardest hit by enforcing the sanctions. Hoping to defuse criticism that it is not doing its share to oppose
Baghdad, Japan said up to $2 billion in aid may be sent to nations most affected by the U.N. embargo on Iraq. President Bush on Tuesday night promised a joint
session of Congress and a nationwide radio and television audience that ``Saddam Hussein will fail'' to make his conquest of Kuwait permanent. ``America must
stand up to aggression, and we will,'' said Bush, who added that the U.S. military may remain in the Saudi Arabian desert indefinitely. ``I cannot predict just how long
it will take to convince Iraq to withdraw from Kuwait,'' Bush said. More than 150,000 U.S. troops have been sent to the Persian Gulf region to deter a possible Iraqi
invasion of Saudi Arabia. Bush's aides said the president would follow his address to Congress with a televised message for the Iraqi people, declaring the world is
united against their government's invasion of Kuwait. Saddam had offered Bush time on Iraqi TV. The Philippines and Namibia, the first of the developing nations to
respond to an offer Monday by Saddam of free oil _ in exchange for sending their own tankers to get it _ said no to the Iraqi leader. Saddam's offer was seen as a
none-too-subtle attempt to bypass the U.N. embargo, in effect since four days after Iraq's Aug. 2 invasion of Kuwait, by getting poor countries to dock their tankers in
Iraq. But according to a State Department survey, Cuba and Romania have struck oil deals with Iraq and companies elsewhere are trying to continue trade with
Baghdad, all in defiance of U.N. sanctions. Romania denies the allegation. The report, made available to The Associated Press, said some Eastern European
countries also are trying to maintain their military sales to Iraq. A well-informed source in Tehran told The Associated Press that Iran has agreed to an Iraqi request to
exchange food and medicine for up to 200,000 barrels of refined oil a day and cash payments. There was no official comment from Tehran or Baghdad on the
reported food-for-oil deal. But the source, who requested anonymity, said the deal was struck during Iraqi Foreign Minister Tariq Aziz's visit Sunday to Tehran, the
first by a senior Iraqi official since the 1980-88 gulf war. After the visit, the two countries announced they would resume diplomatic relations. Well-informed oil industry
sources in the region, contacted by The AP, said that although Iran is a major oil exporter itself, it currently has to import about 150,000 barrels of refined oil a day for
domestic use because of damages to refineries in the gulf war. Along similar lines, ABC News reported that following Aziz's visit, Iraq is apparently prepared to give
Iran all the oil it wants to make up for the damage Iraq inflicted on Iran during their conflict. Secretary of State James A. Baker III, meanwhile, met in Moscow with
Soviet Foreign Minister Eduard Shevardnadze, two days after the U.S.-Soviet summit that produced a joint demand that Iraq withdraw from Kuwait. During the
summit, Bush encouraged Mikhail Gorbachev to withdraw 190 Soviet military specialists from Iraq, where they remain to fulfill contracts. Shevardnadze told the
Soviet parliament Tuesday the specialists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq. In his speech, Bush said
his heart went out to the families of the hundreds of Americans held hostage by Iraq, but he declared, ``Our policy cannot change, and it will not change. America and
the world will not be blackmailed.'' The president added: ``Vital issues of principle are at stake. Saddam Hussein is literally trying to wipe a country off the face of the
Earth.'' In other developments: _A U.S. diplomat in Baghdad said Tuesday up to 800 Americans and Britons will fly out of Iraqi-occupied Kuwait this week, most of
them women and children leaving their husbands behind. Saddam has said he is keeping foreign men as human shields against attack. On Monday, a planeload of
164 Westerners arrived in Baltimore from Iraq. Evacuees spoke of food shortages in Kuwait, nighttime gunfire and Iraqi roundups of young people suspected of
involvement in the resistance. ``There is no law and order,'' said Thuraya, 19, who would not give her last name. ``A soldier can rape a father's daughter in front of
him and he can't do anything about it.'' _The State Department said Iraq had told U.S. officials that American males residing in Iraq and Kuwait who were born in Arab
countries will be allowed to leave. Iraq generally has not let American males leave. It was not known how many men the Iraqi move could affect. _A Pentagon
spokesman said ``some increase in military activity'' had been detected inside Iraq near its borders with Turkey and Syria. He said there was little indication hostilities
are imminent. Defense Secretary Dick Cheney said the cost of the U.S. military buildup in the Middle East was rising above the $1 billion-a-month estimate generally
used by government officials. He said the total cost _ if no shooting war breaks out _ could total $15 billion in the next fiscal year beginning Oct. 1. Cheney promised
disgruntled lawmakers ``a significant increase'' in help from Arab nations and other U.S. allies for Operation Desert Shield. Japan, which has been accused of
responding too slowly to the crisis in the gulf, said Tuesday it may give $2 billion to Egypt, Jordan and Turkey, hit hardest by the U.N. prohibition on trade with Iraq.
``The pressure from abroad is getting so strong,'' said Hiroyasu Horio, an official with the Ministry of International Trade and Industry. Local news reports said the aid
would be extended through the World Bank and International Monetary Fund, and $600 million would be sent as early as mid-September. On Friday, Treasury
Secretary Nicholas Brady visited Tokyo on a world tour seeking $10.5 billion to help Egypt, Jordan and Turkey. Japan has already promised a $1 billion aid package
for multinational peacekeeping forces in Saudi Arabia, including food, water, vehicles and prefabricated housing for non-military uses. But critics in the United States
have said Japan should do more because its economy depends heavily on oil from the Middle East. Japan imports 99 percent of its oil. Japan's constitution bans the
use of force in settling international disputes and Japanese law restricts the military to Japanese territory, except for ceremonial occasions. On Monday, Saddam
offered developing nations free oil if they would send their tankers to pick it up. The first two countries to respond Tuesday _ the Philippines and Namibia _ said no.
Manila said it had already fulfilled its oil requirements, and Namibia said it would not ``sell its sovereignty'' for Iraqi oil. Venezuelan President Carlos Andres Perez
dismissed Saddam's offer of free oil as a ``propaganda ploy.'' Venezuela, an OPEC member, has led a drive among oil-producing nations to boost production to
make up for the shortfall caused by the loss of Iraqi and Kuwaiti oil from the world market. Their oil makes up 20 percent of the world's oil reserves. Only Saudi Arabia
has higher reserves. But according to the State Department, Cuba, which faces an oil deficit because of reduced Soviet deliveries, has received a shipment of Iraqi
petroleum since U.N. sanctions were imposed five weeks ago. And Romania, it said, expects to receive oil indirectly from Iraq. Romania's ambassador to the United
States, Virgil Constantinescu, denied that claim Tuesday, calling it ``absolutely false and without foundation.''.
Cracks appeared in the U.N. trade embargo against Iraq.
The State Department reports that Cuba and Romania have
struck oil deals with Iraq as others attempt to trade with
Baghdad in defiance of the sanctions. Iran has agreed to
exchange food and medicine for Iraqi oil. Saddam has offered
developing nations free oil if they send their tankers to
pick it up. Thus far, none has accepted.
Japan, accused of responding too slowly to the Gulf crisis, has
promised $2 billion in aid to countries hit hardest by the Iraqi
trade embargo. President Bush has promised that
Saddam's aggression will not succeed.
7800 chars, 1300 words
Automatically generated
graph of summary triples
Further online information
• Recorded tutorials, lectures, summerschools available from
http://videolectures.net
– Semantic Web:
http://videolectures.net/Top/Computer_Scienc
e/Semantic_Web/
– Machine Learning:
http://videolectures.net/Top/Computer_Scienc
e/Machine_Learning/
ML Summer Schools @ Video Lectures
• ML summer
schools online