Transcript ArtEquAKT

ArtEquAKT
Harith Alani, Sanghee Kim, Wendy Hall,
Paul Lewis, David Millard, Nigel Shadbolt,
Mark Weal
Overview


Union of three projects : Artiste, Equator, and AKT
Aims:
•
•
•
Use NLT to automatically extract relevant information about
the life and work of artists from online documents
Feed this information automatically to an ontology designed
for this domain
Generate stories by extracting and structuring information
from the knowledge base in the form of biographical
narratives in response to user requests
Objectives


To find out how effective these technologies are when
used together
To explore the way in which the limitations of one
process effects the others


(e.g. how ambiguity during extraction mind be reflected at the
generation stage)
To generate biographies that might not be as readable
as those on the web but which :


contain information that is difficult to find out manually
gather information from disparate sources
Information
Extraction
Narrative Generation
Web
5. Interaction
1. Extraction
Linky
story
template
Servlets
Servlets
Servlets
6. Instantiation
2. Population
6.Instantiation
Ontology
KB
7. Rendering
web
pages
3.Consolidation
KB
4. Indexing
Knowledge Management
DB
Information
Extraction
Narrative Generation
Web
web
pages
5. Interaction
1. Extraction
Linky
story
template
Servlets
Servlets
Servlets
6. Instantiation
2. Population
6.Instantiation
Ontology
KB
7. Rendering
Information
Extraction
3.Consolidation
KB
4. Indexing
Knowledge Management
DB
Knowledge Extraction Procedure
~~~~~
~~~~
~~~~~~~
Ontology
WordNet
~~~
~~~~~
~~~~
~~~~~~~
GATE
Load Resources
Query
XML output
Downloaded Text
Semantic Analysis
(Relational Learning)
Paragraph and
sentence recognition
Apple Pie Parser
(Syntactic Analyser)
Search and Filter Documents




Query search engines (‘Yahoo’, ‘Altavista’) given
artist name as a query
Calculate the similarity of retrieved documents
to an example document
Use term frequency with normalisation for
similarity computation
Apply some heuristics (e.g. sentence length) to
filter out documents which contain mostly tables
and/or links
Relation Extraction




Natural language processing techniques to extract
relation
Guided by an ontology
Use GATE (General Architecture for Text Engineer)
and WordNet for entity recognition (e.g. person name,
place name, or date)
Term expansion using WordNet (synonym, hypernym,
and hyponym, e.g. ‘depict’ maps to ‘portray’ (synonym)
and ‘represent’ (hypernym))
An Example

Given the sentence:
 Rembrandt Harmenszoon van Rijn was born on July 15,
1606, in Leiden, the Netherlands.

The following facts are extracted:
Birth
Person name
Date
Rembrandt Harmenszoon van Rijn was born on July 15,
1606, in Leiden, the Netherlands
Place
Future Information Extraction Work





Incorporate a learning capability in extracting relation
Need to widen the scope of the NLP tool to increase
performance
Extract information about ‘painting’
Extract links to painting images
Further investigation about term expansion using
WordNet (e.g. consider contexts in mapping synonyms
or hypernyms)
Information
Extraction
Narrative Generation
Web
web
pages
5. Interaction
1. Extraction
Servlets
Servlets
Servlets
2. Population
3.Consolidation
KB
4. Indexing
Knowledge Management
Linky
story
template
6. Instantiation
6.Instantiation
Ontology
KB
7. Rendering
Knowledge
Management
DB
Knowledge Management





Ontology of artists based on CIDOC CRM
The ontology guides the extraction process
Populating the Ontology (feeding the KB)
Knowledge consolidation
Ontology server providing a set of inference
queries
Artequakt Ontology
Populating the Ontology
<Paragraph>
<url>Potted_biography.html</url>
<text>>In 1631, when Rembrandt's work had become
well known and his studio in Leiden was flourishing,
he moved to Amsterdam. He became the leading
portrait painter in Holland and received many
commissions for portraits as well as for paintings of
religious subjects. …..It is estimated that he painted
between 50 and 60 self-portraits. </text>
<Painter>
<name>Rembrandt</name>
<place_of_work>leiden</place_of_work>
<has_location>amsterdam</has_location>
<number_of_paintings>between 50 and
60 self-portraits</number_of_paintings>
</Painter>
<Sentence>
<url>Potted_biography.html</url>
<text>He became the leading portrait painter in
Holland and received received many commissions
for portraits as well as for paintings of religious
subjects</text>
<Sentence>
<url>Potted_biography.html</url>
<text>He became the leading portrait painter in
Holland and received</text>
<mood>third-person</mood>
<tense>past</tense>
<order>0</order>
</Sentence>
………
</Paragraph>
Knowledge Consolidation

After extracting info on Rembrandt from 10 web sites,
the KB was populated with the following:

Rembrandt instance:


Date of birth


15/7/1606, 1606, 1620, 1641
Place of birth


26 Rembrandt, 37 Rembrandt Harmenszoon, 2 Van Rijn
Leiden, Leyden, Netherlands, Holland
We need to merge duplications, and verify
inconsistencies before we can use this knowledge
Duplication


Same old problem!
Our approach for consolidation

Simple heuristics to consolidate most duplicates

Artist names are unique


Merge less specific info into more detailed ones


1606 is merged into 15/7/1606
Term expansion using WordNet



all Rembrandts are merged
Synonyms: Leiden and Leyden, The Netherlands and Holland
Holonyms (part of): Leiden is part of The Netherlands
Knowledge Comparison


Rembrandt, Rembrandt Harmenszoon, and Van Rijn share a date of
birth and a place of birth
Difficult with multiple info – verification might help
Verification

Inconsistency
We don’t aim for “the right answer”, but for some sort of
a confidence value
 Different sources may provide different info, eg. Renoir’s
dob is:




5 Feb 1841
25 Feb 1841
in www.pillipscollection.org/html/lbp.html
in www.abcgallery.com/R/renoir/renoirbio.html
which one is more likely to be correct?



Trust: certain sources can be more trusted than others, but how
do we judge that?
Frequency: certain facts might be extracted more often than others
Extraction: some extraction rules are more reliable than others
Information
Extraction
Narrative Generation
Web
1. Extraction
5. Interaction
7. Rendering
web
pages
Linky
story
template
Servlets
Servlets
Servlets
6. Instantiation
2. Population
6.Instantiation
Ontology
Narrative Generation
KB
3.Consolidation
KB
4. Indexing
Knowledge Management
DB
Biography Templates


Specified as XML FOHM structures in Auld
Linky
Leaves of the template may be:
Queries into the DB for whole paragraphs
 NLG using queries into the KB


Context can be used to adjust the shape of the
template according to user preferences
Sequence
1
2
LoD
1
2
1
3
4
Sequence
LoD
2
3
1
2
Low
Low
High
Expertise Expertise Expertise
Construct Sentence with DOB
Search for: Paragraph with DOB
Paragraph about paintings
Paragraph about style
Rembrandt was born on July
The greatest artist of the
In addition to portraits,
His early work was
15, 1606.
Dutch school, Rembrandt
Rembrandt attained fame
devoted to showing the
Harmenszoon van Rijn, was
for his landscapes, while
lines, light and shade, and
born on July 15, 1606.
as an etcher he ranks
color of the people he saw
among the foremost of all
about him.
time.
Birth Family
Art
Paragraph about influences
He was influenced by the
work of Caravaggio and
was fascinated by the work
of many other Italian
artists.
Death
Sequence
The greatest artist of the
1 Dutch school,
2 Rembrandt
Harmenszoon van Rijn, was
born on July 15, 1606.
LoD
1
3
4
Sequence
LoD
In addition to portraits,
2
1 for
Rembrandt
attained fame
his landscapes, while as an
etcher he ranks among the
foremost of all time.
2
3
His early work was devoted
Low light Low
High
to showing the lines,
and
Expertise Expertise Expertise
shade, and color of the
people he saw about him.
Paragraph about paintings
Paragraph about style
In addition to portraits,
His early work was
On October 4, 1669,
Rembrandt attained
fame died in
devoted to showing the
Rembrandt
for his landscapes,
while
lines, light and shade, and
Amsterdam
as an etcher he ranks
color of the people he saw
among the foremost of all
about him.
time.
Birth Family
Art
1
2
Paragraph about influences
He was influenced by the
work of Caravaggio and
was fascinated by the work
of many other Italian
artists.
Death
Future Biography Generation Work



Use co-referencing techniques to smooth out chosen
paragraphs
Develop a ‘memory’ of what has been previously said
(to catch paragraphs that include multiple ‘facts’)
Use conflicting factual data as a resource:



compare conflicting accounts
generate statistical sentences “Most sources agree that…”
Reference material so readers can evaluate the source
Future Direction for ArtEquAKT


Improve the individual processes
Incorporate images



Use inference



Use their context (descriptions etc) to extract knowledge
about them
Deploy them in biographies to accompany the text
generate new relations in the KB
use NLP to generate sentences to describe them
Apply technology to a physical setting (e.g. on a PDA
around a gallery space)