BiographyNet - Vrije Universiteit Amsterdam
Download
Report
Transcript BiographyNet - Vrije Universiteit Amsterdam
BiographyNet
Project review, year-1
September, 18th, 2013
eScience Center 18 September 2013
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Agenda
•
•
•
•
•
Project objectives and first year results (Piek)
Methodology and historian perspective (Serge)
Model, conversions and interface (Niels)
NLP tools and research (Antske)
Discussion
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Starting point
• http://www.biografischportaal.nl
• Academic discipline of writing histories:
–
–
–
–
computational tools marginally used,
long scholarly tradition of study by reading,
single authored historical narratives,
while more and more historical sources digitally available.
• Project challenges:
–
“Computational thinking in history”:
•
•
–
Narrative historians not used to frame research problems in computational
terms, while computer-science researchers understand little of the
subtleties of historical analysis
Strong multi-disciplinary cooperation of front runners in both fields &
demonstrator development to achieve common understanding.
Methodological and tool support
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Contribution to historical research
• New research on the Dutch nation building and a revaluation of
biographical information.
– Bridging a gap between life histories, qualitative historical research,
and quantitative historical research.
• Open research on less static objects and relations such as events:
– most important pieces of information capturing changes and
processes that matter.
• Capture historiographic perspective:
– Requires a model that takes different framings of the same event into
account.
– Adds to the who-knows-who, when, where and how did the lives of
people cross; how did they affect each other’s lives and the world they
lived in.
– How do and did we conceive historic events, how are different
narratives created around the same history?
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Expected outcome
• Demonstrator on top of the Biography Portal.
– Cyclic development.
– links within the Biography Portal among the various
(textual and visual) datasets
• Open-source release of the e-science platform for
analyzing biographical texts about people.
– Adherence to all relevant Web standards and APIs,
maximizing reusability.
• Proposal for methodology for extraction of a
network of relations between people and
(historic) events.
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Short term goals
1. Building a richer data repository by connecting different
distributed sources of data through formalized links and
metadata.
2. Detection of (co-referenced) named-entities (persons, places
and dates) and events.
3. Harmonize the texts that vary from 19th century Dutch to
contemporary Dutch, where the OCR-ed texts also contain
errors.
4. Development of visualization, analytic tools, as well as
computational historiographical methods on the structured
data that is generated for 1. through 3.
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Results first year
• Methodology:
– Use cases and the anticipation of data- and process-driven biases
– Formal modeling of provenance
– Sustainability, replication, reproducibility
• Software:
– Design of interfaces and analytic tools
– Text mining and evaluation
– Linked Data conversion scripts
• Data:
–
–
–
–
–
Linked Data version of the Portal
Linking to Agora
Discussions with Wikimedia/Wikipedia/Dbpedia & Bibliotheek.nl
Verrijkt Koninkrijk
HuygensING exploitation to extend the Portal with enriched data produced
• 6 accepted papers
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
BiographyNet and
historical approaches to
‘big’ and heterogeneous
data
eScience Center 18 September 2013
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
The historian’s role
1. Methodology: Work on a methodology to
extract information, relationships and events
from short biographical texts
2. Question the data: develop use cases
3. Contribute to the design of a user interface that
challenges historians to dig deeper into the data
4. Sensitize target user groups (historians) for both
the possibilities and the limitations of
computational methods in historical research.
1: Methodology
• Year 1 - Historian’s focus: how reliable and
representative are the texts from this particular
dataset? Which questions can and cannot be
answered? How well do ‘tools’ perform, as compared
to the performance of a ‘real’ historian? See also
publications (below).
• Year 1 - Interdisciplinary focus: what is the provenance
of the information, how is it manipulated in order to
arrive at the answer to a query, and who are
responsible for the tools that manipulate those data?
2: Use Cases
• 12 cases developed, ranging from ‘simple’ to
‘highly complex’
• Simple: Group analysis of Governors-general
of the Dutch Indies
• More complex: when did Dutch elites get
involved with the ‘New World?’
• Complex: What can we say about nationalism
in biographical dictionaries from the
nineteenth and twentieth century?
Governors-General of the Dutch Indies
•
•
•
•
Highest Official in the Dutch Indies 1610-1949
71 men
What can we say about these men as a group?
Who was appointed and what qualities did he
have to have?
• Etc ….
3: User friendly interface
• Mainly work in progress,
– Discussion about the impact of a ‘design
metaphor’ (like “time line” … , “house of…”,
“building blocks for…”, “family tree…”) on the type
of questions raised by the user
• … presentation Niels.
The House of History
Time line
Family Tree
4: Sensitize target user groups
•
Publication in Tijdschrift voor Biografie (reaching the
nearest target user group of the demonstrator):
Serge ter Braake, ‘Het individu en zijn tijdgenoten. Wat een biograaf kan doen met prosopografie en
biografische woordenboeken’, Tijdschrift voor Biografie 2 (summer 2013) vol. 2, 52-61.
•
‘Biography and Computational Methods’, joint paper
in preparation (to be submitted before the end of the
month to Journal for Historical Biography (Ter Braake,
Ockeloen and Fokkens)
•
Research on nationalism and national biographies, to
be published in 2014
4: Sensitize target user groups
• Presentation at Huygens ING, 10 October 2013 (for
circa 50 professional historians)
• Presentation on provenance at KNAW Digital
Humanities Workshop, 14-15 November 2013
• Introduction in e-Humanities in the current curriculum
of BA1 students at the Vrije Universiteit (what is eHumanities, how does one use a source like the Oxford
Dictionary of National Biography?)
• Design and development of a series of electives and a
minor on e-history and an e-humanities (BA 2-3;
starting 2014/2015). Dataset of BiographyNet will be
used in a lab for history bachelor students.
BiographyNet
Towards the demonstrator
eScience Center 18 September 2013
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Overview
Main components of the demonstrator
• Schema to structure the data
• Conversion of the BP to Linked Data
• NLP system setup
• Interface
A crash course on Linked Data
Online machine readable data with links
• Simple facts called ‘RDF Triples’
Thorbecke > hasBirthPlace > Zwolle
Some technology concepts:
• Schemas: To structure LD
• RDF Stores: To store LD
• SPARQL: To access LD
Huge growth in the past years:
• More than 300 data sources
• More than 30 billion triples
The conversion process
Purely syntactic conversion
• Preserve the original structure of the data
• Prevent los of information
• Allow for reinterpretation of the original data in the future
The conversion process
Conversion steps:
• Retrieval of XML dump of the Biography Portal
• Initial conversion to ‘crude’ RDF
• Using ClioPatria and the XMLRDF
tool for ClioPatria
• RDF restructuring
• Linking to other sources
• Essential step in the
‘Linked Data’ philosophy
The conversion process
Data schema:
• Based on the structure of the original XML files
• Needs to facilitate the coupling of different biographies of the same
person, without compromising the original data
• Needs to facilitate the incorporation of several enrichments, following
from NLP, Entity Reconciliation, etc.
• Compatible with existing
schemas such as the
Europeana Data Model,
PROV, P-PLAN,
DC terms, etc.
BiographyNet: Schema illustration
http://www.biographynet.nl/schema
Provenance: What is it?
Provenance information is information on how Entities
come into existence
• What are entities?
• Documents, Articles, Pictures, etc.
• Basically anything that can be
‘produced’ by something or someone
• What kind of information?
• Who did what?
• Using which entities?
• In which processes?
Provenance in BiographyNet
For the demonstrator, provenance needs to be
modeled:
• From several perspectives:
• Information involved
• Processes involved
• People involved
• At multiple levels:
• An aggregated level, i.e.
per enrichment
• Detailed level, i.e. all
individual processes
Why is provenance info important for
BiographyNet?
Needed to ensure credibility of the demonstrator, to
evaluate its performance and to improve the academic
status of the tool
• Historians need to be able to validate results
• Replication: Retrieving the same results later using the
demonstrator
• Reproducibility: Manually by the historian
• The aggregated level – Targeted at the historian
• Which original sources where involved?
• Who to contact in case results are pulled into question?
• The detailed level – Targeted at the computer scientist
• Detailed information on each individual step
• Allows for debugging the internal processing pipeline
BiographyNet
Enrichment example
Provenance
Meta Data
NNBW
“Thorbecke”
Biographical
Description
Person
Meta Data
Biography
Parts
Thorbecke
Enrichment
Biographical
Description
Event
Birth
1798
Johan Rudolph Thorbecke werd
in 1798 geboren op 14 januari
januari
in Zwolle en
en komt uit
uit een half-Duit
half-Duit
NLP Tool
Person
Meta Data
Event
Birth
1798-01-14
Zwolle
More than just Provenance…
P-PLAN is not only used to model what actually
happened, but also what was supposed to happen
• ‘Plans’ describe the original idea behind an activity
• Describe what should happen in a certain activity
• Each ‘Plan’ corresponds with an ‘Activity’
• ‘Variables’ describe the input/output of an activity
• Structure, format, quantity, etc.
• Each ‘Variable’ corresponds with an input/output ‘Entity’ of
an ‘Activity’
• ‘Plans’ have their own provenance info
• E.g. who was responsible for the creation of a plan?
Why model plans besides provenance?
The benefits of modeling plans:
• Forces the recording of what an activity and its
input/output should look like
• Provides information on the original idea behind an activity
• As such, can provide info on possible assumptions and biases
• Allows for comparing between the actual activity and its
input/output and the original plan and its variables
• Do they differ from each other and to what extend?
• Makes finding errors much easier, as more information is
available about what the input/output should look like
BiographyNet: Schema illustration
Variable
Variable
Plan
Plan
Agent
Person
Entity
Association
Agent
Activity
Entity
NLP
Tool
Activit
Recap / Current Status
Main components of the demonstrator
• Initial schema available (publication LISC @ISWC 2013)
• Schema models enrichments and aggregations alongside original
sources
• Allows for storing various levels of provenance information
• Model will be adapted while progressing with building the
demonstrator
• Initial conversion to Linked Data available
• Structure according to schema presented
• Next step is linking to external sources
• NLP system setup available (Antske)
• Interface
• Presentation of general outline and ideas
Interface: Focus
• The interface should be easy to use
• The demonstrator should inspire historians to
undertake new research and give direction,
rather than being the ‘closing factor’ in their
research
• The interface should allow users to ‘fine tune’
results returned upon an initial action
Interface: Options
• Query composition
• Faceted browsing
• A combination
Interface: Query composition
• Drop down boxes to select ‘Verbs’, data
elements and relations
Interface: Faceted browsing
• No explicit querying, but
convergence of the data through
browsing and selecting
• Provides better feedback to the user
• Allows for more direct and easier
adjustment of the selected data
Interface: Faceted browsing
Interface: A combination
• Query composition combined with faceted
browsing
• Create new facets by defining a query
– The result of the query is available as a subset of
the data by selecting the defined facet
– As such, combinable with other facets
• Method to integrate ‘open’ querying of the
data into a general interface and visualization
Interface: A combination
Facets
Selection
Process
Data
Question
Analysis
Results
Interface: Demonstrator
Results
?
Time and place
are primary elements
BiographyNet
Text Mining
eScience Center 18 September 2013
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
First year goals for Text Mining
• Methodology
– Requirements
– Approach
• Basic System for data enrichment in text
– Identify metadata in text
– Setup that can easily be improved and extended
– (co-referenced) named entities, events
– Deal with alternative spelling
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Methodology Requirements
• Reproducing results in Natural Language
Processing is non-trivial
• Details in implementations or experimental
setup can influence results up to a point
where they tell a different story
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Reproducing results
• Example: Performance of WordNet similarity
scores compared to human ranking:
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Reproducing results
• Clear registration of all steps involved and storage
of (intermediate) system output can improve
reproducibility
• Systematic testing can help to gain insight into
the variation of the outcome of our systems and
hence lead to more insight in their performance
Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen
and Nuno Freire (2013) Offspring from Reproduction Problems: What
Replication Failure Teaches Us. In: Proceedings of ACL 2013, Sofia, Bulgaria,
August 2013.
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Methodology requirements
• The method used to extract information may
introduce a bias that has unintended influence on
the outcome of the historian’s questions
• For example: location identification with
GeoNames
– Heuristic: when multiple locations with the same
name, take the one in or closest to the Netherlands
– High precision, but `America’, `Willemstad’: what if
the historian investigates trips to the Netherlands by
officials overseas?
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Methodology requirements
• Maximize reuse of existing tools for
BiographyNet
• Maximize reuse of tools developed within
BiographyNet by other researchers
• How can we create a setup that facilitates
this?
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Methodology approach
• Provenance modeling:
– Can help to improve reproducibility of research
– Can support systematic testing
– Can model the exact steps taken
• Flexible formats that support this:
– NLP Annotation Format (NAF) to manage output
and input of NLP tools
– Grounded Annotation Framework (GAF) for the
final output of the NLP pipeline
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
NLP Annotation Format
• Sustainable, because close to existing
linguistic formats (e.g. LAF, GRAF, NIF)
• Joint work across projects and with other
institutes (notably University of the Basque
Country, Fondazione Bruno Kessler)
• Flexible, because the output of individual
tools is added in separate layers
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Grounded Annotation Framework
• RDF compliant framework
• Introduces the denotedBy relation that links
mentions in text to formal representations of
their instances
• Provenance is marked using Named Graphs
• This allows us to accumulate information from
different sources and represent alternative
perspectives
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Provenance Modeling
• It must be clear where information comes
from (original source, opinion holder,
automatically retrieved or from metadata)
• For NLP research:
– Model each step of the process
– Resources used (preprocessing + version), system
output
• For historic research:
– What may introduce biases? How can the process
be represented in an understandable manner?
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Basic System
• Identifying metadata in text
– Linguistically naïve supervised machine learning
• Linguistic processing:
– Named Entity recognition (time and location)
– Concept identification
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
First Evaluation
• Use case: Governor Generals of the Dutch
Indies
• 129 Biographies describing 71 individuals
• Serge ter Braake extracted information
manually
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Metadata versus text mining
100
90
80
70
60
50
metadata
40
text
30
20
10
0
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Preliminary outcome of text mining
Category
Correct
Incorrect
Both
Education
2
0
2
Father
0
0
9
2
Mother
0
1
2
5
Occupation
14
6
21
4
Birthdate
21
2
35
9
2
Correct text
Incorrect Text
Recall problems (for birthdate):
1. Sentence not found (35): typical for wikipedia, bwn, vdaa
2. Value not found (7)
3. Wrong sentence (1), wrong date (1): date of marriage, date of death
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Observations
• Recall problems (for birthdate):
– Sentence identification
• Easy ways to improve:
– Parents: named entity recognition
– Occupation, Education: concept tagged corpus
– Source specific training
• More difficult problems:
– Relations, functions of other people
– Negations or factuality (e.g. refused positions for
occupations)
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
NLP outlook
• Evaluation:
– Text based annotations
• Metadata extraction:
– Supervised with linguistically rich features
– Rule-based approaches
• Beyond Metadata:
– Time lines of people’s lives (2nd year)
– Networks between people (2nd year)
– Complex event modeling (3rd year)
BiographyNet Review Meeting, eScience
centre, September 18th, 2013
Questions?
http://www.biographynet.nl/
eScience Center 18 September 2013
BiographyNet Review Meeting, eScience
centre, September 18th, 2013