PowerPoint-presentatie

Download Report

Transcript PowerPoint-presentatie

BRAT: a web based tool
for manual annotation
Hans Paulussen
ITEC, KU Leuven KULAK
Overview
•
•
•
•
•
Annotation
BRAT
LCF (Learner corpus French)
Alternative editors
Conclusion
Annotation
Annotation
• Annotation = metadata:
data on data
• Edition of textual data or multimedia data requires different
approach: stand-off vs. inline markup
• Typical multimedia editors: ELAN & ADVENE
o https://tla.mpi.nl/tools/tla-tools/elan/
o http://liris.cnrs.fr/advene/
o
Stand-off vs inline annotation
• Inline:
Data and metadata (annotation or markup) are
intermingled
• Stand-off:
o Metadata is stored in a separate document, using
reference anchors
o Alignment: based on token or character offsets
o Primary data is left untouched
o
Inline
John went to Paris yesterday. He
loved the excursion.
John_NNP went_VBD to_TO
Paris_NNP yesterday_NN ._.
He_PRP loved_VBD the_DT
excursion_NN ._.
John_NNP
went_VBD
to_TO
Paris_NNP
yesterday_NN
._.
He_PRP
loved_VBD
the_DT
excursion_NN
._.
Stand-off
12345678901234567890123456789012345678901234567890123
1
2
3
4
5
John went to Paris yesterday. He loved the excursion.
1 4 NNP
6 9 VBD
11 12 TO
14 18 NNP
20 28 NN
29 29 .
31 32 PRP
34 38 VBD
40 42 DT
44 52 NN
53 53 .
Stand-off
BRAT
BRAT
• BRAT rapid annotation tool:
online environment for collaborative text annotation
o http://brat.nlplab.org/
Motivation
•
•
•
•
•
Web-based environment
Multi-user
Easy to install & configure
“Comprehensive” visualization
Well-documented
LCF
Learner corpus French
LCF
• LCF: Learner corpus French
• French texts written by Dutch students from 4 Flemish
institutions
• 500K words (971 texts)
• Text types: argumentative, informative, journalistic, letter,
Self-portrait, summary
LCF
LCF
Configuring BRAT
• Corpus preparation: conversion XML to read-only text
format
• Create annotatation configuration file
• Set up user accounts
• Create export filter to summarize annotated features
LCF
Alternative editors
Alternative annotation editors /1
• MAT (MITRE Annotation Toolkit): a suite of tools which can
be used for automated and human tagging of annotations.
o http://mat-annotation.sf.net
• TEITOK (The Tokenized TEI Environment): a web-based
system for viewing, creating, and editing corpora with both
rich textual mark-up and linguistic annotation
o http://alfclul.clul.ul.pt/teitok
• EGAS: a web-based platform for biomedical text mining
and collaborative curation, supporting manual and
automatic annotation of concepts and relations.
o https://demo.bmd-software.com/egas/
Alternative annotation editors /2
• TextAE: web-based (RESTful) annotation editor for HTML
documents
o http://textae.pubannotation.org/
• WebAnno: a general purpose web-based annotation tool
for a wide range of linguistic annotations
o https://code.google.com/p/webanno/
WebAnno workflow
WebAnno pro and cons
• First impressions (from colleagues):
o
o
o
o
Improved project and user management
Browser ‘sensitive’ behaviour
Accepts larger texts than Brat
Data management only possible when files are closed
Conclusion
• Annotation editors for textual data have improved
considerably, mainly because of standardisation of data
format (XML) and web technology (HTML5)
• Selection of editor depends mainly on user friendliness of
tool and quality of the features for further exploitation