Transcription as

Download Report

Transcript Transcription as

KONVENS Wien, 15 Sep 2004
EXMARaLDA – A modeling and
visualization framework for the
computer-assisted transcription
of spoken language
Thomas Schmidt
SFB 538 ‚Mehrsprachigkeit‘
University of Hamburg
Background
• Multilingual Database, SFB 538
„Mehrsprachigkeit“, University of Hamburg
• EXMARaLDA (Extensible Markup
Language for Discourse Annotation)
• Dissertation project „Computer-based
transcription of spoken language as a
modelling and visualisiation process“
(Supervisor: Angelika Storrer)
Background
• Transcription of spoken language
– Interviewer / child interaction
– Classroom interaction
– Interpreted doctor-patient discourse
– for discourse / conversation analysis
– for (child) language acquisition studies
Background
• Problem: Diversity of Transcription Data
– Theoretical diversity:
• Entities of transcription (utterances, turns, non-verbal activities
etc.)
• Relations between entities (temporal, hierarchical, features, ...)
• Presentation formats (partitur notation, column notation, ...)
– Technological diversity:
• Storage formats (text, binary, RDB)
• Software (syncWriter, HIAT-DOS, DBM-Systems, word
processors, ...)
• Operating Systems (Windows, MAC OS)
Background
Background
Background
• Problem: Diversity of Transcription Data
• Aim: A common platform for computerassisted transcription
Exchange, reuse, archive transcription data
Merge corpora
Use different software tools with one piece of
data
Background
• Problem: Diversity of Transcription Data
• Aim: A common platform for computerassisted transcription
• (Elements of a) Solution
XML technology
Three level architecture
Separate form from content
Separate logical from physical structure
Topics of this talk
1. Some methodological considerations:
Linguistic methods  Computer science
methods
„Computing in the humanities“
Interdisciplinary communication
2. Components of the developed system
Methodological considerations
Transcript
Transcription as...
Quality criteria
Computer
Transcription as...
„Verschriftlichung“
Readability
Visualisation
Visualisation
Visualisation
Form
Analogue
model
Application vs.
Logical layer
Document...
Form
Form
View
Form
Form
Form
Theory
Established view
Adequacy
Modified view
Modelling
Symbolic model
E/R model
Content
Model theory view
Database view
Text technology view
Methodological considerations
Transcription as Modeling and Visualization of spoken
language
 Accordance with text-technological concepts
 One model, different visualizations
 No tradeoff between readability and adequacy
 No tradeoff between human and computer processability
 No “Standardization” of models
 a common modelling framework, not a common model
 no ontological specifications
 XML = Standardization of physical representation
Visualization to Model
Visualization to Model
Structural relations:
1. Temporal sequence
Visualization to Model
Structural relations:
1. Temporal sequence
2. Simultaneity
Visualization to Model
Structural relations:
1. Temporal sequence
2. Simultaneity
3. Equivalence (Entity  Feature)
Visualization to Model
Structural relations:
1. Temporal sequence
2. Simultaneity
3. Equivalence (Entity  Feature)
4. Hierarchy (Containment)
Modeling framework
• Relational?  Sequence? Simultaneity?
• OHCO?  Simultaneity?
• DAG: Annotation Graphs?  Complexity?
 Transcription Graphs
System architecture
Application: Input tools
EXMARaLDA Partitur-Editor
Application: Input tools
Simple EXMARaLDA Text file
Application: Input tools
TASX annotator
Application: Input tools
PRAAT
Application: Input tools
EUDICO Linguistic Annotator (ELAN)
Application: Visualization
... as a wrapped partitur
... as a line transcript
... in column notation
Application: Corpus management
EXMARaLDA Corpus Manager (COMA)
Application: Query/Analysis
Search and Query Instrument for EXMARaLDA (SQUIRREL)
Project status
• Software past beta stage
• Five projects at our own institution use EXMARaLDA
for their corpus work
• Around 800 users in research and teaching outside
SFB
• Used at the IDS in Mannheim
• Submitted a suggestion for integration of data model
into P5 of the TEI guidelines
Summary
Transcription as theory and „Verschriftlichung“ 
Computer-assisted transcription as modelling and
visualisation
Interdisciplinary bridge / Methodology of
computational techniques in „classical“ linguistics
 Concrete practical improvements for work with
transcription data
EXMARaLDA and Database „Multilingalism“
Data model, formats and tools building on the
separation of model and visualisation
Fin.