TalkBank - Brian MacWhinney

Download Report

Transcript TalkBank - Brian MacWhinney

From CHILDES to TalkBank
An International Database of
Communicative Interaction
1
TalkBank
• Brian MacWhinney
– Carnegie Mellon University, Psychology
– Child Language Data Exchange System CHILDES
• Steven Bird, Mark Liberman
– University of Pennsylvania, Linguistics
– Linguistic Data Consortium, LDC
• Howard Wactlar
– Carnegie Mellon University, Computer Science
– Informedia Project
2
Basic Premise of TalkBank
• Human Communication is a unified fact,
• but it is studied by 8 disciplines and up to
40 subdisciplines.
• Analysis is important, but so is synthesis.
• We can put the puzzle back together by
focusing all the disciplines on the data.
3
Some Examples
•
•
•
•
•
•
•
“My Theory”
Bettino Craxi
Nixon’s Watergate Tapes
MacWhinney’s Lectures
Ross and Mark
Graphics lesson
Bilingual Classroom
4
My Theory: An Example
Special Issue of Discourse Processes edited by Tim
Koschmann with articles from
• Rogers Hall
• Jay Lemke
• Annemarie Palincsar
• Carl Frederiksen
• Commentary by
– Judith Green & Marleen McClelland
– Jeremy Roschelle
5
TalkBank Areas
•
•
•
•
•
•
•
Classroom Discourse - CMU Dec 99
Conversation Analysis - Odense Oct
Text and Discourse - Santa Barbara July
Child Language Disorders - Madison 2002
Language and Gesture - CMU October
Child Language Learning - Madison Aug 2002
Animal Communication - Penn May 2000
6
More areas ….
•
•
•
•
•
•
•
Field Linguistics - LSA Dec 99, Penn Dec 2000
Aphasia
Corpus Linguistics
Signed Language
Second Language Learning
Anthropological Linguistics
Cross-cultural studies
7
More areas ...
•
•
•
•
•
•
•
Multilingualism, code-switching - LIDES
Mother-infant interaction
Psychiatry
Conflict Resolution
Management Styles
Small-group Interaction - soon
Human-computer Interaction
8
More areas ...
• Speech Technology - ongoing
• Virtual Reality
• Guided Robots, Social Robots
9
Why data-sharing is important
• Increasing the size and reliability of the
empirical basis
• Opening science to the community,
practitioners, and students
• Opening science to collaborative
commentary
• Creating transparency across disciplines
10
Key Features of TalkBank
•
•
•
•
Multimodal digitized data
Internet access
Defense of confidentiality
Codon: transcription, coding, viewing, and
analysis
• XML standard for underlying representation
• Alliance of databases from many fields
11
Why TalkBank can be built now
•
•
•
•
•
•
The Internet
Fast computers. big disks, cheap storage
Good audio and video digitization
Advances in web-based database design
Emergence of annotation standards
Maturation of the social sciences
12
CHILDES : APrototype
• Brian MacWhinney - CMU
• Leonid Spektor - CMU
• Catherine Snow - Harvard
• 2000 Members
• 400 Active contributors
13
1850-1950 Darwin and Diaries
• Darwin, Stern, Ament
• Emotion, gesture, language, the soul
• Card files and shoe boxes
14
1950-1984 Tapes
• Nagras and TEAC, VHS and Beta
• Dittos, mimeo, notes in the margins
• Good “raw” data, unclear transcription
15
1984 - 1994 PCs
CHILDES Concord Massachusetts 1984
16
1994 -2001 childes.psy.cmu.edu
17
2000 - ? TalkBank
18
Universals
• Are there basic patterns to babbling?
• Are early word orders universal?
• Does UG give children a universal set of
functional categories?
• Is the vocabulary spurt universal?
The answer requires LOTS of data
19
Particulars
• Do children have individual styles?
– Gestalt vs. Analytic
– Enactive (1S) vs. Depictive (3S)
• Do children respond differentially to
parental recasts?
• Do children vary in their match to cue
validity?
Again, we need LOTS of data
20
Comparisons
• How should we match SLI children to
normal controls -- MLU? Morphology, TTR
• How should we compare language
socialization processes across social
classes? Between cultures?
• How should we compare the course of
development across languages? The case of
Romance.
21
Three Components
• CHAT -- Transcription System
• CLAN -- Programs
• Database
22
CHAT Format
@Begin
@Participants: CHI Target_Child Sid, MOT Mother
*MOT: you want them to go in there?
*CHI: yeah. [+ Q]
*CHI: yeah. [+ SR]
*MOT: okay.
*CHI: okay. [+ I]
*CHI: look at this.
%act: CHI picks up piece of paper
@End
23
CLAN Programs
24
String Search
•
•
•
•
•
Freq
KWAL
Combo
Gem
GemFreq, GemList
25
Indexes
•
•
•
•
•
•
MLU
MLT
WdLen, MaxWd
VOCD
DSS
IPSyn (in progress)
26
Profiles
•
•
•
•
•
•
Chains
Cooccur
Dist
CHIP
KeyMap
TimeDur
27
Phonology
•
•
•
•
•
•
MakeMod
ModRep
PhonFreq
UniCode
Inventory (in progress, LIPP, CompProf)
Process Analysis (in progress)
28
Utilities
•
•
•
•
•
Dates
Rely
Lines
SaltIn
Check
29
The Database
• English - 25 corpora
• Non-English - 18 languages
• Clinical - 14 corpora, aphasia, SLI, Down,
autism, Williams, and other groups
• Narrative - Frog stories, Red Balloon
• Childhood Bilingualism
• Adult Second Language Learning
30
Morphology
•
•
•
•
•
•
MOR
Post, PostTrain -- Christophe Parisse
Parse -- Kenji Sagae
--> revised DSS, LARSP, IPSyn
MinMor for 14 language
MaxMor for English, Spanish, Italian,
Hungarian, Dutch, German
31
New Technologies
•
•
•
•
•
•
•
Sonic CHAT
Bullets
QuickTime Movies
Sound editor by wave
Movie editor by dragging
Fast mode editing
Web streaming of audio and video
32
Sample Topics
•
•
•
•
•
•
•
Past tense debate
Functional categories, tenseless verbs
Verb frame generalization
Fine-tuning of the input
Theory of mind
Lexical range and communicative context
MLU and vocabulary growth in disorders
33
Research based on CHILDES
•
•
•
•
•
•
•
•
Over 1200 published studies
Syntax
Morphology
Discourse
Lexicon
Narrative, Literacy
Language Impairments
Phonology
34
Allied Efforts
•
•
•
•
•
•
•
JCHAT, Chinese, Korean
Dutch, Nordic, Celtic
Romance (Italian, Spanish, Portuguese)
Slavic (Krakow, Vienna)
Bilingualism -- Catalan, Basque
Frogs, Disorders, Code-switching
Classroom discourse
35
36
CHILDES/BIB On-Line
37
Format Babel
Alembic
Annotator
Archivage
CA
CHAT
COCOSDA
CSAE
CSLU
DAISY
DAMSL
Delta
DRI
EAGLES
Emu
Festival
FSA’s
GATE
HIAT
Hyperlex
Intex
ISIP
LDC
MATE
MICASE
MPEG
MPI
Multitext
Observer
Partitur
Praat
SABLE
SAMPA
SGREP
SignSTream
SIL
SLAM
SMDL
SNACK
StandOff
SUSANN
TalkBank
TEI
Tipster
Transcriber
TreeBank
TSNLP
Unicode
UTF
38
Video Tools
Media Tagger, CLAN,
Digital Lava, Informedia ….
39
The Script
40
syncWRITER
41
SignStream
42
43 41
Audio on the Web
44
Anthropology on the Web
Chagnon’s Yanamamo
45
Touch and Click for Audio
46
Pawnee Lexicon
47
Lexicon -> Cultural Encyclopedia
48
Cornell Bioacoustics Laboratory
49
Confidentiality Levels
1 - fully public
2 - copying block
3 - transcripts public, audio/video protected
4 - non-disclosure
5 - non-disclosure, no copying
6 - data-viewing with approval
7 - data-viewing under direct supervision
8 - archived only
50
Conclusions
• Child Language has guided other fields, but
now we need to link to these other fields.
• CLAN must give way to more international
tools and distributed databases.
• Number counting will give way to realitylinked number counting.
• Lab-based research will have to open up to
collaborative annotation.
51