Phon - CHILDES - Carnegie Mellon University

Download Report

Transcript Phon - CHILDES - Carnegie Mellon University

Towards a solution
for the sharing of
phonological data
Yvan Rose
Memorial University of Newfoundland
Brian MacWhinney
Carnegie Mellon University
Map of presentation



Context: no specialized tool to facilitate
research in phonological development
A preliminary attempt: ChildPhon
A more promising solution: Phon



Potential


Current state of the Phon project
Developments in foreseeable future
Publicly-available cross-linguistic database
Proposal
Context (until recently)

CHILDES tools (focus on CLAN)
Number of tools for multimedia data storage
and analysis
 Mostly deals with morphological and
syntactic aspects of development
 Not easily extensible
What about phonology?
 No CHILDES tool adapted for phonology
 Data sharing and broad-based investigations
are challenging


A first attempt

ChildPhon (Rose 2003)



Analytical (relational) database for child
language data
Designed within FileMaker Pro
Main features


Interface for double-blind transcriptions
Automatic functions based on phonetic
transcriptions:
Syllabification of transcribed forms
 Detection of common processes observed in child
language (e.g. onset cluster reduction)

Problems with ChildPhon

No support for Unicode fonts
 no X-platform compatibility (Macintosh-only)

Not compatible with CHILDES / TalkBank
 no data exchange functions

Automatic parses limited, not customizable

Multimedia capabilities are minimal (at best)

Requires use of proprietary software and font

Algorithms are ‘destructive’

Statistical functions are minimal
No web implementation

In sum: Good idea -- Bad implementation

Phon: a more promising solution

Interdisciplinary project
(First of its kind between Linguistics and Computer
Science at Memorial University of Newfoundland)
 Software designers and programmers:
Rodrigue Byrne, Gregory Hedlund,
Philip O'Brien, Yvan Rose, Harold Wareham
 Financial Support:
 Faculty of Arts, Memorial University
 Social Sciences and Humanities Research
Council of Canada (SSHRC)
 Canada Fund for Innovation (CFI)
 National Science Foundation (NSF)
Phon: Overview

Software underpinnings:

Programmed in Java, Unicode font encoding


XML data storage structure






Cross-platform compatible (Mac, Windows, …)
Compatible with TalkBank schema
User management system
Extended multimedia capabilities
More flexible automatic algorithms
Specialized query language
Offers a complete solution for data sharing
Phon: usability




Intuitive graphical user interface
Helpful wizards (e.g. project creation, queries)
Record navigator
Custom selection of data fields


General / record-by-record
Intuitive query language



Standard terminology
Built-in queries (modifiable by user)
Query memorization and saving
Phon: main functions




User management
Media segmentation
Phonetic transcription
Transcription merging
(Selection of ‘final’ transcriptions for analysis)


Phrase segmentation and alignment
(Further segmentation according to research needs)
Syllable alignment
(Alignment of syllables of target and actual forms)

Database query
User management


Secure login
User tasks /
privileges
management
Media segmentation

Generally similar to CLAN


Default segment length user-defined


Hit the space bar to define a speech segment
Useful for working on small speech segments
Segment editing:


Change numerical value
‘Stretch’ the time segment by sliding pointer

Yvan Rose: Replace yellow line in segment “timebar” by waveform.
Play
Export
sound clip
Transcription: general interface
Media
window
Session info
(drawer)
Media controls
Transcription
window
Transcription

Built-in IPA
character
map



Symbol
‘categories’
Access to
sound
segment
Interface for double-blind transcriptions
Tied with user management functions
Yvan Rose: Link
 adulttranscription to an electronic IPA dictionary.
Need to develop a transcription system for sounds that can’t be transcribed easily.
• Ability to assign a feature set to a dummy character
Transcription merging

Comparison of ‘competing’ transcriptions

Direct access to media segment


Selection of most
accurate transcription
Further refinement of
selected transcription
Yvan Rose: People an algorithm that would enable a comparison of transcriptions
based on specific parameters (e.g. voicing). This algorithm could build on the feature sets
associated with each segment transcribed.
Phrase alignment

Further segmentation of the utterances


Useful for research
on phonological
domains
A simple mouse
click sets and resets
the domain boundaries
Yvan Rose: Several people requested different levels of segmentation. This includes morpho-syntactic
levels of segmentation, as well as various levels of the prosodic hierarchy.
Syllabification algorithm

Syllabification algorithm
O
k




R
R
O
N
ø
n s
t
r
N
e
I
n t s
‘constraints’
Refined labeling of each syllabic position
Each label is a valid object for query
Syllabification algorithm

Parameters of
syllabification are
user-definable
Timing tier
Syllable constituents
Yvan Rose: The parameters will be revised thoroughly.
To add (among others): word-final codas, list of exceptional clusters.
Also add, to complement stress attraction, an option of ambisyllabic syllabification of intervocalic
Syllable alignment


Automatic alignment of syllables
Manual modifications
Query language


Quick and accurate queries on large
amounts of data
Language features

Uses terms familiar to phonologists to
compose queries
Syllable constituents: onset, nucleus, …
 Stressed vs. unstressed syllables




Custom predicates
History of recent queries
Ability to save queries
Query language components




Selectors (e.g. Onset(Syllable x))
Predicates (e.g. Branching(Onset(Syllable x))
Boolean connectives
Example: let corpusName = "TestCorpus",
let corpus = Corpus(corpusName),
let records = Records(corpus)
foreach r in records
foreach p in Phrases(r)
foreach s in Syllables(p)
Branching(Onset(TargetSyllable(s)))
AND NOT
Branching(Onset(ActualSyllable(s)))
Query tree structure
Branching onset reduction in 2nd syllable
Record
TargetPhrase
Syllable
Syllable
Syllable
Syllable
Rhyme
Rhyme
Rhyme
Rhyme
Nucleus
Nucleus
Nucleus
Nucleus
Onset
T
ActualPhrase
Onset
U
N
D
R
TRUE
branching( onset( pos( TargetPhrase , 2) ) )
AND NOT
branching( onset( pos( ActualPhrase , 2) ) )
Coda
A
S
Onset
D
Onset
U
N
AND NOT
MATCH
D
FALSE
Coda
A
S
Query results


View in application
Use to generate textual reports




Recording session (e.g. to exemplify a given
process)
Time slice (e.g. to exemplify a stage of acquisition)
Entire database (to exemplify a learning curve)
Export


As Unicode file
As ASCII file
(modulo font conversion limitations)
Enhancements (short term)

Improvement of syllable alignment algorithm
(building on Kondrak’s 2003 algorithm)




Import function

ChildPhon files (including font translator --almost done!)

CHAT files
Incorporation user-defined fields
Incorporation of statistical functions
Chart report generator

Ability to select various chart formats


Bar graphs (for proportions within and across sessions)
Line graphs (for learning curves)
Enhancements (longer term)

Interoperability with Praat



Web-based interface



Export to Praat (similar to CLAN function)
Interface to accommodate acoustic
measurement data
Data sharing at a distance
Easy query of corpora on CHILDES database
Further automation
Automatic detection of pre-identified
processes
Yvan Rose: Include function to extract phonetic inventories per session/stage/…

Get examples of ‘canned’ analyses in literature on clinical phonology.
Development timeline

End of fall of 2004



Completion of current development phase
Release of testing (Beta) version
Winter of 2005




Bug fixes
Improvement of functionality and user interface
(including short-term enhancements)
Website creation (http://www.phon.ca/)
Completion of technical documentation



Notes to programmers
User guide
Summer of 2005
Release of  Phon 1.0 as open-source freeware
Potential

Standard for data sharing



Large-scale investigations
Cross-linguistic investigations
Enhancement to CHILDES


Elaboration of a database fulfilling the needs
of acquisitionists focussing on phonology and
related issues
Investigation of interface issues (e.g. between
morpho-syntax and phonology)
How to realize this potential

Team of researchers specializing in:








Early acquisition (including babbling)
Segmental development
Prosodic development
Phonological disorders
Second language acquisition
…
Feedback on software development project
Data contribution


Existing corpora in digital format
Conversion of printed corpora


Identification of corpora (printed, with or without audio files)
Setting of conventions for data conversion
Our proposal

Constitution of a research team to develop
a phonological component of CHILDES



Database
Supporting software
Elaboration, with the research team, of a
grant application to support:





Database elaboration
Software development
Periodical meetings
Workshops
…
Concretely


Feedback on software project

Software needs for various types of research

Implementation
Let us know how you want it to work
Contribution to grant application




Let us know what you need
Kinds of research would the new database enable
Let us know what you would like to do
Impacts of this research (e.g. theoretical, clinical, …)
Supporting letters
Contribution to the public database


Sharing of existing / future corpora
Establishment of conventions to format older corpora
Special thanks

The ‘Phon’ team at Memorial:
Rodrigue Byrne
 Harold Wareham
 Gregory Hedlund
 Philip O’Brien


For his great help with the TalkBank XML schema:
Franklin Chen (Carnegie Mellon University)

For their useful feedback on an early version of this
software:
Heather Goad (McGill), Paula Fikkert (Nijmegen), Clara Levelt (Leiden),
Katherine Demuth (Brown), Mark Johnson (Brown), Carrie Dyck
(Memorial), Phil Branigan (Memorial), Brian MacWhinney (Carnegie
Mellon), Bryan Gick (UBC), Sophie Wauquier-Gravelines (Nantes), Sharon
Inkelas (UC Berkeley), Conxita Lleó, Sonia Frota (Lisbon), Maria João
Freitas (Lisbon), Ronald Sprouse (UC Berkeley), Joe Pater (UMass,
Amherst), John Archibald (Calgary), Éliane Lebel (Memorial); hoping that
no one was forgotten…