TamilConf_DiPersio_ldcx
Download
Report
Transcript TamilConf_DiPersio_ldcx
Indian Language Initiatives at LDC
Denise DiPersio
[email protected]
Overview
Introduction to LDC
Tamil Projects/Resources
Indian Language Projects/Resources
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
2
LDC: Origin and Model
Linguistic Data Consortium established in 1992
Via open, competitive government solicitation, won by U. Penn
Initial 5-year funding followed by self-sufficiency through
membership fees, data licenses
Power of the collective
Language resource distributor/archive
Centralized distribution, archiving, licensing
Resources from donations, funded projects, community
initiatives, LDC initiatives
Membership
Members support the consortium through fees, data, services
Ongoing rights to data published in membership years
Reduced fees on older corpora, extra copies
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
3
LDC: Roles
Data collection
Language resource (LR) production, including quality control
LR distribution and archiving
Intellectual property rights management and license management
Human subjects protocol management
Annotation, lexicon building
Creation of tools, specifications, best practices
Knowledge transfer: documentation, metadata, consulting, training
Corpus creation research and academic publication
Resource coordination in large multisite programs
Serving multiple research communities
Funding panelists, workshop participants, oversight committee members
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
4
LDC: Data Collection
News text
Web text (newsgroups, blogs, chatrooms, twitter)
Biomedical texts and abstracts
Printed, handwritten and hybrid documents
Broadcast programming (news, conversation)
Conversational telephone speech
Lectures, meetings, interviews
Read and prompted speech
Role play
Video (broadcast, web)
Animal vocalizations
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
5
LDC: Annotation
Data scouting, selection, triage
Audio-audio alignment: bandwidth, signal quality, language, dialect,
program, speaker
Quick and careful transcription, aligned at turn, sentence, word level
Phonetic, dialect, sociolinguistic feature, supralexical
Tokenization, tagging of morphology, part-of-speech, gloss
Syntactic, semantic, discourse functions, disfluency, sense
disambiguation
Identification/classification of entities, relations, events and
coreference
Translation, alignment of translated text
Identification/classification of entities/events in video
Document zoning
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
6
LDC: Distribution
Since 1992, LDC has distributed
Nearly 75,000 copies of 1300 titles to more than 3000
organizations in over 65 countries
Approximately 8000 scholars and research groups receive LDC’s
monthly newsletter
Non-exclusive distribution of donated data
LDC research communities span human language
technologies, computer science, social sciences
Uniform licensing within and across research communities
Stable infrastructure
LRs permanently accessible, ongoing access to data
Standardized, simple terms of use and distribution methods
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
7
LDC: Data Scholarships
Formalizes LDC’s long practice of $0 distribution of data
to students without the means to otherwise license it
Competitive process
Student submits application that contains:
Data set requested, proposed need and use of data
Description of research agenda
Demonstration of high probability of success for work
Letter of support from department chair/advisor including statement of
financial need
Two cycles completed; next will be Fall 2011
16 recipients
Argentina, China, India, Indonesia, Mexico, UK, USA
~USD40,000 in data awarded
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
8
Tamil Projects:
REFLEX/LCTL 1/3
REFLEX-LCTL (Less Commonly Taught Languages)
Goal: to create human language technologies for the target
languages, especially machine translation, information extraction
Language selection criteria
Large population of native speakers
Relatively few language resources (electronic text, intentional difficulty
variation in LR creation)
Linguistic and geographic diversity
Include some related languages
Make use of existing collaborations
Thirteen languages: Amazigh (Berber), Bengali, Hungarian,
Kurdish, Pashto, Panjabi, Tamil, Tagalog, Thai, Tigrinya, Urdu,
Uzbek, Yoruba
Bengali, Panjabi, Urdu – related languages
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
9
Tamil Projects:
REFLEX/LCTL 2/3
LDC created language packs for each language consisting of
a monolingual news text corpus (500k words)
a parallel text corpus (250k words)
a lexicon (10k entries)
a grammatical sketch
an encoding converter
a sentence segmenter
a tokenizer
a name transliterator
a part of speech tagger and tagged text
a named entity tagger and tagged text
a morphological analyzer and tagged text
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
10
Tamil Projects:
REFLEX/LCTL 3/3
Resources identified through individual scouting,
“Harvest Festivals”, native speakers
Tamil Language Pack
Text sources included websites (for monolingual and parallel
text)
Collaboration with Harold Schiffman, Vasu Renganathan
• Tamil lexicon – An English Dictionary of the Tamil Verb
• Consulted on encoding conversion
Project sponsor has not yet released pack for publication;
potential use in ongoing technology evaluations
Will be published in LDC catalog when cleared for distribution
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
11
Tamil Projects:
Language Resource Wiki
Language Resource (LR) Wiki designed to be
Publicly accessible, world-readable
Portal of found resources “harvested” in REFLEX-LCTL project
Editable by authenticated others outside LDC
Pages for seven languages, including Tamil
http://lrwiki.ldc.upenn.edu/mediawiki/index.php/Tamil/Tamil
Bengali, Berber, Panjabi, Pashto, Tagalog, Tamil, Urdu
Breton, Ewe pages in progress
Language summary, linguistic resources, encoding and fonts,
data sources, portals, tools and other natural language
processing resources
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
12
Tamil Projects:
Language Resource Wiki
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
13
Tamil Projects:
CALLFRIEND
CALLFRIEND project supported the development of language
identification technology
LDC recruited native speakers in the target languages to make
telephone calls to other native speakers
Calls were unscripted and lasted between 5-30 minutes
Target languages: American English, Canadian French, Egyptian
Arabic, Farsi, German, Hindi, Japanese, Korean, dialectal Mandarin
Chinese, Spanish (Caribbean, non-Caribbean), Tamil, Vietnamese
CALLFRIEND Tamil LDC96S59
60 telephone conversations
Demographic data: sex, age, education
Call information: channel quality, number of speakers
Calls originated inside the continental United States and Canada
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
14
Tamil Resources
An English Dictionary of the Tamil Verb Second Edition
LDC2009L01
Harold Schiffman, Vasu Renganathan (U Penn, Department of
South Asia Studies)
Translations for 6597 English verbs and definitions for 9716
Tamil verbs
Associated sound files for pronunciation; example sentences
Windows search and browse application
Complementary copy in conference packet
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
15
Indian Language
Projects/Resources: Hindi
Hindi Surprise Language Exercise (2003)
Goal: to assemble found resources under timed conditions
LDC collected newswire, web data, some parallel text
Not all resources can be released due to intellectual property,
license restraints
Further work needed for public release
Hindi WordNet LDC2008L02
Joint distribution with IIT Bombay
First WordNet for an Indian language
CALLFRIEND Hindi LDC96S52
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
16
Indian Language Resources: POS
Tagsets
Indian Language Part of Speech Tagsets (IL-POST)
Developed by Microsoft Research India; Anna University, Chennai;
Delhi University; IIT Bombay; Jawaharlal Nehru University, Delhi;
Tamil University, Tamilnadu
Goal: to provide a common tagset framework for Indian languages
that offers flexibility, cross-linguistic compatibility and reusability
across languages
LDC currently distributes three IL-POST sets at no cost: Bengali,
Hindi, Sanskrit
IL-POST Bengali LDC2010T16 – 103k words from web text, EMILLE
corpus (parallel newswire)
IL-POST Hindi LDC2010T24 – 98k words from web text
IL-POST Sanskrit LDC2011T04 – 57k words from Panchatrantra stories
More languages planned, Tamil among them
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
17
LDC: Need to Know
LDC website, http://www.ldc.upenn.edu/
The LDC Corpus Catalog,
http://www.ldc.upenn.edu/Catalog/
Submitting Corpora and Other Resources to LDC,
http://www.ldc.upenn.edu/Providing/
LDC Online, https://online.ldc.upenn.edu/login.html
Member Resources,
http://www.ldc.upenn.edu/Membership/
Questions?
Thank you!
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
18