TamilConf_DiPersio_ldcx

Download Report

Transcript TamilConf_DiPersio_ldcx

Indian Language Initiatives at LDC
Denise DiPersio
[email protected]
Overview

Introduction to LDC

Tamil Projects/Resources

Indian Language Projects/Resources
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
2
LDC: Origin and Model



Linguistic Data Consortium established in 1992

Via open, competitive government solicitation, won by U. Penn

Initial 5-year funding followed by self-sufficiency through
membership fees, data licenses

Power of the collective
Language resource distributor/archive

Centralized distribution, archiving, licensing

Resources from donations, funded projects, community
initiatives, LDC initiatives
Membership

Members support the consortium through fees, data, services

Ongoing rights to data published in membership years

Reduced fees on older corpora, extra copies
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
3
LDC: Roles

Data collection

Language resource (LR) production, including quality control

LR distribution and archiving

Intellectual property rights management and license management

Human subjects protocol management

Annotation, lexicon building

Creation of tools, specifications, best practices

Knowledge transfer: documentation, metadata, consulting, training

Corpus creation research and academic publication

Resource coordination in large multisite programs

Serving multiple research communities

Funding panelists, workshop participants, oversight committee members
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
4
LDC: Data Collection

News text

Web text (newsgroups, blogs, chatrooms, twitter)

Biomedical texts and abstracts

Printed, handwritten and hybrid documents

Broadcast programming (news, conversation)

Conversational telephone speech

Lectures, meetings, interviews

Read and prompted speech

Role play

Video (broadcast, web)

Animal vocalizations
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
5
LDC: Annotation

Data scouting, selection, triage

Audio-audio alignment: bandwidth, signal quality, language, dialect,
program, speaker

Quick and careful transcription, aligned at turn, sentence, word level

Phonetic, dialect, sociolinguistic feature, supralexical

Tokenization, tagging of morphology, part-of-speech, gloss

Syntactic, semantic, discourse functions, disfluency, sense
disambiguation

Identification/classification of entities, relations, events and
coreference

Translation, alignment of translated text

Identification/classification of entities/events in video

Document zoning
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
6
LDC: Distribution

Since 1992, LDC has distributed






Nearly 75,000 copies of 1300 titles to more than 3000
organizations in over 65 countries
Approximately 8000 scholars and research groups receive LDC’s
monthly newsletter
Non-exclusive distribution of donated data
LDC research communities span human language
technologies, computer science, social sciences
Uniform licensing within and across research communities
Stable infrastructure

LRs permanently accessible, ongoing access to data

Standardized, simple terms of use and distribution methods
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
7
LDC: Data Scholarships


Formalizes LDC’s long practice of $0 distribution of data
to students without the means to otherwise license it
Competitive process

Student submits application that contains:

Data set requested, proposed need and use of data

Description of research agenda

Demonstration of high probability of success for work


Letter of support from department chair/advisor including statement of
financial need
Two cycles completed; next will be Fall 2011

16 recipients

Argentina, China, India, Indonesia, Mexico, UK, USA

~USD40,000 in data awarded
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
8
Tamil Projects:
REFLEX/LCTL 1/3

REFLEX-LCTL (Less Commonly Taught Languages)

Goal: to create human language technologies for the target
languages, especially machine translation, information extraction

Language selection criteria



Large population of native speakers
Relatively few language resources (electronic text, intentional difficulty
variation in LR creation)

Linguistic and geographic diversity

Include some related languages

Make use of existing collaborations
Thirteen languages: Amazigh (Berber), Bengali, Hungarian,
Kurdish, Pashto, Panjabi, Tamil, Tagalog, Thai, Tigrinya, Urdu,
Uzbek, Yoruba

Bengali, Panjabi, Urdu – related languages
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
9
Tamil Projects:
REFLEX/LCTL 2/3

LDC created language packs for each language consisting of

a monolingual news text corpus (500k words)

a parallel text corpus (250k words)

a lexicon (10k entries)

a grammatical sketch

an encoding converter

a sentence segmenter

a tokenizer

a name transliterator

a part of speech tagger and tagged text

a named entity tagger and tagged text

a morphological analyzer and tagged text
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
10
Tamil Projects:
REFLEX/LCTL 3/3
Resources identified through individual scouting,
“Harvest Festivals”, native speakers
 Tamil Language Pack

Text sources included websites (for monolingual and parallel
text)
 Collaboration with Harold Schiffman, Vasu Renganathan

• Tamil lexicon – An English Dictionary of the Tamil Verb
• Consulted on encoding conversion
Project sponsor has not yet released pack for publication;
potential use in ongoing technology evaluations
 Will be published in LDC catalog when cleared for distribution

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
11
Tamil Projects:
Language Resource Wiki


Language Resource (LR) Wiki designed to be

Publicly accessible, world-readable

Portal of found resources “harvested” in REFLEX-LCTL project

Editable by authenticated others outside LDC
Pages for seven languages, including Tamil

http://lrwiki.ldc.upenn.edu/mediawiki/index.php/Tamil/Tamil

Bengali, Berber, Panjabi, Pashto, Tagalog, Tamil, Urdu

Breton, Ewe pages in progress

Language summary, linguistic resources, encoding and fonts,
data sources, portals, tools and other natural language
processing resources
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
12
Tamil Projects:
Language Resource Wiki
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
13
Tamil Projects:
CALLFRIEND

CALLFRIEND project supported the development of language
identification technology

LDC recruited native speakers in the target languages to make
telephone calls to other native speakers

Calls were unscripted and lasted between 5-30 minutes

Target languages: American English, Canadian French, Egyptian
Arabic, Farsi, German, Hindi, Japanese, Korean, dialectal Mandarin
Chinese, Spanish (Caribbean, non-Caribbean), Tamil, Vietnamese

CALLFRIEND Tamil LDC96S59

60 telephone conversations

Demographic data: sex, age, education

Call information: channel quality, number of speakers

Calls originated inside the continental United States and Canada
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
14
Tamil Resources

An English Dictionary of the Tamil Verb Second Edition
LDC2009L01

Harold Schiffman, Vasu Renganathan (U Penn, Department of
South Asia Studies)

Translations for 6597 English verbs and definitions for 9716
Tamil verbs

Associated sound files for pronunciation; example sentences

Windows search and browse application

Complementary copy in conference packet
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
15
Indian Language
Projects/Resources: Hindi



Hindi Surprise Language Exercise (2003)

Goal: to assemble found resources under timed conditions

LDC collected newswire, web data, some parallel text

Not all resources can be released due to intellectual property,
license restraints

Further work needed for public release
Hindi WordNet LDC2008L02

Joint distribution with IIT Bombay

First WordNet for an Indian language
CALLFRIEND Hindi LDC96S52
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
16
Indian Language Resources: POS
Tagsets

Indian Language Part of Speech Tagsets (IL-POST)

Developed by Microsoft Research India; Anna University, Chennai;
Delhi University; IIT Bombay; Jawaharlal Nehru University, Delhi;
Tamil University, Tamilnadu

Goal: to provide a common tagset framework for Indian languages
that offers flexibility, cross-linguistic compatibility and reusability
across languages

LDC currently distributes three IL-POST sets at no cost: Bengali,
Hindi, Sanskrit


IL-POST Bengali LDC2010T16 – 103k words from web text, EMILLE
corpus (parallel newswire)

IL-POST Hindi LDC2010T24 – 98k words from web text

IL-POST Sanskrit LDC2011T04 – 57k words from Panchatrantra stories
More languages planned, Tamil among them
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
17
LDC: Need to Know

LDC website, http://www.ldc.upenn.edu/

The LDC Corpus Catalog,
http://www.ldc.upenn.edu/Catalog/

Submitting Corpora and Other Resources to LDC,
http://www.ldc.upenn.edu/Providing/

LDC Online, https://online.ldc.upenn.edu/login.html

Member Resources,
http://www.ldc.upenn.edu/Membership/

Questions?

Thank you!
Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011
18