Transcript Document

Christopher Cieri1, Khalid Choukri2, Nicoletta Calzolari3, D. Terence Langendoen4, Johannes
Leveling5, Martha Palmer6, Nancy Ide7, James Pustejovsky8
1.
Linguistic Data Consortium (LDC), (ccieri @ ldc.upenn.edu,)
2.
European Language resources Association (ELRA), (choukri @ elda.org)
3.
Instituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche (glottolo @
ilc.cnr.it)
4.
Department of Linguistics, University of Arizona (langendt @ email.arizona.edu)
5.
Centre for Next Generation Localisation (CNGL), Dublin City University,
(johannes.leveling @computing.dcu.ie)
6.
Center for Computational Language and Education Research, Department of Computer
Science, University of Colorado, Boulder (Martha.Palmer @ Colorado.edu)
7.
Department of Computer Science, Vassar College, USA,(ide @ cs.vassar.edu)
8. Department of Computer Science, Brandeis University, (jamesp @ cs.brandeis.edu)
Background
 LRs remain expensive to create, thus rare relative to demand
 accidental re-creation of LR a nearly unforgiveable waste of scarce
resources
 Despite
 existence of a few large data centers focused on HLT
 ELRA, LDC
 prior harmonization project
 Networking Data Centers
 union catalog initiative
 Open Language Archives Community (OLAC)
 HLT researchers must still
 master multiple metadata sets
 to search multiple locations
 in order to find needed resources
 or else risk failing to note the existence of critical LRs


recreate them
do without them.
Recent Pre-History
 OLAC (Open Language Archives Community)
 LDC, ELRA early adopters
 LREC Universal Catalog, LDC participating
 FlareNet (Fostering Language Resources Network)
 A major condition for the take-off of the field of Language Resources
and Language Technologies is the creation of a shared policy for the
next years.
 FlareNet Meeting, Vienna
 SILT (Sustainable Interoperability for Language Technology)
 turn existing, fragmented technology and resources developed to
support language processing technology into accessible, stable, and
interoperable resources that can be readily reused across several
fields
 SILT-FLaReNet Meeting, "Towards An Operationalized
Definition of Interoperability for Language Technology", Brandeis
University, Waltham, Massachusetts, 1-2 November, 2009.
Current Landscape
 Major Data Centers maintain own separate catalogs
 different metadata languages (categories, terminologies)
 export subsets of their metadata categories to the OLAC
 OLAC provides
 specifications for OAI (Open Archives Initiative) compliant metadata
 routines for harvesting, interchanging, searching
 ELRA UC
 focusing on resources intended for HLT R&D
 includes a greater percentage of ELRA metadata fields
 exploits data mining to discover resources not produced or distributed by ELRA
 LREC Map
 uses LREC submission process to increase the contribution of LR metadata
 NICT Shachi catalog
 union catalog of resources
 records are scraped
 uses data mining technologies to discover LRs features missing from home catalog
entries
Current Landscape
 LDC LR Wiki
 indentifies LRs (for less commonly taught languages)
 organized by language and LR type
 area experts edit individual sections
 some resources: plain/parallel text & lexicons identified &
even harvested automatically
 free text description => normalization
 LDC LR Papers Catalog
 research papers

introduce, describe, discuss, extend or rely upon another LR
 currently focusing on papers dealing with LDC data
 full bibliographic information on the paper
 link to the unique identifier of the LR referenced
Current Landscape
Short Term Recommendations
 harmonize LR catalogs of largest international data centers
 non-reductionist approach to harmonization
 not identify minimal subset that apply to all LR types
 focus on LRs targeted toward HLT R&D
 identify the superset of metadata types contained in them
 distinguish


those than can be normalized internally across data centers
from those that encode irreconcilable differences
 agree to normalize, harmonize practice wherever possible
 governance body specifically for this project

project partners, sponsors, individual and small group LR providers
and LR users
Outcomes




harmonized catalogs
definition of metadata categories
database structure
search engine customized to HLT LR search
 controlled vocabulary fields
 relevance-based search of entire catalog records
 specification of best metadata practices
 centralized metadata repository with a harvesting protocol
 searcher assistance based upon
 relations among metadata categories (dictionary ≅lexicon)
 prior search behavior

those who searched for “Gigaword” also searched for “news text corpora”
 metadata creator assistance based on
 searcher behavior

“93% of searchers include a language name in their search” but “87% of all providers
include ISO 639-3 language codes”
 behavior of other metadata providers
 “the metadata you have provided so far also characterize 32 other resources
Middle Term Recommendations
 expand UC scope to include raw data & research papers
 some work already begun,
 not coordinated across data centers and LR creators


LCTL LR wiki at LDC
Rosetta Project
 harvest of papers describing LRs


at LDC using human effort
within Rexa project using data mining technologies
 integrate effective workflows: social networking, web sourcing, data mining
 enhance UC with links to raw resources including
 web sites rich in monolingual and parallel text
 lexicons built for interactive use
 new harmonization challenges
 adjust governance and broaden the scope of its normalization activities
 implement sustainable business models
Requirements
 Representatives of
 relevant data and metadata centers; ELRA/ELDA, LDC, NICT, and OLAC.
 interested professional organizations: ACL, LSA, LinguistList, SIL, ISCA
 journals willing to implement version of LREC map: LRE, LILT
 organizers of conferences who agree to implement LREC Map: LREC, AFLR
 related cataloging projects: Rexa Project
 leading industrial partners
 related LR development projects and centers: LanguageGrid, CDAC
 Resources
 support for partners (ELRA, LDC, NIST) some of which is already in place
 database schema (existing)
 search engine
 technology for data mining
 taxonomy/controlled vocabulary of applications, data types, etc.
 Activities
 outreach
 sample output early in the project
 institutional endorsement
 evaluation of metadata (user feedback)
 evaluation of performance of the Catalog in terms of LRs required
Use Case
 connect discussions of interoperability
 metadata, tools, documentation, standards
 design use case which is inclusive, advanced
 automatically training and HLT


harmonized metadata: necessary not sufficient
corpus descriptions
 machine readable to identify locations, syntax,
semantics
 data catalog registry for non-identical
specifications
 human readable to assure consistent methodology
Progress to Date
 planning refined at 2010 FlareNet Forum
 progress on funding advances in Europe
 FlareNet, T4ME, MetaNet, ISO-CAT
 work on the UC continues
 work on
 UC
 LR Wiki


Amazigh, Bengali, Panjabi, Pashto, Tagalog, Tamil, Urdu
~100 resources per language
 LDC Papers Catalog

2500 papers