Transcript Document
Christopher Cieri1, Khalid Choukri2, Nicoletta Calzolari3, D. Terence Langendoen4, Johannes
Leveling5, Martha Palmer6, Nancy Ide7, James Pustejovsky8
1.
Linguistic Data Consortium (LDC), (ccieri @ ldc.upenn.edu,)
2.
European Language resources Association (ELRA), (choukri @ elda.org)
3.
Instituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche (glottolo @
ilc.cnr.it)
4.
Department of Linguistics, University of Arizona (langendt @ email.arizona.edu)
5.
Centre for Next Generation Localisation (CNGL), Dublin City University,
(johannes.leveling @computing.dcu.ie)
6.
Center for Computational Language and Education Research, Department of Computer
Science, University of Colorado, Boulder (Martha.Palmer @ Colorado.edu)
7.
Department of Computer Science, Vassar College, USA,(ide @ cs.vassar.edu)
8. Department of Computer Science, Brandeis University, (jamesp @ cs.brandeis.edu)
Background
LRs remain expensive to create, thus rare relative to demand
accidental re-creation of LR a nearly unforgiveable waste of scarce
resources
Despite
existence of a few large data centers focused on HLT
ELRA, LDC
prior harmonization project
Networking Data Centers
union catalog initiative
Open Language Archives Community (OLAC)
HLT researchers must still
master multiple metadata sets
to search multiple locations
in order to find needed resources
or else risk failing to note the existence of critical LRs
recreate them
do without them.
Recent Pre-History
OLAC (Open Language Archives Community)
LDC, ELRA early adopters
LREC Universal Catalog, LDC participating
FlareNet (Fostering Language Resources Network)
A major condition for the take-off of the field of Language Resources
and Language Technologies is the creation of a shared policy for the
next years.
FlareNet Meeting, Vienna
SILT (Sustainable Interoperability for Language Technology)
turn existing, fragmented technology and resources developed to
support language processing technology into accessible, stable, and
interoperable resources that can be readily reused across several
fields
SILT-FLaReNet Meeting, "Towards An Operationalized
Definition of Interoperability for Language Technology", Brandeis
University, Waltham, Massachusetts, 1-2 November, 2009.
Current Landscape
Major Data Centers maintain own separate catalogs
different metadata languages (categories, terminologies)
export subsets of their metadata categories to the OLAC
OLAC provides
specifications for OAI (Open Archives Initiative) compliant metadata
routines for harvesting, interchanging, searching
ELRA UC
focusing on resources intended for HLT R&D
includes a greater percentage of ELRA metadata fields
exploits data mining to discover resources not produced or distributed by ELRA
LREC Map
uses LREC submission process to increase the contribution of LR metadata
NICT Shachi catalog
union catalog of resources
records are scraped
uses data mining technologies to discover LRs features missing from home catalog
entries
Current Landscape
LDC LR Wiki
indentifies LRs (for less commonly taught languages)
organized by language and LR type
area experts edit individual sections
some resources: plain/parallel text & lexicons identified &
even harvested automatically
free text description => normalization
LDC LR Papers Catalog
research papers
introduce, describe, discuss, extend or rely upon another LR
currently focusing on papers dealing with LDC data
full bibliographic information on the paper
link to the unique identifier of the LR referenced
Current Landscape
Short Term Recommendations
harmonize LR catalogs of largest international data centers
non-reductionist approach to harmonization
not identify minimal subset that apply to all LR types
focus on LRs targeted toward HLT R&D
identify the superset of metadata types contained in them
distinguish
those than can be normalized internally across data centers
from those that encode irreconcilable differences
agree to normalize, harmonize practice wherever possible
governance body specifically for this project
project partners, sponsors, individual and small group LR providers
and LR users
Outcomes
harmonized catalogs
definition of metadata categories
database structure
search engine customized to HLT LR search
controlled vocabulary fields
relevance-based search of entire catalog records
specification of best metadata practices
centralized metadata repository with a harvesting protocol
searcher assistance based upon
relations among metadata categories (dictionary ≅lexicon)
prior search behavior
those who searched for “Gigaword” also searched for “news text corpora”
metadata creator assistance based on
searcher behavior
“93% of searchers include a language name in their search” but “87% of all providers
include ISO 639-3 language codes”
behavior of other metadata providers
“the metadata you have provided so far also characterize 32 other resources
Middle Term Recommendations
expand UC scope to include raw data & research papers
some work already begun,
not coordinated across data centers and LR creators
LCTL LR wiki at LDC
Rosetta Project
harvest of papers describing LRs
at LDC using human effort
within Rexa project using data mining technologies
integrate effective workflows: social networking, web sourcing, data mining
enhance UC with links to raw resources including
web sites rich in monolingual and parallel text
lexicons built for interactive use
new harmonization challenges
adjust governance and broaden the scope of its normalization activities
implement sustainable business models
Requirements
Representatives of
relevant data and metadata centers; ELRA/ELDA, LDC, NICT, and OLAC.
interested professional organizations: ACL, LSA, LinguistList, SIL, ISCA
journals willing to implement version of LREC map: LRE, LILT
organizers of conferences who agree to implement LREC Map: LREC, AFLR
related cataloging projects: Rexa Project
leading industrial partners
related LR development projects and centers: LanguageGrid, CDAC
Resources
support for partners (ELRA, LDC, NIST) some of which is already in place
database schema (existing)
search engine
technology for data mining
taxonomy/controlled vocabulary of applications, data types, etc.
Activities
outreach
sample output early in the project
institutional endorsement
evaluation of metadata (user feedback)
evaluation of performance of the Catalog in terms of LRs required
Use Case
connect discussions of interoperability
metadata, tools, documentation, standards
design use case which is inclusive, advanced
automatically training and HLT
harmonized metadata: necessary not sufficient
corpus descriptions
machine readable to identify locations, syntax,
semantics
data catalog registry for non-identical
specifications
human readable to assure consistent methodology
Progress to Date
planning refined at 2010 FlareNet Forum
progress on funding advances in Europe
FlareNet, T4ME, MetaNet, ISO-CAT
work on the UC continues
work on
UC
LR Wiki
Amazigh, Bengali, Panjabi, Pashto, Tagalog, Tamil, Urdu
~100 resources per language
LDC Papers Catalog
2500 papers