dictionary content

Download Report

Transcript dictionary content

Language resources,
standardization and modern
trends in NLP
Simon Krek
Jožef Stefan Institute, Artificial Intelligence Laboratory, Slovenia
COST Action
Working Groups / Objectives
• WG1: Integrated interface to European dictionary content
• WG2: Retro-digitized dictionaries
• WG3: Innovative e-dictionaries
• WG4: Lexicography and lexicology from a pan-European perspective
Innovative e-dictionaries
• The third working group will focus on the development of digitally
born dictionaries, focusing on the latest developments in elexicography and the interface between lexicography and
computational linguistics.
• Work will be carried out on:
• the analysis of the possible impact of automatic acquisition of lexical data
• the analysis of the interface between dictionary and computational lexica (cf.
wordnets) and syntactically and semantically annotated corpora (cf.
FrameNet, SemCor, Senseval)
• the investigation of the possible use of dictionary content for computational
linguistic applications
Electronic lexicography in the 21st century
• The first eLex conference: New challenges, new applications,
Louvain-la-Neuve (Belgium), 22 to 24 October 2009
• The second eLex conference: New applications for new users, Bled
(Slovenia), 10 to 12 November 2011
• The third eLex conference: Thinking outside the paper, Tallinn
(Estonia), 17 to 19 October 2013
• The fourth eLex conference: Linking Lexical Data in the digital age,
Herstmonceux Castle (UK), 11 to 13 August 2015
eLex 2011
Language data for digital natives: old wine in a new bottle or...?
Text mining is
Content is
a challenge
a problem
Presentation
is a bigger
problem
What is in the middle?
Presentation is a
bigger problem
(Web,
Mobile)
Design
?
Text mining is a
challenge
Content is a
problem
Sinclair: Floating dictionary (2001)
• »A few years ago I felt that the time was ripe to plan a new kind of dictionary, one
that would never exist on paper, but would be automatic or almost automatic in
its selfupdating.
• It would, so to speak, float on top of a corpus, rather like a jellyfish, its tendrils
constantly sensing the state of the language.
• As well as reporting on the settled usage and meanings of the words and phrases
of a language, like a normal dictionary does, the floating dictionary, when
interrogated, dips into the corpus and checks this information, offering instances
that match its criteria for the senses; also it explores further to see if there are
any instances that conflict with the criteria, and may signify a development of a
sense or the emergence of a new usage altogether.
• Within the limits of its powers, it organises this evidence as a comment on the
existing dictionary entry.«
Does dictionary content know itself?
• LT community now has a basic idea how to store various types of
information
• also SW community: RDF, RDFa, RDFS, OWL, SKOS, and more
• standardization in human-oriented dictionary encoding was never
really successful (XML, TEI?)
• the question is: if different types of lexicographic information
intended for human users will have to know each other – will the
format be dictated by LT standards? (Probably yes.)
Similar domain, different task
• EU projects: http://www.xlike.org/, http://xlime.eu/
• The goal of the XLike project is to develop technology to monitor and
aggregate knowledge that is currently spread across mainstream and
social media, and to enable cross-lingual services for publishers,
media monitoring and business intelligence.
• xLiMe proposes to extract knowledge from different media channels
and languages and relate it to cross-lingual, cross-media knowledge
bases. By doing this in near real-time we will provide a continuously
updated and comprehensive view on knowledge diffusion across
media.
Sevices
• Newsfeed
• a clean, continuous, real-time aggregated stream of semantically enriched
news articles from RSS-enabled sites across the world
• http://newsfeed.ijs.si/visual_demo/
• http://enrycher.ijs.si/
• EventRegistry
• a system that can analyze news articles and identify world events
• can identify groups of articles in different languages that describe the same
event
• http://eventregistry.org/
EventRegistry system architecture
ENeL perspective
• Complex story about events = complex story about words/languages
Cross-lingual horizontal axis
Slovene Estonian English German French Hungarian Croatian Basque Swedish …
Diachronic vertical axis
2015 1950 1900 1850 1800 …
Cross-lingual synchronic horizontal axis
• "Never without data"
• Existing lexical resources (dictionaries, BableNet, AnyNet, Linked Data, etc.)
• Corpora, the Web and NLP
• Definition extraction (and generation)
• RANLP 2009, International workshop on definition extraction
• Language Technology for eLearning (http://www.lt4el.eu/)
• Extraction of grammatical or lexical information
• Kookkurrenzdatenbank (http://corpora.ids-mannheim.de/ccdb/)
• Sketch Engine (http://www.sketchengine.co.uk/)
• Extraction of good (dictionary) examples
• ENeL Vienna workshop
• Extraction of translation equivalents
• Linguee etc.
• Extraction of Multi-word Expressions (Parseme)
Complex multimodal information extraction
Automatically Constructed Dictionary Content
Explain, combine, exemplify
Definitions
Combinations
Found
Collocations
as subject
as object
Generated
Multi-word
expressions
KnowledgeRich Contexts
Real-time data
Streaming
News Feeds
Twitter
Sounds, graphics and visuals
Sounds
Images
Recorded /
Speech
Recognition
Speech
Synthesis
Videos
Graphics
Multi-lingual, cross-lingual
hub
language
(Hidden)
parallel
corpora
ENeL
• WG1: Integrated interface to European dictionary content
• WG2: Retro-digitized dictionaries
• WG3: Innovative e-dictionaries
• WG4: Lexicography and lexicology from a pan-European perspective
ENeL
• WG1: Integrated interface to European dictionary content
• WG2: Retro-digitized dictionaries
• WG3: Innovative e-dictionaries
• WG4: Lexicography and lexicology from a pan-European perspective
Retro-digitization
• Digital Agenda for Europe (Europe 2020
Strategy – one of the pillars)
• Commission’s Recommendation on the
digitization and online accessibility of cultural
material and digital preservation
• Put in place solid plans for their investments in
digitization and foster public-private
partnerships to share the gigantic cost of
digitization (recently estimated at € 100 billion).
• Make 30 million objects available through
Europeana by 2015, including all Europe's
masterpieces which are no longer protected by
copyright, and all material digitized with public
funding.
Retro-digitized dictionaries
• encode and enrich dictionary data (standards and tools)
• (the question is: if different types of lexicographic information
intended for human users will have to know each other – will the
format be dictated by LT standards?)
•
•
•
•
definitions
examples
etymology
other types of information
• linking dictionary data with historical corpora
• http://nl.ijs.si/imp/
Lexical Cloud
Integrated interface to European (dictionary /
lexical) content
Any base
Any
corpus
AnyNet
Anypedia
Any
dictionary
Conclusion
• any word/concept in any language on any device offers a story about
its current life and its history
• what is a "concept" (in the sense of "event")? X-Nets? Wikipedia?
• what is the central format?
• what is the appropriate context?
• EU projects? ICT? Cultural Heritage?
• Infrastructure (e.g. Clarin)?