Language Technology
Download
Report
Transcript Language Technology
Language and Speech
Technology: Introduction
Jan Odijk
January 2011
LOT Winter School 2011
1
Overview
• What is language and speech technology
(LST)? (3-7)
• Major Subfields of LST (8-25)
• Characterization of the last 30 years (26-27)
– 80s (28-36), 90s (37-49), 00s (50-56)
– Current Status (57-69)
• CLARIN infrastructure (70-75)
• This week’s programme (76)
2
Language Technology
• Language Technology is the study of
computational systems that process natural
language
• Alternative names:
– Human Language Technology (HLT)
– Natural Language Processing (NLP)
3
Speech Technology
• Speech Technology is the study of
computational systems that process speech
• Is a part of Language Technology
• Often
– Term “Language technology” reserved for the
study of computational systems that process
written language
4
Computational Linguistics
• Computational Linguistics (CL) is the study
of language from a computational
perspective
• Often used interchangeably with language
technology
• Often grouped under Artificial Intelligence
(AI) , although CL predates AI
– AI: the study and design of intelligent systems
5
Computational Systems
• Computational systems to process natural
language do not exist naturally (except in
the human brain)
– They must be designed, implemented, and
evaluated
– Therefore it is a kind of engineering
6
Computational Systems
• LST is NOT
• the study of processing of natural language
by humans in
–
–
–
–
cognition,
(cognitive) psychology,
(psycho)linguistics
phonetics
7
Language Technology Subfields
• Orthographic processing
– Text = sequence of characters
– Tokenization
• Text => sequence of tokens
• Token= occurrence of a word form
• Relatively simple for languages that uses
interpunction (space, dot, comma, etc.) for
separating tokens
• More difficult for languages such as Chinese, Thai,
etc.
8
Language Technology Subfields
• Orthographic processing
– Orthographic normalization
– Token => (token, normalized token)
– Normalized token = canonical orthographic
representation for a set of orthographic variants
– Examples:
•
•
•
•
Contemporary spelling variants: aktie => actie
Older spelling variants: vleesch => vlees
Typos: actei => actie
OCR errors: raarn => raam
9
Language Technology Subfields
• Morphological processing
– Lemmatization: token => (token, lemma)
• Lemma = canonical orthographic representation for
an inflectional paradigm
• Often ambiguities
• Examples
– lemma(walked) = walk; Lemma(men) = man
– Lemma (graven) = {graf, graaf, graven} (Dutch)
10
Language Technology Subfields
• Morphological processing
– Inflection analysis/generation
• Word form (lemma, inflectional features)
• Examples:
–
–
–
–
graven (graf, PoS=Noun, number=plural)
graven (graaf, PoS=Noun, number=plural)
graven (graven, PoS=Verb, form=infinitive)
graven (graven, PoS=Verb, form= indicative,
tense=present, number = plural)
11
Language Technology Subfields
• Morphological processing
– Compound processing
– word form ((word form,affix?)+, word
form)
– lemma ((word form,affix?)+, lemma)
– Example:
– Vleeskoeienhouders ([vlees,koeien],
houders) ‘meat cow farmers’
– gebiedsbepaling ([(gebied, s)], bepaling)
12
Language Technology Subfields
• Morphological processing
– Derivational morphology processing
– word form (prefix*, lemma, suffix*)
– Example:
• Characterization ([], characterize, [ation])
13
Language Technology Subfields
• (PoS-)tagging
– Assignment of a grammatical tag to a token in
context (tag=label for grammatical properties)
– Token => (token, tag) in context
– Usually assignment of PoS-tags
– Often more detailed grammatical (inflectional)
tags
14
Language Technology Subfields
• (PoS-)tagging
– Context: usually:
• Some words and/or tags preceding
• Some words following
– Examples:
• (graven, Zij __ een graf) => Vindprespl
• (graven, De __ zijn boos) => Npl
15
Language Technology Subfields
• Chunking
– identifying major phrases in a sentence
– Example
• The man bought a present for his wife =>
• [NP The man] bought [NP a present] [PP for his wife]
16
Language Technology Subfields
• Parsing
– Assign a syntactic structure to a sentence
– Example: The man bought a present for his wife =>
[S
[subj/NP The man]
[pred/VP bought [obj/NP a present]
[pobj/PP for [obj/NP his wife]]
]
]
17
Language Technology Subfields
• Machine Translation
– Automatic translation of an input text
– Example
• The man bought a present for his wife =>
• L’homme a acheté un cadeau pour sa femme
18
Language Technology Subfields
• Content extraction and processing
–
–
–
–
–
–
–
Named entity recognition
Question-answering
Information retrieval
Information extraction
Sentiment/ opinion mining
Reasoning/Inference on semantic representation
…
19
Speech Technology Subfields
• Speech Synthesis
–
–
–
–
Artificial production of human speech
Text => speech
Often called Text-To-Speech (TTS)
TTS system usually contains two components
• Grapheme to Phoneme (G2P) component
– Text => symbolic speech representation (phonetic
representation)
• Speech Synthesis component
– Symbolic speech representation => speech
20
Speech Technology Subfields
• Speech Synthesis (cont.)
– Term Speech Synthesis often reserved for this
second component
– Meaning => speech
– Usually called Speech Generation, or ConceptTo-Speech, or Data-to-Speech
21
Speech Technology Subfields
• Speech Recognition
– Recognition of human speech
– Audio containing speech => text
– Often called automatic speech recognition
(ASR)
• Speech Understanding
– Understanding of human speech
– Audio containing speech => meaning or action
22
Speech Technology Subfields
• Speaker Recognition
– Recognition of a speaker given a speech signal
– Speech => person identity
• Speaker Verification
– Verification of the identity of a person
– Speech + claimed identity => Boolean
23
Speech Technology Subfields
• Speech Compression
– Reduction of the size of speech representations
(speech encoding), or
– Time-compression of speech representations
(so that they sound faster to the listener)
24
Related fields
• Speech often used in dialogues
– Study of spoken dialogues (human-human,
human-machine)
• Speech often combined with other
modalities
– Study of Multimodal Interaction
• Speech part of an man-machine interface
– Study of Human - Machine Interaction
25
Introduction
• Three decades:
– “80s”= 1980-1994
– “90s”= 1990-2005
– “00s” = 2000-2011
26
Overview
•
•
•
•
•
•
80s: Language Technology
80s: Speech Technology
90s Language and Speech Technology
90s Commercial Activity
90s Importance of Data
00s Language and Speech Technology
27
80s: Language Technology
• Focus on MT (in Europe)
– Eurotra (Europe)
– Rosetta (Philips, Netherlands)
– Distributed Translation (BSO, Netherlands)
28
80s: Language Technology
• Linguistic “Research Approach”
• Focus on Research
– not/less on Technology Development
• Knowledge-based approach
– hand-crafted lexicons and rules
– based on a theory / grammatical formalism
• Focus on linguistically interesting complex
phenomena
– less on phenomena that occur often
– not strongly data-driven
29
80s: Language Technology
• Focus on an idealized language
– not on actual language use
– no focus on robustness
• Computational approach seen (in research)
as a way to gain insight into language,
grammar and grammar formalisms
– no focus on developing a working system
– no pragmatic solutions
30
80s: Language Technology
• Little formal (quantitative) evaluation
– only with test suites
• constructed sentences illustrating linguistic
phenomena
• E.g. the HP Test Suite (Flickinger et al. 1987)
• computational linguistics rather than
language technology
31
80s: Language Technology
Major Problems (from a technology point of view):
• Ambiguity
– Real
– Temporary
• Computational Complexity
– computation-intensive grammar formalisms
• Complexity of language
– handcrafting lexicons and rules
• requires linguistic and computational expertise
• requires a lot of effort and time
32
80s: Language Technology
• Major problems (cont.):
• Idealized Language v. actual Language Use
• Require large and rich lexicons, suited to
the application domain: difficult/ large
effort to make them, and to tune (adapt) to
specific domains
33
80s: Speech Technology
•
•
•
•
Automatic Speech Recognition (ASR)
Statistical “Engineering Approach”
approach based on Noisy Channel Model
derive acoustic models from a lot of
annotated speech examples
• derive statistical language models from
large text corpora (n-gram probabilities)
34
80s: Speech Technology
• Focus on making (small) working systems
• Statistical approach: system uses
probabilities derived from data
• Focus initially on limited, “simple” tasks
(e.g. digit recognition), and increasingly on
more complex tasks
35
80s: Speech Technology
• Focus on real language use under realistic
conditions
• Progress made by making concrete systems
and evaluating them rigorously
36
90s: Language Technology
• Statistical MT
– derive language models from monolingual
corpora (probabilities of word ( sequence)s
– align “sentences” with their translations
– derive translation model from parallel corpora:
• estimate translation probabilities for words and
word sequences from the aligned “sentences”
• use these probabilities to compute translations for
new “sentences”
37
90: Language Technology
• Ambiguity: resolved by probabilities based on statistics
• Computational Complexity
– computationally feasible formalisms
– proven in speech recognition
• Complexity of language
– language and translation model automatically derived from data
• Strong focus on actual language use
– Highly data driven
• Lexicons can be simpler and are derived automatically
from the data; adaptation to specific domains easy once the
data are available
38
90s: Language Technology
• Rise of Internet
• increasing need for information retrieval
• approximated by search for word and word
sequence strings
• Information Retrieval
– strongly statistically based
– Limited linguistics
– formal evaluation (recall, precision, F-score)
39
90s: Language Technology
• Resulted in
– strongly data-driven approach in language
technology
– increasing use of machine learning techniques
– explicit focus on formal, esp. quantative
evaluation
– re-examination of simpler/computationally less
intensive formalisms (finite-state) for syntax
40
90s: Speech Technology
• Continued working under the established
paradigm
• increasingly improving performance and
extending environments and application
areas
41
90s: Companies
• many companies active in Speech
technology
– IBM, Microsoft, Siemens, Nokia, Philips,
Motorola, Matra Nortel, Nortel,..
– Dragon, Kurzweil, Lernout & Hauspie,
SpeechWorks, Nuance, Babel, Loquendo,
Rhetorical, Vocalis, Telisma, Elan, ...
42
90s: Companies
• many companies in Language technology
– IBM, Microsoft, INSO, Novell, ...
– GMS, Apptek, Globalink, Lernout & Hauspie,
Systran, LANT (Xplanation), ...
43
90s: Companies
• MT systems:
– knowledge based systems,
– developed under an engineering approach
• grammatical formalism simple or pruning in
search space
– to reduce ambiguity
– to reduce computational resource requirements
– to reduce hand-crafting of rules
44
90s: Companies
• resulted in low quality MT systems
– still useful in many circumstances
• Differentiating factors
– rapid adaptation to (multi-word) terms /
vocabulary of new domain
– good performance on named entity recognition
45
90s: Data
• Knowledge Based NLP realized
cooperation on lexicons was required
• ASR Methodology requires a lot of data:
– “There is no data like more data”
• This led to
– Data creation projects
– Set-up of data distribution centers
– Projects for developing standards for data
46
90s: Data
• Projects
– Lexicon projects
•
•
•
•
•
Multilex,
Genelex
Acquilex
Parole
WordNet, EuroWordNet
– SpeechDat projects
• SpeechDat, SpeechDat-Car, SpeechDat-East, SPEECON,
Orientel
– National / Local projects
• Spoken Dutch Corpus (Netherlands and Flanders)
47
90s: Data
• Data distribution Centers are set up
– LDC (1993)
– ELRA (1995)
• Standards:
– TEI for text corpora
• CES, XCES
– Eagles, ISLE for grammatical properties
48
Automating Data Production
• Usually existing (imperfect) tools are used
to create data (semi-)automatically
– G2P for creating phonetic dictionaries
– PoS-tagging for PoS-tagged text corpora
– Parsers for treebanks
• For bootstrapping annotations
– Faster and more consistent results
• Followed by (partial) manual correction
49
00s
• Early 00s
– Many data and research initiatives, nationally
– Netherlands
• IMIX 2001-2008
• STEVIN 2004-2011
• TST-Centrale (HLT Agency) 2005-..
– France
• EVALDA
• Technolangue
50
00s
• Early 00s
– International
•
•
•
•
•
•
•
•
TREC
CLEF
TC-STAR 2004-2007
EuroMatrix 2006-2009
EUROMATRIXPlus 2009-2012
ECESS
PASCAL / PASCAL2
ACE
51
00s
• Early 00s
– International
•
•
•
•
•
•
•
TAC US
DUC US
GALE US
NTCIR Japan
RTE
SemEval
SensEval
52
00s
• More recent projects
• FLaReNet
• META-NET
53
00s
• Companies offer services via the internet
and via mobile (smart) phones
– Search: Google, Bing, Yahoo!, etc.
– Social networks: FaceBook, LinkedIn, Youtube
– Cloud Computing: Amazon, Google, Salesforce
• Companies gain access to huge amounts of
data (text, pictures, movies, etc,) including
user behavior
54
00s
• Data are used
– to improve existing services
– To create new services
– To personalize services and advertisements
55
00s
• New Services relevant for LST
– Google: Translation, search by voice, open
platform for mobile devices (Android)
– Amazon: Mechanical Turk
• Allows large scale distribution of work, e.g. on
manual annotation of language resources
– Apple: several iPhone Apps
• Dragon Dictate (for SMS, e-mail)
• Jibbigo
– ReCaptcha: transcription of (hand-written)
documents (now part of Google)
56
Current Status
• Language and Speech Technology in 2011:
– Exciting area!
• A lot of commercial activity, and expanding
• A large and active research community
• A lot of interesting topics are open for
research
57
Commercial Activity
• many companies in Language technology
– Google, Yahoo!, IBM, Microsoft, ...
– Apptek, Linguatec, Systran, Knowledge
Concepts, Q-go, ...
• applications
– MT, content management, information
retrieval, dealing with customer questions,
sentiment and opinion mining, ...
58
Commercial Activity
• many companies in Speech technology
– Google, IBM, Microsoft, Motorola, Nokia, ...
– Nuance, Loquendo, Acapela, SVOX, Telisma,
...
• even more in application development and
system integration
59
Commercial Activity
• applications
– Network IVR applications (Call centers,
banking, information services,...)
– Embedded applications
• in-car applications, e.g. voice activated dialing,
navigation (voice destination entry)
• mobile phone/PDA applications
– multimodal output e.g. for navigation
– command and control
– (SMS) dictation coming soon
60
Commercial Activity
• applications
– Office Applications
• Dictation, horizontal and vertical (medical, legal)
• Language learning
– Audiomining
• information retrieval from recorded speech (possibly
incl. other modalities): Radio/TV-broadcasts,
parliamentary sessions, ...
61
Research Topics?
• Speech Technology (Recognition)
– new paradigms?
• cf . FLAVOR project
http://www.esat.kuleuven.be/psi/spraak/projects/FLaVoR/
– Combination with other modalities
• AMI http://www.amiproject.org
• CHIL http://chil.server.de/servlet/is/101/
• IMIX (Interactive Multimodal Information eXtraction)
62
Research Topics?
• Speech Technology (Recognition)
– robustness against noise and other speakers
• increasing use in car and in public places on PDAs
and mobile phones
• MIDAS project
– pronunciation of names
• Autonomata I and TOO (incl. Nuance, Ghent,
Nijmegen and Utrecht)
63
Research Topics?
• Speech technology (Text-to-Speech)
– better control over prosody in corpus-based
TTS?
– Combination with other modalities
64
Research Topics?
• Language Technology
– Semantic Lexical databases created
– WordNet and EuroWordNet
– Cornetto
65
Research Topics?
• Language Technology
– Focus now on Semantic Annotation of Corpora
• OntoNotes http://www.isi.edu/naturallanguage/people/hovy/papers/06HLT-NAACLOntoNotes-short.pdf
• STEVIN D-COI and SONAR
• DutchSemCor
– How to use this semantic annotation in practical
systems?
66
Research Topics?
• Language Technology
– (Semi-)automatic lexicon creation/adaptation
– Sophisticated information retrieval
• Information extraction, summarization and
merging, opinion and sentiment mining,
67
Research Topics?
• Language And Speech Technology
– Speech to Speech Translation
• TC-STAR http://www.tc-star.org/
68
Research Topics?
• Dutch-Flemish STEVIN programme
– running from 2004-2011
– 11.4M€ budget
•
•
•
•
resources
research
applications
demonstration projects
– Most projects finished
– some projects are still running
– http://www.taalunieversum.nl/stevin
69
CLARIN
• aims to design, construct, validate, and
exploit
– a research infrastructure that is needed to
provide a sustainable and persistent eScience
working environment
– for researchers in the Social Sciences &
Humanities
– who want to make use of language data and
tools
70
CLARIN
• Make data and tools on different locations
easily accessible
– via web interfaces and services
– CLARIN-portal(s) with intelligent searching,
browsing, viewing and querying services)
• make it possible for non-technical
researchers to extract / combine/ enrich data
(supported by dissemination and training)
71
CLARIN
• Will make available interoperable data and
tools based on existing standards and best
practices
– Formal interoperability and
– Semantic interoperability
72
CLARIN
• For researchers that work with language
data and tools
– Humanities and Social Sciences
•
•
•
•
•
•
Linguistics (broadly construed)
Literary and Theatrical Studies
Media en Culture
History
Political Sciences
…
73
CLARIN
• Preparatory Project (CLARIN-prep)
–
–
–
–
Funded by EU
2008-2011
>33 partners from >23 countries
Goals
• Get commitments from EU countries to contribute to the
CLARIN infrastructure after CLARIN-prep
• Investigate needs, requirements
• Make initial specification (and prototype implementations)
74
CLARIN
• Current Status
– Most countries in the process
– CLARIN infrastructure to start in Mid 2011
– Netherlands committed and has leading role
• CLARIN-NL
–
–
–
–
Funded by NWO
2009-2015
Many subprojects running
Focus on Humanities
75
This week’s Programme
• Tuesday: Parsing
• Wednesday: Machine Learning
• Thursday: Speech Recognition
– Guest lecturer: Arjan van Hessen
• Friday: Machine Translation
76
Thanks for Your Attention!
77
References
•
Flickinger D., Nerbonne J., Sag I., Wasow T., "Toward Evaluation of NLP Systems",
Hewlett-Packard Laboratories, Palo Alto, CA, 1987.
78