Natural Language Recognition and Understanding

Download Report

Transcript Natural Language Recognition and Understanding

Natural Language Understanding
Raivydas Simenas
Overwiev



History
Speech Recognition
Natural Language Understanding
–

statistical methods to resolve ambiguities
Current situation
History


Roots in teaching the deaf to speak using “visible speech”
1874: Alexander Bell’s invention of harmonic telegraph
–
–


Different frequency harmonics from an electrical signal could be separated
Could sent multiple messages over the same wire at the same time
1940’s: separating the speech signal into different frequency
components using the spectrogram
1950’s: the beginning of computer use for automatic speech
recognition
The Nature of Speech




Phoneme – a basic sound, e.g. a vowel
The complexity of human vocal apparatus: about 18 phonemes
per second
Speech viewed as a sound wave
Identifying sounds: analyzing the sound wave into its frequency
components
The Spectrogram I



A visual representation of
speech which contains all the
salient information
Plots the amount of energy at
different frequencies against
time
Discontinuous speech (making
a pause after each word) –
easier to recognize on the
spectrogram
The Spectrogram II






The same word uttered twice (especially by different speakers –
speaker independence) might look radically different on a
spectrogram
The need to recognize invariant features in a spectrogram
Formants: resonant frequencies sustained for a short time
period in pronouncing a vowel
Normalization: distinguishing between relevant and irrelevant
information
Nonlinear time compression: taking care of the changing speed
of a speech
Matching a spoken word to a template
Robust Speech Recognition


Need to maintain accuracy when the quality of the input speech
is degraded or when the speech characteristics differ due to
change in environment or speakers
Dynamic parameter adaptation: either alter the input signal or
the internally stored representations
–
–
Optimal parameter estimation: based on a statistical model
characterizing the differences between training and test sets
Empirical feature comparison: based on comparison between
high-quality speech and the same speech recorded under
degraded conditions
Stochastic Methods in Speech
Recognition


Generating the sequence of word hypotheses for an acoustic
signal is most often done using statistics
The process:
–
–

A sequence of acoustic signals is represented using a collection
of vectors
Such collections are used to build acoustic word models, which
consist of probabilities of certain sequences of vectors
representing a word
Acoustic word models utilize Markov chains
Representing Sentences



Syntactic form: indicates the way the
words are related to each other in a
sentence
Logical form: identifying the semantic
relationships between words based
solely on the knowledge of the language
(independently of the situation)
Final meaning representation: mapping
the information from the syntactic and
logical form into knowledge
representation
–
System uses knowledge representation to
represent and reason about its application
domain
Parsing a Sentence



Parsing – determining the
structure of the sentence
according to the grammar
Tree representation of a
sentence
Transition network grammars
–
–
Start with initial node
Can traverse an arc only if it is
labeled with an appropriate
category
Stochastic Methods for Ambiguity
Resolution I



Some sentences can be parsed many different ways, e.g. time flies
like an arrow
The most popular method for this is based on statistics
Some facts from probability theory
–
–
The concept of the random variable, e.g. the lexical category of “flies”
Probability function
 assigns probability to every possible value of the random variable, e.g.
0.3 for “flies” being a noun, 0.7 for its being a verb
 conditional probability functions (Pr(A|B)), e.g. the probability for the
occurrence of a verb given the fact that a noun already occurred
Stochastic Methods for Ambiguity
Resolution II


Probabilities are used to predict future events given some data about
the past
Maximum likelihood estimator (MLE)
–
–

Expected likelihood estimator (ELE)
–
–

Probability of X happening in the future = number of cases of X happening
in the past/total number of events in the past
Works well only if X occurred often, not very useful for low-frequency
events
Probability of X happening in the future = f(number of cases of X
happening in the past)/Sum(f(number of cases of some event happening in
the past)), e.g. if f(Pr(X))=Pr(X)+0.5 and we know that Pr(X)=0.4 and
Pr(Y)=0.6, then ELE(X)=(0.4+0.5)/(0.4+0.5+0.6+0.5)=0.45
MLE is a special case of ELE, i.e. for MLE f(Pr(X)=Pr(X)
Given a large amount of text, one can use MLE or ELE to determine
the lexical category of an ambiguous word, e.g. the word flies
Stochastic Methods for Ambiguity
Resolution III


Always choosing the interpretation that occurs most frequently
in the training set on average obtains 90% success rate (not
good)
Some of the local context should be used to determine the
lexical category of a word
–
–
Ideally, for a sequence of words w1,w2,…,wn we want a lexical
category sequence c1,c2,…,cn which maximizes the probability of
right interpretation
In practice, approximations of such probabilities are made
Stochastic Methods for Ambiguity
Resolution IV

n-gram models
–
–
–

Look at the probability of a lexical
category Ci which follows the
sequence of lexical categories Ci1,Ci-2,…,Ci-n+1
Probability of c1,c2,…,ck occurring is
approximately the product of ngram probabilities for each word,
e.g. the probability of a sequence
ART, N, V is 0.71*1*0.43=.3053
In practice, bigram or trigram
models are used most often
The models capturing the concept
are called Hidden Markov Models
Stochastic Methods for Ambiguity
Resolution V

In order to determine the most
likely interpretation of a given
sequence of n words, we want
to maximize the value of
n
 Pr(C | C
i
i -1
) * Pr(wi | Ci )
i 1

The Viterbi algorithm
–
–
Given k lexical categories, the
total number of possibilities to
consider for a sequence of n
words is kn
The Viterbi algorithm reduces
this number to const*n*k2
Logical Form

Although interpreting sentence often requires the knowledge of
the context, some interpretation can be done independently of it
–

Ontology
–
–
–

basic semantic properties of a word, its different senses etc.
each word has 1 or more senses in which it can be used, e.g. go
has about 40 senses
the different senses of all the words of a natural language are
organized into classes of objects, such as events, actions etc.
the set of such classes is called an ontology
Logical form of an utterance can be viewed as a function that
maps current discourse situation into a new one resulting from
the occurrence of the utterance
Current Situation




Inexpensive software for speech recognition
The issues: large vocabulary, continuous speech and speaker
independence
Automated speech recognition for restricted domains
The speed of serial processes in a computer vs. the number of
parallel processes in human brain
References



Survey of the State of the Art in Human Language Technology,
edited by Ronald A. Cole, 1996
James Allen. Natural Language Understanding, 1995
Raymond Kurzweil. When will HAL understand what we are
saying? Computer Speech Recognition and Understanding.
Taken from HAL’s Legacy, 1996