Multimedia search: From Lab to Web

Download Report

Transcript Multimedia search: From Lab to Web

KI2 – 5
Grammar inference
Lambert Schomaker
Kunstmatige Intelligentie / RuG
2
Grammar inference (GI)
 methods, aimed at uncovering the grammar
which underlies an observed sequence of
tokens
 Two variants:
– explicit, formal GI
– implicit, statistical GI
deterministic token
generators
stochastic token
generators
3
Grammar inference
AABBCCAA..(?)..
ABA
what’s next?
 1A 1B 1A
AABBAA  2A 2B 2A
or
 AAB(+mirrorsymmetric)
 (2A B)(+mirrored)
repetition, mirrorring, insertion, substitution
4
Strings of tokens




DNA: ACTGAGGACCTGAC…
output of speech recognizers
words from an unknown language
tokenized patterns in the real world
5
Strings of tokens




DNA: ACTGAGGACCTGAC…
output of speech recognizers
words from an unknown language
tokenized patterns in the real world
A
B
A
6
Strings of tokens




DNA: ACTGAGGACCTGAC…
output of speech recognizers
words from an unknown language
tokenized patterns in the real world
A
B
A 
Symm(B,A)
7
GI
 induction of structural patterns from
observed data
 representation by a formal grammar
versus:
 emulating the underlying grammar
withoutmaking the rules explicit (NN,HMM)
8
GI, the engine
Data
Grammatical rules
Grammar Induction
aaabbb
ab
abccba
(seq (repeat 3 a)(repeat 3 b))
(seq a b)
(symmetry (repeat 2 c) (seq a b))
9
The hypothesis behind GI
Data
G0
Grammar
Induction
Generator
process
aaabbb
ab
abccba
Find G’  G0
G’
10
The hypothesis behind GI
Data
G0
Grammar
Induction
Generator
process
aaabbb
ab
abccba
It is not claimed that G0 actually ‘exists’
Find G’  G0
G’
11
Learning
 Until now it was implicitly assumed that the
data consists of positive examples
 A very large amount of data is needed to
induce an underlying grammar
 It is difficult to find a good approximation to
G0 if there are no negative examples:
e.g. “aaxybb does NOT belong to the grammar”
12
Learning…
Convergence G0 = G* is assumed for infinite N
Data
G0
Grammar
Induction
Generator
process
sample1
sample2
sample3
.
.
.
sampleN
G’
G1
G1+2
G1+2+3
G*
13
Learning…
(Convergence G0 = G* is assumed for infinite N)
More realistic: a PAC, probably approximately correct G*
Data
G0
Grammar
Induction
Generator
process
sample1
sample2
sample3
.
.
.
sampleN
G’
G1
G1+2
G1+2+3
G*
14
PAC GI
the language generated by G0
L(G0)
P[ p(L(G0)  L(G*)) <  ] > (1 - )
the language explained by G*
L(G*)
15
PAC GI
the language generated by G0
L(G0)
the language explained by G*
L(G*)
P[ p(L(G0)  L(G*)) <  ] > (1 - )
The probability that “the probability of finding
elements {L0 xor L*} is smaller than
”, will be larger than 1- 
16
Example
 S+ = { aa, aba, abba, abbba}
a

a
a
a
a
17
Example
 S+ = { aa, aba, abba, abbba}

a
a
a
b
a
a
a
b
a
b
a
18
Example
 S+ = { aa, aba, abba, abbba}

a
a
a
a
a
b
a
b
b
a
b
b
b
a
a
19
Example
 S+ = { aa, aba, abba, abbba}

a
a
a
a
a
b
a
b
b
a
b
b
b
b
a
a
20
Many GI approaches are known (Dupont, 1997)
21
Second group: Grammar Emulation
 Statistical methods, aiming at producing
token sequences with the same statistical
properties as the generator grammar G0
 1: recurrent neural networks
 2: Markov models
 3: hidden-Markov models
22
Grammar emulation, training
ABGBABGACTVYAB <x>. . .
context window
Grammar emulator
predict
x
23
Recurrent neural networks
for grammar emulation
 Major types:
– Jordan (output-layer recurrence)
– Elman (hidden-layer recurrence)
24
Jordan MLPs
 Assumption: current state is represented by
output unit activation at the previous time
step(s) and by the current input
Input state
Output
t
25
Elman MLPs
 Assumption: current state is represented by
hidden unit activation at the previous time
step(s) and by the current input
Input state
Output
t
26
Markov variants
 Shannon: fixed 5-letter window for English to
predict next letter
 Variable-length Markov Models (VLMM)
(Guyon & Pereira)
idea: the width of the context window
to predict the next token in a sequence
is variable and depends on statistics
27
Results
 Example output of letter VLMM, trained on
news item texts (250 MB training set)
“liferator member of flight since N. a report the
managical including from C all N months after dispute.
C and declaracter leaders first to do a lot of though a
ground out and C C pairs due to each planner of the
lux said the C nailed by the defender begin about in N.
the spokesman standards of the arms responded
victory the side honored by the accustomers was
arrest two mentalisting the romatory accustomers of
ethnic C C the procedure. “