Transcript Slides
Jamie Frost – Franks Society MT10
What is language?
What is a language?
Wikipedia
“A set of symbols of
communication and
the elements used to
manipulate them.”
OED
“The system of spoken or
written communication used
by a particular country,
people, community, etc.,
typically consisting of words
used within a regular
grammatical and syntactic
structure.”
What is a language
• An alphabet
The possible symbols that each ‘unit’ in the
language can take.
For human languages, the alphabet may be at a
character level.
Or we could choose it to be at a word level...
What is a language
• An alphabet
•...can form a ‘string’
Σ2 = Σ × Σ gives us all the possible pairs of symbols.
Σ* = {λ} ∪ Σ ∪ (Σ × Σ) ∪ (Σ × Σ × Σ) ∪ ...
is known as the Kleene Star, and gives us all possible
strings, i.e. containing any combination of symbols and
of any length.
What is a language
• An alphabet
•...can form a ‘string’
•...which can be constrained by a grammar
Any sensible language doesn’t allow a unrestricted
combination of symbols. Human languages are bounded
by some grammatical structure.
Mint.
Grammars
So how do we define a grammar?
Grammars limit our possible strings to certain forms.
These vary in expressiveness – the more expressible
they are, the harder it is to do certain common tasks
with them.
Tasks might include:
Expressiveness
• “finding the
grammatical structure
of a string given a
grammar”, or
•“does a string satisfy a
given grammar?”.
•“are two grammars
equivalent?”
Task Complexity
The Chomsky Hierarchy
In 1956, Noam Chomsky characterised languages according
to how ‘complex’ or expressible they are.
Grammar
Language
Type-0
Recursively Enumerable
Type-1
Context Sensitive
Type-2
Context Free
Type-3
Regular
This is known as the Chomsky Hierarchy.
A language that satisfies a given type is also a instance of all
the grammars above it.
A Formal Grammar
Consists of:
A start symbol (S ϵ N)
Terminal symbols (i.e. our
alphabet Σ)
1. S ⟶ i love T
2. T ⟶ T and T
3. T ⟶ smurfs
4. T ⟶ smurfettes
Non-terminal symbols (N)
Production rules
A Formal Grammar
Think of it as a game...
1. S ⟶ i love T
2. T ⟶ T and T
3. T ⟶ smurfs
4. T ⟶ smurfettes
A) Start with start symbol
S
B) We can use the production
rules the replace things.
⟶ i love T
⟶ i love T and T
⟶ i love smurfs and T
⟶ i love smurfs and T and T
⟶ i love smurfs and smurfs and T
⟶ i love smurfs and smurfs and smurfettes
C) We’re not allowed to
‘finish’ until we only have
terminal symbols.
Regular Grammars
The most restrictive.
The LHS of the production rules can only be a single
non-terminal.
The RHS of the production rules can be one of (a) a
single terminal symbol (b) a single non-terminal symbol
(c) a terminal followed by a non-terminal or (d) the
empty symbol.
The idea is that you don’t
have ‘memory’ of the
symbols you’ve
previously emitted in the
string.
Regular Grammars
Example
Example Generation:
S
⟶ aS
⟶ aaS
⟶ aaaT
⟶ aaabT
⟶aaabb
Notice we’re
always generating
at the end of the
string.
Regular Grammar
: If Σ = {a,b}, the language where two
b’s are never seen next to each other.
e.g. a, ababa, baa, b, aaa, ...
As a regular expression: L = a*(baa*)*(b|ϵ)
Often helpful to draw a picture:
a
b
a
This kind of
diagram is known
as a
‘nondeterministic
finite automaton’
or NFA.
Regular Grammar
We can use this picture to work out the regular
grammar:
a
b
a
It’s Voting Time...
The language of palindromes,
i.e. strings which are the same when
reversed, e.g. “madam”, “acrobats stab
orca”.
{ anbn | n ≥ 1 }
i.e. ab, aabb, aaabbb, aaaabbbb, ...
Neither are.
The problem is that we cannot ‘remember’ the symbols already emitted.
We can use something called the pumping lemma to check if a language is regular.
Context Free Grammars
The restriction on the RHS of the production rules is
now loosened; we can have any combination of nonterminals and terminals.
We still restrict the LHS however to a single nonterminal. This is why the grammar is known as
“context free”, since the production is not dependent
on the context in which it occurs.
While generating a string:
...
⟶ abXd
⟶ abyd
The production rule which allows the X to
become a y is not contingent on the
context, i.e. The preceding b or the
proceeding d.
Context Free Grammars
Examples
Example Generation:
S
⟶ aSa
⟶ acSca
⟶ acbca
Context Free Grammars
Examples
Example generation:
S
⟶ aSb
⟶ aaSbb
⟶ aaaSbbb
⟶ aaabbb
It’s Voting Time...
{ anbncn| n ≥ 1 }
i.e. abc, aabbcc, aaabbbccc, ...
Nope.
A bit harder to see this time.
Can use a variant of the Pumping Lemma called the Bar-Hillel Lemma to show it isn’t.
(But informal explanation as such: We could have non-terminals at the a-b and b-c boundary generating
these pairs, but since our language is context free these non-terminals expand independently of each
other, thus we can only ensure a and b have the same count, or b and c. And we can’t have a rule of the
form S-> X abc Y because then we can’t subsequently increase the number of b’s.)
Context-Sensitive Grammars
Now an expansion of a non-terminal is dependent on
the context it appears in.
Example generation:
S
⟶ aSBC
⟶ aaBCBC
⟶ aaBHBC
⟶ aaBBCC
⟶ aabBCC
⟶ aabbCC
⟶ aabbcC
⟶ aabbcc
i.e. a ‘C’ can change into a ‘c’ only when preceded by another ‘c’. Note
that this context (i.e. this preceding ‘c’) must remain unchanged.
Preservation of context is the only restriction in CSGs.
The Chomsky Hierarchy
Gram
mar
Language
Automaton
Type-0 Recursively Turing Machine
enumerable
Rules
α⟶β
Linear bounded Nondeterministic
Turing Machine
αAβ ⟶ αγβ
Type-2 Context
Free
Non-deterministic
Pushdown
Automaton
A⟶γ
Type-3 Regular
Finite State
Automaton
A ⟶ a and
A ⟶ aB
Type-1
Context
sensitive
The picture with circles and
arrows we saw earlier.
English as a CFG
Before we get on to classifying English according to the
Chomsky Hierarchy, let’s see how English might be
represented as a CFG.
Our starting non-terminal S is a sentence. Since
sentences operate independently syntactically, it’s
sufficient to examine grammar on a sentence level.
Our terminals/alphabet Σ is just a dictionary in the
literal sense.
Σ = { a, aardvark, .... , zebra, zoology, zyzzyva }
English as a CFG
Our non-terminals are ‘constituents’, such as noun phrases,
verb phrases, verbs, determiners, prepositional phrases, etc.
These can be subdivided into further constituents (e.g. NP =
noun phrase), or generate a terminals (e.g. V = verb)
Non-Terminal
Name
Example
NP
Noun Phrase
the cat
VP
Verb Phrase
chastised the politician
PP
Prepositional Phrase
with the broccoli
CONJ
Conjunction
and
V
Verb
chundered
ADV
Adverb
everywhere
English as a CFG
Can use an American style ‘top-down’ generative
form of grammar.
S ⟶ NP VP
NP ⟶ DT N
NP ⟶ PN
VP ⟶ VP PP
NP ⟶ NP PP
PP ⟶ P NP
NP ⟶ NP CONJ NP
DT ⟶ the
DT ⟶ a
N ⟶ monkey
N ⟶ student
N ⟶ telescope
PN ⟶ Corey
P ⟶ with
P ⟶ over
CONJ ⟶ and
CONJ ⟶ or
V ⟶ saw
V ⟶ ate
Ambiguity
Curiously, it’s possible to generate a sentence in
multiple ways!
S
⟶ NP VP
⟶ PN VP
⟶ Corey VP
⟶ Corey VP PP
⟶ Corey V NP PP
⟶ Corey saw NP PP
⟶ ...
⟶ Corey saw the monkey
with the telescope.
S
⟶ NP VP
⟶ PN VP
⟶ Corey VP
⟶ Corey V NP
⟶ Corey saw NP
⟶ Corey saw NP PP
⟶ ...
⟶ Corey saw the monkey
with the telescope.
What do each of these sentences
semantically mean?
Ambiguity
We say that a formal grammar that can yield the same string from multiple
derivations is ‘ambiguous’.
So what kind of language...
Is a Regular Grammar sufficient?
Chomsky demonstrates this by showing a particular part of
English grammar that behaves akin to a palindrome. We saw
earlier that palindromes are not regular.
But there’s a nicer more recent proof...
Embedded Structures (Yngve 60)
The cat likes tuna fish.
The cat the dog chased likes tuna fish.
The cat the dog the rat bit chased likes tuna fish.
The cat the dog the rat the elephant admired bit chased likes tuna fish.
Embedded Structures
The cat the dog the rat the elephant admired bit chased likes tuna fish.
If we let A = { the dog, the rat, the elephant } and
B = { admired, bit, chased } then we represent centre-embedding as
such:
But we already know from earlier that anbn is not regular!
So what kind of language...
Is a Context Free Grammar sufficient?
The fact constituents in a sentence are syntactically
independent of each other naturally supports a grammar that
is ‘context free’.
A number of scholars initially argued that number agreement
(e.g. “Which problem did your professor say was unsolvable?”)
couldn’t be captured by a CFG, but they were just being a bit
dim. This can easily be incorporated by just introducing a few
new non-terminals (e.g. NP_PLUR, NP_SING).
Swiss German
A number of languages, such as Dutch and Swiss
German, allow for cross-serial dependencies
...mer d’chind
...we
em Hans es huus
haend wele
the children/ACC Hans/DAT the house/ACC have
laa halfe aastriiche
wanted to let help paint.
DAT = Dative noun: the indirect object of a verb
(e.g. John gave Mary the book”).
ACC = Accusative noun: the direct object of a verb
(e.g. John gave Mary the book”).
Swiss German
Shieber (1985) notes that among such sentences, those with all accusative
NPs preceding all dative NPs, and all accusative-subcategorising verbs
preceding all dative-subcategorising verbs are acceptable.
...mer d’chind
...we
em Hans es huus
haend wele
the children/ACC Hans/DAT the house/ACC have
laa halfe aastriiche
wanted to let help paint.
The number of verbs requiring dative objects (halfe) must equal the
number of dative NPs (em Hans) and similarly for accusatives.
Can therefore put in form:
Swiss German
...mer d’chind
...we
em Hans es huus
haend wele
the children/ACC Hans/DAT the house/ACC have
laa halfe aastriiche
wanted to let help paint.
anbmcndm is not context free. It’s context-sensitive.
Summary
The Chomsky Hierarchy brings together different
grammar formalisms, listing them in increasing order of
expressiveness.
Languages don’t have to be human ones: they allow us to
generate strings with some given alphabet Σ, subject to
some grammatical constraints.
The least expressive languages are Regular Grammars,
which are insufficient to represent the English language.
But Context Free Grammars are insufficient to represent
languages with cross-serial dependencies, such as Swiss
German.
Questions?