Regular Languages and Expressions

Download Report

Transcript Regular Languages and Expressions

Chapter 3
Regular languages and expressions
Operations on formal languages
Let L1 = {10} and L2 = {011, 11}.
Union:
L1  L2 = {10, 011, 11}
Concatenation: L1 L2 = {10011, 1011}
Kleene Star:
L1* = {λ, 10, 1010, 101010, … }
Other operations: intersection, complement, difference
Definition Of Regular Languages
A regular language over an alphabet  is one
that contains either a single string of length 0
or 1, or strings which can be obtained by
using the operations of union, concatenation,
or Kleene* on strings of length 0 or 1.
Alternative definition of regular languages
The simplest possible regular languages are the empty set
and languages consisting of a single string that is either
the empty string or has length one. For example; if
 = {a,b}, the simplest languages are , {λ}, {a}, and {b}.
A regular language is a language that can be built from
these simple languages, by using the three operations
of union, concatenation, and Kleene star.
Regular Languages correspond to
Regular Expressions
•
•
•
•
•
•
L=Ø
L = {}
L = {a}
L = L1  L2
L = L1 L2
L = L 1*
RE is Ø
RE = 
RE = a
RE = (r1 + r2)
RE = (r1r2)
RE = (r1*)
Regular expressions
A useful shorthand for describing regular languages.
Compare to arithmetic expressions, such as (x + 3)/2.
An arithmetic expression is constructed using arithmetic
operators, such as addition and division. A regular
expression is constructed using operations on languages,
such as concatenation, union, and Kleene star.
The value of an arithmetic expression is a number.
The value of a regular expression is a language.
Recursive definition of a regular expression
 is a regular expression corresponding to the language .
λ is a regular expression corresponding to the language {λ}.
For each symbol a  , a is a regular expression
corresponding to the language {a}.
For any regular expressions r and s, corresponding to
the regular languages L(r) and L(s), respectively,
each of the following is a regular expression:
(r + s) corresponds to the language L(r)  L(s)
(r · s) or (rs) corresponds to the language L(r)L(s)
(r*) corresponds to the language (L(r))*
Examples
a*+ b = {λ, a, b, aa, aaa, aaaa, aaaaa, … }
a*ba* = {w  * : w has exactly one b}
(a + b)* = any string of a’s and b’s
(a + b)*aa (a + b)* = {w  * : w contains aa}
(a + b)*aa (a + b)* + (a + b)*bb (a + b)* =
{w  * : w contains aa or bb}
(a + λ)b* = {abn : n  0} + {bn : n  0}
As with arithmetic expressions, there is an order of
precedence for operators -- unless you change it using
parentheses. The order is: star closure first, then
concatenation, then union.
More examples
All strings containing no more than two a’s:
(b + c)*(λ + a)(b + c)*(λ + a)(b + c)*
All strings containing no runs of a’s of length greater
than two:
(b + c)*(λ + a + aa)(b + c)*((b + c)(b + c)*(λ + a + aa)(b + c)*)*
All strings in which all runs of a’s have lengths that are
multiples of three:
(aaa + b + c)*
Hints for writing regular expressions
Assume  = {a, b, c}.
Zero or more a’s:
a*
One or more a’s:
aa*
Any string at all:
(a + b + c)*
Any nonempty string:
(a + b + c)(a + b + c)*
Any string that does not contain a:
(b + c)*
Any string containing exactly one a: (b + c)*a(b + c)*
Practice
Let  = {a,b,c}. Give a regular expression for the
following languages:
(a) all strings containing exactly two a’s
(b) all strings containing no more than three a’s
Practice
Let  = {a,b,c}. Give a regular expression for the
following languages:
(a) all strings containing exactly two a’s
(b + c)*a(b + c)*a(b + c)*
(b) all strings containing no more than three a’s
(b + c)*(λ + a)(b + c)*(λ + a)(b + c)*(λ + a)(b + c)*
Practice
What languages correspond to the following
regular expressions?
a*b
(aaa + bba)
(ab)*
More practice
Give regular expressions for the following languages,
where the alphabet is  = {a, b, c}.
--all strings ending in b
--all strings containing no more than two a’s
-- all strings of even length
More practice
Give regular expressions for the following languages,
where the alphabet is  = {0, 1}.
--all strings of one or more 0’s followed by a 1
--all strings of two or more symbols followed by three
or more 0’s
-- all strings that do not end with 01
Do these strings match the regular expression?
Regular expression
String
(01* + 1)
0101
(a + λ)b
b
(ab)*a*
λ
(a + b)(ab)
bb
Accepting (review)
Let M = (Q, , q0, d, A) be an FA.
• A string x   * is accepted by M if
d*(q0, x)  A
• The language accepted (or recognized) by M is the
set L(M) = {x   * | x is accepted by M}
• A language L over the alphabet  is regular iff
there is a Finite Automaton that accepts L.
Kleene’s theorem
1) For any regular expression r that represents language
L(r), there is a finite automaton that accepts that same
language.
2) For any finite automaton M that accepts language L(M),
there is a regular expression that represents the same
language.
Therefore, the class of languages that can be represented by
regular expressions is equivalent to the class of languages
accepted by finite automata -- the regular languages.
Kleene’s theorem part 1
NFA
proved
DFA
regular
expression
Kleene’s
Theorem part 2
Theorem 3.1
1st half of Kleene’s theorem: Let r be a regular
expression. Then there exists some nondeterministic
regular accepter that accepts L(r). Consequently, L(r) is
a regular language.
Proof strategy: for any regular expression,
we show how to construct an equivalent NFA.
Because regular expressions are defined recursively,
the proof is by induction.
Base step: Give a NFA that accepts each of the simple
or “base” languages, , {λ}, and {a} for each a  .
a
Inductive step: For each of the operations -- union,
concatenation and Kleene star -- show how to
construct an accepting NFA.
Closure under union:
M1
λ
λ
λ
M2
λ
Closure under concatenation:
M1
λ
M2
Closure under Kleene Star:
λ
λ
M1
λ
λ
Closure properties of Regular
Languages
• Union, concatenation, and Kleene star of two
regular languages will result in a regular
language, since we can write a regular
expression for them.
• Intersection and difference (complement) of two
regular languages will also produce a regular
language.
• The class of regular languages is said to be
closed under these operations. (More in Ch. 4.)
Exercise
Use the construction of the first half of Kleene’s
theorem to construct a NFA that accepts the language
L(ab*aa + bba*ab).
Exercise
Use the construction of the first half of Kleene’s
theorem to construct a NFA that accepts the language
L(ab*aa + bba*ab).
λ
FA accepting ab*aa
λ
q0
qf
λ
FA accepting bba*ab
λ
Exercise
Construct a NFA that accepts the language
corresponding to the regular expression:
((b(a+b)*a) + a)
Theorem 3.2
Kleene’s theorem part 2: Let L be a regular language.
Then there exists a regular expression r such that L =
L(r).
Any language accepted by a finite automaton can
be represented by a regular expression.
The proof strategy: For any DFA, we show
how create an equivalent regular expression. In other
words, we describe an algorithm for converting any
DFA to a regular expression.
Expression diagram
• A labeled directed graph (similar to a finite state
diagram) in which transitions are labeled by regular
expressions
• Has a single start state with no incoming transitions
• Has a single accepting state with no outgoing
transitions
ab
• Example:
(a+b)
a*
Algorithm for converting a DFA into an
equivalent regular expression
Initial step: Change every transition labeled a,b to (a+b).
Add a single start state with an outgoing λ-transition to
the current start state, and add a single final state with
incoming λ-transitions from every previous final state.
Main step: Until expression diagram has only two states
(initial state and final state), repeat the following:
-- pick some non-start, non-final state
-- remove it from the diagram and re-label transitions
with regular expressions so that the same language
is accepted
The key step is removing states and re-labeling
transitions with regular expressions. Here are
some examples of how to do this.
b
a
a
b
ab*a
ab*a
a
b
ab*b
a
b
a
b
ab*b
a
ab*a
Exercise
a,b
a
b
Continue ...
a
λ
(a+b)
b
λ
Exercise
a
λ
(a+b)
λ
b
(a+b)
a* b
λ
a*b (a+b)*
Exercise
Find a regular expression that corresponds to the
language accepted by the following DFA.
a
b
a
b
Exercise
a
b
a
λ
λ
b
a*bb*a
a*bb*
(a*bb*a)*a*bb*
λ
Homework
Find a regular expression that corresponds to the
language accepted by the following DFA.
0
q0
q1
1
0
1
q2
0
1
Applications of regular expressions
• Validation
– checking that an input string is in valid format
– example 1: checking format of email address on
WWW entry form
– example 2: UNIX regex command
• Search and selection
– looking for strings that match a certain pattern
– example: UNIX grep command
• Tokenization
– converting sequence of characters (a string) into
sequence of tokens (e.g., keywords, identifiers)
– used in lexical analysis phase of compiler
Grammar
• A grammar G = (V, T, S, P) consists of the
following quadruple:
– a set V of variables (non-terminal symbols),
including a starting symbol S  NT
– a set T of terminals (same as an alphabet, )
– A start symbol S  V
– a set P of production rules
• Example:
S  aS | A
A bA | λ
Derivation
• Strings are “derived” from a grammar
• Example of a derivation
S  aS  aaS  aaA  aabA  aab
• At each step, a nonterminal is replaced by
the sentential form on the right-hand side
of a rule (a sentential form can contain
nonterminals and/or terminals)
• Automata recognize languages; grammars
generate languages
Context-free grammar
• A grammar is said to be context-free if every
rule has a single non-terminal on the left-hand
side
• This means you can apply the rule in any
context. More complicated languages (such as
English) have context-dependent rules. A
language generated from a context-free
grammar is called a context-free language
Regular grammar
• A grammar is said to be right-linear if all
productions are of the form AxB or Ax,
where A and B are variables and x is a string of
terminals
• A grammar is said to be left-linear if all
productions are of the form ABx or Ax
• A regular grammar is either right-linear or leftlinear.
Linear grammar
• A grammar can be linear without being rightor left-linear.
• A linear grammar is a grammar in which at
most one variable can occur on the right side of
any production rule, without any restriction on
the position of the variable.
• Example:
S  aS | A
A Ab | λ
Another formalism for regular languages
• Every regular grammar generates a regular
language, and every regular language can be
generated by a regular grammar.
• A regular grammar is a simpler, special-case
of a context-free grammar
• The regular languages are a proper subset of
the context-free languages
Exercises
• Find a regular grammar that generates the
language on  = {a,b} consisting of all
strings with no more than three a’s
Exercises
• Find a regular grammar that generates the
language on  = {a,b} consisting of all
strings with no more than three a’s
S  bS | aA | λ
A  bA | aB | λ
B  bB | aC | λ
C  bC | λ
Exercises
• Find a regular grammar that generates the
language consisting of even-length strings
over {a,b}
Exercises
• Find a regular grammar that generates the
language consisting of even-length strings
over {a,b}
S  aaS | abS | baS | bbS | λ
Exercise
What language is generated by the following
context-free (but not regular) grammar?
S  aSa | bSb | a | b | λ
Exercise
What language is generated by the following
context-free grammar?
S  aSa | bSb | a | b | λ
The odd/even palindrome language:
L = {w(a+b+λ)wR}
Exercise
• Given a grammar, you should be able to say
what language it generates
• Use set notation to define the language
generated by the following grammars
1)
S  aaSB | λ
B  bB | b
2)
S  aSbb | A
A  cA | c
Exercise
S  aaSB | λ
B  bB | b
It helps to list some of the strings that can be formed:
S  aaSB  aaB  aab
S  aaSB  aaB  aabB  aabb
S  aaSB  aaB  aabB  aabbB  aabbb
S  aaSB  aaB  aabB  aabbB  aabbbB  aabbbb
S  aaSB  aaaaSBB  aaaaBB  aaaaBb  aaaabb
S  aaSB  aaaaSBB  aaaaBB  aaaaBbB  aaaaBbb 
aaaabbb
What is the pattern?
L = {(aa)nbnb*}
Exercise
• Given a language, you should be able give a
grammar that generates it.
• For example, give a regular (right-linear)
grammar for the language consisting of all
strings over {a, b, c} that begin with a,
contain exactly two b’s, and end with cc.
Exercise
• Give a regular (right-linear) grammar for
the language consisting of all strings over
{a, b, c} that begin with a, contain exactly
two b’s, and end with cc
S  aA
A  bB | aA | cA
B  bC | aB | cB
C  aC | cC | cD
Dc
Theorem 3.3
• Every language generated by a rightlinear grammar is regular.
• Proof:
– Specify a procedure for automatically
constructing an NFA that mimics the
derivations of a right-linear grammar.
Theorem 3.3
• Justification:
The sentential forms produced by a right linear
grammar have exactly one variable, which occurs
as the rightmost symbol.
Assume that our grammar has a production rule
D  dE
and that, during the derivation of a string, there is a
step wcD  wcdE
We can construct an NFA which has states D and E,
and an arc labeled d from D to E.
NFAs can be converted to DFAs.
All languages accepted by DFAs are regular.
Theorem 3.3
Construction:
For each variable Vi in the grammar there will be a state
in the automaton labeled Vi.
The initial state of the automaton will be labeled V0 and
will correspond to the S variable in the grammar.
For each production rule Vi  a1a2…amVj the
automaton will have transitions such that
δ*(Vi  a1a2…am) = Vj
For each production rule Vi  a1a2…am the automaton
will have transitions such that
δ*(Vi  a1a2…am) = Vfinal
Theorem 3.3
Construct an NFA that accepts the language generated
by the grammar:
S  aA
convert to: V0  aV1
A abS | b
V1  abV0 | b
a
V0
b
V1
a
b
Vf
Theorem 3.4
• Every regular language can be generated
by a right-linear grammar.
• Proof:
– Generate a DFA for the language.
– Specify a procedure for automatically
constructing a right-linear grammar from
the DFA.
Theorem 3.4
• Given a regular language L, let M = (Q, , δ,
q0, F) be a DFA that accepts L. Let Q = {q0,
q1, …, qn} and  = {a1, a2, …, am}.
• Construct the grammar G = (V, T, S, P) with:
V = {q0, q1, …, qn}
T = {a1, a2, …, am}
S = q0.
P = {} initially.
• P, the set of production rules, is constructed as
follows:
Theorem 3.4
• For each transition
δ(qi, aj) = qk
in the transition table of M, add to P the
production:
qi  ajqk
• If qk is in F, then add to P the production:
qk  λ
Theorem 3.4
• Example: Construct a right-linear grammar for
the language L = L(aab*a)
• First, build an NFA for L:
q0
a
q1
a
q2
b
a
qf
Theorem 3.4
q0
a
q1
a
q2
a
b
P = {} initially.
Add to P a rule for each transition in the NFA:
q0  aq1
q1  aq2
q2  bq2
q2  aqf
Since qf is in F, add to P the production:
qf  λ
qf
Theorem 3.4
q0
a
q1
a
q2
a
qf
b
Now P =
{q0  aq1
q1  aq2
q2  bq2
q2  aqf
qf  λ }
You can convert to normal grammar notation:
S  aA
A  aB
B  bB
B  aC
Cλ
Theorem 3.5
A language L is regular if and only if there exists a
left-linear grammar G such that L = L(G).
Proof:
The strategy here is a little tricky. We describe an
algorithm to construct a right-linear grammar that
generates the reverse of all the strings generated by the
left-linear grammar.
Theorem 3.5
Given any left-linear grammar we can construct from it an
right-linear grammar G’ by replacing productions of the
form:
A  Bv
with A  vRB
and
Av
with A  vR
Since L(G’) is generated by a right-linear grammar, it is
regular.
It can be demonstrated that L(G) = (L(G’))R.
It can be proven that the reverse of any regular language is
also regular (see exercise 12, section 2.3 in the Linz text).
Hence, L is regular.
Theorem 3.6
A language L is regular if and only if there exists a
regular grammar G such that L = L(G).
Proof:
Combine our definition of regular grammars,
which includes the statement, “A regular grammar is
either right-linear or left-linear”, with theorems 3.4 and
3.5
3 ways of specifying regular languages
Regular expressions
describe
DFA
NFA
accept
Regular languages
generate
Regular grammars