04_RegularExpressions
Download
Report
Transcript 04_RegularExpressions
Regular Expressions
CIS 361
Fundamental Problems
Need finite descriptions of infinite sets
of strings.
Discover and specify “regularity”.
The set of languages over a finite
alphabet is uncountable, while the set
of descriptions is countable
Regular Expressions
Language L is regular if there exists a
finite acceptor for it
Any language that is described by a regular
expression can be accepted by some finite
automaton
Regular Expressions
Regular expressions
Combination of strings of symbols from some alphabet,
parentheses and operators U, ., *
U is union (some literature uses +)
. (or nothing) is concatenation
* is star closure or Kleene star
superscripted
repetition, 0 or more times
+ is closure
superscripted
repetition, 1 or more times
Specifying Lexical Structure
Using Regular Expressions
Have some alphabet = set of symbols
Regular expressions are built from:
- empty string
Any letter from
r1r2 – String r1 followed by r2 (concatenation)
r1 U r2 (r1 + r2) – either regular expression r1 or r2
(union)
r* - iterated sequence and choice | r | r r | …
Parentheses to indicate grouping/precedence
Regular Expressions
Operations
Union
Complement
Intersection
Difference
Concatenation
Repetition
Kleene star
Plus operator
Regular Expressions
Union
LM
The union of two regular expressions Q
and R is Q U R
In terms of automata A and B, respectively
create a new initial state q
connect it to the initial states of A and B by
transitions
Regular Expressions
Complement
* - L
To construct the complement of a regular
expression L, inspect the automaton that
accepts its strings
convert the automaton for L to a deterministic
automaton
flips favorable and nonfavorable states
construct a regular expression for strings
accepted by the updated automaton
Regular Expressions
Complement of
bit strings with at least one “1”
= bit strings containing no “1”s
= 0*
Complement of
bit strings with exactly one “1”
= bit strings containing no “1”s
U bit strings with at least two “1”s
= 0* U (0* 1 0* 1 0*)(0 U 1)*
Regular Expressions
Intersection
LM
Apply DeMorgan’s law
Union of the complements of L and M
Regular Expressions
Difference
L–M
Can be expressed as the intersection of
languages L and * - M
Regular Expressions
Concatenation
Strings u and v over alphabet is string uv
Languages L1 and L2 concatenated
L1L2 ={uv|u L1, v L2}
Can be extended to any finite number of
languages
Regular Expressions
Concatenation
LM
Algorithm connects every favorable state of
L to the initial state of M by an arrow
labeled
Favorable states of L become non-favorable
Favorable states of M become favorable
states of the new automaton
Regular Expressions
Kleene star
L*
In terms of automaton
connect every favorable state of L to the initial
state of L by a transition labeled
create a new initial state s, make it the only
favorable state and connect it to the old initial
state by transition
Regular Expressions
Plus (+)
L+
In terms of automaton
connect every favorable state of L to the
initial state of L by a transition labeled
That’s it. This gets one or more times to a
favorable state
Naming Languages
Regular sets can be named using the
derivation in terms of the seed elements
and the closure operations. Regular
expressions formalize this approach.
Regular sets Regular Expressions
Numbers Numerals
Semantics Syntax
Example
Regular expressions for strings over {a,b}
containing at least one “a”.
Focus on the one “a”
(a u b)*a(a u b)*
Focus on the leftmost “a”
b*a(a u b)*
Focus on the “a”s
b*ab*(ab*)*
Further optimization
b*(ab*)+
Equivalence of regular expressions
Two regular expressions are equivalent
if they represent the same regular set.
Concept of Language Generated by
Regular Expressions
Set of all strings generated by a regular
expression is the language of the
regular expression
In general, a language may be
(countably) infinite
A string in a language is often called a
token
Examples of Languages and
Regular Expressions
= { 0, 1, . }
(0 U 1)*.(0 U 1)* - Binary floating point numbers
(00)* - even-length all-zero strings
1*(01*01*)* - strings with even number of zeros
= {A,…,Z, a,…,z, 0,…,9,_ }
* identifiers
(A U … U z)(A U … U z U 0 U … U 9 U _)
*
(1 U … U 9)(0 U … U 9) natural numbers (no negatives)
(0|1|2)* - trinary (base 3) numbers
Finite-State Automata
Alphabet
Set of states with initial and accepting states
Transitions between states, labeled with symbol(s)
(0 | 1)*.(0|1)*
1
1
0
0