04_RegularExpressions

Download Report

Transcript 04_RegularExpressions

Regular Expressions
CIS 361
Fundamental Problems

Need finite descriptions of infinite sets
of strings.


Discover and specify “regularity”.
The set of languages over a finite
alphabet is uncountable, while the set
of descriptions is countable
Regular Expressions

Language L is regular if there exists a
finite acceptor for it

Any language that is described by a regular
expression can be accepted by some finite
automaton
Regular Expressions

Regular expressions

Combination of strings of symbols from some alphabet,
parentheses and operators U, ., *



U is union (some literature uses +)
. (or nothing) is concatenation
* is star closure or Kleene star



superscripted
repetition, 0 or more times
+ is closure


superscripted
repetition, 1 or more times
Specifying Lexical Structure
Using Regular Expressions


Have some alphabet  = set of symbols
Regular expressions are built from:






 - empty string
Any letter from 
r1r2 – String r1 followed by r2 (concatenation)
r1 U r2 (r1 + r2) – either regular expression r1 or r2
(union)
r* - iterated sequence and choice  | r | r r | …
Parentheses to indicate grouping/precedence
Regular Expressions

Operations






Union
Complement
Intersection
Difference
Concatenation
Repetition


Kleene star
Plus operator
Regular Expressions

Union



LM
The union of two regular expressions Q
and R is Q U R
In terms of automata A and B, respectively


create a new initial state q
connect it to the initial states of A and B by 
transitions
Regular Expressions

Complement


* - L
To construct the complement of a regular
expression L, inspect the automaton that
accepts its strings



convert the automaton for L to a deterministic
automaton
flips favorable and nonfavorable states
construct a regular expression for strings
accepted by the updated automaton
Regular Expressions


Complement of
bit strings with at least one “1”
= bit strings containing no “1”s
= 0*
Complement of
bit strings with exactly one “1”
= bit strings containing no “1”s
U bit strings with at least two “1”s
= 0* U (0* 1 0* 1 0*)(0 U 1)*
Regular Expressions

Intersection


LM
Apply DeMorgan’s law

Union of the complements of L and M
Regular Expressions

Difference


L–M
Can be expressed as the intersection of
languages L and * - M
Regular Expressions

Concatenation


Strings u and v over alphabet  is string uv
Languages L1 and L2 concatenated


L1L2 ={uv|u  L1, v  L2}
Can be extended to any finite number of
languages
Regular Expressions

Concatenation


LM
Algorithm connects every favorable state of
L to the initial state of M by an arrow
labeled 


Favorable states of L become non-favorable
Favorable states of M become favorable
states of the new automaton
Regular Expressions

Kleene star


L*
In terms of automaton


connect every favorable state of L to the initial
state of L by a transition labeled 
create a new initial state s, make it the only
favorable state and connect it to the old initial
state by  transition
Regular Expressions

Plus (+)
L+
 In terms of automaton


connect every favorable state of L to the
initial state of L by a transition labeled 

That’s it. This gets one or more times to a
favorable state
Naming Languages
Regular sets can be named using the
derivation in terms of the seed elements
and the closure operations. Regular
expressions formalize this approach.
Regular sets  Regular Expressions
Numbers  Numerals
Semantics  Syntax

Example

Regular expressions for strings over {a,b}
containing at least one “a”.
 Focus on the one “a”
(a u b)*a(a u b)*

Focus on the leftmost “a”
b*a(a u b)*

Focus on the “a”s
b*ab*(ab*)*

Further optimization
b*(ab*)+
Equivalence of regular expressions
Two regular expressions are equivalent
if they represent the same regular set.
Concept of Language Generated by
Regular Expressions



Set of all strings generated by a regular
expression is the language of the
regular expression
In general, a language may be
(countably) infinite
A string in a language is often called a
token
Examples of Languages and
Regular Expressions


 = { 0, 1, . }
 (0 U 1)*.(0 U 1)* - Binary floating point numbers
 (00)* - even-length all-zero strings
 1*(01*01*)* - strings with even number of zeros
 = {A,…,Z, a,…,z, 0,…,9,_ }
* identifiers
 (A U … U z)(A U … U z U 0 U … U 9 U _)
*
 (1 U … U 9)(0 U … U 9) natural numbers (no negatives)
 (0|1|2)* - trinary (base 3) numbers
Finite-State Automata



Alphabet 
Set of states with initial and accepting states
Transitions between states, labeled with symbol(s)
(0 | 1)*.(0|1)*
1
1
0
0