Dov Levenglick: Beginners Track: Regular Expressions

Download Report

Transcript Dov Levenglick: Beginners Track: Regular Expressions

Regular
Expressions
12 February 2006
Dov Levenglick
Jason Elbaum
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
TM
Regular Expressions
A regular expression (“regex” for short) is a way to describe
a text pattern to search for within a string.
Regexes can be simple strings, which generally match
letter-for-letter, or more complex expressions written using
Perl’s regex grammar.
Examples and explanations will follow.
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – The Match Operator
• m/PATTERN/ is Perl’s built-in match operator. It operates on a scalar
(or by default on $_).
• It searches the scalar for the text described by PATTERN and returns
1 for success or “” for failure.
• To test if PATTERN exists in the string, use the =~ binding operator.
• The !~ binding operator is equivalent to !( =~ ).
Important: using == and != instead of =~ and !~ will
result in wrong results.
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – The Match Operator
#!/usr/local/bin/perl5.8.7
$name = ‘Amir Sahar’;
if ($name =~ m/Dov/) {
print “I thought that I was Amir?!\n”;
}
elsif ($name =~ m/Amir/) {
print “whew… I still remember my name\n”;
}
else {
print “what the heck is my name?\n”;
}
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – The Match Operator
The m// operator interpolates its contents like double-quoted strings.
This lets you use variables in your search patterns.
The m// can use any nonalphanumeric, nonwhitespace delimiter instead
of the ‘/’ delimiter. This can come in handy if you are trying to match a
pattern that includes ‘/’ such as a path name.
Instead of: m/\/usr\/local\/bin/
Prefer: m!/usr/local/bin! or m(/usr/local/bin)
When using a delimiter other than ‘/’, the m must be specified; otherwise
it may be omitted:
/PATTERN/
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – Built-in Match Variables
There are a few useful built-in variables that can help you with processing your pattern
matching results:
$& - contains the matched string
$`
- contains everything before the matched string
$’
- contains anything after the matched string
Notice that if the match fails, these variables will not change from their previous values,
therefore they can’t be used for testing for a successful match.
#!/usr/local/bin/perl5.8.7
$problem = ‘I am running out of funny examples’;
$problem =~ m|out|;
print “before:$`\nafter:$'\nmatched:$&“\n”;
Will produce:
before:I am running
after:of funny examples
matched:out
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – The Substitute Operator
In addition to the m/PATTERN/ operator, a s/PATTERN/SUBSTITUTION/
operator exists. This operator searches for PATTERN in the string and
replaces it with SUBSTITUTION.
The SUBSTITUTION is treated as a double quoted string including
interpolation of variables.
s/// is used the same way as m//
#!/usr/local/bin/perl5.8.7
$wish = ‘I wish that this course would be over’;
$wish =~ s/be over/continue forever/;
print “$wish\n”;
The script above will produce:
I wish that this course would continue forever
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Metasymbols
So far we’ve seen patterns which match a fixed text
sequence. More generic text patterns can be described
using various predefined metasymbols and metacharacters.
The entire list can be found at:
http://cdserver/Volumes/PerlCD_A/prog/ch05_03.htm#INDEX-1497
however the most useful ones are described in the following
slides.
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Metasymbols
Each of the metasymbols in this table matches a single
character in the text being searched:
Metasymbol
Matches
\w
Any alphanumeric character, as well as underscore
\W
Any character that \w doesn’t match
\d
Any digit character
\D
Any non digit character
\s
Any whitespace character
\S
Any non whitespace character
.
Anything but a newline (\n). This is a don’t-care match
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Metasymbols
#!/usr/local/bin/perl5.8.7
$number = 1234;
print “huh?\n” if $number =~ /\D/;
# won’t print anything
print “well, I knew that\n” if $number =~ /\w/;
# will print: well, I knew that
print “matching a don’t care\n” if $number =~ /./;
# will print: matching a don’t care
print “of course there are no whitespace in numbers\n” unless $number =~ /\s/;
# will print: of course there are no whitespace in numbers
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Metasymbols
The following metasymbols help describe where in the string to
search for PATTERN:
(meta)symbol
Meaning
^
Beginning of string (or line if string contains multiple lines)
$
End of string (or line if string contains multiple lines)
\b
Word boundary (between \W and \w or vice versa)
\B
Anything but word boundaries
\A
Beginning of string only
\Z
End of string only
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Metasymbols
$string = “beetlejuice beetlemania beetles”;
$string =~ /^beetlemania/;
$string =~ /^beetlejuice$/;
$string =~ /\bbeetle\b/;
$string =~ /\bbeetle/;
$string =~ /\bbeetles$/;
TM
# won’t match anything
# won’t match anything
# won’t match anything
# will match the first word
# will match the last word
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Metacharacters
In general, “metacharacters” are characters that have a special meaning
when they appear in regular expressions. If you want to search for one of
the metacharacters themselves, you must prefix it with a backslash: ‘\’.
The characters are:
\ | ( ) [ ]{ } ^ $ * + ? .
Therefore, if you are searching for a ‘?’, your match pattern should look
like this:
m/is this a question\?/
We’ll see more about what these characters mean in the following slides.
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Quantifiers
In order to search for a repeated pattern, it is possible to
specify the fact that your search PATTERN should be
repeated using a quantifier after the PATTERN.
Perl has three basic quantifiers:
1. ? – Succeeds if the preceding pattern appears 0 or 1
times
2. * – Succeeds if the preceding pattern appears 0 or more
times in succession
3. + – Succeeds if the preceding pattern appears 1 or more
times in succession
Note that quantifiers have no meaning on their own; they
always modify the immediately preceding regex symbol
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Quantifiers
$fruit = “fifteen (15) bananas”;
$fruit =~ /e+/; print “$&\n”;
$fruit =~ /an*/; print “$&\n”;
$fruit =~ /(an)+/; print “$&\n”;
$fruit =~ /e*/; print “$&\n”;
print “ok” if $fruit =~ m{(abc)?};
print “ok” if $fruit =~ m{(abc)+};
print “ok” if $fruit =~ /tef*en/;
# will print ‘ee’
# will print ‘an’
# will print ‘anan’
# will not print, however it will succeed
# will print ‘ok’
# will not print anything, and fail
# will print ‘ok’
$fruit =~ /\w+/; print “$&\n”;
$fruit =~ /\b\d+\b/; print “$&\n”;
$fruit =~ /\(.*\)/; print “$&\n”;
# will print ‘fifteen’
# will print ‘15’
# will print ‘(15)’
print “not found” if $fruit !~ /^banana/;
# will print ‘not found’
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Quantifiers
More precise
specification of
repeated matches is
possible with these
additional quantifiers.
Generally speaking,
quantifiers will try to
match as many times
as possible if the
“maximal match”
version is used or as
few times as possible
if the “minimal match”
version is used.
TM
Maximal
match
Minimal
match
Meaning
*
*?
Match 0 or more
times
+
+?
Match 1 or more
times
?
??
Match 0 or 1 times
{COUNT}
Match exactly
COUNT times
{MIN,}
{MIN,}?
Match at least MIN
times
{MIN,MAX}
{MIN,MAX}?
Match at least MIN
but not more than
MAX times
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Quantifiers
$phrase = “Hold your horses”;
$phrase =~ /.+o/; print “$&\n”;
$phrase =~ /.+?o/; print “$&\n”;
# will print ‘Hold your ho’
# will print ‘Ho’
print “match\n” if $phrase =~ /.+H/;
print “match\n” if $phrase =~ /.*H/;
print “match\n” if $phrase =~ /^H.{14}s$/;
# will print nothing
# will print ’match’
# will print ’match’
print “hold is a word\n” if $phrase =~ /\BHold/;
# will print nothing
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Alternatives
If you wish to match one of many possible subexpressions,
use the ‘|’ token to separate them and the round
parentheses to enclose them.
#!/usr/local/bin/perl5.8.7
$course = ‘Perl course’;
print “I am in the right course’ if $course =~ /(Perl|Tcl|C) course/;
The script above is expected to produce:
I am in the right course
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – Character Sets
To match any one of a set of possible characters, use the square
brackets to surround them:
[0-9] is the same as \d
To match anything except the characters in the square brackets put a
caret sign (^) after the opening bracket:
[^0-9] is the same as \D
Keep in mind the difference between the square brackets, which group a
set of characters and the round parentheses, which group alternative
expressions:
[fee|fie|foe] is the same as [feio|]
m%(0[1-9]|[12]\d|3[01])/(0[1-9]|1[0-2])/\d{4}%
TM
#matches a date dd/mm/yyyy
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – Captured Matches
In order to remember your matches for further reference, Perl has the built in $1, $2, $3, …
variables. Each variable contains the contents of a match that was surrounded by round
parentheses.
The parentheses are numbered according to the order of the opening parentheses, from
the leftmost one towards the rightmost one.
Notice that these variables get clobbered every time a match is performed, therefore it is
good practice to save them in your own variables. A subroutine call may overwrite them
without your knowledge.
If the result of the match operator is taken in list context, the elements of the resulting list
are what $1, $2… would have returned.
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – Captured Matches
$song = “oh lord wont you buy me a Mercedes Benz”;
@matches = ($song =~ /(.*o)([a-z\s]*)(.*)/);
$1 and $matches[0] will equal ‘oh lord wont yo’
$2 and $matches[1] will equal ‘u buy me a ’
$3 and $matches[2] will equal ‘Mercedes Benz’
$1, $2 etc. can be used in the substitution pattern:
$time = "12:34"; $time =~ s/(..):(..)/$2:$1/; print $time
Is expected to produce:
34:12
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – Modifiers
The match rules for a pattern can be modified by certain flags that
can be used after the closing delimiter of the match operator:
Modifier
Meaning
/i
Ignore alphabetic case distinctions (case insensitive).
/s
Let . match all newlines in the string
/m
Let ^ and $ match next to embedded newlines in the string.
/x
Ignore (most) whitespace and permit comments in pattern.
/o
Compile pattern once only.
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions – Modifiers
print “match” if “Perl” =~ /perl/i;
print “match” if “line 1\nLine 2” =~ /^l.*2/s;
print “match” if “line 1\nLine 2” =~ /^l.*2/m;
print “match” if “line 1\nLine 2” =~ /^L.*2/m;
# will print ‘match’
# will print ‘match’
# will print nothing
# will print ‘match’
The following 3 regexes all match the same thing:
m/\w+:(\s+\w+)\s*\d+/;
# A word, colon, spaces, word, spaces, digits.
m/\w+: (\s+ \w+) \s* \d+/x;
# A word, colon, spaces, word, spaces, digits.
m{ \w+:
# Match a word and a colon.
(
# (begin group)
\s+
# Match one or more spaces.
\w+
# Match another word.
)
# (end group)
\s*
# Match zero or more spaces.
\d+
# Match some digits
}x;
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions - Modifiers
An additional modifier is /g, the global match. It behaves
slightly differently for m// and s///.
For s///, the PATTERN is replaced throughout EXPR as
many times as it is found.
For m//, the PATTERN is repetitively matched each time
from where the last match left off.
TM
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Regular Expressions
The following script
The following script
#!/usr/local/bin/perl5.8.7
$fruit = “banana”;
$counter = 0;
while ($fruit =~ m/a/g) {
print ++$counter, “\n”;
}
print “$fruit \n”;
#!/usr/local/bin/perl5.8.7
$fruit = “banana”;
$counter = 0;
while ($fruit =~ s/a/q/g) {
print ++$counter, “\n”;
}
print “$fruit \n”;
Is expected to produce:
1
2
3
banana
TM
Is expected to produce:
1
bqnqnq
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product
or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2005.