Regular Expressions in Java

Download Report

Transcript Regular Expressions in Java

Regular Expressions in Java
Regular Expressions
•
A regular expression is a pattern for describing a set of
strings that “match” the pattern.
– used to search or modify string data
– employ a special syntax for declaring a regular expression
•
Examples.
–
–
–
–
–
–
–
[a-z] matches a single lower-case letter
[a-zA-Z]* matches a sequence (zero or more) of letters
[0-9] matches a single digit
[0-9]+ matches an integer literal (1 or more digits)
. (period) matches any character
.*abc matches any sequence of characters ending in “abc”
[ \t\n\x0B\f\r] matches a whitespace character
Three Types of Quantifiers
Assume input string = “abcabcx”
• Greedy (most common) – read the entire input string
prior to attempting the first match.
– regular expression .*abc matches “abcabc”
(first six characters of input string).
•
Reluctant – reluctantly read one character at a time
looking for a match.
– regular expression .*?abc matches “abc”
(first three characters of input string).
•
Possessive – always read the entire input string, trying
only once for a match.
– regular expression .*+abc does not match the input string.
Regular Expression Syntax:
Postfix Quantifiers
Greedy
Reluctant
Possessive Meaning
?
??
?+
once or not at all (zero or 1 time)
*
*?
*+
zero or more times
+
+?
++
one or more times
{n}
{n}?
{n}+
exactly n times
{n,}
{n,}?
{n,}+
at least n times
{n,m}
{n,m}?
{n,m}+
at least n but not more than m times
Regular Expression Syntax:
Single Characters
x
\\
character x
backslash character
\0mnn character with octal value 0mnn
(0 ≤ m ≤ 3, 0 ≤ n ≤ 7)
\xhh
character with hexadecimal value 0xhh
\uhhhh
character with hexadecimal value 0xhhhh
\t
tab character ('\u0009')
\n
newline (line feed) character ('\u000A')
\r
carriage-return character ('\u000D')
\f
form-feed character ('\u000C')
\e
escape character ('\u001B')
\cx
control character corresponding to x
Regular Expression Syntax:
Character Classes
[a-z] lowercase letter
[A-Z] uppercase letter
[0-9] digit
[abc] a, b, or c (simple class)
[^abc]
[a-zA-Z]
Any character except a, b, or c (negation)
a through z or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]
d, e, or f (intersection)
[a-z&&[^bc]]
a through z, except for b and c (subtraction)
equivalent to [ad-z]
[a-z&&[^m-p]] a through z, and not m through p (subtraction)
equivalent to [a-lq-z]
Regular Expression Syntax:
Predefined Character Classes
.
Any character (may or may not match line terminators)
\d
A digit; equivalent to [0-9]
\D
A non-digit; equivalent to [^0-9]
\s
A whitespace character; equivalent to [ \t\n\x0B\f\r]
\S
A non-whitespace character; equivalent to [^\s]
\w
A word character; equivalent to [a-zA-Z_0-9]
\W
A non-word character; equivalent to [^\w]
Regular Expression Syntax:
Boundary Matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input
Regular Expression Syntax:
Logical Operators
XY
X followed by Y
X|Y
Either X or Y
( ) Grouping
Regular Expressions in Java
Three classes in package java.util.regex
• Pattern – a compiled representation of a regular
expression
– public static “compile” methods accept a regular expression as
the first argument and return a Pattern object
• Matcher – the engine that interprets the pattern and
performs match operations against an input string
– call the matcher method with a Pattern object.
• PatternSyntaxException – an unchecked exception
that indicates a syntax error in a regular expression
pattern
Using Java Regular Expressions
Two steps:
• Create a Pattern object based on a regular expression
pattern.
•
Use the Pattern object to create a Matcher object that
will match a specified string against the expression
pattern.
Example: Using Regular Expressions
String regex = ...;
String input = ...;
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while (matcher.find())
{
System.out.printf("Found text \"%s\" starting at "
"index %d and ending at index %d.\n",
matcher.group(), matcher.start(), matcher.end());
...
}
Selected Methods from Class Pattern
•
static Pattern compile(String regex)
Compiles the given regular expression into a pattern.
•
Matcher matcher(CharSequence input)
Creates a matcher that will match the given input against this
pattern.
• static boolean matches
(String regex, CharSequence input)
Compiles the given regular expression and attempts to match the
given input against it.
•
String[] split(CharSequence input)
Splits the given input sequence around matches of this pattern.
Selected Methods from Class Matcher
•
int start()
Returns the start index of the previous match.
•
int end()
Returns the offset after the last character matched.
•
boolean find()
Attempts to find the next subsequence of the input sequence
that matches the pattern.
•
boolean find(int start)
Resets this matcher and then attempts to find the next
subsequence of the input sequence that matches the pattern,
starting at the specified index.
Selected Methods from Class Matcher
(continued)
•
String group()
Returns the input subsequence matched by the previous match.
•
String replaceAll(String replacement)
Replaces every subsequence of the input sequence that
matches the pattern with the given replacement string.
• String replaceFirst(String replacement)
Replaces the first subsequence of the input sequence that
matches the pattern with the given replacement string.
•
Matcher usePattern(Pattern newPattern)
Changes the Pattern that this Matcher uses to find matches with.
Pattern Method Equivalents
in java.lang.String
•
public boolean matches(String regex)
– tells whether or not this string matches the given regular
expression
– “str.matches(regex)” yields exactly the same result as the
expression “Pattern.matches(regex, str)”
•
public String[] split(String regex, int limit)
– splits this string around matches of the given regular expression
– “str.split(regex, n)” yields the same result as the
expression “Pattern.compile(regex).split(str, n)”
•
public String[] split(String regex)
– splits this string around matches of the given regular expression
– works the same as if you invoked the two-argument split method
with the given expression and a limit argument of zero
Regular Expression Syntax:
POSIX Character Classes (US-ASCII)
\p{Lower}
A lower-case alphabetic character: [a-z]
\p{Upper}
An upper-case alphabetic character: [A-Z]
\p{ASCII}
All ASCII: [\x00-\x7F]
\p{Alpha}
An alphabetic character: [\p{Lower}\p{Upper}]
\p{Digit}
A decimal digit: [0-9]
\p{Alnum}
An alphanumeric character: [\p{Alpha}\p{Digit}]
\p{Print}
A printable character: [\p{Graph}\x20]
\p{Blank}
A space or a tab: [ \t]
\p{XDigit} A hexadecimal digit: [0-9a-fA-F]
\p{Space}
A whitespace character: [ \t\n\x0B\f\r]
Regular Expression Syntax:
java.lang Character Classes
\p{javaLowerCase} Equivalent to
java.lang.Character.isLowerCase()
\p{javaUpperCase} Equivalent to
java.lang.Character.isUpperCase()
\p{javaWhitespace}
Equivalent to
java.lang.Character.isWhitespace()