Transcript Document
Regular Expression與Java
蕭宇程
[email protected]
http://swanky.adsldns.org/
Introduction
to Regular Expressions
Regular Expression Syntax
Object Models
Using regexes in Java
Part 1:
Introduction to Regular Expressions
Regular expressions are the key to powerful,
flexible, and efficient text processing.
Searching Text Files: egrep
The Filename Analogy
In DOS/Windows
dir *.txt
*、?:file globs or wildcards
* :match anything
? :match any one character
Generalized Pattern Language
Regular Experssions:
Powerful pattern language(generalized
pattern language) and the patterns
themselves
The Language Analogy
Regular Expressions are composed of:
Metacharacters (special characters)
Literal (normal text characters)
Literal text acting as the words and
metacharacters as the grammar.
Part 2
Regular Expression Syntax
Regular Expression測試小程式
http://www.javaworld.com.tw/blog/archives/ciyawasay/000381.html
Regular Experssion的投影片與範例檔
http://www.javaworld.com.tw/blog/archives/ciyawasay/000394.html
Start and End of the Line
Start:^
(caret)
End: $ (dollar)
cat 、^cat 、cat$
Match a position in the line rather than
any actual text characters themselves.
Question:^cat$ 、^$ 、^各代表什麼意思?
Character Classes
Matching
any one of several characters […]
Negated character classes [^…]
Matching any one of several characters
[…]
grey、gray ⇒ gr[ea]y
[0-9]character-class metacharacter
‘-’ (dash)
<h[123456]> ⇒ <h[1-6]>
[0-9a-fA-F] = [A-Fa-f0-9]
A
dash is a metacharacter only within a
character class – otherwise it matches the
normal dash character.
Negated character classes
[^…]
Matches any character that isn’t listed.
[^1-6]
matches a character that’s not 1 through 6.
Question:Why doesn’t q[^u] match ‘Qantas’ or ‘Iraq’
Character Class Notes
A
character class, even negated, still
requires a character to match.
Consider character classes as their
own mini language. The rules
regarding which metacharacters are
supported (and what they do) are
completely different inside and outside
of character classes.
Matching Any Character with Dot
. (dot 、point)
Matches any character
03.19.76 ⇒ 03a19●76
03/19/76 、03-19-76 、03.19.76 ⇒
03[-./]19[-./]76
The dots are not metacharacters within a
character class
[.-/] would be a mistake
Alternation
Matching any one of several subexperssions
| (or 、bar)
Combine multiple experssions into a single expression that
matches any of the individual ones.
gr[ea]y = grey|gray = gr(a|e)y
gr[a|e]y:Wrong! Within a class, the ‘|’
character is just a normal character.
Question
Jeffrey|Jeffery
(Geoff|Jeff)(rey|ery)
^(From|Subject|Date):●
1.
2.
3.
Start-of-line, followed by F、r、o、m, followed by ‘:●’
Start-of-line, followed by S、u、b、j、e、c、t,
followed by ‘:●’
Start-of-line, followed by D、a、t、e, followed by ‘:●’
Character Class & Alternation
A
character class can match just a single
character in the target text.
With alternation, since each alternative can
be a full-fledged regular expression in and of
itself, each alternative can match an
arbitrary amount of text.
Character Class & Alternation
Claracter
classes are almost like their own
special mini-language (with their own ideas
about metacharacters, for example)
While alternation is part of the “main” regular
expression language.
Word Boundaries
\b
(\< 、\>)
Match the position at the start and end of a word
(word-based versions of ^ and $)
\bcat\b 、\bcat 、cat\b
Optional Items
? (question mark)
color|colour ⇒ colou?r
Optional:placed after the character that is allowed to appear
at that point in the experssion, but whose existence isn’t
actually required to still be considered a successful match.
? can attach to a parenthesized expression.
4th|4 ⇒ 4(th)?
Other Quantifiers: Repetition
+ (plus)
One or more of the immediately-preceding item
* (asterisk 、star)
Any number, including none, of the item
<hr●size=14> ⇒ <hr●+size●*=●*14●*>
Exercrise:the size part is optional.
Defined range of matches: intervals
{min,max} (interval quantifier)
[a-zA-Z]{1,5}
Parentheses and Backreferences
Parentheses
can “remember” text matched
by the subexpression they enclose.
Backreferencing:match new text that is the
same as some text matched earlier in the
expression.
Doubled-word problem:the●the
\b([A-Za-z]+)●+\1\b
([a-z])([0-9])\1\2
The Great Escape
\ (escape)
When a metacharacter is escaped, it loses its
special meaning and becomes a literal character.
ega.att.com ⇒ ega\.att\.com
(very) ⇒ \([a-zA-Z]+\)
Not escape:\< 、\> 、\1
Part 3
Object Models
Tasks need to be done in using a
regular expression:
Setup . . .
1.
2.
Accept a string as a regex; compile to an internal form.
Associate the regex with the target text.
3.
Initiate a match attempt.
Actually apply the regex . . .
See the results . . .
4.
5.
6.
Learn whether the match is successful.
Gain access to further details of a successful attempt.
Query those details (what matched, where it matched,
etc.).
You might repeat them from 3. to find the next
match in the target string.
An “all-in-one” model
An “match state” model (Java)
Pattern:Represents
a compiled regular
expression.
Matcher:Has all of the state associated
with applying a Pattern object to a particular
string.
An “match result” model
Part 4
Using regexes in Java
Subexpression
Matches
General
^
$
Start of line/string
End of line/string
\b
\B
\A
Word boundary
\z
End of entire string
\Z
End of entire string (except allowable final
line terminator)
.
Any one character (except line terminator)
[...]
"Character class"; any one character from
those listed
[^...]
Any one character not from those listed
Not a word boundary
Beginning of entire string
Notes
Alternation and grouping
(...)
Grouping(capture groups)
|
Alternation
(?:re)
Noncapturing parenthesis
\G
End of the previous match
\n
Back-reference to capture group number "n"
Normal (greedy) multipliers
{m,n}
{m,}
Multiplier for "from m to n repetitions"
{m}
Multiplier for "exactly m repretitions"
{,n}
Multiplier for 0 up to n repetitions
*
Multiplier for 0 or more repetitions
+
Multiplier for 1 or more repetitions
Multiplier for "m or more repetitions"
Short for {0,}
Short for {1,}
Short for {0,1}
?
Multiplier for 0 or 1 repetitions(i.e,present
exactly once, or not at all)
Reluctant (non-greedy) multipliers
{m,n}?
Reluctant multiplier "from m to n
repetitions"
{m,}?
Reluctant multiplier "m or more
repetitions"
{,n}?
Reluctant multiplier for 0 up to n
repetitions
*?
Reluctant multiplier:0 or more
+?
Reluctant multiplier:1 or more
?
Reluctant multiplier:0 or 1 times
Possessive (very greedy) multipliers
{m,n}+
Possessive multiplier "from m to n repetitions"
{m,}+
Possessive multiplier "m or more repetitions"
{,n}+
Possessive multiplier for 0 up to n repetitions
*+
Possessive multiplier:0 or more
++
Possessive multiplier:1 or more
?+
Possessive multiplier:0 or 1 times
Escapes and shorthands
\
Escape (quote) character:turns most
metacharacters off;
turns subsequent alphabetic into metacharacters
\Q
Escape (quote) all characters up to \E
\E
Ends quoting begun with \Q
\t
Tab character
\r
Return (carriage return) character
\n
Newline character
\f
Form feed
\w
Character in a word
\W
A non-word character
\d
Numeric digit
\D
A non-digit character
\s
Whitespace
\S
A nonwhitespace character
Use \w+ for a word
Use \d+ for an integer
Spave, tab, etc., as
determined by
java.lang.Character.isWhites
pace()
Unicode blocks (representative samples)
(simple block)
\p{InGreek}
A character in the Greek block
\P{InGreek}
Any character not in the Greek block
\p{Lu}
An uppercase letter
\p{Sc}
A currency symbol
(simple category)
POSIX-style character classes (defined only for US-ASCII)
\p{Alnum}
Alphanumeric characters
[A-Za-z0-9]
\p{Alpha}
Alphabetic characters
[A-Za-z]
\p{ASCII}
Any ASCII character
\p{Blank}
Space and tab characters
\p{Space}
Space characters
[ \t\n\x0B\f\r]
\p{Cntrl}
Control characters
[\x00-\x1F\x7F]
\p{Digit}
Numeric digit characters
[0-9]
\p{Graph}
Printable and visible characters
(not spaces or control characters)
\p{Print}
Printable characters
Same as \p{Graph}
\p{Punct}
Punctuation characters
One of !"#$%&'()*+,./:;<=>?@[\]^_`{|}~
\p{Lower}
Lowercase characters
[a-z]
\p{Upper}
Uppercase characters
[A-Z]
\p{XDigit}
Hexadecimal digit characters
[\x00-\x7f]
[0-9a-fA-F]
Java API
public final class Pattern {
// Flags values ('or' together)
public static final int
UNIX_LINES, CASE_INSENSITIVE, COMMENTS,
MULTILINE, DOTALL, UNICODE_CASE, CANON_EQ;
// Factory methods (no public constructors)
public static Pattern compile(String patt);
public static Pattern compile(String patt, int flags);
// Method to get a Matcher for this Pattern
public Matcher matcher(CharSequence input);
// Information methods
public String pattern();
public int flags();
// Convenience methods
public static boolean matches(String pattern,
CharSequence input);
public String[] split(CharSequence input);
public String[] split(CharSequence input, int max);
}
public final class Matcher {
// Action: find or match methods
public boolean matches();
public boolean find();
public boolean find(int start);
public boolean lookingAt();
// "Information about the previous match" methods
public int start();
public int start(int whichGroup);
public int end();
public int end(int whichGroup);
public int groupCount();
public String group();
public String group(int whichGroup);
}
public final class Matcher {
// Reset methods
public Matcher reset();
public Matcher reset(CharSequence newInput);
// Replacement methods
public Matcher appendReplacement(StringBuffer where,
String newText);
public StringBuffer appendTail(StringBuffer where);
public String replaceAll(String newText);
public String replaceFirst(String newText);
// information methods
public Pattern pattern();
}
/* String, showing only the RE-related methods */
public final class String {
public boolean matches(String regex);
public String replaceFirst(String regex, String newStr);
public String replaceAll(String regex, String newStr);
public String[] split(String regex);
public String[] split(String regex, int max);
}
SimpleRegexText.java
import java.regex.Pattern;
import java.regex.Matcher;
public class SimpleRegexText {
public static void main(String args[]){
String sampleText = "this is the 1st test string";
String sampleRegex = "\\d+\\w+";
Pattern p = Pattern.compile(sampleRegex);
Matcher m = p.matcher(sampleText);
if(m.find()){
String matchedText = m.group();
int
matchedFrom = m.start();
int
matchedTo
= m.end();
System.out.println("matched [" + matchedText + "] from " +
matchedFrom + " to " + matchedTo + ".");
} else {
System.out.println("didn’t match");
}
}
}
matched [1st] from 12 to 15.
範例:取出英文單字 (取自 Thinking in Java)
//: c12:FindDemo.java
import java.util.regex.*;
import com.bruceeckel.simpletest.*;
import java.util.*;
public class FindDemo {
private static Test monitor = new Test();
public static void main(String[] args) {
Matcher m = Pattern.compile("\\w+").matcher(
"Evening is full of the linnet's wings");
while(m.find())
System.out.println(m.group());
monitor.expect(new String[] {
"Evening",
"is",
"full",
"of",
"the",
"linnet",
"s",
"wings"
});
}
} ///:~
import java.util.regex.*;
/** Split a String into a Java Array of Strings
divided by an RE
*/
public class Split {
public static void main(String[] args) {
String[] x =
Pattern.compile("ian").split(
"the darwinian devonian explodianchicken");
for (int i=0; i<x.length; i++) {
System.out.println(i + " \"" + x[i] + "\"");
}
}
0 "the darwin"
}
1 " devon"
2 " explod"
3 "chicken"
import java.util.regex.*;
/**
* Quick demo of RE substitution: correct "demon" and other
* spelling variants to the correct, non-satanic "daemon".
*/
public class ReplaceDemo {
public static void main(String[] argv) {
// Make an RE pattern to match almost any form (deamon, demon, etc.).
String patt = "d[ae]{1,2}mon"; // i.e., 1 or 2 'a' or 'e' any combo
// A test input.
String input = "Unix hath demons and deamons in it!";
System.out.println("Input: " + input);
// Run it from a RE instance and see that it works
Pattern r = Pattern.compile(patt);
Matcher m = r.matcher(input);
System.out.println("ReplaceAll: " + m.replaceAll("daemon"));
// Show the appendReplacement method
m.reset();
StringBuffer sb = new StringBuffer();
System.out.print("Append methods: ");
while (m.find()) { // copy to before first match, plus the word "daemon"
m.appendReplacement(sb, "daemon");
}
Input: Unix hath demons and deamons in it!
m.appendTail(sb); // copy remainder ReplaceAll: Unix hath daemons and daemons in it!
Append methods: Unix hath daemons and daemons in it!
System.out.println(sb.toString());
}
}
The End
謝謝大家!
有問題歡迎到506研究室找我一起研究