Regular Expressions

Download Report

Transcript Regular Expressions

Regular
Expressions
A simple and powerful way to
match characters
Laurent Falquet, EPFL March, 2005
Swiss Institute of Bioinformatics
Swiss EMBnet node
Regular Expressions
What is a regular expression?
Literal (or normal characters)
Alphanumeric
abc…ABC…0123...
Punctuation
-_ ,.;:=()/+ *%&{}[]?!$’^|\<>"@#
Metacharacters
Ex: ls *.java
Flavors…
awk, egrep, Emacs, grep, Perl,
POSIX, Tcl, PROSITE !
Example: PROSITE Patterns are regular
expressions




Pattern: <A-x-[ST](2)-x(0,1)-{V}
Perl Regexp: ^A.[ST]{2}.?[^V]
Text: The sequence must start with an alanine,
followed by any amino acid, followed by a serine
or a threonine, two times, followed by any amino
acid or nothing, followed by any amino acid
except a valine.
Simply the syntax differ…
Regular Expressions (1)

In Perl: /…/

Start and End of line


Match any of several


^ start, $ end
[…] or (…|…)
Match 0, 1 or more






. 1 of any
? 0 or 1
+ 1 or more
* 0 or more
{m,n} range
! negation
Examples
Match every instance of a
SwissProt AC
m/[OPQ][0-9][A-Z0-9]{3}[0-9]/;
m/ [OPQ]\d[A-Z0-9]{3}\d/;
Match every instance of a
SwissProt ID
m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;
Regular Expressions (2)

Escape character or back
reference







\d digit [0-9]
\s whitespace [space\f\n\r\t]
\w character [a-zA-Z0-9_]
\D\S\W complement of \d\s\w


\num character in octal
\xnum character in
hexadecimal
\cchar control character
m/…/



s/…/…/


$var =~ s/colou?r/couleur/;
Translate operator

tr/…/…/


$var =~ m/colou?r/;
$var !~ m/colou?r/;
Substitution operator

Byte notation

Match operator
\char or \num
Shorthand


$revcomp =~ tr/ACGT/tgca/;
Modifiers /…/#



/i case insensitive
/g global match
Many other /s,/m,/o,/x...
Regular Expressions (3)

Grouping

External reference

Exercises

$var =~ s/sp\:(\w\d{5})/swissprot AC=$1/;

Internal reference

$var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/;

Numbering


$1 to $9
$10 to more if needed...

Create a regexp to recognize
any pseudo IP address:
012.345.678.912
Create a regexp to recognize
any email address:
[email protected]
Create a regexp to change any
HTML tag to another

<address> -> <pre>
On sib-dea:
use visual_regexp-1.2.tcl to check
your regular expressions
(requires X-windows)

Regular Expressions (4)
Solution RegExp

/[\d{1,3}\.]{3}\d{1,3}/

/\w+\.\w+\@\w+\-?\w+\.[a-z]{2,4}/

/\<(\/?)address\>/\<$1pre\>/
generalized:


address = \w+
Perl In-liners
In-liners: some options










-a autosplit (only with -n or -p)
-c check syntax
-d debugger
-e pass script lines
-h help
-i direct editing of a file
-n loop without print
-p loop with print
-v version
…

Example:
perl -e 'print qq(hello world\n);'
In-liners: -n and -p
 perl -pe ‘s/\r/\n/g’ <file>
 is equivalent to:
open READ, “file”;
while (<READ>) {
s/\r/\n/g;
print;
}
close(READ);



perl -i -pe ‘s/\r/\n/g’ <file>
Warning: the -i option
modifies the file directly
perl -ne is the same
without the “print”
In-liners: -a (only with -n or -p)

perl -ane ‘print @F, “\n”;’ <file>  Example:
 is equivalent to:
open READ, “file”;
while (<READ>) {
@F = split(‘ ‘);
print @F, “\n”;
}
close(READ);
hits -b 'sw' -o
pff2 prf:CARD |
perl -ane 'print
join("\t",
reverse(@F)),"\n"
;'
In-liners: -a (only with -n or -p)

hits -b 'sw' -o pff2 prf:CARD
sw:ICEA_XENLA
sw:RIK2_MOUSE
sw:CARC_HUMAN
sw:NAL1_HUMAN
sw:ASC_HUMAN
sw:CAR8_HUMAN
sw:CARF_HUMAN

1
435
1
1380
113
347
134
90
513
88
1463
195
430
218
prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD
5
5
6
7
7
8
9
-1
-11
-1
-1
-2
-1
-1
18.553
15.058
15.395
15.058
15.374
18.343
12.932
hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";'
18.553
15.058
15.395
15.058
15.374
18.343
12.932
-1
-11
-1
-1
-2
-1
-1
5
5
6
7
7
8
9
prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD
90
513
88
1463
195
430
218
1
435
1
1380
113
347
134
sw:ICEA_XENLA
sw:RIK2_MOUSE
sw:CARC_HUMAN
sw:NAL1_HUMAN
sw:ASC_HUMAN
sw:CAR8_HUMAN
sw:CARF_HUMAN
In-liners: examples

perl -e ‘print int(rand(100)),"\n" for 1..100' | perl -e
'$x{$_}=1 while <>;print sort {$a<=>$b} keys %x'
for($i=0;$i<100;$i++) {
$nb = int(rand(100));
$hash{$nb} = 1;
}
print sort {$a<=>$b} keys %hash;
In-liners: extract FASTA from SP
open (READ, “/db/proteome/ECOLI.dat”); # open file
while ($line=<READ>) { # read line by line until the end
if($line=~ /^ID +(\w+)/) { print “>$1\n”; } # print
fasta header
if($line=~ /^ /) {
$line =~ s/ //g; # remove spaces
print $line;
# print sequence line
}
}
close(READ);

cat /db/proteome/ECOLI.dat | perl -ne ‘if (/^ID +(\w+)/)
{print">$1\n";} if(/^ /) {s/ //g; print}’
In-liners: your turn…

Create an In-liner that extracts non-redundant FASTA format
sequences from a redundant database in SwissProt format
cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/)
{print ">$1\n”;} if(/^ /) {s/ //g; print}' | perl -e 'while(<>)
{ if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values
%x’