Basics of Pattern matching
Download
Report
Transcript Basics of Pattern matching
Lecture 7:
Perl pattern handling features
Pattern Matching
• Recall =~ is the pattern matching operator
• A first simple match example
• print “An methionine amino acid is found ” if $AA =~ /m/;
– It means if $AA (string) contains the m then print methionine
amino acid found.
– What is inside the / / is the pattern and =~ is the pattern
matching symbol
– It could also be written as
• if ($dna =~ /m/)
• {
– print “An methionine amino acid is found ”;
• }
– Met.pl
Pattern Matching
– If we want to check for the start codon we could use:
– if ($seq =~ /ATG/ )
– {
• Print “a start codon was found on line number\n”
}
– Or could write if /ATG / i (where I stands for case)
– if we want to see if there is an A or T or G or C in the
sequence use: $seq =~ /[ATGC]/
– The main way to use the Boolean OR is
• If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol)
• {
– Print “EcoR1 site found!!!”;
• }
– (note EcoR1 is an important DNA sequence)
Sequence size example
• File_size_2 example
–
–
–
–
#!/usr/bin/perl
# file size2.pl
$length = 0; $lines = 0;
while (<>) {
• chomp;
• $length = $length + length $_ if $_ =~ /[GATCNgatcn]/; # n refers to
any nucelotide
• #{refer to http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml}
–
• $lines = $lines + 1;
– }
– print "LENGTH = $length\n"; print "LINES = $lines\n";
• The above is a modification of the length of the file example to include
only files that have G or A or T or C in the input line.
• However this will lead to problems for FASTA files as the descriptor
line will be included: Why?
Pattern Matching
• A NOT Boolean operator such as to see if the pattern contains
letters that are not vowels can be represented via pattern handling
by using the ^ symbol and a set of characters: e.g.
– If ($seq =~ /[^aeiou]/ {print “no vowel”};
• More flexible pattern syntax:
• Quite common to check for words or numbers so perl has
represented as:
– /[0-9]/ or/ \d/ is a digit
– A word character is /[a-zA-Z0-9_]/ and is represented by /\w/ (word)
– / \s/ represents a white space
– By invert the case of the letter it has the reverse meaning; e.g. /\S/ (non
white space)
• A more complete list of what are referred to as “metacharacters” is
shown in the next slide (you must of course use =~ in expression)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Pattern matching: metacharacters
Metacharacter Description
.
Any character except newline
\.
Full stop character
^
The beginning of a line
$
The end of a line
\w
Any word character (non-punctuation, non-white space)
\W
Any non-word character
\s
White space (spaces, tabs, carriage returns)
\S
Non-white space
\d
Any digit
\D
Any non-digit
You can also specify the number of times [ single, multiple or specific multiple]
More information on metacharacters here: metacharacters and other regular
expresions note (abc) \1 \2 are important for comparing sets of characters).
Pattern matching: Quantifiers
• Quantifier Description
–?
– +
–*
– {N}
– {N,M}
– {N, }
– { ,M}
0 or 1 occurrence
1 or more occurrences
0 or more occurrences
n occurrences
Between N and M occurrences
At least N occurrences
No more than M occurrences
Pattern matching: Quantifiers
• Consider the following pattern
– DT249 4 (your class code) consists of [one or more word
characters; then a space and then a digit so the match is:
• { =~/\w+\s\d/ }
• If the sequence has the following format:
– Pu-C-X(40-80)-Pu-C
• Pu [AG] and X[ATGC]
– $sequence =~ /[AG]C[GATC]{40,80}[AG]C/;
Quantify.pl
Pattern Matching
• To determine where to look for a “pattern” in a sequence:
• Anchors
– The start of line anchor ^ {note it is like the Boolean not operator but it
is within [^aeiou]}
•
/^>/ only those beginning with >
– The end of line character $
• />$/ only where the last character is >
– /^$/ : what does this mean?
– The boundary anchor \b
• E.g. Matching a word exactly:
• /\bword\b/ where \b boundary: just looks for “word” and not a sequence of
the letters such as w o r and d
– The non boundary anchor is \B
• /\Bword\B/ look for words like unworthy, trustworthy….. But not worthy or
word
Sequence Size example: modified
• File_size_2 example
–
–
–
–
#!/usr/bin/perl
# file size2.pl
$length = 0; $lines = 0;
while (<>) {
• chomp;
• $length = $length + length $_ if $_ =~ /[GATCNgatcn]+$/;
– #Alternative: $length += length if /^[GATCN]+$ / i;
• $lines = $lines + 1;
–}
– print "LENGTH = $length\n"; print "LINES = $lines\n";
• Refer to DNA sequence codes to see meaning of
A…N
Extracting Patterns
•
•
•
•
The second aspect of Perl pattern handling is:
Pattern extraction:
Consider a sequence like
> M185580, clone 333a, complete sequence
– M18… is the sequence ID
– Clone 33a, com…. : optional comments
• Need to stored some of elements of the descriptor
line:
– $seq =~/ ( \S+)/ part of the match is extracted and put
into variable $1;
Extracting patterns
• #! /usr/bin/perl –w
• # demonstrates the effect of parentheses.
• while ( my $line = <> )
• {
•
$line =~ /\w+ (\w+) \w+ (\w+)/;
•
print "Second word: '$1' on line $..\n" if defined $1;
•
print "Fourth word: '$2' on line $..\n" if defined $2;
• }
– Change it to catch the first and the 3 word of a sentence
• More examples in ExtractExample1.pl
Search/replace and trans-literial
• s/t/u/ replace (t)thymine with (u) Uracil; once only
• s/t/u/g (g = global) so scan the whole string
• s/t/u/gi (global and case insensitive)
– What about the following :
– s/^\s+//
– s/\s+$//
– s/\s+/ /g (where g stands for global)
•
• The transliteration search and replace function
– $seq =~ tr/ATGC/TACG/; gets the compliment of a string of
characters. (the normal search and replace works in a
different way to the tr function)
• Refer to SearchReplace.pl
Search /replace/extract
• Write a program that
• removes the > from the FASTA line descriptor
and assigns each element to appropriate
variables.
• Example Fastafile_replace.txt
–
–
–
–
–
–
–
–
–
>gi|171361, Saccharomyces cerevisiae, cystathionine gamma-lyase
GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC
GCTACAGAGCCAACCCGGTGGACAAACTCGAAGTCATTGTGGACCGAATGAGGCTCAATAACGAGATTAGCG
ACCTCGAAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGCTTAGAGACTCCG
AAATCAACGACGACTTCCACCAGTGGGCCCAGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTT
ATTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCATTCACGTGATCTCA
GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAATCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTT
ATTTAAAGCATAAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTACAATTGAGG
CCTATACACATAGACATTTGCACCTTATACATATAC
Exercises
• Write a script that:
1. Confirms if the user has input the code in the
following format:
•
•
Classcode_yearcode(papercode)
E.g dt249 4(w203c)
2. Many important DNA sequences have specific
patters; e.g. TATA write a script to find the
position of this sequence in a FASTA file
sequence.
Exercises
3. Write a script that can find the reverse
complement of an DNA sequence without
using the tr function. (Hint: a global search
and replace will give an incorrect answer)
4. Coding regions begin win the AUG (ATG)
codon and end with a stop codons. Write a
perl script that extract a coding sequence
from a FASTA file.
Exercise
5. Modify the Sequence size example from
earlier to:
– Allow the user to input a file name and determine
its length.
Exam Questions
• Perl is a important bioinformatics language.
Explain the main features of perl that make it
suitable for bioinformatics (10 marks)
• Write a perl script that illustrates its pattern
matching extraction and substitution ability. (6
marks)
• (refer to assignment/previous papers perl
scripts)