Physical Mapping II

Download Report

Transcript Physical Mapping II

Physical Mapping II + Perl
CIS 667 March 2, 2004
Restriction Site Models
• Let each fragment in the Double Digest Problem
be represented by its length
 No measurement errors
 All fragments present
• Digesting the target DNA by the first enzyme
gives the multiset A = {a1, a2, …, an}
• The second enzyme gives B = {b1, b2, …, bn}
• Digestion with both gives O = {o1, o2, …, on}
Restriction Site Models
• We want to find a permutation A of the elements
of A and B of the elements of B
 Plot lengths A from on a line in the order of A
 Plot lengths B from on a line in the order of B on top
of previous plot
 Several new subintervals may be produced
 We need a one-to-one correspondence between each
resulting subinterval and each element of O
Restriction Site Models
• This problem is NP-complete
 It is a generalization of the set partition
problem
 The number of solutions is exponential
• Partial Digest problem has not been
proven to be NP-complete
 The number of solutions is much smaller than
for DDP
Interval Graph Models
• We model hybridization mapping using
interval graphs
 Much simpler than the real problem, but still
NP-complete
 Uses graphs
 Vertices represent clones
 Edges represent overlap information between
clones
First Interval Graph Model
• Uses two graphs
 Gr = (V, Er)
 (i, j)  Er means we know clones i, j overlap
 Gt = (V, Et)
 Et represents known and unknown overlap
information
 If we know for sure that two clones don’t overlap,
the corresponding edge is left out of the graph Gt
First Interval Graph Model
• Does there exist a graph Gs = (V, Es) such
that Er  Es  Et such that Gs is an interval
graph?
 An interval graph G = (V, E) is an undirected
graph obtained from a collection C of intervals
on the real line
 To each interval in C there corresponds a vertex in
G
 There is an edge between u and v only if their
intervals have a non-empty intersection
First Interval Graph Model
a
b
c
e
d
b
c
a
d
e
Non-Interval Graphs
b
c
a
e
d
b
c
a
d
e
Second Interval Graph Model
• Don’t assume that known overlap
information is reliable
 Construct a graph G = (V, E) using that
information
 Does there exist a graph G’ = (V, E’) such that
E’  E, G’ is an interval graph and |E’| is
maximum?
 We have discarded some false positives
 The solution is the interpretation that contains the
minimum number of false positives
Third Interval Graph Model
• Use overlap information along with
information about each clone
 Different clones come from different copies of
the same molecule
 Label each clone with the identification of the
molecule copy it came from
 Assume we had k copies of the target DNA
and different restriction enzymes were used to
break up each copy
Third Interval Graph Model
• Build a graph G = (V, E) with known overlap
information between clones
 Use k colors to color the vertices
 No edges between vertices of the same color since
they come from the same clone and hence cannot
overlap
 We say that such a graph has a valid coloring
 Does there exist graph G’ = (V’, E) such that , G’ is an interval
graph, and the coloring of G is valid for G’?
 I.e., Can we add edges to G transforming it into an interval
graph without violating the coloring?
Consecutive Ones Property
• We can apply the previous models in any
situation where we can obtain some type of
fingerprint for each fragment
 Now we use as a clone fingerprint the set of probes
that hybridize to it
 Assumptions
 Reverse complement of each probe’s sequence occurs only
once in the target DNA (“probes are unique”
 There are no errors
 All “clones X probes” hybridization experiments have been
done
Consecutive Ones Property
• If we have n clones and m probes we will build
an n  m binary matrix M, where each entry Mij
tells us whether probe j hybridized to clone i or
not
 Then obtaining a physical map from the matrix
becomes the problem of finding a permutation of the
columns (probes) such that all 1s in each row (clone)
are consecutive
 Such a matrix is said to have the consecutive 1s property for
rows (C1P)
Consecutive Ones Property
• There exist polynomial algorithms for the C1P
property
 Works only for data with no errors
 Realistic algorithms should try to find matrixes which
approximate the C1P property, while minimizing the
number of errors which must have been present to
lead to such a solution
 Allow 2 or 3 runs of 1s in a row
 Minimize the number of runs of 1s in the matrix
• Problem is now NP-hard
Now we will look at some
Perl in preparation for
assignment 1
Perl substitution operator
• Example of Perl substitution operator
substitute operator
REPLACEMENT
$RNA =~ s/T/U/g; text to replace PATTERN
Pattern modifier: g means
delimiter globally, throughout the
variable
string. Others:
binding operator
i case insensitive
m multiline
s single line
PATTERN regular expression
To be replaced by REPLACEMENT
Example 1
• Let’s use the substitution operator to
calculate the reverse complement of a
strand of DNA
Example 2
• One common task in bioinformatics is to look for
motifs, short segments of DNA or protein of
interest
 For example, regulatory elements of DNA
• Let’s see a program to
 Read in protein sequence data from a file
 Put all the sequence data into one string for easy
searching
 Look for motifs the user types in at the keyboard
Turning arrays into Scalars
• We often find sequence data broken into
short segments of 80 or so characters
 This is inconvenient for the Perl program
 Have to deal with motifs on more than one line
 Collapse an array into a scalar with join
 $protein = join( ‘’, @protein)
Regular expressions
• Regular expressions are ways of matching
one or more strings using special wildcardlike operators
 $protein =~ s/\s//g
 \s matches whitespace
 Can also be written [ \t\n\f\r]
 if ($motif =~ /^\s*$/ ) {
 ^ - beginning of line; $ - end of line
 * repeated zero or more times
Hashes
• There are three main data types in Perl:
scalar variables, arrays and hashes (also
called associative arrays)
 A hash provides a fast lookup of the value
associated with a key
 Initialized like this:
%classification =
‘dog’
=>
‘robin’
=>
‘asp’
=>
);
(
‘mammal’,
‘bird’
‘reptile’
Example 3
• Let’s look at the use of a hash by a
subroutine to translate a codon to an
amino acid using hash lookup
 codon2aa
Example 3
• The arguments to the subroutine are in the
@_ array
• Declare a local variable as a my variable
• my($dna) = @_;
Example 4
• We can use that subroutine to translate
DNA into protein
• Note the use of a module (library)
• Note the use of .= to concatenate