Physical Mapping II
Download
Report
Transcript Physical Mapping II
Physical Mapping II + Perl
CIS 667 March 2, 2004
Restriction Site Models
• Let each fragment in the Double Digest Problem
be represented by its length
No measurement errors
All fragments present
• Digesting the target DNA by the first enzyme
gives the multiset A = {a1, a2, …, an}
• The second enzyme gives B = {b1, b2, …, bn}
• Digestion with both gives O = {o1, o2, …, on}
Restriction Site Models
• We want to find a permutation A of the elements
of A and B of the elements of B
Plot lengths A from on a line in the order of A
Plot lengths B from on a line in the order of B on top
of previous plot
Several new subintervals may be produced
We need a one-to-one correspondence between each
resulting subinterval and each element of O
Restriction Site Models
• This problem is NP-complete
It is a generalization of the set partition
problem
The number of solutions is exponential
• Partial Digest problem has not been
proven to be NP-complete
The number of solutions is much smaller than
for DDP
Interval Graph Models
• We model hybridization mapping using
interval graphs
Much simpler than the real problem, but still
NP-complete
Uses graphs
Vertices represent clones
Edges represent overlap information between
clones
First Interval Graph Model
• Uses two graphs
Gr = (V, Er)
(i, j) Er means we know clones i, j overlap
Gt = (V, Et)
Et represents known and unknown overlap
information
If we know for sure that two clones don’t overlap,
the corresponding edge is left out of the graph Gt
First Interval Graph Model
• Does there exist a graph Gs = (V, Es) such
that Er Es Et such that Gs is an interval
graph?
An interval graph G = (V, E) is an undirected
graph obtained from a collection C of intervals
on the real line
To each interval in C there corresponds a vertex in
G
There is an edge between u and v only if their
intervals have a non-empty intersection
First Interval Graph Model
a
b
c
e
d
b
c
a
d
e
Non-Interval Graphs
b
c
a
e
d
b
c
a
d
e
Second Interval Graph Model
• Don’t assume that known overlap
information is reliable
Construct a graph G = (V, E) using that
information
Does there exist a graph G’ = (V, E’) such that
E’ E, G’ is an interval graph and |E’| is
maximum?
We have discarded some false positives
The solution is the interpretation that contains the
minimum number of false positives
Third Interval Graph Model
• Use overlap information along with
information about each clone
Different clones come from different copies of
the same molecule
Label each clone with the identification of the
molecule copy it came from
Assume we had k copies of the target DNA
and different restriction enzymes were used to
break up each copy
Third Interval Graph Model
• Build a graph G = (V, E) with known overlap
information between clones
Use k colors to color the vertices
No edges between vertices of the same color since
they come from the same clone and hence cannot
overlap
We say that such a graph has a valid coloring
Does there exist graph G’ = (V’, E) such that , G’ is an interval
graph, and the coloring of G is valid for G’?
I.e., Can we add edges to G transforming it into an interval
graph without violating the coloring?
Consecutive Ones Property
• We can apply the previous models in any
situation where we can obtain some type of
fingerprint for each fragment
Now we use as a clone fingerprint the set of probes
that hybridize to it
Assumptions
Reverse complement of each probe’s sequence occurs only
once in the target DNA (“probes are unique”
There are no errors
All “clones X probes” hybridization experiments have been
done
Consecutive Ones Property
• If we have n clones and m probes we will build
an n m binary matrix M, where each entry Mij
tells us whether probe j hybridized to clone i or
not
Then obtaining a physical map from the matrix
becomes the problem of finding a permutation of the
columns (probes) such that all 1s in each row (clone)
are consecutive
Such a matrix is said to have the consecutive 1s property for
rows (C1P)
Consecutive Ones Property
• There exist polynomial algorithms for the C1P
property
Works only for data with no errors
Realistic algorithms should try to find matrixes which
approximate the C1P property, while minimizing the
number of errors which must have been present to
lead to such a solution
Allow 2 or 3 runs of 1s in a row
Minimize the number of runs of 1s in the matrix
• Problem is now NP-hard
Now we will look at some
Perl in preparation for
assignment 1
Perl substitution operator
• Example of Perl substitution operator
substitute operator
REPLACEMENT
$RNA =~ s/T/U/g; text to replace PATTERN
Pattern modifier: g means
delimiter globally, throughout the
variable
string. Others:
binding operator
i case insensitive
m multiline
s single line
PATTERN regular expression
To be replaced by REPLACEMENT
Example 1
• Let’s use the substitution operator to
calculate the reverse complement of a
strand of DNA
Example 2
• One common task in bioinformatics is to look for
motifs, short segments of DNA or protein of
interest
For example, regulatory elements of DNA
• Let’s see a program to
Read in protein sequence data from a file
Put all the sequence data into one string for easy
searching
Look for motifs the user types in at the keyboard
Turning arrays into Scalars
• We often find sequence data broken into
short segments of 80 or so characters
This is inconvenient for the Perl program
Have to deal with motifs on more than one line
Collapse an array into a scalar with join
$protein = join( ‘’, @protein)
Regular expressions
• Regular expressions are ways of matching
one or more strings using special wildcardlike operators
$protein =~ s/\s//g
\s matches whitespace
Can also be written [ \t\n\f\r]
if ($motif =~ /^\s*$/ ) {
^ - beginning of line; $ - end of line
* repeated zero or more times
Hashes
• There are three main data types in Perl:
scalar variables, arrays and hashes (also
called associative arrays)
A hash provides a fast lookup of the value
associated with a key
Initialized like this:
%classification =
‘dog’
=>
‘robin’
=>
‘asp’
=>
);
(
‘mammal’,
‘bird’
‘reptile’
Example 3
• Let’s look at the use of a hash by a
subroutine to translate a codon to an
amino acid using hash lookup
codon2aa
Example 3
• The arguments to the subroutine are in the
@_ array
• Declare a local variable as a my variable
• my($dna) = @_;
Example 4
• We can use that subroutine to translate
DNA into protein
• Note the use of a module (library)
• Note the use of .= to concatenate