Introduction and Preliminaries - Department of Computer and
Download
Report
Transcript Introduction and Preliminaries - Department of Computer and
Computational Molecular Biology
Introduction and Preliminaries
Preliminaries in Computer Science
Strings and alphabet
Basic notations in graph theory
Algorithms and Complexity
My T. Thai
[email protected]
2
Strings
Consist of a sequence of letters:
DNA: four nucleotides A, C, G, T
Proteins: 20 symbol alphabet of animo acids
Given a string s, we have the following
notations:
Length: |s|
Substring: ACT is a substring of ATGACTG
Superstring: ATGACTG is a superstring of ACT
Index and interval: s[i] and s[i..j]
Prefix and suffix: s[1..j] and s[i..|s|]
My T. Thai
[email protected]
3
Graphs
G = (V, E) where V is a set of vertices and E is
a set of edges
Undirected graph: edges are undirected
Directed graph: edges are directed
Weighted graph G = (V, E, w) where each edge
has some weight
Some special graphs: complete graph, bipartite
graph, tree, and interval graph
Subgraph, spanning tree, steiner tree
My T. Thai
[email protected]
4
Interval Graphs
Intersection graph of a set of intervals on the
real line
A vertex represents an interval and an edge (u,
v) exists if intervals u and v intersect
My T. Thai
[email protected]
5
Some Problems in Graphs
Euler circuit: Given a graph, find a cycle that passes
through each edge exactly once
Hamiltonian circuit: Given a graph, find a cycle that
passes through each vertex exactly once
Minimum Spanning Tree: Given a weighted
undirected graph, find a spanning tree with minimum
total weight
Maximum Matching: Given an undirected graph, find
a maximum cardinality matching, which is a subset of
edges such that no two edges in the subset share an
endpoint
My T. Thai
[email protected]
6
P vs. NP
Class of P: Set of problems solvable by
polynomial-time algoirthms
Class of NP: Set of problems whose solutions,
once found, can be verified in polynomial time
NP-complete (NP-hard) problems: cannot
obtain an optimal solutions in polynomial time
My T. Thai
[email protected]
7
Some approaches for NP-complete
Problems
Special-case method: Work on the problem with a
restricted class of inputs
Exhaustive search: Design an exponential-time
algorithms that may perform well in practice
Approximation algorithms: Design a polynomialtime algorithm that is guaranteed to find near-optimal
solutions (with a good approximation ratio)
Heuristics: Fast algorithms that produce satisfactory
solutions most of the time but without guarantee
My T. Thai
[email protected]
8
Preliminaries in Molecular Biology
My T. Thai
[email protected]
9
DNA and Base Pairs
Double helix consisting of two
dual strands
Has four types of nucleotides:
Adenine, Thymine, Guanine,
Cytosine
Base Pairs: A↔T, C↔G
Two ends of a strand are
marked with 3’ and 5’
The entire DNA of a living
organism is called its genome
My T. Thai
[email protected]
10
DNA Sequences
My T. Thai
[email protected]
11
DNA Replication
Strands are separated
Each strand is replicated
using one of the parental
strands as a template
My T. Thai
[email protected]
12
Cell, Chromosome, and
DNA
My T. Thai
[email protected]
13
Cell Classification
My T. Thai
[email protected]
14
Chromosomes
Consists of a DNA molecule associated with proteins
that fold and pack the DNA thread into a more
compact structure and proteins required for the
process of gene expression, DNA replication and
DNA repair.
Human genome is distributed over 24 chromosomes
Each cell contains 46 chromosomes
22 pairs common to both males and females
2 sex chromosome X and Y in males and two Xs in
female
My T. Thai
[email protected]
15
Genes
Segments of DNA
Functional and physical
unit of heredity passed
from parent to offspring
Contain the information
for making a specific
protein
My T. Thai
[email protected]
16
Proteins
Shorts strings in the amino acid 20-letter
alphabet
Human genome: about 100,000 proteins, with
each protein a few hundred amino acids long
Bacteria make 500-1500 proteins
Made by genes (fragments of DNA) that are
roughly three times longer than the
corresponding proteins.
Why? Every 3 nucleotides in the DNA alphabet
code one letter in the protein alphabet of amino
acids
My T. Thai
[email protected]
17
Central Dogma of Molecular
Biology
My T. Thai
[email protected]
18
Transcription
My T. Thai
[email protected]
19
Translation
Translation
mRNA (after exported out of the nucleus and reaching the
cytosol) directs the synthesis of the protein by joining together
amino acids in the order encoded by the mRNA
Genetic code
Defines a mapping between codons and amino acid.
Codon
Triplet of nucleotides specifies a single amino acid in a
corresponding protein
64 codons and 20 amino acids
Translation is carried out by ribosomes
My T. Thai
[email protected]
20
Polymerase Chain Reaction (PCR)
Primer
Nucleic acid
strand
Serves as a
starting point of
DNA replication
My T. Thai
[email protected]
21
Plasmid Vector
Vector
an agent that can carry a DNA fragment into a host cell
Plasmid
Circular and doublestranded DNA
Antibiotic resistance
Automatic replication
Exists in bacteria
My T. Thai
[email protected]
22
DNA Cloning Using Plasmids as Vectors
(a) DNA recombination
(b) Transformation
My T. Thai
[email protected]
23
DNA Cloning Using Plasmids as Vectors (Cont)
(c) Selective amplification
(d) Isolation of desired DNA clones
My T. Thai
[email protected]
24
DNA Library Screening
Probe:
Labeled with radioisotope or
fluorescence
Used to detect specific DNA
sequences by hybridization
Hybridization:
Binding of two nucleic acid
chains by base paring
DNA Library Screening
To identify each clone whether it
contains a probe from a given set
of probes
Positive clone: contains a probe
My T. Thai
[email protected]
25
Some Computational Problems
Pooling Design
Non-unique probe selection
Sequence Alignment, Multi Sequence
Alignment
DNA sequencing
Genome Rearrangement
Protein Structure Prediction and Recognition
Protein-Protein Interactions
Functional Groups, Modules
My T. Thai
[email protected]
26
Pooling Designs
Problem Definition
Given a set of n clones with at most d positive
clones
Identify all positive clones with the minimum
number of tests
Pool: a subset of clones
Positive pool: a pool contains at least one positive
clone
My T. Thai
[email protected]
27
Pooling Designs
clones
pools
Mtxn =
c1
p1 0
p2 0
.
.
pi 0
.
.
pt 0
V(D)
c2
cj
cn
0 … 0 … 0 … 0 … 0
1 … 0 … 0 … 0 … 0
Testing
0 … 0 … 1 … 0 … 0
0 … 0 … 0 … 0 … 0
0
1
.
.
1
.
.
0
txn
tx1
M[i, j] = 1 iff the ith pool contains the jth clone
Decoding Algorithm:
Given M and V(D), identify all positive clones
My T. Thai
[email protected]
28
Challenges
Challenge 1: How to construct the binary
matrix M such that:
Outputs of any union of d columns are distinct
Challenge 2: How to design a decoding
algorithm with efficient time complexity [O(tn)]
My T. Thai
[email protected]
29
Probe Selection
Problem Definition:
Given a biological sample (e.g., blood) and a set of
probes
Identify the presence (or absence) of some
biological objects (e.g., viruses or bacteria) with the
minimum number of probes
My T. Thai
[email protected]
30
Unique Probes VS. Non-unique
Probes
Unique probes
Gene-specific probes or signature probes.
Difficult to find
Non-unique probes
Hybridize to more than one target.
Difficult to decode the results
My T. Thai
[email protected]
31
Probe-Target Matrix
12 probe candidates.
4 targets (genes).
For target set S, define P(S) as set of
probes reacting to any gene in S.
P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}.
P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}.
Symmetric set difference:
P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes
that separate two sets.
My T. Thai
[email protected]
32
Sequence Alignment
Problem Definition:
Given: 2 DNA or protein sequences
Find: Best match between them
What is an Alignment:
Given: 2 Strings S and S’
Goal: The lengths of S and S’ are the same by inserting
spaces into these strings
A
--
T
C
--
A
--
C
T
C
A
A
My T. Thai
[email protected]
33
Matches, Mismatches and Indels
Match: two aligned, identical characters in an
alignment
Mismatch: two aligned, unequal characters
Indel: A character aligned with a space
A A C T A C T -- C C T A A C A C T -- --- -- C T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
My T. Thai
[email protected]
34
Basic Algorithmic Problem
Find the alignment of the two strings that:
Max m where m = (# matches – mismatches –
indels)
m defines the similarity of the two strings, also
called Optimal Global Alignment
Biologically: a mismatch represents a mutation,
whereas an indel represents a historical
insertion or deletion of a single character
My T. Thai
[email protected]
35
Multiple Sequence Alignment
Problem Definition:
Similar to the sequence alignment problem but the
input has more than 2 strings
Challenges:
NP-hard
Guarantee factor: 2 – 2/k where k is the number of
the input sequences.
More work to reduce the time and space complexity
My T. Thai
[email protected]
36
DNA Sequencing
Problem Definition:
Given a set of fragments that are contained in a
DNA string S
Goal: Determine the string S
NP-complete
Further complicated due to the existence of
repetitive sequences in the genome
Can cast this as a Hamiltonian path or Euler
path problem (was introduced by Pavel
Pavzner)
My T. Thai
[email protected]
37
Genome Rearrangement
Problem Definition:
Given genomes of 2 different species
Goal: Find a sequence of evolutionary events that
turn the first genome to the second one.
Biological reasons: How close between these
species, how much evolution separate these
species.
E.g.: We usually test new drugs on mice before
humans. However, how close is a mouse to a
human?
My T. Thai
[email protected]
38
Genome Rearrangement
Can we use the solutions of sequence alignment
to solve this problem?
Answer: NO, because:
Genome is a very long strings (3 million letters for
a human genome
Model of sequence alignment is not appropriate for
human genome comparison since the differences
are not in terms of insertions/deletion/mutations of
a nucleotide, but a rearrangement of a long DNA
regions
The basic comparison is gene
My T. Thai
[email protected]
39
An Example
• If we compare these two strings by sequence alignment,
it’s impossible
• However, the second string is the first string after
reverse the fragment AATGGT…CCC.
My T. Thai
[email protected]
40
Main Evolutionary Events
Deletions: A fragment is removed
Duplications: create many copies of a fragment
and insert into different positions
Transpositions: A fragment is removed and reinserted into a different position
Inversions: A fragment is removed, reversed,
and then reinserted into the same position
Translocations: A pair of fragments are
exchanged between the ends of two
chromosomes
My T. Thai
[email protected]
41
My T. Thai
[email protected]
42
Protein Structure Prediction
Problem Definition:
Given: A sequence of amino acids
Goal: Predict the 3D structure of the protein
Some approaches:
Determine the position of a protein’s atoms so as to
minimize the total free energy
Find the similarities to some known proteins
My T. Thai
[email protected]
43
Community Structure
Problem Definition:
Given a graph G = (V, E) representing a network
Partition G into a set of subgraph (community structure) so
that nodes in each subgraph are highly connected
Biological reason: Genes with similar expression data
may have similar functions. Identify the community
structure can help us to reduce the number of tests
Others: Community structure is also studied in
different fields
My T. Thai
[email protected]
44