Introduction and Preliminaries - Department of Computer and

Download Report

Transcript Introduction and Preliminaries - Department of Computer and

Computational Molecular Biology
Introduction and Preliminaries
Preliminaries in Computer Science
 Strings and alphabet
 Basic notations in graph theory
 Algorithms and Complexity
My T. Thai
[email protected]
2
Strings
 Consist of a sequence of letters:
 DNA: four nucleotides A, C, G, T
 Proteins: 20 symbol alphabet of animo acids
 Given a string s, we have the following
notations:





Length: |s|
Substring: ACT is a substring of ATGACTG
Superstring: ATGACTG is a superstring of ACT
Index and interval: s[i] and s[i..j]
Prefix and suffix: s[1..j] and s[i..|s|]
My T. Thai
[email protected]
3
Graphs
 G = (V, E) where V is a set of vertices and E is
a set of edges
 Undirected graph: edges are undirected
 Directed graph: edges are directed
 Weighted graph G = (V, E, w) where each edge
has some weight
 Some special graphs: complete graph, bipartite
graph, tree, and interval graph
 Subgraph, spanning tree, steiner tree
My T. Thai
[email protected]
4
Interval Graphs
 Intersection graph of a set of intervals on the
real line
 A vertex represents an interval and an edge (u,
v) exists if intervals u and v intersect
My T. Thai
[email protected]
5
Some Problems in Graphs
 Euler circuit: Given a graph, find a cycle that passes
through each edge exactly once
 Hamiltonian circuit: Given a graph, find a cycle that
passes through each vertex exactly once
 Minimum Spanning Tree: Given a weighted
undirected graph, find a spanning tree with minimum
total weight
 Maximum Matching: Given an undirected graph, find
a maximum cardinality matching, which is a subset of
edges such that no two edges in the subset share an
endpoint
My T. Thai
[email protected]
6
P vs. NP
 Class of P: Set of problems solvable by
polynomial-time algoirthms
 Class of NP: Set of problems whose solutions,
once found, can be verified in polynomial time
 NP-complete (NP-hard) problems: cannot
obtain an optimal solutions in polynomial time
My T. Thai
[email protected]
7
Some approaches for NP-complete
Problems
 Special-case method: Work on the problem with a
restricted class of inputs
 Exhaustive search: Design an exponential-time
algorithms that may perform well in practice
 Approximation algorithms: Design a polynomialtime algorithm that is guaranteed to find near-optimal
solutions (with a good approximation ratio)
 Heuristics: Fast algorithms that produce satisfactory
solutions most of the time but without guarantee
My T. Thai
[email protected]
8
Preliminaries in Molecular Biology
My T. Thai
[email protected]
9
DNA and Base Pairs
 Double helix consisting of two
dual strands
 Has four types of nucleotides:
Adenine, Thymine, Guanine,
Cytosine
 Base Pairs: A↔T, C↔G
 Two ends of a strand are
marked with 3’ and 5’
 The entire DNA of a living
organism is called its genome
My T. Thai
[email protected]
10
DNA Sequences
My T. Thai
[email protected]
11
DNA Replication
 Strands are separated
 Each strand is replicated
using one of the parental
strands as a template
My T. Thai
[email protected]
12
Cell, Chromosome, and
DNA
My T. Thai
[email protected]
13
Cell Classification
My T. Thai
[email protected]
14
Chromosomes
 Consists of a DNA molecule associated with proteins
that fold and pack the DNA thread into a more
compact structure and proteins required for the
process of gene expression, DNA replication and
DNA repair.
 Human genome is distributed over 24 chromosomes
 Each cell contains 46 chromosomes
22 pairs common to both males and females
2 sex chromosome X and Y in males and two Xs in
female
My T. Thai
[email protected]
15
Genes
 Segments of DNA
 Functional and physical
unit of heredity passed
from parent to offspring
 Contain the information
for making a specific
protein
My T. Thai
[email protected]
16
Proteins
 Shorts strings in the amino acid 20-letter
alphabet
 Human genome: about 100,000 proteins, with
each protein a few hundred amino acids long
 Bacteria make 500-1500 proteins
 Made by genes (fragments of DNA) that are
roughly three times longer than the
corresponding proteins.
 Why? Every 3 nucleotides in the DNA alphabet
code one letter in the protein alphabet of amino
acids
My T. Thai
[email protected]
17
Central Dogma of Molecular
Biology
My T. Thai
[email protected]
18
Transcription
My T. Thai
[email protected]
19
Translation
 Translation
 mRNA (after exported out of the nucleus and reaching the
cytosol) directs the synthesis of the protein by joining together
amino acids in the order encoded by the mRNA
 Genetic code
 Defines a mapping between codons and amino acid.
 Codon
 Triplet of nucleotides specifies a single amino acid in a
corresponding protein
 64 codons and 20 amino acids
 Translation is carried out by ribosomes
My T. Thai
[email protected]
20
Polymerase Chain Reaction (PCR)
 Primer
 Nucleic acid
strand
 Serves as a
starting point of
DNA replication
My T. Thai
[email protected]
21
Plasmid Vector
Vector
 an agent that can carry a DNA fragment into a host cell
Plasmid
 Circular and doublestranded DNA
 Antibiotic resistance
 Automatic replication
 Exists in bacteria
My T. Thai
[email protected]
22
DNA Cloning Using Plasmids as Vectors
 (a) DNA recombination
 (b) Transformation
My T. Thai
[email protected]
23
DNA Cloning Using Plasmids as Vectors (Cont)
 (c) Selective amplification
 (d) Isolation of desired DNA clones
My T. Thai
[email protected]
24
DNA Library Screening
 Probe:
 Labeled with radioisotope or
fluorescence
 Used to detect specific DNA
sequences by hybridization
 Hybridization:
 Binding of two nucleic acid
chains by base paring
 DNA Library Screening
 To identify each clone whether it
contains a probe from a given set
of probes
 Positive clone: contains a probe
My T. Thai
[email protected]
25
Some Computational Problems
 Pooling Design
 Non-unique probe selection
 Sequence Alignment, Multi Sequence
Alignment
 DNA sequencing
 Genome Rearrangement
 Protein Structure Prediction and Recognition
 Protein-Protein Interactions
 Functional Groups, Modules
My T. Thai
[email protected]
26
Pooling Designs
 Problem Definition
 Given a set of n clones with at most d positive
clones
 Identify all positive clones with the minimum
number of tests
 Pool: a subset of clones
 Positive pool: a pool contains at least one positive
clone
My T. Thai
[email protected]
27
Pooling Designs
clones
pools
Mtxn =
c1
p1 0
p2 0
.
.
pi 0
.
.
pt 0
V(D)
c2
cj
cn
0 … 0 … 0 … 0 … 0
1 … 0 … 0 … 0 … 0
Testing
0 … 0 … 1 … 0 … 0
0 … 0 … 0 … 0 … 0
0
1
.
.
1
.
.
0
txn
tx1
M[i, j] = 1 iff the ith pool contains the jth clone
Decoding Algorithm:
Given M and V(D), identify all positive clones
My T. Thai
[email protected]
28
Challenges
 Challenge 1: How to construct the binary
matrix M such that:
 Outputs of any union of d columns are distinct
 Challenge 2: How to design a decoding
algorithm with efficient time complexity [O(tn)]
My T. Thai
[email protected]
29
Probe Selection
 Problem Definition:
 Given a biological sample (e.g., blood) and a set of
probes
 Identify the presence (or absence) of some
biological objects (e.g., viruses or bacteria) with the
minimum number of probes
My T. Thai
[email protected]
30
Unique Probes VS. Non-unique
Probes
 Unique probes
 Gene-specific probes or signature probes.
 Difficult to find
 Non-unique probes
 Hybridize to more than one target.
 Difficult to decode the results
My T. Thai
[email protected]
31
Probe-Target Matrix
 12 probe candidates.
 4 targets (genes).
 For target set S, define P(S) as set of
probes reacting to any gene in S.
 P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}.
 P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}.
 Symmetric set difference:
P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes
that separate two sets.
My T. Thai
[email protected]
32
Sequence Alignment
 Problem Definition:
 Given: 2 DNA or protein sequences
 Find: Best match between them
 What is an Alignment:
 Given: 2 Strings S and S’
 Goal: The lengths of S and S’ are the same by inserting
spaces into these strings
A
--
T
C
--
A
--
C
T
C
A
A
My T. Thai
[email protected]
33
Matches, Mismatches and Indels
 Match: two aligned, identical characters in an
alignment
 Mismatch: two aligned, unequal characters
 Indel: A character aligned with a space
A A C T A C T -- C C T A A C A C T -- --- -- C T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
My T. Thai
[email protected]
34
Basic Algorithmic Problem
 Find the alignment of the two strings that:
 Max m where m = (# matches – mismatches –
indels)
 m defines the similarity of the two strings, also
called Optimal Global Alignment
 Biologically: a mismatch represents a mutation,
whereas an indel represents a historical
insertion or deletion of a single character
My T. Thai
[email protected]
35
Multiple Sequence Alignment
 Problem Definition:
 Similar to the sequence alignment problem but the
input has more than 2 strings
 Challenges:
 NP-hard
 Guarantee factor: 2 – 2/k where k is the number of
the input sequences.
 More work to reduce the time and space complexity
My T. Thai
[email protected]
36
DNA Sequencing
 Problem Definition:
 Given a set of fragments that are contained in a
DNA string S
 Goal: Determine the string S
 NP-complete
 Further complicated due to the existence of
repetitive sequences in the genome
 Can cast this as a Hamiltonian path or Euler
path problem (was introduced by Pavel
Pavzner)
My T. Thai
[email protected]
37
Genome Rearrangement
 Problem Definition:
 Given genomes of 2 different species
 Goal: Find a sequence of evolutionary events that
turn the first genome to the second one.
 Biological reasons: How close between these
species, how much evolution separate these
species.
 E.g.: We usually test new drugs on mice before
humans. However, how close is a mouse to a
human?
My T. Thai
[email protected]
38
Genome Rearrangement
 Can we use the solutions of sequence alignment
to solve this problem?
 Answer: NO, because:
 Genome is a very long strings (3 million letters for
a human genome
 Model of sequence alignment is not appropriate for
human genome comparison since the differences
are not in terms of insertions/deletion/mutations of
a nucleotide, but a rearrangement of a long DNA
regions
 The basic comparison is gene
My T. Thai
[email protected]
39
An Example
• If we compare these two strings by sequence alignment,
it’s impossible
• However, the second string is the first string after
reverse the fragment AATGGT…CCC.
My T. Thai
[email protected]
40
Main Evolutionary Events
 Deletions: A fragment is removed
 Duplications: create many copies of a fragment
and insert into different positions
 Transpositions: A fragment is removed and reinserted into a different position
 Inversions: A fragment is removed, reversed,
and then reinserted into the same position
 Translocations: A pair of fragments are
exchanged between the ends of two
chromosomes
My T. Thai
[email protected]
41
My T. Thai
[email protected]
42
Protein Structure Prediction
 Problem Definition:
 Given: A sequence of amino acids
 Goal: Predict the 3D structure of the protein
 Some approaches:
 Determine the position of a protein’s atoms so as to
minimize the total free energy
 Find the similarities to some known proteins
My T. Thai
[email protected]
43
Community Structure
 Problem Definition:
 Given a graph G = (V, E) representing a network
 Partition G into a set of subgraph (community structure) so
that nodes in each subgraph are highly connected
 Biological reason: Genes with similar expression data
may have similar functions. Identify the community
structure can help us to reduce the number of tests
 Others: Community structure is also studied in
different fields
My T. Thai
[email protected]
44