Introduction

Download Report

Transcript Introduction

Chapter 1
Introduction
1- 1
Introduction – Gene(基因) History

1865 Mendel: The basic unit of inheritance
is a gene.



Mendel’s work was forgotten until 1900s.
1944 The gene was known to be made of
DNA (Deoxyribonucleic Acid).
1953 James Watson and Francis Crick :
Double helical structure of DNA.
(雙股螺旋)
1- 2
Introduction – Gene History (Cont.)




1990 The Human Genome Project (人類基
因體計畫 ) started.
1995 The first free-living organism to be
sequenced : haemophilus influenzae
(流行性感冒嗜血桿菌)
1998 CELERA joined the gene research.
2000 The human DNA sequence draft was
completed (published in 2001).
1- 3
Bioinformatics - 國內相關計畫


2000年國科會「生物資訊」跨領域研究
2001年國科會國家型研究計畫


基因體醫學國家型計畫
2001年國科會跨領域專題研究


工程處:資訊科技
生物處:生物資訊
1- 4
動物細胞(細胞核、細胞質、細胞膜)

DNA位於細胞核內之「核仁」
1- 5
DNA Double Helix (雙股螺旋)
1- 6
DNA Double Helix (雙股螺旋)
1- 7
DNA中核甘酸間之鍵結
1- 8
核甘酸


核甘酸(Nucleotide)為核酸分子構成單元
核甘酸包含:



五碳糖(去氧核糖, deoxyribose)
磷酸基(phosphate group)
含氮鹼基之一(A、G、C、T、U)
胞嘧啶 (C)
1- 9
DNA四種含氮鹼基
1- 10
DNA Double Helix (雙股螺旋)
1- 11
DNA Sequence
1- 12
DNA and RNA



Nucleotide (核甘酸):
腺嘌呤 (adenine, A)
鳥糞嘌呤(guanine, G)
胞嘧啶(cytosine, C)
胸腺嘧啶(thymine, T)
尿嘧啶(uracil, U)
DNA(deoxyribonucleic acid , 去氧核糖核酸)
{A, G, C, T} (base pair: GC, A=T )
RNA(ribonucleic acid, 核糖核酸)
{A, G, C, U} (base pair: GC, A=U, GU )
1- 13
DNA Length



The total length of the human DNA is about
3109 (30億) base pairs.
1% ~ 1.5% of DNA sequence is useful.
# of human genes: 30,000~40,000


Conclusion from the human genome project
Expected # is 100,000 originally.
1- 14
DNA Sequencing(定序)

Given DNA sequence:
TGCACTTGACGCATGCT
Cut the sequence after random A:
ATGCT length=5
ACGCATGCT length=9
AACGCATGCT length=10
ACTTGAACGCATGCT length=15
1- 15
DNA Sequencing

電泳法(eletrophoresis)
1- 16
DNA Sequencing
1- 17
Amino Acids (胺基酸)
胺基酸:蛋白質的基本單位,共20種
1- 18
General Structure of an Amino Acid
COO
Amino
Group
H3N +  C
Carboxyl
Group
H
 CH2
3 groups:
 CH2
Amino Group (胺基)
 CH2
Carboxyl Group (羧基)
 CH2
R Group (R 基團)
 NH3+
R Group
1- 19
Amino Acids (胺基酸)分子
1- 20
Amino Acids (胺基酸)分子
1- 21
Protein (蛋白質)分子
1- 22
Amino Acids and RNA
每三個核甘酸(codon,基因密碼)對應至一種胺基酸。
F
i
r
s
t
P
o
s
i
t
i
o
n
U
C
A
G
Second Position of Codon
U
C
A
G
UAU Tyr [Y]
UUU Phe [F] UCU Ser [S]
UGU Cys [C]
UAC Tyr [Y]
UUC Phe [F] UCC Ser [S]
UGC Cys [C]
UUA Leu [L] UCA Ser [S] UAA Ter [end] UGA Ter [end]
UUG Leu [L] UCG Ser [S] UAG Ter [end] UGG Trp [W]
CUU Leu [L] CCU Pro [P]
CAU His [H]
CGU Arg [R]
CUC Leu [L] CCC Pro [P]
CAC His [H]
CGC Arg [R]
CUA Leu [L] CCA Pro [P]
CAA Gln [Q]
CGA Arg [R]
CUG Leu [L] CCG Pro [P]
CAG Gln [Q]
CGG Arg [R]
AUU Ile [I]
ACU Thr [T]
AAU Asn [N]
AGU Ser [S]
AUC Ile [I]
ACC Thr [T]
AAC Asn [N]
AGC Ser [S]
AUA Ile [I]
ACA Thr [T]
AAA Lys [K]
AGA Arg [R]
AUG Met [M] ACG Thr [T]
AAG Lys [K]
AGG Arg [R]
GUU Val [V] GCU Ala [A] GAU Asp [D]
GGU Gly [G]
GUC Val [V] GCC Ala [A]
GAC Asp [D]
GGC Gly [G]
GUA Val [V] GCA Ala [A]
GAA Glu [E]
GGA Gly [G]
GUG Val [V] GCG Ala [A]
GAG Glu [E]
GGG Gly [G]
AUG is also the “start” codon.
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
T
h
i
r
d
P
o
s
i
t
i
o
n
1- 23
From DNA via RNA to Protein
1- 24
DNA, Genes and Proteins
DNA
Gene


Protein
DNA: program for cell processes
Proteins: execute cell processes
1- 25
Promoter(啟動子) and Gene
Transcriptional
Start Site
Transcriptional
Termination Site
TATA
TTG
ATG
TAG
intron
Upstream Promoter Downstream
exon
1- 26
Regulation (調控) of Genes
Transcription Factor
(Protein)
RNA polymerase
(Protein)
DNA
Regulatory Element
Gene
By Blanchette
1- 27
Regulation of Genes
Transcription Factor
(Protein)
RNA polymerase
DNA
Regulatory Element
Gene
By Blanchette
1- 28
Regulation of Genes
New protein
RNA
polymerase
Transcription Factor
DNA
Regulatory Element
Gene
By Blanchette
1- 29
From DNA via RNA to Protein
1- 30
From RNA to Protein
1- 31
From RNA to Protein
1- 32
Primary Structure (一級結構)
of Protein
牛的胰島素(一種蛋白質)之胺基酸序列
1- 33
Secondary Structure (二級結構)
of Protein
1- 34
Tertiary Structure (三級結構) of
Protein
血紅素分子三級結構
1- 35
Quaternary Structure (四級結構)
of Protein
血紅素分子四級結構
1- 36
Problems on Different Levels
1- 37
Some Problems in Bioinformatics

Sequence comparison





Fragment assembly of DNA sequences




Shortest common superstring
Physical mapping


Longest common subsequence
Edit distance
Similarity
Multiple sequence alignment
Double digest problem
Consecutive ones problem
Evolutionary trees
Molecular structure prediction

Protein folding
1- 38
Sequence Comparison

Goals:


Database search: Given a sequence S and a set
of sequences G, to find all the sequences in G,
which are similar to S.
Similarity: To find which parts of the sequences
are alike and which parts differ.
- Sequence alignment (global alignment)
- Local alignment
1- 39
Sequence Alignement

Global alignment

Local alignment
1- 40
Longest Common Subsequence(1)


To find a longest common subsequence
between two strings.
string1: TAGTCACG
string2: AGACTGTC
 LCS :
AGACG
Dynamic programming:
ci 1, j 1  1 if ai  b j

ci , j  max ci 1, j  0 if ai  b j

ci , j 1  0 if ai  b j
1- 41
Longest Common Subsequence(2)
S2
-
A G A C T G T C
-
0
0
0
0
0
0
0
0
0
T
0
0
0
0
0
1
1
1
1
A 0
1
1
1
1
1
1
1
1
G 0
1
2
2
2
2
2
2
2
T
0
1
2
2
2
3
3
3
3
C 0
1
2
2
3
3
3
3
4
A 0
1
2
3
3
3
3
3
4
C 0
1
2
3
4
4
4
4
4
G 0
1
2
3
4
4
5
5
5
S1
TAGTCACG
AGACTGTC
LCS:AGACG
1- 42
Edit Distance(1)

To find a smallest edit process between two strings.
S1: TAGTCAC G
S2 :
AG ACTGTC
Operation:
DMMDDMMIMII
ci 1, j 1  0

ci , j  min ci 1, j  dist (ai ,  )
c  dist ( , b )
j
 i , j 1
Match (ai  b j )
Delete
Insert
Suppose dist ( ai ,  )  dist ( , b j )  1.
1- 43
S1
S2
Edit Distance(2)
-
A
G
A
C
T
G
T
C
-
0
1
2
3
4
5
6
7
8
T
1
2
3
4
5
4
5
6
7
A
2
1
2
3
4
5
6
7
8
G
3
2
1
2
3
4
5
6
7
T
4
3
2
3
4
3
4
5
6
C
5
4
3
4
3
4
5
6
5
A
6
5
4
3
4
5
6
7
6
C
7
6
5
4
3
4
5
6
7
G
8
7
6
5
4
5
4
5
6
ci-1,j-1
ci-1,j
ci,j-1
ci,j
TAGTCAC G
AG ACTGTC
DMMDDMMIMII
1- 44
Similarity



Two sequences s1 and s2.
p is the match value if ai = bj, else it is the
mismatch value.
g is the gap penalty.
ci 1, j 1  p if ai  b j

ci , j  max ci 1, j  g if ai  b j

if ai  b j
ci , j 1  g
1- 45
Sequence Alignment
a = TAGTCACG
b = AGACTGTC

----TAGTCACG
AGACT-GTC--
TAGTCAC-G--AG--ACTGTC
Which one is better?
1- 46
Sequence Alignment Formula
c0,0 = 0
ci,0 =  i
c0,j =  j

ci 1, j 1  1


max ci 1, j  1
ci , j  max 


ci , j 1  1
c
 i 1, j 1  2
if ai  bj
if ai = bj
1- 47
Sequence Alignment Example
-
A G A C T G T C
0 -1 -2 -3 -4 -5 -6 -7 -8
T -1 -1 -2 -3 -4 -2 -3 -4 -5
A -2 1
0
0 -1 -2 -3 -4 -5
G -3 0
3
2
1
0
0 -1 -2
T -4 -1 2
2
1
3
2
2
1
C -5 -2 1
1
4
3
2
1
4
G -8 -5 -2 1
4
4
6
5
4
TAGTCAC-G-A -6 -3 0 3 3 3 2 1 3
-AG--ACTGTC
C -7 -4 -1 2 5 4 3 2 3
1- 48
Multiple Sequence Alignment
s1 = ATTCGAT
s2 = TTGAG
s3 = ATGCT
 alignment
s1 = ATTCGAT
s2 = -TT-GAG
s3 = AT--GCT
 If the number of sequences is k, and k is
large, how to solve the problem?
 NP-complete problem
1- 49
Multiple Sequence Alignment - SP

Sum-of-pairs
score =  scoring ( Si , S j )
i j
1- 50
Example of Sum-of-pairs Score
s1 = ATTCGAT
s2 = -TT-GAG
s3 = AT--GCT
For the alignment, the pairwise alignment scores are:
score(s1,s2) = 5
score(s2,s3) = 0
score(s1,s3) = 5
 SP score = 10
1- 51
Multiple Sequence Alignment - Star


Star alignment is an approximation system
of sum-of-pairs (SP) scoring system.
Star alignment score =  scoring ( ST , Si )
i T
1- 52
Example of Star Score
s1 = ATTCGAT
s2 = -TT-GAG
s3 = AT--GCT
For the alignment, the pairwise alignment scores are:
score(s1,s2) = 5
score(s2,s3) = 0
score(s1,s3) = 5
 Star score = max{score(s1,s2)+score(s1,s3),
score(s2,s1)+score(s2,s3),
score(s3,s1)+score(s3,s2)}
= max{5+5, 5+0, 5+0} = 10
1- 53
Multiple Sequence Alignment - Tree

Tree
score =
 scoring (S  , S
i
i, j
j
)   scoring ( S k , Sl )
k ,l
where Si and Sj are adjacent, Sk and Sl are
adjacent.
1- 54
Fragment Assembly

Depending on experimental factors:


Fragment length can be as low as 200 or high
as 700.
Typical problems involve target sequences
30,000 to 100,000 base-pairs long, and total
number of fragments is in the range 500 to
2000.
1- 55
Shortest Common Superstring

Given a set of k strings P ={s1,s2…,sk}, to
find a shortest superstring s containing
every string in P as a substring. That is, |s| is
the minimal.
ACCGT
CGTGC
TTAC
TACCGT

--ACCGT-----CGTGC
TTAC-----TACCGT
--------------TTACCGTGC
NP-complete problem
1- 56
Physical Mapping


Given a (0,1) matrix of probes versus clones,
to reconstruct the relative places of clones
or probes.
NP-complete problem
1- 57
Consecutive Ones Problem(1)

1- 58
Consecutive Ones Problem(2)




Consider a (0,1) matrix M, with rows indexed by
clones and columns by probes, and position (i, j) is
1 if clone i contains probe j.
The problem is to permute the columns so that the
ones in each row are consecutive.
A (0, 1) matrix has the k-consecutive ones
property (k-C1P) if there exists a column order
such that in each row the occurrences of all ones
appear in at most k consecutive blocks.
The k-consecutive ones Problem:


Does a given (0, 1) matrix have the k-consecutive ones
property?
NP-complete, for k 2
1- 59
Double Digest Problem(1)
enzyme a = |A| = {1, 3, 3, 12}
enzyme b = |B| = {1, 2, 3, 3, 4, 6}
c = |A  B| = |C| = {1, 1, 1, 1, 2, 2, 2, 3, 6}
1- 60
Double Digest Problem(2)


Given the lengths of fragments, |Xi  Xj|, 1  i  j 
n, obtained by applying either one of the two
restriction enzymes A and B, or both, to determine
the order of these fragments.
a = |A| = {ai: 1  i  n} from the first digest
b = |B| = {bi: 1  i  m} from the second
digest.
c = |A  B| = |C| = {ci: 1  i  l} from first
and second digests.
 ai   bi   ci
1i  n
1i  m
1i l
1- 61
Evolutionary Trees(1)
siamang gibbon orangutan
(合趾猴)(長臂猿) (猩猩)
gorilla
(大猩猩)
human chimpanzee
(人類) (黑猩猩)
1- 62
Evolutionary Trees(2)



Genome sequences: Given genomes of
several organisms, to build an evolutionary
tree in which the number of mutations
(changes) is minimal.
Character matrix: Given a (0, 1) character
state matrix of several organisms, to build a
perfect evolutionary tree.
Distance matrix: Given a distance matrix of
several organisms, to build a tree satisfying
the distances between all organisms.
1- 63
Perfect Phylogeny(1)
1- 64
Protein Structure
1- 65
Protein Folding


Given the primary structure of a protein, to
compute or evaluate its 3-dimensional
structure.
Primary structure (sequence):
1- 66
Protein Folding Problem

The characteristic of each amino:



H (hydrophobic, non-polar)
(hating water, 疏水性)
P (hydrophilic, polar)
(loving water, 親水性)
The amino acid sequence of a protein can be
viewed as a binary sequence of H’s (1’s) and
P’s (0’s).
1- 67
Example of H-P Model

Input sequence: 011001001110010
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
1
0
1
1
1
0
0
0
0
0
0
1
0
Score = 3
Score = 5
1- 68
Protein folding on H-P Model


The protein folding on H-P model: Given a
sequence of 1’s (H’s) and 0’s (P’s), to find a
self-avoiding paths embedded in either a 2D
or 3D lattice such that the number of pairs
of adjacent 1’s is maximized.
NP-complete even for 2D lattice.
1- 69
RNA Secondary Structure
Prediction (1)


RNA: {A, G, C, U}
Base pairs:
GC (Watson-Crick base pair)
A=U (Watson-Crick base pair)
GU (Wobble base pair)

(a,b) is defined as 1 if a and b can form a
base pair; otherwise it is 0.
1- 70
RNA Secondary Structure
Prediction (2)
1- 71
RNA Secondary Structure
Prediction (3)


X(i,j) is maximum number of base pairs in
the sequence aiai+1…aj, i  j.
Dynamic Programming:
X(i,j) = 0 if | j  i |  1.
 X (i  1, j  1)
X (i, j )  max 
max{[ X (i, k  1)  1  X (k  1, j )]  (ak , a j 1 )}
 i  k  j 1.
1- 72
Reference - Books




Algorithms on strings, trees, and sequences :
computer science and computational biology,
Dan Gusfield, Cambridge University Press, 1997.
Introduction to computational molecular biology,
Joao Carlos Setubal and Joao Meidanis, PWS Pub.,
1997.
Introduction to computational biology: maps,
sequences and genomes, Michael S. Waterman,
Champman & Hall, 1995.
Manuscript of Prof. R. C. T. Lee
http://www.csie.ncnu.edu.tw/~rctlee/biology.html
1- 73
Reference – Books (Biology)

生物學 C. Starr & R. Taggart 原著
丁澤民 王偉 張世衿 連慧瑞 編譯

現代分子生物學
朱玉賢 李毅 編著
藝軒出版社

分子生物學入門

駒野徹、酒井裕 合著
何士慶 譯
科技圖書
DNA 圖解小百科
(名詞解釋)
威惹利、培瑞、李哈 合著
潘震澤 譯
新新聞文化公司
1- 74
Reference – Journals (1)









Bioinfomatics (SCI)
Bulletin of Mathematical Biology (SCI)
Computer Applications in the Biosciences
Journal of Computational Biology (SCI expanded)
Journal of Mathematical Biology (SCI)
Journal of Molecular Biology (SCI)
Nucleic Acids Research (SCI)
Gene (SCI)
Science (SCI)
1- 75
Reference – Journals (2)








Genome Research (SCI)
PROTEINS: Structure, Function, and
Bioinformatics (SCI)
Gene (SCI)
Current Opinion in Structural Biology (SCI)
Protein-Structure Function and Bioinfomatics
(SCI)
BMC Bioinformatics (SCI Expanded)
Computational Biology and Chemistry (SCI)
BioSystems (SCI)
1- 76
Reference – Web Sites






C. B. Yang
http://par.cse.nsysu.edu.tw
Bioweb link of C.B.Yang
BioWeb http://bioweb.uwlax.edu/
MIT Biology Hypertextbook
http://esgwww.mit.edu:8001/esgbio/
Bioinformatics Related Journals
http://www.iscb.org/journals.html
NCBI (National Center for Biotechnology
Information
http://www.ncbi.nlm.nih.gov/
1- 77
Conclusion (1)
Bioinformatics and Computer Science




Algorithm: all computing problems.
Image processing: 3D images of RNA folds
or protein.
Database: massive database and retrieval.
Distributed system and parallel processing:
massive storage and accelerating
computation.
1- 78
Conclusion (2)

Biology easily has 500 years of exciting
problems to work on.
-- Donald E. Knuth
1- 79