Introduction to Bioinformatics and Databases

Download Report

Transcript Introduction to Bioinformatics and Databases

Todd D. Taylor, Ph.D.
Genome Annotation and Comparative Analysis Team
Computational and Experimental Systems Biology Group
RIKEN Genomic Sciences Center
[email protected]
Bioinformatics and Comparative Genome Analysis Course
Institut Pasteur Tunis - Tunisia
April 2, 2007

Human
 Chromosome 21 (Nature, May 2000)
 17 of 33.5 Mb
 Chromosome 18p (Nature, September 2005)
 16 Mb
 Chromosome 11q (Nature, March 2006)
 81 Mb
 ~4-5 % contribution to the Human Genome Project



Chimpanzee
 Chromosome 22q (Nature, May 2004)
 33.5 Mb (syntenic to human chr21)
 Chromosome Y (Nature Genetics, January 2006)
Development of novel methods for gene and promoter prediction
 Identifying genes missed by other high-throughput methods
Identification of unique regulatory mechanisms



Looking for similarities
 Compare with distant species, like mouse
 Regions that are conserved may be important
Looking for differences
 Compare with close species, like primates
 Regions that are different may be important
Of course, there are exceptions to every rule!
Pongo
Gibbons
Old world monkeys
New world monkeys
Mammalia
Gorilla
Primates
Pan
Hominoidea
Homo
Hominidae
Hominidae
Hominoidea
Catarrhini
Anthropoidea
Primates
Eutheria (placentalia)
~250MYa
Mammalia
Amniota (amniotes)
~350MYa
5 MYa
Prosimians
Heterodonty
Mammary glands
Homoeothermic
Hair
Placentation (in most), amnion, internal
fertilization
Sweat and sebaceous glands
Anucleate red blood cells
Lagomorpha
Rodents
Metatheria
Prototheria
Sauropsida
Reptilia + Aves

34% maps to identical sequence in human genome
Hiram Clawson and Kate Rosenbloom (UCSC). 09 June 2006

95% maps to identical sequence in human genome
Hiram Clawson and Kate Rosenbloom (UCSC). 09 June 2006
Nobrega, et al. Science 302, 413 (2003)





Size
Intelligence
Language
Ageing
Disease susceptibility
 Cancer
 Schizophrenia
 Autism
 Triplet expansion
diseases
 AIDS
 Hepatitis
Newton,2002年4月号
Science 295, 131-134 (2002)
1.23% substitution

Number of simple repetitive sequences

Insertion of Alu and L1 elements

Unique sequences
Local duplications
Translocations
Inversions
Fewer CpG Islands predicted in chimp




 Compare with small ‘representative’ human
chromosome (21)
 Clone-based sequencing strategy
 Map chimp BAC-end sequences to human chr. 21
 Screen libraries for additional clones to fill gap
regions
3 gaps, over 99% coverage
Chimp Chr22 q-arm
Human Chr21 q-arm
Identity
100%
85%
5Mb
100%
Identity
Chimp Chr22 q-arm
Human Chr21 q-arm
85%
1Mb
0.0050
HSA21q
0.20
0.0040
0.15
0.0030
0.10
0.0020
0.05
0.0010
0.00
Insertion frequency per bp
Base changes or insertion size per
bp
0.25
0.0000
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
0.25
0.0050
0.20
0.0040
Base changes or insertion
size per bp
Base change
Insertion size (bp)
Insertion frequency
0.15
0.0030
0.10
0.0020
0.05
0.0010
0.00
0.0000
0
2
4
6
8
10
12
14
16
18
Position (Mb)
20
22
24
26
28
30
32
Insertion frequency per bp
PTR22q
Chimpanzee Sequencing & Analysis Consortium. Nature (205) 437:69-87
Overall :
1.44%
SINE/Alu
LINE/L1
1.81%
1.38%
CpG islands
Simple repeats
2.26%
4.06%
Base
change
Insertion
frequency
Base change
1.000
-
Insertion frequency
0.907
1.000
Insertion size
0.051
0.013
Size (bp)
Base content
CG dinucleotide
CpG islands
Repeats
# of Sequence gaps
Estimated total size of
clone gaps
# of clone gaps
G+C%
HSA21q
33,102,702
14
73,108
PTR22q
32,799,845
22
74,311
3
40.94%
361,259
950
#
ID#
2
41.01%
358,450
885
#
ID#
bp
SINEs
Young Alus *1
LINEs
Young L1s *2
LTRs
DNA elements
RNAs
Satellite
Others
Total
3,647,427
21,798
5,848,427
15,574
75
13,758
15,131
75
8,731
3,614,185
3,122
5,737,082
15,481
12
13,671
9,551
12
6,223
92,171
3,612,930
949,215
8,625
17,246
30,452
14,114,322
59
9,975
4,169
98
23
41
43,638
52
7,269
3,363
97
20
38
34,649
78,653
3,551,044
943,348
8,672
14,773
34,852
13,903,956
64
9,838
4,187
99
20
49
43,345
53
5,324
2,887
98
17
42
24,142
42.6%
*1 AluYa5, AluYa8, AluYb8 and AluYb9
*2 L1HS and L1PA2
bp
42.4%
Family
Subfamily
LINE/L1
L1HS
11
2
LTR/ERV1
HERVIP10FH
14
5
MER41A-int
10
2
MER4A1-int
5
0
MER83B-int
11
0
MER87
32
12
AluYa5
23
3
AluYb8
37
2
AluYb9
7
1
DNA/MER2
Tigger3
42
67
LTR/ERV1
LTR49-int
11
23
LTR/MaLR
MLT1E-int
0
5
SINE/Alu
HS21 PTR22
Human-specific characteristics have been acquired during the
5 million years since the divergence between Pan and Homo.
Orangutan
Gorilla
Time
Pongo
(Orangutan)
Gorilla
Pan
(Chimpanzee)
5〜6MYa
Homo
(Human)
Phylogeny of Hominidae
Chimpanzee
Human(?)
Pongo
(Orangutan)
Outgroup
Gorilla
Pan
(Chimpanzee)
LCA
Time
Homo
Pan
Gorilla
Orangutan
LCA
Homo
(LCA: The Last Common Ancestor)
ACGTGTTTGAAATATTACTGATTGTAA
ACGAGTTTGAAATATTATTGATTGTAA
ACGTGTTTGAATCATTATTGATTGTAA
ACGTGTTTAAATTATTATTGGTTGCAA
ACGTGTTTGAAATATTATTGATTGTAA
0.0050
HSA21q
0.20
0.0040
0.15
0.0030
0.10
0.0020
0.05
0.0010
0.00
0.0000
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
Insertion frequency per bp
Base changes or insertion size per
bp
0.25
Human
Chimpanzee
Gorilla
Orangutan
IN/DEL examination based on 10,292,002 finished sequences RIKEN
total
PCR primers
designable
good amplification
for both*
insertion to the
human sequence
267
158
139
insertion to the
chimp sequence
222
147
128
489
305
267
* positive amplification found for both chimp and human template DNA
Example 1
Deletion in Human Lineage
Pt
Hs
Gg Pp
11 22 33 44 11 22 33 44 11 11 22
Example 2
Insertion in Human Lineage
Pt
Hs
Gg Pp
1 2 3 4 1 2 3 4 1 1 2
1900
4200
2900
980
106
Example 3
Deletion in Chimp Lineage
106
117
Example 4
Allelic Deletion in Chimp Lineage
Pt
Hs
Gg Pp
1 2 3 4 1 2 3 4 1 1 2
Pt
Hs
Gg Pp
1 2 3 4 1 2 3 4 1 1 2
2400
4200
1300
129
1200
154

284 genes






223 known
19 novel CDS
25 novel transcripts
12 putative
5 predicted
85 pseudogenes

We lacked information for 6 genes located in
sequencing gaps

6 hsa21 genes are absent from the ptr22 sequence
(H2BFS, 5 KAP genes from the 21q22.1 cluster)

4 hsa21 genes appear to be pseudogenes in chimp

3 ptr22 pseudogenes are absent from the hsa21
sequence

1 hsa21 pseudogene has a complete ORF in ptr22


83% of genes have at least one amino acid
replacement
10% of the potential ptr22 proteins are predicted to
have a different length
 Amino acid insertion or deletion
 Different start codon
 Different stop codon
 Other, more complex rearrangement

Shorter in chimp:
ADAMTS5
 Longer in chimp:
C21orf30
•17 bp deletion in chimpanzee
•Human and chimpanzee splice sites are different
•Splice-site diversity
FLJ32835
C21orf9
C21orf71
Sequence identity
TCP10L
C21orf96
The human chr21 genes ordered according to their chromosomal position
Human-specific replacements
Chimp-specific replacements
KIAA0184
COL6A2
3. HUNK
4. AGPAT3
5. DSCR3
6. PWP2H
7. STCH
8. SLC5A3
9. CHAF1B
10. SIM2
11. KCNE2
12. APP
13. C21orf98
14. C21orf61
15. IFNAR1
16. UBASH3A
17. TMPRSS3
18. DSCR1
19. C21orf7
20. ADARB1
21. TSGA2
22. IFNAR2
23. C21orf63
24. KCNE1
25. C21orf2
26. C21orf55
27. ATP5A
28. CLDN8
29. C21orf56
30. DNMTA1
1. BACE2
2. TIAM1
3. BACH1
4. FAM3B
5. C21orf33
6. ADAMTS1
7. C21orf103
8. ITGB2
9. HLCS
10.DNMT3L
11.IFNGR2
12.PPIA3L
13.C21orf59
14.MRPL39
15.CLDN17
16.KRTAP11-1
17.CCT8
18.DSCR2
19.TFF2
20.BTG3
21.HSF2BP
22.C21orf115
1.
2.
2.5
2.0
1.5
1.0
0.5
0.0
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
35%
30%
Gene Frequency
25%
20%
15%
10%
5%
0%
0
0.1
0.2
0.3
0.4
0.5
0.6
Ka / Ks
0.7
0.8
0.9
1
>1.1
Chimpanzee Sequencing & Analysis Consortium. Nature (205) 437:69-87
Correralate phenotype with genotype
Using Affymetrix arrays it could be shown that
the amount of transcript/gene varies in a
species-specific manner (Enard et al. 2001).
-> What DNA sequence differences are
responsible for the observed differences in
transcript-levels?
Transcription
start site
(TSS)
5‘UTR
3‘UTR
Promoter
Enhancer
•Transcriptional control
• RNA stability
ANNOTATED GENES
DETECTED GENES
UPREGULATED (IN HUMAN)
DOWNREGULATED (IN HUMAN)
237 genes annotated for chromosome 21
189 represented on the affymetrix A-E arrays
 189 annotated genes represented on the
Affymetrix A-E arrays (Hellmann, Pääbo)
brain
liver
IFNAR2
IFNGR2
ETS2
ITSN
C21orf97
DSCR1
LSS
TTC3
CXADR
higher in chimp
higher in human

Identifying cis-regulatory elements in the human genome is a
major challenge of the post-genomic era
 Promoters and enhancers that regulate gene expression in normal and
diseased cells and tissues

Inter-species sequence comparisons have emerged as a major
technique for identifying human regulatory elements
 Particularly those to the sequenced mouse, chicken and fish genomes

A significant fraction of empirically defined human regulatory
modules
 Too weakly conserved in other mammalian genomes, such as the
mouse, to distinguish them from nonfunctional DNA
 Completely undetectable in nonmammalian genomes

Identification of such significantly divergent functional
sequences will require complementary methods in order to
complete the functional annotation of the human genome
 Deep intra-primate sequence comparison is a novel alternative to the
commonly used distant species comparisons
Non-coding sequences with primate-specific
conservation include three regulatory elements
Nature (2003) 424:788-793
Fused transcript formed by combining the exons
of two or more distinct genes (child genes)
Child
gene A
Conjoined Gene A – B
Child
gene B
Exon
Intron
•
Transcript A-B combines at least one exon (complete or partial
overlap) from both Gene A & Gene B
– Usually only supported by a few mRNA/EST sequences, and
rarely by a CCDS
•
Currently, about 32 known cases found by searching NCBI Entrez
(including 8 from chr 11 recently submitted by our group)
Chr1 SRP9 – EPHX1 fusion (1 EST evidence-DA417873)
Alternate splicing and novel exons observed in fused mRNA
27%
Conjoined
genes
conserved
in
Chimpanzee
Number of mRNAs examined
At least one exon* from both child genes
conserved in
6.5%
Conjoined
genes
conserved
in Mouse
456 (326 conjoined genes)
Number
Chimpanzee mRNAs
125 (69 conjoined genes)
Mouse mRNAs
30 (15 conjoined genes)
Both Chimpanzee and Mouse mRNAs
25 (11 conjoined genes)
* Exons considered were part of conjoined gene mRNAs
• RIKEN
•
•
•
•
•
•
•
•
•
•
•
Yoshiyuki Sakaki
Tulika P. Srivastava
Vineet K. Sharma
Asao Fujiyama
Masahira Hattori
Atsushi Toyoda
Yoko Kuroki
Yasushi Totoki
Hideki Noguchi
Hidemi Watanabe
Takehiko Itoh (MRI)
• Chimpanzee Chr 22 Sequencing
Consortium
• Chinese National Human Genome
Center at Shanghai, China
• KRIBB Genome Research Center,
Daejeon, Korea
• National Yang Ming University Genome
Research Center, Taipei, Taiwan
• National Institute of Genetics, Mishima,
Japan
• RIKEN Genomic Sciences Center,
Yokohama, Japan
• GBF, Dept. of Genome Analysis,
Braunschweig, Germany
• Institute for Molecular Biotechnology,
Jena, Germany
• Max-Planck Institute for Molecular
Genetics, Berlin, Germany