Bioinformatics: Basics

Download Report

Transcript Bioinformatics: Basics

Evolution by natural selection
DNA->RNA->Protein
StructureFunction
What is bioinformatics?
Why is there bioinformatics?
Databases, the reagents
What does bioinformatics do?
Bioinformatics is about
understanding how life
works.
It is an hypothesis driven
science.
5
Bioinformatics is about
integrating biological themes
together with the help of
computer tools and biological
databases, and gaining new
knowledge from this.
6
Acquisition, curation, and analysis of
biological data
Hypothesis
u
Lots of new sequences being added
u
u
Automated sequencers
Genome Projects
EST sequencing
u Microarray studies
u Proteomics
 Metagenomics
 WGS
u
u
Patterns in datasets that can only be analyzed
using computers











Genome information
DNA sequence
Gene expression
Protein expression
Protein Structure
Genome mapping
Metabolic networks
Regulatory networks
Trait mapping
Gene function analysis
Scientific literature













"Biology is mere stamp-collecting”
1951
(Sanger & Tupper) - 30 AAs of ß-chain bovine insulin
1965
(Holley) - nucleotide sequence of a yeast alanine tRNA
1970s
(various)- various protein sequencing methods
1972
(Dayhoff) - "Atlas of protein sequence and structure"
1977
(Sanger, Maxam & Gilbert) DNA sequencing
1980s
(Brenner and various others) automated sequencing
1980s
community databases
1987-92
genome sequencing projects
1992
(Venter) Expressed Sequence Tags and patents
1998
C.elegans complete genome
2001
Human genome
Present
widespread use of automated sequencers
10
1400
1000
800
600
400
200
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
0
1999
No. of databases
1200
Source: Nucleic
Acids Research:
Database issue
11
20
42
26
27
106
134
72
67
189
129
126
113
119
270
DNA Sequence Databases (9%)
RNA Sequence Databases (5%)
Protein Sequence Databases (13%)
Structure Databases (9%)
Genomics Databases (19%)
Metabolic and Signaling Pathways (8%)
Human and other Vertebrate Genomes (8%)
Human Genes & Diseases (9%)
Microarray Data & Gene Expression (5%)
Proteomics Resources (1%)
Other Molecular Biology Databases (3%)
Organelle Databases (2%)
Plant Databases (7%)
Immunological Databases (2%)
1.
2.
3.
GenBank at National Centre for Biotechnology
Information (NCBI) of the National Institute of
Health (NIH) in Bethseda, USA
INTERNATIONAL
NUCLEOTIDE
SEQUENCE
European
Nucleotide Archive
at the European
Bioinformatics Institute (EBI) in Hinxton, England
DATABASE
COLLABORATION (INSDC)
DNA Database of Japan (DDBJ) at the National
Institute of Genetics in Mishima, Japan
Ranks
Higher
taxa
Genus
Species
Lower
taxa
Total
Archaea
108
127
699
199
1133
Bacteria
1144
2136
11591
11445
26316
17812
55141
223387
19199
315535
1281
3860
23888
1769
30794
12985
35329
102285
8925
159524
2141
13580
89422
7214
112357
519
358
8092
68135
77104
19608
57770
249723
99013
426110
Eukaryota
Fungi
Metazoa
Viridiplantae
Viruses
All taxa
Source:
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi
Entries
19,301,988
9,296,587
2,187,038
2,199,144
3,927,943
3,222,429
1,705,210
228,265
1,343,269
1,770,193
1,424,327
2,321,188
1,224,108
214,236
1,454,515
662,510
811,571
1,890,146
82,708
738,474
Bases
15,881,839,899
9,118,049,806
6,503,434,302
5,381,235,474
5,055,840,446
4,793,300,236
3,127,958,433
1,352,948,327
1,251,053,810
1,194,842,997
1,147,237,486
1,138,511,865
1,058,563,193
1,003,309,475
947,332,578
915,431,680
896,784,038
895,052,594
828,906,407
778,132,243
Species
Common name
Homo sapiens
Mus musculus
Rattus norvegicus
Bos taurus
Zea mays
Sus scrofa
Danio rerio
Strongylocentrotus purpuratus
Oryza sativa Japonica Group
Nicotiana tabacum
Xenopus (Silurana) tropicalis
Arabidopsis thaliana
Drosophila melanogaster
Pan troglodytes
Canis lupus familiaris
Vitis vinifera
Gallus gallus
Glycine max
Macaca mulatta
Solanum lycopersicum
Human
Mouse
Rat
Cow
Corn
Pig
Zebra fish
Sea urchin
Rice
Tobacco
Clawed frog
Thale cress
Fruit fly
Chimpanzee
Dog
Grape
Chicken
Soybean
Rhesus macaque
Tomato
Number of base pairs
___________________________________________________________
1971
1977
1982
1992
1995
1996
1998
2000
2001
2003
First published DNA sequence
PhiX174
Lambda
Yeast Chromosome III
Haemophilus influenza
Saccharomyces
C. elegans
D. melanogaster
H. sapines (draft)
H. sapiens
12
5,375
48,502
316,613
1,830,138
12,068,000
97,000,000
120,000,000
2,600,000,000
2,850,000,000
17
Growth in complete genomes
Cochrane G et al. Nucl. Acids Res. 2011;39:D15-D18
© The Author(s) 2010. Published by Oxford University Press.
Bioinformatics as interdisciplinary science has to:
1
• Pick up, provide and apply the appropriate mathematical
tools needed for tackling problems of systematic biology;
2
• provide a suitable knowledge basis to specify the
application of the developed tools;
3
• develop appropriate algorithms and implement them as
effective computer programs;
4
• provide the required technical solutions for handling large
amounts of biological data
Use of computational search
and alignment techniques to
compare new genome against
known genes
Use of mathematical modeling
techniques to identify common
patterns, features and high
level functions
Integrated approach that
integrates both
Purpose
Software
Sequence assembly
Pair wise sequence
comparison
Sequence-profile comparison
Multiple alignment
Phylogenetic analysis
Gene identification
Arachne, GAP4, AMOS
FASTA, BLAST
Analysis of rep DNAs
Protein sequence
‘Fingerprints’/motifs
Microarray data analysis
2-D Gel analysis
PSI-BLAST
ClustalW
PAUP, Phylip
Genscan, GeneMarkHMM, GRAIL, Genei,
Glimmer
RepeatMasker, RepeatFinder, RECON
Pfam, ProDom, COG
PROSITE, PRINTS, BLOCKS
GeneTraffic, GeneSpring, GCOS, Cluster,
CaARRAY, BASE, Bioconductor
SWISS-2DPAGE, Melanie, Flicker, PDQuest
Database/Tool
PlantsDB
URL
Use
http://mips.gsf.de/projects/plants Similarities
and
dissimilarities, specific
characteristics of individual plant genomes
POGs/PlantRBP http://plantrbp.uoregon.edu/
For cross species comparison of genomes
AgBase
http://www.agbase.msstate.edu For functional analysis of genes
TIGR Plant TA http://plantta.tigr.org
To generate a comprehensive resource of
database
assembled and annotated gene transcripts
PathoPlant®
http://www.pathoplant.de
Plant–pathogen interactions and signal
transduction reactions
PlantGDB
http://www.plantgdb.org/
Resources for comparative genomics
SGN
http://sgn.cornell.edu
Solanaceae genomics network
Sputnik
http://mips.gsf.de/proj/sputnik/ EST clustering and annotation system
PopulusDB
http://www.populus.db.umu.se/ Open resource for tree genomics
HARVEST
http://harvest.ucr.edu/
EST database viewing software
CR-EST
http://pgrc.ipk-gatersleben.de/cr- Crop EST database
est/index.php
VitisExpDB
http://cropdisease.ars.usda.gov/v Grape gene expression database
itis_at/main-page.htm
What is similar to my sequence?
Searching gets harder as the
databases get bigger - and quality
changes
Tools: BLAST and FASTA = time saving
heuristics (approximate methods)
Statistics + informed judgment of the
biologist
23
1.
Sequence analysis
› Pairwise (Global & Local)
› Global: aligning sequence pairs in an end-
to-end fashion
› Local: aligning specified regions in a pair of
sequences
› Multiple sequence analysis (MSA)
24

Ab initio: The gene looks like the average of
many genes
› Genscan, GeneMark, GRAIL…

Similarity: The gene looks like a specific
known gene
› Procrustes,…

Hybrid: A combination of both
› Genomescan
(http://genes.mit.edu/genomescan/)
GENERIC STEPS INVOLVED IN
EST ANALYSIS
Briefings in Bioinformatics 2006. VOL 8(1) 6-21
26
MINING FOR SSRs
TOOLS
1. MISA
2. WEBSAT
3. Microsatellite Repeat
Finder
4. Perfect Microsatellite
Repeat Finder
5. Tandem Repeats Finder
6. Repeat Finder
7. Etandem
8. Msatcommander
27
TRENDS in Biotechnology 2005 Vol.23(1)
48-55
In silico
SNP/indel
identification
Tools
•AutoSNP
•QualitySNP
•HaploSNPer
•MAVIANT
•PolyBayes
•SNiPpER
Source: Genes, Genomes and Genomics- SPECIAL ISSUE: Tree and Forest Genetics ( 2010)
28
Tools
Database – miRBase - http://www.mirbase.org/
MiRAlign : http://bioinfo.au.tsinghua.edu.cn/miralign/
miRanda : microRNA Target Detection –
Miracle : http://miracle.igib.res.in/miracle/
RegRNA : http://regrna.mbc.nctu.edu.tw/
miRTar : http://mirtar.mbc.nctu.edu.tw/
miRU: Plant microRNA Potential Target Finder
miRseek : http://220.227.138.213/mirnablast/mirnablast.php
Source: Asia Pac. J. Mol. Biol. Biotechnol., Vol. 15 (3), 2007
29
Can we predict the function of protein
molecules from their sequence?
sequence > structure > function
Prediction of some simple 3-D
structures (a-helix, b-sheet, membrane
spanning, etc.)
30
COMBINED BIOINFORMATICS AND
CHEMOINFORMATICS WORKFLOW.
1. Sequence assembly
2. Identification of target proteins.
3. BLASTp search against PDB to find
out homologous protein structures,
to be used as templates (red) for
protein homology modeling
experiments.
4. Protein model structures (blue) can
in turn be employed for docking. For
docking experiments ligand
structures have to be converted into
their 3D form.
31 36–43
Genomics 89 (2007)
 Mapping
Identifying the location of
clones and markers on the
chromosome by genetic
linkage analysis and
physical mapping
 Sequencing
Assembling clone sequence
reads into large (eventually
complete) genome
sequences
 Gene discovery
Identifying coding regions in
genomic DNA by database
searching and other
methods
 Function assignment
Using database searches,
pattern searches, protein
family analysis and structure
prediction to assign a
function to each predicted
gene
Data mining
Searching for relationships
and correlations in the
information
 Genome comparison
Comparing different
complete genomes to infer
evolutionary history and
genome rearrangements
Development of automated
sequencing techniques
Joining the sequences of
smaller fragments
Prediction of promoters and
protein coding regions
Identifies the enzyme function of new genes by
comparing with that of evolutionary close genomes
Network of gene-groups connected through the reactions
catalyzed by enzymes embedded in the gene-groups
Global modeling of chemical reactions in the microbial
cells
To identify transcription factors for protein-DNA
interactions there are four major approaches
Micro-array
analysis of
gene
expressions
Statistical
analysis of
promoter
regions of
orthologous
genes
Global
analysis of
frequency
patterns of
dimers in the
intergenic
region
Biochemical
modeling at
the atomic
level
We have only touched small parts of the
elephant
 Trial and error (intelligently) is often your
best tool
 Keep up with the main databases, and
you’ll have a pretty good idea of what is
happening and available
