Ecological Genomics

Download Report

Transcript Ecological Genomics

生物信息学中的几个计算问题简介
林魁
Laboratory of Computational Molecular Biology
[http://cmb.bnu.edu.cn]
College of Life Sciences
Beijing Normal University
Outline
 Introduction to bioinformatics
 Genome annotation
 Genome evolution
 Metagenomics
 Acknowledgements
From U.S. Department of Energy Human Genome Program. http://www.ornl.gov/hgmis
Is Biology an Informational Science?
 The HGP changed how we view & practice biology.
 Biology is an informational science.

Digital genome

Environmental signals
 Biology has become a cross-disciplinary science.
Bioinformatics as an intersecting
discipline
Mathematical sciences
Computer science
Life sciences
Developing the high throughput technologies and
computational/mathematical tools required for this new biology.
Why? Where? What? How?
• Why: Ideas for what to produce these huge datasets?
Biological background needed.
• Where: Raw data need to store, IT platforms required.
• What: Patterns in datasets that can be analyzed using
computers. Various data models and their respective
algorithms are needed.
• How: Different resources need to be integrated.
What is Bioinformatics?
•
The field of biology specializing in developing hardware and
software to store and analyze the huge amounts of data being
generated by life scientists. (NIH)
•
More than 20 different definitions can be found from Google!
•
Computational Biology?
•
Computational Molecular Biology?
Data integration
Various molecular biology databases
Bioinformatics applications
Data analysis
Key Challenge of Bioinformatics
The world of biology is very different from what it was
even ten years ago.
To bridge the considerable gap between technical data
production and its use by scientists for biological
discovery.
Schematic platform for bioinformatics applications
Clients
Browsers
Light applications
HTML/XML
PERL/C/C++/Java (BioPerl)
Intranet
and/or
Internet
MySQL
Bioinformatics tools
Statistical analysis (R)
HPC with MPI
WWW servers
Servers
Database servers
Intensive computing servers
新一代基因组测序(NGS)技术
3730xl
焦磷酸测序
1亿 bp
边合成边测序
15亿 bp
边连接边测序
20亿 bp
Sanger
10万 bp
双脱氧核苷酸
Trade-off between read length and sequencing cost
Technology:
Read length:
Throughput:
Sequencing cost:
Sanger
454
Long
Low
High
Middle
Middle
Middle
Short
High
Low
With the new technology
• New scientific questions emerge
• Existing questions can be answered in a way that was not
considered before.
Sequence assembly by the shotgun approach
•
Master sequence  short sequences, simply
by examining the sequences for overlaps.
•
No need any prior knowledge of the genome.
Where is a gene?
CGGTTGAAAGCGGTAGCGTCCATGCGTATTACTCTTGAGCGGTCGAACCTTCTGAAATCGCTGAACCACGTCCACCGGGT
CGTCGAGCGTCGCAACACGATCCCGATCCTGTCCAACGTTCTGCTGCGCGCCTCCGGCGCCAATCTGGACATGAAGGCGA
CCGACCTCGATCTGGAAATCACCGAAGCGACCCCGGCCATGGTGGAGCAGGCTGGCGCCACCACCGTACCGGCACACCTG
CTTTACGAAATCGTGCGCAAGCTGCCGGATGGTTCCGAAGTGCTTCTGGCGACCAACCCGGACGGCTCCTCCATGACCGT
TGCGTCCGGCCGCTCGAAATTCTCGCTGCAATGCCTGCCGGAAGCGGATTTCCCTGACCTCACCGCCGGCACCTTCAGCC
ACACCTTCAAACTGAAGGCGGCCGATCTGAAGATGCTGATCGACCGGACGCAGTTTGCGATTTCGACCGAAGAGACGCGT
TATTACCTGAACGGCATTTTCTTCCACACCATCGAAAGCAATGGCGAGCTGAAACTGCGCGCCGTCGCCACCGACGGTCA
CCGCCTTGCGCGTGCTGACGTCGATGCGCCCTCCGGCTCCGAAGGCATGCCGGGCATCATCATTCCGCGCAAGACCGTCG
GTGAACTGCAGAAGCTGATGGACAATCCGGAACTGGAAGTCACAGTCGAAGTCTCGGATGCGAAGATCCGCCTGGCCATC
GGTTCCGTCGTTCTGACCTCGAAGCTGATCGACGGCACCTTTCCCGATTATCAGCGCGTCATCCCAACCGGCAACGACAA
GGAAATGCGCGTCGATTGCCAGACCTTCGCCCGGGCAGTGGACCGTGTTTCGACGATTTCTTCCGAGCGCGGCCGCGCCG
TGAAGCTGGCGCTAACTGACGGCCAGTTGACGCTGACCGTCAACAATCCCGACTCGGGAAGTGCTACCGAAGAAGTGGCC
GTTGGCTACGACAATGATTCGATGGAAATCGGCTTCAATGCCAAATATCTCCTCGACATCACGTCGCAGCTCTCCGGCGA
AGATGCGATTTTTCTGCTGGCGGATGCGGGTTCGCCAACACTGGTTCGCGATACCGCCGGCGACGACGCACTCTATGTTC
TGATGCCGATGCGCGTTTAAAACCGACCGTTTTCTTCAATTTTTCCAGAAACGCCGGTGGATCGCTTCATCGGCGTTTTT
TGATTCGGCGAACAGGTGGCTCTACCCGTAACTGAATTTTCTCAGTTACGACATTTTGCCTTGTTTTTGCGCCAAATGGG
ATCAACAGTACGTAACAATTTTTTGACAATGACCAATACATCCGAGGGGAATCATGGCACTCAACCTGAAGCAACGGCTT
GAACAAAAATTTGAGGAAGAAATCCGCTTTTTCAAAGGTATGGTCAGCCAGCCGAAAAAAGTCGGCGCCATTGTCCCGAC
GGTTCCGTCGTTCTGACCTCGAAGCTGATCGACGGCACCTTTCCCGATTATCAGCGCGTCATCCCAACCGGCAACGACAA
GGAAATGCGCGTCGATTGCCAGACCTTCGCCCGGGCAGTGGACCGTGTTTCGACGATTTCTTCCGAGCGCGGCCGCGCCG
TGAAGCTGGCGCTAACTGACGGCCAGTTGACGCTGACCGTCAACAATCCCGACTCGGGAAGTGCTACCGAAGAAGTGGCC
GTTGGCTACGACAATGATTCGATGGAAATCGGCTTCAATGCCAAATATCTCCTCGACATCACGTCGCAGCTCTCCGGCGA
AGATGCGATTTTTCTGCTGGCGGATGCGGGTTCGCCAACACTGGTTCGCGATACCGCCGGCGACGACGCACTCTATGTTC
TGATGCCGATGCGCGTTTAAAACCGACCGTTTTCTTCAATTTTTCCAGAAACGCCGGTGGATCGCTTCATCGGCGTTTTT
TGATTCGGCGAACAGGTGGCTCTACCCGTAACTGAATTTTCTCAGTTACGACATTTTGCCTTGTTTTTGCGCCAAATGGG
ATCAACAGTACGTAACAATTTTTTGACAATGACCAATACATCCGAGGGGAATCATGGCACTCAACCTGAAGCAACGGCTT
2. Non-coding RNAs
3. mRNAs
 Evidence-based
prediction
 Cis-alignment
 Trans-alignment
 Ab initio / de novo
prediction
Three Layers of Genome Annotation.
control sites
From Stein, L. 2001. Nature Reviews genetics 2:493-503
1. Transcriptional
Some models
• Dynamic programming
• Hidden Markov Models (HMMs)
• Conditional random field (CRF)
• Support vector machines (SVMs)
Cucumber Genome Annotation Project
The Institute of Vegetables and Flowers,
Chinese Academy of Agricultural Sciences
Laboratory of Computational Molecular Biology,
Beijing Normal University
Genomic
sequence
ESTs/cDNAs
UniProt
proteins
RepeatMasker
CMB genome
annotation
pipeline
EVM
Rfam
Proteincoding
Genes
Repeats
MySQL
DBMS
PseudoPipe
Pseudo
genes
ncRNA
genes
Functional annotation:
 Protein homology
 Domain annotation (InterProScan)
 Mapping to Gene Ontology
 Mapping to KEGG
WWW service
+
Visualization
GBrowse
gbrowser
Depending on the state of sequencing project, genomic coordinates
along the chromosome may change dramatically from assembly to
assembly.
Phylogenetics
• Evolutionary theory states that groups of similar
organisms are descended from a common ancestor.
• Phylogenetic systematics (cladistics) is a method
of taxonomic classification based on their
evolutionary history.
• It was developed by Willi Hennig, a German
entomologist, in 1950.
Major reasons to use phylogenetics
Understand the lineage of different species
Organizing principle to sort species into a taxonomy
Understand how various functions evolved
Understand forces and constraints on evolution
Perform multiple sequence alignment
Predict gene function (phylogenetic footprint)
Species/Gene Trees
Species tree (how are my species related?)
 contains only one representative from each species
 when did speciation take place?
 all nodes indicate speciation events
Gene tree (how are my genes related?)
 often contains a number of genes from a single
species.
 nodes relate either to speciation or gene
duplication events.
Species tree
Phylogenomics: Genome trees




Explore genome evolution based on large data sets of DNA
or protein sequences.
Using entire genomes to infer a species tree (Eisen and
Fraser 2003).
Based on maximum genetic information and average out
the anomalies.
Has become the standard for reconstructing reliable
phylogenies (Ciccarelli et al, 2006; Daubin et al. 2002).
From Delsuc, F., et al. (2005) Phylogenomics and
the reconstruction of the tree of life. Nat. Rev.
Genet. 6, 361-375
Phylogenomics and the tree of life
From Delsuc, F., et al. (2005) Phylogenomics and the
reconstruction of the tree of life. Nat. Rev. Genet. 6, 361-375
Taxonomic resolution of some of the novel approaches
– Creating ever-more robust phylogenies on the basis of
diverse data sets.
Try
Evolutionary
theory is
evolving
100 trillion microbial cells
The dominant form of life on Earth
~1,000 Gbp of microbial genome sequences per gram of soil
!
Why genomics is not enough
• Most microbes cannot be cultured
• Microbial diversity and variation have no limits
Metagenomics offers a way forward
• Who is out there?
• What are they doing?
 What is being done by the community?
Definition:
Both a set of research techniques & a research field.
Difference between metagenomics & microbial genomics
基于16S rRNA的分析方法:快速而高效, 应用广
泛
 “Who is out there?”
 经过多年的发展和完善 (Olsen et al. 1986)
 Renaissance currently (Tringe & Hugenholtz 2008)
 16S rRNA的高保守性
 variable regions
 包含大量 rRNA 基因序列(>200,000)的数据库 ( Cole
et al. 2005; Medini et al. 2008)
 encountering the limitations of existing tools
hsp60 gene: more sensitive than SSU rRNA
Cartoon of the
general
structure of
the bacterial
16S rRNA
gene
Who is there?
 “strain-level”
‘‘species’’
 ‘‘genus’’
 ‘‘family’’
Analysis of diversity in the human gut microbial community based on
surveys of a limited number of humans.
Broad-range PCR amplification and sequencing
of microbial 16S rRNA genes
Microbial diversity in environmental samples
Why? Clusters of very closely related sequences at the tips of
phylogenetic trees separated by relatively long branches.
We require the evolutionary and ecological mechanisms
Functional analysis of complex microbial communities (EGTs)
Relative abundance of major phyla and relative abundance of
categories of function.
From Turnbaugh et al. 2009. A core gut microbiome in obese and lean twins. Nature
Pathways and
subnetworks
reflect the
adaptation of
microbial
communities
across
environments
and habitats.
From Gianoulis et al. 2009. PNAS
Gene content difference
Resolving
strain-level
heterogeneity
Sequence divergence
Multiple strain
sequence types
Gene insertion
Gene
rearrangement
Allen & Banfield (2005) Community genomics in microbial ecology and evolution.
Nat Rev Microbiol, 3, 489-498
拟展开的工作
• 不同生境(或样本)的群落基因组间比较分析,
具体阐明关键环境因子的改变如何导致群落组成
的变异;
• 结合基因表达和代谢等数据探讨不同群落的基因
组与主要生态系统过程(如氮的固定,碳的降解,
反硝化作用以及厌氧微生物的除铵作用等)之间
的相关关系。
Acknowledgements

Beijing Normal University

All members of LCMB, BNU
http://cmb.bnu.edu.cn
Comments
and
Suggestions?