Transcript slides

High throughput biology projects
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
The new biology

Traditional biology:



Small team working on a specialized topic
Well defined experiment to answer precise questions
New « high-throughput » biology


Large international teams using cutting edge technology
defining the project
Results are given raw to the scientific community without
any underlying hypothesis
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Example of « high-throughput »








Complete genome sequencing
Large-scale sampling of the transcriptome
Simultaneous gene expression analysis of thousands of gene
(DNA chips)
Large-scale sampling of the proteome
Protein-protein analysis large-scale 2-hybrid (yeast, worm)
Large-scale 3D structure production (yeast)
Metabolism modelling
Biodiversity
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Role of bioinformatics


Control and management of the data
Analysis of primary data e.g.





Base calling from chromatograms
Mass spectra analysis
DNA chips images analysis
Statistics
Results analysis in a biological context
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Genomes in numbers

Sizes:





virus: 103 to 105 nt
bacteria: 105 to 107 nt
yeast: 1.35 x 107 nt
mammals: 108 to 1010 nt
plants: 1010 to 1011 nt

Gene number:




virus: 3 to 100
bacteria: ~ 1000
yeast: ~ 7000
mammals: ~ 30’000
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Sequencing projects

« small » genomes (<107): bacteria, virus




« large » genomes (107-1010) eucaryotes



Many already sequenced (industry excluded)
More than 60 bacterial genomes already in the public domain
More to come! (one every two weeks…)
5 finished (S.cerevisiae, C.elegans, D.melanogaster, A.thaliana,
Homo sapiens)
Many more to come: mouse, rat, rice (and other plants), fishes,
many pathogenic parasites
EST sequencing


Partial mRNA sequences
~8.5x106 sequences in the public domain
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Human genome






Size: 3 x 109 nt for a haploid genome
Highly repetitive sequences 25%, moderately repetitive
sequences 25-30%
Size of a gene: from 900 to >2’000’000 bases (introns
included)
Proportion of the genome coding for proteins: 5-7%
Number of chromosomes: 22 autosomal, 1 sexual chromosome
Size of a chromosome: 5 x 107 to 5 x 108 bases
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How to sequence the human genome?

Consortium « international » approach:




Generate genetic maps (meiotic recombination) and pseudogenetic
maps (chromosome hybrids) for indicator sequences
Generate a physical map based on large clones (BAC or PAC)
Sequence enough large clones to cover the genome
« commercial » approach (Celera):



Generate random libraries of fixed length genomic clones (2kb and
10kb)
Sequence both ends of enough clones to obtain a 10x coverage
Use computer techniques to reconstitute the chromosomal
sequences, check with the public project physical map
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Mapping resources





Genetic and physical maps: Genethon, GDB, NCBI
Radiation hybrid map: Sanger
BAC production & mapping: Oakland, Caltech,
others
Clone information and retrieval: RZPD (Germany)
Physical maps in ACEDB format from chromosome
coordinators
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Sequencing



Create shotgun library from BAC/PAC
Sequence individual clones to get a ten-fold
coverage
Phases:




0 = single sequence (like STS)
1 = unordered contigs
2 = ordered, oriented contigs
3 = finished, annotated sequence
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Chromosome size sequences




Problem: full chromosomes or entire bacterial
genomes are too long to fit the database entry
specifications
Solution: split the sequence in overlapping “chunks”
New problem: have to reassemble chunks if you
want to analyze the whole sequence
GenBank provides “meta-entries” (CON division)
with assembly instructions
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Interpretation of the human draft



Many gaps and unordered small pieces
A genomic sequence does not tell you where the
genes are encoded. The genome is far from being
« decoded »
One must combine genome and transcriptome to
have a better idea
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
The transcriptome



The set of all functional RNAs (tRNA, rRNA, mRNA
etc…) that can potentially be transcribed from the
genome
The documentation of the localization (cell type)
and conditions under which these RNAs are
expressed
The documentation of the biological function(s) of
each RNA species
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Public draft transcriptome

Information about the expression specificity and
the function of mRNAs



« full » cDNA sequences of know function
« full » cDNA sequences, but « anonymous » (e.g. KIAA or
DKFZ collections)
EST sequences




cDNA libraries derived from many different tissues
Rapid random sequencing of the ends of all clones
ORESTES sequences
Limited set of expression data
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How to organise EST collections?




Clustering: associate individual EST sequences with
unique transcripts or genes
Assembling: derive consensus sequences from
overlapping ESTs belonging to the same cluster
Mapping: associate ESTs (or EST contigs) with
exons in genomic sequences
Interpreting: find and correct coding regions
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Example mapping of ESTs and mRNAs
mRNAs
ESTs
Computer prediction
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How to cope with the amount of data?



Enormous increase of sequences
Always moving data (phases…)
Automatic annotation projects



RefSeq (NCBI)
ENSEMBL (EBI)
HAMAP (SIB)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
RefSeq: NCBI Reference sequences
mRNAs and Proteins
NM_123456 Reference mRNA
NP_123456 Reference Protein
XM_123456 Predicted Transcript
XP_123456 Predicted Protein
XR_123456 Predicted non-coding Transcript
Gene Records
NG_123456 Reference Genomic Sequence
Assemblies
NT_123456 Reference Contig (Mouse and Human Genomes)
NC_123455 Reference Chromosome, Microbial Genomes, Plasmid
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Status codes


RefSeq records are provided with a status code which provides an indication of
the level of review a RefSeq record has undergone.
REVIEWED


PROVISIONAL


The RefSeq record has not yet been subject to individual review.
PREDICTED


The RefSeq record has been the reviewed by NCBI Staff. The review process includes
reviewing available sequence data and frequently also includes a review of the literature.
Some aspect of the RefSeq record is predicted and there is supporting evidence that
the locus is valid.
GENOME ANNOTATION

This identifies the contig (NT_ accessions), mRNA (XM_), non-coding transcript (XR_),
and protein (XP_) RefSeq records provided by the NCBI Genome Annotation process.
These records are provided via automated processing.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Map view of RefSeq
NT_
XM_
NM_
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
ENSEMBL

Goals of Ensembl





Accurate, automatic analysis of genome data
Analysis maintained on the current data
Presentation of the analysis to biologists via the Web
Distribution of the analysis to other bioinformatics laboratories.
The Ensembl project will be a foundation for a next generation sequence database that
provides a curated, distributed, non redundant view of the genomes of model organisms.

Commitments of the Ensembl project

Public release of data


Open, collaborative software development


All the data and analysis will be put into the public domain immediately.
The software which forms the automated pipeline will be available to everyone under an open license,
modelled after the Apache license.
Collaboration on agreed standards for distribution

We hope to provide the data in as many useful forms as is practical, including the EMBL flat file
formats and new data distribution channels such as XML and CORBA.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
ENSEMBL
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
ENSEMBL views
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
High quality Automated Microbial Annotation of Proteomes



Aim: automatically annotate with the highest level of quality a
significant percentage of proteins originating from microbial genome
sequencing projects.
The programs being developed are specifically designed to track down
"eccentric" proteins. Among the peculiarities recognized by the
programs are: size discrepancy, absence or mutation of regions involved
in activity or binding (to metals, nucleotides, etc), presence of paralogs,
contradiction with the biological context (i.e. if a protein belongs to a
pathway supposed to be absent in a particular organism), etc. Such
"problematic" proteins will not be automatically annotated.
This project should allow annotators in the SWISS-PROT groups at
SIB and EBI to concentrate on the proteins that really need careful
manual annotation.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
HAMAP origin



About 60 microbial genomes are available today
>1000 in a few years; >1 million microbial proteins!
Functional analysis and detailed biochemical
characterization will only be available:



For « all » proteins in a handful of model organisms (i.e.
E.coli, B.subtilis, etc.)
For proteins involved in pathogenicity (medical and
pharmaceutical interests)
For proteins involved in specific biosynthetic or catabolic
pathways (biotechnological and food industry interests)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
HAMAP overview
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
HAMAP flow chart
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
HAMAP study case

The case of the Escherichia coli proteome






According to the original analysis in 1997: 4286 protein
coding genes
60 were missed (almost all <100 residues)
120 are most probably « bogus »
50 pairs or triplets of ORFs had to be fused
719 have proven or probable wrong start sites
~1800 are still not biochemically characterized; only one
new « functionalisation » per week…
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Unix reminder





General: man, pwd, cd, ls, mkdir, rmdir, passwd, exit
Files manipulation: cat, more, cp, mv, rm, grep, find,
diff, head, tail, chmod
Editing: vi, pico, emacs
Compression: tar, (un)compress, gzip
Various: redirection (<>>) and piping (|)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11