Introduction - EMBnet node Switzerland

Download Report

Transcript Introduction - EMBnet node Switzerland

Introduction to Bioinformatics
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
SIB and EMBnet Bioinformatics
resources for biomedical scientists
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
The Swiss Institute of Bioinformatics





Founded in March 1998
Collaborative structure Lausanne - Geneva - Basel
Groups at ISREC, Ludwig Institute, Unil, HUG,
UniGe, recently UniBas and soon EPFL.
Several roles: teaching, services, research
Currently: ~ 160 employees
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Projects at SIB

Databases



Softwares


Melanie, Deep View, proteomic tools, ESTScan, pftools, Java
applets
Services



SWISS-PROT, PROSITE, EPD, World-2DPAGE, SWISS-MODEL
TrEST, TrGEN (predicted proteins), tromer (transcriptome)
Web servers ExPASy, EMBnet, MyHits
Teaching and helpdesk
Research

Mostly sequence and expression analysis, 3D structure, and
proteomic
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Teaching





Master degrees in Bioinformatics (Bologna type):
90 ECTS credits in Unige, Unil and Unibas.
EMBnet courses: 4x 1 week per year in Lausanne,
Basel and Zürich
Pregrade courses in Geneva, Fribourg and Lausanne
Universities
Other courses at CHUV and EPFL
Courses in other countries: Colombia, Cambodia,
Peru, …
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Research





New algorithms (faster alignments…)
New technology (GRID or cluster computing)
New tools (protein analysis, microarrays, confocal
microscopy)
New databases (microarrays, transcriptome,
proteome)
Collaborations with lab researchers!
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Three levels of services

Simple web access to softwares and databases



Command-line access with a local Unix account



More powerful (automation) and secure
Requires to understand Unix system and frequent practice
Collaboration with SIB



Easy to use for basic occasional research with few sequences
Potentially insecure
Access to experts in the field (help desk)
For projects requiring huge programming or special hardware resources
Help desk

[email protected] or http://www.expasy.org/contact.html
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
SIB’s important sites

Home


ExPASy - Expert Protein Analysis System


myhits.isb-sib.ch
EMBnet Switzerland


www.expasy.org
MyHits database and tools


www.isb-sib.ch
www.ch.embnet.org
Geneva Bioinformatics

www.genebio.ch
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
SIB home
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Expert Protein Analysis System
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Q ui ck Ti m e ™ an d a T I FF ( U nc om p r es se d) de co m pr e ss or ar e n ee de d t o se e t hi s p i ct u re .
MyHits http://myhits.isb-sib.ch
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Swiss node http://www.ch.embnet.org
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
EMBnet organisation

European in 1988, now world-wide spread


Role






32 country nodes, 8 special nodes.
Training, education (EMBER)
Software development (EMBOSS, SRS)
Computing resources (databases, websites, services)
Helpdesk and technical support
Publications (EMBnet.news, Briefings in Bioinformatics)
Access: www.embnet.org

Each node with “www.xx.embnet.org” where xx is the country code
(e.g., ch for Switzerland)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
EMBnet home
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
QuickTime™ and a TIFF (Uncomp resse d) de com press or are nee ded to s ee this picture.




Free Open Source (for most Unix plateforms)
GCG successor (compatible with GCG file format)
More than 150 programs (ver. 2.9.0)
Easy to install locally



but no interface, requires local databases
Unix command-line only
Interfaces




European Molecular Biology Open Software Suite
Jemboss, wEMBOSS, www2gcg, w2h… (with account)
Pise, EMBOSS-GUI, SRSWWW (no account)
Staden, Kaptain, CoLiMate, Jemboss (local)
Access: www.emboss.org or emboss.sourceforge.net
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Other important sites

ExPASy - Expert Protein Analysis System


EBI - European Bioinformatics Institute


www.ebi.ac.uk
NCBI - National Center for Biotechnology
Information


www.expasy.org
www.ncbi.nlm.nih.gov
Sanger - The Sanger Institute

www.sanger.ac.uk
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Bioinformatics: definition

Every application of computer science to biology


Sequence analysis, images analysis, sample management,
population modelling, …
Analysis of data coming from large-scale biological
projects

Genomes, transcriptomes, proteomes, metabolomes, etc…
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
The new biology

Traditional biology



Small team working on a specialized topic
Well defined experiment to answer precise questions
New « high-throughput » biology


Large international teams using cutting edge technology
defining the project
Results are given raw to the scientific community without
any underlying hypothesis
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Example of « high-throughput »









Complete genome sequencing
Large-scale sampling of the transcriptome (EST)
Simultaneous expression analysis of thousands of genes (DNA
microarrays, SAGE)
Large-scale sampling of the proteome
Protein-protein analysis large-scale 2-hybrid (yeast, worm)
Large-scale 3D structure production (yeast)
Metabolism modelling
Simulations
Biodiversity
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Role of bioinformatics


Control and management of the data
Analysis of primary data e.g.






Base calling from chromatograms
Mass spectra analysis
DNA microarrays images analysis
Statistics
Database storage and access
Results analysis in a biological context
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
First information: a sequence ?

Nucleotide



RNA (or cDNA)
Genomic (intron-exon)
Complete or incomplete?



mRNA with 5’ and 3’ UTR regions
Entire chromosome
Protein




Pre/Pro or functional protein?
Function prediction
Post-translational modifications?
Holy Grail: 3D structure?
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Genomes in numbers

Sizes:





virus: 103 to 105 nt
bacteria: 105 to 107 nt
yeast: 1.35 x 107 nt
mammals: 108 to 1010 nt
plants: 1010 to 1011 nt

Gene number:





virus: 3 to 100
bacteria: ~ 1000
yeast: ~ 7000
mammals: ~ 30’000
Plants: 30’000-50’000?
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Sequencing projects

« small » genomes (<107): bacteria, virus




« large » genomes (107-1010) eucaryotes



Many already sequenced (industry excluded)
More than 150 microbial genomes already in the public domain
More to come! (one new every two weeks…)
>30 finished (S.cerevisiae, S. Pombe, E. cuniculi, G. theta,
C.elegans, D.melanogaster, A. gambiae, P. falciparum, P. yoelii, D.
rerio, F. rubripes, A.thaliana, O. sativa (2x), M. musculus, Homo
sapiens, P. troglodytes, R. norvegicus, C. familiaris, G. gallus…)
Many more to come: cat, elephant, pig, cow, maize (and other
plants), insects, fishes, many pathogenic parasites (Leishmania…)
EST sequencing

Partial mRNA sequences ~20x106 sequences in the public domain
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Human genome






Size: 3 x 109 nt for a haploid genome
Highly repetitive sequences 25%, moderately repetitive
sequences 25-30%
Size of a gene: from 900 to >2’000’000 bases (introns
included)
Proportion of the genome coding for proteins: 5-7%
Number of chromosomes: 22 autosomal, 1 sexual chromosome
Size of a chromosome: 5 x 107 to 5 x 108 bases
centromer
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
exons of a gene
regulatory elements
locus control region
repetitive sequences
LF-2004.08
telomer
How to sequence the human genome?

Consortium « international » approach:




Generate genetic maps (meiotic recombination) and pseudogenetic
maps (chromosome hybrids) for indicator sequences
Generate a physical map based on large clones (BAC or PAC)
Sequence enough large clones to cover the genome
« commercial » approach (Celera):



Generate random libraries of fixed length genomic clones (2kb and
10kb)
Sequence both ends of enough clones to obtain a 10x coverage
Use computer techniques to reconstitute the chromosomal
sequences, check with the public project physical map
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Interpretation of the human draft



All chromosomes
considered as finished
Even a genomic sequence
does not tell you where the
genes are encoded. The
genome is far from being
« decoded »
One must combine genome
and transcriptome to have a
better idea
Last freeze Ncbi34 July, 2003
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
The transcriptome



The set of all functional RNAs (tRNA, rRNA, mRNA
etc…) that can potentially be transcribed from the
genome
The documentation of the localization (cell type)
and conditions under which these RNAs are
expressed
The documentation of the biological function(s) of
each RNA species
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Public draft transcriptome

Information about the expression specificity and the
function of mRNAs



« full » cDNA sequences of know function
« full » cDNA sequences (HTC), but « anonymous » (e.g. KIAA or
DKFZ collections)
EST sequences





cDNA libraries derived from many different tissues
Rapid random sequencing of the ends of all clones
ORESTES sequences
Growing set of expression data (microarrays, SAGE etc…)
Increasing evidences for multiple alternative splicing and
polyadenylation
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Example mapping of ESTs and mRNAs
mRNAs
ESTs
Computer prediction
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
The proteome



Set of proteins present in a particular cell type
under particular conditions
Set of proteins potentially expressed from the
genome
Information about the specific expression and
function of the proteins
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Information on the proteome

Separation of a complex mixture of proteins



Individual characterisation of proteins



2D PAGE (IEF + SDS PAGE)
Capillary chromatography
Tryptic peptides signature (MS)
Sequencing by chemistry or MS/MS
All post-translational modifications (PTMs) !
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Tridimentional structures

Methods to determine structures



Data format


X-ray cristallography
NMR
Atoms coordinates (except H) in a cartesian space
Databases


For proteins and nucleic acids (RSCB, was PDB)
Independent databases for sugars and small organic
molecules
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Visualisation of the structures

Secondary structure elements


Alpha helices, beta sheets, other
Softwares


Various representations (atoms, bonds, secondary…)
Big choice of commercial and free software (e.g.,
DeepView)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Sequence information, and so what ?

How to store and organise ?


Databases (next lecture)
How to access, search, compare ?








Pairwise alignments, dot plots (Tuesday)
BLAST searches in db (Tuesday)
Patterns, PSI-BLAST, Profiles and HMMs (Wednesday)
Gene prediction (Wednesday)
EST clustering (Thursday)
Multiple Alignments (Thursday)
Protein function prediction (Friday)
Users problems (Friday)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08
Thank you
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2004.08