Transcript lecture1_06

Introduction to Bioinformatics
Lecturer: Dr. Yael Mandel-Gutfreund
Teaching Assistance:
Oleg Rokhlenko
Ydo Wexler
http://webcourse.cs.technion.ac.il/236523
What is Bioinformatics?
2
Course Objectives
• To introduce the bioinfomatics discipline
• To make the students familiar with the major
biological questions which can be addressed
by bioinformatics tools
• To introduce the major tools used for
sequence and structure analysis and explain
in general how they work (limitation etc..)
3
Course Structure and Requirements
1. Class Structure
Each class (except the first one) will be divided into two parts:
1.
2.
Lecture (in lecture room)
A Training Lab (in computer lab)*
•
•
For the Training Lab the class will be divided to 2 groups.
Each one of the groups will meet every second week,
starting from the second week.
The work in the Training Labs will be in pairs.
Lab assignments will be submitted at the end of each lab.
Preparing yourself for the lab- A tutorial including self
home exercise and their answers will be posted on the web
a week before the lab
•
•
•
2. A final home exam
4
Grading
• 30 % lab assignments
• 70% final exam
5
Literature list
• Gibas, C., Jambeck, P. Developing Bioinformatics
Computer Skills. O'Reilly, 2001.
• Lesk, A. M. Introduction to Bioinformatics. Oxford
University Press, 2002.
• Mount, D.W. Bioinformatics: Sequence and Genome
Analysis. 2nd ed.,Cold Spring Harbor Laboratory
Press, 2004.
Advanced Reading
Jones N.C & Pevzner P.A. An introduction to
Bioinformatics algorithms MIT Press, 2004
6
Course syllabus
7
What is Bioinformatics?
8
What is Bioinformatics?
“The field of science in which biology, computer
science, and information technology merge to
form a single discipline”
Ultimate goal: to enable the discovery of new
biological insights as well as to create a global
perspective from which unifying principles in
biology can be discerned.
9
from purely lab-based science to an information science
Bioinformatics
Bio = Informatics
10
Central Paradigm in Molecular Biology
Gene (DNA)
mRNA
Protein
21ST centaury
Genome
Transcriptome
Proteome
11
Genome
• Chromosomal DNA of an organism
• Coding and non-coding DNA
• Genome size and number of genes does not
necessarily determine organism complexity
12
Transcriptome
• Complete collection of all possible mRNAs
(including splice variants) of an organism.
• Regions of an organism’s genome that get
transcribed into messenger RNA.
• Transcriptome can be extended to include all
transcribed elements, including non-coding RNAs
used for structural and regulatory purposes.
13
Proteome
• The complete collection of proteins that can
be produced by an organism.
• Can be studied either as static (sum of all
proteins possible) or dynamic (all proteins
found at a specific time point) entity
14
From DNA to Genome
Watson and Crick
DNA model
First protein
sequence
1955
1960
First protein
structure
1965
1970
1975
1980
1985
15
1990
First bacterial
genome
1995
Hemophilus Influenzae
Yeast genome
2000
First human
genome draft
16
The Human Genome Project
Initiated in 1986
Completed in 2003
Project goals were to
• identify all the genes in human DNA,
• determine the sequences of the 3 billion chemical base
pairs that make up human DNA,
• store this information in databases,
• improve tools for data analysis and develop new tools
• address the ethical, legal, and social issues that may arise
from the project.
17
Human Genome Project
International Human
Genome Organization
founded
1985
Celera
Genomics
founded
First working
drafts
published
1995
1990
USA Department
of Energy
announces project
2000
Low resolution
linkage map
published
Project
successfully
completed
18
The Human Genome Project
Initiated in 1986
Completed in 2003
How did we do??
• identify all the genes in human DNA ☺ ☺
• determine the sequences of the 3 billion chemical base pairs that
make up human DNA ☺ ☺ ☺
• store this information in databases ☺ ☺ ☺
• improve tools for data analysis and develop new tools ☺ ☺ ☺
• address the ethical, legal, and social issues that may arise from
the project ☺
19
What makes us human?
CHIMP GENOME
Chimpanzees are similar to humans in so many
ways: they are socially complex, sensitive and
communicative, and yet indisputably on the animal
side of the man/beast divide. Scientists have now
sequenced the genetic code of our closest living
relative, showing the striking concordances and
divergences between the two species, and perhaps
holding up a mirror to our own humanity.
20
How humans
are chimps?
Perhaps not surprising!!!
Comparison between the full drafts of the human and chimp genomes
revealed that they differ only by 1.23%
21
Complete Genomes
• 1994
0
• 1995
1
• 2004
234
• 2005
303
eukaryotes
24
bacteria
240
archaea
39
22
What’s Next ?
The “post-genomics” era
Annotation
Comparative
genomics
Structural
genomics
Functional
genomics
Goal: to understand the functional networks of a living cell
23
Open reading frames
Annotation
Functional sites
Structure, function
24
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
25
TF binding site
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
Transcription
Start Site
promoter
.............. TGAAAAACGTA
ORF=Open Reading Frame Ribosome binding Site
CDS=Coding Sequence
26
Whole Genome Comparison
Concluding on regulatory
networks
Comparative
genomics
27
Chimps and Us
28
Whole Genome Comparison
Concluding on regulatory
networks
Comparative
genomics
Comparing ORFs
Identifying orthologs
Concluding on structure
and function
29
Researchers have learned a great deal about the
function of human genes by examining their
counterparts in simpler model organisms such as
the mouse.
Conservation of the IGFALS (Insulin-like growth factor)
Between human and mouse.
30
Genome-wide profiling of:
• mRNA levels
• Protein levels
Functional
genomics
Co-expression of genes
and/or proteins
31
Understanding the function of genes and other parts of the genome
32
Genome-wide profiling of:
• mRNA levels
• Protein levels
Functional
genomics
Co-expression of genes
and/or proteins
Identifying protein-protein
interaction
Networks of interactions
33
A network of interactions can be built
For all proteins in an organism
A large network of 8184 interactions among 4140 S. Cerevisiae
proteins
34
Structural
genomics
Assign structure to all
proteins encoded in
a genome
35
Protein Structure
36
Resources and Databases
The different types of data are collected in
database
– Sequence databases
– Structural databases
– Databases of Experimental Results
All databases are connected
37
Database Types
Sequence databases
General
special
GenBank, embl
PIR, Swissprot
TF binding sites
Promoters
Genomes
Structure databases
General
Special
PDB
Specific protein families
folds
Databases of experimental results
Co-expressed genes, prot-prot interaction, etc.
38
Sequence databases
•
•
•
•
Gene database
Genome database
SNPs database
Disease related mutation database
39
What can we learn about a Gene
40
mRNA, full length, EST
41
EST
Expressed Sequence Tags
• Partial copies of mRNA found within a
particular cell
• Can be used to identify genic regions;
splicing patterns of genes; etc
42
Different transcripts can be related to
the same gene!
43
Gene database
• Give information into gene functionality
• Alternative splicing of genes
– Alternative pattern of exons included to create
gene product
• EST
44
Genome Databases
• Data organized by species
• Clones assembled into contigous pieces
‘contigs’ or whole chromosomes
• Information on non-coding regions
• Relativity
45
Genome Browsers
• Annotation adds value to sequence
• Easy “walk” through the genome
• Comparative genomics
46
Genome Browsers
• Ensembl Genome Browser (http://www.ensembl.org)
• UCSC Genome Browser http://genome.ucsc.edu/
• WormBase: http://www.wormbase.org/
• AceDB: http://www.acedb.org/
• Comprehensive Microbial Resource:
http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl
• FlyBase: http://flybase.bio.indiana.edu/
47
beta globin
48
49
RefSeq
• Set of mRNA sequences cureted at NCBI
• Many experimentally validated
• Some partially validated via ESTs
• Some computationally predicted
50
51
52
53
54
55
SNP database
Single Nucleotide Polymorphisms (SNPs)
• Single base difference in a single position
among two different individuals of the same
species
• Play an important role in differentiation and
disease
56
Sickle Cell Anemia
• Due to 1 swapping an A for a T, causing inserted amino
acid to be valine instead of glutamine in hemoglobin
Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/
57
Healthy Individual
>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC
AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG
CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC
TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT
CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA
CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA
CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT
GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]
EEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
MVHLTP
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH
58
Diseased Individual
>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA
GGTGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC
AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG
CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC
TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT
CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA
CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA
CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT
GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC
>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens]
VEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
MVHLTP
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH
59
Disease Databases
• Genes are involved in disease
• Many diseases are well studied
• Description of diseases and what is known
about them is stored
OMIM - Online Mendelian Inheritance in Man
60
61
Structure Databases
• 3-dimensional structures of proteins, nucleic
acids, molecular complexes etc
• 3-d data is available due to techniques such
as NMR and X-Ray crystallography
62
63
64
Databases of Experimental Results
• Data such as experimental microarray
images- expression data
• Clustering information
• Metabolic pathways, protein-protein
interaction data
65
Literature Databases
PubMed
http://www.ncbi.nlm.nih.giv/PubMed
Service of the National Library of Medicine
• MEDLINE publication database
– Over 17,000 journals
– 15 million citations since 1950
66
Putting it All Together
• Each Database contains specific
information
• Like other biological systems also these
databases are interrelated
67
PROTEIN
PIR
DISEASE
ASSEMBLED
GENOMES
LocusLink
SWISS-PROT
OMIM
GoldenPath
OMIA
WormBase
MOTIFS
TIGR
BLOCKS
Pfam
GENOMIC DATA
Prosite
GenBank
ESTs
dbEST
DDBJ
GENES
EMBL
RefSeq
unigene
AllGenes
SNPs
GENE
EXPRESSION
dbSNP
STRUCTURE
PDB
MMDB
SCOP
PATHWAY
Stanford MGDB
KEGG
NetAffx
COG
ArrayExpress
GDB
LITERATURE
PubMed
68
Entrez – NCBI Engine
• Entrez is the integrated, text-based search
and retrieval system used at NCBI for the
major databases, including PubMed,
Nucleotide and Protein Sequences, Protein
Structures, Complete Genomes, Taxonomy,
and others.
http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?itool=toolbar
69
Entrez – NCBI Engine
70
• General Bioinformatic Webpages
– USA National Center for Biotechnology
Information: www.ncbi.nlm.nih.gov
– European Bioinformatics Institute:
www.ebi.ac.uk
– ExPASy Molecular Biology Server:
www.expasy.org
– Israeli National Node: inn.org.il
http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm
71