Lecture1cont
Download
Report
Transcript Lecture1cont
The BIG Goal
“The greatest challenge, however, is
analytical. … Deeper biological insight
is likely to emerge from examining
datasets with scores of samples.”
Eric Lander, “array of hope” Nat. Gen.
volume 21 supplement pp 3 - 4, 1999.
Bio-informatics:
Provide methodologies for
elucidating biological knowledge
from biological data.
Central Paradigm of Bio-informatics
Genetic
Information
Central Paradigm of Bio-informatics
Genetic
Information
Molecular
Structure
Central Paradigm of BioInformatics
Genetic
Information
Molecular
Structure
Biochemical
Function
Central Paradigm of Bio-informatics
Genetic
Information
Molecular
Structure
Biochemical
Function
Symptoms
Central Paradigm of Bio-informatics
Genetic
Information
Molecular
Structure
Biochemical
Function
Symptoms
Computer Science Tools are Crucial
http://www.sanger.ac.uk/PostGenomics/S_pombe/presentations/EMBOCopenhagenWebsite.pdf
Computer Science Tools are Crucial
• New bio-technologies create huge amounts
of data.
• It is impossible to analyze data by manual
inspection.
• Novel mathematical, statistical, algorithmic
and computational tools are necessary !
Automated Sequencing
http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm
What is Bio-Informatics ?
• A field of science in which Biology, Computer
Science and Information Technology merge into a
single discipline.
• Computers (& software tools) are used to collect,
analyze and interpret biological information at the
molecular level.
• Goal: To enable the discovery of new biological
insights and create a global perspective for
biologists.
Disciplines
• Development of new algorithms and statistical
methods to assess relationships among members
of large data sets.
• Analysis and interpretation of various types of
data.
• Development and implementation of tools to
efficiently access and manage different types
of information.
Why Use Bio-Informatics ?
An explosive growth in the amount of biological information
necessitates the use of computers for cataloging and
retrieval of data (> 3 billion bps, > 30,000 genes).
• The human genome project.
• Automated sequencing.
• GenBank has over 16 Billion bases
and is doubling every year !!!
New Types of Biological Data
• Micro arrays - gene expression.
• Multi-level maps: genetic, physical:
sequence, annotation.
• Networks of protein-protein
interactions.
• Cross-species relationships:
• Homologous genes.
• Chromosome organization.
http://www.the-scientist.com/yr2002/apr/research020415.html
Why Bio Informatics ? (cont.)
• A more global view of experimental design.
(from “one scientist = one gene/protein/disease”
paradigm to whole organism consideration).
• Data mining - functional/structural information
is important for studying the molecular basis
of diseases, diagnostics, developing drugs
(personal medicine), evolutionary patterns, etc.
Why Bio Informatics ? (cont.)
http://www.library.csi.cuny.edu/~davis/Bioinfo_326/lectures/lect14/lect_14.html
Future of Genomic Research
Principle milestones in data mining and genome analysis:
• Sanger method for sequencing, invented in 1977
(winner of the Nobel Prize in 1980),
• Polymerase chain reaction (PCR), invented in 1989
(awarded the Nobel Prize in 1993).
http://www.usgenomics.com/technology/index.shtml
The next step:
Locate all the genes
and understand their function.
This will probably take another 15-20 years !
Disease Genes Discovered
The job of biologists is changing…
One can efficiently find information:
Using databases and software on
the web .
Question: How likely are you
to use a free bio-informatics
library of accessible
software ?
http://www.cryst.bbk.ac.uk/classlib/BBSRC_poster/potential.html
Molecular Biology Analysis
Software Tools Freely Available on the Web.
- Highlights
Broad Classification of Biological Databases
http://www.mrc-lmb.cam.ac.uk/genomes/madanm/pres/biodb.htm
NCBI
ENTREZ - PubMed
http://www3.ncbi.nlm.nih.gov/Entrez/index.html
Post-genomic terms (Oct. 2002)
Genome
2.1x106
76,566
Proteome
89,300
1,701
Transcriptome
Gene function
Metabolome
Glycome
9,960
229
1.2x106 6.5x105
1,170
29
138
6
PubMed Hits
Google search PubMed
From: Computational Proteomics, Mark B Gerstein, Yale U.
Proteome
http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm
http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm
http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm
http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm
Similarity / Analogy
Examples:
If looks like an elephant,
and smells like an elephant–
it’s an elephant.
If walks like a duck,
and quacks like a duck–
it’s a duck.
http://cbms.st-and.ac.uk/academics/ryan/Teaching/molbiol/Bioinf_files/v3_document.htm
Similarity Search in Databanks
Find similar sequences
to a working draft.
As databanks grow,
homologies get harder,
and quality is reduced.
>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
Length = 369
Score = 272 bits (137), Expect = 4e-71
Identities = 258/297 (86%), Gaps = 1/297 (0%)
Strand = Plus / Plus
Query: 17
Sbjct: 1
Query: 77
Alignment Tools:
BLAST & FASTA
(time saving
heuristicsapproximations).
Sbjct: 60
Pairwise
alignment:
aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296
Multiple Sequence Alignment
Multiple alignment: find protein families
and functional domains.
Structure - Function
Relationships
structure
sequence
function
Protein Structure (domains)
Phylogeny
Evolution - a process in which
small changes occur within
species over time.
These changes could be monitored
today using molecular techniques.
The Tree of Life:
A classical, basic
science problem,
since Darwin’s 1859
“Origin of Species”.
Tree of Life: Searching Protein Sequence Databases How far can we see back ?
Mammalian
radiation
Invertebrates/
vertebrates
Plant/
animals
Prokaryotes/
eukaryotes
First self replicating
systems
Formation of the
solar system
Origin of the universe ?
The Human Genome Project
(HGP)
• Write down all of human DNA on a single CD
(“completed” 2001).
• Identify all genes, their location and
function (far from completion).
Example for Gene Localization
Bio-Tool (FISH).
FISH - Fluorescence In-Situ Hybridization.
• Fluorescent labeled probes hybridize to specific
chromosomal locations.
• Example application: low resolution localization of a gene.
Sequencing Genes & Gene Assembly
Automated
sequencing
Gene Finding
• Only 2-3% of the human genome encodes for functional genes.
• Genes are found along large non-coding DNA regions.
• Repeats, pseudo-genes, introns, contamination of vectors,
are very confusing.
Gene Finding - cont.
Find special gene patterns:
• Translation start and stop sites (open reading
frames - ORF).
• Transcription
factors, promoters.
• Intron splice sites.
Etc…
Micro Arrays (“DNA Chips”)
New biotechnology breakthrough: measure RNA expression
levels of thousands of genes (in one experiment).
The Idea Behind Micro Arrays
Clustering Analysis of Gene Expression Data
DNA chips and
personalized medicine
(leading edge,
future technologies).
Pharmaco-genomics
Use DNA information to measure and predict the
reaction to drugs.
Personalized medicine.
Faster clinical trials: selected populations.
Less drug side-effects.
Protein and Other Arrays
Sequencing the human genome => finite problem.
Studying the proteome => endless possible variations, dynamic.
Future fields of study:
Proteins + Genomics =
Proteomics
Lipids + Genomics =
Lipomics
Sugars + Genomics =
Glycomics
Protein
array
Understanding Mechanisms of Disease
EC number
compound
Putting it all together: Bio-Informatics
SEQUENCE
ALIGNMENT
ORTHOLOG
GENES
(Taxonomy)
CODING
REGIONS
CONSERVED
DOMAINS
3-D
STRUCTURE
SEQUENCES
& LITERATURE
SIGNAL
PEPTIDE
CELLULAR
LOCATION
GENE
FAMILIES
GENOME
MAPS
MUTATIONS &
POLYMORPHISM
Putting it all together: Bio-Informatics
SEQUENCE
ALIGNMENT
ORTHOLOG
GENES
(Taxonomy)
CODING
REGIONS
3-D
STRUCTURE
SIGNAL
PEPTIDE
GENE EXPRESSION,
GENES FUNCTION,
DRUG & PERSONAL
THERAPY
CELLULAR
LOCATION
GENOME
MAPS
CONSERVED
DOMAINS
GENE
FAMILIES
MUTATIONS &
POLYMORPHISM