Transcript Document

CIS 595 Bioinformatics
Lecture 2
Introduction to Bioinformatics
A number of slides taken/modified from:
Russ B. Altman (http://www.smi.stanford.edu/projects/helix/bmi214/)
Patrik Medstrand (www.cmb.lu.se/devbiol/bioinfo/
old/download/intro2003/databases_handouts.pdf)
Mark Gerstein (http://bioinfo.mbb.yale.edu/mbb452a/2002/sequences2002.pdf)
What is Bioinformatics?
• Every application of computer science to
biology
– Sequence analysis, images analysis, sample
management, population modeling, …
• Analysis of data coming from large-scale
biological projects
– Genomes, transcriptomes, proteomes,
metabolomes, etc…
The New Biology
• Traditional biology
– Small team working on a specialized topic
– Well defined experiment to answer precise
questions
• New “high-throughput” biology
– Large international teams using cutting edge
technology defining the project
– Results are given raw to the scientific
community without any underlying hypothesis
Examples of “High-Throughput”
• Complete genome sequencing
• Simultaneous expression analysis of thousands of
genes (DNA microarrays, SAGE)
• Large-scale sampling of the proteome
• Protein-protein analysis large-scale 2-hybrid
(yeast, worm)
• Large-scale 3D structure production (yeast)
• Metabolism modeling
• Biodiversity
Role of Bioinformatics
• Control and management of the data
• Sequence, Structure and Function analysis
• Analysis of primary data e.g.
– Mass spectra analysis
– DNA microarrays image analysis
• Statistics
• Database storage and access
• Interpreting results in a biological context
Sequence, Structure and Function
Analysis
In order to gather insight into the ways in which
genes and gene products (proteins) function
perform:
• SEQUENCE ANALYSIS: Analyze DNA and protein
sequences, searching for clues about structure, function,
and control.
• STRUCTURE ANALYSIS: Analyze biological structures,
searching for clues about sequence, function and control.
• FUNCTION ANALYSIS: Understand how the sequences
and structures leads to the functions.
Evolution and Bioinformatics
1. Common descent of organisms implies that they
will share many “basic technologies.”
2. Development of new phenotypes in response to
environmental pressure can lead to “specialized
technologies.”
3. More recent divergence implies more shared
technologies between species.
4. All of biology is about two things: understanding
shared or unshared features.
Biology is Fundamentally
Information Science
Where is information:
• DNA Sequences
– GENBANK release 128 (2/02) contains 17,089,143,893
bases in 1,546,532 sequences
• Protein Sequences
– PIR or Swiss-prot (as of 3/02); 106,736 sequences,
39,242,287 total amino acids
• Protein 3D Structures
– Protein Data Bank (PDB), as of March 2002: 17,679
Coordinate Entries; 15,855 proteins, 1060 nucleic acids,
746 protein/nucleic acid complex 18 carbohydrates
Biology is Fundamentally
Information Science
Where is information:
• Online access to DNA microarray data
– http://smd.stanford.edu/; 10,000 to 40,000 genes per
chip; Each set of experiments involves 3 to 100
“conditions”
• Medical Literature on line.
– Online database of published literature since 1966 =
Medline = PubMED resource 4,600 journals
11,000,000+ articles (most with abstracts)
• ETC…
Topics
•
•
•
•
•
•
•
•
Sequence Alignment; Sequence Motifs; Gene Finding
Computing with Biological Structures
Phylogenetic Algorithms
Microarray Data Analysis
Genetic Networks
Comparative Genomics
Proteomics
Biological Ontologies; Biological Text Mining
Sequence Alignment
• What is sequence alignment?
– Given two sequences and a scoring scheme find the
optimal pairing of letters.
RKVA--GMAKPNM
RKIAVAAASKPAV
• Why align sequences?
– A few sequences with known structure and function;
much more with unknown properties.
– If one of them has known structure/function, then
alignment to the other yields insight about another
– Similarity may be used as evidence of homology, but
does not necessarily imply homology
Sequence Alignment
Types of alignment:
– Local vs. global;
– Pairwise vs. multiple
d1dhfa_
d8dfr__
d4dfra_
d3dfr__
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI
TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF
Sequence Alignment
How to measure the alignment quality?
– Define scoring matrix (PAM250)
Sequence Alignment
Alignment algorithms:
• dot matrix
• dynamic programming
–
–
–
–
Fasta,
Blast,
Psi-Blast;
Clustal
Similarity strength:
• Percent identity
• E-value (statistical measure)
Sequence Alignment
Sequence Motifs
• A subsequence that occurs in
multiple sequences with a
biological importance.
– Protein motifs often result from
structural features
– DNA sequences that provide
signals for protein binding or
nucleic acid folding
Sequence Motifs
• PROSITE Database a collection of motifs
(1135 different motifs):
– A manually created collection of regular
expressions associated with different protein
families/functions.
– Globin sequence signature (PDOC00933):
F-[LF]-x(5)-G-[PA]-x(4)-G-[KRA]-x-[LIVM]-x(3)-H
Gene Finding
• Problem : Identify the genes within raw genomic DNA
sequence
• Input: Raw DNA sequence
• Output: Location of gene elements in the raw sequence
(including exons, introns, other sequence annotations)
Topics
•
•
•
•
•
•
•
•
Sequence Alignment; Sequence Motifs; Gene Finding
Computing with Biological Structures
Phylogenetic Algorithms
Microarray Data Analysis
Genetic Networks
Comparative Genomics
Proteomics
Biological Ontologies; Biological Text Mining
Computing with Biological Structures
• General Issues
– How do we represent structure for computation?
– How do we compare structures?
– How can we summarize structural families?
Computing with Biological Structures
Applications:
• Structure alignment
• Build fold library
Hb
Alignment
of Individual
Structures
Mb
Fusing into a
Single Fold
“Template”
Computing with Biological Structures
Why align structures:
– Provides the “gold standard” for
sequence alignment
– For nonhomologous proteins,
identify common substructures of
interest
– Classify proteins into clusters, based
on structural similarity (SCOP)
Computing with Biological Structures
Applications:
• Predicting RNA
Secondary Structure (the
MFOLD Program
http://www.bioinfo.rpi.edu/applications
/mfold/old/rna/)
Computing with Biological Structures
Protein secondary structure prediction
Sequence
Structure
RPDFCLEPPYTGPCKARIIRYFYNAKAGLVQTFVYGGCRAKRNNFKSAEDAMRTCGGA
CCGGGGCCCCCCCCCCCEEEEEEETTTTEEEEEEECCCCCTTTTBTTHHHHHHHHHCC
Topics
•
•
•
•
•
•
•
•
Sequence Alignment; Sequence Motifs; Gene Finding
Computing with Biological Structures
Phylogenetic Algorithms
Microarray Data Analysis
Genetic Networks
Comparative Genomics
Proteomics
Biological Ontologies; Biological Text Mining
Phylogenetic Algorithms
Why build evolutionary tree?
• Understand the lineage of different
species.
• Have an organizing principle for sorting
species into a taxonomy
• Understand how various functions
evolved.
• Understand forces and constraints on
evolution.
• To do multiple alignment.
Phylogenetic Algorithms
Multiple Alignment and Trees
• Progressive alignment methods do multiple
alignment and evolutionary tree construction at the
same time.
• Sequence alignment provides scores which can be
interpreted as inversely related to distances in
evolution.
• Distances can be used to build trees.
• Trees can be used to give multiple alignments via
common parents.
Topics
•
•
•
•
•
•
•
•
Sequence Alignment; Sequence Motifs; Gene Finding
Computing with Biological Structures
Phylogenetic Algorithms
Microarray Data Analysis
Genetic Networks
Comparative Genomics
Proteomics
Biological Ontologies; Biological Text Mining
Microarray Data Analysis
Experimental Protocol
Microarray Data Analysis
Microarray Data Analysis
What are expression arrays good for?
– Follow population of (synchronized) cells over time, to
see how expression changes (vs. baseline).
– Expose cells to different external stimuli and measure
their response (vs. baseline).
– Take cancer cells (or other pathology) and compare to
normal cells.
– (Also some non-expression uses, such as assessing
presence/absence of sequences in the genome)
Microarray Data Analysis
Preprocessing
Data
input
Background
Cy5/Cy3
correction
normalization
Spot
quality
Artifactual
regions
Merging
Score
replicate
differential
experiments
hybridization
Duplicate
spot
variability
Replicate
experiment
variability
Microarray Data Analysis
Convert microarray images to data
Microarray Data Analysis
Clustering:
– If two genes are expressed in the same
way, they may be functionally related.
– If a gene has unknown function, but
clusters with genes of known function,
this is a way to assign its general function.
– We may be able to look at high resolution
measurements of expression and figure
out which genes control which other
genes.
– E.g. peak in cluster 1 always precedes
peak in cluster 2 => cluster 1 turns cluster
2 on?
Microarray Data Analysis
Classification:
• Uses known groups of interest
(from other sources) to
– learn the features associated with
these groups in the primary data,
– create rules for associating the data
with the groups of interest.
• Often called “supervised machine
learning.”
Topics
•
•
•
•
•
•
•
•
Sequence Alignment; Sequence Motifs; Gene Finding
Computing with Biological Structures
Phylogenetic Algorithms
Microarray Data Analysis
Genetic Networks
Comparative Genomics
Proteomics
Biological Ontologies; Biological Text Mining
Genetic Networks
What is a genetic network?
– Individual genes have a function (e.g. transforming a
substance or binding to a substance)
– Sets of functions when sequenced can produce
pathways (e.g. output of one transformation is the input
to another)
– Sets of pathways, as they interact with other pathways,
create a genetic network of interactions.
Genetic Networks
Reconstructing Genetic Regulatory Networks:
– Hard problem.
– Given N genes, there are an exponential number of
connections between the genes.
– Relationships are not generally +/but are but are continuous valued.
– Must use knowledge about
expected function and membership
in pathways to prune the list of
possible network interactions.
Topics
•
•
•
•
•
•
•
•
Sequence Alignment; Sequence Motifs; Gene Finding
Computing with Biological Structures
Phylogenetic Algorithms
Microarray Data Analysis
Genetic Networks
Comparative Genomics
Proteomics
Biological Ontologies; Biological Text Mining
Comparative Genomics
• Large scale comparison of genomes to
– understand the biology of individual genomes
– extract general principles applying to groups of
genomes.
• Assumption:
– many biological sequences, structures, and
functions are shared across organisms,
– the signal from these organisms can be
increased by combining them in analyses.
Comparative Genomics
Important issues for Comparative Genomics
–
–
–
–
Aligning very large sequences
Comparative approaches to gene finding
Comparative approaches to assigning function
Comparative approaches to identifying key
regulatory regions
Comparative Genomics
Example: Assigning protein functions
Topics
•
•
•
•
•
•
•
•
Sequence Alignment; Sequence Motifs; Gene Finding
Computing with Biological Structures
Phylogenetic Algorithms
Microarray Data Analysis
Genetic Networks
Comparative Genomics
Proteomics
Biological Ontologies; Biological Text Mining
Proteomics
• What is PROTEOMICS?
– -OMICS has become the suffix to denote the
study of the entire set of something
– Genomics: study of all genes
– Proteomics: study of all proteins
– Transcriptomics: study of all mRNA transcripts
– Metabolomics: study of metabolites in cell
Proteomics
Proteomics questions
–
–
–
–
–
–
Which proteins are made from the genome?
What is their 3D structure?
Where they are?
What they do?
Which other proteins they interact with?
Are they modified in the cell posttranslationally?
Proteomics
Key proteomic technologies
– 3D structure determination (X-ray/NMR)
– 2D Gels to assess all the proteins in a cell.
– Mass spectrometry to identify proteins, protein
modifications.
– Yeast-Two-Hybrid systems to assess protein-protein
interactions
– Protein Arrays to assess all proteins in a cell using
antibodies or other recognition technology.
Topics
•
•
•
•
•
•
•
•
Sequence Alignment; Sequence Motifs; Gene Finding
Computing with Biological Structures
Phylogenetic Algorithms
Microarray Data Analysis
Genetic Networks
Comparative Genomics
Proteomics
Biological Ontologies; Biological Text Mining
Biomedical Ontologies
• In order to communicate effectively we need:
– common language
– basic knowledge
• Example:
– Metabolic Pathways:
• language: names of products, enzymes, substrates and
pathways
• knowledge: what is a reaction, how do enzymes and substrates
participate, what are the legal components of a pathway
Biomedical Ontologies
Gene Ontology
(http://www.geneontology.org/)
• Used to classify gene function.
• A controlled listing of three types
of function:
– Molecular Function
– Biological Process
– Cellular Component
Biological Text Mining
• Literature in Biomedicine
• Much literature generated quickly.
– 11 million citations in MEDLINE.
– 400,000 added yearly.
• Need methods to deal with data.
–
–
–
–
Query
Summarize
Organize
Understand
Long term challenges
• Computational model of physiology.
– Can we give a medication to a computer before we give it to a
human?
• Design of new compounds for medical and industrial use.
– Can we design a protein or nucleic acid to have a specified
function?
• Engineering new biological pathways.
– Can we devise methods for designing and implementing new
metabolic capabilities for treating disease?
• Data mining for new knowledge.
– Can we ask computer programs to examine data (in the context of
our models) and create new knowledge?