Transcript PPT
Computational Biology
Algorithmic Techniques &
Medical Applications
CSE 590YA
August 15, 2001
Outline
Overview
Biology
Technology
Algorithms & Applications
Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
2
Overview
Human Genome Project
Why is it important?
Sequence functionality
Prevention & treatment of disease
Where is there computation in it?
Lab hardware/software
Analysis: assembly, element discovery
Could not accomplish w/o computers
3
Bigger Picture
Biology of the (not so) past
Isolated
Low level (one X at a time)
Slow accumulation of knowledge
Biology of the present
Global
High level (organismal/theoretical)
Rapid accumulation of knowledge
Rapid generation of open questions
4
Example: S. cerevisiae (yeast)
Yeast: before expression arrays
Model organism for experiments
Easy to grow, modify, and study
Genetics similar to higher organisms
Yeast: after expression arrays
Immensely more useful
Now know most gene functions
New results every month that used to take five
years
Results are directly applicable to higher organisms
5
A good beginning …
The genome is not the end
Code to be deciphered
Human road map
Greater need for computational tools and
power
Example: dbSNP
Data exists
Need help finding and relating it all
6
Computers – not just for analysis
Role reversal
Before: Biologists generate data,
computers analyze it
Now: Computers generate experiments,
biologists perform them
Cycle
New future for CMBists
Biotech has greatest opportunity for real
science to be done, and CS is crucial!
7
CB is good for CS
Old research revisited and applied
Clustering
Expired in the 70s, reborn 3 years ago
New papers reacceptance as research topic
Data mining, web statistics, e-commerce
Machine learning
Well-studied over the past couple decades
New needs in CB new research on tuning
8
Outline
Overview
Biology
Technology
Algorithms & Applications
Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
9
Biochemistry 101
Cells
Basic building blocks of life *
Proteins
Key to functionality
Catalyze reactions *
Store and release energy
Build cells and cell components
Process-specific, yet resource-efficient
10
The genetics of proteins
DNA
Four-base alphabet *
Genes are instructions for building proteins
Cell cycle *
Extensive regulatory mechanism
Construct proteins at right time and place
Break down proteins and reuse components
Incredibly complex series of steps
11
Transcription & translation
DNA RNA
RNA protein
Transcription factors *
RNA polymerase
Translation at ribosome *
Amino acid chains
Protein degradation
12
Outline
Overview
Biology
Technology
Algorithms & Applications
Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
13
Technology
DNA microarrays
Consensus RNAs adhered to slide
Test and control cDNAs produced *
Fluorescently labeled
Hybridized with RNAs on slide
Scan fluorescence with computer
Results: how much RNA present! *
What does this signify?
14
Example uses
Timepoints in the cell cycle
Which genes are always “on”?
Which genes are responsible for certain
events in the cycle?
Differential expression in experiment
Which genes are responsible for a
particular cell response?
What is the response pattern over time?
15
Outline
Overview
Biology
Technology
Algorithms & Applications
Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
16
“Low tech” algorithms
90s: DNA is just a bunch of strings
Questions became answerable!
Are there gross similarities in the genome?
What do they imply?
Are there smaller recurring elements in the
genome? What is their function?
I know what Gene A does? Can I use that
to figure out what Gene B does?
17
String and sequence matching
String matching
Sequence matching
Find exact replicas of DNA sequence elsewhere in
the genome
Are they statistically unlikely?
Regions of DNA that look similar: allows for
evolution
Also applied to proteins
In reality, sequences are more important
18
Computer tools
Biological questions could be answered
better by a computer than by a biologist
GenBank, FASTA, BLAST, GAP
Not trivial developments, even for CS
Required novel approaches to NP-hard
problems
Web proliferation (ongoing)
www.cs.jhu.edu/~salzberg/appendixa.html
19
High tech: expression arrays
Use active gene data to classify a cell
Example: Cancer type prediction
Subtypes appear very similar histologically
Very different clinical courses
Diagnoses: biologists’ insight rather than
systematic/unbiased approaches
20
Classifying cancer
ALL vs. AML
Two kinds of leukemia (only recently separated)
Must be treated very differently
Distinguishable in clinic, but not 100% reliable
Golub (1999)
Goal: Determine cancer type by overall gene
expression; build an automated classifier
By-product: One of earliest quantitative uses of
DNA microarrays
21
Strategy
Get expression data for 6800 genes from 27
ALL and 11 AML patients
Clustering: Find genes with expression
levels that are strongly correlated with the
ALL-AML class distinction
Give each such gene a weighted predictive
vote for its class
Let important genes vote on test cases
22
Determining correlation w/ class
Idealized expression patterns
Neighborhood analysis *
Correlation metric
Significance
Euclidean distance, regression, TNOM
Q: Is gene more highly correlated with IEP than would
be expected by chance?
A: Examine correlation w/ random IEP permutations
Results: 1100 genes more highly correlated with
ALL-AML class distinction than expected by chance
23
Making a class predictor
Subset of informative genes will elect the
class of a new sample
Each casts weighted vote for its class: *
Expression level of gene in test sample
Original correlation of gene w/ class distinction
Prediction strength (PS)
Margin of victory after all genes vote
If less than threshold, then uncertain
24
Validation of the model (a)
Initial data set: cross-validation
For each patient sample:
Build a classifier without it (i.e. w/ 37 others)
Predict class of left-out sample
Calculate cumulative error rate
Results
Used top 50 genes
36/38 samples classified correctly, 2 uncertain
25
Validation of the model (b)
Independent data set: test validation
34 samples from diverse tissues
29/34 “strong” predictions; 100% accuracy
PS values quite high for both
.77 in cross-validation; .73 in independent
Mean PS lower for samples from one
particular laboratory: importance of
standardization in clinical setting
26
Further results of clinical importance
10 200 voting gene set had same accuracy
Voter gene function: not just lineage markers
Surface receptors, anti-apoptotic agents, cell cycle
regulators, DNA manipulators, known oncogenes
These genes provide insight into cancer causes
New biological knowledge as a result of
computational methods!
Other applications of CP & feature selection
Response to chemotherapy
Eventual outcome of disease
27
Other array-based classifiers (a)
k-means clustering
Select “high-scoring” features like before
Pick k points as initial cluster centroids
Add each new data point to nearest cluster
Move that cluster centroid to new mean
Use these centroids to classify test cases
28
Other array-based classifiers (b)
Support Vector Machines
Goal: find a plane that separates data points
If not separable
Boost the data points into a higher dimensional space
using some well-behaved kernel function
Try to find a separating hyperplane there
Key benefits of SVM version
Kernel avoids explicit representation of higher-dim space
Finding the maximum margin separating classes avoids
overfitting
29
Class discovery
What if we don’t know how many
clusters we want?
The discovery of finer-grained subtypes of
cancer has been arduous and slow
How can microarrays help here?
Golub (1999) again …
Automatic class discovery based solely on
gene expression
30
Self-organizing maps (SOMs)
Very much like k-means clustering
Results for 27 ALL/11 AML data set
However, we don’t know the discriminating
features in advance
Cluster based on all gene expression levels
Class A: 24/25 samples were ALL
Class B: 10/13 samples were AML
Quite effective, but not perfect
31
SOMs (cont’d)
How can we evaluate the “learned” clusters w/o
knowing the true classes?
Test by class prediction – accuracy should be high
if classes reflect true structure
Results
Predictors w/ variety of genes did well in cross-validation
Exception: the one AML in class A was often predicted to
be in class B
This suggests an iterative method for class
discovery: discover, predict, refine
32
Independent model validation
Cannot assess “accuracy” on test data
Instead, assess prediction strength
High PS indicates that structure in initial data is
also present in test data
Results
Median PS=.61, 74% of samples above threshold
Compared w/ random clusters, PS’s were highly
statistically significant
We have discovered ALL-AML distinction!
Even lower-level distinctions also discovered
33
Other CS w/ expression arrays
Regulatory element detection
Correlate expression data with frequency of DNA
motifs
Taxing even for fastest processors today
Discovery of regulatory pathways
Treat expression arrays over time as a graph
Establish a Bayesian network model for regulatory
pathways over the array graph structure
Infer network parameters pathway structure
34
Problems with DNA arrays
Different companies, different types
Even within one company
Much time spent on normalization
Different products over time
Different binding efficiencies
Even then, different groups’ results are hard to
compare
Biggest worry: RNA levels in cells do not
accurately reflect current protein content
Perhaps limits our discovery potential
35
Proteonomics
If protein is most important, why not study it
directly?
Much work is done on proteins already
But difficult to purify, prepare, quantify
Results are very coarse
Emerging technologies
More efficient protein purification and protein
arrays are being developed!
Lots of discoveries to come
36
Outline
Overview
Biology
Technology
Algorithms & Applications
Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
37
Looking to the future
Biology is becoming a more theoretical,
unified science
The problem w/ biology has always been that
there are too many layers
Work has always been somewhere in the middle
Now research is beginning to focus on processes
and pathways and networks in general
This is the proper path to developing theories
Along the way …
Lots of hard computational problems to be solved!
38