Transcript PPT

Computational Biology
Algorithmic Techniques &
Medical Applications
CSE 590YA
August 15, 2001
Outline




Overview
Biology
Technology
Algorithms & Applications




Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
2
Overview

Human Genome Project


Why is it important?

Sequence  functionality

Prevention & treatment of disease
Where is there computation in it?



Lab hardware/software
Analysis: assembly, element discovery
Could not accomplish w/o computers
3
Bigger Picture

Biology of the (not so) past




Isolated
Low level (one X at a time)
Slow accumulation of knowledge
Biology of the present




Global
High level (organismal/theoretical)
Rapid accumulation of knowledge
Rapid generation of open questions
4
Example: S. cerevisiae (yeast)

Yeast: before expression arrays




Model organism for experiments
Easy to grow, modify, and study
Genetics similar to higher organisms
Yeast: after expression arrays




Immensely more useful
Now know most gene functions
New results every month that used to take five
years
Results are directly applicable to higher organisms
5
A good beginning …

The genome is not the end




Code to be deciphered
Human road map
Greater need for computational tools and
power
Example: dbSNP


Data exists
Need help finding and relating it all
6
Computers – not just for analysis

Role reversal




Before: Biologists generate data,
computers analyze it
Now: Computers generate experiments,
biologists perform them
Cycle
New future for CMBists

Biotech has greatest opportunity for real
science to be done, and CS is crucial!
7
CB is good for CS

Old research revisited and applied

Clustering




Expired in the 70s, reborn 3 years ago
New papers  reacceptance as research topic
Data mining, web statistics, e-commerce
Machine learning


Well-studied over the past couple decades
New needs in CB  new research on tuning
8
Outline




Overview
Biology
Technology
Algorithms & Applications




Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
9
Biochemistry 101

Cells


Basic building blocks of life *
Proteins


Key to functionality
Catalyze reactions *



Store and release energy
Build cells and cell components
Process-specific, yet resource-efficient
10
The genetics of proteins

DNA



Four-base alphabet *
Genes are instructions for building proteins
Cell cycle *




Extensive regulatory mechanism
Construct proteins at right time and place
Break down proteins and reuse components
Incredibly complex series of steps
11
Transcription & translation

DNA  RNA



RNA  protein



Transcription factors *
RNA polymerase
Translation at ribosome *
Amino acid chains
Protein degradation
12
Outline




Overview
Biology
Technology
Algorithms & Applications




Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
13
Technology

DNA microarrays


Consensus RNAs adhered to slide
Test and control cDNAs produced *




Fluorescently labeled
Hybridized with RNAs on slide
Scan fluorescence with computer
Results: how much RNA present! *

What does this signify?
14
Example uses

Timepoints in the cell cycle



Which genes are always “on”?
Which genes are responsible for certain
events in the cycle?
Differential expression in experiment


Which genes are responsible for a
particular cell response?
What is the response pattern over time?
15
Outline




Overview
Biology
Technology
Algorithms & Applications




Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
16
“Low tech” algorithms


90s: DNA is just a bunch of strings
Questions became answerable!



Are there gross similarities in the genome?
What do they imply?
Are there smaller recurring elements in the
genome? What is their function?
I know what Gene A does? Can I use that
to figure out what Gene B does?
17
String and sequence matching

String matching



Sequence matching



Find exact replicas of DNA sequence elsewhere in
the genome
Are they statistically unlikely?
Regions of DNA that look similar: allows for
evolution
Also applied to proteins
In reality, sequences are more important
18
Computer tools


Biological questions could be answered
better by a computer than by a biologist
GenBank, FASTA, BLAST, GAP



Not trivial developments, even for CS
Required novel approaches to NP-hard
problems
Web proliferation (ongoing)

www.cs.jhu.edu/~salzberg/appendixa.html
19
High tech: expression arrays


Use active gene data to classify a cell
Example: Cancer type prediction



Subtypes appear very similar histologically
Very different clinical courses
Diagnoses: biologists’ insight rather than
systematic/unbiased approaches
20
Classifying cancer

ALL vs. AML




Two kinds of leukemia (only recently separated)
Must be treated very differently
Distinguishable in clinic, but not 100% reliable
Golub (1999)


Goal: Determine cancer type by overall gene
expression; build an automated classifier
By-product: One of earliest quantitative uses of
DNA microarrays
21
Strategy




Get expression data for 6800 genes from 27
ALL and 11 AML patients
Clustering: Find genes with expression
levels that are strongly correlated with the
ALL-AML class distinction
Give each such gene a weighted predictive
vote for its class
Let important genes vote on test cases
22
Determining correlation w/ class


Idealized expression patterns
Neighborhood analysis *

Correlation metric


Significance



Euclidean distance, regression, TNOM
Q: Is gene more highly correlated with IEP than would
be expected by chance?
A: Examine correlation w/ random IEP permutations
Results: 1100 genes more highly correlated with
ALL-AML class distinction than expected by chance
23
Making a class predictor


Subset of informative genes will elect the
class of a new sample
Each casts weighted vote for its class: *



Expression level of gene in test sample
Original correlation of gene w/ class distinction
Prediction strength (PS)


Margin of victory after all genes vote
If less than threshold, then uncertain
24
Validation of the model (a)

Initial data set: cross-validation

For each patient sample:




Build a classifier without it (i.e. w/ 37 others)
Predict class of left-out sample
Calculate cumulative error rate
Results


Used top 50 genes
36/38 samples classified correctly, 2 uncertain
25
Validation of the model (b)

Independent data set: test validation



34 samples from diverse tissues
29/34 “strong” predictions; 100% accuracy
PS values quite high for both


.77 in cross-validation; .73 in independent
Mean PS lower for samples from one
particular laboratory: importance of
standardization in clinical setting
26
Further results of clinical importance


10  200 voting gene set had same accuracy
Voter gene function: not just lineage markers




Surface receptors, anti-apoptotic agents, cell cycle
regulators, DNA manipulators, known oncogenes
These genes provide insight into cancer causes
New biological knowledge as a result of
computational methods!
Other applications of CP & feature selection


Response to chemotherapy
Eventual outcome of disease
27
Other array-based classifiers (a)

k-means clustering

Select “high-scoring” features like before

Pick k points as initial cluster centroids
Add each new data point to nearest cluster
Move that cluster centroid to new mean

Use these centroids to classify test cases


28
Other array-based classifiers (b)

Support Vector Machines


Goal: find a plane that separates data points
If not separable



Boost the data points into a higher dimensional space
using some well-behaved kernel function
Try to find a separating hyperplane there
Key benefits of SVM version


Kernel avoids explicit representation of higher-dim space
Finding the maximum margin separating classes avoids
overfitting
29
Class discovery

What if we don’t know how many
clusters we want?



The discovery of finer-grained subtypes of
cancer has been arduous and slow
How can microarrays help here?
Golub (1999) again …

Automatic class discovery based solely on
gene expression
30
Self-organizing maps (SOMs)

Very much like k-means clustering



Results for 27 ALL/11 AML data set



However, we don’t know the discriminating
features in advance
Cluster based on all gene expression levels
Class A: 24/25 samples were ALL
Class B: 10/13 samples were AML
Quite effective, but not perfect
31
SOMs (cont’d)

How can we evaluate the “learned” clusters w/o
knowing the true classes?
 Test by class prediction – accuracy should be high
if classes reflect true structure
 Results



Predictors w/ variety of genes did well in cross-validation
Exception: the one AML in class A was often predicted to
be in class B
This suggests an iterative method for class
discovery: discover, predict, refine
32
Independent model validation


Cannot assess “accuracy” on test data
Instead, assess prediction strength


High PS indicates that structure in initial data is
also present in test data
Results




Median PS=.61, 74% of samples above threshold
Compared w/ random clusters, PS’s were highly
statistically significant
We have discovered ALL-AML distinction!
Even lower-level distinctions also discovered
33
Other CS w/ expression arrays

Regulatory element detection



Correlate expression data with frequency of DNA
motifs
Taxing even for fastest processors today
Discovery of regulatory pathways



Treat expression arrays over time as a graph
Establish a Bayesian network model for regulatory
pathways over the array graph structure
Infer network parameters  pathway structure
34
Problems with DNA arrays


Different companies, different types
Even within one company



Much time spent on normalization


Different products over time
Different binding efficiencies
Even then, different groups’ results are hard to
compare
Biggest worry: RNA levels in cells do not
accurately reflect current protein content

Perhaps limits our discovery potential
35
Proteonomics

If protein is most important, why not study it
directly?




Much work is done on proteins already
But difficult to purify, prepare, quantify
Results are very coarse
Emerging technologies


More efficient protein purification and protein
arrays are being developed!
Lots of discoveries to come
36
Outline




Overview
Biology
Technology
Algorithms & Applications




Low tech: String algorithms
High tech: Class discovery/prediction
Treatments & clinical outcomes
Conclusions
37
Looking to the future

Biology is becoming a more theoretical,
unified science





The problem w/ biology has always been that
there are too many layers
Work has always been somewhere in the middle
Now research is beginning to focus on processes
and pathways and networks in general
This is the proper path to developing theories
Along the way …

Lots of hard computational problems to be solved!
38