Interpreting Microarray Expression Data Using Text

Download Report

Transcript Interpreting Microarray Expression Data Using Text

Interpreting Microarray Expression Data
Using Text Annotating the Genes
Michael Molla, Peter Andreae, Jeremy Glasner, Frederick Blattner, Jude Shavlik
University of Wisconsin – Madison
The Basic Task
Given
Microarray Expression Data &
Text Annotations of Genes
Generate
Model of Expression
Motivation
• Lots of Data Available on the Internet
– Microarray Expression Data
– Text Annotations of Genes
• Maybe we can Make the Scientist’s Job
Easier
– Generate a Model of Expression Automatically
– Easier First Step for the Human
Microarray Expression Data
• Each spot represents a gene in E. coli
• Colors Indicate Up- or Down-Regulation
Under Antibiotic Shock
• Four our Purpose 3 Classes
– Up-Regulated
– Down-Regulated
– No-Change
Microarray Expression Data
From “Genome-Wide Expression in Escheria Coli K-12”, Blattner et al., 1999
Our Microarray Experiment
•
•
•
•
•
4290 genes
574 up-regulated
333 down-regulated
2747 un-regulated
636 non enough signal
Text Annotations of Genes
• The text from a sample SwissProt entry
(b1382)
– The “description” field
HYPOTHETICAL 6.8 KDA PROTEIN IN
LDHA-FEAR INTERGENIC REGION
– The “keyword” field
HYPOTHETICAL PROTEIN
Sample Rules From a Model for
Up-Regulation
• IF
– The annotation contains FLAGELLAR AND
does NOT contain HYPOTHETICAL
OR
– The annotation contains BIOSYNTHESIS
• THEN
– The gene is up-regulated
Why use Machine Learning?
• Concerned with machines learning from
available data
• Informed by text data, the leaner can make
first-pass model for the scientist
Desired Properties of a Model
• Accurate
– Measure with cross validation
• Comprehensible
– Measure with model size
• Stable to Small Changes in the Data
– Measure with random subsampling
Approaches
• Naïve Bayes
– Statistical method
– Uses all of the words (present or absent)
• PFOIL
– Covering algorithm
– Chooses words to use one at a time
Naïve Bayes
For each word wi, there are two likelihood ratios (lr):
lr (wi present) = p(wi present | up) / p(wi present | down)
lr (wi absent) = p(wi absent | up) / p(wi absent | down)
For each annotation, the lrs are combined to form a lr for a gene:
where X is either present or absent.
PFOIL
•
•
•
•
Learn rules from data
Produces multiple if-then rules from data
Builds rules by adding one word at a time
Easy to interpret models
Accuracy/Comprehensibility Tradeoff
40%
Baseline
30%
PFOIL
20%
Naive Bayes
10%
0%
100
90
80
70
60
50
40
30
Number of Words in Model
20
10
0
Testset Error Rate
50%
Stabilized PFOIL
• Repeatedly run PFOIL on randomly
sampled subsets
• For each word, count the number of models
it appears in
• Restrict PFOIL to only those words that
appear in a minimum of m models
• Rerun PFOIL with only those words
Stability Measure
After running the algorithm N times to
generate N rule sets:
Where:
U = the set of words appearing in any rule set
count(wi) = number of rule sets containing word wi
50%
1.0
45%
0.9
40%
0.8
35%
0.7
30%
0.6
25%
0.5
20%
Stabilized PFOIL Error Rate
15%
Stabilized PFOIL Stability
0.4
0.3
10%
0.2
Unstabilized PFOIL Stability
5%
0.1
0%
0.0
0
5
10
15
20
25
Value of m
30
35
40
45
50
Stability
Testset Error Rate
Accuracy/Stability Tradeoff
Discussion
• Not very severe tradeoffs in Accuracy
– vs. stability
– vs. comprehensibility
• PFOIL not as good at characterizing data
– suggests not many dependencies
– need for “softer” rules
Future Directions
• M of N rules
• Permutation Test
• More Sources of Text Data
Take-Home Message
• This is just a first step toward an aid for
understanding expression data
• Make expression models based on text in
stead of DNA sequence.
Acknowledgements
• This research was funded by the following
grants:
NLM 1 R01 LM07050-01,
NSF IRI-9502990,
NIH 2 P30 CA14520-29, and
NIH 5 T32 GM08349.