Title goes here

Download Report

Transcript Title goes here

Robots and Automatic
Genome Annotation



Ross D. King
Department of Computer Science
University of Wales, Aberystwyth
Talk Plan

Data Mining based gene function prediction

The Robot Scientist

Automating annotation and experimentation
Data Mining Prediction





We have developed a method for predicting the
functional class of gene products based on data
mining.
The idea is to learn a reliable predictive function on
the examples of genes with products of known
function.
Then apply this function to genes where the
functional class is unknown.
Applied to: E. coli, M. tuberculosis, S. cerevisiae, A.
thaliana.
We call this approach: Data Mining Prediction (DMP).
Classification schemes
(MIPS/GO)
Hierarchy of classes
1,0,0,0 "METABOLISM"
1,1,0,0 "amino acid metabolism"
1,1,1,0 "amino acid biosynthesis"
1,1,4,0 "regulation of amino acid metabolism"
1,1,7,0 "amino acid transport"
1,1,10,0 "amino acid degradation (catabolism)"
1,1,99,0 "other amino acid metabolism activities"
1,2,0,0 "nitrogen and sulfur metabolism"
1,3,0,0 "nucleotide metabolism"
1,4,0,0 "phosphate metabolism"
1,5,0,0 "C-compound and carbohydrate metabolism"
1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism"
1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups"
1,20,0,0 "secondary metabolism"
... and ORFs may have multiple functions too!
Sequence Data
field
aa_rat_X
seq_len
aa_rat_pair_X_Y
mol_wt
theo_pI
atomic_comp_X
aliphatic_index
hydro
strand
position
cai
motifs
tmSpans
chromosome
description
% of amino acid X in the protein
length of the protein sequence
% of the amino acids X and Y consecutively
molecular weight of the protein
theoretical pI (isoelectric point)
atomic composition of X (C,H,N,O,S)
aliphatic index
grand average of hydropathy
the DNA strand
the number of exons (no. of start positions)
codon adaptation index
number of PROSITE motifs
number of transmembrane spans
chromosome number
478 attributes in total
type
real
int
real
int
real
real
real
real
'w' or 'c'
int
real
int
int
1..16,mit
Homology data
YAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk....
PSI-BLAST
gene
tfc
sfc3
wsv442
cg9463
f1l3
organism
baker's yeast
fission yeast
white spot virus
fruit fly
Arabidopsis
Sequence
database
NRDB
score
0.0
1.0e-18
2.1
2.9
3.0
sfc3:
keyword(membrane)
length(358)
dbref(prosite)
dbref(embl)
We look up the associated
information from SwissProt
Predicted Secondary Structure
Data
mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk...
cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb...
We record length and relative
positions of the secondary
structure elements.
This is relational data.
Expression Data
•
•
Microrarray experiments to measure
expression changes in yeast under a variety
of conditions, including cell cycle, heat shock,
diauxic shift.
Short time series data, numerical-valued
a0
a7
a 14
a 21
YBR166C
0.33
-0.17
0.04
-0.07
YOR357C
-0.64
-0.38
0.32
-0.29
YLR292C
-0.23
0.19
0.36
0.14
YGL112C et
-0.69
-0.89
-0.74
-0.56
Spellman
al (1998),
Roth et
al (1998)
...
DeRisi
et al (1997), Eisen et al (1998)
Gasch et al (2000, 2001), Chu et al (1998)
Phenotype Data
•
•
•
Data from knockout gene growth experiments
Many missing data
Data taken from 3 sources (TRIPLES, MIPS, EUROFAN)
deleted ORF
ORF
YAL001C
YAL019W
YAL021C
YAL029C
growth medium
calcofluor
white
w
n
n
n
sorbitol
n
s
n
w
benomyl
n
w
n
w
H2O2
...
w
w
n
r
s = sensitive (less growth)
w = wild-type (no observable effect)
r = resistant (more growth)
n = no data
What are the
Machine Learning Issues?
•
•
•
•
•
•
•
Large volume of data
Missing data
Accurate results required
Intelligible results required
Class hierarchy
Multiple labels
Relational data
Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule
creation
2/3
1/3
Training
data
C4.5
Rule
generation
Validation data
All
rules
Select
best
rules
Best
rules
Measure
rule
accuracy
Results
Application to Bacterial Genomes

Successful for both M. tuberculosis and E. coli.

Of the ORFs with no assigned function >40% were
predicted to have a function at one or more levels of
the class hierarchy.

It was found that many of the predictive rules were
more general than possible using sequence
homology.
References
King et al. (2000) KDD 2000
King et al. (2000) Yeast (Comparative and Functional Genomics)
King et al. (2001) Bioinformatics
Summary Results (Bacteria)

Using voting (2 or more rules agree on a prediction)
– Level 2 :128 ORFs predicted - 87.5% accuracy
– Level 3 : 23 ORFs predicted - 91.3% accuracy

All predictions
– Level 2 :335 ORFs predicted - 64.5% accuracy
– Level 3: 204 ORFs predicted - 44.6% accuracy
Example Rule (level 2 E. coli)
If the ORF is not predicted to have a b-strand of length  3 
a homologous protein from class Chytridiomycetes was found
Then its functional class is “Cell processes, Transport/binding
proteins”
12/13 (86%) correct on Test Set - probability of this
result occurring by chance is estimated at 4x10-7.
24 ORFs of unknown function are predicted by the rule.
16 ORFs now with putative or
confirmed function - 93.8% accurate
predictions
Experimental Conformation

The original bacterial ORF predictions were made
over three years ago.

In the intervening time many more ORFs have been
sequenced, making traditional homologous prediction
methods more accurate and sensitive, and the
function of some ORFs have been determined by wet
biology.

The E. coli genome has recently been re-annotated
by Monica Riley’s group.
“Wet” Biology conformation

A number of predictions have been confirmed or
falsified by new “wet” experimental data.

This new data is biased towards hard classes.
Despite this the results are still good:
– Level 2: 23 predictions - 47.8% accuracy
– Level 3: 23 predictions - 43.4% accuracy
This is very much better than random as there
are many classes.
Confirmation of “Wet” Predictions
ORF
Rule
Predicted Class
Confirmed Function
Result
b0805
b1519
b1533
b1981
8
15
43
42
Cell envelop
Degradation of small molecules
Transport/binding proteins
Transport/binding proteins
C
C
C
C
b1981
56
Transport/binding proteins
b2210
b2392
b2392
b2392
b2924
15
43a
43b
54
45
Degradation of small molecules
Transport/binding proteins
Transport/binding proteins
Transport/binding proteins
Transport/binding proteins
b3839
b0103
b0103
b0103
b1822
b2530
b2392
b2889
b3222
b3223
b3337
b3338
b3569
b3955
43
42
41
43
15
35
14
50
54
39
28
39
32
8
Transport/binding proteins
Transport/binding proteins
Transport/binding proteins
Transport/binding proteins
Degradation of small molecules
Global regulatory functions
Degradation of small molecules
Energy metabolism carbon
Transport/binding proteins
Ribosome constituents
Laterally acquired elements
Ribosome constituents
Laterally acquired elements
Cell envelop
b3955
18
Energy metabolism carbon
b3955
20
Energy metabolism carbon
Outer membrane protein
Trans-aconitate methyltransferase
Cysteine pathway metabolite transport
Shikimate and dehydroshikimate transport
protein
Shikimate and dehydroshikimate transport
protein
Malate:quinone oxidoreductase
High-affinity manganese transporter
High-affinity manganese transporter
High-affinity manganese transporter
Component of the MscS mechanosensitive
channel – “new gene family”
Essential component of translocase
dephospho-CoA kinase
dephospho-CoA kinase
dephospho-CoA kinase
23S rRNA m1G745 methyltransferase
cysteine desulfurase
High-affinity manganese transporter
Isopentenyl diphosphate isomerase
ManNAc kinase
ManNAc epimerase
regulatory or redox component
Periplasmic endochitinase
transcriptional regulator of xylose utilization
Required for invasion of brain microvascular
endothelial cells
Required for invasion of brain microvascular
endothelial cells
Required for invasion of brain microvascular
endothelial cells
C
C
C
C
C
C
C
W
W
W
W
W
W
W
W
W
W
W
W
EF
EA
EA
Results (Yeast)


Many rules from each data type
Rules at each level of hierarchy

Some classes are much easier to predict than others
(for example "protein synthesis" at 71-93%, "energy"
at 20-47%)

Good levels of accuracy on held out test data

Many predictions for ORFs of unknown function
(some function at some level is predicted for 96% of
the ORFs of unknown function)

Some rules explainable by biology -> scientific
knowledge discovery
Clare & King (2003) Bioinformatics suppl. 2., 42-49
Accuracy Table
Level
Datatype
1
2
3
4
all
Seq
55
55
33
0
71
Struc
49
43
0
0
58
Hom
65
38
69
20
55
Expr
42
37
35
0
75
Phen
75
40
7
0
68
Extension to Arabidopsis Genome





Collaborative project with the Institute of Grassland
and Environmental Research and the University of
Nottingham.
Large increase in data: 6,000 -> 25,000 ORFs. Large
amount of micro-array data from the Nottingham
Arabidopsis stock centre.
250 million Prolog facts, 200,000 attributes, File sizes
almost 2Gb
7,964 gene function predictions with an expected
accuracy >70%, 2,974 with an expected accuracy
>90%,
We are currently growing 14 knockout varieties of
Arabidopsis to test a sample of these predictions
Availability
All predictions available at http://www.genepredictions.org
All rules and data available at
http://www.aber.ac.uk/compsci/Research/bio/dss/
The Robots Scientist
The Robot Scientist Concept
The robot scientist project aims to develop a computer system
that is capable of originating its own experiments, physically
doing them, interpreting the results, and then repeating the cycle.
Background
Knowledge
Analysis
Machine Learning
Consistent
Hypothesis
Experiments(s)
Final Theory
Experiment(s)
selection
Robot
Results
Motivation: Technological

In many areas of science our ability to generate data
is outstripping our ability to analyse the data.

One scientific area where this is true is functional
genomics, where data is now being generated on an
industrial scale.

The analysis of scientific data needs to become as
industrialised as its generation.
The Application Domain

Functional genomics

In yeast (S. cerivasae) ~30% of the 6,000 genes still
have no known function.

EUROFAN 2 has knocked out each of the 6,000
genes in mutant strains.

Task to determine the “function” of the gene by
auxotrophic growth experiments comparing mutants
and wild type.
Logical Cell Model

We have built a logical model of the known metabolic
pathways (coded in Prolog) - taken from KEGG and
other bioinformatic sources. This is essentially a
directed graph: with metabolites as nodes and
enzymes as arcs.

If a path can be found from cell inputs (metabolites in
the growth medium) to all the cell outputs (essential
compounds), then the cell can grow.
AAA Model System

We started using the aromatic amino-acid (AAA)
pathway in yeast as a model system to prove the
principle of the Robot Scientist.

9 metabolities can be used of the shelf
15 knockout mutants from Eurofan


The mutant can grow iff all three aromatic aminoacids can be synthesised (tyrosine, phenyalalanine,
tryptophan). Based on a pathway from glycerate-2phophate.
Phenyalanine, Tyrosine, and Tryptophan Pathways for S. cerivisae
Glycerate
C00631 -2-Phosphate
D-Erythrose
-4-Phosphate
C00279
YGR254W
YHR174W
YMR323W
Phosphoenol
pyruvate C00074
YBR249C
YDR035W
YDR127W
3-Dehydroquinate
C00108
5-o-1-carboxyvinyl
-3-phosphoshikimate
C01269
3-deoxy-D-arabinoheptulosonate-7-phosphate
C04961
C04302
Anthranilate
YGL148W
C00251 Chorismate
Shikimate –3C03175 phosphate
C00944
YDR127W
-b-d-ribosyl
anthranilate
YER090W
(YKL211C)
YDR007W
C01302 1-(2-Carboxyl
YPR060C
C00254
YDR354W N-5’-Phospho
phenylamino)-1’deoxy-D-ribulose5’-phosphate
Prephenate
YBR166C
YNL316C
YKL211C
YDR127W
C03506
3-Dehydroshikimate
YDR127W
C00493
C01179
YHR137W
YGL202W
Shikimate
5-Dehydroshikimate
C02652
C00463
Phenylpyruvate (3-Indolyl)-YGL026C
Indole
glycerol
C00166
phosphate
p-Hydroxyphenyl
pyruvate
C02637
YDR127W
Metabolite import
Growth Medium
TYROSINE
C00082
YHR137W
YGL202W
PHENYLALANINE
C00079
YGL026C
YGL026C
TRYPTOPHAN
C00078
Experimental Methodology

Experiments consist of making particular growth
media and testing if the mutants can grow
(add metabolites to a basic defined medium).

A mutant is auxotrophic if cannot grow on a defined
medium that the wild type can grow on.

By observing the pattern of chemicals that recover
growth the function of the knocked out mutant can be
inferred.
Inferring Hypotheses

In the philosophy of science. It has often been
argued that only humans can make the “leaps of
imagination” necessary to form hypotheses.

We use Abductive Logic Programming to infer
missing arcs/labels in our metabolic graph. With
these missing nodes we can explain (deductively) all
the experimental results.
Reiser et al., (2001) ETAI 5, 233-244;
The Form of the Hypotheses


The form of the hypotheses we can infer is currently
quite simple. Each hypothesis binds a particular
gene to an enzyme that catalyses the reaction.
– A correct hypothesis would be that: YDR060C
codes for the enzyme for the reaction
chorismate  prephenate.
– An incorrect hypothesis would be that: it coded for
the reaction chorismate  anthranilate.
We have also demonstrated how more complex
abductive hypotheses could be formed.
A Discriminating Experiment




Hypothesis 1: YDR060C codes for the enzyme the
reaction: chorismate  prephenate.
Hypothesis 2: YDR060C codes for the enzyme the
reaction: chorismate  anthranilate.
These can be distinguished by growing the knockout
YDR060C on prephenate or anthranilate.
Note that these two experiments will have differing
monetary cost.
Phenyalanine, Tyrosine, and Tryptophan Pathways for S. cerivisae
Glycerate
C00631 -2-Phosphate
D-Erythrose
-4-Phosphate
C00279
YGR254W
YHR174W
YMR323W
Phosphoenol
pyruvate C00074
YBR249C
YDR035W
YDR127W
3-Dehydroquinate
C00108
5-o-1-carboxyvinyl
-3-phosphoshikimate
C01269
3-deoxy-D-arabinoheptulosonate-7-phosphate
C04961
C04302
Anthranilate
YGL148W
C00251 Chorismate
Shikimate –3C03175 phosphate
C00944
YDR127W
-b-d-ribosyl
anthranilate
YER090W
(YKL211C)
YDR007W
C01302 1-(2-Carboxyl
YPR060C
C00254
YDR354W N-5’-Phospho
phenylamino)-1’deoxy-D-ribulose5’-phosphate
Prephenate
YBR166C
YNL316C
YKL211C
YDR127W
C03506
3-Dehydroshikimate
YDR127W
C00493
C01179
YHR137W
YGL202W
Shikimate
5-Dehydroshikimate
C02652
C00463
Phenylpyruvate (3-Indolyl)-YGL026C
Indole
glycerol
C00166
phosphate
p-Hydroxyphenyl
pyruvate
C02637
YDR127W
Metabolite import
Growth Medium
TYROSINE
C00082
YHR137W
YGL202W
PHENYLALANINE
C00079
YGL026C
YGL026C
TRYPTOPHAN
C00078
Inferring Experiments
Given a set of hypotheses we wish to infer an experiment
that will efficiently discriminate between them
Assume:
 Every experiment has an associated cost.
 Each hypothesis has a probability of being correct.
The task:
 To choose a series of experiments which minimise the
expected cost of eliminating all but one hypothesis.
Comparison of different
experimental strategies

ASE - Expected cost minimization.

Naïve - Choose cheapest experiment.

Random - Randomly choose experiments.
The cost of a series of experiment is a function of the
time taken and money spent. “Time is Money”.
The Robot
Biomek 200
Closing the Loop

We have physically implemented all aspects of the
Robot Scientist system.

To the best of our knowledge this is the first active
learning system that both explicitly forms hypotheses
and experiments, and physicals does real experiments.
Accuracy v Time
At the end of the 5th iteration: ASE 80.1%, Naïve 74.0%, Random
72.2%. ASE was significantly more accurate than either Naïve
(p < 0.05) or Random (p < 0.07) using a paired t-test.
Accuracy v Money
100
95
Classification Accuracy (%)
90
85
80
ase
75
random
naive
70
65
60
55
50
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Log 10 Cost (£)
Given a spend of ≤£102.26, ASE 79.5%, Naïve 73.9%, Random 57.4%.
ASE was significantly more accurate than either Naïve (p < 0.05) or
Random (p < 0.001).
Time and Money

“Cost” is a positive function of time & money.
ASE dominates for both, therefore ASE dominates for
any reasonable cost function.

For example: to achieve an accuracy of ~70%, ASE
requires fewer trial iterations, and a hundredth of the
price, of Random; and almost half the number of
iterations, and a third of the price, of Naïve.
King et al. (2004) Nature. 427, 247-252.
Human Comparisons

We were interested to compare the performance of
the Robot Scientist with that of humans.

We adopted the simulator to allow humans to
chooses and interpret the results of cycles of
experimentation.

Compared nine graduate computer scientists and
biologists.

No significant difference between the best humans
and the Robot
Robotic Annotation
New Biological Knowledge

So far with the Robot Scientist we have only shown
that we can automatically rediscover known biological
knowledge.

We wish to extend this result to the discovery of new
biological knowledge.

To do this we need to combine the robot scientist with
conventional genome annotation bioinformatics, and
DMP.
Robotic Annotation




One way of thinking about genome annotation is as a
hypothesis formation process.
Hypothesis formation is perhaps the hardest part of
automating science.
Our idea is to incorporate bioinformatic annotation
methods with genome annotation.
The bioinformatic methods will generate the
hypotheses which the robot scientist will
experimentally test.
Genome Scale Model of
Yeast Metabolism




We have extended our model of aromatic amino acid
metabolism to cover most of what is known about
yeast metabolism.
Includes 1,166 ORFs (940 known, 226 inferred)
Growth if path from growth medium to defined endpoints.
83% accuracy (based on 914 strain/medium
predictions)
The Model is Incomplete




It is not possible to find a path from the inputs (growth
medium) to all the end-point metabolites using only
reactions encoded by known genes.
This suggests automated strategies for determining
the identity of the missing genes - new biological
knowledge.
One strategy is based on using EC enzyme class of
missing reactions, identify genes that code for this
EC class in other organism, then find homologous
genes in yeast.
The predictions can be tested automatically by robot.
Confirmation of DMP
Yeast Predictions




The yeast gene YBR147W, of currently “unknown”
function.
It is predicted to have a function in “metabolism” by 2
DMP rules with expected accuracies of >80%.
It is predicted to have a function in “amino-acid
metabolism” with two rules with expected accuracies
of 50% and 60% respectively.
Using our robot scientist auxotrophic methodology we
have recovered growth of the knockout with: aspartic
acid, tyrosine, leucine, valine, phenylalanine, cystine,
arginine.
Conclusions

Machine learning can be used to accurately predict
gene function.

Simple forms of scientific reasoning and
experimentation can be fully automated.

To develop robotic systems capable of generating
new biological knowledge will require a synthesis of
traditional genome annotation techniques, machine
learning, and a Robot Scientist like methodology.
The Three Objects of the Intellect
• The True
• The Beautiful
• The Beneficial
Acknowledgements
DMP

Andreas Karwath

Amanda Clare

Paul Wise

Luc Dehaspe
Aberystwyth
Aberystwyth
Aberystwyth
Leuven
Robot Scientist

Ken Whelan

Philip Reiser

Ffion Jones

Ugis Sarkans

Douglas Kell

Steve Oliver

Stephen Muggleton

Chris Bryant
Aberystwyth
Aberystwyth
Aberystwyth
Aberystwyth (EBI)
Manchester (Aberystwyth)
Manchester
Imperial College (York)
Robert Gordons (York)

David Page
Wisconsin
BBSRC, EPSRC
PharmDM - Commercial Support
Relational vs Propositional
Propositional: single table, fixed number of columns/attributes
orf
yal001c
yal002w
yal003w
yal004c
time0
0.34
0.76
0.77
0.38
time7
0.52
0.82
0.46
0.50
time14
0.48
0.89
0.78
0.49
Relational: multiple tables, multiple values
orf
yal001c
yal001c
yal002w
yal002w
SwissProtID
p03415
p08640
p32583
p08775
e-val
2e-4
8e-58
6e-52
3e-42
SwissProtID
p03415
p03415
p03415
p08640
keyword
apoptosis
repeat
zinc
membrane
Expression Data Rule
If in the micro-array experiment (sorbitol incubation) the ORF expression is > -0.25
and in the micro-array experiment (nitrogen depletion) the ORF expression is <= -1.29
and in the micro-array experiment (YPD stationary phase) the ORF expression is > 1.06
then the function of this ORF is
”pheromone response, mating type determination, sex-specific proteins"
Accuracy on training data: 11/12 (92%)
Accuracy on the test data: 3/4 (75%)
21 predictions made
Structure Rule
If true: coil (of length 3) followed by alpha (10 <= length < 14)
and true: coil (of length 1 or 2) followed by alpha (10 <= length < 14)
and true: coil (of length 3) followed by alpha (3 <= length < 6)
and false: coil followed by beta followed by coil (c-b-c)
and false: coil (6 <= length < 10) followed by alpha (of length 1 or 2)
then the function of this ORF is
"mitochondrial transport"
•
•
•
•
80% accurate on test data
Most matching ORFs belong to the Mitochondrial Carrier Family
These have 6 long transmembrane alpha-helices of about 20-30
amino acids
Why do we notice alpha-helices of length 10-14?
Alignment
YJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKTVLQIRG------------ 251
YKR052C -------NSYNPLIHCLC----GGISGATCAALTTPLDCIKTVLQVRG------------ 241
YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHEILRTRMQLKS------------ 310
YBR104W ----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD------------ 271
YGR096W ----KTTAAHKKWELATLNHSAGTIGGVIAKIITFPLETIRRRMQFMNSKHLEK------ 250
YJR095W -----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK------------ 246
YKL120W -----LMKDGPALHLTAS-----TISGLGVAVVMNPWDVILTRIYNQK------------ 261
YLR348C -----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS------------ 239
YMR166C ----DGRDGELSIPNEILT---GACAGGLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT
300
YDL198C ------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN------------ 242
YGR257C ----RFASKDANWVHFINSFASGCISGMIAAICTHPFDVGKTRWQISMMN---------- 302
YDL119C FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP------------ 255
YJL133W SQTVSLEIMRKADTFSKAASAIYQVYGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310
YKR052C -SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAISWTAYECAKHF
300
YIL006W -DIPDSIQRR-----LFPLIKATYAQEGLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364
YBR104W -LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGATFLTFELVMRF 325
YGR096W FSRHSSVYGSYKGYGFARIGLQILKQEGVSSLYRGILVALSKTIPTTFVSFWGYETAIHY
310
YJR095W ---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVTFTVYEYVREH
303
YKL120W ----GDLYKG-----PIDCLVKTVRIEGVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312
YLR348C ----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLIFFAIEQLKKH 289
YMR166C
Alignment
YJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 251
YKR052C -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 241
YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 310
YBR104W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 271
YGR096W ----cccccccccccccbaaaaaaaaaaaaaaacccaaaaaaaaaacccccccc------ 250
YJR095W -----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc------------ 246
YKL120W -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 261
YLR348C -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 239
YMR166C ----cccccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacccccccccccccc 300
YDL198C ------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc------------ 242
YGR257C ----ccccccccccccaaaaaaaaaaaaaaaaacccaaaaaaaaaacccc---------- 302
YDL119C ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc------------ 255
YJL133W -ccccccccccccccaaaaaaaaaaaccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310
YKR052C -ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 300
YIL006W -ccccccccc-----aaaaaaaaaaaccccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364
YBR104W -ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 325
YGR096W cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310
YJR095W ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 303
YKL120W ----cccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312
YLR348C ----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 289
YMR166C cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360
YDL198C ---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaaaaaaaaaaaa 293
YGR257C ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359
YDL119C ----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaaaaaaaaaaaa 305
Types of Logic
Deduction
Rule: If a cell grows, then it can synthesise tryptophan.
Fact: cell cannot synthesise tryptophan
 Cell cannot grow.
Given the rule P  Q, and the fact Q, infer the fact P
(modus tollens)
Abduction
Rule: If a cell grows, then it can synthesise tryptophan.
Fact: Cell cannot grow.
 Cell cannot synthesise tryptophan.
Given the rule P  Q, and the fact P, infer the fact Q