Transcript compute3

Basic Overview of
Bioinformatics Tools and
Biocomputing Applications III
Dr Tan Tin Wee
Director
Bioinformatics Centre
More BioComputational Tools
•
•
•
•
Phylogenetics Analysis
Multiple Sequence Alignment
Profile Searching
Sensitivity and Specificity and Probabilities
in the Prediction of Functions
Phylogenetic Analysis
• Assumption:
evolutionary
descent
• Divergence
• Phylogenetic tree
• Rooted and
unrooted trees
Species X Y A B
Rooted and Unrooted Trees
• Rooted: ancestral state of the evolved
organism or gene is known.
• Branches at bifurcation points until terminal
branches, or tips/ leaves.
• Unrooted trees represent branching order,
but does not indicate the root of the last
common ancestor
Phylogenetic inference for genes
• Infancy, inexact science
• computational tools based on general
mathematical and statistical principles
• Phylogenetic reconstructions may conflict with
common sense.
• Incorrect sequence alignments, inadequate models
• All sites within sequences evolve at different rates
• unequal rate effects
Some algorithms
•
•
•
•
•
•
Maximum parsimony
maximum likelihood
distance methods
UPGMA
paralinear (logdet) distances
Software Packages:
PAUP phylogenetic analysis using parsimony
PHYLIP phylogenetic inference package
MacClade, GAMBIT, MEGA/METREE
Limitations
• Inspection of sequence alignments
• Removal of deviant sequences from the
phylogenetic inference
• Different genes analysed produce different
trees
• "Bootstrapping" for estimating statistical
significance may still have errors in
interpretation
A
B
Uses
C
D
• Molecular Taxonomy
• 16S and 23S rRNA analysis for bacterial
classification
• 18S rRNA analysis of nematodes, drosophila
• epidemiological analysis of strain variation eg.
In infections pathogens
Multiple Sequence Analysis
• Gather a set of sequences of putative
similarity or homology
• Pairwise comparison for each set of
multiple sequences
• Build a "tree" of similarity
• realignment of all sequences based on
"ancestral" sequence padding with gaps etc
• Used for generating "profiles"
Use
• Detection of conserved and variable regions
• Infer gene functions
• Variable segments - infer dispensable to function
or antigenic variants
• Motifs can be used to analyse unknown sequence
and infer possible function or relatedness
• Motifs as basis for annotation of genome project
sequences
Software
• CLUSTALW
• Profile software based on Hidden Markov
Models (HMM) statistical models, eg
HMMer, HMMPro, META-MEME,
PROBE, BLOCKS
Example
• C. elegans genome project
• several large gene families of sequence
homology - function unknown.
• Now classified as putative G-protein coupled
receptors (GPCRs).
• Have to detect significant similarity between
putative Worm GPCRs and experimentally
known GPCRs in other species
Process
• Select a typical unknown sequence
BLAST Search against nr database
• Inspect hits and E-values
• Top scoring hits - mitochondrial L11 ribosomal
protein E=0.002 (not low enough to be trusted for
annotation)
• The rest of top scorers are all nematode-specific
unknown sequences
• Compare with PSI-BLAST iterative searching at
NCBI
• Similarity with mammalian GPCRs or the high
scoring mt rL11 protein ?
Further analysis
•
•
•
•
Gather all nematode specific sequences
WormPep database of non-redundant seqs
Discard seqs of abnormally long or short
Multiple sequence alignment using
CLUSTALW
• General Profile of multiple alignment using
HMMer
• Use profile to search database again
Results
• Similarity at significance level detected
with Mammalian GPCRs
• Find that L11 protein has very significant
high score E=5x10 -49
• Pitfalls of PSI-Blast - significance of match
to the training set during iteration.
• Finally, L11 protein may be wrongly
annotated and not based on experimental
results
A.Sensitivity and Specificity of a
Fairly Good Test
• Total real +ve = 73
Total real - ve = 27
• Specificity = (25)/(2+25)=.93
Known gold standard
+ ve
- ve
+ ve
picked up 25 of the 27 negatives, very specific
70
2
3
25
Low false positives
• Sensitivity = 70/(70+3)=.96
able to pickup 70 of the total 73 that are known
positive- quite sensitive-
Low false negatives
• Gold standards
- ve
Exptal
test result
N=100
B.Increase Sensitivity but Lower
Specificity of a Test
• Total real +ve = 73
Total real - ve = 27
• Specificity = (14)/(13+14)=.52
Known gold standard
+ ve
- ve
+ ve
picked up 14 of the 27 negatives, not very specific
72
13
1
14
high false positives
• Sensitivity = 72/(72+1)=.99
able to pickup 72 of the total 73 that are known
positive- super sensitive
Low false negatives
- ve
Exptal
test result
N=100
C.Increase Specificity of a Test but
Sensitivity may drop
• Total real +ve = 73
Total real - ve = 27
• Specificity = (27)/(0+27)=1.0
picked up 27 of the 27 negatives,completely specific
Known gold standard
+ ve
- ve
+ ve
increase threshold to zero false
50
0
23
27
positives, true positives will drop
• Sensitivity = 50/(50+23)=.68
- ve
able to pickup 50 of the total 73 that are knownExptal
positive- not quite sensitive-
Low false negatives
test result
N=100
Trade off involved
• If threshold of test set high, so that all the
noise disappears, you may also miss out on
some true positives, get a lot of false
negatives and thus not so sensitive - case C
• If threshold of test set low, so that you get
as much of the positives as you can get, ie
high sensitivity, your non-specific false
positive hits start appearing - Case B
Computational Predictions of
Gene Function
• Sensitivity and specificity has similar
tradeoffs.
• Cutoff threshold values have to be
empirically determined or arbitrarily chosen
depending on situation