BIOINFORMATICS AND GENE DISCOVERY

Download Report

Transcript BIOINFORMATICS AND GENE DISCOVERY

UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
Bioinformatics Tutorials
BIOINFORMATICS
AND
GENE DISCOVERY
Iosif Vaisman
1998
From genes to proteins
From genes to proteins
DNA
RNA
PROMOTER
ELEMENTS
TRANSCRIPTION
SPLICE
SITES
mRNA
SPLICING
START
CODON
STOP
CODON
TRANSLATION
PROTEI
From genes to proteins
Comparative Sequence Sizes
•
•
•
•
•
•
•
Yeast chromosome 3
Escherichia coli (bacterium) genome
Largest yeast chromosome now mapped
Entire yeast genome
Smallest human chromosome (Y)
Largest human chromosome (1)
Entire human genome
350,000
4,600,000
5,800,000
15,000,000
50,000,000
250,000,000
3,000,000,000
Low-resolution physical map
of chromosome 19
Chromosome 19 gene map
Computational Gene Prediction
•Where the genes are unlikely to be located?
•How do transcription factors know where to bind a region of DNA?
•Where are the transcription, splicing, and translation start and stop
signals?
•What does coding region do (and non-coding regions do not) ?
•Can we learn from examples?
•Does this sequence look familiar?
Artificial Intelligence in Biosciences
Neural Networks (NN)
Genetic Algorithms (GA)
Hidden Markov Models (HMM)
Stochastic context-free grammars (CFG)
Information Theory
0
1
1 bit
Information Theory
00
01
1 bit
11
10
1 bit
Information Theory
1 bit
1 bit
Scientific Models
Physical models -- Mathematical models
Mechanistic models
Stochastic models
Mechanism
Black box
Predictive power
Elegance
Consistency
Predictive power
Hidden Markov models
Stochastic mechanism
Neural Networks
•interconnected assembly of simple processing elements (units or nodes)
•nodes functionality is similar to that of the animal neuron
•processing ability is stored in the inter-unit connection strengths (weights)
•weights are obtained by a process of adaptation to, or learning from, a set
of training patterns
Genetic Algorithms
Search or optimization methods using simulated evolution.
Population of potential solutions is subjected to
natural selection, crossover, and mutation
choose initial population
evaluate each individual's fitness
repeat
select individuals to reproduce
mate pairs at random
apply crossover operator
apply mutation operator
evaluate each individual's fitness
until terminating condition
Crossover
Parent A
Parent B
crossover point
Child AB
Child BA
Mutation
Markov Model (or Markov Chain)
A
T
C
T
A
G
Probability for each character based only on several
preceding characters in the sequence
# of preceding characters = order of the Markov Model
Probability of a sequence
P(s) = P[A] P[A,T] P[A,T,C] P[T,C,T] P[C,T,A] P[T,A,G]
Hidden Markov Models
States -- well defined conditions
Edges -- transitions between the states
T
G
A
A
C
C
T
ATGAC
ATTAC
ACGAC
ACTAC
Each transition asigned a probability.
Probability of the sequence:
single path with the highest probability --- Viterbi path
sum of the probabilities over all paths -- Baum-Welch method
Hidden Markov Model of Biased Coin Tosses
• States (Si):
Two Biased Coins {C1, C2}
• Outputs (Oj):
Two Possible Outputs {H, T}
• p(OutputsOij):
p(C1, H), p(C1, T), p(C2, H) p(C2, T)
• Transitions:
From State X to Y {A11, A22, A12, A21}
• p(Initial Si):
p(I, C1), p(I, C2)
• p(End Si):
p(C1, E), p(C2, E)
Hidden Markov Model for Exon and Stop Codon
(VEIL Algorithm)
GRAIL gene identification program
POSSIBLE EXONS
REFINED EXON
POSITIONS
FINAL EXON
CANDIDATES
Suboptimal Solutions for the Human Growth
Hormone Gene (GeneParser)
Measures of Prediction Accuracy
Nucleotide Level
TN
FN
TP
FP
TN
FN
TP
FN TN
REALITY
PREDICTION
c
c
TP
nc
FP
Sensitivity
Sn = TP / (TP + FN)
Specificity
nc
PREDICTION
REALITY
FN
TN
Sp = TP / (TP + FP)
Measures of Prediction Accuracy
Exon Level
MISSING
EXON
WRONG
EXON
CORRECT
EXON
Sn =
number of correct exons
number of actual exons
Sp =
number of correct exons
number of predicted exons
REALITY
PREDICTION
Sensitivity
Specificity
GeneMark Accuracy Evaluation
Bibliography
http://linkage.rockefeller.edu/wli/gene/list.html
and
http://www-hto.usc.edu/software/procrustes/fans_ref/
Gene Discovery Exercise
http://metalab.unc.edu/pharmacy/Bioinfo/Gene