TCGR: A Novel DNA/RNA Visualization Technique
Download
Report
Transcript TCGR: A Novel DNA/RNA Visualization Technique
TCGR: A Novel DNA/RNA
Visualization Technique
Margaret H. Dunham and Donya Quick
Southern Methodist University
Dallas, Texas 75275
[email protected]
Some slides presented at IEEE BIBE 2006
3/14/08, UMKC
1
Outline
Introduction
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work
3/14/08, UMKC
2
Outline
Introduction
Background
CGR/FCGR
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work
3/14/08, UMKC
3
DNA
Deoxyribonucleic Acid
Basic building blocks of
organisms
Located in nucleus of cells
Composed of 4 nucleotides
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
Two strands bound together
Contains genetic information
3/14/08, UMKC
Image source:
http://www.visionlearning.com/library/m
odule_viewer.php?mid=63
4
Nucleotide Bases
http://www.people.virginia.edu/~rjh9u/gif/bases.gif
3/14/08, UMKC
5
Transcription
During transcription, DNA is
converted in mRNA
RNA is processed and noncoding
regions removed
Coding regions are converted in
protein
Enzyme (RNA Polymerase) that
starts transcription by binding to
DNA code
3/14/08, UMKC
6
Transcription
http://ghs.gresham.k12.or.us/science/ps/sci/
ibbio/chem/nucleic/chpt15/transcripti
on.gif
3/14/08, UMKC
7
RNA
Ribonucleic Acid
Contains A,C,G but U (Uracil) instead
of T
Single Stranded
May fold back on itself
Needed to create proteins
Move around cells – can act like a
messenger
mRNA – moves out of nucleus to
other parts of cell
3/14/08, UMKC
8
Translation
Synthesis of Proteins from mRNA
Nucleotide sequence of mRNA
converted in amino acid sequence
of protein
Four nucleotides
Twenty amino acids
Codon – Group of 3 nucleotides
Amino acids have many codings
3/14/08, UMKC
9
Central Dogma: DNA -> RNA -> Protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
RNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
3/14/08, UMKC www.bioalgorithms.info; chapter 6; Gene Prediction
10
http://www.time.com/time/magazine/article/0,91
71,1541283,00.html
3/14/08, UMKC
11
Human Genome
Scientists originally thought there would
be about 100,000 genes
Appear to be about 20,000
WHY?
Almost identical to that of Chimps.
What makes the difference?
Answers appear to lie in the noncoding
regions of the DNA (formerly thought to
be junk)
3/14/08, UMKC
12
More Questions
If each cell in an organism contains the
same DNA –
How does each cell behave differently?
Why do cells behave differently during
childhood development?
What causes some cells to act differently
– such as during disease?
DNA contains many genes, but only a few
are being transcribed – why?
One answer - miRNA
3/14/08, UMKC
13
miRNA
Short (20-25nt) sequence of noncoding RNA
Single strand
Previously assumed to be garbage
Impact/Prevent translation of mRNA
Bind to target areas in mRNA – Problem is
that this binding is not perfect (particularly in
animals)
mRNA may have multiple (nonoverlapping)
binding sites for one miRNA
3/14/08, UMKC
14
miRNA Functions
Causes some cancers
Embryo Development
Cell Differentiation
Cell Death
Prevents the production of a
protein that causes lung
cancer
Control brain development in
zebra fish
Associated with HIV
3/14/08, UMKC
15
Outline
Introduction
Background
CGR/FCGR
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work
3/14/08, UMKC
16
Chaos Game Representation (CGR)
Scatter plot showing occurrence of
patterns of nucleotides.
3/14/08, UMKC
University of the Basque Country
http://insilico.ehu.es/genomics/my_words/
17
FCGR
Chaos Game Representation
2D technique to visually
(CGR)
see
the distribution of
subpatterns
Our technique is based on
the following:
Generate totals for
each subpattern
Scale totals to a [0,1]
range. (Note scaling
can be a problem)
Convert range to
red/blue
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
A CA C
G UG U
• 0-0.5: White to
Blue
• 0.5-1: Blue to Red
3/14/08, UMKC
19
FCGR Example
Homo sapiens – all mature miRNA
Patterns of length 3
UUC
GUG
3/14/08, UMKC
21
Outline
Introduction
Background
CGR/FCGR
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work
3/14/08, UMKC
22
Motivation
2000bp Flanking Upstream Region
mir-258.2 in C elegans
a) All 2000 bp
3/14/08, UMKC
b) First 240 bp
b) Last 240 bp
23
Research Objectives
Identify, develop, and implement algorithms
which can be used for identifying potential
miRNA functions.
Create an online tool which can be used by other
researchers to apply our algorithms to new data.
3/14/08, UMKC
24
Outline
Introduction
CGR/FCGR
miRNA
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work
3/14/08, UMKC
25
Temporal CGR (TCGR)
Temporal version of Frequency CGR
In our context temporal means the starting location of a window
2D Array
Each Row represents counts for a particular window in sequence
• First row – first window
• Last row – last window
• We start successive windows at the next character location
Each Column represents the counts for the associated pattern in
that window
• Initially we have assumed order of patterns is alphabetic
Size of TCGR depends on sequence length and subpattern lengt
As sequence lengths vary, we only examine complete windows
We only count patterns completely contained in each window.
3/14/08, UMKC
26
TCGR Example
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
Pos 0-8
Pos 1-9
A
2
1
C
3
3
G
3
3
T
1
2
4
2
1
C
0.6
0.6
G
0.6
0.6
T
0.2
0.4
0.8
0.4
0.2
…
Pos 34-42 2
Pos 0-8
Pos 1-9
A
0.4
0.2
…
Pos 34-42 0.4
3/14/08, UMKC
27
TCGR Example (cont’d)
TCGRs for Sub-patterns of length 1, 2, and 3
3/14/08, UMKC
28
TCGR Example (cont’d)
ACGT
Window 0: Pos 0-8
Window 1: Pos 1-9
acgtgcacg
cgtgcacgt
Window 17: Pos 17-25
Window 18: Pos 18-26
tccggaacc
ccggaacca
Window 34: Pos 34-42
ccacgtcga
3/14/08, UMKC
29
TCGR – Mature miRNA
(Window=5; Pattern=3)
C. elegans
Homo sapiens
Mus musculus
All Mature
3/14/08, UMKC
ACG
CGC
GCG
UCG
31
Outline
Introduction
CGR/FCGR
miRNA
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work
3/14/08, UMKC
32
EMM Overview
Time Varying Discrete First Order
Markov Model
Nodes are clusters of real world states.
Learning continues during prediction
phase.
Learning:
Transition probabilities between
nodes
Node labels (centroid of cluster)
Nodes are added and removed as
data arrives
3/14/08, UMKC
33
EMM Definition
Extensible Markov Model (EMM): at any time
t, EMM consists of an MC with designated
current node, Nn, and algorithms to modify
it, where algorithms include:
EMMCluster, which defines a technique for
matching between input data at time t + 1
and existing states in the MC at time t.
EMMIncrement algorithm, which updates
MC at time t + 1 given the MC at time t and
clustering measure result at time t + 1.
EMMDecrement algorithm, which removes
nodes from the EMM when needed.
3/14/08, UMKC
34
EMM Cluster
Find closest node to incoming
event.
If none “close” create new node
Labeling of cluster is centroid of
members in cluster
O(n)
3/14/08, UMKC
35
EMM Increment
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
2/3
2/3
2/21
2/3
1/1
1/2
1/2
N3
N1
1/3
N2
1/1
1/2
1/1
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
3/14/08, UMKC
36
Outline
Introduction
CGR/FCGR
miRNA
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work
3/14/08, UMKC
37
Research Approach
1. Represent potential miRNA sequence with
TCGR sequence of count vectors
2. Create EMM using count vectors for known
miRNA (miRNA stem loops, miRNA targets)
3. Predict unknown sequence to be miRNA (miRNA
stem loop, miRNA target) based on normalized
product of transition probabilities along clustering
path in EMM
3/14/08, UMKC
38
Related Work 1
Predicted occurrence of pre-miRNA
segments form a set of hairpin sequences
No assumptions about biological function or
conservation across species.
Used SVMs to differentiate the structure of
hiarpin segments that contained premiRNAs from those that did not.
Sensitivey of 93.3%
Specificity of 88.1%
1
C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and
Pseudo MicroRNA Precursors using Local Structure-Sequence Features and
Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
3/14/08, UMKC
39
Preliminary Test Data1
Positive Training: This dataset consists of 163
human pre-miRNAs with lengths of 62-119.
Negative Training: This dataset was obtained from
protein coding regions of human RefSeq genes. As
these are from coding regions it is likely that there are
no true pre-miRNAs in this data. This dataset
contains 168 sequences with lengths between 63 and
110 characters.
Positive Test: This dataset contains 30 pre-miRNAs.
Negative Test: This dataset contains 1000 randomly
chosen sequences from coding regions.
1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of
Real and Pseudo MicroRNA Precursors using Local StructureSequence Features and Support Vector Machine,” BMC
Bioinformatics, vol 6, no 310.
3/14/08, UMKC
40
TCGRs for Xue Training Data
P
O
S
I
T
I
V
E
N
E
G
A
T
I
V
E
3/14/08, UMKC
41
TCGRs for Xue Test Data
P
O
S
I
T
I
V
E
N
E
G
AT
I
V
E
3/14/08, UMKC
42
Predictive Probabilities with Xue’s Data
EMM
Mean
Std Dev
Max
Min
Negative Test-Neg
0
0
0
0
Test-Pos
0
0
0
0
Train-Neg
0.37963 0.050085 0.91256 0.2945
Train-Pos
0
0
0
0
Test-Neg
0
0
0
0
Test-Pos
0.25894 0.18701
0.42075 0
Train-Neg
0
0
Train-Pos
0.38926 0.048439 0.91155 0.32209
Positive
3/14/08, UMKC
Test Data
0
0
43
Preliminary Test Results
Positive EMM
Cutoff Probability = 0.3
False Positive Rate = 0%
True Positive Rate = 66%
Test results could be improved by
meta classifiers combining multiple
positive and negative classifiers
together.
3/14/08, UMKC
44
Outline
Introduction
CGR/FCGR
miRNA
Motivation
Research Objective
TCGR
EMM
miRNA Prediction using TCGR/EMM
Conclusion / Future Work
3/14/08, UMKC
45
Future Research
1.
Obtain all known mature miRNA sequences for a species – initially the 119
C. elegans miRNAs.
2.
Create TCGR count vectors for each sequence and each sub-pattern length
(1,2,3,4,5).
3.
Train EMMs using this data for each sub-pattern length. Thus five EMMs will
be created
4.
Obtain negative data (much as Xue did in his research) from coding regions
for C. elegans.
5.
Train EMMs using this data for each sub-pattern length. Thus five EMMs will
be created
6.
Construct a meta-classifier based on the combined results of prediction from
each of these ten EMMs.
7.
Apply the EMM classifier to the existing ~75x106 base pairs of non-exonic
sequence in the C. elegans genome to search for miRNAs. Note: all 119
validated C. elegans miRNAs are contained in the non-exonic part of the
genome and thus the first pass of the algorithm will be tested for its ability to
detect all 119 validated miRNAs.
8.
Validate the prediction of novel miRNAs using molecular biology.
3/14/08, UMKC
46