- Dr. Maik Friedel

Download Report

Transcript - Dr. Maik Friedel

A new way of seeing genomes
Combining sequence- and signal-based genome analyses
FLI
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel
Introduction:
So far, genome analysis is almost exclusively done by treating the sequence as a character string. We developed a new approach that may lead to an improved understanding of nucleotide sequences.
Our genome browser DiProGB (http://DiProGB.fli-leibniz.de) encodes. the sequence by geometrical or physicochemical dinucleotide properties. The values of these properties are plotted as a
dinucleotide-based sequence graph. This type of visualization allows to recognize sequence patterns that are hidden in the usual character string representation. The graph can be manipulated in real
time by zooming in and out, changing the amplitude, and by smoothing the graph adopting a shifting window technique. GenBank annotations such as exons, introns etc. can be visualized by different
colors. The browser also allows to search for motifs in general and for repeats in particular, both at the character-based sequence and the signal levels. Finally, it offers a number of options for statistical
analysis. In summary, the new genome browser is a powerful new tool for enhanced genome analysis. This leads to deeper insights into organization and function of the genome. For providing a reliable
basis of dinucleotide property sets we have collected more than 100 in the dinucleotide property database DiProDB (http://DiProDB.fli-leibniz.de).
DiProGB
DiProDB
Friedel et al. Bioinformatics 2009; doi: 10.1093/bioinformatics/btp436
The genome browser is a computer program that converts DNA sequences
into a signal representation by applying dinucleotide parameters and
smoothing the signal using a shifting window technique.
Basic features:
• standalone computer program written in C++
• uploads nucleotide sequences of any size and type as
GenBank, (multiple) FASTA or text files
• uploads different types of feature files (.gff, *.ptt, *.bed, ...)
• colors annotated features of a feature or GenBank file
• manipulates the signal in real time
(smoothing, changing amplitude, zooming)
Implemented tools:
• motif and repeat search at the signal and
sequence levels
• statistical tools for average statistics
• random sequence generator
• dinucleotide properties editor
• editor for searching and sorting the
list of annotated features
• editor for adding features and
qualifiers to an existing
GenBank file
• export functions for signal
information and for the
character-based
sequence
Friedel et al. Nucleic Acids Res. 2009 Jan;37(Database issue):D37-40.
Data Base
(main table showing a list of twist parameter sets)
Basic features:
 includes more than 100 dinucleotide property sets
 full references for all sets
 all sets are classified according to:
- nucleic acid type (DNA, RNA, ...)
- strand (double, single)
- mode of property determination (experimental, calculated)
- property type (thermo dynamical, conformational, letter-based)
 all information is shown in one table which can be
customized
 users can submit own datasets
Implemented tools:
 search and sorting functions
 data export as text file or input file for the
Genome Browser
 Pearson’s correlation and Spearman’s rank
correlation
Genome Browser
(full genome of Euglena gracilis chloroplast; applied dinucleotide property: stacking energy)
Example:
twist (B-DNA) [degree]
AA
(Gorin et al.
AC
J. Mol. Biol. 247, 34-48 (1995)).
AG
AT
CA
CC
CG
CT
GA
GC
GG
GT
TA
Repeat finder
TC
TG
TT
Motif finder
35.8
35.8
30.5
33.4
36.9
33.4
31.1
30.5
39.3
38.3
33.4
35.8
40
39.3
36.9
35.8
The main window of the genome browser consists of three panels:
(1) Control panel:
(2) Main window:
(3) Position panel:
uploading and manipulating of sequence information and
coding parameter
signal curve display
position information of the actually depicted sequence range
Applications
1. Visualization of evolutionary events
2. Visualization of gene and exon/intron organization
The DiPro Genome Browser can be used to distinguish
between 3 types of rRNA gene clusters in chloroplast
genomes. The patterns can be best seen applying the free
energy change dataset set for the DNA double strand.
With DiProGB it can be shown that genes tend to be purine-rich. In
the Figures shown below the sequence (positive strand) is encoded
by the purine content. On the left side all genes of the + strand and
on the right side all genes of the – strand are shown in red.
3. Repeats which cannot be found by standard repeat search
methods
We have shown this by hiding DNA sequence repeats in an artificial
sequence with only 50% alignment identity. The new sequence
contains the same repeats that are only visible in the signal
representation.
1.) Inverted Repeats
(25kB)
1.) Original sequence
repeats
79 of 88 genomes
2.) The same repeats
hidden in an
artificial sequence
with only 50%
sequence identity
Saccharum officinarum chloroplast
2.) Inverted Repeat
Lacking Clade
7 of 88 genomes
Ureaplasma parvum serovar 3 str.
Pinus thunbergii chloroplast
The exon (red) and intron
(green) structure of a given
gene can be seen adopting a
GC content representation.
Exons tend to have a higher
GC content than introns.
3.) 3 Directed Repeats
2 of 88 genomes
(subclass: Euglenozoa)
Euglena gracilis chloroplast
Euglena gracilis chloroplast (76235-81341)
Conclusion:
The genome browser DiProGB is a powerful new tool for
motif discovery in genomes. In addition to the standard
sequence representation the DNA is also analyzed
considering thermodynamical and geometrical dinucleotide
properties. More than 100 property sets are available in the
new database – DiProDB. This allows to identify and
visualize a broad range of both known and unknown
genome patterns. The new way of seeing the genome can
lead to a better understanding of its organization and
function.