- Dr. Maik Friedel

Download Report

Transcript - Dr. Maik Friedel

Maik Friedel, Thomas Wilhelm, Jürgen Sühnel
FLI-Jena, Germany
http://www.fli-leibniz.de/tsb
Introduction:
During the last 10 years, a large number of complete genomes has been sequenced. Having these data at hand, the basic aim is now to convert this information into biological knowledge.
This requires the identification of biologically meaningful motifs in genomic data. Computational motif discovery has been used with some success in simple organisms such as yeast, for
example. For higher organisms with more complex genomes more sensitive methods are required. There is also a growing awareness that not single motifs but motif combinations usually
called modules may be relevant to biological function. We describe here a new type of GenomeBrowser that offers user-friendly genome analysis tools for the statistical analysis of single
and multiple sequences as well as for the visual exploration of single sequences. A peculiarity is that not only the standard sequence representation in terms of the bases A, T, G and C can
be adopted, but also a reduced sequence representation by purine/pyrimidine and AT/GC characteristics and finally a representation in terms of a large number of dinucleotide parameters
that can encode geometrical information on DNA structure, for example. All of these coding schemes can be converted into a signal representation that allows for a very effective visual motif
discovery. Analyses can be performed for the + and – as well as for the double strand. Combining these sequence- and signal-based representations offers a new approach for the detection
of new regulatory elements. The functionalities described make the GenomeBrowser a unique tool for the identification and analysis of functional motifs in genomes.
Implemented tools
1. Repeat finder
Tool to search for any
type of simple repeat in
the sequence or signal
representation
2. Motif finder
3. Average statistic
Tool for searching DNA motifs in the
sequence or signal
4. Showing underlying DNA sequence
Tool for calculating the average for
any type of DNA feature of selected
DNA fragments
5. Property editor
Feature that allows to show the
underlying DNA sequence of a
selected part of the signal
representation
Tool for searching, filtering and
selecting all types of features indicated
in the GenBank file
Parameters
Main window
The main window of the GenomeBrowser
1
consists of three panel. The first (1) is the
control panel which allows uploading and
manipulation of sequence and coding
parameter information. In the main field
(2) the signal curve is shown and in the
third panel (3) the position information of
the actually depicted sequence range.
The DNA sequence in FASTA or
GenBank format is converted into a
signal representation by applying
dinucleotide parameters and
smoothened using a shifting window
technique. All sequence features
included in the GenBank file can be
selected and shown in different colors.
2
3
AA
-1,20
AC
-1,50
AG
-1,50
AT
-0,90
CA
-1,70
CC
-2,10
CG
-2,80
CT
-1,50
GA
-1,50
GC
-2,30
GG
-2,10
GT
-1,50
TA
-0,90
TC
-1,50
TG
-1,70
TT
-1,20
To visualize biochemical and
biophysical properties of a DNA
strand we have included about 40
different dinucleotide properties. All
parameters are available for the
complete set of 16 dinucleotide
combinations.
The table shows, as an example, the
free energy change (B-DNA)
[Kcal/mol] for the set of all 16
dinucleotides
(M. Aida J. Theor. Biol. 130, 327-335 (1988)).
First applications
1. Visualization of evolutionary events
2. Visualization of gene and exon/intron organization
The GenomeBrowser can be used to distinguish between
3 types of rRNA gene clusters in chloroplast genomes. The
patterns can be best seen applying the free energy change
measure for the DNA double strand.
With help of the GenomeBrowser it can be shown that genes tend
to be purine-rich. In both pictures below the positive strand is
encoded by the pyrimidine content. On the left side all genes of the
+ strand and on the right side all genes of the – strand are shown in
red.
3. Repeats which cannot be found by standard repeat search
methods
We have shown this by hiding DNA sequence repeats in an
artificial sequence with only 50% alignment identity. The new
sequence contains the same repeats that are only visible in the
signal representation.
1.)
1.) original sequence
repeats
Inverted Repeats
(25kB)
79 of 88 genomes
2.) the same repeats
hidden in an
artificial sequence
with only 50%
sequence identity
2.)
Inverted Repeat
Lacking Clade
7 of 88 genomes
3.)
3 Directed Repeats
2 of 88 genomes
(subclass: Euglenozoa)
Conclusion:
The exon (red) and intron
(green) structure of a
given gene can be seen
adopting a GC content
representation. Exons
tend to have a higher GC
content than introns.
The GenomeBrowser is a powerful new tool for motif discovery
in genomes. In addition to the standard sequence
representation the DNA is also analysed considering
biochemical and biophysical dinucleotide properties. This
allows to identify and visualize a broad range of both known
and unknown genome patterns. The new way of seeing the
genome can lead to a better understanding of its organisation
and function.