Transcript shew

BINF 4360, Fall 2007
Rachel Adams, Jerry Choate, Nathan
Harrelson, Divya Mistry, and Whitney Smith
Overview
Goals
 Implementation
 Interface
 Images
 Final product
 Conclusions

Goals
Create a dynamic map of the Shewenella
Oneidensis MR-1 genome
 Populate local database with relevant
information from web-based databases
 Provide an efficient searching algorithm for
key terms
 Implement user-friendly navigation and
readability

Implementation
SQL Schema
 Parsing
 Databases

Parsing

XPath

XPath was used to quickly parse through XML documents generated from
NCBI’s SOAP interface.
my $xp=XML::XPath->new(filename=>$file);
# gets the locus tag
foreach $var ($xp->find('//Gene-ref')->get_nodelist) {
$name = $var->find('Gene-ref_locus')->string_value;
$locus = $var->find('Gene-ref_locus-tag')->string_value;
}

LWP::Simple


Simple was used to grab content from a url so it could be easily written to an
XML file.
Regular Expressions

Regular expressions were used to parse through HTML files, match
specific string patterns, and manipulate text.
Schema
area_area_id_seq
img_img_id_seq
img
sequence_name name
cache_value bigint
sequence_name name
cache_value bigint
img_id integer
last_value bigint
log_cnt bigint
last_value bigint
log_cnt bigint
map varchar(5)
increment_by bigint
is_cycled boolean
increment_by bigint
is_cycled boolean
max_value bigint
is_called boolean
max_value bigint
is_called boolean
min_value bigint
min_value bigint
ncbi_proteins
area
imgplacement
locus_tag text
description text
area_id integer
target text
img_id integer
date date
gene text
href text
coords text
tilex integer
title text
img_id integer
tiley integer
defintion text
pdb
ncbi_genes
kegg
id text
id integer
location text
gene_id text
pdb text
name text
description text
kegg_id text
locus_tag text
function text
month integer
cog_id text
day integer
gi text
year integer
img_id text
Databases

NCBI


COG


Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing
protein sequences encoded in complete genomes, representing major phylogenetic
lineages.
IMG


Local databases were populated using information retrieved from gene, protein, and
3D domain web-based databases.
The Integrated Microbial Genomes (IMG) system's goal is to facilitate the visualization
and exploration of genomes from a functional and evolutionary perspective.
KEGG

Knowledge-based methods for uncovering higher-order systemic behaviors of the cell
and the organism from genomic information is stored in KEGG, Kyoto Encyclopedia of
Genes and Genomes.
More Databases

MIST


ORNL


The Genome Analysis and System Modeling Group of the Life Sciences Division of
ORNL provides bioinformatics and analytic services and resources to collaborators,
predicts prospective gene and protein models for analysis, and provides user services
for the general community.
PDB


The Microbial Signal Transduction database contains the signal transduction proteins
for 591 complete bacterial and archaeal organisms.
The RCSB PDB provides a variety of tools and resources for studying the structures
of biological macromolecules and their relationships to sequence, function, and
disease.
ShewCyc

ShewCyc is a part of BioCyc, a collection of 371 Pathway/Genome Databases, which
describes the genome and metabolic pathways of the Shewenella Oneidensis MR-1
genome.
Interface

Functions provided by Google’s Map API
were used to display pathways of the
Shewenella genome.




A small overview map is provided to give a bird’s eye view of the entire image. The
current view is indicated with a translucent box.
The user has the ability to view the pathways using 5 different zoom levels.
Text balloons show information relevant to the user’s selected target.
A search bar offers quick targeting of a user’s
query of interest.


The user can either pan over the images and click on areas of interest or
enter a query in a search bar to find specific information.
If the user submits a term to be queried, relevant targets are indicated on the
map with colored pins.
Images

ImageMagick is a free software suite to
create, edit, and compose bitmap images.

The main functions that we took advantages of
included the ability to resize, sharpen, pad, and
stitch together images.
We also were able to create a composite
image by combining several (212) separate
images.
 Placing the images within 16384 by 16384
pixels took strategic manipulation and
tedious offset calculation.

Final Product
Zoomed image
Final Product
Query for glycogen
Final Product
Query for ATP
Conclusions



Using GoogleMaps we were able to create a searchable
map of pathways in the Shewenella genome.
Efficient parsing methods made collecting and querying data
far simpler.
With more time, additional improvements could be
implemented to increase the usability of this application.



Currently we offer links to images, but it would be optimal to have
thumbnails of the pictures themselves readily viewable.
GoogleWebToolkit has several functions that would make more
information available for the user. Tabs on text balloons could
separate data into topical subgroups. Overlaying a transparent map
on top of the current map could be a useful tool for comparing two
pathways.
Additionally, the overall scope of the project would be enhanced if we
had even more indepth zoom levels such that the user could actually
see the sequence of the amino acids and nucleotides.