Genome Informatics, CSHL, Nov. 2011

Download Report

Transcript Genome Informatics, CSHL, Nov. 2011

GENOMIC COLOCATION:
A NEW OPTION IN THE STRATEGIES WDK TO COMBINE
RESULT SETS
USING RELATIVE GENOMIC LOCATIONS
Cristina Aurrecoechea1, Brian P. Brunk2, Steve Fischer2, Xin Gao2, Omar S. Harb2,
Mark Heiges1, Jessica C. Kissinger1, Eileen T. Kraemer1, Cary Pennington1, David S.
Roos2, Christian J. Stoeckert2, Charles Treatman2 & Susanne Warrenfeltz1
1Univ. Georgia, Athens GA, & 2Univ. Pennsylvania, Philadelphia PA
The EuPathDB suite of databases covers genomic and functional
genomics datasets for a variety of eukaryotic pathogens.
The EuPathDB Strategies Web Development Kit (Strategies WDK) is a search system and graphical interface for integrative genomics databases that
helps users perform dynamic in silico experiments. A search strategy is built up from individual searches using a graphical display that illustrates
how searches are combined and facilitates revising individual steps, with changes propagated forward through the strategy. The output of a
strategy might be a set of genes, SNPs, clinical or field isolates or any other data type in the database.
We recently added a genomic colocation option to combine the results of two consecutive steps in a strategy based on their members’ relative
genomic locations. The participating steps must contain features that map uniquely to the genome. Supported operators are overlap and contain.
For example, in a genes strategy a user may now add a search for a DNA motif and specify that a motif must overlap within a region 500bp
upstream of a gene, thus retaining only genes that have this motif in their promoter region. Other use cases include identifying SNPs upstream of
protein coding genes that are differentially expressed in two different strains, or identifying divergently transcribed genes that appear on opposite
strands with overlapping upstream regions. (For a strategy on genes with an AP-2 like motif see
http://plasmodb.org/plasmo/im.do?s=0ffa670cc2b0a579.)
The development of the genomic colocation user interface was challenging, as we will present, involving an iterative process that included
usability studies. The GUI guides users through specification of a search region in each of the two result sets to be combined, and the operator
that applies to them (overlap, contain). Each selected region is a configurable interval upstream, downstream or arbitrarily located with respect to
the genomic features in the set, and includes DNA strand orientation. The user also specifies from which input result set to draw the final results.
For example, in a genomic colocation operation that combines genes and SNPs the user may select to return either genes or SNPs. Members of the
chosen set are returned if their specified region relates to any specified region in the other set according to the chosen operator.
The Strategies WDK system is schema independent and available for download and installation at
http://code.google.com/p/strategies-wdk.
Did you ever wish you could intersect different genomic data sets on the basis of their genomic location?
For example, find annotated genes where there is a specific DNA motif within 500 base pairs of the gene’s upstream region.
Upstream regions
The goal was to develop a mechanism to identify features based on their
relative genomic coordinates (genomic colocation). Specifically, given two
sets of features with uniquely defined coordinates on a genomic sequence
(e.g., genes, SNPs, motifs, Sage Tags, ORFs )….. determine which feature pairs
“comply” with a user-defined relative distance from each other in the
genome, and a relative strandness.
500 bp
genes
Genome
motifs
Result set
The final version of the
colocation user interface
involves building a logical
genomic
colocation
statement.
The challenge was to design an intuitive user interface for combining feature
sets based on genomic colocation. This involved multiple iterations with
input from the EuPathDB user community.
A statement that can be modified using drop down menus.
For each set we define a region relative to
each feature:
• (a) genes: 500bp upstream
• (b) motifs: exact region
genes
gene regions
motif regions
(c) Next we define how
these regions should
relate.
We want the motifs
region contained within
the gene upstream
region.
(e)
(d)
(c)
(d) Define the “strandness”
relationship, for a pair of features
to be “compliant”.
(e) Select from which set you
want the “compliant” features in
your result.
(a
)
(b)
Colocation enhances data integration via our search strategy system
Identify genes that may be co-regulated by shared promoter elements.
3
The genes should be located within 1000bp of each other, be divergently transcribed and be expressed maximally at day 30 of the iRBC
cycle +-8hrs and show at least a 4 fold increase in expression.
1
Search for the expressed genes.
2
Add the same set of genes as a
second step in your strategy.
Select the colocation operation.
• Color coded sets information (blue and red).
• Instantaneous graphical feedback on regions selected.
• Instantaneous graphical feedback on regions relationship.
• Instantaneous feedback on what features will be returned.
EuPathDB is an NIAID
Bioinformatics Resource Center
supported by NIAID Contract No.
HHSN266200400037C and The Bill
& Melinda Gates Foundation
4
We turn on the “Pf-iRBC
expression profile graph (GS
array)” column to assess
how well the pairs of genes
compare in terms of
expression.
Let’s specify the
colocation for the
genes we are
looking for!