CTEGD Symposium, UGA, Athens, May 2011

Download Report

Transcript CTEGD Symposium, UGA, Athens, May 2011

EuPathDB: an integrated resource and tool for eukaryotic pathogen bioinformatics
Aurrecoechea C., Heiges M., Warrenfeltz S. for the EuPathDB team
CTEGD, University of Georgia, Athens, GA USA
ABSTRACT: EuPathDB (http://eupathdb.org) is an integrated bioinformatics database covering several
eukaryotic pathogens. Genera represented are Cryptosporidium, Encephalitozoon, Entamoeba,
Enterocytozoon, Giardia, Leishmania, Neospora, Plasmodium, Toxoplasma, Trichomonas and
Trypanosoma, and the newly added Theileria and Babesia. Each of these groups is supported by a taxonspecific database and web interface which can be accessed independently of EuPathDB. EuPathDB
provides a portal to all these databases, and the opportunity to leverage orthology for searches across
genera. The databases are updated and expanded about every 2 months, providing online access to the
latest genomic-scale datasets including complete genome sequences, annotations, and functional
genomics such as proteomics, microarray, RNA-Seq, ChIp-chip, SAGE and EST data.
The specific advantage of the EuPathDB databases lies in the graphical search interface that allows
users to combine datasets while building a search strategy. Multistep searches strategies are built one step
at a time choosing from more than 100 searches. The latest EuPathDB release debuts a search for DNA
motifs and a method of combining searches based on relative genomic location. This new operation allows
the results of successive steps to be combined based on each feature’s location relative to other features.
Parameters defining upstream/downstream distances and gene overlap restrict the search results in a way
that highlights biologically relevant relationships such as antisense transcription and promoter sharing.
The merger of EuPathDB’s user-friendly search strategy system with full and up-to-date databases offers
researchers a powerful tool for data mining during computational experiments.
New way of combining searches
based on relative genomic location:
Building search strategies:
Graphical search interface motivates users to prioritize search results based a variety of
data types. The search strategy system provides the opportunity to explore and
identify biologically meaningful relationships
1
Run a Search. This
search for all protein
coding genes in P.
faliciparum returned
5418 genes.
● Quick access to ID and text search
options, login, contact, twitter, etc.
● Main Header Tab Bar: mouse-over
‘New Search’ to initiate searches;
click ‘My Strategies’ to enter your
workspace
2
● Portal to EuPathDB databases by
clicking on icons
● Initiate searches from center panels.
Over 100 search types available.
● Identify Genes by: look for Genes
based on a variety of datasets,
including whole genome sequence,
coding vs non-coding genes, transcript
evidence (microarray, EST), exon count,
etc.
● Identify Other Data Types: Look for
ESTs, SNPs or DNA motifs;
● Tools: Access tools like Blast and
PubMed from any EuPathDB home
page
3
1. Run a query choosing from more than 100 searches.
 Build strategies for several data types: genes, ESTs, SNPs, ORFs, etc.
2. Add a step – run a second query combining results with previous searches.
 Query the results of Step 1 based on functional genomics.
 Nest strategies to build complexity
3. Add more steps…
Add a step. The
second search here,
based on DNA motif,
searches for the
EcoR1 restriction
enzyme site.
Combine search
results using the
co-location
function.
E. Dispar,
E. histolytica,
E. invadens
C. hominis,
C. muris,
C. parvum
New Search Type: DNA Motif Pattern
G.lamblia,
G.assemblage_B,
G.assemblage_E
E.cuniculi,
E.intestinalis, E.bieneusi,
N.parisii, O.bayeri
Taxon specific
databases provide
access to the latest
available genome-scale
datasets. Built with the
same web-architecture,
search types and
functions are the same
across all databases.
B.bovis,
T.annulata,
T.parva
1
Search for DNA Motifs such as restriction enzyme sites or
transcription factor binding sites.
Choose Genomic segments, DNA Motif Pattern:
ii
i
P.berghei, P.chabaudi,
P.falciparum, P.gallinaceum,
P.knowlesi, P.vivax,P.yoelii
iii
iv
v
4
N.caninum,
T.gondii
T.vaginalis
i.
ii.
C.fasciculata, L.braziliensis, T.cruzi
L.infantum, L.Major, L.mexicana, T.vivax
L.tarentolae, T.brucei, T.congolense
2
Initiate the search. It will find all
occurrences of GAATTC in the genome.
Searches return a list of IDs (genes, ESTs, SNPs, proteins) that satisfy the conditions of your
query parameters. This gene search for protein coding genes in P. falciparum returned 5418
gene IDs.
● Results table with ID as the first
column. Columns can be added,
changed, deleted or sorted. Entire table
can be downloaded as Excel or other
formats.
● Click on the ID name to access details
in that ID’s record page.
iv.
v.
Return IDs from either step 1 or 2.
Define relative location (“Region”) of the returned data type. Search the exact region,
upstream, or downstream of the returned data type.
Define relationship between step 1 and 2 results’ regions: contains, overlaps, or is
contained in.
Define relative location (“Region”) of the other (non-returned) step result.
Define strand to be considered in the operation: either, same or both.
5
● Graphical representation of your
search strategy. Each step can be revised
by clicking on the step name.
● Filter table showing the
distribution of gene IDs across all
species in the database.
iii.
3
The search generates a step and the
results below show the list of
genomic segment IDs corresponding
to the locations of EcoR1 site: a
segment ID for each occurrence of
GAATTC in the genome.
Carefully consider
the 5 user-defined
parameters in the
logic statement of
the co-location
function.
View results. The results
table lists 214 IDs of
genes whose upstream
500bp region contains
the EcoR1 site. The
column ‘Matched
Regions’ defines the
genomic location of the
EcoR1 site within the
gene.