Transcript Document

Building CryptoDB using GUS
Mark Heiges
Center for Tropical and Emerging Global Diseases
University of Georgia
[email protected]
Genomic Data
Analysis Results
GUS
Plugins
Tomcat
WDK
Apache
External Resources:
•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS)
Analysis Input:
•contigs
•proteins
•NRDB
Analysis
Results
Plugins
helper script
Plugins
Web
Development
Kit
GUS
Site Design Considerations
•
•
•
•
data types we wanted to warehouse
additional analyses desired
how to load data into GUS
how to visualize data
– tables
– text
– graphics (interactive, static)
• what types of questions will be asked of the data
Deciding Factors
• What data was available.
• What the research community needed.
• What we could accomplish by the
contractual deadline for our first release.
Crypto External Resource Data
• Genomic sequence and gene annotations for two
species (GenBank)
–
–
–
–
–
–
sequence
CDS translations
gene product descriptions
exon coordinates
RNA type (mRNA, tRNA, snoRNA, rRNA)
other features
• EST/mRNA (GenBank)
Auxillary Data Required
• NRDB
• NCBI Taxonomy Reference
• Sequence Ontology Definitions
External Resources:
•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS)
Analysis Input:
•contigs
•proteins
•NRDB
Analysis
Results
Plugins
helper scripts
Plugins
Web
Development
Kit
GUS
GUS Plugins
• Perl modules for loading data into GUS
– facilities to connect to the GUS perl object
layer and the database
– process command line arguments
– create tracking information in the database
– log and handle errors
GUS Plugins
• Supported and Community plugins bundled
with GUS
• Plugins are versioned
• Each plugin version must be registered with
GUS before use
– records cvs version and md5 checksum
– auditing
Data Loading at CryptoDB
• Install GUS
• Register selected plugins
• Load Controlled Vocabularies
– NCBI Taxonomy
– Sequence Ontology Definitions
• Load Crypto annotated sequences from
GenBank records
• Load NRDB from FASTA file
Data Loading at CryptoDB
• Load Crypto mRNA GenBank records
• Load ESTs from U Penn's database of
NCBI's dbEST
CryptoDB Analyses
• BLASTP - compare annotated proteins to nrdb
• BLASTX - compare whole genome to nrdb
• BLASTN - synteny comparison of the two Crypto
species we host
• EST/mRNA clustering and alignment
• signal peptide predictions
• transmembrane predictions
Analysis Workflow
• Load Source Data into GUS (NRDB, genomic
seqs)
•
•
•
•
Dump same data from GUS with GUS Ids
Perform analysis with this data (BLASTX)
Load results into GUS
GUS Ids allow results to be linked back to
analysis input data
External Resources:
•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS)
Analysis Input:
•contigs
•proteins
•NRDB
Analysis
Plugins
helper script
Plugins
Analysis
Results
Web
Development
Kit
GUS
Data Analysis - BLASTP
• Dump NRDB records from GUS to FASTA
file - with GUS Ids
>336 source_id=0703290B secondary_identifier=223280 tubulin alpha length=411
TIGGGDDSFNTFFSETGAGKHVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAA
NNYARGHYTIGKEIIDLVLDRIRKLADQCTGLQGFSVFHSFGGGTGSGFTSLLMERLSVD
YGKKSKLEFSIYPARQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIE
RQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIE
• Dump annotated protein sequences from GUS to
FASTA file - with GUS Ids
External Resources:
•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS)
Analysis Input:
•contigs
•proteins
•NRDB
Analysis
Plugins
helper scripts
Plugins
Analysis
Results
Web
Development
Kit
GUS
Data Analysis - BLASTP
• Run BLASTP algorithm with these two GUS Id
labeled datasets
– used a Perl wrapper to BLAST executable, included
with GUS... plugin compatible output
• Load BLAST results with plugin
– ga GUS::Common::Plugin::LoadBlastSimFast --file
blastSimilarity.out --restartAlgInvs "" --queryTable
DoTS::ExternalNASequence --subjectTable
DoTS::ExternalAASequence --commit
Post Data Loading
• Find where the results were loaded
– read documentation
• ga GUS::Common::LoadBLAST --help
–
–
–
–
looked in plugin source code
asked other users
gusdb.org schema browser
fishing expeditions in GUS tables
Getting Our Database On Line
External Resources:
•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS)
Analysis Input:
•contigs
•proteins
•NRDB
Analysis
Results
Plugins
helper scripts
Plugins
Web
Development
Kit
GUS
Web Development Kit (WDK)
• provides accelerated development of
database driven web sites
– define questions and records in model XML file
– default JavaServer Pages (JSP) views provided
• not specific to GUS
• can be used with any RDBMS
WDK
Question - Summary - Record Paradigm
• Users supply parameter values to a canned
question on the website
– "Which genes have at least __ exons?"
• The result is returned in summary pages that list
links to the record pages
• Record page - detailed view of data object
– text
– graphics
– tables
Questions
Summary
Record
WDK
Model - View - Controller architecture
• Model XML configuration defines
– questions
– answer summaries
– records
• View
– displays the model
– defined in customizable JavaServer pages
• Controller
– internal, not configurable
WDK Setup
• build
• write WDK model (WDK comes with Toy site spent some time with that before hand)
• test model from command line
• install WDK into Tomcat
• customize the view (jsp) pages
• integrate Tomcat with Apache - personal
preference
WDK Model:
Defining Questions
<question name="GeneByContig"
displayName="Genes by Contig"
queryRef="GeneFeatureIds.GeneByContig"
summaryAttributesRef="source_id,product,organism,contig"
recordClassRef="GeneRecordClasses.GeneRecordClass">
<description>Find gene located on a given contig</description>
</question>
<sqlQuery name="GeneByContig" displayName="By Contig" isCacheable='true'>
<description>
Find Genes By Contig ID.
</description>
<paramRef ref="params.contig"/>
<column name="source_id" isInternal="false"/>
<sql>
<!-- use CDATA because query includes angle brackets -->
<![CDATA[
select g.source_id
from dots.genefeature g, dots.naentry nae,
dots.sequencetype st,
dots.externalNAsequence enas
where nae.na_sequence_id = g.na_sequence_id
and enas.sequence_type_id = st.sequence_type_id
and enas.na_sequence_id = nae.na_sequence_id
and st.name = 'contig'
and nae.source_id = '$$contig$$'
ORDER BY g.source_id
]]>
</sql>
</sqlQuery>
WDK Model - Record
<recordClass idPrefix=""
name="GeneRecordClass" type="Gene"
attributeOrdering="source_id,exoncount,overview,
product,linkout,dnaContext,genomeCompare,tmdata,blastpgraphic,
translation,sequence,reference">
<attributeQueryRef ref="GeneAttributes.GeneAttrs"/>
<attributeQueryRef ref="GeneAttributes.ExonCount"/>
<attributeQueryRef ref="GeneAttributes.TMCount"/>
<tableQueryRef ref="GeneTables.BlastP"/>
<textAttribute name="overview" displayName="Overview">
<text>
<![CDATA[
This <b><i>$$organism$$</i></b> gene spans positions
<b>$$start_max$$</b> - <b>$$end_min$$</b> of contig
<a href="showRecord.do?id=$$contig$$"><b>$$contig$$</b></a>
which maps to chromosome <b>$$chromosome$$</b>
]]>
</text>
</textAttribute>
</recordClass>
Testing the Model
command line tools
•
•
•
•
•
wdkXml - check xml syntax
wdkSummary - test a summary
wdkQuery - run specific query
wdkRecord - test a record
wdkSanityTest - exercises all queries and
records
• wdkCache
Install WDK into Tomcat
• follow the installation instructions carefully
• relies on symbolic links from Tomcat webapp to
$GUS_HOME
– disallowed by default Tomcat configuration
• keep an eye on Tomcat logs for troubleshooting
• reload the webapp when model changes
– retest on command line
– don't forget about the cache
WDK Default View
CryptoDB Custom View
• Made style changes, added site branding
• Added additional form elements
– radio buttons, check boxes
• 'Flattened out' the questions
CryptoDB Custom View
• Record pages - alterations to acheive the
desired ordering and placement of text,
tables and graphics
• Standard JSP tags to embed external objects
– GBrowse graphic
Integrate Tomcat with Apache
• Apache front end answers all web requests
• Serves the static pages and cgi tools
– BLAST interface
– motif search
– BLASTX keyword search
• Calls to the WDK are passed to Tomcat
External Resources:
•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS)
Analysis Input:
•contigs
•proteins
•NRDB
Analysis
Results
Plugins
helper scripts
Plugins
Web
Development
Kit
GUS
Pipeline
External Resources:
•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS)
Analysis Input:
•contigs
•proteins
•NRDB
Analysis
Results
Plugins
helper scripts
Plugins
Web
Development
Kit
GUS