GUS - University of Pennsylvania

Download Report

Transcript GUS - University of Pennsylvania

GUS
The Genomics Unified Schema
A Platform for Genomics Databases
V. Babenko, B. Brunk, J.Crabtree, S. Diskin, S. Fischer, G.
Grant, Y. Kondrahkin, L.Li, J. Liu, J. Mazzarelli, D. Pinney, A.
Pizarro, E. Manduchi, S. McWeeney, J. Schug, C. Stoeckert
Center for Bioinformatics, University of Pennsylvania
stevef,[email protected]
Overview
Abstract
The Genomics Unified Schema (GUS) is a strongly typed relational database
schema and accompanying portable object-based software platform used
for integration, analysis, curation, mining and presentation of sequence
based genomics information. The schema is organized into five domains: a
detailed model of the central dogma (gene, RNA, protein) including DNA,
assembled RNA, and protein sequence, and a diversity of sequence
annotation (DoTS); an MGED compliant warehouse of transcript
expression experiments (RAD); a catalogue of grammars describing
regulatory regions (TESS); a wide range of controlled vocabularies and
ontologies (SRES); and a detailed representation of data provenance
(CORE). (A sixth domain for protein expression is in progress.) GUS’s
normalized relational structure and extent of integrated data enable
powerful queries not viable in many other genomics systems. The platform
facilitates maintenance of the warehouse and its utilization in web and data
mining applications.
Goals of GUS



Generic platform for model organism or disease specific
databases
Freely available at www.gusdev.org and www.cbil.upenn.edu
Integration of genome, transcript and protein data, including:







Sequence
Function
Expression
Interaction
Regulation
Orthologs and paralogs
Support for:




automated annotation and integration
manual curation
data mining/analysis and sophisticated queries
web access
GUS Powers Multiple Genomics DBs
AllGenes
PlasmoDB
EPConDB
Java Servlets
DoTS RAD TESS SRES Core
Oracle RDBMS
Object Layer for Data Loading
Other sites,
Other projects
Components of GUS



Relational database schema
Lightweight object layer
Application frameworks




Applications





Data access
Pipeline/workflow
Web (servlets)
Annotator’s interface
Parsers and exporters (using standards)
Annotation and analysis programs
Schema browser
Utilizes Oracle 9i
Architecture of GUS
GenBank,
InterPro,
GO, etc
Genomic
Sequence
Automated
Analysis &
Integration
GSSs &
ESTs
Annotation
Object Layer
DoTS Oracle/SQL TESS
RAD
WWW
queries,
browsing, &
download
Mapping
Data
Java Servlets
&
Perl CGI
Core
SRes
Mining
Applications
microarray
& SAGE
Experiments
QTL,POP,
SNP, Clinical
Annotator’s
Interface
Usage of GUS

Annotation



Integration



Of genomes: gene models, sequence features
Of genes: function, expression, regulation
From sequence to expression
Map identifiers to/from external databases
Data mining, creating curated datasets



Algorithm-based: GO function prediction
Genome-wide querying: find all pancreas-specific transcripts
PANCchip: non-redundant genes expressed in pancreas found using
ESTs, microarrays and cDNA libraries
GUS Schema
Schema features




Extensive integrated genomics schema (300 tables)
Divided into 5 distinct domains
Highly normalized
Strongly typed



Subclassing





Use views of superclass to define subclasses
Useful for mapping into the object layer
Warehousing


Controlled vocabularies used extensively
Avoid using name-value pairs
Include databases such as Genbank, GO terms, Prodom, CDD.
Facilitates management of value-added annotation across updates
Cross references to external databases
Tracking and versioning
Five domains
GUS is divided into 5 domains* (separate name spaces)
Namespace
Domain
Highlights
Core
Data Provenance
Evidence
Shared Resources
Ontologies
Sequence and
annotation
Central dogma
Gene expression
MIAME/MAGE
Gene regulation
Grammars
SRes
(Shared Resources)
DoTS
(DB of Transcribed Seqs)
RAD
(RNA Abundance DB)
TESS
(Trans Elem Search Site)
* Protein interaction domain underway
Querying across the domains
Core
Data
Provenance
•Ownership
•Protection
•Algorithms
•Versioning
•Workflows
DoTS
•Genes, gene models
•STSs, repeats, etc
•Cross-species analysis
Genomic
Sequence
RAD
SRes
Ontologies
•GO
•Species
•Anatomy/Tissue
•Developmental stage
•Disease state
DoTS
Transcribed
Sequence
Protein
Sequence
•Characterize transcripts
•RH mapping
•Library analysis
•Cross-species analysis
•DOTS assemblies
•Domains
•Function
•Structure
•Cross-species analysis
RAD
Transcript
Expression
• rrays
A
•SAGE
•Conditions
TESS
Gene
Regulation
•Binding Sites
•Patterns
•Grammars
SRes
"Transcription factors upregulated in acute myeloid leukemia
with sequence similarity to c-fos and common promoter motifs"
Core
TESS
DoTS central dogma schema
Gene
RNA
Protein
Gene
Instance
RNA
Instance
Protein
Instance
Gene
Feature
Genomic
Sequence
(isa NA Feature)
(isa NA Sequence)
RNA
Feature
RNA
Sequence
(isa NA Feature)
(isa NA Sequence)
Protein
Feature
Protein
Sequence
(isa NA Feature)
(isa AA Sequence)
RAD schema uses MAGE/MIAME
0..*
MAGE
Experiment
Array
BioMaterial
BioAssay
BioAssayData
Protocol, Descr.
HigherLevelAnalysis
StudyAssay
1
Array
1
1
0..*
1
Assay
0..*
1
1
0..*
Study
1
1
1
1
1
0..*
1
0..*
0..*
1
StudyDesignAssay
ArrayAnnotation
StudyDesign
1
0..*
0..*
0..*
Control
ElementAnnotation
0..*
0..1
0..*
1
1
BioMaterialCharacteristic
0..*
BioMaterialImp
1
ElementImp
1
StudyFactor
0..*
1
0..*
0..*
0..*
0..*
0..*
StudyDesignDescription
0..*
StudyFactorValue
AssayLabeledExtract
0..*
1
Channel
CompositeElementImp
1
1
10..1
0..*
0..*
0..*
0..*
BioMaterialMeasurement
0..*
0..1
1
0..*
1
0..*
1
0..1
0..*
Acquisition
1
1
1
0..*
0..*
1
LabelMethod
RelatedAcquisition
0..*
1
0..*
CompositeElementAnnotation
1
0..*
0..*
1
OntologyEntry
Treatment
0..*
0..1
AcquisitionParam
0..*
0..*
0..1
ElementResultImp
0..1
0..1
CompositeElementResultImp
0..*
0..*
0..*
1
ProcessResult
Quantification
0..*
0..*
1
1
1
MAGEDocumentation
RelatedQuantification
0..*
ProtocolParam
0..*
ProcessIO
1
MAGE_ML
QuantificationParam
0..*
1
0..1
0..*
1
MIAME
Protocol
1
0..*
Experimental Design
Array design
Samples
Hybridization, Measure
Normalization
.
0..*
1
0..*
0..*
1
AnalysisInput
0..*
1
1
ProcessInvocation
ProcessInvocationParam
ProcessImplementationParam
1
0..*
0..*
1
0..*
AnalysisInvocation
AnalysisInvocationParam
1
0..*
AnalysisOutput
1
ProcessImplementation
0..*
1
1
Analysis
0..*
0..*
AnalysisImplementation
1
0..*AnalysisImplementatio
nParam
0..*
TESS schema
TESS.Moiety
Moiety
MoietyHeterodimer
MoietyMultimer
MoietyComplex
DoTS.NaFeature
TESS.Activity
ActivityProteinDnaBinding
BindingSite
TESS.FootprintInstance
Promoter
ActivityTissueSpecificity
...
TESS.TrainingSet
TESS.Model
ModelString
DoTS.NaSequence
TESS.ParameterGroup
ModelConsensusString
ModelPositionalWeightMatrix
TESS.Note
ModelGrammar
Ontologies and vocabularies

Ontologies








Gene Ontology (GO)
Sequence Ontology (SO) (sequence features)
Phenotype and Trait Ontology (PATO)
Taxon (NCBI)
Anatomy (Penn)
Disease (ICD9)
Developmental stage (multiple sources)
And vocabularies



External database names
Genetic codes
Review status
Evidence trail

Evidence and tracking





Data tables have columns for user, date, project, algorithm invocation
Tables dedicated to algorithm, algorithm version and parameters
176 algorithms, including public and in-house
Tracks automated and manual annotation, similarity and integration
Versioning

All updated or deleted rows are copied to version table
Sophisticated queries


Sample queries from three projects that utilize GUS’s data
integration and analysis
www.allgenes.org


http://plasmodb.org


“Is my cDNA similar to any mouse genes that are predicted to encode
transcription factors and have been localized to mouse chromosome 5?”
“List all genes whose proteins are predicted to contain a signal peptide
and for which there is evidence that they are expressed in Plasmodium
falciparum’s late schizont stage”
www.cbil.upenn.edu/EPConDB

“Which genes on chromosome 2 are expressed in pancreas and are
involved in signal transduction based on GO function assignments.”
Application
Frameworks
GUS Object layer





Lightweight Perl implementation
Java on the way
One object per table
Parent/child relationships
Cascading delete
Data input



The GusApplication program manages inserts and
updates to GUS, handling tracking and versioning.
Specific tasks are implemented as plugins.
Plugins use either GUS objects or SQL access.
Low-level database access is provided by DBI classes.
GusApplication
Plugin
Object
Object
Object
Object
Object
SuperClasses

SQL
DBI
Core SRes
DoTS RAD TESS
Pipeline




Perl API for defining annotation pipelines
Supports sequential protocols
Distributes compute intensive work to compute cluster
Used for 90 stage pipeline to build DoTS transcript index
Web


Servlets and cgi based design (JSP on the way)
Automatic generation of HTML FORMs







Automated input checking
Integrated help features
INPUT elements populated from the database
Query history facility
Boolean queries (AND, OR, SUBTRACT)
Declarative configuration file
Base system is relatively independent of GUS
Provided
Applications
Annotator’s interface
Assign Gene Name/Symbol
Assign Gene Description
Assign Gene Synonym(s)
Evidence
Parsing & exporting

Parsing









Sequence DBs: Genbank (main, dbEST, NRDB), SWISS-PROT, TIGR
Protein Motifs: CDD, Prodom, InterPro
Expression: MAGE
Ontologies: GO, SO, PATO
Mapping data: RH maps
Gene predictors: GLIMMER, Genscan, PHAT, GeneFinder
Similarity: BLAST, BLAT, Sim4
CAP4
Exporting




FASTA
MAGE
Table dumps
DoTS Assemblies
Analysis & annotation






GO functional assignment
Expression analysis (PaGE)
Anatomy classification
Library distribution
Genes from BLAT of DoTS against genome
DoTS assembly and annotation




Refresh warehouse
Cluster and assemble mRNAs/ESTs into putative transcripts
Annotate transcripts through similarity, GO function and markers
Integrate previously existing manual curation
DoTS Pipeline
Genomic
Sequence
Gene predictions
GenScan/ HMMer, PHAT
mRNA/EST
Sequence
SIM4 or BLAT
Predicted
Genes
Merge Genes
Clustering and
Assembly
DoTS consensus
Sequences
Gene/RNA cluster
assignment
Annotate DoTS
Manual Annotation
Tasks
RNAs
BLASTX
Other computed annotation
(EPCR,
AssemblyAnatomyPercent,
Index Key Words,
SNP analysis)
BLAST Similarities
Gene
Index
framefinder
translation
BLASTP
Functional predictions
GO Functions
Proteins
PFAM, Smart, ProDom
Protein
Motifs
References & Acknowledgements

References








Scearce, L. Marie, Brestelli, John E., McWeeney, Shannon K., Lee, Catherine S., Mazzarelli, Joan, Pinney, Deborah F.,
Pizarro, Angel, Stoeckert, C. J. Jr., Clifton, Sandra, Permutt, M. Alan, Brown, Juliana, Melton, Douglas A., Kaestner,
Klaus H. (2002) Functional Genomics of the Endocrine Pancreas: The Pancreas Clone Set and PancChip, New
Resources for Diabetes Research Diabetes 51: 1997-2004, 2002.
Schug, J., Diskin, S., Mazzarelli, J., Brunk, Brian P., Stoeckert, C.J. (2002) Predicting Gene Ontology Functions from
ProDom and CDD Protein Domains. Genome Res. 2002 12: 648-655.
Bahl, A., Brunk, B., Coppel, R.L., Crabtree, J., Diskin, S.J., Fraunholz, M.J., Grant, G.R., Gupta, D., Huestis, R.L.,
Kissinger, J.C., Labo, P., Li, L., McWeeney, S.K., Milgram, A.J., Roos, D.S., Schug, J., Stoeckert, C.J. (2002)
PlasmoDB: The Plasmodium Genome Resource. An integrated database providing tools for accessing and analyzing
mapping, expression and sequence data (both finished and unfinished). Nucleic Acids Res. 2002 30: 87-90
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A.,
Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C.P., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H.,
Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., Vingron, M. (2001) Minimum Information
About a Microarray Experiment (MIAME): Toward Standards for Microarray Data. Nature Genetics 29:365-371, 2001.
Manduchi, E., Pizarro, A., Stoeckert, C. (2001) RAD (RNA Abundance Database): an infrastructure for array data
analysis. Proc. SPIE, vol 4266, pp. 68-78.
Davidson, S.B., Crabtree, J., Brunk, Brian P., Schug, J., Tannen, V., Overton, G.C., Stoeckert, C.J. Jr. (2001) K2/Kleisli
and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal: 40(2), p. 512-531.
Crabtree, J., Wiltshire, T., Brunk, B., Zhao, S., Schug, J., Stoeckert, C., Bucan, M. (2001) High-resolution BAC-based
Map of the Central Portion of Mouse Chromosome 5. Genome Res. October 2001; 11: 1746-1757.
Acknowledgements




NIH grant RO1-HG-01539-03
DOE grant DE-FG02-00ER62893
Burroughs Wellcome Fund
NIDDK 56947 and 56954 with cosponsorship from the JDFI
Related posters



114A. Web-Based Biological Discovery using the GUS
Integrated Database.
170A. TESS-II: Describing and Finding Gene Regulatory
Sequences with Grammars
148A. Integrating Eukaryotic Genomes by Orthologous
Groups: What is Unique about Apicomplexan Parasites?