web_services_session

Download Report

Transcript web_services_session

Session goals
• Review existing APIs, and how they fit with
– overall data architecture
– MBAT architecture
• Create a strategy for developing and
assimilating uniform APIs, and priorities
• Explore consequences for MBAT
architecture
Architecture; data types and interfaces
Data
Registration
Portlets
MBAT
WOMBAT
Other clients
Discovery, Retrieval, Analysis, Viz, Integration APIs
Mediator
Catalog wrappers, following uniform web service APIs
Catalogs and
Spatial
BIRNLex,
CCDB
indexes
Registry
etc.
Publication
Gene
Expression
2D Images
2D vector
segmentations
3D Volumes
Surfaces
Phenotype /
behavioral
4+D Volumes
(FMRI)
Time Series
Source wrappers, following uniform web service APIs
Sources
Sources
Sources
SRB,
other
sources
APIs
Read data from other
atlases/databases, in a
uniform way for the data
type
Find relevant data in
other atlases/ databases
API for data retrieval
and transformation
View the region of
interest in another atlas
API for atlas state
exchange
API for atlas
catalogs
Uniform Web Services API
(towards BIRN-ML??)
Web services is a standard way to access remote functionality crossplatform, and assemble applications. We have several data types
accessed by atlases: microarray data, 2D images, 3D volumes,
surfaces, segmentations, annotations, phenotype/behavioral data,
FMRI, time series, etc. Some of these data types have common
representation models (e.g. MAGE). These models are typically
large and exist in multiple incarnations. The level of detail they
provide often is not needed for data discovery and common data
access and integration tasks. So it would be useful to envelope such
data in a common set of services that would expose the most
essential data characteristics and represent the common
denominator queries against each particular data type (e.g.
getGenes, getProbes, getStructures...) that any dataset of this type
shall respond to. Such services would support multiple clients,
including atlases, BDR interface, mediator, etc.
Plug-in architecture vs SOA: no contradiction
(focus on a single product, vs on a larger system)
Issues/steps (for MA and 2D)
1.
figure out how search requests and outputs as implemented in MBAT
(http://www.loni.ucla.edu/twiki/bin/view/MouseBIRN/WebServices), MA module in BIRN
(http://microarray.nbirn.net/), and GN (http://www.genenetwork.org/CGIDoc.html)
2. Examine MAGE and see how the same MA requests and output can be expressed in MAGE.
Then, depending on the results of (1), either abandon MAGE in favor of some simpler XML
(potentially embedded in XCEDE?), or rely on MAGE constructs (and include them in XCEDE
wrappers for gene expression sources, as a foreign namespace?). This shall be done vis-à-vis
common information requirements of client applications (e.g. GetProbes?, GetGenes?,
GetStructures?, etc.)
3. In parallel, review the schema used in the MA module, for whether it sufficiently reflects
information model for GE data, and update as necessary
4. If we decide on the XCEDE route, make sure the mediator can connect with XCEDE sources, be it
a database source or a web source in XCEDE wrapper (see
http://mediator.nbirn.net:8080/axis/services/MedTestBService?wsdl – this would involve passing
web service calls via the ExecuteQuery? method, and conversion between XML output and
mediator’s recordset)
5. Identify additional sources or databases to be wrapped in the same API (GN, Gensat, ABA, BIRN
MA + GEO +UCSC (VISIGENE) – for MA data; CCDB, ABA, ArcIMS, spatial registry, Gensat – for
2D). Then finalize the signatures.
6. Make sure terms used in queries and in the output, are tagged with BIRNLex terms (e.g. develop
controlled vocabularies for each term)
7. Implement web services for the GN and MA module (incl testing/deployment)
8. Based on results of (3), update data publication tools (i.e. software for loading data from common
CSV and text files into the MA module), make sure controlled vocabularies are enforced;
9. Make sure AIDB’s XCEDE wrapper supports the services as well(?). Now CCDB-based.
10. Publish and document web services; develop a series of examples of how they can be called from
various programming environments and applications
11. GEO API: connect the region names with MBAT: need semantic registration;
possibly scrape the GEO catalog, reconcile labels with MBAT semantics, and have a service
wrapper into GEO data,
About XCEDE and MA
•
XCEDE is the common schema providing access to BIRN databases.
– HID and the emerging AIDB are being wrapped in XCEDE (see http://www.namic.org/Wiki/index.php/Slicer3:Remote_Data_Handling), and - as deployed at
BIRN-CC: http://bcc-devmediator.nbirn.net:8080/axis/services/HidQuerierWS?wsdl. Web services are
being written against XCEDE, so both HID and AIDB will be accessible through
XCEDE web services. The goal, therefore, could be to route common metadata
requests against gene expression, 2d images, 3d volumes data, in XCEDE, and
extend XCEDE to support additional requests.
– If we switch to CCDB as the image catalog – what components of XCEDE shall
be retained
•
MAGE-ML/FUGE/MAGEv2
– MAGE-ML is derived from Microarray Gene Expression Object Model (MAGEOM), which is developed and described using the Unified Modelling Language
(UML. MAGE-ML is by purpose used to describe microarray designs, microarray
manufacturing information, microarray experiment setup and execution
information, gene expression data and data analysis results. MAGEv2 is being
built on top of FuGE as an extension to add in microarray specific classes
(extending Data as ArrayDesign, DesignElementData, etc, Material as Array,
QPCRPlate, etc, and DimensionElement as DesignElement extended by
Feature, Reporter, and CompositeElement).
• FUGE Home Page
• MAGE Home Page
From XCEDE API
• Gets/Puts:
–
–
–
–
–
–
–
–
GetProjects, GetProject, GetProjectDetail
GetSubject, GetSubjects, GetSubjectDetail
GetVisits,…
GetStudies,…
GetSeries,…
Get Data Acquisitions
Get Assessments,…
getData, Get DataSizeEstimate
• Also some getCapabilities returns (e.g.
getMethods)
API Examples: Mediator services
• http://mediator.nbirn.net:8080/axis/services/MedTestBService
?wsdl
– SOAP Method : executeQuery (loginTimeoutSecs,
maxByteCountPerBatch, queryID,
queryLifeInSeconds, queryParameters.item0.name,
queryParameters.item0.value,
queryParameters.item1.name,
queryParameters.item1.value, queryString,
queryTimeoutSecs, resultLifeInSeconds,
securityCertificateString)
– SOAP Method : fetchNextResultBatch,
fetchPreviousResultBatch, fetchCurrentResultBatch,
fetchRelativeResultBatch, fetchResultBatch,
getErrorMessage, getStatistics
API Examples: BIRN MA
• BIRN Microarray
(http://microarray.nbirn.net/get_data.php? )
–
–
–
–
–
–
–
–
–
–
–
REST service: cmd=<get_probes|get_my_probes>
user_id=<int>
dset=<all|null>
GN will need several more:
strain=<string>
keyword=<string>
- platformID
species=<string>
- GeneID (proxied by ProbeID here?)
sex=<string>
- ExonID
stage=<string>
- “bestID” (sort by quality, based on
subject_group=<string>
user-selected metric, e.g. highest
anatomy=<string>
Infomodel: expression)
probe_id=<string>
Species – probes – structures - genes
Passing SQL queries as opposed to just filters that we have…
Is there a way to unify what is returned from GN, ABA, BIRN-MA?
Matrix (from ABA-Neuroblast): who are best covariants: spatially,semantically,
API examples: Gensat
•
•
GENSAT:http://maloney.loni.ucla.edu:8080/axis
/GensatSource.jws?wsdl
getGene(geneSym, geneName, exprLevel,
anatStruc, stage, sex)
–
•
get2DImage(geneSym, geneName, exprLevel,
anatStruc, stage, sex, plane)
–
•
Don’t have probes; ABA doesn’t have them either (=
genes)
No spatial info
getDataTypes(dataSourceID)
–
Essentially, a capabilities request
API Examples: GeneNetwork
• http://www.genenetwork.org/webqtl/WebQTL.py
?
• cmd=birn (also: genotype, get, trait, map,
interval, correlation…)
• species=XXXX
Check with Amarnath on ontology
• tissue=XXXX
mapping for genes in GN
• symbol=XXXX
• ProbeId=XXXX
• function=XXXX
• Strain=XXXX
http://www.genenetwork.org/CGIDoc.html
Our expectations for MA data
Methods Summary
getGenes(String geneCode, String geneName, String geneFunction,
String keyWord)
Get the gene information by either gene code, gene name, gene
function or keyword.
getProbe(String ProbesID String geneName, String geneCode)
Get the probe information by either probe id, gene code, or gene name.
getStructures(String structureName, String geneCode, String
geneName)
Get the gene information by either gene code, gene name, gene
function or keyword.
More?
Allen Brain Atlas API
The API to the Allen Brain Atlas-Mouse Brain
consists of a set of services allowing users
to programmatically download the
complete high resolution images, 3D
volumes, and metadata for more than
20,000 genes in the database. In addition
to the documentation, a demo has been
created to demonstrate the use of the
services of the API. The demo's source
code is also available…
ABA API details (expressions)
• ImageSeries Structure Expression
(ImageSeries ID) -> XML in ABA schema
• Expression Energy Volumes
(ImageSeriesID) -> sparse volume file
(x,y,z for each voxel where expression
energy value > 0, + density of expression)
Comment:Smoothed energy volume for gene
Tspan2
imageseriesId 75144618
Dimensions:67,41,58
38,14,4,2.01994e-06
39,14,4,2.37068e-05
40,14,4,3.08554e-05 . . .
ABA API (Genes)
• Genes (GeneSymbol) xml (imageneed to put sex or strain
series, gene-expressions) No
Need to control for dummy
- <image-series>
<age>56</age>
<geneid>12593</geneid>
<imageseriesdisplayname>Coch-Coronal-052779</imageseriesdisplayname>
<imageseriesid>71717614</imageseriesid>
<ncbiaccessionnumber>NM_007728</ncbiaccessionnumber>
<plane>coronal</plane>
<probeorientation>antisense</probeorientation>
<projectname>0310</projectname>
<riboprobename>RP_050623_02_G08</riboprobename>
<sex>male</sex>
<specimenid>05-2779</specimenid>
<strain>C57BL/6J</strain>
<templateid>143280</templateid>
<transcriptgi>31982455</transcriptgi>
<transcriptid>9068</transcriptid>
<transcriptname />
<treatmenttype>ISH</treatmenttype>
</image-series>
entries on front end
Resolution is another issue
Provenance information, when
multiple probes (have on their site)
Status=OK|failed|single_best
- <gene-expression>
<avgdensity>100.0</avgdensity>
<avglevel>93.9770317077637</av
glevel>
<geneid>12593</geneid>
<projectcode>0310</projectcode>
<rgb>#a0d8e8</rgb>
<structureid>343</structureid>
<structurelabel>STRd</structurela
bel>
<structurename>Striatum dorsal
region</structurename>
</gene-expression>
ABA API Details (images)
• Get Image:
http://www.brain-map.org/aba/api/image?
zoom=[zoom]&path=[filePath]&mime=[mime]&top=[top]&left=[l
eft]&width=[width]&height=[height]
– Default output = jpeg; zooms = 0…6; path = filepath
to a file in image series
– Top, left – in image coords, for full size image (implied
zoomify images)
• ImageProperties (by path; by ImageID):
– <IMAGE_PROPERTIES WIDTH="15185" HEIGHT="8817"
NUMTILES="2832" NUMTIERS="7" NUMIMAGES="1"
VERSION="1.8" TILESIZE="256" />
• ImageSeries (ID) 
– Have GetImage feature implemented in GN (per Rob)
ABA API Demos
Code Samples
API Wrapper
This Java class wraps the API URLs in convienience methods. A simple caching scheme is implemented.
3D Volume
classes
A set of Java classes for reading & writing 3D expression volume files retrieved through the API. Methods are
availble for reading & writing our text-based volume data in MetaImage-compatible binary format; other
methods allow the extraction of a slice of volume data as a Java BufferedImage.
Images
Define & download regions of interest or complete images at multiple resolutions.
Visualization
Java classes for displaying and navigating 3D volumes, setting color maps, adjusting image dynamic range.
Analysis
Generate a median expression volume over any input set of expression volumes. Query volume files and
compute an overall gene expression "energy" statistic for each region. Calculate ISH image regions of
interest based on 3D regions of interest defined in the Atlas coordinate space.
More
Several general purpose Java Swing-based UI forms for tasks like downloading & processing data; classes for
efficiently indexing Gene-to-image_series mapping data; define/read/write 2D & 3D ROIs.
Data
Annotated Atlas
volumes
3D volume files annotated with the Allen Brain Atlas structure IDs at each voxel. Volumes available at 25, 100 &
200 micron resolution.
Brain structure
ontology
The major structures of the Allen Brain Atlas and their parent-child relationships; includes the abbreviations and
IDs used throughout the ABA data set.
Gene to image
series
mapping
A complete list of the genes available in the ABA data set, mapped to the IDs of all of the image series for
each. Each entry also references EtrezGene IDs and NCBI accession numbers.
Other existing APIs (spatial)
•
The registry has web service interface, to find available images in ROI:
– http://smartatlas.nbirn.net:8080/axis/services/ImageMetadataForROI?wsdl
<request><category>mouse</category><regionofinterest>-2,2,-2,-2,2,-2,2,2,2,2</regionofinterest><slicenumber>031</slicenumber></request>
•
Requesting image fragments:
– E.g. http://geon15.sdsc.edu/axis/services/ImageQueryService?wsdl
– the name of the method is getSimpleImageWithSpecs
– method inputs:
•
•
•
•
•
host - 132.239.131.188
serviceName - slice_15b_warped1194307428796
minx - -6.035011 miny - -8.165376 maxx - 12.088584 maxy - 0.812765
imageHeight - 800 imageWidth - 600
There are standards in the GIS world on how you exchange spatial data,
e.g. GML simple features that are the basis of many application schemas.
E.g.
– <gml:Point srsName="urn:ogc:def:crs:EPSG:6.6:4269>
<gml:pos>45.256 -71.92</gml:pos>
</gml:Point>
Catalog, and catalog services
• The current model:
– MBAT registers individual source services, queries
them for both metadata and data
• Response time based on the slowest of them
• Catalog-based:
– For each data type, there is a catalog that stores
information for initial discovery
• Eg. probes: ABA:probe1; geneA…;…. GN:probe1
(ensuring unique probe IDs…)
• Or images: ABA:image1; type=zoomify;URL=…;
– Discovery queries (getProbes, getProbeInfo,
getTissues, getTissueInfo, etc.) are executed against
the catalog, while getData go against data sources
– The Catalog is synched with data sources periodically
(sync services)
Feature Requests
• Ability to mix "AND" and "OR" in queries
– (currently all queries assume "AND" of all
parameters)
– might need a "language" to specify query
• Ability to request "pages" of results
– for example, show me the first 10 results
– similar to most search engine results
• More requests: more complete SQL
Issues/steps (for MA and 2D)
1.
figure out how search requests and outputs as implemented in MBAT
(http://www.loni.ucla.edu/twiki/bin/view/MouseBIRN/WebServices), MA module in BIRN
(http://microarray.nbirn.net/), and GN (http://www.genenetwork.org/CGIDoc.html)
2. Examine MAGE and see how the same MA requests and output can be expressed in MAGE.
Then, depending on the results of (1), either abandon MAGE in favor of some simpler XML
(potentially embedded in XCEDE?), or rely on MAGE constructs (and include them in XCEDE
wrappers for gene expression sources, as a foreign namespace?). This shall be done vis-à-vis
common information requirements of client applications (e.g. GetProbes?, GetGenes?,
GetStructures?, etc.)
3. In parallel, review the schema used in the MA module, for whether it sufficiently reflects
information model for GE data, and update as necessary
4. If we decide on the XCEDE route, make sure the mediator can connect with XCEDE sources, be it
a database source or a web source in XCEDE wrapper (see
http://mediator.nbirn.net:8080/axis/services/MedTestBService?wsdl – this would involve passing
web service calls via the ExecuteQuery? method, and conversion between XCEDE and
mediator’s recordset)
5. Identify additional sources or databases to be wrapped in the same API (GN, Gensat, ABA, BIRN
MA – for MA data; CCDB, ABA, ArcIMS, spatial registry, Gensat – for 2D). Then finalize the
signatures
6. Make sure terms used in queries and in the output, are tagged with BIRNLex terms (e.g. develop
controlled vocabularies for each term)
7. Implement web services for the GN and MA module (incl testing/deployment)
8. Based on results of (3), update data publication tools (i.e. software for loading data from common
CSV and text files into the MA module), make sure controlled vocabularies are enforced;
9. Make sure AIDB’s XCEDE wrapper supports the services as well(?).
10. Publish and document web services; develop a series of examples of how they can be called from
various programming environments and applications
API for GE/MA
• What are use cases?
• What is the information model, and what is the catalog?:
a) species -> subjects (age, sex, etc.)-> strains -> probes
->genes -> tissues
• API for discovery… (by tissues, genes, probe sets,…)
–
–
–
–
getSpecies()  list of species in the registry
getSpeciesInfo -> Species metadata from the registry
getProbes (species, strain, sex, age)
getProbeInfo
• API for retrieval…
Species
Strains
(genetic
manipulations
Genes
(from a
master
list)
Manufacturer/
Subjects
(Stage|age, sex)
Probes
(resolution,
failed or not)
provenance
Info
normalization,
Units, etc.
Genetic
manipulations
(biowarehouse)
CCDB
Tissues
(incl.
pointer to
tissue
vocabulary)
Probe-tissue
Catalog
Discovery stage: ultimately till GetProbeTissues call
getSpecies, getSpeciesInfo
getSubjects, ..
getGenes({probeseries}) ,..
getGenes({tissue})
getProbes({Genes})
getProbeTissues
getGE ({probeTissues})
GE values
(categorical or
numeric)
DataRetrieval: getGE
API for 2D images
• What are use cases?
• What is the information model, and what is
the catalog?
• API for discovery… (by ROI, by labeled
regions, by spatial relations…also,
getImageInfo? getImageSeriesInfo?)
• API for retrieval (getImage?
getImageStack?)
Find expression for a Gene based on Gene Name or abbreviation
Image stacks
and groups
Species,
strains,
Subjects,
projects
getImages
getImageInfo (size, orientation..)
getImageStacks
getImageStackInfo
Gene|protein-tissue
catalog
Imagery
servers
Genes
(name,
abbrev)
Proteins
discovery
Image catalog
Type of image, gene expressed,
Spatial characteristics (coord system,
Spatial extent, plane, local XYZs
retrieval
• Have “representatives” for each of the GE
and 2D sources, who would vet the
schema for whether the sources can be
mapped into it without significant losses