Alternative Storage

Download Report

Transcript Alternative Storage

Regions of Interest


What’s in a ROI?
Use cases
 Requirements

Current Storage System
 Problems

Alternative Storage

ROI
 Geometry
 Measurements
 ROI on Channel
 Annotations
▪ ROI
▪ Measurement
▪ Links

User created ROI
 Measurement tools

HCS generated ROI
 Automatic
 External

External analysis
 Particle Tracking
 Other

Templates
 ROIs without images

Human generated
 More interactions
▪ Merge, Propagate, Split, Delete
 Measurements
▪ Geometry
▪ Intensity
▪ Path
 ROI/ROI Links
 Tags mostly on ROI
 Write Many/Read Many

HCS Generated ROI
 Lots of ROI
 Attached to Channel
 Measurements Attached
▪ Multiple measurements
 Tags on ROI, Measurements
▪ Analysis, results and meta.
 Write Once, Read Many

External Tool can Generate ROI (+ scripts)
 Can be tagged
 Links (ROI/ROI, ROI/Image)
 Results can be in any format


ROI need not be attached to image
Template to define other ROI

N-Dimensional Data
 Storage of Image data simple
 ROI more complex
▪ Database entry, file format
 We don’t just want to store in HDF

Database
 ROI
 ROI Annotations

PyTables
 Mask ROI
 Measurements

Pytables
 ROI are heterogeneous
 Concurrency
 Python behind a core service call
 Measurements are optimal
 Tagging is an issue
▪ Inside file
▪ Multiple annotations reported to be slow





ROI can be stored in database
Mask data can be an issue
Tagging in RBD not best
Many more annotations than we’d like
Link to external source for measurements

Key-Value Pair Stores
 Berkeley DB
 Project Voldermort
 Tokyo Cabinet

Document DB
 MongoDB
 CouchDB

Graph DB
 Neo4J
 InfoGrid

Table DB
 Cassandra
 Hypertables
 HBase

Other opinions on the storage solutions
 MongoDB vs CouchDB, Cassandra, ..
 CouchDB vs MongoDB
 Pros and cons of MongoDB
 Digg on Cassandra
 What is a supercolumn
 Cassandra talk
 Indexing nodes in Neo4J

Document Database
 NOSQL movement
 Schemaless
 No Tables
▪ Collections of like data
 No Joins
▪ Document is equivalent of row of data
▪ Distributed file system (GridFS)
Pros











Cons
It has bindings to numerous languages (C++, C#, Java, Python, ...).
Allows storage, indexing, linking of any user data
Annotations are now very easy, efficient
Has mechanisms for schema upgrade
Dynamic Queries
Replication
Sharding.
Map-Reduce framework.
Fast.
GridFS is a distributed file storage mechanism within Mongo.
Easy to install
 Schemaless, data integrity will need to be worked on.
 Graph structures not inherently supported.
DEPLOYMENTS
 SourceForge http://sourceforge.net/
 BusinessInsider http://www.businessinsider.com/
 New York Times http://www.nytimes.com/
 Disqus http://www.disqus.com/
Human Interaction
Merge, Propagate, Split
✓
Geometry
✓
Intensity
✓
Path
✓
ROI/ROI Links
✓
Tags
✓
HCS
Many ROI
✓
Tags on ROI
✓
Tags on Measurement
✓
Tables of Measurements
✓
Externally Generated
Tags
✓
ROI/ROI Links, ROI/Image Links

Many formats, unknown types
✓
Other
N-Dimensional ROI
✓
Hierarchical Structures
✓
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.insert({"tags" : [ ], "label" : “MyROI”, "shapes" : [{
"tags" : [{"tag" : "foo1", "namespace" : "bob"}],
"rx" : 17,
"ry" : 17,
"label" : null,
"cy" : 75,
"cx" : 3,
"t" : 0,
"z" : 0,
"type" : "Ellipse",
"id" : 3
},
{
"tags" : [{"tag" : "foo2", "namespace" : "bob"}],
"rx" : 10,
"ry" : 16,
"label" : null,
"cy" : 82,
"cx" : 45,
"t" : 0,
"z" : 0,
"type" : "Ellipse",
"id" : 5
}], "type" : "Roi", "id" : 565 })
Find roi with tag foofoo and shapes with tag foo1
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.find({”shapes.tags.tag”:”foo1”,”tags.tag”:”foofoo”})
Find roi shapes with tag containing mitosis
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.find({"shapes.tags.tag":'/.*mitosis.*/i'})




Graph Database
use nodes to represent objects
User specifies relationship between nodes
Allows complex traversal of node structures
PROS
 Handles graph structures nicely
 Transactional
 Supported by Gremlin Gremlin
 Native RDF http://components.neo4j.org/neordf-sail/
 Easy to install
CONS
 No C++ language binding.
 Not distributed.
 Tables are not so easily modeled.
 Difficult to query on node contents
DEPLOYMENTS
 The Swedish Defence forces http://www.mil.se
 Windh Technologies http://www.windh.com
 Flextoll http://www.flextoll.se
public enum OMERORelations implements RelationshipType
{
ASSOCIATE,
DERIVE,
AGGREGATE,
COMPOSE
}
Node image = neo.createNode();
image.setProperty("IObject",imageI);
image.setProperty("id",imageI.getId().getValue());
image.setProperty("name",imageI.getName().getValue());
Node derivedImage = neo.createNode();
derivedImage.setProperty("IObject",derivedImageI);
derivedImage.setProperty("id",derivedImageI.getId().getValue());
derivedImage.setProperty("name",derivedImageI.getName().getValue());
Relationship relationship = image.createRelationshipTo( derivedImage, OMERORelations.DERIVE );
relationship.setProperty("type","ROI");
relationship.setProperty("operation","crop");
relationship.setProperty("roi",cropRoiI);
Human Interaction
Merge, Propagate, Split
✓
Geometry

Intensity

Path
✓
ROI/ROI Links
✓
Tags

HCS
Many ROI
✓
Tags on ROI
✓
Tags on Measurement
✓
Tables of Measurements

Externally Generated
Tags
✓
ROI/ROI Links, ROI/Image Links
✓
Many formats, unknown types

Other
N-Dimensional ROI

Hierarchical Structures
✓
Implementation of Google’s BigTables, is a complex
implement of a key/value store to represent a table.
A sophisticated toolset is required to get the most out
of this solutions, for instance Google has
created sawzall to query this system. Digg have
released a language to work with Cassandra
called LazyBoy.
Works by creating a table which has columns linked
together called column families, like data will exist in
the same column family (Ellipse ROI).
Pros
 Quick
 Handles heterogeneous data well


Can manage distributed data




Different rows can have different columns
Map/Reduce
Focus on writes not reads
Scales nicely
Easy to Install
Cons
 Not simple to work with



Building hierarchical structures
Sorting
Querying
▪


Ad Hoc Queries are bad, Digg still use MySQL for certain queries.
Have to manage secondary indexes, (K/V)
Version 0.5
Deployments
 Facebook (MAYBE!!) http://www.facebook.com
 Digg http://www.digg.com
Human Interaction
Merge, Propagate, Split
✓
Geometry
✓
Intensity
✓
Path

ROI/ROI Links

Tags
✓
HCS
Many ROI
✓
Tags on ROI
✓
Tags on Measurement
✓
Tables of Measurements
✓
Externally Generated
Tags
✓
ROI/ROI Links, ROI/Image Links
✓
Many formats, unknown types

Other
N-Dimensional ROI
✓
Hierarchical Structures

Implementation of Google’s BigTables, is a complex
implement of a key/value store to represent a table.
A sophisticated toolset is required to get the most out
of this solutions, for instance Google has
created sawzall to query this system. HyperTable has
a query language call HQL.
Works by creating a table which has columns linked
together called column families, like data will exist in
the same column family (Ellipse ROI).
Pros
 Quick
 Handles heterogeneous data well
 Different rows can have different columns

Can manage distributed data
 Map/Reduce


Scales nicely
Easy to Install
Cons
 GPL License
 Building hierarchical structures
 Docs are weak
 HQL works for simple queries only
 Map/Reduce for other work


limit of 255 column families
Secondary keys
Deployments
 Rediff http://www.rediff.com
 Zvents http://www.zvents.com/
Human Interaction
Merge, Propagate, Split
✓
Geometry
✓
Intensity
✓
Path

ROI/ROI Links

Tags
✓
HCS
Many ROI
✓
Tags on ROI
✓
Tags on Measurement
✓
Tables of Measurements
✓
Externally Generated
Tags
✓
ROI/ROI Links, ROI/Image Links
✓
Many formats, unknown types

Other
N-Dimensional ROI
✓
Hierarchical Structures



Why do we have an RDMS
We don’t normalise the data
 Each import will normalise on:
▪ Image, ObjectiveSettings, LogicalChannel,
LightSettings, Detector Settings.
 Object Penalty
 Difference between normalisation and view