Alternative Storage
Download
Report
Transcript Alternative Storage
Regions of Interest
What’s in a ROI?
Use cases
Requirements
Current Storage System
Problems
Alternative Storage
ROI
Geometry
Measurements
ROI on Channel
Annotations
▪ ROI
▪ Measurement
▪ Links
User created ROI
Measurement tools
HCS generated ROI
Automatic
External
External analysis
Particle Tracking
Other
Templates
ROIs without images
Human generated
More interactions
▪ Merge, Propagate, Split, Delete
Measurements
▪ Geometry
▪ Intensity
▪ Path
ROI/ROI Links
Tags mostly on ROI
Write Many/Read Many
HCS Generated ROI
Lots of ROI
Attached to Channel
Measurements Attached
▪ Multiple measurements
Tags on ROI, Measurements
▪ Analysis, results and meta.
Write Once, Read Many
External Tool can Generate ROI (+ scripts)
Can be tagged
Links (ROI/ROI, ROI/Image)
Results can be in any format
ROI need not be attached to image
Template to define other ROI
N-Dimensional Data
Storage of Image data simple
ROI more complex
▪ Database entry, file format
We don’t just want to store in HDF
Database
ROI
ROI Annotations
PyTables
Mask ROI
Measurements
Pytables
ROI are heterogeneous
Concurrency
Python behind a core service call
Measurements are optimal
Tagging is an issue
▪ Inside file
▪ Multiple annotations reported to be slow
ROI can be stored in database
Mask data can be an issue
Tagging in RBD not best
Many more annotations than we’d like
Link to external source for measurements
Key-Value Pair Stores
Berkeley DB
Project Voldermort
Tokyo Cabinet
Document DB
MongoDB
CouchDB
Graph DB
Neo4J
InfoGrid
Table DB
Cassandra
Hypertables
HBase
Other opinions on the storage solutions
MongoDB vs CouchDB, Cassandra, ..
CouchDB vs MongoDB
Pros and cons of MongoDB
Digg on Cassandra
What is a supercolumn
Cassandra talk
Indexing nodes in Neo4J
Document Database
NOSQL movement
Schemaless
No Tables
▪ Collections of like data
No Joins
▪ Document is equivalent of row of data
▪ Distributed file system (GridFS)
Pros
Cons
It has bindings to numerous languages (C++, C#, Java, Python, ...).
Allows storage, indexing, linking of any user data
Annotations are now very easy, efficient
Has mechanisms for schema upgrade
Dynamic Queries
Replication
Sharding.
Map-Reduce framework.
Fast.
GridFS is a distributed file storage mechanism within Mongo.
Easy to install
Schemaless, data integrity will need to be worked on.
Graph structures not inherently supported.
DEPLOYMENTS
SourceForge http://sourceforge.net/
BusinessInsider http://www.businessinsider.com/
New York Times http://www.nytimes.com/
Disqus http://www.disqus.com/
Human Interaction
Merge, Propagate, Split
✓
Geometry
✓
Intensity
✓
Path
✓
ROI/ROI Links
✓
Tags
✓
HCS
Many ROI
✓
Tags on ROI
✓
Tags on Measurement
✓
Tables of Measurements
✓
Externally Generated
Tags
✓
ROI/ROI Links, ROI/Image Links
Many formats, unknown types
✓
Other
N-Dimensional ROI
✓
Hierarchical Structures
✓
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.insert({"tags" : [ ], "label" : “MyROI”, "shapes" : [{
"tags" : [{"tag" : "foo1", "namespace" : "bob"}],
"rx" : 17,
"ry" : 17,
"label" : null,
"cy" : 75,
"cx" : 3,
"t" : 0,
"z" : 0,
"type" : "Ellipse",
"id" : 3
},
{
"tags" : [{"tag" : "foo2", "namespace" : "bob"}],
"rx" : 10,
"ry" : 16,
"label" : null,
"cy" : 82,
"cx" : 45,
"t" : 0,
"z" : 0,
"type" : "Ellipse",
"id" : 5
}], "type" : "Roi", "id" : 565 })
Find roi with tag foofoo and shapes with tag foo1
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.find({”shapes.tags.tag”:”foo1”,”tags.tag”:”foofoo”})
Find roi shapes with tag containing mitosis
connection = Connection();
db = connection['databaseName'];
collection = db.['collectionName'];
collection.find({"shapes.tags.tag":'/.*mitosis.*/i'})
Graph Database
use nodes to represent objects
User specifies relationship between nodes
Allows complex traversal of node structures
PROS
Handles graph structures nicely
Transactional
Supported by Gremlin Gremlin
Native RDF http://components.neo4j.org/neordf-sail/
Easy to install
CONS
No C++ language binding.
Not distributed.
Tables are not so easily modeled.
Difficult to query on node contents
DEPLOYMENTS
The Swedish Defence forces http://www.mil.se
Windh Technologies http://www.windh.com
Flextoll http://www.flextoll.se
public enum OMERORelations implements RelationshipType
{
ASSOCIATE,
DERIVE,
AGGREGATE,
COMPOSE
}
Node image = neo.createNode();
image.setProperty("IObject",imageI);
image.setProperty("id",imageI.getId().getValue());
image.setProperty("name",imageI.getName().getValue());
Node derivedImage = neo.createNode();
derivedImage.setProperty("IObject",derivedImageI);
derivedImage.setProperty("id",derivedImageI.getId().getValue());
derivedImage.setProperty("name",derivedImageI.getName().getValue());
Relationship relationship = image.createRelationshipTo( derivedImage, OMERORelations.DERIVE );
relationship.setProperty("type","ROI");
relationship.setProperty("operation","crop");
relationship.setProperty("roi",cropRoiI);
Human Interaction
Merge, Propagate, Split
✓
Geometry
Intensity
Path
✓
ROI/ROI Links
✓
Tags
HCS
Many ROI
✓
Tags on ROI
✓
Tags on Measurement
✓
Tables of Measurements
Externally Generated
Tags
✓
ROI/ROI Links, ROI/Image Links
✓
Many formats, unknown types
Other
N-Dimensional ROI
Hierarchical Structures
✓
Implementation of Google’s BigTables, is a complex
implement of a key/value store to represent a table.
A sophisticated toolset is required to get the most out
of this solutions, for instance Google has
created sawzall to query this system. Digg have
released a language to work with Cassandra
called LazyBoy.
Works by creating a table which has columns linked
together called column families, like data will exist in
the same column family (Ellipse ROI).
Pros
Quick
Handles heterogeneous data well
Can manage distributed data
Different rows can have different columns
Map/Reduce
Focus on writes not reads
Scales nicely
Easy to Install
Cons
Not simple to work with
Building hierarchical structures
Sorting
Querying
▪
Ad Hoc Queries are bad, Digg still use MySQL for certain queries.
Have to manage secondary indexes, (K/V)
Version 0.5
Deployments
Facebook (MAYBE!!) http://www.facebook.com
Digg http://www.digg.com
Human Interaction
Merge, Propagate, Split
✓
Geometry
✓
Intensity
✓
Path
ROI/ROI Links
Tags
✓
HCS
Many ROI
✓
Tags on ROI
✓
Tags on Measurement
✓
Tables of Measurements
✓
Externally Generated
Tags
✓
ROI/ROI Links, ROI/Image Links
✓
Many formats, unknown types
Other
N-Dimensional ROI
✓
Hierarchical Structures
Implementation of Google’s BigTables, is a complex
implement of a key/value store to represent a table.
A sophisticated toolset is required to get the most out
of this solutions, for instance Google has
created sawzall to query this system. HyperTable has
a query language call HQL.
Works by creating a table which has columns linked
together called column families, like data will exist in
the same column family (Ellipse ROI).
Pros
Quick
Handles heterogeneous data well
Different rows can have different columns
Can manage distributed data
Map/Reduce
Scales nicely
Easy to Install
Cons
GPL License
Building hierarchical structures
Docs are weak
HQL works for simple queries only
Map/Reduce for other work
limit of 255 column families
Secondary keys
Deployments
Rediff http://www.rediff.com
Zvents http://www.zvents.com/
Human Interaction
Merge, Propagate, Split
✓
Geometry
✓
Intensity
✓
Path
ROI/ROI Links
Tags
✓
HCS
Many ROI
✓
Tags on ROI
✓
Tags on Measurement
✓
Tables of Measurements
✓
Externally Generated
Tags
✓
ROI/ROI Links, ROI/Image Links
✓
Many formats, unknown types
Other
N-Dimensional ROI
✓
Hierarchical Structures
Why do we have an RDMS
We don’t normalise the data
Each import will normalise on:
▪ Image, ObjectiveSettings, LogicalChannel,
LightSettings, Detector Settings.
Object Penalty
Difference between normalisation and view