Smart Storage for Physical Properties
Download
Report
Transcript Smart Storage for Physical Properties
Smart Storage for Physical Properties
Or
How on Earth do we Store this Stuff?
Kieron Taylor with
Jeremy Frey and Jonathan Essex
What makes up chemical data?
●
Numbers - big, small, precise and vague
●
Circumstances - How hot? What pressure?
●
Assumptions
–
This is pretty pure, let's say it's pure
–
Standard conditions? More or less
–
That peak on the spectrum isn't important
Using the Data: QSPR
Take lots of data
Magical statistics
occur
Validate results
Predictive model
So What is Real Data like?
Bad - take the commercial
Physprop Database
Can we handle these
melting points?
Let's Make a Database
●
One data source is not enough
●
Good(?) data isn't free
●
Different sources have varied style of content
●
Most database software not suited to data mining
●
We cannot plumb these varied sources for data,
we must reconcile them to make sensible
statistics
Relational Design
For one molecule: Cyclohexanone
PropertyProperty
Author
Note
Value
ErrorSource
Units
Property Value Error
ValueUnits
Error
UnitsSourceMethod
Units Method
SourceAuthor
SolubilitySolubility 2500 +/-50
mg/L
Laboratory
...
2500
+/-50 Physprop
mg/L
Physprop
Laboratory
...
Solubility 2650 +/-60
2500mg/L
+/-50
mg/L
mg/L
Physprop
Simulation
Me
Superceded
2650 +/-60 Southampton
mg/L Southampton
Simulation
Me
Melting
point
-31
2650
+/-60
C
mg/L Laboratory
Simulation
B Our
Me lab...
Melting point2599 +/-25
-31 mg/L
+/-0.1Southampton
C
Detherm
Melting point
-31
+/-0.1
C
Detherm
Laboratory
...
Boiling
Melting
point
point
155.4
-31
+/-0.1
C
C
Detherm
Boiling
point
155.4
+/-0.5
C
Merck
Index
Laboratory
...
Boiling point
155.4 +/-0.5 C
Merck Index Laboratory
...
Decomposing
Boiling point
155.4
+/-0.5
C
Merck Index
Arbitrary numbers of points are hard to store in relational databases
We're not done yet: We still have to account for multiple experimental
conditions, statements of validity and molecules.
Provenance = Senary relational model?
RDF Triplestore is the Solution
●
●
RDF describes trees and networks of entities
Data of this complexity lends itself well to a tree
representation
●
RDF trees enable additional clever things
●
Triplestores provide persistent RDF models
What can we do with this?
●
●
●
●
Store almost any chemical data as normal
Track the where, when and how of each and
every data point
Filter values down whether real, simulated, old,
new, from a particular source, or done by a
particular person.
Bolt on RDF schemas such as FOAF and our
units system.
What have we done with this?
http://green.chem.soton.ac.uk/triangle/query.html
Thanks to:
●
AKT and Steve Harris for 3store
●
Rob Gledhill for web tech and discussion
●
Perl for s/ / /g