CdO Transparent CdO Films for CdTe Solar Cell Applications Z
Download
Report
Transcript CdO Transparent CdO Films for CdTe Solar Cell Applications Z
Database-centric Data Analysis of Molecular Simulations
Yicheng Tu *, Sagar Pandit §, Ivan Dyedov *, and Vladimir Grupcev *
* Department
Abstract
Molecular simulations (MS) have become an integral part of molecular and
structural biology. By providing model descriptions for biochemical and
biophysical processes at nano–scopic scale, MS can provide fundamental
understanding of diseases and help discovery of drugs. MS, by their nature,
generate large amounts of data. Although many of the MS software are
carefully designed to achieve maximum computational performance in
simulation, they seriously fall short on storage and handling of the large scale
data output. The objective of this project is to use database technologies to
improve the efficiency, ease of maintenance, and security of MS data
analysis. We accomplish this by developing novel data structures and query
processing algorithms in the kernel of the database management system
(DBMS), in addition to leveraging the advantages of such systems in their
current forms. We focus on creative indexing and data organization techniques
and query processing and optimization strategies. We believe that such
innovations will bring significant intellectual merit from which both the
biomedical and database management communities will benefit.
Molecular Simulations (MS)
• Large scale biological structures are
represented using all the individual atoms.
Thus, providing nano–scopic description
of biological processes.
• Data is stored in single or multiple
trajectory files containing time frames.
• Each frame is a sequential list of atoms
with their positions, velocities, perhaps
forces, masses, and types.
• Dataset is very large: millions of atoms,
tens of thousands of frames.
of Computer Science and Engineering, § Department of Physics
Research Challenges
Processing Histogram Queries
• Difficult to maintain application programs - tedious coding is required
for each new query
• Data security is poorly supported - only on the whole file level
• Most important, efficiency in data retrieval is very low - sequential file
search is often needed
• Histogram queries are very popular in DCMS
o given a set of (or all) atoms in a time frame, compute the distribution of a
Our Approach
• A database-centric MS data analysis
(DCMS) framework that
o stores, queries raw data in a database
physical measurement in a histogram with bucket width h
• Histogram of pairwise distances (PDH) is more challenging
• Naive algorithm needs to compute all N(N-1)/2 distances where N is
the number of atoms
• Our solution uses a Quadtree-based data structure called density map
o If distance of all atoms in two cells in the map fall into a histogram bucket,
no need to compute the distances
• Time complexity is O(N1.5) for 2D data and O(N1.667) for 3D data
management system (DBMS)
o allows efficient application
development via declarative query
language (e.g., SQL)
Figure 5. Solving a histogram
query (bucket width h = 3)
using two density maps
generated from raw data (left)
with low (middle) and high
(right) resolution.
• provides fine-granularity access
Figure 2. DCMS architecture.
control and view-based data access
• Further improve the efficiency of data retrieval and analysis via
o novel indexing structures
o sophisticated query processing algorithms
Indexing MS Data
Multiple indexes needed, each
targeting a set of queries
Figure 1. A simulated hydrated
dipalmitoylphosphatidylcholine
bilayer system.
o TPB-tree: random point and
trajectory queries
o TPS-tree: spatial range queries
o kd-tree: range queries on other
non-spatial measurements
Summary
Figure 3. Structure of Time-Parameterized
B+-Tree (TPB) index.
State-of-the-art in MS Data Analysis
Experimental results
• Store trajectory in computer files Organize data into files
• Where to find data? Use the file names to encode file “content”
• Smarter systems: SimDB 1 and BioSimGrid 2 use relational databases
to manage these trajectory files
• Four popular query types
• Comparison with Gromacs
• Dataset size: 286,000 atoms,
100,000 frames
Figure 4. Query processing time in filebased and database-based systems.
• Existing file-based MS data processing bears serious drawbacks in
application development, security, and efficiency in data access
• Storing and querying MS data in DCMS (with a legacy DBMS)
provides a better solution that solves the above problems
• DCMS improves query efficiency by 1-5 orders of magnitude
• Further improvement in efficiency can be achieved by augmenting
the DCMS with novel indexes and query processing algorithms
References
1
Feig et al, Future Generation Computer Systems, 16(1):101-110, (1999)
2 Ng et al, Future Generation Computer Systems, 22(6):657-664, (2006)
Contacts:
[email protected], [email protected], [email protected], [email protected]
UNIVERSITY OF SOUTH FLORIDA