Transcript PPT - S3Lab

An Efficient and Transparent Transaction
Management based on the Data Workflow of
HVEM DataGrid
Im Young Jung
Seoul National University
Introduction
 Transaction Management for a safe data update and insertion on e-Science
DataGrid
 Heterogeneous storages according to the characteristics and the size of data
 Based on workflow, the storing precedence of data across heterogeneous
storages in a transaction
 In this paper
2
 An efficient and transparent transaction management on HVEM DataGrid
 Dividing the transaction into sub-transactions according to the transaction states and
Classifying them
 Transaction hierarchy and parallelism provide
 efficient and safe large data upload to HVEM DataGrid
 transparency in the transaction including simultaneous access to heterogeneous
storages
 Automatic garbage collection
HVEM Grid
 High Voltage Electron Microscope(HVEM)
 Let scientists realize the 3D structure analysis of new materials in micrometer-
scale
 HVEM Grid
3
 Remote users can perform the same tasks as on-site scientists.
 Remote controlling of HVEM
 Storing, retrieval and search data through HVEM DataGrid
 Processing data through HVEM Computational Grid
HVEM
DataGrid
 Designed for Biologic experiments using HVEM
 A logical view of one storage for DB and file storage
 The small metadata is stored at DB
 Information for materials, material handling methods, HVEM experiments, Images,
4
experimenters
 The large files are stored in file storages
 2D or 3D image files, the documents related to HVEM experiments
 Internal process to find files
 After finding their logical path in the file storage by searching the DB, users can
retrieve the files they want in the file storage
HVEM DataGrid
 A unified data management
 The storing precedence
among data
 When store all biological
information for the images, we
should keep the images in HVEM
Grid at the same time
 The relational semantics
between various data stored in
distributed heterogeneous
storages
 To upload many large files to
HVEM DataGrid efficiently
and safely
 Upload dependency &
Serialization
 Ensure the transactions for safe
5
parallel uploads
An efficient and transparent transaction
management
 Requirement for the transactions on HVEM DataGrid
 Consider the semantic of HVEM DataGrid
 A project is composed of several experiments
 The data for an experiment should be inserted according to its data workflow
 The file and its metadata should be stored to HVEM DataGrid simultaneously.
Otherwise, all of them should be deleted
 Support
 the long lifetime transaction according to the timelimit of experiment or project
 the short lifetime transaction which stores the data to HVEM DataGrid physically
 The optimization for the upload of large files to reduce the blocking time
should ensure safe transactions
 An asynchronous and parallel upload scheme should protect upload dependency and
ensure safe transactions
6
An efficient and transparent transaction
management
 Transaction hierarchy
For Project
 The transaction units as checkpoints on
For Experiment
incomplete data insertion
 Confine the rollback extent
For a group
of TnSs
Parallel Processing
For storing data to physical storage

When the data for an experiment or a
project is not inserted to HVEM
DataGrid until each timelimit, the
experiment or the project should be
vanished by the rollback of TnE or TnP
 TnS((((1)2)5)2)
 (1) represents the identity of TnP it
belongs to
 The next index ‘2’ indicates the identity of
TnE and so on
 Support Autonomous garbage collection
 It is dependent on users to insert data or delete it on HVEM DataGrid.
 When they do not insert experimental data any more due to any reason without deleting the
7 related data, HVEM DataGrid would have a big garbage.
Transaction management Scheme
 HVEM DataGrid forks two processes to connect DB and file storage each.
 In the light failure(LF) due to temporary failures on network or server,
 When the connections succeed, it gets the next requests and so on.
retry the transaction fixed times
When
jSiS the
jSiD(the
notification
from
DB), jSiF(the
notification
from
the file storage)
 jSiE (both
of

retries
fail,
a
serious
failure(SF)
is
assumed

rollback
process
8
them arrive) : TnS completes
 The state change of TnS(((())j)i)
Evaluation
 Analysis
 Transparency
 Through transaction hierarchy and fine grained state management
 the transaction manager in HVEM DataGrid enables the transparent transaction to upload the
image files to the file storage and store their metadata to DB simultaneously.
 Serializability
 Many TnSs are upload serializable because their state changes are logged through transaction index.
 To keep the upload dependency,
 the transaction manager protects the first user entering TnW.
o If he withdraws the TnW, then an other user can initiate the TnW
 Transaction performance
 Support the transaction scheme asynchronism and parallelism
 Experiment Setting
 Because the sub-transaction time on DB is negligible compared with that on file storage due to
data size, we only considered the upload time for image file
 Considering the semantic of the data workflow in HVEM DataGrid
 For an asynchronous file transfer, the request intervals for file transfer are chosen randomly
within 50 sec
 The physical locations of the file storages are assumed to be distributed
9
Evaluation
 Overhead
10
 Log management cost
 The cost for TnP, TnE and TnW; The general transaction management requires the log for TnS
 The log size for TnP, TnE and TnW is smaller than that for TnS because they function as
checkpoint rather than real transaction units.
 Rollback cost
 The cascade rollback of TnS in TnW due to the upload dependency on parallel processing of TnS
 At LF, if the retry succeeds, the gain from transaction parallelism can be very large especially for
large file handling
 There are not many SFs or LFs because e-Science DataGrid is not popular as the multimedia
storage
Conclusion
 A transaction management on HVEM Grid
 Safety
 Ensure a safe transaction considering the data workflow in HVEM DataGrid
 Efficiency
 Improve the performance to upload large files by asynchronism and parallelism
 Transparency
 Data management across the heterogeneous storages
 Automatic garbage collection
 Reduce garbage
11