JSOC Pipeline Processing: Infrastructure, Environment and

Download Report

Transcript JSOC Pipeline Processing: Infrastructure, Environment and

JSOC Pipeline Processing Overview
Rasmus Munk Larsen, Stanford University
[email protected]
650-725-5485
Rasmus Munk Larsen / Pipeline Processing 1
Overview
•
•
•
•
Hardware overview
JSOC data model
Pipeline infrastructure & subsystems
Pipeline modules
Rasmus Munk Larsen / Pipeline Processing 2
JSOC Connectivity
Stanford
DDS
World
JSOC Disk array
Front End
Firewall
Sci W/S
Pipeline
Router
Router
Firewall
Firewall
NASA
AMES
LMSAL
1 Gb
Private line
MOC
CMD
Router
Display
Pipeline
Router
LMSAL W/S
Router
JSOC Disk array
Firewall
“White”
Net
Firewall
Rasmus Munk Larsen / Pipeline Processing 3
JSOC Hardware configuration
Rasmus Munk Larsen / Pipeline Processing 4
JSOC data model: Motivation
•
Evolved from MDI dataset concept to
–
–
•
Enable record level access to meta-data for queries and browsing
Accommodate more complex data models required by higher-level processing
Main design features
–
Lesson learned from MDI: Separate meta-data (keywords) and image data
•
•
–
No need to re-write large image files when only keywords change (lev1.8 problem)
No out-of-date keyword values in FITS headers - can bind to most recent values on export
Data access through query-like dataset names
•
•
All access in terms of (sets of) data records, which are the “atomic units” of a data series
A dataset name is a query specifying a set of data records:
–
–
–
Storage and tape management must be transparent to user
•
•
•
–
jsoc:hmi_lev1_V[#3000-#3020] (21 records from with known epoch and cadence)
jsoc:hmi_lev0_fg[t_obs=2008-11-07_02:00:00/8h][cam=‘doppler’] (8 hours worth of filtergrams)
Chunking of data records into storage units for efficient tape/disk usage done internally
Completely separate storage unit and meta-data databases: more modular design
MDI data and modules will be migrated to use new storage service
Store meta-data (keywords) in relational database
•
•
•
Can use power of relational database to search and index data records
Easy and fast to create time series of any keyword value (for trending etc.)
Consequence: Data records must be well defined (e.g. have a fixed set of keywords)
Rasmus Munk Larsen / Pipeline Processing 5
JSOC data model
JSOC Data will be organized according to a data model with the following classes
•
Series: A sequence of like data records, typically data products produced by a particular analysis
–
•
Record: Single measurement/image/observation with associated meta-data
–
–
–
•
Attributes include: Name, Target series, target record id or primary index value
Used to capture data dependencies and processing history
Data Segment: Named data container representing the primary data on disk belonging to a record
–
–
•
Attributes include: Name, Type, Value, Physical unit
Link: Named pointer from one record to another, stored in database
–
–
•
Attributes include: ID, Storage Unit ID, Storage Unit Slot#
Contain Keywords, Links, Data segments
Records are the main data objects seen by module programmers
Keyword: Named meta-data value, stored in database
–
•
Attributes include: Name, Owner , primary search index, Storage unit size, Storage group
Attributes include: Name, filename, datatype, naxis, axis[0…naxis-1], storage format
Can be either structure-less (any file) or n-dimensional array stored in tiled, compressed file format
Storage Unit: A chunk of data records from the same series stored in a single directory tree
–
–
Attributes: include: Online location, offline location, tape group, retention time
Managed by the Storage Unit Manager in a manner transparent to most module programmers
Rasmus Munk Larsen / Pipeline Processing 6
JSOC data model
JSOC Data Series
Data records for
series hmi_lev1_fd_V
Single hmi_lev1_fd_V data record
Keywords:
hmi_lev0_cam1_fg
aia_lev0_cont1700
hmi_lev1_fd_M
hmi_lev1_fd_V
aia_lev0_FE171
…
hmi_lev1_fd_V#12345
hmi_lev1_fd_V#12346
hmi_lev1_fd_V#12347
hmi_lev1_fd_V#12348
hmi_lev1_fd_V#12349
hmi_lev1_fd_V#12350
hmi_lev1_fd_V#12351
hmi_lev1_fd_V#12352
Links:
ORBIT = hmi_lev0_orbit, SERIESNUM = 221268160
CALTABLE = hmi_lev0_dopcal, RECORDNUM = 7
L1 = hmi_lev0_cam1_fg, RECORDNUM = 42345232
R1 = hmi_lev0_cam1_fg, RECORDNUM = 42345233
…
Data Segments:
hmi_lev1_fd_V#12353
…
RECORDNUM = 12345 # Unique serial number
SERIESNUM = 5531704 # Slots since epoch.
T_OBS = ‘2009.01.05_23:22:40_TAI’
DATAMIN = -2.537730543544E+03
DATAMAX = 1.935749511719E+03
...
P_ANGLE = LINK:ORBIT,KEYWORD:SOLAR_P
…
Storage Unit
= Directory
V_DOPPLER =
Rasmus Munk Larsen / Pipeline Processing 7
JSOC subsystems
•
SUMS: Storage Unit Management System
–
–
Maintains database of storage units and their location on disk and tape
Manages JSOC storage subsystems: Disk array, Robotic tape library
•
•
–
–
–
Allocates disk storage needed by pipeline processes through DRMS
Stages storage units requested by pipeline processes through DRMS
Design features:
•
•
•
Scrubs old data from disk cache to maintain enough free workspace
Loads and unloads tape to/from tape drives and robotic library
RPC client-server protocol
Oracle DBMS (to be migrated to PostgreSQL)
DRMS: Data Record Management System
–
Maintains database holding
•
•
–
–
Provides distributed transaction processing framework for pipeline
Provides full meta-data searching through JSOC query language
•
•
–
Multi-column indexed searches on primary index values allows for fast and simple querying for common cases
Inclusion of free-form SQL clauses allows advanced querying
Provides software libraries for querying, creating, retrieving and storing JSOC series, data records and their
keywords, links, and data segments
•
–
Master tables with definitions of all JSOC series and their keyword, link and data segment definitions
One table per series containing record meta-data, e.g. keyword values
Currently available in C. Wrappers (with read-only restriction?) for Fortran, Matlab and IDL are planned.
Design features:
•
•
•
TCP/IP socket client-server protocol
PostgreSQL DBMS
Slony DB replication system to be added for managing query load and enabling multi-site distributed archives
Rasmus Munk Larsen / Pipeline Processing 8
Pipeline software/hardware architecture
JSOC Science
Libraries
Utility Libraries
Pipeline program “module”
OpenRecords
CloseRecords
File I/O
GetKeyword, SetKeyword OpenDataSegment
GetLink, SetLink
CloseDataSegment
DRMS Library
Data Segment I/O
JSOC Disks
JSOC Disks
JSOC Disks
JSOC Disks
Record Cache (Keywords+Links+Data paths)
DRMS socket protocol
DataRecord
Record
Data
Data
Record
ManagementService
Service
Management
Management
Service
(DRMS)
(DRMS)
(DRMS)
Storage unit transfer
AllocUnit
GetUnit
PutUnit
Storage Unit
Management Service
(SUMS)
Storage unit transfer
SQL queries
Database
Server
SQL queries
SQL queries
Series
Tables
Record
Record
Catalogs
Record
Catalogs
Tables
Robotic Tape
Archive
Storage Unit
Tables
Rasmus Munk Larsen / Pipeline Processing 9
JSOC Pipeline Workflow
Pipeline
Operato
r
Pipeline
processing
plan
DRMS session
Module3
PUI
Pipeline User
Interface
(scheduler)
Processing
script, “mapfile”
List of pipeline
modules with
needed datasets for
input, output
Module2
Module1
Processing
History Log
DRMS
Data Record
Management service
DRMS
Data Record
Management service
SUMS
Storage Unit
Management System
Rasmus Munk Larsen / Pipeline Processing 10
Analysis modules: co-I contributions and collaboration
•
Contributions from co-I teams:
–
–
Software for intermediate and high level analysis modules
Data series definitions
•
–
–
–
Documentation
Test data and intended results for verification
Time
•
•
•
•
Keywords, links, data segments, size of storage units, primary index keywords etc.
Explain algorithms and implementation
Help with verification
Collaborate on improvements if required (e.g. performance or maintainability)
Contributions from HMI team:
–
–
–
Pipeline execution environment
Software & hardware resources (Development environment, libraries, tools)
Time
•
•
•
•
Help with defining data series
Help with porting code to JSOC API
If needed, collaborate on algorithmic improvements, tuning for JSOC hardware, parallelization
Verification
Rasmus Munk Larsen / Pipeline Processing 11
HMI module status and MDI heritage
Intermediate and high level data products
Primary
observables
Heliographic
Doppler velocity
maps
Mode frequencies
And splitting
Ring diagrams
Local wave
frequency shifts
Doppler
Velocity
Tracked Tiles
Of Dopplergrams
Internal rotation
Spherical
Harmonic
Time series
Time-distance
Cross-covariance
function
Wave travel times
Egression and
Ingression maps
Wave phase
shift maps
Internal sound speed
Full-disk velocity,
sound speed,
Maps (0-30Mm)
Carrington synoptic v and
cs maps (0-30Mm)
Standalone
“production” code
routinely used
High-resolution v and cs
maps (0-30Mm)
Deep-focus v and cs
maps (0-200Mm)
Research code in use
Far-side activity index
Stokes
I,V
Line-of-sight
Magnetograms
Stokes
I,Q,U,V
Full-disk 10-min
Averaged maps
Vector Magnetograms
Fast algorithm
Tracked Tiles
Vector Magnetograms
Inversion algorithm
Coronal magnetic
Field Extrapolations
Tracked full-disk
1-hour averaged
Continuum maps
Solar limb parameters
Coronal and
Solar wind models
Brightness feature
maps
Brightness Images
Continuum
Brightness
MDI pipeline
modules exist
Line-of-Sight
Magnetic Field Maps
Code developed at
HAO
Vector Magnetic
Field Maps
Code developed at
Stanford
Rasmus Munk Larsen / Pipeline Processing 12
Example: Global Seismology Pipeline
Rasmus Munk Larsen / Pipeline Processing 13
Questions to be discussed at working sessions
•
List of standard science data products
–
–
–
–
•
Detailing each branch of the processing pipeline
–
–
–
•
Which data products, including intermediate ones, should be produced by JSOC to accomplish
the science goals of the mission?
What cadence, resolution, coverage etc. should each data product have?
Which data products should be computed on the fly and which should be archived?
What are the challenges to be overcome for each analysis technique?
What are the detailed steps in each branch?
Can some of the computational steps be encapsulated in general tools that can be shared
among different branches (example: tracking)?
What are the CPU and I/O resource requirements of computational steps?
Contributed analysis modules
–
–
–
What groups or individuals will contribute code, and incorporate it in the pipeline?
If multiple candidate techniques and/or implementations exist, which should be included in the
pipeline?
What is the test plan and what data is needed to verify the approach?
Rasmus Munk Larsen / Pipeline Processing 14
JSOC Series Definition
Rasmus Munk Larsen / Pipeline Processing 15
Global Database Tables
Rasmus Munk Larsen / Pipeline Processing 16
Database tables for example series hmi_fd_v
•
Tables specific for each series contain per record values of
–
–
–
–
Keywords
Record numbers of records pointed to by links
DSIndex = an index identifying the SUMS storage unit containing the data segments of a
record
Series sequence counter used for generating unique record numbers
Rasmus Munk Larsen / Pipeline Processing 17
Pipeline batch processing
•
A pipeline batch is encapsulated in a single database transaction:
– If no module fails all data records are commited and become visible to other clients of the JSOC catalog
at the end of the session
– If failure occurs all data records are deleted and the database rolled back
– It is possible to commit data produced up to intermediate checkpoints during sessions
Pipeline batch = atomic transaction
Module 2.1
Register
session
Module 1
DRMS API
DRMS API
DRMS API
…
Module N
Commit Data
&
Deregister
DRMS API
DRMS API
Module 2.2
DRMS API
Input data Output data
records
records
DRMS Service = Session Master
Record & Series
Database
SUMS
Rasmus Munk Larsen / Pipeline Processing 18
Example of module code:
•
•
A module doing a (naïve) Doppler velocity calculation could look as shown below
Usage:
doppler DRMSSESSION=helios:33546 "2009.09.01_16:00:00_TAI" "2009.09.01_17:00:00_TAI"
extern CmdParams_t cmdparams; /* command line args */
extern DRMS_Env_t *drms_env; /* DRMS environment */
first_frame = 0; /* Start looping over record set. */
for (;;)
{
first_frame = find_next_framelist(first_frame, filtergrams);
if (first_frame == -1) /* No more complete framelists. Exit. */
break;
dopplergram = drms_create_records(drms_env, "hmi_fd_v",
1, &status);
if (status)
return -1;
compute_dopplergram(first_frame, filtergrams, dopplergram);
drms_close_records(drms_env, dopplergram);
}
return 0;
int module_main(void)
{
DRMS_RecordSet_t *filtergrams, *dopplergram;
int first_frame, status;
char query[1024],*start,*end;
start = cmdparms_getarg(&cmdparams, 1);
end = cmdparms_getarg(&cmdparams, 2);
sprintf(query, "hmi_lev0_fg[T_Obs=%s-%s]", start, end);
filtergrams = drms_open_records(drms_env, query, "RD", &status);
if (filtergrams->num_recs==0)
{
printf("Sorry, no filtergrams found for that time interval.\n");
return -1;
}
}
Rasmus Munk Larsen / Pipeline Processing 19
Example continued
int compute_dopplergram(int first_frame, DRMS_RecordSet_t *filtergrams,
DRMS_RecordSet_t * dopplergram)
{
int n_rows, n_cols, tuning;
DRMS_Segment_t *fg[10], *dop;
short *fg_data[10];
char *pol;
double *dop_data;
/* Get pointers for doppler data array. */
dop = drms_open_datasegment(dopplergram->records[0], "v_doppler", "RDWR");
n_cols = drms_getaxis(dop, 0);
n_rows = drms_getaxis(dop, 1);
dop_data = (double *)drms_getdata(dop, 0, 0);
/* Get pointers for filtergram data arrays. */
for (i=first_frame; i<first_frame+10; i++)
{
fg[i] = drms_open_datasegment(filtergrams->records[i], "intensity", "RD");
fg_data[i] = (short *)drms_getdata(fg, 0, 0);
pol = drms_getkey_string(filtergrams->records[i], "Polarization");
tuning = drms_getkey_int(filtergrams->records[i], "Tuning");
printf(“Using filtergram (%s, %d)\n”, pol, tuning);
}
/* Do the actual Doppler computation.*/
calc_v(fg_data, dop_data);
}
Rasmus Munk Larsen / Pipeline Processing 20