High Performance Computing

Download Report

Transcript High Performance Computing

Data Infrastructures at DKRZ
CAS2K11 in Annecy, France
September 11 – 14, 2011
Michael Lautenschlager
Content
• Overview DKRZ
• Climate research as data intensive science
• Data life cycle and services at DKRZ
• Data infrastructure development
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
2
Mission
DKRZ - to provide high performance computing platforms,
sophisticated and high capacity data management, and
superior service for premium climate science.
Our Competences
• High performance compute, storage, and visualization
systems optimized for climate research
• Parallelization and optimization of climate models and
workflows
• Efficient management of highest data volumes
• 3D visualization to communicate research results
• Support of current projects on climate research
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
3
Building
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
4
Computer Hall
Compute Nodes
Disk Subsystem
Air Conditioning
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
5
Compute Service
• IBM Power6-System
• 264 nodes with 8448 cores
• Clock rate 4,7 GHz
• Compute power per core 18,8 GFLOPS
• Maximum compute power 159 TFLOPS
• Linpack 110 TFLOPS and rank 72 in TOP500 of 2011
• Main memory more than 20 TB
• Hard disk storage 7 PB
• Interconnect 8x DDR Infiniband
• Cooling 75% water, 25% air
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
6
Tape Library
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
7
Tape Library

HPSS – High Performance Storage System

7x Sun StorageTek SL8500

In total 67,000 media slots

More than 100 PB storage capacity

90 tape drives
◦ LTO-5, LTO-4, T10000A/B
◦ 9940B, 9840C
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
8
World Data Center for Climate
(approved by ICSU in 2003)
• Long-term data archive
• Appr. 500 TB climate data
• Fully documented
• Search engine
• Field-based data access
• Server side data
processing (sub-setting,
format conversion)
• Data download free of
charge
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
9
Data Volume Increase: small to high PB
DKRZ
100000
1000
TB
10
Overpeck et al.,
Science 2011
0.1
0.001
AR 1
AR 2
1990
1995
AR 3
2001
IPCC GCM Data
AR 4
AR 5
2007
AR 6
2013
2019
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
10
CMIP5 Data Federation
Data estimates 2010:
10 PB in total
2.5 PB WCRP requested
1 PB IPCC-AR5 core
Summary
ESG infrastructure for CMIP5
provided by
NCAR (ESG Portal)
PCMDI (ESG Data Node)
© DKRZ 12.09.2011
CMIP5 Archive Status
Friday, 09. September 2011
11:34AM (UTC)
M. Lautenschlager (DKRZ)
CAS2K11
Modeling centers
13
Models
17
Data nodes
13
Gateways
5
Datasets
11051
Size
170.78 TB
11
Climate Research as Data Intensive Science
• Hey, Tansley and Tolle (2009) „The Fourth Paradigm“:
– Data-intensive science consists of three basic activities: capture,
curation, and analysis. Data comes in all scales and shapes, covering
large international experiments; cross-laboratory, single-laboratory, and
individual observations; and potentially individuals’ lives. The discipline
and scale of individual experiments and especially their data rates make
the issue of tools a formidable problem. (Page XIII)
• Climate Modeling:
– In international experiments like CMIP5 data are produced without
knowing all applications beforehand and these data are projected for
interdisciplinary utilization (impact). This broad data application
increases the volume in archived data and adds additional requirements
compared to community specific data applications.
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
12
Data Management Requirements
for Data Intensive Science
•
Complete data description with respect to browse, discover and use
research data
•
Efficient data access via common interfaces in standard formats
•
Efficient data processing workflows even in data federations (data mining
might provide new methods for information discovery)
•
Common security management across data federations in order to offer
unique access to individual archives
•
Data replication for security and access performance
•
Agreed quality assurance workflow and documentation of data processing
and quality level in metadata in order to assign accepted quality levels
•
Transparent data federation management
•
……….
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
13
My Essentials
Starting today we need in future for climate data archives:
– Sufficient information to find and select data properly
– Sufficient standardization for automatic data processing
– Transparent data quality flags to convince people to trust the
archive federation
– New methods to identify new information in federated data
archives (data mining)
– Complete data life cycle support for seamless management of
large/huge amount of data volumes
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
14
Data Life Cycle Management
DKRZ distinguishes
two layers:
a) Virtual research
environments
integrates
community-based
scientific research
b) Long-term
archiving supports
interdisciplinary
data utilization
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
15
Services at DKRZ: Creation
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
16
Code Optimization
© DKRZ 12.09.2011
Services at DKRZ: Evaluation
M. Lautenschlager (DKRZ)
CAS2K11
CMIP5
17
Services at DKRZ: Archiving
CERA
CIM (EU-METAFOR)
WDCC (CERA):
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
18
Services at DKRZ: Dissemination
C3-Grid
IS-ENES
CMIP5 / ESGF
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
19
International Cooperation in
Data Infrastructure
Development
Target infrastructure:
IS-ENES: Infrastructure for the
European Network for Earth System
Modeling (https://is.enes.org/)
ExArch: Climate analytics on
distributed exascale data archives (G8
project)
EUDAT: EUropean DATa (EU-FP7
project starting at October 1st)
ESGF: Earth System Grid Federation
(http://esgf.org/)
GO-ESSP: Global Organization for
Earth System Science Portals
(http://go-essp.gfdl.noaa.gov/)
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
20
Digital Object Architecture of Climate Model Data
Data Objects
Metadata Objects
NetCDF/CF including
use metadata
CIM metadata for
browse + discovery
Information Objects
Transaction Record
Related more
general information
Dissemination info.
of digital objects
DataCite scientific data
publication entity:
• DOI has been assigned
• Digital objects are frozen
and approved by author
• Citation reference is
assigned for direct use in
scientific literature
• Realized with QC-L3 in
the CMIP5 data quality
assessment
Future Development: Identification of distinct data
objects in data federations with PID and handle system
(Cooperation with European Persistent Identifier
Consortium (EPIC), http://pidconsortium.eu/).
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
21
Planned DKRZ extension in 2014
• Peak compute performance 150 TFLOPS -> 3 PFLOPS
(x20)
• Disk capacity 7 PB -> 150 PB (x20)
• Tape capacity 100 PB -> 1 EB (x10)
Are we ready for the data tsunami?
Are the products ready for the data tsunami?
We will be happy to discuss these issues with you - before
the data sweeps us away
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
22
Thank you for your Attention!
http://www.dkrz.de
© DKRZ 12.09.2011
M. Lautenschlager (DKRZ)
CAS2K11
23