PowerPoint - Computer Science Division

Download Report

Transcript PowerPoint - Computer Science Division

Parallel Methods for Nano/Materials Science
Applications
(Electronic Structure Calculations)
Andrew Canning
Computational Research Division LBNL
& UC Davis, Applied Science Dept.
Outline
•
•
•
•
Introduction to Nano/Materials science
Electronic Structure Calculations (DFT)
Parallelism for Plane Wave Approach
Code performance on High Performance
Parallel Computers
• New Methods and Applications
Milestones in Parallel Calculations
1991 Silicon surface reconstruction (7x7), Phys. Rev. (Stich, Payne, KingSmith, Lin, Clarke) Meiko I860, 64 processor Computing Surface
(Brommer, Needels, Larson, Joannopoulos)
Thinking Machines CM2, 16,384 bit processors
1998 FeMn alloys (exchange bias), Gordon Bell prize SC98 (Ujfalussy,
Stocks, Canning, Y. Wang, Shelton et al.)
Cray T3E, 1500 procs. first > 1 Tflop Simulation
2006 1000 atom Molybdenum simulation with Qbox SC05 (Gordon Bell prize
for peak performance). (F. Gygi et al. )
BlueGene/L, 207 Tflops on IBM BG/L (LLNL)
2008 New Algorithm to Enable 400+ TFlop/s Sustained Performance in
Simulations of Disorder Effects in High-Tc SC08 Gordon Bell prize
(Thomas C. Schulthess et al.)
Cray XT5 (ORNL), first > 1 Pflop Simulation
Linear Scaling Divide-and-Conquer Electronic Structure Calculations
(L-W Wang et. al.) SC08 Gordon Bell prize (algorithmic innovation )
Electronic Structure Calculations
• Accurate Quantum Mechanical treatment for the electrons
• Each electron represented on grid or with some basis functions
(eg. Fourier components)
• Compute Intensive: Each electron requires 1 million points/basis
(need 100s of electrons)
• 70-80% NERSC Materials Science Computer Time
(first-principles electronic structure)
BaYCl:Ce excited state (new
scintillator gamma radiation detector)
Motivation for Electronic Structure Calculations
• Most Materials Properties Only Understood at a
fundamental level from accurate electronic
structure calculations (Strength, Cohesion etc)
• Many Properties Purely Electronic eg.
Optical Properties (Lasers)
• Complements Experiments
• Computer Design Materials at the nanoscale
Materials Science Methods
• Continuum Methods
Single particle DFT methods largest user of supercomputer
cycles of any scientific method in any discipline
Increasing #atoms
• Many Body Quantum Mechanical Approach
(Quantum Monte Carlo) 20-30 atoms
• Single Particle QM (Density Functional Theory)
No free parameters. 100-1000 atoms
• Empirical QM Models eg. Tight Binding
1000-5000 atoms
• Empirical Classical Potential Methods
thousand-million atoms
Quantum Mechanics for
molecules and crystals
Many Body Schrodinger Equation for electrons (position ri)
in a molecule or crystals with nuclei charge Z and position RI
(natural units e = h = 1 etc.)
1 2
1
Z
{ i  

}(r1 ,..rN )  E(r1 ,..rN )
i 2
i ,  j | ri  rj |
i , I | ri  RI |
kinetic
energy
electron-electron
interaction
electron-nuclei
interaction
• Solution of the above eigenfunction equation H=E (H is Hamiltonian) for
the wavefunction  gives complete information for the system.
• Observables take the form of operators eg. momentum p (classical form = mv)
pop = - i h then p observed is given by the solution of pop=p
• Lowest eigenvalue pair (E0,0) is the groundstate energy. Higher eigenpairs
correspond to excited states of the system.
Ab initio Method: Density Functional
Theory
(Kohn 98 Nobel Prize)
Many Body Schrodinger Equation (exact but exponential scaling )
1 2
1
Z
{ i  

}(r1 ,..rN )  E(r1 ,..rN )
i 2
i , j | ri  rj |
i , I | ri  RI |
Kohn Sham Equation (65): The many body ground
state problem can be mapped onto a single particle
non-linear problem with the same electron density
and a different effective potential (cubic scaling).
1 2
 (r )
Z
{   
dr   
 V XC } i (r )  Ei i (r )
2
| r  r |
I | r  RI |
 (r )   |  i (r ) |2 |  (r1 ,..rN ) |2
 = charge density
i
Use Local Density Approximation
(LDA) for V [  (r )] (good for Si,C)
XC
Selfconsistency until Vout = Vin
Selfconsistent Solution
1 2
{   Vin (r ,  )} i (r )  Ei i (r ) fix V(r,)
a linear problem
2
{ i }i 1,.., N
N
 ( r )   |  i ( r ) |2
i
Vout (r ,  )
N electrons
N wave functions
lowest N
eigenfunctions
Choice of Basis for DFT(LDA)
Increasing basis size M
Gaussian
FLAPW
Fourier
grid
Percentage of eigenpairs M/N
30%
2%
Eigensolvers
Direct (scalapack)
Iterative
Plane-wave Pseudopotential Method in
DFT
1 2
 (r )
Z

{   
dr  
 V XC (  (r ))} j (r )  E j j (r )
2
| r  r |
I | r  RI |
Solve Kohn-Sham Equations self-consistently for electron
wavefunctions within the Local Density Approximation
1. Plane-wave (Fourier) expansion for  j ,k (r ) 
to energy cutoff |g| < rcut (sphere )
 C (k )e
j
g
g
2. Replace “frozen” atom core (core electrons and nuclei)
by a pseudopotential (pseudo-wavefunctions in core region
allows rcut to be reduced without loss of accuracy)
Different parts of the Hamiltonian calculated
in different spaces (Fourier and real) via 3d
FFT
i ( g  k ).r
Computational
Considerations
• H never computed explicitly (available through mat-vec product)
• Matrix-vector product in NlogN steps (not N2)
• Orthogonalization is expensive, Mat-vec is cheap
• Matrix is dense (in Fourier or Real space)
• Real space grid approximately double size of Fourier sphere
• Each Selfconsistent step we have good guess for eigenvectors (from previous step)
• Typically use stable CG based iterative methods or Davidson
1 2
   i (r )
2
V (r )
FFT
Parallelization (load balance,
minimize communications)
Dataset: spheres of Fourier coefficients for
each wavefunction (electron) 100s-1000s
……….
N spheres of M points per sphere (M ~100N)
Calculations (most costly):
1.
Calculation of overlap matrices for orthogonalization, scales as MN2
(written in BLAS3)
2.
Non-local part of pseudopotential calculation (ion-electron interaction)
scales as MN2 (written in BLAS3) can also be done in real space
3.
3d FFTs to construct charge density in real space, scales as NMlogM
Load balancing calculations:
1., 2. Give the same number of points in sphere to each processor
3. Give the same number of columns of data to each processor
Conflicting conditions, np complete problem: bin (bag) packing problem
Parallelization (load balance,
minimize communications)
Load Balancing Scheme:
• Consider sphere as one dimensional list of columns
ordered according to length.
• Give out columns to processors in descending order
such that each column is given to the processor with the
least number of points in the sphere.
3 processor example
Close to the same number of points and
columns go to each processor
Communications:
• With this data distribution the BLAS3
operations in Fourier space have limited
communications
• Main communications occur in 3d FFTs
Write specialized 3d FFTs for this data layout
3d FFT
Standard Parallel 3d FFT on NxNxN (x,y,z)
grid
1. Perform 1d FFTs on N2 x direction columns
2. Transpose (x,y,z) -> (y,z,x)
3. Perform 1d FFTs on N2 y direction columns
4. Transpose (y,z,x) -> (z,y,x)
5. Perform 1d FFTs on N2 z direction columns
6. Transpose (z,y,x) -> (x,y,z) optional
Scaling Issues (bandwidth and latency):
computations/communications ~ N2NlogN/N3 = logN ~ O(10)
message size ~ (num. of processors)-2 for 1d layout
Specialized 3d FFT for Electronic
Structure Codes (Plane Waves/Fourier)
(a)
(b)
FFT
(c)
(d)
FFT
(e)
(f)
FFT
–
Works for any grid size on any
number of processors
–
Only non-zero elements
communicated/calculated
–
Most communication in global
transpose (b) to (c) little
communication (d) to (e)
–
Many 3d FFTs done at the same
time to avoid latency issues
–
Much faster than vendor supplied
3d-FFT (no grid size limitations)
–
Used in many codes (PARATEC,
PETot, ESCAN, GIT code etc. )
Results: 3d-FFT 5123 grid on Cray
XT4 (Franklin, NERSC, LBNL)
• Cray XT4 (NERSC)
• Quad core proc. Opteron 2.3 GHz
• 38,640 proc. cores, 9660 nodes
Procs.
isen/rec
isen/rec
(1 core)
alltoallv
isenrecalltoallvblock = 40 block = 40
Weak scaling
Nprocs x 5122
128
0.6175 s
0.3350
0.5560
256
0.3752 s
0.1858
0.3033
0.3292
0.3013
0.1604
512
0.3417 s
0.1326
0.2007
0.1770
0.1609
0.1609
1024
6.3045 s
0.1221
0.1992
0.1913
0.0772
0.1711
2048
7.6232 s
6.2814
0.2767
0.5552
0.0477
0.1857
4096
9.4510 s
6.8414
0.4272
0.0439
0.2019
0.0369
0.2415
8192
0.3746
0.6888
Time for forward+reverse 3d FFT, 5123 grid
uses FFTW 1d FFT libs
PARATEC (PARAllel Total Energy Code)
• PARATEC performs first-principles
quantum mechanical total energy
calculation using pseudopotentials
& plane wave basis set
• Written in F90 and MPI
• Designed to run on large parallel
machines IBM SP etc. but also
runs on PCs
• PARATEC uses all-band CG approach to obtain wavefunctions of
electrons (blocks comms. Specialized 3dffts)
• Generally obtains high percentage of peak on different platforms
(uses BLAS3 and 1d FFT libs)
• Developed with Louie and Cohen’s groups (UCB, LBNL)
Overall ratio calcs./communications ~ N
(not logN)
PARATEC: Code Details
•
•
•
•
Code written in F90 and MPI (~50,000 lines)
33% 3D FFT, 33% BLAS3, 33% Hand coded F90
Global Communications in 3D FFT (Transpose)
Parallel 3D FFT handwritten, minimize comms. reduce
latency
(written on top of vendor supplied 1D complex FFT )
PARATEC: Performance
Problem
488
Atom
CdSe
Quantum
Dot
Bassi
NERSC
(IBM Power5)
Jaquard
NERSC
(Opteron)
Gflop
s/Proc
%
peak
Gflops
/Proc
128
5.49
72%
256
5.52
73%
1.98
512
5.13
67%
0.95
1024
3.74
49%
Proc
2048
•
•
•
•
•
Thunder
(Itanium2)
Franklin
NERSC (Cray
XT4)
NEC ES
(SX6)
Gflops
/Proc
%
peak
Gflops
/Proc
Gflop
s/Proc
%
peak
2.8
51%
5.1
64%
45%
2.6
47%
3.36
65%
5.0
21%
2.4
44%
3.15
61%
1.8
32%
2.93
2.37
%
peak
%
peak
IBM BG/L
Gflops
/Proc
%
peak
62%
1.21
43%
4.4
55%
1.00
35%
56%
3.6
46%
46%
2.7
35%
Grid size 2503
All architectures generally achieve high performance due to computational
intensity of code (BLAS3, FFT)
ES achieves highest overall performance : 5.5Tflop/s on 2048 procs
(5.3 Tflops on XT4 on 2048 procs in single proc. node mode)
FFT used for benchmark for NERSC procurements
(run on up to 18K procs on Cray XT4, weak scaling )
Vectorisation directives and multiple 1d FFTs required for NEC SX6
Developed with Louie and Cohen’s groups (UCB, LBNL), also work with L. Oliker, J Carter
Application: Gamma Ray Detection
( Cerium activated Scintillators )
conduction
5d
Ce states
Project funded by the DHS, in collaboration with S. Derenzo,
M. Weber, A. Chaudhry in Life Sciences (LBNL)
4f
valence
Increasing energy
• Applications in cosmology, particle
physics, medical, homeland security
• Scintillators: produce light in the
optical range when exposed to
gamma rays
• Brightest Scintillators are Cerium
doped materials (lanthanum halides)
• Theoretical Screening of Candidate
Materials
Scintillation process model (Ce activ.)
-ray excitation produces a hole and an e in the host material
The hole is captured by the dopant Ce site (Ce3+ -> Ce4+ )
Electron diffuses down through conduction band into 5d Ce
5d -> 4f transition on Ce site emits optical photon
Key assumption: 4f and 5d states of Ce atom
must be in the energy gap of the host crystal

conduction
5d
F
h
4f
valence
Density of States Plot for Ce in LaBr3
5d
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
 Fermi
Energy
host conduction band
host valence band
Ce density of states
4f
Single Ce atom
in large unit cell
Criteria to determine bright Ce activated
Scintillators
1.
2.
3.
4.
Size of host material bandgap: smaller the
bandgap the more electron hole pairs but
must accommodate Ce 4f,5d
Energy difference between the VBM and Ce
4f level: larger this value the less probable
will be hole capture at the Ce3+ site
(requires multiple or high-energy phonons)
Energy difference between the CBM and Ce
5d level: if 5d is too close thermal
excitations may excite electron into CB
Level of localisation of the (Ce3+)* state on
the Ce atom:
conduction
5d
Ce states
4f
valence
Predictions for some new materials
Compound
Bandgap
EG (eV)
Ce 4f - VBM (Ce3+)*
(eV)
localisation
%
Ba2YCl7:Ce
4.05
0.87
64
BaY2Br8:Ce
2.94
0.45
39
BaY2I8:Ce
2.50
~0.0
42
LaFS:Ce
3.27
0.71
45
La2Ca2O5:Ce
2.38
0.26
28
La2Mg2O5:Ce
2.80
0.1
12
Ba2YCl7:Ce made by experimentalists and found to be a very bright
scintillator. Initial Patent Filing taken out for Material Ba2YCl7:Ce
The Quantization Condition of Quantum-well
States in Cu/Co(100)
• Theoretical investigation of Quantum Well states in Cu
films using our codes (PARATEC, PEtot) to compare with
experiments at the ALS (E. Rotenberg, Y.Z. Wu, Z.Q. Qiu)
photon beam in
electrons
out
• New computational methods for metallic systems used in
the calculations.
•Lead to an understanding of surface effects on the
Quantum Well States. Improves on simple Phase
Accumulation Model used previously
54 Å
0
d
Copper Wedge
Cobalt layer
Copper substrate
0.0
-0.5
E-E F (eV)
QW states in
Copper Wedge
-1.0
-1.5
0
5
10
Cu thickness(ML)
15
20
Difference between theory and experiment improved by
taking surface effects into account
Summary
• Plane wave electronic structure methods can scale
to the 1K-10K processor regime
• Require different parallelization methods for higher proc.
count (state parallelism, k-point parallelism eg. Qbox code)
•
•
This research was funded by the DOE as part of the Modeling and
Simulation in Nanoscience Initiative (MICS and BES) contract No
DE-AC03-76SF00098
Gamma ray detection funded by DHS
Future Directions
• O(N) based methods (exploit locality) can give
sparse matrix problem
• Excited state calculations
• Transport Calculations (conductivity)
(dynamical properties)