In this post, we describe how to compile and run Dr. Stephan Stellmach’s PADDI code on the Stampede2 supercomputer. Phase I of Stampede2 rollout features 4,200 Knights Landing (KNL) nodes, the second generation of processors based on Intel’s Many Integrated Core (MIC) architecture.
I copied the tarball for PADDI PADDI_10.1_dist.tgz from Hyades to Stampede2. The tarball contains generic x86-64 libraries and executable that have been compiled with Intel MPI and Intel Compilers. Since KNL processors offer binary compatibility with Intel Xeon processors, legacy x86-64 binaries can run on KNL nodes without recompilation. However, those binaries won’t take advantage of KNL’s unique features (such as AVX 512), and therefore won’t run at optimal speed on KNL nodes. We’ll recompile PADDI for the KNL microarchitecture.
Unpack the tarball in the home directory on Stampede2:
To build specifically for KNL microarchitecture using default Intel compilers, explicitly add the compiler flag -xMIC-AVX512. Or you can use the flags -xCORE-AVX2 -axMIC-AVX512 to build a fat binary that contains optimized code for both Broadwell microarchitecture (the login nodes) and KNL (the Phase I compute nodes). In this post, we’ll use -xCORE-AVX2 -axMIC-AVX512.
We note in passing that the default optimization level for Intel compilers is -O2.
Dependencies
Let’s first clean the house:
Delete all files in stuff_needed/bin
Delete all files in stuff_needed/lib
Delete all files except jcmagic.h in stuff_needed/include
According to the README file, PADDI depends upon the following libraries:
FFTW3
Parallel NetCDF
a library called jutils (written by Joerg Schmalzl) used to save the data in compressed form
TACC provides optimized builds for FFTW3 & Parallel NetCDF on Stampede2 and we’ll simply use them. We’ll only need to recompile jutils.
Note that TACC uses LMOD to manage software on Stampede2. LMOD is similar to the module utility deployed on Hyades and NERSC supercomputers. You can list currently loaded modules with:
FFTW3
To see what FFTW packages are available:
fftw3 is what we need. To learn more about it:
To load the fftw3 module (note we only need to load the module when we compile PADDI, we don’t need to load it when running PADDI):
Parallel netCDF (PnetCDF)
To see what NetCDF packages are avaiable:
The choices can be confusing, so a brief explanation is in order:
parallel-netcdf is parallel version of NetCDF4 based upon parallel hdf5; and it is not what we need
pnetcdf is parallel netcdf (PnetCDF) that supports netcdf in the classic formats (CDF-1 and CDF-2); and it is what we need
netcdf is the serial version of NetCDF4
To learn more about the pnetcdf module:
To load the pnetcdf module (note we only need to load the module when we compile PADDI, we don’t need to load it when running PADDI):
jutils
We’ll recompile jutils. Go the source directory:
Clean up the old build:
Modify Makefile so that it will have the following contents (note that all we do is to simply add the -xCORE-AVX2 -axMIC-AVX512 compiler flags):
Copy the newly built libraries:
Compiling PADDI
Go to source directory for PADDI:
Clean up the old build:
Modify Makefile so that the first few lines will look as follows (the rest being the same as the original):
Recompile PADDI
It succeeded without a hitch!
Running PADDI
Copy the executable (in this case, double_diff_double_3D) and other requisite files to your $SCRATCH directory.
Here we request 4 KNL nodes, and we run 64 MPI tasks per node (256 MPI tasks in total). There are 68 cores per node, so you should not run more than 68 MPI tasks per node. You might, however, want to run few tasks per node.