To get into the list of TOP 50, 100, 500 HPC (High Performance Computing) complexes, the test results obtained using the HPL (High Performance Linpack) benchmark are suitable.The Linpack benchmark (Linear Algebra PACKage) implements an algorithm for solving SLAEs using the LU decomposition method. This package is publicly available, easy to install and run. Good for demonstrating CPU performance.Everyone who is familiar with the architecture of graphics accelerators can assume that this package is even better for testing computing devices with GPU architecture. However, the 2011 version of CUDA for Fermi architecture is available for download online.In this guide, I will give an example of building and running HPL for the GPU.How to control access to software?
How to install CUDA?
How to install openmpi?
How to install openblas?
How to install HPL for GPU?Installing the MODULES Package
To manage environment variables, install the MODULES package and prepare a test module file.$ yum install environment-modules
$ mcedit /etc/modulefailes/test/v1.0
proc ModulesHelp { } {
global version
puts stderr "Modulefile for test v1.0"
}
set version v1.0
module-whatis "Modulefile for test v1.0"
setenv MAINDIR /nfs/software/test/v1.0
prepend-path PATH $env(MAINDIR)/bin
prepend-path C_INCLUDE_PATH $env(MAINDIR)/include
prepend-path CPLUS_INCLUDE_PATH $env(MAINDIR)/include
prepend-path LIBRARY_PATH $env(MAINDIR)/lib64
prepend-path LD_LIBRARY_PATH $env(MAINDIR)/lib64
Check module files
The probability of making a mistake while preparing the module is quite high. Therefore, I check all the paths specified in the module file. In order not to check each path manually, I prepared a script. If 0, then the path is correct.$ cat check-modulefiles
ModulePath=$1
MainDir=$(cat $ModulePath | grep "setenv MAINDIR" | cut -f7 -d " ")
ListOfPaths=$(cat $ModulePath | grep path | cut -f7 -d " ")
ListOfPaths=$(echo $ListOfPaths | sed "s@\$env(MAINDIR)@$MainDir@g")
for u in $ListOfPaths; do
ls -la $u 1> /dev/null 2> /dev/null;
printf "%60s %4d\n" $u $?;
done
$ chmod +x check-modulefiles
$ ./check-modulefiles /etc/modulefiles/test/v1.0
/nfs/software/test/v1.0/bin 0
/nfs/software/test/v1.0/include 0
/nfs/software/test/v1.0/include 0
/nfs/software/test/v1.0/lib64 0
/nfs/software/test/v1.0/lib64 0
Module Management Commands
$ module avail
$ module add cuda/v10.1
$ nvcc –version
Cuda compilation tools, release 10.1, V10.1.168
$ module switch cuda/v10.1 cuda/v9.2
$ nvcc –version
Cuda compilation tools, release 9.2, V9.2.88
$ module list
$ module rm cuda/v9.2
1. Let's see the list of modules available for connection2. Connect the3-4 module . Check version5. Change the module6-7. Let'scheck version 8. Let's see the list of connected modules9. Remove the module from the list of connectedInstall CUDA
Download CUDA 9.2 for Centos 7 here .$ chmod +x cuda_9.2.run
$ ./cuda_9.2.run
Do you accept the previously read EULA? accept
Install the CUDA 9.2 Toolkit? yes
Enter Toolkit Location: /nfs/software/cuda/v9.2
Do you want to install a symbolic link at /usr/local/cuda? no
Install the CUDA 9.2 Samples? no
$ cat /etc/modulefiles/cuda/v9.2
proc ModulesHelp { } {
global version
puts stderr "Modulefile for cuda v9.2"
}
set version v9.2
module-whatis "Modulefile for cuda v9.2"
setenv MAINDIR /nfs/software/cuda/v9.2
prepend-path PATH $env(MAINDIR)/bin
prepend-path C_INCLUDE_PATH $env(MAINDIR)/include
prepend-path CPLUS_INCLUDE_PATH $env(MAINDIR)/include
prepend-path LIBRARY_PATH $env(MAINDIR)/lib64/stubs
prepend-path LIBRARY_PATH $env(MAINDIR)/lib64
prepend-path LD_LIBRARY_PATH $env(MAINDIR)/lib64/stubs
prepend-path LD_LIBRARY_PATH $env(MAINDIR)/lib64
$ module add cuda/v9.2
$ nvcc --version
Cuda compilation tools, release 9.2, V9.2.148
Install OpenBLAS
$ wget https://github.com/xianyi/OpenBLAS/archive/v0.3.6.tar.gz
$ tar -xzvf v0.3.6.tar.gz
$ cd OpenBLAS-0.3.6
$ mkdir -p /nfs/software/openblas/v0.3.6
$ make -j4
$ make PREFIX=/nfs/software/openblas/v0.3.6/ install
$ ls -la /nfs/software/openblas/v0.3.6/lib/
$ cat /etc/modulefiles/openblas/v0.3.6
proc ModulesHelp { } {
global version
puts stderr "Modulefile for openblas v0.3.6"
}
set version v0.3.6
module-whatis "Modulefile for openblas v0.3.6"
setenv MAINDIR /nfs/software/openblas/v0.3.6
prepend-path PATH $env(MAINDIR)/bin
prepend-path C_INCLUDE_PATH $env(MAINDIR)/include
prepend-path CPLUS_INCLUDE_PATH $env(MAINDIR)/include
prepend-path LIBRARY_PATH $env(MAINDIR)/lib
prepend-path LD_LIBRARY_PATH $env(MAINDIR)/lib
$ ls -la /nfs/software/openblas/v0.3.6/lib
Install OpenMPI
wget https://download.open-mpi.org/release/open-mpi/v2.1/openmpi-2.1.6.tar.gz
$ tar -xzvf openmpi-2.1.6.tar.gz
$ cd openmpi-2.1.6
$ mkdir -p /nfs/software/openmpi/v2.1.6
$ module add cuda/v9.2
$ ./configure --prefix=/nfs/software/openmpi/v2.1.6/ --with-cuda --enable-static
$ make
$ make install
$ cat /etc/modulefiles/openmpi/v2.1.6
proc ModulesHelp { } {
global version
puts stderr "Modulefile for openmpi v2.1.6"
}
set version v2.1.6
module-whatis "Modulefile for openmpi v2.1.6"
setenv MAINDIR /nfs/software/openmpi/v2.1.6
prepend-path PATH $env(MAINDIR)/bin
prepend-path C_INCLUDE_PATH $env(MAINDIR)/include
prepend-path CPLUS_INCLUDE_PATH $env(MAINDIR)/include
prepend-path LIBRARY_PATH $env(MAINDIR)/lib
prepend-path LD_LIBRARY_PATH $env(MAINDIR)/lib
$ module add openmpi/v2.1.6
$ mpirun --version
mpirun (Open MPI) 2.1.6
Install HPL for GPU
Set up the environment variables by connecting the modules and download HPL 2.0.$ module add openmpi/v2.1.6
$ module add cuda/v9.2
$ module add openblas/v0.3.6
$ wget https://developer.download.nvidia.com/assets/cuda/secure/AcceleratedLinpack/hpl-2.0_FERMI_v15.tgz
$ tar -xvf hpl-2.0_FERMI_v15.tgz
$ mv hpl-2.0_FERMI_v15.tgz hpl-2.0
$ cd hpl-2.0
Before assembly, you must edit several files. The first is Make.CUDA in the hpl-2.0 directory. Copy the following code into Make.CUDA:$ cat Make.CUDA
SHELL = /bin/sh
CD = cd
CP = cp
LN_S = ln -fs
MKDIR = mkdir -p
RM = /bin/rm -f
TOUCH = touch
ARCH = CUDA
TOPdir = /home/user/hpl-2.0
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
HPLlib = $(LIBdir)/libhpl.a
MPdir = /nfs/software/openmpi/v2.1.6
MPinc = -I$(MPdir)/include
MPlib = -L$(MPdir)/lib -lmpi
LAdir = /nfs/software/openblas/v0.3.6
LAinc = -I$(LAdir)/include
LAlib = -L$(TOPdir)/src/cuda -ldgemm -L/nfs/software/cuda/v9.2/lib64 -lcuda -lcudart -lcublas -L$(LAdir)/lib -lopenblas
F2CDEFS = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
HPL_OPTS = -DCUDA
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC = mpicc
CCFLAGS = -fopenmp -lpthread -fomit-frame-pointer -O3 -funroll-loops $(HPL_DEFS)
CCNOOPT = $(HPL_DEFS) -O0 -w
LINKER = $(CC)
LINKFLAGS = $(CCFLAGS)
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
MAKE = make TOPdir=$(TOPdir)
11. Path to the hpl-2.0 directory17. Path to OpenMPI21. Path to OpenBLAS23. Path to CUDA lib64Replace the following lines in the hpl-2.0 / src / cuda / cuda_dgemm.c file:$ mcedit src/cuda/cuda_dgemm.c
…
// handle2 = dlopen ("libmkl_intel_lp64.so", RTLD_LAZY);
handle2 = dlopen ("libopenblas.so", RTLD_LAZY);
…
// dgemm_mkl = (void(*)())dlsym(handle, "dgemm");
dgemm_mkl = (void(*)())dlsym(handle, "dgemm_");
…
// handle = dlopen ("libmkl_intel_lp64.so", RTLD_LAZY);
handle = dlopen ("libopenblas.so", RTLD_LAZY);
…
// mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm");
mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm_");
Build and run HPL on a 4x GPU:$ make arch=CUDA
$ cd bin/CUDA
$ export LD_LIBRARY_PATH=/home/user/hpl-2.0/src/cuda/:$LD_LIBRARY_PATH
$ mpirun -np 4 ./xhpl
================================================================================
HPLinpack 2.0 -- High-Performance Linpack benchmark -- September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 25000
NB : 768
PMAP : Row-major process mapping
P : 2
Q : 2
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 1ring
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : no-transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR10L2L2 25000 768 2 2 16.72 6.232e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0019019 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
To edit test parameters, use the hpl-2.0 / bin / CUDA / HPL.dat file