8. Optimization

8.1 ARA

https://wiki.uni-jena.de/download/attachments/22453005/IMG_7381_0p5.JPG?version=1&modificationDate=1625042348365&api=v2

HPC-Cluster ARA. Source: https://wiki.uni-jena.de/pages/viewpage.action?pageId=22453005

8.1.1 - Uploading and running the code

First we cloned our github repository to “beegfs” and transfered the bythymetry and displacement data with “wget https://cloud.uni-jena.de/s/CqrDBqiMyKComPc/download/data_in.tar.xz -O tsunami_lab_data_in.tar.xz” there.

To use scons, we have to execute “module load tools/python/3.8” and “pip install –user scons” first and then load the compiler via “module load compiler/gcc/11.2.0”.

sbatch file:

#!/bin/bash
#SBATCH --job-name=tohoku_1000
#SBATCH --output=tohoku_1000.out
#SBATCH --error=tohoku_1000.err
#SBATCH --partition=s_hadoop
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=10:00:00
#SBATCH --cpus-per-task=72

#load modules
module load compiler/gcc/11.2.0

# Enter your executable commands here
# Execute the compiled program
scons
./build/tsunami_lab

Since we only want to use one node, we set nodes and ntasks to 1 and cpus-per-task to 72.

8.1.2 - Visualizations

Tohoku 5000

Tohoku 1000

Chile 5000

Chile 1000

Comparing to the simulations from assignment 6, it is clear that all simulations behave equally.

8.1.3 - Private PC vs ARA

Note

The code was compiled using scons mode=benchmark opt=-O2. The benchmarking mode disables all file output (and also skips all imports of <filesystem>).

Setups

If you are interested, you can view the used configurations here:

chile5000.json | chile1000.json | tohoku5000.json | tohoku1000.json

Results

execution times on different devices

Device

Config

Setup time

Computation time

Total time

Cells

Setup time per cell

Calculation time per cell

private

chile5000

5.56396s

31.6052s

41.3911s

413.000

0.01347ms

0.07652ms

ARA

chile5000

5.06402s

31.6528s

42.7966s

413.000

0.01226ms

0.07664ms

private

chile1000

139.167s

4017.46s

4160.25s

10.325.000

0.01348ms

0.3891ms

ARA

chile1000

127.257s

4204.83s

4338.24s

10.325.000

0.01233ms

0.40725ms

private

tohoku5000

1.70329s

14.6443s

22.5213s

162.000

0.01051ms

0.09040ms

ARA

tohoku5000

1.49705s

14.8791s

18.9277s

162.000

0.00924ms

0.09184ms

private

tohoku1000

41.2804s

1888.40s

1931.84s

4.050.000

0.01019ms

0.46627ms

ARA

tohoku1000

37.7817s

1949.85s

1990.99s

4.050.000

0.00933ms

0.48144ms

Note

Total time is not just setup + calculation time. The total execution timer is invoked directly at the start of the main function and stopped after the program has finished and all memory has been freed.

Observations

In every scenario, ARA had a faster setup time but slower computation times. We conclude that ARA has faster data/file access (because the setup heavily depends on data reading speed from a file) while the private PC seems to have better single core performance.

8.2 Compilers

8.2.1 - Generic compiler support

We enabled generic compiler support by adding the following lines to our SConstruct file

import os

...

# set local env
env['ENV'] = os.environ

# choose compiler
if 'CXX' in os.environ:
  env['CXX'] = os.environ['CXX']

Now, scons can be invoked with a compiler of choice, for example by running

CXX=icpc scons

8.2.2 & 8.2.3 - Test runs

Time measurements

For each run, we used the following configuration:

{
  "solver": "fwave",
  "simulationSizeX": 10000,
  "simulationSizeY": 10000,
  "offsetX": 5000,
  "offsetY": 5000,
  "nx":2000,
  "ny":2000,
  "setup":"ARTIFICIAL2D",
  "endTime":50
}

We ran the tsunami_lab program in benchmarking mode without file output. The results can be seen below.

execution times for different compilers and optimization flags

Compiler

Optimization flag

Setup time

Computation time

Total time

g++

-O0

0.152468s

741.348s

741.575s

g++

-O2

0.0615546s

273.039s

273.151s

g++

-Ofast

0.0607083s

203.635s

203.743s

icpc

-O0

0.207138s

1230.19s

1230.48s

icpc

-O2

0.0724063s

254.169s

254.308s

icpc

-Ofast

0.0704465s

251.197s

251.33s

Note

Total time is not just setup + calculation time. The total execution timer is invoked directly at the start of the main function and stopped after the programm finished and all memory has been freed.

For g++, we used the module compiler/gcc/11.2.0.

Unfortunately, we were not able to use the latest compiler versions for icpc. When using compiler/intel/2020-Update2, we got errors such as:

tsunami_lab/build/src/setups/TsunamiEvent1d.cpp:38: undefined reference to `tsunami_lab::io::Csv::splitLine(std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >, char, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)'

for compiler/gcc/11.2.0.

With compiler/gcc/10.2.0, there were issues like:

/cluster/spack/opt/spack/linux-centos7-broadwell/gcc-8.1.0/gcc-10.2.0-ru4xdhhkxnma5i727b7njtnjoh6kff3s/include/c++/10.2.0/tuple(566): error: pack "_UElements" does not have the same number of elements as "_Elements"
__and_<is_nothrow_constructible<_Elements, _UElements>...>::value;

Versions compiler/intel/2019-Update5 and compiler/intel/2019-Update3 did not work due to missing licences.

We therefore ended up using compiler/intel/2018-Update1 and gcc (GCC) 4.8.5 which is already available without loading any module. This configuration was the only one that worked for us, as we did not manage to fix all the errors that were thrown at us.

Observations from the table

As one would intuitively expect, the higher the optimization level is, the quicker the process finished.

One can observe that g++ was faster using both -O0 and -Ofast flags, however with the -O2 flag, icpc took the lead. Worth noting is also, that the jump from -O2 to -Ofast was much bigger when using g++ than with icpc.

In conclusion, it can not be said that one compiler always generates faster code than the other. For that, we nearly don’t have enough data to compare. We would also need to ensure that there are no other intensive processes running which could unintentionally slow down the code. Nonetheless, by using the table as a rough estimate it seems that g++ is faster when using -O0 and -Ofast while icpc is preferable for -O2.

8.2.3 - Optimization flags

To allow for an easy switch between optimization flag, we added following code to our SConstruct:

EnumVariable( 'opt',
              'optimization flag',
              '-O3',
              allowed_values=('-O0',
                              '-O1',
                              '-O2',
                              '-O3',
                              '-Ofast')

and

# set optimization mode
if 'debug' in env['mode']:
  env.Append( CXXFLAGS = [ '-g',
                           '-O0' ] )
else:
  env.Append( CXXFLAGS = [ env['opt'] ] )

The dangers of -Ofast

One of the options that -Ofast enables is -ffast-math. With that, a whole lot of other options get activated as well, such as

  • -funsafe-math-optimizations

    • enables optimizations that allow arbitrary reassociations and transformations with no accuracy guarantees

    • does not try to preserve the sign of zeros

    • due to roundoff errors the associative law of algebra do not necessary hold for floating point numbers and thus expressions like (x + y) + z are not necessary equal to x + (y + z)

  • -fnofinite-math-only

    • assumes that arguments and results are not NaNs or +-Infs -> unsafety

  • -fno-rounding-math

    • assumes that rounding mode is round to nearest

  • -fexcess-precision=fast

    • operations may be carried out in a wider precision than the types specified in the source if that would result in faster code,

    • it is unpredictable when rounding to the types specified in the source code takes place

Our sources are https://gcc.gnu.org/wiki/FloatingPointMath and https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

8.2.4 - Compiler reports

We added the support for a compiler report flag with the following lines in our SConstruct

EnumVariable( 'report',
              'flag for enabling reports',
              'none',
              allowed_values=('none',
                              '-qopt-report',
                              '-qopt-report=1',
                              '-qopt-report=2',
                              '-qopt-report=3',
                              '-qopt-report=4',
                              '-qopt-report=5')

To test it out, we ran the code on the ARA machine with following parameters:

CXX=icpc scons mode=benchmark opt=-O2 report=-qopt-report

The generated report for the main class (without the parts about submodules) can be found here.

We can see that five for-loops were not vectorized. For example:

LOOP BEGIN at build/src/main.cpp(488,5)
 remark #15333: loop was not vectorized: exception handling for a call prevents vectorization   [ build/src/main.cpp(497,54) ]

 LOOP BEGIN at build/src/main.cpp(492,7)
    remark #15333: loop was not vectorized: exception handling for a call prevents vectorization   [ build/src/main.cpp(497,54) ]
 LOOP END
LOOP END

This snippet refers to the loops that provide our solver with data from a setup:

for (tsunami_lab::t_idx l_cy = 0; l_cy < l_ny; l_cy++)
{
  for (tsunami_lab::t_idx l_cx = 0; l_cx < l_nx; l_cx++)
  {
  }
}

F-Wave optimization report

The full report can be found here.

Starting with the computeEigenvalues() function, the report tells us that the lines

t_real l_hSqrtL = std::sqrt(i_hL);
t_real l_hSqrtR = std::sqrt(i_hR);
t_real l_ghSqrtRoe = m_gSqrt * std::sqrt(l_hRoe);

are inline:

-> INLINE: (20,21) std::sqrt(float)
-> INLINE: (21,21) std::sqrt(float)
-> INLINE: (29,34) std::sqrt(float)

This means that the call to std::sqrt(float) will be replaced with the actual implementation of that function.

For computeEigencoefficients, we can see that

t_real l_rInv[2][2] = {{0}};
...
t_real l_fDelta[2] = {0};

are implemented by the compiler using memset:

build/src/solvers/Fwave.cpp(48,23):remark #34000: call to memset implemented inline with stores with proven (alignment, offset): (16, 0)
build/src/solvers/Fwave.cpp(55,22):remark #34000: call to memset implemented inline with stores with proven (alignment, offset): (16, 0)

For netUpdates, the report tells us that

INLINE REPORT: (tsunami_lab::solvers::Fwave::netUpdates( [...] )) [3] build/src/solvers/Fwave.cpp(77,1)
-> INLINE: (86,3) tsunami_lab::solvers::Fwave::computeEigenvalues( [...] )
  [...]
-> INLINE: (97,3) tsunami_lab::solvers::Fwave::computeEigencoefficients( [...] )

We can conclude that the compiler is able to inline our calls to computeEigenvalues and computeEigencoefficients.

WavePropagation2d optimization report

The full report can be found here.

To keep it short, the report tells us that the loops for the x- and y-sweep (which compute the net update) could not be vectorized:

LOOP BEGIN at build/src/patches/WavePropagation2d.cpp(86,3)
 remark #15543: loop was not vectorized: loop with function call not considered an optimization candidate.

 LOOP BEGIN at build/src/patches/WavePropagation2d.cpp(88,5)
    remark #15523: loop was not vectorized: loop control variable l_ec was found, but loop iteration count cannot be computed before executing the loop
 LOOP END
LOOP END

LOOP BEGIN at build/src/patches/WavePropagation2d.cpp(152,3)
 remark #15543: loop was not vectorized: loop with function call not considered an optimization candidate.

 LOOP BEGIN at build/src/patches/WavePropagation2d.cpp(154,5)
    remark #15523: loop was not vectorized: loop control variable l_ed was found, but loop iteration count cannot be computed before executing the loop
 LOOP END
LOOP END

Note

Lines 86 and 88 are the two for-loops for y- and x-axis of the x-sweep and lines 152 and 154 are the two for-loops for y- and x-axis of the y-sweep.

8.3 Instrumentation and Performance Counters

8.3.1 to 8.3.4 - VTune

First we used the gui of Intel vTune to specify our reports.

Then the following batch script was used to run the hotspots measurement:

#!/bin/bash
#SBATCH --job-name=vTune
#SBATCH --output=vTune.out
#SBATCH --error=vTune.err
#SBATCH --partition=s_hadoop
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=10:00:00
#SBATCH --cpus-per-task=72

module load compiler/intel/2020-Update2

/cluster/intel/vtune_profiler_2020.2.0.610396/bin64/vtune -collect hotspots -app-working-dir /beegfs/xe63nel/tsunami_lab/build -- /beegfs/xe63nel/tsunami_lab/build/tsunami_lab ../configs/config.json

Hotspots

../../_images/task_8-3-1_hotspot_bottomUp.png ../../_images/task_8-3-1_hotspot_topDown.png

The most compute-intensive part is the computeEigencoefficients() function. This was to be expected, since it computes

  • the inverse of right eigenvector-matrix

  • \(\Delta f\)

  • \(\Delta x \Psi_{i-1/2}\)

  • the two eigencoefficients as the product of the inverse of right eigenvector-matrix and \(\Delta f\)

for all cell edges every time step.

It was interesting to see (although it should not come as a surprise) that the timeStep() function used up almost 99% of the CPU time.

Threads

../../_images/task_8-3-1_threads.png ../../_images/task_8-3-1_threads_chart.png

The poor result for the thread report was also expected, because we only compute sequentially.

8.3.5 - Code optimizations

TsunamiEvent2d speedup

In order to increase the speed of this setup, we introduced a variable lastnegativeIndex for the X and Y direction for the bathymetry and displacement. The idea is the following: When making a bathymetry or displacement query on the arrays, we need to find the point of the data grid which is closest to our queried point. With a naive implementation, we traverse the full arrays for x- and y-values every time until we have discovered the closest point. Since the values are monotonically increasing, we start from index 0 and in the worst case, traverse the whole array.

If the first half of the array is filled with negative numbers due to an offset and the queried point has positive coordinates, we waste time traversing the negative part of the array. We can optimize this if we traverse the array on initialization and save the index of the last negative number before the values turn positive. Now if we look for a positive coodinate, we will start searching from the saved index on and skip the negative part of the array.

We tested this for the chile1000 scenario (Chile Tsunami Simulation with 1000m cell size) in benchmarking mode:

Setup time:

  • Before: 222.709s

  • After: 140.509s

We can observe a 36.9% decrease in time.

Code snippets of the implementation:

  1. traverse the x- and y-axis data for the bathymetry and note last negative index (initialized to 0)

if (m_xDataB[0] < 0 && m_xDataB[m_nxB - 1] > 0)
{
  for (t_idx l_ix = 1; l_ix < m_nxB; l_ix++)
  {
    if (m_xDataB[l_ix - 1] < 0 && m_xDataB[l_ix] >= 0)
    {
      m_lastNegativeIndexBX = l_ix - 1;
      break;
    }
  }
}
  1. use the saved index if the queried point is >= 0

tsunami_lab::t_real tsunami_lab::setups::TsunamiEvent2d::getBathymetryFromArray(t_real i_x,
                                                                                t_real i_y) const
{
[...]
t_idx l_ix = i_x >= 0 ? m_lastNegativeIndexBX : 1;

  for (; l_ix < m_nxB; l_ix++)
  {
  [...]
  }

F-Wave solver optimization

In computeEigencoefficients, we changed

t_real l_rInv[2][2] = {{0}};
l_rInv[0][0] = l_detInv * eigenvalueRoe_2;
l_rInv[0][1] = -l_detInv;
l_rInv[1][0] = -l_detInv * eigenvalueRoe_1;
l_rInv[1][1] = l_detInv;

to

t_real l_rInv[4] = {l_detInv * eigenvalueRoe_2, -l_detInv, -l_detInv * eigenvalueRoe_1, l_detInv};

in order to skip the unnecessary step of initializing the array to 0. Furthermore, we now use a one dimensional array for better performance.

Equally we did this for

t_real l_fDelta[2] = {0};
l_fDelta[0] = i_huR - i_huL;
l_fDelta[1] = (i_huR * i_huR / i_hR + t_real(0.5) * m_g * i_hR * i_hR) - (i_huL * i_huL / i_hL + t_real(0.5) * m_g * i_hL * i_hL);

to

t_real l_fDelta[2] = {{i_huR - i_huL}, {(i_huR * i_huR / i_hR + t_real(0.5) * m_g * i_hR * i_hR) - (i_huL * i_huL / i_hL + t_real(0.5) * m_g * i_hL * i_hL)}};

Furthermore, we established a constant for t_real(0.5) * m_g:

static t_real constexpr l_gHalf = 4.903325;

Coarse Output optimization

Inside the write() function in NetCdf.cpp we calculated

l_data[l_i] /= m_k * m_k;

to get the average value of the m_k * m_k cells which we summed up. To optimize this, we compute

t_real l_averagingFactor = 1 / m_k * m_k;

once and then reuse it wherever we need it:

l_b[l_i] *= l_averagingFactor;

This way, the division only happens once.

Individual phase ideas

For the individual phase, we plan on building a graphical user interface using ImGui.

We plan to implement following features:

  • run a simulation with the click of a button

  • stop a simulation with the click of a button

  • specify simulation parameters (size, cells, endTime, etc.)

  • specify file paths (netCdf output, checkpoint, bathymetry, displacement)

  • save and load config files

  • enable/disable checkpoints

  • monitor time status in

    • simTime / endTime

    • timeStep / maxTimeSteps

    • current execution time (real time)

    • estimated real time left

  • monitor system info

  • trigger a re-compilation with parameters

In case there is time left, we want to try to implement a python script (which can be called via the gui) that calls Paraview and renders an animation from a solution file, without the need for the Paraview Gui. However this part of the project is entirely optional and we are unsure if we will have the capabilities to implement it.