7. Checkpointing and Coarse Output
All project authors contributed to this assignment in equal parts.
7.1 - Checkpointing
7.1.1 - Writing checkpoints
We extended the NetCdf.cpp
file by a writeCheckpoint()
void tsunami_lab::io::NetCdf::writeCheckpoint(const char *path,
t_idx i_stride,
t_real const *i_h,
t_real const *i_hu,
t_real const *i_hv,
t_real const *i_b,
t_real i_t,
t_real i_timeStep)
t_idx start[] = {0, 0};
t_idx count[] = {m_ny, m_nx};
int l_i = 0;
t_real *l_data = new t_real[m_nx * m_ny];
checkNcErr(nc_put_vara_float(m_ncCheckId, m_varCheckSimSizeXId, start, count, &m_simulationSizeX));
checkNcErr(nc_put_vara_float(m_ncCheckId, m_varCheckSimSizeYId, start, count, &m_simulationSizeY));
l_i = 0;
if (i_h == nullptr)
memset(l_data, 0, m_nx * m_ny * sizeof(t_real));
for (t_idx l_y = 0; l_y < m_ny; l_y++)
for (t_idx l_x = 0; l_x < m_nx; l_x++)
l_data[l_i++] = i_h[l_x + l_y * i_stride];
delete[] l_data;
// flush all data
if (m_outputFileOpened){
First we write single value data, such as the simulation size, offset, writing step count, current time step, current time and the k-value from the next task. For simplification, we only included the simulation size in the code snippet.
We then write the arrays, as seen in the example for height. It is the same for
the momenta and bathymetry data. If the input is a nullptr, we write 0 into the array using memset
At the end of the function, we not only flush the checkpoint data to the file, but also all standard netcdf output data.
7.1.2 - CheckPoint setup
We decided against using a setup, as the code was short enough to be implemented directly into the main class. We also hope to get better performance because we can directly access the array from within the main class, without making function calls to a separate CheckPoint class.
We extended the NetCdf class instance initialization in the main.cpp
by directly loading single value data
from the checkpoint file in the case of an existing checkpoint. We overload the NetCdf
constructor with
a lighter version that only takes the two file paths as its input.
// set up netCdf I/O
tsunami_lab::io::NetCdf *l_netCdf;
if (l_setupChoice == "CHECKPOINT")
l_netCdf = new tsunami_lab::io::NetCdf(l_netcdfOutputPath,
std::cout << std::endl;
std::cout << "Loaded following data from checkpoint: " << std::endl;
std::cout << " Cells x: " << l_nx << std::endl;
std::cout << " Cells y: " << l_ny << std::endl;
std::cout << " Simulation size x: " << l_simulationSizeX << std::endl;
std::cout << " Simulation size y: " << l_simulationSizeY << std::endl;
std::cout << " Offset x: " << l_offsetX << std::endl;
std::cout << " Offset y: " << l_offsetY << std::endl;
std::cout << " Current simulation time: " << l_simTime << std::endl;
std::cout << " Current time step: " << l_timeStep << std::endl;
std::cout << std::endl;
l_netCdf = new tsunami_lab::io::NetCdf(l_nx,
The loadCheckpointDimensions()
function might have a little misleading name, but Dimensions
refers to
values like simulation size, cell amount, offset etc. and not dimensions of the checkpoint file.
For all other values which are stored in arrays (height, momenta, bathymetry), we added following code before the usual solver setup:
if (l_setupChoice == "CHECKPOINT")
tsunami_lab::t_real *l_hCheck = new tsunami_lab::t_real[l_nx * l_ny];
tsunami_lab::t_real *l_huCheck = new tsunami_lab::t_real[l_nx * l_ny];
tsunami_lab::t_real *l_hvCheck = new tsunami_lab::t_real[l_nx * l_ny];
tsunami_lab::t_real *l_bCheck = new tsunami_lab::t_real[l_nx * l_ny];
l_netCdf->read(l_checkPointFilePath, "height", &l_hCheck);
l_netCdf->read(l_checkPointFilePath, "momentumX", &l_huCheck);
l_netCdf->read(l_checkPointFilePath, "momentumY", &l_hvCheck);
l_netCdf->read(l_checkPointFilePath, "bathymetry", &l_bCheck);
for (tsunami_lab::t_idx l_cy = 0; l_cy < l_ny; l_cy++)
for (tsunami_lab::t_idx l_cx = 0; l_cx < l_nx; l_cx++)
l_hMax = std::max(l_hCheck[l_cx + l_cy * l_nx], l_hMax);
l_hCheck[l_cx + l_cy * l_nx]);
l_huCheck[l_cx + l_cy * l_nx]);
l_hvCheck[l_cx + l_cy * l_nx]);
l_bCheck[l_cx + l_cy * l_nx]);
delete[] l_hCheck;
delete[] l_huCheck;
delete[] l_hvCheck;
delete[] l_bCheck;
//normal solver setup
The read function was implemented last week and we just added a simpler version of it, which does not require a x- and y-data array (those were needed last week to read the data along the axes along with the actual data). Both functions are accessible by overloading the constructor.
7.1.3 & 7.1.4 - Checkpoint testing and automatic loading
An example log may look like:
LPMGs-Air-6:tsunami_lab lpmg$ ./build/tsunami_lab
### Tsunami Lab ###
### ###
### https://scalable.uni-jena.de ###
runtime configuration file: configs/config.json
Solution file exists but no checkpoint was found. The solution will be deleted.
Setting up solver...
Setup done. Operation took 31.5309ms = 3.15309e-05s
Writing every 100 time steps
Saving checkpoint every 5 seconds
entering time loop
simulation time / #time steps: 0 / 0
writing to netcdf
saving checkpoint to checkpoints/solution.nc
simulation time / #time steps: 0.227168 / 100
writing to netcdf
saving checkpoint to checkpoints/solution.nc
LPMGs-Air-6:tsunami_lab lpmg$ ./build/tsunami_lab
### Tsunami Lab ###
### ###
### https://scalable.uni-jena.de ###
runtime configuration file: configs/config.json
Found checkpoint file: checkpoints/solution.nc
Loaded following data from checkpoint:
Cells x: 1000
Cells y: 1000
Simulation size x: 100
Simulation size y: 100
Offset x: 0
Offset y: 0
Current simulation time: 0.268059
Current time step: 118
Setting up solver...
Setup done. Operation took 64.2207ms = 6.42207e-05s
Writing every 100 time steps
Saving checkpoint every 5 seconds
entering time loop
saving checkpoint to checkpoints/solution.nc
It can be seen the after aborting the program, it loads from a checkpoint file the next time it is executed. If there is a solution file but no checkpoint file for it, the existing solution will simply be deleted. If there is a checkpoint but no solution file, the solution file will be set up again and the simulation will start with the data provided by the checkpoint file.
The checkpoint frequency is provided using real time seconds. The smallest value is 1. If the provided number is less, checkpointing will be disabled.
We also decided on using a single checkpoint file and just overriding it every time. Furthermore, simply out of convenience, the checkpoint file is deleted after the program ends successfully.
7.2 - Coarse Output
7.2.1 - Implementation
First we added new variables in the NetCdf Constructor:
m_k = i_nk;
m_nkx = i_nx / i_nk;
m_nky = i_ny / i_nk;
m_nkx and m_nky state how many values we have for the x and y direction after dividing them by k. We use these values for the definition of the dimension sizes for x and y.
For each bathymetry, height, momentumX and momentumY we use the following loop. As example for bathymetry:
t_real *l_b = new t_real[m_nkx * m_nky];
l_i = 0;
for (t_idx l_gy = 0; l_gy < m_ny; l_gy += m_k)
for (t_idx l_gx = 0; l_gx < m_nx; l_gx += m_k)
for (t_idx l_y = 0; l_y < m_k; l_y++)
for (t_idx l_x = 0; l_x < m_k; l_x++)
l_b[l_i] += i_b[l_gx + l_x + (l_y + l_gy) * i_stride];
l_b[l_i] /= m_k * m_k;
delete[] l_b;
We go through x and y and add k for each step, in order to skip the already used cells. Inside the square we again go through all according cells x and y. We use the iteration variables and the stride to get the right values of the array. Afterwards, the data gets written into the NetCdf file and the allocated memory is freed.
7.2.2 - Visualization
We run a script for a cell size of 50 meters. While running the programm we got the following error:
simulation time / time steps: 2.55646 / 70
writing to netcdf
saving checkpoint to checkpoints/tohoku_50_5_solution.nc
Error: NetCDF: One or more variable sizes violate format constraints
We thought we had fixed this issue by enabling
Large File Support
using the “NC_64BIT_OFFSET” specifier in nc_create
. It solved the issue for writing our output data,
but for some reason not for writing checkpoints. Due to lack of time we were not able to find the issue yet,
however we are working on it.
Because of the error the solution file was corrupted and we could not open it in paraview to visualize it.