###################
9. Parallelization
###################

************
9.1 OpenMP
************

9.1.1 - Parallelization with OpenMP
====================================

An easy way to parallelize our for loops is using 

.. code:: cpp

    #pragma omp parallel for

example:

.. code:: cpp

    #pragma omp parallel for
    for (unsigned short l_st = 0; l_st < 2; l_st++){
    ...

9.1.2 - Parallelization speedup
====================================

We have used following batch script for ara:

.. code:: batch

    #!/bin/bash
    #SBATCH --job-name=tohoku_1000
    #SBATCH --output=tohoku_1000.out
    #SBATCH --error=tohoku_1000.err
    #SBATCH --partition=s_hadoop
    #SBATCH --nodes=1
    #SBATCH --ntasks=1
    #SBATCH --time=10:00:00
    #SBATCH --cpus-per-task=72

    ./build/tsunami_lab configs/tohoku1000.json

And got following results:

Without parallelization
-----------------------

.. code:: text

    Note: max 10520 steps will be computed.
    entering time loop
    finished time loop

    Calculation time: 1.94101e+06ms
    = 1941.01 seconds
    = 32.3501 minutes

With parallelization on 72 cores with 72 threads
------------------------------------------------

.. code:: text

    Note: max 10520 steps will be computed.
    entering time loop
    finished time loop

    Calculation time: 75507.5ms
    = 75.5075 seconds
    = 1.25846 minutes

.. note::

    We compiled in benchmark mode (no IO).

Speedup: :math:`\frac{1941}{75.5} = 25.7`

With parallelization on 72 cores with 144 threads
-------------------------------------------------

.. code:: text

    Note: max 10520 steps will be computed.
    entering time loop
    finished time loop

    Calculation time: 215822ms
    = 215.822 seconds
    = 3.59704 minutes

Speedup: :math:`\frac{1941}{215} = 8.99`

We can see that having twice the amount of threads resulted in a much slower computation.
We conclude that using more threads than cores results in a slowed down performance.

9.1.3 - 2D for loop parallelization
====================================

The results from above used parallelization in the outer loop.
The parallelized inner loops leads to following time:

.. code:: text

    Calculation time: 791389ms
    = 791.389 seconds
    = 13.1898 minutes


It is clear, that parallelizing the outer loop is more effficient.

9.1.4 - Pinning and Scheduling
===============================

Scheduling
----------

The upper implementation used the basic :code:`scheduling(static)`.

For :code:`scheduling(dynamic)` we get:

.. code:: text
        
        Calculation time: 1.57024e+06ms
            = 1570.24 seconds
            = 26.1706 minutes

For :code:`scheduling(guided)` we get:

.. code:: text

        Calculation time: 95218ms
            = 95.218 seconds
            = 1.58697 minutes


For :code:`scheduling(auto)` we get: 

.. code:: text

        Calculation time: 84546.7ms
            = 84.5467 seconds
            = 1.40911 minutes

Pinning
------- 

Using :code:`OMP_PLACES={0}:36:1` we get:

..  image:: ../../_static/assets/task_9-4-1_Pinning_36.png

.. code:: text
            
        Calculation time: 1.39672e+06ms
            = 1396.72 seconds
            = 23.2786 minutes

Using :code:`OMP_PLACES={0,36}:18:1` we get:

..  image:: ../../_static/assets/task_9-4-1_Pinning_18.png

.. code:: text

        Calculation time: 1.43714e+06ms
        = 1437.14 seconds
        = 23.9523 minutes

It shows that using the first strategy is more efficient for computation.