2. Base

After exploring the fundamentals of assembly, we moved on to applying that knowledge by implementing small programs using only AArch64 base instructions.

2.1 Copying Data

In the first task, we were provided with a C++ driver that calls both C and assembly functions.

The goal was to replicate the behavior of the two C functions using assembly. The first C function copies seven 32-bit integers from one array to another. The second C function performs the same operation for a variable number of integers, with the number of elements passed as the first argument. The corresponding assembly functions were pre-defined but not yet implemented.

For context, the two C functions are as follows:

copy_c.c

#include <stdint.h>

#ifdef __cplusplus
extern "C" {
#endif

void copy_c_0( int32_t * __restrict a,
               int32_t * __restrict b ) {
  b[0] = a[0];
  b[1] = a[1];
  b[2] = a[2];
  b[3] = a[3];
  b[4] = a[4];
  b[5] = a[5];
  b[6] = a[6];
}

void copy_c_1( int64_t              n,
               int32_t * __restrict a,
               int32_t * __restrict b ) {
  for( uint64_t i = 0; i < n; ++i ) {
    b[i] = a[i];
  }
}

#ifdef __cplusplus
}
#endif

Task 2.1.1 & 2.1.2

For the first function, which copies seven 32-bit integers, we used ldr and str instructions to load from the source and store to the destination registers. The memory addresses were accessed using immediate offsets, incremented by 4 bytes for each successive element (since each 32-bit integer occupies 4 bytes).

For the second function, which supports a variable number of elements, we implemented a loop. We used two registers to accomplish a loop:

one register to track the number of copied elements
one register to maintain the current byte offset

Our loop performs the following steps: 1. Load a 32-bit value from the source register using the current offset. 2. Store the value at the corresponding destination offset. 3. Increment both the element counter and the byte offset by 1 and 4, respectively. 4. Use the cmp instruction to check whether the target count has been reached. 5. If not, branch back to the top of the loop. Otherwise, return from the function.

The implementation is as follows:

copy_asm.s

    .text
    .type copy_asm_0, %function
    .global copy_asm_0
    /*
    * Copies an array of 7 32-bit integers from one 
    * location to another.
    *
    * @param x0: address of the source array.
    * @param x1: address of the destination array.
    */
copy_asm_0:
    // save frame pointer and link register
    stp x29, x30, [sp, #-16]!
    // update frame pointer to current stack pointer
    mov x29, sp

    // b[0] = a[0]
    ldr w2, [x0]
    str w2, [x1]
    // b[1] = a[1]
    ldr w2, [x0, #4]
    str w2, [x1, #4]
    // b[2] = a[2]
    ldr w2, [x0, #8]
    str w2, [x1, #8]
    // b[3] = a[3]
    ldr w2, [x0, #12]
    str w2, [x1, #12]
    // b[4] = a[4]
    ldr w2, [x0, #16]
    str w2, [x1, #16]
    // b[5] = a[5]
    ldr w2, [x0, #20]
    str w2, [x1, #20]
    // b[6] = a[6]
    ldr w2, [x0, #24]
    str w2, [x1, #24]

    // restore frame pointer and link register
    ldp x29, x30, [sp], #16

    ret


    .text
    .type copy_asm_1, %function
    .global copy_asm_1
    /*
    * Copies an array of n 32-bit integers from one 
    * location to another.
    *
    * @param x0: number of elements to copy.
    * @param x1: address of the source array.
    * @param x2: address of the destination array.
    */
copy_asm_1:
    // save frame pointer and link register
    stp x29, x30, [sp, #-16]!
    // update frame pointer to current stack pointer
    mov x29, sp
    // number of elements copied
    mov x3, #0
    // byte offset for array
    mov x4, #0
loop:
    // b[i] = a[i]
    ldr w5, [x1, x4]
    str w5, [x2, x4]
    // increment the number of elements copied
    add x3, x3, #1
    // increment the byte offset
    add x4, x4, #4
    // check if we have copied n elements
    cmp x3, x0
    // if not, loop again
    blt loop
    // restore frame pointer and link register
    ldp x29, x30, [sp], #16

    ret

After compiling and running the driver, the output confirms that both assembly implementations behave identically to their C function counterparts:

$ ./copy_driver
copy_c_0: copy succeeded
copy_c_1: copy succeeded
copy_asm_0: copy succeeded
copy_asm_1: copy succeeded

2.2 Instruction Throughput and Latency

In this task, we wrote a micro-benchmarking script to measure the throughput and the latency of two instructions:

ADD (shifted register)
MUL.

2.2.1 Throughput

To measure the throughput of the instructions, we developed an assembly function for each instruction. The idea was to construct a loop, where each instruction is independent of the previous one, thereby allowing the processor to execute them in parallel and also to avoid any dependencies between the instructions.

add_instr.s loop

loop:
    add x1, x27, x28
    add x2, x27, x28
    add x3, x27, x28
    add x4, x27, x28
    add x5, x27, x28

    add x6, x27, x28
    add x7, x27, x28
    add x8, x27, x28
    add x9, x27, x28
    add x10, x27, x28

    add x11, x27, x28
    add x12, x27, x28
    add x13, x27, x28
    add x14, x27, x28
    add x15, x27, x28

    add x16, x27, x28
    add x17, x27, x28
    add x19, x27, x28
    add x20, x27, x28
    add x21, x27, x28

    add x22, x27, x28
    add x23, x27, x28
    add x24, x27, x28
    add x25, x27, x28
    add x26, x27, x28

    subs x0, x0, #1          // decrement loop counter
    b.gt loop                // branch if greater than zero

mul_instr.s loop

    mul x1, x27, x28
    mul x2, x27, x28
    mul x3, x27, x28
    mul x4, x27, x28
    mul x5, x27, x28

    mul x6, x27, x28
    mul x7, x27, x28
    mul x8, x27, x28
    mul x9, x27, x28
    mul x10, x27, x28

    mul x11, x27, x28
    mul x12, x27, x28
    mul x13, x27, x28
    mul x14, x27, x28
    mul x15, x27, x28

    mul x16, x27, x28
    mul x17, x27, x28
    mul x19, x27, x28
    mul x20, x27, x28
    mul x21, x27, x28

    mul x22, x27, x28
    mul x23, x27, x28
    mul x24, x27, x28
    mul x25, x27, x28
    mul x26, x27, x28

    subs x0, x0, #1          // decrement loop counter
    b.gt loop                // branch if greater than zero

2.2.2 Latency

To measure the latency of the two instructions, we implemented two additional assembly functions. In contrast to the throughput benchmarks, where instructions were independent, the idea here was to create data dependencies between instructions.

Each instruction in the loop depends on the result of the previous one, forcing the processor to wait for the output of one instruction before executing the next:

add_lat_instr.s loop

loop:
    // repeat the following block 5 times
    .rept 5
    add x1, x1, x2
    add x1, x1, x2
    add x1, x1, x2
    add x1, x1, x2
    add x1, x1, x2
    .endr

    subs x0, x0, #1          // decrement loop counter
    b.gt loop                // branch if greater than zero

mul_lat_instr.s loop

loop:
    // repeat the following block 5 times
    .rept 5
    mul x1, x1, x2
    mul x1, x1, x2
    mul x1, x1, x2
    mul x1, x1, x2
    mul x1, x1, x2
    .endr

    subs x0, x0, #1          // decrement loop counter
    b.gt loop                // branch if greater than zero

2.2.3 Results

To test our assembly functions, we implemented a benchmark in C++ that:

calls each function multiple times,
measures the total execution time,
calculates the GOPS (Giga Operations Per Second) obtained by these calculations.

time measurement for add_instr in microbench.cpp

auto l_start_time = std::chrono::high_resolution_clock::now();
add_instr( n );
auto l_end_time = std::chrono::high_resolution_clock::now();
elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>(l_end_time - l_start_time).count() / 1e6;

GOPS calculation for add_instr

double totalOps = n * 25;
double opsPerSec = totalOps / elapsedTime;
double gops = opsPerSec / 1e9;

To compile and execute the benchmark, we ran:

$ g++ microbench.cpp add_instr.s add_lat_instr.s mul_instr.s mul_lat_instr.s -o microbench.o
$ ./microbench.o

We obtained the following results:

Benchmarking results (GOPS)

Benchmarking ADD throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.29126
Instructions per Second:   2.90414e+10
Estimated GOPS:   29.0414 GigaOps/sec
-----------------------------------------------

Benchmarking MUL throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.82838
Instructions per Second:   1.32584e+10
Estimated GOPS:   13.2584 GigaOps/sec
-----------------------------------------------

Benchmarking ADD latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s):   0.856261
Instructions per Second:   4.37951e+09
Estimated GOPS:   4.37951 GigaOps/sec
-----------------------------------------------

Benchmarking MUL latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s):   2.5642
Instructions per Second:   1.46244e+09
Estimated GOPS:   1.46244 GigaOps/sec
-----------------------------------------------

To interpret these results, we first looked up the clock speed of the Apple M4 chip, which is about 4.4 GHz

Further, we needed information about the M4 architecture. We could find, that the Apple M4 has 8 ALUs, from which 3 are able to perform MUL instructions.

Looking at a these numbers, we can assume a clock cycle speed of:

\[\frac{29.0414 \text{ GOPS}}{8} = 3.63 \text{ GHz}\]

\[\frac{13.2584 \text{ GOPS}}{3} = 4.42 \text{ GHz}\]

For the ADD instruction, we are slightly below the specified clock cycle speed of 4.4 GHz. Looking at the MUL instruction on the other hand, our results closely align with the given clock speed.

For the latency we can make a similar calculation:

\[\frac{4.4 \text{ GHz}}{4.37951 \text{ GOPS}} ≈ 1 \text{ clock cycles per instr}\]

\[\frac{4.4 \text{ GHz}}{1.46244 \text{ GOPS}} ≈ 3 \text{ clock cycles per instr}\]

The ADD latency matches the theoretical value. The MUL is slightly higher than expected (3 vs. 2 cylces).