2. Base

After exploring the fundamentals of assembly, we moved on to applying that knowledge by implementing small programs using only AArch64 base instructions.

2.1 Copying Data

In the first task, we were provided with a C++ driver that calls both C and assembly functions.

The goal was to replicate the behavior of the two C functions using assembly. The first C function copies seven 32-bit integers from one array to another. The second C function performs the same operation for a variable number of integers, with the number of elements passed as the first argument. The corresponding assembly functions were pre-defined but not yet implemented.

For context, the two C functions are as follows:

 1#include <stdint.h>
 2
 3#ifdef __cplusplus
 4extern "C" {
 5#endif
 6
 7void copy_c_0( int32_t * __restrict a,
 8               int32_t * __restrict b ) {
 9  b[0] = a[0];
10  b[1] = a[1];
11  b[2] = a[2];
12  b[3] = a[3];
13  b[4] = a[4];
14  b[5] = a[5];
15  b[6] = a[6];
16}
17
18void copy_c_1( int64_t              n,
19               int32_t * __restrict a,
20               int32_t * __restrict b ) {
21  for( uint64_t i = 0; i < n; ++i ) {
22    b[i] = a[i];
23  }
24}
25
26#ifdef __cplusplus
27}
28#endif

Task 2.1.1 & 2.1.2

For the first function, which copies seven 32-bit integers, we used ldr and str instructions to load from the source and store to the destination registers. The memory addresses were accessed using immediate offsets, incremented by 4 bytes for each successive element (since each 32-bit integer occupies 4 bytes).

For the second function, which supports a variable number of elements, we implemented a loop. We used two registers to accomplish a loop:

  • one register to track the number of copied elements

  • one register to maintain the current byte offset

Our loop performs the following steps: 1. Load a 32-bit value from the source register using the current offset. 2. Store the value at the corresponding destination offset. 3. Increment both the element counter and the byte offset by 1 and 4, respectively. 4. Use the cmp instruction to check whether the target count has been reached. 5. If not, branch back to the top of the loop. Otherwise, return from the function.

The implementation is as follows:

 1    .text
 2    .type copy_asm_0, %function
 3    .global copy_asm_0
 4    /*
 5    * Copies an array of 7 32-bit integers from one 
 6    * location to another.
 7    *
 8    * @param x0: address of the source array.
 9    * @param x1: address of the destination array.
10    */
11copy_asm_0:
12    // save frame pointer and link register
13    stp x29, x30, [sp, #-16]!
14    // update frame pointer to current stack pointer
15    mov x29, sp
16
17    // b[0] = a[0]
18    ldr w2, [x0]
19    str w2, [x1]
20    // b[1] = a[1]
21    ldr w2, [x0, #4]
22    str w2, [x1, #4]
23    // b[2] = a[2]
24    ldr w2, [x0, #8]
25    str w2, [x1, #8]
26    // b[3] = a[3]
27    ldr w2, [x0, #12]
28    str w2, [x1, #12]
29    // b[4] = a[4]
30    ldr w2, [x0, #16]
31    str w2, [x1, #16]
32    // b[5] = a[5]
33    ldr w2, [x0, #20]
34    str w2, [x1, #20]
35    // b[6] = a[6]
36    ldr w2, [x0, #24]
37    str w2, [x1, #24]
38
39    // restore frame pointer and link register
40    ldp x29, x30, [sp], #16
41
42    ret
43
44
45    .text
46    .type copy_asm_1, %function
47    .global copy_asm_1
48    /*
49    * Copies an array of n 32-bit integers from one 
50    * location to another.
51    *
52    * @param x0: number of elements to copy.
53    * @param x1: address of the source array.
54    * @param x2: address of the destination array.
55    */
56copy_asm_1:
57    // save frame pointer and link register
58    stp x29, x30, [sp, #-16]!
59    // update frame pointer to current stack pointer
60    mov x29, sp
61    // number of elements copied
62    mov x3, #0
63    // byte offset for array
64    mov x4, #0
65loop:
66    // b[i] = a[i]
67    ldr w5, [x1, x4]
68    str w5, [x2, x4]
69    // increment the number of elements copied
70    add x3, x3, #1
71    // increment the byte offset
72    add x4, x4, #4
73    // check if we have copied n elements
74    cmp x3, x0
75    // if not, loop again
76    blt loop
77    // restore frame pointer and link register
78    ldp x29, x30, [sp], #16
79
80    ret

After compiling and running the driver, the output confirms that both assembly implementations behave identically to their C function counterparts:

$ ./copy_driver
copy_c_0: copy succeeded
copy_c_1: copy succeeded
copy_asm_0: copy succeeded
copy_asm_1: copy succeeded

2.2 Instruction Throughput and Latency

In this task, we wrote a micro-benchmarking script to measure the throughput and the latency of two instructions:

  • ADD (shifted register)

  • MUL.

2.2.1 Throughput

To measure the throughput of the instructions, we developed an assembly function for each instruction. The idea was to construct a loop, where each instruction is independent of the previous one, thereby allowing the processor to execute them in parallel and also to avoid any dependencies between the instructions.

 1loop:
 2    add x1, x27, x28
 3    add x2, x27, x28
 4    add x3, x27, x28
 5    add x4, x27, x28
 6    add x5, x27, x28
 7
 8    add x6, x27, x28
 9    add x7, x27, x28
10    add x8, x27, x28
11    add x9, x27, x28
12    add x10, x27, x28
13
14    add x11, x27, x28
15    add x12, x27, x28
16    add x13, x27, x28
17    add x14, x27, x28
18    add x15, x27, x28
19
20    add x16, x27, x28
21    add x17, x27, x28
22    add x19, x27, x28
23    add x20, x27, x28
24    add x21, x27, x28
25
26    add x22, x27, x28
27    add x23, x27, x28
28    add x24, x27, x28
29    add x25, x27, x28
30    add x26, x27, x28
31
32    subs x0, x0, #1          // decrement loop counter
33    b.gt loop                // branch if greater than zero
 1    mul x1, x27, x28
 2    mul x2, x27, x28
 3    mul x3, x27, x28
 4    mul x4, x27, x28
 5    mul x5, x27, x28
 6
 7    mul x6, x27, x28
 8    mul x7, x27, x28
 9    mul x8, x27, x28
10    mul x9, x27, x28
11    mul x10, x27, x28
12
13    mul x11, x27, x28
14    mul x12, x27, x28
15    mul x13, x27, x28
16    mul x14, x27, x28
17    mul x15, x27, x28
18
19    mul x16, x27, x28
20    mul x17, x27, x28
21    mul x19, x27, x28
22    mul x20, x27, x28
23    mul x21, x27, x28
24
25    mul x22, x27, x28
26    mul x23, x27, x28
27    mul x24, x27, x28
28    mul x25, x27, x28
29    mul x26, x27, x28
30
31    subs x0, x0, #1          // decrement loop counter
32    b.gt loop                // branch if greater than zero

2.2.2 Latency

To measure the latency of the two instructions, we implemented two additional assembly functions. In contrast to the throughput benchmarks, where instructions were independent, the idea here was to create data dependencies between instructions.

Each instruction in the loop depends on the result of the previous one, forcing the processor to wait for the output of one instruction before executing the next:

 1loop:
 2    // repeat the following block 5 times
 3    .rept 5
 4    add x1, x1, x2
 5    add x1, x1, x2
 6    add x1, x1, x2
 7    add x1, x1, x2
 8    add x1, x1, x2
 9    .endr
10
11    subs x0, x0, #1          // decrement loop counter
12    b.gt loop                // branch if greater than zero
 1loop:
 2    // repeat the following block 5 times
 3    .rept 5
 4    mul x1, x1, x2
 5    mul x1, x1, x2
 6    mul x1, x1, x2
 7    mul x1, x1, x2
 8    mul x1, x1, x2
 9    .endr
10
11    subs x0, x0, #1          // decrement loop counter
12    b.gt loop                // branch if greater than zero

2.2.3 Results

To test our assembly functions, we implemented a benchmark in C++ that:

  1. calls each function multiple times,

  2. measures the total execution time,

  3. calculates the GOPS (Giga Operations Per Second) obtained by these calculations.

time measurement for add_instr in microbench.cpp
auto l_start_time = std::chrono::high_resolution_clock::now();
add_instr( n );
auto l_end_time = std::chrono::high_resolution_clock::now();
elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>(l_end_time - l_start_time).count() / 1e6;
GOPS calculation for add_instr
double totalOps = n * 25;
double opsPerSec = totalOps / elapsedTime;
double gops = opsPerSec / 1e9;

To compile and execute the benchmark, we ran:

$ g++ microbench.cpp add_instr.s add_lat_instr.s mul_instr.s mul_lat_instr.s -o microbench.o
$ ./microbench.o

We obtained the following results:

Benchmarking results (GOPS)
Benchmarking ADD throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.29126
Instructions per Second:   2.90414e+10
Estimated GOPS:   29.0414 GigaOps/sec
-----------------------------------------------

Benchmarking MUL throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.82838
Instructions per Second:   1.32584e+10
Estimated GOPS:   13.2584 GigaOps/sec
-----------------------------------------------

Benchmarking ADD latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s):   0.856261
Instructions per Second:   4.37951e+09
Estimated GOPS:   4.37951 GigaOps/sec
-----------------------------------------------

Benchmarking MUL latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s):   2.5642
Instructions per Second:   1.46244e+09
Estimated GOPS:   1.46244 GigaOps/sec
-----------------------------------------------

To interpret these results, we first looked up the clock speed of the Apple M4 chip, which is about 4.4 GHz

Further, we needed information about the M4 architecture. We could find, that the Apple M4 has 8 ALUs, from which 3 are able to perform MUL instructions.

Looking at a these numbers, we can assume a clock cycle speed of:

\[\frac{29.0414 \text{ GOPS}}{8} = 3.63 \text{ GHz}\]
\[\frac{13.2584 \text{ GOPS}}{3} = 4.42 \text{ GHz}\]

For the ADD instruction, we are slightly below the specified clock cycle speed of 4.4 GHz. Looking at the MUL instruction on the other hand, our results closely align with the given clock speed.

For the latency we can make a similar calculation:

\[\frac{4.4 \text{ GHz}}{4.37951 \text{ GOPS}} ≈ 1 \text{ clock cycles per instr}\]
\[\frac{4.4 \text{ GHz}}{1.46244 \text{ GOPS}} ≈ 3 \text{ clock cycles per instr}\]

The ADD latency matches the theoretical value. The MUL is slightly higher than expected (3 vs. 2 cylces).