2. Base
After exploring the fundamentals of assembly, we moved on to applying that knowledge by implementing small programs using only AArch64 base instructions.
2.1 Copying Data
In the first task, we were provided with a C++ driver that calls both C and assembly functions.
The goal was to replicate the behavior of the two C functions using assembly. The first C function copies seven 32-bit integers from one array to another. The second C function performs the same operation for a variable number of integers, with the number of elements passed as the first argument. The corresponding assembly functions were pre-defined but not yet implemented.
For context, the two C functions are as follows:
1#include <stdint.h>
2
3#ifdef __cplusplus
4extern "C" {
5#endif
6
7void copy_c_0( int32_t * __restrict a,
8 int32_t * __restrict b ) {
9 b[0] = a[0];
10 b[1] = a[1];
11 b[2] = a[2];
12 b[3] = a[3];
13 b[4] = a[4];
14 b[5] = a[5];
15 b[6] = a[6];
16}
17
18void copy_c_1( int64_t n,
19 int32_t * __restrict a,
20 int32_t * __restrict b ) {
21 for( uint64_t i = 0; i < n; ++i ) {
22 b[i] = a[i];
23 }
24}
25
26#ifdef __cplusplus
27}
28#endif
Task 2.1.1 & 2.1.2
For the first function, which copies seven 32-bit integers, we used ldr
and str
instructions to load from the source and store to the destination registers.
The memory addresses were accessed using immediate offsets, incremented by 4 bytes for each successive element (since each 32-bit integer occupies 4 bytes).
For the second function, which supports a variable number of elements, we implemented a loop. We used two registers to accomplish a loop:
one register to track the number of copied elements
one register to maintain the current byte offset
Our loop performs the following steps:
1. Load a 32-bit value from the source register using the current offset.
2. Store the value at the corresponding destination offset.
3. Increment both the element counter and the byte offset by 1 and 4, respectively.
4. Use the cmp
instruction to check whether the target count has been reached.
5. If not, branch back to the top of the loop. Otherwise, return from the function.
The implementation is as follows:
1 .text
2 .type copy_asm_0, %function
3 .global copy_asm_0
4 /*
5 * Copies an array of 7 32-bit integers from one
6 * location to another.
7 *
8 * @param x0: address of the source array.
9 * @param x1: address of the destination array.
10 */
11copy_asm_0:
12 // save frame pointer and link register
13 stp x29, x30, [sp, #-16]!
14 // update frame pointer to current stack pointer
15 mov x29, sp
16
17 // b[0] = a[0]
18 ldr w2, [x0]
19 str w2, [x1]
20 // b[1] = a[1]
21 ldr w2, [x0, #4]
22 str w2, [x1, #4]
23 // b[2] = a[2]
24 ldr w2, [x0, #8]
25 str w2, [x1, #8]
26 // b[3] = a[3]
27 ldr w2, [x0, #12]
28 str w2, [x1, #12]
29 // b[4] = a[4]
30 ldr w2, [x0, #16]
31 str w2, [x1, #16]
32 // b[5] = a[5]
33 ldr w2, [x0, #20]
34 str w2, [x1, #20]
35 // b[6] = a[6]
36 ldr w2, [x0, #24]
37 str w2, [x1, #24]
38
39 // restore frame pointer and link register
40 ldp x29, x30, [sp], #16
41
42 ret
43
44
45 .text
46 .type copy_asm_1, %function
47 .global copy_asm_1
48 /*
49 * Copies an array of n 32-bit integers from one
50 * location to another.
51 *
52 * @param x0: number of elements to copy.
53 * @param x1: address of the source array.
54 * @param x2: address of the destination array.
55 */
56copy_asm_1:
57 // save frame pointer and link register
58 stp x29, x30, [sp, #-16]!
59 // update frame pointer to current stack pointer
60 mov x29, sp
61 // number of elements copied
62 mov x3, #0
63 // byte offset for array
64 mov x4, #0
65loop:
66 // b[i] = a[i]
67 ldr w5, [x1, x4]
68 str w5, [x2, x4]
69 // increment the number of elements copied
70 add x3, x3, #1
71 // increment the byte offset
72 add x4, x4, #4
73 // check if we have copied n elements
74 cmp x3, x0
75 // if not, loop again
76 blt loop
77 // restore frame pointer and link register
78 ldp x29, x30, [sp], #16
79
80 ret
After compiling and running the driver, the output confirms that both assembly implementations behave identically to their C function counterparts:
$ ./copy_driver
copy_c_0: copy succeeded
copy_c_1: copy succeeded
copy_asm_0: copy succeeded
copy_asm_1: copy succeeded
2.2 Instruction Throughput and Latency
In this task, we wrote a micro-benchmarking script to measure the throughput and the latency of two instructions:
ADD
(shifted register)MUL
.
2.2.1 Throughput
To measure the throughput of the instructions, we developed an assembly function for each instruction. The idea was to construct a loop, where each instruction is independent of the previous one, thereby allowing the processor to execute them in parallel and also to avoid any dependencies between the instructions.
1loop:
2 add x1, x27, x28
3 add x2, x27, x28
4 add x3, x27, x28
5 add x4, x27, x28
6 add x5, x27, x28
7
8 add x6, x27, x28
9 add x7, x27, x28
10 add x8, x27, x28
11 add x9, x27, x28
12 add x10, x27, x28
13
14 add x11, x27, x28
15 add x12, x27, x28
16 add x13, x27, x28
17 add x14, x27, x28
18 add x15, x27, x28
19
20 add x16, x27, x28
21 add x17, x27, x28
22 add x19, x27, x28
23 add x20, x27, x28
24 add x21, x27, x28
25
26 add x22, x27, x28
27 add x23, x27, x28
28 add x24, x27, x28
29 add x25, x27, x28
30 add x26, x27, x28
31
32 subs x0, x0, #1 // decrement loop counter
33 b.gt loop // branch if greater than zero
1 mul x1, x27, x28
2 mul x2, x27, x28
3 mul x3, x27, x28
4 mul x4, x27, x28
5 mul x5, x27, x28
6
7 mul x6, x27, x28
8 mul x7, x27, x28
9 mul x8, x27, x28
10 mul x9, x27, x28
11 mul x10, x27, x28
12
13 mul x11, x27, x28
14 mul x12, x27, x28
15 mul x13, x27, x28
16 mul x14, x27, x28
17 mul x15, x27, x28
18
19 mul x16, x27, x28
20 mul x17, x27, x28
21 mul x19, x27, x28
22 mul x20, x27, x28
23 mul x21, x27, x28
24
25 mul x22, x27, x28
26 mul x23, x27, x28
27 mul x24, x27, x28
28 mul x25, x27, x28
29 mul x26, x27, x28
30
31 subs x0, x0, #1 // decrement loop counter
32 b.gt loop // branch if greater than zero
2.2.2 Latency
To measure the latency of the two instructions, we implemented two additional assembly functions. In contrast to the throughput benchmarks, where instructions were independent, the idea here was to create data dependencies between instructions.
Each instruction in the loop depends on the result of the previous one, forcing the processor to wait for the output of one instruction before executing the next:
1loop:
2 // repeat the following block 5 times
3 .rept 5
4 add x1, x1, x2
5 add x1, x1, x2
6 add x1, x1, x2
7 add x1, x1, x2
8 add x1, x1, x2
9 .endr
10
11 subs x0, x0, #1 // decrement loop counter
12 b.gt loop // branch if greater than zero
1loop:
2 // repeat the following block 5 times
3 .rept 5
4 mul x1, x1, x2
5 mul x1, x1, x2
6 mul x1, x1, x2
7 mul x1, x1, x2
8 mul x1, x1, x2
9 .endr
10
11 subs x0, x0, #1 // decrement loop counter
12 b.gt loop // branch if greater than zero
2.2.3 Results
To test our assembly functions, we implemented a benchmark in C++ that:
calls each function multiple times,
measures the total execution time,
calculates the GOPS (Giga Operations Per Second) obtained by these calculations.
auto l_start_time = std::chrono::high_resolution_clock::now();
add_instr( n );
auto l_end_time = std::chrono::high_resolution_clock::now();
elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>(l_end_time - l_start_time).count() / 1e6;
double totalOps = n * 25;
double opsPerSec = totalOps / elapsedTime;
double gops = opsPerSec / 1e9;
To compile and execute the benchmark, we ran:
$ g++ microbench.cpp add_instr.s add_lat_instr.s mul_instr.s mul_lat_instr.s -o microbench.o
$ ./microbench.o
We obtained the following results:
Benchmarking ADD throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s): 1.29126
Instructions per Second: 2.90414e+10
Estimated GOPS: 29.0414 GigaOps/sec
-----------------------------------------------
Benchmarking MUL throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s): 2.82838
Instructions per Second: 1.32584e+10
Estimated GOPS: 13.2584 GigaOps/sec
-----------------------------------------------
Benchmarking ADD latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s): 0.856261
Instructions per Second: 4.37951e+09
Estimated GOPS: 4.37951 GigaOps/sec
-----------------------------------------------
Benchmarking MUL latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s): 2.5642
Instructions per Second: 1.46244e+09
Estimated GOPS: 1.46244 GigaOps/sec
-----------------------------------------------
To interpret these results, we first looked up the clock speed of the Apple M4 chip, which is about 4.4 GHz
Further, we needed information about the M4 architecture. We could find, that the Apple M4 has 8 ALUs, from which 3 are able to perform MUL
instructions.
Looking at a these numbers, we can assume a clock cycle speed of:
For the ADD
instruction, we are slightly below the specified clock cycle speed of 4.4 GHz.
Looking at the MUL
instruction on the other hand, our results closely align with the given clock speed.
For the latency we can make a similar calculation:
The ADD
latency matches the theoretical value. The MUL
is slightly higher than expected (3 vs. 2 cylces).