Competitive Programming on Apple Silicon

Posted: 2-3-2025

#arm #macos #assembly #competitive programming

For some reason on a random night I decided to solve some competitive programming problems in assembly. I picked out an easy and a medium problem that I had already solved in Python and got to work.

The Easy Problem

I’ve programmed in Assembly before for a Computer Architecture class but never set it up to work on my Mac. I used this github to help me setup and walk through the build process of ARM on Apple Silicon.

I copied the hello world program to start off with.

.global _main
.align 4

; Setup the parameters to print hello world and then call the Kernel to do it.
_main:
  mov x0, #1
  adr x1, helloworld  ; string to print
  mov x2, #13         ; length of our string
  mov x16, #4         ; Unix write system call
  svc #0x80           ; Call kernel to output the string

  ; exit program
  mov x0, #0    ; Use 0 return code
  mov x16, #1   ; System call number 1 terminates this program
  svc #0x80     ; Call kernel to terminate the program

helloworld:      .ascii  "Hello World!\n"

This worked super well and I was on my way to solve the easy problem.

AArch64 struggles

The premise for the easy problem was that you are given a string and you need to print out the first character of the string. I imported the _getchar method from C using .extern _getchar to make it easier.

When you call _getchar the result gets stored in the x0/w0 register. We need to move the result to a buffer for it to be persistent in our program. To do this I just used a buffer of size 1.

buffer: .space 1

Ok seems simple enough, right? I wrote this code to test and I thought it would work.

…but I get a zsh: bus error. Meaning that a process tried to access memory that the CPU cannot physically address - i.e. an invalid address.

To fix this I need to specify the .data and .text sections.

.global _main
.align 4
.extern _getchar

_main: 
  ; reads a single character
  bl _getchar ; Result will be in x0

  adr x1, buffer ; store the address of the buffer in x1
  strb w0, [x1] ; store the char in the buffer location

  ; excluded to save space
  print: ...
  exit: ...

buffer: .space 1 ; We only need 1 byte for a single char

Ok, easy enough, I do that and... now I get a assemble error 🫤

as -arch arm64 -g -o lubbi.o lubbi.s
lubbi.s:11:3: error: unknown AArch64 fixup kind!
  adr x1, buffer
  ^
make: *** [assemble] Error 1

This problem took me a bit to figure out and be satisfied with the answer. It turns out on AArch64 platforms (My Mac) there is a pretty strict limitation on the adr instruction. If adr references a symbol that’s not in a nearby memory location, then there is no built-in way to adjust the instruction properly during linking.

The symbol that adr points to has to be within 1 MB of the instruction. This is super restrictive and is not preferred. The solution is to use the instruction adrp. This instruction allows us to reference a symbol within a relative space of 4 GB (it is unlikely to exceed 4 GB) — [1] [2] [3].

Implementing that we get the following:

.text
.global _main
.align 4
.extern _getchar

_main: 
  ; reads a single character
  bl _getchar ; Result will be in x0

  adrp x1, buffer@PAGE
  add x1, x1, buffer@PAGEOFF
  strb w0, [x1] ; store the char in the buffer location

  ; excluded to save space
  print: ...
  exit: ... 
  
.data
buffer: .space 1 ; We only need 1 byte for a single char

adrp stands for Address Relative with Page Offset. This instruction is used to load the absolute page address of buffer into x1. It computes the address by loading the upper bits of the address, effectively discarding the lower 12 bits. The reason it does this is for efficient memory addressing; absolute addresses require large immediate value which wouldn’t fit in a single instruction.

Using the add x1, x1, buffer@PAGEOFF we add the page offset to x1 to compute the full address of buffer. To clarify, buffer@PAGEOFF gives the offset into the page that the symbol is located in — [1] [2]

Nice! 👍 This code works and we learned something new about the quirks our CPU architecture has.

The Medium Problem

In the easy problem we solved all our issues with AArch64 and referencing our buffers. For the medium problem it was just knowing how to translate my logic from Python to ARM Assembly.

In the problem, you were given 2 strings. The first one is the text on a coffee mug and the second string is the transformations (horizontal/vertical flip or 180 degree rotation) the mug takes as it falls to the ground.

You are asked to find what the text will the look like to you right before it hits the ground. (Read about the problem here).

My solution in Python took a few attempts to realize the trick, computing the final state and transform the string once or twice.

h = {'b': 'd', 'q': 'p', 'd': 'b', 'p': 'q'}
v = {'b': 'p', 'q': 'd', 'd': 'q', 'p': 'b'}

x = [c for c in input()]
f = input()

# flip the string based on the mapping
def flip(s, mapping, flip=False):
    for i in range(len(s)):
        s[i] = mapping[s[i]]
    if flip:
        s.reverse()

horizontal, vertical = False, False
for c in f:
    if c == 'h':
        horizontal = not horizontal
    elif c == 'v':
        vertical = not vertical
    elif c == 'r':
        horizontal = not horizontal
        vertical = not vertical

if horizontal:
    flip(x, h, True)
if vertical:
    flip(x, v)

print(''.join(x))

Solution

I’m not going to explain my assembly solution line by line, just some cool parts I learned.

ARM did not have a logical negation instruction. I had to use this workaround which is a bitwise XOR instruction.

eor x1, x1, #1

Our assembly code is basically doing x1 = x1 ^ 1 which negates it because of how XOR works.

The other piece of code I thought was very interesting was reversing the string. This is called when we need to apply the horizontal translation.

reverse_string_and_horiz:
    ; start and end indexes
    mov x11, #0
    mov x12, x19 ; x19 stores the length of the s string buffer (first string)
    sub x12, x12, #1 ; fix off by one error

    ; load s string address into x1
    adrp x1, s_buffer@PAGE
    add x1, x1, s_buffer@PAGEOFF

    reverse_loop:
        ldrb w24, [x1, x11] ; char at start of s
        ldrb w25, [x1, x12] ; char at end of s

        ; swap
        strb w24, [x1, x12]
        strb w25, [x1, x11]

        ; move pointers
        add x11, x11, #1
        sub x12, x12, #1

        ; check loop condition
        cmp x11, x12
        b.lt reverse_loop ; branch less than

    b apply_horizontal ; continue to apply the horizontal mapping

Conclusion

I’m glad that I went on this journey looking into how to do things at the lowest level. It always makes you think about how much we take for granted in programming and how much is abstracted away.

The full solution and my Makefiles can all be found at my Github repository: kattis-asm-solutions

👋 Thanks for reading!

If you enjoyed this post, then consider sharing it and/or following me on Github.