by Carl Burch, Hendrix College, October 2011
In this document, we study assembly language, the system for expressing the individual instructions that a computer should perform.
We are actually concerned with two types of languages, assembly languages and machine languages.
A machine language encodes instructions as sequences of 0's and 1's; this binary encoding is what the computer's processor is built to execute. Writing programs using this encoding is unwieldy for human programmers, though. Thus, when programmers want to dictate the precise instructions that the computer is to perform, they use an assembly language, which allows instructions to be written in textual form. An assembler translates a file containing assembly language code into the corresponding machine language.
Let's look at a simple example for ARM's design. Here is a machine language instruction:
1110 0001 1010 0000 0011 0000 0000 1001
When the processor is told to execute that binary sequence, it
copies the value from register 9
into register 3.
But as a programmer, you'd hardly want to read a long binary
sequence and make sense of it. Instead, a programmer would
prefer programming in assembly language, where we would express
this using the following line.
MOV R3, R9
Then the programmer would use an assembler to translate this into the binary encoding that the computer actually executes.
But there is not just one machine language: Each processor design employs its own machine language, designed with an eye to provide a powerful set of fast instructions while allowing a relatively simple circuit to be built. Often processors are designed to be compatible with a previous processor, so it follows the same machine language design. For example, Intel's line of processors (including 80386, Pentium, and Core i7) support similar machine languages. But ARM processors support an entirely different machine language. The design of the machine language encoding is call the instruction set architecture (ISA).
And for each machine language, there must be a different assembly language, since the assembly language must correspond to an entirely different set of machine language instructions.
Of the many ISAs (instruction set architectures), x86 is handily the most widely recognized. It was first designed by Intel in 1974 in for an 8-bit processor (the Intel 8080), and over the years it was extended to 16-bit form (1978, Intel 8086), then to 32-bit form (1985, Intel 80386), and then to 64-bit form (2003, AMD Opteron). Today, processors supporting IA32 are now manufactured by Intel, AMD, and VIA, and they can be found in most personal computers.
Another well-known ISA today is the PowerPC. Apple's Macintosh computers used these processors until 2006, when Apple switched their computers to the x86 line of processors. But PowerPC remains in common use for applications such as automobiles and gaming consoles (including the Wii, Playstation 3, and XBox 360).
But the ISA that we'll study comes from a company called ARM. (Like other successful ISAs, ARM's ISA has grown over the years. We'll examine version 4T.) Processors supporting ARM's ISA are distributed quite widely, usually for low-power devices such as cellphones, digital music players, and handheld game systems. The iPhone, Blackberry, and Nintendo DS are all prominent examples of devices that incorporate an ARM processor.
There are several reasons for examining ARM's ISA rather than IA32.
Assembly language programming is rarely used for more powerful computing systems, since it's far easier to program in a high-level programming language. But for small devices, assembly language programming remains important: Due to power and price constraints, the devices have very few resources, and developers can use assembly language to use these resources as efficiently as possible.
The multiple extensions to the IA32 architecture lead it to be far too complicated for us to really understand thoroughly.
IA32 dates from the 1970's, which was a completely different era in computing. ARM is more representative of more modern ISA designs.
We'll now turn to examining ARM's ISA.
Let's start our introduction using a simple example. Imagine that we want to add the numbers from 1 to 10. We might do this in C as follows.
int total;
int i;
total = 0;
for (i = 10; i > 0; i--) {
total += i;
}
The following translates this into the instructions supported by ARM's ISA.
MOV R0, #0 ; R0 accumulates total
MOV R1, #10 ; R1 counts from 10 down to 1
again ADD R0, R0, R1
SUBS R1, R1, #1
BNE again
halt B halt ; infinite loop to stop computation
You'll notice the mentions of R0 and R1 in the assembly language program. These are references to registers, which are places in a processor for storing data during computation. The ARM processor includes 16 easily accessible registers, numbered R0 through R15. Each stores a single 32-bit number. Note that though registers store data, they are very separate from the notion of memory: Memory is typically much larger (kilobytes or often gigabytes), and so it typically exists outside of the processor. Because of memory's size, accessing memory takes more time than accessing registers — typically about 10 times as long. Thus, assembly language programming tends to focus on using registers when possible.
Because each line of an assembly language program corresponds
directly to machine language, the lines are highly restricted in their
format. You can see that each line consists of two parts:
First is the opcode such as MOV that is an abbreviation
indicating the type of operation; and after it comes
arguments such as
.
Each opcode has strict requirements on the allowed arguments: For
example, a R0, #0MOV instruction must have exactly two
arguments, the first must identify a register, and the second
must provide either a register or a constant
(prefixed by a '#').
A constant placed directly in an
instruction is called an immediate, since it is
immediately available to the processor when reading the
instruction.
In the above assembly language program, we first use the
MOV instruction to initialize R0 at 0 and R1 at 10.
The ADD instruction computes the sum of R0 and R1 (the
second and third arguments) and places the result into R0 (the first
argument); this corresponds to the total += i; line
of the equivalent C program.
The subsequent SUBS instruction decreases R1 by 1.
To understand the next instruction, we need to understand that in
addition to the registers R0 through R15, the ARM processor also
incorporates a set of four flags,
labeled the zero flag (Z), the
negative flag (N), the carry flag (C), and the overflow flag (V).
Whenever an arithmetic instruction has an S at its end, as
SUBS does, these
flags will be updated based on the result of the computation.
In this case, if the result of decreasing R1 by 1 results in 0, the Z
flag will become 1; the N, C, and V flags are also updated, but they're
not useful to what follows.
The following instruction, BNE, will check the Z flag.
If the Z flag is not set (i.e., the previous subtraction gives a nonzero
result), then BNE arranges the processor so that the next
instruction executed is the ADD instruction, labeled
again; this leads to repeating the loop with a smaller
value of R1. If the Z flag is set, the processor will
simply continue on to the next instruction.
(BNE stands for Branch if Not Equal.
The name comes from imagining that we want to check whether two numbers
are equal. One way to do this using ARM's ISA would be to first
tell the processor to subtract the two numbers; if the difference is
zero, then the two numbers must be equal, and the zero flag will be 1.
them results in zero, which would set the zero flag.)
The final instruction, B, always branches back to
the named instruction. In this program, the instruction names itself,
effectively halting the program by putting the computer into a tight
infinite loop.
Let's look at another example. Here, suppose that we want to add the digits of a positive number; for example, given the number 1,024, we would want to compute 1 + 0 + 2 + 4, which is 7.
The obvious way to express this in C is as follows.
total = 0;
while (i > 0) {
total += i % 10;
i /= 10;
}
It's difficult to translate this into ARM's ISA, though, since the ARM lacks any instruction for dividing values. However, we can use a clever trick to perform this division using multiplication: If we take a number and multiply by 232 / 10, the upper 32 bits of the product tell us the result of dividing the original number by 10. This insight leads to the following alternative way of summing the digits in a number.
base = 0x1999999A;
total = 0;
while (i > 0) {
iDiv10 = (i * base) >> 32;
total += i - iDiv10 * 10;
i = iDiv10;
}
In translating this into assembly code, we have to confront two
issues. The more obvious is determining which instruction to use to
perform the multiplication. Here, we want to use the UMULL
instruction (Unsigned MULtiply Long), which
interprets two registers as unsigned 32-bit numbers,
and places the 64-bit product of the registers' values into two
different registers. The below example illustrates.
UMULL R4, R5, R0, R2 ; computes R0 * R2, placing lower 32 bits in R4, upper 32 in R5
The less obvious issue we have to confront is that of placing
0x1999999A into a register. You might be tempted at first to use
MOV, but this instruction has a major limitation:
Any immediate value
must be rotated by an even number of places to reach an eight-bit value.
For numbers between 0 and 255, this is not a problem; nor it is a
problem for 1,024, since 0x400 can be achieved by rotating 1 left 12
places. But there's no way to do this for 0x1999999A. The solution we'll
use is to load each byte separately, joining them using the
ORR instruction, which computes the bitwise OR of two
values.
MOV R0, #1024 ; R0 is input, decreases by factors of 10
MOV R1, #0 ; R1 is sum of digits
MOV R2, #0x19000000 ; R2 is constantly 0x1999999A
ORR R2, R2, #0x00990000
ORR R2, R2, #0x00009900
ORR R2, R2, #0x0000009A
MOV R3, #10 ; R3 is constantly 10
loop UMULL R4, R5, R0, R2 ; R5 is R0 / 10
UMULL R4, R6, R5, R3 ; R4 is now 10 * (R0 / 10)
SUB R4, R0, R4 ; R5 is now one's digit of R0
ADD R1, R1, R4 ; add it into R1
MOVS R0, R5
BNE loop
halt B halt
By the way, you may sometimes want to place a small negative number
like −10 into a register. You can't use MOV to
accomplish this, because its two's-complement representation is
0xFFFFFFF6, which can't be rotated into an 8-bit number. If it happens
that to know that some register holds the number 0, then you could use
SUB. But if it doesn't, then the MVN
(MoVe Not) instruction is useful: It places the
bitwise NOT of its argument into the destination register. So to get
−10 into R0, we can use
.MVN R0, #0x9
The ARM includes sixteen basic
arithmetic instructions,
numbered 0 through 15.
All sixteen are
listed below, with the functionality summarized by the relevant
C operator. (The number at the beginning of each line is used
in translating the instructions into machine language. There's
no reason for programmers to memorize this correspondence,
though: After all, this is why we have assemblers.)
Figure 1: ARM's basic arithmetic instructions
0. ANDregd, rega, argbregd ← rega & argb 1. EORregd, rega, argbregd ← rega ^ argb 2. SUBregd, rega, argbregd ← rega - argb 3. RSBregd, rega, argbregd ← argb - rega 4. ADDregd, rega, argbregd ← rega + argb 5. ADCregd, rega, argbregd ← rega + argb + carry 6. SBCregd, rega, argbregd ← rega - argb - !carry 7. RSCregd, rega, argbregd ← argb - rega - !carry 8. TSTregd, rega, argbset flags for rega & argb 9. TEQregd, rega, argbset flags for rega ^ argb 10. CMPregd, rega, argbset flags for rega - argb 11. CMNregd, rega, argbset flags for rega + argb 12. ORRregd, rega, argbregd ← rega | argb 13. MOVregd, argregd ← arg 14. BICregd, rega, argbregd ← rega & ~argb 15. MVNregd, argregd ← ~argb
Except for TST, TEQ, CMP, and
CMN, all instructions may have an S postfixed to
the opcode to signify that the operation should set the flags. For
TST, TEQ, CMP, and
CMN, the S is implicit: The instructions don't
change any general-purpose registers, so the only point in performing
the instruction is to set the flags.
We've also seen three other opcodes that aren't in the above
of basic arithmetic instructions:
UMULL is a non-basic
arithmetic instruction,
and B and BNE aren't arithmetic instructions.
Each ARM instruction may incorporate a condition
code specifying that the operation should take place only
when certain combinations of the flags hold. You can specify the
condition code by including it as part of the opcode.
It usually comes at the end of the opcode, but it precedes the
optional S on the basic arithmetic instructions.
The name for the condition codes is based on
the supposition that the flags were set based on a CMP or
SUBS instruction.
Figure 2: ARM's condition codes
0. EQequal Z 1. NEnot equal !Z 2. CSorHScarry set / unsigned higher or same C 3. CCorLOcarry clear / unsigned lower !C 4. MIminus / negative N 5. PLplus / positive or zero !N 6. VSoverflow set V 7. VCoverflow clear !V 8. HIunsigned higher C && !Z 9. LSunsigned lower or same !C || Z 10. GEsigned greater than or equal N == V 11. LTsigned less than N != V 12. GTsigned greater than !Z && (N == V) 13. LEsigned greater than or equal Z || (N != V) 14. ALor omittedalways true
The only instance of this condition code we have seen so far
is the BNE instruction: In this case, we have a B
instruction for branching, but the branch only takes place if
the Z flag is 0.
But ARM's ISA allows us to apply condition codes to other opcodes,
too. For example, ADDEQ says to perform an addition if
the Z flag is 1. One common scenario using condition codes on
non-branch instructions is in computing the greatest common divisor
of two numbers using Euclid's GCD algorithm.
a = 40;
b = 25;
while (a != b) {
if (a > b) a -= b;
else b -= a;
}
The traditional translation to assembly language would use condition codes only on branch instructions.
MOV R0, #40 ; R0 is a
MOV R1, #25 ; R1 is b
again CMP R0, R1
BEQ halt
BLT isLess
SUB R0, R0, R1
B again
isLess SUB R1, R1, R0
B again
halt B halt
However, the following is a much shorter and more efficient translation.
MOV R0, #40 ; R0 is a
MOV R1, #25 ; R1 is b
again CMP R0, R1
SUBGT R0, R0, R1
SUBLT R1, R1, R0
BNE again
halt B halt
This is more efficient for two reasons. More obviously,
the number of instructions executed per iteration is smaller
(four versus five). But the other reason comes from the fact
that modern processors pre-fetch
the following instruction
while executing the current instruction. However, branches disrupt this
process since the location of the next instruction can't be known
certainly. The second translation involves many fewer branch
instructions, so it will have fewer problems with pre-fetching
instructions.
We've seen how to build assembly programs that perform basic numerical computation. We'll now turn to examining how assembly programs can access memory.
The ARM supports memory access via two instructions,
LDR and STR. The LDR instruction
loads data out of memory,
and STR stores data into memory.
Each takes two arguments. The first argument is the data
register: For an LDR instruction, the loaded data is
placed into this register; for an STR
instruction, the data found in this register is stored into memory.
The second argument indicates the register that contains the
memory address being accessed; it will be written using the
register name enclosed in brackets.
(In Section 3.2, we will see that there are
other options for how this second argument can be written.)
For an example of how these instructions work, let's suppose we want a assembly program fragment that adds the integers in an array. We imagine that R0 holds the address of the first integer of the array, and R1 holds the number of integers in the array.
addInts MOV R4, #0
addLoop LDR R2, [R0]
ADD R4, R4, R2
ADD R0, R0, #4
SUBS R1, R1, #1
BNE addLoop
In this fragment, we use R4 to hold the sum of the integers so far.
In the LDR instruction, we look into R0 for a memory address
and load the data found at that address into R2. We then add this value
into R4. Then, we move R0 so that it contains the memory address of the
next integer in the array; we increase R0 by four because each integer
consumes four bytes of memory. Finally, we decrement R1, which is the
number of integers left to read from the array, and we repeat the
process if there are integers remaining.
Both LDR and STR load and store 32-bit values.
There are also instructions for working with 8-bit values, LDRB
and STRB; these are useful primarily for working with strings.
Below is an implementation of C's strcpy function; we imagine
that R0 holds the address of the first character of the destination
array, and that R1 holds the address of the first character of the
source string. We want to keep copying until we copy the terminating NUL
character (ASCII 0).
strcpy LDRB R2, [R1]
STRB R2, [R0]
ADD R0, R0, #1
ADD R1, R1, #1
TST R2, R2 ; repeat if R2 is nonzero
BNE strcpy
In the previous section's examples, we provided the address by enclosing a register's name in brackets. But the ARM allows several other ways of indicating the memory address, too. Each such technique is called an addressing mode; the technique of simply naming a register holding a memory address is one such addressing mode, called register addressing, but there are others.
One of these others is scaled register offset, where we include in the brackets a register, another register, and a shift value. To compute the memory address to access, the processor takes the first register, and adds to it the second register shifted according to the shift value. (Neither of the registers mentioned in brackets change values.) This addressing mode is useful when accessing an array where you know the array index. We can modify our earlier routine for adding the integers in an array to take advantage of this addressing mode.
addInts MOV R4, #0
addLoop SUBS R1, R1, #1
LDR R2, [R0, R1, LSL #2]
ADD R4, R4, R2
BNE addLoop
With each iteration of the loop, we first decrement our loop index R1. Then we retrieve the element at that entry of the array using a scaled register offset: We use R0 as our base, and we add to it R1 shifted left two places. We shift R1 left two places so that R1 is multiplied by four; after all, each integer in the array is four bytes long. After adding the loaded value into R4, which accumulates the total, we repeat the loop if R1 hasn't reached 0 yet.
Beyond using a different addressing mode, this version of the code is slightly different from our original implementation in three ways. First, it loads the numbers in the array in reverse order — that is, it loads the last number in the array first. Second, R0 remains unaltered in the course of the fragment. And finally, it will be somewhat faster since it has one less instruction per loop iteration.
Immediate post-indexed addressing is another addressing mode. To indicate this mode in assembly language, we follow the brackets with a comma and a positive or negative immediate. In executing the instruction, the processor still accesses the memory address found in the register, but after accessing the memory the address register is increased or decreased according to the immediate.
Our strcpy implementation is a useful example where immediate
post-indexed addressing is useful: After we store to R0, we
want R0 to increase by 1 for the following iteration;
and similarly, after we load from R1, we
want R1 to increase by 1. We can use immediate post-indexed
addressing to avoid the two ADD instructions of our earlier
version.
strcpy LDRB R2, [R1], #1
STRB R2, [R0], #1
TST R2, R2 ; repeat if R2 is nonzero
BNE strcpy
In total, the ARM processor supports ten addressing modes.
| [Rn, #±imm] | Immediate offset
Address accessed is imm more/less than the address found in Rn. Rn does not change. |
| [Rn] | Register
Address accessed is value found in Rn. This is just shorthand for [Rn, #0]. |
| [Rn, ±Rm, shift] | Scaled register offset
Address accessed is sum/difference of the value in Rn and the value in Rm shifted as specified. Rn and Rm do not change values. |
| [Rn, ±Rm] | Register offset
Address accessed is sum/difference of the value in Rn and the value in Rm. Rn and Rm do not change values. This is just shorthand for [Rn, ±Rm, LSL #0]. |
| [Rn, #±imm]! | Immediate pre-indexed
Address accessed is as with immediate offset mode, but Rn's value updates to become the address accessed. |
| [Rn, ±Rm, shift]! | Scaled register pre-indexed
Address accessed is as with scaled register offset mode, but Rn's value updates to become the address accessed. |
| [Rn, ±Rm]! | Register pre-indexed
Address accessed is as with register offset mode, but Rn's value updates to become the address accessed. |
| [Rn], #±imm | Immediate post-indexed
Address accessed is value found in Rn, and then Rn's value is increased/decreased by imm. |
| [Rn], ±Rm, shift | Scaled register post-indexed
Address accessed is value found in Rn, and then Rn's value is increased/decreased by Rm shifted according to shift. |
| [Rn], ±Rm | Register post-indexed
Address accessed is value found in Rn, and then Rn's value is increased/decreased by Rm. This is just shorthand for [Rn], ±Rm, LSL #0. |
For those addressing modes involving a shift, the shift technique is as with the arithmetic instructions (LSL, LSR, ASR, ROR, RRX). But the shift distance cannot be according to a register: The distance must be an immediate.
We often want to reserve memory for holding data in a program. To do
this, we use directives: directions for the
assembler to do something other than simply translate an
assembly language instruction into its corresponding machine code.
One useful directive
is DCD, which inserts one or more 32-bit numerical values
into the machine code output.
(DCD cryptically stands for
Define Constant Double-words.)
primes DCD 2, 3, 5, 7, 11, 13, 17, 19
In this example, we've created the label primes, which will
correspond to the address where 2 is placed into memory.
In the following four bytes is placed
the integer 3, then 5, and so on.
In our program, we would want to load
the address of the array into a register; to do this, we add
primes into the program counter PC (which is synonymous with
R15). The below fragment loads the fifth prime (11) into R1.
ADD R0, PC, #primes ; load address of primes[0] into R0
LDR R1, [R0, #16] ; load primes[4] into R1
Another directive worth mentioning is DCB, for
loading bytes into memory. Thus, we could write the following.
primes DCB 2, 3, 5, 7, 11, 13, 17, 19
However, we are using just one byte for each number, so we can only include numbers between −128 and 127. We can also include a string in the list; each character of the string will occupy one byte of memory.
greet DCB "hello world\n", 0
Notice how we included 0 after the string. Without this, the string won't be terminated by the NUL character.
One more directive worth noting here is the percent sign %.
This is useful when you wish you reserve a block of memory, but you
don't care about the memory's initial value.
array % 120 ; reserve 120 bytes of memory, which can hold 30 ints
The ARM ISA also includes instructions allowing several values to be
loaded or stored in the same instruction. The LDMIA instruction
is one such instruction: It allows loading into multiple registers starting
at an address named in another register. In the below example of
its usage, we take our code for adding the integers of an array,
and we modify it using LDMIA so that it
processes four integers with each iteration of the loop. This strategy
allows the program to run using fewer instructions, at the expense of more
complexity.
; R0 holds address of first integer in array
; R1 holds array's length; fragment works only if length is multiple of 4
addInts MOV R4, #0
addLoop LDMIA R0!, { R5-R8 }
ADD R5, R5, R6
ADD R7, R7, R8
ADD R4, R4, R5
ADD R4, R4, R7
SUBS R1, R1, #4
BNE addLoop
In executing the LDMIA instruction above, the ARM processor looks
into the R0 register for an address. It loads into R5 the four bytes starting
at that address, into R6 the next four bytes, into R6 the next four bytes,
and into R7 the next four bytes. Meanwhile, R0 is stepped forward by 16
bytes, so with the next iteration the LDMIA instruction will load
the next four words into the registers.
Inside the braces can be any list of registers, using dashes to indicate
ranges of registers, and using commas to separate ranges.
Thus, the instruction LDMIA R0!, { R1-R4, R8, R11-R12 } will load
seven words from memory. The order in which the registers are listed is not
significant; even if we write LDMIA R0!, { R11-R12, R8, R1-R4 },
R1 will receive the first word loaded from memory.
The exclamation point following R0 in our example may be omitted; if omitted, then the address register is not altered by the instruction. That is, R0 would continue pointing to the first integer in the array. In our example above, we want R0 to change so that it is pointing to the next block of four integers for the next iteration, so we included the exclamation point.
Another instruction is STMIA, which stores several registers into
memory. In the following example, we shift every number in an array into
the next spot; thus, the array <2,3,5,7> becomes
<0,2,3,5>.
; R0 holds address of first integer in array
; R1 holds array's length; fragment works only if length is multiple of 4
shift MOV R4, #0
shLoop LDMIA R0, { R5-R8 }
STMIA R0!, { R4-R7 }
MOV R4, R8
SUBS R1, R1, #4
BNE shLoop
Notice how the LDMIA instruction omits the exclamation point
so that R0 isn't modified. This is so that STMIA stores into
the same range of addresses that were just loaded into the registers.
The STMIA instruction has the exclamation point because R0 must
be modified in preparation for the next iteration of the loop.
The ARM processor includes four variants of the multiple-load and
multiple-store instructions; the LDM and STM
abbreviations must always indicate one of these four variants.
LDMIA, STMIA |
Increment after
We start loading from the named address and into increasing addresses. |
LDMIB, STMIB |
Increment before
We start loading from four more than the named address and into increasing addresses. |
LDMDA, STMDA |
Decrement after
We start loading from the named address and into decreasing addresses. |
LDMDB, STMDB |
Decrement before
We start loading from four less than the named address and into decreasing addresses. |
Across all four modes, the highest-numbered register always
corresponds to the highest address in memory. Thus, the instruction
LDMDA R0, { R1-R4 } will place R4 into the address named by R0,
R3 into R0 − 4, and so on.
As we'll see in Section 4.2, the different variants are particularly useful when we want to use a block of unused memory as a stack.
To make large programs more manageable, programmers break programs into smaller pieces. In C and Python, these pieces are called functions; in Java, they are called methods; and in assembly language, they are called subroutines. We'll now turn to seeing how to write subroutines for ARM's ISA. Through learning about subroutines, we'll also learn more about how methods/functions in higher-level languages are actually executed.
We've seen already that each processor has a register called its
program counter to track the address of the instruction it is
about to execute. The ARM processor uses R15 for this purpose. In fact,
assembly programs refer to
as
R15
instead, though the two are synonymous.PC
In invoking a subroutine, a program must store where the processor
should return after completing the subroutine. The ARM processor uses
R14 for this purpose; it is called the link register,
and is usually referenced as
in programs.LR
Below is an example illustrating how this works, using our
strcpy example from Section 3.2.
By the end of this current section, we'll see that one
uses the BL opcode rather than the two-instruction sequence
used below, but first we'll look at how calling a subroutine works without
using that special-purpose instruction.
; Note: This way of calling subroutines is stylistically poor.
ADD LR, PC, #after ; place into LR the address to return to
B strcpy
after ; main code continues its work here after subroutine completes
strcpy LDRB R2, [R1], #1
STRB R2, [R0], #1
TST R2, R2 ; repeat if R2 is nonzero
BNE strcpy
MOV PC, LR ; return back into the code calling strcpy
This fragment first loads into
LR the address of the instruction
following the call to strcpy, labeled after in the
above program. In the next instruction, it branches into
strcpy, and the subroutines starts its work. Once complete,
the subroutine copies LR back into PC.
Since PC is used to determine which instruction should
be loaded and executed next, changing its value leads the processor
to execute its next instruction at the after label.
The process of calling subroutines is so common that the ARM
designers felt that the two-instruction process in the above
illustration was too cumbersome. So they created a new opcode
explicitly for this purposes,
called BL for Branch and Link.
Thus, rather than the first two instructions above,
we'd write simply
.
This instruction loads into BL strcpyLR the address of the
instruction following the BL instruction, and it sets
PC so that the next instruction executed is the one at
strcpy.
Often our subroutines will themselves call subroutines. But that brings up a problem: How should we avoid the situation where one subroutine needs the values of some registers, but another subroutine that it calls will change those same registers?
The natural solution is to save registers' values into memory before the called subroutine does its work, and then to restore the values from memory once that subroutine has completed. In fact, we'll do this with something called the program stack. When a subroutine starts, it will allocate a new block of memory on the top of the stack; and when it returns, the subroutine will release the block from the top, restoring the stack to its older state.
We implement the stack using a a register called the
stack pointer, which holds the address of where the top
value in the stack can be found.
In the ARM processor, R12 is conventionally
used for the stack pointer, which we can reference this
as
.SP
The stack pointer will start at a large address, but as we
add things into the stack, the stack pointer will decrease;
thus, SP points to the top of the stack,
while SP + 4 is the address of the next-to-top value,
while SP + 8 points to the third-from-top value,
and so on.
For the ARM, we'll be able to push several things onto the
stack using the STMDB instruction; for example, the
instruction
decreases STMDB SP!, {R4,R5}SP by 8, effectively growing the stack by 8
bytes, and R4 and R5 are saved into those 8 bytes.
When we want to remove values from the top of the stack, we can
use the LDMIA instruction, as with
.STMIA SP!, {R4,R5}
An example of such a subroutine is below. This subroutine, an adaptation of our earlier fragment for adding the numbers of an array, needs two additional registers beyond the registers containing the parameters. Since any code that calls this subroutine may not be able to afford those registers, we instead opt to write the subroutine so that it saves both registers onto the stack as soon as it is called; and just before returning, it restores the registers to their previous values.
; sumArray: Places sum of entries in array into R0. On entry, R0
; should be address of first array element, R1 should be array length.
sumArray STMDB SP!, { R4, R5 } ; push R4 and R5 onto stack
MOV R4, #0
sumLoop LDR R5, [R0], #1
ADD R4, R4, R5
SUBS R1, R1, #1
BNE sumLoop
MOV R0, R4
LDMIA SP!, { R4, R5 } ; restore R4 and R5 from stack
MOV LR, PC ; return back to after sumArray call
One of the most common registers a subroutine will want to save is
the link register, since the subroutine will often want to modify the
link register itself as it calls other subroutines. There is a handy
trick involving this: When we restore the registers at the subroutine's
end, we can easily restore the link register LR's saved value
into the program counter PC instead. As a result, we won't need
the
instruction.MOV PC, LR
subName STMDB SP!, { R4-R5,LR }
; code within subroutine goes here, with perhaps some calls
; to other subroutines (thus changing LR)
LDMIA SP!, { R4-R5,PC } ; loading into PC returns out of subroutine
To write large assembly programs, we need a standard system for passing parameters, returning values, and allocating registers between subroutines. After all, if each subroutine created its own system of using registers, things would quickly get very confusing as we try to remember each subroutine's system. A standard system is called a calling convention, and it's important enough to have this convention that ISA designers typically define one even though it is not technically something that is built into a manufactured processor.
For the ARM processor, we'll follow the standard calling convention
that parameters are passed by placing the parameter values into
registers R0 through R3 before calling the subroutine, and a
subroutine
returns a value by placing it into R0 before returning.
In the rare situation that a subroutine wants more than four parameters,
we'd place any additional parameters onto the stack before entering the
subroutine (with the earlier parameters pushed last onto the stack, so
that the fifth parameter is on the stack's top (referenced by
SP).
Each subroutine is allowed to alter R0 through R3 as it wishes; but if it uses R4 through R12, it must restore them to their previous values. It must also restore the stack pointer R13, effectively removing everything from the stack. It may change the link pointer R14.
The registers are thus divided into caller-save registers and callee-save registers. Caller-save registers are those that the subroutine may change, such as R0 through R3 in the ARM convention described above: They are caller-save because since a caller of a subroutine must save the registers' values if it wants the values after the subroutine completes. Callee-save registers are those that a subroutine must leave unchanged, like R4 through R12 is the convention described above: Upon being called, the subroutine (the callee) must save the registers' values if it wishes to use them.
It's beneficial for a calling convention to designate both caller-save registers and callee-save registers. If the convention designated all registers as callee-save, then subroutines would not be able to use any registers at all without saving them onto the stack first — which would be a waste, since some of the saved registers would be transient values that the calling subroutine did not care about long-term. And if the convention designated all registers as caller-save, then programmers would be forced to save many registers before every call to a subroutine and to restore them afterwards, lengthening the amount of time to call a subroutine.
Let's get a bit more practice with ARM assembly programming
by looking at some more examples. In particular, we'll look at
examples for processing a linked list of integers. In C, we'd
define this as a struct with two fields, an int
field named value and a struct node* field named
next. In assembly language, if R0 contains an address of a node
of the linked list, then the node's integer value
(node->value)
would be at address R0 and the address of the following node
(node->next)
would be at address R0 + 4.
We'll begin by writing a subroutine for adding up all the values in a linked list. In C, we'd write a function to do this as follows.
int sumList(struct node *head) {
struct node *cur = head;
int sum = 0;
while (cur != NULL) {
sum += cur->value;
cur = cur->next;
}
return sum;
}
Following is a translation of this into ARM assembly language.
sumList MOV R3, R0 ; R3 holds "cur"
MOV R0, #0 ; R0 holds "sum"
sumList_loop TST R3, R3
MOVEQ PC, LR ; if cur is NULL, return: R0 already holds sum
LDR R2, [R3] ; load cur->value into R2, add it to R0
ADD R0, R0, R2
LDR R3, [R3, #4] ; update cur to be cur->next
B sumList_loop
In this translation, we were somewhat clever in deciding
conveniently to use R0 for holding the accumulated sum: By the
time we're ready to return from the subroutine, the return value
is already in R0, so we simply need to return using the
MOVEQ instruction.
Now let's look at a subroutine that adds a new node at the front of a linked list. The C function is below.
struct node* addFront(struct node *head, int val) {
struct node *n = (struct node*) malloc(sizeof(struct node));
n->value = val;
n->next = head;
return n;
}
This will be harder to translate into assembly language
since the addFront subroutine will need to call another
subroutine (malloc). What's more, the value of the R0,
R1, and LR registers will need to be saved before we enter
malloc, since these are all caller-save registers
and addFront will need their values after
malloc returns. (We need to save LR since the
instruction
modifies it.)BL malloc
In the below translation,
we end up pushing these registers onto the stack and later
popping these values into different registers;
thus, the head argument originally in R0 ends up in R2,
and the val argument originally in R1 ends up in R3.
We do this because malloc places its return value into
R0, and we wish to keep it in R0 for the return value of
addFront.
addFront STMDB SP!, {R0, R1, LR} ; push R0, R1, LR onto stack
MOV R0, #8 ; call malloc(8)
BL malloc
LDMIA SP!, {R2, R3, LR} ; pop stack into R2, R3, LR
STR R3, [R0] ; n->value = val (pushed R1 popped into R3)
STR R2, [R0, #4] ; n->next = head (pushed R0 popped into R2)
MOV PC, LR ; return from subroutine
Finally, an example that's complex enough to involve some real manipulation of the stack: Removing a number from the linked list.
struct node *remove(struct node *head, int val) {
struct node *cur = head;
struct node *prev = NULL;
while (cur != NULL) {
if (cur->value == val) {
if (cur == head) {
head = cur->next;
} else {
prev->next = cur->next;
}
free(cur);
return head;
}
prev = cur;
cur = cur->next;
}
return head;
}
Our implementation ends up needing more than just the
registers R0 through R3, so we'll need to use a callee-save
register. In our implementation below, you can see that the
callee-save register R4 is
saved onto the stack initially and restored when we return from
the subroutine. It also saves R0 onto the stack, since the value
of the head parameter is needed following the free
subroutine. However, in the innermost if statement, we
must modify head; we do this by storing the intended value into
the stack so that the new value is restored after free
completes.
remove STMDB SP!, {R0, R4, LR} ; Save "head", R4, LR onto stack.
MOV R2, R0 ; Use R2 for "cur".
MOV R3, R0 ; Use R3 for "prev".
remove_loop TST R2, R2 ; If cur == NULL, return with R0 unchanged.
LDMEQIA SP!, {R0, R4, PC}
LDR R4, [R2] ; Load cur->value,
CMP R4, R1 ; compare to "val" argument,
BEQ remove_find ; and go to remove_find if they match.
MOV R3, R2 ; Otherwise prev = cur,
LDR R2, [R2, #4] ; cur = cur->next,
B remove_loop ; and repeat.
remove_find LDR R4, [R2, #4] ; Load cur->next into R4.
CMP R2, R0 ; Check whether cur == head:
STREQ R4, [SP] ; if so, replace "head" on stack with R4;
STRNE R4, [R3, #4] ; if not, save R4 into prev->next.
MOV R0, R2 ; Now free(cur).
BL free
LDMIA SP!, {R0, R4, PC} ; Return "head" saved on stack.
The ARM processor has many more details than we have been able to discuss here. Notably missing from our discussion is any interaction with input or output devices. But one commonly writes subroutines for this (usually included as part of an operating system) without needing to think about this interaction again. But we've now seen all the pieces we need for writing computational code in the ARM assembly language.