Understanding the IR code – Basics of IR Code Generation
By Reginald Bellamy / March 30, 2022 / No Comments / Exams of IT, Generating IR from the AST, Handling the scope of names, IT Certifications
Before generating the IR code, it’s good to know the main elements of the IR language. In Chapter 2, The Structure of a Compiler, we had a brief look at IR. An easy way to get more knowledge of IR is to study the output from clang. For example, save this C source code, which implements the Euclidean algorithm for calculating the greatest common divisor of two numbers, as gcd.c:
unsigned gcd(unsigned a, unsigned b) {
if (b == 0)
return a;
while (b != 0) {
unsigned t = a % b;
a = b;
b = t;
}
return a;
}
Then, you can create the gcd.ll IR file by using clang and the following command:
$ clang –target=aarch64-linux-gnu -O1 -S -emit-llvm gcd.c
The IR code is not target-independent, even if it often looks like it is. The preceding command compiles the source file for an ARM 64-bit CPU on Linux. The -S option instructs clang to output an assembly file, and with the additional specification of -emit-llvm, an IR file is created. The optimization level, -O1, is used to get an easily readable IR code. Clang has many more options, all of which are documented in the command-line argument reference at https://clang.llvm.org/docs/ClangCommandLineReference.html. Let’s have a look at the generated file and understand how the C source maps to the LLVM IR.
A C file is translated into a module, which holds the functions and the data objects. A function has at least one basic block, and a basic block contains instructions. This hierarchical structure is also reflected in the C++ API. All data elements are typed. Integer types are represented by the letter i, followed by the number of bits. For example, the 64-bit integer type is written as i64. The most basic float types are float and double, denoting the 32-bit and 64-bit IEEE floating-point types. It is also possible to create aggregate types such as vectors, arrays, and structures.
Here is what the LLVM IR looks like. At the top of the file, some basic properties are established:
; ModuleID = ‘gcd.c’
source_filename = “gcd.c”
target datalayout = “e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128”
target triple = “aarch64-unknown-linux-gnu”
The first line is a comment informing you about which module identifier was used. In the following line, the filename of the source file is named. With clang, both are the same.
The target datalayout string establishes some basic properties. The different parts are separated by -. The following information is included:
- A small e means that bytes in memory are stored using the little-endian schema. To specify a big endian, you must use a big E.
- M: specifies the name mangling that’s applied to symbols. Here, m:e means that ELF name mangling is used.
- The entries in iN:A:P form, such asi8:8:32, specify the alignment of data, given in bits. The first number is the alignment required by the ABI, and the second number is the preferred alignment. For bytes (i8), the ABI alignment is 1 byte (8) and the preferred alignment is 4 bytes (32).
- n specifies which native register sizes are available. n32:64 means that 32-bit and 64-bit wide integers are natively supported.
- S specifies the alignment of the stack, again in bits. S128 means that the stack maintains a 16-byte alignment.