Working with arrays, structs, and pointers – IR Generation for High-Level Language Constructs-1
By Reginald Bellamy / February 17, 2024 / No Comments / Emitting the function body, Exams of IT, IT Certifications, Technical requirements, Understanding the IR code
For almost all applications, basic types such as INTEGER are not sufficient. For example, to represent mathematical objects such as a matrix or a complex number, you must construct new data types based on existing ones. These new data types are generally known as aggregate or composite.
Arrays are a sequence of elements of the same type. In LLVM, arrays are always static, which means that the number of elements is constant. The tinylang type ARRAY [10] OF INTEGER or the C type long[10] is expressed in IR as follows:
[10 x i64]
Structures are composites of different types. In programming languages, they are often expressed with named members. For example, in tinylang, a structure is written as RECORD x: REAL; color: INTEGER; y: REAL; END; and the same structure in C is struct { float x; long color; float y; };. In LLVM IR, only the type names are listed:
{ float, i64, float }
To access a member, a numerical index is used. Like arrays, the first element has an index number of 0.
The members of this structure are arranged in memory according to the specification in the data layout string. For more information regarding the data layout string within LLVM, Chapter 4, Basics of IR Code Generation, describes these details.
Furthermore, if necessary, unused padding bytes are inserted. If you need to take control of the memory layout, then you can use a packed structure in which all elements have a 1-byte alignment. Within C, we utilize the packed attribute in the struct in the following way:
struct attribute((packed)) { float x; long long color; float y; }
Likewise, the syntax within LLVM IR is slightly different and looks like the following:
<{ float, i64, float }>
Loaded into a register, arrays, and structs are treated as a unit. It is not possible to refer to a single element of array-valued register %x as %x[3], for example. This is due to the SSA form because it is not possible to tell if %x[i] and %x[j] refer to the same element or not. Instead, we need special instructions to extract and insert single-element values into an array. To read the second element, we use the following:
%el2 = extractvalue [10 x i64] %x, 1
We can also update an element such as the first one:
%xnew = insertvalue [10 x i64] %x, i64 %el2, 0
Both instructions work on structures, too. For example, to access the color member from register %pt, you write the following:
%color = extractvalue { float, float, i64 } %pt, 2
There exists an important limitation on both instructions: the index must be a constant. For structures, this is easily explainable. The index number is only a substitute for the name, and languages such as C have no notion of dynamically computing the name of a struct member. For arrays, it is simply that it can’t be implemented efficiently. Both instructions have value in specific cases when the number of elements is small and known. For example, a complex number could be modeled as an array of two floating-point numbers. It’s reasonable to pass this array around, and it is always clear which part of the array must be accessed during a computation.
For general use in the front end, we have to resort to pointers to memory. All global values in LLVM are expressed as pointers. Let’s declare a @arr global variable as an array of eight i64 elements. This is the equivalent of the long arr[8] C declaration:
@arr = common global [8 x i64] zeroinitializer
To access the second element of the array, an address calculation must be performed to determine the address of the indexed element. Then the value can then be loaded from that address and put into a function @second, this looks like this:
define i64 @second() {
%1 = load i64, ptr getelementptr inbounds ([8 x i64], ptr @arr, i64 0, i64 1)
ret i64 %1
}