JIT development and implementation

The page is in the process of being translated.

This document provides information on the development and implementation of JIT Julia after code generation is completed and an unoptimized LLVM IR representation is created. The JIT is responsible for optimizing and compiling this IR into machine code, as well as for composing it into the current process and providing access to the code for execution.

Introduction

JIT manages compilation resources, searches for previously compiled code, and compiles new code. It is mainly built on technology. https://llvm.org/docs/ORCv2.html [compilations on demand] (ORCv2) in LLVM, which implements support for a number of useful features such as parallel compilation, deferred compilation, and the ability to compile code in a separate process. Although LLVM provides a basic JIT compiler in the form of LLJIT, Julia uses many ORCv2 APIs directly to create its own JIT compiler.

Review

During code generation, an LLVM module is created containing the IR for one or more Julia functions, from the original SSA Julia IR obtained as a result of type inference (marked as translate in the compiler diagram above). An instance of the code is also mapped to the LLVM function name. However, even though the Julia-based compiler has applied some optimizations to Julia’s IR, the LLVM IR representation obtained during code generation still contains many optimization possibilities. Thus, the first step of the JIT is to launch an optimization pipeline independent of the target[This is not a completely target-independent pipeline, since transformations such as vectorization depend on information about the target, such as the width of the vector register and cost modeling. In addition, the code generation itself makes several assumptions that depend on the target object, which will then be used in the optimization pipeline.] in the LLVM module. The JIT then runs a target-dependent optimization pipeline for optimization and code generation, and outputs the object file. Finally, the JIT compiles the resulting object file into the current process and makes the code available for execution. All these actions are controlled by the code in the file `src/jitlayers.cpp `.

Currently, only one thread can enter the optimization-compilation-link pipeline at a time. This is due to restrictions imposed by one of our linkers (RuntimeDyld). However, JIT is designed to support simultaneous optimization and compilation, and it is expected that in the future, when RuntimeDyld is completely replaced on all platforms, the linker restriction will be lifted.

Optimization Pipeline

The optimization pipeline is based on the new LLVM pass manager, but adapted to Julia’s needs. The pipeline is defined in `src/pipeline.cpp ` and in general, it goes through several stages, which are described in detail below.

Simplification in the early stages
1. These passages are mainly used to simplify the IR and canonicalize patterns so that subsequent passages can more easily identify these patterns. In addition, various internal calls such as branch prediction hints and annotations are downgraded to other metadata or other IR functions. The key role here is played by https://llvm.org/docs/Passes.html#simplifycfg-simplify-the-cfg [SimplifyCFG] (simplification of the control order graph), https://llvm.org/docs/Passes.html#dce-dead-code-elimination [DCE] (exclusion of non-working code) and https://llvm.org/docs/Passes.html#sroa-scalar-replacement-of-aggregates ['SROA`] (scalar replacement of aggregates).
Optimization in the early stages
1. These passes are usually cheap and are mainly aimed at reducing the number of instructions in the IR and spreading knowledge to other instructions. For example, https://en.wikipedia.org/wiki/Common_subexpression_elimination [EarlyCSE] is used to eliminate common subexpressions, and https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions [InstCombine] and https://llvm.org/doxygen/classllvm_1_1InstSimplifyPass.html#details ['InstSimplify`] performs a number of small local optimizations to reduce the cost of operations.
Cycle optimization
1. These passages canonize and simplify the cycles. Loops are often hot-coded, which makes optimizing loops extremely important for performance. The key role here is played by https://llvm.org/docs/Passes.html#loop-rotate-rotate-loops [LoopRotate], https://llvm.org/docs/Passes.html#licm-loop-invariant-code-motion [LICM] and https://llvm.org/docs/Passes.html#loop-unroll-unroll-loops [LoopFullUnroll]. As a result of the passage https://llvm.org/doxygen/InductiveRangeCheckElimination_8cpp_source.html [IRCE] the boundary check part is also being eliminated, which confirms that certain boundaries are never exceeded.
Scalar optimization
1. The scalar optimization pipeline contains a number of more expensive but more powerful passes, such as https://llvm.org/docs/Passes.html#gvn-global-value-numbering [GVN] (numbering of global values), https://llvm.org/docs/Passes.html#sccp-sparse-conditional-constant-propagation ['SCCP`] (propagation of sparse conditional constants) and another loop of boundary check elimination. These passes are expensive, but they often allow you to remove a large amount of code and make vectorization much more successful and efficient. Several other simplification and optimization passes alternate with more expensive ones to reduce the amount of work they have to do.
Vectorization
1. https://en.wikipedia.org/wiki/Automatic_vectorization [Automatic vectorization] is an extremely powerful transformation for CPU-intensive code. In short, vectorization allows you to perform https://en.wikipedia.org/wiki/Single_instruction ._multiple_data[single instruction with multiple data] (SIMD), for example, perform 8 addition operations simultaneously. However, it is quite difficult to prove that the code supports vectorization and is beneficial for vectorization at the same time, and it largely depends on preliminary optimization passes to bring the IR to a state in which vectorization will be justified.
Downgrading internal functions
1. Julia inserts a number of custom native internal functions for purposes such as object allocation, garbage collection, and exception handling. Initially, these internal functions were placed in order to make the optimization possibilities more obvious, but now they have been downgraded to LLVM IR so that IR can be output as machine code.
Clearing.. These passes are last-chance optimizations and perform small optimizations such as spreading the combined multiplication-addition and simplifying division-finding the remainder. Additionally, for targets that do not support half-precision floating-point numbers, half-precision instructions will be downgraded to single-precision instructions, and passes will be added to support sanitizers.

Target-dependent optimization and code generation

LLVM provides target-specific optimization and machine code generation in a single pipeline located in the TargetMachine for this platform. These passes include instruction selection, instruction scheduling, registry allocation, and machine code generation. A good overview of this process is provided in the LLVM documentation, and more detailed information about the pipeline and passes can be found in the LLVM source code.

The layout

Julia is currently in the transition phase from the old RuntimeDyld linker to the newer one. https://llvm.org/docs/JITLink.html [JITLink]. JITLink contains a number of features that RuntimeDyld does not have, such as parallel and reusable linking, but it currently does not have sufficient support for profiling integration and does not yet support all platforms that RuntimeDyld supports. It is expected that over time, JITLink will completely replace RuntimeDyld. For more information about JITLink, see the LLVM documentation.

Accomplishment

The code compiled into the current process becomes available for execution. The generating code instance learns about this as a result of the corresponding update of the fields invoke, specsigflags and specptr'. Code instances support updating the `invoke, specsigflags, and specptr fields, provided that each combination of them can be invoked at any given time. In this case, the JIT can update these fields without invalidating existing code instances, supporting a potentially parallel JIT in the future. In particular, the following states may be valid:

invoke has the NULL value, specsigflags has the 0b00 value, specptr has the NULL value
1. This is the initial state of the code instance, which indicates that the code instance has not yet been compiled.
invoke has a non-zero value, specsigflags has the value 0b00, specptr has the value NULL
1. This indicates that the code instance has not been compiled with any specialization and that it should be called directly. Note that in this case, invoke does not read either the specsigflags field or the specptr field, so they can be changed without invalidating the invoke pointer.
invoke has a non-zero value, specsigflags has a value of 0b10, specptr has a non-zero value
1. This indicates that the code instance has been compiled, but the specialized function signature is considered unnecessary for code generation.
invoke has a non-zero value, specsigflags has the value 0b11, specptr has a non-zero value
1. This indicates that the code instance has been compiled, and the specialized function signature is recognized as necessary for code generation. The 'specptr` field contains a pointer to a specialized function signature. The invoke pointer can read both the 'specsigflags` and specptr fields.

In addition, several different transition states occur during the update process. To account for these possible situations, the following write and read patterns should be used when working with these fields of the code instance.

When writing to the invoke, specsigflags and specptr fields: 1. Perform an atomic compare-swap operation for the specptr field with the assumption that the old value was NULL. This compare-swap operation must have at least a capture-release ordering to ensure that the remaining memory operations are ordered when writing. 2. If the specptr field had a non-zero value, stop the write operation and wait for bit 0b10 to be written in the specsigflags field. 3. Write the new low-order bit for the specsigflags field as its final value. This may be a weakened write operation. 4. Write a new pointer for invoke as its final value. To synchronize with the invoke read operations, it must have at least a memory release ordering. 5. Set the second bit of specsigflags to 1. Synchronization with specsigflags reads requires at least memory release ordering. This step completes the write operation and informs all other threads that all fields have been set.
When reading all fields invoke, specsigflags and `specptr':
1. Consider the invoke field, at least with the memory capture ordering. This operation will be referred to as `initial_invoke'.
2. If initial_invoke' has a value, the code instance is still executable. `invoke has the NULL value, specsigflags can be considered as having the 0b00 value, specptr can be considered as having the NULL value.
3. Consider the specptr field, at least with the memory capture ordering.
4. If specptr is NULL, then the initial_invoke pointer should not rely on specptr for correct execution. Therefore, invoke has a non-zero value, specsigflags can be considered as having the value 0b00, specptr can be considered as having the value NULL.
5. If specptr has a non-zero value, then initial_invoke may not be the final invoke field using specptr'. This can happen if `specptr' has already been recorded, but `invoke has not yet been recorded. Therefore, iterate through the second bit of the specsigflags field until the value 1 is set, at least with the capture memory ordering.
6. Re-read the invoke field, at least with the memory capture ordering. This operation will be designated as `final_invoke'.
7. Read the specsigflags field with any memory ordering.
8. invoke is final_invoke, specsigflags is the value read in step 7, specptr is the value read in step 3.
When updating the specptr to a different but similar function pointer:
1. Release the memory of the new function pointer in `specptr'. Races should be safe here, since the old function pointer should still be valid, just like any new one. The pointer written to the `specptr' must always be available for invocation, regardless of whether it is subsequently overwritten or not.

Despite the complexity, these write, read, and update steps ensure that the JIT can update code instances without invalidating existing instances, and that the JIT can update code instances without invalidating existing 'invoke` pointers. This allows JIT to re-optimize functions at higher optimization levels in the future, as well as support parallel compilation of functions.