Engee documentation

Performance Tips

In the following sections, we will briefly look at a number of techniques that can help speed up code execution on Julia as much as possible.

Code for which performance is important should be placed in functions.

Any code for which performance is important should be placed in a function. Code inside functions is usually much faster than top-level code, due to the way the Julia compiler works.

Functions are important not only in terms of performance: they are more reusable and testable, and they make it easier to understand what actions are being performed and what the input and output data are. The Julia style guide also recommends write functions, not just scripts.

Functions should accept arguments, and not work directly with global variables (see the next paragraph).

Avoid untyped global variables

The value of an untyped global variable can change at any time, which can lead to a change in its type. Therefore, it is difficult for the compiler to optimize code that uses global variables. This also applies to variables whose values are types, that is, to type aliases at the global level. Variables should be local or passed to functions as arguments where possible.

Global names are often constants, and if you declare them in this way, you can significantly improve performance.:

const DEFAULT_VAL = 0

If it is known that the type of the global variable will not change., it should be annotated.

The use of untyped global variables can be optimized by annotating their types at the time of use.:

global x = rand(1000)

function loop_over_global()
    s = 0.0
    for i in x::Vector{Float64}
        s += i
    end
    return s
end

From the point of view of style, it is better to pass arguments to functions. This improves the reusability of the code and allows for a better understanding of the input and output data.

All code in the REPL is calculated in the global scope, so a variable defined and set at the top level will be global. Variables defined in the top-level scope within modules are also global.

In the REPL session, this code is:

julia> x = 1.0

equivalent to the following:

julia> global x = 1.0

Therefore, all the performance-related features described above are applicable.

Measure performance using a macro @time and pay attention to memory allocation

A macro is useful for measuring performance. @time. Here we will repeat the above example with a global variable, but this time we will remove the type annotation.:

julia> x = rand(1000);

julia> function sum_global()
           s = 0.0
           for i in x
               s += i
           end
           return s
       end;

julia> @time sum_global()
  0.011539 seconds (9.08 k allocations: 373.386 KiB, 98.69% compilation time)
523.0007221951678

julia> @time sum_global()
  0.000091 seconds (3.49 k allocations: 70.156 KiB)
523.0007221951678

On the first call ('@time sum_global()'), the function is compiled. (If there is a macro during the session @time has not been used yet, and the functions necessary for timing will also be compiled.) You should not take the results of this execution seriously. Note that for the second execution, in addition to the time, it is also reported that a significant amount of memory has been allocated. In this case, the sum of all the elements in a vector of 64-bit floating-point numbers is simply calculated, so there is no need to allocate memory (in the heap).

It should be clarified that the macro @time reports the allocation of memory exactly in the _ period_. Usually, such allocation is required either for mutable objects, or for creating or enlarging variable-size containers (for example, Array, Dict, strings, or type-unstable objects whose type is known only at runtime). Allocating (or deallocating) such memory blocks may require a resource-intensive function call for libc (for example, malloc in C) and tracking by the garbage collector. On the contrary, storing immutable values such as numbers (except bignum), tuples, and immutable structures (struct) requires much less resources (for example, they can be stored on the stack or in CPU register memory), so you don’t have to worry about performance when allocating memory for them.

Unexpected memory allocation is almost always a sign of some kind of code problem, usually related to type stability or the creation of many small temporary arrays. Therefore, even if you do not take into account the memory consumption, there is a high probability that the code generated for the function is far from optimal. Pay attention to these symptoms and follow the advice below.

In this particular case, memory is allocated due to the use of an unstable global variable x'. If you instead pass `x to the function as an argument, no more memory will be allocated (some of the allocated memory is lower due to the fact that the macro @time is executed in the global area). After the first call, this option will be much faster.:

julia> x = rand(1000);

julia> function sum_arg(x)
           s = 0.0
           for i in x
               s += i
           end
           return s
       end;

julia> @time sum_arg(x)
  0.007551 seconds (3.98 k allocations: 200.548 KiB, 99.77% compilation time)
523.0007221951678

julia> @time sum_arg(x)
  0.000006 seconds (1 allocation: 16 bytes)
523.0007221951678

The allocation of 1 in the output is due to the fact that the macro @time itself is executed in the global area. If the time is counted in the function itself, then no memory will be allocated at all.:

julia> time_sum(x) = @time sum_arg(x);

julia> time_sum(x)
  0.000002 seconds
523.0007221951678

In some situations, a function may need to allocate memory during operation, and this complicates the picture. In such cases, you can use one of the following tools to diagnose problems, or write a version of the function in which memory allocation and algorithmic aspects are isolated from each other (see section Pre-allocation of memory for output data).

For more thorough performance testing, you can use the package https://github.com/JuliaCI/BenchmarkTools.jl [Benchmark Tools.jl], which, among other things, evaluates the function several times to reduce the influence of random factors.

Tools

The Julia language and its ecosystem of packages offer tools that can help diagnose problems and improve code performance.:

  • Profiling allows you to measure the performance of the executed code and identify the lines that are the bottleneck. In complex projects, the package https://github.com/timholy/ProfileView.jl [ProfileView] can help visualize the results of profiling.

  • Package https://github.com/aviatesk/JET.jl [Traceur] can help identify common code performance issues.

  • Unexpectedly large amount of allocated memory according to the output of the macro @time, @allocated or profiler (when calling garbage collection routines) hints at possible problems with the code. If there is no other reason for such excessive allocation, first of all, we can assume a problem with types. You can also run Julia with the --track-allocation=user option and examine the resulting *.mem files to find out where the memory allocation is taking place. See the section Memory allocation analysis.

  • The macro @code_warntype creates a representation of the code that can help you find expressions that lead to type ambiguity. See the macro section. @code_warntype.

Avoid containers with parameters of abstract types.

When working with parameterized types, including arrays, it is best to avoid parameterization with abstract types whenever possible.

Consider the following code:

julia> a = Real[]
Real[]

julia> push!(a, 1); push!(a, 2.0); push!(a, π)
3-element Vector{Real}:
 1
 2.0
 π = 3.1415926535897...

Since a is an array of an abstract type Real, it should be possible to store any value of Real' in it. Since `Real objects can be of arbitrary size and structure, the variable a must be represented by an array of pointers to individual Real objects in memory. However, if only one type of numbers could be stored in a, for example Float64, the storage efficiency would increase:

julia> a = Float64[]
Float64[]

julia> push!(a, 1); push!(a, 2.0); push!(a,  π)
3-element Vector{Float64}:
 1.0
 2.0
 3.141592653589793

When assigning numbers to the variable a, they will now be converted to the Float64 type, and a will be stored as a continuous block of 64-bit floating-point values that can be efficiently operated on.

If you can’t avoid using containers with abstract value types, sometimes it’s better to parameterize them with the Any type so that the types aren’t checked at runtime. For example, IdDict{Any, Any} will run faster than IdDict{Type, Vector}.

See also the discussion in the section Parametric types.

Type Declarations

In many languages with optional type declarations, adding declarations is the main way to improve code performance. In Julia, this is not the case. In Julia, the compiler usually knows the types of all arguments to functions, local variables, and expressions. However, there are a number of special situations in which ads can be useful.

Avoid fields with abstract types.

When declaring fields, you don’t need to specify their types.:

julia> struct MyAmbiguousType
           a
       end

As a result, a can be of any type. This is often useful, but there is a downside: for objects like MyAmbiguousType, the compiler will not be able to generate high-performance code. The reason is that when determining how to build code, the compiler is based on the types of objects, not on their values. Unfortunately, there is not much to learn from the MyAmbiguousType type.:

julia> b = MyAmbiguousType("Hello")
MyAmbiguousType("Hello")

julia> c = MyAmbiguousType(17)
MyAmbiguousType(17)

julia> typeof(b)
MyAmbiguousType

julia> typeof(c)
MyAmbiguousType

The values b and c are of the same type, but their basic representation in memory is very different. Even if only numeric values were stored in the a field, from the fact that the representation of types UInt8 and Float64 varies in memory, and it also follows that they must be processed by the processor using different instructions. Since the type does not contain the necessary information, the decision must be made at runtime. This reduces productivity.

To improve the situation, you can declare the type a. Here we are considering a situation where a can have any of several types. In this case, the obvious solution is to use parameters. For example:

julia> mutable struct MyType{T<:AbstractFloat}
           a::T
       end

This option is better than

julia> mutable struct MyStillAmbiguousType
           a::AbstractFloat
       end

because in the first case, the type a is determined based on the type of the wrapper object. For example:

julia> m = MyType(3.2)
MyType{Float64}(3.2)

julia> t = MyStillAmbiguousType(3.2)
MyStillAmbiguousType(3.2)

julia> typeof(m)
MyType{Float64}

julia> typeof(t)
MyStillAmbiguousType

The field type a can be easily determined based on the type m, but not the type t'. Indeed, in `t the type of the a field may change.:

julia> typeof(t.a)
Float64

julia> t.a = 4.5f0
4.5f0

julia> typeof(t.a)
Float32

On the contrary, after the creation of m, the type of m.a cannot be changed.:

julia> m.a = 4.5f0
4.5f0

julia> typeof(m.a)
Float64

The fact that the type m.a is derived from m, combined with the fact that this type cannot change inside a function, allows the compiler to generate optimized code for objects such as m, but not `t'.

Of course, all this is true only if m is created with a specific type. If you create it explicitly with an abstract type, the situation will change.:

julia> m = MyType{AbstractFloat}(3.2)
MyType{AbstractFloat}(3.2)

julia> typeof(m.a)
Float64

julia> m.a = 4.5f0
4.5f0

julia> typeof(m.a)
Float32

In practice, such objects always behave the same way as objects of the type `MyStillAmbiguousType'.

It can be quite revealing to compare the amount of code generated for a simple function.

func(m::MyType) = m.a+1

with the help

code_llvm(func, Tuple{MyType{Float64}})
code_llvm(func, Tuple{MyType{AbstractFloat}})

Due to the long length, the results are not given here, but you can try it yourself. Since the type is fully specified in the first case, the compiler does not need to create code to resolve the type at runtime. This makes the code more concise and faster.

In addition, it should be borne in mind that not fully parameterized types behave like abstract ones. For example, although the fully specified type is Array{T,n}`is specific, and the type `Array itself is not without parameters.:

julia> !isconcretetype(Array), !isabstracttype(Array), isstructtype(Array), !isconcretetype(Array{Int}), isconcretetype(Array{Int,1})
(true, true, true, true, true)

In this case, it would be better not to declare MyType with the field a::Array, but instead declare the field as a::Array'.{T,N} or as a::A, where {T,N} or A is the parameters of `MyType'.

The previous advice is especially useful when the fields of a structure are functions or, more generally, callable objects. It is tempting to define the structure as follows:

struct MyCallableWrapper
    f::Function
end

But since Function is an abstract type, every call is a wrapper.f will require dynamic dispatching due to the instability of the type of access to the field f. Instead, you should write something like the following:

struct MyCallableWrapper{F}
    f::F
end

with almost the same behavior, but much faster performance (because type instability is eliminated). Please note, use F<:Function optional: this means that callable objects without the Function subtype are also allowed for the f field.

Avoid fields with abstract containers

The same recommendations apply to container types.:

julia> struct MySimpleContainer{A<:AbstractVector}
           a::A
       end

julia> struct MyAmbiguousContainer{T}
           a::AbstractVector{T}
       end

julia> struct MyAlsoAmbiguousContainer
           a::Array
       end

For example:

julia> c = MySimpleContainer(1:3);

julia> typeof(c)
MySimpleContainer{UnitRange{Int64}}

julia> c = MySimpleContainer([1:3;]);

julia> typeof(c)
MySimpleContainer{Vector{Int64}}

julia> b = MyAmbiguousContainer(1:3);

julia> typeof(b)
MyAmbiguousContainer{Int64}

julia> b = MyAmbiguousContainer([1:3;]);

julia> typeof(b)
MyAmbiguousContainer{Int64}

julia> d = MyAlsoAmbiguousContainer(1:3);

julia> typeof(d), typeof(d.a)
(MyAlsoAmbiguousContainer, Vector{Int64})

julia> d = MyAlsoAmbiguousContainer(1:1.0:3);

julia> typeof(d), typeof(d.a)
(MyAlsoAmbiguousContainer, Vector{Float64})

For MySimpleContainer, the object is completely determined by its type and parameters, so the compiler can generate optimized functions. In most cases, this is likely to be enough.

Although the compiler is now doing a great job, there are times when you may need to implement different operations in the code, depending on the type of the a elements. Usually, the best way to do this is to enclose a specific operation (in this case, foo) in a separate function.:

julia> function sumfoo(c::MySimpleContainer)
           s = 0
           for x in c.a
               s += foo(x)
           end
           s
       end
sumfoo (generic function with 1 method)

julia> foo(x::Integer) = x
foo (generic function with 1 method)

julia> foo(x::AbstractFloat) = round(x)
foo (generic function with 2 methods)

This simplifies the code and allows the compiler to always generate optimized code.

However, sometimes you may need to declare different versions of the external function for different element types or the AbstractVector types of the a field in the MySimpleContainer. It can be done like this:

julia> function myfunc(c::MySimpleContainer{<:AbstractArray{<:Integer}})
           return c.a[1]+1
       end
myfunc (generic function with 1 method)

julia> function myfunc(c::MySimpleContainer{<:AbstractArray{<:AbstractFloat}})
           return c.a[1]+2
       end
myfunc (generic function with 2 methods)

julia> function myfunc(c::MySimpleContainer{Vector{T}}) where T <: Integer
           return c.a[1]+3
       end
myfunc (generic function with 3 methods)
julia> myfunc(MySimpleContainer(1:3))
2

julia> myfunc(MySimpleContainer(1.0:3))
3.0

julia> myfunc(MySimpleContainer([1:3;]))
4

Annotate the values coming from untyped objects.

It is often convenient to work with data structures that can contain values of any type (arrays of type Array{Any}). However, if you know exactly the type of the element when using such a structure, it will be useful to inform the compiler.:

function foo(a::Array{Any,1})
    x = a[1]::Int32
    b = x+1
    ...
end

Here we know that the first element of the array a will be of type Int32. An additional advantage of such an annotation is that if the value has a type other than the required one, an error will occur during execution. Perhaps this will help to identify other errors earlier in the code.

If the type of the element a[1] is not exactly known, x can be declared like this: x = convert(Int32, a[1])::Int32. By using the function convert the element a[1] can be any object converted to Int32 (for example, UInt8). This increases the versatility of the code due to less stringent type requirements. Note that in this context, the convert function itself requires type annotations to achieve type stability. The reason is that the compiler cannot deduce the type of the function’s return value, even convert, if the types of any of its arguments are unknown.

Type annotation will not improve (and may even decrease) performance if the type is abstract or constructed at runtime. This is because the compiler cannot use the annotation to specialize subsequent code, and type checking itself takes time. For example, in the following code:

function nr(a, prec)
    ctype = prec == 32 ? Float32 : Float64
    b = Complex{ctype}(a)
    c = (b + 1.0f0)::Complex{ctype}
    abs(c)
end

annotation c has a negative effect on performance. To write high-performance code with types constructed at runtime, use the one described below. acceptance of barriers between functions and include the constructed type in the argument types of the kernel function so that the operations performed by the kernel are specialized by the compiler appropriately. For example, in the code snippet below, as soon as the variable b is created, it can be passed to another function k, that is, the kernel. For example, if in the function k the argument b is declared as having type Complex{T}, where T is a type parameter, then the type annotation in the assignment operator inside k looks like this:

c = (b + 1.0f0)::Complex{T}

it does not reduce performance (but it does not increase it either), since the compiler can determine the type of c during the compilation of the function k.

Keep in mind when specialization in Julia is not done automatically.

As a rule, the parameters of the argument types are not specialize automatically in Julia in three specific cases: Type, Function and `Vararg'. Specialization is always performed in Julia when an argument is used inside a method, and not just passed on to another function. This usually has no effect on runtime performance and improves compiler performance. If in your case it turns out that there is an impact on performance, you can initiate specialization by adding a type parameter to the method declaration. Here are some examples:

In this case, specialization is not performed.:

function f_type(t)  # или t::Type
    x = ones(t, 10)
    return sum(map(sin, x))
end

And in this case, it is produced:

function g_type(t::Type{T}) where T
    x = ones(T, 10)
    return sum(map(sin, x))
end

In these cases, specialization is not performed.:

f_func(f, num) = ntuple(f, div(num, 2))
g_func(g::Function, num) = ntuple(g, div(num, 2))

And in this case, it is produced:

h_func(h::H, num) where {H} = ntuple(h, div(num, 2))

In this case, specialization is not performed.:

f_vararg(x::Int...) = tuple(x...)

And in this case, it is produced:

g_vararg(x::Vararg{Int, N}) where {N} = tuple(x...)

To force specialization, it is enough to enter only one type parameter, even if other types are not limited. For example, specialization is also performed in this case (this approach may be useful if not all arguments are of the same type):

h_vararg(x::Vararg{Any, N}) where {N} = tuple(x...)

Please note that the macro @code_typed and other similar tools always show specialized code, even if in Julia this method call is usually not specialized. If you want to find out if specializations are created when changing argument types, that is, whether Base.specializations(@which f(...)) contains specializations for the corresponding argument need to be studied internal structure of the method.

Splitting functions into multiple definitions

If the function is implemented as several small definitions, the compiler can directly call the most appropriate code or even embed it.

Here is an example of a "composite function" that should actually be implemented as multiple definitions.:

using LinearAlgebra

function mynorm(A)
    if isa(A, Vector)
        return sqrt(real(dot(A,A)))
    elseif isa(A, Matrix)
        return maximum(svdvals(A))
    else
        error("mynorm: invalid argument")
    end
end

This option would be more concise and effective.:

mynorm(x::Vector) = sqrt(real(dot(x, x)))
mynorm(A::Matrix) = maximum(svdvals(A))

However, it should be noted that the compiler is very effective at optimizing dead-end branches in the code, as in the example of `mynorm'.

Write type-stable functions.

It is desirable that the function always returns a value of the same type whenever possible. Consider the following definition:

pos(x) = x < 0 ? 0 : x

Although it seems quite harmless, the problem is that 0 is an integer (like Int), and x can be of any type. Thus, depending on the value of x, this function can return a value of one of two types. This behavior is acceptable, and in some cases may be desirable. However, the situation can be easily fixed as follows:

pos(x) = x < 0 ? zero(x) : x

There is also a function oneunit and a more general function 'oftype(x, y), which returns the argument `y, converted to type `x'.

Try not to change the type of the variable.

A similar problem of "type stability" is relevant for variables that are reused inside a function.:

function foo()
    x = 1
    for i = 1:10
        x /= rand()
    end
    return x
end

The local variable x is an integer at first, but after one iteration of the loop it becomes a floating-point number (as a result of applying the operator /). Therefore, it is more difficult for the compiler to optimize the loop body. There are several ways to solve this problem.:

  • initialize x with the value x = 1.0;

  • declare the type of x explicitly as x::Float64 = 1;

  • use an explicit conversion: x = oneunit(Float64);

  • execute the loop for i = 2:10 with initialization x = 1 / rand() in the first iteration.

Separate the core functions (that is, use barriers between functions)

Many functions are built according to a certain pattern: first, some preparatory work is done, and then many iterations are repeated to perform basic calculations. If possible, it is advisable to separate these basic calculations into separate functions. For example, the following fictional function returns an array of random type:

julia> function strange_twos(n)
           a = Vector{rand(Bool) ? Int64 : Float64}(undef, n)
           for i = 1:n
               a[i] = 2
           end
           return a
       end;

julia> strange_twos(3)
3-element Vector{Int64}:
 2
 2
 2

It would be correct to write it like this:

julia> function fill_twos!(a)
           for i = eachindex(a)
               a[i] = 2
           end
       end;

julia> function strange_twos(n)
           a = Vector{rand(Bool) ? Int64 : Float64}(undef, n)
           fill_twos!(a)
           return a
       end;

julia> strange_twos(3)
3-element Vector{Int64}:
 2
 2
 2

The Julia compiler specializes the code taking into account the types of arguments at function boundaries, therefore, in the initial implementation, the type a is unknown during the execution of the loop (since it is randomly selected). The second version is likely to run faster, as the inner loop can be recompiled as part of fill_twos! for different types of a.

The second form is also more correct in terms of style and makes code easier to reuse.

This template is repeatedly used in the Base Julia module. For example, you can familiarize yourself with the functions vcat and hcat in abstractarray.jl , as well as with the function fill!, which could be used instead of writing your own implementation of fill_twos!.

Functions such as strange_twos are used when working with data of an undefined type, for example, loaded from an input file that may contain integers, floating-point numbers, strings, or something else.

Types with values as parameters

Let’s say you want to create an N-dimensional array with a size of 3 on each axis. Such an array can be created as follows:

julia> A = fill(5.0, (3, 3))
3×3 Matrix{Float64}:
 5.0  5.0  5.0
 5.0  5.0  5.0
 5.0  5.0  5.0

This approach is very effective: the compiler can determine that A is an Array'.{Float64,2}, since it knows the type of value for initialization (5.0::Float64) and the dimension ((3, 3)::NTuple{2,Int}It follows that the compiler can generate very efficient code for further use of the array `A in the same function.

But now let’s assume that you need to write a function that creates a 3×3×…​ array of arbitrary dimension. The following implementation suggests itself:

julia> function array3(fillval, N)
           fill(fillval, ntuple(d->3, N))
       end
array3 (generic function with 1 method)

julia> array3(5.0, 2)
3×3 Matrix{Float64}:
 5.0  5.0  5.0
 5.0  5.0  5.0
 5.0  5.0  5.0

It works, but (as you can see for yourself with the help of @code_warntype array3(5.0, 2)the problem is that the output type cannot be deduced: the argument `N is an Int type value, and the type inference engine cannot predict its value. This means that the code that uses the result of this function must be conservative: the type must be checked every time A is accessed. This code will run very slowly.

One of the very effective ways to solve such problems is to use removing barriers between functions. However, in some cases, it may be necessary to completely eliminate type instability. One approach may be to pass the dimension as a parameter, for example by using Val{T}() (see the section Types-values):

julia> function array3(fillval, ::Val{N}) where N
           fill(fillval, ntuple(d->3, Val(N)))
       end
array3 (generic function with 1 method)

julia> array3(5.0, Val(2))
3×3 Matrix{Float64}:
 5.0  5.0  5.0
 5.0  5.0  5.0
 5.0  5.0  5.0

Julia has a special version of ntuple that accepts an instance of Val{::Int} as the second parameter; by passing N as the type parameter, you tell its "value" to the compiler. Therefore, this version of array3 allows the compiler to predict the return type.

However, using this technique may conceal unexpected subtleties. For example, calling array3 from a function like the following would be useless:

function call_array3(fillval, n)
    A = array3(fillval, Val(n))
end

The same problem arises here: the compiler cannot determine n, so it does not know the type of Val(n). Incorrect use of Val can often lead to a decrease in performance. (The use of the above code template is advisable only if the Val is correctly combined with the barrier method between functions, which makes the core function more efficient.)

Here is an example of the correct use of Val.

function filter3(A::AbstractArray{T,N}) where {T,N}
    kernel = array3(1, Val(N))
    filter(A, kernel)
end

In this example, N is passed as a parameter, so its "value" is known to the compiler. In fact, Val(T) only works if T is hard-coded, is a literal (Val(3)), or is already specified in the type scope.

Risks of incorrect use of multiple dispatching (continuation of the topic of types with values as parameters)

Having appreciated the advantages of multiple dispatching, a programmer can get carried away and start applying it in all situations in a row. For example, it can use the following structure to store information:

struct Car{Make, Model}
    year::Int
    ...more fields...
end

and then perform dispatching of objects: Car{:Honda,:Accord}(year, args...).

This may be advisable if one of the following conditions is met.

  • Each Car object requires complex processor processing, which will be much more efficient if the Make and Model are known at compile time, and the total number of different Make or Model is not too large.

  • There are homogeneous lists of Car objects of the same type that can be stored in the 'Array` array.{Car{:Honda,:Accord},N}`.

If the latter is true, a function that processes such a homogeneous array can be effectively specialized: The type of each element is known to Julia in advance (all objects in the container have the same specific type), so when compiling the function, Julia can determine suitable method calls (so this will not need to be done at runtime), thereby generating efficient code to process the entire list.

If these conditions are not met, most likely, there will be no benefit. Moreover, the "combinatorial explosion of types" caused by this approach will have the opposite effect. If the type items[i+1] differs from the type item[i], the Julia environment will have to determine the type at runtime, search for the appropriate method in the method tables, determine the appropriate method (by crossing types), establish whether its JIT compilation has already been performed (and perform it, if this is not the case), and then execute the call. In fact, you are instructing the type system and the JIT compilation mechanism to perform an action in the code similar to the branching operator or dictionary search.

A number of test results comparing type dispatch, dictionary search, and the branch operator can be found in https://groups.google.com/forum /#!msg/julia-users/jUMu9A3QKQQ/qjgVWr7vAwAJ[forum discussion].

Perhaps the compile-time effect is even more serious than the runtime effect: Julia compiles specialized functions for each individual Car' object.{Make, Model}. If there are hundreds or thousands of types, then for each function that accepts such an object as a parameter (its own custom function get_year', the universal function `push! from the Base Julia module or any other), hundreds or thousands of variants will be compiled. Each of them increases the cache size of the compiled code, the length of the internal lists of methods, and so on. Excessive enthusiasm for values as parameters can easily lead to resource exhaustion.

Access arrays in the order they are placed in memory (by columns)

Julia multidimensional arrays are stored in memory in columns, that is, they are placed column by column. This can be checked using the vec function or the syntactic construct [:], as shown below (note that the array elements follow in the order [1 3 2 4], not [1 2 3 4]).

julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> x[:]
4-element Vector{Int64}:
 1
 3
 2
 4

This array ordering convention is accepted in many languages, including Fortran, Matlab, R, and others. An alternative to column-based memory allocation is row-based; this convention is accepted in C and Python (numpy), among others. The order in which arrays are placed in memory can seriously affect performance when iterating through them. Remember a simple rule: if an array is stored in memory in columns, the first index changes the fastest. This means that iteration will be faster if the index used in the deepest nested loop is the first in the slice expression. Keep in mind that when indexing an array using :, an implicit loop is performed with alternate access to each element in a certain dimension. For example, columns are retrieved faster than rows.

Consider the following artificial example. Suppose we need to write a function that accepts an object Vector and returns a square matrix Matrix, the rows or columns of which are filled with copies of the input vector. Let’s say it doesn’t matter to us whether rows or columns will be filled with these copies (perhaps the corresponding behavior can be configured in another part of the code). There are at least four ways to implement this (in addition to the recommended built-in function call repeat):

function copy_cols(x::Vector{T}) where T
    inds = axes(x, 1)
    out = similar(Array{T}, inds, inds)
    for i = inds
        out[:, i] = x
    end
    return out
end

function copy_rows(x::Vector{T}) where T
    inds = axes(x, 1)
    out = similar(Array{T}, inds, inds)
    for i = inds
        out[i, :] = x
    end
    return out
end

function copy_col_row(x::Vector{T}) where T
    inds = axes(x, 1)
    out = similar(Array{T}, inds, inds)
    for col = inds, row = inds
        out[row, col] = x[row]
    end
    return out
end

function copy_row_col(x::Vector{T}) where T
    inds = axes(x, 1)
    out = similar(Array{T}, inds, inds)
    for row = inds, col = inds
        out[row, col] = x[col]
    end
    return out
end

Now we will measure the execution time of each of these functions by entering a vector of size 10000 by 1:

julia> x = randn(10000);

julia> fmt(f) = println(rpad(string(f)*": ", 14, ' '), @elapsed f(x))

julia> map(fmt, [copy_cols, copy_rows, copy_col_row, copy_row_col]);
copy_cols:    0.331706323
copy_rows:    1.799009911
copy_col_row: 0.415630047
copy_row_col: 1.721531501

Note that the 'copy_cols` function is much faster than copy_rows'. This is quite expected, because the `copy_cols function takes into account the order in which the Matrix object is placed in memory by columns and fills it in one column at a time. In addition, the 'copy_col_row` function is much faster than copy_row_col because it follows the rule that the first element in the slice expression should be used in the deepest nested loop.

Pre-allocation of memory for output data

If a function returns an Array or another complex type of object, it may need to allocate memory. Unfortunately, memory allocation and the reverse operation, garbage collection, are often serious bottlenecks.

Sometimes you can avoid having to allocate memory each time a function is called by pre-allocating memory for the output data. Let’s look at the simplest example.

julia> function xinc(x)
           return [x, x+1, x+2]
       end;

julia> function loopinc()
           y = 0
           for i = 1:10^7
               ret = xinc(i)
               y += ret[2]
           end
           return y
       end;

with

julia> function xinc!(ret::AbstractVector{T}, x::T) where T
           ret[1] = x
           ret[2] = x+1
           ret[3] = x+2
           nothing
       end;

julia> function loopinc_prealloc()
           ret = Vector{Int}(undef, 3)
           y = 0
           for i = 1:10^7
               xinc!(ret, i)
               y += ret[2]
           end
           return y
       end;

Time measurement results:

julia> @time loopinc()
  0.529894 seconds (40.00 M allocations: 1.490 GiB, 12.14% gc time)
50000015000000

julia> @time loopinc_prealloc()
  0.030850 seconds (6 allocations: 288 bytes)
50000015000000

There are other advantages to preallocating memory. For example, the caller can control the output type of the algorithm. In the example above, if desired, you could pass SubArray instead Array.

Abusing preallocation of memory can make the code less elegant, so it may be necessary to evaluate the appropriateness of its use and measure performance. However, for "vectorized" functions (executed element-by-element), you can use the convenient syntax x.= f.(y) for in-place operations with combined loops without temporary arrays (see the section on dot syntax for vectorizing functions).

Use MutableArithmetics' for additional control over allocation of mutable arithmetic types

Some subtypes Number, such as BigInt or BigFloat, can be implemented as types mutable struct or have mutable components. The arithmetic interfaces in Julia Base usually choose convenience over efficiency in such cases, so their simplest use can lead to suboptimal performance. Package Abstractions https://juliahub.com/ui/Packages/General/MutableArithmetics ['MutableArithmetics'], on the other hand, allows you to use the mutability of such types to write fast code that allocates the necessary volume. 'MutableArithmetics` can also be used to explicitly copy values of mutable arithmetic types when necessary. 'MutableArithmetics` is a custom package and is not associated with the Julia project.

Even more points: combine vectorized operations

Julia has a special dot syntax, which converts any scalar function into a "vectorized" function call, and any operator into a "vectorized" operator. At the same time, it has a special property: nested "point calls" are combined at the syntactic level into a single loop without allocating memory for temporary arrays. When using .= and similar assignment operators, the result can also be stored in place in a pre-stored array (see above).

In the context of linear algebra, this means that although operations such as vector + vector and vector * scalar' are defined, it may be better to use `vector' instead.+ vector and vector .* scalar, since the resulting loops can be combined with neighboring calculations. For example, consider two functions:

julia> f(x) = 3x.^2 + 4x + 7x.^3;

julia> fdot(x) = @. 3x^2 + 4x + 7x^3; # эквивалентно 3 .* x.^2 .+ 4 .* x .+ 7 .* x.^3

The functions f and fdot' perform the same calculations. However, the function `fdot' (defined using a macro @.`) is performed much faster when applied to an array:

julia> x = rand(10^6);

julia> @time f(x);
  0.019049 seconds (16 allocations: 45.777 MiB, 18.59% gc time)

julia> @time fdot(x);
  0.002790 seconds (6 allocations: 7.630 MiB)

julia> @time f.(x);
  0.002626 seconds (8 allocations: 7.630 MiB)

That is, fdot(x) is ten times faster and allocates six times less memory than f(x), because each operation of * and + in f(x)' allocates memory for a new temporary array and it is executed in a separate loop. In this example, the speed of the function `f.(x) is the same as the function fdot(x), but in many situations it is more convenient to simply add a few dots to expressions than to define a separate function for each vectorized operation.

Fewer dots: separate some intermediate translations

The above-mentioned combination of point loops allows you to express high-performance operations in concise and idiomatic code. However, it is important to remember that the combined operation will be calculated at each iteration of the translation. This means that in some situations, especially in the presence of composite or multidimensional translations, an expression with dot calls may evaluate the function more times than expected. For example, we want to build a random matrix whose rows have a Euclidean norm equal to one. We can write something like the following:

julia> x = rand(1000, 1000);

julia> d = sum(abs2, x; dims=2);

julia> @time x ./= sqrt.(d);
  0.002049 seconds (4 allocations: 96 bytes)

It will work. However, such an expression will actually recalculate sqrt(d[i]) for each element in the string x[i, :], which means that many more square roots are calculated than necessary. To find out exactly which indexes will be iterated on, you can call Broadcast.combine_axes for the arguments of the combined expression. A tuple of ranges will be returned, the entries of which correspond to the axes of the iteration. The product of the lengths of these ranges will be the total number of calls to the combined operation.

It follows that if some components of the translation expression are constant along the axis, for example, sqrt in the second dimension in the previous example, then performance can be improved by forcibly "separating" these components, that is, pre-allocating the result of the translation operation, reusing the cached value along its constant axis. Some of these possible approaches use temporary variables, enclose the components of a dot expression in an identity, or use an equivalent internally vectorized (but not combined) function.

julia> @time let s = sqrt.(d); x ./= s end;
  0.000809 seconds (5 allocations: 8.031 KiB)

julia> @time x ./= identity(sqrt.(d));
  0.000608 seconds (5 allocations: 8.031 KiB)

julia> @time x ./= map(sqrt, d);
  0.000611 seconds (4 allocations: 8.016 KiB)

In each of these options, an approximately three-fold acceleration is achieved by highlighting. For large broadcast objects, such acceleration can be asymptotically very large.

Use views for cross-sections

In Julia, an array slice expression, such as array[1:5, :], creates a copy of the data (but not if it is used on the left side of the assignment: array[1:5,:] =... performs assignment in place of the corresponding part of the array array). If a lot of operations are performed with a slice, this can be advantageous from a performance point of view, since it is more efficient to work with a small continuous copy than to access the source array by indexes. On the other hand, if a small number of simple operations are performed with a slice, the cost of memory allocation and copying can be significant.

An alternative is to create a "representation" of the array, that is, an array object (SubArray) that references the data of the original array in place without creating a copy. (When writing to the view, the data in the original array also changes.) You can do this for individual slices by calling the function view, or, more simply, for the entire expression or block of code, specifying @views before the expression. For example:

julia> fcopy(x) = sum(x[2:end-1]);

julia> @views fview(x) = sum(x[2:end-1]);

julia> x = rand(10^6);

julia> @time fcopy(x);
  0.003051 seconds (3 allocations: 7.629 MB)

julia> @time fview(x);
  0.001020 seconds (1 allocation: 16 bytes)

Note that the fview version of the function is three times faster and allocates less memory.

Copying data is not always a bad thing

Arrays are stored in memory as contiguous blocks, which makes it possible to vectorize the CPU and reduce the number of memory accesses due to caching. For the same reasons, arrays are recommended to be accessed by columns (see above). Random access patterns and disjointed representations can seriously slow down array operations due to inconsistent memory access.

Copying data accessed randomly into a continuous array before re-accessing it can lead to significant speed gains, as in the example below. In this case, the matrix elements are accessed at random indexes, and then multiplied. Copying to regular arrays speeds up multiplication, despite the cost of copying and allocating memory.

julia> using Random

julia> A = randn(3000, 3000);

julia> x = randn(2000);

julia> inds = shuffle(1:3000)[1:2000];

julia> function iterated_neural_network(A, x, depth)
           for _ in 1:depth
               x .= max.(0, A * x)
           end
           argmax(x)
       end

julia> @time iterated_neural_network(view(A, inds, inds), x, 10)
  0.324903 seconds (12 allocations: 157.562 KiB)
1569

julia> @time iterated_neural_network(A[inds, inds], x, 10)
  0.054576 seconds (13 allocations: 30.671 MiB, 13.33% gc time)
1569

Provided that there is enough memory, the cost of copying a representation to an array is more than offset by the speed gain due to repeated multiplication in a continuous array.

Use StaticArrays.jl for small operations with fixed-size vectors and matrices

If the calculations involve many small (< 100 elements) arrays of a fixed size (which is known before execution), then you can try using https://github.com/JuliaArrays/StaticArrays.jl [the StaticArrays.jl package]. It allows you to represent such arrays in such a way that unnecessary memory allocations in the heap are not required, and the compiler will be able to specialize the code taking into account the size of the array, for example, by fully deploying vector operations (eliminating loops) and storing elements in CPU registers.

For example, in two-dimensional geometric calculations, there may be many operations with two-component vectors. By using the SVector' type from the StaticArrays package.convenient vector notation and operations such as `norm(3v - w) become available to jl in relation to the vectors v and w, and the compiler can expand the code into a minimal calculation equivalent to @inbounds hypot(3v[1]-w[1], 3v[2]-w[2]).

Avoid string interpolation for I/O

When writing data to a file (or to another I/O device), building intermediate lines entails additional costs. Instead of:

println(file, "$a $b")

use this one:

println(file, a, " ", b)

In the first version of the code, the string is first constructed and then written to a file, while in the second version the values are written to the file directly. Also, note that in some cases, string interpolation code may be less readable. Compare:

println(file, "$(f(a))$(f(b))")

with this option:

println(file, f(a), f(b))

Optimize network I/O during parallel execution

When executing a remote function in parallel mode, the following code:

using Distributed

responses = Vector{Any}(undef, nworkers())
@sync begin
    for (idx, pid) in enumerate(workers())
        @async responses[idx] = remotecall_fetch(foo, pid, args...)
    end
end

it will be faster than this:

using Distributed

refs = Vector{Any}(undef, nworkers())
for (idx, pid) in enumerate(workers())
    refs[idx] = @spawnat pid foo(args...)
end
responses = [fetch(r) for r in refs]

The first option involves a network access cycle for each workflow, and in the second, two network calls occur: the first is performed by a macro @spawnat, and the second one is a method fetch (or even wait). In addition, the methods fetch and wait are executed sequentially, which generally reduces performance.

Eliminating warnings about decommissioning

The outdated function performs an internal search for a single output of the corresponding warning. This additional search operation can significantly slow down the work, so all cases of using outdated functions should be adjusted according to the instructions in the warnings.

The subtleties of optimization

There are a number of less important points that should be considered when working with continuous internal loops.

Performance Annotations

Sometimes you can improve optimization by declaring certain program properties.

  • Use a macro @inbounds to disable array bounds checking in expressions. But this should be done only if you are sure of the result. If you exceed the acceptable index limits, data failure or corruption may occur.

  • Use a macro @fastmath to allow optimization of floating-point numbers, which is performed correctly for real numbers, but leads to deviations for IEEE numbers. Use this feature with caution, as it is possible to change the numerical results. This macro corresponds to the -ffast-math parameter in clang.

  • Please specify @simd before the for loops to show that the iterations are independent of each other and their order can be changed. Note that Julia can often automatically vectorize code without the '@simd` macro; it is useful only in cases where such a conversion would otherwise be unacceptable, including to ensure the associativity of floating-point numbers and ignoring dependent memory accesses (@simd ivdep'). Again, be very careful when using the `@simd statement, as incorrectly annotating a loop with dependent operations can lead to unexpected results. In particular, keep in mind that the setindex! operation for some subtypes of AbstractArray inherently depends on the order of iterations. This feature is experimental and may change or disappear in future versions of Julia.

The standard 1:n idiom for indexing an AbstractArray array is unsafe if non-standard indexing is applied in the array, and may lead to a segmentation fault if boundary checking is disabled. Use LinearIndices(x) or eachindex(x) instead (see also the chapter Arrays with custom indexes).

Although @simd must be specified immediately before the most deeply nested for loop, the macros @inbounds and @fastmath can be applied either to individual expressions or to all expressions in nested code blocks, for example using @inbounds begin or @inbounds for ....

Here is an example of using the @inbounds and @simd annotations at the same time (in this case, @noinline is used so that the optimizer does not disrupt the testing process):

@noinline function inner(x, y)
    s = zero(eltype(x))
    for i=eachindex(x)
        @inbounds s += x[i]*y[i]
    end
    return s
end

@noinline function innersimd(x, y)
    s = zero(eltype(x))
    @simd for i = eachindex(x)
        @inbounds s += x[i] * y[i]
    end
    return s
end

function timeit(n, reps)
    x = rand(Float32, n)
    y = rand(Float32, n)
    s = zero(Float64)
    time = @elapsed for j in 1:reps
        s += inner(x, y)
    end
    println("GFlop/sec        = ", 2n*reps / time*1E-9)
    time = @elapsed for j in 1:reps
        s += innersimd(x, y)
    end
    println("GFlop/sec (SIMD) = ", 2n*reps / time*1E-9)
end

timeit(1000, 1000)

On a computer with a 2.4GHz Intel Core i5 processor, the result will be as follows:

GFlop/sec        = 1.9467069505224963
GFlop/sec (SIMD) = 17.578554163920018

(Performance is measured in GFlop/sec, and the higher the score, the better.)

Here is an example of using all three annotations. This program first calculates the final difference of a one-dimensional array, and then the L2 norm for the result.:

function init!(u::Vector)
    n = length(u)
    dx = 1.0 / (n-1)
    @fastmath @inbounds @simd for i in 1:n #утверждая, что `u` — это `Vector`, можно предположить, что индексирование начинается с 1
        u[i] = sin(2pi*dx*i)
    end
end

function deriv!(u::Vector, du)
    n = length(u)
    dx = 1.0 / (n-1)
    @fastmath @inbounds du[1] = (u[2] - u[1]) / dx
    @fastmath @inbounds @simd for i in 2:n-1
        du[i] = (u[i+1] - u[i-1]) / (2*dx)
    end
    @fastmath @inbounds du[n] = (u[n] - u[n-1]) / dx
end

function mynorm(u::Vector)
    n = length(u)
    T = eltype(u)
    s = zero(T)
    @fastmath @inbounds @simd for i in 1:n
        s += u[i]^2
    end
    @fastmath @inbounds return sqrt(s)
end

function main()
    n = 2000
    u = Vector{Float64}(undef, n)
    init!(u)
    du = similar(u)

    deriv!(u, du)
    nu = mynorm(du)

    @time for i in 1:10^6
        deriv!(u, du)
        nu = mynorm(du)
    end

    println(nu)
end

main()

On a computer with an Intel Core i7 processor running at 2.7 GHz, the result will be as follows:

$ julia wave.jl;
  1.207814709 seconds
4.443986180758249

$ julia --math-mode=ieee wave.jl;
  4.487083643 seconds
4.443986180758249

Here, the --math-mode=ieee parameter disables the @fastmath macro so that the results can be compared.

In this case, using @fastmath increases the speed by about 3.7 times. Such an increase is unusual - as a rule, it happens less. (In this particular example, the working set of test data is quite small and fits completely in the processor’s L1 cache, so delays in accessing memory do not matter and the time spent on the calculation is primarily determined by the CPU load. In most real-world programs, the situation is different.) In addition, in this case, this optimization does not affect the result, but usually the result is slightly different. In some cases, especially when using numerically unstable algorithms, the result may vary greatly.

The @fastmath annotation rearranges floating-point expressions, for example, by changing the order of calculations or assuming that some special cases (inf, nan) are impossible. In this case (on this particular computer), the main difference is that the expression 1 /(2*dx) in the 'derive` function is output from the loop (calculated outside it), as if we had written idx = 1 /(2*dx)'. Inside the loop, the expression `+…​ / (2*dx)+ is converted to ... * idx, which is calculated much faster. Of course, both the actual optimization applied by the compiler and the resulting speed gain largely depend on the hardware. You can use the Julia function to find out how the generated code has changed. code_native.

Note that the macro @fastmath also assumes that there are no NaN values during the calculation, which can lead to unexpected behavior.:

julia> f(x) = isnan(x);

julia> f(NaN)
true

julia> f_fast(x) = @fastmath isnan(x);

julia> f_fast(NaN)
false

Treat subnormal numbers as zeros

Subnormal numbers, previously called https://en.wikipedia.org/wiki/Denormal_number [denormalized], are useful in many situations, but on some hardware they lead to lower performance. Challenge 'set_zero_subnormals(true) allows floating-point operations to interpret subnormal input or output values as zero, which can improve performance on some hardware. Challenge 'set_zero_subnormals(false) strictly prescribes the processing of subnormal values according to the IEEE standard.

The following is an example of the significant impact of subnormal numbers on performance on some hardware.

function timestep(b::Vector{T}, a::Vector{T}, Δt::T) where T
    @assert length(a)==length(b)
    n = length(b)
    b[1] = 1                            # Предельное условие
    for i=2:n-1
        b[i] = a[i] + (a[i-1] - T(2)*a[i] + a[i+1]) * Δt
    end
    b[n] = 0                            # Предельное условие
end

function heatflow(a::Vector{T}, nstep::Integer) where T
    b = similar(a)
    for t=1:div(nstep,2)                # Предполагаем, что nstep — четное число
        timestep(b,a,T(0.1))
        timestep(a,b,T(0.1))
    end
end

heatflow(zeros(Float32,10),2) # Forced compilation
for trial=1:6
    a = zeros(Float32,1000)
    set_zero_subnormals(iseven(trial))  # Odd attempts are calculated in strict accordance with IEEE
    @time heatflow(a,1000)
end

The result will be something like this:

  0.002202 seconds (1 allocation: 4.063 KiB)
  0.001502 seconds (1 allocation: 4.063 KiB)
  0.002139 seconds (1 allocation: 4.063 KiB)
  0.001454 seconds (1 allocation: 4.063 KiB)
  0.002115 seconds (1 allocation: 4.063 KiB)
  0.001455 seconds (1 allocation: 4.063 KiB)

Note that each even iteration runs significantly faster.

In this example, many subnormal numbers are generated because the values in a form an exponentially decreasing curve that gradually flattens out.

Treating subnormal numbers as zeros should be used with caution, because it can disrupt the operation of some expressions, for example, x-y == 0 implies x == y:

julia> x = 3f-38; y = 2f-38;

julia> set_zero_subnormals(true); (x - y, x == y)
(0.0f0, false)

julia> set_zero_subnormals(false); (x - y, x == y)
(1.0000001f-38, false)

In some applications, an alternative to zeroing out subnormal numbers is to introduce a little noise. For example, instead of initializing object a with zeros, initialize it as follows:

a = rand(Float32,1000) * 1.f-9

@code_warntype

The macro @code_warntype' (or its variant as a function `code_warntype) can sometimes be useful for diagnosing type problems. Here is an example:

julia> @noinline pos(x) = x < 0 ? 0 : x;

julia> function f(x)
           y = pos(x)
           return sin(y*x + 1)
       end;

julia> @code_warntype f(3.2)
MethodInstance for f(::Float64)
  from f(x) @ Main REPL[9]:1
Arguments
  #self#::Core.Const(f)
  x::Float64
Locals
  y::Union{Float64, Int64}
Body::Float64
1 ─      (y = Main.pos(x))
│   %2 = (y * x)::Float64
│   %3 = (%2 + 1)::Float64
│   %4 = Main.sin(%3)::Float64
└──      return %4

Interpreting the macro output @code_warntype', as well as related macros @code_lowered`, @code_typed, @code_llvm and @code_native requires some experience. Your code is presented in a form that has been significantly redesigned during the generation of compiled machine code. Most expressions have type annotations in the form of ::T (where T can be, for example, Float64). The most important feature of the macro '@code_warntype' is that non-specific types are displayed in red. Since this document is written in Markdown format, which does not imply color highlighting, the red text in it is indicated in capital letters.

The first line shows the output return type of the function — Body::Float64. The following lines represent the body of the function f in the SSA Julia intermediate representation format. Numbered fields are labels representing the goals of transitions (via goto) in the code. If you look at the function body, you can see that pos is called first and the return type Union is output.{Float64, Int64}`. It is displayed in capital letters, as it is not specific. This means that it is impossible to accurately determine the returned type pos based on the input types. However, the result of the operation y*x is of type Float64 regardless of whether y is of type Float64 or Int64'. As a result, the result is `f(x::Float64) will not be unstable in type, even if some intermediate calculations are such.

How you use this information is up to you. Obviously, it would be best to fix the pos so as to ensure type stability. In this case, all the variables in f will be specific and performance will be optimal. However, in certain situations, such temporal instability of types does not really matter. For example, if pos is never used in isolation, the fact that the output type is f is stable (for input values of type Float64), will protect further code from the effects of type instability. This is especially important in cases where it is difficult or impossible to eliminate type instability. In such situations, the above tips (for example, about adding type annotations or separating functions) are the best way to contain the effects of type instability. Also keep in mind that even the Base Julia module has functions with type instability. For example, the function findfirst returns the index of the position in the array where the key is found, or nothing if the key is not found — a clear case of type instability. To make it easier to identify cases of type instability that may be relevant, Union' unions containing `missing or `nothing' are highlighted in yellow rather than red.

Below are examples that can help you figure out how to interpret expressions marked as containing non-finite types.:

  • The body of the function, starting with Body::Union{T1,T2})

    • Interpretation: a function with an unstable return type.

    • Recommendation: Make the return type stable, even if it needs to be annotated for this.

  • invoke Main.g(%%x::Int64)::Union{Float64, Int64}

    • Interpretation: a call to the type-unstable function `g'.

    • Recommendation: fix the function or, if necessary, annotate the return value.

  • invoke Base.getindex(%%x::Array{Any,1}, 1::Int64)::Any

    • Interpretation: accessing elements of poorly typed arrays.

    • Recommendation: use arrays with more clearly defined types or, if necessary, specify the types of individual element calls.

  • Base.getfield(%%x, :(:data))::Array{Float64,N} where N

    • Interpretation: getting a field of a non-finite type. In this case, the type x, let’s say ArrayContainer, has the field data::Array{T}. However, in order for the type Array to be specific, the dimension N is also required.

    • Recommendation: use specific types, for example Array{T,3} or Array{T,N}, where N is a parameter for `ArrayContainer'.

Performance of captured variables

Consider the following example, which defines an internal function.

function abmult(r::Int)
    if r < 0
        r = -r
    end
    f = x -> x * r
    return f
end

The 'abmult` function returns the f function, which multiplies its argument by the absolute value of r'. The internal function assigned to `f' is called a closure. Internal functions are also used in the language for `do blocks and generator expressions.

This style of writing code presents performance challenges. When converting the above code into lower-level instructions, the analyzer rearranges it by extracting an internal function and placing it in a separate code block. "Captured" variables such as r, which are shared between internal functions and the external scope, are also extracted to a dedicated "container" on the heap, accessible by both internal and external functions. The reason is that according to the language specification, the variable r in the inner scope must be identical to r in the outer scope even after changing the r in the outer scope (or in another inner function).

The analyzer was mentioned in the previous paragraph. This refers to the compilation stage where the module containing abmult is loaded for the first time, as opposed to the later stage when it is first called. The analyzer does not know that Int is a fixed type or that the operator r = -r converts the value of Int to another value `Int'. The magic of type inference happens at a later stage of compilation.

Thus, the analyzer does not know that r has a fixed type (Int) or that the value of r does not change after creating the internal function (and that, therefore, the container is not needed). Therefore, the analyzer creates code for a container containing an object of an abstract type, such as Any, which requires type dispatching for each instance of r at runtime. You can check this by applying the macro @code_warntype to the above function. Both packaging and dispatching types at runtime can reduce performance.

If the captured variables are used in a part of the code for which performance is extremely important, then you can use the tips below. First, if it is known that the type of the captured variable does not change, it can be explicitly declared using the type annotation (applied to the variable itself, and not in the right part):

function abmult2(r0::Int)
    r::Int = r0
    if r < 0
        r = -r
    end
    f = x -> x * r
    return f
end

Type annotation partially compensates for the performance loss due to capture, since the analyzer can associate a specific type with an object in the container. Further, if the captured variable does not need to be packaged at all (since it will not be repackaged after the closure is created), this can be indicated using the let block as follows.

function abmult3(r::Int)
    if r < 0
        r = -r
    end
    f = let r = r
            x -> x * r
    end
    return f
end

The 'let` block creates a new variable r, whose scope is limited by an internal function. The second technique allows you to fully restore performance in the presence of captured variables. Keep in mind: this aspect of the compiler is evolving rapidly, and it is possible that in future versions it will not be necessary to annotate types in such detail to ensure performance. In the meantime, it is possible to automate the addition of let statements, as in abmult3, using packages from community members such as https://github.com/c42f/FastClosures.jl [FastClosures].

Multithreading and linear algebra

The information in this section relates to Julia’s multithreaded code, which performs linear algebra operations in each thread. Indeed, these linear algebra operations involve the use of BLAS/LAPACK calls, which themselves are multithreaded. In this case, it is necessary to ensure that the cores are not overloaded due to two different types of multithreading.

Julia compiles and uses its own copy of OpenBLAS for linear algebra with the number of threads controlled by the environment variable OPENBLAS_NUM_THREADS'. It can be set as a command line parameter when starting Julia, or changed during a Julia session using `BLAS.set_num_threads(N) (the BLAS submodule is exported using `using LinearAlgebra'). Its current value can be accessed via `BLAS.get_num_threads()'.

If the user does not specify anything, Julia tries to choose a reasonable value for the number of OpenBLAS streams (for example, based on the platform, Julia version, etc.). However, it is usually recommended to check and set the value manually. OpenBLAS operates as follows.

  • If OPENBLAS_NUM_THREADS=1, OpenBLAS uses the calling Julia thread (or threads), i.e. it is in the Julia thread that performs calculations.

  • If OPENBLAS_NUM_THREADS=N>1, OpenBLAS creates and manages its own thread pool (the total number is N). There is only one OpenBLAS thread pool used by all Julia threads.

When running Julia in multithreaded mode with JULIA_NUM_THREADS=X, it is usually recommended to set OPENBLAS_NUM_THREADS=1. Given the behavior described above, increasing the number of BLAS threads to N>1 can easily lead to performance degradation, especially when N<<X. However, this is just a rule of thumb, and the best way to set each number of threads is to experiment in a specific existing application.

Alternative Linear Algebra backends

As an alternative to OpenBLAS, there are several other backends that can help with linear algebra performance. Notable examples are https://github.com/JuliaLinearAlgebra/MKL.jl [MKL.jl] and https://github.com/JuliaMath/AppleAccelerate.jl [AppleAccelerate.jl].

These are external packages, so we won’t discuss them in detail here. See the relevant documentation (especially because their behavior regarding multithreading differs from that of OpenBLAS).

Execution delay, download time, and pre-compilation of packages

Reducing the time until the first graph is displayed, etc.

The first time the Julia method is called, it (and all the methods it calls, or those that can be statically defined) will be compiled. This is demonstrated by a family of macros @time.

julia> foo() = rand(2,2) * rand(2,2)
foo (generic function with 1 method)

julia> @time @eval foo();
  0.252395 seconds (1.12 M allocations: 56.178 MiB, 2.93% gc time, 98.12% compilation time)

julia> @time @eval foo();
  0.000156 seconds (63 allocations: 2.453 KiB)

Note that @time @eval is better suited for measuring compile time, since without @eval part of the compilation may have already been completed before the countdown.

When developing a package, you can improve the future user experience by using forward compilation so that the code they use is already compiled. For effective pre-compilation of the package code, it is recommended to use https://julialang .github.io/PrecompileTools.jl/stable /[PrecompileTools.jl] to run the "precompilation workload" during the process. This workload represents the standard usage of the package and will cache its own compiled code into the pkgimage package cache, significantly reducing the "time to first execution" (often referred to as TTFX).

If you do not want to spend additional time on pre-compilation, which may be required during package development, workloads https://julialang .github.io/PrecompileTools.jl/stable /[PrecompileTools.jl] can be disabled and sometimes adjusted by setting preferred values.

Reducing package loading time

It is usually recommended to reduce the package download time. Developers can use the following tips.

  1. Reduce the number of dependencies, leaving only those that you really need. Consider using package extensions to support interaction with other packages without excessively increasing the number of core dependencies.

  2. Try not to use functions __init__(), if there is no alternative. This is especially true for those that can cause multiple compilations to run or just take a long time to complete.

  3. If possible, correct https://julialang.org/blog/2020/08/invalidations /[invalidations] in dependencies and in the package code.

To check the above conditions, you can use the tool @time_imports in the REPL.

julia> @time @time_imports using Plots
      0.5 ms  Printf
     16.4 ms  Dates
      0.7 ms  Statistics
               ┌ 23.8 ms SuiteSparse_jll.__init__() 86.11% compilation time (100% recompilation)
     90.1 ms  SuiteSparse_jll 91.57% compilation time (82% recompilation)
      0.9 ms  Serialization
               ┌ 39.8 ms SparseArrays.CHOLMOD.__init__() 99.47% compilation time (100% recompilation)
    166.9 ms  SparseArrays 23.74% compilation time (100% recompilation)
      0.4 ms  Statistics → SparseArraysExt
      0.5 ms  TOML
      8.0 ms  Preferences
      0.3 ms  PrecompileTools
      0.2 ms  Reexport
... many deps omitted for example ...
      1.4 ms  Tar
               ┌ 73.8 ms p7zip_jll.__init__() 99.93% compilation time (100% recompilation)
     79.4 ms  p7zip_jll 92.91% compilation time (100% recompilation)
               ┌ 27.7 ms GR.GRPreferences.__init__() 99.77% compilation time (100% recompilation)
     43.0 ms  GR 64.26% compilation time (100% recompilation)
               ┌ 2.1 ms Plots.__init__() 91.80% compilation time (100% recompilation)
    300.9 ms  Plots 0.65% compilation time (100% recompilation)
  1.795602 seconds (3.33 M allocations: 190.153 MiB, 7.91% gc time, 39.45% compilation time: 97% of which was recompilation)

Note that there are several packages loaded in this example. Some of them contain the functions __init__(), some of which lead to compilation, and others to recompilation. Recompilation is caused by the fact that previous packages invalidate methods. Therefore, in cases where the following packages run their function __init__(), recompilation occurs before code execution.

Also, please note that the Statistics extension SparseArraysExt has been activated, since SparseArrays is located in the dependency tree, i.e. see `0.4 ms Statistics → SparseArraysExt'.

This report allows you to analyze whether the cost of loading dependencies is worth the functionality they bring. Using the utility program 'Pkg` why, you can find out the reasons for the existence of indirect dependence.

(CustomPackage) pkg> why FFMPEG_jll
  Plots → FFMPEG → FFMPEG_jll
  Plots → GR → GR_jll → FFMPEG_jll

Or to see the indirect dependencies that the package introduces, you can run pkg> rm for the package, look at the dependencies that are removed from the manifest, and then revert the change using pkg> undo.

If the loading time is affected by slow methods __init__() with compilation, one of the advanced ways to define compiled components is to use the arguments --trace-compile=stderr Julia, which will report the expression precompile each time the method is compiled. For example, the full setup will look like this.

$ julia --startup-file=no --trace-compile=stderr
julia> @time @time_imports using CustomPackage
...

Note the expression --startup-file=no, which helps isolate the test from packages that may be present in startup.jl.

A more detailed analysis of the causes of recompilation can be performed using the package https://github.com/timholy/SnoopCompile.jl [SnoopCompile].