Multithreading

A presentation of Julia’s multithreading features is given in this https://julialang.org/blog/2019/07/multithreading /[blog post].

Running Julia with multiple threads

By default, Julia runs with a single thread of execution. You can check this using the command Threads.nthreads().

julia> Threads.nthreads()
1

The number of control threads is controlled either by the command line argument -t/--threads, or by an environment variable JULIA_NUM_THREADS. When both are specified, -t/--threads will have priority.

The number of threads can be set either as an integer (--threads=4), or as auto (--threads=auto), where auto tries to output a useful default number of threads to use (for more information, see the page Command Line options).

Compatibility: Julia 1.5

The command line argument -t/--threads requires a version at least Julia 1.5. In older versions, you must use an environment variable.

Compatibility: Julia 1.7

So that as the value of the environment variable JULIA_NUM_THREADS' use `auto, requires a version not lower than Julia 1.7. In older versions, this value is ignored. Let’s run Julia with four threads.

$ julia --threads 4

Let’s check that four streams are available to us.

julia> Threads.nthreads()
4

However, we are currently in the main stream. To check this, use the function Threads.threadid.

julia> Threads.threadid()
1

If you prefer to use an environment variable, you can set it as follows in Bash (Linux/macOS):

    export JULIA_NUM_THREADS=4

В оболочке C на Linux/macOS или в CMD на Windows:

    set JULIA_NUM_THREADS=4

В Powershell на Windows:

    $env:JULIA_NUM_THREADS=4

Учтите, что это необходимо сделать *перед* запуском Julia.

The number of threads specified by -t/--threads applies to workflows created using the command line options -p/--procs or '--machine-file'. For example, julia -p2 -t2' creates one main process and two workers, and two threads are enabled for all three processes. For more precise control over work flows, use 'addprocs and pass -t/--threads as execution flags (`exeflags').

Multiple garbage collector threads

The garbage collector can use multiple threads. The number used is either equal to half the number of computing worker threads, or is set using the command line argument --gcthreads or an environment variable. JULIA_NUM_GC_THREADS.

Compatibility: Julia 1.10

The command line argument `--gcthreads' requires a version not lower than Julia 1.10.

Pools of streams

When program threads are busy with multiple tasks, tasks can be delayed, which can negatively affect the responsiveness and interactivity of the program. To solve this problem, you can specify that the task is interactive when you plan (Threads.@spawn) execute it:

using Base.Threads
@spawn :interactive f()

In interactive tasks, high-latency operations should be avoided, and if these are long-running tasks, they should exit frequently.

Julia can be run with one or more threads reserved for interactive tasks.:

$ julia --threads 3,1

You can use an environment variable in the same way. JULIA_NUM_THREADS:

export JULIA_NUM_THREADS=3,1

At the same time, Julia starts with 3 threads in the thread pool :default and 1 thread in the thread pool :interactive:

julia> using Base.Threads

julia> nthreadpools()
2

julia> threadpool() # главный поток находится в пуле интерактивных потоков
:interactive

julia> nthreads(:default)
3

julia> nthreads(:interactive)
1

julia> nthreads()
3

The nthreads version without arguments returns the number of threads in the pool by default.

Depending on whether the Julia environment was running with interactive threads, the main thread is either in the default thread pool or in the interactive thread pool.

Any number or both of them can be replaced with the word auto, resulting in Julia choosing a reasonable default value.

The '@threads` macro

Let’s look at a simple example based on our own streams. Create an array of zeros.

julia> a = zeros(10)
10-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

We will work with this array using four threads simultaneously. Each thread will record its ID in each position.

Julia supports parallel loops based on macros Threads.@threads. This macro is placed before the for loop and indicates to Julia that the loop is a multithreaded region.

julia> Threads.@threads for i = 1:10
           a[i] = Threads.threadid()
       end

The iteration area is distributed among the threads, and then each thread writes its ID to the assigned positions.

julia> a
10-element Vector{Float64}:
 1.0
 1.0
 1.0
 2.0
 2.0
 2.0
 3.0
 3.0
 4.0
 4.0

Please note that Threads.@threads an optional reduction parameter is missing, such as @distributed.

Using `@threads` without data race

The concept of data race is discussed in detail in the section Interaction and data races between threads. For now, just know that racing against the data can lead to incorrect results and dangerous mistakes.

Let’s say we want to make the sum_single function below multithreaded.

julia> function sum_single(a)
           s = 0
           for i in a
               s += i
           end
           s
       end
sum_single (generic function with 1 method)

julia> sum_single(1:1_000_000)
500000500000

Simply adding @threads leads to a data race with multiple threads simultaneously reading and writing `s'.

julia> function sum_multi_bad(a)
           s = 0
           Threads.@threads for i in a
               s += i
           end
           s
       end
sum_multi_bad (generic function with 1 method)

julia> sum_multi_bad(1:1_000_000)
70140554652

Note that the result is not equal to 500000500000, as it should be, and is likely to change with each calculation.

To solve this problem, you can use task-specific buffers to split the amount into chunks without racing. Here sum_single is reused with its own internal buffer s. The input vector a is split into fragments nthreads() for parallel operation. Then we use Threads.@spawn to create tasks that summarize each fragment separately. Finally, we summarize the results of each task, again using sum_single:

julia> function sum_multi_good(a)
           chunks = Iterators.partition(a, length(a) ÷ Threads.nthreads())
           tasks = map(chunks) do chunk
               Threads.@spawn sum_single(chunk)
           end
           chunk_sums = fetch.(tasks)
           return sum_single(chunk_sums)
       end
sum_multi_good (generic function with 1 method)

julia> sum_multi_good(1:1_000_000)
500000500000

Buffers should not be managed based on threadid(), that is, buffers = zeros(Threads.nthreads()), since simultaneous tasks can be output, that is, several simultaneous tasks can use the same buffer in the same thread, which creates the risk of data races. In addition, if there are multiple threads, tasks can change the flow at the output points. It is called migration of tasks.

Another option is to use atomic operations with variables shared by tasks or threads, which may be more productive depending on the characteristics of the operations.

Interaction and data race between threads

Although Julia threads can communicate through shared memory, it is known that it is quite difficult to write correct multithreaded code that is free from data races. Channels (Channel) in Julia are thread-safe and can be used for secure communication. The sections below explain how to use locks and atomic operations to avoid data races.

Eliminating data races

You are solely responsible for ensuring that there is no data race in your program. If you do not comply with this requirement, none of the promises here are guaranteed. The observed results can be completely unpredictable.

If there is a race, according to Julia, it does not ensure memory safety. Be very careful when reading any data if it is possible to write it in another stream, as this may lead to segmentation errors or even more serious problems. Below are a couple of unsafe ways to access global variables from different threads.

Thread 1:
global b = false
global a = rand()
global b = true

Thread 2:
while !b; end
bad_read1(a) # Доступ к `a` здесь НЕбезопасен!

Thread 3:
while !@isdefined(a); end
bad_read2(a) # Доступ к `a` здесь НЕбезопасен.

Using locks to exclude data races

Locking is an important tool to avoid data races and write thread-safe code as a result. The lock can be locked and unlocked. If a thread has blocked a lock and has not unblocked it, it is said to be "holding" it. If there is only one lock and we write code that requires holding the lock to access some data, we can ensure that multiple threads will never access the same data at the same time. Please note that the relationship between the lock and the variable is established by the programmer, not the program.

For example, you can create a my_lock lock and lock it while changing the my_variable variable. The easiest way to do this is using the macro @lock:

julia> my_lock = ReentrantLock();

julia> my_variable = [1, 2, 3];

julia> @lock my_lock my_variable[1] = 100
100

When using a similar pattern with the same lock and variable, but in a different thread, operations will be free from data races.

The operation described above with the functional version of lock can be performed in the following two ways:

julia> lock(my_lock) do
           my_variable[1] = 100
       end
100

julia> begin
           lock(my_lock)
           try
               my_variable[1] = 100
           finally
               unlock(my_lock)
           end
       end
100

All three options are equivalent. Note that the latest version requires an explicit try block to ensure that the lock is permanently unlocked, whereas the first two versions do this internally. When changing data (for example, assigning a value to a variable in the global scope or in a closure) that is available to other threads, you should always use the locking pattern shown above. Failure to comply with this requirement can lead to unforeseen and serious consequences.

Atomic operations

Julia supports accessing and changing values in an atomic manner, that is, with thread safety and prevention https://en.wikipedia.org/wiki/Race_condition [race conditions]. To provide such access to the value, you can put it in a shell. Threads.Atomic (and the value must be of a primitive type). See the following example.

julia> i = Threads.Atomic{Int}(0);

julia> ids = zeros(4);

julia> old_is = zeros(4);

julia> Threads.@threads for id in 1:4
           old_is[id] = Threads.atomic_add!(i, id)
           ids[id] = id
       end

julia> old_is
4-element Vector{Float64}:
 0.0
 1.0
 7.0
 3.0

julia> i[]
 10

julia> ids
4-element Vector{Float64}:
 1.0
 2.0
 3.0
 4.0

If we had tried to do the addition without the atomicity label, the answer might have been incorrect due to the race condition. An example of what would happen without a race exception:

julia> using Base.Threads

julia> Threads.nthreads()
4

julia> acc = Ref(0)
Base.RefValue{Int64}(0)

julia> @threads for i in 1:1000
          acc[] += 1
       end

julia> acc[]
926

julia> acc = Atomic{Int64}(0)
Atomic{Int64}(0)

julia> @threads for i in 1:1000
          atomic_add!(acc, 1)
       end

julia> acc[]
1000

Atomic operations in each field

You can also use atomic operations at a more detailed level using macros. @atomic, @atomicswap, @atomicreplace' and @atomiconce`.

Detailed information about the memory model and other aspects of their design is provided in https://gist .github.com/vtjnash/11b0031f2e2a66c9c24d33e810b34ec0 [Julia Atomic Operations Manifesto], which will be officially published later.

The label @atomic can be used for any field in the structure declaration, and then each record must also be marked @atomic and must use one of the specific orders of atomic operations (:monotonic, :acquire, :release, :acquire_release or :sequentially_consistent). For any reading of an atomic field, you can also specify a restriction on the order of atomic operations. If it is not specified, a non-strict monotonic order is used.

Compatibility: Julia 1.7

Atomic operations in each field require a version not lower than Julia 1.7.

Side effects and mutable function arguments

When using multithreading, you should be careful to use functions that are not https://en.wikipedia.org/wiki/Pure_function [clean], as you may get an incorrect answer. For example, functions that have name ending with !, modify their arguments by default and therefore are not pure.

@threadcall

External libraries, for example, called via 'ccall`, create problems for the task-based I/O mechanism used in Julia. If the C library performs a blocking operation, the Julia scheduler cannot perform any other tasks until the call is completed. (The exception is calling custom C code with a Julia callback, which can give up control, or C code with a call to the jl_yield() function, which is the equivalent yield in C.)

The macro @threadcall allows you to avoid suspending execution in such situations. It assigns the execution of the C function in a separate thread. For this purpose, a pool of four threads is used by default. The size of the thread pool is controlled by the environment variable UV_THREADPOOL_SIZE'. When using this macro, when the requesting task (in the main Julia event loop) is waiting for a free thread or performs a function in an available thread, it gives control to other tasks. In this case, the `@threadcall does not terminate until the execution is completed. Therefore, from the user’s point of view, this macro is a blocking call, just like other Julia APIs.

It is extremely important that the called function does not make a Julia callback, as this will cause the program to crash.

It is possible that in future versions of Julia the macro @threadcall will be deleted or changed.

Warnings

Currently, most operations in the Julia runtime and standard libraries provide thread safety, provided that user code does not contain data races. However, thread support is still unstable in some areas. Multithreaded programming inevitably involves many difficulties, so if a program using threads behaves in an unusual or undesirable way (for example, crashes or produces unexpected results), suspicion should primarily fall on interactions between threads.

When using streams in Julia, you need to consider a number of specific limitations and risks.

When using a collection type from Base in multiple threads simultaneously, if at least one thread modifies the collection (common examples: push!`for arrays or inserting elements into a `Dict), manual locking must be used.
Assignment of execution in '@spawn` is non-deterministic and should not be relied upon.
Using tasks that rely on computing resources without allocating memory can prevent garbage collection from running on other threads where memory is allocated. In such cases, it may be necessary to manually insert a call to GC.safepoint() to ensure that it runs. This limitation will be eliminated in the future.
Try not to use top-level operations such as include or eval in parallel for defining types, methods, and modules.
Please note that enabling threads may disrupt the operation of finalizers registered by the library. In order to freely use streams for them, additional ecosystem improvements may be necessary. For more information, see Safe use of finalizers.

Migration of tasks

After a task has started running in a certain thread, it can move to another thread if it outputs a value.

Such tasks can be started using @spawn or @threads, although the schedule parameter :static for @threads freezes the thread ID.

This means that in most cases threadid() should not be considered as a constant within a task, which means it should not be used for indexing buffers or stateful objects into a vector.

Compatibility: Julia 1.7

Migration of tasks appeared in version Julia 1.7. Before that, tasks always remained in the same thread in which they were started.

Safe use of finalizers

Since finalizers can interrupt the execution of any code, special care must be taken when interacting with any global states. Unfortunately, changes to the global state are the main reason for their use (it usually doesn’t make sense to use a pure function as a finalizer). Therefore, a difficult situation arises. There are several approaches to solving this problem.

In single-threaded mode, you can call the internal function jl_gc_enable_finalizers in the code. in the C language, so that the launch of finalizers is not assigned in a mission-critical region. This is used in a number of internal functions (for example, in our C locks) to prevent recursion when performing certain operations (incremental package loading, code generation, etc.). The combination of blocking and this flag allows you to make finalizers secure.

Another strategy implemented in some elements of the Base module is to explicitly delay the launch of the finalizer until it can acquire a lock without recursion. The following example shows the application of this strategy for Distributed.finalize_ref`.

 function finalize_ref(r::AbstractRemoteRef)

 if r.where > 0 # Проверяет, не запущен ли уже финализатор if islocked(client_refs) || !trylock(client_refs) # Задержка запуска финализатора, если невозможно получить блокировку finalizer(finalize_ref, r) return nothing end try # За `lock` всегда должна следовать `try` if r.where > 0 # Здесь нужна повторная проверка # Здесь должна быть фактическая очистка r.where = 0 end finally unlock(client_refs) end end nothing end

3. A related third strategy is to use a yield-free queue. We don't currently
have a lock-free queue implemented in Base, but
`Base.IntrusiveLinkedListSynchronized{T}` is suitable. This can frequently be a
good strategy to use for code with event loops. For example, this strategy is
employed by `Gtk.jl` to manage lifetime ref-counting. In this approach, we
don't do any explicit work inside the `finalizer`, and instead add it to a queue
to run at a safer time. In fact, Julia's task scheduler already uses this, so
defining the finalizer as `x -> @spawn do_cleanup(x)` is one example of this
approach. Note however that this doesn't control which thread `do_cleanup`
runs on, so `do_cleanup` would still need to acquire a lock. That
doesn't need to be true if you implement your own queue, as you can explicitly
only drain that queue from your thread.

Multithreading

Running Julia with multiple threads

Multiple garbage collector threads

Pools of streams

The '@threads` macro

Using @threads without data race

Interaction and data race between threads

Eliminating data races

Using locks to exclude data races

Atomic operations

Atomic operations in each field

Side effects and mutable function arguments

@threadcall

Warnings

Migration of tasks

Safe use of finalizers

Using `@threads` without data race