Profiling

The 'Profile` module provides tools to help developers improve code performance. During operation, it measures the running code and outputs results that help you understand how much time is spent on a single line (or lines). The most common use case is to identify "bottlenecks" as optimization targets.

The Profile module implements a mechanism known as a "sampling" profiler or https://en.wikipedia.org/wiki/Profiling_ (computer_programming)[statistical profiler]. It works by periodically receiving backtracking during the execution of any task. Each backtrace records the currently running function and line number, as well as the complete chain of function calls that led to that line, and is therefore a "snapshot" of the current execution state.

If it takes most of the time to execute a certain line of code, this line will often appear in the set of all backtraces. In other words, the "cost" of a given string (or, in fact, the cost of a sequence of function calls up to and including this string) is proportional to the frequency of its occurrence in the set of all backtraces.

The sampling profiler does not provide full line coverage, because backtracking occurs at intervals (by default, 1 ms on Unix systems and 10 ms on Windows, although the actual scheduling depends on the operating system load). Moreover, as discussed below, since samples are collected in a sparse subset of all execution points, statistical noise affects the data collected by the sampling profiler.

Despite these limitations, sampling profilers have significant advantages.

You don’t need to make changes to the code to take time measurements.
The profiler can profile the Julia core code and even (optionally) the C and Fortran libraries.
With infrequent execution, performance-related costs are very low. When profiling, the code can run at almost its own speed.

Therefore, it is recommended to try using the built-in sampling profiler before considering any alternatives.

Basic usage

Let’s work with a simple test case.:

julia> function myfunc()
           A = rand(200, 200, 400)
           maximum(A)
       end

First, you should run the code you are going to profile at least once (unless you are going to profile the Julia JIT compiler):

julia> myfunc() # запустить один раз для принудительного выполнения компиляции

Now everything is ready to profile this function.:

julia> using Profile

julia> @profile myfunc()

There are several graphical browsers for viewing the profiling results. One "family" of visualizers is based on https://github.com/timholy/FlameGraphs.jl [FlameGraphs.jl], with each member of the family providing its own user interface:

https://www.julia-vscode.org /[VS Code] is a full-featured IDE with built-in support for profile visualization.
https://github.com/timholy/ProfileView.jl [ProfileView.jl] is a standalone GTK-based visualizer.
https://github.com/davidanthoff/ProfileVega.jl [ProfileVega.jl] uses VegaLight and integrates well with Jupyter notebooks.
StatProfilerHTML.jl creates HTML code and presents some additional summaries, and integrates well with Jupyter notebooks.
ProfileSVG.jl renders SVG.
PProf.jl provides a local website for viewing graphs, flamegraph, etc.
https://github.com/pfitzseb/ProfileCanvas.jl [ProfileCanvas.jl] is the user interface of an HTML canvas-based profile viewer used by https://www.julia-vscode.org /[with the Julia VS Code extension], but it can also generate interactive HTML files.

A completely independent approach to profiling visualization is https://github.com/vchuravy/PProf.jl [PProf.jl], which uses the external tool `pprof'.

However, here we will use text mapping, which is included in the standard library.:

julia> Profile.print()
80 ./event.jl:73; (::Base.REPL.##1#2{Base.REPL.REPLBackend})()
 80 ./REPL.jl:97; macro expansion
  80 ./REPL.jl:66; eval_user_input(::Any, ::Base.REPL.REPLBackend)
   80 ./boot.jl:235; eval(::Module, ::Any)
    80 ./<missing>:?; anonymous
     80 ./profile.jl:23; macro expansion
      52 ./REPL[1]:2; myfunc()
       38 ./random.jl:431; rand!(::MersenneTwister, ::Array{Float64,3}, ::Int64, ::Type{B...
        38 ./dSFMT.jl:84; dsfmt_fill_array_close_open!(::Base.dSFMT.DSFMT_state, ::Ptr{F...
       14 ./random.jl:278; rand
        14 ./random.jl:277; rand
         14 ./random.jl:366; rand
          14 ./random.jl:369; rand
      28 ./REPL[1]:3; myfunc()
       28 ./reduce.jl:270; _mapreduce(::Base.#identity, ::Base.#scalarmax, ::IndexLinear,...
        3  ./reduce.jl:426; mapreduce_impl(::Base.#identity, ::Base.#scalarmax, ::Array{F...
        25 ./reduce.jl:428; mapreduce_impl(::Base.#identity, ::Base.#scalarmax, ::Array{F...

Each line here represents a specific location (line number) in the code. Indentation is used to indicate a nested sequence of function calls, with lines with more indentation located deeper in the sequence of calls. In each row, the first "field" is the number of backtraces (samples) performed in this row or any functions performed by this row_. The second field is the file name and line number, and the third field is the function name. Please note that the specific line numbers may change as the Julia code changes. If you want to track this point, it’s better to run this example yourself.

In this example, we can see that the top-level function being called is located in the 'event.jl` file. This is the function that triggers the REPL when Julia is started. If you look at line 97 of the REPL.jl file, you will see that the eval_user_input() function is called here. This function processes your input in the REPL, and since we are working interactively, these functions were called when we entered @profile myfunc(). The following line reflects the actions performed in the macro @profile.

The first line shows that 80 backtraces were executed in line 73 of the event.jl file, but it’s not that this line was "expensive" in itself: the third line shows that all 80 backtraces were actually launched inside its eval_user_input call, and so on. To find out which operations really take time, you need to look deeper into the call chain.

Here is the first important line in this output:

52 ./REPL[1]:2; myfunc()

The REPL refers to the fact that we defined the function (myfunc') in the REPL rather than putting it in a file. If we were using a file, its name would be displayed here. `[1] indicates that the function (myfunc') was the first expression evaluated in this REPL session. Line 2 of the function (`myfunc()) contains a call to `rand', and 52 (out of 80) backtraces occurred in it. Below you can see the function call `dsfmt_fill_array_close_open!`inside the file `dSFMT.jl'.

A little further you can see:

28 ./REPL[1]:3; myfunc()

Line 3 of the function ('myfunc') contains the maximum call, and 28 (out of 80) backtraces occurred in it. Below you can see the specific locations in the base/reduce' file.jl, in which time-consuming operations are performed in the maximum function for a given type of input data.

In general, we can make a preliminary conclusion that generating random numbers is about twice as expensive as searching for the maximum element. It would be possible to increase confidence in this result by collecting more samples.:

julia> @profile (for i = 1:100; myfunc(); end)

julia> Profile.print()
[....]
 3821 ./REPL[1]:2; myfunc()
  3511 ./random.jl:431; rand!(::MersenneTwister, ::Array{Float64,3}, ::Int64, ::Type...
   3511 ./dSFMT.jl:84; dsfmt_fill_array_close_open!(::Base.dSFMT.DSFMT_state, ::Ptr...
  310  ./random.jl:278; rand
   [....]
 2893 ./REPL[1]:3; myfunc()
  2893 ./reduce.jl:270; _mapreduce(::Base.#identity, ::Base.#scalarmax, ::IndexLinea...
   [....]

In general, if there are N samples collected in one row, we can expect uncertainty about the order of sqrt(N) (if we do not take into account other sources of noise, such as computer workload with other tasks). The main exception to this rule is garbage collection, which is performed infrequently, but is usually quite expensive. (Since the Julia garbage collector is written in C, such events can be detected using the C=true output mode described below, or using https://github.com/timholy/ProfileView.jl [ProfileView.jl].)

Next, a standard tree dump is shown. An alternative is a flat dump, which accumulates calculations regardless of their nesting.:

julia> Profile.print(format=:flat)
 Count File          Line Function
  6714 ./<missing>     -1 anonymous
  6714 ./REPL.jl       66 eval_user_input(::Any, ::Base.REPL.REPLBackend)
  6714 ./REPL.jl       97 macro expansion
  3821 ./REPL[1]        2 myfunc()
  2893 ./REPL[1]        3 myfunc()
  6714 ./REPL[7]        1 macro expansion
  6714 ./boot.jl      235 eval(::Module, ::Any)
  3511 ./dSFMT.jl      84 dsfmt_fill_array_close_open!(::Base.dSFMT.DSFMT_s...
  6714 ./event.jl      73 (::Base.REPL.##1#2{Base.REPL.REPLBackend})()
  6714 ./profile.jl    23 macro expansion
  3511 ./random.jl    431 rand!(::MersenneTwister, ::Array{Float64,3}, ::In...
   310 ./random.jl    277 rand
   310 ./random.jl    278 rand
   310 ./random.jl    366 rand
   310 ./random.jl    369 rand
  2893 ./reduce.jl    270 _mapreduce(::Base.#identity, ::Base.#scalarmax, :...
     5 ./reduce.jl    420 mapreduce_impl(::Base.#identity, ::Base.#scalarma...
   253 ./reduce.jl    426 mapreduce_impl(::Base.#identity, ::Base.#scalarma...
  2592 ./reduce.jl    428 mapreduce_impl(::Base.#identity, ::Base.#scalarma...
    43 ./reduce.jl    429 mapreduce_impl(::Base.#identity, ::Base.#scalarma...

If there is recursion in the code, a potentially difficult point is that a line in a child function can accumulate calculations that exceed the total number of backtraces. Consider the following function definitions.

dumbsum(n::Integer) = n == 1 ? 1 : 1 + dumbsum(n-1)
dumbsum3() = dumbsum(3)

If you profiled dumbsum3 and performed a backtrace during the execution of dumbsum(1), it would look like this:

dumbsum3
    dumbsum(3)
        dumbsum(2)
            dumbsum(1)

Therefore, this child function receives three calculations, although the parent function receives only one. The tree view makes this situation much more understandable, and for this reason (among others) is probably the most preferred way to view the results.

Accumulation and purification

Macro results @profile accumulates in the buffer. If you are executing multiple code snippets using a macro @profile, then Profile.print() outputs the combined results. It can be very useful. But sometimes you want to start with a clean slate, and this can be done using the function Profile.clear().

Options for controlling the display of profiling results

Profile.print has other parameters that we haven’t considered yet. Here is the full announcement:

function print(io::IO = stdout, data = fetch(); kwargs...)

Let’s first discuss two positional arguments, and then the named arguments.

'io` — allows you to save the results to a buffer, for example to a file, but by default the output is done in `stdout' (console).
data — contains the data that you want to analyze. By default, their source is the function Profile.fetch(), which extracts backtraces from a pre-allocated buffer. For example, to profile a profiler, you can do the following.

julia data = copy(Profile.fetch()) Profile.clear() @profile Profile.print(stdout, data) # Prints the previous results Profile.print() # Prints results from Profile.print()

Any combination of the following named arguments is possible:

format — presented above, determines whether backtracks are indented (by default, :tree) or without it (:flat), meaning a tree structure.
C — if the value is true, backtraces from the C and Fortran code will be displayed (they are excluded by default). Try to run the introductory example with Profile.print(C = true)'. This can be extremely useful for determining which code — Julia code or C code — is causing the bottleneck. Setting `C = true also improves the interpretability of nesting due to longer profiling dumps.
combine' — some lines of code contain multiple operations. For example, `s += A[i] contains both an array reference (A[i]) and a summation operation. They correspond to different lines in the generated machine code, and therefore there may be two or more different addresses in this line recorded during backtracking. 'combine = true` combines them, and this is probably what is required. However, it is possible to generate output separately for each unique instruction pointer using `combine = false'.
maxdepth — limits frames with a depth of more than 'maxdepth' in the :tree format.
sortedby' — controls the ordering in the `:flat format. ':filefuncline' (by default) sorts by the source string, whereas :count sorts in the order of the number of samples collected.
'noisefloor' — restricts frames that are below the heuristic noise level of the sample (applies only to the :tree format). The recommended value is 2.0 (default is 0). This parameter hides the samples for which n <= noisefloor * √N, where n is the number of samples in this row, and N is the number of samples for the called object.
`mincount' — limits frames with occurrences less than `mincount'.

The names of files or functions are sometimes truncated (with ...), and the margins are truncated with +n at the beginning, where n is the number of additional spaces that would have been inserted if there had been room for them. If you need a complete profile of deeply embedded code, it is often recommended to save its file using a wide displaysize in IOContext:

open("/tmp/prof.txt", "w") do s
    Profile.print(IOContext(s, :displaysize => (24, 500)))
end

Configuration

@profile simply accumulates backtraces, and the analysis occurs when you call Profile.print(). With a long-term calculation, it is quite possible that the pre-allocated buffer for storing backtraces will be filled. If this happens, the backtracking will stop, but the calculations will continue. As a result, some important profiling data may be missing (in this case, a warning will be displayed).

You can get and configure the relevant parameters as follows:

Profile.init() # возвращает текущие параметры
Profile.init(n = 10^7, delay = 0.01)

'n` is the total number of instruction pointers available for storage. The default value is 10^6. If a standard backtrace consists of 20 instruction pointers, you can collect 50,000 backtraces, which assumes a statistical error of less than 1%. This may be sufficient for most applications.

Therefore, you will most likely need to change the delay, expressed in seconds, which sets the amount of time available in Julia between snapshots to perform the required calculations. For a very long task, frequent backtracking may not be required. The default parameter is `delay = 0.001'. Of course, the delay can be either reduced or increased. However, the cost of profiling increases when the delay becomes similar to the time required to create a backtrace (~30 microseconds on the author’s laptop).

Memory allocation analysis

One of the most common methods of improving performance is to reduce the amount of allocated memory. There are several measurement tools available in Julia:

`@time`

The total amount of allocation can be measured using macros. @time, @allocated and @allocations, and specific allocation-activating rows can often be deduced from profiling based on the garbage collection costs associated with these rows. However, sometimes it is more efficient to directly measure the amount of memory allocated by each line of code.

Logging garbage collection statistics

Although @time logs high-level statistics on memory usage and garbage collection during the expression calculation process. It may be useful to log each garbage collection event to get an intuitive idea of how often the garbage collector is running, how long it has been running each time, and how much garbage it collects. This feature can be enabled using the function GC.enable_logging(true), which instructs Julia to write data to stderr at each garbage collection.

Allocation Profiler

Compatibility: Julia 1.8

This feature requires a version not lower than Julia 1.8.

The allocation profiler records the stack trace, type, and size of each allocation during its operation. You can run it using the command Profile.Allocs.@profile.

This allocation information is returned as an array of Alloc objects enclosed in the 'AllocResults` object. The best way to visualize at the moment are packages https://github.com/JuliaPerf/PProf.jl [PProf.jl] and https://github.com/pfitzseb/ProfileCanvas.jl [ProfileCanvas.jl], which can visualize the call stacks that perform the most allocations.

The allocation profiler is associated with significant costs, so you can pass the sample_rate argument to speed up its work by skipping some allocations. When passing sample_rate=1.0, it will record everything (but slowly), when passing sample_rate=0.1 it will only record 10% of the selections (faster), etc.

Compatibility: Julia 1.11

In older versions of Julia, writing types was not possible in any cases. If you see the Profile' selection in older versions of Julia.Allocs.UnknownType, this means that the profiler does not know which type of object was allocated. This mostly happened when the allocation was carried out from the code generated by the compiler. For more information, see https://github.com/JuliaLang/julia/issues/43688 [issue # 43688].

Starting from Julia 1.11, all selections must have the specified type.

For more information about using this tool, see the discussion at the JuliaCon 2022 event: https://www.youtube.com/watch?v=BFvpwC8hEWQ

An example of a selection profiler

In this simple example, we use PProf to visualize the selection profile. You can use another visualization tool instead. We collect the profile (by setting the sampling rate), and then visualize it.

using Profile, PProf
Profile.Allocs.clear()
Profile.Allocs.@profile sample_rate=0.0001 my_function()
PProf.Allocs.pprof()

Here is a more detailed sample rate setting example. It is recommended to have 1-10 thousand samples. If the number is too large, the profile visualizer may be overloaded and profiling will be slow. However, too small a number will not allow you to get a representative sample.

julia> import Profile

julia> @time my_function()  # Вычислим выделения по результатам (второго запуска) функции.
  0.110018 seconds (1.50 M allocations: 58.725 MiB, 17.17% gc time)
500000

julia> Profile.Allocs.clear()

julia> Profile.Allocs.@profile sample_rate=0.001 begin   # 1,5 млн * 0,001 = ~1,5 тыс выделений
           my_function()
       end
500000

julia> prof = Profile.Allocs.fetch();  # При желании результаты можно проверить вручную.

julia> length(prof.allocs)  # Проверим наличие ожидаемого количества выделений.
1515

julia> using PProf  # Теперь выполним визуализацию с помощью внешнего инструмента, например PProf или ProfileCanvas.

julia> PProf.Allocs.pprof(prof; from_c=false)  # При желании можно передать ранее полученный результат профиля.
Analyzing 1515 allocation samples... 100%|████████████████████████████████| Time: 0:00:00
Main binary filename not available.
Serving web UI on http://localhost:62261
"alloc-profile.pb.gz"

Then you can view the profile by clicking on the link http://localhost:62261 . The profile will be saved to disk. For more features, see the PProf package.

Tips for profiling selections

As mentioned above, try to have about 1 to 10 thousand samples in your profile.

Please note that we make a uniform sample in the area of all selections and do not weight our samples by the size of the selection. Therefore, this allocation profile may not give a representative idea of where the most bytes are allocated in your program, unless you set `sample_rate=1'.

Selections can be made by users who directly create objects, but their source can also be the runtime environment, or they can be inserted into compiled code to handle type instability. Examining the representation of the "source code" can help isolate them, and then you can use other external tools such as https://github.com/JuliaDebug/Cthulhu.jl [`Cthulhu.jl'], to identify the cause of the discharge.

Selection Profile Visualization Tools

Currently, there are several profile visualization tools for displaying selection profiles. Here is a short list of known remedies:

PProf.jl
ProfileCanvas.jl
Built-in VSCode profile visualizer (`@profview_allocs') [documentation required]
View the results directly in the REPL
- You can study the results in the REPL using the function Profile.Allocs.fetch() to see the stack trace and the type of each allocation.

Line-by-line selection tracking

To use an alternative way to measure allocations, run Julia with the command-line parameter --track-allocation=<setting>, which can be set to none (set by default, do not measure allocation), user (measure memory allocation everywhere except Julia kernel code) or all' (measure memory allocation in each line of Julia code). The allocation is measured for each line of compiled code. After exiting Julia, the total results are written to text files with `.mem added after their name and located in the same directory as the source file. Each line indicates the total number of bytes allocated. The `Coverage' package contains some basic analysis tools, for example, to sort strings in ascending order of the number of allocated bytes.

A number of important details should be taken into account when interpreting the results. In the user parameter (user), the first line of any function called directly from the REPL will display highlighting due to events occurring in the REPL code itself. Moreover, JIT compilation also increases the number of allocations, since most of the Julia compiler is written in Julia (and compilation usually requires memory allocation). It is recommended to force the compilation by executing all the commands that you want to analyze, then calling the function Profile.clear_malloc_data() to reset all allocation counters. Finally, you need to run the necessary commands and exit Julia to start generating .mem files.

'--track-allocation` modifies the generation of the allocation registration code, and therefore the allocation may differ from what happens without this parameter. Instead, it is recommended to use selection profiler.

External profiling

Julia currently supports Intel VTune, OProfile, and perf as external profiling tools.

Depending on the selected tool, compile using the variables USE_INTEL_JITEVENTS, USE_OPROFILE_JITEVENTS and USE_PERF_JITEVENTS, which are set to 1 in `Make.user'. Multiple flags are supported.

Before starting Julia, set the environment variable ENABLE_JITPROFILING value is 1.

Now you can use these tools in various ways. For example, using OProfile you can try a simple entry:

>ENABLE_JITPROFILING=1 sudo operf -Vdebug ./julia test/fastmath.jl
>opreport -l `which ./julia`

Or it can be done using perf:

$ ENABLE_JITPROFILING=1 perf record -o /tmp/perf.data --call-graph dwarf -k 1 ./julia /test/fastmath.jl
$ perf inject --jit --input /tmp/perf.data --output /tmp/perf-jit.data
$ perf report --call-graph -G -i /tmp/perf-jit.data

There are many other interesting things that you can measure in your program. The full list can be found at https://www.brendangregg.com/perf.html [the Linux perf examples page].

Remember that perf saves a perf.data file for each execution, which can become quite voluminous even for small programs. Additionally, the perf LLVM module saves temporary debugging objects to the ~/.debug/jit folder. Do not forget to clean it regularly.