Engee documentation
Notebook

Matrix multiplication on different calculators

In this example, we'll show you how to study code performance using the toolkit. BenchmarkTools and let's compare its execution on the CPU and on the GPU.

Connecting to the GPU

The availability of GPU resources for users remains a premium feature of the Engee platform for now. GPU is a graphics graphics card that allows you to significantly parallelize the execution of code by executing it on tens of thousands of computing cores located inside a graphics coprocessor.

The main library for working with GPU is CUDA.jl. Let's install this library, and with it, a toolkit for evaluating code performance (package BenchmarkTools).

In [ ]:
# Закомментируйте эту строку, если будете устанавливать библиотеки другим образом
Pkg.add( url="https://github.com/JuliaBinaryWrappers/CUDA_Runtime_jll.jl.git" )
Pkg.add( ["CUDA", "cuDNN", "Flux", "BenchmarkTools"] );
Pkg.instantiate()

Matrix multiplication on the CPU

Let's see how long matrix multiplication takes on an average on a conventional processor.

In [ ]:
N = 1_000

A = rand(ComplexF64, (N,N))
B = rand(ComplexF64, (N,N))

using BenchmarkTools
@benchmark A*B
Out[0]:
BenchmarkTools.Trial: 11 samples with 1 evaluation per sample.
 Range (minmax):  384.014 ms500.289 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     496.367 ms                GC (median):    0.00%
 Time  (mean ± σ):   462.194 ms ±  50.123 ms   GC (mean ± σ):  0.00% ± 0.00%

            ▁                                                █  
  ▆▁▁▁▁▁▁▁▁▁█▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁█ ▁
  384 ms           Histogram: frequency by time          500 ms <

 Memory estimate: 15.26 MiB, allocs estimate: 2.

It can take quite a long time to execute this cell, since the command @benchmark runs the operation assigned to her many times to smooth out the "warming up" effect inherent in Julia. It also reduces the impact of rare conditions when the code accidentally shows the lowest possible performance.

In this particular case, the experiment showed that multiplying 1000-by-1000 complex number matrices takes an average of 300 milliseconds.

Matrix multiplication on the GPU

To multiply the matrices on the graphics card, they need to be transferred to it, which can be done in many ways. For example, with the command A |> gpu However, since there may not be a GPU in the system, we will check the configuration of the computing space and select an available computer.

After transferring the matrix Matrix now they are objects CuArray. Their multiplication is performed without additional code (due to overloading of the multiplication operator). But multiply the matrix A_gpu on the matrix B we cannot do this without transferring both matrices to the same calculator (otherwise you will get an error KernelError: kernel returns a value of type Union{}).

In [ ]:
using CUDA, Flux

if CUDA.functional()
    A_gpu = A |> gpu
    B_gpu = B |> gpu
    @benchmark A_gpu * B_gpu
end
Out[0]:
BenchmarkTools.Trial: 9878 samples with 1 evaluation per sample.
 Range (minmax):   41.620 μs349.520 ms   GC (min … max): 0.00% … 94.80%
 Time  (median):     492.424 μs                GC (median):    0.00%
 Time  (mean ± σ):   503.189 μs ±   3.522 ms   GC (mean ± σ):  6.78% ±  0.98%

                                                            █    
  ▃▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▂▂▂▂▂▃█ ▂
  41.6 μs          Histogram: frequency by time          509 μs <

 Memory estimate: 1.62 KiB, allocs estimate: 71.

The minimum operation time on the GPU is almost 10,000 times less than the minimum computing time on the GPU (41 microseconds versus 384 milliseconds).

Conclusion

Julia allows you to transfer calculations to the GPU, so that a wide variety of applied calculations can be accelerated many times without rewriting their code. We performed matrix multiplication on the processor and on the graphics card and determined that square matrices measuring 1000 by 1000, consisting of random complex numbers, the graphics card multiplies tens of thousands of times faster than the CPU.