Engee documentation
Notebook

Matrix multiplication on different calculators

In this example we will show how to study the code performance using the BenchmarkTools toolkit and compare its execution on CPU and GPU.

Connecting to GPU

Availability of GPU resources to users is still a premium feature of the Engee platform. A GPU is a graphics graphics card that allows you to significantly parallelise code execution by running it on tens of thousands of computational cores inside a graphics coprocessor.

The main library for working with GPUs is CUDA.jl. Let's install this library and with it the toolkit for code performance evaluation ( BenchmarkTools package).

In [ ]:
# Закомментируйте эту строку, если будете устанавливать библиотеки другим образом
Pkg.add( url="https://github.com/JuliaBinaryWrappers/CUDA_Runtime_jll.jl.git" )
Pkg.add( ["CUDA", "cuDNN", "Flux", "BenchmarkTools"] );
Pkg.instantiate()

Matrix multiplication on CPU

Let's see how long it takes on average to multiply matrices on a regular CPU.

In [ ]:
N = 1_000

A = rand(ComplexF64, (N,N))
B = rand(ComplexF64, (N,N))

using BenchmarkTools
@benchmark A*B
Out[0]:
BenchmarkTools.Trial: 11 samples with 1 evaluation per sample.
 Range (minmax):  384.014 ms500.289 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     496.367 ms                GC (median):    0.00%
 Time  (mean ± σ):   462.194 ms ±  50.123 ms   GC (mean ± σ):  0.00% ± 0.00%

            ▁                                                █  
  ▆▁▁▁▁▁▁▁▁▁█▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁█ ▁
  384 ms           Histogram: frequency by time          500 ms <

 Memory estimate: 15.26 MiB, allocs estimate: 2.

The execution of this cell can take quite a long time because the @benchmark command runs the operation passed to it many times to smooth out the "warm-up" effect inherent in Julia. And it also reduces the impact of rare conditions when the code randomly shows the lowest possible performance.

In this particular case, the experiment showed that multiplication of matrices of complex numbers of size 1000 by 1000 takes 300 milliseconds on average.

Matrix multiplication on GPU

To multiply matrices on a video card, you need to transfer them to it, which can be done in many ways. For example, the command A |> gpu, but since there may not be a GPU in the system, we will check the configuration of the computational space and select an available calculator.

After the transfer, the matrices Matrix are now objects CuArray. Their multiplication with each other is done without additional code (thanks to multiplication operator overloading). But we cannot multiply the matrix A_gpu by the matrix B without transferring both matrices to the same evaluator (otherwise you will get the error KernelError: kernel returns a value of type Union{}).

In [ ]:
using CUDA, Flux

if CUDA.functional()
    A_gpu = A |> gpu
    B_gpu = B |> gpu
    @benchmark A_gpu * B_gpu
end
Out[0]:
BenchmarkTools.Trial: 9878 samples with 1 evaluation per sample.
 Range (minmax):   41.620 μs349.520 ms   GC (min … max): 0.00% … 94.80%
 Time  (median):     492.424 μs                GC (median):    0.00%
 Time  (mean ± σ):   503.189 μs ±   3.522 ms   GC (mean ± σ):  6.78% ±  0.98%

                                                            █    
  ▃▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▂▂▂▂▂▃█ ▂
  41.6 μs          Histogram: frequency by time          509 μs <

 Memory estimate: 1.62 KiB, allocs estimate: 71.

The minimum operation time on GPU is almost 10 thousand times less than the minimum computation time on CPU (41 microseconds vs. 384 milliseconds).

Conclusion

Julia allows us to port computations to the GPU, so that a wide variety of application computations can be accelerated many times over without rewriting their code. We performed matrix multiplication on a CPU and on a graphics graphics card and determined that the graphics graphics graphics card multiplies 1000 by 1000 square matrices consisting of random complex numbers tens of thousands of times faster than a CPU.