Matrix multiplication on different calculators
In this example, we'll show you how to study code performance using the toolkit. BenchmarkTools and let's compare its execution on the CPU and on the GPU.
Connecting to the GPU
The availability of GPU resources for users remains a premium feature of the Engee platform for now. GPU is a graphics graphics card that allows you to significantly parallelize the execution of code by executing it on tens of thousands of computing cores located inside a graphics coprocessor.
The main library for working with GPU is CUDA.jl. Let's install this library, and with it, a toolkit for evaluating code performance (package BenchmarkTools).
# Закомментируйте эту строку, если будете устанавливать библиотеки другим образом
Pkg.add( url="https://github.com/JuliaBinaryWrappers/CUDA_Runtime_jll.jl.git" )
Pkg.add( ["CUDA", "cuDNN", "Flux", "BenchmarkTools"] );
Pkg.instantiate()
Matrix multiplication on the CPU
Let's see how long matrix multiplication takes on an average on a conventional processor.
N = 1_000
A = rand(ComplexF64, (N,N))
B = rand(ComplexF64, (N,N))
using BenchmarkTools
@benchmark A*B
It can take quite a long time to execute this cell, since the command @benchmark runs the operation assigned to her many times to smooth out the "warming up" effect inherent in Julia. It also reduces the impact of rare conditions when the code accidentally shows the lowest possible performance.
In this particular case, the experiment showed that multiplying 1000-by-1000 complex number matrices takes an average of 300 milliseconds.
Matrix multiplication on the GPU
To multiply the matrices on the graphics card, they need to be transferred to it, which can be done in many ways. For example, with the command A |> gpu However, since there may not be a GPU in the system, we will check the configuration of the computing space and select an available computer.
After transferring the matrix Matrix now they are objects CuArray. Their multiplication is performed without additional code (due to overloading of the multiplication operator). But multiply the matrix A_gpu on the matrix B we cannot do this without transferring both matrices to the same calculator (otherwise you will get an error KernelError: kernel returns a value of type Union{}).
using CUDA, Flux
if CUDA.functional()
A_gpu = A |> gpu
B_gpu = B |> gpu
@benchmark A_gpu * B_gpu
end
The minimum operation time on the GPU is almost 10,000 times less than the minimum computing time on the GPU (41 microseconds versus 384 milliseconds).
Conclusion
Julia allows you to transfer calculations to the GPU, so that a wide variety of applied calculations can be accelerated many times without rewriting their code. We performed matrix multiplication on the processor and on the graphics card and determined that square matrices measuring 1000 by 1000, consisting of random complex numbers, the graphics card multiplies tens of thousands of times faster than the CPU.