Matrix multiplication on different calculators¶
In this example we will show how to study the code performance using the BenchmarkTools
toolkit and compare its execution on CPU and GPU.
Connecting to GPU¶
Availability of GPU resources to users is still a premium feature of the Engee platform. A GPU is a graphics graphics card that allows you to significantly parallelise code execution by running it on tens of thousands of computational cores inside a graphics coprocessor.
The main library for working with GPUs is CUDA.jl
. Let's install this library and with it the toolkit for code performance evaluation ( BenchmarkTools
package).
# Закомментируйте эту строку, если будете устанавливать библиотеки другим образом
Pkg.add( url="https://github.com/JuliaBinaryWrappers/CUDA_Runtime_jll.jl.git" )
Pkg.add( ["CUDA", "cuDNN", "Flux", "BenchmarkTools"] );
Pkg.instantiate()
Matrix multiplication on CPU¶
Let's see how long it takes on average to multiply matrices on a regular CPU.
N = 1_000
A = rand(ComplexF64, (N,N))
B = rand(ComplexF64, (N,N))
using BenchmarkTools
@benchmark A*B
The execution of this cell can take quite a long time because the @benchmark
command runs the operation passed to it many times to smooth out the "warm-up" effect inherent in Julia. And it also reduces the impact of rare conditions when the code randomly shows the lowest possible performance.
In this particular case, the experiment showed that multiplication of matrices of complex numbers of size 1000 by 1000 takes 300 milliseconds on average.
Matrix multiplication on GPU¶
To multiply matrices on a video card, you need to transfer them to it, which can be done in many ways. For example, the command A |> gpu
, but since there may not be a GPU in the system, we will check the configuration of the computational space and select an available calculator.
After the transfer, the matrices Matrix
are now objects CuArray
. Their multiplication with each other is done without additional code (thanks to multiplication operator overloading). But we cannot multiply the matrix A_gpu
by the matrix B
without transferring both matrices to the same evaluator (otherwise you will get the error KernelError: kernel returns a value of type Union{}
).
using CUDA, Flux
if CUDA.functional()
A_gpu = A |> gpu
B_gpu = B |> gpu
@benchmark A_gpu * B_gpu
end
The minimum operation time on GPU is almost 10 thousand times less than the minimum computation time on CPU (41 microseconds vs. 384 milliseconds).
Conclusion¶
Julia allows us to port computations to the GPU, so that a wide variety of application computations can be accelerated many times over without rewriting their code. We performed matrix multiplication on a CPU and on a graphics graphics card and determined that the graphics graphics graphics card multiplies 1000 by 1000 square matrices consisting of random complex numbers tens of thousands of times faster than a CPU.