Comparison of neural network training on CPU and GPU
Let's return to the example of training a multilayer neural network to a regression problem and compare how the process will speed up when switching to a GPU.
Installing Libraries
The following command may take several minutes if the named libraries have not been installed yet.
Pkg.add( ["Flux", "CUDA", "cuDNN", "Random"] )
using Flux, CUDA
Preparation
Let's create fairly large vectors of training data like Float32:
Nx1, Nx2 = 200, 200
x1 = Float32.( range( -3, 3, length=Nx1 ) )
x2 = Float32.( range( -3, 3, length=Nx2 ) )
Xs = [ repeat( x1, outer=Nx2) repeat( x2, inner=Nx1) ];
Ys = @. 3*(1-Xs[:,1])^2*exp(-(Xs[:,1]^2) - (Xs[:,2]+1)^2) - 10*(Xs[:,1]/5 - Xs[:,1]^3 - Xs[:,2]^5)*exp(-Xs[:,1]^2-Xs[:,2]^2) - 1/3*exp(-(Xs[:,1]+1) ^ 2 - Xs[:,2]^2);
And we'll set a fairly long learning time.:
epochs = 2000;
learning_rate = 0.08;
When running on the CPU, it's worth starting with
epochs=100and slowly increase this amount. Even if we are far from the "industrial" values, we still risk losing a lot of CPU time while we train the neural network in a suboptimal setting.
Data loading and training on the CPU
Let's create a standard training procedure:
model_cpu = Chain( Dense( 2 => 20, relu ), Dense( 20 => 5, relu ), Dense( 5 => 1 ) )
model_copy = deepcopy( model_cpu ) # На будущее нам будет нужна точно такая же модель
data = [ (Xs', Ys') ]
opt_state = Flux.setup( Adam( learning_rate ), model_cpu );
using Random
Random.seed!( 1 ); # Установим генератор случайных чисел в воспроизводимое состояние
@time (for i in 1:epochs
Flux.train!(model_cpu, data, opt_state) do m, x, y
Flux.mse( m(x), y ) # Лосс функция
end
end)
The training lasted quite a long time. If we were limited by CPU in our work, we would have to reduce the number of points in the training sample, at least at the beginning of training, while we would select the hyperparameters of the neural network.
1-2% of this time is spent on memory cleanup (gc_time = garbage collection) and to compile the code (compile time).
Transferring the process to the GPU
When connecting the library Flux, a construction appears at our disposal |> gpu (pass the left expression to the gpu() function). It allows you to visually, without creating additional nesting levels of data processing, send a matrix or structure to the GPU and execute code on it.
if CUDA.functional()
model_gpu = model_copy |> gpu;
Xg = Xs' |> gpu;
Yg = Ys' |> gpu;
data = [ (Xg, Yg) ]
opt_state = Flux.setup( Adam( learning_rate ), model_gpu );
end;
Note that we tried to use DataLoader. Its use significantly slowed down execution on the GPU, from which it was concluded that it was worth trusting the parallelization procedure implemented by Julia GPU Compiler.
Random.seed!( 1 ); # Установим генератор случайных чисел в воспроизводимое состояние
if CUDA.functional()
@time CUDA.@sync (
for i in 1:epochs
Flux.train!(model_gpu, data, opt_state) do m, x, y
Flux.mse( m(x), y ) # Лосс функция
end
end
)
end
Summary of GPU migration
In general, transferring a neural network to a GPU requires three additional actions:
- transfer of the feature matrix
Xs |> gpu - transfer of the response matrix
Ys |> gpu - transferring structure with a neural network
model |> gpu
With a small amount of training data (100-200 examples), the neural network learns faster on the CPU. But for 40,000 examples, the advantage is on the GPU side.
If you subtract the time spent on compilation (compile time) and memory cleanup after execution (gc time = garbage collection time) we can say that
*The graphics card performed this task about 200 times faster than the central processors.
Checking the results
We use a GPU-trained neural network to perform interpolation of the original data set.:
gr()
# Вернем модель на CPU
if CUDA.functional()
model_gpu = model_gpu |> cpu;
else
model_gpu = model_cpu
end
# Создадим набор данных поменьше (иначе отрисовка займет очень много времени)
Nx1, Nx2 = 40, 40
x1 = Float32.( range( -3, 3, length=Nx1 ) )
x2 = Float32.( range( -3, 3, length=Nx2 ) )
Xs = [ repeat( x1, outer=Nx2) repeat( x2, inner=Nx1) ];
Ys = @. 3*(1-Xs[:,1])^2*exp(-(Xs[:,1]^2) - (Xs[:,2]+1)^2) - 10*(Xs[:,1]/5 - Xs[:,1]^3 - Xs[:,2]^5)*exp(-Xs[:,1]^2-Xs[:,2]^2) - 1/3*exp(-(Xs[:,1]+1) ^ 2 - Xs[:,2]^2);
plot(
surface( Xs[:,1], Xs[:,2], vec(Ys), c=:viridis, cbar=:false, title="Обучающая выборка", titlefont=font(10)),
wireframe( x1, x2, vec(model_cpu( Xs' )), title="Нейросеть (CPU)", titlefont=font(10) ),
wireframe( x1, x2, vec(model_gpu( Xs' )), title="Нейросеть (GPU)", titlefont=font(10) ),
layout=(1,3), size=(1000,400)
)
Interestingly, one installation Random.seed it was enough to get two identical (in appearance) results – one on the CPU, the other on the GPU.
Conclusion
We have trained a neural network to interpolate a sample of 40,000 elements on the CPU and on the GPU.
The learning process over 2,000 epochs on the GPU was carried out almost 200 times faster than on the CPU, which allows you to perform many more iterations over hyperparameters and train much larger networks, saving the designer time.
For a sample size of 100-1000 elements, we did not notice a difference in learning speed, because due to the frequent transfer of data from/ to the GPU, the advantage in execution speed disappeared. It may be worth organizing the training cycle so that the data remains on the GPU for the entire training time.