Engee documentation
Notebook

Comparison of neural network training on CPU and GPU

Let's return to the example of training a multilayer neural network to a regression problem and compare how the process will speed up when switching to a GPU.

Installing Libraries

The following command may take several minutes if the named libraries have not been installed yet.

In [ ]:
Pkg.add( ["Flux", "CUDA", "cuDNN", "Random"] )
In [ ]:
using Flux, CUDA

Preparation

Let's create fairly large vectors of training data like Float32:

In [ ]:
Nx1, Nx2 = 200, 200
x1 = Float32.( range( -3, 3, length=Nx1 ) )
x2 = Float32.( range( -3, 3, length=Nx2 ) )
Xs = [ repeat( x1, outer=Nx2)  repeat( x2, inner=Nx1) ];
Ys = @. 3*(1-Xs[:,1])^2*exp(-(Xs[:,1]^2) - (Xs[:,2]+1)^2) - 10*(Xs[:,1]/5 - Xs[:,1]^3 - Xs[:,2]^5)*exp(-Xs[:,1]^2-Xs[:,2]^2) - 1/3*exp(-(Xs[:,1]+1) ^ 2 - Xs[:,2]^2);

And we'll set a fairly long learning time.:

In [ ]:
epochs = 2000;
learning_rate = 0.08;

When running on the CPU, it's worth starting with epochs=100 and slowly increase this amount. Even if we are far from the "industrial" values, we still risk losing a lot of CPU time while we train the neural network in a suboptimal setting.

Data loading and training on the CPU

Let's create a standard training procedure:

In [ ]:
model_cpu = Chain( Dense( 2 => 20, relu ), Dense( 20 => 5, relu ), Dense( 5 => 1 ) )
model_copy = deepcopy( model_cpu ) # На будущее нам будет нужна точно такая же модель
data = [ (Xs', Ys') ]
opt_state = Flux.setup( Adam( learning_rate ), model_cpu );
In [ ]:
using Random
Random.seed!( 1 ); # Установим генератор случайных чисел в воспроизводимое состояние
In [ ]:
@time (for i in 1:epochs
    Flux.train!(model_cpu, data, opt_state) do m, x, y
        Flux.mse( m(x), y ) # Лосс функция
    end
end)
467.851443 seconds (20.01 M allocations: 27.787 GiB, 1.23% gc time, 3.11% compilation time)

The training lasted quite a long time. If we were limited by CPU in our work, we would have to reduce the number of points in the training sample, at least at the beginning of training, while we would select the hyperparameters of the neural network.

1-2% of this time is spent on memory cleanup (gc_time = garbage collection) and to compile the code (compile time).

Transferring the process to the GPU

When connecting the library Flux, a construction appears at our disposal |> gpu (pass the left expression to the gpu() function). It allows you to visually, without creating additional nesting levels of data processing, send a matrix or structure to the GPU and execute code on it.

In [ ]:
if CUDA.functional()
    model_gpu = model_copy |> gpu;
    
    Xg = Xs' |> gpu;
    Yg = Ys' |> gpu;
    data = [ (Xg, Yg) ]
    
    opt_state = Flux.setup( Adam( learning_rate ), model_gpu );
end;

Note that we tried to use DataLoader. Its use significantly slowed down execution on the GPU, from which it was concluded that it was worth trusting the parallelization procedure implemented by Julia GPU Compiler.

In [ ]:
Random.seed!( 1 ); # Установим генератор случайных чисел в воспроизводимое состояние
In [ ]:
if CUDA.functional()
    @time CUDA.@sync (
        for i in 1:epochs
            Flux.train!(model_gpu, data, opt_state) do m, x, y
                Flux.mse( m(x), y ) # Лосс функция
            end
        end
    )
end
 26.461145 seconds (33.51 M allocations: 1.926 GiB, 2.39% gc time, 80.03% compilation time: 2% of which was recompilation)

Summary of GPU migration

In general, transferring a neural network to a GPU requires three additional actions:

  • transfer of the feature matrix Xs |> gpu
  • transfer of the response matrix Ys |> gpu
  • transferring structure with a neural network model |> gpu

With a small amount of training data (100-200 examples), the neural network learns faster on the CPU. But for 40,000 examples, the advantage is on the GPU side.

If you subtract the time spent on compilation (compile time) and memory cleanup after execution (gc time = garbage collection time) we can say that

*The graphics card performed this task about 200 times faster than the central processors.

Checking the results

We use a GPU-trained neural network to perform interpolation of the original data set.:

In [ ]:
gr()

# Вернем модель на CPU
if CUDA.functional()
    model_gpu = model_gpu |> cpu;
else
    model_gpu = model_cpu
end

# Создадим набор данных поменьше (иначе отрисовка займет очень много времени)
Nx1, Nx2 = 40, 40
x1 = Float32.( range( -3, 3, length=Nx1 ) )
x2 = Float32.( range( -3, 3, length=Nx2 ) )
Xs = [ repeat( x1, outer=Nx2)  repeat( x2, inner=Nx1) ];
Ys = @. 3*(1-Xs[:,1])^2*exp(-(Xs[:,1]^2) - (Xs[:,2]+1)^2) - 10*(Xs[:,1]/5 - Xs[:,1]^3 - Xs[:,2]^5)*exp(-Xs[:,1]^2-Xs[:,2]^2) - 1/3*exp(-(Xs[:,1]+1) ^ 2 - Xs[:,2]^2);

plot(
    surface( Xs[:,1], Xs[:,2], vec(Ys), c=:viridis, cbar=:false, title="Обучающая выборка", titlefont=font(10)),
    wireframe( x1, x2, vec(model_cpu( Xs' )), title="Нейросеть (CPU)", titlefont=font(10) ), 
    wireframe( x1, x2, vec(model_gpu( Xs' )), title="Нейросеть (GPU)", titlefont=font(10) ), 
    layout=(1,3), size=(1000,400)
)
Out[0]:

Interestingly, one installation Random.seed it was enough to get two identical (in appearance) results – one on the CPU, the other on the GPU.

Conclusion

We have trained a neural network to interpolate a sample of 40,000 elements on the CPU and on the GPU.

The learning process over 2,000 epochs on the GPU was carried out almost 200 times faster than on the CPU, which allows you to perform many more iterations over hyperparameters and train much larger networks, saving the designer time.

For a sample size of 100-1000 elements, we did not notice a difference in learning speed, because due to the frequent transfer of data from/ to the GPU, the advantage in execution speed disappeared. It may be worth organizing the training cycle so that the data remains on the GPU for the entire training time.