Comparison of neural network training on CPU and GPU¶

Let's return to the example of training a multilayer neural network for a regression task and compare how the process will be accelerated when switching to GPU.

Installing libraries¶

The following command may take a few minutes if the named libraries have not already been installed.

Pkg.add( ["Flux", "CUDA", "cuDNN", "Random"] )

using Flux, CUDA

Preparation¶

Let's create fairly large training data vectors like Float32:

Nx1, Nx2 = 200, 200
x1 = Float32.( range( -3, 3, length=Nx1 ) )
x2 = Float32.( range( -3, 3, length=Nx2 ) )
Xs = [ repeat( x1, outer=Nx2)  repeat( x2, inner=Nx1) ];
Ys = @. 3*(1-Xs[:,1])^2*exp(-(Xs[:,1]^2) - (Xs[:,2]+1)^2) - 10*(Xs[:,1]/5 - Xs[:,1]^3 - Xs[:,2]^5)*exp(-Xs[:,1]^2-Xs[:,2]^2) - 1/3*exp(-(Xs[:,1]+1) ^ 2 - Xs[:,2]^2);

And set a sufficiently large training time:

epochs = 2000;
learning_rate = 0.08;

When pushing the CPU, it is worth starting with epochs=100 and slowly increasing this value. Even if we are far from "industrial" values, we still risk losing a lot of CPU time while we train the neural network in a suboptimal setting.

Data loading and training on CPU¶

Let's create a standard training procedure:

model_cpu = Chain( Dense( 2 => 20, relu ), Dense( 20 => 5, relu ), Dense( 5 => 1 ) )
model_copy = deepcopy( model_cpu ) # На будущее нам будет нужна точно такая же модель
data = [ (Xs', Ys') ]
opt_state = Flux.setup( Adam( learning_rate ), model_cpu );

using Random
Random.seed!( 1 ); # Установим генератор случайных чисел в воспроизводимое состояние

@time (for i in 1:epochs
    Flux.train!(model_cpu, data, opt_state) do m, x, y
        Flux.mse( m(x), y ) # Лосс функция
    end
end)

467.851443 seconds (20.01 M allocations: 27.787 GiB, 1.23% gc time, 3.11% compilation time)

Training took quite a long time. If we were CPU-limited in our work, we would have to reduce the number of points in the training sample, at least in the first stages of training, while we would select the hyperparameters of the neural network.

1-2% of this time is spent clearing memory (gc_time = garbage collection ) and compiling code (compile time).

Transferring a process to the GPU¶

By connecting the library Flux, we have at our disposal the construct |> gpu (pass the left expression to the function gpu()). It allows us to send a matrix or structure to the GPU and execute code on it without creating additional levels of nested data processing.

if CUDA.functional()
    model_gpu = model_copy |> gpu;
    
    Xg = Xs' |> gpu;
    Yg = Ys' |> gpu;
    data = [ (Xg, Yg) ]
    
    opt_state = Flux.setup( Adam( learning_rate ), model_gpu );
end;

Note that we tried using DataLoader. Its use slowed down the execution on GPU, from which we concluded that we should trust the parallelisation procedure implemented by Julia GPU Compiler.

Random.seed!( 1 ); # Установим генератор случайных чисел в воспроизводимое состояние

if CUDA.functional()
    @time CUDA.@sync (
        for i in 1:epochs
            Flux.train!(model_gpu, data, opt_state) do m, x, y
                Flux.mse( m(x), y ) # Лосс функция
            end
        end
    )
end

 26.461145 seconds (33.51 M allocations: 1.926 GiB, 2.39% gc time, 80.03% compilation time: 2% of which was recompilation)

Summary on GPU porting¶

In general, porting a neural network to GPU requires three additional steps:

transferring the feature matrix Xs |> gpu
transferring the response matrix Ys |> gpu
transfer of the neural network structure model |> gpu

With a small amount of training data (100-200 examples), the neural network learns faster on CPU. But at 40000 examples the advantage is on the side of GPU.

If we subtract the time spent on compilation (compile time) and memory cleaning after execution (gc time = garbage collection time) we can say that

**The graphics card was about 200 times faster than CPUs.

Checking the results¶

We use a GPU trained neural network to perform interpolation of the original dataset:

gr()

# Вернем модель на CPU
if CUDA.functional()
    model_gpu = model_gpu |> cpu;
else
    model_gpu = model_cpu
end

# Создадим набор данных поменьше (иначе отрисовка займет очень много времени)
Nx1, Nx2 = 40, 40
x1 = Float32.( range( -3, 3, length=Nx1 ) )
x2 = Float32.( range( -3, 3, length=Nx2 ) )
Xs = [ repeat( x1, outer=Nx2)  repeat( x2, inner=Nx1) ];
Ys = @. 3*(1-Xs[:,1])^2*exp(-(Xs[:,1]^2) - (Xs[:,2]+1)^2) - 10*(Xs[:,1]/5 - Xs[:,1]^3 - Xs[:,2]^5)*exp(-Xs[:,1]^2-Xs[:,2]^2) - 1/3*exp(-(Xs[:,1]+1) ^ 2 - Xs[:,2]^2);

plot(
    surface( Xs[:,1], Xs[:,2], vec(Ys), c=:viridis, cbar=:false, title="Обучающая выборка", titlefont=font(10)),
    wireframe( x1, x2, vec(model_cpu( Xs' )), title="Нейросеть (CPU)", titlefont=font(10) ), 
    wireframe( x1, x2, vec(model_gpu( Xs' )), title="Нейросеть (GPU)", titlefont=font(10) ), 
    layout=(1,3), size=(1000,400)
)

Interestingly, one setting of Random.seed was enough to get two identical (seemingly) results - one on the CPU and one on the GPU.

Conclusion¶

We trained a neural network to interpolate a sample of 40,000 elements on the CPU and on the GPU.

The training process over 2,000 epochs on the GPU was nearly 200 times faster than on the CPU, allowing for many more hyperparameter iterations and training networks that are much larger in size, saving the designer time.

For a sample size of 100-1000 elements, we did not notice any difference in training speed, as the speed advantage disappeared due to frequent data transfer from/to GPU. Perhaps we should organise the training cycle in such a way that the data remains on the GPU for the whole training time.