Engee documentation
Notebook

Neural network image recognition and embedding

This example will allow you to take the first steps in a project where image classification is needed or where it is planned to use the ResNet neural network to extract features from an image or matrix.

Task description

Many of the most complex problems of computer vision can be solved using "foundational" neural networks, and one of the outstanding achievements of the last decade was the ResNet network, on whose calculations, or even on their intermediate results, many modern methods are based.

To work with this neural network, it is best to use a library that provides access to a solid fleet of pre-trained neural networks., Metalhead. Neural networks in this library are created in the format Flux The commands of this library will allow you to easily work with the topology of models. Finally, we'll need a library. DataAugmentation to simplify operations on images, although all the same operations (scaling and rasterization) can be performed using conventional matrix additions and multiplications.

In [ ]:
Pkg.add( ["Flux", "Metalhead", "DataAugmentation"] )

Recognizing a single image

Let's perform a small calculation on the pre-trained ResNet neural network, which will allow us to determine the class of the object represented in the image.

In [ ]:
using Flux, Metalhead
model = ResNet(18; pretrain = true);

The ResNet neural network was created to work with RGB images. Therefore, in addition to uploading the image itself, we convert it to a three-channel format, getting rid of the alpha channel, using a simple command RGB applied to each pixel of the image matrix.

In [ ]:
using Images
img = RGB.(load( "dog.png" ))
Out[0]:
No description has been provided for this image
In [ ]:
using DataAugmentation
DATA_MEAN = (0.485, 0.456, 0.406)
DATA_STD = (0.229, 0.224, 0.225)
augmentations = CenterResizeCrop((224, 224)) |> ImageToTensor() |> Normalize(DATA_MEAN, DATA_STD)

data = apply(augmentations, Image(img)) |> itemdata
labels = readlines( "imagenet_classes_ru.txt" )
img_labels = Flux.onecold( model(Flux.unsqueeze(data, 4)), labels )
print( img_labels )
["эскимосская собака"]

Note that we have obtained a vector. Subsequently, we will take the first element of this vector.

So, if the image is not too unusual from the point of view of the ImageNet dataset, then the neural network that was trained on this dataset recognizes it with predictable accuracy. We used the smallest of the "textbook" ResNet implementations with a size of 18 layers.

To find out the size of other ResNet neural networks available in the library, you can simply call help using this command:

In [ ]:
# ?ResNet

Recognizing multiple images

To perform this operation on multiple images, you can simply package a neural network call into a loop. for. A small neural network of 18 layers, even on a CPU, will run in a matter of seconds.

In [ ]:
using Flux, Metalhead, Images, DataAugmentation

Classifying a number of images:

In [ ]:
# Цепочка предобработки включает стандартных для ImageNet нормализацию
DATA_MEAN = (0.485, 0.456, 0.406)
DATA_STD = (0.229, 0.224, 0.225)
augmentations = CenterResizeCrop((224, 224)) |> ImageToTensor() |> Normalize(DATA_MEAN, DATA_STD)

# Загрузим изображения средствами библиотеки Images
imgs = load.("imgs/" .* ["dog1.png", "dog2.png", "cat1.png", "cat2.png", "chair1.png", "chair2.png", "rocket.png", "airplane.png"])
imgs = [ RGB.(img) for img in imgs ]

# Загрузим предобученную модель из Metalhead
model = ResNet(18; pretrain = true);

# Преобразуем каждое изображение
imgs_data = [ apply(augmentations, Image(img)) |> itemdata for img in imgs ]

# Загрузим перечень классов из текстового файла
labels = readlines("imagenet_classes_ru.txt")

# Отправляем каждое изображение в нейросеть и получаем индекс наиболее вероятной метки
img_labels = [Flux.onecold(model(Flux.unsqueeze(data, 4)), labels)[1] for data in imgs_data]

# Вывод изображений и подписей к ним
println(img_labels)
imgs
["золотистый ретривер", "чихуахуа", "сиамская кошка", "африканский хамелеон", "раскладной стул", "диван-кровать", "снаряд", "авиалайнер"]
Out[0]:
No description has been provided for this imageNo description has been provided for this imageNo description has been provided for this imageNo description has been provided for this imageNo description has been provided for this imageNo description has been provided for this imageNo description has been provided for this imageNo description has been provided for this image
(a vector displayed as a row to save space)

Obviously, not all labels in this list are perfectly matched. As always, the limitations of the training sample are affected and the last image goes beyond the boundaries of reliable recognition.

Intermediate latency representation

For more complex work with ResNet neural networks, you can use an intermediate representation of the objects that these neural networks create during the calculation of the result. One option is to exclude the final layer and work only with backbone a part of a model (a basic model?) that can be used as a feature generator for some other task.

In [ ]:
# Первая часть ResNet, называемая backbone моделью, принимает изображения любого размера
backbone_model = Metalhead.backbone( model );
img_embeddings = Flux.activations(backbone_model, Flux.unsqueeze(data[1], 4))
length( img_embeddings )
Out[0]:
5

Why aren't there 18 layers? In addition to the fact that we are working with a shortened version of ResNet, some layers of this matrix are combined. For example, Conv and MeanPool they form one layer of the type Chain. Function activations analyzes only the object Chain the top level.

What size object does the neural network generate on the last layer? backbone parts?

In [ ]:
size(img_embeddings[end])
Out[0]:
(7, 7, 512, 1)

You can make sure that we have 512 7-by-7 images in front of us. The last dimension is the batch number, but we sent the images to the neural network one by one, so the dimension in the 4th dimension is 1.

Let's build an illustration where we will see all 512 layers of the image obtained by the convolutional part of the network.

In [ ]:
# Предположим, img_embeddings[end] имеет размер (7, 7, 512, 1)
img_activations = dropdims(img_embeddings[end], dims=4)  # Удаляем последнюю размерность -> (7, 7, 512)

# Параметры тайлов, Количество тайлов в строке и столбце
tile_height, tile_width, n_tiles_per_row, n_tiles_per_col = 7, 7, 32, 16

# Создаем пустую матрицу для tilemap
tilemap = zeros(Float32, tile_height * n_tiles_per_col, tile_width * n_tiles_per_row)

# Заполняем tilemap
for k in 1:min(512, n_tiles_per_row * n_tiles_per_col)  # На случай, если активаций меньше чем 512
    i = div(k - 1, n_tiles_per_row) + 1  # Номер строки в tilemap
    j = mod(k - 1, n_tiles_per_row) + 1  # Номер столбца в tilemap
    
    # Вычисляем координаты в tilemap
    row_range = (1:tile_height) .+ (i-1)*tile_height
    col_range = (1:tile_width) .+ (j-1)*tile_width
    
    # Вставляем активации
    tilemap[row_range, col_range] = img_activations[:, :, k]
end

# Визуализация
heatmap(tilemap, 
       aspect_ratio=:equal, color=:viridis, cbar=false,
       title="Карта активаций предпоследнего слоя (7×7×512 → 224×112)")
Out[0]:

We see a lot of small "images". In fact, the neural network transformed the three-layer image into a 512-layer one, simultaneously reducing the original image to a size of 7 by 7.

This representation is useful for studying the internal structure of a neural network, but to work with a "raw" feature description, you need to apply a MeanPool-like operation to this representation and work with a feature vector rather than a multitude of small images.

In [ ]:
# Добавление слоя mean pooling - Размер пулинга должен соответствовать выходу feature extractor
pool_model = Chain(
    backbone_model,
    AdaptiveMeanPool((1, 1)),
    Flux.flatten
)

# Загрузка и предобработка
imgs = load.("imgs/" .* ["dog1.png", "dog2.png", "cat1.png", "cat2.png", "chair1.png", "chair2.png", "rocket.png", "airplane.png"])
imgs = [ RGB.(img) for img in imgs ]
imgs_data = [ apply(augmentations, Image(img)) |> itemdata for img in imgs ]

# Классификация моделью `model`` (чтобы получить метки классов)
img_labels = [Flux.onecold(model(Flux.unsqueeze(data, 4)), labels)[1] for data in imgs_data]

# Еще одна классификация моделью `pool_model` (просто чтобы получить эмбеддинги)
mlp_activations_512 = [Flux.activations( pool_model, Flux.unsqueeze(data, 4) )[end] for data in imgs_data];
plot(
    ([heatmap(reshape( activation, :, 32 ), cbar=false, title=title) for (activation,title) in zip(mlp_activations_512, img_labels)])...,
    layout=( 1,: ), size=(900,200)
)
Out[0]:

The final latent representation

Latent representation (embedding) can be considered information on any slice of a neural network.

The last layer of ResNet-18 transforms 512 features that we saw in the last graph into 1000 features, and the most intensely expressed of them characterizes the object class.

The distribution of labels between other "pixels" can also tell us something about the confidence of the neural network in its prediction or about the class of the object, but since the neural network did not learn to classify high-level concepts (dog/cat), but learned to put specific labels, the intra-class proximity between all breeds of dogs and all breeds of cats may be low.

In [ ]:
DATA_MEAN = (0.485, 0.456, 0.406)
DATA_STD = (0.229, 0.224, 0.225)
augmentations = CenterResizeCrop((224, 224)) |> ImageToTensor() |> Normalize(DATA_MEAN, DATA_STD)

imgs = load.("imgs/" .* ["dog1.png", "dog2.png", "cat1.png", "cat2.png", "chair1.png", "chair2.png", "rocket.png", "airplane.png"])
imgs = [ RGB.(img) for img in imgs ]

model = ResNet(18; pretrain = true);

imgs_data = [ apply(augmentations, Image(img)) |> itemdata for img in imgs ]
labels = readlines("imagenet_classes_ru.txt")
img_labels = [Flux.onecold(model(Flux.unsqueeze(data, 4)), labels)[1] for data in imgs_data]

mlp_activations = [Flux.activations( model.layers, Flux.unsqueeze(data, 4) )[end] for data in imgs_data];
plot(
    ([heatmap(reshape( activation, 20, : ), cbar=false, title=title) for (activation,title) in zip(mlp_activations, img_labels)])...,
    layout=( 1,: ), size=(900,300), titlefont = Plots.font(9)
)
Out[0]:

On the CPU, the execution of the previous cell takes 10-15 seconds. Using batches could speed up classification.

Finally, let's build a correlation diagram. If we had classified the spectra using a pre-trained neural network, this result might have been of interest even before any machine learning.

In [ ]:
plot( heatmap( cor( hcat(mlp_activations_512...) ), title="512 признаков", c=:viridis, yflip=true, xticks=(1:8, img_labels), yticks=(1:8, img_labels)),
      heatmap( cor( hcat(mlp_activations...) ), title="1000 признаков", c=:viridis, yflip=true, xticks=(1:8, img_labels), yticks=(1:8, img_labels) ),
      size=(1000,500))
Out[0]:

Judging by this illustration, the objects of the furniture group are closer to each other than the objects of the animal world, but more objects are needed for a full-fledged study. Note that the image of a chihuahua is the least similar to furniture from the point of view of a neural network.

Up to this layer, the neural network has been trained to extract features relevant for classification into the declared 1000 classes, so that the distribution of the final embedding can be informative. But since the ultimate goal of the training was to maximize the response of a relevant logit, one out of 1,000 (rather than studying the class distribution), there may not be the expected proximity of distributions between semantically close objects (different dog breeds), and we may not see anything on the correlation diagram.

Conclusion

We dived a little into the classification of images by the ResNet neural network and studied what the embeddings (latent representations) of this neural network look like and how they can be useful to us in data analysis.