Year 2020: the Transformers Expansion in the CV

Published in

CodeX

7 min readJul 30, 2021

In the precedent years transformers have perform greatly in the field of NLP. They’ve significantly improved the performance of the language processing models and the effect is comparable to what had been done to image understanding since 2012 by convolution neural networks. Now at the end of 2020 we have transformers entering the top quartile of well-known computer vision benchmarks, such as image classification on ImageNet and object detection on COCO.

Continuing the topic of the previous post about transformer based DETR and Sparse R-CNN, in this article we will have an overview of the recent Facebook AI and Sorbonne University joint research “Training data-efficient image transformers & distillation through attention” , their DeiT model and which scientific achievements have anticipated this work.

The Transformer architecture

Overview

Many improvements of convolution nets for image classification are inspired by transformers. For example, Squeeze and Excitation, Selective Kernel and Split-Attention Networks exploit mechanism akin to transformers self-attention mechanism.

Introduced in 2017 “Attention Is All You Need” paper for machine translation are currently the reference model for all natural language processing tasks. Such task are the heart of the increasingly common home AI assistants and call-centers’ chat-bots. Transformer has an ability to find dependencies in the sequential data of arbitrary length without significant complication of the model. From that arise its great performance in sequence to sequence tasks such as translating text from one language to another or create the answer with respect to a question.

Before transformers, recurrent networks were adopted for these tasks. Let’s talk about drawbacks which RNNs have compare to transformers. The slide below was taken from the Pascal Plupart’s lecture.

RNNs to Transformer specifics comparison

Here we can see the following specifics:

To deal with long sequences and to find dependencies in the data it is needed to utilize LSTM or GRU with the large recurrence degree which in practice means to have a very deep neural network. On the other hand transformer doesn’t have to be very deep to facilitate long range dependencies
There are no gradient vanishing and explosion in transformer’s architecture, because of instead of having computations goes linearly as it is done in RNNs transformer do the computation for the entire sequence simultaneously and in practice it has much less layers then RNN.
Due to its architecture, transformer won’t take as much training steps as RNN and lack of recurrence leads to parallel computation ability. That means it can process input tokens in batches.

Structure

Transformer has an encoder-decoder structure

It takes tokens to encoder and outputs tokens from decoder. Both halves consist of composite blocks, which are slightly differ between encoder and decoder.

The first part of the block is the self-attention layer which helps the block to encode the piece of input with respect to other inputs and their positions. For example, a word in the input sentence. After that, outputs go to feed forward layer that creates a new representation. Decoder block has additional mask layer because it operates with both token sequences of tokens and it shouldn’t see the following not yet generated token, i.e. it takes input French sentence representation and embeddings of output words in English generated so far.

For the better understanding of transformer architecture take a look at the video of one of the transformer’s author and to well illustrated post

Self-attention

This part of the network is very crucial for transformers, the DeiT architecture also highly rely on that.

Simplifying, one can consider attention as retrieving value from (key, value) storage using the query. Where the resulting value is the weighted sum of values based on <query, key> similarities.

Authors of the paper define this more precise.

The attention mechanism is based on a trainable associative memory with (key, value) vector pairs. A query vector q is matched against a set of k key vectors (packed together into a matrix K) using inner products. These inner products are then scaled and normalized with a softmax function to obtain k weights. The output weighted sum of a set of k value vectors can be written as:

where the Softmax function is applied over each row of the input matrix and
the d term provides appropriate normalization.

The DeiT architecture

The neural network proposed by authors is the successor and in fact the logical continuation of visual transformer (ViT) proposed in this paper. The novelty researchers are proposing is essentially in the following three aspects.

Train visual transformer efficiently without large amount of curated data, while authors of ViT assumed the opposite.
They perform various experiments on how to distillate knowledge from convolution neural networks, that benefit from almost a decade of tuning and optimization.
And show that distilled model outperforms its teacher in terms of the trade-off between accuracy and throughput.

On the graph you can see the comparison in terms of accuracy on ImageNet and inference speed between proposed network variations, EfficientNet (convnet) variations and ViT

Visual Transformer

Visual transformer has a simple and elegant architecture that treats input image as a sequence of N image patches of fixed spatial size of 16 × 16 pixels. Each patch is projected with a linear layer that conserves its overall dimension 3 × 16 × 16 = 768.

Transformer block doesn’t know anything about order of the patches and that’s why fixed or trainable positional encoding is added to the patch embedding before the first encoder layer.

To learn from supervised data ViT uses the class token the way it was proposed in the BERT paper. Class token is analogue of class label in conv nets training.

There are three variations of the DeiT. Their performance is shown in the table below

DeiT-B is the same model as ViT, i.e. it has the same architecture but trained differently. The DeiT-Ti and DeiT-S are the smaller ones, only parameters that vary across models are the embedding dimension and the number of heads in multi-head self-attention. Smaller models have a lower parameter count, and a faster throughput. The throughput is measured for images at resolution 224×224.

Fine-tuning at different resolution

Authors adopt the fine-tuning procedure from this paper. It is shown that it is desirable to use a lower training resolution and fine-tune the network at the larger resolution. This approach speeds up the full training and improves
the accuracy under prevailing data augmentation schemes, that has the case when you train transformer on a relatively small dataset as ImageNet1k.

When increasing the resolution of an input image, authors keep the patch size the same, therefore the sequence length of patches changes. Transformer doesn’t require the sequences of fixed length but one needs to adapt the positional embeddings of patches. To do that authors adopt a
bicubic interpolation that approximately preserves the norm of the vectors, before fine-tuning the network.

Distillation

In addition to data augmentation and fine-tuning at the bigger resolutions authors propose to use technique that noticeably improved performance of DeiT compare to ViT. They apply knowledge distillation to the DeiT to transfer inductive bias of convnets.

There are couple of insights I want you to assume before going to a performance tables.

Authors proposed the distillation token, it has the same nature as class token and used for label prediction on par with classification token. Distillation token is an embedding, it is fed to the network as the class token and interact with other inputs through self-attention. The difference between class and distillation token in that it is obtained from predictions of convolution network, rather then from ground truth labels.
With distillation token authors tried both soft distillation and hard distillation. To introduce this briefly, soft distillation tries to minimize the Kullback-Leibler divergence between the softmax of the teacher and the softmax of the student model, while hard distillation imply to adjust students’ loss-function considering the hard decision of the teacher as a true label. They show in the ablation study that hard distillation on labels plus distillation token work better then other combinations.

It is seen from the paper that with convnet as a teacher DeiT-B obtains results better than the teacher. For example, RegNetY-8.0GF (note that accuracy vary between this and original paper because of different augmentation) perform two points worth then its student

The table above shows how accuracy of DeiT-B changes during fine-tuning (3rd and 4th columns) according to which teacher is used for distillation (in rows). Teachers’ initial accuracy is present in the second column.

Show me the code!

In this final section as always you can find the link to a papers’ code. But if you wonder to find there the main proposal of the paper namely distillation, unfortunately it doesn’t provided yet, but one of the issues says that it will be released soon.

References

[1] Training data-efficient image transformers & distillation through attention https://arxiv.org/abs/2012.12877

[2] CS480/680 Lecture 19: Attention and Transformer Networks https://www.youtube.com/watch?v=OyFJWRnt_AY

[3] The Illustrated Transformer http://jalammar.github.io/illustrated-transformer/