FLEXIBLE ATTENTION MECHANISM IN VISION TRANSFORMERS

Authors

  • Lucas Thimoteo Departamento de Engenharia Elétrica, Pontifícia Universidade Católica do Rio de Janeiro
  • Keyslav Moreno Departamento de Engenharia Elétrica, Pontifícia Universidade Católica do Rio de Janeiro
  • Jorge Machado do Amaral Programa de Pós-Graduação de Engenharia Eletrônica, Universidade do Estado do Rio de Janeiro
  • Marley Vellasco Departamento de Engenharia Elétrica, Pontifícia Universidade Católica do Rio de Janeiro

Keywords:

vision transformers, attention, sharing weights, image classification

Abstract

This work introduces the Flexible Vision Transformer, a novel architecture that modifies the head creation mechanism in the attention calculation of Vision Transformers. Instead of slicing the Q, K, V matrices into equal parts to generate each head, our approach employs random sampling in the embedded space, allowing for the generation of any desired number of heads with any dimension. Our ablation studies, particularly with smaller embeddings from 64 to 128, demonstrated that this approach can lead to improved performance without substantially increasing the overall complexity of the model. The best improvement was seen in embedding size of 96, achieving an accuracy of 82.55% on the test set for CIFAR-10 and 55.27% on CIFAR-100. Moreover, our approach is compatible with other vision transformer models, since it only modifies the core attention component, opening up the possibility of enhancing their performance without a significant increase in the number of trainable parameters. However, it is important to acknowledge the limitations of our approach and the original Vision Transformers architecture, which tend to underperform on smaller datasets.

Downloads

Published

2024-10-18

Issue

Section

Articles