Attention Is All You Need

Research Paper

2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones
transformersattention-mechanismneural-networksNLPdeep-learning

Abstract

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing with large and limited training data.

Key Contributions

1. Self-Attention Mechanism

The paper introduces the concept of self-attention, where each position in the sequence can attend to all positions in the previous layer of the encoder. This allows the model to capture dependencies between different positions in the input sequence.

2. Multi-Head Attention

Instead of performing a single attention function, the Transformer uses multiple attention heads in parallel. This allows the model to jointly attend to information from different representation subspaces at different positions.

3. Positional Encoding

Since the Transformer contains no recurrence or convolution, positional information must be injected to give the model some sense of the order of the sequence. The paper uses sinusoidal positional encodings.

4. Encoder-Decoder Architecture

The Transformer follows an encoder-decoder structure:

  • Encoder: Maps an input sequence to a continuous representation
  • Decoder: Generates an output sequence one element at a time

Technical Details

Architecture Overview

The Transformer uses a stack of identical layers, each with two sub-layers:

  1. Multi-head self-attention mechanism
  2. Position-wise fully connected feed-forward network

Attention Function

The attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

Attention(Q,K,V) = softmax(QK^T/√d_k)V

Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces:

MultiHead(Q,K,V) = Concat(head_1, ..., head_h)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Impact and Legacy

The Transformer architecture has become the foundation for:

  • BERT (Bidirectional Encoder Representations from Transformers)
  • GPT (Generative Pre-trained Transformer) series
  • T5 (Text-to-Text Transfer Transformer)
  • Vision Transformers (ViT)
  • Modern large language models like ChatGPT, Claude, and others

Applications

  1. Machine Translation: Original application in the paper
  2. Text Generation: GPT models for creative writing, code generation
  3. Question Answering: BERT and its variants
  4. Image Processing: Vision Transformers for computer vision
  5. Speech Processing: Speech recognition and synthesis
  6. Code Generation: GitHub Copilot, CodeT5

Why It Matters for Software Engineering

Understanding the Transformer architecture is crucial for:

  • AI/ML Engineers: Building modern language models
  • Software Engineers: Working with AI-powered applications
  • System Designers: Designing scalable AI inference systems
  • Product Managers: Understanding AI capabilities and limitations

The paper represents a paradigm shift from recurrent to attention-based architectures, fundamentally changing how we approach sequence modeling in deep learning.

Loading PDF...

Analysis & Content

Click the button above to view detailed analysis and discussion of this paper

Key insights
Detailed breakdown