[slides] Attention Is All You Need

The paper introduces the Transformer, a novel sequence transduction model that relies solely on attention mechanisms, eliminating the need for recurrent or convolutional layers. The Transformer is designed to be highly parallelizable and efficient, achieving superior performance in machine translation tasks while requiring significantly less training time compared to existing models. On the WMT 2014 English-to-German translation task, the Transformer achieves a BLEU score of 28.4, surpassing previous state-of-the-art models by over 2 BLEU. On the English-to-French translation task, it sets a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, at a fraction of the training cost of previous models. The Transformer generalizes well to other tasks, such as English constituency parsing, demonstrating its versatility and robustness. The paper also discusses the architectural details of the Transformer, including the use of multi-head attention, position-wise feed-forward networks, and positional encoding, and provides insights into the benefits of self-attention over recurrent and convolutional layers.The paper introduces the Transformer, a novel sequence transduction model that relies solely on attention mechanisms, eliminating the need for recurrent or convolutional layers. The Transformer is designed to be highly parallelizable and efficient, achieving superior performance in machine translation tasks while requiring significantly less training time compared to existing models. On the WMT 2014 English-to-German translation task, the Transformer achieves a BLEU score of 28.4, surpassing previous state-of-the-art models by over 2 BLEU. On the English-to-French translation task, it sets a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, at a fraction of the training cost of previous models. The Transformer generalizes well to other tasks, such as English constituency parsing, demonstrating its versatility and robustness. The paper also discusses the architectural details of the Transformer, including the use of multi-head attention, position-wise feed-forward networks, and positional encoding, and provides insights into the benefits of self-attention over recurrent and convolutional layers.

Attention Is All You Need

2 Aug 2023 | Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin