

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Analysis of transformer mechanisms (Transformer Architecture Analysis)
Typology: Essays (university)
1 / 3
This page cannot be seen from the preview
Don't miss anything!


The Transformer architecture represents a shift from sequential processing models (like LSTMs and GRUs) to a parallelizable, attention-based mechanism. Introduced by Vaswani et al. in 2017, the model eliminates recurrence entirely, relying on "Self-Attention" to compute representations of its input and output without using sequence-aligned RNNs or convolution.
At the heart of the Transformer is the Attention function. It identifies the relationship between different words in a sentence, regardless of their distance from each other. The core mathematical operation is defined as:
Attention(Q, K, V) = softmax( (QKT) / √dk ) V
In this equation, Q (Query), K (Key), and V (Value) are vector representations of the input. The term √dk serves as a scaling factor to prevent gradients from vanishing during the softmax stage when dimensions are high.
3. Multi-Head Attention
Instead of performing a single attention function, the model runs multiple "heads" in parallel. Each head attends to different parts of the input sequence, allowing the model to capture multiple relationships simultaneously (e.g., one head capturing syntactic dependencies while another captures semantic meaning).
4. Positional Encoding
Because the Transformer processes all words in a sequence simultaneously, it has no inherent sense of word order. To address this, Positional Encodings are added to the input embeddings. These encodings use sine and cosine functions of different frequencies to provide the model with information about the absolute and relative positions of tokens.
5. Encoder and Decoder Stacks
The Encoder
The encoder consists of a stack of N identical layers. Each layer contains two sub-layers: a Multi- Head Self-Attention mechanism and a Position-wise Fully Connected Feed-Forward Network. Residual connections and Layer Normalization are applied around each sub-layer.
The Decoder
The decoder shares a similar structure but includes a third sub-layer that performs Multi-Head Attention over the encoder's output. Additionally, the self-attention layer in the decoder is "masked"