Transformers from scratch
A deep dive on the architecture behind LLMs

I like building robots. I like exploring low level stuff. Sometimes I do stuff with my GPU.
We will go through the paper: Attention is All You Need and build out a transformer model from scratch. I will try to explain everything in a bottom-up approach, moving from raw text to complete architectural blocks. When I train this transformer model on something and have solid results, I will update the blog accordingly. This is still very much a work in progress for me, mostly writing this to learn. C-3PO pic unrelated.
Embeddings
It would be a lot easier to work with text, if we could somehow represent it numerically. We do this by converting each token in the text into a vector. We use nn.Embedding to achieve this. It is essentially a giant lookup table of shape (vocab_size, d_model). It should be noted that conventionally, these vectors are called embeddings.
Now what exactly does that mean?
We have some limited vocabulary of size vocab_size. And for each item (token) in the vocabulary, we would to like to represent it as a d_model dimensional vector. This is essentially what nn.Embedding does, and stores the vectors for the entire vocabulary as a 2 dimensional lookup table.
For example, the 10th row from this lookup table would give us a d_model dimensional vector that represents the 10th token in the vocabulary.
The paper specifies that we have to scale each of these vectors by the square root of d_model. Keeping that in mind, here is an implementation for the Embedding layer.
class ScaledEmbedding(nn.Module):
def __init__(self, d_model: int, vocab_size: int) -> None:
super().__init__()
self.scale: float = d_model**0.5 # precalc the scale
self.embedding: nn.Embedding = nn.Embedding(
num_embeddings=vocab_size, embedding_dim=d_model
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.embedding(x) * self.scale
Positional Encoding
So far, we are able to convert textual data into an embedding (a vector). But as we’ll see soon, transformers process all tokens in parallel. So, the information about the order in which the tokens appear is lost. This is simpler to explain with an example:
The dog bit the man
The man bit the dog
Both of these sentences would result in the same set of embeddings, and the transformer model would have no information about the order in which they were originally present. So, we use positional encoding to “inject” positional data into our embeddings.
Simply put, we do this by adding a unique vector to each token in the input sentence. And this unique vector is calculated using the position of the token in the sentence alone. So two tokens with different positions in the sequence will have different unique vectors added to them.
So how do we come up with these unique vectors?
The most common way is to use sine and cosine functions alternating along the length of a vector. Apart from using the index along the vector, we also use the index in the sequence itself.

The paper uses sinusoidal functions to achieve this:
$$PE(pos, 2i) = \sin{(pos / 10000^\frac{2i}{d_{model}})}$$
$$PE(pos, 2i+1) = \cos{(pos / 10000^\frac{2i}{d_{model}})}$$
However, this is not very numerically stable to calculate, so we will rely on logarithms to simplify:
$$\log{(pos / 10000^\frac{2i}{d_{model}})} = \log{pos} - \frac{2i}{d_{model}} * \log{10000}$$
$$\implies pos / 10000^\frac{2i}{d_{model}} = pos * \exp{(-\frac{2i}{d_{model}} * \log{10000})}$$
Visualizing the data for a vocabulary size of 100 and embedding dimension of 128:

Essentially, we add this to our embeddings tensor. And that injects positional information into them, which is helpful for the model to understand that proper ordering is of significance here.
Here is the final implementation:
class PositionalEncoding(nn.Module):
pe: torch.Tensor
def __init__(self, d_model: int, max_len: int) -> None:
super().__init__()
position = torch.arange(max_len).unsqueeze(1) # (max_len, 1)
log_10000 = np.log(10000.0)
div_term = torch.exp(
-torch.arange(0, d_model, 2).float() * (log_10000 / d_model)
) # (d_model // 2, )
encoding = position * div_term # (max_len, d_model // 2)
# dim1 here is batch_size. setting it to 1 so that broadcasting works with batches
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(encoding)
pe[:, 1::2] = torch.cos(encoding)
pe = pe.unsqueeze(0)
self.register_buffer("pe", pe)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = x + self.pe[:, : x.shape[1], :] # clamp pe size to match input size
return x
Attention
Scaled Dot Product Attention
Before we try to tackle multi-head attention, let us understand Scaled Dot Product Attention. The paper mentions this as:
$$\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$$
Queries (Q) represent what each token in the sequence is looking for. They're literally a "query".
Keys (K) represent a sort of label that tokens have. A token’s key tells other tokens what kind of information it holds.
Values (V) represent the actual information that each token holds.
That is their physical significance. Now let us look at how they're actually used, mathematically.
In a general Transformer block, the Query (Q) tensor is shaped \((L_q, d_{model})\), while the Key (K) and Value (V) tensors are shaped \((L_{kv}, d_{model})\). This flexibility allows the model to calculate attention even when the source and target sequences have different lengths. This will make more sense if you come back to it after you finish reading this post.
Let us focus on the \(QK^T\) part of attention. Usually, the query tensor contains queries from all tokens in the input sequence. And the key and value tensors contain keys and values respectively for the tokens that the querying tokens should be attending to.

A matrix product between the query and key tensor results in a \((L_q, L_{kv})\) shaped tensor. This tensor is essentially a similarity map between each of the \(L_q\) querying tokens and the \(L_{kv}\) keyed tokens. Taking the softmax of this output, we get a normalized importance map.

The output of this softmax tells us how much each token \(L_q\) cares about any other token \(L_{kv}\). Now that we know which tokens are the most important for the query tokens to attend to, we take the product between the attention weights and the value tensor.
\(\text{softmax}(\frac{QK^T}{\sqrt{d_k}})\) is a tensor telling us how much each token cares about any other token
$V$ is a tensor telling us the information held by each token
If we matrix multiply these two, we get a new tensor where every row contains a token's "absorbed" information from every other token in the sequence, proportional to how much it cared about them.
So, what was the point of doing all this? Well, when we started out we have an input embedding matrix. Any word from this input sequence, say for example 'bank', would have had the same embedding values regardless of whether it was part of a river bank or a banking institution. What we did, was rephrase the entire matrix, into a more context aware one. Now, the embedding values for a 'bank' would be different if it was from a river or a financial context. This happened because the token absorbed information from nearby tokens, and has more information in it than it began with.
Multi-Head Attention
Instead of doing everything in one single pass, with just one of each Q, K, V tensors, we split the input matrix into smaller chunks. Literally, we split:
(seq_len_q, d_model) --> (h, seq_len_q, d_k)
Here, \(d_k = d_{model} // h\)
By doing this we increase the number of passes with attention blocks. The idea is that each block will learn to specialize in a single feature, say grammar relationships, or verb-object pairings, or patterns that we don't see ourselves. To make the Q, K, V tensors learnable, we introduce a linear layer for each tensor and one for the final output. We split up each of the Q, K, V tensors and then concatenate the final output to get the full tensor.
NOTE: We will implement this module with masking in mind. Masking will be required when implementing the decoder stack. We allow for an additional mask argument to our attention module, that lets us block out certain tokens from seeing certain other tokens. The reasoning behind this will be explained when we implement the decoder.
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, h: int) -> None:
super().__init__()
assert d_model % h == 0, "d_model is not perfectly divisible by h"
self.h: int = h
self.d_k: int = d_model // h
self.sqrt_d_k: float = self.d_k**0.5
self.w_q = nn.Linear(d_model, d_model)
self.w_k = nn.Linear(d_model, d_model)
self.w_v = nn.Linear(d_model, d_model)
self.w_o = nn.Linear(d_model, d_model)
def forward(
self,
q: torch.Tensor, # (batch_size, seq_len_q, d_model)
k: torch.Tensor, # (batch_size, seq_len_kv, d_model)
v: torch.Tensor, # (batch_size, seq_len_kv, d_model)
mask: torch.Tensor | None, # (batch_size, seq_len_q, seq_len_kv)
) -> torch.Tensor:
query = self.w_q(q) # (batch_size, seq_len_q, d_model)
key = self.w_k(k) # (batch_size, seq_len_kv, d_model)
value = self.w_v(v) # (batch_size, seq_len_kv, d_model)
batch_size, seq_len_q, _ = q.shape
seq_len_kv = k.shape[1]
# split into h pieces
# we have to do (batch_size, seq_len, d_model) --> (batch_size, h, seq_len, d_k)
# first (batch_size, seq_len, d_model) --> (batch_size, seq_len, h, d_k)
# then (batch_size, seq_len, h, d_k).transpose(1, 2) --> (batch_size, h, seq_len, d_k)
query = query.view(batch_size, seq_len_q, self.h, self.d_k).transpose(
1, 2
) # (batch_size, h, seq_len_q, d_k)
key = key.view(batch_size, seq_len_kv, self.h, self.d_k).transpose(
1, 2
) # (batch_size, h, seq_len_kv, d_k)
value = value.view(batch_size, seq_len_kv, self.h, self.d_k).transpose(
1, 2
) # (batch_size, h, seq_len_kv, d_k)
attention_scores = (
query @ key.transpose(-1, -2)
) / self.sqrt_d_k # (batch_size, h, seq_len_q, seq_len_kv)
if mask is not None:
attention_scores = attention_scores.masked_fill(mask == 0, float("-inf"))
attention_scores = torch.softmax(
attention_scores, dim=-1
) # (batch_size, h, seq_len_q, seq_len_kv)
attention = attention_scores @ value # (batch_size, h, seq_len_q, d_k)
attention = (
attention.transpose(1, 2).contiguous().view(batch_size, seq_len_q, -1)
) # (batch_size, seq_len_q, d_model)
return self.w_o(attention) # (batch_size, seq_len, d_model)
Feed-Forward and Add & Norm
The architecture mentioned in the paper employs the use of a feed forward network at the end of both the encoder and decoder modules. Additionally, a layer norm is used after every attention and feed-forward module. As mentioned in the training section of the paper, we will also be using a dropout wherever necessary, this is not really relevant to transformers, but is being mentioned for implementation’s sake. We will use PyTorch’s nn.LayerNorm for this.
The layer norm allows for shifting and scaling the data to consistently maintain its range. This is also done through learnable parameters which improves training stability. This however, is a topic for another blog post.
class FeedForward(nn.Module):
def __init__(self, d_model: int, d_ff: int) -> None:
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.linear1(x)
x = torch.relu(x)
x = self.linear2(x)
return x
Residual Connections
One thing to keep in mind before going forward is the use of residual connections in the architecture. Every sub-layer’s (Attention or Feed-Forward) output is added to its own input before applying layer normalization.
$$\text{LayerNorm}(x + \text{Sublayer}(x))$$
Encoder & Decoder
The encoder’s job is to take raw input embeddings and enrich them by making them context-aware, as discussed in the Attention section. The encoder makes use of Self-Attention, where every word looks at every other word in the same sentence in the Attention layer. The encoder is made of the following block of Attention and Feed-Forward layers stacked for a few times (the paper uses \(N=6\)).

The output of the encoder is a “latent map” which is a tensor of embeddings which are more context-aware. Here is the implementation:
class EncoderBlock(nn.Module):
def __init__(self, d_model: int, h: int, d_ff: int, dropout: float) -> None:
super().__init__()
self.mha = MultiHeadAttention(d_model, h)
self.ff = FeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask: torch.Tensor | None) -> torch.Tensor:
# self-attention
attention_output = self.mha(x, x, x, mask) # (batch_size, seq_len, d_model)
attention_output = self.dropout1(attention_output)
x = self.norm1(attention_output + x)
ff_output = self.ff(x) # (batch_size, seq_len, d_model)
ff_output = self.dropout2(ff_output)
x = self.norm2(ff_output + x)
return x
class Encoder(nn.Module):
def __init__(
self,
d_model: int,
vocab_size: int,
max_len: int,
h: int,
d_ff: int,
n_layers: int,
dropout: float,
) -> None:
super().__init__()
# stack N encoder blocks
self.layers = nn.ModuleList(
[EncoderBlock(d_model, h, d_ff, dropout) for _ in range(n_layers)]
)
def forward(self, x: torch.Tensor, mask: torch.Tensor | None) -> torch.Tensor:
for layer in self.layers:
x = layer(x, mask)
return x
The decoder’s job is a bit more complicated. It looks at the encoder’s output (the richer, more context aware representation of the input sentence) and it also looks at what it has generated itself so far. The decoder is autoregressive in this regard, i.e., it looks at its own output to generate future outputs. The idea is the the decoder will correctly try to predict what comes next by looking at the encoder’s latent map and its own output.
But since we do not want the decoder to look at the training input (and therefore cheat), we want to blindfold it selectively so it can only look at past tokens. We do this with Masked Self-Attention, where we mask the decoder’s input embeddings.
Now, for the decoder to also be able to look at the encoder’s output, we have another Attention layer that takes queries from the decoder but keys and values from the encoder’s output. This cross talk allows the transformer to “understand” sequence to sequence patterns. This Attention layer is therefore called Cross-Attention. Similar to the encoder block, we follow up with a Feed-Forward layer, and it is also stacked up a number of times.

Here is the implementation:
class DecoderBlock(nn.Module):
def __init__(self, d_model: int, h: int, d_ff: int, dropout: float) -> None:
super().__init__()
self.masked_mha = MultiHeadAttention(d_model, h)
self.mha = MultiHeadAttention(d_model, h)
self.ff = FeedForward(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.dropout3 = nn.Dropout(dropout)
def forward(
self,
x: torch.Tensor, # (batch_size, seq_len, d_model)
encoder_output: torch.Tensor, # (batch_size, seq_len, d_model)
src_mask: torch.Tensor | None, # (batch_size, seq_len, seq_len)
tgt_mask: torch.Tensor | None, # (batch_size, seq_len, seq_len)
) -> torch.Tensor:
# self attention
masked_attention_output = self.masked_mha(
x, x, x, tgt_mask
) # (batch_size, seq_len, d_model)
masked_attention_output = self.dropout1(masked_attention_output)
x = self.norm1(masked_attention_output + x)
# cross attention
attention_output = self.mha(
x, encoder_output, encoder_output, src_mask
) # (batch_size, seq_len, d_model)
attention_output = self.dropout2(attention_output)
x = self.norm2(attention_output + x)
ff_output = self.ff(x) # (batch_size, seq_len, d_model)
ff_output = self.dropout3(ff_output)
x = self.norm3(ff_output + x)
return x
class Decoder(nn.Module):
def __init__(
self,
d_model: int,
vocab_size: int,
max_len: int,
h: int,
d_ff: int,
n_layers: int,
dropout: float,
) -> None:
super().__init__()
# stack N decoder blocks
self.layers = nn.ModuleList(
[DecoderBlock(d_model, h, d_ff, dropout) for _ in range(n_layers)]
)
def forward(
self,
x: torch.Tensor,
encoder_output: torch.Tensor,
src_mask: torch.Tensor | None,
tgt_mask: torch.Tensor | None,
) -> torch.Tensor:
for layer in self.layers:
x = layer(x, encoder_output, src_mask, tgt_mask)
return x
Putting it all together: The Transformer
The output from the encoder-decoder stack is passed into another Feed-Forward network and then finally through a softmax function. This process where the decoder’s output is turned into a probability distribution over the entire vocabulary is called Projection. After projecting, the output is concatenated to the decoder’s previous output and repeated on loop.
Projection is done with a nn.Linear layer that maps from d_model to vocab_size. Finally, log_softmax converts it into a probability distribution. I’ve used a config class as a method to instantiate my transformer class. The values have been set to whatever the paper mentioned. Here is the implementation of the final Transformer:
@dataclass
class TransformerConfig:
d_model: int = 512
vocab_size_src: int = 30000
vocab_size_tgt: int = 30000
num_layers: int = 6
num_heads: int = 8
d_ff: int = 2048
dropout: float = 0.1
max_len: int = 5000
class Transformer(nn.Module):
def __init__(self, cfg: TransformerConfig) -> None:
super().__init__()
self.input_scaled_emb = ScaledEmbedding(cfg.d_model, cfg.vocab_size_src)
self.output_scaled_emb = ScaledEmbedding(cfg.d_model, cfg.vocab_size_tgt)
self.input_pos_enc = PositionalEncoding(cfg.d_model, cfg.max_len)
self.output_pos_enc = PositionalEncoding(cfg.d_model, cfg.max_len)
self.encoder = Encoder(
cfg.d_model,
cfg.vocab_size_src,
cfg.max_len,
cfg.num_heads,
cfg.d_ff,
cfg.num_layers,
cfg.dropout,
)
self.decoder = Decoder(
cfg.d_model,
cfg.vocab_size_tgt,
cfg.max_len,
cfg.num_heads,
cfg.d_ff,
cfg.num_layers,
cfg.dropout,
)
self.input_emb_dropout = nn.Dropout(0.1)
self.output_emb_dropout = nn.Dropout(0.1)
self.projection_linear = nn.Linear(cfg.d_model, cfg.vocab_size_tgt)
def encode(self, x: torch.Tensor, mask: torch.Tensor | None) -> torch.Tensor:
x = self.input_scaled_emb(x) # (batch_size, seq_len, d_model)
x = self.input_pos_enc(x) # (batch_size, seq_len, d_model)
x = self.input_emb_dropout(x)
x = self.encoder(x, mask) # (batch_size, seq_len, d_model)
return x
def decode(
self,
x: torch.Tensor,
encoder_output: torch.Tensor,
src_mask: torch.Tensor | None,
tgt_mask: torch.Tensor | None,
) -> torch.Tensor:
x = self.output_scaled_emb(x) # (batch_size, seq_len, d_model)
x = self.output_pos_enc(x) # (batch_size, seq_len, d_model)
x = self.output_emb_dropout(x)
x = self.decoder(
x, encoder_output, src_mask, tgt_mask
) # (batch_size, seq_len, d_model)
return x
def project(self, x: torch.Tensor) -> torch.Tensor:
x = self.projection_linear(x) # (batch_size, seq_len, vocab_size)
x = torch.log_softmax(x, dim=-1) # (batch_size, seq_len, vocab_size)
return x
That concludes this blog post. Might be updated soon with results after training on a real dataset. Feel free to comment, and correct any mistakes I have made.
