Learning Transformers Code First Part 2 — GPT Up Close and Personal

Learning Transformers Code First Part 2 — GPT Up Close and Personal

Digging into Generative Pre-Trained Transformers via nanoGPT

Photo by Luca Onniboni on Unsplash

Welcome to the second part of my project, where I delve into the intricacies of transformer and GPT-based models using the TinyStories dataset and nanoGPT all trained on an aging gaming laptop. In the first part, I prepared the dataset for input into a character-level generative model. You can find a link to part one below.

Learning Transformers Code First Part 1

In this article, I aim to dissect the GPT model, its components, and its implementation in nanoGPT. I selected nanoGPT due to its straightforward Python implementation of a GPT model, which is approximately 300 lines long, and its similarly digestible training script. With the necessary background knowledge, one could quickly comprehend GPT models from simply reading the source code. To be frank, I lacked this understanding when I first examined the code. Some of the material still eludes me. However, I hope that with all I’ve learned, this explanation will provide a starting point for those wishing to gain an intuitive understanding of how GPT-style models function internally.

n preparation for this article, I read various papers. Initially, I assumed that simply reading the seminal work “Attention is All You Need” would suffice to bring my understanding up to speed. This was a naive assumption. While it’s true that this paper introduced the transformer model, it was subsequent papers that adapted it for more advanced tasks such as text generation. “AIAYN” was merely an introduction to a broader topic. Undeterred, I recalled an article on HackerNews that provided a reading list to fully understand LLMs. After a quick search, I found the article here. I didn’t read everything in sequence, but I intend to revisit this reading list to continue my learning journey after completing this series.

With that said, let’s dive in. To comprehend GPT models in detail, we must start with the transformer. The transformer employs a self-attention mechanism known as scaled dot-product attention. The following explanation is derived from this insightful article on scaled dot-product attention, which I recommend for a more in-depth understanding. Essentially, for every element of an input sequence (the i-th element), we want to multiply the input sequence by a weighted average of all the elements in the sequence with the i-th element. These weights are calculated via taking the dot-product of the vector at the i-th element with the entire input vector and then applying a softmax to it so the weights are values between 0 and 1. In the original “Attention is All You Need” paper, these inputs are named query (the entire sequence), key (the vector at the i-th element) and the value (also the whole sequence). The weights passed to the attention mechanism are initialized to random values and learned as more passes occur within a neural network.


nanoGPT implements scaled dot-product attention and extends it to multi-head attention, meaning multiple attention operations occurring at once. It also implements it as a torch.nn.Module, which allows it to be composed with other network layers

import torch
import torch.nn as nn
from torch.nn import functional as F

class CausalSelfAttention(nn.Module):

def __init__(self, config):
assert config.n_embd % config.n_head == 0
# key, query, value projections for all heads, but in a batch
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
# output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
# regularization
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.dropout = config.dropout
# flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
self.flash = hasattr(torch.nn.functional, ‘scaled_dot_product_attention’)
if not self.flash:
print(“WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0”)
# causal mask to ensure that attention is only applied to the left in the input sequence
self.register_buffer(“bias”, torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size))

def forward(self, x):
B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

# calculate query, key, values for all heads in batch and move head forward to be the batch dim
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
if self.flash:
# efficient attention using Flash Attention CUDA kernels
y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
# manual implementation of attention
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float(‘-inf’))
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)
y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

# output projection
y = self.resid_dropout(self.c_proj(y))
return y

Let’s dissect this code further, starting with the constructor. First, we verify that the number of attention heads (n_heads) divides the dimensionality of the embedding (n_embed) evenly. This is crucial because when the embedding is divided into sections for each head, we want to cover all of the embedding space without any gaps. Next, we initialize two Linear layers, c_att and c_proj: c_att is the layer that holds all our working space for the matrices that compose of a scaled dot-product attention calculation while c_proj stores the finally result of the calculations. The embedding dimension is tripled in c_att because we need to include space for the three major components of attention: query, key, and value.

We also have two dropout layers, attn_dropoutand resid_dropout. The dropout layers randomly nullify elements of the input matrix based on a given probability. According to the PyTorch docs, this serves the purpose of reducing overfitting for the model. The value in config.dropout is the probability that a given sample will be dropped during a dropout layer.

We finalize the constructor by verifying if the user has access to PyTorch 2.0, which boasts an optimized version of the scaled dot-product attention. If available, the class utilizes it; otherwise we set up a bias mask. This mask is a component of the optional masking feature of the attention mechanism. The torch.tril method yields a matrix with its upper triangular section converted to zeros. When combined with the torch.ones method, it effectively generates a mask of 1s and 0s that the attention mechanism uses to produce anticipated outputs for a given sampled input.

Next, we delve into the forward method of the class, where the attention algorithm is applied. Initially, we determine the sizes of our input matrix and divide it into three dimensions: Batch size, Time (or number of samples), Corpus (or embedding size). nanoGPT employs a batched learning process, which we will explore in greater detail when examining the transformer model that utilizes this attention layer. For now, it’s sufficient to understand that we are dealing with the data in batches. We then feed the input x into the linear transformation layer c_attn which expands the dimensionality from n_embed to three times n_embed. The output of that transformation is split it into our q, k, v variables which are our inputs to the attention algorithm. Subsequently, the view method is utilized to reorganize the data in each of these variables into the format expected by the PyTorch scaled_dot_product_attention function.

When the optimized function isn’t available, the code defaults to a manual implementation of scaled dot-product attention. It begins by taking the dot product of the q and k matrices, with k transposed to fit the dot product function, and the result is scaled by the square root of the size of k. We then mask the scaled output using the previously created bias buffer, replacing the 0s with negative infinity. Next, a softmax function is applied to the att matrix, converting the negative infinities back to 0s and ensuring all other values are scaled between 0 and 1. We then apply a dropout layer to avoid overfitting before getting the dot-product of the att matrix and v.

Regardless of the scaled dot-product implementation used, the multi-head output is reorganized side by side before passing it through a final dropout layer and then returning the result. This is the complete implementation of the attention layer in less than 50 lines of Python/PyTorch. If you don’t fully comprehend the above code, I recommend spending some time reviewing it before proceeding with the rest of the article.

Before we delve into the GPT module, which integrates everything, we require two more building blocks. The first is a simple multi-layer perceptron (MLP) — referred to in the “Attention is All You Need” paper as a feed-forward network — and the attention block, which combines the attention layer with an MLP to complete the basic transformer architecture represented in the paper. Both are implemented in the following snippet from nanoGPT.

class MLP(nn.Module):
Multi Layer Perceptron

def __init__(self, config):
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
self.gelu = nn.GELU()
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
self.dropout = nn.Dropout(config.dropout)

def forward(self, x):
x = self.c_fc(x)
x = self.gelu(x)
x = self.c_proj(x)
x = self.dropout(x)
return x

class Block(nn.Module):

def __init__(self, config):
self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
self.attn = CausalSelfAttention(config)
self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
self.mlp = MLP(config)

def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x

The MLP layer, despite its apparent simplicity in terms of code lines, adds an extra layer of complexity to the model. Essentially, the Linear layers link each input layer with each element of the output layer, using a linear transformation to transfer the values between them. In the aforementioned code, we start with the embedding size, n_embed, as the number of parameters before quadrupling it in the output. The quadrupling here is arbitrary; the purpose of the MLP module is to enhance the network’s computation by adding more nodes. As long as the dimensionality increase at the beginning of the MLP and decrease at the end of the MLP is equivalent, yielding the same initial input/final output dimension, then the scaling number is merely another hyper-parameter. Another crucial element to consider is the activation function. This MLP implementation consists of two linear layers connected with the GELU activation function. The original paper uses a ReLU function, but nanoGPT employs GELU to ensure compatibility with GPT2 model checkpoints.

Next, we examine the Block module. This module finalizes our transformer block as outlined in the “Attention” paper. Essentially, it channels the input through a normalization layer before passing it to the attention layer, then adds the result back to the input. The output of this addition is normalized once more before being transferred to the MLP, and then added back to itself. This process implements the decoder side of the transformer as described in the “Attention” paper. For text generation, it’s common to use only a decoder, as it doesn’t need to condition the decoder’s output on anything other than the input sequence. The transformer was initially designed for machine translation, which needs to account for both the input token encoding and the output token encoding. However, with text generation, only a single token encoding is used, eliminating the need for cross-attention via an encoder. Andrej Karpathy, the author of nanoGPT, provides a comprehensive explanation of this in his video linked in the first article in this series.

Finally, we reach the main component: the GPT model. The majority of the approximately 300-line file is dedicated to the GPT module. It manages beneficial features such as model fine-tuning and utilities designed for model training (the topic of the next article in this series). Therefore, I present a simplified version of what is available in the nanoGPT repository below.

class GPT(nn.Module):

def __init__(self, config):
assert config.vocab_size is not None
assert config.block_size is not None
self.config = config

self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd),
wpe = nn.Embedding(config.block_size, config.n_embd),
drop = nn.Dropout(config.dropout),
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
ln_f = LayerNorm(config.n_embd, bias=config.bias),
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
# with weight tying when using torch.compile() some warnings get generated:
# “UserWarning: functional_call was passed multiple values for tied weights.
# This behavior is deprecated and will be an error in future versions”
# not 100% sure what this is, so far seems to be harmless. TODO investigate
self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying

# init all weights
# apply special scaled init to the residual projections, per GPT-2 paper
for pn, p in self.named_parameters():
if pn.endswith(‘c_proj.weight’):
torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

def forward(self, idx, targets=None):
device = idx.device
b, t = idx.size()
assert t <= self.config.block_size, f”Cannot forward sequence of length {t}, block size is only {self.config.block_size}”
pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)

# forward the GPT model itself
tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
x = self.transformer.drop(tok_emb + pos_emb)
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)

if targets is not None:
# if we are given some desired targets also calculate the loss
logits = self.lm_head(x)
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
# inference-time mini-optimization: only forward the lm_head on the very last position
logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
loss = None

return logits, loss

def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
the sequence max_new_tokens times, feeding the predictions back into the model each time.
Most likely you’ll want to make sure to be in model.eval() mode of operation for this.
for _ in range(max_new_tokens):
# if the sequence context is growing too long we must crop it at block_size
idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
# forward the model to get the logits for the index in the sequence
logits, _ = self(idx_cond)
# pluck the logits at the final step and scale by desired temperature
logits = logits[:, -1, :] / temperature
# optionally crop the logits to only the top k options
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float(‘Inf’)
# apply softmax to convert logits to (normalized) probabilities
probs = F.softmax(logits, dim=-1)
# sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1)
# append sampled index to the running sequence and continue
idx = torch.cat((idx, idx_next), dim=1)

return idx

Let’s begin with the constructor of the class. The different layers are assembled into a PyTorch ModuleDict, which provides some structure. We start with two Embedding layers: one for token embedding and one for positional embedding. The nn.Embedding module is designed to be sparsely populated with values, optimizing it storage capabilities over other layer modules. Following this, we have a dropout layer, succeeded by n_layer number of Block modules that form our attention layers, and then another single dropout layer. The lm_head Linear layer takes the output from the attention Blocks and reduces it to the vocab size, acting as our main output for the GPT, apart from the loss value.

Once the layers are defined, additional setup is required before we can begin training the module. Here, Andrej links the weights of the positional encoding to those of the output layer. According to the paper linked in the code comments, this is done to reduce the model’s final parameters while also improving its performance. The constructor also initializes the model’s weights. As these weights will be learned during training, they are initialized to Gaussian distribution of random numbers and the module biases are set to 0. Finally, a modification from the GPT-2 paper is utilized where the weights of any residual layers are scaled by square root of the number of layers.

When feeding forward through the network, batch size and the number of samples (here t) are pulled from the input size. We then create memory on the training device for what will become the positional embedding. Next, we embed the input tokens into a token embedding later wte. Following this, the positional embedding is calculated on the wpe layer. These embeddings are added together before being passed through a dropout layer. The result is then passed through each of the n_layer blocks and normalized. The final result is passed to the Linear layer lm_head which reduces the embedded weights into a probability score for each token in a vocab.

When a loss is being calculated (e.g., during training), we calculate the difference between the predicted token and the actual token using cross-entropy. If not, loss is None. Both the loss and the token probabilities are returned as part of the feed forward function.

Unlike the previous modules, the GPT module has additional methods. The most relevant to us is the generate function, which will be familiar to anyone who has used a generative model before. Given a set of input tokens idx, a number of max_new_tokens and a temperature, it generates max_new_tokens many tokens. Let’s delve into how it accomplishes this. First it trims the input tokens to fit within the block_size (others call this context length), if necessary, sampling from the end of the input first. Next, the tokens are fed to the network and the output is scaled for the inputted temperature. The higher the temperature, the more creative and likely to hallucinate the model is. Higher temperatures also result in less predictable output. Next, a softmax is applied to convert the model output weights into probabilities between 0 and 1. A sampling function is used to select the next token from the probabilities, and that token is added to the input vector that gets fed back into the GPT model for the next character.

Thank you for your patience in reading this comprehensive article. While examining annotated source code is a valuable method for understanding the function of a code segment, there’s no replacement for personally manipulating various parts and parameters of the code. In line with this, I’m providing a link to the complete model.py source code from the nanoGPT repository

nanoGPT/model.py at master · karpathy/nanoGPT

In the upcoming article, we’ll explore the train.py script of nanoGPT and train a character-level model on the TinyStories dataset. Follow me on Medium to ensure you don’t miss out!

I utilized a vast array of resources to create this article, many of which have already been linked in this and the previous article. However, I would be neglecting my duty if I didn’t share these resources with you for further exploration of any topic or for alternative explanations of the concepts.

Let’s Build GPT: from scratch, in code, spelled out — YouTubeLLM Reading List — Blog“Attention is All You Need” — Paper“Language Models are Unsupervised Multitask Learners” — GPT-2 PaperMulti-Layer Perceptrons Explained and Illustrated — MediumWeight Tying — Papers With CodeIllustrated Guide to Transformers Neural Network: A step by step explanation — YouTube

Edited using GPT-4 and a custom LangChain script.

Learning Transformers Code First Part 2 — GPT Up Close and Personal was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.


Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top