Case Study 06

Mini-GPT

Decoder-only transformer language model built from scratch

2025
PythonPyTorchTransformersNumPyMatplotlib
Key impact
Implemented a decoder-only GPT from scratch in PyTorch: token + learned positional embeddings, a stack of transformer blocks with causal multi-head self-attention, residual connections, layer norm, and GELU feed-forward layers — configured at 6 layers, 8 attention heads, and a 256-token context window.
DECODER-ONLY TRANSFORMERattnffnattnffnattnffn×6Thecausal attn6 layers · 8 heads · 256 ctx
Representative mockup

What I did

4
  1. 01

    Implemented a decoder-only GPT from scratch in PyTorch: token + learned positional embeddings, a stack of transformer blocks with causal multi-head self-attention, residual connections, layer norm, and GELU feed-forward layers — configured at 6 layers, 8 attention heads, and a 256-token context window.

  2. 02

    Trained autoregressively with AdamW (lr 5e-4) and dropout 0.2, generating coherent text through temperature and top-k sampling.

  3. 03

    Investigated tokenization tradeoffs by comparing a custom word-level vocabulary against GPT-2 byte-pair encoding, analyzing the effect of vocabulary size on sequence length and model capacity.

  4. 04

    Shipped a baseline and an improved model with reusable training/generation scripts, runtime sanity checks on tensor shapes, and an annotated notebook walking through the full architecture.

Tech stack

PythonPyTorchTransformersNumPyMatplotlib