Case Study 06

Mini-GPT

Decoder-only transformer language model built from scratch

2026

PythonPyTorchTransformersNumPyMatplotlib

Key impact

Implemented a decoder-only GPT from scratch in PyTorch: token + learned positional embeddings, a stack of transformer blocks with causal multi-head self-attention, residual connections, layer norm, and GELU feed-forward layers — configured at 6 layers, 8 attention heads, and a 256-token context window.

Representative mockup

What I did

01
Implemented a decoder-only GPT from scratch in PyTorch: token + learned positional embeddings, a stack of transformer blocks with causal multi-head self-attention, residual connections, layer norm, and GELU feed-forward layers — configured at 6 layers, 8 attention heads, and a 256-token context window.
02
Trained autoregressively with AdamW (lr 5e-4) and dropout 0.2, generating coherent text through temperature and top-k sampling.
03
Investigated tokenization tradeoffs by comparing a custom word-level vocabulary against GPT-2 byte-pair encoding, analyzing the effect of vocabulary size on sequence length and model capacity.
04
Shipped a baseline and an improved model with reusable training/generation scripts, runtime sanity checks on tensor shapes, and an annotated notebook walking through the full architecture.

Tech stack

PythonPyTorchTransformersNumPyMatplotlib

More projects

← All projects