Mini-GPT
Decoder-only transformer language model built from scratch
Implemented a decoder-only GPT from scratch in PyTorch: token + learned positional embeddings, a stack of transformer blocks with causal multi-head self-attention, residual connections, layer norm, and GELU feed-forward layers — configured at 6 layers, 8 attention heads, and a 256-token context window.
What I did
4- 01
Implemented a decoder-only GPT from scratch in PyTorch: token + learned positional embeddings, a stack of transformer blocks with causal multi-head self-attention, residual connections, layer norm, and GELU feed-forward layers — configured at 6 layers, 8 attention heads, and a 256-token context window.
- 02
Trained autoregressively with AdamW (lr 5e-4) and dropout 0.2, generating coherent text through temperature and top-k sampling.
- 03
Investigated tokenization tradeoffs by comparing a custom word-level vocabulary against GPT-2 byte-pair encoding, analyzing the effect of vocabulary size on sequence length and model capacity.
- 04
Shipped a baseline and an improved model with reusable training/generation scripts, runtime sanity checks on tensor shapes, and an annotated notebook walking through the full architecture.