Making a vintage LLM from scratch
Summary
This blog post details the author’s hands-on journey to build a vintage 340M-parameter LLM trained exclusively on pre-1900 English texts. It covers data curation from sources like Project Gutenberg and LOC, a custom tokenizer, base-training and fine-tuning workflows inspired by PyTorch/NanoGPT-style tooling, and cloud vs. local compute considerations. The article also discusses early model behavior, evaluation methods, and open-source artifacts on HuggingFace and GitHub, noting limitations such as hallucinations and limited math memorization.