Training data as a readable ambient stream
Pre-training
Post-training
Data Lava Lamp
A neon drift of tokens, domains, source hints, and human-feedback
signals. Leave it running for a few minutes and the weird texture of
model training data starts to come into focus.
Pre-training
web-scale text
floating excerpts
Expand Details
Hide Interface
Show Interface
Exit Fullscreen
What This Shows
Pre-training data is the broad text soup a model studies before it
knows how to act like an assistant. Post-training data is more
deliberate: prompts, replies, ratings, and preference signals that
shape how the assistant responds.
Technical Details
This static version uses small local JSON samples so the page
loads quickly and keeps working without an API server. A later
version can keep this as the fallback while a Pages Function or
Worker streams random rows from hosted datasets.
Visible labels mix original dataset fields, such as domain, year,
role, and review counts, with a few light editorial tags that make
the fragments easier to scan.