NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

qlabs.sh - 125 poäng - 25 kommentarer - 140548 sekunder sedan

Kommentarer (12)

shubhamintech - 110994 sekunder sedan
[flagged]
bee_rider - 118998 sekunder sedan
> Directions we think are wide open
> Second-order optimizers and natural gradient methods
Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).
linolevan - 124522 sekunder sedan
There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.
[0] https://www.alphaxiv.org/abs/2509.14786
kseniamorph - 129030 sekunder sedan
Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?
archermarks - 135288 sekunder sedan
Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
lzaborowski - 133553 sekunder sedan
I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.
If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.
navvyeanand - 133161 sekunder sedan
Amazing job!
refulgentis - 120452 sekunder sedan
This looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.
suddenlybananas - 137713 sekunder sedan
Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.
aplomb1026 - 116719 sekunder sedan
[dead]
riajain2525 - 131017 sekunder sedan
[flagged]
STARGA - 120206 sekunder sedan
[flagged]