NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
qlabs.sh - 125 poäng - 25 kommentarer - 140548 sekunder sedan
Kommentarer (12)
- shubhamintech - 110994 sekunder sedan[flagged]
- bee_rider - 118998 sekunder sedan> Directions we think are wide open
> Second-order optimizers and natural gradient methods
Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).
- linolevan - 124522 sekunder sedanThere was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.
- kseniamorph - 129030 sekunder sedanCurious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?
- archermarks - 135288 sekunder sedanVery cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
- lzaborowski - 133553 sekunder sedanI like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.
If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.
- navvyeanand - 133161 sekunder sedanAmazing job!
- refulgentis - 120452 sekunder sedanThis looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.
- suddenlybananas - 137713 sekunder sedanReminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.
- aplomb1026 - 116719 sekunder sedan[dead]
- riajain2525 - 131017 sekunder sedan[flagged]
- STARGA - 120206 sekunder sedan[flagged]
Nördnytt! 🤓