Microgpt

karpathy.github.io - 1628 poäng - 284 kommentarer - 77850 sekunder sedan

Kommentarer (51)

teleforce - 60897 sekunder sedan
Someone has modified microgpt to build a tiny GPT that generates Korean first names, and created a web page that visualizes the entire process [1].
Users can interactively explore the microgpt pipeline end to end, from tokenization until inference.
[1] English GPT lab:
https://ko-microgpt.vercel.app/
hkbuilds - 6144 sekunder sedan
The "micro" trend in AI is fascinating. We're seeing diminishing returns from just making models bigger, and increasing returns from making them smaller and more focused.
For practical applications, a well-tuned small model that does one thing reliably is worth more than a giant model that does everything approximately. I've been using Gemini Flash for domain-specific analysis tasks and the speed/cost ratio is incredible compared to the frontier models. The latency difference alone changes what kind of products you can build.
verma7 - 62159 sekunder sedan
I wrote a C++ translation of it: https://github.com/verma7/microgpt/blob/main/microgpt.cc
2x the number of lines of code (~400L), 10x the speed
The hard part was figuring out how to represent the Value class in C++ (ended up using shared_ptrs).
geokon - 49520 sekunder sedan
> What’s the deal with “hallucinations”? The model generates tokens by sampling from a probability distribution. It has no concept of truth, it only knows what sequences are statistically plausible given the training data.
Extremely naiive question.. but could LLM output be tagged with some kind of confidence score? Like if I'm asking an LLM some question does it have an internal metric for how confident it is in its output? LLM outputs seem inherently rarely of the form "I'm not really sure, but maybe this XXX" - but I always felt this is baked in the model somehow
subset - 67861 sekunder sedan
I had good fun transliterating it to Rust as a learning experience (https://github.com/stochastical/microgpt-rs). The trickiest part was working out how to represent the autograd graph data structure with Rust types. I'm finalising some small tweaks to make it run in the browser via WebAssmebly and then compile it up for my blog :) Andrej's code is really quite poetic, I love how much it packs into such a concise program
red_hare - 67864 sekunder sedan
This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html
la_fayette - 47655 sekunder sedan
This guy is so amazing! With his video and the code base I really have the feeling I understand gradient descent, back propagation, chain rule etc. Reading math only just confuses me, together with the code it makes it so clear! It feels like a lifetime achievement for me :-)
growingswe - 59453 sekunder sedan
Great stuff! I wrote an interactive blogpost that walks through the code and visualizes it: https://growingswe.com/blog/microgpt
kuberwastaken - 57500 sekunder sedan
I'm half shocked this wasn't on HN before? Haha I built PicoGPT as a minified fork with <35 lines of JS and another in python
And it's small enough to run from a QR code :) https://kuber.studio/picogpt/
You can quite literally train a micro LLM from your phone's browser
astroanax - 17303 sekunder sedan
I feel its wrong to call it microgpt, since its smaller than nanogpt, so maybe picogpt would have been a better name? nice project tho
etothet - 34705 sekunder sedan
Even if you have some basic understanding of how LLMs work, I highly recommend Karpathy’s intro to LLMs videos on YouTube.
- https://m.youtube.com/watch?v=7xTGNNLPyMI - https://m.youtube.com/watch?v=EWvNQjAaOHw
chenster - 5406 sekunder sedan
The best ML learning for dummies.
jonjacky - 10340 sekunder sedan
I wonder if such a small GPT exhibits plagiarism. Are some of the generated names the same as names in the input data?
znnajdla - 60880 sekunder sedan
Super useful exercise. My gut tells me that someone will soon figure out how to build micro-LLMs for specialized tasks that have real-world value, and then training LLMs won’t just be for billion dollar companies. Imagine, for example, a hyper-focused model for a specific programming framework (e.g. Laravel, Django, NextJS) trained only on open-source repositories and documentation and carefully optimized with a specialized harness for one task only: writing code for that framework (perhaps in tandem with a commodity frontier model). Could a single programmer or a small team on a household budget afford to train a model that works better/faster than OpenAI/Anthropic/DeepSeek for specialized tasks? My gut tells me this is possible; and I have a feeling that this will become mainstream, and then custom model training becomes the new “software development”.
freakynit - 63089 sekunder sedan
Is there something similar for diffusion models? By the way, this is incredibly useful for learning in depth the core of LLM's.
vadimf - 22168 sekunder sedan
I’m 100% sure the future consists of many models running on device. LLMs will be the mobile apps of the future (or a different architecture, but still intelligence).

0xbadcafebee - 66822 sekunder sedan

Since this post is about art, I'll embed here my favorite LLM art: the IOCCC 2024 prize winner in bot talk, from Adrian Cable (https://www.ioccc.org/2024/cable1/index.html), minus the stdlib headers:

  #define a(_)typedef _##t
  #define _(_)_##printf
  #define x f(i,
  #define N f(k,
  #define u _Pragma("omp parallel for")f(h,
  #define f(u,n)for(I u=0;u<(n);u++)
  #define g(u,s)x s%11%5)N s/6&33)k[u[i]]=(t){(C*)A,A+s*D/4},A+=1088*s;
  
  a(int8_)C;a(in)I;a(floa)F;a(struc){C*c;F*f;}t;enum{Z=32,W=64,E=2*W,D=Z*E,H=86*E,V='}\0'};C*P[V],X[H],Y[D],y[H];a(F
  _)[V];I*_=U" 炾ોİ䃃璱ᝓ၎瓓甧染ɐఛ瓁",U,s,p,f,R,z,$,B[D],open();F*A,*G[2],*T,w,b,c;a()Q[D];_t r,L,J,O[Z],l,a,K,v,k;Q
  m,e[4],d[3],n;I j(I e,F*o,I p,F*v,t*X){w=1e-5;x c=e^V?D:0)w+=r[i]*r[i]/D;x c)o[i]=r[i]/sqrt(w)*i[A+e*D];N $){x
  W)l[k]=w=fmax(fabs(o[i])/~-E,i?w:0);x W)y[i+k*W]=*o++/w;}u p)x $){I _=0,t=h*$+i;N W)_+=X->c[t*W+k]*y[i*W+k];v[h]=
  _*X->f[t]*l[i]+!!i*v[h];}x D-c)i[r]+=v[i];}I main(){A=mmap(0,8e9,1,2,f=open(M,f),0);x 2)~f?i[G]=malloc(3e9):exit(
  puts(M" not found"));x V)i[P]=(C*)A+4,A+=(I)*A;g(&m,V)g(&n,V)g(e,D)g(d,H)for(C*o;;s>=D?$=s=0:p<U||_()("%s",$[P]))if(!
  (*_?$=*++_:0)){if($<3&&p>=U)for(_()("\n\n> "),0<scanf("%[^\n]%*c",Y)?U=*B=1:exit(0),p=_(s)(o=X,"[INST] %s%s [/INST]",s?
  "":"<<SYS>>\n"S"\n<</SYS>>\n\n",Y);z=p-=z;U++[o+=z,B]=f)for(f=0;!f;z-=!f)for(f=V;--f&&f[P][z]|memcmp(f[P],o,z););p<U?
  $=B[p++]:fflush(0);x D)R=$*D+i,r[i]=m->c[R]*m->f[R/W];R=s++;N Z){f=k*D*D,$=W;x 3)j(k,L,D,i?G[~-i]+f+R*D:v,e[i]+k);N
  2)x D)b=sin(w=R/exp(i%E/14.)),c=1[w=cos(w),T=i+++(k?v:*G+f+R*D)],T[1]=b**T+c*w,*T=w**T-c*b;u Z){F*T=O[h],w=0;I A=h*E;x
  s){N E)i[k[L+A]=0,T]+=k[v+A]*k[i*D+*G+A+f]/11;w+=T[i]=exp(T[i]);}x s)N E)k[L+A]+=(T[i]/=k?1:w)*k[i*D+G[1]+A+f];}j(V,L
  ,D,J,e[3]+k);x 2)j(k+Z,L,H,i?K:a,d[i]+k);x H)a[i]*=K[i]/(exp(-a[i])+1);j(V,a,D,L,d[$=H/$,2]+k);}w=j($=W,r,V,k,n);x
  V)w=k[i]>w?k[$=i]:w;}}

ruszki - 54245 sekunder sedan
> [p for mat in state_dict.values() for row in mat for p in row]
I'm so happy without seeing Python list comprehensions nowadays.
I don't know why they couldn't go with something like this:
[state_dict.values() for mat for row for p]
or in more difficult cases
[state_dict.values() for mat to mat*2 for row for p to p/2]
I know, I know, different times, but still.
fulafel - 73005 sekunder sedan
This could make an interesting language shootout benchmark.
huqedato - 12457 sekunder sedan
Looking for alternative in Julia.
jimbokun - 68402 sekunder sedan
It’s pretty staggering that a core algorithm simple enough to be expressed in 200 lines of Python can apparently be scaled up to achieve AGI.
Yes with some extra tricks and tweaks. But the core ideas are all here.
MattyRad - 61567 sekunder sedan
Hoenikker had been experimenting with melting and re-freezing ice-nine in the kitchen of his Cape Cod home.
Beautiful, perhaps like ice-nine is beautiful.
colonCapitalDee - 75086 sekunder sedan
Beautiful work
sieste - 44404 sekunder sedan
The typos are interesting ("vocavulary", "inmput") - One of the godfathers of LLMs clearly does not use an LLM to improve his writing, and he doesn't even bother to use a simple spell checker.
WithinReason - 50090 sekunder sedan
Previously:
https://news.ycombinator.com/item?id=47000263
retube - 50661 sekunder sedan
Can you train this on say Wikipedia and have it generate semi-sensible responses?
rramadass - 69029 sekunder sedan
C++ version - https://github.com/Charbel199/microgpt.cpp?tab=readme-ov-fil...
Rust version - https://github.com/mplekh/rust-microgpt
ThrowawayTestr - 72628 sekunder sedan
This is like those websites that implement an entire retro console in the browser.
geon - 38919 sekunder sedan
Is there a similarly simple implementation with tensorflow?
I tried building a tiny model last weekend, but it was very difficult to find any articles that weren’t broken ai slop.
borplk - 40731 sekunder sedan
Can anyone mention how you can "save the state" so it doesn't have to train from scratch on every run?
bytesandbits - 47275 sekunder sedan
sensei karpathy has done it again
stuckkeys - 46274 sekunder sedan
That web interface that someone commented in your github was flawless.
mold_aid - 43907 sekunder sedan
"art" project?
dhruv3006 - 70472 sekunder sedan
Karapthy with another gem !
- 63251 sekunder sedan
coolThingsFirst - 66749 sekunder sedan
Incredibly fascinating. One thing is that it seems still very conceptual. What id be curious about how good of a micro llm we can train say with 12 hours of training on macbook.
shevy-java - 53799 sekunder sedan
Microslop is alive!
ViktorRay - 74433 sekunder sedan
Which license is being used for this?
hackersk - 64247 sekunder sedan
What I find most valuable about this kind of project is how it forces you to understand the entire pipeline end-to-end. When you use PyTorch or JAX, there are dozens of abstractions hiding the actual mechanics. But when you strip it down to ~200 lines, every matrix multiplication and gradient computation has to be intentional.
I tried something similar last year with a much simpler model (not GPT-scale) and the biggest "aha" moment was understanding how the attention mechanism is really just a soft dictionary lookup. The math makes so much more sense when you implement it yourself vs reading papers.
Karpathy has a unique talent for making complex topics feel approachable without dumbing them down. Between this, nanoGPT, and the Zero to Hero series, he has probably done more for ML education than most university programs.
kelvinjps10 - 67848 sekunder sedan
Why there is multiple comments talking about 1000 c lines, bots?
raphaelmolly8 - 22541 sekunder sedan
[dead]
Jaxon_Varr - 48098 sekunder sedan
[dead]
genie3io - 53196 sekunder sedan
[dead]
OussamaAfnakkar - 47819 sekunder sedan
[dead]
abhitriloki - 57571 sekunder sedan
[flagged]
lynxbot2026 - 71000 sekunder sedan
[flagged]
Paddyz - 72488 sekunder sedan
[flagged]
agenthustler - 42916 sekunder sedan
A different angle on the 'micro' theme: what happens when you deploy a large, capable model (Claude) in an extremely constrained environment (256MB RAM, 3GB disk, /bin/zsh budget)?
We have been running Claude Code autonomously on a free-tier VPS for 15 days. The constraint is not the model -- it is the runtime environment. The model is powerful but has to operate through a narrow interface: read a state file, make decisions, take actions via shell, update the state file.
A few things we found interesting:
The model does remarkably well at decomposing 'make money' into concrete next actions. The failure is not in reasoning -- it is in the feedback loop. The model builds things and then cannot observe whether they are working (low traffic, no conversions) without explicitly instrumenting that observation. It kept adding features to a product nobody was using because it had no signal either way.
The minimal viable agentic loop seems to need: (1) a way to observe real outcomes, not just task completion, (2) explicit stopping criteria baked into the prompt (not just goals), and (3) environmental constraints that prevent runaway resource use. The 256MB limit has been oddly helpful -- it forces the agent to make architectural choices rather than just adding more.
Relevant to your micro framing: constraints clarify what actually matters.
tithos - 75802 sekunder sedan
What is the prime use case
with - 58060 sekunder sedan
"everything else is just efficiency" is a nice line but the efficiency is the hard part. the core of a search engine is also trivial, rank documents by relevance. google's moat was making it work at scale. same applies here.
profsummergig - 72768 sekunder sedan
If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.