A developer has created a from-scratch ML compiler that lowers language models like TinyLlama and Qwen2.5-7B to optimized CUDA kernels through six intermediate representations. The compiler achieves 1.11× speedup over PyTorch eager execution and 1.20× over torch.compile on RTX 5090, with selective wins reaching 4.7× on operations like attention and KV projections.
Why it matters: As ML compiler complexity grows (TVM is 500K+ lines), demonstrating that a hackable, maintainable compiler can be built in 5,000 lines with competitive performance challenges industry assumptions about necessary complexity and opens new possibilities for custom optimization.