Burn is a next generation tensor library and Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
-
Updated
Jun 12, 2026 - Rust
Burn is a next generation tensor library and Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
An efficient concurrent graph processing system
An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.
GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving
Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.
LAMB go brrr
Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, FP8 GEMM — CPU-testable references, autotuning, and benchmarking
MLX + Metal implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI.
Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).
Zero-dependency WebGPU deep learning inference engine (~50KB vs TensorFlow.js ~2MB)
Assigment 3 for the "Parallel & Distributed Systems" course (ECE, AUTh) - Fall 2024
High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.
Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.
ADAS sensor fusion benchmark — 11-stage fused wgpu-native vs multi-kernel PyTorch. 12-15x faster on same GPU.
Production-grade Triton kernel fusing residual add + RMSNorm + packed QKV projection into a single GPU launch for decoder-only transformer inference (Llama-3, Mistral, Qwen2). +2.4% tok/s, -1.5 GB VRAM on A10G.
Write the math. Get the kernel. Fused CUDA kernel generation from mathematical specifications.
Pushing fused WebGPU transformer kernels to max model size — int4, tiled FFN, Phi-3-mini 3.6B in Chrome
WebGPU quantum many-body + chemistry simulator — statevector, MPS, kernel fusion, HF/UHF/UCCSD/DFT/MP2/CCSD, CCSD(T) on GPU (39× speedup), EE/IP/EA-EOM-CCSD (FCI-validated via brute-force), Cholesky density fitting. 401 tests, ITensor + PySCF + brute-force-EOM cross-checked. Browser-native.
Add a description, image, and links to the kernel-fusion topic page so that developers can more easily learn about it.
To associate your repository with the kernel-fusion topic, visit your repo's landing page and select "manage topics."