kernel-fusion

An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.

machine-learning gpu cuda gpu-programming kernel-fusion mlsys llm-inference agent-harness megakernel

Updated Jun 8, 2026
Python

wu-kan / GoPTX

Star

GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving

gpu compile ilp ptx kernel-fusion warp-stall data-hazard

Updated Jul 30, 2025
HTML

Argonaut790 / fused-turboquant

Star

Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.

Updated Apr 1, 2026
Python

nopperl / pytorch-fused-lamb

Star

LAMB go brrr

cuda optimizer pytorch triton lamb kernel-fusion triton-lang

Updated Apr 11, 2024
Python

AICL-Lab / triton-fused-ops

Star

Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, FP8 GEMM — CPU-testable references, autotuning, and benchmarking

Updated May 25, 2026
Python

svdrecbd / mhc-mlx

Star

MLX + Metal implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI.

performance deep-learning metal gpu transformers mhc mlx sinkhorn kernel-fusion sinkhorn-knopp apple-silicon metal-kernel mlx-explore fused-kernels manifold-constrained-hyper-connections hyperconnections birkhoff-polytope

Updated Jan 13, 2026
Python

0sec-labs / noeris

Star

Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).

benchmarking cuda pytorch triton autotuning gemma gpu-kernels github-actions kernel-fusion llm-training llm-inference kernel-optimization

Updated May 27, 2026
Python

AICL-Lab / tiny-dl-inference

Star

Zero-dependency WebGPU deep learning inference engine (~50KB vs TensorFlow.js ~2MB)

machine-learning typescript browser deep-learning neural-network wasm inference mnist tensor gpu-computing webgpu kernel-fusion wgsl webgpu-compute

Updated May 25, 2026
TypeScript

fraidakis / PDS_BitonicSortCUDA

Star

Assigment 3 for the "Parallel & Distributed Systems" course (ECE, AUTh) - Fall 2024

cuda shared-memory radix-sort bitonic-sort nvidia-gpu kernel-fusion

Updated Mar 16, 2025
Cuda

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

Star

High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.

deep-learning cuda pytorch gpu-optimization kernel-fusion layernorm

Updated Aug 17, 2025
Python

ShkalikovOleh / alpaka_expr_trees

Star

Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.

cuda accelerators kernel-fusion alpaka

Updated Feb 20, 2024
C++

ParCoreLab / gpu-fusion

Star

GPU fusion code and algorithm

gpu cuda kernel-fusion

Updated May 24, 2024
Cuda

abgnydn / wgpu-adas-bench

Star

ADAS sensor fusion benchmark — 11-stage fused wgpu-native vs multi-kernel PyTorch. 12-15x faster on same GPU.

rust benchmark metal vulkan pytorch autonomous-driving sensor-fusion adas webgpu kernel-fusion wgpu

Updated May 4, 2026
Rust

varad-more / fused-triton-rmsnorm-residual-qkv

Star

Production-grade Triton kernel fusing residual add + RMSNorm + packed QKV projection into a single GPU launch for decoder-only transformer inference (Llama-3, Mistral, Qwen2). +2.4% tok/s, -1.5 GB VRAM on A10G.

cuda pytorch transformer triton llama memory-bandwidth gpu-kernels kernel-fusion rmsnorm llm-inference

Updated Apr 22, 2026
Python

TxsharDev / Molten

Star

Write the math. Get the kernel. Fused CUDA kernel generation from mathematical specifications.

python machine-learning kernel deep-learning compiler gpu cuda inference pytorch code-generation kernel-fusion fused-kernels alia-labs

Updated Jun 11, 2026
Python

abgnydn / webgpu-fusion-max

Star

Pushing fused WebGPU transformer kernels to max model size — int4, tiled FFN, Phi-3-mini 3.6B in Chrome

inference transformer quantization webgpu kernel-fusion wgsl llm phi-3 browser-llm

Updated May 4, 2026
HTML

abgnydn / webgpu-q

Star

WebGPU quantum many-body + chemistry simulator — statevector, MPS, kernel fusion, HF/UHF/UCCSD/DFT/MP2/CCSD, CCSD(T) on GPU (39× speedup), EE/IP/EA-EOM-CCSD (FCI-validated via brute-force), Cholesky density fitting. 401 tests, ITensor + PySCF + brute-force-EOM cross-checked. Browser-native.

Updated Jun 10, 2026
TypeScript

Improve this page

Add a description, image, and links to the kernel-fusion topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the kernel-fusion topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel-fusion

Here are 20 public repositories matching this topic...

tracel-ai / burn

ROCm / iris

chhzh123 / Krill

RightNow-AI / AutoMegaKernel

wu-kan / GoPTX

Argonaut790 / fused-turboquant

nopperl / pytorch-fused-lamb

AICL-Lab / triton-fused-ops

svdrecbd / mhc-mlx

0sec-labs / noeris

AICL-Lab / tiny-dl-inference

fraidakis / PDS_BitonicSortCUDA

JonSnow1807 / Fused-LayerNorm-CUDA-Operator

ShkalikovOleh / alpaka_expr_trees

ParCoreLab / gpu-fusion

abgnydn / wgpu-adas-bench

varad-more / fused-triton-rmsnorm-residual-qkv

TxsharDev / Molten

abgnydn / webgpu-fusion-max

abgnydn / webgpu-q

Improve this page

Add this topic to your repo