Skip to content

ParallelKernelBench: Frontier LLMs Still Struggling with Fast Multi-GPU Kernels

In brief: Frontier LLMs solve fewer than one-third of 87 multi-GPU CUDA benchmark tasks, though some generated kernels still outperform public reference implementations.

A new benchmark tests how well the most powerful language models can write multi-GPU CUDA kernels. Current frontier models solve fewer than one-third of 87 real-world workloads – yet individual generated kernels outperform existing public implementations.

Together AI has developed ParallelKernelBench, a benchmark covering 87 real-world workloads. These require optimized multi-GPU CUDA kernels to run efficiently in parallel across multiple GPUs. This is a central requirement for scalable AI systems and compute-intensive applications.

The results show: the best current frontier models successfully solve fewer than one-third of these tasks. This underscores that LLMs are not yet consistently able to generate hardware-optimized, performant parallelization code – even though they excel at other programming tasks.

An interesting side effect: some of the kernels generated by LLMs outperform documented public implementations for their respective workload. This suggests that in individual cases, models can beat creative or suboptimal reference code. For engineers, this benchmark is a clear signal: GPU kernel optimization remains an area where human expertise and manual optimization continue to be necessary – automation through LLMs is not yet production-ready here.


Source: www.together.ai · Published 23 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on: