JetSpec overcomes scaling limits of speculative decoding through parallel tree drafting with causal conditioning, achieving up to 9.64x speedup in LLM inference.
EfficientRollout uses self-speculative decoding with adaptive system utilization to reduce rollout latency in RL scenarios without separate drafter pretraining or jeopardizing the target model.
Dedicated exploration models (4B–30B parameters) can handle code search in repositories more efficiently than general solver models while significantly reducing context pollution.
MSA reduces attention computation for million-token contexts by a factor of 28.4 through blockwise sparse selection and achieves practical speedups via co-design of algorithm and GPU kernel.
Aligning router rows with the principal singular directions of their associated expert matrices improves the efficiency and stability of Mixture-of-Experts models.
Corporate AI spending has spiraled out of control; OpenAI promises more efficient models, while the Jevons Paradox could drive renewed demand growth over the long term.