JetSpec overcomes scaling limits of speculative decoding through parallel tree drafting with causal conditioning, achieving up to 9.64x speedup in LLM inference.
MSA reduces attention computation for million-token contexts by a factor of 28.4 through blockwise sparse selection and achieves practical speedups via co-design of algorithm and GPU kernel.