LoopCoder-v2 with two loops substantially improves code reasoning benchmarks (SWE-bench Verified: 43.0 → 64.4 points), while three or more loops become counterproductive due to growing position errors.
AWS has developed P-EAGLE, a parallelized variant of speculative decoding that generates draft tokens in a single forward pass instead of sequentially, achieving inference throughput improvements of up to 1.69x on SageMaker AI.
LCLMs compress KV-caches through encoder-decoder architecture up to 1:16 more efficiently than previous methods while reducing peak memory consumption and processing time.
Encoder-decoder compressors with adaptive expansion improve KV-cache compression methods in speed and memory efficiency without significant quality loss.
A CPU-based RL controller optimizes adaptive sampling during test-time scaling, reducing computational overhead and latency compared to heuristic methods.