GLM-5.2 ranks as the leading open language model on the Artificial Analysis Index with a score of 51 and places 2nd in the Code Arena WebDev Leaderboard, but produces significantly more output tokens than competing models.
LoopCoder-v2 with two loops substantially improves code reasoning benchmarks (SWE-bench Verified: 43.0 → 64.4 points), while three or more loops become counterproductive due to growing position errors.
Grammar-Constrained Decoding (GCD), a technique for ensuring syntactically correct code, opens a new jailbreak method for attackers with a success rate over 30 percentage points higher than previous approaches.
Arbor coordinates autonomous AI agents via persistent hypothesis trees and achieved 2.5× better results than Codex and Claude Code on six research tasks.
A self-learning framework for code-repair agents leverages their solution traces directly to generate targeted training tasks, achieving higher accuracy than previous approaches.