In a nutshell: Output compression reduces inference costs by 1.4–3x, while input compression increases them by an average of 1.15x because models respond to imprecise prompts with longer answers.
A study shows that simplified input prompts (“speak briefly, without grammar”) do not lead to cost savings in Large Language Models but instead increase them — whereas output compression, by contrast, is effective.
The “Cavewoman” evaluation protocol systematically examined how eight LLMs respond to language compression in two channels: input prompts and generated output. The researchers tested five compression levels across five datasets, measuring task accuracy, actual costs per element, and semantic alignment with uncompressed reference output.
Output compression — instructing models to deliver shorter answers — reduces realized costs at most API models by 1.4–2.4x per model, and up to 3x in optimal cases. All four tested open-weight models also showed cost savings under public pricing models. Input compression, by contrast, has the opposite effect: models generate longer answers as compensation for underspecified prompts, thereby increasing net costs by an average of 1.15x — in the worst case by 1.8x, and under aggressive compression even by 2.7x — while simultaneously task accuracy declines.
Additionally, the analysis reveals a semantic problem: in non-reasoning models, approximately half of all compressed generations superficially diverge from the uncompressed baseline, despite still formally solving the task. This divergence persists even after length normalization, statistical correction, and validation through alternative semantic metrics.
Source: arxiv.org · Published 22 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification via Lumi News Pipeline v1.7.1.