Skip to content

ThoughtFold: Shortened Reasoning Chains through Preference Learning

In a nutshell: ThoughtFold identifies and removes redundant exploration steps in reasoning chains, reducing token consumption by 56% for DeepSeek-R1-Distill-Qwen-7B while maintaining state-of-the-art accuracy.

A new framework called ThoughtFold reduces the inefficiency of Large Reasoning Models by identifying and eliminating redundant steps in chains of thought. This significantly lowers token consumption without compromising solution accuracy.

Large Reasoning Models (LRMs) have so far achieved high performance through reinforcement learning with verifiable rewards trained on chains of thought (CoTs). The core problem: long CoTs systematically contain trial-and-error phases, and established RLVR approaches reinforce all exploration steps that lead to correct results – including redundant ones. This leads to an over-thinking phenomenon in which models inefficiently consume many tokens.

ThoughtFold addresses this problem through fine-grained preference learning instead of pure outcome optimization. The framework uses an introspective strategy to locate redundant segments within each correct solution chain. From a correct chain, multiple candidate subchains of varying lengths are created. A procedure called Masked Preference Optimization then explicitly rewards direct connections between essential reasoning steps and penalizes unnecessary exploration detours.

In experiments, ThoughtFold reduced token consumption of DeepSeek-R1-Distill-Qwen-7B by approximately 56%, while solution accuracy remained at state-of-the-art levels. For CTOs, this means: significantly lower inference costs and faster responses with unchanged quality – relevant for productive deployments of reasoning models in the enterprise.


Source: arxiv.org · Published June 2, 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrasing and classification by Lumi News Pipeline v1.2.9.

Share on: