At a glance: Granite 4.1 are compact language models from IBM with 3B, 30B and 83B parameters, trained on 15 trillion tokens with a 512K context window. The 8B Instruct model outperforms the larger predecessor model through optimized dense architecture and advanced fine-tuning and reinforcement learning techniques.
A comprehensive technical overview of the data processing, pretraining and optimization processes of IBM’s Granite 4.1 language models. The new model family demonstrates impressive performance with significantly fewer parameters than previous versions.
The Granite 4.1 family consists of compact decoder-only language models in three sizes: 3 billion, 30 billion and 83 billion parameters. The training setup is based on a multi-stage pretraining pipeline, in which the models are trained on approximately 15 trillion tokens.
A special feature is the extension of the context window to up to 512,000 tokens, which enables processing of significantly longer text sequences. Following the base training, a refinement phase follows with supervised fine-tuning based on approximately 4.1 million carefully curated high-quality training examples.
The training process is completed through reinforcement learning using the GRPO algorithm (policy-based) and the DAPO Loss. This optimization method helps the models further improve their outputs.
Particularly noteworthy is the performance of the 8B Instruct model: it achieves or exceeds the results of the previous Granite 4.0-H-Small with 32 billion parameters in a Mixture-of-Experts architecture. The new model instead uses a simpler, denser architecture with fewer total parameters.