LSA predicts relevant context sections in advance and retains only these in GPU memory, compressing the KV-cache by over 86 percent without sacrificing accuracy.
Open models are closing the gap to the frontier, but different benchmarking methods and evaluation frameworks make reliable performance comparisons between open and closed systems difficult.