Monday, February 23, 2026

Optimized Asynchronous Semantic Caching for Tiered LLM Architectures

Large language models (LLMs) play a crucial role in search, assistance, and agentic workflows, making semantic caching vital for minimizing inference costs and latency. Production systems typically employ a static-dynamic tiered design, combining a static cache of vetted responses from logs with a dynamic online cache. However, this approach often depends on a single embedding similarity threshold, leading to a tradeoff: conservative thresholds can miss valuable reuse opportunities, while aggressive ones risk inaccurate responses. We present Krites, an asynchronous caching policy that enhances static coverage without altering serving functions. Krites retains standard static behavior on the critical path but invokes an LLM judge when the nearest static neighbor falls short of the threshold. Verified responses are promoted to the dynamic cache, allowing future paraphrases to benefit from curated static answers. In simulations, Krites boosts the rate of requests served with curated responses by up to 3.9 times for conversational and search queries while maintaining critical path latency.

Source link

Share

Read more

Local News