Why Local LLMs Suddenly Slow Down at Long Context Titelbild

Why Local LLMs Suddenly Slow Down at Long Context

Why Local LLMs Suddenly Slow Down at Long Context

Jetzt kostenlos hören, ohne Abo

Details anzeigen

This story was originally published on HackerNoon at: https://hackernoon.com/why-local-llms-suddenly-slow-down-at-long-context.
Your local LLM runs fine until it doesn't. A look at KV cache spilling from VRAM into shared memory, and why it happens silently on Windows.
Check more stories related to tech-stories at: https://hackernoon.com/c/tech-stories. You can also check exclusive content about #local-llms, #llama.cpp, #kv-cache, #vram, #gpu, #local-inference, #machine-learning, #hackernoon-top-story, and more.

This story was written by: @speederx. Learn more about this writer by checking @speederx's about page, and for more stories, please visit hackernoon.com.

Your local LLM runs fine until the context fills up past a certain point - then generation speed can drop by ~50%. The cause is the KV cache spilling out of VRAM into slower shared memory. On Windows it happens silently, with no out-of-memory error to warn you.

adbl_web_anon_alc_button_suppression_t1
Noch keine Rezensionen vorhanden