Why Local LLMs Suddenly Slow Down at Long Context
Artikel konnten nicht hinzugefügt werden
Der Titel konnte nicht zum Warenkorb hinzugefügt werden.
Der Titel konnte nicht zum Merkzettel hinzugefügt werden.
„Von Wunschzettel entfernen“ fehlgeschlagen.
„Podcast folgen“ fehlgeschlagen
„Podcast nicht mehr folgen“ fehlgeschlagen
-
Gesprochen von:
-
Von:
This story was originally published on HackerNoon at: https://hackernoon.com/why-local-llms-suddenly-slow-down-at-long-context.
Your local LLM runs fine until it doesn't. A look at KV cache spilling from VRAM into shared memory, and why it happens silently on Windows.
Check more stories related to tech-stories at: https://hackernoon.com/c/tech-stories. You can also check exclusive content about #local-llms, #llama.cpp, #kv-cache, #vram, #gpu, #local-inference, #machine-learning, #hackernoon-top-story, and more.
This story was written by: @speederx. Learn more about this writer by checking @speederx's about page, and for more stories, please visit hackernoon.com.
Your local LLM runs fine until the context fills up past a certain point - then generation speed can drop by ~50%. The cause is the KV cache spilling out of VRAM into slower shared memory. On Windows it happens silently, with no out-of-memory error to warn you.