Why Local LLMs Suddenly Slow Down at Long Context

Artikel konnten nicht hinzugefügt werden

Leider können wir den Artikel nicht hinzufügen, da Ihr Warenkorb bereits seine Kapazität erreicht hat.

Der Titel konnte nicht zum Warenkorb hinzugefügt werden.

Bitte versuchen Sie es später noch einmal

Der Titel konnte nicht zum Merkzettel hinzugefügt werden.

Bitte versuchen Sie es später noch einmal

„Von Wunschzettel entfernen“ fehlgeschlagen.

Bitte versuchen Sie es später noch einmal

„Podcast folgen“ fehlgeschlagen

„Podcast nicht mehr folgen“ fehlgeschlagen

Why Local LLMs Suddenly Slow Down at Long Context

Jetzt kostenlos hören, ohne Abo

Details anzeigen

This story was originally published on HackerNoon at: https://hackernoon.com/why-local-llms-suddenly-slow-down-at-long-context.
Your local LLM runs fine until it doesn't. A look at KV cache spilling from VRAM into shared memory, and why it happens silently on Windows.
Check more stories related to tech-stories at: https://hackernoon.com/c/tech-stories. You can also check exclusive content about #local-llms, #llama.cpp, #kv-cache, #vram, #gpu, #local-inference, #machine-learning, #hackernoon-top-story, and more.

This story was written by: @speederx. Learn more about this writer by checking @speederx's about page, and for more stories, please visit hackernoon.com.

Your local LLM runs fine until the context fills up past a certain point - then generation speed can drop by ~50%. The cause is the KV cache spilling out of VRAM into slower shared memory. On Windows it happens silently, with no out-of-memory error to warn you.

Noch keine Rezensionen vorhanden