#006 - The Subtle Art of Inference with Adam Grzywaczewski
Artikel konnten nicht hinzugefügt werden
Der Titel konnte nicht zum Warenkorb hinzugefügt werden.
Der Titel konnte nicht zum Merkzettel hinzugefügt werden.
„Von Wunschzettel entfernen“ fehlgeschlagen.
„Podcast folgen“ fehlgeschlagen
„Podcast nicht mehr folgen“ fehlgeschlagen
-
Gesprochen von:
-
Von:
Über diesen Titel
In this episode of The Private AI Lab, Johan van Amersfoort speaks with Adam Grzywaczewski, a senior Deep Learning Data Scientist at NVIDIA, about the rapidly evolving world of AI inference.
They explore how inference has shifted from simple, single-GPU execution to highly distributed, latency-sensitive systems powering today’s large language models. Adam explains the real bottlenecks teams face, why software optimization and hardware innovation must move together, and how NVIDIA’s inference stack—from TensorRT-LLM to Dynamo—enables scalable, cost-efficient deployments.
The conversation also covers quantization, pruning, mixture-of-experts models, AI factories, and why inference optimization is becoming one of the most critical skills in modern AI engineering.
Topics covered
Why inference is now harder than training
Autoregressive models and KV-cache challenges
Mixture-of-experts architectures
NVIDIA Dynamo and TensorRT-LLM
Hardware vs software optimization
Quantization, pruning, and distillation
Latency vs throughput trade-offs
The rise of AI factories and DGX systems
What’s next for AI inference
