#006 - The Subtle Art of Inference with Adam Grzywaczewski Titelbild

#006 - The Subtle Art of Inference with Adam Grzywaczewski

#006 - The Subtle Art of Inference with Adam Grzywaczewski

Jetzt kostenlos hören, ohne Abo

Details anzeigen

Über diesen Titel

In this episode of The Private AI Lab, Johan van Amersfoort speaks with Adam Grzywaczewski, a senior Deep Learning Data Scientist at NVIDIA, about the rapidly evolving world of AI inference.


They explore how inference has shifted from simple, single-GPU execution to highly distributed, latency-sensitive systems powering today’s large language models. Adam explains the real bottlenecks teams face, why software optimization and hardware innovation must move together, and how NVIDIA’s inference stack—from TensorRT-LLM to Dynamo—enables scalable, cost-efficient deployments.


The conversation also covers quantization, pruning, mixture-of-experts models, AI factories, and why inference optimization is becoming one of the most critical skills in modern AI engineering.


Topics covered


  • Why inference is now harder than training

  • Autoregressive models and KV-cache challenges

  • Mixture-of-experts architectures

  • NVIDIA Dynamo and TensorRT-LLM

  • Hardware vs software optimization

  • Quantization, pruning, and distillation

  • Latency vs throughput trade-offs

  • The rise of AI factories and DGX systems

  • What’s next for AI inference

Noch keine Rezensionen vorhanden