
It uses disaggregated serving to separate the processing and generation phases of large language models (LLMs) on different GPUs, which allows each phase to be optimized independently for its specific needs and ensures maximum GPU resource utilization, the chipmaker explained.
The efficiency gain is made possible as Dynamo has the ability to map the knowledge that inference systems hold in memory from serving prior requests — known as KV cache — across potentially thousands of GPUs.
It then routes new inference requests to the GPUs that have the best knowledge match, avoiding costly re-computations and freeing up GPUs to respond to new incoming requests, the chipmaker explained.
Dynamo upgrades make it better than vLLM and SG Lang
Dynamo includes four upgrades over its predecessor that may help it reduce inference serving costs, including a GPU Planner, a Smart Router, a low latency Communication Library, and a Memory Manager.
While the GPU Planner gives enterprises the ability to use Dynamo to add, remove, and reallocate GPUs in response to fluctuating request volumes and types to avoid over and under-provisioning of GPUs, the low latency Communication Library enables GPU-to-GPU communication and faster data transfer.
The Smart Router upgrade, on the other hand, will allow enterprises to use Dynamo to pinpoint specific GPUs in large clusters that can minimize response computations and route queries, Nvidia said.