Storage is an overlooked element of AI that has been overshadowed by all the emphasis on processors, namely GPUs. Large language models (LLMs) measure in the terabytes of size and all that needs to be moved around to be processed. So the faster you can move data, the better, so that the GPUs aren’t sitting around waiting for data to be fed to them.
Nvidia says it has tested out these Spectrum-4 features with its Israel-1 AI supercomputer. The testing process measured the read and write bandwidth generated by Nvidia HGX H100 GPU server clients accessing the storage, first with the network configured as a standard RoCE v2 fabric, and then with the adaptive routing and congestion control from Spectrum-X turned on, Nvidia stated.
Tests were run using a range of GPU servers as clients, from 40 to 800 GPUs. In every case, the enhanced Spectrum-X networking performed better than the standard version, with the modified read bandwidth improving from 20% to 48% and write bandwidth improving from 9% to 41% over standard RoCE networking, according to Nvidia.
Another method for improving efficiency is checkpointing, where the processing job state is saved periodically so that if the training run fails for any reason, it can be restarted from a saved checkpoint state rather than starting it over from the beginning.
Nvidia’s storage partners such as DDN, Dell, HPE, Lenovo, VAST Data, and WEKA will likely support these Spectrum-X features in the future.