Deep Learning (DL) applications often require processing video data for tasks such as object detection, classification, and segmentation. However, conventional video processing pipelines are typically inefficient for deep learning inference, leading to performance bottlenecks. In this post will leverage PyTorch and FFmpeg with NVIDIA hardware acceleration to achieve this optimisation.
The inefficiency comes from how video frames are typically decoded and transferred between CPU and GPU. The standard workflow that we may find in the majority of tutorials follow this structure:
- Decode Frames on CPU: Video files are first decoded into raw frames using CPU-based decoding tools (e.g., OpenCV, FFmpeg without GPU support).
- Transfer to GPU: These frames are then transferred from CPU to GPU memory to perform deep learning inference using frameworks like TensorFlow, Pytorch, ONNX, etc.
- Inference on GPU: Once the frames are in GPU memory, the model performs inference.
- Transfer Back to CPU (if needed): Some post-processing steps may require data to be moved back to the CPU.

This CPU-GPU transfer process introduces a significant performance bottleneck, especially when processing high-resolution videos at high frame rates. The unnecessary memory copies and context switches slow down the overall inference speed, limiting real-time processing capabilities.
As an example, the following snippet has the typical Video Processing pipeline that you came across when you are starting to learn deep learning:
The Solution: GPU-Based Video Decoding and Inference
A more efficient approach is to keep the entire pipeline on the GPU, from video decoding to inference, eliminating redundant CPU-GPU transfers. This can be achieved using FFmpeg with NVIDIA GPU hardware acceleration.
Key Optimisations
- GPU-Accelerated Video Decoding: Instead of using CPU-based decoding, we leverage FFmpeg with NVIDIA GPU acceleration (NVDEC) to decode video frames directly on the GPU.
- Zero-Copy Frame Processing: The decoded frames remain in GPU memory, avoiding unnecessary memory transfers.
- GPU-Optimized Inference: Once the frames are decoded, we perform inference directly using any model on the same GPU, significantly reducing latency.

Hands on!
Prerequisites
In order to achieve the aforementioned improvements, we will be using the following dependencies:
Installation
Please, to get a deep insight of how FFmpeg is installed with NVIDIA gpu acceleration, follow these instructions.
Tested with:
- System: Ubuntu 22.04
- NVIDIA Driver Version: 550.120
- CUDA Version: 12.4
- Torch: 2.4.0
- Torchaudio: 2.4.0
- Torchvision: 0.19.0
1. Install the NV-Codecs
2. Clone and configure FFmpeg
3. Validate whether the installation was successful with torchaudio.utils
Time to code an optimised pipeline!
Benchmarking
To benchmark whether it is making any difference, we will be using this video from Pexels by Pawel Perzanowski. Since most videos there are really short, I have stacked the same video several times to provide some results with different video lengths. The original video is 32 seconds long which gives us a total of 960 frames. The new modified videos have 5520 and 9300 frames respectively.
Original video
- typical workflow: 28.51s
- optimised workflow: 24.2s
Okay… it doesn’t seem like a real improvement, right? Let’s test it with longer videos.
Modified video v1 (5520 frames)
- typical workflow: 118.72s
- optimised workflow: 100.23s
Modified video v2 (9300 frames)
- typical workflow: 292.26s
- optimised workflow: 240.85s
As the video duration increases, the benefits of the optimization become more evident. In the longest test case, we achieve an 18% speedup, demonstrating a significant reduction in processing time. These performance gains are particularly crucial when handling large video datasets or even in real-time video analysis tasks, where small efficiency improvements accumulate into substantial time savings.
Conclusion
In today’s post, we have explored two video processing pipelines, the typical one where frames are copied from CPU to GPU, introducing noticeable bottlenecks, and an optimised pipeline, in which frames are decoded in the GPU and pass them directly to inference, saving a considerably amount of time as videos’ duration increase.