Custom Training Pipeline for Object Detection Models

Stay Ahead, Stay ONMINE

Custom Training Pipeline for Object Detection Models

What if you want to write the whole object detection training pipeline from scratch, so you can understand each step and be able to customize it? That’s what I set out to do. I examined several well-known object detection pipelines and designed one that best suits my needs and tasks. Thanks to Ultralytics, YOLOx, DAMO-YOLO, RT-DETR and D-FINE repos, I leveraged them to gain deeper understanding into various design details. I ended up implementing SoTA real-time object detection model D-FINE in my custom pipeline. Plan Dataset, Augmentations and transforms: Mosaic (with affine transforms) Mixup and Cutout Other augmentations with bounding boxes Letterbox vs simple resize Training: Optimizer Scheduler EMA Batch accumulation AMP Grad clipping Logging Metrics: mAPs from TorchMetrics / cocotools How to compute Precision, Recall, IoU? Pick a suitable solution for your case Experiments Attention to data preprocessing Where to start Dataset Dataset processing is the first thing you usually start working on. With object detection, you need to load your image and annotations. Annotations are often stored in COCO format as a json file or YOLO format, with txt file for each image. Let’s take a look at the YOLO format: Each line is structured as: class_id, x_center, y_center, width, height, where bbox values are normalized between 0 and 1. When you have your images and txt files, you can write your dataset class, nothing tricky here. Load everything, transform (augmentations included) and return during training. I prefer splitting the data by creating a CSV file for each split and then reading it in the Dataloader class rather than physically moving files into train/val/test folders. This is an example of a customization that helped my use case. Augmentations Firstly, when augmenting images for object detection, it’s crucial to apply the same transformations to the bounding boxes. To comfortably do that I use Albumentations lib. For example: def _init_augs(self, cfg) – > None: if self.keep_ratio: resize = [ A.LongestMaxSize(max_size=max(self.target_h, self.target_w)), A.PadIfNeeded( min_height=self.target_h, min_width=self.target_w, border_mode=cv2.BORDER_CONSTANT, fill=(114, 114, 114), ), ] else: resize = [A.Resize(self.target_h, self.target_w)] norm = [ A.Normalize(mean=self.norm[0], std=self.norm[1]), ToTensorV2(), ] if self.mode == “train”: augs = [ A.RandomBrightnessContrast(p=cfg.train.augs.brightness), A.RandomGamma(p=cfg.train.augs.gamma), A.Blur(p=cfg.train.augs.blur), A.GaussNoise(p=cfg.train.augs.noise, std_range=(0.1, 0.2)), A.ToGray(p=cfg.train.augs.to_gray), A.Affine( rotate=[90, 90], p=cfg.train.augs.rotate_90, fit_output=True, ), A.HorizontalFlip(p=cfg.train.augs.left_right_flip), A.VerticalFlip(p=cfg.train.augs.up_down_flip), ] self.transform = A.Compose( augs + resize + norm, bbox_params=A.BboxParams(format=”pascal_voc”, label_fields=[“class_labels”]), ) elif self.mode in [“val”, “test”, “bench”]: self.mosaic_prob = 0 self.transform = A.Compose( resize + norm, bbox_params=A.BboxParams(format=”pascal_voc”, label_fields=[“class_labels”]), ) Secondly, there are a lot of interesting and not trivial augmentations: Mosaic. The idea is simple, let’s take several images (for example 4), and stack them together in a grid (2×2). Then let’s do some affine transforms and feed it to the model. MixUp. Originally used in image classification (it’s surprising that it works). Idea – let’s take two images, put them onto each other with some percent of transparency. In classification models it usually means that if one image is 20% transparent and the second is 80%, then the model should predict 80% for class 1 and 20% for class 2. In object detection we just get more objects into 1 image. Cutout. Cutout involves removing parts of the image (by replacing them with black pixels) to help the model learn more robust features. I see mosaic often applied with Probability 1.0 of the first ~90% of epochs. Then, it’s usually turned off, and lighter augmentations are used. The same idea applies to mixup, but I see it being used a lot less (for the most popular detection framework, Ultralytics, it’s turned off by default. For another one, I see P=0.15). Cutout seems to be used less frequently. You can read more about those augmentations in these two articles: 1, 2. Results from just turning on mosaic are pretty good (darker one without mosaic got mAP 0.89 vs 0.92 with, tested on a real dataset) Author’s metrics on a custom dataset, logged in Wandb Letterbox or simple resize? During training, you usually resize the input image to a square. Models often use 640×640 and benchmark on COCO dataset. And there are two main ways how you get there: Simple resize to a target size. Letterbox: Resize the longest side to the target size (e.g., 640), preserving the aspect ratio, and pad the shorter side to reach the target dimensions. Sample from VisDrone dataset with ground truth bounding boxes, preprocessed with a simple resize function Sample from VisDrone dataset with ground truth bounding boxes, preprocessed with a letterbox Both approaches have advantages and disadvantages. Let’s discuss them first, and then I will share the results of numerous experiments I ran comparing these approaches. Simple resize: Compute goes to the whole image, with no useless padding. “Dynamic” aspect ratio may act as a form of regularization. Inference preprocessing perfectly matches training preprocessing (augmentations excluded). Kills real geometry. Resize distortion could affect the spatial relationships in the image. Although it might be a human bias to think that a fixed aspect ratio is important. Letterbox: Preserves real aspect ratio. During inference, you can cut padding and run not on the square image if you don’t lose accuracy (some models can degrade). Can train on a bigger image size, then inference with cut padding to get the same inference latency as with simple resize. For example 640×640 vs 832×480. The second one will preserve the aspect ratios and objects will appear +- the same size. Part of the compute is wasted on gray padding. Objects get smaller. How to test it and decide which one to use? Train from scratch with parameters: Simple resize, 640×640 Keep aspect ratio, max side 640, and add padding (as a baseline) Keep aspect ratio, larger image size (for example max side 832), and add padding Then inference 3 models. When the aspect ratio is preserved – cut padding during the inference. Compare latency and metrics. Example of the same image from above with cut padding (640 × 384): Sample from VisDrone dataset Here is what happens when you preserve ratio and inference by cutting gray padding: params | F1 score | latency (ms). | ————————-+————-+—————–| ratio kept, 832 | 0.633 | 33.5 | no ratio, 640×640 | 0.617 | 33.4 | As shown, training with preserved aspect ratio at a larger size (832) achieved a higher F1 score (0.633) compared to a simple 640×640 resize (F1 score of 0.617), while the latency remained similar. Note that some models may degrade if the padding is removed during inference, which kills the whole purpose of this trick and probably the letterbox too. What does this mean: Training from scratch: With the same image size, simple resize gets better accuracy than letterbox. For letterbox, If you cut padding during the inference and your model doesn’t lose accuracy – you can train and inference with a bigger image size to match the latency, and get a little bit higher metrics (as in the example above). Training with pre-trained weights initialized: If you finetune – use the same tactic as the pre-trained model did, it should give you the best results if the datasets are not too different. For D-FINE I see lower metrics when cutting padding during inference. Also the model was pre-trained on a simple resize. For YOLO, a letterbox is typically a good choice. Training Every ML engineer should know how to implement a training loop. Although PyTorch does much of the heavy lifting, you might still feel overwhelmed by the number of design choices available. Here are some key components to consider: Optimizer – start with Adam/AdamW/SGD. Scheduler – fixed LR can be ok for Adams, but take a look at StepLR, CosineAnnealingLR or OneCycleLR. EMA. This is a nice technique that makes training smoother and sometimes achieves higher metrics. After each batch, you update a secondary model (often called the EMA model) by computing an exponential moving average of the primary model’s weights. Batch accumulation is nice when your vRAM is very limited. Training a transformer-based object detection model means that in some cases even in a middle-sized model you only can fit 4 images into the vRAM. By accumulating gradients over several batches before performing an optimizer step, you effectively simulate a larger batch size without exceeding your memory constraints. Another use case is when you have a lot of negatives (images without target objects) in your dataset and a small batch size, you can encounter unstable training. Batch accumulation can also help here. AMP uses half precision automatically where applicable. It reduces vRAM usage and makes training faster (if you have a GPU that supports it). I see 40% less vRAM usage and at least a 15% training speed increase. Grad clipping. Often, when you use AMP, training can become less stable. This can also happen with higher LRs. When your gradients are too big, training will fail. Gradient clipping will make sure gradients are never bigger than a certain value. Logging. Try Hydra for configs and something like Weights and Biases or Clear ML for experiment tracking. Also, log everything locally. Save your best weights, and metrics, so after numerous experiments, you can always find all the info on the model you need. def train(self) – > None: best_metric = 0 cur_iter = 0 ema_iter = 0 one_epoch_time = None def optimizer_step(step_scheduler: bool): “”” Clip grads, optimizer step, scheduler step, zero grad, EMA model update “”” nonlocal ema_iter if self.amp_enabled: if self.clip_max_norm: self.scaler.unscale_(self.optimizer) torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm) self.scaler.step(self.optimizer) self.scaler.update() else: if self.clip_max_norm: torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm) self.optimizer.step() if step_scheduler: self.scheduler.step() self.optimizer.zero_grad() if self.ema_model: ema_iter += 1 self.ema_model.update(ema_iter, self.model) for epoch in range(1, self.epochs + 1): epoch_start_time = time.time() self.model.train() self.loss_fn.train() losses = [] with tqdm(self.train_loader, unit=”batch”) as tepoch: for batch_idx, (inputs, targets, _) in enumerate(tepoch): tepoch.set_description(f”Epoch {epoch}/{self.epochs}”) if inputs is None: continue cur_iter += 1 inputs = inputs.to(self.device) targets = [ { k: (v.to(self.device) if (v is not None and hasattr(v, “to”)) else v) for k, v in t.items() } for t in targets ] lr = self.optimizer.param_groups[0][“lr”] if self.amp_enabled: with autocast(self.device, cache_enabled=True): output = self.model(inputs, targets=targets) with autocast(self.device, enabled=False): loss_dict = self.loss_fn(output, targets) loss = sum(loss_dict.values()) / self.b_accum_steps self.scaler.scale(loss).backward() else: output = self.model(inputs, targets=targets) loss_dict = self.loss_fn(output, targets) loss = sum(loss_dict.values()) / self.b_accum_steps loss.backward() if (batch_idx + 1) % self.b_accum_steps == 0: optimizer_step(step_scheduler=True) losses.append(loss.item()) tepoch.set_postfix( loss=np.mean(losses) * self.b_accum_steps, eta=calculate_remaining_time( one_epoch_time, epoch_start_time, epoch, self.epochs, cur_iter, len(self.train_loader), ), vram=f”{get_vram_usage()}%”, ) # Final update for any leftover gradients from an incomplete accumulation step if (batch_idx + 1) % self.b_accum_steps != 0: optimizer_step(step_scheduler=False) wandb.log({“lr”: lr, “epoch”: epoch}) metrics = self.evaluate( val_loader=self.val_loader, conf_thresh=self.conf_thresh, iou_thresh=self.iou_thresh, path_to_save=None, ) best_metric = self.save_model(metrics, best_metric) save_metrics( {}, metrics, np.mean(losses) * self.b_accum_steps, epoch, path_to_save=None ) if ( epoch >= self.epochs – self.no_mosaic_epochs and self.train_loader.dataset.mosaic_prob ): self.train_loader.dataset.close_mosaic() if epoch == self.ignore_background_epochs: self.train_loader.dataset.ignore_background = False logger.info(“Including background images”) one_epoch_time = time.time() – epoch_start_time Metrics For object detection everyone uses mAP, and it is already standardized how we measure those. Use pycocotools or faster-coco-eval or TorchMetrics for mAP. But mAP means that we check how good the model is overall, on all confidence levels. mAP0.5 means that IoU threshold is 0.5 (everything lower is considered as a wrong prediction). I personally don’t fully like this metric, as in production we always use 1 confidence threshold. So why not set the threshold and then compute metrics? That’s why I also always calculate confusion matrices, and based on that – Precision, Recall, F1-score, and IoU. But logic also might be tricky. Here is what I use: 1 GT (ground truth) object = 1 predicted object, and it’s a TP if IoU > threshold. If there is no prediction for a GT object – it’s a FN. If there is no GT for a prediction – it’s a FP. 1 GT should be matched by a prediction only 1 time. If there are 2 predictions for 1 GT, then I calculate 1 TP and 1 FP. Class ids should also match. If the model predicts class_0 but GT is class_1, it means FP += 1 and FN += 1. During training, I select the best model based on the metrics that are relevant to the task. I typically consider the average of mAP50 and F1-score. Model and loss I haven’t discussed model architecture and loss function here. They usually go together, and you can choose any model you like and integrate it into your pipeline with everything from above. I did that with DAMO-YOLO and D-FINE, and the results were great. Pick a suitable solution for your case Many people use Ultralytics, however it has GPLv3, and you can’t use it in commercial projects unless your code is open source. So people often look into Apache 2 and MIT licensed models. Check out D-FINE, RT-DETR2 or some yolo models like Yolov9. What if you want to customize something in the pipeline? When you build everything from scratch, you should have full control. Otherwise, try choosing a project with a smaller codebase, as a large one can make it difficult to isolate and modify individual components. If you don’t need anything custom and your usage is allowed by the Ultralytics license – it’s a great repo to use, as it supports multiple tasks (classification, detection, instance segmentation, key points, oriented bounding boxes), models are efficient and achieve good scores. Reiterating ones more, you probably don’t need a custom training pipeline if you are not doing very specific things. Experiments Let me share some results I got with a custom training pipeline with the D-FINE model and compare it to the Ultralytics YOLO11 model on the VisDrone-DET2019 dataset. Trained from scratch: model | mAP 0.50. | F1-score | Latency (ms) | ———————————+————–+————–+——————| YOLO11m TRT | 0.417 | 0.568 | 15.6 | YOLO11m TRT dynamic | – | 0.568 | 13.3 | YOLO11m OV | – | 0.568 | 122.4 | D-FINEm TRT | 0.457 | 0.622 | 16.6 | D-FINEm OV | 0.457 | 0.622 | 115.3 | From COCO pre-trained: model | mAP 0.50 | F1-score | ——————+————|————-| YOLO11m | 0.456 | 0.600 | D-FINEm | 0.506 | 0.649 | Latency was measured on an RTX 3060 with TensorRT (TRT), static image size 640×640, including the time for cv2.imread. OpenVINO (OV) on i5 14000f (no iGPU). Dynamic means that during inference, gray padding is being cut for faster inference. It worked with the YOLO11 TensorRT version. More details about cutting gray padding above (Letterbox or simple resize section). One disappointing result is the latency on intel N100 CPU with iGPU ($150 miniPC): model | Latency (ms) | ——————+————-| YOLO11m | 188 | D-FINEm | 272 | D-FINEs | 11 | Author’s screenshot of iGPU usage from n100 machine during model inference Here, traditional convolutional neural networks are noticeably faster, maybe because of optimizations in OpenVINO for GPUs. Overall, I conducted over 30 experiments with different datasets (including real-world datasets), models, and parameters and I can say that D-FINE gets better metrics. And it makes sense, as on COCO, it is also higher than all YOLO models. D-FINE paper comparison to other object detection models VisDrone experiments: Author’s metrics logged in WandB, D-FINE model Author’s metrics logged in WandB, YOLO11 model Example of D-FINE model predictions (green – GT, blue – pred): Sample from VisDrone dataset Final results Knowing all the details, let’s see a final comparison with the best settings for both models on i12400F and RTX 3060 with the VisDrone dataset: model | F1-score | Latency (ms) | ———————————–+—————+——————-| YOLO11m TRT dynamic | 0.600 | 13.3 | YOLO11m OV | 0.600 | 122.4 | D-FINEs TRT | 0.629 | 12.3 | D-FINEs OV | 0.629 | 57.4 | As shown above, I was able to use a smaller D-FINE model and achieve both faster inference time and accuracy than YOLO11. Beating Ultralytics, the most widely used real-time object detection model, in both speed and accuracy, is quite an accomplishment, isn’t it? The same pattern is observed across several other real-world datasets. I also tried out YOLOv12, which came out while I was writing this article. It performed similarly to YOLO11 and even achieved slightly lower metrics (mAP 0.456 vs 0.452). It appears that YOLO models have been hitting the wall for the last couple of years. D-FINE was a great update for object detection models. Finally, let’s see visually the difference between YOLO11m and D-FINEs. YOLO11m, conf 0.25, nms iou 0.5, latency 13.3ms: Sample from VisDrone dataset D-FINEs, conf 0.5, no nms, latency 12.3ms: Sample from VisDrone dataset Both Precision and Recall are higher with the D-FINE model. And it’s also faster. Here is also “m” version of D-FINE: Sample from VisDrone dataset Isn’t it crazy that even that one car on the left was detected? Attention to data preprocessing This part can go a little bit outside the scope of the article, but I want to at least quickly mention it, as some parts can be automated and used in the pipeline. What I definitely see as a Computer Vision engineer is that when engineers don’t spend time working with the data – they don’t get good models. You can have all SoTA models and everything done right, but garbage in – garbage out. So, I always pay a ton of attention to how to approach the task and how to gather, filter, validate, and annotate the data. Don’t think that the annotation team will do everything right. Get your hands dirty and check manually some portion of the dataset to be sure that annotations are good and collected images are representative. Several quick ideas to look into: Remove duplicates and near duplicates from val/test sets. The model should not be validated on one sample two times, and definitely, you don’t want to have a data leak, by getting two same images, one in training and one in validation sets. Check how small your objects can be. Everything not visible to your eye should not be annotated. Also, remember that augmentations will make objects appear even smaller (for example, mosaic or zoom out). Configure these augmentations accordingly so you won’t end up with unusably small objects on the image. When you already have a model for a certain task and need more data – try using your model to pre-annotate new images. Check cases where the model fails and gather more similar cases. Where to start I worked a lot on this pipeline, and I am ready to share it with everyone who wants to try it out. It uses the SoTA D-FINE model under the hood and adds some features that were absent in the original repo (mosaic augmentations, batch accumulation, scheduler, more metrics, visualization of preprocessed images and eval predictions, exporting and inference code, better logging, unified and simplified configuration file). Here is the link to my repo. Here is the original D-FINE repo, where I also contribute. If you need any help, please contact me on LinkedIn. Thank you for your time! Citations and acknowledgments DroneVis @article{zhu2021detection, title={Detection and tracking meet drones challenge}, author={Zhu, Pengfei and Wen, Longyin and Du, Dawei and Bian, Xiao and Fan, Heng and Hu, Qinghua and Ling, Haibin}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, volume={44}, number={11}, pages={7380–7399}, year={2021}, publisher={IEEE} } D-FINE @misc{peng2024dfine, title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement}, author={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu}, year={2024}, eprint={2410.13842}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Plan

Dataset, Augmentations and transforms:
- Mosaic (with affine transforms)
- Mixup and Cutout
- Other augmentations with bounding boxes
- Letterbox vs simple resize
Training:
- Optimizer
- Scheduler
- EMA
- Batch accumulation
- AMP
- Grad clipping
- Logging
Metrics:
- mAPs from TorchMetrics / cocotools
- How to compute Precision, Recall, IoU?
Pick a suitable solution for your case
Experiments
Attention to data preprocessing
Where to start

Dataset

Dataset processing is the first thing you usually start working on. With object detection, you need to load your image and annotations. Annotations are often stored in COCO format as a json file or YOLO format, with txt file for each image. Let’s take a look at the YOLO format: Each line is structured as: class_id, x_center, y_center, width, height, where bbox values are normalized between 0 and 1.

When you have your images and txt files, you can write your dataset class, nothing tricky here. Load everything, transform (augmentations included) and return during training. I prefer splitting the data by creating a CSV file for each split and then reading it in the Dataloader class rather than physically moving files into train/val/test folders. This is an example of a customization that helped my use case.

Augmentations

Firstly, when augmenting images for object detection, it’s crucial to apply the same transformations to the bounding boxes. To comfortably do that I use Albumentations lib. For example:

    def _init_augs(self, cfg) -> None:
        if self.keep_ratio:
            resize = [
                A.LongestMaxSize(max_size=max(self.target_h, self.target_w)),
                A.PadIfNeeded(
                    min_height=self.target_h,
                    min_width=self.target_w,
                    border_mode=cv2.BORDER_CONSTANT,
                    fill=(114, 114, 114),
                ),
            ]

        else:
            resize = [A.Resize(self.target_h, self.target_w)]
        norm = [
            A.Normalize(mean=self.norm[0], std=self.norm[1]),
            ToTensorV2(),
        ]

        if self.mode == "train":
            augs = [
                A.RandomBrightnessContrast(p=cfg.train.augs.brightness),
                A.RandomGamma(p=cfg.train.augs.gamma),
                A.Blur(p=cfg.train.augs.blur),
                A.GaussNoise(p=cfg.train.augs.noise, std_range=(0.1, 0.2)),
                A.ToGray(p=cfg.train.augs.to_gray),
                A.Affine(
                    rotate=[90, 90],
                    p=cfg.train.augs.rotate_90,
                    fit_output=True,
                ),
                A.HorizontalFlip(p=cfg.train.augs.left_right_flip),
                A.VerticalFlip(p=cfg.train.augs.up_down_flip),
            ]

            self.transform = A.Compose(
                augs + resize + norm,
                bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
            )

        elif self.mode in ["val", "test", "bench"]:
            self.mosaic_prob = 0
            self.transform = A.Compose(
                resize + norm,
                bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]),
            )

Secondly, there are a lot of interesting and not trivial augmentations:

Mosaic. The idea is simple, let’s take several images (for example 4), and stack them together in a grid (2×2). Then let’s do some affine transforms and feed it to the model.
MixUp. Originally used in image classification (it’s surprising that it works). Idea – let’s take two images, put them onto each other with some percent of transparency. In classification models it usually means that if one image is 20% transparent and the second is 80%, then the model should predict 80% for class 1 and 20% for class 2. In object detection we just get more objects into 1 image.
Cutout. Cutout involves removing parts of the image (by replacing them with black pixels) to help the model learn more robust features.

I see mosaic often applied with Probability 1.0 of the first ~90% of epochs. Then, it’s usually turned off, and lighter augmentations are used. The same idea applies to mixup, but I see it being used a lot less (for the most popular detection framework, Ultralytics, it’s turned off by default. For another one, I see P=0.15). Cutout seems to be used less frequently.

You can read more about those augmentations in these two articles: 1, 2.

Results from just turning on mosaic are pretty good (darker one without mosaic got mAP 0.89 vs 0.92 with, tested on a real dataset)

Author’s metrics on a custom dataset, logged in Wandb

Letterbox or simple resize?

During training, you usually resize the input image to a square. Models often use 640×640 and benchmark on COCO dataset. And there are two main ways how you get there:

Simple resize to a target size.
Letterbox: Resize the longest side to the target size (e.g., 640), preserving the aspect ratio, and pad the shorter side to reach the target dimensions.

Sample from VisDrone dataset with ground truth bounding boxes, preprocessed with a simple resize function

Sample from VisDrone dataset with ground truth bounding boxes, preprocessed with a letterbox

Both approaches have advantages and disadvantages. Let’s discuss them first, and then I will share the results of numerous experiments I ran comparing these approaches.

Simple resize:

Compute goes to the whole image, with no useless padding.
“Dynamic” aspect ratio may act as a form of regularization.
Inference preprocessing perfectly matches training preprocessing (augmentations excluded).
Kills real geometry. Resize distortion could affect the spatial relationships in the image. Although it might be a human bias to think that a fixed aspect ratio is important.

Letterbox:

Preserves real aspect ratio.
During inference, you can cut padding and run not on the square image if you don’t lose accuracy (some models can degrade).
Can train on a bigger image size, then inference with cut padding to get the same inference latency as with simple resize. For example 640×640 vs 832×480. The second one will preserve the aspect ratios and objects will appear +- the same size.

Part of the compute is wasted on gray padding.
Objects get smaller.

How to test it and decide which one to use?

Train from scratch with parameters:

Simple resize, 640×640
Keep aspect ratio, max side 640, and add padding (as a baseline)
Keep aspect ratio, larger image size (for example max side 832), and add padding Then inference 3 models. When the aspect ratio is preserved – cut padding during the inference. Compare latency and metrics.

Example of the same image from above with cut padding (640 × 384):

Here is what happens when you preserve ratio and inference by cutting gray padding:

params                  |   F1 score  |  latency (ms).   |
-------------------------+-------------+-----------------|
ratio kept, 832         |    0.633    |        33.5      |
no ratio, 640x640       |    0.617    |        33.4      |

As shown, training with preserved aspect ratio at a larger size (832) achieved a higher F1 score (0.633) compared to a simple 640×640 resize (F1 score of 0.617), while the latency remained similar. Note that some models may degrade if the padding is removed during inference, which kills the whole purpose of this trick and probably the letterbox too.

What does this mean:

Training from scratch:

With the same image size, simple resize gets better accuracy than letterbox.
For letterbox, If you cut padding during the inference and your model doesn’t lose accuracy – you can train and inference with a bigger image size to match the latency, and get a little bit higher metrics (as in the example above).

Training with pre-trained weights initialized:

If you finetune – use the same tactic as the pre-trained model did, it should give you the best results if the datasets are not too different.

For D-FINE I see lower metrics when cutting padding during inference. Also the model was pre-trained on a simple resize. For YOLO, a letterbox is typically a good choice.

Training

Every ML engineer should know how to implement a training loop. Although PyTorch does much of the heavy lifting, you might still feel overwhelmed by the number of design choices available. Here are some key components to consider:

Optimizer – start with Adam/AdamW/SGD.
Scheduler – fixed LR can be ok for Adams, but take a look at StepLR, CosineAnnealingLR or OneCycleLR.
EMA. This is a nice technique that makes training smoother and sometimes achieves higher metrics. After each batch, you update a secondary model (often called the EMA model) by computing an exponential moving average of the primary model’s weights.
Batch accumulation is nice when your vRAM is very limited. Training a transformer-based object detection model means that in some cases even in a middle-sized model you only can fit 4 images into the vRAM. By accumulating gradients over several batches before performing an optimizer step, you effectively simulate a larger batch size without exceeding your memory constraints. Another use case is when you have a lot of negatives (images without target objects) in your dataset and a small batch size, you can encounter unstable training. Batch accumulation can also help here.
AMP uses half precision automatically where applicable. It reduces vRAM usage and makes training faster (if you have a GPU that supports it). I see 40% less vRAM usage and at least a 15% training speed increase.
Grad clipping. Often, when you use AMP, training can become less stable. This can also happen with higher LRs. When your gradients are too big, training will fail. Gradient clipping will make sure gradients are never bigger than a certain value.
Logging. Try Hydra for configs and something like Weights and Biases or Clear ML for experiment tracking. Also, log everything locally. Save your best weights, and metrics, so after numerous experiments, you can always find all the info on the model you need.

    def train(self) -> None:
        best_metric = 0
        cur_iter = 0
        ema_iter = 0
        one_epoch_time = None

        def optimizer_step(step_scheduler: bool):
            """
            Clip grads, optimizer step, scheduler step, zero grad, EMA model update
            """
            nonlocal ema_iter
            if self.amp_enabled:
                if self.clip_max_norm:
                    self.scaler.unscale_(self.optimizer)

torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm)
                self.scaler.step(self.optimizer)
                self.scaler.update()

            else:
                if self.clip_max_norm:

torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.clip_max_norm)
                self.optimizer.step()

            if step_scheduler:
                self.scheduler.step()
            self.optimizer.zero_grad()

            if self.ema_model:
                ema_iter += 1
                self.ema_model.update(ema_iter, self.model)

        for epoch in range(1, self.epochs + 1):
            epoch_start_time = time.time()
            self.model.train()
            self.loss_fn.train()
            losses = []

            with tqdm(self.train_loader, unit="batch") as tepoch:
                for batch_idx, (inputs, targets, _) in enumerate(tepoch):
                    tepoch.set_description(f"Epoch {epoch}/{self.epochs}")
                    if inputs is None:
                        continue
                    cur_iter += 1

                    inputs = inputs.to(self.device)
                    targets = [
                        {
                            k: (v.to(self.device) if (v is not None and hasattr(v, "to")) else v)
                            for k, v in t.items()
                        }
                        for t in targets
                    ]

                    lr = self.optimizer.param_groups[0]["lr"]

                    if self.amp_enabled:
                        with autocast(self.device, cache_enabled=True):
                            output = self.model(inputs, targets=targets)
                        with autocast(self.device, enabled=False):
                            loss_dict = self.loss_fn(output, targets)
                        loss = sum(loss_dict.values()) / self.b_accum_steps
                        self.scaler.scale(loss).backward()

                    else:
                        output = self.model(inputs, targets=targets)
                        loss_dict = self.loss_fn(output, targets)
                        loss = sum(loss_dict.values()) / self.b_accum_steps
                        loss.backward()

                    if (batch_idx + 1) % self.b_accum_steps == 0:
                        optimizer_step(step_scheduler=True)

                    losses.append(loss.item())

                    tepoch.set_postfix(
                        loss=np.mean(losses) * self.b_accum_steps,
                        eta=calculate_remaining_time(
                            one_epoch_time,
                            epoch_start_time,
                            epoch,
                            self.epochs,
                            cur_iter,
                            len(self.train_loader),
                        ),
                        vram=f"{get_vram_usage()}%",
                    )

            # Final update for any leftover gradients from an incomplete accumulation step
            if (batch_idx + 1) % self.b_accum_steps != 0:
                optimizer_step(step_scheduler=False)

            wandb.log({"lr": lr, "epoch": epoch})

            metrics = self.evaluate(
                val_loader=self.val_loader,
                conf_thresh=self.conf_thresh,
                iou_thresh=self.iou_thresh,
                path_to_save=None,
            )

            best_metric = self.save_model(metrics, best_metric)
            save_metrics(
                {}, metrics, np.mean(losses) * self.b_accum_steps, epoch, path_to_save=None
            )

            if (
                epoch >= self.epochs - self.no_mosaic_epochs
                and self.train_loader.dataset.mosaic_prob
            ):
                self.train_loader.dataset.close_mosaic()

            if epoch == self.ignore_background_epochs:
                self.train_loader.dataset.ignore_background = False
                logger.info("Including background images")

            one_epoch_time = time.time() - epoch_start_time

Metrics

For object detection everyone uses mAP, and it is already standardized how we measure those. Use pycocotools or faster-coco-eval or TorchMetrics for mAP. But mAP means that we check how good the model is overall, on all confidence levels. mAP0.5 means that IoU threshold is 0.5 (everything lower is considered as a wrong prediction). I personally don’t fully like this metric, as in production we always use 1 confidence threshold. So why not set the threshold and then compute metrics? That’s why I also always calculate confusion matrices, and based on that – Precision, Recall, F1-score, and IoU.

But logic also might be tricky. Here is what I use:

1 GT (ground truth) object = 1 predicted object, and it’s a TP if IoU > threshold. If there is no prediction for a GT object – it’s a FN. If there is no GT for a prediction – it’s a FP.
1 GT should be matched by a prediction only 1 time. If there are 2 predictions for 1 GT, then I calculate 1 TP and 1 FP.
Class ids should also match. If the model predicts class_0 but GT is class_1, it means FP += 1 and FN += 1.

During training, I select the best model based on the metrics that are relevant to the task. I typically consider the average of mAP50 and F1-score.

Model and loss

I haven’t discussed model architecture and loss function here. They usually go together, and you can choose any model you like and integrate it into your pipeline with everything from above. I did that with DAMO-YOLO and D-FINE, and the results were great.

Pick a suitable solution for your case

Many people use Ultralytics, however it has GPLv3, and you can’t use it in commercial projects unless your code is open source. So people often look into Apache 2 and MIT licensed models. Check out D-FINE, RT-DETR2 or some yolo models like Yolov9.

What if you want to customize something in the pipeline? When you build everything from scratch, you should have full control. Otherwise, try choosing a project with a smaller codebase, as a large one can make it difficult to isolate and modify individual components.

If you don’t need anything custom and your usage is allowed by the Ultralytics license – it’s a great repo to use, as it supports multiple tasks (classification, detection, instance segmentation, key points, oriented bounding boxes), models are efficient and achieve good scores. Reiterating ones more, you probably don’t need a custom training pipeline if you are not doing very specific things.

Experiments

Let me share some results I got with a custom training pipeline with the D-FINE model and compare it to the Ultralytics YOLO11 model on the VisDrone-DET2019 dataset.

Trained from scratch:

model                     |  mAP 0.50.   |    F1-score  |  Latency (ms)  |
---------------------------------+--------------+--------------+------------------|
YOLO11m TRT               |     0.417    |     0.568    |       15.6     |
YOLO11m TRT dynamic       |     -        |     0.568    |       13.3     |
YOLO11m OV                |      -       |     0.568    |      122.4     |
D-FINEm TRT               |    0.457     |     0.622    |       16.6     |
D-FINEm OV                |    0.457     |     0.622    |       115.3    |

From COCO pre-trained:

model          |    mAP 0.50   |   F1-score  |
------------------+------------|-------------|
YOLO11m        |     0.456     |    0.600    |
D-FINEm        |     0.506     |    0.649    |

Latency was measured on an RTX 3060 with TensorRT (TRT), static image size 640×640, including the time for cv2.imread. OpenVINO (OV) on i5 14000f (no iGPU). Dynamic means that during inference, gray padding is being cut for faster inference. It worked with the YOLO11 TensorRT version. More details about cutting gray padding above (Letterbox or simple resize section).

One disappointing result is the latency on intel N100 CPU with iGPU ($150 miniPC):

model            | Latency (ms) |
------------------+-------------|
YOLO11m          |       188    |
D-FINEm          |       272    |
D-FINEs          |       11     |

Author’s screenshot of iGPU usage from n100 machine during model inference

Here, traditional convolutional neural networks are noticeably faster, maybe because of optimizations in OpenVINO for GPUs.

Overall, I conducted over 30 experiments with different datasets (including real-world datasets), models, and parameters and I can say that D-FINE gets better metrics. And it makes sense, as on COCO, it is also higher than all YOLO models.

D-FINE paper comparison to other object detection models

VisDrone experiments:

Author’s metrics logged in WandB, D-FINE model

Author’s metrics logged in WandB, YOLO11 model

Example of D-FINE model predictions (green – GT, blue – pred):

Final results

Knowing all the details, let’s see a final comparison with the best settings for both models on i12400F and RTX 3060 with the VisDrone dataset:

model                              |   F1-score    |   Latency (ms)    |
-----------------------------------+---------------+-------------------|
YOLO11m TRT dynamic                |      0.600    |        13.3       |
YOLO11m OV                         |      0.600    |       122.4       |
D-FINEs TRT                        |      0.629    |        12.3       |
D-FINEs OV                         |      0.629    |        57.4       |

As shown above, I was able to use a smaller D-FINE model and achieve both faster inference time and accuracy than YOLO11. Beating Ultralytics, the most widely used real-time object detection model, in both speed and accuracy, is quite an accomplishment, isn’t it? The same pattern is observed across several other real-world datasets.

I also tried out YOLOv12, which came out while I was writing this article. It performed similarly to YOLO11 and even achieved slightly lower metrics (mAP 0.456 vs 0.452). It appears that YOLO models have been hitting the wall for the last couple of years. D-FINE was a great update for object detection models.

Finally, let’s see visually the difference between YOLO11m and D-FINEs. YOLO11m, conf 0.25, nms iou 0.5, latency 13.3ms:

D-FINEs, conf 0.5, no nms, latency 12.3ms:

Both Precision and Recall are higher with the D-FINE model. And it’s also faster. Here is also “m” version of D-FINE:

Isn’t it crazy that even that one car on the left was detected?

Attention to data preprocessing

This part can go a little bit outside the scope of the article, but I want to at least quickly mention it, as some parts can be automated and used in the pipeline. What I definitely see as a Computer Vision engineer is that when engineers don’t spend time working with the data – they don’t get good models. You can have all SoTA models and everything done right, but garbage in – garbage out. So, I always pay a ton of attention to how to approach the task and how to gather, filter, validate, and annotate the data. Don’t think that the annotation team will do everything right. Get your hands dirty and check manually some portion of the dataset to be sure that annotations are good and collected images are representative.

Several quick ideas to look into:

Remove duplicates and near duplicates from val/test sets. The model should not be validated on one sample two times, and definitely, you don’t want to have a data leak, by getting two same images, one in training and one in validation sets.
Check how small your objects can be. Everything not visible to your eye should not be annotated. Also, remember that augmentations will make objects appear even smaller (for example, mosaic or zoom out). Configure these augmentations accordingly so you won’t end up with unusably small objects on the image.
When you already have a model for a certain task and need more data – try using your model to pre-annotate new images. Check cases where the model fails and gather more similar cases.

Where to start

I worked a lot on this pipeline, and I am ready to share it with everyone who wants to try it out. It uses the SoTA D-FINE model under the hood and adds some features that were absent in the original repo (mosaic augmentations, batch accumulation, scheduler, more metrics, visualization of preprocessed images and eval predictions, exporting and inference code, better logging, unified and simplified configuration file).

Here is the link to my repo. Here is the original D-FINE repo, where I also contribute. If you need any help, please contact me on LinkedIn. Thank you for your time!

Citations and acknowledgments

DroneVis

@article{zhu2021detection,
  title={Detection and tracking meet drones challenge},
  author={Zhu, Pengfei and Wen, Longyin and Du, Dawei and Bian, Xiao and Fan, Heng and Hu, Qinghua and Ling, Haibin},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  volume={44},
  number={11},
  pages={7380--7399},
  year={2021},
  publisher={IEEE}
}

D-FINE

@misc{peng2024dfine,
      title={D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
      author={Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
      year={2024},
      eprint={2410.13842},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Stay Ahead

Explore More Insights

Stay ahead with more perspectives on cutting-edge power, infrastructure, energy, bitcoin and AI solutions. Explore these articles to uncover strategies and insights shaping the future of industries.

Sovereign European Cloud API claims to offer interoperability without lock-in

“AI and Cloud are transforming the global economy, and Europe cannot afford to be left behind. Europe needs a strong, sovereign digital ecosystem. SECA is a critical step in building a secure, independent, and future-proof digital infrastructure — one that keeps Europe strong, competitive, and in control,” IONOS CEO Achim

HPE cuts 2,500 workers, expects Juniper buy to close end of ’25, faces tariff issues

AI systems backlog rose 29% quarter over quarter to $3.1 billion and total server revenue totaled $4.29 billion, Myers said. The company reported Intelligent Edge revenue was down 5% from the prior-year period to $1.1 billion, but Hybrid Cloud revenue was $1.4 billion, up 10% from the prior-year period. Then

Microsoft’s Veeam partnership signals data resiliency market shift

In my conversations with IT and business leaders, I’ve seen a significant increase in interest in re-thinking data resilience. It’s always been important, but the Russia-Ukraine war put a magnifying glass on where data was stored and how fast it could be recovered. Since then, the growth of ransomware, the

SolarWinds buys Squadcast to speed incident response

Squadcast customers shared their experiences with the technology. “Since implementing Squadcast, we’ve reduced incoming alerts from tens of thousands to hundreds, thanks to flexible deduplication. It has a direct impact on reducing alert fatigue and increasing awareness,” said Avner Yaacov, Senior Manager at Redis, in a statement. According to SolarWinds,

EVOL X Fugro International Women’s Day special

Join Energy Voice News Editor Erikka Askeland who speaks to two high profile energy industry business leaders for International Women’s Day. We speak to Nicola Welsh, UK Country Director at geo-data specialist Fugro alongsideLinda Stewart, Director Marine Geophysical Europe, also at Fugro. Tune in to hear Nicola discuss her route from mining camps in the Australian outback to a senior leadership role while Linda charts her 19-year career journey to become Fugro’s first female director in her role in Scotland. There’s serious discussion about leaning in, the “double bind” and what the IWD 2025 call to “accelerate action” really means. This special podcast also serves and the opening of Energy Voice’s highly anticipated Women in New Energy Event which takes place in Aberdeen in June. Recommended for you Celebrating International Women’s Day with Axis Network’s Emma Behjat

Repsol to slash North Sea jobs

Repsol has blamed UK government tax “policies and adverse economic conditions” as it as confirmed plans to cut jobs in its North Sea operations. The Spanish energy firm said 21 in-house roles could be cut although it did not confirm how many jobs would have to go as it announced its “new and more efficient operating model”. However all of the operator’s 1,000 North Sea staff and contractor roles will be up for review, with Petrofac and Altrad the firm’s biggest employers. Many firms are citing the general market and UK fiscal policies for the cuts. This week North Sea decommissioning firm Well-Safe Solutions announced plans to cut dozens of jobs on shore as well as on its vessel, the Well-Safe Guardian. The firm which has invested tens of millions in repurposing drilling rigs into units that can remove subsea oil and gas infrastructure, said the cuts were due to a business down turn which was a “knock-on effects” of the windfall tax. “Repsol UK has undertaken a review of its operations at our offshore sites, which will result in a new and more efficient operating model. The health and safety of our people and delivery of safe operations remain our priority. “We remain committed to thrive in the UK North Sea basin, but the UK government’s policies and adverse economic conditions make these changes necessary. “There will be organisational changes, and we are in dialogue with the affected employees and will seek to redeploy where possible.” More to follow. Recommended for you SeAH Wind brings in three contractors for Hornsea 3 work

BP CEO Sees Pay Cut 30 Pct After Profit Miss, Elliott Intervention

BP Plc Chief Executive Officer Murray Auchincloss’ total compensation dropped to £5.36 million ($6.91 million) in 2024, about 30% less than the previous year, after the energy giant’s profits disappointed. The London-based company’s 2024 earnings results reported in February showed a steep drop in profits compared with the previous year. That set the stage for a subsequent strategic switch back to oil and gas after years of shifting away from fossil fuels, as it strives to catch up with rivals such as Shell Plc which were quicker to pivot back to core businesses. While Auchincloss saw his base salary rise to £1.45 million from £1.02 million, his share awards dropped to £2.75 million from £4.36 million, according to the annual report published on Thursday. His annual bonus was sharply reduced in his first full year as boss. Auchincloss is in the middle of a roadshow meeting with investors in London in the hope of enlisting support for the company’s new direction. Activist investor Elliott Investment Management, which had bought about 5% of the oil major, is ramping up pressure on the company’s management after the new strategy fell short of its expectations. BP’s shares have declined about 6% since the strategy announcement on Feb. 26. BP chair Helge Lund is looking for new board members who can bring skills and experience that align with the company’s revised oil and gas-focused strategy, he said in the annual report. The board is particularly keen to recruit an oil and gas expert, according to a person familiar with the matter who asked not to be identified because the information is private. Grafton Group Chair Ian Tyler was appointed to BP’s board to lead the remuneration committee, the company said Thursday. Tyler is also a director at Anglo American Plc. BP’s previous strategy, unveiled in 2020, focused on shifting away from oil

Nexos bosses on ‘less people applying’ for apprenticeships

Nexos bosses discussed how they have seen “less people applying” for apprenticeships in recent years at a Scottish Apprenticeships Week event. The Aberdeen-based engineering, procurement and construction (EPC) firm, formerly known as Global E&C, welcomed local skills and training organisations as well as a local MSP to its harbour-side facility in the Granite City to mark the weeklong celebration of trainees. Graeme Gray, fabrication director for Nexos, said: “Going back 10 years, if you advertised an apprentice position you would be in the hundreds of applicants, I think when these recent guys came on the programme there were no more than 50 to 60 applicants.” He added that his current batch of apprentices “are great” and that “there’s no talking away from the quality” of their work; however, “there are just less people applying”. This supports recent reports from the Engineering Construction Industry Training Board (ECITB), which found that 71% of employers in the engineering construction industry have recruitment challenges of late. On the oil and gas sector specifically, the trade body said that it is “unlikely” that oil and gas will be able to replace its aging workforce with younger employees, according to current trends. Nexos employs between 10 and 12 apprentices each year and the firm’s managing director for offshore, Derek Mitchell, described them as “the people who will be driving our future”. ‘Immense’ job market pressures However, oil and gas is not the only sector experiencing these challenges, as MSP for Aberdeen Central Kevin Stewart MSP pointed out while visiting the Nexos facility. Stewart commented: “The pressure in the job market is so immense.” He said that the industry’s engagement with young people is left “too late” and that employers need to be speaking to younger children about opportunities out with university. “I think we should be

Power Moves: Elemental Energies head of decommissioning and more

Ross Provan has been appointed as head of decommissioning solutions at Aberdeenshire firm Elemental Energies. Provan brings 18 years of projects and operational experience working with major global operators and contractors, with expertise spanning drilling, facilities engineering, subsea, project assurance, construction and decommissioning. In his new role, he will lead Elemental Energies’ focus on EPRD (engineering, preparation, removal and disposal) and the integration of services, including the existing wells decommissioning capabilities, across all areas of the decommissioning work breakdown structure (WBS). Elemental Energies has specialist teams across subsurface, wells and facilities with a track record managing large-scale platform plugging and abandonment(P&A), major subsea well decommissioning and integrated wells and facilities projects. The firm’s CEO, Mike Adams, commented: “With global offshore decommissioning spend projected to double over the next two decades, the need for integrated, cost-effective and innovative solutions is crucial. “We believe this approach to decommissioning presents significant opportunities for efficiencies, particularly when technical teams collaborate early in the process. “We have seen these benefits firsthand through our successful delivery of integrated wells and facilities scopes. “With Ross leading this key area, we are confident that his experience and expertise will help us to continue to drive innovation and efficiency in the decommissioning sector.” Last year saw Elemental Energies snap up Norwegian firm Well Expertise, giving it a turnover boost worth more than £50 million. © Supplied by BlueFloat EnergyBlueFloat Energy CEO Carlos Martin Rivals. Carlos Martin Rivals has stepped down as CEO of BlueFloat Energy. Writing on LinkedIn, he said: “After careful thinking, I’ve concluded that it is the right moment to turn the page on my role in the company I founded with the support from 547 Energy and Quantum Capital Group in 2020 and move forward to explore other opportunities. “It has been an amazing journey since

GB Energy could see budget slashed in defence-spending pivot

Ministers are considering cutting the budget of Labour’s flagship state-owned energy company GB Energy. GB Energy was originally promised a budget of £8.3 billion over the current five-year duration of parliament. However, October’s budget only included £100 million for the company’s first two years. A Financial Times report warned that the upcoming June spending review will likely see cuts to the budget. The move comes amid mounting pressure on the UK government as it looks to push defence spending against the backdrop of the Russian invasion of Ukraine and a weakening US commitment to NATO. This means that every part of the budget could be subject to a “zero-based review”, with sources warning that every previous spending commitment could be under review. According to people familiar with the discussions, the Treasury could cut £3.3bn from its budget, including the portion previously earmarked for low-interest loans to cover projects such as rooftop solar and shared-ownership wind projects. A government spokesperson said: “We are fully committed to GB Energy, which is at the heart of our mission to make Britain a clean energy superpower and to ensure homes are cheaper and cleaner to run.” However, neither the Treasury nor the Department for Energy Security and Net Zero (DESNZ) have confirmed that GB Energy is still guaranteed the full £8.3bn of funding. While the exact remit of the company is still unknown, GB Energy was created to help accelerate the UK’s energy transition, most likely by taking stakes in projects such as offshore wind farms. However, the group’s chairman, Jurgen Maier, has previously said his long-term plan for the company is to create a UK Orsted. Maier’s claims that GB Energy could create 1,000 jobs have also been revised, with Maier clarifying that that figure would be over 20 years, with the next

Lenovo introduces entry-level, liquid cooled AI edge server

Lenovo has announced the ThinkEdge SE100, an entry-level AI inferencing server, designed to make edge AI affordable for enterprises as well as small and medium-sized businesses. AI systems are not normally associated with being small and compact; they’re big, decked out servers with lots of memory, GPUs, and CPUs. But the server is for inferencing, which is the less compute intensive portion of AI processing, Lenovo stated. GPUs are considered overkill for inferencing and there are multiple startups making small PC cards with inferencing chip on them instead of the more power-hungry CPU and GPU. This design brings AI to the data rather than the other way around. Instead of sending the data to the cloud or data center to be processed, edge computing uses devices located at the data source, reducing latency and the amount of data being sent up to the cloud for processing, Lenovo stated.

Seven important trends in the server sphere

The pace of change around server technology is advancing considerably, driven by hyperscalers but spilling over into the on-premises world as well. There are numerous overall trends, experts say, including: AI Everything: AI mania is everywhere and without high power hardware to run it, it’s just vapor. But it’s more than just a buzzword, it is a very real and measurable trend. AI servers are notable because they are decked out with high end CPUs, GPU accelerators, and oftentimes a SmartNIC network controller. All the major players — Nvidia, Supermicro, Google, Asus, Dell, Intel, HPE — as well as smaller vendors are offering purpose-built AI hardware, according to a recent Network World article. AI edge server growth: There is also a trend towards deploying AI edge servers. The Global Edge AI Servers Market size is expected to be worth around $26.6 Billion by 2034, from $2.7 Billion in 2024, according to a Market.US report. Considerable amounts of data are collected on the edge. Edge servers do the job of culling the useless data and sending only the necessary data back to data centers for processing. The market is rapidly expanding as industries such as manufacturing, automotive, healthcare, and retail increasingly deploy IoT devices and require immediate data processing for decision-making and operational efficiency, according to the report. Liquid cooling gains ground: Liquid cooling is inching its way in from the fringes into the mainstream of data center infrastructure. What was once a difficult add-on is now becoming a standard feature, says Jeffrey Hewitt, vice president and analyst with Gartner. “Server providers are working on developing the internal chassis plumbing for direct-to-chip cooling with the goal of supporting the next generation of AI CPUs and GPUs that will produce high amounts of heat within their servers,” he said. New data center structures: Not

Data center vacancies hit historic lows despite record construction

The growth comes despite considerable headwinds facing data center operators, including higher construction costs, equipment pricing, and persistent shortages in critical materials like generators, chillers and transformers, CRBE stated. There is a considerable pricing disparity between newly built data centers and legacy facilities, reflecting the premium placed on modern, energy-efficient infrastructure. Specifically, liquid/immersion cooling is preferred over air cooling for modern server requirements, CRBE found. On the networking side of things, major telecom companies made substantial investments in fiber in the second half of 2024, reflecting the growing need for more network infrastructure and capacity to accommodate growing demand from AI and data providers. There have also been many notable deals recently: AT&T’s multi-year, $1 billion agreement with Corning to provide next-generation fiber, cable and connectivity solutions; Comcast’s proposed acquisition of Nitel; Verizon’s agreement to acquire Frontier, the largest pure-play fiber internet provider in the U.S.; and T-Mobile’s entry into the fiber internet market via partnerships with fiber-optic providers. In the quarter, Meta announced plans for a 25,000-mile undersea fiber cable that would connect the U.S. East and West coasts with global markets across the Atlantic, Indian and Pacific oceans. The project would mark the first privately owned and operated global fiber cable network. Data Center Outlook

AI driving a 165% rise in data center power demand by 2030

Goldman Sachs Research estimates the power usage by the global data center market to be around 55 gigawatts, which breaks down as 54% for cloud computing workloads, 32% for traditional line of business workloads and 14% for AI. By 2027, that number jumps to 84 GW, with AI growing to 27% of the overall market, cloud dropping to 50%, and traditional workloads falling to 23%, Schneider stated. Goldman Sachs Research estimates that there will be around 122 GW of data center capacity online by the end of 2030, and the density of power use in data centers is likely to grow as well, from 162 kilowatts per square foot to 176 KW per square foot in 2027, thanks to AI, Schneider stated. “Data center supply — specifically the rate at which incremental supply is built — has been constrained over the past 18 months,” Schneider wrote. These constraints have arisen from the inability of utilities to expand transmission capacity because of permitting delays, supply chain bottlenecks, and infrastructure that is both costly and time-intensive to upgrade. The result is that due to power demand from data centers, there will need to be additional utility investment, to the tune of about $720 billion of grid spending through 2030. And then they are subject to the pace of public utilities, which move much slower than hyperscalers. “These transmission projects can take several years to permit, and then several more to build, creating another potential bottleneck for data center growth if the regions are not proactive about this given the lead time,” Schneider wrote.

Top data storage certifications to sharpen your skills

Organization: Hitachi Vantara Skills acquired: Knowledge of data center infrastructure management tasks automation using Hitachi Ops Center Automator. Price: $100 Exam duration: 60 minutes How to prepare: Knowledge of all storage-related operations from an end-user perspective, including planning, allocating, and managing storage and architecting storage layouts. Read more about Hitachi Vantara’s training and certification options here. Certifications that bundle cloud, networking and storage skills AWS Certified Solutions Architect – Professional The AWS Certified Solutions Architect – Professional certification from leading cloud provider Amazon Web Services (AWS) helps individuals showcase advanced knowledge and skills in optimizing security, cost, and performance, and automating manual processes. The certification is a means for organizations to identify and develop talent with these skills for implementing cloud initiatives, according to AWS. The ideal candidate has the ability to evaluate cloud application requirements, make architectural recommendations for deployment of applications on AWS, and provide expert guidance on architectural design across multiple applications and projects within a complex organization, AWS says. Certified individuals report increased credibility with technical colleagues and customers as a result of earning this certification, it says. Organization: Amazon Web Services Skills acquired: Helps individuals showcase skills in optimizing security, cost, and performance, and automating manual processes Price: $300 Exam duration: 180 minutes How to prepare: The recommended experience prior to taking the exam is two or more years of experience in using AWS services to design and implement cloud solutions Cisco Certified Internetwork Expert (CCIE) Data Center The Cisco CCIE Data Center certification enables individuals to demonstrate advanced skills to plan, design, deploy, operate, and optimize complex data center networks. They will gain comprehensive expertise in orchestrating data center infrastructure, focusing on seamless integration of networking, compute, and storage components. Other skills gained include building scalable, low-latency, high-performance networks that are optimized to support artificial intelligence (AI)

Netskope expands SASE footprint, bolsters AI and automation

Netskope is expanding its global presence by adding multiple regions to its NewEdge carrier-grade infrastructure, which now includes more than 75 locations to ensure processing remains close to end users. The secure access service edge (SASE) provider also enhanced its digital experience monitoring (DEM) capabilities with AI-powered root-cause analysis and automated network diagnostics. “We are announcing continued expansion of our infrastructure and our continued focus on resilience. I’m a believer that nothing gets adopted if end users don’t have a great experience,” says Netskope CEO Sanjay Beri. “We monitor traffic, we have multiple carriers in every one of our more than 75 regions, and when traffic goes from us to that destination, the path is direct.” Netskope added regions including data centers in Calgary, Helsinki, Lisbon, and Prague as well as expanded existing NewEdge regions including data centers in Bogota, Jeddah, Osaka, and New York City. Each data center offers customers a range of SASE capabilities including cloud firewalls, secure web gateway (SWG), inline cloud access security broker (CASB), zero trust network access (ZTNA), SD-WAN, secure service edge (SSE), and threat protection. The additional locations enable Netskope to provide coverage for more than 220 countries and territories with 200 NewEdge Localization Zones, which deliver a local direct-to-net digital experience for users, the company says.

Microsoft will invest $80B in AI data centers in fiscal 2025

And Microsoft isn’t the only one that is ramping up its investments into AI-enabled data centers. Rival cloud service providers are all investing in either upgrading or opening new data centers to capture a larger chunk of business from developers and users of large language models (LLMs). In a report published in October 2024, Bloomberg Intelligence estimated that demand for generative AI would push Microsoft, AWS, Google, Oracle, Meta, and Apple would between them devote $200 billion to capex in 2025, up from $110 billion in 2023. Microsoft is one of the biggest spenders, followed closely by Google and AWS, Bloomberg Intelligence said. Its estimate of Microsoft’s capital spending on AI, at $62.4 billion for calendar 2025, is lower than Smith’s claim that the company will invest $80 billion in the fiscal year to June 30, 2025. Both figures, though, are way higher than Microsoft’s 2020 capital expenditure of “just” $17.6 billion. The majority of the increased spending is tied to cloud services and the expansion of AI infrastructure needed to provide compute capacity for OpenAI workloads. Separately, last October Amazon CEO Andy Jassy said his company planned total capex spend of $75 billion in 2024 and even more in 2025, with much of it going to AWS, its cloud computing division.

John Deere unveils more autonomous farm machines to address skill labor shortage

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Self-driving tractors might be the path to self-driving cars. John Deere has revealed a new line of autonomous machines and tech across agriculture, construction and commercial landscaping. The Moline, Illinois-based John Deere has been in business for 187 years, yet it’s been a regular as a non-tech company showing off technology at the big tech trade show in Las Vegas and is back at CES 2025 with more autonomous tractors and other vehicles. This is not something we usually cover, but John Deere has a lot of data that is interesting in the big picture of tech. The message from the company is that there aren’t enough skilled farm laborers to do the work that its customers need. It’s been a challenge for most of the last two decades, said Jahmy Hindman, CTO at John Deere, in a briefing. Much of the tech will come this fall and after that. He noted that the average farmer in the U.S. is over 58 and works 12 to 18 hours a day to grow food for us. And he said the American Farm Bureau Federation estimates there are roughly 2.4 million farm jobs that need to be filled annually; and the agricultural work force continues to shrink. (This is my hint to the anti-immigration crowd). John Deere’s autonomous 9RX Tractor. Farmers can oversee it using an app. While each of these industries experiences their own set of challenges, a commonality across all is skilled labor availability. In construction, about 80% percent of contractors struggle to find skilled labor. And in commercial landscaping, 86% of landscaping business owners can’t find labor to fill open positions, he said. “They have to figure out how to do

2025 playbook for enterprise AI success, from agents to evals

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More 2025 is poised to be a pivotal year for enterprise AI. The past year has seen rapid innovation, and this year will see the same. This has made it more critical than ever to revisit your AI strategy to stay competitive and create value for your customers. From scaling AI agents to optimizing costs, here are the five critical areas enterprises should prioritize for their AI strategy this year. 1. Agents: the next generation of automation AI agents are no longer theoretical. In 2025, they’re indispensable tools for enterprises looking to streamline operations and enhance customer interactions. Unlike traditional software, agents powered by large language models (LLMs) can make nuanced decisions, navigate complex multi-step tasks, and integrate seamlessly with tools and APIs. At the start of 2024, agents were not ready for prime time, making frustrating mistakes like hallucinating URLs. They started getting better as frontier large language models themselves improved. “Let me put it this way,” said Sam Witteveen, cofounder of Red Dragon, a company that develops agents for companies, and that recently reviewed the 48 agents it built last year. “Interestingly, the ones that we built at the start of the year, a lot of those worked way better at the end of the year just because the models got better.” Witteveen shared this in the video podcast we filmed to discuss these five big trends in detail. Models are getting better and hallucinating less, and they’re also being trained to do agentic tasks. Another feature that the model providers are researching is a way to use the LLM as a judge, and as models get cheaper (something we’ll cover below), companies can use three or more models to

OpenAI’s red teaming innovations define new essentials for security leaders in the AI era

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More OpenAI has taken a more aggressive approach to red teaming than its AI competitors, demonstrating its security teams’ advanced capabilities in two areas: multi-step reinforcement and external red teaming. OpenAI recently released two papers that set a new competitive standard for improving the quality, reliability and safety of AI models in these two techniques and more. The first paper, “OpenAI’s Approach to External Red Teaming for AI Models and Systems,” reports that specialized teams outside the company have proven effective in uncovering vulnerabilities that might otherwise have made it into a released model because in-house testing techniques may have missed them. In the second paper, “Diverse and Effective Red Teaming with Auto-Generated Rewards and Multi-Step Reinforcement Learning,” OpenAI introduces an automated framework that relies on iterative reinforcement learning to generate a broad spectrum of novel, wide-ranging attacks. Going all-in on red teaming pays practical, competitive dividends It’s encouraging to see competitive intensity in red teaming growing among AI companies. When Anthropic released its AI red team guidelines in June of last year, it joined AI providers including Google, Microsoft, Nvidia, OpenAI, and even the U.S.’s National Institute of Standards and Technology (NIST), which all had released red teaming frameworks. Investing heavily in red teaming yields tangible benefits for security leaders in any organization. OpenAI’s paper on external red teaming provides a detailed analysis of how the company strives to create specialized external teams that include cybersecurity and subject matter experts. The goal is to see if knowledgeable external teams can defeat models’ security perimeters and find gaps in their security, biases and controls that prompt-based testing couldn’t find. What makes OpenAI’s recent papers noteworthy is how well they define using human-in-the-middle