🧚🏿 🙏🏽 🧀 YOLOv4 - the most accurate real-time neural network on the Microsoft COCO dataset 💈 🤞🏼 🥁

Darknet YOLOv4 is faster / more accurate than Google TensorFlow EfficientDet and FaceBook Pytorch / Detectron RetinaNet / MaskRCNN.

The same article on medium : medium
Code : github.com/AlexeyAB/darknet
Article : arxiv.org/abs/2004.10934

We will show some nuances of comparing and using neural networks to detect objects.

Our goal was to develop an object detection algorithm for use in real products, and not just move science forward. The accuracy of the YOLOv4 neural network (608x608) is 43.5% AP / 65.7% AP50 Microsoft-COCO-testdev.

62 FPS - YOLOv4 (608x608 batch = 1) on Tesla V100 - by using Darknet-framework
400 FPS - YOLOv4 (416x416 batch = 4) on RTX 2080 Ti - by using TensorRT + tkDNN
32 FPS - YOLOv4 (416x416 batch = 1) on Jetson AGX Xavier - by using TensorRT + tkDNN

First, some useful links.

You can read a detailed description of the features used in YOLOv4 in this article: medium.com/@jonathan_hui/yolov4-c9901eaa8e61
YOLOv4: lutzroeder.imtqy.com/netron/?url=https%3A%2F%2Fraw.githubusercontent.com%2FAlexeyAB%2Fdarknet%2Fmaster%2Fcfg%2Fyolov4.cfg
YOLOv4 GPU Google-cloud Jupyter Notebook – , - «Open in Playground», [ ] – , , , 5 : colab.research.google.com/drive/12QusaaRj_lUwCGDvQNfICpa7kA7_a2dE www.youtube.com/watch?v=mKAEGSxwOAY
Darknet :
— Darknet YOLOv4
— Darknet YOLOv4

Our YOLOv4 neural network and our own Darknet DL framework (C / C ++ / CUDA) are better in FPS speed and AP50: 95 and AP50 accuracy on Microsoft COCO datasets than DL frameworks and neural networks: Google TensorFlow EfficientDet, FaceBook Detectron RetinaNet / MaskRCNN, PyTorch Yolov3-ASFF, and many others ... YOLOv4 achieves accuracy of 43.5% AP / 65.7% AP50 on the Microsoft COCO test at a speed of 62 FPS TitanV or 34 FPS RTX 2070. Unlike other modern detectors, YOLOv4 can train anyone with whoever has the nVidia gaming graphics card with 8-16 GB VRAM. Now, not only large companies can train a neural network on hundreds of GPU / TPUs to use large mini-batch sizes to achieve higher accuracy, so we are returning control of artificial intelligence to ordinary users, because for YOLOv4 a large mini-lot is not required,can be limited to a size of 2 - 8.

YOLOV4 is optimal for using real-time, because the network lies on the Pareto optimality curve in the AP (accuracy) / FPS (speed) graph.

Graphs of accuracy (AP) and speed (FPS) of many neural networks for detecting objects measured on GPUs TitanV / TeslaV100, TitanXP / TeslaP100, TitanX / TeslaM40 for the two main indicators of accuracy AP50: 95 and AP50

For a fair comparison, we take data from articles and compare only on the GPU with the same architecture.

Most practical tasks have the minimum necessary requirements for detectors - these are the minimum acceptable accuracy and speed. Usually the minimum allowable speed of 30 FPS (frames per second) and higher for real-time systems.

As can be seen from the graphs, in Real-time systems with FPS 30 or more:

YOLOv4-608 RTX 2070 450$ (34 FPS) 43.5% AP / 65.7% AP50
EfficientDet-D2 TitanV 2250$ (42 FPS) 43.0% AP / 62.3% AP50
EfficientDet-D0 RTX 2070 450$ (34 FPS) 33.8% AP / 52.2% AP50

Those. YOLOv4 requires 5 times cheaper equipment and more accurately than EfficientDet-D2 (Google-TensorFlow). You can use EfficientDet-D0 (Google-TensorFlow) then the cost of equipment will be the same, but the accuracy will be 10% AP lower.
In addition, some industrial systems have limitations on heat dissipation or on the use of a passive cooling system - in this case you can not use TitanV even with money.

When using YOLOv4 (416x416) on an RTX 2080 Ti GPU using TensorRT + tkDNN, we achieve a speed of 2x times faster, and when using batch = 4 it is 3x-4x times faster - for an honest comparison, we do not present these results in an article on arxiv. org:

YOLOv4 neural network (416x416) FP16 (Tensor Cores) batch = 1 reaches at 32 FPS calculator nVidia Jetson AGX Xavier using libraries + tkDNN TensorRT: github.com/ceccocats/tkDNN
slightly slower speed gives OpenCV-dnn library compiled with CUDA: docs .opencv.org / master / da / d9d / tutorial_dnn_yolo.html

Sometimes the speed (FPS) of some neural networks in articles is indicated when using a high batch size or when testing with specialized software (TensorRT), which optimizes the network and shows an increased FPS value. Comparison of some networks on TRT with other networks without TRT is not fair. Using a high batch size increases FPS, but also increases latency (rather than decreasing it) compared to batch = 1. If the network with batch = 1 shows 40 FPS, and with batch = 32 it shows 60 FPS, then the delay will be 25ms for batch = 1, and ~ 500ms for batch = 32, because only ~ 2 packets (32 images each) will be processed per second, which is why using batch = 32 is not acceptable in many industrial systems. Therefore, we compared the results on the graphs only with batch = 1 and without using TensorRT.

Any process can be controlled either by people or by computers. When a computer system acts with a big delay due to low speed and makes too many mistakes, then it cannot be entrusted with complete control of actions, in this case the person controls the process, and the computer system only gives hints - this is a recommendation system - the person works, and the system only tells where mistakes were made. When the system works quickly and with high accuracy, such a system can control the process independently, and a person only looks after it. Therefore, accuracy and system speed are always important. If it seems to you that 120 FPS for YOLOv4 416x416 is too much for your task, and it is better to take the algorithm more slowly and more accurately, then most likely in real tasks you will use something weaker than the Tesla V100 250 Watt,for example, RTX 2060 / Jetson-Xavier 30-80 Watt, in this case you will get 30 FPS on YOLOv4 416x416, and other neural networks at 1-15 FPS or will not start at all.

Features of training various neural networks

You have to train EfficientDet with mini-batch = 128 size on several Tesla V100 32GB GPUs, while YOLOv4 was trained on just one Tesla V100 32GB GPU with mini-batch = 8 = batch / subdivisions, and can be trained on a regular gaming graphics card 8-16GB GPU-VRAM.
The next nuance is the difficulty of training a neural network to detect its own objects. No matter how much time you train other networks on the same 1080 Ti GPU, you will not get the stated accuracy shown in the graph above. Most networks (EfficientDet, ASFF, ...) need to be trained on 4 - 128 GPUs (with a large mini-batch size using syncBN) and it is necessary to train each time anew for each network resolution, without meeting both conditions it is impossible to achieve the AP or AP50 accuracy declared by them.

You can see the dependence of the detection accuracy of objects on the size of the minibatch in other detectors, i.e. using 128 video cards instead of 8 video cards and the learning speed is 10 times higher and the final accuracy is 1.5 AP higher - MegDet: A Large Mini-Batch Object Detector arxiv.org/abs/1711.07240

Yolo ASFF: arxiv.org/abs/1911.09516

Following [43], we introduce a bag of tricks in the training process, such as the mixup algorithm [12], the cosine [26] learning rate schedule, and the synchronized batch normalization technique [30].

EfficientDet: arxiv.org/abs/1911.09070

Synchronized batch normalization is added after every convolution with batch norm decay 0.99 and epsilon 1e-3.

Each model is trained 300 epochs with batch total size 128 on 32 TPUv3 cores.

cloud.google.com/tpu/docs/types-zones#europe

v3-32 TPU type (v3) – 32 TPU v3 cores – 512 GiB Total TPU memory

You must use 512 GB TPU / GPU-RAM to train the EfficientDet model with Synchronized batch normalization at batch = 128, while mini-batch = 8 and only 32 GB GPU-RAM were used to train YOLOv4. Despite this, YOLOv4 is faster / more accurate than public networks, although it is trained only 1 time with a resolution of 512x512 per 1 GPU (Tesla V100 32GB / 16GB). At the same time, using the smaller mini-batch size and GPU-VRAM does not lead to such a dramatic loss of accuracy as in other neural networks:

Source: arxiv.org/abs/2004.10934

So you can train artificial intelligence locally on your PC, instead of downloading Dataset to the cloud - this guarantees the protection of your personal data and makes artificial intelligence training available to everyone.

It is enough to train our network once with a network resolution 512x512, and then it can be used with different network resolutions in the range: [416x416 - 512x512 - 608x608].

Most other models need to be trained each time separately for each network resolution, because of this, training takes many times longer.

Features of measuring accuracy of object detection algorithms

You can always find an image on which one algorithm will work poorly, and another algorithm will work well, and vice versa. Therefore, to test detection algorithms, a large set of ~ 20,000 images and 80 different types of objects is used - MSCOCO test-dev dataset.

So that the algorithm does not try to just remember the hash of each image and the coordinates on it (overfitting), the accuracy of object detection is always checked on images and labels that the algorithm did not see during training - this ensures that the algorithm can detect objects on images / videos that it never saw.

So that no one could make a mistake in calculating accuracy, in the public domain there are only test-dev test images on which you detect, and send the results to the CodaLab evaluation server, on which the program itself compares your results with test annotations that are not accessible to anyone .

MSCOCO dataset consists of 3 parts

Tutorial: 120,000 images and a json file with the coordinates of each object
Validation set: 5,000 images and a json file with the coordinates of each object
Test suite: 41,000 jpg-images without the coordinates of objects (some of these images are used to determine accuracy in tasks: Object Detection, Instance Segmentation, Keypoints, ...)

Features of the development of YOLOv4

When developing YOLOv4, I had to develop both the YOLOv4 neural network and the Darknet framework on C / C ++ / CUDA myself. Because in Darknet there is no automatic differentiation and automatic execution of the chain-rule, then all the gradients have to be implemented manually. On the other hand, we can depart from strict adherence to the chain-rule, change backpropagation and try very non-trivial things to increase learning stability and accuracy.

Additional findings when creating neural networks

Not always the best network for classifying objects will be the best as a backbone for the network used to detect objects
Using weights trained with features that have increased accuracy in classification can adversely affect detector accuracy (on some networks)
Not all features stated in various studies improve network accuracy.
.
BFLOPS , BFLOPS
, receptive field , stride=2 / conv3x3, weights (filters) .

YOLOv4

Object detection using trained YOLOv4 models is built into the OpenCV-dnn library github.com/opencv/opencv/issues/17148 so that you can use YOLOv4 directly from OpenCV without using the Darknet framework. The OpenCV library supports the implementation of neural networks on the CPU, GPU (nVidia GPU), VPU (Intel Myriad X). More details: docs.opencv.org/master/da/d9d/tutorial_dnn_yolo.html

OpenCV (dnn) framework:

C ++ example: github.com/opencv/opencv/blob/master/samples/dnn/object_detection.cpp
Python example: github.com/opencv/opencv/blob/master/samples/dnn/object_detection.py

Darknet framework:

Instructions for using YOLOv4 to detect objects: github.com/AlexeyAB/darknet#how-to-use-on-the-command-line
Instructions for training a neural network to detect its own objects: github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
Instructions for training the CSPDarknet53 classifier on the ILSVRC2012 dataset (ImageNet): github.com/AlexeyAB/darknet/wiki/Train-Classifier-on-ImageNet- (ILSVRC2012)
Instructions for training YOLOv4 on the MS COCO dataset: github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO- (trainvalno5k-2014) -dataset

tkDNN + TensorRT - Maximum speed of object detection using YOLOv4: TensorRT + tkDNN github.com/ceccocats/tkDNN

400 FPS - YOLOv4 (416x416 batch = 4) on RTX 2080 Ti
32 FPS - YOLOv4 (416x416 batch = 1) on Jetson AGX Xavier

Use of YOLOv4 can be expanded to detect 3D-Rotated-Bboxes or key points / facial landmarks, for example:

github.com/ouyanghuiyu/darknet_face_with_landmark