YOLOv4 - the most accurate real-time neural network on the Microsoft COCO dataset

Darknet YOLOv4 is faster / more accurate than Google TensorFlow EfficientDet and FaceBook Pytorch / Detectron RetinaNet / MaskRCNN.

The same article on medium : medium
Code : github.com/AlexeyAB/darknet
Article : arxiv.org/abs/2004.10934


We will show some nuances of comparing and using neural networks to detect objects.

Our goal was to develop an object detection algorithm for use in real products, and not just move science forward. The accuracy of the YOLOv4 neural network (608x608) is 43.5% AP / 65.7% AP50 Microsoft-COCO-testdev.

62 FPS - YOLOv4 (608x608 batch = 1) on Tesla V100 - by using Darknet-framework
400 FPS - YOLOv4 (416x416 batch = 4) on RTX 2080 Ti - by using TensorRT + tkDNN
32 FPS - YOLOv4 (416x416 batch = 1) on Jetson AGX Xavier - by using TensorRT + tkDNN




First, some useful links.




Our YOLOv4 neural network and our own Darknet DL framework (C / C ++ / CUDA) are better in FPS speed and AP50: 95 and AP50 accuracy on Microsoft COCO datasets than DL frameworks and neural networks: Google TensorFlow EfficientDet, FaceBook Detectron RetinaNet / MaskRCNN, PyTorch Yolov3-ASFF, and many others ... YOLOv4 achieves accuracy of 43.5% AP / 65.7% AP50 on the Microsoft COCO test at a speed of 62 FPS TitanV or 34 FPS RTX 2070. Unlike other modern detectors, YOLOv4 can train anyone with whoever has the nVidia gaming graphics card with 8-16 GB VRAM. Now, not only large companies can train a neural network on hundreds of GPU / TPUs to use large mini-batch sizes to achieve higher accuracy, so we are returning control of artificial intelligence to ordinary users, because for YOLOv4 a large mini-lot is not required,can be limited to a size of 2 - 8.

YOLOV4 is optimal for using real-time, because the network lies on the Pareto optimality curve in the AP (accuracy) / FPS (speed) graph.



Graphs of accuracy (AP) and speed (FPS) of many neural networks for detecting objects measured on GPUs TitanV / TeslaV100, TitanXP / TeslaP100, TitanX / TeslaM40 for the two main indicators of accuracy AP50: 95 and AP50

For a fair comparison, we take data from articles and compare only on the GPU with the same architecture.

Most practical tasks have the minimum necessary requirements for detectors - these are the minimum acceptable accuracy and speed. Usually the minimum allowable speed of 30 FPS (frames per second) and higher for real-time systems.

As can be seen from the graphs, in Real-time systems with FPS 30 or more:

  • YOLOv4-608 RTX 2070 450$ (34 FPS) 43.5% AP / 65.7% AP50
  • EfficientDet-D2 TitanV 2250$ (42 FPS) 43.0% AP / 62.3% AP50
  • EfficientDet-D0 RTX 2070 450$ (34 FPS) 33.8% AP / 52.2% AP50

Those. YOLOv4 requires 5 times cheaper equipment and more accurately than EfficientDet-D2 (Google-TensorFlow). You can use EfficientDet-D0 (Google-TensorFlow) then the cost of equipment will be the same, but the accuracy will be 10% AP lower.
In addition, some industrial systems have limitations on heat dissipation or on the use of a passive cooling system - in this case you can not use TitanV even with money.

When using YOLOv4 (416x416) on an RTX 2080 Ti GPU using TensorRT + tkDNN, we achieve a speed of 2x times faster, and when using batch = 4 it is 3x-4x times faster - for an honest comparison, we do not present these results in an article on arxiv. org:

YOLOv4 neural network (416x416) FP16 (Tensor Cores) batch = 1 reaches at 32 FPS calculator nVidia Jetson AGX Xavier using libraries + tkDNN TensorRT: github.com/ceccocats/tkDNN
slightly slower speed gives OpenCV-dnn library compiled with CUDA: docs .opencv.org / master / da / d9d / tutorial_dnn_yolo.html

Sometimes the speed (FPS) of some neural networks in articles is indicated when using a high batch size or when testing with specialized software (TensorRT), which optimizes the network and shows an increased FPS value. Comparison of some networks on TRT with other networks without TRT is not fair. Using a high batch size increases FPS, but also increases latency (rather than decreasing it) compared to batch = 1. If the network with batch = 1 shows 40 FPS, and with batch = 32 it shows 60 FPS, then the delay will be 25ms for batch = 1, and ~ 500ms for batch = 32, because only ~ 2 packets (32 images each) will be processed per second, which is why using batch = 32 is not acceptable in many industrial systems. Therefore, we compared the results on the graphs only with batch = 1 and without using TensorRT.

Any process can be controlled either by people or by computers. When a computer system acts with a big delay due to low speed and makes too many mistakes, then it cannot be entrusted with complete control of actions, in this case the person controls the process, and the computer system only gives hints - this is a recommendation system - the person works, and the system only tells where mistakes were made. When the system works quickly and with high accuracy, such a system can control the process independently, and a person only looks after it. Therefore, accuracy and system speed are always important. If it seems to you that 120 FPS for YOLOv4 416x416 is too much for your task, and it is better to take the algorithm more slowly and more accurately, then most likely in real tasks you will use something weaker than the Tesla V100 250 Watt,for example, RTX 2060 / Jetson-Xavier 30-80 Watt, in this case you will get 30 FPS on YOLOv4 416x416, and other neural networks at 1-15 FPS or will not start at all.

Features of training various neural networks


You have to train EfficientDet with mini-batch = 128 size on several Tesla V100 32GB GPUs, while YOLOv4 was trained on just one Tesla V100 32GB GPU with mini-batch = 8 = batch / subdivisions, and can be trained on a regular gaming graphics card 8-16GB GPU-VRAM.
The next nuance is the difficulty of training a neural network to detect its own objects. No matter how much time you train other networks on the same 1080 Ti GPU, you will not get the stated accuracy shown in the graph above. Most networks (EfficientDet, ASFF, ...) need to be trained on 4 - 128 GPUs (with a large mini-batch size using syncBN) and it is necessary to train each time anew for each network resolution, without meeting both conditions it is impossible to achieve the AP or AP50 accuracy declared by them.


You can see the dependence of the detection accuracy of objects on the size of the minibatch in other detectors, i.e. using 128 video cards instead of 8 video cards and the learning speed is 10 times higher and the final accuracy is 1.5 AP higher - MegDet: A Large Mini-Batch Object Detector arxiv.org/abs/1711.07240

Yolo ASFF: arxiv.org/abs/1911.09516
Following [43], we introduce a bag of tricks in the training process, such as the mixup algorithm [12], the cosine [26] learning rate schedule, and the synchronized batch normalization technique [30].

EfficientDet: arxiv.org/abs/1911.09070
Synchronized batch normalization is added after every convolution with batch norm decay 0.99 and epsilon 1e-3.

Each model is trained 300 epochs with batch total size 128 on 32 TPUv3 cores.

cloud.google.com/tpu/docs/types-zones#europe
v3-32 TPU type (v3) โ€“ 32 TPU v3 cores โ€“ 512 GiB Total TPU memory

You must use 512 GB TPU / GPU-RAM to train the EfficientDet model with Synchronized batch normalization at batch = 128, while mini-batch = 8 and only 32 GB GPU-RAM were used to train YOLOv4. Despite this, YOLOv4 is faster / more accurate than public networks, although it is trained only 1 time with a resolution of 512x512 per 1 GPU (Tesla V100 32GB / 16GB). At the same time, using the smaller mini-batch size and GPU-VRAM does not lead to such a dramatic loss of accuracy as in other neural networks:


Source: arxiv.org/abs/2004.10934

So you can train artificial intelligence locally on your PC, instead of downloading Dataset to the cloud - this guarantees the protection of your personal data and makes artificial intelligence training available to everyone.

It is enough to train our network once with a network resolution 512x512, and then it can be used with different network resolutions in the range: [416x416 - 512x512 - 608x608].

Most other models need to be trained each time separately for each network resolution, because of this, training takes many times longer.

Features of measuring accuracy of object detection algorithms


You can always find an image on which one algorithm will work poorly, and another algorithm will work well, and vice versa. Therefore, to test detection algorithms, a large set of ~ 20,000 images and 80 different types of objects is used - MSCOCO test-dev dataset.

So that the algorithm does not try to just remember the hash of each image and the coordinates on it (overfitting), the accuracy of object detection is always checked on images and labels that the algorithm did not see during training - this ensures that the algorithm can detect objects on images / videos that it never saw.

So that no one could make a mistake in calculating accuracy, in the public domain there are only test-dev test images on which you detect, and send the results to the CodaLab evaluation server, on which the program itself compares your results with test annotations that are not accessible to anyone .

MSCOCO dataset consists of 3 parts

  1. Tutorial: 120,000 images and a json file with the coordinates of each object
  2. Validation set: 5,000 images and a json file with the coordinates of each object
  3. Test suite: 41,000 jpg-images without the coordinates of objects (some of these images are used to determine accuracy in tasks: Object Detection, Instance Segmentation, Keypoints, ...)

Features of the development of YOLOv4


When developing YOLOv4, I had to develop both the YOLOv4 neural network and the Darknet framework on C / C ++ / CUDA myself. Because in Darknet there is no automatic differentiation and automatic execution of the chain-rule, then all the gradients have to be implemented manually. On the other hand, we can depart from strict adherence to the chain-rule, change backpropagation and try very non-trivial things to increase learning stability and accuracy.

Additional findings when creating neural networks


  • Not always the best network for classifying objects will be the best as a backbone for the network used to detect objects
  • Using weights trained with features that have increased accuracy in classification can adversely affect detector accuracy (on some networks)
  • Not all features stated in various studies improve network accuracy.
  • .
  • BFLOPS , BFLOPS
  • , receptive field , stride=2 / conv3x3, weights (filters) .

YOLOv4


Object detection using trained YOLOv4 models is built into the OpenCV-dnn library github.com/opencv/opencv/issues/17148 so that you can use YOLOv4 directly from OpenCV without using the Darknet framework. The OpenCV library supports the implementation of neural networks on the CPU, GPU (nVidia GPU), VPU (Intel Myriad X). More details: docs.opencv.org/master/da/d9d/tutorial_dnn_yolo.html

OpenCV (dnn) framework:


Darknet framework:


tkDNN + TensorRT - Maximum speed of object detection using YOLOv4: TensorRT + tkDNN github.com/ceccocats/tkDNN

  • 400 FPS - YOLOv4 (416x416 batch = 4) on RTX 2080 Ti
  • 32 FPS - YOLOv4 (416x416 batch = 1) on Jetson AGX Xavier

Use of YOLOv4 can be expanded to detect 3D-Rotated-Bboxes or key points / facial landmarks, for example:

github.com/ouyanghuiyu/darknet_face_with_landmark


All Articles