Following [43], we introduce a bag of tricks in the training process, such as the mixup algorithm [12], the cosine [26] learning rate schedule, and the synchronized batch normalization technique [30].
Synchronized batch normalization is added after every convolution with batch norm decay 0.99 and epsilon 1e-3.Each model is trained 300 epochs with batch total size 128 on 32 TPUv3 cores.
v3-32 TPU type (v3) โ 32 TPU v3 cores โ 512 GiB Total TPU memory