Training and testing on samples with different distribution

36. When you have to train and test algorithms on different distributions

Users of your feline app uploaded 10,000 images, which were marked by you as images with cats and images without cats. You also have a large selection of 200,000 images collected online. How then to choose training, validation and test samples?

Since 10,000 images uploaded by users accurately reflect the actual probabilistic distribution of data on which your algorithm should work well, you can use these images for validation and test samples. If you are learning a deep learning algorithm that requires a lot of data, you can use 200,000 additional examples from the Internet to train it. In this case, your training and test with a validation sample will have a different probability distribution. How will this affect your work?

Instead of fiddling with the selection of data for training, validation, and test samples, we could take all 210,000 of our images, mix them, and randomly select data for each sample. In this case, all three samples will contain data from the same distribution.

But I am against this approach. Due to the fact that about 97.6% of the data (205,000 / 210,000 ≈ 97.6%) of the validation and test samples will be taken from data found on the Internet (not received from users) and they will not reflect the real distribution on which it is necessary to achieve high quality. Remember our recommendation for the selection of validation and test samples:

Choose validation and test samples that reflect the data that your algorithm will receive after launching the application and on which it should work well

, , .

: . « », « » « ». . A B, . ( «» , , .) .

. , . , . , , .

, . , , , .

, 10000 , 5000 . 5000 . , 205 000, 5000 , 200 000 , . .

. , , . 20 000 , . 500 000 , . 10 000 10 000 500 000 .

, , , , .

37. ,

, 10 000 . . 20 000 , . 20 000 + 10 000 = 30 000 20 000 , ?

( , ), , . , , 20000 , .

, , . / , 20000 . , , .

, , x -> y, . , - , , , , , .

20000 :

, . , , , . , .
, (, , , . .). , , «» . , , , , . , .

, . , , , « - , . , »

, , , . , , , . «» , , .

( ), , .

, , , . , , , , , . , :

These documents do not contain anything resembling cats. They are also completely unlike the distributions of validation and test samples. There is no point in including this data as negative examples. The benefit from the first effect described above will be negligible - the neural network is unlikely to be able to extract anything from this data that will help it work better on the validation and test samples of your application. The inclusion of this data will lead to the loss of computing resources and possibly reduce the ability of the neural network to approximate functions (ultimately reduce its recognition capabilities).

continuation

Translation of Andrew Un's book, Passion for Machine Learning, Chapters 36 and 37

Training and testing on samples with different distribution

36. When you have to train and test algorithms on different distributions

37. ,

More articles: