Translation of Andrew Un's book, Passion for Machine Learning, Chapters 38 and 39

previous chapters


38. How to determine whether to add data with a different distribution


Suppose we want to learn how to forecast housing prices in New York. Based on the size of the house (input sign x ), it is necessary to predict its price (target value y ).


Housing prices in New York are very high. Suppose you have a second dataset of housing prices in Detroit, Michigan, where real estate is much cheaper. Should this data be included in the training set?


With the same size x, the price of a y house is very different depending on whether it is in New York or in Detroit. If it is necessary to forecast housing prices in New York, combining the two datasets will degrade results. In this case, it is better not to add Detroit property data to the training set.


* Author’s Note One way to solve the problem of the incompatibility of Detroit data with New York data is to add an additional parameter to each sample data indicating the city. Given the parameter x indicating the city, the target value of y becomes unique. However, this approach is rare in practice. *


How does this case with real estate prices in New York and Detroit differ from the case with images of cats received from a mobile application and from the Internet?


, , , ( ). . f(x), x ( y), . , «» () . , ( ), .


, - , , (). , , .


39.


, 200 000 5000 . 40:1. , 205 000 , , -, .


, 40 "-", , 40 ( ) , 5000 .


, .


, ( , ). , :


image


5000 , 200 000. β:


image


β = 1/40, 5000 200 000 -. β , , .


By reducing the weight of errors in images obtained from the Internet, the need for a massive neural network, which is necessary for the algorithm to cope with both data distributions, disappears. This approach of redistributing the weights of the error function is necessary only if there is a suspicion that the additional data (images from the Internet) have a distribution that is very different from the validation and test samples, or if the amount of additional data far exceeds the amount of data from the distribution that corresponds to the validation and test samples (images from a mobile application).


continuation


All Articles