42. More about data mismatch

Suppose you have developed a speech recognition system that works very well on a training sample and on a “sample for training validation”. However, it has poor quality in the validation sample: obviously, you are dealing with the problem of data inconsistency. What can be done in this case?

I would recommend the following: (i) Try to understand how the data distributions of training and validation samples differ. (ii) Find as many training examples as possible that correspond to validation sample examples and on which the algorithm is wrong.

For example, for example, if you manually analyze errors for a speech recognition device, you analyze 100 examples, trying to understand in which of them the algorithm makes mistakes. And as a result, you find that the system does not work well, because most of the sound clips from the validation sample are recorded in the machine, while almost all examples of the training sample have no extraneous sounds. The sound of the engine and road noise significantly degrade the quality of speech recognition. In this case, you can try to add more training examples recorded in the car. The purpose of the error analysis of the algorithm is to search for what the discrepancies between the training and test samples are, leading to data inconsistency in these samples.

« » , , . , , « », , , , . , , . , .

, . , , , , , .

43.

, , . , , .

, / . . , , . «» / , , , , . , «» , , .

, , , .

. , - , , . , .

, : , , . , , 1000 1 . 1 , 1000 , , . , , , — . . , , , .

, 1000 , 10 . «» 10 , , , . , , , .

. , . , . , . , ( ), , .

~ 20 . 3D- ; , , , , . .. . — , , , — 20 . , 100 000 , 20 , , «» 20 , , .

, , . , , , , 20 , 1 . .

While working on data synthesis, my teams sometimes spent weeks before we were able to reproduce details that allowed us to get close enough to the actual distribution of examples so that the synthesized data could have a significant effect. But if you can correctly reproduce in detail objects that are close to those on which the algorithm should show high quality, you have a chance to gain access to a much larger volume of the training sample than you had before.

continuation

Translation of Andrew Un's book, Passion for Machine Learning, Chapters 42 and 43

42. More about data mismatch

43.

More articles: