Null problem in Data Science and Machine Learning

The current definition of Null in Data Science is severely limited. With a little effort? we will significantly improve the processing of data previously found in Null.


The old problem is the "Null" problem . It was formulated in an article by Codd regarding database semantics.


Programmers must work hard to handle null values. Perhaps that is why they do not like Null and even promoted the idea that you can do without Null. A popular saying is that including Null in SQL was a mistake .


The following null definitions are available:


  • Not available
  • Not applicable
  • missed
  • unknown

The last definition is the most commonly used in the database.


Data Science defines Null as a missed value.
Here Jake VanderPlas discusses the use and interpretation of Null, NaN, NA, None values โ€‹โ€‹in python, Pandas, numpy.


Below I will show that the existing approach only partially reflects reality and in many cases can be expanded specifically for use in Data Science.


missed data ( AlkanSte !)


, (sample), , .



: , , . . .



  • : . , Null.
  • : , . .
  • : . , , . , .


  • outlier: " " 1000 . 1000 Null.

, .


Null . , Null " ", . Null " ", , . " " ( ).


ML , Null , .


Null


. โ€” . . :


  1. , , . .
  2. , . .
  3. . , . , , . , . .
  4. , , : , , .. .
  5. . .

Null. , ^ , : " ", " ", " ", "", " ". Null . , . , , .


- .


There is also a minus in replacing Null with several more detailed classes. Null is an abstraction at the level of data types, at the language level, which gives us many built-in functions and methods in data processing.


We, in fact, add new classes to our classification system, which does not complicate the processing much.


And, at a minimum, we need to clearly understand what is meant by Null values โ€‹โ€‹in our data. A better understanding of the data will always lead to better results, isn't it?


All Articles