⛱️ 🚠 🐇 DeepCode: side view 🧔 👩🏾‍💼 👲🏻

Not so long ago, DeepCode, a machine-learning-based static analyzer, began supporting the validation of C and C ++ projects. And now we can see in practice how the results of classical static analysis and static analysis based on machine learning differ.

We already mentioned DeepCode in our article " Using Machine Learning in Static Analysis of the Source Code of Programs ." And soon they published the article " DeepCode adds AI-based static code analysis support for C and C ++ " with the announcement of support for the analysis of projects written in C and C ++.

To look at the results of DeepCode analysis, I decided to check out the PhysX project. I don’t have a goal here to present an analysis of the errors found by PVS-Studio or DeepCode in this project, but I just wanted to take some third-party project and look at its example how the analyzers will work.

Running DeepCode analysis is extremely simple. Analysis of projects posted on GitHub starts automatically. Although there is a problem, since the entire repository was checked in a row, although it contained various projects, and warnings to them ended up being mixed up. Also, the analysis speed indicates that the code is not compiled and many errors may be hidden from the analyzer.

The overall result of DeepCode is as follows:

This does not quite reflect the number of trips, as they are grouped into one warning, as I will show later. Of course, it is obvious that the static C / C ++ analysis in DeepCode is an innovation, and work on it has just begun, but I think it is already possible to draw some conclusions.

Starting to examine DeepCode triggers, I came across the following warning for an error encountered by the analyzer no less than 31 times:

This response caught my eye, since PVS-Studio had a similar diagnosis and, moreover, I examined such a response in another article. When I met him for the first time, it seemed to me that very few people would write a mathematical constant “by hand”.

And PVS-Studio also found such careless use of constants, but there were no more than 31, but 52 trials of this kind. Almost immediately there was a moment where machine learning was inferior to the classical approach. If the error code at least somehow departs from a very typical error of this kind, finding it becomes much more difficult. For classical static analysis, there is no need for the error to occur many, many times - it is enough for the diagnostic developer to come up with a general principle for the appearance of an error.

Moreover, PVS-Studio diagnostics works not only on the Pi number or other frequently used constants, but also on specific ones (for example, the Landau-Ramanujan constant), the erroneous use of which can be difficult to catch on the basis of even a large training base due to their rare use .

Since DeepCode worked strictly on a constant of the form 3.1415, I added out of interest in one place in the code to the constant 4 decimal places (3.14159263), and the operation disappeared.

PVS-Studio works on various options for rounding constants.

Including there is a response to the changed section:

V624 There is probably a misprint in '3.14159263' constant. Consider using the M_PI constant from <math.h>. Crab.cpp 219

And the point here is not that this is some kind of terrible mistake / rough rounding / the possibility of making a typo, but that it confirms that training on the code will be limited to this very code and, if something happens get out of the general pattern or appear only in rare cases, an error or shortcoming can slip unnoticed.

Let's consider one more operation. DeepCode issued one warning on the following code:

bool Shader::loadShaderCode(const char* fragmentShaderCode, ....)
{
  ....
  if (mFragmentShaderSource != NULL)
    free(mFragmentShaderSource);

  ....

  if (mFragmentShaderSource != NULL)
    free(mFragmentShaderSource);
  ....
}

PVS-Studio had more complaints about this code:

V809 Verifying that a pointer value is not NULL is not required. The 'if (mFragmentShaderSource != 0)' check can be removed. Shader.cpp 178
V809 Verifying that a pointer value is not NULL is not required. The 'if (mFragmentShaderSource != 0)' check can be removed. Shader.cpp 194
V774 The 'mFragmentShaderSource' pointer was used after the memory was released. Shader.cpp 194
V586 The 'free' function is called twice for deallocation of the same memory space. Inspect the first argument. Check lines: 179, 195. Shader.cpp 195

You might think that DeepCode triggering simply turned out to be laconic, and PVS-Studio produced a lot of unnecessary triggering, but this is not so, and each triggering here has its own meaning.

The first two operations are related to excessive pointer checking, since such a check is not needed to use the free () function . The third actuation indicates that it is unsafe to use the pointer after it has been released. Even if the pointer itself is not dereferenced, but only checked, it is still suspicious and often indicates a typo. Well, the last operation just points to the same problem that DeepCode discovered.

There was a screenshot in the DeepCode article about the beginning of C / C ++ support , which, as it seemed to us, contains a false positive. The warning indicates a possible buffer overflow, however, the memory for the buffer is allocated taking into account the length of the home and the length of the line, which is further added by + 1. Thus, there is enough space in the buffer for sure.

The problem here may be the lack of verification that the memory was successfully allocated. Perhaps the DeepCode warning in this case says something else: the first sentence of the warning does not particularly help in understanding what the error is and what the analyzer actually swears about.

In this case, perhaps another factor that we wrote about is triggered - the difficulty of creating meaningful warnings. Either warnings after classifying the responses of a trained analyzer are written by people, or formed from comments on commits / documentation. But generating a good warning can be difficult if the error pattern was deduced through machine learning, and not by the developer.

I was wondering what needs to be changed in this code so that the warning disappears. But, strangely enough, on a completely similar section of code in the new file, such a warning did not occur, and another appeared.

Perhaps the point is that the environment that the analyzer was oriented to has disappeared, or I did something wrong. But the instability of warnings, the fact that their appearance is unpredictable, can lead, in my opinion, to inconvenience not only for users, but also for the developers of the analyzer itself and complicate the development process.

In general, although the majority of DeepCode responses are not yet false, their number is so small that the result of its work at the moment is rather scarce than accurate. A small number of false positives is accompanied by a small number of positives in principle.

I’ll give you some interesting errors found by PVS-Studio.

Fragment N1

V773The function was exited without releasing the 'line' pointer. A memory leak is possible. PxTkBmpLoader.cpp 166

bool BmpLoader::loadBmp(PxFileHandle f)
{
  ....
  int lineLen = ....;
  unsigned char* line = static_cast<unsigned char*>(malloc(lineLen));
  for(int i = info.Height-1; i >= 0; i--)
  {
    num = fread(line, 1, static_cast<size_t>(lineLen), f);
    if (num != static_cast<size_t>(lineLen)) { fclose(f); return false; }
    ....
  }
  free(line);
  return true;
}

Here, in case the reading from the file was interrupted, a memory leak occurs, since the memory is not freed before returning from the function.

Fragment N2

V595 The 'gSphereActor' pointer was utilized before it was verified against nullptr. Check lines: 228, 230. SnippetContactReportCCD.cpp 228

void initScene()
{
  ....
  gSphereActor = gPhysics->createRigidDynamic(spherePose);
  gSphereActor->setRigidBodyFlag(PxRigidBodyFlag::eENABLE_CCD, true);

  if (!gSphereActor)
    return;
  ....
}

The gSphereActor pointer is dereferenced, but immediately after that it is checked for nullptr, and the function exits. That is, a null pointer is possible here, but dereferencing still occurs. There were 24 errors of this kind in the PhysX project.

DeepCode gave only positives for a specific type of cases (as I understand it), where the pointer was initially initialized to zero, and there were no other assignments in the field of view. PVS-Studio didn’t work on such a code, since most often such a fire will be false, because the pointer can get a value in another translation unit (most of the fires were on global variables). This example shows that fewer false positives in trained static analysis are not necessarily true.

Here DeepCode, most likely, for some reason indicates the wrong initialization. Instead initialization gCooking highlighted initialization gFoundation , although the index is not further illuminated.

Fragment N3

V517 The use of 'if (A) {...} else if (A) {...}' pattern was detected. There is a probability of logical error presence. Check lines: 266, 268. AABox.cpp 266

bool AABox::initRT(int depthSamples, int coverageSamples)
{
  ....
  int query;
  ....
  if (....)  
  {
    ....
    if ( query < coverageSamples)
      ret = false;
    else if ( query > coverageSamples) 
      coverageSamples = query;
    ....
    if ( query < depthSamples)
      ret = false;
    else if ( query > depthSamples)
      depthSamples = query;
  }
  else {
    ....
    if ( query < depthSamples)
      ret = false;
    else if ( query < depthSamples) // <=
      depthSamples = query;
    ....
  }
  ....
}

Here it seems that the copy-paste error has crept in. The analyzer detected the same condition in if and else . From the previous if-else, you can see that the ">" in if and the "<" in else are usually checked . Although the pattern looks extremely simple, I did not find such a response among the DeepCode warnings, although it would seem to be a very simple pattern to detect.

Conclusion

Of course, DeepCode just announced support for C / C ++ and it is part of their product that has just begun to develop. You can try it on your projects as well, in their service a convenient feedback window has been implemented in case you have been given a false warning.

However, we can now see the shortcomings that accompany machine learning in static analysis. Therefore, we are skeptical of the idea of adding machine learning to static analyzers, since the gain is doubtful: the exact same analyzer support, writing or editing documentation, and working on the diagnostics themselves are necessary. In addition, problems encountered during development require more complex solutions and specific skills from developers, which reduces the number of potential candidates when expanding the team. Indeed, the developer in this case should not only be a specialist in the analyzed language, but also possess knowledge in the field of machine learning.

If you want to share this article with an English-speaking audience, then please use the translation link: Victoria Khanieva. DeepCode: Outside Perspective .

DeepCode: side view

More articles: