On the good of automated filtering of identical messages

Dec 14 2011

Author: Evgenii Ryzhkov

From the very beginning duplicates of messages in our analyzer PVS-Studio have been eliminated. For example, if a diagnostic message is generated for a code in an .h-file included into several .cpp-files, our tool will generate it only once. Some other analyzers don't do that and when they check .cpp-files, they show you messages every time on the same strings in .h-files. So it turns out that our analyzer generates fewer messages compared to such tools. But we had no chance to estimate how useful it is before. Now we've got such an occasion, and the results are really impressive.

To clarify the point let me first cite a code sample. Assume we have the Foo class defined in the Foo.h file:

class Foo {
  int iChilds[2];
  ...
  bool hasChilds() const { return(iChilds > 0 || iChilds > 0); }
  ...
}

There are two files Usage.cpp and Play.cpp, both containing the following string:

#include "Foo.h"

When checking these files, a message will be generated: "V501. There are identical sub-expressions to the left and to the right of the 'foo' operator". The message will be generated twice (because two compilation units have been checked), but the user will see it only once, since the repeated message will be automatically filtered.

If there were no filtration, you would see 2 V501 messages: one for the Usage.cpp file and another for the Play.cpp file.

We have recently checked the source code of Mozilla Firefox. Although Firefox's code is built with Visual C++, it still doesn't contain .sln-files and is compiled through makefile. It is this makefile we have built a call of the console version of PVS-Studio into for each file (as described in the documentation). The messages in this mode are written all in a row into one large "raw" report file that can be later opened with PVS-Studio from Visual Studio. Then this "raw" report can be saved as .plog (PVS-Studio's xml-report). During conversion repeated messages are automatically filtered.

So, there were about 2 000 000 messages (with numerous duplicates) in the "raw" report. In the converted report, there were only 80 000 messages, i.e. 25 times fewer. It is this number that enables us to estimate the amount of message duplicates which are automatically filtered.

This example also confirms the idea that a static analyzer is a complex system, and it's not enough just to print error messages into stdout.