Andrey Karpov

Nov 01 2010

Tags:

#Cpp

Difference of code analysis approaches in compilers and specialized tools

Nov 01 2010

Author: Andrey Karpov

Notepad++
Static analysis by compiler
Third-party static code analyzers
Some figures
UPDATE: Note

Compilers and third-party static code analyzers have one common task: to detect dangerous code fragments. However, there is a great difference in the types of analysis performed by each kind of these tools. I will try to show you the differences between these two approaches (and explain their source) by the example of the Intel C++ compiler and PVS-Studio analyzer.

This time, it is the Notepad++ 5.8.2 project that we chose for the test.

Notepad++

At first a couple of words about the project we have chosen. Notepad++ is an open-source and free source code editor that supports many languages and appears a substitute for the standard Notepad. It works in the Microsoft Windows environment and is released under the GPL license. What I liked about this project is that it is written in C++ and has a small size - just 73000 lines of code. But what is the most important, this is a rather accurate project - it is compiled by presence of the /W4 switch in the project's settings and /WX switch that makes analyzers treat each warning as an error.

Static analysis by compiler

Now let's study the analysis procedure from the viewpoints of a compiler and a separate specialized tool. The compiler is always inclined to generating warnings after processing only very small local code fragments. This preference is a consequence of very strict performance requirements imposed on the compiler. It is no coincidence that there exist tools of distributed project build. The time needed to compile medium and large projects is a significant factor influencing the choice of development methodology. So if developers can get a 5% performance gain out of the compiler, they will do it.

Such optimization makes the compiler solider and actually such steps as preprocessing, building AST and code generation are not so distinct. For instance, I may say relying on some indirect signs that Visual C++ uses different preprocessor algorithms when compiling projects and generating preprocessed "*.i" files. The compiler also does not need (it is even harmful for it) to store the whole AST. Once the code for some particular nodes is generated and they are no more needed, they get destroyed right away. During the compilation process, AST may never exist in the full form. There is simply no need for that - we parse a small code fragment, generate the code and go further. This saves memory and cache and therefore increases speed.

The result of this approach is "locality" of warnings. The compiler consciously saves on various structures that could help it detect higher-level errors. Let's see in practice what local warnings Intel C++ will generate for the Notepad++ project. Let me remind you that the Notepad++ project is built with the Visual C++ compiler without any warnings with the /W4 switch enabled. But the Intel C++ compiler certainly has a different set of warnings and I also set a specific switch /W5 [Intel C++]. Moreover, I would like to have a look at what the Intel C++ compiler calls "remark".

Let's see what kinds of messages we get from Intel C++. Here it found four similar errors where the CharUpper function is being handled (SEE NOTE AT THE END). Note the "locality" of the diagnosis - the compiler found just a very dangerous type conversion. Let's study the corresponding code fragment:

wchar_t *destStr = new wchar_t[len+1];
...
for (int j = 0 ; j < nbChar ; j++)
{
  if (Case == UPPERCASE)
    destStr[j] =
      (wchar_t)::CharUpperW((LPWSTR)destStr[j]);
  else
    destStr[j] =
      (wchar_t)::CharLowerW((LPWSTR)destStr[j]);
}

Here we see strange type conversions. The Intel C++ compiler warns us: "#810: conversion from "LPWSTR={WCHAR={__wchar_t} *}" to "__wchar_t" may lose significant bits". Let's look at the CharUpper function's prototype.

LPTSTR WINAPI CharUpper(
  __inout  LPTSTR lpsz
);

The function handles a string and not separate characters at all. But here a character is cast to a pointer and some memory area is modified by this pointer. How horrible.

Well, actually this is the only horrible issue detected by Intel C++. All the rest are much more boring and are rather inaccurate code than error-prone code. But let's study some other warnings too.

The compiler generated a lot of #1125 warnings:

"#1125: function "Window::init(HINSTANCE, HWND)" is hidden by "TabBarPlus::init" -- virtual function override intended?"

These are not errors but just poor naming of functions. We are interested in this message for a different reason: although it seems to involve several classes for the check, the compiler does not keep special data - it must store diverse information about base classes anyway, that is why this diagnosis is implemented.

The next sample. The message "#186: pointless comparison of unsigned integer with zero" is generated for the meaningless comparisons:

static LRESULT CALLBACK hookProcMouse(
  UINT nCode, WPARAM wParam, LPARAM lParam)
{
  if(nCode < 0)
  {
    ...
    return 0;
  }
  ...
}

The "nCode < 0" condition is always false. It is a good example of good local diagnosis. You may easily find an error this way.

Let's consider the last warning by Intel C++ and get finished with it. I think you have understood the concept of "locality".

void ScintillaKeyMap::showCurrentSettings() {
  int i = ::SendDlgItemMessage(...);
  ...
  for (size_t i = 0 ; i < nrKeys ; i++)
  {
    ...
  }
}

Again we have no error here. It is just poor naming of variables. The "i" variable has the "int" type at first. Then a new "i" variable of the "size_t" type is defined in the "for()" operator and is being used for different purposes. At the moment when "size_t i" is defined, the compiler knows that there already exists a variable with the same name and generates the warning. Again, it did not require the compiler to store any additional data - it must remember anyway that the "int i" variable is available until the end of the function's body.

Third-party static code analyzers

Now let's consider specialized static code analyzers. They do not have such severe speed restrictions since they are launched ten times less frequently than compilers. The speed of their work might get tens of times slower than code compilation but it is not crucial: for instance, the programmer may work with the compiler at day and launch a static code analyzer at night to get a report about suspicious fragments on the morning. It is quite a reasonable approach.

While paying with slow-down for their work, static code analyzers can store the whole code tree, traverse it several times and store a lot of additional information. It lets them find "spreaded" and high-level errors.

Let's see what the PVS-Studio static analyzer can find in Notepad++. Note that I am using a pilot version that is not available for download yet. We will present the new free general-purpose rule set in 1-2 months within the scope of PVS-Studio 4.00.

Surely, the PVS-Studio analyzer finds errors that may be referred to "local" like in case of Intel C++. This is the first sample:

bool _isPointXValid;
bool _isPointYValid;
bool isPointValid() {
  return _isPointXValid && _isPointXValid;
};

The PVS-Studio analyzer informs us: "V501: There are identical sub-expressions to the left and to the right of the '&&' operator: _isPointXValid && _isPointXValid".

I think the error is clear to you and we will not dwell upon it. The diagnosis is "local" because it is enough to analyze one expression to perform the check.

Here is one more local error causing incomplete clearing of the _iContMap array:

#define CONT_MAP_MAX 50
int _iContMap[CONT_MAP_MAX];
...
DockingManager::DockingManager()
{
  ...
  memset(_iContMap, -1, CONT_MAP_MAX);
  ...
}

Here we have the warning "V512: A call of the memset function will lead to a buffer overflow or underflow". This is the correct code:

memset(_iContMap, -1, CONT_MAP_MAX * sizeof(int));

And now let's go over to more interesting issues. This is the code where we must analyze two branches simultaneously to see that there is something wrong:

void TabBarPlus::drawItem(
  DRAWITEMSTRUCT *pDrawItemStruct)
{
  ...
  if (!_isVertical)
    Flags |= DT_BOTTOM;
  else
    Flags |= DT_BOTTOM;
  ...
}

PVS-Studio generates the message "V523: The 'then' statement is equivalent to the 'else' statement". If we review the code nearby, we may conclude that the author intended to write this text:

if (!_isVertical)
  Flags |= DT_VCENTER;
else
  Flags |= DT_BOTTOM;

And now get brave to meet a trial represented by the following code fragment:

void KeyWordsStyleDialog::updateDlg() 
{
  ...
  Style & w1Style =
    _pUserLang->_styleArray.getStyler(STYLE_WORD1_INDEX);
  styleUpdate(w1Style, _pFgColour[0], _pBgColour[0],
    IDC_KEYWORD1_FONT_COMBO, IDC_KEYWORD1_FONTSIZE_COMBO,
    IDC_KEYWORD1_BOLD_CHECK, IDC_KEYWORD1_ITALIC_CHECK,
    IDC_KEYWORD1_UNDERLINE_CHECK);

  Style & w2Style =
    _pUserLang->_styleArray.getStyler(STYLE_WORD2_INDEX);
  styleUpdate(w2Style, _pFgColour[1], _pBgColour[1],
    IDC_KEYWORD2_FONT_COMBO, IDC_KEYWORD2_FONTSIZE_COMBO,
    IDC_KEYWORD2_BOLD_CHECK, IDC_KEYWORD2_ITALIC_CHECK,
    IDC_KEYWORD2_UNDERLINE_CHECK);

  Style & w3Style =
    _pUserLang->_styleArray.getStyler(STYLE_WORD3_INDEX);
  styleUpdate(w3Style, _pFgColour[2], _pBgColour[2],
    IDC_KEYWORD3_FONT_COMBO, IDC_KEYWORD3_FONTSIZE_COMBO,
    IDC_KEYWORD3_BOLD_CHECK, IDC_KEYWORD3_BOLD_CHECK,
    IDC_KEYWORD3_UNDERLINE_CHECK);

  Style & w4Style =
    _pUserLang->_styleArray.getStyler(STYLE_WORD4_INDEX);
  styleUpdate(w4Style, _pFgColour[3], _pBgColour[3],
    IDC_KEYWORD4_FONT_COMBO, IDC_KEYWORD4_FONTSIZE_COMBO,
    IDC_KEYWORD4_BOLD_CHECK, IDC_KEYWORD4_ITALIC_CHECK,
    IDC_KEYWORD4_UNDERLINE_CHECK);
  ...
}

I can say that I am proud of our analyzer PVS-Studio that managed to find an error here. I think you have hardly noticed it or just have skipped the whole fragment to see the explanation. Code review is almost helpless before this code. But the static analyzer is patient and pedantic: "V525: The code containing the collection of similar blocks. Check items '7', '7', '6', '7' in lines 576, 580, 584, 588".

I will abridge the text to point out the most interesting fragment:

styleUpdate(...
  IDC_KEYWORD1_BOLD_CHECK, IDC_KEYWORD1_ITALIC_CHECK,
  ...);
styleUpdate(...
  IDC_KEYWORD2_BOLD_CHECK, IDC_KEYWORD2_ITALIC_CHECK,
  ...);
styleUpdate(...
  IDC_KEYWORD3_BOLD_CHECK, !!! IDC_KEYWORD3_BOLD_CHECK !!!,
  ...);
styleUpdate(...
  IDC_KEYWORD4_BOLD_CHECK, IDC_KEYWORD4_ITALIC_CHECK,
  ...);

This code was most likely written by the Copy-Paste method. As a result, it is IDC_KEYWORD3_BOLD_CHECK which is used instead of IDC_KEYWORD3_ITALIC_CHECK. The warning looks a bit strange reporting about numbers '7', '7', '6', '7'. Unfortunately, it cannot generate a clearer message. These numbers arise from macros like these:

#define IDC_KEYWORD1_ITALIC_CHECK (IDC_KEYWORD1 + 7)
#define IDC_KEYWORD3_BOLD_CHECK (IDC_KEYWORD3 + 6)

The last cited sample is especially significant because it demonstrates that the PVS-Studio analyzer processed a whole large code fragment simultaneously, detected repetitive structures in it and managed to suspect something wrong relying on heuristic method. This is a very significant difference in the levels of information processing performed by compilers and static analyzers.

Some figures

Let's touch upon one more consequence of "local" analysis performed by compilers and more global analysis of specialized tools. In case of "local analysis", it is difficult to make it clear if some issue is really dangerous or not. As a result, there are ten times more false alarms. Let me explain this by example.

When we analyzed the Notepad++ project, PVS-Studio generated only 10 warnings. 4 messages out of them indicated real errors. The result is modest, but general-purpose analysis in PVS-Studio is only beginning to develop. It will become one of the best in time.

When analyzing the Notepad++ project with the Intel C++ compiler, it generated 439 warnings and 3139 remarks. I do not know how many of them point to real errors, but I found strength to review some part of these warnings and saw only 4 real issues related to CharUpper (see the above description).

3578 messages are too many for a close investigation of each of them. It turns out that the compiler offers me to consider each 20-th line in the program (73000 / 3578 = 20). Well, come on, it's not serious. When you are dealing with a general-purpose analyzer, you must cut off as much unnecessary stuff as possible.

Those who tried the Viva64 rule set (included into PVS-Studio) may notice that it produces the same huge amount of false alarms. But we have a different case there: we must detect all the suspicious type conversions. It is more important not to miss an error than not to produce a false alarm. Besides, the tool's settings provide a flexible filtering of false alarms.

UPDATE: Note

It turned out that I had written a wrong thing here. There is no error in the sample with CharUpperW but nobody corrected me. I noticed it myself when I decided to implement a similar rule in PVS-Studio.

The point is that CharUpperW can handle both strings and individual characters. If the high-order part of a pointer is zero, the pointer is considered a character and not pointer any more. Of course, the WIN API interface in this place disappointed me by its poorness, but the code in Notepad++ is correct.

By the way, it turns out now that Intel C++ has not found any errors at all.

#Cpp