What comments hide

Sep 20 2012

Author: Dmitry Novikov

Much is said about good and harm of comments in program code and a single opinion hasn't been worked out yet. However, we've decided to take a look at comments from a different viewpoint. Can comments serve as an indication of hidden errors for a programmer studying the code?

When investigating different projects concerning errors, we noticed that programmers sometimes see defects but cannot find out all their causes. Suspicion falls on the compiler: my colleague has recently discussed this effect in the article "The compiler is to blame for everything". As a result, programmers make crutches in the code and leave some comments. These are often obscene.

We decided it was an interesting subject to investigate. Manual review of files or usual word-by-word search is long and tiresome. That's why we wrote a utility that searches for suspicious comments in ".c" and ".cpp" files relying on its dictionary of "suspicious words". This dictionary includes, for example, such words as fuck, bug, stupid, compiler.

We've got a lot of lines with comments of that kind. Picking out fragments really worth considering was a hard and tiresome task. We have found little of interest - much less than we expected.

The task of our search was to find new patterns of possible mistakes made by programmers. Unfortunately, all the found defects either cannot be diagnosed by static code analysis at all or are already successfully detectable by PVS-Studio.

But a bad result is a result too. Most likely we will come to the conclusion that the method of searching for strange comments is a dead-end. It's too labor-intensive while allowing you to catch too few bugs.

But since the investigation has been carried out, we've decided to show you a couple of examples.

For example, consider this code:

// Search for EOH (CRLFCRLF)
const char* pc = m_pStrBuffer;
int iMaxOff = m_iStrBuffSize - sizeof(DWORD);
for (int i = 0; i <= iMaxOff; i++) {
  if (*(DWORD*)(pc++) == 0x0A0D0A0D) {
    // VC-BUG?: '\r\n\r\n' results in 0x0A0D0A0D too,
    //although it should not!
    bFoundEOH = true;
    break;
  }
}

As you can see from the comment "// Search for EOH (CRLFCRLF)", the programmer wanted to find the sequence of bytes 0D,0A,0D,0A (CR == 0x0D, LF == 0x0A). Since the bytes are arranged in a reverse order, the search constant equals 0x0A0D0A0D.

This program doesn't seem to be quite successful at handling a different sequence of carriage return and line folding. This is the cause of the author's misunderstanding, which is indicated by the comment: " // VC-BUG?: '\r\n\r\n' results in 0x0A0D0A0D too, although it should not!". So why does the algorithm find not only the {0D,0A,0D,0A} sequence, but the {0A,0D,0A,0D} sequence too?

Everything's simple. The search algorithm is moving through the array byte-by-byte. That's why if it comes across a long sequence like {0A,0D,0A,0D,0A,0D,0A,...}, it will skip the first symbol 0A and move on to find quite different things than the programmer wanted.

Unfortunately, such defects are impossible to catch by static analysis.

Here is one more example of strange code:

TCHAR szCommand[_MAX_PATH * 2];
LPCTSTR lpsz = (LPCTSTR)GlobalLock(hData);
int commandLength = lstrlen(lpsz);
if (commandLength >= _countof(szCommand))
{
  // The command would be truncated.
  //This could be a security problem
  TRACE(_T("Warning: ........\n"));
  return 0;
}
// !!! MFC Bug Fix
_tcsncpy(szCommand, lpsz, _countof(szCommand) - 1);
szCommand[_countof(szCommand) - 1] = '\0';
// !!!

In this case "MFC Bug Fix" is absolutely untrue because there is no error in MFC here. The code cannot cause errors being written in this form, but maybe its earlier version contained only this line: '_tcsncpy(szCommand, lpsz, _countof(szCommand) - 1);'. In this case the error did exist. However, you can implement correct string copying in a shorter way:

_tcsncpy(szCommand, lpsz, _countof(szCommand));

Functions like 'strncpy' add the terminal null at the end of the string automatically if the source string is not longer than the value specified in the counter. This is exactly so in our case, as there is a check for this written above. Cases of incorrect string copying are well detectable by PVS-Studio, so we haven't learned anything new.

Conclusion

We haven't managed to find any new error patterns for further including them into the database of errors detected by our static analyzer. However, this is a good experience in investigating alternative methods of software defect detection. We will for some time continue studying comments in new projects we'll get for analysis. We also plan to make some improvements to the search utility:

implement a simple syntactic analysis to decrease detections of "uninteresting" lines;
extend the dictionary with new expressions.

Perhaps this program can be useful when you "inherit" a large project with a long code history and would like to see what your predecessors didn't like there.

#Cpp