Andrey Karpov

Apr 02 2014

Tags:

#Cpp

Archeology for Entertainment, or Checking Microsoft Word 1.1a with PVS-Studio

Apr 02 2014

Author: Andrey Karpov

Where to find the source files
Checking Word 1.1a
A few words about Word 1.1a's code
Analysis results
Conclusion

The Microsoft company has recently made a present to all programmers eager to dig into some interesting stuff: they revealed the source codes of MS-DOS v 1.1, v 2.0 and Word for Windows 1.1a. The MS-DOS operating system is written in assembler, so the analyzer cannot be applied to it. But Word is written in C. Word 1.1a's source codes are almost 25 years old, but we still managed to analyze it. There's no practical use of it, of course. Just for fun.

Analysis based on pre-generated preprocessed files is no longer supported. Please consider using the Compiler Monitoring mode instead. This mode allows a generation and analysis of preprocessed files automatically during regular build process.

Where to find the source files

Perhaps many will like not this article itself, but the fact that one can download the source codes of MS-DOS v 1.1, v 2.0 and Word for Windows 1.1a. Those interested in digging the source files on their own should check the original source.

The press release: Computer History Museum Makes Historic MS-DOS and Word for Windows Source Code Available to the Public.

Checking Word 1.1a

Figure 1. Word for Windows 1.1a (click on the picture to enlarge).

Word for Windows 1.1a was released in 1990. Its source code was made public available on March 25, 2014. Word has always been a flagship product of Microsoft, and I, as well as many other programmers, was very eager to peek into the inside of the software product that so much contributed to Microsoft's commercial success.

I decided to check Word 1.1a's code with our tool PVS-Studio. It is a static analyzer for C/C++ code. That task was not that easy to fulfill, of course, as the analyzer is designed to work with projects developed at least in Visual Studio 2005. And now I had C source codes more than 20 years old. We can fairly call them a find from the prehistoric times. At least, the C language standard didn't exist then yet and every compiler had to be by itself. Fortunately, Word 1.1a's source codes appeared to be free of any specific nuances and abuse of non-standard compiler extensions.

Before you can perform code analysis, you need to get preprocessed files (*.i). Once you have generated them, you can use the PVS-Studio Standalone tool to run the analysis and examine the diagnostic messages. Of course, the analyzer is not designed to check 16-bit programs, but the results I got were quite enough to satisfy my curiosity. After all, a meticulous analysis of a 24 year old project just wouldn't make any sense.

So the basic obstacle was in obtaining the preprocessed files for the source codes. I asked my co-worker to find some solution, and he approached the task with much creativity: he chose to use GCC 4.8.1 to get the preprocessed files. I guess no one has ever mocked at Word 1.1's source codes in such a cruel way. How could it have occurred to him at all to use GCC? That ingenious imagination of his!

What's most interesting, it all turned out pretty fine. He wrote a small utility to run preprocessing by GCC 4.8.1 of each file from the folder it was stored in. As it displayed error messages regarding troubles with locating and including header files, we added -I switches into the launch parameters to specify the paths to the required files. A couple of header files which we failed to find were created empty. All the rest troubles with #include expanding were related to including resources, so we commented them out. The WIN macro was defined for preprocessing as the code contained branches both for WIN and MAC.

After that, PVS-Studio Standalone and I came into play. I noted down a few suspicious code fragments I want to show you. But let's first speak a bit more about the project itself.

A few words about Word 1.1a's code

The most complex functions

The following functions showed the highest cyclomatic complexity:

CursUpDown - 219;
FIdle - 192;
CmdDrCurs1 - 142.

#ifdef WIN23

While looking through the source codes, I came across "#ifdef WIN23" and couldn't help smiling. I even noted that fragment down. I thought it was a typo and the correct code was #ifdef WIN32.

When I saw WIN23 for the second time, I grew somewhat doubtful. And just then it struck me that I was viewing source files as old as 24 years by the moment. WIN23 stood for Windows 2.3.

Stern times

In some code fragment, I stumbled upon the following interesting line.

Assert((1 > 0) == 1);

It seems incredible that this condition can ever be false. But since there is such a check, there must be the reason for it. There was no language standard at that time. As far as I get it, it was a good style to check that the compiler's work met programmers' expectations.

Well, if we agree to treat K&R as a standard, the ((1 > 0) == 1) condition is always true, of course. But K&R was just a de facto standard. So it's just a check of the compiler's adequacy.

Analysis results

Now let's discuss the suspicious fragments I have found in the code. I guess it's the main reason why you are reading this article. So here we go.

Infinite loop

void GetNameElk(elk, stOut)
ELK elk;
unsigned char *stOut;
{
  unsigned char *stElk = &rgchElkNames[mpelkichName[elk]];
  unsigned cch = stElk[0] + 1;

  while (--cch >= 0)
    *stOut++ = *stElk++;
}

PVS-Studio's diagnostic message: V547 Expression '-- cch >= 0' is always true. Unsigned type value is always >= 0. mergeelx.c 1188

The "while (--cch >= 0)" loop will never terminate. The 'cch' variable is unsigned, which means that it will always be >= 0, however long you may decrease it.

A typo leading to an array overrun

uns rgwSpare0 [5];

DumpHeader()
{
  ....
  printUns ("rgwSpare0[0]   = ", Fib.rgwSpare0[5], 0, 0, fTrue);
  printUns ("rgwSpare0[1]   = ", Fib.rgwSpare0[1], 1, 1, fTrue);
  printUns ("rgwSpare0[2]   = ", Fib.rgwSpare0[2], 0, 0, fTrue);
  printUns ("rgwSpare0[3]   = ", Fib.rgwSpare0[3], 1, 1, fTrue);
  printUns ("rgwSpare0[4]   = ", Fib.rgwSpare0[4], 2, 2, fTrue);
  ....
}

PVS-Studio's diagnostic message: V557 Array overrun is possible. The '5' index is pointing beyond array bound. dnatfile.c 444

It turned out that the first line for some reason contains the text Fib.rgwSpare0[5]. That's incorrect: there are just 5 items in the array, therefore the largest index should be 4. The value '5' is just a typo. A zero index should have most likely been used in the first string:

printUns ("rgwSpare0[0]   = ", Fib.rgwSpare0[0], 0, 0, fTrue);

Uninitialized variable

FPrintSummaryInfo(doc, cpFirst, cpLim)
int doc;
CP cpFirst, cpLim;
{
  int fRet = fFalse;
  int pgnFirst = vpgnFirst;
  int pgnLast = vpgnLast;
  int sectFirst = vsectFirst;
  int sectLast = sectLast;
  ....
}

PVS-Studio's diagnostic message: V573 Uninitialized variable 'sectLast' was used. The variable was used to initialize itself. print2.c 599

The 'sectLast' variable is assigned to itself:

int sectLast = sectLast;

I suspect that it should have been initialized to the 'vsectLast' variable instead:

int sectLast = vsectLast;

I found one more error of that kind - must be a consequence of using the Copy-Paste method:

V573 Uninitialized variable 'sectLast' was used. The variable was used to initialize itself. print2.c 719

Undefined behavior

CmdBitmap()
{
  static int  iBitmap = 0;
  ....
  iBitmap = ++iBitmap % MAXBITMAP;
}

PVS-Studio's diagnostic message: V567 Undefined behavior. The 'iBitmap' variable is modified while being used twice between sequence points. ddedit.c 107

I don't know how people used to treat such code 20 years ago, but in our times it is treated as hooliganism as it leads to undefined behavior.

Other fragments with similar issues:

V567 Undefined behavior. The 'iIcon' variable is modified while being used twice between sequence points. ddedit.c 132
V567 Undefined behavior. The 'iCursor' variable is modified while being used twice between sequence points. ddedit.c 150

Unsuccessful call of the printf() function

ReadAndDumpLargeSttb(cb,err)
  int     cb;
  int     err;
{
  ....
  printf("\n - %d strings were read, "
         "%d were expected (decimal numbers) -\n");
  ....
}

PVS-Studio's diagnostic message: V576 Incorrect format. A different number of actual arguments is expected while calling 'printf' function. Expected: 3. Present: 1. dini.c 498

The printf() function is a variadic function. Passing or not passing arguments to it are both legal. In this case, the programmer forgot about the arguments, and it resulted in printing garbage all the time.

Uninitialized pointers

One of the auxiliary utilities included into the package of Word source files contains a very strange piece of code.

main(argc, argv)
int argc;
char * argv [];
{
  FILE * pfl;
  ....
  for (argi = 1; argi < argc; ++argi)
  {
    if (FWild(argv[argi]))
    {
      FEnumWild(argv[argi], FEWild, 0);
    }
    else
    {
      FEWild(argv[argi], 0);
    }

    fclose(pfl);
  }
  ....
}

PVS-Studio's diagnostic message: V614 Uninitialized pointer 'pfl' used. Consider checking the first actual argument of the 'fclose' function. eldes.c 87

The 'pfl' variable is initialized neither before the loop nor inside it, while the fclose(pfl) function is called multiple times. It all, however, may have worked pretty well. The function would return an error status and the program would go on running.

And here's another dangerous function which will most likely cause a program crash.

FPathSpawn( rgsz )
char *rgsz[];
{ /* puts the correct path at the beginning of rgsz[0]
     and calls FSpawnRgsz */
  char *rgsz0;

  strcpy(rgsz0, szToolsDir);
  strcat(rgsz0, "\\");
  strcat(rgsz0, rgsz[0]);
  return FSpawnRgsz(rgsz0, rgsz);
}

PVS-Studio's diagnostic message: V614 Uninitialized pointer 'rgsz0' used. Consider checking the first actual argument of the 'strcpy' function. makeopus.c 961

The ' rgsz0' pointer is not initialized to anything. However, it doesn't prevent copying a string into it.

Typo in a condition

....
#define wkHdr    0x4000
#define wkFtn    0x2000
#define wkAtn    0x0008
....
#define wkSDoc    (wkAtn+wkFtn+wkHdr)

CMD CmdGoto (pcmb)
CMB * pcmb;
{
  ....
  int wk = PwwdWw(wwCur)->wk;
    if (wk | wkSDoc)
      NewCurWw((*hmwdCur)->wwUpper, fTrue);
  ....
}

PVS-Studio's diagnostic message: V617 Consider inspecting the condition. The '(0x0008 + 0x2000 + 0x4000)' argument of the '|' bitwise operation contains a non-zero value. dlgmisc.c 409

The (wk | wkSDoc) condition is always true. The programmer must have actually meant to write the following code instead:

if (wk & wkSDoc)

That is, the | and & operators are swapped by mistake.

And finally a lengthy but simple sample

int TmcCharacterLooks(pcmb)
CMB * pcmb;
{
  ....
  if (qps < 0)
  {
    pcab->wCharQpsSpacing = -qps;
    pcab->iCharIS = 2;
  }
  else  if (qps > 0)
  {
    pcab->iCharIS = 1;
  }
  else
  {
    pcab->iCharIS = 0;
  }
  ....
  if (hps < 0)
  {
    pcab->wCharHpsPos = -hps;
    pcab->iCharPos = 2;
  }
  else  if (hps > 0)
  {
    pcab->iCharPos = 1;
  }
  else
  {
    pcab->iCharPos = 1;
  }
  ....
}

PVS-Studio's diagnostic message: V523 The 'then' statement is equivalent to the 'else' statement. dlglook1.c 873

When working with the 'qps' variable, the following values are written into 'pcab->iCharIS': 2, 1, 0.

The 'hps' variable is handled in a similar way, but in this case, some suspicious values are saved into the variable 'pcab->iCharPos': 2, 1, 1.

It must be a typo: a zero was most likely meant to be used at the very end.

Conclusion

I have found very few strange fragments. There are two reasons for that. Firstly, I found the code to be skillfully and clearly written. Secondly, the analysis had to be incomplete, while teaching the analyzer the specifics of the old C language wouldn't be of any use.

I hope you have enjoyed a few minutes of interesting reading. Thank you for attention. And welcome to try the PVS-Studio analyzer on your code.

#Cpp