An ideal static analyzer, or why ideals are unachievable

Mar 15 2012

Author: Evgenii Ryzhkov

An ideal static analyzer's characteristics
100% detection of all the types of programming errors
0% false positives
High performance
Integration with my favorite (i.e. every) IDE; ability to work under my favorite (i.e. every) operating system; analysis of code in my favorite (i.e. any) programming language
Free (freeware, open source) and high-quality customer support
Conclusion

Being inspired by Eugene Laspersky's post about an ideal antivirus, I decided to write a similar post about an ideal static analyzer. And meanwhile think how far from being it our PVS-Studio is.

An ideal static analyzer's characteristics

Those who are not familiar with the notion static code analysis, please follow the link. So let me enumerate the characteristics right away:

100% detection of all the types of programming errors;
0% false positives;
high performance - "whooosh, and the code is analyzed completely almost at the same instance";
integration with my favorite (i.e. every) IDE; ability to work under my favorite (i.e. every) operating system; analysis of code in my favorite (i.e. any) programming language;
free (freeware, open source);
high-quality and prompt customer support.

Of course, this ideal can never be achieved but it shows the direction towards which companies developing solutions in this sphere can head.

100% detection of all the types of programming errors

You should understand that none of the static analyzers will ever provide 100% error detection. Why? Well, if only because some error types are better detected by dynamic analyzers. And it's ridiculous to try to compete with them in this area. As well as dynamic analyzers cannot compete with static ones regarding some rule types.

It's difficult to obtain 100% error detection even for diagnostics characteristic of static analysis. First, any live programming language is constantly developing acquiring new syntax and therefore new ways of making an error. Second, even old syntax can be with time used by people in a rather unusual way analyzer developers did not think of.

Finally, a static analyzer doesn't possess knowledge about what a program SHOULD contain, it doesn't have AI. If there is a phrase in a program "A is equal to B", while the correct one is "A is not equal to B", static analysis won't help you with that.

That's why the only real way out is to constantly create new diagnostic rules. It will never give you 100% error detection but will keep you close to it all the time.

0% false positives

Any static analyzer produces false positives, as, in the long run, only the programmer KNOWS what exactly the code IS TO do. But an analyzer sees what the code really DOES and tries to UNDERSTAND what it SHOULD do.

Returning to the previous section about "100% detection of all the errors", one can make a naïve suggestion: "Why, let's detect everything that moves and we'll be happy!" That is, let's detect everything that looks like an error in the least bit. But this approach is wrong because the number of false positives will go overboard. And there is an opinion that when a user sees 10 false positives in a row, he/she closes the tool not to deal with it anymore.

We have the following ways to reduce the number of false positives:

Constantly handling existing rules to refine their formulations. For example, if in a test project a rule "was triggered" 100 times and 50 of them were false positives, refining the rule can reduce this number to 10. However, you can lose 1 or 2 real warnings, but it's the eternal issue of making compromises.
Refusing rules which are no more relevant. If you only add new rules and never remove (turn off) obsolete ones, some of your diagnostics lose their relevance with time.
Having useful tools to handle false positives. For instance, PVS-Studio provides a mechanism to suppress false positives. Once marking a message as a false report, you won't see it next time.

High performance

Everybody wants software to work fast but it's not always possible. Usually the code analysis technology requires more resources than the compiler - because the compiler checks only very crude errors, while the analyzer's aim is to perform fine analysis. Of course, it needs more data for that. The more the data, the deeper analysis is and the more interesting errors can be found.

An obvious solution to enhance performance is to provide support of several processor cores when analyzing the code. It's rather easy to implement in static analyzers: each file is checked separately and the results are simply combined then.

Less obvious is an attempt to check just a code fragment instead of a whole compilation unit (a file). This is a very complicated task and, taken generally, it's quite difficult to solve (for any language). You have to find and "calculate" data types, analyze classes being used and so on. Costs on "extracting" the part you need to analyze might be even higher than just analyzing the whole code completely.

Integration with my favorite (i.e. every) IDE; ability to work under my favorite (i.e. every) operating system; analysis of code in my favorite (i.e. any) programming language

The issue of providing support for a certain operating system, development environment or analyzable programming language is important in choosing between static analysis tools. To my great surprise, programmers, being the main users of static analyzers, often cannot understand the difficulties of implementing support for the whole zoo of operating systems they want. But let's discuss it in due order.

Supported (analyzable) programming languages

Programming errors detected by code analyzers surely can occur in every programming language and these errors have common features: in every language programmers forget to initialize variables, confuse keys when typing a program and so on. But parsing and analysis of a program is VERY different from language to language.

If some analyzer is announced to support analysis of software in several programming languages, it means that there are most likely several analysis modules in it too. It can even be hidden from users! I'm writing this just for people to understand that the phrase: "Why don't you make the SAME but for C#/PHP/Java?" implies very much work.

Supported operating systems

It's very naïve to think that a code analyzer "just" handles text and therefore can work in any operating system. Of course, different programming languages are "tied" to the environment to various extents: some are more, like C++; others are less, like PHP.

Where does this difference come from? The point is that there exist several compilers for large and powerful languages like C++, considering all their differences and subtleties in the language syntax. The code written for Windows-based compilers is just a bit yet noticeably different from the code written for Linux-based compilers. Though this difference is not very crucial from the user code's viewpoint, it might be important from the viewpoint of a static analyzer - because if the code being analyzed contains key words that are used in this very compiler, the analyzer needs to be "taught" them. In this sense, support of one more compiler and support of one more operating system are equal tasks, generally speaking.

Note that this is an easier task for simpler languages than C++.

Thus, supported operating systems include not only platforms an executed file is run on, but the code for these platforms the analyzer can "understand".

Supported IDE

There are a lot of development tools for different languages. What is important for users is this:

a static analyzer should be able to integrate with their favorite development environment;
the tool can be run in automatic mode at night;
the analyzer should be able to integrate into continuous integration systems;

The last two points are often called "support of command line version" but it has nothing to do with the command line. No one nowadays actually finds it interesting to watch white letters on the black screen instead of a conveniently organized report which can be converted into a text file and sent via e-mail or written into the build system's log.

Support of different IDE's is a difficult, effort-intensive task, as each IDE imposes certain restrictions on their plugins. These restrictions often vary in different systems.

Free (freeware, open source) and high-quality customer support

I've united two sections into one because they are closely connected.

Static code analysis tools refer to the software type for which quality and continuous support are very important. Yes, there are a few tools distributed for free, but I believe they will never reach the market leaders (Coverity, Klocwork, Parasoft).

Generally speaking, a static analysis tool can become free and open-source if the developer company is purchased by some giant like Google, Microsoft or Intel, but this is a special case.

Static analysis tools are usually sold according to the model of annually renewable license. Some users might not like it, but I will try to explain why this scheme is the best. And please forgive me if you have entered the "Free" section and now are reading about licensing schemes.

As I've already said, customer support is very important for static analysis tools. In the field of static analysis, support implies, first of all, cases when the analyzer cannot parse user code (because of complex C++ templates, non-standard compiler extensions, etc.). In these cases you need to promptly (during several days) improve the analyzer so that it can parse the customer's code. User support also includes aid in integrating the tool into their development process. Well, implementation of customer requests that makes use of the tool more convenient is also necessary.

All this costs money. That's why you cannot sell a license once and support your users for free for the rest of your life.

One could sell new major-releases, for example, versions v3, v4, v5... What is bad about this scheme is that it makes the developer "hold" new cool capabilities of the tool till the next major-version instead of releasing them right away as soon as they are ready.

Thus, it appears that annual license renewal is the best way. Meanwhile, some developer companies set the renewal price at the 100% of the initial price, while others set a lower price (making a discount for renewal). Regarding the latter case, it can be explained this way: the first year's price includes additional costs on teaching the customer to work with the tool.

So, it appears that a quality tool with quality support cannot be free, if only it is not being developed by a company-giant, but in this case you can forget about targeted individual customer support.

Conclusion

In this article I've tried to show you what characteristics an ideal static code analysis tool should possess; how users want it to look. And it is users, of course, who decide how much this or that tool really corresponds to this ideal.