Why Static Analysis Can Improve a Complex C++ Codebase

Aug 03 2019

Author: Andrey Karpov

Gradually and imperceptibly we get the situation when C++ projects' complexity becomes extreme. Unfortunately, now a C++ programmer can't be on his own.

Note. This article was first published by me on the blog Fluent C++: Why Static Analysis Can Improve a Complex C++ Codebase.

First, there is so much code that it is no longer possible to have at least a couple of programmers per project who know the whole project. For example, earlier the Linux 1.0.0 kernel contained about 176,000 lines of code. That's a lot, but it was possible to review the entire code and understand the general principles of its work for a couple of weeks, having a coffee machine nearby. Nevertheless, if you take the Linux 5.0.0 kernel, the size of the code base is already about 26 million lines of code. The kernel code is 150 times larger than it used to be. You can only choose a few parts of the project and take part in their development. You can't settle down and figure out exactly how it works, what are the interconnections between different parts of code.

Secondly, the C++ language continues to develop rapidly. On the one hand, it is good, as new constructions appear which allow writing more compact and secure code. On the other hand, due to backward compatibility, old large projects become heterogeneous. Old and new approaches to code writing intertwine in them. Here comes the analogy with the rings on the tree cut. Because of this, it is becoming more and more difficult to immerse yourself in C++ projects every year. A developer has to know what's what in code both written in "C with classes" style and in modern approaches (lambdas, move semantics and so on). It takes a long time to fully dig into C++.

Since projects still have to be developed, people begin to write code in C++, whereas they haven't fully studied all its nuances. This leads to additional defects. Nevertheless, it's irrational to just stay and wait when all developers will flawlessly know C++.

Is the situation hopeless? No. A new class of tools comes to the rescue: static code analyzers. Here many worldly-wise programmers twist the lips, as if I just palmed on a lemon :). Like, we know all your linters... Lots of warnings - great boast, small roast... And what is the new class of tools?! We ran linters even 20 years ago!

Yet I would venture to say that this is a new class of tools. What was 10-20 years ago is not the kind of tools that are now called static analyzers. First, I'm not talking about tools aimed at code formatting. They're also static analysis tools, but we're talking about identifying bugs in the code. Second, today's tools use sophisticated analysis technologies, taking into account the relationships between different functions and virtually executing certain parts of code. These are not those 20-year-old linters built on regular expressions. By the way, a normal static analyzer can not be done on regular expressions. Technologies like data flow analysis, automatic methods annotation, symbolic execution and others are used to find errors.

These are not just abstract words, but it's the reality that I can observe, being one of the founders of the PVS-Studio tool. Check out this article to see what helps the analyzers find the most exciting errors.

More importantly, modern static analyzers have extensive knowledge of error patterns. Analyzers know more than even professional developers. It has became too difficult to take into account and remember all the nuances when writing code. For instance, if you haven't specifically read about it, you'll never guess that calls to memset function for clearing private data sometimes disappear, as from a compiler's point of view, a call to memset function is redundant. Meanwhile, it is a serious security defect CWE-14 that is detected literally everywhere. Or, for example, if you haven't heard about that guideline, how would you know that it is dangerous to add an element to a container this way?

std::vector<std::unique_ptr<MyType>> v;
v.emplace_back(new MyType(123));

I think, not everyone will immediately realize that such code is potentially dangerous and can lead to memory leaks.

In addition to extensive knowledge of patterns, static analyzers are infinitely attentive and never get tired. For example, unlike humans, they are not too lazy to look into header files to make sure that isspace and sprintf are actual functions, but not insane macros which spoil everything. Such cases demonstrate the complexity of finding bugs in large projects: something changes in one place, and breaks down in another.

I'm sure that soon static analysis will become an intrinsic part of DevOps - it will be as natural and necessary as usage of version control system. It is already gradually happening at development conferences, where static analysis is increasingly mentioned as one of the first lines of defense to fight against bugs.

Static analysis acts as a kind of rough cleaning filter. It is inefficient to look for stupid errors and typos using unit tests or manual testing. It's much faster and cheaper to fix them right after you've written code, using static analysis to detect problems. This idea, as well as the importance of regular application of the analyzer, is well described in the article "Introduce static analysis into the process, don't look for bugs with it."

Someone may say that there is no point in special tools, as compilers learn how to perform such static checks as well. Yes, it's true. However, static analyzers are also on the go and leave behind compilers as specialized tools. For example, every time when we check LLVM, we find errors there using PVS-Studio.

The world offers a large number of static code analysis tools. As they say, choose by your preference.

In summary, if you want to find a lot of bugs and potential vulnerabilities while you're writing code, and increase the quality of your codebase, use static code analyzers!

#Cpp #StaticAnalysis