>
>
How static analysis works

Andrey Karpov
Articles: 675

How static analysis works

This is a review article on what technologies underlies the work of static code analyzers. The article covers pattern-based analysis, data flow analysis, symbolic execution, taint checking, etc.

From the user's point of view, a static code analyzer works as follows: the programmer inputs the project source code and receives a report at the output. The report contains warnings pointing out code fragments that are likely to contain errors or fail to meet coding standards. It looks simple, but analyzers hide a lot of interesting and tricky stuff inside. Let's look behind the scenes.

What is the basis of static code analysis

Let's look at the 8 main technologies underlying the work of modern static code analyzers:

  • Parsing and semantic analysis;
  • Pattern-based analysis;
  • Data flow analysis;
  • Symbolic execution;
  • Identification of vulnerable components (SCA, software composition analysis);
  • Function annotation;
  • Interprocedural and intermodular analysis;
  • Taint checking;

Parsing and semantic analysis

Collecting data about the source code and analyzing it is an important pre-step for a thorough analysis. Older analyzers, sometimes called "linters", could do without this step and worked in a manner similar to that of searching for a text fragment with regular expressions. This is a dead-end approach, since it can only detect a very narrow range of errors. The drawbacks of this approach are discussed in more detail in the article "Static analysis and regular expressions".

Modern analyzers perform syntactic code analysis (parsing) and build a parse tree or an abstract syntax tree. The analyzer infers (resolves) types of variables, collects information about classes and other useful data. After that, the analyzer starts traversing the syntax tree to detect errors. At the same time, it uses additional information collected, such as variable types and class sizes.

Without knowing the information about types, it is impossible to tell whether the following construction is wrong or not (here and below the examples are written in C):

while ((ch = getchar()) != EOF)

If the ch variable is of the int type, the code is correct. If the ch variable is of the char type, then the character with the 0xFF (255) value turns into -1 and is interpreted in the same way as the end of the file (EOF). Here you can learn more about this error pattern.

Pattern-based analysis

Pattern-based analysis is the detection of known error patterns. A simple example is an extra semicolon after the for statement:

for (int i = 0; i < n; i++);
{
  foo();
}

Comparing takes place at the syntax tree level, which has several advantages over regular expressions. When you working with trees, it's easy enough to understand that A < B and B > A are the same thing, and find the error in the code:

if (A < B) { .... }
else if (B > A) { .... }

The second condition is always false, since it actually duplicates the first one. This is a classic pattern of a typo (examples in real apps).

If you search for repeating conditions with regular expressions, it will become extremely difficult task, since you will have to review many options for how this error pattern can manifest itself:

if (A < B) ....
else if (A < B) ....

if (A + 1 < B && C) ....
else if (1 + A < B && C) ....

if (A & B == 0) ....
else if (0 == B & A) ....

In fact, it's even more complicated. The static analyzer must take into account additional information. For example, that the operators are not overloaded. Or that there is a modification of variables in the condition:

if ((A = get()) < B) ....
else if ((A = get()) < B) .... // there is no need to issue a warning

So regular expressions are not suitable here at all.

Data flow analysis

Data-flow analysis is a way for the static analyzer to estimate values that variables or expressions have - across various locations in the source code. Some types of estimated values:

  • precise values (numbers, strings);
  • ranges of values;
  • a set of possible values.

There may be more complex types of data that are compared to some variable. For example, the analyzer may know that the pointer is in one of the following conditions:

  • Uninitialized (cannot be dereferenced and compared with another pointer);
  • Null (cannot be dereferenced);
  • Indicates a memory buffer of a certain size (it is possible to check for the buffer overrun);
  • Indicates deallocated memory (cannot be dereferenced, re-deallocated);
  • A copy of another pointer (however, this is more of a symbolic execution, which we will look at in the next section);
  • Etc.

An example where data about the possible values of a variable allows you to identify array index out of bounds:

float A[10];
for (size_t i = 0; i < sizeof(A); i++)
{
  // It is known that the value of the i variable lies within [0..39].
  A[i] = 1.0; // The analyzer will issue a warning about
              // array index out of bounds.
}

Symbolic execution

Symbolic execution is similar to data flow analysis, but the analyzer does not know anything about the values of variables. However, by solving the equations, it can draw some conclusions about the results of the calculations. Let me explain with the simplest example:

int A = foo();
int B = A;
int C = 100 / (A - B);

Let's say the analyzer can't figure out what values the foo function returns. However, since the A and B variables are equal, the A - B expression is 0. Knowing this, the analyzer will warn about division by 0.

Identification of vulnerable components (SCA, software composition analysis)

All the code analysis technologies discussed here make it possible to identify potential vulnerabilities, since in fact these are the same errors in the code. However, Software Composition Analysis is particularly worth highlighting.

Software projects use third-party components that may contain vulnerabilities. Accordingly, the application itself may also be vulnerable to malicious attacks.

Information about vulnerabilities in various versions of components is aggregated in open databases. Using these databases, static analyzers can warn developers that they are using outdated components that contain vulnerabilities.

It is important to understand that this approach allows you to identify only known vulnerabilities, but not zero-day vulnerabilities. This approach is somewhat similar to how antiviruses work, using databases that store signatures of known viruses.

Function annotation

Annotations provide information on the basis of which the analyzer detects incorrect use of functions. Examples:

if (memcmp(A, A, size) == 0) // It makes no sense to compare
                             // the buffer with itself.
{
  malloc(N * sizeof(float)); // Strange not to save the address
                             // of the allocated buffer
}

There are three types of function annotation:

  • Information about the functions is stored in the code analyzers and laid down by their creators;
  • Information about functions set by users;
  • Function information collected by the analyzer itself based on the definition of these functions.

Interprocedural and intermodular analysis

If you know how one function calls another, then you can identify, for example, the following error:

void use(struct S *p)
{
  p->a = 123;
}

struct S globalVar;

void foo(int x)
{
  struct S *q = (x == 5) ? NULL : &globalVar;
  use(q);
}

The potential dereference of the null pointer will be detected due to the joint work of the data flow analysis algorithm and interprocedural analysis.

Data flow analysis will calculate that either a null or a non-null pointer to a global variable can be passed to the use function.

Due to the interprocedural analysis, this information will be used to re-check the use function. Here again, the data flow analysis algorithm comes into operation. The algorithm will identify that the null pointer can be dereferenced.

What is interesting, there is another way to detect this error: the use of automatic function annotation and interprocedural analysis.

Going through the use function, the analyzer will make a "note" that the argument of the function is dereferenced without prior verification. Therefore, the analyzer will further warn that a pointer that may be NULL cannot be passed to the use function.

Intermodular analysis allows you to identify even more errors, as it allows you to analyze the joint operation of functions located in different translation units.

Taint checking

Taint checking is a technology that allows you to track the spread of unverified external data in a program during its operation. Getting such data into some key points can lead to various vulnerabilities, including SQL injection, XSS, path traversal and others.

Taint analysis is similar to data flow analysis. However, the goal here is not to understand what data is actually being transmitted. It is assumed that if data came from outside, it can be dangerous. It is the job of the analyzer to detect that this data reaches the receiver without prior verification and to issue a warning.

How code analyzers work, using PVS-Studio as an example

If you want to know how the described technologies are applied in practice, we invite you to read "PVS-Studio: static code analysis technology". Some of the implementation details are also described in the following articles:

Limitations of static code analysis

All static code analyzers have two drawbacks:

These limitations are due to the following reasons:

  • Static analyzers lack high-level information about how the program should actually work.
  • Insufficient amount of computing resources. Therefore, a trade-off has to be made between the accuracy of the analysis and the running time.
  • The "halting problem", unsolvable on a Turing machine.

All this is discussed in more detail in the article "What static analysis cannot find".

Exceptions to prevent false positives

As mentioned above, all static code analyzers are prone to false positives. Therefore, the developers of analyzers face not only the task of making the analyzers to issue as many useful warnings as possible, but also to reduce the number of useless ones.

For this purpose, a number of exceptions are usually written in the implementation of diagnostic rules. Let's look at an example of assigning a variable to itself:

variable = variable;

This is a good diagnostic rule that can help you find many real errors: example of errors from the PVS-Studio collection.

It seems that it is very easy to look for such an error, and even regular expressions would have coped. However, there are nuances. For example, the same PVS-Studio has about 10 exceptions in this diagnostic. Here's a couple of them:

if (foo())
  x = x;
else
  x = -x;

This code is redundant, but there is no error in it! The programmer's idea is obvious: if the function returned false, then the variable's sign should be changed. The author of the code will not be happy if they see a warning. They specifically wrote the code exactly so that it was as clear as possible. Accordingly, it is useful to recognize such a programming pattern and not issue a warning.

Another example:

#ifdef WORDS_BIGENDIAN
#define fci_endian_ulong(x) RtlUlongByteSwap(x)
#else
#define fci_endian_ulong(x) (x)
#endif
....
cffile.cbFile = fci_endian_ulong(cffile.cbFile);

If the WORDS_BIGENDIAN macro is not declared, the last line will be expanded by the preprocessor in:

cffile.cbFile = cffile.cbFile;

The variable is assigned to itself, but there is no need to issue a warning, since the macro could have expanded into a function call.

If you are interested in this section, here is an article for you: "The way static analyzers fight against false positives, and why they do it".

Additional links