Dec 26 2024

Technologies used in PVS-Studio

Dec 26 2024

Watch, don't read (YouTube)
Abstract Syntax Tree and pattern-based analysis
Semantic code model and type inference
Preprocessing in C and C++ source code
Monitoring of C and C++ source code compilation
Data-flow analysis and symbolic execution
Interprocedural analysis
Intermodular analysis and function annotations
Taint analysis (taint checking)
Software Composition Analysis (SCA)
Additional resources

PVS-Studio provides static analyzers for C, C++, C# and Java languages on Windows, Linux and macOS platforms. PVS-Studio analyzers can vary slightly due to certain features that the languages have. However, all our analyzers share common technologies and approaches to the implementation of static analysis.

As part of PVS-Studio, there are 3 separate software tools for static analysis: the C and C++ analyzer, the C# analyzer and the Java analyzer.

The PVS-Studio analyzer for C and C++ is written in C++. It builds upon the VivaCore closed source code parsing library. This library is a development of the PVS-Studio team as well.

The PVS-Studio analyzer for C# is written in C#. To parse code (to build an abstract syntax tree and a semantic model) and to integrate with the MSBuild \ .NET project system the analyzer uses the open source Roslyn platform.

The PVS-Studio analyzer for Java is written in Java. Data-flow analysis is implemented through the internal C++ library named VivaCore. To analyze source code (to build an AST and a semantic model), the analyzer uses the open source Spoon library.

All the PVS-Studio analyzers implement algorithms and mechanisms to run data-flow analysis (including symbolic execution, interprocedural context-sensitive analysis and intermodular analysis). These algorithms and mechanisms are built upon PVS-Studio own developments.

The PVS-Studio static code analysis technology is based on the following approaches and processes.

Watch, don't read (YouTube)

Abstract Syntax Tree and pattern-based analysis

First, let's look at two terms that we use from the theory of developing compilers and static code analyzers.

Abstract Syntax Tree (AST). AST is a finite oriented tree the nodes of which are correlated with the programming language's operators, and the leaves - with the corresponding operands. Compilers and interpreters use abstract syntax tree as an intermediate representation between parse trees and the internal code representation. The advantage of AST - sufficient structure compactness (abstractness). It is achieved due to the absence of nodes for constructs that do not affect semantics of the program.

AST-based analyzers do not depend on specific syntax. For example, names, coding style, code formatting, etc. This is the key advantage of the abstract syntax tree in comparison with direct analysis of program text (source code).

Parse tree (PT). The result of grammatical analysis. The derivation tree differs from the abstract syntactical tree in that it contains nodes for those syntactic rules which do not influence the program semantics. Classical examples of such nodes are grouping parentheses, while grouping of operands in AST is explicitly defined by the tree structure.

At a high level, we can say that the cores of all PVS-Studio analyzers for different languages work with an abstract syntax tree (AST). However, in practice, everything is a bit more complicated. In some cases, diagnostic rules require information about optional nodes or even about the number of spaces at the beginning of the line. In this case, the analysis proceeds down the parse tree and extracts additional information. All of the parse libraries that we use (Roslyn, Spoon, VivaCore) enable getting information at the parse tree level. The analyzer takes this opportunity in some cases.

PVS-Studio analyzers use the AST program representation to search for potential defects by pattern-based analysis. It's a category of relatively simple diagnostic rules. To decide whether the code is dangerous, these rules compare the constructions in the code with predefined templates of potential errors.

Note that the template search is a more advanced and efficient technology than regular expressions. Regular expressions are actually not suitable to build an effective static analyzer, for many reasons. We can explain this with a simple example. Let's say you need to find typos when the expression is compared with itself. For the simplest cases, you can use regular expressions:

if (A + B == A + B) 
if (V[i] == V[i])

However, if the expressions that contain errors are written differently, regular expressions are powerless. Rather, it is simply impossible to write them for all of the alternatives:

if (A + B == B + A) 
if (A + (B) == (B) + A) 
if (V[i] == ((V[i]))) 
if (V[(i)] == (V[i]))

In turn, in pattern matching, it's not a problem to detect such errors if you use an AST.

The abstract syntax tree representation of code is also a preparation step for the next level of analysis — the construction of a semantic model and type inference.

Semantic code model and type inference

In addition to the syntax analysis of the code, all PVS-Studio analyzers also perform semantic analysis based on the use of AST code representation described in the previous step. They build a complete semantic model of the code they check.

The generalized semantic model is a dictionary of correspondences of semantic symbols and elements of the syntactic representation of the same code (for which PVS-Studio uses nodes of the abstract syntax tree mentioned above).

Each such symbol defines the semantics of the corresponding syntactic language construction. This semantics may be subtle and cannot be deduced from the local syntax itself. To derive such semantics, you must refer to other parts of the syntactic code representation. Here is an example of a code fragment in C language:

A = B(C);

We don't know what 'B' stands for, so it's impossible to say what kind of language construction this is. This can be either a function call or a functional cast expression.

The semantic model thus allows to analyze the code semantics without the need to constantly traverse the syntactic representation of this code to resolve semantic facts that are not deduced from the local context. During analysis, the semantic model "remembers" the semantics of code for further use.

Based on the semantic model, PVS-Studio analyzers can perform type inference for any syntactic construction they encounter, which may be required when analyzing the code for potential defects. For instance, such as variable identifiers, expressions, etc. The semantic model complements the pattern-based analysis in cases where a single syntactic representation is not enough to decide whether the tested construction is dangerous.

Building a complete and correct semantic model requires consistency and, accordingly, compilability of the code we check. The compilability of the source code is a necessary condition for PVS-Studio analyzers to operate fully and correctly. PVS-Studio analyzers have fault tolerance mechanisms in cases when they deal with uncompilable code. However, the uncompilable code may impair the accuracy of diagnostic rules.

Preprocessing in C and C++ source code

Preprocessing of C and C++ code is the mechanism that expands compilation directives in the source code and substitute the macro values. In particular, the result of the preprocessor operation is the following. In place of #include directives the contents of header files are substituted, the paths to which are specified by the directive. In the case of such substitution, the preprocessor expands directives and macros sequentially in all the header files that were already expanded by the #include directive. Preprocessing is the first step of the compiler's work. It's a preparation of a compilation unit and its dependencies for source code translation into the internal compiler representation.

Expansion of #include directives leads to the merger of the source file and all the header files used in it into a single file, often called intermediate. By analogy with the compiler, C and C ++ PVS-Studio analyzer uses preprocessing before it starts the analysis. PVS-Studio uses the target compiler (in preprocessor mode) for preprocessing the checked code. The analyzed code was originally intended to be built by the compiler used by PVS-Studio. PVS-Studio supports large number of preprocessors, which are listed on the product page. The output format of the preprocessors of various compilers differs. For the analyzer to work correctly, it is necessary to use the right preprocessor that corresponds to the compiler which is used to build the code.

Before starting the C and C++ analysis, the PVS-Studio analyzer launches a preprocessor for each translation unit of the code it checks. Both contents of the source files and the compilation parameters affect the preprocessor operation. For preprocessing PVS-Studio uses the same build parameters that are used during code compilation. PVS-Studio receives information about the list of translation units and compilation parameters from the build system of the checked project, or by tracing (intercepting) compiler's calls during the project build.

The work of the PVS-Studio C and C ++ analyzer is based on the result of the work of the corresponding preprocessor. The analyzer does not analyze the source code directly. Preprocessing C and C ++ code by expanding compiler directives allows the analyzer to build a complete semantic model of the code being checked.

Monitoring of C and C++ source code compilation

PVS-Studio provides the monitoring feature that allows you to intercept process invocations at the level of your operating system's API. Intercepting a process being invoked allows to get complete information about this process: its invocation parameters and its working environment. PVS-Studio supports process invocation monitoring on Windows and Linux. The analyzer's Windows version uses WinAPI directly, while the Linux version employs the strace standard system utility.

The C and C++ PVS-Studio analyzers can use compilation process tracing as a way to analyze C++ code. PVS-Studio integrates directly with the most popular build systems for C and C++ projects. However, there are many build systems the analyzer does not support. This is because the ecosystem of C and C++ languages is extremely diverse and contains a very large number of build systems - for example, in the embedded sector. Although the C++ PVS-Studio analyzer supports low-level direct integration with such systems, implementing this integration requires a lot of effort. For each translation unit (a C or C++ source file), compilation parameters must be passed to the analyzer.

PVS-Studio's compilation process monitoring system can simplify and automate the process of supplying the analyzer with all the information that it needs for analysis. The monitoring system collects process compilation parameters, analyzes them, and modifies them (for example, by activating the compiler's preprocessing mode, as the analyzer requires this stage only). Then the monitoring system passes these parameters to the C++ PVS-Studio analyzer directly.

This way, thanks to the process invocation monitoring feature, PVS-Studio offers a universal solution to check C and C++ projects. Moreover, the system does not depend on the build system used, is easily configured, and takes the original parameters of the source code compilation fully into account.

Data-flow analysis and symbolic execution

Data-flow analysis is a way for the static analyzer to estimate values that variables or expressions have - across various locations in the source code. The estimated values here mean specific values, value ranges or sets of possible values. Additionally, the analyzer tracks whether memory linked to a pointer has been freed, what the array sizes are etc. Then the analyzer saves this information and processes it.

To estimate values, the analyzer tracks how variable values move along the control-flow graph, and analyzes the results. In many cases, the analyzer cannot know the variable's or expression's exact value. To evaluate the expressions, the analyzer uses direct and indirect restrictions, imposed on the expressions as the control-flow graph is traversed. The analyzer makes assumptions as to what ranges or sets of values given expressions can take at the control-flow graph's various points.

Sometimes source code's syntactic (AST) or semantic structure is insufficient for the analyzer to make a decision on whether certain code is dangerous. This is why all PVS-Studio analyzers use data-flow analysis to support the diagnostics and to make a more precise decision on whether that code is dangerous. To conduct data-flow analysis, PVS-Studio analyzers use their own internally-implemented algorithms. Data-flow analysis in PVS-Studio provides flow and path sensitivity. The branching in the analyzed source code is fully covered in the data-flow model constructed by the analyzer.

The PVS-Studio analyzer for C and C ++ and the PVS-Studio analyzer for Java use a shared internal C ++ library for data-flow analysis. The C# PVS-Studio analyzer has its own implementation of data-flow algorithms - they are in a library written in C#.

Sometimes, when processing code, the analyzer cannot calculate an expression's range of values. In this case, the analyzer employs the symbolic execution approach. Symbolic execution means that possible variable and expression values are represented as formulas. In this case, instead of specific variable values, the analyzer operates with symbols that are abstractions of these variables.

Study this C++ code example:

int F(std::vector<int> &v, int x) 
{ 
    int denominator = v[x] - v[x]; 
    return x / denominator; 
}

To detect division by zero here, the analyzer does not need to know which values the function takes when this function is called.

When traversing the control-flow graph, the analyzer can build formulas for expressions it encounters - and calculate the limitations of these expressions' values. To do this, the analyzer substitutes variables in these formulas for known limitations on symbols that a given expression depends on. The analyzer employs symbolic execution algorithms to solve the formulas it builds when traversing the control-flow graph. The algorithms allow the analyzer to calculate expression or variable value limitations based on the values of other expressions or variables. The calculation of the final value is postponed till the moment it is required (for example, when a specific diagnostic rule is running, the final value will be calculated based on the formula created earlier).

The PVS-Studio analyzers for C, C++ and Java use the symbolic execution approach as part of their data-flow algorithms.

Interprocedural analysis

Interprocedural analysis is a static analyzer's ability to discover function calls and figure out how these calls affect the sate of the program and its variables in the local context. The PVS-Studio analyzers use interprocedural analysis to confirm limitations and ranges of variable and expression values that are calculated with data-flow mechanisms.

During analysis, the PVS-Studio analyzers use the AST code representation and build a complete semantic model. This way, when the analyzers encounter a function call, they can represent this function's body as an AST - and get all semantic information from this AST.

In data-flow analysis, PVS-Studio's interprocedural analysis allows to account for values returned by function calls. PVS-Studio also tracks the states of variables and expressions passed to functions. This enables the analyzer to detect potentially dangerous constructions and operations inside function bodies - for values passed to these functions. The analyzer can see potential defects in the bodies of the functions called. It can also identify how values a function accepts limit values the function can return.

Interprocedural analysis is limited by the access to the source code of the functions the analyzer needs to expand. To expand functions defined in different source files PVS-Studio employs the intermodular analysis mechanism. Although it is impossible to analyze functions defined in third-party libraries (due to the unavailability of these functions' source code) - PVS-Studio analyzers can estimate values these functions return. The annotation mechanism makes this possible.

Intermodular analysis and function annotations

Aside from interprocedural analysis, PVS-Studio analyzers support intermodular analysis. PVS-Studio's intermodular analysis extends the capabilities of interprocedural analysis.

In different programming languages, modules may mean different things. However, the concept of a module is generally understood as a compilation unit. For C and C++ languages, a compilation unit is a separate source code file (a file with the .c or .cpp extension). For C# language, a compilation unit is a project. For Java - it's a source file (a file with the .java extension) with a class herein declared.

When analyzing a project's source code file, Java and C# PVS-Studio analyzers can get access to the code of functions that are defined in this file - and in other files of the analyzed project. The PVS-Studio analyzer for C# can also get and analyze the source code of functions defined in other projects - if these projects were also submitted to the analysis.

The C++ PVS-Studio analyzer can get bodies of methods, defined in the compilation unit that is being processed at the time. This compilation unit is a preprocessed source file with expanded inclusions of header files. The C++ analyzer's intermodular mode allows to also get data-flow information from other compilation units. To do this, the analyzer works through source code twice. During the first run, the analyzer gathers interprocedural data-flow information for all source files being checked. During the second run, the analyzer uses this information to analyze source files.

If, when processing code, the analyzer encounters a function it cannot expand for analysis - it can use the function annotation mechanism. Function annotations are a declarative specification of information about limitations on values passed to functions and values that functions can return.

PVS-Studio analyzers provide two kinds of annotations: for library functions and for user functions. All PVS-Studio analyzers provide annotations on many functions from standard and popular libraries. The C++ PVS-Studio analyzer has an extra feature. You can use special syntax to set annotations for custom functions that are specific to a particular project being checked.

Taint analysis (taint checking)

Taint analysis is a way to track how externally supplied unchecked - therefore tainted - data spreads across an application. When such data hits taint sinks, it can cause a number of security vulnerabilities: SQL injections, XSS (cross-site scripting), and many others. Standards for secure software development, such as OWASP ASVS (Application Security Verification Standard), describe potential software vulnerabilities that result from the spread of tainted data.

Generally, it's impossible to fully protect a program from potentially tainted data. This is why the most efficient way to counteract external tainted data is to check it before it enters taint sink. This process is called data sanitization.

The PVS-Studio analyzers for C and C++, as well as for C#, can use interprocedural data-flow analysis technologies to track how tainted data spreads across applications. An entire group of PVS-Studio rules is based on the tainted data tracking mechanism.

PVS-Studio analyzers control the entire route that tainted data takes - this includes locations where data travels between program modules and where data gets checked (sanitized). If the PVS-Studio analyzer detects that tainted data travels from the taint source to the taint sink unchecked, it issues a warning about a potential code security threat. This way, PVS-Studio guards taint sources and taint sinks both, and issues a warning not only if tainted data is used - but at the moment such data is supplied to an application.

Software Composition Analysis (SCA)

Many modern applications use third-party components like libraries, packages, etc. Some of these components contain vulnerabilities. If an application uses a component with a vulnerability, the application itself can become vulnerable as well.

Special utilities find these "malicious" dependencies by performing software composition analysis (SCA).

The PVS-Studio analyzer for C# supports SCA. This mechanism works as follows:

The analyzer creates a bill of materials (BOM) — a list of direct and transitive project dependencies and information about their versions. BOM is based on the MSBuild project files.
Of all dependencies, the analyzer selects those which are directly or indirectly used in the code. The analyzer checks whether data types declared in dependent packages are used in code.
For each BOM record, PVS-Studio searches for the corresponding record in the GitHub Advisory Database. At the same time, the analyzer takes into account the name of the dependency checked and its version.
If a match is found in the database, the analyzer issues a warning with information on the dependency and vulnerabilities it contains.

You can read more about the mechanism of SCA in the documentation for the V5625 diagnostic rule.

Additional resources

A talk at Italian Cpp Community 2021 (itCppCon21). Yuri Minaev. Inside a static analyzer: type system.
Oleg Lisiy, Sergey Larin. Intermodular analysis of C++ projects in PVS-Studio.
Andrey Karpov. Why PVS-Studio uses data flow analysis.
Sergey Vasiliev. OWASP, vulnerabilities, and taint analysis in PVS-Studio for C#.

Was this page helpful?

If you can’t find an answer to your question, fill in the form below and our developers will contact you

By clicking this button you agree to our Privacy Policy statement