Your code accepts external data? Congratulations, and welcome to the minefield! Any unchecked user input can lead to a vulnerability, and manually finding all the "tripwires" in a large project is nearly impossible. But there is a "sapper"—a static analyzer. Our "sapper's" demining tool is taint analysis. It helps detect "dirty" data that reaches dangerous points without prior validation. This note explains how it works.

Think back to your early programming days. Right after "Hello, World" you probably wrote something like this:
int main()
{
char name[256];
printf("Stand up. There you go. "
"You were dreaming. What's your name?\n");
scanf("%255s", name);
printf(name);
return 0;
}
Let's overlook the potential buffer overflow in name from the user input for now—this is only our second program. Let's be honest, this number of characters is enough for everyone. Well... almost everyone.:) However, this program also faces another issue: the name string contains tainted data. We've covered the consequences of passing such a string to printf in a separate article.
You might think, "only students make such mistakes," but we've found similar errors far beyond academic projects.
To catch them during development, you can use static analyzers employing taint-analysis technology. Let's look under the hood of PVS-Studio C and C++ analyzer and break down its structure in detail.
Data flow analysis is a mechanism used by compilers for optimizations and by static analyzers to find bugs and vulnerabilities in code.
To detect errors like overflows, division by zero, or array index out of bounds, the analyzer has to know a variable/operand value at a potentially erroneous expression. The variable can have different values depending on the execution path. A virtual variable value is a set of all possible values it can have at a specific point in a program. Look at this example:
int getElement(int index)
{
int arr[2] {0, 1};
return arr[index];
}
int main()
{
int index = 0; // index = 0
std::cin >> index; // index = [INT_MIN; INT_MAX]
index &= 1; // index = [0; 1]
int element = getElement(index);
return element;
}
It shows the virtual value of the index variable at each program execution point.
0 value;[INT_MIN; INT_MAX];index value falls within the range [0; 1].In this case, knowing the virtual index value, the analyzer won't issue a warning, as it understands that accessing the array won't cause a buffer overrun. All potential values of index lie within the array bounds.
But what if we don't know the virtual value of the variable? For example, when it's a function parameter. Let's remove the main function from the previous example:
int getElement(int index)
{
int arr[2] {0, 1};
return arr[index]; // index is unknown
}
Here, we have a simple function returning an array value by index. When analyzing the function body without interprocedural information, the analyzer doesn't know the index parameter value.
A logical question arises: "Aren't index = [INT_MIN; INT_MAX] and index is unknown the same thing?"
No, they are not. Let's step back and clarify an important point.
There are sound and unsound static analysis strategies. Usually, the analyzer operates with an unsound strategy and issues warnings only if it can prove an error exists. Under this strategy, the example above won't trigger a warning about the array index out of bounds. We don't know the index value, and although the overrun is possible, we can't definitively claim the error exists. Otherwise, we'd drown in false and meaningless warnings.
The sound strategy follows the opposite principle: the analyzer issues a warning if it can't prove the absence of an error. The main problem with this strategy is a high number of false positives.
Taint analysis stands apart between these two strategies. Although taint analysis is theoretically part of the sound strategy, its application is confined to the lifecycle of potentially tainted data within the narrow path from the source to sink. That's why, the side effects—false positives—are typically less severe.
Ok, let's not delve deeply into the general theory of taint analysis or static analysis. We've already covered and explored this topic in detail in our articles about Java analyzer.
As we've seen, taint analysis is a part of the sound strategy and it operates based on data flow analysis. Sources of tainted data are typically input streams: the console, files, sockets, and others. Essentially, any data obtained externally, rather than computed during program execution, is potentially tainted.
Note. Besides well-known sources of potentially tainted data like std::cin and scanf, PVS-Studio enables adding custom sources using annotations in JSON format.
For example, a user can add their own source of potentially tainted data by annotating it as taint_source:
std::string ReadStrFromStream(std::istream &input, std::string &str)
{
....
input >> str;
return str;
....
}
"annotations": [
....
{
"type": "function",
"name": "ReadStrFromStream",
"params": [
{
"type": "std::istream &input"
},
{
"type": "std::string &str",
"attributes": [ "taint_source" ] // <=
}
],
"returns": { "attributes": [ "taint_source" ] }
}
....
]
You can read more about JSON annotations in our documentation.
Data obtained from sources is considered tainted and is marked with a corresponding tag. During operations with such data, their status can propagate (taint propagation).
For instance, adding reliable data to tainted data results in a tainted output.
int taintedData;
scanf("%d", &taintedData);
int res = 5 + taintedData; // res now is tainted data too
....
// The same thing happens when using
// tainted data in a loop condition
int taintedData;
scanf("%d", &taintedData);
for (int i = 0; i < taintedData; i++)
{
//The value of i is in range [0; taintedData]
//as a top bound of the range is tainted
//the variable derived taint status too
}
Moving on, we approach the end of the tainted data lifecycle—the taint sinks.
Sinks for potentially tainted data (taint sinks) are functions or operations sensitive to the reliability of their arguments/operands. Examples of such sinks include memory allocation functions, division by zero operations, and the subscript operator(operator[]).
template <typename T, size_t size>
const auto &getElement(const std::array<T, size> &arr, size_t index)
{
return arr[index]; // taint-sink operation
}
int main()
{
size_t index;
scanf("%zu", &index);
std::array<int, 5> arr { 1, 2, 3, 4, 5 };
return getElement(arr, index);
}
Using the taint_sink annotation, we can define a custom sink function:
{
....
"annotations": [
{
{
"type": "function",
"name": "DoSomethingWithData",
"params": [ { "type": "std::string &str",
"attributes": [ "taint_sink" ] }] // <=
}
}
]
}
C++26 will introduce contracts into the language. Yes, they are intended for detecting errors at runtime, but they will also be very helpful useful for static analysis. The point is, many functions already have implicit contracts. For example, the division operation assumes the divisor can't be zero.
The new syntax allows explicitly defining a function contract (Function contract specifiers). Such conditions are described in the function declaration and are accessible even without the function body, enabling a static analyzer to match predicates with the virtual arguments values and detect errors.
A sink function essentially has a contract that it imposes on its arguments. For example, here's how the getElement function would look with added contracts:
template <typename T, size_t size>
const auto &getElement(const std::array<T, size> &arr, size_t index)
pre(index < std::size(arr))
{
return arr[index];
}
Note that operator[] itself has a similar implicit contract. When accessing an array, the index is checked against this contract, and if violated, the analyzer issues a warning. Similarly, the analyzer can detect implicit function contracts during interprocedural analysis based on their content. For instance, if function arguments are passed internally to a sink function, the contract of that sink applies to the wrapper function, providing more information for diagnostic rules to work with.
We've covered the conditions under which the analyzer issues warnings when interacting with sinks. Now let's move on to data validation. Obviously, we have to check tainted data to fix errors when using it. How do we determine that the data has been validated and is no longer potentially tainted? With user annotations, it's straightforward: data that passes through a function defined by a user as a sanitizer is considered clean. But what if there are no user annotations? Do we have to rely solely on function names containing "validate" or "check"?
As we defined earlier, tainted data is external data that hasn't undergone sufficient validation.
This raises the question: "What constitutes 'sufficient validation'?"
That's a good question. Validation is the most intriguing part of taint analysis. This stage is defined by the rules for checking tainted data.
There are various ways to validate potentially tainted data.
How can we ensure the check is exhaustive? Look at the example:
int getElement(int array[5], int index)
{
scanf("%d", &index);
if (index >= 0)
{
return array[index];
}
return 0;
}
Checking only one boundary of the range for a signed number isn't considered sufficient, so the analyzer warns us about using tainted data as the array index.
The PVS-Studio warning: V557 Array underrun/overrun is possible. The 'index' index is from potentially tainted source.
Let's try a more precise check:
int getElement(int array[5], int index)
{
scanf("%d", &index);
if (index >= 0 && index < 6)
{
return array[index];
}
return 0;
}
This time, we checked both boundaries of the user-input index range, but we checked them incorrectly—the index using in operator[] could still lead to array overrun. Is this still the error caused by tainted data or not?
The PVS-Studio warning: V557 Array overrun is possible. The value of 'index' index could reach 5.
PVS-Studio analyzer still warns us about the error. However, this time the data will be considered clean because the user checked both boundaries.
Even within our team, opinions on tainted data validation differ. Some think that checking both boundaries for integer data is sufficient. In this example, the array overrun is no longer related to potentially tainted data. However, others hold the opposite view: despite the check, user input can still trigger an error, so a warning about potentially tainted data is appropriate here.
What do you think about this example? When do tainted data cease to be tainted?
While validation methods like pattern matching and escaping control sequences and symbols typically require user annotations, type casting and white list checks can exist independently.
Besides its focus on high-level code vulnerabilities (SQL injections, XXE, XEE, etc.), taint analysis is also necessary at a much lower level. For example, in indexing operations, divisions by zero, memory allocations, and others. Therefore, taint analysis remains valuable even without user annotations.
In such cases, we can rely on two remaining validation methods: type casting and matching against the white list of valid values. For instance, variables checked by such functions like equal or contains can be considered clean.
According to PVS-Studio, data validation involves:
For example, inside an if statement with such a condition, we get a reliable value range for the variable:
int array[16] { };
int index;
scanf("%d", &index);
if (index > 0 && index < 16)
{
// index = [1; 15] inside the if statement
return array[index];
}
Some arithmetic operations can also validate numerical data. For instance, the modulo operation allows us to calculate the resulting range precisely, meaning the data is no longer tainted. We can interpret the result of a bitwise AND operation similarly.
int array[16] { };
int index;
scanf("%d", &index);
int trustedValue1 = index & 0xf; // trustedValue1 = [0; 15]
array[trustedValue1] = 0;
int trustedValue2 = index % 16; // trustedValue2 = [0; 15]
array[trustedValue2] = 0;
Currently, PVS-Studio doesn't support annotations for sanitizer functions, but we plan to implement this functionality soon.
To sum up, taint analysis is a layer on top of data flow analysis that identifies key points, such as sources, sinks, and validators. The static analyzer then uses these to search for vulnerabilities and errors, leveraging classical tools and mechanisms.
To enhance effectiveness, users can apply JSON annotations to add their own sources and sinks for tainted data. Regardless, PVS-Studio provides basic taint analysis even without user annotations. And with the advent of contracts in C++26, the quality of analysis will improve even further.
If you want to try analysis of tainted data yourself, don't hesitate to go for the free 30-day trial of PVS-Studio analyzer.
0