Webinar: Evaluation - 05.12
Toyota ITC Benchmark is a synthetic test set for C and C++. It consists of approximately 650 examples, and it's designed for testing code analyzers. This article is an answer to the question: "How well does the PVS-Studio static analyzer cover the Toyota ITC Benchmark?".
We tested PVS-Studio on Toyota ITC Benchmark about 5 years ago. It all started when Bill Torpey wrote a note called "Even Mo' Static" on his blog. Bill tested our analyzer and Cppcheck on Toyota ITC Benchmark, compared the results, and concluded that the analyzers are almost equal in their capabilities.
We didn't like that conclusion — we thought (and still think) that PVS-Studio is much more powerful than Cppcheck. Therefore, my teammate Andrey Karpov did his own research and wrote an article about it: "Why I dislike synthetic tests".
After that we never touched Toyota ITC Benchmark. However, not so long ago a user sent us a question: "What is the PVS-Studio coverage of the Toyota IT benchmark?". The user was interested in numbers, not in philosophical arguments that synthetics is evil. We did new research, and below I described the results and the way we got them.
First, we need to find out what we need to count. To do this, let's look at the structure of the Toyota ITC benchmark. We'll take the GitHub's version.
The benchmark includes 51 rules. By rule we mean a typical mistake that can be made in a C and/or C++ project. For example, Toyota ITC has a "conflicting cond" rule. This rule means that there should be no conflicting conditions in the code. So, the (a == 0) && (a == 1) condition has two contradictory conditions: (a == 0) and (a == 1). This means the expression contains an error.
For each rule, Toyota ITC Benchmark has two test files. The first one is called "W_{rule name}.c/cpp" and has tests that should trigger the analyzer. The second file is called "Wo_{rule name}.cpp" and has tests on which the analyzer should be silent. A test is a function with or without a typical error. The function's code has a comment marking a place that should or should not trigger the analyzer.
Of course, we can just count the number of tests passed by the analyzer, taking into account their type. In other words, the tests from W files are passed if the analyzer issued a warning. The tests from the Wo files are passed if they didn't trigger the analyzer. Then we divide the resulting number of the successful tests by their total number. The resulting percentage is the benchmark coverage. However, this approach has a significant disadvantage: different rules have different number of tests. For example, the "dead_lock" rule has 10 tests, and the "overrun_st" rule — 108. Does this mean that finding possible array index out of bounds is 10 times more important than identifying potential dead locks in the program? I think no.
That's why we chose another approach. For each rule, we separately count the tests passed. Then we divide this number by the total number of tests for this rule. If the final percentage is higher than the pre-set threshold value, then we mark this rule as passed. Otherwise, it's not. After that we count the number of passed rules, divide this number by the total number of rules (51), and consider the resulting percentage as the benchmark coverage.
Let's look at the advantages of this approach. First, all rules are considered equivalent. Since the threshold value is the same for all rules, then a rule with a larger number of tests needs a larger number of passed tests — to be marked as passed. We won't be able to achieve good statistics if we only support a couple of rules with lots of tests and abandon those with few tests.
Second, this approach provides flexibility in choosing the threshold percentage required to maintain the rule. Some people think that the rule is maintained only if all the tests are passed. For others 75% would be enough. Both can obtain the corresponding coverage percentage.
The disadvantages of this approach follow from its advantages. First, this approach is not suitable if we do not consider the rules to be equivalent. In this case, we'll have to set a weight for each rule and take it into account when calculating the final coverage. Second, depending on the threshold value required to maintain the rule, different percentages of coverage will be obtained. This means that it will no longer be possible to talk about X% of coverage without mentioning the threshold value in Y%, which may not be very convenient. There's a whole section in this article explaining why there are several different coverage values.
I chose 3 numbers as thresholds: 50%, 75% and 100%.
PVS-Studio supports Toyota ITC Benchmark by 12% at a threshold of 100%, by 27% at a threshold of 75%, and by 39% at a threshold of 50%.
Many tests were not passed due to special exceptions in our analyzer. These exceptions make sense when we analyze real projects and reduce the number of false positives. Theoretically, it is possible to make a special analyzer mode in which such exceptions are disabled. Then the Toyota ITC Benchmark coverage will increase. We don't see the point in making this mode for most users. However, this mode can be useful when we analyze code with specific requirements, for example, in the automotive industry. If you are interested in this analyzer mode, as well as the topic of the Toyota ITC benchmark in general, and you want to discuss it – contact us.
Below I will give some examples from the tests that can help you understand how we got these numbers.
Toyota ITC Benchmark has the "dead_code" rule. This rule was the first reason of my facepalm. The fact is that there are two concepts: dead code and unreachable code. Dead code means that a code fragment can be executed, but its elimination does not change the program's behavior. Here's an example of dead code:
int i;
i = 5;
i = 10;
Here the i = 5; assignment is dead code.
Unreachable code means that a code fragment is never executed. An example:
bool cond = false;
int i;
if (cond)
{
i = 5;
}
Here the i = 5; assignment is unreachable code.
So, all tests for the rule with the name "dead_code" are actually tests for unreachable code!
PVS-Studio doesn't have a specific rule that would catch all variations of unreachable code. There's V779 that warns that the code written after the noreturn function call is unreachable. However, this is one of the many ways to get unreachable code. The presence of unreachable code in the program is a result of some error, not an error itself. This is a symptom, not a cause. We think that it's better to point out the cause of the error to the developer. To do this, we made a number of diagnostics that point out errors that may lead to the appearance of unreachable code in the program. In the Toyota ITC case, the V547 diagnostic rule was triggered. Look at the example:
void dead_code_002 ()
{
int flag = 0;
int a = 0;
int ret;
if (flag)
{
a ++; /*Tool should detect this line as error*/ /*ERROR:Dead Code*/
}
ret = a;
sink = ret;
}
PVS-Studio warning: V547 Expression 'flag' is always false.
Here the flag variable has the false value, so the a++; statement is unreachable. The analyzer warns that the condition in if is always false. Even though PVS-Studio did not issue a warning on the a++; line, I counted this test as passed.
It's interesting that similar pattern is found in real projects. But the assignment there, as well as the use of variable is usually separated by hundreds of lines of code. It's really difficult to find such an error without the analyzer.
The V547 diagnostic remained silent in the following fragment too.
void dead_code_001 ()
{
int a = 0;
int ret;
if (0)
{
a ++; /*Tool should detect this line as error*/ /*ERROR:Dead Code*/
}
ret = a;
sink = ret;
}
The thing is, the V547 diagnostic has an exception specifically made for cases like if(0), while(1). We believe that if a developer has written such code, they realize what they're doing. Thus, there's no need to warn them about a suspicious condition. That's why PVS-Studio doesn't issue a warning on this example. This test is certainly synthetic. Unlike the previous one (which had no connection with reality) I didn't mark it as passed.
Note. Why do developers write if (0) in real projects? It's simple. This is a well-known pattern of commenting out code where it's not executed but continues to compile. This allows (if necessary) to make code work again and at the same time be sure that code will be compiled successfully. Another rare technique: in debugging mode, manually move the execution point to this code to perform some specific action that helps debugging. For example, print some values. There's another construction "while (1)". Though it may seem strange, it occurs in real projects as the following pattern:
while (1)
{
doSomething();
if(condition) break;
doSomethingElse();
}
This is a normal coding practice, and it makes no sense to issue a warning here.
This is another rule on which PVS-Studio also failed to get 100% of the tests passed.
The analyzer failed on some of the "null pointer" tests because of an exception for V522.
Andrey Karpov has already described examples of this rule in his article.
Another rule that the analyzer could not cover 100% was the "free null pointer" rule. This rule forbids passing a null pointer to the free function.
Note that the free function call on a null pointer is not an error. In this case the function does nothing.
Nevertheless, we agree with the Toyota ITC Benchmark developers. We believe that in some cases the transfer of a null pointer may be an error. Here's a test example from the benchmark:
void free_null_pointer_001 ()
{
char* buf= NULL;
free(buf);/* Tool should detect this line as error */
/*ERROR:Freeing a NULL pointer*/
buf = NULL;
}
PVS-Studio warning: V575 The null pointer is passed into 'free' function. Inspect the first argument.
Here the analyzer does exactly what the test example expects — it warns that the null pointer buf is passed to the free function.
Not so good here:
int *free_null_pointer_002_gbl_ptr = NULL;
void free_null_pointer_002 ()
{
int a = 20;
if (a > 0)
{
free(free_null_pointer_002_gbl_ptr);
/* Tool should detect this line as error */
/*ERROR:Freeing a NULL pointer*/
free_null_pointer_002_gbl_ptr = NULL;
}
}
Here PVS-Studio is silent. The thing is, the V575 diagnostic rule issues a warning only if the free function receives exactly a null pointer. In this example, we're dealing with a non-constant global variable free_null_pointer_002_gbl_ptr. The analyzer stores virtual values only for constant global variables. The values of non-constant global variables can change anywhere in the program, and we do not track them. Because of this, PVS-Studio does not see the free_null_pointer_002_gbl_ptr pointer to be exactly zero and thus doesn't issue a warning.
Okay, but can we teach the analyzer to inspect whether this example has exactly a null pointer? In this synthetic example — yes, we can. But it won't make PVS-Studio better. Such an enhancement won't help find new errors in the real code. In real projects, global variables are used in many places simultaneously. It is hard (and almost impossible for a static analyzer) to figure out where a global variable has that value.
There were other controversial tests. However, these examples are not so easy to explain, so I did not analyze them in my note. Let me tell you about the results once again: PVS-Studio supports Toyota ITC Benchmark by 12% at a threshold of 100%, by 27% at a threshold of 75%, and by 39% at a threshold of 50%.
Above, we saw that PVS-Studio can enhance the Toyota ITC Benchmark coverage. For example, if you simply disable exceptions on diagnostics, this will already give a good result in terms of increasing coverage. However, for most of our users this mode won't be useful. Adding it only for the benchmark is a very controversial decision. But if you're interested in something like this, contact us.
Thank you all for your attention and have a clean code!
0