Why PVS-Studio uses data flow analysis: based on gripping error in Open Asset Import Library

Feb 18 2021

Author: Andrey Karpov

An essential part of any modern static code analyzer is data flow analysis. However, from an outside perspective, the use of data flow analysis and its benefit is unclear. Some people still consider static analysis a tool searching for something in code according to a certain pattern. Thus, we occasionally write blog posts to show how this or that technology, used in the PVS-Studio analyzer, helps to identify another interesting error. Today, we have such an article about the bug found in the Base64, one of the encoding standard implementations of binary data.

It all started with checking the latest version of the Qt 6 library. There was a separate usual article on this, where I'd described 77 errors found. It turned out that at first, I decided to flip through the report, not excluding the third-party libraries' warnings. In other words, I didn't exclude the warnings related to \src\3rdparty in the settings. It so happened that I immediately ran up against a gripping error example in the Open Asset Import Library. So, I decided to write this extra little note about it.

This defect highlights the benefit of data flow analysis in tools such as PVS-Studio. Without that, it's impossible to find numerous errors. By the way, if you're interested in learning more about data flow analysis and other aspects of the tool's setup, you can read the Technologies used in the PVS-Studio code analyzer for finding bugs and potential vulnerabilities article.

Now, let's turn our attention right to the error, found in the Open Asset Import Library (assimp). File: \src\3rdparty\assimp\src\code\FBX\FBXUtil.cpp.

std::string EncodeBase64(const char* data, size_t length)
{
    // calculate extra bytes needed to get a multiple of 3
    size_t extraBytes = 3 - length % 3;

    // number of base64 bytes
    size_t encodedBytes = 4 * (length + extraBytes) / 3;

    std::string encoded_string(encodedBytes, '=');

    // read blocks of 3 bytes
    for (size_t ib3 = 0; ib3 < length / 3; ib3++)
    {
        const size_t iByte = ib3 * 3;
        const size_t iEncodedByte = ib3 * 4;
        const char* currData = &data[iByte];

        EncodeByteBlock(currData, encoded_string, iEncodedByte);
    }

    // if size of data is not a multiple of 3,
    // also encode the final bytes (and add zeros where needed)
    if (extraBytes > 0)
    {
        char finalBytes[4] = { 0,0,0,0 };
        memcpy(&finalBytes[0], &data[length - length % 3], length % 3);

        const size_t iEncodedByte = encodedBytes - 4;
        EncodeByteBlock(&finalBytes[0], encoded_string, iEncodedByte);

        // add '=' at the end
        for (size_t i = 0; i < 4 * extraBytes / 3; i++)
            encoded_string[encodedBytes - i - 1] = '=';
    }
    return encoded_string;
}

If you want, for a start you may try to detect the error yourself. So that you don't accidentally read the answer right away, let me show you some other exciting articles and briefly tell you what Base64 is:). Here's a list of additional articles on related topics:

Ok, let's go on. Here is the coding algorithm implementation of a byte string in Base64 encoding. This is the coding standard of binary data with only 64 characters. The encoding alphabet contains text and numeric Latin characters A-Z, a-z, and 0-9 (62 characters) and 2 additional characters that vary among implementations. Base64 encoding converts every 3 source bytes into 4 encoded characters.

If only one or two bytes are left to encode, as a result, we have only the first two or three characters of the line. The output will be padded with one or two additional pad characters (=). The padding character "=" prevents further bits from being added to the reconstructed data. This point is implemented incorrectly in the function considered.

Found the error? Well done. If not, that's ok too. You need to delve into the code to notice that something goes wrong. The analyzer reports about this "something wrong" with the warning: V547 [CWE-571] Expression 'extraBytes > 0' is always true. FBXUtil.cpp 224

To understand what worried the analyzer, let's take a look at the initialization of the extraBytes variable:

// calculate extra bytes needed to get a multiple of 3
size_t extraBytes = 3 - length % 3;

The programmer planned to calculate how many additional bytes of input data need to be processed if their total number is not equal to 3. To do this, we just need to divide the number of processed bytes by modulo 3. A correct option of the variable initialization looks like this:

size_t extraBytes = length % 3;

Then, if, for example, 5 bytes are processed, we get 5 % 3 = 2. So, we need to additionally process 2 bytes. If the input received 6 bytes, then nothing needs to be processed separately, since 6 % 3 = 0.

Although, it may have meant the number of bytes missing for a multiple of three. Then, the correct code should look that way:

size_t extraBytes = (3 - length % 3) % 3;

Right now, I'm not interested in trying to figure out the right variant. Anyway, the programmer wrote some average meaningless version of the code:

size_t extraBytes = 3 - length % 3;

Right at the moment of analyzing this code, the analyzer uses data flow analysis. Whatever value is in the length variable, after modulo division, a value in the range [0..2] will be obtained. The PVS-Studio analyzer can work with ranges, exact values, and sets. That is, we are talking about Value Range Analysis. In this case, it is the range of values that will be used.

Let's continue the evaluations:

size_t extraBytes = 3 - [0..2];

It turns out that the extraBytes variable will never be equal to zero. The analyzer will evaluate the following possible range of its values: [1..3].

Until the moment of checking, the variable is not changed anywhere. The analyzer reports us that the check result will always be true. Therefore, the tool is absolutely right:

if (extraBytes > 0)

This is a simple but wonderful example. It shows how the data flow analysis allowed us to evaluate the range of variable values. It also helped us to be certain that the variable does not change, and finally, that the condition is always true.

Of course, the incorrectness of function operation is not limited to the execution of a code fragment that should not be executed. Everything goes awry there. Imagine, you want to encode 6 characters. In this case, the output string must contain 8 characters. Let's quickly estimate how the considered function will behave.

// calculate extra bytes needed to get a multiple of 3
size_t extraBytes = 3 - length % 3; // 3-6%3 = 3

// number of base64 bytes
size_t encodedBytes = 4 * (length + extraBytes) / 3; // 4*(6+3)/3 = 12

std::string encoded_string(encodedBytes, '=');

The output string happened to contain 12 characters, not 8. Further, everything will work in a wrong way, too. There's no point in going into details.

That's how nice and easy static analysis found the error in the code. Just imagine how painful it would be to debug and understand why the characters encoding in Base64 encoding went wrong. By the way, here comes the question of the third-party libraries' quality. I touched upon it in the following article: Why it is important to apply static analysis for open libraries that you add to your project.

Try to use PVS-Studio regularly in your development process to find many bugs as early as possible. You'll like it :). If you are developing an open-source project, you can use the analyzer for free. Thanks for your attention. Wish you bugless code.

#Cpp