>
>
Searching for errors by means of virtua…

Ilya Ivanov
Articles: 7

Searching for errors by means of virtual values evaluation

In the process of static analysis, exact values or ranges of values of some variables and expressions can be evaluated. This is useful information, which can be used when searching for errors. We call such values 'virtual values', and this article is devoted to this topic.

If a static analyzer is able to evaluate an expression, this will allow to perform more in-depth analysis, and detect more errors. It certainly is not only about the exact results of the expressions such as 1+2, but also about evaluating the range of values taken by a variable at a certain point in the program. In PVS-Studio analyzer, we call the algorithms responsible for evaluation of ranges a mechanism of virtual values. Such a mechanism is there in the C/C++ kernel of the analyzer, as well as in the C# kernel. In this article we discuss the mechanism of virtual values using C#-analyzer as an example.

In order to search for bugs in C# projects we use Roslyn in PVS-Studio analyzer to get all the necessary information about the source code. Roslyn provides us with a syntax tree, information about the types, a search for dependencies and so on. During the analysis process, PVS-Studio traverses a syntax tree and applies the diagnostics to its nodes. Besides that, the analyzer collects the information that can be of use later while traversing. An example of such additional information is virtual values.

Creating virtual values

Virtual values are stored for fields, properties, variables, and parameters when they are first mentioned in the code. If the first mention is the variable declaration or assignment, we try to evaluate the virtual value by analyzing the expression to the right of the equal sign. Otherwise, we usually don't know anything about the property/field/parameter, and suppose that it can be of any legitimate value. Here is an example:

public class MyClass
{
    private bool hasElements = false;
    public void Func(byte x, List<int> list)
    {
        int y = x;
        hasElements = (list.Count >= 0);
        if (hasElements && y >= 0) //V3022
        {
        }
    }
}

When the analyzer gets to the body of the Func function during the traverse of the tree, it will start evaluating virtual values of the variables. In the first string we'll have the declaration of y variable that is initialized with x value. Since all we know about x is that it has a byte type, then it can take any value from 0 to 255. So, this range of values will be assigned as the virtual value of the y variable. The same will be done for the field hasElements: the analyzer knows that the Count property of the list cannot take negative values, that's why true will be assigned as a virtual value of the hasElements variable. Now during the analysis of hasElements && y >= 0 condition, we know that the left and the right part are true, so the whole condition is also always true, which triggers the V3022 diagnostic.

Let's take a closer look at how the virtual value of an expression is evaluated.

Evaluation of a virtual value of an expression

The virtual value is evaluated differently for variables of different types. In the case that a variable is of an integer type, the virtual value is stored as a set of value ranges which the variable can be in. For example, consider this code:

public void Func(int x)
{
    if (x >= -10 && x <= 10 && x != 0)
    {
        int y = x + 5;
    }
}

In the beginning of the function, we don't know anything about x variable; its range of values - all legitimate values if int type: [int.MinValue, int.MaxValue]. When entering the if block, we can improve the virtual value based on the condition. Thus, within the if block, the variable x will have the range of [-10, -1], [1, 10]. So, now the analyzer will take into account the virtual value if x is used during the evaluation of the expression. In our example, y will receive the virtual value [-5, 4], [6, 15].

For the expressions of bool type, the virtual value is evaluated differently. Here we have only three options: false, true, or unknown value. So, we'll just substitute enough values for all variables of the expression, and check whether the expression will take one and the same value. For example:

public void Func(uint x)
{
    bool b = (x >= 0); //V3022
}

No matter what values of x parameter we take, the expression x >= 0 will always be true. That's why by substituting several values instead of x, we'll make sure of it, and assign true as the virtual value for b.

One more example from the Umbraco project:

private object Aggregate(object[] args, string name)
{
    ....
    if (name != "Min" || name != "Max") //V3022
    {
        throw new ArgumentException("Can only use min or max...");
    }
    ....
}

To make sure that the condition in the example is always true, the analyzer will substitute name with "Min", "Max", "", null values. In each of these cases, either the left or right side of the expression is true, then the expression in the condition is always true.

Virtual values evaluated for all variables are stored separately for each block. Entering the nested block, the analyzer creates its own set of virtual values based on the parent block. For a simple nested block, it's just a copy of all virtual values. For conditions, loops, and other blocks, the virtual values aren't just copied, some additional restrictions can be imposed on them.

Improving virtual values in the if/else block

Let's take a look at the virtual values during the processing of if/else block.

public void Func(int x)
{
    if (x >= 0)
    {                    //x:[0, int.MaxValue]
        if (x <= 10)
        {                //x:[0, 10]
        ....
        }
        else
        {                //x:[11, int.MaxValue]
        ....
        }
    }
}

Having analyzed the condition x > = 0, PVS-Studio will limit the range of the x variable for the first if block with values [0, int. MaxValue]. After processing the second condition x <= 10, the analyzer will create two more copies of virtual values of x variable - one for the if block and the other - for the else block. Moreover, these copies will be restricted, taking into account the virtual value of this variable from the parent block and the virtual value of the expression in the condition. So for a nested if block, the virtual value of x variable will be [0, 10], and for the else block - all other values - [11, int.MaxValue].

After the traversing of if/else block, it is necessary to combine virtual values from the if and else blocks for each variable. We should also remember that if in the end of if or else block there was a jump statement, return, for instance, then we don't have to combine the values from this block. Examples:

public void Func1(int x)
{
    bool b1 = false;
    bool b2 = false;
    if (x >= 0)
    {
        ....
        b1 = true;
        b2 = true;
    }
    else
    {
        ....
        b1 = true;
    }
    //Here we know that b1 is always true
    //b2 can take any value
}

public void Func2(int x)
{
    if (x < 0)
        return;
    //Here we know that x is always non-negative
}

Loop processing

The peculiarity of virtual values evaluation for loops, is that the body of the loop should be traversed twice. Let's look at an example.

public void Func()
{
    int i = 0;
    while (i < 10)
    {
        if (i == 5)
        {
        ....
        }
        i++;
    }
}

If we just copied virtual values from the parent block to the while block, then we would get a false positive of V3022 during the analysis of the condition i == 5, as we know, before the loop the i variable was zero. That's why before analyzing the body of the loop, we should evaluate which values variables can take at the end of each iteration, and also in all blocks containing continue statement, then combine all these values with the values of the variables before the loop is entered. Besides that if we analyze the for loop, we need to consider the initialization blocks and the changes of the counter. After the values of all the possible entry points of the loop are combined, we should apply the condition of the loop in the same way as it is done for the if block. In such a way we will get correct virtual values for the variables, and we can start the second traverse of the loop, where we'll have the diagnostics applied.

After traversing the loop we should combine all virtual values of variables from all the points, from which we get to the statement that follows the loop. These are values we had before the beginning of the loop (if there was no iteration), values of variables in the end of the body of the loop, and values of variables in the blocks, containing the break or continue statement. All these values we have already evaluated and remembered when the loop was traversed for the first time. Now we should combine them all and apply a condition, which is the opposite to the condition of the loop.

This was quite a complicated explanation, so let's look at an example.

public void Func(bool condition1, bool condition2, bool condition3)
{
    int x = 0;
    while (condition1)
    {
        //x:[0, 1], [3]
        if (condition2)
        {
            x = 2;
            break;
        }
        if (condition3)
        {
            x = 3;
            continue;
        }
        x = 1;
    }
    //x:[0, 3]
}

In this example the x variable is zero before entering the loop. Having done the first traverse of the loop, the analyzer will evaluate that the x variable can also take the value 1 and 2 in the block with break, and 3 in the block with continue. Since we have three points of transition to the next iteration of the loop, the variable x can be 0, 1 or 3 at the beginning of the loop body. We can get from four points to the statement following the loop. So here x can be 0, 1, 2 or 3.

Also the analyzer evaluates which values can accept the variables inside case blocks of switch statement, inside the try/catch/finally and for other language constructs.

Division by zero

Division by zero, is one of the mistakes which can be found by means of virtual values. This diagnostic is peculiar due to the fact that not every division, where we may theoretically have a zero in the denominator, will trigger it. Let's look at an example:

public int GetBlockCount(int dataLength, int blockSize)
{
    return dataLength / blockSize;
}

In this function, the blockSize can theoretically take any int value, and zero is included in this range. But if the diagnostic is issued for this code, it will not make sense, as the warnings that are actually useful, will be lost among false positives. So we had to teach the analyzer to detect real suspicious divisions in the mass of all divisions, like these for example:

public string GetDownloadAvgRateString() {
    if (SecondsDownloading >= 0) {
        return GetSpeed(Downloaded / SecondsDownloading);
    } else {
        return "";
    }
}

or these:

public void Func(int x, int y)
{
    for (int i = -10; i <= 10; i++)
    {
        y = x / i;
    }
}

As a solution, we differentiate ranges of virtual values as precise and non-precise. By default, the range is considered to be non-precise, until it is updated by explicit assignment of a constant, or a variable with a precise range, or by limiting in the condition of if or loop statement. If a zero gets into the precise range or to the precise range bound, then the analyzer will issue a diagnostic "Division by zero".

Examples

Now let's take a look at several examples taken from real projects, which were found by PVS-Studio with the help of virtual values.

Example N1 (RavenDB).

internal static void UrlPathEncodeChar (char c, Stream result)
{
    if (c < 33 || c > 126) {
        byte [] bIn = Encoding.UTF8.GetBytes (c.ToString ());
        for (int i = 0; i < bIn.Length; i++) {
            ....
        }
    }
    else if (c == ' ') { //V3022
        result.WriteByte ((byte) '%');
        result.WriteByte ((byte) '2');
        result.WriteByte ((byte) '0');
    }
    else
        result.WriteByte ((byte) c);
}

The first condition of the UrlPathEncodeChar function processes special characters, the second condition is a special optimization for the space. But as the ASCII code of the space is 32, then the space will be handled by the first block. PVS-Studio finds this error in the following way: inside the if block, the virtual value of the c variable will be equal to [0, 32], [127, char.MaxValue], and inside the first else block - all the other values : [33, 126]. As the code of the space doesn't get into this range, the analyzer reports about the error V3022 - expression c == ' ' is always false.

Example N2 (ServiceStack).

protected override sealed void Initialize()
{
    if (RootDirInfo == null)
        RootDirInfo = new DirectoryInfo(WebHostPhysicalPath);

    if (RootDirInfo == null || !RootDirInfo.Exists) //V3063
        throw new ApplicationException("...");
    ....
}

We don't know anything about the RootDirInfo property in the beginning of the Initialize function. After the analysis of the RootDirInfo == null two more copies of virtual values are created: one for the if block, where RootDirInfo is null, the second - for the else block, where RootDirInfo isn't null. Although there is no else block in the example, the virtual values are still created for it. Furthermore, inside the if block, the RootDirInfo property is assigned with a new value, which was received after the call of the constructor. As the constructor never returns null, then the value of RootDirInfo in the if block isn't null. Also, as the RootDirInfo for the else block isn't null either, then we see that when combining these two branches RootDirInfo will never be equal to null after the processing of the first condition. That's why PVS-Studio will issue a warning during the analysis of the second condition V3063 - a part of the condition is always false.

Example N3 (ServiceStack).

public static TextNode ParseTypeIntoNodes(this string typeDef)
{
    ....
    var lastBlockPos = typeDef.IndexOf('<');
    if (lastBlockPos >= 0)
    {
        ....
        //V3022
        while (lastBlockPos != -1 || blockStartingPos.Count == 0)
        {
            var nextPos = typeDef.IndexOfAny(blockChars,
                lastBlockPos + 1);
            if (nextPos == -1)
                break;
            ....
            lastBlockPos = nextPos;
        }
    }
}

Let's see what happens with the variable lastBlockPos in this example. First it is assigned with a result of IndexOf function. The analyzer knows that the functions IndexOf, IndexOfAny, LastIndexOf, LastIndexOfAny return non-negative values, or -1. Therefore, the range of values of the lastBlockPos variable is [-1, int.MaxValue]. After entering the if block, the range will be limited only by non-negative values [0, int.MaxValue]. Then the analyzer will traverse the body of the while loop. During the declaration, the nextPos variable will get the range [0, int.MaxValue]. After the analysis of the condition if (nextPos == -1) two copies of virtual values of the nextPost variable get created: [-1] for the if branch, and [0, int.MaxValue] for the else branch. As the if branch has the break statement, then in the rest of the body of the loop for the nextPost variable, only virtual values are used for the else branch: [0, int.MaxValue], that are then assigned in the end of the lastBlockPos variable.

Thus, we have two transition points to the body of the loop: one when entering a loop in which lastBlockPos has a value of [0, int. MaxValue], and the second when moving to the next iteration in which lastBlockPos has a value of [0, int. MaxValue]. Consequently, lastBlockPos never gets negative values in the condition of the loop, which means that the condition of the loop is always true; so the V3022 diagnostic is issued for this code.

It should be noted, that this error is really hard to detect manually, because the body of the loop contains about forty lines, and it's hard to track the way all branches get traversed.

Example N4 (TransmissionRemoteDotnet).

// Converts an IPv4 address into a long, for reading from geo database
long AddressToLong(IPAddress ip)
{
    long num = 0;
    byte[] bytes = ip.GetAddressBytes();
    for (int i = 0; i < 4; ++i)
    {
        long y = bytes[i];
        if (y < 0) //V3022
            y += 256;
        num += y << ((3 - i) * 8);
    }

    return num;
}

Here the variable y of long type is assigned with a value having byte type. So, the check y < 0 makes no sense, because the byte type is unsigned.

Example N5 (MSBuild).

private bool ValidateTaskNode()
{
    bool foundInvalidNode = false;
    foreach (XmlNode childNode in _taskNode.ChildNodes)
    {
        switch (childNode.NodeType)
        {
            case XmlNodeType.Comment:
            case XmlNodeType.Whitespace:
            case XmlNodeType.Text:
                // These are legal, and ignored
                continue;
            case XmlNodeType.Element:
                if (childNode.Name.Equals("Code") ||
                    childNode.Name.Equals("Reference") ||
                    childNode.Name.Equals("Using"))
                {
                    continue;
                }
                else
                {
                     foundInvalidNode = true;
                }
                break;
            default:
                foundInvalidNode = true;
                break;
        }

        if (foundInvalidNode) //V3022
        {
            ....
            return false;
        }
    }

    return true;
}

In this example, the body of the loop contains two statements - switch, and if. Let's look at switch block. The first section with three case labels, contains only the continue statement, that's why we cannot move to checking the condition foundInvalidNode. The second case section either moves to the next loop iteration, or sets foundinvalidNode and exits from switch. And, finally, the default section also sets foundinvalidNode and exits from switch. Thus, after exiting the switch, the foundInvalidNode variable will always be true, which means that the following if is redundant. The analyzer took into account that there is a default branch in this switch block, so the control cannot go to checking the condition directly - one of the switch sections will be executed.

We should also say that inside this switch statement, the continue statement is related to the loop, while break exits the switch, not the loop!

Conclusion

Evaluation of variable values during the static analysis is a powerful tool for searching out bugs in the code. The code can contain complex branching, nested conditions and loops, blocks having the size of hundreds of lines. To trace the changes of the variable manually and find an error can be really difficult, that's why PVS-Studio static analyzer can be of great help.