An unusual bug in Lucene.Net

14 Mar 2016

Author: Ilya Ivanov

Introduction
About the bug found
Conclusion
Additional links

Listening to stories about static analysis, some programmers say that they don't really need it, as their code is entirely covered by unit tests, and that's enough to catch all the bugs. Recently I have found a bug that is theoretically possible to find using unit tests, but if you are not aware that it's there, it's almost unreal to write such a test to check it.

Introduction

Lucene.Net is a port of the Lucene search engine library, written in C#, and targeted at .NET runtime users. The source code is open and available on the project website https://lucenenet.apache.org/.

The analyzer managed to detect only 5 suspicious fragments due to the slow pace of development, small size and the fact that the project is widely used in other projects for full-text search [1].

To be honest, I didn't expect to find more bugs. One of these errors seemed especially interesting to me, so I decided to tell our readers about it in our blog.

About the bug found

We have a diagnostic, V3035, about an error when instead of += a programmer may mistakenly write =+, where + is a unary plus. When I was writing it by analogy with the V588 diagnostic, designed for C++, I was thinking - can a programmer really make the same error, coding in C#? It could be understandable in C++ - people use various text editors instead of IDE, and a typo can be easily left unnoticed. But typing text in Visual Studio, which automatically aligns the code once a semicolon is put, is it possible to overlook the misprint? It turns out that it is. Such a bug was found in Lucene.Net. It is of great interest to us, mostly because it's rather hard to detect it using means other than static analysis. Let's take a look at the code:

protected virtual void Substitute( StringBuilder buffer )
{
    substCount = 0;
    for ( int c = 0; c < buffer.Length; c++ ) 
    {
        ....

        // Take care that at least one character
        // is left left side from the current one
        if ( c < buffer.Length - 1 ) 
        {
            // Masking several common character combinations
            // with an token
            if ( ( c < buffer.Length - 2 ) && buffer[c] == 's' &&
                buffer[c + 1] == 'c' && buffer[c + 2] == 'h' )
            {
                buffer[c] = '$';
                buffer.Remove(c + 1, 2);
                substCount =+ 2;
            }
            ....
            else if ( buffer[c] == 's' && buffer[c + 1] == 't' ) 
            {
                buffer[c] = '!';
                buffer.Remove(c + 1, 1);
                substCount++;
            }
            ....
        }
    }
}

There is also a class GermanStemmer, which cuts off suffixes of german words to mark out a common root. It works in the following way: first, the Substitute method replaces different combinations of letters with other symbols, so that they are not confused with a suffix. There are such substitutions as - 'sch' to '$', 'st' to '!' (you can see it in the code example). At the same time the number of characters by which such changes will shorten the word, is stored in the substCount variable. Further on, the Strip method cuts off extra suffixes and finally, the Resubstitute method does the reverse substitution: '$' to 'sch', '!' to 'st'. For instance, if we have a word "kapitalistischen" (capitalistic), the stemmer will do the following: kapitalistischen => kapitali!i$en (Substitute) => kapitali!i$ (Strip) => kapitalistisch (Resubstitute).

Because of this typo, during the substitution of 'sch' with '$', the substCount variable will be assigned with 2, instead of adding 2 to substCount. This error is really hard to find using methods other than static analysis. That's the answer to those who think "Do I need static analysis, if I have unit-tests?" Thus, to catch such a bug with the help of unit tests one should test Lucene.Net on German texts, using GermanStemmer; the tests should index a word containing the 'sch' combination, and one more letter combination, for which the substitution will be performed. At the same time it should be present in the word before 'sch', so that the substCount will be not zero by the time the expression substCount =+ 2 is executed. Quite an unusual combination for a test, especially if you don't see the bug.

Conclusion

Unit tests and static analysis need not exclude, but rather complement, each other as methods of software development [2]. I suggest downloading PVS-Studio static analyzer, and finding those bugs that weren't detected by means of unit-testing.