Unicorn with delicious cookie
Nous utilisons des cookies pour améliorer votre expérience de navigation. En savoir plus
Accepter
to the top
>
>
>
An unusual bug in Lucene.Net

An unusual bug in Lucene.Net

14 Mar 2016
Author:

Listening to stories about static analysis, some programmers say that they don't really need it, as their code is entirely covered by unit tests, and that's enough to catch all the bugs. Recently I have found a bug that is theoretically possible to find using unit tests, but if you are not aware that it's there, it's almost unreal to write such a test to check it.

Introduction

Lucene.Net is a port of the Lucene search engine library, written in C#, and targeted at .NET runtime users. The source code is open and available on the project website https://lucenenet.apache.org/.

The analyzer managed to detect only 5 suspicious fragments due to the slow pace of development, small size and the fact that the project is widely used in other projects for full-text search [1].

To be honest, I didn't expect to find more bugs. One of these errors seemed especially interesting to me, so I decided to tell our readers about it in our blog.

About the bug found

We have a diagnostic, V3035, about an error when instead of += a programmer may mistakenly write =+, where + is a unary plus. When I was writing it by analogy with the V588 diagnostic, designed for C++, I was thinking - can a programmer really make the same error, coding in C#? It could be understandable in C++ - people use various text editors instead of IDE, and a typo can be easily left unnoticed. But typing text in Visual Studio, which automatically aligns the code once a semicolon is put, is it possible to overlook the misprint? It turns out that it is. Such a bug was found in Lucene.Net. It is of great interest to us, mostly because it's rather hard to detect it using means other than static analysis. Let's take a look at the code:

protected virtual void Substitute( StringBuilder buffer )
{
    substCount = 0;
    for ( int c = 0; c < buffer.Length; c++ ) 
    {
        ....

        // Take care that at least one character
        // is left left side from the current one
        if ( c < buffer.Length - 1 ) 
        {
            // Masking several common character combinations
            // with an token
            if ( ( c < buffer.Length - 2 ) && buffer[c] == 's' &&
                buffer[c + 1] == 'c' && buffer[c + 2] == 'h' )
            {
                buffer[c] = '$';
                buffer.Remove(c + 1, 2);
                substCount =+ 2;
            }
            ....
            else if ( buffer[c] == 's' && buffer[c + 1] == 't' ) 
            {
                buffer[c] = '!';
                buffer.Remove(c + 1, 1);
                substCount++;
            }
            ....
        }
    }
}

There is also a class GermanStemmer, which cuts off suffixes of german words to mark out a common root. It works in the following way: first, the Substitute method replaces different combinations of letters with other symbols, so that they are not confused with a suffix. There are such substitutions as - 'sch' to '$', 'st' to '!' (you can see it in the code example). At the same time the number of characters by which such changes will shorten the word, is stored in the substCount variable. Further on, the Strip method cuts off extra suffixes and finally, the Resubstitute method does the reverse substitution: '$' to 'sch', '!' to 'st'. For instance, if we have a word "kapitalistischen" (capitalistic), the stemmer will do the following: kapitalistischen => kapitali!i$en (Substitute) => kapitali!i$ (Strip) => kapitalistisch (Resubstitute).

Because of this typo, during the substitution of 'sch' with '$', the substCount variable will be assigned with 2, instead of adding 2 to substCount. This error is really hard to find using methods other than static analysis. That's the answer to those who think "Do I need static analysis, if I have unit-tests?" Thus, to catch such a bug with the help of unit tests one should test Lucene.Net on German texts, using GermanStemmer; the tests should index a word containing the 'sch' combination, and one more letter combination, for which the substitution will be performed. At the same time it should be present in the word before 'sch', so that the substCount will be not zero by the time the expression substCount =+ 2 is executed. Quite an unusual combination for a test, especially if you don't see the bug.

Conclusion

Unit tests and static analysis need not exclude, but rather complement, each other as methods of software development [2]. I suggest downloading PVS-Studio static analyzer, and finding those bugs that weren't detected by means of unit-testing.

Additional links

Popular related articles

S'abonner

Comments (0)

close comment form
close form

Remplissez le formulaire ci‑dessous en 2 étapes simples :

Vos coordonnées :

Étape 1
Félicitations ! Voici votre code promo !

Type de licence souhaité :

Étape 2
Team license
Enterprise licence
** En cliquant sur ce bouton, vous déclarez accepter notre politique de confidentialité
close form
Demandez des tarifs
Nouvelle licence
Renouvellement de licence
--Sélectionnez la devise--
USD
EUR
* En cliquant sur ce bouton, vous déclarez accepter notre politique de confidentialité

close form
La licence PVS‑Studio gratuit pour les spécialistes Microsoft MVP
close form
Pour obtenir la licence de votre projet open source, s’il vous plait rempliez ce formulaire
* En cliquant sur ce bouton, vous déclarez accepter notre politique de confidentialité

close form
I want to join the test
* En cliquant sur ce bouton, vous déclarez accepter notre politique de confidentialité

close form
check circle
Votre message a été envoyé.

Nous vous répondrons à


Si l'e-mail n'apparaît pas dans votre boîte de réception, recherchez-le dans l'un des dossiers suivants:

  • Promotion
  • Notifications
  • Spam