Our website uses cookies to enhance your browsing experience.
Accept
to the top
>
>
>
Bugs across the world's languages....

Bugs across the world's languages. Let's check LanguageTool

Dec 15 2025

In this article, we'll look at traveling bugs that PVS-Studio static analyzer has detected in LanguageTool—a grammar, style, and spell checker.

Hello, everyone! Hallo zusammen! Hola a tothom! مرحباً بالجميع!

On our blog, we frequently discuss static analysis, linters, and related tools. Today, we have an interesting one! LanguageTool is a multilingual tool that helps correct and rephrase texts by checking spelling, style, and grammar.

Today, let's take a look at its code and explore some fascinating things that PVS-Studio static code analyzer detected there.

We used the 7778ca1 commit to check the project.

How did we do it?

I used PVS-Studio plugin for Visual Studio Code to analyze the project.

First, I built the project using the Project Manager for Java plugin. Then, I clicked the analyze button and waited for the result:

Note. You can read about how to install and use PVS-Studio plugin for Visual Studio Code here.

Open 24/7

Developers frequently forget about descriptors. The LanguageTool project is no exception:

Fragment 1

private Dictionary getDictionary(
  Supplier<List<byte[]>> lines, 
  String dictPath, 
  String infoPath, 
  boolean isUserDict, 
  int userDictSize
){
  ....
  InputStream metadata;
  if (new File(infoPath).exists()) {
    metadata = new FileInputStream(infoPath);  // <=
  } else {
    metadata = getDataBroker().getFromResourceDirAsStream(infoPath);
  }
  Dictionary dict = Dictionary.read(fsaInStream, metadata);
  if (!isUserDict) {
    dicPathToDict.put(cacheKey, dict);
  } else if (userDictCacheSize != null) {
    getUserDictCache().put(userDictName, dict);
  }
  return dict;
}

PVS-Studio warning: V6127 The 'metadata' Closeable object is not closed. This may lead to a resource leak. MorfologikMultiSpeller.java 320

The developers created an instance of the InputStream class and then forgot about it.

The Closeable interface explicitly signals that the instance may hold a system resource which must be explicitly released upon completion. In our case, FileInputStream holds the file descriptor. If the reference to that descriptor is lost without calling close() first, the system holds the resource until the garbage collector runs its finalizer. Since we can't predict when—or even if—that will happen, the risk of a leak remains.

One might assume that the stream closes within the Dictionary.Read method, but the method's documentation states otherwise.

Attempts to load a dictionary from opened streams of FSA dictionary data and associated metadata. Input streams are not closed automatically.

Without the explicit call to close(), the file descriptor leaks. This can exhaust the limit on open file descriptors in a long-running application or code that runs often, leading to critical I/O failures and even denial-of-service. It's safer to use constructs that guarantee predictable resource cleanup, such as try-with-resources.

Your analyzer can't understand us

Fragment 2

public String getEnclitic(AnalyzedToken token) {
  ....
  if (word.endsWith("ه")) {
    suffix = "ه";
  ....
  else if ((word.equals("عني") || 
            word.equals("مني")) && 
            word.endsWith("ني") // <=
  ) {    
    suffix = "ني";
  }
  ....
}

PVS-Studio warning: V6007 Expression 'word.endsWith("ني")' is always false. ArabicTagger.java 428

In this code snippet, the tool works with the Arabic language. Here, our analyzer faced a language barrier!

Let's look at the first part of the condition, which involves checking whether the string is equal to either عني or مني. In both cases, the string doesn't end with ني, so the word.endsWith("ني") expression is always false, as the analyzer pointed out. But!

The string doesn't end with these characters if we read it the way we're used to. But this is Arabic, which is read from right to left! In this case, the analyzer correctly identified the line containing the error, but misinterpreted it: both lines where we can reach the execution of endsWith in the if statement end with ني. So, the condition will always be true, not false. In other words, this check is redundant.

Since we've never seen code that uses Arabic before, we didn't anticipate any issues. Now we know they exist :)

Catalan copy-paste

Fragment 3

private String removeOldDiacritics(String s) {
  return s
    .replace("contrapèl", "contrapel")
    .replace("Contrapèl", "Contrapel")
    .replace("vés", "ves")
    .replace("féu", "feu")
    .replace("desféu", "desfeu")
    .replace("adéu", "adeu")
    .replace("dóna", "dona")
    .replace("dónes", "dones")
    .replace("sóc", "soc")
    .replace("vénen", "venen")
    .replace("véns", "véns")       // <=
    .replace("fóra", "fora")
    .replace("Vés", "Ves")
    .replace("Féu", "Feu")
    .replace("Desféu", "Desfeu")
    .replace("Adéu", "Adeu")
    .replace("Dóna", "Dona")
    .replace("Dónes", "Dones")
    .replace("Sóc", "Soc")
    .replace("Vénen", "Venen")
    .replace("Véns", "Vens")
    .replace("Fóra", "Fora");
}

PVS-Studio warning: V6009 Function 'replace' receives an odd argument. The '" véns "' argument was passed several times. Catalan.java 453

LanguageTool uses the removeOldDiacritics method to normalize the spelling of certain Catalan words. In other words, it replaces obsolete or alternative forms with diacritic marks, such as è or é, with modern ones.

However, this code fragment contains an error: the word véns has been replaced with itself. Most likely, the developers copied the original word and replaced the character, but forgot to change the character in the second argument:

....
.replace("véns", "vens")
....

Trust but verify

Fragment 4

@Nullable
@Override
public RuleMatch acceptRuleMatch(
  RuleMatch match, 
  Map<String, String> arguments, 
  int patternTokenPos, 
  AnalyzedTokenReadings[] patternTokens, 
  List<Integer> tokenPositions
) throws IOException {
  int posWord = 0;
  ....
  AnalyzedTokenReadings[] tokens = match.getSentence()
                                     .getTokensWithoutWhitespace();
  ....
  posWord = verbSynth.getLastVerbIndex() +
            verbSynth.getNumPronounsAfter() + 1;
  primerAdverbi = posWord;
  while (
    posWord < tokens.length &&
    !adverbiFinal.contains(
      tokens[posWord]
        .getToken()
        .toLowerCase()
    )
  ) {
    posWord++;
  }
  if (
    posWord == tokens.length || 
    !adverbiFinal.contains(
      tokens[posWord]
        .getToken()
        .toLowerCase()
    )
  ) {
    return null;
  }
  darrerAdverbi = posWord;
  String darrerAdverbiStr = tokens[darrerAdverbi].getToken();    // <=
  ....
  if (primerAdverbi == -1 || darrerAdverbi == -1) {              // <=
    return null;
  }
  ....
}

PVS-Studio warning: V6079 Value of the 'darrerAdverbi' variable is checked after use. Potential logical error is present. DonarseliBeFilter.java 83

Here's another method for Catalan!

Here, the analyzer warns that the darrerAdverbi variable was used as an index before checking whether it equals -1. And it really does—the variable is used just a couple of lines above the check.

However, upon looking at this warning, we found another intriguing aspect of the method! To figure this out, let's take a closer look.

After data preparation, a search is performed to determine where to start looking for the last adverb from the given set. The starting position depends on the index of the last verb and on how many pronouns follow it.

....
posWord = verbSynth.getLastVerbIndex() + 
          verbSynth.getNumPronounsAfter() + 1;
....

Both called methods can theoretically return -1 when neither a verb nor pronouns are found. If that happens, posWord ends up as -1.

Right after that, the devs used this value as an index without performing any checks. The code still reaches the condition, because -1 is guaranteed to be less than tokens.length:

....
if (
  posWord == tokens.length ||
  !adverbiFinal.contains(
    tokens[posWord]          // <=
      .getToken()
      .toLowerCase()
  )
....

In this condition, the program can crash with an IndexOutOfBoundsException. So, the crash can happen even earlier than the warning suggests.

A static analyzer usually catches such errors using data-flow analysis, a technique that tracks how values move through the code and determines their possible ranges at different execution points.

Fragment 5

protected List<RuleMatch> getRuleMatches(
  String word, int startPos,
  AnalyzedSentence sentence, 
  List<RuleMatch> ruleMatchesSoFar, 
  int idx, 
  AnalyzedTokenReadings[] tokens
) throws IOException {
  ....
  //Translator translator = getTranslator(globalConfig);
  Translator translator = null;      // <=
  if (
    translator != null &&            // <=
    ruleMatch == null && 
    motherTongue != null &&
    language.getShortCode().equals("en") &&          
    motherTongue.getShortCode().equals("de")
  ) {....}
  ....
}

PVS-Studio warning: V6007 Expression 'translator != null' is always false. MorfologikSpellerRule.java 449

The analyzer warns that the condition is always false. To see this, just look at the previous line where null is assigned to the first variable being checked.

However, one more line above, we can see a commented assignment to this variable that already has a value. A quick look at the commit history shows that someone commented it out to disable certain features.

This isn't a full-fledged bug, but the always-false condition resulted in about thirty lines of dead code in the project. This code used to run when the condition still worked.

Messy code isn't just an aesthetic issue; it also allows bugs to sneak in. We covered this topic in another article.

Fragment 6

private static String getDigitHundredJarStatus(
  int digit, 
  String inflectionCase
) {
  if (inflectionCase.equals("jar") || 
      inflectionCase.equals("jar")) {
    return ArabicNumbersWordsConstants
             .arabicJarHundreds.get(digit);
  }
  return ArabicNumbersWordsConstants
           .arabicHundreds.get(digit);
}

PVS-Studio warning: V6001 There are identical sub-expressions 'inflectionCase.equals("jar")' to the left and to the right of the '||' operator. ArabicNumbersWords.java 215

Here we go again with another condition. This time, the left and right sides of the if statement are the same. At first, it seemed like a refactoring mistake, but it wasn't. This condition appeared because a file was added to the commit.

Most likely, the conditions differed, but the developers replaced the lines using IDE tools during later edits because this behavior repeats a couple more times.

V6001 There are identical sub-expressions 'inflectionCase.equals("jar")' to the left and to the right of the '||' operator. ArabicNumbersWords.java 207

V6001 There are identical sub-expressions 'inflectionCase.equals("jar")' to the left and to the right of the '||' operator. ArabicNumbersWords.java 223

The imitation game

Have you ever seen the little code that couldn't?

Fragment 7

public static void main(
  String[] args
) throws IOException {
  ....
  List<RuleMatch> matches = lt.check(incorrectExample);
  for (RuleMatch match : matches) {
    if (match.getSuggestedReplacements().isEmpty()) {
      ....
      printRule(rule.getId(), rule, incorrectExample, popularity);
      noSuggestion++;
    } else {
      suggestion++;
    }
    break;     // <=
  }
  ....
}

PVS-Studio warning: V6037 An unconditional 'break' within a loop. NoSuggestionRuleList.java 97

The loop looks like it's running multiple times, but it always ends after the first iteration because of break.

Most likely, the developers wanted to process only the first found element. This is a common approach, especially when only the first match in the collection is required.

However, using a loop just to retrieve the first element seems excessive and misleading. Anyone reading the code might expect a more complex logic, but it never actually happens. This pattern isn't necessarily a bug, but it reduces readability and can cause confusion when someone else maintains the code later on.

Repetition is the key to mastery

Fragment 8

private final static Map<String, Integer> id2prio = new HashMap<>();
static {
  id2prio.put("I_E", 10); 
  id2prio.put("CHILDISH_LANGUAGE", 8);   
  id2prio.put("RUDE_SARCASTIC", 6);   
  id2prio.put("FOR_NOUN_SAKE", 6);   
  id2prio.put("YEAR_OLD_HYPHEN", 6);   
  id2prio.put("MISSING_HYPHEN", 5);
  id2prio.put("WRONG_APOSTROPHE", 5);
  ....
  id2prio.put("SPURIOUS_APOSTROPHE", 1);   // <=
  ....
  id2prio.put("SPURIOUS_APOSTROPHE", 1);   // <=
  ....
}

PVS-Studio warning: V6033 An item with the same key '"SPURIOUS_APOSTROPHE"' has already been added. English.java 391

This fragment loads a large set of values into a hash map. This is a really huge set: the inserts start around line 283 and wrap up only at line 612. Somewhere in that wall of code, the analyzer spotted two identical values.

To be fair, the second value matches the first, so it's not necessarily an error. However, there are cases where the opposite is true.

Fragment 9

....
id2prio.put("SENTENCE_FRAGMENT", -50);  
id2prio.put("SENTENCE_FRAGMENT", -51); 
id2prio.put("SEEMS_TO_BE", -51);
....

PVS-Studio warning: V6033 An item with the same key '"SENTENCE_FRAGMENT"' has already been added. English.java 594

In this fragment, we can assume that the analyzer issues a warning for a real error. The values for duplicate entries differ. It's funny that the rewritten value appears literally in the next line. Most likely, the developers copied the string with the SENTENCE_FRAGMENT key and forgot to change it to SEEMS_TO_BE.

Fragment 10

private final Map<String,String> adverb2Adj = new HashMap<String, String>() {{
  // irregular ones:
  put("well", "good");
  put("fast", "fast");
  put("hard", "hard");
  ....
  put("jokily", "jokey");
  ....
  put("jokily", "joking");
  ....
}

PVS-Studio warning: V6033 An item with the same key '"jokily"' has already been added. AdverbFilter.java 199

This fragment, like the previous ones, deals with English-language processing. The hash table maps adverbs to their corresponding adjectives. The word jokily appears twice with two different options. The real issue is that the later entry overwrites the earlier one.

A better approach would be to store lists of strings instead of single values. This way, one adverb could map to several adjective options. With the current setup, one of the options disappears.

The end

This concludes our journey through the LanguageTool project. We'll create an issue in the GitHub repository to notify the project developers of all errors detected by the analyzer.

By the way, this project was selected for analysis from our repository. You can also suggest projects you'd like us to check by submitting a pull request.

You can try PVS-Studio static code analyzer on your project by getting a trial license at this link.

Clean code to you folks!

Posts: articles

Poll:

Subscribe
and get the e-book
for free!

book terrible tips
Popular related articles


Comments (0)

Next comments next comments
close comment form