>
>
>
Tesseract. Recognizing Errors in Recogn…

Andrey Karpov
Articles: 671

Tesseract. Recognizing Errors in Recognition Software

Tesseract is a free software program for text recognition developed by Google. According to the project description, "Tesseract is probably the most accurate open source OCR engine available". And what if we try to catch some bugs there with the help of the PVS-Studio analyzer?

Tesseract

Tesseract is an optical character recognition engine for various operating systems and is free software originally developed as proprietary software in Hewlett Packard labs between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006. [taken from Wikipedia]

The source code of the project is available at Google Code: https://code.google.com/p/tesseract-ocr/

The size of the source code is about 16 Mbytes.

Analysis results

Below I will cite those code fragments that caught my attention while examining PVS-Studio analysis report. I could have probably missed something, so Tesseract's authors should carry out their own analysis. The trial version is active through 7 days, which is more than enough for such a small project. It will be then up to them to decide if they want to use the tool regularly and catch typos or not.

As usual, let me remind you the basic law: the static analysis methodology is all about using it regularly, not on rare occasions.

Poor division

void LanguageModel::FillConsistencyInfo(....)
{
  ....
  float gap_ratio = expected_gap / actual_gap;
  if (gap_ratio < 1/2 || gap_ratio > 2) {
    consistency_info->num_inconsistent_spaces++;
  ....
}

PVS-Studio diagnostic messages: V636 The '1 / 2' expression was implicitly casted from 'int' type to 'float' type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = (double)(X) / Y;. language_model.cpp 1163

The programmer wanted to compare the 'gap_ratio' variable with the value 0.5. Unfortunately, he chose a poor way to write 0.5. 1/2 is integer division and evaluates to 0.

The correct code should look like this:

if (gap_ratio < 1.0f/2 || gap_ratio > 2) {

or this:

if (gap_ratio < 0.5f || gap_ratio > 2) {

There are some other fragments with suspicious integer division. Some of them may also contain really unpleasant errors.

The following are the code fragments that need to be checked:

  • baselinedetect.cpp 110
  • bmp_8.cpp 983
  • cjkpitch.cpp 553
  • cjkpitch.cpp 564
  • mfoutline.cpp 392
  • mfoutline.cpp 393
  • normalis.cpp 454

Typo in a comparison

uintmax_t streamtoumax(FILE* s, int base) {
  int d, c = 0;
  ....
  c = fgetc(s);
  if (c == 'x' && c == 'X') c = fgetc(s);
  ....
}

PVS-Studio diagnostic message: V547 Expression 'c == 'x' && c == 'X'' is always false. Probably the '||' operator should be used here. scanutils.cpp 135

The fixed check:

if (c == 'x' || c == 'X') c = fgetc(s);

Undefined behavior

I have discovered one interesting construct I have never seen before:

void TabVector::Evaluate(....) {
  ....
  int num_deleted_boxes = 0;
  ....
  ++num_deleted_boxes = true;
  ....
}

PVS-Studio diagnostic message: V567 Undefined behavior. The 'num_deleted_boxes' variable is modified while being used twice between sequence points. tabvector.cpp 735

It's not clear what the author meant by this code; it must be the result of a typo.

The result of this expression can't be predicted: the variable 'num_deleted_boxes' may be incremented both before and after the assignment. The reason is that the variable changes twice in one sequence point.

Other errors causing undefined behavior are related to shifts. For example:

void Dawg::init(....)
{
  ....
  letter_mask_ = ~(~0 << flag_start_bit_);
  ....
}

Diagnostic message V610 Undefined behavior. Check the shift operator '<<. The left operand '~0' is negative. dawg.cpp 187

The '~0' expression is of the 'int' type and evaluates to '-1'. Shifting negative values causes undefined behavior, so it is just pure luck that the program works well. To fix the bug, we need to make '0' unsigned:

letter_mask_ = ~(~0u << flag_start_bit_);

But that's not all. This line also triggers one more warning:

V629 Consider inspecting the '~0 << flag_start_bit_' expression. Bit shifting of the 32-bit value with a subsequent expansion to the 64-bit type. dawg.cpp 187

The point is that the variable 'letter_mask_' is of the 'uinT64' type. As far as I understand, it may be needed to write ones into the most significant 32 bits. In this case, the implemented expression is incorrect because it can handle only the least significant bits.

We need to make '0' of a 64-bit type:

letter_mask_ = ~(~0ull << flag_start_bit_);

Here is a list of other code fragments where negative numbers are shifted:

  • dawg.cpp 188
  • intmatcher.cpp 172
  • intmatcher.cpp 174
  • intmatcher.cpp 176
  • intmatcher.cpp 178
  • intmatcher.cpp 180
  • intmatcher.cpp 182
  • intmatcher.cpp 184
  • intmatcher.cpp 186
  • intmatcher.cpp 188
  • intmatcher.cpp 190
  • intmatcher.cpp 192
  • intmatcher.cpp 194
  • intmatcher.cpp 196
  • intmatcher.cpp 198
  • intmatcher.cpp 200
  • intmatcher.cpp 202
  • intmatcher.cpp 323
  • intmatcher.cpp 347
  • intmatcher.cpp 366

Suspicious double assignment

TESSLINE* ApproximateOutline(....) {
  EDGEPT *edgept;
  ....
  edgept = edgesteps_to_edgepts(c_outline, edgepts);
  fix2(edgepts, area);
  edgept = poly2 (edgepts, area);  // 2nd approximation.
  ....
}

PVS-Studio diagnostic message: V519 The 'edgept' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 76, 78. polyaprx.cpp 78

Another similar error:

inT32 row_words2(....)
{
  ....
  this_valid = blob_box.width () >= min_width;
  this_valid = TRUE;
  ....
}

PVS-Studio diagnostic message: V519 The 'this_valid' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 396, 397. wordseg.cpp 397

Incorrect order of class member initialization

Let's examine the 'MasterTrainer' class first. Notice that the 'samples_' member is written before the 'fontinfo_table_' member:

class MasterTrainer {
  ....
  TrainingSampleSet samples_;
  ....
  FontInfoTable fontinfo_table_;
  ....
};

According to the standard, class members are initialized in the constructor in the same order as they are declared inside the class. It means that 'samples_' will be initialized PRIOR to 'fontinfo_table_'.

Now let's examine the constructor:

MasterTrainer::MasterTrainer(NormalizationMode norm_mode,
                             bool shape_analysis,
                             bool replicate_samples,
                             int debug_level)
  : norm_mode_(norm_mode), samples_(fontinfo_table_),
    junk_samples_(fontinfo_table_),
    verify_samples_(fontinfo_table_),
    charsetsize_(0),
    enable_shape_anaylsis_(shape_analysis),
    enable_replication_(replicate_samples),
    fragments_(NULL), prev_unichar_id_(-1),
    debug_level_(debug_level)
{
}

The trouble is about using a yet uninitialized variable 'fontinfo_table_' to initialize 'samples_'.

A similar problem in this class is with initializing the fields 'junk_samples_' and 'verify_samples_'.

I cannot say for sure what to do with this class. Perhaps it would be sufficient just to move the declaration of 'fontinfo_table_' into the very beginning of the class.

Typo in a condition

This typo is not clearly seen, but the analyzer is always alert.

class ScriptDetector {
  ....
  int korean_id_;
  int japanese_id_;
  int katakana_id_;
  int hiragana_id_;
  int han_id_;
  int hangul_id_;
  int latin_id_;
  int fraktur_id_;
  ....
};

void ScriptDetector::detect_blob(BLOB_CHOICE_LIST* scores) {
  ....
  if (prev_id == katakana_id_)
    osr_->scripts_na[i][japanese_id_] += 1.0;
  if (prev_id == hiragana_id_)
    osr_->scripts_na[i][japanese_id_] += 1.0;
  if (prev_id == hangul_id_)
    osr_->scripts_na[i][korean_id_] += 1.0;
  if (prev_id == han_id_)
    osr_->scripts_na[i][korean_id_] += kHanRatioInKorean;
  if (prev_id == han_id_)             <<<<====
    osr_->scripts_na[i][japanese_id_] += kHanRatioInJapanese;
  ....
}

PVS-Studio diagnostic message: V581 The conditional expressions of the 'if' operators situated alongside each other are identical. Check lines: 551, 553. osdetect.cpp 553

The very last comparison is very likely to look like this:

if (prev_id == japanese_id_)

Unnecessary checks

There is no need to check the return result of the 'new' operator. If memory cannot be allocated, it will throw an exception. You can, of course, implement a special 'new' operator that returns null pointers, but that is a special case (learn more).

Keeping that in mind, we can simplify the following function:

void SetLabel(char_32 label) {
  if (label32_ != NULL) {
    delete []label32_;
  }
  label32_ = new char_32[2];
  if (label32_ != NULL) {
    label32_[0] = label;
    label32_[1] = 0;
  }
}

PVS-Studio diagnostic message: V668 There is no sense in testing the 'label32_' pointer against null, as the memory was allocated using the 'new' operator. The exception will be generated in the case of memory allocation error. char_samp.h 73

There are 101 other fragments where a pointer returned by the 'new' operator is checked. I don't find it reasonable to enumerate them all here - you'd better launch PVS-Studio and find them yourself.

Conclusion

Please use static analysis regularly - it will help you save much time to spend on solving more useful tasks than catching silly mistakes and typos.

And don't forget to follow me on Twitter: @Code_Analysis. I regularly publish links to interesting articles on C++ there.