To get a trial key
fill out the form below
Team License (a basic version)
Enterprise License (an extended version)
* By clicking this button you agree to our Privacy Policy statement

Request our prices
New License
License Renewal
--Select currency--
* By clicking this button you agree to our Privacy Policy statement

Free PVS-Studio license for Microsoft MVP specialists
* By clicking this button you agree to our Privacy Policy statement

To get the licence for your open-source project, please fill out this form
* By clicking this button you agree to our Privacy Policy statement

I am interested to try it on the platforms:
* By clicking this button you agree to our Privacy Policy statement

Message submitted.

Your message has been sent. We will email you at

If you haven't received our response, please do the following:
check your Spam/Junk folder and click the "Not Spam" button for our message.
This way, you won't miss messages from our team in the future.

Tesseract. Recognizing Errors in Recogn…

Tesseract. Recognizing Errors in Recognition Software

May 21 2014

Tesseract is a free software program for text recognition developed by Google. According to the project description, "Tesseract is probably the most accurate open source OCR engine available". And what if we try to catch some bugs there with the help of the PVS-Studio analyzer?



Tesseract is an optical character recognition engine for various operating systems and is free software originally developed as proprietary software in Hewlett Packard labs between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006. [taken from Wikipedia]

The source code of the project is available at Google Code:

The size of the source code is about 16 Mbytes.

Analysis results

Below I will cite those code fragments that caught my attention while examining PVS-Studio analysis report. I could have probably missed something, so Tesseract's authors should carry out their own analysis. The trial version is active through 7 days, which is more than enough for such a small project. It will be then up to them to decide if they want to use the tool regularly and catch typos or not.

As usual, let me remind you the basic law: the static analysis methodology is all about using it regularly, not on rare occasions.

Poor division

void LanguageModel::FillConsistencyInfo(....)
  float gap_ratio = expected_gap / actual_gap;
  if (gap_ratio < 1/2 || gap_ratio > 2) {

PVS-Studio diagnostic messages: V636 The '1 / 2' expression was implicitly casted from 'int' type to 'float' type. Consider utilizing an explicit type cast to avoid the loss of a fractional part. An example: double A = (double)(X) / Y;. language_model.cpp 1163

The programmer wanted to compare the 'gap_ratio' variable with the value 0.5. Unfortunately, he chose a poor way to write 0.5. 1/2 is integer division and evaluates to 0.

The correct code should look like this:

if (gap_ratio < 1.0f/2 || gap_ratio > 2) {

or this:

if (gap_ratio < 0.5f || gap_ratio > 2) {

There are some other fragments with suspicious integer division. Some of them may also contain really unpleasant errors.

The following are the code fragments that need to be checked:

  • baselinedetect.cpp 110
  • bmp_8.cpp 983
  • cjkpitch.cpp 553
  • cjkpitch.cpp 564
  • mfoutline.cpp 392
  • mfoutline.cpp 393
  • normalis.cpp 454

Typo in a comparison

uintmax_t streamtoumax(FILE* s, int base) {
  int d, c = 0;
  c = fgetc(s);
  if (c == 'x' && c == 'X') c = fgetc(s);

PVS-Studio diagnostic message: V547 Expression 'c == 'x' && c == 'X'' is always false. Probably the '||' operator should be used here. scanutils.cpp 135

The fixed check:

if (c == 'x' || c == 'X') c = fgetc(s);

Undefined behavior

I have discovered one interesting construct I have never seen before:

void TabVector::Evaluate(....) {
  int num_deleted_boxes = 0;
  ++num_deleted_boxes = true;

PVS-Studio diagnostic message: V567 Undefined behavior. The 'num_deleted_boxes' variable is modified while being used twice between sequence points. tabvector.cpp 735

It's not clear what the author meant by this code; it must be the result of a typo.

The result of this expression can't be predicted: the variable 'num_deleted_boxes' may be incremented both before and after the assignment. The reason is that the variable changes twice in one sequence point.

Other errors causing undefined behavior are related to shifts. For example:

void Dawg::init(....)
  letter_mask_ = ~(~0 << flag_start_bit_);

Diagnostic message V610 Undefined behavior. Check the shift operator '<<. The left operand '~0' is negative. dawg.cpp 187

The '~0' expression is of the 'int' type and evaluates to '-1'. Shifting negative values causes undefined behavior, so it is just pure luck that the program works well. To fix the bug, we need to make '0' unsigned:

letter_mask_ = ~(~0u << flag_start_bit_);

But that's not all. This line also triggers one more warning:

V629 Consider inspecting the '~0 << flag_start_bit_' expression. Bit shifting of the 32-bit value with a subsequent expansion to the 64-bit type. dawg.cpp 187

The point is that the variable 'letter_mask_' is of the 'uinT64' type. As far as I understand, it may be needed to write ones into the most significant 32 bits. In this case, the implemented expression is incorrect because it can handle only the least significant bits.

We need to make '0' of a 64-bit type:

letter_mask_ = ~(~0ull << flag_start_bit_);

Here is a list of other code fragments where negative numbers are shifted:

  • dawg.cpp 188
  • intmatcher.cpp 172
  • intmatcher.cpp 174
  • intmatcher.cpp 176
  • intmatcher.cpp 178
  • intmatcher.cpp 180
  • intmatcher.cpp 182
  • intmatcher.cpp 184
  • intmatcher.cpp 186
  • intmatcher.cpp 188
  • intmatcher.cpp 190
  • intmatcher.cpp 192
  • intmatcher.cpp 194
  • intmatcher.cpp 196
  • intmatcher.cpp 198
  • intmatcher.cpp 200
  • intmatcher.cpp 202
  • intmatcher.cpp 323
  • intmatcher.cpp 347
  • intmatcher.cpp 366

Suspicious double assignment

TESSLINE* ApproximateOutline(....) {
  EDGEPT *edgept;
  edgept = edgesteps_to_edgepts(c_outline, edgepts);
  fix2(edgepts, area);
  edgept = poly2 (edgepts, area);  // 2nd approximation.

PVS-Studio diagnostic message: V519 The 'edgept' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 76, 78. polyaprx.cpp 78

Another similar error:

inT32 row_words2(....)
  this_valid = blob_box.width () >= min_width;
  this_valid = TRUE;

PVS-Studio diagnostic message: V519 The 'this_valid' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 396, 397. wordseg.cpp 397

Incorrect order of class member initialization

Let's examine the 'MasterTrainer' class first. Notice that the 'samples_' member is written before the 'fontinfo_table_' member:

class MasterTrainer {
  TrainingSampleSet samples_;
  FontInfoTable fontinfo_table_;

According to the standard, class members are initialized in the constructor in the same order as they are declared inside the class. It means that 'samples_' will be initialized PRIOR to 'fontinfo_table_'.

Now let's examine the constructor:

MasterTrainer::MasterTrainer(NormalizationMode norm_mode,
                             bool shape_analysis,
                             bool replicate_samples,
                             int debug_level)
  : norm_mode_(norm_mode), samples_(fontinfo_table_),
    fragments_(NULL), prev_unichar_id_(-1),

The trouble is about using a yet uninitialized variable 'fontinfo_table_' to initialize 'samples_'.

A similar problem in this class is with initializing the fields 'junk_samples_' and 'verify_samples_'.

I cannot say for sure what to do with this class. Perhaps it would be sufficient just to move the declaration of 'fontinfo_table_' into the very beginning of the class.

Typo in a condition

This typo is not clearly seen, but the analyzer is always alert.

class ScriptDetector {
  int korean_id_;
  int japanese_id_;
  int katakana_id_;
  int hiragana_id_;
  int han_id_;
  int hangul_id_;
  int latin_id_;
  int fraktur_id_;

void ScriptDetector::detect_blob(BLOB_CHOICE_LIST* scores) {
  if (prev_id == katakana_id_)
    osr_->scripts_na[i][japanese_id_] += 1.0;
  if (prev_id == hiragana_id_)
    osr_->scripts_na[i][japanese_id_] += 1.0;
  if (prev_id == hangul_id_)
    osr_->scripts_na[i][korean_id_] += 1.0;
  if (prev_id == han_id_)
    osr_->scripts_na[i][korean_id_] += kHanRatioInKorean;
  if (prev_id == han_id_)             <<<<====
    osr_->scripts_na[i][japanese_id_] += kHanRatioInJapanese;

PVS-Studio diagnostic message: V581 The conditional expressions of the 'if' operators situated alongside each other are identical. Check lines: 551, 553. osdetect.cpp 553

The very last comparison is very likely to look like this:

if (prev_id == japanese_id_)

Unnecessary checks

There is no need to check the return result of the 'new' operator. If memory cannot be allocated, it will throw an exception. You can, of course, implement a special 'new' operator that returns null pointers, but that is a special case (learn more).

Keeping that in mind, we can simplify the following function:

void SetLabel(char_32 label) {
  if (label32_ != NULL) {
    delete []label32_;
  label32_ = new char_32[2];
  if (label32_ != NULL) {
    label32_[0] = label;
    label32_[1] = 0;

PVS-Studio diagnostic message: V668 There is no sense in testing the 'label32_' pointer against null, as the memory was allocated using the 'new' operator. The exception will be generated in the case of memory allocation error. char_samp.h 73

There are 101 other fragments where a pointer returned by the 'new' operator is checked. I don't find it reasonable to enumerate them all here - you'd better launch PVS-Studio and find them yourself.


Please use static analysis regularly - it will help you save much time to spend on solving more useful tasks than catching silly mistakes and typos.

And don't forget to follow me on Twitter: @Code_Analysis. I regularly publish links to interesting articles on C++ there.

Popular related articles
Static analysis as part of the development process in Unreal Engine

Date: Jun 27 2017

Author: Andrey Karpov

Unreal Engine continues to develop as new code is added and previously written code is changed. What is the inevitable consequence of ongoing development in a project? The emergence of new bugs in th…
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, 10-15% of False Positives

Date: Jul 31 2017

Author: Andrey Karpov

After I wrote quite a big article about the analysis of the Tizen OS code, I received a large number of questions concerning the percentage of false positives and the density of errors (how many erro…
Technologies used in the PVS-Studio code analyzer for finding bugs and potential vulnerabilities

Date: Nov 21 2018

Author: Andrey Karpov

A brief description of technologies used in the PVS-Studio tool, which let us effectively detect a large number of error patterns and potential vulnerabilities. The article describes the implementati…
Free PVS-Studio for those who develops open source projects

Date: Dec 22 2018

Author: Andrey Karpov

On the New 2019 year's eve, a PVS-Studio team decided to make a nice gift for all contributors of open-source projects hosted on GitHub, GitLab or Bitbucket. They are given free usage of PVS-Studio s…
PVS-Studio ROI

Date: Jan 30 2019

Author: Andrey Karpov

Occasionally, we're asked a question, what monetary value the company will receive from using PVS-Studio. We decided to draw up a response in the form of an article and provide tables, which will sho…
The Ultimate Question of Programming, Refactoring, and Everything

Date: Apr 14 2016

Author: Andrey Karpov

Yes, you've guessed correctly - the answer is "42". In this article you will find 42 recommendations about coding in C++ that can help a programmer avoid a lot of errors, save time and effort. The au…
PVS-Studio for Java

Date: Jan 17 2019

Author: Andrey Karpov

In the seventh version of the PVS-Studio static analyzer, we added support of the Java language. It's time for a brief story of how we've started making support of the Java language, how far we've co…
The Last Line Effect

Date: May 31 2014

Author: Andrey Karpov

I have studied many errors caused by the use of the Copy-Paste method, and can assure you that programmers most often tend to make mistakes in the last fragment of a homogeneous code block. I have ne…
The way static analyzers fight against false positives, and why they do it

Date: Mar 20 2017

Author: Andrey Karpov

In my previous article I wrote that I don't like the approach of evaluating the efficiency of static analyzers with the help of synthetic tests. In that article, I give the example of a code fragment…
How PVS-Studio Proved to Be More Attentive Than Three and a Half Programmers

Date: Oct 22 2018

Author: Andrey Karpov

Just like other static analyzers, PVS-Studio often produces false positives. What you are about to read is a short story where I'll tell you how PVS-Studio proved, just one more time, to be more atte…

Comments (0)

Next comments
This website uses cookies and other technology to provide you a more personalized experience. By continuing the view of our web-pages you accept the terms of using these files. If you don't want your personal data to be processed, please, leave this site.
Learn More →