Sviatoslav Razmyslov

Nov 20 2018

Tags:

#Cpp

NCBI Genome Workbench: Scientific Research under Threat

Nov 20 2018

Author: Sviatoslav Razmyslov

Introduction
Just two numerals to spoil it all
Other typos and copy-paste
/* with security is best be pedantic */
Suspicious loops
Bizarre array indexing
How to make users distrust your functions
Class designing mistakes
Unreachable code and other problems with code execution
Data read errors
Miscellaneous
Conclusion

Modern computer technologies, hardware and software solutions all make it much easier and faster for us to do various kinds of scientific research. Computer simulation is often the only way to verify many theories. Scientific software has its own peculiarities. For instance, it's often heavily tested yet poorly documented. But anyway, software is written by humans, and humans tend to make mistakes. When found in scientific applications, programming mistakes could cast doubt on the results of much of the research work. In this article, we will look at dozens of defects found in the code of the NCBI Genome Workbench software package.

Introduction

NCBI Genome Workbench offers researchers a rich set of integrated tools for studying and analyzing genetic data. Users can explore and compare data from multiple sources including the NCBI (National Center for Biotechnology Information) databases or the user's own private data.

As I already said, scientific software is usually richly covered by unit tests. When checking this project, I excluded 85 directories with test files from analysis, which makes about one thousand files. I guess this has to do with the test requirements for the various complex algorithms devised individually for each scientific study. That said, the rest of the code (other than the tests) is not as high-quality as one would like it to be. Well, this actually applies to any project that doesn't use static analysis yet :).

The data for this review (or I'd say research) was collected using PVS-Studio, a static code analyzer for C/C++/C#/Java.

Just two numerals to spoil it all

Using our bug database, which currently includes more than 12 thousand select samples, we can detect and describe specific coding patterns that lead to numerous errors. For example, we did the following studies:

With this project, we have discovered a new pattern. It has to do with the usage of numerals 1 and 2 in variable names such as file1 and file2, and the like. Such variables are very easy to mix up. Being a special case of typos, these defects all result from programmers' wish to work with variables sharing the same name save the ending numerals 1 and 2.

I'm running a bit ahead of the story, but I've got to tell you that all of the patterns we examined in the studies mentioned above are found in this project's code too :D.

Let's start with the first example from Genome Workbench:

V501 There are identical sub-expressions '(!loc1.IsInt() &&!loc1.IsWhole())' to the left and to the right of the '||' operator. nw_aligner.cpp 480

CRef<CSeq_align> CNWAligner::Run(CScope &scope, const CSeq_loc &loc1,
                                 const CSeq_loc &loc2,
                                 bool trim_end_gaps)
{
  if ((!loc1.IsInt() && !loc1.IsWhole()) ||
      (!loc1.IsInt() && !loc1.IsWhole()))
  {
    NCBI_THROW(CException, eUnknown,
               "Only whole and interval locations supported");
  }
  ....
}

You can see two variables, loc1 and loc2, and a typo: the loc2 variable is not used because loc1 is used one more time instead.

Another example:

V560 A part of conditional expression is always false: s1.IsSet(). valid_biosource.cpp 3073

static bool s_PCRPrimerSetLess(const CPCRPrimerSet& s1, const CPCRPrimerSet& s2)
{
  if (!s1.IsSet() && s1.IsSet()) {
    return true;
  } else if (s1.IsSet() && !s2.IsSet()) {
    return false;
  } else if (!s1.IsSet() && !s2.IsSet()) {
    return false;
  } else if (s1.Get().size() < s2.Get().size()) {
    return true;
  } else if (s1.Get().size() > s2.Get().size()) {
    return false;
  } else {
  .....
}

The programmer mixed up the variables s1 and s2 in the very first line. The name of the function suggests that it does comparison. But errors like that may come up just anywhere because if you name your variables Number1 and Number2, you are almost guaranteed to mess them up later. The more often these names are repeated in a function, the higher the risk.

Other typos and copy-paste

V501 There are identical sub-expressions to the left and to the right of the '!=' operator: bd.bit_.bits[i] != bd.bit_.bits[i] bm.h 296

bool compare_state(const iterator_base& ib) const
{
  ....
  if (this->block_type_ == 0
  {
    if (bd.bit_.ptr != ib_db.bit_.ptr) return false;
    if (bd.bit_.idx != ib_db.bit_.idx) return false;
    if (bd.bit_.cnt != ib_db.bit_.cnt) return false;
    if (bd.bit_.pos != ib_db.bit_.pos) return false;
    for (unsigned i = 0; i < bd.bit_.cnt; ++i)
    {
      if (bd.bit_.bits[i] != bd.bit_.bits[i]) return false;
    }
  }
  ....
}

I figure that after all those checks, the bits arrays of the objects bd.bit_ and ib_db.bit_ should be the same size. That's why the developer wrote one loop for element-by-element comparison of the bits arrays. But they mistyped the name of one of the objects under comparison. As a result, the objects may incorrectly compare equal in certain situations.

That's a nice example worth mentioning in the article "The Evil within the Comparison Functions".

V501 There are identical sub-expressions 'CFieldHandler::QualifierNamesAreEquivalent(field, kFieldTypeSeqId)' to the left and to the right of the '||' operator. field_handler.cpp 152

bool CFieldHandlerFactory::s_IsSequenceIDField(const string& field)
{
  if (   CFieldHandler::QualifierNamesAreEquivalent(field, kFieldTypeSeqId)
      || CFieldHandler::QualifierNamesAreEquivalent(field, kFieldTypeSeqId)) {
    return true;
  } else {
    return false;
  }
}

It looks like one of the checks is redundant. I haven't found any other variables with a name similar to kFieldTypeSeqId. And using the "||" operator could still invoke one extra call to the function, thus slowing down the program.

Here are two more cases of the same kind to be examined:

V501 There are identical sub-expressions 'uf->GetData().IsBool()' to the left and to the right of the '&&' operator. variation_utils.cpp 1711
V501 There are identical sub-expressions 'uf->GetData().IsBool()' to the left and to the right of the '&&' operator. variation_utils.cpp 1735

V766 An item with the same key 'kArgRemote' has already been added. blast_args.cpp 3262

void
CBlastAppArgs::x_IssueWarningsForIgnoredOptions(const CArgs& args)
{
  set<string> can_override;
  ....
  can_override.insert(kArgOutputFormat);
  can_override.insert(kArgNumDescriptions);
  can_override.insert(kArgNumAlignments);
  can_override.insert(kArgMaxTargetSequences);
  can_override.insert(kArgRemote);               // <=
  can_override.insert(kArgNumThreads);
  can_override.insert(kArgInputSearchStrategy);
  can_override.insert(kArgRemote);               // <=
  can_override.insert("remote_verbose");
  can_override.insert("verbose");
  ....
}

The analyzer has detected the addition of two identical values to a set container. As you know, this type of container can store only unique values and doesn't permit duplicate elements.

Code like that is often written using the copy-paste technique. What we are dealing with here is probably just an extra element, but it could also be a copy that was to be renamed to make a new variable. Deleting an extra insert call can help optimize the code a bit, but that's not a big deal. A much more serious concern is that this could be a missing element of the set.

V523 The 'then' statement is equivalent to the subsequent code fragment. vcf_reader.cpp 1105

bool
CVcfReader::xAssignFeatureLocationSet(....)
{
  ....
  if (data.m_SetType == CVcfData::ST_ALL_DEL) {
    if (data.m_strRef.size() == 1) {
      //deletion of a single base
      pFeat->SetLocation().SetPnt().SetPoint(data.m_iPos-1);
      pFeat->SetLocation().SetPnt().SetId(*pId);
    }
    else {
      pFeat->SetLocation().SetInt().SetFrom(data.m_iPos-1);
      //-1 for 0-based, 
      //another -1 for inclusive end-point ( i.e. [], not [) )
      pFeat->SetLocation().SetInt().SetTo( 
         data.m_iPos -1 + data.m_strRef.length() - 1); 
      pFeat->SetLocation().SetInt().SetId(*pId);
    }
    return true;
  }

  //default: For MNV's we will use the single starting point
  //NB: For references of size >=2, this location will not
  //match the reference allele.  Future Variation-ref
  //normalization code will address these issues,
  //and obviate the need for this code altogether.
  if (data.m_strRef.size() == 1) {
    //deletion of a single base
    pFeat->SetLocation().SetPnt().SetPoint(data.m_iPos-1);
    pFeat->SetLocation().SetPnt().SetId(*pId);
  }
  else {
    pFeat->SetLocation().SetInt().SetFrom(data.m_iPos-1);
    pFeat->SetLocation().SetInt().SetTo( 
      data.m_iPos -1 + data.m_strRef.length() - 1); 
    pFeat->SetLocation().SetInt().SetId(*pId);
  }
  return true;
}

The function contains large and absolutely identical blocks of code, while the comments are different. This code is written in a non-optimal and confusing way and may be faulty.

Here's the full list of suspicious if-else statements:

V523 The 'then' statement is equivalent to the 'else' statement. blk.c 2142
V523 The 'then' statement is equivalent to the subsequent code fragment. odbc.c 379
V523 The 'then' statement is equivalent to the subsequent code fragment. odbc.c 1414
V523 The 'then' statement is equivalent to the 'else' statement. seqdbvol.cpp 1922
V523 The 'then' statement is equivalent to the 'else' statement. seqdb_demo.cpp 466
V523 The 'then' statement is equivalent to the subsequent code fragment. blast_engine.c 1917
V523 The 'then' statement is equivalent to the 'else' statement. blast_filter.c 420
V523 The 'then' statement is equivalent to the 'else' statement. blast_parameters.c 636
V523 The 'then' statement is equivalent to the 'else' statement. unordered_spliter.cpp 684
V523 The 'then' statement is equivalent to the 'else' statement. bme.cpp 333
V523 The 'then' statement is equivalent to the 'else' statement. gme.cpp 484

/* with security is best be pedantic */

V597 The compiler could delete the 'memset' function call, which is used to flush 'passwd_buf' buffer. The memset_s() function should be used to erase the private data. challenge.c 366

/**
 * Crypt a given password using schema required for NTLMv1 authentication
 * @param passwd clear text domain password
 * @param challenge challenge data given by server
 * @param flags NTLM flags from server side
 * @param answer buffer where to store crypted password
 */
void
tds_answer_challenge(....)
{
#define MAX_PW_SZ 14
  ....
  if (ntlm_v == 1) {
    ....
    /* with security is best be pedantic */
    memset(hash, 0, sizeof(hash));
    memset(passwd_buf, 0, sizeof(passwd_buf));
    memset(ntlm2_challenge, 0, sizeof(ntlm2_challenge));
  } else {
    ....
  }
}

As you have already guessed, the title of this section is taken from the funny comment about security.

In brief, the compiler will delete the memset function because the buffers supposed to be cleared are no longer used. As a result, such data as hash or passwd_buf won't be erased. This non-obvious feature of the compiler is discussed in more detail in the article "Safe Clearing of Private Data".

V597 The compiler could delete the 'memset' function call, which is used to flush 'answer' object. The memset_s() function should be used to erase the private data. challenge.c 561

static TDSRET
tds7_send_auth(....)
{
  ....
  /* for security reason clear structure */
  memset(&answer, 0, sizeof(TDSANSWER));

  return tds_flush_packet(tds);
}

That's not the only snippet with "security" comments. Judging by those comments, the authors do care about security, so I'm including the complete - and pretty long - list of all such defects detected:

V597 The compiler could delete the 'memset' function call, which is used to flush 'heap' object. The memset_s() function should be used to erase the private data. ncbi_heapmgr.c 1300
V597 The compiler could delete the 'memset' function call, which is used to flush 'context' object. The memset_s() function should be used to erase the private data. challenge.c 167
V597 The compiler could delete the 'memset' function call, which is used to flush 'ks' object. The memset_s() function should be used to erase the private data. challenge.c 339
V597 The compiler could delete the 'memset' function call, which is used to flush 'md5_ctx' object. The memset_s() function should be used to erase the private data. challenge.c 353
V597 The compiler could delete the 'memset' function call, which is used to flush 'hash' buffer. The memset_s() function should be used to erase the private data. challenge.c 365
V597 The compiler could delete the 'memset' function call, which is used to flush 'ks' object. The memset_s() function should be used to erase the private data. challenge.c 406
V597 The compiler could delete the 'memset' function call, which is used to flush 'ntlm_v2_response' object. The memset_s() function should be used to erase the private data. login.c 795
V597 The compiler could delete the 'memset' function call, which is used to flush 'answer' object. The memset_s() function should be used to erase the private data. login.c 801
V597 The compiler could delete the 'memset' function call, which is used to flush 'packet' buffer. The memset_s() function should be used to erase the private data. numeric.c 256
V597 The compiler could delete the 'memset' function call, which is used to flush 'packet' buffer. The memset_s() function should be used to erase the private data. numeric.c 110
V597 The compiler could delete the 'memset' function call, which is used to flush 'pwd' buffer. The memset_s() function should be used to erase the private data. getpassarg.c 50
V597 The compiler could delete the 'memset' function call, which is used to flush 'context' object. The memset_s() function should be used to erase the private data. challenge.c 188
V597 The compiler could delete the 'memset' function call, which is used to flush 'buf' buffer. The memset_s() function should be used to erase the private data. challenge.c 243
V597 The compiler could delete the 'memset' function call, which is used to flush 'ntlm_v2_hash' buffer. The memset_s() function should be used to erase the private data. challenge.c 309
V597 The compiler could delete the 'memset' function call, which is used to flush 'md5_ctx' object. The memset_s() function should be used to erase the private data. challenge.c 354
V597 The compiler could delete the 'memset' function call, which is used to flush 'passwd_buf' buffer. The memset_s() function should be used to erase the private data. challenge.c 380
V597 The compiler could delete the 'memset' function call, which is used to flush 'ks' object. The memset_s() function should be used to erase the private data. challenge.c 393
V597 The compiler could delete the 'memset' function call, which is used to flush 'hash' buffer. The memset_s() function should be used to erase the private data. challenge.c 394
V597 The compiler could delete the 'memset' function call, which is used to flush 'ntlm2_challenge' buffer. The memset_s() function should be used to erase the private data. challenge.c 395
V597 The compiler could delete the 'memset' function call, which is used to flush 'ks' object. The memset_s() function should be used to erase the private data. challenge.c 419
V597 The compiler could delete the 'memset' function call, which is used to flush 'ntlm_v2_response' object. The memset_s() function should be used to erase the private data. challenge.c 556

Suspicious loops

V534 It is likely that a wrong variable is being compared inside the 'for' operator. Consider reviewing 'i'. taxFormat.cpp 569

void CTaxFormat::x_LoadTaxTree(void)
{
  ....
  for(size_t i = 0; i < alignTaxids.size(); i++) {
    int tax_id = alignTaxids[i];
    ....
    for(size_t j = 0; i < taxInfo.seqInfoList.size(); j++) {
      SSeqInfo* seqInfo = taxInfo.seqInfoList[j];
      seqInfo->taxid = newTaxid;
    }
    ....
  }
  ....
}

I suspect that the i variable wasn't really meant to be used in the inner loop's condition. It got there by mistake and should have been j instead.

V535 The variable 'i' is being used for this loop and for the outer loop. Check lines: 302, 309. sls_alp.cpp 309

alp::~alp()
{
  ....
  if(d_alp_states)
  {
    for(i=0;i<=d_nalp;i++)           // <=
    {
      if(i<=d_alp_states->d_dim)
      {
        if(d_alp_states->d_elem[i])
        {
          for(i=0;i<=d_nalp;i++)     // <=
          {
            ....
  ....
}

Two twin nested loops resetting the global counter to zero - that doesn't look right at all. The authors should take a good look at what's going on here.

Bizarre array indexing

V520 The comma operator ',' in array index expression '[-- i2, -- k]'. nw_spliced_aligner16.cpp 564

void CSplicedAligner16::x_DoBackTrace (
    const Uint2* backtrace_matrix, CNWAligner::SAlignInOut* data,
    int i_global_max, int j_global_max)
{
  ....
  while(intron_length < m_IntronMinSize || (Key & donor) == 0) {
      Key = backtrace_matrix[--i2, --k];
      ++intron_length;
      data->m_transcript.push_back(eTS_Intron);
  }
  ....
}

I'll tell you right off that there's no apparent error here (at least for now, lol). Take a look at this line:

Key = backtrace_matrix[--i2, --k];

The word 'matrix' and double indexing could make you think that this is a two-dimensional array, but it's not so. It's a regular pointer to an array of integers. But it was not for nothing that we designed the V520 diagnostic. Programmers do tend to get confused when indexing into two-dimensional arrays.

Here, the author simply wanted to save on one extra line of code, but why not write it like this then:

--i2;
Key = backtrace_matrix[--k];

V661 A suspicious expression 'A[B == C]'. Probably meant 'A[B] == C'. ncbi_service_connector.c 180

static EHTTP_HeaderParse s_ParseHeader(const char* header, ....)
{
  ....
  if (sscanf(header, "%u.%u.%u.%u%n", &i1, &i2, &i3, &i4, &n) < 4
      ||  sscanf(header + n, "%hu%x%n", &uuu->port, &tkt, &m) < 2
      || (header[m += n]  &&  !(header[m] == '$')  &&
          !isspace((unsigned char)((header + m)
                                   [header[m] == '$'])))) {
      break/*failed - unreadable connection info*/;
  }
  ....
}

This is another snippet where I had a hard time figuring out what was going on :D. The isspace() function is used to check the character with the m index, but if that character is '$', then what is passed to the function is the character with the index m + 1. However, the check for '$' has been already done before. Perhaps there's no error here, but this code could definitely be rewritten in a clearer way.

V557 Array overrun is possible. The 'row' index is pointing beyond array bound. aln_reader.cpp 412

bool CAlnReader::x_IsGap(TNumrow row, TSeqPos pos, const string& residue)
{
  if (m_MiddleSections.size() == 0) {
    x_CalculateMiddleSections();
  }
  
  if (row > m_MiddleSections.size()) {
      return false;
  }
  if (pos < m_MiddleSections[row].first) {
    ....
  }
  ....
}

This one is serious. The correct check of the row index should look like this:

if (row >= m_MiddleSections.size()) {
  return false;
}

Otherwise, there's a risk of accessing the data beyond the MiddleSections vector.

There are plenty of defects like that:

V557 Array overrun is possible. The 'i' index is pointing beyond array bound. resource_pool.hpp 388
V557 Array overrun is possible. The 'row' index is pointing beyond array bound. aln_reader.cpp 418
V557 Array overrun is possible. The 'fmt_idx' index is pointing beyond array bound. seq_writer.cpp 384
V557 Array overrun is possible. The 'fmt_idx' index is pointing beyond array bound. blastdb_formatter.cpp 183
V557 Array overrun is possible. The 'num' index is pointing beyond array bound. newcleanupp.cpp 13035

How to make users distrust your functions

V570 The 'm_onClickFunction' variable is assigned to itself. alngraphic.hpp 103

void SetOnClickFunctionName(string onClickFunction) {
  m_onClickFunction = m_onClickFunction;
}

No comment. You can only feel for users clicking again and again to no avail.

Two more cases where a variable is assigned to itself:

V570 The 'iter->level' variable is assigned to itself. align_format_util.cpp 189
V570 The 'd_elements_values[ind]' variable is assigned to itself. sls_alp_data.cpp 1416

V763 Parameter 'w1' is always rewritten in function body before being used. bmfunc.h 5363

/// Bit COUNT functor
template<typename W> struct bit_COUNT
{
  W operator()(W w1, W w2) 
  {
    w1 = 0;
    BM_INCWORD_BITCOUNT(w1, w2);
    return w1;
  }
};

A function that has its argument overwritten right after the invocation may confuse the developers. This code should be reviewed.

Class designing mistakes

V688 The 'm_qsrc' function argument possesses the same name as one of the class members, which can result in a confusion. compart_matching.cpp 873

class CElementaryMatching: public CObject
{
  ....
  ISequenceSource * m_qsrc;
  ....
  void x_CreateIndex (ISequenceSource *m_qsrc, EIndexMode index_more, ....);
  void x_CreateRemapData(ISequenceSource *m_qsrc, EIndexMode mode);
  void x_LoadRemapData (ISequenceSource *m_qsrc, const string& sdb);
  ....
};

Three class functions at once have an argument of the same name as a class field. This may lead to mistakes in the function bodies: the programmer may think they're working with a class member, while in reality they are altering the local variable's value.

V614 Uninitialized variable 'm_BitSet' used. SnpBitAttributes.hpp 187

/// SNP bit attribute container.
class CSnpBitAttributes
{
public:
  ....
private:
  /// Internal storage for bits.
  Uint8 m_BitSet;
};

inline CSnpBitAttributes::CSnpBitAttributes(Uint8 bits) : m_BitSet(bits)
{
}

inline CSnpBitAttributes::CSnpBitAttributes(const vector<char>& octet_string)
{
  auto count = sizeof(m_BitSet);
  auto byte = octet_string.end();

  do
    m_BitSet = (m_BitSet << 8) | *--byte;
  while (--count > 0);
}

One of the constructors is handling the m_BitSet variable in an unsafe manner. The problem is that this variable has not been initialized yet. Its "garbage" value will be used at the first loop iteration, and only then will it be initialized. This is a grave mistake, which could lead to undefined behavior.

V603 The object was created but it is not being used. If you wish to call constructor, 'this->SIntervalComparisonResult::SIntervalComparisonResult(....)' should be used. compare_feats.hpp 100

//This struct keeps the result of comparison of two exons
struct SIntervalComparisonResult : CObject
{
public:
  SIntervalComparisonResult(unsigned pos1, unsigned pos2,
                            FCompareLocs result, int pos_comparison = 0) 
  : m_exon_ordinal1(pos1), m_exon_ordinal2(pos2),
    m_result(result), m_position_comparison(pos_comparison) {}
    
  SIntervalComparisonResult()
  {
    SIntervalComparisonResult(0, 0, fCmp_Unknown, 0);
  }
  ....
};

I haven't seen errors like this for quite a while, but the problem still persists. The point here is that calling a parameterized constructor in a way like that leads to creating and deleting a temporary object while leaving the class fields uninitialized. The call to the other constructor should be done using the initializer list (see Delegating constructor).

V591 Non-void function should return a value. bio_tree.hpp 266

/// Recursive assignment
CBioNode& operator=(const CBioNode& tree)
{
  TParent::operator=(tree);
  TBioTree* pt = (TBioTree*)tree.GetParentTree();
  SetParentTree(pt);
}

The analyzer says the overloaded operator lacks this single line:

return *this;

V670 The uninitialized class member 'm_OutBlobIdOrData' is used to initialize the 'm_StdOut' member. Remember that members are initialized in the order of their declarations inside a class. remote_app.hpp 215

class NCBI_XCONNECT_EXPORT CRemoteAppResult
{
public:
  CRemoteAppResult(CNetCacheAPI::TInstance netcache_api,
          size_t max_inline_size = kMaxBlobInlineSize) :
      m_NetCacheAPI(netcache_api),
      m_RetCode(-1),
      m_StdOut(netcache_api, m_OutBlobIdOrData, m_OutBlobSize),
      m_OutBlobSize(0),
      m_StdErr(netcache_api, m_ErrBlobIdOrData, m_ErrBlobSize),
      m_ErrBlobSize(0),
      m_StorageType(eBlobStorage),
      m_MaxInlineSize(max_inline_size)
  {
  }
  ....
};

This snippet triggers 3 warnings at once. The order in which the class fields are initialized is the same order in which they are declared rather than the order in which they were added to the initializer list. This error typically occurs because not all programmers know or remember about this rule. And it's the initializer list here that has the wrong order, which looks as if it were random order.

V746 Object slicing. An exception should be caught by reference rather than by value. cobalt.cpp 247

void 
CMultiAligner::SetQueries(const vector< CRef<objects::CBioseq> >& queries)
{
  ....
  try {
    seq_loc->SetId(*it->GetSeqId());
  }
  catch (objects::CObjMgrException e) {
    NCBI_THROW(CMultiAlignerException, eInvalidInput,
               (string)"Missing seq-id in bioseq. " + e.GetMsg());
  }
  m_tQueries.push_back(seq_loc);
  ....
}

When catching exceptions by value, some of the information about the exception may be lost since a new object is created. A much better and safer practice is to catch exceptions by reference.

Other similar cases:

V746 Object slicing. An exception should be caught by reference rather than by value. agp_validate_reader.cpp 562
V746 Object slicing. An exception should be caught by reference rather than by value. aln_build_app.cpp 320
V746 Object slicing. An exception should be caught by reference rather than by value. aln_test_app.cpp 458
V746 Object slicing. An exception should be caught by reference rather than by value. cobalt.cpp 691
V746 Object slicing. An exception should be caught by reference rather than by value. cobalt.cpp 719
V746 Object slicing. An exception should be caught by reference rather than by value. cobalt.cpp 728
V746 Object slicing. An exception should be caught by reference rather than by value. cobalt.cpp 732

Unreachable code and other problems with code execution

V779 Unreachable code detected. It is possible that an error is present. merge_tree_core.cpp 627

bool CMergeTree::x_FindBefores_Up_Iter(....)
{
    ....
    FirstFrame->Curr = StartCurr;
    FirstFrame->Returned = false;
    FirstFrame->VisitCount = 0;
    FrameStack.push_back(FirstFrame);

    while(!FrameStack.empty()) {

        ....
        if(Rel == CEquivRange::eAfter) {
            Frame->Returned = false;
            FrameStack.pop_back();
            continue;
        } 
        else if(Rel == CEquivRange::eBefore) {
            ....
            continue;
        }
        else {
            if(Frame->VisitCount == 0) {
                ....
                continue;
            } else {
                ....
                continue;
            }
        }
        
        Frame->Returned = false; // <=
        FrameStack.pop_back();
        continue;
    }  // end stack loop
    
    FirstFrame->ChildFrames.clear();
    return FirstFrame->Returned;
}

The conditional operator is written in such a way that absolutely all of its branches end with a continue statement. This renders some of the lines in the while loop unreachable. And those lines do look strange. The problem must have occurred after refactoring and now calls for careful code review.

A few more cases:

V779 Unreachable code detected. It is possible that an error is present. dbapi_driver_utils.cpp 351
V779 Unreachable code detected. It is possible that an error is present. net.c 780
V779 Unreachable code detected. It is possible that an error is present. bcp.c 1495
V779 Unreachable code detected. It is possible that an error is present. remote_blast.cpp 1470
V779 Unreachable code detected. It is possible that an error is present. remote_blast.cpp 1522

V519 The 'interval_width' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 454, 456. aln_writer.cpp 456

void CAlnWriter::AddGaps(....)
{
  ....
  switch(exon_chunk->Which()) {
  case CSpliced_exon_chunk::e_Match:
      interval_width = exon_chunk->GetMatch();
  case CSpliced_exon_chunk::e_Mismatch:
      interval_width = exon_chunk->GetMismatch();
  case CSpliced_exon_chunk::e_Diag:
      interval_width = exon_chunk->GetDiag();
      genomic_string.append(....);
      product_string.append(....);
      genomic_pos += interval_width;
      product_pos += interval_width/res_width;
      break;
    ....
  }
  ....
}

The interval_width variable is overwritten several times as the case branches lack break statements. Though classic, it's still a bad bug to have in one's code.

V571 Recurring check. The 'if (m_QueryOpts->filtering_options)' condition was already verified in line 703. blast_options_local_priv.hpp 713

inline void
CBlastOptionsLocal::SetFilterString(const char* f)
{
  ....
  if (m_QueryOpts->filtering_options)      // <=
  {
    SBlastFilterOptions* old_opts = m_QueryOpts->filtering_options;
    m_QueryOpts->filtering_options = NULL;
    SBlastFilterOptionsMerge(&(m_QueryOpts->filtering_options),
      old_opts, new_opts);
    old_opts = SBlastFilterOptionsFree(old_opts);
    new_opts = SBlastFilterOptionsFree(new_opts);
  } 
  else
  {
     if (m_QueryOpts->filtering_options)   // <=
         m_QueryOpts->filtering_options = 
             SBlastFilterOptionsFree(m_QueryOpts->filtering_options);
     m_QueryOpts->filtering_options = new_opts;
     new_opts = NULL;
  }
  ....
}

The else branch obviously needs revising. I've got a couple ideas as to what the authors might have intended to do with the m_QueryOpts->filtering_options pointer, but the code is still pretty obscure. Please, guys, do make it clearer!

Bad luck comes in threes, you know:

V571 Recurring check. The 'if (sleeptime)' condition was already verified in line 205. request_control.cpp 208
V571 Recurring check. The 'if (assignValue.empty())' condition was already verified in line 712. classstr.cpp 718

Data read errors

V739 EOF should not be compared with a value of the 'char' type. The 'linestring[0]' should be of the 'int' type. alnread.c 3509

static EBool
s_AfrpInitLineData(
  ....
  char* linestring = readfunc (pfile);
  ....
  while (linestring != NULL  &&  linestring [0] != EOF) {
    s_TrimSpace (&linestring);
    ....
  }
  ....
}

Characters to be tested against EOF must not be stored in variables of type char; otherwise, there's a risk that the character with the value 0xFF (255) will turn into -1 and be interpreted as end-of-file. The implementation of the readfunc function should also be checked (just in case).

V663 Infinite loop is possible. The 'cin.eof()' condition is insufficient to break from the loop. Consider adding the 'cin.fail()' function call to the conditional expression. ncbicgi.cpp 1564

typedef std::istream CNcbiIstream;
void CCgiRequest::Serialize(CNcbiOstream& os) const
{
  ....
  CNcbiIstream* istrm = GetInputStream();
  if (istrm) {
    char buf[1024];
    while(!istrm->eof()) {
      istrm->read(buf, sizeof(buf));
      os.write(buf, istrm->gcount());
    }
  }
}

The analyzer has detected a potential error that could leave you running over an infinite loop. If the data can't be read, a call to the eof() function will be returning false all the time. To guarantee that the loop will terminate in this case, you need to additionally check the value returned by fail().

Miscellaneous

V502 Perhaps the '?:' operator works in a different way than it was expected. The '?:' operator has a lower priority than the '&&' operator. ncbi_connutil.c 1135

static const char* x_ClientAddress(const char* client_host,
                                   int/*bool*/ local_host)
{
  ....
  if ((client_host == c  &&  x_IsSufficientAddress(client_host))
      ||  !(ip = *c  &&  !local_host
            ? SOCK_gethostbyname(c)
            : SOCK_GetLocalHostAddress(eDefault))
      ||  SOCK_ntoa(ip, addr, sizeof(addr)) != 0
      ||  !(s = (char*) malloc(strlen(client_host) + strlen(addr) + 3))) {
      return client_host/*least we can do :-/*/;
  }
  ....
}

Note the expression:

!local_host ? SOCK_gethostbyname(c) : SOCK_GetLocalHostAddress(eDefault)

It won't be evaluated the way the programmer expected because the entire expression looks like this:

ip = *c  && !local_host ? SOCK_gethostbyname(c) : SOCK_GetLocalHostAddress(...)

The precedence of the && operator is higher than that of ?:. Because of that, the code executes differently from what was intended.

V561 It's probably better to assign value to 'seq' variable than to declare it anew. Previous declaration: validator.cpp, line 490. validator.cpp 492

bool CValidator::IsSeqLocCorrectlyOrdered(const CSeq_loc& loc, CScope& scope)
{
  CBioseq_Handle seq;
  try {
    CBioseq_Handle seq = scope.GetBioseqHandle(loc);
  } catch (CObjMgrException& ) {
    // no way to tell
    return true;
  } catch (const exception& ) {
    // no way to tell
    return true;
  }
  if (seq  &&  seq.GetInst_Topology() == CSeq_inst::eTopology_circular) {
    // no way to check if topology is circular
    return true;
  }

  return CheckConsecutiveIntervals(loc, scope, x_IsCorrectlyOrdered);
}

Because the programmer declared a new variable seq inside the try/catch section, the other seq variable will remain uninitialized and be used further in the code.

V562 It's odd to compare a bool type value with a value of 0: (((status) & 0x7f) == 0) != 0. ncbi_process.cpp 111

bool CProcess::CExitInfo::IsExited(void) const
{
    EXIT_INFO_CHECK;
    if (state != eExitInfo_Terminated) {
        return false;
    }
#if   defined(NCBI_OS_UNIX)
    return WIFEXITED(status) != 0;
#elif defined(NCBI_OS_MSWIN)
    // The process always terminates with exit code
    return true;
#endif
}

It seemed nothing could go wrong, but WIFEXITED turned out to be a macro expanding into the following:

return (((status) & 0x7f) == 0) != 0;

It turns out the function returns the opposite value.

There was one more function like that:

V562 It's odd to compare a bool type value with a value of 0. ncbi_process.cpp 126

V595 The 'dst_len' pointer was utilized before it was verified against nullptr. Check lines: 309, 315. zlib.cpp 309

bool CZipCompression::CompressBuffer(
  const void* src_buf, size_t  src_len,
  void*       dst_buf, size_t  dst_size,
  /* out */   size_t* dst_len)
{
  *dst_len = 0;

  // Check parameters
  if (!src_len  &&  !F_ISSET(fAllowEmptyData)) {
    src_buf = NULL;
  }
  if (!src_buf || !dst_buf || !dst_len) {
    SetError(Z_STREAM_ERROR, "bad argument");
    ERR_COMPRESS(48, FormatErrorMessage("CZipCompression::CompressBuffer"));
    return false;
  }
  ....
}

The dst_len pointer is dereferenced at the very beginning of the function and is later checked for null. This error will cause undefined behavior if dst_len is found to be equal to nullptr.

V590 Consider inspecting the 'ch != '\0' && ch == ' '' expression. The expression is excessive or contains a misprint. cleanup_utils.cpp 580

bool Asn2gnbkCompressSpaces(string& val)
{
  ....
  while (ch != '\0' && ch == ' ') {
    ptr++;
    ch = *ptr;
  }
  ....
}

The loop termination condition depends only on whether or not ch is a space character. It means the expression can be simplified as follows:

while (ch == ' ') {
  ....
}

Conclusion

Scientific software is already helping us make new discoveries and will continue to do so. So let's hope we won't miss the most important ones just because of some trivial typo.

I encourage the developers of the NCBI Genome Workbench project to contact us so that we could share with them the full analysis report by PVS-Studio.

I hope this small research of ours will help fix a lot of bugs and make the project more reliable. Don't hesitate to try PVS-Studio with your own projects if you haven't done so yet. You'll probably like it :).

#Cpp