About two months ago I wrote an article about the analysis of GCC using PVS-Studio. The idea of the article was as follows: GCC warnings are great, but they're not enough. It is necessary to use specialized tools for code analysis, for example, PVS-Studio. As proof of my words I showed errors that PVS-Studio was able to find the GCC code. A number of readers have noticed that the quality of the GCC code, and its diagnosis, aren't really great; while Clang compiler is up to date, of high quality, and fresh. In general Clang is awesome! Well, apparently, it's time to check LLVM project with the help of PVS-Studio.
I think there are few who do not know what LLVM is. Nevertheless, I'll keep the tradition of giving a short description of the project that has been tested.
LLVM (Low Level Virtual Machine) - a universal system of analysis, transformation and optimization of programs, implementing a virtual machine with RISC-based instructions. It can be used as an optimizing compiler of byte code into machine code for various architectures, or for its interpretation and JIT-compilation (for some platforms). Within the scope of LLVM project, the developers made the Clang front-end for C,C++, and Objective-C, translating the source code into byte code LLVM and allowing the use of LLVM as a full-fledged compiler.
Official site: http://llvm.org/
We checked the revision 282481. The code was checked with a PVS-Studio version, working under Linux. Since PVS-Studio for Linux is a new product, I'll give more details about the process of analysis. I'm sure this will demonstrate that it's really not hard to use our analyzer on Linux, and that you should try it out on your project without hesitation.
The Linux version of the analyzer is available for downloading on this page: http://www.viva64.com/en/pvs-studio-download-linux/
The previous projects were checked with a universal mechanism that tracks the compiler runs. This time we will use the information that PVS-Studio takes from the JSON Database Compilation for the analysis. Details can be found in the section "How to run PVS-Studio on Linux".
In LLVM 3.9 we have completely ceased using autoconf in favor of Cmake, and it was a good reason to try the support for JSON Compilation Database. What is it? This is a format used by the Clang utilities. It stores a list of compiler calls in the following way:
[
{
"directory": "/home/user/llvm/build",
"command": "/usr/bin/c++ .... file.cc",
"file": "file.cc"
},
....
]
It's very simple to get such a file for CMake-projects - you simply generate the project with an additional option:
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=On ../llvm
After that there will be compile_commands.json in the current directory. It is this file that we need. Let's build the project first, because some projects use code generation.
make -j8
Now everything is ready for analysis. It starts with a single line:
pvs-studio-analyzer analyze -l ~/PVS-Studio.lic -o PVS-Studio.log -j
You can get compile_commands.json with the help of the Bear utility, for projects that aren't using CMake. But for complex assembly systems actively using environment variables or cross-compilation, the commands don't always provide detailed information about the translation unit.
Note N1. How to work with the report of PVS-Studio in Linux.
Note N2. We provide high-quality and quick support for our clients and potential users. So, if something is unclear or does not work, please contact us at support. You will like our service.
By the way, this is not the first check of LLVM. The article was inspired by previous checks:
Unfortunately, I can't say anything about the number of false positives or the density of errors found. The project is large, there are a lot of warnings, and I had quite a quick look at them. As an excuse I can say that the preparation for the Linux version of PVS-Studio took a lot of time, so I could not work on the article alone.
Enough talking, let's move on to the most interesting material. Let's take a look at the suspicious fragments in the LLVM code which PVS-Studio detected.
So we have such enumeration in the code:
enum Type {
ST_Unknown, // Type not specified
ST_Data,
ST_Debug,
ST_File,
ST_Function,
ST_Other
};
This is a "classical enumeration", if we can say so. Each name in the enumeration is assigned with an integer value that corresponds to a specific place in the order of the values in the enumeration:
Again, let me emphasize that this is just enumeration, not a set of masks. If the constants could be combined, they would be a power of 2.
Now it's time to look at the code, where this enumeration is used incorrectly:
void MachODebugMapParser::loadMainBinarySymbols(....)
{
....
SymbolRef::Type Type = *TypeOrErr;
if ((Type & SymbolRef::ST_Debug) ||
(Type & SymbolRef::ST_Unknown))
continue;
....
}
PVS-Studio warning: V616 The 'SymbolRef::ST_Unknown' named constant with the value of 0 is used in the bitwise operation. MachODebugMapParser.cpp 448
Let's recall from memory, that the ST_Unknown constant is zero. Therefore, you can shorten the expression:
if (Type & SymbolRef::ST_Debug)
Clearly something is wrong here. Apparently the programmer who wrote this code decided that he is working with enumeration consisting of flags. That is, he expected that one or another bit matches every constant. But it is not so. I think the correct check should be like this:
if ((Type == SymbolRef::ST_Debug) || (Type == SymbolRef::ST_Unknown))
I think, enum class should have been used here to avoid such errors. In this case, an incorrect expression simply would not be compiled.
The function is not a very complicated one, so I decided to cite it entirely. Before continuing to read the article, I suggest you try to guess what is suspicious here.
Parser::TPResult Parser::TryParseProtocolQualifiers() {
assert(Tok.is(tok::less) && "Expected '<' for qualifier list");
ConsumeToken();
do {
if (Tok.isNot(tok::identifier))
return TPResult::Error;
ConsumeToken();
if (Tok.is(tok::comma)) {
ConsumeToken();
continue;
}
if (Tok.is(tok::greater)) {
ConsumeToken();
return TPResult::Ambiguous;
}
} while (false);
return TPResult::Error;
}
PVS-Studio warning: V696 The 'continue' operator will terminate 'do { ... } while (FALSE)' loop because the condition is always false. Check lines: 1642, 1649. ParseTentative.cpp 1642
LLVM developers, of course, will be able to understand if there is a bug here or not. I have to play detective. Looking at the code, I was thinking in the following direction: The function should read the opening bracket '<', then it reads the identifiers and commas in the loop. If there is no comma, we expected a closing bracket. If something goes wrong, then the function returns an error code. I reckon there was supposed to be the following algorithm of the function work (pseudocode):
The trouble is that the programmer tries to resume the loop with the help of the continue operator. It passes the control not to the beginning of the loop body, but to the check of the condition of the loop continuation. And the condition is always false. As a result, the loop exits, and the algorithm becomes the following:
Thus, only the sequence from one element enclosed in square brackets can be correct. If there is more than one item in the sequence, separated by a comma, then the function will return an error status:TPResult::Error.
Now let's consider another case, when not more than one loop iteration is executed:
static bool checkMachOAndArchFlags(....) {
....
unsigned i;
for (i = 0; i < ArchFlags.size(); ++i) {
if (ArchFlags[i] == T.getArchName())
ArchFound = true;
break;
}
....
}
PVS-Studio warning: V612 An unconditional 'break' within a loop. MachODump.cpp 1206
Pay attention to the break statement. It will break the loop after the first iteration. I think the break statement must refer to a condition, so the correct code will look like this:
for (i = 0; i < ArchFlags.size(); ++i) {
if (ArchFlags[i] == T.getArchName())
{
ArchFound = true;
break;
}
}
There are two more similar fragments, but in order not to make the article too long, I'll only copy the analyzer warnings here:
static bool containsNoDependence(CharMatrix &DepMatrix,
unsigned Row,
unsigned Column) {
for (unsigned i = 0; i < Column; ++i) {
if (DepMatrix[Row][i] != '=' || DepMatrix[Row][i] != 'S' ||
DepMatrix[Row][i] != 'I')
return false;
}
return true;
}
PVS-Studio warning: V547 Expression is always true. Probably the '&&' operator should be used here. LoopInterchange.cpp 208
The expression makes no sense. I'll simplify the code to highlight the essence of the error:
if (X != '=' || X != 'S' || X != 'I')
The variable X will never be equal to something. As a result, the condition is always true. Most likely, instead of the "||" operators, the "&&" should have been used, then the expression would make sense.
SingleLinkedListIterator<T> &operator++(int) {
SingleLinkedListIterator res = *this;
++*this;
return res;
}
PVS-Studio warning: V558 Function returns the reference to temporary local object: res. LiveInterval.h 679
The function is a traditional implementation of a postfix increment:
The error is that the function returns a reference. This reference is not valid, because the temporary object res gets destroyed when the function is exited.
To fix this, you need to return a value, rather than a reference:
SingleLinkedListIterator<T> operator++(int) { .... }
I'll copy the entire function, so as to show that before the repeating assignment the variable ZeroDirective isn't used in any way.
HexagonMCAsmInfo::HexagonMCAsmInfo(const Triple &TT) {
Data16bitsDirective = "\t.half\t";
Data32bitsDirective = "\t.word\t";
Data64bitsDirective = nullptr;
ZeroDirective = "\t.skip\t"; // <=
CommentString = "//";
LCOMMDirectiveAlignmentType = LCOMM::ByteAlignment;
InlineAsmStart = "# InlineAsm Start";
InlineAsmEnd = "# InlineAsm End";
ZeroDirective = "\t.space\t"; // <=
AscizDirective = "\t.string\t";
SupportsDebugInformation = true;
MinInstAlignment = 4;
UsesELFSectionDirectiveForBSS = true;
ExceptionsType = ExceptionHandling::DwarfCFI;
}
PVS-Studio warning: V519 The 'ZeroDirective' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 25, 31. HexagonMCAsmInfo.cpp 31
The variable ZeroDirective is a simple pointer of const char * type. In the beginning it points to a string "\t.skip\t", but further on it is assigned with a line address "\t.space\t". It's weird, and doesn't make sense. There is a high probability that one of the assignments should change a completely different variable.
Let's look at another case of repetitive assignment.
template <class ELFT>
void GNUStyle<ELFT>::printFileHeaders(const ELFO *Obj) {
....
Str = printEnum(e->e_ident[ELF::EI_OSABI], makeArrayRef(ElfOSABI));
printFields(OS, "OS/ABI:", Str);
Str = "0x" + to_hexString(e->e_version); // <=
Str = to_hexString(e->e_ident[ELF::EI_ABIVERSION]); // <=
printFields(OS, "ABI Version:", Str);
Str = printEnum(e->e_type, makeArrayRef(ElfObjectFileType));
printFields(OS, "Type:", Str);
....
}
PVS-Studio warning: V519 The 'Str' variable is assigned values twice successively. Perhaps this is a mistake. Check lines: 2407, 2408. ELFDumper.cpp 2408
Apparently we are dealing with a typo. Instead of doing the reassignment, the programmer had to link two lines with the help of += operator. Then the correct code could be like this:
Str = "0x" + to_hexString(e->e_version);
Str += to_hexString(e->e_ident[ELF::EI_ABIVERSION]);
There are several more code fragments with the repeated assignment. In my opinion, these repetitive assignments do not pose any danger, so I'll just copy the warnings as a list:
Expected<std::unique_ptr<PDBFile>>
PDBFileBuilder::build(
std::unique_ptr<msf::WritableStream> PdbFileBuffer)
{
....
auto File = llvm::make_unique<PDBFile>(
std::move(PdbFileBuffer), Allocator);
File->ContainerLayout = *ExpectedLayout;
if (Info) {
auto ExpectedInfo = Info->build(*File, *PdbFileBuffer);
....
}
PVS-Studio warning: V522 Dereferencing of the null pointer 'PdbFileBuffer' might take place. PDBFileBuilder.cpp 106
The code is not clear to me, as I have not studied what llvm::make_unique is, and how it works in general. Nevertheless, both myself and the analyzer are confused by the fact that at first glance the possession of an object from a smart pointer PdbFileBuffer goes to File. After that we have dereferencing of a null pointer PdbFileBuffer that already contains nullptr. Specifically, this fragment looks strange:
.... llvm::make_unique<PDBFile>(::move(PdbFileBuffer), Allocator);
....
.... Info->build(*File, *PdbFileBuffer);
If this is a bug, then it should be fixed in 3 more fragments in the same file:
static bool areExclusiveRanges(BinaryOperatorKind OpcodeLHS,
const APSInt &ValueLHS,
BinaryOperatorKind OpcodeRHS,
const APSInt &ValueRHS) {
....
// Handle cases where the constants are different.
if ((OpcodeLHS == BO_EQ ||
OpcodeLHS == BO_LE || // <=
OpcodeLHS == BO_LE) // <=
&&
(OpcodeRHS == BO_EQ ||
OpcodeRHS == BO_GT ||
OpcodeRHS == BO_GE))
return true;
....
}
PVS-Studio warning: V501 There are identical sub-expressions 'OpcodeLHS == BO_LE' to the left and to the right of the '||' operator. RedundantExpressionCheck.cpp 174
This is a classic typo. The variable OpcodeLHS is compared with the BO_LE constant twice. It seems to me that one of the BO_LE constants should be replaced by BO_LT. As you can see the names of the constants are very similar and can be easily confused.
The following example demonstrates how static analysis complements other methodologies of writing high quality code. Let's inspect the incorrect code:
std::pair<Function *, Function *>
llvm::createSanitizerCtorAndInitFunctions(
....
ArrayRef<Type *> InitArgTypes, ArrayRef<Value *> InitArgs,
....)
{
assert(!InitName.empty() && "Expected init function name");
assert(InitArgTypes.size() == InitArgTypes.size() &&
"Sanitizer's init function expects "
"different number of arguments");
....
}
PVS-Studio warning: V501 There are identical sub-expressions 'InitArgTypes.size()' to the left and to the right of the '==' operator. ModuleUtils.cpp 107
One of the many good ways to improve code safety, is to use assert() macros. This macro, and ones similar to it, help to detect various errors at the developmental stage, and during debugging. But I won't go into detail here about the benefits of such macros, as it is beyond the scope of this article.
It is important to us that the assert() macros are used in the function createSanitizerCtorAndInitFunctions() to check the correctness of the input data. Too bad the second assert() macro is useless, because of a typo.
Fortunately, the static analyzer is of great help here, as it notices that the array size is compared with itself. As a result we can fix this check, and the correct condition in assert() may help to prevent some other error in the future.
Apparently, in the condition the array sizes InitArgTypes and InitArgs should be compared:
assert(InitArgTypes.size() == InitArgs.size() &&
"Sanitizer's init function expects "
"different number of arguments");
In the std::unique_ptr class there are two functions with similar names: release and reset. My observations show, they are sometimes confused. Apparently this is what happened here:
std::unique_ptr<DiagnosticConsumer> takeClient()
{ return std::move(Owner); }
VerifyDiagnosticConsumer::~VerifyDiagnosticConsumer() {
....
SrcManager = nullptr;
CheckDiagnostics();
Diags.takeClient().release();
}
PVS-Studio warning: V530 The return value of function 'release' is required to be utilized. VerifyDiagnosticConsumer.cpp 46
Perhaps there is no error here, and the programmer used some tricky logic. But it looks more like a resource leak. In any case, the developers should take one more look at this code fragment.
bool ARMDAGToDAGISel::tryT1IndexedLoad(SDNode *N) {
LoadSDNode *LD = cast<LoadSDNode>(N);
EVT LoadedVT = LD->getMemoryVT();
ISD::MemIndexedMode AM = LD->getAddressingMode();
if (AM == ISD::UNINDEXED ||
LD->getExtensionType() != ISD::NON_EXTLOAD ||
AM != ISD::POST_INC ||
LoadedVT.getSimpleVT().SimpleTy != MVT::i32)
return false;
....
}
PVS-Studio warning: V590 Consider inspecting this expression. The expression is excessive or contains a misprint. ARMISelDAGToDAG.cpp 1565
The condition is long, so I'll highlight the most important part:
AM == ISD::UNINDEXED || AM != ISD::POST_INC
This condition is redundant, and you can simplify it to:
AM != ISD::POST_INC
So, we see here redundancy in the condition, or some error. There is a chance, that the redundancy shows that some other condition was meant here. I cannot judge how dangerous this is, but it is certainly worth reviewing. Also, I want to draw the developers attention to two more analyzer warnings:
Pointers in C and C++ - an endless headache for programmers. You verify them against null, and then somewhere there is null pointer dereference again! The V595 diagnostic detects situations where the verification against null is done too late. Before this check the pointer already gets used. This is one of the most typical errors that we find in the code of various applications (proof). However, speaking in support of C/C++, I will say that the situation in C# is not much better. Despite the fact that the C# pointers are now called references, such bugs haven't disappeared (proof).
Let's go back to the LLVM code and look at a simple variant of the bug:
bool PPCDarwinAsmPrinter::doFinalization(Module &M) {
....
MachineModuleInfoMachO &MMIMacho =
MMI->getObjFileInfo<MachineModuleInfoMachO>();
if (MAI->doesSupportExceptionHandling() && MMI) {
....
}
PVS-Studio warning: V595 The 'MMI' pointer was utilized before it was verified against nullptr. Check lines: 1357, 1359. PPCAsmPrinter.cpp 1357
The case is simple, and everything is quite obvious. The check (... && MMI) tells us that the pointer MMI can be null. If so, the program won't get to this check during execution. It will be terminated earlier because of the null pointer dereference.
Let's look at one more code fragment:
void Sema::CodeCompleteObjCProtocolReferences(
ArrayRef<IdentifierLocPair> Protocols)
{
ResultBuilder
Results(*this, CodeCompleter->getAllocator(),
CodeCompleter->getCodeCompletionTUInfo(),
CodeCompletionContext::CCC_ObjCProtocolName);
if (CodeCompleter && CodeCompleter->includeGlobals()) {
Results.EnterNewScope();
....
}
PVS-Studio warning: V595 The 'CodeCompleter' pointer was utilized before it was verified against nullptr. Check lines: 5952, 5955. SemaCodeComplete.cpp 5952
The pointer CodeCompleter gets dereferenced first, and further on there is verification of the pointer against null. The same code was detected three more times in the same file:
These were simple cases, but sometimes the code is more complex, where it's hard to say how dangerous it is. So my suggestion to the developers, is to check the following fragments of the LLVM code:
I apologize that I cite here a hard to read code fragment. A little more patience, please, the article is almost at an end.
static bool print_class_ro64_t(....) {
....
const char *r;
uint32_t offset, xoffset, left;
....
r = get_pointer_64(p, offset, left, S, info);
if (r == nullptr || left < sizeof(struct class_ro64_t))
return false;
memset(&cro, '\0', sizeof(struct class_ro64_t));
if (left < sizeof(struct class_ro64_t)) {
memcpy(&cro, r, left);
outs() << " (class_ro_t entends past the .......)\n";
} else
memcpy(&cro, r, sizeof(struct class_ro64_t));
....
}
PVS-Studio warning: V649 There are two 'if' statements with identical conditional expressions. The first 'if' statement contains function return. This means that the second 'if' statement is senseless. Check lines: 4410, 4413. MachODump.cpp 4413
Pay attention to the check:
if (.... || left < sizeof(struct class_ro64_t))
return false;
If the value in the left variable is less than the class size, then the function will exit. It turns out that this choice of behavior does not make sense:
if (left < sizeof(struct class_ro64_t)) {
memcpy(&cro, r, left);
outs() << " (class_ro_t entends past the .......)\n";
} else
memcpy(&cro, r, sizeof(struct class_ro64_t));
The condition is always false, and therefore the else-branch always executes. This is very strange. Perhaps the program contains a logical error, or, we are dealing with a typo.
This place also needs some revision:
A class SequenceNumberManager is declared inside a template class RPC. It has such a move assignment operator:
SequenceNumberManager &operator=(SequenceNumberManager &&Other) {
NextSequenceNumber = std::move(Other.NextSequenceNumber);
FreeSequenceNumbers = std::move(Other.FreeSequenceNumbers);
}
PVS-Studio warning: V591 Non-void function should return a value. RPCUtils.h 719
As you can see return was forgotten in the end:
return *this;
Actually there is nothing terrible here. Compilers generally do not work with bodies of functions of template classes in any way, if these functions are not used. Apparently, we have this case here. Although I have not tested it, I am quite sure: if you call this move operator, the compiler will generate an error, or yield a warning. So, there is nothing wrong here, but I decided to point out this flaw.
There were several strange code fragments, where the value of the pointer returned by the new operator is verified against null. This code does not make sense, because if you are not able to allocate the memory, the exception std::bad_alloc will be thrown. Here's one such place:
LLVMDisasmContextRef LLVMCreateDisasmCPUFeatures(....) {
....
// Set up the MCContext for creating symbols and MCExpr's.
MCContext *Ctx = new MCContext(MAI, MRI, nullptr);
if (!Ctx)
return nullptr;
....
}
PVS-Studio warning: V668 There is no sense in testing the 'Ctx' pointer against null, as the memory was allocated using the 'new' operator. The exception will be generated in the case of memory allocation error. Disassembler.cpp 76
Two more warnings:
These code fragments don't look dangerous, so I decided to describe them in the section on unimportant warnings. Most likely, all three of these checks can simply be removed.
As you can see, the compiler warnings are good, but they are not enough. Specialized tools for static analysis, such as PVS-Studio will always outpace the compilers in diagnostic capabilities and configuration flexibility working with false positives. That's actually how the analyzer developers make money.
It's also important to note that the main effect from static analysis, will only be achieved with regular use of static code analyzers. A lot of errors will be detected at the earliest stage, so there won't be need to debug, or ask users to give a detailed description of the actions that led to the program crash. In the static analysis we have the warnings similar to the warnings of a compiler (actually, they are almost the same, but more intelligent). I think that everybody always checks the compiler warnings, not just one a month?!
I suggest downloading and trying out PVS-Studio on your project's code.