Andrey Karpov , Dmitry Sviridkin

Jun 07 2024

Tags:

#Cpp #Knowledge

C++ programmer's guide to undefined behavior: part 1 of 11

Jun 07 2024

Author: Andrey Karpov , Dmitry Sviridkin

Introduction
- Author's note. Briefly about why and what for
- Editor's note
What is undefined behavior and what it leads to
- Some useful links
How do we look for undefined behavior?
- Useful links
Narrowing conversions and implicit type conversion
All chapters

Your attention is invited to the first part of an e-book on undefined behavior. This is not a textbook, as it's intended for those who are already familiar with C++ programming. It's a kind of C++ programmer's guide to undefined behavior and to its most secret and exotic corners. The book was written by Dmitry Sviridkin and edited by Andrey Karpov.

Introduction

Panic!

Author's note. Briefly about why and what for

The story starts simple and straightforward: an ordinary tenth grader becomes interested in programming and gets acquainted with algorithmic problems, the solutions to which must be fast. He finds out about C++ and learns minimal syntax, basic constructs, and containers. He solves problems with predefined, always correct input and output formats, and doesn't know any sorrow...

Meanwhile, somewhere in the big world, developers curse one programming language or another every day for various reasons: this one isn't user friendly, that one lacks some kind of feature, there are extra letters to write, here are bugs in the standard library... But there's one language that's criticized for all that, and especially for something as obscure and mysterious as undefined behavior (UB).

Five or six years later, our no longer tenth-grader, who has seen neither worries nor sorrows in the sea of programs detached from reality, suddenly learns that the most strongly disliked language has always been, still remains, and will be his C++.

Then, for several more years, he encounters the most nightmarish and unbelievable horrors that await C++ programmers at almost every turn. That's how this series of notes comes to be—a collection of the most disgusting examples you can easily stumble upon in everyday tasks.

"Premature optimization is the root of all evil" (D. E. Knuth or C. A. R. Hoare — depending on the source you're consulting with).

The C++ language is perhaps the most vivid demonstration of the following idea: a large number of errors in C++ programs are related to undefined behavior embedded in the language foundation, just to open gates for optimizations at the compilation stage.

If you want to write code in C++ and be at least a bit sure of its performance, you may want to know about various pitfalls and cleverly placed landmines in the language standard and its library. You would certainly try to avoid them in any way possible. Otherwise, your programs will work correctly only on a particular machine and only by chance.

Important: This collection is not a manual on the language. It targets those who are already familiar with programming, the C++ language and understand its basic constructs.

Editor's note

I'm well familiar with the topic of undefined behavior. It permeates my own PVS-Studio project, where I'm one of the founders. PVS-Studio is a static code analysis tool that takes on the immense task of detecting this very undefined behavior. The analyzer has other tasks, such as searching for typos or unreachable code. However, UB is the largest and most inexhaustible source of issues in C++ programs and, therefore, of reasons to create new diagnostic rules to detect them.

So, when I found Dmitry Sviridkin's guide to UB on GitHub (ubbook), I was very excited to read it. I even wrote down some interesting thoughts for myself. They'll end up being the basis of new diagnostic rules. So, I both enjoyed and benefited from reading it.

Then, I started thinking: firstly, I also have something on the undefined behavior topic. Secondly, it would be nice to share such valuable and interesting material with as many programmers as possible. So, why not translate it into English? However, I didn't think about it for too long and decided to try to make it happen.

I contacted Dmitry with an offer to collaborate on the editing, completion, and translation of his material. He agreed, and we set to work on this e-book that we'll eventually try to turn into a printed one. You're welcome to see what we've done. Stock up on cookies and sharpen your attentiveness for an enjoyable and thoughtful reading experience.

What is undefined behavior and what it leads to

Undefined behavior or UB is an amazing peculiarity of some programming languages. It enables you to write a syntactically correct code that works completely unpredictably when you port it from one platform to another, change compilation/interpretation options, or replace one compiler/interpreter with another. Most importantly, in addition to being syntactically correct, the code looks semantically correct.

The peculiarity is that the language specification intentionally doesn't define how the program behaves under certain conditions. This is done for performance reasons, since there's no need to generate additional instructions with checks, or for flexibility in implementing some features. The specification simply states, "If code does something wrong, then the behavior is undefined." For example:

if we dereference a null pointer, the behavior is undefined;
if we lock twice in the same thread, the behavior is undefined;
if we divide by zero, the behavior is undefined;
if we read uninitialized memory, the behavior is undefined;
and so on and so forth.

Note that "behavior is undefined" means that anything can happen: a disk formatting, a compilation error, an exception, or maybe everything will be fine. No guarantee is given. This is where all the hilarious, unexpected, and very sad consequences in production code come from.

Of course, C and C++ are most notorious for their undefined behavior. However, one needs to understand that it's also present in other languages. In many languages, you can find a rare special case with undefined behavior. However, in C and C++, it occurs when creating almost any program: too many language features have peculiarities that make undefined behavior possible.

So, what are the signs to look for in an application that might indicate UB? How much of undefined behavior is really undefined?

Back in the day, UB in code could indeed lead to anything. For example, GCC 1.17 started running games.

If you divide something by zero today, such a thing probably won't happen. However, trouble does come in many forms:

For a particular platform and compiler, the documentation tells exactly what will happen despite such scary words as "undefined behavior" in the standard. It will be fine. You know what you're doing. Nothing is undefined. Everything's cool.
UB in memory operations most often ends with a segmentation error, and we get a nice SIGSEGV signal from the operating system. The application crashes.
The application runs and completes properly but gives different (or inadequate) results from run to run. The results also change from build to build if you change compiler options or the compiler itself. However, you didn't use any random number generators.
The application behaves incorrectly even though there are many checks, asserts, and try-catch blocks in the code. Each of them "confirms" that everything is correct. In the debugger, you see that calculations are correct, but suddenly everything breaks down.
The application executes contained but uninvoked code. Functions that have never been called are being processed.
The compiler refuses to build the code "for no reason" and without crashing. The linker gives "impossible and meaningless" errors.
The checks in the code don't work. Under the debugger, we can see that the execution flow doesn't get into the if or catch branches. However, according to the variable values, it should.
Sudden unreasonable call to std::terminate.
Infinite cycles become finite and vice versa.

Undefined behavior is often confused with two other concepts.

Another scary UB acronym is unspecified behavior. The standard doesn't specify exactly what can happen but describes options. So, for example, the evaluation order of function arguments is unspecified behavior.
Implementation-defined behavior — you need to consult the documentation for your platform and compiler.

These two are much better than undefined behavior, though they have one thing in common: a program that relies on either of them is, in fact, unportable.

There are also two classes of undefined behavior:

Library Undefined Behavior occurs when you've done something that a particular library (including the standard library, but not always) doesn't enable you to do. For example, to avoid undefined behavior, the gMock library doesn't allow you to reconfigure a mock object after you've started using it.
Language Undefined Behavior occurs when you've done something that a programming language specification doesn't define in its core. For example, it can be a null pointer dereference.

If you encounter the first one, you're in trouble. However, if everything works fine, there's a good chance it will continue to do so until you update the library or change platforms. Side effects can often occur only at a local level. It looks a lot like implementation-defined behavior.

If it's the second one, you're in serious trouble. Even with the slightest change, the code may suddenly stop working correctly. Moreover, users of your application may face serious security threats.

Some useful links

Stack Overflow. Undefined, unspecified and implementation-defined behavior.
Predrag Gruevski. Falsehoods programmers believe about undefined behavior.
John Regehr. A Guide to Undefined Behavior in C and C++.

How do we look for undefined behavior?

It's a very common question I've been asked. I've also asked it myself and others. Unfortunately, every C++ developer has to ask it.

The short answer is that there's no way. This is an algorithmically unsolvable problem, almost no different from a halting problem. However, programmers will keep solving unsolvable problems no matter how hard you try to stop them. So, specific code and inputs sometimes have ways to give an answer.

We can check the code before compiling it using various static analyzers:

Cppcheck,
Clang Static Analyzer,
PVS-Studio,
etc.

A smart enough analyzer, working with a control-flow graph of program and knowing hundreds of standard language traps, can find many issues and warn about suspicious code. However, not all of them can do that, and not always.

The Clang and GCC compilers with the -Wall and -Wpedantic flags enabled can find some errors. open icon

For example, GCC issues a warning for the following code:

int arr[5] = {1,2,3,4,5};

int main() {
    int i = 5;
    return arr[i];
}

Here's the warning:

array subscript 5 is above array bounds of 'int [5]' [-Warray-bounds]
    6 |     return arr[i];
      |            ~~~~~^
note: while referencing 'arr'
    2 | int arr[5] = {1,2,3,4,5};

We can check some of the code at compile time using different sets of inputs and constexpr. In a context evaluated at compile time, UB is forbidden:

constexpr int my_div(int a, int b) {
    return a / b;
}

namespace test {
template <unsigned int N>
constexpr int div_test(const int (&A)[N], const int (&B)[N]) {
    int x = 0;
    for (auto i = 0u; i < N; ++i) {
        x = ::my_div(A[i], B[i]);
    }
    return x;
}

constexpr int A[] = {1,2,3,4,5};
constexpr int B[] = {1,2,3,4,0};
static_assert((div_test(A, A), true)); // OK
static_assert((div_test(A, B), true)); // Compilation error, zero division

However, we can't use constexpr everywhere: depending on the version of the standard, it puts restrictions on the function body. It also implicitly applies the inline specifier "forbidding" to move the function definition to a separate translation unit (or, more simply, the definition will have to be placed in a header file).

Finally, if we can't find errors using static analysis (external utilities or the compiler), we can resort to the help of dynamic analysis.

When building with Clang or GCC compilers, we can include the -fsanitize=undefined, -fsanitize=address, and -fsanitize=thread sanitizers. They detect runtime errors, but at the cost of significant performance overhead. So, one should use such tools only at the testing and development stages.

Also, for debug builds, standard library code is sometimes equipped with asserts. This is done, for example, for the various iterators of the standard library in the MSVC (Visual Studio) distribution.

Since undefined behavior can emerge due to the optimization features of different compilers, we need to build our code for different platforms with different optimization levels and compare its behavior. Error-free code should be portable, and it should always behave in the same way (unless, of course, its job is to generate completely random values).

Tests, various builds, static and dynamic analysis are the ways to increase your confidence that the code is UB-free. Only a group of experts who check every line of code against the standard and double-check each other three times can guarantee that. Even that may not be enough, though.

There's also a way to disable any optimizations by using compiler flags. There's also an option to enable flags for various standard violations (the famous -fpermissive) that turn C++ into something completely different. However, I urge you to never tread that path. Your code will become unportable. Your code will no longer be C++ code. It's better to choose another programming language in such a case.

Useful links

GCC documentation. Options to Request or Suppress Warnings.
Clang documentation. AddressSanitizer.
Clang documentation. UndefinedBehaviorSanitizer.
Shafik Yaghmour. Exploring Undefined Behavior Using Constexpr.
MSVC documentation. /permissive- (Standards conformance).
Wikipedia. List of tools for static code analysis.

Narrowing conversions and implicit type conversion

Many modern programming languages, especially newer ones, forbid implicit type conversions.

So, in Rust, Haskell, or Kotlin, we can't just use float and int in the same arithmetic expression without explicitly stating in the code to convert one to the other. Python isn't as strict but still keeps strings, characters, and numbers from mixing.

C++ doesn't forbid implicit conversion, which leads to a lot of erroneous code. Moreover, such code can contain both defined (but unexpected) and undefined behavior.

Let's look at an example:

#include <vector>
#include <numeric>
#include <iostream>

int average(const std::vector<int>& v) {
    if (v.empty()) {
        return 0;
    }
    return std::accumulate(v.begin(), v.end(), 0) / v.size();
}

int main() {
    std::cout << average({-1,-1,-1});
}

Anyone who takes a glimpse at this code would expect the result to be -1. However, unfortunately, the result is different. A program built by GCC for the x86-64 platform displays the following:

1431655764

The code doesn't contain undefined behavior (not in the used input data, at least). However, the implicit type conversion is there, making the result unexpected.

The third argument determines the return type of std::accumulate. In this case, it's an integer signed zero, the default type for all numeric literals.
The largest of the involved argument types and the integer promotion rules determine the return value type of a division operation. In the example, the left argument type is int and the right argument type is size_t — a fairly wide unsigned integer. Wider than int. So, according to the rules of integer promotion, the result is size_t.
-3 is implicitly converted to the size_t type, such conversion is well-defined. The result is the unsigned number: 2^N - 3.
Next, let's divide unsigned numbers: (2^N - 3) / 3. The most significant bit of the result is zero.
The return type of the average function is declared as int. So, we need to perform another implicit conversion.
Generally speaking, the unsigned -> signed conversion is implementation-defined.
- If the sizes of the int and size_t types are the same, then the positive number fits within the value range for the int type since the most significant bit is zero. The standard guarantees that there are no issues.
- If the sizes don't match, narrowing conversion occurs, which is left to the implementation details. So, instead of slicing the unfit most significant bits as expected, on some platforms it may be replaced by std::numeric_limits<int>::max.
- For example, to build an application for a 64-bit platform using GCC, the narrowing transformation is defined by slicing the most significant bits, as expected. So, the final result is ((2^64 - 3) / 3 % 2^32).

Implicit type conversions apply not only to built-in primitives but also to more complex types. Worst of all, they interfere with the selection of an appropriate function overload, leading to various surprises that are often unpleasant.

Here's an example with abs:

#include <cmath>
#include <iostream>

int main() {
    std::cout << abs(3.5) << "\n"; // the C library function
                                   // takes the long type as input,
                                   // the result is 3
    std::cout << std::abs(3.5);    // the C++ library function
                                   // overloaded for double,
                                   // the result is 3.5
}

An even worse example is the std::string standard type:

#include <string>

int main() {
    std::string s;
    s += 48;    // implicit conversion to char.
    s += 1000;  // and there's a very unpleasant overflow
                // on a platform with signed char.
    s += 49.5;  // implicit conversion to char again
}

This monstrosity compiles!

It seems that this absolutely horrible usage example can never be found in normal code. Unfortunately, it can.

You can write generalized code for your std::accumulate with different checks of template arguments. Then, you may accidentally pass string as an accumulator and a container, like float, into it. And there won't be any compilation error. Just a weird bug in the program.

The application is compiled, and the result is an empty string: open icon

#include <string>
#include <vector>
#include <iostream>

template <class Range, class Acc>
auto accumulate(Range&& r, Acc acc) 
requires(requires(){
    {acc += *std::begin(r) };
})
{
    for (auto&& x : r){
        acc += x;
    }
    return acc;
}


int main() {
    std::vector<double> v {0.5, 0.7, 0.1};
    auto res = accumulate(v, std::string{});
    std::cout << '"' << res << '"';
}

The program outputs:

""

Chains of implicit conversions can be very obscure:

void f(float&& x) { std::cout << "float " << x << "\n";  }
void f(int&& x) { std::cout << "int " << x << "\n";  }
void g(auto&& v) { f(v); } // C++20

int main() { 
    g(2);
    g(1.f);
}

Most surprisingly, this example displays the following:

float 2
int 1

Even though we substituted constant types in exactly the opposite way and almost certainly expected to get this:

int 2
float 1

This isn't a compiler bug or undefined behavior! A tricky chain of implicit conversions is to blame.

Let's look at it using the example of the first call to g(2) and substitute the template parameter:

void g(int&& v) {
    // Although v has the int&& type
    // Using v further in expressions results in int& !
    // decltype(v)   == int&&
    // decltype((v)) == int&

    // The f functions accept only rvalue references

    // Implicit conversion of int& to int&& is forbidden
    //  int&& x = 5;
    //  int&& y = x; // doesn't compile!

    // So, the f(int&&) overload cannot be used

    // f(float&&) remains
    // int can be implicitly converted to float
    // int& can implicitly act as just int
    // implicit static_cast<float>(v) returns a temporary float value
    // temporary values of the T type implicitly bind to T&&

    // Here we have a conversion chain:
    // int& -> int -> float -> float&& 

    f(v); // calls f(float&&) !

    // explicitly: f(static_cast<float>(v));
}

Of course, nobody ever (at least explicitly) takes primitives over rvalue references because it's pointless. However, even without the rvalue reference for primitives, we can do something terrible:

struct MyMovableStruct {
    operator bool () {
        return !data.empty();
    }
    std::string data;
};

void consume(MyMovableStruct&& x) { 
    std::cout << "MyStruct: " << x.data << "\n";  
}
void consume(bool x) { std::cout << "bool " << x << "\n";  }
void g(auto&& v) { consume(v); }
int main() { 
    g(MyMovableStruct{"hello"});
}

The same conversion chain gives "bool 1" in the output, except we don't need the last step.

Be sure to enable compiler warnings for all implicit conversions. It's best to treat them as errors.

Always mark single-parameter constructors as explicit to avoid implicit conversions for your types.

If you overload the cast operators (operator T()) for your types, make them explicit as well.

If your functions/methods are designed to work only with a particular primitive type, use templates, SFINAEs, and concepts to restrict them. You can also use the mechanism of explicit overload deletion (=delete), which is really easy:

int only_ints(int x) { return x;}

template <class T>
auto only_ints(T x) = delete;

int main() {
    const int& x = 2;
    only_ints(2);
    only_ints(x);
    char c = '1';
    only_ints(c);   // Compilation Error.
    only_ints(2.5); // Explicitly deleted.
}

Author: Dmitry Sviridkin

Dmitry has over eight years of experience in high-performance software development in C and C++. From 2019 to 2021, Dmitry Sviridkin has been teaching Linux system programming at SPbU and C++ hands-on courses at HSE. Currently works on system and embedded development in Rust and C++ for edge servers as a Software Engineer at AWS (Cloudfront). His main area of interest is software security.

Editor: Andrey Karpov

Andrey has over 15 years of experience with static code analysis and software quality. The author of numerous articles on writing high-quality code in C++. Andrey Karpov has been honored with the Microsoft MVP award in the Developer Technologies category from 2011 to 2021. Andrey is a co-founder of the PVS-Studio project. He has long been the company's CTO and was involved in the development of the C++ analyzer core. Andrey is currently responsible for team management, personnel training, and DevRel activities.

All chapters

Part 1: introduction; what is undefined behavior and what it leads to; narrowing conversions and implicit type conversion.
Part 2: overflow of signed integers; floating-point numbers; integer promotion; char and sign extension.
Part 3: dangling references; string_view; a fly in the syntactic sugar (range-based for); self-reference; std::vector and reference invalidation.
Part 4: lambda function capture lists; tuples; unexpected mutability; implicit references; use-after-move; lifetime extension.
Part 5: Most Vexing Parse; non-constant constants; move semantics; std::enable_if_t vs. std::void_t; forgotten return.
Part 6: ellipsis and functions; operator []; iostreams—good luck debugging!; comma operator; function-try-block; zero-sized types.
Part 7: null-terminated strings; std::shared_ptr; imexplicit type conversion; how to pass a standard function and not break anything.
Part 8: infinite loops and halting problem; recursion; false noexcept; buffer overflow.
Part 9: (N)RVO vs RAII; null pointer dereferencing; static initialization order fiasco; static inline; ODR violation; reserved names.
Part 10: trivial types and ABI; uninitialized variables; C++20 unbounded ranges; non-virtual yet virtual functions; VLA.
Part 11: invalid pointers; placement new for arrays; data race; mutex deadlock; signal (un)safety; how to do everything right and trigger the deadlock.
Part 12: std::vector::reserve and std::vector::resize; unaligned references; time of life and death; static analysis and UB; conclusion.