Your attention is invited to the first part of an e-book on undefined behavior. This is not a textbook, as it's intended for those who are already familiar with C++ programming. It's a kind of C++ programmer's guide to undefined behavior and to its most secret and exotic corners. The book was written by Dmitry Sviridkin and edited by Andrey Karpov.
Panic!
The story starts simple and straightforward: an ordinary tenth grader becomes interested in programming and gets acquainted with algorithmic problems, the solutions to which must be fast. He finds out about C++ and learns minimal syntax, basic constructs, and containers. He solves problems with predefined, always correct input and output formats, and doesn't know any sorrow...
Meanwhile, somewhere in the big world, developers curse one programming language or another every day for various reasons: this one isn't user friendly, that one lacks some kind of feature, there are extra letters to write, here are bugs in the standard library... But there's one language that's criticized for all that, and especially for something as obscure and mysterious as undefined behavior (UB).
Five or six years later, our no longer tenth-grader, who has seen neither worries nor sorrows in the sea of programs detached from reality, suddenly learns that the most strongly disliked language has always been, still remains, and will be his C++.
Then, for several more years, he encounters the most nightmarish and unbelievable horrors that await C++ programmers at almost every turn. That's how this series of notes comes to be—a collection of the most disgusting examples you can easily stumble upon in everyday tasks.
"Premature optimization is the root of all evil" (D. E. Knuth or C. A. R. Hoare — depending on the source you're consulting with).
The C++ language is perhaps the most vivid demonstration of the following idea: a large number of errors in C++ programs are related to undefined behavior embedded in the language foundation, just to open gates for optimizations at the compilation stage.
If you want to write code in C++ and be at least a bit sure of its performance, you may want to know about various pitfalls and cleverly placed landmines in the language standard and its library. You would certainly try to avoid them in any way possible. Otherwise, your programs will work correctly only on a particular machine and only by chance.
Important: This collection is not a manual on the language. It targets those who are already familiar with programming, the C++ language and understand its basic constructs.
I'm well familiar with the topic of undefined behavior. It permeates my own PVS-Studio project, where I'm one of the founders. PVS-Studio is a static code analysis tool that takes on the immense task of detecting this very undefined behavior. The analyzer has other tasks, such as searching for typos or unreachable code. However, UB is the largest and most inexhaustible source of issues in C++ programs and, therefore, of reasons to create new diagnostic rules to detect them.
So, when I found Dmitry Sviridkin's guide to UB on GitHub (ubbook), I was very excited to read it. I even wrote down some interesting thoughts for myself. They'll end up being the basis of new diagnostic rules. So, I both enjoyed and benefited from reading it.
Then, I started thinking: firstly, I also have something on the undefined behavior topic. Secondly, it would be nice to share such valuable and interesting material with as many programmers as possible. So, why not translate it into English? However, I didn't think about it for too long and decided to try to make it happen.
I contacted Dmitry with an offer to collaborate on the editing, completion, and translation of his material. He agreed, and we set to work on this e-book that we'll eventually try to turn into a printed one. You're welcome to see what we've done. Stock up on cookies and sharpen your attentiveness for an enjoyable and thoughtful reading experience.
Undefined behavior or UB is an amazing peculiarity of some programming languages. It enables you to write a syntactically correct code that works completely unpredictably when you port it from one platform to another, change compilation/interpretation options, or replace one compiler/interpreter with another. Most importantly, in addition to being syntactically correct, the code looks semantically correct.
The peculiarity is that the language specification intentionally doesn't define how the program behaves under certain conditions. This is done for performance reasons, since there's no need to generate additional instructions with checks, or for flexibility in implementing some features. The specification simply states, "If code does something wrong, then the behavior is undefined." For example:
Note that "behavior is undefined" means that anything can happen: a disk formatting, a compilation error, an exception, or maybe everything will be fine. No guarantee is given. This is where all the hilarious, unexpected, and very sad consequences in production code come from.
Of course, C and C++ are most notorious for their undefined behavior. However, one needs to understand that it's also present in other languages. In many languages, you can find a rare special case with undefined behavior. However, in C and C++, it occurs when creating almost any program: too many language features have peculiarities that make undefined behavior possible.
So, what are the signs to look for in an application that might indicate UB? How much of undefined behavior is really undefined?
Back in the day, UB in code could indeed lead to anything. For example, GCC 1.17 started running games.
If you divide something by zero today, such a thing probably won't happen. However, trouble does come in many forms:
Undefined behavior is often confused with two other concepts.
These two are much better than undefined behavior, though they have one thing in common: a program that relies on either of them is, in fact, unportable.
There are also two classes of undefined behavior:
If you encounter the first one, you're in trouble. However, if everything works fine, there's a good chance it will continue to do so until you update the library or change platforms. Side effects can often occur only at a local level. It looks a lot like implementation-defined behavior.
If it's the second one, you're in serious trouble. Even with the slightest change, the code may suddenly stop working correctly. Moreover, users of your application may face serious security threats.
It's a very common question I've been asked. I've also asked it myself and others. Unfortunately, every C++ developer has to ask it.
The short answer is that there's no way. This is an algorithmically unsolvable problem, almost no different from a halting problem. However, programmers will keep solving unsolvable problems no matter how hard you try to stop them. So, specific code and inputs sometimes have ways to give an answer.
We can check the code before compiling it using various static analyzers:
A smart enough analyzer, working with a control-flow graph of program and knowing hundreds of standard language traps, can find many issues and warn about suspicious code. However, not all of them can do that, and not always.
For example, GCC issues a warning for the following code:
int arr[5] = {1,2,3,4,5};
int main() {
int i = 5;
return arr[i];
}
Here's the warning:
array subscript 5 is above array bounds of 'int [5]' [-Warray-bounds]
6 | return arr[i];
| ~~~~~^
note: while referencing 'arr'
2 | int arr[5] = {1,2,3,4,5};
We can check some of the code at compile time using different sets of inputs and constexpr. In a context evaluated at compile time, UB is forbidden:
constexpr int my_div(int a, int b) {
return a / b;
}
namespace test {
template <unsigned int N>
constexpr int div_test(const int (&A)[N], const int (&B)[N]) {
int x = 0;
for (auto i = 0u; i < N; ++i) {
x = ::my_div(A[i], B[i]);
}
return x;
}
constexpr int A[] = {1,2,3,4,5};
constexpr int B[] = {1,2,3,4,0};
static_assert((div_test(A, A), true)); // OK
static_assert((div_test(A, B), true)); // Compilation error, zero division
However, we can't use constexpr everywhere: depending on the version of the standard, it puts restrictions on the function body. It also implicitly applies the inline specifier "forbidding" to move the function definition to a separate translation unit (or, more simply, the definition will have to be placed in a header file).
Finally, if we can't find errors using static analysis (external utilities or the compiler), we can resort to the help of dynamic analysis.
When building with Clang or GCC compilers, we can include the -fsanitize=undefined, -fsanitize=address, and -fsanitize=thread sanitizers. They detect runtime errors, but at the cost of significant performance overhead. So, one should use such tools only at the testing and development stages.
Also, for debug builds, standard library code is sometimes equipped with asserts. This is done, for example, for the various iterators of the standard library in the MSVC (Visual Studio) distribution.
Since undefined behavior can emerge due to the optimization features of different compilers, we need to build our code for different platforms with different optimization levels and compare its behavior. Error-free code should be portable, and it should always behave in the same way (unless, of course, its job is to generate completely random values).
Tests, various builds, static and dynamic analysis are the ways to increase your confidence that the code is UB-free. Only a group of experts who check every line of code against the standard and double-check each other three times can guarantee that. Even that may not be enough, though.
There's also a way to disable any optimizations by using compiler flags. There's also an option to enable flags for various standard violations (the famous -fpermissive) that turn C++ into something completely different. However, I urge you to never tread that path. Your code will become unportable. Your code will no longer be C++ code. It's better to choose another programming language in such a case.
Many modern programming languages, especially newer ones, forbid implicit type conversions.
So, in Rust, Haskell, or Kotlin, we can't just use float and int in the same arithmetic expression without explicitly stating in the code to convert one to the other. Python isn't as strict but still keeps strings, characters, and numbers from mixing.
C++ doesn't forbid implicit conversion, which leads to a lot of erroneous code. Moreover, such code can contain both defined (but unexpected) and undefined behavior.
Let's look at an example:
#include <vector>
#include <numeric>
#include <iostream>
int average(const std::vector<int>& v) {
if (v.empty()) {
return 0;
}
return std::accumulate(v.begin(), v.end(), 0) / v.size();
}
int main() {
std::cout << average({-1,-1,-1});
}
Anyone who takes a glimpse at this code would expect the result to be -1. However, unfortunately, the result is different. A program built by GCC for the x86-64 platform displays the following:
1431655764
The code doesn't contain undefined behavior (not in the used input data, at least). However, the implicit type conversion is there, making the result unexpected.
Implicit type conversions apply not only to built-in primitives but also to more complex types. Worst of all, they interfere with the selection of an appropriate function overload, leading to various surprises that are often unpleasant.
Here's an example with abs:
#include <cmath>
#include <iostream>
int main() {
std::cout << abs(3.5) << "\n"; // the C library function
// takes the long type as input,
// the result is 3
std::cout << std::abs(3.5); // the C++ library function
// overloaded for double,
// the result is 3.5
}
An even worse example is the std::string standard type:
#include <string>
int main() {
std::string s;
s += 48; // implicit conversion to char.
s += 1000; // and there's a very unpleasant overflow
// on a platform with signed char.
s += 49.5; // implicit conversion to char again
}
This monstrosity compiles!
It seems that this absolutely horrible usage example can never be found in normal code. Unfortunately, it can.
You can write generalized code for your std::accumulate with different checks of template arguments. Then, you may accidentally pass string as an accumulator and a container, like float, into it. And there won't be any compilation error. Just a weird bug in the program.
#include <string>
#include <vector>
#include <iostream>
template <class Range, class Acc>
auto accumulate(Range&& r, Acc acc)
requires(requires(){
{acc += *std::begin(r) };
})
{
for (auto&& x : r){
acc += x;
}
return acc;
}
int main() {
std::vector<double> v {0.5, 0.7, 0.1};
auto res = accumulate(v, std::string{});
std::cout << '"' << res << '"';
}
The program outputs:
""
Chains of implicit conversions can be very obscure:
void f(float&& x) { std::cout << "float " << x << "\n"; }
void f(int&& x) { std::cout << "int " << x << "\n"; }
void g(auto&& v) { f(v); } // C++20
int main() {
g(2);
g(1.f);
}
Most surprisingly, this example displays the following:
float 2
int 1
Even though we substituted constant types in exactly the opposite way and almost certainly expected to get this:
int 2
float 1
This isn't a compiler bug or undefined behavior! A tricky chain of implicit conversions is to blame.
Let's look at it using the example of the first call to g(2) and substitute the template parameter:
void g(int&& v) {
// Although v has the int&& type
// Using v further in expressions results in int& !
// decltype(v) == int&&
// decltype((v)) == int&
// The f functions accept only rvalue references
// Implicit conversion of int& to int&& is forbidden
// int&& x = 5;
// int&& y = x; // doesn't compile!
// So, the f(int&&) overload cannot be used
// f(float&&) remains
// int can be implicitly converted to float
// int& can implicitly act as just int
// implicit static_cast<float>(v) returns a temporary float value
// temporary values of the T type implicitly bind to T&&
// Here we have a conversion chain:
// int& -> int -> float -> float&&
f(v); // calls f(float&&) !
// explicitly: f(static_cast<float>(v));
}
Of course, nobody ever (at least explicitly) takes primitives over rvalue references because it's pointless. However, even without the rvalue reference for primitives, we can do something terrible:
struct MyMovableStruct {
operator bool () {
return !data.empty();
}
std::string data;
};
void consume(MyMovableStruct&& x) {
std::cout << "MyStruct: " << x.data << "\n";
}
void consume(bool x) { std::cout << "bool " << x << "\n"; }
void g(auto&& v) { consume(v); }
int main() {
g(MyMovableStruct{"hello"});
}
The same conversion chain gives "bool 1" in the output, except we don't need the last step.
Be sure to enable compiler warnings for all implicit conversions. It's best to treat them as errors.
Always mark single-parameter constructors as explicit to avoid implicit conversions for your types.
If you overload the cast operators (operator T()) for your types, make them explicit as well.
If your functions/methods are designed to work only with a particular primitive type, use templates, SFINAEs, and concepts to restrict them. You can also use the mechanism of explicit overload deletion (=delete), which is really easy:
int only_ints(int x) { return x;}
template <class T>
auto only_ints(T x) = delete;
int main() {
const int& x = 2;
only_ints(2);
only_ints(x);
char c = '1';
only_ints(c); // Compilation Error.
only_ints(2.5); // Explicitly deleted.
}
Author: Dmitry Sviridkin
Dmitry has over eight years of experience in high-performance software development in C and C++. From 2019 to 2021, Dmitry Sviridkin has been teaching Linux system programming at SPbU and C++ hands-on courses at HSE. Currently works on system and embedded development in Rust and C++ for edge servers as a Software Engineer at AWS (Cloudfront). His main area of interest is software security.
Editor: Andrey Karpov
Andrey has over 15 years of experience with static code analysis and software quality. The author of numerous articles on writing high-quality code in C++. Andrey Karpov has been honored with the Microsoft MVP award in the Developer Technologies category from 2011 to 2021. Andrey is a co-founder of the PVS-Studio project. He has long been the company's CTO and was involved in the development of the C++ analyzer core. Andrey is currently responsible for team management, personnel training, and DevRel activities.
0