Let's give a spit-shine to our bolides and watch them navigate twists and turns of compiler optimizations using std::array as an example. Will they overtake a built-in array or fall behind?
Whoa, whoa, Johnny, put the gun away! I can explain it all! In the previous article on std::array
, we've covered why std::array
is as efficient as a built-in array in production-ready code, and not any slower.
After writing such an article, one cannot help but wonder: can our friendly neighborhood std::array
, outrun its C-era grandfather? Just admit it, sometimes it's nice to indulge in freethinking and see where it takes you. Let's do it! What are we risking, after all? Not a C++ inquisition we are expecting, aren't we?
Let's ponder for a while on how this might happen. It can be faster in two cases: if std::array
offers something a built-in array lacks, or, conversely, if the former doesn't offer something that the latter does.
But std::array
is implemented via an array! We've demonstrated it by proving that std::array
is no slower than a built-in array. So, it has what the array has, right? Therefore, it has something that the array doesn't have.
Since std::array
is a first-class citizen, its objects behave like a first-class-citizen objects: they are copied like objects and like every other object except an array, they don't decay to pointers.
In addition, the receiving function always knows its size at compile time, since std::array
is a class template. Therefore, it is the templates that will handle it. Knowing something at compile time is always a win by default.
How can this help us?
We're not just using a compiler. We're using an optimizing compiler. At least the author uses one, we can't speak for all readers. So, it does some hanky-panky with our code, which very often results in code with higher performance (and perhaps even higher quality) than we've originally intended. That's good news. Here is more—in theory, we can write code in such a way that may help our friendly neighborhood optimizer do its job even better.
That's what we're going to try out! Let's take advantage of the fundamental differences between std::array
and array that we've described above and see what we'll end up with.
Here's the code that we'll be working with. For compilation we use the GCC compiler version 14.2.
Let's focus on these two functions that we are testing:
template<size_t N>
void transform(const std::array<int, N> &src,
std::array<int, N> &dst,
int n)
{
for (size_t i = 0; i < N; ++i)
{
dst[i] = src[i] + n;
}
}
#define ARR_SIZE 1000000
template
void transform(const std::array<int, ARR_SIZE>&,
std::array<int, ARR_SIZE> &,
int);
void transform(const int *src, int *dst, size_t N, int n)
{
for (size_t i = 0; i < N; ++i)
{
dst[i] = src[i] + n;
}
}
It's as simple as it gets: they iterate through some memory in the same way and modify one of memory areas. Let's enable the -O1
flag and look at the assembly code:
void transform<1000000ul>(
std::array<int, 1000000ul> const&, std::array<int, 1000000ul>&, int):
mov eax, 0
.L2:
mov ecx, edx
add ecx, DWORD PTR [rdi+rax*4]
mov DWORD PTR [rsi+rax*4], ecx
add rax, 1
cmp rax, 1000000
jne .L2
ret
transform(int const*, int*, unsigned long, int):
test rdx, rdx
je .L1
mov eax, 0
.L3:
mov r8d, ecx
add r8d, DWORD PTR [rdi+rax*4]
mov DWORD PTR [rsi+rax*4], r8d
add rax, 1
cmp rdx, rax
jne .L3
.L1:
ret
We can't speak for you, but the author is tempted to call them identical. They are virtually identical, as they do virtually identical things. The only real difference is that std::array
has the advantage of having the size embedded into its type, so this size is hardcoded into the function body. It's another point in favor of std::array
, if you ask the author. Another plus in the piggy bank of C++, you get the wordplay. The second function, on the other hand, has an extra zero-length check, which certainly deserves some appreciation.
Let's not stay long on this snippet, as it's a simple iteration—the same thing we've written in our C++ code. Let's enable the -O2' flag right away and see what happens:
void transform<1000000ul>(
std::array<int, 1000000ul> const&, std::array<int, 1000000ul>&, int):
movd xmm2, edx
xor eax, eax
pshufd xmm1, xmm2, 0
.L2:
movdqu xmm0, XMMWORD PTR [rdi+rax]
paddd xmm0, xmm1
movups XMMWORD PTR [rsi+rax], xmm0
add rax, 16
cmp rax, 4000000
jne .L2
ret
transform(int const*, int*, unsigned long, int):
test rdx, rdx
je .L1
xor eax, eax
.L3:
mov r8d, DWORD PTR [rdi+rax*4]
add r8d, ecx
mov DWORD PTR [rsi+rax*4], r8d
add rax, 1
cmp rdx, rax
jne .L3
.L1:
ret
There have been some changes. Note the first function that accepts std::array
. Not just registers, but vector registers appeared there in some magical (actually not) way. Anything with the XMM prefix works with vectors, not scalars. This means that several scalar values are handled simultaneously at once. In GCC, vector gimmicks are enabled by default along with the -O2
flag, starting with version 12.1.
The size of one register is 128 bits. Registers contain scalar values, a number of which may vary depending on what we put in it. Naturally, the number of values depends on the size of the types —a register can fit more 8-bit size scalars than 16-bit size ones. That's clear. Just saying that you can put different types in there, but all values must be of the same type.
See the number 4,000,000? We don't have it in our code. But we do have the number 1,000,000, which is the size of the tested arrays. It's a fourfold difference. Well, the size of the int
type on the target platform is 4. That is, the number of bytes in our array is exactly 4,000,000. Ok.
There is also the number 16 written to the rax
register. It looks like a counter, doesn't it? When the counter exceeds 40,000, the loop ends. We have already found out that XMM registers are 128 bits, or 16 bytes. Each loop iteration processes 16 bytes, i.e. we traverse the entire array at least four times faster than without vectorization.
The function that accepts pointers to int
has also undergone some changes, though not nearly as sweeping as in the previous function. You may notice that it still iterates one element at a time. Why?
Imagine two memory areas as in the picture below.
The author will explain it like you're in the fifth grade, since, let's be honest, kids these days are savvy, unlike us old folks. Back in the days (in the fifth grade), the author was trying to persuade mom to bring home a nice stick found in a park, but today's kids may be interested in vector operations.
Imagine that we handle these memory areas via std::array
: we initialized one container with some values, initialized another container with some other values. Then we pass these containers to our template function that accepts references. What do we get?
There are two scenarios. In the first, we pass different objects pointing to different (and guaranteed different) memory areas. No problems—the compiler happily vectorizes the element traversal, since they are independent from each other.
In the second scenario, we pass the same object to both function parameters. References point to the same memory, no problems here either—we read a value from a memory area, modify it and write it back. We can vectorize it without any concerns.
Let's move on to the second function that accepts pointers. These pointers can point anywhere. In addition to the first two scenarios above, another one is possible: memory areas will partially overlap.
This changes a lot. Now we cannot simply change several array elements in one operation—by changing elements we overwrite the resulting values as well. This happens when memory areas slightly overlap—just enough to block vector operations in the cases we discussed above.
We'll use two terms in the article. Let's call completely identical memory areas overdubbing, and areas that partially overlap—overlapping.
Does anything significantly change if we enable the -O3
flag? Let's find out:
void transform<1000000ul>(
std::array<int, 1000000ul> const&, std::array<int, 1000000ul>&, int):
movd xmm2, edx
xor eax, eax
pshufd xmm1, xmm2, 0
.L2:
movdqu xmm0, XMMWORD PTR [rdi+rax]
paddd xmm0, xmm1
movups XMMWORD PTR [rsi+rax], xmm0
add rax, 16
cmp rax, 4000000
jne .L2
ret
transform(int const*, int*, unsigned long, int):
mov r8d, ecx
test rdx, rdx
je .L1
lea rax, [rdx-1]
cmp rax, 2
jbe .L9
lea rcx, [rdi+4]
mov rax, rsi
sub rax, rcx
cmp rax, 8
jbe .L9
mov rcx, rdx
movd xmm2, r8d
xor eax, eax
shr rcx, 2
pshufd xmm1, xmm2, 0
sal rcx, 4
.L4:
movdqu xmm0, XMMWORD PTR [rdi+rax]
paddd xmm0, xmm1
movups XMMWORD PTR [rsi+rax], xmm0
add rax, 16
cmp rax, rcx
jne .L4
mov rax, rdx
and rax, -4
test dl, 3
je .L1
mov r9d, DWORD PTR [rdi+rax*4]
lea rcx, [0+rax*4]
add r9d, r8d
mov DWORD PTR [rsi+rax*4], r9d
lea r9, [rax+1]
cmp r9, rdx
jnb .L1
mov r9d, DWORD PTR [rdi+4+rcx]
add rax, 2
add r9d, r8d
mov DWORD PTR [rsi+4+rcx], r9d
cmp rax, rdx
jnb .L1
add r8d, DWORD PTR [rdi+8+rcx]
mov DWORD PTR [rsi+8+rcx], r8d
ret
.L9:
xor eax, eax
.L6:
mov ecx, DWORD PTR [rdi+rax*4]
add ecx, r8d
mov DWORD PTR [rsi+rax*4], ecx
add rax, 1
cmp rdx, rax
jne .L6
.L1:
ret
I know, I know. There's no such thing as too much of assembly. But how come there's so much code? And how did the XMM prefix instructions appear in the code of the function that accepts pointers? Author, did you, uh, fool us?
In fact, there is no deception. Let's first deal with the function that works with std::array
. It's all the same. Cool! At least we haven't lost any performance!
Well, now for the elephant in the room. Yes, there are vector operations in the second function. Yes, they can work. Yes, they can iterate over both entire arrays. But not always.
At the very beginning of the function, the length is checked for zero—we've seen this before. Then, we check if our arrays have more than three elements. If not, the compiler doesn't see the point in vector operations and jumps to the very bottom of the function, where it iterates over elements in a regular loop.
Now look at what goes next:
lea rcx, [rdi+4]
mov rax, rsi
sub rax, rcx
cmp rax, 8
jbe .L9
We remember that on Linux x64, rdi
and rsi
are used to pass the first and the second scalar function arguments—the first and the second pointers in our case. Formally, the 64-bit Linux kernel follows the conventions described in System V Application Binary Interface, just a fun fact. That is, the delta between two pointers is being checked here. If the second pointer points to an element of the first array and is close enough to its beginning, the execution flow will move to the function end, to the place where a regular search is performed.
No doubt it will jump there. How can you vectorize the operation if the read values must be modified immediately? This problem is akin to one that strict aliasing rules solve. If the compiler sees two same-type pointers, it cannot say for sure (at least in a basic scenario), whether they point to the same object. If they do, reload the second value into memory after modifying the first one—because the first one could have originally been the second one as well! When working with pointers to different, incompatible types, the compiler may assume that they point to different objects, allowing it to optimize extra load away. In our case, we don't directly apply strict aliasing rules, but they serve as a useful analogy.
Now imagine two arrays overlap so that the distance between their starting points is less than a vector register size. Then the compiler cannot optimize the operation, as it would lead to incorrect values in the array. If the distance is larger, why not vectorize this operation? Indeed, if one array overlaps another, but their starting points are separated far enough—in our case by more than four elements— the writes through vector operations won't overlap. Awesome! We will not dissect these vector operations in detail: they closely resemble those in the function using std::array
.
Interestingly, the compiler doesn't seem to consider the overdubbing arrays as a possible vectorization option. It's strange given that such operations can indeed be vectorized, as we have shown in previous parts of the article.
An extra credit question for the most attentive readers. When checking the distance between the starting positions of arrays, we compare the difference, not the absolute difference. What if the first array points to a part of the second array—meaning the first array starts before the second? Take your time to think it over. And if you don't feel like thinking, here is the answer. It's no big deal here, because even if we overwrite values from the first array while writing them to the second array, they won't be needed by this time.
By the way, a fun fact about C++: formally, the standard forbids us to perform arithmetic operations on pointers unless they are guaranteed to point to elements of the same array. That is, such an operation of distance calculation could lead to undefined behavior in C and C++. But we're not talking about those programming languages right now, because we're the cool kids, and we're talking about assembly. Other developers look at us with respect and envy. Yeah!
By the way, look at the small code snippet that comes after the main vector loop:
mov rax, rdx
and rax, -4
test dl, 3
je .L1
In assembly, we get the lowest byte of the array size from the dl
register (rdx -> edx -> dx -> dl/dh). If it's not equal to three (0b0011), then there are up to three elements left in the arrays. Then we deal with them them with hardcoded scalar operations.
The intermediate conclusion from this article section is clear. Yes, of course, we ended up with vector operations in both cases with the -O3
flag. After all, modern compilers can optimize code really well. In this particular case, GCC managed to concoct a function that will be highly performant in many cases. But our example is trivial. And with this trivial example, we have trivially shown that introducing std::array
provides both a user-friendly interface and a performance boost. Sure, with the -O3' flag, the difference boils down to just a couple of runtime checks—nothing particularly impressive. But with
-O2', the difference will be much more noticeable. Do you always compile production code with -O3
? Don't hesitate to write in the comments what you think!
An interesting philosophical observation: std::array
behaves this way precisely because it lacks something what a raw pointer has: the ability to, well... to point. The author can't help but recall the Uncle Bob's book "Clean Architecture". In the introduction, while discussing programming paradigms, he said that each paradigm takes something from a programmer rather than gifting something extra. In our case, the std::array
performance advantage (albeit for a limited set of cases) comes from restricting the unrestricted indirection.
Some readers may have already dashed off a comment about the second function—the one that accepts pointers. They may suggest writing it to guarantee vectorization. Let's do it.
The __restrict__
keyword is included in GCC extensions. It serves the same purpose as its counterpart in the official C language standard: a guarantee that two memory areas do not overlap. This helps resolve strict aliasing issues when working with same-type pointers. It also removes barriers to vectorizing operations that should be vectorizable. If we are confident enough, why not rewrite the function that accepts pointers using the following keyword:
void transform(const int * __restrict__ src,
int * __restrict__ dst,
size_t N,
int n)
{
for (size_t i = 0; i < N; ++i)
{
dst[i] = src[i] + n;
}
}
This small change makes the compiler throw vector optimizations at us like a horn of plenty! Vector operations with both -O2
and -O3
flags always occur, regardless of the distance between pointers! Let's omit the assembly snippets, because they will be nearly identical with the previous ones. Here is the link to them, as well as to all examples from this article part.
We could go the other way and add a pragma doing the same thing as the __restrict__
keyword in our case:
void transform(const int *src, int *dst, size_t N, int n)
{
#pragma GCC ivdep
for (size_t i = 0; i < N; ++i)
{
dst[i] = src[i] + n;
}
}
This pragma applies to individual loops, instructing the compiler to iterate over them with vector operations. Strictly speaking, this pragma can work in more cases than the __restrict__
keyword, and thus is a slightly more versatile tool for manual vectorization. By the way, you can check out a code example on the GCC site where the pragma works while __restrict__
—doesn't.
Win! We got what we wanted! We only had to tie the project to a specific compiler and break the function work for some input data. As for the rest, it's a banger!
While readers who don't appreciate sarcasm are busy scribbling angry comments, let's think of this. Assume we're the fastest-toughest cowboys, demanding uncompromising, absolute speed. Moreover, we want to control it unconditionally, at every turn of the control flow track. Shall we try? Let's do it:
#include <experimental/simd>
namespace stdx = std::experimental;
const ptrdiff_t simd_int_width = stdx::native_simd<int>{}.size();
void transform(int* src, int* dst, size_t N, int n)
{
size_t i = 0;
size_t len = (N / simd_int_width) * simd_int_width;
auto simd_n = stdx::native_simd<int>{n};
if (dst - src >= simd_int_width)
{
for (/*size_t i*/; i < len; i += simd_int_width)
{
auto s =
stdx::native_simd<int>{src + i, stdx::element_aligned} + simd_n;
s.copy_to(dst + i, stdx::element_aligned);
}
}
for (/*size_t i*/; i < N; ++i)
{
dst[i] = src[i] + n;
}
}
Alright, cowboy, I wrangled up some code and I ain't about to spoon-feed it to you or pretend I did it to win favors. We did manual vectorization and got hooked on the experimental part of the C++ library. By the way, it will be officially released as part of the C++26 standard, so why not practice beforehand? We also considered the corner case where two arrays stand too close, eliminating possible vectorization. In the case of __restrict__
and pragma, we didn't have that option. Technically, we've introduced a small instance of UB, but that's just the little things of life, right?
Plus, the assembly output from this direct approach turned out to be not as optimized as with other methods. As a reminder, all snippets from this part are available here.
But this article is not about manual vectorization. If you really need manual vectorization in your project, you probably aren't even thinking about using std::array
, you already have immintrin.h
in your project. And this article would probably just make you laugh. Indeed, what's the use of optimizations that we can't fully control!
After all, this article is for people who walk on sinful ground, sometimes even with their own feet. They write the project parts that are not bottlenecks. They use standard library abstractions because they're convenient. They don't grind matrices on the GPU. And they won't do manual vectorization.
Attentive readers might have already noticed that we can do the same trick with built-in arrays as we do with std::array
. That's the plain truth. We can make a function template that accepts built-in arrays by reference. In this case, arrays cannot overlap, nor can std::array
, because both adhere to aggregate initialization rules. This means that they cannot contain overlapping memory areas. Remember earlier in the article, we've introduced the concepts of overlapping and overdubbing memory? The compiler should be able to vectorize traversal over overdubbing arrays, which is what we are talking about now. Let's see what happens in reality.
First, let's write a new function:
template <size_t N>
void transform(const int (&src)[N], int (&dst)[N], int n)
{
for(size_t i = 0; i < N; ++i)
{
dst[i] = src[i] + n;
}
}
#define ARR_SIZE 1000000
template void transform(const int (&)[ARR_SIZE], int (&)[ARR_SIZE], int);
A new function, as promised at the beginning of this part, accepts arrays by reference. What's going on in assembly? Let's enable the -O2' flag and look:
void transform<1000000ul>(int const (&) [1000000ul], int (&) [1000000ul], int):
xor eax, eax
.L2:
mov ecx, DWORD PTR [rdi+rax*4]
add ecx, edx
mov DWORD PTR [rsi+rax*4], ecx
add rax, 1
cmp rax, 1000000
jne .L2
ret
It turns out there are no vector operations, even though memory areas may ovedub but are guaranteed not to overlap.
What happens if we enable the -O3' flag? Let's see:
void transform<1000000ul>(int const (&) [1000000ul], int (&) [1000000ul], int):
lea rcx, [rdi+4]
mov rax, rsi
sub rax, rcx
cmp rax, 8
jbe .L5
movd xmm2, edx
xor eax, eax
pshufd xmm1, xmm2, 0
.L3:
movdqu xmm0, XMMWORD PTR [rdi+rax]
paddd xmm0, xmm1
movups XMMWORD PTR [rsi+rax], xmm0
add rax, 16
cmp rax, 4000000
jne .L3
ret
.L5:
xor eax, eax
.L2:
mov ecx, DWORD PTR [rdi+rax*4]
add ecx, edx
mov DWORD PTR [rsi+rax*4], ecx
add rax, 1
cmp rax, 1000000
jne .L2
ret
We won't go through the details of what's going on. The function is almost a duplicate of the one we got for the version that accepts pointers and uses the __restrict__
keyword or pragma directive.
You can review all code snippets from this part of the article here. In this case, the compiler failed to optimize the loop with the enabled -O2
flag. Strictly speaking, it looks more like a GCC peculiarity than some regularity. There is no reason for vector optimizations to be blocked at the -O2
level in this case. Clang, for example, uses vector operations at that level. What else is there to say? So many compilers, so many approaches!
We optimized the function that accepts pointers, but it took some effort. Also, in a couple of cases we even had to tie ourselves to a particular compiler.
At the same time, we used a convenient abstraction—std::array
— and got the same performance gain for free. Actually, this gain was even teeny-whiny larger. In addition, std::array
enabled us to avoid a potential mistake that would have turned into a real error if we had passed to the function variables pointing to overlapping memory areas.
We also checked whether std::array
has any performance advantage over classic array references and saw that, in some cases, it does.
We ended up using an abstraction that didn't shoot us in the foot, and got guaranteed optimization for free. If that's not happiness, I don't know what is.
Of course, at the level of -O3' optimizations, the performance difference is too minor for any project to take it seriously. But with the
-O2' flag, we get a noticeable difference.
Again, you can find many cases where using std::array
is impossible or even harmful, just like many other parts of the standard library. In this case, the very comparison between std::array
and a built-in array makes no sense.
Also, the author leaves a link to an interesting study by a good Samaritan on how well (or not, you decide) modern compilers autovectorize standard algorithms. Make sure you got the chance to see it for yourself free with no virus at the link.
Perhaps, the most important conclusion is the following: different things can happen under the hood of C++. Sometimes, it's worth indulging in heretical thoughts to find out the answer to the question—are all our beliefs born out of practice? And then ask ourselves another question—did we draw the right conclusions?
After all, the conclusions in our article are based on a particular version of a particular compiler. Remember that we used the GCC 14.2 in all examples. Other compilers and even other versions of the same compiler may behave differently.
The article didn't mean to show that the GCC 14.2 is worse at vectorizing some things in some scenarios. The goal was to research, and the research we did!
Anyway, we hope you enjoyed this little journey and had fun with us. If you learned something interesting, we would be more than happy. And of course, if you have something to share on the topic, see you in comments, let's chat!
Thank you for making it to the end! El Psy Kongroo.