Webinar: Evaluation - 05.12
As Yuri Gagarin once said, "In the future, we will fly, we will fly a lot". To some extent, Gaijin's Dagor Engine makes these flights possible. Let's get a glimpse of the engine's innards and talk with its developers!
There were times when I was younger, my comrade Kaluga and I were part of a squadron flying in the skies over Stalingrad. At the end of the fierce battle, the enemy forces were defeated, and Yours Truly was already anticipating a speedy return to the hangar. Suddenly, Kaluga swerved and, saying, "Look what I can do," rammed the aircraft at full speed, turning my dreams of a hangar into reality. Then came the legitimate question:
"My dear distinguished comrade, would you be so kind as to explain your tactical maneuver?"
"I wanted to steer closer...," he said.
This is just one of the many incidents, amusing and upsetting, but always memorable, that Yours Truly has experienced in good old War Thunder. It's been a long time, but the news of the game engine code being publicly released couldn't leave Yours Truly indifferent! After all, how can you resist the pleasure of peeking into the engine of a game like this?
Anton Yudintsev, the founder of Gaijin Entertainment, joined us on this code journey. He agreed to comment on the suspicious code fragments we found and answer questions about some aspects of their product testing process in the interview. You can find the interview in the second part of the article, which comes right after the code analysis.
For the author of this article, this text became a proof of the advantages that may be drawn from a thorough approach to testing, which includes the use of static analyzers. As Anton will tell you, PVS-Studio is already used in the project. However, that didn't prevent us from writing an article based on the issued warnings and drawing some conclusions. We suggest the reader pay special attention to Anton's comments further in the text, and then compare his conclusions with ours, which are summarized at the end of the first part.
While reading, you will see code snippets with the ellipsis "...." that were added by the author of this article.
You can find the source code on Gaijin Entertainment's GitHub. Besides, each fragment has a link to a specific place in the code.
By the time this text is published, many code fragments may already have been fixed, but the links in the examples lead to the code you see in this article. We enjoy not only digging into projects, but also making them a little better!
Let's start with a simple but no less remarkable thing.
void ExprField::serialize ( AstSerializer & ser ) {
....
Module * module; ser << module;
....
}
DagorEngine/prog/1stPartyLibs/daScript/src/builtin/module_builtin_ast_serialize.cpp : 1303
The module variable is created without being initialized, after which it is passed to the operator << of the ser serializer. No problem here, the variable is not read. But what's next?
AstSerializer & AstSerializer::operator << ( Module * & module ) {
bool is_null = module == nullptr;
....
}
DagorEngine/prog/1stPartyLibs/daScript/src/builtin/module_builtin_ast_serialize.cpp : 831
Then the uninitialized variable is nevertheless read to be compared to nullptr. Although it appears to be a small mistake, UB remains frightening. Developer, beware: the night is dark and full of myths about undefined behavior!
Anton Yudintsev:
"This code is new and is not yet in use. It came through our library, which we haven't checked with the analyzer yet."
The analyzer issued a concise message:
V614 Uninitialized pointer 'module' used. DagorEngine/prog/1stPartyLibs/daScript/src/builtin/module_builtin_ast_serialize.cpp 1303
The example below clearly shows the null pointer dereference.
void copy(const Node &n, int sz)
{
....
for (int i = 0; i < 4; ++i)
if (n.leaf_linear[i])
{
if (leaf_linear[i])
leaf_linear[i] = new Leaf(*n.leaf_linear[i]);
else
*leaf_linear[i] = *n.leaf_linear[i];
}
else
....
....
}
DagorEngine/prog/dagorInclude/generic/dag_hierGrid.h : 60
Judge for yourself: first, there is a check for NULL in leaf_linear[i]. But what's next? If NULL is there, it is dereferenced when assigned. What else could it be but UB?!
Anton Yudintsev:
"Indeed, it is an error. We have a few spots with legacy code that is no longer in use. We deliberately filter out such warnings."
The analyzer reports:
V522 Dereferencing of the null pointer 'leaf_linear[i]' might take place. DagorEngine/prog/dagorInclude/generic/dag_hierGrid.h 71
In this part, we have collected some suspicious fragments that are all related to bitwise shift operations.
For example, here a shift is performed by a number of bits larger than the size of the operand variable:
__forceinline bool is_char_in_set (int32_t ch,
const TDim<uint32_t,8> & bitset,
Context * ctx, LineInfoArg * at)
{
if ( ch<0 || ch>=256 ) ctx->throw_error_at(at,"invalid character %d", ch);
return bitset[ch>>5] & (1u<<uint32_t(ch));
}
DagorEngine/prog/1stPartyLibs/daScript/include/daScript/simulate/aot.h : 820
Note the range of values that the ch variable can take according to the condition in the if statement. This is UB for sure.
Anton Yudintsev:
"Clearly, char_in_set is an error (&31 is missing). This code also came from our library."
The analyzer is stands guard:
V610 Undefined behavior. Check the shift operator '<<'. The right operand ('uint32_t(ch)' = [0..255]) is greater than or equal to the length in bits of the promoted left operand. DagorEngine/prog/1stPartyLibs/daScript/include/daScript/simulate/aot.h 820
Or, for example, the case below, where the possible problem lies in a bitwise shift to the right.
inline int does_line_intersect_box_side_two_points(....)
{
int ret = -1;
for (int j = 0; j < 3; ++j)
{
BBox2 box2;
const int j1_mask = (j + 1 - 3) >> 4; //>3 == 0 <3 == -1
const int j2_mask = (j + 2 - 3) >> 4;
....
}
}
DagorEngine/prog/dagorInclude/math/dag_mathUtils.h : 610
It is worth noting that, unlike the previous example, this one does not lead to undefined, but only to unspecified behavior. Another reason to get to know your compiler better!
Anton Yudintsev:
"It feels like this code is 'correct'. I mean, it is obviously unspecified, but it behaves the same way."
The analyzer keeps us in the loop:
V610 Unspecified behavior. Check the shift operator '>>'. The left operand is negative ('(j + 1 - 3)' = [-2..0]). DagorEngine/prog/dagorInclude/math/dag_mathUtils.h 612
V610 Unspecified behavior. Check the shift operator '>>'. The left operand is negative ('(j + 2 - 3)' = [-1..1]). /DagorEngine/prog/dagorInclude/math/dag_mathUtils.h 613
Look at the following short code snippet:
int SH3LightingManager::loadLtDataBinary(IGenLoad &crd, unsigned id)
{
SH3LightingData *ltData = SH3LightingData::loadBinary(crd);
if (!ltData)
{
delete ltData;
return -1;
}
return addLtData(ltData, id);
}
DagorEngine/prog/engine/scene/sh3LtMgr.cpp : 429
An object is created on the heap, a pointer to which is returned to the client. If NULL is returned to the client, it is deleted.
Wait, what?
Let's see what happens in the called function.
SH3LightingData *SH3LightingData::loadBinary(IGenLoad &crd)
{
....
SH3LightingData *data =
new (memalloc(sz, midmem), _NEW_INPLACE) SH3LightingData;
....
return data;
}
DagorEngine/prog/engine/scene/sh3LtMgr.cpp : 476
Indeed, the operator new is called, and a pointer to the newly created object is returned at the end.
Roughly speaking, the language does not prohibit shoving NULL into the operator delete, but that's what makes our example interesting. That's one of the cases when static analysis helps in searching for logic errors. After all, do we really want to delete NULL?
The good news is that it doesn't cause problems for the development team.
Anton Yudintsev:
"This code is of no use. Pity it wasn't deleted."
The analyze issues another concise message:
V575 The null pointer is passed into 'operator delete'. Inspect the argument. DagorEngine/prog/engine/scene/sh3LtMgr.cpp 435
Now let's look at the shortest example of code today:
void builtin_map_file(const FILE* f,
const TBlock<void, TTemporary<TArray<uint8_t>>>& blk,
Context* context, LineInfoArg * at) {
....
munmap(data, 0);
}
DagorEngine/prog/1stPartyLibs/daScript/src/builtin/module_builtin_fio.cpp : 200
In the man pages, we see the following error description: "EINVAL The len argument is 0".
Anton Yudintsev:
"Sure, that's an error, but the code works. It got into the project from the first-party library that hasn't been analyzed yet."
The analyzer issues the following message:
V575 The 'munmap' function processes '0' elements. Inspect the second argument. DagorEngine/prog/1stPartyLibs/daScript/src/builtin/module_builtin_fio.cpp 214
P.S. Dear reader, would you like to have a little debate? Yours Truly came up with the Fast Square Root dilemma (named after a famous function from the Quake 3 engine). Here it is: if the code leads to UB or has another bug, but it runs on our platforms with our compilers for sure, is it worth fixing? What do you think? Drop a comment below!
In this example, the analyzer complains about a simple variable initialization.
void DoInsertValues(const_iterator position,
size_type n,
const value_type& value) {
const value_type temp = value;
}
DagorEngine/prog/1stPartyLibs/dag/dag_vector.h : 567
We wonder what type it is? Let's see.
template <typename T, ....>
class Vector
{
....
typedef T value_type;
....
}
DagorEngine/prog/1stPartyLibs/dag/dag_vector.h : 118
Ok, so the type is set when the template is instantiated. The analyzer shows that in our case, T actually is the MountVromfsRec type (see the warning at the end of this section).
Let's see how the MountVromfsRec type is defined.
struct MountVromfsRec
{
VirtualRomFsPack *vd;
SimpleString fn;
SimpleString mountPath;
MountVromfsRec() : vd(NULL) {}
~MountVromfsRec()
{
remove_vromfs(vd);
close_vromfs_pack(vd, inimem);
vd = NULL;
}
};
DagorEngine/prog/tools/dargbox/main.cpp : 90
In the entire codebase, it is used directly only in the following location:
static Tab<MountVromfsRec> mnt_vromfs(inimem);
DagorEngine/prog/tools/dargbox/main.cpp : 105
This place drives us back to the dag::Vector user-defined type.
template <typename T>
using Tab = dag::Vector<T, dag::MemPtrAllocator, false, uint32_t>;
DagorEngine/prog/dagorInclude/generic/dag_tab.h : 15
It turns out that a structure that contains a pointer is copied. At the same time, it doesn't have a user-defined copy constructor. So, the pointer value is copied to the new variable, but the object it points to remains the same.
In this case, it's hard to say if it's a mistake. But we would like to draw your attention to it once again: such an error (if present) can easily gobble up a lot of time and effort during the debugging process. Don't we strive to have fun with coding?
Anton Yudintsev:
"Indeed, there is no mistake here."
The analyzer issues a warning:
V1002 Instantiation of Vector < MountVromfsRec, MemPtrAllocator, bool, uint32_t >: The 'MountVromfsRec' class, containing pointers, constructor and destructor, is copied by the automatically generated copy constructor. DagorEngine/prog/1stPartyLibs/dag/dag_vector.h 567
Below is a rather lengthy code snippet. Here's a little game for you: before you read the code example to the end, try to find out what raised our suspicions.
GrassLod &lod = layer->lods[lodIdx];
unsigned int numInstancesInCell = (unsigned int)(
baseNumInstances * lod.density / (GRID_SIZE * GRID_SIZE) + 0.5f
);
const ShaderMesh::RElem &elem = lod.mesh->getAllElems()[0];
lod.numInstancesInCell = min(
numInstancesInCell,
(unsigned int)MAX_VERTICES_IN_CELL / elem.numv
);
if (numInstancesInCell == 0)
return;
lod.singleVb = (lod.numInstancesInCell >= INSTANCES_IN_SINGLE_VB);
....
DagorEngine/prog/gameLibs/render/editorGrass.cpp : 133
The lod.numInstancesInCell variable is assigned the minimum value of the two variables. One of them, the numInstancesInCell variable, can be 0, as indicated by the check following the assignment. It was the check that made us suspicious. If checking for 0 occurs after the assignment, wouldn't it be better to check the assignee?
Theoretically, elem.numv could hardly be larger than MAX_VERTICES_IN_CELL. Let's see.
struct RElem
{
....
int numv; // number of vertices
....
}
DagorEngine/prog/dagorInclude/shaders/dag_shaderMesh.h : 303
Indeed, dividing one by the other will result in 0, which is not taken into account when checking if (numInstancesInCell == 0).
Anton Yudintsev:
"Theoretically, yes. In reality, nothing like that will happen here, there are orders of magnitude difference."
The analyzer issues a warning:
V1051 Consider checking for misprints. It's possible that the 'lod.numInstancesInCell' should be checked here. DagorEngine/prog/gameLibs/render/editorGrass.cpp:135:1, DagorEngine/prog/gameLibs/render/editorGrass.cpp 133
Here's a tiny example with a possible shot in the foot.
int cache_idx = bd->cache->files.getNameId(fn);
....
fs->data[idx].init((void *)intptr_t(cache_idx + 1), fs->data[idx].size());
DagorEngine/prog/engine/ioSys/vromfsBacked.cpp : 44
V1028 Possible overflow. Consider casting operands of the 'cache_idx + 1' operator to the 'intptr_t' type, not the result. DagorEngine/prog/engine/ioSys/vromfsBacked.cpp 44
Or a following variation:
const void *df_mmap(file_ptr_t fp, int *flen, int length, int offset)
{
....
int pa_diff = (base + offset) - pa_offs;
....
void *ret = mmap(NULL,
(size_t)(length + pa_diff),
PROT_READ, MAP_SHARED, fd, (off_t)pa_offs);
....
}
DagorEngine/prog/engine/osApiWrappers/mmap.cpp : 77
V1028 Possible overflow. Consider casting operands of the 'length + pa_diff' operator to the 'size_t' type, not the result. DagorEngine/prog/engine/osApiWrappers/mmap.cpp 77
If the result of the operation is cast to a larger type, why were the operands not cast to it?
Anton Yudintsev:
"This code runs fine. All operands and the sum fit into int because we still need our project to run on 32 bits. Anyway, it's better to fix it."
Yours Truly might recommend you read the article by a former Google employee and author of some Java features. There, he describes how a simple mergesort for years contained a potential overflow bug that didn't fire for a long time but finally did. Reader, beware!
Do you remember when we suggested comparing our findings of error? Let's do it! If you have read Anton's comments carefully, you have surely noticed that the errors we have found in the code fall into two categories. They either refer to those parts of the project that are not used at the moment (which does not stop them from being excellent material for examining the analyzer's features), or they refer to those parts of the project where static analysis is not used yet. This means that we do our work for a reason, and the benefits of implementing such technologies are obvious even to the naked eye.
But still, how does Gaijin Entertainment ensure that their engine keeps the code quality that high? Let's jump into the interview with Anton Yudintsev and find out!
In a nutshell, how do you test your projects?
We have been using PVS-Studio for several years, along with other static analyzers. CI is powered by Gerrit + Jenkins. We do code review in Gerrit and run tests in Jenkins. At the same time, we use a trunk-based approach; commits are sent to the master branch only if code review and tests are passed.
So, pull requests are made through Gerrit?
Kinda. Gerrit is a "marvel of engineering" with its own repositories, but to a developer, it looks like a push to origin/master. Other people won't get the commit until someone clicks Merge in the web interface. But you can't click it until it has the Verified and Reviewed status. A person who does code review marks the code Reviewed. Jenkins marks the code Verified after the tests are completed.
Is the trunk-based approach a deliberate choice, or is it a legacy of the past?
A deliberate choice. We are not going to switch to branches for some individual features yet. Our pipeline leans towards trunk-based development. At the same time, some features are developed in branches.
What tests do you run?
Jenkins runs autotests. Autotests include compilation for all platforms and checking by static analyzers. Then the analysis results are compared with the database of warnings, and if there are more warnings than before, the tests fail. We have zero warning tolerance, so the warnings database is not updated.
Do you analyze every commit?
Yes.
What happens if two developers commit at the same time?
We have a build node farm. Commits are put into a queue and analyzed sequentially. It's enough to keep the job running, but if a lot of people are pushing at once, you'll have to wait a little longer. But if the farm's capacity is not enough, we run the nodes in the cloud.
Do you balance the load across nodes in some way?
It's easy, isn't it? Jenkins distributes tasks to all interested workers in the queue. It is preferred by those who did the task last time. In fact, Jenkins does the balancing. But actually, balancing is not really needed; all nodes are equivalent. Our goal is to do it as fast as possible when needed, not to equally load the nodes. Nodes are idle at night but chug along during the day. We look at the queue size with the script: if it's too big, we run the nodes in the cloud.
How exactly does the analysis work?
Each commit triggers a build of modified components on all supported platforms, the project linking process on all platforms as well, and a static analysis. If any of these things don't go through, then the commit correspondingly doesn't go through either. Each check always resets the current master to workspace build nodes, and then cherry-picks the checked commit.
Your project seems to support many platforms. I can only imagine how frustrating it is if a project crashes on one platform.
Thankfully, a bot reports it to us. And you cannot take your hatred out on a soulless machine.
Do you check projects entirely?
Sure. Every N hours, there are slow checks that can't be done on every commit just physically. I wish it was all on a per-commit basis, but no such farms exist.
A few words about prod. Do you roll out to production automatically or by button?
The button, for sure. When it comes to production, we only do it manually and only from a branch. In the master, the binaries are built every morning. But we roll out to prod only from the branch, not the master. We fast-forward from the master to the branch, and only then cherry-pick from the master.
Do you mean stable?
Well, it is a name branch, but it is not stable. Every prod version has a branch. Besides, we have a monorepo, and there are multiple games. Every game has its own prod branches.
If it's no secret, was the monorepo deliberately chosen?
It was a deliberate decision with its own upsides. We'll probably move to submodules or analogue soon—too many games are made on the same engine. It's difficult for the engine team to minimize the impact on all projects; it slows down the development process.
By the way, is opening up the Dagor Engine code part of the plan to move to submodules?
These are independent but interrelated things.
Do you have plans to make a fully open-source project with maintainers and pull requests? I have seen the updates and the issues created.
And we are even solving these issues! Well, we do have plans, as far as we can.
Before we end, tell me: can I develop my own Thar Wunder on the engine if I want it badly enough?
You have to want it even more badly. We don't open the game's source code. But there are projects on our engine, such as Nau. Also, we used to give FOSS permissions to various teams before.
What a wonderful journey we had! To relive the memories of an amazing game and enjoy it in a little different way. Finding a few bugs in a project with such an emphasis on testing was unexpectedly satisfying, and even more satisfying was helping fix them.
We hope you enjoyed our journey as well!
And, of course, we would like to thank Anton Yudintsev and other members of the Gaijin Entertainment team for contributing to our article. We appreciate your help in making it much more curious!
Thank you for reading it to the end! El Psy Kongroo.
0