My name is Andrey Karpov. I develop software for developers, and I'm fond of writing articles on code quality issues. In this connection, I have met the wonderful man Walter Bright who has created the D language. In the form of an interview, I will try to learn from him how the D language helps programmers get rid of errors we all make when writing code.
Walter Bright is the programmer known to the world as the main developer of the first "native" C++ compiler Zortech C++ (which later became Symantec C++, and then Digital Mars C++) and creator of the D language.
What is the D language?
D is an object-oriented programming language designed as an improved version of the C++ language. Despite their similarity, however, D is not a variety of C++: some of the capabilities have been realized anew, while especially much attention is paid to the security aspect. The main source of information about D is dlang.org. You can also learn some information from the Wikipedia article "D (programming language)" and on the Digital Mars company's website.
Slogans like "write programs without mistakes and you won't need additional tools" are senseless. We always made, make and will make mistakes when programming. One certainly needs to improve one's coding style, carry out code reviews, and use various methodologies to reduce the number of errors. But they will always be there. It's just wonderful when the compiler can catch at least a few of them. That's why I think it's a good idea to introduce special mechanisms into programming languages that will help us avoid mistakes. Much attention is paid to this aspect in the D language, and that's perfect.
One should understand that the language design and compiler can prevent just a few errors. If the programmer has written an algorithm that calculates by an incorrect formula, nothing can be done about it. To catch such errors, we need an AI. However, many typical errors are quite simple and are related to human inattention, tiredness, misprints. It is in this area that the programming language syntax and compiler's warnings can be of great aid detecting defects.
What are the errors occurring from programmer's inattention? It's a difficult question. But I have one thing that will let me give you a believable answer. While testing the PVS-Studio analyzer, we analyze various open-source projects. The errors we manage to find are entered in the database. The more errors of a certain type are found, the more samples are associated with the corresponding diagnostic. The analyzer doesn't search for all the possible error types, of course. But it already has pretty many diagnostic rules for us to be able to discover certain regularities.
You cannot just take all the diagnostics and see which of them have found the largest number of errors: new diagnostic rules appear gradually. That's why a diagnostic implemented long ago has participated in analysis of more projects than a diagnostic created recently. I've combined similar diagnostics, applied a correction factor and performed some computations. Not to bother the readers with the details, let me just give you the list of 10 diagnostics that detect the maximum number of errors: V547+V560, V595, V501, V512+V579, V557, V567+V610, V597, V519, V576, V530. You can follow these links to see samples of errors detected by these rules. We may say this is a "top 10" of typical mistakes made by C/C++ programmers.
Sorry for having gone away from our main subject. I wanted to show you that the typical mistakes we're discussing are not invented by myself, but are a really existing trouble that programmers face in software development. I will discuss these error types and try to find out answers from Walter to the following questions:
There exist many reasons for a meaningless condition to appear. It may be a misprint, careless refactoring, or using an incorrect type. The C/C++ compiler is rather tolerant of conditions which are always either true or false. Such expressions may be useful sometimes. For example:
#define DEBUG_ON 0
if (DEBUG_ON && Foo()) Dump(X);
But it may also do much harm (samples, samples). Here is one of the typical examples:
std::string::size_type pos = dtaFileName.rfind("/");
if (pos < 0) {
pos = dtaFileName.rfind("\\");
}
The 'pos' variable has the unsigned type. That's why the (pos < 0) condition is always false.
Walter's comment:
Many of my comments are from the perspective of baking these issues into the language. When doing that, one has to either eliminate 100% of false positives, or provide a straightforward workaround that is non-ugly and always works. An optional checker tool can get away with a few false positives now and then.
The unsigned<0 is usually a bug when it appears in top level code. However, it can legitimately appear as a boundary case inside of generic code. It can also be used in code that tests if a type is unsigned, as in -(T)1<0. So I'm a bit uncomfortable in declaring it as always wrong.
I don't need to explain why programmers check the pointer for being a null pointer. But not many developers know how fragile this code is. Because of careless refactoring, the pointer is quite often used before the check (samples). For instance, this error may look like the following code:
buf = buf->next;
pos = buf->pos;
if(!buf) return -1;
This code can work for a long time until a nonstandard situation occurs, and the pointer equals zero.
Walter's comment:
I can't think of a case where this might be done legitimately. But it does take some decent data flow analysis to be sure there are no intervening modifications of buf.
The samples of errors found by the V501 diagnostic demonstrate it very well why using Copy-Paste is harmful. However, programming completely without Copy-Paste is tiresome. That's why this kind of errors is quite durable. Consider the following sample:
if( m_GamePad[iUserIndex].wButtons ||
m_GamePad[iUserIndex].sThumbLX ||
m_GamePad[iUserIndex].sThumbLX ||
m_GamePad[iUserIndex].sThumbRX ||
m_GamePad[iUserIndex].sThumbRY ||
m_GamePad[iUserIndex].bLeftTrigger ||
m_GamePad[iUserIndex].bRightTrigger )
In a code like this, you cannot help feeling an urge to copy and paste a line and edit it a bit. The result of giving in to this urge is a strange program behavior occurring at a specific set of circumstances. If the reader hasn't found the mistake, here's a hint. The 'sThumbLX' class member is checked twice, while there is no check for 'sThumbLY'.
Walter's comment:
I looked into doing this for D. The trouble, though, is if the duplicated condition has side effects, and if any conditions between the dups have side effects that may affect the result of the duplicated condition. To make this work reliably and not give false positives, some decent data flow analysis is required.
There's also the issue of generic code and function inlining, which may cause duplicates to appear but are not bugs in user code. So the test for dups must be done before generic code expansion and inlining, but generic code expansion and inlining have to be done before the data flow analysis can be done correctly. So there's a non-trivial chicken-and-egg problem with doing this correctly.
Processing of only a part of the buffer is a typical mistake when using such functions as memset, memcpy, strncmp. The error occurs when the pointer size is calculated instead of the buffer size. Such an error seems to be easily detectable at once. But they live inside programs for many years (samples, samples). For example, the code below intended to check the table integrity almost works.
const char * keyword;
....
if (strncmp(piece, keyword, sizeof(keyword)) != 0) {
HUNSPELL_WARNING(stderr,
"error: line %d: table is corrupt\n", af->getlinenum());
Only a part of the key word participates in comparison - to be exact, 4 or 8 bytes, depending on the pointer size on the given platform.
Walter's comment:
These miserable problems are pretty much exorcised if one sticks to the D array syntax:
if (piece.startsWith(keyword)) ...
You should pretty much never see memset, strncmp, etc., in D code. D arrays know their lengths, and so typical C bugs where the length is incorrect are a thing of the past. In my not-so-humble opinion, C's biggest mistake was stripping the length from an array when passing it to a function, and its second biggest was using 0-terminated strings. D corrected both of these.
This is the classics of programming. There are very many ways to make these mistakes:
If you look through this article, you will easily find a sample to each of the previous items. Let me just cite one simplest example:
#define FINDBUFFLEN 64 // Max buffer find/replace size
static char findWhat[FINDBUFFLEN] = {'\0'};
....
findWhat[FINDBUFFLEN] = '\0';
Walter's comment:
This is more of the same of (4). D has real arrays, which know their length. Array bounds checking is done by default (but can be disabled as desired with a command line switch). Array overflows belong in the dustbin of history. Of course, D still allows you to use raw pointers and do arithmetic on them, but such use cases should be rare and not the norm, and best practice says to convert raw pointers to arrays as soon as possible.
This type of errors is the most debatable. Although the standard reads very clear that a certain construct will cause undefined behavior, developers often disagree with that. Their arguments are as follows:
I don't feel like starting that old discussion once again and argue on whether or not one should write such a code. Every programmer is to draw conclusions on his/her own. Personally I take it as an absolutely undue risk that may later unexpectedly result in many hours spent on debugging.
You can see various samples here and here. Here are a couple of unsafe constructs:
m_Quant.IQuant = m_Quant.IQuant--;
intptr_t Val = -1;
Val <<= PointerLikeTypeTraits<PointerTy>::NumLowBitsAvailable;
In the first case, the variable is changed twice in one sequence point. In the second, a negative value is shifted.
Walter's comment:
D's solution to this is to try and eliminate undefined behavior as much as possible. For example, the order of evaluation of expressions is defined (left to right).
Programmers often appear not to know that the compiler sometimes performs very specific optimizations. For instance, it can remove a call of the memset() function filling a buffer, if this buffer is not used anywhere further. As a result, if you were storing a password or any other important data in this buffer, these data will remain in memory. This is a potential vulnerability. I wrote on this subject in detail in the article "Security, security! But do you test it?". These are the corresponding samples.
For example:
void CSHA1::Final()
{
UINT_8 finalcount[8];
...
memset(finalcount, 0, 8);
Transform(m_state, m_buffer);
}
The "finalcount" buffer is not used anymore after calling the function memset(). It means that this call can be deleted.
Walter's comment:
Being able to remove dead assignments is a very important compiler optimization. They come up when instantiating generic code and inlining functions, and as the consequence of performing other optimizations. In C you should be able to force the dead assignment to happen by declaring the target to be volatile. The only way to force a dead assignment in D to happen is to write it using inline assembler. Inline assembler is the ultimate "what you write is what you get".
It's not a mistake, of course, to write two values into one and the same variable. It can be of use. For instance, when you don't need a function result but want to know it when debugging the code, you store the result in a temporary variable. This is what such a code may look like:
status = Foo(x);
status = Foo(y);
status = Foo(z);
But personally I believe that the compiler should be warned about code like this. Quite often we come across errors of the following kind:
t=x1; x1=x2; x2=t;
t=y1; x1=y2; y2=t;
Values exchange in the variables is written incorrectly. In this sample, values are assigned twice in a row to the x1 variable. The second line should have looked like this: "t=y1; y1=y2; y2=t;". But don't tell me it's a code written by a student and you will never make such mistakes. This code, by the way, is taken from the QT library. And here are other samples of such errors in serious applications.
Walter's comment:
Double assignment is a form of dead assignment, and my comment on it is the same as for (7).
The talk that the functions printf, scanf, etc., are dangerous is so old and trivial that I don't have to dwell on it. I mention it only because these errors are very frequent in C/C++ programs (samples). I wonder how this issue is handled in the D language.
Walter's comment:
In D you can call C's printf and use and misuse it just like in C. But, when calling D functions, D has a way to inquire at runtime what the types of the arguments are. Hence, std.stdio.writefln() is typesafe.
Another thing D has is that templates can also have variadic argument lists, but the use of those lists is still typesafe, and checked at compile time.
Pretty often we don't need to check function results. But there are functions which imply that their results are surely used. For instance, such is the fopen() function. Unfortunately, some functions' names in the C++ language appear to be rather poor, provoking programmers to make mistakes like this one:
void VariantValue::Clear()
{
m_vtype = VT_NULL;
m_bvalue = false;
m_ivalue = 0;
m_fvalue = 0;
m_svalue.empty();
m_tvalue = 0;
}
The empty() function is called instead of the clear() function. This is a very frequent error (samples).
The trouble is that there is no notion in the C/C++ language that a function result must be used. Of course, there exist various language extensions concerning this issue. But they are of little use. Does anybody use them?
Walter's comment:
D does not have anything specific saying that a function's return value must be used. In my experience, return values that get ignored are often error codes, and D encourages the use of exceptions to indicate errors rather than error codes.
There are other widely spread patterns of typical errors: for example, errors detected by the V517 diagnostic. But we have to stop somewhere, unfortunately.
The result of our overview of the mentioned error patterns is quite natural. The language and compiler could not be expected to locate just every type of error that exists there. The way code works is very often not very obvious, and only a human is so far capable to find it out. However, we can see that much work has been done to make programming safer. The D language is a good example of this. The article shows how D, though being similar to C/C++, allows programmers to avoid many of its predecessor's problems. It's wonderful! I wish this language much luck and suggest that all the developers give it a close look.
0