Andrey Karpov

Sep 21 2009

Tags:

#Cpp #Knowledge #64bit

About size_t and ptrdiff_t

Sep 21 2009

Author: Andrey Karpov

Introduction
The size_t type
The ptrdiff_t type
Portability of size_t and ptrdiff_t
Safety of ptrdiff_t and size_t types in address arithmetic
Performance of code that uses ptrdiff_t and size_t types
Code refactoring to switch to ptrdiff_t and size_t
References

The article explains what size_t and ptrdiff_t types are, their purpose, and when to use them. The following information is especially valuable for developers starting to create 64-bit applications, where size_t and ptrdiff_t types provide high performance, the ability to work with large amounts of data, and portability between different platforms.

Introduction

Note that definitions and recommendations in the article relate to the most common current architectures (IA-32, Intel 64, IA-64). The information may be inaccurate in relation to exotic architectures.

size_t and ptrdiff_t types were created to perform correct address arithmetic. For a long time, developers assumed that int was the same size as a machine word (a bit width of a processor), and it could be used as an index to store object sizes or pointers. So, address arithmetic was also built with the help of int and unsigned types. The int type is used in most C and C++ programming tutorials in loop bodies, and as indexes. Here's an almost canonical example:

for (int i = 0; i < n; i++)
  a[i] = 0;

As processors were developing and their bit width was increasing, it became unreasonable to further increase the bit width of the int type. There were many reasons: it saved the memory used, ensures maximum compatibility, and so on. As a result, several data models describing the relations of the base C and C++ types appeared. Table N1 shows the main data models and lists the most popular systems that use them.

Table N1. Data models

As you can see, it's not that easy to choose the type of variable to store a pointer or the size of an object. To flawlessly solve this problem, size_t and ptrdiff_t types appeared. They can certainly be used in address arithmetic. Now, we can consider the following code canonical:

for (size_t i = 0; i < n; i++)
  a[i] = 0;

It can provide reliability, portability, and performance. We'll learn why further on.

The size_t type

size_t is a special unsigned integer type defined in standard libraries of C and C++. It is the type of the result returned by sizeof and alignof operators.

The maximum allowed value of the size_t type is the SIZE_MAX constant.

size_t can store the maximum size of a theoretically possible array or object. In other words, the number of bits in size_t is equal to the number of bits required to store the maximum address in the machine's memory. For example, on a 32-bit system, size_t occupies 32 bits, on a 64-bit system — 64 bits. This means that a pointer can safely be placed in the size_t type (except for platforms with segment addressing and pointers to class member functions).

With this type, developers don't have to worry about the possible different behavior of integer variables when changing the platform. The implementation of the standard library takes care of this. So, the size_t type is safer and more efficient than ordinary unsigned integer types:

This type allows to write loops and counters without worrying about possible overflow when changing platforms. For example, when the number of required iterations exceeds UINT_MAX;
Since the standard guarantees that size_t can contain the maximum possible object in the system, this type is used to store the sizes of objects;
The same feature makes it possible to use size_t for array indexing. Unlike the usual basic integer types, size_t guarantees that the index value cannot be greater than SIZE_MAX;
Since a pointer can usually be safely placed in size_t, it is used for address arithmetic. However, for these purposes, it's better to use another unsigned integer type — uintptr_t — the name says it all;
The compiler can build simpler and, therefore, faster code without unnecessary conversions of 32-bit and 64-bit data.

In C, the size_t type is declared in the following header files: <stddef.h>, <stdlib.h>, <string.h>, <wchar.h>, <uchar.h>, <time.h>, and <stdio.h>. In C++, size_t is declared in the following files: <cstddef>, <cstdlib>, <cstring>, <cwchar>, <cuchar>, <ctime>, and <cstdio>. The size_t type is placed in the global namespace and in std. Standard header files of the C language used for backward compatibility can also be included in C++ programs.

Note. There's also the rsize_t type. It is very similar to the size_t type. However, it is designed to store the size of a single object. In other words, using rsize_t, developers emphasize that they are working with the size of a single object. The RSIZE_MAX constant sets the maximum size of a single object.

The ptrdiff_t type

ptrdiff_t is a special signed integer type defined in the standard libraries of the C and C++ languages. It is a type of the result of subtracting pointers. The behavior of the type is similar to size_t: on a 32-bit system, the size of ptrdiff_t will be 32 bits, on a 64-bit system — 64 bits.

Also, when working with standard library containers, the result of subtracting two iterators has the difference_type type of the container used, which, depending on the standard library, is often equal to ptrdiff_t.

The ptrdiff_t type is often used for address arithmetic and array indexing, if negative values are possible. Programs that use regular integer types (int) for this purpose can experience undefined behavior. For example, if the index value exceeds INT_MAX.

For arrays smaller than PTRDIFF_MAX, ptrdiff_t behaves like an analog of size_t: it can store the size of an array of any type and is very similar to intptr_t on most platforms. However, if an array is large enough (larger than PTRDIFF_MAX but smaller than SIZE_MAX), and the difference of its pointers cannot be represented as ptrdiff_t, then the result of subtracting such pointers is undefined.

In C, the ptrdiff_t type is declared in the header file <stddef.h>. In C++, its declaration is located in <cstddef> and is placed in the global namespace and in std. Standard header files of the C language for backward compatibility can also be included in C++ programs.

Portability of size_t and ptrdiff_t

The size_t and ptrdiff_t types allow to write portable code. The size of size_t and ptrdiff_t always matches the size of the pointer. For this reason, these types should be used as indexes of large arrays for storing pointers and pointer arithmetic.

Linux application developers often use the long type for this purpose. This indeed worked within the framework of 32-bit and 64-bit data models adopted in Linux. The size of the long type is the same as the size of the pointer. However, such code is incompatible with the Windows data model, and, accordingly, it cannot be considered well portable. In the LLP64 model (Windows x64), the long type remained 32-bit. A more correct solution would be to use size_t and ptrdiff_t types.

Developers working on Windows can use DWORD_PTR, SIZE_T, SSIZE_T, and so on as an alternative to size_t and ptrdiff_t. However, it's better to restrict yourself to size_t, ptrdiff_t, uintptr_t, intptr_t for greater compatibility.

Safety of ptrdiff_t and size_t types in address arithmetic

The problems of address arithmetic began to actively manifest themselves with the emerging of 64-bit systems. The greatest number of issues when porting 32-bit applications to 64-bit systems is associated with types unsuitable for working with pointers and arrays, such as int and long. This is not the only problem of porting applications to 64-bit systems, but most errors are related to address arithmetic and indexes. The problems of code migration are described in more detail in Lessons on the development of 64-bit C/C++ applications [1].

Let's take a look at a simple example:

size_t n = ...;
for (int i = 0; i < n; i++)
  a[i] = 0;

If we have an array consisting of more than INT_MAX elements, then this code is incorrect. When a signed variable overflows, undefined behavior occurs. In the debug version of the program, Access Violation is more likely to occur, when the index value overflows. But the release version, depending on the optimization settings and the code features, can, for example, unexpectedly correctly fill all the elements of the array, creating the illusion of correct operation! As a result, floating errors show up in the program, appearing or disappearing after the slightest code change. You can find more information about such phantom errors and their dangers in the following article: A 64-bit horse that can count [2].

An example of another hidden error that will manifest itself with a certain combination of input data (the value of variables A and B):

int A = -2;
unsigned B = 1;
int array[5] = { 1, 2, 3, 4, 5 };
int *ptr = array + 3;
ptr = ptr + (A + B); // Error
printf("%i\n", *ptr);

This code will be executed successfully in the 32-bit version and will print number "3" on the screen. After compiling in 64-bit mode, code execution will fail. Let's consider the sequence of code execution and the cause of the error:

The A variable of the int type is converted to the unsigned type;
There's an addition of A and B. As a result, we get the 0xFFFFFFFF value of the unsigned type;
The expression "ptr + 0xFFFFFFFF" is calculated. The result depends on the size of the pointer on the given platform. In a 32-bit program, the expression will be equivalent to "ptr - 1", and we will successfully print number 3. In a 64-bit program, the 0xFFFFFFFF value will be added to the pointer. As a result, the pointer will be far outside the array.

size_t and ptrdiff_t types help avoid these errors. In the first case, if the type of the i variable is size_t, no overflow will occur. In the second case, if we use size_t or ptrdiff_t types for variables A and B, we will print number "3" correctly.

So, here's a tip: if you're working with pointers or arrays, it's better to use size_t and ptrdiff_t types.

To learn more about errors that you can avoid with the help of size_t and ptrdiff_t, read the following articles:

Performance of code that uses ptrdiff_t and size_t types

In addition to improving the reliability of the code, ptrdiff_t and size_t types in address arithmetic can give additional performance gains. For example, the int type as an index (the size of which differs from the size of the pointer) results in additional data conversion commands in the binary code. We are talking about 64-bit code in which the size of pointers became 64 bits, and the size of the int type remained 32-bit.

It's difficult to give a brief example demonstrating that size_t is better than unsigned. To be objective, it is necessary to use the optimizing capabilities of the compiler. However, the two versions of optimized code often become too dissimilar to easily demonstrate the difference. We tried to create something close to a simple example, but we succeeded only at the sixth attempt. Still, the example is not perfect, because it shows not the previously mentioned unnecessary data type conversions, but that the compiler was able to build more efficient code with the help of the size_t type. Let's consider the program code that arranges array items in the reverse order:

unsigned arraySize;
...
for (unsigned i = 0; i < arraySize / 2; i++)
{
  float value = array[i];
  array[i] = array[arraySize - i - 1];
  array[arraySize - i - 1] = value;
}

The variables arraySize and i have the unsigned type. You can easily replace the type with size_t and compare a small fragment of assembler code shown in Figure 1.

Figure N1. The comparison of 64-bit assembler code with unsigned and size_t types

The compiler managed to build more concise code, when it used 64-bit registers. We don't want to say that the code created with the help of the unsigned type (text on the left) will be slower than the code created with the help of the size_t type (text on the right). It is rather difficult to compare the speed of code execution on contemporary processors. However, the example shows that the compiler can build more concise and faster code with the help of 64-bit types.

According to our personal experience, a competent replacement of int/unsigned types with ptrdiff_t/size_t can give an additional performance gain of up to 10% on a 64-bit system. You can view one of the examples of how ptrdiff_t and size_t types increase the performance in the fourth chapter of the following article: Development of resource-intensive applications in Visual C++ [7].

Code refactoring to switch to ptrdiff_t and size_t

As we've already discussed, ptrdiff_t and size_t types have a number of advantages for 64-bit programs. However, we can't just replace all unsigned types with size_t. Firstly, this does not guarantee the correctness of the program on a 64-bit system. Secondly, most likely, such a replacement will provoke new errors, break the compatibility of data formats, and so on. Do not forget that such a replacement can significantly increase the amount of memory consumed by the program. Moreover, an increase in the amount of required memory can slow down the application, because there will be fewer objects in the cache.

So, the introduction of ptrdiff_t and size_t types into the legacy code is a task of gradual thoughtful refactoring that requires a lot of time. In fact, it is necessary to review the entire code and make the necessary edits. This approach is actually too expensive and inefficient. It's better to choose one of 2 following options:

Use specialized tools such as PVS-Studio. This is a static code analyzer that detects places where it's better to change data types so that the program works correctly and efficiently on 64-bit systems.
If you don't plan to adapt a 32-bit program for 64-bit systems, then there's no point in refactoring data types. A 32-bit program will not benefit from ptrdiff_t and size_t types.

References

Andrey Karpov, Evgenii Ryzhkov. Lessons on the development of 64-bit C and C++ applications. https://pvs-studio.com/en/blog/lessons/
Andrey Karpov. A 64-bit horse that can count. https://pvs-studio.com/en/blog/posts/cpp/a0043/
Andrey Karpov, Evgenii Ryzhkov. 20 issues of porting C++ code to the 64-bit platform. https://pvs-studio.com/en/blog/posts/cpp/a0004/
Andrey Karpov. Safety of 64-bit code. https://pvs-studio.com/en/blog/posts/cpp/a0046/
Andrey Karpov, Evgenii Ryzhkov. Traps detection during migration of C and C++ code to 64-bit Windows. https://pvs-studio.com/en/blog/posts/cpp/a0012/
Andrey Karpov. Undefined behavior is closer than you think. https://pvs-studio.com/en/blog/posts/cpp/0374/
Andrey Karpov, Evgenii Ryzhkov. Development of resource-intensive applications in Visual C++. https://pvs-studio.com/en/blog/posts/a0018/