Webinar: Parsing C++ - 10.10
Suppose we need to sort the collection by multiple keys. In C#, we can do this with the help of OrderBy().OrderBy() or OrderBy().ThenBy(). But what is the difference between these calls? To answer this question, we need to delve into the source code.
The article has three chapters:
It all started with the following article: "Suspicious sortings in Unity, ASP.NET Core, and more". The article describes cases where the order of OrderBy().OrderBy() calls can lead to errors. However, it turns out that sometimes developers intentionally use OrderBy().OrderBy(), not OrderBy().ThenBy().
Take a look at an example. Let's say we have a Wrapper class and an array of instances of the following type:
class Wrapper
{
public int Primary { get; init; }
public int Secondary { get; init; }
}
var arr = new Wrapper[]
{
new() { Primary = 1, Secondary = 2 },
new() { Primary = 0, Secondary = 1 },
new() { Primary = 2, Secondary = 1 },
new() { Primary = 2, Secondary = 0 },
new() { Primary = 0, Secondary = 2 },
new() { Primary = 0, Secondary = 3 },
};
We want to sort this array: first by the Primary value, then by Secondary.
If we perform sorting as follows — we make a mistake:
var sorted = arr.OrderBy(p => p.Primary)
.OrderBy(p => p.Secondary);
....
Here's the result:
Primary: 2 Secondary: 0
Primary: 0 Secondary: 1
Primary: 2 Secondary: 1
Primary: 0 Secondary: 2
Primary: 1 Secondary: 2
Primary: 0 Secondary: 3
We get the wrong result. To sort the collection correctly, we need to use OrderBy().ThenBy():
var sorted = arr.OrderBy(p => p.Primary)
.ThenBy(p => p.Secondary);
....
Here's the result:
Primary: 0 Secondary: 1
Primary: 0 Secondary: 2
Primary: 0 Secondary: 3
Primary: 1 Secondary: 2
Primary: 2 Secondary: 0
Primary: 2 Secondary: 1
However, we can also use OrderBy().OrderBy() to get the correct result. We just need to swap the calls.
That's how we shouldn't do:
var sorted = arr.OrderBy(p => p.Primary)
.OrderBy(p => p.Secondary);
And that's how we should do:
var sorted = arr.OrderBy(p => p.Secondary)
.OrderBy(p => p.Primary);
It turns out that to get the correct result, we can use both options:
// #1
var sorted1 = arr.OrderBy(p => p.Secondary)
.OrderBy(p => p.Primary);
// #2
var sorted2 = arr.OrderBy(p => p.Primary)
.ThenBy(p => p.Secondary);
I think the second option is more readable.
When you see OrderBy().OrderBy(), you wonder: isn't there an error? So, it is better to use OrderBy().ThenBy(): the code is easier to read, and the developer's intention is clear.
However, these sorting methods differ not only in appearance: their performance and memory consumption vary too.
Let's experiment with the following code:
struct Wrapper
{
public int Primary { get; init; }
public int Secondary { get; init; }
}
Wrapper[] arr = ....;
// #1
_ = arr.OrderBy(p => p.Secondary)
.OrderBy(p => p.Primary)
.ToArray();
// #2
_ = arr.OrderBy(p => p.Primary)
.ThenBy(p => p.Secondary)
.ToArray();
So, let's sum up the main points:
To run tests, I took two sets of generated test data (instances of the Wrapper type). In the first set, the spread of Primary and Secondary values is greater, in the second set it is less. I wrote Wrapper objects (from 10 to 1,000,000) in arr and sorted them.
The test project is running on .NET 6.
I used BenchmarkDotNet to track the performance.
Below you can see how much it took to perform sorting and get an array. We're not interested in absolute values — the difference between sorting methods is more important.
Data set #1:
arr.Length |
10 |
100 |
1 000 |
10 000 |
100 000 |
1 000 000 |
---|---|---|---|---|---|---|
OrderBy().OrderBy() |
619 ns |
9 us |
170 us |
2 ms |
25.8 ms |
315 ms |
OrderBy().ThenBy() |
285 ns |
4.5 us |
100 us |
1.4 ms |
20.4 ms |
271 ms |
Ratio |
2.17 |
2 |
1.7 |
1.43 |
1.26 |
1.16 |
Data set #2:
arr.Length |
10 |
100 |
1 000 |
10 000 |
100 000 |
1 000 000 |
---|---|---|---|---|---|---|
OrderBy().OrderBy() |
553.3 ns |
8.7 us |
154 us |
2.1 ms |
29.5 ms |
364 ms |
OrderBy().ThenBy() |
316.4 ns |
4.2 us |
80 us |
1.1 ms |
16.9 ms |
240 ms |
Ratio |
1.75 |
2.07 |
1.93 |
1.91 |
1.75 |
1.52 |
Let's move on and see the time difference if we perform several sortings at once. To do this, we use the for loop:
for (int i = 0; i < iterNum; ++i)
{
// Perform sorting
}
Here's the time (in seconds) spend to sort 1,000,000 instances of the Wrapper type:
Number of iterations |
1 |
10 |
100 |
---|---|---|---|
OrderBy().OrderBy() |
0.819 |
6.52 |
65.15 |
OrderBy().ThenBy() |
0.571 |
5.21 |
42.94 |
Ratio |
1.43 |
1.25 |
1.30 |
Well, the difference in 20 seconds is worth considering.
There is a similar thing with memory — OrderBy().OrderBy() consumes more. It is especially noticeable on large amounts of data and several iterations.
Here's the difference in the number of objects created per iteration:
Type |
OrderBy().OrderBy() |
OrderBy().ThenBy() |
---|---|---|
Int32[] |
4 |
3 |
Comparison<Int32> |
2 |
1 |
Wrapper[] |
3 |
2 |
As the table suggests, OrderBy().OrderBy() calls create two more arrays. When I was running the test that had 100 sortings, I got a 1GB difference of allocated memory.
It's important to note that the larger the sorted collection is, the larger the "extra" arrays are. As a result, the amount of consumed memory also increases.
And now it's time to look "under the hood". Just to remind you — we are considering two ways of sorting:
// #1
_ = arr.OrderBy(p => p.Secondary)
.OrderBy(p => p.Primary)
.ToArray();
// #2
_ = arr.OrderBy(p => p.Primary)
.ThenBy(p => p.Secondary)
.ToArray();
To understand the difference, we need to analyze:
The source code of .NET 6 is available on GitHub.
We need to discuss three top-level methods: OrderBy, ThenBy, and ToArray. Let's consider each of them.
OrderBy
OrderBy is an extension method that returns an instance of the OrderedEnumerable<TElement, TKey> type:
public static IOrderedEnumerable<TSource>
OrderBy<TSource, TKey>(this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector)
=> new OrderedEnumerable<TSource,
TKey>(source, keySelector, null, false, null);
Here's the OrderedEnumerable<TElement, TKey> constructor:
internal OrderedEnumerable( IEnumerable<TElement> source,
Func<TElement, TKey> keySelector,
IComparer<TKey>? comparer,
bool descending,
OrderedEnumerable<TElement>? parent
) : base(source)
{
....
_parent = parent;
_keySelector = keySelector;
_comparer = comparer ?? Comparer<TKey>.Default;
_descending = descending;
}
Here we're interested in the base constructor call — base(source). The base type is OrderedEnumerable<TElement>. The constructor looks as follows:
protected OrderedEnumerable(IEnumerable<TElement> source)
=> _source = source;
Let's brush up on what we've already discussed: as a result of the OrderBy call, the OrderedEnumerable<TElement, TKey> instance is created. Its state is determined by the following fields:
ThenBy
ThenBy is an extension method:
public static IOrderedEnumerable<TSource>
ThenBy<TSource, TKey>(this IOrderedEnumerable<TSource> source,
Func<TSource, TKey> keySelector)
{
....
return source.CreateOrderedEnumerable(keySelector, null, false);
}
In our case, the type of the source variable is OrderedEnumerable<TElement, TKey>. Let's take a look at the implementation of the CreateOrderedEnumerable method that will be called:
IOrderedEnumerable<TElement>
IOrderedEnumerable<TElement>
.CreateOrderedEnumerable<TKey>(Func<TElement, TKey> keySelector,
IComparer<TKey>? comparer,
bool descending)
=> new OrderedEnumerable<TElement,
TKey>(_source,
keySelector,
comparer,
@descending,
this);
We see that the constructor of the OrderedEnumerable<TElement, TKey> type (that we've already discussed in the OrderBy section) is called. The arguments of the call differ, and, as a result, the state of the created object differs.
Let's brush it up: ThenBy (like OrderBy), in our case returns an instance of the OrderedEnumerable<TElement, TKey> type.
ToArray
ToArray is an extension method:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
....
return source is IIListProvider<TSource> arrayProvider
? arrayProvider.ToArray()
: EnumerableHelpers.ToArray(source);
}
In both cases, source are the instances of the OrderedEnumerable<TElement, TKey> type. This type implements the IIlistProvider<TSource> interface. Therefore, the execution will go via the arrayProvider.ToArray() call. In fact, the OrderedEnumerable<TElement>.ToArray method will be called:
public TElement[] ToArray()
{
Buffer<TElement> buffer = new Buffer<TElement>(_source);
int count = buffer._count;
if (count == 0)
{
return buffer._items;
}
TElement[] array = new TElement[count];
int[] map = SortedMap(buffer);
for (int i = 0; i != array.Length; i++)
{
array[i] = buffer._items[map[i]];
}
return array;
}
And here the main differences occur. Before we continue to dive deeper, we need to examine the state of the objects we will be working with.
Let's get back to the source examples:
// #1
_ = arr.OrderBy(p => p.Secondary) // Wrapper[] -> #1.1
.OrderBy(p => p.Primary) // #1.1 -> #1.2
.ToArray(); // #1.2 -> Wrapper[]
// #2
_ = arr.OrderBy(p => p.Primary) // Wrapper[] -> #2.1
.ThenBy(p => p.Secondary) // #2.1 -> #2.2
.ToArray(); // #2.2 -> Wrapper[]
We need to compare four objects arranged in pairs:
As a result, we get 2 tables for comparing the states of objects.
Here are the states of objects created by the first OrderBy calls:
Field |
Object #1.1 |
Object #2.1 |
---|---|---|
_source |
arr |
arr |
_comparer |
Comparer<Int32>.Default |
Comparer<Int32>.Default |
_descending |
false |
false |
_keySelector |
p => p.Secondary |
p => p.Primary |
_parent |
null |
null |
This pair is identical. Only selectors differ.
Here are the states of objects created by the second call of OrderBy (#1.2) and ThenBy (#2.2):
Field |
Object #1.2 |
Object #2.2 |
---|---|---|
_source |
Object #1.1 |
arr |
_comparer |
Comparer<Int32>.Default |
Comparer<Int32>.Default |
_descending |
false |
false |
_keySelector |
p => p.Primary |
p => p.Secondary |
_parent |
null |
Object #2.1 |
Here the selectors also differ — and this is expected. Curious that _source and _parent fields differ. The state of the object in case of the ThenBy (#2.2) call seems more correct: the reference to the source collection is saved, and there's a "parent" — the result of the previous sorting.
Now we need to find out how the state of objects affects the execution flow.
Let's get back to the ToArray method:
public TElement[] ToArray()
{
Buffer<TElement> buffer = new Buffer<TElement>(_source);
int count = buffer._count;
if (count == 0)
{
return buffer._items;
}
TElement[] array = new TElement[count];
int[] map = SortedMap(buffer);
for (int i = 0; i != array.Length; i++)
{
array[i] = buffer._items[map[i]];
}
return array;
}
Keep in mind that objects received by different calls have different _source fields:
Let's consider the determining of the Buffer<TElement> type:
internal readonly struct Buffer<TElement>
{
internal readonly TElement[] _items;
internal readonly int _count;
internal Buffer(IEnumerable<TElement> source)
{
if (source is IIListProvider<TElement> iterator)
{
TElement[] array = iterator.ToArray();
_items = array;
_count = array.Length;
}
else
{
_items = EnumerableHelpers.ToArray(source, out _count);
}
}
}
This is where behavior starts to differ:
In the first case, we are getting back to the ToArray method that was given above. Then, we get into the Buffer constructor again, but the execution will follow the else branch, because the _source of the object #1.1 is Wrapper[].
EnumerableHelpers.ToArray just creates a copy of the array:
internal static T[] ToArray<T>(IEnumerable<T> source, out int length)
{
if (source is ICollection<T> ic)
{
int count = ic.Count;
if (count != 0)
{
T[] arr = new T[count];
ic.CopyTo(arr, 0);
length = count;
return arr;
}
}
else
....
....
}
The execution follows the then branch. I omitted the rest of code, because in our case it is unimportant.
The difference is clearer in call stacks. Pay attention to the bold "extra" calls:
Call stack for OrderBy().OrderBy() |
Call stack for OrderBy().ThenBy() |
---|---|
EnumerableHelpers.ToArray |
EnumerableHelpers.ToArray |
Buffer.ctor |
Buffer.ctor |
OrderedEnumerable.ToArray |
OrderedEnumerable.ToArray |
Buffer.ctor |
Enumerable.ToArray |
OrderedEnumerable.ToArray |
Main |
Enumerable.ToArray |
|
Main |
By the way, this also explains the difference in the number of created objects. Here's the table we discussed previously:
Type |
OrderBy().OrderBy() |
OrderBy().ThenBy() |
---|---|---|
Int32[] |
4 |
3 |
Comparison<Int32> |
2 |
1 |
Wrapper[] |
3 |
2 |
The most interesting arrays here are: Int32[] and Wrapper[]. They arise because the execution flow unnecessarily passes via the OrderedEnumerable<TElement>.ToArray method once again:
public TElement[] ToArray()
{
....
TElement[] array = new TElement[count];
int[] map = SortedMap(buffer);
....
}
Just to remind you — the sizes of array and map arrays depend on the size of the sorted collection: the larger it is, the larger the overhead will be — because of the unnecessary OrderedEnumerable<TElement>.ToArray call.
The same thing is with the performance. Let's take another look at the code of the OrderedEnumerable<TElement>.ToArray method:
public TElement[] ToArray()
{
Buffer<TElement> buffer = new Buffer<TElement>(_source);
int count = buffer._count;
if (count == 0)
{
return buffer._items;
}
TElement[] array = new TElement[count];
int[] map = SortedMap(buffer);
for (int i = 0; i != array.Length; i++)
{
array[i] = buffer._items[map[i]];
}
return array;
}
We are interested in the map array. It describes the relationship between the positions of elements in arrays:
Let's say map[5] == 62. This means that the element takes the 62nd position in the source array, and the 5th position in the resulting array.
To get a so-called "relationship map", the SortedMap method is used:
private int[] SortedMap(Buffer<TElement> buffer)
=> GetEnumerableSorter().Sort(buffer._items, buffer._count);
Here's the GetEnumerableSorter method:
private EnumerableSorter<TElement> GetEnumerableSorter()
=> GetEnumerableSorter(null);
Let's take a look at the method overload:
internal override EnumerableSorter<TElement>
GetEnumerableSorter(EnumerableSorter<TElement>? next)
{
....
EnumerableSorter<TElement> sorter =
new EnumerableSorter<TElement, TKey>(_keySelector,
comparer,
_descending,
next);
if (_parent != null)
{
sorter = _parent.GetEnumerableSorter(sorter);
}
return sorter;
}
Here comes up another difference between the sorting methods that we are discussing:
Here's the called EnumerableSorter constructor:
internal EnumerableSorter(
Func<TElement, TKey> keySelector,
IComparer<TKey> comparer,
bool descending,
EnumerableSorter<TElement>? next)
{
_keySelector = keySelector;
_comparer = comparer;
_descending = descending;
_next = next;
}
The constructor just initializes the object fields. There is another field that is not used in the constructor — _keys. It will be initialized later in the ComputeKeys method.
Let's consider what the fields are responsible for. To do this, let's discuss one of the sorting ways:
_ = arr.OrderBy(p => p.Primary)
.ThenBy(p => p.Secondary)
.ToArray();
To perform sorting via OrderBy, an instance of EnumerableSorter will be created. It has the following fields:
After the EnumerableSorter instance was created and initialized, the Sort method is called for it:
private int[] SortedMap(Buffer<TElement> buffer)
=> GetEnumerableSorter().Sort(buffer._items, buffer._count);
Here's the body of the Sort method:
internal int[] Sort(TElement[] elements, int count)
{
int[] map = ComputeMap(elements, count);
QuickSort(map, 0, count - 1);
return map;
}
Here's the ComputeMap method:
private int[] ComputeMap(TElement[] elements, int count)
{
ComputeKeys(elements, count);
int[] map = new int[count];
for (int i = 0; i < map.Length; i++)
{
map[i] = i;
}
return map;
}
Let's take a look at the ComputeKeys method:
internal override void ComputeKeys(TElement[] elements, int count)
{
_keys = new TKey[count];
for (int i = 0; i < count; i++)
{
_keys[i] = _keySelector(elements[i]);
}
_next?.ComputeKeys(elements, count);
}
In this method, the _keys array of the EnumerableSorter instance is initialized. The _next?.ComputeKeys(elements, count) call allows initializing the entire chain of related EnumerableSorter objects.
Why do we need the _keys field? The array stores the results of calling the selector on each element of the source array. So, we get an array of keys that will be used to perform sorting.
Here's an example:
var arr = new Wrapper[]
{
new() { Primary = 3, Secondary = 2 },
new() { Primary = 3, Secondary = 1 },
new() { Primary = 1, Secondary = 0 }
};
_ = arr.OrderBy(p => p.Primary)
.ThenBy(p => p.Secondary)
.ToArray();
In this example, two related EnumerableSorter instances will be created.
Field |
EnumerableSorter #1 |
EnumerableSorter #2 |
---|---|---|
_keySelector |
p => p.Primary |
p => p.Secondary |
_keys |
[3, 3, 1] |
[2, 1, 0] |
Thus, _keys stores sorting keys for each element of the source array.
Let's get back to the ComputeMap method:
private int[] ComputeMap(TElement[] elements, int count)
{
ComputeKeys(elements, count);
int[] map = new int[count];
for (int i = 0; i < map.Length; i++)
{
map[i] = i;
}
return map;
}
After calling the ComputeKeys method, the map array is created and initialized. This is the array that describes the relationship between the positions in the source and resulting arrays. In this method, it still describes the i -> i relationship. This means that the positions in the source array coincide with the positions in the resulting array.
Let's get back to the Sort method:
internal int[] Sort(TElement[] elements, int count)
{
int[] map = ComputeMap(elements, count);
QuickSort(map, 0, count - 1);
return map;
}
We are interested in the QuickSort method that makes the map array to look the way we need. Right after this operation we get the correct relationship between the positions of elements in the source array and in the resulting one.
Here's the body of the QuickSort method:
protected override void QuickSort(int[] keys, int lo, int hi)
=> new Span<int>(keys, lo, hi - lo + 1).Sort(CompareAnyKeys);
We won't dive into the details of Span and its Sort method. Let's focus on the fact that it sorts the array considering the Comparison delegate:
public delegate int Comparison<in T>(T x, T y);
It's a classic delegate for comparison. It takes two elements, compares them, and returns the value:
In our case, the CompareAnyKeys method is used for comparison:
internal override int CompareAnyKeys(int index1, int index2)
{
Debug.Assert(_keys != null);
int c = _comparer.Compare(_keys[index1], _keys[index2]);
if (c == 0)
{
if (_next == null)
{
return index1 - index2; // ensure stability of sort
}
return _next.CompareAnyKeys(index1, index2);
}
// ....
return (_descending != (c > 0)) ? 1 : -1;
}
So, let's see what we have:
int c = _comparer.Compare(_keys[index1], _keys[index2]);
if (c == 0)
....
return (_descending != (c > 0)) ? 1 : -1;
Two elements are compared via a comparator written in _comparer. Since we have not explicitly set any comparator, Comparer<T>.Default is used (in our case – Comparer<Int32>.Default).
If elements are not equal, the c == 0 condition is not executed, and the execution flow goes to return. The _descending field stores the information on how the elements are sorted — in descending or ascending order. If necessary, the field is used to correct the value returned by the method.
But what if the elements are equal?
if (c == 0)
{
if (_next == null)
{
return index1 - index2; // ensure stability of sort
}
return _next.CompareAnyKeys(index1, index2);
}
Here come the chains of EnumerableSorter instances related to each other. If the compared keys are equal, a check is performed to see if there are any other sorting criteria. If there is (_next != null), the comparison is performed via them.
It turns out that all sorting criteria are considered in one call of the Sort method.
What happens if we use OrderBy().OrderBy()? To find out, let's get back to creating the EnumerableSorter instance:
internal override EnumerableSorter<TElement>
GetEnumerableSorter(EnumerableSorter<TElement>? next)
{
....
EnumerableSorter<TElement> sorter =
new EnumerableSorter<TElement, TKey>(_keySelector,
comparer,
_descending,
next);
if (_parent != null)
{
sorter = _parent.GetEnumerableSorter(sorter);
}
return sorter;
}
The _parent value of the object obtained as a result of the second OrderBy call is null. This means that one EnumerableSorter instance is created. It is not related to something, the value of _next is null.
It turns out that we need to perform all the actions described above twice. We've already discussed how this affects the performance. To remind you, I'd duplicate one of the tables given above.
Here's the time (in seconds) spend to sort 1,000,000 instances of the Wrapper type:
Number of iterations |
1 |
10 |
100 |
---|---|---|---|
OrderBy().OrderBy() |
0.819 |
6.52 |
65.15 |
OrderBy().ThenBy() |
0.571 |
5.21 |
42.94 |
The OrderBy and ThenBy methods create OrderedEnumerable instances that are used to perform sorting. Instances of the EnumerableSorter type help perform sorting. They affect the algorithm, use specified selectors and a comparator.
The main difference between OrderBy().OrderBy() and OrderBy().ThenBy() calls is the relations between objects.
OrderBy().OrderBy(). There are no relations between OrderedEnumerable or EnumerableSorter. As a result, "extra" objects are created — instead of one sorting, we get two. Memory consumption is growing, and the code runs slower.
OrderBy().ThenBy(). Both OrderedEnumerable and EnumerableSorter instances are related. Because of this, one sorting operation is performed by several criteria at once. Extra objects are not created. Memory consumption is reduced, and the code runs faster.
The code that uses OrderBy().ThenBy() instead of OrderBy().OrderBy():
Follow me on Twitter, if you are interested in such publications.
0