Webinar: C++ semantics - 06.11
Regular expressions come in handy when you need to search for and replace text. However, in some cases, they may cause the system to slow down or even make vulnerable to ReDoS attacks.
ReDoS is a subtype of a DoS attack. The aim of a ReDoS attack is to halt an application or cause it to slow down via an inefficient regex.
ReDoS attacks can be divided into two types:
The main point of any ReDoS attack is using a vulnerable regular expression in an application. Passing a string of a certain format to a regex leads to its unreasonably long calculation.
If a ReDoS attack is successful, then the regex calculation results in catastrophic backtracking. It is a consequence of the backtracking function in the Regex engine, which iterates through possible string matches until it finds the correct one. If there's no correct match, a regular expression won't stop until it iterates through all possible options. A complete iteration of all possible options leads to an unacceptably long calculation of a regex. This is called catastrophic backtracking.
A regex is vulnerable to catastrophic backtracking if it contains at least one subexpression that may cause a large number of matching options.
Let's inspect several regular expressions for vulnerabilities.
I wrote a small program — it displays a graph of how the calculation time of regex depends on the number of characters in the evaluated string. In next examples, I will use this program to show you the catastrophic backtracking.
Example 1
Let's look at a simple synthetic example:
(x+)+y
Let's compare the calculation time of the (x+)+y expression in two cases:
The results of such an experiment are below:
Figure 1. The execution time of a regular expression if a string matches the pattern (x+)+y.
Figure 2. The execution time of a regular expression if a string does not match the (x+)+y pattern (the y character is missing at the end).
As you can see, the strings from the first set are processed instantly. However, the processing of the second set increases exponentially! Why so?
The thing is, in the first case a regular expression finds a match on the first try. When processing strings in the second case, everything becomes very complicated. The x+ template can match any number of x characters. The (x+)+ template can fit a string that consists of one or more substrings corresponding with x+. Because of this, there are many options for matching a string with a regular expression. Their number depends on the length of a substring consisting of x characters. Every time the regular expression does not find the y character, it starts checking the next option. Only after checking all of them, the regular expression gives the answer – no matches were found.
The table below shows several possible matches of the xxxx string with the (x+)+y regex:
Fortunately, not all regular expressions are vulnerable to catastrophic backtracking. A regex is vulnerable if it meets the following conditions:
Expressions of the (\d?|....|[1-9])+ type are a small exception. Here the (\d?|....|[1-9])+ expression includes subexpressions \d? and [1-9]. They are enumerated via the '|' operator. These subexpressions can also fit the same string, for example, 111. In this case, applying the '?' quantifier to one of the subexpressions also results in a vulnerability.
Example 2
We found out that the (x+)+y expression is vulnerable. Now let's change it a bit by adding a check for the presence of another character:
(x+z)+y
Now we have the (x+z)+ subexpression, and the xz and xxxxz strings can be matched to this expression. This subexpression includes the x+ subexpression, which can correspond to strings of x, xxxx, etc. As you can see, these subexpressions cannot be matched with the same values. Thus, the second condition is not met and there's no catastrophic backtracking.
Figure 3. Unsuccessful attempt to "break" a regex with a set of strings, each of them corresponds with either the x+ subexpression or the (x+z)+ subexpression.
Example 3
Now let's look at the next regex:
newDate\((-?\d+)*\)
This regex has a task — search for a substring of the newDate(12-09-2022) type. Can we call this regex safe? No. Besides correct strings, the regex will consider correct the newDate(8-911-111-11-11) and even newDate(1111111111111) strings. However, to understand the essence of the issue, such an expression will be enough for us.
None of the above options will lead to catastrophic backtracking. However, it will happen if we process strings of the 'newDate(1111111111111' type.
Figure 4. The execution time of the regex checking strings that do not match the pattern (there is no closing parenthesis at the end of the string)
We see catastrophic backtracking again. It happens because of the (-?\d+)* subexpression, which includes the \d+ subexpression. The '*' or '+' quantifiers are applied to both subexpressions and the same string can be matched with each of them, for example, 111.
Let's compare these observations with the previously examined conditions of the regular expression with a vulnerability:
By the way, the newDate\((-?\d+)*\) regex caused a vulnerability (CVE-2021-27293) in a real project – the RestSharp library.
Example 4
As a final example, let's look for vulnerability in a more complex regular expression:
^(([A-Z]:|\\main)(\\[^\\]+)*(,\s)?)+$
The task of this expression is to find strings that represent a list of paths to files or directories. Each element of this list is separated from each other with a comma and a space character. A list item can be represented by a path corresponding to one of two types:
Thus, the string corresponding to the pattern may look like this:
D:\catalog, C:\catalog\file.cs, \main\file.txt, \main\, project\main.csproj
A regular expression will evaluate such strings without any problems.
The same goes for processing almost any incorrect string processing, for example:
D:\catalog\file.cs\catalog\file.cs\catalog\file.cs\catalog\file.cs\catalog\file.cs\catalog\file.cs\\\
However, the situation changes if we pass a string of the following type to a regex:
D:\main\main\main\main\main\main\main\main\main\main\main\main\main\main\main\\\
Figure 5. The running time of the regular expression when processing strings of the D:\main ...\main\\\ format.
Let's inspect the original regular expression (^(([A-Z]:|\\main)(\\[^\\]+)*(,\s)?)+$) in more details. Note that subexpressions ([A-Z]:|\\main) and (\\[^\\]+)* that follow each other can be matched with the same \main string. Moreover, the following subexpression ((,\s)?) can be ignored, because the '?' quantifier allows the absence of a match with this template.
Thus, it is possible to simplify the original regular expression to check only one special case – strings of the D:\main ...\main format:
^(([A-Z]:|\\main)(\\main)*)+$
The catastrophic backtracking vulnerability becomes clear when we look at the simplified version of this string.
Therefore, both conditions for a vulnerable expression are met.
Let's highlight the main factors that cause catastrophic backtracking in the ^(([A-Z]:|\\main)(\\[^\\]+)*(,\s)?)+$ regular expression:
The absence of at least one of them would make the regular expression absolutely safe.
Let's look at the main ways to protect a regex from catastrophic backtracking. We'll use the newDate\((-?\d+)*\) as an example. The code below is written in C#. However, the similar functionality probably exists in other programming languages that support regular expressions.
Option 1. Add a limit on execution time for processing a string by a regular expression. In .NET, we can do it by setting the matchTimeout parameter when calling a static method or initializing the new Regex object.
RegexOptions options = RegexOptions.None;
TimeSpan timeout = TimeSpan.FromSeconds(1);
Regex pattern = new Regex(@"newDate\((-?\d+)*\)", options, timeout);
Regex.Match(str, @"newDate\((-?\d+)*\)", options, timeout);
Figure 6. The execution time of a regular expression is limited to one second
Option 2. Use atomic groups (?>...):
Regex pattern = new Regex(@"newDate\((?>-?\d+)*\)");
For expressions marked as atomic groups, the backtracking function is disabled. Thus, of all possible matching options, an atomic group will always be matched with only one substring containing the maximum number of characters.
Although atomic groups are a reliable way of protection against catastrophic backtracking, we recommend to use them carefully. In some cases, using atomic groups can decrease the accuracy of the regular expression's calculation.
Figure 7. Subexpressions marked as atomic groups are no longer vulnerable to catastrophic backtracking.
Option 3. Rewrite a regex by replacing an unsafe subexpression with a safe equivalent. For example, to find a string of the newDate(13-09-2022) type you can use newDate\((\d{2}-\d{2}-\d{4})\) instead of newDate\((-?\d+)*\).
The latter has two subexpressions: (-?\d+)* and \d+. The \d+ subexpression is included in (-?\d+)*. The same substring can match both these subexpressions. The safe equivalent allows matching any substring with only one template due to the mandatory check of the '-' character between templates \d{...}.
Let's sum up:
And that's all for today :). I hope this article was interesting.
Clean code and successful projects to you! See you in next articles!
0