GPT-3 detected 213 Security Vulnerabilities... Or it did not

Apr 11 2023

Author: Andrey Karpov

This text is a detailed commentary on the article "I Used GPT-3 to Find 213 Security Vulnerabilities in a Single Codebase".

For a better grasp of the subject under discussion, it would be preferable if you could first read Chris Koch's article: I Used GPT-3 to Find 213 Security Vulnerabilities in a Single Codebase. I had already written a lengthy comment to it. Then I decided to write another one. This time, I chose to put my thoughts into a complete post.

I don't share the author's enthusiasm. Our trials with ChatGPT yielded far more modest and dubious findings, and you may read about them in the article: Is ChatGPT good enough at spotting bugs in code?.

GPT-3 appears to have fascinated the article's author, as seen by the author crediting the bot's responses as correct even when they are not. This may explain why the author claims that GPT-3 gave almost no false positives. Sure, it won't give us any false positives, if we don't want to notice them :)

What makes me that skeptical? I think the article shows the best examples of how GPT-3 works. In fact, the author hardly chose weak examples :). However, even in these carefully chosen samples that highlight GPT-3's capabilities, there are a few unnoticed false positives.

Let's look at the first example.

int main(int argc, char **argv) {
    printf(argv[1]);

I agree with the second message:

Format string vulnerability: The program does not check the format of the user input, which could lead to a format string attack.

We could argue the wording here. It's not necessary to check the user input. Alternatively, we can just use printf in a different way. The warning generated by GPT-3 lags behind the documentation provided by traditional static analyzers, here's an example for comparison: V618. Well, let's take a look at the first warning, which is a bit more interesting.

Unvalidated user input: The program does not check the length of the user input, which could lead to a buffer overflow attack.

In my opinion, this is a false positive. The number of arguments (of the argc variable) is not checked. Here is an error: the argv array may be out of bounds. While GPT-3 begins speculating about buffer overflows. Sure, we could say that it's the same thing... Well, then the bot could just say: "There's an error." If there is a real error, then you're lucky. If there isn't, well, that's life :) When do programmers talk about buffer overflow? When they work with a null-terminated string, deal with incorrect use of the strcat, memcpy, and other functions.

Okay, maybe that false positives weren't convincing enough. Let us now examine the third piece of code and the message it causes:

fp = fopen(filename,"r"); 
if(fp == NULL)
{
  printf("\nCan't open file or file doesn't exist.");
  exit(0);
}

Unchecked return value: The return value of the fopen() function is not checked, which could lead to a null pointer dereference.

In the first version of the article, the author wrote that GPT-3 was right. Then there was a postscript revealing that this particular case was a false positive. Thanks to the attentive readers who noted the fault. In fact, if we look closely, all the magic of GPT-3's useful messages disappears. I'm about to spoil the impression even further :).

Referring again to the third example:

char OOBR_stack = buff3[size3+100];
char OOBR_heap = buff4[100];

Uninitialized memory access: The OOBR_stack and OOBR_heap variables are accessed without being initialized, which could lead to undefined behavior.

What nonsense. Here's the initialization. These variables cannot in any manner be considered uninitialized. Another matter is that the array goes out of bounds due to initialization, but GPT-3 said nothing about that. GPT-3 is also wrong when reports on the access to uninitialized variables OOBR_stack and OOBR_heap. These variables are not used anywhere.

Although this warning appears to be clever and helpful, it really makes no sense. The same most likely holds true for other errors not mentioned in the article.

By the way, GPT-3 is mute about at least two other faults in the same example.

free(buff1);          // <=
if (size1/2==0){
  free(buff1);        // <=
}
else{
  if(size1 == 123456){
    buff1[0]='a';     // <=
  }
}

Firstly, the memory may be released again. Secondly, writing to a buffer that has already been released is possible. In fact, the more we look at the code, the more flaws we can find in warnings produced by GPT-3.

P.S. In my perspective, labeling everything as a vulnerability is too pretentious. The ones discussed in the article are simple errors. Some of them could be potential vulnerabilities, but no further. If the discovered flaw can be exploited, then yes, a vulnerability exists. Otherwise, it's just an error, and there are thousands of them in any application :). I know for sure that these bugs are everywhere. We found more than 15,000 bugs in open source projects with the help of PVS-Studio. Yet we don't call these bugs vulnerabilities — that would be overstated.

Here are some other useful links:

#Cpp #Security