Our website uses cookies to enhance your browsing experience.
Accept
to the top
>
>
>
Stumbling block for AI: UTF-8

Stumbling block for AI: UTF-8

Oct 13 2025
Author:

I think you're becoming tired of vibe-coding topics. But don't worry, my goal isn't to talk about new groundbreaking achievements that change the world, blah-blah-blah... I find it more interesting to look for points where code generation starts to fail. This will help adapt static analyzers for the new task of checking code created by these systems.

Code generation experiments

I did some experiments generating code with GigaChat and DeepSeek. These weren't work tasks or thorough research. I was simply curious to find examples where problem complexity reaches a certain threshold and C++ code generation begins to struggle.

If you ask it to generate code at the level of lab exercises or even some course projects, there are no issues. It produces excellent code for sorting arrays or counting set bits in byte arrays. For counting bits, it even suggests interesting speed-optimized solutions I wasn't aware of, despite having read about different approaches to this problem.

This highlights a potential problem in programming education: the temptation to get immediate answers without understanding how things work. We might have to return to writing code on paper :) But let's not digress.

For one experiment, I chose to create a program that reads source code files and performs operations on variable names or their components. One task was to get three tokens from the "RedColor" variable name: "RedColor", "Red", and "Color".

Tokenization is common in programming, and plenty of open-source code exists on this topic. So, in this formulation, both GigaChat and DeepSeek handle the task well.

However, the real world is more complex. We must work not only with standard ANSI encoding but also with UTF-8. When I started requesting code that processes files in UTF-8 format, the code generators began to struggle.

I suspect the amount of open-source code handling encodings, including UTF-8, is significantly smaller. Additionally, my task descriptions became more complex and therefore vaguer. I should note that after a couple of attempts, I couldn't get working code. I'm sure it's possible with more generations and refined task descriptions, but my goal wasn't to write a final program—I wanted to experiment with failure points :)

Bugs

The code below was generated for me by GigaChat. Interestingly, it cut corners by not implementing real conversion. However, the mere requirement to handle UTF-8 already leads to incorrect C++ code.

....
// A helper function to convert
// a char vector into a UTF-8 string
std::string to_utf8(const std::vector<char>& data)
{
  // For simplicity, we assume the vector is already in UTF-8 format
  return { data.begin(), data.end() };
}
....
// The main function to split a string into tokens
std::vector<std::string_view>
  tokenize_string_giga(const std::vector<char>& input_data)
{
  std::string text = to_utf8(input_data);

  std::vector<std::string_view> result;

  size_t pos = 0;
  while (pos < text.size())
  {
    if (is_letter(text[pos]) || is_digit(text[pos]))
    {
      size_t begin_pos = pos++;

      // Looking for the end of the current alphanumeric sequence
      while (pos < text.size() &&
             (is_letter(text[pos]) || is_digit(text[pos])))
        ++pos;

      // Extract the complete word
      std::string_view current_word =
        text.substr(begin_pos, pos - begin_pos);

      // Split the word into individual tokens
      split_word_into_tokens(current_word, result);
    }
    else
    {
      ++pos; // Skip non-alphanumeric characters
    }
  }

  return result;
}

The error relates to object lifetime. Note the fragment where words are extracted:

std::string_view current_word = text.substr(begin_pos, pos - begin_pos);

The substr function creates a temporary std::string object that is immediately destroyed. Consequently, current_word stores pointers to an already destroyed object.

Of course, I ran PVS-Studio on this code. It detects this error with the following warning: V1017 [CWE-416] Variable of the 'string_view' type references a temporary object which will be removed after evaluation of an expression. Tokenizer_giga.cpp 79

Interestingly, DeepSeek stumbled at approximately the same point. It generated more code because it actually attempted to handle UTF-8, so here's the relevant snippet:

std::vector<std::string_view>
  tokenize_string(const std::vector<char>& buffer)
{
  std::vector<std::string_view> tokens;
  ....
  std::string utf8_text = detect_and_convert_to_utf8(buffer);
  ....
  std::string_view token(utf8_text.data() + token_start, pos - token_start);
  ....
  tokens.push_back(token);
  ....
  return tokens;
}

A utf8_text local variable of std::string type is created inside the function. It is split into tokens, and pointers to these tokens are stored in the output tokens array. When the function exits,utf8_text is destroyed and the references become invalid. Unfortunately, PVS-Studio couldn't help here: it failed to correlate the lifetimes of utf8_text and tokens.

Reflection

It was rather interesting to witness how the increased task complexity led to failures in the generated code.

There likely isn't a single reason why this happens; rather, several factors combine.

  • More generalized and less precise task descriptions.
  • Less code in the world "understands" UTF-8. So the system is less trained on it. This implies that the more atypical the task, the worse the result.
  • The std::string_view class is relatively new (C++17), so massive codebase using it doesn't yet exist compared to, say, std::string.
  • The concept of object lifetime might be difficult for the model to learn.

A few words about static analysis. The way we use code checking tools might change. There's little point in manually fixing such errors—it's easier to regenerate the entire function code with a clarified task. The analyzer helps developers quickly understand why the code doesn't work as expected. Based on this knowledge, they can tell the generator what it did wrong, helping to refine or rephrase the task, or break it into subtasks.

Otherwise, you're left either regenerating code with different phrasing until it works, or manually reviewing the code. Both scenarios are mediocre. Of course, static analysis isn't a silver bullet, but if it helps fix errors faster, that's excellent. That has always been its goal :)

P.S. As before, I invite you to share similar cases in comments.



Comments (0)

Next comments next comments
close comment form