Webinar: Parsing C++ - 10.10
Our team wrote three articles related to the code analysis of Tizen operating system. The operating system contains a lot of code, so this is the reason why it is a fertile ground for writing different articles. I think that we will go back again to Tizen in future, but right now other interesting projects are waiting for us. So, I will sum up some results of the work done and answer a number of questions that have arisen after the previously published articles.
Well, our team wrote 3 articles:
After publishing this articles two great discussions appeared: first on Reddit, second on Hacker News. Some new posts also showed up. Main posts:
All of this made me think about a discussion of some additional issues and about answering some of the questions that have been raised in the discussions.
Recently, many enthusiasts have become very active, agitating to use Rust everywhere. A notable spate of discussion on this topic followed after the article "Rewrite the Linux kernel in Rust?".
These enthusiasts couldn't be quiet writing the comments to our articles. Their suggestion was - to avoid such errors, one has to rewrite all the code on the Rust.
Actually, I don't bother if something gets rewritten or not. In the world there is so much C and C++ code, that PVS-Studio analyzer will have enough projects to verify for at least 50 years. If such static analyzers for Cobol are still used, the analyzers for C and C++ code will be in high-demand as well.
Anyway, I can not evade this issue. Are you seriously suggesting to rewrite such projects in Rust? Just up and rewrite 72 MLOC of code in Rust? That's crazy!
This would require an incredible amount of time and effort. Moreover, having spent many years developing, you get exactly the same result that has already existed! It would be much better to invest these person-years in creating something new in an existing project.
Someone will argue that after such rewriting, code will become better and more reliable. There is no such guarantee at all. In large projects, the significance of a chosen language is not so great. In addition, many libraries in C or C++ have already been debugged, while when rewriting, one will have to reinvent the wheel, which will "please" the users with various errors for many years.
I believe, the one who suggests to rewrite 72 MLOC of code is simply incompetent. You can forgive the newcomer, but if a person with experience says this, he is, apparently, a troll.
Yes, such an approach may give inaccurate results. However, it would have made sense to worry about it only if we had checked 1000, 3000 or 10000 lines of code. It would have been worth worrying about it if we had checked one project written by one team. In another project the density of bugs can be very different.
I will remind that (using PVS-Studio analyzer) I checked 2 400 000 lines of code in C/C++. That's a lot! This is the size of some projects.
This code has been tested with different projects by the way. I used the selecting method "a shot in the dark". A fine, honest way. Here is a list of the projects I have studied:
alsa-lib-1.0.28, aspell-0.60.6.1, augeas-1.3.0, bind-9.11.0, efl-1.16.0, enlightenment-0.20.0, ise-engine-anthy-1.0.9, bluetooth-frwk-0.2.157, capi-appfw-application-0.5.5, capi-base-utils-3.0.0, capi-content-media-content-0.3.10, capi-maps-service-0.6.12, capi-media-audio-io-0.3.70, capi-media-codec-0.5.3, capi-media-image-util-0.1.15, capi-media-player-0.3.58, capi-media-screen-mirroring-0.1.78, capi-media-streamrecorder-0.0.10, capi-media-vision-0.3.24, capi-network-bluetooth-0.3.4, capi-network-http-0.0.23, cynara-0.14.10, e-mod-tizen-devicemgr-0.1.69, ise-engine-default-1.0.7, ise-engine-sunpinyin-1.0.10, ise-engine-tables-1.0.10, isf-3.0.186, org.tizen.app-selector-0.1.61, org.tizen.apps-0.3.1, org.tizen.bluetooth-0.1.2, org.tizen.browser-3.2.0, org.tizen.browser-profile_common-1.6.4, org.tizen.classic-watch-0.0.1, org.tizen.d2d-conv-setting-profile_mobile-1.0, org.tizen.d2d-conv-setting-profile_wearable-1.0, org.tizen.download-manager-0.3.21, org.tizen.download-manager-0.3.22, org.tizen.dpm-toolkit-0.1, org.tizen.elm-demo-tizen-common-0.1, org.tizen.indicator-0.2.53, org.tizen.inputdelegator-0.1.170518, org.tizen.menu-screen-1.2.5, org.tizen.myplace-1.0.1, org.tizen.privacy-setting-profile_mobile-1.0.0, org.tizen.privacy-setting-profile_wearable-1.0.0, org.tizen.quickpanel-0.8.0, org.tizen.screen-reader-0.0.8, org.tizen.service-plugin-sample-0.1.6, org.tizen.setting-1.0.1, org.tizen.settings-0.2, org.tizen.settings-adid-0.0.1, org.tizen.telephony-syspopup-0.1.6, org.tizen.voice-control-panel-0.1.1, org.tizen.voice-setting-0.0.1, org.tizen.volume-0.1.149, org.tizen.w-home-0.1.0, org.tizen.w-wifi-1.0.229, org.tizen.watch-setting-0.0.1, security-manager-1.2.17.
I am hardly that "lucky" to take so many projects written by one team. It is obvious that different teams of specialists worked on this fragment.
That's why we can assume that the obtained density value of detected errors is average for the rest of the project.
After publishing my article "27000 errors in the Tizen Operating System", several unreasonable news appeared on the Internet, where people wrote about the large number of vulnerabilities found in Tizen. For example, it was possible to meet such incorrect headlines, as "27000 vulnerabilities found in the code of Tizen operating system". This, of course, doesn't reflect the reality. Let me explain why.
I'll tell you right away, that I wrote not about vulnerabilities, but about the errors. I also didn't mention, that Tizen code is of low quality. Yes, I am saying that the PVS-Studio analyzer detects many errors, but in any large project there will be a lot of errors. Therefore, the total number of errors doesn't identify the quality of the code.
Let's talk a little more about vulnerabilities. Among all of the errors that occur in programs, programmers also distinguish security weaknesses. Their peculiarity is that such concourse of circumstances is possible when this error can be used by an attacker. These types of errors are described in CWE. CWE is a community-developed list of common software security weaknesses - https://cwe.mitre.org/.
In my article I classify many errors on the classification of CWE. However, it still means nothing. The fact of the matter is that such errors are rarely can be used as vulnerabilities. In other words, you can turn CWE into CVE very seldom. More details about the terminology can be found here: https://cwe.mitre.org/about/faq.html.
I will emphasize one more time that the error can be used as a vulnerability very very rarely. In most cases, a bug is just a bug that isn't quite pleasant for users, but does not cause security problems.
Do 27000 errors demonstrate good or bad quality of code? It is impossible to say. However, this is not a scary number, as it might seem at first glance. It should be taken into account that the size of the code is 72 500 000 lines in C, C++ (excluding comments). It turns out that the PVS-Studio analyzer detects approximately 0.37 errors per 1000 lines of code. Or in other words approximately 1 error per 3000 lines of code.
Note. It shouldn't be confused with the total number of errors in Tizen code. This is something that we can identify from the total number of them. I would like to draw your attention to that, because some people interpret this data incorrectly.
So, PVS-Studio detects approximately 0.37 errors per 1000 lines of code. Is it a lot or not? It's quite average. Sometimes it can be better and worse. Here are some examples:
Let's summarize. Actually, there is no sensation. 27000 errors are shocking, but this number is so big because of the size of Tizen project. If you take another big project, there will be a lot of mistakes as well.
The purpose of my article was to show that the PVS-Studio tool can be useful for the Tizen project. Well, it seems to me I managed to do it. However, I did not expect such a strong reaction and debates that arose around this article. We regularly write such notes. They can be found here: https://www.viva64.com/en/inspections/
I will speak in a roundabout way. Unfortunately, many people read the articles very inattentively. As a result, they are quite often wrong when perceiving numbers, which are specified there. I am quite familiar with this effect and try to take this into account when writing articles. For example, in the article about "27000 errors" I specifically wrote twice that I found 900 errors by examining the 3.3% of the code. Doing so, I have emphasized that it is precisely the error number, but not the number of warnings issued by the analyzer.
Even though I made myself safe, there appeared this comment:
900 warnings in the analogue analyzer Lint doesn't mean there are 900 bugs. I would even say that these indicators are not bound in any way. There are certainly errors in the formatting of code, scopes, etc. Screw such analysts!
A person didn't read the article, but saw the number 900 and now is so excited to share his opinion with others.
It is the reason why I don't write about the number of false positives. People will look at the numbers and then comment: "this is a bad analyzer, its percentage of false positives is NN".
The fact is that the analyzer requires thoughtful configuration. Besides that most of false positives are caused by a small number of macros. In some of my articles I have already demonstrated several times, how the warnings suppression of some macros dramatically reduces the number of false positives.
Exactly the same thing happened in Tizen. However, I'm afraid that people won't pay attention to these explanations and examples. While at the same time all readers will remember the large percentage of false positives.
So, a logical question comes up: Why don't you configure a static analyzer and show a good result right away?
Here is the answer. It will take time, and still there are such interesting projects as iOS or Android, waiting for me. However, this is not the main reason why I don't want to do it. The fact is that it is unclear where to stop. I know that having made some efforts we will be able to reduce the number of false positives to zero or almost zero. For example, we reduced to zero the number of false positives when were working on Unreal Engine project (see arts. 1, 2).
So, if I reduce the number of false positives to a very small percentage using configurations, readers will tell me that this was unfair. It turns out that on the one hand I would like to leave as less false positives as possible, on the other hand, I must not overdo, showing a too perfect result. I don't really like this whole situation. I believe that in this case it is better to do nothing.
How can a programmer understand if the analyzer works well or not? It is very simple! You need to download it and to check the working project. It will become clear at once if you like the tool or not. In addition, it will be immediately evident how many false positives there are and what is their type. Perhaps, after that you will gladly join to the list of our clients.
I also would like to ask not to make a mistake by trying to run the analyzer on small projects or on test examples. There are some reasons:
Update. I will add this note after writing the article. Congratulations, readers won :). I surrender and give the number. I made the analysis of EFL Core Libraries and counted that the static analyzer PVS-Studio will issue about 10-15% of false positives. Here is the article about this: "Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries".
As always, there were comments that modern compilers do the static code analysis well and as a result additional tools are not needed.
But additional tools are really needed. Static analyzers are specialized tools that are always ahead of the compilers according to their diagnostic capabilities. There are reasons why these are paid tools.
However, in addition to the words I have some facts. Every time we check a compiler we find bugs:
In addition to that, we have to remember that static code analysis includes not only the warnings but the whole infrastructure as well. Here are some abilities of PVS-Studio:
More details are available in the documentation.
Yes, we don't have prices at the site. This is the standard practice of companies selling solutions in the sphere of static code analysis.
We see PVS-Studio as a B2B solution. When selling our tool to different companies we need to discuss many things that effect the price of the license. There is no sense in posting a specific price on the site, it is much better to begin the discussions.
Why do not we work with individual developers? We tried, but it didn't work out for us.
Individual developers can use one of options for a free license:
I invite company representatives to discuss your questions with me via mail.
Yes, maybe with a more thoughtful review, some fragments of code will be found correct. On the other hand, during a careful analysis it may turn out that, on the contrary, I missed some mistakes. For example, I did not bother studying warning V730 - Not all members of a class are initialized inside the constructor. Trying to understand in someone else's code, if it is an error or not, that a class member was not initialized in constructor, is very laborious. However, if we do so, we will find real bugs.
Let's take a close look at one of these cases. The code refers to project org.tizen.browser-profile_common-1.6.4.
Let's start by looking at the BookmarkItem class definition.
class BookmarkItem
{
public:
BookmarkItem();
BookmarkItem(
const std::string& url,
const std::string& title,
const std::string& note,
unsigned int dir = 0,
unsigned int id = 0
);
virtual ~BookmarkItem();
void setAddress(const std::string & url) { m_url = url; };
std::string getAddress() const { return m_url; };
void setTitle(const std::string & title) { m_title = title; };
std::string getTitle() const { return m_title; };
void setNote(const std::string& note){m_note = note;};
std::string getNote() const { return m_note;};
void setId(int id) { m_saved_id = id; };
unsigned int getId() const { return m_saved_id; };
....
....
bool is_folder(void) const { return m_is_folder; }
bool is_editable(void) const { return m_is_editable; }
void set_folder_flag(bool flag) { m_is_folder = flag; }
void set_editable_flag(bool flag) { m_is_editable = flag; }
private:
unsigned int m_saved_id;
std::string m_url;
std::string m_title;
std::string m_note;
std::shared_ptr<tizen_browser::tools::BrowserImage> m_thumbnail;
std::shared_ptr<tizen_browser::tools::BrowserImage> m_favicon;
unsigned int m_directory;
std::vector<unsigned int> m_tags;
bool m_is_folder;
bool m_is_editable;
};
We are interested in members m_is_folder and m_is_editable. Note that they are at the end of the class definition. I'm ready to bet $10 that originally they were not in the first version of the class and they appeared later in the development process of the project. So, when those members were added, only the one constructor was modified.
As a result, we have these two constructors:
BookmarkItem::BookmarkItem()
: m_saved_id(0)
, m_url()
, m_title()
, m_note()
, m_thumbnail(std::make_shared<.....>())
, m_favicon(std::make_shared<.....>())
, m_directory(0)
, m_is_folder(false)
, m_is_editable(true)
{
}
BookmarkItem::BookmarkItem(
const std::string& url,
const std::string& title,
const std::string& note,
unsigned int dir,
unsigned int id
)
: m_saved_id(id)
, m_url(url)
, m_title(title)
, m_note(note)
, m_directory(dir)
{
}
One constructor initializes members m_is_folder and m_is_editable, and the other does not. I do not have absolute certainty, but most likely it is an error.
PVS-Studio analyzer gives for a second constructor the following warning: V730. Not all members of a class are initialized inside the constructor. Consider inspecting: m_is_folder, m_is_editable. BookmarkItem.cpp 268
By the way, the PVS-Studio analyzer can detect 64-bit errors. Tizen is 32-bit so far, that's why they are not actual, but I have a few words to say about this issue.
To tell the truth, there are no "64-bit errors". However, it makes sense to distinguish some mistakes in such a category and consider them separately. The fact of the matter is that such errors do not reveal themselves in 32-bit versions of applications. What is more, they do not show up at all and it is impossible to find them using any test.
Let's look at a simple example: One wants to create an array of pointers, and for this there was written this incorrect code:
int **PtrArray = (int **)malloc(Count * size_of(int));
Memory is allocated for an array of int instead of an array of pointers. Correct code should be like this:
int **PtrArray = (int **)malloc(Count * size_of(int *));
A bug in 32-bit program does not show itself. The size of the pointer and the type int are the same, so the correct size buffer is allocated. Everything works correctly and the problems will appear only when we begin working with the 64-bit version of the program.
Note. Of course, in some 64-bit systems the size of the pointer can also be the same as the size of the int type. It may also be that the sizes will be different in 32-bit systems as well. These cases are outstanding, so we do not need to dwell on them. This example will work correctly in all common 32-bit systems and fail in 64-bit ones.
On our site you can find a lot of interesting materials on 64-bit errors and ways of fixing them:
Now we will go back to Tizen project and take capi-media-vision-0.3.24 project as an example. Here you can see the result of an interesting variety of 64-bit errors. PVS-Studio analyzer issues 11 warnings for it with code V204:
These warnings are issued for the totally harmless code at the first glance. Here it is:
*string = (char*)malloc(real_string_len * sizeof(char));
What is the reason? The fact of the matter is that a header file that declares the malloc function isn't included anywhere. You can verify this by running preprocessing of .c files and watching the contents of i-files. The malloc function is used, but it is not declared.
As this program is in C language, it compiles despite the absence of its declaration. If the function is not declared, it shall be deemed that it returns the arguments of int type.
So, the compiler thinks than the function is declared as follows:
int malloc(int x);
Thanks to that, a 32-bit program compiles and works perfectly. A pointer is located to int type and all is well.
This program will be compiled in the 64-bit mode too. It even works most of the time. The important thing is this very "most" time.
Everything will work fine while the memory is allocated in the lower addresses of the address space. However, in the process of working, the memory in the lower part of the address space may be occupied or fragmented. Then the memory manager will return memory allocated outside the lower addresses. The failure will occur due to the loss of high bits in the pointer. More details about the whole process are given here: "A nice 64-bit error in C"
As a result, we see 11 defects that can lead to poorly reproducible crashes. Very nasty bugs.
Unfortunately, the diagnostics of PVS-Studio for detecting 64-bit errors generate a lot of false positives and nothing can be done. That is their nature. The analyzer often does not know what is the range of some or other values, and cannot figure out what code will work correctly. But if you want to make a reliable and fast 64-bit application, you must work with all these warnings. By the way, we can take on this diligent work and fulfill an order to port an application to 64-bit system. We have some experience in this issue (see. "How to Port a 9 Million Code Line Project to 64 bits?")
So if Tizen developers want to make a system 64-bit, our team is ready to help with it.
Thank you for attention. Those who are interested in the PVS-Studio analyzer and want to learn more about its abilities, take a look at a detailed presentation (47 minutes): PVS-Studio static code analyzer for C, C++ and C#.
Subscribe to be informed about new publications:
0