To get a trial key
fill out the form below
Team License (a basic version)
Enterprise License (an extended version)
* By clicking this button you agree to our Privacy Policy statement

Request our prices
New License
License Renewal
--Select currency--
USD
EUR
GBP
RUB
* By clicking this button you agree to our Privacy Policy statement

Free PVS-Studio license for Microsoft MVP specialists
* By clicking this button you agree to our Privacy Policy statement

To get the licence for your open-source project, please fill out this form
* By clicking this button you agree to our Privacy Policy statement

I am interested to try it on the platforms:
* By clicking this button you agree to our Privacy Policy statement

Message submitted.

Your message has been sent. We will email you at


If you haven't received our response, please do the following:
check your Spam/Junk folder and click the "Not Spam" button for our message.
This way, you won't miss messages from our team in the future.

>
>
When a Butler Becomes a Victim

When a Butler Becomes a Victim

Dec 07 2017

Today we are going to depart from the classical cliché of second-rate detectives and will tell you about the case from our experience, when the butler himself became a victim and a search for criminals led us to an unexpected result. But don't be scared. Of course, not the real people, but the programs will be described. Why are we not scared to admit our own mistakes? We certainly know that not a single team is insured from making them.

0546_When_a_Butler_Becomes_a_Victim/image1.png

Introduction

Let me start with a little bit of theory. We develop a PVS-Studio static analyzer for C, C++, and C# code. For internal testing of our product, we use a variety of tools and techniques: Unit tests and Visual Studio UI-tests, a joint review of code, specialized tester application. I would like to comment in more detail on the last point.

When developing new diagnostic rules, as well as when you change the internal mechanisms of the analyzer, it is always necessary to understand the way these changes affected the quality of analysis. For this task, we perform test runs of the analyzer on a set of large open source projects. For C/C++ code we use about 150 projects, and 52 projects are used for C#. A tester app for analyzing C/C++ projects under Windows is called SelfTester. Further on, we will talk exactly about this tool. There is also a specialized system for Linux, but for now it will remain behind the scenes.

All work is performed on a build server with Windows 10. As a build system we use Jenkins which is configured to run nightly on a regular basis, among other things, the testers. The run of Jenkins itself is performed from cmd-file with a command:

java -jar jenkins.war >
%JENKINS_PROJECTS%\Logs\%YYYY%.%MM%.%DD%_%HH%.%MI%.log 2>&1

The JENKINS_PROJECTS variable specifies the location of the Jenkins project folder on the local disk. Variables YYYY, MM, DD, HH and MI contain the date and time at the moment of executing, respectively. Thus, after launching Jenkins, the output of its console will be redirected to the file with the name of the form:

2017.11.08_17.58.log

Cmd-file starts with the help of a Windows Task Scheduler. A launch of Jenkins as a Windows service turned out to be inapplicable for us, since in this case there were problems in the work of UI tests of Visual Studio: the service does not have a desktop. Therefore, the invocation chain looks like this:

Windows Task Manager -> cmd.exe -> java.exe

So, having read the short introduction, let's move on to the question of what went wrong.

Problem

About six months ago, we began to notice that Jenkins is closed during the night builds from time to time. As a result, it had to re-run be manually, we had to check the logs, and restart some tasks. The problem was fairly regular. The investigation was started with exploration of Jenkins behaviour, because it is Jenkins, which does all the "dirty work", and therefore is the first to fall under suspicion. Yes, that very butler.

Jenkins

First and foremost we checked the configuration and customization of Jenkins tasks. The check did not find anything serious. Moreover, if an error was made, for example, in the order, priority or start time of the tasks, then the problem would probably reveal itself in a different way. It is also unlikely that Jenkins will crash due to configuration errors.

As a temporary measure, a task of a Windows scheduler was reconfigured to restart Jenkins every 30 minutes. For protection against starting a second instance, a check for the existence of an already running Jenkins process was also performed.

The next step was to examine the console logs of Jenkins. As you probably remember, they are saved in the folder %JENKINS_PROJECTS%\Logs, and the file names contain a time stamp of creation (in fact - time of Jenkins restart). After a while, we have studied the contents of this folder and, as expected, found a fairly large number of files. It showed that Jenkins continued to restart. At the same time, Jenkins console logs did not contain anything criminal. The problem became somewhat different: a part of the nightly tests stopped being launched. They just were not starting. However, those that did, often terminated with errors.

All this spoke in favor of the fact that Jenkins was, probably, not to blame. Someone from the outside hindered its work, by "killing" the java.exe process.

0546_When_a_Butler_Becomes_a_Victim/image2.png

Windows

The next chosen suspect was Windows. Well, in fact, who will take the responsibility for an unplanned processes termination if not the operating system itself? Perhaps, the system reloads after installing updates, or for some reason closes only the java.exe process. Such situation is possible when someone hinders the restart process and if the restart is not happening, but the system has already managed to kill several of the running processes. One way or another, but the analysis of Windows logs gave no results. Yes, sometimes the system really reloaded, but everything was going as usual. In addition, time and place of the "crime" did not match.

Around the same time, when we were conducting our investigation, another incident happened, which was not given much importance immediately. One night, our build server crashed due to BSOD. Automatic restart did not happen, so in the morning we could see a very bleak picture.

To analyze a memory dump we used Windows Debugger (WinDbg). This utility was found by the path "C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\windbg.exe". The dump was located in itsregular place: "C:\Windows\MEMORY. DMP ". After the dump analysis in WinDbg, the reason of BSOD was discovered:

CRITICAL_PROCESS_DIED (ef)
CRITICAL_PROCESS:  svchost.exe
....
FAILURE_BUCKET_ID:
0xEF_svchost.exe_BUGCHECK_CRITICAL_PROCESS_TERMINATED_BY_
SelfTester.exe_6771c7c0

As you probably know, a forced termination of the instances of svchost.exe process usually results in BSOD. You can check it yourself.

Yes, our impeccable SelfTester tool led to the collapse of Windows. However, as it was the first such incident for all the long history of existence of the tester, and it has been a long time since the last edits were made to its code, we decided, that it was just an accident.

In the meantime, the investigation was going on as usual. To determine the perpetrator of java.exe process termination, a native Windows utility "Global Flags" for configuring the registry keys has been used. Among other things, it allows to set the monitoring mode for the exiting of the specified process, together with adding a record of such events in the system log:

0546_When_a_Butler_Becomes_a_Victim/image3.png

In case of a forced termination of a process by a third-party process, the Windows Application log should contain an event with an Event ID = 3001 from the 'Process Exit Monitor' source.

As it often happens in such cases, you just set a trap but the victim takes cover. The next couple of weeks the server worked perfectly, the tasks were successfully performed and a Windows log did not contain the expected events.

Finally, one fine morning the problem repeated itself and the following entry appeared in the system log :

The process 'C:\Program Files\java\jre1.8.0_102\bin\java.exe' was terminated by the process 'D:\SelfTester\SelfTester.exe' with termination code 0. The creation time for the exiting process was 0x01d3541edc4f867b.

The SelfTester Again. It appears that it was the SelfTester to blame that the process java.exe was terminated, which, in turn, itself launched the Selftester process. An incredible cunning.

0546_When_a_Butler_Becomes_a_Victim/image4.png

SelfTester

To find a perpetrator is just a half of the story. Now you should understand exactly what is happening and how to fix it. To do this, let's examine how the tester works.

To analyze each C++ project, the tester runs the process of the analyzer (PVS-Studio_cmd.exe) with parameters. Upon completion of the analysis, the tester does not wait for the termination of the analyzer process, but responds to an event that a report of the analysis (the plog file) is created. After that, the tester terminates the analyzer process and all of its child processes, "just in case". Probably, at this point, something unexpected occurs.

Actually, this operation algorithm is the legacy of the past and now it is no longer necessary. Previously, the devenv.exe process with parameters was started for the analysis. The analysis was performed by the PVS-Studio plugin for Visual Studio. During this, many child processes were generated which had to be closed.

To understand what exactly was going on, we had to add logging to the process termination logic. After examining the logs, we managed to find out the following: the child processes for PVS-Studio_cmd.exe in the tester are discovered using the following WMI query:

SELECT * FROM Win32_Process WHERE ParentProcessId={0}

The problem is that at the time of this query execution, the process might not be running anymore. The analyzer process honestly performed its tasks and exited but the tester does not know about it. Sometimes it happens, that the ID of the analyzer process is identical (was re-used) to the ID of a process that originally launched the cmd.exe, which in turn, launched the java.exe process (that is, the Task Scheduler process). Windows remembers the ID of a parent processes, even if it has already exited. This way, we are getting at this cmd.exe process and terminating it, and all of its child processes (recursively).

0546_When_a_Butler_Becomes_a_Victim/image5.png

Solution

The issue was solved by adding two conditions in the tester code under which it could not terminate child processes:

  • The parent process has already terminated.
  • The parent process has not exited, but it was started later than his alleged child ones. In this case, it could potentially be re-using the ID of the process that has already exited.

Conclusion

Therefore, the investigation is over. All suspicions are dismissed from the innocents and the identified perpetrators and severely punished (fixed).

In conclusion, I would like to note that this situation is, in my opinion, a fine example of how even a small mistake can lead to an error that is difficult to reproduce.

We are not ashamed to admit our mistakes and, as I think, it was interesting for readers to get acquainted with one of them.

We wish you bugless code and successful builds!

Popular related articles
The Evil within the Comparison Functions

Date: May 19 2017

Author: Andrey Karpov

Perhaps, readers remember my article titled "Last line effect". It describes a pattern I've once noticed: in most cases programmers make an error in the last line of similar text blocks. Now I want t…
PVS-Studio ROI

Date: Jan 30 2019

Author: Andrey Karpov

Occasionally, we're asked a question, what monetary value the company will receive from using PVS-Studio. We decided to draw up a response in the form of an article and provide tables, which will sho…
PVS-Studio for Java

Date: Jan 17 2019

Author: Andrey Karpov

In the seventh version of the PVS-Studio static analyzer, we added support of the Java language. It's time for a brief story of how we've started making support of the Java language, how far we've co…
Free PVS-Studio for those who develops open source projects

Date: Dec 22 2018

Author: Andrey Karpov

On the New 2019 year's eve, a PVS-Studio team decided to make a nice gift for all contributors of open-source projects hosted on GitHub, GitLab or Bitbucket. They are given free usage of PVS-Studio s…
Static analysis as part of the development process in Unreal Engine

Date: Jun 27 2017

Author: Andrey Karpov

Unreal Engine continues to develop as new code is added and previously written code is changed. What is the inevitable consequence of ongoing development in a project? The emergence of new bugs in th…
Characteristics of PVS-Studio Analyzer by the Example of EFL Core Libraries, 10-15% of False Positives

Date: Jul 31 2017

Author: Andrey Karpov

After I wrote quite a big article about the analysis of the Tizen OS code, I received a large number of questions concerning the percentage of false positives and the density of errors (how many erro…
The Ultimate Question of Programming, Refactoring, and Everything

Date: Apr 14 2016

Author: Andrey Karpov

Yes, you've guessed correctly - the answer is "42". In this article you will find 42 recommendations about coding in C++ that can help a programmer avoid a lot of errors, save time and effort. The au…
How PVS-Studio Proved to Be More Attentive Than Three and a Half Programmers

Date: Oct 22 2018

Author: Andrey Karpov

Just like other static analyzers, PVS-Studio often produces false positives. What you are about to read is a short story where I'll tell you how PVS-Studio proved, just one more time, to be more atte…
Appreciate Static Code Analysis!

Date: Oct 16 2017

Author: Andrey Karpov

I am really astonished by the capabilities of static code analysis even though I am one of the developers of PVS-Studio analyzer myself. The tool surprised me the other day as it turned out to be sma…
The Last Line Effect

Date: May 31 2014

Author: Andrey Karpov

I have studied many errors caused by the use of the Copy-Paste method, and can assure you that programmers most often tend to make mistakes in the last fragment of a homogeneous code block. I have ne…

Comments (0)

Next comments
This website uses cookies and other technology to provide you a more personalized experience. By continuing the view of our web-pages you accept the terms of using these files. If you don't want your personal data to be processed, please, leave this site.
Learn More →
Accept