Webinar: Evaluation - 05.12
Program code started using machines to kill people as early as in 1985.
A standard one-time therapeutic dose of radiation is up to 200 rads.
1000 rads is a lethal dose, and the revolted machine was burning the defenseless humans with 20 000 rads.
Let's look into the case of a system error - the worst software bug in history - that occurred as a result of incremental yet uncoordinated software improvements.
Hardware locks were removed in the Therac-25, and the safety-maintaining functions were passed to the software instead.
In this article, we will talk about how the investigation went and what lessons IT engineers, programmers, and testers should learn from this story not to let something like that happen again.
The Therac-25 is a radiation therapy machine, a medical linear accelerator produced by Atomic Energy of Canada Limited (AECL).
The plan of the facility is shown in the figure below.
And here's a commercial for housewives.
Between June 1985 and January 1987, this machine was the cause of six radiation-overdose accidents, when some of the patients were exposed to dozens of thousands of rads. At least two patients died of the direct consequences of the overdoses.
The technician recalled changing the command 'x' to 'e' that day. It was found that doing it quickly enough resulted in radiation overdose in almost 100% of cases.
While prosecuting the cases against AECL, the Smith County District Attorney's office in Tyler, Texas, asked Nancy Leveson (who was a Computer Science professor at the University of California, Irvine, at the time) to assist as an expert in the investigation. She made a considerable contribution to system and software safety. Nancy and Clark Turner spent three years collecting the materials and reconstructing the events related to the Therac-25 accidents. This is an important result, as for most incidents involving safety, information appears to be incomplete, inconsistent, and incorrect.
AECL built three versions of their machine: Therac-6, Therac-20, and Therac-25. The versions 6 and 20 were manufactured in partnership with CGR, a French company. The partnership had dissolved before the Therac-25 was designed, but both companies maintained access to the designs and source code of the earlier models.
The Therac-20 codebase was developed from the Therac-6. All three machines used a PDP-11 computer. Therac-6 and 20 didn't need that computer, though. Both were designed to operate as standalone devices. In manual mode, a radiotherapy technician would manually set up various parts of the machine, including the turntable to place one of three devices in the path of the electron beam.
In electron mode, scanning magnets would be used to spread the beam out to cover a larger area. In X-ray mode, a target was placed in the electron beam with electrons striking the target to produce X-ray photons directed at the patient. Finally, a mirror could be placed in the beam. The electron beam would never switch on while the mirror was in place. The mirror would reflect a light which would help the radiotherapy technician to precisely aim the machine.
On the Therac-6 and 20, hardware locks prevented the operator from doing something dangerous, say selecting a high power electron beam without the x-ray target in place.
Attempting to activate the accelerator in an invalid mode would trigger a protector, bringing everything to a halt. The PDP-11 and associated hardware were added as a convenience. The technician could enter a prescription in on a VT-100 terminal, and the computer would use servos to position the turntable and other devices.
Hospitals loved the fact that the computer was faster at setup than a human. Less setup time meant more patients per day.
When it came time to design the Therac-25, AECL decided to go with computer control only. Not only did they remove many of the manual controls, they also removed the hardware locks. The computer would keep track of the machine setup and shut things down if it detected a dangerous situation.
Well, well...
At least four bugs were found in the Therac-25 software that could cause radiation overdose.
A number of potential bugs were also found: the multitasking operating system lacked any synchronization.
Complete list of fixes in English:
Source: Nancy G. Leveson, Therac-25 Accidents
The manufacturer said that the hardware and software had been tested over many years. However, the investigation found that a minimum amount of tests had been run on a simulator, while most of the effort had been directed at the integrated system test. It means that the developers neglected unit testing and did integration testing only.
A naive assumption is often made that reusing software or using commercial off-the-shelf software increases safety because the software has been exercised extensively. Reusing software modules does not guarantee safety in the new system to which they are transferred due to the development specifics of that system. Rewriting the entire software may be safer in many cases.
In this case, the manufacturer chose to reuse the program code from the Therac-6 and Therac-20, though the Therac-6 did not provide X-ray mode at all, while the Therac-20 was equipped with hardware locks.
Since the Therac-25 events, the FDA has changed their attitude to many of the issues involving safety-critical systems and moved to improve the reporting system and to augment their procedures and guidelines to include software. It was an important lesson not only for FDA, but for all industrial safety-critical systems.
According to Software Engineering Institute's data, there is an average of 1 bug per 100 lines of code, and 98% of device malfunctions caused by software bugs could have been averted through proper testing. Now that I know it, I feel like joining the "let me see the code" movement. Sure, measures were taken after all those big incidents, but I wouldn't want to go to the dentist once and be treated with a drill whose angular velocity is controlled by a variable with "just one extra zero" added by mistake. Dear testers (as well as programmers and developers), please do your job properly.
The University of California, Berkeley: Computer Science 61A — Lecture 35: Therac-25
This article was originally published (in Russian) on habrahabr.ru. The original and translated versions were posted on our blog with the permission of the author.
0