Infrastructure equipment that supports our society has increasingly been miniaturized over recent years thanks to the advancements in semiconductor production technologies. On the other hand, miniaturization of semiconductors have revealed that faint radiation rays coming from the cosmos to the earth can cause malfunctions of such equipment.
To address this issue, Hitachi has established a method to assess the impact of cosmic rays on such equipment and developed countermeasure technologies to prevent malfunctions from happening. "Non-stop infrastructure" made possible by more reliable equipment should realize a society in which people can live in feeling secured.
(Publication: October 25, 2016)
UEZONOIn our daily life, we do not commonly have an image of what "cosmic rays" are. However, we are bombarded by various particles from outer space every day. There are a variety of types of such particles, including electrons, protons and neutrons. Of these, neutrons are so small in their particle size that they may pass through buildings.
If neutrons unfortunately collide with semiconductors to unfortunately cause noise in semiconductors, and if the noises are unfortunately taken in by the memory devices, the value (charge) stored in memory cells will be reversed from zero to one or vice versa. Such a reversal may cause malfunctions of the equipment or produce erroneous computation results. This is how cosmic-ray-induced neutrons cause a soft error.
SHIMBOThe phenomenon of a soft error has long been known to exist, but it has been believed to only occur in very exceptional cases. I had long been involved in designing digital circuits, but I used to think that such a phenomenon would never arise.
UEZONOEarlier, when the size of a semiconductor memory cell was rather large, some tiny noises would not reverse the values in such a memory cell. More recently, as the cell size of semiconductor devices have become smaller through technological advances and the electric charge retained in a memory cell has also become smaller, the fact that some tiny noises can reverse the values has surfaced.
SHIMBOIn an experiment in which neutrons are artificially irradiated onto equipment, I was very surprised to see how frequently soft errors actually occurred. Later, as we investigated the incidence rate of soft errors for a variety of semiconductor devices, we found that certain devices had an incidence rate that was even higher than the permanent failure rate of the devices.
SHIMBOUnlike malfunctions due to aging degradation or permanent failures of the devices, a soft error is an error that regains a correct status once new data is overwritten. Even if an equipment momentarily malfunctions, we stop it and restart it, and then the equipment would operate correctly. But we don't know why the error occurred. We are also not able to reproduce the error, and we are in trouble. The most notable feature, and the most troublesome aspect, of a soft error is that the malfunctions do not repeat.
UEZONOWe have developed technologies to assess and address soft errors. In developing them, we followed a perspective of repeating the cycle of assessing and addressing soft errors in turn. In the cycle, assessment of the equipment is first conducted to understand its current situation. Then, countermeasures are taken based on the actual state that has been grasped. To find out if the countermeasures prove effective, the assessment is repeated. That's how the cycle goes around. Of these technologies, I have worked on the assessment section, while Mr. Shinbo has been in charge of the countermeasure technology.
UEZONOThe assessment technology employs a method in which the equipment is irradiated with strong neutron beams to purposefully cause errors. We know from the design stage which components are installed in the sections to be irradiated with neutron beams. Judging from the positions of the irradiated sections, we can guess which components have generated errors. So the assessment technology causes soft errors while changing the positions of the neutron beam radiation, and estimates which components in the equipment caused the soft errors to occur. With this technology, we can firmly discern which sections of the equipment are vulnerable to neutrons and what countermeasures are needed.
Figure 1: Overview of the assessment technology for soft errors due to cosmic-ray-induced neutrons
UEZONOA variety of devices are installed in equipment. As a result of the assessments, we learned that a semiconductor device called FPGA (field programmable gate array) is most vulnerable to neutrons. An FPGA is a convenient device as its logic information, or an operation program stored in its memory, can be overwritten many times. FPGAs are adopted to products, including most advanced communication equipment, in an increasing number of cases. Partly because of this, we determined that we need to focus on FPGAs in developing the technology to address the problem.
UEZONOWe could not necessarily use neutron beams for assessments at any time. In other words, we were not able to make assessments when we wanted. That was the most difficult part of the research. For the assessments, we needed to use equipment that could irradiate a number of neutrons several hundred million times that of normal environments in a short period of time. Such a large number of neutrons must be irradiated to cause errors. In Japan, there are only two facilities that have the equipment to emit such neutron beams and which we are allowed to use.
SHIMBOWhen the facilities in Japan are not available to us, we would use overseas facilities. There was an occasion when we could not use the domestic facilities, but we still had to produce results by the deadline.
UEZONOAt that time we flew to Sweden to conduct assessments. However, the operation of the equipment was not the same as what we were accustomed to. The power source was also different from what we use in Japan. In addition, there was clerical work including that for export control procedures for the equipment we brought with us for the assessments and countermeasures. All of these were rather tough for us.
UEZONOThat's right. So we are going to develop technology that would allow us to make assessments with neutron beams having a much lower level of energy.
Originally, for implementing assessment tests for neutron-induced soft errors, we had no choice but to visit the Los Alamos National Laboratory in the U.S. One of the purposes of our technological development conducted thereafter was to enable us to conduct the assessments in Japan. Our development was recognized as an international standard test method in 2006. Even so, the method is applicable only for final checking of equipment and the like, because we need to use neutron beams that can be irradiated only at particular facilities.
Instead, if we could make assessments at facilities where only weak neutron beams are available for use, we would be able to check if certain equipment is all right in advance, even during its design stage. This would surely change the way we design equipment. I believe that if we could eventually succeed in standardizing the method so that it uses neutron beams having much lower energy and can make assessments at ordinary laboratories, then the technology would be accepted and used more widely in society.
SHIMBOAs we already mentioned, we learned from the assessment results that FPGAs are most vulnerable to neutrons. The cause of the vulnerability was the memory blocks inside an FPGA. The logic information, or program, to be written on an FPGA is stored in its internal memory blocks. Soft errors are prone to occur in these memory blocks, because their memory capacity is fairly large.
Originally, an FPGA incorporated a function to find errors by dividing the internal memory blocks into smaller areas and checking each area one by one. However, once the FPGA became bigger in size, it took time to find errors. This was a problem. For example, if an error happened to occur in a particular area immediately after it was checked, the error would not be found until the next round of checking of the area. If the error was transferred to the equipment during that period, the equipment would produce malfunctions. Thus, in order to find any error in an FPGA as soon as possible, we developed a circuit that makes it possible to freely set the steps in which the checks are conducted.
In fact, while an FPGA is operating, it does not use all areas of the device. Therefore, we arranged so that the checks only go through the areas that are actually used or so that important areas are preferentially checked. By doing so, we reduced the time required for finding errors. This allows the errors to be corrected before they are transferred to the equipment, and the equipment can continue operating without producing malfunctions.
Figure 2: Overview of countermeasure technology against soft errors due to cosmic-ray-induced neutrons
SHIMBOExactly. This technology doesn't prevent errors from occurring. Rather, it masks errors so that they do not go out of the device. There are methods to enhance resistance against soft errors by changing the configuration of semiconductors. To do so, we have to re-construct the semiconductors but it takes a lot of time and cost. Furthermore, if we buy semiconductors from external sources, we cannot change their internal configuration by ourself. Accordingly, in developing this countermeasure technology, we presented ideas on how to make the best use of purchased semiconductors. "Making the best use" was the most difficult part of our latest countermeasure technology, but was also an interesting subject for researchers.
SHIMBOThat's right. After the countermeasures were taken, the errors were corrected at an early stage. Consequently, we confirmed that the equipment as a whole continued operating normally. Whether before or after the countermeasures are taken, there was no change in the rate of soft errors occurring within the equipment when the neutron beams were irradiated. After the countermeasures, however, the equipment did not show any malfunctions even if errors occurred in the FPGA within the equipment.
SHIMBOThe technologies we developed were meant to enhance the reliability of the equipment. Not limited to this, however, I think they can evolve from the perspective of security.
Electronic products, and especially FPGA devices, can produce malfunctions if they are artificially attacked with malicious intent. If the technologies we have developed are applied to such cases, the electronic products would be able to detect such attacks by themselves. In other words, the technologies should be able to contribute to enhancing the security of infrastructure systems.
SHIMBOI want to continue doing research that will please and support our customers. Hopefully I will be able to contribute to people's lives by further developing and evolving systems that they can use with ease in terms of security as well as reliability.
UEZONOSince I joined Hitachi, I had expected to work on the Company's infrastructure products, and especially on their reliability. This dream has come true, and luckily I have been engaged in soft error-related research. Going forward, I want to constantly develop technologies that can address a variety of factors, not just limited to soft errors, which can threaten reliability. By doing so, I wish to realize "non-stop" infrastructures.