While progress in information technology (IT) engineering is enhancing user convenience, the underlying IT infrastructures are becoming larger and more complicated. In accordance with this, the workload of operations managers of IT infrastructures is ever increasing.
Automating simple tasks alone, as has been done to date, will not reduce the workload of operations managers. Accordingly, Hitachi has started endeavors to automate operations management, focusing on such complicated tasks as "system troubleshooting" that have conceivably been difficult to automate.
NAGAI Takayuki
Researcher
MASUDA Mineyoshi
Senior Researcher
(Publication: August 31, 2017)
MASUDAThat is correct. Since the volume of data that must be handled has increased, the number of equipment to be administered has become immense. Virtualization and other technologies have also made systems more complicated. Thus, operations management work is becoming more and more difficult. Still, such systems must be maintained around the clock and throughout the year. Practically, it is no longer possible for humans alone to conduct operations management. Against this backdrop, endeavors are now actively being made to "automate operations management of IT infrastructures."
MASUDAThe word "automate" on its own refers to automating simple tasks and not-so-simple ones. Automating simple tasks include starting, stopping and restarting servers and applications, making a backup of data, and routine tasks that must be done at a predetermined time. These tasks can be automated relatively easily. Once you have prepared what must be processed and how, you just need to push a button to have them processed.
In contrast, with non-routine tasks it is hard to determine when they will occur, and thus they cannot be readily automated. A typical example of such tasks is system troubleshooting. We cannot foresee all troubles. Each time a trouble occurs, humans must determine what countermeasures to take.
For routine tasks, automation technologies have already been established. However, they are no longer sufficient to bolster the IT infrastructures that have become large and complicated. So we need to expand automation into such non-routine tasks as troubleshooting.
NAGAISpecifically, we make a cycle called a MAPE loop to cycle repeatedly. MAPE is an abbreviation of Monitoring, Analysis, Planning and Execution. The flow is comprised of always monitoring if there is any problem, analyzing the cause if a problem occurs, planning how the problem should be solved and executing the solution at the appropriate time.
In order to support automation of the "Monitoring" and "Analyzing" tasks, the "Hitachi Infrastructure Analytics Advisor" product was developed.
Figure 1: MAPE loop and support coverage of the Hitachi Infrastructure Analytics Advisor
MASUDAFor the "Monitor" part, the Hitachi Infrastructure Analytics Advisor visualizes the huge volume of performance information collected from the equipment to be administered, in a manner that is easy to see. The product allows the entire information, such as changes in the system configuration and what is being conducted by the equipment to be administered, to be viewed on a single screen. Because of this, it is possible to identify whether there is any performance problem at a glance.
For the "Analyze" part, the product supports identifying the root cause of the ongoing performance problem. When a trouble occurs, its root cause does not always exist in the equipment where the trouble is found. Very often, the behavior of other equipment has an effect on unexpected sections.
For example, when a virtual machine (VM) "A" starts a large processing task, the storage system that supports the processing may not perform as well, and a VM "B" that is connected to the same storage system may show a poorer response. In this case, the VM "A," which is the root cause of the worsened response of the VM "B," is called a "noisy neighbor."
The more complicated the system configuration becomes, the more difficult it is to identify the noisy neighbor. In order to make it easier to identify the noisy neighbor, the Hitachi Infrastructure Analytics Advisor has employed a screen called the "E2E (end to end) view."
NAGAIThe E2E view is a screen that displays the configuration information of complicated systems in an easy-to-see manner. Conventionally, in order to find the root cause, we had to check each component of the system one by one and examine the relationships of each of the components. By referring to the E2E view, however, the relationships of components can be seen on a single screen. This significantly facilitates the examination process.
Once the relationships of components are found, it is easier to identify the resource that has directly caused the problem, as well as to identify the underlying root cause and the noisy neighbor.
Figure 2: Identifying the noisy neighbor using the E2E view
MASUDAYes, it is. Over recent years, the number of equipment to be administered has increased. The I/O load of such equipment tends to be applied to the storage. This often causes a performance problem in the storage and bothers the administrators. In this regard, Hitachi has ample know-how of troubleshooting for storage, as development of storage products has been one of the Company's mainstay businesses over many years. Based on this know-how, the Hitachi Infrastructure Analytics Advisor smoothly guides the administrators as to which procedures to take and what data to check so that they can get to the problem.
NAGAIIt is very difficult for the administrators to understand the complicated architecture of the storage and deal with the problem by themselves. In this sense, an advantage of the Hitachi Infrastructure Analytics Advisor is that it elaborately guides the administrators with regard to troubleshooting of the storage.
I myself have been engaged in storage administration to a certain degree. Still, it is difficult for me to understand the functions and behaviors of the storage in detail. In the actual development process, there were many things I didn't understand. Each time I came across such things, I asked researchers that are engaged in the research of storage machines. Through the elaborate information collection, the Hitachi Infrastructure Analytics Advisor is full of Hitachi's know-how on storage.
MASUDAWe were concerned about how the product would guide the administrators so that they can get to the problem as smoothly as possible. A typical example of such an endeavor is identifying the noisy neighbor, as mentioned above. I heard a story from an administrator on the actual site that a system user suddenly started a load test without giving any advance notice, and this generated a problem. The administrator was called to the site at midnight to find the cause. But since nobody was aware that a load test was being conducted at such a time, it took a lot of time to find out that the equipment conducting the load test was the noisy neighbor. My hope is that the Hitachi Infrastructure Analytics Advisor will help reduce or eliminate such burdens of administrators.
MASUDAWe have already achieved automating the "Execution" process with the Hitachi Automation Director, a product specialized in automatic execution. We are currently investigating how we can automate the "Planning" process.
In fact, the "Planning" process is the most difficult part of the MAPE loop. Nagai is the main researcher engaged in this part of our work.
NAGAIWe can find the cause by identifying the noisy neighbor. That is good, but how can we solve the problem? In such a case, the product would show plans to handle the problem in response to the direction of the administrators. That is the function we are working to achieve. The difficulty in this function is in determining how the product can generate the optimum solution for each particular situation. There are quite a few options for solutions, and it is meaningless if the product simply shows them all. We have to incorporate the logic for the product to create the optimum solution plan to suit each situation. But we are having a hard time in accomplishing this.
Certain solution plans that should theoretically work, may not actually work when they are executed. Each time we fail, we modify the logic and try again. Through such trial and error, we are working to create a product to achieve this function.
NAGAICertainly. We are currently investigating the logic by focusing on the performance problem of storage, which is believed to be particularly difficult to cope with. However, we don't think it is practical to perfectly create all aspects of the logic. The solution plans created only by developers, like us, have certain limitations, and we hope we can reflect the comments received from our customers and sales companies going forward. For example, the customers would be able to input solutions directly, or we would interview the sales companies that are in contact with the customers and reflect the results. By doing so, we hope to build a scheme in which people can join hands to expand the logic we develop. We have already received proactive responses from related persons in the Company, as they say they want to cooperate in creating the logic.
MASUDAWell, you may say so. Once the MAPE loop is complete, and with the functions of the Hitachi Infrastructure Analytics Advisor and the Hitachi Automation Director becoming further enhanced and easy to use, we will gradually see an increasing number of tasks performed automatically. This will promote automation of non-routine tasks that have been hard to deal with up to now, and the workload of the administrators will be reduced.
NAGAIOver the past several years, I have been engaged in technologies to automate operations of IT infrastructures, such as the Hitachi Infrastructure Analytics Advisor. More recently, the environment surrounding this subject is steadily expanding. For example, IT infrastructure facilities are now located outside of companies thanks to cloud computing, and the performance of equipment is being enhanced a lot. Accordingly, I think we need to deal with the problems of other layers, such as middleware and applications, aside from solutions in the world of IT infrastructures. I would like to apply the knowledge and techniques we have fostered through the development of IT infrastructures to other fields as well.
MASUDAI wish to continue developing products that will reduce the burdens of administrators on actual sites. I have had many opportunities to talk with such administrators, and they are doing their job very earnestly. They are working to bolster the IT infrastructures day and night. We owe our living to them. That's why I want to keep trying to be of some help to the people on the frontlines.