Tony Grayson Discusses Putting an End to Human Error in Data Center Outages
Over the past decade, human error has played a significant role in most of the industry’s Data Center outages. According to a 2021 Uptime Institute Annual Outage Analysis, an aggregated year-on-year average of 63% of failures are due to human error.
If human error continues to be such a problem in the industry, we must ask ourselves why we haven’t been able to fix it. Tony Grayson suspects that this is because we have not been addressing the real root cause- which he believes has its roots in the Challenger explosion.
On 28 January 1986, the Space Shuttle Challenger broke apart after the failure of an O-ring that degraded in the launch’s cold weather. In her 1997 book Challenger Launch Decision, sociologist Diane Vaughan theorized that NASA’s decision to launch on such a cold day was due to a social normalization of deviance. This theory describes a situation in which people within an organization tolerate behaviors or practices once considered unacceptable, even when it falls below their standards for personal or equipment safety. Often this happens for several reasons, including, but not limited to, a feeling that the rules don’t apply, inconsistencies in the level of knowledge, a lack of understanding, or fear of speaking up. Diane Vaughan theorizes that all these factors are often found in high-pressure environments.
Starting With Technicians
Consider a Data Center technician on shift who is continually engaged in activities that involve varying degrees of risks while doing their job. When a problem is identified during a workday, the technician must either accept the risk of doing a quick fix to correct it or fix the problem via established procedures. But why would a technician feel enough pressure to take a risk to self or equipment to fix a problem rather than following the procedures?
On a typical workday, like anyone, a technician must first balance work with their personal life, which is especially hard considering the rhythm of shift work. Then, while on the job, they must conduct preventive and corrective maintenance, practice casualty response, study for continuation training and additional certifications, fill out and update tickets, and still find time to eat. This leaves little time for extra work.
But -because the cost of an outage is so high- risk tolerance in Data Centers is extremely low. This results in a heavy reliance on controls and evidence for work completed. Unfortunately, this bureaucratic method can create a culture of apathy among technicians because the process can be time-consuming, and there are implications in bringing up a problem.
If a technician identifies a problem and follows the established procedures, there will be a fact-finding which some technicians consider witch hunts. The fact-finding is followed by corrective actions, which are often viewed as punitive or cumbersome, especially when they result in additional work for an already overtasked small team.
Furthermore, the technician might also lose their job if the problem is a result of a mistake caused by them. Together, these factors can create a culture where it could become acceptable to do what is needed to get a job done without considering the possible repercussions such as an outage, or in more extreme situations, the technician being hurt.
How Can Data Center Leaders Evaluate Their Processes?
To address the potential of a social normalization of deviance developing in a Date Center, leaders should consider the following four points, and then tailor additional actions based on their findings:
- Leaders need to ensure that they are effectively communicating core values and beliefs in such a way that it develops buy-in on the deckplate. The message should be delivered in a way that the team can absorb—not only what is being said, but why things are being done that way. The “why” is important because the team needs to see their role in the future of the organization.
- Data Center leaders need to develop processes to ensure there is meaningful work and pathways to success while evaluating for and removing, tearing-down forces that might affect personal integrity. Teams need to feel confident that leaders have their back and are invested in their own careers and families.
- Data Center leaders need to take a hard look at their existing safety culture to see where organizational pressures could influence a technician’s risk tolerance. This will help inform how program policies can lead to deviance and ensure communication and supervision is effective (but not overbearing) to create a safe environment and help prevent poor-risk decision making.
- The Data Center industry needs to strengthen its ability to incorporate human factors into its root-cause analysis by adopting the Human Factors Analysis and Classification System (HFACS). This system was developed in response to a trend that showed some form of human error was the primary cause in 80% of all flight accidents in the Navy and Marine Corps. It can provide Data Centers a more comprehensive approach to identifying and mitigating human-factor problems by looking at human factors holistically.
As leaders in the Data Center industry, we have a responsibility to reflect on how our leadership and broader organizational, and sometimes overly bureaucratic processes, affect our people. Just adding in checks might have the reverse effect by creating a culture where risks are taken just to get the job done. 63% of outages caused by human error is too much for something we should be able to control. Tony Grayson believes we have the power to make things better, but can we overcome our own inertia to do so?
About Tony Grayson: Tony Grayson is SVP Physical Infrastructure at Oracle. He has over 25 years of experience in technology and leadership, with expertise in global strategy, financial management, engineering, software, telecommunications, sustainability, and operations for data centers.