Implementing Resilience Engineering in IT

A Definition of Resilience Engineering


Resilience engineering requires designing systems and equipment to fit human nature. One of the classic signs something is wrong with your manufacturing processes is when the people who work there look like they work out because they are using their bodies that hard and long as part of their jobs. This type of problem means that your process will fail when someone is not strong enough or too tired to do what the process requires. There are other problems caused by human failings, when we assume that people fit the process by acting just like the machines. And that is the fault of the designers that resilience engineering seeks to correct.

Resilience engineering goes beyond poke-yoke or mistake proofing that has only one way a product can be assembled or safety buttons that have to be held down when someone is operating a press by designing the equipment or operations to fit human nature to make it almost impossible to make a mistake by design.


How to Implement Resilience Engineering


Your equipment and processes must also be designed to suit the human mind. Your process cannot rely on humans have perfect memory, be fully attentive and alert for the entire time they are on shift. People forget confusing procedures, get distracted (often by parallel processes) and get bored when they are doing the same thing. Sometimes they cannot keep up and prioritize the constant stream of notices and alerts that compete for their attention, so they don’t know what to do or take the wrong action.

I’ve previously written about human attention as the most limited resource in the modern era. Processes are regularly created that assume that pop-up informational notices and warnings are value added, neglecting the time it takes for someone’s attention to shift back to the task at hand or the serious distraction the constant stream of pop-ups and notification beeps creates. For example, a user interface that throws up so many informational notices that someone may not see an urgent warning for some time has created its own failure mode.

When the system generates many competing notices of varying priorities, it creates distractions and confusion on the part of the user that increases the odds of failure. Or users get in the habit of ignoring and closing pop-ups, useless informational, barely useful and critical warnings. All of these cases are the opposite of a mistake-proof design or resilience engineering.

Another problem with system design is with work systems that put too much of an intellectual burden on the employee. For example, systems that assume people can immediately switch gears when multi-tasking and then give them multiple tasks to do simultaneously increase the odds someone will make mistakes. Perhaps they forget what they were doing and fail to return to it, or they return to it but miss steps. Or they continue the actions they were doing but it is now applied to the wrong item. Over-work leads to fatigue and errors, but systems are typically designed to assume people don’t get tired at the end of a shift or when working overtime. Demanding people work from home or on the go doesn’t solve this problem, since shifting their attention from personal affairs or driving can lead to an incorrect decision so they can get back to what they were doing.

These are the times people just select the default option or the first auto-fill suggestion before moving on.  You can reduce the errors by requiring attention checks, not allowing auto-fill on critical tasks that require care and reducing distractions. “Are you sure?” pop-ups are hardly of value in these cases because it is as easily clicked and closed as the other selections the person made without thinking about it.

When the default solution in a company is the blame the people who made the mistakes and train them, it prevents root cause analysis that shows that the bad process is to blame. In fact, the end result may be altering the process the person followed to make it more complex and training the person who made the mistake in the new process, but all too often failing to train the other employees in the new process. Now the solution for one user nearly ensures mistakes by the others.

Knowledge based errors include applying the wrong procedure when an error occurs and not knowing what to do. The former case occurs when someone can’t figure out what an error message means. The latter situation may be solved by training, but it can occur when someone seeks help but can’t find it. I’ve even seen this error occur on help desks when company policies punished seeking subject matter expert advice or escalating tickets to a higher level. The end results ranged from the first level tech support applying the wrong process to the problem because they couldn’t ask if it was the right one to spending an extensive amount of time trying to troubleshoot a matter that the expert could solve in a fraction of the time.

The unavailability of knowledge workers can also leave users forced to make knowledge based errors at their own level because they couldn’t get the expert opinion on the right course of action. Managers and knowledge workers can make knowledge based errors themselves when users and lower level employees simply don’t give them all the information for fear of the consequences. When it is considered bad to report bad news, the problem gets worse before it gets solved. A climate of fear or scarcity thus creates the environment for more mistakes to be made.

Rules and procedures are often based on legal compliance, even if the procedures don’t fit the work environment. This results in people getting in the habit of violating the rules to get the work done. This increases the odds people start breaking critical rules in order to do things outside the standard, non-working process. Think about users getting in the habit of jailbreaking their phones to install software they want or adding people to project roles and then asking for permission because a higher level manger told them to do X.

Rules and procedures can become a legalistic hamstring without any way out by applying logic and intelligent action, such as when your user can’t confirm via two competing processes they are who they say they are because of one situation that prevents both from verifying. You must have a formal process for someone to handle the exceptions or rule conflicts to avoid the problem of people going outside the formal process to do their job.

Access control limits often hamstring workers, leading to complex rules to determine who should have access and work-arounds by employees trying to do their jobs. The real solution is streamlining rules and regulations and simplifying the system, but the default solution is adding one more loop on a process chart that already looks like a bowl of spaghetti spilled on a table.

Your processes should be as simple as possible, but no simpler. For example, a website that refers people to a phone number if they have problems and a phone number that takes you only to a recorded message that they should go to the website is simple – and a failure. This isn’t a hypothetical scenario – I actually had to deal with it once.

Sometimes the solution seems to be to go outside the rules, such as when someone tries to implement a fix or work-around. This creates new problems if not major ones, such as when someone restarts a service without telling others or puts in a software patch without testing it thoroughly. The better solution is having a formal process for testing improvements and new solutions in a deliberate, controlled manner and updating all processes when it is found to be an improvement.

Sometimes the solution is supposedly “go look at the process” and “update the process document”. Then the users run into problems because they weren’t notified about the process changing. Now they are running off of an old process and may call up tech support asking why the process they are accustomed to isn’t working right.




Design your IT processes from software interfaces to user support to take human failings into account. Design processes that don’t require humans to be machines, such as demanding 100% attention, incredible reaction time, data processing skills akin to a computer or perfect memory. Have formal processes in place to handle the exceptions and odd events without making the standard processes insanely complex.

Do take the time to train users, but also take a look at your processes to see if you can make them simpler … and then train users on the new processes to avoid new problems. Ensure that people have access to the knowledgeable experts and documents they need to make the right decision, and don’t throw too much information or distractions at them or they are sure to make more mistakes.