Triage in real life and IT

By Tamara Wilhite

I received a call that my elementary aged daughter was hurt with a suspected head injury. When I arrived at the school, I walked past the room where my kindergarten aged son was. The kindergarten teacher looked down the hall and saw me, exclaiming, “Oh, good, you’re here!” I went into the nurse’s office to see my daughter, who was disoriented but conscious. Then I turned to see the kindergarten teacher pushing in my kindergartener, a split lip leaving a bloody mess on his shirt.

I said I was there for my daughter. The kindergarten teacher said that I could take my son, as well, since he needed attention and wanted me.

“My daughter has a suspected head injury, which the pediatrician wants to review now. My son is a mess, but it isn’t life threatening or even health endangering. I’m taking her. If he needs a parent, call his father. I’m taking her and only her for medical treatment now.” The pediatrician saw her within an hour of the injury and sent us on to a nearby urgent care center for a suspected concussion. My daughter was merely in a disassociated state and not concussed, though it took a CT scan to prove it. My son ended up at home with Daddy later that afternoon, ice on his mouth while he watched cartoons. He was noisier and messier when hurt, but he wasn’t the highest priority because of the (literal) impact of events. It was a matter of triage.

System failures and the need to quickly triage in IT occurs as well. Three servers or systems are down after a power outage. Which one do you bring up first? The one with the most users, but with a work around to get the data? The one with the fewest users but they are all configuration managers, where if they can’t do their job, production halts? This requires triage. And triage requires a plan.

  1. What systems are the top priority to take care of? Know which ones are mission critical to the organization. Those systems then take priority if multiple systems need attention.
  2. Which problems are a priority and which can wait? If multiple critical systems are down, do you work on the one that can’t automatically try to bring itself up first? Does a hardware outage take priority over stopped services, since one may take more time to resolve? Select the criteria for tie breakers when equal priorities come in. These tie breakers can be solving those problems first that can be solved quickly or those issues that require more time to work (thus must be started sooner to be solved in a timely manner).
  3. How do you communicate outages and problems? If a system is down and no one hears the first user’s cries for help, a tidal wave of screaming can hit at once through various channels. If a system is down, do users know how to report it in a timely manner that will be responded to quickly? For example, a system that restarts itself at 4 AM will get noticed by those trying to log in at 6 AM. If they don’t know how to report it or report it through incorrect means, the issue builds in priority as more and more users try to get in and fail, and the first desperate users begin trying a dozen different avenues to get help. If the first reports had been received and responded to by the first person in IT at their arrival at 7 AM, the problem might be resolved before most people arrive by 8 AM. Even if the problem isn’t fixed, a system down broadcast would notify would-be callers of the problem and reassure users that the issue is being worked.

Triage in IT thus centers on knowing your priorities, your tie breakers, and well-defined communication methods. Lack of any of these three criteria can result in a massive outage becoming a wild scramble. In real life, having a plan can save a life. In IT, it can save your organization.