IT/OT Monitoring: Does Your Data Center Have a Check Engine Light?

Redundant designs have become so capable that we often don’t miss a beat when a failure occurs. Without monitoring these assets, issues can fly under the radar until a catastrophic failure occurs.

Thomas Roth, Hargrove Controls + Automation
For most industrial sites, preventive maintenance and monitoring is a familiar practice for operations and maintenance teams. They identify, assess and replace hardware failures. They investigate alarms, evaluate their impact and process accordingly. Not all catastrophic failures are prevented, but usually there is some indication that a failure is imminent, and the appropriate preparations can take place.

Unfortunately, these practices do not often extend to the industrial data center and the industrial control system (ICS) network. By implementing the same practices into the data center that we use for monitoring plants, we can help prevent catastrophic failures and reduce downtime of one of our most significant centralized resources.

Often, IT infrastructure conversations happen upfront with a focus on improving redundancy and removing single points of failure. Unfortunately, redundant designs can provide a false sense of security if they are implemented without initiating a commensurate monitoring program. In our experience, it is not uncommon for a plant to be running in partial failover without any indication to the operations or maintenance teams. Our redundant designs have become so capable that we often don’t miss a beat when a failure occurs. Without monitoring these assets, issues can fly under the radar until a catastrophic failure occurs. These failures can snowball into larger plant outages, impacting production and requiring costly repairs instead of relatively benign replacements.

In a recent service call, we experienced the risk of critical IT systems failures first hand. A client’s primary ICS storage array experienced a catastrophic hard drive failure. The system was designed with redundancy to be single fault-tolerant, but would not persist through a second drive failure. With its ICS storage array down, the plant would incur a costly outage and loss of production while arrays were rebuilt and backups were restored.

This plant had no systems in place to monitor its IT infrastructure or alert its operations staff. If not for a coincidental trip by a technician through a server room, the only indication of failure would have been when the system went down. In this case, we were able to replace the drive and rebuild the array without failure, but only because of a sharp-eyed technician who chose to take action on an item unrelated to his task at hand.

We bet many of you can relate to similar close calls. Our post-mortem showed that the drive had initially indicated trouble three weeks prior. But by missing an early opportunity to avoid the catastrophic failure, we were forced into a higher-risk plan of action without suitable time to implement contingency plans and reduce risk. The ICS storage array was indicating imminent failure, but there was no check engine light to signal it.

To help mitigate the risk of catastrophic failure, we recommend using IT/OT monitoring programs that can indicate problems before they become failures. IT systems include a wealth of monitoring services, checks and data built in. But without a tool to aggregate all this data, it can be prohibitively expensive to manually monitor individual assets. And those assets will continue to run happily up until the point of failure.

By implementing a monitoring tool, maintenance and operations teams can have a single pane of glass that indicates the overall health of nearly all IT/OT assets in the plant. When failures are caught early, outages can be prevented and the overall cost of operations is reduced. As plants implement more Industrial Internet of Things (IIoT) devices and sensors, these concepts will transfer directly into cost savings on the operational side as well.

