Fault tolerance for manufacturing

Aug. 1, 2003

The age-old business axiom, “time is money,” is more true today than ever before. Across the manufacturing industry, fluctuating market conditions, increased pressure from global competition, stricter government regulations and the daily demands of customers and partners have converged to dramatically alter business processes.

Steve Kellen

As a result, more emphasis has been placed on specific applications and processes for which performance and reliability have become tantamount to success and growth.

Each time a process is automated or an existing process upgraded, successful execution often depends on the information technology infrastructure. This is where “time is money” becomes more than just a folksy adage. If a mission-critical application is unavailable, the entire process can break down. But downtime—whether a few minutes or a few hours—is an all too frequent occurrence. Not only does downtime negatively impact productivity, but it also takes a toll on customers and business partners.

Consider batch process manufacturing, in which products are produced one at a time in a pre-determined timeframe. Because a batch process requires strict adherence to a production schedule, even a short outage can create a disastrous domino effect that ripples across the production schedules of numerous products.

So, how much does downtime cost? Can one hour of downtime truly impact the bottom line? According to a recent report by research firm Gartner (www.gartner.com), the answer is undeniably, yes. Gartner estimates the average cost across all industries for each hour of downtime to be $44,000; it is certain to be more for many manufacturers.

When critical systems fail—even if only a few times a year—those are dollars that can add up quickly. A process that operates at 99.9 percent uptime efficiency, for example, will have about nine hours of downtime annually on average. Using Gartner’s conservative cost estimate, that amounts to a $396,000 cost.

In an effort to improve the availability of mission-critical applications, some manufacturers are evaluating fault-tolerant server platforms to provide performance and reliability guarantees. Only recently has this technology become available on industry-standard platforms, producing dramatic price reductions.

Simply stated, fault tolerant servers can operate at levels of better than 99.999 percent uptime. Commonly referred to as “five-nines,” this level of uptime reliability translates into less than five minutes of unplanned downtime annually on average. Again, using the Gartner estimate, that amounts to about $3,700, or $392,300 less than a system with 99.9 percent reliability.

Fault-tolerant servers achieve such high availability by using lockstep technology. This approach relies on replicated, fault-tolerant hardware components that simultaneously execute the exact same instructions in precisely the same way at the same time. If a component malfunctions, the partner component is already on the job, essentially an active spare that continues to carry out the transaction; there is no downtime and the application runs unaffected.

Compare a fault-tolerant platform to a clustered server system, which typically has been used for high availability. Clusters consist of multiple, interconnected servers that back up each preceding server. When one server goes down, another takes over, providing good—but not fault-tolerant—levels of availability. When the primary server fails, there is a time lapse between failure and failover (the point at which the system recovers) and, therefore, an opportunity for data loss. This is an important distinction, as there is a common misconception involving the terms “fault tolerance” and “failover.” Fault-tolerant servers are designed to avoid failure, preventing downtime and lost data entirely, while clusters are designed to recover after a failure has occurred.

In manufacturing, the pressure is on for greater profitability, at a time when organizations are challenged to do more with less. That translates to an even greater need than in the past for 24x7 reliability of critical factory systems. Manufacturers should demand nothing less from their IT hardware and software vendors.

Steve Keilen, [email protected]

Steve Keilen is director of segment & alliance marketing at Stratus Technologies Inc.