Technology and Catastrophic Failure

Usually I spend a lot more time talking about learning technology in this blog as opposed to pure technology.  However, the Gulf of Mexico oil spill is an example of catastrophic technical failure we are beginning to see much too often.  Whatever the ultimate analysis of the causes of this ecological horror yields, it is clear that some level of technical failure of a complex system, together with possible human error was in play.  The issues and ultimately the solutions involved are a mix of the realms of technology,  project management, and human performance improvement.  We all are invested in understanding how to prevent this.

Richard Cook, MD wrote an elegant paper in 1998 entitled How Complex Systems Fail.  It’s been referenced a few times since, usually after some major technical snafu.  Michael Krigsman of ZDNet invoked it this time to point out how informed  leadership must take a role in establishing a culture of safety when dealing with complex technical systems. I agree with his point.  However, Cook’s paper warns us of much, much more than that.

Cook outlines 18 points in his paper in his four page paper. Each one of them is worth a year’s study.  I won’t list them all here. I urge you to read the paper instead.  Here are two important themes from it:

  1. Because of their redundancy, complex systems are inherently unstable in many circumstances.  The redundancies are built-in “failsafe” responses to possible or even expected failure of some set of  component.  They are constantly operating at sub-optimal levels since the systems are designed to keep working despite component failures. All it takes is some novel occurrence beyond the finite catalog of anticipated failures to place a system in an entirely new state of operation – which, of course, could be wholly inappropriate.
  2. Humans are a huge source of variability in complex systems. Humans are necessarily interchangeable because of change of shifts, vacations, sick time, promotions, layoffs,  firings, mergers, etc. So even if complex systems are initially staffed with only top performers, this will change and the reliability of the systems will change.

The net effect of these two points is to say that the reliability of complex systems is continuously changing and can quickly slip into catastrophe when a series of seemingly minor incidents occur in such a novel fashion that completely unexpected major failure ensues.

The news is not good.  There is no magic bullet. No simple root cause.  At our current level of understanding, it takes more than just hard work to manage  complex systems without error.  Even with constant vigilance, preparedness, and training a series of  seemingly innocent failures in a complex system end up becoming phrases etched in the international consciousness like Challenger, Bopahl, Three Mile Island, Exxon Valdez, and the BP Gulf Spill.

It is clear that we must learn much, much more about building and managing complex systems.  Cooks’s paper tells us that the inherent nature of complex systems invites catastrophe. And now more than ever, we live in the age of complex systems.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s