By Allen Brown, President and CEO, The Open Group
In early December, a technical problem at the U.K.’s central air traffic control center in Swanwick, England caused significant delays that were felt at airports throughout Britain and Ireland, also affecting flights in and out of the U.K. from Europe to the U.S. At Heathrow—one of the world’s largest airports—alone, there were a reported 228 cancellations, affecting 15 percent of the 1,300 daily flights flying to and from the airport. With a ripple effect that also disturbed flight schedules at airports in Birmingham, Dublin, Edinburgh, Gatwick, Glasgow and Manchester, the British National Air Traffic Services (NATS) were reported to have handled 20 percent fewer flights that day as a result of the glitch.
According to The Register, the problem was caused when a touch-screen telephone system that allows air traffic controllers to talk to each other failed to update during what should have been a routine shift change from the night to daytime system. According to news reports, the NATS system is the largest of its kind in Europe, containing more than a million lines of code. It took the engineering and manufacturing teams nearly a day to fix the problem. As a result of the snafu, Irish airline Ryanair even went so far as to call on Britain’s Civil Aviation Authority to intervene to prevent further delays and to make sure better contingency efforts are in place to prevent such failures happening again.
Increasingly complex systems
As businesses have come to rely more and more on technology, the systems used to keep operations running smoothly from day to day have gotten not only increasingly larger but increasingly complex. We are long past the days where a single mainframe was used to handle a few batch calculations.
Today, large global organizations, in particular, have systems that are spread across multiple centers of technical operations, often scattered in various locations throughout the globe. And with industries also becoming more inter-related, even individual company systems are often connected to larger extended networks, such as when trading firms are connected to stock exchanges or, as was the case with the Swanwick failure, airlines are affected by NATS’ network problems. Often, when systems become so large that they are part of even larger interconnected systems, the boundaries of the entire system are no longer always known.
The Open Group’s vision for Boundaryless Information Flow™ has never been closer to fruition than it is today. Systems have become increasingly open out of necessity because commerce takes place on a more global scale than ever before. This is a good thing. But as these systems have grown in size and complexity, there is more at stake when they fail than ever before.
The ripple effect felt when technical problems shut down major commercial systems cuts far, wide and deep. Problems such as what happened at Swanwick can affect the entire extended system. In this case, NATS, for example, suffers from damage to its reputation for maintaining good air traffic control procedures. The airlines suffer in terms of cancelled flights, travel vouchers that must be given out and angry passengers blasting them on social media. The software manufacturers and architects of the system are blamed for shoddy planning and for not having the foresight to prevent failures. And so on and so on.
Looking for blame
When large technical failures happen, stakeholders, customers, the public and now governments are beginning to look for accountability for these failures, for someone to assign blame. When the Obamacare website didn’t operate as expected, the U.S. Congress went looking for blame and jobs were lost. In the NATS fiasco, Ryanair asked for the government to intervene. Risk.net has reported that after the Royal Bank of Scotland experienced a batch processing glitch last summer, the U.K. Financial Services Authority wrote to large banks in the U.K. requesting they identify the people in their organization’s responsible for business continuity. And when U.S. trading company Knight Capital lost $440 million in 40 minutes when a trading software upgrade failed in August, U.S. Securities and Exchange Commission Chairman Mary Schapiro was quoted in the same article as stating: “If there is a financial loss to be incurred, it is the firm committing the error that should suffer that loss, not its customers or other investors. That more than anything sends a wake-up call to the entire industry.”
As governments, in particular, look to lay blame for IT failures, companies—and individuals—will no longer be safe from the consequences of these failures. And it won’t just be reputations that are lost. Lawsuits may ensue. Fines will be levied. Jobs will be lost. Today’s organizations are at risk, and that risk must be addressed.
Avoiding catastrophic failure through assuredness
As any IT person or Enterprise Architect well knows, completely preventing system failure is impossible. But mitigating system failure is not. Increasingly the task of keeping systems from failing—rather than just up and running—will be the job of CTOs and enterprise architects.
When systems grow to a level of massive complexity that encompasses everything from old legacy hardware to Cloud infrastructures to worldwide data centers, how can we make sure those systems are reliable, highly available, secure and maintain optimal information flow while still operating at a maximum level that is cost effective?
In August, The Open Group introduced the first industry standard to address the risks associated with large complex systems, the Dependability through Assuredness™ (O-DA) Framework. This new standard is meant to help organizations both determine system risk and help prevent failure as much as possible.
O-DA provides guidelines to make sure large, complex, boundaryless systems run according to the requirements set out for them while also providing contingencies for minimizing damage when stoppage occurs. O-DA can be used as a standalone or in conjunction with an existing architecture development method (ADM) such as the TOGAF® ADM.
O-DA encompasses lessons learned within a number of The Open Group’s forums and work groups—it borrows from the work of the Security Forum’s Dependency Modeling (O-DM) and Risk Taxonomy (O-RT) standards and also from work done within the Open Group Trusted Technology Forum and the Real-Time and Embedded Systems Forums. Much of the work on this standard was completed thanks to the efforts of The Open Group Japan and its members.
This standard addresses the issue of responsibility for technical failures by providing a model for accountability throughout any large system. Accountability is at the core of O-DA because without accountability there is no way to create dependability or assuredness. The standard is also meant to address and account for the constant change that most organization’s experience on a daily basis. The two underlying principles within the standard provide models for both a change accommodation cycle and a failure response cycle. Each cycle, in turn, provides instructions for creating a dependable and adaptable architecture, providing accountability for it along the way.
Ultimately, the O-DA will help organizations identify potential anomalies and create contingencies for dealing with problems before or as they happen. The more organizations can do to build dependability into large, complex systems, hopefully the less technical disasters will occur. As systems continue to grow and their boundaries continue to blur, assuredness through dependability and accountability will be an integral part of managing complex systems into the future.
Allen Brown is President and CEO, The Open Group – a global consortium that enables the achievement of business objectives through IT standards. For over 14 years Allen has been responsible for driving The Open Group’s strategic plan and day-to-day operations, including extending its reach into new global markets, such as China, the Middle East, South Africa and India. In addition, he was instrumental in the creation of the AEA, which was formed to increase job opportunities for all of its members and elevate their market value by advancing professional excellence.