By Stuart Boardman, KPN
Enterprises today are subject to and increasingly make use of a range of technological and business phenomena, that increase enormously the range of factors affecting the ability of an enterprise to carry out its business effectively and efficiently. Some examples of this (Cloud, Big Data, The Internet of Things, Social Media/Business and Mobility) are the focus of The Open Group’s Open Platform 3.0™ Forum. An enterprise participating in some way in this world (i.e. any enterprise unable to lock itself inside its own walls) will have to find ways of matching the variety these phenomena introduce. I’m using the term Variety here in the sense defined by W. Ross Ashby – most notably in his Law Of Requisite Variety (1952), which I’ve written more extensively about elsewhere.
Variety can be internal or external to a system (an enterprise is a system) but it’s the external variety that is increased so dramatically by these new phenomena, because they typically involve having some part of an enterprise’s business performed by another party – or a network of parties, not all of whom are necessarily directly known to that enterprise.
Ashby’s law says that the more Variety a system has to deal with, the more variety is needed in its responses. Variety must be matched by variety. You need at least to be able to monitor each factor and assess changes in its behavior, if you are to have any hope of responding..
There are three main elements involved in developing a strategy to match variety.
First we need to find ways of identifying relevant variety and of understanding what its effect on our enterprise might be. That’s going to tell us what meaningful options for response exist. We should not make the mistake of thinking that a deterministic response to any given type of variety is always possible. Ashby himself was very clear about this. Some factors (especially those involving people) don’t behave in a predictable manner. It’s therefore useful to classify each form of variety according to some schematic. I use Tom Graves’s SCAN framework.
There are other frameworks – I just find Tom’s semantics rich but easy to follow.
Second we need to understand the level of risk that a particular form of variety might pose. How much damage might a particular event do to our business? In the “always on” world that Platform 3.0 encompasses, there’s a tendency to assume that being offline is a drama. But is that always true? The size and cost of a response mechanism needs to be in proportion with the risk involved.
Lastly we must decide what kind of response mechanism we actually want to implement – assuming the level of risk and the available options indicate any need for response at all. The fact that we could put a control mechanism in place doesn’t necessarily mean that it’s a good idea, as Nassim Nicholas Taleb shows in his book Antifragile (of which more in a minute).
Here’s an example from the Internet of Things (IoT): “Smart Charging” for electric automobiles. Here we know that the number of parties involved is quite small (Distribution Network Operator, Charging Provider, Local Controller/Provider and Automobile/User) and that both functional and legal/commercial contracts between parties will apply. If we look at an individual device (sensor, monitor, controller…) and its relationship to someone else’s device, there’s a good chance we can describe the behaviour with some confidence. So we’re talking about a Simple situation that’s amenable to a rules based (“if A happens, do B”) response. But of course it’s not usually so straightforward. One can expect at least a one to many relationship between “our” device and the devices with which it exchanges information. So in reality we’re dealing with a Complicated situation. That doesn’t mean you can’t determine a reliable set of behaviours and responses but it will be a sizeable matrix and will require significant analysis effort.
So what’s the risk? Well that depends who you are. A car owner, a charging provider and a network operator have quite different perceptions of what constitutes an event of business significance. They all have a common interest in the efficient functioning of the system as a whole but quite different views on which events require a response and what sort of response. A sensor or controller problem could lead to a failure to detect a potential network overload. So could faulty data about weather or consumption patterns or poor (big) data analytics, all of which fall at best into the Ambiguous category. For a car owner this isn’t really a risk until something goes seriously wrong – and even then one can always work from home. For the network operator the significance is far greater, as they are legally responsible for providing sufficient capacity and additional infrastructure is expensive. On the other hand, if the network operator decides to play safe and reduce capacity allocated to the charging provider, that will at least lead to irritation for car owners due to incomplete or slow charging of their car. That is not usually a business critical event but the possibility exists. For the charging provider an isolated local event is not much more than an annoyance but a widespread effect can have direct financial or customer relationship consequences.
Then there’s the third consideration. Just because we could set up a control, does that mean we really should do so? In Antifragile Taleb shows that many systems are fragile exactly because they try to control everything. Now in general this applies to social/economic systems, which in SCAN terms are Ambiguous or Not-known and therefore not really amenable to tight control anyway. But even mechanical systems can suffer from this problem. It’s not uncommon that a response to some stimulus has knock-on effects elsewhere in the system and if there’s a two way relationship between a source of variety and our response mechanism, all kinds of unexpected things could happen. So we need to be very sure about what we’re doing.
Moreover tightly controlled systems have great difficulty with black swan events (another Taleb book), because these by definition are not catered for in the rule book. An over-reaction or mistaken reaction can have disastrous consequences. No reaction may sometimes be a better tactic. All of which brings me to another example.
The example is based on the (in)famous Amazon outage of a couple of years back and is in no way intended to knock Amazon. I’ve written about this in detail in another blog but the central point is that when there is a significant outage we (the customer) are in the Not-known territory. We have no direct ability to respond to the variety that caused the problem, so we need a different way of responding – something that we can decide for ourselves but which can’t possibly be based on a rules driven approach. I described a response that involved creating a separate back-up/recovery strategy – potentially with multiple options. But of course that comes at a price, so our risk assessment needs to be well thought through.
This example has another interesting aspect to it. The scale of the problem arose from a failure of a control structure that could manage expected events but which actually made things worse in the face of something in the order of a black swan event. And of course this isn’t just about machines – there were people involved too. The control structure was intended to be robust but was in fact fragile. But in the end how much damage was done? As far as I know no-one went bust. Amazon learned from the experience and continued to do so – and so did everyone else. So actually the whole system proved to be anti-fragile. It got better as a result of a few knocks. I don’t know exactly how Amazon do it now but I hope they’ve given up trying to control everything with a rule book.
You could say that the mission of the Open Platform 3.0 Forum is to help enterprises gain the benefits they seek from all those phenomena. So here’s a great opportunity for the Forum to take a lead in an area that too often gets shoved off into the non-sexy world of “non-functional requirements”. I hope we can describe ways for enterprises to deal with variety in an intelligent and adequate manner – to reliably manage what can be managed without driving themselves crazy trying to manage the unmanageable.
Stuart Boardman is a Senior Business Consultant with KPN where he co-leads the Enterprise Architecture practice as well as the Cloud Computing solutions group. He is co-lead of The Open Group Cloud Computing Work Group’s Security for the Cloud and SOA project and a founding member of both The Open Group Cloud Computing Work Group and The Open Group SOA Work Group. Stuart is the author of publications by the Information Security Platform (PvIB) in The Netherlands and of his previous employer, CGI. He is a frequent speaker at conferences on the topics of Cloud, SOA, and Identity.