Why is DCIRN important to me? Don Carless - UK Expert on BSI TCT 7/3, TCT 7/1 & 7/2
Clean Hands and Best Practice
I first encountered the suggestion of a Data Centre incident Reporting Network (DCIRN) as a global method of anonymously posting DC incidents for open analysis and advice for best practise, when I saw a presentation by Ed Ansett (i3 Solutions) a few years ago at Data Centre Dynamics (DCD) in London.
Ed had been given dispensation to disclose his findings from investigating an incident at the Singapore Stock Exchange (SSE) – sharing information as a simple, but totally altruistic gesture to help the industry. Until that event the Data Centre industry was secretive – incidents and failures were communicated in whispers, causes were mostly speculative and lessons had to relearned the hard way – I was struck by how valuable sharing root cause information can be and what a magnanimous and mature gesture had been made by SSE – an example to us all, and one that I recognised needed to be emulated throughout the industry. DCiRN will be that vehicle.
At the opposite end of the spectrum (in terms of responsible behaviour) – I recently notified a manufacturer that I’d discovered a fault; their response was to send me a firmware patch – which solved the problem, but the event had been avoidable. I asked why I hadn’t received the software patch by default, and, are there any other software patches available for issues I hadn’t discovered? The response of the manufacturer was that they only send patches to customers who had experienced the issue. This wholly reflects our industry. The motivation, it seems, was to maintain their reputation and market perception –which is, in my opinion a short term and irresponsible attitude and not necessarily in the best interest of our industry.
Our industry was founded on the business requirement to store and process data, most of which was not necessarily time or life sensitive, end of day or period reconciliations could resolve outage problems. Our world now houses systems as diverse as hospital data and autonomous systems running on digital infrastructure, Smart Cities have integrated digital nervous systems to manage everything from traffic lights and trains to emergency communication systems – incidents could result in human fatalities.
Ed drew the parallel with the Airline Industry and their anonymised reporting system CHIRP. Early signs of bad practise, malfunction or poor design are shared in the airline industry. This makes sense to me. The end game is, you can’t hide an airplane crash – there will always be a follow up investigation involving government bodies and often law suits and compensation. As an industry if a DC failure takes life – how can we demonstrate we have clean hands and used best practice? I believe the answer is DCiRN. The alternative will be the dead hand of Government regulation – which will be painful and expensive. We have to grow up – we need to share best practices – DCiRN is important – we need to be able to tell Governments “We’re on it!” and this is how we’re ensuring transparency and best practice.
Another benefit of DCiRN is we can demonstrate to the consumer that our offerings are more reliable, designed and operated using the latest learnings and best practise that go beyond standards. The assurance of the anonymization process maintains our employers and customers reputation in the market. Within the organisation I’ve drummed into all my staff that - if they are aware of an incident – then “hands-up” and don’t hide anything. However, I can now seek root cause analysis and publish our lessons learned anonymously to the industry. My industry is a mechanical and electrical world all designed and managed by humans – currently a teams reputation is all about how they recover from an incident. However, I would rather avoid the incident.
By passing our learnings forward and agreeing to share information via DCiRN, we can all sleep better at night.