Why does IT fail ...

Business Continuity Management Briefing BCM - IT Technology

If Information Technology be the foundation of your organisation, then read on

There can be very few modern organisations that do not rely, to a greater or lesser degree, on information, computing and telecoms to assist them in delivering their products and services. 

When the Chartered Management Institute (CMI) asked managers across all sizes and types of organisation what would have the greatest impact on their organisation they replied, the loss of IT. So why is it that major systems still fail?

The most recent high profile failure occurred when the outsourced data centre handling patient administration systems suffered multiple failures leaving 72 primary care trusts and eight acute care trusts in the North West and West Midlands without access to patient records for up to five days.

It has been judged to be one of the worst IT crashes ever to hit the NHS. It appears that the incident was caused by a failure in storage area network equipment due to a power failure. Back-up power systems failed to kick in at the main datacenter as a number of the multiple UPS (uninterruptible power supply) systems in use at the computer centre were down for essential maintenance. To add to the problems the failover to an alternative computer centre also failed. Yet this key NHS contract for this service included a stringent BCM requirement. What went wrong? 

Could it be that the fail-over system was not correctly tested? Testing and exercising of BC plans is an essential part of BCM yet many large organisations are reluctant to undertake a full systems test, preferring to test individual components due to fear that a full test might not work. By taking this route they are leaving themselves open to failure at the very time the failover system is required. 

There has always been a problem with scheduling maintenance on working systems and this may have been the case with the NHS system where several of the UPS systems were out for essential maintenance. The major power outage that happened in London last year occurred when two of the four main grid transformers were out for annual maintenance, a third developed a fault due to loading and the fourth contained a fuse unable to take the full load. The lesson learnt as a result was that there has to be better co-ordination between maintenance and operational activities and the continuity arrangements must be fully operational before the work commences. 

One major public body in London has learnt lessons as a result of a similar power failure and now carries out a full risk assessment when scheduling planned maintenance of equipment, informing all operational units of their intentions. Note - The similarities between the circumstances of yast years and last months power outages in Londons West End are striking! Systems are becoming more complex and access to data is key to any operation.

This is recognized by many of the major organisations that responded to a recent survey carried out by SteelEye Technology. 73% of organisations questioned have BC plans with almost 65% having automated data replication and 36% have automated processing switchover. 87% of those with BC plans have stand-by sites. So the planning is there but what about the exercising. The CMI research sheds some light on the problem. Only 46% of those with BC plans have tested them in the last 12 months and some 21% have never tested their plans. The most common reason given is that - it might not work when we test it. 

The companies providing disaster recovery centres for many of the UK’s major businesses report that although contracts include testing days a high percentage of these are never taken up. When testing does take place it appears to be at component level and not complete systems. With so much being invested in back-up systems and facilities it is very irresponsible not to undertake full tests and hope everything will work when it is really needed. 

There appears to be a different approach in the US where the financial regulator insists that critical systems are tested and that key organisations can demonstrate their capability of operating from alternative sites with live data. It is not just major companies that need to improve their IT back up. It is critical that public bodies also give closer consideration to their arrangements as they are now subject to the Civil Contingencies Act requirements for BCM. Access to critical data at the time of an emergency is essential if the vulnerable in the community are to be protected.

During recent discussions with a major local authority it was acknowledged that IT data-back arrangements are undertaken on a regular basis but that there was no indication given by the users to the IT department as to what data would be required first. The department responsible for the most vulnerable in their community just assumed that they would have access to their data quickly. This would not be the case unless the data appeared at the start of the back-up tape.

Access to replacement PCs is also an issue that is not thought through, many plans contain instructions to go to PC World and buy a replacement. No thought is given to what software is needed, what network connections or what levels of security are required before the replacement system can be connected to the organisation’s network.

The person responsible for the BC plan from a police force body felt it would be up to the IT department to sort it out but had not shared his plans with them and therefore did not know if PCs bought from PC World would be compatible with the stringent force requirements. SMEs are encouraged to back-up their data as they are the most likely to suffer IT failure and data loss and, if failure does occur they are the most vulnerable to losing business. Those that do then fail to take the basic steps with the backed up data. Too often the back-up tapes or discs are left alongside the computer or at best stored on site in a cupboard or drawer. The back-ups should be at least locked in a fireproof safe or better still stored off-site. Accessibility of back-up data is important. Is the data held in the home of the company director, what happens if they are away on holiday and access to the back-up data is denied?

There are now offerings that allow automatic back-ups to be taken and sent on-line to remote data storage facilities. This allows access at anytime by authorized personnel allowing recovery to take place quickly. But taking a regular data back up is only undertaking half of the task. How many organisations test the restoration of their data from the back-up tapes? Very few. 

It is too late, when disaster strikes, to discover that the back up has failed or that critical data has been missed. The clear message that comes through from IT failures is that there has not been enough thought about the alternative arrangements that should be in place and there is insufficient testing of systems and exercising of BC plans. Those who do exercise find errors and omissions before the plans are invoked for real.

According to the CMI research 87% of UK organisations that annually rehearsed their plans found this to be the case. The new British Standard for BCM, BS 25999, looks set to require regular exercising of plans and evidence that lessons learnt have been incorporated into plan revisions. Perhaps all organisations should learn from others failures and pay more attention to exercising before they are required to do so under the new Standard.

The Continuity Forum team is available to assist technology companies deliver or support BCM events and workshops. 

END  

If you would like to know more about how your organisation can get involved and benefit from working with the Continuity Forum, please email us or call on + 44 (0) 208 993 1599.