BBC News reports that British Airways is blaming its global IT failure that resulted in travel disruption for 75,000 people on human error:
The boss of British Airways’ parent company says that human error caused an IT meltdown that led to travel chaos for 75,000 passengers.
Willie Walsh, chief executive of IAG, said an engineer disconnected a power supply, with the major damage caused by a surge when it was reconnected.
He said there would now be an independent investigation “to learn from the experience”.
However, some experts say that blaming a power surge is too simplistic.
Mr Walsh, appearing at an annual airline industry conference in Mexico on Monday, said: “It’s very clear to me that you can make a mistake in disconnecting the power.
“It’s difficult for me to understand how to make a mistake in reconnecting the power,” he said.
He told reporters that the engineer was authorised to be in the data centre, but was not authorised “to do what he did”.
I can’t help but find the explanation very odd.
The fact is that big companies like British Airways have a variety of safety nets in place. Should one system fail, then it should be easy to switch over to another. If there is a power outage then you have backup power systems waiting to kick in, and you have hardware in place to handle any unexpected power surges.
Furthermore, you site your backup systems at different locations – so if something bad does happen at one site, the other is unaffected.
Your backup systems are kept in sync, mirroring each others’ data, so that it’s as simple as possible to switch from one to another with the minimum of fuss.
Did British Airways not have the right failsafes in place? Were its separate data centres not kept in sync? How was it possible for physically remote backup systems to be impacted?
And, perhaps most importantly, why did regular testing of potential disaster scenarios not allow them to discover there was a potential huge problem looming, and take remedial action?
British Airways would like us to believe that some gormless engineer pulled out a power cable and then plugged the systems in again without authorisation.
Yes, there almost certainly was some human error involved. But the people to blame are more likely to be in senior management than on the front line.
Check out this recent episode of the Smashing Security podcast, where we discuss the British Airways IT meltdown (amongst other topics):
Smashing Security #023: 'Covfefe'
Listen on Apple Podcasts | Spotify | Google Podcasts | Pocket Casts | Other... | RSS
Found this article interesting? Follow Graham Cluley on Twitter or Mastodon to read more of the exclusive content we post.
11 comments on “British Airways blames IT meltdown on human error”
It's all very suspicious.
My bog-standard Amazon Web Services account allows me to replicate my application and data across servers in different continents just by ticking a box. They charge extra for this of course.
So I agree, there is no acceptable explanation for this debacle.
I can't believe that such a complex system had a) a single point of failure and b) it took so long to recover. Sounds like the complete system needs a close examination. Was it just BA or were other IAG companies affected?
Perhaps PaddyAir's pencil and paper system has its merits ?
"And, perhaps most importantly, why did regular testing of potential disaster scenarios allow them to discover there was a potential huge problem looming, and take remedial action?"
Ahh yes. Well spotted that man! (Fixed)
More BA BS. Our small company has measures in place, such that, if our main web survey server goes down, we can bring up another in a different centre with all the data on it, within a couple of hours of the outage, even with no access to the original server whatsoever. We have LOTS of backups all over the place and they are automatically maintained by our own in-house systems. If we can do it, then so can larger organisations. Personally, I find that the larger a company, the worse its IT adaptability is.
If BA thinks a single human failure is justification for a total crash, then god help us if they treat aircaft safety the same way.
I would hazard a guess that BA pushing the idiotic fall-guy explanation for insurance purposes.
LOL. Willie Walsh told the public that BA systems have no redundancy. Just funny.
There is no formal standard for separation of primary and back-up datacentres, but the industry best practice is considered to be at least 30 miles (50 km) apart.
Although the Heathrow area isn’t famous for its risk of Earthquakes, Floods, Tsunamis or Tornados, datacentres should be kept away from large industrial facilities, and should not be dependent upon the same source of electrical power, but should be on a different power grid.
They're not easy to identify, but I believe I've located the 2 BA datacentres at Comet House & Boadicea House. If so, they are both less than 1km from Heathrows Main runways; 4km apart by road along the Heathrow perimeter road, but only 2km apart as the crow flies. Not the best locations if there was ever an 'incident' at Heathrow that required the area to be evacuated. Whilst the 2 sites themselves are not within designated flood risk zones, the route between them is!
If I was advising BA I'd suggest that a more remote 'back-up' site at Gatwick (40 miles away) could be one suggestion.
I think some are assuming that BA just has one big 'ol ERP system to back up…simple. My experience is not so. The larger the client, the more systems (with integrations to other systems, copies of systems for testing, reporting, development, training, backups for everything…and on and on). I agree that it is unforgivable not to have a fallback system for production but this not a simple task. Everybody chime in now 'throw more money at the problem..' then who is the first to complain when they have to buy an airline ticket.
If the unfortunate guy who switched the power off simply said "Oh shit!" and switched it back on again, that was probably the worst thing he could possibly do. The inrush current to a rack of equipment can often be much greater than the running current, and in order to power a whole data centre back on you'd have to do it very, very carefully, and you'd probably need to take care not to switch anything on which depended on something else that you hadn't yet switched on. Furthermore, if anything was due to fail it's likely that the power-on stress would push it over the edge, quite apart from the fact that the sudden power loss may well have caused a number of instances of disk corruption. SSDs in particular really don't like nasty surprises when they're in the middle of doing their internal garbage collection.
Quite why the failover didn't work is an altogether different question, and there will certainly be serious questions to be answered. With probably hundreds of applications it may be that piecemeal failover was all that had ever been tested. A full failover of everything when you're running a 24/7 worldwide business would be a pretty scary thing to do. Not that it shouldn't have been done, but how many of the people accusing BA of incompetence would like to take the responsibility of managing it?
Then there's the unknown unknowns, often obvious with the benefit of 20/20 hindsight but nearly impossible to foresee. I well remember discussing business continuity and environmental threats with our ISO27000 assessor some years ago. We discussed planes falling out of the sky, all the usual things. And this oil storage depot a quarter of a mile away. If that went up, we agreed, it'd just go straight up. No big problem for us except for likely traffic disruption. Even a 1000lb bomb at that distance, we thought, would do no more than crack a few windows. Then Buncefield hit the national headlines. It was a mystery for quite a long time to the investigators why 100 tons of petrol vapour had detonated rather than just going WWOOOMMPHH like petrol normally does. Meanwhile, it made a mess of our main building. Moral of the tale: with the best will in the world, there's always something you haven't thought of.