BA IT system down. How? Why? :: General Broadband Chatter

General Discussion
>> General Broadband Chatter

Pages in this thread: 1 | 2 | 3 | 4 | [5] | 6 | 7 | (show all)

Print Thread

ian72
(eat-sleep-adslguide) Wed 31-May-17 08:30:19

Re: BA IT system down. How? Why?

[re: Michael_Chare] [link to this post]

The article does state there is a secondary data centre that took up "some of the slack". Given BA's reliance on IT I would have thought the secondary data centre should be scaled to take up all of the operations in the event of a failure of the primary - that does not appear to be the case if the article is correct.

Also, if it isn't capable of taking all of the load then prioritising systems that would keep core airline operations up and running should have been key - it appears that everything died when they should have been able to keep up a percentage of key services.

Edited by ian72 (Wed 31-May-17 08:36:47)

richi
(regular) Thu 01-Jun-17 11:04:17

Re: BA IT system down. How? Why?

[re: ian72] [link to this post]

In reply to a post by ian72:
The article does state there is a secondary data centre that took up "some of the slack"

The whispers I've heard said that, when the failover data center came up, they discovered corrupt data, because the replication hadn't been working properly. So they couldn't use the secondary (known as Comet House).

In essence, all this talk about power surges is a smokescreen to cover up the fact that BA's DR strategy failed.

3 km line on THTG: 18/1.2 Mb/s with Sky
Previously: BT ISDN, Nildram, Plusnet, 186k, EFH, Be*, Plusnet (again), Pulse8

deleted
(deleted) Thu 01-Jun-17 11:39:55

Re: BA IT system down. How? Why?

[re: richi] [link to this post]

Perhaps Macrium reflect would help them? It updates my system daily and sends me an e-mail if the backup is not successful

deleted
(deleted) Thu 01-Jun-17 11:48:30

Re: BA IT system down. How? Why?

[re: richi] [link to this post]

Presumably they didn't test the replication was working due to cost constraints?

ian72
(eat-sleep-adslguide) Thu 01-Jun-17 11:50:11

Re: BA IT system down. How? Why?

[re: deleted] [link to this post]

They should be doing periodic DR tests - but if the data corruption happened after the last test then it may not have been a simple thing to spot. I keep telling people that it doesn't matter how much redundancy you put in you still have to account for the fact the IT may be unavailable for a protracted period of time and so have to have business continuity plans in place to know what to do if that happened.

RobertoS
(elder) Thu 01-Jun-17 12:27:26

Re: BA IT system down. How? Why?

[re: ian72] [link to this post]

At this level it should not be about replicating files and then bringing the remote server in to replace the other after the world-wide system has gone down. It should simply be an automatic re-routing.

In principle similar to hot-swapping drives at a local level, but obviously with a great deal more complexity.

The failing of the Heathrow hub should not have been visible to the outside world at all.

My broadband basic info/help site - www.robertos.me.uk. Domains, site and mail hosting - Tsohost.
Connection - AAISP Home::1 80/20. Sync 63679/13080Kbps @ 600m. BQMs - IPv4 & IPv6

ian72
(eat-sleep-adslguide) Thu 01-Jun-17 13:53:38

Re: BA IT system down. How? Why?

[re: RobertoS] [link to this post]

Yes, it should be a replica of the hardware with near real time data sync between the file systems (and those file systems are likely to be on an enterprise grade SAN or similar). A failure at the main site would then automatically fail over to the secondary site with a delay measured in the milliseconds. However, if the data is corrupt at the secondary site for some unknown reason then the systems could fail catastrophically - this is what is currently being posited by some people.

It seems to have been a chain of failures (none of which should have happened) that has resulted in an outage that the resilience should have ensured didn't happen.

Whilst this shouldn't happen it seems I am seeing it more and more at the moment - the root causes are different but the result is the same, loss of critical services for days.

RobertoS
(elder) Thu 01-Jun-17 14:53:29

Re: BA IT system down. How? Why?

[re: ian72] [link to this post]

For the replication to be duff means the system was not fit for purpose. Plus as per my opening post, before even this level of information was released:

Surely something that could affect BA worldwide should have at least two if not more mirrored systems/hubs geographically far apart? With dedicated links with minimal latency. It was possible in my day, so what and why and how can this have happened?

I agree with

In reply to a post by ian72:
The article does state there is a secondary data centre that took up "some of the slack". Given BA's reliance on IT I would have thought the secondary data centre should be scaled to take up all of the operations in the event of a failure of the primary - that does not appear to be the case if the article is correct.

Also, if it isn't capable of taking all of the load then prioritising systems that would keep core airline operations up and running should have been key - it appears that everything died when they should have been able to keep up a percentage of key services.

in some ways, but not in others. In particular, from the article:

Under normal circumstances, power would have been returned to the servers in Boadicea House slowly, allowing the airline�s other Heathrow data centre, at Comet House, to take up some of the slack.

But, on Saturday morning, just minutes after the UPS went down, power was resumed in what one source described as �uncontrolled fashion.� �It should have been gradual,� the source went on.

This caused �catastrophic physical damage� to BA�s servers, which contain everything from customer and crew information to operational details and flight paths. No data is however understood to have been lost or compromised as a result of the incident.

BA�s technology team spent the weekend rebuilding the servers, allowing the airline to return to normal operations as of today.

Sources close to the airline indicated that had the power been restored more gradually, BA would have been able to cope with the outage, and return services far more quickly than was the case.

There is no suggestion there that the backup system was in a position to take over any functions instantaneously. It could well have been purely running remote disc mirroring.

As for the Heathrow system coming straight back up so wrecking everything. What?

The point is that there should have been no downtime at all to the international online systems. We aren't taking about the Sainsbury's national network, which could legitimately work with a central failure, with tills and stock control running happily on the instore systems.

My broadband basic info/help site - www.robertos.me.uk. Domains, site and mail hosting - Tsohost.
Connection - AAISP Home::1 80/20. Sync 63679/13080Kbps @ 600m. BQMs - IPv4 & IPv6

Banger
(eat-sleep-adslguide) Fri 02-Jun-17 20:41:53

Re: BA IT system down. How? Why?

[re: RobertoS] [link to this post]

Latest headline from the Independent and Telegraph is that an "IT worker switched off the power supply to the server". laugh

Tim
www.uno.net.uk & freenetname
Asus DSL-N55U and TP-Link WD9970 on 80 Meg LLU Fibre
http://www.thinkbroadband.com/speedtest/results.html...

Current Sync: 69892/17901

oldswan
(learned) Fri 02-Jun-17 21:57:15

Re: BA IT system down. How? Why?

[re: Banger] [link to this post]

Sounds like a Specsavers advert doesn't it? I wonder if they will have the cheek to make one about it?

Pages in this thread: 1 | 2 | 3 | 4 | [5] | 6 | 7 | (show all)

Print Thread

Jump to