I do wonder though, what shiny new features were so important in this new software that (a) they put it straight into production, and (b) they persisted with trying to fix it *in production* for months, rather than rolling back to the stable software straight away.
The positive from this is that the existence of software which is stable under the same load strongly suggests that it's *not* flaky hardware after all. Not 100%, but very likely.
I'm assuming they've gone back to the original Firebrick OS that only runs on 2 cores and the OS they need to get working is the complete rewrite that allows the new Firebrick to run using all the cores. This would explain why they can't just compare the code differences between the working one and the one that crashes, as they can't be compared. This might also explain why they needed to throw up more of the new Firebricks as they would be under-performing.
All conjecture on my part of course, but if you read this blog
https://www.firebrick.co.uk/about/news/version-20/ the dates of the new version 2.0 OS going live coincided with the all the problems starting, and prior to the OS 2.0, the new Firebricks were stable when running on the original older software utilising only 2 cores.
Edited by E300 (Sat 25-May-24 09:50:00)