Hi all,
Thanks for all the feedback given in this thread.
We do appreciate it, and we know that our recent reliability for some customers has been unacceptable. I wanted to set out a bit more of the story, mainly for transparency rather than because we expect it to be "mitigation" in most people's minds.
This post refers to and updates a status post originally made at :
https://aastatus.net/42608
This is where our two roles; that of both an ISP with broadband customers, and also that of a hardware manufacturer meet each other head-on and, unfortunately and uncomfortably, collide.
To be abundantly clear, we are very sorry for the outages some customers have suffered. This falls below the standards we set ourselves. We are not happy about it, and a lot of effort is going into sorting it.
The story since
---------------
Several plausible causes have been found, fixed and tested in our testing process (before deploying live). Many of these will have fixed genuine problems, but not solved what appears to be the "main" issue.
Almost all of these have been at the meeting point between hardware and software. The problem with a hardware hang is that far less diagnostic information is available to assist with debugging.
On several go-arounds now, we have genuinely believed that the issue had been found and fixed, tested in our test-rig offline, and therefore we were keen to place the firmware in active use; the thought being that the sooner it was rolled out, the sooner the unreliability would disappear.
But then, some time after being put live, an FB9000 would suffer another hang. The nature of the hang has been unpredictable (i.e. when it would happen); sometimes taking days or weeks to surface. Meanwhile, until it did hang, we still believed the problem had been solved.
"Why not Cisco?"
----------------
Some customers have quite reasonably asked why we do not employ (even temporarily) a 3rd party hardware vendor as our LNS supplier, such as Cisco. This is an option, but the costs of implementation (in time and money) we still feel would be better spent on active R&D to resolve this problem.
We do still believe strongly that the FB9000, when stable, offers us features that distinguish our service from the service of almost all others. Simply, we want bonding, CQM graphs, low power consumption, etc.
It is part of what makes our ISP offering different and better; our USP.
Other issues
------------
Within this same time frame, we have had multiple instances of BT Wholesale doing planned work which they had not told us about in advance (and apparently not told other ISPs, too). We could have zeroed the impact of their planned work, had they told us they were doing it beforehand.
Multiple times we have raised this with our account manager and at higher levels, and we still have not had a satisfactory response. Of course, no wholesale network is 100% reliable; we are not unreasonable about this, but the combined appearance, especially to customers not following matters closely, is that it's "another LNS blip". Unlucky timing, which would be bad any time, but happens to be far worse just now.
A change of plan
----------------
Historically, our October "Factory" firmware from has been stable. The hangs we have seen have all occurred in releases prior to that one, or since that one. That release did have at least one major fix in it, addressing a hardware hang (the PCI/NVMe issue).
Our immediate decision is to therefore put all "live" production FB9000 hardware back onto the October "Factory" release, except for our test LNS. To this end, we have already rolled back almost all live LNSs.
Assistance requested if you're willing
--------------------------------------
We invite and encourage customers who do want to assist with the process of fixing this to prepend "test-" onto their login, which will steer them to the test LNS, and help the effort to fix the problem. Of course this may be less stable than our regular LNS. Email support for more details.
Rounding up
-----------
Hopefully this post shows we are listening, that there is a vast amount of work going on, and that we've taken a different approach, recognising that this state of affairs has remained too long and cannot be carried on.
I recognise that this level of openness is uncommon, but the situation we are in is uncommon; I doubt any other ISP develops its own core equipment.
I politely request that this post is taken for what it is; a genuine offer to :
* explain in more depth
* announce a change of direction
- and -
* apologise for the outages
... and not as an invitation to simply slag off everything we do.
Nothing we do happens by accident or because of a lack of thought, or a lack of awareness, or a cavalier approach to customer well-being. Decisions sometimes do prove to be wrong, but decisions *are* made, and made with the best of intentions.
There are human beings writing the code.
There are human beings in our Ops and Support teams.
And there are human beings managing the business.
Nobody takes this in any other way than "extremely seriously".
Thanks for taking the time to read this, and we are happy to answer any questions, of course.
--- B
---
Bloor
GM, A&A.
Edited by aabloor (Fri 12-Apr-24 16:47:38)