Poor uptime and reliability :: AAISP

User comments on ISPs
>> AAISP

Pages in this thread: 1 | 2 | 3 | 4 | 5 | 6 | 7 | [8] | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | (show all)

Print Thread

E300
(committed) Thu 11-Apr-24 16:46:27

Re: Poor uptime and reliability

[re: Sun4Lw5LIQy] [link to this post]

In reply to a post by Sun4Lw5LIQy:
For a company that tries to pride its self on transparency the ball really has dropped recently. The FireBrick platform has some good features but the trade off is an unstable internet connection at a premium cost. I’m rooting for the team to get things fixed but they might have to go back to the drawing board and admit the current platform isn’t going to work for customers. Noticeable speed drops, connection drops and lack of transparency is making me heavily reconsider who I go with next. It won’t be A&A.

Yes I agree and I'm out of contract and wondering how long I stick with it. It is starting to irk me that I'm paying a premium but getting that less and less reflected in the product and service. This last drop like the one before has gone completely unacknowledged and no real updates on this issue in a while now.

AAISP BQM - IPv6 BQM - IPv4

Edited by E300 (Thu 11-Apr-24 18:03:07)

bellerby
(newbie) Thu 11-Apr-24 18:01:41

Re: Poor uptime and reliability

[re: E300] [link to this post]

Couldn't agree more. Unfortunately I'm still under contract but things will have to improve for me to stay. For all we know the latest incident could have happened on the "more stable" software. The lack of any update is very concerning.

candlerb
(knowledge is power) Fri 12-Apr-24 09:28:50

Re: Poor uptime and reliability

[re: bellerby] [link to this post]

I'm not an AAISP customer, and I'm very unlikely to consider them in future after this.

It sounds to me like AAISP now have a tough decision to make. They have to decide whether they are primarily a router hardware vendor, with an ISP on the side to act as a large group of (paying) beta testers; or primarily an ISP, whose job is to provide top-rate Internet connectivity.

If they want to be the latter, they have to acknowledge that Firebrick isn't currently "best of breed" when it comes to LNS/BRAS, and they need a second vendor to provide actual service while they sort out their problems, or at least to roll back to the older hardware.

The decision to remove the older, slower but reliable LNSes (which were providing service to the customers on lower speed connections), and replace them with a model which is faster but known to be unstable, was madness IMO.

jpm
(fountain of knowledge) Fri 12-Apr-24 11:53:28

Re: Poor uptime and reliability

[re: candlerb] [link to this post]

It's at least time to put out a blog post with an update as to where they are currently and what the plans for resolving this look like. I think the Firebrick is an ARM appliance so they can't even run the software on something else while they debug the hardware.

j0hn83
(knowledge is power) Fri 12-Apr-24 13:56:21

Re: Poor uptime and reliability

[re: candlerb] [link to this post]

In reply to a post by candlerb:
The decision to remove the older, slower but reliable LNSes (which were providing service to the customers on lower speed connections), and replace them with a model which is faster but known to be unstable, was madness IMO.

Those on the slower, more reliable FB6000 LNS's (nicknamed gormless I believe) were moved to other FB6000's. The FB6000's were swapped with the newer troublesome FB9000's (nicknamed witless).
So nobody was moved from an FB6000 to an FB9000, but it added additional FB9000's to spread the witless LNS load.

So that particular move was sensible in my opinion, though the whole saga seems a bit of a s*** show.
Essentially everyone on a package above 80Mb is a beta tester.

E300
(committed) Fri 12-Apr-24 14:30:52

Re: Poor uptime and reliability

[re: j0hn83] [link to this post]

In reply to a post by j0hn83:
but it added additional FB9000's to spread the witless LNS load.

I wonder if they needed more of these LNSs because the firmware they are calling factory stable maybe a very early one? From memory from status updates over the last 16 months or so, the early firmware's were not optimised for the new processors (something to do with only running on one core or not multi-threaded enough), but were stable.

So if they had to go back to this non-optmised but stable firmware, then it would explain why they have had to throw more boxes at it to make up for the lower performance, as they presumably have more customers on the faster packages now and also have CityFibre as well with symmetrical connections they didn't have before.

That isn't what they told us at the time, they suggested more boxes would mean fewer people would be affected by a lock up, but more boxes with the same probability of crashing would just work out over time seeing the exact same number of customers affected, so I couldn't see the logic in that.

They've now had a L2TP router lock up (https://aastatus.net/42655) and drop a lot of customers and blamed that on an early FB9000 prototype and are/have replaced it with a new box, but I'm sure they said that about the LNS's and replaced the hardware with production kit, which we know didn't fix the issue. It would seem these new Firebox's are just not stable full stop, and I really hope they have a plan B.

Of course with the lack of any up to date information we are all reading things into what is going on and perhaps not coming to the correct conclusions.

AAISP BQM - IPv6 BQM - IPv4

Edited by E300 (Fri 12-Apr-24 14:48:23)

candlerb
(knowledge is power) Fri 12-Apr-24 15:11:04

Re: Poor uptime and reliability

[re: j0hn83] [link to this post]

In reply to a post by j0hn83:
Those on the slower, more reliable FB6000 LNS's (nicknamed gormless I believe) were moved to other FB6000's. The FB6000's were swapped with the newer troublesome FB9000's (nicknamed witless).
So nobody was moved from an FB6000 to an FB9000, but it added additional FB9000's to spread the witless LNS load.

Ah yes, thank you: re-reading the thread it was made clear earlier on. There is a slightly smaller FB6000 pool as a result, and slightly less headroom/redundancy.

E300
(committed) Fri 12-Apr-24 15:13:23

Re: Poor uptime and reliability

[re: E300] [link to this post]

Just to add there is some more information here https://social.aa.net.uk/public/local covering the issues and troubleshooting. Would be good if they added this link into the service status pages, its a bit more chatting and verbose in the information it provides.

AAISP BQM - IPv6 BQM - IPv4

aabloor
(newbie) Fri 12-Apr-24 16:40:50

Re: Poor uptime and reliability

[re: E300] [link to this post]

Hi all,

Thanks for all the feedback given in this thread.

We do appreciate it, and we know that our recent reliability for some customers has been unacceptable. I wanted to set out a bit more of the story, mainly for transparency rather than because we expect it to be "mitigation" in most people's minds.

This post refers to and updates a status post originally made at :

https://aastatus.net/42608

This is where our two roles; that of both an ISP with broadband customers, and also that of a hardware manufacturer meet each other head-on and, unfortunately and uncomfortably, collide.

To be abundantly clear, we are very sorry for the outages some customers have suffered. This falls below the standards we set ourselves. We are not happy about it, and a lot of effort is going into sorting it.

The story since
---------------

Several plausible causes have been found, fixed and tested in our testing process (before deploying live). Many of these will have fixed genuine problems, but not solved what appears to be the "main" issue.

Almost all of these have been at the meeting point between hardware and software. The problem with a hardware hang is that far less diagnostic information is available to assist with debugging.

On several go-arounds now, we have genuinely believed that the issue had been found and fixed, tested in our test-rig offline, and therefore we were keen to place the firmware in active use; the thought being that the sooner it was rolled out, the sooner the unreliability would disappear.

But then, some time after being put live, an FB9000 would suffer another hang. The nature of the hang has been unpredictable (i.e. when it would happen); sometimes taking days or weeks to surface. Meanwhile, until it did hang, we still believed the problem had been solved.

"Why not Cisco?"
----------------

Some customers have quite reasonably asked why we do not employ (even temporarily) a 3rd party hardware vendor as our LNS supplier, such as Cisco. This is an option, but the costs of implementation (in time and money) we still feel would be better spent on active R&D to resolve this problem.

We do still believe strongly that the FB9000, when stable, offers us features that distinguish our service from the service of almost all others. Simply, we want bonding, CQM graphs, low power consumption, etc.

It is part of what makes our ISP offering different and better; our USP.

Other issues
------------

Within this same time frame, we have had multiple instances of BT Wholesale doing planned work which they had not told us about in advance (and apparently not told other ISPs, too). We could have zeroed the impact of their planned work, had they told us they were doing it beforehand.

Multiple times we have raised this with our account manager and at higher levels, and we still have not had a satisfactory response. Of course, no wholesale network is 100% reliable; we are not unreasonable about this, but the combined appearance, especially to customers not following matters closely, is that it's "another LNS blip". Unlucky timing, which would be bad any time, but happens to be far worse just now.

A change of plan
----------------

Historically, our October "Factory" firmware from has been stable. The hangs we have seen have all occurred in releases prior to that one, or since that one. That release did have at least one major fix in it, addressing a hardware hang (the PCI/NVMe issue).

Our immediate decision is to therefore put all "live" production FB9000 hardware back onto the October "Factory" release, except for our test LNS. To this end, we have already rolled back almost all live LNSs.

Assistance requested if you're willing
--------------------------------------

We invite and encourage customers who do want to assist with the process of fixing this to prepend "test-" onto their login, which will steer them to the test LNS, and help the effort to fix the problem. Of course this may be less stable than our regular LNS. Email support for more details.

Rounding up
-----------

Hopefully this post shows we are listening, that there is a vast amount of work going on, and that we've taken a different approach, recognising that this state of affairs has remained too long and cannot be carried on.

I recognise that this level of openness is uncommon, but the situation we are in is uncommon; I doubt any other ISP develops its own core equipment.

I politely request that this post is taken for what it is; a genuine offer to :

* explain in more depth
* announce a change of direction
- and -
* apologise for the outages

... and not as an invitation to simply slag off everything we do.

Nothing we do happens by accident or because of a lack of thought, or a lack of awareness, or a cavalier approach to customer well-being. Decisions sometimes do prove to be wrong, but decisions *are* made, and made with the best of intentions.

There are human beings writing the code.
There are human beings in our Ops and Support teams.
And there are human beings managing the business.

Nobody takes this in any other way than "extremely seriously".

Thanks for taking the time to read this, and we are happy to answer any questions, of course.

--- B

---
Bloor
GM, A&A.

Edited by aabloor (Fri 12-Apr-24 16:47:38)

perlen
(newbie) Fri 12-Apr-24 17:10:50

Re: Poor uptime and reliability

[re: aabloor] [link to this post]

Hi Alex, regarding:
"To this end, we have already rolled back almost all live LNSs."

Why wasn't this announced/warned about on A&A Status Page?
A few of us have seen unplanned stuff affecting multiple users without knowing the reason.
Thanks.

Pages in this thread: 1 | 2 | 3 | 4 | 5 | 6 | 7 | [8] | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | (show all)

Print Thread

Jump to