User comments on ISPs
  >> AAISP


Register (or login) on our website and you will not see this ad.


  Print Thread
Standard User CecilWard
(learned) Fri 11-May-18 16:14:24
Print Post

Corruption, HEC errors and AA CQM


[link to this post]
 
I'm getting reported downstream HEC errors from one of my three AA-supplied Dlink DSL-320B-Z1 modems at what seems to be a modest rate but is actually scary when you do the maths. Upstream is fine. The second modem is 10 times better (lower) per cell and at one stage the third was 100 times better per cell.

The 'bad' modem gives downstream 0.3% HEC errors per cell.

I don't know if the modem’s reporting is genuine, and if the figure for the total number of cells is bogus then my fraction is bogus, but then still why the factor of 10 and 100 difference? That would seem to indicate that something odd is going on even if the modem’s figures are not right, either one modem is duff, or line #1 is bad or the modem 1 settings are wrong (insufficient FEC for some reason for example, or SNRM wrong).

Is the modem mad? Or am I? The denominator problem.

My derivation of the figure of 0.3% is HEC_errors / xx_cells. What is xx_cells? The modem quotes two user-looking numbers unfortunately one called ‘total cells’ and one called ‘data cells’. I have no idea what these two mean. Maybe you would expect a true total cells number and perhaps a count for true data-bearing cells or idle cells or all three, but anyway the true total would one obviously be the largest and would be the sum of the other two.

modem 1
'total' 1576618
'data' 549514

modem 2
'total' 1419050
'data' 14950880

The modems have not had comparable uptimes here, modem 1 was resynched recently. The two numbers in a pair belonging to one modem do not even have values in a constant relationship when compared with a different modem!

Because I have no idea what is going on, I just chose the max of the two values for the denominator xx_cells. An alternative would have been to add the two together.

I have not yet worked out how to cross-check the believability of the modem’s HEC ratio numbers using a comparison of the rate of RS uncorrected errors per (x)-bits of data as the only other thing I have to go on is the downstream superframe count and I don't know how to convert that into n bits without discovering some more parameter values and reading the standards docs again.

Anyway.

Hypothetically, let's suppose the downstream HEC errors per cell rate of 0.3% obtained as described above is genuine. Then we have one line that is ten times worse then line 2. What does this error rate mean in practice ?

I am using the following equation 1 - ( 1 - p )**32 (where ** denotes exponentiation) to obtain the probability of a corrupt 1500 byte IP PDU (=32 ATM cells). This comes out to about 9% (ie per IP PDU per line). So on the one line, TCP will need to retx one in every 11 max-length packets, which is not good at all. I have three lines so it will be 1 in 33 packets and that will cover the bad line up.

Could someone tell me if my maths is completely wrong? (We do not have to believe the modem, especially given that the labels on the figures make absolutely no sense, they can't possibly be ‘totals’, but I'd just like some help with a sanity check.)

If I am right then a HEC error rate that does not sound too scary means a big deal on a flat-out TCP download if 9% of your packets are corrupted.

One other possibility. It could be that the modem is reporting all cells that have errors even if these errors are successfully corrected by the HEC single-bit correction mechanism. There are not two figures labelled ‘HEC’, uncorrectable and corrected.

I asked AA about this, hoping for a sanity check, and asked about the 9% corrupt TCP rate. I didn't get any help at all, staff just refused to be drawn in. I explained that I wasn’t trying to assert that it’s a line issue, and was aware that the whole thing could be bogus, although if the modems are not lying and modem 1 is not broken and the settings on line 1 are not off then line 1 just happens to be a lot more noisy than the other two lines, but then that's just what happens sometimes. If the numbers and my maths are not bogus then I would say that it is AA’s problem though, or ought to be, because they also sold me the modem and you would hope that they want to deliver a working service that includes non-corrupted PPP frames and that includes non-corrupted IP packets not relying on TCP (what about UDP-only prototocols).

I need to :
1. swap out modem 1 in case it is broken.
2. Do a test download and capture it in order to try and spot a high rate of TCP retx. Clueless can capture it, unfortunately I don't have the tools myself but no matter, luckily clueless saves the day.
3. Look at the SNRM on line 1 (again).

SNRM and target SNRM

All lines had a 3dB downstream target SNRM. I was expecting this to cause problems of exactly this type, which is why I took a look at the modems’ reported numbers. I increased the d/s target on line 1 from 3dB to the usual 6dB. Weirdly this never seems to do much, the actual SNRM was 2.3dB after a resynch soon after the change and the sync rate did not change. This is very odd and I don't understand it. Changing the target again to 9dB chopped 20% off the sync rate, a ‘cure’ that is worse than the disease and probably not a sensible option since sustaining a slightly higher actual SNRM might be all that is needed, but trying to get rather nearer to 6dB or a bit less rather than the status quo, which is 0.6-2.5 dB varying, does not seem to be happening.

What real current SNRM do other users see when they have chosen a 6dB d/s target with on of these modems? Is it something weird about the modem model, a naughty margin tweak? Doesn't explain why the lines are so different though, because the SNRM values reported are about the same, varying quite a lot.

Packet loss, CQM and 'pings'

AA staff asked me if I was seeing any packet loss and I said no. And then lost all interest, but this doesn't seem to be reasonable.

It seems to me, when you do the maths and have done the to-the-power-32 thing, that you realise that a CQM ppp ping loss rate of 0.3% - because one PPP ping equals one ATM cell - isn't going to show up in clueless much. This is not going to show you that you actually have a really bad problem of 9% per packet corruption rate of full-size IP PDUs. And I have three lines, and that helps to cover up the bad problem that TCP has to deal with. AA needs to successfully deliver 1508 byte-long PPP frames not just short PPP pings. Also VoIP won't show it so much.

So it seems to me that we have something beyond what CQM as we have it currently can ‘see’, and we could put too much trust in it. (Like radiation monitoring equipment that could not see gamma rays.)

If someone wanted to do something very fancy, a L4-snoop function in an FB6000 could count TCP retxs and give a packet loss rate counter which might be interesting? Could even have an optional and configurable alarm on it if the value went weird, way out of the established ‘learned’ norm of that link or outside some global sanity limit. More load on an FB6000 CPU. Also I'm aware that VPNs would make it impossible. Also not everything is TCP. You wouldn't be covering other L4 protocols without additional L4-snooper modules that spoke certain real time UDP protocols or for example SCTP. But so what - you would just do what you could when you could and if some people couldn't get the benefit some times then so be it. It's an enhancement not an essential. And this would not specifically detect corruption, it would pick up a lot of natural packet loss that is essential in the internet as part of congestion-related behaviour and even congestion control. Also multiple line users such as myself would see problems like this diluted down - by a factor of three in my case - so this kind of corruption would be that bit harder to detect by L4 snoop.

Another Firebrick feature, turning now to the FB2x00 series, that is very interesting would be a counter for the number of bad L3 (or L4?) TCP/IP checksums seen. I don't know whether this is in there already?

AA help

I really could do with a little help from AA. A hand. In the following -

1. Sanity check me, my maths, assumptions
2. Help check whether (their own) modems are lying / insane ie bugged
3. If so then we could advise customers that the stats on them are just bugged and not to be fooled thus saving everyone from a waste of time
4. Help find out what is going on with the instantaneous d/s SNRM and the target SNRM as I should be able to have the _option_ of getting the real instantaneous SNRM up to something nearer to 6dB without having to go crazy and set a horrible 9dB d/s target.

I was a little disappointed. I understand AA staff are bound to rightly take this as low priority compared to people who are really stuffed. But I still would appreciate a helping hand, whenever someone could take a while to look over it seriously and just be a second pair of eyes.

AA, you just got a five-star vote from me in the recent best ISP contest, so don't let me down.

Any other users who would be kind enough to sanity-check me, report their own experiences or advise please do help if you are able.
Standard User RobertoS
(elder) Fri 11-May-18 17:53:54
Print Post

Re: Corruption, HEC errors and AA CQM


[re: CecilWard] [link to this post]
 
You haven't told us if these lines are ADSL2+ or FTTC. If FTTC, then tweaking the SNRM as you say you did is highly frowned upon by Openreach. A few modems seem to manage it, but all the ones I tried didn't do at all what I expected from my previous experiences on ADSL2+.
It is the DLM system that sets the line profile, and this should not be interfered with by CPs/users setting rates, SNR margins etc. at the modem.
BT SIN498.

To check if it is the modem either faulty or mis-reporting, I would just have swapped it with the best-performing one of the three. That would tell you if it was the modem or the line/socket.

Do you have any wired extensions on the "bad" line?

My broadband basic info/help site - www.robertos.me.uk. Domains, site and mail hosting - Tsohost.
Connection - AAISP Home::1 80/20. 200GB. Sync 67717/13670Kbps @ 600m. BQMs - IPv4 & IPv6
Standard User CecilWard
(learned) Sat 12-May-18 06:40:11
Print Post

Re: Corruption, HEC errors and AA CQM


[re: RobertoS] [link to this post]
 
It's ADLS2 on an extremely long line, PPPoEoA, Firebrick router, three modems. I should indeed have set the scene.

A friend, ejs, has pointed out some of the flaws in my understanding of HEC errors. I had forgotten that they don't cover the whole cell, just the cell header. So the interpretation of the numbers cannot possibly be like reality as the errors would be so much worse than I though given that I am not even counting the errors in the rest of each cell. So that means the whole thing is bogus.

Thus puts a hold on everything. All I can say is that there is a mountain of stuff I don't understand and the DLink ui is incomprehensible and nonsensical as it lacks any kind of explanation or qualification or definitions of terms. It's also a mystery why these strange numbers are so wildly different between modems.

Basically, the best I can say is - do not look at the DLink modem’s numbers. Health warning.


Register (or login) on our website and you will not see this ad.

Standard User CecilWard
(learned) Sat 12-May-18 06:48:15
Print Post

Re: Corruption, HEC errors and AA CQM


[re: RobertoS] [link to this post]
 
Roberto - I didn't perform any SNRM tweaking on the modems at all, there is no such facility on these models. You just assumed way too much. AA’s clueless.aa.net.uk server can alter BT's target SNRM and it gets BT to reconfigure the DSLAM and commands a resynch remotely. AA gives its users control over the range of target SNRM and stability settings that BT supports.
Standard User RobertoS
(elder) Sat 12-May-18 09:25:08
Print Post

Re: Corruption, HEC errors and AA CQM


[re: CecilWard] [link to this post]
 
In reply to a post by CecilWard:
Roberto - I didn't perform any SNRM tweaking on the modems at all, there is no such facility on these models. You just assumed way too much. AA’s clueless.aa.net.uk server can alter BT's target SNRM and it gets BT to reconfigure the DSLAM and commands a resynch remotely. AA gives its users control over the range of target SNRM and stability settings that BT supports.
Note that my comments about tweaking the SNRM were preceded by an “If” smile, and related to FTTC. As for tweaking on ADSL2+, either by the user settings in the modem at their end, or via AA on the exchange modem/DLM, then yes it is not a problem.

I’m unhappy about your comment that I ”assumed way too much” frown. It reads to me as being antagonistic.

What about swapping the modems as I suggested as well, as the obvious way to narrow down where your problem is frown? Though it is likely to be the line or socket/wiring. The three lines in question could well be subject to very different external influences between you, the cabinet and the exchange.

My broadband basic info/help site - www.robertos.me.uk. Domains, site and mail hosting - Tsohost.
Connection - AAISP Home::1 80/20. 200GB. Sync 67717/13670Kbps @ 600m. BQMs - IPv4 & IPv6
Standard User CecilWard
(learned) Sun 13-May-18 02:26:19
Print Post

Re: Corruption, HEC errors and AA CQM


[re: RobertoS] [link to this post]
 
Roberto, my apologies I did not spot the if. smile
Standard User ionic
(fountain of knowledge) Tue 15-May-18 11:27:50
Print Post

Re: Corruption, HEC errors and AA CQM


[re: CecilWard] [link to this post]
 
Why not set up a ping graph on your firebrick directed to 81.187.81.187 over that particular link with a large payload size, or ask AA support to do something similar from their end back to the WAN IP on that line?

That way you will see if you're seeing the effect you're concerned about.
Standard User RobertoS
(elder) Tue 15-May-18 13:18:22
Print Post

Re: Corruption, HEC errors and AA CQM


[re: CecilWard] [link to this post]
 
That's OK smile.

But, is there a technical reason in your setup you can't swap the best and worst reporting modems over? As I suggested earlier.

My broadband basic info/help site - www.robertos.me.uk. Domains, site and mail hosting - Tsohost.
Connection - AAISP Home::1 80/20. 200GB. Sync 67717/13670Kbps @ 600m. BQMs - IPv4 & IPv6
  Print Thread

Jump to