Going to take a few minutes to just fill out a bit on the other posts I've made on this topic briefly rather than getting into a urinating contest with the forum technical Gods.
If you, as a Zen user, have seen ae in a traceroute you've gone over a LAG - Juniper equipment, and others, name LAGs aggregated Ethernet. I remember seeing this plenty when I used Zen retail.
Cisco call them Port Channels, so links with 'po' and a number often indicate Cisco LAGs.
All these do is, when a packet needs to use the link, basically do a bit of maths and work out which physical link to use, that's it. When there are 2 links it's essentially nothing more than distilling the packet headers down to odd or even. Where there are 4 it's turning those headers into a number, dividing by 4 and counting the remainder, then using link 0,1,2 or 3. Once that's done every packet in that connection will use the same link. Different connections will use different links,so the slow speed will be inconsistent. Anything, whether GEA, BTWholesale or TalkTalk, that goes over those links will experience the same exact issues. If the two sides of the LAG choose different links it doesn't matter - these are cables connecting the same chassis and likely same line cards either side that are probably running next to each other and physically cable managed together in a bundle.
TL;DR it's extremely unlikely to be anything to do with that.
ECMP is nothing more than having multiple routes to the same place and using all of them. This can and does create asymmetry and jitter. However, how much jitter do you reckon is possible on links between the same chassis in the same building, probably the same line of racks and don't you think if there were issues like that you'd see them in the following hops?
Traceroute:
https://www.thousandeyes.com/learning/glossary/trace...
Traceroute most commonly uses Internet Control Message Protocol (ICMP) echo packets with variable time to live (TTL) values. The response time of each hop is calculated.
Time To Live is sent starting at 1, then 2, and so on until you reach the destination. Most devices that receive this and aren't tunneling your packets will count the TTL down when they receive the packet. If the TTL becomes zero they will send this to their control plane or a slow routing path to be dealt with, as it requires some action from the device itself above just forwarding it.
https://traceroute.home.blog/category/general-networ... has a nice quote:
In terms of configuration the control plane should be considered as an interface though which any traffic destined for the device must pass. This traffic can enter through any physical interface, but before it is processed it passes through the control plane “interface”.
Packets with TTL 1 hitting the router enter the control plane or a 'slow' data plane as they can't go further. They must be dealt with by this device and need to have an ICMP TTL expired message generated. The time it takes to generate and send this message and for it to arrive back at you is your latency for that hop in the traceroute. Where the TTL is above 1 the router will send it on through the fast data plane / forwarding path which in the case of these routers is a high speed, high bandwidth very low latency Application Specific Integrated Circuit routing and switching fabric.
There are, of course, buffers in between where this data hits the router and the CPUs that handle the control plane and slow data plane. When these buffers fill, or if there is policing in place to drop everything that isn't answered immediately, the traceroute is dropped. Until then packets wait for the CPU to service them and send out the ICMP TTL expired message. This is both right at the bottom of the priority list for the CPU and will in itself be throttled to protect the system so you've policing both entering the control plane in the first place and within generating the ICMP messages. This is to protect the router from having its CPU drained.
That set of CPUs likely have some work to do to handle the huge routing table. They probably export telemetry, handle generation of alerts, etc, etc. As long as nothing is seen after that hop in a traceroute the higher latency isn't an issue.
https://hal.inria.fr/hal-01111190/document
The main problem that arises when making use of TTL-limited probes is that ICMP feedback from routers is often neither instantaneous nor entirely reliable. Indeed, as the generation of ICMP error messages takes place in the slow path of the data plane, manufacturers and operators impose a low priority on it, in order to minimize the overall load on routers. Other internal tasks mostly related to the control plane, like route computation and management operations, might take precedence over it, especially when resources are shared between slow path and control plane
If nothing is seen after the hops using LAGs / ECMP it's nothing to do with that either.
That's it. I appreciate the frustration, this thread is huge, but don't get sidetracked with this stuff. ANY issues with the LAGs, ECMP or the higher ping response coming from the Zen edge router would, if relevant, show throughout the traceroute down. It really is that simple - you'd see latency, loss or jitter throughout. You don't.
I appreciate I don't have 40,000+ posts in this forum's technical section however if you are really, really bored you can read
https://www.rfc-editor.org/rfc/rfc7747.html for why some edge devices have higher latency when being pinged and the truly excruciating
https://www.rfc-editor.org/rfc/rfc4098 for control plane / forwarding/data plane stuff. I have no intention of reading either.
The issues almost certainly relate to Zen's Plexus network, hence why some people on it are fine, others aren't, even when they're going across the dreaded ECMP LAGs.
Coffee break over. Cheers.
EDIT: Just to reiterate, Pluralist, I am not reading your posts here since I responded to the last one, it being abundantly clear it's a waste of both my time reading them and yours writing them. This post is for the benefit of those actually interested in this and to avoid support tickets flying into Zen because people see 'lag' in a traceroute and think it's a problem. I'm sure if you look hard enough look through some documentation you can find something to nitpick, I've simplified a ton, however quoting documentation on something you only realised existed yesterday says a lot more about you than it does me so your call whether you waste your time trying to be 'right' or keep that to The Park.
EDIT 2: It's overly simplistic to not mention that there are 2 data plane forwarding paths in many routers, the fast path and the slow one, and in some cases the slow forwarding path, usually sharing resources with the control plane, handles traceroute responses. In others it goes to the control plane itself, it depends, but for clarity should be mentioned and cuts off some pendaticism even though the end result is the same.
Edited by XGS_Is_On (Thu 10-Nov-22 14:00:20)