Cable was designed for mass distribution from the headend out to the customers. The reverse path is a later add-on and relies on a scheduled request-to-send mechanism to stop all the customers trying to talk at the same time adding latency to acks etc. Each upstream channel does this independently. TCP throughput is very latency dependent. Multi-threaded throughput is increased if multiple acks can be bundled together reducing the average latency.
Likewise on the downstream each packet (not Ethernet packet) of data needs to be queued since it is actually sent to everyone connected to that same channel, the modem picks off the packets intended for it. These packets which can arrive on any one of the downstream channels in any order then need to be put back together into Ethernet frames before being passed onto the router (or the router part of the SuperHub). Again there is additional latency added because the parts may arrive out-of-order but only when all have arrived can the data be passed on. This is why latency is higher and subject to more jitter on cable.
Okay.
Are you familiar with selective acknowledgement, DOCSIS scheduling, Payload Header Suppression, ack suppression and per-service flow upstream burst parameter/buffer configuration?
The scheduler arranges packets to minimise arrival out of order and ensure it can be mitigated by even a small buffer on the cable modem. It's a FIFO operation by default so as soon as a packet arrives on the IP side it gets split up into 188 byte chunks, encapsulated in MPEG 2 frames and fired down the service group. Not a lot of queuing and not much delay, you can fit an awful lot of 188 byte + overhead frames into a 55.6Mb/s bearer.
The scheduler is not like MLPPP - the bonding is at a lower level than MLPPP, it runs at the MAC layer and is fully managed by the CMTS.
Selective acknowledgement and ack suppression dramatically reduce the need for upstream transmission to the point where 350Mb of downstream can be accommodated by 7Mb of upstream with room to spare through use of a TCP proxy on the CPE to eliminate redundant acknowledgements.
Payload Header Suppression reduces the overhead on layer 3/4 protocols by eliminating redundant information.
Buffers in DOCSIS 3.0 can be configured per service flow. VM have a lot of control over this depending on the firmware they use on their CMTS.
The ~4ms of jitter inherent in DOCSIS without congestion isn't avoidable without some more advanced work, the access network delay in DOCSIS 3.0 is 4-8ms due to the frequency of upstream MAPs, every 2ms on the downstream by default, alongside CMTS processing delays and contention slot availability, but there's no need for this to seriously impact on throughput.
I have no idea what's happened at VM towers, but a friend nearby on a very lightly utilised node running on a Cisco 10k with the 3 x 1 GE SPA delivering 12 downstreams, the tactical solution awaiting full CCAP deployment, can max out with a single stream, making me think it's an issue with either CMTS software/hardware or CCAP software/hardware.
I'm sure you know about CCAP so no need for me to supply any links on that one.
There are some potential issues with regards to single thread performance when using Remote PHY, but as VM aren't doing that just yet that shouldn't be a problem.
There may be some vendor-specific issues going on alongside an overly stretched core network, but DOCSIS 3.0 itself isn't an impediment to very high single-threaded throughput.