[ Date Index ] [ Thread Index ] [ <= Previous by date / thread ] [ Next by date / thread => ]
On 18/04/2020 16:06, Michael Everitt wrote:
Hmm, but how is that affecting routing . MAC is physical layer, that should theoretically make (virtually!) no difference? What problem is that causing higher up the stack??
I'll be honest chief, I wasted about 6 hours on this last night before - for the first time in years - giving up before I went crazy. Sometimes you just have to know when to move on to more pressing issues. Fortunately this is one of those things that got caught early in staging and so nothing in production is effected - I am going to have to fix it before long though, just got to smash the job queue a bit first.
More eyes on is going to be the best way to figure out what's wrong and I've already got a few other people looking it over (not all the configs are mine so it's entirely possible it's nothing I've personally done) so I'll outline the issue as briefly as I can as replicated easily enough on my two main home workstations which is where I also first noticed the problem. I'll omit a lot of stuff for brevity.
The stuff: Ubuntu 19.10 + 20.04 host machinesBoth have LACP ethernet bonds across multiple interfaces + separate management LAN, etc
Enterprise switch has matching LAG group + VLANs tagged throughVirtualbox hypervisor on both machines, lots of different VMs (all are bridged to the LACP bond0 but can be switched to other interfaces or tunnels)
Both machines are multihomed with multiple gateways/routes/DNS available Also lots of egress SSH and VPN tunnels in use on both The symptoms:Fire up random VM on either. _Everything_ network related works fine in VMs: internet browsing, wget files, git clone, etc. What does NOT work is the built in operating system package management tools on _some_ distros. That's literally the only thing that doesn't work - DNS timeouts everywhere but only for some of the configured repos. Seen so far on Arch, Mint, Ubuntu and Debian. Slack, Fedora, RHEL and Ubuntu 20.04 specifically and Windows are 'immune', haven't tested Macs or BSD or all my Linux VMs yet (too many of them just for a start).
Digging:Switching between my egress methods whilst keeping the VMs in bridged mode has almost no discernible effect that can be presumed statistically significant despite this changing their VLAN ID, network/subnet, gateway, DNS and route. Sometimes a slightly different apt mirror times out but still results in general failure. _Some_ repos still work instantly every time - including all PPAs and random little personal repos (including my local ones). Flipping _any_ effected VM from bridged to NAT mode however instantly fixes them.
Smoking gun?:On my system any traffic not specifically guided out otherwise "falls through" into my admin VLAN which is automatically dropped into a permanent Wireguard tunnel to my VPN provider. So any "naive" traffic, which the NAT'd VMs are included in, gets routed out through my VPN provider and not my local ISP connection - this comes obviously enough with it's own separate gateway and physically emerges elsewhere (but still in the UK). Any effected VM works perfectly egressing like this! At least I can rule out that all these mirrors are somehow down, I didn't think that could be possible after all.
DNS Trouble:I feel the bug is here specifically, but may be wrong. Generally, every single network segment and VLAN is handled specifically by my own local DNS server. There is some complexity handling with it's upstream DNS which varies depending on the VLAN feeding it requests (split horizon DNS obviously so internal resources are available to everything but without necessarily leaking DNS requests through my ISP for private/work systems). Obviously I can change the VM client's DNS at will - and have - but this doesn't change anything!
Sanity checking:I've remoted into some client networks and double checked similar setups in testing there - same results. But then they are nearly all pretty geographically local. Speaking of which...
Tentative diagnosis:I *think* what is happening is something like this: the load balancers at the other end are messing me up somehow. It looks like DNS but clearly isn't that simple. Ubuntu/Debian/Mint all use geographically distributed load balancers on their big mirrors that not only pick a round robin DNS mirror to send back to the client but then that content is served from a specific box situated "close by" in network cost terms. THAT is the bit that somehow triggers the... fault? Bug? I'm not even quite sure how to class it. I'm not ruling out a configuration misstep either - it could even be a freakish combination of all of them.
If I leave my VMs where they should be: on the production LACP bond and VLAN, behind my DNS server and routed through my normal gateway via my ISP I can make them work, it's just a pain and obviously it shouldn't need any adjustment whatsoever. Some VMs continue working flawlessly on default settings pulling from the _same damn mirrors_ (Ubuntu 20.04 is the stand out weird one here). On effected Ubuntu/Debian/Mint/Arch VMs if I manually edit their repositories away from top level mirrors with load balancers and choose specific servers, they immediately start working.
So to be clear, there doesn't seem to be _anything_ wrong with my setup, even though I've changed a lot of stuff in the last week or so (the enterprise switches are new as are the bonds and some VLANs). I've also lived and breathed this stuff for years - I know how to configure all this stuff in my sleep. I also log _everything_ and the logs say nothing.
The big fat glitch seems to be how certain open source top level mirrors firstly feed back a different round robin DNS to the client, guiding it towards say bytemark or mirrors.ac.uk. Then,_depending on the IP of that client, a specific server instance is chosen via geolocation. All of my normal VM traffic egresses out through my ISP so can be geolocated to down here in the South West pretty accurately. That hand off seems to be what is breaking in certain circumstances, for certain VMs. If I NAT my VM out it goes through the VPN tunnel and emerges in a London datacenter somewhere - and that gets perfect results. That seems to be the core of the problem. Geolocated by a top level load balancing FLOSS mirror to the South West and are a VM bridged through a LACP VLAN behind a local DNS instance? FAIL. Same VM geolocated by same mirrors to a London datacenter? PASS.
Is it the package managers logic that is somehow wrong? After all, any effected VM can still access Google/Amazon/everything else on the entire internet just fine and they're most definitely geographically load balanced systems too! It's just the damned package managers. I'd nearly narrowed it down to blaming some weird Debian specific glitch (note the trashed VMs are all Debian flavours whilst Windows and Redhat derivative Linux continue without issues... until the Arch VM turned up broken as well). I really wanted to blame the newest stuff first - LACP specifically. But literally everything else is working perfectly. And the workaround VMs NAT'd out over the VPN that do resume normal behaviour? That traffic is passing over the very same LACP bond. Doh!
Honestly mindblown by this one ¯\_(ツ)_/¯It's on pause for now either way - please bear in mind that I've actually left out vast amounts of technical detail here as well, it would take months to explain the whole lot fully! My intuition is that something like this *has* to be my fault really. There's an unforeseen glitch or loop or rogue cache somewhere and eventually I'll find it and curse myself.
On with some easier jobs for now. Like solving world hunger or proving P=NP.
-- The Mailing List for the Devon & Cornwall LUG https://mailman.dcglug.org.uk/listinfo/list FAQ: http://www.dcglug.org.uk/listfaq