Re: [NLNOG] Curious problem with connections from Ziggo customers to Linode nodes in some data centers
Munging the http urls since the listserver claims they are on a spam list . See if this gets through. Also switched to plain text since html didn't edit the URls properly.
Op 24-08-2023 12:50 CEST schreef Boudewijn Visser (nlnog) <bvisser-nlnog@xs4all.nl>:
Hi Stefan,
While I'm quite old skool, I just never really got into irc, so I missed the conversation.
I've had a look at your packet capture . It doesn't seem to be an MTU issue .
Filtering for the traffic captured on the server side : (ip.src_host == 192.46.232.6 && ip.dst_host == 84.28.119.251 ) ||( ip.dst_host == 192.46.232.6 && ip.src_host == 84.28.119.251 )
So it seems your Ziggo public IP is 84.28.119.251 . And filtering for the capture from the inside client side (ip.src_host == 192.46.232.6 && ip.dst_host == 192.168.0.107 ) ||( ip.dst_host == 192.46.232.6 && ip.src_host == 192.168.0.107 )
I see an OK session using source port 50006 , and then a session that seems to have severe packet loss issues with source port 50007 .
See al the TCP retransmissions for the source-port 50007 session - and rarely that a packet gets through.
If you still can use this client (same public IP) try curl --local-port 50006 http://192.46.232.six curl --local-port 50007 http://192.46.232.six
that should replicate the problem exactly, first one always OK, second one always major problems. Note : some socket timeouts when trying multiple times shorty after each other.(bind failure socket already in use )
And - the specific local port that fails or works very likely also depends on the client source IP.
Sabri's suggestion on for tcp-traceroute is also valuable .
(normally , traceroute is done using UDP (classic Unix, Cisco) or ICMP - but it can be done with TCP too. ) With some luck , tcp-traceroute may give a hint for a node or path where the failure starts.
I've done a quick test (I happen to be behind Ziggo at the moment) but a tcp traceroute isn't too conclusive . Generally load balancing within a network is deterministic - based on ip/port combination for example.
IMO, the whole problem still looks like a network link that has severe issues (probably corrups large amount of packets which are then dropped at the neighbor node) , and traffic is load balanced over this link . So some session flows are impacted and others are not .
Since it seems limited to Ziggo clients it would likely be somewhere in the Ziggo network . Something at an exchange point is a remoter possibility - depending on what (other) destinations are impacted it might just not have been noticed either .
(some caveats : NAT in the Ziggo modem may change source port , esp with repeated tests )
I think that to get anything more it will need a quite senior Ziggo network engineer to investigate further.
Best regards, Boudewijn
Op 24-08-2023 08:01 CEST schreef Stefan van den Oord <stefan+nlnog@medicinemen.eu>:
Thanks Boudewijn!
There was a lively conversation about this on #nlnog yesterday, so I forgot to respond to you. I tried changing the MTU to 1420, that didn’t make a difference. I did a packet capture as well. This was between server 192.46.232.6 and client 192.168.0.107. Command used on the server was:
tcpdump -Aennvvi eth0 -w server.pcap port not 22
And on the client (because I was connected through VNC):
sudo tcpdump -Aennvvi en1 -w client.pcap port not 22 and port not 5900
During this capture I did two requests (using curl) to http://192.46.232.six, the first one succeeded and the second one I aborted after half a minute. The result is here: http://192.46.232.six/client+server.pcap
I lack the experience to properly analyse this. Does this contain any clues to you?
-- Stefan van den Oord CTO @ Medicine Men B.V.
Not in the office on Wednesdays
Regulierenring 22 3981 LB Bunnik The Netherlands
Thanks for looking further into this Boudewijn.
On 24 Aug 2023, at 13:02, Boudewijn Visser (nlnog) <bvisser-nlnog@xs4all.nl> wrote:
I've had a look at your packet capture . It doesn't seem to be an MTU issue .
That’s good to know.
Filtering for the traffic captured on the server side : (ip.src_host == 192.46.232.6 && ip.dst_host == 84.28.119.251 ) ||( ip.dst_host == 192.46.232.6 && ip.src_host == 84.28.119.251 )
So it seems your Ziggo public IP is 84.28.119.251 . And filtering for the capture from the inside client side (ip.src_host == 192.46.232.6 && ip.dst_host == 192.168.0.107 ) ||( ip.dst_host == 192.46.232.6 && ip.src_host == 192.168.0.107 )
I see an OK session using source port 50006 , and then a session that seems to have severe packet loss issues with source port 50007 .
See al the TCP retransmissions for the source-port 50007 session - and rarely that a packet gets through.
If you still can use this client (same public IP) try curl --local-port 50006 http://192.46.232.six curl --local-port 50007 http://192.46.232.six
that should replicate the problem exactly, first one always OK, second one always major problems. Note : some socket timeouts when trying multiple times shorty after each other.(bind failure socket already in use )
That appears to be correct. What does that mean though?
And - the specific local port that fails or works very likely also depends on the client source IP.
Sabri's suggestion on for tcp-traceroute is also valuable .
$ tcptraceroute -i en1 192.46.232.6 80 Selected device en1, address 192.168.0.107, port 50296 for outgoing packets Tracing the path to 192.46.232.6 on TCP port 80 (http), 30 hops max 1 192.168.178.1 3.926 ms 3.567 ms 3.542 ms 2 * * * 3 hvs-rc0002-cr102-et99-251.core.as33915.net (213.51.196.61) 32.259 ms 20.733 ms 26.945 ms 4 asd-tr0021-cr101-be155-10.core.as9143.net (213.51.158.110) 15.686 ms 20.290 ms 13.737 ms 5 nl-ams14a-ri1-ae51-0.core.as9143.net (213.51.64.186) 16.997 ms 18.788 ms 20.759 ms 6 be3065.ccr41.ams03.atlas.cogentco.com (130.117.14.1) 37.960 ms 18.725 ms 20.228 ms 7 be2813.ccr41.fra03.atlas.cogentco.com (130.117.0.122) 36.390 ms 18.615 ms 33.758 ms 8 be2501.rcr21.b015749-1.fra03.atlas.cogentco.com (154.54.39.178) 35.337 ms 18.712 ms 23.170 ms 9 204.130.243.21 21.488 ms 19.246 ms 27.173 ms 10 * * * 11 * * * 12 * * * 13 192-46-232-6.ip.linodeusercontent.com (192.46.232.6) [open] 32.722 ms 23.759 ms * Just dumping the information here, not sure if this provides insight.
(normally , traceroute is done using UDP (classic Unix, Cisco) or ICMP - but it can be done with TCP too. ) With some luck , tcp-traceroute may give a hint for a node or path where the failure starts.
I've done a quick test (I happen to be behind Ziggo at the moment) but a tcp traceroute isn't too conclusive . Generally load balancing within a network is deterministic - based on ip/port combination for example.
IMO, the whole problem still looks like a network link that has severe issues (probably corrups large amount of packets which are then dropped at the neighbor node) , and traffic is load balanced over this link . So some session flows are impacted and others are not .
Since it seems limited to Ziggo clients it would likely be somewhere in the Ziggo network . Something at an exchange point is a remoter possibility - depending on what (other) destinations are impacted it might just not have been noticed either .
(some caveats : NAT in the Ziggo modem may change source port , esp with repeated tests )
I think that to get anything more it will need a quite senior Ziggo network engineer to investigate further.
That’s what I’m afraid of as well, and Ziggo customer support doesn’t even want (doesn’t have the expertise?) to talk about the issue I guess. My lack of deep networking expertise prevents me from understanding and diagnosing this issue properly. I have installed a work-around by introducing a proxy so that all our traffic goes to the Frankfurt server via London. This way our customers can at least use our service again. It would be awesome if someone, “for the greater good”, with more understanding than I have, could pick this up and put it to the right people, but I find myself forced to switch priorities now. Thanks again for your assistance (all of you)!
Hi Stefan, all,
Filtering for the traffic captured on the server side : (ip.src_host == 192.46.232.6 && ip.dst_host == 84.28.119.251 ) ||( ip.dst_host == 192.46.232.6 && ip.src_host == 84.28.119.251 )
So it seems your Ziggo public IP is 84.28.119.251 . And filtering for the capture from the inside client side (ip.src_host == 192.46.232.6 && ip.dst_host == 192.168.0.107 ) ||( ip.dst_host == 192.46.232.6 && ip.src_host == 192.168.0.107 )
I see an OK session using source port 50006 , and then a session that seems to have severe packet loss issues with source port 50007 .
See al the TCP retransmissions for the source-port 50007 session - and rarely that a packet gets through.
If you still can use this client (same public IP) try curl --local-port 50006 http://192.46.232.six curl --local-port 50007 http://192.46.232.six
that should replicate the problem exactly, first one always OK, second one always major problems. Note : some socket timeouts when trying multiple times shorty after each other.(bind failure socket already in use )
That appears to be correct. What does that mean though?
A heads-up that the bind failure is normal and to be expected here. I was sufficiently intrigued by the problem that I wanted to look at it soon, but didn't have(/take) the time during working hours to write a more extensive explanation . A tcp socket is defined by the four-tuple of (source ip, source port , destination ip, destination port) . When it is not fully and completely closed from both sides it hangs around for some fairly long timeout in case any long delayed packets still come in - and even if the program that opened it was quit. So trying to re-use the exact same socket (because of same source/dest pairs) when it was waiting for those timeouts leads to the bind failure error . So that error is not related to your problem . If indeed the problem is exactly reproducible (always working with source port x,y,z - and always failing with source ports u,v,w ) for a fixed source and destination - that is strong indication that the problem is indeed a broken link that is part of a load sharing/load balancing path somewhere between client and your server. Given that you've (only - how sure ?) had this issue with Ziggo clients likely in their network. The exact source/destination IP and port pairs are helpful/necessary for any engineer who can take a look . [..]
(some caveats : NAT in the Ziggo modem may change source port , esp with repeated tests )
I think that to get anything more it will need a quite senior Ziggo network engineer to investigate further.
That’s what I’m afraid of as well, and Ziggo customer support doesn’t even want (doesn’t have the expertise?) to talk about the issue I guess.
People that understand this sort of networking problems are indeed not working at first line customer support ;-) And of course - getting a ticket in as a non-customer is big hassle anywhere. (I'm not a customer of Ziggo either - I was having a coffee at a place that happened to be a Ziggo customer).
My lack of deep networking expertise prevents me from understanding and diagnosing this issue properly. I have installed a work-around by introducing a proxy so that all our traffic goes to the Frankfurt server via London. This way our customers can at least use our service again. It would be awesome if someone, “for the greater good”, with more understanding than I have, could pick this up and put it to the right people, but I find myself forced to switch priorities now.
Thanks again for your assistance (all of you)!
Unfortunately I don't think any Ziggo networking staff are on the nlnog.net list . I'm not sure if I know someone in the right place to ping and see if he can put it on the right person's desk. I'll give it (him) a try. Best regards, Boudewijn
* bvisser-nlnog@xs4all.nl (Boudewijn Visser (nlnog)) [Thu 24 Aug 2023, 21:46 CEST]:
Unfortunately I don't think any Ziggo networking staff are on the nlnog.net list . I'm not sure if I know someone in the right place to ping and see if he can put it on the right person's desk. I'll give it (him) a try.
They're aware, despite some initial difficulties with emails hitting the Office365 spamfilter due to the inclusion of IP addresses. -- Niels. --
Hi, Yes, they're aware, as are we (Liberty Global) - incident ticket number INC000004886338. Regards, --Steven. -----Original Message----- From: NLNOG <nlnog-bounces@nlnog.net> On Behalf Of Niels Bakker Sent: Friday, 25 August 2023 00:08 To: nlnog@nlnog.net Subject: Re: [NLNOG] Curious problem with connections from Ziggo customers to Linode nodes in some data centers * bvisser-nlnog@xs4all.nl (Boudewijn Visser (nlnog)) [Thu 24 Aug 2023, 21:46 CEST]:
Unfortunately I don't think any Ziggo networking staff are on the nlnog.net list . I'm not sure if I know someone in the right place to ping and see if he can put it on the right person's desk. I'll give it (him) a try.
They're aware, despite some initial difficulties with emails hitting the Office365 spamfilter due to the inclusion of IP addresses. -- Niels. -- _______________________________________________ NLNOG mailing list NLNOG@nlnog.net http://mailman.nlnog.net/listinfo/nlnog
That’s great Steven. Just out of curiosity: is that something that has been going on for longer, or does it coincide with our issue that started about two weeks ago? Also: we lack the expertise to diagnose this, but if there is anything that we can do in terms of providing information or testing things, please don’t hesitate to ask! Kind regards, -- Stefan van den Oord CTO @ Medicine Men B.V. Not in the office on Wednesdays Regulierenring 22 3981 LB Bunnik The Netherlands +31 85 1307020 OpenPGP Key <https://keys.openpgp.org/vks/v1/by-fingerprint/676D4DA93671DFA4CA44D13D6232002ADF3E0504>
On 25 Aug 2023, at 10:31, Van Steen, Steven via NLNOG <nlnog@nlnog.net> wrote:
Hi,
Yes, they're aware, as are we (Liberty Global) - incident ticket number INC000004886338.
Regards,
--Steven.
-----Original Message----- From: NLNOG <nlnog-bounces@nlnog.net> On Behalf Of Niels Bakker Sent: Friday, 25 August 2023 00:08 To: nlnog@nlnog.net Subject: Re: [NLNOG] Curious problem with connections from Ziggo customers to Linode nodes in some data centers
* bvisser-nlnog@xs4all.nl (Boudewijn Visser (nlnog)) [Thu 24 Aug 2023, 21:46 CEST]:
Unfortunately I don't think any Ziggo networking staff are on the nlnog.net list . I'm not sure if I know someone in the right place to ping and see if he can put it on the right person's desk. I'll give it (him) a try.
They're aware, despite some initial difficulties with emails hitting the Office365 spamfilter due to the inclusion of IP addresses.
-- Niels. -- _______________________________________________ NLNOG mailing list NLNOG@nlnog.net http://mailman.nlnog.net/listinfo/nlnog _______________________________________________ NLNOG mailing list NLNOG@nlnog.net http://mailman.nlnog.net/listinfo/nlnog
participants (4)
-
Boudewijn Visser (nlnog) -
niels=nlnog@bakker.net -
Stefan van den Oord -
Van Steen, Steven