2023-11-26 4AM
What was affected
- K3s Cluster On premise
- Internal Tooling
What caused it
For some reason I cant locate the logs for, the BT router decided that the control plane node's IP address is 172.16.3.154
And for another reason, the Router (default from ISP) re-addressed the other nodes, somehow putting the node k3s-02
at the IP
address 172.16.0.1
(Which is the IP address of the DNS k8s svc)
I manually changed the IP address of the main node using netplan with the below config
network:
version: 2
ethernets:
enp2s0:
dhcp4: no
addresses: [172.16.2.1]
gateway4: 172.16.3.254
nameservers:
addresses: [1.1.1.1, 1.0.0.1]
This then caused the k3s cluster to pick up the default route IP address which is now 172.16.2.1
, opposed to the IP address
specified by the tls-san
and bind-address
It's not eth0, the kubelet picks the interface with the default route. This is standard Kubernetes behavior.
@brandond via GitHub
This then caused the below error
Nov 26 03:55:26 k3s-01 k3s[21008]: time="2023-11-26T03:55:26Z" level=info msg="Connecting to proxy" url="wss://172.16.2.1:6443/v1-k3s/connect"
Nov 26 03:55:26 k3s-01 k3s[21008]: time="2023-11-26T03:55:26Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 172.16.2.1:6443: connect: connection refused"
Nov 26 03:55:26 k3s-01 k3s[21008]: time="2023-11-26T03:55:26Z" level=error msg="Remotedialer proxy error" error="dial tcp 172.16.2.1:6443: connect: connection refused"
Nov 26 03:55:30 k3s-01 k3s[21008]: W1126 03:55:30.337944 21008 lease.go:251] Resetting endpoints for master service "kubernetes" to [172.16.2.0]
Nov 26 03:55:31 k3s-01 k3s[21008]: time="2023-11-26T03:55:31Z" level=info msg="Stopped tunnel to 172.16.2.1:6443"
Which then in turn meant that other nodes could not connect to the control plane.
I then decided that the easiest way to resolve this issue is to wipe the cluster and start again.
What monitoring was in place
- None
What was done to resolve it
I made the call after not really understanding the issue at hand to wipe the cluster, reinstall it all and re-connect flux
This took around 20 minutes
What did we learn
I could have added the below lines to the config.yaml
file, and it would have fixed things
This would have told k3s to use the ip address 172.16.2.0
(vrrp address) opposed to the IP address attached to the network
card with the default route
I also learnt that I need to stop putting off the Juniper router migration as this could have been avoided quite easily
Want to make this site better? Open a PR or help fund hosting costs