2023-11-26 4AM

What was affected

K3s Cluster On premise
Internal Tooling

What caused it

For some reason I cant locate the logs for, the BT router decided that the control plane node's IP address is 172.16.3.154

And for another reason, the Router (default from ISP) re-addressed the other nodes, somehow putting the node k3s-02 at the IP address 172.16.0.1 (Which is the IP address of the DNS k8s svc)

I manually changed the IP address of the main node using netplan with the below config

network:
  version: 2
  ethernets:
    enp2s0:
      dhcp4: no
      addresses: [172.16.2.1]
      gateway4: 172.16.3.254
      nameservers:
        addresses: [1.1.1.1, 1.0.0.1]

This then caused the k3s cluster to pick up the default route IP address which is now 172.16.2.1, opposed to the IP address specified by the tls-san and bind-address

It's not eth0, the kubelet picks the interface with the default route. This is standard Kubernetes behavior.

@brandond via GitHub

This then caused the below error

Nov 26 03:55:26 k3s-01 k3s[21008]: time="2023-11-26T03:55:26Z" level=info msg="Connecting to proxy" url="wss://172.16.2.1:6443/v1-k3s/connect"
Nov 26 03:55:26 k3s-01 k3s[21008]: time="2023-11-26T03:55:26Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 172.16.2.1:6443: connect: connection refused"
Nov 26 03:55:26 k3s-01 k3s[21008]: time="2023-11-26T03:55:26Z" level=error msg="Remotedialer proxy error" error="dial tcp 172.16.2.1:6443: connect: connection refused"
Nov 26 03:55:30 k3s-01 k3s[21008]: W1126 03:55:30.337944   21008 lease.go:251] Resetting endpoints for master service "kubernetes" to [172.16.2.0]
Nov 26 03:55:31 k3s-01 k3s[21008]: time="2023-11-26T03:55:31Z" level=info msg="Stopped tunnel to 172.16.2.1:6443"

Which then in turn meant that other nodes could not connect to the control plane.

I then decided that the easiest way to resolve this issue is to wipe the cluster and start again.

What monitoring was in place

None

What was done to resolve it

I made the call after not really understanding the issue at hand to wipe the cluster, reinstall it all and re-connect flux

This took around 20 minutes

What did we learn

I could have added the below lines to the config.yaml file, and it would have fixed things

node-external-ip: 172.16.2.0
node-ip: 172.16.2.0

This would have told k3s to use the ip address 172.16.2.0 (vrrp address) opposed to the IP address attached to the network card with the default route

I also learnt that I need to stop putting off the Juniper router migration as this could have been avoided quite easily

Want to make this site better? Open a PR or help fund hosting costs