Skip to content

Instantly share code, notes, and snippets.

@HoKim98
Last active August 17, 2024 09:53
Show Gist options
  • Save HoKim98/c50515e16d391e7c73b7d6dae9d66622 to your computer and use it in GitHub Desktop.
Save HoKim98/c50515e16d391e7c73b7d6dae9d66622 to your computer and use it in GitHub Desktop.
Resolve some issues while operating K8S on Bare-Metal

IPMI TroubleShootings

Infinite Loading of Supermicro BMC

  1. Turn off Adblocks

How to fix NVMe 1GB issue?

Causes

  1. Firmware memory corruption due to forced shutdown of server
    • Actually happened on Samsung PM983 NVMe SSD.

Solution

  1. nvme format <disk>
    • You may install the nvme-cli package.
  2. Reboot the node

How to Recreate Kubernetes ETCD cluster while saving your data?

  1. Backup your ETCD data to the safe area.
  2. Open the etcd.env file on one of your ETCD cluster nodes and append below.
    • ETCD_FORCE_NEW_CLUSTER=true
    • ETCD_INITIAL_CLUSTER=(remove the broken nodes)
  3. Restart etcd service.
  4. Check whether etcd service is running.
    • Check whether the broken nodes are removed from the member list.
  5. Remove the ETCD_FORCE_NEW_CLUSTER flag and restart etcd service again.
  6. Wait some minutes and check whether your kubernetes cluster is recovered.
    • Restarting kubelet is recommended: it will recover broken core k8s services.
    • Restarting your provisioning services are recommeded.
    • Rebooting the nodes will resolve most of the issues about containers.

How to Reinstall OS on K8S nodes?

  1. Backup your data to the safe area
    • ETCD: /opt/etcd/ /etc/etcd /etc/etcd.env
    • Control Plane: /etc/kubernetes /var/lib/kubelet
    • Rook Ceph: /var/lib/rook
  2. Drain the nodes
  3. Reinstall the OS
    • Rook Ceph: DO NOT WIPE THE DATA VOLUME
  4. Restore the data and reinstall the K8S
  5. Undrain the nodes

How to Replace Kubernetes ETCD nodes?

  1. Add an ETCD node to existing kubernetes ETCD cluster.
    • etcdctl member add [new-node-name] --peer-urls=https://[new-node-ip]:2380
    • You may use cert files to grant the command like below:
      • --cacert /etc/etcd/ssl/ca.pem
      • --cert /etc/etcd/ssl/admin-[old-node-k8s-name].pem
      • --key /etc/etcd/ssl/admin-[old-node-k8s-name]-key.pem
  2. Update /etc/kubernetes/manifests/kube-apiserver.yaml.
    • --etcd-servers=https://[new-node-ip]:2379
    • The kubernetes manifest directory may be differ (i.e. Kubespray)
  3. Restart kubelet service.
    • systemctl restart kubelet.service
  4. Wait some seconds and check the K8S cluster is running.
  5. Remove the old ETCD node from your cluster.
    • etcdctl member remove [old-node-id]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment