Quarantine A EKS Node with Docker Containers

January 6, 2022•565 words

While performing these step please practice caution, and have the right dashboards that will point to any issues.

Step 0. Annotate the node, ip-10-102-11-188.ec2.internal, in question, to not participate in the auto-scaling.

$ kubectl annotate node ip-10-102-11-188.ec2.internal cluster-autoscaler.kubernetes.io/scale-down-disabled=true

Step 1. Cordon the affected the node.

kubectl cordon ip-10-102-11-188.ec2.internal

Step 2. Get the list of app pods running on that instance

$ kubectl get pods -n podinfo --field-selector spec.nodeName=ip-10-102-11-188.ec2.internal --show-labels
NAME                      READY   STATUS    RESTARTS   AGE   LABELS
podinfo-ffb8d6b8d-bbs5r   1/1     Running   0          48m   app=podinfo,pod-template-hash=ffb8d6b8d
podinfo-ffb8d6b8d-jfs9b   1/1     Running   0          48m   app=podinfo,pod-template-hash=ffb8d6b8d
podinfo-ffb8d6b8d-mpskj   1/1     Running   0          48m   app=podinfo,pod-template-hash=ffb8d6b8d

Step 3. Label the pods for quarantine

$ kubectl label pod podinfo-ffb8d6b8d-bbs5r -n podinfo app=quarantine --overwrite
pod/podinfo-ffb8d6b8d-bbs5r labeled

After you’ve changed the label, you will notice that ReplicaSet creates a new pod, but the pod with name podinfo-ffb8d6b8d-bbs5r stays around.

NOTE: Do this for every pod you found in the step 2

Step 4. Seek out any Deployment resource pods, that relevant from to evicted, from that node.

$ kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-102-11-188.ec2.internal
NAMESPACE           NAME                                              READY   STATUS             RESTARTS   AGE
kube-system         aws-node-b5qb8                                    1/1     Running            0          82d
kube-system         coredns-59dfd6b59f-hn5sk                          1/1     Running            0          56m
kube-system         debug-agent-z2r2x                                 1/1     Running            0          82d
kube-system         kube-proxy-g4sr7                                  1/1     Running            0          82d
podinfo             podinfo-ffb8d6b8d-bbs5r                           1/1     Running            0          50m
podinfo             podinfo-ffb8d6b8d-jfs9b                           1/1     Running            0          50m
podinfo             podinfo-ffb8d6b8d-mpskj                           1/1     Running            0          50m

For example in the above list coredns-59dfd6b59f-hn5sk pod from the coredns deployment, evict this.

$ kubectl delete pod -n kube-system coredns-59dfd6b59f-hn5sk

Step 5. SSH into the node ip-10-102-11-188.ec2.internal to stop and disable kubelet

$ ssh ec2-user@ip-10-102-11-188.ec2.internal
Last login: Tue Dec 21 09:27:28 2021 from ip-10-102-11-188.ec2.internal

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
14 package(s) needed for security, out of 44 available
Run "sudo yum update" to apply all updates.
[ec2-user@ip-10-102-11-188 ~]$ sudo systemctl stop kubelet
[ec2-user@ip-10-102-11-188 ~]$ sudo systemctl disable kubelet
Removed symlink /etc/systemd/system/multi-user.target.wants/kubelet.service.

This above operation should send node into NotReady state.

Step 6. Once the Node ip-10-102-11-188.ec2.internal goes into a NotReady state delete the node

$ kubectl get nodes
NAME                               STATUS                        ROLES    AGE    VERSION
ip-10-102-11-188.ec2.internal      NotReady,SchedulingDisabled   <none>   154m   v1.17.17-eks-ac51f2
$ kubectl delete ip-10-102-11-188.ec2.internal
node "ip-10-102-11-188.ec2.internal" deleted

This will ensure the the node is removed and no longer part of the K8s cluster. The above steps ensure the the instance is running with all the docker containers intact, but no longer attached to the K8s cluster. This ends our work on the terminal, the following steps need to be performed on AWS Management Console.

Step 7. At this point when you will need to logged in to the node, you should pause all the docker containers. But before that you might want to check for any spurious connection from the running containers.

# docker ps -q | xargs -n1 -I{} -P0 bash -c "docker inspect -f '{{.State.Pid}}' {}; exit 0;" | xargs -n1 -I{} -P0 bash -c 'nsenter -t {} -n ss -p -a -t4 state established; echo; exit 0'

And the pause the running containers.

$ docker ps -q | xargs -n1 -I{} -P0 bash -c 'docker pause {}; exit 0'

Step 8. Head over to the EC2 > ASG dashboard on the AWS Management Console, and detach the instance from the ASG.

Step 9. Go to instance details dashboard, and remove the existing security group and add the QUARANTINE security group.

Quarantine A EKS Node with Docker Containers

More from mojozoox
All posts

Terraform Template to Create SQS Private Endpoint

Quarantine A EKS Node with Docker Containers

More from mojozooxAll posts

Terraform Template to Create SQS Private Endpoint

More from mojozoox
All posts