Autoscaling XNAT on Kubernetes with EKS

There are three types of autoscaling that Kubernetes offers:

  1. Horizontal Pod Autoscaling
    Horizontal Pod Autoscaling (HPA) is a technology that scales up or down the number of replica pods for an application based on resource limits specified in a values file.

  2. Vertical Pod Autoscaling
    Vertical Pod Autoscaling (VPA) increases or decreases the resources to each pod when it gets to a certain percentage to help you best deal with your resources. After some testing this is legacy and HPA is preferred and also built into the Helm chart so we won’t be utilising this technology.

  3. Cluster-autoscaling
    Cluster-autoscaling is where the Kubernetes cluster itself spins up or down new Nodes (think EC2 instances in this case) to handle capacity.

You can’t use HPA and VPA together so we will use HPA and Cluster-Autoscaling.



Prerequisites

  • Running Kubernetes Cluster and XNAT Helm Chart AIS Deployment
  • AWS Application Load Balancer (ALB) as an Ingress Controller with some specific annotations
  • Resources (requests and limits) need to specified in your values file
  • Metrics Server
  • Cluster-Autoscaler




You can find more information on applying ALB implementation for the AIS Helm Chart deployment in the ALB-Ingress-Controller document in this repo, so will not be covering that here, save to say there are some specific annotations that are required for autoscaling to work effectively.

Specific annotations required:

alb.ingress.kubernetes.io/target-group-attributes: "stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=1800,load_balancing.algorithm.type=least_outstanding_requests"
alb.ingress.kubernetes.io/target-type: ip

Let’s breakdown and explain the sections.

Change the stickiness of the Load Balancer:
It is important to set a stickiness time on the load balancer. This forces you to the same pod all the time and retains your session information. Without stickiness, after logging in, the Database thinks you have logged but the Load Balancer can alternate which pod you go to. The session details are kept on each pod so the new pod thinks you aren’t logged in and keeps logging you out all the time. Setting stickiness time reasonably high – say 30 minutes, can get round this.

stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=1800

Change the Load Balancing Algorithm for best performance:

load_balancing.algorithm.type=least_outstanding_requests

Change the Target type:
Not sure why but if target-type is set to instance and not ip, it disregards the stickiness rules.

alb.ingress.kubernetes.io/target-type: ip




Resources (requests and limits) need to specified in your values file

In order for HPA and Cluster-autoscaling to work, you need to specify resources - requests and limits, in the AIS Helm chart values file, or it won’t know when to scale.
This makes sense because how can you know when you are running out of resources to start scaling up if you don’t know what your resources are to start with?

In your values file add the following lines below the xnat-web section (please adjust the CPU and memory to fit with your environment):

  resources:
    limits:
      cpu: 1000m
      memory: 3000Mi
    requests:
      cpu: 1000m
      memory: 3000Mi

You can read more about what this means here:

https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/

From my research with HPA, I discovered a few important facts.

  1. Horizontal Pod Autoscaler doesn’t care about limits, it bases autoscaling on requests. Requests are meant to be the minimum needed to safely run a pod and limits are the maximum. However, this is completely irrelevant for HPA as it ignores the limits altogether so I specify the same resources for requests and limits. See this issue for more details:

https://github.com/kubernetes/kubernetes/issues/72811

  1. XNAT is extremely memory hungry, and any pod will use approximately 750MB of RAM without doing anything. This is important as when the requests are set below that, you will have a lot of pods scale up, then scale down and no consistency for the user experience. This will play havoc with user sessions and annoy everyone a lot. Applications - specifically XNAT Desktop can use a LOT of memory for large uploads (I have seen 12GB RAM used on an instance) so try and specify as much RAM as you can for the instances you have. In the example above I have specified 3000MB of RAM and 1 vCPU. The worker node instance has 4 vCPUs and 4GB. You would obviously use larger instances if you can. You will have to do some testing to work out the best Pod to Instance ratio for your environment.




Metrics Server

Download the latest Kubernetes Metrics server yaml file. We will need to edit it before applying the configuration or HPA won’t be able to see what resources are being used and none of this will work.

wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Add the following line:

        - --kubelet-insecure-tls

to here:

    spec:
      containers:
      - args:

Completed section should look like this:

    spec:
      containers:
      - args:
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP,ExternalIP
        - --cert-dir=/tmp
        - --secure-port=443
        - --kubelet-use-node-status-port
        - --metric-resolution=15s

Now apply it to your Cluster:

k -nkube-system apply -f components.yaml

Congratulations - you now have an up and running Metrics server.
You can read more about Metrics Server here:

https://github.com/kubernetes-sigs/metrics-server




Cluster-Autoscaler

There are quite a lot of ways to use the Cluster-autoscaler - single zone node clusters deployed in single availability zones (no AZ redundancy), single zone node clusters deployed in multiple Availability zones or single Cluster-autoscalers that deploy in multiple Availability Zones. In this example we will be deploying the autoscaler in multiple Availability Zones (AZ’s).

In order to do this, a change needs to be made to the StorageClass configuration used.

Delete whatever StorageClasses you have and then recreate them changing the VolumeBindingMode. At a minimum you will need to change the GP2 / EBS StorageClass VolumeBindingMode but if you are using a persistent volume for archive / prearchive, that will also need to be updated.

Change this:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp2
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
volumeBindingMode: Immediate
parameters:
  fsType: ext4
  type: gp2

to this:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp2
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
volumeBindingMode: WaitForFirstConsumer
parameters:
  fsType: ext4
  type: gp2

The run the following commands (assuming the file above is called storageclass.yaml):

kubectl delete sc --all
kubectl apply -f storageclass.yaml

This stops pods trying to bind to volumes in different AZ’s.

You can read more about this here:
https://aws.amazon.com/blogs/containers/amazon-eks-cluster-multi-zone-auto-scaling-groups/

Relevant section:
If you need to run a single ASG spanning multiple AZs and still need to use EBS volumes you may want to change the default VolumeBindingMode to WaitForFirstConsumer as described in the documentation here. Changing this setting “will delay the binding and provisioning of a PersistentVolume until a pod using the PersistentVolumeClaim is created.” This will allow a PVC to be created in the same AZ as a pod that consumes it.

If a pod is descheduled, deleted and recreated, or an instance where the pod was running is terminated then WaitForFirstConsumer won’t help because it only applies to the first pod that consumes a volume. When a pod reuses an existing EBS volume there is still a chance that the pod will be scheduled in an AZ where the EBS volume doesn’t exist.

You can refer to AWS documentation for how to install the EKS Cluster-autoscaler:

https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html
This is specific for your deployment IAM roles, clusternames etc, so will not specified here.





Configure Horizontal Pod Autoscaler

Add the following lines into your values file under the xnat-web section:

  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 100
    targetCPUUtilizationPercentage: 80
    targetMemoryUtilizationPercentage: 80

Tailor it your own environment. this will create 2 replicas (pods) at start up, up to a limit of 100 replicas, and will scale up pods when 80% CPU and 80% Memory are utilised - read more about that again here:
https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/

This is the relevant parts of my environment when running the get command:

k -nxnat get horizontalpodautoscaler.autoscaling/xnat-xnat-web
NAME            REFERENCE                   TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
xnat-xnat-web   StatefulSet/xnat-xnat-web   34%/80%, 0%/80%   2         100       2          3h29m

As you can see 34% of memory is used and 0% CPU. Example of get command for pods - no restarts and running nicely.

k -nxnat get pods
NAME                  READY   STATUS    RESTARTS   AGE
pod/xnat-xnat-web-0   1/1     Running   0          3h27m
pod/xnat-xnat-web-1   1/1     Running   0          3h23m




Troubleshooting

Check Metrics server is working (assuming in the xnat namespace) and see memory and CPU usage:

kubectl top pods -nxnat
kubectl top nodes

Check Cluster-Autoscaler logs:

kubectl logs -f deployment/cluster-autoscaler -n kube-system

Check the HPA:

kubectl -nxnat describe horizontalpodautoscaler.autoscaling/xnat-xnat-web