Share

Problem Introduction:

Kubernetes does a good job in self healing and application recovery from failure. New pods come up in the place of pods that crash. One reason for pods failing in kubernetes cluster is the memory consumed exceeding the limit set. In this case, kubernetes pods are OOM killed ( out of memory) and there is a temporary outage before the new pods come up.

Alert manager can be used to monitor the memory usage of pods, when the pod usage gets close to the limit then an alert can be triggered. Usually these alerts are emails or slack message to engineers who are then expected to fix the pods limits or take any other appropriate action.

If the action is just increasing the limit for the pods memory then it can be automated by using the webhook alerts feature of alert manager and webhook triggered pipelines of spinnaker.

Solution introduction:

In this blog, https://mallozup.github.io/posts/self-healing-systems-with-prometheus/, DARIO MAIOCCHI shows how alermanager’s webhooks can be used to trigger an external application. 

In this documentation https://prometheus.io/docs/alerting/latest/configuration/#webhook_config more info about prometheus alertmanager webhook trigger is present. 

Here is the documentation to trigger a spinnaker pipeline from external sources https://spinnaker.io/docs/guides/user/pipeline/triggers/webhooks/

So one can trigger a spinnaker pipeline from alert manager and patch the pods manifest using the in-built patch stage seen here https://spinnaker.io/docs/guides/user/kubernetes-v2/patch-manifest/.

Prerequisites:

Spinnaker installed, prometheus and alertmanager installed to monitor pods installed in the cluster.

Details:

A deployment using a pod consuming a constant memory (150Mi) and a limit configured (200Mi) is used as the candidate to monitor. 

The yaml for the pod can be found at https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/

Configure the prometheus configmap for alerts with the following code to monitor the above created pod/deployment

				
					- name: feedback-container-menmory-too-high
      rules:
      - alert:  feedback-container-menmory-too-high
        annotations:
          description: container memory-demo-ctr in namespace jobs in isdprod is taking too much memory
            may be evicted soon
          summary: memory-demo-ctr in namespace jobs in isdprod is taking too much memory
        expr: (sum(container_memory_max_usage_bytes{container="memory-demo-ctr"}) by (instance, area) / sum(container_spec_memory_limit_bytes{container="memory-demo-ctr"}) by (instance, area)) > .75
        for: 8m
        labels:
          severity: critical
				
			

Then configure the receiver in the alermanager configmap using the code below

				
					- name: feedback-receiver
      webhook_configs:
      - url: "https://<spinnaker-gate-url>/webhooks/webhook/alerthandler"
        http_config:
          basic_auth:
            username: "<username>"
            password: "<password>"
				
			

Finally set the alerts to be sent  to the receiver in the alertmanager configmap

				
					  - match:
          alertname: feedback-container-menmory-too-high
        repeat_interval: 4m
        group_interval: 4m
        receiver: feedback-receiver
				
			

Now prometheus and alert manager are ready. when the ratio of max memory usage to the memory limit exceeds .75 for about 8 minutes, then an alert is sent out to the spinnaker webhook endpoint.

Now to get spinnaker ready to receive the webhook , create a pipeline and in the configuration stage choose webhook trigger and use the same endpoint given to alertmanager, alerthandler, in this case.

Then create a patch resource stage to increase the memory limit to (225Mi). The yaml needed can be as below

				
					spec:
  template:
    spec:
      containers:
        - name: memory-demo-ctr
          resources:
            limits:
              memory: 225Mi
				
			

Now watch as the alert manager triggers the pipeline to increase the memory limit and stabilize the pod.

Conclusion:

A simple proof of concept is presented here that shows how webhooks can be used to connect alertmanager and spinnaker to stabilize pod memory usage and “heal” the pod before any possible Out of memory errors happen and pods get evicted.

Part Two: of this series will published soon with both memory and cpu tuning. Also for multiple pod replicas and using vertical pod autoscaler recommendation to tune the requests and limits od pod resources.

Future improvements:

The time to monitor the pod memory may vary from application to application and has to be tuned accordingly. The amount of extra memory to be allotted as pod memory limit is not simple to calculate and has to be tuned according to application requirements.

Also the kubernetes node memory has to be included in the equation. Number of replicas of pods will also affect the tuning of this methodology before practical application of this process.

Acknowledgements:

I thank Sharief Shaik and Srinivas Kambhampati for their inputs.

Gopal Jayanthi

Gopal Jayanthi has 15+ years of experience in the software field in development, configuration management, build/release, and DevOps areas. Worked at Cisco, AT&T (SBC), IBM in USA and Accenture, Bank of America, and Tech Mahindra in India. Expertise in Kubernetes, Docker, Jenkins, SDLC management, version control, change management, release management.

0 Comments

Submit a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.