Share

Problem Introduction:

In Part 1, We have seen how to automate the process of increasing the limits/requests for the pods memory/cpu using the webhook alerts feature of alert manager and webhook triggered pipelines of spinnaker.

In Part 2, , it was shown how to use verticalpodautoscaler (VPA) to get the recommendations for pod memory and cpu requests as well as limits.

In this Part 3, using prometheus metrics to tune pod memory and cpu is discussed. Also in parts 1 and 2 we used pods with constant memory and cpu usage. in Part 3 we also used pods that have more realistic usage of memory and cpu. Clouddriver-caching pod of spinnaker was used to tune the resources. Also pods with multiple containers and deployments with multiple replicas are tested.

Solution Introduction:

Prometheus provides memory and cpu usage of a given container in a pod. It also provides functions like avg_over_time and max_over_time to get the average and maximum values of memory and cpu.

These values can then be used to tune the pod resources using spinnaker pipelines with webhook, run job and patch stages.

Prerequisites:

Spinnaker installed, prometheus and alertmanager installed to monitor pods installed in the cluster.

Details:

We have used the same deployments as in Part 1, in addition clouddriver-caching pod of spinnaker was used. 

hence a run job stage using curl and jq was used to query prometheus and extract values using jq.

example curl command is below.

				
					curl -g "http://<url>/api/v1/query?query=<query> | jq 
				
			

The following metrics were used to calculate resources specifications for the pod.

				
					memory request: container_memory_working_set_bytes{container=\"$container\"}["$period"])"
memory limit: max_over_time(container_memory_max_usage_bytes{container=\"$container\"}["$period"])"
cpu request: avg_over_time(rate(container_cpu_usage_seconds_total{container=\"$container\"}[1m])["$period":1m])"
cpu limit=: max_over_time(rate(container_cpu_usage_seconds_total{container=\"$container\"}[1m])["$period":1m])"
				
			

here a period of 2h, i.e 2 hours was used, a different value could be used depending on the period for which data is available. Also the limits were factored using a value of 0.8, i.e the values of limits provided by the above queries were divided by this factor to get provide a conservative cushion.

Multi container pods and multi replica deployments were tested using spinnaker pipelines with a webhook stage to trigger the pipeline, run job stage to query prometheus and get memory and cpu request and limit, and finally a patch stage to patch the deployment.

Conclusion:

Four cases of high cpu, low cpu , high memory and low memory were tested. In all cases the pipelines were successful in adjusting memory and cpu requests and limits so the alerts were addressed

This resulted in reduced wastage of computing resources ( in case of low resource usage pods ) and preventing pod crashes due to out of memory errors and pod throttling. Finally when a real life scenario pod, clouddriver-caching was used, then also the pipelines managed to stabilize the resource usage. This method worked even when there were multiple containers per pod or multiple replicas per deployment.

Thus it was successfully proved that spinnaker pipelines can be used to tune its own pods to proper compute resource usage.

Acknowledgements:

I thank Sharief Shaik and Srinivas Kambhampati for their inputs.

Gopal Jayanthi

Gopal Jayanthi has 15+ years of experience in the software field in development, configuration management, build/release, and DevOps areas. Worked at Cisco, AT&T (SBC), IBM in USA and Accenture, Bank of America, and Tech Mahindra in India. Expertise in Kubernetes, Docker, Jenkins, SDLC management, version control, change management, release management.

0 Comments

Submit a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.