This is super cool and useful: https://medium.com/faun/kubectl-commands-cheatsheet-43ce8f13adfb Tips & tricks: https://hackernoon.com/top-10-kubernetes-tips-and-tricks-27528c2d0222 Configure bash complete: echo "source <(kubectl completion bash)" >> ~/.bashrc
kubectl get pods --all-namespaces -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,QOS-CLASS:.status.qosClass
https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#example-hardware-configurations https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#cpus
Check agent status: https://docs.datadoghq.com/agent/guide/agent-commands/?tab=agentv6#agent-status-and-information
When datadog agent cannot detect kubelet and the dashboard is empty: DataDog/integrations-core#2582
More generic agent troubleshooting: https://docs.datadoghq.com/agent/troubleshooting/
Manage datadog monitors using code: https://github.com/trueaccord/DogPush
https://www.replex.io/blog/kubernetes-in-production-readiness-checklist-and-best-practices-for-resource-management https://www.magalix.com/blog/kubernetes-resource-requests-and-limits-101 https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-resource-requests-and-limits
Pod Disruption Budgets, number 6 here: https://hackernoon.com/top-10-kubernetes-tips-and-tricks-27528c2d0222
As explained in the section about container metrics, some statistics reported by Docker should be also monitored as they provide deeper (and more accurate) insights. The CPU throttling metric is a great example, as it represents the number of times a container hit its specified limit.
About nodes and resources: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/ How to change eviction values: https://medium.com/faun/kubelet-pod-the-node-was-low-on-resource-diskpressure-384f590892f5
Kubernetes metrics in production: https://www.replex.io/blog/kubernetes-in-production-the-ultimate-guide-to-monitoring-resource-metrics
draining nodes: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ taints and toleration: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/