Kubernetes spread like a wildfire in 2017, No kidding! Here are some numbers from Scott's post:
“For companies with more than 5000 employees, Kubernetes is used by 48% and the primary orchestration tool for 33%.”
“79% of the sample chose Docker as their primary container technology.”
Riding the wave of Kubernetes, 2017 was a particular fun year for Infrastructure/DevOps folks. Finally we have some cool tools to play with after years of darkness. We started thinking what we could do with such paradigm shift. We tried to optimize the developer velocity with Jenkins and Helm Chart and many other more to come :D
One thing I hold dear in my heart is democratizing Kubernetes for Data team. It's a well known fact that today's Data teams have to muster an array of bleeding edge technologies in order to stay productive and competitive. A few years ago MapReduce was, and still is wildly used. The infrastructure requirements are not a walk in a park even by today's standard. Fast forward to 2018 we see the same thing happening all over again with Deep Learning. To me, data team should not be distracted by infrastructural challenges and having to re-invent the wheels. A company's Systems team should work side by side with them.
Inspired by this talk by Lachlan Evenson on KubeCon 2017 about how he helped the Data team with Kubernetes, I decided to have my little experiment setting up a GPU ready cluster.
First thing first, let's create a k8s cluster with GPU accelerated nodes. In this example we will use AWS p2.xlarge
EC2 instance because it's the cheapest available option for this PoC (Proof of Concept). If you are also trying this out I'd suggest you using this instance type to avoid hitting your bill heavily.
Again, in this example I use kops to create a k8s cluster on AWS with 1 master node and 2 GPU nodes.
$ kops create cluster \
--name steven.buffer-k8s.com \
--cloud aws \
--master-size t2.medium \
--master-zones=us-east-1b \
--node-size p2.xlarge \
--zones=us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f \
--node-count=2 \
--kubernetes-version=1.8.6 \
--vpc=vpc-1234567a \
--network-cidr=10.0.0.0/16 \
--networking=flannel \
--authorization=RBAC \
--ssh-public-key="~/.ssh/kube_aws_rsa.pub" \
--yes
After kops created the cluster successfully you will have some GPU accelerated nodes on k8s. But that doesn't mean you can access the GPU resource from Kubernetes. In order to make it work we have to jump through a few hoops, which I found could be potentially tricky to most.
First we need to update the nodes to have to right configuration for k8s. This kops
command will open up an editor for default minion node configs.
$ kops edit ig nodes
We will change the default AMI to kope.io/k8s-1.8-debian-stretch-amd64-hvm-ebs-2018-01-05
becuase nvidia-docker
requires a package on a certain version that stretch
has but not in jessie
.
Then we add a config for kublet
to enable DevicePlugins
. More about it later.
spec:
image: kope.io/k8s-1.8-debian-stretch-amd64-hvm-ebs-2018-01-05
kubelet:
featureGates:
DevicePlugins: "true"
machineType: p2.xlarge
maxSize: 2
minSize: 2
Once complete editing the config, do this to apply the update.
$ kops update cluster steven.buffer-k8s.com --yes
Kubernetes is now ready to access GPU via Docker. But wait, the Docker runtime isn't ready access the GPU from the host. Read on, I will show you how.
Since the default AMI from kops
doesn't have CUDA driver installed. We will have to man this part with SSH into each node.
Copy these commands to install the driver.
$ wget https://developer.nvidia.com/compute/cuda/9.1/Prod/local_installers/cuda_9.1.85_387.26_linux
$ sudo apt-get update && sudo apt-get install -y \
build-essential
$ sudo sh cuda_9.1.85_387.26_linux
// Verify
$ nvidia-smi
Now we need to have set Docker correctly to access GPU via CUDA. To do that let's install nvidia-docker
and docker-ce
, change a few configurations and restart.
As you can see, there are a hoops to jump through. I'm going to give you the exact steps I took to get it to work.
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/debian9/amd64/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
gnupg2 \
software-properties-common
$ curl -fsSL https://download.docker.com/linux/$(. /etc/os-release; echo "$ID")/gpg | sudo apt-key add -
$ sudo apt-key fingerprint 0EBFCD88
$ sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/$(. /etc/os-release; echo "$ID") \
$(lsb_release -cs) \
stable"
$ sudo apt-get update && sudo apt-get install -y \
nvidia-docker2 \
docker-ce
$ sudo vim /lib/systemd/system/docker.service
ExecStart=/usr/bin/dockerd -H fd:// -s=overlay2
$ sudo systemctl daemon-reload
$ sudo apt-get update && sudo apt-get install -y \
nvidia-docker2 \
docker-ce
$ sudo tee /etc/docker/daemon.json <<EOF
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
$ sudo pkill -SIGHUP dockerd
$ sudo systemctl restart kubelet
// Verify
$ sudo docker run --rm nvidia/cuda nvidia-smi
Line 1 - Line 25 are about installing docker-ce
and nvidia-docker
. Line 27 opens up the configuration file for docker. Change the ExecStart
to ExecStart=/usr/bin/dockerd -H fd:// -s=overlay2
. Line 36 - 46 switches the default docker runtime to using nvidia
so GPU could be accessed from a docker container. Line 48 - 50 restarts kublet
so Kubernetes could access the new nvidia
runtime correctly. The last command will verify that docker is able to access CUDA from a container. We are now very close!
Not sure if you are still with me after the last part? I know it's brutal. If you are having issues please feel free to ping me on Twitter. Now, Kubernetes needs one last thing, and the last part is quite straightforward. It seems from Kubernetes 1.8 GPU device is accessed through a plugin. Run on of the following command to install the plugin as a DaemonSet and you will be set.
# For Kubernetes v1.8
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml
# For Kubernetes v1.9
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml
Now we have a fully GPU enabled Kubernetes cluster. Imagination is your only limit :D To harness the power of GPU a pod needs to know GPU is available, and request it. This is not as intuitive as other resource counterparts like memory or CPU. But it might be a good thing, as a glance on pod template tells me its GPU requirements. Always a good thing to be clear. The pod template demonstrates how to request GPU that is now available in the cluster.
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPUs
- The velocity of Data team shouldn't be limited by infrastructure requirements
- More and more technologies in Data require a specialized setup that is not easy to assemble
- A GPU capable k8s cluster is one of the examples. Here is how to create one for Deep Learning workloads.