Running Pods on Kubernetes

December 11, 2017 | Docker Machine Learning Python

In this post we explore running Pods on PRP Kubernetes cluster. Unless otherwise noted, we’ll execute all kubectl commands using the Kubeconfig for shaw@ucsc.edu:

$ kubectl config use-context shaw
Switched to context "shaw".

Interactive run

Here are 2 simple examples without using a Kubernetes manifest file:

1) Run python interactively in a single instance of tensorflow, don’t restart it if it exits:

$ kubectl run -i --tty tf --image=gcr.io/tensorflow/tensorflow --restart=Never -- bash
If you don't see a command prompt, try pressing enter.
root@tf:/notebooks# python
Python 2.7.12 (default, Nov 20 2017, 18:23:56)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2017-12-19 02:27:12.462302: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
>>> print(sess.run(hello))
Hello, TensorFlow!
>>> exit()
root@tf:/notebooks# exit
exit

Don’t forget to delete the Pod when done:

$ kubectl delete pod tf

2) Start Jupyter Notebook in the tensorflow instance (with the --rm option, the pod will be automatically deleted when done):

$ kubectl run --rm -it tf --image=gcr.io/tensorflow/tensorflow --restart=Never -- bash
root@tf:/notebooks# jupyter notebook password
Enter password:
Verify password:
[NotebookPasswordApp] Wrote hashed password to /root/.jupyter/jupyter_notebook_config.json
root@tf:/notebooks# jupyter notebook

On another terminal, forward local port 8888 to port 8888 on the pod:

$ kubectl port-forward tf 8888:8888
Forwarding from 127.0.0.1:8888 -> 8888

Load http://127.0.0.1:8888 in a web browser.

When done, type Ctrl-C on the second terminal; and type Ctrl-C then ‘exit’ on the first terminal.

Further readings:

TensorFlow GPU Pod example

Dmitry Mishin provides an example manifest for a TensorFlow GPU Pod on GitHub:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-example
spec:
  containers:
  - name: gpu-container
    image: gcr.io/tensorflow/tensorflow:latest-gpu
    imagePullPolicy: Always
    args: ["sleep", "36500000"]
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
      requests:
        alpha.kubernetes.io/nvidia-gpu: 1
    volumeMounts:
    - name: nvidia-driver
      mountPath: /usr/local/nvidia
      readOnly: true
    - mountPath: /examples
      name: tensor-examples
  restartPolicy: Never
  volumes:
    - name: nvidia-driver
      hostPath:
        path: /var/lib/nvidia-docker/volumes/nvidia_driver/384.90/
    - name: tensor-examples
      gitRepo:
        repository: "https://github.com/tensorflow/models.git"

Create a Pod from the manifest file:

$ kubectl create -f tensorflow-example.yaml
pod "gpu-pod-example" created

Check if the Pod is running:

$ kubectl get pod gpu-pod-example
NAME              READY     STATUS    RESTARTS   AGE
gpu-pod-example   1/1       Running   0          2m

Connect to the running Pod:

$ kubectl exec -it gpu-pod-example -- bash
root@gpu-pod-example:/notebooks# nvidia-smi
Tue Dec 19 05:32:23 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 00000000:8A:00.0 Off |                  N/A |
| 23%   24C    P8    14W / 250W |     10MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

root@gpu-pod-example:/notebooks# python
Python 2.7.12 (default, Nov 20 2017, 18:23:56)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2017-12-19 05:33:20.279226: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-12-19 05:33:20.804666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:8a:00.0
totalMemory: 11.90GiB freeMemory: 11.75GiB
2017-12-19 05:33:20.804758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:8a:00.0, compute capability: 6.1)
>>> print(sess.run(hello))
Hello, TensorFlow!
>>> exit()
root@gpu-pod-example:/notebooks# exit
exit

Clean up by deleting the Pod:

$ kubectl delete -f tensorflow-example.yaml
pod "gpu-pod-example" deleted

Persistent Volume with Rook Block Storage

If you read and/or write a lot of data, one option is to mount the Rook Block Storage as a Persistent Volume on your Pod. Here is an example manifest file (tensorflow-rook-block.yaml):

metadata:
  name: tf-pvc
spec:
  storageClassName: rook-block
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: tf-gpu
spec:
  containers:
  - name: tensorflow-gpu
    image: gcr.io/tensorflow/tensorflow:latest-gpu
    args: ["sleep", "36500000"]
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
      requests:
        alpha.kubernetes.io/nvidia-gpu: 1
    volumeMounts:
    - name: nvidia-driver
      mountPath: /usr/local/nvidia
      readOnly: true
    - mountPath: /examples
      name: tensor-examples
    - mountPath: /rook
      name: tf-pvc
  restartPolicy: Never
  volumes:
    - name: nvidia-driver
      hostPath:
        path: /var/lib/nvidia-docker/volumes/nvidia_driver/384.90/
    - name: tensor-examples
      gitRepo:
        repository: "https://github.com/tensorflow/models.git"
    - name: tf-pvc
      persistentVolumeClaim:
        claimName: tf-pvc

Create a Persistent Volumes Claim and a Pod from the manifest file:

$ kubectl create -f tensorflow-rook-block.yaml
persistentvolumeclaim "tf-pvc" created
pod "tf-gpu" created

Check the status of the PVC and the Pod:

dong@wunderkind:~/kubernetes$ kubectl get pvc
NAME      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
tf-pvc    Bound     pvc-d2e74790-e4e3-11e7-a68f-0cc47a6a1e1e   10Gi       RWO            rook-block     1m
dong@wunderkind:~/kubernetes$ kubectl get pods
NAME      READY     STATUS    RESTARTS   AGE
tf-gpu    1/1       Running   0          1m

Connect to the running Pod:

$ kubectl exec -it tf-gpu -- bash

root@tf-gpu:/notebooks# df -h /rook/
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0        10G   33M   10G   1% /rook

root@tf-gpu:/notebooks# grep rook /proc/mounts
/dev/rbd0 /rook xfs rw,relatime,attr2,inode64,sunit=8192,swidth=8192,noquota 0 0

root@tf-gpu:/notebooks# exit
exit

Delete the resources when done:

dong@wunderkind:~/kubernetes$ kubectl delete -f tensorflow-rook-block.yaml
persistentvolumeclaim "tf-pvc" deleted
pod "tf-gpu" deleted

Note that the default reclaim policy is delete. So when you delete your PersistentVolumeClaim, the PersistentVolume object will be deleted from Kubernetes and the associated storage asset (block storage) will be deleted from Rook as well.

Persistent Volume with Rook Shared File System

If you read and/or write a lot of data, one option is to mount the Rook Shared File System on your Pod. Here is an example manifest file tensorflow-rook-filesystem.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: tf-gpu
spec:
  containers:
  - name: tensorflow-gpu
    image: gcr.io/tensorflow/tensorflow:latest-gpu
    args: ["sleep", "36500000"]
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
      requests:
        alpha.kubernetes.io/nvidia-gpu: 1
    volumeMounts:
    - name: nvidia-driver
      mountPath: /usr/local/nvidia
      readOnly: true
    - mountPath: /examples
      name: tensor-examples
    - mountPath: /rook
      name: rook-fs
  restartPolicy: Never
  volumes:
    - name: nvidia-driver
      hostPath:
        path: /var/lib/nvidia-docker/volumes/nvidia_driver/384.90/
    - name: tensor-examples
      gitRepo:
        repository: "https://github.com/tensorflow/models.git"
    - name: rook-fs
      flexVolume:
        driver: rook.io/rook
        fsType: ceph
        options:
          fsName: calogan-fs
          clusterName: rook

Create a Pod from the manifest file:

$ kubectl create -f tensorflow-rook-filesystem.yaml
pod "tf-gpu" created

Check the status of the Pod:

$ kubectl get pods
NAME      READY     STATUS    RESTARTS   AGE
tf-gpu    1/1       Running   0          14s

Connect to the running Pod:

$ kubectl exec -it tf-gpu -- bash
root@tf-gpu:/notebooks# df -h /rook
Filesystem                                                   Size  Used Avail Use% Mounted on
10.105.88.218:6790,10.96.181.144:6790,10.105.145.177:6790:/   41T  4.8T   36T  12% /rook

root@tf-gpu:/notebooks# grep rook /proc/mounts
10.105.88.218:6790,10.96.181.144:6790,10.105.145.177:6790:/ /rook ceph rw,relatime,name=admin,secret=<hidden>,acl,mds_namespace=calogan-fs 0 0

root@tf-gpu:/notebooks# mkdir -p /rook/ucsc-edu/shaw
root@tf-gpu:/notebooks# cp 1_hello_tensorflow.ipynb /rook/ucsc-edu/shaw/

root@tf-gpu:/notebooks# exit
exit

Delete the Pod when done:

$ kubectl delete -f tensorflow-rook-filesystem.yaml
pod "tf-gpu" deleted

Note that data on the Shared File System are truly persistent, they will survive beyond the Pod lifecycle.