In this post we explore running Pods on PRP Kubernetes cluster . Unless otherwise noted, we’ll execute all kubectl commands using the Kubeconfig for shaw@ucsc.edu :
$ kubectl config use-context shaw
Switched to context "shaw".
Interactive run
Here are 2 simple examples without using a Kubernetes manifest file:
1) Run python interactively in a single instance of tensorflow, don’t restart it if it exits:
$ kubectl run -i --tty tf --image = gcr.io/tensorflow/tensorflow --restart = Never -- bash
If you don't see a command prompt, try pressing enter.
root@tf:/notebooks# python
Python 2.7.12 (default, Nov 20 2017, 18:23:56)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
> >> import tensorflow as tf
> >> hello = tf.constant( 'Hello, TensorFlow!' )
> >> sess = tf.Session()
2017-12-19 02:27:12.462302: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
> >> print( sess.run( hello))
Hello, TensorFlow!
> >> exit ()
root@tf:/notebooks# exit
exit
Don’t forget to delete the Pod when done:
$ kubectl delete pod tf
2) Start Jupyter Notebook in the tensorflow instance (with the --rm
option, the pod will be automatically deleted when done):
$ kubectl run --rm -it tf --image = gcr.io/tensorflow/tensorflow --restart = Never -- bash
root@tf:/notebooks# jupyter notebook password
Enter password:
Verify password:
[NotebookPasswordApp] Wrote hashed password to /root/.jupyter/jupyter_notebook_config.json
root@tf:/notebooks# jupyter notebook
On another terminal, forward local port 8888 to port 8888 on the pod:
$ kubectl port-forward tf 8888:8888
Forwarding from 127.0.0.1:8888 -> 8888
Load http://127.0.0.1:8888 in a web browser.
When done, type Ctrl-C on the second terminal; and type Ctrl-C then ‘exit’ on the first terminal.
Further readings:
TensorFlow GPU Pod example
Dmitry Mishin provides an example manifest for a TensorFlow GPU Pod on GitHub:
apiVersion : v1
kind : Pod
metadata :
name : gpu -pod -example
spec :
containers :
- name : gpu -container
image : gcr .io /tensorflow /tensorflow :latest -gpu
imagePullPolicy : Always
args : ["sleep" , "36500000" ]
resources :
limits :
alpha .kubernetes .io /nvidia -gpu : 1
requests :
alpha .kubernetes .io /nvidia -gpu : 1
volumeMounts :
- name : nvidia -driver
mountPath : /usr /local /nvidia
readOnly : true
- mountPath : /examples
name : tensor -examples
restartPolicy : Never
volumes :
- name : nvidia -driver
hostPath :
path : /var /lib /nvidia -docker /volumes /nvidia_driver /384 .90 /
- name : tensor -examples
gitRepo :
repository : "https://github.com/tensorflow/models.git"
Create a Pod from the manifest file:
$ kubectl create -f tensorflow-example.yaml
pod "gpu-pod-example" created
Check if the Pod is running:
$ kubectl get pod gpu-pod-example
NAME READY STATUS RESTARTS AGE
gpu-pod-example 1/1 Running 0 2m
Connect to the running Pod:
$ kubectl exec -it gpu-pod-example -- bash
root@gpu-pod-example:/notebooks# nvidia-smi
Tue Dec 19 05:32:23 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:8A:00.0 Off | N/A |
| 23% 24C P8 14W / 250W | 10MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@gpu-pod-example:/notebooks# python
Python 2.7.12 (default, Nov 20 2017, 18:23:56)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
> >> import tensorflow as tf
> >> hello = tf.constant( 'Hello, TensorFlow!' )
> >> sess = tf.Session()
2017-12-19 05:33:20.279226: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-12-19 05:33:20.804666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:8a:00.0
totalMemory: 11.90GiB freeMemory: 11.75GiB
2017-12-19 05:33:20.804758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> ( device: 0, name: TITAN X ( Pascal) , pci bus id: 0000:8a:00.0, compute capability: 6.1)
> >> print( sess.run( hello))
Hello, TensorFlow!
> >> exit ()
root@gpu-pod-example:/notebooks# exit
exit
Clean up by deleting the Pod:
$ kubectl delete -f tensorflow-example.yaml
pod "gpu-pod-example" deleted
Persistent Volume with Rook Block Storage
If you read and/or write a lot of data, one option is to mount the Rook Block Storage as a Persistent Volume on your Pod. Here is an example manifest file (tensorflow-rook-block.yaml
):
metadata :
name : tf -pvc
spec :
storageClassName : rook -block
accessModes :
- ReadWriteOnce
resources :
requests :
storage : 10 Gi
---
apiVersion : v1
kind : Pod
metadata :
name : tf -gpu
spec :
containers :
- name : tensorflow -gpu
image : gcr .io /tensorflow /tensorflow :latest -gpu
args : ["sleep" , "36500000" ]
resources :
limits :
alpha .kubernetes .io /nvidia -gpu : 1
requests :
alpha .kubernetes .io /nvidia -gpu : 1
volumeMounts :
- name : nvidia -driver
mountPath : /usr /local /nvidia
readOnly : true
- mountPath : /examples
name : tensor -examples
- mountPath : /rook
name : tf -pvc
restartPolicy : Never
volumes :
- name : nvidia -driver
hostPath :
path : /var /lib /nvidia -docker /volumes /nvidia_driver /384 .90 /
- name : tensor -examples
gitRepo :
repository : "https://github.com/tensorflow/models.git"
- name : tf -pvc
persistentVolumeClaim :
claimName : tf -pvc
Create a Persistent Volumes Claim and a Pod from the manifest file:
$ kubectl create -f tensorflow-rook-block.yaml
persistentvolumeclaim "tf-pvc" created
pod "tf-gpu" created
Check the status of the PVC and the Pod:
dong@wunderkind:~/kubernetes$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
tf-pvc Bound pvc-d2e74790-e4e3-11e7-a68f-0cc47a6a1e1e 10Gi RWO rook-block 1m
dong@wunderkind:~/kubernetes$ kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-gpu 1/1 Running 0 1m
Connect to the running Pod:
$ kubectl exec -it tf-gpu -- bash
root@tf-gpu:/notebooks# df -h /rook/
Filesystem Size Used Avail Use% Mounted on
/dev/rbd0 10G 33M 10G 1% /rook
root@tf-gpu:/notebooks# grep rook /proc/mounts
/dev/rbd0 /rook xfs rw,relatime,attr2,inode64,sunit=8192,swidth=8192,noquota 0 0
root@tf-gpu:/notebooks# exit
exit
Delete the resources when done:
dong@wunderkind:~/kubernetes$ kubectl delete -f tensorflow-rook-block.yaml
persistentvolumeclaim "tf-pvc" deleted
pod "tf-gpu" deleted
Note that the default reclaim policy is delete . So when you delete your PersistentVolumeClaim , the PersistentVolume object will be deleted from Kubernetes and the associated storage asset (block storage) will be deleted from Rook as well.
Persistent Volume with Rook Shared File System
If you read and/or write a lot of data, one option is to mount the Rook Shared File System on your Pod. Here is an example manifest file tensorflow-rook-filesystem.yaml
:
apiVersion : v1
kind : Pod
metadata :
name : tf -gpu
spec :
containers :
- name : tensorflow -gpu
image : gcr .io /tensorflow /tensorflow :latest -gpu
args : ["sleep" , "36500000" ]
resources :
limits :
alpha .kubernetes .io /nvidia -gpu : 1
requests :
alpha .kubernetes .io /nvidia -gpu : 1
volumeMounts :
- name : nvidia -driver
mountPath : /usr /local /nvidia
readOnly : true
- mountPath : /examples
name : tensor -examples
- mountPath : /rook
name : rook -fs
restartPolicy : Never
volumes :
- name : nvidia -driver
hostPath :
path : /var /lib /nvidia -docker /volumes /nvidia_driver /384 .90 /
- name : tensor -examples
gitRepo :
repository : "https://github.com/tensorflow/models.git"
- name : rook -fs
flexVolume :
driver : rook .io /rook
fsType : ceph
options :
fsName : calogan -fs
clusterName : rook
Create a Pod from the manifest file:
$ kubectl create -f tensorflow-rook-filesystem.yaml
pod "tf-gpu" created
Check the status of the Pod:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-gpu 1/1 Running 0 14s
Connect to the running Pod:
$ kubectl exec -it tf-gpu -- bash
root@tf-gpu:/notebooks# df -h /rook
Filesystem Size Used Avail Use% Mounted on
10.105.88.218:6790,10.96.181.144:6790,10.105.145.177:6790:/ 41T 4.8T 36T 12% /rook
root@tf-gpu:/notebooks# grep rook /proc/mounts
10.105.88.218:6790,10.96.181.144:6790,10.105.145.177:6790:/ /rook ceph rw,relatime,name=admin,secret=<hidden> ,acl,mds_namespace= calogan-fs 0 0
root@tf-gpu:/notebooks# mkdir -p /rook/ucsc-edu/shaw
root@tf-gpu:/notebooks# cp 1_hello_tensorflow.ipynb /rook/ucsc-edu/shaw/
root@tf-gpu:/notebooks# exit
exit
Delete the Pod when done:
$ kubectl delete -f tensorflow-rook-filesystem.yaml
pod "tf-gpu" deleted
Note that data on the Shared File System are truly persistent , they will survive beyond the Pod lifecycle.