k8s离线部署学习-Nvidia插件
下载镜像
docker pull nvidia/k8s-device-plugin:v0.9.0 docker tag nvidia/k8s-device-plugin:v0.9.0 10.0.7.125:8000/library/k8s-device-plugin:v0.9.0 docker push 10.0.7.125:8000/library/k8s-device-plugin:v0.9.0
下载并安装驱动
驱动:https://www.nvidia.cn/Download/index.aspx?lang=cn
CUDA:https://developer.nvidia.com/cuda-toolkit-archive
安装nvidia-docker
apt download libnvidia-container1 apt download libnvidia-container-tools apt download nvidia-container-toolkit apt download nvidia-container-runtime apt download nvidia-docker2
修改/etc/docker/daemon.json文件配置如下
sudo vim /etc/docker/daemon.json { ... "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } ... }
重启docker
sudo systemctl daemon-reload
sudo systemctl restart docker
部署
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: # This annotation is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: name: nvidia-device-plugin-ds spec: tolerations: # This toleration is deprecated. Kept here for backward compatibility # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ - key: CriticalAddonsOnly operator: Exists - key: nvidia.com/gpu operator: Exists effect: NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" containers: - image: 10.0.7.125:8000/library/nvidia/k8s-device-plugin:v0.9.0 name: nvidia-device-plugin-ctr args: ["--fail-on-init-error=false"] securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
测试
kubectl describe node node1|grep nvidia.com/gpu # nvidia.com/gpu: 1 # nvidia.com/gpu: 1 # nvidia.com/gpu 0 0
有输出结果有则说明部署成功