Skip to content

通过rancher镜像部署,配置volcano虚拟化GPU无法正常启动 #10564

@CCzzzzzzz

Description

@CCzzzzzzz

Reminder

  • I have read the above rules and searched the existing issues.

System Info

日志记录显示:sleep: error while loading shared libraries: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short
我的yaml文件配置如下,关键配置已用粗体标出:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llamafactory
namespace: xxx
labels:
workload.user.cattle.io/workloadselector: apps.deployment-jmai-llamafactory
spec:
replicas: 1
selector:
matchLabels:
workload.user.cattle.io/workloadselector: apps.deployment-jmai-llamafactory
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # 保证零停机更新
type: RollingUpdate
template:
metadata:
labels:
xxx
spec:
schedulerName: volcano
runtimeClassName: nvidia

terminationGracePeriodSeconds: 30
containers:
- name: llamafactory
image: goharbor.jomoo.cn/llmos-ai/llamafactory:0.9.5
imagePullPolicy: IfNotPresent
command:
- llamafactory-cli
- webui
- '--host'
- '0.0.0.0'
- '--port'
- '7860'
ports:
- containerPort: 7860
name: http
protocol: TCP
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
- name: NVIDIA_DISABLE_REQUIRE
value: 'true'
- name: LD_LIBRARY_PATH
value: >-
/usr/lib/wsl/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu
resources:
limits:
cpu: '16'
memory: 16000Mi
volcano.sh/vgpu-memory: '12288'
volcano.sh/vgpu-number: '1'

requests:
cpu: '4'
memory: 8000Mi
volcano.sh/vgpu-memory: '12288'
volcano.sh/vgpu-number: '1'

securityContext:
privileged: false
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 8Gi

Reproduction

Put your message here.

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpendingThis problem is yet to be addressed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions