Slurm 20.02.3 集群添加gpu节点

为slurm集群增加GPU节点

1 环境准备
一个slurm管理节点(186.31.29.21)，一个GPU节点(183.31.28.247)

GPU节点的GPU型号为GTX1080Ti，驱动版本为440.100，CUDA版本为10.0，安装了对应的cudnn。

其实，slurm对GPU的型号及驱动并不敏感，slurm只是去/dev下面去找硬件设备，然后使其作为slurm的通用资源。 

2 修改配置文件
管理节点：

在slurm.conf中，修改如下两项

GresTypes=gpu
NodeName= gupnode01 Gres=gpu:1 CPUs=56 RealMemory=256000 Socket=2 State=UNKNOWN
第一行是指明通用资源的类型为gpu

第二行中，重要的参数是 `Gres=gpu:1`   gpu代表类型，冒号后的数字代表数量，1个GPU就是1，8个就是8。



计算节点：

计算节点除了要 slurm.conf还需要gres.conf，slurm官方文档说，把 gres.conf中的东西写到slurm.conf中也未尝不可~ 

NodeName=gpunode01 Name=gpu  File=/dev/nvidia0
这一行重要的就是知名节点名字和GPU  File的位置 。



3 关闭防火墙、测试
要使用gpu节点，

一定要关闭防火墙`systemctl stop firewalld`

最好也清除并关闭iptables  `iptables -t nat -F`  `iptables -t filter -F` `systemctl stop iptables`

最好将selinux关闭，`vi /etc/selinux/config`  将SELINUX改为disable



然后我们就可以使用GPU了 

这里有个使用python tensorflow-1.14的GPU测试脚本，大家可以拿去试试

import tensorflow as tf
 
with tf.device('/gpu:0')
    v1 = tf.constant([1.0, 2.0, 3.0], shape=[3], name='v1')
    v2 = tf.constant([1.0, 2.0, 3.0], shape=[3], name='v2')
    sumV = v1 + v2
    with tf.Session() as sess:
        print(sess.run(sumV))
 

使用slurm  srun运行此脚本，命令为：

srun --gres=gpu:1 python3 test_gpu.py


END
小胖墩

你的赏识是我前进的动力