slurm集群安装
发表于:2022-06-16 |

环境:三台物理机,os均为ubuntu-18-04 LTS,hostname分别为tian-609-06、tian-609-07、tian-609-08。其中tian-609-06作为控制节点和计算节点,其他节点作为计算节点。

     sudo apt install munge slurm-wlm
1
2、配置/etc/slurm-llnl/slurm.conf文件(所有机器,配置一样)


# slurm.conf file generated by configurator easy.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ControlMachine=tian-609-06 #<YOUR-HOST-NAME> 
#ControlAddr= 
# 
#MailProg=/bin/mail 
MpiDefault=none 
#MpiParams=ports=#-# 
ProctrackType=proctrack/pgid 
ReturnToService=1 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 
#SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 
SlurmUser=slurm 
#SlurmdUser=root 
StateSaveLocation=/var/lib/slurm-llnl/slurmctld 
SwitchType=switch/none 
TaskPlugin=task/none 
# 
# 
# TIMERS 
#KillWait=30 
#MinJobAge=300 
#SlurmctldTimeout=120 
#SlurmdTimeout=300 
# 
# 
# SCHEDULING 
FastSchedule=1 
SchedulerType=sched/builtin 
#SchedulerPort=7321 
SelectType=select/linear 
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/none 
#AccountingStoragePass=/var/run/munge/global.socket.2 
ClusterName=workstation #<YOUR-HOST-NAME> 
#JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none 
#SlurmctldDebug=3 
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 
#SlurmdDebug=4 
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 
# 
# 
# COMPUTE NODES 
NodeName=tian-609-06,tian-609-07,tian-609-08 CPUs=48 Sockets=2 CoresPerSocket=12 RealMemory=257731 ThreadsPerCore=2 State=IDLE
PartitionName=debug Nodes=tian-609-06,tian-609-07,tian-609-08 Default=YES MaxTime=INFINITE State=UP

3、将/etc/hosts中配置对应的hostname和ip(所有机器)

4、开启slurm
sudo systemctl enable slurmctld(控制节点tian-609-06)
sudo service slurmctld start(控制节点tian-609-06)
sudo systemctl enable slurmd(计算节点tian-609-[06-08])
sudo service slurmd start(计算节点tian-609-[06-08])

5、将控制节点的/etc/munge/munge.key拷贝至其他机器相同目录,文件所属用户和用户组均为slurm

6、开启munge
sudo /etc/init.d/munge start(所有节点)

7、查看slurm集群状态

> sinfo 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      3   idle tian-609-[06-08]

8、执行命令(测试)

> srun -N 3 hostname
tian-609-07
tian-609-08
tian-609-06
上一篇:
Slurm 20.02.3 集群添加gpu节点
下一篇:
docker部署skywalking并接入java服务