环境:三台物理机,os均为ubuntu-18-04 LTS,hostname分别为tian-609-06、tian-609-07、tian-609-08。其中tian-609-06作为控制节点和计算节点,其他节点作为计算节点。
sudo apt install munge slurm-wlm
1
2、配置/etc/slurm-llnl/slurm.conf文件(所有机器,配置一样)
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=tian-609-06 #<YOUR-HOST-NAME>
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/builtin
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#AccountingStoragePass=/var/run/munge/global.socket.2
ClusterName=workstation #<YOUR-HOST-NAME>
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
#SlurmdDebug=4
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
#
# COMPUTE NODES
NodeName=tian-609-06,tian-609-07,tian-609-08 CPUs=48 Sockets=2 CoresPerSocket=12 RealMemory=257731 ThreadsPerCore=2 State=IDLE
PartitionName=debug Nodes=tian-609-06,tian-609-07,tian-609-08 Default=YES MaxTime=INFINITE State=UP
3、将/etc/hosts中配置对应的hostname和ip(所有机器)
4、开启slurm
sudo systemctl enable slurmctld(控制节点tian-609-06)
sudo service slurmctld start(控制节点tian-609-06)
sudo systemctl enable slurmd(计算节点tian-609-[06-08])
sudo service slurmd start(计算节点tian-609-[06-08])
5、将控制节点的/etc/munge/munge.key拷贝至其他机器相同目录,文件所属用户和用户组均为slurm
6、开启munge
sudo /etc/init.d/munge start(所有节点)
7、查看slurm集群状态
> sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 3 idle tian-609-[06-08]
8、执行命令(测试)
> srun -N 3 hostname
tian-609-07
tian-609-08
tian-609-06