安装 Slurm 和 OpenMPI
1. Introduction
Section titled “1. Introduction”- Slurm: A highly scalable cluster management and job scheduling system.
- OpenMPI: A high-performance message passing library.
- OS: Ubuntu 22.04 LTS
- Slurm Version: 21.08
- Shared Directory:
/shared_workdir(NFS)- Nodes:
slurmmaster(192.168.120.133): Controller Nodeslurm1(192.168.120.131): Compute Nodeslurm2(192.168.120.132): Compute Node
2. Common Configuration (All Nodes)
Section titled “2. Common Configuration (All Nodes)”Perform these steps on all nodes (Master and Clients).
2.1 Hostname Resolution
Section titled “2.1 Hostname Resolution”Edit /etc/hosts to ensure all nodes can resolve each other by name.
192.168.120.133 slurmmaster192.168.120.131 slurm1192.168.120.132 slurm22.2 SSH Keys
Section titled “2.2 SSH Keys”Generate SSH keys on all nodes.
sudo ssh-keygen -t rsa2.3 Enable Root SSH Login
Section titled “2.3 Enable Root SSH Login”- Set Root Password:
Terminal window sudo passwd root - Configure SSH:
Edit
/etc/ssh/sshd_configand set:PermitRootLogin yes - Restart SSH:
Terminal window sudo systemctl restart sshd
3. Master Node Configuration (slurmmaster)
Section titled “3. Master Node Configuration (slurmmaster)”Perform these steps only on the Master Node.
3.1 SSH Trust
Section titled “3.1 SSH Trust”Exchange SSH keys to allow passwordless login between nodes.
sudo su# Collect public keyscp /root/.ssh/id_rsa.pub slurmmaster.pubscp root@slurm1:/root/.ssh/id_rsa.pub slurm1.pubscp root@slurm2:/root/.ssh/id_rsa.pub slurm2.pub
# Merge keyscat slurm*.pub >> /root/.ssh/authorized_keys
# Distribute authorized_keysscp /root/.ssh/authorized_keys root@slurm1:/root/.ssh/authorized_keysscp /root/.ssh/authorized_keys root@slurm2:/root/.ssh/authorized_keys3.2 NFS Server
Section titled “3.2 NFS Server”Install and configure NFS to share the working directory.
# Install NFS Serversudo apt install nfs-kernel-server
# Create Shared Directorysudo mkdir -p /shared_workdirsudo chmod -R 777 /shared_workdirEdit /etc/exports:
/shared_workdir 192.168.120.0/24(rw,sync,no_root_squash,no_subtree_check)Restart NFS:
sudo service rpcbind restartsudo service nfs-kernel-server restart3.3 Install Slurm & Dependencies
Section titled “3.3 Install Slurm & Dependencies”sudo apt install mariadb-server munge slurmd slurmctld slurmdbd openmpi-bin openmpi-common libopenmpi-dev3.4 Configure Munge
Section titled “3.4 Configure Munge”# Enable Mungesudo systemctl enable mungesudo systemctl start munge
# Distribute Munge Key (Do this AFTER installing munge on clients)# sudo scp /etc/munge/munge.key root@slurm1:/etc/munge/# sudo scp /etc/munge/munge.key root@slurm2:/etc/munge/3.5 Configure MariaDB
Section titled “3.5 Configure MariaDB”sudo systemctl enable mariadbsudo systemctl start mariadb3.6 Configure Slurm
Section titled “3.6 Configure Slurm”Create configuration files:
sudo touch /etc/slurm/cgroup.confsudo touch /etc/slurm/slurm.confsudo touch /etc/slurm/slurmdbd.confsudo chmod 600 /etc/slurm/slurmdbd.confcgroup.conf
Section titled “cgroup.conf”CgroupAutomount=yesConstrainCores=noConstrainRAMSpace=noslurmdbd.conf
Section titled “slurmdbd.conf”AuthType=auth/mungeAuthInfo=/var/run/munge/munge.socket.2DbdAddr=localhostDbdHost=localhostSlurmUser=rootDebugLevel=verboseLogFile=/var/log/slurm/slurmdbd.logPidFile=/var/run/slurmdbd.pidStorageType=accounting_storage/mysqlStoragePass=passwordStorageUser=rootStorageLoc=slurm_acct_dbslurm.conf (CPU Only Example)
Section titled “slurm.conf (CPU Only Example)”ClusterName=clusterSlurmctldHost=slurmmasterMpiDefault=pmi2ProctrackType=proctrack/cgroupReturnToService=1SlurmctldPidFile=/run/slurmctld.pidSlurmdPidFile=/run/slurmd.pidSlurmdSpoolDir=/var/lib/slurm/slurmdSlurmUser=rootStateSaveLocation=/var/lib/slurm/slurmctldSwitchType=switch/noneTaskPlugin=task/affinity
SchedulerType=sched/backfillSelectType=select/cons_tres
AccountingStorageType=accounting_storage/slurmdbdAccountingStorageEnforce=associations,limits,qosAccountingStorageHost=localhostAccountingStoragePass=/var/run/munge/munge.socket.2JobCompHost=localhostJobCompLoc=slurm_acct_dbJobCompPass=passwordJobCompType=jobcomp/mysqlJobCompUser=root
SlurmctldLogFile=/var/log/slurm/slurmctld.logSlurmdLogFile=/var/log/slurm/slurmd.log
# COMPUTE NODESNodeName=slurm1 CPUs=4 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=3876 Gres=gpu:0 State=UNKNOWNNodeName=slurm2 CPUs=4 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=3876 Gres=gpu:0 State=UNKNOWNPartitionName=compute Nodes=slurm[1-2] Default=YES MaxTime=INFINITE State=UP3.7 Start Slurm Services
Section titled “3.7 Start Slurm Services”sudo systemctl enable slurmdbdsudo systemctl start slurmdbdsudo systemctl enable slurmctldsudo systemctl start slurmctld4. Client Node Configuration (slurm1, slurm2)
Section titled “4. Client Node Configuration (slurm1, slurm2)”Perform these steps only on Client Nodes.
4.1 NFS Client
Section titled “4.1 NFS Client”sudo apt install nfs-commonsudo mkdir -p /shared_workdirsudo chmod -R 777 /shared_workdirMount NFS share:
sudo mount -t nfs slurmmaster:/shared_workdir /shared_workdirAuto-mount on boot (Add to /etc/fstab):
slurmmaster:/shared_workdir /shared_workdir nfs defaults 0 04.2 Install Slurm Client
Section titled “4.2 Install Slurm Client”sudo apt install munge slurm-client slurmd openmpi-bin openmpi-common libopenmpi-dev4.3 Configure Munge
Section titled “4.3 Configure Munge”- Receive Key: Ensure
munge.keyis copied from Master. - Restart Munge:
Terminal window sudo systemctl restart munge
4.4 Configure Slurm
Section titled “4.4 Configure Slurm”Copy slurm.conf and cgroup.conf from Master to /etc/slurm/.
If using GPUs, create gres.conf:
Name=gpu Type=A800 File=/dev/nvidia[0-7]4.5 Start Slurm Daemon
Section titled “4.5 Start Slurm Daemon”sudo systemctl enable slurmdsudo systemctl start slurmd