Build a Slurm Desktop Cluster
December 17, 2024•1,127 words
Desktop Abaqus Slurm Cluster
This is a guide to create a Slurm Cluster using NUC computers to allow for development and testing of Abaqus simulations with a desktop form factor.
Table of Contents
Hardware Overview
The hardware stack we are using are:
- Head Node - GMKtec Mini PC, Intel N100 (3.4GHz) 16GB 1TB M2 View on Amazon
- Compute Nodes - 3 Geekom XT12 Pro Mini PCs 12th Gen Intel i9-12900H 14 cores - 32GB. View on Amazon
- D-Link 5-Port 2.5GB Unmanaged Gaming Switch View on Amazon
- GeeekPi 8U Server Rack - View on Amazon
- Here's the build:
Software Overview
The configuration is:
- Operating system - Ubuntu 24.04
- Slurm - Version
- Abaqus - Verion 2023HF3
- OpenMpi -
- Fortran -
Install Steps
- Install Ubuntu
- Configure Networking
- Install Munge
- Create NFS shares
- Install Slurm
- Install Slurm Rest API
- Install Abaqus Prerequistes
- Install Abaqus
- Build Test Code
Install Ubuntu
The link below has the steps to create a bootable usb drive to install Ubuntu 24.04 on Windows, Linux and MacOS. We are using the desktop version.
Create a Ubuntu bootable USB stick
Configure Networking
On the Head node and all the Compute nodes execute the following commands:
Install openssh:
sudo apt install openssh-server
Verify its running:
sudo systemctl status ssh
If its not running start it:
sudo systemctl start ssh
Set it to autostart:
sudo systemctl enable ssh
This is optional but I prefer to install xrdp on the nodes for a graphical gnome desktop.
Same steps as for ssh:
sudo apt install xrdp
sudo systemctl status xrdp
sudo systemctl status xrdp
sudo systemctl enable xrdp
Configure the ethernet interface for a private network:
I chose 192.168.110. as my network, the details to plug into Ubuntu manual settings for the interface:
Address: 192.168.110.1 to 4
Netmask: 255.255.255.0
Gateway: 192.168.110.255
On each node add to the /etc/hosts file with something like this:
192.168.110.1 headnode
192.168.110.2 node1
192.168.110.3 node2
192.168.110.4 node13
At this point you should be able to ssh to and from each node.
Configure Munge
Steps:
Install Munge packages:
sudo apt install munge libmunge2 libmunge-dev
Verify the Munge installation, this should return "Success":
munge -n | unmunge | grep STATUS
If verification wasn't successful then verify the munge key was created:
ls -la /etc/munge/munge.key
If the key doesn't exist create it:
sudo /usr/sbin/mungekey
Set permssions for the munge user:
sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
sudo chmod 0755 /run/munge/
sudo chmod 0700 /etc/munge/munge.key
sudo chown -R munge: /etc/munge/munge.key
Restart the munge service:
systemctl enable munge
systemctl restart munge
Check the munge server status:
systemctl status munge
Once munge is installed on all node on the worker nodes copy the munge key from the head node.
sudo scp /etc/munge/munge.key worker node
Restart the munge service one the keys are copied:
systemctl enable munge
systemctl restart munge
Create NFS Shares
On the head node which will share the file systems. I created two file systems:
- /app - shared software
- /jobs - job output
sudo mkdir /apps
sudo mkdir /jobs
sudo chmod a+rwx /apps
sudo chmod a+rwx /jobs
Install the NFS server packages:
sudo apt install nfs-kernel-server
sudo systemctl start nfs-kernel-server.service
sudo systemctl enable nfs-kernel-server.service
Update the /etc/exports file with the shares:
/jobs 192.168.110.0/24(rw,sync,no_root_squash)
/apps 192.168.110.0/24(rw,sync,no_root_squash)
Export the file systems:
sudo exportfs -a
On the Compute nodes add and mount the NFS shares from the Head Node.
Install the NFS client software:
apt install nfs-common
Create the directories:
- /app - shared software
- /jobs - job output
sudo mkdir /apps
sudo mkdir /jobs
sudo chmod a+rwx /app
sudo chmod a+rwx /jobs
Update the /etc/fstab file for the new nfs mounts:
sudo cp /etc/fstab /etc/fstab.bkp
sudo vi /etc/fstab
Add the lines:
mhpcheadnode:/jobs /jobs nfs
mhpcheadnode:/apps /apps nfs
Reload the daemon with the new fstab:
systemctl daemon-reload
Mount the file systems:
sudo mount /apps
sudo mount /jobs
Install Slurm
On the head node and all compute nodes install slurm:
sudo apt install slurm-wlm
Configure Slurm:
The slurm configuration tool is located at:
/usr/share/doc/slurmctld/slurm-wlm-configurator.html
Open the file with a web browser and fill out the following:
- ClusterName:
<CLUSTER-NAME>
- SlurmctldHost:
<HEAD-NODE-NAME>
- NodeName:
<COMPUTE-NODE-NAME>[1-N]
- Enter values for CPUs, Sockets, CoresPerSocket, and ThreadsPerCore according to $ lscpu.
- ProctrackType:
LinuxProc
Once completed copy the text from the browser and update the file below on all nodes:
/etc/slurm/slurm.conf
Restart slurm on all nodes:
systemctl enable slurmctld
systemctl restart slurmctld
Install Slurm Rest API:
The Slurm Rest API provides an endpoint to allow for submitting jobs, listing running jobs and other Slurm commands.
Installing the API server consists of the following components:
- Database - I installed mysql.
- slurmdbd - Slurm Database Daemon.
- slurmrestd - Interface to Slurm via REST API.
Configure slurm JWT support:
dd if=/dev/random of=/var/spool/slurm/statesave/jwt_hs256.key bs=32 count=1
chown slurm:slurm /var/spool/slurm/statesave/jwt_hs256.key
chmod 0600 /var/spool/slurm/statesave/jwt_hs256.key
chown slurm:slurm /var/spool/slurm/statesave
chmod 0755 /var/spool/slurm/statesave
Update both the slurm.conf and slurmdbd.conf and add:
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
To create a user token:
scontrol token username=$USER lifespan=$LIFESPAN
Install and configure mysql:
sudo apt install mysql-server
As root:
mysql
grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by 'some_pass' with grant option;
Install and configure slurmdbd:
sudo apt install slurmdbd
Create the /etc/slurm/slurmdbd.conf file, this is a copy of the one I used:
cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
DbdHost=localhost
DebugLevel=debug5
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=SLURM-USER-MYSQL-PASSWORD
StorageType=accounting_storage/mysql
StoragePort=3306
StorageUser=slurm
LogFile=/opt/slurm/log/slurmdbd.log
PidFile=/tmp/slurmdbd.pid
SlurmUser=slurm
Update the /etc/slurm/slurm.conf file and add:
AccountingStorageHost=localhost
Note: Do not enter the Slurm mysql password"
#AccountingStoragePass=
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
Start the Slurm database daemon:
systemctl start slurmdbd
Install the Slurm rest API:
sudo apt install slurmrest
The Rest daemon has to run as a non-privedged user so a user has to be created:
sudo adduser slurmrest
Edit the slurmrest service and add the slurmrest user:
sudo vi /usr/lib/systemd/system/slurmrestd.service
Update the following:
# an unpriviledged user to run as.
User=slurmrest
Group=slurmrest
Environment="SLURM_JWT=daemon"
Start the Rest service:
sudo systemctl start slurmrestd
Restart the Slurm controller service:
sudo systemctl daemon-reload
sudo systemctl restart slurmctld
Copy the /etc/slurm/slurm.conf to all of the compute nodes and restart slurmd