3 Running experiments through a job system¶
Note
- In this tutorial, we will see how to run experiments through a job allocation system. It is comprised of two parts depending on the cluster configuration:
The case of HTCondor (https://htcondor.readthedocs.io/en/latest/) system
The case of SLURM (https://slurm.schedmd.com/documentation.html) system
Note
The code of this tutorial is available at: TODO
3.1 Objectives¶
In this tutorial we will try to run the previous tutorial (2 Experiments, Datasets and description files) through a job allocation system. We will see how to take the existing Qanat project structure and run the experiments on the cluster.
3.2 Setting up the project and environment¶
We assume that you already have done the previous tutorial (2 Experiments, Datasets and description files) or downloaded its code and setup the Qanat project. If you did it on your local machine that do not have a job system, you need to copy the project over to the job submission server or redo those steps on the machine.
The problem that needs adressing is to have installation of python working on both the job submission server and the job execution server. This is because the job submission server will be responsible for submitting the jobs to the job execution server and will need to have Qanat installed. The job execution server will be responsible for running the experiments and should be able to execute the scripts of the experiments.
- One workaround that is useful is to recast the experiment executable as a bash script that will do the following:
- Make it so that the python environment is setup. This can be done depending on your setup:
If you are using conda, you can use the conda activate command to activate the environment that will be stored in a shared filesystem between the job submission server and the job execution server.
If you have a working python executable somewhere in the shared filesystem, you can use it to run the experiment by specifying the full path to the python executable.
Run the python script of the experiment by also forwarding the arguments to the script. This can be done by using the $@ variable in bash and it is very important since the python script need to access the –storage_path and –dataset_path options.
In this case, you need to update the experiment executable and executable command with qanat experiment update summary_iris.
Another approach is to use a containerized environment. This is a more robust approach but requires more setup. We will not cover it in this tutorial. See (TODO-ADD-LINK) for more information.
Warning
We assume in Qanat that the current working directory is available on the job execution server. This is because the experiment executable is a python script that is run from the current working directory. This a limitation of the package.
3.3 HTCondor¶
- Usually when submitting a job through HTCondor, you need to specify a submit description file (jobname.submit) that will contain the description of the job. This file will contain the following information:
The executable that will be run
The arguments that will be passed to the executable
The environment variables that will be set
The input and output files that will be used
The resources that will be used
The queue that will be used
See (https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html) for more information.
Qanat takes a similar approach but ditch the need of a submit description file for each executable. Instead, we use YAML description file as a template that will contain the keywords (ressources, groups, etc). When running Qanat will parse those descriptors and submit a job with the right executable and arguments for you thanks to the python bindings of HTCondor.
For example, the default template (when no template is specified for a job) is the following:
+WishedAcctGroup: group_usmb.listic
getenv: 'true'
request_cpus: 1
request_disk: 1GB
request_gpus: 0
request_memory: 1GB
universe: vanilla
Note
The +WishedAcctGroup is a custom keyword that is used to specify the group that will be used for the job. It is used by the job allocation system to determine the group that will be used for the job. The default group is the one I have in the MUST datacenter of Université Savoie Mont-Blanc. You will need to change it in the .qanat/config.yaml according to your needs.
- In order to run the experiment through HTCondor, we need two things:
to specify the runner as htcondor when running the experiment with qanat run command
to specify the template that will be used for the job. This can be done by specifying the –submit_template option to put at the end of the qanat run command. If no template is specified, the previous default one is used.
A command to run the experiment through HTCondor would be:
qanat run --runner htcondor summary_iris --submit_template htcondor_template.yaml
Another approach to have several templates without having different templates files is to put them in the qanat configuration file .qanat/config.yaml which looks like:
default_editor: vim
htcondor:
default:
+WishedAcctGroup: '"group_usmb.listic"'
getenv: 'true'
request_cpus: 1
request_disk: 1GB
request_gpus: 0
request_memory: 1GB
universe: vanilla
logging: INFO
result_dir: results
slurm:
default:
--cpus-per-task: 1
--ntasks: 1
--time: 1-00:00:00
You can edit the file to add your own templates. For example, if you want to add a template for a job that will use 2 cpus and 2GB of memory, you can add the following lines:
htcondor:
default:
+WishedAcctGroup: '"group_usmb.listic"'
getenv: 'true'
request_cpus: 1
request_disk: 1GB
request_gpus: 0
request_memory: 1GB
universe: vanilla
two_cpus:
+WishedAcctGroup: '"group_usmb.listic'"
getenv: 'true'
request_cpus: 2
request_disk: 1GB
request_gpus: 0
request_memory: 2GB
universe: vanilla
Note
The getenv option is used to make sure that the environment variables are forwarded to the job execution server. This allows to use the python environment that is setup on the job submission server on the job execution server.
Notice also the ‘””’ around the group name. This is because the group name is a string and the +WishedAcctGroup option is a string. If you do not put the ‘””’, the group name will not be interpreted correctly.
Then you can run the experiment with the following command:
qanat run --runner htcondor summary_iris --submit_template two_cpus
If you manage to configure this, you will be able to launch the experiment and have the following output:
[13:17:55] INFO Run 4 created. run.py:1078
INFO Setting up the run... run.py:1179
INFO Single group of parameters detected runs.py:209
INFO Creating /mustfs/MUST-DATA/listic/amian/iris_mnist/results/summary_iris/run_4 runs.py:210
INFO Running the experiment... run.py:1188
INFO Submitting job for command python experiments/summary_statistics/iris.py --storage_path /mustfs/MUST-DATA/listic/amian/iris_mnist/results/summary_iris/run_4 runs.py:806
--dataset_path /mustfs/MUST-DATA/listic/amian/iris_mnist/data/iris
INFO Jobs submitted to clusters runs.py:829
INFO - 4651
You can check that the job has been submitted by running the following command:
condor_q
-- Schedd: lappusmb7a.in2p3.fr : <134.158.84.226:9618?... @ 06/29/23 13:18:02
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
ammarmian summary_iris_4 6/29 13:17 _ 1 _ 1 4651.0
Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for ammarmian: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 3 jobs; 0 completed, 0 removed, 0 idle, 3 running, 0 held, 0 suspended
You can also check the status of the run with the following command:
> qanat experiment status summary_iris
🔖 Name: summary_iris
💬 Description: Summary statistics on IRIS dataset
📁 Path: experiments/summary_statistics
💾 Datasets:['iris']
⚙ Executable: experiments/summary_statistics/iris.py
⚙ Execute command: python
⏳ Number of runs: 3
🏷 Tags: ['First-order', 'Histograms', 'Correlation', 'Statistics']
🛠 Actions:
- plot: Plot summary statistics about the dataset
⏳ Runs:
🆔 ID 💬 Description 📁 Path 🖥 Runner 📆 Launch date ⏱ Duration 🔍 Status 🏷 Tags ⏳ Progress
4 results/summary_iris/run_4 htcondor 2023-06-29 13:17:58 0:00:11.333462 ▶
1 results/summary_iris/run_1 local 2023-06-29 10:51:06.260540 0:00:00.319506 🏁
3 results/summary_iris/run_3 htcondor 2023-06-29 11:35:04 0:00:14 🏁
Once the job is finished, you can check the results in the results/summary_iris/run_4 directory. You can also check the status explore the run through a prompt with:
> qanat experiment run_explore summary_iris 4 ─╯
Run 4 of experiment summary_iris informations:
- 🆔 Id: 4
- 💬 description:
- 🏷 Tags
- 🖥 Runner: htcondor
- 📓 Runner parameters:
◾ --submit_template: default
- 📁 Path: results/summary_iris/run_4
- 🔍 Status: 🏁
- 📆 Start time: 2023-06-29 13:17:58
- 📆 End time: 2023-06-29 13:18:15
- 📑 Commit: eac4a826bbbfc2700f4dd2f860acd802f62bd5b6
Run 4 of experiment summary_iris - Explore menu
> [a] Show output(s)
[b] Show error(s)
[c] Show parameters
[d] Show comment
[e] Explore run directory
[f] Show HTCondor log(s)
[g] Delete run
[h] Action: plot
┌── preview ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Show output(s) of the run with less │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Note
You can see that an option has been added to the menu: Show HTCondor log(s) which will allow you to see the logs of the job. When more than one job is submitted, you can see the logs of all the jobs for each separate command used.
Note
In this tutorial we simplified with a script that use no parameters but you can specify them after the experiment_name in the qanat experiment run command like usual. You can also do groups of parameters and range on options. This is of course the point: being able to run the same experiments over a grid of parameters that are executed over several machines. For more information: See (TODO).
3.4 SLURM¶
SLURM is a job scheduler that is used on many clusters. It is very similar to HTCondor and the configuration is very similar.
TODO