Submitting and running jobs

More time-consuming runs are usually run on a remote cluster, via a job queue. A good configuration for running MCMC chains is often to run 4-6 chains, each using some number of cores (e.g. in cosmology, for OPENMP threading in the cosmology codes, for a total of 4-24 hours running time). Cobaya has a convenient script to produce, submit and manage job submission scripts that can be adapted for different systems. Once configured you can do:

cobaya-run-job --queue regular --walltime 12:00:00 [yaml_file].yaml

This produces a job script for your specific yaml_file, and then submits it to the queue using default settings for your cluster.

To do this, it loads a template for your job submission script. This template can be specified via a command line argument, e.g.

cobaya-run-job --walltime 12:00:00 --job-template job_script_NERSC [yaml_file].yaml

However, it is usually more convenient to set an environment variable on each/any cluster that you use so that the appropriate job script template is automatically used. You can then submit jobs on different clusters with the same commands, without worrying about local differences. To set the environment variable put in your .bashrc (or equivalent):

export COBAYA_job_template=/path/to/my_cluster_job_script_template

The job script templates are also used by grids, which can be used to manage running a batch of jobs at once.

Job script templates

These are essentially queue submission scripts with variable values replaced by {placeholder}s. There are also lines to specify default settings for the different cobaya-run-job options. For example for NERSC, the template might be

#!/bin/bash
#SBATCH -N {NUMNODES}
#SBATCH -q {QUEUE}
#SBATCH -J {JOBNAME}
#SBATCH -C haswell
#SBATCH -t {WALLTIME}

#OpenMP settings:
export OMP_NUM_THREADS={OMP}
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

###set things to be used by the python script, which extracts text from here with ##XX: ... ##
### command to use for each run in the batch
##RUN: time srun -n {NUMMPI} -c {OMP} --cpu_bind=cores {PROGRAM} {INI} > {JOBSCRIPTDIR}/{INIBASE}.log 2>&1 ##
### defaults for this script
##DEFAULT_qsub: sbatch ##
##DEFAULT_qdel: scancel ##
##DEFAULT_cores_per_node: 16 ##
##DEFAULT_chains_per_node: 4 ##
##DEFAULT_program: cobaya-run -r ##
##DEFAULT_walltime: 8:00:00##
##DEFAULT_queue: regular##

cd {ROOTDIR}

{COMMAND}

wait

Here each word in {} braces is replaced with a value taken (or computed) from your cobaya-run-script arguments. The ##RUN line specifies the actual command. If you run more than one run per job, this may be used multiple times in the generated script file.

The lines starting ## are used to define default settings for jobs, in this case 4 chains each running with 4 cores each (this does not use a complete NERSC node).

You can see some sample templates for different grid management systems.

The available placeholder variables are:

JOBNAME	name of job, from yaml file name
QUEUE	queue name
WALLTIME	running time (e.g. 12:00:00 for 12 hrs)
NUMNODES	number of nodes
OMP	number of OPENMP threads per chain (one chain per mpi process)
CHAINSPERNODE	number of chains per node for each run
NUMRUNS	number of runs in each job
NUMTASKS	total number of chains (NUMMPI * NUMRUNS)
NUMMPI	total of MPI processes per run (=total number of chains per run)
MPIPERNODE	total number of MPI processes per node (CHAINSPERNODE * NUMRUNS)
PPN	total cores per node (CHAINSPERNODE * NUMRUNS * OMP)
NUMSLOTS	total number of cores on all nodes (PPN * NUMNODES)
MEM_MB	memory requirement
JOBCLASS	job class name
ROOTDIR	directory of invocation
JOBSCRIPTDIR	directory of the generated job submission script file
ONERUN	zero if only one run at a time (one yaml or multiple yaml run sequentially)
PROGRAM	name of the program to run (cobaya-run) [can be changed by cobaya-run, e.g. to change cobaya’s optional run arguments]
COMMAND	substituted by the command(s) that actually runs the job, calculated from ##RUN

The ##RUN line in the template has the additional placeholders INI and INIBASE, which are substituted by the name of the input yaml file, and the base name (without .yaml) respectively.

Optional arguments

You can change various arguments when submitting jobs, running cobaya-run-job -h gives you the details

usage: cobaya-run-job [-h] [--nodes NODES] [--chains-per-node CHAINS_PER_NODE]
                      [--cores-per-node CORES_PER_NODE]
                      [--mem-per-node MEM_PER_NODE] [--walltime WALLTIME]
                      [--job-template JOB_TEMPLATE] [--program PROGRAM]
                      [--queue QUEUE] [--jobclass JOBCLASS] [--qsub QSUB]
                      [--dryrun] [--no_sub]
                      input_file [input_file ...]

Submit a single job to queue

positional arguments:
  input_file

options:
  -h, --help            show this help message and exit
  --nodes NODES
  --chains-per-node CHAINS_PER_NODE
  --cores-per-node CORES_PER_NODE
  --mem-per-node MEM_PER_NODE
                        Memory in MB per node
  --walltime WALLTIME
  --job-template JOB_TEMPLATE
                        template file for the job submission script
  --program PROGRAM     actual program to run (default: cobaya-run -r)
  --queue QUEUE         name of queue to submit to
  --jobclass JOBCLASS   any class name of the job
  --qsub QSUB           option to change qsub command to something else
  --dryrun              just test configuration and give summary for checking,
                        don't produce or do anything
  --no_sub              produce job script but don't actually submit it

To set default value yy for option xx in your job script template, add a line:

##DEFAULT_xx: yy

Job control

When you use cobaya-run-job, it stores the job details of this, and all other jobs started from the same directory, in a pickle file in your ./scripts directory (along with the generated and submitted job submission script). This can be used by two additional utility scripts cobaya-running-jobs which lists queued jobs, with optional filtering on whether actually running or queued.

Use cobaya-delete-jobs to delete a job corresponding to a given input yaml base name, or to delete a range of job ids.