2. User commands

Go to:

Check the queue and pool status
Check the job status, delete jobs...
Submit jobs
Remote submission
Check the output
SSH connection to a job

Check the queue and pool status Top

The current status of the pool where the jobs can run is listed by condor_status. You will see the names of the slots and nodes, the status, the activity time and more.

You will note from the output of condor_status that it lists the slots. To know the number of cpus that are associated to one slot, you can use

condor_status -format "%s\t" Name -format "%s\n" Cpus
#more convenient:
condor_status -autoformat:th Name Cpus

You can also list only the available slots (condor_status -available) or the slots running jobs (condor_status -run), or use even more complex options (see also the dedicated page on the HTCondor manual and some more tips here).

condor_status can be used also to list the status of the schedd, collector, master, ... daemons in the pool. Example: condor_status -schedd.

Note: HTCondor does not update the status of the workers in the pool in real time. You can see that a node is claimed or in a working state also after a few time that it has been freed. Probably it is just that the situation has not been recently updated. Please wait some minutes, retry and you will see the correct status. Alternatively, you can force a reload of the condor daemons on a specific node with condor_restart node and the correct state will be shown immediately.

Check the job status, delete jobs... Top

Before learning how to submit a job, let's take a look at the job monitoring commands.

To list the currently running jobs, you can use condor_q. From this list you can read the jobID that can be used to perform a number of other actions.

A single job can be analyzed with condor_q -analyze jobID, or the last lines of the output can be retrieved with condor_tail jobID. A job is deleted with condor_rm jobID.

Let's specify one more thing. We will see that a submit script can be used to run more than one job. When you submit, HTCondor creates a cluster of jobs and assigns it a number. All the jobs inside a single submit file are inside the same cluster, but they are identified by a different process number. This means that a jobID may be something like cluster.process or cluster.0, if there is only one process in the cluster.

The HTCondor commands allow you to do much more using the advanced options, as you can read in this page or in the official manual. Searching the web you can find a lot of similar pages with useful tips.

Submit jobs Top

The command that allows to submit a job is condor_submit. Its usage is very simple:

condor_submit submitScript

What may be difficult in this case is the configuration of the submitScript. Here I will list some general tips that will be treated more in detail in the following chapters, with specific details on each different universe. The full documentation is here.

I start with an example:

Universe   = vanilla
jobname    = cosmomcVanillaRun
Executable = cosmomcwrapper_vanilla
Arguments  = test_planck.ini

transfer_input_files = /home/gariazzo/data/tests/test_planck.ini
when_to_transfer_output = ON_EXIT_OR_EVICT

on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
next_job_start_delay=60

notification = Always
notify_user = gariazzo@to.infn.it

initialdir=/home/gariazzo/
Log        = logs/$(jobname).$(cluster).$(process).l
Output     = logs/$(jobname).$(cluster).$(process).o
Error      = logs/$(jobname).$(cluster).$(process).e
Queue

Arguments  = test_planck.$(process).ini
transfer_input_files = /home/gariazzo/data/tests/test_planck.$(process).ini
Queue 4

Let's see this in details:

From the first line we learn that the submit script is for a vanilla universe job, named cosmomcVanillaRun;
The executable is cosmomcwrapper_vanilla (it executes some commands to load libraries before running cosmomc) and it takes test_planck.ini as the only argument;
This file must be copied at the beginning from a local folder, specified in transfer_input_files;
The output files are copied ON_EXIT_OR_EVICT, i.e. when the job ends or is evicted for some reasons;
The job is not eliminated by the queue when an error occurs. This is defined by on_exit_remove = (ExitBySignal == False) && (ExitCode == 0) (the job is removed only if not killed and if it doesn not end with errors codes different from 0) and in case of error it is started again after 60 seconds (next_job_start_delay=60);
An email notification is sent to gariazzo@to.infn.it in case of job completion, errors in the execution or job suspension for any reasons (notification = Always);
The error, output (stdout) and log files are created in a subfolder of initial_dir, where the output files (of the program) are also saved. The names of the stderr, stdout and log files are created using the jobname, the cluster and the process numbers (the log file may be cosmomcwrapper_vanilla.63.0.l, for example);
the first queue is the instruction that creates the first process in the cluster. This script creates 4 more processes with the last queue 4 instruction. In this second case, the Arguments and the transfer_input_files are defined using the process number, so each process uses a different input file (they must all exist in the folder specified by transfer_input_files).

Should you need to define environment variables and/or run some commands before the job execution, there are two possibilities: you can use the condor_submit instructions in the submit file (getenv, environment and more: see the official manual) or you can define a wrapper that manages the environment preparation and the command execution (see next pages).

To conclude this section I want to show one interesting possibility. HTCondor can run jobs at specific times, basically as cron. This may be subject to the queue state, however. The lines to be added to the submit script are something like:

on_exit_remove = false
cron_minute = 0
cron_hour = 0-23/3
cron_day_of_month = *
cron_month = *
cron_day_of_week = *

.
This is used to execute the script every day, every three ours at the minute 0.

Remote submission Top

HTCondor allows the submission from nodes different from the one that hosts the schedd daemon. This means that you don't need to login to the node where the schedd runs, but from any node in the pool you can do the same.

The mechanism is almost the same, with the submit script, the executable and the input files that must be in the node where you submit from. In the to4pxl pool, the command is:

condor_submit -remote to4pxl.gr4_5f job_submit_file

.
The argument of the -remote option, in this case, is given by the hostname plus the default_domain_name configured for the pool. You should prefer to get the full name of the node hosting schedd with condor_status -schedd -format "%s\n" Machine.

Check the output Top

For normal submissions, the files containing stdout, stderr and log files are created and updated where specified in the submit script. The files are not updated in real time, however. You can retrieve the stderr and stdout contents using condor_tail jobID and its options (see this page).

For remote submissions, instead, the situation is slightly different. At the end of the job execution, you will note that the job remains in the condor_q output in the "completed" state, instead of being automatically removed. At the same time, no output files will appear in the node where you submitted from. To transfer the output data from the HTCondor spool folder you have to use

condor_transfer_data -name to4pxl.gr4_5f jobID

,
where -name takes the same argument used with condor_submit -remote. This command will transfer the data from the spool folder to the local filesystem and can be used until the job is not manually removed from the queue: condor_rm jobID. Remember to transfer the output before removing the job!

SSH connection to a job Top

Another possibility to check the job status and the output in real time is to connect to the job folder in the computer where it is running. HTCondor provides this possibility through the condor_ssh_to_job command. The usage is very simple:

condor_ssh_to_job 85.0 to connect to the machine running the process 0 of the cluster 85;
condor_ssh_to_job 21.1.0 to connect to the machine running the MPI node number 0 (main node) for the process 1 of the cluster 21;

For a more advanced usage of the command (copying files, rsync and more) see the man page.

Previous: 1. HTCondor concepts

Index

Next: 3. Opportunistic computation