Previous: 1. HTCondor concepts | Index | Next: 3. Opportunistic computation |
Go to:
The current status of the pool where the jobs can run is listed by condor_status
.
You will see the names of the slots and nodes, the status, the activity time and more.
You will note from the output of condor_status
that it lists the slots.
To know the number of cpus that are associated to one slot,
you can use
condor_status -format "%s\t" Name -format "%s\n" Cpus #more convenient: condor_status -autoformat:th Name Cpus.
You can also list only the available slots (condor_status -available
)
or the slots running jobs (condor_status -run
),
or use even more complex options
(see also the dedicated
page on the HTCondor manual
and some more tips
here).
condor_status
can be used also to list the status of the schedd, collector, master, ... daemons
in the pool.
Example: condor_status -schedd
.
Note: HTCondor does not update the status of the workers in the pool in real time.
You can see that a node is claimed or in a working state also after a few time that it has been freed.
Probably it is just that the situation has not been recently updated.
Please wait some minutes, retry and you will see the correct status.
Alternatively, you can force a reload of the condor daemons on a specific node with condor_restart node
and the correct state will be shown immediately.
Before learning how to submit a job, let's take a look at the job monitoring commands.
To list the currently running jobs, you can use condor_q
.
From this list you can read the jobID
that can be used to perform a number of other actions.
A single job can be analyzed with condor_q -analyze jobID
,
or the last lines of the output can be retrieved with condor_tail jobID
.
A job is deleted with condor_rm jobID
.
Let's specify one more thing.
We will see that a submit script can be used to run more than one job.
When you submit, HTCondor creates a cluster
of jobs and assigns it a number.
All the jobs inside a single submit file are inside the same cluster,
but they are identified by a different process
number.
This means that a jobID
may be something like cluster.process
or cluster.0
, if there is only one process in the cluster.
The HTCondor commands allow you to do much more using the advanced options, as you can read in this page or in the official manual. Searching the web you can find a lot of similar pages with useful tips.
The command that allows to submit a job is condor_submit
.
Its usage is very simple:
condor_submit submitScript.
What may be difficult in this case is the configuration of the submitScript
.
Here I will list some general tips that will be treated more in detail in the following chapters,
with specific details on each different universe.
The full documentation is
here.
I start with an example:
Universe = vanilla jobname = cosmomcVanillaRun Executable = cosmomcwrapper_vanilla Arguments = test_planck.ini transfer_input_files = /home/gariazzo/data/tests/test_planck.ini when_to_transfer_output = ON_EXIT_OR_EVICT on_exit_remove = (ExitBySignal == False) && (ExitCode == 0) next_job_start_delay=60 notification = Always notify_user = gariazzo@to.infn.it initialdir=/home/gariazzo/ Log = logs/$(jobname).$(cluster).$(process).l Output = logs/$(jobname).$(cluster).$(process).o Error = logs/$(jobname).$(cluster).$(process).e Queue Arguments = test_planck.$(process).ini transfer_input_files = /home/gariazzo/data/tests/test_planck.$(process).ini Queue 4
Let's see this in details:
cosmomcVanillaRun
;cosmomcwrapper_vanilla
(it executes some commands to load libraries before running cosmomc) and
it takes test_planck.ini
as the only argument;transfer_input_files
;ON_EXIT_OR_EVICT
, i.e. when the job ends or is evicted for some reasons;on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
(the job is removed only if not killed and if it doesn not end with errors codes different from 0)
and in case of error it is started again after 60 seconds
(next_job_start_delay=60
);gariazzo@to.infn.it
in case of job completion, errors in the execution or job suspension for any reasons
(notification = Always
);initial_dir
,
where the output files (of the program) are also saved.
The names of the stderr, stdout and log files are created using the jobname
,
the cluster and the process numbers (the log file may be cosmomcwrapper_vanilla.63.0.l
, for example);queue
is the instruction that creates the first process in the cluster.
This script creates 4 more processes with the last queue 4
instruction.
In this second case, the Arguments
and the transfer_input_files
are defined
using the process
number, so each process uses a different input file
(they must all exist in the folder specified by transfer_input_files
).Should you need to define environment variables and/or run some commands before the job execution,
there are two possibilities: you can use the condor_submit
instructions in the submit file
(getenv
, environment
and more: see the official manual)
or you can define a wrapper that manages the environment preparation and the command execution (see next pages).
To conclude this section I want to show one interesting possibility.
HTCondor can run jobs at specific times, basically as cron
.
This may be subject to the queue state, however.
The lines to be added to the submit script are something like:
on_exit_remove = false cron_minute = 0 cron_hour = 0-23/3 cron_day_of_month = * cron_month = * cron_day_of_week = *.
HTCondor allows the submission from nodes different from the one that hosts the schedd daemon. This means that you don't need to login to the node where the schedd runs, but from any node in the pool you can do the same.
The mechanism is almost the same, with the submit script, the executable and the input files
that must be in the node where you submit from. In the to4pxl
pool, the command is:
condor_submit -remote to4pxl.gr4_5f job_submit_file.
-remote
option, in this case,
is given by the hostname plus the default_domain_name
configured for the pool.
You should prefer to get the full name of the node hosting schedd
with condor_status -schedd -format "%s\n" Machine
.
For normal submissions, the files containing stdout, stderr and log files are created and updated where specified
in the submit script.
The files are not updated in real time, however.
You can retrieve the stderr and stdout contents using condor_tail jobID
and its options
(see this page).
For remote submissions, instead, the situation is slightly different.
At the end of the job execution, you will note that the job remains in the condor_q
output in the
"completed" state, instead of being automatically removed.
At the same time, no output files will appear in the node where you submitted from.
To transfer the output data from the HTCondor spool folder you have to use
condor_transfer_data -name to4pxl.gr4_5f jobID,
-name
takes the same argument used with condor_submit -remote
.
This command will transfer the data from the spool folder to the local filesystem and can be used until
the job is not manually removed from the queue: condor_rm jobID
.
Remember to transfer the output before removing the job!
Another possibility to check the job status and the output in real time is to connect to the job folder
in the computer where it is running.
HTCondor provides this possibility through the condor_ssh_to_job
command.
The usage is very simple:
condor_ssh_to_job 85.0
to connect to the machine running the process 0
of the cluster 85
;condor_ssh_to_job 21.1.0
to connect to the machine running the MPI node number 0
(main node)
for the process 1
of the cluster 21
;For a more advanced usage of the command (copying files, rsync and more) see the man page.
Previous: 1. HTCondor concepts | Index | Next: 3. Opportunistic computation |