Previous: 4. The docker universe Index Next: 7. The standard universe

6. The parallel universe

Go to:

Differences with sequential jobs Top

The mechanisms of parallel jobs are somewhat more complex than the mechanisms needed for sequential jobs. The reason is that a number of operation must be managed by HTCondor: the SSH communications between nodes, the file transfers, the execution of the job in the designed nodes and more. As a result, checkpointing will not work for parallel universe jobs and job suspension may cause problems: this is why the parallel universe jobs must run in dedicated resources, where policies that include job suspension should be disabled.

The parallelization of the job is not managed by HTCondor directly, but by some MPI implementation, such as OpenMPI. HTCondor is responsible of preparing the MPI settings before running the job.

Technically speaking, this is what HTCondor must do when a parallel job must run:

Most of these operations are managed by HTCondor through a specific shell script. An example of such a script working with OpenMPI is shipped in the HTCondor documentation as openmpiscript. With some small modifications, it is at the end of this page.

The openmpiscript must be the executable in the condor submit file. The reason is that it is considered as the main program, because it manages all the necessary operations to run the parallel job with OpenMPI. The real executable will then be one of the configured arguments.

Specific commands Top

The instructions in the submit script that change from the vanilla to the parallel universe are just three: universe=parallel is the most obvious one, and then there are the already mentioned differences in executable and arguments.

As we said, the value of executable must be the name of the MPI wrapper. In our case, in the dedicated resources there is OpenMPI and one possible wrapper is at the end of this page. It is taken from the HTCondor documentation, plus a few changes. You can copy it and save as openmpiscript, and then use executable = openmpiscript in the submit script.

The other difference is in arguments, where you should use a syntax compatible with the MPI wrapper. In the case of the wrapper at the end of the page, the syntax is arguments = /working/directory/for/mpirun actual_mpi_job arg1 arg2. If you remove the bold parts concerning the -wdir option, instead, you should use arguments = actual_mpi_job arg1 arg2.

In addition to the already seen commands there are two additions. The first one is the command that tells HTCondor how many MPI nodes you would like to use: machine_count = 4 will request 4 nodes, that is to say slots. The cpu number for each slot is defined by request_cpus as for the vanilla universe.

The second addition tells HTCondor that only the main MPI process must be considered to decide when to consider the job as concluded, since the other processes will be waiting for the node 0: +ParallelShutdownPolicy = "WAIT_FOR_NODE0", that is the default value. If necessary, you can set +ParallelShutdownPolicy = "WAIT_FOR_ALL", that forces HTCondor to wait for every node to complete.

An example Top

As for the previous universes, I report here an example.

This example uses the openmpiscript reported below. I am assuming that the nodes where the job will be executed share the filesystem (as in the gr4cloud pool) or that the relevant folders have been synchronized previously, so that they have the same content on all the nodes. If the input-output operations are not too expensive, one can also use sshfs to share the required folders between the nodes.

universe = parallel
executable = openmpiscript
jobname=cosmomcParallelRun

#for "arguments", use: /WDIR/PATH EXECUTABLE real_arguments
arguments = /home/gariazzo/software/1507_cosmomc ./cosmomc test_planck.ini

#number of machines:
machine_count = 4
#cpus for each machine:
request_cpus = 8

log    = logs/$(jobname).$(Cluster).l
output = logs/$(jobname).$(Cluster).o
error  = logs/$(jobname).$(Cluster).e

#+ParallelShutdownPolicy = "WAIT_FOR_NODE0"
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)

queue

The OpenMPI wrapper - for MPI + OpenMP codes Top

The following script is the OpenMPI wrapper (openmpiscript) that manages the machine settings and ssh communications and launches the executable. If not needed, you may omit the bold parts.

This version of the file works well with mixed MPI + OpenMP parallelized codes. If you are using a pure MPI code, see below.

This file is available inside the /home/condor/submit folder in to4pxl.

#!/bin/bash
#
# modified 160608 by Stefano Gariazzo [gariazzo@to.infn.it]
#
##**************************************************************
##
## Copyright (C) 1990-2010, Condor Team, Computer Sciences Department,
## University of Wisconsin-Madison, WI.
## 
## Licensed under the Apache License, Version 2.0 (the "License"); you
## may not use this file except in compliance with the License.  You may
## obtain a copy of the License at
## 
##    http://www.apache.org/licenses/LICENSE-2.0
## 
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
## See the License for the specific language governing permissions and
## limitations under the License.
##
##**************************************************************

# YOU MUST CHANGE THIS TO THE PREFIX DIR OF OPENMPI
MPDIR=/usr

#if `uname -m | grep "64" 1>/dev/null 2>&1` 
#then 
    #MPDIR=/usr/lib64/openmpi
#fi

PATH=/usr/bin:$MPDIR:.:$PATH
export PATH
LD_LIBRARY_PATH=/usr/lib/openmpi/lib:/usr/bin:.:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS 

# If not the head node, just sleep forever, to let the sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
	wait
	sshd_cleanup
	exit 0
fi

WDIR=$1
shift
EXECUTABLE=$1
shift

# the binary is copied but the executable flag is cleared.
# so the script have to take care of this
chmod +x $EXECUTABLE

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

## run the actual mpijob
echo -e "---\non nodes:"
cat machines

#insert here some commands needed to run your job
source /home/common/Planck15/plc-2.0/bin/clik_profile.sh

echo -e "---\n`date`\n---\nrunning:"
echo mpirun.openmpi -v --prefix $MPDIR --mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_agent $CONDOR_SSH -n $_CONDOR_NPROCS  --host `paste -sd',' machines` -wdir $WDIR $EXECUTABLE $@
mpirun.openmpi -v --prefix $MPDIR --mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_agent $CONDOR_SSH -n $_CONDOR_NPROCS  --host `paste -sd',' machines` -wdir $WDIR $EXECUTABLE $@ 
err=$?
echo mpirun ended

sshd_cleanup
rm -f machines

echo exiting $err
exit $err

The OpenMPI wrapper - for pure MPI codes Top

The following script is the OpenMPI wrapper (openmpiscript) that manages the machine settings and ssh communications and launches the executable. If not needed, you may omit the bold parts.

This version of the file works well with pure MPI parallelized codes. If you are using a mixed MPI + OpenMP code, see above.

This file is available inside the /home/condor/submit folder in to4pxl.

#!/bin/bash
#!/bin/bash
#
# modified 160608 by Stefano Gariazzo [gariazzo@to.infn.it]
# modified 161115 by Hannes Zechlin [zechlin@to.infn.it]
#
##**************************************************************
##
## Copyright (C) 1990-2010, Condor Team, Computer Sciences Department,
## University of Wisconsin-Madison, WI.
## 
## Licensed under the Apache License, Version 2.0 (the "License"); you
## may not use this file except in compliance with the License.  You may
## obtain a copy of the License at
## 
##    http://www.apache.org/licenses/LICENSE-2.0
## 
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
## See the License for the specific language governing permissions and
## limitations under the License.
##
##**************************************************************

# YOU MUST CHANGE THIS TO THE PREFIX DIR OF OPENMPI
MPDIR=/usr

#if `uname -m | grep "64" 1>/dev/null 2>&1` 
#then 
    #MPDIR=/usr/lib64/openmpi
#fi

PATH=/usr/bin:$MPDIR:.:$PATH
export PATH
LD_LIBRARY_PATH=/usr/lib/openmpi/lib:/usr/bin:.:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

_CONDOR_RCPUS=`condor_q -global -format '%d\n' RequestCpus | tail -1 | sed -n 's/\([0-9][0-9]*\).*/\1/p'`

CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS 

# If not the head node, just sleep forever, to let the sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
	wait
	sshd_cleanup
	exit 0
fi

WDIR=$1
shift
EXECUTABLE=$1
shift

# the binary is copied but the executable flag is cleared.
# so the script have to take care of this
chmod +x $EXECUTABLE

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

## run the actual mpijob
echo -e "---\non nodes:"
cat machines

#insert here some commands needed to run your job
#source /home/common/Planck15/plc-2.0/bin/clik_profile.sh

echo -e "---\n`date`\n---\nrunning:"
echo mpirun.openmpi -v --prefix $MPDIR --mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_agent $CONDOR_SSH -n `expr $_CONDOR_NPROCS \* $_CONDOR_RCPUS` -npernode $_CONDOR_RCPUS --host `paste -sd',' machines` -wdir $WDIR $EXECUTABLE $@
mpirun.openmpi -v --prefix $MPDIR --mca btl_tcp_if_exclude lo,docker0 --mca plm_rsh_agent $CONDOR_SSH -n `expr $_CONDOR_NPROCS \* $_CONDOR_RCPUS` -npernode $_CONDOR_RCPUS --host `paste -sd',' machines` -wdir $WDIR $EXECUTABLE $@ 
err=$?
echo mpirun ended

sshd_cleanup
rm -f machines

echo exiting $err
exit $err




Previous: 4. The docker universe Index Next: 7. The standard universe