1. HtCondor concepts

Go to:

Introduction
Daemons
Permissions and domains
Slots
Job execution details
Local networks
Universes

Introduction Top

HTCondor is a complex system of for managing distributed computing on large collections of distributively owned computing resources.

This means that HTCondor is able to recognize if a node subscribes to the list of available "workers" and use it for running the jobs submitted to the queue.

For these reasons, HTCondor was selected to manage the queue in the Torino cloud system, since it is able to easily recognize if a new virtual machine (VM) is created or destroyed, and it dinamically updates the worker pool when a change occurs.

HTCondor is also able to detect if a machine is being used by other processes, and if correctly configured it can ignore that machine while running jobs.

Daemons Top

HTCondor bases its functions on the execution of several daemons. Each daemon has a different function, as we will briefly see here.

The most important daemon is the master. This is the first daemon that is executed on each node where HTCondor runs. All the other HTCondor daemons are launched by the master, after it has read the configuration files.

The daemon that manages the pool is the collector. As the name suggests, it collects the information from all the nodes running HTCondor and its services inside the pool. All the other daemons send their information to the collector, that can be queried to retrieve the status of the services. There must be only one collector in each pool.

Another daemon is responsible for instantiating and executing jobs on the worker nodes. It is the start daemon, or startd. Each worker node has its own startd, that creates a starter daemon for each job that runs on the machine. The starter manages the input and output and monitors the job execution.

The daemon that collects the information on the jobs that must be executed, or in other words the daemon that receives the information on the submitted jobs, is called the scheduler daemon, or schedd. More that one schedd can exist in the same pool. For example, one schedd can be used to manage dedicated resources and parallel jobs and one schedd for everything else. Schedd is the daemon that processes the submission details and manages the queue, but it does not decide where a job must run.

The daemon that assigns free worker nodes and waiting jobs is the negotiator. This daemon checks the list of requirements of the queued jobs and looks for free matching resources, retrieving the information from the collector. After a job is associated to a free node, the communications start and the job data are sent to the startd of the machine that will execute it.

This is not the end of the list, but the main daemons are these ones. Other daemons may be present in a pool, but they are not required. For example, one of the additional daemons is the one that manages the checkpoints for the standard universe jobs, that is called the ckpt_server. Another one is the defrag daemon, that is used to reduce the fragmentation in a pool where partitionable slots are present (see here).

To resume: a HTCondor worker node must run the master and the stardd. A submit node (if separated by the collector node) runs the master and the schedd, while the central node of the pool should have at least the master, collector and negotiator daemons.

Permissions and domains Top

Being designed to work with an eterogeneous set of machines, HTCondor has a number of policies that are used to manage the permissions, the users and the filesystems.

First of all, the communication between daemons has an underlying set of rules that specify what a daemon, a service or a node can do. Read permissions mean that a daemon can read the status of another daemon: for example, a negotiator that retrieves the pool status from the collector. Write permissions, instead, refer to the possibility for a daemon or a command to write information on another daemon: it is the case of the condor_submit command that writes information on the schedd status.

Permissions can be defined at the user of host level, and different authentication methods are compatible with HTCondor. In our case the authentication is made with the "password" method, since we are inside the local INFN network. Even if permissions are set at the host level, that means that each pc that has the password can connect to the central node, actually only the users that have an account on the pool manager (to4pxl) can submit jobs to the pool. The jobs can be submitted from each node of the pool if the username on the node where the job is submitted matches one username on the pool manager.

This is about submission. Execution of jobs works differently. We will see in the next section that HTCondor is flexible enough to be able to run jobs in any kind of pool. While only who has an account on the central manager can submit to HTCondor, it is not required that the same user accounts that are present on to4pxl exist in all the worker nodes. HTCondor, indeed, checks if a worker node is part of the UID_DOMAIN to know if it has the same users as the central manager.

In an equivalent way, HTCondor checks if there is a common filesystem. If not, it will have to copy the files required to run a job before executing it.

The existence of a shared filesystem and users simplifies a bit the management of the jobs. This is the main difference between the pool hosted by to4pxl and the one hosted by gr4cloud: while the latter is based on a common filesystem and all the nodes have the same users, the former is an ensemble of independent nodes, each with its independent filesystem and probably different users. In the next section we will see what the difference implies for the job execution.

Slots Top

HTCondor jobs are executed on the nodes that host a startd. These nodes must be configured so that the negotiator knows how many cores can be used and how many jobs can be hosted. For this reason, each worker contains a number of executing slots. Slots may be of different types:

static: each slot corresponds to a fixed number of cores, that are manually or automatically configured. The simplest case is: one slot for each core in the machine.
dynamic: see below.
partitionable: these slots will never execute jobs, in this sense. A partitionable slot manages a given amount of cores, let's say 10. If a job needs 6 cores, HTCondor creates a dynamic slot of 6 cores and reduces the number of cores in the partitionable slot from 10 to 4. The slot that executes the job is the dynamic one, while the partitionable one remains idle, basically waiting for a job that needs 4 or less cores. If such a job exists and is assigned to the machine, a new dynamic slot will be created.
Note: a partitionable slot exists also if there are 0 available cores. In this case, the slot will only wait for a dynamic slot to be released at the end of the job execution.
Other note: dynamic slots are "dynamic" in the sense that they are created only when a job must run. Once created and running, they are equivalent to static slots, and the number of cores cannot be modified without destroying the slot.

Job execution details Top

HTCondor execution process is managed by the starter daemons, created by startd. For each new starter, a new job folder is created inside a HTCondor directory called spool (usually /var/lib/condor/spool). The spool folder contains the log file and the files that collect the stdout and stderr of the job and, if required, the executable, all the input files and the output files.

If the filesystem is shared only the stdout, stderr and log files are in the spool folder dedicated to the job. In the other cases, all the files required to run the job correctly must be transfered there. This means that you should configure the submit script to transfer all the input files, executables and libraries that your job needs but that may be not available in the worker hosting the job. If you have a simple program that requires only system libraries, this may be easy. For a very complex code, instead, configuring the transfer of the input files may be extremely complex and time-wasting. In this case, you may want to consider the possibility of using Docker to create a container with all the libraries required to run your code, and to run the code inside it (see this page). The job folder, in any case, will contain also the output files created by the job. These are the files that you will have to copy back to the submit folder (you must configure the transfer of the output files in the submit script).

Finally, a note on the job execution. A process must be executed by a user. This includes the HTCondor daemons: the best case, but not the only possibility, is that HTCondor runs as root. If this happens, HTCondor has the total control on the execution of the jobs. Root cannot submit jobs and jobs cannot be executed as root: this is for security reasons. If the worker nodes and the submit machine are inside the same UID_DOMAIN, the job is executed from the same user that submits it. If the nodes do not share the same UID_DOMAIN, instead, the jobs are not executed by the same user that submits it, because there is not the certainty that each user is present in all the nodes. In this case, the jobs are executed by the special nobody:nogroup user. The spool folders are created so that nobody has the full permissions inside them.

However, in some cases it may be convenient to use the files and folders present in the local filesystem at least for reading files. This is feasible in a dedicated pool without a shared filesystem (for example, manually synchronizing the necessary folders in a limited set of computers), but surely not in an opportunistic-based pool. It is much easier to use the local data for reading, but with a few efforts nobody can be able to write the output where required. It is possible to configure the machine so that nobody is part of a users group: root must do that. If users is the group owner of a folder and there are write permissions for the group, nobody becomes able to write in that folder. In this case, without needing root permissions, you can make nobody be able to write in a /path/to/folder you own with a few commands:

chgrp -R users /path/to/folder
chmod -R g+rwX /path/to/folder
find /path/to/folder -type d -exec chmod g+s {} \;

The last command sets the special permission setgid to all the folders inside /path/to/folder, which ensures that all the new files which are created inside these folders are owned by the users group.

Local networks Top

Now I will quickly explain how our local networks are configured. In the next pages I will refer to a pool with the name of its manager.

The pool hosted by to4pxl:
- to4pxl is the main node. It hosts the collector, negotiator and schedd daemons, plus a defrag service for the partitionable slots and a checkpoint server for standard universe jobs.
- The z* computers: worker nodes, hosting one startd each. They are an eterogeneous network of nodes with 8 to 12 cores and 9 to 40GB RAM, dedicated to computation. Each node hosts one partitionable slot, as described in the previous section. Parallel jobs submitted to to4pxl are executed only on these nodes. Here, the nobody user is part of the users group, as all the other users.
- The interesting idea is to extend the pool using the other computers on an opportunistic basis. Any member of the GR4 can in principle add his computer to the pool following what is explained here. This means that the pool will have a number of workers that are available only under specific policies ensuring that no jobs will run while the computer is being used by the owner. The pool will be completely eterogeneous, with no common filesystems nor users.
The pool hosted in the Centro di Calcolo is different. The central manager is gr4cloud.
- gr4cloud runs the collector, negotiator and schedd daemons, and it is the only machine with public IP address. All the jobs are submitted from this machine. It shares the filesystem with all the nodes in the pool.
- node-???: the workers. They are virtual machines (VM) that are created only when needed. Elastiq is the python software that manages the creation and destruction of the VMs (see here). They share the filesystem and the user names with the manager, so no file transfer is needed and the jobs are executed with the username of the submitter. At the time of writing there are 3 cores per machine, managed by a partitionable slot.

Universes Top

The last concept to understand is the way HTCondor treats jobs, that can be both of the sequential or parallel type. These jobs are managed by several different HTCondor universes.

Sequential jobs can be run in several different "universes", including the vanilla universe which provides the ability to run most "batch ready" programs, and the standard universe in which the target application is re-linked with the HTCondor I/O library which provides for remote job I/O and job checkpointing. If the system is properly configured, sequential jobs can be suspended, moved to a different worker node and resumed without any data loss. The default universe is "vanilla".

There is one further Universe running sequential jobs, that is similar to the "vanilla" universe. It is called docker because it is based on the Docker system, that is a container manager. Containers are a method to provide a standard system, in a way that is similar to virtual machines but without requiring the resources that a virtual machine requires. You can fing something about Docker here.

Parallel jobs, instead, must use the parallel universe. Since "parallel" means that the same job runs on different nodes connected to each other, it is much more complicated to be suspended, moved and resumed: for this reason and other technical reasons, the way of submitting the job are slightly different for a sequential and a parallel job. In particular, the parallel jobs are managed by a dedicated scheduler and can run only in dedicated nodes.

There is another universe that you may want to use. The local universe is designed to immediately run jobs in the same machine from which the job is submitted. As you can guess, this should be used only for testing. The sintax for this universe is the same that is used for the vanilla one, but HTCondor may be configured to run a maximum amount of contemporary local universe jobs.

Index

Next: 2. User commands