Stefano Gariazzo

PhD in Physics and Astrophysics

Indirizzo di posta (mail address): gariazzo@to.infn.it
Homepage: Go

Gr4cloud network

Go to:

General Info
Filesystems
Users and Login
Software Installation
Common software
Submitting jobs
Other notes

General Info Top

It is a system of virtual machines, hosted in the Centro di Calcolo network where also TIER2 for CERN is running.

We bought some new physical machines that are supposed to be available for everyone in the network, if we are not using them. For this reason we are testing a cloud system in which we can create and destroy a number of virtual machines (VM), for a total of 48 cores.
The preferred format for these machines is 3 cores/8GB RAM, but there are other formats.
The cores have not a fixed speed, since it depends on the physical machine the VM is created and we have no control on this.
There is a head node with public ip (193.205.66.216 - with alias gr4cloud or gr4cloud.to.infn.it), the other nodes are on a private subnetwork.

As they are VM, the nodes can be created and destroyed in a relatively short time. We are asked to destroy the slave nodes, if we are not using them: in this way, the physical resources are available for other users of the network. After a VM is destroyed, the changes made on it are lost.
VMs are created using a configuration file that allow us to prepare the environment in some specified ways. In particular, through this mechanism we can run some specific commands and install specific packages, so that every VM is created with the same initial setup.

Filesystems Top

Each VM mounts different filesystems: some of them are temporary, that means they are destroyed with the VM and the changes made on them are lost:

/ is temporary;
/tmp: is temporary; if you need to write temporary data, it is better to write them here.

The other filesystem are persistent: they are shared between the VM through the network, so if you make changes on the files you have on a persistent disk from one of the machines, they will propagate in all the other VM immediately.
These filesystems are physically mounted on the head node, they are saved on a data storage every time the head node is cleanly terminated and restored once it is created again. All the data you want to save must be in one of the following:

/home: currently 20GB space, it contains the home directories of the users, with all the configuration files.
/data: larger filesystem, currently 1TB, where you can save all the data files. It is hosted on a less performing hardware than the /home fs, so you are encouraged to periodically backup the important files on remote computers, to save them in case of eventual disk problems (this suggestion is from the Centro di Calcolo staff).

Additionally, it is possible to mount remote filesystems through sshfs, so if you need more space or data located on different computers you can use this method.
Please note that the performance of a remote filesystem is lower, however.
Please note also that you should in any case have some backup of the important data, in order to avoid data loss in case of system faults, because NO kind of backup is implemented.

Users and Login Top

You can login to the system through the node with public ip, with ssh user@gr4cloud from inside the Torino INFN network, or from the public login machines.
After login, you are in the head node, from which you can login to the other nodes with ssh node-XXX where the number must match the IP of one of the currently existing VM.

The default password should be the usual one on the z* nodes (please check), but you are encouraged to create a .ssh/authorized_keys file after your first access, in order to login only with your ssh key (see here for useful information).
Since the home directory is shared, if you generate a ssh key in the head node you can insert it in the .ssh/authorized_keys file in order to successfully login to all the slave nodes.
After all of you will have the ssh key configured correctly, the password login should be disabled for security reasons (tell to one of the admin).

Software Installation Top

Since we will create and destroy the machines, if you need some specific package or configuration, there are two possibilities:

If you can, install and setup everything in our folders /home/username/ or /data/username/ (better!), so that the changes are maintained after the shutdown of the VM;
If you need to run some commands that require sudo privileges, including commands to install packages, please tell the list of commands to one of the admins and they will save them in the configuration files from which the VM are generated. If many of you need the same software that is not on the Ubuntu repository, maybe we can install it in some /data/common/ folder in order to make it persistent and accessible to everybody. Something is already present (see below).

The VM have Ubuntu 14.04LTS with some useful packages preinstalled, but it is possible that you need some other software: please send me (or to Hannes) the list of packages you need, so that we will add them in the startup routine or we can install them to some common folder. We can give you sudo rights, if really needed.

Common software Top

Here there is a list of software installed in the /data/common/ folder, that should be accessible to everyone.
You can import in your environment the correct path with the instruction lines in /data/common/exportvariables, where there are the commands for all the software listed below:

Anaconda, v1.9.2, in /data/common/Anaconda/.
Fermi Science Tools from NASA, v9r33p0, in /data/common/FermiTools/.
Intel Fortran XE 2013, sp1, v 14.0.2, in /data/common/intel/.
Planck 2013 likelihoods (plc_1.0), in /data/common/Planck13/.
Planck 2015 likelihoods (plc_2.0), in /data/common/Planck15/.
RootCern v5.34.19, in /data/common/root/.

Please note that this list can be incomplete, and also that some of the installed software may be not working. You are encouraged to notify any inconvenience or deficiency.

Submitting jobs Top

Jobs must be submitted using the HTCondor system. You can find a lot of information on HTCondor here.

At the moment we are using a flexible system that manages the creation and shutdown of the VMs. The system is Elastiq and it is connected to HTCondor: the idle VMs are destroyed, so that the cloud infrastructure is freed for other users, and when new jobs are added to the HTCondor queue the system tries to create new VMs. If there is free space in some physical node, the new VMs are created and automatically added to the common pool, so that HTCondor can send them any new job. You don't have to care about the creation of the new machines, since Elastiq will do the possible to enlarge the pool each time the HTCondor queue is full.

If you can enter the GR4 Cloud, you can find some useful scripts and a few submit files in /data/condor. The aim is to share what we are going to use, so that other people can take advantage of what the users have already done: feel free to copy and paste what it's there and what you modified for you personal use. Please don't edit what has been created by other people, create new files!

Other notes Top

Please notify to me, Hannes or Carlo every kind of problem you experience, so that we can try to solve them or ask to the people in Centro di Calcolo: we are not the first group to test this dynamical system, so someone else may have faced the same problem. Any feedback is definitely useful to improve the system.