You are here: Home Grid tools and utilities Reserve SMP nodes Description of the tool
Personal tools
Document Actions

Description of the tool

by Riccardo Di Meo last modified 2008-10-30 19:53

can be used to run multi-thread and shared memory multi processor jobs on the EGEE infrastructure.

Overview

Today most (if not all) Computing Elements (CE) on the grid have nodes which are multi processor and could therefore run very tightly coupled codes, at least in a limited fashion (using up to 8 CPUs at once, though 2 and 4processors per node is the more common setup): however with the current middleware, a software that allows an user to submit such kind of jobs is nowhere to be found.

To overcome this limitation, we developed an application (written in python) that allows an user to submit reserve an entire node on the grid and run computations on it without conflicting with other user's jobs.

The application, called reserve_smp_nodes is a system of 2 separate python scripts: a server which runs locally and a client which is executed on the grid.

The tool works on a job reservation basis: the server submits a number of jobs to a CE in order to increase the chances that more than one could end on the same nodes: once this happens, the server sends the scripts on the reserved or parially reserver ones, where they are started, while all the other jobs expire, thus freeing the unneeded resources.

Since we acknowledge that, though this tool doesn't add anything new to what an uncaring user may do to cause inconveniences to others, using the job reservation is a practice that should generally be avoided, we invested a considerable effort adding security features to reduce considerably the chance of keeping resources booked by mistake; moreover from our analysis it seems like that another approach to provide the same service doesn't exists (or is very difficult to be found...) at a user level.

It comes also as a corollary, that though the use of MPI with the shmem device, though necessary sometimes may be, in some situations, replaced by the use of MPI with Infiniband or Myrinet (available on some CEs), reserve_smp_nodes is, so far the only available way to submit multi threaded jobs on the grid.

Therefore it seems like that, until edg will provide (if ever) the ability (through a JDL tag?) to book an entire node, our application will be the only way to run multi threaded or shmem-device MPI codes on grids using the edg/glite middleware.

The reserve_smp_nodes program can be found at the address:

http://www.ictp.it/~dimeo/reserve_smp_nodes-1.5.tar.bz2

Usage

The Reseve_smp_nodes can be used in two ways: interactive and via command line.

From the 1.4 version, the old reserve_smp_nodes script has been splitted in two different utilities: reserve_interactive which can be used to reserve a single node (and mimics what the earlier version did when invoked with the -i option) and reserve_smp_nodes, which now can be used only for command line submission.

In order to make this application as simple and general as possible, we stripped it of all the unnecessary features (like complex data transfer or logging features) leaving the bare minimum to run an application: once a node is reserved, the user provide a single executable (usually a script) as long as it options, and such code will start running independently on the WN, thus allowing the user even to shut down the computer, if he/she likes to.

Reserving multiple nodes at once

From version 1.3, a new way of reserving tasks has been implemented to improve the efficiency of the utility: using a "task file" to define a list of jobs instead of a command line option for the script and another for the options it is now possible to reserve multiple nodes at once and assign to each one of them a different code to executed, with different arguments.

This will allow reserve_smp_nodes to reserve far more efficiently the resources, by:

  1. fitting the biggest tasks (requiring more CPUs) first
  2. fitting as more tasks as possibile

since this is the best and more efficient way to submit, users are encouraged to gather multiple tasks of different size and use the reserve_smp_nodes in this way.

At the end of the execution, which, as usual, may happens due to the following conditions:

  • Timeout for the reservation expired
  • All jobs submitted contacted back the server
  • All tasks have been submitted

(Note: what follows apply for the latest version, 1.5. Earlier versions may lack some functionalities)

Each task file can contain a mix of blank lines, comments and tasks definitions.

Lines containing only spaces and tabs are considered blanks, where a lines in which the first non-blank, non-tab character is a # sign are ignored as comments.

E.g.

       # valid comments
# in a tasks definition

Task definitions have a fixed field oriented format: by default each field is separated by a : sign and each definition should contain at least 2 fields: the number of cpu requested for the node and the full path to the executable to be executed remotely. Trailing spaces are not allowed.

E.g.

2:./script.sh
3:./script2.sh:

An optional third argument, a list of options that will be passed to the remote node, can be specified also: it's format is fixed too: each option should be separated by a ;, by default.

E.g.

4:./script4.sh:option1
5:./script5.sh:option1;option2;option3

Since the format of the tasks may be restrictive for some uses (due to the presence of the : character as separator, mostly), e.g. if one of the arguments to a program is a URL, a special directive is also available to change the separators for the fields and arguments.

A line consisting of !, followed by two characters, changes the behavior of the parser: the 2th and 3th characters will become the new field separator and arguments separator for the following lines.

The "switch separators" directive can be used any number of time.

The next is an example of a valid tasks definition file which switches between different separators:

# By default the separators are : and ;
1:test.sh:arg1;arg2;arg3
1:test.sh
1:test.sh:

# switch to @ as a separator for the fields
!@;
4@../test2@gsiftp://some.se.somewhere.it/mydir/input;gsiftp://some.se.somewhere.it/mydir/output

# change the arguments separator too: use a space from now on
!@
4@../test2@gsiftp://some.se.somewhere.it/mydir/input gsiftp://some.se.somewhere.it/mydir/output
5@./test.sh@argument1 argument2

# Now revert to the original behavior
!:;
2:test.sh:arg1;arg2;arg3

To use a task definition file, it should be passed to the -J option which will override the -F, -O and -N ones.

Requirements

The user will require a configured UI with a valid proxy certificate, as well as one or more scripts (though a binary can be used) that carry on all the steps necessary for the preparation of the environment, execution of the tasks and subsequent saving of the data on the grid (since, as said, no other data outside the user script, is sent to the WN).

Simple example of script usable with reserve_smp_nodes

Here is a very simple example about how to use a very bare bone script to perform some computation:

#!/bin/bash

# Load the input data. The Catalog may also be used
globus-url-copy  gsiftp://.../input.dat file:`pwd`/input.dat

# Get the executable, which will use threads (or fork)
globus-url-copy gsiftp://.../threaded_bin file:`pwd`/threaded_bin
chmod +x threaded_bin

# Run it, with the options provided by the user
# (i assume the -n options is the number of threads to spawn)
./threaded_bin -n ${RESERVE_SMP_NODES_CPU} $@

# Save the output back
globus-url-copy file:`pwd`/output.dat gsiftp://.../output.dat

The cleaver user may find in the previous script enough hints about how to convert other applications in a more efficient way (e.g. using compressed archives and executing more computations up the maximum queue length).

As can be plainly seen, deputing the script to perform all the required steps to run a simulation doesn't involve more lines than the ones needed to write a JDL.

Using reserve_smp_nodes with MPICH with the shmem device

Since MPI usage, among scientists, seems to be far more common than the use of plain threads, we provided also a couple of solutions that could be used to port applications, like Quantum Espresso or RegCM, which are already MPI enabled and which would greatly benefit from the very small shmem latency and large bandwidth (and cannot be used efficiently, if at all, with the normal p4 device).

The first, straightforward solution, is to compile and install the MPI package with shmem device enabled as experimental software on as many CE as possible.

We developed a script that takes care of this task (as well as of the installation of other softwares) and carried on it on a number of them, however this option lacks scalability, since not few CEs have setups which doesn't allow an easy or automatic installation of applications (and in some cases the administrator of the site should be contacted personally), however this is the most clean, standard and fast solution (from the point of view of the code execution).

Due to the aforementioned limitations, we also studied the feasibility of an MPICH package which could be re-located on different directories from the one it was compiled in, and we easily succeeded in creating one by simple substitution of some variables in the mpirun executable.

This package (which can be trivially created from scratches) is provided at the address:

http://www.ictp.it/~dimeo/relocatable_mpich_shmem.tar.gz

and contains an MPICH environment compiled with the shmem device
which is suitable to be executed on the EUIndia grid, as well as a
script called relocate.sh which adapt the package to the current
directory.

Here is a simple example about how to execute an MPI code with shared memory with the last approach (as it will appear evident, some steps overlap with the previous script):

#!/bin/bash

# Load the input data. The Catalog may be used
globus-url-copy  gsiftp://.../input.dat file:`pwd`/input.dat

# Get the executable (compiled against shmem!)
globus-url-copy  gsiftp://.../shmem_bin file:`pwd`/shmem_bin
chmod +x shmem_bin

# This is the new step (not needed if shmem mpi is installed as
# experimental software): get the relocatable mpi package and adapt it
# to the current directory
globus-url-copy                                 \
  gsiftp://.../relocatable_mpich_shmem.tar.gz   \
  file:`pwd`/relocatable_mpich_shmem.tar.gz
tar xvzf relocatable_mpich_shmem.tar.gz
./relocate.sh mpich_smp
source mpich.env.sh

# We are ready to run the code!
mpirun -np  ${RESERVE_SMP_NODES_CPU} ./shmem_bin

# Save the output
globus-url-copy file:`pwd`/output.dat gsiftp://.../output.dat

As can be plainly seen, once the relocatable package has been uploaded on a SE, only 4 extra commands are required in order to execute an MPI code with this approach.

A simple session (v1.2, interactive mode).

In order to show how easy it is to use reserve_smp_nodes, a session with the interactive interface is provided here, where we submit 10 jobs in order to get a 2 processors node on ictpgrid-ce-1, as an example:

$ ./reserve_smp_nodes  -i
Listening port (23000)? 23790
VO (euindia): [enter]
NS type: edg/glite/glite-wms (Default glite-wms)? [enter]
Checking the resources available...
Destination:
--------------------------------------------------------------
0) grid0.fe.infn.it:2119/jobmanager-lcgpbs-grid
1) grid012.ct.infn.it:2119/jobmanager-lcglsf-euindia
2) gridce.sns.it:2119/jobmanager-lcgpbs-grid
3) gridce2.pi.infn.it:2119/jobmanager-lcglsf-grid4
4) ictpgrid-ce-1.ictp.it:2119/jobmanager-pbs-euindia
5) prod-ce-01.pd.infn.it:2119/jobmanager-lcglsf-grid
6) serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor-euindia
7) vecce01.vecc.eu-india.res.in:2119/jobmanager-lcgpbs-euindia
8) t2-ce-02.lnl.infn.it:2119/jobmanager-lcglsf-euindia
9) gridba2.ba.infn.it:2119/jobmanager-lcgpbs-infinite
10) gridba2.ba.infn.it:2119/jobmanager-lcgpbs-long
11) gridba2.ba.infn.it:2119/jobmanager-lcgpbs-short
12) * Use the matchmaking
Select an option(12): 4
How many cpus do you want to reserve (1)? 2
How many jobs do you want to submit (1)? 10
How long should i try to reserve the cpus (300")? [enter]
Script to execute? test.sh
Arguments to pass to it ("")? option1 option2 option3
------------------------------------------
All jobs correctly submitted!
** New connection established from 140.105.46.200:38695
   + Hostname received: node037.beowulf.ictp.it
** New connection established from 140.105.46.200:38752
   + Hostname received: node038.beowulf.ictp.it
** New connection established from 140.105.46.200:38939
   + Hostname received: node039.beowulf.ictp.it
** New connection established from 140.105.46.200:38940
   + Hostname received: node039.beowulf.ictp.it
Script 'test.sh' sent.

At this point the program gives the prompt back and the user is freet to execute another task (or even to shut the computer down): the test.sh script has been executed and will run on it's own on the WN node039.beowulf.ictp.it.

« January 2021 »
Su Mo Tu We Th Fr Sa
12
3456789
10111213141516
17181920212223
24252627282930
31
 

Powered by Plone This site conforms to the following standards: