You are here: Home Grid tools and utilities Reserve SMP nodes Automatic thread optimization + reserve_smp_nodes
Personal tools
Document Actions

Automatic thread optimization + reserve_smp_nodes

by Riccardo Di Meo last modified 2008-10-30 02:37

A simple proof of concept application is automatically optimized using the GotoBLAS library and submitted to the grid with the reserve_smp_nodes. 2 porting of the application are discussed (simple and real-time/interactive), performances are also briefly discussed.


This example consists in the execution of a simple set of matrix multiplications on the grid, using the Goto Blas library in order to automatically optimize the computation on smp processors and reserve SMP nodes to submit the job on a glite WMS-based grid.

The important concepts i'd like to clearly point out are:

  • using threads in scientific computing is often very easy
  • through the use of the Goto Blas library, very good performances and scalability can be obtained automatically
  • though the grid doesn't provide the facilities to to submit threaded code, using the reserve SMP nodes it is actually possible to execute such code easily and in a safe way.
  • the two submission scripts, the simple way in BASH and a real-time feedback using Python and XMLRPC show that porting an application in such a way that it benefits from reserve_smp_nodes can be performed with various investment of time and with different results, and that both the advanced and the average user can tackle the task.

The application

We developed a simple test program matmul_bench.f90: it's only action is to repeatedly perform matrix multiplications between different matrices of increasing size using the BLAS library and print, for each matrix size, the throughput (number of floating point operations per seconds) and the time required to perform the repetitions.

What's more important, is that, though the program has been decently written, no special optimization was implemented and no parallelization was included in the code at all!!

This latter point is important, since we are going to rely completely on the cleverness of the GotoBLAS library for both optimization (as it's natural) and parallelization: the Goto implementation of the BLAS library can, in fact, use multiple threads on a SMP and/or multi-core machine to speed up the it's own routines.

This is a somewhat strange approach, since programs are usually parallelized manually by the user (through the use of threads or the insertion of MPI calls) where with this library, code using the BLAS routines is automatically split between different processes which will run on different CPUs, if available (this usually cannot be done, due to performances and tuning issues, between processes running on different nodes).

To download the GotoBLAS library or know more about it, check this address

Using the reserve_smp_nodes, we will run this program (which is thread enabled and automagically optimized) to the grid and try to get an idea of the performance gain obtained by running the code on a single node with multiple processors: to do that, we will try to approaches, the first very simple, using a plain .sh file (much like the examples pointed in the reserve_smp_nodes page) and a second, written in python and based on a client-server architecture with interactivity and real time feedback.

Keep in mind, that this is, as far as i know, the only way in which multi thread jobs can be run on the grid.

Files required to execute the examples

Precompiled binaries, statically linked, required to run this example can be found in a bzip2 compressed package at the address:

This archive contains two binaries: matmul and matmul_short, both performing the same operations, though the first takes ~ 2 hours on a P4 2.4Ghz where the second only 5-10 minutes on the same processor.

The source of the aforementioned binaries can be found at the address:

though recompilation of the code should not be necessary, and the scripts used to run the code with the reserve_smp_nodes can be found here:

and here (for the advanced XMLRPC version of the scripts):

The reserve SMP nodes version which will be used for this example can be found here:

Simple straightforward version (

This very simple script can be used to execute matmul with reserve_smp_nodes and provides a bare bones example about porting applications for it.

The script performs the following operations:

  1. Gets the matmul binary from the grid (both binary and location are specified on the command line)
  2. Sets the permissions for the executable and the OMP_NUM_THREADS variable appropriately
  3. Run the matmul code and saves the standard output of matmul and other logging information to 2 local files
  4. Sends the files with the output and logs in the grid, marking it with an identifier passed as argument

The arguments that should be passed to it are, as specified in the comments inside the script:

  1. A name used to identify the output of the run once it will be saved in a SE
  2. a gsiftp location pointing to a directory for both read and write (without the gsiftp:// part), where the matmul binaries will be downloaded and where the script will dump the output files.
  3. The name of the matmul binary which will be used for the execution (in our case only matmul and matmul_short, though other programs could be in principle be used, as long as they are executed in similar ways)

Executing matmul with the script

Enter the reserve_smp_nodes-1.4 directory and download there the script.

Dump the matmul and matmul_short binaries to a SE (for such a simple test, the /tmp directory may do, don't use it for other purposes though!):

$ wget -c -t0
$ wget -c -t0
$ tar xvjf matmul_binaries.tar.bz2
$ globus-url-copy file:`pwd`/matmul_short gsi
$ globus-url-copy file:`pwd`/matmul gsi

The former steps will be required only once, and are performed in order to put all requirements in place.

At this point we are almost ready to run the reserve_smp_nodes: we just need a file with will specify the tasks to be submitted to the grid.

Since we would like to run a scalability test, we may like to use a task list like this one (which we will call tasks.txt):


which will give us a complete description about how our code scales (if we will be able to reserve the required CPUs).

Now we only need to find a cluster which we know has multiple CPUs on a single node (we will use in our example, which has 4 processors per node), start the reserve_smp_nodes and cross our fingers ;-) ):

$ ./reserve_smp_nodes -T 1000 -J tasks.txt -r -j 20

With the latter command we are submitting 20 jobs (though only 10 are required, in total, by our tasks, to account for the job loss associated with WNs partially owned by other users) to the specified queue and waiting for 1000 seconds at most for them to start running.

Here is the output from our first run:

$ ./reserve_smp_nodes -T 1000 -J tasks.txt -r -j 20
Checking port 23594...
Starting to receive...
All jobs correctly submitted!
** New connection established from
   + Hostname received: farm012
** New connection established from
   + Hostname received: farm037
** New connection established from
   + Hostname received: farm018
** New connection established from
   + Hostname received: farm044
Timeout hit: about to fit the tasks into the available resources
Out of the receiving cycle
Resources available:
Node 1/4 owned.
Node 3/4 owned.
Node 2/4 owned.
- sending a 3-task to a 3-processors node
Script './' sent.
Closing socket farm012 (
Closing socket farm012 (
Closing socket farm012 (
  task sent
- sending a 2-task to a 2-processors node
Script './' sent.
Closing socket farm015 (
Closing socket farm015 (
  task sent
- sending a 1-task to a 1-processors node
Script './' sent.
Closing socket farm020 (
  task sent
Fit of the remaining resources terminated
6 more cpu executing 3 more tasks
Some taks have not been assigned:
        * 4 CPUs:  - script: ./ args: 4proc matmul_short
Closing the remaining resources
3 tasks submitted for execution

As you can see, digging among the output (in this version some debugging messages have been left in the code, they are likely to disappear in the next versions), the reserve_smp_nodes was able to gather enough CPUs to fit 3 tasks, the ones of 1, 2 and 3 CPUs, where we weren't able go send the 4 CPU task (which may be due to a number of reasons: we didn't submitted enough jobs, the cluster had all nodes already occupied by at least one job or we didn't waited enough).

At this point, the directory at the location gsi should be periodically inspected for the files:


to appear. The files ending with _output.txt, containing only log info can be ignored, where the files ending like _matmul_short.txt contain the output of the command:

$ /usr/bin/time matmul_short

Here is a chunk of 2proc_matmul_short.txt:

  256  5766.238     0.291
  286  5692.760     0.411
  316  5738.047     0.550
  946  5832.605    14.515
  976  5860.311    15.865
 1006  5840.003    17.433
310.90user 0.33system 2:36.71elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+68911minor)pagefaults 0swaps

The output can be divided in 2 parts: the lines with the 3 column output are printed by matmul_short, each line is the profiling of a number of matrix multiplication (the first column is the size of the matrices involved, the second is the throughput, in mflops, and the third is the number of seconds required to perform the multiplications).

Confronting the output for different processors, should show that the scalability is very good (where the perfect situation is when, roughly doubling the number of CPU involved in the computation halves the time required to perform it).

[Graph of the performances of matmul_short for 1, 2 and 3 cpus]

The last 2 lines are instead the output of the /usr/bin/time command, which shows how much time was taken to execute the code, how much CPU time was consumed and the average percentage of the CPU involved (which can be far more than 100, if more than 1 CPU is involved, as in our example).

A quick and enlightening feedback about the performances of the GotoBLAS library comes also from the last 2 lines: the average usage of the CPU should be, for an N processors simulation, near N*100% and the wall time required to run the binary on it should be near T/N seconds, where T is the time required to run the same binary on a single processor of the same architecture.

==> 1proc_matmul_short.txt <==
210.55user 0.27system 3:30.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+68877minor)pagefaults 0swaps

==> 2proc_matmul_short.txt <==
310.90user 0.33system 2:36.71elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+68911minor)pagefaults 0swaps

==> 3proc_matmul_short.txt <==
311.61user 0.36system 1:45.30elapsed 296%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+68929minor)pagefaults 0swaps

Keep however in mind that the previous test is far too small and inaccurate to give a correct picture about the scalability of the code: more data should have been gathered and the matmul binary should have been used instead of it's short alternative.

A more complex example (with real-time feedback) using XMLRPC

Though the previous example is probably enough in most situations (and may be easily adapted to other codes), we will demonstrate here another way in which the porting of the matmul program has been done, which due to it's complexity will probably be more fit for the advanced users.

While for some tasks the previous solution may be the better one (since it allows for a kind of "fire and forget" submission model, very suited for many jobs), in other situations users would like to have a direct feedback about how a simulation is going, and even be able to interact with it in real time: in such scenarios, the former example is clearly too simple.

While the feedback problem could have been solved easily using the logging tool (also on this site), a two way interaction with the user is another story...

To give some clues about how such situations can be handled, in this example we use the SimpleXMLRPCServer module present in any standard python installation to set up a server (which can be run on any machine with inbound connectivity, though we will assume it will be running on the UI) which will receive the connections from another script (the client side), this time running on the WNs.

While, as in the previous exercise, the purpose of the script on the client is to prepare the environment for the matmul binary to start (keep in mind that the required cpus are already reserved as soon as the script starts, since reserve_smp_nodes handles this), it's very different the way in which such operation is performed, namely all the data is directly downloaded/upoaded from/to the UI, and the logging messages are printed directly on the user's screen.

One of the consequences of this setup, is that no SE is used at all: this is a mixed blessing, since while it is easier to create a simulation and the output is retrieved in real time, the time and bandwidth required to send the input data (in this case the executable only) to the WNs may make this workflow un-practical (though nothing forbids to mix this approach with the previous one).

Starting the interactive simulation: server side

Running the server without arguments returns us:

$ ./matmul_xmlrpc_server
Wrong number of arguments:
        ./matmul_xmlrpc_server <password> <listening port>

To start the simulation, first we need a suitable port for your server to listen on. To get it, the correct approach would be to use the command netstat, and pick a port among the available ones.

A more pragmatic approach to this problem, though, is to a) pick random number between 22000 and 25000 b) start the server and if it complains about the port being busy, going back to a)... let's say we will run the server on the port 22017.

To protect the server against requests from possibly malicious attackers or even against the risk of running more than one simulation on it at one time a password is also required: we will use foo, this time.

Though it's not specified, we also need the matmul binaries to be somewhere on the server filesystem: we will assume that they are in the same directory where we will start the server itself.

$ ./matmul_xmlrpc_server foo 22017

and we can, at this point, leave the server where it is: it will do nothing until an incoming connection will be received from one of our jobs: submitting them will be our next task.

All this has to be performed only once.

Starting the interactive simulation: client side

Switch to another terminal (leave the server pending); starting the client returns:

$ ./matmul_xmlrpc_client
Wrong number of arguments:
        ./matmul_xmlrpc_client <password> <host> <port>

The password and port will be the same used for the server, since they must match, the host is the fully resolved hostname of the machine where the server is running (the UI in our case).

Therefore, a task file for this job (task2.txt) could be like this:

2:./matmul_xmlrpc_client:foo 22017

Note that multiple lines in this file should point to different servers and employ different ports and passwords.

At this point we can launch, as we already mentioned in the previous example, the reserve_smp_nodes:

$ ./reserve_smp_nodes -T 1000 -J tasks2.txt -r -j 15

At this point we just have to wait that our reservation terminates successfully and then switch to the terminal with the server: if we did everything right, almost immediately the server will come ti life and ask us on behalf of the client the name of the binary to run (matmul or matmul_short):

./matmul_xmlrpc_server foo 22017
File to run execute?

which will be sent to the client and then executed while we will receive the results in real time:

$ ./matmul_xmlrpc_server foo 22017
File to run execute? matmul_short
Getting the executable...
The executable is 1155398 bytes long
Setting the permissions
Setting 'OMP_NUM_THREADS' to 2
About to run the code.
  256  6284.543     0.267
  286  6189.731     0.378
  976  6396.475    14.535
 1006  6368.522    15.987
285.10user 0.35system 2:23.79elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+68911minor)pagefaults 0swaps
Program terminated: shut down the server
or run another simulation

The submission can then be repeated (you don't need to start the server every time) in order to execute different computations or the server shut down and the session terminated.

Note that, with very little effort, the client could be modified in order to do as many computation as we want, thus relieving us from re-submitting the jobs and trying the reservation again: this is a winning strategy for short interactive jobs which take less than 24 hours to execute (like most interactive computations) since it allows us to use better the resources and rely less on the reservation strategy!!!

We invite the advanced users to inspect the scripts and modifying them for their needs: however they have been made in such a generic way that any program can be launched with them (even /usr/bin/date :-)), and that any GotoBLAS enabled one can benefit from the multi-threaded environment (as long as the output is written to the standard output).

Since the wrapper used doesn't affects the performances, we will not investigate the results: cfr. the comments at the end of the first example.

Related content
« December 2022 »
Su Mo Tu We Th Fr Sa

Powered by Plone This site conforms to the following standards: