You are here: Home BEMuSE Porting the application.
Personal tools
Document Actions

Porting the application.

by Riccardo Di Meo last modified 2007-11-29 20:41

A description of the problems faced through the porting of the application, and the solutions found.

Application description

From the computational point of view, the application implementing the algorithm that we received from Alessandro Laio is loosely coupled, CPU intensive and not IO-intensive.

The Metadynamics algorithm has been implemented in the Gromacs molecular dynamics code (, adding the Bias-Exchange component into the source in order to adapt it to the GRID environment (using MPI).

With the aforementioned addition, every process running on the grid can be assimilated to a different agent/walker, moving along a molecular dynamics trajectory, and periodically exchanging information on the explored space with all the other walkers.

Every walker is assigned a different reaction coordinate to explore, with the help of a suitable bias. The interaction among all walkers results in the efficient exploration of a high-dimensional free energy surface, representing e.g. the folding process of a protein, in a simulation time of the order of tens on nanoseconds/process.

The patched binary behaves as the original Gromacs one, from which it inherits most of the code, plus it handles a new file ('COLVAR.0') which saves some status information about the walker (which will be, from now on, a process running in a BEM simulation associated to a specific reaction coordinate).

Here is a link to a guide (courtesy of Fabio Pietrucci) with instructions about how to write the META_INP.0 files needed in order to run a a simulation:

Bias Exchange original implementation

The application we received was already enabled to use 2 different methodologies in order to share the data among processors in a cluster setup: through the exchange of files on a shared directory between nodes, and trough MPI, both ways allowing the operation to be performed with periods in the order of a second.

However the first experimental evidences where appearing that suggested that lower exchanges periods wouldn't had negatively impacted the results, which was one of the many aspects we where intentioned to investigate with our porting (since as it will appear evident from our solutions, will be one of the key aspects in the success of our implementation).

Another aspect we hoped to shed light on, was which influence the etherogeneous hardware present on the grid would had on the performances of the algorithm, since until that point, every simulation made with it involved cluster like, homogeneous resources.

Resources availability and scheduling.

Since the original application was MPI-enabled, a very easy and fast porting would have implied running the code "as it is" on the grid, simply providing a JDL file to mimic the original behavior.

Though this approach may seem the best one at a first glance, it hides a number of subtle and limiting issues which would had strongly affected the efficiency of the resulting application.

All the Good reasons to rule MPI out.

The first issue that came to our mind, since it has been our strongest objection against the use of MPI on grid since when we started using it, is the big difference in the chances of scheduling a MPI job when compared with the chances of scheduling a collection of serial jobs of the same size.

This comes from two different sources: the limited number of MPI - enabled clusters on the grid, and the inherent difficulty of getting multiple slots in a queue system at once.

Another drawback, which affects all loosely coupled applications, is the inability to use CPUs from different Computing Elements (CE): in this way, choosing MPI means automatically to limit every run of the application to a single cluster, which is reasonable for programs which do intensive network IO, but would had been a strong drawback for our program.

MPI jobs are limited to a small subset
  of the grid resources.

Since there where good arguments suggesting that the BEM algorithm wouldn't had suffered from an asynchronous recruiting of the computational resources or from slow network connections and considered the great advantage that getting rid of MPI would have dealt to application in the long term, we decided to drop the MPI layer for the grid simulations and to start from scratches, using plain sockets to implement the infrastructure that would had carried the data for the exchanges.

This decision added also another interesting aspect for us to investigate: how would had the algorithm tolerated the dynamic, asynchronous scheduling and availability of the resources we had in mind for the porting, and which contribute mixing walkers which could have previously runned for different periods and at different speeds could give to the simulation.

MPI jobs are limited to a small subset
  of the grid resources.

Toward a new grid implementation

In order to gain the maximum benefits from the rewriting of the communication routines, we had to find a way to let Worker Nodes (WN) in different CEs exchange informations, which is not trivial since the connectivity on the clusters usually allows for incoming connections only.

However we already had to face the same issue in a previous project (for references, see the paper at this location) and due to the low demanding requirements of the application, we opted to use a server on a resolved host, whose task would had been to accept the connections from the WNs and to bridge the data between them.

This setup had the advantage of solving a number of other problems too, like finding a way to synchronize the exchanges between different walkers, providing a centralized clock to the simulation.

The next step was to study a way to integrate our code with the application in such a way that wouldn't had prevented the improvement of the source code as well as a potential upgrade to a new version of gromacs could allow for an upgrade to a new version without too much intervention.

Since we expected the logic and network code of our add-on to be quite large and we knew that all of it could have been inserted in a single routine (handling the exchange between the walkers), the first problem could have been solved, as a matter of principle, quite easily by creating a separate module containing all the routines.

We didn't selected this design though, since we decided to implement all the IO wrapping in python, in order to speed up the development, and therefore in the first version we created a python wrapping around the executable, communicating with the computational code with named pipes (handled in a separate module, with only a couple of calls inserted into the original source).

Though the choice of python proved to be a winning solution, this design demonstrated some flaws and we changed it to a less elegant (but more stable) one, where the exchanges where implemented and integrated into the simulation by literally terminating and restarting the application itself.

The application (mdrun) get executed inside a python script which, before every exchange cycle, stops it, exchanges the data, and then restart it: in this way the program remain oblivious and simply process the (possibly) new files on the local directory, in the illusion of being on a shared directory.

This design, for crude that it may seem, proved to be the best one in term of reliability and didn't affected the performances (as we thought when we tried it first), allowing us to meet all the following goals:

  • Keeping the original application's code almost completely separated from the grid details.
  • Integrating the network communication into the simulation.
  • Avoid unwanted interactions.
  • Simplifying the check-pointing.
  • Remaining compatible with future versions of the same algorithm
  • Hiding all the added complexity to the users of the application.
  • Allowing for an easy integration with any language, like python, without the additional burden of creating a C interface.

Since we are the first to acknowledge the crufty nature of such solution, other combination where also considered, none of them meeting all the aforementioned goals at once.

In order to run the application one the grid, we had also to choose between remaining compatible with the obsolete python version present by default on the WNs, or switching to a recent implementation, with the added complexity of having to install the software on every CE (or, as we did, finding a way to provide it with other means), due to the extreme inadequacy of the 2.2 version (which was present in the grid at the time) for networking, the latter option was preferred.

The current implementation.

The architecture is client-server, both implemented in python (version 2.5 required): the server running on a resolved host (very commonly on the User Interface or UI)employed to submit the jobs), the client wrapping the protein folding executable, and running in the grid, on the WNs.

The server receives the connections from the clients as soon as they reach the WNs and possibly rejects them (if the maximum number of connection is reached), handling at set time the Bias exchange with the available clients (if any), at which point it starts bridging the information between different couples of walkers (randomly selected), thus allowing for inter-CE communication (otherwise forbidden).

FIXME add a graph showing the basics of the connection sequence.

Detailed logs are provided by the server on the local filesystem, which can be inspected for debugging purposes, as well as statistics on the number of picoseconds performed by all the connected processes on the grid, updated at every exchange (provided by the clients, which mediate all the interactions with the simulation).

FIXME local output from the server.FIXME.

On the WN, a bootstrapping script (made in bash) is executed by the grid middleware as soon as a job becomes 'Running', providing a suitable version of python from a Storage Element (where it was previously saved, if not already installed as experiment software) and executing then the python client.

The client in turn, contacts the server and, after a basic authentication, receive an ID which will uniquely identify it in the simulation and that will be used to fetch the correct input files from the SE where they where previously saved by the user (when the simulation is started) or by another client which previously died (e.g. because it reached the maximum run length for the qeue) at a checkpoint.

After fetching both the input files and all the executable needed for the simulation (the already mentioned mdrun plus a couple of useful tools, provided by the gromacs package, to handle the data files), the client starts the simulation and contacts the server to acknowledge that it's ready, and wait for the next exchange cycle.

FIXME more graphs showing the basics of the connection sequence (client point of view, this time).

« November 2022 »
Su Mo Tu We Th Fr Sa

Powered by Plone This site conforms to the following standards: