You are here: Home root based application tapping from ATLAS data Client - Server approach
Personal tools
Document Actions

Client - Server approach

by Riccardo Di Meo last modified 2008-12-06 18:42

We present here a port of the application with the client/server paradigm: a server runs on a resolved host and schedules the tasks, while more clients receive and process them on the grid. Warning: the software pointed by this document is to be considered in a BETA stage of development!!

Client - Server port: foreword

Athough this port has been tested and can be cosidered functional, due to time constraints it has been completed in roughly 2 days and it still lacks some refirements, features and careful testing, therefore it should be considered "beta" software.

Here are some notable rough edges:

  • the structure of the code is not very rational (in particular, the source should be splitted among different files and some methods should be re-organized/re-thinked)
  • some values that should be passed through parameters are hard coded
  • requires attention from user, in particular in the (all but) unlikely case of a grid failure
  • messages on the standard output of the server are somewhat misleading for user, and too much debug logging is still present
  • the code has not been thoroughly tested and some minor bugs could be present.
  • the password for the authentication travels through the network in clear text...
  • checks for missing/died clients are very sloppy: the server is single threaded, therefore no check is performed until something happens on it (a way to work around this, is required, is to telnet on the server and write something).
  • (as a consequence of the last issue) some error conditions are probably not checked properly: the server may keep running after all the datasets have been processed, for example.

Good luck!

The package

CSCDriver_on_grid is really a system of 2 programs: one running on the User Interface (the server) and one running in multiple copies on the grid (the client).

The 2 main advantages of this approach are:

  1. Easy job handling: the server assigns the tasks to the clients and reschedules them if the clients die unexpectedly
  2. Real time feedback: the user doesn't need to concern himself/herself with the retrieval of the output for the job, since he/she receives it directly on his/her computer. Moreover, depending on the implementation, the user has also a real time picture of the situation of the grid, without the grid delays

To install the CSCDriver_on_grid package, just untar it in a directory of your choice and create a link to CSCDriver_server.py in your $HOME/bin. You can now create a new clean directory anywhere in your computer and work with the package in there.

The client instead will run on the grid: it is provided with a jdl to submit it, that should be modified at least once: before the execution of a simulation (it contains the coordinates of the server and a simple password to provide a very basic authentication mechanism).

The simulations workflow goes like this: start the server (on the UI usually, but any host with inbound connectivity will do since the server doesn't use the middleware at all), modify the jdl to point the client to the right host and port, with the right password, and then blindly submit a large number of jobs.

Each running job will become a client, querying the server for a surl to process (a single client could fufill more than a single task, until the max running time limit is hit) and send the results to the server.

The user then read the logs/watches the files on the server's directory (a new folder is created, with the name of the .root file involved, that contains the output of CSCDriver and, at the end of the job, the content of "results.txt" and the "summary.root" file.

The package with the client and server can be found at this link.

Running the simulation

The server program uses getopt to parse the input arguments (short options only): use the -h to get a handy help message:

$ CSCDriver_server.py -h
Usage: CSCDriver_server.py [options]

 The options are:

 -p        (p)ort to open. Use a port in the range [22000:25000]
           for grid execution

 -s        (s)URL list: a file in the local computer with the
           links to the datasets

 -c        (c)ode URL: a lfn: link to the tar.bz2 package containing
           the CSCDriver program and libraries (as specified in the
           euindia.ictp.it page)

 -P        (P)roxy required to retrieve the data: a file in the local
           file system created with the voms-proxy-init command that
           provide the right credentials to access the data set.

           E.g. an ATLAS certificate created with the commands:

           $ voms-proxy-init --voms  --valid 300:0 --out 
                                --i need>

 -h   This help message

  A password should also be provided through the environment variable
  CSCDRIVER_PASSWORD.

  Password and port should match the ones in the client.

  All options, with the exception of the -h one, are all but optional  :-)

Run the server, change the Arguments line in the jdl and submit some jobs.

If you did everything right and there's no firewall on the way, as soon as a client start running on the grid you should see some messages, like:

$ ./CSCDriver_server.py  -P x509up_u561_atlas -c lfn:/grid/euindia/someuser/CSCDriver_code.tar.bz2 -s surls.txt -p 24777  -e 20000
File surl_done.txt not read. Starting from scratch...
20 SURLs to process
++ Client 1 identified itself
**** Sat Dec  6 16:58:58 2008
**** 1 clients identified
**** 20 SURLs not yet assigned
**** 0 SURLs being processed
++ Client 2 identified itself
**** Sat Dec  6 16:59:03 2008
**** 1 clients identified
**** 20 SURLs not yet assigned
**** 0 SURLs being processed
++ Client 3 identified itself
**** Sat Dec  6 16:59:33 2008
**** 1 clients identified
**** 20 SURLs not yet assigned
**** 0 SURLs being processed
Client 1 had no tasks
Processing srm://t2-dpm-01.na.infn.it/dpm/na.infn.it/home/atlas/atlasuserdisk/(...)
File 'user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101345476545241._00017.root/output.txt' in 'user69(...)AnalysisSkeleton.aan.root' opened
**** Sat Dec  6 16:59:40 2008
**** 3 clients identified
**** 19 SURLs not yet assigned
**** 1 SURLs being processed
++ Client 4 identified itself
(...)

on the server terminal. At some point, if enough resources are trown into the simulation and no problem arise, the server should exit on it's own.

Also the server creates a directory for each running client, where it puts the standard output of the CSCDriver program (100 lines at a time) in a file called "output.txt": it is safe to inspect the content of the file, as long as it's not modified, while the server is running.

Don't delete the directories while the server is in execution, or it may crash!

Here is an example of the content of the server directory after the correct execution of the code:

$ ls
surl_list.txt
surl_done.txt
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546009958._00008.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546091385._00005.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546119545._00007.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546330276._00018.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546380112._00019.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546471704._00009.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546509511._00015.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546520396._00010.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101546527452._00004.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101547017181._00011.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101547045241._00013.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101911086384._00006.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101911576716._00002.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101912078520._00020.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101912162195._00016.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101913168203._00003.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101913289381._00014.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101914488348._00001.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101916548069._00012.AnalysisSkeleton.aan.root
user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101917229581._00017.AnalysisSkeleton.aan.root
x509up_u561_atlas

where the content of one of the directories is:

$ ls -l user69.JRandomHacker.ganga.5208_napoli_etmiss.200810101547017181._00011.AnalysisSkeleton.aan.root
total 84
-rw-------  1 dimeo dimeo 37846 Dec  6 17:00 output.txt
-rw-------  1 dimeo dimeo    35 Dec  6 17:00 results.txt
-rw-------  1 dimeo dimeo 40062 Dec  6 17:00 summary.root

If a client dies on the grid, the server will not be immediately aware of that: due to the short developing time, the server checks for such events only after each xmlrpc call (when it prints the statistics about the clients and tasks available).

A client is declared missing in action by the server if it fails to contact the server whitin SERVER_TIMEOUT seconds (SERVER_TIMEOUT is a constant defined in CSCDriver_server.py, line 28, and is set to 1 hour): this value can be decreased if the download of the dataset is very likely to take a lot time less and increased if required (e.g. because some clients get killed while they are simply trying to get the dataset).

Tips'n Tricks

Here are listed some hints to work with this application:

  • Since the submission is handled by the user it is his/her responsibility to make sure that enough jobs are in a running state to make the execution of the simulation as fast as possible: this is especially important in case of failures of some clients on the grid.

  • Don't use rigorous math to submit the jobs: a 2x factor in the number of jobs submitted for each task to process is fine, since the server will get rid of any extra client it doesn't need anyway.

  • Since the application is bandwidth intensive, try to submit the application to different CEs at once, instead of running too many on the same: in this way the inbound bandwidth of the CE will not be a bottleneck.

  • If your are using the same CE for the execution of the job, it might help to submit the jobs with some delay between them: in this way, while a WN is downloading the data, other will be busy computing (this however depends a lot from the execution time).

  • If the ATLAS proxy expires, the server will refuse to feed new SURLs to the clients (and will just kill them as they come instead) but will wait for the clients running to return their output before terminating, so it is important to create a proxy which is long enough for it to be feed to the server, for the simulation to run as long as possible

  • If the voms certificate passed to the server is about to expire, the simulation life can be prolonged by simply creating a new one and overwriting the old. Use the mv command to do this, not cp (since it's an atomic operation within the OS)!

  • At the end of the simulation, the server should (hopefully) produce a list of SURLs that has not been processed properly (either failed for some reason or simply not processed since the proxy expired. You can rename and use either one of such lists as the input for another simulation.

« October 2017 »
Su Mo Tu We Th Fr Sa
1234567
891011121314
15161718192021
22232425262728
293031
 

Powered by Plone This site conforms to the following standards: