Personal tools
You are here: Home Documentation OSG Grid Applications
Document Actions

OSG Grid Applications

by admin last modified 2006-07-10 09:35

A guide to making grid-enabled applications with the OSG.

There are three steps to successfully running a grid job on a grid-enabled application.  The following directions are specific to the OSG 0.4.1, but the ideas can in general be applied to many grids.  This process is divided into four parts:

  • Installing the application on the remote computer.
  • Transferring data.
  • Executing the remote application.
  • Retrieving results.
This tutorial will be illustrated with an application which performs an Ising model calculation.  The tarball is here.  Simply download it to your home directory, and untar it:
wget http://t2.unl.edu/Members/brian/gpn-demo.tgz
tar zxf gpn-demo.tgz

Installing the Application

One important feature of grids that catches many first-time application packagers is the fact that software installation at remote sites is unpredictable.  So, any software dependencies your job may have must be brought in by you.  Also, grid-sites may vary in their architecture, so the software must be compiled for a 32-bit architecture.

OSG software must be installed to $OSG_APP.  Make sure to give your application directory a unique name, such as $OSG_APP/<VO NAME>/<APP NAME>, to avoid namespace conflicts.

In our example, the Ising software requires python and several obscure python modules.  While a working python may be a safe assumption for the remote site, having obscure modules is not.  So, we have packaged our own python; in the gpn-demo files, it is "python-grid.tgz".  It contains python, and a file called "setup-grid.sh" which correctly configures the path when sourced.

The stage-in-data.sh script copies the python application to the remote compute element and unpacks it.  Here is the contents:
#!/bin/sh

MYHOST=gpn-husker.unl.edu
MYJM=/jobmanager-fork

MYPWD=$PWD
MYHOME=$HOME

export OSG_APP=`globus-job-run $MYHOST /bin/sh -c 'echo $OSG_APP'`

MY_DIRECTORY=$OSG_APP/gpn/brian

# Make sure the directory exists:
echo globus-job-run $MYHOST:$MYJM /bin/sh -c "mkdir -p $MY_DIRECTORY"
globus-job-run $MYHOST:$MYJM /bin/sh -c "mkdir -p $MY_DIRECTORY"

# Stage my software
echo globus-url-copy file:///$MYPWD/python-grid.tgz gsiftp://$MYHOST/$MY_DIRECTORY/python-grid.tgz
globus-url-copy file:///$MYPWD/python-grid.tgz gsiftp://$MYHOST/$MY_DIRECTORY/python-grid.tgz

# Unpack my software
echo globus-job-run  $MYHOST:$MYJM /bin/sh -c "cd $MY_DIRECTORY && tar zxf python-grid.tgz"
globus-job-run $MYHOST:$MYJM /bin/sh -c "cd $MY_DIRECTORY && tar zxf python-grid.tgz"
The two used commands here are "globus-job-run" and "globus-url-copy".  The first runs a command on the remote host, and the second copies a file.  Before each command is run, I echo them to standard out, in order to know how much progress is being made.  Note how I use this command:
globus-job-run $MYHOST /bin/sh -c 'echo $OSG_APP'
to detect the location of the $OSG_APP directory on the remote host.  Any of the standard OSG directories can be detected in this way.

Any large files should be staged in before hand - having Condor copy over very large files is not reliable.

Transferring over Data

Any large data files should be transferred using globus-url-copy.  If you need to copy over datasets (anything over 10 GB), the files should be staged with SRM to a storage element beforehand.  VOs that have the largest dataset needs (multi-terrabyte) should have a dedicated transfer application.

Our application does not need to transfer data; rather it generates initial conditions on the remote server.  This is done with a condor job.  Here is the submit file that we use:
executable = ising-generate-wrapper.sh
output = ising.A.out
error  = ising.A.err
log    = ising.log
universe = globus
GlobusScheduler = gpn-husker.unl.edu:/jobmanager
notification = NEVER
transfer_input_files = ch10/ising-generate.py
WhenToTransferOutput = ON_EXIT_OR_EVICT
queue

The "ising-generate-wrapper.sh" script sets up the environment to use the application as installed above, and the "ising-generate.py" script transferred actually generates the data.  The contents of ising-generate-wrapper.sh are:

#!/bin/sh
mkdir -p $OSG_DATA/gpn/brian
cp ising-generate.py $OSG_DATA/gpn/brian
cd $OSG_DATA/gpn/brian
source $OSG_APP/gpn/brian/python-grid/setup-grid.sh
python ising-generate.py

Note how it copies the script to ising-generate.py.  Normally, data should be generated in $OSG_WN_TMP, then transferred back to $OSG_DATA.  However, in this case, the data file is extremely small, and does not make a difference.

Executing the Remote Application

Finally, we are ready to execute the remote application.  This is done on the condor jobmanager, so it is done on a worker node, not the headnode:

executable = ising-grid-wrapper.sh
output = ising.B.out
error  = ising.B.err
log    = ising.log
universe = globus
GlobusScheduler = gpn-husker.unl.edu:/jobmanager-condor
transfer_input_files = ch10/ising-grid.py
WhenToTransferOutput = ON_EXIT_OR_EVICT
notification = NEVER
queue
Just like before, ising-grid-wrapper.sh sets up the environment, while ising-grid.py is the script which drives the application.  The contents are similar to ising-generate-wrapper.sh.

Retrieving Data

Most grid applications have non trivial results which must be retrieved separately.  This script is very similar to the stage-in-data.sh script:
#!/bin/sh
MYHOST=gpn-husker.unl.edu
MYJM=/jobmanager-fork

MYPWD=$PWD
MYHOME=$HOME

echo globus-job-run $MYHOST:$MYJM /bin/sh -c "export"
export OSG_DATA=`globus-job-run $MYHOST /bin/sh -c 'echo $OSG_DATA'`

echo $OSG_DATA

MY_DIRECTORY=$OSG_DATA/gpn/brian

# Stage out my data
globus-url-copy gsiftp://$MYHOST/$MY_DIRECTORY/matrix-out.txt file:///$MYPWD/matrix-out.txt

# Delete my data
globus-job-run $MYHOST:$MYJM /bin/sh -c "cd $MY_DIRECTORY && rm matrix-out.txt"

This will bring the remote data to the current directory, and then delete it off the remote computer.

Putting it all together with DAGMAN

The entire chain for our grid job goes:
stage in -> generate job -> process job -> stage out
This whole process can be automated with a condor DAGMAN script.  The homepage for DAGMAN is here: http://www.cs.wisc.edu/condor/dagman/.  A sample dag script is found in the gpn-demo:

# Filename: ising.dag
#
Job A Generate.condor
Job B Process.condor
Script PRE A stage-in-data.sh
Script POST B stage-out-data.sh
PARENT A CHILD B

Then, the entire process could be started with the following command:

condor_submit_dag ising.dag

Walk away, and wait for your results to come back in a couple of minutes.


Conclusions

What we have accomplished in this walkthrough is the execution of a single, simplistic grid job for one user, to a specific compute element.  The most elaborate grid applications submit thousands of complex jobs for multiple users across the entire grid, and include resubmission, failover, and monitoring capabilities.

So, there is a lot of work to be done from this point in our application.  The application will suffice for a single-job-at-a-time approach, but would not work well for numerous jobs.  Normally, one would have yet another wrapper script on the local computer that would generate the above job description and scripts in unique directories.  This way, each job's output and input data is separated.  Further, we have not addressed any reliability concerns - what do we do when a job fails?  Obviously, if we submit 1000 jobs, we do not want to check each by hand.  External utilities such as BOSS help in job monitoring and control. 

Finally, we have written scripts for each piece of a rather generic process.  There are efforts, such as Pegasus from the iVDGL, which automate the script generation by specifying the whole grid job in a single XML-based config file.


Powered by Plone, the Open Source Content Management System