SciClone Local Manual

pbslam


Purpose

Run a LAM/MPI program under PBS.

Synopsis

exec pbslam [-fghntTvx] [-c np] [-C load | -X load] [-D | -W dir] [-N net] [-r rpi] [-s coll] program [args...]

Description

The PBS job scheduling system allocates resources for parallel programs, but does not provide the system-specific procedures for actually initiating and executing parallel programs on those resources. pbslam provides such an interface between PBS and the LAM/MPI runtime system, including the following services:
 

In order to properly intercept termination signals, pbslam must be exec'ed, replacing the shell which invokes it. pbslam checks for this, and will complain if it is not in the proper location within the PBS process hierarchy.

For added flexibility, pbslam provides two different strategies for mapping processes onto processors; these are described in detail in the section on Process Mapping. Which strategy is best depends on the requirements of the application, the number and type of nodes requested for the job, and the number of processes which will be run on those nodes.

Arguments

-c np
Run np copies of program on the assigned nodes. If this option is not specified, one process is assigned to each PBS virtual processor.
 
-C load
Before starting the program, check the CPU utilization on each node assigned to the job, and report any which exceed load. load should be a decimal fraction in the range from 0.0 to 1.0. By default, no checking is done. -C and -X are mutually exclusive. A certain amount of system-related background activity is unavoidable, so the minimum useful value for load is probably in the 0.01-0.02 range.
 
-D
Use the directory which contains program as the working directory for LAM processes. By default, pbslam runs program in the directory from which it is invoked (i.e., the current working directory). The same directory pathname is used on all nodes. -D and -W are mutually exclusive.
 
-f
Do not configure LAM's standard I/O descriptors. Output from remote processes is directed to /dev/null. By default, stdout and stderr from remote processes are routed back to stdout and stderr of LAM's mpirun command, which in turn is routed to stdout and stderr of pbslam.
 
-g
Enable LAM's Guaranteed Envelope Resources (GER) mode. GER is disabled by default.
 
-h
Print a help message, listing the available options with a brief description of each.
 
-n
Use a node-order strategy (explained below) for mapping processes to processors.
 
-N net
On SciClone, many nodes are connected to more than one network. By default, pbslam uses the best available network for communication between nodes. The -N option allows the user to specify an alternate network, where net is a valid hostname suffix (minus the leading "-") defined in the networks table in the SciClone User's Guide. All of the nodes assigned to the job must have interfaces on the requested network and the requested RPI must be supported on the requested network; if not, pbslam will abort the job.
 
-r rpi
Override the default choice of RPI. Allowable choices include tcp, usysv, sysv, gm, and lamd. If all of the nodes allocated to a job are situated on the same Myrinet (either Myrinet 1280 or Myrinet 2000), pbslam selects gm as the default RPI; otherwise usysv is the default. The gm RPI can be requested only on Myrinet-enabled nodes, and only when all of the nodes in the job are on the same Myrinet (equivalent to the default behavior). The tcp, usysv, sysv, and lamd RPI's can be used on any network, although performance may be suboptimal.
 
-s coll
Force LAM to use a particular module for collective operations. Available choices include lam_basic, smp, and shmem. lam_basic provides collective operations which are layered on top of the point-to-point communication operations, and can be used with any number and configuration of processors. smp is optimized for use with multiple nodes, each of which has multiple processes/processors assigned to it. shmem is optimized for use within a single shared-memory node and can only be used in that environment. By default, LAM picks an appropriate collective module at runtime; users should rarely need to override the default. Explicitly specifying smp enables associativity in MPI reduction operators, which may provide improved performance (and slightly different numerical results) for some applications.
 
-t
Enable LAM's trace generation capability, with tracing initially turned off. By default, tracing is disabled. Mutually exclusive with -T.
 
-T
Enable LAM's trace generation capability, with tracing initially turned on. Mutually exclusive with -t.
 
-v
Verbose mode. Enable verbose option on LAM commands, and generate additional output about the progress of pbslam, as well as a listing of allocated nodes and the mapping of processes to processors.
 
-W dir
Use dir as the working directory for LAM processes. -W and -D are mutually exclusive.
 
-x
Enable LAM's fault tolerant "heart beat" mode. By default, heart beat messages are disabled to achieve maximum performance.
 
-X load
Before starting the program, check the CPU utilization on each node assigned to the job, and abort if any of them exceed load. load should be a decimal fraction in the range from 0.0 to 1.0. By default, no checking is done. -C and -X are mutually exclusive.
 
program
Name of the LAM MPI program to be invoked via mpirun. If a full pathname is not given, the current search path ($PATH environment variable) is used to locate program.
 
args...
Arguments for program.

Process Mapping

When PBS is configured, each node in the system is assigned one or more virtual processors (or VP's, for short). On SciClone, the number of PBS virtual processors on each node is identical to the number of physical processors on that node (except for server nodes, which only allow PBS to use one processor). PBS then allocates virtual processors to jobs, based on the resource requirements specified by the qsub command. The hostnames of each virtual processor allocated to a job are available at runtime in a file specified by the PBS_NODEFILE environment variable. The order in which hosts are listed in this file correpsonds to the order in which they are requested by the "-l nodes=" option of qsub.

pbslam supports two different schemes for mapping LAM processes onto PBS virtual processors. We call one of these schemes "VP order", and the other "node order". By default, pbslam uses VP order; node order is invoked with the -n option. The contents of PBS_NODEFILE, as well as the mapping of processes onto nodes, is displayed on stdout when the -v option is specified.

VP Order: Processes are assigned one per VP in the order listed in PBS_NODEFILE. (Note that there may be more than one VP per node.) If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, then wrap around to the beginning of the VP list, assigning an additional process to each VP. This procedure repeats until all processes have been assigned.

Node Order: Processes are assigned one per node, wrapping around until all of the VP slots on all of the nodes are filled. If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, wrap around and assign an additional process to each node (rather than VP), repeating until all processes have been assigned.

Example 1: Our first example illustrates the difference in assignment strategies for a job which maps 12 processes onto 8 virtual processors which are spread across four nodes. Assume the following PBS job request:

qsub -l nodes=2:single+1:dual:ppn=2+1:quad:ppn=4
#!/usr/bin/csh
exec pbslam -c 12 myprog
^D

On SciClone, the resulting PBS_NODEFILE might look like:

wh01
wh02
tw01
tw01
hu01
hu01
hu01
hu01

With the default VP order, processes would be mapped as follows:

p0 -> wh01
p1 -> wh02
p2 -> tw01
p3 -> tw01
p4 -> hu01
p5 -> hu01
p6 -> hu01
p7 -> hu01
p8 -> wh01
p9 -> wh02
p10 -> tw01
p11 -> tw01

If the -n option had been used on the pbslam command, the mapping would instead be:

p0 -> wh01
p1 -> wh02
p2 -> tw01
p3 -> hu01
p4 -> tw01
p5 -> hu01
p6 -> hu01
p7 -> hu01
p8 -> wh01
p9 -> wh02
p10 -> tw01
p11 -> hu01

Example 2: The -n option is particularly useful if you want exclusive use of a multi-processor node, but only want to use a subset of the processors on each node. The following example allocates all 16 processors on 8 dual-cpu nodes (thereby ensuring that the job has exclusive use of the nodes), but only assigns one process to each node:

qsub -l nodes=8:compute:dual:ppn=2
#!/usr/local/bin/tcsh
exec pbslam -n -c 8 myprog
^D

Example 3: To force both local and remote communication to use TCP/IP and to route remote traffic via InfiniBand rather than Ethernet, use the -N and -r options in conjunction with the "ib4x" node property:

qsub -l nodes=16:dual:ib4x:ppn=2
#!/usr/local/bin/tcsh
exec pbslam -N 8 -r tcp myprog
^D

Exit Status

If the LAM job runs to completion, pbslam returns the exit status from the mpirun command. If pbslam terminates by catching a signal, it returns the signal number. If pbslam detects any other error condition, it returns a non-zero value.

Bugs and Limitations

pbslam does not provide access to all of the options and features supported in LAM. In particular, there is no way to directly specify different executables for different processes, although various workarounds can be imagined.

The performance of Solaris 9 TCP/IP over Myrinet is especially poor with certain message sizes. Use of the tcp, sysv, usysv, and lamd RPI's in conjunction with Myrinet is therefore not recommended at this time.

Circumstances may arise in which pbslam (or any other PBS job, for that matter) might not be able to find and kill all of the processes belonging to it. If a pristine execution environment is essential, additional checks (beyond -C or -X) may be needed to ensure that no stray processes reside on a node.

Related Topics