SciClone Local Manual

pbsmpichgm


Purpose

Run an MPICH-GM program under PBS.

Synopsis

pbsmpichgm [-hmnsv] [-c np] [-C load | -X load] [-D | -W dir] [-k secs] [-r poll | block | hybrid] program [args...]

Description

The PBS job scheduling system allocates resources for parallel programs, but does not provide the system-specific procedures for actually initiating and executing parallel programs on those resources. pbsmpichgm provides such an interface for executing MPICH-GM jobs under PBS, including the following services:
 

MPICH-GM programs will only run on nodes with Myrinet interfaces. pbsmpichgm checks for this, and will complain if it determines that any of the nodes assigned to the job lack Myrinet connectivity. You can assure that only Myrinet nodes are assigned to a job by specifying the correct node properties with PBS's qsub command.

For added flexibility, pbsmpichgm provides two different strategies for mapping processes onto processors; these are described in detail in the section on Process Mapping. Which strategy is best depends on the requirements of the application, the number and type of nodes requested for the job, and the number of processes which will be run on those nodes.

Arguments

-c np
Run np copies of program on the assigned nodes. If this option is not specified, one process is assigned to each PBS virtual processor. As currently configured, MPICH-GM allows a maximum of four processes per node.
 
-C load
Before starting the program, check the CPU utilization on each node assigned to the job, and report any which exceed load. load should be a decimal fraction in the range from 0.0 to 1.0. By default, no checking is done. -C and -X are mutually exclusive. A certain amount of system-related background activity is unavoidable, so the minimum useful value for load is probably in the 0.01-0.02 range.
 
-D
Use the directory which contains program as the working directory for MPICH-GM processes. By default, pbsmpichgm runs program in the directory from which it is invoked (i.e., the current working directory). The same directory pathname is used on all nodes. -D and -W are mutually exclusive.
 
-h
Print a help message, listing the available options with a brief description of each.
 
-k secs
MPICH-GM does not implement the MPI_Abort() function correctly, so only processes which call MPI_Abort() directly will actually exit, leaving other processes active. As a workaround, mpirun.ch_gm can be configured to automatically kill all remaining processes in a job at some specified time interval after the first process exits. pbsmpichgm uses this feature to ensure that failed jobs are aborted in a timely manner and that stray processes get cleaned up. By default, the delay is set to 15 seconds. The -k option can be used to make this delay longer or shorter, but values longer than 45 seconds may interfere with PBS's job termination procedure and are therefore not recommended. Note that the delay occurs even when all processes in a job exit normally and/or at the same time, so large values for secs will slow down job termination and tie up processors needlessly. If processes in an application are supposed to exit at different times, there is a chance that slower processes could be killed prematurely. To prevent this, insert an extra MPI_Barrier() before the program's exit() or STOP statement to ensure that all processes exit together.  
 
-m
Enable shared memory communication between processes on the same node. Without this option, local communication is routed through the Myrinet interface. Use of shared memory greatly reduces latency, but places higher demands on the memory subsystem and perturbs the contents of processor caches. Which approach is faster depends on the characteristics of both the application and the hardware on which it is running, so some experimentation is in order. -m implies -r poll. Use of -r block or -r hybrid overrides -m and disables shared memory communication.
 
-n
Use a node-order strategy (explained below) for mapping processes to processors.
 
-r poll | block | hybrid
Specifies the behavior of blocking MPI receive calls. -r poll causes processes to check continuously for completion of send and receive events. This is appropriate when every process has a dedicated CPU available to it. -r block causes receives to block in the kernel, effectively releasing the CPU for other tasks. This option may perform better for applications which create more than one process or thread per processor. -r hybrid is a combination of the other two strategies, polling for up to 1 millisecond and then blocking if the operation has not been satisfied. The default is -r poll.
 
-s
Close standard input. This may be useful for some applications which expect stdin to be connected to a tty device. The default is to leave stdin open.
 
-v
Verbose mode. Enable the verbose option on mpirun.ch_gm, and generate additional output about the progress of pbsmpichgm, as well as a listing of allocated nodes and the mapping of processes to processors and to Myrinet ports.
 
-W dir
Use dir as the working directory for MPICH-GM processes. -W and -D are mutually exclusive.
 
-X load
Before starting the program, check the CPU utilization on each node assigned to the job, and abort if any of them exceed load. load should be a decimal fraction in the range from 0.0 to 1.0. By default, no checking is done. -C and -X are mutually exclusive.
 
program
Name of the MPICH-GM program to be invoked via mpirun.ch_gm. If a full pathname is not given, the current search path ($PATH environment variable) is used to locate program.
 
args...
Arguments for program.

Process Mapping

When PBS is configured, each node in the system is assigned one or more virtual processors (or VP's, for short). On SciClone, the number of PBS virtual processors on each node is identical to the number of physical processors on that node (except for the front end, which only allows PBS to allocate one of its two processors). PBS then assigns virtual processors to jobs, based on the resource requirements specified by the qsub command. The hostnames of each virtual processor allocated to a job are available at runtime in a file specified by the PBS_NODEFILE environment variable. The order in which hosts are listed in this file correpsonds to the order in which they are requested by the "-l nodes=" option of qsub.

pbsmpichgm supports two different schemes for mapping MPI processes onto PBS virtual processors. We call one of these schemes "VP order", and the other "node order". By default, pbsmpichgm uses VP order; node order is invoked with the -n option. The contents of PBS_NODEFILE, as well as the mapping of processes onto nodes, is displayed on stdout when the -v option is specified.

VP Order: Processes are assigned one per VP in the order listed in PBS_NODEFILE. (Note that there may be more than one VP per node.) If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, then wrap around to the beginning of the VP list, assigning an additional process to each VP. This procedure repeats until all processes have been assigned.

Node Order: Processes are assigned one per node, wrapping around until all of the VP slots on all of the nodes are filled. If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, wrap around and assign an additional process to each node (rather than VP), repeating until all processes have been assigned.

Example 1: Our first example illustrates the difference in assignment strategies for a job which maps 12 processes onto 8 virtual processors which are spread across four nodes. Assume the following PBS job request:

qsub -l nodes=3:dual:myri:ppn=2+1:quad:myri:ppn=4
#!/usr/bin/csh
pbsmpichgm -c 12 -r hybrid myprog
^D

On SciClone, the resulting PBS_NODEFILE might look like:

tn01
tn01
tn02
tn02
tn03
tn03

hu01
hu01
hu01
hu01

With the default VP order, processes would be mapped as follows:

p0 -> tn01
p1 -> tn01
p2 -> tn02
p3 -> tn02
p4 -> tn03
p5 -> tn03
p6 -> hu01
p7 -> hu01
p8 -> hu01
p9 -> hu01
p10 -> tn01
p11 -> tn01

If the -n option had been used on the pbsmpichgm command, the mapping would instead be:

p0 -> tn01
p1 -> tn02
p2 -> tn03
p3 -> hu01
p4 -> tn01
p5 -> tn02
p6 -> tn03
p7 -> hu01
p8 -> hu01
p9 -> hu01
p10 -> tn01
p11 -> tn02

Example 2: The -n option is particularly useful if you want exclusive use of a multi-processor node, but only want to use a subset of the processors on each node. The following example allocates all 16 processors on 8 dual-cpu nodes (thereby ensuring that the job has exclusive use of the nodes), but only assigns one process to each node:

qsub -l nodes=8:c2:myri:ppn=2
#!/usr/local/bin/tcsh
pbsmpichgm -n -c 8 myprog
^D

Exit Status

If pbsmpichgm detects an error, it exits with a non-zero value; otherwise, it returns the exit status from the mpirun.ch_gm command.

Bugs and Limitations

SciClone incorporates two distinct Myrinet networks which do not communicate with each other. pbsmpichgm will abort the job if all of the requested nodes do not reside on the same Myrinet network. The choice of Myrinet network is specified by using either the "myri" or "myri2" node property in the "-l nodes=" option of the qsub command. See the SciClone User's Guide for more information.

mpirun.ch_gm currently does not allow different executables to be started on different nodes, and neither does pbsmpichgm.

Circumstances may arise in which pbsmpichgm (or any other PBS job, for that matter) might not be able to find and kill all of the processes belonging to it. If a pristine execution environment is essential, additional checks (beyond -C or -X) may be needed to ensure that no stray processes reside on a node.

MPI-Abort() does not operate correctly under MPICH-GM. See the discussion of the -k option for a workaround.

Related Topics