SciClone Local Manual

pbsmpich


Purpose

Run an MPICH 1.2.5 program under PBS.

Synopsis

pbsmpich [-hMmnv] [-c np] [-C load | -X load] [-D | -W dir] [-N net] program [args...]

Description

The PBS job scheduling system allocates resources for parallel programs, but does not provide the system-specific procedures for actually initiating and executing parallel programs on those resources. pbsmpich provides such an interface between PBS and the MPICH runtime system, including the following services:
 

For added flexibility, pbsmpich provides two different strategies for mapping processes onto processors; these are described in detail in the section on Process Mapping. Which strategy is best depends on the requirements of the application, the number and type of nodes requested for the job, and the number of processes which will be run on those nodes.

Arguments

-c np
Run np copies of program on the assigned nodes. If this option is not specified, one process is assigned to each PBS virtual processor.
 
-C load
Before starting the program, check the CPU utilization on each node assigned to the job, and report any which exceed load. load should be a decimal fraction in the range from 0.0 to 1.0. By default, no checking is done. -C and -X are mutually exclusive. A certain amount of system-related background activity is unavoidable, so the minimum useful value for load is probably in the 0.01-0.02 range.
 
-D
Use the directory which contains program as the working directory for MPICH processes. By default, pbsmpich runs program in the directory from which it is invoked (i.e., the current working directory). The same directory pathname is used on all nodes. -D and -W are mutually exclusive.
 
-h
Print a help message, listing the available options with a brief description of each.
 
-M
Use IP over Myrinet for communication between nodes. Same as "-N m2".
 
-m
Use shared memory for communication among processes residing on the same node. This option implies VP-order process mapping (see below) and overrides the -n option, if present. Without this option, communication always uses TCP/IP via sockets. Use of shared memory reduces software overheads, but places higher demands on the memory subsystem and may perturb the contents of processor caches. Which approach is faster depends on the characteristics of both the application and the hardware on which it is running, so some experimentation is recommended.
 
-n
Use a node-order strategy (explained below) for mapping processes to processors. Not available when shared memory communication is enabled via the -m option.
 
-N net
On SciClone, many nodes are connected to more than one network. By default, pbsmpich uses the system-wide internal Ethernet network (a.k.a. "jetstream") for communication between nodes. The -N option allows the user to specify an alternate network, where net is a valid hostname suffix (minus the leading "-") defined in the networks table in the SciClone User's Guide. All of the nodes assigned to the job must have interfaces on the requested network; if not, pbsmpich will abort the job.
 
-v
Verbose mode. Enable verbose option on mpirun, and generate additional output about the progress of pbsmpich, as well as a listing of allocated nodes and the mapping of processes to processors.
 
-W dir
Use dir as the working directory for MPICH processes. -W and -D are mutually exclusive.
 
-X load
Before starting the program, check the CPU utilization on each node assigned to the job, and abort if any of them exceed load. load should be a decimal fraction in the range from 0.0 to 1.0. By default, no checking is done. -C and -X are mutually exclusive.
 
program
Name of the MPICH program to be invoked via mpirun. If a full pathname is not given, the current search path ($PATH environment variable) is used to locate program.
 
args...
Arguments for program.

Process Mapping

When PBS is configured, each node in the system is assigned one or more virtual processors (or VP's, for short). On SciClone, the number of PBS virtual processors available on each node is identical to the number of physical processors on that node (except for the front end, which only allows PBS to use one of its two processors). PBS then allocates virtual processors to jobs, based on the resource requirements specified by the qsub command. The hostnames of each virtual processor allocated to a job are made available at runtime in a file specified by the PBS_NODEFILE environment variable. The order in which hosts are listed in this file correpsonds to the order in which they are requested by the "-l nodes=" option of qsub.

pbsmpich supports two different schemes for mapping MPICH processes onto PBS virtual processors. We call one of these schemes the "VP order", and the other "node order". By default, pbsmpich uses VP order; node order is invoked with the -n option. The contents of PBS_NODEFILE, as well as the mapping of processes onto nodes, is displayed on stdout when the -v option is specified.

VP Order: Processes are assigned one per VP in the order listed in PBS_NODEFILE. (Note that there may be more than one VP per node.) If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, then wrap around to the beginning of the VP list, assigning an additional process to each VP. This procedure repeats until all processes have been assigned.

Node Order: Processes are assigned one per node, wrapping around until all of the VP slots on all of the nodes are filled. If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, wrap around and assign an additional process to each node (rather than VP), repeating until all processes have been assigned.

Example 1: Our first example illustrates the difference in assignment strategies for a job which maps 12 processes onto 8 virtual processors which are spread across four nodes. Assume the following PBS job request:

qsub -l nodes=2:single+1:dual:ppn=2+1:quad:ppn=4
#!/usr/bin/csh
pbsmpich -c 12 myprog
^D

On SciClone, the resulting PBS_NODEFILE might look like:

ty01
ty02
tn01
tn01
hu01
hu01
hu01
hu01

With the default VP order, processes would be mapped as follows:

p0 -> ty01
p1 -> ty02
p2 -> tn01
p3 -> tn01
p4 -> hu01
p5 -> hu01
p6 -> hu01
p7 -> hu01
p8 -> ty01
p9 -> ty02
p10 -> tn01
p11 -> tn01

If the -n option had been used on the pbsmpich command, the mapping would instead be:

p0 -> ty01
p1 -> ty02
p2 -> tn01
p3 -> hu01
p4 -> tn01
p5 -> hu01
p6 -> hu01
p7 -> hu01
p8 -> ty01
p9 -> ty02
p10 -> tn01
p11 -> hu01

Example 2: The -n option is particularly useful if you want exclusive use of a multi-processor node, but only want to use a subset of the processors on each node. The following example allocates all 16 processors on 8 dual-cpu nodes (thereby ensuring that the job has exclusive use of the nodes), but only assigns one process to each node:

qsub -l nodes=8:compute:dual:ppn=2
#!/usr/local/bin/tcsh
pbsmpich -n -c 8 myprog
^D

Example 3: To route communication over Myrinet instead of Ethernet, use the -M option in conjunction with the "myri" node property:

qsub -l nodes=16:dual:myri:ppn=2
#!/usr/local/bin/tcsh
pbsmpich -M myprog
^D

Example 4: In contrast with pbslam, pbsmpich doesn't have to be exec'ed, which means that a series of programs can be run from a single job:

qsub -l nodes=8:typhoon
#!/usr/local/bin/tcsh
cd ~/results
pbsmpich ~/bin/preprocess args... ‹raw_data ›data
pbsmpich ~/bin/compute args... ‹data ›output
pbsmpich ~/bin/analyze args... ‹output ›final
^D

Exit Status

If pbsmpich detects an error condition, it returns a non-zero value. Otherwise, it returns the exit status from the mpirun command.

Bugs and Limitations

pbsmpich does not provide access to all of the options and features supported in MPICH. In particular, there is no way to directly specify different executables for different processes, although various workarounds can be imagined.

Circumstances may arise in which pbsmpich (or any other PBS job, for that matter) might not be able to find and kill all of the processes belonging to it. If a pristine execution environment is essential, additional checks (beyond -C or -X) may be needed to ensure that no stray processes reside on a node.

Related Topics