SciClone Local Manual

pbsmvp2


Purpose

Run an MVAPICH2 1.2 program under TORQUE or PBS.

Synopsis

pbsmvp2 [-hv] [-c np] [-C load | -X load] [-D | -W dir] [-m vp | cyclic | blocked] [-N eth | ib] [-e VAR=value] ... program [args...]

Description

The TORQUE job scheduling system (a derivative of OpenPBS) allocates resources for parallel programs, but does not provide the system-specific procedures for actually initiating and executing parallel programs on those resources. pbsmvp2 provides such an interface between TORQUE and the MVAPICH2 runtime system, including the following services:
 

For added flexibility, pbsmvp2 provides three different strategies for mapping processes onto processors; these are described in detail in the section on Process Mapping. Which strategy is best depends on the requirements of the application, the number and type of nodes requested for the job, and the number of processes which will be run on those nodes.

Arguments

-c np
Run np copies of program on the assigned nodes. If this option is not specified, one process is assigned to each TORQUE virtual processor.
 
-C load
Before starting the program, check the CPU utilization on each node assigned to the job, and report any which exceed load. load should be a decimal fraction in the range from 0.0 to 1.0. By default, no checking is done. -C and -X are mutually exclusive. A certain amount of system-related background activity is unavoidable, so the minimum useful value for load is probably in the 0.01-0.02 range.
 
-D
Use the directory which contains program as the working directory for MVAPICH2 processes. By default, pbsmvp2 runs program in the directory from which it is invoked (i.e., the current working directory). The same directory pathname is used on all nodes. -D and -W are mutually exclusive.
 
-e VAR=value
Set the environment variable VAR to value and propagate it to each process. To define multiple environment variables, -e may be specified multiple times. White space is not allowed in value (this is a limitation of mpirun_rsh).
 
-h
Print a help message, listing the available options with a brief description of each.
 
-m vp | cyclic | blocked
Select the desired process mapping scheme (see below). vp assigns processes to nodes in the order specified by the job scheduler in the PBS_NODEFILE. cyclic distributes processes among nodes in round-robin fashion. blocked partitions processes into groups of approximately equal size and maps these groups onto consecutive nodes. The default mapping is vp.
 
-N eth | ib
SciClone's Opteron-based Linux nodes support two different communication networks, Gigabit Ethernet (eth) and InfiniBand (ib). By default, pbsmvp2 uses the best-available network which spans the set of nodes assigned to the job (generally InfiniBand). The -N option allows the user to specify an alternate network. All of the nodes assigned to the job must have interfaces on the requested network; if not, pbsmvp2 will abort the job. Furthermore, the version of MVAPICH2 used to compile and run the program must support the requested network. mvapich2-*-ib will only run on the InfiniBand network. mvapich2-*-tcp can use either eth or ib, although in the latter case, performance is likely to suffer relative to the native InfiniBand version.
 
-v
Verbose mode. Generate additional output about the progress of pbsmvp2, as well as a listing of allocated nodes and the mapping of processes to processors.
 
-W dir
Use dir as the working directory for MVAPICH2 processes. -W and -D are mutually exclusive.
 
-X load
Before starting the program, check the CPU utilization on each node assigned to the job, and abort if any of them exceed load. load should be a decimal fraction in the range from 0.0 to 1.0. By default, no checking is done. -C and -X are mutually exclusive.
 
program
Name of the MVAPICH2 program to be invoked via mpirun_rsh. If a full pathname is not given, the current search path ($PATH environment variable) is used to locate program.
 
args...
Arguments for program.

Process Mapping

When TORQUE is configured, each node in the system is assigned one or more virtual processors (or VP's, for short). On SciClone, the number of TORQUE virtual processors available on each node is identical to the number of physical processors on that node (except for server nodes, which only allow TORQUE to use a single processor). On SciClone, we treat each processor core in a multicore system as an individually allocatable processor.

When a job is scheduled to run, TORQUE allocates a set of virtual processors based on the resource requirements specified by the qsub command. The hostnames of each virtual processor assigned to the job are made available at runtime in a file specified by the PBS_NODEFILE environment variable. The order in which hosts are listed in this file correpsonds to the order in which they are requested by the "-l nodes=" option of qsub.

pbsmvp2 supports three different schemes for mapping MPI processes onto virtual processors. We designate these schemes as vp, cyclic, or blocked. Each of these is described in detail in the following paragraphs. By default, pbsmvp2 uses the vp mapping. To obtain a listing showing which nodes have been allocated to a job and how processes have been mapped onto those nodes, use the -v option.

vp Mapping: Processes are assigned one per VP in the order listed in PBS_NODEFILE. (Note that there may be multiple VP's per node.) If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, then wrap around to the beginning of the VP list, assigning an additional process to each VP. This procedure repeats until all processes have been assigned. If the number of requested processes is less than the number of VPs, some processors (and potentially entire nodes) will be left idle.

cyclic Mapping: Processes are assigned one per node, wrapping around until all of the VP slots on all of the nodes are filled. If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, wrap around and assign an additional process to each node (rather than VP), repeating until all processes have been assigned. If the number of requested processes is less than the number of VPs, some processors (and potentially entire nodes) will be left idle, although in this situation, the cyclic mapping will do a better job of distributing the workload among the available nodes than the vp mapping.

blocked Mapping: The number of nodes assigned to the job (N) is divided by the number of processes requested (P), and blocks of N/P consecutively-numbered processes are assigned to each node. If P is not an exact multiple of N, then the first P mod N nodes will have an additional process assigned to them. This scheme presumes, but does not require, that every node assigned to the job will have an equal number of processors. This mapping strategy is primarily useful on multiprocessor and multicore nodes when it is desirable (for performance or other reasons) to leave one or more of the processors idle by using the -c option to request fewer processes than VPs. In the common case when every assigned node has the same number of processors and every processor is in use, blocked mapping is equivalent to the vp mapping.

The differences between the different mapping schemes are best illustrated with a few examples:

Example 1: This example illustrates the common case of multiprocessor nodes with one MPI process per processor. Assume the following TORQUE job request (^D indicates end-of-file for the job script):

qsub -l nodes=2:quad:ppn=4
#!/usr/bin/tcsh
pbsmvp2 myprog ...
^D

The resulting mappings of processes to nodes are shown in the following table. Each row corresponds to a processor slot allocated by the job scheduler. Nodes are numbered in the order in which they appear in the PBS_NODEFILE; processes are numbered by MPI rank.

Node
Number
Mapping Scheme
vp
cyclic
blocked
n0
p0
p0
p0
n0
p1
p2
p1
n0
p2
p4
p2
n0
p3
p6
p3
n1
p4
p1
p4
n1
p5
p3
p5
n1
p6
p5
p6
n1
p7
p7
p7

In this case, all three mappings result in an even distribution of processes among nodes, and the vp and blocked schemes produce the same result. The choice of cyclic over vp or blocked will depend on the performance characteristics of the application and is usually determined by experimentation.

Example 2: In this case, we allocate all of the processors within each of two quad-processor nodes, but spawn only 6 processes. This illustrates the improved load balancing of the cyclic and blocked approaches over the vp approach when it is desirable to leave one or more cores idle, for example, to avoid memory bottlenecks or to reduce contention for network adapters. Note the use of the -c option to limit the number of processes created:

qsub -l nodes=2:quad:ppn=4
#!/usr/bin/tcsh
pbsmvp2 -c 6 myprog
^D

The resulting mappings are as follows:

Node
Number
Mapping Scheme
vp
cyclic
blocked
n0
p0
p0
p0
n0
p1
p2
p1
n0
p2
p4
p2
n0
p3
-
-
n1
p4
p1
p3
n1
p5
p3
p4
n1
-
p5
p5
n1
-
-
-

The cyclic and blocked mappings produce evenly-distributed workloads, while the vp mapping puts twice as much work on n0 as on n1.

Example 3: This example illustrates the behavior when the number of processes requested is not evenly divisible by the number of nodes. In this case, we're requesting six processes on four dual-processor nodes.

qsub -l nodes=4:dual:ppn=2
#!/usr/bin/tcsh
pbsmvp2 -c 6 myprog
^D

This results in the following mappings:

Node
Number
Mapping Scheme
vp
cyclic
blocked
n0
p0
p0
p0
n0
p1
p4
p1
n1
p2
p1
p2
n1
p3
p5
p3
n2
p4
p2
p4
n2
p5
-
-
n3
-
p3
p5
n3
-
-
-

The cyclic and blocked mappings fill both processor slots on the first two nodes, and one slot on each of the two remaining nodes. The vp mapping fills all of the processor slots on the first three nodes, but leaves the fourth node completely idle.

Example 4: In some situations it may be useful to allocate nodes with differing numbers of processors, for example when combining resources from multiple subclusters to tackle a very large problem. This simple example uses two quad-processor nodes and two dual-processor nodes to show how the different mapping strategies behave in such a scenario. In this case, we let the number of processes default to the number of processors.

qsub -l nodes=2:typhoon:ppn=4+2:tempest:ppn=2
#!/usr/bin/tcsh
pbsmvp2 myprog
^D

Here are the results:

Node
Number
Mapping Scheme
vp
cyclic
blocked
n0
p0
p0
p0
n0
p1
p4
p1
n0
p2
p8
p2
n0
p3
p10
-
n1
p4
p1
p3
n1
p5
p5
p4
n1
p6
p9
p5
n1
p7
p11
-
n2
p8
p2
p6
p7
p8
n2
p9
p6
n3
p10
p3
p9
p10
p11
n3
p11
p7

The vp and cyclic mappings fill every processor slot, but the blocked mapping leaves processors idle on nodes n0 and n1, while oversubscribing n2 and n3.

Processor Affinity

By default, MVAPICH2 locks processes onto specific processor cores in order to reduce cache pollution and improve performance. However, when multiple MVAPICH2 jobs are running on the same node, processes can be incorrectly mapped onto overlapping sets of processors, resulting in very poor performance. In situations where this problem could arise, pbsmvp2 will automatically disable processor affinity by setting the MV2_ENABLE_AFFINITY runtime variable to 0. Since the Linux kernel endeavors to maintain processor affinity anyway, disabling this setting within MVAPICH2 results in little if any performance degradation. The processor affinity setting chosen by pbsmvp2 is reported in the output produced by the -v option.

To ensure that MVAPICH2 processor affinity is enabled and/or to allow for explicit mapping of processes to cores via the MV2_CPU_MAPPING option, allocate whole nodes using an appropriate value for the ":ppn=" option when submitting jobs with qsub. For more information about processor affinity in MVAPICH2, see the MVAPICH2 1.2 User Guide.

Usage Notes

In contrast with some MPI packages such as LAM, recent versions of MVAPICH2 use a daemonless process management framework which greatly simplifies the task of detecting abnormal process termination and cleaning up remaining processes. As a result, pbsmvp2 does not have to be exec'ed as do pbslam and similar scripts which interface with daemon-based process managers. This means that a series of programs can be run from a single job and that post-processing steps (such as file copying) can be incorporated directly into the job script.

As an example, consider the following script which runs a series of programs in a loop, with the input and output filenames selected by the value of the loop index variable. Each program reads its input from the user's home directory and writes its output to the /local/scr filesystem. At the end of the job, all of the output files are copied back to the /sciclone/data10 filesystem (hosted on the server called gfs00) and the local files are deleted.

qsub -l nodes=8:typhoon
#!/usr/bin/tcsh
cd /local/scr/$user
set INDIR=$HOME/project
set OUTDIR=
"gfs00:/sciclone/data10/$user/project"
foreach i (1 2 3 4 5)

   pbsmvp2 $INDIR/myprog ‹$INDIR/input_$i ›result_$i
end

rcp result_* $OUTDIR
rm -f result_*
^D

Tip: The output files could have been copied back using cp instead of rcp, but in the presence of heavy NFS write traffic, rcp is much more efficient, assuming that the proper server is associated with the destination directory. The Unix/Linux df command can be used to determine which server hosts a particular filesystem.

Environment Variables

pbsmvp2 queries the MVAPICH2_HOME environment variable to determine which version of MVAPICH2 to use. For more info, see the MVAPICH2 documentation page. pbsmvp2 also depends on several environment variables which are automatically set when running within a TORQUE or PBS environment.

Exit Status

If pbsmvp2 detects an error condition, it returns a non-zero value. Otherwise, it returns the exit status from MVAPICH2's mpirun_rsh command.

Bugs and Limitations

pbsmvp2 does not provide access to all of the options and features supported in MVAPICH2. In particular, there is no way to specify different executables for different processes, although various workarounds can be imagined.

Circumstances may arise in which pbsmvp2 (or any other parallel job, for that matter) might not be able to find and kill all of the processes belonging to it. If a pristine execution environment is essential, additional checks (beyond -C or -X) may be needed to ensure that no stray processes reside on a node.

Related Topics