For added flexibility, pbsmvp2 provides three different strategies for mapping processes onto processors; these are described in detail in the section on Process Mapping. Which strategy is best depends on the requirements of the application, the number and type of nodes requested for the job, and the number of processes which will be run on those nodes.
When TORQUE is configured, each node in the system is assigned one or more virtual processors (or VP's, for short). On SciClone, the number of TORQUE virtual processors available on each node is identical to the number of physical processors on that node (except for server nodes, which only allow TORQUE to use a single processor). On SciClone, we treat each processor core in a multicore system as an individually allocatable processor.
When a job is scheduled to run, TORQUE allocates a set of virtual processors based on the resource requirements specified by the qsub command. The hostnames of each virtual processor assigned to the job are made available at runtime in a file specified by the PBS_NODEFILE environment variable. The order in which hosts are listed in this file correpsonds to the order in which they are requested by the "-l nodes=" option of qsub.
pbsmvp2 supports three different schemes for mapping MPI processes onto virtual processors. We designate these schemes as vp, cyclic, or blocked. Each of these is described in detail in the following paragraphs. By default, pbsmvp2 uses the vp mapping. To obtain a listing showing which nodes have been allocated to a job and how processes have been mapped onto those nodes, use the -v option.
vp Mapping: Processes are assigned one per VP in the order listed in PBS_NODEFILE. (Note that there may be multiple VP's per node.) If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, then wrap around to the beginning of the VP list, assigning an additional process to each VP. This procedure repeats until all processes have been assigned. If the number of requested processes is less than the number of VPs, some processors (and potentially entire nodes) will be left idle.
cyclic Mapping: Processes are assigned one per node, wrapping around until all of the VP slots on all of the nodes are filled. If the number of processes requested by the -c option is larger than the number of VPs allocated to the job, wrap around and assign an additional process to each node (rather than VP), repeating until all processes have been assigned. If the number of requested processes is less than the number of VPs, some processors (and potentially entire nodes) will be left idle, although in this situation, the cyclic mapping will do a better job of distributing the workload among the available nodes than the vp mapping.
blocked Mapping: The number of nodes assigned to the job (N) is divided by the number of processes requested (P), and blocks of N/P consecutively-numbered processes are assigned to each node. If P is not an exact multiple of N, then the first P mod N nodes will have an additional process assigned to them. This scheme presumes, but does not require, that every node assigned to the job will have an equal number of processors. This mapping strategy is primarily useful on multiprocessor and multicore nodes when it is desirable (for performance or other reasons) to leave one or more of the processors idle by using the -c option to request fewer processes than VPs. In the common case when every assigned node has the same number of processors and every processor is in use, blocked mapping is equivalent to the vp mapping.
The differences between the different mapping schemes are best illustrated with a few examples:
Example 1: This example illustrates the common case of multiprocessor nodes with one MPI process per processor. Assume the following TORQUE job request (^D indicates end-of-file for the job script):
qsub -l nodes=2:quad:ppn=4
#!/usr/bin/tcsh
pbsmvp2 myprog ...
^D
The resulting mappings of processes to nodes are shown in the following table. Each row corresponds to a processor slot allocated by the job scheduler. Nodes are numbered in the order in which they appear in the PBS_NODEFILE; processes are numbered by MPI rank.
Node
NumberMapping Scheme vp cyclic blocked n0 p0 p0 p0 n0 p1 p2 p1 n0 p2 p4 p2 n0 p3 p6 p3 n1 p4 p1 p4 n1 p5 p3 p5 n1 p6 p5 p6 n1 p7 p7 p7
In this case, all three mappings result in an even distribution of processes among nodes, and the vp and blocked schemes produce the same result. The choice of cyclic over vp or blocked will depend on the performance characteristics of the application and is usually determined by experimentation.
Example 2: In this case, we allocate all of the processors within each of two quad-processor nodes, but spawn only 6 processes. This illustrates the improved load balancing of the cyclic and blocked approaches over the vp approach when it is desirable to leave one or more cores idle, for example, to avoid memory bottlenecks or to reduce contention for network adapters. Note the use of the -c option to limit the number of processes created:
qsub -l nodes=2:quad:ppn=4
#!/usr/bin/tcsh
pbsmvp2 -c 6 myprog
^D
The resulting mappings are as follows:
Node
NumberMapping Scheme vp cyclic blocked n0 p0 p0 p0 n0 p1 p2 p1 n0 p2 p4 p2 n0 p3 - - n1 p4 p1 p3 n1 p5 p3 p4 n1 - p5 p5 n1 - - -
The cyclic and blocked mappings produce evenly-distributed workloads, while the vp mapping puts twice as much work on n0 as on n1.
Example 3: This example illustrates the behavior when the number of processes requested is not evenly divisible by the number of nodes. In this case, we're requesting six processes on four dual-processor nodes.
qsub -l nodes=4:dual:ppn=2
#!/usr/bin/tcsh
pbsmvp2 -c 6 myprog
^D
This results in the following mappings:
Node
NumberMapping Scheme vp cyclic blocked n0 p0 p0 p0 n0 p1 p4 p1 n1 p2 p1 p2 n1 p3 p5 p3 n2 p4 p2 p4 n2 p5 - - n3 - p3 p5 n3 - - -
The cyclic and blocked mappings fill both processor slots on the first two nodes, and one slot on each of the two remaining nodes. The vp mapping fills all of the processor slots on the first three nodes, but leaves the fourth node completely idle.
Example 4: In some situations it may be useful to allocate nodes with differing numbers of processors, for example when combining resources from multiple subclusters to tackle a very large problem. This simple example uses two quad-processor nodes and two dual-processor nodes to show how the different mapping strategies behave in such a scenario. In this case, we let the number of processes default to the number of processors.
qsub -l nodes=2:typhoon:ppn=4+2:tempest:ppn=2
#!/usr/bin/tcsh
pbsmvp2 myprog
^D
Here are the results:
Node
NumberMapping Scheme vp cyclic blocked n0 p0 p0 p0 n0 p1 p4 p1 n0 p2 p8 p2 n0 p3 p10 - n1 p4 p1 p3 n1 p5 p5 p4 n1 p6 p9 p5 n1 p7 p11 - n2 p8 p2 p6
p7
p8 n2 p9 p6 n3 p10 p3 p9
p10
p11 n3 p11 p7
The vp and cyclic mappings fill every processor slot, but the blocked mapping leaves processors idle on nodes n0 and n1, while oversubscribing n2 and n3.
To ensure that MVAPICH2 processor affinity is enabled and/or to allow for explicit mapping of processes to cores via the MV2_CPU_MAPPING option, allocate whole nodes using an appropriate value for the ":ppn=" option when submitting jobs with qsub. For more information about processor affinity in MVAPICH2, see the MVAPICH2 1.2 User Guide.
As an example, consider the following script which runs a series of programs in a loop, with the input and output filenames selected by the value of the loop index variable. Each program reads its input from the user's home directory and writes its output to the /local/scr filesystem. At the end of the job, all of the output files are copied back to the /sciclone/data10 filesystem (hosted on the server called gfs00) and the local files are deleted.
Tip: The output files could have been copied back using cp instead of rcp, but in the presence of heavy NFS write traffic, rcp is much more efficient, assuming that the proper server is associated with the destination directory. The Unix/Linux df command can be used to determine which server hosts a particular filesystem.
qsub -l nodes=8:typhoon
#!/usr/bin/tcsh
cd /local/scr/$user
set INDIR=$HOME/project
set OUTDIR="gfs00:/sciclone/data10/$user/project"
foreach i (1 2 3 4 5)
pbsmvp2 $INDIR/myprog $INDIR/input_$i result_$i
end
rcp result_* $OUTDIR
rm -f result_*
^D
MVAPICH2_HOME environment variable to determine which version of MVAPICH2 to use. For more info, see the MVAPICH2 documentation page. pbsmvp2 also depends on several environment variables which are automatically set when running within a TORQUE or PBS environment.Circumstances may arise in which pbsmvp2 (or any other parallel job, for that matter) might not be able to find and kill all of the processes belonging to it. If a pristine execution environment is essential, additional checks (beyond -C or -X) may be needed to ensure that no stray processes reside on a node.