SciClone User's Guide
This document provides background information and instructions for using the SciClone Cluster at the College of William and Mary. It is considered to be required reading for new users. Refer to other sections of the SciClone web site for more detailed information on specific topics such as software packages or hardware configuration.
How to Get HelpQuestions, problems, or trouble reports: Send email to email@example.com.
Urgent problems: Call Tom Crockett (757-221-2762) or Andriy Kot (443-365-9107).
Emergencies: After hours phone numbers are posted on the computer room door in Savage House, or call Campus Police at 757-221-4596.
Application Support: To request assistance in accessing or using the system, send email to firstname.lastname@example.org.
Because SciClone is regarded as a research (rather than production) facility, support is available only during normal office hours.
When reporting problems, please provide as much relevant information as possible. This should include the following, as appropriate:
Email Discussion List: An email list for discussing topics of general interest to the SciClone user community is available on the William and Mary list server. Access is by subscription only. To subscribe, William and Mary users should login to the list server with their campus account; external users must first create an account. Postings should be sent to email@example.com.
SciClone is operated by the Computational Science Cluster as a College-wide resource. William and Mary faculty, staff, and students with computation- or data-intensive applications from any discipline are welcome to apply for accounts on the system. The system is also available to those interested in developing tools or infrastructure to support more effective use of cluster computing systems. Requests for access from outside the William and Mary community (including W&M collaborators) are evaluated on a case-by-case basis. To apply for an account, follow the instructions in the SciClone Account Request Form.
Account Renewal, Expiration, and Deletion
Most accounts on SciClone have an associated expiration date which is specified on the user's Account Request Form. An email notice will be sent to the user approximately two weeks before his/her account is set to expire. Accounts can be renewed if need be by submitting a new Account Request Form. If an account is not renewed before the expiration date, it will be disabled immediately following the expiration date. All files belonging to expired accounts are subject to deletion after a short grace period (30-day minimum). It is the user's responsibility to preserve any necessary files by moving or copying them elsewhere before the account expires. An account which has expired may be reactivated by submitting a new Account Request Form, but files previously associated with the account may not be available after the grace period.
Accessing the System
All access to SciClone from external systems is via Secure Shell using the SSH 2 protocol. SciClone includes subclusters based on two different and incompatible processor architectures, Sun UltraSPARC and AMD Opteron. Each of these architectures has a different login server which should be used to edit and compile programs, manipulate files, or submit batch jobs. The primary login server for the UltraSPARC nodes is monsoon.sciclone.wm.edu, a.k.a. sciclone.wm.edu. The primary login server for the Opteron nodes is squall.sciclone.wm.edu. These nodes are also referred to as "compile servers", "file servers", "front ends", "host nodes", or simply monsoon and squall. All of these terms are used interchangeably.
Two additional login servers are available for users involved in specific projects requiring access to certain node-locked software packages or the Oracle database server. These are maelstrom.sciclone.wm.edu (UltraSPARC) and mistral.sciclone.wm.edu (Opteron).
Unless you have made special arrangements for dedicated time, all access to nodes other than monsoon, maelstrom, squall, and mistral should be through the PBS job scheduler. This topic is discussed in more detail in the section on Running Programs.
Changing Your Password
You will be issued a default password by telephone when your SciClone account is created. To change it, you must login to monsoon.sciclone.wm.edu and run the passwd command. Changes will automatically propagate to all of the login servers within a minute or so. If you change your password on any node other than monsoon, the changes will not be permanent, and your password will eventually revert to its previous setting.
A key feature of SciClone is its heterogeneous architecture, which provides both flexibility for applications as well as a controlled environment for studying the complex issues which arise in larger distributed systems. Specifically, SciClone's heterogeneity arises from its use of multiple processor configurations (single-, dual-, and quad-cpu nodes, each at different clock rates), multiple networking technologies (Fast Ethernet, Gigabit Ethernet, and Myrinet), and its organization as a "cluster of clusters".
SciClone features sixteen different node configurations, organized into eight distinct subclusters which can be used individually or in combination. The node types and subclusters are summarized in the tables below. See the Hardware Component List for detailed specifications. Nodes can be further classified as either "server" nodes or "compute" nodes, depending on their intended uses. As the name implies, server nodes provide specialized functions for the cluster as a whole, while compute nodes are dedicated to running users' jobs. Application programs should not run on server nodes unless they have a specific need to do so.
SciClone features a rich but complex networking environment. For a schematic, refer to the SciClone Architecture Diagram. The following table summarizes the networks used in SciClone.
SciClone's front end nodes, monsoon and squall, have interfaces on multiple networks, as do many of the other nodes. From outside the cluster, use the hostname monsoon.sciclone.wm.edu (a.k.a. sciclone.wm.edu) or squall.wm.edu to login via the Savage House Fast Ethernet network. From other nodes within the cluster, use ms00.sciclone.wm.edu or sq00.sciclone.wm.edu (or just ms00 and sq00) to reach the front end nodes via the cluster's internal network. If you use the monsoon or squall addresses from within the cluster, you'll get routed outside to the building network and then back in via the external interface, a longer and slower path.
A complete list of hostnames and their corresponding network addresses is available in /etc/hosts on the front end.
When communication is routed via Myrinet, some of the distinctions between subclusters become blurred. In particular, the gulfstream and tornado subclusters plus the node nws01 can be thought of as a unified 39-node, 78-processor "metacluster". Although some details differ, these nodes have identical processor and memory configurations and share the same Myrinet-1280 network. However, I/O intensive processes should be placed on nws01 or one of the gulfstream nodes to take advantage of Gigabit Ethernet links to the servers. Nodes could be allocated from this combined pool via a PBS node specification such as "ultra60:myri:ppn=2". Similarly, nws02 can be used to augment the twister subcluster, using the Myrinet-2000 network for communication. In this case a node spec might look like "f280r:myri2:ppn=2" and nws02 would be the best place to locate I/O intensive processes.
Note that applications are free to combine nodes from multiple subclusters in any way they see fit, but in the general case, differences in processor speeds, communication interfaces, memory capacities, and disk performance pose difficult load balancing problems which can lead to very inefficient use of computational resources unless specialized parallel algorithms are employed.
When a user account is installed, subdirectories are created in the following filesystems:
Symlinks in each user's home directory point to the preconfigured global (~/scr*) and local (~/lscr*) scratch directories and the QFS filesystem (~/qfs00). /root, /usr, and /var filesystems are local to each node; /opt, /usr/local, and /import reside on the front end and are exported to each node via NFS.
Sun's StorEdge QFS is a high-performance, high-capacity SAN-based shared filesystem which is optimized for bulk accesses on large sequential files. The /sciclone/qfs00 filesystem spans 46 disk drives distributed across five separate RAID arrays, with a total formatted capacity of 5.3 terabytes. The disk arrays are connected to each other and to the SciClone servers via a Fibre Channel Storage Area Network, or SAN. Individual files exceeding a terabyte in size can be accommodated, space permitting.
QFS allocates disk space in large blocks, and is therefore rather inefficient for collections of small files, such as source code. As a rule of thumb, files which are stored in SciClone's QFS filesystem should have an average size of 1 MB or larger. Smaller files should be stored in home directories (/sciclone/home*) or scratch filesystems (/sciclone/scr*). Space allocation in /sciclone/qfs00 is monitored and you may be notified if you have too many small files stored there. Note that small files can be aggregated into larger ones with Unix utilities such as ar, tar, cpio, and zip. ar is particularly useful since it maintains an internal index structure which supports the addition, deletion, replacement, and retrieval of individual members of the archive.
I/O performance with QFS can be improved (sometimes dramatically) by reading and writing data in large blocks that match the blocking factor of the filesystem. In our tests, a blocksize of 65536 (64 KB) was near optimal, yielding write speeds in excess of 100 MB/s from monsoon and maelstrom.
QFS is a shared filesystem, meaning that it can be mounted and accessed simultaneously by multiple servers. It allows simultaneous reads from the same or different files and simultaneous writes to different files. QFS can also be configured to allow simultaneous writes to the same file from applications which have been designed to perform page-aligned I/O. This option is not presently enabled on SciClone, but could be if the need arose.
The /sciclone/qfs00 filesystem is mounted directly on monsoon and maelstrom, and is exported from there via NFS to all other nodes in the cluster. Even-numbered nodes mount qfs00 from maelstrom and odd-numbered nodes from monsoon. This spreads the NFS load across both servers, resulting in higher throughput and less resource contention. Note that applications which generate a lot of output will still get better performance by writing files to the local scratch partitions and then copying them back to the servers with rcp, rather than relying on NFS. The even/odd strategy would be helpful in this scenario, too.
Home directories (/sciclone/home00 and /sciclone/home10) are normally backed up several times per week. QFS directories (/sciclone/qfs00) are backed up approximately every three days. Scratch directories are not backed up at all. Furthermore, files in any of the scratch partitions which have not been used in the past 30 days will be deleted automatically in order to maintain sufficient free space for active projects.
A default .cshrc file is provided in each user's home directory. If you modify it, be sure you know what you're doingSciClone's hardware and software environment is considerably more complex than the typical UNIX workstation environment. The default configuration enables a 32-bit environment for compiling and linking, with LAM as the default MPI package. To use alternative communication packages such as MPICH or MPICH-GM, settings of various environment variables will need to be changed as documented within the .cshrc file. Copies of the current recommended configuration files can be found in
Four UNIX groups are established for each organization or department which has users on SciClone. These four groups correspond to the user's status within the organization, i.e. faculty/staff, graduate student, undergraduate student, and "other". For example, a professor from the William and Mary Computer Science Department would be assigned to the group "csf", while a CS undergraduate would be assigned to the group "csu". Default file access permissions are set so that a user's files and directories are read-write for the user, read-only for the user's group, and unreadable by anyone else. If this is not appropriate for your situation, you should change the default umask setting in your .cshrc file, or set file access modes on a case-by-case basis. Note that SciClone's filesystems are exported to other computers within the Computational Science Cluster, and may therefore be visible beyond the SciClone user community.
Compilers and Libraries
Both Sun and GNU compilers are installed on the system. See the Software section of the SciClone web site for more information about what's available and how to use it. Unless otherwise noted, all of the third-party software which is installed on SciClone has been built using Sun's compilers. For maximum performance and to ensure compatibility with system libraries, we strongly recommend the use of Sun's compilers whenever possible.
Compiler Options and Code Optimization
When compiling and optimizing code for use on SciClone, care must be taken to ensure that the resulting executables will perform properly on the type of nodes on which they will be executed. The situation is further complicated by the fact that SciClone includes two distinct and incompatible families of processor architecture, Sun UltraSPARC and Intel/AMD x86/amd64. Even within a particular architecture family, there are variations in processor capabilities among the different types of nodes. By choosing appropriate compiler options, users can compile their applications for portability across an entire architecture family, or optimize performance for a particular type of node. With the current set of hardware, six different instruction set architectures (ISAs) are of interest:
To achieve portability along with performance on UltraSPARC nodes, the following set of options is suggested:
For best performance on AMD Opteron nodes, the following options provide a good starting point for further experimentation:
By default, -fast generates code for the type of processor on which the compiler is running, which may or may not be compatible with the target execution processor. This is why it is important to explicitly specify the desired ISA with the -xarch option. Note that the order is important here: the -xarch, -xchip, and -xcache options must come after -fast.
Opteron processors support several different "memory models", depending on the intended use of the resulting code. For 64-bit addressing on SciClone's C8, S4, and S4A nodes, specify the "medium" memory model, for example:
To optimize for a specific type of node, you can specify the characteristics of the CPU (-xchip) and cache (-xcache) in addition to the ISA. The following table summarizes the options by node type.
So, for example, to optimize code for an Ultra 5 node, the following compiler options could be used:
To target a Sun Fire 280R, use:
For an Opteron-based node, use:
To address more than 2 GB of memory in a single process, use v9a, v9b, or amd64a instead of v8plusa, v8plusb, or sse2a, repsectively:
Note that 64-bit addressing is generally of interest only for the C4, C7, C7A, C8, DB1, S3, S4, and S4A nodes, which each have 4 GB or more of physical memory installed on them; all other node types have 2 GB or less.
Programs compiled with v9a, v9b, or amd64a will not work with libraries which have been compiled for 32-bit addressing. Many of the performance-critical software packages installed on SciClone are compiled for all six ISAs, and the desired version can automatically be selected via the XARCHULTRA and XARCHX86 environment variables in the user's ~/.cshrc file. Consult the documentation for individual software packages to see which versions are available.
Note that the optimizations invoked by -fast may be too aggressive for some codes, with the potential for unintended or incorrect results. If you suspect this is a problem, you could try a lower optimization level, e.g.:
or you could leave off -fast entirely. If -fast is not used, it may be necessary to use
Code optimization is a complex topic, and the use of any given option may help one routine but hinder another, so some experimentation is in order. Consult Sun's compiler documentation for full details, including information about many options not mentioned here.
Parallel Programming Tools
Although SciClone's ability to run many serial jobs concurrently is useful for some applications, bringing the full power of the system to bear on a single computation requires the use of parallel programming techniques. A variety of tools are available to assist in the development of parallel programs.
Many of the nodes in SciClone include more than one CPU, so effective use implies that all of the CPUs should be kept busy. If an application will fit on a single node, then the use of shared memory programming techniques may be the simplest way to boost performance on multiprocessor (SMP) nodes. There are several approaches for exploiting parallelism in shared memory environments, including automatic parallelization, compiler directives, thread libraries, system-level interprocess communication (IPC) services, and message passing.
Sun's compilers support automatic parallelization of well-behaved loop constructs via the -xautopar (C) and -autopar (Fortran) options. These are described in detail in the compiler manuals. Sometimes the compiler's ability to detect and exploit parallelism can be enhanced with straightforward changes to the code which eliminate dependencies or simplify the control flow.
In other cases, the programmer needs to convey additional information to the compiler in the form of directives (Fortran) or pragmas (C/C++). Among other things, directives and pragmas can be used to give the compiler hints about loops that can or cannot be safely parallelized. Sun's C, C++, and Fortran compilers support the OpenMP 2.5 API, a mix of directives and library calls which support a fork-join model of parallel execution. For more information on using OpenMP, refer to Chapter 3 in the C User's Guide, Chapter 10 of the Fortran Programming Guide, Appendix D of the Fortran User's Guide, and the OpenMP API User's Guide.
At a coarser level of granularity, entire programs, or major sections of them, can be structured as independent threads of control, each of which can potentially run on a separate CPU. Solaris supports two different thread packages, known as Solaris threads and Posix threads (or pthreads). Sun's Multithreaded Programming Guide covers both of these packages in detail. Additional information on multithreading for C++ applications can be found in Chapter 11 of the C++ User's Guide.
Java includes threads as a fundamental part of the language. Although not well suited for high performance computing due to its resource requirements and the interpretive nature of the language, Java may nonetheless be of interest for certain applications. Java programmers should consult the Java 2 SDK documentation.
Threads provide a lightweight mechanism for exploiting parallelism within the context of a single UNIX process. Parallelism can also be obtained by running several distinct processes at the same time. Like all UNIX variants, Solaris provides a number of system services which facilitate communication between processes. These include pipes, message queues, semaphores, shared memory segments, signals, sockets, and memory-mapped files. These facilities may be used directly or as the basis for process-to-process communication in higher level libraries such as MPI. Interprocess communication (IPC) facilities may be the mechanism of choice for applications which bring together several programs with different functionality. An overview of IPC services in Solaris can be found in the Programming Interfaces Guide.
In most cases, the message-passing communication libraries described in the next section can also be used in shared memory environments, sometimes quite efficiently. The LAM/MPI library, in particular, exhibits very low overheads on shared memory nodes. While the message-passing paradigm often requires more programming effort than some of the simpler shared memory schemes, it is more portable, allowing a single application to run in shared memory, distributed memory, or mixed environments.
Even the most powerful multiprocessor nodes on SciClone provide only a fraction of the aggregate system resources (about 1% of the total CPU power, 2% of the memory, and 1% of the disk capacity). To truly take advantage of the system, it is necessary to build distributed memory applications that can bring many nodes to bear on a single computation. Although lower-level system services such as sockets or remote procedure calls are sometimes used to build distributed applications, most scientific programmers working on SciClone will want to use MPI, the de facto standard for message passing on distributed-memory parallel architectures. SciClone currently supports three different MPI implementations, LAM, MPICH, and MPICH-GM. Additional communication packages are expected to be available in the future.
The shared memory and distributed memory approaches can be combined in applications which run on multiple SMP nodes. This can be useful if the computation exhibits parallelism at several different levels, for example loop-level parallelism within coarse-grained tasks. Shared memory constructs may offer performance advantages over message passing for local communication within SMP nodes, although in some cases the reverse may also be true. Whether the performance benefits of mixed-mode programming are worth the extra complexity seems to depend heavily on the characteristics of the application.
To provide conflict-free access to SciClone's computational resources, node allocation and job scheduling services are provided by OpenPBS, the freeware version of the Portable Batch System (PBS). To avoid interfering with PBS jobs, all access to both compute and server nodes, including interactive shell sessions, must be initiated through PBS. The only exceptions are:
Direct rlogin/slogin/telnet access to individual compute nodes is disabled, and all rsh and rcp commands which reference the nodes should be submitted via PBS jobs. Stray processes on the nodes (i.e., those not belonging to an active PBS job) are subject to termination without warning.
In some cases interactive access to compute nodes is required. Examples include software packages with graphical user interfaces (e.g., MATLAB or various visualization systems), debugging, etc. PBS has a special interactive mode (described in more detail below) which provides this capability, including forwarding of X11 sessions to the user's workstation via SSH.
As discussed in the Architecture Overview, nodes in the SciClone cluster fall into one of two categories, server nodes or compute nodes. While compute nodes are intended to provide dedicated computational resources for one or more jobs, server nodes provide services for the system as a whole. Thus PBS jobs which run on server nodes can adversely impact the performance of the whole cluster, and will themselves be impacted by other activities on the system. Thus most jobs should specifically request to run on compute nodes, as explained in the following sections.
Nevertheless, there may be circumstances in which jobs with special requirements will need to create processes on server nodes. Our PBS configuration currently allows this, and, in fact, will allocate server nodes to jobs if (1) server nodes are specifically requested by the job, or (2) the job does not specifically request compute nodes and no other resources are currently available to satisfy the request. This latter case is designed to improve turnaround for small jobs when the system is otherwise saturated. Users should feel free to take advantage of this when it is really needed (for example, deadlines for class projects or conference and journal submissions), but should not use it routinely.
Because server nodes also host SciClone's global filesystems (/sciclone/home*, /sciclone/scr*, /sciclone/qfs00), they are also the most efficient place to locate processes that perform large amounts of I/O against these files. In this case an appropriate PBS node specification can be used to place an I/O process on the server which physically hosts the filesystem of interest.
Although SciClone's server nodes (monsoon, maelstrom, squall, mistral, tempest, hurricane, zephyr) all have dual processors, PBS is allowed to use only one processor per server node. This leaves the other processor free to provide system-wide services such as compilation, I/O, NFS, DNS, job scheduling, etc. To avoid overloading the front end nodes, all application programs with non-trivial resource requirements (> 30 secs. CPU time or > 128 MB memory) must be submitted as PBS jobs. Processes which violate this rule may be killed without warning. (Typical code development and job preparation activities, including editing, compilation, make, file manipulation, etc., are specifically allowed to run on monsoon and squall as part of their normal interactive workloads.) PBS references server nodes via aliases (ms00, ml00, sq00, mt00, tp00, hu00) which map to either Gigabit or 10-Gigabit Ethernet interfaces on SciClone's internal jetstream network.
To use PBS, your search path, man path, library path, and default PBS server must be set correctly in your ~/.cshrc file:
set path=($path /usr/local/pbs/bin)
If you are using the recommended environment configuration (available in /usr/local/etc/templates/cshrc on sciclone.wm.edu), all of these environment settings will be configured for you automatically.
To accommodate heterogeneous environments (such as SciClone), PBS allows an arbitrary set of node properties to be assigned to each node. These properties may be appended to node allocation requests to constrain the set of processors which may be used to run the job. Node properties for SciClone are listed in the following table:
The next table defines all of the node properties listed in the table above:
In PBS, a job is simply a shell script which is submitted to the scheduler via the qsub command (run "man qsub" for full details.) As the following examples illustrate, qsub's "-l nodes=" option is used to specify how many and what type of nodes are required.
To request a single compute node (of any type):
qsub -l nodes=1:compute
This is also the default resource request, so it is equivalent to:
To request a specific node, use the node name as a node property:
qsub -l nodes=1:wh64
To request several specific nodes:
qsub -l nodes=nws02+wh01+tn01+tw01+gfs01+hu01
To request 32 c3 compute nodes, you could use any of the following:
Because PBS allocates CPUs, rather than hosts, it is possible for more than one job to be assigned to a multiprocessor node. Unless told otherwise, PBS allocates only a single CPU from a multiprocessor node, leaving the other CPUs available for other jobs. So the following example only assigns one CPU to the job, even though it specifically requests a node with two CPUs:
To allocate more than one CPU, append the "ppn=" modifier to the list of node properties:
The first specification will provide exclusive access to the node; the second will cause PBS to allocate a total of 28 processors on 12 nodes, all with exclusive access. In contrast, the following example would leave two CPUs available for other jobs on a 4-CPU node:
qsub -l nodes=1:quad:ppn=2
It is also possible for jobs to specifically allow their resources to be shared with other jobs by appending a "#shared" modifier to the end of the node properties:
qsub -l 'nodes=1:quad:ppn=4#shared'
This would assign all four processors of a quad-CPU node to the job, but would also allow other jobs with a #shared attribute to use the same node in time-shared fashion. Note the use of quotes to prevent "#shared" from being interpreted as a shell comment.
If you don't explicitly specify "compute" and no other node properties restrict the set of possible nodes, then the job could potentially be scheduled to run anywhere, including server nodes. For example, the following request might map onto any of the nodes in the tornado or gulfstream subclusters, as well as nws01 or the server node hu00.
When you absolutely positively need turnaround as fast as possible on a busy system, use as few qualifiers as possible (subject to the guidelines stated above for running jobs on server nodes). The following requests maximize the chance of finding an available resource since they can potentially be scheduled anywhere:
qsub -l nodes=1
When a PBS job executes, the PBS_NODEFILE environment variable is set to the name of a file which contains the list of nodes which have been allocated to the job. This file can be examined by your PBS job script to determine specifically which nodes a job is using.
If PBS is unable to satisfy a request due to conflicting node properties or insufficient nodes with the specified properties, qsub will return with a message such as "Job rejected by all possible destinations" or "Job exceeds queue resource limits".
In developing a scheduling policy for SciClone, we are trying to satisfy two conflicting goals: (1) rapid turnaround for code development, testing, and experimentation, and (2) maximum throughput for large, computationally intensive scientific applications which may need several days to complete. While there is fundamental tension between these two requirements, there are several things that can be done to alleviate the conflict. First, jobs are segregated into queues based on the number of nodes they require and the expected duration of the job. Different queues are assigned different priorities, and the scheduler attempts to run higher priority jobs first. Each queue has limits on the total number of jobs that can be run simultaneously, as well as the number of jobs that an individual user may run. This helps to reduce the chance of the system being monopolized by a single user or by a particular type of job to the exclusion of other work.
These strategies are augmented with priority aging, which means that the longer a job sits in the queue, the higher its priority becomes. If a job remains queued for a long period of time (currently 48 hours), it is deemed to be "starving", and the scheduler will purposely defer other jobs in order to free up enough resources to run a starving job. This can lead to lead to the idling of large numbers of nodes for prolonged periods of time, but we mitigate this problem somewhat by "backfilling" shorter and smaller jobs into "holes" in the schedule.
SciClone's heterogeneous architecture, with its multiple subclusters, multiple networks, and differing types of nodes, poses special problems for job schedulers. In particular, PBS' starving jobs feature assumes that all nodes are identical, which means that it sometimes holds idle nodes open even when a starving job needs a different type of node. When this situation arises, the system operators may intervene to improve utilization and turnaround. We are aware of no exsting batch schedulers which incorporate all of the capabilities we need, so development of a customized scheduler for SciClone remains as a long-term goal.qsub automatically submits jobs to a routing queue which examines their resource requirements (walltime and number of nodes) and places them in the appropriate execution queue. The following table lists the execution queues along with their node and time limits, their relative priorities, and bounds on the number of executing jobs:
In addition to the queues listed above, there are three others, called sys, special, and defer. These are intended for system maintenance tasks, special jobs that can't be accommodated within the normal queue limits, and very low priority jobs, respectively. Users with requirements that can't be accommodated within the normal queue structure should contact the system manager.
Queue parameters are subject to change as the workload on the system varies. In addition, the long-running "l" and "x" queues may be stopped on occasion to allow a backlog of shorter jobs to complete.
Note: At the present time, the tempest subcluster runs a separate copy of the job scheduler. It is configured similarly, but the number of queues, their names, and their limits differ. The appropriate scheduler is selected based on the type of node (Opteron or UltraSPARC) on which a PBS command is issued. We plan to unite the entire system under a single job scheduler in the near future.
All jobs are submitted to PBS via the qsub command. The basic syntax is:
script contains the executable commands which comprise the job. It may also contain PBS directives, which are special shell comments that can be used to set job options (see the qsub man page for details). According to the documentation, PBS is supposed to execute script by invoking the user's login shell, which on SciClone is either csh or tcsh. However, this feature does not appear to work correctly (except for interactive jobs invoked with "qsub -I"). For reliable operation, it is therefore necessary to specify the desired shell by using one of the following as the first line of the script:
Some users have reported problems with sh, so csh and tcsh are recommended.
If script is omitted, qsub will read commands and directives from standard input.
qsub provides numerous options. For our purposes the most important ones are:
The "-l nodes=" option was described above in the section on Node Properties. If this option is omitted, the default node specification is "1:compute".
The "-l walltime=" option indicates the maximum elapsed time for a job. This value is used along with the "-l nodes=" option to determine which execution queue the job will be placed in. If a walltime limit is not specified, the job will use a default value of 5 minutes. If a job exceeds its walltime limit, PBS will kill it. (See the section on Terminating Jobs for important information about the termination process.)
To facilitate accurate job scheduling, the requested walltime should be a reasonably good estimate of the actual time which will be needed, with a modest padding factor to avoid accidental termination. Time estimates which are larger than necessary can cause jobs to be placed in inappropriate queues, and will allow runaway jobs to consume excessive resources, delaying other jobs. Walltime may be specified in several formats. The following are equivalent:
The "-I" option is described below under the topic of Interactive Jobs.
PBS provides only resource allocation and scheduling services; it is not aware of system-specific procedures for initiating parallel programs. Instead, it relies on vendor- or site-dependent scripts or programs to interface between PBS and parallel applications. The following sections describe how to initiate various types of programs from within a PBS job.
Most programs that run on a single node can be invoked from a PBS script in exactly the same way as in any other shell, by simply executing the command. For example, the following job would run myprog from within the directory mydir on any available compute node:
Jobs which may be run this way include conventional serial programs, shell scripts and shell commands, and SMP-style shared memory programs (multi-threaded, auto-parallel, OpenMP, etc.). The exception is single-node MPI programs, which should be started using the appropriate PBS interface script, as described in subsequent sections.
PBS provides a command, called pbsdsh, which is supposed to run shell commands on multiple nodes simultaneously. However, its functionality is limited, and it sometimes fails in a very anti-social way on SciClone. It has been disabled in favor of a script called pbsdoall, which can be found in /usr/local/bin.
Each of the different MPI packages provided on SciClone has a slightly different method for specifying the set of nodes which should be used for a job, and for mapping processes onto those nodes. This process is further complicated by SciClone's heterogeneous architecture, and the ability to mix different types of nodes within a single PBS job. To make matters worse, conflicts can sometimes arise when multiple MPI jobs try to share a node, and some MPI implementations require special care to ensure that leftover processes get cleaned up in the event a job terminates abnormally or exceeds its time limit. For these reasons, relatively sophisticated interface scripts are needed to launch MPI jobs from within a PBS environment.
The procedures for running MPI jobs on SciClone are described in the following sections.
In ordinary use, several steps are required to run a LAM job. First the set of nodes which LAM is going to use must be booted (lamboot). Then the parallel application program is executed (mpirun). A final step shuts down LAM and cleans up leftover processes on the nodes (lamhalt).
On SciClone, these steps have been consolidated into a single script called pbslam. pbslam makes most LAM runtime options available to the user, with defaults which should be appropriate for most applications. It also performs an optional check for extraneous processor activity before executing the application and tries very hard to clean up stray processes in the event that a job terminates abnormally. pbslam features two different strategies for mapping processes onto processors, which can be used in conjunction with PBS node properties to afford considerable flexibility.
LAM commands such as lamboot, mpirun, lamclean, or lamhalt should not be used directly from within a PBS script.
To work its magic, pbslam must be exec'ed from the top-level shell within a PBS job, replacing the shell which invokes it (don't ask). This implies that pbslam must be the last command executed from within a PBS script, since it will never return. Furthermore, it cannot be executed at a lower level, for example, from another script which is called by the PBS script. pbslam checks for its proper place within the PBS process hierarchy, and will complain if it is invoked incorrectly. For a full description, see the pbslam man page.
The following example runs an 8-process LAM job on two quad-cpu nodes, aborting the job if extraneous CPU activity is detected on either node:
qsub -l nodes=2:quad:ppn=4 -l walltime=600
Although MPICH's procedure for launching jobs is simpler than LAM's, a PBS interface script (pbsmpich) is still required to map processes onto processors and to perform other checks. Unlike pbslam, however, pbsmpich does not have to be exec'ed, which means that multiple MPICH programs can be run from a single PBS job script. This makes it easier to orchestrate computations which are composed of several different programs, particularly if they need to pass data to each other via intermediate files which are stored on the local scratch partitions of individual nodes. By combining several programs into a single script, you can ensure that successive steps all use the same set of processors, in the same order, without having to resort to elaborate and restrictive specifications of PBS node properties.
The following simple example shows a three-step MPICH job which passes results from one step to the next via scratch files located on logical node 0's local disk. Initial input data and final results are stored on the front end's scratch partition. A more sophisticated application might open files directly on each node's local scratch disk, avoiding the communication and serialization needed to retrieve and redistribute results from the job's head node.
qsub -l nodes=4:c7 -l walltime=7200
The MPICH installation includes a companion suite of extensions and utilities called MPE, which may be useful in debugging, visualizing, or analyzing the performance of MPI programs. Although the MPE libraries and utilities are installed within the MPICH directory hierarchy (/usr/local/v8plusa/generic/mpich), they can also be used with other MPI packages such as LAM. MPICH and MPE libraries are currently available only for 32-bit (v8plusa) applications due to limitations in the build procedure.
Although MPICH is in some respects easier to use and more flexible than LAM, LAM has historically been more robust and usually outperforms MPICH, sometimes by a wide margin. A few users have reported that various MPI constructs are faster or more reliable in one package or the other, so it's probably worth trying both LAM and MPICH to see which one works best for a given application.
MPICH-GM is a version of MPICH which has been implemented on top of Myrinet's low-level GM communication library. MPICH-GM programs can be run only on nodes which have Myrinet interfaces (i.e., those with a "myri" or "myri2" node property). SciClone includes two distinct Myrinet networks based on different generations of the interconnection technology, but they are software compatible, meaning that the same MPICH-GM executable can be run on either network. See the section on Networking for details on which nodes and subclusters are connected to each Myrinet. MPICH-GM over Myrinet provides high bandwidth and low latency and is a good choice for communication-intensive applications, offering performance that can be dramatically better than the alternative TCP-based MPI packages.
The procedure for running MPICH-GM programs under PBS is similar to that for regular MPICH, but uses a different interface script (pbsmpichgm) with somewhat different options. The most important difference is the way in which MPICH-GM terminates programs. The first process to exit (either normally or abnormally) triggers a timer in the MPICH-GM runtime system which kills all of the remaining processes when it expires. To ensure that stray processes get cleaned up before PBS exits, this interval should be fairly short, no more than about 45 seconds. The default delay is 15 seconds, but it can be modified with the -k option of pbsmpichgm. Because of this brute-force hack for terminating programs, it may be advisable to insert an extra barrier synchronization in the code (MPI_Barrier) just before the call to MPI_Finalize to ensure that all processes exit together. Otherwise, slower processes could be killed prematurely. In a C program, this might look like:
The following example shows a simple PBS job which uses pbsmpichgm to launch an MPICH-GM program on the tornado subcluster with round-robin (node-order) mapping of processes onto processors. (The "myri" node property is redundant since all c2 nodes are connected to Myrinet, but it's a good idea to make this requirement explicit anyway.)
MPICH-GM vs. LAM
In the past, MPICH-GM was the only option for MPI programs which wanted to take advantage of SciClone's Myrinet networks, but LAM 7 added native GM support for Myrinet. MPICH-GM offers slightly better Myrinet performance than LAM, but applications which are built with MPICH-GM can run only on Myrinet-enabled nodes. In contrast, applications built with LAM can run on any node in SciClone and the pbslam script will automatically choose the best available network based on the set of nodes allocated to the job.
If the "-I" option of qsub is used, the job will be interactive, meaning that the shell process will be connected to the terminal session from which qsub was invoked. Users can then run anything they want interactively, including MPI programs (via pbslam, pbsmpich, or pbsmpichgm) or parallel shell commands (via pbsdoall). If the job is running on a single node, then the interactive session can be used to run UNIX commands directly, just as in any other shell environment.
As an example, to get a login shell on wh33 for half an hour, use:
qsub -I -l nodes=wh33 -l walltime=30:00
The qlogin command provides a simple alternative to "qsub -I". With qlogin, the previous example becomes:
qlogin -t 30 wh33
In the Unix/Linux world, the X Window System (a.k.a. "X11" or simply "X") provides the underlying mechanism for constructing graphical user interfaces (GUIs) and displaying visual information on screen. One of the most powerful features of X11 is the ability to run application programs (X clients) on one system and have them display on another system (X server) by routing the graphical protocol stream across the network. On SciClone, this capability allows us to submit interactive X11-based jobs (such as MATLAB and OpenDX) to the PBS job scheduler and route the display back to the user's desktop workstation. This can be accomplished either with X11 Forwarding through Secure Shell (SSH), or via the Xauthority mechanism which is built into X11.
X11 Forwarding is the simplest and most secure mechanism. With this technique, the user simply logs into the front end (sciclone.wm.edu) via SSH, and then uses the qlogin command to initiate a job. qlogin automatically configures the DISPLAY environment variable so that X11 sessions will be forwarded through the front end back to the user's desktop (assuming that the user is running an X server on an authorized host and that his/her local SSH configuration allows X11 Forwarding). This makes qlogin the preferred way to launch simple GUI-based applications on compute nodes.
For applications which have complex GUIs, require frequent screen updates, and/or produce a high volume of image data, the SSH forwarding scheme is inefficient and may perform poorly. Overheads are incurred in encrypting the graphical data stream and routing it back through the front end node, where it may compete with several other users for a slice of a Fast Ethernet link to the outside world. A better approach for these types of applications is to establish a direct connection between a compute node and the user's workstation. To allow this, the user must first establish an X11 authorization key on his/her workstation and then propagate that to SciClone. The key is essentially a "permission slip" which gives X clients on SciClone the ability to connect to the user's local display. Some X11 installations create the appropriate keys automatically when a user logs in on his/her workstation, while others do not. In either case, the key must be copied over to SciClone.
To facilitate this process, we provide a script which can be downloaded and run on the user's workstation to automatically generate a key (if necessary) and merge it into the user's ~/.Xauthority file on SciClone. Once the key is in place, X11-based PBS jobs can be submitted to SciClone in the usual way (either interactive or batch mode), setting the DISPLAY environment variable to point back to the local workstation as shown in the following example:
setenv DISPLAY myworkstation.domain:0
/usr/local/bin/dx -processors 4 ...
X sessions established with this mechanism will bypass the front end and are normally routed directly to the campus backbone via SciClone's "back end" Gigabit Ethernet link. The primary disadvantage of this approach is that the connection is unencrypted and therefore inherently insecure: X11 authorization keys are transmitted "in the clear" across the network, as is the data stream between the X client and the X server. This means that anyone with the appropriate software tools and access to a host on any of the network segments along the route could potentially steal authorization keys and/or intercept the contents of X sessions. For this reason, we recommend using X authorization keys only for applications which need to maximize graphical performance. It is imperative that passwords, passphrases, or other sensitive information never be entered into X applications which are initiated using this technique.
PBS provides several commands for monitoring the status of jobs and queues. One of these, a graphical job monitor called xpbsmon, is no longer recommended for use on SciClone. xpbsmon suffers from a number of problems, including sluggish performance, excessive memory and CPU requirements, heavy network traffic, poor scalability, a limited color palette, and occasional crashes. As an alternative, we have developed a Java-based tool called Jobprobe which rectifies most, if not all, of these deficiencies. We recommend the routine use of Jobprobe by anyone who is actively running PBS jobs on SciClone. For more information about downloading and running Jobprobe, see the subsequent section on Performance Monitoring Tools.
The pbsnodes command can be used to check node status, as follows:
To list the status and node properties of all nodes:
Information about jobs and queues is provided by the qstat command. The following list includes several useful examples. Consult the qstat man page for more details.
Important note 2: When a job exceeds its walltime limit, it may take a few minutes for PBS to notice, and a few more minutes for the termination and cleanup process to complete. During this period, do not attempt to kill the job with a qdel command. Wait until a job is at least ten minutes over its walltime limit before concluding that something is amiss.
These same cautions apply to the qsig command, which should be used only in extraordinary circumstances.
For interactive jobs (qsub -I), PBS connects stdin, stdout, and stderr to the terminal session which issued the qsub command.
For batch jobs, PBS collects the contents of stdout and stderr in a spool directory which resides on the first node assigned to the job. When the job terminates, PBS copies these spool files back to the directory from which the original qsub command was issued, using a filename of the form "jobname.[eo]NNNN", where jobname is the name assigned by the -N option of qsub, or, if -N was not specified, either the name of the job script or "STDIN" if the script is read from qsub's standard input). "e" or "o" indicates stderr or stdout, and NNNN is the job number assigned by PBS. The names and locations of these output files can be modified using the -e, -o, and -j options of qsub.
stdin for a batch job is attached to the PBS script file itself. Any command within the script which reads from stdin should redirect its input to avoid buffering conflicts with the shell which is running the script.
If you want to monitor stdout or stderr of your program while it is running (rather than waiting for PBS to deliver the results), you should redirect the output to a globally accessible filesystem (e.g., home00, home10, scr00, scr01, scr02, or scr10). Once the job begins execution, you can monitor its progress using the tail command or something similar. For example,
Wait for the job to begin, then
Note that the output may not appear immediately, but rather in spurts as buffers fill up and get flushed. Any other output generated by PBS or by the script itself will be directed to stdout and delivered to ~/mydir/myjob.oNNNNN, where NNNNN is the job number.
Also note that this strategy does not work well with QFS filesystems, apparently due to the way that QFS manages read and write access to shared files. Continuous monitoring of the contents of a QFS file with "tail -f" or some similar utility can interfere with write access by an active job, resulting in very poor I/O performance.
Important note: The capacity of the PBS spool directory on each node is limited. If your PBS job will generate more than a few hundred megabytes of output on stdout or stderr, you should explicitly redirect it to a filesystem which has enough free space to hold the data.
PBS has many more features beyond those described here, but unfortunately, the developers have never written a user-oriented manual. The best source of information is the PBS man pages, available in /usr/local/pbs/man on sciclone.wm.edu. For a link to the Administrator's Guide and other information, see the OpenPBS page on the SciClone web site.
Performance Monitoring Tools
To obtain a better understanding of the dynamic behavior of the entire SciClone cluster, we have developed a suite of Java-based graphical monitoring tools. Employed together, these tools can provide considerable insight into the behavior of applications running on SciClone, as well as the overall performance of the system. The suite of tools includes:
Each of these tools is implemented as a rather sophisticated distributed system comprised of:
The displays are updated in near-real-time (approximately every 15 seconds), which provides a good balance between low intrusion in the system and sufficient temporal resolution to see how applications are behaving.
To use these tools, you first need to have a recent version of the Java Runtime Environment (JRE) installed on your desktop PC or workstation (minimum acceptable version is 1.4.2_02, but 1.5.0_02 or later is recommended). Note that JRE is preinstalled on many systems so you may not need to download and install it yourself. Once JRE is available on your desktop, you then need to login to monsoon.sciclone.wm.edu and copy the following files to your workstation:
cd /usr/local/sciclonometer/clientswhere "myhost" is the name of your workstation and "mydir" is whatever local directory you want to store the jar files under.
Once the files are copied over, you can execute them as follows on Linux or Unix (on a PC or Mac, start them as you would any other application):
cd mydirThe display programs have been tested under Mac OS 10.3, RedHat Enterprise Linux 3, and Solaris 9, but theoretically should work on any platform with a current version of Java.
A final caveat: although these tools have been in use for some time by SciClone's staff and user community, we still consider them to be somewhat experimental, so if you run into problems, please send a detailed trouble report to firstname.lastname@example.org.