SciClone Cluster Project Computational Science Cluster
Home
Introduction
Sponsors
Research
Hardware
Software
User Info
Documentation

SciClone User's Guide

Version 2.0

Revised: 6/10/09

This document provides background information and instructions for using the SciClone Cluster at the College of William and Mary. It is considered to be required reading for new users. Refer to other sections of the SciClone web site for more detailed information on specific topics such as software packages or hardware configuration.


Topics


How to Get Help

Questions, problems, or trouble reports: Send email to sciclone@wm.edu.

Urgent problems: Call Tom Crockett (757-221-2762) or Andriy Kot (443-365-9107).

Emergencies: After hours phone numbers are posted on the computer room door in Savage House, or call Campus Police at 757-221-4596.

Application Support: To request assistance in accessing or using the system, send email to sciclone@wm.edu.

Because SciClone is regarded as a research (rather than production) facility, support is available only during normal office hours.

When reporting problems, please provide as much relevant information as possible. This should include the following, as appropriate:

  • date and time when the problem occurred
  • node(s) or server(s) involved
  • text of the command(s) which you issued
  • exact and complete text of any error messages which were generated
  • source code and/or makefiles which demonstrate the problem
  • any other information which may help in identifying or resolving the problem

Email Discussion List: An email list for discussing topics of general interest to the SciClone user community is available on the William and Mary list server. Access is by subscription only. To subscribe, William and Mary users should login to the list server with their campus account; external users must first create an account. Postings should be sent to sciclone-discuss@lists.wm.edu.


Obtaining Accounts

SciClone is operated by the Computational Science Cluster as a College-wide resource. William and Mary faculty, staff, and students with computation- or data-intensive applications from any discipline are welcome to apply for accounts on the system. The system is also available to those interested in developing tools or infrastructure to support more effective use of cluster computing systems. Requests for access from outside the William and Mary community (including W&M collaborators) are evaluated on a case-by-case basis. To apply for an account, follow the instructions in the SciClone Account Request Form.


Account Renewal, Expiration, and Deletion

Most accounts on SciClone have an associated expiration date which is specified on the user's Account Request Form. An email notice will be sent to the user approximately two weeks before his/her account is set to expire. Accounts can be renewed if need be by submitting a new Account Request Form. If an account is not renewed before the expiration date, it will be disabled immediately following the expiration date. All files belonging to expired accounts are subject to deletion after a short grace period (30-day minimum). It is the user's responsibility to preserve any necessary files by moving or copying them elsewhere before the account expires. An account which has expired may be reactivated by submitting a new Account Request Form, but files previously associated with the account may not be available after the grace period.


Accessing the System

All access to SciClone from external systems is via Secure Shell using the SSH 2 protocol. SciClone includes subclusters based on two different and incompatible processor architectures, Sun UltraSPARC and AMD Opteron. Each of these architectures has a different login server which should be used to edit and compile programs, manipulate files, or submit batch jobs. The primary login server for the UltraSPARC nodes is monsoon.sciclone.wm.edu, a.k.a. sciclone.wm.edu. The primary login server for the Opteron nodes is squall.sciclone.wm.edu. These nodes are also referred to as "compile servers", "file servers", "front ends", "host nodes", or simply monsoon and squall. All of these terms are used interchangeably.

Two additional login servers are available for users involved in specific projects requiring access to certain node-locked software packages or the Oracle database server. These are maelstrom.sciclone.wm.edu (UltraSPARC) and mistral.sciclone.wm.edu (Opteron).

Unless you have made special arrangements for dedicated time, all access to nodes other than monsoon, maelstrom, squall, and mistral should be through the PBS job scheduler. This topic is discussed in more detail in the section on Running Programs.


Changing Your Password

You will be issued a default password by telephone when your SciClone account is created. To change it, you must login to monsoon.sciclone.wm.edu and run the passwd command. Changes will automatically propagate to all of the login servers within a minute or so. If you change your password on any node other than monsoon, the changes will not be permanent, and your password will eventually revert to its previous setting.


Architecture Overview

A key feature of SciClone is its heterogeneous architecture, which provides both flexibility for applications as well as a controlled environment for studying the complex issues which arise in larger distributed systems. Specifically, SciClone's heterogeneity arises from its use of multiple processor configurations (single-, dual-, and quad-cpu nodes, each at different clock rates), multiple networking technologies (Fast Ethernet, Gigabit Ethernet, and Myrinet), and its organization as a "cluster of clusters".

Node Types and Subclusters

SciClone features sixteen different node configurations, organized into eight distinct subclusters which can be used individually or in combination. The node types and subclusters are summarized in the tables below. See the Hardware Component List for detailed specifications. Nodes can be further classified as either "server" nodes or "compute" nodes, depending on their intended uses. As the name implies, server nodes provide specialized functions for the cluster as a whole, while compute nodes are dedicated to running users' jobs. Application programs should not run on server nodes unless they have a specific need to do so.

SciClone Node Types
Node
Type
Qty.
# of
CPUs
(x cores)
Clock
Speed
Memory /local/scr Comm.
Subcluster
Suggested Uses
S3
1
2
900 MHz 6 GB 33 GB Gigabit Ethernet
Primary server. Provides login, compilation, and job scheduling services for all of the UltraSPARC nodes in the cluster. Provides system-wide accounting, DNS, and global filesystem services. Should not be used for computation except in specialized circumstances.
S5
1
2x2
2.2 GHz 4 GB 6 GB 10 Gb Ethernet
Primary server. Provides login, compilation, and job scheduling services for all of the Opteron nodes in the cluster. Provides global filesystem services. Should not be used for computation except in specialized circumstances.
S4
1
2
2.4 GHz 4 GB 6 GB 10 Gb Ethernet
Secondary server. Provides auxilliary services for the tempest subcluster. Should not be used for logins or computation.
S4A
1
2
2.4 GHz 4 GB 6 GB 10 Gb Ethernet
Secondary server. Similar to S4 node, but tailored to specialized services and applications. Should not be used for computation except in specialized circumstances.
S2
1
2
450 MHz 1 GB 35 GB Gigabit Ethernet
Secondary server. Provides mail, printing, and global filesystem services for the entire cluster. Should not be used for logins or computations except in specialized circumstances.
DB1
1
2
450 MHz 4 GB 16 GB Gigabit Ethernet
Myrinet 2000
Database and bioinformatics server. Similar to S3 node, but with extra memory and larger disks to support database and bioinformatics applications. Also provides global filesystem services for the entire cluster. Direct logins allowed for access to specialized software packages. May be used by PBS jobs which need access to node-locked bioinformatics software.
M1 1 1 650 MHz 512 MB 10 GB Fast Ethernet System management node. Provides performance monitoring and control of computers, storage, and networks, along with centralized logging services. Not intended for use by applications.
N1
1
2
360 MHz 512 MB 12 GB Gigabit Ethernet
Myrinet 1280
Network compute node. Similar to a C2 compute node, but with a gigabit connection to the internal network and a direct connection to the building network. Useful for host/node, client/server, master/slave, or n+1 programming models, as well as for proxies or other processes which mediate between internal and external computations. Good choice as an I/O or "head" node for applications running in the tornado or gulfstream subclusters.
N2
1
2
900 MHz 2 GB 31 GB Gigabit Ethernet
Myrinet 2000
Network compute node. Similar to a C5 compute node, but with a gigabit connection to the internal network and a direct connection to the building network. Useful for host/node, client/server, master/slave, or n+1 programming models, as well as for proxies or other processes which mediate between internal and external computations. Good choice as an I/O or "head" node for applications running in the twister subcluster.
C2
32
2
360 MHz 512 MB 12 GB Fast Ethernet
Myrinet 1280
tornado
Compute node. General parallel and serial computation; jobs with intensive communication or local I/O requirements.
C3
64
1
650 MHz 1 GB 26 GB Fast Ethernet
whirlwind
Compute node. General computation. Preferred location for serial (non-parallel) computations with significant memory requirements.
C4
4
4
450 MHz 4 GB 6 GB Gigabit Ethernet
Myrinet 2000
hurricane
High-performance compute node with Gigabit Ethernet and Myrinet 2000. SMP parallel computations (multi-threaded, auto-parallelization, compiler directives, OpenMP), memory- and communication-intensive applications; serious number crunching.
C5
32
2
900 MHz 2 GB 31 GB Fast Ethernet
Myrinet 2000
twister
High-performance compute node. Memory-, CPU-, and I/O-intensive applications with modest communication requirements. Two local scratch partitions (53 GB total).
C6
4
2
360 MHz 512 MB 17 GB Gigabit Ethernet
Myrinet 1280
gulfstream
Compute node, with two local scratch disks (36 GB total), Gigabit Ethernet, and Myrinet 1280. Good choice for I/O-, communication-, and data-intensive applications, including interactive visualization work.
C6A
2
2
360 MHz 512 MB 17 GB Gigabit Ethernet
Myrinet 1280
gulfstream
Same as a C6 node, but with three local scratch disks (54 GB total).
C7 2 4 1.28 GHz 8 GB 202 GB Gigabit Ethernet
Myrinet 2000
vortex Data-intensive compute node with large memory (8 GB), high capacity local scratch disks (202 GB & 44 GB), Gigabit Ethernet, and Myrinet 2000. Well-suited for SMP parallel computations (multi-threaded, auto-parallelization, compiler directives, OpenMP) with large memory and I/O requirements, as well as communication-intensive distributed-memory applications, out-of-core methods, and serious number crunching.
C7A 2 4 1.28 GHz 16 GB 202 GB Gigabit Ethernet
Myrinet 2000
vortex Same as a C7 node, but with twice as much memory (16 GB).
C8 42 2 2.4 GHz 4 GB 56 GB Gigabit Ethernet
InfiniBand 4x
tempest High-performance Opteron-based compute node with Gigabit Ethernet and InfiniBand. Suitable for distributed-memory applications with demanding CPU, memory, I/O, and communication requirements.

 

SciClone Subclusters
Subcluster
# of
Nodes
Node
Type(s)

Node
Names

Suggested Uses
whirlwind
64
C3
wh01-
wh64
General computation; serial and embarassingly parallel applications; distributed-memory parallel computations; algorithm development and scalability studies. Communication is limited to Fast Ethernet (100 Mb/s).
tornado
32
C2
tn01-
tn32
General parallel computation; communication-intensive parallel computations via Myrinet; mixed mode (SMP+distributed) computations; algorithm development and scalabilty studies.
twister
32
C5
tw01-
tw32
General parallel computation; memory- and CPU-intensive applications; communication-intensive parallel applications via Myrinet; large out-of-core problems; mixed mode (SMP+distributed) computations; algorithm development and scalabilty studies.
hurricane
4
C4

hu01-
hu04

Memory-, CPU-, and communication-intensive parallel computations via Gigabit Ethernet or Myrinet; shared-memory applications; mixed mode programs. Lightweight computations should run somewhere else.
gulfstream
6
C6, C6A
gfs01-
gfs06
Data-intensive computing with multiple local scratch disks; communication-intensive computing via Gigabit Ethernet or Myrinet; visualization; large out-of-core problems.
vortex
4
C7, C7A

vx01-
vx04

Memory-, I/O-, and communication-intensive parallel computations via Gigabit Ethernet or Myrinet; shared-memory applications; mixed mode programs; very large out-of-core problems. Lightweight computations should run somewhere else.
tempest
42
C8

tp01-
tp42

CPU-, memory-, and communication-intensive parallel computations via Gigabit Ethernet or InfiniBand; large out-of-core problems; mixed mode (SMP+distributed) computations. Lightweight computations should run somewhere else.

 

Networks

SciClone features a rich but complex networking environment. For a schematic, refer to the SciClone Architecture Diagram. The following table summarizes the networks used in SciClone.

Net
ID
Technology
Network
Number
Hostname
Suffix
Description
1
Fast Ethernet /
Gigabit Ethernet
128.239.40-43
-f
-g
Also known as "jetstream", this is the primary internal network for SciClone. Every node in the cluster has an interface to this network. Uses a combination of Fast Ethernet (100 Mb/s) and Gigabit Ethernet (1000 Mb/s) switches connected by 3- and 4-way Gigabit Ethernet trunks. Also provides a gigabit route to the campus network for bulk data transfers and bandwidth-intensive applications such as visualization.
2
Myrinet-1280
198.168.2
-m2
1.28 Gb/s low-latency switched communication fabric connects the tornado and gulfstream subclusters plus nws01. Provides excellent performance for communication-intensive applications.
3
Gigabit Ethernet
192.168.3
-g3
Dedicated point-to-point connections for specialized applications such as visualization.
4
Fast Ethernet
128.239.33
-f4
Savage House network. Switched network connects SciClone to other systems within the building, and provides the preferred route for external hosts to reach the SciClone front end.
5
Ethernet
192.168.5
-e5
-f5
Device management network. 10/100 Mb/s switched Ethernet network allows monitoring and management of SciClone's switches, mass storage subsystems, and service processors without intruding on application traffic.
6
Fast Ethernet
172.31
-f6
Private class B network used by the Computer Science Department's Network Systems Testbed. Not intended for general use.
7
Myrinet-2000
192.168.7
-y7
2.0 Gb/s low-latency switched communication fabric connects the vortex, hurricane, and twister subclusters plus nws02 and maelstrom. Offers maximum performance for communication-intensive applications.
8
InfiniBand 4x
192.168.8
-i8
10 Gb/s low-latency switched communication fabric connects nodes within the tempest subcluster. Offers maximum performance for communication-intensive applications.

SciClone's front end nodes, monsoon and squall, have interfaces on multiple networks, as do many of the other nodes. From outside the cluster, use the hostname monsoon.sciclone.wm.edu (a.k.a. sciclone.wm.edu) or squall.wm.edu to login via the Savage House Fast Ethernet network. From other nodes within the cluster, use ms00.sciclone.wm.edu or sq00.sciclone.wm.edu (or just ms00 and sq00) to reach the front end nodes via the cluster's internal network. If you use the monsoon or squall addresses from within the cluster, you'll get routed outside to the building network and then back in via the external interface, a longer and slower path.

A complete list of hostnames and their corresponding network addresses is available in /etc/hosts on the front end.

Metaclusters

When communication is routed via Myrinet, some of the distinctions between subclusters become blurred. In particular, the gulfstream and tornado subclusters plus the node nws01 can be thought of as a unified 39-node, 78-processor "metacluster". Although some details differ, these nodes have identical processor and memory configurations and share the same Myrinet-1280 network. However, I/O intensive processes should be placed on nws01 or one of the gulfstream nodes to take advantage of Gigabit Ethernet links to the servers. Nodes could be allocated from this combined pool via a PBS node specification such as "ultra60:myri:ppn=2". Similarly, nws02 can be used to augment the twister subcluster, using the Myrinet-2000 network for communication. In this case a node spec might look like "f280r:myri2:ppn=2" and nws02 would be the best place to locate I/O intensive processes.

Note that applications are free to combine nodes from multiple subclusters in any way they see fit, but in the general case, differences in processor speeds, communication interfaces, memory capacities, and disk performance pose difficult load balancing problems which can lead to very inefficient use of computational resources unless specialized parallel algorithms are employed.


Filesystems

When a user account is installed, subdirectories are created in the following filesystems:

Filesystem
Name
Purpose
Description

One of:
/sciclone/home00
/sciclone/home01
/sciclone/home02
/sciclone/home10

Home directories
Primary location for source code, executables, scripts, and moderate-sized data files. Accessible system-wide via NFS. Files on these partitions are backed up on a regular basis. Files are subject to deletion after the user's account has expired. We do not archive expired accounts.
/sciclone/qfs00 Large file storage High-capacity, high-performance shared filesystem intended for storage of large files (1 MB and above). Mounted directly on monsoon and maelstrom via Fibre Channel SAN; accessible everywhere else via NFS. Files on this partition are backed up on a regular basis. Files are subject to deletion after the user's account has expired. See additional info on QFS below.
/sciclone/scr00
/sciclone/scr01
/sciclone/scr02
/sciclone/scr10
Global scratch space
High capacity storage for large files and short-term working data. Accessible system-wide via NFS. Files on these partitions are automatically deleted after 30 days of inactivity, and are not backed up.
/local/scr
/local/scr2
/local/scr3
Local scratch space
Multi-gigabyte scratch partition(s) physically resident on a node's local disk. Use for temporary storage of local data and intermediate results. Also useful as scratch space for out-of-core methods or as a staging area for input and output files. Provides better performance than NFS-mounted filesystems. Every node has a /local/scr partition. /local/scr2 is available only on C5, C6, C6a, C7, and C7a nodes. /local/scr3 is available only on C6a nodes. Files on these partitions are automatically deleted after 30 days of inactivity, and are not backed up. On hurricane, /local/scr points to /sciclone/scr00; on monsoon, /local/scr points to /sciclone/scr01.

Symlinks in each user's home directory point to the preconfigured global (~/scr*) and local (~/lscr*) scratch directories and the QFS filesystem (~/qfs00). /root, /usr, and /var filesystems are local to each node; /opt, /usr/local, and /import reside on the front end and are exported to each node via NFS.

QFS

Sun's StorEdge QFS is a high-performance, high-capacity SAN-based shared filesystem which is optimized for bulk accesses on large sequential files. The /sciclone/qfs00 filesystem spans 46 disk drives distributed across five separate RAID arrays, with a total formatted capacity of 5.3 terabytes. The disk arrays are connected to each other and to the SciClone servers via a Fibre Channel Storage Area Network, or SAN. Individual files exceeding a terabyte in size can be accommodated, space permitting.

QFS allocates disk space in large blocks, and is therefore rather inefficient for collections of small files, such as source code. As a rule of thumb, files which are stored in SciClone's QFS filesystem should have an average size of 1 MB or larger. Smaller files should be stored in home directories (/sciclone/home*) or scratch filesystems (/sciclone/scr*). Space allocation in /sciclone/qfs00 is monitored and you may be notified if you have too many small files stored there. Note that small files can be aggregated into larger ones with Unix utilities such as ar, tar, cpio, and zip. ar is particularly useful since it maintains an internal index structure which supports the addition, deletion, replacement, and retrieval of individual members of the archive.

I/O performance with QFS can be improved (sometimes dramatically) by reading and writing data in large blocks that match the blocking factor of the filesystem. In our tests, a blocksize of 65536 (64 KB) was near optimal, yielding write speeds in excess of 100 MB/s from monsoon and maelstrom.

QFS is a shared filesystem, meaning that it can be mounted and accessed simultaneously by multiple servers. It allows simultaneous reads from the same or different files and simultaneous writes to different files. QFS can also be configured to allow simultaneous writes to the same file from applications which have been designed to perform page-aligned I/O. This option is not presently enabled on SciClone, but could be if the need arose.

The /sciclone/qfs00 filesystem is mounted directly on monsoon and maelstrom, and is exported from there via NFS to all other nodes in the cluster. Even-numbered nodes mount qfs00 from maelstrom and odd-numbered nodes from monsoon. This spreads the NFS load across both servers, resulting in higher throughput and less resource contention. Note that applications which generate a lot of output will still get better performance by writing files to the local scratch partitions and then copying them back to the servers with rcp, rather than relying on NFS. The even/odd strategy would be helpful in this scenario, too.


Backups

Home directories (/sciclone/home00 and /sciclone/home10) are normally backed up several times per week. QFS directories (/sciclone/qfs00) are backed up approximately every three days. Scratch directories are not backed up at all. Furthermore, files in any of the scratch partitions which have not been used in the past 30 days will be deleted automatically in order to maintain sufficient free space for active projects.


Shell Environment

A default .cshrc file is provided in each user's home directory. If you modify it, be sure you know what you're doing—SciClone's hardware and software environment is considerably more complex than the typical UNIX workstation environment. The default configuration enables a 32-bit environment for compiling and linking, with LAM as the default MPI package. To use alternative communication packages such as MPICH or MPICH-GM, settings of various environment variables will need to be changed as documented within the .cshrc file. Copies of the current recommended configuration files can be found in /usr/local/etc/templates/.

Four UNIX groups are established for each organization or department which has users on SciClone. These four groups correspond to the user's status within the organization, i.e. faculty/staff, graduate student, undergraduate student, and "other". For example, a professor from the William and Mary Computer Science Department would be assigned to the group "csf", while a CS undergraduate would be assigned to the group "csu". Default file access permissions are set so that a user's files and directories are read-write for the user, read-only for the user's group, and unreadable by anyone else. If this is not appropriate for your situation, you should change the default umask setting in your .cshrc file, or set file access modes on a case-by-case basis. Note that SciClone's filesystems are exported to other computers within the Computational Science Cluster, and may therefore be visible beyond the SciClone user community.


Compilers and Libraries

Both Sun and GNU compilers are installed on the system. See the Software section of the SciClone web site for more information about what's available and how to use it. Unless otherwise noted, all of the third-party software which is installed on SciClone has been built using Sun's compilers. For maximum performance and to ensure compatibility with system libraries, we strongly recommend the use of Sun's compilers whenever possible.


Compiler Options and Code Optimization

When compiling and optimizing code for use on SciClone, care must be taken to ensure that the resulting executables will perform properly on the type of nodes on which they will be executed. The situation is further complicated by the fact that SciClone includes two distinct and incompatible families of processor architecture, Sun UltraSPARC and Intel/AMD x86/amd64. Even within a particular architecture family, there are variations in processor capabilities among the different types of nodes. By choosing appropriate compiler options, users can compile their applications for portability across an entire architecture family, or optimize performance for a particular type of node. With the current set of hardware, six different instruction set architectures (ISAs) are of interest:

v8plusa  - 32-bit addressing, compatible with all UltraSPARC node types
v9a  - 64-bit addressing, compatible with all UltraSPARC node types
v8plusb  - 32-bit addressing, runs only on C5, C7, C7A, N2, DB1, and S3 node types
v9b  - 64-bit addressing, runs only on C5, C7, C7A, N2, DB1, and S3 node types
sse2a  - 32-bit addressing, runs only on C8, S4 and S4A node types
amd64a  - 64-bit addressing, runs only on C8, S4 and S4A node types

To achieve portability along with performance on UltraSPARC nodes, the following set of options is suggested:

-fast -xarch=v8plusa -xchip=generic -xcache=generic

For best performance on AMD Opteron nodes, the following options provide a good starting point for further experimentation:

-fast -xarch=sse2a -xchip=opteron -xcache=64/64/2:1024/64/16

By default, -fast generates code for the type of processor on which the compiler is running, which may or may not be compatible with the target execution processor. This is why it is important to explicitly specify the desired ISA with the -xarch option. Note that the order is important here: the -xarch, -xchip, and -xcache options must come after -fast.

Opteron processors support several different "memory models", depending on the intended use of the resulting code. For 64-bit addressing on SciClone's C8, S4, and S4A nodes, specify the "medium" memory model, for example:

-fast -xarch=amd64a -xchip=opteron -xcache=64/64/2:1024/64/16 -xmodel=medium

To optimize for a specific type of node, you can specify the characteristics of the CPU (-xchip) and cache (-xcache) in addition to the ISA. The following table summarizes the options by node type.

Node Type -xarch -xchip -xcache
Any UltraSPARC v8plusa
v9a
generic generic
C2, C4, C6, C6A, N1, S2 v8plusa
v9a
ultra2 16/32/1:4096/64/1
C3, M1 v8plusa
v9a
ultra2e 16/32/1:512/64/4
C5, N2, S3, DB1 v8plusb
v9b
ultra3cu 64/32/4:8192/512/2
C7, C7A v8plusb
v9b
ultra3i 64/32/4:1024/64/4
Any Opteron sse2a
amd64a
opteron 64/64/2:1024/64/16
C8, S4, S4A sse2a
amd64a
opteron 64/64/2:1024/64/16

So, for example, to optimize code for an Ultra 5 node, the following compiler options could be used:

-fast -xarch=v8plusa -xchip=ultra2i -xcache=16/32/1:2048/64/1

To target a Sun Fire 280R, use:

-fast -xarch=v8plusb -xchip=ultra3cu -xcache=64/32/4:8192/512/2

For an Opteron-based node, use:

-fast -xarch=sse2a -xchip=opteron -xcache=64/64/2:1024/64/16

To address more than 2 GB of memory in a single process, use v9a, v9b, or amd64a instead of v8plusa, v8plusb, or sse2a, repsectively:

-fast -xarch=v9a -xchip=generic -xcache=generic
-fast -xarch=v9a -xchip=ultra2 -xcache=16/32/1:4096/64/1
-fast -xarch=v9b -xchip=ultra3cu -xcache=64/32/4:8192/512/2
-fast -xarch=v9b -xchip=ultra3i -xcache=64/32/4:1024/64/4
-fast -xarch=amd64a -xchip=opteron -xcache=64/64/2:1024/64/16 -xmodel=medium

Note that 64-bit addressing is generally of interest only for the C4, C7, C7A, C8, DB1, S3, S4, and S4A nodes, which each have 4 GB or more of physical memory installed on them; all other node types have 2 GB or less.

Programs compiled with v9a, v9b, or amd64a will not work with libraries which have been compiled for 32-bit addressing. Many of the performance-critical software packages installed on SciClone are compiled for all six ISAs, and the desired version can automatically be selected via the XARCHULTRA and XARCHX86 environment variables in the user's ~/.cshrc file. Consult the documentation for individual software packages to see which versions are available.

Note that the optimizations invoked by -fast may be too aggressive for some codes, with the potential for unintended or incorrect results. If you suspect this is a problem, you could try a lower optimization level, e.g.:

-fast -xO3 -xarch=... -xchip=... -xcache=...

or you could leave off -fast entirely. If -fast is not used, it may be necessary to use "-xmemalign=8s" (C and C++) or "-dalign -xmemalign=8s" (Fortran) to assure that the application's data alignment matches that expected by the system libraries.

Code optimization is a complex topic, and the use of any given option may help one routine but hinder another, so some experimentation is in order. Consult Sun's compiler documentation for full details, including information about many options not mentioned here.


Parallel Programming Tools

Although SciClone's ability to run many serial jobs concurrently is useful for some applications, bringing the full power of the system to bear on a single computation requires the use of parallel programming techniques. A variety of tools are available to assist in the development of parallel programs.

Shared Memory Programming

Many of the nodes in SciClone include more than one CPU, so effective use implies that all of the CPUs should be kept busy. If an application will fit on a single node, then the use of shared memory programming techniques may be the simplest way to boost performance on multiprocessor (SMP) nodes. There are several approaches for exploiting parallelism in shared memory environments, including automatic parallelization, compiler directives, thread libraries, system-level interprocess communication (IPC) services, and message passing.

Sun's compilers support automatic parallelization of well-behaved loop constructs via the -xautopar (C) and -autopar (Fortran) options. These are described in detail in the compiler manuals. Sometimes the compiler's ability to detect and exploit parallelism can be enhanced with straightforward changes to the code which eliminate dependencies or simplify the control flow.

In other cases, the programmer needs to convey additional information to the compiler in the form of directives (Fortran) or pragmas (C/C++). Among other things, directives and pragmas can be used to give the compiler hints about loops that can or cannot be safely parallelized. Sun's C, C++, and Fortran compilers support the OpenMP 2.5 API, a mix of directives and library calls which support a fork-join model of parallel execution. For more information on using OpenMP, refer to Chapter 3 in the C User's Guide, Chapter 10 of the Fortran Programming Guide, Appendix D of the Fortran User's Guide, and the OpenMP API User's Guide.

At a coarser level of granularity, entire programs, or major sections of them, can be structured as independent threads of control, each of which can potentially run on a separate CPU. Solaris supports two different thread packages, known as Solaris threads and Posix threads (or pthreads). Sun's Multithreaded Programming Guide covers both of these packages in detail. Additional information on multithreading for C++ applications can be found in Chapter 11 of the C++ User's Guide.

Java includes threads as a fundamental part of the language. Although not well suited for high performance computing due to its resource requirements and the interpretive nature of the language, Java may nonetheless be of interest for certain applications. Java programmers should consult the Java 2 SDK documentation.

Threads provide a lightweight mechanism for exploiting parallelism within the context of a single UNIX process. Parallelism can also be obtained by running several distinct processes at the same time. Like all UNIX variants, Solaris provides a number of system services which facilitate communication between processes. These include pipes, message queues, semaphores, shared memory segments, signals, sockets, and memory-mapped files. These facilities may be used directly or as the basis for process-to-process communication in higher level libraries such as MPI. Interprocess communication (IPC) facilities may be the mechanism of choice for applications which bring together several programs with different functionality. An overview of IPC services in Solaris can be found in the Programming Interfaces Guide.

In most cases, the message-passing communication libraries described in the next section can also be used in shared memory environments, sometimes quite efficiently. The LAM/MPI library, in particular, exhibits very low overheads on shared memory nodes. While the message-passing paradigm often requires more programming effort than some of the simpler shared memory schemes, it is more portable, allowing a single application to run in shared memory, distributed memory, or mixed environments.

Distributed Memory Programming

Even the most powerful multiprocessor nodes on SciClone provide only a fraction of the aggregate system resources (about 1% of the total CPU power, 2% of the memory, and 1% of the disk capacity). To truly take advantage of the system, it is necessary to build distributed memory applications that can bring many nodes to bear on a single computation. Although lower-level system services such as sockets or remote procedure calls are sometimes used to build distributed applications, most scientific programmers working on SciClone will want to use MPI, the de facto standard for message passing on distributed-memory parallel architectures. SciClone currently supports three different MPI implementations, LAM, MPICH, and MPICH-GM. Additional communication packages are expected to be available in the future.

Mixed-Mode Programming

The shared memory and distributed memory approaches can be combined in applications which run on multiple SMP nodes. This can be useful if the computation exhibits parallelism at several different levels, for example loop-level parallelism within coarse-grained tasks. Shared memory constructs may offer performance advantages over message passing for local communication within SMP nodes, although in some cases the reverse may also be true. Whether the performance benefits of mixed-mode programming are worth the extra complexity seems to depend heavily on the characteristics of the application.


Running Programs

To provide conflict-free access to SciClone's computational resources, node allocation and job scheduling services are provided by OpenPBS, the freeware version of the Portable Batch System (PBS). To avoid interfering with PBS jobs, all access to both compute and server nodes, including interactive shell sessions, must be initiated through PBS. The only exceptions are:

  1. logins and associated interactive processes (compilations, file manipulation, job submittal, etc.) on the front-end servers ([monsoon.]sciclone.wm.edu and squall.sciclone.wm.edu),
  2. interactive access to maelstrom and mistral in order to run certain node-locked commercial software packages (although these can also be launched via PBS jobs), and
  3. cases in which you have made prior arrangements for dedicated access to a set of nodes for the purpose of running experiments which cannot be accommodated within the PBS framework (rare).

Direct rlogin/slogin/telnet access to individual compute nodes is disabled, and all rsh and rcp commands which reference the nodes should be submitted via PBS jobs. Stray processes on the nodes (i.e., those not belonging to an active PBS job) are subject to termination without warning.

In some cases interactive access to compute nodes is required. Examples include software packages with graphical user interfaces (e.g., MATLAB or various visualization systems), debugging, etc. PBS has a special interactive mode (described in more detail below) which provides this capability, including forwarding of X11 sessions to the user's workstation via SSH.

Server Nodes vs. Compute Nodes

As discussed in the Architecture Overview, nodes in the SciClone cluster fall into one of two categories, server nodes or compute nodes. While compute nodes are intended to provide dedicated computational resources for one or more jobs, server nodes provide services for the system as a whole. Thus PBS jobs which run on server nodes can adversely impact the performance of the whole cluster, and will themselves be impacted by other activities on the system. Thus most jobs should specifically request to run on compute nodes, as explained in the following sections.

Nevertheless, there may be circumstances in which jobs with special requirements will need to create processes on server nodes. Our PBS configuration currently allows this, and, in fact, will allocate server nodes to jobs if (1) server nodes are specifically requested by the job, or (2) the job does not specifically request compute nodes and no other resources are currently available to satisfy the request. This latter case is designed to improve turnaround for small jobs when the system is otherwise saturated. Users should feel free to take advantage of this when it is really needed (for example, deadlines for class projects or conference and journal submissions), but should not use it routinely.

Because server nodes also host SciClone's global filesystems (/sciclone/home*, /sciclone/scr*, /sciclone/qfs00), they are also the most efficient place to locate processes that perform large amounts of I/O against these files. In this case an appropriate PBS node specification can be used to place an I/O process on the server which physically hosts the filesystem of interest.

Although SciClone's server nodes (monsoon, maelstrom, squall, mistral, tempest, hurricane, zephyr) all have dual processors, PBS is allowed to use only one processor per server node. This leaves the other processor free to provide system-wide services such as compilation, I/O, NFS, DNS, job scheduling, etc. To avoid overloading the front end nodes, all application programs with non-trivial resource requirements (> 30 secs. CPU time or > 128 MB memory) must be submitted as PBS jobs. Processes which violate this rule may be killed without warning. (Typical code development and job preparation activities, including editing, compilation, make, file manipulation, etc., are specifically allowed to run on monsoon and squall as part of their normal interactive workloads.) PBS references server nodes via aliases (ms00, ml00, sq00, mt00, tp00, hu00) which map to either Gigabit or 10-Gigabit Ethernet interfaces on SciClone's internal jetstream network.

PBS Environment Variables

To use PBS, your search path, man path, library path, and default PBS server must be set correctly in your ~/.cshrc file:

    set path=($path /usr/local/pbs/bin)
    setenv MANPATH "${MANPATH}:/usr/local/pbs/man"
    setenv PBS_DEFAULT `/usr/local/bin/pbs_default`
    setenv LD_LIBRARY_PATH "/usr/local/lib"

If you are using the recommended environment configuration (available in /usr/local/etc/templates/cshrc on sciclone.wm.edu), all of these environment settings will be configured for you automatically.

Node Properties

To accommodate heterogeneous environments (such as SciClone), PBS allows an arbitrary set of node properties to be assigned to each node. These properties may be appended to node allocation requests to constrain the set of processors which may be used to run the job. Node properties for SciClone are listed in the following table:

 

Node
Name
Node
Type
Subcluster CPU
Configuration
Memory
Limit (MB)
Local Scratch
Disk (GB)
Networks Network
Interfaces
Switches Operating
System
Special
Software
hu00 s2
ultra60
server
ultra2
mhz450
dual
cache4
msb
ppn=1*
m512
scr35
net1
net4
net5
geth jsg02
sms02
els100
sol7
ms00 s3
f280r
server
ultra3cu
mhz900
dual
cache8
msb
ppn=1*
m4096
m4gb
scr33 net1
net4
geth jsc01
els100
sol9
ml00 db1
f280r
server
ultra3cu
mhz900
dual
cache8
msb
ppn=1*
m2048
m2gb
scr16 net1
net4
net7
geth
myri2
jsc01
els100
myr02
sol9 oracle
gcg
sq00 s5
x4200
server
opteron
mhz2200
dual
core2
cache2
lsb
ppn=1*
m4096
m4gb
scr6 net1
net4
net5
feth
geth10
jsg05
sms02
els100
sol10
tp00 s4
v20z
server
opteron
mhz2400
dual
cache1
lsb
ppn=1*
m4096
m4gb
scr6 net1
net4
net5
feth
geth10
jsg05
sms02
els100
sol10
mt00 s4a
v20z
server
opteron
mhz2400
dual
cache1
lsb
ppn=1*
m4096
m4gb
scr6 net1
net4
feth
geth10
jsg05
els100
sol10
zp00 m1
v120
server
ultra2e
mhz650
single
cache512k
msb
ppn=1
m512 scr10 net1
net5
feth jsc01
sms01
sol9
nws01
n1
ultra60
compute
ultra2
mhz360
dual

cache4
msb

ppn=2
m512
scr12

net1
net2
net3
net4

geth
myri
myri1
jsg02
myr01
els100
sol9
nws02
n2
f280r
compute
ultra3cu
mhz900
dual

cache8
msb

ppn=2
m2048
m2gb
scr31

net1
net4
net7

geth
myri2
jsc01
myr02
els100
sol9
hu01-hu04
c4
e420r
compute
hurricane
ultra2
mhz450
quad

cache4
msb

ppn=4
m4096
m4gb
scr6

net1
net7

geth
myri2
jsg04
myr02
sol9
vx01-vx02
c7
c7a
v440
compute
vortex
ultra3i
mhz1280
quad

cache1
msb

ppn=4
m16384
m16gb
scr202

net1
net7

geth
myri2
jsg04
myr02
sol9
vx03-vx04
c7
v440
compute
vortex
ultra3i
mhz1280
quad

cache1
msb

ppn=4
m8192
m8gb
scr202

net1
net7

geth
myri2
jsg04
myr02
sol9
gfs01-gfs02
c6
c6a
ultra60
compute
gulfstream ultra2
mhz360
dual
cache4
msb
ppn=2
m512 scr17 net1
net2
net4
geth
myri
myri1
jsg03
myr01
els100
sol9
gfs03-gfs06 c6
ultra60
compute
gulfstream ultra2
mhz360
dual
cache4
msb
ppn=2
m512 scr17 net1
net2
net4
geth
myri
myri1
jsg03
myr01
els100
sol9
tn01-tn32 c2
ultra60
compute
tornado ultra2
mhz360
dual
cache4
msb
ppn=2
m512 scr12 net1
net2
feth
myri
myri1
jsf03
myr01
sol9
tw01-tw32 c5
f280r
compute
twister ultra3cu
mhz900
dual
cache8
msb
ppn=2
m2048
m2gb
scr31 net1
net7
feth
myri2
jsc01
myr02
sol9
tp01-tp42 c8
v20z
compute
tempest opteron
mhz2400
dual
cache1
lsb
ppn=2
m4096
m4gb
scr56 net1
net8
geth
ib4x
jsg05
ib01
sol10
wh01-wh32 c3
v120
compute
whirlwind
whlow
ultra2e
mhz650
single
cache512k
msb
ppn=1
m1024
m1gb
scr26 net1 feth jsc01 sol9
wh33-wh64 c3
v120
compute
whirlwind
whhigh
ultra2e
mhz650
single
cache512k
msb
ppn=1
m1024
m1gb
scr26 net1 feth jsc01 sol9
*Although these nodes contains two processors, only one of them is available to PBS.

The next table defines all of the node properties listed in the table above:

Node Property
Description
compute
server
Primary function.
c2, c3, c4, c5, c6, c6a,
c7, c7a, c8, n1, n2, m1, s2, s3, s4, s4a, s5, db1
Node type, as described above and in the Hardware Component list.
e420r
f280r
ultra5
ultra60
v120
v440
v20z
x4200
Manufacturer's model name.
gulfstream
hurricane
tornado
twister
tempest
vortex
whirlwind
Indicates which subcluster the node belongs to.
whlow, whhigh Indicates the lower or upper half of the whirlwind subcluster. Nodes in each half share the same Fast Ethernet switch module, and therefore can communicate with each other slightly more efficiently than with nodes in the opposite half of the subcluster.
ultra2
ultra2e
ultra3cu
ultra3i
opteron
CPU architecture.
mhz360
mhz450
mhz650
mhz900
mhz1280
mhz2400
CPU clock speed in megaHertz.
cache512k
cache1
cache2
cache4
cache8
Size of secondary processor cache, in kilobytes ("k" suffix) or megabytes (no suffix).
lsb
msb
Architecture byte order, either least-significant-byte-first (lsb) or most-significant-byte-first (msb).
single
dual
quad
Number of processors per node.
core2 Number of cores per processor.
ppn=1
ppn=2
ppn=4
Number of PBS virtual processors assigned to the node.

m512
m1024, m1gb
m2048, m2gb
m4096, m4gb
m8192, m8gb
m16384, m16gb

Approximate per-process memory limit, in megabytes (no suffix) or gigabytes ("gb" suffix), rounded up to the nearest power of 2. The amount of memory actually available to an application is a bit less than this number.
scr3
scr6
scr10
scr12
scr16
scr17
scr26
scr31
scr33
scr35
scr56
scr202
Size of the largest local scratch partition, in gigabytes. This is the maximum space available if the partition were completely empty. Some nodes have multiple local scratch partitions of varying sizes.
net1
net2
net3
net4
net5
net6
net7
net8
Indicates which network(s) the node is attached to, as defined in the table above.
feth
geth
geth10
myri, myri1
myri2
ib4x
Type(s) of network interface(s). feth = Fast Ethernet; geth = Gigabit Ethernet; geth10 = 10 Gigabit Ethernet (10 Gb/s); myri and myri1 = Myrinet-1280 (1.28 Gb/s); myri2 = Myrinet-2000 (2.0 Gb/s); ib4x = 4x InfiniBand (10 Gb/s).

jsc01
jsf03
jsg01, jsg02, jsg03, jsg04, jsg05
myr01
myr02
ib01
els100
sms01, sms02

Name(s) of switch(es) that a node is connected to: jsc01 is a multiple-module core switch with both Fast Ethernet and Gigabit Ethernet ports (JetStream Core); jsf03 is an internal Fast Ethernet switch (JetStream Fast); jsg01-jsg05 are internal Gigabit Ethernet switches (JetStream Gigabit); myr01 is a 64-port Myrinet-1280 switch; myr02 is a 48-port Myrinet-2000 switch; ib01 is a 48-port 4x InfiniBand switch; els100 is a Fast Ethernet switch on the Savage House network; sms01-sms02 are 10/100 Ethernet switches (Switch Management Switch).
sol7
sol9
sol10
Operating system, either Solaris 7 (sol7), Solaris 9 (sol9), or Solaris 10 (sol10).
oracle
gcg
Special software available only on designated nodes. On SciClone, most software packages are available for use everywhere on the system. A few packages may be installed for use only on certain nodes, due to resource requirements, compatiblity considerations, or licensing restrictions. Examples include the GCG Wisconsin bioinformatics package (gcg) and the Oracle database system (oracle).

Examples: Using Node Properties to Request Specific Resources

In PBS, a job is simply a shell script which is submitted to the scheduler via the qsub command (run "man qsub" for full details.) As the following examples illustrate, qsub's "-l nodes=" option is used to specify how many and what type of nodes are required.

To request a single compute node (of any type):

qsub -l nodes=1:compute

This is also the default resource request, so it is equivalent to:

qsub

To request a specific node, use the node name as a node property:

qsub -l nodes=1:wh64

To request several specific nodes:

qsub -l nodes=nws02+wh01+tn01+tw01+gfs01+hu01

To request 32 c3 compute nodes, you could use any of the following:

qsub -l nodes=32:c3
qsub -l nodes=32:whirlwind
qsub -l nodes=32:v120
qsub -l nodes=32:mhz650
qsub -l nodes=32:c3:compute:whirlwind:v120:mhz650:ultra2e:m1024

Because PBS allocates CPUs, rather than hosts, it is possible for more than one job to be assigned to a multiprocessor node.  Unless told otherwise, PBS allocates only a single CPU from a multiprocessor node, leaving the other CPUs available for other jobs. So the following example only assigns one CPU to the job, even though it specifically requests a node with two CPUs:

qsub -l nodes=1:dual:compute

To allocate more than one CPU, append the "ppn=" modifier to the list of node properties:

qsub -l nodes=1:dual:compute:ppn=2
qsub -l nodes=4:single+4:dual:compute:ppn=2+4:quad:ppn=4

The first specification will provide exclusive access to the node; the second will cause PBS to allocate a total of 28 processors on 12 nodes, all with exclusive access. In contrast, the following example would leave two CPUs available for other jobs on a 4-CPU node:

qsub -l nodes=1:quad:ppn=2

It is also possible for jobs to specifically allow their resources to be shared with other jobs by appending a "#shared" modifier to the end of the node properties:

qsub -l 'nodes=1:quad:ppn=4#shared'

This would assign all four processors of a quad-CPU node to the job, but would also allow other jobs with a #shared attribute to use the same node in time-shared fashion. Note the use of quotes to prevent "#shared" from being interpreted as a shell comment.

If you don't explicitly specify "compute" and no other node properties restrict the set of possible nodes, then the job could potentially be scheduled to run anywhere, including server nodes. For example, the following request might map onto any of the nodes in the tornado or gulfstream subclusters, as well as nws01 or the server node hu00.

qsub -l nodes=8:ultra60

When you absolutely positively need turnaround as fast as possible on a busy system, use as few qualifiers as possible (subject to the guidelines stated above for running jobs on server nodes). The following requests maximize the chance of finding an available resource since they can potentially be scheduled anywhere:

qsub -l nodes=1
qsub -l 'nodes=1#shared'

When a PBS job executes, the PBS_NODEFILE environment variable is set to the name of a file which contains the list of nodes which have been allocated to the job. This file can be examined by your PBS job script to determine specifically which nodes a job is using.

If PBS is unable to satisfy a request due to conflicting node properties or insufficient nodes with the specified properties, qsub will return with a message such as "Job rejected by all possible destinations" or "Job exceeds queue resource limits".

Scheduling Strategies

In developing a scheduling policy for SciClone, we are trying to satisfy two conflicting goals: (1) rapid turnaround for code development, testing, and experimentation, and (2) maximum throughput for large, computationally intensive scientific applications which may need several days to complete. While there is fundamental tension between these two requirements, there are several things that can be done to alleviate the conflict. First, jobs are segregated into queues based on the number of nodes they require and the expected duration of the job. Different queues are assigned different priorities, and the scheduler attempts to run higher priority jobs first. Each queue has limits on the total number of jobs that can be run simultaneously, as well as the number of jobs that an individual user may run. This helps to reduce the chance of the system being monopolized by a single user or by a particular type of job to the exclusion of other work.

These strategies are augmented with priority aging, which means that the longer a job sits in the queue, the higher its priority becomes. If a job remains queued for a long period of time (currently 48 hours), it is deemed to be "starving", and the scheduler will purposely defer other jobs in order to free up enough resources to run a starving job. This can lead to lead to the idling of large numbers of nodes for prolonged periods of time, but we mitigate this problem somewhat by "backfilling" shorter and smaller jobs into "holes" in the schedule.

SciClone's heterogeneous architecture, with its multiple subclusters, multiple networks, and differing types of nodes, poses special problems for job schedulers. In particular, PBS' starving jobs feature assumes that all nodes are identical, which means that it sometimes holds idle nodes open even when a starving job needs a different type of node. When this situation arises, the system operators may intervene to improve utilization and turnaround. We are aware of no exsting batch schedulers which incorporate all of the capabilities we need, so development of a customized scheduler for SciClone remains as a long-term goal.

Queues

To implement the above strategy within the confines of PBS's standard scheduler, a fairly elaborate queue structure has been established. Fortunately, users do not need to specify a particular queue when a job is submitted—in fact, PBS is configured to prevent this. Instead, qsub automatically submits jobs to a routing queue which examines their resource requirements (walltime and number of nodes) and places them in the appropriate execution queue. The following table lists the execution queues along with their node and time limits, their relative priorities, and bounds on the number of executing jobs:

Queue Name
No. of Nodes
Walltime
Priority
Queue Max User Max
q1q
1
0-15m
28
291 128
q1s
1
15m-1h
27
256 128
q1m
1
1-4h
26
128 96
q1l
1
4-24h
25
128 64
q1x 1 24-96h 5 64 32
q20q
2-20
0-15m
58
128 64
q20s
2-20
15m-1h
57
128 64
q20m
2-20
1-4h
56
96 48
q20l
2-20
4-24h
35
32 16
q20x 2-20 24-96h 6 10 5
q40q 21-40 0-15m 68 8 4
q40s
21-40
15m-1h
67
8 4
q40m
21-40
1-4h
66
8 4
q40l 21-40 4-24h 45 5 3
q40x 21-40 24-96h 7 4 2
q70q 41-70 0-15m 78 4 2
q70s
41-70
15m-1h
77
4 2
q70m
41-70
1-4h
76
4 2
q70l 41-70 4-24h 55 4 2
q70x 41-70 24-96h 8 2 1
q140q
71-140
0-15m
88
2 2
q140s
71-140
15m-1h
87
2 2
q140m
71-140
1-4h
86
2 2
q140l
71-140
4-24h
65
1 1
q200q
141-200
0-15m
98
1 1
q200s
141-200
15m-1h
97
1 1
q200m 141-200 1-4h 96 1 1
q207q 201-207 0-15m 99 1 1
q207s 201-207 15m-1h 9 1 1

In addition to the queues listed above, there are three others, called sys, special, and defer. These are intended for system maintenance tasks, special jobs that can't be accommodated within the normal queue limits, and very low priority jobs, respectively. Users with requirements that can't be accommodated within the normal queue structure should contact the system manager.

Queue parameters are subject to change as the workload on the system varies. In addition, the long-running "l" and "x" queues may be stopped on occasion to allow a backlog of shorter jobs to complete.

Note: At the present time, the tempest subcluster runs a separate copy of the job scheduler. It is configured similarly, but the number of queues, their names, and their limits differ. The appropriate scheduler is selected based on the type of node (Opteron or UltraSPARC) on which a PBS command is issued. We plan to unite the entire system under a single job scheduler in the near future.

Submitting Jobs

All jobs are submitted to PBS via the qsub command. The basic syntax is:

    qsub [options] [script]

script contains the executable commands which comprise the job. It may also contain PBS directives, which are special shell comments that can be used to set job options (see the qsub man page for details). According to the documentation, PBS is supposed to execute script by invoking the user's login shell, which on SciClone is either csh or tcsh. However, this feature does not appear to work correctly (except for interactive jobs invoked with "qsub -I"). For reliable operation, it is therefore necessary to specify the desired shell by using one of the following as the first line of the script:

#!/bin/sh
#!/bin/csh
#!/bin/tcsh

Some users have reported problems with sh, so csh and tcsh are recommended.

If script is omitted, qsub will read commands and directives from standard input.

qsub provides numerous options. For our purposes the most important ones are:

-l nodes=   Specify number of nodes and their properties.
-l walltime=   Specify a maximum wallclock time limit for the job.
-I   Run a job interactively.

The "-l nodes=" option was described above in the section on Node Properties. If this option is omitted, the default node specification is "1:compute".

The "-l walltime=" option indicates the maximum elapsed time for a job. This value is used along with the "-l nodes=" option to determine which execution queue the job will be placed in. If a walltime limit is not specified, the job will use a default value of 5 minutes. If a job exceeds its walltime limit, PBS will kill it. (See the section on Terminating Jobs for important information about the termination process.)

To facilitate accurate job scheduling, the requested walltime should be a reasonably good estimate of the actual time which will be needed, with a modest padding factor to avoid accidental termination. Time estimates which are larger than necessary can cause jobs to be placed in inappropriate queues, and will allow runaway jobs to consume excessive resources, delaying other jobs. Walltime may be specified in several formats. The following are equivalent:

-l walltime=4800   Seconds
-l walltime=80:00   Minutes and seconds
-l walltime=1:20:00   Hours, minutes, and seconds

The "-I" option is described below under the topic of Interactive Jobs.

PBS provides only resource allocation and scheduling services; it is not aware of system-specific procedures for initiating parallel programs. Instead, it relies on vendor- or site-dependent scripts or programs to interface between PBS and parallel applications. The following sections describe how to initiate various types of programs from within a PBS job.

Serial and Shared Memory Programs

Most programs that run on a single node can be invoked from a PBS script in exactly the same way as in any other shell, by simply executing the command. For example, the following job would run myprog from within the directory mydir on any available compute node:

qsub -l nodes=1:compute
#!/bin/sh
cd mydir
./myprog -arg <infile >outfile
^D

Jobs which may be run this way include conventional serial programs, shell scripts and shell commands, and SMP-style shared memory programs (multi-threaded, auto-parallel, OpenMP, etc.). The exception is single-node MPI programs, which should be started using the appropriate PBS interface script, as described in subsequent sections.

Parallel Shell Commands

PBS provides a command, called pbsdsh, which is supposed to run shell commands on multiple nodes simultaneously. However, its functionality is limited, and it sometimes fails in a very anti-social way on SciClone. It has been disabled in favor of a script called pbsdoall, which can be found in /usr/local/bin.

MPI Programs

Each of the different MPI packages provided on SciClone has a slightly different method for specifying the set of nodes which should be used for a job, and for mapping processes onto those nodes. This process is further complicated by SciClone's heterogeneous architecture, and the ability to mix different types of nodes within a single PBS job. To make matters worse, conflicts can sometimes arise when multiple MPI jobs try to share a node, and some MPI implementations require special care to ensure that leftover processes get cleaned up in the event a job terminates abnormally or exceeds its time limit. For these reasons, relatively sophisticated interface scripts are needed to launch MPI jobs from within a PBS environment.

The procedures for running MPI jobs on SciClone are described in the following sections.

LAM/MPI

In ordinary use, several steps are required to run a LAM job. First the set of nodes which LAM is going to use must be booted (lamboot). Then the parallel application program is executed (mpirun). A final step shuts down LAM and cleans up leftover processes on the nodes (lamhalt).

On SciClone, these steps have been consolidated into a single script called pbslam. pbslam makes most LAM runtime options available to the user, with defaults which should be appropriate for most applications. It also performs an optional check for extraneous processor activity before executing the application and tries very hard to clean up stray processes in the event that a job terminates abnormally. pbslam features two different strategies for mapping processes onto processors, which can be used in conjunction with PBS node properties to afford considerable flexibility.

LAM commands such as lamboot, mpirun, lamclean, or lamhalt should not be used directly from within a PBS script.

To work its magic, pbslam must be exec'ed from the top-level shell within a PBS job, replacing the shell which invokes it (don't ask). This implies that pbslam must be the last command executed from within a PBS script, since it will never return. Furthermore, it cannot be executed at a lower level, for example, from another script which is called by the PBS script. pbslam checks for its proper place within the PBS process hierarchy, and will complain if it is invoked incorrectly. For a full description, see the pbslam man page.

The following example runs an 8-process LAM job on two quad-cpu nodes, aborting the job if extraneous CPU activity is detected on either node:

qsub -l nodes=2:quad:ppn=4 -l walltime=600
#!/bin/csh
cd ~/scr/results
exec pbslam -v -X 0.02 ~/bin/myprog -arg1 -arg2 <infile >outfile
^D

MPICH

Although MPICH's procedure for launching jobs is simpler than LAM's, a PBS interface script (pbsmpich) is still required to map processes onto processors and to perform other checks. Unlike pbslam, however, pbsmpich does not have to be exec'ed, which means that multiple MPICH programs can be run from a single PBS job script. This makes it easier to orchestrate computations which are composed of several different programs, particularly if they need to pass data to each other via intermediate files which are stored on the local scratch partitions of individual nodes. By combining several programs into a single script, you can ensure that successive steps all use the same set of processors, in the same order, without having to resort to elaborate and restrictive specifications of PBS node properties.

The following simple example shows a three-step MPICH job which passes results from one step to the next via scratch files located on logical node 0's local disk. Initial input data and final results are stored on the front end's scratch partition. A more sophisticated application might open files directly on each node's local scratch disk, avoiding the communication and serialization needed to retrieve and redistribute results from the job's head node.

qsub -l nodes=4:c7 -l walltime=7200
#!/bin/tcsh
cd ~/myappdir
pbsmpich ./bin/step1 args... <~/scr/data >~/lscr/tmp1
pbsmpich ./bin/step2 args... <~/lscr/tmp1 >~/lscr/tmp2
pbsmpich ./bin/step3 args... <~/lscr/tmp2 >~/scr/results
rm /lscr/tmp1 /lscr/tmp2
^D

The MPICH installation includes a companion suite of extensions and utilities called MPE, which may be useful in debugging, visualizing, or analyzing the performance of MPI programs. Although the MPE libraries and utilities are installed within the MPICH directory hierarchy (/usr/local/v8plusa/generic/mpich), they can also be used with other MPI packages such as LAM. MPICH and MPE libraries are currently available only for 32-bit (v8plusa) applications due to limitations in the build procedure.

Although MPICH is in some respects easier to use and more flexible than LAM, LAM has historically been more robust and usually outperforms MPICH, sometimes by a wide margin. A few users have reported that various MPI constructs are faster or more reliable in one package or the other, so it's probably worth trying both LAM and MPICH to see which one works best for a given application.

MPICH-GM

MPICH-GM is a version of MPICH which has been implemented on top of Myrinet's low-level GM communication library. MPICH-GM programs can be run only on nodes which have Myrinet interfaces (i.e., those with a "myri" or "myri2" node property). SciClone includes two distinct Myrinet networks based on different generations of the interconnection technology, but they are software compatible, meaning that the same MPICH-GM executable can be run on either network. See the section on Networking for details on which nodes and subclusters are connected to each Myrinet. MPICH-GM over Myrinet provides high bandwidth and low latency and is a good choice for communication-intensive applications, offering performance that can be dramatically better than the alternative TCP-based MPI packages.

The procedure for running MPICH-GM programs under PBS is similar to that for regular MPICH, but uses a different interface script (pbsmpichgm) with somewhat different options. The most important difference is the way in which MPICH-GM terminates programs. The first process to exit (either normally or abnormally) triggers a timer in the MPICH-GM runtime system which kills all of the remaining processes when it expires. To ensure that stray processes get cleaned up before PBS exits, this interval should be fairly short, no more than about 45 seconds. The default delay is 15 seconds, but it can be modified with the -k option of pbsmpichgm. Because of this brute-force hack for terminating programs, it may be advisable to insert an extra barrier synchronization in the code (MPI_Barrier) just before the call to MPI_Finalize to ensure that all processes exit together. Otherwise, slower processes could be killed prematurely. In a C program, this might look like:

MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
exit(0);

The following example shows a simple PBS job which uses pbsmpichgm to launch an MPICH-GM program on the tornado subcluster with round-robin (node-order) mapping of processes onto processors. (The "myri" node property is redundant since all c2 nodes are connected to Myrinet, but it's a good idea to make this requirement explicit anyway.)

qsub -l nodes=16:c2:myri:ppn=2
#!/usr/local/bin/tcsh
cd ~/myappdir
pbsmpichgm -v -n -k 30 ./bin/myprog -arg1 -arg2
^D

MPICH-GM vs. LAM

In the past, MPICH-GM was the only option for MPI programs which wanted to take advantage of SciClone's Myrinet networks, but LAM 7 added native GM support for Myrinet. MPICH-GM offers slightly better Myrinet performance than LAM, but applications which are built with MPICH-GM can run only on Myrinet-enabled nodes. In contrast, applications built with LAM can run on any node in SciClone and the pbslam script will automatically choose the best available network based on the set of nodes allocated to the job.

Interactive Jobs

If the "-I" option of qsub is used, the job will be interactive, meaning that the shell process will be connected to the terminal session from which qsub was invoked. Users can then run anything they want interactively, including MPI programs (via pbslam, pbsmpich, or pbsmpichgm) or parallel shell commands (via pbsdoall). If the job is running on a single node, then the interactive session can be used to run UNIX commands directly, just as in any other shell environment.

As an example, to get a login shell on wh33 for half an hour, use:

    qsub -I -l nodes=wh33 -l walltime=30:00

The qlogin command provides a simple alternative to "qsub -I". With qlogin, the previous example becomes:

    qlogin -t 30 wh33

Warning: "qsub -I" or qlogin should be used instead of rlogin, slogin, telnet, rsh, ssh, rcp, or scp for interactive access to any node other than monsoon and maelstrom. Interactive sessions which violate this rule may be killed without warning by SciClone's automated housekeeping procedures.

Connecting to Remote X Servers

In the Unix/Linux world, the X Window System (a.k.a. "X11" or simply "X") provides the underlying mechanism for constructing graphical user interfaces (GUIs) and displaying visual information on screen. One of the most powerful features of X11 is the ability to run application programs (X clients) on one system and have them display on another system (X server) by routing the graphical protocol stream across the network. On SciClone, this capability allows us to submit interactive X11-based jobs (such as MATLAB and OpenDX) to the PBS job scheduler and route the display back to the user's desktop workstation. This can be accomplished either with X11 Forwarding through Secure Shell (SSH), or via the Xauthority mechanism which is built into X11.

X11 Forwarding is the simplest and most secure mechanism. With this technique, the user simply logs into the front end (sciclone.wm.edu) via SSH, and then uses the qlogin command to initiate a job. qlogin automatically configures the DISPLAY environment variable so that X11 sessions will be forwarded through the front end back to the user's desktop (assuming that the user is running an X server on an authorized host and that his/her local SSH configuration allows X11 Forwarding). This makes qlogin the preferred way to launch simple GUI-based applications on compute nodes.

For applications which have complex GUIs, require frequent screen updates, and/or produce a high volume of image data, the SSH forwarding scheme is inefficient and may perform poorly. Overheads are incurred in encrypting the graphical data stream and routing it back through the front end node, where it may compete with several other users for a slice of a Fast Ethernet link to the outside world. A better approach for these types of applications is to establish a direct connection between a compute node and the user's workstation. To allow this, the user must first establish an X11 authorization key on his/her workstation and then propagate that to SciClone. The key is essentially a "permission slip" which gives X clients on SciClone the ability to connect to the user's local display. Some X11 installations create the appropriate keys automatically when a user logs in on his/her workstation, while others do not. In either case, the key must be copied over to SciClone.

To facilitate this process, we provide a script which can be downloaded and run on the user's workstation to automatically generate a key (if necessary) and merge it into the user's ~/.Xauthority file on SciClone. Once the key is in place, X11-based PBS jobs can be submitted to SciClone in the usual way (either interactive or batch mode), setting the DISPLAY environment variable to point back to the local workstation as shown in the following example:

    qsub -l nodes=1:geth:quad:ppn=4 -l walltime=1200
    #!/bin/csh
    setenv DISPLAY myworkstation.domain:0
    cd mydir
    /usr/local/bin/dx -processors 4 ...
    ^D

X sessions established with this mechanism will bypass the front end and are normally routed directly to the campus backbone via SciClone's "back end" Gigabit Ethernet link. The primary disadvantage of this approach is that the connection is unencrypted and therefore inherently insecure: X11 authorization keys are transmitted "in the clear" across the network, as is the data stream between the X client and the X server. This means that anyone with the appropriate software tools and access to a host on any of the network segments along the route could potentially steal authorization keys and/or intercept the contents of X sessions. For this reason, we recommend using X authorization keys only for applications which need to maximize graphical performance. It is imperative that passwords, passphrases, or other sensitive information never be entered into X applications which are initiated using this technique.

Monitoring Job Status

PBS provides several commands for monitoring the status of jobs and queues. One of these, a graphical job monitor called xpbsmon, is no longer recommended for use on SciClone. xpbsmon suffers from a number of problems, including sluggish performance, excessive memory and CPU requirements, heavy network traffic, poor scalability, a limited color palette, and occasional crashes. As an alternative, we have developed a Java-based tool called Jobprobe which rectifies most, if not all, of these deficiencies. We recommend the routine use of Jobprobe by anyone who is actively running PBS jobs on SciClone. For more information about downloading and running Jobprobe, see the subsequent section on Performance Monitoring Tools.

The pbsnodes command can be used to check node status, as follows:

To list the status and node properties of all nodes:

    pbsnodes -a
To see which nodes are down or offline:
    pbsnodes -l

Information about jobs and queues is provided by the qstat command. The following list includes several useful examples. Consult the qstat man page for more details.

To list all of the queues and their run limits: qstat -q
To check the status of all the queues: qstat -Q
To list all jobs and their status: qstat -a
To list the status for all of your jobs: qstat -a -u $USER
To list all jobs with a brief explanation of their status:  qstat -a -s
To list all jobs that are running: qstat -r
To list all jobs that aren't running:

qstat -i

For detailed information about all jobs: qstat -f
For detailed information about job number 179427: qstat -f 179427
To get a list of nodes allocated to job number 179427:  qstat -n 179427

Terminating Jobs

Jobs may be deleted from the queues (and killed if they are running) by using the qdel command:
    qdel job_id
Important note 1: Job termination is a multi-step process which may take several minutes to complete. If more than one qdel request is issued for a running job, the node cleanup procedure may be aborted, leaving orphaned processes active on the nodes. As a rule of thumb, wait at least five minutes after a qdel before concluding that the termination request has failed.

Important note 2: When a job exceeds its walltime limit, it may take a few minutes for PBS to notice, and a few more minutes for the termination and cleanup process to complete. During this period, do not attempt to kill the job with a qdel command. Wait until a job is at least ten minutes over its walltime limit before concluding that something is amiss.

These same cautions apply to the qsig command, which should be used only in extraordinary circumstances.

Input/Output

For interactive jobs (qsub -I), PBS connects stdin, stdout, and stderr to the terminal session which issued the qsub command.

For batch jobs, PBS collects the contents of stdout and stderr in a spool directory which resides on the first node assigned to the job. When the job terminates, PBS copies these spool files back to the directory from which the original qsub command was issued, using a filename of the form "jobname.[eo]NNNN", where jobname is the name assigned by the -N option of qsub, or, if -N was not specified, either the name of the job script or "STDIN" if the script is read from qsub's standard input). "e" or "o" indicates stderr or stdout, and NNNN is the job number assigned by PBS. The names and locations of these output files can be modified using the -e, -o, and -j options of qsub.

stdin for a batch job is attached to the PBS script file itself. Any command within the script which reads from stdin should redirect its input to avoid buffering conflicts with the shell which is running the script.

If you want to monitor stdout or stderr of your program while it is running (rather than waiting for PBS to deliver the results), you should redirect the output to a globally accessible filesystem (e.g., home00, home10, scr00, scr01, scr02, or scr10). Once the job begins execution, you can monitor its progress using the tail command or something similar. For example,

cd ~/mydir
qsub -N myjob -l nodes=8:whirlwind -l walltime=3600 -j oe
#!/bin/tcsh
exec pbslam -v -W ~/mydir myprog -args < ~/mydir/infile >& ~/scr/outfile
^D

Wait for the job to begin, then

cd ~/scr
tail -f outfile

Note that the output may not appear immediately, but rather in spurts as buffers fill up and get flushed. Any other output generated by PBS or by the script itself will be directed to stdout and delivered to ~/mydir/myjob.oNNNNN, where NNNNN is the job number.

Also note that this strategy does not work well with QFS filesystems, apparently due to the way that QFS manages read and write access to shared files. Continuous monitoring of the contents of a QFS file with "tail -f" or some similar utility can interfere with write access by an active job, resulting in very poor I/O performance.

Important note: The capacity of the PBS spool directory on each node is limited. If your PBS job will generate more than a few hundred megabytes of output on stdout or stderr, you should explicitly redirect it to a filesystem which has enough free space to hold the data.

PBS Documentation

PBS has many more features beyond those described here, but unfortunately, the developers have never written a user-oriented manual. The best source of information is the PBS man pages, available in /usr/local/pbs/man on sciclone.wm.edu. For a link to the Administrator's Guide and other information, see the OpenPBS page on the SciClone web site.


Performance Monitoring Tools

To obtain a better understanding of the dynamic behavior of the entire SciClone cluster, we have developed a suite of Java-based graphical monitoring tools. Employed together, these tools can provide considerable insight into the behavior of applications running on SciClone, as well as the overall performance of the system. The suite of tools includes:

Jobprobe [screen shot]
The Jobprobe is a much-improved replacement for PBS's xpbsmon monitoring tool. Jobprobe provides a dynamic list of all the executing jobs in the system (similar to "qsub -r") along with a graphical view of which nodes they are running on. Jobprobe also shows which nodes are shared, offline, down, or otherwise not available to PBS. Every job is assigned a unique color or glyph which is displayed on the node icon(s) associated with that job. Moving the mouse over a node icon will highlight all of the nodes belonging to the same job(s), highlight the job listing(s) in the text pane, and pop-up a window with status information. Clicking on a job listing in the text pane will highlight the corresponding node icons in the graphical view. This is a very useful tool for determining at a glance which nodes are busy, which nodes are free, and which nodes have been assigned to a particular job. We encourage users to run Jobprobe on a routine basis whenever they are active on the system.
 
Nodeprobe [screen shot]
The Nodeprobe displays a variety of important statistics about each node in the system, including CPU and memory utilization, load average, paging, interrupts, disk I/O, local scratch space, and temperature (on certain types of nodes). Several of these statistics can be quite helpful in diagnosing performance problems with applications. The type of information to be displayed is selected by clicking on the desired tab; display options are selected from menus which are specific to each tab. A legend to the right of the node icons defines a color-coding which is appropriate to the type of data being displayed. Moving the mouse over a node icon will pop-up a window with precise information. Clicking on the life-ring icon will bring up a description of the statistics which are available via the selected tab.
 
Netprobe [screen shot]
The Netprobe displays network traffic and network errors (current and cumulative) for all of SciClone's major networks, including the internal (jetstream) and external (Savage House) ethernet networks, and both Myrinets. There are two different views, a node-oriented display which is most useful for monitoring application traffic, and a switch-oriented view which is helpful in monitoring switch-to-switch traffic. A row of menus below the node icons is used to select various display options. Placing the mouse over a node icon or switch port will pop-up detailed information about traffic and errors. This tool can be very valuable in assessing the communication loads and communication patterns of your applications, as well as I/O traffic to/from the servers. 

Each of these tools is implemented as a rather sophisticated distributed system comprised of:

  • a set of low- or no-intrusion system probes which run within the SciClone cluster,
  • a server process which runs on the Computational Science Cluster's web server (www.compsci.wm.edu) and collects data from the probes, and
  • a Java-based graphical display process which runs on the user's local workstation.

The displays are updated in near-real-time (approximately every 15 seconds), which provides a good balance between low intrusion in the system and sufficient temporal resolution to see how applications are behaving.

To use these tools, you first need to have a recent version of the Java Runtime Environment (JRE) installed on your desktop PC or workstation (minimum acceptable version is 1.4.2_02, but 1.5.0_02 or later is recommended). Note that JRE is preinstalled on many systems so you may not need to download and install it yourself. Once JRE is available on your desktop, you then need to login to monsoon.sciclone.wm.edu and copy the following files to your workstation:

cd /usr/local/sciclonometer/clients
scp *.jar myhost:mydir
where "myhost" is the name of your workstation and "mydir" is whatever local directory you want to store the jar files under.

Once the files are copied over, you can execute them as follows on Linux or Unix (on a PC or Mac, start them as you would any other application):

cd mydir
java -jar JobprobeDisplay.jar &
java -jar NetprobeDisplay.jar &
java -jar Nodeprobe.jar &
The display programs have been tested under Mac OS 10.3, RedHat Enterprise Linux 3, and Solaris 9, but theoretically should work on any platform with a current version of Java.

A final caveat: although these tools have been in use for some time by SciClone's staff and user community, we still consider them to be somewhat experimental, so if you run into problems, please send a detailed trouble report to sciclone@wm.edu.