SciClone Cluster Project Computational Science Cluster
Home
Introduction
Sponsors
Research
Hardware
Software
User Info
Documentation

SciClone User's Guide

Version 2.0

Revised: 7/20/07

This document provides background information and instructions for using the SciClone Cluster at the College of William and Mary. It is considered to be required reading for new users. Refer to other sections of the SciClone web site for more detailed information on specific topics such as software packages or hardware configuration.


Topics


How to Get Help

Questions, problems, or trouble reports: Send email to sciclone@wm.edu.

Urgent problems: Call Tom Crockett (757-221-2762) or Dale Castle (757-221-1701).

Emergencies: After hours phone numbers are posted on the computer room door in Savage House, or call Campus Police at 757-221-4596.

Application Support: The Computational Science Cluster employees an Applications Analyst who is prepared to assist users with a broad range of tasks, from basic procedures such as establishing SSH access to the system, compiling codes, and setting up Makefiles and job scripts, to more sophisticated activities such as installing software packages, porting and tuning applications, advising on algorithms and methodology, parallelizing existing codes, and visualizing results. To request assistance, send email to sciclone@wm.edu or call Chris Bording at 757-221-3488.

Because SciClone is regarded as a research (rather than production) facility, support is available only during normal office hours.

When reporting problems, please provide as much relevant information as possible. This should include the following, as appropriate:

  • date and time when the problem occurred
  • node(s) or server(s) involved
  • text of the command(s) which you issued
  • exact and complete text of any error messages which were generated
  • source code and/or makefiles which demonstrate the problem
  • any other information which may help in identifying or resolving the problem


Obtaining Accounts

SciClone is operated by the Computational Science Cluster as a College-wide resource. William and Mary faculty, staff, and students with computation- or data-intensive applications from any discipline are welcome to apply for accounts on the system. The system is also available to those interested in developing tools or infrastructure to support more effective use of cluster computing systems. Requests for access from outside the William and Mary community (including W&M collaborators) are evaluated on a case-by-case basis. To apply for an account, follow the instructions in the SciClone Account Request Form.


Account Renewal, Expiration, and Deletion

Most accounts on SciClone have an associated expiration date which is specified on the user's Account Request Form. An email notice will be sent to the user approximately two weeks before his/her account is set to expire. Accounts can be renewed if need be by submitting a new Account Request Form. If an account is not renewed before the expiration date, it will be disabled immediately following the expiration date. All files belonging to expired accounts are subject to deletion after a short grace period (30-day minimum). It is the user's responsibility to preserve any necessary files by moving or copying them elsewhere before the account expires. An account which has expired may be reactivated by submitting a new Account Request Form, but files previously associated with the account may not be available after the grace period.


Accessing the System

All access to SciClone from external systems is via Secure Shell using the SSH 2 protocol. SciClone includes subclusters based on two different and incompatible processor architectures, Sun UltraSPARC and AMD Opteron. Each of these architectures has a different login server which should be used to edit and compile programs, manipulate files, or submit batch jobs. The primary login server for the UltraSPARC nodes is monsoon.sciclone.wm.edu, a.k.a. sciclone.wm.edu. The primary login server for the Opteron nodes is squall.sciclone.wm.edu. These nodes are also referred to as "compile servers", "file servers", "front ends", "host nodes", or simply monsoon and squall. All of these terms are used interchangeably.

Two additional login servers are available for users involved in specific projects requiring access to certain node-locked software packages or the Oracle database server. These are maelstrom.sciclone.wm.edu (UltraSPARC) and mistral.sciclone.wm.edu (Opteron).

Unless you have made special arrangements for dedicated time, all access to nodes other than monsoon, maelstrom, squall, and mistral should be through the PBS job scheduler. This topic is discussed in more detail in the section on Running Programs.


Changing Your Password

You will be issued a default password by telephone when your SciClone account is created. To change it, you must login to monsoon.sciclone.wm.edu and run the passwd command. Changes will automatically propagate to all of the login servers within a minute or so. If you change your password on any node other than monsoon, the changes will not be permanent, and your password will eventually revert to its previous setting.


Architecture Overview

A key feature of SciClone is its heterogeneous architecture, which provides both flexibility for applications as well as a controlled environment for studying the complex issues which arise in larger distributed systems. Specifically, SciClone's heterogeneity arises from its use of multiple processor configurations (single-, dual-, and quad-cpu nodes, each at different clock rates), multiple networking technologies (Fast Ethernet, Gigabit Ethernet, and Myrinet), and its organization as a "cluster of clusters".

Node Types and Subclusters

SciClone features sixteen different node configurations, organized into eight distinct subclusters which can be used individually or in combination. The node types and subclusters are summarized in the tables below. See the Hardware Component List for detailed specifications. Nodes can be further classified as either "server" nodes or "compute" nodes, depending on their intended uses. As the name implies, server nodes provide specialized functions for the cluster as a whole, while compute nodes are dedicated to running users' jobs. Application programs should not run on server nodes unless they have a specific need to do so.

SciClone Node Types
Node
Type
Qty.
# of
CPUs
(x cores)
Clock
Speed
Memory /local/scr Comm.
Subcluster
Suggested Uses
S3
1
2
900 MHz 6 GB 33 GB Gigabit Ethernet
Primary server. Provides login, compilation, and job scheduling services for all of the UltraSPARC nodes in the cluster. Provides system-wide accounting, DNS, and global filesystem services. Should not be used for computation except in specialized circumstances.
S5
1
2x2
2.2 GHz 4 GB 6 GB 10 Gb Ethernet
Primary server. Provides login, compilation, and job scheduling services for all of the Opteron nodes in the cluster. Provides global filesystem services. Should not be used for computation except in specialized circumstances.
S4
1
2
2.4 GHz 4 GB 6 GB 10 Gb Ethernet
Secondary server. Provides auxilliary services for the tempest subcluster. Should not be used for logins or computation.
S4A
1
2
2.4 GHz 4 GB 6 GB 10 Gb Ethernet
Secondary server. Similar to S4 node, but tailored to specialized services and applications. Should not be used for computation except in specialized circumstances.
S2
1
2
450 MHz 1 GB 35 GB Gigabit Ethernet
Secondary server. Provides mail, printing, and global filesystem services for the entire cluster. Should not be used for logins or computations except in specialized circumstances.
DB1
1
2
450 MHz 4 GB 16 GB Gigabit Ethernet
Myrinet 2000
Database and bioinformatics server. Similar to S3 node, but with extra memory and larger disks to support database and bioinformatics applications. Also provides global filesystem services for the entire cluster. Direct logins allowed for access to specialized software packages. May be used by PBS jobs which need access to node-locked bioinformatics software.
M1 1 1 650 MHz 512 MB 10 GB Fast Ethernet System management node. Provides performance monitoring and control of computers, storage, and networks, along with centralized logging services. Not intended for use by applications.
N1
1
2
360 MHz 512 MB 12 GB Gigabit Ethernet
Myrinet 1280
Network compute node. Similar to a C2 compute node, but with a gigabit connection to the internal network and a direct connection to the building network. Useful for host/node, client/server, master/slave, or n+1 programming models, as well as for proxies or other processes which mediate between internal and external computations. Good choice as an I/O or "head" node for applications running in the tornado or gulfstream subclusters.
N2
1
2
900 MHz 2 GB 31 GB Gigabit Ethernet
Myrinet 2000
Network compute node. Similar to a C5 compute node, but with a gigabit connection to the internal network and a direct connection to the building network. Useful for host/node, client/server, master/slave, or n+1 programming models, as well as for proxies or other processes which mediate between internal and external computations. Good choice as an I/O or "head" node for applications running in the twister subcluster.
C2
32
2
360 MHz 512 MB 12 GB Fast Ethernet
Myrinet 1280
tornado
Compute node. General parallel and serial computation; jobs with intensive communication or local I/O requirements.
C3
64
1
650 MHz 1 GB 26 GB Fast Ethernet
whirlwind
Compute node. General computation. Preferred location for serial (non-parallel) computations with significant memory requirements.
C4
4
4
450 MHz 4 GB 6 GB Gigabit Ethernet
Myrinet 2000
hurricane
High-performance compute node with Gigabit Ethernet and Myrinet 2000. SMP parallel computations (multi-threaded, auto-parallelization, compiler directives, OpenMP), memory- and communication-intensive applications; serious number crunching.
C5
32
2
900 MHz 2 GB 31 GB Fast Ethernet
Myrinet 2000
twister
High-performance compute node. Memory-, CPU-, and I/O-intensive applications with modest communication requirements. Two local scratch partitions (53 GB total).
C6
4
2
360 MHz 512 MB 17 GB Gigabit Ethernet
Myrinet 1280
gulfstream
Compute node, with two local scratch disks (36 GB total), Gigabit Ethernet, and Myrinet 1280. Good choice for I/O-, communication-, and data-intensive applications, including interactive visualization work.
C6A
2
2
360 MHz 512 MB 17 GB Gigabit Ethernet
Myrinet 1280
gulfstream
Same as a C6 node, but with three local scratch disks (54 GB total).
C7 2 4 1.28 GHz 8 GB 202 GB Gigabit Ethernet
Myrinet 2000
vortex Data-intensive compute node with large memory (8 GB), high capacity local scratch disks (202 GB & 44 GB), Gigabit Ethernet, and Myrinet 2000. Well-suited for SMP parallel computations (multi-threaded, auto-parallelization, compiler directives, OpenMP) with large memory and I/O requirements, as well as communication-intensive distributed-memory applications, out-of-core methods, and serious number crunching.
C7A 2 4 1.28 GHz 16 GB 202 GB Gigabit Ethernet
Myrinet 2000
vortex Same as a C7 node, but with twice as much memory (16 GB).
C8 42 2 2.4 GHz 4 GB 56 GB Gigabit Ethernet
InfiniBand 4x
tempest High-performance Opteron-based compute node with Gigabit Ethernet and InfiniBand. Suitable for distributed-memory applications with demanding CPU, memory, I/O, and communication requirements.

 

SciClone Subclusters
Subcluster
# of
Nodes
Node
Type(s)

Node
Names

Suggested Uses
whirlwind
64
C3
wh01-
wh64
General computation; serial and embarassingly parallel applications; distributed-memory parallel computations; algorithm development and scalability studies. Communication is limited to Fast Ethernet (100 Mb/s).
tornado
32
C2
tn01-
tn32
General parallel computation; communication-intensive parallel computations via Myrinet; mixed mode (SMP+distributed) computations; algorithm development and scalabilty studies.
twister
32
C5
tw01-
tw32
General parallel computation; memory- and CPU-intensive applications; communication-intensive parallel applications via Myrinet; large out-of-core problems; mixed mode (SMP+distributed) computations; algorithm development and scalabilty studies.
hurricane
4
C4

hu01-
hu04

Memory-, CPU-, and communication-intensive parallel computations via Gigabit Ethernet or Myrinet; shared-memory applications; mixed mode programs. Lightweight computations should run somewhere else.
gulfstream
6
C6, C6A
gfs01-
gfs06
Data-intensive computing with multiple local scratch disks; communication-intensive computing via Gigabit Ethernet or Myrinet; visualization; large out-of-core problems.
vortex
4
C7, C7A

vx01-
vx04

Memory-, I/O-, and communication-intensive parallel computations via Gigabit Ethernet or Myrinet; shared-memory applications; mixed mode programs; very large out-of-core problems. Lightweight computations should run somewhere else.
tempest
42
C8

tp01-
tp42

CPU-, memory-, and communication-intensive parallel computations via Gigabit Ethernet or InfiniBand; large out-of-core problems; mixed mode (SMP+distributed) computations. Lightweight computations should run somewhere else.

 

Networks

SciClone features a rich but complex networking environment. For a schematic, refer to the SciClone Architecture Diagram. The following table summarizes the networks used in SciClone.

Net
ID
Technology
Network
Number
Hostname
Suffix
Description
1
Fast Ethernet /
Gigabit Ethernet
128.239.40-43
-f
-g
Also known as "jetstream", this is the primary internal network for SciClone. Every node in the cluster has an interface to this network. Uses a combination of Fast Ethernet (100 Mb/s) and Gigabit Ethernet (1000 Mb/s) switches connected by 3- and 4-way Gigabit Ethernet trunks. Also provides a gigabit route to the campus network for bulk data transfers and bandwidth-intensive applications such as visualization.
2
Myrinet-1280
198.168.2
-m2
1.28 Gb/s low-latency switched communication fabric connects the tornado and gulfstream subclusters plus nws01. Provides excellent performance for communication-intensive applications.
3
Gigabit Ethernet
192.168.3
-g3
Dedicated point-to-point connections for specialized applications such as visualization.
4
Fast Ethernet
128.239.33
-f4
Savage House network. Switched network connects SciClone to other systems within the building, and provides the preferred route for external hosts to reach the SciClone front end.
5
Ethernet
192.168.5
-e5
-f5
Device management network. 10/100 Mb/s switched Ethernet network allows monitoring and management of SciClone's switches, mass storage subsystems, and service processors without intruding on application traffic.
6
Fast Ethernet
172.31
-f6
Private class B network used by the Computer Science Department's Network Systems Testbed. Not intended for general use.
7
Myrinet-2000
192.168.7
-y7
2.0 Gb/s low-latency switched communication fabric connects the vortex, hurricane, and twister subclusters plus nws02 and maelstrom. Offers maximum performance for communication-intensive applications.
8
InfiniBand 4x
192.168.8
-i8
10 Gb/s low-latency switched communication fabric connects nodes within the tempest subcluster. Offers maximum performance for communication-intensive applications.

SciClone's front end nodes, monsoon and squall, have interfaces on multiple networks, as do many of the other nodes. From outside the cluster, use the hostname monsoon.sciclone.wm.edu (a.k.a. sciclone.wm.edu) or squall.wm.edu to login via the Savage House Fast Ethernet network. From other nodes within the cluster, use ms00.sciclone.wm.edu or sq00.sciclone.wm.edu (or just ms00 and sq00) to reach the front end nodes via the cluster's internal network. If you use the monsoon or squall addresses from within the cluster, you'll get routed outside to the building network and then back in via the external interface, a longer and slower path.

A complete list of hostnames and their corresponding network addresses is available in /etc/hosts on the front end.

Metaclusters

When communication is routed via Myrinet, some of the distinctions between subclusters become blurred. In particular, the gulfstream and tornado subclusters plus the node nws01 can be thought of as a unified 39-node, 78-processor "metacluster". Although some details differ, these nodes have identical processor and memory configurations and share the same Myrinet-1280 network. However, I/O intensive processes should be placed on nws01 or one of the gulfstream nodes to take advantage of Gigabit Ethernet links to the servers. Nodes could be allocated from this combined pool via a PBS node specification such as "ultra60:myri:ppn=2". Similarly, nws02 can be used to augment the twister subcluster, using the Myrinet-2000 network for communication. In this case a node spec might look like "f280r:myri2:ppn=2" and nws02 would be the best place to locate I/O intensive processes.

Note that applications are free to combine nodes from multiple subclusters in any way they see fit, but in the general case, differences in processor speeds, communication interfaces, memory capacities, and disk performance pose difficult load balancing problems which can lead to very inefficient use of computational resources unless specialized parallel algorithms are employed.


Filesystems

When a user account is installed, subdirectories are created in the following filesystems:

Filesystem
Name
Purpose
Description

One of:
/sciclone/home00
/sciclone/home01
/sciclone/home02
/sciclone/home10

Home directories
Primary location for source code, executables, scripts, and moderate-sized data files. Accessible system-wide via NFS. Files on these partitions are backed up on a regular basis. Files are subject to deletion after the user's account has expired. We do not archive expired accounts.
/sciclone/qfs00 Large file storage High-capacity, high-performance shared filesystem intended for storage of large files (1 MB and above). Mounted directly on monsoon and maelstrom via Fibre Channel SAN; accessible everywhere else via NFS. Files on this partition are backed up on a regular basis. Files are subject to deletion after the user's account has expired. See additional info on QFS below.
/sciclone/scr00
/sciclone/scr01
/sciclone/scr02
/sciclone/scr10
Global scratch space
High capacity storage for large files and short-term working data. Accessible system-wide via NFS. Files on these partitions are automatically deleted after 30 days of inactivity, and are not backed up.
/local/scr
/local/scr2
/local/scr3
Local scratch space
Multi-gigabyte scratch partition(s) physically resident on a node's local disk. Use for temporary storage of local data and intermediate results. Also useful as scratch space for out-of-core methods or as a staging area for input and output files. Provides better performance than NFS-mounted filesystems. Every node has a /local/scr partition. /local/scr2 is available only on C5, C6, C6a, C7, and C7a nodes. /local/scr3 is available only on C6a nodes. Files on these partitions are automatically deleted after 30 days of inactivity, and are not backed up. On hurricane, /local/scr points to /sciclone/scr00; on monsoon, /local/scr points to /sciclone/scr01.

Symlinks in each user's home directory point to the preconfigured global (~/scr*) and local (~/lscr*) scratch directories and the QFS filesystem (~/qfs00). /root, /usr, and /var filesystems are local to each node; /opt, /usr/local, and /import reside on the front end and are exported to each node via NFS.

QFS

Sun's StorEdge QFS is a high-performance, high-capacity SAN-based shared filesystem which is optimized for bulk accesses on large sequential files. The /sciclone/qfs00 filesystems spans 46 disk drives distributed across five separate RAID arrays, with a total formatted capacity of 5.3 terabytes. The disk arrays are connected to each other and to the SciClone servers via a Fibre Channel Storage Area Network, or SAN. Individual files exceeding a terabyte in size can be accommodated, space permitting.

QFS allocates disk space in large blocks, and is therefore rather inefficient for collections of small files, such as source code. As a rule of thumb, files which are stored in SciClone's QFS filesystem should have an average size of 1 MB or larger. Smaller files should be stored in home directories (/sciclone/home*) or scratch filesystems (/sciclone/scr*). Space allocation in /sciclone/qfs00 is monitored and you may be notified if you have too many small files stored there. Note that small files can be aggregated into larger ones with Unix utilities such as ar, tar, cpio, and zip. ar is particularly useful since it maintains an internal index structure which supports the addition, deletion, replacement, and retrieval of individual members of the archive.

I/O performance with QFS can be improved (sometimes dramatically) by reading and writing data in large blocks that match the blocking factor of the filesystem. In our tests, a blocksize of 65536 (64 KB) was near optimal, yielding write speeds in excess of 100 MB/s from monsoon and maelstrom.

QFS is a shared filesystem, meaning that it can be mounted and accessed simultaneously by multiple servers. It allows simultaneous reads from the same or different files and simultaneous writes to different files. QFS can also be configured to allow simultaneous writes to the same file from applications which have been designed to perform page-aligned I/O. This option is not presently enabled on SciClone, but could be if the need arose.

The /sciclone/qfs00 filesystem is mounted directly on monsoon and maelstrom, and is exported from there via NFS to all other nodes in the cluster. Even-numbered nodes mount qfs00 from maelstrom and odd-numbered nodes from monsoon. This spreads the NFS load across both servers, resulting in higher throughput and less resource contention. Note that applications which generate a lot of output will still get better performance by writing files to the local scratch partitions and then copying them back to the servers with rcp, rather than relying on NFS. The even/odd strategy would be helpful in this scenario, too.


Backups

Home directories (/sciclone/home00 and /sciclone/home10) are normally backed up several times per week. QFS directories (/sciclone/qfs00) are backed up approximately every three days. Scratch directories are not backed up at all. Furthermore, files in any of the scratch partitions which have not been used in the past 30 days will be deleted automatically in order to maintain sufficient free space for active projects.


Shell Environment

A default .cshrc file is provided in each user's home directory. If you modify it, be sure you know what you're doing—SciClone's hardware and software environment is considerably more complex than the typical UNIX workstation environment. The default configuration enables a 32-bit environment for compiling and linking, with LAM as the default MPI package. To use alternative communication packages such as MPICH or MPICH-GM, settings of various environment variables will need to be changed as documented within the .cshrc file. Copies of the current recommended configuration files can be found in /usr/local/etc/templates/.

Four UNIX groups are established for each organization or department which has users on SciClone. These four groups correspond to the user's status within the organization, i.e. faculty/staff, graduate student, undergraduate student, and "other". For example, a professor from the William and Mary Computer Science Department would be assigned to the group "csf", while a CS undergraduate would be assigned to the group "csu". Default file access permissions are set so that a user's files and directories are read-write for the user, read-only for the user's group, and unreadable by anyone else. If this is not appropriate for your situation, you should change the default umask setting in your .cshrc file, or set file access modes on a case-by-case basis. Note that SciClone's filesystems are exported to other computers within the Computational Science Cluster, and may therefore be visible beyond the SciClone user community.


Compilers and Libraries

Both Sun and GNU compilers are installed on the system. See the Software section of the SciClone web site for more information about what's available and how to use it. Unless otherwise noted, all of the third-party software which is installed on SciClone has been built using Sun's compilers. For maximum performance and to ensure compatibility with system libraries, we strongly recommend the use of Sun's compilers whenever possible.


Compiler Options and Code Optimization

When compiling and optimizing code for use on SciClone, care must be taken to ensure that the resulting executables will perform properly on the type of nodes on which they will be executed. The situation is further complicated by the fact that SciClone includes two distinct and incompatible families of processor architecture, Sun UltraSPARC and Intel/AMD x86/amd64. Even within a particular architecture family, there are variations in processor capabilities among the different types of nodes. By choosing appropriate compiler options, users can compile their applications for portability across an entire architecture family, or optimize performance for a particular type of node. With the current set of hardware, six different instruction set architectures (ISAs) are of interest:

v8plusa  - 32-bit addressing, compatible with all UltraSPARC node types
v9a  - 64-bit addressing, compatible with all UltraSPARC node types
v8plusb  - 32-bit addressing, runs only on C5, C7, C7A, N2, DB1, and S3 node types
v9b  - 64-bit addressing, runs only on C5, C7, C7A, N2, DB1, and S3 node types
sse2a  - 32-bit addressing, runs only on C8, S4 and S4A node types
amd64a  - 64-bit addressing, runs only on C8, S4 and S4A node types

To achieve portability along with performance on UltraSPARC nodes, the following set of options is suggested:

-fast -xarch=v8plusa -xchip=generic -xcache=generic

For best performance on AMD Opteron nodes, the following options provide a good starting point for further experimentation:

-fast -xarch=sse2a -xchip=opteron -xcache=64/64/2:1024/64/16

By default, -fast generates code for the type of processor on which the compiler is running, which may or may not be compatible with the target execution processor. This is why it is important to explicitly specify the desired ISA with the -xarch option. Note that the order is important here: the -xarch, -xchip, and -xcache options must come after -fast.

Opteron processors support several different "memory models", depending on the intended use of the resulting code. For 64-bit addressing on SciClone's C8, S4, and S4A nodes, specify the "medium" memory model, for example:

-fast -xarch=amd64a -xchip=opteron -xcache=64/64/2:1024/64/16 -xmodel=medium

To optimize for a specific type of node, you can specify the characteristics of the CPU (-xchip) and cache (-xcache) in addition to the ISA. The following table summarizes the options by node type.

Node Type -xarch -xchip -xcache
Any UltraSPARC v8plusa
v9a
generic generic
C2, C4, C6, C6A, N1, S2 v8plusa
v9a
ultra2 16/32/1:4096/64/1
C3, M1 v8plusa
v9a
ultra2e 16/32/1:512/64/4
C5, N2, S3, DB1 v8plusb
v9b
ultra3cu 64/32/4:8192/512/2
C7, C7A v8plusb
v9b
ultra3i 64/32/4:1024/64/4
Any Opteron sse2a
amd64a
opteron 64/64/2:1024/64/16
C8, S4, S4A sse2a
amd64a
opteron 64/64/2:1024/64/16

So, for example, to optimize code for an Ultra 5 node, the following compiler options could be used:

-fast -xarch=v8plusa -xchip=ultra2i -xcache=16/32/1:2048/64/1

To target a Sun Fire 280R, use:

-fast -xarch=v8plusb -xchip=ultra3cu -xcache=64/32/4:8192/512/2

For an Opteron-based node, use:

-fast -xarch=sse2a -xchip=opteron -xcache=64/64/2:1024/64/16

To address more than 2 GB of memory in a single process, use v9a, v9b, or amd64a instead of v8plusa, v8plusb, or sse2a, repsectively:

-fast -xarch=v9a -xchip=generic -xcache=generic
-fast -xarch=v9a -xchip=ultra2 -xcache=16/32/1:4096/64/1
-fast -xarch=v9b -xchip=ultra3cu -xcache=64/32/4:8192/512/2
-fast -xarch=v9b -xchip=ultra3i -xcache=64/32/4:1024/64/4
-fast -xarch=amd64a -xchip=opteron -xcache=64/64/2:1024/64/16 -xmodel=medium

Note that 64-bit addressing is generally of interest only for the C4, C7, C7A, C8, DB1, S3, S4, and S4A nodes, which each have 4 GB or more of physical memory installed on them; all other node types have 2 GB or less.

Programs compiled with v9a, v9b, or amd64a will not work with libraries which have been compiled for 32-bit addressing. Many of the performance-critical software packages installed on SciClone are compiled for all six ISAs, and the desired version can automatically be selected via the XARCHULTRA and XARCHX86 environment variables in the user's ~/.cshrc file. Consult the documentation for individual software packages to see which versions are available.

Note that the optimizations invoked by -fast may be too aggressive for some codes, with the potential for unintended or incorrect results. If you suspect this is a problem, you could try a lower optimization level, e.g.:

-fast -xO3 -xarch=... -xchip=... -xcache=...

or you could leave off -fast entirely. If -fast is not used, it may be necessary to use "-xmemalign=8s" (C and C++) or "-dalign -xmemalign=8s" (Fortran) to assure that the application's data alignment matches that expected by the system libraries.

Code optimization is a complex topic, and the use of any given option may help one routine but hinder another, so some experimentation is in order. Consult Sun's compiler documentation for full details, including information about many options not mentioned here.


Parallel Programming Tools

Although SciClone's ability to run many serial jobs concurrently is useful for some applications, bringing the full power of the system to bear on a single computation requires the use of parallel programming techniques. A variety of tools are available to assist in the development of parallel programs.

Shared Memory Programming

Many of the nodes in SciClone include more than one CPU, so effective use implies that all of the CPUs should be kept busy. If an application will fit on a single node, then the use of shared memory programming techniques may be the simplest way to boost performance on multiprocessor (SMP) nodes. There are several approaches for exploiting parallelism in shared memory environments, including automatic parallelization, compiler directives, thread libraries, system-level interprocess communication (IPC) services, and message passing.

Sun's compilers support automatic parallelization of well-behaved loop constructs via the -xautopar (C) and -autopar (Fortran) options. These are described in detail in the compiler manuals. Sometimes the compiler's ability to detect and exploit parallelism can be enhanced with straightforward changes to the code which eliminate dependencies or simplify the control flow.

In other cases, the programmer needs to convey additional information to the compiler in the form of directives (Fortran) or pragmas (C/C++). Among other things, directives and pragmas can be used to give the compiler hints about loops that can or cannot be safely parallelized. Sun's C, C++, and Fortran compilers support the OpenMP 2.5 API, a mix of directives and library calls which support a fork-join model of parallel execution. For more information on using OpenMP, refer to Chapter 3 in the C User's Guide, Chapter 10 of the Fortran Programming Guide, Appendix D of the Fortran User's Guide, and the OpenMP API User's Guide.

At a coarser level of granularity, entire programs, or major sections of them, can be structured as independent threads of control, each of which can potentially run on a separate CPU. Solaris supports two different thread packages, known as Solaris threads and Posix threads (or pthreads). Sun's Multithreaded Programming Guide covers both of these packages in detail. Additional information on multithreading for C++ applications can be found in Chapter 11 of the C++ User's Guide.

Java includes threads as a fundamental part of the language. Although not well suited for high performance computing due to its resource requirements and the interpretive nature of the language, Java may nonetheless be of interest for certain applications. Java programmers should consult the Java 2 SDK documentation.

Threads provide a lightweight mechanism for exploiting parallelism within the context of a single UNIX process. Parallelism can also be obtained by running several distinct processes at the same time. Like all UNIX variants, Solaris provides a number of system services which facilitate communication between processes. These include pipes, message queues, semaphores, shared memory segments, signals, sockets, and memory-mapped files. These facilities may be used directly or as the basis for process-to-process communication in higher level libraries such as MPI. Interprocess communication (IPC) facilities may be the mechanism of choice for applications which bring together several programs with different functionality. An overview of IPC services in Solaris can be found in the Programming Interfaces Guide.

In most cases, the message-passing communication libraries described in the next section can also be used in shared memory environments, sometimes quite efficiently. The LAM/MPI library, in particular, exhibits very low overheads on shared memory nodes. While the message-passing paradigm often requires more programming effort than some of the simpler shared memory schemes, it is more portable, allowing a single application to run in shared memory, distributed memory, or mixed environments.

Distributed Memory Programming

Even the most powerful multiprocessor nodes on SciClone provide only a fraction of the aggregate system resources (about 1% of the total CPU power, 2% of the memory, and 1% of the disk capacity). To truly take advantage of the system, it is necessary to build distributed memory applications that can bring many nodes to bear on a single computation. Although lower-level system services such as sockets or remote procedure calls are sometimes used to build distributed applications, most scientific programmers working on SciClone will want to use MPI, the de facto standard for message passing on distributed-memory parallel architectures. SciClone currently supports three different MPI implementations, LAM, MPICH, and MPICH-GM. Additional communication packages are expected to be available in the future.

Mixed-Mode Programming

The shared memory and distributed memory approaches can be combined in applications which run on multiple SMP nodes. This can be useful if the computation exhibits parallelism at several different levels, for example loop-level parallelism within coarse-grained tasks. Shared memory constructs may offer performance advantages over message passing for local communication within SMP nodes, although in some cases the reverse may also be true. Whether the performance benefits of mixed-mode programming are worth the extra complexity seems to depend heavily on the characteristics of the application.


Running Programs

To provide conflict-free access to SciClone's computational resources, node allocation and job scheduling services are provided by OpenPBS, the freeware version of the Portable Batch System (PBS). To avoid interfering with PBS jobs, all access to both compute and server nodes, including interactive shell sessions, must be initiated through PBS. The only exceptions are:

  1. logins and associated interactive processes (compilations, file manipulation, job submittal, etc.) on the front-end servers ([monsoon.]sciclone.wm.edu and squall.sciclone.wm.edu),
  2. interactive access to maelstrom and mistral in order to run certain node-locked commercial software packages (although these can also be launched via PBS jobs), and
  3. cases in which you have made prior arrangements for dedicated access to a set of nodes for the purpose of running experiments which cannot be accommodated within the PBS framework (rare).

Direct rlogin/slogin/telnet access to individual compute nodes is disabled, and all rsh and rcp commands which reference the nodes should be submitted via PBS jobs. Stray processes on the nodes (i.e., those not belonging to an active PBS job) are subject to termination without warning.

In some cases interactive access to compute nodes is required. Examples include software packages with graphical user interfaces (e.g., MATLAB or various visualization systems), debugging, etc. PBS has a special interactive mode (described in more detail below) which provides this capability, including forwarding of X11 sessions to the user's workstation via SSH.

Server Nodes vs. Compute Nodes

As discussed in the Architecture Overview, nodes in the SciClone cluster fall into one of two categories, server nodes or compute nodes. While compute nodes are intended to provide dedicated computational resources for one or more jobs, server nodes provide services for the system as a whole. Thus PBS jobs which run on server nodes can adversely impact the performance of the whole cluster, and will themselves be impacted by other activities on the system. Thus most jobs should specifically request to run on compute nodes, as explained in the following sections.

Nevertheless, there may be circumstances in which jobs with special requirements will need to create processes on server nodes. Our PBS configuration currently allows this, and, in fact, will allocate server nodes to jobs if (1) server nodes are specifically requested by the job, or (2) the job does not specifically request compute nodes and no other resources are currently available to satisfy the request. This latter case is designed to improve turnaround for small jobs when the system is otherwise saturated. Users should feel free to take advantage of this when it is really needed (for example, deadlines for class projects or conference and journal submissions), but should not use it routinely.

Because server nodes also host SciClone's global filesystems (/sciclone/home*, /sciclone/scr*, /sciclone/qfs00), they are also the most efficient place to locate processes that perform large amounts of I/O against these files. In this case an appropriate PBS node specification can be used to place an I/O process on the server which physically hosts the filesystem of interest.

Although SciClone's server nodes (monsoon, maelstrom, squall, mistral, tempest, hurricane, zephyr) all have dual processors, PBS is allowed to use only one processor per server node. This leaves the other processor free to provide system-wide services such as compilation, I/O, NFS, DNS, job scheduling, etc. To avoid overloading the front end nodes, all application programs with non-trivial resource requirements (> 30 secs. CPU time or > 128 MB memory) must be submitted as PBS jobs. Processes which violate this rule may be killed without warning. (Typical code development and job preparation activities, including editing, compilation, make, file manipulation, etc., are specifically allowed to run on monsoon and squall as part of their normal interactive workloads.) PBS references server nodes via aliases (ms00, ml00, sq00, mt00, tp00, hu00) which map to either Gigabit or 10-Gigabit Ethernet interfaces on SciClone's internal jetstream network.

PBS Environment Variables

To use PBS, your search path, man path, library path, and default PBS server must be set correctly in your ~/.cshrc file:

    set path=($path /usr/local/pbs/bin)
    setenv MANPATH "${MANPATH}:/usr/local/pbs/man"
    setenv PBS_DEFAULT `/usr/local/bin/pbs_default`
    setenv LD_LIBRARY_PATH "/usr/local/lib"

If you are using the recommended environment configuration (available in /usr/local/etc/templates/cshrc on sciclone.wm.edu), all of these environment settings will be configured for you automatically.

Node Properties

To accommodate heterogeneous environments (such as SciClone), PBS allows an arbitrary set of node properties to be assigned to each node. These properties may be appended to node allocation requests to constrain the set of processors which may be used to run the job. Node properties for SciClone are listed in the following table:

 

Node
Name
Node
Type
Subcluster CPU
Configuration
Memory
Limit (MB)
Local Scratch
Disk (GB)
Networks Network
Interfaces
Switches Operating
System
Special
Software
hu00 s2
ultra60
server
ultra2
mhz450
dual
cache4
msb
ppn=1*
m512
scr35
net1
net4
net5
geth jsg02
sms02
els100
sol7
ms00 s3
f280r
server
ultra3cu
mhz900
dual
cache8
msb
ppn=1*
m4096
m4gb
scr33 net1
net4
geth jsc01
els100
sol9
ml00 db1
f280r
server
ultra3cu
mhz900
dual
cache8
msb
ppn=1*
m2048
m2gb
scr16 net1
net4
net7
geth
myri2
jsc01
els100
myr02
sol9 oracle
gcg
sq00 s5
x4200
server
opteron
mhz2200
dual
core2
cache2
lsb
ppn=1*
m4096
m4gb
scr6 net1
net4
net5
feth
geth10
jsg05
sms02
els100
sol10
tp00 s4
v20z
server
opteron
mhz2400
dual
cache1
lsb
ppn=1*
m4096
m4gb
scr6 net1
net4
net5
feth
geth10
jsg05
sms02
els100
sol10
mt00 s4a
v20z
server
opteron
mhz2400
dual
cache1
lsb
ppn=1*
m4096
m4gb
scr6 net1
net4
feth
geth10
jsg05
els100
sol10
zp00 m1
v120
server
ultra2e
mhz650
single
cache512k
msb
ppn=1
m512 scr10 net1
net5
feth jsc01
sms01
sol9
nws01
n1
ultra60
compute
ultra2
mhz360
dual

cache4
msb

ppn=2
m512
scr12

net1
net2
net3
net4

geth
myri
myri1
jsg02
myr01
els100
sol9
nws02
n2
f280r
compute
ultra3cu
mhz900
dual

cache8
msb

ppn=2
m2048
m2gb
scr31

net1
net4
net7

geth
myri2
jsc01
myr02
els100
sol9
hu01-hu04
c4
e420r
compute
hurricane
ultra2
mhz450
quad

cache4
msb

ppn=4
m4096
m4gb
scr6

net1
net7

geth
myri2
jsg04
myr02
sol9
vx01-vx02
c7
c7a
v440
compute
vortex
ultra3i
mhz1280
quad

cache1
msb

ppn=4
m16384
m16gb
scr202

net1
net7

geth
myri2
jsg04
myr02
sol9
vx03-vx04
c7
v440
compute
vortex
ultra3i
mhz1280
quad

cache1
msb

ppn=4
m8192
m8gb
scr202

net1
net7

geth
myri2
jsg04
myr02
sol9
gfs01-gfs02
c6
c6a
ultra60
compute
gulfstream ultra2
mhz360
dual
cache4
msb
ppn=2
m512 scr17 net1
net2
net4
geth
myri
myri1
jsg03
myr01
els100
sol9
gfs03-gfs06 c6
ultra60
compute
gulfstream ultra2
mhz360
dual
cache4
msb
ppn=2
m512 scr17 net1
net2
net4
geth
myri
myri1
jsg03
myr01
els100
sol9
tn01-tn32 c2
ultra60
compute
tornado ultra2
mhz360
dual
cache4
msb
ppn=2
m512 scr12 net1
net2
feth
myri
myri1
jsf03
myr01
sol9
tw01-tw32 c5
f280r
compute
twister ultra3cu
mhz900
dual
cache8
msb
ppn=2
m2048
m2gb
scr31 net1
net7
feth
myri2
jsc01
myr02
sol9
tp01-tp42 c8
v20z
compute
tempest opteron
mhz2400
dual
cache1
lsb
ppn=2
m4096
m4gb
scr56 net1
net8
geth
ib4x
jsg05
ib01
sol10
wh01-wh32 c3
v120
compute
whirlwind
whlow
ultra2e
mhz650
single
cache512k
msb
ppn=1
m1024
m1gb
scr26 net1 feth jsc01 sol9
wh33-wh64 c3
v120
compute
whirlwind
whhigh
ultra2e
mhz650
single
cache512k
msb
ppn=1
m1024
m1gb
scr26 net1 feth jsc01 sol9
*Although these nodes contains two processors, only one of them is available to PBS.

The next table defines all of the node properties listed in the table above:

Node Property
Description
compute
server
Primary function.
c2, c3, c4, c5, c6, c6a,
c7, c7a, c8, n1, n2, m1, s2, s3, s4, s4a, s5, db1
Node type, as described above and in the Hardware Component list.
e420r
f280r
ultra5
ultra60
v120
v440
v20z
x4200
Manufacturer's model name.
gulfstream
hurricane
tornado
twister
tempest
vortex
whirlwind
Indicates which subcluster the node belongs to.
whlow, whhigh Indicates the lower or upper half of the whirlwind subcluster. Nodes in each half share the same Fast Ethernet switch module, and therefore can communicate with each other slightly more efficiently than with nodes in the opposite half of the subcluster.
ultra2
ultra2e
ultra3cu
ultra3i
opteron
CPU architecture.