![]() |
![]() |
|
|
|
|
|
|
|
|
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
SciClone User's Guide Version 2.0 Revised: This document provides background information and instructions for using the SciClone Cluster at the College of William and Mary. It is considered to be required reading for new users. Refer to other sections of the SciClone web site for more detailed information on specific topics such as software packages or hardware configuration. Topics
How to Get Help Questions, problems, or trouble reports: Send email to sciclone@wm.edu.Urgent problems: Call Tom Crockett (757-221-2762) or Dale Castle (757-221-1701). Emergencies: After hours phone numbers are posted on the computer room door in Savage House, or call Campus Police at 757-221-4596. Application Support: The Computational Science Cluster employees an Applications Analyst who is prepared to assist users with a broad range of tasks, from basic procedures such as establishing SSH access to the system, compiling codes, and setting up Makefiles and job scripts, to more sophisticated activities such as installing software packages, porting and tuning applications, advising on algorithms and methodology, parallelizing existing codes, and visualizing results. To request assistance, send email to sciclone@wm.edu or call Chris Bording at 757-221-3488. Because SciClone is regarded as a research (rather than production) facility, support is available only during normal office hours. When reporting problems, please provide as much relevant information as possible. This should include the following, as appropriate:
Obtaining Accounts SciClone is operated by the Computational Science Cluster as a College-wide resource. William and Mary faculty, staff, and students with computation- or data-intensive applications from any discipline are welcome to apply for accounts on the system. The system is also available to those interested in developing tools or infrastructure to support more effective use of cluster computing systems. Requests for access from outside the William and Mary community (including W&M collaborators) are evaluated on a case-by-case basis. To apply for an account, follow the instructions in the SciClone Account Request Form. Account Renewal, Expiration, and Deletion Most accounts on SciClone have an associated expiration date which is specified on the user's Account Request Form. An email notice will be sent to the user approximately two weeks before his/her account is set to expire. Accounts can be renewed if need be by submitting a new Account Request Form. If an account is not renewed before the expiration date, it will be disabled immediately following the expiration date. All files belonging to expired accounts are subject to deletion after a short grace period (30-day minimum). It is the user's responsibility to preserve any necessary files by moving or copying them elsewhere before the account expires. An account which has expired may be reactivated by submitting a new Account Request Form, but files previously associated with the account may not be available after the grace period. Accessing the System All access to SciClone from external systems is via Secure Shell using the SSH 2 protocol. SciClone includes subclusters based on two different and incompatible processor architectures, Sun UltraSPARC and AMD Opteron. Each of these architectures has a different login server which should be used to edit and compile programs, manipulate files, or submit batch jobs. The primary login server for the UltraSPARC nodes is monsoon.sciclone.wm.edu, a.k.a. sciclone.wm.edu. The primary login server for the Opteron nodes is squall.sciclone.wm.edu. These nodes are also referred to as "compile servers", "file servers", "front ends", "host nodes", or simply monsoon and squall. All of these terms are used interchangeably. Two additional login servers are available for users involved in specific projects requiring access to certain node-locked software packages or the Oracle database server. These are maelstrom.sciclone.wm.edu (UltraSPARC) and mistral.sciclone.wm.edu (Opteron). Unless you have made special arrangements for dedicated time, all access to nodes other than monsoon, maelstrom, squall, and mistral should be through the PBS job scheduler. This topic is discussed in more detail in the section on Running Programs. Changing Your Password You will be issued a default password by telephone when your SciClone account is created. To change it, you must login to monsoon.sciclone.wm.edu and run the passwd command. Changes will automatically propagate to all of the login servers within a minute or so. If you change your password on any node other than monsoon, the changes will not be permanent, and your password will eventually revert to its previous setting. Architecture Overview A key feature of SciClone is its heterogeneous architecture, which provides both flexibility for applications as well as a controlled environment for studying the complex issues which arise in larger distributed systems. Specifically, SciClone's heterogeneity arises from its use of multiple processor configurations (single-, dual-, and quad-cpu nodes, each at different clock rates), multiple networking technologies (Fast Ethernet, Gigabit Ethernet, and Myrinet), and its organization as a "cluster of clusters". SciClone features sixteen different node configurations, organized into eight distinct subclusters which can be used individually or in combination. The node types and subclusters are summarized in the tables below. See the Hardware Component List for detailed specifications. Nodes can be further classified as either "server" nodes or "compute" nodes, depending on their intended uses. As the name implies, server nodes provide specialized functions for the cluster as a whole, while compute nodes are dedicated to running users' jobs. Application programs should not run on server nodes unless they have a specific need to do so.
SciClone features a rich but complex networking environment. For a schematic, refer to the SciClone Architecture Diagram. The following table summarizes the networks used in SciClone.
SciClone's front end nodes, monsoon and squall, have interfaces on multiple networks, as do many of the other nodes. From outside the cluster, use the hostname monsoon.sciclone.wm.edu (a.k.a. sciclone.wm.edu) or squall.wm.edu to login via the Savage House Fast Ethernet network. From other nodes within the cluster, use ms00.sciclone.wm.edu or sq00.sciclone.wm.edu (or just ms00 and sq00) to reach the front end nodes via the cluster's internal network. If you use the monsoon or squall addresses from within the cluster, you'll get routed outside to the building network and then back in via the external interface, a longer and slower path. A complete list of hostnames and their corresponding network addresses is available in /etc/hosts on the front end. When communication is routed via Myrinet, some of the distinctions between subclusters become blurred. In particular, the gulfstream and tornado subclusters plus the node nws01 can be thought of as a unified 39-node, 78-processor "metacluster". Although some details differ, these nodes have identical processor and memory configurations and share the same Myrinet-1280 network. However, I/O intensive processes should be placed on nws01 or one of the gulfstream nodes to take advantage of Gigabit Ethernet links to the servers. Nodes could be allocated from this combined pool via a PBS node specification such as "ultra60:myri:ppn=2". Similarly, nws02 can be used to augment the twister subcluster, using the Myrinet-2000 network for communication. In this case a node spec might look like "f280r:myri2:ppn=2" and nws02 would be the best place to locate I/O intensive processes. Note that applications are free to combine nodes from multiple subclusters in any way they see fit, but in the general case, differences in processor speeds, communication interfaces, memory capacities, and disk performance pose difficult load balancing problems which can lead to very inefficient use of computational resources unless specialized parallel algorithms are employed. Filesystems When a user account is installed, subdirectories are created in the following filesystems:
Symlinks in each user's home directory point to the preconfigured global (~/scr*) and local (~/lscr*) scratch directories and the QFS filesystem (~/qfs00). /root, /usr, and /var filesystems are local to each node; /opt, /usr/local, and /import reside on the front end and are exported to each node via NFS. Sun's StorEdge QFS is a high-performance, high-capacity SAN-based shared filesystem which is optimized for bulk accesses on large sequential files. The /sciclone/qfs00 filesystems spans 46 disk drives distributed across five separate RAID arrays, with a total formatted capacity of 5.3 terabytes. The disk arrays are connected to each other and to the SciClone servers via a Fibre Channel Storage Area Network, or SAN. Individual files exceeding a terabyte in size can be accommodated, space permitting. QFS allocates disk space in large blocks, and is therefore rather inefficient for collections of small files, such as source code. As a rule of thumb, files which are stored in SciClone's QFS filesystem should have an average size of 1 MB or larger. Smaller files should be stored in home directories (/sciclone/home*) or scratch filesystems (/sciclone/scr*). Space allocation in /sciclone/qfs00 is monitored and you may be notified if you have too many small files stored there. Note that small files can be aggregated into larger ones with Unix utilities such as ar, tar, cpio, and zip. ar is particularly useful since it maintains an internal index structure which supports the addition, deletion, replacement, and retrieval of individual members of the archive. I/O performance with QFS can be improved (sometimes dramatically) by reading and writing data in large blocks that match the blocking factor of the filesystem. In our tests, a blocksize of 65536 (64 KB) was near optimal, yielding write speeds in excess of 100 MB/s from monsoon and maelstrom. QFS is a shared filesystem, meaning that it can be mounted and accessed simultaneously by multiple servers. It allows simultaneous reads from the same or different files and simultaneous writes to different files. QFS can also be configured to allow simultaneous writes to the same file from applications which have been designed to perform page-aligned I/O. This option is not presently enabled on SciClone, but could be if the need arose. The /sciclone/qfs00 filesystem is mounted directly on monsoon and maelstrom, and is exported from there via NFS to all other nodes in the cluster. Even-numbered nodes mount qfs00 from maelstrom and odd-numbered nodes from monsoon. This spreads the NFS load across both servers, resulting in higher throughput and less resource contention. Note that applications which generate a lot of output will still get better performance by writing files to the local scratch partitions and then copying them back to the servers with rcp, rather than relying on NFS. The even/odd strategy would be helpful in this scenario, too. Backups Home directories (/sciclone/home00 and /sciclone/home10) are normally backed up several times per week. QFS directories (/sciclone/qfs00) are backed up approximately every three days. Scratch directories are not backed up at all. Furthermore, files in any of the scratch partitions which have not been used in the past 30 days will be deleted automatically in order to maintain sufficient free space for active projects. Shell Environment A default .cshrc file is provided in each user's home directory. If you modify it, be sure you know what you're doingSciClone's hardware and software environment is considerably more complex than the typical UNIX workstation environment. The default configuration enables a 32-bit environment for compiling and linking, with LAM as the default MPI package. To use alternative communication packages such as MPICH or MPICH-GM, settings of various environment variables will need to be changed as documented within the .cshrc file. Copies of the current recommended configuration files can be found in Four UNIX groups are established for each organization or department which has users on SciClone. These four groups correspond to the user's status within the organization, i.e. faculty/staff, graduate student, undergraduate student, and "other". For example, a professor from the William and Mary Computer Science Department would be assigned to the group "csf", while a CS undergraduate would be assigned to the group "csu". Default file access permissions are set so that a user's files and directories are read-write for the user, read-only for the user's group, and unreadable by anyone else. If this is not appropriate for your situation, you should change the default umask setting in your .cshrc file, or set file access modes on a case-by-case basis. Note that SciClone's filesystems are exported to other computers within the Computational Science Cluster, and may therefore be visible beyond the SciClone user community. Compilers and Libraries Both Sun and GNU compilers are installed on the system. See the Software section of the SciClone web site for more information about what's available and how to use it. Unless otherwise noted, all of the third-party software which is installed on SciClone has been built using Sun's compilers. For maximum performance and to ensure compatibility with system libraries, we strongly recommend the use of Sun's compilers whenever possible. Compiler Options and Code Optimization When compiling and optimizing code for use on SciClone, care must be taken to ensure that the resulting executables will perform properly on the type of nodes on which they will be executed. The situation is further complicated by the fact that SciClone includes two distinct and incompatible families of processor architecture, Sun UltraSPARC and Intel/AMD x86/amd64. Even within a particular architecture family, there are variations in processor capabilities among the different types of nodes. By choosing appropriate compiler options, users can compile their applications for portability across an entire architecture family, or optimize performance for a particular type of node. With the current set of hardware, six different instruction set architectures (ISAs) are of interest:
To achieve portability along with performance on UltraSPARC nodes, the following set of options is suggested:
For best performance on AMD Opteron nodes, the following options provide a good starting point for further experimentation:
By default, -fast generates code for the type of processor on which the compiler is running, which may or may not be compatible with the target execution processor. This is why it is important to explicitly specify the desired ISA with the -xarch option. Note that the order is important here: the -xarch, -xchip, and -xcache options must come after -fast. Opteron processors support several different "memory models", depending on the intended use of the resulting code. For 64-bit addressing on SciClone's C8, S4, and S4A nodes, specify the "medium" memory model, for example:
To optimize for a specific type of node, you can specify the characteristics of the CPU (-xchip) and cache (-xcache) in addition to the ISA. The following table summarizes the options by node type.
So, for example, to optimize code for an Ultra 5 node, the following compiler options could be used:
To target a Sun Fire 280R, use:
For an Opteron-based node, use:
To address more than 2 GB of memory in a single process, use v9a, v9b, or amd64a instead of v8plusa, v8plusb, or sse2a, repsectively:
Note that 64-bit addressing is generally of interest only for the C4, C7, C7A, C8, DB1, S3, S4, and S4A nodes, which each have 4 GB or more of physical memory installed on them; all other node types have 2 GB or less. Programs compiled with v9a, v9b, or amd64a will not work with libraries which have been compiled for 32-bit addressing. Many of the performance-critical software packages installed on SciClone are compiled for all six ISAs, and the desired version can automatically be selected via the XARCHULTRA and XARCHX86 environment variables in the user's ~/.cshrc file. Consult the documentation for individual software packages to see which versions are available. Note that the optimizations invoked by -fast may be too aggressive for some codes, with the potential for unintended or incorrect results. If you suspect this is a problem, you could try a lower optimization level, e.g.:
or you could leave off -fast entirely. If -fast is not used, it may be necessary to use Code optimization is a complex topic, and the use of any given option may help one routine but hinder another, so some experimentation is in order. Consult Sun's compiler documentation for full details, including information about many options not mentioned here. Parallel Programming Tools Although SciClone's ability to run many serial jobs concurrently is useful for some applications, bringing the full power of the system to bear on a single computation requires the use of parallel programming techniques. A variety of tools are available to assist in the development of parallel programs. Many of the nodes in SciClone include more than one CPU, so effective use implies that all of the CPUs should be kept busy. If an application will fit on a single node, then the use of shared memory programming techniques may be the simplest way to boost performance on multiprocessor (SMP) nodes. There are several approaches for exploiting parallelism in shared memory environments, including automatic parallelization, compiler directives, thread libraries, system-level interprocess communication (IPC) services, and message passing. Sun's compilers support automatic parallelization of well-behaved loop constructs via the -xautopar (C) and -autopar (Fortran) options. These are described in detail in the compiler manuals. Sometimes the compiler's ability to detect and exploit parallelism can be enhanced with straightforward changes to the code which eliminate dependencies or simplify the control flow. In other cases, the programmer needs to convey additional information to the compiler in the form of directives (Fortran) or pragmas (C/C++). Among other things, directives and pragmas can be used to give the compiler hints about loops that can or cannot be safely parallelized. Sun's C, C++, and Fortran compilers support the OpenMP 2.5 API, a mix of directives and library calls which support a fork-join model of parallel execution. For more information on using OpenMP, refer to Chapter 3 in the C User's Guide, Chapter 10 of the Fortran Programming Guide, Appendix D of the Fortran User's Guide, and the OpenMP API User's Guide. At a coarser level of granularity, entire programs, or major sections of them, can be structured as independent threads of control, each of which can potentially run on a separate CPU. Solaris supports two different thread packages, known as Solaris threads and Posix threads (or pthreads). Sun's Multithreaded Programming Guide covers both of these packages in detail. Additional information on multithreading for C++ applications can be found in Chapter 11 of the C++ User's Guide. Java includes threads as a fundamental part of the language. Although not well suited for high performance computing due to its resource requirements and the interpretive nature of the language, Java may nonetheless be of interest for certain applications. Java programmers should consult the Java 2 SDK documentation. Threads provide a lightweight mechanism for exploiting parallelism within the context of a single UNIX process. Parallelism can also be obtained by running several distinct processes at the same time. Like all UNIX variants, Solaris provides a number of system services which facilitate communication between processes. These include pipes, message queues, semaphores, shared memory segments, signals, sockets, and memory-mapped files. These facilities may be used directly or as the basis for process-to-process communication in higher level libraries such as MPI. Interprocess communication (IPC) facilities may be the mechanism of choice for applications which bring together several programs with different functionality. An overview of IPC services in Solaris can be found in the Programming Interfaces Guide. In most cases, the message-passing communication libraries described in the next section can also be used in shared memory environments, sometimes quite efficiently. The LAM/MPI library, in particular, exhibits very low overheads on shared memory nodes. While the message-passing paradigm often requires more programming effort than some of the simpler shared memory schemes, it is more portable, allowing a single application to run in shared memory, distributed memory, or mixed environments. Distributed Memory Programming Even the most powerful multiprocessor nodes on SciClone provide only a fraction of the aggregate system resources (about 1% of the total CPU power, 2% of the memory, and 1% of the disk capacity). To truly take advantage of the system, it is necessary to build distributed memory applications that can bring many nodes to bear on a single computation. Although lower-level system services such as sockets or remote procedure calls are sometimes used to build distributed applications, most scientific programmers working on SciClone will want to use MPI, the de facto standard for message passing on distributed-memory parallel architectures. SciClone currently supports three different MPI implementations, LAM, MPICH, and MPICH-GM. Additional communication packages are expected to be available in the future. The shared memory and distributed memory approaches can be combined in applications which run on multiple SMP nodes. This can be useful if the computation exhibits parallelism at several different levels, for example loop-level parallelism within coarse-grained tasks. Shared memory constructs may offer performance advantages over message passing for local communication within SMP nodes, although in some cases the reverse may also be true. Whether the performance benefits of mixed-mode programming are worth the extra complexity seems to depend heavily on the characteristics of the application. Running Programs To provide conflict-free access to SciClone's computational resources, node allocation and job scheduling services are provided by OpenPBS, the freeware version of the Portable Batch System (PBS). To avoid interfering with PBS jobs, all access to both compute and server nodes, including interactive shell sessions, must be initiated through PBS. The only exceptions are:
Direct rlogin/slogin/telnet access to individual compute nodes is disabled, and all rsh and rcp commands which reference the nodes should be submitted via PBS jobs. Stray processes on the nodes (i.e., those not belonging to an active PBS job) are subject to termination without warning. In some cases interactive access to compute nodes is required. Examples include software packages with graphical user interfaces (e.g., MATLAB or various visualization systems), debugging, etc. PBS has a special interactive mode (described in more detail below) which provides this capability, including forwarding of X11 sessions to the user's workstation via SSH. Server Nodes vs. Compute Nodes As discussed in the Architecture Overview, nodes in the SciClone cluster fall into one of two categories, server nodes or compute nodes. While compute nodes are intended to provide dedicated computational resources for one or more jobs, server nodes provide services for the system as a whole. Thus PBS jobs which run on server nodes can adversely impact the performance of the whole cluster, and will themselves be impacted by other activities on the system. Thus most jobs should specifically request to run on compute nodes, as explained in the following sections. Nevertheless, there may be circumstances in which jobs with special requirements will need to create processes on server nodes. Our PBS configuration currently allows this, and, in fact, will allocate server nodes to jobs if (1) server nodes are specifically requested by the job, or (2) the job does not specifically request compute nodes and no other resources are currently available to satisfy the request. This latter case is designed to improve turnaround for small jobs when the system is otherwise saturated. Users should feel free to take advantage of this when it is really needed (for example, deadlines for class projects or conference and journal submissions), but should not use it routinely. Because server nodes also host SciClone's global filesystems (/sciclone/home*, /sciclone/scr*, /sciclone/qfs00), they are also the most efficient place to locate processes that perform large amounts of I/O against these files. In this case an appropriate PBS node specification can be used to place an I/O process on the server which physically hosts the filesystem of interest. Although SciClone's server nodes (monsoon, maelstrom, squall, mistral, tempest, hurricane, zephyr) all have dual processors, PBS is allowed to use only one processor per server node. This leaves the other processor free to provide system-wide services such as compilation, I/O, NFS, DNS, job scheduling, etc. To avoid overloading the front end nodes, all application programs with non-trivial resource requirements (> 30 secs. CPU time or > 128 MB memory) must be submitted as PBS jobs. Processes which violate this rule may be killed without warning. (Typical code development and job preparation activities, including editing, compilation, make, file manipulation, etc., are specifically allowed to run on monsoon and squall as part of their normal interactive workloads.) PBS references server nodes via aliases (ms00, ml00, sq00, mt00, tp00, hu00) which map to either Gigabit or 10-Gigabit Ethernet interfaces on SciClone's internal jetstream network. To use PBS, your search path, man path, library path, and default PBS server must be set correctly in your ~/.cshrc file: set path=($path /usr/local/pbs/bin) If you are using the recommended environment configuration (available in /usr/local/etc/templates/cshrc on sciclone.wm.edu), all of these environment settings will be configured for you automatically. To accommodate heterogeneous environments (such as SciClone), PBS allows an arbitrary set of node properties to be assigned to each node. These properties may be appended to node allocation requests to constrain the set of processors which may be used to run the job. Node properties for SciClone are listed in the following table:
The next table defines all of the node properties listed in the table above:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||