Skip to Content

College of Arts & Sciences
Computing and Information Technology


Pople User Guide

System Overview

The Pople Linux Cluster consists of 18 compute nodes, with two 6-Core processors per node, for a total of 216 cores. It is configured with 432 GB of total memory and 11TB of shared disk space. The theoretical peak compute performance is 4.9 TFLOPS. The system supports an 11 TB global, NFS file system. Nodes are interconnected with InfiniBand technology in a fat-tree topology with a 40Gbit/sec point-to-point bandwidth.

All Pople nodes run Linux Centos 6.8 OS and support batch services through SGE 6.2. Global, data intensive I/O is supported by a Lustre file system, while home directories are serviced by an NSF file system with global access. Other than the $HOME directory file system, all inter-node communication (MPI) is through a Mellanox InfiniBand network. The configuration and features for the compute nodes, interconnect and I/O systems are described below, and summarized.

There are 18 nodes with each blade housing two 6-core Westmere processors, 24GB of memory and a 600GB disk.

System NamePople
Host Namepople.psc.sc.edu
IP Address129.252.37.178
Operating SystemLinux
Number of Processors216 (compute)
CPU TypeHex-core Xeon 5660 processors (2.8 GHz)
Total Memory432 GB
Peak Performance4.9 TFLOPS
Total Disk Space11TB (shared)
Primary InterconnectQDR Infiniband @ 40 GB/S

Compute Nodes

A regular node consists of a HP SL390 G7 blade running the 2.6 x86_64 Linux kernel from kernel.org. Each node contains two Xeon Intel Hexa-Core 64-bit Westmere processors (12 cores in all) on a single board, as an SMP unit. The core frequency is 2.8 GHz and supports 4 floating-point operations per clock period with a peak performance of 11.2 GFLOPS/core or 134 GFLOPS/node. Each node contains 24GB of memory (2GB/core). The memory subsystem has 3 channels from each processor's memory controller to 3 DDR3 ECC DIMMS, running at 1333 MHz. The processor interconnect, QPI, runs at 6.4 GT/s.

ComponentTechnology
Sockets per Node/Cores per Socket2/6
MotherboardHP SL390, Intel PQI 5520 Chipset
Memory per Node24GB 6x4G 3 channels DDR3-1333MHz
Processor Interconnect2x QPI 6.4 GT/s
PCI Express36 lanes, Gen 2
146GB Disk10K RPM SAS-SATA

Scientific Application Software:

  • ADF/2010.02
  • ADF/2012.01
  • ADF/2013.01(default)
  • AMBER/11(default)
  • AMBERTOOLS/13(default)
  • AUTODOCK/4.2.5.1
  • AUTODOCKVINA/1.1.2_x86
  • GROMACS/4.5.5(default)
  • GROMACS/4.6
  • INTEL Compiler/11.1-059
  • INTEL Compiler/11.1-072
  • INTEL Compiler/12.0.4
  • INTEL Compiler/12.1.1
  • INTEL Compiler/12.1.3(default)
  • INTELMPI/4.0
  • INTELMPI/4.1(default)
  • LAMMPS/7.2(default)
  • MPICH2/3.0.2(default)
  • NAMD/2.9(default)
  • OPENMPI/143-intel(default)
  • OPENMPI/161-gcc
  • OPENMPI/161-intel
  • PGI/12.1(default)
  • Scienomics/3.4
  • SPARTAN/2010
  • QCHEM
  • VMD/1.9(default)

Interconnect

The InfiniBand (IB) interconnect topology is a Clos fat tree (no oversubscription). Each of the 16 blades in a chassis is connected to a Voltaire InfiniBand Switch Blade within the chassis. Even with I/O servers connected to the switches there is still about 25% capacity remaining for system growth. Any processor is only 1 hop away from any other processor.

File Systems

Pople storage includes two 300 GB SAS drives in RAID configuration on each node. $HOME directories are NSF mounted to all nodes. The $SCRATCH file system, also accessible from all nodes, is a parallel file system supported by NFS. Archival storage is not directly available from the login node, but accessible through scp.

The RCC HPC platforms have several different file systems with distinct storage characteristics. There are predefined, user-owned directories in these file systems for users to store their data. Of course, these file systems are shared with other users, so they are managed by either a quota limit, a purge policy (time-residency) limit, or a migration policy.

The $WORK and $SCRATCH directories on Pople are network file systems. They are designed for parallel and high performance data access from within applications. They have been configured to work well with MPI-IO, accessing data from many compute nodes.

Home directories use the NFS (network file systems) protocol and their file systems are designed for smaller and less intense data access-- a place for storing executables and utilities. Use MPI-IO only on $WORK and $SCRATCH filesystems; your $HOME directory does not support parallel IO with MPI-IO.

To determine the amount of disk space used in a file system, cd to the directory of interest and execute the df -k . command, including the dot that represents the current directory. Without the dot all file systems are reported.

In the command output below, the file system name appears on the left (IP number, followed by the file system name), and the used and available space (-k, in units of 1 KBytes) appear in the middle columns, followed by the percent used, and the mount point:

  • df -k 
  • Filesystem 1K-blocks Used Available Use% Mounted on

To determine the amount of space occupied in a user-owned directory, cd to the directory and execute the du command with the -sh option (s=summary, h=units 'human readable):

  • du -sh

To determine quota limits and usage on $HOME, execute the quota command without any options (from any directory.)

The major file systems available on Pople are:

$HOME

  • At login, the system automatically sets the current working directory to your home directory.
  • Store your source code and build your executables here.
  • This file system is backed up.
  • The frontend nodes and any compute node can access this directory.
  • Use $HOME to reference your home directory in scripts.
  • Use cd /home/$USERNAME to change to $HOME.

NOTE: RCC staff may delete files from $WORK if the file system becomes full. A full file system inhibits use of the file system for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.

$SCRATCH

  • Change to this directory in your batch scripts and run jobs in this file system.
  • The scratch file system is approximately 11 TB.
  • This file system is not backed up.
  • The frontend nodes and any compute node can access this directory.
  • Purge Policy: Files with access times greater than 10 days are purged.
  • Use $SCRATCH to reference this directory in scripts.

NOTE: RCC staff may delete files from work if the file system becomes full, even if files are less than 100 days old. A full file system inhibits use of the file system for everyone. The use of programs or scripts to actively circumvent the file purge policy will not be tolerated.

System Access

To ensure a secure login session, users must connect to machines using the secure shell, ssh program. Data movement must be done using the secure commands scp, and sftp.

Before any login sessions can be initiated using ssh, a working SSH client needs to be present in the local machine. Wikipedia is a good source of information on SSH in general and provides information on the various clients available for your particular operating system.

Do not run the optional ssh-keygen command to set up Public-key authentication. This option sets up a passphrase that will interfere with submitting job scripts to the batch queues. If you have already done this, remove the .ssh directory (and the files under it) from your home directory. Log out and log back in to test.

To initiate an ssh connection to a Pople login node from a UNIX or Linux system with ssh already installed, execute the following command:

  • login1$ ssh userid@pople.psc.sc.edu
Note: userid is needed only if the user name on the local machine and the RCC machine differ.
Password changes should comply with practices presented in the RCC Password Guide.
Computing Environment
Unix Shell
The most important component of a user's environment is the login shell that interprets text on each interactive command line and statements in shell scripts. Each login has a line entry in the /etc/passwd file, and the last field contains the shell launched at login. To determine your login shell, use:
  • echo $SHELL
You can use the chsh command to change your login shell. Full instructions are in the chsh man page. Available shells are defined by the /etc/shells file, along with their full-path.
To display the list of available shells with chsh and change your login shell to bash, execute the following:
  • chsh -l chsh -s /bin/bash
Environment Variables
The next most important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), look in the environment for variables that specify information they may need to access. To see the variables in your environment execute the command:
  • env
The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the $HOME and $PATH variables.
  • HOME=/home/username
  • PATH=/bin:/usr/bin:/usr/local/apps:/share/apps/intel/bin
Notice that the $PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with setenv for C shells and export for Bourne shells) are carried to the environment of shell scripts and new shell invocations, while normal shell variables (created with the set command) are useful only in the present shell. Only environment variables are displayed by the env (or printenv) command. Execute set to see the (normal) shell variables.
Startup Scripts
All UNIX systems set up a default environment and provide administrators and users with the ability to execute additional UNIX commands to alter the environment. These commands are sourced. That is, they are executed by your login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment.
Basic site environment variables and aliases are set in the following files:
  • /etc/csh.cshrc {C-type shells, non-login specific}
  • /etc/csh.login {C-type shells, specific to login}
  • /etc/profile {Bourne-type shells}

RCC coordinates the environments on several systems. In order to efficiently maintain and create a common environment among these systems, RCC uses its own startup files in /usr/local/etc
A corresponding file in this /etc directory is sourced by the startup script files that reside in your home directory.
 
For historical reasons, the C based shells (csh, tcsh, etc.) source two types of files. The .cshrc type files are sourced first. These files are used to set up the execution environment used by all scripts and for access to the machine without an interactive login. For example, the following commands execute only the .cshrc type files on the remote machine:
  • scp data Pople.psc.sc.edu
  • ssh Pople.psc.sc.edu date
The .login type files set up environment variables that accounts commonly use in an interactive session. They are sourced after the .cshrc type files.
The commands in the /etc files above set operating system interaction and the initial PATH, ulimit, umask, and environment variables such as the HOSTNAME. They also source command scripts in /etc/profile.d -- the /etc/csh.cshrc sources files ending in .csh, and /etc/profile sources files ending in .sh. Many site administrators use these scripts to setup the environments for common user tools (vim, less, etc.) and system utilities (ganglia, modules, Globus, LSF, etc.)
Long time users of RCC computers have controlled their initial environment with the system supplied version of .profile for bash and .login, and .cshrc for csh users and then placed their personal setup in .login_user or .profile_user for (csh/bash) users. We will continue to support sourcing these files. However, this old system is no longer required. All the required setup has been moved to system files, so users are free to have their own startup files.
For Bash users we recommend:
  • ~/.profile (only executed by login shells)
  • if [ -f ~/.bashrc ]; then
  • . ~/.bashrc
  • fi
This will source the user's ~/.bashrc file if it exists.
Transfering Files to Pople
Data Transfer Methods/Software
There are two utilities, bbcp and globus-url-copy, that can be used to achieve higher performance than the rcp and scp programs, when transferring large files between RCC clusters and the RCC archive (Ranch). During production, the scp and rcp speeds between Ranger and Ranch average about 15MB/s, while bbcp and globus-url-copy speeds are about 125MB/s. These values vary with I/O and network traffic.
bbcp
The bbcp utility works much the same way as scp, it is only available on RCC machines, and includes an option to copy subdirectories. For each transfer command it is necessary to provide your ssh passphrase or login password. You can use the ssh-agent and ssh-add commands to automatically supply passphrases for ssh commands (including bbcp) during your login session. The general bbcp syntax is:
  • bbcp [options] <file or directory> <to Machine>:<relative path>/<file or directory>
For example, the following command transfers <data> to $ARCHIVER as <data>:
  • bbcp <data> ${ARCHIVER}:$ARCHIVE
To transfer <data> to $ARCHIVER and force replacement, use:
  • bbcp -f <data> ${ARCHIVER}:$ARCHIVE/<data>
To transfer directory <dir1> and subdirectories, use:
  • bbcp -r <dir1> ${ARCHIVER}:$ARCHIVE

By default, bbcp does not overwrite files; use the -f option to force replacement. The -r option transfers directory contents recursively. You can also use bbcp to transfer files between Pople, Planck ACM-CHEM as shown in this example:
  • bbcp -f <data> pople.psc.sc.edu:~
Application Development
Programming Models
There are two distinct memory models for computing: distributed-memory and shared-memory. In the former, the message passing interface (MPI) is employed in programs to communicate between processors that use their own memory address space. In the latter, open multiprocessing (OMP) programming techniques are employed for multiple threads (light weight processes) to access memory in a common address space.
For distributed memory systems, single-program multiple-data (SPMD) and multiple-program multiple-data (MPMD) programming paradigms are used. In the SPMD paradigm, each processor core loads the same program image and executes and operates on data in its own address space (different data). This is illustrated in Figure 2. It is the usual mechanism for MPI code: a single executable (a.out in the figure) is available on each node (through a globally accessible file system such as $WORK or $HOME), and launched on each node (through the batch MPI launch command, ibrun a.out)
In the MPMD paradigm, each processor core loads up and executes a different program image and operates on different data sets, as illustrated in Figure 2. This paradigm is often used by researchers who are investigating the parameter space (parameter sweeps) of certain models, and need to launch 10s or hundreds of single processor executions on different data. (This is a special case of MPMD in which the same executable is used, and there is NO MPI communication.) The executables are launched through the same mechanism as SPMD jobs, but a UNIX script is used to assign input parameters for the execution command (through the batch MPI launcher).
The shared-memory programming model is used on Symmetric Multi-Processor (SMP) nodes. Each node on this system contains 12 cores with a single 24GB memory subsystem.
The programming paradigm for this memory model is called Parallel Vector Processing (PVP) or Shared-Memory Parallel Programming (SMPP). The latter name is derived from the fact that vectorizable loops are often employed as the primary structure for parallelization. The main point of SMPP computing is that all of the processors in the same node share data in a single memory subsystem. There is no need for explicit messaging between processors as with MPI coding.
In the SMPP paradigm either compiler directives (as pragmas in C, and special comments in Fortran) or explicit threading calls (e.g. with Pthreads) are employed. The majority of science codes now use OpenMP directives that are understood by most vendor compilers, as well as the GNU compilers.
In cluster systems that have SMP nodes and a high speed interconnect between them, programmers often treat all CPUs within the cluster as having their own local memory. On a node an MPI executable is launched on each CPU and runs within a separate address space. In this way, all CPUs appear as a set of distributed memory machines, even though each node has CPUs that share a single memory subsystem.
In clusters with SMPs, hybrid programming is sometimes employed to take advantage of higher performance at the node-level for certain algorithms that use SMPP (OMP) parallel coding techniques. In hybrid programming, OMP code is executed on the node as a single process with multiple threads (or an OMP library routine is called), while MPI programming is used at the cluster-level for exchanging data between the distributed memories of the nodes.
The number of application that benefit from hybrid programming on dual-processor nodes (e.g. on Pople) is very small. The programming and support of hybrid codes is complicated by compiler and platform support of both paradigms. However, with the new multi-core multi-socket commodity systems on the horizon, there may be resurgence in hybrid programming if these systems provide better enhanced performance with SMPP (OMP) algorithms.
For further information, please see the Reference section of this document.
Compiling
 
The Pople programming environment uses Intel C++ and Intel Fortran compilers by default. This section highlights the important HPC aspects of using the Intel compilers. The Intel compiler commands can be used for both compiling (making "o" object files) and linking (making an executable from "o" object files). For information on compiling GPU programs, please see the CUDA and OpenCL sections of this guide under the Tools section.
The Intel Compiler Suite
The Intel compiler version 12.1 is loaded as the default at login. The gcc compiler is also available (Use 'gcc --version' to display version information); we recommend using the Intel suite whenever possible. The Intel suite is installed with the EM64T 64-bit standard libraries and will compile programs as 64-bit applications (as the default compiler mode). Any programs compiled on 32-bit systems need to be recompiled to run natively on Pople. Any pre-compiled packages should be EM64T (x86-64) compiled or errors may occur. Since only 64-bit versions of the MPI libraries have been built on Pople, programs compiled in 32-bit mode will not execute MPI code.
The Intel Fortran compiler command is ifort (use 'ifort -V' for current version information).
Web accessible Intel manuals are available from the Intel website.
Compiling Serial Programs
CompilerLanguageFile ExtensionExample
iccC.cicc [compiler_options] prog.c
icpcC++.C, .cc, .cpp, .cxxicpc [compiler_options] prog.cpp
ifortF77.f, .for, .ftnifort [compiler_options] prog.f
ifortF90.f90, .fpp ifort [compiler_options] prog.f90
Appropriate file name extensions are required for each compiler. By default, the executable name is a.out; and it may be renamed with the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimizations.
A C program example:
  • icc -o flamec.exe -O3 -xSSE4.2 prog.c
A Fortran program example:
  • ifort -o flamef.exe -O3 -xSSE4.2 prog.f90
Commonly used options may be placed in an icc.cfg or ifc.cfg file for compiling C and Fortran code, respectively.
For additional information, execute the compiler command with the -help option to display all compiler options, their syntax, and a brief explanation; or display the man page, as follows:
  • icc -help
  • icpc -help
  • ifort -help
  • man icc
  • man icpc
  • man ifort
Some of the more important options are listed in the Basic Optimization section of this guide. Additional documentation, references, and a number of user guides (pdf, html) are available in the Fortran and C++ compiler home directories ($IFC_DOC and $ICC_DOC).
Compiling OpenMP Programs
Since each of the HP SL 390 G7 blades (nodes) of the Pople cluster is a Xeon dual-processor system, applications can use the shared memory programming paradigm on the node. However, because of the limited number of processors in each node, the maximum theoretical benefit to using a shared-memory model on the node is a factor of 12.
The OpenMP compiler options are listed in the Basic Optimization section of this guide, for those who need SMP support on the nodes. For hybrid programming, use the mpi-compiler commands, and include the '-openmp' options.
Compiling MPI Programs
The mpicmds commands support the compilation and execution of parallel MPI programs for specific interconnects and compilers. At login, MPI (OPENMPI) and Intel compiler (intel) modules are loaded to produce the default environment that provides the location to the corresponding mpicmds.
Compiling Parallel Programs with MPI
The mpicc, mpicxx, mpif77, and mpif90 compiler scripts (wrappers) compile MPI code and automatically link startup and message passing libraries into the executable.
Appropriate file name extensions are required for each wrapper. By default, the executable name is a.out. You may rename it using the -o option. To compile without the link step, use the -c option. The following examples illustrate renaming an executable and the use of two important compiler optimization options.
A C program example:
  • mpicc -o prog.exe  -openmp -O3 -xSSE4.2 prog.c
A Fortran program example:
  • mpif90 -o prog.exe -openmp  -O3 -xSSE4.2 prog.f90
Include linker options, such as library paths and library names, after the program module names, as explained in the Loading Libraries section below. The Running Code section explains how to execute MPI executables in batch scripts and interactive batch runs on compute nodes.
We recommend that you use the Intel compiler for optimal code performance. RCC does not support the use of the gcc compiler for production code on the Pople4 system. For those rare cases when gcc is required, for either a module or the main program, you can specify the gcc compiler with the -cc mpcc option for modules requiring gcc. (Since gcc- and Intel-compiled codes are binary compatible, you should compile all other modules that don't require gcc with the Intel compiler.) When gcc is used to compile the main program, an additional Intel library is required. The examples below show how to invoke the gcc compiler for the two cases:
  • mpicc -O3 -xSSE4.2 -c -cc=gcc suba.c
  • mpicc -O3 -xSSE4.2 mymain.c suba.o
  • mpicc -O3 -xSSE4.2 -c suba.c
  • mpicc -O3 -xSSE4.2 -cc=gcc -L$ICC_LIB -lirc mymain.c suba.o
Compiler Options
Compiler options must be used to achieve optimal performance of any application. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for interprocedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.
The most basic level of optimization that the compiler can perform is -On options, explained below:
Compiler Optimization Levels
Level (-On)Description
n = 0:Fast compilation, full debugging support. Automatically enabled if using -g.
n = 1,2:Low to moderate optimization, partial debugging support:
instruction rescheduling
copy propagation
software pipelining
common subexpression elimination
prefetching, loop transformations
n = 3:Aggressive optimization - compile time/space intensive and/or marginal effectiveness; may change code semantics and results (sometimes even breaks code!)
enables -O2
more aggressive prefetching, loop transformations
The following table lists some of the more important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.
Compiler Options Affecting Performance
OptionDescription
-cFor compilation of source file only
-O3Aggressive optimization (-O2 is default)
-xSSE4.2 Generates code with streaming SIMD extensions SSE2/3/4 for Intel Core architecture
-gDebugging information, generates symbol table
-fpe0Enable floating point exceptions, useful for debugging
-fp-model <arg>Enable floating-point model variation:
[no-]except : enable/disable floating-point semantics
fast[=1|2] : enables more aggressive floating-point optimizations
precise : allows value-safe optimizations
source : allows intermediates in source precision
strict : enables fp-model precise, fp-model except, disables contractions and enables pragma stdc fenv_access
double : rounds intermediates in 53-bit (double) precision
extended : rounds intermediates in 64-bit (extended) precision
-ipEnable single-file interprocedural (IP) optimizations (within files)
-ipoEnable multi-file IP optimizations (between files)
-opt-prefetchEnables data prefetching
-opt-streaming-stores <arg>Specifies whether streaming stores are generated:
Always : Enable streaming stores under the assumption that the application is memory bound
Auto : [DEFAULT] Compiler decides when streaming stores are used
Never : Disable generation of streaming stores
-openmpEnable the parallelizer to generate multi-threaded code based on the OpenMP directives
-openmp-report[0|1|2]

Controls the OpenMP parallelizer diagnostic level

Basic Optimization for Serial and Parallel Programming using OpenMP and MPI
The MPI compiler wrappers use the same compilers that are invoked for serial code compilation. So, any of the compiler flags used with the icc command can also be used with mpicc; likewise for ifort and mpif90; and icpc and mpicxx. Below are some of the common serial compiler options with descriptions.
More Compiler Options
Compiler OptionsDescription
-O3performs some compile time and memory intensive optimizations in addition to those executed with -O2, but may not improve performance for all programs
-vec_report[0|...|5]controls amount of vectorizer diagnostic information
-xSSE4.2includes specialized code for SSE4 instruction set
-fastNOT RECOMMENDED - static load
-g -fpgenerates debugging information, disables using EBP as general-purpose register
-openmpenables the parallelizer to generate multi-threaded code based on the OpenMP directives
-openmp_report[0|1|2]controls the OpenMP parallelizer diagnostic level
-helplists options
Developers often experiment with the following options: -pad, -align, -ip, -no-rec-div and -no-rec-sqrt. In some codes performance may decrease. Please see the Intel compiler manual (below) for a full description of each option.
Use the -help option with the mpicmds commands for additional information:
  • mpicc -help
  • mpicxx -help
  • mpif90 -help
  • mpirun -help
Use the options listed for mpirun with the ibrun command in your job script. For detail on the MPI standard, go to: www.mcs.anl.gov/mpi.
Libraries
Some of the more useful load flags/options are listed below. For a more comprehensive list, consult the ld man page.
Use the -l loader option to link in a library at load time: e.g. ifort prog.f90 -l<name>
This links in either the shared library libname.so (default) or the static library libname.a, provided it can be found in ldd's library search path or the LD_LIBRARY_PATH environment variable paths.
To explicitly include a library directory, use the -L option, e.g. ifort prog.f -L/mydirectory/lib -l<name>
In the above examples, the user's libname.a library is not in the default search path, so the L option is specified to point to the libname.a directory. (Only the library name is supplied in the -l argument - remove the lib prefix and the .a suffix.)
 
Many of the modules for applications and libraries, such as the mkl library module provide environment variables for compiling and linking commands. Execute the module help module_name command for a description, listing and use cases for the assigned environment variables. The following example illustrates their use for the mkl library:
  • mpicc  -mkl=sequential mkl_test.c
Here, the module supplied environment variables RCC_MKL_LIB and RCC_MKL_INC contain the MKL library and header library directory paths, respectively. The loader option -Wl,-rpath specifies that the $RCC_MKL_LIB directory should be included in the binary executable. This allows the run-time dynamic loader to determine the location of shared libraries directly from the executable instead of the LD_LIBRARY path or the LDD dynamic cache of bindings between shared libraries and directory paths. This avoids having to set the LD_LIBRARY path (manually or through a module command) before running the executables. (This simple load sequence will work for some of the unthreaded MKL functions; see MKL Library section for using various packages within MKL.)
You can view the full path of the dynamic libraries inserted into your binary with the ldd command. The example below shows a partial listing for the a.out binary:
  • ldd a.out
A load map, which shows the library module for each static routine and a cross reference (--cref) can be used to validate which libraries are being used in your program. The following example shows that the ddot function used in the mdot.f90 code comes from the MKL library:
  • mpif90 mdot.f90 –mkl=seqential
Performance Libraries
ISPs (Independent Software Providers) and HPC vendors provide high performance math libraries that are tuned for specific architectures. Many applications depend on these libraries for optimal performance. Intel has developed performance libraries for most of the common math functions and routines (linear algebra, transformations, transcendental, sorting, etc.) for the EM64T architectures. Details of the Intel libraries and specific loader/linker options are given below.
MKL Library
The Math Kernel Library consists of functions with Fortran, C, and C++ interfaces for the following computational areas:
BLAS (vector-vector, matrix-vector, matrix-matrix operations) and extended BLAS for sparse computations
LAPACK for linear algebraic equation solvers and eigensystem analysis
Fast Fourier Transforms
Transcendental Functions
In addition, MKL also offers a set of functions collectively known as VML -- the Vector Math Library VML is a set of vectorized transcendental functions that offer both high performance and excellent accuracy compared to the libm functions (for most of the Intel architectures). The vectorized functions are considerably faster than standard library functions for vectors longer than a few elements.
To use MKL and VML, first load the MKL module using the command module load intel/12.1. This will set the MKL_LIB and MKL_INC environment variables to the directories containing the MKL libraries and the MKL header files. Below is an example command for compiling and linking a program that contains calls to BLAS functions (in MKL). Note that the library is for use in a single node, hence it can be used by both serial compilers or by MPI wrapper scripts.
The following C and Fortran examples illustrate the use for the mkl library after loading the mkl module: module load mkl:
  • mpicc myprog.c –mkl=sequential
  • mpif90 myprog.f90 –mkl=sequential
Notice the use of the linker commands --start-group and --end-group, which are used to resolve dependencies between the libraries enclosed within. This useful option avoids having to find the correct linking order of the libraries and, in cases of circular dependencies, having to include a library more than once in a single link line.
 
RCC has summarized the MKL document in a quick guide. Assistance in constructing the MKL linker options is provided by the MKL Link Line Advisor utility.
Runtime Environment
Bindings to the most recent shared libraries are configured in the file /etc/ld.so.conf (and cached in the /etc/ld.so.cache file).
Cat /etc/ld.so.conf to see the RCC configured directories, or execute:
  • /sbin/ldconfig -p
to see a list of directories and candidate libraries. Use the -Wl, rpath loader option or the LD_LIBARY_PATH to override the default runtime bindings.
Debugging
DDT is a symbolic, parallel debugger that allows graphical debugging of MPI applications. For information on how to perform parallel debugging using DDT on Ranger, please see the DDT Debugging Guide.
Running Applications
SGE Batch Environment
Batch facilities such as LoadLeveler, LSF, and SGE differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (hold, delete, resource request modification). The following paragraphs list the basic batch operations and their options, explain how to use the SGE batch environment, and describe the queue structure.
New users should visit the SGE wiki and read the first chapter of the Introduction to the N1 Grid Engine 6 Software document. To help those migrating from other systems, a comparison of the IBM LoadLeveler, LSF, and SGE batch options and commands is offered in a Batch Systems Comparison guide.
In addition to the environment variables inherited by the job from the interactive login environment, SGE sets additional variables in every batch session. The following table lists some of the important SGE variables:
SGE Batch Environment Variables
Environmental VariableContains
JOB_IDbatch job id
TASK_IDtask id of an array job sub-task
JOB_NAMEname user assigned to the job
Pople Queue Structure
The Pople production queues and their characteristics (wall-clock and processor limits; priority charge factor; and purpose) are listed in the table below. Queues that don't appear in the table (such as systest, sysdebug, and clean) are non-production queues for system and HPC group testing and special support.
SGE Batch Environment Queues
Queue NameMax RuntimeMax ProcsSU Charge RatePurpose
normal24 hoursinfinitenormalpriority
The latest queue information can be determined using the following commands:
CommandComment
qconf -sqlLists the available queues
qconf -sq <queue_name>The s_rt and h_rt values are the soft and hard wall-clock limits for the queue
SGE provides the qsub command for submitting batch jobs: Use this command to submit a batch job:
  • qsub job_script
where <job_script> is the name of a UNIX format text file containing job script commands. This file should contain both shell commands and special statements that include qsub options and resource specifications. Some of the most common options are described below. Details on using these options and examples of job scripts follow.
List of the Most Common qsub Options
OptionArgumentFunction
-q<queue_name>Submits to queue designated by <queue_name>
-pe<TpN>way <NoN x 12>Executes the job using the specified number of tasks (cores to use) per node (wayness and the number of nodes times 12 (total number of cores) (See example script below)
-N<job_name>Names the job <job_name>
-M<email_address>Specify the email address to use for notifications
-m{b|e|a|s|n}Specify when user notifications are to be sent
-V Use current environment setting in batch job
-cwd Use current directory as the job's working directory
-jyJoin stderr output with the file specified by the -o option (Don't also use the -e option)
-o<output_file>Direct job output to <output_file>
-e<error_file>Direct job error to <error_file> (Don't also use the -j option)
-A<project_account_name>Charges run to <project_account_name>. Used only for multi-project logins. Account names and reports are displayed at login.
-l<resource>=<value>Specify resource limits (See qsub man page)
Options can be passed to qsub on the command line or specified in the job script file. The latter approach is preferable. It is easier to store commonly used qsub commands in a script file that will be reused several times rather than retyping the qsub commands at every batch request. In addition, it is easier to maintain a consistent batch environment across runs if the same options are stored in a reusable job script.
Batch scripts contain two types of statements: special comments and shell commands. Special comment lines begin with #$ and are followed with qsub options. The SGE shell_start_mode has been set to unix_behavior, which means the UNIX shell commands are interpreted by the shell specified on the first line after #! sentinel; otherwise the Bourne shell (/bin/sh) is used. The file job below requests an MPI job with 24 cores and 1.5 hours of run time:
  • #!/bin/bash
  • #$ -V     #Inherit the submission environment
  • #$ -cwd                # Start job in submission directory
  • #$ -N myMPI     # Job Name
  • #$ -j y    # Combine stderr and stdout
  • #$ -o $JOB_NAME.o$JOB_ID      # Name of the output file (eg. myMPI.oJobID)
  • #$ -pe 12way 24               # Requests 12 tasks/node, 24 cores total
  • #$ -q normal      # Queue name normal
  • #$ -l h_rt=01:30:00          # Run time (hh:mm:ss) - 1.5 hours
  • #$ -M    # Use email notification address
  • #$ -m be              # Email at Begin and End of job
  • set -x     # Echo commands, use set echo with csh
  • mpirun -np $NSLOTS ./a.out      # Run the MPI executable named a.out
If you don't want stderr and stdout directed to the same file, replace the -j option line, with a -e option to name a separate output file for stderr (but don't use both). By default, stderr and stdout are sent to out.o and err.o, respectively.
SGE provides several environment variables for the #$ options lines that are evaluated at submission time. The above $JOB_ID string is substituted with the job id. The job name (set with -N) is assigned to the environment variable JOB_NAME. The memory limit per task on a node is automatically adjusted to the maximum memory available to a user application (for serial and parallel codes).
Example job scripts are available online in /share/doc/sge. They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.
This user guide is continued on Page Two.