Q Logic Switch IB6054601 00 D User Guide

S i m p l i f y  
Q
InfiniPath User Guide  
Version 2.0  
IB6054601-00 D  
Page i  
 
InfiniPath User Guide  
Version 2.0  
Q
Added info about using MPI over uDAPL. Need to load modules  
rdma_cm and rdma_ucm.  
Added section: Error messages generated by mpirun. This explains  
more about the types of errors found in the sub-sections. Also added  
error messages related to failed connections between nodes  
Added mpirun error message about stray processes to error message  
section  
Added driver and link error messages reported by MPI programs  
Added section about errors occurring when different runtime/compile  
time MPI versions are used  
2.0 mpirun incompatible with 1.3 libraries  
Added glossary entry for MTRR  
Added new index entries for MPI error messages format, corrected  
index formatting  
IB6054601-00 D  
Page iii  
 
InfiniPath User Guide  
Version 2.0  
Q
© 2006, 2007 QLogic Corporation. All rights reserved worldwide.  
© PathScale 2004, 2005, 2006. All rights reserved.  
First Published: August 2005  
Printed in U.S.A.  
Page iv  
IB6054601-00 D  
 
Table of Contents  
ipath_etherConfiguration on Fedora and RHEL4 . . . . . . . . . . .  
ipath_etherConfiguration on SUSE 9.3, SLES 9, and SLES 10  
Configuring sshand sshdUsing shosts.equiv . . . . . . . . . . . . . . . . . 2-15  
Process Limitation with ssh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17  
2-7  
2-8  
IB6054601-00 D  
Page v  
 
InfiniPath User Guide  
Version 2.0  
Q
Configuring for sshUsing ssh-agent . . . . . . . . . . . . . . . . . . . . . . . .  
The mpihostsFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11  
mpirunOptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14  
3-5  
Page vi  
IB6054601-00 D  
 
InfiniPath User Guide  
Version 2.0  
Q
InfiniPath User Guide  
B-2  
Generating the mpihostsFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
C-5  
mpirunInstallation Requires 32-bit Support . . . . . . . . . . . . . . . . . . . .  
IB6054601-00 D  
Page vii  
 
InfiniPath User Guide  
Version 2.0  
Q
OpenFabrics Load Errors If ib_ipathDriver Load Fails . . . . . . . . . . C-10  
InfiniPath ib_ipathInitialization Failure . . . . . . . . . . . . . . . . . . . . . . C-11  
Set $IBPATHfor OpenFabrics Scripts . . . . . . . . . . . . . . . . . . . . . . . . . C-12  
Process Limitation with ssh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-19  
C.8.11  
Lock Enough Memory on Nodes When Using a Batch Queuing  
Error Messages Generated by mpirun . . . . . . . . . . . . . . . . . . . . . . . C-22  
Check Cluster Homogeneity with ipath_checkout . . . . . . . . . . . . . C-29  
boardversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-31  
ibstatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-32  
ibv_devinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-32  
ident . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-32  
ipath_checkout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-33  
ipath_control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-34  
ipathbug-helper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-35  
Page viii  
IB6054601-00 D  
 
InfiniPath User Guide  
Version 2.0  
Q
InfiniPath User Guide  
ipath_pkt_test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-35  
ipathstats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-35  
lsmod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-36  
mpirun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-36  
rpm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-36  
status_str . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-36  
strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-38  
version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-38  
Figures  
Figure  
Page  
Tables  
Table  
Page  
1-1 PathScale-QLogic Adapter Model Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
1-3  
IB6054601-00 D  
Page ix  
 
InfiniPath User Guide  
Version 2.0  
Q
Notes  
Page x  
IB6054601-00 D  
 
Section 1  
Introduction  
This chapter describes the objectives, intended audience, and organization of the  
InfiniPath User Guide.  
The InfiniPath User Guide is intended to give the end users of an InifiniPath cluster  
what they need to know to use it. In this case, end users are understood to include  
both the cluster administrator and the MPI application programmers, who have  
different but overlapping interests in the details of the technology.  
For specific instructions about installing the InfiniPath QLE7140 PCI Express™  
adapter, the QMI7140 adapter, or the QHT7140 /QHT7040 HTX™ adapters, and  
the initial installation of the InifiniPath Software, see the InfiniPath Install Guide.  
1.1  
Who Should Read this Guide  
ThisguideisintendedbothforreadersresponsibleforadministrationofanInfiniPath  
cluster network and for readers wanting to use that cluster.  
This guide assumes that all readers are familiar with cluster computing, that the  
cluster administrator reader is familiar with Linux administration and that the  
application programmer reader is familiar with MPI.  
1.2  
How this Guide is Organized  
The InfiniPath User Guide is organized into these sections:  
Section 1"Introduction". This section.  
Section 2 “InfiniPath Cluster Administration” describes the lower levels of the  
supplied InfiniPath software. This would be of interest mainly to an InfiniPath  
cluster administrator.  
Section 3 “Using InfiniPath MPI” helps the MPI programmer make best use of  
the InfiniPath MPI implementation.  
Appendix A “Benchmark Programs”  
Appendix B “Integration with a Batch Queuing System”  
Appendix C “Troubleshooting”. The Troubleshooting section provides  
information for troubleshooting installation, cluster administration, and MPI.  
Appendix D “Recommended Reading”  
IB6054601-00 D  
1-1  
 
           
1 – Introduction  
Interoperability  
Q
Appendix E Glossary of technical terms  
In addition, the InfiniPath Install Guide contains information on InfiniPath hardware  
and software installation.  
1.3  
Overview  
The material in this documentation pertains to an InfiniPath cluster. This is defined  
as a collection of nodes, each attached to an InfiniBand™-based fabric through the  
InfiniPath Interconnect. The nodes are Linux-based computers, each having up to  
eight processors.  
The InfiniPath interconnect is InfiniBand 4X, with a raw data rate of 10 Gb/s (data  
rate of 8Gb/s).  
InfiniPath utilizes standard, off-the-shelf InfiniBand 4X switches and cabling.  
InfiniPath OpenFabrics software is interoperable with other vendors’ InfiniBand  
HCAs running compatible OpenFabrics releases. There are two options for Subnet  
Management in your cluster:  
Use the Subnet Manager on one or more managed switches supplied with your  
Infiniband switches.  
Use the OpenSM component of OpenFabrics.  
1.4  
Switches  
The InfiniPath interconnect is designed to work with all InfiniBand-compliant  
switches. Use of OpenSM as a subnet manager is now supported. OpenSM is part  
of the OpenFabrics component of this release.  
1.5  
Interoperability  
InfiniPath participates in the standard InfiniBand Subnet Management protocols for  
configuration and monitoring. InfiniPath OpenFabrics (including IPoIB) is  
interoperablewithothervendorsInfiniBandHCAsrunningcompatibleOpenFabrics  
releases. The InfiniPath MPI and Ethernet emulation stacks (ipath_ether) are not  
interoperable with other InfiniBand Host Channel Adapters (HCA) and Target  
Channel Adapters (TCA). Instead, InfiniPath uses an InfiniBand-compliant  
vendor-specific protocol that is highly optimized for MPI and TCP between  
InfiniPath-equipped hosts.  
1-2  
IB6054601-00 D  
 
             
1 – Introduction  
What’s New in this Release  
Q
NOTE: OpenFabrics was known as OpenIB until March 2006. All relevant  
references to OpenIB in this documentation have been updated to reflect  
this change. See the OpenFabrics website at http://www.openfabrics.org  
for more information on the OpenFabrics Alliance.  
1.6  
What’s New in this Release  
QLogic Corp. acquired PathScale in April 2006. In this 2.0 release, product names,  
internal program and output message names now refer to QLogic rather than  
PathScale.  
The new QLogic and former PathScale adapter model numbers are shown in the  
table below.  
Table 1-1. PathScale-QLogic Adapter Model Numbers  
Former  
PathScale  
Model Number  
New QLogic Model  
Number  
Description  
HT-400  
IBA6110  
Single Port 10GBS InfiniBand to HTX ASIC  
ROHS  
PE-800  
IBA6120  
SinglePort10GBSInfiniBandtox8PCIExpress  
ASIC ROHS  
HT-460  
HT-465  
PE-880  
QHT7040  
QHT7140  
QLE7140  
Single Port 10GBS InfiniBand to HTX Adapter  
Single Port 10GBS InfiniBand to HTX Adapter  
SinglePort10GBSInfiniBandtox8PCIExpress  
Adapter  
PE-850  
QMI7140  
SinglePort10GBSInfiniBandIBMBladeCenter  
Adapter  
This version of InfiniPath provides support for all QLogic’s HCAs, including:  
InfiniPath QLE7140, which is supported on systems with PCIe x8 or x16 slots  
InfiniPath QMI7140, which runs on Power PC systems, particularly on the IBM®  
BladeCenter H processor blades  
InfiniPath QHT7040 and QHT7140, which leverage HTX™. The InfiniPath  
QHT7040 and QHT7140 are exclusively for motherboards that support  
HTXcards. The QHT7140 has a smaller form factor than the QHT7040, but is  
otherwise the same. Unless otherwise stated, QHT7140 will refer to both the  
QHT7040 and QHT7140 in this documentation.  
Expanded MPI scalability enhancements for PCI Express have been added. The  
QHT7040 and QHT7140 can support 2 processes per context for a total of 16. The  
QLE7140 and QMI7140 also support 2 processes per context, for a total of 8.  
IB6054601-00 D  
1-3  
 
     
1 – Introduction  
Supported Distributions and Kernels  
Q
SupportformultipleversionsofMPIhasbeenadded. Youcanuseadifferentversion  
of MPI and achieve the high-bandwidth and low-latency performance that is  
standard with InfiniPath MPI.  
Also included is expanded operating system support, and support for the latest  
OpenFabrics software stack.  
MultipleInfiniPathcardspernodearesupported.Asinglesoftwareinstallationworks  
for all the cards.  
Additional up-to-date information can be found on the QLogic web site:  
http://www.qlogic.com  
1.7  
Supported Distributions and Kernels  
The InfiniPath interconnect runs on AMD Opteron, Intel EM64T, and IBM Power  
Blade Center H) systems running Linux. The currently supported distributions and  
associated Linux kernel versions for InfiniPath and OpenFabrics are listed in the  
following table. The kernels are the ones that shipped with the distributions, unless  
otherwise noted.  
Table 1-2. InfiniPath/OpenFabrics Supported Distributions and Kernels  
InfiniPath/OpenFabrics supported  
Distribution  
Fedora Core 3 (FC3)  
kernels  
2.6.12 (x86_64)  
Fedora Core 4 (FC4)  
2.6.16, 2.6.17 (x86_64)  
Red Hat Enterprise Linux 4 (RHEL4)  
2.6.9-22, 2.6.9-34, 2.6.9-42(U2/U3/U4)  
(x86_64)  
CentOS 4.2-4.4 (Rocks 4.2-4.4)  
SUSE Linux 9.3 (SUSE 9.3)  
2.6.9 (x86_64)  
2.6.11 (x86_64)  
2.6.5 (x86_64)  
SUSE LInux Enterprise Server (SLES 9)  
SUSE LInux Enterprise Server (SLES 10) 2.6.16 (x86_64 and ppc64)  
NOTE: IBM Power systems run only with the SLES 10 distribution.  
The SUSE10 release series is no longer supported as of this InfiniPath 2.0 release.  
Fedora Core 4 kernels prior to 2.6.16 are also no longer supported.  
1-4  
IB6054601-00 D  
 
       
1 – Introduction  
Software Components  
Q
1.8  
Software Components  
The software provided with the InfiniPath Interconnect product consists of:  
InfiniPath driver (including OpenFabrics)  
InfiniPath ethernet emulation  
InfiniPath libraries  
InfiniPath utilities, configuration, and support tools  
InfiniPath MPI  
InfiniPath MPI benchmarks  
OpenFabrics protocols, including Subnet Management Agent  
OpenFabrics libraries and utilities  
OpenFabricskernelmodulesupportisnowbuiltandinstalledaspartoftheInfiniPath  
RPM install. The InfiniPath release 2.0 runs on the same code base as OpenFabrics  
Enterprise Distribution (OFED) version 1.1. It also includes the OpenFabrics  
1.1-based library and utility RPMs. InfiniBand protocols are interoperable between  
InfiniPath 2.0 and OFED 1.1.  
This release provides support for the following protocols:  
IPoIB (TCP/IP networking)  
SDP (Sockets Direct Protocol)  
OpenSM  
UD (Unreliable Datagram)  
RC (Reliable Connection)  
UC (Unreliable Connection)  
SRQ (Shared Receive Queue)  
uDAPL (user Direct Access Provider Library)  
This release includes a technology preview of:  
SRP (SCSI RDMA Protocol)  
Future releases will provide support for:  
iSER (iSCSI Extensions for RDMA)  
No support is provided for RD.  
IB6054601-00 D  
1-5  
 
   
1 – Introduction  
Documentation and Technical Support  
Q
NOTE: 32 bit OpenFabrics programs using the verb interfaces are not supported  
in this InfiniPath release, but will be supported in a future release.  
1.9  
Conventions Used in this Document  
This Guide uses these typographical conventions:  
Table 1-3. Typographical Conventions  
Convention  
Meaning  
command  
Fixed-space font is used for literal items such as commands,  
functions, programs, files and pathnames, and program  
output;  
variable  
Italic fixed-space font is used for variable names in programs  
and command lines.  
concept  
Italic font is used for emphasis, concepts.  
user input  
Bold fixed-space font is used for literal items in commands or  
constructs that you type in.  
$
#
Indicates a command line prompt.  
Indicates a command line prompt as root when using bash or  
sh.  
[ ]  
Brackets enclose optional elements of a command or  
program construct.  
...  
>
Ellipses indicate that a preceding element can be repeated.  
Right caret identifies the cascading path of menu commands  
used in a procedure.  
2.0  
The current version number of the software is included in the  
RPM names and within this documentation.  
NOTE:  
Indicates important information.  
1.10  
Documentation and Technical Support  
The InfiniPath product documentation includes:  
The InfiniPath Install Guide  
The InfiniPath User Guide  
Release Notes  
Quick Start Guide  
1-6  
IB6054601-00 D  
 
       
1 – Introduction  
Documentation and Technical Support  
Q
Readme file  
The Troubleshooting Appendix for installation, InfiniPath and OpenFabrics  
administration, and MPI issues is located in the InfiniPath User Guide.  
VisittheQLogicsupportWebsitefordocumentationandthelatestsoftwareupdates.  
IB6054601-00 D  
1-7  
 
1 – Introduction  
Documentation and Technical Support  
Q
Notes  
1-8  
IB6054601-00 D  
 
Section 2  
InfiniPath Cluster Administration  
This chapter describes what the cluster administrator needs to know about the  
InfiniPath software and system administration.  
2.1  
Introduction  
The InfiniPath driver ib_ipath, layered Ethernet driver ipath_ether, OpenSM,  
and other modules and the protocol and MPI support libraries are the components  
of the InfiniPath software providing the foundation that supports the MPI  
implementation.  
Figure 2-1, below, shows these relationships.  
MPIApplication  
InfiniPathMPI  
OpenFabricscomponents  
InfiniPathChannel(ADILayer)  
TCP/IP  
IPoIB  
OpenSM  
ipath_ether  
InfiniPathProtocolLibrary  
InfiniPathdriverib_ipath  
LinuxKernel  
InfiniPathHardware  
Figure 2-1. InfiniPath Software Structure  
2.2  
Installed Layout  
The InfiniPath software is supplied as a set of RPM files, described in detail in the  
InfiniPath Install Guide. This section describes the directory structure that the  
installation leaves on each node’s file system.  
The InfiniPath shared libraries are installed in:  
/usr/lib for 32-bit applications  
/usr/lib64 for 64-bit applications  
IB6054601-00 D  
2-1  
 
                 
2 – InfiniPath Cluster Administration  
Memory Footprint  
Q
MPI include files are in:  
/usr/include  
MPI programming examples and source for several MPI benchmarks are in:  
/usr/share/mpich/examples  
InfiniPath utility programs, as well as MPI utilities and benchmarks are installed in:  
/usr/bin  
The InfiniPath kernel modules are installed in the standard module locations in:  
/lib/modules (version dependent)  
They are compiled and installed when the infinipath-kernelRPM is installed.  
They must be rebuilt and re-installed when the kernel is upgraded. This can be done  
by running the script:  
/usr/src/infinipath/drivers/make-install.sh  
Documentation can be found in:  
/usr/share/man  
/usr/share/doc/infinipath  
/usr/share/doc/mpich-infinipath  
2.3  
Memory Footprint  
The following is a preliminary guideline for estimating the memory footprint of the  
InfiniPath adapter on Linux x86_64systems. Memory consumption is linear based  
2-2  
IB6054601-00 D  
 
   
2 – InfiniPath Cluster Administration  
Memory Footprint  
Q
on system configuration. OpenFabrics support is under development and has not  
been fully characterized. This table summarizes the guidelines.  
Table2-1. MemoryFootprintoftheInfiniPathAdapteronLinuxx86_64Systems  
Adapter  
component  
Required/  
optional  
Memory Footprint  
9 MB  
Comment  
InfiniPath Driver  
Required  
Includes accelerated IP  
support. Includes tables  
space to support up to  
1000 node systems.  
Clusters larger than 1000  
nodes can also be  
configured.  
MPI  
Optional  
71 MB per process with  
Several of these  
default parameters: 60 MB parameters (sendbufs,  
recvbufs and size of the  
shared memory region)  
are tunable if reduced  
memory footprint is  
desired.  
+ 512*2172 (sendbufs) +  
4096*2176 (recvbufs) +  
1024*1K(misc.allocations)  
+ 32 MB per node when  
multiple processes  
communicate via shared  
memory  
+ 264 Bytes per MPI node  
on the subnet  
OpenFabrics  
Optional  
1~6 MB  
This not been fully  
characterized as of this  
writing.  
+ ~500 bytes per QP  
+ TBD bytes per MR  
+ ~500 bytes per EE  
Context  
+ OpenFabrics stack from  
openfabrics.org (size not  
included in these  
guidelines)  
Here is an example for a 1024 processor system:  
1024 cores over 256 nodes (each node has 2 sockets with dual-core processors)  
1 adapter per node  
Each core runs an MPI process, with the 4 processes per node communicating  
via shared memory.  
Each core uses OpenFabrics to connect with storage and file system targets  
using 50 QPs and 50 EECs per core.  
IB6054601-00 D  
2-3  
 
 
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
This breaks down to a memory footprint of 331MB per node, as follows:  
Table 2-2. Memory Footprint, 331 MB per Node  
Component  
Footprint (in MB)  
Breakdown  
Per node  
Driver  
MPI  
9
316  
4*71 MB (MPI per process)  
+ 32 MB (shared memory  
per node)  
OpenFabrics  
6
6 MB + 200 KB per node  
2.4  
Configuration and Startup  
2.4.1  
BIOS Settings  
A properly configured BIOS is required. The BIOS settings, which are stored in  
non-volatile memory, contain certain parameters characterizing the system,. These  
parametersmayincludedateandtime,configurationsettings,andinformationabout  
the installed hardware.  
There are currently two issues concerning BIOS settings that you need to be aware  
of:  
ACPI needs to be enabled  
MTRR mapping needs to be set to “Discrete”  
MTRR (Memory Type Range Registers) is used by the InfiniPath driver to enable  
write combining to the InfiniPath on-chip transmit buffers. This improves write  
bandwidth to the InfiniPath chip by writing multiple words in a single bus transaction  
(typically 64). This applies only to x86_64 systems.  
However, some BIOSes don’t have the MTRR mapping option. It may be referred  
to in a different way, dependent upon chipset, vendor, BIOS, or other factors. For  
example, it is sometimes referred to as "32 bit memory hole", which should be  
enabled.  
If there is no setting for MTRR mapping or 32 bit memory hole, please contact your  
system or motherboard vendor and inquire as to how write combining may be  
enabled.  
ACPI and MTRR mapping issues are discussed in greater detail in the  
Troubleshooting section of the InfiniPath User Guide.  
NOTE: BIOS settings on IBM Blade Center H (Power) systems do not need  
adjustment.  
2-4  
IB6054601-00 D  
 
     
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
You can check and adjust these BIOS settings using the BIOS Setup Utility. For  
specific instructionsonhowtodothis, followthehardwaredocumentationthatcame  
with your system.  
2.4.2  
InfiniPath Driver Startup  
The ib_ipathmodule provides low level InfiniPath hardware support. It does  
hardware initialization, handles infinipath-specific memory management, and  
provides services to other InfiniPath and OpenFabrics modules. It provides the  
management functions for InfiniPath MPI programs, the ipath_etherethernet  
emulation, and general OpenFabrics protocols such as IPoIB, and SDP. It also  
contains a Subnet Management Agent.  
The InfiniPath driver software is generally started at system startup under control  
of these scripts:  
/etc/init.d/infinipath  
/etc/sysconfig/infinipath  
These scripts are configured by the installation. Debug messages are printed with  
the function name preceding the message.  
The cluster administrator does not normally need to be concerned with the  
configuration parameters. Assuming that all the InfiniPath and OpenFabrics  
software has been installed, the default settings upon startup will be:  
InfiniPath ib_ipathis enabled  
InfiniPath ipath_etheris not running until configured  
OpenFabrics IPoIB is not running until configured  
OpenSM is enabled on startup. Disable it on all nodes except where it will be  
used as subnet manager.  
2.4.3  
InfiniPath Driver Software Configuration  
The ib_ipathdriver has several configuration variables which provide for setting  
reservedbuffersforthesoftware, definingeventstocreatetracerecords, andsetting  
debug level. See the ib_ipathman page for details.  
2.4.4  
InfiniPath Driver Filesystem  
The InfiniPath driver supplies a filesystem for exporting certain binary statistics to  
user applications. By default, this filesystem is mounted in the /ipathfsdirectory  
when the infinipath script is invoked with the "start" option (e.g. at system startup)  
IB6054601-00 D  
2-5  
 
         
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
and unmounted when the infinipath script is invoked with the "stop" option (e.g. at  
system shutdown).  
The layout of the filesystem is as follows:  
atomic_stats  
00/  
01/  
...  
The atomic_statsfile contains general driver statistics. There is one numbered  
directory per InfiniPath device on the system. Each numbered directory contains  
the following files of per-device statistics:  
atomic_counters  
node_info  
port_info  
The atomic_countersfile contains counters for the device: examples would be  
interrupts received, bytes and packets in and out, and so on. The node_infofile  
contains information such as the device’s GUID. The port_infofile contains  
information for each port on the device. An example would be the port LID.  
2.4.5  
Subnet Management Agent  
Each node in an InfiniPath cluster runs a Subnet Management Agent (SMA), which  
carries out two-way communication with the Subnet Manager (SM) running on one  
or more managed switches. The Subnet Manager is responsible for network  
initialization (topology discovery), configuration, and maintenance. The Subnet  
Manager also assigns and manages InfiniBand multicast groups, such as the group  
used for broadcast purposes by the ipath_etherdriver. The primary functions of  
the SMA are to keep the SM informed whether a node is alive and to get the node’s  
assigned identifier (LID) from the SM.  
2.4.6  
Layered Ethernet Driver  
The layered Ethernet component ipath_etherprovides almost complete Ethernet  
software functionality over the InfiniPath fabric. At startup this is bound to some  
Ethernet device ethx. All Ethernet functions are available through this device in a  
transparent way, except that Ethernet multicasting is not supported. Broadcasting  
is supported. You can use all the usual command line and GUI-based configuration  
tools on this Ethernet. Configuration of ipath_etheris optional.  
These instructions are for enabling TCP-IP networking over the InfiniPath link. To  
enable IPoIB networking, see section 2.4.7.1.  
2-6  
IB6054601-00 D  
 
       
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
You must create a network device configuration file for the layered Ethernet device  
on the InfiniPath adapter. This configuration file will resemble the configuration files  
for the other Ethernet devices on the nodes. Typically on servers there are two  
Ethernet devices present, numbered as 0 (eth0) and 1 (eth1). This examples  
assumes we create a third device, eth2.  
NOTE: When multiple InfiniPath chips are present, the configuration for eth3,  
eth4, andsoonfollowthesameformatasforaddingeth2intheexamples  
below.  
Two slightly different procedures are given below for the ipath configuration; one  
for Fedora and one for SUSE, SLES9, or SLES 10.  
Many of the entries that are used in the configuration directions below are explained  
in the file sysconfig.txt. To familiarize yourself with these, please see:  
/usr/share/doc/initscripts-*/sysconfig.txt  
2.4.6.1  
ipath_etherConfiguration on Fedora and RHEL4  
These configuration steps will cause the ipath_ethernetwork interfaces to be  
automatically configured when you next reboot the system. These instructions are  
for the Fedora Core 3, Fedora Core 4 and Red Hat Enterprise Linux 4 distributions.  
Typically on servers there are two Ethernet devices present, numbered as 0 (eth0)  
and 1 (eth1). This example assumes we create a third device, eth2.  
NOTE: When multiple InfiniPath chips are present, the configuration for eth3,  
eth4, and so on follow the same format as for adding eth2in the  
examples below.  
1. Check for the number of Ethernet drivers you currently have by either one of  
the two following commands :  
$ ifconfig -a  
$ ls /sys/class/net  
As mentioned above we assume that two Ethernet devices (numbered 0 and  
1) are already present.  
2. Edit the file /etc/modprobe.conf(as root) by adding the following line:  
alias eth2 ipath_ether  
3. Create or edit the following file (as root).  
/etc/sysconfig/network-scripts/ifcfg-eth2  
IB6054601-00 D  
2-7  
 
   
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
If you are using DHCP (dynamic host configuration protocol), add the following  
lines to ifcfg-eth2:  
# QLogic Interconnect Ethernet  
DEVICE=eth2  
ONBOOT=yes  
BOOTPROTO=dhcp  
If you are using static IP addresses, use the following lines instead, substituting  
your own IP address for the sample one given here.The normal matching  
netmask is shown.  
# QLogic Interconnect Ethernet  
DEVICE=eth2  
BOOTPROTO=static  
ONBOOT=YES  
IPADDR=192.168.5.101 #Substitute your IP address here  
NETMASK="255.255.255.0"#Normal matching netmask  
TYPE=Ethernet  
Thiswillcausetheipath_ether Ethernetdrivertobeloadedandconfiguredduring  
system startup. To check your configuration, and make the ipath_ether Ethernet  
driver available immediately, use the command (as root):  
# /sbin/ifup eth2  
4. Check whether the Ethernet driver has been loaded with:  
$ lsmod | grep ipath_ether  
5. Verify that the driver is up with:  
$ ifconfig -a  
2.4.6.2  
ipath_etherConfiguration on SUSE 9.3, SLES 9, and SLES 10  
These configuration steps will cause the ipath_ethernetwork interfaces to be  
automatically configured when you next reboot the system. These instructions are  
for the SUSE 9.3, SLES 9 and SLES 10 distributions.  
Typically on servers there are two Ethernet devices present, numbered as 0 (eth0)  
and 1 (eth1). This example assumes we create a third device, eth2.  
NOTE: When multiple InfiniPath chips are present, the configuration for eth3,  
eth4, and so on follow the same format as for adding eth2in the  
examples below. Similarly , in step 2, add one to the unit number, so  
replace .../00/guidwith /01/guidfor the second InfiniPath interface,  
and so on.  
2-8  
IB6054601-00 D  
 
     
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
Step 3 is applicable only to SLES 10; it is required because SLES 10 uses a newer  
version of the udevsubsystem.  
NOTE: The MAC address (media access control address) is a unique identifier  
attachedtomostformsofnetworkingequipment.Step2belowdetermines  
the MAC address to use, and will be referred to as $MAC in the  
subsequent steps. $MAC must be replaced in each case with the string  
printed in step 2.  
The following steps must all be executed as the root user.  
1. Be sure that the ipath_ether module is loaded:  
# lsmod | grep -q ipath_ether || modprobe ipath_ether  
2. Determine the MAC address that will be used:  
# sed ’s/^\(..:..:..\):..:../\1/’ \  
/sys/bus/pci/drivers/ib_ipath/00/guid  
NOTE: Care should be taken when cutting and pasting commands such as  
the above from PDF documents, as quotes are special characters  
and may not be translated correctly.  
The output should appear similar to this (6 hex digit pairs, separated by colons):  
00:11:75:04:e0:11  
The GUID can also be returned by running:  
# ipath_control -i  
$Id: QLogic Release2.0 $ $Date: 2006-10-15-04:16 $  
00: Version: Driver 2.0, InfiniPath_QHT7140, InfiniPath1 3.2,  
PCI 2, SW Compat 2  
00: Status: 0xe1 Initted Present IB_link_up IB_configured  
00: LID=0x30 MLID=0x0 GUID=00:11:75:00:00:04:e0:11 Serial:  
1236070407  
Note that removing the middle two 00:00octets from the GUID in the above  
output will form the MAC address  
If either step 1 or step 2 fails in some fashion, the problem must be found and  
corrected before continuing. Verify that the RPMs are installed correctly, and  
that infinipathhas correctly been started. If problems continue, run  
ipathbug-helperand report the results to your reseller or InfiniPath support  
organization.  
3. Skip to Step 4 if you are using SUSE 9.3 or SLES 9. This step is only done on  
SLES 10 systems. Edit the file:  
/etc/udev/rules.d/30-net_persistent_names.rules  
If this file does not exist, skip to Step 4.  
IB6054601-00 D  
2-9  
 
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
Check each of the lines starting with SUBSYSTEM=, to find the highest numbered  
interface. (For standard motherboards, the highest numbered interface will  
typically be 1.)  
Add a new line at the end of the file, incrementing the interface number by one.  
In this example, it becomes eth2. The new line will look like this:  
SUBSYSTEM=="net", ACTION=="add", SYSFS{address}=="$MAC",  
IMPORT="/sbin/ rename_netiface %k eth2"  
This will appear as a single line in the file. $MAC is replaced by the string from  
step 2 above.  
4. Create the network module file:  
/etc/sysconfig/hardware/hwcfg-eth-id-$MAC  
Add the following lines to the file:  
MODULE=ipath_ether  
STARTMODE=auto  
This will cause the ipath_etherEthernet driver to be loaded and configured  
during system startup.  
5. Create the network configuration file:  
/etc/sysconfig/network/ifcfg-eth2  
If you are using DHCP (dynamically assigned IP addresses), add these lines  
to the file:  
STARTMODE=onboot  
BOOTPROTO=dhcp  
NAME=’InfiniPath Network Card’  
_nm_name=eth-id-$MAC  
Proceed to Step 6.  
If you are you are using static IP addresses (not DHCP), add these lines to the  
file:  
STARTMODE=onboot  
BOOTPROTO=static  
NAME=’InfiniPath Network Card’  
NETWORK=192.168.5.0  
NETMASK=255.255.255.0  
BROADCAST=192.168.5.255  
IPADDR=192.168.5.211  
_nm_name=eth-id-$MAC  
Make sure that you substitute your own IP address for the sample IPADDR  
shown here. The BROADCAST, NETMASK, and NETWORK lines need to  
match for your network.  
2-10  
IB6054601-00 D  
 
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
6. To verify that the configuration files are correct, you will normally now be able  
to run the commands:  
# ifup eth2  
# ifconfig eth2  
Note that it may be necessary to reboot the system before the configuration  
changes will work.  
2.4.7  
OpenFabrics Configuration and Startup  
In the prior InfiniPath 1.3 release the InfiniPath (ipath_core) and OpenFabrics  
(ib_ipath) modules were separate. In this release there is now one module,  
ib_ipath, which provides both low level InfiniPath support and management  
functions for OpenFabrics protocols. The startup script for ib_ipathis installed  
automatically as part of the software installation, and normally does not need to be  
changed.  
However, the IPoIB network interface and OpenSM components of OpenFabrics  
can be configured to be on or off. IPoIB is off by default; OpenSM is on by default.  
IPoIB and OpenSM configuration is explained in greater detail in the following  
sections.  
NOTE: The following instructions work for FC4, SUSE9.3, SLES 9, and SLES 10.  
2.4.7.1  
Configuring the IPoIB Network Interface  
Instructions are given here to manually configure your OpenFabrics IPoIB network  
interface. This example assumes that you are using shor bashas your shell, and  
that all required InfiniPath and OpenFabrics RPMs are installed, and your startup  
scripts have been run, either manually or at system boot.  
For this example, we assume that your IPoIB network is 10.1.17.0 (one of the  
networks reserved for private use, and thus not routable on the internet), with a /8  
host portion, and therefore requires that the netmask be specified.  
This example assumes that no hosts files exist, and that the host being configured  
has the IP address 10.1.17.3, and that DHCP is not being used.  
NOTE: We supply instructions only for this static IP address case. Configuration  
methods for using DHCP will be supplied in a later release.  
Type the following commands (as root):  
# ifconfig ib0 10.1.17.3 netmask 0xffffff00  
IB6054601-00 D  
2-11  
 
         
2 – InfiniPath Cluster Administration  
Configuration and Startup  
Q
To verify the configuration, type:  
# ifconfig ib0  
The output from this command should be similar to this:  
ib0 Link encap:InfiniBand HWaddr  
00:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
inet addr:10.1.17.3 Bcast:10.1.17.255 Mask:255.255.255.0  
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1  
RX packets:0 errors:0 dropped:0 overruns:0 frame:0  
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0  
collisions:0 txqueuelen:128  
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)  
Next, type:  
# ping -c 2 -b 10.1.17.255  
The output of the pingcommand should be similar to that below, with a line for  
each host already configured and connected:  
WARNING: pinging broadcast address  
PING 10.1.17.255 (10.1.17.255) 517(84) bytes of data.  
174 bytes from 10.1.17.3: icmp_seq=0 ttl=174 time=0.022 ms  
64 bytes from 10.1.17.1: icmp_seq=0 ttl=64 time=0.070 ms (DUP!)  
64 bytes from 10.1.17.7: icmp_seq=0 ttl=64 time=0.073 ms (DUP!)  
The IPoIB network interface is now configured.  
NOTE: The configuration must be repeated each time the system is rebooted.  
2.4.8  
OpenSM  
OpenSM is an optional component of the OpenFabrics project that provides a  
subnet manager for InfiniBand networks. This package can be installed on all  
machines, but only needs to be enabled on the machine in your cluster that is going  
toactasasubnetmanager. YoudonotneedtouseOpenSMifanyofyourInfiniBand  
switches provide a subnet manager.  
After installing the opensmpackage, OpenSM is configured to be on on the next  
machine reboot. It only needs to be enabled on the node which acts as the subnet  
manager, sousethechkconfigcommand(asroot)todisableitontheothernodes:  
# chkconfig opensmd off  
The command to enable it on reboot is:  
# chkconfig opensmd on  
You can start opensmdwithout rebooting your machine as follows:  
# /etc/init.d/opensmd start  
2-12  
IB6054601-00 D  
 
     
2 – InfiniPath Cluster Administration  
Starting and Stopping the InfiniPath Software  
Q
and you can stop it again like this:  
# /etc/init.d/opensmd stop  
If you wish to pass any arguments to the OpenSM program, modify the file:  
/etc/init.d/opensmd  
and add the arguments to the "OPTIONS" variable. Here is an example:  
# Use the UPDN algorithm instead of the Min Hop algorithm.  
OPTIONS="-u"  
2.5  
SRP  
SRP stands for SCSI RDMA Protocol. It was originally intended to allow the SCSI  
protocol to run over InfiniBand for SAN usage. SRP interfaces directly to the Linux  
file system through the SRP Upper Layer Protocol. SRP storage can be treated as  
just another device.  
In this release SRP is provided as a technology preview. Add ib_srpto the module  
list in /etc/sysconfig/infinipathto have it automatically loaded.  
NOTE: SRP does not yet work with IBM Power Systems.This will be fixed in a  
future release.  
2.6  
Further Information on Configuring and Loading Drivers  
See the modprobe(8), modprobe.conf(5), lsmod(8), man pages for more  
information. Also see the file /usr/share/doc/initscripts-*/sysconfig.txt  
for more general information on configuration files. Section 2.7, below, may also be  
useful.  
2.7  
Starting and Stopping the InfiniPath Software  
The InfiniPath driver software runs as a system service, normally started at system  
startup. Normally you will not need to restart the software, but you may wish to do  
so after installing a new InfiniPath release, or after changing driver options, or if  
doing manual testing.  
The following commands can be used to check or configure state. These methods  
will not reboot the system.  
To check the configuration state, use the command:  
$ chkconfig --list infinipath  
To enable the driver, use the command (as root):  
# chkconfig infinipath on 2345  
IB6054601-00 D  
2-13  
 
         
2 – InfiniPath Cluster Administration  
Starting and Stopping the InfiniPath Software  
Q
To disable the driver on the next system boot, use the command (as root):  
# chkconfig infinipath off  
NOTE: This does not stop and unload the driver, if it is already loaded.  
You can start, stop, or restart (as root) the InfiniPath support with:  
# /etc/init.d/infinipath [start | stop | restart]  
This method will not reboot the system. The following set of commands shows how  
this script can be used. Please take note of the following:  
You should omit the commands to start/stop opensmdif you are not running it  
on that node.  
You should omit the ifdownand ifupstep if you are not using ipath_ether  
on that node.  
The sequence of commands to restart infinipathare given below. Note that this  
next example assumes that ipath_etheris configured as eth2.  
# /etc/init.d/opensmd stop  
# ifdown eth2  
# /etc/init.d/infinipath stop  
...  
# /etc/init.d/infinipath start  
# ifup eth2  
# /etc/init.d/opensmd start  
The ...represents whatever activity you are engaged in after InfiniPath is stopped.  
An equivalent way to specify this is to use same sequence as above, except use  
the restartcommand instead of startand stop:  
# /etc/init.d/opensmd stop  
# ifdown eth2  
# /etc/init.d/infinipath restart  
# ifup eth2  
# /etc/init.d/opensmd start  
NOTE: Restarting InfiniPath will terminate any InfiniPath MPI processes, as well  
as any OpenFabrics processes that are running at the time. Processes  
using networking over ipath_etherwill return errors.  
You can check to see if opensmdis running by using the following command; if  
there is no output, opensmdis not configured to run:  
# /sbin/chkconfig --list opensmd | grep -w on  
You can check to see if ipath_etheris running by using the following command.  
If it prints no output, it is not running.  
$ /sbin/lsmod | grep ipath_ether  
2-14  
IB6054601-00 D  
 
2 – InfiniPath Cluster Administration  
Configuring sshand sshdUsing shosts.equiv  
Q
If there is output, you should look at the output from this command to determine if  
it is configured:  
$ /sbin/ifconfig -a  
Finally, if you need to find which InfiniPath and OpenFabrics modules are running,  
try the following command:  
$ lsmod | egrep ’ipath_|ib_|rdma_|findex’  
2.8  
Software Status  
InfiniBand status can be checked by running the program ipath_control. Here is  
sample usage and output:  
$ ipath_control -i  
$Id: QLogic Release2.0 $ $Date: 2006-09-15-04:16 $  
00: Version: Driver 2.0, InfiniPath_QHT7140, InfiniPath1 3.2,  
PCI 2, SW Compat 2  
00: Status: 0xe1 Initted Present IB_link_up IB_configured  
00: LID=0x30 MLID=0x0 GUID=00:11:75:00:00:07:11:97 Serial:  
1236070407  
Another useful program is ibstatus. Sample usage and output is as follows:  
$ ibstatus  
Infiniband device ’ipath0’ port 1 status:  
default gid:  
base lid:  
sm lid:  
fe80:0000:0000:0000:0011:7500:0005:602f  
0x35  
0x2  
state:  
phys state:  
rate:  
4: ACTIVE  
5: LinkUp  
10 Gb/sec (4X)  
For more information on these programs, See appendix C.9.9 and appendix C.9.5.  
2.9  
Configuring sshand sshdUsing shosts.equiv  
Running MPI programs on an InfiniPath cluster depends, by default, on secure shell  
sshto launch node programs on the nodes. Jobs must be able to start up without  
the need for interactive password entry on every node. Here we see how the cluster  
administrator can lift this burden from the user through the use of the shosts.equiv  
mechanism. This method is recommended, provided that your cluster is behind a  
firewall and accessible only to trusted users.  
Later, in section 3.5.1, we show how an individual user can accomplish this end  
through the use of ssh-agent.  
IB6054601-00 D  
2-15  
 
       
2 – InfiniPath Cluster Administration  
Configuring sshand sshdUsing shosts.equiv  
Q
This next example assumes the following:  
Both the cluster nodes and the front end system are running the openssh  
package as distributed in current Linux systems.  
All cluster users have accounts with the same account name on the front end  
and on each node, either by using NIS or some other means of distributing the  
password file.  
The front end is called ip-fe.  
Root or superuser access is required on ip-feand on each node in order to  
configure ssh.  
ssh, including the host’s key, has already been configured on the system ip-fe.  
See the sshdand ssh-keygenman pages for more information.  
The example proceeds as follows:  
1. On the system ip-fe, the front end node, change /etc/ssh/ssh_configto  
allow host-based authentication. Specifically, this file must contain the following  
four lines, set to ‘yes’. If they are already present but commented out with an  
initial #, remove the #.  
RhostsAuthentication yes  
RhostsRSAAuthentication yes  
HostbasedAuthentication yes  
EnableSSHKeysign yes  
2. On each of the InfiniPath node systems, create or edit the file  
/etc/ssh/shosts.equiv, addingthenameofthefrontendsystem.You’llneed  
to add the line:  
ip-fe  
Change the file to mode 600 when finished editing.  
3. On each of the InfiniPath node systems, create or edit the file  
/etc/ssh/ssh_known_hosts. You’ll need to copy the contents of the file  
/etc/ssh/ssh_host_dsa_key.pubfrom ip-feto this file (as a single line),  
and then edit that line to insert ip-fe ssh-dssat the beginning of the line. This  
is very similar to the standard known_hostsfile for ssh. An example line might  
look like this (displayed as multiple lines, but a single line in the file):  
ip-fe ssh-dss  
AAzAB3NzaC1kc3MAAACBAPoyES6+Akk+z3RfCkEHCkmYuYzqL2+1nwo4LeTVWp  
CD1QsvrYRmpsfwpzYLXiSJdZSA8hfePWmMfrkvAAk4ueN8L3ZT4QfCTwqvHVvS  
ctpibf8n  
aUmzloovBndOX9TIHyP/Ljfzzep4wL17+5hr1AHXldzrmgeEKp6ect1wxAAAAF  
QDR56dAKFA4WgAiRmUJailtLFp8swAAAIBB1yrhF5P0jO+vpSnZrvrHa0Ok+Y9  
apeJp3sessee30NlqKbJqWj5DOoRejr2VfTxZROf8LKuOY8tD6I59I0vlcQ812  
E5iw1GCZfNefBmWbegWVKFwGlNbqBnZK7kDRLSOKQtuhYbGPcrVlSjuVpsfWEj  
u64FTqKEetA8l8QEgAAAIBNtPDDwdmXRvDyc0gvAm6lPOIsRLmgmdgKXTGOZUZ  
2-16  
IB6054601-00 D  
 
2 – InfiniPath Cluster Administration  
Performance and Management Tips  
Q
0zwxSL7GP1nEyFk9wAxCrXv3xPKxQaezQKs+KL95FouJvJ4qrSxxHdd1NYNR0D  
avEBVQgCaspgWvWQ8cL  
0aUQmTbggLrtD9zETVU5PCgRlQL6I3Y5sCCHuO7/UvTH9nneCg==  
Change the file to mode 600 when finished editing.  
4. On each node, the system file /etc/ssh/sshd_configmust be edited, so that  
the following four lines uncommented (no #at the start of the line) and are set  
to yes. Each of these lines is normally present, but commented out and set to  
noby default.  
RhostsAuthentication yes  
RhostsRSAAuthentication yes  
HostbasedAuthentication yes  
PAMAuthenticationViaKbdInt yes  
5. After creating or editing these three files in steps 2, 3 and 4, sshdmust be  
restarted on each system. If you are already logged in via ssh(or any other  
user is logged in via ssh), their sessions or programs will be terminated, so do  
this only on idle nodes. Tell sshdto use the new configuration files by typing  
(as root):  
# killall -HUP sshd  
NOTE: This will terminate all ssh sessions into that system. Run from the  
console, or have a way to log into the console in case of any problem.  
At this point, any user should be able to login to the ip-fefront end system, and  
then use sshto login to any InfiniPath node without being prompted for a password  
or pass phrase.  
2.9.1  
Process Limitation with ssh  
MPI jobs that use more than 8 processes per node may encounter an SSH throttling  
mechanism that limits the amount of concurrent per-node connections to 10. If you  
need to use more processes, you or your system administrator should increase the  
value of ’MaxStartups’ in your sshd configurations. See appendix C.8.8 for an  
example of an error message associated with this limitation.  
2.10  
Performance and Management Tips  
The following section gives some suggestions for improving performance and  
simplifying management of the cluster.  
2.10.1  
Remove Unneeded Services  
An important step that the cluster administrator can take to enhance application  
performance is to minimize the set of system services running on the compute  
IB6054601-00 D  
2-17  
 
       
2 – InfiniPath Cluster Administration  
Performance and Management Tips  
Q
nodes. Since these are presumed to be specialized computing appliances, they  
do not need many of the service daemons normally running on a general Linux  
computer.  
Following are several groups constituting a minimal necessary set of services.  
These are all services controlled by chkconfig. To see the list of services that are  
enabled, use the command:  
$ /sbin/chkconfig --list | grep -w on  
Basic network services:  
network  
ntpd  
syslog  
xinetd  
sshd  
For system housekeeping:  
anacron  
atd  
crond  
If you are using NFS or yp passwords:  
rpcidmapd  
ypbind  
portmap  
nfs  
nfslock  
autofs  
To watch for disk problems:  
smartd  
readahead  
The service comprising the InfiniPath driver and SMA:  
infinipath  
Other services may be required by your batch queuing system or user community.  
2.10.2  
Disable Powersaving Features  
If you are running benchmarks or large numbers of short jobs, it is beneficial to  
disable the powersaving features of the Opteron. The reason is that these features  
may be slow to respond to changes in system load.  
For rhel4, fc3 and fc4, run this command as root:  
# /sbin/chkconfig --level 12345 cpuspeed off  
2-18  
IB6054601-00 D  
 
     
2 – InfiniPath Cluster Administration  
Performance and Management Tips  
Q
For SUSE 9.3 and 10.0 run this command as root:  
# /sbin/chkconfig --level 12345 powersaved off  
After running either of these commands, the system will need to be rebooted for  
these changes to take effect.  
2.10.3  
Balanced Processor Power  
Higher processor speed is good. However, adding more processors is good only if  
processor speed is balanced. Adding processors with different speeds can result  
in load imbalance.  
2.10.4  
SDP Module Parameters for Best Performance  
To get the best performance from SDP, especially for bandwidth tests, edit one of  
these files:  
/etc/modprobe.conf (on Fedora and RHEL)  
/etc/modprobe.conf.local (on SUSE and SLES)  
Add the line:  
options ib_sdp sdp_debug_level=4  
sdp_zcopy_thrsh_src_default=10000000  
This should be a single line in the file. This sets both the debug level and the zero  
copy threshold.  
2.10.5  
CPU Affinity  
InfiniPath will attempt to run each node program with CPU affinity set to a separate  
logical processor, up to the number of available logical processors. If CPU affinity  
is already set (with sched_setaffinity(), or with the tasksetutility), then  
InfiniPath will not change the setting.  
The tasksetutility can be used with mpirunto specify the mapping of MPI  
processes to logical processors. This is useful, for example, to make best use of  
available memory bandwidth or cache locality when running on dual-core SMP  
cluster nodes.  
In the following example we use the NAS Parallel Benchmark’s MG (multi-grid)  
benchmark and the -coption to taskset.  
$ mpirun -np 4 -ppn 2 -m $hosts taskset -c 0,2 bin/mg.B.4  
$ mpirun -np 4 -ppn 2 -m $hosts taskset -c 1,3 bin/mg.B.4  
The first command forces the programs to run on CPUs (or cores) 0 and 2. The  
second forces the programs to run on CPUs 1 and 3. Please see the manpage for  
tasksetfor more information on usage.  
IB6054601-00 D  
2-19  
 
           
2 – InfiniPath Cluster Administration  
Performance and Management Tips  
Q
2.10.6  
Hyper-Threading  
If using Intel processors that support Hyper-Threading, it is recommended that  
HyperThreading is turned off in the BIOS. This will provide more consistent  
performance. You can check and adjust this setting using the BIOS Setup Utility.  
For specific instructions on how to do this, follow the hardware documentation that  
came with your system.  
2.10.7  
Homogeneous Nodes  
To minimize management problems, the compute nodes of the cluster should have  
very similar hardware configurations and identical software installations. A  
mismatch between the InfiniPath software versions may also cause problems. Old  
and new libraries should not be run within the same job. It may also be useful to  
distinguish between the InfiniPath-specific drivers and those that are associated  
with kernel.org, OpenFabrics, or are distribution-built. The most useful tools are:  
ipathbug-helper  
ipath_control  
rpm  
mpirun  
ident  
strings  
ipath_checkout  
NOTE: Run these tools to gather information before reporting problems and  
requesting support.  
ipathbug_helper  
TheInfiniPathsoftwareincludesashellscriptipathbug-helper, whichcan gather  
status and history information for use in analyzing InfiniPath problems. This tool is  
also useful for verifying homogeneity. It is best to run ipathbug-helperwith root  
privilege, since some of the queries require it. There is also a --verboseoption  
which greatly increases theamount of gatheredinformation. Simply run it on several  
nodes and examine the output for differences.  
ipath_control  
Run the shell script ipath_controlas follows:  
% ipath_control -i  
$Id: QLogic Release2.0 $ $Date: 2006-09-15-04:16 $  
00: Version: Driver 2.0, InfiniPath_QHT7140, InfiniPath1 3.2, PCI  
2, SW Compat 2  
00: Status: 0xe1 Initted Present IB_link_up IB_configured  
2-20  
IB6054601-00 D  
 
       
2 – InfiniPath Cluster Administration  
Performance and Management Tips  
Q
00: LID=0x30 MLID=0x0 GUID=00:11:75:00:00:07:11:97 Serial:  
1236070407  
Note that ipath_controlwill report whether the installed adapter is the QHT7040,  
QHT7140, or the QLE7140. It will also report whether the driver is InfiniPath-specific  
or not with the output associated with $Id.  
rpm  
To check the contents of an RPM, commands of these types may be useful:  
$ rpm -qa infinipath\* mpi-\*  
$ rpm -q --info infinipath # (etc)  
The option -qwill query and -qawill query all.  
mpirun  
mpirun can give information on whether the program is being run against a QLogic  
or non-QLogic driver. Sample commands and results are given below.  
QLogic-built:  
$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0  
asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1  
active chips)  
asus-01:0.ipath_userinit: Driver is QLogic-built  
Non-QLogic built:  
$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0  
asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1  
active chips)  
asus-01:0.ipath_userinit: Driver is not QLogic-built  
ident  
identstrings are available in ib_ipath.ko. Running ident(as root) will yield  
information similar to the following. For QLogic RPMs, it will look like:  
# ident /lib/modules/$(uname -r)/updates/*ipath.ko  
/lib/modules/2.6.16.21-0.8-smp/updates/ib_ipath.ko:  
$Id: QLogic Release2.0 $  
$Date: 2006-09-15-04:16 $  
$Id: QLogic Release2.0 $  
$Date: 2006-09-15-04:16 $  
For non-QLogic RPMs, it will look like:  
# ident /lib/modules/$(uname -r)/updates/*ipath_ether.ko  
/lib/modules/2.6.16.21-0.8-smp/updates/infinipath.ko:  
IB6054601-00 D  
2-21  
 
 
2 – InfiniPath Cluster Administration  
Customer Acceptance Utility  
Q
$Id: kernel.org InfiniPath Release 2.0 $  
$Date: 2006-09-15-04:16 $  
/lib/modules/2.6.16.21-0.8-smp/updates/ipath.ko:  
$Id: kernel.org InfiniPath Release2.0 $  
$Date: 2006-09-15-04:20 $  
NOTE: identis in the optional rcsRPM, and is not always installed.  
strings  
The command stringscan also be used. Here is a sample:  
$ strings /usr/lib/libinfinipath.so.4.0 | grep Date:  
will produce output like this:  
Date: 2006-09-15 04:07 Release2.0 InfiniPath $  
NOTE: stringsis part of binutils(a development RPM), and may not be  
available on all machines.  
ipath_checkout  
ipath_checkoutis a bashscript used to verify that the installation is correct, and  
that all the nodes are functioning. It is run on a front end node and requires a hosts  
file:  
$ ipath_checkout [options] hostsfile  
More complete information on ipath_checkoutis given below in section 2.11 and  
2.11  
Customer Acceptance Utility  
ipath_checkoutis a bashscript used to verify that the installation is correct and  
that all the nodes of the network are functioning and mutually connected by the  
InfiniPath fabric. It is to be run on a front end node, and requires specification of a  
hosts file:  
$ ipath_checkout [options] hostsfile  
wherehostsfiledesignatesafilelistingthehostnamesofthenodesofthecluster,  
one hostname per line. The format of hostsfileis as follows:  
hostname1  
hostname2  
...  
ipath_checkoutperforms the following seven tests on the cluster:  
1. pingall nodes to verify all are reachable from the frontend.  
2. sshto each node to verify correct configuration of ssh.  
2-22  
IB6054601-00 D  
 
   
2 – InfiniPath Cluster Administration  
Customer Acceptance Utility  
Q
3. Gather and analyze system configuration from nodes.  
4. Gather and analyze RPMs installed on nodes.  
5. Verify InfiniPath hardware and software status and configuration.  
6. Verify ability to mpirun jobs on nodes.  
7. Run bandwidth and latency test on every pair of nodes and analyze results.  
The possible options to ipath_checkoutare:  
-h, --help  
Displays help messages giving defined usage.  
-v, --verbose  
-vv, --vverbose  
-vvv, --vvverbose  
These specify three successively higher levels of detail in reporting results of tests.  
So, there are four levels of detail in all, including the case of where none these  
options are given.  
-c, --continue  
When not specified, the test terminates when any test fails. When specified, the  
tests continue after a failure, with failing nodes excluded from subsequent tests.  
--workdir=DIR  
Use DIR to hold intermediate files created while running tests. DIR must not already  
exist.  
-k, --keep  
Keep intermediate files that were created while performing tests and compiling  
reports. Results will be saved in a directory created by mktempand named  
infinipath_XXXXXXor in the directory name given to --workdir.  
--skip=LIST  
Skip the tests in LIST(e.g. --skip=2,4,5,7 will skip tests 2, 4, 5, and 7)  
-d, --debug  
Turn on -xand -vflags in bash.  
In most cases of failure, the script suggests recommended actions. Please see the  
ipath_checkout man page for further information and updates.  
IB6054601-00 D  
2-23  
 
2 – InfiniPath Cluster Administration  
Customer Acceptance Utility  
Q
Notes  
2-24  
IB6054601-00 D  
 
Section 3  
Using InfiniPath MPI  
This chapter provides information on using InfiniPath MPI. Examples are provided  
for compiling and running MPI programs.  
3.1  
InfiniPath MPI  
QLogic’s implementation of the MPI standard is derived from the MPICH reference  
implementation Version 1.2.6. The InfiniPath MPI libraries have been highly tuned  
for the InfiniPath Interconnect, and will not run over other interconnects.  
InfiniPath MPI is an implementation of the original MPI 1.2 standard. The MPI-2  
standard provides several enhancements of the original standard. Of the MPI-2  
features, InfiniPath MPI includes only the MPI-IO features implemented in ROMIO  
version 1.2.6 and the generalized MPI_Alltoallw communication exchange.  
In this Version 2.0release, the InfiniPath MPI implementation supports hybrid  
MPI/OpenMP, and other multi-threaded programs, as long as only one thread uses  
MPI. For more information, see section 3.10.  
3.2  
Other MPI Implementations  
As of this release, other MPI implementations can now be run over InfiniPath. The  
currently supported implementations are HP-MPI, OpenMPI and Scali. For more  
information see section 3.6.  
3.3  
Getting Started with MPI  
In this section you will learn how to compile and run some simple example programs  
that are included in the InfiniPath software product. Compiling and running these  
examples lets you verify that InfiniPath MPI and its components have been properly  
installed on your cluster. See appendix C.8 if you have problems compiling or  
running these examples.  
IB6054601-00 D  
3-1  
 
           
3 – Using InfiniPath MPI  
Getting Started with MPI  
Q
These examples assume that:  
Your cluster administrator has properly installed InfiniPath MPI and the  
PathScale compilers.  
Your cluster’s policy allows you to use the mpirunscript directly, without having  
to submit the job to a batch queuing system.  
You or your administrator has properly set up your sshkeys and associated files  
on your cluster. See section 3.5.1 and section 2.9 for details on ssh  
administration.  
To begin, copy the examples to your working directory:  
$ cp /usr/share/mpich/examples/basic/* .  
Next, create an MPI hosts file in the same working directory. It contains the host  
names of the nodes in your cluster on which you want to run the examples, with  
one host name per line. Name this file mpihosts. The contents can be in the  
following format:  
hostname1  
hostname2  
...  
There is more information on the mpihostsfile in section 3.5.6.  
3.3.1  
An Example C Program  
InfiniPath MPI uses some shell scripts to find the appropriate include files and  
libraries for each supported language. Use the script mpiccto compile an MPI  
program in C and the script mpirunto execute it.  
The supplied example program cpi.ccomputes an approximation to pi. First,  
compile it to an executable named cpi.  
$ mpicc -o cpi cpi.c  
mpicc, by default, runs the PathScale pathccor gcccompiler, and is used for  
both compiling and linking, exactly as you'd use the pathcccommand.  
NOTE: On ppc64 systems, gccis the default compiler. For information on using  
other compilers, see section 3.5.3.  
Then, run it with several different specifications for the number of processes:  
$ mpirun -np 2 -m mpihosts ./cpi  
Process 0 on hostname1  
Process 1 on hostname2  
pi is approximately 3.1416009869231241,  
Error is 0.0000083333333309  
wall clock time = 0.000149  
3-2  
IB6054601-00 D  
 
     
3 – Using InfiniPath MPI  
Getting Started with MPI  
Q
Here ./cpidesignates the executable of the example program in the working  
directory. The -npparameter to mpirundefines the number of processes to be  
used in the parallel computation. Now try it with four processes:  
$ mpirun -np 4 -m mpihosts ./cpi  
Process 3 on hostname1  
Process 0 on hostname2  
Process 2 on hostname2  
Process 1 on hostname1  
pi is approximately 3.1416009869231249,  
Error is 0.0000083333333318  
wall clock time = 0.000603  
If you run the program several times with the same value of the -npparameter, you  
may get the output lines in different orders. This is because they are issued by  
independent asynchronous processes, so their order is non-deterministic.  
The number of processes can be greater than the number of nodes. In this  
four-process example, the mpihostsfile listed only two hosts, hostname1 and  
hostname2. Generally, mpirunwill try to distribute the specified number of  
processes evenly among the nodes listed in the mpihostsfile, but if the number of  
processes exceeds the number of nodes listed in the mpihostsfile, then some  
nodes will be assigned more than one instance of the program.  
Up to a limit, the number of processes can even exceed the total number of  
processors on the specified set of nodes, although it is usually detrimental to  
performance to have more than one node program per processor. This limit is eight  
processes per node with the QHT7140, and four processes per node with the  
QLE7140. See section 3.5.9 for further discussion.  
Detailsonalternatemeansofspecifyingthempihostsfilearegiveninsection 3.5.6.  
Further information on the mpirunoptions are in section 3.5.5, section 3.5.9 and  
3.3.2  
Examples Using Other Languages  
This section gives more examples, one for Fortran77, one for Fortran90, and one  
for C++. Fortran95 usage will be similar to that for Fortran90.  
fpi.fis a Fortran77 program that computes pi in a way similar to cpi.c. Compile  
and link it with:  
$ mpif77 -o fpi3 fpi3.f  
and run it with:  
$ mpirun -np 2 -m mpihosts ./fpi3  
pi3f90.f90in the same directory is a Fortran90 program that does essentially the  
same computation. Compile and link it with:  
$ mpif90 -o pi3f90 pi3f90.f90  
IB6054601-00 D  
3-3  
 
 
3 – Using InfiniPath MPI  
Configuring MPI Programs for InfiniPath MPI  
Q
and run it with:  
$ mpirun -np 2 -m mpihosts ./pi3f90  
The C++ program hello++.ccis a parallel processing version of the traditional  
“Hello, World” program. Notice that this version makes use of the external C  
bindings of the MPI functions if the C++ bindings are not present.  
Compile it:  
$ mpicxx -o hello hello++.cc  
and run it:  
$ mpirun -np 10 -m mpihosts ./hello  
Hello World! I am 9 of 10  
Hello World! I am 2 of 10  
Hello World! I am 4 of 10  
Hello World! I am 1 of 10  
Hello World! I am 7 of 10  
Hello World! I am 6 of 10  
Hello World! I am 3 of 10  
Hello World! I am 0 of 10  
Hello World! I am 5 of 10  
Hello World! I am 8 of 10  
Each of the scripts invokes the PathScale compiler for the respective language and  
the linker. See section 3.5.3 for an example of how to use the gcccompiler. The  
use of mpirunis the same for programs in all languages.  
3.4  
Configuring MPI Programs for InfiniPath MPI  
When configuring an MPI program (generating header files and/or Makefiles), for  
InfiniPath MPI, you will usually need to specify mpicc, mpif90, etc. as the compiler,  
rather than pathcc, pathf90, etc.  
Typically this is done with commands similar to these (this assumes you are using  
shor bashas your shell):  
$ export CC=mpicc  
$ export CXX=mpicxx  
$ export F77=mpif77  
$ export F90=mpif90  
$ export F95=mpif95  
The shell variables will vary with the program being configured, but these examples  
show frequently used variable names. Users of cshwould instead use commands  
similar to:  
$ setenv CC mpicc  
3-4  
IB6054601-00 D  
 
       
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
You may need to instead pass arguments to configuredirectly, in a fashion similar  
to this:  
$ ./configure -cc=mpicc -fc=mpif77 -c++=mpicxx  
-c++linker=mpicxx  
Sometimes you may need to edit a Makefile to achieve this result, adding lines  
similar to:  
CC=mpicc  
F77=mpif77  
F90=mpif90  
F95=mpif95  
CXX=mpicxx  
In some cases, the configuration process may specify the linker. It is recommended  
that the linker be specified as mpicc, mpif90, etc. in these cases. That will  
automatically include the correct flags and libraries, rather than trying to configure  
to pass the flags and libraries explicitly. For example:  
LD=mpicc  
LD=mpif90  
These scripts pass appropriate options to the various compiler passes to include  
header files, required libraries, etc. While the same effect can be achieved by  
passing the arguments explicitly as flags, the required arguments may vary from  
release to release, so it's good practice to use the provided scripts.  
3.5  
InfiniPath MPI Details  
This section gives more details on the use of InfiniPath MPI. We assume the reader  
has some familiarity with standard MPI. See the references in appendix D.1. This  
implementation does include the manpages from the MPICH implementation for the  
numerous MPI functions.  
3.5.1  
Configuring for sshUsing ssh-agent  
Thecommandmpiruncanberunonthefrontendoronanyothernode. InInfiniPath  
MPI, this uses the secure shell command sshto start instances of the given MPI  
program on the remote compute nodes. To use ssh, the user must have generated  
RSA or DSA keys, public and private. The public keys must be distributed to all the  
compute nodes so that connections to the remote machines can be established  
without supplying a password. Each user can accomplish this through use of the  
ssh-agent. ssh-agentis a daemon that caches decrypted private keys. You use  
ssh-addto add your private keys to ssh-agent’s cache. When sshestablishes a  
new connection, it communicates with ssh-agentin order to acquire these keys,  
rather than prompting you for a passphrase.  
IB6054601-00 D  
3-5  
 
         
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
The process is shown in the following steps:  
1. Create a key pair. Use the default file name, and be sure to enter a passphrase.  
$ ssh-keygen -t rsa  
2. Enter a passphrase for your key pair when prompted. Note that the key agent  
does not survive X11 logout or system reboot:  
$ ssh-add  
3. This tells sshthat your key pair should let you in:  
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys  
edit ~/.ssh/configso that it reads like this:  
Host*  
ForwardAgent yes  
ForwardX11 yes  
CheckHostIP no  
StrictHostKeyChecking no  
This forwards the key agent requests back to your desktop. When you log into  
a front end node, you can ssh to compute nodes without passwords.  
4. Start ssh-agent by adding the following line to your ~/.bash_profile (or  
equivalent in another shell):  
eval ‘ssh-agent‘  
Use back-quotes rather than normal single-quotes. Programs started in your  
login shell will then be able to locate ssh-agentand query it for keys.  
5. Finally, test by logging into the front end node, and from the front end node to  
a compute node as follows:  
$ ssh frontend_node_name  
$ ssh compute_node_name  
For more information, see the man pages for ssh(1),ssh-keygen(1),  
ssh-add(1), and ssh-agent(1).  
Alternatively, the cluster administrator can accomplish this for all users through the  
shosts.equivmechanism, as described in section 2.9.  
3-6  
IB6054601-00 D  
 
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
3.5.2  
Compiling and Linking  
These scripts invoke the compiler and linker for programs in each of the respective  
languages, and take care of referring to the correct include files and libraries in each  
case.  
mpicc  
mpicxx  
mpif77  
mpif90  
mpif95  
On x86_64, by default these call the PathScale compiler and linker. To use other  
compilers, see section 3.5.3.  
NOTE: The 2.x PathScale compilers aren’t currently supported on systems that  
use the GNU 4.x compiler and environment. This includes FC4, FC5 and  
SLES10. For suggestions on how to work around this issue, see  
section 3.5.4. The 3.0 compiler release will support the GNU 4.x compiler  
environment.  
These scripts all provide the following command line options:  
-help  
Provides help.  
-show  
Lists each of the compiling and linking commands that would be called without  
actually calling them.  
-echo  
Gets verbose output of all the commands in the script.  
-compile_info  
Shows how to compile a program.  
-link_info  
Shows how to link a program.  
Further, each of these scripts allows a command line option for specifying the use  
of a different compiler/linker as an alternative to the PathScale Compiler Suite.  
These are described in the next section.  
Most other command line options are passed on to the invoked compiler and linker.  
The PathScale compiler and the usual alternatives all admit numerous command  
IB6054601-00 D  
3-7  
 
   
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
line options. See the PathScale compiler documentation and the manpages for  
pathccandpathf90for completeinformation on its options. Seethecorresponding  
documentation for any other compiler/linker you may call for its options.  
3.5.3  
To Use Another Compiler  
In addition to the PathScale Compiler Suite, InfiniPath MPI supports a number of  
other compilers. These include PGI 5.2 and 6.0, Intel 9.0, the GNU gcc 3.3.x, 3.4.x,  
and 4.0.x compiler suites and gfortran. The IBM XL family of compilers is also  
supported on ppc64 (Power) systems.  
NOTE: The 2.x PathScale compilers aren’t currently supported on systems that  
have the GNU 4.x compilers and compiler environment (header files and  
libraries). This includes Fedora Core 4, Fedora Core 5, SUSE 10, and  
SLES 10. To run on those distributions, you can compile your application  
on a system that does support the PathScale compiler. Then you can run  
the executable on one of the systems that uses the GNU 4.x compiler  
and environment. For more information on setting up for  
cross-compilation, see section 3.5.4. The GNU 4.x compiler environment  
will be supported by the PathScale Compiler Suite 3.0 release.  
NOTE: In addition, gfortranis not currently supported on Fedora Core 3, as it  
has dependencies on the GNU 4.x suite.  
The following example shows how to use gccfor compiling and linking MPI  
programs in C:  
$ mpicc -cc=gcc .......  
To use gccfor compiling and linking C++ programs use:  
$ mpicxx -CC=g++ .......  
To use gcc for compiling and linking Fortran77 programs use:  
$ mpif77 -fc=g77 .......  
In each case, .....stands for the remaining options to the mpicxxscript, the  
options to the compiler in question, and the names of the files it is to operate upon.  
Using the same pattern you will see that this next example is similar, except that it  
uses the PGI (pgcc) compiler for compiling and linking in C:  
$ mpicc -cc=pgcc .....  
To use PGI for Fortran90/Fortran95 programs, use:  
$ mpif90 -f90=pgf90 .....  
$ mpif95 -f95=pgf95 .....  
This example uses the Intel C compiler (icc):  
$ mpicc -cc=icc .....  
3-8  
IB6054601-00 D  
 
   
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
To use the Intel compiler for Fortran90/Fortran95 programs, use:  
$ mpif90 -f90=ifort .....  
$ mpif95 -f95=ifort .....  
Usage for other compilers will be similar to the examples above, substituting the  
options following -cc, -CC, -f77, -f90, or -f95. Consult the documentation for  
specific compilers for more details.  
Also,usempif77,mpif90,ormpif95forlinking,otherwiseyoumayhaveproblems  
with .true.having the wrong value. If you are not using the provided scripts for  
linking, you should link a sample program using the -showoption as a test, to see  
what libraries to add to your link line. Some examples follow.  
For Fortran90 programs:  
$ mpif90 -f90=pgf90 -show pi3f90.f90 -o pi3f90  
pgf90 -I/usr/include/mpich/pgi5/x86_64 -c -I/usr/include  
pi3f90.f90 -c  
pgf90 pi3f90.o -o pi3f90 -lmpichf90 -lmpich -lmpichabiglue_pgi5  
Fortran95 programs will be similar to the above.  
For C programs:  
$ mpicc -cc=pgcc -show cpi.c  
pgcc -c cpi.c  
pgcc cpi.o -lmpich -lpgftnrtl -lmpichabiglue_pgi5  
3.5.3.1  
Compiler and Linker Variables  
If you use environment variables (e.g., $MPICH_CC) to select which compiler  
mpicc, et al. should use, the scripts will also set the matching linker variable (e.g.  
$MPICH_CLINKER), if not already set. If both the environment variable and  
command line options are used (e.g, -cc=gcc), the command line variable is used.  
If both the compiler and linker variables are set, and they do not match for the  
compiler you are using, it is likely that the MPI program will fail to link, or if it links,  
it may not execute correctly. For a sample error message, please see section C.8.3  
in the Troubleshooting chapter.  
3.5.4  
Cross-compilation Issues  
The 2.x PathScale compilers aren’t currently supported on systems that use the  
GNU 4.x compilers and compiler environment (header files and libraries). This  
includes Fedora Core 4, Fedora Core 5 and SLES 10. The GNU 4.x environment  
will be supported in the PathScale Complier Suite 3.0 release.  
IB6054601-00 D  
3-9  
 
     
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
The current workaround for this is to compile on a supported and compatible  
distribution, then run the executable on one of the systems that uses the GNU 4.x  
compilers and environment.  
To run on FC4 or FC5, install FC3 or RHEL4/CentOS on your build machine.  
Compile your application on this machine.  
To run on SLES 10, install SUSE 9.3 on your build machine. Compile your  
application on this machine.  
Alternatively, gcccan be used as the default compiler. Set mpicc -cc=gccas  
described in section 3.5.3 "To Use Another Compiler".  
Next, on the machines in your cluster on which the job will run, install compatibility  
libraries. These libraries include C++ and Fortran compatibility shared libraries and  
libgcc.  
For an FC4 or FC5 system, you would need:  
pathscale-compilers-libs (for FC3)  
compat-gcc-32  
compat-gcc-32-g77  
compat-libstdc++-33  
On a SLES 10 system, you would need:  
compat-libstdc++ (for FC3)  
compat-libstdc++5 (for SLES 10)  
Depending upon the application, you may need to use the -W1,-Bstaticoption to  
use the static versions of some libraries.  
3.5.5  
Running MPI Programs  
The script mpirunlets you start your parallel MPI program on a set of nodes in a  
cluster. It starts, monitors, and terminates the node programs. mpirunuses ssh  
(secure shell) to log in to individual cluster machines and prints any messages that  
the node program prints on stdoutor stderron the terminal from which mpirun  
is invoked. It is therefore usually desirable to either configure all cluster nodes to  
use shosts.equiv(see section 2.9), or for users to use ssh-agent(see  
section 3.5.1) in order to allow MPI programs to be run without requiring that a  
password be entered for each node in the job.  
The general syntax is:  
$ mpirun [mpirun_options...] program-name [program options]  
3-10  
IB6054601-00 D  
 
   
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
program-name will generally be the pathname to the executable MPI program. If  
the MPI program resides in the current directory and the current directory is not in  
your search path, then program-name must begin with ‘./’, such as:  
./program-name  
Unless you want to run only one instance of the program, you need to use the -np  
option, as in:  
$ mpirun -np n [other options] program-name  
This spawns ninstances of program-name. We usually call these instances node  
programs.  
Eachnodeprogramisstartedasaprocessononenode. Whileitiscertainlypossible  
for a node program to fork child processes, the children must not themselves call  
MPI functions.  
mpirunmonitors the parallel MPI job, terminating when all the node programs in  
that job exit normally, or if any of them terminates abnormally.  
Killing the mpirunprogram kills all the processes in the job. Use Ctrl-Cto do this.  
3.5.6  
The mpihostsFile  
As noted in section 3.3 you have created an mpihostsfile (also called a machines  
file, node file, or hosts file) in your current working directory. This file names the  
nodes on which the node programs may run. The mpihostsfile contains lines of  
the form:  
hostname[:p]  
The optional part :pspecifies the number of node programs that can be spawned  
on that node. When not specified, the default value is 1. The two supported formats  
for the mpihostsfile are:  
hostname1  
hostname2  
...  
or  
hostname1:process_count  
hostname2:process_count  
...  
In the first format, if the -np count is greater than the number of lines in the machine  
file, the hostnames will be repeated (in order) as many times as necessary for the  
requested number of node programs.  
In the second format process_countcan be different for each host, and is normally  
the number of available processors on the node. Up to process_countnode  
IB6054601-00 D  
3-11  
 
     
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
programs will be started on that host before using the next entry in the mpihosts  
file. If the full mpihostsfile is processed, and there are still more processes  
requested, processing starts again at the start of the file.  
You have several alternative ways of specifying the mpihostsfile.  
1. First, as noted in section 3.3.1, you can use the command line option -m:  
$ mpirun -np n -m mpihosts [other options] program-name  
In this case, if the named file cannot be opened, the MPI job fails.  
2. If the -moption is omitted, mpirunchecks the environment variable MPIHOSTS  
for the name of the MPI hosts file. If this variable is defined and the file it names  
cannot be opened, then the MPI job fails.  
3. In the absence of both the -moption and the MPIHOSTSenvironment variable,  
mpirunuses the file ./mpihosts, if it exists.  
4. If none of these three methods of specifying the hosts file are used, mpirun  
looks for the file ~/.mpihosts.  
If you are working in the context of a batch queuing system, it may provide you with  
a job submission script that generates an appropriate mpihostsfile.  
3.5.7  
Console I/O in MPI Programs  
mpirunsends any output printed to stdout or stderrby any node program to the  
terminal. This output is line-buffered, so the lines output from the various node  
programs will be non-deterministically interleaved on the terminal. Using the -l  
option to mpirunwill label each line with the rank of the node program that produced  
it.  
Node programs do not normally use interactive input on stdin, and by default,  
stdin is bound to /dev/null. However, for applications that require standard input  
redirection, InfiniPath MPI supports two mechanisms to redirect stdin:  
1. If mpirunis run from the same node as MPI rank 0, all input piped to the mpirun  
command will be redirected to rank 0.  
2. If mpirunis not run from the same node as MPI rank 0 or if the input must be  
redirected to all or specific MPI processes, the -stdinoption can be used to  
redirect a file as standard input to all nodes or to a particular node as specified  
by the -stdin-targetoption.  
3.5.8  
Environment for Node Programs  
The environment variables existing on the front end node on which you run mpirun  
are not propagated to the other nodes. You can set the paths, such as  
3-12  
IB6054601-00 D  
 
           
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
LD_LIBRARY_PATH, and other environment variables for the node programs  
through the use of the -rcfileoption of mpirun:  
$ mpirun -np n -m mpihosts -rcfile mpirunrc program  
In the absence of this option, mpirunchecks to see if a file called  
$HOME/.mpirunrc exists in the user's home directory. In either case, the file is  
sourced by the shell on each node at time of startup of the node program.  
The .mpirunrcshould not contain any interactive commands. It may contain  
commands that output on stdoutor stderr.  
When you do not specify an mpirunrcfile, either through the option or the default  
~/.mpirunrc, the environment on each node is whatever it would be for the user’s  
login via ssh, unless you are using MPD. (See section 3.8.)  
There is a global options file that can be used for mpirunarguments. The default  
location of this file is:  
/opt/infinipath/etc/mpirun.defaults  
You can use an alternate file by setting the environment variable  
$PSC_MPIRUN_DEFAULTS_PATH. See the mpirunman page for more  
information.  
3.5.8.1  
Environment for Multiple Versions of InfiniPath or MPI  
The variable INFINIPATH_ROOTsets a root prefix for all Infinipath-related paths.  
It is used by mpirunto try to find the mpirun-ipath-ssh executable, and it is  
also used to set up LD_LIBRARY_PATHfor new programs. This allows multiple  
versions of the InfiniPath software releases to be installed on some or all nodes, as  
well as having InfiniPath MPI and other version(s) of MPI installed at the same time.  
It may be set in the environment, in mpirun.defaults, or in an rcfile (such  
as .mpirunrc, .bashrcor .cshrc) that will be invoked on remote nodes.  
If you have used the --prefixargument with the rpmcommand to change the  
root prefix for the InfiniPath installation, then set INFINIPATH_ROOTto the same  
value.  
If INFINIPATH_ROOTis notset, the normal PATHis usedunless mpirunis invoked  
with a full pathname.  
NOTE: mpirun-sshwas renamed mpirun-ipath-sshso as to avoid name  
collisions with other MPI implementations.  
IB6054601-00 D  
3-13  
 
     
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
3.5.9  
Multiprocessor Nodes  
Another command line option, -ppn, instructs mpirunto assign a fixed number p  
of node programs to each node, as it distributes the ninstances among the nodes:  
$ mpirun -np n -m mpihosts -ppn p program-name  
This option overrides the :p specifications, if any, in the lines of the MPI hosts file.  
As a general rule, mpiruntries to distribute the n node programs among the nodes  
without exceeding on any node the maximum number of instances specified by  
the :poption. The value of the :p option is specified by either the -ppncommand  
line option or in the mpihostsfile.  
NOTE: When the -npvalue is larger than the number of nodes in the mpi hostsfile  
times the -ppnvalue, mpirunwill cycle back through the hostsfile,  
assigning additional node programs per host.  
Normally, the number of node programs should be no larger than the number of  
processors on the node, at least not for compute-bound problems. In the current  
implementationoftheInfiniPathinterconnect, nonodecanrunmore thaneight node  
programs.  
For improved performance, InfiniPath MPI uses shared memory to pass messages  
between node programs running on the same host.  
3.5.10  
mpirunOptions  
Here is a list summarizing the most commonly used options to mpirun. See the  
man page for a more complete listing.  
-np np  
Number of processes to spawn.  
-ppn processes-per-node  
Create up to specified number of processes per node.  
-machinefile filename, -m filename  
Machines (mpihosts) file, the list of hosts to be used for this job.  
Default: $MPIHOSTS, then ./mpihosts, then ~/.mpihosts  
-M  
Print a formatted list of MPI-level stats of interest for the MPI programmer  
3-14  
IB6054601-00 D  
 
         
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
-verbose  
Print diagnostic messages from mpirunitself. Can be useful in troubleshooting  
Default: Off  
-version, -v  
Print MPI version. Default: Off  
-help, -h  
Print mpirunhelp message. Default: Off  
-rcfile node-shell-script  
Startup script for setting environment on nodes.  
Default: $HOME/.mpirunrc  
-in-xterm  
Run each process in an xterm window. Default: Off  
-display X-server  
X Display for xterm. Default: None  
-debug  
Run each process under debugger in an xterm window. Uses gdbby default.  
Default: Off  
Set -q 0when using -debug.  
-debug-no-pause  
Like debug, except doesn't pause at beginning. Uses gdbby default.  
Default: Off  
-debugger gdb|pathdb|strace  
Which debugger to use.Default: gdb  
-psc-debug-level mask  
Controls the verbosity of MPI and InfiniPath debug messages for node programs.  
A synonym is -d mask.  
Default: 1  
IB6054601-00 D  
3-15  
 
3 – Using InfiniPath MPI  
InfiniPath MPI Details  
Q
-nonmpi  
Run a non-MPI program. Required if the node program makes no MPI calls. Default:  
Off  
-quiescence-timeout, seconds  
Wait time in secondsfor quiescence (absence of MPIcommunication) onthe nodes.  
Useful for detecting deadlocks. 0 disables quiescence detection.  
Default: 900  
-disable-mpi-progress-check  
ThisoptiondisablesMPIcommunicationprogresscheck,withoutdisablingtheping  
reply check. Default: Off.  
-l  
Label each line of output on stdoutand stderrwith the rank of the MPI process  
which produces the output.  
-labelstyle string  
Specify the label that is prefixed to error messages and statistics. Process rank is  
the default prefix.  
-stdin filename  
Filename that should be fed as stdin to the node program. Default: /dev/null  
-stdin-target 0..np-1 | -1  
Process rank that should receive the file specified with the -stdinoption. -1 means  
all ranks. Default: -1  
-wdir path-to-working_dir  
Sets the working directory for the node program.  
Default: -wdir current-working-dir  
-print-stats  
Causes each node program to print various MPI statistics to stderron job  
termination. Can be useful for troubleshooting. Default: off. For details, see  
3-16  
IB6054601-00 D  
 
3 – Using InfiniPath MPI  
MPD  
Q
-statsfile file-prefix  
Specifies alternate file to receive the output from the -print-statsoption.  
Default: stderr  
3.6  
Using Other MPI Implementations  
Support for multiple MPI implementations has been added. You can use a different  
version of MPI and achieve the high-bandwidth and low-latency performance that  
it is standard with InfiniPath MPI.  
The currently supported implementations are HP-MPI, OpenMPI and Scali.  
These MPI implementations will run on multiple interconnects, and have their own  
mechanisms for selecting which one you will run on. Please see the documentation  
provided with the version of MPI that you wish to use.  
If you have downloaded and installed another MPI implementation, you will need  
to set your PATHup to pick up the version of MPI you wish to use.  
You will also need to set LD_LIBRARY_PATH, both in your local environment and  
in an rcfile (such as .mpirunrc, .bashrcor .cshrc) that will be invoked on  
remote nodes. See section 3.5.8 and section 3.5.3.1 for information on setting up  
your environment and section C.8.6 for information on setting your run-time library  
path. See also section C.8.7 for information on run time errors that may occur if  
there are MPI version mismatches.  
3.7  
MPI Over uDAPL  
Some MPI implementations can be run over uDAPL. uDAPL is the user mode  
version of the Direct Access Provider Library (DAPL). Examples of such MPI  
implementations are Intel MPI and one option on OpenMPI.  
If you are running such an MPI implementation, the rdma_cmand rdma_ucm  
modules will need to be loaded. To test these modules, use these commands (as  
root):  
# modprobe rdma_cm  
# modprobe rdma_ucm  
To ensure that the modules are loaded whenever the driver is loaded, add rdma_cm  
and rdma_ucmto the OPENFABRICS_MODULESassignment in  
/etc/sysconfig/infinipath.  
3.8  
MPD  
MPD is an alternative to mpirunfor launching MPI jobs. It is described briefly in the  
following sections.  
IB6054601-00 D  
3-17  
 
             
3 – Using InfiniPath MPI  
File I/O in MPI  
Q
3.8.1  
MPD Description  
The Multi-Purpose Daemon (MPD) was developed by Argonne National Laboratory  
(ANL), as part of the MPICH-2 system. While the ANL MPD had certain advantages  
over the use of their mpirun(faster launching, better cleanup after crashes, better  
tolerance of node failures), the InfiniPath mpirunoffers the same advantages.  
The disadvantage of MPD is reduced security, since it does not use sshto launch  
node programs. It is also a little more complex to use than mpirunbecause it  
requiresstartingaringofMPDdaemonsonthenodes. Therefore, mostusersshould  
use the normal mpirunmechanism for starting jobs as described in the previous  
chapter. However, for users who wish to use MPD, it is included in the InfiniPath  
software.  
3.8.2  
Using MPD  
To start anMPDenvironment, usethempdbootprogram. You mustprovidempdboot  
with a file listing the machines on which to run the mpddaemon. The format of this  
file is the same as for the mpihosts file in the mpiruncommand.  
Here is an example of how to run mpdboot:  
$ mpdboot -f hostsfile  
After mpdboothas started the MPD daemons, it will print a status message and  
drop you into a new shell.  
To leave the MPD environment, exit from this shell. This will terminate the daemons.  
TorunanMPIprogramfromwithintheMPDenvironment,usethempiruncommand.  
You do not need to provide a mpihostsfile or a count of CPUs; by default, mpirun  
will use all nodes and CPUs available within the MPD environment.  
To check the status of the MPD daemons, use the mpdpingcommand  
NOTE: To use MPD, the software package mpi-frontend-2.0*.rpmmust be  
installed on all nodes. See the InfiniPath Install Guide for more details on  
software installation.  
3.9  
File I/O in MPI  
File I/O in MPI is discussed briefly in the following two sections.  
3.9.1  
Linux File I/O in MPI Programs  
MPI node programs are Linux programs, which can do file I/O to local or remote  
files in the usual ways through APIs of the language in use. Remote files are  
3-18  
IB6054601-00 D  
 
           
3 – Using InfiniPath MPI  
InfiniPath MPI and Hybrid MPI/OpenMP Applications  
Q
accessed via some network file system, typically NFS. Parallel programs usually  
need to have some data in files to be shared by all of the processes of an MPI job.  
Node programs may also use non-shared, node-specific files, such as for scratch  
storage for intermediate results or for a node’s share of a distributed database.  
There are different styles of handling file I/O of shared data in parallel programming.  
You may have one process, typically on the front end node or on a file server, which  
is the only process to touch the shared files, and which passes data to and from  
the other processes via MPI messages. On the other hand, the shared data files  
could be accessed directly by each node program. In this case, the shared files  
would be available through some network file support, such as NFS. Also, in this  
case, the application programmer would be responsible for ensuring file  
consistency, either through proper use of file locking mechanisms offered by the  
OS and the programming language, such as fcntlin C, or by the use of MPI  
synchronization operations.  
3.9.2  
MPI-IO with ROMIO  
MPI-IO is the part of the MPI2 standard, supporting collective and parallel file IO.  
One of the advantages in using MPI-IO is that it can take care of managing file locks  
in case of file data shared among nodes.  
InfiniPath MPI includes ROMIO version 1.2.6, a high-performance, portable  
implementation of MPI-IO from Argonne National Laboratory. ROMIO includes  
everything defined in the MPI-2 I/O chapter of the MPI-2 standard except support  
for file interoperability and user-defined error handlers for files. Of the MPI-2  
features, InfiniPath MPI includes only the MPI-IO features implemented in ROMIO  
version 1.2.6 and the generalized MPI_Alltoallw communication exchange. See the  
ROMIO documentation in http://www.mcs.anl.gov/romio for details.  
3.10  
InfiniPath MPI and Hybrid MPI/OpenMP Applications  
InfiniPathMPIsupportshybridMPI/OpenMPapplicationsprovidedthatMPIroutines  
are only called by the master OpenMP thread. This is called the funneled thread  
model. Instead of MPI_Init/MPI_INIT (for C/C++ and Fortran respectively), the  
program can call MPI_Init_thread/MPI_INIT_THREAD to determine the level of  
thread support and the value MPI_THREAD_FUNNELED will be returned.  
To use this feature the application should be compiled with both OpenMP and MPI  
code enabled. To do this, use the -mpflag on the mpicc compile line.  
As mentioned above, MPI routines must only be called by the master OpenMP  
thread. The hybrid executable is executed as usual using mpirun, but typically only  
one MPI process is run per node and the OpenMP library will create additional  
threads to utilize all CPUs on that node. If there are sufficient CPUs on a node, it  
IB6054601-00 D  
3-19  
 
         
3 – Using InfiniPath MPI  
Debugging MPI Programs  
Q
may be desirable to run multiple MPI processes and multiple OpenMP threads per  
node.  
The number of OpenMP threads is typically controlled by the  
OMP_NUM_THREADS environment variable in the .mpirunrcfile. This may be  
used to adjust the split between MPI processes and OpenMP threads. Usually the  
number of MPI processes (per node) times the number of OpenMP threads will be  
set to match the number of CPUs per node. An example case would be a node with  
4 CPUs, running 1 MPI process and 4 OpenMP threads. In this case,  
OMP_NUM_THREADS is set to 4. OMP_NUM_THREADS is on a per-node basis.  
See the section 3.5.8 for information on setting environment variables.  
The MPI_THREAD_SERIALIZED and MPI_THREAD_MULTIPLE models are not  
yet supported.  
NOTE: If there are more threads than CPUs, then both MPI and OpenMP  
performance can be significantly degraded due to over-subscription of  
the CPUs.  
3.11  
Debugging MPI Programs  
Debugging parallel programs is substantially more difficult than debugging serial  
programs. Thoroughly debugging the serial parts of your code before parallelizing  
is good programming practice.  
3.11.1  
MPI Errors  
Almost all MPI routines (except MPI_Wtimeand MPI_Wtick) return an error code;  
as the function return value in C functions or as the last argument in a Fortran  
subroutine call. Before the value is returned, the current MPI error handler is called.  
By default, this error handler aborts the MPI job. Therefore you can get information  
about MPI exceptions in your code by providing your own handler for  
MPI_ERRORS_RETURN. See the manpage for MPI_Errhandler_setfor details.  
NOTE: MPI does not guarantee that an MPI program can continue past an error.  
See the standard MPI documentation referenced in appendix D for details on the  
MPI error codes.  
3.11.2  
Using Debuggers  
The InfiniPath software supports the use of multiple debuggers, including pathdb,  
gdb, and the system call tracing utility strace. These debuggers let you set  
breakpoints in a running program, and examine and set its variables.  
3-20  
IB6054601-00 D  
 
       
3 – Using InfiniPath MPI  
InfiniPath MPI Limitations  
Q
Symbolic debugging is easier than machine language debugging. To enable  
symbolic debugging you must have compiled with the -goption to mpicc so that  
the compiler will have included symbol tables in the compiled object code.  
To run your MPI program with a debugger use the -debug or -debug-no-pause  
and -debuggeroptions to mpirun. See the manpages to pathdb, gdb, and strace  
for details. When you run under a debugger, you get an xterm window on the front  
end machine for each node process. Thus, you can control the different node  
processes as desired.  
To use stracewith your MPI program, the syntax would be:  
$ mpirun -np n -m mpihosts strace program-name  
The following features of InfiniPath MPI especially facilitate debugging:  
Stack backtraces are provided for programs that crash.  
-debugand -debug-no-pauseoptions are provided for mpirunthat can make  
each node program start with debugging enabled. The -debugoption allows you  
to set breakpoints, and start running programs individually. The  
-debug-no-pauseoption allows postmortem inspection. Note that you should  
set -q 0when using -debug.  
Communication between mpirunand node programs can be printed by  
specifying the mpirun -verboseoption.  
MPI implementation debug messages can be printed by specifying the mpirun  
-psc-debug-level option. Note that this can substantially impact the  
performance of the node program.  
Support is provided for progress timeout specifications, deadlock detection, and  
generating information about where a program is stuck.  
Several misconfigurations (such as mixed use of 32-bit/64-bit executables) are  
detected by the runtime.  
A formatted list containing information useful for high-level MPI application  
profiling is provided by using the -print-statsoption with mpirun. Statistics  
include minimum, maximum and median values for message transmission  
protocols as well as a more detailed information for expected and unexpected  
message reception. See appendix C.8.13 for more information and a sample  
output listing.  
3.12  
InfiniPath MPI Limitations  
The current version of InfiniPath MPI has the following limitations:  
By default, at most eight node programs per node with the QHT7140 are allowed,  
and at most four node programs per node with the QLE7140. The error message  
when this limit is exceeded is:  
IB6054601-00 D  
3-21  
 
       
3 – Using InfiniPath MPI  
InfiniPath MPI Limitations  
Q
No ports available on /dev/ipath  
NOTE: If port sharing is enabled, this limit is raised to 16 and 8 respectively. To  
enable port sharing, set PSM_SHAREDPORTS=1 in your environment  
There are no C++ bindings to MPI -- use the extern C MPI function calls.  
In MPI-IO file I/O calls in the Fortran binding, offset or displacement arguments are  
limited to 32 bits. Thus, for example, the second argument of MPI_File_seekmust  
lie between -231 and 231-1, and the argument to MPI_File_read_atmust lie  
between 0 and 232-1.  
3-22  
IB6054601-00 D  
 
Appendix A  
Benchmark Programs  
Several MPI performance measurement programs are installed from the  
mpi-benchmarkRPM. This Appendix describes these useful benchmarks and how  
to run them. These programs are based on code from the group of Dr. Dhabaleswar  
K. Panda at the Network-Based Computing Laboratory at the Ohio State University.  
For more information, see:  
http://nowlab.cis.ohio-state.edu/  
These programs allow you to measure the MPI latency and bandwidth between two  
or more nodes in your cluster. Both the executables, and the source for those  
executables, are shipped. The executables are shipped in the mpi-benchmark  
RPM, and installed under /usr/bin. The source is shipped in the mpi-develRPM  
and installed under  
/usr/share/mpich/examples/performance.  
The examples given below are intended only to show the syntax for invoking these  
programs and the meaning of the output. They are NOT representations of actual  
InfiniPath performance characteristics.  
A.1  
Benchmark 1: Measuring MPI Latency Between Two Nodes  
In the MPI community, latency for a message of given size is defined to be the time  
difference between a node program’s calling MPI_Sendand the time that the  
corresponding MPI_Recvin the receiving node program returns. By latency, alone  
without a qualifying message size, we mean the latency for a message of size zero.  
This latency represents the minimum overhead for sending messages, due both to  
software overhead and to delays in the electronics of the fabric. To simplify the  
timing measurement, latencies are usually measured with a ping-pong method,  
timing a round-trip and dividing by two.  
The program osu_latency, from Ohio State University, measures the latency for a  
range of messages sizes from 0 to 4 megabytes. It uses a ping-pong method, in  
which the rank 0 process initiates a series of sends and the rank 1 process echoes  
them back, using the blocking MPI send and receive calls for all operations. Half  
the time interval observed by the rank 0 process for each such exchange is a  
measure of the latency for messages of that size, as defined above. The program  
uses a loop, executing many such exchanges for each message size, in order to  
get an average. It defers the timing until the message has been sent and received  
a number of times, in order to be sure that all the caches in the pipeline have been  
filled.  
IB6054601-00 D  
A-1  
 
           
A – Benchmark Programs  
Benchmark 2: Measuring MPI Bandwidth Between Two Nodes  
Q
This benchmark always involves just two node programs. You can run it with the  
command:  
$ mpirun -np 2 -ppn 1 -m mpihosts osu_latency  
The -ppn 1option is needed to be certain that the two communicating processes  
are on different nodes. Otherwise, in the case of multiprocessor nodes, mpirun  
might assign the two processes to the same node, and so the result would not be  
indicative of the latency of the InfiniPath fabric, but rather of the shared memory  
transport mechanism. Here is what the output of the program looks like:  
# OSU MPI Latency Test (Version 2.0)  
# Size  
0
Latency (us)  
1.26  
1
1.26  
2
1.26  
4
1.26  
8
1.26  
16  
1.45  
32  
1.47  
64  
1.52  
128  
1.63  
256  
1.88  
512  
2.34  
1024  
2048  
4096  
8192  
16384  
32768  
65536  
131072  
262144  
524288  
1048576  
2097152  
4194304  
3.25  
5.13  
7.34  
11.58  
20.25  
37.56  
78.69  
149.84  
287.49  
565.84  
1119.18  
2220.18  
4424.59  
The first column gives the message size in bytes, the second gives the average  
(one-way) latency in microseconds. Again, this example is given to show the syntax  
of the command and the format of the output, and is not meant to represent actual  
values that might be obtained on any particular InfiniPath installation.  
A.2  
Benchmark 2: Measuring MPI Bandwidth Between Two Nodes  
The osu_bwbenchmark is meant to measure the maximum rate at which you can  
pump data between two nodes. It also uses a ping-pong mechanism, similar to the  
osu_latencycode, except in this case, the originator of the messages pumps a  
number of them (64 in the installed version) in succession using the non-blocking  
A-2  
IB6054601-00 D  
 
       
A – Benchmark Programs  
Benchmark 3: Messaging Rate Microbenchmarks  
Q
MPI_Isend function, while the receiving node consumes them as quickly as it can  
usingthenon-blockingMPI_Irecv, andthenreturnsazero-lengthacknowledgement  
when all of the set has been received.  
You can run this program with:  
$ mpirun -np 2 -ppn 1 -m mpihosts osu_bw  
Typical output might look like:  
# OSU MPI Bandwidth Test (Version 2.0)  
# Size  
1
Bandwidth (MB/s)  
2.250465  
2
4.475789  
4
8.979276  
8
17.952547  
16  
27.615041  
32  
52.676363  
64  
128  
256  
512  
104.704225  
198.347505  
335.396929  
521.273433  
829.369420  
884.249845  
926.723948  
934.093084  
941.191459  
938.179872  
945.163478  
950.206048  
951.938802  
952.912385  
953.716825  
953.922714  
954.119999  
1024  
2048  
4096  
8192  
16384  
32768  
65536  
131072  
262144  
524288  
1048576  
2097152  
4194304  
Note that the increase in measured bandwidth with messages size results from the  
fact that latency’s contribution to the measured time interval becomes relatively  
smaller.  
A.3  
Benchmark 3: Messaging Rate Microbenchmarks  
mpi_multibw is the microbenchmark used to highlight QLogic’s messaging rate  
results. This benchmark is a modified form of the OSU NOWlab’s osu_bw  
IB6054601-00 D  
A-3  
 
     
A – Benchmark Programs  
Benchmark 3: Messaging Rate Microbenchmarks  
Q
benchmark (as shown in the example above). It has been enhanced with the  
following additional functionality:  
Messaging rate reported as well as bandwidth  
N/2 dynamically calculated at end of run  
Allows user to run multiple processes per node and see aggregate bandwidth  
and messaging rates  
The benchmark has been updated with code to dynamically determine which  
processes are on which host. This is an example showing the type of output you  
will see when you run mpi_multibw:  
$ mpirun -np 8 ./mpi_multibw  
This will run on four processes per node. Typical output might look like:  
# PathScale Modified OSU MPI Bandwidth Test  
(OSU Version 2.2, PathScale $Revision: 1.1 $)  
# Running on 4 procs per node  
# Size  
1
2
4
8
16  
32  
64  
128  
Aggregate Bandwidth  
8.150462  
16.693747  
33.086567  
66.733488  
(MB/s) Messages/s  
8150461.697283  
8346873.631841  
8271641.814960  
8341686.016159  
6756067.602089  
6679160.388156  
6086626.744516  
4450558.832794  
2835261.345093  
1639550.807757  
892019.592596  
466056.028717  
233025.991467  
116510.340772  
58257.857017  
29130.469523  
14481.726572  
7261.555084  
108.097082  
213.733132  
389.544112  
569.671531  
725.826904  
839.450014  
913.428063  
954.482747  
954.474461  
954.452712  
954.496729  
954.547225  
949.074433  
951.786548  
952.193849  
952.391830  
952.490368  
952.539382  
952.566591  
256  
512  
1024  
2048  
4096  
8192  
16384  
32768  
65536  
131072  
262144  
524288  
1048576  
2097152  
4194304  
3632.331272  
1816.543255  
908.365600  
454.206172  
227.109573  
Searching for N/2 bandwidth. Maximum Bandwidth of 954.547225  
MB/s...  
Found N/2 bandwidth of 476.993060 MB/s at size 94 bytes  
This microbenchmark is available and can be downloaded from the QLogic website:  
A-4  
IB6054601-00 D  
 
A – Benchmark Programs  
Benchmark 4: Measuring MPI Latency in Host Rings  
Q
A.4  
Benchmark 4: Measuring MPI Latency in Host Rings  
The program mpi_latencycan be used to measure latency in a ring of hosts. Its  
syntax is a bit different from Benchmark 1 in that it takes command line arguments  
that let you specify the message size and the number of messages over which to  
average the results. So, for example, if you have a hosts file listing four or more  
nodes, the command:  
$ mpirun -np 4 -ppn 1 -m mpihosts mpi_latency 100 0  
might produce output like this:  
0
1.760125  
This indicates that it took an average of 1.76 microseconds per hop to send a  
zero-length message from the first host, to the second, to the third, to the fourth,  
and then get replies back in the other direction.  
IB6054601-00 D  
A-5  
 
       
A – Benchmark Programs  
Benchmark 4: Measuring MPI Latency in Host Rings  
Q
Notes  
A-6  
IB6054601-00 D  
 
Appendix B  
Integration with a Batch Queuing System  
Most cluster systems use some kind of batch queuing system as an orderly way to  
provideuserswithaccesstotheresourcestheyneedtomeettheirjob’sperformance  
requirements. One of the tasks of the cluster administrator is to provide means for  
users to submit MPI jobs through such batch queuing systems. This can take the  
form of a script, which your users can invoke much as they would invoke mpirun  
to submit their MPI jobs. A sample script is presented in this section.  
B.1  
A Batch Queuing Script  
We give an example of the some of the functions that such a script might perform,  
in the context of the Simple Linux Utility Resource Manager (SLURM) developed  
at Lawrence Livermore National Laboratory. These functions assume the use of the  
bashshell. We will call this script batch_mpirun. It is provided here:  
#! /bin/sh  
# Very simple example batch script for InfiniPath MPI, using slurm  
# (http://www.llnl.gov/linux/slurm/)  
# Invoked as:  
# batch_mpirun #cpus mpi_program_name mpi_program_args ...  
#
np=$1 mpi_prog="$2" # assume arguments to script are correct  
shift 2 # program args are now $@  
eval ‘srun --allocate --ntasks=$np --no-shell‘  
mpihosts_file=‘mktemp -p /tmp mpihosts_file.XXXXXX‘  
srun --jobid=${SLURM_JOBID} hostname -s | sort | uniq -c \  
| awk ’{printf "%s:%s\n", $2, $1}’ > $mpihosts_file  
mpirun -np $np -m $mpihosts_file "$mpi_prog" $@  
exit_code=$?  
scancel ${SLURM_JOBID}  
rm -f $mpihosts_file  
exit $exit_code  
In the following sections, setup and the various functions of the script are discussed  
in further detail.  
B.1.1  
Allocating Resources  
When the mpiruncommand starts, it requires specification of the number of node  
programs it must spawn (via the -np option) and specification of an mpihostsfile  
listing the nodes on which the node programs may be run. (See section 3.5.8 for  
more information.) Normally, since performance is usually important, a user might  
IB6054601-00 D  
B-1  
 
         
B – Integration with a Batch Queuing System  
A Batch Queuing Script  
Q
require that his node program be the only application running on each node CPU.  
In a typical batch environment, the MPI user would still specify the number of node  
programs, but would depend on the batch system to allocate specific nodes when  
the required number of CPUs becomes available. Thus, batch_mpirunwould take  
at least an argument specifying the number of node programs and an argument  
specifying the MPI program to be instantiated. For example,  
$ batch_mpirun -np n my_mpi_program  
After parsing the command line arguments, the next step of batch_mpirunwould  
be to request an allocation of nprocessors from the batch system. In SLURM, this  
would use the command  
eval ‘srun --allocate --ntasks=$np --no-shell‘  
Make sure to use back-quotes rather than normal single-quotes. $npis the shell  
variable that your script has set from the parsing of its command line options. The  
--no-shelloption to srunprevents SLURM from starting a subshell. The srun  
command is run with evalin order to set the SLURM_JOBIDshell variable from the  
output of the sruncommand.  
With these specified arguments, the SLURM function srunblocks until there are  
$npprocessors available to commit to the caller. When the requested resources  
are available, this command opens a new shell and allocates the requested number  
of processors to it.  
B.1.2  
Generating the mpihostsFile  
Once the batch system has allocated the required resources, your script must  
generate a mpihostsfile, which contains a list of nodes that will be used. To do this,  
it must find out which nodes the batch system has allocated, and how many  
processes we can start on each node. This is the part of the script batch_mpirun  
that performs these tasks:  
mpihosts_file=‘mktemp -p /tmp mpihosts_file.XXXXXX‘  
srun --jobid=${SLURM_JOBID} hostname -s | sort | uniq -c \  
| awk ’{printf "%s:%s\n", $2, $1}’ > $mpihosts_file  
The first command creates a temporary hosts file with a random name, and assigns  
the name to the variable mpihosts fileit has generated.  
The next instance of the SLURM sruncommand runs hostname -sonce per  
process slot that SLURM has allocated to us. If SLURM has allocated two slots on  
one node, we thus get the output of hostname -stwice for that node.  
The sort | uniq -ccomponent tells us the number of times each unique line was  
printed. The awkcommand converts the result into the mpihostsfile format used  
B-2  
IB6054601-00 D  
 
   
B – Integration with a Batch Queuing System  
A Batch Queuing Script  
Q
bympirun.Eachlineconsistsofanodename, acolon, andthenumberofprocesses  
to start on that node.  
NOTE: This is one of two formats that the file may use. Seesection 3.5.6 for more  
information.  
B.1.3  
Simple Process Management  
At this point, your script has enough information to be able to run an MPI program.  
All that remains is to start the program when the batch system tells us that we can  
do so, and notify the batch system when the job completes. This is done in the final  
part of batch_mpirun:  
mpirun -np $np -m $mpihosts_file "$mpi_prog" $@  
exit_code=$?  
scancel ${SLURM_JOBID}  
rm -f $mpihosts_file  
exit $exit_code  
B.1.4  
Clean Termination of MPI Processes  
The InfiniPath software will normally ensure clean termination of all MPI programs  
when a job ends, but in some rare circumstances an MPI process will remain alive,  
and potentially interfere with future MPI jobs. To avoid this problem, the usual  
solution is to run a script before and after each batch job which kills all unwanted  
processes. QLogic does not provide such a script, but it is useful to know how to  
findoutwhichprocessesonanodeareusingtheInfiniPathinterconnect.Theeasiest  
way to do this is through use of the fusercommand, which is normally installed in  
/sbin.Run as root:  
# /sbin/fuser -v /dev/ipath  
/dev/ipath: 22648m22651m  
In this example, processes 22648 and 22651 are using the InfiniPath interconnect.  
It is also possible to use this command (as root):  
# lsof /dev/ipath  
This gets a list of processes using InfiniPath. Additionally, to get all processes,  
including statsprograms, ipath_sma, diags, and others, run the program in this  
way:  
# /sbin/fuser -v /dev/ipath*  
losfcan also take the same form:  
# lsof /dev/ipath*  
IB6054601-00 D  
B-3  
 
   
B – Integration with a Batch Queuing System  
Lock Enough Memory on Nodes When Using SLURM  
Q
The following command will terminate all processes using the InfiniPath  
interconnect:  
# /sbin/fuser -k /dev/ipath  
For more information, see the man pages for fuser(1)and lsof(8).  
NOTE: Run these commands as root to insure that all processes are reported.  
B.2  
Lock Enough Memory on Nodes When Using SLURM  
This is identical to information provided in appendix C.8.11. It is repeated here for  
your convenience.  
InfiniPath MPI requires the ability to lock (pin) memory during data transfers on each  
compute node. This is normally done via /etc/initscript, which is created or  
modified during the installation of the infinipathRPM (setting a limit of 64MB,  
with the command "ulimit -l 65536").  
Some batch systems, such as SLURM, propagate the user’s environment from the  
node where you start the job to all the other nodes. For these batch systems, you  
may need to make the same change on the node from which you start your batch  
jobs.  
If this file is not present or the node has not been rebooted after the infinipath  
RPM has been installed, a failure message similar to this will be generated:  
$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000  
node-00:1.ipath_update_tid_err: failed: Cannot allocate memory  
mpi_latency:  
/fs2/scratch/infinipath-build-1.3/mpi-1.3/mpich/psm/src  
mq_ips.c:691:  
mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program  
unexpectedly quit. Exiting.  
You can check the ulimit -lon all the nodes by running ipath_checkout. A  
warning will be given if ulimit -lis less that 4096.  
There are two possible solutions to this. If infinipathis not installed on the node  
where you start the job, set this value in the following way. You must be root to set it:  
# ulimit -l 65536  
Or, if you have installed infinipathon the node, reboot it to insure that  
/etc/initscriptis run.  
B-4  
IB6054601-00 D  
 
   
Appendix C  
Troubleshooting  
This Appendix describes some of the existing provisions for diagnosing and fixing  
problems. The sections are organized in the following order:  
C.1 “Troubleshooting InfiniPath adapter installation”  
C.2 “BIOS settings”  
C.3 “Software installation issues”  
C.4 “Kernel and initialization issues”  
C.5 “OpenFabrics issues”  
C.6 “System administration troubleshooting”  
C.7 “Performance issues”  
C.8 “InfiniPath MPI troubleshooting”  
C.9 “Useful programs and files for debugging”  
C.1  
Troubleshooting InfiniPath Adapter Installation  
This section lists conditions you may encounter while installing the InfiniPath  
QLE7140 or QHT7140 adapter, and offers suggestions for working around them.  
C.1.1  
Mechanical and Electrical Considerations  
The LEDs function as link and data indicators once the InfiniPath hardware and  
software has been installed, the driver has been loaded, and the fabric is being  
actively managed by a Subnet Manager. The following table shows the possible  
IB6054601-00 D  
C-1  
 
           
C – Troubleshooting  
BIOS Settings  
Q
states of the LEDs. The green LED will normally illuminate first. The normal state  
is Green On, Amber On.  
Table C-1. LED Link and Data Indicators  
LED  
Color  
Status  
Power  
Green  
ON  
OFF  
Signal detected.  
Switch not powered up.  
Software not installed or started.  
Loss of signal.  
Ready to talk to an SM to bring  
link fully up.  
Check cabling.  
Link  
Amber  
ON  
OFF  
Link configured.  
SM may be missing.  
Properly connected and ready Link may not be configured.  
to receive data and link packets. Check the connection.  
If a node repeatedly and spontaneously reboots when attempting to load the  
InfiniPath driver, it may be a symptom that its InfiniPath interconnect board is not  
well seated in the HTX or PCIe slot.  
C.1.2  
Some HTX Motherboards May Need 2 or More CPUs in Use  
Some HTX motherboards may require that 2 or more of the CPUs be in use for the  
HTX InfiniPath card to be recognized. This is most evident in four-socket  
motherboards.  
C.2  
BIOS Settings  
This section covers issues related to improper BIOS settings.The two most  
important settings are:  
ACPI needs to be enabled  
MTRR mapping needs to be set to “Discrete”  
If ACPI has been disabled, it may result in initialization problems, as described in  
An improper setting for MTRR mapping can result in reduced performance. See  
NOTE: BIOS settings on IBM Blade Center H (Power) systems do not need  
adjustment.  
C-2  
IB6054601-00 D  
 
       
C – Troubleshooting  
BIOS Settings  
Q
C.2.1  
MTRR Mapping and Write Combining  
MTRR (Memory Type Range Registers) is used by the InfiniPath driver to enable  
write combining to the InfiniPath on-chip transmit buffers. This improves write  
bandwidth to the InfiniPath chip by writing multiple words in a single bus transaction  
(typically 64). This applies only to x86_64 systems. To see if is working correctly  
and to check your bandwidth use this command:  
$ ipath_pkt_test -B  
When configured correctly, PCIe InfiniPath will normally report in the range of  
1150-1500 MB/s, while HTX InfiniPath cards will normally report in the range of  
2300-2650 MB/s.  
However, some BIOSes don’t have the MTRR mapping option. It may be referred  
to in a different way, dependent upon chipset, vendor, BIOS, or other factors. For  
example, it is sometimes referred to as "32 bit memory hole", which should be  
enabled.  
If there is no setting for MTRR mapping or 32 bit memory hole, please contact your  
system or motherboard vendor and inquire as to how write combining may be  
enabled.  
C.2.2  
Incorrect MTRR Mapping  
In some cases, the InfiniPath driver may be unable to configure the CPU Write  
Combining attributes for the QLogic InfiniPath IBA6110. This would normally be  
seen for a new system, or after the system’s BIOS has been upgraded or  
reconfigured.  
If this error occurs, the InfiniPath interconnect will operate, but in a degraded  
performance mode. Typically the latency will increase to several microseconds, and  
the bandwidth may decrease to as little as 200 MBytes/sec.  
A message similar to this will be printed on the console, and normally to the system  
log (typically in /var/log/messages):  
infinipath: mtrr_add(feb00000,0x100000,WC,0) failed (-22)  
infinipath: probe of 0000:04:01.0 failed with error -22  
If you see this error message, you should edit the BIOS setting for MTRR Mapping.  
The setting should look like this:  
MTRR Mapping  
[Discrete]  
You can check and adjust the BIOS settings using the BIOS Setup Utility. Check  
the hardware documentation that came with your system for more information on  
how to do this. Section C.2.3, below, documents a related issue.  
IB6054601-00 D  
C-3  
 
         
C – Troubleshooting  
BIOS Settings  
Q
C.2.3  
Incorrect MTRR Mapping Causes Unexpected Low Bandwidth  
This same MTRR Mapping setting as described in the previous section can also  
cause unexpected low bandwidth if it is set incorrectly.  
The setting should look like this:  
MTRR Mapping  
[Discrete]  
The MTRR Mapping needs to be set to Discrete if there is 4GB or more memory in  
the system; it affects where the PCI, PCIe, and HyperTransport i/o addresses  
(BARs) are mapped. If there is 4GB or more memory in the system, and this is not  
set to Discrete, you will get very low bandwidth (under 250 MB/sec) on anything  
that would normally run near full bandwidth. The exact symptoms can vary with  
BIOS, amount of memory, etc., but typically there will be no errors or warnings.  
To check your bandwidth try:  
$ ipath_pkt_test -B  
When configured correctly, PICIe InfiniPath will normally report in the range of  
1150-1500 MB/s, while HTX InfiniPath cards will normally report in the range of  
2300-2650 MB/s. ipath_checkoutcan also be used to check bandwidth.  
You can check and adjust the BIOS settings using the BIOS Setup Utility. Check  
the hardware documentation that came with your system for more information on  
how to do this.  
C.2.4  
Change Setting for Mapping Memory  
In some cases, on systems with 4GB or more memory on Opteron systems with  
InfiniPath HTX cards (QHT7040 or QHT7140), and the Red Hat Enterprise Linux 4  
release with 2.6.9 Linux kernels, MPI jobs may fail to initialize or may terminate  
early. This can be worked around by changing the setting for mapping memory  
around the PCI configuration space ("SoftWare Memory Hole") to "Disabled" in the  
Chipset, Northbridge screen in the BIOS. This will result in a small loss in usable  
memory.  
C.2.5  
Issue with SuperMicro H8DCE-HTe and QHT7040  
The InfiniPath card may not be recognized on startup when using the SuperMicro  
H8DCE-HT-e and the QHT7040 adapter. To fix this problem, the OS selector option  
in the BIOS should be set for Linux. The option will look like this:  
OS Installation [Linux]  
C-4  
IB6054601-00 D  
 
         
C – Troubleshooting  
Software Installation Issues  
Q
C.3  
Software Installation Issues  
This section covers issues related to software installation.  
C.3.1  
OpenFabrics Dependencies  
You need to install sysfsutilsfor your distribution before installing the  
OpenFabrics RPMs, as there are dependencies. If sysfsutils has not been  
installed, you might see error messages like this:  
error: Failed dependencies:  
libsysfs.so.1()(64bit) is needed by  
libipathverbs-2.0-1_100.77_fc3_psc.x86_64  
libsysfs.so.1()(64bit) is needed by  
libibverbs-utils-2.0-1_100.77_fc3_psc.x86_64  
/usr/include/sysfs/libsysfs.h is needed by  
libibverbs-devel-2.0-1_100.77_fc3_psc.x86_64  
Check your distribution’s documentation for information about sysfsutils.  
C.3.2  
Install Warning with RHEL4U2  
You may see a warning similar to this when installing InfiniPath and OpenFabrics  
modules on RHEL4U2.  
infinipath-2.0-7277.1538_fc3_psc  
Building and installing InfiniPath and OpenIB modules for  
2.6.9-22.ELsmp kernel  
Building modules, stage 2.  
Warning: could not find versions for .tmp_versions/ib_mthca.mod  
This warning may be safely ignored.  
C.3.3  
mpirunInstallation Requires 32-bit Support  
On a 64-bit system, 32-bit glibcmust be installed before installing the  
mpi-frontend-*RPM. mpirun, which is part of the mpi-frontend-*RPM,  
requires 32-bit support.  
If 32-bit glibcis not installed on a 64-bit system, you will now see an error like this  
when installing mpi-frontend:  
# rpm -Uv ~/tmp/mpi-frontend-2.0-2250.735_fc3_psc.i386.rpm  
error: Failed dependencies:  
/lib/libc.so.6 is needed by mpi-frontend-2.0 2250.735_fc3_psc.i386  
IB6054601-00 D  
C-5  
 
             
C – Troubleshooting  
Software Installation Issues  
Q
In older distributions, such as RHEL4, the 32-bit glibc will be contained in the  
libgccRPM. The RPM will be named similarly to:  
libgcc-3.4.3-9.EL4.i386.rpm  
In newer distributions, glibcis an RPM name. The 32-bit glibcwill be named  
similarly to:  
glibc-2.3.4-2.i686.rpm  
or  
glibc-2.3.4-2.i386.rpm  
Check your distribution for the exact RPM name.  
C.3.4  
Installing Newer Drivers from Other Distributions  
The driver source now resides in infinipath-kernel. This means that newer  
drivers can be installed as they become available. Those who wish to install newer  
drivers, for example, from OFED (Open Fabrics Enterprise Distribution), should be  
abletodoso. However, someextrastepsneedtobetakeninordertoinstallproperly.  
1. Install all InfiniPath RPMs, including infinipath-kernel. The RPM  
infinipath-kernelinstalls into:  
/lib/modules/$(uname -r)/updates  
This should not affect any other installed InfiniPath or OpenFabrics drivers.  
2. Reload the InfiniPath and OpenFabrics modules to verify that the installation  
works by using this command (as root):  
# /etc/init.d/infinipath restart  
3. Run ipath_checkout or other OpenFabrics test program to verify that the  
InfiniPath card(s) work properly.  
4. Unload the InfiniPath and OpenFabrics modules with the command:  
# /etc/init.d/infinipath stop  
5. Remove the InfiniPath kernel components with the command:  
$ rpm -e infinipath-kernel --nodeps  
The option --nodepsis required because the other InfiniPath RPMs depend  
on infinipath-kernel.  
6. Verify that no InfiniPath or OpenFabrics modules are present in the  
/lib/modules/$(uname -r)/updatesdirectory.  
7. If not yet installed, install the InfiniPath and OpenFabrics modules from your  
alternate set of RPMs.  
C-6  
IB6054601-00 D  
 
 
C – Troubleshooting  
Kernel and Initialization Issues  
Q
8. Reload all modules by using this command (as root):  
# /etc/init.d/infinipath start  
An alternate mechanism can be used, if provided as part of your alternate  
installation.  
9. Run an OpenFabrics test program, such as ibstatus, to verify that your  
InfiniPath card(s) work correctly.  
C.3.5  
Installing for Your Distribution  
You may be using a kernel which is compatible with one of the supported  
distributions, but which may not be picked up during infinipath-kernel  
installation. It may also happen when using make-install.shto manually  
recompile the drivers.  
In this case, you can set your distribution with the $IPATH_DISTROoverride. Run  
this command before installation, or before running make-install.sh. We use  
the RHEL4 Update 4 distribution as an example in this command for bashor sh  
users:  
$ export IPATH_DISTRO=rhel4_U4  
The distribution arguments that are currently understood are listed below. They are  
found in the file build-guards.sh.  
These are used for RHEL, CentOS(Rocks), and Scientific Linux.  
rhel4_U2  
rhel4_U3  
rhel4_U4  
These are used for SLES, SUSE, and Fedora:  
sles9  
sles10  
suse9.3  
fc3  
fc4  
make-install.shand build-guards.share both found in this directory:  
/usr/src/infinipath/drivers  
C.4  
Kernel and Initialization Issues  
Issues that may prevent the system from coming up properly are described.  
IB6054601-00 D  
C-7  
 
           
C – Troubleshooting  
Kernel and Initialization Issues  
Q
C.4.1  
Kernel Needs CONFIG_PCI_MSI=y  
If the InfiniPath driver is being compiled on a machine without CONFIG_PCI_MSI=y  
configured, you will get a compilation error similar to this:  
ib_ipath/ipath_driver.c:46:2: #error "InfiniPath driver can only  
be used with kernels with CONFIG_PCI_MSI=y"  
make[3]: *** [ib_ipath/ipath_driver.o]  
Error 1  
Some kernels, such as some versions of FC4 (2.6.16), have CONFIG_PCI_MSI=n  
as the default. This default may also be introduced with updates to other Linux  
distributions or local configuration changes. This needs to be changed to  
CONFIG_PCI_MSI=y in order for the InfiniPath driver to function.  
The suggested remedy is to install one of the supported Linux kernels (see  
section 1.7), or download a patched kernel from the QLogic website.  
Pre-built kernels and patches for these distributions are available for download on  
the website. Please go to:  
Follow the links to the download page.  
NOTE: As of this writing, kernels later than 2.6.16-1.2108_FC4smp on FC4 no  
longer have this problem.  
C.4.2  
pci_msi_quirk  
A change was made in the kernel.org 2.6.12 kernel that can cause an InfiniPath  
driver runtime error with the QLE7140. This change is found in most linux  
distributions with 2.6.12 - 2.6.16 kernels, including Fedora Core 3, Fedora Core 4,  
and SUSE Linux 10.0. Affected systems are those that contain the AMD8131 PCI  
bridge. Such systems may experience a problem with MSI (Message Signaled  
Interrupt) that impairs the operation of the InfiniPath QLE7140 adapter. The  
InfiniPath driver will not be able to configure the InfiniBand link to the Active state.  
If messages similar to those below are displayed on the console during boot, or are  
in /var/log/messages, then you probably have the problem:  
PCI: MSI quirk detected. pci_msi_quirk set.  
path_core 0000:03:00.0: pci_enable_msi failed: -22, interrupts may  
not work  
Pre-built kernels and patches for these distributions are available for download on  
the website. Please go to:  
Follow the links to the downloads page.  
C-8  
IB6054601-00 D  
 
   
C – Troubleshooting  
Kernel and Initialization Issues  
Q
NOTE: This problem has been fixed in the 2.6.17 kernel.org kernel.  
C.4.3  
Driver Load Fails Due to Unsupported Kernel  
If you try to load the InfiniPath driver on a kernel that InfiniPath software does not  
support, the load fails. Error messages similar to this appear:  
modprobe: error inserting  
’/lib/modules/2.6.3-1.1659-smp/kernel/drivers/infiniband/hw/ipath/  
ib_ipath.ko’: -1 Invalid module format  
To correct this, install one of the appropriate supported Linux kernel versions as  
listed in section 2.3.3, then reload the driver.  
C.4.4  
InfiniPath Interrupts Not Working  
The InfiniPath driver will not be able to configure the InfiniPath link to a usable state  
unless interrupts are working. Check for this with the commands:  
$ grep ib_ipath /proc/interrupts  
Normal output will like similar to this:  
CPU0  
CPU1  
0: 22577705 22968429  
IO-APIC-edge timer  
IO-APIC-edge serial  
IO-APIC-edge rtc  
4:  
8:  
415  
774  
0
0
9:  
0
0 IO-APIC-level acpi  
14:  
15:  
15750  
64559  
23  
IO-APIC-edge ide0  
IO-APIC-edge ide1  
0
533817  
0
364263  
0
169:  
177:  
185:  
193:  
201:  
NMI:  
921 IO-APIC-level eth0  
22767 IO-APIC-level eth1  
0 IO-APIC-level ib_ipath  
0 IO-APIC-level libata  
0 IO-APIC-level ohci_hcd:usb1, ohci_hcd:usb2  
45570  
0
45641  
LOC: 45540410 45540372  
ERR:  
MIS:  
0
0
If there is no output at all, driver initialization has failed. For further information on  
driver problems, see appendix C.4.1, appendix C.4.3, or appendix C.4.6.  
However, if the output appears similar to one of these lines, then interrupts are not  
being delivered to the driver:  
66:  
0
0
0
0
PCI-MSI  
ib_ipath  
185:  
IO-APIC-level ib_ipath  
NOTE: The output you see may vary depending on board type, distribution, or  
update level.  
IB6054601-00 D  
C-9  
 
       
C – Troubleshooting  
Kernel and Initialization Issues  
Q
A zero count in all CPU columns means that no interrupts have been delivered to  
the processor.  
Possible causes are:  
BootingthelinuxkernelwithACPI(AdvancedConfigurationandPowerInterface)  
disabled on the boot command line, or in the BIOS configuration  
Other infinipathinitialization failures  
To check if the kernel was booted with the "noacpi"or "pci=noacpi"options, use  
this command:  
$ grep -i acpi /proc/cmdline  
If output is displayed, fix your kernel boot command line so that ACPI is enabled.  
This can be set in various ways, depending on your distribution. If no output is  
displayed, check to be sure that ACPI is enabled in your BIOS settings.  
To track down other initialization failures, see appendix C.4.6.  
The program ipath_checkoutcan also help flag these kinds of problems. See  
appendix C.9.8 for more information.  
C.4.5  
OpenFabrics Load Errors If ib_ipathDriver Load Fails  
When the ib_ipathdriver fails to load for any reason, all of the OpenFabrics  
drivers/modules loaded by /etc/init.d/infinipath fail with "Unknown symbol" errors:  
ib_mad: Unknown symbol ib_unregister_client  
ib_mad: Unknown symbol ib_query_ah  
.
ib_sa: Unknown symbol ib_unregister_client  
ib_sa: Unknown symbol ib_unpack  
.
ib_ipath: Unknown symbol ib_modify_qp_is_ok  
ib_ipath: Unknown symbol ib_unregister_device  
.
ipath_ether: Unknown symbol ipath_layer_get_mac  
ipath_ether: Unknown symbol ipath_layer_get_lid  
.
NOTE: Not all the error messages are shown here.  
C-10  
IB6054601-00 D  
 
 
C – Troubleshooting  
Kernel and Initialization Issues  
Q
C.4.6  
InfiniPath ib_ipathInitialization Failure  
There may be cases where ib_ipathwas not properly initialized. Symptoms of this  
may show up in error messages from an MPI job or another program. Here is a  
sample command and error message:  
$ mpirun -np 2 -m ~/tmp/mbu13 osu_latency  
<nodename>:The link is down  
MPIRUN: Node program unexpectedly quit. Exiting.  
First, check to be sure that the InfiniPath driver is loaded:  
$ lsmod | grep ib_ipath  
If no output is displayed, the driver did not load for some reason. Try the commands  
(as root):  
# modprobe -v ib_ipath  
# lsmod | grep ib_ipath  
# dmesg | grep ipath | tail -25  
This will indicate whether the driver has loaded. Printing out messages using dmesg  
may help to locate any problems with ib_ipath.  
If the driver loaded, but MPI or other programs are not working, check to see if  
problems were detected during the driver and InfiniPath hardware initialization with  
the command:  
$ dmesg | grep -i ipath  
This may generate more than one screen of output. Also, check the link status with  
the commands:  
$ cat /sys/bus/pci/driver/ib_ipath/0?/status_str  
These commands are normally executed by the ipathbug-helperscript, but  
running them separately may help locate the problem.  
Refer also to appendix C.9.16 and appendix C.9.8.  
C.4.7  
MPI Job Failures Due to Initialization Problems  
If one or more nodes do not have the interconnect in a usable state, messages  
similar to the following will occur when the MPI program is started:  
userinit: userinit ioctl failed: Network is down [1]: device init  
failed  
userinit: userinit ioctl failed: Fatal Error in keypriv.c(520):  
device init failed  
This could indicate that a cable is not connected, the switch is down, SM is not  
running, or a hardware error has occurred.  
IB6054601-00 D  
C-11  
 
     
C – Troubleshooting  
System Administration Troubleshooting  
Q
C.5  
OpenFabrics Issues  
This section covers items related to OpenFabrics, including OpenSM.  
C.5.1  
Stop OpenSM Before Stopping/Restarting InfiniPath  
OpenSM must be stopped before stopping or restarting InfiniPath. If not, error  
messages such as the following will occur:  
# /etc/init.d/infinipath stop  
Unloading infiniband modules: sdp cm umad uverbs ipoib sa ipath  
mad coreFATAL:Module ib_umad is in use.  
Unloading infinipath modules FATAL: Module ib_ipath is in use.  
[FAILED]  
C.5.2  
Load and Configure IPoIB Before Loading SDP  
SDP will generate "Connection Refused" errors if it is loaded before IPoIB has been  
loaded and configured. Loading and configuring IPoIB first should solve the  
problem.  
C.5.3  
Set $IBPATHfor OpenFabrics Scripts  
The environment variable $IBPATH should be set to /usr/bin. If this has not been  
set, or if you have it set to a location other than the installed location, you may see  
error messages similar to this when running some OpenFabrics scripts:  
/usr/bin/ibhosts: line 30: /usr/local/bin/ibnetdiscover: No such  
file or directory  
For the OpenFabrics commands supplied with this InfiniPath release, you should  
set the variable (if it has not been set already), to /usr/binas follows:  
$ export IBPATH=/usr/bin  
C.6  
System Administration Troubleshooting  
The following section gives details on locating problems related to system  
administration.  
C-12  
IB6054601-00 D  
 
             
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
C.6.1  
Broken Intermediate Link  
Sometimes message traffic passes through the fabric while other traffic appears to  
be blocked. In this case, MPI jobs fail to run.  
In large cluster configurations, switches may be attached to other switches in order  
to supply the necessary inter-node connectivity. Problems with these inter-switch  
(or intermediate) links are sometime more difficult to diagnose than failure of the  
final link between a switch and a node. The failure of an intermediate link may allow  
some traffic to pass through the fabric while other traffic is blocked or degraded.  
If you encounter such behavior in a multi-layer fabric, check that all switch cable  
connections are correct. Statistics for managed switches are available on a per-port  
basis, and may help with debugging. See your switch vendor for more information.  
C.7  
Performance Issues  
Performance issues that are currently being addressed are covered in this section.  
C.7.1  
MVAPICH Performance Issues  
MVAPICH over OpenFabrics over InfiniPath performance tuning has not yet been  
done. Improved performance will be delivered in future releases.  
C.8  
InfiniPath MPI Troubleshooting  
Problems specific to compiling and running MPI programs are detailed below.  
C.8.1  
Mixed Releases of MPI RPMs  
MakesurethatalloftheMPIRPMsarefromthesamerelease. Whenusingmpirun,  
an error message will occur if different components of the MPI RPMs are from  
different releases. This is a sample message in the case where mpirunfrom  
release 1.3 is being used with a 2.0 library:  
$ mpirun -np 2 -m ~/tmp/x2 osu_latency  
MPI_runscript-xqa-14.0: ssh -x> Cannot detect InfiniPath  
interconnect.  
MPI_runscript-xqa-14.0: ssh -x> Seek help on loading InfiniPath  
interconnect driver.  
MPI_runscript-xqa-15.1: ssh -x> Cannot detect InfiniPath  
interconnect.  
MPI_runscript-xqa-15.1: ssh -x> Seek help on loading InfiniPath  
interconnect driver.  
MPIRUN: Node program(s) exited during connection setup  
IB6054601-00 D  
C-13  
 
                 
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
$ mpirun -v  
MPIRUN:Infinipath Release2.0 : Built on Wed Nov 19 17:28:58 PDT  
2006 by mee  
The following is the error that occurs when mpirunfrom the 2.0 release is being  
used with the 1.3 libraries:  
$ mpirun-ipath-ssh -np 2 -ppn 1 -m ~/tmp/idev osu_latency  
MPIRUN: mpirun from the 2.0 software distribution requires all  
node processes to be running 2.0 software. At least node  
<nodename> uses non-2.0 MPI libraries  
C.8.2  
Cross-compilation Issues  
The 2.x PathScale compilers aren’t currently supported on systems that use the  
GNU 4.x compilers and compiler environment (header files and libraries). This  
includes Fedora Core 4, Fedora Core 5 and SLES 10. The GNU 4.x environment  
will be supported in the PathScale Complier Suite 3.0 release.  
The current workaround for this is to compile on a supported and compatible  
distribution, then run the executable on one of the systems that uses the GNU 4.x  
compilers and environment.  
To run on FC4 or FC5, install FC3 or RHEL4/CentOS on your build machine.  
Compile your application on this machine.  
To run on SLES 10, install SUSE 9.3 on your build machine. Compile your  
application on this machine.  
Alternatively, gcccan be used as the default compiler. Set mpicc -cc=gccas  
described in section 3.5.3 "To Use Another Compiler".  
Next, on the machines in your cluster on which the job will run, install compatibility  
libraries. These libraries include C++ and Fortran compatibility shared libraries and  
libgcc.  
For an FC4 or FC5 system, you would need:  
pathscale-compilers-libs (for FC3)  
compat-gcc-32  
compat-gcc-32-g77  
compat-libstdc++-33  
C-14  
IB6054601-00 D  
 
 
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
On a SLES 10 system, you would need:  
compat-libstdc++ (for FC3)  
compat-libstdc++5 (for SLES 10)  
Depending upon the application, you may need to use the -W1,-Bstaticoption to  
use the static versions of some libraries.  
C.8.3  
Compiler/Linker Mismatch  
This is a typical error message if the compiler and linker are not matching in C and  
C++ programs:  
$ export MPICH_CC=gcc  
$ mpicc mpiworld.c  
/usr/bin/ld: cannot find -lmpichabiglue_gcc3  
collect2: ld returned 1 exit status  
C.8.4  
Compiler Can’t Find Include, Module or Library Files  
RPMs can be installed in any location by using the --prefixoption. This can  
introduce errors when compiling, if the compiler cannot find the include files (and  
module files for Fortran90 and Fortran95) from mpi-devel*, and the libraries from  
mpi-libs*in the new locations. Compiler errors similar to this can occur:  
$ mpicc myprogram.c  
/usr/bin/ld: cannot find -lmpich  
collect2: ld returned 1 exit status  
NOTE: As noted in section 3.5.2 of the InfiniPath Install Guide, all development  
files now reside in specific *-Devel subdirectories.  
On development nodes, programs must be compiled with the appropriate options  
so that the include files and the libraries can be found in the new locations. In  
addition, when running programs on compute nodes, you need to insure that the  
run-time library path is the same as the path that was used to compile the program.  
Theexamplesbelowshowwhatcompileroptionstouseforincludefilesandlibraries  
on the development nodes, and how to specify this new library path on the compute  
nodes for the runtime linker. The affected RPMs are:  
mpi-devel* (on the development nodes)  
mpi-libs* (on the development or compute nodes)  
IB6054601-00 D  
C-15  
 
     
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
For these examples in Section C.8.5 below, we assume that these new locations  
are:  
/path/to/devel (for mpi-devel-*)  
/path/to/libs (for mpi-libs-*)  
C.8.5  
Compiling on Development Nodes  
If the mpi-devel-*rpm is installed with the --prefix /path/to/develoption  
then mpicc, etc. will need to be passed -I/path/to/devel/includein order for  
the compiler to find the MPI include files, as in this example:  
$ mpicc myprogram.c -I/path/to/devel/include  
If you are using Fortran90 or Fortran95, a similar option is needed for the compiler  
to find the module files:  
$ mpif90 myprogramf90.f90 -I/path/to/devel/include  
If the mpi-lib-*rpm is installed on these development nodes with the --prefix  
/path/to/libs option, then the compiler will need to be given the  
-L/path/to/libsoption so it can find the libraries. Here is the example for mpicc:  
$ mpicc myprogram.c -L/path/to/libs/lib (for 32 bit)  
$ mpicc myprogram.c -L/path/to/libs/lib64 (for 64bit)  
To find both the include files and the libraries with these non-standard locations, we  
would now see an example like this:  
$ mpicc myprogram.c -I/path/to/devel/include -L/path/to/libs/lib  
C.8.6  
Specifying the Run-time Library Path  
There are several ways to specify the run-time library path so that when the  
programs are run the appropriate libraries are found in the new location. There are  
three different ways to do this:  
Use the -Wl,-rpath,option when compiling on the development node.  
Update the /etc/ld.so.conffile on the compute nodes to include the path.  
Export the path in the .mpirunrcfile.  
These methods are explained in more detail below.  
1. An additional linker option, -Wl,-rpath, supplies the run-time library path  
when compiling on the development node. The compiler options now look like  
this:  
$ mpicc myprogram.c -I/path/to/devel/include  
-L/path/to/libs/lib -Wl,-rpath,/path/to/libs/lib  
C-16  
IB6054601-00 D  
 
     
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
The above compiler command insures that the program will run using this path  
on any machine.  
For the second option, we change the file /etc/ld.so.conf on the compute  
nodes rather than using the -Wl,-rpath, option when compiling on the  
development node. We assume that the mpi-lib-*rpm is installed on the  
compute nodes with the same --prefix /path/to/libs option as on the  
development nodes. Then, on the computer nodes we then add the following  
lines to the file /etc/ld.so.conf.  
/path/to/libs/lib  
/path/to/libs/lib64  
Then, to make sure that the changes are picked up, run (as root):  
# /etc/ldconfig  
The libraries can now be found by the runtime linker on the compute nodes.  
This method has the advantage that it will work for all InfiniPath programs,  
without having to remember to change the compile/link lines.  
2. Instead of either of the two above mechanisms, you can also put this line in the  
~/.mpirunrcfile:  
export LD_LIBRARY_PATH=/path/to/libs/{lib,lib64}  
See Section 3.5.8 in the chapter “Using InfiniPath MPI” for more information  
on using the -rcfileoption to mpirun.  
Choices between these options are left up to the cluster administrator and the  
MPI developer. See the documentation for your compiler for more information  
on the compiler options.  
C.8.7  
Run Time Errors With Different MPI Implementations  
It is now possible to run different implementations of MPI, such as HP-MPI, over  
InfiniPath. Many of these implementations share command (such as mpirun) and  
library names, so it is important to distinguish which MPI version is in use. This is  
done primarily through careful programming practices.  
IB6054601-00 D  
C-17  
 
   
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
Examples are given below.  
In the following command, the HP-MPI version of mpirunis invoked by the full  
pathname. However, the program mpi_nxnlatbwwas compiled with the QLogic  
version of mpicc. The mismatch will produce errors similar this:  
$ /opt/hpmpi/bin/mpirun -hostlist "bbb-01,bbb-02,bbb-03,bbb-04"  
-np 4 /usr/bin/mpi_nxnlatbw  
bbb-02: Not running from mpirun?.  
MPI Application rank 1 exited before MPI_Init() with status 1  
bbb-03: Not running from mpirun?.  
MPI Application rank 2 exited before MPI_Init() with status 1  
bbb-01: Not running from mpirun?.  
bbb-04: Not running from mpirun?.  
MPI Application rank 3 exited before MPI_Init() with status 1  
MPI Application rank 0 exited before MPI_Init() with status 1  
In the case below, mpi_nxnlatbw.cis compiled with the HP-MPI version of  
mpicc, and given the name of hpmpi-mpi_nxnlatbw, so that it is easy to see  
which version was used. However, it is run with the QLogic mpirun, which will  
produce errors similar to this:  
$ /opt/hpmpi/bin/mpicc \  
/usr/share/mpich/examples/performance/mpi_nxnlatbw.c -o  
hpmpi-mpi_nxnlatbw  
$ mpirun -m ~/host-bbb -np 4 ./hpmpi-mpi_nxnlatbw  
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:  
libmpio.so.1: cannot open shared object file: No such file or  
directory  
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:  
libmpio.so.1: cannot open shared object file: No such file or  
directory  
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:  
libmpio.so.1: cannot open shared object file: No such file or  
directory  
./hpmpi-mpi_nxnlatbw: error while loading shared libraries:  
libmpio.so.1: cannot open shared object file: No such file or  
directory  
MPIRUN: Node program(s) exited during connection setup  
C-18  
IB6054601-00 D  
 
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
The following two commands will both work properly:  
QLogic mpirunand executable used together:  
$ mpirun -m ~/host-bbb -np 4 /usr/bin/mpi_nxnlatbw  
HP-MPI mpirunand executable used together:  
$ /opt/hpmpi/bin/mpirun -hostlist \  
"bbb-01,bbb-02,bbb-03,bbb-04" -np 4 ./hpmpi-mpi_nxnlatbw  
Hints:  
Use the rpmcommand to find out which RPM is installed in the standard installed  
layout. For example:  
# rpm -qf /usr/bin/mpirun  
mpi-frontend-2.0-964.731_fc3_psc.i386.rpm  
Check all rcfiles and /opt/infinipath/etc/mpirun.defaultsto make sure  
that the paths for binaries and libraries ($PATH and $LD_LIBRARY _PATH) are  
consistent.  
When compiling, use descriptive names for the object files.  
See section C.8.4, section C.8.5, and section C.8.6 for additional information.  
C.8.8  
Process Limitation with ssh  
MPI jobs that use more than 8 processes per node may encounter an sshthrottling  
mechanism that limits the amount of concurrent per-node connections to 10. If you  
have this problem, you will see a message similar to this when using mpirun:  
$ mpirun -m tmp -np 11 ~/mpi/mpiworld/mpiworld  
ssh_exchange_identification: Connection closed by remote host  
MPIRUN: Node program(s) exited during connection setup  
If you encounter a message like this, you or your system administrator should  
increase the value of ’MaxStartups’ in your sshd configurations.  
C.8.9  
Using MPI.mod Files  
MPI.mod (or mpi.mod) are the Fortran90/Fortran95 mpi modules files. These  
contain the Fortran90/Fortran95 interface to the platform-specific MPI library. The  
module file is invoked by ‘USE MPI’ or ‘use mpi’ in your application. If the application  
has an argument list that doesn’t match what mpi.mod expects, errors such as this  
can occur:  
$ mpif90 -O3 -OPT:fast_math -c communicate.F  
call mpi_recv(nrecv,1,mpi_integer,rpart(nswap),0,  
IB6054601-00 D  
C-19  
 
       
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
^
pathf95-389 pathf90: ERROR BORDERS, File = communicate.F, Line =  
407, Column = 18  
No specific match can be found for the generic subprogram call  
"MPI_RECV".  
If it is necessary to use a non-standard argument list, it is advisable to create your  
own MPI module file, and compile the application with it, rather than the standard  
MPI module file that is shipped in the mpi-devel-* RPM.  
The default search path for the module file is:  
/usr/include  
To include your own MPI.modrather than the standard version, use  
-I/your/search/directorywhich will cause /your/search/directoryto be  
checked before /usr/include:  
$ mpif90 -I/your/search/directory myprogram.f90  
Usage for Fortran95 will be similar to the example for Fortran90.  
C.8.10  
Extending MPI Modules  
MPI implementations provide certain procedures which accept an argument having  
any data type, any precision, and any rank, but it isn’t practical for an MPI module  
to enumerate every possible combination of type, kind, and rank. Therefore the  
strict type checking required by Fortran 90 may generate errors.  
For example, if the MPI module tells the compiler that "mpi_bcast" can operate on  
an integer but does not also say that it can operate on a character string, you may  
see a message similar to the following one:  
pathf95: ERROR INPUT, File = input.F, Line = 32, Column = 14  
No specific match can be found for the generic subprogram call  
"MPI_BCAST".  
If you know that an argument can in fact accept a data type which the MPI module  
doesn’texplicitlyallow,youcanextendtheinterfaceforyourself.Forexample,here’s  
a program which illustrates how to extend the interface for "mpi_bcast" so that it  
accepts a character type as its first argument, without losing the ability to accept an  
integer type as well:  
module additional_bcast  
use mpi  
implicit none  
interface mpi_bcast  
module procedure additional_mpi_bcast_for_character  
end interface mpi_bcast  
contains  
subroutine additional_mpi_bcast_for_character(buffer, count,  
datatype, & root, comm, ierror)  
character*(*) buffer  
C-20  
IB6054601-00 D  
 
     
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
integer count, datatype, root, comm, ierror  
! Call the Fortran 77 style implicit interface to "mpi_bcast"  
external mpi_bcast  
call mpi_bcast(buffer, count, datatype, root, comm, ierror)  
end subroutine additional_mpi_bcast_for_character  
end module additional_bcast  
program myprogram  
use mpi  
use additional_bcast  
implicit none  
character*4 c  
integer master, ierr, i  
! Explicit integer version obtained from module "mpi"  
call mpi_bcast(i, 1, MPI_INTEGER, master, MPI_COMM_WORLD, ierr)  
! Explicit character version obtained from module "additional_bcast"  
call mpi_bcast(c, 4, MPI_CHARACTER, master, MPI_COMM_WORLD, ierr)  
end program myprogram  
This is equally applicable if the module "mpi" provides only a lower-rank interface  
andyouwanttoaddahigher-rankinterface. Anexamplewouldbewherethemodule  
explicitly provides for 1-D and 2-D integer arrays but you need to pass a 3-D integer  
array.  
However, some care must be taken. One should only do this if:  
The module "mpi" provides an explicit Fortran 90 style interface for "mpi_bcast."  
If the module "mpi" does not, the program will use an implicit Fortran 77 style  
interface, which does not perform any type checking. Adding an interface will  
cause type-checking error messages where there previously were none.  
The underlying function really does accept any data type. It is appropriate for the  
first argument of "mpi_bcast" because the function operates on the underlying  
bits, without attempting to interpret them as integer or character data.  
C.8.11  
Lock Enough Memory on Nodes When Using a Batch Queuing System  
InfiniPath MPI requires the ability to lock (pin) memory during data transfers on each  
compute node. This is normally done via /etc/initscript, which is created or  
modified during the installation of the infinipathRPM (setting a limit of 64MB,  
with the command "ulimit -l 65536").  
Some batch systems, such as SLURM, propagate the user’s environment from the  
node where you start the job to all the other nodes. For these batch systems, you  
may need to make the same change on the node from which you start your batch  
jobs.  
IB6054601-00 D  
C-21  
 
   
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
If this file is not present or the node has not been rebooted after the infinipath  
RPM has been installed, a failure message similar to this will be generated:  
$ mpirun -m ~/tmp/sm -np 2 -mpi_latency 1000 1000000  
node-00:1.ipath_update_tid_err: failed: Cannot allocate memory  
mpi_latency:  
/fs2/scratch/infinipath-build-2.0/mpi-2.0/mpich/psm/src  
mq_ips.c:691:  
mq_ipath_sendcts: Assertion ‘rc == 0’ failed. MPIRUN: Node program  
unexpectedly quit. Exiting.  
You can check the ulimit -lon all the nodes by running ipath_checkout. A  
warning will be given if ulimit -l is less that 4096.  
There are two possible solutions to this. If InfiniPath is not installed on the node  
where you start the job, set this value in the following way (as root).  
# ulimit -l 65536  
Or, if you have installed InfiniPath on the node, reboot it to insure that  
/etc/initscriptis run.  
C.8.12  
Error Messages Generated by mpirun  
In the sections below, types of mpirun error messages are described. They fall into  
these categories:  
Messages from the InfiniPath Library  
MPI messages  
Messages relating to the InfiniPath driver and InfiniBand links  
Messages generated by mpirunfollow a general format:  
program_name: message  
function_name: message  
Messages may also have different prefixes, such and ipath_or psm_, which will  
indicate in which part of the software the errors are occurring.  
C.8.12.1  
Messages from the InfiniPath Library  
These messages may appear in the mpirunoutput.  
The first set are error messages, which indicate internal problems and should be  
reported to Support.  
Trying to cancel invalid timer (EOC)  
sender rank rank is out of range (notification)  
sender rank rank is out of range (ack)  
Reached TIMER_TYPE_EOC while processing timers  
C-22  
IB6054601-00 D  
 
       
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
Found unknown timer type type  
unknown frame type type  
recv done: available_tids now n, but max is m (freed p)  
cancel recv available_tids now n, but max is m (freed %p)  
[n] Src lid error: sender: x, exp send: y  
Frame receive from unknown sender. exp. sender = x, came from y  
Failed to allocate memory for eager buffer addresses: str  
The following error messages probably indicate a hardware or connectivity problem:  
Failed to get IB Unit LID for any unit  
Failed to get our IB LID  
Failed to get number of Infinipath units  
In these cases you can try to reboot, then call Support.  
The following indicate a mismatch between the InfiniPath interconnect hardware in  
use and the version for which the software was compiled:  
Number of buffer avail registers is wrong; have n, expected m  
build mismatch, tidmap has n bits, ts_map m  
These indicate a mismatch between the InfiniPath software and hardware versions.  
Consult Support after verifying that current drivers and libraries are installed.  
Thefollowingareallinformativemessagesaboutdriverinitializationproblems. They  
are not necessarily fatal themselves, but sometimes indicate problems that interfere  
with the application. In the actual printed output all of them are prefixed with the  
name of the function that produced them.  
Failed to get LID for unit u: str  
Failed to get number of units: str  
GETPORT ioctl failed: str  
can't allocate memory for ipath_ctrl_typ: type  
can't stat infinipath device to determine type: type  
file descriptor is not for a real device, failing  
get info ioctl failed: str  
ipath_get_num_units called before init  
ipath_get_unit_lid called before init  
mmap64 of egr bufs from h failed: str  
mmap64 of pio buffers at %llx failed: str  
mmap64 of pioavail registers (%llx) failed: str  
mmap64 of rcvhdr q failed: str  
mmap64 of user registers at %llx failed: str  
userinit allocation of rcvtail memory failed: str  
userinit ioctl failed: str  
Failed to set close on exec for device: str  
NOTE: These messages should never occur. Please inform Support if they do.  
IB6054601-00 D  
C-23  
 
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
The following message indicates that a node program may not be processing  
incoming packets, perhaps due to a very high system load:  
eager array full after overflow, flushing (head h, tail t)  
The following indicates an invalid InfiniPath link protocol version:  
InfiniPath version ERROR: Expected version v, found w (memkey h)  
The following error messages should rarely occur and indicate internal software  
problems:  
ExpSend opcode h tid=j, rhf_error k: str  
Asked to set timeout w/delay l, gives time in past (t2 < t1)  
Error in sending packet: str  
Fatal error in sending packet, exiting: str  
Fatal error in sending packet: str  
Here the strcan give additional clues to the reason for the failure.  
The following probably indicates a node failure or malfunctioning link in the fabric:  
Couldn’t connect to NODENAME, rank RANK#. Time elapsed HH:MM:SS.  
Still trying  
NODENAME is the node (host) name, RANK# is the MPI rank, and HH:MM:SS are  
the hours, minutes, and seconds since we started trying to connect.  
If you get messages similar to the following, it may mean that you are trying to  
receive to an invalid (unallocated) memory address, perhaps due to a logic error in  
the program, usually related to malloc/free:  
ipath_update_tid_err: Failed TID update for rendevous, allocation  
problem  
kernel: infinipath: get_user_pages (0x41 pages starting at  
0x2aaaaeb50000  
kernel: infinipath: Failed to lock addr 0002aaaaeb50000, 65 pages:  
errno 12  
TID is short for Token ID, and is part of the InfiniPath hardware. This error indicates  
a failure of the program, not the hardware or driver.  
C.8.12.2  
MPI Messages  
Some MPI error messages are issued from the parts of the code inherited from the  
MPICH implementation. See the MPICH documentation for descriptions of these.  
This section presents the error messages specific to the InfiniPath MPI  
implementation.  
C-24  
IB6054601-00 D  
 
   
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
These messages appear in the mpirunoutput. Most are followed by an abort, and  
possibly a backtrace. Each is preceded by the name of the function in which the  
exception occurred.  
Error sending packet: description  
Error receiving packet: description  
A fatal protocol error occurred while trying to send an InfiniPath packet.  
On Node n, process p seems to have forked.  
The new process id is q. Forking is illegal under  
InfiniPath. Exiting.  
An MPI process has forked and its child process has attempted to make MPI calls.  
This is not allowed.  
processlabel Fatal Error in filename line_no: error_string  
This is always followed by an abort. The processlabelusually takes the form of  
host name followed by process rank.  
At time of writing, the possible error_strings are:  
Illegal label format character.  
Recv Error.  
Memory allocation failed.  
Error creating shared memory object.  
Error setting size of shared memory object.  
Error mapping shared memory.  
Error opening shared memory object.  
Error attaching to shared memory.  
invalid remaining buffers !!  
Node table has inconsistent length!  
Timeout waiting for nodetab!  
The following indicates an unknown host:  
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100  
MPIRUN: Cannot obtain IP address of <nodename>: Unknown host  
<nodename> 15:35_~.1019  
There is no route to a valid host:  
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100  
ssh: connect to host <nodename> port 22: No route to host  
MPIRUN: Some node programs ended prematurely without connecting to  
mpirun.  
MPIRUN: No connection received from 1 node process on node  
<nodename>  
IB6054601-00 D  
C-25  
 
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
There is no route to any host:  
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100  
ssh: connect to host <nodename> port 22: No route to host  
ssh: connect to host <nodename> port 22: No route to host  
MPIRUN: All node programs ended prematurely without connecting to  
mpirun.  
Node jobs have started, but one host couldn’t connect back to mpirun:  
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100  
9139.psc_skt_connect: Error connecting to socket: No route to host  
<nodename> Cannot connect to mpirun within 60 seconds.  
MPIRUN: Some node programs ended prematurely without connecting to  
mpirun.  
MPIRUN: No connection received from 1 node process on node  
<nodename>  
Node jobs have started, both hosts couldn’t connect back to mpirun:  
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100 100  
9158.psc_skt_connect: Error connecting to socket: No route to host  
<nodename> Cannot connect to mpirun within 60 seconds.  
6083.psc_skt_connect: Error connecting to socket: No route to host  
<nodename> Cannot connect to mpirun within 60 seconds.  
MPIRUN: All node programs ended prematurely without connecting to  
mpirun.  
$ mpirun -np 2 -m ~/tmp/q mpi_latency 1000000 1000000  
MPIRUN: <nodename> node program unexpectedly quit: Exiting.  
One program on one node died:  
$ mpirun -np 2 -m ~/tmp/q mpi_latency 100000 1000000  
MPIRUN: <nodename> node program unexpectedly quit: Exiting.  
The quiescence detectedmessage is printed when an MPI job does not seem  
to be making progress. The default timeout is 900 seconds. After this length of time  
all the node processes will be terminated. This timeout can be extended or disabled  
with the -quiescence-timeoutoption in mpirun.  
C-26  
IB6054601-00 D  
 
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
$ mpirun -np 2 -m ~/tmp/q -q 60 mpi_latency 1000000 1000000  
MPIRUN: MPI progress Quiescence Detected after 9000 seconds.  
MPIRUN: 2 out of 2 ranks showed no MPI send or receive progress.  
MPIRUN: Per-rank details are the following:  
MPIRUN: Rank  
MPIRUN: Rank  
0 (<nodename>) caused MPI progress Quiescence.  
1 (<nodename>) caused MPI progress Quiescence.  
MPIRUN: both MPI progress and Ping Quiescence Detected after 120  
seconds.  
Occasionally a stray process will continue to exist out of its context. mpirunchecks  
for stray processes; they are killed after detection.The following is an example of  
the type of message you will see in this case:  
$ mpirun -np 2 -ppn 1 -m ~/tmp/mfast mpi_latency 500000 2000  
iqa-38: Received 1 out-of-context eager message(s) from stray  
process PID=29745  
running on host 192.168.9.218  
iqa-35: PSM pid 10513 on host IP 192.168.9.221 has detected that I  
am a stray process, exiting.  
2000  
5.222116  
iqa-38:1.ips_ptl_report_strays: Process PID=29745 on host  
IP=192.168.9.218 sent  
1 stray message(s) and was told so 1 time(s) (first stray message  
at 0.7s (13%),last at 0.7s (13%) into application run)  
The following should never occur. Please inform Support if it does:  
Internal Error: NULL function/argument found:func_ptr(arg_ptr)  
C.8.12.3  
Driver and Link Error Messages Reported by MPI Programs  
Two types of error messages are described below.  
1. When the InfiniBand link fails during a job, a message will be reported once  
per occurrence. The message will be similar to this:  
ipath_check_unit_status: IB Link is down  
This can happen when a cable is disconnected, a switch is rebooted, or if there  
are other problems with the link. The job will continue retrying until the  
quiescence interval expires. See the mpirun -qoption for information on  
quiescence.  
2. If a hardware problem occurs, an error similar to this will be reported:  
infinipath: [error strings] Hardware error  
This will cause the MPI program to terminate. The error string may provide  
additional information as to the problem. To further determine the source of the  
problem, examine syslogon the node reporting the problem.  
IB6054601-00 D  
C-27  
 
   
C – Troubleshooting  
InfiniPath MPI Troubleshooting  
Q
C.8.13  
MPI Stats  
Using the -print-statsoption to mpirunwill result in a listing to stderrof various  
MPI statistics. Here is example output for the -print-statsoption when used with  
an 8-rank run of the HPCC benchmark.  
MPIRUN: MPI Statistics Summary  
MPIRUN: Messages sent  
(min, max, median @ rank)  
MPIRUN: Eager count  
(min=652.54K @ 0, max=653.39K @ 7, med= 653.15K)  
MPIRUN: Eager aggregate bytes  
MPIRUN: Rendezvous count  
MPIRUN: Rendezvous agg. bytes  
MPIRUN:  
(min= 2.08G @ 0, max= 2.08G @ 2, med=  
(None)  
(None)  
2.08G)  
MPIRUN: Messages received  
MPIRUN: Expected count  
MPIRUN: Expected aggregate bytes (min= 2.03G @ 2, max= 2.04G @ 1, med=  
MPIRUN: Unexpected count  
MPIRUN: Unexpected agg. bytes  
MPIRUN: Unexpected count %  
(min=590.48K @ 2, max=624.90K @ 6, med= 619.01K)  
2.04G)  
(min= 27.89K @ 6, max= 62.69K @ 2, med= 39.20K)  
(min= 44.57M @ 1, max= 57.95M @ 2, med= 48.04M)  
(min= 4% @ 6, max= 9% @ 2, med= 6%)  
Message statistics are available for transmitted and received messages. In all  
cases, the MPI rank number responsible for a minimum or maximum value is  
reported with the relevant value. For application runs of at least 3 ranks, a median  
is also available.  
Since transmitted messages employ either an Eager or a Rendezvous protocol,  
resultsareavailablerelativetobothmessagecountandaggregatedbytes.Message  
count represents the amount of messages transmitted by each protocol on a  
per-rank basis. Aggregated amounts of message bytes indicate the total amount of  
data that was moved on each rank by a particular protocol.  
On the receive side, messages are split into either expected or unexpected  
messages. Unexpected messages cause the MPI implementation to buffer the  
transmitted data until the receiver is able to produce a matching MPI receive buffer.  
Expected messages refer to the inverse case, which should be the common case  
in most MPI applications. An additional metric, Unexpected count %, representing  
the proportion of unexpected messages in relation to the total number of messages  
received is also shown because of the notable effect unexpected messages have  
on performance.  
For more precise information, users are encouraged to make use of MPI profilers  
such as mpiP. For more information on mpiP, see:  
For reference on the HPCC benchmark, see:  
C-28  
IB6054601-00 D  
 
     
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
C.9  
Useful Programs and Files for Debugging  
The most useful programs and files for debugging are listed in the sections below.  
Many of these programs and files have been discussed elsewhere in the  
documentation: this information is summarized and repeated here for your  
convenience.  
C.9.1  
Check Cluster Homogeneity with ipath_checkout  
Many problems can be attributed to the lack of homogeneity in the cluster  
environment. Use the following items as a checklist for verifying homogeneity. A  
difference in any one of these items in your cluster may cause problems:  
Kernels  
Distributions  
Versions of the InfiniPath boards  
Runtime and build environments  
.o files from different compilers  
Libraries  
Processor speeds  
With the exception of finding any differences between the runtime and build  
environments, ipath_checkoutwill pick up information on all the above items.  
Other programs useful for verifying homogeneity are listed in Table C-2.  
C.9.2  
Restarting InfiniPath  
If, on any node, the driver status appears abnormal, you can try restarting (as root):  
# /etc/init.d/infinipath restart  
These two commands perform the same functions:  
# /etc/init.d/infinipath stop  
# /etc/init.d/infinipath start  
It may also be useful to inspect the file /var/log/messages, to check for any  
abnormal activity.  
IB6054601-00 D  
C-29  
 
       
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
C.9.3  
Summary of Useful Programs and Files  
Useful programs and files are summarized in the table below. Descriptions for some  
of the programs and files follow. Check manpages for more information on the  
programs.  
Table C-2. Useful Programs and Files  
Use to verify  
Program or file name  
Function  
homogeneity?  
File. Check the version of the installed  
InfiniPath software.  
Yes  
chkconfig  
Check configuration state, enable/disable  
services, including drivers.  
No  
No  
No  
Checks status of InfiniBand devices when  
OpenFabrics is enabled.  
Lists info about InfiniBand devices in use.  
Use when OpenFabrics is enabled.  
Identifies RCSkeyword strings in files. Can Yes  
check for dates, release versions, and other  
identifying information.  
A bashshell script that performs sanity  
testing on cluster using InfiniPath hardware  
and software. If the program is run without  
errors, the node is properly configured.  
Yes  
A shell script that can be used to manipulate Yes  
various parameters for the InfiniPath driver.  
This script gathers the same information  
containedinboardversion,status_str,  
and version.  
Yes  
No  
A shell script that gathers status and  
history information for use in analyzing  
InfiniPath problems.  
Tests the InfiniBand link and bandwidth  
between two InfiniPath HCAs, or, using an  
InfiniBand loopback connector, within a  
single InfiniPath HCA.  
Displays both driver statistics, and hardware No  
counters, including both performance and  
"error" (including status) counters  
Shows status of modules in the Linux kernel. No  
Canusetocheckwhetherdriversareloaded.  
C-30  
IB6054601-00 D  
 
           
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
Table C-2. Useful Programs and Files (Continued)  
Use to verify  
Program or file name  
Function  
homogeneity?  
modprobe  
Adds or removes modules from the Linux  
kernel. Used to configure ipath_ether  
module on SUSE.  
No  
ps  
A front end program that starts an MPIjob  
on an InfiniPath cluster. Can be used to  
check the origin of the drivers.  
Yes  
No  
Displays information on current active  
processes. Use to check whether all  
necessary processes have been started.  
Package manager used to install, query,  
verify, update, or erase software packages.  
Can use to check contents of a package.  
Yes  
Prints the strings of printable characters in a Yes  
file. Useful for determining contents of  
non-text files such as date and version.  
File. Verifies that the InfiniPath software is  
loaded and functioning.  
No  
File.Provides version information of installed Yes  
software /drivers.  
/var/log/messages File. Various programs write messages to  
this logfile. Use to track activity on your  
system.  
No  
C.9.4  
boardversion  
It may be useful to keep track of the current version of the installed software. You  
can check the version of the installed InfiniPath software by looking in:  
/sys/bus/pci/drivers/ib_ipath/00/boardversion  
Example contents are:  
Driver 2.0,InfiniPath_QHT7140,InfiniPath13.2,PCI 2,SW Compat 2  
This information is useful when for reporting problems when requesting support.  
NOTE: This file returns information on which form factor adapter is installed. The  
HTX full height short form factor is referred to as the QHT7040, the HTX  
low profile form factor is referred to as the QHT7140, and the PCIe half  
height short form factor is the QLE7140. This information will make it  
easier for Support to help with any problems.  
IB6054601-00 D  
C-31  
 
       
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
C.9.5  
ibstatus  
This program displays basic information on the status of InfiniBand devices that are  
currently in use when the OpenFabrics modules are loaded.  
C.9.6  
ibv_devinfo  
This program displays information about InfiniBand devices, including various kinds  
of identification and status data. Use this program when OpenFabrics is enabled.  
C.9.7  
ident  
identstrings are available in ib_ipath.ko. Running ident(as root) will yield  
information similar to the following. For QLogic RPMs, it will look like:  
# ident /lib/modules/$(uname -r)/updates/*ipath.ko  
/lib/modules/2.6.16.21-0.8-smp/updates/ib_ipath.ko:  
$Id: QLogic Release2.0 $  
$Date: 2006-09-15-04:16 $  
$Id: QLogic Release2.0 $  
$Date: 2006-09-15-04:16 $  
For non-QLogic RPMs, it will look like:  
# ident /lib/modules/$(uname -r)/updates/*ipath_ether.ko  
/lib/modules/2.6.16.21-0.8-smp/updates/infinipath.ko:  
$Id: kernel.org InfiniPath Release 2.0 $  
$Date: 2006-09-15-04:16 $  
/lib/modules/2.6.16.21-0.8-smp/updates/ipath.ko:  
$Id: kernel.org InfiniPath Release2.0 $  
$Date: 2006-09-15-04:20 $  
NOTE: $ident is in the optional rcs RPM, and is not always installed.  
C-32  
IB6054601-00 D  
 
         
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
C.9.8  
ipath_checkout  
ipath_checkoutis a bashscript used to verify that the installation is correct and  
that all the nodes of the network are functioning and mutually connected by the  
InfiniPath fabric. It is to be run on a front end node, and requires specification of a  
hosts file:  
$ ipath_checkout [options] hostsfile  
wherehostsfiledesignatesafilelistingthehostnamesofthenodesofthecluster,  
one hostname per line. The format of hostsfileis as follows:  
hostname1  
hostname2  
...  
ipath_checkoutperforms the following seven tests on the cluster:  
1. pingall nodes to verify all are reachable from the frontend.  
2. sshto each node to verify correct configuration of ssh.  
3. Gather and analyze system configuration from nodes.  
4. Gather and analyze RPMs installed on nodes.  
5. Verify InfiniPath hardware and software status and configuration.  
6. Verify ability to mpirun jobs on nodes.  
7. Run bandwidth and latency test on every pair of nodes and analyze results.  
The possible options to ipath_checkoutare:  
-h, --help  
Displays help messages giving defined usage.  
-v, --verbose  
-vv, --vverbose  
-vvv, --vvverbose  
These specify three successively higher levels of detail in reporting results of tests.  
So, there are four levels of detail in all, including the case of where none these  
options are given.  
-c, --continue  
When not specified, the test terminates when any test fails. When specified, the  
tests continue after a failure, with failing nodes excluded from subsequent tests.  
IB6054601-00 D  
C-33  
 
   
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
--workdir=DIR  
Use DIR to hold intermediate files created while running tests. DIR must not already  
exist.  
-k, --keep  
Keep intermediate files that were created while performing tests and compiling  
reports. Results will be saved in a directory created by mktempand named  
infinipath_XXXXXXor in the directory name given to --workdir.  
--skip=LIST  
Skip the tests in LIST(e.g. --skip=2,4,5,7 will skip tests 2, 4, 5, and 7)  
-d, --debug  
Turn on -xand -vflags in bash.  
In most cases of failure, the script suggests recommended actions. Please see the  
ipath_checkout man page for further information and updates.  
C.9.9  
ipath_control  
This is a shell script that can be used to manipulate various parameters for the  
InfiniPath driver. Many of them are intended to be used only when diagnosing  
problems, and may require special system configurations. Use of the options may  
require restarting the driver or utility programs in order to recover from incorrect  
parameters.  
Mostofthefunctionalityisaccessedviathe/sysfilesystem. Thisshellscriptgathers  
the same information contained in these files:  
/sys/bus/pci/drivers/ib_ipath/00/boardversion  
/sys/bus/pci/drivers/ib_ipath/00/status_str  
/sys/bus/pci/drivers/ib_ipath/version  
These files are also documented in table C-2 and table C-3.  
Other than the -i option, this script will need to be run with root permissions. The  
-ioption is listed here, as it is the most commonly used. See the man pages for  
ipath_controlfor more details.  
Here is sample usage and output:  
$ ipath_control -i  
$Id: QLogic Release2.0 $ $Date: 2006-09-15-04:16 $  
00: Version: Driver 2.0, InfiniPath_QHT7140, InfiniPath1 3.2, PCI  
2, SW Compat 2  
00: Status: 0xe1 Initted Present IB_link_up IB_configured  
C-34  
IB6054601-00 D  
 
   
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
00: LID=0x30 MLID=0x0 GUID=00:11:75:00:00:07:11:97 Serial:  
1236070407  
C.9.10  
ipathbug-helper  
The tool ipathbug-helperis useful for verifying homogeneity. Prior to seeking  
assistance from QLogic technical support, you should run this script on the head  
node of your cluster and the compute nodes which are suspected to have problems.  
Inspection of the output will often help you to see the problem. Simply run it on  
several nodes and examine the output for differences.  
It is best to run ipathbug-helperwith root privilege, since some of the queries it  
makes requires it. There is also a --verbosewhich greatly increases the amount  
of gathered information.  
If you are unable to see the problem, send its stdoutoutput to your reseller, along  
with information on the version of the InfiniPath software you are using.  
C.9.11  
ipath_pkt_test  
This is a simple program that can be used to test the InfiniBand link and bandwidth  
between two InfiniPath HCAs, or, using an InfiniBand loopback connector, within a  
single InfiniPath HCA. It is runs in either ping-pong mode (send a packet, wait for  
a reply, repeat), or in stream mode (send packets as quickly as possible, receive  
responses as they come back).  
On completion, the sending side prints statistics on the packet bandwidth, showing  
both the payload bandwidth, and the total bandwidth (including InfiniBand and  
InfiniPath headers). See the man page for more information.  
C.9.12  
ipathstats  
The ipathstatsprogram can be useful for diagnosing InfiniPath problems,  
particularly those that are performance related. It displays both driver statistics, and  
hardware counters, including both performance and "error" (including status)  
counters.  
Running "ipathstats -c 10", for example, will show the number of packets and  
32 bit words of data being transferred on a node in each10 second interval. This  
may show differences in traffic patterns on different nodes, or at different stages of  
execution. For more information see the man page.  
IB6054601-00 D  
C-35  
 
       
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
C.9.13  
lsmod  
If you need to find which InfiniPath and OpenFabrics modules are running, try the  
following command:  
# lsmod | egrep ’ipath_|ib_|rdma_|findex’  
C.9.14  
mpirun  
mpiruncan give information on whether the program is being run against a QLogic  
or non-QLogic driver. Sample commands and results are given below.  
QLogic-built:  
$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0  
asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1  
active chips)  
asus-01:0.ipath_userinit: Driver is QLogic-built  
Non-QLogic built:  
$ mpirun -np 2 -m /tmp/id1 -d0x101 mpi_latency 1 0  
asus-01:0.ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1  
active chips)  
asus-01:0.ipath_userinit: Driver is not QLogic-built  
C.9.15  
rpm  
To check the contents of an RPM, use these commands:  
$ rpm -qa infinipath\* mpi-\*  
$ rpm -q --info infinipath # (etc)  
The option-qwill query and --qawill query all.  
C.9.16  
status_str  
Check the file status_str to verify that the InfiniPath software is loaded and  
functioning. To locate this file go to:  
/sys/bus/pci/drivers/ib_ipath/  
and look for a subdirectory with the InfiniPath unit numbers 00, 01, and so on.  
status_strwill be found in this directory.  
C-36  
IB6054601-00 D  
 
           
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
The following table shows the possible contents of the file, with brief explanations  
of the entries.  
Table C-3. status_str File  
File contents  
Description  
Initted  
Present  
The driver has loadedand successfully initialized the  
IBA6110.  
The IBA6110 has been detected (but not initialized  
unless Initted is also here).  
IB_link_up  
IB_configured  
NOIBcable  
The IB link has been configured and is in the active  
state; packets can be sent and received.  
The IB link has been configured. It may or may not  
be up and usable.  
Unable to detect link present. Can be caused by no  
cable plugged into the QHT7140 or QLE7140, or  
connected there but not to a switch, or the switch it  
is connected to is down.  
Fatal_Hardware_Error  
Only appears if there is trouble.  
In this same directory are other files containing information related to status. They  
are summarized in table C-4.  
Table C-4. Other Files Related to Status  
File name  
lid  
Contents  
InfiniBand Local ID (LID). The address on the IB fabric, similar  
conceptually to an IP address for TCP/IP. The "Local" refers to it being  
unique only within a single IB fabric.  
mlid  
guid  
nguid  
The Multicast Local ID (MLID), for IB multicast. Used for doing  
InfiniPath ether broadcasts, since IB has no concept of broadcast.  
The Globally Unique ID (GUID) for the InfiniPath chip. Equivalent to  
an Ethernet MAC address.  
The number of GUIDs that are used. If nguids == 2, and two chips are  
discovered, the first one will be assigned the requested GUID (from  
eeprom, or ipath_sma), and the second chip gets that GUID+1.  
serial  
unit  
The serial number of the QHT7140 or QLE7140 board.  
Unique number for each card or chip in a system.  
status  
The numeric version of the status_strfile, described in the  
preceding table.  
IB6054601-00 D  
C-37  
 
   
C – Troubleshooting  
Useful Programs and Files for Debugging  
Q
C.9.17  
strings  
The command stringscan also be used. Its format is as follows:  
$ strings /usr/lib/libinfinipath.so.4.0 | grep Date:  
will produce output like this:  
$Date: 2006-09-15 04:07 Release2.0 InfiniPath $  
NOTE: stringsis part of binutils(a development RPM), and may not be  
available on all machines.  
C.9.18  
version  
You can check the version of the installed InfiniPath software by looking in:  
/sys/bus/pci/drivers/ib_ipath/version  
Example contents for QLogic-built drivers:  
$Id: QLogic Release2.0 $ $Date: 2006-09-15-04:16 $  
For non-Qlogic-built drivers (in this case kernel.org), it will look like this:  
$Id: kernel.org InfiniPath Release2.0 $ $Date: 2006-09-15-04:18 $  
C-38  
IB6054601-00 D  
 
   
Appendix D  
Recommended Reading  
Reference material for further reading is provided here.  
D.1  
References for MPI  
The MPI Standard specification documents.  
The MPICH implementation of MPI and its documentation.  
The ROMIO distribution and its documentation.  
D.2  
Books for Learning MPI Programming  
Gropp, William, Ewing Lusk, and Anthony Skjellum, Using MPI, Second Edition,  
1999, MIT Press, ISBN 0-262-57134-X.  
Gropp, William, Ewing Lusk, and Anthony Skjellum, Using MPI-2, Second Edition,  
1999, MIT Press, ISBN 0-262-57133-1.  
Pacheco,ParallelProgrammingwithMPI, 1997, MorganKaufmanPublishers, ISBN  
1-55860  
D.3  
Reference and Source for SLURM  
The open-source resource manager designed for Linux clusters.  
D.4  
InfiniBand  
The InfiniBand specification, found at the InfiniBand Trade Association site.  
D.5  
OpenFabrics  
Open InfiniBand Alliance.  
IB6054601-00 D  
D-1  
 
               
D – Recommended Reading  
Rocks  
Q
D.6  
Clusters  
Gropp, William, Ewing Lusk, and Thomas Sterling, Beowulf Cluster Computing with  
Linux, Second Edition, 2003, MIT Press, ISBN 0-262-69292-9.  
D.7  
Rocks  
Extensive documentation on installing Rocks and custom Rolls.  
D-2  
IB6054601-00 D  
 
   
Appendix E  
Glossary  
A glossary is provided below for technical terms used in the documentation.  
bandwidth  
The rate at which data can be transmitted. This  
represents the capacity of the network connection.  
Theoretical peak bandwidth is fixed, but the effective  
bandwidth, the ideal rate is modified by overhead in  
hardware and the computer operating system. Usually  
measured in bits/megabits or bytes/megabytes per  
second. Bandwidth is related to latency.  
BIOS  
For Basic Input/Output System. It typically contains  
code for initial hardware setup and bootstrapping.  
build node  
A machine on which source code, examples or  
benchmarks can be compiled.  
compute node  
DAPL  
A machine used to run a job.  
For Direct Access Provider Library. The reference  
implementation for RDMA transports. Consists of both  
kernel mode (kDAPL) and user mode (uDAPL)  
versions.  
development node  
DHCP  
Same as build node.  
For Dynamic Host Configuration Protocol. A  
communications protocol for allocating IP addresses.  
Also provides other basic networking information, such  
as router addresses and name servers.  
EATX  
fabric  
For Extended Advanced Technology Extended  
motherboard.  
The InfiniBand interconnect infrastructure, consisting  
of a set of HCAs (and possibly TCAs) connected by  
switches, such that each end node can directly reach  
all other nodes.  
front end node  
The machine or machines used to launch jobs.  
funneled thread model  
Only the main (master) thread may execute MPI calls.  
In InfiniPath MPI, hybrid MPI/OpenMP applications are  
supported, provided the MPI routines are called only by  
the master OpenMP thread.  
IB6054601-00 D  
E-1  
 
   
E – Glossary  
Q
GID  
For Global Identifier. Used for routing between different  
InfiniBand subnets.  
GUID  
For Globally Unique Identifier for the InfiniPath chip.  
Equivalent to Ethernet MAC address.  
head node  
HCA  
Same as front end node.  
For Host Channel Adapter. HCAs are I/O engines  
located within processing nodes, connecting them to  
the InfiniBand fabric.  
hosts file  
HTX  
Same as mpihosts file. Not the same as the /etc/hosts  
file.  
A specification that defines a connector and form factor  
for HyperTransport-enabled daughtercards and EATX  
motherboards.  
InfiniBand  
AlsoreferredtoasIB. Aninput/outputarchitectureused  
in high-end servers. It is also a specification for the  
serial transmission of data between processors and I/O  
devices. InfiniBand typically uses switched,  
point-to-point channels. These channels are usually  
createdbyattachinghostchanneladapters(HCAs)and  
target channel adapters (TCAs) through InfiniBand  
switches.  
IPoIB  
For Internet Protocol over InfiniBand, as per the  
OpenFabrics standards effort. This protocol layer  
allows the traditional Internet protocol (IP) to run over  
an InfiniBand fabric.  
iSER  
For iSCSI Extensions for RDMA. An upper layer  
protocol.  
kDAPL  
latency  
For kernel Direct Access Provider Library.kDAPL is the  
kernel mode version of the DAPL protocol.  
Thedelayinherentinprocessingnetworkdata. Interms  
of MPI, it is the time required to send a message from  
one node to another, independent of message size.  
Latency can be further split into sender and receiver  
processing overheads, as well as wire and switch  
overhead.  
launch node  
Same as front end node.  
layered driver  
A driver that does not directly manage any target  
devices. The layered driver calls another driver’s  
routines, which in turn manages the target devices.  
E-2  
IB6054601-00 D  
 
E – Glossary  
Q
LID  
For Local Identifier. Assigned by the Subnet Manager  
(SM) to each visible node within a single InfiniBand  
fabric. It is similar conceptually to an IP address for  
TCP/IP.  
Lustre  
Open source project to develop scalable cluster file  
systems.  
MAC Address  
For Media Access Control Address. It is a unique  
identifier attached to most forms of networking  
equipment.  
machines file  
MADs  
Same as mpihostsfile.  
For Management Datagrams. Subnet Managers (SMs)  
and Subnet Management Agents (SMAs)  
communicate via MADs.  
managed switch  
MGID  
A switch that can be configured to run an embedded  
Subnet Manager (SM).  
For Multicast Group ID. An identifier for a multicast  
group. This can be assigned by the SM at multicast  
group creation time, although frequently it is chosen by  
the application or protocol instead.  
MLID  
MPD  
MPI  
For Multicast Local ID for InfiniBand multicast. This is  
the identifier a member of a multicast group uses for  
addressing messages to other members of the group.  
For Multi-Purpose Daemon. An alternative to mpirun  
to launch MPI jobs, providing support for MPICH.  
Developed at Argonne National laboratory.  
For Message-Passing Interface. MPI is a  
message-passing library or collection of routines used  
in distributed-memory parallel programming. It is used  
in data exchange and task synchronization between  
processes. The goal of MPI is to provide portability and  
efficient implementation across different platforms and  
architectures.  
MPICH  
A freely available, portable implementation of MPI.  
mpihosts file  
A file containing a list of the hostnames of the nodes in  
a cluster on which node programs may be run. Also  
referred to as node file, hosts file, or machine(s) file.  
IB6054601-00 D  
E-3  
 
E – Glossary  
Q
MTRR  
For Memory Type Range Registers. MTRR For  
"Memory Type Range Registers". Used by the  
InfiniPath driver to enable write combining to the  
InfiniPath on-chip transmit buffers. This improves write  
bandwidth to the InfiniPath chip, by writing multiple  
words in a single bus transaction (typically 64). Applies  
only to x86_64 systems.  
MTU  
For Maximum Transfer Unit. The largest packet size  
that can be transmitted over a given network.  
multicast group  
A mechanism that a group of nodes use to  
communicate amongst either other. It is an efficient  
mechanismforbroadcastingmessagestomanynodes,  
as messages sent to the group are received by all  
members of the group without the sender having to  
explicitly send it to each individual member (or even  
having to know who the members are.) Nodes can join  
or leave the group at any time.  
node file  
Same as hostsfile.  
node program  
Each individual process that is part of the parallel MPI  
job. The machine on which it is executed is called a  
node.  
OpenIB  
The previous name of OpenFabrics.  
OpenFabrics  
OpenMP  
The open source InfiniBand protocol stack.  
Specification that provides an open source model for  
parallel programming that is portable across shared  
memory architectures from different vendors.  
OpenSM  
PCIe  
OpensourceSM(SubnetManager)that provides basic  
functionality for subnet discovery and activation.  
For PCI Express. Based on PCI concepts and  
standards, PCIe uses a faster serial connection  
mechanism.  
RDMA  
RPM  
For Remote Direct Memory Access. A communications  
protocol that enables data transmission from the  
memory of one computer to the memory of another  
without involving the CPU. The most common form of  
RDMA is over InfiniBand.  
For Red Hat Package Manager. A tool for packaging,  
installing, and managing software for Linux  
distributions.  
E-4  
IB6054601-00 D  
 
E – Glossary  
Q
SDP  
SRP  
SM  
For Sockets Direct Protocol. An InfiniBand-specific  
upper layer protocol. It defines a standard wire protocol  
to support stream sockets networking over InfiniBand.  
For SCSI RDMA Protocol. The implementation of this  
protocol is under development for utilizing block  
storage devices over an InfiniBand fabric.  
For Subnet Manager. A subnet contains a master  
Subnet Manager which is responsible for network  
initialization (topology discovery), configuration, and  
maintenance. The Subnet Manager discovers and  
configures all the reachable nodes in the InfiniBand  
fabric. It discovers them at switch startup, and  
continues monitoring changes in the physical network  
connectivityandtopology.Itisresponsibleforassigning  
local identifiers, called LIDs, to the visible nodes. It also  
handles multicast group setup. When the network  
contains multiple managed switches, they negotiate  
among themselves which will be the controlling Subnet  
Manager. It communicates with the SMAs that exist on  
all nodes in a cluster.  
SMA  
For Subnet Management Agent. SMAs exist on all  
nodes, and are responsible for interacting with the  
subnet manager to configure an individual node and  
report node parameters and statistics.  
subnet  
switch  
A single InfiniBand network.  
Used to connect HCAs and TCAs. Packets are  
forwarded from one port to another within the switch,  
based on the LID of the packet. The fabric is the  
connected group of switches.  
TCA  
TCP  
For Target Channel Adapter. A TCA is a channel  
adapter for I/O nodes, such as shared storage devices.  
For Transmission Control Protocol. One of the core  
protocols of the Internet protocol suite. A transport  
mechanismthatensuresthatdataarrivescompleteand  
in order.  
TID  
For Token ID. A method of identifying a memory region.  
Part of the InfiniPath hardware.  
uDAPL  
For user Direct Access Provider Library. uDAPL is the  
user space implementation of the DAPL protocol.  
unmanaged switch  
A switch that does not have an active Subnet Manager  
(SM).  
IB6054601-00 D  
E-5  
 
E – Glossary  
Q
Notes  
E-6  
IB6054601-00 D  
 
Index  
A
F
ACPI, enabling C-9  
Front matter  
intended audience for this guide 1-1  
organization of this guide 1-1  
typographic conventions in this guide 1-6  
B
Batch queuing for MPI jobs B-1B-4  
Benchmarking  
MPI bandwidth A-2A-3  
H
MPI latency measurement A-1A-2  
MPI latency measurement in host rings A-5  
HTX InfiniPath card not recognized C-2  
I
C
ib_ipath, startup of, 2-5  
ident 2-21, C-30, C-32  
InfiniPath interconnect, overview 1-2  
InfiniPath scripts  
for system startup 2-5  
using to start, stop, or restart drivers 2-13  
InfiniPath software  
Compiling MPI programs  
compiler and linker variables 3-9  
scripts for invoking compiler and linker 3-7  
specifying compilers and linkers 3-43-5  
Troubleshooting C-13C-20  
using other compilers 3-8  
Configuration  
components 2-1  
OpenSM 2-12  
installed layout 2-1  
Configuration, OpenSM 2-12  
CPU affinity, setting 2-19  
list of 1-2  
memory footprint 2-2  
startup of 2-5  
InfiniPath software, list of 1-5  
Installation  
dependencies C-5  
hardware troubleshooting C-1  
software issues C-5C-7  
Interoperability  
InfiniPath OpenFabrics 1-2  
Interrupts, problems with C-9  
ipath_checkout C-30  
D
Debugging MPI programs 3-203-21  
Distribution override, setting C-7  
Distributions supported 1-4, 1-1  
Driver configuration, IPoIB 2-11  
Drivers  
list of 2-1  
starting, stopping and restarting 2-13  
starting, stopping and testing 2-13  
see also ib_ipath; ipath_ether  
verifying installation with 2-22, C-33  
ipath_control C-34  
for checking version information C-30  
ipath_ether  
configuration of on Fedora and RHEL4  
E
Environment variables 3-12  
IB6054601-00 D  
Index-1  
 
   
InfiniPath User Guide  
Version 2.0 Beta2  
Q
configuration of on SUSE and SLES 10  
layered Ethernet driver 2-6  
ipathbug_helper C-30, C-35  
O
OpenFabrics Configuration 2-11  
OpenSM 2-12  
P
L
PathScale to QLogic Adapter model numbers  
Performance tips  
LEDs, showing state of system with C-1  
Limitations of PathScale MPI 3-21  
balanced processor speed 2-19  
disabling powersaving 2-18  
minimum set of services needed 2-172-18  
M
Management tips  
maintaining homogeneous nodes 2-20  
useful tools for verifying homogeneity 2-20  
MPD, as alternative to mpirun 3-173-18  
MPI  
bandwidth measurement, see benchmarking  
extending modules for C-20  
benchmarking  
Protocols, InfiniBand subnet management 1-2  
R
rpm, using for software package verification  
S
latency measurement, see benchmarking  
Linux file I/O in 3-18  
ssh(secure shell)  
administrator setup using shosts.equiv 2-15  
user setup using ssh-agent 3-5  
Status, checking software C-36  
strings C-31  
Subnet Management Agent (SMA), function of  
other implementations of 3-17  
PathScale MPI and hybrid MPI/OpenMP 3-19  
PathScale MPI and ROMIO 3-19  
Pathscale MPI limitations 3-21  
QLogic’s implementation of 3-1  
MPI programming  
Switches, supported 1-2  
examples 3-23-4  
specifying compilers and linkers 3-43-5  
MPI-2, supported features in ROMIO 3-19  
mpihosts file  
T
Troubleshooting  
formats of 3-11  
kernel and initialization issues C-7C-11  
MTRR BIOS mappings C-3, C-4  
OpenFabrics issues C-12  
generating using SLURM B-2  
getting started 3-2  
specifying 3-12  
PathScale MPI C-13C-28  
performance issues C-13  
mpi.mod files, using C-19  
mpirun C-36  
software installation issues C-5C-7  
system administration C-12  
useful programs for C-29C-31  
command 3-10, 3-12, 3-13, 3-14, 3-21, A-2,  
error message format of C-22  
options 3-14  
Index-2  
IB6054601-00 D  
 

Quadra Fire Indoor Fireplace DV400S User Guide
Quasar Fax Machine PAX405 User Guide
Radio Shack Cordless Telephone 43 5862 User Guide
Raritan Engineering Marine Sanitation System Crown Head II User Guide
RCA CRT Television F27318 User Guide
RCA Flat Panel Television LED52B45RQ User Guide
Rinnai Water Heater RC80E User Guide
SanDisk Computer Hardware SNS User Guide
Sanyo Microwave Oven EM G5595S User Guide
Scotts Bicycle ZX 10R 2004 05 User Guide