OpenSM Release Notes 3.1.10 ============================= Version: OpenFabrics Enterprise Distribution (OFED) 1.3 Repo: git://git.openfabrics.org/~ofed_1_3/management.git (release) git://git.openfabrics.org/~sashak/management.git (development) Date: February 2008 1 Overview ---------- This document describes the contents of the OpenSM OFED 1.3 release. OpenSM is an InfiniBand compliant Subnet Manager and Administration, and runs on top of OpenIB. The OpenSM version for this release is openib-3.1.10 This document includes the following sections: 1 This Overview section (describing new features and software dependencies) 2 Known Issues And Limitations 3 Unsupported IB compliance statements 4 Major Bug Fixes 5 Main Verification Flows 6 Qualified software stacks and devices 1.1 Major New Features * QoS manager (experimental) This QoS manager implementation is in accordance with IBA QoS Annex. Highly configurable QoS Policy is parsed from OpenSM QoS policy file. Valid QoS parameters will be reported in SA PathRecord and MultiPathRecord. In addition simple QoS levels per ULPs configuration is supported too. * Performance Manager When enabled it collects a fabric port counters and able to log it or to pass to external program via event plugin interface. It handles counters overflow, supports LID/QP redirection and is able to work when OpenSM is in master, standby, and inactive states. * Dimension Order routing (DOR) algorithm DOR Unicast routing algorithm - based on the Min Hop algorithm, but avoids port equalization except for redundant links between the same two switches. This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh (see details in OpenSM man page). * Routing improvements Speedup the current routing algorithms default MinHops, Up/Down and LASH and lid matrix generation. Fat Tree routing engine is able to work with not pure fat free topology. * Multiple IB routers support OpenSM now able to keep configurable subnet prefix to router table. SA will report path to this routers when SA PathRecord was issued with non-local DGID. * Node map This is possible to name nodes in this config file. Those names will be used for logging and by QoS configuration. * PKey index support Proper support for PKey index in GSI queries. * Incremental LFTs, PKey, SL2VL, and VLarbitration table updates OpenSM will only fetch those tables in first heavy sweep and then will maintain this internally. * Fast port and switch detector When port and/or switch was externally reset and it was fast so sweep doesn't find this device as disconnected OpenSM will detect this by changed port states and handle accordingly. * Duplicated GUIDs/port moving detector OpenSM will be able to detect port moving during a fabric discovery and will not report duplicated GUIDs in this case. * Multicast rerouting speedup Now OpenSM will calculate and setup multicast forwarding tables for all altered multicast groups and not for each one. * Event plugin API OpenSM allows to load dynamically various plugin modules. * Many generic improvements 1.2 Minor New Features: * Daemon mode can be activated with -B option. * Support multiple scopes for IPoIB multicast groups in partition config. * Loopback connection handling Loopback connection is not interpreted as duplicated GUID anymore. * Connect root nodes option for Up/Down routing engine. When this option is specified Up/Down will create routing paths between its root nodes. * Dump and log filenames changed from osm* to opensm*. * Support loopback console Socket console with only local access. * Configurable config directory (the default value is /etc/opensm) and configurable default values of OpenSM config filenames. * Add option for force SDR link speed Add option to opensm.opts to force link speed. Currently, only forcing to SDR link speed is supported. This option is not supported as a command line option. * Better packaging Building and RPM packaging were improved and simplified. * Handle "babbling" ports When a babbling port (port which causes a frequent trap generation) is detected, OpenSM will disable the port which should terminate the trap storm. 1.3 Library API Changes None 1.4 Software Dependencies OpenSM depends on the installation of either OFED 1.3, OFED 1.2, OFED 1.1, OFED 1.0, OpenIB gen2 (e.g. IBG2 distribution), OpenIB gen1 (e.g. IBGD distribution), or Mellanox VAPI stacks. The qualified driver versions are provided in Table 2, "Qualified IB Stacks". Also building of QoS manager policy file parser requires flex, and either bison or byacc installed. 1.5 Supported Devices Firmware The main task of OpenSM is to initialize InfiniBand devices. The qualified devices and their corresponding firmware versions are listed in Table 3. 2 Known Issues And Limitations ------------------------------ * No Service / Key associations: There is no way to manage Service access by Keys. * No SM to SM SMDB synchronization: Puts the burden of re-registering services, multicast groups, and inform-info on the client application (or IB access layer core). 3 Unsupported IB Compliance Statements -------------------------------------- The following section lists all the IB compliance statements which OpenSM does not support. Please refer to the IB specification for detailed information regarding each compliance statement. * C14-22 (Authentication): M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one SubnSet method. As a work-around, an OpenSM option is provided for defining the protect bits. * C14-67 (Authentication): On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then the SM shall generate a SubnGetResp if the M_Key matches, or silently drop the packet if M_Key does not match. * C15-0.1.23.4 (Authentication): InformInfoRecords shall always be provided with the QPN set to 0, except for the case of a trusted request, in which case the actual subscriber QPN shall be returned. * o13-17.1.2 (Event-FWD): If no permission to forward, the subscription should be removed and no further forwarding should occur. * C14-24.1.1.5 and C14-62.1.1.22 (Initialization): GUIDInfo - SM should enable assigning Port GUIDInfo. * C14-44 (Initialization): If the SM discovers that it is missing an M_Key to update CA/RT/SW, it should notify the higher level. * C14-62.1.1.12 (Initialization): PortInfo:M_Key - Set the M_Key to a node based random value. * C14-62.1.1.13 (Initialization): PortInfo:P_KeyProtectBits - set according to an optional policy. * C14-62.1.1.24 (Initialization): SwitchInfo:DefaultPort - should be configured for random FDB. * C14-62.1.1.32 (Initialization): RandomForwardingTable should be configured. * o15-0.1.12 (Multicast): If the JoinState is SendOnlyNonMember = 1 (only), then the endport should join as sender only. * o15-0.1.8 (Multicast): If a request for creating an MCG with fields that cannot be met, return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass). * C15-0.1.8.6 (SA-Query): Respond to SubnAdmGetTraceTable - this is an optional attribute. * C15-0.1.13 Services: Reject ServiceRecord create, modify or delete if the given ServiceP_Key does not match the one included in the ServiceGID port and the port that sent the request. * C15-0.1.14 (Services): Provide means to associate service name and ServiceKeys. 4 Major Bug Fixes ----------------- The following is a list of bugs that were fixed. Note that other less critical or visible bugs were also fixed. * osm_ucast_ftree.c: do load-leveling of non-CN routes * osm_ucast_ftree.c: ignore port 0 and loopbacks on switches * lash: fix possible segfault in osm_get_lash_sl() * osm_ucast_ftree.c: fixing coredump in fat-tree routing * osm_sa_slvl_record: fix overflow crash * Break multicast rerouting requests processing when heavy sweep is scheduled. * updn: report fallback properly * Fix incorrect identification of routing engine used * Don't zero base LID when invalid value is received * lash: fix wrong allocation size * Fixing broken logic in 'process world' part of LinkRecord processing * Fix lmc_mask bit order in osm_sa_link_record.c * Adding missing comparison by to_lid/from_lid in LinkRecord processing * Broken logic when scanning subnet for PIR request * No interactive games in daemon mode * Fixing memory leak in node description * Fix PortInfo update issues for switch port 0 * Changed method_mask type in user_mad interface in accordance with kernel ABI * Use umad_get_issm_path() in osm_vendor_set_sm() * Report message fix * Uninitialized variables usage fix * osm_ucast_ftree.c: Possible NULL ptr seg fault * osm_mcast_mgr.c: Possible NULL ptr seg fault * TrapRepress was failing for mkey != 0 * IB_PR_COMPMASK was used in MPR * Set hop limit when creating ipoib multicast groups * Fix outstanding mad counters tracking on the error paths. * Report new ports before handover mastership * Fix opvls and neighbormtu when remote port invalid. * Bug in coding trying to set vl_arb_high_limit when PortInfo.base_lid was still zero. * Protect SMInfo response against port moving issue. 5 Main Verification Flows ------------------------- OpenSM verification is run using the following activities: * osmtest - a stand-alone program * ibmgtsim (IB management simulator) based - a set of flows that simulate clusters, inject errors and verify OpenSM capability to respond and bring up the network correctly. * small cluster regression testing - where the SM is used on back to back or single switch configurations. The regression includes multiple OpenSM dedicated tests. * cluster testing - when we run OpenSM to setup a large cluster, perform hand-off, reboots and reconnects, verify routing correctness and SA responsiveness at the ULP level (IPoIB and SDP). 5.1 osmtest osmtest is an automated verification tool used for OpenSM testing. Its verification flows are described by list below. * Inventory File: Obtain and verify all port info, node info, link and path records parameters. * Service Record: - Register new service - Register another service (with a lease period) - Register another service (with service p_key set to zero) - Get all services by name - Delete the first service - Delete the third service - Added bad flows of get/delete non valid service - Add / Get same service with different data - Add / Get / Delete by different component mask values (services by Name & Key / Name & Data / Name & Id / Id only ) * Multicast Member Record: - Query of existing Groups (IPoIB) - BAD Join with insufficient comp mask (o15.0.1.3) - Create given MGID=0 (o15.0.1.4) - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4) - Create BAD MGID=0xFA. (o15.0.1.6) - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6) - New MGID with invalid join state (o15.0.1.9) - Retry of existing MGID - See JoinState update (o15.0.1.11) - BAD RATE when connecting to existing MGID (o15.0.1.13) - Partial JoinState delete request - removing FullMember (o15.0.1.14) - Full Delete of a group (o15.0.1.14) - Verify Delete by trying to Join deleted group (o15.0.1.14) - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15) * GUIDInfo Record: - All GUIDInfoRecords in subnet are obtained * MultiPathRecord: - Perform some compliant and noncompliant MultiPathRecord requests - Validation is via status in responses and IB analyzer * PKeyTableRecord: - Perform some compliant and noncompliant PKeyTableRecord queries - Validation is via status in responses and IB analyzer * LinearForwardingTableRecord: - Perform some compliant and noncompliant LinearForwardingTableRecord queries - Validation is via status in responses and IB analyzer * Event Forwarding: Register for trap forwarding using reports - Send a trap and wait for report - Unregister non-existing * Trap 64/65 Flow: Register to Trap 64-65, create traps (by disconnecting/connecting ports) and wait for report, then unregister. * Stress Test: send PortInfoRecord queries, both single and RMPP and check for the rate of responses as well as their validity. 5.2 IB Management Simulator OpenSM Test Flows: The simulator provides ability to simulate the SM handling of virtual topologies that are not limited to actual lab equipment availability. OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily regressions use smaller (16 and 128 nodes clusters). The following test flows are run on the IB management simulator: * Stability: Up to 12 links from the fabric are randomly selected to drop packets at drop rates up to 90%. The SM is required to succeed in bringing the fabric up. The resulting routing is verified to be correct as well. * LID Manager: Using LMC = 2 the fabric is initialized with LIDs. Faults such as zero LID, Duplicated LID, non-aligned (to LMC) LIDs are randomly assigned to various nodes and other errors are randomly output to the guid2lid cache file. The SM sweep is run 5 times and after each iteration a complete verification is made to ensure that all LIDs that could possibly be maintained are kept, as well as that all nodes were assigned a legal LID range. * Multicast Routing: Nodes randomly join the 0xc000 group and eventually the resulting routing is verified for completeness and adherence to Up/Down routing rules. * osmtest: The complete osmtest flow as described in the previous table is run on the simulated fabrics. * Stress Test: This flow merges fabric, LID and stability issues with continuous PathRecord, ServiceRecord and Multicast Join/Leave activity to stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get were added to the test such both existing and non existing nodes perform them in random order. 5.3 OpenSM Regression Using a back-to-back or single switch connection, the following set of tests is run nightly on the stacks described in table 2. The included tests are: * Stress Testing: Flood the SA with queries from multiple channel adapters to check the robustness of the entire stack up to the SA. * Dynamic Changes: Dynamic Topology changes, through randomly dropping SMP packets, used to test OpenSM adaptation to an unstable network & verify DB correctness. * Trap Injection: This flow injects traps to the SM and verifies that it handles them gracefully. * SA Query Test: This test exhaustively checks the SA responses to all possible single component mask. To do that the test examines the entire set of records the SA can provide, classifies them by their field values and then selects every field (using component mask and a value) and verifies that the response matches the expected set of records. A random selection using multiple component mask bits is also performed. 5.4 Cluster testing: Cluster testing is usually run before a distribution release. It involves real hardware setups of 16 to 32 nodes (or more if a beta site is available). Each test is validated by running all-to-all ping through the IB interface. The test procedure includes: * Cluster bringup * Hand-off between 2 or 3 SM's while performing: - Node reboots - Switch power cycles (disconnecting the SM's) * Unresponsive port detection and recovery * osmtest from multiple nodes * Trap injection and recovery 6 Qualification ---------------- Table 2 - Qualified IB Stacks ============================= Stack | Version -----------------------------------------|-------------------------- OFED | 1.3 OFED | 1.2 OFED | 1.1 OFED | 1.0 OpenIB Gen2 (IBG2 distribution) | 1.0 OpenIB Gen1 (IBGD distribution) | 1.8.0 VAPI (Mellanox InfiniBand HCA Driver) | 3.2 and later Table 3 - Qualified Devices and Corresponding Firmware ====================================================== Mellanox Device | FW versions ------------------------------------|------------------------------- InfiniScale | fw-43132 5.2.000 (and later) InfiniScale III | fw-47396 0.5.000 (and later) InfiniHost | fw-23108 3.5.000 (and later) InfiniHost III Lx | fw-25204 1.2.000 (and later) InfiniHost III Ex (InfiniHost Mode) | fw-25208 4.8.200 (and later) InfiniHost III Ex (MemFree Mode) | fw-25218 5.3.000 (and later) ConnectX IB | fw-25408 2.3.000 (and later) QLogic/PathScale Device | Note --------|----------------------------------------------------------- iPath | QHT6040 (PathScale InfiniPath HT-460) iPath | QHT6140 (PathScale InfiniPath HT-465) iPath | QLE6140 (PathScale InfiniPath PE-880) iPath | QLE7240 iPath | QLE7280 Note 1: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose QP0 and QP1. However, it does support it as a device on the subnet. Note 2: QoS firmware and Mellanox devices HCAs: QoS supported by ConnectX. The current FW release doesn't support QoS. QoS-enabled FW release (2_5_000) is planned for May. If someone wishes to get QoS-enabled FW before the official release, they should contact Mellanox FAE. Switches: QoS supported by InfiniScale III Any InfiniScale III FW that is supported by OpenSM supports QoS.