4 Version: OpenFabric Enterprise Distribution (OFED) 1.0
5 Repo: https://openib.org/svn/gen2/branches/1.0/src/userspace/management/osm
11 This document describes the contents of the OpenSM OFED 1.0 release.
12 OpenSM is an InfiniBand compliant Subnet Manager and Administrator,
13 and runs on top of OpenIB. The OpenSM version for this release
16 This document includes the following sections:
17 1 This Overview section (describing new features and software
19 2 Known Issues And Limitations
20 3 Unsupported IB compliance statements
22 5 Main Verification Flows
23 6 Qualified software stacks and devices
27 * SA GuidInfoRecord support
29 * Default for maxsmps changed:
30 Control the number of SMP sent in parallel and thus shorten the
31 fabric initialization time.
33 * osmtest/osmt_slvl_vl_arb.c:
34 Output file name changed from vl_arbs.txt to qos.txt
36 * Support new IBTA Errata IsPortInfoCapMaskMatchSupported:
37 This new capability of the SA enables matching of individual port
38 capability bits dramatically reducing the query size for agents like
39 the SRP initiator query for finding SRP targets.
41 * Honor guid2lid when coming out of standby:
42 This change adds an option to the opensm that forces it to honor the
43 guid2lid file given when it comes out of Standby state. Currently,
44 when opensm comes out of Standby state, it ignores the guid2lid file
45 it read, and honors only the lids defined on the ports themselves.
47 * Add guid to opensm opts
48 This enables the port on which to run the SM to be defined through
49 the configuration file as well as through the command line.
52 No PPC QA was performed.
54 1.2 Software Dependencies
56 OpenSM depends on the installation of either OFED 1.0,
57 OpenIB gen2 (e.g. IBG2 distribution), OpenIB gen1 (e.g. IBGD
58 distribution) or Mellanox VAPI stacks. The qualified driver versions
59 are provided in Table 2, "Qualified IB Stacks".
61 1.4 Supported Devices Firmware
63 The main task of OpenSM is to initialize InfiniBand devices. The
64 qualified devices and their corresponding firmware versions
65 are listed in Table 3.
67 2 Known Issues And Limitations
68 ------------------------------
70 * No Partition/Pkey policy support:
71 OpenSM does not provide means to set partitions.
73 * No Service / Key associations:
74 There is no way to manage Service access by Keys.
76 * No SM to SM SMDB synchronization:
77 Puts the burden of re-registering services, multicast groups, and
78 inform-info on the client application (or IB access layer core).
80 * No "port down" event handling:
81 Changing the switch port through which OpenSM connects to the IB
82 fabric may cause incorrect operation. Please restart OpenSM whenever
83 such a connectivity change is made.
85 * Changing connections during SM operation:
86 Under some conditions the SM can get confused by a change in
87 cabling (moving a cable from one switch port to the other) and
88 momentarily see this as having the same GUID appear connected
89 to two different IB ports. Under some conditions, when the SM fails to
90 get the corresponding change event it might mistakenly report this case
91 as a "duplicated GUID" case and abort. It is advisable to double-check
92 the syslog after each such change in connectivity and restart
93 OpenSM if it has exited.
96 No SL2VL and VLArbitration setting is performed by the SM.
98 3 Unsupported IB Compliance Statements
99 --------------------------------------
100 The following section lists all the IB compliance statements which
101 OpenSM does not support. Please refer to the IB specification for detailed
102 information regarding each compliance statement.
104 * C14-22 (Authentication):
105 M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one
106 SubnSet method. As a work-around, an OpenSM option is provided for
107 defining the protect bits.
109 * C14-67 (Authentication):
110 On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then
111 the SM shall generate a SubnGetResp if the M_Key matches, or
112 silently drop the packet if M_Key does not match.
114 * C15-0.1.23.4 (Authentication):
115 InformInfoRecords shall always be provided with the QPN set to 0,
116 except for the case of a trusted request, in which case the actual
117 subscriber QPN shall be returned.
119 * o13-17.1.2 (Event-FWD):
120 If no permission to forward, the subscription should be removed and
121 no further forwarding should occur.
123 * C14-37.1.2 (Handover):
124 Priority should be kept in non-volatile memory.
126 * C14-24.1.1.5 and C14-62.1.1.22 (Initialization):
127 GUIDInfo - SM should enable assigning Port GUIDInfo.
129 * C14-44 (Initialization):
130 If the SM discovers that it is missing an M_Key to update CA/RT/SW,
131 it should notify the higher level.
133 * C14-62.1.1.11 (Initialization):
134 PortInfo:VLHighLimit should match the configured VLArb on the port.
136 * C14-62.1.1.12 (Initialization):
137 PortInfo:M_Key - Set the M_Key to a node based random value.
139 * C14-62.1.1.13 (Initialization):
140 PortInfo:P_KeyProtectBits - set according to an optional policy.
142 * C14-62.1.1.24 (Initialization):
143 SwitchInfo:DefaultPort - should be configured for random FDB.
145 * C14-62.1.1.32 (Initialization):
146 RandomForwardingTable should be configured.
148 * o15-0.1.12 (Multicast):
149 If the JoinState is SendOnlyNonMember = 1 (only), then the endport
150 should join as sender only.
152 * o15-0.1.13 (Multicast):
153 If a Join request using unrealistic parameters is received, return
156 * o15-0.1.8 (Multicast):
157 If a request for creating an MCG with fields that cannot be met,
158 return ERR_REQ_INVALID (currently ignoring SL and FlowLabelTclass).
160 * C15-0.1.11 (SA-Query):
161 Query response should use only base LIDs (as the feature has not
164 * C15-0.1.19 (SA-Query):
165 Respond to SubnGetMulti(MultiPathRec)
167 * C15-0.1.8.6 (SA-Query):
168 Respond to SubnAdmGetTraceTable - this is an optional attribute.
170 * C15-0.1.8.7 (SA-Query):
171 SubnAdmGetMulti SubnAdmGetMultiResp - Only in case of a MultiPath.
173 * C15-0.1.13 Services:
174 Reject ServiceRecord create, modify or delete if the given
175 ServiceP_Key does not match the one included in the ServiceGID port
176 and the port that sent the request.
178 * C15-0.1.14 (Services):
179 Provide means to associate service name and ServiceKeys.
184 The following is a list of bugs that were fixed. Note that other less critical
185 or visible bugs were also fixed.
187 * Eliminate error on active -> active port state transition
188 SM may transition port from armed to active but in the meantime, due
189 to passing a data packet with active enable set, the port may
190 already have transitioned to active. Active -> active port state
191 transition is indicated as an error but it isn't really an error so
192 don't indicate error in the osm log.
194 * Routing not set for the first LID in the last LFT block:
195 Fix: osm_switch.c: In osm_switch_get_fwd_tbl_block last block calculation
197 * Corrupted guid2lid file causes OpenSM exit
198 Fix: exit only if exit_on_fatal option is set (the default)
200 * OpenSM was causing Client-Re-Registration continuously:
201 The SM was storing the response PortInfo.ClientReRegstration
202 bit and using it during next Set(PortInfo). Fix: clear the bit on
205 * Multicast Query Selectors MTU, rate, and PacketLifeTime were not exact
207 * Try not to recognize port change as duplicated GUID
208 This fix solves the issue of a port move during heavy sweep
209 being recognized as a duplicated guid. Fix: If the SM sees what
210 seems to be a duplicated guid, but it also received an indication
211 for immediately forcing another heavy sweep (for example, as a
212 result of receiving trap 128) then it shouldn't issue a duplicated
213 guid error (and possibly exit), but should just ignore this and
214 continue. This means that only if the SM recognizes such a
215 duplication in a stable subnet that it'll issue the error (and
218 * Set PKey table on switch ports not supporting it:
219 OpenSM attempts to set pkey table entries on external switch ports
220 even if the switch declares a PartitionEnforcementCap of zero. The
221 consequence is ERR 4108. Fix: Observe PartitionEnforcementCap of zero.
223 * Incorrect MCMemberRecord Get/GetTable in trusted mode:
224 This change fixes the retrieval of the MCMember records according to
225 Errata MGTWG3280. This fix provide means to obtain all the group
226 members by issuing a trusted GetTable query.
228 * Trusted MCMemberRecord query was not recognized as trusted:
229 Trust is checked by comparing the request SM_Key field to the SM
230 SM_Key. The bug was in looking up the SM_Key from the response not
233 * Port left in down state after setting MTU or OpVLs on its neighbor:
234 In case of a difference between the MTU of two ports, only the port
235 with the higher MTU was set to down. Its remote port was written in
236 the DB as in the ACTIVE state although its real status was INIT.
237 Because of this, the SM didn't try to move the remote port to
240 * Atomic counters used throughout the code were broken:
241 A new implementation has been provided.
243 * MC Group creation with "less than" MTU ignores the requester MTU:
244 When requesting to create an MC group with MTU(rate) selector 1
245 (meaning less than rate specified), the MC group is created with
246 MTU(rate) requested - 1. This is without checking the MTU(rate) of
247 the port requesting the creation of the multicast group. This means
248 that if, for example, port with MTU=2 sends a request for MC group
249 creation with MTU selector=1 and MTU=5, Opensm will try to create a
250 MC group with MTU=4, and fail, since the port capabilities are not
251 realizable. Fix: creation of the MC group with MTU(rate) also takes
252 into account the MTU(rate) of the port requesting the creation.
254 * MC Group join does not validate that the joining port's capabilities
255 match those of the MC. Fix: Add verification of endport physical
256 capability to join MC group.
258 * ClientReRegistration not sent to ports discovered after first sweep:
259 PortInfo sent with ClientReRegistration bit turned on only during
260 the first sweep after becoming Master. This doesn't cover all cases
261 where ClientReRegistration should be turned on. Fix: turn on this
262 bit also on new ports it discovers (in cases of subnet merging, for
265 * segfault during a report on deleted multicast group:
266 osm_mcast_mgr.c, executing the line of code:
267 osm_mgrp_send_delete_notice( p_mgr->p_subn, p_mgr->p_log, p_mgrp );
268 caused segmentation fault since the handle p_mgrp was already
269 deleted while the function was called. Fix: inserted the line above
270 into the protected section.
272 * segfault in osm_get_gid_by_mad_addr:
273 The affected flows are ports and multicast joins.
275 * segfault in LID manager:
276 Handle NULL p_rem_physp can validly be NULL when the remote SMA is
277 not responding (but physical link is up).
279 * segfault in Up/Down routing engine
282 5 Main Verification Flows
283 -------------------------
285 OpenSM verification is run using the following activities:
286 * osmtest - a stand-alone program
287 * ibmgtsim (IB management simulator) based - a set of flows that
288 simulate clusters, inject errors and verify OpenSM capability to
289 respond and bring up the network correctly.
290 * small cluster regression testing - where the SM is used on back to
291 back or single switch configurations. The regression includes
292 multiple OpenSM dedicated tests.
293 * cluster testing - when we run OpenSM to setup a large cluster, perform
294 hand-off, reboots and reconnects, verify routing correctness and SA
295 responsiveness at the ULP level (IPoIB and SDP).
299 osmtest is an automated verification tool used for OpenSM
300 testing. Its verification flows are described by list below.
302 * Inventory File: Obtain and verify all port info, node info, and path
306 - Register new service
307 - Register another service (with a lease period)
308 - Register another service (with service p_key set to zero)
309 - Get all services by name
310 - Delete the first service
311 - Delete the third service.
312 - Added bad flows of get/delete non valid service
313 - Add / Get same service with different data
314 - Add / Get / Delete by different component mask values (services
315 by Name & Key / Name & Data / Name & Id / Id only )
317 * Multicast Member Record:
318 - Query of existing Groups (IPoIB)
319 - BAD Join with insufficient comp mask (o15.0.1.3)
320 - Create given MGID=0 (o15.0.1.4)
321 - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4)
322 - Create BAD MGID=0xFA. (o15.0.1.6)
323 - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6)
324 - New MGID with invalid join state (o15.0.1.9)
325 - Retry of existing MGID - See JoinState update (o15.0.1.11)
326 - BAD RATE when connecting to existing MGID (o15.0.1.13)
327 - Partial JoinState delete request - removing FullMember (o15.0.1.14)
328 - Full Delete of a group (o15.0.1.14)
329 - Verify Delete by trying to Join deleted group (o15.0.1.14)
330 - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15)
333 - All GUIDInfoRecords in subnet are obtained
335 * Event Forwarding: Register for trap forwarding using reports
336 - Send a trap and wait for report
337 - Unregister non-existing
339 * Trap 64/65 Flow: Register to Trap 64-65, create traps (by
340 disconnecting/connecting ports) and wait for report, then unregister.
342 * Stress Test: send PortInfoRecord queries, both single and RMPP and
343 check for the rate of responses as well as their validity.
346 5.2 IB Management Simulator OpenSM Test Flows:
348 The simulator provides ability to simulate the SM handling of virtual
349 topologies that are not limited to actual lab equipment availability.
350 OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily
351 regressions use smaller (16 and 128 nodes clusters).
353 The following test flows are run on the IB management simulator:
356 Up to 12 links from the fabric are randomly selected to drop packets
357 at drop rates up to 90%. The SM is required to succeed in bringing the
358 fabric up. The resulting routing is verified to be correct, too.
361 Using LMC = 2 the fabric is initialized with LIDs. Faults such as
362 zero LID, Duplicated LID, non-aligned (to LMC) LIDs are
363 randomly assigned to various nodes and other errors are randomly
364 output to the guid2lid cache file. The SM sweep is run 5 times and
365 after each iteration a complete verification is made to ensure that all
366 LIDs that could possibly be maintained are kept, as well as that all nodes
367 were assigned a legal LID range.
370 Nodes randomly join the 0xc000 group and eventually the
371 resulting routing is verified for completeness and adherence to
372 Up/Down routing rules.
375 The complete osmtest flow as described in the previous table is run on
376 the simulated fabrics.
379 This flow merges fabric, LID and stability issues with continuous
380 PathRecord, ServiceRecord and Multicast Join/Leave activity to
381 stress the SM/SA during continuous sweeps.
383 5.3 OpenSM Regression
385 Using a back-to-back or single switch connection, the following set of
386 tests is run nightly on the stacks described in table 2. The included
389 * Stress Testing: Flood the SA with queries from multiple channel
390 adapters to check the robustness of the entire stack up to the SA.
392 * Dynamic Changes: Dynamic Topology changes, through randomly
393 dropping SMP packets, used to test OpenSM adaptation to an unstable
394 network & verify DB correctness.
396 * Trap Injection: This flow injects traps to the SM and verifies that it
397 handles them gracefully.
399 * SA Query Test: This test exhaustively checks the SA responses to all
400 possible single component mask. To do that the test examines the
401 entire set of records the SA can provide, classifies them by their
402 field values and then selects every field (using component mask and a
403 value) and verifies that the response matches the expected set of records.
404 A random selection using multiple component mask bits is also performed.
408 Cluster testing is usually run before a distribution release. It
409 involves real hardware setups of 16 to 32 nodes (or more if a beta site
410 is available). Each test is validated by running all-to-all ping through the IB
411 interface. The test procedure includes:
415 * Hand-off between 2 or 3 SM's while performing
417 - Switch power cycles (disconnecting the SM's)
419 * Unresponsive port detection and recovery
421 * osmtest from multiple nodes
423 * Trap injection and recovery
429 Table 2 - Qualified IB Stacks
430 =============================
433 -----------------------------------------|--------------------------
435 OpenIB Gen2 (IBG2 distribution) | 1.0
436 OpenIB Gen1 (IBGD distribution) | 1.8.0
437 VAPI (Mellanox InfiniBand HCA Driver) | 3.2 and later
439 Table 3 - Qualified Devices and Corresponding Firmware
440 ======================================================
444 --------|-----------------------------------------------------------
445 MT43132 | InfiniScale - fw-43132 5.2.0 (and later)
446 MT47396 | InfiniScale III - fw-47396 0.5.0 (and later)
447 MT23108 | InfiniHost - fw-23108 3.3.2
448 MT25204 | InfiniHost III Lx - fw-25204 1.0.1
449 MT25208 | InfiniHost III Ex (InfiniHost Mode) - fw-25208 4.6.2 (and later)
450 MT25208 | InfiniHost III Ex (MemFree Mode) - fw-25218 5.0.1 (and later)
454 --------|-----------------------------------------------------------
455 iPath | QHT6040 (PathScale InfiniPath HT-460)
456 iPath | QHT6140 (PathScale InfiniPath HT-465)
457 iPath | QLE6140 (PathScale InfiniPath PE-880)
459 Note: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose
460 QP0 and QP1. However, it does support it as a device on the subnet.