1 .TH OPENSM 8 "June 13, 2008" "OpenIB" "OpenIB Management"
4 opensm \- InfiniBand subnet manager and administration (SM/SA)
9 [\-F | \-\-config <file_name>]
10 [\-c(reate-config) <file_name>]
11 [\-g(uid) <GUID in hex>]
13 [\-p(riority) <PRIORITY>]
16 [\-R <engine name(s)> | \-\-routing_engine <engine name(s)>]
17 [\-A | \-\-ucast_cache]
18 [\-z | \-\-connect_roots]
19 [\-M <file name> | \-\-lid_matrix_file <file name>]
20 [\-U <file name> | \-\-lfts_file <file name>]
21 [\-S | \-\-sadb_file <file name>]
22 [\-a | \-\-root_guid_file <path to file>]
23 [\-u | \-\-cn_guid_file <path to file>]
24 [\-X | \-\-guid_routing_order_file <path to file>]
25 [\-m | \-\-ids_guid_file <path to file>]
27 [\-s(weep) <interval>]
28 [\-t(imeout) <milliseconds>]
30 [\-console [off | local | socket | loopback]]
31 [\-console-port <port>]
32 [\-i(gnore-guids) <equalize-ignore-guids-file>]
33 [\-f <log file path> | \-\-log_file <log file path> ]
34 [\-L | \-\-log_limit <size in MB>] [\-e(rase_log_file)]
35 [\-P(config) <partition config file> ]
36 [\-N | \-\-no_part_enforce]
37 [\-Q | \-\-qos [\-Y | \-\-qos_policy_file <file name>]]
38 [\-y | \-\-stay_on_fatal]
42 [\-\-perfmgr_sweep_time_s <seconds>]
43 [\-\-prefix_routes_file <path>]
44 [\-\-consolidate_ipv6_snm_req]
45 [\-v(erbose)] [\-V] [\-D <flags>] [\-d(ebug) <number>]
50 opensm is an InfiniBand compliant Subnet Manager and Administration,
51 and runs on top of OpenIB.
53 opensm provides an implementation of an InfiniBand Subnet Manager and
54 Administration. Such a software entity is required to run for in order
55 to initialize the InfiniBand hardware (at least one per each
58 opensm also now contains an experimental version of a performance
61 opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes. Thus, in this default mode, opensm will scan the IB
62 fabric, initialize it, and sweep occasionally for changes.
64 opensm attaches to a specific IB port on the local machine and configures only
65 the fabric connected to it. (If the local machine has other IB ports,
66 opensm will ignore the fabrics connected to those other ports). If no port is
67 specified, it will select the first "best" available port.
69 opensm can present the available ports and prompt for a port number to
72 By default, the run is logged to two files: /var/log/messages and /var/log/opensm.log.
73 The first file will register only general major events, whereas the second
74 will include details of reported errors. All errors reported in this second
75 file should be treated as indicators of IB fabric health issues.
76 (Note that when a fatal and non-recoverable error occurs, opensm will exit.)
77 Both log files should include the message "SUBNET UP" if opensm was able to
78 setup the subnet correctly.
85 Prints OpenSM version and exits.
87 \fB\-F\fR, \fB\-\-config\fR <config file>
88 The name of the OpenSM config file. When not specified
89 \fB\% @OPENSM_CONFIG_DIR@/@OPENSM_CONFIG_FILE@\fP will be used (if exists).
91 \fB\-c\fR, \fB\-\-create-config\fR <file name>
92 OpenSM will dump its configuration to the specified file and exit.
93 This is a way to generate OpenSM configuration file template.
95 \fB\-g\fR, \fB\-\-guid\fR <GUID in hex>
96 This option specifies the local port GUID value
97 with which OpenSM should bind. OpenSM may be
98 bound to 1 port at a time.
99 If GUID given is 0, OpenSM displays a list
100 of possible port GUIDs and waits for user input.
101 Without -g, OpenSM tries to use the default port.
103 \fB\-l\fR, \fB\-\-lmc\fR <LMC value>
104 This option specifies the subnet's LMC value.
105 The number of LIDs assigned to each port is 2^LMC.
106 The LMC value must be in the range 0-7.
107 LMC values > 0 allow multiple paths between ports.
108 LMC values > 0 should only be used if the subnet
109 topology actually provides multiple paths between
110 ports, i.e. multiple interconnects between switches.
111 Without -l, OpenSM defaults to LMC = 0, which allows
112 one path between any two ports.
114 \fB\-p\fR, \fB\-\-priority\fR <Priority value>
115 This option specifies the SM\'s PRIORITY.
116 This will effect the handover cases, where master
117 is chosen by priority and GUID. Range goes from 0
118 (default and lowest priority) to 15 (highest).
120 \fB\-smkey\fR <SM_Key value>
121 This option specifies the SM\'s SM_Key (64 bits).
122 This will effect SM authentication.
123 Note that OpenSM version 3.2.1 and below used the default value '1'
124 in a host byte order, it is fixed now but you may need this option to
125 interoperate with old OpenSM running on a little endian machine.
127 \fB\-r\fR, \fB\-\-reassign_lids\fR
128 This option causes OpenSM to reassign LIDs to all
129 end nodes. Specifying -r on a running subnet
130 may disrupt subnet traffic.
131 Without -r, OpenSM attempts to preserve existing
132 LID assignments resolving multiple use of same LID.
134 \fB\-R\fR, \fB\-\-routing_engine\fR <Routing engine names>
135 This option chooses routing engine(s) to use instead of Min Hop
136 algorithm (default). Multiple routing engines can be specified
137 separated by commas so that specific ordering of routing algorithms
138 will be tried if earlier routing engines fail.
139 Supported engines: minhop, updn, file, ftree, lash, dor
141 \fB\-A\fR, \fB\-\-ucast_cache\fR
142 This option enables unicast routing cache and prevents routing
143 recalculation (which is a heavy task in a large cluster) when
144 there was no topology change detected during the heavy sweep, or
145 when the topology change does not require new routing calculation,
146 e.g. when one or more CAs/RTRs/leaf switches going down, or one or
147 more of these nodes coming back after being down.
148 A very common case that is handled by the unicast routing cache
149 is host reboot, which otherwise would cause two full routing
150 recalculations: one when the host goes down, and the other when
151 the host comes back online.
153 \fB\-z\fR, \fB\-\-connect_roots\fR
154 This option enforces a routing engine (currently up/down
155 only) to make connectivity between root switches and in
156 this way to be fully IBA complaint. In many cases this can
157 violate "pure" deadlock free algorithm, so use it carefully.
159 \fB\-M\fR, \fB\-\-lid_matrix_file\fR <file name>
160 This option specifies the name of the lid matrix dump file
161 from where switch lid matrices (min hops tables will be
164 \fB\-U\fR, \fB\-\-lfts_file\fR <file name>
165 This option specifies the name of the LFTs file
166 from where switch forwarding tables will be loaded.
168 \fB\-S\fR, \fB\-\-sadb_file\fR <file name>
169 This option specifies the name of the SA DB dump file
170 from where SA database will be loaded.
172 \fB\-a\fR, \fB\-\-root_guid_file\fR <file name>
173 Set the root nodes for the Up/Down or Fat-Tree routing
174 algorithm to the guids provided in the given file (one to a line).
176 \fB\-u\fR, \fB\-\-cn_guid_file\fR <file name>
177 Set the compute nodes for the Fat-Tree routing algorithm
178 to the guids provided in the given file (one to a line).
180 \fB\-m\fR, \fB\-\-ids_guid_file\fR <file name>
181 Name of the map file with set of the IDs which will be used
182 by Up/Down routing algorithm instead of node GUIDs
183 (format: <guid> <id> per line).
185 \fB\-X\fR, \fB\-\-guid_routing_order_file\fR <file name>
186 Set the order port guids will be routed for the MinHop
187 and Up/Down routing algorithms to the guids provided in the
188 given file (one to a line).
190 \fB\-o\fR, \fB\-\-once\fR
191 This option causes OpenSM to configure the subnet
192 once, then exit. Ports remain in the ACTIVE state.
194 \fB\-s\fR, \fB\-\-sweep\fR <interval value>
195 This option specifies the number of seconds between
196 subnet sweeps. Specifying -s 0 disables sweeping.
197 Without -s, OpenSM defaults to a sweep interval of
200 \fB\-t\fR, \fB\-\-timeout\fR <value>
201 This option specifies the time in milliseconds
202 used for transaction timeouts.
203 Specifying -t 0 disables timeouts.
204 Without -t, OpenSM defaults to a timeout value of
207 \fB\-maxsmps\fR <number>
208 This option specifies the number of VL15 SMP MADs
209 allowed on the wire at any one time.
210 Specifying -maxsmps 0 allows unlimited outstanding
212 Without -maxsmps, OpenSM defaults to a maximum of
215 \fB\-console [off | local | socket | loopback]\fR
216 This option brings up the OpenSM console (default off).
217 Note that the socket and loopback options will only be available
218 if OpenSM was built with --enable-console-socket.
220 \fB\-console-port\fR <port>
221 Specify an alternate telnet port for the socket console (default 10000).
222 Note that this option only appears if OpenSM was built with
223 --enable-console-socket.
225 \fB\-i\fR, \fB\-ignore-guids\fR <equalize-ignore-guids-file>
226 This option provides the means to define a set of ports
227 (by node guid and port number) that will be ignored by the link load
228 equalization algorithm.
230 \fB\-x\fR, \fB\-\-honor_guid2lid\fR
231 This option forces OpenSM to honor the guid2lid file,
232 when it comes out of Standby state, if such file exists
233 under OSM_CACHE_DIR, and is valid.
234 By default, this is FALSE.
236 \fB\-f\fR, \fB\-\-log_file\fR <file name>
237 This option defines the log to be the given file.
238 By default, the log goes to /var/log/opensm.log.
239 For the log to go to standard output use -f stdout.
241 \fB\-L\fR, \fB\-\-log_limit\fR <size in MB>
242 This option defines maximal log file size in MB. When
243 specified the log file will be truncated upon reaching
246 \fB\-e\fR, \fB\-\-erase_log_file\fR
247 This option will cause deletion of the log file
248 (if it previously exists). By default, the log file
251 \fB\-P\fR, \fB\-\-Pconfig\fR <partition config file>
252 This option defines the optional partition configuration file.
253 The default name is \fB\%@OPENSM_CONFIG_DIR@/@PARTITION_CONFIG_FILE@\fP.
255 \fB\-\-prefix_routes_file\fR <file name>
256 Prefix routes control how the SA responds to path record queries for
257 off-subnet DGIDs. By default, the SA fails such queries. The
259 section below describes the format of the configuration file.
260 The default path is \fB\%@OPENSM_CONFIG_DIR@/prefix\-routes.conf\fP.
262 \fB\-Q\fR, \fB\-\-qos\fR
263 This option enables QoS setup. It is disabled by default.
265 \fB\-Y\fR, \fB\-\-qos_policy_file\fR <file name>
266 This option defines the optional QoS policy file. The default
267 name is \fB\%@OPENSM_CONFIG_DIR@/@QOS_POLICY_FILE@\fP.
269 \fB\-N\fR, \fB\-\-no_part_enforce\fR
270 This option disables partition enforcement on switch external ports.
272 \fB\-y\fR, \fB\-\-stay_on_fatal\fR
273 This option will cause SM not to exit on fatal initialization
274 issues: if SM discovers duplicated guids or a 12x link with
275 lane reversal badly configured.
276 By default, the SM will exit on these errors.
278 \fB\-B\fR, \fB\-\-daemon\fR
279 Run in daemon mode - OpenSM will run in the background.
281 \fB\-I\fR, \fB\-\-inactive\fR
282 Start SM in inactive rather than init SM state. This
283 option can be used in conjunction with the perfmgr so as to
284 run a standalone performance manager without SM/SA. However,
285 this is NOT currently implemented in the performance manager.
288 Enable the perfmgr. Only takes effect if --enable-perfmgr was specified at
291 \fB\-perfmgr_sweep_time_s\fR <seconds>
292 Specify the sweep time for the performance manager in seconds
293 (default is 180 seconds). Only takes
294 effect if --enable-perfmgr was specified at configure time.
296 .BI --consolidate_ipv6_snm_req
297 Consolidate IPv6 Solicited Node Multicast group join requests into one
298 multicast group per MGID PKey.
300 \fB\-v\fR, \fB\-\-verbose\fR
301 This option increases the log verbosity level.
302 The -v option may be specified multiple times
303 to further increase the verbosity level.
304 See the -D option for more information about
308 This option sets the maximum verbosity level and
310 The -V option is equivalent to \'-D 0xFF -d 2\'.
311 See the -D option for more information about
315 This option sets the log verbosity level.
316 A flags field must follow the -D option.
317 A bit set/clear in the flags enables/disables a
318 specific log level as follows:
320 BIT LOG LEVEL ENABLED
321 ---- -----------------
322 0x01 - ERROR (error messages)
323 0x02 - INFO (basic messages, low volume)
324 0x04 - VERBOSE (interesting stuff, moderate volume)
325 0x08 - DEBUG (diagnostic, high volume)
326 0x10 - FUNCS (function entry/exit, very high volume)
327 0x20 - FRAMES (dumps all SMP and GMP frames)
328 0x40 - ROUTING (dump FDB routing information)
329 0x80 - currently unused.
331 Without -D, OpenSM defaults to ERROR + INFO (0x3).
332 Specifying -D 0 disables all messages.
333 Specifying -D 0xFF enables all messages (see -V).
334 High verbosity levels may require increasing
335 the transaction timeout with the -t option.
337 \fB\-d\fR, \fB\-\-debug\fR <value>
338 This option specifies a debug option.
339 These options are not normally needed.
340 The number following -d selects the debug
341 option to enable as follows:
344 --- -----------------
345 -d0 - Ignore other SM nodes
346 -d1 - Force single threaded dispatching
347 -d2 - Force log flushing after each log message
348 -d3 - Disable multicast support
350 \fB\-h\fR, \fB\-\-help\fR
351 Display this usage info then exit.
354 Display this usage info then exit.
356 .SH ENVIRONMENT VARIABLES
358 The following environment variables control opensm behavior:
360 OSM_TMP_DIR - controls the directory in which the temporary files generated by
361 opensm are created. These files are: opensm-subnet.lst, opensm.fdbs, and
362 opensm.mcfdbs. By default, this directory is /var/log.
364 OSM_CACHE_DIR - opensm stores certain data to the disk such that subsequent
365 runs are consistent. The default directory used is /var/cache/opensm.
366 The following file is included in it:
368 guid2lid - stores the LID range assigned to each GUID
372 When opensm receives a HUP signal, it starts a new heavy sweep as if a trap was received or a topology change was found.
374 Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for
377 .SH PARTITION CONFIGURATION
379 The default name of OpenSM partitions configuration file is
380 \fB\%@OPENSM_CONFIG_DIR@/@PARTITION_CONFIG_FILE@\fP. The default may be changed by using
381 --Pconfig (-P) option with OpenSM.
383 The default partition will be created by OpenSM unconditionally even
384 when partition configuration file does not exist or cannot be accessed.
386 The default partition has P_Key value 0x7fff. OpenSM\'s port will have
387 full membership in default partition. All other end ports will have
394 Line content followed after \'#\' character is comment and ignored by
399 <Partition Definition>:<PortGUIDs list> ;
401 Partition Definition:
403 [PartitionName][=PKey][,flag[=value]][,defmember=full|limited]
405 PartitionName - string, will be used with logging. When omitted
406 empty string will be used.
407 PKey - P_Key value for this partition. Only low 15 bits will
408 be used. When omitted will be autogenerated.
409 flag - used to indicate IPoIB capability of this partition.
410 defmember=full|limited - specifies default membership for port guid
411 list. Default is limited.
413 Currently recognized flags are:
415 ipoib - indicates that this partition may be used for IPoIB, as
416 result IPoIB capable MC group will be created.
417 rate=<val> - specifies rate for this IPoIB MC group
418 (default is 3 (10GBps))
419 mtu=<val> - specifies MTU for this IPoIB MC group
420 (default is 4 (2048))
421 sl=<val> - specifies SL for this IPoIB MC group
423 scope=<val> - specifies scope for this IPoIB MC group
424 (default is 2 (link local)). Multiple scope settings
425 are permitted for a partition.
427 Note that values for rate, mtu, and scope should be specified as
428 defined in the IBTA specification (for example, mtu=4 for 2048).
432 PortGUID - GUID of partition member EndPort. Hexadecimal
433 numbers should start from 0x, decimal numbers
435 full or limited - indicates full or limited membership for this
436 port. When omitted (or unrecognized) limited
437 membership is assumed.
439 There are two useful keywords for PortGUID definition:
441 - 'ALL' means all end ports in this subnet.
442 - 'SELF' means subnet manager's port.
444 Empty list means no ports in this partition.
448 White space is permitted between delimiters ('=', ',',':',';').
450 The line can be wrapped after ':' followed after Partition Definition and
453 PartitionName does not need to be unique, PKey does need to be unique.
454 If PKey is repeated then those partition configurations will be merged
455 and first PartitionName will be used (see also next note).
457 It is possible to split partition configuration in more than one
458 definition, but then PKey should be explicitly specified (otherwise
459 different PKey values will be generated for those definitions).
463 Default=0x7fff : ALL, SELF=full ;
465 NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ;
467 YetAnotherOne = 0x300 : SELF=full ;
468 YetAnotherOne = 0x300 : ALL=limited ;
470 ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
471 # 0x123453, 0x123454 will be limited
472 ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
473 # 0x123456, 0x123457 will be limited
474 ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full;
475 ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
476 ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d;
481 The following rule is equivalent to how OpenSM used to run prior to the
484 Default=0x7fff,ipoib:ALL=full;
486 .SH QOS CONFIGURATION
488 There are a set of QoS related low-level configuration parameters.
489 All these parameter names are prefixed by "qos_" string. Here is a full
490 list of these parameters:
492 qos_max_vls - The maximum number of VLs that will be on the subnet
493 qos_high_limit - The limit of High Priority component of VL
494 Arbitration table (IBA 7.6.9)
495 qos_vlarb_low - Low priority VL Arbitration table (IBA 7.6.9)
497 qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
499 Both VL arbitration templates are pairs of
501 qos_sl2vl - SL2VL Mapping table (IBA 7.6.6) template. It is
502 a list of VLs corresponding to SLs 0-15 (Note
503 that VL15 used here means drop this SL)
505 Typical default values (hard-coded in OpenSM initialization) are:
509 qos_vlarb_low 0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
510 qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
511 qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
513 The syntax is compatible with rest of OpenSM configuration options and
514 values may be stored in OpenSM config file (cached options file).
516 In addition to the above, we may define separate QoS configuration
517 parameters sets for various target types. As targets, we currently support
518 CAs, routers, switch external ports, and switch's enhanced port 0. The
519 names of such specialized parameters are prefixed by "qos_<type>_"
520 string. Here is a full list of the currently supported sets:
522 qos_ca_ - QoS configuration parameters set for CAs.
523 qos_rtr_ - parameters set for routers.
524 qos_sw0_ - parameters set for switches' port 0.
525 qos_swe_ - parameters set for switches' external ports.
529 qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
534 Prefix routes control how the SA responds to path record queries for
535 off-subnet DGIDs. By default, the SA fails such queries.
536 Note that IBA does not specify how the SA should obtain off-subnet path
538 The prefix routes configuration is meant as a stop-gap until the
539 specification is completed.
541 Each line in the configuration file is a 64-bit prefix followed by a
542 64-bit GUID, separated by white space.
543 The GUID specifies the router port on the local subnet that will
545 Blank lines are ignored, as is anything between a \fB#\fP character
546 and the end of the line.
547 The prefix and GUID are both in hex, the leading 0x is optional.
548 Either, or both, can be wild-carded by specifying an
549 asterisk instead of an explicit prefix or GUID.
551 When responding to a path record query for an off-subnet DGID,
552 opensm searches for the first prefix match in the configuration file.
553 Therefore, the order of the lines in the configuration file is important:
554 a wild-carded prefix at the beginning of the configuration file renders
555 all subsequent lines useless.
556 If there is no match, then opensm fails the query.
557 It is legal to repeat prefixes in the configuration file,
558 opensm will return the path to the first available matching router.
559 A configuration file with a single line where both prefix and GUID
560 are wild-carded means that a path record query specifying any
561 off-subnet DGID should return a path to the first available router.
562 This configuration yields the same behaviour formerly achieved by
563 compiling opensm with -DROUTER_EXP.
567 OpenSM now offers five routing engines:
569 1. Min Hop Algorithm - based on the minimum hops to each node where the
570 path length is optimized.
572 2. UPDN Unicast routing algorithm - also based on the minimum hops to each
573 node, but it is constrained to ranking rules. This algorithm should be chosen
574 if the subnet is not a pure Fat Tree, and deadlock may occur due to a
577 3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing
578 for congestion-free "shift" communication pattern.
579 It should be chosen if a subnet is a symmetrical or almost symmetrical
580 fat-tree of various types, not just K-ary-N-Trees: non-constant K, not
581 fully staffed, any Constant Bisectional Bandwidth (CBB) ratio.
582 Similar to UPDN, Fat Tree routing is constrained to ranking rules.
584 4. LASH unicast routing algorithm - uses Infiniband virtual layers
585 (SL) to provide deadlock-free shortest-path routing while also
586 distributing the paths between layers. LASH is an alternative
587 deadlock-free topology-agnostic routing algorithm to the non-minimal
588 UPDN algorithm avoiding the use of a potentially congested root node.
590 5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
591 avoids port equalization except for redundant links between the same
592 two switches. This provides deadlock free routes for hypercubes when
593 the fabric is cabled as a hypercube and for meshes when cabled as a
594 mesh (see details below).
596 OpenSM also supports a file method which
597 can load routes from a table. See \'Modular Routing Engine\' for more
600 The basic routing algorithm is comprised of two stages:
602 1. MinHop matrix calculation
603 How many hops are required to get from each port to each LID ?
604 The algorithm to fill these tables is different if you run standard
605 (min hop) or Up/Down.
606 For standard routing, a "relaxation" algorithm is used to propagate
607 min hop from every destination LID through neighbor switches
608 For Up/Down routing, a BFS from every target is used. The BFS tracks link
609 direction (up or down) and avoid steps that will perform up after a down
612 2. Once MinHop matrices exist, each switch is visited and for each target LID a
613 decision is made as to what port should be used to get to that LID.
614 This step is common to standard and Up/Down routing. Each port has a
615 counter counting the number of target LIDs going through it.
616 When there are multiple alternative ports with same MinHop to a LID,
617 the one with less previously assigned ports is selected.
618 If LMC > 0, more checks are added: Within each group of LIDs assigned to
620 a. use only ports which have same MinHop
621 b. first prefer the ones that go to different systemImageGuid (then
622 the previous LID of the same LMC group)
623 c. if none - prefer those which go through another NodeGuid
624 d. fall back to the number of paths method (if all go to same node).
626 Effect of Topology Changes
628 OpenSM will preserve existing routing in any case where there is no change in
629 the fabric switches unless the -r (--reassign_lids) option is specified.
634 This option causes OpenSM to reassign LIDs to all
635 end nodes. Specifying -r on a running subnet
636 may disrupt subnet traffic.
637 Without -r, OpenSM attempts to preserve existing
638 LID assignments resolving multiple use of same LID.
640 If a link is added or removed, OpenSM does not recalculate
641 the routes that do not have to change. A route has to change
642 if the port is no longer UP or no longer the MinHop. When routing changes
643 are performed, the same algorithm for balancing the routes is invoked.
645 In the case of using the file based routing, any topology changes are
646 currently ignored The 'file' routing engine just loads the LFTs from the file
647 specified, with no reaction to real topology. Obviously, this will not be able
648 to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent
649 switches will be skipped. Multicast is not affected by 'file' routing engine
650 (this uses min hop tables).
655 The Min Hop algorithm is invoked by default if no routing algorithm is
656 specified. It can also be invoked by specifying '-R minhop'.
658 The Min Hop algorithm is divided into two stages: computation of
659 min-hop tables on every switch and LFT output port assignment. Link
660 subscription is also equalized with the ability to override based on
661 port GUID. The latter is supplied by:
663 -i <equalize-ignore-guids-file>
665 -ignore-guids <equalize-ignore-guids-file>
666 This option provides the means to define a set of ports
667 (by guid) that will be ignored by the link load
668 equalization algorithm. Note that only endports (CA,
669 switch port 0, and router ports) and not switch external
672 LMC awareness routes based on (remote) system or switch basis.
675 Purpose of UPDN Algorithm
677 The UPDN algorithm is designed to prevent deadlocks from occurring in loops
678 of the subnet. A loop-deadlock is a situation in which it is no longer
679 possible to send data between any two hosts connected through the loop. As
680 such, the UPDN routing algorithm should be used if the subnet is not a pure
681 Fat Tree, and one of its loops may experience a deadlock (due, for example,
684 The UPDN algorithm is based on the following main stages:
686 1. Auto-detect root nodes - based on the CA hop length from any switch in
687 the subnet, a statistical histogram is built for each switch (hop num vs
688 number of occurrences). If the histogram reflects a specific column (higher
689 than others) for a certain node, then it is marked as a root node. Since
690 the algorithm is statistical, it may not find any root nodes. The list of
691 the root nodes found by this auto-detect stage is used by the ranking
694 Note 1: The user can override the node list manually.
695 Note 2: If this stage cannot find any root nodes, and the user did
696 not specify a guid list file, OpenSM defaults back to the
697 Min Hop routing algorithm.
699 2. Ranking process - All root switch nodes (found in stage 1) are assigned
700 a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the
701 subnet are ranked incrementally. This ranking aids in the process of enforcing
702 rules that ensure loop-free paths.
704 3. Min Hop Table setting - after ranking is done, a BFS algorithm is run from
705 each (CA or switch) node in the subnet. During the BFS process, the FDB table
706 of each switch node traversed by BFS is updated, in reference to the starting
707 node, based on the ranking rules and guid values.
709 At the end of the process, the updated FDB tables ensure loop-free paths
712 Note: Up/Down routing does not allow LID routing communication between
713 switches that are located inside spine "switch systems".
714 The reason is that there is no way to allow a LID route between them
715 that does not break the Up/Down rule.
716 One ramification of this is that you cannot run SM on switches other
717 than the leaf switches of the fabric.
722 Activation through OpenSM
724 Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
725 Use '-a <root_guid_file>' for adding an UPDN guid file that contains the
726 root nodes for ranking.
727 If the `-a' option is not used, OpenSM uses its auto-detect root nodes
730 Notes on the guid list file:
732 1. A valid guid file specifies one guid in each line. Lines with an invalid
733 format will be discarded.
735 2. The user should specify the root switch guids. However, it is also
736 possible to specify CA guids; OpenSM will use the guid of the switch (if
737 it exists) that connects the CA to the subnet as a root node.
740 Fat-tree Routing Algorithm
742 The fat-tree algorithm optimizes routing for "shift" communication pattern.
743 It should be chosen if a subnet is a symmetrical or almost symmetrical
744 fat-tree of various types.
745 It supports not just K-ary-N-Trees, by handling for non-constant K,
746 cases where not all leafs (CAs) are present, any CBB ratio.
747 As in UPDN, fat-tree also prevents credit-loop-deadlocks.
749 If the root guid file is not provided ('-a' or '--root_guid_file' options),
750 the topology has to be pure fat-tree that complies with the following rules:
751 - Tree rank should be between two and eight (inclusively)
752 - Switches of the same rank should have the same number
753 of UP-going port groups*, unless they are root switches,
754 in which case the shouldn't have UP-going ports at all.
755 - Switches of the same rank should have the same number
756 of DOWN-going port groups, unless they are leaf switches.
757 - Switches of the same rank should have the same number
758 of ports in each UP-going port group.
759 - Switches of the same rank should have the same number
760 of ports in each DOWN-going port group.
761 - All the CAs have to be at the same tree level (rank).
763 If the root guid file is provided, the topology doesn't have to be pure
764 fat-tree, and it should only comply with the following rules:
765 - Tree rank should be between two and eight (inclusively)
766 - All the Compute Nodes** have to be at the same tree level (rank).
767 Note that non-compute node CAs are allowed here to be at different
770 * ports that are connected to the same remote switch are referenced as
773 ** list of compute nodes (CNs) can be specified by \'-u\' or \'--cn_guid_file\'
776 Topologies that do not comply cause a fallback to min hop routing.
777 Note that this can also occur on link failures which cause the topology
778 to no longer be "pure" fat-tree.
780 Note that although fat-tree algorithm supports trees with non-integer CBB
781 ratio, the routing will not be as balanced as in case of integer CBB ratio.
782 In addition to this, although the algorithm allows leaf switches to have any
783 number of CAs, the closer the tree is to be fully populated, the more
784 effective the "shift" communication pattern will be.
785 In general, even if the root list is provided, the closer the topology to a
786 pure and symmetrical fat-tree, the more optimal the routing will be.
788 The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
789 in the same directory where the OpenSM log resides. This ordering file provides
790 the CN order that may be used to create efficient communication pattern, that
791 will match the routing tables.
793 Activation through OpenSM
795 Use '-R ftree' option to activate the fat-tree algorithm.
796 Use '-a <root_guid_file>' to provide root nodes for ranking. If the `-a' option
797 is not used, routing algorithm will detect roots automatically.
798 Use '-u <root_cn_file>' to provide the list of compute nodes. If the `-u' option
799 is not used, all the CAs are considered as compute nodes.
801 Note: LMC > 0 is not supported by fat-tree routing. If this is
802 specified, the default routing algorithm is invoked instead.
805 LASH Routing Algorithm
807 LASH is an acronym for LAyered SHortest Path Routing. It is a
808 deterministic shortest path routing algorithm that enables topology
809 agnostic deadlock-free routing within communication networks.
811 When computing the routing function, LASH analyzes the network
812 topology for the shortest-path routes between all pairs of sources /
813 destinations and groups these paths into virtual layers in such a way
814 as to avoid deadlock.
816 Note LASH analyzes routes and ensures deadlock freedom between switch
817 pairs. The link from HCA between and switch does not need virtual
818 layers as deadlock will not arise between switch and HCA.
820 In more detail, the algorithm works as follows:
822 1) LASH determines the shortest-path between all pairs of source /
823 destination switches. Note, LASH ensures the same SL is used for all
824 SRC/DST - DST/SRC pairs and there is no guarantee that the return
825 path for a given DST/SRC will be the reverse of the route SRC/DST.
827 2) LASH then begins an SL assignment process where a route is assigned
828 to a layer (SL) if the addition of that route does not cause deadlock
829 within that layer. This is achieved by maintaining and analysing a
830 channel dependency graph for each layer. Once the potential addition
831 of a path could lead to deadlock, LASH opens a new layer and continues
834 3) Once this stage has been completed, it is highly likely that the
835 first layers processed will contain more paths than the latter ones.
836 To better balance the use of layers, LASH moves paths from one layer
837 to another so that the number of paths in each layer averages out.
839 Note, the implementation of LASH in opensm attempts to use as few layers
840 as possible. This number can be less than the number of actual layers
843 In general LASH is a very flexible algorithm. It can, for example,
844 reduce to Dimension Order Routing in certain topologies, it is topology
845 agnostic and fares well in the face of faults.
847 It has been shown that for both regular and irregular topologies, LASH
848 outperforms Up/Down. The reason for this is that LASH distributes the
849 traffic more evenly through a network, avoiding the bottleneck issues
850 related to a root node and always routes shortest-path.
852 The algorithm was developed by Simula Research Laboratory.
855 Use '-R lash -Q ' option to activate the LASH algorithm.
857 Note: QoS support has to be turned on in order that SL/VL mappings are
860 Note: LMC > 0 is not supported by the LASH routing. If this is
861 specified, the default routing algorithm is invoked instead.
864 DOR Routing Algorithm
866 The Dimension Order Routing algorithm is based on the Min Hop
867 algorithm and so uses shortest paths. Instead of spreading traffic
868 out across different paths with the same shortest distance, it chooses
869 among the available shortest paths based on an ordering of dimensions.
870 Each port must be consistently cabled to represent a hypercube
871 dimension or a mesh dimension. Paths are grown from a destination
872 back to a source using the lowest dimension (port) of available paths
873 at each step. This provides the ordering necessary to avoid deadlock.
874 When there are multiple links between any two switches, they still
875 represent only one dimension and traffic is balanced across them
876 unless port equalization is turned off. In the case of hypercubes,
877 the same port must be used throughout the fabric to represent the
878 hypercube dimension and match on both ends of the cable. In the case
879 of meshes, the dimension should consistently use the same pair of
880 ports, one port on one end of the cable, and the other port on the
881 other end, continuing along the mesh dimension.
883 Use '-R dor' option to activate the DOR algorithm.
888 To learn more about deadlock-free routing, see the article
889 "Deadlock Free Message Routing in Multiprocessor Interconnection Networks"
890 by William J Dally and Charles L Seitz (1985).
892 To learn more about the up/down algorithm, see the article
893 "Effective Strategy to Compute Forwarding Tables for InfiniBand Networks"
894 by Jose Carlos Sancho, Antonio Robles, and Jose Duato at the
895 Universidad Politecnica de Valencia.
897 To learn more about LASH and the flexibility behind it, the requirement
898 for layers, performance comparisons to other algorithms, see the
901 "Layered Routing in Irregular Networks", Lysne et al, IEEE
902 Transactions on Parallel and Distributed Systems, VOL.16, No12,
905 "Routing for the ASI Fabric Manager", Solheim et al. IEEE
906 Communications Magazine, Vol.44, No.7, July 2006.
908 "Layered Shortest Path (LASH) Routing in Irregular System Area
909 Networks", Skeie et al. IEEE Computer Society Communication
910 Architecture for Clusters 2002.
913 Modular Routine Engine
915 Modular routing engine structure allows for the ease of
916 "plugging" new routing modules.
918 Currently, only unicast callbacks are supported. Multicast
921 One existing routing module is up-down "updn", which may be
922 activated with '-R updn' option (instead of old '-u').
925 $ opensm -R 'module-name'
927 There is also a trivial routing module which is able
928 to load LFT tables from a file.
932 - this will load switch LFTs and/or LID matrices (min hops tables)
933 - this will load switch LFTs according to the path entries introduced
935 - no additional checks will be performed (such as "is port connected",
937 - in case when fabric LIDs were changed this will try to reconstruct
938 LFTs correctly if endport GUIDs are represented in the file
939 (in order to disable this, GUIDs may be removed from the file
942 The file format is compatible with output of 'ibroute' util and for
943 whole fabric can be generated with dump_lfts.sh script.
945 To activate file based routing module, use:
947 opensm -R file -U /path/to/lfts_file
949 If the lfts_file is not found or is in error, the default routing
950 algorithm is utilized.
952 The ability to dump switch lid matrices (aka min hops tables) to file and
953 later to load these is also supported.
955 The usage is similar to unicast forwarding tables loading from a lfts
956 file (introduced by 'file' routing engine), but new lid matrix file
957 name should be specified by -M or --lid_matrix_file option. For example:
959 opensm -R file -M ./opensm-lid-matrix.dump
961 The dump file is named \'opensm-lid-matrix.dump\' and will be generated
962 in standard opensm dump directory (/var/log by default) when
963 OSM_LOG_ROUTING logging flag is set.
965 When routing engine 'file' is activated, but the lfts file is not specified
966 or not cannot be open default lid matrix algorithm will be used.
968 There is also a switch forwarding tables dumper which generates
969 a file compatible with dump_lfts.sh output. This file can be used
970 as input for forwarding tables loading by 'file' routing engine.
971 Both or one of options -U and -M can be specified together with \'-R file\'.
975 .B @OPENSM_CONFIG_DIR@/@OPENSM_CONFIG_FILE@
976 default OpenSM config file.
979 .B @OPENSM_CONFIG_DIR@/@NODENAMEMAPFILE@
980 default node name map file. See ibnetdiscover for more information on format.
983 .B @OPENSM_CONFIG_DIR@/@PARTITION_CONFIG_FILE@
984 default partition config file
987 .B @OPENSM_CONFIG_DIR@/@QOS_POLICY_FILE@
988 default QOS policy config file
991 .B @OPENSM_CONFIG_DIR@/@PREFIX_ROUTES_FILE@
992 default prefix routes file.
997 .RI < hal.rosenstock@gmail.com >
1000 .RI < sashak@voltaire.com >
1003 .RI < eitan@mellanox.co.il >
1006 .RI < kliteyn@mellanox.co.il >
1009 .RI < tsodring@simula.no >
1012 .RI < weiny2@llnl.gov >