contrib/ofed/management/opensm/doc/perf-manager-arch.txt

   1 Performance Manager
   2 2/12/07
   3
   4 This document will describe an architecture and a phased plan
   5 for an OpenFabrics OpenIB performance manager.
   6
   7 Currently, there is no open source performance manager, only
   8 a perfquery diagnostic tool which some have scripted into a
   9 "poor man's" performance manager.
  10
  11 The primary responsibilities of the performance manager are to:
  12 1. Monitor subnet topology
  13 2. Based on subnet topology, monitor performance and error counters.
  14    Also, possible counters related to congestion.
  15 3. Perform data reduction (various calculations (rates, histograms, etc.))
  16    on counters obtained
  17 4. Log performance data and indicate "interesting" related events
  18
  19
  20 Performance Manager Components
  21 1. Determine subnet topology
  22    Performance manager can determine the subnet topology by subscribing
  23    for GID in and out of service events. Upon receipt of a GID in service
  24    event, use GID to query SA for corresponding LID by using SubnAdmGet
  25    NodeRecord with PortGUID specified. It would utilize the LID and NumPorts
  26    returned and add this to the monitoring list. Note that the monitoring
  27    list can be extended to be distributed with the manager "balancing" the
  28    assignments of new GIDs to the set of known monitors. For GID out of
  29    service events, the GID is removed from the monitoring list.
  30
  31 2. Monitoring
  32    Counters to be monitored include performance counters (data octets and
  33    packets both receive and transmit) and error counters. These are all in
  34    the mandatory PortCounters attribute. Future support will include the
  35    optional 64 bit counters, PortExtendedCounters (as this is only known
  36    to be supported on one IB device currently). Also, one congestion
  37    counter (PortXmitWait) will also be monitored (on switch ports) initially.
  38
  39    Polling rather than sampling will be used as the monitoring technique. The
  40    polling rate configurable from 1-65535 seconds (default TBD)
  41    Note that with 32 bit counters, on 4x SDR links, byte counts can max out in
  42    16 seconds and on 4x DDR links in 8 seconds. The polling rate needs to
  43    deal with this is accurate byte and packet rates are desired. Since IB
  44    counters are sticky, the counters need to be reset when they get "close"
  45    to max'ing out. This will result in some inaccuracy. When counters are
  46    reset, the time of the reset will be tracked in the monitor and will be
  47    queryable. Note that when the 64 bit counters are supported more generally,
  48    the polling rate can be reduced.
  49
  50    The performance manager will support parallel queries. The level of
  51    parallelism is configurable with a default of 64 queries outstanding
  52    at one time.
  53
  54    Configuration and dynamic adjustment of any performance manager "knobs"
  55    will be supported.
  56
  57    Also, there will be a console interface to obtain performance data.
  58    It will be able to reset counters, report on specific nodes or
  59    node types of interest (CAs only, switches only, all, ...). The
  60    specifics are TBD.
  61
  62 3. Data Reduction
  63    For errors, rate rather than raw value will be calculated. Error
  64    event is only indicated when rate exceeds a threshold.
  65    For packet and byte counters, small changes will be aggregated
  66    and only significant changes are updated.
  67    Aggregated histograms (per node, all nodes (this is TBD))) for each
  68    counter will be provided. Actual counters will also be written to files.
  69    NodeGUID will be used to identify node. File formats are TBD. One
  70    format to be supported might be CSV.
  71
  72 4. Logging
  73    "Interesting" events determined by the performance manager will be
  74    logged as well as the performance data itself. Significant events
  75    will be logged to syslog. There are some interesting scalability
  76    issues relative to logging especially for the distributed model.
  77
  78    Events will be based on rates which are configured as thresholds.
  79    There will be configurable thresholds for the error counters with
  80    reasonable defaults. Correlation of PerfManager and SM events is
  81    interesting but not a mandatory requirement.
  82
  83
  84 Performance Manager Scalability
  85 Clearly as the polling rate goes up, the number of nodes which can be
  86 monitored from a single performance management node decreases. There is
  87 some evidence that a single dedicated management node may not be able to
  88 monitor the largest clusters at a rapid rate.
  89
  90 There are numerous PerfManager models which can be supported:
  91 1. Integrated as thread(s) with OpenSM (run only when SM is master)
  92 2. Standby SM
  93 3. Standalone PerfManager (not running with master or standby SM)
  94 4. Distributed PerfManager (most scalable approach)
  95
  96 Note that these models are in order of implementation complexity and
  97 hence "schedule".
  98
  99 The simplest model is to run the PerfManager with the master SM. This has
 100 the least scalability but is the simplest model. Note that in this model
 101 the topology can be obtained without the GID in and out of service events
 102 but this is needed for any of the other models to be supported.
 103
 104 The next model is to run the PerfManager with a standby SM. Standbys are not
 105 doing much currently (polling the master) so there is much idle CPU.
 106 The downside of this approach is that if the standby takes over as master,
 107 the PerfManager would need to be moved (or is becomes model 1).
 108
 109 A totally separate standlone PerfManager would allow for a deployment
 110 model which eliminates the downside of model 2 (standby SM). It could
 111 still be built in a similar manner with model 2 with unneeded functions
 112 (SM and SA) not included. The advantage of this model is that it could
 113 be more readily usable with a vendor specific SM (switch based or otherwise).
 114 Vendor specific SMs usually come with a built-in performance manager and
 115 this assumes that there would be a way to disable that performance manager.
 116 Model 2 can act like model 3 if a disable SM feature is supported in OpenSM
 117 (command line/console). This will take the SM to not active.
 118
 119 The most scalable model is a distributed PerfManager. One approach to
 120 distribution is a hierarchial model where there is a PerfManager at the
 121 top level with a number of PerfMonitors which are responsible for some
 122 portion of the subnet.
 123
 124 The separation of PerfManager from OpenSM brings up the following additional
 125 issues:
 126 1. What communication is needed between OpenSM and the PerfManager ?
 127 2. Integration of interesting events with OpenSM log
 128 (Does performance manager assume OpenSM ? Does it need to work with vendor
 129 SMs ?)
 130
 131 Hierarchial distribution brings up some additional issues:
 132 1. How is the hierarchy determined ?
 133 2. How do the PerfManager and PerfMonitors find each other ?
 134 3. How is the subnet divided amongst the PerfMonitors
 135 4. Communication amongst the PerfManager and the PerfMonitors
 136 (including communication failures)
 137
 138 In terms of inter manager communication, there seem to be several
 139 choices:
 140 1. Use vendor specific MADs (which can be RMPP'd) and build on top of
 141 this
 142 2. Use RC QP communication and build on top of this
 143 3. Use IPoIB which is much more powerful as sockets can then be utilized
 144
 145 RC QP communication improves on the lower performance of the vendor
 146 specific MAD approach but is not as powerful as the socket based approach.
 147
 148 The only downside of IPoIB is that it requires multicast to be functioning.
 149 It seems reasonable to require IPoIB across the management nodes. This
 150 can either be a separate IPoIB subnet or a shared one with other endnodes
 151 on the subnet. (If this communication is built on top of sockets, it
 152 can be any IP subnet amongst the manager nodes).
 153
 154 The first implementation phase will address models 1-3. Model 3 is optional
 155 as it is similar to models 1 and 2 and may be not be needed.
 156
 157 Model 4 will be addressed in a subsequent implementation phase (and a future
 158 version of this document). Model 4 can be built on the basis of models 1 and
 159 2 where some SM, not necessarily master, is the PerfManager and the rest are
 160 PerfMonitors.
 161
 162
 163 Performance Manager Partition Membership
 164 Note that as the performance manager needs to talk via GSI to the PMAs
 165 in all the end nodes and GSI utilizes PKey sharing, partition membership
 166 if invoked must account for this.
 167
 168 The most straightforward deployment of the performance manager is
 169 to have it be a member of the full default partition (P_Key 0xFFFF).
 170
 171
 172 Performance Manager Redundancy
 173 TBD (future version of this document)
 174
 175
 176 Congestion Management
 177 TBD (future version of this document)
 178
 179
 180 QoS Management
 181 TBD (future version of this document)