From patchwork Fri Nov 12 22:11:22 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jim Schutt X-Patchwork-Id: 321502 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id oACMBd9G018672 for ; Fri, 12 Nov 2010 22:11:50 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932584Ab0KLWLt (ORCPT ); Fri, 12 Nov 2010 17:11:49 -0500 Received: from sentry-three.sandia.gov ([132.175.109.17]:38987 "EHLO sentry-three.sandia.gov" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933008Ab0KLWLq (ORCPT ); Fri, 12 Nov 2010 17:11:46 -0500 X-WSS-ID: 0LBSLNG-0C-274-02 X-M-MSG: Received: from sentry.sandia.gov (mm04snlnto.sandia.gov [132.175.109.21]) by sentry-three.sandia.gov (Postfix) with ESMTP id 1167A4CA605; Fri, 12 Nov 2010 15:11:40 -0700 (MST) Received: from [132.175.109.1] by sentry.sandia.gov with ESMTP (SMTP Relay 01 (Email Firewall v6.3.2)); Fri, 12 Nov 2010 15:11:34 -0700 X-Server-Uuid: AF72F651-81B1-4134-BA8C-A8E1A4E620FF Received: from localhost.localdomain (sale659.sandia.gov [134.253.4.20]) by mailgate.sandia.gov (8.14.4/8.14.4) with ESMTP id oACMBDat000385; Fri, 12 Nov 2010 15:11:24 -0700 From: "Jim Schutt" To: sashak@voltaire.com cc: linux-rdma@vger.kernel.org, "Jim Schutt" Subject: [PATCH 13/13] opensm/doc/current-routing.txt: Sync torus-2QoS information with new man pages. Date: Fri, 12 Nov 2010 15:11:22 -0700 Message-ID: <1289599882-15165-14-git-send-email-jaschut@sandia.gov> X-Mailer: git-send-email 1.6.2.2 In-Reply-To: <1289599882-15165-1-git-send-email-jaschut@sandia.gov> References: <1289599882-15165-1-git-send-email-jaschut@sandia.gov> X-PMX-Version: 5.6.0.2009776, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2010.11.12.220015 X-PMX-Spam: Gauge=IIIIIIII, Probability=8%, Report=' BODY_SIZE_10000_PLUS 0, DATE_TZ_NA 0, __HAS_MSGID 0, __HAS_X_MAILER 0, __MIME_TEXT_ONLY 0, __SANE_MSGID 0, __TO_MALFORMED_2 0, __TO_NO_NAME 0, __URI_NO_PATH 0, __URI_NO_WWW 0, __URI_NS ' X-TMWD-Spam-Summary: TS=20101112221136; ID=1; SEV=2.3.1; DFV=B2010111222; IFV=NA; AIF=B2010111222; RPD=5.03.0010; ENG=NA; RPDID=7374723D303030312E30413031303230332E34434444424239382E303046383A534346535441543838363133332C73733D312C6667733D30; CAT=NONE; CON=NONE; SIG=AAABAJsKIgAAAAAAAAAAAAAAAAAAAH0= X-MMS-Spam-Filter-ID: B2010111222_5.03.0010 MIME-Version: 1.0 X-WSS-ID: 60C3641C4KO2689023-01-01 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.3 (demeter1.kernel.org [140.211.167.41]); Fri, 12 Nov 2010 22:11:50 +0000 (UTC) diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt index 4eaf861..5048c55 100644 --- a/opensm/doc/current-routing.txt +++ b/opensm/doc/current-routing.txt @@ -399,7 +399,7 @@ Use '-R dor' option to activate the DOR algorithm. Torus-2QoS Routing Algorithm ---------------------------- -Torus-2QoS is routing algorithm designed for large-scale 2D/3D torus fabrics. +Torus-2QoS is a routing algorithm designed for large-scale 2D/3D torus fabrics. The torus-2QoS routing engine can provide the following functionality on a 2D/3D torus: - routing that is free of credit loops @@ -411,6 +411,8 @@ a 2D/3D torus: - very short run times, with good scaling properties as fabric size increases +Unicast Routing: + Torus-2QoS is a DOR-based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension. It encodes into a path SL which datelines the path crosses as follows: @@ -423,17 +425,18 @@ It encodes into a path SL which datelines the path crosses as follows: For a 3D torus, that leaves one SL bit free, which torus-2QoS uses to implement two QoS levels. -This is possible because torus-2QoS also makes use of the output port -dependence of the switch SL2VL maps. It computes in which torus coordinate -direction each interswitch link "points", and writes SL2VL maps for such -ports as follows: +Torus-2QoS also makes use of the output port dependence of switch SL2VL +maps to encode into one VL bit the information encoded in three SL bits. +It computes in which torus coordinate direction each inter-switch link +"points", and writes SL2VL maps for such ports as follows: for (sl = 0; sl < 16; sl ++) /* cdir(port) reports which torus coordinate direction a switch port * "points" in, and returns 0, 1, or 2 */ sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport)); -Thus torus-2QoS consumes 8 SL values (SL bits 0-2) and 2 VL values (VL bit 0) +Thus, on a pristine 3D torus, i.e., in the absence of failed fabric switches, +torus-2QoS consumes 8 SL values (SL bits 0-2) and 2 VL values (VL bit 0) per QoS level to provide deadlock-free routing on a 3D torus. Torus-2QoS routes around link failure by "taking the long way around" any @@ -454,7 +457,7 @@ torus below, where switches are denoted by [+a-zA-Z]: x=0 1 2 3 4 5 -For a pristine fabric the path from S to D would be S-n-T-r-d. In the +For a pristine fabric the path from S to D would be S-n-T-r-D. In the event that either link S-n or n-T has failed, torus-2QoS would use the path S-m-p-o-T-r-D. Note that it can do this without changing the path SL value; once the 1D ring m-S-n-T-o-p-m has been broken by failure, path @@ -463,11 +466,19 @@ dateline (between, say, x=5 and x=0) can be ignored for path segments on that ring. One result of this is that torus-2QoS can route around many simultaneous -link failures, as long as no 1D ring is broken into disjoint regions. For +link failures, as long as no 1D ring is broken into disjoint segments. For example, if links n-T and T-o have both failed, that ring has been broken -into two disjoint regions, T and o-p-m-S-n. Torus-2QoS checks for such +into two disjoint segments, T and o-p-m-S-n. Torus-2QoS checks for such issues, reports if they are found, and refuses to route such fabrics. +Note that in the case where there are multiple parallel links between a pair +of switches, torus-2QoS will allocate routes across such links in a round- +robin fashion, based on ports at the path destination switch that are active +and not used for inter-switch links. Should a link that is one of several +such parallel links fail, routes are redistributed across the remaining +links. When the last of such a set of parallel links fails, traffic is +rerouted as described above. + Handling a failed switch under DOR requires introducing into a path at least one turn that would be otherwise "illegal", i.e. not allowed by DOR rules. Torus-2QoS will introduce such a turn as close as possible to the @@ -476,8 +487,9 @@ failed switch in order to route around it. In the above example, suppose switch T has failed, and consider the path from S to D. Torus-2QoS will produce the path S-n-I-r-D, rather than the S-n-T-r-D path for a pristine torus, by introducing an early turn at n. -For traffic arriving at switch I from n, normal DOR rules will generate an -illegal turn in the path from S to D at I, and a legal turn at r. +Normal DOR rules will cause traffic arriving at switch I to be forwarded +to switch r; for traffic arriving from I due to the "early" turn at n, +this will generate an "illegal" turn at I. Torus-2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 (which would be otherwise unused) for y-x, z-x, and z-y turns, i.e., @@ -549,6 +561,8 @@ VL with bit 1 set. In contrast to the earlier examples, the second hop after the illegal turn, q-r, can be used to construct a credit loop encircling the failed switches. +Multicast Routing: + Since torus-2QoS uses all four available SL bits, and the three data VL bits that are typically available in current switches, there is no way to use SL/VL values to separate multicast traffic from unicast traffic. @@ -649,7 +663,104 @@ a branch that crosses a dateline. However, again this cannot contribute to credit loops as it occurs on a 1D ring (the ring for x=3) that is broken by a failure, as in the above example. -Due to the use made by torus-2QoS of SLs and VLs, QoS configuration should -only employ SL values 0 and 8, for both multicast and unicast. Also, -SL to VL map configuration must be under the complete control of torus-2QoS, -so any user-supplied configuration must and will be ignored. +Torus Topolgy Discovery: + +The algorithm used by torus-2QoS to contruct the torus topology from the +undirected graph representing the fabric requires that the radix of each +dimension be configured via torus-2QoS.conf. It also requires that the +torus topology be "seeded"; for a 3D torus this requires configuring four +switches that define the three coordinate directions of the torus. + +Given this starting information, the algorithm is to examine the cube +formed by the eight switch locations bounded by the corners (x,y,z) and +(x+1,y+1,z+1). Based on switches already placed into the torus topology at +some of these locations, the algorithm examines 4-loops of interswitch +links to find the one that is consistent with a face of the cube of switch +locations, and adds its swiches to the discovered topology in the correct +locations. + +Because the algorithm is based on examing the topology of 4-loops of links, +a torus with one or more radix-4 dimensions requires extra initial seed +configuration. See torus-2QoS.conf(5) for details. Torus-2QoS will detect +and report when it has insufficient configuration for a torus with radix-4 +dimensions. + +In the event the torus is significantly degraded, i.e., there are many +missing switches or links, it may happen that torus-2QoS is unable to place +into the torus some switches and/or links that were discoverd in the +fabric, and will generate a warning in that case. A similar condition +occurs if torus-2QoS is misconfigured, i.e., the radix of a torus dimension +as configured does not match the radix of that torus dimension as wired, +and many switches/links in the fabric will not be placed into the torus. + +Quality Of Service Configuration: + +OpenSM will not program switchs and channel adapters with SL2VL maps or VL +arbitration configuration unless it is invoked with -Q. Since torus-2QoS +depends on such functionality for correct operation, always invoke OpenSM +with -Q when torus-2QoS is in the list of routing engines. + +Any quality of service configuration method supported by OpenSM will work +with torus-2QoS, subject to the following limitations and considerations. + +For all routing engines supported by OpenSM except torus-2QoS, there is a +one-to-one correspondence between QoS level and SL. Torus-2QoS can only +support two quality of service levels, so only the high-order bit of any SL +value used for unicast QoS configuration will be honored by torus-2QoS. + +For multicast QoS configuration, only SL values 0 and 8 should be used with +torus-2QoS. + +Since SL to VL map configuration must be under the complete control of +torus-2QoS, any configuration via qos_sl2vl, qos_swe_sl2vl, etc., must and +will be ignored, and a warning will be generated. + +Torus-2QoS uses VL values 0-3 to implement one of its supported QoS levels, +and VL values 4-7 to implement the other. Hard-to-diagnose application +issues may arise if traffic is not delivered fairly across each of these +two VL ranges. Torus-2QoS will detect and warn if VL arbitration is +configured unfairly across VLs in the range 0-3, and also in the range +4-7. Note that the default OpenSM VL arbitration configuration does not +meet this constraint, so all torus-2QoS users should configure VL +arbitration via qos_vlarb_high, qos_vlarb_low, etc. + +Operational Considerations: + +Any routing algorithm for a torus IB fabric must employ path SL values to +avoid credit loops. As a result, all applications run over such fabrics +must perform a path record query to obtain the correct path SL for +connection setup. Applications that use rdma_cm for connection setup will +automatically meet this requirement. + +If a change in fabric topology causes changes in path SL values required to +route without credit loops, in general all applications would need to +repath to avoid message deadlock. Since torus-2QoS has the ability to +reroute after a single switch failure without changing path SL values, +repathing by running applications is not required when the fabric is routed +with torus-2QoS. + +Torus-2QoS can provide unchanging path SL values in the presence of subnet +manager failover provided that all OpenSM instances have the same idea of +dateline location. See torus-2QoS.conf(5) for details. + +Torus-2QoS will detect configurations of failed switches and links that +prevent routing that is free of credit loops, and will log warnings and +refuse to route. If "no_fallback" was configured in the list of OpenSM +routing engines, then no other routing engine will attempt to route the +fabric. In that case all paths that do not transit the failed components +will continue to work, and the subset of paths that are still operational +will continue to remain free of credit loops. OpenSM will continue to +attempt to route the fabric after every sweep interval, and after any +change (such as a link up) in the fabric topology. When the fabric +components are repaired, full functionality will be restored. + +In the event OpenSM was configured to allow some other engine to route the +fabric if torus-2QoS fails, then credit loops and message deadlock are +likely if torus-2QoS had previously routed the fabric successfully. Even if +the other engine is capable of routing a torus without credit loops, +applications that built connections with path SL values granted under +torus-2QoS will likely experience message deadlock under routing generated +by a different engine, unless they repath. + +To verify that a torus fabric is routed free of credit loops, use ibdmchk +to analyze data collected via ibdiagnet -vlr.