From patchwork Fri Dec 18 20:51:01 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jim Schutt X-Patchwork-Id: 68808 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter.kernel.org (8.14.3/8.14.2) with ESMTP id nBIKpGph024927 for ; Fri, 18 Dec 2009 20:51:22 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932426AbZLRUvV (ORCPT ); Fri, 18 Dec 2009 15:51:21 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932420AbZLRUvV (ORCPT ); Fri, 18 Dec 2009 15:51:21 -0500 Received: from sentry-three.sandia.gov ([132.175.109.17]:47444 "EHLO sentry-three.sandia.gov" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932497AbZLRUvT (ORCPT ); Fri, 18 Dec 2009 15:51:19 -0500 X-WSS-ID: 0KUV8LL-08-7PS-02 X-M-MSG: Received: from sentry.sandia.gov (mm03snlnto.sandia.gov [132.175.109.20]) by sentry-three.sandia.gov (Postfix) with ESMTP id 2C2718E9563; Fri, 18 Dec 2009 13:51:20 -0700 (MST) Received: from [132.175.109.1] by sentry.sandia.gov with ESMTP (SMTP Relay 01 (Email Firewall v6.3.2)); Fri, 18 Dec 2009 13:51:11 -0700 X-Server-Uuid: 6BFC7783-7E22-49B4-B610-66D6BE496C0E Received: from localhost.localdomain (sale659.sandia.gov [134.253.4.20]) by mailgate.sandia.gov (8.14.1/8.14.1) with ESMTP id nBIKp1i5008814; Fri, 18 Dec 2009 13:51:10 -0700 From: "Jim Schutt" To: linux-rdma@vger.kernel.org cc: sashak@voltaire.com, eitan@mellanox.co.il, jaschut@sandia.gov Subject: [PATCH 12/12] opensm: Update documentation to describe torus-2QoS multicast support. Date: Fri, 18 Dec 2009 13:51:01 -0700 Message-ID: <1261169461-2516-12-git-send-email-jaschut@sandia.gov> X-Mailer: git-send-email 1.5.6.GIT In-Reply-To: <1258744509-11148-1-git-send-email-jaschut@sandia.gov> References: <1258744509-11148-1-git-send-email-jaschut@sandia.gov> X-PMX-Version: 5.5.7.378829, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2009.12.18.204217 X-PerlMx-Spam: Gauge=IIIIIIII, Probability=8%, Report=' TO_NO_NAME 0, __HAS_MSGID 0, __HAS_X_MAILER 0, __MIME_TEXT_ONLY 0, __SANE_MSGID 0, __TO_MALFORMED_2 0, __URI_NS ' X-TMWD-Spam-Summary: TS=20091218205111; ID=1; SEV=2.3.1; DFV=B2009121816; IFV=NA; AIF=B2009121816; RPD=5.03.0010; ENG=NA; RPDID=7374723D303030312E30413031303230352E34423242454233462E303044313A534346535441543838363133332C73733D312C6667733D30; CAT=NONE; CON=NONE; SIG=AAAAAAAAAAAAAAAAAAAAAAAAfQ== X-MMS-Spam-Filter-ID: B2009121816_5.03.0010 MIME-Version: 1.0 X-WSS-ID: 673534B52OC300027-01-01 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt index 141d793..78a2e01 100644 --- a/opensm/doc/current-routing.txt +++ b/opensm/doc/current-routing.txt @@ -400,8 +400,18 @@ Torus-2QoS Routing Algorithm ---------------------------- Torus-2QoS is routing algorithm designed for large-scale 2D/3D torus fabrics. - -It is a DOR-based algorithm that avoids deadlocks that would otherwise +The torus-2QoS routing engine can provide the following functionality on +a 2D/3D torus: +- routing that is free of credit loops +- two levels of QoS, assuming switches support 8 data VLs +- ability to route around a single failed switch, and/or multiple failed + links, without + - introducing credit loops + - changing path SL values +- very short run times, with good scaling properties as fabric size + increases + +Torus-2QoS is a DOR-based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension. It encodes into a path SL which datelines the path crosses as follows: @@ -424,7 +434,7 @@ ports as follows: sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport)); Thus torus-2QoS consumes 8 SL values (SL bits 0-2) and 2 VL values (VL bit 0) - per QoS level to provide deadlock-free routing on a 3D torus. +per QoS level to provide deadlock-free routing on a 3D torus. Torus-2QoS routes around link failure by "taking the long way around" any 1D ring interrupted by a link failure. For example, consider the 2D 6x5 @@ -538,3 +548,108 @@ path S-n-I-q-r-D, with illegal turn at switch I, and with hop I-q using a VL with bit 1 set. In contrast to the earlier examples, the second hop after the illegal turn, q-r, can be used to construct a credit loop encircling the failed switches. + +Since torus-2QoS uses all four available SL bits, and the three data VL +bits that are typically available in current switches, there is no way +to use SL/VL values to separate multicast traffic from unicast traffic. +Thus, torus-2QoS must generate multicast routing such that credit loops +cannot arise from a combination of multicast and unicast path segments. + +It turns out that it is possible to construct spanning trees for multicast +routing that have that property. For the 2D 6x5 torus example above, here +is the full-fabric spanning tree that torus-2QoS will construct, where "x" +is the root switch and each "+" is a non-root switch: + + 4 + + + + + + + | | | | | | + 3 + + + + + + + | | | | | | + 2 +----+----+----x----+----+ + | | | | | | + 1 + + + + + + + | | | | | | + y=0 + + + + + + + + x=0 1 2 3 4 5 + +For multicast traffic routed from root to tip, every turn in the above +spanning tree is a legal DOR turn. + +For traffic routed from tip to root, and some traffic routed through the +root, turns are not legal DOR turns. However, to construct a credit loop, +the union of multicast routing on this spanning tree with DOR unicast +routing can only provide 3 of the 4 turns needed for the loop. + +In addition, if none of the above spanning tree branches crosses a dateline +used for unicast credit loop avoidance on a torus, and if multicast traffic +is confined to SL 0 or SL 8 (recall that torus-2QoS uses SL bit 3 to +differentiate QoS level), then multicast traffic also cannot contribute to +the "ring" credit loops that are otherwise possible in a torus. + +Torus-2QoS uses these ideas to create a master spanning tree. Every +multicast group spanning tree will be constructed as a subset of the master +tree, with the same root as the master tree. + +Such multicast group spanning trees will in general not be optimal for +groups which are a subset of the full fabric. However, this compromise must +be made to enable support for two QoS levels on a torus while preventing +credit loops. + +In the presence of link or switch failures that result in a fabric for +which torus-2QoS can generate credit-loop-free unicast routes, it is also +possible to generate a master spanning tree for multicast that retains the +required properties. For example, consider that same 2D 6x5 torus, with +the link from (2,2) to (3,2) failed. Torus-2QoS will generate the following +master spanning tree: + + 4 + + + + + + + | | | | | | + 3 + + + + + + + | | | | | | + 2 --+----+----+ x----+----+-- + | | | | | | + 1 + + + + + + + | | | | | | + y=0 + + + + + + + + x=0 1 2 3 4 5 + +Two things are notable about this master spanning tree. First, assuming +the x dateline was between x=5 and x=0, this spanning tree has a branch +that crosses the dateline. However, just as for unicast, crossing a +dateline on a 1D ring (here, the ring for y=2) that is broken by a failure +cannot contribute to a torus credit loop. + +Second, this spanning tree is no longer optimal even for multicast groups +that encompass the entire fabric. That, unfortunately, is a compromise that +must be made to retain the other desirable properties of torus-2QoS routing. + +In the event that a single switch fails, torus-2QoS will generate a master +spanning tree that has no "extra" turns by appropriately selecting a root +switch. In the 2D 6x5 torus example, assume now that the switch at (3,2), +i.e. the root for a pristine fabric, fails. Torus-2QoS will generate the +following master spanning tree for that case: + + | + 4 + + + + + + + | | | | | | + 3 + + + + + + + | | | | | + 2 + + + + + + | | | | | + 1 +----+----x----+----+----+ + | | | | | | + y=0 + + + + + + + | + + x=0 1 2 3 4 5 + +Assuming the y dateline was between y=4 and y=0, this spanning tree has +a branch that crosses a dateline. However, again this cannot contribute +to credit loops as it occurs on a 1D ring (the ring for x=3) that is +broken by a failure, as in the above example. + +Due to the use made by torus-2QoS of SLs and VLs, QoS configuration should +only employ SL values 0 and 8, for both multicast and unicast. Also, +SL to VL map configuration must be under the complete control of torus-2QoS, +so any user-supplied configuration must and will be ignored.