Message ID | 20190411142228.22587.63118.stgit@scvm10.sc.intel.com (mailing list archive) |
---|---|
State | Mainlined |
Commit | ba7d8117f3cca8eb70d579fde3f9ec8cd6a28f39 |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | [for-next] IB/core, ipoib: Do not overreact to SM LID change event | expand |
On Thu, Apr 11, 2019 at 07:22:35AM -0700, Dennis Dalessandro wrote: > When IPoIB receives an SM LID change event, it reacts by flushing its > path record cache and rejoining multicast groups. This is the same > behavior it performs when it receives a reregistration event. This > behavior is unnecessary as an SM may have database backup or > synchronization mechanisms which permit the SM location or LID to change > without loss of multicast membership and without impact to path records. > > Both opensm and the OPA FM issue reregistration events if a new SM is > started (or restarted with a new config) or an SM event occurs which > results in loss of multicast membership records by the SM (such as > opensm failover) or the SM encounters new nodes with Active ports (such > as after joining 2 fabrics by connecting switches via ISLs). Hence this > event can be depended on as the trigger for IPoIB cache and multicast > flushing. > > It appears that some drivers, such as qib, and hfi1 issue the > IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not. > Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed > that Mellanox EDR HCAs do not generate SM change events and that opensm > does generate reregistration. > > An SM LID change event is generated by the mentioned drivers to reflect > that sm_lid and/or sm_sl in the local port info has changed. The intent > of this event is to permit applications and ULPs which have a local copy > of this information (or an address handle using it) to update their > information. > > The intent is that the reregistration event (caused by the SM via a bit > in Set(PortInfo)) be used to inform nodes that they need to rejoin > multicast groups, resubscribe for notices and potentially update path > records. > > When an SM migrates or fails over, a SM LID change event can occur. In > response IPoIB discards path records and multicast membership and loses > connectivity until these records are restored via SA requests. In very > large fabrics, it may take minutes for the SM to be ready and for the SA > responses to be supplied. This can result in undesirable and > unnecessary IPoIB connectivity impacts. It also can result in an > unnecessary storm of SA queries from all nodes in a cluster potentially > followed by yet another storm if the SM issues the reregistration > request. > > The fact the Mellanox HCAs do not even generate this event, is further > evidence that on modern IB fabrics there will be no ill side effects > from the proposed changes below to reduce the reaction by 3 kernel > components to this event. So these changes should be benign for Mellanox > IB fabrics and will benefit OPA fabrics while also making ib_core and > ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event > APIs. > > Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib. > IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do > anything on SM LID change. IPoIB makes use of other ib_core components > to issue SA requests for it and those components correctly track SM LID > and SM LID changes. > > Also in ib_core multicast handling, remove the test for > IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the > error state, which will trigger rejoins. This code is used by IPoIB as > well as the connection manager and other clients of multicast groups. > This kernel module centralizes group membership status and joins since a > node can only join a given group once but multiple ULPs or applications > may want to join the same group. It makes use of the sa_query.c > component in ib_core, which correctly trackes SM LID and SL. This > component does not track SM LID nor SL itself and hence need not react > to their changes. > > Similarly in the ib_core cache code remove the handling for the > IB_EVENT_SM_CHANGE. In this function. The ib_cache_update function > which is ultimately called is updating local copies of the pkey table, > gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As > such it does not need to be called on an SM LID change. It technically > also does not need to be called on a reregistration. The LID_CHANGE, > PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR, > PORT_ACTICE) should be sufficient triggers. > > It is worth noting that the alternative of simply having the hfi1 and > qib drivers not generate the SM LID change event was explored. While > this would duplicate what Mellanox drivers do now, it is not the correct > behavior and removes the ability for an SM to migrate without requiring > reregistration. Since both opensm and OPA SM have mechanisms to backup > or synchronize registration information, it is desirable to let them > perform SM migrations (with LID or SL changes) without requiring > reregistration when they deem it appropriate. > > Suggested-by: Todd Rimmer <todd.rimmer@intel.com> > Tested-by: Michael Brooks <michael.brooks@intel.com> > Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> > Reviewed-by: Todd Rimmer <todd.rimmer@intel.com> > Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> > --- > drivers/infiniband/core/cache.c | 1 - > drivers/infiniband/core/multicast.c | 1 - > drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 3 +-- > 3 files changed, 1 insertions(+), 4 deletions(-) Applied to for-next, thanks Jason
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 43c67e5..f60951b 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -1392,7 +1392,6 @@ static void ib_cache_event(struct ib_event_handler *handler, event->event == IB_EVENT_PORT_ACTIVE || event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || - event->event == IB_EVENT_SM_CHANGE || event->event == IB_EVENT_CLIENT_REREGISTER || event->event == IB_EVENT_GID_CHANGE) { work = kmalloc(sizeof *work, GFP_ATOMIC); diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index d50ff70..cd338dd 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -804,7 +804,6 @@ static void mcast_event_handler(struct ib_event_handler *handler, switch (event->event) { case IB_EVENT_PORT_ERR: case IB_EVENT_LID_CHANGE: - case IB_EVENT_SM_CHANGE: case IB_EVENT_CLIENT_REREGISTER: mcast_groups_event(&dev->port[index], MCAST_GROUP_ERROR); break; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index 1e88213..ba09068 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -279,8 +279,7 @@ void ipoib_event(struct ib_event_handler *handler, ipoib_dbg(priv, "Event %d on device %s port %d\n", record->event, dev_name(&record->device->dev), record->element.port_num); - if (record->event == IB_EVENT_SM_CHANGE || - record->event == IB_EVENT_CLIENT_REREGISTER) { + if (record->event == IB_EVENT_CLIENT_REREGISTER) { queue_work(ipoib_workqueue, &priv->flush_light); } else if (record->event == IB_EVENT_PORT_ERR || record->event == IB_EVENT_PORT_ACTIVE ||