diff mbox

[6/8] IPoIB: Use dedicated workqueues per interface

Message ID f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford@redhat.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Doug Ledford Aug. 12, 2014, 11:38 p.m. UTC
During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up.  It turns out
that this is because no matter how well we protect against races on a
single device, the fact that all devices use the same workqueue, and
flush_workqueue() flushes *everything* from that workqueue, we can have
one device in the middle of a down and holding the rtnl lock and another
totally unrelated device needing to run mcast_restart_task, which wants
the rtnl lock and will loop trying to take it unless is sees its own
FLAG_ADMIN_UP flag go away.  Because the unrelated interface will never
see its own ADMIN_UP flag drop, the interface going down will deadlock
trying to flush the queue.  There are several possible solutions to this
problem:

Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail.  This runs the real
risk of dropping work on the floor, which can end up being its own
separate kind of deadlock.

Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail.  Again, this can
drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
away, then maybe after a few tries on the rtnl lock we can queue our own
task back up as a delayed work and return and avoid dropping work on the
floor that way.  But I'm not 100% convinced that we won't cause other
problems.

Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface.  In addition, keep the global workqueue,
but now limit it to only flush tasks.  In this way, the flush tasks can
always flush the device specific work queues without having deadlock
issues.

Signed-off-by: Doug Ledford <dledford@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 18 +++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  6 +++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c      | 19 ++++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++--------------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     | 22 +++++++++++++++++++++-
 6 files changed, 58 insertions(+), 34 deletions(-)

Comments

Estrin, Alex Aug. 20, 2014, 3:01 p.m. UTC | #1
> -----Original Message-----
> From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On
> Behalf Of Doug Ledford
> Sent: Tuesday, August 12, 2014 7:38 PM
> To: linux-rdma@vger.kernel.org; roland@kernel.org
> Cc: Doug Ledford
> Subject: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface
> 
> During my recent work on the rtnl lock deadlock in the IPoIB driver, I
> saw that even once I fixed the apparent races for a single device, as
> soon as that device had any children, new races popped up.  It turns out
> that this is because no matter how well we protect against races on a
> single device, the fact that all devices use the same workqueue, and
> flush_workqueue() flushes *everything* from that workqueue, we can have
> one device in the middle of a down and holding the rtnl lock and another
> totally unrelated device needing to run mcast_restart_task, which wants
> the rtnl lock and will loop trying to take it unless is sees its own
> FLAG_ADMIN_UP flag go away.  Because the unrelated interface will never
> see its own ADMIN_UP flag drop, the interface going down will deadlock
> trying to flush the queue.  There are several possible solutions to this
> problem:
> 
> Make carrier_on_task and mcast_restart_task try to take the rtnl for
> some set period of time and if they fail, then bail.  This runs the real
> risk of dropping work on the floor, which can end up being its own
> separate kind of deadlock.
> 
> Set some global flag in the driver that says some device is in the
> middle of going down, letting all tasks know to bail.  Again, this can
> drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
> away, then maybe after a few tries on the rtnl lock we can queue our own
> task back up as a delayed work and return and avoid dropping work on the
> floor that way.  But I'm not 100% convinced that we won't cause other
> problems.
> 
> Or the method this patch attempts to use, which is when we bring an
> interface up, create a workqueue specifically for that interface, so
> that when we take it back down, we are flushing only those tasks
> associated with our interface.  In addition, keep the global workqueue,
> but now limit it to only flush tasks.  In this way, the flush tasks can
> always flush the device specific work queues without having deadlock
> issues.
> 
> Signed-off-by: Doug Ledford <dledford@redhat.com>

Workqueues per interface is a great idea. Thanks!
Acked-by: Alex Estrin <alex.estrin@intel.com>

> ---
>  drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
>  drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 18 +++++++++---------
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  6 +++---
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      | 19 ++++++++++++-------
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++--------------
>  drivers/infiniband/ulp/ipoib/ipoib_verbs.c     | 22 +++++++++++++++++++++-
>  6 files changed, 58 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
> b/drivers/infiniband/ulp/ipoib/ipoib.h
> index 43840971ad8..7bf7339eaef 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -317,6 +317,7 @@ struct ipoib_dev_priv {
>  	struct list_head multicast_list;
>  	struct rb_root multicast_tree;
> 
> +	struct workqueue_struct *wq;
>  	struct delayed_work mcast_task;
>  	struct work_struct carrier_on_task;
>  	struct work_struct flush_light;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 933efcea0d0..56959adb6c7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct
> ib_cm_event *even
>  	}
> 
>  	spin_lock_irq(&priv->lock);
> -	queue_delayed_work(ipoib_workqueue,
> +	queue_delayed_work(priv->wq,
>  			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  	/* Add this entry to passive ids list head, but do not re-add it
>  	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
> @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc
> *wc)
>  			spin_lock_irqsave(&priv->lock, flags);
>  			list_splice_init(&priv->cm.rx_drain_list, &priv-
> >cm.rx_reap_list);
>  			ipoib_cm_start_rx_drain(priv);
> -			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
> +			queue_work(priv->wq, &priv->cm.rx_reap_task);
>  			spin_unlock_irqrestore(&priv->lock, flags);
>  		} else
>  			ipoib_warn(priv, "cm recv completion event with wrid %d (>
> %d)\n",
> @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc
> *wc)
>  				spin_lock_irqsave(&priv->lock, flags);
>  				list_move(&p->list, &priv->cm.rx_reap_list);
>  				spin_unlock_irqrestore(&priv->lock, flags);
> -				queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
> +				queue_work(priv->wq, &priv->cm.rx_reap_task);
>  			}
>  			return;
>  		}
> @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc
> *wc)
> 
>  		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>  			list_move(&tx->list, &priv->cm.reap_list);
> -			queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +			queue_work(priv->wq, &priv->cm.reap_task);
>  		}
> 
>  		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
> @@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
> 
>  		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>  			list_move(&tx->list, &priv->cm.reap_list);
> -			queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +			queue_work(priv->wq, &priv->cm.reap_task);
>  		}
> 
>  		spin_unlock_irqrestore(&priv->lock, flags);
> @@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device
> *dev, struct ipoib_path
>  	tx->dev = dev;
>  	list_add(&tx->list, &priv->cm.start_list);
>  	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
> -	queue_work(ipoib_workqueue, &priv->cm.start_task);
> +	queue_work(priv->wq, &priv->cm.start_task);
>  	return tx;
>  }
> 
> @@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
>  	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>  		spin_lock_irqsave(&priv->lock, flags);
>  		list_move(&tx->list, &priv->cm.reap_list);
> -		queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +		queue_work(priv->wq, &priv->cm.reap_task);
>  		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
>  			  tx->neigh->daddr + 4);
>  		tx->neigh = NULL;
> @@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct
> sk_buff *skb,
> 
>  	skb_queue_tail(&priv->cm.skb_queue, skb);
>  	if (e)
> -		queue_work(ipoib_workqueue, &priv->cm.skb_task);
> +		queue_work(priv->wq, &priv->cm.skb_task);
>  }
> 
>  static void ipoib_cm_rx_reap(struct work_struct *work)
> @@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work)
>  	}
> 
>  	if (!list_empty(&priv->cm.passive_ids))
> -		queue_delayed_work(ipoib_workqueue,
> +		queue_delayed_work(priv->wq,
>  				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  	spin_unlock_irq(&priv->lock);
>  }
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 72626c34817..bfd17d41b5f 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work)
>  	__ipoib_reap_ah(dev);
> 
>  	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
> +		queue_delayed_work(priv->wq, &priv->ah_reap_task,
>  				   round_jiffies_relative(HZ));
>  }
> 
> @@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
>  	}
> 
>  	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
> -	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
> +	queue_delayed_work(priv->wq, &priv->ah_reap_task,
>  			   round_jiffies_relative(HZ));
> 
>  	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
> @@ -881,7 +881,7 @@ timeout:
>  	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>  	cancel_delayed_work(&priv->ah_reap_task);
>  	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
> 
>  	begin = jiffies;
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 949948a46d4..255c8296566 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
>  		return;
>  	}
> 
> -	queue_work(ipoib_workqueue, &priv->restart_task);
> +	queue_work(priv->wq, &priv->restart_task);
>  }
> 
>  static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
> @@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work)
>  	__ipoib_reap_neigh(priv);
> 
>  	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
> +		queue_delayed_work(priv->wq, &priv->neigh_reap_task,
>  				   arp_tbl.gc_interval);
>  }
> 
> @@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
> 
>  	/* start garbage collection */
>  	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
> -	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
> +	queue_delayed_work(priv->wq, &priv->neigh_reap_task,
>  			   arp_tbl.gc_interval);
> 
>  	return 0;
> @@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device
> *ca, int port)
>  	return 0;
> 
>  out_dev_uninit:
> -	ipoib_ib_dev_cleanup();
> +	ipoib_ib_dev_cleanup(dev);
> 
>  out_tx_ring_cleanup:
>  	vfree(priv->tx_ring);
> @@ -1646,7 +1646,7 @@ register_failed:
>  	/* Stop GC if started before flush */
>  	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
>  	cancel_delayed_work(&priv->neigh_reap_task);
> -	flush_workqueue(ipoib_workqueue);
> +	flush_workqueue(priv->wq);
> 
>  event_failed:
>  	ipoib_dev_cleanup(priv->dev);
> @@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device)
>  		/* Stop GC */
>  		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
>  		cancel_delayed_work(&priv->neigh_reap_task);
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
> 
>  		unregister_netdev(priv->dev);
>  		free_netdev(priv->dev);
> @@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void)
>  	 * unregister_netdev() and linkwatch_event take the rtnl lock,
>  	 * so flush_scheduled_work() can deadlock during device
>  	 * removal.
> +	 *
> +	 * In addition, bringing one device up and another down at the
> +	 * same time can deadlock a single workqueue, so we have this
> +	 * global fallback workqueue, but we also attempt to open a
> +	 * per device workqueue each time we bring an interface up
>  	 */
> -	ipoib_workqueue = create_singlethread_workqueue("ipoib");
> +	ipoib_workqueue = create_singlethread_workqueue("ipoib_flush");
>  	if (!ipoib_workqueue) {
>  		ret = -ENOMEM;
>  		goto err_fs;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 19e3fe75ebf..969ef420518 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>  	 * the workqueue while holding the rtnl lock, so loop
>  	 * on trylock until either we get the lock or we see
>  	 * FLAG_ADMIN_UP go away as that signals that we are bailing
> -	 * and can safely ignore the carrier on work
> +	 * and can safely ignore the carrier on work.
>  	 */
>  	while (!rtnl_trylock()) {
>  		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> @@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status,
>  	if (!status) {
>  		mcast->backoff = 1;
>  		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->mcast_task, 0);
> +			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> 
>  		/*
> -		 * Defer carrier on work to ipoib_workqueue to avoid a
> +		 * Defer carrier on work to priv->wq to avoid a
>  		 * deadlock on rtnl_lock here.
>  		 */
>  		if (mcast == priv->broadcast)
> -			queue_work(ipoib_workqueue, &priv->carrier_on_task);
> +			queue_work(priv->wq, &priv->carrier_on_task);
>  	} else {
>  		if (mcast->logcount++ < 20) {
>  			if (status == -ETIMEDOUT || status == -EAGAIN) {
> @@ -465,7 +464,7 @@ out:
>  	if (status == -ENETRESET)
>  		status = 0;
>  	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
> +		queue_delayed_work(priv->wq, &priv->mcast_task,
>  				   mcast->backoff * HZ);
>  	spin_unlock_irq(&priv->lock);
>  	mutex_unlock(&mcast_mutex);
> @@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct
> ipoib_mcast *mcast,
>  			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
> 
>  		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->mcast_task,
> +			queue_delayed_work(priv->wq, &priv->mcast_task,
>  					   mcast->backoff * HZ);
>  	}
>  	mutex_unlock(&mcast_mutex);
> @@ -584,8 +582,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
>  			ipoib_warn(priv, "failed to allocate broadcast group\n");
>  			mutex_lock(&mcast_mutex);
>  			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -				queue_delayed_work(ipoib_workqueue,
> -						   &priv->mcast_task, HZ);
> +				queue_delayed_work(priv->wq, &priv->mcast_task,
> +						   HZ);
>  			mutex_unlock(&mcast_mutex);
>  			return;
>  		}
> @@ -652,7 +650,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
> 
>  	mutex_lock(&mcast_mutex);
>  	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
> +		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>  	mutex_unlock(&mcast_mutex);
> 
>  	return 0;
> @@ -670,7 +668,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
>  	mutex_unlock(&mcast_mutex);
> 
>  	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
> 
>  	return 0;
>  }
> @@ -737,7 +735,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct
> sk_buff *skb)
>  		__ipoib_mcast_add(dev, mcast);
>  		list_add_tail(&mcast->list, &priv->multicast_list);
>  		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
> +			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>  	}
> 
>  	if (!mcast->ah) {
> @@ -952,7 +950,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>  	 * completes.  So do like the carrier on task and attempt to
>  	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
>  	 * goes away, then just return and know that the remove list will
> -	 * get flushed later by mcast_dev_flush.
> +	 * get flushed later by mcast_stop_thread.
>  	 */
>  	while (!rtnl_trylock()) {
>  		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> index c56d5d44c53..b72a753eb41 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> @@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct
> ib_device *ca)
>  	int ret, size;
>  	int i;
> 
> +	/*
> +	 * the various IPoIB tasks assume they will never race against
> +	 * themselves, so always use a single thread workqueue
> +	 */
> +	priv->wq = create_singlethread_workqueue("ipoib_wq");
> +	if (!priv->wq) {
> +		printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
> +		return -ENODEV;
> +	}
> +
>  	priv->pd = ib_alloc_pd(priv->ca);
>  	if (IS_ERR(priv->pd)) {
>  		printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name);
> -		return -ENODEV;
> +		goto out_free_wq;
>  	}
> 
>  	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
> @@ -242,6 +252,10 @@ out_free_mr:
> 
>  out_free_pd:
>  	ib_dealloc_pd(priv->pd);
> +
> +out_free_wq:
> +	destroy_workqueue(priv->wq);
> +	priv->wq = NULL;
>  	return -ENODEV;
>  }
> 
> @@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
> 
>  	if (ib_dealloc_pd(priv->pd))
>  		ipoib_warn(priv, "ib_dealloc_pd failed\n");
> +
> +	if (priv->wq) {
> +		flush_workqueue(priv->wq);
> +		destroy_workqueue(priv->wq);
> +		priv->wq = NULL;
> +	}
>  }
> 
>  void ipoib_event(struct ib_event_handler *handler,
> --
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Erez Shitrit Sept. 4, 2014, 6:49 a.m. UTC | #2
Hi Doug,

The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks 
good to me.

one comment here: I don't see where the driver flushes it at all.
IMHO, when the driver goes down you should cancel all pending tasks over 
that global wq prior the call for the destroy_workqueue.
The current logic calls the destroy_workqueue after the remove_one, i 
think you should cancel the pending works after the 
ib_unregister_event_handler call in the ipoib_remove_one function, 
otherwise if there are pending tasks in that queue they will schedule 
after the dev/priv are gone.

Thanks, Erez
> During my recent work on the rtnl lock deadlock in the IPoIB driver, I
> saw that even once I fixed the apparent races for a single device, as
> soon as that device had any children, new races popped up.  It turns out
> that this is because no matter how well we protect against races on a
> single device, the fact that all devices use the same workqueue, and
> flush_workqueue() flushes *everything* from that workqueue, we can have
> one device in the middle of a down and holding the rtnl lock and another
> totally unrelated device needing to run mcast_restart_task, which wants
> the rtnl lock and will loop trying to take it unless is sees its own
> FLAG_ADMIN_UP flag go away.  Because the unrelated interface will never
> see its own ADMIN_UP flag drop, the interface going down will deadlock
> trying to flush the queue.  There are several possible solutions to this
> problem:
>
> Make carrier_on_task and mcast_restart_task try to take the rtnl for
> some set period of time and if they fail, then bail.  This runs the real
> risk of dropping work on the floor, which can end up being its own
> separate kind of deadlock.
>
> Set some global flag in the driver that says some device is in the
> middle of going down, letting all tasks know to bail.  Again, this can
> drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
> away, then maybe after a few tries on the rtnl lock we can queue our own
> task back up as a delayed work and return and avoid dropping work on the
> floor that way.  But I'm not 100% convinced that we won't cause other
> problems.
>
> Or the method this patch attempts to use, which is when we bring an
> interface up, create a workqueue specifically for that interface, so
> that when we take it back down, we are flushing only those tasks
> associated with our interface.  In addition, keep the global workqueue,
> but now limit it to only flush tasks.  In this way, the flush tasks can
> always flush the device specific work queues without having deadlock
> issues.
>
> Signed-off-by: Doug Ledford <dledford@redhat.com>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
>   drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 18 +++++++++---------
>   drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  6 +++---
>   drivers/infiniband/ulp/ipoib/ipoib_main.c      | 19 ++++++++++++-------
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++--------------
>   drivers/infiniband/ulp/ipoib/ipoib_verbs.c     | 22 +++++++++++++++++++++-
>   6 files changed, 58 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> index 43840971ad8..7bf7339eaef 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -317,6 +317,7 @@ struct ipoib_dev_priv {
>   	struct list_head multicast_list;
>   	struct rb_root multicast_tree;
>   
> +	struct workqueue_struct *wq;
>   	struct delayed_work mcast_task;
>   	struct work_struct carrier_on_task;
>   	struct work_struct flush_light;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 933efcea0d0..56959adb6c7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
>   	}
>   
>   	spin_lock_irq(&priv->lock);
> -	queue_delayed_work(ipoib_workqueue,
> +	queue_delayed_work(priv->wq,
>   			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>   	/* Add this entry to passive ids list head, but do not re-add it
>   	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
> @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
>   			spin_lock_irqsave(&priv->lock, flags);
>   			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
>   			ipoib_cm_start_rx_drain(priv);
> -			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
> +			queue_work(priv->wq, &priv->cm.rx_reap_task);
>   			spin_unlock_irqrestore(&priv->lock, flags);
>   		} else
>   			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
> @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
>   				spin_lock_irqsave(&priv->lock, flags);
>   				list_move(&p->list, &priv->cm.rx_reap_list);
>   				spin_unlock_irqrestore(&priv->lock, flags);
> -				queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
> +				queue_work(priv->wq, &priv->cm.rx_reap_task);
>   			}
>   			return;
>   		}
> @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>   
>   		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>   			list_move(&tx->list, &priv->cm.reap_list);
> -			queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +			queue_work(priv->wq, &priv->cm.reap_task);
>   		}
>   
>   		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
> @@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
>   
>   		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>   			list_move(&tx->list, &priv->cm.reap_list);
> -			queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +			queue_work(priv->wq, &priv->cm.reap_task);
>   		}
>   
>   		spin_unlock_irqrestore(&priv->lock, flags);
> @@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
>   	tx->dev = dev;
>   	list_add(&tx->list, &priv->cm.start_list);
>   	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
> -	queue_work(ipoib_workqueue, &priv->cm.start_task);
> +	queue_work(priv->wq, &priv->cm.start_task);
>   	return tx;
>   }
>   
> @@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
>   	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>   		spin_lock_irqsave(&priv->lock, flags);
>   		list_move(&tx->list, &priv->cm.reap_list);
> -		queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +		queue_work(priv->wq, &priv->cm.reap_task);
>   		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
>   			  tx->neigh->daddr + 4);
>   		tx->neigh = NULL;
> @@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
>   
>   	skb_queue_tail(&priv->cm.skb_queue, skb);
>   	if (e)
> -		queue_work(ipoib_workqueue, &priv->cm.skb_task);
> +		queue_work(priv->wq, &priv->cm.skb_task);
>   }
>   
>   static void ipoib_cm_rx_reap(struct work_struct *work)
> @@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work)
>   	}
>   
>   	if (!list_empty(&priv->cm.passive_ids))
> -		queue_delayed_work(ipoib_workqueue,
> +		queue_delayed_work(priv->wq,
>   				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>   	spin_unlock_irq(&priv->lock);
>   }
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 72626c34817..bfd17d41b5f 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work)
>   	__ipoib_reap_ah(dev);
>   
>   	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
> +		queue_delayed_work(priv->wq, &priv->ah_reap_task,
>   				   round_jiffies_relative(HZ));
>   }
>   
> @@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
>   	}
>   
>   	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
> -	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
> +	queue_delayed_work(priv->wq, &priv->ah_reap_task,
>   			   round_jiffies_relative(HZ));
>   
>   	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
> @@ -881,7 +881,7 @@ timeout:
>   	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>   	cancel_delayed_work(&priv->ah_reap_task);
>   	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
>   
>   	begin = jiffies;
>   
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 949948a46d4..255c8296566 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
>   		return;
>   	}
>   
> -	queue_work(ipoib_workqueue, &priv->restart_task);
> +	queue_work(priv->wq, &priv->restart_task);
>   }
>   
>   static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
> @@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work)
>   	__ipoib_reap_neigh(priv);
>   
>   	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
> +		queue_delayed_work(priv->wq, &priv->neigh_reap_task,
>   				   arp_tbl.gc_interval);
>   }
>   
> @@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
>   
>   	/* start garbage collection */
>   	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
> -	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
> +	queue_delayed_work(priv->wq, &priv->neigh_reap_task,
>   			   arp_tbl.gc_interval);
>   
>   	return 0;
> @@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
>   	return 0;
>   
>   out_dev_uninit:
> -	ipoib_ib_dev_cleanup();
> +	ipoib_ib_dev_cleanup(dev);
>   
>   out_tx_ring_cleanup:
>   	vfree(priv->tx_ring);
> @@ -1646,7 +1646,7 @@ register_failed:
>   	/* Stop GC if started before flush */
>   	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
>   	cancel_delayed_work(&priv->neigh_reap_task);
> -	flush_workqueue(ipoib_workqueue);
> +	flush_workqueue(priv->wq);
>   
>   event_failed:
>   	ipoib_dev_cleanup(priv->dev);
> @@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device)
>   		/* Stop GC */
>   		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
>   		cancel_delayed_work(&priv->neigh_reap_task);
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
>   
>   		unregister_netdev(priv->dev);
>   		free_netdev(priv->dev);
> @@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void)
>   	 * unregister_netdev() and linkwatch_event take the rtnl lock,
>   	 * so flush_scheduled_work() can deadlock during device
>   	 * removal.
> +	 *
> +	 * In addition, bringing one device up and another down at the
> +	 * same time can deadlock a single workqueue, so we have this
> +	 * global fallback workqueue, but we also attempt to open a
> +	 * per device workqueue each time we bring an interface up
>   	 */
> -	ipoib_workqueue = create_singlethread_workqueue("ipoib");
> +	ipoib_workqueue = create_singlethread_workqueue("ipoib_flush");
>   	if (!ipoib_workqueue) {
>   		ret = -ENOMEM;
>   		goto err_fs;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 19e3fe75ebf..969ef420518 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>   	 * the workqueue while holding the rtnl lock, so loop
>   	 * on trylock until either we get the lock or we see
>   	 * FLAG_ADMIN_UP go away as that signals that we are bailing
> -	 * and can safely ignore the carrier on work
> +	 * and can safely ignore the carrier on work.
>   	 */
>   	while (!rtnl_trylock()) {
>   		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> @@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status,
>   	if (!status) {
>   		mcast->backoff = 1;
>   		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->mcast_task, 0);
> +			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   
>   		/*
> -		 * Defer carrier on work to ipoib_workqueue to avoid a
> +		 * Defer carrier on work to priv->wq to avoid a
>   		 * deadlock on rtnl_lock here.
>   		 */
>   		if (mcast == priv->broadcast)
> -			queue_work(ipoib_workqueue, &priv->carrier_on_task);
> +			queue_work(priv->wq, &priv->carrier_on_task);
>   	} else {
>   		if (mcast->logcount++ < 20) {
>   			if (status == -ETIMEDOUT || status == -EAGAIN) {
> @@ -465,7 +464,7 @@ out:
>   	if (status == -ENETRESET)
>   		status = 0;
>   	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
> +		queue_delayed_work(priv->wq, &priv->mcast_task,
>   				   mcast->backoff * HZ);
>   	spin_unlock_irq(&priv->lock);
>   	mutex_unlock(&mcast_mutex);
> @@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
>   			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
>   
>   		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->mcast_task,
> +			queue_delayed_work(priv->wq, &priv->mcast_task,
>   					   mcast->backoff * HZ);
>   	}
>   	mutex_unlock(&mcast_mutex);
> @@ -584,8 +582,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
>   			ipoib_warn(priv, "failed to allocate broadcast group\n");
>   			mutex_lock(&mcast_mutex);
>   			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -				queue_delayed_work(ipoib_workqueue,
> -						   &priv->mcast_task, HZ);
> +				queue_delayed_work(priv->wq, &priv->mcast_task,
> +						   HZ);
>   			mutex_unlock(&mcast_mutex);
>   			return;
>   		}
> @@ -652,7 +650,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
>   
>   	mutex_lock(&mcast_mutex);
>   	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
> +		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   	mutex_unlock(&mcast_mutex);
>   
>   	return 0;
> @@ -670,7 +668,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
>   	mutex_unlock(&mcast_mutex);
>   
>   	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
>   
>   	return 0;
>   }
> @@ -737,7 +735,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>   		__ipoib_mcast_add(dev, mcast);
>   		list_add_tail(&mcast->list, &priv->multicast_list);
>   		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
> +			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   	}
>   
>   	if (!mcast->ah) {
> @@ -952,7 +950,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>   	 * completes.  So do like the carrier on task and attempt to
>   	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
>   	 * goes away, then just return and know that the remove list will
> -	 * get flushed later by mcast_dev_flush.
> +	 * get flushed later by mcast_stop_thread.
>   	 */
>   	while (!rtnl_trylock()) {
>   		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> index c56d5d44c53..b72a753eb41 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> @@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
>   	int ret, size;
>   	int i;
>   
> +	/*
> +	 * the various IPoIB tasks assume they will never race against
> +	 * themselves, so always use a single thread workqueue
> +	 */
> +	priv->wq = create_singlethread_workqueue("ipoib_wq");
> +	if (!priv->wq) {
> +		printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
> +		return -ENODEV;
> +	}
> +
>   	priv->pd = ib_alloc_pd(priv->ca);
>   	if (IS_ERR(priv->pd)) {
>   		printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name);
> -		return -ENODEV;
> +		goto out_free_wq;
>   	}
>   
>   	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
> @@ -242,6 +252,10 @@ out_free_mr:
>   
>   out_free_pd:
>   	ib_dealloc_pd(priv->pd);
> +
> +out_free_wq:
> +	destroy_workqueue(priv->wq);
> +	priv->wq = NULL;
>   	return -ENODEV;
>   }
>   
> @@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
>   
>   	if (ib_dealloc_pd(priv->pd))
>   		ipoib_warn(priv, "ib_dealloc_pd failed\n");
> +
> +	if (priv->wq) {
> +		flush_workqueue(priv->wq);
> +		destroy_workqueue(priv->wq);
> +		priv->wq = NULL;
> +	}
>   }
>   
>   void ipoib_event(struct ib_event_handler *handler,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Ledford Sept. 9, 2014, 7:09 a.m. UTC | #3
On 09/04/2014 02:49 AM, Erez Shitrit wrote:
> Hi Doug,
>
> The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks
> good to me.
>
> one comment here: I don't see where the driver flushes it at all.
> IMHO, when the driver goes down you should cancel all pending tasks over
> that global wq prior the call for the destroy_workqueue.

I understand your thought, but I disagree with your conclusion.  If we 
ever get to the portion of the driver down sequence where we are 
removing the big flush work queue and there are still items on that work 
queue, then we have already failed and we are going to crash because 
there are outstanding flushes for a deleted device.  The real issue here 
is that we need to make sure it's not possible to delete a device that 
has outstanding flushes, and not possible to flush a device in the 
process of being deleted (Wendy, don't get mad at me, but the rtnl lock 
jumps out as appropriate for this purpose).

> The current logic calls the destroy_workqueue after the remove_one, i
> think you should cancel the pending works after the
> ib_unregister_event_handler call in the ipoib_remove_one function,

The ib_remove_one is device specific while the big flush work queue is 
not.  As such, that can cancel work for devices other than the device 
being removed with no clear means of getting it back.

> otherwise if there are pending tasks in that queue they will schedule
> after the dev/priv are gone.

I agree that there is a problem here.  I think what needs to happen is 
that now that we have a work queue per device, and all flushes come in 
to us via a work queue that does not belong to the device, it is now 
possible to handle flushes synchronously.  This would allow us to lock 
around the flush itself and prevent removal until after the flush is 
complete without getting into the hairy rat hole that is trying to flush 
the flush workqueue that might result in either flushing or canceling 
the very device we are working on.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Erez Shitrit Sept. 10, 2014, 5:27 p.m. UTC | #4
> On 09/04/2014 02:49 AM, Erez Shitrit wrote:
>> Hi Doug,
>>
>> The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks
>> good to me.
>>
>> one comment here: I don't see where the driver flushes it at all.
>> IMHO, when the driver goes down you should cancel all pending tasks over
>> that global wq prior the call for the destroy_workqueue.
>
> I understand your thought, but I disagree with your conclusion. If we 
> ever get to the portion of the driver down sequence where we are 
> removing the big flush work queue and there are still items on that 
> work queue, then we have already failed and we are going to crash 
> because there are outstanding flushes for a deleted device.
it is not a rare case, sm events (pkey, hand-over, etc.) while driver 
restart on devices in the cluster, and you are there.
removing one device of few on the same host, it is a rare condition, 
anyway i agree that cancel_work is not the solution.
> The real issue here is that we need to make sure it's not possible to 
> delete a device that has outstanding flushes, and not possible to 
> flush a device in the process of being deleted (Wendy, don't get mad 
> at me, but the rtnl lock jumps out as appropriate for this purpose).
we can remove the device and drain the workqueue while checking on each 
event that this event doesn't belong to deleted device (we have the list 
of all existing priv objects in the system via ib_get_client_data) after 
that we can be sure that no more works are for that deleted device.
>
>> The current logic calls the destroy_workqueue after the remove_one, i
>> think you should cancel the pending works after the
>> ib_unregister_event_handler call in the ipoib_remove_one function,
>
> The ib_remove_one is device specific while the big flush work queue is 
> not.  As such, that can cancel work for devices other than the device 
> being removed with no clear means of getting it back.
>
>> otherwise if there are pending tasks in that queue they will schedule
>> after the dev/priv are gone.
>
> I agree that there is a problem here.  I think what needs to happen is 
> that now that we have a work queue per device, and all flushes come in 
> to us via a work queue that does not belong to the device, it is now 
> possible to handle flushes synchronously.  This would allow us to lock 
> around the flush itself and prevent removal until after the flush is 
> complete without getting into the hairy rat hole that is trying to 
> flush the flush workqueue that might result in either flushing or 
> canceling the very device we are working on.
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 43840971ad8..7bf7339eaef 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -317,6 +317,7 @@  struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
+	struct workqueue_struct *wq;
 	struct delayed_work mcast_task;
 	struct work_struct carrier_on_task;
 	struct work_struct flush_light;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 933efcea0d0..56959adb6c7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -474,7 +474,7 @@  static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	}
 
 	spin_lock_irq(&priv->lock);
-	queue_delayed_work(ipoib_workqueue,
+	queue_delayed_work(priv->wq,
 			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	/* Add this entry to passive ids list head, but do not re-add it
 	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
@@ -576,7 +576,7 @@  void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 			spin_lock_irqsave(&priv->lock, flags);
 			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
 			ipoib_cm_start_rx_drain(priv);
-			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+			queue_work(priv->wq, &priv->cm.rx_reap_task);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		} else
 			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
@@ -603,7 +603,7 @@  void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 				spin_lock_irqsave(&priv->lock, flags);
 				list_move(&p->list, &priv->cm.rx_reap_list);
 				spin_unlock_irqrestore(&priv->lock, flags);
-				queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+				queue_work(priv->wq, &priv->cm.rx_reap_task);
 			}
 			return;
 		}
@@ -827,7 +827,7 @@  void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 
 		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 			list_move(&tx->list, &priv->cm.reap_list);
-			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+			queue_work(priv->wq, &priv->cm.reap_task);
 		}
 
 		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
@@ -1255,7 +1255,7 @@  static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 
 		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 			list_move(&tx->list, &priv->cm.reap_list);
-			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+			queue_work(priv->wq, &priv->cm.reap_task);
 		}
 
 		spin_unlock_irqrestore(&priv->lock, flags);
@@ -1284,7 +1284,7 @@  struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
 	tx->dev = dev;
 	list_add(&tx->list, &priv->cm.start_list);
 	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
-	queue_work(ipoib_workqueue, &priv->cm.start_task);
+	queue_work(priv->wq, &priv->cm.start_task);
 	return tx;
 }
 
@@ -1295,7 +1295,7 @@  void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
 	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 		spin_lock_irqsave(&priv->lock, flags);
 		list_move(&tx->list, &priv->cm.reap_list);
-		queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		queue_work(priv->wq, &priv->cm.reap_task);
 		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
 			  tx->neigh->daddr + 4);
 		tx->neigh = NULL;
@@ -1417,7 +1417,7 @@  void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
 
 	skb_queue_tail(&priv->cm.skb_queue, skb);
 	if (e)
-		queue_work(ipoib_workqueue, &priv->cm.skb_task);
+		queue_work(priv->wq, &priv->cm.skb_task);
 }
 
 static void ipoib_cm_rx_reap(struct work_struct *work)
@@ -1450,7 +1450,7 @@  static void ipoib_cm_stale_task(struct work_struct *work)
 	}
 
 	if (!list_empty(&priv->cm.passive_ids))
-		queue_delayed_work(ipoib_workqueue,
+		queue_delayed_work(priv->wq,
 				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	spin_unlock_irq(&priv->lock);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 72626c34817..bfd17d41b5f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -655,7 +655,7 @@  void ipoib_reap_ah(struct work_struct *work)
 	__ipoib_reap_ah(dev);
 
 	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
+		queue_delayed_work(priv->wq, &priv->ah_reap_task,
 				   round_jiffies_relative(HZ));
 }
 
@@ -696,7 +696,7 @@  int ipoib_ib_dev_open(struct net_device *dev, int flush)
 	}
 
 	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
-	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
+	queue_delayed_work(priv->wq, &priv->ah_reap_task,
 			   round_jiffies_relative(HZ));
 
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
@@ -881,7 +881,7 @@  timeout:
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
 	if (flush)
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 	begin = jiffies;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 949948a46d4..255c8296566 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -839,7 +839,7 @@  static void ipoib_set_mcast_list(struct net_device *dev)
 		return;
 	}
 
-	queue_work(ipoib_workqueue, &priv->restart_task);
+	queue_work(priv->wq, &priv->restart_task);
 }
 
 static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
@@ -954,7 +954,7 @@  static void ipoib_reap_neigh(struct work_struct *work)
 	__ipoib_reap_neigh(priv);
 
 	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+		queue_delayed_work(priv->wq, &priv->neigh_reap_task,
 				   arp_tbl.gc_interval);
 }
 
@@ -1133,7 +1133,7 @@  static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
 
 	/* start garbage collection */
 	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
-	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+	queue_delayed_work(priv->wq, &priv->neigh_reap_task,
 			   arp_tbl.gc_interval);
 
 	return 0;
@@ -1293,7 +1293,7 @@  int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	return 0;
 
 out_dev_uninit:
-	ipoib_ib_dev_cleanup();
+	ipoib_ib_dev_cleanup(dev);
 
 out_tx_ring_cleanup:
 	vfree(priv->tx_ring);
@@ -1646,7 +1646,7 @@  register_failed:
 	/* Stop GC if started before flush */
 	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 	cancel_delayed_work(&priv->neigh_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	flush_workqueue(priv->wq);
 
 event_failed:
 	ipoib_dev_cleanup(priv->dev);
@@ -1717,7 +1717,7 @@  static void ipoib_remove_one(struct ib_device *device)
 		/* Stop GC */
 		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 		cancel_delayed_work(&priv->neigh_reap_task);
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 		unregister_netdev(priv->dev);
 		free_netdev(priv->dev);
@@ -1758,8 +1758,13 @@  static int __init ipoib_init_module(void)
 	 * unregister_netdev() and linkwatch_event take the rtnl lock,
 	 * so flush_scheduled_work() can deadlock during device
 	 * removal.
+	 *
+	 * In addition, bringing one device up and another down at the
+	 * same time can deadlock a single workqueue, so we have this
+	 * global fallback workqueue, but we also attempt to open a
+	 * per device workqueue each time we bring an interface up
 	 */
-	ipoib_workqueue = create_singlethread_workqueue("ipoib");
+	ipoib_workqueue = create_singlethread_workqueue("ipoib_flush");
 	if (!ipoib_workqueue) {
 		ret = -ENOMEM;
 		goto err_fs;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 19e3fe75ebf..969ef420518 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -388,7 +388,7 @@  void ipoib_mcast_carrier_on_task(struct work_struct *work)
 	 * the workqueue while holding the rtnl lock, so loop
 	 * on trylock until either we get the lock or we see
 	 * FLAG_ADMIN_UP go away as that signals that we are bailing
-	 * and can safely ignore the carrier on work
+	 * and can safely ignore the carrier on work.
 	 */
 	while (!rtnl_trylock()) {
 		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
@@ -432,15 +432,14 @@  static int ipoib_mcast_join_complete(int status,
 	if (!status) {
 		mcast->backoff = 1;
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->mcast_task, 0);
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 
 		/*
-		 * Defer carrier on work to ipoib_workqueue to avoid a
+		 * Defer carrier on work to priv->wq to avoid a
 		 * deadlock on rtnl_lock here.
 		 */
 		if (mcast == priv->broadcast)
-			queue_work(ipoib_workqueue, &priv->carrier_on_task);
+			queue_work(priv->wq, &priv->carrier_on_task);
 	} else {
 		if (mcast->logcount++ < 20) {
 			if (status == -ETIMEDOUT || status == -EAGAIN) {
@@ -465,7 +464,7 @@  out:
 	if (status == -ENETRESET)
 		status = 0;
 	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
+		queue_delayed_work(priv->wq, &priv->mcast_task,
 				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
@@ -535,8 +534,7 @@  static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->mcast_task,
+			queue_delayed_work(priv->wq, &priv->mcast_task,
 					   mcast->backoff * HZ);
 	}
 	mutex_unlock(&mcast_mutex);
@@ -584,8 +582,8 @@  void ipoib_mcast_join_task(struct work_struct *work)
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
 			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-				queue_delayed_work(ipoib_workqueue,
-						   &priv->mcast_task, HZ);
+				queue_delayed_work(priv->wq, &priv->mcast_task,
+						   HZ);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -652,7 +650,7 @@  int ipoib_mcast_start_thread(struct net_device *dev)
 
 	mutex_lock(&mcast_mutex);
 	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);
 
 	return 0;
@@ -670,7 +668,7 @@  int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
 	mutex_unlock(&mcast_mutex);
 
 	if (flush)
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 	return 0;
 }
@@ -737,7 +735,7 @@  void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
 		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 
 	if (!mcast->ah) {
@@ -952,7 +950,7 @@  void ipoib_mcast_restart_task(struct work_struct *work)
 	 * completes.  So do like the carrier on task and attempt to
 	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
 	 * goes away, then just return and know that the remove list will
-	 * get flushed later by mcast_dev_flush.
+	 * get flushed later by mcast_stop_thread.
 	 */
 	while (!rtnl_trylock()) {
 		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index c56d5d44c53..b72a753eb41 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -145,10 +145,20 @@  int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	int ret, size;
 	int i;
 
+	/*
+	 * the various IPoIB tasks assume they will never race against
+	 * themselves, so always use a single thread workqueue
+	 */
+	priv->wq = create_singlethread_workqueue("ipoib_wq");
+	if (!priv->wq) {
+		printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
+		return -ENODEV;
+	}
+
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
 		printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name);
-		return -ENODEV;
+		goto out_free_wq;
 	}
 
 	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
@@ -242,6 +252,10 @@  out_free_mr:
 
 out_free_pd:
 	ib_dealloc_pd(priv->pd);
+
+out_free_wq:
+	destroy_workqueue(priv->wq);
+	priv->wq = NULL;
 	return -ENODEV;
 }
 
@@ -270,6 +284,12 @@  void ipoib_transport_dev_cleanup(struct net_device *dev)
 
 	if (ib_dealloc_pd(priv->pd))
 		ipoib_warn(priv, "ib_dealloc_pd failed\n");
+
+	if (priv->wq) {
+		flush_workqueue(priv->wq);
+		destroy_workqueue(priv->wq);
+		priv->wq = NULL;
+	}
 }
 
 void ipoib_event(struct ib_event_handler *handler,