diff mbox

[2/9] IB: add a proper completion queue abstraction

Message ID 564F4F38.9040505@sandisk.com (mailing list archive)
State Superseded
Headers show

Commit Message

Bart Van Assche Nov. 20, 2015, 4:50 p.m. UTC
On 11/20/2015 02:16 AM, Christoph Hellwig wrote:
> On Wed, Nov 18, 2015 at 10:20:14AM -0800, Bart Van Assche wrote:
>> Are you perhaps referring to the sysfs CPU mask that allows to control
>> workqueue affinity ?
> 
> I think he is referring to the defintion of WQ_UNBOUND:
> 
>    WQ_UNBOUND
> 
> 	Work items queued to an unbound wq are served by the special
> 	woker-pools which host workers which are not bound to any
> 	specific CPU.  This makes the wq behave as a simple execution
> 	context provider without concurrency management.  The unbound
> 	worker-pools try to start execution of work items as soon as
> 	possible.  Unbound wq sacrifices locality but is useful for
> 	the following cases.
> 
> 	* Wide fluctuation in the concurrency level requirement is
> 	  expected and using bound wq may end up creating large number
> 	  of mostly unused workers across different CPUs as the issuer
> 	  hops through different CPUs.
> 
> 	* Long running CPU intensive workloads which can be better
> 	  managed by the system scheduler.
 
Hello Christoph,

The comment about locality in the above quote is interesting. How about
modifying patch 2/9 as indicated below ? The modification below does not
change the behavior of this patch if ib_cq.w.cpu is not modified. And it
allows users who care about locality and who want to skip the scheduler
overhead by setting ib_cq.w.cpu to the index of the CPU they want the
work to be processed on.

Thanks,

Bart.

---
 drivers/infiniband/core/cq.c | 11 ++++++-----
 include/rdma/ib_verbs.h      |  5 ++++-
 2 files changed, 10 insertions(+), 6 deletions(-)

Comments

Sagi Grimberg Nov. 22, 2015, 9:51 a.m. UTC | #1
> Hello Christoph,
>
> The comment about locality in the above quote is interesting. How about
> modifying patch 2/9 as indicated below ? The modification below does not
> change the behavior of this patch if ib_cq.w.cpu is not modified. And it
> allows users who care about locality and who want to skip the scheduler
> overhead by setting ib_cq.w.cpu to the index of the CPU they want the
> work to be processed on.

That sounds acceptable...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 22, 2015, 10:13 a.m. UTC | #2
On Sun, Nov 22, 2015 at 11:51:13AM +0200, Sagi Grimberg wrote:
>
>> Hello Christoph,
>>
>> The comment about locality in the above quote is interesting. How about
>> modifying patch 2/9 as indicated below ? The modification below does not
>> change the behavior of this patch if ib_cq.w.cpu is not modified. And it
>> allows users who care about locality and who want to skip the scheduler
>> overhead by setting ib_cq.w.cpu to the index of the CPU they want the
>> work to be processed on.
>
> That sounds acceptable...

Wouldn't it be a better idea to set the WQ_SYSFS interface and use
the standard sysfs interface for specifying cpumasks or node affinity?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 22, 2015, 10:36 a.m. UTC | #3
> Wouldn't it be a better idea to set the WQ_SYSFS interface and use
> the standard sysfs interface for specifying cpumasks or node affinity?

I think that bart wants to allow the caller to select cpu affinity
per CQ. In this case ib_alloc_cq in workqueue mode would need to
accept a affinity_hint from the caller (default to wild-card 
WORK_CPU_UNBOUND).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 22, 2015, 1:23 p.m. UTC | #4
On Sun, Nov 22, 2015 at 12:36:00PM +0200, Sagi Grimberg wrote:
>
>> Wouldn't it be a better idea to set the WQ_SYSFS interface and use
>> the standard sysfs interface for specifying cpumasks or node affinity?
>
> I think that bart wants to allow the caller to select cpu affinity
> per CQ. In this case ib_alloc_cq in workqueue mode would need to
> accept a affinity_hint from the caller (default to wild-card 
> WORK_CPU_UNBOUND).

Hmm, true.  How would be set that hint from userspace?  I'd really prefer
to see a practical justification for it first.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Nov. 22, 2015, 2:57 p.m. UTC | #5
>>
>> I think that bart wants to allow the caller to select cpu affinity
>> per CQ. In this case ib_alloc_cq in workqueue mode would need to
>> accept a affinity_hint from the caller (default to wild-card
>> WORK_CPU_UNBOUND).
>
> Hmm, true.  How would be set that hint from userspace?  I'd really prefer
> to see a practical justification for it first.

In order to assign CPUs from user-space we'd need an ethtool like
interface for isert/srpt/<xxxt>. Given that this is something we don't
want to get into right now, I assumed that Bart meant that srpt
would take a "least used" approach from srpt driver (which isn't better
taking the wild-card option I'd say), So I'll let Bart answer...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Nov. 22, 2015, 4:55 p.m. UTC | #6
On 11/22/15 06:57, Sagi Grimberg wrote:
>>> I think that bart wants to allow the caller to select cpu affinity
>>> per CQ. In this case ib_alloc_cq in workqueue mode would need to
>>> accept a affinity_hint from the caller (default to wild-card
>>> WORK_CPU_UNBOUND).
>>
>> Hmm, true.  How would be set that hint from userspace?  I'd really prefer
>> to see a practical justification for it first.
>
> In order to assign CPUs from user-space we'd need an ethtool like
> interface for isert/srpt/<xxxt>. Given that this is something we don't
> want to get into right now, I assumed that Bart meant that srpt
> would take a "least used" approach from srpt driver (which isn't better
> taking the wild-card option I'd say), So I'll let Bart answer...

Hello Christoph and Sagi,

My intention is indeed to allow to control CPU affinity per CQ. One use 
case is to implement a least-used policy in RDMA drivers that use 
multiple completion queues. Another use case is to make CPU affinity 
configurable from user space through something similar to ethtool or via 
sysfs.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index bf2a079..4d80d8c 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -94,18 +94,18 @@  static void ib_cq_completion_softirq(struct ib_cq *cq, void *private)
 
 static void ib_cq_poll_work(struct work_struct *work)
 {
-	struct ib_cq *cq = container_of(work, struct ib_cq, work);
+	struct ib_cq *cq = container_of(work, struct ib_cq, w.work);
 	int completed;
 
 	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
 	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
 	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
-		queue_work(ib_comp_wq, &cq->work);
+		queue_work_on(cq->w.cpu, ib_comp_wq, &cq->w.work);
 }
 
 static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
 {
-	queue_work(ib_comp_wq, &cq->work);
+	queue_work_on(cq->w.cpu, ib_comp_wq, &cq->w.work);
 }
 
 /**
@@ -159,7 +159,8 @@  struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
 		break;
 	case IB_POLL_WORKQUEUE:
 		cq->comp_handler = ib_cq_completion_workqueue;
-		INIT_WORK(&cq->work, ib_cq_poll_work);
+		INIT_WORK(&cq->w.work, ib_cq_poll_work);
+		cq->w.cpu = WORK_CPU_UNBOUND;
 		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 		break;
 	default:
@@ -195,7 +196,7 @@  void ib_free_cq(struct ib_cq *cq)
 		irq_poll_disable(&cq->iop);
 		break;
 	case IB_POLL_WORKQUEUE:
-		flush_work(&cq->work);
+		flush_work(&cq->w.work);
 		break;
 	default:
 		WARN_ON_ONCE(1);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index f59a8d3..b1344f8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1291,7 +1291,10 @@  struct ib_cq {
 	struct ib_wc		*wc;
 	union {
 		struct irq_poll		iop;
-		struct work_struct	work;
+		struct {
+			struct work_struct	work;
+			int			cpu;
+		} w;
 	};
 };