[4/7] blk-mq: allow the driver to pass in an affinity mask

Message ID	20160905194759.GA26008@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> Date: Mon, 5 Sep 2016 21:48:00 +0200 From: Christoph Hellwig <hch@lst.de> To: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de>, axboe@fb.com, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org Subject: Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask Message-ID: <20160905194759.GA26008@lst.de> References: <1472468013-29936-1-git-send-email-hch@lst.de> <1472468013-29936-5-git-send-email-hch@lst.de> <20160831163852.GB5598@localhost.localdomain> <20160901084624.GC4115@lst.de> <20160901142410.GA10903@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160901142410.GA10903@localhost.localdomain> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Message ID

20160905194759.GA26008@lst.de (mailing list archive)

State

New, archived

Headers

Date: Mon, 5 Sep 2016 21:48:00 +0200
From: Christoph Hellwig <hch@lst.de>
To: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>, axboe@fb.com,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
Message-ID: <20160905194759.GA26008@lst.de>
References: <1472468013-29936-1-git-send-email-hch@lst.de>
	<1472468013-29936-5-git-send-email-hch@lst.de>
	<20160831163852.GB5598@localhost.localdomain>
	<20160901084624.GC4115@lst.de>
	<20160901142410.GA10903@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160901142410.GA10903@localhost.localdomain>
User-Agent: Mutt/1.5.17 (2007-11-01)
Sender: linux-block-owner@vger.kernel.org
Precedence: bulk

Commit Message

Christoph Hellwig Sept. 5, 2016, 7:48 p.m. UTC

On Thu, Sep 01, 2016 at 10:24:10AM -0400, Keith Busch wrote:
> On Thu, Sep 01, 2016 at 10:46:24AM +0200, Christoph Hellwig wrote:
> > On Wed, Aug 31, 2016 at 12:38:53PM -0400, Keith Busch wrote:
> > > This can't be right. We have a single affinity mask for the entire
> > > set, but what I think we want is an one affinity mask for each
> > > nr_io_queues. The irq_create_affinity_mask should then create an array
> > > of cpumasks based on nr_vecs..
> > 
> > Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
> > in the affinity_mask means this is a cpu we allocate a vector / queue to.
> 
> Yeah, I gathered that's what it was providing, but that's just barely
> not enough information to do something useful. The CPUs that aren't set
> have to use a previously assigned vector/queue, but which one?

Always the previous one.  Below is a patch to get us back to the
previous behavior:

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Keith Busch Sept. 6, 2016, 2:39 p.m. UTC | #1

On Mon, Sep 05, 2016 at 09:48:00PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 01, 2016 at 10:24:10AM -0400, Keith Busch wrote:
> > On Thu, Sep 01, 2016 at 10:46:24AM +0200, Christoph Hellwig wrote:
> > > On Wed, Aug 31, 2016 at 12:38:53PM -0400, Keith Busch wrote:
> > > > This can't be right. We have a single affinity mask for the entire
> > > > set, but what I think we want is an one affinity mask for each
> > > > nr_io_queues. The irq_create_affinity_mask should then create an array
> > > > of cpumasks based on nr_vecs..
> > > 
> > > Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
> > > in the affinity_mask means this is a cpu we allocate a vector / queue to.
> > 
> > Yeah, I gathered that's what it was providing, but that's just barely
> > not enough information to do something useful. The CPUs that aren't set
> > have to use a previously assigned vector/queue, but which one?
> 
> Always the previous one.  Below is a patch to get us back to the
> previous behavior:

No, that's not right.

Here's my topology info:

  # numactl --hardware
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
  node 0 size: 15745 MB
  node 0 free: 15319 MB
  node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
  node 1 size: 16150 MB
  node 1 free: 15758 MB
  node distances:
  node   0   1
    0:  10  21
    1:  21  10

If I have 16 vectors, the affinity_mask generated by what you're doing
looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
of those are the first unique CPU, getting a unique vector just like you
wanted. If an unset bit just means share with the previous, then all of
my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!

What we want for my CPU topology is the 16th CPU to pair with CPU 0,
17 pairs with 1, 18 with 2, and so on. You can't convey that information
with this scheme. We need affinity_masks per vector.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig Sept. 6, 2016, 4:50 p.m. UTC | #2

[adding Thomas as it's about the affinity_mask he (we) added to the
 IRQ core]

On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote:
> > Always the previous one.  Below is a patch to get us back to the
> > previous behavior:
> 
> No, that's not right.
> 
> Here's my topology info:
> 
>   # numactl --hardware
>   available: 2 nodes (0-1)
>   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
>   node 0 size: 15745 MB
>   node 0 free: 15319 MB
>   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
>   node 1 size: 16150 MB
>   node 1 free: 15758 MB
>   node distances:
>   node   0   1
>     0:  10  21
>     1:  21  10

How do you get that mapping?  Does this CPU use Hyperthreading and
thus expose siblings using topology_sibling_cpumask?  As that's the
only thing the old code used for any sort of special casing.

I'll need to see if I can find a system with such a mapping to reproduce.

> If I have 16 vectors, the affinity_mask generated by what you're doing
> looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> of those are the first unique CPU, getting a unique vector just like you
> wanted. If an unset bit just means share with the previous, then all of
> my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> 
> What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> with this scheme. We need affinity_masks per vector.

We actually have per-vector masks, but they are hidden inside the IRQ
core and awkward to use.  We could to the get_first_sibling magic
in the block-mq queue mapping (and in fact with the current code I guess
we need to).  Or take a step back from trying to emulate the old code
and instead look at NUMA nodes instead of siblings which some folks
suggested a while ago.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Keith Busch Sept. 6, 2016, 5:30 p.m. UTC | #3

On Tue, Sep 06, 2016 at 06:50:56PM +0200, Christoph Hellwig wrote:
> [adding Thomas as it's about the affinity_mask he (we) added to the
>  IRQ core]
> > Here's my topology info:
> > 
> >   # numactl --hardware
> >   available: 2 nodes (0-1)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> >   node 0 size: 15745 MB
> >   node 0 free: 15319 MB
> >   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> >   node 1 size: 16150 MB
> >   node 1 free: 15758 MB
> >   node distances:
> >   node   0   1
> >     0:  10  21
> >     1:  21  10
> 
> How do you get that mapping?  Does this CPU use Hyperthreading and
> thus expose siblings using topology_sibling_cpumask?  As that's the
> only thing the old code used for any sort of special casing.
> 
> I'll need to see if I can find a system with such a mapping to reproduce.

Yes, this is a two-socket server with hyperthreading enabled. Numbering
the physical CPUs before the hyperthreads is a common numbering on
x86, so we're going to see this split numbering on any multi-socket
hyperthreaded server.

The topology_sibling_cpumask shows the right information. The resulting
mask from cpu 0 on my server is 0x00010001; cpu 1 is 0x00020002, etc...

> > What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> > 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> > with this scheme. We need affinity_masks per vector.
> 
> We actually have per-vector masks, but they are hidden inside the IRQ
> core and awkward to use.  We could to the get_first_sibling magic
> in the block-mq queue mapping (and in fact with the current code I guess
> we need to).  Or take a step back from trying to emulate the old code
> and instead look at NUMA nodes instead of siblings which some folks
> suggested a while ago.

Adding the first sibling magic in blk-mq would fix my specific case,
but it doesn't help genericly when we need to pair more than just thread
siblings.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Gleixner Sept. 7, 2016, 3:38 p.m. UTC | #4

On Tue, 6 Sep 2016, Christoph Hellwig wrote:

> [adding Thomas as it's about the affinity_mask he (we) added to the
>  IRQ core]
> 
> On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote:
> > > Always the previous one.  Below is a patch to get us back to the
> > > previous behavior:
> > 
> > No, that's not right.
> > 
> > Here's my topology info:
> > 
> >   # numactl --hardware
> >   available: 2 nodes (0-1)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> >   node 0 size: 15745 MB
> >   node 0 free: 15319 MB
> >   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> >   node 1 size: 16150 MB
> >   node 1 free: 15758 MB
> >   node distances:
> >   node   0   1
> >     0:  10  21
> >     1:  21  10
> 
> How do you get that mapping?  Does this CPU use Hyperthreading and
> thus expose siblings using topology_sibling_cpumask?  As that's the
> only thing the old code used for any sort of special casing.

That's a normal Intel mapping with two sockets and HT enabled. The cpu
enumeration is

Socket0 - physical cores
Socket1 - physical cores

Socket0 - HT siblings
Socket1 - HT siblings
 
> I'll need to see if I can find a system with such a mapping to reproduce.

Any 2 socket Intel with HT enabled will do. If you need access to one let
me know.

> > If I have 16 vectors, the affinity_mask generated by what you're doing
> > looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> > of those are the first unique CPU, getting a unique vector just like you
> > wanted. If an unset bit just means share with the previous, then all of
> > my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> > 
> > What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> > 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> > with this scheme. We need affinity_masks per vector.
> 
> We actually have per-vector masks, but they are hidden inside the IRQ
> core and awkward to use.  We could to the get_first_sibling magic
> in the block-mq queue mapping (and in fact with the current code I guess
> we need to).  Or take a step back from trying to emulate the old code
> and instead look at NUMA nodes instead of siblings which some folks
> suggested a while ago.

I think you want both.

NUMA nodes are certainly the first decision factor. You split the number of
vectors to the nodes:

  vecs_per_node = num_vector / num_nodes;

Then you spread the number of vectors per node by the number of cpus per
node.

  cpus_per_vec = cpus_on(node) / vecs_per_node;

If the number of cpus per vector is <= 1 you just use a round robin
scheme. If not, you need to look at siblings.

Looking at the whole thing, I think we need to be more clever when setting
up the msi descriptor affinity masks.

I'll send a RFC series soon.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 32f6cfc..09d4407 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -29,6 +29,8 @@  struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
 {
 	struct cpumask *affinity_mask;
 	unsigned int max_vecs = *nr_vecs;
+	unsigned int nr_cpus = 0, nr_uniq_cpus = 0, cpu;
+	unsigned int vec = 0, prev = -1, idx = 0;
 
 	if (max_vecs == 1)
 		return NULL;
@@ -40,24 +42,27 @@  struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
 	}
 
 	get_online_cpus();
-	if (max_vecs >= num_online_cpus()) {
-		cpumask_copy(affinity_mask, cpu_online_mask);
-		*nr_vecs = num_online_cpus();
-	} else {
-		unsigned int vecs = 0, cpu;
-
-		for_each_online_cpu(cpu) {
-			if (cpu == get_first_sibling(cpu)) {
-				cpumask_set_cpu(cpu, affinity_mask);
-				vecs++;
-			}
-
-			if (--max_vecs == 0)
-				break;
-		}
-		*nr_vecs = vecs;
+	for_each_online_cpu(cpu) {
+		nr_cpus++;
+		if (cpu == get_first_sibling(cpu))
+			nr_uniq_cpus++;
+	}
+
+	for_each_online_cpu(cpu) {
+		if (max_vecs >= nr_cpus || nr_cpus == nr_uniq_cpus)
+			vec = idx * max_vecs / nr_cpus;
+		else if (cpu == get_first_sibling(cpu))
+			vec = idx * max_vecs / nr_uniq_cpus;
+		else
+			continue;
+
+		if (vec != prev)
+			cpumask_set_cpu(cpu, affinity_mask);
+		prev = vec;
+		idx++;
 	}
 	put_online_cpus();
 
+	*nr_vecs = idx;
 	return affinity_mask;
 }

[4/7] blk-mq: allow the driver to pass in an affinity mask

Commit Message

Comments

Patch