Message ID | 20160905194759.GA26008@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Sep 05, 2016 at 09:48:00PM +0200, Christoph Hellwig wrote: > On Thu, Sep 01, 2016 at 10:24:10AM -0400, Keith Busch wrote: > > On Thu, Sep 01, 2016 at 10:46:24AM +0200, Christoph Hellwig wrote: > > > On Wed, Aug 31, 2016 at 12:38:53PM -0400, Keith Busch wrote: > > > > This can't be right. We have a single affinity mask for the entire > > > > set, but what I think we want is an one affinity mask for each > > > > nr_io_queues. The irq_create_affinity_mask should then create an array > > > > of cpumasks based on nr_vecs.. > > > > > > Nah, this is Thomas' creating abuse of the cpumask type. Every bit set > > > in the affinity_mask means this is a cpu we allocate a vector / queue to. > > > > Yeah, I gathered that's what it was providing, but that's just barely > > not enough information to do something useful. The CPUs that aren't set > > have to use a previously assigned vector/queue, but which one? > > Always the previous one. Below is a patch to get us back to the > previous behavior: No, that's not right. Here's my topology info: # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 15745 MB node 0 free: 15319 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 16150 MB node 1 free: 15758 MB node distances: node 0 1 0: 10 21 1: 21 10 If I have 16 vectors, the affinity_mask generated by what you're doing looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each of those are the first unique CPU, getting a unique vector just like you wanted. If an unset bit just means share with the previous, then all of my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful! What we want for my CPU topology is the 16th CPU to pair with CPU 0, 17 pairs with 1, 18 with 2, and so on. You can't convey that information with this scheme. We need affinity_masks per vector. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[adding Thomas as it's about the affinity_mask he (we) added to the IRQ core] On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote: > > Always the previous one. Below is a patch to get us back to the > > previous behavior: > > No, that's not right. > > Here's my topology info: > > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 > node 0 size: 15745 MB > node 0 free: 15319 MB > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 > node 1 size: 16150 MB > node 1 free: 15758 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 How do you get that mapping? Does this CPU use Hyperthreading and thus expose siblings using topology_sibling_cpumask? As that's the only thing the old code used for any sort of special casing. I'll need to see if I can find a system with such a mapping to reproduce. > If I have 16 vectors, the affinity_mask generated by what you're doing > looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each > of those are the first unique CPU, getting a unique vector just like you > wanted. If an unset bit just means share with the previous, then all of > my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful! > > What we want for my CPU topology is the 16th CPU to pair with CPU 0, > 17 pairs with 1, 18 with 2, and so on. You can't convey that information > with this scheme. We need affinity_masks per vector. We actually have per-vector masks, but they are hidden inside the IRQ core and awkward to use. We could to the get_first_sibling magic in the block-mq queue mapping (and in fact with the current code I guess we need to). Or take a step back from trying to emulate the old code and instead look at NUMA nodes instead of siblings which some folks suggested a while ago. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 06, 2016 at 06:50:56PM +0200, Christoph Hellwig wrote: > [adding Thomas as it's about the affinity_mask he (we) added to the > IRQ core] > > Here's my topology info: > > > > # numactl --hardware > > available: 2 nodes (0-1) > > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 > > node 0 size: 15745 MB > > node 0 free: 15319 MB > > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 > > node 1 size: 16150 MB > > node 1 free: 15758 MB > > node distances: > > node 0 1 > > 0: 10 21 > > 1: 21 10 > > How do you get that mapping? Does this CPU use Hyperthreading and > thus expose siblings using topology_sibling_cpumask? As that's the > only thing the old code used for any sort of special casing. > > I'll need to see if I can find a system with such a mapping to reproduce. Yes, this is a two-socket server with hyperthreading enabled. Numbering the physical CPUs before the hyperthreads is a common numbering on x86, so we're going to see this split numbering on any multi-socket hyperthreaded server. The topology_sibling_cpumask shows the right information. The resulting mask from cpu 0 on my server is 0x00010001; cpu 1 is 0x00020002, etc... > > What we want for my CPU topology is the 16th CPU to pair with CPU 0, > > 17 pairs with 1, 18 with 2, and so on. You can't convey that information > > with this scheme. We need affinity_masks per vector. > > We actually have per-vector masks, but they are hidden inside the IRQ > core and awkward to use. We could to the get_first_sibling magic > in the block-mq queue mapping (and in fact with the current code I guess > we need to). Or take a step back from trying to emulate the old code > and instead look at NUMA nodes instead of siblings which some folks > suggested a while ago. Adding the first sibling magic in blk-mq would fix my specific case, but it doesn't help genericly when we need to pair more than just thread siblings. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 6 Sep 2016, Christoph Hellwig wrote: > [adding Thomas as it's about the affinity_mask he (we) added to the > IRQ core] > > On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote: > > > Always the previous one. Below is a patch to get us back to the > > > previous behavior: > > > > No, that's not right. > > > > Here's my topology info: > > > > # numactl --hardware > > available: 2 nodes (0-1) > > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 > > node 0 size: 15745 MB > > node 0 free: 15319 MB > > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 > > node 1 size: 16150 MB > > node 1 free: 15758 MB > > node distances: > > node 0 1 > > 0: 10 21 > > 1: 21 10 > > How do you get that mapping? Does this CPU use Hyperthreading and > thus expose siblings using topology_sibling_cpumask? As that's the > only thing the old code used for any sort of special casing. That's a normal Intel mapping with two sockets and HT enabled. The cpu enumeration is Socket0 - physical cores Socket1 - physical cores Socket0 - HT siblings Socket1 - HT siblings > I'll need to see if I can find a system with such a mapping to reproduce. Any 2 socket Intel with HT enabled will do. If you need access to one let me know. > > If I have 16 vectors, the affinity_mask generated by what you're doing > > looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each > > of those are the first unique CPU, getting a unique vector just like you > > wanted. If an unset bit just means share with the previous, then all of > > my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful! > > > > What we want for my CPU topology is the 16th CPU to pair with CPU 0, > > 17 pairs with 1, 18 with 2, and so on. You can't convey that information > > with this scheme. We need affinity_masks per vector. > > We actually have per-vector masks, but they are hidden inside the IRQ > core and awkward to use. We could to the get_first_sibling magic > in the block-mq queue mapping (and in fact with the current code I guess > we need to). Or take a step back from trying to emulate the old code > and instead look at NUMA nodes instead of siblings which some folks > suggested a while ago. I think you want both. NUMA nodes are certainly the first decision factor. You split the number of vectors to the nodes: vecs_per_node = num_vector / num_nodes; Then you spread the number of vectors per node by the number of cpus per node. cpus_per_vec = cpus_on(node) / vecs_per_node; If the number of cpus per vector is <= 1 you just use a round robin scheme. If not, you need to look at siblings. Looking at the whole thing, I think we need to be more clever when setting up the msi descriptor affinity masks. I'll send a RFC series soon. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 32f6cfc..09d4407 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -29,6 +29,8 @@ struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs) { struct cpumask *affinity_mask; unsigned int max_vecs = *nr_vecs; + unsigned int nr_cpus = 0, nr_uniq_cpus = 0, cpu; + unsigned int vec = 0, prev = -1, idx = 0; if (max_vecs == 1) return NULL; @@ -40,24 +42,27 @@ struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs) } get_online_cpus(); - if (max_vecs >= num_online_cpus()) { - cpumask_copy(affinity_mask, cpu_online_mask); - *nr_vecs = num_online_cpus(); - } else { - unsigned int vecs = 0, cpu; - - for_each_online_cpu(cpu) { - if (cpu == get_first_sibling(cpu)) { - cpumask_set_cpu(cpu, affinity_mask); - vecs++; - } - - if (--max_vecs == 0) - break; - } - *nr_vecs = vecs; + for_each_online_cpu(cpu) { + nr_cpus++; + if (cpu == get_first_sibling(cpu)) + nr_uniq_cpus++; + } + + for_each_online_cpu(cpu) { + if (max_vecs >= nr_cpus || nr_cpus == nr_uniq_cpus) + vec = idx * max_vecs / nr_cpus; + else if (cpu == get_first_sibling(cpu)) + vec = idx * max_vecs / nr_uniq_cpus; + else + continue; + + if (vec != prev) + cpumask_set_cpu(cpu, affinity_mask); + prev = vec; + idx++; } put_online_cpus(); + *nr_vecs = idx; return affinity_mask; }