diff mbox series

blk-mq-rdma: remove queue mapping helper for rdma devices

Message ID 20230322123703.485544-1-sagi@grimberg.me (mailing list archive)
State New, archived
Headers show
Series blk-mq-rdma: remove queue mapping helper for rdma devices | expand

Commit Message

Sagi Grimberg March 22, 2023, 12:37 p.m. UTC
No rdma device exposes its irq vectors affinity today. So the only
mapping that we have left, is the default blk_mq_map_queues, which
we fallback to anyways. Also fixup the only consumer of this helper
(nvme-rdma).

Remove this now dead code.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/Kconfig               |  5 -----
 block/Makefile              |  1 -
 block/blk-mq-rdma.c         | 44 -------------------------------------
 drivers/nvme/host/rdma.c    |  7 ++----
 include/linux/blk-mq-rdma.h | 11 ----------
 5 files changed, 2 insertions(+), 66 deletions(-)
 delete mode 100644 block/blk-mq-rdma.c
 delete mode 100644 include/linux/blk-mq-rdma.h

Comments

Jason Gunthorpe March 22, 2023, 12:54 p.m. UTC | #1
On Wed, Mar 22, 2023 at 02:37:03PM +0200, Sagi Grimberg wrote:
> No rdma device exposes its irq vectors affinity today. So the only
> mapping that we have left, is the default blk_mq_map_queues, which
> we fallback to anyways. Also fixup the only consumer of this helper
> (nvme-rdma).

This was the only caller of ib_get_vector_affinity() so please delete 
op get_vector_affinity and ib_get_vector_affinity() from verbs as well

Jason
Sagi Grimberg March 22, 2023, 1 p.m. UTC | #2
>> No rdma device exposes its irq vectors affinity today. So the only
>> mapping that we have left, is the default blk_mq_map_queues, which
>> we fallback to anyways. Also fixup the only consumer of this helper
>> (nvme-rdma).
> 
> This was the only caller of ib_get_vector_affinity() so please delete
> op get_vector_affinity and ib_get_vector_affinity() from verbs as well

Yep, no problem.

Given that nvme-rdma was the only consumer, do you prefer this goes from
the nvme tree?
Jason Gunthorpe March 22, 2023, 1:50 p.m. UTC | #3
On Wed, Mar 22, 2023 at 03:00:08PM +0200, Sagi Grimberg wrote:
> 
> > > No rdma device exposes its irq vectors affinity today. So the only
> > > mapping that we have left, is the default blk_mq_map_queues, which
> > > we fallback to anyways. Also fixup the only consumer of this helper
> > > (nvme-rdma).
> > 
> > This was the only caller of ib_get_vector_affinity() so please delete
> > op get_vector_affinity and ib_get_vector_affinity() from verbs as well
> 
> Yep, no problem.
> 
> Given that nvme-rdma was the only consumer, do you prefer this goes from
> the nvme tree?

Sure, it is probably fine

Jason
Keith Busch March 22, 2023, 2:45 p.m. UTC | #4
On Wed, Mar 22, 2023 at 02:37:03PM +0200, Sagi Grimberg wrote:
> No rdma device exposes its irq vectors affinity today. So the only
> mapping that we have left, is the default blk_mq_map_queues, which
> we fallback to anyways. Also fixup the only consumer of this helper
> (nvme-rdma).
> 
> Remove this now dead code.

Thanks, this will make it even more practical to unify the queue setup for nvme
transports.

Acked-by: Keith Busch <kbusch@kernel.org>
Chaitanya Kulkarni March 22, 2023, 6:14 p.m. UTC | #5
On 3/22/23 05:37, Sagi Grimberg wrote:
> No rdma device exposes its irq vectors affinity today. So the only
> mapping that we have left, is the default blk_mq_map_queues, which
> we fallback to anyways. Also fixup the only consumer of this helper
> (nvme-rdma).
>
> Remove this now dead code.
>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>

Based on the discussion on the other thread on the Keith's patch
this looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck
Leon Romanovsky March 23, 2023, 12:05 p.m. UTC | #6
On Wed, Mar 22, 2023 at 10:50:22AM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 22, 2023 at 03:00:08PM +0200, Sagi Grimberg wrote:
> > 
> > > > No rdma device exposes its irq vectors affinity today. So the only
> > > > mapping that we have left, is the default blk_mq_map_queues, which
> > > > we fallback to anyways. Also fixup the only consumer of this helper
> > > > (nvme-rdma).
> > > 
> > > This was the only caller of ib_get_vector_affinity() so please delete
> > > op get_vector_affinity and ib_get_vector_affinity() from verbs as well
> > 
> > Yep, no problem.
> > 
> > Given that nvme-rdma was the only consumer, do you prefer this goes from
> > the nvme tree?
> 
> Sure, it is probably fine

I tried to do it two+ years ago:
https://lore.kernel.org/all/20200929091358.421086-1-leon@kernel.org

> 
> Jason
Jason Gunthorpe March 23, 2023, 1:03 p.m. UTC | #7
On Thu, Mar 23, 2023 at 02:05:15PM +0200, Leon Romanovsky wrote:
> On Wed, Mar 22, 2023 at 10:50:22AM -0300, Jason Gunthorpe wrote:
> > On Wed, Mar 22, 2023 at 03:00:08PM +0200, Sagi Grimberg wrote:
> > > 
> > > > > No rdma device exposes its irq vectors affinity today. So the only
> > > > > mapping that we have left, is the default blk_mq_map_queues, which
> > > > > we fallback to anyways. Also fixup the only consumer of this helper
> > > > > (nvme-rdma).
> > > > 
> > > > This was the only caller of ib_get_vector_affinity() so please delete
> > > > op get_vector_affinity and ib_get_vector_affinity() from verbs as well
> > > 
> > > Yep, no problem.
> > > 
> > > Given that nvme-rdma was the only consumer, do you prefer this goes from
> > > the nvme tree?
> > 
> > Sure, it is probably fine
> 
> I tried to do it two+ years ago:
> https://lore.kernel.org/all/20200929091358.421086-1-leon@kernel.org

Christoph's points make sense, but I think we should still purge this
code.

If we want to do proper managed affinity the right RDMA API is to
directly ask for the desired CPU binding when creating the CQ, and
optionally a way to change the CPU binding of the CQ at runtime.

This obfuscated 'comp vector number' thing is nonsensical for a kAPI -
creating a CQ on a random CPU then trying to backwards figure out what
CPU it was created on is silly.

Jason
Sagi Grimberg March 23, 2023, 3:07 p.m. UTC | #8
>>>>>> No rdma device exposes its irq vectors affinity today. So the only
>>>>>> mapping that we have left, is the default blk_mq_map_queues, which
>>>>>> we fallback to anyways. Also fixup the only consumer of this helper
>>>>>> (nvme-rdma).
>>>>>
>>>>> This was the only caller of ib_get_vector_affinity() so please delete
>>>>> op get_vector_affinity and ib_get_vector_affinity() from verbs as well
>>>>
>>>> Yep, no problem.
>>>>
>>>> Given that nvme-rdma was the only consumer, do you prefer this goes from
>>>> the nvme tree?
>>>
>>> Sure, it is probably fine
>>
>> I tried to do it two+ years ago:
>> https://lore.kernel.org/all/20200929091358.421086-1-leon@kernel.org
> 
> Christoph's points make sense, but I think we should still purge this
> code.
> 
> If we want to do proper managed affinity the right RDMA API is to
> directly ask for the desired CPU binding when creating the CQ, and
> optionally a way to change the CPU binding of the CQ at runtime.

I think the affinity management is referring to IRQD_AFFINITY_MANAGED
which IIRC is the case when the device passes `struct irq_affinity` to
pci_alloc_irq_vectors_affinity.

Not sure what that has to do with passing a cpu to create_cq.

> This obfuscated 'comp vector number' thing is nonsensical for a kAPI -
> creating a CQ on a random CPU then trying to backwards figure out what
> CPU it was created on is silly.

I don't remember if the comp_vector maps 1x1 to an irq vector, and if it
isn't then it is indeed obfuscated. But a similar model is heavily used
by the network stack with cpu_rmap, where this was derived from.

But regardless, its been two years, it is effectively dead code, and not
a single user complained about missing it. So we can safely purge them
and if someone cares about it, we can debate adding it back.
Jason Gunthorpe March 23, 2023, 3:57 p.m. UTC | #9
On Thu, Mar 23, 2023 at 05:07:24PM +0200, Sagi Grimberg wrote:
> 
> > > > > > > No rdma device exposes its irq vectors affinity today. So the only
> > > > > > > mapping that we have left, is the default blk_mq_map_queues, which
> > > > > > > we fallback to anyways. Also fixup the only consumer of this helper
> > > > > > > (nvme-rdma).
> > > > > > 
> > > > > > This was the only caller of ib_get_vector_affinity() so please delete
> > > > > > op get_vector_affinity and ib_get_vector_affinity() from verbs as well
> > > > > 
> > > > > Yep, no problem.
> > > > > 
> > > > > Given that nvme-rdma was the only consumer, do you prefer this goes from
> > > > > the nvme tree?
> > > > 
> > > > Sure, it is probably fine
> > > 
> > > I tried to do it two+ years ago:
> > > https://lore.kernel.org/all/20200929091358.421086-1-leon@kernel.org
> > 
> > Christoph's points make sense, but I think we should still purge this
> > code.
> > 
> > If we want to do proper managed affinity the right RDMA API is to
> > directly ask for the desired CPU binding when creating the CQ, and
> > optionally a way to change the CPU binding of the CQ at runtime.
> 
> I think the affinity management is referring to IRQD_AFFINITY_MANAGED
> which IIRC is the case when the device passes `struct irq_affinity` to
> pci_alloc_irq_vectors_affinity.
> 
> Not sure what that has to do with passing a cpu to create_cq.

I took Christoph's remarks to be that the system should auto configure
interrupts sensibly and not rely on userspace messing around in proc.

For instance, I would expect that the NVMe driver work the same way on
RDMA and PCI. For PCI it calls pci_alloc_irq_vectors_affinity(), RDMA
should call some ib_alloc_cq_affinity() and generate the affinity in
exactly the same way.

So, I have no problem to delete these things as the
get_vector_affinity API is not part of solving the affinity problem,
and it seems NVMe PCI doesn't need blk_mq_rdma_map_queues() either.

Jason
Sagi Grimberg March 26, 2023, 7:12 a.m. UTC | #10
>>>>>>>> No rdma device exposes its irq vectors affinity today. So the only
>>>>>>>> mapping that we have left, is the default blk_mq_map_queues, which
>>>>>>>> we fallback to anyways. Also fixup the only consumer of this helper
>>>>>>>> (nvme-rdma).
>>>>>>>
>>>>>>> This was the only caller of ib_get_vector_affinity() so please delete
>>>>>>> op get_vector_affinity and ib_get_vector_affinity() from verbs as well
>>>>>>
>>>>>> Yep, no problem.
>>>>>>
>>>>>> Given that nvme-rdma was the only consumer, do you prefer this goes from
>>>>>> the nvme tree?
>>>>>
>>>>> Sure, it is probably fine
>>>>
>>>> I tried to do it two+ years ago:
>>>> https://lore.kernel.org/all/20200929091358.421086-1-leon@kernel.org
>>>
>>> Christoph's points make sense, but I think we should still purge this
>>> code.
>>>
>>> If we want to do proper managed affinity the right RDMA API is to
>>> directly ask for the desired CPU binding when creating the CQ, and
>>> optionally a way to change the CPU binding of the CQ at runtime.
>>
>> I think the affinity management is referring to IRQD_AFFINITY_MANAGED
>> which IIRC is the case when the device passes `struct irq_affinity` to
>> pci_alloc_irq_vectors_affinity.
>>
>> Not sure what that has to do with passing a cpu to create_cq.
> 
> I took Christoph's remarks to be that the system should auto configure
> interrupts sensibly and not rely on userspace messing around in proc.

Yes, that is correct.

> For instance, I would expect that the NVMe driver work the same way on
> RDMA and PCI. For PCI it calls pci_alloc_irq_vectors_affinity(), RDMA
> should call some ib_alloc_cq_affinity() and generate the affinity in
> exactly the same way.

But an RDMA ulp does not own the EQs like the nvme driver does.
That is why NVMe is fine with managed affinity, and RDMA is not.
The initial attempt was to make RDMA use managed affinity, but then
users started complaining that they are unable to mangle with irq
vector affinity via procfs.

> So, I have no problem to delete these things as the
> get_vector_affinity API is not part of solving the affinity problem,
> and it seems NVMe PCI doesn't need blk_mq_rdma_map_queues() either.

Cool.
Christoph Hellwig March 26, 2023, 11:16 p.m. UTC | #11
On Thu, Mar 23, 2023 at 10:03:25AM -0300, Jason Gunthorpe wrote:
> > > > Given that nvme-rdma was the only consumer, do you prefer this goes from
> > > > the nvme tree?
> > > 
> > > Sure, it is probably fine
> > 
> > I tried to do it two+ years ago:
> > https://lore.kernel.org/all/20200929091358.421086-1-leon@kernel.org
> 
> Christoph's points make sense, but I think we should still purge this
> code.

Given that we don't keep dead code around in the kernel as policy
we should probably remove it.  That being said, I'm really sad about
this, as I think what the RDMA code does here right now is pretty
broken.

> If we want to do proper managed affinity the right RDMA API is to
> directly ask for the desired CPU binding when creating the CQ, and
> optionally a way to change the CPU binding of the CQ at runtime.

Chanigng the bindings causes a lot of nasty interactions with CPU
hotplug.  The managed affinity and the way blk-mq interacts with it
is designed around the hotunplug notifier quiescing the queues,
and I'm not sure we can get everything right without a strict
binding to a set of CPUs.

> This obfuscated 'comp vector number' thing is nonsensical for a kAPI -
> creating a CQ on a random CPU then trying to backwards figure out what
> CPU it was created on is silly.

Yes.
Jason Gunthorpe March 27, 2023, 12:56 a.m. UTC | #12
On Mon, Mar 27, 2023 at 01:16:22AM +0200, Christoph Hellwig wrote:
> On Thu, Mar 23, 2023 at 10:03:25AM -0300, Jason Gunthorpe wrote:
> > > > > Given that nvme-rdma was the only consumer, do you prefer this goes from
> > > > > the nvme tree?
> > > > 
> > > > Sure, it is probably fine
> > > 
> > > I tried to do it two+ years ago:
> > > https://lore.kernel.org/all/20200929091358.421086-1-leon@kernel.org
> > 
> > Christoph's points make sense, but I think we should still purge this
> > code.
> 
> Given that we don't keep dead code around in the kernel as policy
> we should probably remove it.  That being said, I'm really sad about
> this, as I think what the RDMA code does here right now is pretty
> broken.

I don't know nvme well, but any affinity scheme that relies on using
/proc/interrupts to set the affinity of queues is incompatible with
RDMA HW and more broadly incompatible with mlx5 HW.

The fundamental issue is that this class of HW can have more queues
than the CPUs can have interrupts. To make this work it has to mux
each MSI interrupt to N queues.

So, I think the only way it can really make sense is if the MSI
interrupt is per-cpu and the muxing is adjusted according to affinity
and hotplug needs.

Thus, we'd need to see a scheme where something like nvme-cli directs
the affinity on the queue so it can flow down that way, as there is no
other obvious API that can manipulate a queue multiplexed on a MSI.

IIRC netdev sort of has this in that you can set the number of queues
and the queues layout according to CPU number by default. So
requesting N-4 queues will reserve the last 4 CPUs for isolation sort
of thinking.

> > If we want to do proper managed affinity the right RDMA API is to
> > directly ask for the desired CPU binding when creating the CQ, and
> > optionally a way to change the CPU binding of the CQ at runtime.
> 
> Chanigng the bindings causes a lot of nasty interactions with CPU
> hotplug.  The managed affinity and the way blk-mq interacts with it
> is designed around the hotunplug notifier quiescing the queues,
> and I'm not sure we can get everything right without a strict
> binding to a set of CPUs.

Yeah, I think the netdev version is that the queues can't change until
the netdev is brought down and the queues destroyed. So from that view
what you want is a tunable to provide a CPU mask that you'd like the
logical device to occupy provided at creation/startup time.

I can imagine people isolating containers to certain CPU cores and
then having netdevs and nvmef devices that were created only for that
container wanting them to follow the same CPU assignment.

So affinity at creation time does get a good wack of the use cases..

Jason
Christoph Hellwig April 12, 2023, 6:42 a.m. UTC | #13
Thanks,

applied to nvme-6.4.
diff mbox series

Patch

diff --git a/block/Kconfig b/block/Kconfig
index 5d9d9c84d516..0b7c13f9a089 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -215,11 +215,6 @@  config BLK_MQ_VIRTIO
 	depends on VIRTIO
 	default y
 
-config BLK_MQ_RDMA
-	bool
-	depends on INFINIBAND
-	default y
-
 config BLK_PM
 	def_bool PM
 
diff --git a/block/Makefile b/block/Makefile
index 4e01bb71ad6e..b31b05390749 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -30,7 +30,6 @@  obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY_T10)	+= t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
 obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
-obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
deleted file mode 100644
index 29c1f4d6eb04..000000000000
--- a/block/blk-mq-rdma.c
+++ /dev/null
@@ -1,44 +0,0 @@ 
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (c) 2017 Sagi Grimberg.
- */
-#include <linux/blk-mq.h>
-#include <linux/blk-mq-rdma.h>
-#include <rdma/ib_verbs.h>
-
-/**
- * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
- * @map:	CPU to hardware queue map.
- * @dev:	rdma device to provide a mapping for.
- * @first_vec:	first interrupt vectors to use for queues (usually 0)
- *
- * This function assumes the rdma device @dev has at least as many available
- * interrupt vetors as @set has queues.  It will then query it's affinity mask
- * and built queue mapping that maps a queue to the CPUs that have irq affinity
- * for the corresponding vector.
- *
- * In case either the driver passed a @dev with less vectors than
- * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
- * vector, we fallback to the naive mapping.
- */
-void blk_mq_rdma_map_queues(struct blk_mq_queue_map *map,
-		struct ib_device *dev, int first_vec)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu;
-
-	for (queue = 0; queue < map->nr_queues; queue++) {
-		mask = ib_get_vector_affinity(dev, first_vec + queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			map->mq_map[cpu] = map->queue_offset + queue;
-	}
-
-	return;
-
-fallback:
-	blk_mq_map_queues(map);
-}
-EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index bbad26b82b56..3a62867b6cab 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -12,7 +12,6 @@ 
 #include <linux/string.h>
 #include <linux/atomic.h>
 #include <linux/blk-mq.h>
-#include <linux/blk-mq-rdma.h>
 #include <linux/blk-integrity.h>
 #include <linux/types.h>
 #include <linux/list.h>
@@ -2163,10 +2162,8 @@  static void nvme_rdma_map_queues(struct blk_mq_tag_set *set)
 			ctrl->io_queues[HCTX_TYPE_DEFAULT];
 		set->map[HCTX_TYPE_READ].queue_offset = 0;
 	}
-	blk_mq_rdma_map_queues(&set->map[HCTX_TYPE_DEFAULT],
-			ctrl->device->dev, 0);
-	blk_mq_rdma_map_queues(&set->map[HCTX_TYPE_READ],
-			ctrl->device->dev, 0);
+	blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
+	blk_mq_map_queues(&set->map[HCTX_TYPE_READ]);
 
 	if (opts->nr_poll_queues && ctrl->io_queues[HCTX_TYPE_POLL]) {
 		/* map dedicated poll queues only if we have queues left */
diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h
deleted file mode 100644
index 53b58c610e76..000000000000
--- a/include/linux/blk-mq-rdma.h
+++ /dev/null
@@ -1,11 +0,0 @@ 
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_BLK_MQ_RDMA_H
-#define _LINUX_BLK_MQ_RDMA_H
-
-struct blk_mq_tag_set;
-struct ib_device;
-
-void blk_mq_rdma_map_queues(struct blk_mq_queue_map *map,
-		struct ib_device *dev, int first_vec);
-
-#endif /* _LINUX_BLK_MQ_RDMA_H */