diff mbox series

[blk-next,1/2] blk-mq-rdma: Delete not-used multi-queue RDMA map queue code

Message ID 20200929091358.421086-2-leon@kernel.org (mailing list archive)
State Changes Requested
Headers show
Series Delete the get_vector_affinity leftovers | expand

Commit Message

Leon Romanovsky Sept. 29, 2020, 9:13 a.m. UTC
From: Leon Romanovsky <leonro@nvidia.com>

The RDMA vector affinity code is not backed up by any driver and always
returns NULL to every ib_get_vector_affinity() call.

This means that blk_mq_rdma_map_queues() always takes fallback path.

Fixes: 9afc97c29b03 ("mlx5: remove support for ib_get_vector_affinity")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/Kconfig               |  5 -----
 block/Makefile              |  1 -
 block/blk-mq-rdma.c         | 44 -------------------------------------
 drivers/nvme/host/rdma.c    |  7 ++----
 include/linux/blk-mq-rdma.h | 11 ----------
 5 files changed, 2 insertions(+), 66 deletions(-)
 delete mode 100644 block/blk-mq-rdma.c
 delete mode 100644 include/linux/blk-mq-rdma.h

--
2.26.2

Comments

Christoph Hellwig Sept. 29, 2020, 10:20 a.m. UTC | #1
On Tue, Sep 29, 2020 at 12:13:57PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> The RDMA vector affinity code is not backed up by any driver and always
> returns NULL to every ib_get_vector_affinity() call.
> 
> This means that blk_mq_rdma_map_queues() always takes fallback path.
> 
> Fixes: 9afc97c29b03 ("mlx5: remove support for ib_get_vector_affinity")
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

So you guys totally broken the nvme queue assignment without even
telling anyone?  Great job!
Leon Romanovsky Sept. 29, 2020, 10:35 a.m. UTC | #2
On Tue, Sep 29, 2020 at 12:20:46PM +0200, Christoph Hellwig wrote:
> On Tue, Sep 29, 2020 at 12:13:57PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > The RDMA vector affinity code is not backed up by any driver and always
> > returns NULL to every ib_get_vector_affinity() call.
> >
> > This means that blk_mq_rdma_map_queues() always takes fallback path.
> >
> > Fixes: 9afc97c29b03 ("mlx5: remove support for ib_get_vector_affinity")
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>
> So you guys totally broken the nvme queue assignment without even
> telling anyone?  Great job!

Who is "you guys" and it wasn't silent either? I'm sure that Sagi knows the craft.
https://lore.kernel.org/linux-rdma/20181224221606.GA25780@ziepe.ca/

commit 759ace7832802eaefbca821b2b43a44ab896b449
Author: Sagi Grimberg <sagi@grimberg.me>
Date:   Thu Nov 1 13:08:07 2018 -0700

    i40iw: remove support for ib_get_vector_affinity

....

commit 9afc97c29b032af9a4112c2f4a02d5313b4dc71f
Author: Sagi Grimberg <sagi@grimberg.me>
Date:   Thu Nov 1 09:13:12 2018 -0700

    mlx5: remove support for ib_get_vector_affinity

Thanks
Sagi Grimberg Sept. 29, 2020, 6:24 p.m. UTC | #3
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>
>>> The RDMA vector affinity code is not backed up by any driver and always
>>> returns NULL to every ib_get_vector_affinity() call.
>>>
>>> This means that blk_mq_rdma_map_queues() always takes fallback path.
>>>
>>> Fixes: 9afc97c29b03 ("mlx5: remove support for ib_get_vector_affinity")
>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>
>> So you guys totally broken the nvme queue assignment without even
>> telling anyone?  Great job!
> 
> Who is "you guys" and it wasn't silent either? I'm sure that Sagi knows the craft.
> https://lore.kernel.org/linux-rdma/20181224221606.GA25780@ziepe.ca/
> 
> commit 759ace7832802eaefbca821b2b43a44ab896b449
> Author: Sagi Grimberg <sagi@grimberg.me>
> Date:   Thu Nov 1 13:08:07 2018 -0700
> 
>      i40iw: remove support for ib_get_vector_affinity
> 
> ....
> 
> commit 9afc97c29b032af9a4112c2f4a02d5313b4dc71f
> Author: Sagi Grimberg <sagi@grimberg.me>
> Date:   Thu Nov 1 09:13:12 2018 -0700
> 
>      mlx5: remove support for ib_get_vector_affinity
> 
> Thanks

Yes, basically usage of managed affinity caused people to report
regressions not being able to change irq affinity from procfs.

Back then I started a discussion with Thomas to make managed
affinity to still allow userspace to modify this, but this
was dropped at some point. So currently rdma cannot do
automatic irq affinitization out of the box.
Christoph Hellwig Oct. 2, 2020, 6:45 a.m. UTC | #4
On Tue, Sep 29, 2020 at 11:24:49AM -0700, Sagi Grimberg wrote:
> Yes, basically usage of managed affinity caused people to report
> regressions not being able to change irq affinity from procfs.

Well, why would they change it?  The whole point of the infrastructure
is that there is a single sane affinity setting for a given setup. Now
that setting needed some refinement from the original series (e.g. the
current series about only using housekeeping cpus if cpu isolation is
in use).  But allowing random users to modify affinity is just a receipe
for a trainwreck.

So I think we need to bring this back ASAP, as doing affinity right
out of the box is an absolute requirement for sane performance without
all the benchmarketing deep magic.
Sagi Grimberg Oct. 2, 2020, 8:20 p.m. UTC | #5
>> Yes, basically usage of managed affinity caused people to report
>> regressions not being able to change irq affinity from procfs.
> 
> Well, why would they change it?  The whole point of the infrastructure
> is that there is a single sane affinity setting for a given setup. Now
> that setting needed some refinement from the original series (e.g. the
> current series about only using housekeeping cpus if cpu isolation is
> in use).  But allowing random users to modify affinity is just a receipe
> for a trainwreck.

Well allowing people to mangle irq affinity settings seem to be a hard
requirement from the discussions in the past.

> So I think we need to bring this back ASAP, as doing affinity right
> out of the box is an absolute requirement for sane performance without
> all the benchmarketing deep magic.

Well, it's hard to say that setting custom irq affinity settings is
deemed non-useful to anyone and hence should be prevented. I'd expect
that irq settings have a sane default that works and if someone wants to
change it, it can but there should be no guarantees on optimal
performance. But IIRC this had some dependencies on drivers and some
more infrastructure to handle dynamic changes...
Christoph Hellwig Oct. 5, 2020, 8:38 a.m. UTC | #6
On Fri, Oct 02, 2020 at 01:20:35PM -0700, Sagi Grimberg wrote:
>> Well, why would they change it?  The whole point of the infrastructure
>> is that there is a single sane affinity setting for a given setup. Now
>> that setting needed some refinement from the original series (e.g. the
>> current series about only using housekeeping cpus if cpu isolation is
>> in use).  But allowing random users to modify affinity is just a receipe
>> for a trainwreck.
>
> Well allowing people to mangle irq affinity settings seem to be a hard
> requirement from the discussions in the past.
>
>> So I think we need to bring this back ASAP, as doing affinity right
>> out of the box is an absolute requirement for sane performance without
>> all the benchmarketing deep magic.
>
> Well, it's hard to say that setting custom irq affinity settings is
> deemed non-useful to anyone and hence should be prevented. I'd expect
> that irq settings have a sane default that works and if someone wants to
> change it, it can but there should be no guarantees on optimal
> performance. But IIRC this had some dependencies on drivers and some
> more infrastructure to handle dynamic changes...

The problem is that people change random settings.  We need to generalize
it into a sane API (e.g. the housekeeping CPUs thing which totally makes
sense).
Leon Romanovsky Oct. 6, 2020, 4:58 a.m. UTC | #7
On Mon, Oct 05, 2020 at 10:38:17AM +0200, Christoph Hellwig wrote:
> On Fri, Oct 02, 2020 at 01:20:35PM -0700, Sagi Grimberg wrote:
> >> Well, why would they change it?  The whole point of the infrastructure
> >> is that there is a single sane affinity setting for a given setup. Now
> >> that setting needed some refinement from the original series (e.g. the
> >> current series about only using housekeeping cpus if cpu isolation is
> >> in use).  But allowing random users to modify affinity is just a receipe
> >> for a trainwreck.
> >
> > Well allowing people to mangle irq affinity settings seem to be a hard
> > requirement from the discussions in the past.
> >
> >> So I think we need to bring this back ASAP, as doing affinity right
> >> out of the box is an absolute requirement for sane performance without
> >> all the benchmarketing deep magic.
> >
> > Well, it's hard to say that setting custom irq affinity settings is
> > deemed non-useful to anyone and hence should be prevented. I'd expect
> > that irq settings have a sane default that works and if someone wants to
> > change it, it can but there should be no guarantees on optimal
> > performance. But IIRC this had some dependencies on drivers and some
> > more infrastructure to handle dynamic changes...
>
> The problem is that people change random settings.  We need to generalize
> it into a sane API (e.g. the housekeeping CPUs thing which totally makes
> sense).

I don't see many people jump on the bandwagon, someone should do it, but
who will? I personally have no knowledge in that area to do anything
meaningful.

Thanks
diff mbox series

Patch

diff --git a/block/Kconfig b/block/Kconfig
index bbad5e8bbffe..8ede308a1343 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -227,11 +227,6 @@  config BLK_MQ_VIRTIO
 	depends on BLOCK && VIRTIO
 	default y

-config BLK_MQ_RDMA
-	bool
-	depends on BLOCK && INFINIBAND
-	default y
-
 config BLK_PM
 	def_bool BLOCK && PM

diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..bbdc3e82308a 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -29,7 +29,6 @@  obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY_T10)	+= t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
 obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
-obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
deleted file mode 100644
index 14f968e58b8f..000000000000
--- a/block/blk-mq-rdma.c
+++ /dev/null
@@ -1,44 +0,0 @@ 
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (c) 2017 Sagi Grimberg.
- */
-#include <linux/blk-mq.h>
-#include <linux/blk-mq-rdma.h>
-#include <rdma/ib_verbs.h>
-
-/**
- * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
- * @map:	CPU to hardware queue map.
- * @dev:	rdma device to provide a mapping for.
- * @first_vec:	first interrupt vectors to use for queues (usually 0)
- *
- * This function assumes the rdma device @dev has at least as many available
- * interrupt vetors as @set has queues.  It will then query it's affinity mask
- * and built queue mapping that maps a queue to the CPUs that have irq affinity
- * for the corresponding vector.
- *
- * In case either the driver passed a @dev with less vectors than
- * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
- * vector, we fallback to the naive mapping.
- */
-int blk_mq_rdma_map_queues(struct blk_mq_queue_map *map,
-		struct ib_device *dev, int first_vec)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu;
-
-	for (queue = 0; queue < map->nr_queues; queue++) {
-		mask = ib_get_vector_affinity(dev, first_vec + queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			map->mq_map[cpu] = map->queue_offset + queue;
-	}
-
-	return 0;
-
-fallback:
-	return blk_mq_map_queues(map);
-}
-EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9e378d0a0c01..5989d4e35ef3 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -12,7 +12,6 @@ 
 #include <linux/string.h>
 #include <linux/atomic.h>
 #include <linux/blk-mq.h>
-#include <linux/blk-mq-rdma.h>
 #include <linux/types.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -2171,10 +2170,8 @@  static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
 			ctrl->io_queues[HCTX_TYPE_DEFAULT];
 		set->map[HCTX_TYPE_READ].queue_offset = 0;
 	}
-	blk_mq_rdma_map_queues(&set->map[HCTX_TYPE_DEFAULT],
-			ctrl->device->dev, 0);
-	blk_mq_rdma_map_queues(&set->map[HCTX_TYPE_READ],
-			ctrl->device->dev, 0);
+	blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
+	blk_mq_map_queues(&set->map[HCTX_TYPE_READ]);

 	if (opts->nr_poll_queues && ctrl->io_queues[HCTX_TYPE_POLL]) {
 		/* map dedicated poll queues only if we have queues left */
diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h
deleted file mode 100644
index 5cc5f0f36218..000000000000
--- a/include/linux/blk-mq-rdma.h
+++ /dev/null
@@ -1,11 +0,0 @@ 
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_BLK_MQ_RDMA_H
-#define _LINUX_BLK_MQ_RDMA_H
-
-struct blk_mq_tag_set;
-struct ib_device;
-
-int blk_mq_rdma_map_queues(struct blk_mq_queue_map *map,
-		struct ib_device *dev, int first_vec);
-
-#endif /* _LINUX_BLK_MQ_RDMA_H */