diff mbox series

[v6,7/7] genirq/affinity: Add support for non-managed affinity sets

Message ID 20190216172228.869750763@linutronix.de (mailing list archive)
State New, archived
Headers show
Series genirq/affinity: Overhaul the multiple interrupt sets support | expand

Commit Message

Thomas Gleixner Feb. 16, 2019, 5:13 p.m. UTC
Some drivers need an extra set of interrupts which should not be marked
managed, but should get initial interrupt spreading.

Add a bitmap to struct irq_affinity which allows the driver to mark a
particular set of interrupts as non managed. Check the bitmap during
spreading and use the result to mark the interrupts in the sets
accordingly.

The unmanaged interrupts get initial spreading, but user space can change
their affinity later on. For the managed sets, i.e. the corresponding bit
in the mask is not set, there is no change in behaviour.

Usage example:

	struct irq_affinity affd = {
		.pre_vectors	= 2,
		.unmanaged_sets	= 0x02,
		.calc_sets	= drv_calc_sets,
	};
	....

For both interrupt sets the interrupts are properly spread out, but the
second set is not marked managed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/interrupt.h |    2 ++
 kernel/irq/affinity.c     |   16 +++++++++++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

Comments

Ming Lei Feb. 17, 2019, 1:45 p.m. UTC | #1
Hi Thomas,

On Sat, Feb 16, 2019 at 06:13:13PM +0100, Thomas Gleixner wrote:
> Some drivers need an extra set of interrupts which should not be marked
> managed, but should get initial interrupt spreading.

Could you share the drivers and their use case?

> 
> Add a bitmap to struct irq_affinity which allows the driver to mark a
> particular set of interrupts as non managed. Check the bitmap during
> spreading and use the result to mark the interrupts in the sets
> accordingly.
> 
> The unmanaged interrupts get initial spreading, but user space can change
> their affinity later on. For the managed sets, i.e. the corresponding bit
> in the mask is not set, there is no change in behaviour.
> 
> Usage example:
> 
> 	struct irq_affinity affd = {
> 		.pre_vectors	= 2,
> 		.unmanaged_sets	= 0x02,
> 		.calc_sets	= drv_calc_sets,
> 	};
> 	....
> 
> For both interrupt sets the interrupts are properly spread out, but the
> second set is not marked managed.

Given drivers only care the managed vs non-managed interrupt numbers,
just wondering why this case can't be covered by .pre_vectors &
.post_vectors?

Also this kind of usage may break blk-mq easily, in which the following
rule needs to be respected:

1) all CPUs are required to spread among each interrupt set

2) no any CPU is shared between two IRQs in same set.

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  include/linux/interrupt.h |    2 ++
>  kernel/irq/affinity.c     |   16 +++++++++++-----
>  2 files changed, 13 insertions(+), 5 deletions(-)
> 
> Index: b/include/linux/interrupt.h
> ===================================================================
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -251,6 +251,7 @@ struct irq_affinity_notify {
>   *			the MSI(-X) vector space
>   * @nr_sets:		The number of interrupt sets for which affinity
>   *			spreading is required
> + * @unmanaged_sets:	Bitmap to mark entries in the @set_size array unmanaged
>   * @set_size:		Array holding the size of each interrupt set
>   * @calc_sets:		Callback for calculating the number and size
>   *			of interrupt sets
> @@ -261,6 +262,7 @@ struct irq_affinity {
>  	unsigned int	pre_vectors;
>  	unsigned int	post_vectors;
>  	unsigned int	nr_sets;
> +	unsigned int	unmanaged_sets;
>  	unsigned int	set_size[IRQ_AFFINITY_MAX_SETS];
>  	void		(*calc_sets)(struct irq_affinity *, unsigned int nvecs);
>  	void		*priv;
> Index: b/kernel/irq/affinity.c
> ===================================================================
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -249,6 +249,8 @@ irq_create_affinity_masks(unsigned int n
>  	unsigned int affvecs, curvec, usedvecs, i;
>  	struct irq_affinity_desc *masks = NULL;
>  
> +	BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS > sizeof(affd->unmanaged_sets) * 8);
> +
>  	/*
>  	 * Determine the number of vectors which need interrupt affinities
>  	 * assigned. If the pre/post request exhausts the available vectors
> @@ -292,7 +294,8 @@ irq_create_affinity_masks(unsigned int n
>  	 * have multiple sets, build each sets affinity mask separately.
>  	 */
>  	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
> -		unsigned int this_vecs = affd->set_size[i];
> +		bool managed = affd->unmanaged_sets & (1U << i) ? true : false;

The above check is inverted.

Thanks,
Ming
Thomas Gleixner Feb. 17, 2019, 7:17 p.m. UTC | #2
On Sun, 17 Feb 2019, Ming Lei wrote:
> On Sat, Feb 16, 2019 at 06:13:13PM +0100, Thomas Gleixner wrote:
> > Some drivers need an extra set of interrupts which should not be marked
> > managed, but should get initial interrupt spreading.
> 
> Could you share the drivers and their use case?

You were Cc'ed on that old discussion:

 https://lkml.kernel.org/r/300d6fef733ca76ced581f8c6304bac6@mail.gmail.com

> > For both interrupt sets the interrupts are properly spread out, but the
> > second set is not marked managed.
> 
> Given drivers only care the managed vs non-managed interrupt numbers,
> just wondering why this case can't be covered by .pre_vectors &
> .post_vectors?

Well, yes, but post/pre are not subject to spreading and I really don't
want to go there.

> Also this kind of usage may break blk-mq easily, in which the following
> rule needs to be respected:
> 
> 1) all CPUs are required to spread among each interrupt set
> 
> 2) no any CPU is shared between two IRQs in same set.

I don't see how that would break blk-mq. The unmanaged set is not used by
the blk-mq stuff, that's some driver internal voodoo. So blk-mq still gets
a perfectly spread and managed interrupt set for the queues.

> >  	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
> > -		unsigned int this_vecs = affd->set_size[i];
> > +		bool managed = affd->unmanaged_sets & (1U << i) ? true : false;
> 
> The above check is inverted.

Doh. Stupid me.

Thanks,

	tglx
Ming Lei Feb. 18, 2019, 2:49 a.m. UTC | #3
Hi Thomas,

On Sun, Feb 17, 2019 at 08:17:05PM +0100, Thomas Gleixner wrote:
> On Sun, 17 Feb 2019, Ming Lei wrote:
> > On Sat, Feb 16, 2019 at 06:13:13PM +0100, Thomas Gleixner wrote:
> > > Some drivers need an extra set of interrupts which should not be marked
> > > managed, but should get initial interrupt spreading.
> > 
> > Could you share the drivers and their use case?
> 
> You were Cc'ed on that old discussion:
> 
>  https://lkml.kernel.org/r/300d6fef733ca76ced581f8c6304bac6@mail.gmail.com

Thanks for providing the link.

> 
> > > For both interrupt sets the interrupts are properly spread out, but the
> > > second set is not marked managed.
> > 
> > Given drivers only care the managed vs non-managed interrupt numbers,
> > just wondering why this case can't be covered by .pre_vectors &
> > .post_vectors?
> 
> Well, yes, but post/pre are not subject to spreading and I really don't
> want to go there.
> 
> > Also this kind of usage may break blk-mq easily, in which the following
> > rule needs to be respected:
> > 
> > 1) all CPUs are required to spread among each interrupt set
> > 
> > 2) no any CPU is shared between two IRQs in same set.
> 
> I don't see how that would break blk-mq. The unmanaged set is not used by
> the blk-mq stuff, that's some driver internal voodoo. So blk-mq still gets
> a perfectly spread and managed interrupt set for the queues.

From the discussion above, the use case is for megaraid_sas. And one of the
two interrupt sets(managed and non-managed) will be chosen according to
workloads runtime.

Each interrupt set actually defines one blk-mq queue mapping, and the
queue mapping needs to respect the rule I mentioned now. However,
non-managed affinity can be changed to any way anytime by user-space.

Recently HPSA tried to add one module parameter to use non-managed
IRQ[1].

Also NVMe RDMA uses non-managed interrupts, and at least one CPU hotplug
issue is never fixed yet[2]. 

[1] https://marc.info/?t=154387665200001&r=1&w=2
[2] https://www.spinics.net/lists/linux-block/msg24140.html


thanks,
Ming
Thomas Gleixner Feb. 18, 2019, 7:25 a.m. UTC | #4
Ming,

On Mon, 18 Feb 2019, Ming Lei wrote:
> On Sun, Feb 17, 2019 at 08:17:05PM +0100, Thomas Gleixner wrote:
> > I don't see how that would break blk-mq. The unmanaged set is not used by
> > the blk-mq stuff, that's some driver internal voodoo. So blk-mq still gets
> > a perfectly spread and managed interrupt set for the queues.
> 
> >From the discussion above, the use case is for megaraid_sas. And one of the
> two interrupt sets(managed and non-managed) will be chosen according to
> workloads runtime.
> 
> Each interrupt set actually defines one blk-mq queue mapping, and the
> queue mapping needs to respect the rule I mentioned now. However,
> non-managed affinity can be changed to any way anytime by user-space.
> 
> Recently HPSA tried to add one module parameter to use non-managed
> IRQ[1].
> 
> Also NVMe RDMA uses non-managed interrupts, and at least one CPU hotplug
> issue is never fixed yet[2]. 
> 
> [1] https://marc.info/?t=154387665200001&r=1&w=2
> [2] https://www.spinics.net/lists/linux-block/msg24140.html

Interesting. I misread the description which megasas folks provided
then. I'll drop that patch and get the other lot merged. Thanks for your
work and help with that!

Thanks,

	tglx
diff mbox series

Patch

Index: b/include/linux/interrupt.h
===================================================================
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -251,6 +251,7 @@  struct irq_affinity_notify {
  *			the MSI(-X) vector space
  * @nr_sets:		The number of interrupt sets for which affinity
  *			spreading is required
+ * @unmanaged_sets:	Bitmap to mark entries in the @set_size array unmanaged
  * @set_size:		Array holding the size of each interrupt set
  * @calc_sets:		Callback for calculating the number and size
  *			of interrupt sets
@@ -261,6 +262,7 @@  struct irq_affinity {
 	unsigned int	pre_vectors;
 	unsigned int	post_vectors;
 	unsigned int	nr_sets;
+	unsigned int	unmanaged_sets;
 	unsigned int	set_size[IRQ_AFFINITY_MAX_SETS];
 	void		(*calc_sets)(struct irq_affinity *, unsigned int nvecs);
 	void		*priv;
Index: b/kernel/irq/affinity.c
===================================================================
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -249,6 +249,8 @@  irq_create_affinity_masks(unsigned int n
 	unsigned int affvecs, curvec, usedvecs, i;
 	struct irq_affinity_desc *masks = NULL;
 
+	BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS > sizeof(affd->unmanaged_sets) * 8);
+
 	/*
 	 * Determine the number of vectors which need interrupt affinities
 	 * assigned. If the pre/post request exhausts the available vectors
@@ -292,7 +294,8 @@  irq_create_affinity_masks(unsigned int n
 	 * have multiple sets, build each sets affinity mask separately.
 	 */
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
-		unsigned int this_vecs = affd->set_size[i];
+		bool managed = affd->unmanaged_sets & (1U << i) ? true : false;
+		unsigned int idx, this_vecs = affd->set_size[i];
 		int ret;
 
 		ret = irq_build_affinity_masks(affd, curvec, this_vecs,
@@ -301,8 +304,15 @@  irq_create_affinity_masks(unsigned int n
 			kfree(masks);
 			return NULL;
 		}
+
+		idx = curvec;
 		curvec += this_vecs;
 		usedvecs += this_vecs;
+		if (managed) {
+			/* Mark the managed interrupts */
+			for (; idx < curvec; idx++)
+				masks[idx].is_managed = 1;
+		}
 	}
 
 	/* Fill out vectors at the end that don't need affinity */
@@ -313,10 +323,6 @@  irq_create_affinity_masks(unsigned int n
 	for (; curvec < nvecs; curvec++)
 		cpumask_copy(&masks[curvec].mask, irq_default_affinity);
 
-	/* Mark the managed interrupts */
-	for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
-		masks[i].is_managed = 1;
-
 	return masks;
 }