diff mbox

[v5,02/33] genirq: Add irq_alloc_reserved_desc()

Message ID 1388707565-16535-3-git-send-email-yinghai@kernel.org (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Yinghai Lu Jan. 3, 2014, 12:05 a.m. UTC
For ioapic hot-add support, it would be easy if we have continuous
irq numbers for hot added ioapic controller.

We can reserve irq range at first, and later allocate desc for those
pre-reserved irqs when they are needed.

The reasons for not allocating them during reserving:
1. only several pins of one ioapic are used, allocate for all pins, will
   waste memory for not used pins.
2. allocate later when is needed could make sure irq_desc is allocated
   on local node ram, as dev->node is set at that point.

-v2: update changelog by adding reasons, requested by Konrad.
-v3: according to tglx:
       separate core code change with arch code change.
       change function name to irq_alloc_reserved_desc.
       kill __irq_is_reserved().
       remove not need exports.
     according to Sebastian:
       spare one comments by put two functions together.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
---
 include/linux/irq.h  |  3 +++
 kernel/irq/irqdesc.c | 23 +++++++++++++++++++++++
 2 files changed, 26 insertions(+)

Comments

Thomas Gleixner Jan. 23, 2014, 12:03 a.m. UTC | #1
Yinghai,

On Thu, 2 Jan 2014, Yinghai Lu wrote:

> For ioapic hot-add support, it would be easy if we have continuous
> irq numbers for hot added ioapic controller.

I really don't care about easy. Easy to solve problems are for
wimps.

What you really want to say is, that ioapic hot-add support requires a
contiguous irq number space for a hotplugged ioapic to avoid expensive
translations in the ioapic hotplug code.

That's a proper reason for making that change to the core code.
 
> We can reserve irq range at first, and later allocate desc for those
> pre-reserved irqs when they are needed.
> 
> The reasons for not allocating them during reserving:
> 1. only several pins of one ioapic are used, allocate for all pins, will
>    waste memory for not used pins.
> 2. allocate later when is needed could make sure irq_desc is allocated
>    on local node ram, as dev->node is set at that point.
> 
> -v2: update changelog by adding reasons, requested by Konrad.
> -v3: according to tglx:
>        separate core code change with arch code change.

Thanks for splitting the patches!

Now the scope of this change becomes more obvious and what I already
suspected before becomes crystal clear.

The initial intention of irq_reserve_irqs() was to cope with the
legacy interrupts to prevent the dynamic allocator from giving them
out, but it was at least a misnomer if not even a misconception.

Did you notice that? No!

Did you even think why irq_reserve_irqs() exists? No!

You just hacked it into submission for your purpose. As usual, sigh!

What prevents a user of __irq_alloc_reserved_desc() to request
something completely out of its range? Nothing as you happily return
an existing interrupt via:

+       if (irq_to_desc(irq))
+               return irq;

which is true for all already existing interrupts. So some random off
by one is going to cause a spurious and extremly hard to debug
issue. Brilliant.

No, we are not going to play the "it works for Yinghai" game again. I
wasted enough time with that already.

There is a clear step by step approach to get this done proper:

 1) Get rid of the existing misconception/misnomer of
    irq_reserve_irqs().
    
    Make it explicit that this is dealing with legacy irq spaces. It's
    not that hard as there are only two users in tree which are both
    trivial to fix. 

 2) Provide a proper reservation mechanism which does not piggypack
    blindly on the allocation bitmap.

    So what you want is a reservation which:

    A) Marks the irq range in the allocation bitmap

       This prevents other code pathes to stomp on that range.

    B) Stores a unique generated ID in a separate radix tree for that
       particular irq range.

       The generated ID is returned to the caller as it is required
       for actually allocating an interrupt from that range.

       We don't have to bother with making this conditional as the
       initial memory consumption of the radix tree is minimal and we
       only expand it when we actually use that hotplug feature.

  3) Provide a proper alloc_reserved_irqdesc() function

     This function verifies against the reservation ID which was
     handed out by the reservation function.

     It's questionable whether we want to allow the reuse of already
     allocated irq descriptors. I'm leaning to avoid that. See #4

  4) Provide a proper mechanism to free the registered irq descriptors
     and the reservation range when the physical device is removed
     from the system. So you don't have to preserve state in the
     ioapic code. Physical hotplug is not a high frequency hotpath
     operation.

  5) Modify the x86 ioapic code to always use the reserve first and
     allocate later mechanism to avoid ifdeffery and pointless
     conditional code pathes. That also ensures proper test coverage.

     TBH, I could not be bothered to look at your x86 related changes,
     but I expect they are from the "make it work for Yinghai"
     departement as well. I'll review them once the core code changes
     are in an acceptable shape.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Feb. 18, 2014, 12:59 a.m. UTC | #2
On Wed, Jan 22, 2014 at 4:03 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> There is a clear step by step approach to get this done proper:
>
>  1) Get rid of the existing misconception/misnomer of
>     irq_reserve_irqs().
>
>     Make it explicit that this is dealing with legacy irq spaces. It's
>     not that hard as there are only two users in tree which are both
>     trivial to fix.

Hi, Thomas,

While going through the code for kill irq_reserve_irqs(), I found that
there is irq_reserve_irq().

in include/linux/irq.h

static inline int irq_reserve_irq(unsigned int irq)
{
        return irq_reserve_irqs(irq, 1);
}

it is called via kernel/irq/chip.c::irq_set_chip().

        /*
         * For !CONFIG_SPARSE_IRQ make the irq show up in
         * allocated_irqs. For the CONFIG_SPARSE_IRQ case, it is
         * already marked, and this call is harmless.
         */
        irq_reserve_irq(irq);

There are tens of irq_set_chip... calling for arches that does not
support SPARSE_IRQ yet, and they does not use  irq_alloc_desc()
anywhere.

so how about change those irq_reserve_irq to irq_set_allocated_irqs() and
leave them there?

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiang Liu Feb. 21, 2014, 6:45 a.m. UTC | #3
On 2014/1/3 8:05, Yinghai Lu wrote:
> For ioapic hot-add support, it would be easy if we have continuous
> irq numbers for hot added ioapic controller.
> 
> We can reserve irq range at first, and later allocate desc for those
> pre-reserved irqs when they are needed.
> 
> The reasons for not allocating them during reserving:
> 1. only several pins of one ioapic are used, allocate for all pins, will
>    waste memory for not used pins.
> 2. allocate later when is needed could make sure irq_desc is allocated
>    on local node ram, as dev->node is set at that point.
> 
> -v2: update changelog by adding reasons, requested by Konrad.
> -v3: according to tglx:
>        separate core code change with arch code change.
>        change function name to irq_alloc_reserved_desc.
>        kill __irq_is_reserved().
>        remove not need exports.
>      according to Sebastian:
>        spare one comments by put two functions together.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
> ---
>  include/linux/irq.h  |  3 +++
>  kernel/irq/irqdesc.c | 23 +++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index 0229caf..e5f6493 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -595,10 +595,13 @@ static inline u32 irq_get_trigger_type(unsigned int irq)
>  
>  int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
>  		struct module *owner);
> +int __irq_alloc_reserved_desc(int at, int node, struct module *owner);
>  
>  /* use macros to avoid needing export.h for THIS_MODULE */
>  #define irq_alloc_descs(irq, from, cnt, node)	\
>  	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE)
> +#define irq_alloc_reserved_desc_at(at, node)	\
> +	__irq_alloc_reserved_desc(at, node, THIS_MODULE)
>  
>  #define irq_alloc_desc(node)			\
>  	irq_alloc_descs(-1, 0, 1, node)
> diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
> index a151db6..1166545 100644
> --- a/kernel/irq/irqdesc.c
> +++ b/kernel/irq/irqdesc.c
> @@ -410,6 +410,29 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
>  EXPORT_SYMBOL_GPL(__irq_alloc_descs);
>  
>  /**
> + * __irq_alloc_reserved_desc - allocate irq descriptor for irq that is already reserved
> + * @irq:	Allocate for specific irq number if irq >= 0
Hi Yinghai
	Should we skip "if irq >= 0" here because irq should have already been
reserved? Or should we add range check for irq?
Thanks!
Gerry

> + * @node:	Preferred node on which the irq descriptor should be allocated
> + * @owner:	Owning module (can be NULL)
> + *
> + * Returns the irq number or error code
> + */
> +int __ref __irq_alloc_reserved_desc(int irq, int node, struct module *owner)
> +{
> +	mutex_lock(&sparse_irq_lock);
> +	if (!test_bit(irq, allocated_irqs)) {
> +		mutex_unlock(&sparse_irq_lock);
> +		return -EINVAL;
> +	}
> +	mutex_unlock(&sparse_irq_lock);
> +
> +	if (irq_to_desc(irq))
> +		return irq;
> +
> +	return alloc_descs(irq, 1, node, owner);
> +}
> +
> +/**
>   * irq_reserve_irqs - mark irqs allocated
>   * @from:	mark from irq number
>   * @cnt:	number of irqs to mark
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Feb. 22, 2014, 10:08 a.m. UTC | #4
On Mon, 17 Feb 2014, Yinghai Lu wrote:
> On Wed, Jan 22, 2014 at 4:03 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > There is a clear step by step approach to get this done proper:
> >
> >  1) Get rid of the existing misconception/misnomer of
> >     irq_reserve_irqs().
> >
> >     Make it explicit that this is dealing with legacy irq spaces. It's
> >     not that hard as there are only two users in tree which are both
> >     trivial to fix.
> 
> Hi, Thomas,
> 
> While going through the code for kill irq_reserve_irqs(), I found that
> there is irq_reserve_irq().
> 
> in include/linux/irq.h
> 
> static inline int irq_reserve_irq(unsigned int irq)
> {
>         return irq_reserve_irqs(irq, 1);
> }
> 
> it is called via kernel/irq/chip.c::irq_set_chip().
> 
>         /*
>          * For !CONFIG_SPARSE_IRQ make the irq show up in
>          * allocated_irqs. For the CONFIG_SPARSE_IRQ case, it is
>          * already marked, and this call is harmless.
>          */
>         irq_reserve_irq(irq);
> 
> There are tens of irq_set_chip... calling for arches that does not
> support SPARSE_IRQ yet, and they does not use  irq_alloc_desc()
> anywhere.
> 
> so how about change those irq_reserve_irq to irq_set_allocated_irqs() and
> leave them there?

As I said before irq_reserve_irq() is a misnomer and a
misconception. Of course this needs to be fixed as well.

And you cannot just blindly change it because !SPARSE can use the
allocation. We are not creating stupid corner cases just to support
your sloppyness. Its not rocket science to do it the right way.

That said, it might be worthwhile to get rid of the !SPARSE case
completely. That would probably make quite some stuff simpler.

Thanks,

	tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Feb. 22, 2014, 5:28 p.m. UTC | #5
On Sat, Feb 22, 2014 at 2:08 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> As I said before irq_reserve_irq() is a misnomer and a
> misconception. Of course this needs to be fixed as well.
>
> And you cannot just blindly change it because !SPARSE can use the
> allocation. We are not creating stupid corner cases just to support
> your sloppyness. Its not rocket science to do it the right way.
>
> That said, it might be worthwhile to get rid of the !SPARSE case
> completely. That would probably make quite some stuff simpler.

So we need to make all arches support SPARSE_IRQ at first?

Now we have arm, arm64, c6x, metag, powerpc, sh, x86 support SPARSE_IRQ.

The following  are not with SPARSE_IRQ yet:
alpha, arc, avr32, blackfin, cris, frv, hexagon, m32r, m68k, microblaze,
mips, mn10300, openrisc, parisc, s390, score, sparc, tile, um,
unicore32, xtensa.

Thanks

Yinghai


Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 0229caf..e5f6493 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -595,10 +595,13 @@  static inline u32 irq_get_trigger_type(unsigned int irq)
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 		struct module *owner);
+int __irq_alloc_reserved_desc(int at, int node, struct module *owner);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)	\
 	__irq_alloc_descs(irq, from, cnt, node, THIS_MODULE)
+#define irq_alloc_reserved_desc_at(at, node)	\
+	__irq_alloc_reserved_desc(at, node, THIS_MODULE)
 
 #define irq_alloc_desc(node)			\
 	irq_alloc_descs(-1, 0, 1, node)
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index a151db6..1166545 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -410,6 +410,29 @@  __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
 EXPORT_SYMBOL_GPL(__irq_alloc_descs);
 
 /**
+ * __irq_alloc_reserved_desc - allocate irq descriptor for irq that is already reserved
+ * @irq:	Allocate for specific irq number if irq >= 0
+ * @node:	Preferred node on which the irq descriptor should be allocated
+ * @owner:	Owning module (can be NULL)
+ *
+ * Returns the irq number or error code
+ */
+int __ref __irq_alloc_reserved_desc(int irq, int node, struct module *owner)
+{
+	mutex_lock(&sparse_irq_lock);
+	if (!test_bit(irq, allocated_irqs)) {
+		mutex_unlock(&sparse_irq_lock);
+		return -EINVAL;
+	}
+	mutex_unlock(&sparse_irq_lock);
+
+	if (irq_to_desc(irq))
+		return irq;
+
+	return alloc_descs(irq, 1, node, owner);
+}
+
+/**
  * irq_reserve_irqs - mark irqs allocated
  * @from:	mark from irq number
  * @cnt:	number of irqs to mark