diff mbox series

[10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR

Message ID 20201005152856.974112-10-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping | expand

Commit Message

David Woodhouse Oct. 5, 2020, 3:28 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

When interrupt remapping isn't enabled, only the first 255 CPUs can
receive external interrupts. Set the appropriate max affinity for
the IOAPIC and MSI IRQ domains accordingly.

This also fixes the case where interrupt remapping is enabled but some
devices are not within the scope of any active IOMMU.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kernel/apic/io_apic.c | 2 ++
 arch/x86/kernel/apic/msi.c     | 3 +++
 2 files changed, 5 insertions(+)

Comments

Thomas Gleixner Oct. 6, 2020, 9:54 p.m. UTC | #1
On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:

> From: David Woodhouse <dwmw@amazon.co.uk>
>
> When interrupt remapping isn't enabled, only the first 255 CPUs can

No, only CPUs with an APICid < 255 ....

> receive external interrupts. Set the appropriate max affinity for
> the IOAPIC and MSI IRQ domains accordingly.
>
> This also fixes the case where interrupt remapping is enabled but some
> devices are not within the scope of any active IOMMU.

What? If this fixes an pre-existing problem then

      1) Explain the problem proper
      2) Have a patch at the beginning of the series which fixes it
         independently of this pile

If it's fixing a problem in your pile, then you got the ordering wrong.

You didn't start kernel programming as of yesterday, so you really know
how that works.

>  	ip->irqdomain->parent = parent;
> +	if (parent == x86_vector_domain)
> +		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);

OMG

>  	if (cfg->type == IOAPIC_DOMAIN_LEGACY ||
>  	    cfg->type == IOAPIC_DOMAIN_STRICT)
> diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
> index 4d891967bea4..af5ce5c4da02 100644
> --- a/arch/x86/kernel/apic/msi.c
> +++ b/arch/x86/kernel/apic/msi.c
> @@ -259,6 +259,7 @@ struct irq_domain * __init native_create_pci_msi_domain(void)
>  		pr_warn("Failed to initialize PCI-MSI irqdomain.\n");
>  	} else {
>  		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
> +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);

So here it's unconditional

>  	}
>  	return d;
>  }
> @@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id)
>  		irq_domain_free_fwnode(fn);
>  		kfree(domain_info);
>  	}
> +	if (parent == x86_vector_domain)
> +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);

And here we need a condition again. Completely obvious and reviewable - NOT.

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 7:48 a.m. UTC | #2
On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote:
> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > When interrupt remapping isn't enabled, only the first 255 CPUs can
> 
> No, only CPUs with an APICid < 255 ....

Ack.

> > receive external interrupts. Set the appropriate max affinity for
> > the IOAPIC and MSI IRQ domains accordingly.
> > 
> > This also fixes the case where interrupt remapping is enabled but some
> > devices are not within the scope of any active IOMMU.
> 
> What? If this fixes an pre-existing problem then
> 
>       1) Explain the problem proper
>       2) Have a patch at the beginning of the series which fixes it
>          independently of this pile
> 
> If it's fixing a problem in your pile, then you got the ordering wrong.

It's not that simple; there's not a single patch which fixes that and
which can go first. I can, and do, fix the "no IR" case in a simple
patch that goes first, simply by restricting the kernel to the APIC IDs
which can be targeted.

This is the case I called out in the cover letter:

    This patch series implements a per-domain "maximum affinity" set and
    uses it for the non-remapped IOAPIC and MSI domains on x86. As well as
    allowing more CPUs to be used without interrupt remapping, this also
    fixes the case where some IOAPICs or PCI devices aren't actually in
    scope of any active IOMMU and are operating without remapping.

To fix *that* case, we really do need the whole series giving us per-
domain restricted affinity, and to use it for those MSIs/IOAPICs that
the IRQ remapping doesn't cover.

> You didn't start kernel programming as of yesterday, so you really know
> how that works.
> 
> >  	ip->irqdomain->parent = parent;
> > +	if (parent == x86_vector_domain)
> > +		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);
> 
> OMG

This IOAPIC function may or may not be behind remapping.

> 
> >  	if (cfg->type == IOAPIC_DOMAIN_LEGACY ||
> >  	    cfg->type == IOAPIC_DOMAIN_STRICT)
> > diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
> > index 4d891967bea4..af5ce5c4da02 100644
> > --- a/arch/x86/kernel/apic/msi.c
> > +++ b/arch/x86/kernel/apic/msi.c
> > @@ -259,6 +259,7 @@ struct irq_domain * __init native_create_pci_msi_domain(void)
> >  		pr_warn("Failed to initialize PCI-MSI irqdomain.\n");
> >  	} else {
> >  		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
> 
> So here it's unconditional

Yes, this code is only for the non-remapped case and there's a separate
arch_create_remap_msi_irq_domain() function a few lines further down
which creates the IR-PCI-MSI IRQ domain. So no need for a condition
here.

> >  	}
> >  	return d;
> >  }
> > @@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id)
> >  		irq_domain_free_fwnode(fn);
> >  		kfree(domain_info);
> >  	}
> > +	if (parent == x86_vector_domain)
> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
> 
> And here we need a condition again. Completely obvious and reviewable - NOT.

And HPET may or may not be behind remapping so again needs the
condition. I had figured that part was fairly obvious but can note it
in the commit comment.
Thomas Gleixner Oct. 7, 2020, 12:59 p.m. UTC | #3
On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
> On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote:
>> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> This is the case I called out in the cover letter:
>
>     This patch series implements a per-domain "maximum affinity" set and
>     uses it for the non-remapped IOAPIC and MSI domains on x86. As well as
>     allowing more CPUs to be used without interrupt remapping, this also
>     fixes the case where some IOAPICs or PCI devices aren't actually in
>     scope of any active IOMMU and are operating without remapping.
>
> To fix *that* case, we really do need the whole series giving us per-
> domain restricted affinity, and to use it for those MSIs/IOAPICs that
> the IRQ remapping doesn't cover.

Which do not exist today.

>> >  	ip->irqdomain->parent = parent;
>> > +	if (parent == x86_vector_domain)
>> > +		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);
>> 
>> OMG
>
> This IOAPIC function may or may not be behind remapping.

>> >  		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
>> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
>> 
>> So here it's unconditional
>
> Yes, this code is only for the non-remapped case and there's a separate
> arch_create_remap_msi_irq_domain() function a few lines further down
> which creates the IR-PCI-MSI IRQ domain. So no need for a condition
> here.
>
>> > +	if (parent == x86_vector_domain)
>> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
>> 
>> And here we need a condition again. Completely obvious and reviewable - NOT.
>
> And HPET may or may not be behind remapping so again needs the
> condition. I had figured that part was fairly obvious but can note it
> in the commit comment.

And all of this is completely wrong to begin with.

The information has to property of the relevant irq domains and the
hierarchy allows you nicely to retrieve it from there instead of
sprinkling this all over the place.

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 1:08 p.m. UTC | #4
On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>> On Tue, 2020-10-06 at 23:54 +0200, Thomas Gleixner wrote:
>>> On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
>> This is the case I called out in the cover letter:
>>
>>     This patch series implements a per-domain "maximum affinity" set
>and
>>     uses it for the non-remapped IOAPIC and MSI domains on x86. As
>well as
>>     allowing more CPUs to be used without interrupt remapping, this
>also
>>     fixes the case where some IOAPICs or PCI devices aren't actually
>in
>>     scope of any active IOMMU and are operating without remapping.
>>
>> To fix *that* case, we really do need the whole series giving us per-
>> domain restricted affinity, and to use it for those MSIs/IOAPICs that
>> the IRQ remapping doesn't cover.
>
>Which do not exist today.

Sure. But at patch 10/13 into this particular patch series, it *does* exist.

(Ignoring, for the moment, that I'm about to rewrite half the preceding patches to take a different approach)


>>> >  	ip->irqdomain->parent = parent;
>>> > +	if (parent == x86_vector_domain)
>>> > +		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);
>>> 
>>> OMG
>>
>> This IOAPIC function may or may not be behind remapping.
>
>>> >  		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
>>> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
>>> 
>>> So here it's unconditional
>>
>> Yes, this code is only for the non-remapped case and there's a
>separate
>> arch_create_remap_msi_irq_domain() function a few lines further down
>> which creates the IR-PCI-MSI IRQ domain. So no need for a condition
>> here.
>>
>>> > +	if (parent == x86_vector_domain)
>>> > +		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
>>> 
>>> And here we need a condition again. Completely obvious and
>reviewable - NOT.
>>
>> And HPET may or may not be behind remapping so again needs the
>> condition. I had figured that part was fairly obvious but can note it
>> in the commit comment.
>
>And all of this is completely wrong to begin with.
>
>The information has to property of the relevant irq domains and the
>hierarchy allows you nicely to retrieve it from there instead of
>sprinkling this all over the place.

No. This is not a property of the parent domain per se. Especially if you're thinking that we could inherit the affinity mask from the parent, then twice no.

This is a property of the MSI domain itself, and how many bits of destination ID the hardware at *this* level can interpret and pass on to the parent domain.
Thomas Gleixner Oct. 7, 2020, 2:05 p.m. UTC | #5
On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
> On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>>On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>>> To fix *that* case, we really do need the whole series giving us per-
>>> domain restricted affinity, and to use it for those MSIs/IOAPICs that
>>> the IRQ remapping doesn't cover.
>>
>>Which do not exist today.
>
> Sure. But at patch 10/13 into this particular patch series, it *does*
> exist.

As I told you before: Your ordering is wrong. We do not introduce bugs
first and then fix them later ....

>>And all of this is completely wrong to begin with.
>>
>>The information has to property of the relevant irq domains and the
>>hierarchy allows you nicely to retrieve it from there instead of
>>sprinkling this all over the place.
>
> No. This is not a property of the parent domain per se. Especially if
> you're thinking that we could inherit the affinity mask from the
> parent, then twice no.
>
> This is a property of the MSI domain itself, and how many bits of
> destination ID the hardware at *this* level can interpret and pass on
> to the parent domain.

Errm what?

The MSI domain does not know anything about what the underlying domain
can handle and it shouldn't.

If MSI is on top of remapping then the remapping domain defines what the
MSI domain can do and not the other way round. Ditto for the non
remapped case in which the vector domain decides.

The top most MSI irq chip does not even have a compose function, neither
for the remap nor for the vector case. The composition is done by the
parent domain from the data which the parent domain constructed. Same
for the IO/APIC just less clearly separated.

The top most chip just takes what the underlying domain constructed and
writes it to the message store, because that's what the top most chip
controls. It does not control the content.

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 2:23 p.m. UTC | #6
On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
> On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
> > On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
> > > > To fix *that* case, we really do need the whole series giving us per-
> > > > domain restricted affinity, and to use it for those MSIs/IOAPICs that
> > > > the IRQ remapping doesn't cover.
> > > 
> > > Which do not exist today.
> > 
> > Sure. But at patch 10/13 into this particular patch series, it *does*
> > exist.
> 
> As I told you before: Your ordering is wrong. We do not introduce bugs
> first and then fix them later ....

I didn't introduce that bug; it's been there for years. Fixing it
properly requires per-irqdomain affinity limits.

There's a cute little TODO at least in the Intel irq-remapping driver,
noting that we should probably check if there are any IOAPICs that
aren't in the scope of any DRHD at all. But that's all.

If it happens, I think we'll end up doing the right thing and
instantiating a non-IR IOAPIC domain for it, and if we *don't* have any
CPU with an APIC ID above... (checks notes)... 7 ... then it'll just
about work out. Over 7 and we're screwed (because logical mode; see
also https://git.infradead.org/users/dwmw2/linux.git/commit/429f0c33f
for a straw man but that's getting even further down the rabbit hole)
David Woodhouse Oct. 7, 2020, 3:05 p.m. UTC | #7
On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
> > > The information has to property of the relevant irq domains and the
> > > hierarchy allows you nicely to retrieve it from there instead of
> > > sprinkling this all over the place.
> > 
> > No. This is not a property of the parent domain per se. Especially if
> > you're thinking that we could inherit the affinity mask from the
> > parent, then twice no.
> > 
> > This is a property of the MSI domain itself, and how many bits of
> > destination ID the hardware at *this* level can interpret and pass on
> > to the parent domain.
> 
> Errm what?
> 
> The MSI domain does not know anything about what the underlying domain
> can handle and it shouldn't.
> 
> If MSI is on top of remapping then the remapping domain defines what the
> MSI domain can do and not the other way round. Ditto for the non
> remapped case in which the vector domain decides.
> 
> The top most MSI irq chip does not even have a compose function, neither
> for the remap nor for the vector case. The composition is done by the
> parent domain from the data which the parent domain constructed. Same
> for the IO/APIC just less clearly separated.
> 
> The top most chip just takes what the underlying domain constructed and
> writes it to the message store, because that's what the top most chip
> controls. It does not control the content.

Sure, whatever. The way we arrange the IRQ domain hierarchy in x86
Linux doesn't really match my understanding of the real hardware, or
how qemu emulates it either. But it is at least internally self-
consistent, and in this model it probably does make some sense to claim
the 8-bit limit on x86_vector_domain itself, and *remove* that limit in
the remapping irqdomain.

Not really the important part to deal with right now, either way.
Thomas Gleixner Oct. 7, 2020, 3:25 p.m. UTC | #8
On Wed, Oct 07 2020 at 16:05, David Woodhouse wrote:
> On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
>> The top most MSI irq chip does not even have a compose function, neither
>> for the remap nor for the vector case. The composition is done by the
>> parent domain from the data which the parent domain constructed. Same
>> for the IO/APIC just less clearly separated.
>> 
>> The top most chip just takes what the underlying domain constructed and
>> writes it to the message store, because that's what the top most chip
>> controls. It does not control the content.
>
> Sure, whatever. The way we arrange the IRQ domain hierarchy in x86
> Linux doesn't really match my understanding of the real hardware, or
> how qemu emulates it either. But it is at least internally self-
> consistent, and in this model it probably does make some sense to claim
> the 8-bit limit on x86_vector_domain itself, and *remove* that limit in
> the remapping irqdomain.

It's clearly how the hardware works. MSI has a message store of some
sorts and if the entry is enabled then the MSI chip (in PCI or
elsewhere) will send exactly the message which is in that message
store. It knows absolutely nothing about what the message means and how
it is composed. The only things which MSI knows about is whether the
message address is 64bit wide or not and whether the entries are
maskable or not and how many entries it can store.

Software allocates a message target at the underlying irq domain (vector
or remap) and that underlying irq domain defines the properties.

If qemu emulates it differently then it's qemu's problem, but that does
not make it in anyway something which influences the irq domain
abstractions which are correctly modeled after how the hardware works.

> Not really the important part to deal with right now, either way.

Oh yes it is. We define that upfront and not after the fact.

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 3:46 p.m. UTC | #9
On Wed, 2020-10-07 at 17:25 +0200, Thomas Gleixner wrote:
> It's clearly how the hardware works. MSI has a message store of some
> sorts and if the entry is enabled then the MSI chip (in PCI or
> elsewhere) will send exactly the message which is in that message
> store. It knows absolutely nothing about what the message means and how
> it is composed. The only things which MSI knows about is whether the
> message address is 64bit wide or not and whether the entries are
> maskable or not and how many entries it can store.
> 
> Software allocates a message target at the underlying irq domain (vector
> or remap) and that underlying irq domain defines the properties.
> 
> If qemu emulates it differently then it's qemu's problem, but that does
> not make it in anyway something which influences the irq domain
> abstractions which are correctly modeled after how the hardware works.
> 
> > Not really the important part to deal with right now, either way.
> 
> Oh yes it is. We define that upfront and not after the fact.

The way the hardware works is that something handles physical address
cycles to addresses in the (on x86) 0xFEExxxxx range, and turns them
into actual interrupts on the appropriate CPU — where the APIC ID and
vector (etc.) are directly encoded in the bits of the address and the
data written. That compatibility x86 APIC MSI format is where the 8-bit 
(or 15-bit) limit comes from.

Then interrupt remapping comes along, and now those physical address
cycles are actually handled by the IOMMU — which can either handle the
compatibility format as before, or use a different format of
address/data bits and perform a lookup in its IRTE table.

The PCI MSI domain, HPET, and even the IOAPIC are just the things out
there on the bus which might perform those physical address cycles. And
yes, as you say they're just a message store sending exactly the
message that was composed for them. They know absolutely nothing about
what the message means and how it is composed.

It so happens that in Linux, we don't really architect the software
like that. So each of the PCI MSI domain, HPET, and IOAPIC have their
*own* message composer which has the same limits and composes basically
the same messages as if it was *their* format, not dictated to them by
the APIC upstream. And that's what we're both getting our panties in a
knot about, I think.

It really doesn't matter that much to the underlying generic irqdomain
support for limited affinities. Except that you want to make the
generic code support the concept of a child domain supporting *more*
CPUs than its parent, which really doesn't make much sense if you think
about it. But it isn't that hard to do, and if it means we don't have
to argue any more about the x86 hierarchy not matching the hardware
then it's a price worth paying.
Thomas Gleixner Oct. 7, 2020, 4:02 p.m. UTC | #10
On Wed, Oct 07 2020 at 15:23, David Woodhouse wrote:
> On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
>> On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
>> > On 7 October 2020 13:59:00 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>> > > > To fix *that* case, we really do need the whole series giving us per-
>> > > > domain restricted affinity, and to use it for those MSIs/IOAPICs that
>> > > > the IRQ remapping doesn't cover.
>> > > 
>> > > Which do not exist today.
>> > 
>> > Sure. But at patch 10/13 into this particular patch series, it *does*
>> > exist.
>> 
>> As I told you before: Your ordering is wrong. We do not introduce bugs
>> first and then fix them later ....
>
> I didn't introduce that bug; it's been there for years. Fixing it
> properly requires per-irqdomain affinity limits.
>
> There's a cute little TODO at least in the Intel irq-remapping driver,
> noting that we should probably check if there are any IOAPICs that
> aren't in the scope of any DRHD at all. But that's all.

So someone forgot to remove the cute little TODO when this was added:

       if (parse_ioapics_under_ir()) {
                pr_info("Not enabling interrupt remapping\n");
                goto error;
        }

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 4:15 p.m. UTC | #11
On 7 October 2020 17:02:59 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, Oct 07 2020 at 15:23, David Woodhouse wrote:
>> On Wed, 2020-10-07 at 16:05 +0200, Thomas Gleixner wrote:
>>> On Wed, Oct 07 2020 at 14:08, David Woodhouse wrote:
>>> > On 7 October 2020 13:59:00 BST, Thomas Gleixner
><tglx@linutronix.de> wrote:
>>> > > On Wed, Oct 07 2020 at 08:48, David Woodhouse wrote:
>>> > > > To fix *that* case, we really do need the whole series giving
>us per-
>>> > > > domain restricted affinity, and to use it for those
>MSIs/IOAPICs that
>>> > > > the IRQ remapping doesn't cover.
>>> > > 
>>> > > Which do not exist today.
>>> > 
>>> > Sure. But at patch 10/13 into this particular patch series, it
>*does*
>>> > exist.
>>> 
>>> As I told you before: Your ordering is wrong. We do not introduce
>bugs
>>> first and then fix them later ....
>>
>> I didn't introduce that bug; it's been there for years. Fixing it
>> properly requires per-irqdomain affinity limits.
>>
>> There's a cute little TODO at least in the Intel irq-remapping
>driver,
>> noting that we should probably check if there are any IOAPICs that
>> aren't in the scope of any DRHD at all. But that's all.
>
>So someone forgot to remove the cute little TODO when this was added:
>
>       if (parse_ioapics_under_ir()) {
>                pr_info("Not enabling interrupt remapping\n");
>                goto error;
>        }

And HPET, and PCI devices including those that might be hotplugged in future and not be covered by any extant IOMMU's scope?
Thomas Gleixner Oct. 7, 2020, 5:23 p.m. UTC | #12
On Wed, Oct 07 2020 at 16:46, David Woodhouse wrote:
> The PCI MSI domain, HPET, and even the IOAPIC are just the things out
> there on the bus which might perform those physical address cycles. And
> yes, as you say they're just a message store sending exactly the
> message that was composed for them. They know absolutely nothing about
> what the message means and how it is composed.

That's what I said.

> It so happens that in Linux, we don't really architect the software
> like that. So each of the PCI MSI domain, HPET, and IOAPIC have their
> *own* message composer which has the same limits and composes basically
> the same messages as if it was *their* format, not dictated to them by
> the APIC upstream. And that's what we're both getting our panties in a
> knot about, I think.

Are you actually reading what I write and caring to look at the code?

PCI-MSI does not have a compose message callback in the irq chip. The
message is composed by the underlying parent domain. Same for HPET.

The only dogdy part is the IO/APIC for hysterical raisins and because
I did not come around yet to sort that out.

> It really doesn't matter that much to the underlying generic irqdomain
> support for limited affinities. Except that you want to make the
> generic code support the concept of a child domain supporting *more*
> CPUs than its parent, which really doesn't make much sense if you think
> about it.

Right. So we really want to stick the restriction into a compat-MSI
domain to make stuff match reality and to avoid banging the head against
the wall sooner than later.

Thanks,

        tglx
David Woodhouse Oct. 7, 2020, 5:34 p.m. UTC | #13
On Wed, 2020-10-07 at 19:23 +0200, Thomas Gleixner wrote:
> > It so happens that in Linux, we don't really architect the software
> > like that. So each of the PCI MSI domain, HPET, and IOAPIC have their
> > *own* message composer which has the same limits and composes basically
> > the same messages as if it was *their* format, not dictated to them by
> > the APIC upstream. And that's what we're both getting our panties in a
> > knot about, I think.
> 
> Are you actually reading what I write and caring to look at the code?
> 
> PCI-MSI does not have a compose message callback in the irq chip. The
> message is composed by the underlying parent domain. Same for HPET.
> 
> The only dogdy part is the IO/APIC for hysterical raisins and because
> I did not come around yet to sort that out.

Right. And the problem is that the "underlying parent domain" in this
case is actually x86_vector_domain. Whereas it probably makes more
sense, at least in theory, for there to be an *intermediate* domain
responsible for composing the Compat MSI messages.

Both the Compat-MSI and the IR-MSI domains would be children of the
generic x86_vector_domain, and then any given HPET, IOAPIC or PCI-MSI
domain would be a child of either one of those generic MSI domains.

> > It really doesn't matter that much to the underlying generic irqdomain
> > support for limited affinities. Except that you want to make the
> > generic code support the concept of a child domain supporting *more*
> > CPUs than its parent, which really doesn't make much sense if you think
> > about it.
> 
> Right. So we really want to stick the restriction into a compat-MSI
> domain to make stuff match reality and to avoid banging the head against
> the wall sooner than later.

Right.
diff mbox series

Patch

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 4d0ef46fedb9..1c8ce7bc098f 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2332,6 +2332,8 @@  static int mp_irqdomain_create(int ioapic)
 	}
 
 	ip->irqdomain->parent = parent;
+	if (parent == x86_vector_domain)
+		irq_domain_set_affinity(ip->irqdomain, &x86_non_ir_cpumask);
 
 	if (cfg->type == IOAPIC_DOMAIN_LEGACY ||
 	    cfg->type == IOAPIC_DOMAIN_STRICT)
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 4d891967bea4..af5ce5c4da02 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -259,6 +259,7 @@  struct irq_domain * __init native_create_pci_msi_domain(void)
 		pr_warn("Failed to initialize PCI-MSI irqdomain.\n");
 	} else {
 		d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
+		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
 	}
 	return d;
 }
@@ -479,6 +480,8 @@  struct irq_domain *hpet_create_irq_domain(int hpet_id)
 		irq_domain_free_fwnode(fn);
 		kfree(domain_info);
 	}
+	if (parent == x86_vector_domain)
+		irq_domain_set_affinity(d, &x86_non_ir_cpumask);
 	return d;
 }