diff mbox series

[for-4.13,2/2] x86/ioapic: don't use raw entry reads/writes in clear_IO_APIC_pin

Message ID 20191107150609.93004-3-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series x86/ioapic: fix clear_IO_APIC_pin when using interrupt remapping | expand

Commit Message

Roger Pau Monne Nov. 7, 2019, 3:06 p.m. UTC
clear_IO_APIC_pin can be called after the iommu has been enabled, and
using raw entry reads and writes will result in a misconfiguration of
the entries already setup to use the interrupt remapping table. This
fixes the following panic seen on AMD Rome boxes:

(XEN) [   10.082154] ENABLING IO-APIC IRQs
(XEN) [   10.087789]  -> Using new ACK method
(XEN) [   10.093738] Assertion 'get_rte_index(rte) == offset' failed at iommu_intr.c:328

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/io_apic.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Jan Beulich Nov. 7, 2019, 3:28 p.m. UTC | #1
On 07.11.2019 16:06, Roger Pau Monne wrote:
> clear_IO_APIC_pin can be called after the iommu has been enabled, and
> using raw entry reads and writes will result in a misconfiguration of
> the entries already setup to use the interrupt remapping table.

I'm afraid I don't understand this: Raw reads and writes don't even
go to the IOMMU interrupt remapping code, so how would the assertion
be triggered?

> (XEN) [   10.082154] ENABLING IO-APIC IRQs
> (XEN) [   10.087789]  -> Using new ACK method
> (XEN) [   10.093738] Assertion 'get_rte_index(rte) == offset' failed at iommu_intr.c:328
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

"Reported-by: Sergey ..." ahead of this?

> --- a/xen/arch/x86/io_apic.c
> +++ b/xen/arch/x86/io_apic.c
> @@ -514,13 +514,13 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
>          entry.mask = 1;
>          __ioapic_write_entry(apic, pin, false, entry);
>      }
> -    entry = __ioapic_read_entry(apic, pin, true);
> +    entry = __ioapic_read_entry(apic, pin, false);
>  
>      if (entry.irr) {
>          /* Make sure the trigger mode is set to level. */
>          if (!entry.trigger) {
>              entry.trigger = 1;
> -            __ioapic_write_entry(apic, pin, true, entry);
> +            __ioapic_write_entry(apic, pin, false, entry);
>          }

All we do here is set the trigger bit. No translation back and forth
of the RTE should be needed.

> @@ -530,9 +530,9 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
>       */
>      memset(&entry, 0, sizeof(entry));
>      entry.mask = 1;
> -    __ioapic_write_entry(apic, pin, true, entry);
> +    __ioapic_write_entry(apic, pin, false, entry);

I may be able to understand why this one can't use raw mode, but as
per above a better overall description is needed.

> -    entry = __ioapic_read_entry(apic, pin, true);
> +    entry = __ioapic_read_entry(apic, pin, false);
>      if (entry.irr)
>          printk(KERN_ERR "IO-APIC%02x-%u: Unable to reset IRR\n",
>                 IO_APIC_ID(apic), pin);

This read again shouldn't need conversion, as the IRR bit doesn't
get touched (I think) by the interrupt remapping code during the
translation it does.

Jan
Roger Pau Monne Nov. 7, 2019, 3:46 p.m. UTC | #2
On Thu, Nov 07, 2019 at 04:28:56PM +0100, Jan Beulich wrote:
> On 07.11.2019 16:06, Roger Pau Monne wrote:
> > clear_IO_APIC_pin can be called after the iommu has been enabled, and
> > using raw entry reads and writes will result in a misconfiguration of
> > the entries already setup to use the interrupt remapping table.
> 
> I'm afraid I don't understand this: Raw reads and writes don't even
> go to the IOMMU interrupt remapping code, so how would the assertion
> be triggered?

Because the code does something like:

memset(&rte, 0, ...);
...
__ioapic_write_entry(apic, pin, true, rte);

At which point you misconfigure an ioapic entry that was already setup
to point to an interrupt remapping entry, and the AMD IOMMU code
chokes in the assert below.

> 
> > (XEN) [   10.082154] ENABLING IO-APIC IRQs
> > (XEN) [   10.087789]  -> Using new ACK method
> > (XEN) [   10.093738] Assertion 'get_rte_index(rte) == offset' failed at iommu_intr.c:328
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> "Reported-by: Sergey ..." ahead of this?

Oh yes.

> > --- a/xen/arch/x86/io_apic.c
> > +++ b/xen/arch/x86/io_apic.c
> > @@ -514,13 +514,13 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
> >          entry.mask = 1;
> >          __ioapic_write_entry(apic, pin, false, entry);
> >      }
> > -    entry = __ioapic_read_entry(apic, pin, true);
> > +    entry = __ioapic_read_entry(apic, pin, false);
> >  
> >      if (entry.irr) {
> >          /* Make sure the trigger mode is set to level. */
> >          if (!entry.trigger) {
> >              entry.trigger = 1;
> > -            __ioapic_write_entry(apic, pin, true, entry);
> > +            __ioapic_write_entry(apic, pin, false, entry);
> >          }
> 
> All we do here is set the trigger bit. No translation back and forth
> of the RTE should be needed.
> 
> > @@ -530,9 +530,9 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
> >       */
> >      memset(&entry, 0, sizeof(entry));
> >      entry.mask = 1;
> > -    __ioapic_write_entry(apic, pin, true, entry);
> > +    __ioapic_write_entry(apic, pin, false, entry);
> 
> I may be able to understand why this one can't use raw mode, but as
> per above a better overall description is needed.

Yes, this is the one that's actually incorrect, but see my reasoning
below.

> 
> > -    entry = __ioapic_read_entry(apic, pin, true);
> > +    entry = __ioapic_read_entry(apic, pin, false);
> >      if (entry.irr)
> >          printk(KERN_ERR "IO-APIC%02x-%u: Unable to reset IRR\n",
> >                 IO_APIC_ID(apic), pin);
> 
> This read again shouldn't need conversion, as the IRR bit doesn't
> get touched (I think) by the interrupt remapping code during the
> translation it does.

TBH, I think raw mode should only be used by the iommu code in order
to setup the entries to point to the interrupt remapping table,
everything else shouldn't be using raw mode. While it's true that some
of the cases here are safe to use raw mode I would discourage it's
usage as it can lead to issues, and this is not a performance critical
path anyway.

Thanks, Roger.
Roger Pau Monne Nov. 7, 2019, 3:51 p.m. UTC | #3
On Thu, Nov 07, 2019 at 04:46:32PM +0100, Roger Pau Monné wrote:
> On Thu, Nov 07, 2019 at 04:28:56PM +0100, Jan Beulich wrote:
> > On 07.11.2019 16:06, Roger Pau Monne wrote:
> > > clear_IO_APIC_pin can be called after the iommu has been enabled, and
> > > using raw entry reads and writes will result in a misconfiguration of
> > > the entries already setup to use the interrupt remapping table.
> > 
> > I'm afraid I don't understand this: Raw reads and writes don't even
> > go to the IOMMU interrupt remapping code, so how would the assertion
> > be triggered?
> 
> Because the code does something like:
> 
> memset(&rte, 0, ...);
> ...
> __ioapic_write_entry(apic, pin, true, rte);
> 
> At which point you misconfigure an ioapic entry that was already setup
> to point to an interrupt remapping entry, and the AMD IOMMU code
> chokes in the assert below.

Just to clarify since I think my reply hasn't been fully clear, the
ASSERT doesn't trigger in clear_IO_APIC_pin, but at a later point when
the IO-APIC entry is configured.

Thanks, Roger.
Jan Beulich Nov. 7, 2019, 3:56 p.m. UTC | #4
On 07.11.2019 16:46, Roger Pau Monné  wrote:
> On Thu, Nov 07, 2019 at 04:28:56PM +0100, Jan Beulich wrote:
>> On 07.11.2019 16:06, Roger Pau Monne wrote:
>>> @@ -530,9 +530,9 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
>>>       */
>>>      memset(&entry, 0, sizeof(entry));
>>>      entry.mask = 1;
>>> -    __ioapic_write_entry(apic, pin, true, entry);
>>> +    __ioapic_write_entry(apic, pin, false, entry);
>>
>> I may be able to understand why this one can't use raw mode, but as
>> per above a better overall description is needed.
> 
> Yes, this is the one that's actually incorrect, but see my reasoning
> below.
> 
>>
>>> -    entry = __ioapic_read_entry(apic, pin, true);
>>> +    entry = __ioapic_read_entry(apic, pin, false);
>>>      if (entry.irr)
>>>          printk(KERN_ERR "IO-APIC%02x-%u: Unable to reset IRR\n",
>>>                 IO_APIC_ID(apic), pin);
>>
>> This read again shouldn't need conversion, as the IRR bit doesn't
>> get touched (I think) by the interrupt remapping code during the
>> translation it does.
> 
> TBH, I think raw mode should only be used by the iommu code in order
> to setup the entries to point to the interrupt remapping table,
> everything else shouldn't be using raw mode. While it's true that some
> of the cases here are safe to use raw mode I would discourage it's
> usage as it can lead to issues, and this is not a performance critical
> path anyway.

You also should take the other possible perspective - not using
raw mode means going through interrupt remapping logic, which
can (needlessly) trigger errors. I think you want to break the
patch into a necessary and an optional part. The optional part
should be discussed separately and deferred until after 4.13.

Jan
Roger Pau Monne Nov. 8, 2019, 9:27 a.m. UTC | #5
On Thu, Nov 07, 2019 at 04:56:09PM +0100, Jan Beulich wrote:
> On 07.11.2019 16:46, Roger Pau Monné  wrote:
> > On Thu, Nov 07, 2019 at 04:28:56PM +0100, Jan Beulich wrote:
> >> On 07.11.2019 16:06, Roger Pau Monne wrote:
> >>> @@ -530,9 +530,9 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
> >>>       */
> >>>      memset(&entry, 0, sizeof(entry));
> >>>      entry.mask = 1;
> >>> -    __ioapic_write_entry(apic, pin, true, entry);
> >>> +    __ioapic_write_entry(apic, pin, false, entry);
> >>
> >> I may be able to understand why this one can't use raw mode, but as
> >> per above a better overall description is needed.
> > 
> > Yes, this is the one that's actually incorrect, but see my reasoning
> > below.
> > 
> >>
> >>> -    entry = __ioapic_read_entry(apic, pin, true);
> >>> +    entry = __ioapic_read_entry(apic, pin, false);
> >>>      if (entry.irr)
> >>>          printk(KERN_ERR "IO-APIC%02x-%u: Unable to reset IRR\n",
> >>>                 IO_APIC_ID(apic), pin);
> >>
> >> This read again shouldn't need conversion, as the IRR bit doesn't
> >> get touched (I think) by the interrupt remapping code during the
> >> translation it does.
> > 
> > TBH, I think raw mode should only be used by the iommu code in order
> > to setup the entries to point to the interrupt remapping table,
> > everything else shouldn't be using raw mode. While it's true that some
> > of the cases here are safe to use raw mode I would discourage it's
> > usage as it can lead to issues, and this is not a performance critical
> > path anyway.
> 
> You also should take the other possible perspective - not using
> raw mode means going through interrupt remapping logic, which
> can (needlessly) trigger errors. I think you want to break the
> patch into a necessary and an optional part. The optional part
> should be discussed separately and deferred until after 4.13.

IMO generic IO-APIC code has not business playing with raw entries
when interrupt remapping is enabled, the layout of IO-APIC entries in
that case is vendor-specific, and hence the generic IO-APIC code is
not able to parse it.

For example the code in clear_IO_APIC_pin modifies the mask or the
trigger fields of RAW entries, is there any guarantee that those
fields don't have different meanings/layout when interrupt remapping
is enabled?

I can split the specific bugfix into a separate patch, but IMO the
code in clear_IO_APIC_pin is not safe.

Thanks, Roger.
Jan Beulich Nov. 8, 2019, 10:16 a.m. UTC | #6
On 08.11.2019 10:27, Roger Pau Monné  wrote:
> On Thu, Nov 07, 2019 at 04:56:09PM +0100, Jan Beulich wrote:
>> On 07.11.2019 16:46, Roger Pau Monné  wrote:
>>> On Thu, Nov 07, 2019 at 04:28:56PM +0100, Jan Beulich wrote:
>>>> On 07.11.2019 16:06, Roger Pau Monne wrote:
>>>>> @@ -530,9 +530,9 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
>>>>>       */
>>>>>      memset(&entry, 0, sizeof(entry));
>>>>>      entry.mask = 1;
>>>>> -    __ioapic_write_entry(apic, pin, true, entry);
>>>>> +    __ioapic_write_entry(apic, pin, false, entry);
>>>>
>>>> I may be able to understand why this one can't use raw mode, but as
>>>> per above a better overall description is needed.
>>>
>>> Yes, this is the one that's actually incorrect, but see my reasoning
>>> below.
>>>
>>>>
>>>>> -    entry = __ioapic_read_entry(apic, pin, true);
>>>>> +    entry = __ioapic_read_entry(apic, pin, false);
>>>>>      if (entry.irr)
>>>>>          printk(KERN_ERR "IO-APIC%02x-%u: Unable to reset IRR\n",
>>>>>                 IO_APIC_ID(apic), pin);
>>>>
>>>> This read again shouldn't need conversion, as the IRR bit doesn't
>>>> get touched (I think) by the interrupt remapping code during the
>>>> translation it does.
>>>
>>> TBH, I think raw mode should only be used by the iommu code in order
>>> to setup the entries to point to the interrupt remapping table,
>>> everything else shouldn't be using raw mode. While it's true that some
>>> of the cases here are safe to use raw mode I would discourage it's
>>> usage as it can lead to issues, and this is not a performance critical
>>> path anyway.
>>
>> You also should take the other possible perspective - not using
>> raw mode means going through interrupt remapping logic, which
>> can (needlessly) trigger errors. I think you want to break the
>> patch into a necessary and an optional part. The optional part
>> should be discussed separately and deferred until after 4.13.
> 
> IMO generic IO-APIC code has not business playing with raw entries
> when interrupt remapping is enabled, the layout of IO-APIC entries in
> that case is vendor-specific, and hence the generic IO-APIC code is
> not able to parse it.
> 
> For example the code in clear_IO_APIC_pin modifies the mask or the
> trigger fields of RAW entries, is there any guarantee that those
> fields don't have different meanings/layout when interrupt remapping
> is enabled?

From an abstract pov there's no such guarantee, but in practice
the meaning of the fields doesn't change. You make a good point
though nevertheless: For VT-d the trigger mode fields in RTE and
IRTE need to match, so the interrupt remapping code needs to see
the trigger mode change. See below for a possible alternative
patch.

> I can split the specific bugfix into a separate patch, but IMO the
> code in clear_IO_APIC_pin is not safe.

A change is needed, yes, but in particular because of the use of
the function from clear_IO_APIC(), in turn called from
disable_IO_APIC(), yet in turn used e.g. during emergency
shutdown after a crash, I'd like the function to do as simple
operations as possible, i.e. specifically avoid going through
interrupt remapping code (because its data structures may also
be corrupted at that point) unless really needed (hence the
alternative patch suggestion below).

As an aside, iommu_crash_shutdown() - even if actually doing
something, i.e. disabling interrupt remapping - does _not_
cause the RTEs to be re-written in non-translated format.

Jan

--- unstable.orig/xen/arch/x86/io_apic.c
+++ unstable/xen/arch/x86/io_apic.c
@@ -519,8 +519,9 @@ static void clear_IO_APIC_pin(unsigned i
     if (entry.irr) {
         /* Make sure the trigger mode is set to level. */
         if (!entry.trigger) {
+            entry = __ioapic_read_entry(apic, pin, false);
             entry.trigger = 1;
-            __ioapic_write_entry(apic, pin, TRUE, entry);
+            __ioapic_write_entry(apic, pin, false, entry);
         }
         __io_apic_eoi(apic, entry.vector, pin);
     }
@@ -530,7 +531,7 @@ static void clear_IO_APIC_pin(unsigned i
      */
     memset(&entry, 0, sizeof(entry));
     entry.mask = 1;
-    __ioapic_write_entry(apic, pin, TRUE, entry);
+    __ioapic_write_entry(apic, pin, false, entry);
 
     entry = __ioapic_read_entry(apic, pin, TRUE);
     if (entry.irr)
Roger Pau Monne Nov. 8, 2019, 4:07 p.m. UTC | #7
On Fri, Nov 08, 2019 at 11:16:26AM +0100, Jan Beulich wrote:
> On 08.11.2019 10:27, Roger Pau Monné  wrote:
> > On Thu, Nov 07, 2019 at 04:56:09PM +0100, Jan Beulich wrote:
> >> On 07.11.2019 16:46, Roger Pau Monné  wrote:
> >>> On Thu, Nov 07, 2019 at 04:28:56PM +0100, Jan Beulich wrote:
> >>>> On 07.11.2019 16:06, Roger Pau Monne wrote:
> >>>>> @@ -530,9 +530,9 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
> >>>>>       */
> >>>>>      memset(&entry, 0, sizeof(entry));
> >>>>>      entry.mask = 1;
> >>>>> -    __ioapic_write_entry(apic, pin, true, entry);
> >>>>> +    __ioapic_write_entry(apic, pin, false, entry);
> >>>>
> >>>> I may be able to understand why this one can't use raw mode, but as
> >>>> per above a better overall description is needed.
> >>>
> >>> Yes, this is the one that's actually incorrect, but see my reasoning
> >>> below.
> >>>
> >>>>
> >>>>> -    entry = __ioapic_read_entry(apic, pin, true);
> >>>>> +    entry = __ioapic_read_entry(apic, pin, false);
> >>>>>      if (entry.irr)
> >>>>>          printk(KERN_ERR "IO-APIC%02x-%u: Unable to reset IRR\n",
> >>>>>                 IO_APIC_ID(apic), pin);
> >>>>
> >>>> This read again shouldn't need conversion, as the IRR bit doesn't
> >>>> get touched (I think) by the interrupt remapping code during the
> >>>> translation it does.
> >>>
> >>> TBH, I think raw mode should only be used by the iommu code in order
> >>> to setup the entries to point to the interrupt remapping table,
> >>> everything else shouldn't be using raw mode. While it's true that some
> >>> of the cases here are safe to use raw mode I would discourage it's
> >>> usage as it can lead to issues, and this is not a performance critical
> >>> path anyway.
> >>
> >> You also should take the other possible perspective - not using
> >> raw mode means going through interrupt remapping logic, which
> >> can (needlessly) trigger errors. I think you want to break the
> >> patch into a necessary and an optional part. The optional part
> >> should be discussed separately and deferred until after 4.13.
> > 
> > IMO generic IO-APIC code has not business playing with raw entries
> > when interrupt remapping is enabled, the layout of IO-APIC entries in
> > that case is vendor-specific, and hence the generic IO-APIC code is
> > not able to parse it.
> > 
> > For example the code in clear_IO_APIC_pin modifies the mask or the
> > trigger fields of RAW entries, is there any guarantee that those
> > fields don't have different meanings/layout when interrupt remapping
> > is enabled?
> 
> From an abstract pov there's no such guarantee, but in practice
> the meaning of the fields doesn't change. You make a good point
> though nevertheless: For VT-d the trigger mode fields in RTE and
> IRTE need to match, so the interrupt remapping code needs to see
> the trigger mode change. See below for a possible alternative
> patch.
> 
> > I can split the specific bugfix into a separate patch, but IMO the
> > code in clear_IO_APIC_pin is not safe.
> 
> A change is needed, yes, but in particular because of the use of
> the function from clear_IO_APIC(), in turn called from
> disable_IO_APIC(), yet in turn used e.g. during emergency
> shutdown after a crash, I'd like the function to do as simple
> operations as possible, i.e. specifically avoid going through
> interrupt remapping code (because its data structures may also
> be corrupted at that point) unless really needed (hence the
> alternative patch suggestion below).

Isn't just masking the entries fine when disabling the IO-APIC, or a
full wipe of all the entries is required?

I would be fine with having a maskall_IO_APIC function that reads
entries in raw format, sets the mask bit and writes the entry back in
raw format as long as it's annotated that this is done in order to
limit as much a possible the chances of hitting corrupted data in the
crash case.

> As an aside, iommu_crash_shutdown() - even if actually doing
> something, i.e. disabling interrupt remapping - does _not_
> cause the RTEs to be re-written in non-translated format.

I'm not sure whether that will work, it's possible that some entries
can't be translated because they use x2APIC IDs, and thus don't fit in
a non-translated IO-APIC entry.

> Jan
> 
> --- unstable.orig/xen/arch/x86/io_apic.c
> +++ unstable/xen/arch/x86/io_apic.c
> @@ -519,8 +519,9 @@ static void clear_IO_APIC_pin(unsigned i
>      if (entry.irr) {
>          /* Make sure the trigger mode is set to level. */
>          if (!entry.trigger) {
> +            entry = __ioapic_read_entry(apic, pin, false);
>              entry.trigger = 1;
> -            __ioapic_write_entry(apic, pin, TRUE, entry);
> +            __ioapic_write_entry(apic, pin, false, entry);
>          }
>          __io_apic_eoi(apic, entry.vector, pin);
>      }
> @@ -530,7 +531,7 @@ static void clear_IO_APIC_pin(unsigned i
>       */
>      memset(&entry, 0, sizeof(entry));
>      entry.mask = 1;
> -    __ioapic_write_entry(apic, pin, TRUE, entry);
> +    __ioapic_write_entry(apic, pin, false, entry);
>  
>      entry = __ioapic_read_entry(apic, pin, TRUE);
>      if (entry.irr)

Well, this is certainly better than what's there currently, and should
fix the issue reported by Sergey, albeit I still think checking the
irr or the trigger fields of a raw entry when using interrupt
remapping is not safe future wise.

There's no guarantee that future interrupt remapping implementations
don't clobber the non-translated fields with different ones when using
a remapped IO-APIC entry, and hence while this fixes the current issue
at hand it seems fragile.

Anyway, I think this should be fixed ASAP, so if you are happy with
this version that's fine for me. Do you want me to pick this up and
rebase it on top of my TRUE/FALSE removal patch, or would you rather
send it formally standalone?

Thanks, Roger.
Jan Beulich Nov. 8, 2019, 4:51 p.m. UTC | #8
On 08.11.2019 17:07, Roger Pau Monné  wrote:
> On Fri, Nov 08, 2019 at 11:16:26AM +0100, Jan Beulich wrote:
>> On 08.11.2019 10:27, Roger Pau Monné  wrote:
>>> On Thu, Nov 07, 2019 at 04:56:09PM +0100, Jan Beulich wrote:
>>>> On 07.11.2019 16:46, Roger Pau Monné  wrote:
>>>>> On Thu, Nov 07, 2019 at 04:28:56PM +0100, Jan Beulich wrote:
>>>>>> On 07.11.2019 16:06, Roger Pau Monne wrote:
>>>>>>> @@ -530,9 +530,9 @@ static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
>>>>>>>       */
>>>>>>>      memset(&entry, 0, sizeof(entry));
>>>>>>>      entry.mask = 1;
>>>>>>> -    __ioapic_write_entry(apic, pin, true, entry);
>>>>>>> +    __ioapic_write_entry(apic, pin, false, entry);
>>>>>>
>>>>>> I may be able to understand why this one can't use raw mode, but as
>>>>>> per above a better overall description is needed.
>>>>>
>>>>> Yes, this is the one that's actually incorrect, but see my reasoning
>>>>> below.
>>>>>
>>>>>>
>>>>>>> -    entry = __ioapic_read_entry(apic, pin, true);
>>>>>>> +    entry = __ioapic_read_entry(apic, pin, false);
>>>>>>>      if (entry.irr)
>>>>>>>          printk(KERN_ERR "IO-APIC%02x-%u: Unable to reset IRR\n",
>>>>>>>                 IO_APIC_ID(apic), pin);
>>>>>>
>>>>>> This read again shouldn't need conversion, as the IRR bit doesn't
>>>>>> get touched (I think) by the interrupt remapping code during the
>>>>>> translation it does.
>>>>>
>>>>> TBH, I think raw mode should only be used by the iommu code in order
>>>>> to setup the entries to point to the interrupt remapping table,
>>>>> everything else shouldn't be using raw mode. While it's true that some
>>>>> of the cases here are safe to use raw mode I would discourage it's
>>>>> usage as it can lead to issues, and this is not a performance critical
>>>>> path anyway.
>>>>
>>>> You also should take the other possible perspective - not using
>>>> raw mode means going through interrupt remapping logic, which
>>>> can (needlessly) trigger errors. I think you want to break the
>>>> patch into a necessary and an optional part. The optional part
>>>> should be discussed separately and deferred until after 4.13.
>>>
>>> IMO generic IO-APIC code has not business playing with raw entries
>>> when interrupt remapping is enabled, the layout of IO-APIC entries in
>>> that case is vendor-specific, and hence the generic IO-APIC code is
>>> not able to parse it.
>>>
>>> For example the code in clear_IO_APIC_pin modifies the mask or the
>>> trigger fields of RAW entries, is there any guarantee that those
>>> fields don't have different meanings/layout when interrupt remapping
>>> is enabled?
>>
>> From an abstract pov there's no such guarantee, but in practice
>> the meaning of the fields doesn't change. You make a good point
>> though nevertheless: For VT-d the trigger mode fields in RTE and
>> IRTE need to match, so the interrupt remapping code needs to see
>> the trigger mode change. See below for a possible alternative
>> patch.
>>
>>> I can split the specific bugfix into a separate patch, but IMO the
>>> code in clear_IO_APIC_pin is not safe.
>>
>> A change is needed, yes, but in particular because of the use of
>> the function from clear_IO_APIC(), in turn called from
>> disable_IO_APIC(), yet in turn used e.g. during emergency
>> shutdown after a crash, I'd like the function to do as simple
>> operations as possible, i.e. specifically avoid going through
>> interrupt remapping code (because its data structures may also
>> be corrupted at that point) unless really needed (hence the
>> alternative patch suggestion below).
> 
> Isn't just masking the entries fine when disabling the IO-APIC, or a
> full wipe of all the entries is required?

Just masking the RTE _should_ be fine (but you never know).

> I would be fine with having a maskall_IO_APIC function that reads
> entries in raw format, sets the mask bit and writes the entry back in
> raw format as long as it's annotated that this is done in order to
> limit as much a possible the chances of hitting corrupted data in the
> crash case.

Right, certainly something we may want to do for 4.14.

>> As an aside, iommu_crash_shutdown() - even if actually doing
>> something, i.e. disabling interrupt remapping - does _not_
>> cause the RTEs to be re-written in non-translated format.
> 
> I'm not sure whether that will work, it's possible that some entries
> can't be translated because they use x2APIC IDs, and thus don't fit in
> a non-translated IO-APIC entry.

And of course there's no expectation that interrupts would still
work, but any inspection (e.g. via dumping) of the RTEs would be
misleading at that point.

>> --- unstable.orig/xen/arch/x86/io_apic.c
>> +++ unstable/xen/arch/x86/io_apic.c
>> @@ -519,8 +519,9 @@ static void clear_IO_APIC_pin(unsigned i
>>      if (entry.irr) {
>>          /* Make sure the trigger mode is set to level. */
>>          if (!entry.trigger) {
>> +            entry = __ioapic_read_entry(apic, pin, false);
>>              entry.trigger = 1;
>> -            __ioapic_write_entry(apic, pin, TRUE, entry);
>> +            __ioapic_write_entry(apic, pin, false, entry);
>>          }
>>          __io_apic_eoi(apic, entry.vector, pin);
>>      }
>> @@ -530,7 +531,7 @@ static void clear_IO_APIC_pin(unsigned i
>>       */
>>      memset(&entry, 0, sizeof(entry));
>>      entry.mask = 1;
>> -    __ioapic_write_entry(apic, pin, TRUE, entry);
>> +    __ioapic_write_entry(apic, pin, false, entry);
>>  
>>      entry = __ioapic_read_entry(apic, pin, TRUE);
>>      if (entry.irr)
> 
> Well, this is certainly better than what's there currently, and should
> fix the issue reported by Sergey, albeit I still think checking the
> irr or the trigger fields of a raw entry when using interrupt
> remapping is not safe future wise.
> 
> There's no guarantee that future interrupt remapping implementations
> don't clobber the non-translated fields with different ones when using
> a remapped IO-APIC entry, and hence while this fixes the current issue
> at hand it seems fragile.
> 
> Anyway, I think this should be fixed ASAP, so if you are happy with
> this version that's fine for me. Do you want me to pick this up and
> rebase it on top of my TRUE/FALSE removal patch, or would you rather
> send it formally standalone?

I'd appreciate you making this a v2 of your original patch, ideally
with a further improved description.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/io_apic.c b/xen/arch/x86/io_apic.c
index b9c66acdb3..13b41b46a3 100644
--- a/xen/arch/x86/io_apic.c
+++ b/xen/arch/x86/io_apic.c
@@ -514,13 +514,13 @@  static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
         entry.mask = 1;
         __ioapic_write_entry(apic, pin, false, entry);
     }
-    entry = __ioapic_read_entry(apic, pin, true);
+    entry = __ioapic_read_entry(apic, pin, false);
 
     if (entry.irr) {
         /* Make sure the trigger mode is set to level. */
         if (!entry.trigger) {
             entry.trigger = 1;
-            __ioapic_write_entry(apic, pin, true, entry);
+            __ioapic_write_entry(apic, pin, false, entry);
         }
         __io_apic_eoi(apic, entry.vector, pin);
     }
@@ -530,9 +530,9 @@  static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin)
      */
     memset(&entry, 0, sizeof(entry));
     entry.mask = 1;
-    __ioapic_write_entry(apic, pin, true, entry);
+    __ioapic_write_entry(apic, pin, false, entry);
 
-    entry = __ioapic_read_entry(apic, pin, true);
+    entry = __ioapic_read_entry(apic, pin, false);
     if (entry.irr)
         printk(KERN_ERR "IO-APIC%02x-%u: Unable to reset IRR\n",
                IO_APIC_ID(apic), pin);