diff mbox series

[RFC] Xen crashes on ASSERT on suspend/resume, suggested fix

Message ID alpine.DEB.2.22.394.2305181638390.128889@ubuntu-linux-20-04-desktop (mailing list archive)
State New, archived
Headers show
Series [RFC] Xen crashes on ASSERT on suspend/resume, suggested fix | expand

Commit Message

Stefano Stabellini May 18, 2023, 11:44 p.m. UTC
Hi all,

After many PVH Dom0 suspend/resume cycles we are seeing the following
Xen crash (it is random and doesn't reproduce reliably):

(XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd
(XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8
(XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7
(XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4
(XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155
(XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b
(XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e
(XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268
(XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
(XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f
(XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c
(XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6
(XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
(XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35
(XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4
(XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0
(XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
(XEN) [555.043067]
(XEN) [555.469861]
(XEN) [555.471855] ****************************************
(XEN) [555.477315] Panic on CPU 9:
(XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233
(XEN) [555.489882] ****************************************

Looking at the code in question, the ASSERT looks wrong to me.

Specifically, if you see send_cleanup_vector and
irq_move_cleanup_interrupt, it is entirely possible to have old_vector
still valid and also move_in_progress still set, but only some of the
per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could
happen especially when an MSI has a large old_cpu_mask.

While we go and clear per_cpu(vector_irq, me)[vector] in each CPU one by
one, the state is that not all per_cpu(vector_irq, me)[vector] are
cleared and old_vector is still set.

If at this point we enter _clear_irq_vector, we are going to hit the
ASSERT above.

My suggestion was to turn the ASSERT into an if. Any better ideas?

Cheers,

Stefano

Comments

Jan Beulich May 22, 2023, 1:03 p.m. UTC | #1
On 19.05.2023 01:44, Stefano Stabellini wrote:
> Hi all,
> 
> After many PVH Dom0 suspend/resume cycles we are seeing the following
> Xen crash (it is random and doesn't reproduce reliably):
> 
> (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd
> (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8
> (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7
> (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4
> (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155
> (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b
> (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e
> (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268
> (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f
> (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c
> (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6
> (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35
> (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4
> (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0
> (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> (XEN) [555.043067]
> (XEN) [555.469861]
> (XEN) [555.471855] ****************************************
> (XEN) [555.477315] Panic on CPU 9:
> (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233
> (XEN) [555.489882] ****************************************
> 
> Looking at the code in question, the ASSERT looks wrong to me.
> 
> Specifically, if you see send_cleanup_vector and
> irq_move_cleanup_interrupt, it is entirely possible to have old_vector
> still valid and also move_in_progress still set, but only some of the
> per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could
> happen especially when an MSI has a large old_cpu_mask.
> 
> While we go and clear per_cpu(vector_irq, me)[vector] in each CPU one by
> one, the state is that not all per_cpu(vector_irq, me)[vector] are
> cleared and old_vector is still set.
> 
> If at this point we enter _clear_irq_vector, we are going to hit the
> ASSERT above.
> 
> My suggestion was to turn the ASSERT into an if. Any better ideas?

While I'm fully willing to believe there is something that needs fixing,
we need to understand what's going wrong before applying any change.
Even more so when the suggestion is the simplest possible, converting an
assertion to an if(). There's no explanation at all what happens in the
"else" case: Are we (silently) leaking vectors then? Any other undue
side effects?

What I'm particularly missing is any connection to suspend/resume. There
my primary suspect would be an issue with/in fixup_irqs().

What might be relevant in the investigation is what the value is that
is found in the array slot. In particular if it was already ~irq, it
may hint at where the issue is coming from and/or that the assertion
may merely need a little bit of relaxing. Allowing for that value in an
array slot would in particular not raise any questions towards "what if
not" (as mentioned above), because that's already the value we store
there anyway.

Jan
Roger Pau Monné May 23, 2023, 1:54 p.m. UTC | #2
On Thu, May 18, 2023 at 04:44:53PM -0700, Stefano Stabellini wrote:
> Hi all,
> 
> After many PVH Dom0 suspend/resume cycles we are seeing the following
> Xen crash (it is random and doesn't reproduce reliably):
> 
> (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd
> (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8
> (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7
> (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4
> (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155
> (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b
> (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e
> (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268
> (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f
> (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c
> (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6
> (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35
> (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4
> (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0
> (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> (XEN) [555.043067]
> (XEN) [555.469861]
> (XEN) [555.471855] ****************************************
> (XEN) [555.477315] Panic on CPU 9:
> (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233
> (XEN) [555.489882] ****************************************
> 
> Looking at the code in question, the ASSERT looks wrong to me.
> 
> Specifically, if you see send_cleanup_vector and
> irq_move_cleanup_interrupt, it is entirely possible to have old_vector
> still valid and also move_in_progress still set, but only some of the
> per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could
> happen especially when an MSI has a large old_cpu_mask.

i guess the only way to get into such situation would be if you happen
to execute _clear_irq_vector() with a cpu_online_map smaller than the
one in old_cpu_mask, at which point you will leave old_vector fields
not updated.

Maybe somehow you get into this situation when doing suspend/resume?

Could you try to add a:

ASSERT(cpumask_equal(tmp_mask, desc->arch.old_cpu_mask));

Before the `for_each_cpu(cpu, tmp_mask)` loop?

Thanks, Roger.
Roger Pau Monné May 23, 2023, 2:50 p.m. UTC | #3
On Tue, May 23, 2023 at 03:54:36PM +0200, Roger Pau Monné wrote:
> On Thu, May 18, 2023 at 04:44:53PM -0700, Stefano Stabellini wrote:
> > Hi all,
> > 
> > After many PVH Dom0 suspend/resume cycles we are seeing the following
> > Xen crash (it is random and doesn't reproduce reliably):
> > 
> > (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd
> > (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8
> > (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7
> > (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4
> > (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155
> > (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b
> > (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e
> > (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268
> > (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> > (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f
> > (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c
> > (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6
> > (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> > (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35
> > (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4
> > (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0
> > (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> > (XEN) [555.043067]
> > (XEN) [555.469861]
> > (XEN) [555.471855] ****************************************
> > (XEN) [555.477315] Panic on CPU 9:
> > (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233
> > (XEN) [555.489882] ****************************************
> > 
> > Looking at the code in question, the ASSERT looks wrong to me.
> > 
> > Specifically, if you see send_cleanup_vector and
> > irq_move_cleanup_interrupt, it is entirely possible to have old_vector
> > still valid and also move_in_progress still set, but only some of the
> > per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could
> > happen especially when an MSI has a large old_cpu_mask.
> 
> i guess the only way to get into such situation would be if you happen
> to execute _clear_irq_vector() with a cpu_online_map smaller than the
> one in old_cpu_mask, at which point you will leave old_vector fields
> not updated.
> 
> Maybe somehow you get into this situation when doing suspend/resume?
> 
> Could you try to add a:
> 
> ASSERT(cpumask_equal(tmp_mask, desc->arch.old_cpu_mask));
> 
> Before the `for_each_cpu(cpu, tmp_mask)` loop?

I see that the old_cpu_mask is cleared in release_old_vec(), so that
suggestion is not very useful.

Does the crash happen at specific points, for example just after
resume or before suspend?

Roger.
Stefano Stabellini May 24, 2023, 11:51 p.m. UTC | #4
On Tue, 23 May 2023, Roger Pau Monné wrote:
> On Tue, May 23, 2023 at 03:54:36PM +0200, Roger Pau Monné wrote:
> > On Thu, May 18, 2023 at 04:44:53PM -0700, Stefano Stabellini wrote:
> > > Hi all,
> > > 
> > > After many PVH Dom0 suspend/resume cycles we are seeing the following
> > > Xen crash (it is random and doesn't reproduce reliably):
> > > 
> > > (XEN) [555.042981][<ffff82d04032a137>] R arch/x86/irq.c#_clear_irq_vector+0x214/0x2bd
> > > (XEN) [555.042986][<ffff82d04032a74c>] F destroy_irq+0xe2/0x1b8
> > > (XEN) [555.042991][<ffff82d0403276db>] F msi_free_irq+0x5e/0x1a7
> > > (XEN) [555.042995][<ffff82d04032da2d>] F unmap_domain_pirq+0x441/0x4b4
> > > (XEN) [555.043001][<ffff82d0402d29b9>] F arch/x86/hvm/vmsi.c#vpci_msi_disable+0xc0/0x155
> > > (XEN) [555.043006][<ffff82d0402d39fc>] F vpci_msi_arch_disable+0x1e/0x2b
> > > (XEN) [555.043013][<ffff82d04026561c>] F drivers/vpci/msi.c#control_write+0x109/0x10e
> > > (XEN) [555.043018][<ffff82d0402640c3>] F vpci_write+0x11f/0x268
> > > (XEN) [555.043024][<ffff82d0402c726a>] F arch/x86/hvm/io.c#vpci_portio_write+0xa0/0xa7
> > > (XEN) [555.043029][<ffff82d0402c6682>] F hvm_process_io_intercept+0x203/0x26f
> > > (XEN) [555.043034][<ffff82d0402c6718>] F hvm_io_intercept+0x2a/0x4c
> > > (XEN) [555.043039][<ffff82d0402b6353>] F arch/x86/hvm/emulate.c#hvmemul_do_io+0x29b/0x5f6
> > > (XEN) [555.043043][<ffff82d0402b66dd>] F arch/x86/hvm/emulate.c#hvmemul_do_io_buffer+0x2f/0x6a
> > > (XEN) [555.043048][<ffff82d0402b7bde>] F hvmemul_do_pio_buffer+0x33/0x35
> > > (XEN) [555.043053][<ffff82d0402c7042>] F handle_pio+0x6d/0x1b4
> > > (XEN) [555.043059][<ffff82d04029ec20>] F svm_vmexit_handler+0x10bf/0x18b0
> > > (XEN) [555.043064][<ffff82d0402034e5>] F svm_stgi_label+0x8/0x18
> > > (XEN) [555.043067]
> > > (XEN) [555.469861]
> > > (XEN) [555.471855] ****************************************
> > > (XEN) [555.477315] Panic on CPU 9:
> > > (XEN) [555.480608] Assertion 'per_cpu(vector_irq, cpu)[old_vector] == irq' failed at arch/x86/irq.c:233
> > > (XEN) [555.489882] ****************************************
> > > 
> > > Looking at the code in question, the ASSERT looks wrong to me.
> > > 
> > > Specifically, if you see send_cleanup_vector and
> > > irq_move_cleanup_interrupt, it is entirely possible to have old_vector
> > > still valid and also move_in_progress still set, but only some of the
> > > per_cpu(vector_irq, me)[vector] cleared. It seems to me that this could
> > > happen especially when an MSI has a large old_cpu_mask.
> > 
> > i guess the only way to get into such situation would be if you happen
> > to execute _clear_irq_vector() with a cpu_online_map smaller than the
> > one in old_cpu_mask, at which point you will leave old_vector fields
> > not updated.
> > 
> > Maybe somehow you get into this situation when doing suspend/resume?
> > 
> > Could you try to add a:
> > 
> > ASSERT(cpumask_equal(tmp_mask, desc->arch.old_cpu_mask));
> > 
> > Before the `for_each_cpu(cpu, tmp_mask)` loop?
> 
> I see that the old_cpu_mask is cleared in release_old_vec(), so that
> suggestion is not very useful.
> 
> Does the crash happen at specific points, for example just after
> resume or before suspend?

Hi Roger, Jan,

Unfortunately we are all completely blind on this issue. It is not
reproducible. It only happened once. We only have our wits to solve this
problem. However, looking at the codebase I think there are a few
possible races. Here is my analysis and suggested fix.


---
xen/irq: fix races between send_cleanup_vector and _clear_irq_vector

It is possible that send_cleanup_vector and _clear_irq_vector are
running at the same time on different CPUs. In that case we have a race
as both _clear_irq_vector and irq_move_cleanup_interrupt are trying to
clear old_vector.

This patch fixes 3 races:

1) As irq_move_cleanup_interrupt is running on multiple CPUs at the
same time, and also _clear_irq_vector is running, it is possible that
only some per_cpu(vector_irq, cpu)[old_vector] are valid but not all.
So, turn the ASSERT in _clear_irq_vector into an if.

2) It is possible that _clear_irq_vector is running at the same time as
release_old_vec, called from irq_move_cleanup_interrupt. At the moment,
it is possible for _clear_irq_vector to read a valid old_cpu_mask but an
invalid old_vector (because it is being set to invalid by
release_old_vec). To avoid this problem in release_old_vec move clearing
old_cpu_mask before setting old_vector to invalid. This way, we know that
in _clear_irq_vector if old_vector is invalid also old_cpu_mask is zero
and we don't enter the loop.

3) It is possible that release_old_vec is running twice at the same time
for the same old_vector. Change the code in release_old_vec to make it
OK to call it twice. Remove both ASSERTs. With those gone, it should be
possible now to call release_old_vec twice in a row for the same
old_vector.

Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
---
 xen/arch/x86/irq.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 20150b1c7f..d9c139cf1b 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -112,16 +112,11 @@ static void release_old_vec(struct irq_desc *desc)
 {
     unsigned int vector = desc->arch.old_vector;
 
-    desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
     cpumask_clear(desc->arch.old_cpu_mask);
+    desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
 
-    if ( !valid_irq_vector(vector) )
-        ASSERT_UNREACHABLE();
-    else if ( desc->arch.used_vectors )
-    {
-        ASSERT(test_bit(vector, desc->arch.used_vectors));
+    if ( desc->arch.used_vectors )
         clear_bit(vector, desc->arch.used_vectors);
-    }
 }
 
 static void _trace_irq_mask(uint32_t event, int irq, int vector,
@@ -230,9 +225,11 @@ static void _clear_irq_vector(struct irq_desc *desc)
 
         for_each_cpu(cpu, tmp_mask)
         {
-            ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
-            TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
-            per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+            if ( per_cpu(vector_irq, cpu)[old_vector] == irq )
+            {
+                TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
+                per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+            }
         }
 
         release_old_vec(desc);
Jan Beulich May 25, 2023, 11:30 a.m. UTC | #5
On 25.05.2023 01:51, Stefano Stabellini wrote:
> xen/irq: fix races between send_cleanup_vector and _clear_irq_vector

This title is, I'm afraid, already misleading. No such race can occur
afaict, as both callers of _clear_irq_vector() acquire the IRQ
descriptor lock first, and irq_complete_move() (the sole caller of
send_cleanup_vector()) is only ever invoked as or by an ->ack()
hook, which in turn is only invoked with, again, the descriptor lock
held.

> It is possible that send_cleanup_vector and _clear_irq_vector are
> running at the same time on different CPUs. In that case we have a race
> as both _clear_irq_vector and irq_move_cleanup_interrupt are trying to
> clear old_vector.
> 
> This patch fixes 3 races:
> 
> 1) As irq_move_cleanup_interrupt is running on multiple CPUs at the
> same time, and also _clear_irq_vector is running, it is possible that
> only some per_cpu(vector_irq, cpu)[old_vector] are valid but not all.
> So, turn the ASSERT in _clear_irq_vector into an if.

Note again the locking which is in effect.

> 2) It is possible that _clear_irq_vector is running at the same time as
> release_old_vec, called from irq_move_cleanup_interrupt. At the moment,
> it is possible for _clear_irq_vector to read a valid old_cpu_mask but an
> invalid old_vector (because it is being set to invalid by
> release_old_vec). To avoid this problem in release_old_vec move clearing
> old_cpu_mask before setting old_vector to invalid. This way, we know that
> in _clear_irq_vector if old_vector is invalid also old_cpu_mask is zero
> and we don't enter the loop.

All invocations of release_old_vec() are similarly inside suitably
locked regions.

> 3) It is possible that release_old_vec is running twice at the same time
> for the same old_vector. Change the code in release_old_vec to make it
> OK to call it twice. Remove both ASSERTs. With those gone, it should be
> possible now to call release_old_vec twice in a row for the same
> old_vector.

Same here.

Any such issues would surface more frequently and without any suspend /
resume involved. What is still missing is that connection, and only then
it'll (or really: may) become clear what needs adjusting. If you've seen
the issue exactly once, then I'm afraid there's not much we can do unless
someone can come up with a plausible explanation of something being
broken on any of the involved code paths. More information will need to
be gathered out of the next occurrence of this, whenever that's going to
be. One of the things we will want to know, as mentioned before, is the
value that per_cpu(vector_irq, cpu)[old_vector] has when the assertion
triggers. Iirc Roger did suggest another piece of data you'd want to log.

> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -112,16 +112,11 @@ static void release_old_vec(struct irq_desc *desc)
>  {
>      unsigned int vector = desc->arch.old_vector;
>  
> -    desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
>      cpumask_clear(desc->arch.old_cpu_mask);
> +    desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
>  
> -    if ( !valid_irq_vector(vector) )
> -        ASSERT_UNREACHABLE();
> -    else if ( desc->arch.used_vectors )
> -    {
> -        ASSERT(test_bit(vector, desc->arch.used_vectors));
> +    if ( desc->arch.used_vectors )
>          clear_bit(vector, desc->arch.used_vectors);
> -    }
>  }
>  
>  static void _trace_irq_mask(uint32_t event, int irq, int vector,
> @@ -230,9 +225,11 @@ static void _clear_irq_vector(struct irq_desc *desc)
>  
>          for_each_cpu(cpu, tmp_mask)
>          {
> -            ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
> -            TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
> -            per_cpu(vector_irq, cpu)[old_vector] = ~irq;
> +            if ( per_cpu(vector_irq, cpu)[old_vector] == irq )
> +            {
> +                TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
> +                per_cpu(vector_irq, cpu)[old_vector] = ~irq;
> +            }
>          }

As said before - replacing ASSERT() by a respective if() cannot really
be done without discussing the "else" in the description. Except of
course in trivial/obvious cases, but I think we agree here we don't
have such a case.

Jan
Stefano Stabellini May 25, 2023, 7:54 p.m. UTC | #6
On Thu, 25 May 2023, Jan Beulich wrote:
> On 25.05.2023 01:51, Stefano Stabellini wrote:
> > xen/irq: fix races between send_cleanup_vector and _clear_irq_vector
> 
> This title is, I'm afraid, already misleading. No such race can occur
> afaict, as both callers of _clear_irq_vector() acquire the IRQ
> descriptor lock first, and irq_complete_move() (the sole caller of
> send_cleanup_vector()) is only ever invoked as or by an ->ack()
> hook, which in turn is only invoked with, again, the descriptor lock
> held.

Yes I see that you are right about the locking, and thank you for taking
the time to look into it.

One last question: could it be that a second interrupt arrives while
->ack() is being handled?  do_IRQ() is running with interrupts disabled?


> > It is possible that send_cleanup_vector and _clear_irq_vector are
> > running at the same time on different CPUs. In that case we have a race
> > as both _clear_irq_vector and irq_move_cleanup_interrupt are trying to
> > clear old_vector.
> > 
> > This patch fixes 3 races:
> > 
> > 1) As irq_move_cleanup_interrupt is running on multiple CPUs at the
> > same time, and also _clear_irq_vector is running, it is possible that
> > only some per_cpu(vector_irq, cpu)[old_vector] are valid but not all.
> > So, turn the ASSERT in _clear_irq_vector into an if.
> 
> Note again the locking which is in effect.
> 
> > 2) It is possible that _clear_irq_vector is running at the same time as
> > release_old_vec, called from irq_move_cleanup_interrupt. At the moment,
> > it is possible for _clear_irq_vector to read a valid old_cpu_mask but an
> > invalid old_vector (because it is being set to invalid by
> > release_old_vec). To avoid this problem in release_old_vec move clearing
> > old_cpu_mask before setting old_vector to invalid. This way, we know that
> > in _clear_irq_vector if old_vector is invalid also old_cpu_mask is zero
> > and we don't enter the loop.
> 
> All invocations of release_old_vec() are similarly inside suitably
> locked regions.
> 
> > 3) It is possible that release_old_vec is running twice at the same time
> > for the same old_vector. Change the code in release_old_vec to make it
> > OK to call it twice. Remove both ASSERTs. With those gone, it should be
> > possible now to call release_old_vec twice in a row for the same
> > old_vector.
> 
> Same here.
> 
> Any such issues would surface more frequently and without any suspend /
> resume involved. What is still missing is that connection, and only then
> it'll (or really: may) become clear what needs adjusting. If you've seen
> the issue exactly once, then I'm afraid there's not much we can do unless
> someone can come up with a plausible explanation of something being
> broken on any of the involved code paths. More information will need to
> be gathered out of the next occurrence of this, whenever that's going to
> be. One of the things we will want to know, as mentioned before, is the
> value that per_cpu(vector_irq, cpu)[old_vector] has when the assertion
> triggers. Iirc Roger did suggest another piece of data you'd want to log.

Understood, thanks for the explanation.


> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -112,16 +112,11 @@ static void release_old_vec(struct irq_desc *desc)
> >  {
> >      unsigned int vector = desc->arch.old_vector;
> >  
> > -    desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
> >      cpumask_clear(desc->arch.old_cpu_mask);
> > +    desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
> >  
> > -    if ( !valid_irq_vector(vector) )
> > -        ASSERT_UNREACHABLE();
> > -    else if ( desc->arch.used_vectors )
> > -    {
> > -        ASSERT(test_bit(vector, desc->arch.used_vectors));
> > +    if ( desc->arch.used_vectors )
> >          clear_bit(vector, desc->arch.used_vectors);
> > -    }
> >  }
> >  
> >  static void _trace_irq_mask(uint32_t event, int irq, int vector,
> > @@ -230,9 +225,11 @@ static void _clear_irq_vector(struct irq_desc *desc)
> >  
> >          for_each_cpu(cpu, tmp_mask)
> >          {
> > -            ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
> > -            TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
> > -            per_cpu(vector_irq, cpu)[old_vector] = ~irq;
> > +            if ( per_cpu(vector_irq, cpu)[old_vector] == irq )
> > +            {
> > +                TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
> > +                per_cpu(vector_irq, cpu)[old_vector] = ~irq;
> > +            }
> >          }
> 
> As said before - replacing ASSERT() by a respective if() cannot really
> be done without discussing the "else" in the description. Except of
> course in trivial/obvious cases, but I think we agree here we don't
> have such a case.
Jan Beulich May 26, 2023, 7:03 a.m. UTC | #7
On 25.05.2023 21:54, Stefano Stabellini wrote:
> On Thu, 25 May 2023, Jan Beulich wrote:
>> On 25.05.2023 01:51, Stefano Stabellini wrote:
>>> xen/irq: fix races between send_cleanup_vector and _clear_irq_vector
>>
>> This title is, I'm afraid, already misleading. No such race can occur
>> afaict, as both callers of _clear_irq_vector() acquire the IRQ
>> descriptor lock first, and irq_complete_move() (the sole caller of
>> send_cleanup_vector()) is only ever invoked as or by an ->ack()
>> hook, which in turn is only invoked with, again, the descriptor lock
>> held.
> 
> Yes I see that you are right about the locking, and thank you for taking
> the time to look into it.
> 
> One last question: could it be that a second interrupt arrives while
> ->ack() is being handled?  do_IRQ() is running with interrupts disabled?

It is, at least as far as the invocation of ->ack() is concerned. Else
the locking scheme would be broken. You may not that around ->handler()
invocation we enable interrupts.

Jan
Stefano Stabellini May 31, 2023, 12:32 a.m. UTC | #8
On Fri, 26 May 2023, Jan Beulich wrote:
> On 25.05.2023 21:54, Stefano Stabellini wrote:
> > On Thu, 25 May 2023, Jan Beulich wrote:
> >> On 25.05.2023 01:51, Stefano Stabellini wrote:
> >>> xen/irq: fix races between send_cleanup_vector and _clear_irq_vector
> >>
> >> This title is, I'm afraid, already misleading. No such race can occur
> >> afaict, as both callers of _clear_irq_vector() acquire the IRQ
> >> descriptor lock first, and irq_complete_move() (the sole caller of
> >> send_cleanup_vector()) is only ever invoked as or by an ->ack()
> >> hook, which in turn is only invoked with, again, the descriptor lock
> >> held.
> > 
> > Yes I see that you are right about the locking, and thank you for taking
> > the time to look into it.
> > 
> > One last question: could it be that a second interrupt arrives while
> > ->ack() is being handled?  do_IRQ() is running with interrupts disabled?
> 
> It is, at least as far as the invocation of ->ack() is concerned. Else
> the locking scheme would be broken. You may not that around ->handler()
> invocation we enable interrupts.

OK. FYI, we were able to repro a problem after 250+ suspend/resume Dom0
cycles with my patch applied. So unfortunately there is no extra
information as my patch removes the ASSERTs.

However I can tell you that the symptom is the below. I am not sure if
it tells you anything but FYI. So clearly my patch makes the problem
harder to repro but doesn't fix it.


May 23 22:47:31 amd-saravana-crater kernel: [17881.744986] INFO: task kworker/u8:1:45 blocked for more than 120 seconds.
May 23 22:47:31 amd-saravana-crater kernel: [17881.745048]       Not tainted 6.1.0-rtc-s3 #1
May 23 22:47:31 amd-saravana-crater kernel: [17881.745089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 23 22:47:31 amd-saravana-crater kernel: [17881.745144] task:kworker/u8:1    state:D stack:0     pid:45    ppid:2      flags:0x00004000
May 23 22:47:31 amd-saravana-crater kernel: [17881.745154] Workqueue: writeback wb_workfn (flush-259:0)
May 23 22:47:31 amd-saravana-crater kernel: [17881.745170] Call Trace:
May 23 22:47:31 amd-saravana-crater kernel: [17881.745174]  <TASK>
May 23 22:47:31 amd-saravana-crater kernel: [17881.745182]  __schedule+0x2d5/0x920
May 23 22:47:31 amd-saravana-crater kernel: [17881.745192]  ? preempt_count_add+0x7c/0xc0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745202]  schedule+0x63/0xd0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745208]  __bio_queue_enter+0xeb/0x230
May 23 22:47:31 amd-saravana-crater kernel: [17881.745217]  ? prepare_to_wait_event+0x130/0x130
May 23 22:47:31 amd-saravana-crater kernel: [17881.745226]  blk_mq_submit_bio+0x358/0x570
May 23 22:47:31 amd-saravana-crater kernel: [17881.745237]  __submit_bio+0xfa/0x170
May 23 22:47:31 amd-saravana-crater kernel: [17881.745243]  submit_bio_noacct_nocheck+0x229/0x2b0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745250]  ? ktime_get+0x47/0xb0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745256]  submit_bio_noacct+0x1e4/0x5a0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745261]  ? submit_bio_noacct+0x1e4/0x5a0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745268]  submit_bio+0x47/0x80
May 23 22:47:31 amd-saravana-crater kernel: [17881.745273]  ext4_io_submit+0x24/0x40
May 23 22:47:31 amd-saravana-crater kernel: [17881.745282]  ext4_writepages+0x57f/0xdd0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745288]  ? _raw_read_lock_bh+0x20/0x40
May 23 22:47:31 amd-saravana-crater kernel: [17881.745296]  ? update_sd_lb_stats.constprop.148+0x11e/0x960
May 23 22:47:31 amd-saravana-crater kernel: [17881.745308]  do_writepages+0xbf/0x1a0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745314]  ? __enqueue_entity+0x6c/0x80
May 23 22:47:31 amd-saravana-crater kernel: [17881.745321]  ? enqueue_entity+0x1a9/0x370
May 23 22:47:31 amd-saravana-crater kernel: [17881.745327]  __writeback_single_inode+0x44/0x360
May 23 22:47:31 amd-saravana-crater kernel: [17881.745332]  ? _raw_spin_unlock+0x19/0x40
May 23 22:47:31 amd-saravana-crater kernel: [17881.745339]  writeback_sb_inodes+0x203/0x4e0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745350]  __writeback_inodes_wb+0x66/0xd0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745358]  wb_writeback+0x23d/0x2d0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745366]  wb_workfn+0x20b/0x490
May 23 22:47:31 amd-saravana-crater kernel: [17881.745372]  ? _raw_spin_unlock+0x19/0x40
May 23 22:47:31 amd-saravana-crater kernel: [17881.745381]  process_one_work+0x227/0x440
May 23 22:47:31 amd-saravana-crater kernel: [17881.745389]  worker_thread+0x31/0x3e0
May 23 22:47:31 amd-saravana-crater kernel: [17881.745395]  ? process_one_work+0x440/0x440
May 23 22:47:31 amd-saravana-crater kernel: [17881.745400]  kthread+0xfe/0x130
May 23 22:47:31 amd-saravana-crater kernel: [17881.745406]  ? kthread_complete_and_exit+0x20/0x20
May 23 22:47:31 amd-saravana-crater kernel: [17881.745413]  ret_from_fork+0x22/0x30
May 23 22:47:31 amd-saravana-crater kernel: [17881.745425]  </TASK>
diff mbox series

Patch

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 20150b1c7f..c82c6b350a 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -230,9 +230,11 @@  static void _clear_irq_vector(struct irq_desc *desc)
 
         for_each_cpu(cpu, tmp_mask)
         {
-            ASSERT(per_cpu(vector_irq, cpu)[old_vector] == irq);
-            TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
-            per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+            if ( per_cpu(vector_irq, cpu)[old_vector] == irq )
+            {
+                TRACE_3D(TRC_HW_IRQ_MOVE_FINISH, irq, old_vector, cpu);
+                per_cpu(vector_irq, cpu)[old_vector] = ~irq;
+            }
         }
 
         release_old_vec(desc);