KVM: arm64: vgic: Fix soft lockup during VM teardown

Message ID	20230118022348.4137094-1-sdonthineni@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.160 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C From: Shanker Donthineni <sdonthineni@nvidia.com> To: Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com> CC: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Shanker Donthineni <sdonthineni@nvidia.com>, <linux-arm-kernel@lists.infradead.org>, <kvmarm@lists.linux.dev>, <kvmarm@lists.cs.columbia.edu>, <linux-kernel@vger.kernel.org>, Vikram Sethi <vsethi@nvidia.com> Subject: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown Date: Tue, 17 Jan 2023 20:23:48 -0600 Message-ID: <20230118022348.4137094-1-sdonthineni@nvidia.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	KVM: arm64: vgic: Fix soft lockup during VM teardown \| expand KVM: arm64: vgic: Fix soft lockup during VM teardown

Shanker Donthineni Jan. 18, 2023, 2:23 a.m. UTC

Getting intermittent CPU soft lockups during the virtual machines
teardown on a system with GICv4 features enabled. The function
__synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
to be cleared forever as per the current implementation.

CPU stuck here for a long time leads to soft lockup:
  while (irqd_irq_inprogress(&desc->irq_data))
	cpu_relax();

Call trace from the lockup CPU:
 [   87.238866] watchdog: BUG: soft lockup - CPU#37 stuck for 23s!
 [   87.250025] CPU: 37 PID: 1031 Comm: qemu-system-aarch64
 [   87.358397] Call trace:
 [   87.360891]  __synchronize_hardirq+0x48/0x140
 [   87.365343]  free_irq+0x138/0x424
 [   87.368727]  vgic_v4_teardown+0xa4/0xe0
 [   87.372649]  __kvm_vgic_destroy+0x18c/0x194
 [   87.376922]  kvm_vgic_destroy+0x28/0x3c
 [   87.380839]  kvm_arch_destroy_vm+0x24/0x44
 [   87.385024]  kvm_destroy_vm+0x158/0x2c4
 [   87.388943]  kvm_vm_release+0x6c/0x98
 [   87.392681]  __fput+0x70/0x220
 [   87.395800]  ____fput+0x10/0x20
 [   87.399005]  task_work_run+0xb4/0x23c
 [   87.402746]  do_exit+0x2bc/0x8a4
 [   87.406042]  do_group_exit+0x34/0xb0
 [   87.409693]  get_signal+0x878/0x8a0
 [   87.413254]  do_notify_resume+0x138/0x1530
 [   87.417440]  el0_svc+0xdc/0xf0
 [   87.420559]  el0t_64_sync_handler+0xf0/0x11c
 [   87.424919]  el0t_64_sync+0x18c/0x190

The state of the IRQD_IRQ_INPROGRESS information is lost inside
irq_domain_activate_irq() which happens before calling free_irq().
Instrumented the code and confirmed, the IRQD state is changed from
0x10401400 to 0x10441600 instead of 0x10401600 causing problem.

Call trace from irqd_set_activated():
 [   78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS
                old=0x10401400, new=0x10441600
 [   78.992093] CPU: 19 PID: 1511 Comm: qemu-system-aarch64
 [   79.008461] Call trace:
 [   79.010956]  dump_backtrace.part.0+0xc8/0xe0
 [   79.015328]  show_stack+0x18/0x54
 [   79.018713]  dump_stack_lvl+0x64/0x7c
 [   79.022459]  dump_stack+0x18/0x30
 [   79.025842]  irq_domain_activate_irq+0x88/0x94
 [   79.030385]  vgic_v3_save_pending_tables+0x260/0x29c
 [   79.035463]  vgic_set_common_attr+0xac/0x23c
 [   79.039826]  vgic_v3_set_attr+0x48/0x60
 [   79.043742]  kvm_device_ioctl+0x120/0x19c
 [   79.047840]  __arm64_sys_ioctl+0x42c/0xe00
 [   79.052027]  invoke_syscall.constprop.0+0x50/0xe0
 [   79.056835]  do_el0_svc+0x58/0x180
 [   79.060308]  el0_svc+0x38/0xf0
 [   79.063425]  el0t_64_sync_handler+0xf0/0x11c
 [   79.067785]  el0t_64_sync+0x18c/0x190

irqreturn_t handle_irq_event(struct irq_desc *desc)
{
    irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
    raw_spin_unlock(&desc->lock);

    ret = handle_irq_event_percpu(desc);

    raw_spin_lock(&desc->lock);
    irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
}

In this particular failed case and based on traces, the two functions
irqd_set_activated() and handle_irq_event() are concurrently modifying
IRQD state without both holding desc->lock. The irqd_set_activated()
execution path is reading memory 'state_use_accessors' in between set
& clear of IRQD_IRQ_INPROGRESS state change and writing the modified
data after executing 'irqd_clear(desc->irq_data, IRQD_IRQ_INPROGRESS)'.

To fix the lockup issue, hold desc->lock when calling functions
irq_domain_activate_irq() and irq_domain_deactivate_irq).

Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
---
 arch/arm64/kvm/vgic/vgic-v3.c | 6 ++++++
 arch/arm64/kvm/vgic/vgic-v4.c | 4 ++++
 2 files changed, 10 insertions(+)

Marc Zyngier Jan. 18, 2023, 11:54 a.m. UTC | #1

Shanker,

Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.

On Wed, 18 Jan 2023 02:23:48 +0000,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> 
> Getting intermittent CPU soft lockups during the virtual machines
> teardown on a system with GICv4 features enabled. The function
> __synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
> to be cleared forever as per the current implementation.
> 
> CPU stuck here for a long time leads to soft lockup:
>   while (irqd_irq_inprogress(&desc->irq_data))
> 	cpu_relax();

Is it a soft-lockup from which the system recovers? or a livelock
which leaves the system dead?

What kernel version is that? Please provide all the relevant context
that could help analysing the issue.

> 
> Call trace from the lockup CPU:
>  [   87.238866] watchdog: BUG: soft lockup - CPU#37 stuck for 23s!
>  [   87.250025] CPU: 37 PID: 1031 Comm: qemu-system-aarch64
>  [   87.358397] Call trace:
>  [   87.360891]  __synchronize_hardirq+0x48/0x140
>  [   87.365343]  free_irq+0x138/0x424
>  [   87.368727]  vgic_v4_teardown+0xa4/0xe0
>  [   87.372649]  __kvm_vgic_destroy+0x18c/0x194
>  [   87.376922]  kvm_vgic_destroy+0x28/0x3c
>  [   87.380839]  kvm_arch_destroy_vm+0x24/0x44
>  [   87.385024]  kvm_destroy_vm+0x158/0x2c4
>  [   87.388943]  kvm_vm_release+0x6c/0x98
>  [   87.392681]  __fput+0x70/0x220
>  [   87.395800]  ____fput+0x10/0x20
>  [   87.399005]  task_work_run+0xb4/0x23c
>  [   87.402746]  do_exit+0x2bc/0x8a4
>  [   87.406042]  do_group_exit+0x34/0xb0
>  [   87.409693]  get_signal+0x878/0x8a0
>  [   87.413254]  do_notify_resume+0x138/0x1530
>  [   87.417440]  el0_svc+0xdc/0xf0
>  [   87.420559]  el0t_64_sync_handler+0xf0/0x11c
>  [   87.424919]  el0t_64_sync+0x18c/0x190
> 
> The state of the IRQD_IRQ_INPROGRESS information is lost inside
> irq_domain_activate_irq() which happens before calling free_irq().
> Instrumented the code and confirmed, the IRQD state is changed from
> 0x10401400 to 0x10441600 instead of 0x10401600 causing problem.
> 
> Call trace from irqd_set_activated():
>  [   78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS
>                 old=0x10401400, new=0x10441600
>  [   78.992093] CPU: 19 PID: 1511 Comm: qemu-system-aarch64
>  [   79.008461] Call trace:
>  [   79.010956]  dump_backtrace.part.0+0xc8/0xe0
>  [   79.015328]  show_stack+0x18/0x54
>  [   79.018713]  dump_stack_lvl+0x64/0x7c
>  [   79.022459]  dump_stack+0x18/0x30
>  [   79.025842]  irq_domain_activate_irq+0x88/0x94
>  [   79.030385]  vgic_v3_save_pending_tables+0x260/0x29c
>  [   79.035463]  vgic_set_common_attr+0xac/0x23c
>  [   79.039826]  vgic_v3_set_attr+0x48/0x60
>  [   79.043742]  kvm_device_ioctl+0x120/0x19c
>  [   79.047840]  __arm64_sys_ioctl+0x42c/0xe00
>  [   79.052027]  invoke_syscall.constprop.0+0x50/0xe0
>  [   79.056835]  do_el0_svc+0x58/0x180
>  [   79.060308]  el0_svc+0x38/0xf0
>  [   79.063425]  el0t_64_sync_handler+0xf0/0x11c
>  [   79.067785]  el0t_64_sync+0x18c/0x190

Are these two traces an indication of concurrent events? Or are they
far apart?

> 
> irqreturn_t handle_irq_event(struct irq_desc *desc)
> {
>     irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>     raw_spin_unlock(&desc->lock);
> 
>     ret = handle_irq_event_percpu(desc);
> 
>     raw_spin_lock(&desc->lock);
>     irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> }

How is that relevant to this trace? Do you see this function running
concurrently with the teardown? If it matters here, it must be a VPE
doorbell, right? But you claim that this is on a GICv4 platform, while
this would only affect GICv4.1... Or are you using GICv4.1?

> 
> In this particular failed case and based on traces, the two functions
> irqd_set_activated() and handle_irq_event() are concurrently modifying
> IRQD state without both holding desc->lock. The irqd_set_activated()
> execution path is reading memory 'state_use_accessors' in between set
> & clear of IRQD_IRQ_INPROGRESS state change and writing the modified
> data after executing 'irqd_clear(desc->irq_data, IRQD_IRQ_INPROGRESS)'.
> 
> To fix the lockup issue, hold desc->lock when calling functions
> irq_domain_activate_irq() and irq_domain_deactivate_irq).

For that particular issue, the main problem is that we are abusing the
interrupt startup/teardown code. The locking is only a consequence of
this.

> 
> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
> ---
>  arch/arm64/kvm/vgic/vgic-v3.c | 6 ++++++
>  arch/arm64/kvm/vgic/vgic-v4.c | 4 ++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
> index 2074521d4a8c..e6aa909fcbe2 100644
> --- a/arch/arm64/kvm/vgic/vgic-v3.c
> +++ b/arch/arm64/kvm/vgic/vgic-v3.c
> @@ -353,22 +353,28 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
>  static void unmap_all_vpes(struct vgic_dist *dist)
>  {
>  	struct irq_desc *desc;
> +	unsigned long flags;
>  	int i;
>  
>  	for (i = 0; i < dist->its_vm.nr_vpes; i++) {
>  		desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
> +		raw_spin_lock_irqsave(&desc->lock, flags);
>  		irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
> +		raw_spin_unlock_irqrestore(&desc->lock, flags);

I guess this is the guilty one, based on your analysis, and assuming
this is a v4.1 issue, not v4.

>  	}
>  }
>  
>  static void map_all_vpes(struct vgic_dist *dist)
>  {
>  	struct irq_desc *desc;
> +	unsigned long flags;
>  	int i;
>  
>  	for (i = 0; i < dist->its_vm.nr_vpes; i++) {
>  		desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
> +		raw_spin_lock_irqsave(&desc->lock, flags);
>  		irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
> +		raw_spin_unlock_irqrestore(&desc->lock, flags);
>  	}
>  }
>  
> diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
> index ad06ba6c9b00..a01b8313e82c 100644
> --- a/arch/arm64/kvm/vgic/vgic-v4.c
> +++ b/arch/arm64/kvm/vgic/vgic-v4.c
> @@ -139,8 +139,10 @@ static void vgic_v4_enable_vsgis(struct kvm_vcpu *vcpu)
>  		/* Transfer the full irq state to the vPE */
>  		vgic_v4_sync_sgi_config(vpe, irq);
>  		desc = irq_to_desc(irq->host_irq);
> +		raw_spin_lock(&desc->lock);
>  		ret = irq_domain_activate_irq(irq_desc_get_irq_data(desc),
>  					      false);
> +		raw_spin_unlock(&desc->lock);

This one looks wrong. The interrupt never fires on the host (that's
the whole point of this stuff). There isn't even a handler attached to
it. How can that result in a problem?

My take on the whole thing is that we should stop playing games with
the underlying interrupt infrastructure. How about the following hack,
which is only compile-tested. Let me know if that helps.

	M.

diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 2074521d4a8c..2624963cb95b 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -350,26 +350,23 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
  * The deactivation of the doorbell interrupt will trigger the
  * unmapping of the associated vPE.
  */
-static void unmap_all_vpes(struct vgic_dist *dist)
+static void unmap_all_vpes(struct kvm *kvm)
 {
-	struct irq_desc *desc;
+	struct vgic_dist *dist = &kvm->arch.vgic;
 	int i;
 
-	for (i = 0; i < dist->its_vm.nr_vpes; i++) {
-		desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
-		irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
-	}
+	for (i = 0; i < dist->its_vm.nr_vpes; i++)
+		free_irq(dist->its_vm.vpes[i]->irq, kvm_get_vcpu(kvm, i));
 }
 
-static void map_all_vpes(struct vgic_dist *dist)
+static void map_all_vpes(struct kvm *kvm)
 {
-	struct irq_desc *desc;
+	struct vgic_dist *dist = &kvm->arch.vgic;
 	int i;
 
-	for (i = 0; i < dist->its_vm.nr_vpes; i++) {
-		desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
-		irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
-	}
+	for (i = 0; i < dist->its_vm.nr_vpes; i++)
+		WARN_ON(vgic_v4_request_vpe_irq(kvm_get_vcpu(kvm, i),
+						dist->its_vm.vpes[i]->irq));
 }
 
 /**
@@ -394,7 +391,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
 	 * and enabling of the doorbells have already been done.
 	 */
 	if (kvm_vgic_global_state.has_gicv4_1) {
-		unmap_all_vpes(dist);
+		unmap_all_vpes(kvm);
 		vlpi_avail = true;
 	}
 
@@ -444,7 +441,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
 
 out:
 	if (vlpi_avail)
-		map_all_vpes(dist);
+		map_all_vpes(kvm);
 
 	return ret;
 }
diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
index ad06ba6c9b00..a413718be92b 100644
--- a/arch/arm64/kvm/vgic/vgic-v4.c
+++ b/arch/arm64/kvm/vgic/vgic-v4.c
@@ -222,6 +222,11 @@ void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val)
 	*val = !!(*ptr & mask);
 }
 
+int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq)
+{
+	return request_irq(irq, vgic_v4_doorbell_handler, 0, "vcpu", vcpu);
+}
+
 /**
  * vgic_v4_init - Initialize the GICv4 data structures
  * @kvm:	Pointer to the VM being initialized
@@ -283,8 +288,7 @@ int vgic_v4_init(struct kvm *kvm)
 			irq_flags &= ~IRQ_NOAUTOEN;
 		irq_set_status_flags(irq, irq_flags);
 
-		ret = request_irq(irq, vgic_v4_doorbell_handler,
-				  0, "vcpu", vcpu);
+		ret = vgic_v4_request_vpe_irq(vcpu, irq);
 		if (ret) {
 			kvm_err("failed to allocate vcpu IRQ%d\n", irq);
 			/*
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 0c8da72953f0..23e280fa0a16 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -331,5 +331,6 @@ int vgic_v4_init(struct kvm *kvm);
 void vgic_v4_teardown(struct kvm *kvm);
 void vgic_v4_configure_vsgis(struct kvm *kvm);
 void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val);
+int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq);
 
 #endif

Shanker Donthineni Jan. 18, 2023, 7:24 p.m. UTC | #2

On 1/18/23 05:54, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
> 
> 
> Shanker,
> 
> Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.
> 
> On Wed, 18 Jan 2023 02:23:48 +0000,
> Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>>
>> Getting intermittent CPU soft lockups during the virtual machines
>> teardown on a system with GICv4 features enabled. The function
>> __synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
>> to be cleared forever as per the current implementation.
>>
>> CPU stuck here for a long time leads to soft lockup:
>>    while (irqd_irq_inprogress(&desc->irq_data))
>>        cpu_relax();
> 
> Is it a soft-lockup from which the system recovers? or a livelock
> which leaves the system dead?
> 
The system is not recovering, did a power cycle to recover.

[  211.018214] watchdog: BUG: soft lockup - CPU#8 stuck for 45s! [qemu-system-aar:1261]
[  211.026164] Modules linked in:
[  211.029283] CPU: 8 PID: 1261 Comm: qemu-system-aar Kdump: loaded Not tainted 6.0.0 #7
[  211.037338] Hardware name:  /th500_silicon, BIOS v1.1.1-4638c6c9 01/17/2023
[  211.044453] pstate: 62400009 (nZCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
[  211.051586] pc : __synchronize_hardirq+0x54/0x140
[  211.056421] lr : free_irq+0x138/0x424
[  211.060160] sp : ffff80001504f990
[  211.063551] x29: ffff80001504f990 x28: 0000000000000008 x27: ffff00036c880000
[  211.070850] x26: ffff00036f888c00 x25: 0000000000000000 x24: 0000000000000001
[  211.078219] x23: ffff0000835659d0 x22: 0000000000000001 x21: 0000000000000000
[  211.085513] x20: ffff00008356591c x19: ffff0000835658a0 x18: 000000000000003f
[  211.092807] x17: ffff001feafabf28 x16: ffffffc0012c3388 x15: ffff80001504fab0
[  211.100103] x14: 0000000000000008 x13: ffff800008b41718 x12: 00000000f5257d14
[  211.107489] x11: 0000000000000040 x10: ffff8000091f1680 x9 : ffff8000091f1678
[  211.114783] x8 : 0000000000000001 x7 : 0000000000000000 x6 : 0000000000000001
[  211.122115] x5 : ffff8000083524e4 x4 : ffffffc000db5e20 x3 : 0000000081550146
[  211.129409] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000010071200
[  211.136736] Call trace:
[  211.139232]  __synchronize_hardirq+0x54/0x140
[  211.143746]  free_irq+0x138/0x424
[  211.147139]  vgic_v4_teardown+0xa4/0xe0
[  211.151072]  __kvm_vgic_destroy+0x18c/0x194
[  211.155351]  kvm_vgic_destroy+0x28/0x3c
[  211.159310]  kvm_arch_destroy_vm+0x24/0x44
[  211.163579]  kvm_destroy_vm+0x158/0x2c4
[  211.167495]  kvm_vm_release+0x6c/0x98
[  211.171257]  __fput+0x70/0x220
[  211.174373]  ____fput+0x10/0x20
[  211.177588]  task_work_run+0xb4/0x23c
[  211.181327]  do_exit+0x2bc/0x8a4
[  211.184622]  do_group_exit+0x34/0xb0
[  211.188271]  get_signal+0x878/0x8a0
[  211.191856]  do_notify_resume+0x138/0x1530
[  211.196075]  el0_svc+0xdc/0xf0
[  211.199211]  el0t_64_sync_handler+0xf0/0x11c
[  211.203635]  el0t_64_sync+0x18c/0x190
[  211.207374] Kernel panic - not syncing: softlockup: hung tasks
[  211.213335] CPU: 8 PID: 1261 Comm: qemu-system-aar Kdump: loaded Tainted: G             L     6.0.0 #7
[  211.222851] Hardware name:  /th500_silicon, BIOS v1.1.1-4638c6c9 01/17/2023
[  211.229993] Call trace:
[  211.232498]  dump_backtrace.part.0+0xc8/0xe0
[  211.236890]  show_stack+0x18/0x54
[  211.240330]  dump_stack_lvl+0x64/0x7c
[  211.244097]  dump_stack+0x18/0x30
[  211.247479]  panic+0x160/0x328
[  211.250622]  watchdog_nmi_enable+0x0/0x14
[  211.254818]  __hrtimer_run_queues+0x118/0x244
[  211.259288]  hrtimer_interrupt+0x114/0x300
[  211.263471]  arch_timer_handler_phys+0x2c/0x40
[  211.268034]  handle_percpu_devid_irq+0x84/0x138
[  211.272671]  generic_handle_domain_irq+0x2c/0x40
[  211.277394]  gic_handle_irq+0x50/0x128
[  211.281221]  call_on_irq_stack+0x2c/0x38
[  211.285226]  do_interrupt_handler+0x7c/0x88
[  211.289552]  el1_interrupt+0x34/0x54
[  211.293239]  el1h_64_irq_handler+0x18/0x20
[  211.297433]  el1h_64_irq+0x64/0x68
[  211.300935]  __synchronize_hardirq+0x54/0x140
[  211.305385]  free_irq+0x138/0x424
[  211.308767]  vgic_v4_teardown+0xa4/0xe0
[  211.312709]  __kvm_vgic_destroy+0x18c/0x194
[  211.317007]  kvm_vgic_destroy+0x28/0x3c
[  211.320927]  kvm_arch_destroy_vm+0x24/0x44
[  211.325204]  kvm_destroy_vm+0x158/0x2c4
[  211.329125]  kvm_vm_release+0x6c/0x98
[  211.332872]  __fput+0x70/0x220
[  211.335988]  ____fput+0x10/0x20
[  211.339192]  task_work_run+0xb4/0x23c
[  211.342980]  do_exit+0x2bc/0x8a4
[  211.346310]  do_group_exit+0x34/0xb0
[  211.349960]  get_signal+0x878/0x8a0
[  211.353520]  do_notify_resume+0x138/0x1530
[  211.357703]  el0_svc+0xdc/0xf0
[  211.360818]  el0t_64_sync_handler+0xf0/0x11c
[  211.365197]  el0t_64_sync+0x18c/0x190
[  211.368947] SMP: stopping secondary CPUs


> What kernel version is that? Please provide all the relevant context
> that could help analysing the issue.
> 
Tested with v6.0+ and v6.2.rc4.

>>
>> Call trace from the lockup CPU:
>>   [   87.238866] watchdog: BUG: soft lockup - CPU#37 stuck for 23s!
>>   [   87.250025] CPU: 37 PID: 1031 Comm: qemu-system-aarch64
>>   [   87.358397] Call trace:
>>   [   87.360891]  __synchronize_hardirq+0x48/0x140
>>   [   87.365343]  free_irq+0x138/0x424
>>   [   87.368727]  vgic_v4_teardown+0xa4/0xe0
>>   [   87.372649]  __kvm_vgic_destroy+0x18c/0x194
>>   [   87.376922]  kvm_vgic_destroy+0x28/0x3c
>>   [   87.380839]  kvm_arch_destroy_vm+0x24/0x44
>>   [   87.385024]  kvm_destroy_vm+0x158/0x2c4
>>   [   87.388943]  kvm_vm_release+0x6c/0x98
>>   [   87.392681]  __fput+0x70/0x220
>>   [   87.395800]  ____fput+0x10/0x20
>>   [   87.399005]  task_work_run+0xb4/0x23c
>>   [   87.402746]  do_exit+0x2bc/0x8a4
>>   [   87.406042]  do_group_exit+0x34/0xb0
>>   [   87.409693]  get_signal+0x878/0x8a0
>>   [   87.413254]  do_notify_resume+0x138/0x1530
>>   [   87.417440]  el0_svc+0xdc/0xf0
>>   [   87.420559]  el0t_64_sync_handler+0xf0/0x11c
>>   [   87.424919]  el0t_64_sync+0x18c/0x190
>>
>> The state of the IRQD_IRQ_INPROGRESS information is lost inside
>> irq_domain_activate_irq() which happens before calling free_irq().
>> Instrumented the code and confirmed, the IRQD state is changed from
>> 0x10401400 to 0x10441600 instead of 0x10401600 causing problem.
>>
>> Call trace from irqd_set_activated():
>>   [   78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS
>>                  old=0x10401400, new=0x10441600
>>   [   78.992093] CPU: 19 PID: 1511 Comm: qemu-system-aarch64
>>   [   79.008461] Call trace:
>>   [   79.010956]  dump_backtrace.part.0+0xc8/0xe0
>>   [   79.015328]  show_stack+0x18/0x54
>>   [   79.018713]  dump_stack_lvl+0x64/0x7c
>>   [   79.022459]  dump_stack+0x18/0x30
>>   [   79.025842]  irq_domain_activate_irq+0x88/0x94
>>   [   79.030385]  vgic_v3_save_pending_tables+0x260/0x29c
>>   [   79.035463]  vgic_set_common_attr+0xac/0x23c
>>   [   79.039826]  vgic_v3_set_attr+0x48/0x60
>>   [   79.043742]  kvm_device_ioctl+0x120/0x19c
>>   [   79.047840]  __arm64_sys_ioctl+0x42c/0xe00
>>   [   79.052027]  invoke_syscall.constprop.0+0x50/0xe0
>>   [   79.056835]  do_el0_svc+0x58/0x180
>>   [   79.060308]  el0_svc+0x38/0xf0
>>   [   79.063425]  el0t_64_sync_handler+0xf0/0x11c
>>   [   79.067785]  el0t_64_sync+0x18c/0x190
> 
> Are these two traces an indication of concurrent events? Or are they
> far apart?
> 
Concurrent.

>>
>> irqreturn_t handle_irq_event(struct irq_desc *desc)
>> {
>>      irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>      raw_spin_unlock(&desc->lock);
>>
>>      ret = handle_irq_event_percpu(desc);
>>
>>      raw_spin_lock(&desc->lock);
>>      irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>> }
> 
> How is that relevant to this trace? Do you see this function running
> concurrently with the teardown? If it matters here, it must be a VPE
> doorbell, right? But you claim that this is on a GICv4 platform, while
> this would only affect GICv4.1... Or are you using GICv4.1?
> 
handle_irq_event() is running concurrently with irq_domain_activate_irq()
which happens before free_irq() called. Corruption at [78.983544] and
teardown started at [87.360891].

[   78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS old=0x10401400, new=0x10441600

[   87.360891]  __synchronize_hardirq+0x48/0x140

Yes, I'm using GICv4.1, used these 2 functions to trace the issue.

static inline void irqd_set_activated(struct irq_data *d)
{
	atomic_t *state = (atomic_t *)(&__irqd_to_state(d));
	unsigned int new, old = __irqd_to_state(d);

	new = atomic_xchg(state, old | IRQD_ACTIVATED);	

	if ((old ^ new) & IRQD_IRQ_INPROGRESS) {
		pr_err("%s: lost IRQD_IRQ_INPROGRESS old=0x%x new=0x%x\n",
		       __func__, old, old | IRQD_ACTIVATED);
		dump_stack();
	}
}

static inline void irqd_clr_activated(struct irq_data *d)
{
	atomic_t *state = (atomic_t *)(&__irqd_to_state(d));
	unsigned int new, old = __irqd_to_state(d);

	new = atomic_xchg(state, old & ~IRQD_ACTIVATED);	

	if ((old ^ new) & IRQD_IRQ_INPROGRESS) {
		pr_err("%s: lost IRQD_IRQ_INPROGRESS old=0x%x new=0x%x\n",
		       __func__, old, old & ~IRQD_ACTIVATED);
		dump_stack();
	}
}

>>
>> In this particular failed case and based on traces, the two functions
>> irqd_set_activated() and handle_irq_event() are concurrently modifying
>> IRQD state without both holding desc->lock. The irqd_set_activated()
>> execution path is reading memory 'state_use_accessors' in between set
>> & clear of IRQD_IRQ_INPROGRESS state change and writing the modified
>> data after executing 'irqd_clear(desc->irq_data, IRQD_IRQ_INPROGRESS)'.
>>
>> To fix the lockup issue, hold desc->lock when calling functions
>> irq_domain_activate_irq() and irq_domain_deactivate_irq).
> 
> For that particular issue, the main problem is that we are abusing the
> interrupt startup/teardown code. The locking is only a consequence of
> this.
> 
>>
>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
>> ---
>>   arch/arm64/kvm/vgic/vgic-v3.c | 6 ++++++
>>   arch/arm64/kvm/vgic/vgic-v4.c | 4 ++++
>>   2 files changed, 10 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
>> index 2074521d4a8c..e6aa909fcbe2 100644
>> --- a/arch/arm64/kvm/vgic/vgic-v3.c
>> +++ b/arch/arm64/kvm/vgic/vgic-v3.c
>> @@ -353,22 +353,28 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
>>   static void unmap_all_vpes(struct vgic_dist *dist)
>>   {
>>        struct irq_desc *desc;
>> +     unsigned long flags;
>>        int i;
>>
>>        for (i = 0; i < dist->its_vm.nr_vpes; i++) {
>>                desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
>> +             raw_spin_lock_irqsave(&desc->lock, flags);
>>                irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
>> +             raw_spin_unlock_irqrestore(&desc->lock, flags);
> 
> I guess this is the guilty one, based on your analysis, and assuming
> this is a v4.1 issue, not v4.
> 
>>        }
>>   }
>>
>>   static void map_all_vpes(struct vgic_dist *dist)
>>   {
>>        struct irq_desc *desc;
>> +     unsigned long flags;
>>        int i;
>>
>>        for (i = 0; i < dist->its_vm.nr_vpes; i++) {
>>                desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
>> +             raw_spin_lock_irqsave(&desc->lock, flags);
>>                irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
>> +             raw_spin_unlock_irqrestore(&desc->lock, flags);
>>        }
>>   }
>>
>> diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
>> index ad06ba6c9b00..a01b8313e82c 100644
>> --- a/arch/arm64/kvm/vgic/vgic-v4.c
>> +++ b/arch/arm64/kvm/vgic/vgic-v4.c
>> @@ -139,8 +139,10 @@ static void vgic_v4_enable_vsgis(struct kvm_vcpu *vcpu)
>>                /* Transfer the full irq state to the vPE */
>>                vgic_v4_sync_sgi_config(vpe, irq);
>>                desc = irq_to_desc(irq->host_irq);
>> +             raw_spin_lock(&desc->lock);
>>                ret = irq_domain_activate_irq(irq_desc_get_irq_data(desc),
>>                                              false);
>> +             raw_spin_unlock(&desc->lock);
> 
> This one looks wrong. The interrupt never fires on the host (that's
> the whole point of this stuff). There isn't even a handler attached to
> it. How can that result in a problem?

Thanks for catching mistake.

> 
> My take on the whole thing is that we should stop playing games with
> the underlying interrupt infrastructure. How about the following hack,
> which is only compile-tested. Let me know if that helps.
> 
>          M.
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
> index 2074521d4a8c..2624963cb95b 100644
> --- a/arch/arm64/kvm/vgic/vgic-v3.c
> +++ b/arch/arm64/kvm/vgic/vgic-v3.c
> @@ -350,26 +350,23 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
>    * The deactivation of the doorbell interrupt will trigger the
>    * unmapping of the associated vPE.
>    */
> -static void unmap_all_vpes(struct vgic_dist *dist)
> +static void unmap_all_vpes(struct kvm *kvm)
>   {
> -       struct irq_desc *desc;
> +       struct vgic_dist *dist = &kvm->arch.vgic;
>          int i;
> 
> -       for (i = 0; i < dist->its_vm.nr_vpes; i++) {
> -               desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
> -               irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
> -       }
> +       for (i = 0; i < dist->its_vm.nr_vpes; i++)
> +               free_irq(dist->its_vm.vpes[i]->irq, kvm_get_vcpu(kvm, i));
>   }
> 
> -static void map_all_vpes(struct vgic_dist *dist)
> +static void map_all_vpes(struct kvm *kvm)
>   {
> -       struct irq_desc *desc;
> +       struct vgic_dist *dist = &kvm->arch.vgic;
>          int i;
> 
> -       for (i = 0; i < dist->its_vm.nr_vpes; i++) {
> -               desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
> -               irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
> -       }
> +       for (i = 0; i < dist->its_vm.nr_vpes; i++)
> +               WARN_ON(vgic_v4_request_vpe_irq(kvm_get_vcpu(kvm, i),
> +                                               dist->its_vm.vpes[i]->irq));
>   }
> 
>   /**
> @@ -394,7 +391,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
>           * and enabling of the doorbells have already been done.
>           */
>          if (kvm_vgic_global_state.has_gicv4_1) {
> -               unmap_all_vpes(dist);
> +               unmap_all_vpes(kvm);
>                  vlpi_avail = true;
>          }
> 
> @@ -444,7 +441,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
> 
>   out:
>          if (vlpi_avail)
> -               map_all_vpes(dist);
> +               map_all_vpes(kvm);
> 
>          return ret;
>   }
> diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
> index ad06ba6c9b00..a413718be92b 100644
> --- a/arch/arm64/kvm/vgic/vgic-v4.c
> +++ b/arch/arm64/kvm/vgic/vgic-v4.c
> @@ -222,6 +222,11 @@ void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val)
>          *val = !!(*ptr & mask);
>   }
> 
> +int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq)
> +{
> +       return request_irq(irq, vgic_v4_doorbell_handler, 0, "vcpu", vcpu);
> +}
> +
>   /**
>    * vgic_v4_init - Initialize the GICv4 data structures
>    * @kvm:       Pointer to the VM being initialized
> @@ -283,8 +288,7 @@ int vgic_v4_init(struct kvm *kvm)
>                          irq_flags &= ~IRQ_NOAUTOEN;
>                  irq_set_status_flags(irq, irq_flags);
> 
> -               ret = request_irq(irq, vgic_v4_doorbell_handler,
> -                                 0, "vcpu", vcpu);
> +               ret = vgic_v4_request_vpe_irq(vcpu, irq);
>                  if (ret) {
>                          kvm_err("failed to allocate vcpu IRQ%d\n", irq);
>                          /*
> diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
> index 0c8da72953f0..23e280fa0a16 100644
> --- a/arch/arm64/kvm/vgic/vgic.h
> +++ b/arch/arm64/kvm/vgic/vgic.h
> @@ -331,5 +331,6 @@ int vgic_v4_init(struct kvm *kvm);
>   void vgic_v4_teardown(struct kvm *kvm);
>   void vgic_v4_configure_vsgis(struct kvm *kvm);
>   void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val);
> +int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq);
> 
>   #endif
> 

I ran stress test launch/teardown multiple VMs for 3hrs. The issue
is not reproducible. The same test fails in 10-30min without code
changes.

> --
> Without deviation from the norm, progress is not possible.

Marc Zyngier Jan. 19, 2023, 7:11 a.m. UTC | #3

[dropping the now dead old kvmarm list]

On Wed, 18 Jan 2023 19:24:01 +0000,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> 
> 
> 
> On 1/18/23 05:54, Marc Zyngier wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > Shanker,
> > 
> > Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.
> > 
> > On Wed, 18 Jan 2023 02:23:48 +0000,
> > Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> >> 
> >> Getting intermittent CPU soft lockups during the virtual machines
> >> teardown on a system with GICv4 features enabled. The function
> >> __synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
> >> to be cleared forever as per the current implementation.
> >> 
> >> CPU stuck here for a long time leads to soft lockup:
> >>    while (irqd_irq_inprogress(&desc->irq_data))
> >>        cpu_relax();
> > 
> > Is it a soft-lockup from which the system recovers? or a livelock
> > which leaves the system dead?
> > 
> The system is not recovering, did a power cycle to recover.

This isn't a soft-lockup then. This is at best a livelock.

> > Are these two traces an indication of concurrent events? Or are they
> > far apart?
> > 
> Concurrent.

So you can see the VM being torn down while the vgic save sequence is
still in progress?

If you can actually see that, then this is a much bigger bug than the
simple race you are describing, and we're missing a reference on the
kvm structure. This would be a *MAJOR* bug.

Please post the full traces, not snippets. The absolutely full kernel
log, the configuration, what you run, how you run it, *EVERYTHING*. I
need to be able to reproduce this.

> 
> >> 
> >> irqreturn_t handle_irq_event(struct irq_desc *desc)
> >> {
> >>      irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> >>      raw_spin_unlock(&desc->lock);
> >> 
> >>      ret = handle_irq_event_percpu(desc);
> >> 
> >>      raw_spin_lock(&desc->lock);
> >>      irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> >> }
> > 
> > How is that relevant to this trace? Do you see this function running
> > concurrently with the teardown? If it matters here, it must be a VPE
> > doorbell, right? But you claim that this is on a GICv4 platform, while
> > this would only affect GICv4.1... Or are you using GICv4.1?
> > 
> handle_irq_event() is running concurrently with irq_domain_activate_irq()
> which happens before free_irq() called. Corruption at [78.983544] and
> teardown started at [87.360891].

But that doesn't match the description you made of concurrent
events. Does it take more than 9 seconds for the vgic state to be
saved to memory?

> 
> [   78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS old=0x10401400, new=0x10441600
> 
> [   87.360891]  __synchronize_hardirq+0x48/0x140
> 
> Yes, I'm using GICv4.1, used these 2 functions to trace the issue.

Then *please* be precise in your descriptions. You send people in the
wrong direction.

[...]

> I ran stress test launch/teardown multiple VMs for 3hrs. The issue
> is not reproducible. The same test fails in 10-30min without code
> changes.

That doesn't add up with the earlier description of concurrent events,
with the VM teardown and the vgic saving running in parallel. My patch
doesn't prevent this.

So either your testing is insufficient, or your description of
concurrent events is inaccurate.

	M.

Shanker Donthineni Jan. 19, 2023, 1 p.m. UTC | #4

On 1/19/23 01:11, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
> 
> 
> [dropping the now dead old kvmarm list]
> 
> On Wed, 18 Jan 2023 19:24:01 +0000,
> Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>>
>>
>>
>> On 1/18/23 05:54, Marc Zyngier wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Shanker,
>>>
>>> Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.
>>>
>>> On Wed, 18 Jan 2023 02:23:48 +0000,
>>> Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>>>>
>>>> Getting intermittent CPU soft lockups during the virtual machines
>>>> teardown on a system with GICv4 features enabled. The function
>>>> __synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
>>>> to be cleared forever as per the current implementation.
>>>>
>>>> CPU stuck here for a long time leads to soft lockup:
>>>>     while (irqd_irq_inprogress(&desc->irq_data))
>>>>         cpu_relax();
>>>
>>> Is it a soft-lockup from which the system recovers? or a livelock
>>> which leaves the system dead?
>>>
>> The system is not recovering, did a power cycle to recover.
> 
> This isn't a soft-lockup then. This is at best a livelock.
> 
>>> Are these two traces an indication of concurrent events? Or are they
>>> far apart?
>>>
>> Concurrent.
> 
> So you can see the VM being torn down while the vgic save sequence is
> still in progress?
> 
> If you can actually see that, then this is a much bigger bug than the
> simple race you are describing, and we're missing a reference on the
> kvm structure. This would be a *MAJOR* bug.
> 
How do we know vGIC save sequence is in progress while VM is being teardown?
I'm launching/terminating ~32 VMs in a loop to reproduce the issue.
  
> Please post the full traces, not snippets. The absolutely full kernel
> log, the configuration, what you run, how you run it, *EVERYTHING*. I
> need to be able to reproduce this.
Sure, I'll share the complete boot log messages of host kernel next run.
  
> 
>>
>>>>
>>>> irqreturn_t handle_irq_event(struct irq_desc *desc)
>>>> {
>>>>       irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>>       raw_spin_unlock(&desc->lock);
>>>>
>>>>       ret = handle_irq_event_percpu(desc);
>>>>
>>>>       raw_spin_lock(&desc->lock);
>>>>       irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>> }
>>>
>>> How is that relevant to this trace? Do you see this function running
>>> concurrently with the teardown? If it matters here, it must be a VPE
>>> doorbell, right? But you claim that this is on a GICv4 platform, while
>>> this would only affect GICv4.1... Or are you using GICv4.1?
>>>
>> handle_irq_event() is running concurrently with irq_domain_activate_irq()
>> which happens before free_irq() called. Corruption at [78.983544] and
>> teardown started at [87.360891].
> 
> But that doesn't match the description you made of concurrent
> events. Does it take more than 9 seconds for the vgic state to be
> saved to memory?

Are there any other possibilities of corrupting IRQD_IRQ_INPROGRESS
state bit other than concurrent accesses?

I did another experiment, changed the bit manipulation of the IRQD
state to atomic operations, the issue is not reproducible.

struct irq_common_data {
-       unsigned  int		__private state_use_accessors;
+	atomic_t		__private state_use_accessors;

+#define __irqd_to_state(d)       ACCESS_PRIVATE((d)->common, state_use_accessors)
+#define __irqd_get_state(d)      atomic_read(&__irqd_to_state(d))
+#define __irqd_set_state(d, m)   atomic_or((m), &__irqd_to_state(d))
+#define __irqd_clear_state(d, m) atomic_andnot((m), &__irqd_to_state(d))
> 
>>
>> [   78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS old=0x10401400, new=0x10441600
>>
>> [   87.360891]  __synchronize_hardirq+0x48/0x140
>>
>> Yes, I'm using GICv4.1, used these 2 functions to trace the issue.
> 
> Then *please* be precise in your descriptions. You send people in the
> wrong direction.
> 
It was my mistake not to mention the exact GIC version.

> [...]
> 
>> I ran stress test launch/teardown multiple VMs for 3hrs. The issue
>> is not reproducible. The same test fails in 10-30min without code
>> changes.
> 
> That doesn't add up with the earlier description of concurrent events,
> with the VM teardown and the vgic saving running in parallel. My patch
> doesn't prevent this.
> 
> So either your testing is insufficient, or your description of
> concurrent events is inaccurate.
> 

I'm using the unmodified 6.2.rc3/v6.0 from kernel.org and QEMU to
reproduce the issue.

Using the below steps for launching/terminating 32 VMs in loop. The
failure is intermittent. The same issue is reproducible with KVMTOOL
also.

numvms=32
socketcnt=1
iterations=100
while [ ${iterations} -ne 0 ]; do
  iterations=$(( iterations - 1 ))
  for idx in `seq 1 ${numvms}`; do
    vmname=vm${idx}
    cpunode=$((idx % socketcnt))
    cpuset=`lscpu  | grep "NUMA node${cpunode} CPU(s)" | awk -F' ' '{ print $4 }'`
    qemu-img create -f qcow2 -F raw -b /var/lib/libvirt/images/cloudimg.raw /var/lib/libvirt/images/${vmname}.qcow2
    virt-sysprep -a /var/lib/libvirt/images/${vmname}.qcow2 --hostname ${vmname}
    virt-install \
      --name ${vmname} \
      --numatune ${cpunode},mode=preferred \
      --vcpus vcpus=16,maxvcpus=16,cpuset=${cpuset} \
      --memory 4096 \
      --import \
      --noautoconsole \
      --os-variant name=ubuntu22.04 \
      --disk path=/var/lib/libvirt/images/${vmname}.qcow2 \
      --boot loader=/usr/share/AAVMF/AAVMF_CODE.fd,loader_ro=yes,loader_type=pflash
    if [ $? -ne 0 ]; then
      echo "Failed to create vm${idx}"
      while true ; do  sleep 1; done
    fi
    sleep 5
  done

  for idx in `seq 1 ${numvms}`; do
    echo -n "Checking vm${idx} boot: "
    while true; do
      ghostname=`virsh guestinfo vm${idx} --hostname 2>/dev/null | awk '{ print $3 }'`
      [ "${ghostname}" == "vm${idx}" ] && echo "done" && break
      sleep 1
    done
  done

  for idx in `seq 1 ${numvms}`; do
   virsh domstate vm${idx} >/dev/null 2>&1 || continue
   echo "Started shutdown & undefine vm${idx}..."
   virsh shutdown "vm${idx}" --mode agent >/dev/null 2>&1
   while true; do
     virsh domstate vm${idx} 2>/dev/null 2>&1 | grep -w "^shut off" >/dev/null 2>&1
     [ $? -eq 0 ] && break
     sleep 1
     virsh shutdown "vm${idx}" --mode agent >/dev/null 2>&1
   done
   virsh undefine --nvram --remove-all-storage vm${idx} 2>/dev/null
  done
done


>          M.
> 
> --
> Without deviation from the norm, progress is not possible.

Marc Zyngier Jan. 19, 2023, 2:01 p.m. UTC | #5

On Thu, 19 Jan 2023 13:00:49 +0000,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> 
> 
> 
> On 1/19/23 01:11, Marc Zyngier wrote:
> > So you can see the VM being torn down while the vgic save sequence is
> > still in progress?
> > 
> > If you can actually see that, then this is a much bigger bug than the
> > simple race you are describing, and we're missing a reference on the
> > kvm structure. This would be a *MAJOR* bug.
> > 
> How do we know vGIC save sequence is in progress while VM is being
> teardown?  I'm launching/terminating ~32 VMs in a loop to reproduce
> the issue.

Errr... *you* know when you are issuing the save ioctl, right? You
also know when you are terminating the VM (closing its fd or killing
the VMM).

>  
> > Please post the full traces, not snippets. The absolutely full kernel
> > log, the configuration, what you run, how you run it, *EVERYTHING*. I
> > need to be able to reproduce this.
> Sure, I'll share the complete boot log messages of host kernel next run.
>  
> > 
> >> 
> >>>> 
> >>>> irqreturn_t handle_irq_event(struct irq_desc *desc)
> >>>> {
> >>>>       irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> >>>>       raw_spin_unlock(&desc->lock);
> >>>> 
> >>>>       ret = handle_irq_event_percpu(desc);
> >>>> 
> >>>>       raw_spin_lock(&desc->lock);
> >>>>       irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> >>>> }
> >>> 
> >>> How is that relevant to this trace? Do you see this function running
> >>> concurrently with the teardown? If it matters here, it must be a VPE
> >>> doorbell, right? But you claim that this is on a GICv4 platform, while
> >>> this would only affect GICv4.1... Or are you using GICv4.1?
> >>> 
> >> handle_irq_event() is running concurrently with irq_domain_activate_irq()
> >> which happens before free_irq() called. Corruption at [78.983544] and
> >> teardown started at [87.360891].
> > 
> > But that doesn't match the description you made of concurrent
> > events. Does it take more than 9 seconds for the vgic state to be
> > saved to memory?
> 
> Are there any other possibilities of corrupting IRQD_IRQ_INPROGRESS
> state bit other than concurrent accesses?

Forget about this bit. You said that we could see the VM teardown
happening *at the same time* as the vgic state saving, despite the
vgic device holding a reference on the kvm structure. If that's the
case, this bit is the least of our worries. Think of the consequences
for a second...

[...]

> Using the below steps for launching/terminating 32 VMs in loop. The
> failure is intermittent. The same issue is reproducible with KVMTOOL
> also.

kvmtool never issue a KVM_DEV_ARM_VGIC_GRP_CTRL with the
KVM_DEV_ARM_ITS_SAVE_TABLES argument, so the code path we discussed is
never used. What is the exact problem you're observing with kvmtool
as the VMM?

	M.

Shanker Donthineni Jan. 19, 2023, 2:16 p.m. UTC | #6

Hi Marc,

On 1/19/23 08:01, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, 19 Jan 2023 13:00:49 +0000,
> Shanker Donthineni <sdonthineni@nvidia.com> wrote:
>>
>>
>>
>> On 1/19/23 01:11, Marc Zyngier wrote:
>>> So you can see the VM being torn down while the vgic save sequence is
>>> still in progress?
>>>
>>> If you can actually see that, then this is a much bigger bug than the
>>> simple race you are describing, and we're missing a reference on the
>>> kvm structure. This would be a *MAJOR* bug.
>>>
>> How do we know vGIC save sequence is in progress while VM is being
>> teardown?  I'm launching/terminating ~32 VMs in a loop to reproduce
>> the issue.
> 
> Errr... *you* know when you are issuing the save ioctl, right? You
> also know when you are terminating the VM (closing its fd or killing
> the VMM).
> 
>>
>>> Please post the full traces, not snippets. The absolutely full kernel
>>> log, the configuration, what you run, how you run it, *EVERYTHING*. I
>>> need to be able to reproduce this.
>> Sure, I'll share the complete boot log messages of host kernel next run.
>>
>>>
>>>>
>>>>>>
>>>>>> irqreturn_t handle_irq_event(struct irq_desc *desc)
>>>>>> {
>>>>>>        irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>>>>        raw_spin_unlock(&desc->lock);
>>>>>>
>>>>>>        ret = handle_irq_event_percpu(desc);
>>>>>>
>>>>>>        raw_spin_lock(&desc->lock);
>>>>>>        irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>>>> }
>>>>>
>>>>> How is that relevant to this trace? Do you see this function running
>>>>> concurrently with the teardown? If it matters here, it must be a VPE
>>>>> doorbell, right? But you claim that this is on a GICv4 platform, while
>>>>> this would only affect GICv4.1... Or are you using GICv4.1?
>>>>>
>>>> handle_irq_event() is running concurrently with irq_domain_activate_irq()
>>>> which happens before free_irq() called. Corruption at [78.983544] and
>>>> teardown started at [87.360891].
>>>
>>> But that doesn't match the description you made of concurrent
>>> events. Does it take more than 9 seconds for the vgic state to be
>>> saved to memory?
>>
>> Are there any other possibilities of corrupting IRQD_IRQ_INPROGRESS
>> state bit other than concurrent accesses?
> 
> Forget about this bit. You said that we could see the VM teardown
> happening *at the same time* as the vgic state saving, despite the
> vgic device holding a reference on the kvm structure. If that's the
> case, this bit is the least of our worries. Think of the consequences
> for a second...
> 
> [...]
> 
>> Using the below steps for launching/terminating 32 VMs in loop. The
>> failure is intermittent. The same issue is reproducible with KVMTOOL
>> also.
> 
> kvmtool never issue a KVM_DEV_ARM_VGIC_GRP_CTRL with the
> KVM_DEV_ARM_ITS_SAVE_TABLES argument, so the code path we discussed is
> never used. What is the exact problem you're observing with kvmtool
> as the VMM?
> 
Unfortunately I didn't capture the log messages of kvmtool based VM boot.
Saw once kernel crash. Now I'm trying to reproduce the issue with QEMU tool
and share complete log messages.

>          M.
> 
> --
> Without deviation from the norm, progress is not possible.

Marc Zyngier Jan. 20, 2023, noon UTC | #7

On Fri, 20 Jan 2023 05:02:15 +0000,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> 
> [1  <text/plain; UTF-8 (7bit)>]
> Hi Marc,
> 
> On 1/19/23 08:01, Marc Zyngier wrote:
> >> On 1/19/23 01:11, Marc Zyngier wrote:
> >>> So you can see the VM being torn down while the vgic save sequence is
> >>> still in progress?
> >>> 
> >>> If you can actually see that, then this is a much bigger bug than the
> >>> simple race you are describing, and we're missing a reference on the
> >>> kvm structure. This would be a*MAJOR*  bug.
> >>> 
> >> How do we know vGIC save sequence is in progress while VM is being
> >> teardown?  I'm launching/terminating ~32 VMs in a loop to reproduce
> >> the issue.
> > Errr...*you*  know when you are issuing the save ioctl, right? You
> > also know when you are terminating the VM (closing its fd or killing
> > the VMM).
> > 
> 
> Added debug statements to trace the code patch, and tagged each log message
> with 'struct kvm *'. Attached the complete kernel log messages including
> debug messages.
> 
> All 32 VMs launched, time 258s to 291s
>  [  258.519837] kvm_create_vm(1236) called kvm=ffff8000303e0000  --> 1st VM
>  ...
>  [  291.801179] kvm_create_vm(1236) called kvm=ffff800057a60000   --> 32nd VM
> 
> Test script inside VM issues poweroff command after sleeping 200sec.
> 
> Working case kvm=ffff8000303e0000:
> 
>  $ cat gicv4-debug.txt | grep ffff8000303e0000
>   [  258.519837] kvm_create_vm(1236) called kvm=ffff8000303e0000
>   [  258.667101] vgic_v4_init(267) called kvm=ffff8000303e0000 doorbell=140(64)
>   [  517.942167] vgic_set_common_attr(263) called kvm=ffff8000303e0000
>   [  517.948415] vgic_v3_save_pending_tables(397) called kvm=ffff8000303e0000
>   [  517.955602] vgic_v3_save_pending_tables(448) called kvm=ffff8000303e0000
>   [  518.099696] kvm_vm_release(1374) called kvm=ffff8000303e0000
>   [  518.126833] vgic_v4_teardown(323) started kvm=ffff8000303e0000 doorbell=140(64)
>   [  518.134677] vgic_v4_teardown(333) finished kvm=ffff8000303e0000 doorbell=140(64)
> 
> Not working case kvm=ffff80001e0a0000:
> 
>  $ cat gicv4-debug.txt | grep ffff80001e0a0000
>   [  277.684981] kvm_create_vm(1236) called kvm=ffff80001e0a0000
>   [  278.158511] vgic_v4_init(267) called kvm=ffff80001e0a0000 doorbell=20812(64)
>   [  545.079117] vgic_set_common_attr(263) called kvm=ffff80001e0a0000
>   [  545.085358] vgic_v3_save_pending_tables(397) called kvm=ffff80001e0a0000
>   [  545.092580] vgic_v3_save_pending_tables(448) called kvm=ffff80001e0a0000
>   [  545.099562] irq: irqd_set_activated: CPU49 IRQ20821 lost IRQD_IRQ_INPROGRESS old=0x10401400 new=0x10401600, expected=0x10441600 kvm=ffff80001e0a0000^M
>   [  545.113177] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS set time [545.099561]^M
>   [  545.121454] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS clr time [545.099562]^M
>   [  545.129755] irq: irqd_set_activated: CPU49 IRQ20826 lost IRQD_IRQ_INPROGRESS old=0x10441400 new=0x10441600, expected=0x10401600 kvm=ffff80001e0a0000^M
>   [  545.143365] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS set time [545.129754]^M
>   [  545.151654] irq: irqd_set_activated: IRQD_IRQ_INPROGRESS clr time [545.129755]^M
>   [  545.163250] kvm_vm_release(1374) called kvm=ffff80001e0a0000
>   [  545.169204] vgic_v4_teardown(323) started kvm=ffff80001e0a0000 doorbell=20812(64)
> 
> IRQD_IRQ_INPROGRESS is corrupted before calling kvm_vm_release(),

You keep missing my point. Yes, we have established the interrupt race
and have a way to fix it, let's move on...

What I am asking agin is: is there any overlap between any vgic ioctl
and the teardown of the VM? Do you ever see kvm_vm_release() being
called before kvm_device_release()?

Because that's the overlap I've been talking all along.

	M.

Shanker Donthineni Jan. 21, 2023, 3:28 p.m. UTC | #8

On 1/20/23 06:00, Marc Zyngier wrote:
> What I am asking agin is: is there any overlap between any vgic ioctl
> and the teardown of the VM? Do you ever see kvm_vm_release() being
> called before kvm_device_release()?
> 
> Because that's the overlap I've been talking all along.

I've not observed overlap of kvm_vm_release() with ioctls kvm_device_ioctl(),
kvm_vcpu_ioctl() and kvm_vm_ioctl() for 3hrs. Test created hundreds of VMs and
teardowns.


Debug code:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b244f6d0..9c8409a894f5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -800,7 +800,7 @@ struct kvm {
         bool dirty_ring_with_bitmap;
         bool vm_bugged;
         bool vm_dead;
-
+       atomic_t checkoverlap;
  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
         struct notifier_block pm_notifier;
  #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9c60384b5ae0..a5a9071aeb19 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1370,9 +1370,13 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
  {
         struct kvm *kvm = filp->private_data;

+       atomic_inc(&kvm->checkoverlap);
+       pr_err("%s(%d) start kvm=%px\n",__func__,__LINE__, kvm);
+
         kvm_irqfd_release(kvm);

         kvm_put_kvm(kvm);
+       pr_err("%s(%d) end kvm=%px\n",__func__,__LINE__, kvm);
         return 0;
  }

@@ -4073,6 +4077,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
         if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
                 return -EINVAL;

+if (atomic_read(&vcpu->kvm->checkoverlap)) pr_err("%s(%d) overlap kvm=%px\n", __func__, __LINE__, vcpu->kvm);
+
         /*
          * Some architectures have vcpu ioctls that are asynchronous to vcpu
          * execution; mutex_lock() would break them.
@@ -4346,6 +4352,8 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
         if (dev->kvm->mm != current->mm || dev->kvm->vm_dead)
                 return -EIO;

+       if (atomic_read(&dev->kvm->checkoverlap)) pr_err("%s(%d) overlap kvm=%px\n", __func__, __LINE__, dev->kvm);
+
         switch (ioctl) {
         case KVM_SET_DEVICE_ATTR:
                 return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
@@ -4731,6 +4739,8 @@ static long kvm_vm_ioctl(struct file *filp,

         if (kvm->mm != current->mm || kvm->vm_dead)
                 return -EIO;
+if (atomic_read(&kvm->checkoverlap)) pr_err("%s(%d) overlap kvm=%px\n", __func__, __LINE__, kvm);
+
         switch (ioctl) {
         case KVM_CREATE_VCPU:
                 r = kvm_vm_ioctl_create_vcpu(kvm, arg);

Shanker Donthineni Jan. 21, 2023, 3:35 p.m. UTC | #9

On 1/20/23 06:00, Marc Zyngier wrote:
> What I am asking agin is: is there any overlap between any vgic ioctl
> and the teardown of the VM? Do you ever see kvm_vm_release() being
> called before kvm_device_release()?
> 
> Because that's the overlap I've been talking all along.

I've not observed overlap of kvm_vm_release() with ioctls kvm_device_ioctl(),
kvm_vcpu_ioctl() and kvm_vm_ioctl() for 3hrs. Test created hundreds of VMs and
teardowns.


Debug code:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b244f6d0..9c8409a894f5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -800,7 +800,7 @@ struct kvm {
         bool dirty_ring_with_bitmap;
         bool vm_bugged;
         bool vm_dead;
-
+       atomic_t checkoverlap;
  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
         struct notifier_block pm_notifier;
  #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9c60384b5ae0..a5a9071aeb19 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1370,9 +1370,13 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
  {
         struct kvm *kvm = filp->private_data;

+       atomic_inc(&kvm->checkoverlap);
+       pr_err("%s(%d) start kvm=%px\n",__func__,__LINE__, kvm);
+
         kvm_irqfd_release(kvm);

         kvm_put_kvm(kvm);
+       pr_err("%s(%d) end kvm=%px\n",__func__,__LINE__, kvm);
         return 0;
  }

@@ -4073,6 +4077,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
         if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
                 return -EINVAL;

+if (atomic_read(&vcpu->kvm->checkoverlap)) pr_err("%s(%d) overlap kvm=%px\n", __func__, __LINE__, vcpu->kvm);
+
         /*
          * Some architectures have vcpu ioctls that are asynchronous to vcpu
          * execution; mutex_lock() would break them.
@@ -4346,6 +4352,8 @@ static long kvm_device_ioctl(struct file *filp, unsigned int ioctl,
         if (dev->kvm->mm != current->mm || dev->kvm->vm_dead)
                 return -EIO;

+       if (atomic_read(&dev->kvm->checkoverlap)) pr_err("%s(%d) overlap kvm=%px\n", __func__, __LINE__, dev->kvm);
+
         switch (ioctl) {
         case KVM_SET_DEVICE_ATTR:
                 return kvm_device_ioctl_attr(dev, dev->ops->set_attr, arg);
@@ -4731,6 +4739,8 @@ static long kvm_vm_ioctl(struct file *filp,

         if (kvm->mm != current->mm || kvm->vm_dead)
                 return -EIO;
+if (atomic_read(&kvm->checkoverlap)) pr_err("%s(%d) overlap kvm=%px\n", __func__, __LINE__, kvm);
+
         switch (ioctl) {
         case KVM_CREATE_VCPU:
                 r = kvm_vm_ioctl_create_vcpu(kvm, arg);

Marc Zyngier Jan. 23, 2023, 11:23 a.m. UTC | #10

On Sat, 21 Jan 2023 15:35:48 +0000,
Shanker Donthineni <sdonthineni@nvidia.com> wrote:
> 
> 
> 
> On 1/20/23 06:00, Marc Zyngier wrote:
> > What I am asking agin is: is there any overlap between any vgic ioctl
> > and the teardown of the VM? Do you ever see kvm_vm_release() being
> > called before kvm_device_release()?
> > 
> > Because that's the overlap I've been talking all along.
> 
> I've not observed overlap of kvm_vm_release() with ioctls
> kvm_device_ioctl(), kvm_vcpu_ioctl() and kvm_vm_ioctl() for
> 3hrs. Test created hundreds of VMs and teardowns.

OK, that's reassuring. Thanks for looking into this.

	M.

KVM: arm64: vgic: Fix soft lockup during VM teardown

Commit Message

Comments

Patch