diff mbox series

[v2] KVM: s390: fix gisa destroy operation might lead to cpu stalls

Message ID 20230824130932.3573866-1-mimu@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series [v2] KVM: s390: fix gisa destroy operation might lead to cpu stalls | expand

Commit Message

Michael Mueller Aug. 24, 2023, 1:09 p.m. UTC
A GISA cannot be destroyed as long it is linked in the GIB alert list
as this would breake the alert list. Just waiting for its removal from
the list triggered by another vm is not sufficient as it might be the
only vm. The below shown cpu stall situation might occur when GIB alerts
are delayed and is fixed by calling process_gib_alert_list() instead of
waiting.

At this time the vcpus of the vm are already destroyed and thus
no vcpu can be kicked to enter the SIE again if for some reason an
interrupt is pending for that vm.

Additianally the IAM restore value ist set to 0x00 if that was not the
case. That would be a bug introduced by incomplete device de-registration,
i.e. missing kvm_s390_gisc_unregister() call.

Setting this value guarantees that late interrupts don't bring the GISA
back into the alert list.

Both situation can now be observed in the kvm-trace:

 00 01692880424:653210 3 - 0004 000003ff80136b58  vm 0x000000008e588000 created by pid 3019
 00 01692880472:159783 3 - 0002 000003ff80143c06  vm 0x000000008e588000 has unexpected restore iam 0x02
 00 01692880472:159784 3 - 0002 000003ff80143c24  vm 0x000000008e588000 gisa in alert list during destroy
 00 01692880472:229846 3 - 0004 000003ff8013319a  vm 0x000000008e588000 destroyed

CPU stall caused by kvm_s390_gisa_destroy():

 [ 4915.311372] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 14-.... } 24533 jiffies s: 5269 root: 0x1/.
 [ 4915.311390] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x4000/.
 [ 4915.311394] Task dump for CPU 14:
 [ 4915.311395] task:qemu-system-s39 state:R  running task     stack:0     pid:217198 ppid:1      flags:0x00000045
 [ 4915.311399] Call Trace:
 [ 4915.311401]  [<0000038003a33a10>] 0x38003a33a10
 [ 4933.861321] rcu: INFO: rcu_sched self-detected stall on CPU
 [ 4933.861332] rcu: 	14-....: (42008 ticks this GP) idle=53f4/1/0x4000000000000000 softirq=61530/61530 fqs=14031
 [ 4933.861353] rcu: 	(t=42008 jiffies g=238109 q=100360 ncpus=18)
 [ 4933.861357] CPU: 14 PID: 217198 Comm: qemu-system-s39 Not tainted 6.5.0-20230816.rc6.git26.a9d17c5d8813.300.fc38.s390x #1
 [ 4933.861360] Hardware name: IBM 8561 T01 703 (LPAR)
 [ 4933.861361] Krnl PSW : 0704e00180000000 000003ff804bfc66 (kvm_s390_gisa_destroy+0x3e/0xe0 [kvm])
 [ 4933.861414]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
 [ 4933.861416] Krnl GPRS: 0000000000000000 00000372000000fc 00000002134f8000 000000000d5f5900
 [ 4933.861419]            00000002f5ea1d18 00000002f5ea1d18 0000000000000000 0000000000000000
 [ 4933.861420]            00000002134fa890 00000002134f8958 000000000d5f5900 00000002134f8000
 [ 4933.861422]            000003ffa06acf98 000003ffa06858b0 0000038003a33c20 0000038003a33bc8
 [ 4933.861430] Krnl Code: 000003ff804bfc58: ec66002b007e	cij	%r6,0,6,000003ff804bfcae
                           000003ff804bfc5e: b904003a		lgr	%r3,%r10
                          #000003ff804bfc62: a7f40005		brc	15,000003ff804bfc6c
                          >000003ff804bfc66: e330b7300204	lg	%r3,10032(%r11)
                           000003ff804bfc6c: 58003000		l	%r0,0(%r3)
                           000003ff804bfc70: ec03fffb6076	crj	%r0,%r3,6,000003ff804bfc66
                           000003ff804bfc76: e320b7600271	lay	%r2,10080(%r11)
                           000003ff804bfc7c: c0e5fffea339	brasl	%r14,000003ff804942ee
 [ 4933.861444] Call Trace:
 [ 4933.861445]  [<000003ff804bfc66>] kvm_s390_gisa_destroy+0x3e/0xe0 [kvm]
 [ 4933.861460] ([<00000002623523de>] free_unref_page+0xee/0x148)
 [ 4933.861507]  [<000003ff804aea98>] kvm_arch_destroy_vm+0x50/0x120 [kvm]
 [ 4933.861521]  [<000003ff8049d374>] kvm_destroy_vm+0x174/0x288 [kvm]
 [ 4933.861532]  [<000003ff8049d4fe>] kvm_vm_release+0x36/0x48 [kvm]
 [ 4933.861542]  [<00000002623cd04a>] __fput+0xea/0x2a8
 [ 4933.861547]  [<00000002620d5bf8>] task_work_run+0x88/0xf0
 [ 4933.861551]  [<00000002620b0aa6>] do_exit+0x2c6/0x528
 [ 4933.861556]  [<00000002620b0f00>] do_group_exit+0x40/0xb8
 [ 4933.861557]  [<00000002620b0fa6>] __s390x_sys_exit_group+0x2e/0x30
 [ 4933.861559]  [<0000000262d481f4>] __do_syscall+0x1d4/0x200
 [ 4933.861563]  [<0000000262d59028>] system_call+0x70/0x98
 [ 4933.861565] Last Breaking-Event-Address:
 [ 4933.861566]  [<0000038003a33b60>] 0x38003a33b60

Fixes: 9f30f6216378 ("KVM: s390: add gib_alert_irq_handler()")
Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
---
 arch/s390/kvm/interrupt.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Matthew Rosato Aug. 24, 2023, 7:17 p.m. UTC | #1
On 8/24/23 9:09 AM, Michael Mueller wrote:
> A GISA cannot be destroyed as long it is linked in the GIB alert list
> as this would breake the alert list. Just waiting for its removal from

Hi Michael,

Nit: s/breake/break/

> the list triggered by another vm is not sufficient as it might be the
> only vm. The below shown cpu stall situation might occur when GIB alerts
> are delayed and is fixed by calling process_gib_alert_list() instead of
> waiting.
> 
> At this time the vcpus of the vm are already destroyed and thus
> no vcpu can be kicked to enter the SIE again if for some reason an
> interrupt is pending for that vm.
> 
> Additianally the IAM restore value ist set to 0x00 if that was not the

Nits: s/Additianally/Additionally/  as well as s/ist/is/ 

> case. That would be a bug introduced by incomplete device de-registration,
> i.e. missing kvm_s390_gisc_unregister() call.
If this implies a bug, maybe it should be a WARN_ON instead of a KVM_EVENT?  Because if we missed a call to kvm_s390_gisc_unregister() then we're also leaking refcounts (one for each gisc that we didn't unregister).

> 
> Setting this value guarantees that late interrupts don't bring the GISA
> back into the alert list.

Just to make sure I understand -- The idea is that once you set the alert mask to 0x00 then it should be impossible for millicode to deliver further alerts associated with this gisa right?  Thus making it OK to do one last process_gib_alert_list() after that point in time.

But I guess my question is: will millicode actually see this gi->alert.mask change soon enough to prevent further alerts?  Don't you need to also cmpxchg the mask update into the contents of kvm_s390_gisa (via gisa_set_iam?) in order to ensure an alert can't still be delivered some time after you check gisa_in_alert_list(gi->origin)?  That matches up with what is done per-gisc in kvm_s390_gisc_unregister() today.

...  That said, now that I'm looking closer at kvm_s390_gisc_unregister() and gisa_set_iam():  it seems strange that nobody checks the return code from gisa_set_iam today.  AFAICT, even if the device driver(s) call kvm_s390_gisc_unregister correctly for all associated gisc, if gisa_set_iam manages to return -EBUSY because the gisa is already in the alert list then wouldn't the gisc refcounts be decremented but the relevant alert bit left enabled for that gisc until the next time we call gisa_set_iam or gisa_get_ipm_or_restore_iam?

Similar strangeness for kvm_s390_gisc_register() - AFAICT if gisa_set_iam returns -EBUSY then we would increment the gisc refcounts but never actually enable the alert bit for that gisc until the next time we call gisa_set_iam or gisa_get_ipm_or_restore_iam.

Thanks,
Matt

> 
> Both situation can now be observed in the kvm-trace:
> 
>  00 01692880424:653210 3 - 0004 000003ff80136b58  vm 0x000000008e588000 created by pid 3019
>  00 01692880472:159783 3 - 0002 000003ff80143c06  vm 0x000000008e588000 has unexpected restore iam 0x02
>  00 01692880472:159784 3 - 0002 000003ff80143c24  vm 0x000000008e588000 gisa in alert list during destroy
>  00 01692880472:229846 3 - 0004 000003ff8013319a  vm 0x000000008e588000 destroyed
> 
> CPU stall caused by kvm_s390_gisa_destroy():
> 
>  [ 4915.311372] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 14-.... } 24533 jiffies s: 5269 root: 0x1/.
>  [ 4915.311390] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x4000/.
>  [ 4915.311394] Task dump for CPU 14:
>  [ 4915.311395] task:qemu-system-s39 state:R  running task     stack:0     pid:217198 ppid:1      flags:0x00000045
>  [ 4915.311399] Call Trace:
>  [ 4915.311401]  [<0000038003a33a10>] 0x38003a33a10
>  [ 4933.861321] rcu: INFO: rcu_sched self-detected stall on CPU
>  [ 4933.861332] rcu: 	14-....: (42008 ticks this GP) idle=53f4/1/0x4000000000000000 softirq=61530/61530 fqs=14031
>  [ 4933.861353] rcu: 	(t=42008 jiffies g=238109 q=100360 ncpus=18)
>  [ 4933.861357] CPU: 14 PID: 217198 Comm: qemu-system-s39 Not tainted 6.5.0-20230816.rc6.git26.a9d17c5d8813.300.fc38.s390x #1
>  [ 4933.861360] Hardware name: IBM 8561 T01 703 (LPAR)
>  [ 4933.861361] Krnl PSW : 0704e00180000000 000003ff804bfc66 (kvm_s390_gisa_destroy+0x3e/0xe0 [kvm])
>  [ 4933.861414]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>  [ 4933.861416] Krnl GPRS: 0000000000000000 00000372000000fc 00000002134f8000 000000000d5f5900
>  [ 4933.861419]            00000002f5ea1d18 00000002f5ea1d18 0000000000000000 0000000000000000
>  [ 4933.861420]            00000002134fa890 00000002134f8958 000000000d5f5900 00000002134f8000
>  [ 4933.861422]            000003ffa06acf98 000003ffa06858b0 0000038003a33c20 0000038003a33bc8
>  [ 4933.861430] Krnl Code: 000003ff804bfc58: ec66002b007e	cij	%r6,0,6,000003ff804bfcae
>                            000003ff804bfc5e: b904003a		lgr	%r3,%r10
>                           #000003ff804bfc62: a7f40005		brc	15,000003ff804bfc6c
>                           >000003ff804bfc66: e330b7300204	lg	%r3,10032(%r11)
>                            000003ff804bfc6c: 58003000		l	%r0,0(%r3)
>                            000003ff804bfc70: ec03fffb6076	crj	%r0,%r3,6,000003ff804bfc66
>                            000003ff804bfc76: e320b7600271	lay	%r2,10080(%r11)
>                            000003ff804bfc7c: c0e5fffea339	brasl	%r14,000003ff804942ee
>  [ 4933.861444] Call Trace:
>  [ 4933.861445]  [<000003ff804bfc66>] kvm_s390_gisa_destroy+0x3e/0xe0 [kvm]
>  [ 4933.861460] ([<00000002623523de>] free_unref_page+0xee/0x148)
>  [ 4933.861507]  [<000003ff804aea98>] kvm_arch_destroy_vm+0x50/0x120 [kvm]
>  [ 4933.861521]  [<000003ff8049d374>] kvm_destroy_vm+0x174/0x288 [kvm]
>  [ 4933.861532]  [<000003ff8049d4fe>] kvm_vm_release+0x36/0x48 [kvm]
>  [ 4933.861542]  [<00000002623cd04a>] __fput+0xea/0x2a8
>  [ 4933.861547]  [<00000002620d5bf8>] task_work_run+0x88/0xf0
>  [ 4933.861551]  [<00000002620b0aa6>] do_exit+0x2c6/0x528
>  [ 4933.861556]  [<00000002620b0f00>] do_group_exit+0x40/0xb8
>  [ 4933.861557]  [<00000002620b0fa6>] __s390x_sys_exit_group+0x2e/0x30
>  [ 4933.861559]  [<0000000262d481f4>] __do_syscall+0x1d4/0x200
>  [ 4933.861563]  [<0000000262d59028>] system_call+0x70/0x98
>  [ 4933.861565] Last Breaking-Event-Address:
>  [ 4933.861566]  [<0000038003a33b60>] 0x38003a33b60
> 
> Fixes: 9f30f6216378 ("KVM: s390: add gib_alert_irq_handler()")
> Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
> ---
>  arch/s390/kvm/interrupt.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
> index 85e39f472bb4..06890a58d001 100644
> --- a/arch/s390/kvm/interrupt.c
> +++ b/arch/s390/kvm/interrupt.c
> @@ -3216,11 +3216,15 @@ void kvm_s390_gisa_destroy(struct kvm *kvm)
>  
>  	if (!gi->origin)
>  		return;
> -	if (gi->alert.mask)
> -		KVM_EVENT(3, "vm 0x%pK has unexpected iam 0x%02x",
> +	if (gi->alert.mask) {
> +		KVM_EVENT(3, "vm 0x%pK has unexpected restore iam 0x%02x",
>  			  kvm, gi->alert.mask);
> -	while (gisa_in_alert_list(gi->origin))
> -		cpu_relax();
> +		gi->alert.mask = 0x00;
> +	}
> +	if (gisa_in_alert_list(gi->origin)) {
> +		KVM_EVENT(3, "vm 0x%pK gisa in alert list during destroy", kvm);
> +		process_gib_alert_list();
> +	}
>  	hrtimer_cancel(&gi->timer);
>  	gi->origin = NULL;
>  	VM_EVENT(kvm, 3, "gisa 0x%pK destroyed", gisa);
Michael Mueller Aug. 24, 2023, 8:36 p.m. UTC | #2
On 24.08.23 21:17, Matthew Rosato wrote:
> On 8/24/23 9:09 AM, Michael Mueller wrote:
>> A GISA cannot be destroyed as long it is linked in the GIB alert list
>> as this would breake the alert list. Just waiting for its removal from
> 
> Hi Michael,
> 
> Nit: s/breake/break/
> 
>> the list triggered by another vm is not sufficient as it might be the
>> only vm. The below shown cpu stall situation might occur when GIB alerts
>> are delayed and is fixed by calling process_gib_alert_list() instead of
>> waiting.
>>
>> At this time the vcpus of the vm are already destroyed and thus
>> no vcpu can be kicked to enter the SIE again if for some reason an
>> interrupt is pending for that vm.
>>
>> Additianally the IAM restore value ist set to 0x00 if that was not the
> 
> Nits: s/Additianally/Additionally/  as well as s/ist/is/
> 

Thanks a lot, Matt. I will address of course all these typos ;)

>> case. That would be a bug introduced by incomplete device de-registration,
>> i.e. missing kvm_s390_gisc_unregister() call.
> If this implies a bug, maybe it should be a WARN_ON instead of a KVM_EVENT?  Because if we missed a call to kvm_s390_gisc_unregister() then we're also leaking refcounts (one for each gisc that we didn't unregister).

I was thinking of a WARN_ON() as well and will most probaly add it 
because it is much better visible.

> 
>>
>> Setting this value guarantees that late interrupts don't bring the GISA
>> back into the alert list.
> 
> Just to make sure I understand -- The idea is that once you set the alert mask to 0x00 then it should be impossible for millicode to deliver further alerts associated with this gisa right?  Thus making it OK to do one last process_gib_alert_list() after that point in time.
> 
> But I guess my question is: will millicode actually see this gi->alert.mask change soon enough to prevent further alerts?  Don't you need to also cmpxchg the mask update into the contents of kvm_s390_gisa (via gisa_set_iam?) 

It is not the IAM directly that I set to 0x00 but gi->alert.mask. It is 
used the restore the IAM in the gisa by means of 
gisa_get_ipm_or_restore_iam() under cmpxchg() conditions which is called 
by process_gib_alert_list() and the hr_timer function gisa_vcpu_kicker() 
that it triggers. When the gisa is in the alert list, the IAM is always 
0x00. It's set by millicode. I just need to ensure that it is not 
changed to anything else.

in order to ensure an alert can't still be delivered some time after you 
check gisa_in_alert_list(gi->origin)?  That matches up with what is done 
per-gisc in kvm_s390_gisc_unregister() today.

right

> 
> ...  That said, now that I'm looking closer at kvm_s390_gisc_unregister() and gisa_set_iam():  it seems strange that nobody checks the return code from gisa_set_iam today.  AFAICT, even if the device driver(s) call kvm_s390_gisc_unregister correctly for all associated gisc, if gisa_set_iam manages to return -EBUSY because the gisa is already in the alert list then wouldn't the gisc refcounts be decremented but the relevant alert bit left enabled for that gisc until the next time we call gisa_set_iam or gisa_get_ipm_or_restore_iam?

you are right, that should retried in kvm_s390_gisc_register() and 
kvm_s390_gisc_unregister() until the rc is 0 but that would lead to a 
CPU stall as well under the condition where GAL interrupts are not 
delivered in the host.

> 
> Similar strangeness for kvm_s390_gisc_register() - AFAICT if gisa_set_iam returns -EBUSY then we would increment the gisc refcounts but never actually enable the alert bit for that gisc until the next time we call gisa_set_iam or gisa_get_ipm_or_restore_iam.

I have to think and play around with process_gib_alert_list() being 
called as well in these situations.

BTW the pci and the vfip_ap device drivers currently also ignore the 
return codes of kvm_s390_gisc_unregister().

Thanks a lot for your thoughts!
Michael

> 
> Thanks,
> Matt
> 
>>
>> Both situation can now be observed in the kvm-trace:
>>
>>   00 01692880424:653210 3 - 0004 000003ff80136b58  vm 0x000000008e588000 created by pid 3019
>>   00 01692880472:159783 3 - 0002 000003ff80143c06  vm 0x000000008e588000 has unexpected restore iam 0x02
>>   00 01692880472:159784 3 - 0002 000003ff80143c24  vm 0x000000008e588000 gisa in alert list during destroy
>>   00 01692880472:229846 3 - 0004 000003ff8013319a  vm 0x000000008e588000 destroyed
>>
>> CPU stall caused by kvm_s390_gisa_destroy():
>>
>>   [ 4915.311372] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 14-.... } 24533 jiffies s: 5269 root: 0x1/.
>>   [ 4915.311390] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x4000/.
>>   [ 4915.311394] Task dump for CPU 14:
>>   [ 4915.311395] task:qemu-system-s39 state:R  running task     stack:0     pid:217198 ppid:1      flags:0x00000045
>>   [ 4915.311399] Call Trace:
>>   [ 4915.311401]  [<0000038003a33a10>] 0x38003a33a10
>>   [ 4933.861321] rcu: INFO: rcu_sched self-detected stall on CPU
>>   [ 4933.861332] rcu: 	14-....: (42008 ticks this GP) idle=53f4/1/0x4000000000000000 softirq=61530/61530 fqs=14031
>>   [ 4933.861353] rcu: 	(t=42008 jiffies g=238109 q=100360 ncpus=18)
>>   [ 4933.861357] CPU: 14 PID: 217198 Comm: qemu-system-s39 Not tainted 6.5.0-20230816.rc6.git26.a9d17c5d8813.300.fc38.s390x #1
>>   [ 4933.861360] Hardware name: IBM 8561 T01 703 (LPAR)
>>   [ 4933.861361] Krnl PSW : 0704e00180000000 000003ff804bfc66 (kvm_s390_gisa_destroy+0x3e/0xe0 [kvm])
>>   [ 4933.861414]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>>   [ 4933.861416] Krnl GPRS: 0000000000000000 00000372000000fc 00000002134f8000 000000000d5f5900
>>   [ 4933.861419]            00000002f5ea1d18 00000002f5ea1d18 0000000000000000 0000000000000000
>>   [ 4933.861420]            00000002134fa890 00000002134f8958 000000000d5f5900 00000002134f8000
>>   [ 4933.861422]            000003ffa06acf98 000003ffa06858b0 0000038003a33c20 0000038003a33bc8
>>   [ 4933.861430] Krnl Code: 000003ff804bfc58: ec66002b007e	cij	%r6,0,6,000003ff804bfcae
>>                             000003ff804bfc5e: b904003a		lgr	%r3,%r10
>>                            #000003ff804bfc62: a7f40005		brc	15,000003ff804bfc6c
>>                            >000003ff804bfc66: e330b7300204	lg	%r3,10032(%r11)
>>                             000003ff804bfc6c: 58003000		l	%r0,0(%r3)
>>                             000003ff804bfc70: ec03fffb6076	crj	%r0,%r3,6,000003ff804bfc66
>>                             000003ff804bfc76: e320b7600271	lay	%r2,10080(%r11)
>>                             000003ff804bfc7c: c0e5fffea339	brasl	%r14,000003ff804942ee
>>   [ 4933.861444] Call Trace:
>>   [ 4933.861445]  [<000003ff804bfc66>] kvm_s390_gisa_destroy+0x3e/0xe0 [kvm]
>>   [ 4933.861460] ([<00000002623523de>] free_unref_page+0xee/0x148)
>>   [ 4933.861507]  [<000003ff804aea98>] kvm_arch_destroy_vm+0x50/0x120 [kvm]
>>   [ 4933.861521]  [<000003ff8049d374>] kvm_destroy_vm+0x174/0x288 [kvm]
>>   [ 4933.861532]  [<000003ff8049d4fe>] kvm_vm_release+0x36/0x48 [kvm]
>>   [ 4933.861542]  [<00000002623cd04a>] __fput+0xea/0x2a8
>>   [ 4933.861547]  [<00000002620d5bf8>] task_work_run+0x88/0xf0
>>   [ 4933.861551]  [<00000002620b0aa6>] do_exit+0x2c6/0x528
>>   [ 4933.861556]  [<00000002620b0f00>] do_group_exit+0x40/0xb8
>>   [ 4933.861557]  [<00000002620b0fa6>] __s390x_sys_exit_group+0x2e/0x30
>>   [ 4933.861559]  [<0000000262d481f4>] __do_syscall+0x1d4/0x200
>>   [ 4933.861563]  [<0000000262d59028>] system_call+0x70/0x98
>>   [ 4933.861565] Last Breaking-Event-Address:
>>   [ 4933.861566]  [<0000038003a33b60>] 0x38003a33b60
>>
>> Fixes: 9f30f6216378 ("KVM: s390: add gib_alert_irq_handler()")
>> Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
>> ---
>>   arch/s390/kvm/interrupt.c | 12 ++++++++----
>>   1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
>> index 85e39f472bb4..06890a58d001 100644
>> --- a/arch/s390/kvm/interrupt.c
>> +++ b/arch/s390/kvm/interrupt.c
>> @@ -3216,11 +3216,15 @@ void kvm_s390_gisa_destroy(struct kvm *kvm)
>>   
>>   	if (!gi->origin)
>>   		return;
>> -	if (gi->alert.mask)
>> -		KVM_EVENT(3, "vm 0x%pK has unexpected iam 0x%02x",
>> +	if (gi->alert.mask) {
>> +		KVM_EVENT(3, "vm 0x%pK has unexpected restore iam 0x%02x",
>>   			  kvm, gi->alert.mask);
>> -	while (gisa_in_alert_list(gi->origin))
>> -		cpu_relax();
>> +		gi->alert.mask = 0x00;
>> +	}
>> +	if (gisa_in_alert_list(gi->origin)) {
>> +		KVM_EVENT(3, "vm 0x%pK gisa in alert list during destroy", kvm);
>> +		process_gib_alert_list();
>> +	}
>>   	hrtimer_cancel(&gi->timer);
>>   	gi->origin = NULL;
>>   	VM_EVENT(kvm, 3, "gisa 0x%pK destroyed", gisa);
>
Matthew Rosato Aug. 25, 2023, 2:56 a.m. UTC | #3
On 8/24/23 4:36 PM, Michael Mueller wrote:
> 
> 
> On 24.08.23 21:17, Matthew Rosato wrote:
>> On 8/24/23 9:09 AM, Michael Mueller wrote:
>>> A GISA cannot be destroyed as long it is linked in the GIB alert list
>>> as this would breake the alert list. Just waiting for its removal from
>>
>> Hi Michael,
>>
>> Nit: s/breake/break/
>>
>>> the list triggered by another vm is not sufficient as it might be the
>>> only vm. The below shown cpu stall situation might occur when GIB alerts
>>> are delayed and is fixed by calling process_gib_alert_list() instead of
>>> waiting.
>>>
>>> At this time the vcpus of the vm are already destroyed and thus
>>> no vcpu can be kicked to enter the SIE again if for some reason an
>>> interrupt is pending for that vm.
>>>
>>> Additianally the IAM restore value ist set to 0x00 if that was not the
>>
>> Nits: s/Additianally/Additionally/  as well as s/ist/is/
>>
> 
> Thanks a lot, Matt. I will address of course all these typos ;)
> 
>>> case. That would be a bug introduced by incomplete device de-registration,
>>> i.e. missing kvm_s390_gisc_unregister() call.
>> If this implies a bug, maybe it should be a WARN_ON instead of a KVM_EVENT?  Because if we missed a call to kvm_s390_gisc_unregister() then we're also leaking refcounts (one for each gisc that we didn't unregister).
> 
> I was thinking of a WARN_ON() as well and will most probaly add it because it is much better visible.
> 
>>
>>>
>>> Setting this value guarantees that late interrupts don't bring the GISA
>>> back into the alert list.
>>
>> Just to make sure I understand -- The idea is that once you set the alert mask to 0x00 then it should be impossible for millicode to deliver further alerts associated with this gisa right?  Thus making it OK to do one last process_gib_alert_list() after that point in time.
>>
>> But I guess my question is: will millicode actually see this gi->alert.mask change soon enough to prevent further alerts?  Don't you need to also cmpxchg the mask update into the contents of kvm_s390_gisa (via gisa_set_iam?) 
> 
> It is not the IAM directly that I set to 0x00 but gi->alert.mask. It is used the restore the IAM in the gisa by means of gisa_get_ipm_or_restore_iam() under cmpxchg() conditions which is called by process_gib_alert_list() and the hr_timer function gisa_vcpu_kicker() that it triggers. When the gisa is in the alert list, the IAM is always 0x00. It's set by millicode. I just need to ensure that it is not changed to anything else.

Besides zeroing it while on the alert list and restoring the IAM to re-enable millicode alerts, we also change the IAM to enable a gisc (kvm_s390_gisc_register) and disable a gisc (kvm_s390_gisc_register) for alerts via a call to gisa_set_iam().  AFAIU the IAM is telling millicode what giscs host alerts should happen for, and the point of the gisa_set_iam() call during gisc_unregister is to tell millicode to stop alerts from being delivered for that gisc at that point.

Now for this patch, my understanding is that you are basically cleaning up after a driver that did not handle their gisc refcounts properly, right?  Otherwise by the time you reach gisa_destroy the alert.mask would already be 0x00.  Then in that case, wouldn't you want to force the unregistration of any gisc still in the IAM at gisa_destroy time -- meaning shouldn't we do the equivalent gisa_set_iam that would have previously been done during gisc_unregister, had it been called properly? 

For example, rather than just setting gi->alert.mask = 0x00 in kvm_s390_gisa_destroy(), what if you instead did:
1) issue the warning that gi->alert.mask was nonzero upon entry to gisa_destroy
2) perform the equivalent of calling kvm_s390_gisc_unregister() for every bit that is still on in the gi->alert.mask, performing the same actions as though the refcount were reaching 0 for each gisc (remove the bit from the alert mask, call gisa_set_iam).
3) Finally, process the alert list one more time if gisa_in_alert_list(gi->origin).  At this point, since we already set IAM to 0x00, millicode would have no further reason to deliver more alerts, so doing one last check should be safe.

That would be the same chain of events (minus the warning) that would occur if a driver actually called kvm_s390_gisc_unregister() the correct number of times.  Of course you could also just collapse step #2 -- doing that gets us _very_ close to this patch; you could just set gi->alert.mask directly to 0x00 like you do here but then you would also need a gisa_set_iam call to tell millicode to stop sending alerts for all of the giscs you just removed from the alert.mask.  In either approach, the -EBUSY return from gisa_set_iam would be an issue that needs to be handled.

Overall I guess until the IAM visible to millicode is set to 0x00 I'm not sure I understand what would prevent millicode from delivering another alert to any gisc still enabled in the IAM.  You say above it will be cmpxchg()'d during process_gib_alert_list() via gisa_get_ipm_or_restore_iam() but if we first check gisa_in_alert_list(gi->origin) with this new patch and the gisa is not yet in the alert list then we would skip the call to process_gib_alert_list() and instead just cancel the timer -- I could very well be misunderstanding something, but my concern is that you are shrinking but not eliminating the window here.  Let me try an example -- With this patch, isn't the following chain of events still possible:

1) enter kvm_s390_gisa_destroy.  Let's say gi->alert.mask = 0x80.
2) set gi->alert.mask = 0x00
3) check if(gisa_in_alert_list(gi->origin)) -- it returns false
4) Since the IAM still had a bit on at this point, millicode now delivers an alert for the gisc associated with bit 0x80 and sets IAM to 0x00 to indicate the gisa in the alert list
5) call hrtimer_cancel (since we already checked gisa_in_alert_list, we don't notice that last alert delivered)
6) set gi->origin = NULL, return from kvm_s390_gisa_destroy

Assuming that series of events is possible, wouldn't a solution be to replace step #3 above with something along the lines of this (untested diff on top of this patch):

diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 06890a58d001..ab99c9ec1282 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -3220,6 +3220,10 @@ void kvm_s390_gisa_destroy(struct kvm *kvm)
                KVM_EVENT(3, "vm 0x%pK has unexpected restore iam 0x%02x",
                          kvm, gi->alert.mask);
                gi->alert.mask = 0x00;
+               while (gisa_set_iam(gi->origin, gi->alert.mask)) {
+                       KVM_EVENT(3, "vm 0x%pK alert while clearing iam", kvm);
+                       process_gib_alert_list();
+               }
        }
        if (gisa_in_alert_list(gi->origin)) {
                KVM_EVENT(3, "vm 0x%pK gisa in alert list during destroy", kvm);

> 
> in order to ensure an alert can't still be delivered some time after you check gisa_in_alert_list(gi->origin)?  That matches up with what is done per-gisc in kvm_s390_gisc_unregister() today.
> 
> right
> 
>>
>> ...  That said, now that I'm looking closer at kvm_s390_gisc_unregister() and gisa_set_iam():  it seems strange that nobody checks the return code from gisa_set_iam today.  AFAICT, even if the device driver(s) call kvm_s390_gisc_unregister correctly for all associated gisc, if gisa_set_iam manages to return -EBUSY because the gisa is already in the alert list then wouldn't the gisc refcounts be decremented but the relevant alert bit left enabled for that gisc until the next time we call gisa_set_iam or gisa_get_ipm_or_restore_iam?
> 
> you are right, that should retried in kvm_s390_gisc_register() and kvm_s390_gisc_unregister() until the rc is 0 but that would lead to a CPU stall as well under the condition where GAL interrupts are not delivered in the host.
> 
>>
>> Similar strangeness for kvm_s390_gisc_register() - AFAICT if gisa_set_iam returns -EBUSY then we would increment the gisc refcounts but never actually enable the alert bit for that gisc until the next time we call gisa_set_iam or gisa_get_ipm_or_restore_iam.
> 
> I have to think and play around with process_gib_alert_list() being called as well in these situations.
> 
> BTW the pci and the vfip_ap device drivers currently also ignore the return codes of kvm_s390_gisc_unregister().
> 

Hmm, good point.  You're right, we should probably do something there.  I think the 3 reasons kvm_s390_gisc_unregister() could give a nonzero RC today would all be strange, likely implementation bugs...

-ENODEV we also would have never been able to register, or something odd happened to gisa after registration
-ERANGE we also would have never been able to register, or the gisc got clobbered sometime after registration
-EINVAL either we never registered, unregistered too many times or gisa was destroyed on us somehow

I think for these cases the best pci/ap can do would be to WARN_ON(_ONCE) and then proceed just assuming that the gisc was unregistered or never properly registered.

Thanks,
Matt
Michael Mueller Aug. 28, 2023, 10:39 a.m. UTC | #4
On 25.08.23 04:56, Matthew Rosato wrote:
> On 8/24/23 4:36 PM, Michael Mueller wrote:
>>
>>
>> On 24.08.23 21:17, Matthew Rosato wrote:
>>> On 8/24/23 9:09 AM, Michael Mueller wrote:
>>>> A GISA cannot be destroyed as long it is linked in the GIB alert list
>>>> as this would breake the alert list. Just waiting for its removal from
>>>
>>> Hi Michael,
>>>
>>> Nit: s/breake/break/
>>>
>>>> the list triggered by another vm is not sufficient as it might be the
>>>> only vm. The below shown cpu stall situation might occur when GIB alerts
>>>> are delayed and is fixed by calling process_gib_alert_list() instead of
>>>> waiting.
>>>>
>>>> At this time the vcpus of the vm are already destroyed and thus
>>>> no vcpu can be kicked to enter the SIE again if for some reason an
>>>> interrupt is pending for that vm.
>>>>
>>>> Additianally the IAM restore value ist set to 0x00 if that was not the
>>>
>>> Nits: s/Additianally/Additionally/  as well as s/ist/is/
>>>
>>
>> Thanks a lot, Matt. I will address of course all these typos ;)
>>
>>>> case. That would be a bug introduced by incomplete device de-registration,
>>>> i.e. missing kvm_s390_gisc_unregister() call.
>>> If this implies a bug, maybe it should be a WARN_ON instead of a KVM_EVENT?  Because if we missed a call to kvm_s390_gisc_unregister() then we're also leaking refcounts (one for each gisc that we didn't unregister).
>>
>> I was thinking of a WARN_ON() as well and will most probaly add it because it is much better visible.
>>
>>>
>>>>
>>>> Setting this value guarantees that late interrupts don't bring the GISA
>>>> back into the alert list.
>>>
>>> Just to make sure I understand -- The idea is that once you set the alert mask to 0x00 then it should be impossible for millicode to deliver further alerts associated with this gisa right?  Thus making it OK to do one last process_gib_alert_list() after that point in time.
>>>
>>> But I guess my question is: will millicode actually see this gi->alert.mask change soon enough to prevent further alerts?  Don't you need to also cmpxchg the mask update into the contents of kvm_s390_gisa (via gisa_set_iam?)
>>
>> It is not the IAM directly that I set to 0x00 but gi->alert.mask. It is used the restore the IAM in the gisa by means of gisa_get_ipm_or_restore_iam() under cmpxchg() conditions which is called by process_gib_alert_list() and the hr_timer function gisa_vcpu_kicker() that it triggers. When the gisa is in the alert list, the IAM is always 0x00. It's set by millicode. I just need to ensure that it is not changed to anything else.
> 
> Besides zeroing it while on the alert list and restoring the IAM to re-enable millicode alerts, we also change the IAM to enable a gisc (kvm_s390_gisc_register) and disable a gisc (kvm_s390_gisc_register) for alerts via a call to gisa_set_iam().  AFAIU the IAM is telling millicode what giscs host alerts should happen for, and the point of the gisa_set_iam() call during gisc_unregister is to tell millicode to stop alerts from being delivered for that gisc at that point.

Yes, the kvm_s390_gisc_register() function manages the alert.mask to be 
restored into the IAM when a gisa is processed in the alert list as well 
as the IAM when a first/additinal guest ISC has to be handled, that's 
the case for a first/second AP or PCI adapter. In case it's the very 
first GISC, eg. AP_ISC, the gisa cannot be in the alert list, whence 
gisa_set_iam() will be successful. When a second AP_ISC is registered 
the alert.mask is not changed and thus gisa_set_iam() is NOT called. If 
a different guest ISC is registered, e.g. PCI_ISC, the alert.mask is 
updated and the gisa_set_iam() tries to set the new IAM. In case the 
gisa is in the alert list the gisa_set_iam() call returns -EBUSY but we 
don't need to care because the correct IAM will be restored during the 
processsing of the alert list. I the gisa is not in the alert list the 
call will be successful and the new IAM is set.

The situation for kvm_s390_gisc_unregister() will be similar, let's walk 
through it. The gisa_set_iam() nees to be called always when the last of 
a specific guest ISCs is deregistered and also the alert.mask is 
changed. The condition is that ref_count[gisc] is decrease by 1 and 
becomes 0. Then the respective bit is cleared in the alert.mask and 
gisa_set_iam() tries to update the IAM. In case the gisa is in the alert 
list the IAM will be restored with the current alert.mask which has the 
bit already cleared. In case the gisa is not in the alert list the IAM 
will be set immediately.

> 
> Now for this patch, my understanding is that you are basically cleaning up after a driver that did not handle their gisc refcounts properly, right?  Otherwise by the time you reach gisa_destroy the alert.mask would already be 0x00.  Then in that case, wouldn't you want to force the unregistration of any gisc still in the IAM at gisa_destroy time -- meaning shouldn't we do the equivalent gisa_set_iam that would have previously been done during gisc_unregister, had it been called properly?

The problem with the patch that I currently see is that it tries to 
solve two different bug scenaries that I have not sketched properly.

Yes, I want to do a cleanup a) for the situation that the alert.mask is 
NOT 0x00 during gisa destruction which is caused by a missing 
kvm_s390_gisc_unregister() and that's a bug.

And b) for the situation that a gisa is still in the alert list during 
gisa destruction. That happend due to a bug in this case as well. The 
patch from Benjamin in devel 88a096a7a460 accidently switched off the 
GAL interrupt processing in the host. Thus the 
kvm_s390_gisc_unregister() was called by the AP driver but the alert was 
not processed (and never could have processed). My code then ran into 
that endless loop which is a bug as well:

  while (gisa_in_alert_list(gi->origin))
    relax_cpu()

> 
> For example, rather than just setting gi->alert.mask = 0x00 in kvm_s390_gisa_destroy(), what if you instead did:
> 1) issue the warning that gi->alert.mask was nonzero upon entry to gisa_destroy
> 2) perform the equivalent of calling kvm_s390_gisc_unregister() for every bit that is still on in the gi->alert.mask, performing the same actions as though the refcount were reaching 0 for each gisc (remove the bit from the alert mask, call gisa_set_iam).
> 3) Finally, process the alert list one more time if gisa_in_alert_list(gi->origin).  At this point, since we already set IAM to 0x00, millicode would have no further reason to deliver more alerts, so doing one last check should be safe.

yes, that should work

> 
> That would be the same chain of events (minus the warning) that would occur if a driver actually called kvm_s390_gisc_unregister() the correct number of times.  Of course you could also just collapse step #2 -- doing that gets us _very_ close to this patch; you could just set gi->alert.mask directly to 0x00 like you do here but then you would also need a gisa_set_iam call to tell millicode to stop sending alerts for all of the giscs you just removed from the alert.mask.  In either approach, the -EBUSY return from gisa_set_iam would be an issue that needs to be handled.

The -EBUSY does not need a special treatment in this case because it 
means the gisa is in the alert list and no additional alerts for the 
same gisa are triggert by the millicode. Also not for other guest ISC 
bits. The IAM is cleared to 0x00 by millicode as well.

Either the gisa is not in the alert list, then the IAM was set to 0x00 
already by the previous gisa_set_iam() an cannot be brought back to the 
alert list by millicode or I call process_gib_alert_list() once when the 
gisa is in the alert list, then the IAM is 0x00 (set by millicaode) and 
will be restored to 0x00 by gisa_vcpu_kicker()/gisa_get_ipm_or_restore_iam()

I don't want to run process_gib_alert_list() unconditionally because it 
would touch other guest also even if not required.

I will send a v3 today.

Thanks a lot!

> 
> Overall I guess until the IAM visible to millicode is set to 0x00 I'm not sure I understand what would prevent millicode from delivering another alert to any gisc still enabled in the IAM.  You say above it will be cmpxchg()'d during process_gib_alert_list() via gisa_get_ipm_or_restore_iam() but if we first check gisa_in_alert_list(gi->origin) with this new patch and the gisa is not yet in the alert list then we would skip the call to process_gib_alert_list() and instead just cancel the timer -- I could very well be misunderstanding something, but my concern is that you are shrinking but not eliminating the window here.  Let me try an example -- With this patch, isn't the following chain of events still possible:
> 
> 1) enter kvm_s390_gisa_destroy.  Let's say gi->alert.mask = 0x80.
> 2) set gi->alert.mask = 0x00
> 3) check if(gisa_in_alert_list(gi->origin)) -- it returns false
> 4) Since the IAM still had a bit on at this point, millicode now delivers an alert for the gisc associated with bit 0x80 and sets IAM to 0x00 to indicate the gisa in the alert list
> 5) call hrtimer_cancel (since we already checked gisa_in_alert_list, we don't notice that last alert delivered)
> 6) set gi->origin = NULL, return from kvm_s390_gisa_destroy
> 
> Assuming that series of events is possible, wouldn't a solution be to replace step #3 above with something along the lines of this (untested diff on top of this patch):
> 
> diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
> index 06890a58d001..ab99c9ec1282 100644
> --- a/arch/s390/kvm/interrupt.c
> +++ b/arch/s390/kvm/interrupt.c
> @@ -3220,6 +3220,10 @@ void kvm_s390_gisa_destroy(struct kvm *kvm)
>                  KVM_EVENT(3, "vm 0x%pK has unexpected restore iam 0x%02x",
>                            kvm, gi->alert.mask);
>                  gi->alert.mask = 0x00;
> +               while (gisa_set_iam(gi->origin, gi->alert.mask)) {
> +                       KVM_EVENT(3, "vm 0x%pK alert while clearing iam", kvm);
> +                       process_gib_alert_list();
> +               }
>          }
>          if (gisa_in_alert_list(gi->origin)) {
>                  KVM_EVENT(3, "vm 0x%pK gisa in alert list during destroy", kvm);
> 
>>
>> in order to ensure an alert can't still be delivered some time after you check gisa_in_alert_list(gi->origin)?  That matches up with what is done per-gisc in kvm_s390_gisc_unregister() today.
>>
>> right
>>
>>>
>>> ...  That said, now that I'm looking closer at kvm_s390_gisc_unregister() and gisa_set_iam():  it seems strange that nobody checks the return code from gisa_set_iam today.  AFAICT, even if the device driver(s) call kvm_s390_gisc_unregister correctly for all associated gisc, if gisa_set_iam manages to return -EBUSY because the gisa is already in the alert list then wouldn't the gisc refcounts be decremented but the relevant alert bit left enabled for that gisc until the next time we call gisa_set_iam or gisa_get_ipm_or_restore_iam?
>>
>> you are right, that should retried in kvm_s390_gisc_register() and kvm_s390_gisc_unregister() until the rc is 0 but that would lead to a CPU stall as well under the condition where GAL interrupts are not delivered in the host.
>>
>>>
>>> Similar strangeness for kvm_s390_gisc_register() - AFAICT if gisa_set_iam returns -EBUSY then we would increment the gisc refcounts but never actually enable the alert bit for that gisc until the next time we call gisa_set_iam or gisa_get_ipm_or_restore_iam.
>>
>> I have to think and play around with process_gib_alert_list() being called as well in these situations.
>>
>> BTW the pci and the vfip_ap device drivers currently also ignore the return codes of kvm_s390_gisc_unregister().
>>
> 
> Hmm, good point.  You're right, we should probably do something there.  I think the 3 reasons kvm_s390_gisc_unregister() could give a nonzero RC today would all be strange, likely implementation bugs...
> 
> -ENODEV we also would have never been able to register, or something odd happened to gisa after registration
> -ERANGE we also would have never been able to register, or the gisc got clobbered sometime after registration
> -EINVAL either we never registered, unregistered too many times or gisa was destroyed on us somehow
> 
> I think for these cases the best pci/ap can do would be to WARN_ON(_ONCE) and then proceed just assuming that the gisc was unregistered or never properly registered.
> 
> Thanks,
> Matt
diff mbox series

Patch

diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 85e39f472bb4..06890a58d001 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -3216,11 +3216,15 @@  void kvm_s390_gisa_destroy(struct kvm *kvm)
 
 	if (!gi->origin)
 		return;
-	if (gi->alert.mask)
-		KVM_EVENT(3, "vm 0x%pK has unexpected iam 0x%02x",
+	if (gi->alert.mask) {
+		KVM_EVENT(3, "vm 0x%pK has unexpected restore iam 0x%02x",
 			  kvm, gi->alert.mask);
-	while (gisa_in_alert_list(gi->origin))
-		cpu_relax();
+		gi->alert.mask = 0x00;
+	}
+	if (gisa_in_alert_list(gi->origin)) {
+		KVM_EVENT(3, "vm 0x%pK gisa in alert list during destroy", kvm);
+		process_gib_alert_list();
+	}
 	hrtimer_cancel(&gi->timer);
 	gi->origin = NULL;
 	VM_EVENT(kvm, 3, "gisa 0x%pK destroyed", gisa);