[RFC,v3,2/2] KVM: s390: Extend the USER_SIGP capability

Message ID	20211110203322.1374925-3-farman@linux.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Eric Farman <farman@linux.ibm.com> To: Christian Borntraeger <borntraeger@de.ibm.com>, Janosch Frank <frankja@linux.ibm.com>, David Hildenbrand <david@redhat.com>, Claudio Imbrenda <imbrenda@linux.ibm.com>, Thomas Huth <thuth@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com>, Vasily Gorbik <gor@linux.ibm.com>, kvm@vger.kernel.org, linux-s390@vger.kernel.org, Eric Farman <farman@linux.ibm.com> Subject: [RFC PATCH v3 2/2] KVM: s390: Extend the USER_SIGP capability Date: Wed, 10 Nov 2021 21:33:22 +0100 Message-Id: <20211110203322.1374925-3-farman@linux.ibm.com> In-Reply-To: <20211110203322.1374925-1-farman@linux.ibm.com> References: <20211110203322.1374925-1-farman@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	s390x: Improvements to SIGP handling [KVM] \| expand [RFC,v3,0/2] s390x: Improvements to SIGP handling [KVM] [RFC,v3,1/2] Capability/IOCTL/Documentation [RFC,v3,2/2] KVM: s390: Extend the USER_SIGP capability

Eric Farman Nov. 10, 2021, 8:33 p.m. UTC

With commit 2444b352c3ac ("KVM: s390: forward most SIGP orders to user
space") we have a capability that allows the "fast" SIGP orders (as
defined by the Programming Notes for the SIGNAL PROCESSOR instruction in
the Principles of Operation) to be handled in-kernel, while all others are
sent to userspace for processing.

This works fine but it creates a situation when, for example, a SIGP SENSE
might return CC1 (STATUS STORED, and status bits indicating the vcpu is
stopped), when in actuality userspace is still processing a SIGP STOP AND
STORE STATUS order, and the vcpu is not yet actually stopped. Thus, the
SIGP SENSE should actually be returning CC2 (busy) instead of CC1.

To fix this, add another CPU capability, dependent on the USER_SIGP one,
and two associated IOCTLs. One IOCTL will be used by userspace to mark a
vcpu "busy" processing a SIGP order, and cause concurrent orders handled
in-kernel to be returned with CC2 (busy). Another IOCTL will be used by
userspace to mark the SIGP "finished", and the vcpu free to process
additional orders.

Signed-off-by: Eric Farman <farman@linux.ibm.com>
---
 arch/s390/include/asm/kvm_host.h |  2 ++
 arch/s390/kvm/kvm-s390.c         | 29 +++++++++++++++++++++++++++++
 arch/s390/kvm/kvm-s390.h         | 16 ++++++++++++++++
 arch/s390/kvm/sigp.c             | 10 ++++++++++
 4 files changed, 57 insertions(+)

David Hildenbrand Nov. 11, 2021, 9:15 a.m. UTC | #1

On 10.11.21 21:33, Eric Farman wrote:
> With commit 2444b352c3ac ("KVM: s390: forward most SIGP orders to user
> space") we have a capability that allows the "fast" SIGP orders (as
> defined by the Programming Notes for the SIGNAL PROCESSOR instruction in
> the Principles of Operation) to be handled in-kernel, while all others are
> sent to userspace for processing.
> 
> This works fine but it creates a situation when, for example, a SIGP SENSE
> might return CC1 (STATUS STORED, and status bits indicating the vcpu is
> stopped), when in actuality userspace is still processing a SIGP STOP AND
> STORE STATUS order, and the vcpu is not yet actually stopped. Thus, the
> SIGP SENSE should actually be returning CC2 (busy) instead of CC1.
> 
> To fix this, add another CPU capability, dependent on the USER_SIGP one,
> and two associated IOCTLs. One IOCTL will be used by userspace to mark a
> vcpu "busy" processing a SIGP order, and cause concurrent orders handled
> in-kernel to be returned with CC2 (busy). Another IOCTL will be used by
> userspace to mark the SIGP "finished", and the vcpu free to process
> additional orders.
> 

This looks much cleaner to me, thanks!

[...]

> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
> index c07a050d757d..54371cede485 100644
> --- a/arch/s390/kvm/kvm-s390.h
> +++ b/arch/s390/kvm/kvm-s390.h
> @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct kvm_vcpu *vcpu)
>  	return test_bit(vcpu->vcpu_idx, vcpu->kvm->arch.idle_mask);
>  }
>  
> +static inline bool kvm_s390_vcpu_is_sigp_busy(struct kvm_vcpu *vcpu)
> +{
> +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);

You can drop ()

> +}
> +
> +static inline bool kvm_s390_vcpu_set_sigp_busy(struct kvm_vcpu *vcpu)
> +{
> +	/* Return zero for success, or -EBUSY if another vcpu won */
> +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) == 0) ? 0 : -EBUSY;

You can drop () as well.

We might not need the -EBUSY semantics after all. User space can just
track if it was set, because it's in charge of setting it.

> +}
> +
> +static inline void kvm_s390_vcpu_clear_sigp_busy(struct kvm_vcpu *vcpu)
> +{
> +	atomic_set(&vcpu->arch.sigp_busy, 0);
> +}
> +
>  static inline int kvm_is_ucontrol(struct kvm *kvm)
>  {
>  #ifdef CONFIG_KVM_S390_UCONTROL
> diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
> index 5ad3fb4619f1..a37496ea6dfa 100644
> --- a/arch/s390/kvm/sigp.c
> +++ b/arch/s390/kvm/sigp.c
> @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct kvm_vcpu *vcpu, u8 order_code,
>  	if (!dst_vcpu)
>  		return SIGP_CC_NOT_OPERATIONAL;
>  
> +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
> +		return SIGP_CC_BUSY;
> +	}

You can drop {}

> +
>  	switch (order_code) {
>  	case SIGP_SENSE:
>  		vcpu->stat.instruction_sigp_sense++;
> @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct kvm_vcpu *vcpu)
>  	if (handle_sigp_order_in_user_space(vcpu, order_code, cpu_addr))
>  		return -EOPNOTSUPP;
>  
> +	/* Check the current vcpu, if it was a target from another vcpu */
> +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
> +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
> +		return 0;
> +	}


I don't think we need this. I think the above (checking the target of a
SIGP order) is sufficient. Or which situation do you have in mind?



I do wonder if we want to make this a kvm_arch_vcpu_ioctl() instead,
essentially just providing a KVM_S390_SET_SIGP_BUSY *and* providing the
order. "order == 0" sets it to !busy. Not that we would need the value
right now, but who knows for what we might reuse that interface in the
future.

Thanks!

Eric Farman Nov. 11, 2021, 3:03 p.m. UTC | #2

On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
> On 10.11.21 21:33, Eric Farman wrote:
> > With commit 2444b352c3ac ("KVM: s390: forward most SIGP orders to
> > user
> > space") we have a capability that allows the "fast" SIGP orders (as
> > defined by the Programming Notes for the SIGNAL PROCESSOR
> > instruction in
> > the Principles of Operation) to be handled in-kernel, while all
> > others are
> > sent to userspace for processing.
> > 
> > This works fine but it creates a situation when, for example, a
> > SIGP SENSE
> > might return CC1 (STATUS STORED, and status bits indicating the
> > vcpu is
> > stopped), when in actuality userspace is still processing a SIGP
> > STOP AND
> > STORE STATUS order, and the vcpu is not yet actually stopped. Thus,
> > the
> > SIGP SENSE should actually be returning CC2 (busy) instead of CC1.
> > 
> > To fix this, add another CPU capability, dependent on the USER_SIGP
> > one,
> > and two associated IOCTLs. One IOCTL will be used by userspace to
> > mark a
> > vcpu "busy" processing a SIGP order, and cause concurrent orders
> > handled
> > in-kernel to be returned with CC2 (busy). Another IOCTL will be
> > used by
> > userspace to mark the SIGP "finished", and the vcpu free to process
> > additional orders.
> > 
> 
> This looks much cleaner to me, thanks!
> 
> [...]
> 
> > diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
> > index c07a050d757d..54371cede485 100644
> > --- a/arch/s390/kvm/kvm-s390.h
> > +++ b/arch/s390/kvm/kvm-s390.h
> > @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct kvm_vcpu
> > *vcpu)
> >  	return test_bit(vcpu->vcpu_idx, vcpu->kvm->arch.idle_mask);
> >  }
> >  
> > +static inline bool kvm_s390_vcpu_is_sigp_busy(struct kvm_vcpu
> > *vcpu)
> > +{
> > +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
> 
> You can drop ()
> 
> > +}
> > +
> > +static inline bool kvm_s390_vcpu_set_sigp_busy(struct kvm_vcpu
> > *vcpu)
> > +{
> > +	/* Return zero for success, or -EBUSY if another vcpu won */
> > +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) == 0) ? 0 :
> > -EBUSY;
> 
> You can drop () as well.
> 
> We might not need the -EBUSY semantics after all. User space can just
> track if it was set, because it's in charge of setting it.

Hrm, I added this to distinguish a newer kernel with an older QEMU, but
of course an older QEMU won't know the difference either. I'll
doublecheck that this is works fine in the different permutations.

> 
> > +}
> > +
> > +static inline void kvm_s390_vcpu_clear_sigp_busy(struct kvm_vcpu
> > *vcpu)
> > +{
> > +	atomic_set(&vcpu->arch.sigp_busy, 0);
> > +}
> > +
> >  static inline int kvm_is_ucontrol(struct kvm *kvm)
> >  {
> >  #ifdef CONFIG_KVM_S390_UCONTROL
> > diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
> > index 5ad3fb4619f1..a37496ea6dfa 100644
> > --- a/arch/s390/kvm/sigp.c
> > +++ b/arch/s390/kvm/sigp.c
> > @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct kvm_vcpu
> > *vcpu, u8 order_code,
> >  	if (!dst_vcpu)
> >  		return SIGP_CC_NOT_OPERATIONAL;
> >  
> > +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
> > +		return SIGP_CC_BUSY;
> > +	}
> 
> You can drop {}

Arg, I had some debug in there which needed the braces, and of course
it's unnecessary now. Thanks.

> 
> > +
> >  	switch (order_code) {
> >  	case SIGP_SENSE:
> >  		vcpu->stat.instruction_sigp_sense++;
> > @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct kvm_vcpu
> > *vcpu)
> >  	if (handle_sigp_order_in_user_space(vcpu, order_code,
> > cpu_addr))
> >  		return -EOPNOTSUPP;
> >  
> > +	/* Check the current vcpu, if it was a target from another vcpu
> > */
> > +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
> > +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
> > +		return 0;
> > +	}
> 
> I don't think we need this. I think the above (checking the target of
> a
> SIGP order) is sufficient. Or which situation do you have in mind?
> 

Hrm... I think you're right. I was thinking of this:

VCPU 1 - SIGP STOP CPU 2
VCPU 2 - SIGP SENSE CPU 1

But of course either CPU2 is going to be marked "busy" first, and the
sense doesn't get processed until it's reset, or the sense arrives
first, and the busy/notbusy doesn't matter. Let me doublecheck my tests
for the non-RFC version.

> 
> 
> I do wonder if we want to make this a kvm_arch_vcpu_ioctl() instead,

In one of my original attempts between v1 and v2, I had put this there.
This reliably deadlocks my guest, because the caller (kvm_vcpu_ioctl())
tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN) might
already be holding it. Thus, it's an async ioctl. I could fold it into
the existing interrupt ioctl, but as those are architected structs it
seems more natural do it this way. Or I have mis-understood something
along the way?

> essentially just providing a KVM_S390_SET_SIGP_BUSY *and* providing
> the
> order. "order == 0" sets it to !busy. 

I'd tried this too, since it provided some nice debug-ability.
Unfortunately, I have a testcase (which I'll eventually get folded into
kvm-unit-tests :)) that picks a random order between 0-255, knowing
that there's only a couple handfuls of valid orders, to check the
response. Zero is valid architecturally (POPS figure 4-29), even if
it's unassigned. The likelihood of it becoming assigned is probably
quite low, but I'm not sure that I like special-casing an order of zero
in this way.

> Not that we would need the value
> right now, but who knows for what we might reuse that interface in
> the
> future.
> 
> Thanks!
>

Janosch Frank Nov. 11, 2021, 4:13 p.m. UTC | #3

On 11/11/21 16:03, Eric Farman wrote:
> On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
>> On 10.11.21 21:33, Eric Farman wrote:
>>> With commit 2444b352c3ac ("KVM: s390: forward most SIGP orders to
>>> user
>>> space") we have a capability that allows the "fast" SIGP orders (as
>>> defined by the Programming Notes for the SIGNAL PROCESSOR
>>> instruction in
>>> the Principles of Operation) to be handled in-kernel, while all
>>> others are
>>> sent to userspace for processing.
>>>
>>> This works fine but it creates a situation when, for example, a
>>> SIGP SENSE
>>> might return CC1 (STATUS STORED, and status bits indicating the
>>> vcpu is
>>> stopped), when in actuality userspace is still processing a SIGP
>>> STOP AND
>>> STORE STATUS order, and the vcpu is not yet actually stopped. Thus,
>>> the
>>> SIGP SENSE should actually be returning CC2 (busy) instead of CC1.
>>>
>>> To fix this, add another CPU capability, dependent on the USER_SIGP
>>> one,
>>> and two associated IOCTLs. One IOCTL will be used by userspace to
>>> mark a
>>> vcpu "busy" processing a SIGP order, and cause concurrent orders
>>> handled
>>> in-kernel to be returned with CC2 (busy). Another IOCTL will be
>>> used by
>>> userspace to mark the SIGP "finished", and the vcpu free to process
>>> additional orders.
>>>
>>
>> This looks much cleaner to me, thanks!
>>
>> [...]
>>
>>> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
>>> index c07a050d757d..54371cede485 100644
>>> --- a/arch/s390/kvm/kvm-s390.h
>>> +++ b/arch/s390/kvm/kvm-s390.h
>>> @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct kvm_vcpu
>>> *vcpu)
>>>   	return test_bit(vcpu->vcpu_idx, vcpu->kvm->arch.idle_mask);
>>>   }
>>>   
>>> +static inline bool kvm_s390_vcpu_is_sigp_busy(struct kvm_vcpu
>>> *vcpu)
>>> +{
>>> +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
>>
>> You can drop ()
>>
>>> +}
>>> +
>>> +static inline bool kvm_s390_vcpu_set_sigp_busy(struct kvm_vcpu
>>> *vcpu)
>>> +{
>>> +	/* Return zero for success, or -EBUSY if another vcpu won */
>>> +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) == 0) ? 0 :
>>> -EBUSY;
>>
>> You can drop () as well.
>>
>> We might not need the -EBUSY semantics after all. User space can just
>> track if it was set, because it's in charge of setting it.
> 
> Hrm, I added this to distinguish a newer kernel with an older QEMU, but
> of course an older QEMU won't know the difference either. I'll
> doublecheck that this is works fine in the different permutations.
> 
>>
>>> +}
>>> +
>>> +static inline void kvm_s390_vcpu_clear_sigp_busy(struct kvm_vcpu
>>> *vcpu)
>>> +{
>>> +	atomic_set(&vcpu->arch.sigp_busy, 0);
>>> +}
>>> +
>>>   static inline int kvm_is_ucontrol(struct kvm *kvm)
>>>   {
>>>   #ifdef CONFIG_KVM_S390_UCONTROL
>>> diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
>>> index 5ad3fb4619f1..a37496ea6dfa 100644
>>> --- a/arch/s390/kvm/sigp.c
>>> +++ b/arch/s390/kvm/sigp.c
>>> @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct kvm_vcpu
>>> *vcpu, u8 order_code,
>>>   	if (!dst_vcpu)
>>>   		return SIGP_CC_NOT_OPERATIONAL;
>>>   
>>> +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
>>> +		return SIGP_CC_BUSY;
>>> +	}
>>
>> You can drop {}
> 
> Arg, I had some debug in there which needed the braces, and of course
> it's unnecessary now. Thanks.
> 
>>
>>> +
>>>   	switch (order_code) {
>>>   	case SIGP_SENSE:
>>>   		vcpu->stat.instruction_sigp_sense++;
>>> @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct kvm_vcpu
>>> *vcpu)
>>>   	if (handle_sigp_order_in_user_space(vcpu, order_code,
>>> cpu_addr))
>>>   		return -EOPNOTSUPP;
>>>   
>>> +	/* Check the current vcpu, if it was a target from another vcpu
>>> */
>>> +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
>>> +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
>>> +		return 0;
>>> +	}
>>
>> I don't think we need this. I think the above (checking the target of
>> a
>> SIGP order) is sufficient. Or which situation do you have in mind?
>>
> 
> Hrm... I think you're right. I was thinking of this:
> 
> VCPU 1 - SIGP STOP CPU 2
> VCPU 2 - SIGP SENSE CPU 1
> 
> But of course either CPU2 is going to be marked "busy" first, and the
> sense doesn't get processed until it's reset, or the sense arrives
> first, and the busy/notbusy doesn't matter. Let me doublecheck my tests
> for the non-RFC version.
> 
>>
>>
>> I do wonder if we want to make this a kvm_arch_vcpu_ioctl() instead,
> 
> In one of my original attempts between v1 and v2, I had put this there.
> This reliably deadlocks my guest, because the caller (kvm_vcpu_ioctl())
> tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN) might
> already be holding it. Thus, it's an async ioctl. I could fold it into
> the existing interrupt ioctl, but as those are architected structs it
> seems more natural do it this way. Or I have mis-understood something
> along the way?
> 
>> essentially just providing a KVM_S390_SET_SIGP_BUSY *and* providing
>> the
>> order. "order == 0" sets it to !busy.
> 
> I'd tried this too, since it provided some nice debug-ability.
> Unfortunately, I have a testcase (which I'll eventually get folded into
> kvm-unit-tests :)) that picks a random order between 0-255, knowing
> that there's only a couple handfuls of valid orders, to check the
> response. Zero is valid architecturally (POPS figure 4-29), even if
> it's unassigned. The likelihood of it becoming assigned is probably
> quite low, but I'm not sure that I like special-casing an order of zero
> in this way.
> 

Looking at the API I'd like to avoid having two IOCTLs and I'd love to 
see some way to extend this without the need for a whole new IOCTL.



>> Not that we would need the value
>> right now, but who knows for what we might reuse that interface in
>> the
>> future.
>>
>> Thanks!
>>
>

Janosch Frank Nov. 11, 2021, 4:16 p.m. UTC | #4

On 11/10/21 21:33, Eric Farman wrote:
> With commit 2444b352c3ac ("KVM: s390: forward most SIGP orders to user
> space") we have a capability that allows the "fast" SIGP orders (as
> defined by the Programming Notes for the SIGNAL PROCESSOR instruction in
> the Principles of Operation) to be handled in-kernel, while all others are
> sent to userspace for processing.
> 
> This works fine but it creates a situation when, for example, a SIGP SENSE
> might return CC1 (STATUS STORED, and status bits indicating the vcpu is
> stopped), when in actuality userspace is still processing a SIGP STOP AND
> STORE STATUS order, and the vcpu is not yet actually stopped. Thus, the
> SIGP SENSE should actually be returning CC2 (busy) instead of CC1.
> 
> To fix this, add another CPU capability, dependent on the USER_SIGP one,
> and two associated IOCTLs. One IOCTL will be used by userspace to mark a
> vcpu "busy" processing a SIGP order, and cause concurrent orders handled
> in-kernel to be returned with CC2 (busy). Another IOCTL will be used by
> userspace to mark the SIGP "finished", and the vcpu free to process
> additional orders.
> 
> Signed-off-by: Eric Farman <farman@linux.ibm.com>
> ---
>   arch/s390/include/asm/kvm_host.h |  2 ++
>   arch/s390/kvm/kvm-s390.c         | 29 +++++++++++++++++++++++++++++
>   arch/s390/kvm/kvm-s390.h         | 16 ++++++++++++++++
>   arch/s390/kvm/sigp.c             | 10 ++++++++++
>   4 files changed, 57 insertions(+)
> 
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index a604d51acfc8..c93271557de3 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -748,6 +748,7 @@ struct kvm_vcpu_arch {
>   	bool skey_enabled;
>   	struct kvm_s390_pv_vcpu pv;
>   	union diag318_info diag318_info;
> +	atomic_t sigp_busy;
>   };
>   
>   struct kvm_vm_stat {
> @@ -941,6 +942,7 @@ struct kvm_arch{
>   	int user_sigp;
>   	int user_stsi;
>   	int user_instr0;
> +	int user_sigp_busy;
>   	struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS];
>   	wait_queue_head_t ipte_wq;
>   	int ipte_lock_count;
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 5f52e7eec02f..06d188dd2c89 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -564,6 +564,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_S390_VCPU_RESETS:
>   	case KVM_CAP_SET_GUEST_DEBUG:
>   	case KVM_CAP_S390_DIAG318:
> +	case KVM_CAP_S390_USER_SIGP_BUSY:
>   		r = 1;
>   		break;
>   	case KVM_CAP_SET_GUEST_DEBUG2:
> @@ -706,6 +707,15 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>   		kvm->arch.user_sigp = 1;
>   		r = 0;
>   		break;
> +	case KVM_CAP_S390_USER_SIGP_BUSY:
> +		r = -EINVAL;
> +		if (kvm->arch.user_sigp) {
> +			kvm->arch.user_sigp_busy = 1;
> +			r = 0;
> +		}
> +		VM_EVENT(kvm, 3, "ENABLE: CAP_S390_USER_SIGP_BUSY %s",
> +			 r ? "(not available)" : "(success)");
> +		break;
>   	case KVM_CAP_S390_VECTOR_REGISTERS:
>   		mutex_lock(&kvm->lock);
>   		if (kvm->created_vcpus) {
> @@ -4825,6 +4835,25 @@ long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   			return -EINVAL;
>   		return kvm_s390_inject_vcpu(vcpu, &s390irq);
>   	}
> +	case KVM_S390_VCPU_SET_SIGP_BUSY: {
> +		int rc;
> +
> +		if (!vcpu->kvm->arch.user_sigp_busy)
> +			return -EFAULT;

Huh?
This should be EINVAL, no?

> +
> +		rc = kvm_s390_vcpu_set_sigp_busy(vcpu);
> +		VCPU_EVENT(vcpu, 3, "SIGP: CPU %x set busy rc %x", vcpu->vcpu_id, rc);
> +
> +		return rc;
> +	}
> +	case KVM_S390_VCPU_RESET_SIGP_BUSY: {
> +		if (!vcpu->kvm->arch.user_sigp_busy)
> +			return -EFAULT;

Same

> +
> +		VCPU_EVENT(vcpu, 3, "SIGP: CPU %x reset busy", vcpu->vcpu_id);
> +		kvm_s390_vcpu_clear_sigp_busy(vcpu);
> +		return 0;
> +	}
>   	}
>   	return -ENOIOCTLCMD;
>   }
> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
> index c07a050d757d..54371cede485 100644
> --- a/arch/s390/kvm/kvm-s390.h
> +++ b/arch/s390/kvm/kvm-s390.h
> @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct kvm_vcpu *vcpu)
>   	return test_bit(vcpu->vcpu_idx, vcpu->kvm->arch.idle_mask);
>   }
>   
> +static inline bool kvm_s390_vcpu_is_sigp_busy(struct kvm_vcpu *vcpu)
> +{
> +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
> +}
> +
> +static inline bool kvm_s390_vcpu_set_sigp_busy(struct kvm_vcpu *vcpu)
> +{
> +	/* Return zero for success, or -EBUSY if another vcpu won */
> +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) == 0) ? 0 : -EBUSY;
> +}
> +
> +static inline void kvm_s390_vcpu_clear_sigp_busy(struct kvm_vcpu *vcpu)
> +{
> +	atomic_set(&vcpu->arch.sigp_busy, 0);
> +}
> +
>   static inline int kvm_is_ucontrol(struct kvm *kvm)
>   {
>   #ifdef CONFIG_KVM_S390_UCONTROL
> diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
> index 5ad3fb4619f1..a37496ea6dfa 100644
> --- a/arch/s390/kvm/sigp.c
> +++ b/arch/s390/kvm/sigp.c
> @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct kvm_vcpu *vcpu, u8 order_code,
>   	if (!dst_vcpu)
>   		return SIGP_CC_NOT_OPERATIONAL;
>   
> +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
> +		return SIGP_CC_BUSY;
> +	}
> +
>   	switch (order_code) {
>   	case SIGP_SENSE:
>   		vcpu->stat.instruction_sigp_sense++;
> @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct kvm_vcpu *vcpu)
>   	if (handle_sigp_order_in_user_space(vcpu, order_code, cpu_addr))
>   		return -EOPNOTSUPP;
>   
> +	/* Check the current vcpu, if it was a target from another vcpu */
> +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
> +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
> +		return 0;
> +	}
> +
>   	if (r1 % 2)
>   		parameter = vcpu->run->s.regs.gprs[r1];
>   	else
>

Eric Farman Nov. 11, 2021, 5:48 p.m. UTC | #5

On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
> On 11/11/21 16:03, Eric Farman wrote:
> > On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
> > > On 10.11.21 21:33, Eric Farman wrote:
> > > > With commit 2444b352c3ac ("KVM: s390: forward most SIGP orders
> > > > to
> > > > user
> > > > space") we have a capability that allows the "fast" SIGP orders
> > > > (as
> > > > defined by the Programming Notes for the SIGNAL PROCESSOR
> > > > instruction in
> > > > the Principles of Operation) to be handled in-kernel, while all
> > > > others are
> > > > sent to userspace for processing.
> > > > 
> > > > This works fine but it creates a situation when, for example, a
> > > > SIGP SENSE
> > > > might return CC1 (STATUS STORED, and status bits indicating the
> > > > vcpu is
> > > > stopped), when in actuality userspace is still processing a
> > > > SIGP
> > > > STOP AND
> > > > STORE STATUS order, and the vcpu is not yet actually stopped.
> > > > Thus,
> > > > the
> > > > SIGP SENSE should actually be returning CC2 (busy) instead of
> > > > CC1.
> > > > 
> > > > To fix this, add another CPU capability, dependent on the
> > > > USER_SIGP
> > > > one,
> > > > and two associated IOCTLs. One IOCTL will be used by userspace
> > > > to
> > > > mark a
> > > > vcpu "busy" processing a SIGP order, and cause concurrent
> > > > orders
> > > > handled
> > > > in-kernel to be returned with CC2 (busy). Another IOCTL will be
> > > > used by
> > > > userspace to mark the SIGP "finished", and the vcpu free to
> > > > process
> > > > additional orders.
> > > > 
> > > 
> > > This looks much cleaner to me, thanks!
> > > 
> > > [...]
> > > 
> > > > diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-
> > > > s390.h
> > > > index c07a050d757d..54371cede485 100644
> > > > --- a/arch/s390/kvm/kvm-s390.h
> > > > +++ b/arch/s390/kvm/kvm-s390.h
> > > > @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct
> > > > kvm_vcpu
> > > > *vcpu)
> > > >   	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
> > > > >arch.idle_mask);
> > > >   }
> > > >   
> > > > +static inline bool kvm_s390_vcpu_is_sigp_busy(struct kvm_vcpu
> > > > *vcpu)
> > > > +{
> > > > +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
> > > 
> > > You can drop ()
> > > 
> > > > +}
> > > > +
> > > > +static inline bool kvm_s390_vcpu_set_sigp_busy(struct kvm_vcpu
> > > > *vcpu)
> > > > +{
> > > > +	/* Return zero for success, or -EBUSY if another vcpu
> > > > won */
> > > > +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) ==
> > > > 0) ? 0 :
> > > > -EBUSY;
> > > 
> > > You can drop () as well.
> > > 
> > > We might not need the -EBUSY semantics after all. User space can
> > > just
> > > track if it was set, because it's in charge of setting it.
> > 
> > Hrm, I added this to distinguish a newer kernel with an older QEMU,
> > but
> > of course an older QEMU won't know the difference either. I'll
> > doublecheck that this is works fine in the different permutations.
> > 
> > > > +}
> > > > +
> > > > +static inline void kvm_s390_vcpu_clear_sigp_busy(struct
> > > > kvm_vcpu
> > > > *vcpu)
> > > > +{
> > > > +	atomic_set(&vcpu->arch.sigp_busy, 0);
> > > > +}
> > > > +
> > > >   static inline int kvm_is_ucontrol(struct kvm *kvm)
> > > >   {
> > > >   #ifdef CONFIG_KVM_S390_UCONTROL
> > > > diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
> > > > index 5ad3fb4619f1..a37496ea6dfa 100644
> > > > --- a/arch/s390/kvm/sigp.c
> > > > +++ b/arch/s390/kvm/sigp.c
> > > > @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct kvm_vcpu
> > > > *vcpu, u8 order_code,
> > > >   	if (!dst_vcpu)
> > > >   		return SIGP_CC_NOT_OPERATIONAL;
> > > >   
> > > > +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
> > > > +		return SIGP_CC_BUSY;
> > > > +	}
> > > 
> > > You can drop {}
> > 
> > Arg, I had some debug in there which needed the braces, and of
> > course
> > it's unnecessary now. Thanks.
> > 
> > > > +
> > > >   	switch (order_code) {
> > > >   	case SIGP_SENSE:
> > > >   		vcpu->stat.instruction_sigp_sense++;
> > > > @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct kvm_vcpu
> > > > *vcpu)
> > > >   	if (handle_sigp_order_in_user_space(vcpu, order_code,
> > > > cpu_addr))
> > > >   		return -EOPNOTSUPP;
> > > >   
> > > > +	/* Check the current vcpu, if it was a target from
> > > > another vcpu
> > > > */
> > > > +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
> > > > +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
> > > > +		return 0;
> > > > +	}
> > > 
> > > I don't think we need this. I think the above (checking the
> > > target of
> > > a
> > > SIGP order) is sufficient. Or which situation do you have in
> > > mind?
> > > 
> > 
> > Hrm... I think you're right. I was thinking of this:
> > 
> > VCPU 1 - SIGP STOP CPU 2
> > VCPU 2 - SIGP SENSE CPU 1
> > 
> > But of course either CPU2 is going to be marked "busy" first, and
> > the
> > sense doesn't get processed until it's reset, or the sense arrives
> > first, and the busy/notbusy doesn't matter. Let me doublecheck my
> > tests
> > for the non-RFC version.
> > 
> > > 
> > > I do wonder if we want to make this a kvm_arch_vcpu_ioctl()
> > > instead,
> > 
> > In one of my original attempts between v1 and v2, I had put this
> > there.
> > This reliably deadlocks my guest, because the caller
> > (kvm_vcpu_ioctl())
> > tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN) might
> > already be holding it. Thus, it's an async ioctl. I could fold it
> > into
> > the existing interrupt ioctl, but as those are architected structs
> > it
> > seems more natural do it this way. Or I have mis-understood
> > something
> > along the way?
> > 
> > > essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
> > > providing
> > > the
> > > order. "order == 0" sets it to !busy.
> > 
> > I'd tried this too, since it provided some nice debug-ability.
> > Unfortunately, I have a testcase (which I'll eventually get folded
> > into
> > kvm-unit-tests :)) that picks a random order between 0-255, knowing
> > that there's only a couple handfuls of valid orders, to check the
> > response. Zero is valid architecturally (POPS figure 4-29), even if
> > it's unassigned. The likelihood of it becoming assigned is probably
> > quite low, but I'm not sure that I like special-casing an order of
> > zero
> > in this way.
> > 
> 
> Looking at the API I'd like to avoid having two IOCTLs 

Since the order is a single byte, we could have the payload of an ioctl
say "0-255 is an order that we're busy processing, anything higher than
that resets the busy" or something. That would remove the need for a
second IOCTL.

> and I'd love to 
> see some way to extend this without the need for a whole new IOCTL.
> 

Do you mean zero IOCTLs? Because... I think the only way we can do that
is to get rid of USER_SIGP altogether, and handle everything in-kernel. 
Some weeks ago I played with QEMU not enabling USER_SIGP, but I can't
say I've tried it since we went down this "mark a vcpu busy" path. If I
do that busy/not-busy tagging in the kernel for !USER_SIGP, that might
not be a bad thing anyway. But I don't know how we get the behavior
straightened out for USER_SIGP without some type of handshake.

> 
> 
> > > Not that we would need the value
> > > right now, but who knows for what we might reuse that interface
> > > in
> > > the
> > > future.
> > > 
> > > Thanks!
> > >

Eric Farman Nov. 11, 2021, 5:50 p.m. UTC | #6

On Thu, 2021-11-11 at 17:16 +0100, Janosch Frank wrote:
> On 11/10/21 21:33, Eric Farman wrote:

...snip...

> > +	case KVM_S390_VCPU_SET_SIGP_BUSY: {
> > +		int rc;
> > +
> > +		if (!vcpu->kvm->arch.user_sigp_busy)
> > +			return -EFAULT;
> 
> Huh?
> This should be EINVAL, no?

Of course; my mistake.

> 
> > +
> > +		rc = kvm_s390_vcpu_set_sigp_busy(vcpu);
> > +		VCPU_EVENT(vcpu, 3, "SIGP: CPU %x set busy rc %x",
> > vcpu->vcpu_id, rc);
> > +
> > +		return rc;
> > +	}
> > +	case KVM_S390_VCPU_RESET_SIGP_BUSY: {
> > +		if (!vcpu->kvm->arch.user_sigp_busy)
> > +			return -EFAULT;
> 
> Same
> 
>

David Hildenbrand Nov. 11, 2021, 6:29 p.m. UTC | #7

On 11.11.21 18:48, Eric Farman wrote:
> On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
>> On 11/11/21 16:03, Eric Farman wrote:
>>> On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
>>>> On 10.11.21 21:33, Eric Farman wrote:
>>>>> With commit 2444b352c3ac ("KVM: s390: forward most SIGP orders
>>>>> to
>>>>> user
>>>>> space") we have a capability that allows the "fast" SIGP orders
>>>>> (as
>>>>> defined by the Programming Notes for the SIGNAL PROCESSOR
>>>>> instruction in
>>>>> the Principles of Operation) to be handled in-kernel, while all
>>>>> others are
>>>>> sent to userspace for processing.
>>>>>
>>>>> This works fine but it creates a situation when, for example, a
>>>>> SIGP SENSE
>>>>> might return CC1 (STATUS STORED, and status bits indicating the
>>>>> vcpu is
>>>>> stopped), when in actuality userspace is still processing a
>>>>> SIGP
>>>>> STOP AND
>>>>> STORE STATUS order, and the vcpu is not yet actually stopped.
>>>>> Thus,
>>>>> the
>>>>> SIGP SENSE should actually be returning CC2 (busy) instead of
>>>>> CC1.
>>>>>
>>>>> To fix this, add another CPU capability, dependent on the
>>>>> USER_SIGP
>>>>> one,
>>>>> and two associated IOCTLs. One IOCTL will be used by userspace
>>>>> to
>>>>> mark a
>>>>> vcpu "busy" processing a SIGP order, and cause concurrent
>>>>> orders
>>>>> handled
>>>>> in-kernel to be returned with CC2 (busy). Another IOCTL will be
>>>>> used by
>>>>> userspace to mark the SIGP "finished", and the vcpu free to
>>>>> process
>>>>> additional orders.
>>>>>
>>>>
>>>> This looks much cleaner to me, thanks!
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-
>>>>> s390.h
>>>>> index c07a050d757d..54371cede485 100644
>>>>> --- a/arch/s390/kvm/kvm-s390.h
>>>>> +++ b/arch/s390/kvm/kvm-s390.h
>>>>> @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct
>>>>> kvm_vcpu
>>>>> *vcpu)
>>>>>   	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
>>>>>> arch.idle_mask);
>>>>>   }
>>>>>   
>>>>> +static inline bool kvm_s390_vcpu_is_sigp_busy(struct kvm_vcpu
>>>>> *vcpu)
>>>>> +{
>>>>> +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
>>>>
>>>> You can drop ()
>>>>
>>>>> +}
>>>>> +
>>>>> +static inline bool kvm_s390_vcpu_set_sigp_busy(struct kvm_vcpu
>>>>> *vcpu)
>>>>> +{
>>>>> +	/* Return zero for success, or -EBUSY if another vcpu
>>>>> won */
>>>>> +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) ==
>>>>> 0) ? 0 :
>>>>> -EBUSY;
>>>>
>>>> You can drop () as well.
>>>>
>>>> We might not need the -EBUSY semantics after all. User space can
>>>> just
>>>> track if it was set, because it's in charge of setting it.
>>>
>>> Hrm, I added this to distinguish a newer kernel with an older QEMU,
>>> but
>>> of course an older QEMU won't know the difference either. I'll
>>> doublecheck that this is works fine in the different permutations.
>>>
>>>>> +}
>>>>> +
>>>>> +static inline void kvm_s390_vcpu_clear_sigp_busy(struct
>>>>> kvm_vcpu
>>>>> *vcpu)
>>>>> +{
>>>>> +	atomic_set(&vcpu->arch.sigp_busy, 0);
>>>>> +}
>>>>> +
>>>>>   static inline int kvm_is_ucontrol(struct kvm *kvm)
>>>>>   {
>>>>>   #ifdef CONFIG_KVM_S390_UCONTROL
>>>>> diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
>>>>> index 5ad3fb4619f1..a37496ea6dfa 100644
>>>>> --- a/arch/s390/kvm/sigp.c
>>>>> +++ b/arch/s390/kvm/sigp.c
>>>>> @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct kvm_vcpu
>>>>> *vcpu, u8 order_code,
>>>>>   	if (!dst_vcpu)
>>>>>   		return SIGP_CC_NOT_OPERATIONAL;
>>>>>   
>>>>> +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
>>>>> +		return SIGP_CC_BUSY;
>>>>> +	}
>>>>
>>>> You can drop {}
>>>
>>> Arg, I had some debug in there which needed the braces, and of
>>> course
>>> it's unnecessary now. Thanks.
>>>
>>>>> +
>>>>>   	switch (order_code) {
>>>>>   	case SIGP_SENSE:
>>>>>   		vcpu->stat.instruction_sigp_sense++;
>>>>> @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct kvm_vcpu
>>>>> *vcpu)
>>>>>   	if (handle_sigp_order_in_user_space(vcpu, order_code,
>>>>> cpu_addr))
>>>>>   		return -EOPNOTSUPP;
>>>>>   
>>>>> +	/* Check the current vcpu, if it was a target from
>>>>> another vcpu
>>>>> */
>>>>> +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
>>>>> +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
>>>>> +		return 0;
>>>>> +	}
>>>>
>>>> I don't think we need this. I think the above (checking the
>>>> target of
>>>> a
>>>> SIGP order) is sufficient. Or which situation do you have in
>>>> mind?
>>>>
>>>
>>> Hrm... I think you're right. I was thinking of this:
>>>
>>> VCPU 1 - SIGP STOP CPU 2
>>> VCPU 2 - SIGP SENSE CPU 1
>>>
>>> But of course either CPU2 is going to be marked "busy" first, and
>>> the
>>> sense doesn't get processed until it's reset, or the sense arrives
>>> first, and the busy/notbusy doesn't matter. Let me doublecheck my
>>> tests
>>> for the non-RFC version.
>>>
>>>>
>>>> I do wonder if we want to make this a kvm_arch_vcpu_ioctl()
>>>> instead,
>>>
>>> In one of my original attempts between v1 and v2, I had put this
>>> there.
>>> This reliably deadlocks my guest, because the caller
>>> (kvm_vcpu_ioctl())
>>> tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN) might
>>> already be holding it. Thus, it's an async ioctl. I could fold it
>>> into
>>> the existing interrupt ioctl, but as those are architected structs
>>> it
>>> seems more natural do it this way. Or I have mis-understood
>>> something
>>> along the way?
>>>
>>>> essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
>>>> providing
>>>> the
>>>> order. "order == 0" sets it to !busy.
>>>
>>> I'd tried this too, since it provided some nice debug-ability.
>>> Unfortunately, I have a testcase (which I'll eventually get folded
>>> into
>>> kvm-unit-tests :)) that picks a random order between 0-255, knowing
>>> that there's only a couple handfuls of valid orders, to check the
>>> response. Zero is valid architecturally (POPS figure 4-29), even if
>>> it's unassigned. The likelihood of it becoming assigned is probably
>>> quite low, but I'm not sure that I like special-casing an order of
>>> zero
>>> in this way.
>>>
>>
>> Looking at the API I'd like to avoid having two IOCTLs 
> 
> Since the order is a single byte, we could have the payload of an ioctl
> say "0-255 is an order that we're busy processing, anything higher than
> that resets the busy" or something. That would remove the need for a
> second IOCTL.

Maybe just pass an int and treat a negative (or just -1) value as
clearing the order.

Eric Farman Nov. 11, 2021, 7:05 p.m. UTC | #8

On Thu, 2021-11-11 at 19:29 +0100, David Hildenbrand wrote:
> On 11.11.21 18:48, Eric Farman wrote:
> > On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
> > > On 11/11/21 16:03, Eric Farman wrote:
> > > > On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
> > > > > On 10.11.21 21:33, Eric Farman wrote:
> > > > > > With commit 2444b352c3ac ("KVM: s390: forward most SIGP
> > > > > > orders
> > > > > > to
> > > > > > user
> > > > > > space") we have a capability that allows the "fast" SIGP
> > > > > > orders
> > > > > > (as
> > > > > > defined by the Programming Notes for the SIGNAL PROCESSOR
> > > > > > instruction in
> > > > > > the Principles of Operation) to be handled in-kernel, while
> > > > > > all
> > > > > > others are
> > > > > > sent to userspace for processing.
> > > > > > 
> > > > > > This works fine but it creates a situation when, for
> > > > > > example, a
> > > > > > SIGP SENSE
> > > > > > might return CC1 (STATUS STORED, and status bits indicating
> > > > > > the
> > > > > > vcpu is
> > > > > > stopped), when in actuality userspace is still processing a
> > > > > > SIGP
> > > > > > STOP AND
> > > > > > STORE STATUS order, and the vcpu is not yet actually
> > > > > > stopped.
> > > > > > Thus,
> > > > > > the
> > > > > > SIGP SENSE should actually be returning CC2 (busy) instead
> > > > > > of
> > > > > > CC1.
> > > > > > 
> > > > > > To fix this, add another CPU capability, dependent on the
> > > > > > USER_SIGP
> > > > > > one,
> > > > > > and two associated IOCTLs. One IOCTL will be used by
> > > > > > userspace
> > > > > > to
> > > > > > mark a
> > > > > > vcpu "busy" processing a SIGP order, and cause concurrent
> > > > > > orders
> > > > > > handled
> > > > > > in-kernel to be returned with CC2 (busy). Another IOCTL
> > > > > > will be
> > > > > > used by
> > > > > > userspace to mark the SIGP "finished", and the vcpu free to
> > > > > > process
> > > > > > additional orders.
> > > > > > 
> > > > > 
> > > > > This looks much cleaner to me, thanks!
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-
> > > > > > s390.h
> > > > > > index c07a050d757d..54371cede485 100644
> > > > > > --- a/arch/s390/kvm/kvm-s390.h
> > > > > > +++ b/arch/s390/kvm/kvm-s390.h
> > > > > > @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > >   	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
> > > > > > > arch.idle_mask);
> > > > > >   }
> > > > > >   
> > > > > > +static inline bool kvm_s390_vcpu_is_sigp_busy(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > > +{
> > > > > > +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
> > > > > 
> > > > > You can drop ()
> > > > > 
> > > > > > +}
> > > > > > +
> > > > > > +static inline bool kvm_s390_vcpu_set_sigp_busy(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > > +{
> > > > > > +	/* Return zero for success, or -EBUSY if another vcpu
> > > > > > won */
> > > > > > +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) ==
> > > > > > 0) ? 0 :
> > > > > > -EBUSY;
> > > > > 
> > > > > You can drop () as well.
> > > > > 
> > > > > We might not need the -EBUSY semantics after all. User space
> > > > > can
> > > > > just
> > > > > track if it was set, because it's in charge of setting it.
> > > > 
> > > > Hrm, I added this to distinguish a newer kernel with an older
> > > > QEMU,
> > > > but
> > > > of course an older QEMU won't know the difference either. I'll
> > > > doublecheck that this is works fine in the different
> > > > permutations.
> > > > 
> > > > > > +}
> > > > > > +
> > > > > > +static inline void kvm_s390_vcpu_clear_sigp_busy(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > > +{
> > > > > > +	atomic_set(&vcpu->arch.sigp_busy, 0);
> > > > > > +}
> > > > > > +
> > > > > >   static inline int kvm_is_ucontrol(struct kvm *kvm)
> > > > > >   {
> > > > > >   #ifdef CONFIG_KVM_S390_UCONTROL
> > > > > > diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
> > > > > > index 5ad3fb4619f1..a37496ea6dfa 100644
> > > > > > --- a/arch/s390/kvm/sigp.c
> > > > > > +++ b/arch/s390/kvm/sigp.c
> > > > > > @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu, u8 order_code,
> > > > > >   	if (!dst_vcpu)
> > > > > >   		return SIGP_CC_NOT_OPERATIONAL;
> > > > > >   
> > > > > > +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
> > > > > > +		return SIGP_CC_BUSY;
> > > > > > +	}
> > > > > 
> > > > > You can drop {}
> > > > 
> > > > Arg, I had some debug in there which needed the braces, and of
> > > > course
> > > > it's unnecessary now. Thanks.
> > > > 
> > > > > > +
> > > > > >   	switch (order_code) {
> > > > > >   	case SIGP_SENSE:
> > > > > >   		vcpu->stat.instruction_sigp_sense++;
> > > > > > @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > >   	if (handle_sigp_order_in_user_space(vcpu, order_code,
> > > > > > cpu_addr))
> > > > > >   		return -EOPNOTSUPP;
> > > > > >   
> > > > > > +	/* Check the current vcpu, if it was a target from
> > > > > > another vcpu
> > > > > > */
> > > > > > +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
> > > > > > +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
> > > > > > +		return 0;
> > > > > > +	}
> > > > > 
> > > > > I don't think we need this. I think the above (checking the
> > > > > target of
> > > > > a
> > > > > SIGP order) is sufficient. Or which situation do you have in
> > > > > mind?
> > > > > 
> > > > 
> > > > Hrm... I think you're right. I was thinking of this:
> > > > 
> > > > VCPU 1 - SIGP STOP CPU 2
> > > > VCPU 2 - SIGP SENSE CPU 1
> > > > 
> > > > But of course either CPU2 is going to be marked "busy" first,
> > > > and
> > > > the
> > > > sense doesn't get processed until it's reset, or the sense
> > > > arrives
> > > > first, and the busy/notbusy doesn't matter. Let me doublecheck
> > > > my
> > > > tests
> > > > for the non-RFC version.
> > > > 
> > > > > I do wonder if we want to make this a kvm_arch_vcpu_ioctl()
> > > > > instead,
> > > > 
> > > > In one of my original attempts between v1 and v2, I had put
> > > > this
> > > > there.
> > > > This reliably deadlocks my guest, because the caller
> > > > (kvm_vcpu_ioctl())
> > > > tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN)
> > > > might
> > > > already be holding it. Thus, it's an async ioctl. I could fold
> > > > it
> > > > into
> > > > the existing interrupt ioctl, but as those are architected
> > > > structs
> > > > it
> > > > seems more natural do it this way. Or I have mis-understood
> > > > something
> > > > along the way?
> > > > 
> > > > > essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
> > > > > providing
> > > > > the
> > > > > order. "order == 0" sets it to !busy.
> > > > 
> > > > I'd tried this too, since it provided some nice debug-ability.
> > > > Unfortunately, I have a testcase (which I'll eventually get
> > > > folded
> > > > into
> > > > kvm-unit-tests :)) that picks a random order between 0-255,
> > > > knowing
> > > > that there's only a couple handfuls of valid orders, to check
> > > > the
> > > > response. Zero is valid architecturally (POPS figure 4-29),
> > > > even if
> > > > it's unassigned. The likelihood of it becoming assigned is
> > > > probably
> > > > quite low, but I'm not sure that I like special-casing an order
> > > > of
> > > > zero
> > > > in this way.
> > > > 
> > > 
> > > Looking at the API I'd like to avoid having two IOCTLs 
> > 
> > Since the order is a single byte, we could have the payload of an
> > ioctl
> > say "0-255 is an order that we're busy processing, anything higher
> > than
> > that resets the busy" or something. That would remove the need for
> > a
> > second IOCTL.
> 
> Maybe just pass an int and treat a negative (or just -1) value as
> clearing the order.
> 

Right, that's exactly what I had at one point. I thought it was too
cumbersome, but maybe not. Will dust it off, pending my question to
Janosch about 0-vs-1 IOCTLs.

David Hildenbrand Nov. 11, 2021, 7:15 p.m. UTC | #9

On 11.11.21 20:05, Eric Farman wrote:
> On Thu, 2021-11-11 at 19:29 +0100, David Hildenbrand wrote:
>> On 11.11.21 18:48, Eric Farman wrote:
>>> On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
>>>> On 11/11/21 16:03, Eric Farman wrote:
>>>>> On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
>>>>>> On 10.11.21 21:33, Eric Farman wrote:
>>>>>>> With commit 2444b352c3ac ("KVM: s390: forward most SIGP
>>>>>>> orders
>>>>>>> to
>>>>>>> user
>>>>>>> space") we have a capability that allows the "fast" SIGP
>>>>>>> orders
>>>>>>> (as
>>>>>>> defined by the Programming Notes for the SIGNAL PROCESSOR
>>>>>>> instruction in
>>>>>>> the Principles of Operation) to be handled in-kernel, while
>>>>>>> all
>>>>>>> others are
>>>>>>> sent to userspace for processing.
>>>>>>>
>>>>>>> This works fine but it creates a situation when, for
>>>>>>> example, a
>>>>>>> SIGP SENSE
>>>>>>> might return CC1 (STATUS STORED, and status bits indicating
>>>>>>> the
>>>>>>> vcpu is
>>>>>>> stopped), when in actuality userspace is still processing a
>>>>>>> SIGP
>>>>>>> STOP AND
>>>>>>> STORE STATUS order, and the vcpu is not yet actually
>>>>>>> stopped.
>>>>>>> Thus,
>>>>>>> the
>>>>>>> SIGP SENSE should actually be returning CC2 (busy) instead
>>>>>>> of
>>>>>>> CC1.
>>>>>>>
>>>>>>> To fix this, add another CPU capability, dependent on the
>>>>>>> USER_SIGP
>>>>>>> one,
>>>>>>> and two associated IOCTLs. One IOCTL will be used by
>>>>>>> userspace
>>>>>>> to
>>>>>>> mark a
>>>>>>> vcpu "busy" processing a SIGP order, and cause concurrent
>>>>>>> orders
>>>>>>> handled
>>>>>>> in-kernel to be returned with CC2 (busy). Another IOCTL
>>>>>>> will be
>>>>>>> used by
>>>>>>> userspace to mark the SIGP "finished", and the vcpu free to
>>>>>>> process
>>>>>>> additional orders.
>>>>>>>
>>>>>>
>>>>>> This looks much cleaner to me, thanks!
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-
>>>>>>> s390.h
>>>>>>> index c07a050d757d..54371cede485 100644
>>>>>>> --- a/arch/s390/kvm/kvm-s390.h
>>>>>>> +++ b/arch/s390/kvm/kvm-s390.h
>>>>>>> @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>>   	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
>>>>>>>> arch.idle_mask);
>>>>>>>   }
>>>>>>>   
>>>>>>> +static inline bool kvm_s390_vcpu_is_sigp_busy(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>> +{
>>>>>>> +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
>>>>>>
>>>>>> You can drop ()
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline bool kvm_s390_vcpu_set_sigp_busy(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>> +{
>>>>>>> +	/* Return zero for success, or -EBUSY if another vcpu
>>>>>>> won */
>>>>>>> +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) ==
>>>>>>> 0) ? 0 :
>>>>>>> -EBUSY;
>>>>>>
>>>>>> You can drop () as well.
>>>>>>
>>>>>> We might not need the -EBUSY semantics after all. User space
>>>>>> can
>>>>>> just
>>>>>> track if it was set, because it's in charge of setting it.
>>>>>
>>>>> Hrm, I added this to distinguish a newer kernel with an older
>>>>> QEMU,
>>>>> but
>>>>> of course an older QEMU won't know the difference either. I'll
>>>>> doublecheck that this is works fine in the different
>>>>> permutations.
>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline void kvm_s390_vcpu_clear_sigp_busy(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>> +{
>>>>>>> +	atomic_set(&vcpu->arch.sigp_busy, 0);
>>>>>>> +}
>>>>>>> +
>>>>>>>   static inline int kvm_is_ucontrol(struct kvm *kvm)
>>>>>>>   {
>>>>>>>   #ifdef CONFIG_KVM_S390_UCONTROL
>>>>>>> diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
>>>>>>> index 5ad3fb4619f1..a37496ea6dfa 100644
>>>>>>> --- a/arch/s390/kvm/sigp.c
>>>>>>> +++ b/arch/s390/kvm/sigp.c
>>>>>>> @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu, u8 order_code,
>>>>>>>   	if (!dst_vcpu)
>>>>>>>   		return SIGP_CC_NOT_OPERATIONAL;
>>>>>>>   
>>>>>>> +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
>>>>>>> +		return SIGP_CC_BUSY;
>>>>>>> +	}
>>>>>>
>>>>>> You can drop {}
>>>>>
>>>>> Arg, I had some debug in there which needed the braces, and of
>>>>> course
>>>>> it's unnecessary now. Thanks.
>>>>>
>>>>>>> +
>>>>>>>   	switch (order_code) {
>>>>>>>   	case SIGP_SENSE:
>>>>>>>   		vcpu->stat.instruction_sigp_sense++;
>>>>>>> @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>>   	if (handle_sigp_order_in_user_space(vcpu, order_code,
>>>>>>> cpu_addr))
>>>>>>>   		return -EOPNOTSUPP;
>>>>>>>   
>>>>>>> +	/* Check the current vcpu, if it was a target from
>>>>>>> another vcpu
>>>>>>> */
>>>>>>> +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
>>>>>>> +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
>>>>>>> +		return 0;
>>>>>>> +	}
>>>>>>
>>>>>> I don't think we need this. I think the above (checking the
>>>>>> target of
>>>>>> a
>>>>>> SIGP order) is sufficient. Or which situation do you have in
>>>>>> mind?
>>>>>>
>>>>>
>>>>> Hrm... I think you're right. I was thinking of this:
>>>>>
>>>>> VCPU 1 - SIGP STOP CPU 2
>>>>> VCPU 2 - SIGP SENSE CPU 1
>>>>>
>>>>> But of course either CPU2 is going to be marked "busy" first,
>>>>> and
>>>>> the
>>>>> sense doesn't get processed until it's reset, or the sense
>>>>> arrives
>>>>> first, and the busy/notbusy doesn't matter. Let me doublecheck
>>>>> my
>>>>> tests
>>>>> for the non-RFC version.
>>>>>
>>>>>> I do wonder if we want to make this a kvm_arch_vcpu_ioctl()
>>>>>> instead,
>>>>>
>>>>> In one of my original attempts between v1 and v2, I had put
>>>>> this
>>>>> there.
>>>>> This reliably deadlocks my guest, because the caller
>>>>> (kvm_vcpu_ioctl())
>>>>> tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN)
>>>>> might
>>>>> already be holding it. Thus, it's an async ioctl. I could fold
>>>>> it
>>>>> into
>>>>> the existing interrupt ioctl, but as those are architected
>>>>> structs
>>>>> it
>>>>> seems more natural do it this way. Or I have mis-understood
>>>>> something
>>>>> along the way?
>>>>>
>>>>>> essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
>>>>>> providing
>>>>>> the
>>>>>> order. "order == 0" sets it to !busy.
>>>>>
>>>>> I'd tried this too, since it provided some nice debug-ability.
>>>>> Unfortunately, I have a testcase (which I'll eventually get
>>>>> folded
>>>>> into
>>>>> kvm-unit-tests :)) that picks a random order between 0-255,
>>>>> knowing
>>>>> that there's only a couple handfuls of valid orders, to check
>>>>> the
>>>>> response. Zero is valid architecturally (POPS figure 4-29),
>>>>> even if
>>>>> it's unassigned. The likelihood of it becoming assigned is
>>>>> probably
>>>>> quite low, but I'm not sure that I like special-casing an order
>>>>> of
>>>>> zero
>>>>> in this way.
>>>>>
>>>>
>>>> Looking at the API I'd like to avoid having two IOCTLs 
>>>
>>> Since the order is a single byte, we could have the payload of an
>>> ioctl
>>> say "0-255 is an order that we're busy processing, anything higher
>>> than
>>> that resets the busy" or something. That would remove the need for
>>> a
>>> second IOCTL.
>>
>> Maybe just pass an int and treat a negative (or just -1) value as
>> clearing the order.
>>
> 
> Right, that's exactly what I had at one point. I thought it was too
> cumbersome, but maybe not. Will dust it off, pending my question to
> Janosch about 0-vs-1 IOCTLs.

As we really only care about the SIGP STOP case, you could theoretically
bury it into kvm_arch_vcpu_ioctl_set_mpstate(), using a new state
"KVM_MP_STATE_STOPPING".

Eric Farman Nov. 11, 2021, 7:44 p.m. UTC | #10

On Thu, 2021-11-11 at 20:15 +0100, David Hildenbrand wrote:
> On 11.11.21 20:05, Eric Farman wrote:
> > On Thu, 2021-11-11 at 19:29 +0100, David Hildenbrand wrote:
> > > On 11.11.21 18:48, Eric Farman wrote:
> > > > On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
> > > > > On 11/11/21 16:03, Eric Farman wrote:
> > > > > > On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
> > > > > > > On 10.11.21 21:33, Eric Farman wrote:
> > > > > > > > With commit 2444b352c3ac ("KVM: s390: forward most SIGP
> > > > > > > > orders
> > > > > > > > to
> > > > > > > > user
> > > > > > > > space") we have a capability that allows the "fast"
> > > > > > > > SIGP
> > > > > > > > orders
> > > > > > > > (as
> > > > > > > > defined by the Programming Notes for the SIGNAL
> > > > > > > > PROCESSOR
> > > > > > > > instruction in
> > > > > > > > the Principles of Operation) to be handled in-kernel,
> > > > > > > > while
> > > > > > > > all
> > > > > > > > others are
> > > > > > > > sent to userspace for processing.
> > > > > > > > 
> > > > > > > > This works fine but it creates a situation when, for
> > > > > > > > example, a
> > > > > > > > SIGP SENSE
> > > > > > > > might return CC1 (STATUS STORED, and status bits
> > > > > > > > indicating
> > > > > > > > the
> > > > > > > > vcpu is
> > > > > > > > stopped), when in actuality userspace is still
> > > > > > > > processing a
> > > > > > > > SIGP
> > > > > > > > STOP AND
> > > > > > > > STORE STATUS order, and the vcpu is not yet actually
> > > > > > > > stopped.
> > > > > > > > Thus,
> > > > > > > > the
> > > > > > > > SIGP SENSE should actually be returning CC2 (busy)
> > > > > > > > instead
> > > > > > > > of
> > > > > > > > CC1.
> > > > > > > > 
> > > > > > > > To fix this, add another CPU capability, dependent on
> > > > > > > > the
> > > > > > > > USER_SIGP
> > > > > > > > one,
> > > > > > > > and two associated IOCTLs. One IOCTL will be used by
> > > > > > > > userspace
> > > > > > > > to
> > > > > > > > mark a
> > > > > > > > vcpu "busy" processing a SIGP order, and cause
> > > > > > > > concurrent
> > > > > > > > orders
> > > > > > > > handled
> > > > > > > > in-kernel to be returned with CC2 (busy). Another IOCTL
> > > > > > > > will be
> > > > > > > > used by
> > > > > > > > userspace to mark the SIGP "finished", and the vcpu
> > > > > > > > free to
> > > > > > > > process
> > > > > > > > additional orders.
> > > > > > > > 
> > > > > > > 
> > > > > > > This looks much cleaner to me, thanks!
> > > > > > > 
> > > > > > > [...]
> > > > > > > 
> > > > > > > > diff --git a/arch/s390/kvm/kvm-s390.h
> > > > > > > > b/arch/s390/kvm/kvm-
> > > > > > > > s390.h
> > > > > > > > index c07a050d757d..54371cede485 100644
> > > > > > > > --- a/arch/s390/kvm/kvm-s390.h
> > > > > > > > +++ b/arch/s390/kvm/kvm-s390.h
> > > > > > > > @@ -82,6 +82,22 @@ static inline int
> > > > > > > > is_vcpu_idle(struct
> > > > > > > > kvm_vcpu
> > > > > > > > *vcpu)
> > > > > > > >   	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
> > > > > > > > > arch.idle_mask);
> > > > > > > >   }
> > > > > > > >   
> > > > > > > > +static inline bool kvm_s390_vcpu_is_sigp_busy(struct
> > > > > > > > kvm_vcpu
> > > > > > > > *vcpu)
> > > > > > > > +{
> > > > > > > > +	return (atomic_read(&vcpu->arch.sigp_busy) ==
> > > > > > > > 1);
> > > > > > > 
> > > > > > > You can drop ()
> > > > > > > 
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static inline bool kvm_s390_vcpu_set_sigp_busy(struct
> > > > > > > > kvm_vcpu
> > > > > > > > *vcpu)
> > > > > > > > +{
> > > > > > > > +	/* Return zero for success, or -EBUSY if
> > > > > > > > another vcpu
> > > > > > > > won */
> > > > > > > > +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy,
> > > > > > > > 0, 1) ==
> > > > > > > > 0) ? 0 :
> > > > > > > > -EBUSY;
> > > > > > > 
> > > > > > > You can drop () as well.
> > > > > > > 
> > > > > > > We might not need the -EBUSY semantics after all. User
> > > > > > > space
> > > > > > > can
> > > > > > > just
> > > > > > > track if it was set, because it's in charge of setting
> > > > > > > it.
> > > > > > 
> > > > > > Hrm, I added this to distinguish a newer kernel with an
> > > > > > older
> > > > > > QEMU,
> > > > > > but
> > > > > > of course an older QEMU won't know the difference either.
> > > > > > I'll
> > > > > > doublecheck that this is works fine in the different
> > > > > > permutations.
> > > > > > 
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static inline void
> > > > > > > > kvm_s390_vcpu_clear_sigp_busy(struct
> > > > > > > > kvm_vcpu
> > > > > > > > *vcpu)
> > > > > > > > +{
> > > > > > > > +	atomic_set(&vcpu->arch.sigp_busy, 0);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >   static inline int kvm_is_ucontrol(struct kvm *kvm)
> > > > > > > >   {
> > > > > > > >   #ifdef CONFIG_KVM_S390_UCONTROL
> > > > > > > > diff --git a/arch/s390/kvm/sigp.c
> > > > > > > > b/arch/s390/kvm/sigp.c
> > > > > > > > index 5ad3fb4619f1..a37496ea6dfa 100644
> > > > > > > > --- a/arch/s390/kvm/sigp.c
> > > > > > > > +++ b/arch/s390/kvm/sigp.c
> > > > > > > > @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct
> > > > > > > > kvm_vcpu
> > > > > > > > *vcpu, u8 order_code,
> > > > > > > >   	if (!dst_vcpu)
> > > > > > > >   		return SIGP_CC_NOT_OPERATIONAL;
> > > > > > > >   
> > > > > > > > +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
> > > > > > > > +		return SIGP_CC_BUSY;
> > > > > > > > +	}
> > > > > > > 
> > > > > > > You can drop {}
> > > > > > 
> > > > > > Arg, I had some debug in there which needed the braces, and
> > > > > > of
> > > > > > course
> > > > > > it's unnecessary now. Thanks.
> > > > > > 
> > > > > > > > +
> > > > > > > >   	switch (order_code) {
> > > > > > > >   	case SIGP_SENSE:
> > > > > > > >   		vcpu->stat.instruction_sigp_sense++;
> > > > > > > > @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct
> > > > > > > > kvm_vcpu
> > > > > > > > *vcpu)
> > > > > > > >   	if (handle_sigp_order_in_user_space(vcpu,
> > > > > > > > order_code,
> > > > > > > > cpu_addr))
> > > > > > > >   		return -EOPNOTSUPP;
> > > > > > > >   
> > > > > > > > +	/* Check the current vcpu, if it was a target
> > > > > > > > from
> > > > > > > > another vcpu
> > > > > > > > */
> > > > > > > > +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
> > > > > > > > +		kvm_s390_set_psw_cc(vcpu,
> > > > > > > > SIGP_CC_BUSY);
> > > > > > > > +		return 0;
> > > > > > > > +	}
> > > > > > > 
> > > > > > > I don't think we need this. I think the above (checking
> > > > > > > the
> > > > > > > target of
> > > > > > > a
> > > > > > > SIGP order) is sufficient. Or which situation do you have
> > > > > > > in
> > > > > > > mind?
> > > > > > > 
> > > > > > 
> > > > > > Hrm... I think you're right. I was thinking of this:
> > > > > > 
> > > > > > VCPU 1 - SIGP STOP CPU 2
> > > > > > VCPU 2 - SIGP SENSE CPU 1
> > > > > > 
> > > > > > But of course either CPU2 is going to be marked "busy"
> > > > > > first,
> > > > > > and
> > > > > > the
> > > > > > sense doesn't get processed until it's reset, or the sense
> > > > > > arrives
> > > > > > first, and the busy/notbusy doesn't matter. Let me
> > > > > > doublecheck
> > > > > > my
> > > > > > tests
> > > > > > for the non-RFC version.
> > > > > > 
> > > > > > > I do wonder if we want to make this a
> > > > > > > kvm_arch_vcpu_ioctl()
> > > > > > > instead,
> > > > > > 
> > > > > > In one of my original attempts between v1 and v2, I had put
> > > > > > this
> > > > > > there.
> > > > > > This reliably deadlocks my guest, because the caller
> > > > > > (kvm_vcpu_ioctl())
> > > > > > tries to acquire vcpu->mutex, and racing SIGPs (via
> > > > > > KVM_RUN)
> > > > > > might
> > > > > > already be holding it. Thus, it's an async ioctl. I could
> > > > > > fold
> > > > > > it
> > > > > > into
> > > > > > the existing interrupt ioctl, but as those are architected
> > > > > > structs
> > > > > > it
> > > > > > seems more natural do it this way. Or I have mis-understood
> > > > > > something
> > > > > > along the way?
> > > > > > 
> > > > > > > essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
> > > > > > > providing
> > > > > > > the
> > > > > > > order. "order == 0" sets it to !busy.
> > > > > > 
> > > > > > I'd tried this too, since it provided some nice debug-
> > > > > > ability.
> > > > > > Unfortunately, I have a testcase (which I'll eventually get
> > > > > > folded
> > > > > > into
> > > > > > kvm-unit-tests :)) that picks a random order between 0-255,
> > > > > > knowing
> > > > > > that there's only a couple handfuls of valid orders, to
> > > > > > check
> > > > > > the
> > > > > > response. Zero is valid architecturally (POPS figure 4-29),
> > > > > > even if
> > > > > > it's unassigned. The likelihood of it becoming assigned is
> > > > > > probably
> > > > > > quite low, but I'm not sure that I like special-casing an
> > > > > > order
> > > > > > of
> > > > > > zero
> > > > > > in this way.
> > > > > > 
> > > > > 
> > > > > Looking at the API I'd like to avoid having two IOCTLs 
> > > > 
> > > > Since the order is a single byte, we could have the payload of
> > > > an
> > > > ioctl
> > > > say "0-255 is an order that we're busy processing, anything
> > > > higher
> > > > than
> > > > that resets the busy" or something. That would remove the need
> > > > for
> > > > a
> > > > second IOCTL.
> > > 
> > > Maybe just pass an int and treat a negative (or just -1) value as
> > > clearing the order.
> > > 
> > 
> > Right, that's exactly what I had at one point. I thought it was too
> > cumbersome, but maybe not. Will dust it off, pending my question to
> > Janosch about 0-vs-1 IOCTLs.
> 
> As we really only care about the SIGP STOP case, 

Is that really true? SIGP RESTART does an inject back to KVM, ditto the
(INITIAL) CPU RESET orders. It's true that SIGP SENSE is getting
tripped up on whether a vcpu is actually stopped or not, but I believe
that SIGP SENSE saying "everything's fine" when a vcpu is still busy processing an order isn't great either.

> you could theoretically
> bury it into kvm_arch_vcpu_ioctl_set_mpstate(), using a new state
> "KVM_MP_STATE_STOPPING".
>

Janosch Frank Nov. 12, 2021, 8:49 a.m. UTC | #11

On 11/11/21 18:48, Eric Farman wrote:
> On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
>> On 11/11/21 16:03, Eric Farman wrote:
>>> On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
>>>> On 10.11.21 21:33, Eric Farman wrote:
>>>>> With commit 2444b352c3ac ("KVM: s390: forward most SIGP orders
>>>>> to
>>>>> user
>>>>> space") we have a capability that allows the "fast" SIGP orders
>>>>> (as
>>>>> defined by the Programming Notes for the SIGNAL PROCESSOR
>>>>> instruction in
>>>>> the Principles of Operation) to be handled in-kernel, while all
>>>>> others are
>>>>> sent to userspace for processing.
>>>>>
>>>>> This works fine but it creates a situation when, for example, a
>>>>> SIGP SENSE
>>>>> might return CC1 (STATUS STORED, and status bits indicating the
>>>>> vcpu is
>>>>> stopped), when in actuality userspace is still processing a
>>>>> SIGP
>>>>> STOP AND
>>>>> STORE STATUS order, and the vcpu is not yet actually stopped.
>>>>> Thus,
>>>>> the
>>>>> SIGP SENSE should actually be returning CC2 (busy) instead of
>>>>> CC1.
>>>>>
>>>>> To fix this, add another CPU capability, dependent on the
>>>>> USER_SIGP
>>>>> one,
>>>>> and two associated IOCTLs. One IOCTL will be used by userspace
>>>>> to
>>>>> mark a
>>>>> vcpu "busy" processing a SIGP order, and cause concurrent
>>>>> orders
>>>>> handled
>>>>> in-kernel to be returned with CC2 (busy). Another IOCTL will be
>>>>> used by
>>>>> userspace to mark the SIGP "finished", and the vcpu free to
>>>>> process
>>>>> additional orders.
>>>>>
>>>>
>>>> This looks much cleaner to me, thanks!
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-
>>>>> s390.h
>>>>> index c07a050d757d..54371cede485 100644
>>>>> --- a/arch/s390/kvm/kvm-s390.h
>>>>> +++ b/arch/s390/kvm/kvm-s390.h
>>>>> @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct
>>>>> kvm_vcpu
>>>>> *vcpu)
>>>>>    	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
>>>>>> arch.idle_mask);
>>>>>    }
>>>>>    
>>>>> +static inline bool kvm_s390_vcpu_is_sigp_busy(struct kvm_vcpu
>>>>> *vcpu)
>>>>> +{
>>>>> +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
>>>>
>>>> You can drop ()
>>>>
>>>>> +}
>>>>> +
>>>>> +static inline bool kvm_s390_vcpu_set_sigp_busy(struct kvm_vcpu
>>>>> *vcpu)
>>>>> +{
>>>>> +	/* Return zero for success, or -EBUSY if another vcpu
>>>>> won */
>>>>> +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) ==
>>>>> 0) ? 0 :
>>>>> -EBUSY;
>>>>
>>>> You can drop () as well.
>>>>
>>>> We might not need the -EBUSY semantics after all. User space can
>>>> just
>>>> track if it was set, because it's in charge of setting it.
>>>
>>> Hrm, I added this to distinguish a newer kernel with an older QEMU,
>>> but
>>> of course an older QEMU won't know the difference either. I'll
>>> doublecheck that this is works fine in the different permutations.
>>>
>>>>> +}
>>>>> +
>>>>> +static inline void kvm_s390_vcpu_clear_sigp_busy(struct
>>>>> kvm_vcpu
>>>>> *vcpu)
>>>>> +{
>>>>> +	atomic_set(&vcpu->arch.sigp_busy, 0);
>>>>> +}
>>>>> +
>>>>>    static inline int kvm_is_ucontrol(struct kvm *kvm)
>>>>>    {
>>>>>    #ifdef CONFIG_KVM_S390_UCONTROL
>>>>> diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
>>>>> index 5ad3fb4619f1..a37496ea6dfa 100644
>>>>> --- a/arch/s390/kvm/sigp.c
>>>>> +++ b/arch/s390/kvm/sigp.c
>>>>> @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct kvm_vcpu
>>>>> *vcpu, u8 order_code,
>>>>>    	if (!dst_vcpu)
>>>>>    		return SIGP_CC_NOT_OPERATIONAL;
>>>>>    
>>>>> +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
>>>>> +		return SIGP_CC_BUSY;
>>>>> +	}
>>>>
>>>> You can drop {}
>>>
>>> Arg, I had some debug in there which needed the braces, and of
>>> course
>>> it's unnecessary now. Thanks.
>>>
>>>>> +
>>>>>    	switch (order_code) {
>>>>>    	case SIGP_SENSE:
>>>>>    		vcpu->stat.instruction_sigp_sense++;
>>>>> @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct kvm_vcpu
>>>>> *vcpu)
>>>>>    	if (handle_sigp_order_in_user_space(vcpu, order_code,
>>>>> cpu_addr))
>>>>>    		return -EOPNOTSUPP;
>>>>>    
>>>>> +	/* Check the current vcpu, if it was a target from
>>>>> another vcpu
>>>>> */
>>>>> +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
>>>>> +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
>>>>> +		return 0;
>>>>> +	}
>>>>
>>>> I don't think we need this. I think the above (checking the
>>>> target of
>>>> a
>>>> SIGP order) is sufficient. Or which situation do you have in
>>>> mind?
>>>>
>>>
>>> Hrm... I think you're right. I was thinking of this:
>>>
>>> VCPU 1 - SIGP STOP CPU 2
>>> VCPU 2 - SIGP SENSE CPU 1
>>>
>>> But of course either CPU2 is going to be marked "busy" first, and
>>> the
>>> sense doesn't get processed until it's reset, or the sense arrives
>>> first, and the busy/notbusy doesn't matter. Let me doublecheck my
>>> tests
>>> for the non-RFC version.
>>>
>>>>
>>>> I do wonder if we want to make this a kvm_arch_vcpu_ioctl()
>>>> instead,
>>>
>>> In one of my original attempts between v1 and v2, I had put this
>>> there.
>>> This reliably deadlocks my guest, because the caller
>>> (kvm_vcpu_ioctl())
>>> tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN) might
>>> already be holding it. Thus, it's an async ioctl. I could fold it
>>> into
>>> the existing interrupt ioctl, but as those are architected structs
>>> it
>>> seems more natural do it this way. Or I have mis-understood
>>> something
>>> along the way?
>>>
>>>> essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
>>>> providing
>>>> the
>>>> order. "order == 0" sets it to !busy.
>>>
>>> I'd tried this too, since it provided some nice debug-ability.
>>> Unfortunately, I have a testcase (which I'll eventually get folded
>>> into
>>> kvm-unit-tests :)) that picks a random order between 0-255, knowing
>>> that there's only a couple handfuls of valid orders, to check the
>>> response. Zero is valid architecturally (POPS figure 4-29), even if
>>> it's unassigned. The likelihood of it becoming assigned is probably
>>> quite low, but I'm not sure that I like special-casing an order of
>>> zero
>>> in this way.
>>>
>>
>> Looking at the API I'd like to avoid having two IOCTLs
> 
> Since the order is a single byte, we could have the payload of an ioctl
> say "0-255 is an order that we're busy processing, anything higher than
> that resets the busy" or something. That would remove the need for a
> second IOCTL.
> 
>> and I'd love to
>> see some way to extend this without the need for a whole new IOCTL.
>>
> 
> Do you mean zero IOCTLs? Because... I think the only way we can do that
> is to get rid of USER_SIGP altogether, and handle everything in-kernel.
> Some weeks ago I played with QEMU not enabling USER_SIGP, but I can't
> say I've tried it since we went down this "mark a vcpu busy" path. If I
> do that busy/not-busy tagging in the kernel for !USER_SIGP, that might
> not be a bad thing anyway. But I don't know how we get the behavior
> straightened out for USER_SIGP without some type of handshake.

I'd move over to a very small struct argument with a command and a flags 
field so we can extend the IOCTL at a later time without the need to 
introduce a new IOCTL.

IMHO there's no real need to make this IOCTL as small as possible and 
only handle an int as the argument with < 0 shenanigans. We should 
rather focus on making this a nice and sane API if we have the option to 
do so.

> 
>>
>>
>>>> Not that we would need the value
>>>> right now, but who knows for what we might reuse that interface
>>>> in
>>>> the
>>>> future.
>>>>
>>>> Thanks!
>>>>
>

David Hildenbrand Nov. 12, 2021, 9:34 a.m. UTC | #12

> > > Right, that's exactly what I had at one point. I thought it was too
> > > cumbersome, but maybe not. Will dust it off, pending my question to
> > > Janosch about 0-vs-1 IOCTLs.
> >
> > As we really only care about the SIGP STOP case,
>
> Is that really true? SIGP RESTART does an inject back to KVM, ditto the
> (INITIAL) CPU RESET orders. It's true that SIGP SENSE is getting
> tripped up on whether a vcpu is actually stopped or not, but I believe
> that SIGP SENSE saying "everything's fine" when a vcpu is still busy processing an order isn't great either.

The general rule is "if the guest can detect that it violates the
spec, it needs fixing".

That is true right now when a single VCPU does:

#1: SIGP STOP AND STORE #2
#1: SIGP SENSE #2

Because according to the spec, the SIGP SENSE has to return either
BUSY or indicated STOPPED. And if it indicates STOPPED, the STORE has
to be fully processed.


Can you construct something similar with SIGP INITIAL CPU RESET and
e.g., SIGP SENSE? From a single CPU not:

#1: SIGP INITIAL CPU RESET #2
#1: SIGP SENSE #2

As the SIGP INITIAL CPU RESET is processed fully asynchronous.

Can you come up with an example where using another VCPU we could
reliably detect a difference between

#1: SIGP INITIAL CPU RESET #2
#3: SIGP SENSE #2

and

#3: SIGP SENSE #2
#1: SIGP INITIAL CPU RESET #2

and

#3: SIGP SENSE #2 [and concurrent] #1: SIGP INITIAL CPU RESET #2

If yes, it needs fixing, if not, we can happily ignore it.

David Hildenbrand Nov. 12, 2021, 9:35 a.m. UTC | #13

>
> #1: SIGP INITIAL CPU RESET #2
> #1: SIGP SENSE #2
>
> As the SIGP INITIAL CPU RESET is processed fully asynchronous.

I meant to say "fully synchronous".

Eric Farman Nov. 12, 2021, 4:09 p.m. UTC | #14

On Fri, 2021-11-12 at 09:49 +0100, Janosch Frank wrote:
> On 11/11/21 18:48, Eric Farman wrote:
> > On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
> > > On 11/11/21 16:03, Eric Farman wrote:
> > > > On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
> > > > > On 10.11.21 21:33, Eric Farman wrote:
> > > > > > With commit 2444b352c3ac ("KVM: s390: forward most SIGP
> > > > > > orders
> > > > > > to
> > > > > > user
> > > > > > space") we have a capability that allows the "fast" SIGP
> > > > > > orders
> > > > > > (as
> > > > > > defined by the Programming Notes for the SIGNAL PROCESSOR
> > > > > > instruction in
> > > > > > the Principles of Operation) to be handled in-kernel, while
> > > > > > all
> > > > > > others are
> > > > > > sent to userspace for processing.
> > > > > > 
> > > > > > This works fine but it creates a situation when, for
> > > > > > example, a
> > > > > > SIGP SENSE
> > > > > > might return CC1 (STATUS STORED, and status bits indicating
> > > > > > the
> > > > > > vcpu is
> > > > > > stopped), when in actuality userspace is still processing a
> > > > > > SIGP
> > > > > > STOP AND
> > > > > > STORE STATUS order, and the vcpu is not yet actually
> > > > > > stopped.
> > > > > > Thus,
> > > > > > the
> > > > > > SIGP SENSE should actually be returning CC2 (busy) instead
> > > > > > of
> > > > > > CC1.
> > > > > > 
> > > > > > To fix this, add another CPU capability, dependent on the
> > > > > > USER_SIGP
> > > > > > one,
> > > > > > and two associated IOCTLs. One IOCTL will be used by
> > > > > > userspace
> > > > > > to
> > > > > > mark a
> > > > > > vcpu "busy" processing a SIGP order, and cause concurrent
> > > > > > orders
> > > > > > handled
> > > > > > in-kernel to be returned with CC2 (busy). Another IOCTL
> > > > > > will be
> > > > > > used by
> > > > > > userspace to mark the SIGP "finished", and the vcpu free to
> > > > > > process
> > > > > > additional orders.
> > > > > > 
> > > > > 
> > > > > This looks much cleaner to me, thanks!
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-
> > > > > > s390.h
> > > > > > index c07a050d757d..54371cede485 100644
> > > > > > --- a/arch/s390/kvm/kvm-s390.h
> > > > > > +++ b/arch/s390/kvm/kvm-s390.h
> > > > > > @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > >    	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
> > > > > > > arch.idle_mask);
> > > > > >    }
> > > > > >    
> > > > > > +static inline bool kvm_s390_vcpu_is_sigp_busy(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > > +{
> > > > > > +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
> > > > > 
> > > > > You can drop ()
> > > > > 
> > > > > > +}
> > > > > > +
> > > > > > +static inline bool kvm_s390_vcpu_set_sigp_busy(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > > +{
> > > > > > +	/* Return zero for success, or -EBUSY if another vcpu
> > > > > > won */
> > > > > > +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) ==
> > > > > > 0) ? 0 :
> > > > > > -EBUSY;
> > > > > 
> > > > > You can drop () as well.
> > > > > 
> > > > > We might not need the -EBUSY semantics after all. User space
> > > > > can
> > > > > just
> > > > > track if it was set, because it's in charge of setting it.
> > > > 
> > > > Hrm, I added this to distinguish a newer kernel with an older
> > > > QEMU,
> > > > but
> > > > of course an older QEMU won't know the difference either. I'll
> > > > doublecheck that this is works fine in the different
> > > > permutations.
> > > > 
> > > > > > +}
> > > > > > +
> > > > > > +static inline void kvm_s390_vcpu_clear_sigp_busy(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > > +{
> > > > > > +	atomic_set(&vcpu->arch.sigp_busy, 0);
> > > > > > +}
> > > > > > +
> > > > > >    static inline int kvm_is_ucontrol(struct kvm *kvm)
> > > > > >    {
> > > > > >    #ifdef CONFIG_KVM_S390_UCONTROL
> > > > > > diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
> > > > > > index 5ad3fb4619f1..a37496ea6dfa 100644
> > > > > > --- a/arch/s390/kvm/sigp.c
> > > > > > +++ b/arch/s390/kvm/sigp.c
> > > > > > @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu, u8 order_code,
> > > > > >    	if (!dst_vcpu)
> > > > > >    		return SIGP_CC_NOT_OPERATIONAL;
> > > > > >    
> > > > > > +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
> > > > > > +		return SIGP_CC_BUSY;
> > > > > > +	}
> > > > > 
> > > > > You can drop {}
> > > > 
> > > > Arg, I had some debug in there which needed the braces, and of
> > > > course
> > > > it's unnecessary now. Thanks.
> > > > 
> > > > > > +
> > > > > >    	switch (order_code) {
> > > > > >    	case SIGP_SENSE:
> > > > > >    		vcpu->stat.instruction_sigp_sense++;
> > > > > > @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct
> > > > > > kvm_vcpu
> > > > > > *vcpu)
> > > > > >    	if (handle_sigp_order_in_user_space(vcpu, order_code,
> > > > > > cpu_addr))
> > > > > >    		return -EOPNOTSUPP;
> > > > > >    
> > > > > > +	/* Check the current vcpu, if it was a target from
> > > > > > another vcpu
> > > > > > */
> > > > > > +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
> > > > > > +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
> > > > > > +		return 0;
> > > > > > +	}
> > > > > 
> > > > > I don't think we need this. I think the above (checking the
> > > > > target of
> > > > > a
> > > > > SIGP order) is sufficient. Or which situation do you have in
> > > > > mind?
> > > > > 
> > > > 
> > > > Hrm... I think you're right. I was thinking of this:
> > > > 
> > > > VCPU 1 - SIGP STOP CPU 2
> > > > VCPU 2 - SIGP SENSE CPU 1
> > > > 
> > > > But of course either CPU2 is going to be marked "busy" first,
> > > > and
> > > > the
> > > > sense doesn't get processed until it's reset, or the sense
> > > > arrives
> > > > first, and the busy/notbusy doesn't matter. Let me doublecheck
> > > > my
> > > > tests
> > > > for the non-RFC version.
> > > > 
> > > > > I do wonder if we want to make this a kvm_arch_vcpu_ioctl()
> > > > > instead,
> > > > 
> > > > In one of my original attempts between v1 and v2, I had put
> > > > this
> > > > there.
> > > > This reliably deadlocks my guest, because the caller
> > > > (kvm_vcpu_ioctl())
> > > > tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN)
> > > > might
> > > > already be holding it. Thus, it's an async ioctl. I could fold
> > > > it
> > > > into
> > > > the existing interrupt ioctl, but as those are architected
> > > > structs
> > > > it
> > > > seems more natural do it this way. Or I have mis-understood
> > > > something
> > > > along the way?
> > > > 
> > > > > essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
> > > > > providing
> > > > > the
> > > > > order. "order == 0" sets it to !busy.
> > > > 
> > > > I'd tried this too, since it provided some nice debug-ability.
> > > > Unfortunately, I have a testcase (which I'll eventually get
> > > > folded
> > > > into
> > > > kvm-unit-tests :)) that picks a random order between 0-255,
> > > > knowing
> > > > that there's only a couple handfuls of valid orders, to check
> > > > the
> > > > response. Zero is valid architecturally (POPS figure 4-29),
> > > > even if
> > > > it's unassigned. The likelihood of it becoming assigned is
> > > > probably
> > > > quite low, but I'm not sure that I like special-casing an order
> > > > of
> > > > zero
> > > > in this way.
> > > > 
> > > 
> > > Looking at the API I'd like to avoid having two IOCTLs
> > 
> > Since the order is a single byte, we could have the payload of an
> > ioctl
> > say "0-255 is an order that we're busy processing, anything higher
> > than
> > that resets the busy" or something. That would remove the need for
> > a
> > second IOCTL.
> > 
> > > and I'd love to
> > > see some way to extend this without the need for a whole new
> > > IOCTL.
> > > 
> > 
> > Do you mean zero IOCTLs? Because... I think the only way we can do
> > that
> > is to get rid of USER_SIGP altogether, and handle everything in-
> > kernel.
> > Some weeks ago I played with QEMU not enabling USER_SIGP, but I
> > can't
> > say I've tried it since we went down this "mark a vcpu busy" path.
> > If I
> > do that busy/not-busy tagging in the kernel for !USER_SIGP, that
> > might
> > not be a bad thing anyway. But I don't know how we get the behavior
> > straightened out for USER_SIGP without some type of handshake.
> 
> I'd move over to a very small struct argument with a command and a
> flags 
> field so we can extend the IOCTL at a later time without the need to 
> introduce a new IOCTL.

> IMHO there's no real need to make this IOCTL as small as possible
> and 
> only handle an int as the argument with < 0 shenanigans. We should 
> rather focus on making this a nice and sane API if we have the option
> to 
> do so.
> 

So then naming this as "USER_SIGP_BUSY" is too narrow. What about just
"USER_BUSY" and something like this?

enum user_busy_function {
	SET,
	RESET,
}

enum user_busy_reason {
	SIGP,
}

struct user_busy {
	int function;
	int reason;
	long payload;		// Optional? In our case, SIGP order
}

> > > 
> > > > > Not that we would need the value
> > > > > right now, but who knows for what we might reuse that
> > > > > interface
> > > > > in
> > > > > the
> > > > > future.
> > > > > 
> > > > > Thanks!
> > > > >

Eric Farman Nov. 12, 2021, 8:30 p.m. UTC | #15

On Fri, 2021-11-12 at 11:09 -0500, Eric Farman wrote:
> On Fri, 2021-11-12 at 09:49 +0100, Janosch Frank wrote:
> > On 11/11/21 18:48, Eric Farman wrote:
> > > On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
> > > > On 11/11/21 16:03, Eric Farman wrote:
> > > > > On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
> > > > > > On 10.11.21 21:33, Eric Farman wrote:
> > > > > > > With commit 2444b352c3ac ("KVM: s390: forward most SIGP
> > > > > > > orders
> > > > > > > to
> > > > > > > user
> > > > > > > space") we have a capability that allows the "fast" SIGP
> > > > > > > orders
> > > > > > > (as
> > > > > > > defined by the Programming Notes for the SIGNAL PROCESSOR
> > > > > > > instruction in
> > > > > > > the Principles of Operation) to be handled in-kernel,
> > > > > > > while
> > > > > > > all
> > > > > > > others are
> > > > > > > sent to userspace for processing.
> > > > > > > 
> > > > > > > This works fine but it creates a situation when, for
> > > > > > > example, a
> > > > > > > SIGP SENSE
> > > > > > > might return CC1 (STATUS STORED, and status bits
> > > > > > > indicating
> > > > > > > the
> > > > > > > vcpu is
> > > > > > > stopped), when in actuality userspace is still processing
> > > > > > > a
> > > > > > > SIGP
> > > > > > > STOP AND
> > > > > > > STORE STATUS order, and the vcpu is not yet actually
> > > > > > > stopped.
> > > > > > > Thus,
> > > > > > > the
> > > > > > > SIGP SENSE should actually be returning CC2 (busy)
> > > > > > > instead
> > > > > > > of
> > > > > > > CC1.
> > > > > > > 
> > > > > > > To fix this, add another CPU capability, dependent on the
> > > > > > > USER_SIGP
> > > > > > > one,
> > > > > > > and two associated IOCTLs. One IOCTL will be used by
> > > > > > > userspace
> > > > > > > to
> > > > > > > mark a
> > > > > > > vcpu "busy" processing a SIGP order, and cause concurrent
> > > > > > > orders
> > > > > > > handled
> > > > > > > in-kernel to be returned with CC2 (busy). Another IOCTL
> > > > > > > will be
> > > > > > > used by
> > > > > > > userspace to mark the SIGP "finished", and the vcpu free
> > > > > > > to
> > > > > > > process
> > > > > > > additional orders.
> > > > > > > 
> > > > > > 
> > > > > > This looks much cleaner to me, thanks!
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > diff --git a/arch/s390/kvm/kvm-s390.h
> > > > > > > b/arch/s390/kvm/kvm-
> > > > > > > s390.h
> > > > > > > index c07a050d757d..54371cede485 100644
> > > > > > > --- a/arch/s390/kvm/kvm-s390.h
> > > > > > > +++ b/arch/s390/kvm/kvm-s390.h
> > > > > > > @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct
> > > > > > > kvm_vcpu
> > > > > > > *vcpu)
> > > > > > >    	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
> > > > > > > > arch.idle_mask);
> > > > > > >    }
> > > > > > >    
> > > > > > > +static inline bool kvm_s390_vcpu_is_sigp_busy(struct
> > > > > > > kvm_vcpu
> > > > > > > *vcpu)
> > > > > > > +{
> > > > > > > +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
> > > > > > 
> > > > > > You can drop ()
> > > > > > 
> > > > > > > +}
> > > > > > > +
> > > > > > > +static inline bool kvm_s390_vcpu_set_sigp_busy(struct
> > > > > > > kvm_vcpu
> > > > > > > *vcpu)
> > > > > > > +{
> > > > > > > +	/* Return zero for success, or -EBUSY if another vcpu
> > > > > > > won */
> > > > > > > +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) ==
> > > > > > > 0) ? 0 :
> > > > > > > -EBUSY;
> > > > > > 
> > > > > > You can drop () as well.
> > > > > > 
> > > > > > We might not need the -EBUSY semantics after all. User
> > > > > > space
> > > > > > can
> > > > > > just
> > > > > > track if it was set, because it's in charge of setting it.
> > > > > 
> > > > > Hrm, I added this to distinguish a newer kernel with an older
> > > > > QEMU,
> > > > > but
> > > > > of course an older QEMU won't know the difference either.
> > > > > I'll
> > > > > doublecheck that this is works fine in the different
> > > > > permutations.
> > > > > 
> > > > > > > +}
> > > > > > > +
> > > > > > > +static inline void kvm_s390_vcpu_clear_sigp_busy(struct
> > > > > > > kvm_vcpu
> > > > > > > *vcpu)
> > > > > > > +{
> > > > > > > +	atomic_set(&vcpu->arch.sigp_busy, 0);
> > > > > > > +}
> > > > > > > +
> > > > > > >    static inline int kvm_is_ucontrol(struct kvm *kvm)
> > > > > > >    {
> > > > > > >    #ifdef CONFIG_KVM_S390_UCONTROL
> > > > > > > diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
> > > > > > > index 5ad3fb4619f1..a37496ea6dfa 100644
> > > > > > > --- a/arch/s390/kvm/sigp.c
> > > > > > > +++ b/arch/s390/kvm/sigp.c
> > > > > > > @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct
> > > > > > > kvm_vcpu
> > > > > > > *vcpu, u8 order_code,
> > > > > > >    	if (!dst_vcpu)
> > > > > > >    		return SIGP_CC_NOT_OPERATIONAL;
> > > > > > >    
> > > > > > > +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
> > > > > > > +		return SIGP_CC_BUSY;
> > > > > > > +	}
> > > > > > 
> > > > > > You can drop {}
> > > > > 
> > > > > Arg, I had some debug in there which needed the braces, and
> > > > > of
> > > > > course
> > > > > it's unnecessary now. Thanks.
> > > > > 
> > > > > > > +
> > > > > > >    	switch (order_code) {
> > > > > > >    	case SIGP_SENSE:
> > > > > > >    		vcpu->stat.instruction_sigp_sense++;
> > > > > > > @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct
> > > > > > > kvm_vcpu
> > > > > > > *vcpu)
> > > > > > >    	if (handle_sigp_order_in_user_space(vcpu,
> > > > > > > order_code,
> > > > > > > cpu_addr))
> > > > > > >    		return -EOPNOTSUPP;
> > > > > > >    
> > > > > > > +	/* Check the current vcpu, if it was a target from
> > > > > > > another vcpu
> > > > > > > */
> > > > > > > +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
> > > > > > > +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
> > > > > > > +		return 0;
> > > > > > > +	}
> > > > > > 
> > > > > > I don't think we need this. I think the above (checking the
> > > > > > target of
> > > > > > a
> > > > > > SIGP order) is sufficient. Or which situation do you have
> > > > > > in
> > > > > > mind?
> > > > > > 
> > > > > 
> > > > > Hrm... I think you're right. I was thinking of this:
> > > > > 
> > > > > VCPU 1 - SIGP STOP CPU 2
> > > > > VCPU 2 - SIGP SENSE CPU 1
> > > > > 
> > > > > But of course either CPU2 is going to be marked "busy" first,
> > > > > and
> > > > > the
> > > > > sense doesn't get processed until it's reset, or the sense
> > > > > arrives
> > > > > first, and the busy/notbusy doesn't matter. Let me
> > > > > doublecheck
> > > > > my
> > > > > tests
> > > > > for the non-RFC version.
> > > > > 
> > > > > > I do wonder if we want to make this a kvm_arch_vcpu_ioctl()
> > > > > > instead,
> > > > > 
> > > > > In one of my original attempts between v1 and v2, I had put
> > > > > this
> > > > > there.
> > > > > This reliably deadlocks my guest, because the caller
> > > > > (kvm_vcpu_ioctl())
> > > > > tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN)
> > > > > might
> > > > > already be holding it. Thus, it's an async ioctl. I could
> > > > > fold
> > > > > it
> > > > > into
> > > > > the existing interrupt ioctl, but as those are architected
> > > > > structs
> > > > > it
> > > > > seems more natural do it this way. Or I have mis-understood
> > > > > something
> > > > > along the way?
> > > > > 
> > > > > > essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
> > > > > > providing
> > > > > > the
> > > > > > order. "order == 0" sets it to !busy.
> > > > > 
> > > > > I'd tried this too, since it provided some nice debug-
> > > > > ability.
> > > > > Unfortunately, I have a testcase (which I'll eventually get
> > > > > folded
> > > > > into
> > > > > kvm-unit-tests :)) that picks a random order between 0-255,
> > > > > knowing
> > > > > that there's only a couple handfuls of valid orders, to check
> > > > > the
> > > > > response. Zero is valid architecturally (POPS figure 4-29),
> > > > > even if
> > > > > it's unassigned. The likelihood of it becoming assigned is
> > > > > probably
> > > > > quite low, but I'm not sure that I like special-casing an
> > > > > order
> > > > > of
> > > > > zero
> > > > > in this way.
> > > > > 
> > > > 
> > > > Looking at the API I'd like to avoid having two IOCTLs
> > > 
> > > Since the order is a single byte, we could have the payload of an
> > > ioctl
> > > say "0-255 is an order that we're busy processing, anything
> > > higher
> > > than
> > > that resets the busy" or something. That would remove the need
> > > for
> > > a
> > > second IOCTL.
> > > 
> > > > and I'd love to
> > > > see some way to extend this without the need for a whole new
> > > > IOCTL.
> > > > 
> > > 
> > > Do you mean zero IOCTLs? Because... I think the only way we can
> > > do
> > > that
> > > is to get rid of USER_SIGP altogether, and handle everything in-
> > > kernel.
> > > Some weeks ago I played with QEMU not enabling USER_SIGP, but I
> > > can't
> > > say I've tried it since we went down this "mark a vcpu busy"
> > > path.
> > > If I
> > > do that busy/not-busy tagging in the kernel for !USER_SIGP, that
> > > might
> > > not be a bad thing anyway. But I don't know how we get the
> > > behavior
> > > straightened out for USER_SIGP without some type of handshake.
> > 
> > I'd move over to a very small struct argument with a command and a
> > flags 
> > field so we can extend the IOCTL at a later time without the need
> > to 
> > introduce a new IOCTL.
> > IMHO there's no real need to make this IOCTL as small as possible
> > and 
> > only handle an int as the argument with < 0 shenanigans. We should 
> > rather focus on making this a nice and sane API if we have the
> > option
> > to 
> > do so.
> > 
> 
> So then naming this as "USER_SIGP_BUSY" is too narrow. What about
> just
> "USER_BUSY" and something like this?
> 
> enum user_busy_function {
> 	SET,
> 	RESET,
> }
> 
> enum user_busy_reason {
> 	SIGP,
> }
> 
> struct user_busy {
> 	int function;
> 	int reason;
> 	long payload;		// Optional? In our case, SIGP order
> }
> 

Looking at this further, rather than pile on as its own thing, why
couldn't this be added as an IRQ type and embed such a structure into
kvm_s390_irq? Then the ioctl is moot.

> > > > > > Not that we would need the value
> > > > > > right now, but who knows for what we might reuse that
> > > > > > interface
> > > > > > in
> > > > > > the
> > > > > > future.
> > > > > > 
> > > > > > Thanks!
> > > > > >

Christian Borntraeger Nov. 17, 2021, 7:54 a.m. UTC | #16

Am 11.11.21 um 20:05 schrieb Eric Farman:
> On Thu, 2021-11-11 at 19:29 +0100, David Hildenbrand wrote:
>> On 11.11.21 18:48, Eric Farman wrote:
>>> On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
>>>> On 11/11/21 16:03, Eric Farman wrote:
>>>>> On Thu, 2021-11-11 at 10:15 +0100, David Hildenbrand wrote:
>>>>>> On 10.11.21 21:33, Eric Farman wrote:
>>>>>>> With commit 2444b352c3ac ("KVM: s390: forward most SIGP
>>>>>>> orders
>>>>>>> to
>>>>>>> user
>>>>>>> space") we have a capability that allows the "fast" SIGP
>>>>>>> orders
>>>>>>> (as
>>>>>>> defined by the Programming Notes for the SIGNAL PROCESSOR
>>>>>>> instruction in
>>>>>>> the Principles of Operation) to be handled in-kernel, while
>>>>>>> all
>>>>>>> others are
>>>>>>> sent to userspace for processing.
>>>>>>>
>>>>>>> This works fine but it creates a situation when, for
>>>>>>> example, a
>>>>>>> SIGP SENSE
>>>>>>> might return CC1 (STATUS STORED, and status bits indicating
>>>>>>> the
>>>>>>> vcpu is
>>>>>>> stopped), when in actuality userspace is still processing a
>>>>>>> SIGP
>>>>>>> STOP AND
>>>>>>> STORE STATUS order, and the vcpu is not yet actually
>>>>>>> stopped.
>>>>>>> Thus,
>>>>>>> the
>>>>>>> SIGP SENSE should actually be returning CC2 (busy) instead
>>>>>>> of
>>>>>>> CC1.
>>>>>>>
>>>>>>> To fix this, add another CPU capability, dependent on the
>>>>>>> USER_SIGP
>>>>>>> one,
>>>>>>> and two associated IOCTLs. One IOCTL will be used by
>>>>>>> userspace
>>>>>>> to
>>>>>>> mark a
>>>>>>> vcpu "busy" processing a SIGP order, and cause concurrent
>>>>>>> orders
>>>>>>> handled
>>>>>>> in-kernel to be returned with CC2 (busy). Another IOCTL
>>>>>>> will be
>>>>>>> used by
>>>>>>> userspace to mark the SIGP "finished", and the vcpu free to
>>>>>>> process
>>>>>>> additional orders.
>>>>>>>
>>>>>>
>>>>>> This looks much cleaner to me, thanks!
>>>>>>
>>>>>> [...]
>>>>>>
>>>>>>> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-
>>>>>>> s390.h
>>>>>>> index c07a050d757d..54371cede485 100644
>>>>>>> --- a/arch/s390/kvm/kvm-s390.h
>>>>>>> +++ b/arch/s390/kvm/kvm-s390.h
>>>>>>> @@ -82,6 +82,22 @@ static inline int is_vcpu_idle(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>>    	return test_bit(vcpu->vcpu_idx, vcpu->kvm-
>>>>>>>> arch.idle_mask);
>>>>>>>    }
>>>>>>>    
>>>>>>> +static inline bool kvm_s390_vcpu_is_sigp_busy(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>> +{
>>>>>>> +	return (atomic_read(&vcpu->arch.sigp_busy) == 1);
>>>>>>
>>>>>> You can drop ()
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline bool kvm_s390_vcpu_set_sigp_busy(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>> +{
>>>>>>> +	/* Return zero for success, or -EBUSY if another vcpu
>>>>>>> won */
>>>>>>> +	return (atomic_cmpxchg(&vcpu->arch.sigp_busy, 0, 1) ==
>>>>>>> 0) ? 0 :
>>>>>>> -EBUSY;
>>>>>>
>>>>>> You can drop () as well.
>>>>>>
>>>>>> We might not need the -EBUSY semantics after all. User space
>>>>>> can
>>>>>> just
>>>>>> track if it was set, because it's in charge of setting it.
>>>>>
>>>>> Hrm, I added this to distinguish a newer kernel with an older
>>>>> QEMU,
>>>>> but
>>>>> of course an older QEMU won't know the difference either. I'll
>>>>> doublecheck that this is works fine in the different
>>>>> permutations.
>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline void kvm_s390_vcpu_clear_sigp_busy(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>> +{
>>>>>>> +	atomic_set(&vcpu->arch.sigp_busy, 0);
>>>>>>> +}
>>>>>>> +
>>>>>>>    static inline int kvm_is_ucontrol(struct kvm *kvm)
>>>>>>>    {
>>>>>>>    #ifdef CONFIG_KVM_S390_UCONTROL
>>>>>>> diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
>>>>>>> index 5ad3fb4619f1..a37496ea6dfa 100644
>>>>>>> --- a/arch/s390/kvm/sigp.c
>>>>>>> +++ b/arch/s390/kvm/sigp.c
>>>>>>> @@ -276,6 +276,10 @@ static int handle_sigp_dst(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu, u8 order_code,
>>>>>>>    	if (!dst_vcpu)
>>>>>>>    		return SIGP_CC_NOT_OPERATIONAL;
>>>>>>>    
>>>>>>> +	if (kvm_s390_vcpu_is_sigp_busy(dst_vcpu)) {
>>>>>>> +		return SIGP_CC_BUSY;
>>>>>>> +	}
>>>>>>
>>>>>> You can drop {}
>>>>>
>>>>> Arg, I had some debug in there which needed the braces, and of
>>>>> course
>>>>> it's unnecessary now. Thanks.
>>>>>
>>>>>>> +
>>>>>>>    	switch (order_code) {
>>>>>>>    	case SIGP_SENSE:
>>>>>>>    		vcpu->stat.instruction_sigp_sense++;
>>>>>>> @@ -411,6 +415,12 @@ int kvm_s390_handle_sigp(struct
>>>>>>> kvm_vcpu
>>>>>>> *vcpu)
>>>>>>>    	if (handle_sigp_order_in_user_space(vcpu, order_code,
>>>>>>> cpu_addr))
>>>>>>>    		return -EOPNOTSUPP;
>>>>>>>    
>>>>>>> +	/* Check the current vcpu, if it was a target from
>>>>>>> another vcpu
>>>>>>> */
>>>>>>> +	if (kvm_s390_vcpu_is_sigp_busy(vcpu)) {
>>>>>>> +		kvm_s390_set_psw_cc(vcpu, SIGP_CC_BUSY);
>>>>>>> +		return 0;
>>>>>>> +	}
>>>>>>
>>>>>> I don't think we need this. I think the above (checking the
>>>>>> target of
>>>>>> a
>>>>>> SIGP order) is sufficient. Or which situation do you have in
>>>>>> mind?
>>>>>>
>>>>>
>>>>> Hrm... I think you're right. I was thinking of this:
>>>>>
>>>>> VCPU 1 - SIGP STOP CPU 2
>>>>> VCPU 2 - SIGP SENSE CPU 1
>>>>>
>>>>> But of course either CPU2 is going to be marked "busy" first,
>>>>> and
>>>>> the
>>>>> sense doesn't get processed until it's reset, or the sense
>>>>> arrives
>>>>> first, and the busy/notbusy doesn't matter. Let me doublecheck
>>>>> my
>>>>> tests
>>>>> for the non-RFC version.
>>>>>
>>>>>> I do wonder if we want to make this a kvm_arch_vcpu_ioctl()
>>>>>> instead,
>>>>>
>>>>> In one of my original attempts between v1 and v2, I had put
>>>>> this
>>>>> there.
>>>>> This reliably deadlocks my guest, because the caller
>>>>> (kvm_vcpu_ioctl())
>>>>> tries to acquire vcpu->mutex, and racing SIGPs (via KVM_RUN)
>>>>> might
>>>>> already be holding it. Thus, it's an async ioctl. I could fold
>>>>> it
>>>>> into
>>>>> the existing interrupt ioctl, but as those are architected
>>>>> structs
>>>>> it
>>>>> seems more natural do it this way. Or I have mis-understood
>>>>> something
>>>>> along the way?
>>>>>
>>>>>> essentially just providing a KVM_S390_SET_SIGP_BUSY *and*
>>>>>> providing
>>>>>> the
>>>>>> order. "order == 0" sets it to !busy.
>>>>>
>>>>> I'd tried this too, since it provided some nice debug-ability.
>>>>> Unfortunately, I have a testcase (which I'll eventually get
>>>>> folded
>>>>> into
>>>>> kvm-unit-tests :)) that picks a random order between 0-255,
>>>>> knowing
>>>>> that there's only a couple handfuls of valid orders, to check
>>>>> the
>>>>> response. Zero is valid architecturally (POPS figure 4-29),
>>>>> even if
>>>>> it's unassigned. The likelihood of it becoming assigned is
>>>>> probably
>>>>> quite low, but I'm not sure that I like special-casing an order
>>>>> of
>>>>> zero
>>>>> in this way.
>>>>>
>>>>
>>>> Looking at the API I'd like to avoid having two IOCTLs
>>>
>>> Since the order is a single byte, we could have the payload of an
>>> ioctl
>>> say "0-255 is an order that we're busy processing, anything higher
>>> than
>>> that resets the busy" or something. That would remove the need for
>>> a
>>> second IOCTL.
>>
>> Maybe just pass an int and treat a negative (or just -1) value as
>> clearing the order.
>>
> 
> Right, that's exactly what I had at one point. I thought it was too
> cumbersome, but maybe not. Will dust it off, pending my question to
> Janosch about 0-vs-1 IOCTLs.

As a totally different idea. Would a sync_reg value called SIGP_BUSY
work as well?

Eric Farman Nov. 19, 2021, 8:20 p.m. UTC | #17

On Wed, 2021-11-17 at 08:54 +0100, Christian Borntraeger wrote:
> Am 11.11.21 um 20:05 schrieb Eric Farman:
> > On Thu, 2021-11-11 at 19:29 +0100, David Hildenbrand wrote:
> > > On 11.11.21 18:48, Eric Farman wrote:
> > > > On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
> > > > > 
> > > > > Looking at the API I'd like to avoid having two IOCTLs
> > > > 
> > > > Since the order is a single byte, we could have the payload of
> > > > an
> > > > ioctl
> > > > say "0-255 is an order that we're busy processing, anything
> > > > higher
> > > > than
> > > > that resets the busy" or something. That would remove the need
> > > > for
> > > > a
> > > > second IOCTL.
> > > 
> > > Maybe just pass an int and treat a negative (or just -1) value as
> > > clearing the order.
> > > 
> > 
> > Right, that's exactly what I had at one point. I thought it was too
> > cumbersome, but maybe not. Will dust it off, pending my question to
> > Janosch about 0-vs-1 IOCTLs.
> 
> As a totally different idea. Would a sync_reg value called SIGP_BUSY
> work as well?
> 

Hrm... I'm not sure. I played with it a bit, and it's not looking
great. I'm almost certainly missing some serialization, because I was
frequently "losing" one of the toggles (busy/not-busy) when hammering
CPUs with various SIGP orders on this interface and thus getting
incorrect responses from the in-kernel orders.

I also took a stab at David's idea of tying it to KVM_MP_STATE [1]. I
still think it's a little odd, considering the existing states are all
z/Arch-defined CPU states, but it does sound like the sort of thing
we're trying to do (letting userspace announce what the CPU is up to).
One flaw is that most of the rest of QEMU uses s390_cpu_set_state() for
this, which returns the number of running CPUs instead of the return
code from the MP_STATE ioctl (via kvm_s390_set_cpu_state()) that SIGP
would be interested in. Even if I made the ioctl call directly, I still
encounter some system problems that smell like ones I've addressed in
v2 and v3. Possibly fixable, but I didn't pursue them far enough to be
certain.

I ALSO took a stab at folding this into the S390 IRQ paths [2], similar
to what was done with kvm_s390_stop_info. This worked reasonably well,
except the QEMU interface kvm_s390_vcpu_interrupt() returns a void, and
so wouldn't notice an error sent back by KVM. Not a deal breaker, but
having not heard anything to this idea, I didn't go much farther.

Next week is a short week due to the US holiday, so rather than flesh
out any of the above possibilities, I'm going to send a new RFC as v4.
This will be back to a single IOCTL, with a small payload, per Janosch'
feedback. We can have a discussion on that, but if any of the above
alternatives sound more appealing I can try getting one of them working
with more consistency.

[1] 
https://lore.kernel.org/r/ff344676-0c37-610b-eafb-b1477db0f6a1@redhat.com/
[2] 
https://lore.kernel.org/all/b206e7b73696907328bc4338664dea1ef572e8aa.camel@linux.ibm.com/

David Hildenbrand Nov. 22, 2021, 10:52 a.m. UTC | #18

On 19.11.21 21:20, Eric Farman wrote:
> On Wed, 2021-11-17 at 08:54 +0100, Christian Borntraeger wrote:
>> Am 11.11.21 um 20:05 schrieb Eric Farman:
>>> On Thu, 2021-11-11 at 19:29 +0100, David Hildenbrand wrote:
>>>> On 11.11.21 18:48, Eric Farman wrote:
>>>>> On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
>>>>>>
>>>>>> Looking at the API I'd like to avoid having two IOCTLs
>>>>>
>>>>> Since the order is a single byte, we could have the payload of
>>>>> an
>>>>> ioctl
>>>>> say "0-255 is an order that we're busy processing, anything
>>>>> higher
>>>>> than
>>>>> that resets the busy" or something. That would remove the need
>>>>> for
>>>>> a
>>>>> second IOCTL.
>>>>
>>>> Maybe just pass an int and treat a negative (or just -1) value as
>>>> clearing the order.
>>>>
>>>
>>> Right, that's exactly what I had at one point. I thought it was too
>>> cumbersome, but maybe not. Will dust it off, pending my question to
>>> Janosch about 0-vs-1 IOCTLs.
>>
>> As a totally different idea. Would a sync_reg value called SIGP_BUSY
>> work as well?
>>
> 
> Hrm... I'm not sure. I played with it a bit, and it's not looking
> great. I'm almost certainly missing some serialization, because I was
> frequently "losing" one of the toggles (busy/not-busy) when hammering
> CPUs with various SIGP orders on this interface and thus getting
> incorrect responses from the in-kernel orders.

You can only modify the destination CPU from the destination CPU thread,
after synchronizing the CPU state. This would be trivial with my QEMU proposal.

> 
> I also took a stab at David's idea of tying it to KVM_MP_STATE [1]. I
> still think it's a little odd, considering the existing states are all
> z/Arch-defined CPU states, but it does sound like the sort of thing
> we're trying to do (letting userspace announce what the CPU is up to).
> One flaw is that most of the rest of QEMU uses s390_cpu_set_state() for
> this, which returns the number of running CPUs instead of the return
> code from the MP_STATE ioctl (via kvm_s390_set_cpu_state()) that SIGP
> would be interested in. Even if I made the ioctl call directly, I still
> encounter some system problems that smell like ones I've addressed in
> v2 and v3. Possibly fixable, but I didn't pursue them far enough to be
> certain.

Well, we can essentially observe this special state of that CPU ("stopping"), so
it's not that weird. STOPPING is essentially OPERATING with the notion of
"the CPU is blocked for some actions.".

> 
> I ALSO took a stab at folding this into the S390 IRQ paths [2], similar
> to what was done with kvm_s390_stop_info. This worked reasonably well,
> except the QEMU interface kvm_s390_vcpu_interrupt() returns a void, and
> so wouldn't notice an error sent back by KVM. Not a deal breaker, but
> having not heard anything to this idea, I didn't go much farther.

We only care about SIGP STOP* handling so far, if anybody is aware of other issues
that need fixing, it would be helpful  to spell them out. I'll keep assuming that
only SIGP STOP*  needs fixing, as I already explained.

Whenever QEMU tells a CPU to stop asynchronously, it does so via a STOP IRQ from
the destination CPU thread via KVM_S390_SIGP_STOP. Then, we know the CPU is busy
... until we clear that interrupt, which happens via kvm_s390_clear_stop_irq().

Interestingly, we clear that interrupt via two paths:

1) kvm_s390_clear_local_irqs(), via KVM_S390_INITIAL_RESET and 
   KVM_S390_NORMAL_RESET. Here, we expect that new user space also sets  
   the CPU to stopped via KVM_MP_STATE_STOPPED. In fact, modern QEMU 
   properly sets the CPU stopped before triggering clearing of the 
   interrupts (s390_cpu_reset()).
2) kvm_s390_clear_stop_irq() via kvm_s390_vcpu_stop(), which is 
   triggered via:

a) STOP intercept (handle_stop()), KVM_S390_INITIAL_RESET and 
   KVM_S390_NORMAL_RESET with old user space -- 
   !kvm_s390_user_cpu_state_ctrl().

b) KVM_MP_STATE_STOPPED for modern user space.



Would the following solve our SIGP STOP * issue w.o. uapi changes?


a) Kernel

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index c6257f625929..bd7ee1ea8aa8 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -4643,10 +4643,15 @@ int kvm_s390_vcpu_stop(struct kvm_vcpu *vcpu)
                }
        }
 
-       /* SIGP STOP and SIGP STOP AND STORE STATUS has been fully processed */
+       /*
+        * SIGP STOP and SIGP STOP AND STORE STATUS have been fully
+        * processed. Clear the interrupt after setting the VCPU stopped,
+        * such that the VCPU remains busy for most SIGP orders until fully
+        * stopped.
+        */
+       kvm_s390_set_cpuflags(vcpu, CPUSTAT_STOPPED);
        kvm_s390_clear_stop_irq(vcpu);
 
-       kvm_s390_set_cpuflags(vcpu, CPUSTAT_STOPPED);
        __disable_ibs_on_vcpu(vcpu);
 
        for (i = 0; i < online_vcpus; i++) {
diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
index cf4de80bd541..e40f6412106d 100644
--- a/arch/s390/kvm/sigp.c
+++ b/arch/s390/kvm/sigp.c
@@ -276,6 +276,38 @@ static int handle_sigp_dst(struct kvm_vcpu *vcpu, u8 order_code,
        if (!dst_vcpu)
                return SIGP_CC_NOT_OPERATIONAL;
 
+       /*
+        * SIGP STOP * orders are the only SIGP orders that are processed
+        * asynchronously, and can theoretically, never complete.
+        *
+        * Until the destination VCPU is stopped via kvm_s390_vcpu_stop(), we
+        * have a stop interrupt pending. While we have a pending stop
+        * interrupt, that CPU is busy for most SIGP orders.
+        *
+        * This is important, because otherwise a single VCPU could issue on an
+        * operating destination VCPU:
+        * 1) SIGP STOP $DEST
+        * 2) SIGP SENSE $DEST
+        * And have 2) not rejected with BUSY although the destination is still
+        * processing the pending SIGP STOP * order.
+        *
+        * Relevant code has to make sure to complete the SIGP STOP * action
+        * (e.g., setting the CPU stopped, storing the status) before clearing
+        * the STOP interrupt.
+        */
+       if (order_code != SIGP_INITIAL_CPU_RESET &&
+           order_code != SIGP_CPU_RESET) {
+               /*
+                * Lockless check. SIGP STOP / SIGP RE(START) properly
+                * synchronizes when processing these orders. In any other case,
+                * we don't particularly care about races, as the guest cannot
+                * observe the difference really when issuing orders from two
+                * differing VCPUs.
+                */
+               if (kvm_s390_is_stop_irq_pending(dst_vcpu))
+                       return SIGP_CC_BUSY;
+       }
+
        switch (order_code) {
        case SIGP_SENSE:
                vcpu->stat.instruction_sigp_sense++;

b) QEMU

diff --git a/target/s390x/sigp.c b/target/s390x/sigp.c
index 51c727834c..e97e3a60fd 100644
--- a/target/s390x/sigp.c
+++ b/target/s390x/sigp.c
@@ -479,13 +479,17 @@ void do_stop_interrupt(CPUS390XState *env)
 {
     S390CPU *cpu = env_archcpu(env);
 
-    if (s390_cpu_set_state(S390_CPU_STATE_STOPPED, cpu) == 0) {
-        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
-    }
+    /*
+     * Complete the STOP operation before exposing the CPU as STOPPED to
+     * the system.
+     */
     if (cpu->env.sigp_order == SIGP_STOP_STORE_STATUS) {
         s390_store_status(cpu, S390_STORE_STATUS_DEF_ADDR, true);
     }
     env->sigp_order = 0;
+    if (s390_cpu_set_state(S390_CPU_STATE_STOPPED, cpu) == 0) {
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    }
     env->pending_int &= ~INTERRUPT_STOP;
 }

Eric Farman Nov. 23, 2021, 5:42 p.m. UTC | #19

On Mon, 2021-11-22 at 11:52 +0100, David Hildenbrand wrote:
> On 19.11.21 21:20, Eric Farman wrote:
> > On Wed, 2021-11-17 at 08:54 +0100, Christian Borntraeger wrote:
> > > Am 11.11.21 um 20:05 schrieb Eric Farman:
> > > > On Thu, 2021-11-11 at 19:29 +0100, David Hildenbrand wrote:
> > > > > On 11.11.21 18:48, Eric Farman wrote:
> > > > > > On Thu, 2021-11-11 at 17:13 +0100, Janosch Frank wrote:
> > > > > > > Looking at the API I'd like to avoid having two IOCTLs
> > > > > > 
> > > > > > Since the order is a single byte, we could have the payload
> > > > > > of
> > > > > > an
> > > > > > ioctl
> > > > > > say "0-255 is an order that we're busy processing, anything
> > > > > > higher
> > > > > > than
> > > > > > that resets the busy" or something. That would remove the
> > > > > > need
> > > > > > for
> > > > > > a
> > > > > > second IOCTL.
> > > > > 
> > > > > Maybe just pass an int and treat a negative (or just -1)
> > > > > value as
> > > > > clearing the order.
> > > > > 
> > > > 
> > > > Right, that's exactly what I had at one point. I thought it was
> > > > too
> > > > cumbersome, but maybe not. Will dust it off, pending my
> > > > question to
> > > > Janosch about 0-vs-1 IOCTLs.
> > > 
> > > As a totally different idea. Would a sync_reg value called
> > > SIGP_BUSY
> > > work as well?
> > > 
> > 
> > Hrm... I'm not sure. I played with it a bit, and it's not looking
> > great. I'm almost certainly missing some serialization, because I
> > was
> > frequently "losing" one of the toggles (busy/not-busy) when
> > hammering
> > CPUs with various SIGP orders on this interface and thus getting
> > incorrect responses from the in-kernel orders.
> 
> You can only modify the destination CPU from the destination CPU
> thread,
> after synchronizing the CPU state. 

Correct, but that was not missing from my quick pass down that rabbit
hole.

> This would be trivial with my QEMU proposal.
> 
> > I also took a stab at David's idea of tying it to KVM_MP_STATE [1].
> > I
> > still think it's a little odd, considering the existing states are
> > all
> > z/Arch-defined CPU states, but it does sound like the sort of thing
> > we're trying to do (letting userspace announce what the CPU is up
> > to).
> > One flaw is that most of the rest of QEMU uses s390_cpu_set_state()
> > for
> > this, which returns the number of running CPUs instead of the
> > return
> > code from the MP_STATE ioctl (via kvm_s390_set_cpu_state()) that
> > SIGP
> > would be interested in. Even if I made the ioctl call directly, I
> > still
> > encounter some system problems that smell like ones I've addressed
> > in
> > v2 and v3. Possibly fixable, but I didn't pursue them far enough to
> > be
> > certain.
> 
> Well, we can essentially observe this special state of that CPU
> ("stopping"), so
> it's not that weird. STOPPING is essentially OPERATING with the
> notion of
> "the CPU is blocked for some actions.".

My point is that any state we define here (STOPPING, BLOCKED, BUSY,
whatever) is something that we are inventing and not something that is
in POPS, which is what the existing states (STOPPED, OPERATING, LOAD,
CHECK-STOP) are. That's a little weird.

I'm not against it. The fact that MP_STATE is already a path back into
KVM without requiring a new uapi is really pleasant.

> 
> > I ALSO took a stab at folding this into the S390 IRQ paths [2],
> > similar
> > to what was done with kvm_s390_stop_info. This worked reasonably
> > well,
> > except the QEMU interface kvm_s390_vcpu_interrupt() returns a void,
> > and
> > so wouldn't notice an error sent back by KVM. Not a deal breaker,
> > but
> > having not heard anything to this idea, I didn't go much farther.
> 
> We only care about SIGP STOP* handling so far, if anybody is aware of
> other issues
> that need fixing, it would be helpful  to spell them out. 

Yes, I keep using SIGP STOP* as an example because it offers (A) a
clear miscommunication with the status bits returned by SIGP SENSE, and
(B) has some special handling in QEMU that to my experience is still
incomplete. But I'm seeing issues with other orders, like SIGP RESTART
[1] [2] (where QEMU does a kvm_s390_vcpu_interrupt() with
KVM_S390_RESTART, and thus adds to pending_irq) and SIGP (INITIAL) CPU
RESET [2] (admittedly not greatly researched).

The reason for why I have no spent a lot of time in the latter is that
I have also mentioned that POPS has lists of orders that will return
busy [3], and so something more general is perhaps warranted. The QEMU
RFC's don't handle anything further than SIGP STOP*, on account of it
makes sense to get the interface right first.

[1]
https://lore.kernel.org/kvm/4c733158506497972d5b04b34a169c054fca4ba5.camel@linux.ibm.com/
[2]
https://lore.kernel.org/kvm/006980fd7d0344b0258aa87128891fcd81c005b7.camel@linux.ibm.com/
[3]
https://lore.kernel.org/kvm/2ad9bef6b39a5a6c9b634cab7d70d110064d8f04.camel@linux.ibm.com/


> I'll keep assuming that
> only SIGP STOP*  needs fixing, as I already explained.
> 
> Whenever QEMU tells a CPU to stop asynchronously, it does so via a
> STOP IRQ from
> the destination CPU thread via KVM_S390_SIGP_STOP. Then, we know the
> CPU is busy
> ... until we clear that interrupt, which happens via
> kvm_s390_clear_stop_irq().
> 
> Interestingly, we clear that interrupt via two paths:
> 
> 1) kvm_s390_clear_local_irqs(), via KVM_S390_INITIAL_RESET and 
>    KVM_S390_NORMAL_RESET. Here, we expect that new user space also
> sets  
>    the CPU to stopped via KVM_MP_STATE_STOPPED. In fact, modern QEMU 
>    properly sets the CPU stopped before triggering clearing of the 
>    interrupts (s390_cpu_reset()).
> 2) kvm_s390_clear_stop_irq() via kvm_s390_vcpu_stop(), which is 
>    triggered via:
> 
> a) STOP intercept (handle_stop()), KVM_S390_INITIAL_RESET and 
>    KVM_S390_NORMAL_RESET with old user space -- 
>    !kvm_s390_user_cpu_state_ctrl().
> 
> b) KVM_MP_STATE_STOPPED for modern user space.
> 
> 
> 
> Would the following solve our SIGP STOP * issue w.o. uapi changes?
> 

A quick pass shows some promise, but I haven't the bandwidth to throw
the battery of stuff at it. I'll have to take a closer look after the
US Holiday to give a better answer. (Example: looking for
IRQ_PEND_SIGP_STOP || IRQ_PEND_RESTART is trivial.)

Eric

> 
> a) Kernel
> 
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index c6257f625929..bd7ee1ea8aa8 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -4643,10 +4643,15 @@ int kvm_s390_vcpu_stop(struct kvm_vcpu *vcpu)
>                 }
>         }
>  
> -       /* SIGP STOP and SIGP STOP AND STORE STATUS has been fully
> processed */
> +       /*
> +        * SIGP STOP and SIGP STOP AND STORE STATUS have been fully
> +        * processed. Clear the interrupt after setting the VCPU
> stopped,
> +        * such that the VCPU remains busy for most SIGP orders until
> fully
> +        * stopped.
> +        */
> +       kvm_s390_set_cpuflags(vcpu, CPUSTAT_STOPPED);
>         kvm_s390_clear_stop_irq(vcpu);
>  
> -       kvm_s390_set_cpuflags(vcpu, CPUSTAT_STOPPED);
>         __disable_ibs_on_vcpu(vcpu);
>  
>         for (i = 0; i < online_vcpus; i++) {
> diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c
> index cf4de80bd541..e40f6412106d 100644
> --- a/arch/s390/kvm/sigp.c
> +++ b/arch/s390/kvm/sigp.c
> @@ -276,6 +276,38 @@ static int handle_sigp_dst(struct kvm_vcpu
> *vcpu, u8 order_code,
>         if (!dst_vcpu)
>                 return SIGP_CC_NOT_OPERATIONAL;
>  
> +       /*
> +        * SIGP STOP * orders are the only SIGP orders that are
> processed
> +        * asynchronously, and can theoretically, never complete.
> +        *
> +        * Until the destination VCPU is stopped via
> kvm_s390_vcpu_stop(), we
> +        * have a stop interrupt pending. While we have a pending
> stop
> +        * interrupt, that CPU is busy for most SIGP orders.
> +        *
> +        * This is important, because otherwise a single VCPU could
> issue on an
> +        * operating destination VCPU:
> +        * 1) SIGP STOP $DEST
> +        * 2) SIGP SENSE $DEST
> +        * And have 2) not rejected with BUSY although the
> destination is still
> +        * processing the pending SIGP STOP * order.
> +        *
> +        * Relevant code has to make sure to complete the SIGP STOP *
> action
> +        * (e.g., setting the CPU stopped, storing the status) before
> clearing
> +        * the STOP interrupt.
> +        */
> +       if (order_code != SIGP_INITIAL_CPU_RESET &&
> +           order_code != SIGP_CPU_RESET) {
> +               /*
> +                * Lockless check. SIGP STOP / SIGP RE(START)
> properly
> +                * synchronizes when processing these orders. In any
> other case,
> +                * we don't particularly care about races, as the
> guest cannot
> +                * observe the difference really when issuing orders
> from two
> +                * differing VCPUs.
> +                */
> +               if (kvm_s390_is_stop_irq_pending(dst_vcpu))
> +                       return SIGP_CC_BUSY;
> +       }
> +
>         switch (order_code) {
>         case SIGP_SENSE:
>                 vcpu->stat.instruction_sigp_sense++;
> 
> b) QEMU
> 
> diff --git a/target/s390x/sigp.c b/target/s390x/sigp.c
> index 51c727834c..e97e3a60fd 100644
> --- a/target/s390x/sigp.c
> +++ b/target/s390x/sigp.c
> @@ -479,13 +479,17 @@ void do_stop_interrupt(CPUS390XState *env)
>  {
>      S390CPU *cpu = env_archcpu(env);
>  
> -    if (s390_cpu_set_state(S390_CPU_STATE_STOPPED, cpu) == 0) {
> -        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
> -    }
> +    /*
> +     * Complete the STOP operation before exposing the CPU as
> STOPPED to
> +     * the system.
> +     */
>      if (cpu->env.sigp_order == SIGP_STOP_STORE_STATUS) {
>          s390_store_status(cpu, S390_STORE_STATUS_DEF_ADDR, true);
>      }
>      env->sigp_order = 0;
> +    if (s390_cpu_set_state(S390_CPU_STATE_STOPPED, cpu) == 0) {
> +        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
> +    }
>      env->pending_int &= ~INTERRUPT_STOP;
>  }
>  
> 
>

David Hildenbrand Nov. 23, 2021, 6:44 p.m. UTC | #20

>>
>>> I ALSO took a stab at folding this into the S390 IRQ paths [2],
>>> similar
>>> to what was done with kvm_s390_stop_info. This worked reasonably
>>> well,
>>> except the QEMU interface kvm_s390_vcpu_interrupt() returns a void,
>>> and
>>> so wouldn't notice an error sent back by KVM. Not a deal breaker,
>>> but
>>> having not heard anything to this idea, I didn't go much farther.
>>
>> We only care about SIGP STOP* handling so far, if anybody is aware of
>> other issues
>> that need fixing, it would be helpful  to spell them out. 
> 
> Yes, I keep using SIGP STOP* as an example because it offers (A) a
> clear miscommunication with the status bits returned by SIGP SENSE, and
> (B) has some special handling in QEMU that to my experience is still
> incomplete. But I'm seeing issues with other orders, like SIGP RESTART
> [1] [2] (where QEMU does a kvm_s390_vcpu_interrupt() with
> KVM_S390_RESTART, and thus adds to pending_irq) and SIGP (INITIAL) CPU
> RESET [2] (admittedly not greatly researched).

Sorry I missed that discussion previously. Summarizing what the PoP says:

Once we have a pending:
* start (-> synchronous)
* stop (-> asynchronous)
* restart (-> asynchronous)
* stop-and-store-status (-> asynchronous)
* set-prefix (-> synchronous)
* store-status-at-address (-> synchronous)
* store-additional-status-at-address (-> synchronous)
* initial-CPU-reset (-> synchronous)
* CPU-reset order (-> synchronous)

The following orders have to return busy (my take: only a must if the
guest could observe the difference):
* sense
* external call
* emergency signal
* start
* stop
* restart
* stop and store status
* set prefix
* store status at address
* set architecture
* set multithreading
* store additional status at address

We have to ask ourselves

a) How could a single VCPU observe the difference if executing any other
instruction immediately afterwards. My take would be that for
synchronous orders we can't really. So we're left with:
* stop (-> asynchronous)
* restart (-> asynchronous)
* stop-and-store-status (-> asynchronous)

b) How could multiple VCPUs observe the difference that a single VCPU
can't observe. That will require more thought, but I think it will be
hardly possible.


We know that SIGP STOP * needs care.

SIGP RESTART is interesting. We inject it only for OPERATING cpus and it
will only change the LC psw. What if we execute immediately afterwards:

* sense -> does not affect any bits
* external call -> has higher IRQ priority. There could be a difference
  if injected before or after the restart was delivered. Could be fixed
  in the kernel (check IRQ_PEND_RESTART).
* emergency signal -> has higher IRQ priority. There could be a
  difference if injected before or after the restart was delivered.
  Could be fixed in the kernel (check IRQ_PEND_RESTART).
* start -> CPU is already operating
* stop -> restart is delivered first
* restart -> I think the lowcore will look different if we inject two
  RESTARTs immediately afterwards instead of only a single
  one. Could be fixed in the kernel (double-deliver the interrupt).
* stop and store status -> restart is delivered first
* set prefix -> CPU is operating, not possible
* store status at address -> CPU is operating, not possible
* set architecture -> don't care
* set multithreading -> don't care
* store additional status at address -> CPU is operating, not possible
* initial-CPU-reset -> clears local IRQ. LC will look different if
  RESTART was delivered or not. Could be fixed in the kernel quite
  easily (deliver restart first before clearing interrupts).
* CPU-reset -> clears local IRQs. LC will look different if
  injected before vs. after. Could be fixed in the kernel quite
  easily (deliver restart first before clearing interrupts)..

external call as handled by the SIGP interpretation facility will
already also violate that description. We don't know that a SIGP restart
is pending. We'd have to disable the SIGP interpretation facility
temporarily.

/me shivers

This sounds like the kind of things we should happily not be fixing
because nobody might really care :)



> 
> The reason for why I have no spent a lot of time in the latter is that
> I have also mentioned that POPS has lists of orders that will return
> busy [3], and so something more general is perhaps warranted. The QEMU
> RFC's don't handle anything further than SIGP STOP*, on account of it
> makes sense to get the interface right first.

Right. My take is to have a look what we actually have to fix -- where
the guest can actually observe the difference. If the guest can't
observe the difference there is no need to actually implement BUSY logic
as instructed in the PoP.

> 
>> I'll keep assuming that
>> only SIGP STOP*  needs fixing, as I already explained.
>>
>> Whenever QEMU tells a CPU to stop asynchronously, it does so via a
>> STOP IRQ from
>> the destination CPU thread via KVM_S390_SIGP_STOP. Then, we know the
>> CPU is busy
>> ... until we clear that interrupt, which happens via
>> kvm_s390_clear_stop_irq().
>>
>> Interestingly, we clear that interrupt via two paths:
>>
>> 1) kvm_s390_clear_local_irqs(), via KVM_S390_INITIAL_RESET and 
>>    KVM_S390_NORMAL_RESET. Here, we expect that new user space also
>> sets  
>>    the CPU to stopped via KVM_MP_STATE_STOPPED. In fact, modern QEMU 
>>    properly sets the CPU stopped before triggering clearing of the 
>>    interrupts (s390_cpu_reset()).
>> 2) kvm_s390_clear_stop_irq() via kvm_s390_vcpu_stop(), which is 
>>    triggered via:
>>
>> a) STOP intercept (handle_stop()), KVM_S390_INITIAL_RESET and 
>>    KVM_S390_NORMAL_RESET with old user space -- 
>>    !kvm_s390_user_cpu_state_ctrl().
>>
>> b) KVM_MP_STATE_STOPPED for modern user space.
>>
>>
>>
>> Would the following solve our SIGP STOP * issue w.o. uapi changes?
>>
> 
> A quick pass shows some promise, but I haven't the bandwidth to throw
> the battery of stuff at it. I'll have to take a closer look after the
> US Holiday to give a better answer. (Example: looking for
> IRQ_PEND_SIGP_STOP || IRQ_PEND_RESTART is trivial.)

Yes, extending to IRQ_PEND_RESTART would make sense.

Eric Farman Nov. 30, 2021, 8:11 p.m. UTC | #21

On Tue, 2021-11-23 at 19:44 +0100, David Hildenbrand wrote:
> > > > I ALSO took a stab at folding this into the S390 IRQ paths [2],
> > > > similar
> > > > to what was done with kvm_s390_stop_info. This worked
> > > > reasonably
> > > > well,
> > > > except the QEMU interface kvm_s390_vcpu_interrupt() returns a
> > > > void,
> > > > and
> > > > so wouldn't notice an error sent back by KVM. Not a deal
> > > > breaker,
> > > > but
> > > > having not heard anything to this idea, I didn't go much
> > > > farther.
> > > 
> > > We only care about SIGP STOP* handling so far, if anybody is
> > > aware of
> > > other issues
> > > that need fixing, it would be helpful  to spell them out. 
> > 
> > Yes, I keep using SIGP STOP* as an example because it offers (A) a
> > clear miscommunication with the status bits returned by SIGP SENSE,
> > and
> > (B) has some special handling in QEMU that to my experience is
> > still
> > incomplete. But I'm seeing issues with other orders, like SIGP
> > RESTART
> > [1] [2] (where QEMU does a kvm_s390_vcpu_interrupt() with
> > KVM_S390_RESTART, and thus adds to pending_irq) and SIGP (INITIAL)
> > CPU
> > RESET [2] (admittedly not greatly researched).
> 
> Sorry I missed that discussion previously. Summarizing what the PoP
> says:
> 
> Once we have a pending:
> * start (-> synchronous)
> * stop (-> asynchronous)
> * restart (-> asynchronous)
> * stop-and-store-status (-> asynchronous)
> * set-prefix (-> synchronous)
> * store-status-at-address (-> synchronous)
> * store-additional-status-at-address (-> synchronous)
> * initial-CPU-reset (-> synchronous)
> * CPU-reset order (-> synchronous)

I would argue that these last two are not necessarily synchronous.
Looking at QEMU...

handle_sigp_single_dst()
sigp_(initial_)cpu_reset()
s390_cpu_reset()
>> If kvm_enabled(), call
   kvm_s390_reset_vcpu_initial()
   -or-
   kvm_s390_reset_vcpu_normal()
kvm_s390_reset_vcpu()
kvm_vcpu_ioctl()

(Switch to KVM code)
kvm_vcpu_ioctl()
 -- Acquire vcpu->mutex
kvm_arch_vcpu_ioctl()
kvm_arch_vcpu_ioctl_initial_reset()
 -and/or-
kvm_arch_vcpu_ioctl_normal_reset()

So, given that, couldn't it be possible for a SIGP SENSE to be sent to
a CPU that is currently processing one of the RESET orders? The mutex
acquired as part of the ioctl would end up gating one of them, when
according to POPS a subsequent SENSE should get CC2 until the reset
completes. Unlike STOP*/RESTART, there's no IRQ that we can key off of
to know that the RESET is still in process, unless we define another
IOCTL as in v2-v4 here.

> 
> The following orders have to return busy (my take: only a must if the
> guest could observe the difference):
> * sense
> * external call
> * emergency signal

These ones:

> * start
> * stop
> * restart
> * stop and store status
> * set prefix
> * store status at address
> * set architecture
> * set multithreading
> * store additional status at address

... are handled in userspace, which QEMU serializes in handle_sigp()
and returns CC2 today:

    if (qemu_mutex_trylock(&qemu_sigp_mutex)) {
        ret = SIGP_CC_BUSY;
        goto out;
    }

> 
> We have to ask ourselves
> 
> a) How could a single VCPU observe the difference if executing any
> other
> instruction immediately afterwards. My take would be that for
> synchronous orders we can't really. So we're left with:
> * stop (-> asynchronous)
> * restart (-> asynchronous)
> * stop-and-store-status (-> asynchronous)
> 
> b) How could multiple VCPUs observe the difference that a single VCPU
> can't observe. That will require more thought, but I think it will be
> hardly possible.
> 
> 
> We know that SIGP STOP * needs care.
> 
> SIGP RESTART is interesting. We inject it only for OPERATING cpus and
> it
> will only change the LC psw. What if we execute immediately
> afterwards:
> 
> * sense -> does not affect any bits
> * external call -> has higher IRQ priority. There could be a
> difference
>   if injected before or after the restart was delivered. Could be
> fixed
>   in the kernel (check IRQ_PEND_RESTART).
> * emergency signal -> has higher IRQ priority. There could be a
>   difference if injected before or after the restart was delivered.
>   Could be fixed in the kernel (check IRQ_PEND_RESTART).
> * start -> CPU is already operating
> * stop -> restart is delivered first
> * restart -> I think the lowcore will look different if we inject two
>   RESTARTs immediately afterwards instead of only a single
>   one. Could be fixed in the kernel (double-deliver the interrupt).
> * stop and store status -> restart is delivered first
> * set prefix -> CPU is operating, not possible
> * store status at address -> CPU is operating, not possible
> * set architecture -> don't care
> * set multithreading -> don't care
> * store additional status at address -> CPU is operating, not
> possible
> * initial-CPU-reset -> clears local IRQ. LC will look different if
>   RESTART was delivered or not. Could be fixed in the kernel quite
>   easily (deliver restart first before clearing interrupts).
> * CPU-reset -> clears local IRQs. LC will look different if
>   injected before vs. after. Could be fixed in the kernel quite
>   easily (deliver restart first before clearing interrupts)..

These might be of value. I don't yet have a clear order of events in
these scenarios, but will keep this in mind as I am seeing some
oddities there.

> 
> external call as handled by the SIGP interpretation facility will
> already also violate that description. We don't know that a SIGP
> restart
> is pending. We'd have to disable the SIGP interpretation facility
> temporarily.
> 
> /me shivers

/me too

> 
> This sounds like the kind of things we should happily not be fixing
> because nobody might really care :)
> 
> 

Hi, me again, really hoping I don't care about this aspect of it. :)

> 
> > The reason for why I have no spent a lot of time in the latter is
> > that
> > I have also mentioned that POPS has lists of orders that will
> > return
> > busy [3], and so something more general is perhaps warranted. The
> > QEMU
> > RFC's don't handle anything further than SIGP STOP*, on account of
> > it
> > makes sense to get the interface right first.
> 
> Right. My take is to have a look what we actually have to fix --
> where
> the guest can actually observe the difference. If the guest can't
> observe the difference there is no need to actually implement BUSY
> logic
> as instructed in the PoP.

My concern is largely with SIGP SENSE giving a clear answer for the
state of the cpu. Since most of the SIGP orders have some variation of
"may not occur/be completed during the execution of SIGNAL PROCESSOR"
in POPs, the SIGP SENSE is a quick mechanism (being defined as a "fast"
order) to determine if another order has completed or is still in-
process, and if the cpu was left in the expected state at the
completion of the order.

It does suggest that those synchronous orders could offer a misleading
answer, but since there's no (obvious) loss of control in those paths,
it's not as big a concern for me.

> 
> > > I'll keep assuming that
> > > only SIGP STOP*  needs fixing, as I already explained.
> > > 
> > > Whenever QEMU tells a CPU to stop asynchronously, it does so via
> > > a
> > > STOP IRQ from
> > > the destination CPU thread via KVM_S390_SIGP_STOP. Then, we know
> > > the
> > > CPU is busy
> > > ... until we clear that interrupt, which happens via
> > > kvm_s390_clear_stop_irq().
> > > 
> > > Interestingly, we clear that interrupt via two paths:
> > > 
> > > 1) kvm_s390_clear_local_irqs(), via KVM_S390_INITIAL_RESET and 
> > >    KVM_S390_NORMAL_RESET. Here, we expect that new user space
> > > also
> > > sets  
> > >    the CPU to stopped via KVM_MP_STATE_STOPPED. In fact, modern
> > > QEMU 
> > >    properly sets the CPU stopped before triggering clearing of
> > > the 
> > >    interrupts (s390_cpu_reset()).
> > > 2) kvm_s390_clear_stop_irq() via kvm_s390_vcpu_stop(), which is 
> > >    triggered via:
> > > 
> > > a) STOP intercept (handle_stop()), KVM_S390_INITIAL_RESET and 
> > >    KVM_S390_NORMAL_RESET with old user space -- 
> > >    !kvm_s390_user_cpu_state_ctrl().
> > > 
> > > b) KVM_MP_STATE_STOPPED for modern user space.
> > > 
> > > 
> > > 
> > > Would the following solve our SIGP STOP * issue w.o. uapi
> > > changes?
> > > 
> > 
> > A quick pass shows some promise, but I haven't the bandwidth to
> > throw
> > the battery of stuff at it. I'll have to take a closer look after
> > the
> > US Holiday to give a better answer. (Example: looking for
> > IRQ_PEND_SIGP_STOP || IRQ_PEND_RESTART is trivial.)
> 
> Yes, extending to IRQ_PEND_RESTART would make sense.
> 

Running my stressors at this combined patch goes well. Going to work on
some additional ones this week, with some debug on the reset orders.

Thanks,
Eric

[RFC,v3,2/2] KVM: s390: Extend the USER_SIGP capability

Commit Message

Comments

Patch