diff mbox

[4/5] KVM: add __kvm_request_needs_mb

Message ID 20170216160449.13094-5-rkrcmar@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Radim Krčmář Feb. 16, 2017, 4:04 p.m. UTC
A macro to optimize requests that do not need a memory barrier because
they have no dependencies.  An architecture can implement a function
that says which requests do not need memory barriers when handling them.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 include/linux/kvm_host.h | 41 +++++++++++++++++++++++++++++++++++++----
 virt/kvm/kvm_main.c      |  3 ++-
 2 files changed, 39 insertions(+), 5 deletions(-)

Comments

David Hildenbrand Feb. 16, 2017, 7:49 p.m. UTC | #1
Am 16.02.2017 um 17:04 schrieb Radim Krčmář:
> A macro to optimize requests that do not need a memory barrier because
> they have no dependencies.  An architecture can implement a function
> that says which requests do not need memory barriers when handling them.
> 
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  include/linux/kvm_host.h | 41 +++++++++++++++++++++++++++++++++++++----
>  virt/kvm/kvm_main.c      |  3 ++-
>  2 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d899473859d3..2cc438685af8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1097,8 +1097,8 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>   *  2) remote request with no data (= kick)
>   *  3) remote request with data (= kick + mb)
>   *
> - * TODO: the API is inconsistent -- a request doesn't call kvm_vcpu_kick(), but
> - * forces smp_wmb() for all requests.
> + * TODO: the API does not distinguish local and remote requests -- remote
> + * should contain kvm_vcpu_kick().
>   */

Just for your info, kvm_vcpu_kick() and kvm_make_all_cpus_request() do
not work on s390x (and in its current form never will). I tried to make
it work once, but I gave up.

s390x uses kvm_s390_sync_request()->kvm_s390_vcpu_request() to kick a
guest out of guest mode. A special bit in the SIE control block is used
to perform the kick (exit_sie(), STOP request), and another bit to
prevent the guest from reentering the SIE, until the request has been
handled (to avoid races).

This is really complicated stuff, and the basic reason for it (if I
remember correctly) is that s390x does reenable all interrupts when
entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
kicks don't work (as it is otherwise just racy), and if I remember
correctly, SMP reschedule signals (s390x external calls) would be
slower. (Christian, please correct me if I'm wrong)

So this statement, is at least from a s390x point of view wrong. The
kvm_vcpu_kick() function would have to be rerouted to an appropriate
s390x implementation (or that whole smp and OUTSIDE_GUEST_MODE stuff
would have to be factored out).
Radim Krčmář Feb. 16, 2017, 9:31 p.m. UTC | #2
2017-02-16 20:49+0100, David Hildenbrand:
> Am 16.02.2017 um 17:04 schrieb Radim Krčmář:
>> A macro to optimize requests that do not need a memory barrier because
>> they have no dependencies.  An architecture can implement a function
>> that says which requests do not need memory barriers when handling them.
>> 
>> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
>> ---
>>  include/linux/kvm_host.h | 41 +++++++++++++++++++++++++++++++++++++----
>>  virt/kvm/kvm_main.c      |  3 ++-
>>  2 files changed, 39 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index d899473859d3..2cc438685af8 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1097,8 +1097,8 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>>   *  2) remote request with no data (= kick)
>>   *  3) remote request with data (= kick + mb)
>>   *
>> - * TODO: the API is inconsistent -- a request doesn't call kvm_vcpu_kick(), but
>> - * forces smp_wmb() for all requests.
>> + * TODO: the API does not distinguish local and remote requests -- remote
>> + * should contain kvm_vcpu_kick().
>>   */
> 
> Just for your info, kvm_vcpu_kick() and kvm_make_all_cpus_request() do
> not work on s390x (and in its current form never will). I tried to make
> it work once, but I gave up.
> 
> s390x uses kvm_s390_sync_request()->kvm_s390_vcpu_request() to kick a
> guest out of guest mode. A special bit in the SIE control block is used
> to perform the kick (exit_sie(), STOP request), and another bit to
> prevent the guest from reentering the SIE, until the request has been
> handled (to avoid races).

Hm, kvm_s390_vcpu_wakeup() looks more suitable as the s390
implementation of kvm_vcpu_kick() (which is what we want to be connected
with kvm_request).

I think that kvm_s390_sync_request() is a different idea as it does not
call swait_active(), so the request is delayed if the VCPU is halted.
And kvm_s390_sync_request() it also waits for the VCPU to actually exit,
which pushes it even further away from what other requests do. :)

I would rather use bitops/barriers/kicks directly if the use of
kvm_request helpers is too diverse.

> This is really complicated stuff, and the basic reason for it (if I
> remember correctly) is that s390x does reenable all interrupts when
> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
> kicks don't work (as it is otherwise just racy), and if I remember
> correctly, SMP reschedule signals (s390x external calls) would be
> slower. (Christian, please correct me if I'm wrong)

Having a different mechanism for the exit itself is ok, just the general
behavior has to stay the same -- kvm_vcpu_kick() does whatever is needed
to let the VCPU notice possible changes in state as soon as possible and
doesn't care about the VCPU any further.

If other architectures had a fast mechanism that forced an immediate
VCPU exit, they would gladly use it as well.

> So this statement, is at least from a s390x point of view wrong. The
> kvm_vcpu_kick() function would have to be rerouted to an appropriate
> s390x implementation (or that whole smp and OUTSIDE_GUEST_MODE stuff
> would have to be factored out).

I agree.  What about starting by adding __weak on functions that are
currently "#ifndef CONFIG_S390" and letting s390 completely reimplement
them?

Thanks for the info!
Christian Borntraeger Feb. 17, 2017, 8:46 a.m. UTC | #3
On 02/16/2017 08:49 PM, David Hildenbrand wrote:
> Am 16.02.2017 um 17:04 schrieb Radim Krčmář:
>> A macro to optimize requests that do not need a memory barrier because
>> they have no dependencies.  An architecture can implement a function
>> that says which requests do not need memory barriers when handling them.
>>
>> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
>> ---
>>  include/linux/kvm_host.h | 41 +++++++++++++++++++++++++++++++++++++----
>>  virt/kvm/kvm_main.c      |  3 ++-
>>  2 files changed, 39 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index d899473859d3..2cc438685af8 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1097,8 +1097,8 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>>   *  2) remote request with no data (= kick)
>>   *  3) remote request with data (= kick + mb)
>>   *
>> - * TODO: the API is inconsistent -- a request doesn't call kvm_vcpu_kick(), but
>> - * forces smp_wmb() for all requests.
>> + * TODO: the API does not distinguish local and remote requests -- remote
>> + * should contain kvm_vcpu_kick().
>>   */
> 
> Just for your info, kvm_vcpu_kick() and kvm_make_all_cpus_request() do
> not work on s390x (and in its current form never will). I tried to make
> it work once, but I gave up.
> 
> s390x uses kvm_s390_sync_request()->kvm_s390_vcpu_request() to kick a
> guest out of guest mode. A special bit in the SIE control block is used
> to perform the kick (exit_sie(), STOP request), and another bit to
> prevent the guest from reentering the SIE, until the request has been
> handled (to avoid races).
> 
> This is really complicated stuff, and the basic reason for it (if I
> remember correctly) is that s390x does reenable all interrupts when
> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
> kicks don't work (as it is otherwise just racy), and if I remember
> correctly, SMP reschedule signals (s390x external calls) would be
> slower. (Christian, please correct me if I'm wrong)

No the reason was that there are some requests that need to be handled
outside run SIE. For example one reason was the guest prefix page.
This must be mapped read/write ALL THE TIME when a guest is running,
otherwise the host might crash. So we have to exit SIE and make sure that
it does not reenter, therefore we use the RELOAD_MMU request from a notifier
that is called from page table functions, whenever memory management decides
to unmap/write protect (dirty pages tracking, reference tracking, page migration
or compaction...)

SMP-based request wills kick out the guest, but for some thing like the
one above it will be too late.

 
> So this statement, is at least from a s390x point of view wrong. The
> kvm_vcpu_kick() function would have to be rerouted to an appropriate
> s390x implementation (or that whole smp and OUTSIDE_GUEST_MODE stuff
> would have to be factored out).
>
David Hildenbrand Feb. 17, 2017, 10:13 a.m. UTC | #4
>> This is really complicated stuff, and the basic reason for it (if I
>> remember correctly) is that s390x does reenable all interrupts when
>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
>> kicks don't work (as it is otherwise just racy), and if I remember
>> correctly, SMP reschedule signals (s390x external calls) would be
>> slower. (Christian, please correct me if I'm wrong)
> 
> No the reason was that there are some requests that need to be handled
> outside run SIE. For example one reason was the guest prefix page.
> This must be mapped read/write ALL THE TIME when a guest is running,
> otherwise the host might crash. So we have to exit SIE and make sure that
> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
> that is called from page table functions, whenever memory management decides
> to unmap/write protect (dirty pages tracking, reference tracking, page migration
> or compaction...)
> 
> SMP-based request wills kick out the guest, but for some thing like the
> one above it will be too late.

While what you said is 100% correct, I had something else in mind that
hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
And I remember that being related to how preemption and
OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
have to be implemented in kvm_arch_vcpu_should_kick().

x86 can track the guest state using vcpu->mode, because they can be sure
that the guest can't reschedule while in the critical guest entry/exit
section. This is not true for s390x, as preemption is enabled. That's
why vcpu->mode cannot be used in its current form to track if a VCPU is
in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
relies on this setting.

For now, calling vcpu_kick() on s390x will result in a BUG().


On s390x, there are 3 use cases I see for requests:

1. Remote requests that need a sync

Make a request, wait until SIE has been left and make sure the request
will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
candidate.

2. Remote requests that don't need a sync

E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
KVM_REQ_DISABLE_IBS does.

3. local requests

E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()


Of course, having a unified interface would be better.

/* set the request and kick the CPU out of guest mode */
kvm_set_request(req, vcpu);

/* set the request, kick the CPU out of guest mode, wait until guest
mode has been left and make sure the request will be handled before
reentering guest mode */
kvm_set_sync_request(req, vcpu);


Same maybe even for multiple VCPUs (as there are then ways to speed it
up, e.g. first kick all, then wait for all)

This would require arch specific callbacks to
1. pre announce the request (e.g. set PROG_REQUEST on s390x)
2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
kvm_s390_vsie_kick(vcpu) on s390x)
3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)

This would only make sense if there are other use cases for sync
requests. At least I remember that Power also has a faster way for
kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
s390x only thing and is better be left as is :)

At least vcpu_kick() could be quite easily made to work on s390x.

Radim, are there also other users that need something like sync requests?

> 
>  
>> So this statement, is at least from a s390x point of view wrong. The
>> kvm_vcpu_kick() function would have to be rerouted to an appropriate
>> s390x implementation (or that whole smp and OUTSIDE_GUEST_MODE stuff
>> would have to be factored out).
>>
>
Christian Borntraeger Feb. 17, 2017, 10:19 a.m. UTC | #5
On 02/17/2017 11:13 AM, David Hildenbrand wrote:
> 
>>> This is really complicated stuff, and the basic reason for it (if I
>>> remember correctly) is that s390x does reenable all interrupts when
>>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
>>> kicks don't work (as it is otherwise just racy), and if I remember
>>> correctly, SMP reschedule signals (s390x external calls) would be
>>> slower. (Christian, please correct me if I'm wrong)
>>
>> No the reason was that there are some requests that need to be handled
>> outside run SIE. For example one reason was the guest prefix page.
>> This must be mapped read/write ALL THE TIME when a guest is running,
>> otherwise the host might crash. So we have to exit SIE and make sure that
>> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
>> that is called from page table functions, whenever memory management decides
>> to unmap/write protect (dirty pages tracking, reference tracking, page migration
>> or compaction...)
>>
>> SMP-based request wills kick out the guest, but for some thing like the
>> one above it will be too late.
> 
> While what you said is 100% correct, I had something else in mind that
> hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
> And I remember that being related to how preemption and
> OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
> have to be implemented in kvm_arch_vcpu_should_kick().
> 
> x86 can track the guest state using vcpu->mode, because they can be sure
> that the guest can't reschedule while in the critical guest entry/exit
> section. This is not true for s390x, as preemption is enabled. That's
> why vcpu->mode cannot be used in its current form to track if a VCPU is
> in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
> relies on this setting.
> 
> For now, calling vcpu_kick() on s390x will result in a BUG().

Yes, you are right, preemption and interrupt handling certainly made it
hard to switch to the common code function. The prefix page oddity was
the original reason for implementing the s390 special code since
we realized that even mlocking and pinning did not guarantee that
the page table stays writable in the page table.


> 
> 
> On s390x, there are 3 use cases I see for requests:
> 
> 1. Remote requests that need a sync
> 
> Make a request, wait until SIE has been left and make sure the request
> will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
> notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
> candidate.
> 
> 2. Remote requests that don't need a sync
> 
> E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
> KVM_REQ_DISABLE_IBS does.
> 
> 3. local requests
> 
> E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
> 
> 
> Of course, having a unified interface would be better.

Yes, we tried to be as common as possible. This enabled us to 
use things like halt polling. 
Radims idea of a weak function could work out, have not checked yet.

> 
> /* set the request and kick the CPU out of guest mode */
> kvm_set_request(req, vcpu);
> 
> /* set the request, kick the CPU out of guest mode, wait until guest
> mode has been left and make sure the request will be handled before
> reentering guest mode */
> kvm_set_sync_request(req, vcpu);
> 
> 
> Same maybe even for multiple VCPUs (as there are then ways to speed it
> up, e.g. first kick all, then wait for all)
> 
> This would require arch specific callbacks to
> 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
> 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
> kvm_s390_vsie_kick(vcpu) on s390x)
> 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
> 
> This would only make sense if there are other use cases for sync
> requests. At least I remember that Power also has a faster way for
> kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
> s390x only thing and is better be left as is :)
> 
> At least vcpu_kick() could be quite easily made to work on s390x.
> 
> Radim, are there also other users that need something like sync requests?
> 
>>
>>  
>>> So this statement, is at least from a s390x point of view wrong. The
>>> kvm_vcpu_kick() function would have to be rerouted to an appropriate
>>> s390x implementation (or that whole smp and OUTSIDE_GUEST_MODE stuff
>>> would have to be factored out).
>>>
>>
> 
>
Christian Borntraeger Feb. 17, 2017, 11:28 a.m. UTC | #6
On 02/17/2017 11:13 AM, David Hildenbrand wrote:
> 
>>> This is really complicated stuff, and the basic reason for it (if I
>>> remember correctly) is that s390x does reenable all interrupts when
>>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
>>> kicks don't work (as it is otherwise just racy), and if I remember
>>> correctly, SMP reschedule signals (s390x external calls) would be
>>> slower. (Christian, please correct me if I'm wrong)
>>
>> No the reason was that there are some requests that need to be handled
>> outside run SIE. For example one reason was the guest prefix page.
>> This must be mapped read/write ALL THE TIME when a guest is running,
>> otherwise the host might crash. So we have to exit SIE and make sure that
>> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
>> that is called from page table functions, whenever memory management decides
>> to unmap/write protect (dirty pages tracking, reference tracking, page migration
>> or compaction...)
>>
>> SMP-based request wills kick out the guest, but for some thing like the
>> one above it will be too late.
> 
> While what you said is 100% correct, I had something else in mind that
> hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
> And I remember that being related to how preemption and
> OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
> have to be implemented in kvm_arch_vcpu_should_kick().
> 
> x86 can track the guest state using vcpu->mode, because they can be sure
> that the guest can't reschedule while in the critical guest entry/exit
> section. This is not true for s390x, as preemption is enabled. That's
> why vcpu->mode cannot be used in its current form to track if a VCPU is
> in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
> relies on this setting.
> 
> For now, calling vcpu_kick() on s390x will result in a BUG().
> 
> 
> On s390x, there are 3 use cases I see for requests:
> 
> 1. Remote requests that need a sync
> 
> Make a request, wait until SIE has been left and make sure the request
> will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
> notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
> candidate.
> 
> 2. Remote requests that don't need a sync
> 
> E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
> KVM_REQ_DISABLE_IBS does.
> 
> 3. local requests
> 
> E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
> 
> 
> Of course, having a unified interface would be better.
> 
> /* set the request and kick the CPU out of guest mode */
> kvm_set_request(req, vcpu);
> 
> /* set the request, kick the CPU out of guest mode, wait until guest
> mode has been left and make sure the request will be handled before
> reentering guest mode */
> kvm_set_sync_request(req, vcpu);
> 
> 
> Same maybe even for multiple VCPUs (as there are then ways to speed it
> up, e.g. first kick all, then wait for all)
> 
> This would require arch specific callbacks to
> 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
> 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
> kvm_s390_vsie_kick(vcpu) on s390x)
> 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
> 
> This would only make sense if there are other use cases for sync
> requests. At least I remember that Power also has a faster way for
> kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
> s390x only thing and is better be left as is :)


Hmmm, maybe we should simply enable kvm_vcpu_kick and kvm_vcpu_wake_up in
common code. They should work for the cases from common code. We must still
keep the s390 specific functions and we will call those from within s390 code
when necessary. Back then you tried to replace our functions and that had
issues (functionality wise and speed wise) - maybe just keeping those
is the easiest solution. 

For kvm_make_all_cpus_request we already have the slow path of waking all CPUs,
so that would be ok. So why not have vcpu->mode = IN_GUEST_MODE for the whole
inner loop of s390?

I will try to come up with a patch.

Christian
Radim Krčmář Feb. 22, 2017, 3:17 p.m. UTC | #7
[Oops, the end of this thread got dragged into a mark-as-read spree ...]

2017-02-17 11:13+0100, David Hildenbrand:
>>> This is really complicated stuff, and the basic reason for it (if I
>>> remember correctly) is that s390x does reenable all interrupts when
>>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
>>> kicks don't work (as it is otherwise just racy), and if I remember
>>> correctly, SMP reschedule signals (s390x external calls) would be
>>> slower. (Christian, please correct me if I'm wrong)
>> 
>> No the reason was that there are some requests that need to be handled
>> outside run SIE. For example one reason was the guest prefix page.
>> This must be mapped read/write ALL THE TIME when a guest is running,
>> otherwise the host might crash. So we have to exit SIE and make sure that
>> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
>> that is called from page table functions, whenever memory management decides
>> to unmap/write protect (dirty pages tracking, reference tracking, page migration
>> or compaction...)
>> 
>> SMP-based request wills kick out the guest, but for some thing like the
>> one above it will be too late.
> 
> While what you said is 100% correct, I had something else in mind that
> hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
> And I remember that being related to how preemption and
> OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
> have to be implemented in kvm_arch_vcpu_should_kick().
> 
> x86 can track the guest state using vcpu->mode, because they can be sure
> that the guest can't reschedule while in the critical guest entry/exit
> section. This is not true for s390x, as preemption is enabled. That's
> why vcpu->mode cannot be used in its current form to track if a VCPU is
> in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
> relies on this setting.
> 
> For now, calling vcpu_kick() on s390x will result in a BUG().
> 
> 
> On s390x, there are 3 use cases I see for requests:
> 
> 1. Remote requests that need a sync
> 
> Make a request, wait until SIE has been left and make sure the request
> will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
> notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
> candidate.

Btw. aren't those requests racy?

    void exit_sie(struct kvm_vcpu *vcpu)
    {
    	atomic_or(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags);

If you get stalled here and the target VCPU handles the request and
reenters SIE in the meantime, then you'll wait until its next exit.
(And miss an unbounded amount of exits in the worst case.)

    	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
    		cpu_relax();
    }

And out of curiosity -- how many cycles does this loop usually take?

> 2. Remote requests that don't need a sync
> 
> E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
> KVM_REQ_DISABLE_IBS does.

A usual KVM request would kick the VCPU out of nested virt as well.
Shouldn't it be done for these as well?

> 3. local requests
> 
> E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
> 
> 
> Of course, having a unified interface would be better.
> 
> /* set the request and kick the CPU out of guest mode */
> kvm_set_request(req, vcpu);
> 
> /* set the request, kick the CPU out of guest mode, wait until guest
> mode has been left and make sure the request will be handled before
> reentering guest mode */
> kvm_set_sync_request(req, vcpu);

Sounds good, I'll also add

  kvm_set_self_request(req, vcpu);

> Same maybe even for multiple VCPUs (as there are then ways to speed it
> up, e.g. first kick all, then wait for all)
> 
> This would require arch specific callbacks to
> 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
> 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
> kvm_s390_vsie_kick(vcpu) on s390x)
> 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
> 
> This would only make sense if there are other use cases for sync
> requests. At least I remember that Power also has a faster way for
> kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
> s390x only thing and is better be left as is :)
> 
> At least vcpu_kick() could be quite easily made to work on s390x.
> 
> Radim, are there also other users that need something like sync requests?

I think that ARM has a similar need when updating vgic, but relies on an
asumption that VCPUs are going to be out after kicking them with
kvm_make_all_cpus_request().
(vgic_change_active_prepare in virt/kvm/arm/vgic/vgic-mmio.c)

Having synchronous requests in a common API should probably wait for the
completion of the request, not just for the kick, which would make race
handling simpler.

I'm not going to worry about them in this pass, though.

Thanks.
Christian Borntraeger Feb. 22, 2017, 7:23 p.m. UTC | #8
On 02/22/2017 04:17 PM, Radim Krčmář wrote:
> [Oops, the end of this thread got dragged into a mark-as-read spree ...]
> 
> 2017-02-17 11:13+0100, David Hildenbrand:
>>>> This is really complicated stuff, and the basic reason for it (if I
>>>> remember correctly) is that s390x does reenable all interrupts when
>>>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
>>>> kicks don't work (as it is otherwise just racy), and if I remember
>>>> correctly, SMP reschedule signals (s390x external calls) would be
>>>> slower. (Christian, please correct me if I'm wrong)
>>>
>>> No the reason was that there are some requests that need to be handled
>>> outside run SIE. For example one reason was the guest prefix page.
>>> This must be mapped read/write ALL THE TIME when a guest is running,
>>> otherwise the host might crash. So we have to exit SIE and make sure that
>>> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
>>> that is called from page table functions, whenever memory management decides
>>> to unmap/write protect (dirty pages tracking, reference tracking, page migration
>>> or compaction...)
>>>
>>> SMP-based request wills kick out the guest, but for some thing like the
>>> one above it will be too late.
>>
>> While what you said is 100% correct, I had something else in mind that
>> hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
>> And I remember that being related to how preemption and
>> OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
>> have to be implemented in kvm_arch_vcpu_should_kick().
>>
>> x86 can track the guest state using vcpu->mode, because they can be sure
>> that the guest can't reschedule while in the critical guest entry/exit
>> section. This is not true for s390x, as preemption is enabled. That's
>> why vcpu->mode cannot be used in its current form to track if a VCPU is
>> in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
>> relies on this setting.
>>
>> For now, calling vcpu_kick() on s390x will result in a BUG().
>>
>>
>> On s390x, there are 3 use cases I see for requests:
>>
>> 1. Remote requests that need a sync
>>
>> Make a request, wait until SIE has been left and make sure the request
>> will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
>> notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
>> candidate.
> 
> Btw. aren't those requests racy?
> 
>     void exit_sie(struct kvm_vcpu *vcpu)
>     {
>     	atomic_or(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags);
> 
> If you get stalled here and the target VCPU handles the request and
> reenters SIE in the meantime, then you'll wait until its next exit.
> (And miss an unbounded amount of exits in the worst case.)
> 
>     	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
>     		cpu_relax();
>     }
> 

Its not racy for the purpose it was originally made for (get the vcpu 
out of SIE before we unmap a guest prefix page) as the MMU_RELOAD handler 
will wait for the pte lock which is held by the code that called
kvm_s390_sync_request(KVM_REQ_MMU_RELOAD, vcpu).

We also have the guarantee that after returning from kvm_s390_sync_request
we will have that request be handled before we reenter the guest, which is
all we need for DISABLE_IBS. 

But yes, all non MMU_RELOAD users might wait longer, possibly several guest
exits. We never noticed that as requests are really a seldom event. Basically
unmapping of the guest prefix page due to paging and migration, switching 
between 1 and more guest cpus and some other seldom events.

> And out of curiosity -- how many cycles does this loop usually take?
> 
>> 2. Remote requests that don't need a sync
>>
>> E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
>> KVM_REQ_DISABLE_IBS does.
> 
> A usual KVM request would kick the VCPU out of nested virt as well.
> Shouldn't it be done for these as well?
> 
>> 3. local requests
>>
>> E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
>>
>>
>> Of course, having a unified interface would be better.
>>
>> /* set the request and kick the CPU out of guest mode */
>> kvm_set_request(req, vcpu);
>>
>> /* set the request, kick the CPU out of guest mode, wait until guest
>> mode has been left and make sure the request will be handled before
>> reentering guest mode */
>> kvm_set_sync_request(req, vcpu);
> 
> Sounds good, I'll also add
> 
>   kvm_set_self_request(req, vcpu);
> 
>> Same maybe even for multiple VCPUs (as there are then ways to speed it
>> up, e.g. first kick all, then wait for all)
>>
>> This would require arch specific callbacks to
>> 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
>> 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
>> kvm_s390_vsie_kick(vcpu) on s390x)
>> 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
>>
>> This would only make sense if there are other use cases for sync
>> requests. At least I remember that Power also has a faster way for
>> kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
>> s390x only thing and is better be left as is :)
>>
>> At least vcpu_kick() could be quite easily made to work on s390x.
>>
>> Radim, are there also other users that need something like sync requests?
> 
> I think that ARM has a similar need when updating vgic, but relies on an
> asumption that VCPUs are going to be out after kicking them with
> kvm_make_all_cpus_request().
> (vgic_change_active_prepare in virt/kvm/arm/vgic/vgic-mmio.c)
> 
> Having synchronous requests in a common API should probably wait for the
> completion of the request, not just for the kick, which would make race
> handling simpler.
> 
> I'm not going to worry about them in this pass, though.
> 
> Thanks.
>
Christian Borntraeger Feb. 22, 2017, 7:57 p.m. UTC | #9
On 02/22/2017 04:17 PM, Radim Krčmář wrote:
> 
[...] 
>    	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
>   		cpu_relax();
>    }

> And out of curiosity -- how many cycles does this loop usually take?

A quick hack indicates something between 3 and 700ns.

>> 2. Remote requests that don't need a sync
>>
>> E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
>> KVM_REQ_DISABLE_IBS does.
> 
> A usual KVM request would kick the VCPU out of nested virt as well.
> Shouldn't it be done for these as well?

A common code function probably should. For some of the cases (again
prefix page handling) we do not need it. For example if we unmap
the guest prefix page, but guest^2 is running this causes no trouble
as long as we handle the request before reentering guest^1. So
not an easy answer.

> 
>> 3. local requests
>>
>> E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
>>
>>
>> Of course, having a unified interface would be better.
>>
>> /* set the request and kick the CPU out of guest mode */
>> kvm_set_request(req, vcpu);
>>
>> /* set the request, kick the CPU out of guest mode, wait until guest
>> mode has been left and make sure the request will be handled before
>> reentering guest mode */
>> kvm_set_sync_request(req, vcpu);
> 
> Sounds good, I'll also add
> 
>   kvm_set_self_request(req, vcpu);
> 
>> Same maybe even for multiple VCPUs (as there are then ways to speed it
>> up, e.g. first kick all, then wait for all)
>>
>> This would require arch specific callbacks to
>> 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
>> 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
>> kvm_s390_vsie_kick(vcpu) on s390x)
>> 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
>>
>> This would only make sense if there are other use cases for sync
>> requests. At least I remember that Power also has a faster way for
>> kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
>> s390x only thing and is better be left as is :)
>>
>> At least vcpu_kick() could be quite easily made to work on s390x.
>>
>> Radim, are there also other users that need something like sync requests?
> 
> I think that ARM has a similar need when updating vgic, but relies on an
> asumption that VCPUs are going to be out after kicking them with
> kvm_make_all_cpus_request().
> (vgic_change_active_prepare in virt/kvm/arm/vgic/vgic-mmio.c)
> 
> Having synchronous requests in a common API should probably wait for the
> completion of the request, not just for the kick, which would make race
> handling simpler.

This would be problematic for our prefix page handling due to locking.
David Hildenbrand Feb. 23, 2017, 10:20 a.m. UTC | #10
Am 22.02.2017 um 20:57 schrieb Christian Borntraeger:
> On 02/22/2017 04:17 PM, Radim Krčmář wrote:
>>
> [...] 
>>    	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
>>   		cpu_relax();
>>    }
> 
>> And out of curiosity -- how many cycles does this loop usually take?
> 
> A quick hack indicates something between 3 and 700ns.
> 
>>> 2. Remote requests that don't need a sync
>>>
>>> E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
>>> KVM_REQ_DISABLE_IBS does.
>>
>> A usual KVM request would kick the VCPU out of nested virt as well.
>> Shouldn't it be done for these as well?
> 
> A common code function probably should. For some of the cases (again
> prefix page handling) we do not need it. For example if we unmap
> the guest prefix page, but guest^2 is running this causes no trouble
> as long as we handle the request before reentering guest^1. So
> not an easy answer.

The problem is, that doing synchronous requests on two sie control
blocks (vsie and "ordinary") is really really hard to implement. I had a
prototype, but it was just ugly. And there was no reason to do it,
because all current requests can live with nested guest being executed
(as it is really all just a matter of coordinating SIE control block
changes for our guest, not involving nested guests).

Also note, that VSIE uses these special request bits in the SIE control
block for its own purposes (to catch unmappings of the prefix page, but
this time on the nested address space). We don't want to replace this by
an ordinary request bit (because then we have to exit the VSIE loop much
more often).

I think the VSIE code should not care too much about request bits until
there is really a need for it (meaning: VCPU requests that cannot
tolerate the VSIE running).

Kicking the VSIE from time to time cannot harm. But there are no
guarantees about requests.

> 
>>
>>> 3. local requests
>>>
>>> E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
>>>
>>>
>>> Of course, having a unified interface would be better.
>>>
>>> /* set the request and kick the CPU out of guest mode */
>>> kvm_set_request(req, vcpu);
>>>
>>> /* set the request, kick the CPU out of guest mode, wait until guest
>>> mode has been left and make sure the request will be handled before
>>> reentering guest mode */
>>> kvm_set_sync_request(req, vcpu);
>>
>> Sounds good, I'll also add
>>
>>   kvm_set_self_request(req, vcpu);

or introduce some lightweight check (e.g. against (vcpu->cpu))
internally, that simply skips kicking/other stuff in case the vcpu is
currently loaded.

>>
>>> Same maybe even for multiple VCPUs (as there are then ways to speed it
>>> up, e.g. first kick all, then wait for all)
>>>
>>> This would require arch specific callbacks to
>>> 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
>>> 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
>>> kvm_s390_vsie_kick(vcpu) on s390x)
>>> 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
>>>
>>> This would only make sense if there are other use cases for sync
>>> requests. At least I remember that Power also has a faster way for
>>> kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
>>> s390x only thing and is better be left as is :)
>>>
>>> At least vcpu_kick() could be quite easily made to work on s390x.
>>>
>>> Radim, are there also other users that need something like sync requests?
>>
>> I think that ARM has a similar need when updating vgic, but relies on an
>> asumption that VCPUs are going to be out after kicking them with
>> kvm_make_all_cpus_request().
>> (vgic_change_active_prepare in virt/kvm/arm/vgic/vgic-mmio.c)
>>
>> Having synchronous requests in a common API should probably wait for the
>> completion of the request, not just for the kick, which would make race
>> handling simpler.
> 
> This would be problematic for our prefix page handling due to locking.
> 

And if I am not wrong, waiting for a request to be handled might take
forever (thinking about VCPUs in user space - maybe even stopped ones
that won't run again).

I think the general notion of synchronous VCPU requests should be: Make
sure that VCPU does not go into guest mode before handling the request.
This includes waiting for it to exit guest mode.
Paolo Bonzini Feb. 23, 2017, 11:01 a.m. UTC | #11
On 16/02/2017 17:04, Radim Krčmář wrote:
> A macro to optimize requests that do not need a memory barrier because
> they have no dependencies.  An architecture can implement a function
> that says which requests do not need memory barriers when handling them.
> 
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

I would leave this for a separate series, otherwise looks nice (though I
was skeptical at first ;)).

Paolo

> ---
>  include/linux/kvm_host.h | 41 +++++++++++++++++++++++++++++++++++++----
>  virt/kvm/kvm_main.c      |  3 ++-
>  2 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d899473859d3..2cc438685af8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1097,8 +1097,8 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>   *  2) remote request with no data (= kick)
>   *  3) remote request with data (= kick + mb)
>   *
> - * TODO: the API is inconsistent -- a request doesn't call kvm_vcpu_kick(), but
> - * forces smp_wmb() for all requests.
> + * TODO: the API does not distinguish local and remote requests -- remote
> + * should contain kvm_vcpu_kick().
>   */
>  
>  static inline void __kvm_request_set(unsigned req, struct kvm_vcpu *vcpu)
> @@ -1106,6 +1106,37 @@ static inline void __kvm_request_set(unsigned req, struct kvm_vcpu *vcpu)
>  	set_bit(req, &vcpu->requests);
>  }
>  
> +/*
> + * __kvm_request_needs_mb is used to improve performance, so it should have no
> + * runtime overhead.
> + */
> +static inline bool __kvm_request_needs_mb(int req)
> +{
> +	/*
> +	 * This barrier lets callers avoid the following pattern:
> +	 *   if (__kvm_request_needs_mb(req))
> +	 *      ...
> +	 *   else
> +	 *      barrier();
> +	 */
> +	barrier();
> +
> +	if (!__builtin_constant_p(req))
> +		return true;
> +
> +#ifdef kvm_arch_request_needs_mb
> +	/*
> +	 * GCC optimizes pure kvm_arch_request_needs_mb() with a constant input
> +	 * into a contant, but __builtin_constant_p() is not so clever, so we
> +	 * cannot ensure that with:
> +	 * BUILD_BUG_ON(!__builtin_constant_p(kvm_arch_request_needs_mb(req)));
> +	 */
> +	return kvm_arch_request_needs_mb(req);
> +#else
> +	return true;
> +#endif
> +}
> +
>  static inline void kvm_request_set(unsigned req, struct kvm_vcpu *vcpu)
>  {
>  	/*
> @@ -1113,7 +1144,8 @@ static inline void kvm_request_set(unsigned req, struct kvm_vcpu *vcpu)
>  	 * kvm_request_test_and_clear's caller.
>  	 * Paired with the smp_mb__after_atomic in kvm_request_test_and_clear.
>  	 */
> -	smp_wmb();
> +	if (__kvm_request_needs_mb(req))
> +		smp_wmb();
>  	__kvm_request_set(req, vcpu);
>  }
>  
> @@ -1137,7 +1169,8 @@ static inline bool kvm_request_test_and_clear(unsigned req, struct kvm_vcpu *vcp
>  		 * kvm_request_test_and_clear's caller.
>  		 * Paired with the smp_wmb in kvm_request_set.
>  		 */
> -		smp_mb__after_atomic();
> +		if (__kvm_request_needs_mb(req))
> +			smp_mb__after_atomic();
>  		return true;
>  	} else {
>  		return false;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2250920ec965..ced3e4cb1df0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -179,7 +179,8 @@ bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  	me = get_cpu();
>  
>  	/* Paired with the smp_mb__after_atomic in kvm_request_test_and_clear. */
> -	smp_wmb();
> +	if (__kvm_request_needs_mb(req))
> +		smp_wmb();
>  
>  	kvm_for_each_vcpu(i, vcpu, kvm) {
>  		__kvm_request_set(req, vcpu);
>
Radim Krčmář Feb. 23, 2017, 3:39 p.m. UTC | #12
2017-02-23 11:20+0100, David Hildenbrand:
> Am 22.02.2017 um 20:57 schrieb Christian Borntraeger:
>> On 02/22/2017 04:17 PM, Radim Krčmář wrote:
>>>
>> [...] 
>>>    	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
>>>   		cpu_relax();
>>>    }
>> 
>>> And out of curiosity -- how many cycles does this loop usually take?
>> 
>> A quick hack indicates something between 3 and 700ns.
>> 
>>>> 2. Remote requests that don't need a sync
>>>>
>>>> E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
>>>> KVM_REQ_DISABLE_IBS does.
>>>
>>> A usual KVM request would kick the VCPU out of nested virt as well.
>>> Shouldn't it be done for these as well?
>> 
>> A common code function probably should. For some of the cases (again
>> prefix page handling) we do not need it. For example if we unmap
>> the guest prefix page, but guest^2 is running this causes no trouble
>> as long as we handle the request before reentering guest^1. So
>> not an easy answer.
> 
> The problem is, that doing synchronous requests on two sie control
> blocks (vsie and "ordinary") is really really hard to implement. I had a
> prototype, but it was just ugly. And there was no reason to do it,
> because all current requests can live with nested guest being executed
> (as it is really all just a matter of coordinating SIE control block
> changes for our guest, not involving nested guests).
> 
> Also note, that VSIE uses these special request bits in the SIE control
> block for its own purposes (to catch unmappings of the prefix page, but
> this time on the nested address space). We don't want to replace this by
> an ordinary request bit (because then we have to exit the VSIE loop much
> more often).
> 
> I think the VSIE code should not care too much about request bits until
> there is really a need for it (meaning: VCPU requests that cannot
> tolerate the VSIE running).

Sure, request-based changes to L1 can be delayed if they cannot affect
L2.

> Kicking the VSIE from time to time cannot harm. But there are no
> guarantees about requests.
> 
>> 
>>>
>>>> 3. local requests
>>>>
>>>> E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
>>>>
>>>>
>>>> Of course, having a unified interface would be better.
>>>>
>>>> /* set the request and kick the CPU out of guest mode */
>>>> kvm_set_request(req, vcpu);
>>>>
>>>> /* set the request, kick the CPU out of guest mode, wait until guest
>>>> mode has been left and make sure the request will be handled before
>>>> reentering guest mode */
>>>> kvm_set_sync_request(req, vcpu);
>>>
>>> Sounds good, I'll also add
>>>
>>>   kvm_set_self_request(req, vcpu);
> 
> or introduce some lightweight check (e.g. against (vcpu->cpu))
> internally, that simply skips kicking/other stuff in case the vcpu is
> currently loaded.

Sounds interesting, I'll benchmark it on x86.  I was planning on adding
a check inside CONFIG_DEBUG WARN_ON_ONCE, but it might be cheap enough
and definitely simplifies the API.

The check is already performed in kvm_vcpu_kick() and I hope that it
won't need a big refactoring to avoid copy-paste.

>>>> Same maybe even for multiple VCPUs (as there are then ways to speed it
>>>> up, e.g. first kick all, then wait for all)
>>>>
>>>> This would require arch specific callbacks to
>>>> 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
>>>> 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
>>>> kvm_s390_vsie_kick(vcpu) on s390x)
>>>> 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
>>>>
>>>> This would only make sense if there are other use cases for sync
>>>> requests. At least I remember that Power also has a faster way for
>>>> kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
>>>> s390x only thing and is better be left as is :)
>>>>
>>>> At least vcpu_kick() could be quite easily made to work on s390x.
>>>>
>>>> Radim, are there also other users that need something like sync requests?
>>>
>>> I think that ARM has a similar need when updating vgic, but relies on an
>>> asumption that VCPUs are going to be out after kicking them with
>>> kvm_make_all_cpus_request().
>>> (vgic_change_active_prepare in virt/kvm/arm/vgic/vgic-mmio.c)
>>>
>>> Having synchronous requests in a common API should probably wait for the
>>> completion of the request, not just for the kick, which would make race
>>> handling simpler.
>> 
>> This would be problematic for our prefix page handling due to locking.
>> 
> 
> And if I am not wrong, waiting for a request to be handled might take
> forever (thinking about VCPUs in user space - maybe even stopped ones
> that won't run again).

I agree.

> I think the general notion of synchronous VCPU requests should be: Make
> sure that VCPU does not go into guest mode before handling the request.
> This includes waiting for it to exit guest mode.

I'll keep synchronous requests out for the moment.  There is nothing
similar in other architectures and the idea is very specific even in
s390.
Radim Krčmář Feb. 23, 2017, 3:43 p.m. UTC | #13
2017-02-22 20:23+0100, Christian Borntraeger:
> On 02/22/2017 04:17 PM, Radim Krčmář wrote:
>> [Oops, the end of this thread got dragged into a mark-as-read spree ...]
>> 
>> 2017-02-17 11:13+0100, David Hildenbrand:
>>>>> This is really complicated stuff, and the basic reason for it (if I
>>>>> remember correctly) is that s390x does reenable all interrupts when
>>>>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
>>>>> kicks don't work (as it is otherwise just racy), and if I remember
>>>>> correctly, SMP reschedule signals (s390x external calls) would be
>>>>> slower. (Christian, please correct me if I'm wrong)
>>>>
>>>> No the reason was that there are some requests that need to be handled
>>>> outside run SIE. For example one reason was the guest prefix page.
>>>> This must be mapped read/write ALL THE TIME when a guest is running,
>>>> otherwise the host might crash. So we have to exit SIE and make sure that
>>>> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
>>>> that is called from page table functions, whenever memory management decides
>>>> to unmap/write protect (dirty pages tracking, reference tracking, page migration
>>>> or compaction...)
>>>>
>>>> SMP-based request wills kick out the guest, but for some thing like the
>>>> one above it will be too late.
>>>
>>> While what you said is 100% correct, I had something else in mind that
>>> hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
>>> And I remember that being related to how preemption and
>>> OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
>>> have to be implemented in kvm_arch_vcpu_should_kick().
>>>
>>> x86 can track the guest state using vcpu->mode, because they can be sure
>>> that the guest can't reschedule while in the critical guest entry/exit
>>> section. This is not true for s390x, as preemption is enabled. That's
>>> why vcpu->mode cannot be used in its current form to track if a VCPU is
>>> in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
>>> relies on this setting.
>>>
>>> For now, calling vcpu_kick() on s390x will result in a BUG().
>>>
>>>
>>> On s390x, there are 3 use cases I see for requests:
>>>
>>> 1. Remote requests that need a sync
>>>
>>> Make a request, wait until SIE has been left and make sure the request
>>> will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
>>> notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
>>> candidate.
>> 
>> Btw. aren't those requests racy?
>> 
>>     void exit_sie(struct kvm_vcpu *vcpu)
>>     {
>>     	atomic_or(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags);
>> 
>> If you get stalled here and the target VCPU handles the request and
>> reenters SIE in the meantime, then you'll wait until its next exit.
>> (And miss an unbounded amount of exits in the worst case.)
>> 
>>     	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
>>     		cpu_relax();
>>     }
>> 
> 
> Its not racy for the purpose it was originally made for (get the vcpu 
> out of SIE before we unmap a guest prefix page) as the MMU_RELOAD handler 
> will wait for the pte lock which is held by the code that called
> kvm_s390_sync_request(KVM_REQ_MMU_RELOAD, vcpu).
> 
> We also have the guarantee that after returning from kvm_s390_sync_request
> we will have that request be handled before we reenter the guest, which is
> all we need for DISABLE_IBS. 
> 
> But yes, all non MMU_RELOAD users might wait longer, possibly several guest
> exits. We never noticed that as requests are really a seldom event. Basically
> unmapping of the guest prefix page due to paging and migration, switching 
> between 1 and more guest cpus and some other seldom events.

Ok, thanks for the info.

I don't think that we'll find too many use-cases to demand inclusion
into a generic kick/request API, so having a function that waits until a
VCPU is out of guest mode would be more suited for generic code.
Radim Krčmář Feb. 23, 2017, 3:52 p.m. UTC | #14
2017-02-23 12:01+0100, Paolo Bonzini:
> On 16/02/2017 17:04, Radim Krčmář wrote:
>> A macro to optimize requests that do not need a memory barrier because
>> they have no dependencies.  An architecture can implement a function
>> that says which requests do not need memory barriers when handling them.
>> 
>> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> 
> I would leave this for a separate series, otherwise looks nice (though I
> was skeptical at first ;)).

Ok, I'll first post just the renaming and base other ideas from related
thread on it.
Christoffer Dall Feb. 24, 2017, 11:34 a.m. UTC | #15
On Wed, Feb 22, 2017 at 04:17:05PM +0100, Radim Krčmář wrote:
> [Oops, the end of this thread got dragged into a mark-as-read spree ...]
> 
> 2017-02-17 11:13+0100, David Hildenbrand:
> >>> This is really complicated stuff, and the basic reason for it (if I
> >>> remember correctly) is that s390x does reenable all interrupts when
> >>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
> >>> kicks don't work (as it is otherwise just racy), and if I remember
> >>> correctly, SMP reschedule signals (s390x external calls) would be
> >>> slower. (Christian, please correct me if I'm wrong)
> >> 
> >> No the reason was that there are some requests that need to be handled
> >> outside run SIE. For example one reason was the guest prefix page.
> >> This must be mapped read/write ALL THE TIME when a guest is running,
> >> otherwise the host might crash. So we have to exit SIE and make sure that
> >> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
> >> that is called from page table functions, whenever memory management decides
> >> to unmap/write protect (dirty pages tracking, reference tracking, page migration
> >> or compaction...)
> >> 
> >> SMP-based request wills kick out the guest, but for some thing like the
> >> one above it will be too late.
> > 
> > While what you said is 100% correct, I had something else in mind that
> > hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
> > And I remember that being related to how preemption and
> > OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
> > have to be implemented in kvm_arch_vcpu_should_kick().
> > 
> > x86 can track the guest state using vcpu->mode, because they can be sure
> > that the guest can't reschedule while in the critical guest entry/exit
> > section. This is not true for s390x, as preemption is enabled. That's
> > why vcpu->mode cannot be used in its current form to track if a VCPU is
> > in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
> > relies on this setting.
> > 
> > For now, calling vcpu_kick() on s390x will result in a BUG().
> > 
> > 
> > On s390x, there are 3 use cases I see for requests:
> > 
> > 1. Remote requests that need a sync
> > 
> > Make a request, wait until SIE has been left and make sure the request
> > will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
> > notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
> > candidate.
> 
> Btw. aren't those requests racy?
> 
>     void exit_sie(struct kvm_vcpu *vcpu)
>     {
>     	atomic_or(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags);
> 
> If you get stalled here and the target VCPU handles the request and
> reenters SIE in the meantime, then you'll wait until its next exit.
> (And miss an unbounded amount of exits in the worst case.)
> 
>     	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
>     		cpu_relax();
>     }
> 
> And out of curiosity -- how many cycles does this loop usually take?
> 
> > 2. Remote requests that don't need a sync
> > 
> > E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
> > KVM_REQ_DISABLE_IBS does.
> 
> A usual KVM request would kick the VCPU out of nested virt as well.
> Shouldn't it be done for these as well?
> 
> > 3. local requests
> > 
> > E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
> > 
> > 
> > Of course, having a unified interface would be better.
> > 
> > /* set the request and kick the CPU out of guest mode */
> > kvm_set_request(req, vcpu);
> > 
> > /* set the request, kick the CPU out of guest mode, wait until guest
> > mode has been left and make sure the request will be handled before
> > reentering guest mode */
> > kvm_set_sync_request(req, vcpu);
> 
> Sounds good, I'll also add
> 
>   kvm_set_self_request(req, vcpu);
> 
> > Same maybe even for multiple VCPUs (as there are then ways to speed it
> > up, e.g. first kick all, then wait for all)
> > 
> > This would require arch specific callbacks to
> > 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
> > 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
> > kvm_s390_vsie_kick(vcpu) on s390x)
> > 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
> > 
> > This would only make sense if there are other use cases for sync
> > requests. At least I remember that Power also has a faster way for
> > kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
> > s390x only thing and is better be left as is :)
> > 
> > At least vcpu_kick() could be quite easily made to work on s390x.
> > 
> > Radim, are there also other users that need something like sync requests?
> 
> I think that ARM has a similar need when updating vgic, but relies on an
> asumption that VCPUs are going to be out after kicking them with
> kvm_make_all_cpus_request().
> (vgic_change_active_prepare in virt/kvm/arm/vgic/vgic-mmio.c)

Yes, we have similar needs.  We don't actually use the requests
infrastructure in the moment (although I have plans to move to that
following a long series of optimization patches I have stashed on my
machine), but we reuse the kvm_make_all_cpus_request function to figure
out which CPUs need a kick, and which don't, instead of duplicating this
logic in the ARM tree.

> 
> Having synchronous requests in a common API should probably wait for the
> completion of the request, not just for the kick, which would make race
> handling simpler.
> 
> I'm not going to worry about them in this pass, though.
> 

I'll be happy to help working on this or at least reviewing stuff to
move our home-baked "stop all VCPUs and wait for something before
entering the guest again" functionality to common functionality that
uses requests.

Thanks,
-Christoffer
Andrew Jones Feb. 24, 2017, 12:46 p.m. UTC | #16
On Fri, Feb 24, 2017 at 12:34:07PM +0100, Christoffer Dall wrote:
> On Wed, Feb 22, 2017 at 04:17:05PM +0100, Radim Krčmář wrote:
> > [Oops, the end of this thread got dragged into a mark-as-read spree ...]
> > 
> > 2017-02-17 11:13+0100, David Hildenbrand:
> > >>> This is really complicated stuff, and the basic reason for it (if I
> > >>> remember correctly) is that s390x does reenable all interrupts when
> > >>> entering the sie (see kvm-s390.c:__vcpu_run()). So the fancy smp-based
> > >>> kicks don't work (as it is otherwise just racy), and if I remember
> > >>> correctly, SMP reschedule signals (s390x external calls) would be
> > >>> slower. (Christian, please correct me if I'm wrong)
> > >> 
> > >> No the reason was that there are some requests that need to be handled
> > >> outside run SIE. For example one reason was the guest prefix page.
> > >> This must be mapped read/write ALL THE TIME when a guest is running,
> > >> otherwise the host might crash. So we have to exit SIE and make sure that
> > >> it does not reenter, therefore we use the RELOAD_MMU request from a notifier
> > >> that is called from page table functions, whenever memory management decides
> > >> to unmap/write protect (dirty pages tracking, reference tracking, page migration
> > >> or compaction...)
> > >> 
> > >> SMP-based request wills kick out the guest, but for some thing like the
> > >> one above it will be too late.
> > > 
> > > While what you said is 100% correct, I had something else in mind that
> > > hindered using vcpu_kick() and especially kvm_make_all_cpus_request().
> > > And I remember that being related to how preemption and
> > > OUTSIDE_GUEST_MODE is handled. I think this boils down to what would
> > > have to be implemented in kvm_arch_vcpu_should_kick().
> > > 
> > > x86 can track the guest state using vcpu->mode, because they can be sure
> > > that the guest can't reschedule while in the critical guest entry/exit
> > > section. This is not true for s390x, as preemption is enabled. That's
> > > why vcpu->mode cannot be used in its current form to track if a VCPU is
> > > in/oustide/exiting guest mode. And kvm_make_all_cpus_request() currently
> > > relies on this setting.
> > > 
> > > For now, calling vcpu_kick() on s390x will result in a BUG().
> > > 
> > > 
> > > On s390x, there are 3 use cases I see for requests:
> > > 
> > > 1. Remote requests that need a sync
> > > 
> > > Make a request, wait until SIE has been left and make sure the request
> > > will be processed before re-entering the SIE. e.g. KVM_REQ_RELOAD_MMU
> > > notifier in mmu notifier you mentioned. Also KVM_REQ_DISABLE_IBS is a
> > > candidate.
> > 
> > Btw. aren't those requests racy?
> > 
> >     void exit_sie(struct kvm_vcpu *vcpu)
> >     {
> >     	atomic_or(CPUSTAT_STOP_INT, &vcpu->arch.sie_block->cpuflags);
> > 
> > If you get stalled here and the target VCPU handles the request and
> > reenters SIE in the meantime, then you'll wait until its next exit.
> > (And miss an unbounded amount of exits in the worst case.)
> > 
> >     	while (vcpu->arch.sie_block->prog0c & PROG_IN_SIE)
> >     		cpu_relax();
> >     }
> > 
> > And out of curiosity -- how many cycles does this loop usually take?
> > 
> > > 2. Remote requests that don't need a sync
> > > 
> > > E.g. KVM_REQ_ENABLE_IBS doesn't strictly need it, while
> > > KVM_REQ_DISABLE_IBS does.
> > 
> > A usual KVM request would kick the VCPU out of nested virt as well.
> > Shouldn't it be done for these as well?
> > 
> > > 3. local requests
> > > 
> > > E.g. KVM_REQ_TLB_FLUSH from kvm_s390_set_prefix()
> > > 
> > > 
> > > Of course, having a unified interface would be better.
> > > 
> > > /* set the request and kick the CPU out of guest mode */
> > > kvm_set_request(req, vcpu);
> > > 
> > > /* set the request, kick the CPU out of guest mode, wait until guest
> > > mode has been left and make sure the request will be handled before
> > > reentering guest mode */
> > > kvm_set_sync_request(req, vcpu);
> > 
> > Sounds good, I'll also add
> > 
> >   kvm_set_self_request(req, vcpu);
> > 
> > > Same maybe even for multiple VCPUs (as there are then ways to speed it
> > > up, e.g. first kick all, then wait for all)
> > > 
> > > This would require arch specific callbacks to
> > > 1. pre announce the request (e.g. set PROG_REQUEST on s390x)
> > > 2. kick the cpu (e.g. CPUSTAT_STOP_INT and later
> > > kvm_s390_vsie_kick(vcpu) on s390x)
> > > 3. check if still executing the guest (e.g. PROG_IN_SIE on s390x)
> > > 
> > > This would only make sense if there are other use cases for sync
> > > requests. At least I remember that Power also has a faster way for
> > > kicking VCPUs, not involving SMP rescheds. I can't judge if this is a
> > > s390x only thing and is better be left as is :)
> > > 
> > > At least vcpu_kick() could be quite easily made to work on s390x.
> > > 
> > > Radim, are there also other users that need something like sync requests?
> > 
> > I think that ARM has a similar need when updating vgic, but relies on an
> > asumption that VCPUs are going to be out after kicking them with
> > kvm_make_all_cpus_request().
> > (vgic_change_active_prepare in virt/kvm/arm/vgic/vgic-mmio.c)
> 
> Yes, we have similar needs.  We don't actually use the requests
> infrastructure in the moment (although I have plans to move to that
> following a long series of optimization patches I have stashed on my
> machine), but we reuse the kvm_make_all_cpus_request function to figure
> out which CPUs need a kick, and which don't, instead of duplicating this
> logic in the ARM tree.

I also have patches (that I was going to post a week ago, but then decided
to base on Radim's work) that bring vcpu-requests to arm. The patches I
have are to plug a few races, most notably ones in PSCI emulation. You may
recall an off-list discussion we had about those races that took place
roughly one million years ago. Indeed it was your suggestion at the time
to bring vcpu-requests to arm.

I'll still post those patches shortly. I think Radim plans to post a v2
soon, so I'll rebase on that.

> 
> > 
> > Having synchronous requests in a common API should probably wait for the
> > completion of the request, not just for the kick, which would make race
> > handling simpler.
> > 
> > I'm not going to worry about them in this pass, though.
> > 
> 
> I'll be happy to help working on this or at least reviewing stuff to
> move our home-baked "stop all VCPUs and wait for something before
> entering the guest again" functionality to common functionality that
> uses requests.
> 

Thanks,
drew
diff mbox

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d899473859d3..2cc438685af8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1097,8 +1097,8 @@  static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
  *  2) remote request with no data (= kick)
  *  3) remote request with data (= kick + mb)
  *
- * TODO: the API is inconsistent -- a request doesn't call kvm_vcpu_kick(), but
- * forces smp_wmb() for all requests.
+ * TODO: the API does not distinguish local and remote requests -- remote
+ * should contain kvm_vcpu_kick().
  */
 
 static inline void __kvm_request_set(unsigned req, struct kvm_vcpu *vcpu)
@@ -1106,6 +1106,37 @@  static inline void __kvm_request_set(unsigned req, struct kvm_vcpu *vcpu)
 	set_bit(req, &vcpu->requests);
 }
 
+/*
+ * __kvm_request_needs_mb is used to improve performance, so it should have no
+ * runtime overhead.
+ */
+static inline bool __kvm_request_needs_mb(int req)
+{
+	/*
+	 * This barrier lets callers avoid the following pattern:
+	 *   if (__kvm_request_needs_mb(req))
+	 *      ...
+	 *   else
+	 *      barrier();
+	 */
+	barrier();
+
+	if (!__builtin_constant_p(req))
+		return true;
+
+#ifdef kvm_arch_request_needs_mb
+	/*
+	 * GCC optimizes pure kvm_arch_request_needs_mb() with a constant input
+	 * into a contant, but __builtin_constant_p() is not so clever, so we
+	 * cannot ensure that with:
+	 * BUILD_BUG_ON(!__builtin_constant_p(kvm_arch_request_needs_mb(req)));
+	 */
+	return kvm_arch_request_needs_mb(req);
+#else
+	return true;
+#endif
+}
+
 static inline void kvm_request_set(unsigned req, struct kvm_vcpu *vcpu)
 {
 	/*
@@ -1113,7 +1144,8 @@  static inline void kvm_request_set(unsigned req, struct kvm_vcpu *vcpu)
 	 * kvm_request_test_and_clear's caller.
 	 * Paired with the smp_mb__after_atomic in kvm_request_test_and_clear.
 	 */
-	smp_wmb();
+	if (__kvm_request_needs_mb(req))
+		smp_wmb();
 	__kvm_request_set(req, vcpu);
 }
 
@@ -1137,7 +1169,8 @@  static inline bool kvm_request_test_and_clear(unsigned req, struct kvm_vcpu *vcp
 		 * kvm_request_test_and_clear's caller.
 		 * Paired with the smp_wmb in kvm_request_set.
 		 */
-		smp_mb__after_atomic();
+		if (__kvm_request_needs_mb(req))
+			smp_mb__after_atomic();
 		return true;
 	} else {
 		return false;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2250920ec965..ced3e4cb1df0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -179,7 +179,8 @@  bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	me = get_cpu();
 
 	/* Paired with the smp_mb__after_atomic in kvm_request_test_and_clear. */
-	smp_wmb();
+	if (__kvm_request_needs_mb(req))
+		smp_wmb();
 
 	kvm_for_each_vcpu(i, vcpu, kvm) {
 		__kvm_request_set(req, vcpu);