[2/5] x86/kvm: Handle task_work on VMENTER/EXIT

Message ID	20190801143657.887648487@linutronix.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Message-Id: <20190801143657.887648487@linutronix.de> User-Agent: quilt/0.65 Date: Thu, 01 Aug 2019 16:32:52 +0200 From: Thomas Gleixner <tglx@linutronix.de> To: LKML <linux-kernel@vger.kernel.org> Cc: x86@kernel.org, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>, Sebastian Siewior <bigeasy@linutronix.de>, Anna-Maria Gleixner <anna-maria@linutronix.de>, Steven Rostedt <rostedt@goodmis.org>, Julia Cartwright <julia@ni.com>, Paul McKenney <paulmck@linux.vnet.ibm.com>, Frederic Weisbecker <fweisbec@gmail.com>, kvm@vger.kernel.org, Radim Krcmar <rkrcmar@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Oleg Nesterov <oleg@redhat.com>, John Stultz <john.stultz@linaro.org>, Andy Lutomirski <luto@kernel.org>, "Paul E. McKenney" <paulmck@linux.ibm.com> Subject: [patch 2/5] x86/kvm: Handle task_work on VMENTER/EXIT References: <20190801143250.370326052@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	posix-cpu-timers: Move expiry into task work context \| expand [0/5] posix-cpu-timers: Move expiry into task work context [1/5] tracehook: Provide TIF_NOTIFY_RESUME handling for KVM [2/5] x86/kvm: Handle task_work on VMENTER/EXIT [3/5] posix-cpu-timers: Split run_posix_cpu_timers() [4/5] posix-cpu-timers: Defer timer handling to task_work [5/5] x86: Select POSIX_CPU_TIMERS_TASK_WORK

Thomas Gleixner Aug. 1, 2019, 2:32 p.m. UTC

TIF_NOTITY_RESUME is evaluated on return to user space along with other TIF
flags.

>From the kernels point of view a VMENTER is more or less equivalent to
return to user space which means that at least a subset of TIF flags needs
to be evaluated and handled.

Currently KVM handles only TIF_SIGPENDING and TIF_NEED_RESCHED, but
TIF_NOTIFY_RESUME is ignored. So pending task_work etc, is completely
ignored until the vCPU thread actually goes all the way back into
userspace/qemu.

Use the newly provided notify_resume_pending() and
tracehook_handle_notify_resume() to solve this similar to the existing
handling of SIGPENDING.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/x86.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Oleg Nesterov Aug. 1, 2019, 4:24 p.m. UTC | #1

On 08/01, Thomas Gleixner wrote:
>
> @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
>  			++vcpu->stat.signal_exits;
>  			break;
>  		}
> +
> +		if (notify_resume_pending())
> +			tracehook_handle_notify_resume();

shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?

I don't understand this code at all, but vcpu_run() does this even before
cond_resched().

Oleg.

Thomas Gleixner Aug. 1, 2019, 6:34 p.m. UTC | #2

On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> On 08/01, Thomas Gleixner wrote:
> >
> > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> >  			++vcpu->stat.signal_exits;
> >  			break;
> >  		}
> > +
> > +		if (notify_resume_pending())
> > +			tracehook_handle_notify_resume();
> 
> shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> 
> I don't understand this code at all, but vcpu_run() does this even before
> cond_resched().

Yeah, I noticed that it's dropped around cond_resched().

My understanding is that for voluntary giving up the CPU via cond_resched()
it needs to be dropped.

For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
whole code section after preempt_enable() is fully preemptible.

Now the 1Mio$ question is whether any of the notify functions invokes
cond_resched() and whether that really matters. Paolo?

Thanks,

	tglx

Sean Christopherson Aug. 1, 2019, 9:35 p.m. UTC | #3

On Thu, Aug 01, 2019 at 08:34:53PM +0200, Thomas Gleixner wrote:
> On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > On 08/01, Thomas Gleixner wrote:
> > >
> > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > >  			++vcpu->stat.signal_exits;
> > >  			break;
> > >  		}
> > > +
> > > +		if (notify_resume_pending())
> > > +			tracehook_handle_notify_resume();
> > 
> > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> > 
> > I don't understand this code at all, but vcpu_run() does this even before
> > cond_resched().
> 
> Yeah, I noticed that it's dropped around cond_resched().
> 
> My understanding is that for voluntary giving up the CPU via cond_resched()
> it needs to be dropped.
> 
> For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
> whole code section after preempt_enable() is fully preemptible.
> 
> Now the 1Mio$ question is whether any of the notify functions invokes
> cond_resched() and whether that really matters. Paolo?

cond_resched() is called via tracehook_notify_resume()->task_work_run(),
and "kernel code can only call cond_resched() in places where it ...
cannot hold references to any RCU-protected data structures" according to
https://lwn.net/Articles/603252/.

Peter Zijlstra Aug. 1, 2019, 9:44 p.m. UTC | #4

On Thu, Aug 01, 2019 at 02:35:50PM -0700, Sean Christopherson wrote:
> On Thu, Aug 01, 2019 at 08:34:53PM +0200, Thomas Gleixner wrote:
> > On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > > On 08/01, Thomas Gleixner wrote:
> > > >
> > > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > > >  			++vcpu->stat.signal_exits;
> > > >  			break;
> > > >  		}
> > > > +
> > > > +		if (notify_resume_pending())
> > > > +			tracehook_handle_notify_resume();
> > > 
> > > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> > > 
> > > I don't understand this code at all, but vcpu_run() does this even before
> > > cond_resched().
> > 
> > Yeah, I noticed that it's dropped around cond_resched().
> > 
> > My understanding is that for voluntary giving up the CPU via cond_resched()
> > it needs to be dropped.
> > 
> > For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
> > whole code section after preempt_enable() is fully preemptible.
> > 
> > Now the 1Mio$ question is whether any of the notify functions invokes
> > cond_resched() and whether that really matters. Paolo?
> 
> cond_resched() is called via tracehook_notify_resume()->task_work_run(),
> and "kernel code can only call cond_resched() in places where it ...
> cannot hold references to any RCU-protected data structures" according to
> https://lwn.net/Articles/603252/.

This is SRCU, you can reschedule while holding that just fine. It will
just delay some kvm operations, like the memslot stuff. I don't think it
is terrible to keep it, but people more versed in KVM might know of a
good reason to drop it anyway.

Thomas Gleixner Aug. 1, 2019, 9:44 p.m. UTC | #5

On Thu, 1 Aug 2019, Sean Christopherson wrote:
> On Thu, Aug 01, 2019 at 08:34:53PM +0200, Thomas Gleixner wrote:
> > On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > > On 08/01, Thomas Gleixner wrote:
> > > >
> > > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > > >  			++vcpu->stat.signal_exits;
> > > >  			break;
> > > >  		}
> > > > +
> > > > +		if (notify_resume_pending())
> > > > +			tracehook_handle_notify_resume();
> > > 
> > > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> > > 
> > > I don't understand this code at all, but vcpu_run() does this even before
> > > cond_resched().
> > 
> > Yeah, I noticed that it's dropped around cond_resched().
> > 
> > My understanding is that for voluntary giving up the CPU via cond_resched()
> > it needs to be dropped.
> > 
> > For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
> > whole code section after preempt_enable() is fully preemptible.
> > 
> > Now the 1Mio$ question is whether any of the notify functions invokes
> > cond_resched() and whether that really matters. Paolo?
> 
> cond_resched() is called via tracehook_notify_resume()->task_work_run(),
> and "kernel code can only call cond_resched() in places where it ...
> cannot hold references to any RCU-protected data structures" according to
> https://lwn.net/Articles/603252/.

Right you are.

Thomas Gleixner Aug. 1, 2019, 9:47 p.m. UTC | #6

On Thu, 1 Aug 2019, Thomas Gleixner wrote:

> On Thu, 1 Aug 2019, Sean Christopherson wrote:
> > On Thu, Aug 01, 2019 at 08:34:53PM +0200, Thomas Gleixner wrote:
> > > On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > > > On 08/01, Thomas Gleixner wrote:
> > > > >
> > > > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > > > >  			++vcpu->stat.signal_exits;
> > > > >  			break;
> > > > >  		}
> > > > > +
> > > > > +		if (notify_resume_pending())
> > > > > +			tracehook_handle_notify_resume();
> > > > 
> > > > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> > > > 
> > > > I don't understand this code at all, but vcpu_run() does this even before
> > > > cond_resched().
> > > 
> > > Yeah, I noticed that it's dropped around cond_resched().
> > > 
> > > My understanding is that for voluntary giving up the CPU via cond_resched()
> > > it needs to be dropped.
> > > 
> > > For involuntary preemption (CONFIG_PREEMPT=y) it's not required as the
> > > whole code section after preempt_enable() is fully preemptible.
> > > 
> > > Now the 1Mio$ question is whether any of the notify functions invokes
> > > cond_resched() and whether that really matters. Paolo?
> > 
> > cond_resched() is called via tracehook_notify_resume()->task_work_run(),
> > and "kernel code can only call cond_resched() in places where it ...
> > cannot hold references to any RCU-protected data structures" according to
> > https://lwn.net/Articles/603252/.
> 
> Right you are.

Bah. Hit send too fast.

Right you are about cond_resched() being called, but for SRCU this does not
matter unless there is some way to do a synchronize operation on that SRCU
entity. It might have some other performance side effect though.

Thanks,

	tglx

Oleg Nesterov Aug. 2, 2019, 12:04 p.m. UTC | #7

On 08/01, Thomas Gleixner wrote:
>
> On Thu, 1 Aug 2019, Oleg Nesterov wrote:
> > On 08/01, Thomas Gleixner wrote:
> > >
> > > @@ -8172,6 +8174,10 @@ static int vcpu_run(struct kvm_vcpu *vcp
> > >  			++vcpu->stat.signal_exits;
> > >  			break;
> > >  		}
> > > +
> > > +		if (notify_resume_pending())
> > > +			tracehook_handle_notify_resume();
> >
> > shouldn't you drop kvm->srcu before tracehook_handle_notify_resume() ?
> >
> > I don't understand this code at all, but vcpu_run() does this even before
> > cond_resched().
>
> Yeah, I noticed that it's dropped around cond_resched().
>
> My understanding is that for voluntary giving up the CPU via cond_resched()
> it needs to be dropped.

I am not sure it really needs, but this doesn't matter.

tracehook_handle_notify_resume() can do "anything", say it can run the
works queued by systemtap. I don't think it should delay synchronize_srcu().
And may be this is simply unsafe, even if I don't think a task_work can
ever call synchronize_srcu(kvm->srcu) directly.

Oleg.

Paolo Bonzini Aug. 2, 2019, 9:35 p.m. UTC | #8

On 01/08/19 23:47, Thomas Gleixner wrote:
> Right you are about cond_resched() being called, but for SRCU this does not
> matter unless there is some way to do a synchronize operation on that SRCU
> entity. It might have some other performance side effect though.

I would use srcu_read_unlock/lock around the call.

However, I'm wondering if the API can be improved because basically we
have six functions for three checks of TIF flags.  Does it make sense to
have something like task_has_request_flags and task_do_requests (names
are horrible I know) that can be used like

	if (task_has_request_flags()) {
		int err;
		...srcu_read_unlock...
		// return -EINTR if signal_pending
		err = task_do_requests();
		if (err < 0)
			goto exit_no_srcu_read_unlock;
		...srcu_read_lock...
	}

taking care of all three cases with a single hook?  This is important
especially because all these checks are done by all KVM architectures in
slightly different ways, and a unified API would be a good reason to
make all architectures look the same.

(Of course I could also define this unified API in virt/kvm/kvm_main.c,
so this is not blocking the series in any way!).

Paolo

Thomas Gleixner Aug. 2, 2019, 10:22 p.m. UTC | #9

On Fri, 2 Aug 2019, Paolo Bonzini wrote:
> On 01/08/19 23:47, Thomas Gleixner wrote:
> > Right you are about cond_resched() being called, but for SRCU this does not
> > matter unless there is some way to do a synchronize operation on that SRCU
> > entity. It might have some other performance side effect though.
> 
> I would use srcu_read_unlock/lock around the call.
> 
> However, I'm wondering if the API can be improved because basically we
> have six functions for three checks of TIF flags.  Does it make sense to
> have something like task_has_request_flags and task_do_requests (names
> are horrible I know) that can be used like
> 
> 	if (task_has_request_flags()) {
> 		int err;
> 		...srcu_read_unlock...
> 		// return -EINTR if signal_pending
> 		err = task_do_requests();
> 		if (err < 0)
> 			goto exit_no_srcu_read_unlock;
> 		...srcu_read_lock...
> 	}
> 
> taking care of all three cases with a single hook?  This is important
> especially because all these checks are done by all KVM architectures in
> slightly different ways, and a unified API would be a good reason to
> make all architectures look the same.
> 
> (Of course I could also define this unified API in virt/kvm/kvm_main.c,
> so this is not blocking the series in any way!).

You're not holding up something. Having a common function for this is
definitely the right approach.

As this is virt specific because it only checks for non arch specific bits
(TIF_NOTIFY_RESUME should be available for all KVM archs) and the TIF bits
are a subset of the available TIF bits because all others do not make any
sense there, this really should be a common function for KVM so that all
other archs which obviously lack a TIF_NOTIFY_RESUME check, can be fixed up
and consolidated. If we add another TIF check later then we only have to do
it in one place.

Thanks,

	tglx

Andy Lutomirski Aug. 2, 2019, 10:39 p.m. UTC | #10

> On Aug 2, 2019, at 3:22 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Fri, 2 Aug 2019, Paolo Bonzini wrote:
>>> On 01/08/19 23:47, Thomas Gleixner wrote:
>>> Right you are about cond_resched() being called, but for SRCU this does not
>>> matter unless there is some way to do a synchronize operation on that SRCU
>>> entity. It might have some other performance side effect though.
>> 
>> I would use srcu_read_unlock/lock around the call.
>> 
>> However, I'm wondering if the API can be improved because basically we
>> have six functions for three checks of TIF flags.  Does it make sense to
>> have something like task_has_request_flags and task_do_requests (names
>> are horrible I know) that can be used like
>> 
>>    if (task_has_request_flags()) {
>>        int err;
>>        ...srcu_read_unlock...
>>        // return -EINTR if signal_pending
>>        err = task_do_requests();
>>        if (err < 0)
>>            goto exit_no_srcu_read_unlock;
>>        ...srcu_read_lock...
>>    }
>> 
>> taking care of all three cases with a single hook?  This is important
>> especially because all these checks are done by all KVM architectures in
>> slightly different ways, and a unified API would be a good reason to
>> make all architectures look the same.
>> 
>> (Of course I could also define this unified API in virt/kvm/kvm_main.c,
>> so this is not blocking the series in any way!).
> 
> You're not holding up something. Having a common function for this is
> definitely the right approach.
> 
> As this is virt specific because it only checks for non arch specific bits
> (TIF_NOTIFY_RESUME should be available for all KVM archs) and the TIF bits
> are a subset of the available TIF bits because all others do not make any
> sense there, this really should be a common function for KVM so that all
> other archs which obviously lack a TIF_NOTIFY_RESUME check, can be fixed up
> and consolidated. If we add another TIF check later then we only have to do
> it in one place.
> 
> 

If we add a real API for this, can we make it, or a very similar API, work for exit_to_usermode_loop() too?  Maybe:

bool usermode_work_pending();
bool guestmode_work_pending();

void do_usermode_work();
void do_guestmode_work();

The first two are called with IRQs off.  The latter two are called with IRQs on.

[2/5] x86/kvm: Handle task_work on VMENTER/EXIT

Commit Message

Comments

Patch