diff mbox

[v2] kvm/x86: Handle async PF in RCU read-side critical sections

Message ID 20171001013140.21325-1-boqun.feng@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Boqun Feng Oct. 1, 2017, 1:31 a.m. UTC
Sasha Levin reported a WARNING:

| WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
| rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
| WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
| rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
...
| CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
| 1.10.1-1ubuntu1 04/01/2014
| Call Trace:
...
| RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
| RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
| RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
| RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
| RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
| RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
| R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
| R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
| __schedule+0x201/0x2240 kernel/sched/core.c:3292
| schedule+0x113/0x460 kernel/sched/core.c:3421
| kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
| do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
| async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
| RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
| RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
| RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
| RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
| RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
| R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
| R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
| vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
| sprintf+0xbe/0xf0 lib/vsprintf.c:2386
| proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
| get_link fs/namei.c:1047 [inline]
| link_path_walk+0x1041/0x1490 fs/namei.c:2127
...

This happened when the host hit a page fault, and delivered it as in an
async page fault, while the guest was in an RCU read-side critical
section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
but rcu_preempt_note_context_switch() would treat the reschedule as a
sleep in RCU read-side critical section, which is not allowed (even in
preemptible RCU).  Thus the WARN.

To cure this, we need to make kvm_async_pf_task_wait() go to the halt
path(instead of the schedule path) if the PF happens in a RCU read-side
critical section.

In PREEMPT=y kernel, this is simple, as we record current RCU read-side
critical section nested level in rcu_preempt_depth(). But for PREEMPT=n
kernel rcu_read_lock/unlock() may be no-ops, so we don't whether we are
in a RCU read-side critical section or not. We resolve this by always
choosing the halt path in PREEMPT=n kernel unless the guest gets the
async PF while running in user mode.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
[The explanation for async PF is contributed by Paolo Bonzini]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
v1 --> v2:

*	Add more accurate explanation of async PF from Paolo in the
	commit message.

*	Extend the kvm_async_pf_task_wait() to have a second parameter
	@user to indicate whether the fault happens while a user program
	running the guest.

Wanpeng, the callsite of kvm_async_pf_task_wait() in
kvm_handle_page_fault() is for nested scenario, right? I take it we
should handle it as if the fault happens when l1 guest is running in
kernel mode, so @user should be 0, right?

 arch/x86/include/asm/kvm_para.h | 4 ++--
 arch/x86/kernel/kvm.c           | 9 ++++++---
 arch/x86/kvm/mmu.c              | 2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

Comments

Paolo Bonzini Oct. 2, 2017, 1:41 p.m. UTC | #1
On 01/10/2017 03:31, Boqun Feng wrote:
> Sasha Levin reported a WARNING:
> 
> | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> ...
> | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> | 1.10.1-1ubuntu1 04/01/2014
> | Call Trace:
> ...
> | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> | schedule+0x113/0x460 kernel/sched/core.c:3421
> | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> | get_link fs/namei.c:1047 [inline]
> | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> ...
> 
> This happened when the host hit a page fault, and delivered it as in an
> async page fault, while the guest was in an RCU read-side critical
> section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> but rcu_preempt_note_context_switch() would treat the reschedule as a
> sleep in RCU read-side critical section, which is not allowed (even in
> preemptible RCU).  Thus the WARN.
> 
> To cure this, we need to make kvm_async_pf_task_wait() go to the halt
> path(instead of the schedule path) if the PF happens in a RCU read-side
> critical section.
> 
> In PREEMPT=y kernel, this is simple, as we record current RCU read-side
> critical section nested level in rcu_preempt_depth(). But for PREEMPT=n
> kernel rcu_read_lock/unlock() may be no-ops, so we don't whether we are
> in a RCU read-side critical section or not. We resolve this by always
> choosing the halt path in PREEMPT=n kernel unless the guest gets the
> async PF while running in user mode.
> 
> Reported-by: Sasha Levin <levinsasha928@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> [The explanation for async PF is contributed by Paolo Bonzini]
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> v1 --> v2:
> 
> *	Add more accurate explanation of async PF from Paolo in the
> 	commit message.
> 
> *	Extend the kvm_async_pf_task_wait() to have a second parameter
> 	@user to indicate whether the fault happens while a user program
> 	running the guest.
> 
> Wanpeng, the callsite of kvm_async_pf_task_wait() in
> kvm_handle_page_fault() is for nested scenario, right? I take it we
> should handle it as if the fault happens when l1 guest is running in
> kernel mode, so @user should be 0, right?

In that case we can schedule, actually.  The guest will let another
process run.

In fact we could schedule safely most of the time in the
!user_mode(regs) case, it's just that with PREEMPT=n there's no
knowledge of whether we can do so.  This explains why we have never seen
the bug before.

I had already applied v1, can you rebase and resend please?  Thanks,

Paolo

>  arch/x86/include/asm/kvm_para.h | 4 ++--
>  arch/x86/kernel/kvm.c           | 9 ++++++---
>  arch/x86/kvm/mmu.c              | 2 +-
>  3 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index bc62e7cbf1b1..0a5ae6bb128b 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  void __init kvm_guest_init(void);
> -void kvm_async_pf_task_wait(u32 token);
> +void kvm_async_pf_task_wait(u32 token, int user);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  extern void kvm_disable_steal_time(void);
> @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
>  
>  #else /* CONFIG_KVM_GUEST */
>  #define kvm_guest_init() do {} while (0)
> -#define kvm_async_pf_task_wait(T) do {} while(0)
> +#define kvm_async_pf_task_wait(T, U) do {} while(0)
>  #define kvm_async_pf_task_wake(T) do {} while(0)
>  
>  static inline bool kvm_para_available(void)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index aa60a08b65b1..916f519e54c9 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
>  	return NULL;
>  }
>  
> -void kvm_async_pf_task_wait(u32 token)
> +void kvm_async_pf_task_wait(u32 token, int user)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
>  
>  	n.token = token;
>  	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) || preempt_count() > 1;
> +	n.halted = is_idle_task(current) ||
> +		   preempt_count() > 1 ||
> +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> +		   rcu_preempt_depth();
>  	init_swait_queue_head(&n.wq);
>  	hlist_add_head(&n.link, &b->list);
>  	raw_spin_unlock(&b->lock);
> @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		/* page is swapped out by the host. */
>  		prev_state = exception_enter();
> -		kvm_async_pf_task_wait((u32)read_cr2());
> +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
>  		exception_exit(prev_state);
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index eca30c1eb1d9..106d4a029a8a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		vcpu->arch.apf.host_apf_reason = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait(fault_address);
> +		kvm_async_pf_task_wait(fault_address, 0);
>  		local_irq_enable();
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
>
Boqun Feng Oct. 2, 2017, 2:43 p.m. UTC | #2
On Mon, Oct 02, 2017 at 01:41:03PM +0000, Paolo Bonzini wrote:
[...]
> > 
> > Wanpeng, the callsite of kvm_async_pf_task_wait() in
> > kvm_handle_page_fault() is for nested scenario, right? I take it we
> > should handle it as if the fault happens when l1 guest is running in
> > kernel mode, so @user should be 0, right?
> 
> In that case we can schedule, actually.  The guest will let another
> process run.
> 
> In fact we could schedule safely most of the time in the
> !user_mode(regs) case, it's just that with PREEMPT=n there's no
> knowledge of whether we can do so.  This explains why we have never seen
> the bug before.
> 

Thanks, looks like I confused myself a little bit here. You are right.
So in PREEMPT=n kernel, we only couldn't schedule when the async PF
*interrupts* the *kernel*, while in the kvm_handle_page_fault(), we
actually didn't interrupt the kernel, so it's fine.

> I had already applied v1, can you rebase and resend please?  Thanks,
> 

Sure, I'm going to rename that parameter to "interrupt_kernel"(probably
a bad name"), indicating whether the async PF interrupts the kernel.

But it's a little bit late today, will do that tomorrow.

Regards,
Boqun


> Paolo
> 
> >  arch/x86/include/asm/kvm_para.h | 4 ++--
> >  arch/x86/kernel/kvm.c           | 9 ++++++---
> >  arch/x86/kvm/mmu.c              | 2 +-
> >  3 files changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index bc62e7cbf1b1..0a5ae6bb128b 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> >  bool kvm_para_available(void);
> >  unsigned int kvm_arch_para_features(void);
> >  void __init kvm_guest_init(void);
> > -void kvm_async_pf_task_wait(u32 token);
> > +void kvm_async_pf_task_wait(u32 token, int user);
> >  void kvm_async_pf_task_wake(u32 token);
> >  u32 kvm_read_and_reset_pf_reason(void);
> >  extern void kvm_disable_steal_time(void);
> > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> >  
> >  #else /* CONFIG_KVM_GUEST */
> >  #define kvm_guest_init() do {} while (0)
> > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > +#define kvm_async_pf_task_wait(T, U) do {} while(0)
> >  #define kvm_async_pf_task_wake(T) do {} while(0)
> >  
> >  static inline bool kvm_para_available(void)
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index aa60a08b65b1..916f519e54c9 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> >  	return NULL;
> >  }
> >  
> > -void kvm_async_pf_task_wait(u32 token)
> > +void kvm_async_pf_task_wait(u32 token, int user)
> >  {
> >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
> >  
> >  	n.token = token;
> >  	n.cpu = smp_processor_id();
> > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > +	n.halted = is_idle_task(current) ||
> > +		   preempt_count() > 1 ||
> > +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> > +		   rcu_preempt_depth();
> >  	init_swait_queue_head(&n.wq);
> >  	hlist_add_head(&n.link, &b->list);
> >  	raw_spin_unlock(&b->lock);
> > @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >  		/* page is swapped out by the host. */
> >  		prev_state = exception_enter();
> > -		kvm_async_pf_task_wait((u32)read_cr2());
> > +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> >  		exception_exit(prev_state);
> >  		break;
> >  	case KVM_PV_REASON_PAGE_READY:
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index eca30c1eb1d9..106d4a029a8a 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >  		vcpu->arch.apf.host_apf_reason = 0;
> >  		local_irq_disable();
> > -		kvm_async_pf_task_wait(fault_address);
> > +		kvm_async_pf_task_wait(fault_address, 0);
> >  		local_irq_enable();
> >  		break;
> >  	case KVM_PV_REASON_PAGE_READY:
> > 
>
Paul E. McKenney Oct. 2, 2017, 3:34 p.m. UTC | #3
On Mon, Oct 02, 2017 at 10:43:00PM +0800, Boqun Feng wrote:
> On Mon, Oct 02, 2017 at 01:41:03PM +0000, Paolo Bonzini wrote:
> [...]
> > > 
> > > Wanpeng, the callsite of kvm_async_pf_task_wait() in
> > > kvm_handle_page_fault() is for nested scenario, right? I take it we
> > > should handle it as if the fault happens when l1 guest is running in
> > > kernel mode, so @user should be 0, right?
> > 
> > In that case we can schedule, actually.  The guest will let another
> > process run.
> > 
> > In fact we could schedule safely most of the time in the
> > !user_mode(regs) case, it's just that with PREEMPT=n there's no
> > knowledge of whether we can do so.  This explains why we have never seen
> > the bug before.
> 
> Thanks, looks like I confused myself a little bit here. You are right.
> So in PREEMPT=n kernel, we only couldn't schedule when the async PF
> *interrupts* the *kernel*, while in the kvm_handle_page_fault(), we
> actually didn't interrupt the kernel, so it's fine.

Actually, we should be able to do a little bit better than that.
If PREEMPT=n but PREEMPT_COUNT=y, then preempt_count() will know
about RCU read-side critical sections via preempt_disable().

So maybe something like this?

	n.halted = is_idle_task(current) ||
		   preempt_count() > 1 ||
		   (!IS_ENABLED(CONFIG_PREEMPT) && !IS_ENABLED(CONFIG_PREEMPT_COUNT) && !user) ||
		   rcu_preempt_depth();

							Thanx, Paul

> > I had already applied v1, can you rebase and resend please?  Thanks,
> > 
> 
> Sure, I'm going to rename that parameter to "interrupt_kernel"(probably
> a bad name"), indicating whether the async PF interrupts the kernel.
> 
> But it's a little bit late today, will do that tomorrow.
> 
> Regards,
> Boqun
> 
> 
> > Paolo
> > 
> > >  arch/x86/include/asm/kvm_para.h | 4 ++--
> > >  arch/x86/kernel/kvm.c           | 9 ++++++---
> > >  arch/x86/kvm/mmu.c              | 2 +-
> > >  3 files changed, 9 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > > index bc62e7cbf1b1..0a5ae6bb128b 100644
> > > --- a/arch/x86/include/asm/kvm_para.h
> > > +++ b/arch/x86/include/asm/kvm_para.h
> > > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> > >  bool kvm_para_available(void);
> > >  unsigned int kvm_arch_para_features(void);
> > >  void __init kvm_guest_init(void);
> > > -void kvm_async_pf_task_wait(u32 token);
> > > +void kvm_async_pf_task_wait(u32 token, int user);
> > >  void kvm_async_pf_task_wake(u32 token);
> > >  u32 kvm_read_and_reset_pf_reason(void);
> > >  extern void kvm_disable_steal_time(void);
> > > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> > >  
> > >  #else /* CONFIG_KVM_GUEST */
> > >  #define kvm_guest_init() do {} while (0)
> > > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > > +#define kvm_async_pf_task_wait(T, U) do {} while(0)
> > >  #define kvm_async_pf_task_wake(T) do {} while(0)
> > >  
> > >  static inline bool kvm_para_available(void)
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index aa60a08b65b1..916f519e54c9 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > > @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> > >  	return NULL;
> > >  }
> > >  
> > > -void kvm_async_pf_task_wait(u32 token)
> > > +void kvm_async_pf_task_wait(u32 token, int user)
> > >  {
> > >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> > >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > > @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
> > >  
> > >  	n.token = token;
> > >  	n.cpu = smp_processor_id();
> > > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > > +	n.halted = is_idle_task(current) ||
> > > +		   preempt_count() > 1 ||
> > > +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> > > +		   rcu_preempt_depth();
> > >  	init_swait_queue_head(&n.wq);
> > >  	hlist_add_head(&n.link, &b->list);
> > >  	raw_spin_unlock(&b->lock);
> > > @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > >  		/* page is swapped out by the host. */
> > >  		prev_state = exception_enter();
> > > -		kvm_async_pf_task_wait((u32)read_cr2());
> > > +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> > >  		exception_exit(prev_state);
> > >  		break;
> > >  	case KVM_PV_REASON_PAGE_READY:
> > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > index eca30c1eb1d9..106d4a029a8a 100644
> > > --- a/arch/x86/kvm/mmu.c
> > > +++ b/arch/x86/kvm/mmu.c
> > > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > >  		vcpu->arch.apf.host_apf_reason = 0;
> > >  		local_irq_disable();
> > > -		kvm_async_pf_task_wait(fault_address);
> > > +		kvm_async_pf_task_wait(fault_address, 0);
> > >  		local_irq_enable();
> > >  		break;
> > >  	case KVM_PV_REASON_PAGE_READY:
> > > 
> >
Paolo Bonzini Oct. 2, 2017, 9:11 p.m. UTC | #4
----- Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> Actually, we should be able to do a little bit better than that.
> If PREEMPT=n but PREEMPT_COUNT=y, then preempt_count() will know
> about RCU read-side critical sections via preempt_disable().
> 
> So maybe something like this?
> 
> 	n.halted = is_idle_task(current) ||
> 		   preempt_count() > 1 ||
> 		   (!IS_ENABLED(CONFIG_PREEMPT) && !IS_ENABLED(CONFIG_PREEMPT_COUNT) && !user) ||
> 		   rcu_preempt_depth();

Or even

 is_idle_task(current) ||
 (IS_ENABLED(CONFIG_PREEMPT_COUNT)
  ? preempt_count() > 1 || rcu_preempt_depth()
  : !user)

since PREEMPT selects PREEMPT_COUNT.

Paolo

> 							Thanx, Paul
> 
> > > I had already applied v1, can you rebase and resend please?  Thanks,
> > > 
> > 
> > Sure, I'm going to rename that parameter to "interrupt_kernel"(probably
> > a bad name"), indicating whether the async PF interrupts the kernel.
> > 
> > But it's a little bit late today, will do that tomorrow.
> > 
> > Regards,
> > Boqun
> > 
> > 
> > > Paolo
> > > 
> > > >  arch/x86/include/asm/kvm_para.h | 4 ++--
> > > >  arch/x86/kernel/kvm.c           | 9 ++++++---
> > > >  arch/x86/kvm/mmu.c              | 2 +-
> > > >  3 files changed, 9 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > > > index bc62e7cbf1b1..0a5ae6bb128b 100644
> > > > --- a/arch/x86/include/asm/kvm_para.h
> > > > +++ b/arch/x86/include/asm/kvm_para.h
> > > > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> > > >  bool kvm_para_available(void);
> > > >  unsigned int kvm_arch_para_features(void);
> > > >  void __init kvm_guest_init(void);
> > > > -void kvm_async_pf_task_wait(u32 token);
> > > > +void kvm_async_pf_task_wait(u32 token, int user);
> > > >  void kvm_async_pf_task_wake(u32 token);
> > > >  u32 kvm_read_and_reset_pf_reason(void);
> > > >  extern void kvm_disable_steal_time(void);
> > > > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> > > >  
> > > >  #else /* CONFIG_KVM_GUEST */
> > > >  #define kvm_guest_init() do {} while (0)
> > > > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > > > +#define kvm_async_pf_task_wait(T, U) do {} while(0)
> > > >  #define kvm_async_pf_task_wake(T) do {} while(0)
> > > >  
> > > >  static inline bool kvm_para_available(void)
> > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > > index aa60a08b65b1..916f519e54c9 100644
> > > > --- a/arch/x86/kernel/kvm.c
> > > > +++ b/arch/x86/kernel/kvm.c
> > > > @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> > > >  	return NULL;
> > > >  }
> > > >  
> > > > -void kvm_async_pf_task_wait(u32 token)
> > > > +void kvm_async_pf_task_wait(u32 token, int user)
> > > >  {
> > > >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> > > >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > > > @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
> > > >  
> > > >  	n.token = token;
> > > >  	n.cpu = smp_processor_id();
> > > > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > > > +	n.halted = is_idle_task(current) ||
> > > > +		   preempt_count() > 1 ||
> > > > +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> > > > +		   rcu_preempt_depth();
> > > >  	init_swait_queue_head(&n.wq);
> > > >  	hlist_add_head(&n.link, &b->list);
> > > >  	raw_spin_unlock(&b->lock);
> > > > @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > > >  		/* page is swapped out by the host. */
> > > >  		prev_state = exception_enter();
> > > > -		kvm_async_pf_task_wait((u32)read_cr2());
> > > > +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> > > >  		exception_exit(prev_state);
> > > >  		break;
> > > >  	case KVM_PV_REASON_PAGE_READY:
> > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > index eca30c1eb1d9..106d4a029a8a 100644
> > > > --- a/arch/x86/kvm/mmu.c
> > > > +++ b/arch/x86/kvm/mmu.c
> > > > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > > >  		vcpu->arch.apf.host_apf_reason = 0;
> > > >  		local_irq_disable();
> > > > -		kvm_async_pf_task_wait(fault_address);
> > > > +		kvm_async_pf_task_wait(fault_address, 0);
> > > >  		local_irq_enable();
> > > >  		break;
> > > >  	case KVM_PV_REASON_PAGE_READY:
> > > > 
> > > 
> 
>
Paul E. McKenney Oct. 2, 2017, 9:41 p.m. UTC | #5
On Mon, Oct 02, 2017 at 05:11:07PM -0400, Paolo Bonzini wrote:
> 
> ----- Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > Actually, we should be able to do a little bit better than that.
> > If PREEMPT=n but PREEMPT_COUNT=y, then preempt_count() will know
> > about RCU read-side critical sections via preempt_disable().
> > 
> > So maybe something like this?
> > 
> > 	n.halted = is_idle_task(current) ||
> > 		   preempt_count() > 1 ||
> > 		   (!IS_ENABLED(CONFIG_PREEMPT) && !IS_ENABLED(CONFIG_PREEMPT_COUNT) && !user) ||
> > 		   rcu_preempt_depth();
> 
> Or even
> 
>  is_idle_task(current) ||
>  (IS_ENABLED(CONFIG_PREEMPT_COUNT)
>   ? preempt_count() > 1 || rcu_preempt_depth()
>   : !user)
> 
> since PREEMPT selects PREEMPT_COUNT.

Makes sense to me!

							Thanx, Paul

> Paolo
> 
> > 							Thanx, Paul
> > 
> > > > I had already applied v1, can you rebase and resend please?  Thanks,
> > > > 
> > > 
> > > Sure, I'm going to rename that parameter to "interrupt_kernel"(probably
> > > a bad name"), indicating whether the async PF interrupts the kernel.
> > > 
> > > But it's a little bit late today, will do that tomorrow.
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > 
> > > > Paolo
> > > > 
> > > > >  arch/x86/include/asm/kvm_para.h | 4 ++--
> > > > >  arch/x86/kernel/kvm.c           | 9 ++++++---
> > > > >  arch/x86/kvm/mmu.c              | 2 +-
> > > > >  3 files changed, 9 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > > > > index bc62e7cbf1b1..0a5ae6bb128b 100644
> > > > > --- a/arch/x86/include/asm/kvm_para.h
> > > > > +++ b/arch/x86/include/asm/kvm_para.h
> > > > > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> > > > >  bool kvm_para_available(void);
> > > > >  unsigned int kvm_arch_para_features(void);
> > > > >  void __init kvm_guest_init(void);
> > > > > -void kvm_async_pf_task_wait(u32 token);
> > > > > +void kvm_async_pf_task_wait(u32 token, int user);
> > > > >  void kvm_async_pf_task_wake(u32 token);
> > > > >  u32 kvm_read_and_reset_pf_reason(void);
> > > > >  extern void kvm_disable_steal_time(void);
> > > > > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> > > > >  
> > > > >  #else /* CONFIG_KVM_GUEST */
> > > > >  #define kvm_guest_init() do {} while (0)
> > > > > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > > > > +#define kvm_async_pf_task_wait(T, U) do {} while(0)
> > > > >  #define kvm_async_pf_task_wake(T) do {} while(0)
> > > > >  
> > > > >  static inline bool kvm_para_available(void)
> > > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > > > index aa60a08b65b1..916f519e54c9 100644
> > > > > --- a/arch/x86/kernel/kvm.c
> > > > > +++ b/arch/x86/kernel/kvm.c
> > > > > @@ -117,7 +117,7 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> > > > >  	return NULL;
> > > > >  }
> > > > >  
> > > > > -void kvm_async_pf_task_wait(u32 token)
> > > > > +void kvm_async_pf_task_wait(u32 token, int user)
> > > > >  {
> > > > >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> > > > >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > > > > @@ -140,7 +140,10 @@ void kvm_async_pf_task_wait(u32 token)
> > > > >  
> > > > >  	n.token = token;
> > > > >  	n.cpu = smp_processor_id();
> > > > > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > > > > +	n.halted = is_idle_task(current) ||
> > > > > +		   preempt_count() > 1 ||
> > > > > +		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
> > > > > +		   rcu_preempt_depth();
> > > > >  	init_swait_queue_head(&n.wq);
> > > > >  	hlist_add_head(&n.link, &b->list);
> > > > >  	raw_spin_unlock(&b->lock);
> > > > > @@ -268,7 +271,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > > > >  		/* page is swapped out by the host. */
> > > > >  		prev_state = exception_enter();
> > > > > -		kvm_async_pf_task_wait((u32)read_cr2());
> > > > > +		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
> > > > >  		exception_exit(prev_state);
> > > > >  		break;
> > > > >  	case KVM_PV_REASON_PAGE_READY:
> > > > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > > > > index eca30c1eb1d9..106d4a029a8a 100644
> > > > > --- a/arch/x86/kvm/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu.c
> > > > > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > > > >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > > > >  		vcpu->arch.apf.host_apf_reason = 0;
> > > > >  		local_irq_disable();
> > > > > -		kvm_async_pf_task_wait(fault_address);
> > > > > +		kvm_async_pf_task_wait(fault_address, 0);
> > > > >  		local_irq_enable();
> > > > >  		break;
> > > > >  	case KVM_PV_REASON_PAGE_READY:
> > > > > 
> > > > 
> > 
> > 
>
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index bc62e7cbf1b1..0a5ae6bb128b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -88,7 +88,7 @@  static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 void __init kvm_guest_init(void);
-void kvm_async_pf_task_wait(u32 token);
+void kvm_async_pf_task_wait(u32 token, int user);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
@@ -103,7 +103,7 @@  static inline void kvm_spinlock_init(void)
 
 #else /* CONFIG_KVM_GUEST */
 #define kvm_guest_init() do {} while (0)
-#define kvm_async_pf_task_wait(T) do {} while(0)
+#define kvm_async_pf_task_wait(T, U) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
 
 static inline bool kvm_para_available(void)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index aa60a08b65b1..916f519e54c9 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -117,7 +117,7 @@  static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
 	return NULL;
 }
 
-void kvm_async_pf_task_wait(u32 token)
+void kvm_async_pf_task_wait(u32 token, int user)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
@@ -140,7 +140,10 @@  void kvm_async_pf_task_wait(u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
-	n.halted = is_idle_task(current) || preempt_count() > 1;
+	n.halted = is_idle_task(current) ||
+		   preempt_count() > 1 ||
+		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) ||
+		   rcu_preempt_depth();
 	init_swait_queue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	raw_spin_unlock(&b->lock);
@@ -268,7 +271,7 @@  do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		/* page is swapped out by the host. */
 		prev_state = exception_enter();
-		kvm_async_pf_task_wait((u32)read_cr2());
+		kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));
 		exception_exit(prev_state);
 		break;
 	case KVM_PV_REASON_PAGE_READY:
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eca30c1eb1d9..106d4a029a8a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3837,7 +3837,7 @@  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		vcpu->arch.apf.host_apf_reason = 0;
 		local_irq_disable();
-		kvm_async_pf_task_wait(fault_address);
+		kvm_async_pf_task_wait(fault_address, 0);
 		local_irq_enable();
 		break;
 	case KVM_PV_REASON_PAGE_READY: