diff mbox

kvm/x86: Avoid async PF to end RCU read-side critical section early in PREEMPT=n kernel

Message ID 20171003133653.1178-1-boqun.feng@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Boqun Feng Oct. 3, 2017, 1:36 p.m. UTC
Currently, in PREEMPT=n kernel, kvm_async_pf_task_wait() could call
schedule() to reschedule in some cases, which could result in
accidentally ending the current RCU read-side critical section early.
And this could end up with random memory corruption in the guest.

The difficulty to handle this well is because we don't know whether an
async PF delivered in a RCU read-side critical section for
PREEMPT_COUNT=n kernel, since rcu_read_lock/unlock() are just no-ops in
that case.

To cure this, we treat any async PF interrupting a kernel context as one
delivered in a RCU read-side critical section, and we don't allow
kvm_async_pf_task_wait() to choose schedule path in that case for
PREEMPT_COUNT=n kernel, because that will introduce unvolunteerly
context switches and break the assumption for RCU to work properly.

To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
so that we know whether it's called from a context interrupting the
kernel, and we set that parameter properly in all the callsites.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 arch/x86/include/asm/kvm_para.h |  4 ++--
 arch/x86/kernel/kvm.c           | 14 ++++++++++----
 arch/x86/kvm/mmu.c              |  2 +-
 3 files changed, 13 insertions(+), 7 deletions(-)

Comments

Paolo Bonzini Oct. 3, 2017, 2:11 p.m. UTC | #1
I'd prefer a slight change in subject and topic:

------- 8< --------
Subject: [PATCH] kvm/x86: Avoid async PF preempting the kernel incorrectly

Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call
schedule() to reschedule in some cases.  This could result in
accidentally ending the current RCU read-side critical section early,
causing random memory corruption in the guest, or otherwise preempting
the currently running task inside between preempt_disable and
preempt_enable.

The difficulty to handle this well is because we don't know whether an
async PF delivered in a preemptible section or RCU read-side critical section
for PREEMPT_COUNT=n, since preempt_disable()/enable() and rcu_read_lock/unlock()
are both no-ops in that case.

To cure this, we treat any async PF interrupting a kernel context as one
that cannot be preempted, preventing kvm_async_pf_task_wait() from choosing
the schedule() path in that case.

To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
so that we know whether it's called from a context interrupting the
kernel, and the parameter is set properly in all the callsites.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
------- 8< --------

I'll let Radim pick up the patch.

Paolo

On 03/10/2017 15:36, Boqun Feng wrote:
> Currently, in PREEMPT=n kernel, kvm_async_pf_task_wait() could call
> schedule() to reschedule in some cases, which could result in
> accidentally ending the current RCU read-side critical section early.
> And this could end up with random memory corruption in the guest.
> 
> The difficulty to handle this well is because we don't know whether an
> async PF delivered in a RCU read-side critical section for
> PREEMPT_COUNT=n kernel, since rcu_read_lock/unlock() are just no-ops in
> that case.
> 
> To cure this, we treat any async PF interrupting a kernel context as one
> delivered in a RCU read-side critical section, and we don't allow
> kvm_async_pf_task_wait() to choose schedule path in that case for
> PREEMPT_COUNT=n kernel, because that will introduce unvolunteerly
> context switches and break the assumption for RCU to work properly.
> 
> To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
> so that we know whether it's called from a context interrupting the
> kernel, and we set that parameter properly in all the callsites.
> 
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
>  arch/x86/include/asm/kvm_para.h |  4 ++--
>  arch/x86/kernel/kvm.c           | 14 ++++++++++----
>  arch/x86/kvm/mmu.c              |  2 +-
>  3 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index bc62e7cbf1b1..59ad3d132353 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
>  bool kvm_para_available(void);
>  unsigned int kvm_arch_para_features(void);
>  void __init kvm_guest_init(void);
> -void kvm_async_pf_task_wait(u32 token);
> +void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  extern void kvm_disable_steal_time(void);
> @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
>  
>  #else /* CONFIG_KVM_GUEST */
>  #define kvm_guest_init() do {} while (0)
> -#define kvm_async_pf_task_wait(T) do {} while(0)
> +#define kvm_async_pf_task_wait(T, I) do {} while(0)
>  #define kvm_async_pf_task_wake(T) do {} while(0)
>  
>  static inline bool kvm_para_available(void)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index e675704fa6f7..8bb9594d0761 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -117,7 +117,11 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
>  	return NULL;
>  }
>  
> -void kvm_async_pf_task_wait(u32 token)
> +/*
> + * @interrupt_kernel: Is this called from a routine which interrupts the kernel
> + * 		      (other than user space)?
> + */
> +void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
>  {
>  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
>  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> @@ -140,8 +144,10 @@ void kvm_async_pf_task_wait(u32 token)
>  
>  	n.token = token;
>  	n.cpu = smp_processor_id();
> -	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> -		   rcu_preempt_depth();
> +	n.halted = is_idle_task(current) ||
> +		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> +		    ? preempt_count() > 1 || rcu_preempt_depth()
> +		    : interrupt_kernel);
>  	init_swait_queue_head(&n.wq);
>  	hlist_add_head(&n.link, &b->list);
>  	raw_spin_unlock(&b->lock);
> @@ -269,7 +275,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		/* page is swapped out by the host. */
>  		prev_state = exception_enter();
> -		kvm_async_pf_task_wait((u32)read_cr2());
> +		kvm_async_pf_task_wait((u32)read_cr2(), !user_mode(regs));
>  		exception_exit(prev_state);
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index eca30c1eb1d9..106d4a029a8a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
>  		vcpu->arch.apf.host_apf_reason = 0;
>  		local_irq_disable();
> -		kvm_async_pf_task_wait(fault_address);
> +		kvm_async_pf_task_wait(fault_address, 0);
>  		local_irq_enable();
>  		break;
>  	case KVM_PV_REASON_PAGE_READY:
>
Boqun Feng Oct. 6, 2017, 1:33 a.m. UTC | #2
On Tue, Oct 03, 2017 at 02:11:08PM +0000, Paolo Bonzini wrote:
> I'd prefer a slight change in subject and topic:
> 
> ------- 8< --------
> Subject: [PATCH] kvm/x86: Avoid async PF preempting the kernel incorrectly
> 
> Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call
> schedule() to reschedule in some cases.  This could result in
> accidentally ending the current RCU read-side critical section early,
> causing random memory corruption in the guest, or otherwise preempting
> the currently running task inside between preempt_disable and
> preempt_enable.
> 
> The difficulty to handle this well is because we don't know whether an
> async PF delivered in a preemptible section or RCU read-side critical section
> for PREEMPT_COUNT=n, since preempt_disable()/enable() and rcu_read_lock/unlock()
> are both no-ops in that case.
> 
> To cure this, we treat any async PF interrupting a kernel context as one
> that cannot be preempted, preventing kvm_async_pf_task_wait() from choosing
> the schedule() path in that case.
> 
> To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
> so that we know whether it's called from a context interrupting the
> kernel, and the parameter is set properly in all the callsites.
> 
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ------- 8< --------
> 

It's more concise and accurate now!

Learned a lot from your modification of commit messages, thanks!

Regards,
Boqun

> I'll let Radim pick up the patch.
> 
> Paolo
> 
> On 03/10/2017 15:36, Boqun Feng wrote:
> > Currently, in PREEMPT=n kernel, kvm_async_pf_task_wait() could call
> > schedule() to reschedule in some cases, which could result in
> > accidentally ending the current RCU read-side critical section early.
> > And this could end up with random memory corruption in the guest.
> > 
> > The difficulty to handle this well is because we don't know whether an
> > async PF delivered in a RCU read-side critical section for
> > PREEMPT_COUNT=n kernel, since rcu_read_lock/unlock() are just no-ops in
> > that case.
> > 
> > To cure this, we treat any async PF interrupting a kernel context as one
> > delivered in a RCU read-side critical section, and we don't allow
> > kvm_async_pf_task_wait() to choose schedule path in that case for
> > PREEMPT_COUNT=n kernel, because that will introduce unvolunteerly
> > context switches and break the assumption for RCU to work properly.
> > 
> > To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
> > so that we know whether it's called from a context interrupting the
> > kernel, and we set that parameter properly in all the callsites.
> > 
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ---
> >  arch/x86/include/asm/kvm_para.h |  4 ++--
> >  arch/x86/kernel/kvm.c           | 14 ++++++++++----
> >  arch/x86/kvm/mmu.c              |  2 +-
> >  3 files changed, 13 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> > index bc62e7cbf1b1..59ad3d132353 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
> >  bool kvm_para_available(void);
> >  unsigned int kvm_arch_para_features(void);
> >  void __init kvm_guest_init(void);
> > -void kvm_async_pf_task_wait(u32 token);
> > +void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
> >  void kvm_async_pf_task_wake(u32 token);
> >  u32 kvm_read_and_reset_pf_reason(void);
> >  extern void kvm_disable_steal_time(void);
> > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void)
> >  
> >  #else /* CONFIG_KVM_GUEST */
> >  #define kvm_guest_init() do {} while (0)
> > -#define kvm_async_pf_task_wait(T) do {} while(0)
> > +#define kvm_async_pf_task_wait(T, I) do {} while(0)
> >  #define kvm_async_pf_task_wake(T) do {} while(0)
> >  
> >  static inline bool kvm_para_available(void)
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index e675704fa6f7..8bb9594d0761 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -117,7 +117,11 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
> >  	return NULL;
> >  }
> >  
> > -void kvm_async_pf_task_wait(u32 token)
> > +/*
> > + * @interrupt_kernel: Is this called from a routine which interrupts the kernel
> > + * 		      (other than user space)?
> > + */
> > +void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
> >  {
> >  	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
> >  	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
> > @@ -140,8 +144,10 @@ void kvm_async_pf_task_wait(u32 token)
> >  
> >  	n.token = token;
> >  	n.cpu = smp_processor_id();
> > -	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> > -		   rcu_preempt_depth();
> > +	n.halted = is_idle_task(current) ||
> > +		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
> > +		    ? preempt_count() > 1 || rcu_preempt_depth()
> > +		    : interrupt_kernel);
> >  	init_swait_queue_head(&n.wq);
> >  	hlist_add_head(&n.link, &b->list);
> >  	raw_spin_unlock(&b->lock);
> > @@ -269,7 +275,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
> >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >  		/* page is swapped out by the host. */
> >  		prev_state = exception_enter();
> > -		kvm_async_pf_task_wait((u32)read_cr2());
> > +		kvm_async_pf_task_wait((u32)read_cr2(), !user_mode(regs));
> >  		exception_exit(prev_state);
> >  		break;
> >  	case KVM_PV_REASON_PAGE_READY:
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index eca30c1eb1d9..106d4a029a8a 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  	case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >  		vcpu->arch.apf.host_apf_reason = 0;
> >  		local_irq_disable();
> > -		kvm_async_pf_task_wait(fault_address);
> > +		kvm_async_pf_task_wait(fault_address, 0);
> >  		local_irq_enable();
> >  		break;
> >  	case KVM_PV_REASON_PAGE_READY:
> > 
>
Radim Krčmář Oct. 6, 2017, 12:41 p.m. UTC | #3
2017-10-06 09:33+0800, Boqun Feng:
> On Tue, Oct 03, 2017 at 02:11:08PM +0000, Paolo Bonzini wrote:
> > I'd prefer a slight change in subject and topic:
> > 
> > ------- 8< --------
> > Subject: [PATCH] kvm/x86: Avoid async PF preempting the kernel incorrectly
> > 
> > Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call
> > schedule() to reschedule in some cases.  This could result in
> > accidentally ending the current RCU read-side critical section early,
> > causing random memory corruption in the guest, or otherwise preempting
> > the currently running task inside between preempt_disable and
> > preempt_enable.
> > 
> > The difficulty to handle this well is because we don't know whether an
> > async PF delivered in a preemptible section or RCU read-side critical section
> > for PREEMPT_COUNT=n, since preempt_disable()/enable() and rcu_read_lock/unlock()
> > are both no-ops in that case.
> > 
> > To cure this, we treat any async PF interrupting a kernel context as one
> > that cannot be preempted, preventing kvm_async_pf_task_wait() from choosing
> > the schedule() path in that case.
> > 
> > To do so, a second parameter for kvm_async_pf_task_wait() is introduced,
> > so that we know whether it's called from a context interrupting the
> > kernel, and the parameter is set properly in all the callsites.
> > 
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> > ------- 8< --------
> > 
> 
> It's more concise and accurate now!
> 
> Learned a lot from your modification of commit messages, thanks!

Applied with the updated commit message, thanks.
diff mbox

Patch

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index bc62e7cbf1b1..59ad3d132353 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -88,7 +88,7 @@  static inline long kvm_hypercall4(unsigned int nr, unsigned long p1,
 bool kvm_para_available(void);
 unsigned int kvm_arch_para_features(void);
 void __init kvm_guest_init(void);
-void kvm_async_pf_task_wait(u32 token);
+void kvm_async_pf_task_wait(u32 token, int interrupt_kernel);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
@@ -103,7 +103,7 @@  static inline void kvm_spinlock_init(void)
 
 #else /* CONFIG_KVM_GUEST */
 #define kvm_guest_init() do {} while (0)
-#define kvm_async_pf_task_wait(T) do {} while(0)
+#define kvm_async_pf_task_wait(T, I) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
 
 static inline bool kvm_para_available(void)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e675704fa6f7..8bb9594d0761 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -117,7 +117,11 @@  static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b,
 	return NULL;
 }
 
-void kvm_async_pf_task_wait(u32 token)
+/*
+ * @interrupt_kernel: Is this called from a routine which interrupts the kernel
+ * 		      (other than user space)?
+ */
+void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
 {
 	u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS);
 	struct kvm_task_sleep_head *b = &async_pf_sleepers[key];
@@ -140,8 +144,10 @@  void kvm_async_pf_task_wait(u32 token)
 
 	n.token = token;
 	n.cpu = smp_processor_id();
-	n.halted = is_idle_task(current) || preempt_count() > 1 ||
-		   rcu_preempt_depth();
+	n.halted = is_idle_task(current) ||
+		   (IS_ENABLED(CONFIG_PREEMPT_COUNT)
+		    ? preempt_count() > 1 || rcu_preempt_depth()
+		    : interrupt_kernel);
 	init_swait_queue_head(&n.wq);
 	hlist_add_head(&n.link, &b->list);
 	raw_spin_unlock(&b->lock);
@@ -269,7 +275,7 @@  do_async_page_fault(struct pt_regs *regs, unsigned long error_code)
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		/* page is swapped out by the host. */
 		prev_state = exception_enter();
-		kvm_async_pf_task_wait((u32)read_cr2());
+		kvm_async_pf_task_wait((u32)read_cr2(), !user_mode(regs));
 		exception_exit(prev_state);
 		break;
 	case KVM_PV_REASON_PAGE_READY:
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index eca30c1eb1d9..106d4a029a8a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3837,7 +3837,7 @@  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	case KVM_PV_REASON_PAGE_NOT_PRESENT:
 		vcpu->arch.apf.host_apf_reason = 0;
 		local_irq_disable();
-		kvm_async_pf_task_wait(fault_address);
+		kvm_async_pf_task_wait(fault_address, 0);
 		local_irq_enable();
 		break;
 	case KVM_PV_REASON_PAGE_READY: