kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.c
diff mbox

Message ID 17EC94B0A072C34B8DCF0D30AD16044A028747C1@BPXM09GP.gisp.nec.co.jp
State New
Headers show

Commit Message

Kosuke Tatsukawa Oct. 9, 2015, 12:35 a.m. UTC
async_pf_execute() seems to be missing a memory barrier which might
cause the waker to not notice the waiter and miss sending a wake_up as
in the following figure.

        async_pf_execute                    kvm_vcpu_block
------------------------------------------------------------------------
spin_lock(&vcpu->async_pf.lock);
if (waitqueue_active(&vcpu->wq))
/* The CPU might reorder the test for
   the waitqueue up here, before
   prior writes complete */
                                    prepare_to_wait(&vcpu->wq, &wait,
                                      TASK_INTERRUPTIBLE);
                                    /*if (kvm_vcpu_check_block(vcpu) < 0) */
                                     /*if (kvm_arch_vcpu_runnable(vcpu)) { */
                                      ...
                                      return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
                                        !vcpu->arch.apf.halted)
                                        || !list_empty_careful(&vcpu->async_pf.done)
                                     ...
                                     return 0;
list_add_tail(&apf->link,
  &vcpu->async_pf.done);
spin_unlock(&vcpu->async_pf.lock);
                                    waited = true;
                                    schedule();
------------------------------------------------------------------------

The attached patch adds the missing memory barrier.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c  (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
---
 virt/kvm/async_pf.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

Comments

Paolo Bonzini Oct. 9, 2015, 8:45 a.m. UTC | #1
On 09/10/2015 02:35, Kosuke Tatsukawa wrote:
>         async_pf_execute                    kvm_vcpu_block
> ------------------------------------------------------------------------
> spin_lock(&vcpu->async_pf.lock);
> if (waitqueue_active(&vcpu->wq))
> /* The CPU might reorder the test for
>    the waitqueue up here, before
>    prior writes complete */
>                                     prepare_to_wait(&vcpu->wq, &wait,
>                                       TASK_INTERRUPTIBLE);
>                                     /*if (kvm_vcpu_check_block(vcpu) < 0) */
>                                      /*if (kvm_arch_vcpu_runnable(vcpu)) { */
>                                       ...
>                                       return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
>                                         !vcpu->arch.apf.halted)
>                                         || !list_empty_careful(&vcpu->async_pf.done)
>                                      ...

The new memory barrier isn't "paired" with any other, and in
fact I think that the same issue exists on the other side: 
list_empty_careful(&vcpu->async_pf.done) may be reordered up,
before the prepare_to_wait:

spin_lock(&vcpu->async_pf.lock);
                                    (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
                                            !vcpu->arch.apf.halted)
                                            || !list_empty_careful(&vcpu->async_pf.done)
                                    ...
                                    prepare_to_wait(&vcpu->wq, &wait,
                                      TASK_INTERRUPTIBLE);
                                    /*if (kvm_vcpu_check_block(vcpu) < 0) */
                                     /*if (kvm_arch_vcpu_runnable(vcpu)) { */
                                      ...
                                     return 0;
list_add_tail(&apf->link,
  &vcpu->async_pf.done);
spin_unlock(&vcpu->async_pf.lock);
                                    waited = true;
                                    schedule();
if (waitqueue_active(&vcpu->wq))

So you need another smp_mb() after prepare_to_wait().  I'm not sure
if it's needed also for your original tty report, but I think it is
for https://lkml.org/lkml/2015/10/8/989 ("mei: fix waitqueue_active
without memory barrier in mei drivers").

I wonder if it makes sense to introduce smp_mb__before_spin_lock()
and smp_mb__after_spin_unlock().  On x86 the former could be a
simple compiler barrier, and on s390 both of them could.  But that
should be a separate patch.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 9, 2015, 8:50 a.m. UTC | #2
On Fri, Oct 09, 2015 at 10:45:32AM +0200, Paolo Bonzini wrote:
> So you need another smp_mb() after prepare_to_wait().  I'm not sure
> if it's needed also for your original tty report, but I think it is
> for https://lkml.org/lkml/2015/10/8/989 ("mei: fix waitqueue_active
> without memory barrier in mei drivers").
> 
> I wonder if it makes sense to introduce smp_mb__before_spin_lock()
> and smp_mb__after_spin_unlock().  On x86 the former could be a
> simple compiler barrier, and on s390 both of them could.  But that
> should be a separate patch.

Not having actually read or thought about the issue at hand, its
perfectly valid to pair an smp_mb() with either spin_lock() or
spin_unlock().

IOW. MB <-> {ACQUIRE, RELEASE} is a valid pairing.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kosuke Tatsukawa Oct. 9, 2015, 9:04 a.m. UTC | #3
Paolo Bonzini wrote:
> On 09/10/2015 02:35, Kosuke Tatsukawa wrote:
>>         async_pf_execute                    kvm_vcpu_block
>> ------------------------------------------------------------------------
>> spin_lock(&vcpu->async_pf.lock);
>> if (waitqueue_active(&vcpu->wq))
>> /* The CPU might reorder the test for
>>    the waitqueue up here, before
>>    prior writes complete */
>>                                     prepare_to_wait(&vcpu->wq, &wait,
>>                                       TASK_INTERRUPTIBLE);
>>                                     /*if (kvm_vcpu_check_block(vcpu) < 0) */
>>                                      /*if (kvm_arch_vcpu_runnable(vcpu)) { */
>>                                       ...
>>                                       return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
>>                                         !vcpu->arch.apf.halted)
>>                                         || !list_empty_careful(&vcpu->async_pf.done)
>>                                      ...
> 
> The new memory barrier isn't "paired" with any other, and in
> fact I think that the same issue exists on the other side: 
> list_empty_careful(&vcpu->async_pf.done) may be reordered up,
> before the prepare_to_wait:

smp_store_mb() called from set_current_state(), which is called from
prepare_to_wait() should prevent reordering such as below from
happening.  wait_event*() also calls set_current_state() inside.


> spin_lock(&vcpu->async_pf.lock);
>                                     (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
>                                             !vcpu->arch.apf.halted)
>                                             || !list_empty_careful(&vcpu->async_pf.done)
>                                     ...
>                                     prepare_to_wait(&vcpu->wq, &wait,
>                                       TASK_INTERRUPTIBLE);
>                                     /*if (kvm_vcpu_check_block(vcpu) < 0) */
>                                      /*if (kvm_arch_vcpu_runnable(vcpu)) { */
>                                       ...
>                                      return 0;
> list_add_tail(&apf->link,
>   &vcpu->async_pf.done);
> spin_unlock(&vcpu->async_pf.lock);
>                                     waited = true;
>                                     schedule();
> if (waitqueue_active(&vcpu->wq))
> 
> So you need another smp_mb() after prepare_to_wait().  I'm not sure
> if it's needed also for your original tty report, but I think it is
> for https://lkml.org/lkml/2015/10/8/989 ("mei: fix waitqueue_active
> without memory barrier in mei drivers").
> 
> I wonder if it makes sense to introduce smp_mb__before_spin_lock()
> and smp_mb__after_spin_unlock().  On x86 the former could be a
> simple compiler barrier, and on s390 both of them could.  But that
> should be a separate patch.

The problem on the waiter side occurs if add_wait_queue() or
__add_wait_queue() is directly called without memory barriers nor locks
protecting it.


> Thanks,
> 
> Paolo
---
Kosuke TATSUKAWA  | 3rd IT Platform Department
                  | IT Platform Division, NEC Corporation
                  | tatsu@ab.jp.nec.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Oct. 9, 2015, 10:24 a.m. UTC | #4
On 09/10/2015 10:50, Peter Zijlstra wrote:
> Not having actually read or thought about the issue at hand, its
> perfectly valid to pair an smp_mb() with either spin_lock() or
> spin_unlock().
> 
> IOW. MB <-> {ACQUIRE, RELEASE} is a valid pairing.

In this case it's an smp_mb() (store-load barrier) being paired with
another smp_mb(), so spin_unlock()'s release is not enough.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Oct. 9, 2015, 10:26 a.m. UTC | #5
On 09/10/2015 11:04, Kosuke Tatsukawa wrote:
> smp_store_mb() called from set_current_state(), which is called from
> prepare_to_wait() should prevent reordering such as below from
> happening.  wait_event*() also calls set_current_state() inside.

Ah, I missed that set_current_state has a memory barrier in it.  The
patch is okay, but please expand the comment to say that this memory
barrier pairs with prepare_to_wait's set_current_state().

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kosuke Tatsukawa Oct. 9, 2015, 12:21 p.m. UTC | #6
Paolo Bonzini wrote:
> On 09/10/2015 11:04, Kosuke Tatsukawa wrote:
>> smp_store_mb() called from set_current_state(), which is called from
>> prepare_to_wait() should prevent reordering such as below from
>> happening.  wait_event*() also calls set_current_state() inside.
>
> Ah, I missed that set_current_state has a memory barrier in it.  The
> patch is okay, but please expand the comment to say that this memory
> barrier pairs with prepare_to_wait's set_current_state().

thank you for the comment.  I'll send a new version of the patch with
the modification.
---
Kosuke TATSUKAWA  | 3rd IT Platform Department
                  | IT Platform Division, NEC Corporation
                  | tatsu@ab.jp.nec.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 44660ae..303fc7f 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -94,6 +94,11 @@  static void async_pf_execute(struct work_struct *work)
 
 	trace_kvm_async_pf_completed(addr, gva);
 
+	/*
+	 * Memory barrier is required here to make sure change to
+	 * vcpu->async_pf.done is visible from other CPUs
+	 */
+	smp_mb();
 	if (waitqueue_active(&vcpu->wq))
 		wake_up_interruptible(&vcpu->wq);