diff mbox series

mm/oom_kill: Ensure MMU notifier range_end() is paired with range_start()

Message ID 20210310213117.1444147-1-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series mm/oom_kill: Ensure MMU notifier range_end() is paired with range_start() | expand

Commit Message

Sean Christopherson March 10, 2021, 9:31 p.m. UTC
Invoke the MMU notifier's .invalidate_range_end() callbacks even if one
of the .invalidate_range_start() callbacks failed.  If there are multiple
notifiers, the notifier that did not fail may have performed actions in
its ...start() that it expects to unwind via ...end().  Per the
mmu_notifier_ops documentation, ...start() and ...end() must be paired.

The only in-kernel usage that is fatally broken is the SGI UV GRU driver,
which effectively blocks and sleeps fault handlers during ...start(), and
unblocks/wakes the handlers during ...end().  But, the only users that
can fail ...start() are the i915 and Nouveau drivers, which are unlikely
to collide with the SGI driver.

KVM is the only other user of ...end(), and while KVM also blocks fault
handlers in ...start(), the fault handlers do not sleep and originate in
killable ioctl() calls.  So while it's possible for the i915 and Nouveau
drivers to collide with KVM, the bug is benign for KVM since the process
is dying and KVM's guest is about to be terminated.

So, as of today, the bug is likely benign.  But, that may not always be
true, e.g. there is a potential use case for blocking memslot updates in
KVM while an invalidation is in-progress, and failure to unblock would
result in said updates being blocked indefinitely and hanging.

Found by inspection.  Verified by adding a second notifier in KVM that
periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
and observing that KVM exits with an elevated notifier count.

Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
Cc: stable@vger.kernel.org
Cc: David Rientjes <rientjes@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 mm/oom_kill.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comments

Ben Gardon March 10, 2021, 10:08 p.m. UTC | #1
On Wed, Mar 10, 2021 at 1:31 PM Sean Christopherson <seanjc@google.com>
wrote:

> Invoke the MMU notifier's .invalidate_range_end() callbacks even if one
> of the .invalidate_range_start() callbacks failed.  If there are multiple
> notifiers, the notifier that did not fail may have performed actions in
> its ...start() that it expects to unwind via ...end().  Per the
> mmu_notifier_ops documentation, ...start() and ...end() must be paired.
>
> The only in-kernel usage that is fatally broken is the SGI UV GRU driver,
> which effectively blocks and sleeps fault handlers during ...start(), and
> unblocks/wakes the handlers during ...end().  But, the only users that
> can fail ...start() are the i915 and Nouveau drivers, which are unlikely
> to collide with the SGI driver.
>
> KVM is the only other user of ...end(), and while KVM also blocks fault
> handlers in ...start(), the fault handlers do not sleep and originate in
> killable ioctl() calls.  So while it's possible for the i915 and Nouveau
> drivers to collide with KVM, the bug is benign for KVM since the process
> is dying and KVM's guest is about to be terminated.
>
> So, as of today, the bug is likely benign.  But, that may not always be
> true, e.g. there is a potential use case for blocking memslot updates in
> KVM while an invalidation is in-progress, and failure to unblock would
> result in said updates being blocked indefinitely and hanging.
>
> Found by inspection.  Verified by adding a second notifier in KVM that
> periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
> and observing that KVM exits with an elevated notifier count.
>
> Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
> notifiers")
> Cc: stable@vger.kernel.org
> Cc: David Rientjes <rientjes@google.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
>

Reviewed-by: Ben Gardon <bgardon@google.com>

Thanks for catching and fixing this.


> ---
>  mm/oom_kill.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index bc65ba4f5192..acc3ba8b2ed7 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -546,12 +546,10 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>                                                 vma, mm, vma->vm_start,
>                                                 vma->vm_end);
>                         tlb_gather_mmu(&tlb, mm);
> -                       if
> (mmu_notifier_invalidate_range_start_nonblock(&range)) {
> -                               tlb_finish_mmu(&tlb);
> +                       if
> (!mmu_notifier_invalidate_range_start_nonblock(&range))
> +                               unmap_page_range(&tlb, vma, range.start,
> range.end, NULL);
> +                       else
>                                 ret = false;
> -                               continue;
> -                       }
> -                       unmap_page_range(&tlb, vma, range.start,
> range.end, NULL);
>                         mmu_notifier_invalidate_range_end(&range);
>                         tlb_finish_mmu(&tlb);
>                 }
> --
> 2.30.1.766.gb4fecdf3b7-goog
>
>
Andrew Morton March 11, 2021, 12:06 a.m. UTC | #2
On Wed, 10 Mar 2021 13:31:17 -0800 Sean Christopherson <seanjc@google.com> wrote:

> Invoke the MMU notifier's .invalidate_range_end() callbacks even if one
> of the .invalidate_range_start() callbacks failed.  If there are multiple
> notifiers, the notifier that did not fail may have performed actions in
> its ...start() that it expects to unwind via ...end().  Per the
> mmu_notifier_ops documentation, ...start() and ...end() must be paired.
> 
> The only in-kernel usage that is fatally broken is the SGI UV GRU driver,
> which effectively blocks and sleeps fault handlers during ...start(), and
> unblocks/wakes the handlers during ...end().  But, the only users that
> can fail ...start() are the i915 and Nouveau drivers, which are unlikely
> to collide with the SGI driver.
> 
> KVM is the only other user of ...end(), and while KVM also blocks fault
> handlers in ...start(), the fault handlers do not sleep and originate in
> killable ioctl() calls.  So while it's possible for the i915 and Nouveau
> drivers to collide with KVM, the bug is benign for KVM since the process
> is dying and KVM's guest is about to be terminated.
> 
> So, as of today, the bug is likely benign.  But, that may not always be
> true, e.g. there is a potential use case for blocking memslot updates in
> KVM while an invalidation is in-progress, and failure to unblock would
> result in said updates being blocked indefinitely and hanging.
> 
> Found by inspection.  Verified by adding a second notifier in KVM that
> periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
> and observing that KVM exits with an elevated notifier count.

Given that there's no way known of triggering this in 5.11 or earlier,
I'd normally question the need to backport into -stable kernels.

But I guess that people might later develop modules which they want to
load into earlier kernels, in which case this might bite them.  So I
agree on the cc:stable thing.
Jason Gunthorpe March 11, 2021, 12:28 a.m. UTC | #3
On Wed, Mar 10, 2021 at 01:31:17PM -0800, Sean Christopherson wrote:
> Invoke the MMU notifier's .invalidate_range_end() callbacks even if one
> of the .invalidate_range_start() callbacks failed.  If there are multiple
> notifiers, the notifier that did not fail may have performed actions in
> its ...start() that it expects to unwind via ...end().  Per the
> mmu_notifier_ops documentation, ...start() and ...end() must be paired.

No this is not OK, if invalidate_start returns EBUSY invalidate_end
should *not* be called.

As you observed:
 
> The only in-kernel usage that is fatally broken is the SGI UV GRU driver,
> which effectively blocks and sleeps fault handlers during ...start(), and
> unblocks/wakes the handlers during ...end().  But, the only users that
> can fail ...start() are the i915 and Nouveau drivers, which are unlikely
> to collide with the SGI driver.

It used to be worse but I've since moved most of the other problematic
users to the itree notifier which doesn't have the problem.

> KVM is the only other user of ...end(), and while KVM also blocks fault
> handlers in ...start(), the fault handlers do not sleep and originate in

KVM will have its mmu_notifier_count become imbalanced:

static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
                                        const struct mmu_notifier_range *range)
{
        kvm->mmu_notifier_count++;

static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
                                        const struct mmu_notifier_range *range)
{
        kvm->mmu_notifier_count--;

Which I believe is fatal to kvm? These notifiers certainly do not only
happen at process exit.

So, both of the remaining _end users become corrupted with this patch!

I've tried to fix this before, the only thing that seems like it will
work is to sort the hlist and only call ends that have succeeded their
starts by comparing pointers with <.

This is because the hlist can have items removed concurrently under
SRCU so there is no easy way to compute the subset that succeeded in
calling start.

I had a prior effort to just ban more than 1 hlist notifier with end,
but it turns out kvm on ARM uses two all the time (IIRC)

> Found by inspection.  Verified by adding a second notifier in KVM
> that

AFAIK it is a non-problem in real life because kvm is not mixed with
notifier_start's that fail (and GRU is dead?). Everything else was
fixed by moving to itree.

Jason
Sean Christopherson March 11, 2021, 1:20 a.m. UTC | #4
On Wed, Mar 10, 2021, Jason Gunthorpe wrote:
> On Wed, Mar 10, 2021 at 01:31:17PM -0800, Sean Christopherson wrote:
> > Invoke the MMU notifier's .invalidate_range_end() callbacks even if one
> > of the .invalidate_range_start() callbacks failed.  If there are multiple
> > notifiers, the notifier that did not fail may have performed actions in
> > its ...start() that it expects to unwind via ...end().  Per the
> > mmu_notifier_ops documentation, ...start() and ...end() must be paired.
> 
> No this is not OK, if invalidate_start returns EBUSY invalidate_end
> should *not* be called.
> 
> As you observed:
>  
> > The only in-kernel usage that is fatally broken is the SGI UV GRU driver,
> > which effectively blocks and sleeps fault handlers during ...start(), and
> > unblocks/wakes the handlers during ...end().  But, the only users that
> > can fail ...start() are the i915 and Nouveau drivers, which are unlikely
> > to collide with the SGI driver.
> 
> It used to be worse but I've since moved most of the other problematic
> users to the itree notifier which doesn't have the problem.
> 
> > KVM is the only other user of ...end(), and while KVM also blocks fault
> > handlers in ...start(), the fault handlers do not sleep and originate in
> 
> KVM will have its mmu_notifier_count become imbalanced:
> 
> static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
> {
>         kvm->mmu_notifier_count++;
> 
> static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
>                                         const struct mmu_notifier_range *range)
> {
>         kvm->mmu_notifier_count--;
> 
> Which I believe is fatal to kvm? These notifiers certainly do not only
> happen at process exit.

My point about the process dying is that the existing bug that causes
mmu_notifier_count to become imbalanced is benign only because the process is
being killed, and thus KVM will stop running its vCPUs.

> So, both of the remaining _end users become corrupted with this patch!

I don't follow.  mn_hlist_invalidate_range_start() iterates over all notifiers,
even if a notifier earlier in the chain failed.  How will KVM become imbalanced?

The existing _end users never fail their _start.  If KVM started failing its
start, then yes, it could get corrupted.  But my assumption/expection is that,
if KVM were to ever reject _start, it would be responsible for knowing that it
must also skip _end.  I'm happy to kick that one down the road though, as I
can't think of a scenario where KVM would _need_ to sleep.

> I've tried to fix this before, the only thing that seems like it will
> work is to sort the hlist and only call ends that have succeeded their
> starts by comparing pointers with <.
> 
> This is because the hlist can have items removed concurrently under
> SRCU so there is no easy way to compute the subset that succeeded in
> calling start.
> 
> I had a prior effort to just ban more than 1 hlist notifier with end,
> but it turns out kvm on ARM uses two all the time (IIRC)
> 
> > Found by inspection.  Verified by adding a second notifier in KVM
> > that
> 
> AFAIK it is a non-problem in real life because kvm is not mixed with
> notifier_start's that fail (and GRU is dead?). Everything else was
> fixed by moving to itree.
> 
> Jason
Jason Gunthorpe March 11, 2021, 1:50 a.m. UTC | #5
On Wed, Mar 10, 2021 at 05:20:01PM -0800, Sean Christopherson wrote:

> > Which I believe is fatal to kvm? These notifiers certainly do not only
> > happen at process exit.
> 
> My point about the process dying is that the existing bug that causes
> mmu_notifier_count to become imbalanced is benign only because the process is
> being killed, and thus KVM will stop running its vCPUs.

Are you saying we only call non-blocking invalidate during a process
exit event?? 
 
> > So, both of the remaining _end users become corrupted with this patch!
> 
> I don't follow.  mn_hlist_invalidate_range_start() iterates over all
> notifiers, even if a notifier earlier in the chain failed.  How will
> KVM become imbalanced?

Er, ok, that got left in a weird way. There is another "bug" where end
is not supposed to be called if the start failed.

> The existing _end users never fail their _start.  If KVM started failing its
> start, then yes, it could get corrupted.  

Well, maybe that is the way out of this now. If we don't permit a
start to fail if there is an end then we have no problem to unwind it
as we can continue to call everything. This can't be backported too
far though, the itree notifier conversions are what made the WARN_ON
safe today.

Something very approximately like this is closer to my preference:

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 61ee40ed804ee5..6d5cd20f81dadc 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -501,10 +501,25 @@ static int mn_hlist_invalidate_range_start(
 						"");
 				WARN_ON(mmu_notifier_range_blockable(range) ||
 					_ret != -EAGAIN);
+				/*
+				 * We call all the notifiers on any EAGAIN,
+				 * there is no way for a notifier to know if
+				 * its start method failed, thus a start that
+				 * does EAGAIN can't also do end.
+				 */
+				WARN_ON(ops->invalidate_range_end);
 				ret = _ret;
 			}
 		}
 	}
+
+	if (ret) {
+		/* Must be non-blocking to get here*/
+		hlist_for_each_entry_rcu (subscription, &subscriptions->list,
+					  hlist, srcu_read_lock_held(&srcu))
+			subscription->ops->invalidate_range_end(subscription,
+								range);
+	}
 	srcu_read_unlock(&srcu, id);
 
 	return ret;
Sean Christopherson March 11, 2021, 7:22 a.m. UTC | #6
On Wed, Mar 10, 2021, Jason Gunthorpe wrote:
> On Wed, Mar 10, 2021 at 05:20:01PM -0800, Sean Christopherson wrote:
> 
> > > Which I believe is fatal to kvm? These notifiers certainly do not only
> > > happen at process exit.
> > 
> > My point about the process dying is that the existing bug that causes
> > mmu_notifier_count to become imbalanced is benign only because the process is
> > being killed, and thus KVM will stop running its vCPUs.
> 
> Are you saying we only call non-blocking invalidate during a process
> exit event?? 

Yes?  __oom_reap_task_mm() is the only user of _nonblock(), if that's what you're
asking.

> > > So, both of the remaining _end users become corrupted with this patch!
> > 
> > I don't follow.  mn_hlist_invalidate_range_start() iterates over all
> > notifiers, even if a notifier earlier in the chain failed.  How will
> > KVM become imbalanced?
> 
> Er, ok, that got left in a weird way. There is another "bug" where end
> is not supposed to be called if the start failed.

Ha, the best kind of feature.  :-)

> > The existing _end users never fail their _start.  If KVM started failing its
> > start, then yes, it could get corrupted.  
> 
> Well, maybe that is the way out of this now. If we don't permit a
> start to fail if there is an end then we have no problem to unwind it
> as we can continue to call everything. This can't be backported too
> far though, the itree notifier conversions are what made the WARN_ON
> safe today.
> 
> Something very approximately like this is closer to my preference:

Makes sense.  I don't have a strong preference, I'll give this a spin tomorrow.

> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 61ee40ed804ee5..6d5cd20f81dadc 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -501,10 +501,25 @@ static int mn_hlist_invalidate_range_start(
>  						"");
>  				WARN_ON(mmu_notifier_range_blockable(range) ||
>  					_ret != -EAGAIN);
> +				/*
> +				 * We call all the notifiers on any EAGAIN,
> +				 * there is no way for a notifier to know if
> +				 * its start method failed, thus a start that
> +				 * does EAGAIN can't also do end.
> +				 */
> +				WARN_ON(ops->invalidate_range_end);
>  				ret = _ret;
>  			}
>  		}
>  	}
> +
> +	if (ret) {
> +		/* Must be non-blocking to get here*/
> +		hlist_for_each_entry_rcu (subscription, &subscriptions->list,
> +					  hlist, srcu_read_lock_held(&srcu))
> +			subscription->ops->invalidate_range_end(subscription,
> +								range);
> +	}
>  	srcu_read_unlock(&srcu, id);
>  
>  	return ret;
Michal Hocko March 11, 2021, 4:20 p.m. UTC | #7
On Wed 10-03-21 20:28:07, Jason Gunthorpe wrote:
> On Wed, Mar 10, 2021 at 01:31:17PM -0800, Sean Christopherson wrote:
> > Invoke the MMU notifier's .invalidate_range_end() callbacks even if one
> > of the .invalidate_range_start() callbacks failed.  If there are multiple
> > notifiers, the notifier that did not fail may have performed actions in
> > its ...start() that it expects to unwind via ...end().  Per the
> > mmu_notifier_ops documentation, ...start() and ...end() must be paired.
> 
> No this is not OK, if invalidate_start returns EBUSY invalidate_end
> should *not* be called.

Yes, this is what I remember when introducing nonblock interface. So I
agree with Jason this patch is not correct. The interface is subtle but
I remember we couldn't come up with something more robust and still
memory with notifiers to be reapable.
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bc65ba4f5192..acc3ba8b2ed7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -546,12 +546,10 @@  bool __oom_reap_task_mm(struct mm_struct *mm)
 						vma, mm, vma->vm_start,
 						vma->vm_end);
 			tlb_gather_mmu(&tlb, mm);
-			if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
-				tlb_finish_mmu(&tlb);
+			if (!mmu_notifier_invalidate_range_start_nonblock(&range))
+				unmap_page_range(&tlb, vma, range.start, range.end, NULL);
+			else
 				ret = false;
-				continue;
-			}
-			unmap_page_range(&tlb, vma, range.start, range.end, NULL);
 			mmu_notifier_invalidate_range_end(&range);
 			tlb_finish_mmu(&tlb);
 		}