diff mbox series

[6.15] mm/vma: add give_up_on_oom option on modify/merge, use in uffd release

Message ID 20250321100937.46634-1-lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series [6.15] mm/vma: add give_up_on_oom option on modify/merge, use in uffd release | expand

Commit Message

Lorenzo Stoakes March 21, 2025, 10:09 a.m. UTC
Currently, if a VMA merge fails due to an OOM condition arising on commit
merge or a failure to duplicate anon_vma's, we report this so the caller
can handle it.

However there are cases where the caller is only ostensibly trying a
merge, and doesn't mind if it fails due to this condition.

Since we do not want to introduce an implicit assumption that we only
actually modify VMAs after OOM conditions might arise, add a 'give up on
oom' option and make an explicit contract that, should this flag be set, we
absolutely will not modify any VMAs should OOM arise and just bail out.

Since it'd be very unusual for a user to try to vma_modify() with this flag
set but be specifying a range within a VMA which ends up being split (which
can fail due to rlimit issues, not only OOM), we add a debug warning for
this condition.

The motivating reason for this is uffd release - syzkaller (and Pedro
Falcato's VERY astute analysis) found a way in which an injected fault on
allocation, triggering an OOM condition on commit merge, would result in
uffd code becoming confused and treating an error value as if it were a VMA
pointer.

To avoid this, we make use of this new VMG flag to ensure that this never
occurs, utilising the fact that, should we be clearing entire VMAs, we do
not wish an OOM event to be reported to us.

Many thanks to Pedro Falcato for his excellent analysis and Jann Horn for
his insightful and intelligent analysis of the situation, both of whom were
instrumental in this fix.

Reported-by: syzbot+20ed41006cf9d842c2b5@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001e.GAE@google.com/
Fixes: 47b16d0462a4 ("mm: abort vma_modify() on merge out of memory failure")
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/userfaultfd.c | 13 ++++++++++--
 mm/vma.c         | 51 ++++++++++++++++++++++++++++++++++++++++++++----
 mm/vma.h         |  9 ++++++++-
 3 files changed, 66 insertions(+), 7 deletions(-)

--
2.48.1

Comments

Pedro Falcato March 21, 2025, 11:26 a.m. UTC | #1
On Fri, Mar 21, 2025 at 10:09:37AM +0000, Lorenzo Stoakes wrote:
> Currently, if a VMA merge fails due to an OOM condition arising on commit
> merge or a failure to duplicate anon_vma's, we report this so the caller
> can handle it.
> 
> However there are cases where the caller is only ostensibly trying a
> merge, and doesn't mind if it fails due to this condition.
>

Ok, so here's my problem with your idea: I don't think merge should be exposed
to vma_modify() callers. Right now (at least AIUI), you want to modify a given
VMA, you call vma_modify(), and it gives you a vma you can straight up modify
without any problems. Essentially breaks down any VMAs necessary. This feels
contractually simple and easy to use, and I don't think leaking details about
merging is the correct approach here.

> Since we do not want to introduce an implicit assumption that we only
> actually modify VMAs after OOM conditions might arise, add a 'give up on
> oom' option and make an explicit contract that, should this flag be set, we
> absolutely will not modify any VMAs should OOM arise and just bail out.
>

Thus, to me the most natural solution is still mine. Do you think it places too
many constraints on vma_modify()? vma_modify() on a single VMA, without
splitting, Just Working(tm) is a sensible expectation (and vma_merge being fully
best-effort). Things like mprotect() failing due to OOM are also pretty disastrous,
so if we could limit that it'd be great.

In any case, your solution looks palatable to me, but I want to make
sure we're not making this excessively complicated.
Liam R. Howlett March 21, 2025, 3:27 p.m. UTC | #2
+cc Peter due to uffd interests

* Pedro Falcato <pfalcato@suse.de> [250321 07:26]:
> On Fri, Mar 21, 2025 at 10:09:37AM +0000, Lorenzo Stoakes wrote:
> > Currently, if a VMA merge fails due to an OOM condition arising on commit
> > merge or a failure to duplicate anon_vma's, we report this so the caller
> > can handle it.
> > 
> > However there are cases where the caller is only ostensibly trying a
> > merge, and doesn't mind if it fails due to this condition.
> >
> 
> Ok, so here's my problem with your idea: I don't think merge should be exposed
> to vma_modify() callers. Right now (at least AIUI), you want to modify a given
> VMA, you call vma_modify(), and it gives you a vma you can straight up modify
> without any problems. Essentially breaks down any VMAs necessary. This feels
> contractually simple and easy to use, and I don't think leaking details about
> merging is the correct approach here.
> 
> > Since we do not want to introduce an implicit assumption that we only
> > actually modify VMAs after OOM conditions might arise, add a 'give up on
> > oom' option and make an explicit contract that, should this flag be set, we
> > absolutely will not modify any VMAs should OOM arise and just bail out.
> >
> 
> Thus, to me the most natural solution is still mine. Do you think it places too
> many constraints on vma_modify()? vma_modify() on a single VMA, without
> splitting, Just Working(tm) is a sensible expectation (and vma_merge being fully
> best-effort). Things like mprotect() failing due to OOM are also pretty disastrous,
> so if we could limit that it'd be great.
> 
> In any case, your solution looks palatable to me, but I want to make
> sure we're not making this excessively complicated.

Either solution is fine with me, but...

I hate both of them, and I (mostly) blame uffd.  Some blame is on syzbot
for exposing me to this unreachable failure. ;-)

I think Lorenzo's patch points out that we should have a better way to
deal with a return and a vma pointer and we basically have that a lot of
other places with the structures that we thread through to deal with
mostly unreachable errors that syzbot injects.  I dislike flags passed
in to treat things differently.  We are adding an 8th argument to the
function (of type boolean) to work around this..

I think Pedro's patch is working around the same issue of return value
vs return pointers.  Other places we pass in &prev and just do what we
need to in the caller and just check the return value - which I also
hate, especially when you look at the assembly generated to deal with a
"non-problem" problem.

I have no better solution and need to spend more time looking into it,
but the number of times we have syzbot failing a random allocation and
finding issues.. well, it's basically a class of bugs we handle very
poorly throughout our stack.

Realistically, I *hope* that Lorenzo's additional argument will make
code authors think twice about the failure scenario when calling the
function.

Pedro's code fixes this caller, but leaves the possibility of someone
missing this again in the future, I think?  This could be made more
obvious with some clever naming of functions, perhaps try_<something>?

We are essentially avoiding the compiler from catching the error for us
by returning that ERR_PTR(), which (keeping with the theme of my email)
I hate.  It's fine for little functions but we've made a mess of it too
often.

Reality will probably not align with the realistic view and people will
just copy/paste from wherever they saw it called... so we should think
twice about the failure scenarios on code review and I think a flag
(or a function name change?) might make this more obvious.

Thanks,
Liam
Lorenzo Stoakes March 21, 2025, 5:04 p.m. UTC | #3
On Fri, Mar 21, 2025 at 11:26:18AM +0000, Pedro Falcato wrote:
> On Fri, Mar 21, 2025 at 10:09:37AM +0000, Lorenzo Stoakes wrote:
> > Currently, if a VMA merge fails due to an OOM condition arising on commit
> > merge or a failure to duplicate anon_vma's, we report this so the caller
> > can handle it.
> >
> > However there are cases where the caller is only ostensibly trying a
> > merge, and doesn't mind if it fails due to this condition.
> >
>
> Ok, so here's my problem with your idea: I don't think merge should be exposed
> to vma_modify() callers. Right now (at least AIUI), you want to modify a given
> VMA, you call vma_modify(), and it gives you a vma you can straight up modify
> without any problems. Essentially breaks down any VMAs necessary. This feels
> contractually simple and easy to use, and I don't think leaking details about
> merging is the correct approach here.

The vmg passed is already 'exposing' merge and has flags you can change. There's
no contract that vma_modify() abstracts this, you're saying you want to modify,
maybe merge if we can.

Uffd is actually calling it in a purely special case - one where you would never
split.

I mean an alternative is to have something not-vma_modify() do it, but then we
end up with code duplication, which is why I made the unregister + clear paths
do the same thing here.

>
> > Since we do not want to introduce an implicit assumption that we only
> > actually modify VMAs after OOM conditions might arise, add a 'give up on
> > oom' option and make an explicit contract that, should this flag be set, we
> > absolutely will not modify any VMAs should OOM arise and just bail out.
> >
>
> Thus, to me the most natural solution is still mine. Do you think it places too
> many constraints on vma_modify()? vma_modify() on a single VMA, without
> splitting, Just Working(tm) is a sensible expectation (and vma_merge being fully
> best-effort). Things like mprotect() failing due to OOM are also pretty disastrous,
> so if we could limit that it'd be great.

I disagree, again for the same reason as stated before, you are making an
implicit assumption that an OOM error means the VMA is not deleted. This only
happens to be true _now_.

Having this implicit assumption there, which later might change, is _precisely_
the kind of thing that led us to this issue in the first place.

So that's just not workable.

A version of your thing that would work is where vma_modify() itself sets the
flag so we -establish this contract-.

But I don't want to infect all other callers who don't have uffd's problem with
this.

Also, again, this is a -uffd- problem. Uffd is calling a function that can
return an error, assuming it must not return an error, and breaking if it does.

We have to on some level have uffd say 'actually we rely on this'.

So, ugly as it is, I feel my approach is the best for now.

But to be revisited!

>
> In any case, your solution looks palatable to me, but I want to make
> sure we're not making this excessively complicated.

Thanks!

I don't think this is any more complicated than it needs to be, as Liam alludes
to in his reply.

But rethinking this whole thing on a deeper level is on my agenda now.

Error paths are an ugly pain anywwhere :(

>
> --
> Pedro
Lorenzo Stoakes March 21, 2025, 5:16 p.m. UTC | #4
On Fri, Mar 21, 2025 at 11:27:34AM -0400, Liam R. Howlett wrote:
> +cc Peter due to uffd interests

Gentle nudge for Peter to make himself uffd maintainer :) I am not a fan of
this 'happen to know person X often touches Y' stuff, this is what
MAINTAINERS is for. If you're not there, good chance I won't cc you...

I also strongly feel we need somebody to take overall responsibility for
uffd at this point.

Pints will be bought for this person in Montreal ;)

>
> * Pedro Falcato <pfalcato@suse.de> [250321 07:26]:
> > On Fri, Mar 21, 2025 at 10:09:37AM +0000, Lorenzo Stoakes wrote:
> > > Currently, if a VMA merge fails due to an OOM condition arising on commit
> > > merge or a failure to duplicate anon_vma's, we report this so the caller
> > > can handle it.
> > >
> > > However there are cases where the caller is only ostensibly trying a
> > > merge, and doesn't mind if it fails due to this condition.
> > >
> >
> > Ok, so here's my problem with your idea: I don't think merge should be exposed
> > to vma_modify() callers. Right now (at least AIUI), you want to modify a given
> > VMA, you call vma_modify(), and it gives you a vma you can straight up modify
> > without any problems. Essentially breaks down any VMAs necessary. This feels
> > contractually simple and easy to use, and I don't think leaking details about
> > merging is the correct approach here.
> >
> > > Since we do not want to introduce an implicit assumption that we only
> > > actually modify VMAs after OOM conditions might arise, add a 'give up on
> > > oom' option and make an explicit contract that, should this flag be set, we
> > > absolutely will not modify any VMAs should OOM arise and just bail out.
> > >
> >
> > Thus, to me the most natural solution is still mine. Do you think it places too
> > many constraints on vma_modify()? vma_modify() on a single VMA, without
> > splitting, Just Working(tm) is a sensible expectation (and vma_merge being fully
> > best-effort). Things like mprotect() failing due to OOM are also pretty disastrous,
> > so if we could limit that it'd be great.
> >
> > In any case, your solution looks palatable to me, but I want to make
> > sure we're not making this excessively complicated.
>
> Either solution is fine with me, but...
>

Thanks.

> I hate both of them, and I (mostly) blame uffd.  Some blame is on syzbot
> for exposing me to this unreachable failure. ;-)

So do I.

But as I said to Pedro, we -must- express this contract that we won't
remove VMAs before returning, otherwise later somebody might change logic
which changes this, not realise they just broke these stupid horrible edge
cases (that 99.9999-recurring-9% will never happen in reality) and that's
just not something I want to happen.

>
> I think Lorenzo's patch points out that we should have a better way to
> deal with a return and a vma pointer and we basically have that a lot of
> other places with the structures that we thread through to deal with
> mostly unreachable errors that syzbot injects.  I dislike flags passed
> in to treat things differently.  We are adding an 8th argument to the
> function (of type boolean) to work around this..

The top of this make sa very good point, but in reference to params, flags,
etc... well I have to say something :)

Yeah but it's a convenience wrapper function, literally build explicitly
for uffd. I think it's not quite as terrible a situation as you think.

And it seems to me having a flag/mode to explicitly request something is
fine? Was it more palatable as a bit field prior to Vlastimil's request to
move to bit fields?... Because you must object to a lot of the kernel if
you dislike that kind of thing :)

A lot of the point of having threaded state is that we can modify things
for specific cases.

>
> I think Pedro's patch is working around the same issue of return value
> vs return pointers.  Other places we pass in &prev and just do what we
> need to in the caller and just check the return value - which I also
> hate, especially when you look at the assembly generated to deal with a
> "non-problem" problem.

Let me group a reply to all of these as you're making a very good point
that I think has a deeper solution...

>
> I have no better solution and need to spend more time looking into it,
> but the number of times we have syzbot failing a random allocation and
> finding issues.. well, it's basically a class of bugs we handle very
> poorly throughout our stack.
>
> Realistically, I *hope* that Lorenzo's additional argument will make
> code authors think twice about the failure scenario when calling the
> function.
>
> Pedro's code fixes this caller, but leaves the possibility of someone
> missing this again in the future, I think?  This could be made more
> obvious with some clever naming of functions, perhaps try_<something>?

Renaming like that will not affect changes to the core merge/functions
called by merge.

We need that explicit contract. So...

>
> We are essentially avoiding the compiler from catching the error for us
> by returning that ERR_PTR(), which (keeping with the theme of my email)
> I hate.  It's fine for little functions but we've made a mess of it too
> often.
>
> Reality will probably not align with the realistic view and people will
> just copy/paste from wherever they saw it called... so we should think
> twice about the failure scenarios on code review and I think a flag
> (or a function name change?) might make this more obvious.

OK so what I think we have have is a break in abstraction, where we are
trying to do an 'iteration through VMAs where it's convenient to keep track
of prev' and then people duplicating the code, making subtly false
assumptions (yes pointer being returned with the obnoxious ERR_PTR() stuff
possible and -god knows what happens to the state if not present-) and
etc. etc.

Don't we just need a new kind of vma iterator that handles the prev bits
for us that can just do away with this crap?

How we get it to interact with everything else is left as a fun one for
thinking about in hot baths, aeroplanes or at trampoline parks (it'll be me
who ends up thinking about his won't it?)

Anyway it's all horrid, but I still think (in, of course, an entirely
unbiased fashion) my solution is the way forward, but I agree with and
Pedro that it's vom-inducing yes.

>
> Thanks,
> Liam
Liam R. Howlett March 21, 2025, 6:11 p.m. UTC | #5
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250321 13:16]:
> On Fri, Mar 21, 2025 at 11:27:34AM -0400, Liam R. Howlett wrote:
> > +cc Peter due to uffd interests
> 
> Gentle nudge for Peter to make himself uffd maintainer :) I am not a fan of
> this 'happen to know person X often touches Y' stuff, this is what
> MAINTAINERS is for. If you're not there, good chance I won't cc you...
> 
> I also strongly feel we need somebody to take overall responsibility for
> uffd at this point.

Yes, uffd isn't well represented today and Peter seems to be doing the
work of R:, if not M:.

...

> 
> >
> > We are essentially avoiding the compiler from catching the error for us
> > by returning that ERR_PTR(), which (keeping with the theme of my email)
> > I hate.  It's fine for little functions but we've made a mess of it too
> > often.
> >
> > Reality will probably not align with the realistic view and people will
> > just copy/paste from wherever they saw it called... so we should think
> > twice about the failure scenarios on code review and I think a flag
> > (or a function name change?) might make this more obvious.
> 
> OK so what I think we have have is a break in abstraction, where we are
> trying to do an 'iteration through VMAs where it's convenient to keep track
> of prev' and then people duplicating the code, making subtly false
> assumptions (yes pointer being returned with the obnoxious ERR_PTR() stuff
> possible and -god knows what happens to the state if not present-) and
> etc. etc.
> 
> Don't we just need a new kind of vma iterator that handles the prev bits
> for us that can just do away with this crap?

I've been thinking about the iterator and the prev/next stuff for a
while.

I am not entirely sure on pulling it into the iterator.  My hesitation
is that a lot of the time we don't really care about prev, except
merging.  Merging only matters if the vma is touching the start of the
vma being modified, and if that's the case then we are very likely to be
in the same maple tree node and the previous slot.  This should be in
the cpu cache, almost always.

So I'm wondering if we want to have an iterator do some fancy "this is
prev" or just ask "what's the previous slot?" - aka mas_peek_prev() or
something (that doesn't exist today).

We also have the users of contiguous iterations which wants to fail if
there is a gap anywhere before the end, and detect that error after the
iterator too.. ie, did we reach the end (or is the end a gap?), or did
we find an intermediate gap?

So there are a few common scenarios, and maybe we are getting to the
point of having a clear view of specific users for each that would
result in less bugs with common patterns?

The patters are also not entirely clear as the regular number of vmas
iterated - many are written this way to work on many vmas, but in
practice there is only one vma checked the majority of the time.  So we
may be over-complicating things by keeping prev around and up to date in
the first place.  Perhaps clever iterator coding could avoid this as
well.

Thanks,
Liam
Lorenzo Stoakes March 21, 2025, 9:24 p.m. UTC | #6
On Fri, Mar 21, 2025 at 02:11:06PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250321 13:16]:
> > On Fri, Mar 21, 2025 at 11:27:34AM -0400, Liam R. Howlett wrote:
> > > +cc Peter due to uffd interests
> >
> > Gentle nudge for Peter to make himself uffd maintainer :) I am not a fan of
> > this 'happen to know person X often touches Y' stuff, this is what
> > MAINTAINERS is for. If you're not there, good chance I won't cc you...
> >
> > I also strongly feel we need somebody to take overall responsibility for
> > uffd at this point.
>
> Yes, uffd isn't well represented today and Peter seems to be doing the
> work of R:, if not M:.
>
> ...
>
> >
> > >
> > > We are essentially avoiding the compiler from catching the error for us
> > > by returning that ERR_PTR(), which (keeping with the theme of my email)
> > > I hate.  It's fine for little functions but we've made a mess of it too
> > > often.
> > >
> > > Reality will probably not align with the realistic view and people will
> > > just copy/paste from wherever they saw it called... so we should think
> > > twice about the failure scenarios on code review and I think a flag
> > > (or a function name change?) might make this more obvious.
> >
> > OK so what I think we have have is a break in abstraction, where we are
> > trying to do an 'iteration through VMAs where it's convenient to keep track
> > of prev' and then people duplicating the code, making subtly false
> > assumptions (yes pointer being returned with the obnoxious ERR_PTR() stuff
> > possible and -god knows what happens to the state if not present-) and
> > etc. etc.
> >
> > Don't we just need a new kind of vma iterator that handles the prev bits
> > for us that can just do away with this crap?
>
> I've been thinking about the iterator and the prev/next stuff for a
> while.
>
> I am not entirely sure on pulling it into the iterator.  My hesitation
> is that a lot of the time we don't really care about prev, except
> merging.  Merging only matters if the vma is touching the start of the
> vma being modified, and if that's the case then we are very likely to be
> in the same maple tree node and the previous slot.  This should be in
> the cpu cache, almost always.
>
> So I'm wondering if we want to have an iterator do some fancy "this is
> prev" or just ask "what's the previous slot?" - aka mas_peek_prev() or
> something (that doesn't exist today).
>
> We also have the users of contiguous iterations which wants to fail if
> there is a gap anywhere before the end, and detect that error after the
> iterator too.. ie, did we reach the end (or is the end a gap?), or did
> we find an intermediate gap?
>
> So there are a few common scenarios, and maybe we are getting to the
> point of having a clear view of specific users for each that would
> result in less bugs with common patterns?
>
> The patters are also not entirely clear as the regular number of vmas
> iterated - many are written this way to work on many vmas, but in
> practice there is only one vma checked the majority of the time.  So we
> may be over-complicating things by keeping prev around and up to date in
> the first place.  Perhaps clever iterator coding could avoid this as
> well.

I will look at actually analysing this stuff.

I absolutely HATE having vmg->prev, next etc.

Anyway tomorrow (Heathrow cooperating) we'll be co-located so maybe a light
night beer and discussion tomorrow night eh? ;)

>
> Thanks,
> Liam
Peter Xu March 22, 2025, 12:30 a.m. UTC | #7
On Fri, Mar 21, 2025 at 05:16:44PM +0000, Lorenzo Stoakes wrote:
> On Fri, Mar 21, 2025 at 11:27:34AM -0400, Liam R. Howlett wrote:
> > +cc Peter due to uffd interests
> 
> Gentle nudge for Peter to make himself uffd maintainer :) I am not a fan of
> this 'happen to know person X often touches Y' stuff, this is what
> MAINTAINERS is for. If you're not there, good chance I won't cc you...
> 
> I also strongly feel we need somebody to take overall responsibility for
> uffd at this point.
> 
> Pints will be bought for this person in Montreal ;)

Thanks for the offer, though I will be absent from lsfmm this year.  I sent
a maintainer file change here, though:

https://lore.kernel.org/linux-mm/20250322002124.131736-1-peterx@redhat.com/

Maybe someone would like to be the 2nd reviewer, and if he/she would be at
the conference maybe the pints will still count? :)

[...]

> > I hate both of them, and I (mostly) blame uffd.  Some blame is on syzbot
> > for exposing me to this unreachable failure. ;-)
> 
> So do I.

I agree with Jann; AFAIU, userfaultfd is innocent, if it used to work.

I don't think I have followed much on the recent vma mgmt changes.  So I am
almost of no use here.. I'll leave that to you experts.  Beers indeed might
help.

Said that, it'll definitely be very appreciated if the old behavior can be
fully recovered, so release() will never fail at such -ENOMEM.

Thanks,
diff mbox series

Patch

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index fbf2cf62ab9f..7d5d709cc838 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1902,6 +1902,14 @@  struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi,
 					     unsigned long end)
 {
 	struct vm_area_struct *ret;
+	bool give_up_on_oom = false;
+
+	/*
+	 * If we are modifying only and not splitting, just give up on the merge
+	 * if OOM prevents us from merging successfully.
+	 */
+	if (start == vma->vm_start && end == vma->vm_end)
+		give_up_on_oom = true;

 	/* Reset ptes for the whole vma range if wr-protected */
 	if (userfaultfd_wp(vma))
@@ -1909,7 +1917,7 @@  struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi,

 	ret = vma_modify_flags_uffd(vmi, prev, vma, start, end,
 				    vma->vm_flags & ~__VM_UFFD_FLAGS,
-				    NULL_VM_UFFD_CTX);
+				    NULL_VM_UFFD_CTX, give_up_on_oom);

 	/*
 	 * In the vma_merge() successful mprotect-like case 8:
@@ -1960,7 +1968,8 @@  int userfaultfd_register_range(struct userfaultfd_ctx *ctx,
 		new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags;
 		vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end,
 					    new_flags,
-					    (struct vm_userfaultfd_ctx){ctx});
+					    (struct vm_userfaultfd_ctx){ctx},
+					    /* give_up_on_oom = */false);
 		if (IS_ERR(vma))
 			return PTR_ERR(vma);

diff --git a/mm/vma.c b/mm/vma.c
index 5cdc5612bfc1..839d12f02c88 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -666,6 +666,9 @@  static void vmg_adjust_set_range(struct vma_merge_struct *vmg)
 /*
  * Actually perform the VMA merge operation.
  *
+ * IMPORTANT: We guarantee that, should vmg->give_up_on_oom is set, to not
+ * modify any VMAs or cause inconsistent state should an OOM condition arise.
+ *
  * Returns 0 on success, or an error value on failure.
  */
 static int commit_merge(struct vma_merge_struct *vmg)
@@ -685,6 +688,12 @@  static int commit_merge(struct vma_merge_struct *vmg)

 	init_multi_vma_prep(&vp, vma, vmg);

+	/*
+	 * If vmg->give_up_on_oom is set, we're safe, because we don't actually
+	 * manipulate any VMAs until we succeed at preallocation.
+	 *
+	 * Past this point, we will not return an error.
+	 */
 	if (vma_iter_prealloc(vmg->vmi, vma))
 		return -ENOMEM;

@@ -915,7 +924,13 @@  static __must_check struct vm_area_struct *vma_merge_existing_range(
 		if (anon_dup)
 			unlink_anon_vmas(anon_dup);

-		vmg->state = VMA_MERGE_ERROR_NOMEM;
+		/*
+		 * We've cleaned up any cloned anon_vma's, no VMAs have been
+		 * modified, no harm no foul if the user requests that we not
+		 * report this and just give up, leaving the VMAs unmerged.
+		 */
+		if (!vmg->give_up_on_oom)
+			vmg->state = VMA_MERGE_ERROR_NOMEM;
 		return NULL;
 	}

@@ -926,7 +941,15 @@  static __must_check struct vm_area_struct *vma_merge_existing_range(
 abort:
 	vma_iter_set(vmg->vmi, start);
 	vma_iter_load(vmg->vmi);
-	vmg->state = VMA_MERGE_ERROR_NOMEM;
+
+	/*
+	 * This means we have failed to clone anon_vma's correctly, but no
+	 * actual changes to VMAs have occurred, so no harm no foul - if the
+	 * user doesn't want this reported and instead just wants to give up on
+	 * the merge, allow it.
+	 */
+	if (!vmg->give_up_on_oom)
+		vmg->state = VMA_MERGE_ERROR_NOMEM;
 	return NULL;
 }

@@ -1068,6 +1091,10 @@  int vma_expand(struct vma_merge_struct *vmg)
 		/* This should already have been checked by this point. */
 		VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg);
 		vma_start_write(next);
+		/*
+		 * In this case we don't report OOM, so vmg->give_up_on_mm is
+		 * safe.
+		 */
 		ret = dup_anon_vma(middle, next, &anon_dup);
 		if (ret)
 			return ret;
@@ -1090,9 +1117,15 @@  int vma_expand(struct vma_merge_struct *vmg)
 	return 0;

 nomem:
-	vmg->state = VMA_MERGE_ERROR_NOMEM;
 	if (anon_dup)
 		unlink_anon_vmas(anon_dup);
+	/*
+	 * If the user requests that we just give upon OOM, we are safe to do so
+	 * here, as commit merge provides this contract to us. Nothing has been
+	 * changed - no harm no foul, just don't report it.
+	 */
+	if (!vmg->give_up_on_oom)
+		vmg->state = VMA_MERGE_ERROR_NOMEM;
 	return -ENOMEM;
 }

@@ -1534,6 +1567,13 @@  static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
 	if (vmg_nomem(vmg))
 		return ERR_PTR(-ENOMEM);

+	/*
+	 * Split can fail for reasons other than OOM, so if the user requests
+	 * this it's probably a mistake.
+	 */
+	VM_WARN_ON(vmg->give_up_on_oom &&
+		   (vma->vm_start != start || vma->vm_end != end));
+
 	/* Split any preceding portion of the VMA. */
 	if (vma->vm_start < start) {
 		int err = split_vma(vmg->vmi, vma, start, 1);
@@ -1602,12 +1642,15 @@  struct vm_area_struct
 		       struct vm_area_struct *vma,
 		       unsigned long start, unsigned long end,
 		       unsigned long new_flags,
-		       struct vm_userfaultfd_ctx new_ctx)
+		       struct vm_userfaultfd_ctx new_ctx,
+		       bool give_up_on_oom)
 {
 	VMG_VMA_STATE(vmg, vmi, prev, vma, start, end);

 	vmg.flags = new_flags;
 	vmg.uffd_ctx = new_ctx;
+	if (give_up_on_oom)
+		vmg.give_up_on_oom = true;

 	return vma_modify(&vmg);
 }
diff --git a/mm/vma.h b/mm/vma.h
index 7356ca5a22d3..149926e8a6d1 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -114,6 +114,12 @@  struct vma_merge_struct {
 	 */
 	bool just_expand :1;

+	/*
+	 * If a merge is possible, but an OOM error occurs, give up and don't
+	 * execute the merge, returning NULL.
+	 */
+	bool give_up_on_oom :1;
+
 	/* Internal flags set during merge process: */

 	/*
@@ -255,7 +261,8 @@  __must_check struct vm_area_struct
 		       struct vm_area_struct *vma,
 		       unsigned long start, unsigned long end,
 		       unsigned long new_flags,
-		       struct vm_userfaultfd_ctx new_ctx);
+		       struct vm_userfaultfd_ctx new_ctx,
+		       bool give_up_on_oom);

 __must_check struct vm_area_struct
 *vma_merge_new_range(struct vma_merge_struct *vmg);