mm,mremap: Bail out earlier in mremap_to under map pressure
diff mbox series

Message ID 20190226091314.18446-1-osalvador@suse.de
State New
Headers show
Series
  • mm,mremap: Bail out earlier in mremap_to under map pressure
Related show

Commit Message

Oscar Salvador Feb. 26, 2019, 9:13 a.m. UTC
When using mremap() syscall in addition to MREMAP_FIXED flag,
mremap() calls mremap_to() which does the following:

1) unmaps the destination region where we are going to move the map
2) If the new region is going to be smaller, we unmap the last part
   of the old region

Then, we will eventually call move_vma() to do the actual move.

move_vma() checks whether we are at least 4 maps below max_map_count
before going further, otherwise it bails out with -ENOMEM.
The problem is that we might have already unmapped the vma's in steps
1) and 2), so it is not possible for userspace to figure out the state
of the vma's after it gets -ENOMEM, and it gets tricky for userspace
to clean up properly on error path.

While it is true that we can return -ENOMEM for more reasons
(e.g: see may_expand_vm() or move_page_tables()), I think that we can
avoid this scenario in concret if we check early in mremap_to() if the
operation has high chances to succeed map-wise.

Should not be that the case, we can bail out before we even try to unmap
anything, so we make sure the vma's are left untouched in case we are likely
to be short of maps.

The thumb-rule now is to rely on the worst-scenario case we can have.
That is when both vma's (old region and new region) are going to be split
in 3, so we get two more maps to the ones we already hold (one per each).
If current map count + 2 maps still leads us to 4 maps below the threshold,
we are going to pass the check in move_vma().

Of course, this is not free, as it might generate false positives when it is
true that we are tight map-wise, but the unmap operation can release several
vma's leading us to a good state.

Another approach was also investigated [1], but it may be too much hassle
for what it brings.

[1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/mremap.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Andrew Morton Feb. 26, 2019, 10:04 p.m. UTC | #1
On Tue, 26 Feb 2019 10:13:14 +0100 Oscar Salvador <osalvador@suse.de> wrote:

> When using mremap() syscall in addition to MREMAP_FIXED flag,
> mremap() calls mremap_to() which does the following:
> 
> 1) unmaps the destination region where we are going to move the map
> 2) If the new region is going to be smaller, we unmap the last part
>    of the old region
> 
> Then, we will eventually call move_vma() to do the actual move.
> 
> move_vma() checks whether we are at least 4 maps below max_map_count
> before going further, otherwise it bails out with -ENOMEM.
> The problem is that we might have already unmapped the vma's in steps
> 1) and 2), so it is not possible for userspace to figure out the state
> of the vma's after it gets -ENOMEM, and it gets tricky for userspace
> to clean up properly on error path.
> 
> While it is true that we can return -ENOMEM for more reasons
> (e.g: see may_expand_vm() or move_page_tables()), I think that we can
> avoid this scenario in concret if we check early in mremap_to() if the
> operation has high chances to succeed map-wise.
> 
> Should not be that the case, we can bail out before we even try to unmap
> anything, so we make sure the vma's are left untouched in case we are likely
> to be short of maps.
> 
> The thumb-rule now is to rely on the worst-scenario case we can have.
> That is when both vma's (old region and new region) are going to be split
> in 3, so we get two more maps to the ones we already hold (one per each).
> If current map count + 2 maps still leads us to 4 maps below the threshold,
> we are going to pass the check in move_vma().
> 
> Of course, this is not free, as it might generate false positives when it is
> true that we are tight map-wise, but the unmap operation can release several
> vma's leading us to a good state.
> 
> Another approach was also investigated [1], but it may be too much hassle
> for what it brings.
> 

How is this going to affect existing userspace which is aware of the
current behaviour?

And how does it affect your existing cleanup code, come to that?  Does
it work as well or better after this change?
Oscar Salvador Feb. 27, 2019, 9:32 p.m. UTC | #2
On Tue, Feb 26, 2019 at 02:04:28PM -0800, Andrew Morton wrote:
> How is this going to affect existing userspace which is aware of the
> current behaviour?

Well, current behavior is not really predictable.
Our customer was "surprised" that the call to mremap() failed, but the regions
got unmapped nevertheless.
They found it the hard way when they got a segfault when trying to write to those
regions when cleaning up. 

As I said in the changelog, the possibility for false positives exists, due to
the fact that we might get rid of several vma's when unmapping, but I do not
expect existing userspace applications to start failing.
Should be that the case, we can revert the patch, it is not that it adds a lot
of churn.

> And how does it affect your existing cleanup code, come to that?  Does
> it work as well or better after this change?

I guess the customer can trust more reliable that the maps were left untouched.
I still have my reserves though.

We can get as far as move_vma(), and copy_vma() can fail returning -ENOMEM.
(Or not due to the "too small to fail" ?)
Vlastimil Babka Feb. 28, 2019, 8:06 a.m. UTC | #3
On 2/27/19 10:32 PM, Oscar Salvador wrote:
> On Tue, Feb 26, 2019 at 02:04:28PM -0800, Andrew Morton wrote:
>> How is this going to affect existing userspace which is aware of the
>> current behaviour?
> 
> Well, current behavior is not really predictable.
> Our customer was "surprised" that the call to mremap() failed, but the regions
> got unmapped nevertheless.
> They found it the hard way when they got a segfault when trying to write to those
> regions when cleaning up. 
> 
> As I said in the changelog, the possibility for false positives exists, due to
> the fact that we might get rid of several vma's when unmapping, but I do not
> expect existing userspace applications to start failing.
> Should be that the case, we can revert the patch, it is not that it adds a lot
> of churn.

Hopefully the only program that would start failing would be a LTP test
testing the current behavior near the limit (if such test exists). And
that can be adjusted.

>> And how does it affect your existing cleanup code, come to that?  Does
>> it work as well or better after this change?
> 
> I guess the customer can trust more reliable that the maps were left untouched.
> I still have my reserves though.
> 
> We can get as far as move_vma(), and copy_vma() can fail returning -ENOMEM.
> (Or not due to the "too small to fail" ?)
>
Joel Fernandes Feb. 28, 2019, 8:44 p.m. UTC | #4
On Thu, Feb 28, 2019 at 12:06 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/27/19 10:32 PM, Oscar Salvador wrote:
> > On Tue, Feb 26, 2019 at 02:04:28PM -0800, Andrew Morton wrote:
> >> How is this going to affect existing userspace which is aware of the
> >> current behaviour?
> >
> > Well, current behavior is not really predictable.
> > Our customer was "surprised" that the call to mremap() failed, but the regions
> > got unmapped nevertheless.
> > They found it the hard way when they got a segfault when trying to write to those
> > regions when cleaning up.
> >
> > As I said in the changelog, the possibility for false positives exists, due to
> > the fact that we might get rid of several vma's when unmapping, but I do not
> > expect existing userspace applications to start failing.
> > Should be that the case, we can revert the patch, it is not that it adds a lot
> > of churn.
>
> Hopefully the only program that would start failing would be a LTP test
> testing the current behavior near the limit (if such test exists). And
> that can be adjusted.
>

IMO the original behavior is itself probably not a big issue because
if userspace wanted to mremap over something, it was prepared to lose
the "over something" mapping anyway. So it does seem to be a stretch
to call the behavior a "bug". Still I agree with the patch that mremap
should not leave any side effects after returning error.

thanks,

 - Joel
Cyril Hrubis March 1, 2019, 3:25 p.m. UTC | #5
Hi!
> Hopefully the only program that would start failing would be a LTP test
> testing the current behavior near the limit (if such test exists). And
> that can be adjusted.

There does not seem to be a mremap() test that would do such a thing, so
we should be safe :-).

BTW there was a similar fix for mmap() with MAP_FIXED that caused a LTP
test to fail and was fixed in:

commit e8420a8ece80b3fe810415ecf061d54ca7fab266
Author: Cyril Hrubis <chrubis@suse.cz>
Date:   Mon Apr 29 15:08:33 2013 -0700

    mm/mmap: check for RLIMIT_AS before unmapping

And I haven't heard of any breakages so far so I guess that this very
similar situation and that the possibility of breaking real world
applications here is really low.

Patch
diff mbox series

diff --git a/mm/mremap.c b/mm/mremap.c
index 3320616ed93f..e3edef6b7a12 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -516,6 +516,23 @@  static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (addr + old_len > new_addr && new_addr + new_len > addr)
 		goto out;
 
+	/*
+	 * move_vma() need us to stay 4 maps below the threshold, otherwise
+	 * it will bail out at the very beginning.
+	 * That is a problem if we have already unmaped the regions here
+	 * (new_addr, and old_addr), because userspace will not know the
+	 * state of the vma's after it gets -ENOMEM.
+	 * So, to avoid such scenario we can pre-compute if the whole
+	 * operation has high chances to success map-wise.
+	 * Worst-scenario case is when both vma's (new_addr and old_addr) get
+	 * split in 3 before unmaping it.
+	 * That means 2 more maps (1 for each) to the ones we already hold.
+	 * Check whether current map count plus 2 still leads us to 4 maps below
+	 * the threshold, otherwise return -ENOMEM here to be more safe.
+	 */
+	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
+		return -ENOMEM;
+
 	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
 	if (ret)
 		goto out;