mm,mremap: Bail out earlier in mremap_to under map pressure

Message ID	20190226091314.18446-1-osalvador@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of osalvador@suse.de designates 195.135.221.5 as permitted sender) client-ip=195.135.221.5; From: Oscar Salvador <osalvador@suse.de> To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, hughd@google.com, kirill@shutemov.name, vbabka@suse.cz, joel@joelfernandes.org, jglisse@redhat.com, yang.shi@linux.alibaba.com, mgorman@techsingularity.net, Oscar Salvador <osalvador@suse.de> Subject: [PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure Date: Tue, 26 Feb 2019 10:13:14 +0100 Message-Id: <20190226091314.18446-1-osalvador@suse.de> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm,mremap: Bail out earlier in mremap_to under map pressure \| expand mm,mremap: Bail out earlier in mremap_to under map pressure

Oscar Salvador Feb. 26, 2019, 9:13 a.m. UTC

When using mremap() syscall in addition to MREMAP_FIXED flag,
mremap() calls mremap_to() which does the following:

1) unmaps the destination region where we are going to move the map
2) If the new region is going to be smaller, we unmap the last part
   of the old region

Then, we will eventually call move_vma() to do the actual move.

move_vma() checks whether we are at least 4 maps below max_map_count
before going further, otherwise it bails out with -ENOMEM.
The problem is that we might have already unmapped the vma's in steps
1) and 2), so it is not possible for userspace to figure out the state
of the vma's after it gets -ENOMEM, and it gets tricky for userspace
to clean up properly on error path.

While it is true that we can return -ENOMEM for more reasons
(e.g: see may_expand_vm() or move_page_tables()), I think that we can
avoid this scenario in concret if we check early in mremap_to() if the
operation has high chances to succeed map-wise.

Should not be that the case, we can bail out before we even try to unmap
anything, so we make sure the vma's are left untouched in case we are likely
to be short of maps.

The thumb-rule now is to rely on the worst-scenario case we can have.
That is when both vma's (old region and new region) are going to be split
in 3, so we get two more maps to the ones we already hold (one per each).
If current map count + 2 maps still leads us to 4 maps below the threshold,
we are going to pass the check in move_vma().

Of course, this is not free, as it might generate false positives when it is
true that we are tight map-wise, but the unmap operation can release several
vma's leading us to a good state.

Another approach was also investigated [1], but it may be too much hassle
for what it brings.

[1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/mremap.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Andrew Morton Feb. 26, 2019, 10:04 p.m. UTC | #1

On Tue, 26 Feb 2019 10:13:14 +0100 Oscar Salvador <osalvador@suse.de> wrote:

> When using mremap() syscall in addition to MREMAP_FIXED flag,
> mremap() calls mremap_to() which does the following:
> 
> 1) unmaps the destination region where we are going to move the map
> 2) If the new region is going to be smaller, we unmap the last part
>    of the old region
> 
> Then, we will eventually call move_vma() to do the actual move.
> 
> move_vma() checks whether we are at least 4 maps below max_map_count
> before going further, otherwise it bails out with -ENOMEM.
> The problem is that we might have already unmapped the vma's in steps
> 1) and 2), so it is not possible for userspace to figure out the state
> of the vma's after it gets -ENOMEM, and it gets tricky for userspace
> to clean up properly on error path.
> 
> While it is true that we can return -ENOMEM for more reasons
> (e.g: see may_expand_vm() or move_page_tables()), I think that we can
> avoid this scenario in concret if we check early in mremap_to() if the
> operation has high chances to succeed map-wise.
> 
> Should not be that the case, we can bail out before we even try to unmap
> anything, so we make sure the vma's are left untouched in case we are likely
> to be short of maps.
> 
> The thumb-rule now is to rely on the worst-scenario case we can have.
> That is when both vma's (old region and new region) are going to be split
> in 3, so we get two more maps to the ones we already hold (one per each).
> If current map count + 2 maps still leads us to 4 maps below the threshold,
> we are going to pass the check in move_vma().
> 
> Of course, this is not free, as it might generate false positives when it is
> true that we are tight map-wise, but the unmap operation can release several
> vma's leading us to a good state.
> 
> Another approach was also investigated [1], but it may be too much hassle
> for what it brings.
> 

How is this going to affect existing userspace which is aware of the
current behaviour?

And how does it affect your existing cleanup code, come to that?  Does
it work as well or better after this change?

Oscar Salvador Feb. 27, 2019, 9:32 p.m. UTC | #2

On Tue, Feb 26, 2019 at 02:04:28PM -0800, Andrew Morton wrote:
> How is this going to affect existing userspace which is aware of the
> current behaviour?

Well, current behavior is not really predictable.
Our customer was "surprised" that the call to mremap() failed, but the regions
got unmapped nevertheless.
They found it the hard way when they got a segfault when trying to write to those
regions when cleaning up. 

As I said in the changelog, the possibility for false positives exists, due to
the fact that we might get rid of several vma's when unmapping, but I do not
expect existing userspace applications to start failing.
Should be that the case, we can revert the patch, it is not that it adds a lot
of churn.

> And how does it affect your existing cleanup code, come to that?  Does
> it work as well or better after this change?

I guess the customer can trust more reliable that the maps were left untouched.
I still have my reserves though.

We can get as far as move_vma(), and copy_vma() can fail returning -ENOMEM.
(Or not due to the "too small to fail" ?)

Vlastimil Babka Feb. 28, 2019, 8:06 a.m. UTC | #3

On 2/27/19 10:32 PM, Oscar Salvador wrote:
> On Tue, Feb 26, 2019 at 02:04:28PM -0800, Andrew Morton wrote:
>> How is this going to affect existing userspace which is aware of the
>> current behaviour?
> 
> Well, current behavior is not really predictable.
> Our customer was "surprised" that the call to mremap() failed, but the regions
> got unmapped nevertheless.
> They found it the hard way when they got a segfault when trying to write to those
> regions when cleaning up. 
> 
> As I said in the changelog, the possibility for false positives exists, due to
> the fact that we might get rid of several vma's when unmapping, but I do not
> expect existing userspace applications to start failing.
> Should be that the case, we can revert the patch, it is not that it adds a lot
> of churn.

Hopefully the only program that would start failing would be a LTP test
testing the current behavior near the limit (if such test exists). And
that can be adjusted.

>> And how does it affect your existing cleanup code, come to that?  Does
>> it work as well or better after this change?
> 
> I guess the customer can trust more reliable that the maps were left untouched.
> I still have my reserves though.
> 
> We can get as far as move_vma(), and copy_vma() can fail returning -ENOMEM.
> (Or not due to the "too small to fail" ?)
>

Joel Fernandes Feb. 28, 2019, 8:44 p.m. UTC | #4

On Thu, Feb 28, 2019 at 12:06 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/27/19 10:32 PM, Oscar Salvador wrote:
> > On Tue, Feb 26, 2019 at 02:04:28PM -0800, Andrew Morton wrote:
> >> How is this going to affect existing userspace which is aware of the
> >> current behaviour?
> >
> > Well, current behavior is not really predictable.
> > Our customer was "surprised" that the call to mremap() failed, but the regions
> > got unmapped nevertheless.
> > They found it the hard way when they got a segfault when trying to write to those
> > regions when cleaning up.
> >
> > As I said in the changelog, the possibility for false positives exists, due to
> > the fact that we might get rid of several vma's when unmapping, but I do not
> > expect existing userspace applications to start failing.
> > Should be that the case, we can revert the patch, it is not that it adds a lot
> > of churn.
>
> Hopefully the only program that would start failing would be a LTP test
> testing the current behavior near the limit (if such test exists). And
> that can be adjusted.
>

IMO the original behavior is itself probably not a big issue because
if userspace wanted to mremap over something, it was prepared to lose
the "over something" mapping anyway. So it does seem to be a stretch
to call the behavior a "bug". Still I agree with the patch that mremap
should not leave any side effects after returning error.

thanks,

 - Joel

Cyril Hrubis March 1, 2019, 3:25 p.m. UTC | #5

Hi!
> Hopefully the only program that would start failing would be a LTP test
> testing the current behavior near the limit (if such test exists). And
> that can be adjusted.

There does not seem to be a mremap() test that would do such a thing, so
we should be safe :-).

BTW there was a similar fix for mmap() with MAP_FIXED that caused a LTP
test to fail and was fixed in:

commit e8420a8ece80b3fe810415ecf061d54ca7fab266
Author: Cyril Hrubis <chrubis@suse.cz>
Date:   Mon Apr 29 15:08:33 2013 -0700

    mm/mmap: check for RLIMIT_AS before unmapping

And I haven't heard of any breakages so far so I guess that this very
similar situation and that the possibility of breaking real world
applications here is really low.

mm,mremap: Bail out earlier in mremap_to under map pressure

Commit Message

Comments

Patch