[PATCHv2] mm: Fix warning in move_normal_pmd()
diff mbox series

Message ID 20200715135011.42743-1-kirill.shutemov@linux.intel.com
State New
Headers show
Series
  • [PATCHv2] mm: Fix warning in move_normal_pmd()
Related show

Commit Message

Kirill A. Shutemov July 15, 2020, 1:50 p.m. UTC
mremap(2) does not allow source and destination regions to overlap, but
shift_arg_pages() calls move_page_tables() directly and in this case the
source and destination overlap often. It confuses move_normal_pmd():

  WARNING: CPU: 3 PID: 27091 at mm/mremap.c:211 move_page_tables+0x6ef/0x720

move_normal_pmd() expects the destination PMD to be empty, but when
ranges overlap nobody removes PTE page tables on source side.
move_ptes() only removes PTE entries, leaving tables behind.
When the source PMD becomes destination and alignment/size is right we
step onto the warning.

The warning is harmless: kernel correctly fallbacks to handle entries on
per-entry basis.

The fix is to avoid move_normal_pmd() if we see that source and
destination ranges overlap.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Link: https://lore.kernel.org/lkml/20200713025354.GB3644504@google.com/
Reported-and-tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: William Kucharski <william.kucharski@oracle.com>
---
 mm/mremap.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Comments

Linus Torvalds July 15, 2020, 6:36 p.m. UTC | #1
On Wed, Jul 15, 2020 at 6:50 AM Kirill A. Shutemov
<kirill.shutemov@linux.intel.com> wrote:
>
> mremap(2) does not allow source and destination regions to overlap, but
> shift_arg_pages() calls move_page_tables() directly and in this case the
> source and destination overlap often. It

Actually, before we do this patch (which I think is a good idea), I'd
like Naresh to test the attached patch.

And Kirill, Joel, mind looking it over too.

IT IS ENTIRELY UNTESTED.

But I hope the concept (and the code) is fairly obvious: it makes
move_page_tables() try to align the range to be moved, if possible.

That *should* actually remove the warning that Naresh sees, for the
simple reason that now the PMD moving case will *always* trigger when
the stack movement is PMD-aligned, and you never see the "first do a
few small pages, then do the PMD optimization" case.

And that should mean that we don't ever hit the "oh, we already have a
target pmd" thing, because the "move the whole pmd" case will clear
the ones it has already taken care of, and you never have that "oh,
there's an empty pmd where the pages were moved away one by one".

And that means that for this case, it's _ok_ to overlap (as long as we
copy downwards).

What do you think?

Ok, so I could easily have screwed up the patch, even if it's
conceptually fairly simple. The details are small, but they needed
some care. The thing _looks_ fine to me, both on a source level and
when looking at the generated code, and I made sure to try to use a
lot of small helper functions and couple of helper macros to make each
individual piece look clear and obvious.

But obviously a single "Oh, that test is the wrong way around", or a
missing mask inversion, or whatever, could completely break the code.
So no guarantees, but it looks fairly straightforward to me.

Naresh - this patch does *NOT* make sense to test together with
Kirill's (or Joel's) patch that says "don't do the PMD optimization at
all for overlapping cases". Those patches will hide the issue, and
make the new code only work for mremap(), not for the initial stack
setup that caused the original warning.

Joel - I think this patch makes sense _regardless_ of the special
execve() usage case, in that it extends your "let's just copy things
at a whole PMD level" logic to even the case where the beginning
wasn't PMD-aligned (or the length wasn't sufficient).

Obviously it only triggers when the source and destination are
mutually aligned, and if there are no other vma's around those
first/last PMD entries. Which might mean that it almost never triggers
in practice, but looking at the generated code, it sure looks like
it's cheap enough to test.

Didn't you have some test-cases for your pmd optimized movement cases,
at least for timing?

               Linus
Joel Fernandes July 15, 2020, 8:54 p.m. UTC | #2
Hi Linus,

On Wed, Jul 15, 2020 at 11:36:59AM -0700, Linus Torvalds wrote:
> On Wed, Jul 15, 2020 at 6:50 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > mremap(2) does not allow source and destination regions to overlap, but
> > shift_arg_pages() calls move_page_tables() directly and in this case the
> > source and destination overlap often. It
> 
> Actually, before we do this patch (which I think is a good idea), I'd
> like Naresh to test the attached patch.
> 
> And Kirill, Joel, mind looking it over too.
> 
> IT IS ENTIRELY UNTESTED.

Seems a clever idea and was something I wanted as well. Thanks. Some comments
below:

> But I hope the concept (and the code) is fairly obvious: it makes
> move_page_tables() try to align the range to be moved, if possible.
> 
> That *should* actually remove the warning that Naresh sees, for the
> simple reason that now the PMD moving case will *always* trigger when
> the stack movement is PMD-aligned, and you never see the "first do a
> few small pages, then do the PMD optimization" case.
> 
> And that should mean that we don't ever hit the "oh, we already have a
> target pmd" thing, because the "move the whole pmd" case will clear
> the ones it has already taken care of, and you never have that "oh,
> there's an empty pmd where the pages were moved away one by one".
> 
> And that means that for this case, it's _ok_ to overlap (as long as we
> copy downwards).
> 
> What do you think?

I might not have fully understand the code but I get the basic idea it is
aiming for. Basically you want to align the cases where the address space is
free from both sides so that you can trigger the PMD-moving optimization.

Regarding the ADDR_AFTER_NEXT checks, shouldn't you check for:

	if (ADDR_AFTER_NEXT(ALIGN(*old_addr + *len, PMD_SIZE), old))
		return;

and for the len calculation, I did not follow what you did, but I think you
meant something like this? Does the following reduce to what you did? At
least this is a bit more readable I think:

	*len += (ALIGN(*new_addr + *len, PMD_SIZE) - (*new_addr + *len));

	which reduces to:

	*len = ALIGN(*new_addr + *len, PMD_SIZE) - *new_addr;

Also you did "len +=", it should be "*len +=" in this function.

In try_to_align_start(), everything looks correct to me.

> Ok, so I could easily have screwed up the patch, even if it's
> conceptually fairly simple. The details are small, but they needed
> some care. The thing _looks_ fine to me, both on a source level and
> when looking at the generated code, and I made sure to try to use a
> lot of small helper functions and couple of helper macros to make each
> individual piece look clear and obvious.
> 
> But obviously a single "Oh, that test is the wrong way around", or a
> missing mask inversion, or whatever, could completely break the code.
> So no guarantees, but it looks fairly straightforward to me.
> 
> Naresh - this patch does *NOT* make sense to test together with
> Kirill's (or Joel's) patch that says "don't do the PMD optimization at
> all for overlapping cases". Those patches will hide the issue, and
> make the new code only work for mremap(), not for the initial stack
> setup that caused the original warning.
> 
> Joel - I think this patch makes sense _regardless_ of the special
> execve() usage case, in that it extends your "let's just copy things
> at a whole PMD level" logic to even the case where the beginning
> wasn't PMD-aligned (or the length wasn't sufficient).
> 
> Obviously it only triggers when the source and destination are
> mutually aligned, and if there are no other vma's around those
> first/last PMD entries. Which might mean that it almost never triggers
> in practice, but looking at the generated code, it sure looks like
> it's cheap enough to test.

Oh ok, we had some requirements in my old job about moving large address
spaces which were aligned so it triggered a lot in those. Also folks in the
ChromeOS team tell me it is helping them.

> Didn't you have some test-cases for your pmd optimized movement cases,
> at least for timing?

Yes, I had a simple mremap() test which was moving a 1GB map and measuring
performance. Once I get a chance I can try it out with this optimization and
trigger it with unaligned addresses as well.

thanks,

 - Joel
Kirill A. Shutemov July 15, 2020, 8:55 p.m. UTC | #3
On Wed, Jul 15, 2020 at 11:36:59AM -0700, Linus Torvalds wrote:
> On Wed, Jul 15, 2020 at 6:50 AM Kirill A. Shutemov
> <kirill.shutemov@linux.intel.com> wrote:
> >
> > mremap(2) does not allow source and destination regions to overlap, but
> > shift_arg_pages() calls move_page_tables() directly and in this case the
> > source and destination overlap often. It
> 
> Actually, before we do this patch (which I think is a good idea), I'd
> like Naresh to test the attached patch.
> 
> And Kirill, Joel, mind looking it over too.

I don't understand 'len' calculation in try_to_align_end().

IIUC, it increases 'len' by PMD_SIZE if 'new_addr+len' is not aligned to
PMD_SIZE. It doesn't make sense to me.

Maybe
	*len = roundup_up(*new_addr + *len, PMD_SIZE) - *new_addr;

or something?

BUT

I *think* there's a bigger problem with the patch:

For stack relocation case both VMAs are the same and always(?) the only
VMA around at the time. It means none of ADDR_BEFORE_PREV and
ADDR_AFTER_NEXT are going to stop us.

Consider the following case, before and after try_to_align_start():

		before		after
old_addr:	0x0123000	0x0000000
new_addr:	0x1123000	0x1000000
len:		0x1000000	0x1123000

(4k PAGE_SIZE, 2M PMD_SIZE)

On the first iteration we would attempt to move 0x0-0x200000 to 
0x1000000-0x1200000 and step onto the same WARN(), no?
Kirill A. Shutemov July 15, 2020, 9:13 p.m. UTC | #4
On Wed, Jul 15, 2020 at 04:54:28PM -0400, Joel Fernandes wrote:
> Hi Linus,
> 
> On Wed, Jul 15, 2020 at 11:36:59AM -0700, Linus Torvalds wrote:
> Regarding the ADDR_AFTER_NEXT checks, shouldn't you check for:
> 
> 	if (ADDR_AFTER_NEXT(ALIGN(*old_addr + *len, PMD_SIZE), old))
> 		return;

See comment to ADDR_AFTER_NEXT().
Linus Torvalds July 15, 2020, 9:31 p.m. UTC | #5
On Wed, Jul 15, 2020 at 1:54 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Regarding the ADDR_AFTER_NEXT checks, shouldn't you check for:
>
>         if (ADDR_AFTER_NEXT(ALIGN(*old_addr + *len, PMD_SIZE), old))
>                 return;

No, there's even a comment to the effect.

Instead, that ADDR_AFTER_NEXT() aligns the next address _down_ to the
PMD boundary.

Because otherwise, what can happen is:

 - you're on an architecture that has a separate address space for users

 - you're the next-to-last VMA in that address space,

 - you're in the last PMD.

And now "ALIGN(*old_addr + *len, PMD_SIZE)" will wrap, and become 0,
and you think it's ok to move the whole PMD, because it's now smaller
than the start address of the next VMA.

It's _not_ ok, because you'd be moving that next-vma data too.

> and for the len calculation, I did not follow what you did, but I think you
> meant something like this? Does the following reduce to what you did? At
> least this is a bit more readable I think:
>
>         *len += (ALIGN(*new_addr + *len, PMD_SIZE) - (*new_addr + *len));

Yes, right you are.

I actually wrote that first (except I added a helper variable for that
"*new_addr + *len" thing), and then I decided it can be simplified.

And simplified it wrong ;)

> Also you did "len +=", it should be "*len +=" in this function.

That's indeed a plain stupid bug ;)

Naresh - don't test that version. The bugs Joel found just make the
math wrong, so it won't work.

The concept was solid, the implementation not so much ;)

                 Linus
Linus Torvalds July 15, 2020, 9:35 p.m. UTC | #6
On Wed, Jul 15, 2020 at 1:55 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> I don't understand 'len' calculation in try_to_align_end().

Yeah, Joel found the same thing. You don't understand it, because it's garbage.

> BUT
>
> I *think* there's a bigger problem with the patch:
>
> For stack relocation case both VMAs are the same and always(?) the only
> VMA around at the time. It means none of ADDR_BEFORE_PREV and
> ADDR_AFTER_NEXT are going to stop us.

But for the stack relocation case, that should actually be fine. We
are moving the whole thing.

Or maybe I missed something.

> Consider the following case, before and after try_to_align_start():
>
>                 before          after
> old_addr:       0x0123000       0x0000000
> new_addr:       0x1123000       0x1000000
> len:            0x1000000       0x1123000

That's the "move up" case that fundamentally doesn't work anyway
because it will corrupt the data as it moves it.

The stack relocation only moves down.

               Linus
Linus Torvalds July 15, 2020, 9:43 p.m. UTC | #7
On Wed, Jul 15, 2020 at 2:31 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Naresh - don't test that version. The bugs Joel found just make the
> math wrong, so it won't work.

Here's a new version with the thing that Joel and Kirill both noticed
hopefully fixed.

But I probably screwed it up again. I guess I should test it, but I
don't have any really relevant environment (the plain mremap() case
should have shown the obvious bugs, though, so that's just an excuse
for my laziness)

            Linus
Linus Torvalds July 15, 2020, 9:51 p.m. UTC | #8
On Wed, Jul 15, 2020 at 2:35 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The stack relocation only moves down.

.. and that may not be true on a grow-up stack. Christ, that code is
hard to read. But I think Kirill is right, and I'm wrong.

So that "try to expand" code is only ok when non-overlapping, or when
moving down.

Naresh - from a testing standpoint, that doesn't make any difference:
I'd still like you to test on x86. We always shift the stack down
there.

          Linus
Kirill A. Shutemov July 15, 2020, 10:22 p.m. UTC | #9
On Wed, Jul 15, 2020 at 02:43:00PM -0700, Linus Torvalds wrote:
> On Wed, Jul 15, 2020 at 2:31 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Naresh - don't test that version. The bugs Joel found just make the
> > math wrong, so it won't work.
> 
> Here's a new version with the thing that Joel and Kirill both noticed
> hopefully fixed.
> 
> But I probably screwed it up again. I guess I should test it, but I
> don't have any really relevant environment (the plain mremap() case
> should have shown the obvious bugs, though, so that's just an excuse
> for my laziness)

Sorry, but the patch is broken.

It overcopies entires in some cases. It doesn't handles correctly if you
try to move PTEs from the middle of VMA.

The test case below rightly sigfaults without the patch when trying access
DST_ADDR[0], but not with the patch. It creates PTE entry at DST_ADDR that
doesn't belong to any VMA. :/

#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>

#define SRC_ADDR ((char *)(4UL << 30))
#define DST_ADDR ((char *)(5UL << 30))


int main(void)
{
	mmap(SRC_ADDR, 2UL << 20, PROT_READ | PROT_WRITE,
		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	madvise(SRC_ADDR, 2UL << 20, MADV_NOHUGEPAGE);
	memset(SRC_ADDR, 0x33, 2UL << 20);

	mremap(SRC_ADDR + 4096, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED,
		     DST_ADDR + 4096);
	printf("0: %#x 4096: %#x\n", DST_ADDR[0], DST_ADDR[4096]);
	return 0;
}
Linus Torvalds July 15, 2020, 10:36 p.m. UTC | #10
On Wed, Jul 15, 2020 at 3:22 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Sorry, but the patch is broken.

I should take up some other hobby. Maybe knitting.

But then I'd probably end up with sweaters with three arms, and
there's a very limited market for that.

Oh well.

           Linus
Linus Torvalds July 15, 2020, 10:57 p.m. UTC | #11
On Wed, Jul 15, 2020 at 3:22 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> Sorry, but the patch is broken.

.. instead of taking up knitting - which I'd invariably also screw up
- I took a look.

Yeah, in addition to checking the vm_prev/vm_next vma's, we need to
check the limits of the 'vma' itself. Because we may not be moving the
whole vma.

So the "extend upwards" can only happen if the end address matches the
end address of the current vma (and vice versa for the "extend down"
case).

Then it would hopefully work.

But now I've screwed it up twice, and have a splitting headache, so
rather than stare at this cross-eyed, I'll take a break and hope that
somebody more competent than me looks at the code.

               Linus
Linus Torvalds July 15, 2020, 11:04 p.m. UTC | #12
On Wed, Jul 15, 2020 at 3:57 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But now I've screwed it up twice, and have a splitting headache, so
> rather than stare at this cross-eyed, I'll take a break and hope that
> somebody more competent than me looks at the code.

I lied. I had a couple of pending pulls, so I couldn't go lie down
anyway, so I tried to take another look.

It *might* be as simple as this incremetal thing on top, but then I
think that code will no longer trigger on the stack movement case at
all, since there we don't work with the whole vma.

So if we want that to work, we'd have to fix that up too.

And this might be broken anyway. My track record isn't looking so hot right now.

                  Linus
Linus Torvalds July 15, 2020, 11:18 p.m. UTC | #13
On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It *might* be as simple as this incremental thing on top

No, it needs to be

+       if (*old_addr + *len < old->vm_end)
+               return;

in try_to_align_end(), of course.

Now I'm going for a lie-down, because this cross-eyed thing isn't working.

              Linus
Naresh Kamboju July 16, 2020, 6:37 a.m. UTC | #14
On Thu, 16 Jul 2020 at 04:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It *might* be as simple as this incremental thing on top
>
> No, it needs to be
>
> +       if (*old_addr + *len < old->vm_end)
> +               return;
>
> in try_to_align_end(), of course.
>
> Now I'm going for a lie-down, because this cross-eyed thing isn't working.


Just want to double check.
Here is the diff after those two patches applied. Please correct me if
it is wrong.
This patch applied on top of Linus mainline master branch.
I am starting my test cycles.

---
 mm/mremap.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 6b153dc05fe4..4961c18d2008 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -254,6 +254,77 @@ static bool move_normal_pmd(struct vm_area_struct
*vma, unsigned long old_addr,

  return true;
 }
+
+#define ADDR_BEFORE_PREV(addr, vma) \
+ ((vma)->vm_prev && (addr) < (vma)->vm_prev->vm_end)
+
+static inline void try_to_align_start(unsigned long *len,
+ struct vm_area_struct *old, unsigned long *old_addr,
+ struct vm_area_struct *new, unsigned long *new_addr)
+{
+ if (*old_addr > old->vm_start)
+ return;
+
+ if (ADDR_BEFORE_PREV(*old_addr & PMD_MASK, old))
+ return;
+
+ if (ADDR_BEFORE_PREV(*new_addr & PMD_MASK, new))
+ return;
+
+ /* Bingo! */
+ *len += *new_addr & ~PMD_MASK;
+ *old_addr &= PMD_MASK;
+ *new_addr &= PMD_MASK;
+}
+
+/*
+ * When aligning the end, avoid ALIGN() (which can overflow
+ * if the user space is the full address space, and overshoot
+ * the vm_start of the next vma).
+ *
+ * Align the upper limit down instead, and check that it's not
+ * in the same PMD as the end.
+ */
+#define ADDR_AFTER_NEXT(addr, vma) \
+ ((vma)->vm_next && (addr) > (PMD_MASK & (vma)->vm_next->vm_start))
+
+static inline void try_to_align_end(unsigned long *len,
+ struct vm_area_struct *old, unsigned long *old_addr,
+ struct vm_area_struct *new, unsigned long *new_addr)
+{
+ if (*old_addr < old->vm_end)
+ return;
+
+ if (ADDR_AFTER_NEXT(*old_addr + *len, old))
+ return;
+
+ if (ADDR_AFTER_NEXT(*new_addr + *len, new))
+ return;
+
+ /* Mutual alignment means this is same for new/old addr */
+ *len = ALIGN(*new_addr + *len, PMD_SIZE) - *new_addr;
+}
+
+/*
+ * The PMD move case is much more efficient, so if we have the
+ * mutually aligned case, try to see if we can extend the
+ * beginning and end to be aligned too.
+ *
+ * The pointer dereferences look bad, but with inlining, the
+ * compiler will sort it out.
+ */
+static inline void try_to_align_range(unsigned long *len,
+ struct vm_area_struct *old, unsigned long *old_addr,
+ struct vm_area_struct *new, unsigned long *new_addr)
+{
+ if ((*old_addr ^ *new_addr) & ~PMD_MASK)
+ return;
+
+ try_to_align_start(len, old, old_addr, new, new_addr);
+ try_to_align_end(len, old, old_addr, new, new_addr);
+}
+#else
+#define try_to_align_range(len,old,olda,new,newa) do { } while(0);
 #endif

 unsigned long move_page_tables(struct vm_area_struct *vma,
@@ -272,6 +343,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
  old_addr, old_end);
  mmu_notifier_invalidate_range_start(&range);

+ try_to_align_range(&len, vma, &old_addr, new_vma, &new_addr);
+
  for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
  cond_resched();
  next = (old_addr + PMD_SIZE) & PMD_MASK;
Naresh Kamboju July 16, 2020, 7:23 a.m. UTC | #15
On Thu, 16 Jul 2020 at 12:07, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> On Thu, 16 Jul 2020 at 04:49, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > It *might* be as simple as this incremental thing on top
> >
> > No, it needs to be
> >
> > +       if (*old_addr + *len < old->vm_end)
> > +               return;
> >
> > in try_to_align_end(), of course.
> >
> > Now I'm going for a lie-down, because this cross-eyed thing isn't working.
>
>
> Just want to double check.
> Here is the diff after those two patches applied. Please correct me if
> it is wrong.
> This patch applied on top of Linus mainline master branch.
> I am starting my test cycles.

Sorry this patch (the below pasted ) did not solve the reported problem.
I still notice warning

[  760.004318] ------------[ cut here ]------------
[  760.009039] WARNING: CPU: 3 PID: 461 at mm/mremap.c:230
move_page_tables+0x818/0x840
[  760.016848] Modules linked in: x86_pkg_temp_thermal fuse
[  760.022220] CPU: 3 PID: 461 Comm: true Not tainted 5.8.0-rc5 #1
[  760.028198] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[  760.035651] EIP: move_page_tables+0x818/0x840

ref:
https://lkft.validation.linaro.org/scheduler/job/1574277#L12221

>
> ---
>  mm/mremap.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 6b153dc05fe4..4961c18d2008 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -254,6 +254,77 @@ static bool move_normal_pmd(struct vm_area_struct
> *vma, unsigned long old_addr,
>
>   return true;
>  }
> +
> +#define ADDR_BEFORE_PREV(addr, vma) \
> + ((vma)->vm_prev && (addr) < (vma)->vm_prev->vm_end)
> +
> +static inline void try_to_align_start(unsigned long *len,
> + struct vm_area_struct *old, unsigned long *old_addr,
> + struct vm_area_struct *new, unsigned long *new_addr)
> +{
> + if (*old_addr > old->vm_start)
> + return;
> +
> + if (ADDR_BEFORE_PREV(*old_addr & PMD_MASK, old))
> + return;
> +
> + if (ADDR_BEFORE_PREV(*new_addr & PMD_MASK, new))
> + return;
> +
> + /* Bingo! */
> + *len += *new_addr & ~PMD_MASK;
> + *old_addr &= PMD_MASK;
> + *new_addr &= PMD_MASK;
> +}
> +
> +/*
> + * When aligning the end, avoid ALIGN() (which can overflow
> + * if the user space is the full address space, and overshoot
> + * the vm_start of the next vma).
> + *
> + * Align the upper limit down instead, and check that it's not
> + * in the same PMD as the end.
> + */
> +#define ADDR_AFTER_NEXT(addr, vma) \
> + ((vma)->vm_next && (addr) > (PMD_MASK & (vma)->vm_next->vm_start))
> +
> +static inline void try_to_align_end(unsigned long *len,
> + struct vm_area_struct *old, unsigned long *old_addr,
> + struct vm_area_struct *new, unsigned long *new_addr)
> +{
> + if (*old_addr < old->vm_end)
> + return;
> +
> + if (ADDR_AFTER_NEXT(*old_addr + *len, old))
> + return;
> +
> + if (ADDR_AFTER_NEXT(*new_addr + *len, new))
> + return;
> +
> + /* Mutual alignment means this is same for new/old addr */
> + *len = ALIGN(*new_addr + *len, PMD_SIZE) - *new_addr;
> +}
> +
> +/*
> + * The PMD move case is much more efficient, so if we have the
> + * mutually aligned case, try to see if we can extend the
> + * beginning and end to be aligned too.
> + *
> + * The pointer dereferences look bad, but with inlining, the
> + * compiler will sort it out.
> + */
> +static inline void try_to_align_range(unsigned long *len,
> + struct vm_area_struct *old, unsigned long *old_addr,
> + struct vm_area_struct *new, unsigned long *new_addr)
> +{
> + if ((*old_addr ^ *new_addr) & ~PMD_MASK)
> + return;
> +
> + try_to_align_start(len, old, old_addr, new, new_addr);
> + try_to_align_end(len, old, old_addr, new, new_addr);
> +}
> +#else
> +#define try_to_align_range(len,old,olda,new,newa) do { } while(0);
>  #endif
>
>  unsigned long move_page_tables(struct vm_area_struct *vma,
> @@ -272,6 +343,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>   old_addr, old_end);
>   mmu_notifier_invalidate_range_start(&range);
>
> + try_to_align_range(&len, vma, &old_addr, new_vma, &new_addr);
> +
>   for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
>   cond_resched();
>   next = (old_addr + PMD_SIZE) & PMD_MASK;
> --
> 2.27.0
>
> >
> >               Linus
>
> - Naresh
Naresh Kamboju July 16, 2020, 8:32 a.m. UTC | #16
On Thu, 16 Jul 2020 at 04:49, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It *might* be as simple as this incremental thing on top
>
> No, it needs to be
>
> +       if (*old_addr + *len < old->vm_end)
> +               return;
>
> in try_to_align_end(), of course.

I have modified the patch with the above change.
Sorry the proposed patch [1] did not solve the reported problem.

Link to warning on i386,
https://lkft.validation.linaro.org/scheduler/job/1574295#L2055

Link to the patch which applied on top of Linus master and tested.
[1] https://pastebin.com/2gTxi3pj

I am happy to test the next version patch.

- Naresh
Kirill A. Shutemov July 16, 2020, 8:46 a.m. UTC | #17
On Thu, Jul 16, 2020 at 12:53:23PM +0530, Naresh Kamboju wrote:
> On Thu, 16 Jul 2020 at 12:07, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
> >
> > On Thu, 16 Jul 2020 at 04:49, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > It *might* be as simple as this incremental thing on top
> > >
> > > No, it needs to be
> > >
> > > +       if (*old_addr + *len < old->vm_end)
> > > +               return;
> > >
> > > in try_to_align_end(), of course.
> > >
> > > Now I'm going for a lie-down, because this cross-eyed thing isn't working.
> >
> >
> > Just want to double check.
> > Here is the diff after those two patches applied. Please correct me if
> > it is wrong.
> > This patch applied on top of Linus mainline master branch.
> > I am starting my test cycles.
> 
> Sorry this patch (the below pasted ) did not solve the reported problem.

As Linus said, it does not trigger on the stack movement anymore and it is
not going to fix the issue, but may help to increase coverage of the
optimization.
Kirill A. Shutemov July 16, 2020, 1:16 p.m. UTC | #18
On Wed, Jul 15, 2020 at 04:18:44PM -0700, Linus Torvalds wrote:
> On Wed, Jul 15, 2020 at 4:04 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It *might* be as simple as this incremental thing on top
> 
> No, it needs to be
> 
> +       if (*old_addr + *len < old->vm_end)
> +               return;
> 
> in try_to_align_end(), of course.

Okay, this should work, but I'm not convinced that it gives much: number
of cases covered by the optimization not going to be high.

It can also lead to performance regression: for small mremap() if only one
side of the range got aligned and there's no PMD_SIZE range to move,
kernel will still iterate over PTEs, but it would need to handle more
pte_none()s than without the patch.
Linus Torvalds July 16, 2020, 5:54 p.m. UTC | #19
On Thu, Jul 16, 2020 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> It can also lead to performance regression: for small mremap() if only one
> side of the range got aligned and there's no PMD_SIZE range to move,
> kernel will still iterate over PTEs, but it would need to handle more
> pte_none()s than without the patch.

Ack, I've dropped the patch from my queue of experiments, because it
doesn't work for the case I wanted to do, and the other cases could
regress, as you say.

Plus considering how many problems that patch had, I decided it wasn't
as simple as I initially thought it would be anyway ;)

Joel - while it's gone from my mind, if you're still interested in
this, maybe you can do something _similar_ that patch, except perhaps
also start out checking that the initial size is large enough for this
to make sense even when one of the sides doesn't align, for example.

(It might be as simple as checking that the initial 'len' is at least
PMD_SIZE - then you're guaranteed that whichever side gets aligned,
it's not doing extra work because the other side didn't).

               Linus
Joel Fernandes July 16, 2020, 6:47 p.m. UTC | #20
On Thu, Jul 16, 2020 at 1:55 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Jul 16, 2020 at 6:16 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > It can also lead to performance regression: for small mremap() if only one
> > side of the range got aligned and there's no PMD_SIZE range to move,
> > kernel will still iterate over PTEs, but it would need to handle more
> > pte_none()s than without the patch.
>
> Ack, I've dropped the patch from my queue of experiments, because it
> doesn't work for the case I wanted to do, and the other cases could
> regress, as you say.
>
> Plus considering how many problems that patch had, I decided it wasn't
> as simple as I initially thought it would be anyway ;)
>
> Joel - while it's gone from my mind, if you're still interested in
> this, maybe you can do something _similar_ that patch, except perhaps
> also start out checking that the initial size is large enough for this
> to make sense even when one of the sides doesn't align, for example.
>
> (It might be as simple as checking that the initial 'len' is at least
> PMD_SIZE - then you're guaranteed that whichever side gets aligned,
> it's not doing extra work because the other side didn't).

Hi Linus,
Yes I'm quite interested in doing a similar patch and also adding some
test cases to make sure it always works properly. Probably I can add
some kselftest cases doing mremap on various ranges.

I think it will be a great optimization to trigger the PMD move more
often. I'll work on it and get back hopefully soon on this, Thanks!

 - Joel

Patch
diff mbox series

diff --git a/mm/mremap.c b/mm/mremap.c
index 5dd572d57ca9..340a96a29cbb 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -245,6 +245,26 @@  unsigned long move_page_tables(struct vm_area_struct *vma,
 	unsigned long extent, next, old_end;
 	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
+	bool overlaps;
+
+	/*
+	 * shift_arg_pages() can call move_page_tables() on overlapping ranges.
+	 * In this case we cannot use move_normal_pmd() because destination pmd
+	 * might be established page table: move_ptes() doesn't free page
+	 * table.
+	 */
+	if (old_addr > new_addr) {
+		overlaps = old_addr - new_addr < len;
+	} else {
+		overlaps = new_addr - old_addr < len;
+
+		/*
+		 * We are interating over ranges forward. It means we cannot
+		 * handle overlapping ranges with new_addr > old_addr without
+		 * risking data corruption. Don't do this.
+		 */
+		WARN_ON(overlaps);
+	}
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
@@ -282,7 +302,7 @@  unsigned long move_page_tables(struct vm_area_struct *vma,
 			split_huge_pmd(vma, old_pmd, old_addr);
 			if (pmd_trans_unstable(old_pmd))
 				continue;
-		} else if (extent == PMD_SIZE) {
+		} else if (!overlaps && extent == PMD_SIZE) {
 #ifdef CONFIG_HAVE_MOVE_PMD
 			/*
 			 * If the extent is PMD-sized, try to speed the move by