diff mbox series

[v3,1/6] mm/mremap: Optimize the start addresses in move_page_tables()

Message ID 20230524153239.3036507-2-joel@joelfernandes.org (mailing list archive)
State New
Headers show
Series Optimize mremap during mutual alignment within PMD | expand

Commit Message

Joel Fernandes May 24, 2023, 3:32 p.m. UTC
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I
mean the source and destination addresses of the mremap are at
the same offset within a PMD.

This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that
does not have PTEs in it.

This warning will only trigger when there is mutual alignment in the
move operation. A solution, as suggested by Linus Torvalds [2], is to
initiate the copy process at the PMD level whenever such alignment is
present. Implementing this approach will not only prevent the warning
from being triggered, but it will also optimize the operation as this
method should enhance the speed of the copy process whenever there's a
possibility to start copying at the PMD level.

Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.

b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.

c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.

d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.

[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 mm/mremap.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

Comments

Linus Torvalds May 24, 2023, 11:23 p.m. UTC | #1
Hmm. I'm still quite unhappy about your can_align_down().

On Wed, May 24, 2023 at 8:32 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> +       /* If the masked address is within vma, we cannot align the address down. */
> +       if (vma->vm_start <= addr_masked)
> +               return false;

I don't think this test is right.

The test should not be "is the mapping still there at the point we
aligned down to".

No, the test should be whether there is any part of the mapping below
the point we're starting with:

        if (vma->vm_start < addr_to_align)
                return false;

because we can do the "expand the move down" *only* if it's the
beginning of the vma (because otherwise we'd be moving part of the vma
that precedes the address!)

(Alternatively, just make that "<" be "!=" - we're basically saying
that we can expand moving ptes to a pmd boundary *only* if this vma
starts at that point. No?).

> +       cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
> +       if (!cur || cur != vma || !prev)
> +               return false;

I've mentioned this test before, and I still find it actively misleading.

First off, the "!cur || cur != vma" test is clearly redundant. We know
'vma' isn't NULL (we just dereferenced it!). So "cur != vma" already
includes the "!cur" test.

So that "!cur" part of the test simply *cannot* be sensible.

And the "!prev" test still makes no sense to me. You tried to explain
it to me earlier, and I clearly didn't get it. It seems actively
wrong. I still think "!prev" should return true.

You seemed to think that "!prev" couldn';t actually happen and would
be a sign of some VM problem, but that doesn't make any sense to me.
Of course !prev can happen - if "vma" is the first vma in the VM and
there is no previous.

It may be *rare*, but I still don't understand why you'd make that
"there is no vma below us" mean "we cannot expand the move below us
because there's something there".

So I continue to think that this test should just be

        if (WARN_ON_ONCE(cur != vma))
                return false;

because if it ever returns something that *isn't* the same as vma,
then we do indeed have serious problems. But that WARN_ON_ONCE() shows
that that's a "cannot happen" thing, not some kind of "if this happens
than don't do it" test.

and then the *real* test  for "can we align down" should just be

        return !prev || prev->vm_end <= addr_masked;

Because while I think your code _works_, it really doesn't seem to
make much sense as it stands in your patch. The tests are actively
misleading. No?

                 Linus
Joel Fernandes May 25, 2023, 7:51 p.m. UTC | #2
Hi Linus,

On Wed, May 24, 2023 at 7:23 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. I'm still quite unhappy about your can_align_down().
>
> On Wed, May 24, 2023 at 8:32 AM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > +       /* If the masked address is within vma, we cannot align the address down. */
> > +       if (vma->vm_start <= addr_masked)
> > +               return false;
>
> I don't think this test is right.
>
> The test should not be "is the mapping still there at the point we
> aligned down to".
>
> No, the test should be whether there is any part of the mapping below
> the point we're starting with:
>
>         if (vma->vm_start < addr_to_align)
>                 return false;
>
> because we can do the "expand the move down" *only* if it's the
> beginning of the vma (because otherwise we'd be moving part of the vma
> that precedes the address!)

You are right, I missed that. Funny I did think about this case you
mentioned. I will fix it in the next revision, thanks.

> (Alternatively, just make that "<" be "!=" - we're basically saying
> that we can expand moving ptes to a pmd boundary *only* if this vma
> starts at that point. No?).

Yes, I prefer the "!=" check. I will use that.

>
> > +       cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
> > +       if (!cur || cur != vma || !prev)
> > +               return false;
>
> I've mentioned this test before, and I still find it actively misleading.
>
> First off, the "!cur || cur != vma" test is clearly redundant. We know
> 'vma' isn't NULL (we just dereferenced it!). So "cur != vma" already
> includes the "!cur" test.
>
> So that "!cur" part of the test simply *cannot* be sensible.

Ok, I agree with you now.

> And the "!prev" test still makes no sense to me. You tried to explain
> it to me earlier, and I clearly didn't get it. It seems actively
> wrong. I still think "!prev" should return true.

Yes, ok. Sounds good.

> You seemed to think that "!prev" couldn';t actually happen and would
> be a sign of some VM problem, but that doesn't make any sense to me.
> Of course !prev can happen - if "vma" is the first vma in the VM and
> there is no previous.
>
> It may be *rare*, but I still don't understand why you'd make that
> "there is no vma below us" mean "we cannot expand the move below us
> because there's something there".
>
> So I continue to think that this test should just be
>
>         if (WARN_ON_ONCE(cur != vma))
>                 return false;

I agree with this now.

>
> because if it ever returns something that *isn't* the same as vma,
> then we do indeed have serious problems. But that WARN_ON_ONCE() shows
> that that's a "cannot happen" thing, not some kind of "if this happens
> than don't do it" test.
>
> and then the *real* test  for "can we align down" should just be
>
>         return !prev || prev->vm_end <= addr_masked;

Agreed, that's cleaner.

> Because while I think your code _works_, it really doesn't seem to
> make much sense as it stands in your patch. The tests are actively
> misleading. No?

True, your approach makes me want to improve on writing cleaner code
than being excessively paranoid. So thank you for that.

These patches have been tricky to get right so thank you for your
continued input and quick feedback.

I will add a test for the case you mentioned above where the address
to realign wasn't in the VMA's beginning.

thanks,

- Joel
diff mbox series

Patch

diff --git a/mm/mremap.c b/mm/mremap.c
index 411a85682b58..184d52f83b19 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -478,6 +478,53 @@  static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
 	return moved;
 }
 
+/*
+ * A helper to check if a previous mapping exists. Required for
+ * move_page_tables() and realign_addr() to determine if a previous mapping
+ * exists before we can do realignment optimizations.
+ */
+static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
+			       unsigned long mask)
+{
+	unsigned long addr_masked = addr_to_align & mask;
+	struct vm_area_struct *prev = NULL, *cur = NULL;
+
+	/* If the masked address is within vma, we cannot align the address down. */
+	if (vma->vm_start <= addr_masked)
+		return false;
+
+	/*
+	 * Attempt to find VMA before prev that contains the address.
+	 * On any issue finding prev, assume there is a mapping and return false
+	 * which will turn off any optimizations. Yes, we're conservative!
+	 * The mmap write lock is held here, so the lookup is safe.
+	 */
+	cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
+	if (!cur || cur != vma || !prev)
+		return false;
+
+	/* The masked address fell within some previous mapping. */
+	if (prev->vm_end > addr_masked)
+		return false;
+
+	return true;
+}
+
+/* Opportunistically realign to specified boundary for faster copy. */
+static void realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
+			 unsigned long *new_addr, struct vm_area_struct *new_vma,
+			 unsigned long mask)
+{
+	bool mutually_aligned = (*old_addr & ~mask) == (*new_addr & ~mask);
+
+	if ((*old_addr & ~mask) && mutually_aligned
+	    && can_align_down(old_vma, *old_addr, mask)
+	    && can_align_down(new_vma, *new_addr, mask)) {
+		*old_addr = *old_addr & mask;
+		*new_addr = *new_addr & mask;
+	}
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
@@ -493,6 +540,15 @@  unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	old_end = old_addr + len;
 
+	/*
+	 * If possible, realign addresses to PMD boundary for faster copy.
+	 * Don't align for intra-VMA moves as we may destroy existing mappings.
+	 */
+	if ((vma != new_vma)
+		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK))) {
+		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
+	}
+
 	if (is_vm_hugetlb_page(vma))
 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
 						new_addr, len);
@@ -565,6 +621,13 @@  unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	/*
+	 * Prevent negative return values when {old,new}_addr was realigned
+	 * but we broke out of the above loop for the first PMD itself.
+	 */
+	if (len + old_addr < old_end)
+		return 0;
+
 	return len + old_addr - old_end;	/* how much done */
 }