diff mbox series

[RFC,1/8] mm/madvise: propagate vma->vm_end changes

Message ID 20210926161259.238054-2-namit@vmware.com (mailing list archive)
State New
Headers show
Series mm/madvise: support process_madvise(MADV_DONTNEED) | expand

Commit Message

Nadav Amit Sept. 26, 2021, 4:12 p.m. UTC
From: Nadav Amit <namit@vmware.com>

The comment in madvise_dontneed_free() says that vma splits that occur
while the mmap-lock is dropped, during userfaultfd_remove(), should be
handled correctly, but nothing in the code indicates that it is so: prev
is invalidated, and do_madvise() will therefore continue to update VMAs
from the "obsolete" end (i.e., the one before the split).

Propagate the changes to end from madvise_dontneed_free() back to
do_madvise() and continue the updates from the new end accordingly.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Suren Baghdasarya <surenb@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Fixes: 70ccb92fdd90 ("userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED")
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/madvise.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Kirill A. Shutemov Sept. 27, 2021, 9:08 a.m. UTC | #1
On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> The comment in madvise_dontneed_free() says that vma splits that occur
> while the mmap-lock is dropped, during userfaultfd_remove(), should be
> handled correctly, but nothing in the code indicates that it is so: prev
> is invalidated, and do_madvise() will therefore continue to update VMAs
> from the "obsolete" end (i.e., the one before the split).
> 
> Propagate the changes to end from madvise_dontneed_free() back to
> do_madvise() and continue the updates from the new end accordingly.

Could you describe in details a race that would lead to wrong behaviour?

If mmap lock was dropped any change to VMA layout can appear. We can have
totally unrelated VMA there.

Either way, if userspace change VMA layout for a region that is under
madvise(MADV_DONTNEED) it is totally broken. I don't see a valid reason to
do this.

The current behaviour looks reasonable to me. Yes, we can miss VMAs, but
these VMAs can also be created just after madvise() is finished.
Nadav Amit Sept. 27, 2021, 10:11 a.m. UTC | #2
> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> The comment in madvise_dontneed_free() says that vma splits that occur
>> while the mmap-lock is dropped, during userfaultfd_remove(), should be
>> handled correctly, but nothing in the code indicates that it is so: prev
>> is invalidated, and do_madvise() will therefore continue to update VMAs
>> from the "obsolete" end (i.e., the one before the split).
>> 
>> Propagate the changes to end from madvise_dontneed_free() back to
>> do_madvise() and continue the updates from the new end accordingly.
> 
> Could you describe in details a race that would lead to wrong behaviour?

Thanks for the quick response.

For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause
the VMA to split.

Something like:

  CPU0				CPU1
  ----				----
  madvise(0x10000, 0x2000, MADV_DONTNEED)
  -> userfaultfd_remove()
   [ mmap-lock dropped ]
				mprotect(0x11000, 0x1000, PROT_READ)
				[splitting the VMA]

				read(uffd)
				[unblocking userfaultfd_remove()]

   [ resuming ]
   end = vma->vm_end
   [end == 0x11000]

   madvise_dontneed_single_vma(vma, 0x10000, 0x11000)

  Following this operation, 0x11000-0x12000 would not be zapped.


> If mmap lock was dropped any change to VMA layout can appear. We can have
> totally unrelated VMA there.

Yes, but we are not talking about completely unrelated VMAs. If
userspace registered a region to be monitored using userfaultfd,
it expects this region to be handled as any other region. This is
a change of behavior that only affects regions with uffd.

The comment in the code explicitly says that this scenario should be
handled:

                        /*
                         * Don't fail if end > vma->vm_end. If the old
                         * vma was split while the mmap_lock was
                         * released the effect of the concurrent
                         * operation may not cause madvise() to
                         * have an undefined result. There may be an
                         * adjacent next vma that we'll walk
                         * next. userfaultfd_remove() will generate an
                         * UFFD_EVENT_REMOVE repetition on the
                         * end-vma->vm_end range, but the manager can
                         * handle a repetition fine.
                         */

Unless I am missing something, this does not happen in the current
code.

> 
> Either way, if userspace change VMA layout for a region that is under
> madvise(MADV_DONTNEED) it is totally broken. I don't see a valid reason to
> do this.
> 
> The current behaviour looks reasonable to me. Yes, we can miss VMAs, but
> these VMAs can also be created just after madvise() is finished.

Again, we are not talking about newly created VMAs.

Alternatively, this comment should be removed and perhaps the
documentation should be updated.
Kirill A. Shutemov Sept. 27, 2021, 11:55 a.m. UTC | #3
On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote:
> 
> > On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> The comment in madvise_dontneed_free() says that vma splits that occur
> >> while the mmap-lock is dropped, during userfaultfd_remove(), should be
> >> handled correctly, but nothing in the code indicates that it is so: prev
> >> is invalidated, and do_madvise() will therefore continue to update VMAs
> >> from the "obsolete" end (i.e., the one before the split).
> >> 
> >> Propagate the changes to end from madvise_dontneed_free() back to
> >> do_madvise() and continue the updates from the new end accordingly.
> > 
> > Could you describe in details a race that would lead to wrong behaviour?
> 
> Thanks for the quick response.
> 
> For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause
> the VMA to split.
> 
> Something like:
> 
>   CPU0				CPU1
>   ----				----
>   madvise(0x10000, 0x2000, MADV_DONTNEED)
>   -> userfaultfd_remove()
>    [ mmap-lock dropped ]
> 				mprotect(0x11000, 0x1000, PROT_READ)
> 				[splitting the VMA]
> 
> 				read(uffd)
> 				[unblocking userfaultfd_remove()]
> 
>    [ resuming ]
>    end = vma->vm_end
>    [end == 0x11000]
> 
>    madvise_dontneed_single_vma(vma, 0x10000, 0x11000)
> 
>   Following this operation, 0x11000-0x12000 would not be zapped.

Okay, fair enough.

Wouldn't something like this work too:

diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..0898120c5c04 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -796,6 +796,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 */
 			return -ENOMEM;
 		}
+		*prev = vma;
 		if (!can_madv_lru_vma(vma))
 			return -EINVAL;
 		if (end > vma->vm_end) {
Nadav Amit Sept. 27, 2021, 12:33 p.m. UTC | #4
> On Sep 27, 2021, at 4:55 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote:
>> 
>>> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>> 
>>> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> The comment in madvise_dontneed_free() says that vma splits that occur
>>>> while the mmap-lock is dropped, during userfaultfd_remove(), should be
>>>> handled correctly, but nothing in the code indicates that it is so: prev
>>>> is invalidated, and do_madvise() will therefore continue to update VMAs
>>>> from the "obsolete" end (i.e., the one before the split).
>>>> 
>>>> Propagate the changes to end from madvise_dontneed_free() back to
>>>> do_madvise() and continue the updates from the new end accordingly.
>>> 
>>> Could you describe in details a race that would lead to wrong behaviour?
>> 
>> Thanks for the quick response.
>> 
>> For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause
>> the VMA to split.
>> 
>> Something like:
>> 
>>  CPU0				CPU1
>>  ----				----
>>  madvise(0x10000, 0x2000, MADV_DONTNEED)
>>  -> userfaultfd_remove()
>>   [ mmap-lock dropped ]
>> 				mprotect(0x11000, 0x1000, PROT_READ)
>> 				[splitting the VMA]
>> 
>> 				read(uffd)
>> 				[unblocking userfaultfd_remove()]
>> 
>>   [ resuming ]
>>   end = vma->vm_end
>>   [end == 0x11000]
>> 
>>   madvise_dontneed_single_vma(vma, 0x10000, 0x11000)
>> 
>>  Following this operation, 0x11000-0x12000 would not be zapped.
> 
> Okay, fair enough.
> 
> Wouldn't something like this work too:
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0734db8d53a7..0898120c5c04 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -796,6 +796,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> 			 */
> 			return -ENOMEM;
> 		}
> +		*prev = vma;
> 		if (!can_madv_lru_vma(vma))
> 			return -EINVAL;
> 		if (end > vma->vm_end) {

Admittedly (embarrassingly?) I didn’t even consider it since all the
comments say that once the lock is dropped prev should be invalidated.

Let’s see, considering the aforementioned scenario and that there is
initially one VMA between 0x10000-0x12000.

Looking at the code from do_madvise()):

[ end == 0x12000 ]

                tmp = vma->vm_end;

[ tmp == 0x12000 ]

                if (end < tmp)
                        tmp = end;

                /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */

                error = madvise_vma(vma, &prev, start, tmp, behavior);

[ prev->vm_end == 0x11000 after the split]

                if (error)
                        goto out;
                start = tmp;

[ start == 0x12000 ]

                if (prev && start < prev->vm_end)
                        start = prev->vm_end;

[ The condition (start < prev->vm_end) is false, start not updated ]

                error = unmapped_error;
                if (start >= end)
                        goto out;

[ start >= end; so we end without updating the second part of the split ]

So it does not work.

Perhaps adding this one on top of yours? I can test it when I wake up.
It is cleaner, but I am not sure if I am missing something.

diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..548fc106e70b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1203,7 +1203,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
                if (error)
                        goto out;
                start = tmp;
-               if (prev && start < prev->vm_end)
+               if (prev)
                        start = prev->vm_end;
                error = unmapped_error;
                if (start >= end)
Kirill A. Shutemov Sept. 27, 2021, 12:45 p.m. UTC | #5
On Mon, Sep 27, 2021 at 05:33:39AM -0700, Nadav Amit wrote:
> 
> 
> > On Sep 27, 2021, at 4:55 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > 
> > On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote:
> >> 
> >>> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >>> 
> >>> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
> >>>> From: Nadav Amit <namit@vmware.com>
> >>>> 
> >>>> The comment in madvise_dontneed_free() says that vma splits that occur
> >>>> while the mmap-lock is dropped, during userfaultfd_remove(), should be
> >>>> handled correctly, but nothing in the code indicates that it is so: prev
> >>>> is invalidated, and do_madvise() will therefore continue to update VMAs
> >>>> from the "obsolete" end (i.e., the one before the split).
> >>>> 
> >>>> Propagate the changes to end from madvise_dontneed_free() back to
> >>>> do_madvise() and continue the updates from the new end accordingly.
> >>> 
> >>> Could you describe in details a race that would lead to wrong behaviour?
> >> 
> >> Thanks for the quick response.
> >> 
> >> For instance, madvise(MADV_DONTNEED) can race with mprotect() and cause
> >> the VMA to split.
> >> 
> >> Something like:
> >> 
> >>  CPU0				CPU1
> >>  ----				----
> >>  madvise(0x10000, 0x2000, MADV_DONTNEED)
> >>  -> userfaultfd_remove()
> >>   [ mmap-lock dropped ]
> >> 				mprotect(0x11000, 0x1000, PROT_READ)
> >> 				[splitting the VMA]
> >> 
> >> 				read(uffd)
> >> 				[unblocking userfaultfd_remove()]
> >> 
> >>   [ resuming ]
> >>   end = vma->vm_end
> >>   [end == 0x11000]
> >> 
> >>   madvise_dontneed_single_vma(vma, 0x10000, 0x11000)
> >> 
> >>  Following this operation, 0x11000-0x12000 would not be zapped.
> > 
> > Okay, fair enough.
> > 
> > Wouldn't something like this work too:
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 0734db8d53a7..0898120c5c04 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -796,6 +796,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > 			 */
> > 			return -ENOMEM;
> > 		}
> > +		*prev = vma;
> > 		if (!can_madv_lru_vma(vma))
> > 			return -EINVAL;
> > 		if (end > vma->vm_end) {
> 
> Admittedly (embarrassingly?) I didn’t even consider it since all the
> comments say that once the lock is dropped prev should be invalidated.
> 
> Let’s see, considering the aforementioned scenario and that there is
> initially one VMA between 0x10000-0x12000.
> 
> Looking at the code from do_madvise()):
> 
> [ end == 0x12000 ]
> 
>                 tmp = vma->vm_end;
> 
> [ tmp == 0x12000 ]
> 
>                 if (end < tmp)
>                         tmp = end;
> 
>                 /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> 
>                 error = madvise_vma(vma, &prev, start, tmp, behavior);
> 
> [ prev->vm_end == 0x11000 after the split]
> 
>                 if (error)
>                         goto out;
>                 start = tmp;
> 
> [ start == 0x12000 ]
> 
>                 if (prev && start < prev->vm_end)
>                         start = prev->vm_end;
> 
> [ The condition (start < prev->vm_end) is false, start not updated ]
> 
>                 error = unmapped_error;
>                 if (start >= end)
>                         goto out;
> 
> [ start >= end; so we end without updating the second part of the split ]
> 
> So it does not work.
> 
> Perhaps adding this one on top of yours? I can test it when I wake up.
> It is cleaner, but I am not sure if I am missing something.

It should work.

BTW, shouldn't we bring madvise_willneed() and madvise_remove() to the
same scheme?
Nadav Amit Sept. 27, 2021, 12:59 p.m. UTC | #6
> On Sep 27, 2021, at 5:45 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> On Mon, Sep 27, 2021 at 05:33:39AM -0700, Nadav Amit wrote:
>> 
>> 
>>> On Sep 27, 2021, at 4:55 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>> 
>>> On Mon, Sep 27, 2021 at 03:11:20AM -0700, Nadav Amit wrote:
>>>> 
>>>>> On Sep 27, 2021, at 2:08 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>>>>> 
>>>>> On Sun, Sep 26, 2021 at 09:12:52AM -0700, Nadav Amit wrote:
>>>>>> From: Nadav Amit <namit@vmware.com>
>>>>>> 
>>>>>> The comment in madvise_dontneed_free() says that vma splits that occur
>>>>>> while the mmap-lock is dropped, during userfaultfd_remove(), should be
>>>>>> handled correctly, but nothing in the code indicates that it is so: prev
>>>>>> is invalidated, and do_madvise() will therefore continue to update VMAs
>>>>>> from the "obsolete" end (i.e., the one before the split).
>>>>>> 
>> 

[snip]


>> Perhaps adding this one on top of yours? I can test it when I wake up.
>> It is cleaner, but I am not sure if I am missing something.
> 
> It should work.
> 
> BTW, shouldn't we bring madvise_willneed() and madvise_remove() to the
> same scheme?

Even for consistency you are right. My only problem is that I am afraid
to backport such a change. For MADV_DONTNEED, I saw an explicit assumption.
I can do it all in one patch if we agree that none of it goes into stable
(which I clumsily forgot to cc, but might find the patch and backport it).
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..a2b05352ebfe 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -768,10 +768,11 @@  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 
 static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
-				  unsigned long start, unsigned long end,
+				  unsigned long start, unsigned long *pend,
 				  int behavior)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	unsigned long end = *pend;
 
 	*prev = vma;
 	if (!can_madv_lru_vma(vma))
@@ -811,6 +812,7 @@  static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 * end-vma->vm_end range, but the manager can
 			 * handle a repetition fine.
 			 */
+			*pend = end;
 			end = vma->vm_end;
 		}
 		VM_WARN_ON(start >= end);
@@ -980,8 +982,10 @@  static int madvise_inject_error(int behavior,
 
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
-		unsigned long start, unsigned long end, int behavior)
+		unsigned long start, unsigned long *pend, int behavior)
 {
+	unsigned long end = *pend;
+
 	switch (behavior) {
 	case MADV_REMOVE:
 		return madvise_remove(vma, prev, start, end);
@@ -993,7 +997,7 @@  madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_pageout(vma, prev, start, end);
 	case MADV_FREE:
 	case MADV_DONTNEED:
-		return madvise_dontneed_free(vma, prev, start, end, behavior);
+		return madvise_dontneed_free(vma, prev, start, pend, behavior);
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
 		return madvise_populate(vma, prev, start, end, behavior);
@@ -1199,7 +1203,7 @@  int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 			tmp = end;
 
 		/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
-		error = madvise_vma(vma, &prev, start, tmp, behavior);
+		error = madvise_vma(vma, &prev, start, &tmp, behavior);
 		if (error)
 			goto out;
 		start = tmp;