tools/testing/selftests/vm/mlock2-tests: fix mlock2 false-negative errors
diff mbox series

Message ID 20200322013525.1095493-1-aquini@redhat.com
State New
Headers show
Series
  • tools/testing/selftests/vm/mlock2-tests: fix mlock2 false-negative errors
Related show

Commit Message

Rafael Aquini March 22, 2020, 1:35 a.m. UTC
Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
break this test expectations on the behavior of mlock syscall family immediately
inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
passed to the syscall as part of its flag-set.

There is no functional error introduced by the aforementioned commit,
but it opens up a time window where the recently faulted and locked pages
might yet not be put back into the UNEVICTABLE_LRU, thus causing a
subsequent and immediate PFN flag check for the UNEVICTABLE bit
to trip on false-negative errors, as it happens with this test.

This patch fix the false negative by forcefully resorting to a code path that
will call a CPU pagevec drain right after the fault but before the PFN flag
check takes place, sorting out the race that way.

Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 tools/testing/selftests/vm/mlock2-tests.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Andrew Morton March 22, 2020, 1:43 a.m. UTC | #1
On Sat, 21 Mar 2020 21:35:25 -0400 Rafael Aquini <aquini@redhat.com> wrote:

> Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> break this test expectations on the behavior of mlock syscall family immediately
> inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
> passed to the syscall as part of its flag-set.
> 
> There is no functional error introduced by the aforementioned commit,
> but it opens up a time window where the recently faulted and locked pages
> might yet not be put back into the UNEVICTABLE_LRU, thus causing a
> subsequent and immediate PFN flag check for the UNEVICTABLE bit
> to trip on false-negative errors, as it happens with this test.
> 
> This patch fix the false negative by forcefully resorting to a code path that
> will call a CPU pagevec drain right after the fault but before the PFN flag
> check takes place, sorting out the race that way.
> 
>  
> +/*
> + * After commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> + * changes made by calls to mlock* family might not be immediately reflected
> + * on the LRUs, thus checking the PFN flags might race against pagevec drain.
> + *
> + * In order to sort out that race, and get the after fault checks consistent,
> + * the "quick and dirty" trick below is required in order to force a call to
> + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> + * the unevictable LRU, as expected by the checks in this selftest.
> + */
> +static void force_lru_add_drain_all(void)
> +{
> +	sched_yield();
> +	system("echo 1 > /proc/sys/vm/compact_memory");
> +}

What is the sched_yield() for?

>  static int onfault_check(char *map)
>  {
>  	unsigned long page_size = getpagesize();
> @@ -343,6 +360,9 @@ static int onfault_check(char *map)
>  	}
>  
>  	*map = 'a';
> +
> +	force_lru_add_drain_all();
> +
>  	page1_flags = get_pageflags((unsigned long)map);
>  	page2_flags = get_pageflags((unsigned long)map + page_size);
>  
> @@ -465,6 +485,8 @@ static int test_lock_onfault_of_present()
>  		goto unmap;
>  	}
>  
> +	force_lru_add_drain_all();
> +
>  	page1_flags = get_pageflags((unsigned long)map);
>  	page2_flags = get_pageflags((unsigned long)map + page_size);
>  	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
Rafael Aquini March 22, 2020, 2:03 a.m. UTC | #2
On Sat, Mar 21, 2020 at 06:43:52PM -0700, Andrew Morton wrote:
> On Sat, 21 Mar 2020 21:35:25 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> 
> > Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > break this test expectations on the behavior of mlock syscall family immediately
> > inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
> > passed to the syscall as part of its flag-set.
> > 
> > There is no functional error introduced by the aforementioned commit,
> > but it opens up a time window where the recently faulted and locked pages
> > might yet not be put back into the UNEVICTABLE_LRU, thus causing a
> > subsequent and immediate PFN flag check for the UNEVICTABLE bit
> > to trip on false-negative errors, as it happens with this test.
> > 
> > This patch fix the false negative by forcefully resorting to a code path that
> > will call a CPU pagevec drain right after the fault but before the PFN flag
> > check takes place, sorting out the race that way.
> > 
> >  
> > +/*
> > + * After commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > + * changes made by calls to mlock* family might not be immediately reflected
> > + * on the LRUs, thus checking the PFN flags might race against pagevec drain.
> > + *
> > + * In order to sort out that race, and get the after fault checks consistent,
> > + * the "quick and dirty" trick below is required in order to force a call to
> > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > + * the unevictable LRU, as expected by the checks in this selftest.
> > + */
> > +static void force_lru_add_drain_all(void)
> > +{
> > +	sched_yield();
> > +	system("echo 1 > /proc/sys/vm/compact_memory");
> > +}
> 
> What is the sched_yield() for?
>

Mostly it's there to provide a sleeping gap after the fault, whithout 
actually adding an arbitrary value with usleep(). 

It's not a hard requirement, but, in some of the tests I performed 
(whithout that sleeping gap) I would still see around 1% chance 
of hitting the false-negative. After adding it I could not hit
the issue anymore.

 
> >  static int onfault_check(char *map)
> >  {
> >  	unsigned long page_size = getpagesize();
> > @@ -343,6 +360,9 @@ static int onfault_check(char *map)
> >  	}
> >  
> >  	*map = 'a';
> > +
> > +	force_lru_add_drain_all();
> > +
> >  	page1_flags = get_pageflags((unsigned long)map);
> >  	page2_flags = get_pageflags((unsigned long)map + page_size);
> >  
> > @@ -465,6 +485,8 @@ static int test_lock_onfault_of_present()
> >  		goto unmap;
> >  	}
> >  
> > +	force_lru_add_drain_all();
> > +
> >  	page1_flags = get_pageflags((unsigned long)map);
> >  	page2_flags = get_pageflags((unsigned long)map + page_size);
> >  	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
>
Andrew Morton March 22, 2020, 4:31 a.m. UTC | #3
On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:

> > > + * In order to sort out that race, and get the after fault checks consistent,
> > > + * the "quick and dirty" trick below is required in order to force a call to
> > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > + */
> > > +static void force_lru_add_drain_all(void)
> > > +{
> > > +	sched_yield();
> > > +	system("echo 1 > /proc/sys/vm/compact_memory");
> > > +}
> > 
> > What is the sched_yield() for?
> >
> 
> Mostly it's there to provide a sleeping gap after the fault, whithout 
> actually adding an arbitrary value with usleep(). 
> 
> It's not a hard requirement, but, in some of the tests I performed 
> (whithout that sleeping gap) I would still see around 1% chance 
> of hitting the false-negative. After adding it I could not hit
> the issue anymore.

It's concerning that such deep machinery as pagevec draining is visible
to userspace.

I suppose that for consistency and correctness we should perform a
drain prior to each read from /proc/*/pagemap.  Presumably this would
be far too expensive.

Is there any other way?  One such might be to make the MLOCK_ONFAULT
pages bypass the lru_add_pvecs?
Rafael Aquini March 22, 2020, 5:41 a.m. UTC | #4
On Sat, Mar 21, 2020 at 09:31:42PM -0700, Andrew Morton wrote:
> On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> 
> > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > + */
> > > > +static void force_lru_add_drain_all(void)
> > > > +{
> > > > +	sched_yield();
> > > > +	system("echo 1 > /proc/sys/vm/compact_memory");
> > > > +}
> > > 
> > > What is the sched_yield() for?
> > >
> > 
> > Mostly it's there to provide a sleeping gap after the fault, whithout 
> > actually adding an arbitrary value with usleep(). 
> > 
> > It's not a hard requirement, but, in some of the tests I performed 
> > (whithout that sleeping gap) I would still see around 1% chance 
> > of hitting the false-negative. After adding it I could not hit
> > the issue anymore.
> 
> It's concerning that such deep machinery as pagevec draining is visible
> to userspace.
> 
> I suppose that for consistency and correctness we should perform a
> drain prior to each read from /proc/*/pagemap.  Presumably this would
> be far too expensive.
> 
> Is there any other way?  One such might be to make the MLOCK_ONFAULT
> pages bypass the lru_add_pvecs?
>

Well,

I admit I wasn't taking the approach of changing the kernel because I was 
thinking it would require a partial, or even full, revert of commit 
9c4e6b1a7027f, and that would be increasing complexity, but on a 
second thought, it seems that we might just be missing:

diff --git a/mm/swap.c b/mm/swap.c
index cf39d24ada2a..b1601228ded4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -473,6 +473,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
                __mod_zone_page_state(page_zone(page), NR_MLOCK,
                                    hpage_nr_pages(page));
                count_vm_event(UNEVICTABLE_PGMLOCKED);
+               SetPageUnevictable(page);
        }
        lru_cache_add(page);
 }


I'll take a closer look into it, as well as test it properly, tomorrow. 
Thanks for the heads up, Andrew.
-- Rafael
Shakeel Butt March 22, 2020, 4:31 p.m. UTC | #5
On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini <aquini@redhat.com> wrote:
>
> Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> break this test expectations on the behavior of mlock syscall family immediately
> inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
> passed to the syscall as part of its flag-set.

mlock* syscalls do not provide any guarantee that the pages will be in
unevictable LRU, only that the pages will not be paged-out. The test
is checking something very internal to the kernel and this is expected
to break.

>
> There is no functional error introduced by the aforementioned commit,
> but it opens up a time window where the recently faulted and locked pages
> might yet not be put back into the UNEVICTABLE_LRU, thus causing a
> subsequent and immediate PFN flag check for the UNEVICTABLE bit
> to trip on false-negative errors, as it happens with this test.
>
> This patch fix the false negative by forcefully resorting to a code path that
> will call a CPU pagevec drain right after the fault but before the PFN flag
> check takes place, sorting out the race that way.
>
> Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")

This is fixing the actual test and not about fixing the mentioned
patch. So, this Fixes line is not needed.
Shakeel Butt March 22, 2020, 4:36 p.m. UTC | #6
On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
>
> > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > + */
> > > > +static void force_lru_add_drain_all(void)
> > > > +{
> > > > + sched_yield();
> > > > + system("echo 1 > /proc/sys/vm/compact_memory");
> > > > +}
> > >
> > > What is the sched_yield() for?
> > >
> >
> > Mostly it's there to provide a sleeping gap after the fault, whithout
> > actually adding an arbitrary value with usleep().
> >
> > It's not a hard requirement, but, in some of the tests I performed
> > (whithout that sleeping gap) I would still see around 1% chance
> > of hitting the false-negative. After adding it I could not hit
> > the issue anymore.
>
> It's concerning that such deep machinery as pagevec draining is visible
> to userspace.
>

We already have other examples like memcg stats where the
optimizations like batching per-cpu stats collection exposes
differences to the userspace. I would not be that worried here.

> I suppose that for consistency and correctness we should perform a
> drain prior to each read from /proc/*/pagemap.  Presumably this would
> be far too expensive.
>
> Is there any other way?  One such might be to make the MLOCK_ONFAULT
> pages bypass the lru_add_pvecs?
>

I would rather prefer to have something similar to
/proc/sys/vm/stat_refresh which drains the pagevecs.
Shakeel Butt March 22, 2020, 4:40 p.m. UTC | #7
On Sat, Mar 21, 2020 at 10:41 PM Rafael Aquini <aquini@redhat.com> wrote:
>
> On Sat, Mar 21, 2020 at 09:31:42PM -0700, Andrew Morton wrote:
> > On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> >
> > > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > > + */
> > > > > +static void force_lru_add_drain_all(void)
> > > > > +{
> > > > > +       sched_yield();
> > > > > +       system("echo 1 > /proc/sys/vm/compact_memory");
> > > > > +}
> > > >
> > > > What is the sched_yield() for?
> > > >
> > >
> > > Mostly it's there to provide a sleeping gap after the fault, whithout
> > > actually adding an arbitrary value with usleep().
> > >
> > > It's not a hard requirement, but, in some of the tests I performed
> > > (whithout that sleeping gap) I would still see around 1% chance
> > > of hitting the false-negative. After adding it I could not hit
> > > the issue anymore.
> >
> > It's concerning that such deep machinery as pagevec draining is visible
> > to userspace.
> >
> > I suppose that for consistency and correctness we should perform a
> > drain prior to each read from /proc/*/pagemap.  Presumably this would
> > be far too expensive.
> >
> > Is there any other way?  One such might be to make the MLOCK_ONFAULT
> > pages bypass the lru_add_pvecs?
> >
>
> Well,
>
> I admit I wasn't taking the approach of changing the kernel because I was
> thinking it would require a partial, or even full, revert of commit
> 9c4e6b1a7027f, and that would be increasing complexity, but on a
> second thought, it seems that we might just be missing:
>
> diff --git a/mm/swap.c b/mm/swap.c
> index cf39d24ada2a..b1601228ded4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -473,6 +473,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
>                 __mod_zone_page_state(page_zone(page), NR_MLOCK,
>                                     hpage_nr_pages(page));
>                 count_vm_event(UNEVICTABLE_PGMLOCKED);
> +               SetPageUnevictable(page);

No, this is not correct. Check __pagevec_lru_add_fn() for
TestClearPageUnevictable().

As I mentioned in the other email, I think the better solution would
be to introduce a sysctl to drain the pageves. That way there will not
be any dependency on compaction as was in your original patch.

Shakeel
Michal Hocko March 23, 2020, 7:52 a.m. UTC | #8
On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
> On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> >
> > > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > > + */
> > > > > +static void force_lru_add_drain_all(void)
> > > > > +{
> > > > > + sched_yield();
> > > > > + system("echo 1 > /proc/sys/vm/compact_memory");
> > > > > +}
> > > >
> > > > What is the sched_yield() for?
> > > >
> > >
> > > Mostly it's there to provide a sleeping gap after the fault, whithout
> > > actually adding an arbitrary value with usleep().
> > >
> > > It's not a hard requirement, but, in some of the tests I performed
> > > (whithout that sleeping gap) I would still see around 1% chance
> > > of hitting the false-negative. After adding it I could not hit
> > > the issue anymore.
> >
> > It's concerning that such deep machinery as pagevec draining is visible
> > to userspace.
> >
> 
> We already have other examples like memcg stats where the
> optimizations like batching per-cpu stats collection exposes
> differences to the userspace. I would not be that worried here.

Agreed! Tests should be more tolerant for counters imprecision.
Unevictable LRU is an optimization and transition to that list is a
matter of an internal implementation detail.
 
> > I suppose that for consistency and correctness we should perform a
> > drain prior to each read from /proc/*/pagemap.  Presumably this would
> > be far too expensive.
> >
> > Is there any other way?  One such might be to make the MLOCK_ONFAULT
> > pages bypass the lru_add_pvecs?
> >
> 
> I would rather prefer to have something similar to
> /proc/sys/vm/stat_refresh which drains the pagevecs.

No, please don't. Pagevecs draining is by far not the only batching
scheme we use and an interface like this would promise users to
effectivelly force flushing all of them.

Can we simply update the test to be more tolerant to imprecisions
instead?
Rafael Aquini March 23, 2020, 2:16 p.m. UTC | #9
On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
> On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini <aquini@redhat.com> wrote:
> >
> > Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > break this test expectations on the behavior of mlock syscall family immediately
> > inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
> > passed to the syscall as part of its flag-set.
> 
> mlock* syscalls do not provide any guarantee that the pages will be in
> unevictable LRU, only that the pages will not be paged-out. The test
> is checking something very internal to the kernel and this is expected
> to break.

It was a check expected to be satisfied before the commit, though. 
Getting the mlocked pages inserted directly into the unevictable LRU,
skipping the pagevec, was established behavior before the aforementioned
commit, and even though one could argue userspace should not be aware,
or care, about such inner kernel circles the program in question is not an 
ordinary userspace app, but a kernel selftest that is supposed to check
for the functionality correctness.

> >
> > There is no functional error introduced by the aforementioned commit,
> > but it opens up a time window where the recently faulted and locked pages
> > might yet not be put back into the UNEVICTABLE_LRU, thus causing a
> > subsequent and immediate PFN flag check for the UNEVICTABLE bit
> > to trip on false-negative errors, as it happens with this test.
> >
> > This patch fix the false negative by forcefully resorting to a code path that
> > will call a CPU pagevec drain right after the fault but before the PFN flag
> > check takes place, sorting out the race that way.
> >
> > Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> 
> This is fixing the actual test and not about fixing the mentioned
> patch. So, this Fixes line is not needed.
>

If one bisects the kernel looking for the patch that causes the selftest
to fail that commit is going to show up as the issue, thus the reference.
Michal Hocko March 23, 2020, 2:29 p.m. UTC | #10
On Mon 23-03-20 10:16:59, Rafael Aquini wrote:
> On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
> > On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini <aquini@redhat.com> wrote:
> > >
> > > Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > > break this test expectations on the behavior of mlock syscall family immediately
> > > inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
> > > passed to the syscall as part of its flag-set.
> > 
> > mlock* syscalls do not provide any guarantee that the pages will be in
> > unevictable LRU, only that the pages will not be paged-out. The test
> > is checking something very internal to the kernel and this is expected
> > to break.
> 
> It was a check expected to be satisfied before the commit, though. 
> Getting the mlocked pages inserted directly into the unevictable LRU,
> skipping the pagevec, was established behavior before the aforementioned
> commit, and even though one could argue userspace should not be aware,
> or care, about such inner kernel circles the program in question is not an 
> ordinary userspace app, but a kernel selftest that is supposed to check
> for the functionality correctness.

But mlock (in neither mode) is reall forced to put pages to the
UNEVICTABLE_LRU for correctness. If the test is really depending on it
then the selftest is bogus. A real MCL_ONFAULT test should focus on the
user observable contract of this api. And that is that a new mapping
doesn't fault in the page during the mlock call but the memory is locked
after the memory is faulted in. You can use different methods to observe
locked memory - e.g. try to reclaim it and check or check /proc/<pid>/smaps
Rafael Aquini March 23, 2020, 2:42 p.m. UTC | #11
On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
> On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
> > On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > >
> > > > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > > > + */
> > > > > > +static void force_lru_add_drain_all(void)
> > > > > > +{
> > > > > > + sched_yield();
> > > > > > + system("echo 1 > /proc/sys/vm/compact_memory");
> > > > > > +}
> > > > >
> > > > > What is the sched_yield() for?
> > > > >
> > > >
> > > > Mostly it's there to provide a sleeping gap after the fault, whithout
> > > > actually adding an arbitrary value with usleep().
> > > >
> > > > It's not a hard requirement, but, in some of the tests I performed
> > > > (whithout that sleeping gap) I would still see around 1% chance
> > > > of hitting the false-negative. After adding it I could not hit
> > > > the issue anymore.
> > >
> > > It's concerning that such deep machinery as pagevec draining is visible
> > > to userspace.
> > >
> > 
> > We already have other examples like memcg stats where the
> > optimizations like batching per-cpu stats collection exposes
> > differences to the userspace. I would not be that worried here.
> 
> Agreed! Tests should be more tolerant for counters imprecision.
> Unevictable LRU is an optimization and transition to that list is a
> matter of an internal implementation detail.
>
> > > I suppose that for consistency and correctness we should perform a
> > > drain prior to each read from /proc/*/pagemap.  Presumably this would
> > > be far too expensive.
> > >
> > > Is there any other way?  One such might be to make the MLOCK_ONFAULT
> > > pages bypass the lru_add_pvecs?
> > >
> > 
> > I would rather prefer to have something similar to
> > /proc/sys/vm/stat_refresh which drains the pagevecs.
> 
> No, please don't. Pagevecs draining is by far not the only batching
> scheme we use and an interface like this would promise users to
> effectivelly force flushing all of them.
> 
> Can we simply update the test to be more tolerant to imprecisions
> instead?
> 

I don't think, thouhg, that this particular test case can be entirely
reduced as "counter imprecison".

The reason I think this is a different beast, is that having the page
being flagged as PG_unevictable is expected part of the aftermath of
a mlock* call. This selftest is, IMO, correctly verifying that fact,
as it checks the functionality correctness.

The problem boils down to the fact that the page would immediately
be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT
semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f
changed that by "delaying" that flag setting.

As I mentioned, too, there's nothing wrong with the delayed setting of 
PG_unevictable, we just need to compensate for that fact in this test,
which is what this patch is suggesting to do.
  
> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko March 23, 2020, 2:51 p.m. UTC | #12
On Mon 23-03-20 10:42:40, Rafael Aquini wrote:
> On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
> > On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
> > > On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > > >
> > > > > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > > > > + */
> > > > > > > +static void force_lru_add_drain_all(void)
> > > > > > > +{
> > > > > > > + sched_yield();
> > > > > > > + system("echo 1 > /proc/sys/vm/compact_memory");
> > > > > > > +}
> > > > > >
> > > > > > What is the sched_yield() for?
> > > > > >
> > > > >
> > > > > Mostly it's there to provide a sleeping gap after the fault, whithout
> > > > > actually adding an arbitrary value with usleep().
> > > > >
> > > > > It's not a hard requirement, but, in some of the tests I performed
> > > > > (whithout that sleeping gap) I would still see around 1% chance
> > > > > of hitting the false-negative. After adding it I could not hit
> > > > > the issue anymore.
> > > >
> > > > It's concerning that such deep machinery as pagevec draining is visible
> > > > to userspace.
> > > >
> > > 
> > > We already have other examples like memcg stats where the
> > > optimizations like batching per-cpu stats collection exposes
> > > differences to the userspace. I would not be that worried here.
> > 
> > Agreed! Tests should be more tolerant for counters imprecision.
> > Unevictable LRU is an optimization and transition to that list is a
> > matter of an internal implementation detail.
> >
> > > > I suppose that for consistency and correctness we should perform a
> > > > drain prior to each read from /proc/*/pagemap.  Presumably this would
> > > > be far too expensive.
> > > >
> > > > Is there any other way?  One such might be to make the MLOCK_ONFAULT
> > > > pages bypass the lru_add_pvecs?
> > > >
> > > 
> > > I would rather prefer to have something similar to
> > > /proc/sys/vm/stat_refresh which drains the pagevecs.
> > 
> > No, please don't. Pagevecs draining is by far not the only batching
> > scheme we use and an interface like this would promise users to
> > effectivelly force flushing all of them.
> > 
> > Can we simply update the test to be more tolerant to imprecisions
> > instead?
> > 
> 
> I don't think, thouhg, that this particular test case can be entirely
> reduced as "counter imprecison".
> 
> The reason I think this is a different beast, is that having the page
> being flagged as PG_unevictable is expected part of the aftermath of
> a mlock* call. This selftest is, IMO, correctly verifying that fact,
> as it checks the functionality correctness.
> 
> The problem boils down to the fact that the page would immediately
> be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT
> semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f
> changed that by "delaying" that flag setting.

As I've tried to explain in other email in this email thread. The test
was exploiting a certain user visible side effect. The unevictable flag
or the placement on the unevictable LRU list is are not really needed
for the user contract correctness. That means that the test is not
really correct. Working around that by trying to enforce kernel to
comply with the test expectations is just plain wrong at least for two
reasons 1) you cannot expect or event do not want userspace to do the
same because the behavior might change in the future 2) the test is not
really testing for correctness in the first place.
Rafael Aquini March 23, 2020, 2:55 p.m. UTC | #13
On Mon, Mar 23, 2020 at 03:29:41PM +0100, Michal Hocko wrote:
> On Mon 23-03-20 10:16:59, Rafael Aquini wrote:
> > On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
> > > On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini <aquini@redhat.com> wrote:
> > > >
> > > > Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > > > break this test expectations on the behavior of mlock syscall family immediately
> > > > inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
> > > > passed to the syscall as part of its flag-set.
> > > 
> > > mlock* syscalls do not provide any guarantee that the pages will be in
> > > unevictable LRU, only that the pages will not be paged-out. The test
> > > is checking something very internal to the kernel and this is expected
> > > to break.
> > 
> > It was a check expected to be satisfied before the commit, though. 
> > Getting the mlocked pages inserted directly into the unevictable LRU,
> > skipping the pagevec, was established behavior before the aforementioned
> > commit, and even though one could argue userspace should not be aware,
> > or care, about such inner kernel circles the program in question is not an 
> > ordinary userspace app, but a kernel selftest that is supposed to check
> > for the functionality correctness.
> 
> But mlock (in neither mode) is reall forced to put pages to the
> UNEVICTABLE_LRU for correctness. If the test is really depending on it
> then the selftest is bogus. A real MCL_ONFAULT test should focus on the
> user observable contract of this api. And that is that a new mapping
> doesn't fault in the page during the mlock call but the memory is locked
> after the memory is faulted in. You can use different methods to observe
> locked memory - e.g. try to reclaim it and check or check /proc/<pid>/smaps
>

Again, I don't think the test is bogus, although it's (now) expecting
something that is not guaranteed after the referred commit.
The check for PG_unevictable set on the page backing up the mapping
seems reasonable, as the flag is supposed to be there, if everything 
went on fine after the mlock call. 

-- Rafael
Michal Hocko March 23, 2020, 3:01 p.m. UTC | #14
On Mon 23-03-20 15:29:43, Michal Hocko wrote:
> On Mon 23-03-20 10:16:59, Rafael Aquini wrote:
> > On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
> > > On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini <aquini@redhat.com> wrote:
> > > >
> > > > Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > > > break this test expectations on the behavior of mlock syscall family immediately
> > > > inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
> > > > passed to the syscall as part of its flag-set.
> > > 
> > > mlock* syscalls do not provide any guarantee that the pages will be in
> > > unevictable LRU, only that the pages will not be paged-out. The test
> > > is checking something very internal to the kernel and this is expected
> > > to break.
> > 
> > It was a check expected to be satisfied before the commit, though. 
> > Getting the mlocked pages inserted directly into the unevictable LRU,
> > skipping the pagevec, was established behavior before the aforementioned
> > commit, and even though one could argue userspace should not be aware,
> > or care, about such inner kernel circles the program in question is not an 
> > ordinary userspace app, but a kernel selftest that is supposed to check
> > for the functionality correctness.
> 
> But mlock (in neither mode) is reall forced to put pages to the

ble I meant to say "is not really forced"

> UNEVICTABLE_LRU for correctness. If the test is really depending on it
> then the selftest is bogus. A real MCL_ONFAULT test should focus on the
> user observable contract of this api. And that is that a new mapping
> doesn't fault in the page during the mlock call but the memory is locked
> after the memory is faulted in. You can use different methods to observe
> locked memory - e.g. try to reclaim it and check or check /proc/<pid>/smaps

I have just checked the testcase and I believe it is really dubious to
check for page flags. Those are really an internal implementation
detail. All the available information is available in the
/proc/<pid>/smaps file which is already parsed in the test so the test
is easily fixable.
Rafael Aquini March 23, 2020, 3:02 p.m. UTC | #15
On Mon, Mar 23, 2020 at 03:51:06PM +0100, Michal Hocko wrote:
> On Mon 23-03-20 10:42:40, Rafael Aquini wrote:
> > On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
> > > On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
> > > > On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > >
> > > > > On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > > > >
> > > > > > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > > > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > > > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > > > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > > > > > + */
> > > > > > > > +static void force_lru_add_drain_all(void)
> > > > > > > > +{
> > > > > > > > + sched_yield();
> > > > > > > > + system("echo 1 > /proc/sys/vm/compact_memory");
> > > > > > > > +}
> > > > > > >
> > > > > > > What is the sched_yield() for?
> > > > > > >
> > > > > >
> > > > > > Mostly it's there to provide a sleeping gap after the fault, whithout
> > > > > > actually adding an arbitrary value with usleep().
> > > > > >
> > > > > > It's not a hard requirement, but, in some of the tests I performed
> > > > > > (whithout that sleeping gap) I would still see around 1% chance
> > > > > > of hitting the false-negative. After adding it I could not hit
> > > > > > the issue anymore.
> > > > >
> > > > > It's concerning that such deep machinery as pagevec draining is visible
> > > > > to userspace.
> > > > >
> > > > 
> > > > We already have other examples like memcg stats where the
> > > > optimizations like batching per-cpu stats collection exposes
> > > > differences to the userspace. I would not be that worried here.
> > > 
> > > Agreed! Tests should be more tolerant for counters imprecision.
> > > Unevictable LRU is an optimization and transition to that list is a
> > > matter of an internal implementation detail.
> > >
> > > > > I suppose that for consistency and correctness we should perform a
> > > > > drain prior to each read from /proc/*/pagemap.  Presumably this would
> > > > > be far too expensive.
> > > > >
> > > > > Is there any other way?  One such might be to make the MLOCK_ONFAULT
> > > > > pages bypass the lru_add_pvecs?
> > > > >
> > > > 
> > > > I would rather prefer to have something similar to
> > > > /proc/sys/vm/stat_refresh which drains the pagevecs.
> > > 
> > > No, please don't. Pagevecs draining is by far not the only batching
> > > scheme we use and an interface like this would promise users to
> > > effectivelly force flushing all of them.
> > > 
> > > Can we simply update the test to be more tolerant to imprecisions
> > > instead?
> > > 
> > 
> > I don't think, thouhg, that this particular test case can be entirely
> > reduced as "counter imprecison".
> > 
> > The reason I think this is a different beast, is that having the page
> > being flagged as PG_unevictable is expected part of the aftermath of
> > a mlock* call. This selftest is, IMO, correctly verifying that fact,
> > as it checks the functionality correctness.
> > 
> > The problem boils down to the fact that the page would immediately
> > be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT
> > semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f
> > changed that by "delaying" that flag setting.
> 
> As I've tried to explain in other email in this email thread. The test
> was exploiting a certain user visible side effect. The unevictable flag
> or the placement on the unevictable LRU list is are not really needed
> for the user contract correctness. That means that the test is not
> really correct. Working around that by trying to enforce kernel to
> comply with the test expectations is just plain wrong at least for two
> reasons 1) you cannot expect or event do not want userspace to do the
> same because the behavior might change in the future 2) the test is not
> really testing for correctness in the first place.
>

Sorry, Michal, it seems we keep going back and forth (I just replied to
your comment on the other thread)

The selftest also checks the kernel visible effect, via
/proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.

As I mentioned before, I think it is a reasonable check, given this
is a kernel selftest, although we need to compensate it for the
differences between its expectations and what the kernel is doing
currently.

-- Rafael
Rafael Aquini March 23, 2020, 3:07 p.m. UTC | #16
On Mon, Mar 23, 2020 at 04:01:34PM +0100, Michal Hocko wrote:
> On Mon 23-03-20 15:29:43, Michal Hocko wrote:
> > On Mon 23-03-20 10:16:59, Rafael Aquini wrote:
> > > On Sun, Mar 22, 2020 at 09:31:04AM -0700, Shakeel Butt wrote:
> > > > On Sat, Mar 21, 2020 at 6:35 PM Rafael Aquini <aquini@redhat.com> wrote:
> > > > >
> > > > > Changes for commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > > > > break this test expectations on the behavior of mlock syscall family immediately
> > > > > inserting the recently faulted pages into the UNEVICTABLE_LRU, when MCL_ONFAULT is
> > > > > passed to the syscall as part of its flag-set.
> > > > 
> > > > mlock* syscalls do not provide any guarantee that the pages will be in
> > > > unevictable LRU, only that the pages will not be paged-out. The test
> > > > is checking something very internal to the kernel and this is expected
> > > > to break.
> > > 
> > > It was a check expected to be satisfied before the commit, though. 
> > > Getting the mlocked pages inserted directly into the unevictable LRU,
> > > skipping the pagevec, was established behavior before the aforementioned
> > > commit, and even though one could argue userspace should not be aware,
> > > or care, about such inner kernel circles the program in question is not an 
> > > ordinary userspace app, but a kernel selftest that is supposed to check
> > > for the functionality correctness.
> > 
> > But mlock (in neither mode) is reall forced to put pages to the
> 
> ble I meant to say "is not really forced"
> 
> > UNEVICTABLE_LRU for correctness. If the test is really depending on it
> > then the selftest is bogus. A real MCL_ONFAULT test should focus on the
> > user observable contract of this api. And that is that a new mapping
> > doesn't fault in the page during the mlock call but the memory is locked
> > after the memory is faulted in. You can use different methods to observe
> > locked memory - e.g. try to reclaim it and check or check /proc/<pid>/smaps
> 
> I have just checked the testcase and I believe it is really dubious to
> check for page flags. Those are really an internal implementation
> detail. All the available information is available in the
> /proc/<pid>/smaps file which is already parsed in the test so the test
> is easily fixable.

That surely is another take for this test. 
I still think the test is reasonable when it checks what was 
expected to be there, from the kernel standpoint, and just needs
to be adjusted for the current state of affairs.

-- Rafael
Michal Hocko March 23, 2020, 3:12 p.m. UTC | #17
On Mon 23-03-20 11:02:59, Rafael Aquini wrote:
> On Mon, Mar 23, 2020 at 03:51:06PM +0100, Michal Hocko wrote:
> > On Mon 23-03-20 10:42:40, Rafael Aquini wrote:
> > > On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
> > > > On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
> > > > > On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > > >
> > > > > > On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > > > > >
> > > > > > > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > > > > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > > > > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > > > > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > > > > > > + */
> > > > > > > > > +static void force_lru_add_drain_all(void)
> > > > > > > > > +{
> > > > > > > > > + sched_yield();
> > > > > > > > > + system("echo 1 > /proc/sys/vm/compact_memory");
> > > > > > > > > +}
> > > > > > > >
> > > > > > > > What is the sched_yield() for?
> > > > > > > >
> > > > > > >
> > > > > > > Mostly it's there to provide a sleeping gap after the fault, whithout
> > > > > > > actually adding an arbitrary value with usleep().
> > > > > > >
> > > > > > > It's not a hard requirement, but, in some of the tests I performed
> > > > > > > (whithout that sleeping gap) I would still see around 1% chance
> > > > > > > of hitting the false-negative. After adding it I could not hit
> > > > > > > the issue anymore.
> > > > > >
> > > > > > It's concerning that such deep machinery as pagevec draining is visible
> > > > > > to userspace.
> > > > > >
> > > > > 
> > > > > We already have other examples like memcg stats where the
> > > > > optimizations like batching per-cpu stats collection exposes
> > > > > differences to the userspace. I would not be that worried here.
> > > > 
> > > > Agreed! Tests should be more tolerant for counters imprecision.
> > > > Unevictable LRU is an optimization and transition to that list is a
> > > > matter of an internal implementation detail.
> > > >
> > > > > > I suppose that for consistency and correctness we should perform a
> > > > > > drain prior to each read from /proc/*/pagemap.  Presumably this would
> > > > > > be far too expensive.
> > > > > >
> > > > > > Is there any other way?  One such might be to make the MLOCK_ONFAULT
> > > > > > pages bypass the lru_add_pvecs?
> > > > > >
> > > > > 
> > > > > I would rather prefer to have something similar to
> > > > > /proc/sys/vm/stat_refresh which drains the pagevecs.
> > > > 
> > > > No, please don't. Pagevecs draining is by far not the only batching
> > > > scheme we use and an interface like this would promise users to
> > > > effectivelly force flushing all of them.
> > > > 
> > > > Can we simply update the test to be more tolerant to imprecisions
> > > > instead?
> > > > 
> > > 
> > > I don't think, thouhg, that this particular test case can be entirely
> > > reduced as "counter imprecison".
> > > 
> > > The reason I think this is a different beast, is that having the page
> > > being flagged as PG_unevictable is expected part of the aftermath of
> > > a mlock* call. This selftest is, IMO, correctly verifying that fact,
> > > as it checks the functionality correctness.
> > > 
> > > The problem boils down to the fact that the page would immediately
> > > be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT
> > > semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f
> > > changed that by "delaying" that flag setting.
> > 
> > As I've tried to explain in other email in this email thread. The test
> > was exploiting a certain user visible side effect. The unevictable flag
> > or the placement on the unevictable LRU list is are not really needed
> > for the user contract correctness. That means that the test is not
> > really correct. Working around that by trying to enforce kernel to
> > comply with the test expectations is just plain wrong at least for two
> > reasons 1) you cannot expect or event do not want userspace to do the
> > same because the behavior might change in the future 2) the test is not
> > really testing for correctness in the first place.
> >
> 
> Sorry, Michal, it seems we keep going back and forth (I just replied to
> your comment on the other thread)
> 
> The selftest also checks the kernel visible effect, via
> /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.

I really fail to see your point. Even if you are right that the self
test is somehow evaluating the kernel implementation which I am not sure
is the scope of the selft thest but anyway. The mere fact that the
kernel test fails on a perfectly valid change should just suggest that
the test is leading to false positives and therefore should be fixed.
Your proposed fix is simply suboptimal because it relies on yet another
side effect which might change anytime in the future and still lead to a
correctly behaving kernel. See my point?
Rafael Aquini March 23, 2020, 3:41 p.m. UTC | #18
On Mon, Mar 23, 2020 at 04:12:56PM +0100, Michal Hocko wrote:
> On Mon 23-03-20 11:02:59, Rafael Aquini wrote:
> > On Mon, Mar 23, 2020 at 03:51:06PM +0100, Michal Hocko wrote:
> > > On Mon 23-03-20 10:42:40, Rafael Aquini wrote:
> > > > On Mon, Mar 23, 2020 at 08:52:08AM +0100, Michal Hocko wrote:
> > > > > On Sun 22-03-20 09:36:49, Shakeel Butt wrote:
> > > > > > On Sat, Mar 21, 2020 at 9:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > > > >
> > > > > > > On Sat, 21 Mar 2020 22:03:26 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > > > > > >
> > > > > > > > > > + * In order to sort out that race, and get the after fault checks consistent,
> > > > > > > > > > + * the "quick and dirty" trick below is required in order to force a call to
> > > > > > > > > > + * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
> > > > > > > > > > + * the unevictable LRU, as expected by the checks in this selftest.
> > > > > > > > > > + */
> > > > > > > > > > +static void force_lru_add_drain_all(void)
> > > > > > > > > > +{
> > > > > > > > > > + sched_yield();
> > > > > > > > > > + system("echo 1 > /proc/sys/vm/compact_memory");
> > > > > > > > > > +}
> > > > > > > > >
> > > > > > > > > What is the sched_yield() for?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Mostly it's there to provide a sleeping gap after the fault, whithout
> > > > > > > > actually adding an arbitrary value with usleep().
> > > > > > > >
> > > > > > > > It's not a hard requirement, but, in some of the tests I performed
> > > > > > > > (whithout that sleeping gap) I would still see around 1% chance
> > > > > > > > of hitting the false-negative. After adding it I could not hit
> > > > > > > > the issue anymore.
> > > > > > >
> > > > > > > It's concerning that such deep machinery as pagevec draining is visible
> > > > > > > to userspace.
> > > > > > >
> > > > > > 
> > > > > > We already have other examples like memcg stats where the
> > > > > > optimizations like batching per-cpu stats collection exposes
> > > > > > differences to the userspace. I would not be that worried here.
> > > > > 
> > > > > Agreed! Tests should be more tolerant for counters imprecision.
> > > > > Unevictable LRU is an optimization and transition to that list is a
> > > > > matter of an internal implementation detail.
> > > > >
> > > > > > > I suppose that for consistency and correctness we should perform a
> > > > > > > drain prior to each read from /proc/*/pagemap.  Presumably this would
> > > > > > > be far too expensive.
> > > > > > >
> > > > > > > Is there any other way?  One such might be to make the MLOCK_ONFAULT
> > > > > > > pages bypass the lru_add_pvecs?
> > > > > > >
> > > > > > 
> > > > > > I would rather prefer to have something similar to
> > > > > > /proc/sys/vm/stat_refresh which drains the pagevecs.
> > > > > 
> > > > > No, please don't. Pagevecs draining is by far not the only batching
> > > > > scheme we use and an interface like this would promise users to
> > > > > effectivelly force flushing all of them.
> > > > > 
> > > > > Can we simply update the test to be more tolerant to imprecisions
> > > > > instead?
> > > > > 
> > > > 
> > > > I don't think, thouhg, that this particular test case can be entirely
> > > > reduced as "counter imprecison".
> > > > 
> > > > The reason I think this is a different beast, is that having the page
> > > > being flagged as PG_unevictable is expected part of the aftermath of
> > > > a mlock* call. This selftest is, IMO, correctly verifying that fact,
> > > > as it checks the functionality correctness.
> > > > 
> > > > The problem boils down to the fact that the page would immediately
> > > > be flagged as PG_unevictable after the mlock (under MCL_FUTURE|MCL_ONFAULT
> > > > semantics) call, and the test was expecting it, and commit 9c4e6b1a7027f
> > > > changed that by "delaying" that flag setting.
> > > 
> > > As I've tried to explain in other email in this email thread. The test
> > > was exploiting a certain user visible side effect. The unevictable flag
> > > or the placement on the unevictable LRU list is are not really needed
> > > for the user contract correctness. That means that the test is not
> > > really correct. Working around that by trying to enforce kernel to
> > > comply with the test expectations is just plain wrong at least for two
> > > reasons 1) you cannot expect or event do not want userspace to do the
> > > same because the behavior might change in the future 2) the test is not
> > > really testing for correctness in the first place.
> > >
> > 
> > Sorry, Michal, it seems we keep going back and forth (I just replied to
> > your comment on the other thread)
> > 
> > The selftest also checks the kernel visible effect, via
> > /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.
> 
> I really fail to see your point. Even if you are right that the self
> test is somehow evaluating the kernel implementation which I am not sure
> is the scope of the selft thest but anyway. The mere fact that the
> kernel test fails on a perfectly valid change should just suggest that
> the test is leading to false positives and therefore should be fixed.
> Your proposed fix is simply suboptimal because it relies on yet another
> side effect which might change anytime in the future and still lead to a
> correctly behaving kernel. See my point?
>

OK, I concede your point on the bogusness of checking the page flags in
this particular test and expect certain valuse there, given that no other 
selftest seems to be doing that level of inner kenrel detail scrutiny.

I'll repost this fix suggestion getting rif of those related
checkpoints.

Cheers!
-- Rafael
Michal Hocko March 23, 2020, 3:51 p.m. UTC | #19
On Mon 23-03-20 11:41:59, Rafael Aquini wrote:
> On Mon, Mar 23, 2020 at 04:12:56PM +0100, Michal Hocko wrote:
> > On Mon 23-03-20 11:02:59, Rafael Aquini wrote:
[...]
> > > The selftest also checks the kernel visible effect, via
> > > /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.
> > 
> > I really fail to see your point. Even if you are right that the self
> > test is somehow evaluating the kernel implementation which I am not sure
> > is the scope of the selft thest but anyway. The mere fact that the
> > kernel test fails on a perfectly valid change should just suggest that
> > the test is leading to false positives and therefore should be fixed.
> > Your proposed fix is simply suboptimal because it relies on yet another
> > side effect which might change anytime in the future and still lead to a
> > correctly behaving kernel. See my point?
> >
> 
> OK, I concede your point on the bogusness of checking the page flags in
> this particular test and expect certain valuse there, given that no other 
> selftest seems to be doing that level of inner kenrel detail scrutiny.
> 
> I'll repost this fix suggestion getting rif of those related
> checkpoints.

Here is what I have after I had to context switch to something else
before finishing it. Feel free to reuse if you feel like. It is likely
to not even compile.

diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c
index 637b6d0ac0d0..9fb1638a4d33 100644
--- a/tools/testing/selftests/vm/mlock2-tests.c
+++ b/tools/testing/selftests/vm/mlock2-tests.c
@@ -67,59 +67,6 @@ static int get_vm_area(unsigned long addr, struct vm_boundaries *area)
 	return ret;
 }
 
-static uint64_t get_pageflags(unsigned long addr)
-{
-	FILE *file;
-	uint64_t pfn;
-	unsigned long offset;
-
-	file = fopen("/proc/self/pagemap", "r");
-	if (!file) {
-		perror("fopen pagemap");
-		_exit(1);
-	}
-
-	offset = addr / getpagesize() * sizeof(pfn);
-
-	if (fseek(file, offset, SEEK_SET)) {
-		perror("fseek pagemap");
-		_exit(1);
-	}
-
-	if (fread(&pfn, sizeof(pfn), 1, file) != 1) {
-		perror("fread pagemap");
-		_exit(1);
-	}
-
-	fclose(file);
-	return pfn;
-}
-
-static uint64_t get_kpageflags(unsigned long pfn)
-{
-	uint64_t flags;
-	FILE *file;
-
-	file = fopen("/proc/kpageflags", "r");
-	if (!file) {
-		perror("fopen kpageflags");
-		_exit(1);
-	}
-
-	if (fseek(file, pfn * sizeof(flags), SEEK_SET)) {
-		perror("fseek kpageflags");
-		_exit(1);
-	}
-
-	if (fread(&flags, sizeof(flags), 1, file) != 1) {
-		perror("fread kpageflags");
-		_exit(1);
-	}
-
-	fclose(file);
-	return flags;
-}
-
 #define VMFLAGS "VmFlags:"
 
 static bool is_vmflag_set(unsigned long addr, const char *vmflag)
@@ -159,19 +106,13 @@ static bool is_vmflag_set(unsigned long addr, const char *vmflag)
 #define RSS  "Rss:"
 #define LOCKED "lo"
 
-static bool is_vma_lock_on_fault(unsigned long addr)
+static unsigned long get_value_for_name(unsigned long addr, const char *name)
 {
-	bool ret = false;
-	bool locked;
-	FILE *smaps = NULL;
-	unsigned long vma_size, vma_rss;
 	char *line = NULL;
-	char *value;
 	size_t size = 0;
-
-	locked = is_vmflag_set(addr, LOCKED);
-	if (!locked)
-		goto out;
+	char *value_ptr;
+	FILE *smaps = NULL;
+	unsigned long value = -1UL;
 
 	smaps = seek_to_smaps_entry(addr);
 	if (!smaps) {
@@ -179,44 +120,43 @@ static bool is_vma_lock_on_fault(unsigned long addr)
 		goto out;
 	}
 
-	while (getline(&line, &size, smaps) > 0) {
-		if (!strstr(line, SIZE)) {
+	while (getline(&line, &size, file) > 0) {
+		if (!strstr(line, name)) {
 			free(line);
 			line = NULL;
 			size = 0;
 			continue;
 		}
 
-		value = line + strlen(SIZE);
-		if (sscanf(value, "%lu kB", &vma_size) < 1) {
+		value_ptr = line + strlen(name);
+		if (sscanf(value_ptr, "%lu kB", &value) < 1) {
 			printf("Unable to parse smaps entry for Size\n");
 			goto out;
 		}
 		break;
 	}
 
-	while (getline(&line, &size, smaps) > 0) {
-		if (!strstr(line, RSS)) {
-			free(line);
-			line = NULL;
-			size = 0;
-			continue;
-		}
-
-		value = line + strlen(RSS);
-		if (sscanf(value, "%lu kB", &vma_rss) < 1) {
-			printf("Unable to parse smaps entry for Rss\n");
-			goto out;
-		}
-		break;
-	}
-
-	ret = locked && (vma_rss < vma_size);
 out:
-	free(line);
 	if (smaps)
 		fclose(smaps);
-	return ret;
+	free(line);
+	return value;
+}
+
+static bool is_vma_lock_on_fault(unsigned long addr)
+{
+	bool locked;
+	unsigned long vma_size, vma_rss;
+
+	locked = is_vmflag_set(addr, LOCKED);
+	if (!locked)
+		return false
+
+	vma_size = get_value_for_name(addr, SIZE);
+	vma_rss = get_value_for_name(addr, RSS);
+
+	/* only one page is faulted in */
+	return (vma_rss < vma_size);
 }
 
 #define PRESENT_BIT     0x8000000000000000ULL
@@ -225,40 +165,18 @@ static bool is_vma_lock_on_fault(unsigned long addr)
 
 static int lock_check(char *map)
 {
-	unsigned long page_size = getpagesize();
-	uint64_t page1_flags, page2_flags;
-
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-
-	/* Both pages should be present */
-	if (((page1_flags & PRESENT_BIT) == 0) ||
-	    ((page2_flags & PRESENT_BIT) == 0)) {
-		printf("Failed to make both pages present\n");
-		return 1;
-	}
-
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
-
-	/* Both pages should be unevictable */
-	if (((page1_flags & UNEVICTABLE_BIT) == 0) ||
-	    ((page2_flags & UNEVICTABLE_BIT) == 0)) {
-		printf("Failed to make both pages unevictable\n");
-		return 1;
-	}
+	bool locked;
+	FILE *smaps = NULL;
+	unsigned long vma_size, vma_rss;
 
-	if (!is_vmflag_set((unsigned long)map, LOCKED)) {
-		printf("VMA flag %s is missing on page 1\n", LOCKED);
-		return 1;
-	}
+	locked = is_vmflag_set(addr, LOCKED);
+	if (!locked)
+		return false;
 
-	if (!is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
-		printf("VMA flag %s is missing on page 2\n", LOCKED);
-		return 1;
-	}
+	vma_size = get_value_for_name(SIZE, smaps);
+	vma_rss = get_value_for_name(RSS, smaps);
 
-	return 0;
+	return (vma_rss == vma_size);
 }
 
 static int unlock_lock_check(char *map)
@@ -266,16 +184,6 @@ static int unlock_lock_check(char *map)
 	unsigned long page_size = getpagesize();
 	uint64_t page1_flags, page2_flags;
 
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
-
-	if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) {
-		printf("A page is still marked unevictable after unlock\n");
-		return 1;
-	}
-
 	if (is_vmflag_set((unsigned long)map, LOCKED)) {
 		printf("VMA flag %s is present on page 1 after unlock\n", LOCKED);
 		return 1;
@@ -333,36 +241,7 @@ static int onfault_check(char *map)
 	unsigned long page_size = getpagesize();
 	uint64_t page1_flags, page2_flags;
 
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-
-	/* Neither page should be present */
-	if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) {
-		printf("Pages were made present by MLOCK_ONFAULT\n");
-		return 1;
-	}
-
 	*map = 'a';
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-
-	/* Only page 1 should be present */
-	if ((page1_flags & PRESENT_BIT) == 0) {
-		printf("Page 1 is not present after fault\n");
-		return 1;
-	} else if (page2_flags & PRESENT_BIT) {
-		printf("Page 2 was made present\n");
-		return 1;
-	}
-
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-
-	/* Page 1 should be unevictable */
-	if ((page1_flags & UNEVICTABLE_BIT) == 0) {
-		printf("Failed to make faulted page unevictable\n");
-		return 1;
-	}
-
 	if (!is_vma_lock_on_fault((unsigned long)map)) {
 		printf("VMA is not marked for lock on fault\n");
 		return 1;
@@ -381,14 +260,6 @@ static int unlock_onfault_check(char *map)
 	unsigned long page_size = getpagesize();
 	uint64_t page1_flags;
 
-	page1_flags = get_pageflags((unsigned long)map);
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-
-	if (page1_flags & UNEVICTABLE_BIT) {
-		printf("Page 1 is still marked unevictable after unlock\n");
-		return 1;
-	}
-
 	if (is_vma_lock_on_fault((unsigned long)map) ||
 	    is_vma_lock_on_fault((unsigned long)map + page_size)) {
 		printf("VMA is still lock on fault after unlock\n");
@@ -465,17 +336,6 @@ static int test_lock_onfault_of_present()
 		goto unmap;
 	}
 
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
-
-	/* Page 1 should be unevictable */
-	if ((page1_flags & UNEVICTABLE_BIT) == 0) {
-		printf("Failed to make present page unevictable\n");
-		goto unmap;
-	}
-
 	if (!is_vma_lock_on_fault((unsigned long)map) ||
 	    !is_vma_lock_on_fault((unsigned long)map + page_size)) {
 		printf("VMA with present pages is not marked lock on fault\n");
Rafael Aquini March 23, 2020, 3:54 p.m. UTC | #20
On Mon, Mar 23, 2020 at 04:51:11PM +0100, Michal Hocko wrote:
> On Mon 23-03-20 11:41:59, Rafael Aquini wrote:
> > On Mon, Mar 23, 2020 at 04:12:56PM +0100, Michal Hocko wrote:
> > > On Mon 23-03-20 11:02:59, Rafael Aquini wrote:
> [...]
> > > > The selftest also checks the kernel visible effect, via
> > > > /proc/kpageflags, and that's where it fails after 9c4e6b1a7027f.
> > > 
> > > I really fail to see your point. Even if you are right that the self
> > > test is somehow evaluating the kernel implementation which I am not sure
> > > is the scope of the selft thest but anyway. The mere fact that the
> > > kernel test fails on a perfectly valid change should just suggest that
> > > the test is leading to false positives and therefore should be fixed.
> > > Your proposed fix is simply suboptimal because it relies on yet another
> > > side effect which might change anytime in the future and still lead to a
> > > correctly behaving kernel. See my point?
> > >
> > 
> > OK, I concede your point on the bogusness of checking the page flags in
> > this particular test and expect certain valuse there, given that no other 
> > selftest seems to be doing that level of inner kenrel detail scrutiny.
> > 
> > I'll repost this fix suggestion getting rif of those related
> > checkpoints.
> 
> Here is what I have after I had to context switch to something else
> before finishing it. Feel free to reuse if you feel like. It is likely
> to not even compile.
>

I'm OK with it, if you want to go ahead and do the kill.

Thanks 
-- Rafael
Michal Hocko March 24, 2020, 3:42 p.m. UTC | #21
[Cc Eric - the email thread starts http://lkml.kernel.org/r/20200322013525.1095493-1-aquini@redhat.com]

On Mon 23-03-20 11:54:49, Rafael Aquini wrote:
[...]
> I'm OK with it, if you want to go ahead and do the kill.

See the patch below

From 07c08f387d036c70239d4060ffd30534cf77a0a5 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 23 Mar 2020 17:33:46 +0100
Subject: [PATCH] selftests: vm: drop dependencies on page flags from mlock2
 tests

It was noticed that mlock2 tests are failing after 9c4e6b1a7027f
("mm, mlock, vmscan: no more skipping pagevecs") because the patch has
changed the timing on when the page is added to the unevictable LRU list
and thus gains the unevictable page flag.

The test was just too dependent on the implementation details which were
true at the time when it was introduced. Page flags and the timing when
they are set is something no userspace should ever depend on. The test
should be testing only for the user observable contract of the tested
syscalls. Those are defined pretty well for the mlock and there are
other means for testing them. In fact this is already done and testing
for page flags can be safely dropped to achieve the aimed purpose.
Present bits can be checked by /proc/<pid>/smaps RSS field and the
locking state by VmFlags although I would argue that Locked: field would
be more appropriate.

Drop all the page flag machinery and considerably simplify the test.
This should be more robust for future kernel changes while checking the
promised contract is still valid.

Reported-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 tools/testing/selftests/vm/mlock2-tests.c | 233 ++++------------------
 1 file changed, 37 insertions(+), 196 deletions(-)

diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c
index 637b6d0ac0d0..11b2301f3aa3 100644
--- a/tools/testing/selftests/vm/mlock2-tests.c
+++ b/tools/testing/selftests/vm/mlock2-tests.c
@@ -67,59 +67,6 @@ static int get_vm_area(unsigned long addr, struct vm_boundaries *area)
 	return ret;
 }
 
-static uint64_t get_pageflags(unsigned long addr)
-{
-	FILE *file;
-	uint64_t pfn;
-	unsigned long offset;
-
-	file = fopen("/proc/self/pagemap", "r");
-	if (!file) {
-		perror("fopen pagemap");
-		_exit(1);
-	}
-
-	offset = addr / getpagesize() * sizeof(pfn);
-
-	if (fseek(file, offset, SEEK_SET)) {
-		perror("fseek pagemap");
-		_exit(1);
-	}
-
-	if (fread(&pfn, sizeof(pfn), 1, file) != 1) {
-		perror("fread pagemap");
-		_exit(1);
-	}
-
-	fclose(file);
-	return pfn;
-}
-
-static uint64_t get_kpageflags(unsigned long pfn)
-{
-	uint64_t flags;
-	FILE *file;
-
-	file = fopen("/proc/kpageflags", "r");
-	if (!file) {
-		perror("fopen kpageflags");
-		_exit(1);
-	}
-
-	if (fseek(file, pfn * sizeof(flags), SEEK_SET)) {
-		perror("fseek kpageflags");
-		_exit(1);
-	}
-
-	if (fread(&flags, sizeof(flags), 1, file) != 1) {
-		perror("fread kpageflags");
-		_exit(1);
-	}
-
-	fclose(file);
-	return flags;
-}
-
 #define VMFLAGS "VmFlags:"
 
 static bool is_vmflag_set(unsigned long addr, const char *vmflag)
@@ -159,19 +106,13 @@ static bool is_vmflag_set(unsigned long addr, const char *vmflag)
 #define RSS  "Rss:"
 #define LOCKED "lo"
 
-static bool is_vma_lock_on_fault(unsigned long addr)
+static unsigned long get_value_for_name(unsigned long addr, const char *name)
 {
-	bool ret = false;
-	bool locked;
-	FILE *smaps = NULL;
-	unsigned long vma_size, vma_rss;
 	char *line = NULL;
-	char *value;
 	size_t size = 0;
-
-	locked = is_vmflag_set(addr, LOCKED);
-	if (!locked)
-		goto out;
+	char *value_ptr;
+	FILE *smaps = NULL;
+	unsigned long value = -1UL;
 
 	smaps = seek_to_smaps_entry(addr);
 	if (!smaps) {
@@ -180,112 +121,70 @@ static bool is_vma_lock_on_fault(unsigned long addr)
 	}
 
 	while (getline(&line, &size, smaps) > 0) {
-		if (!strstr(line, SIZE)) {
+		if (!strstr(line, name)) {
 			free(line);
 			line = NULL;
 			size = 0;
 			continue;
 		}
 
-		value = line + strlen(SIZE);
-		if (sscanf(value, "%lu kB", &vma_size) < 1) {
+		value_ptr = line + strlen(name);
+		if (sscanf(value_ptr, "%lu kB", &value) < 1) {
 			printf("Unable to parse smaps entry for Size\n");
 			goto out;
 		}
 		break;
 	}
 
-	while (getline(&line, &size, smaps) > 0) {
-		if (!strstr(line, RSS)) {
-			free(line);
-			line = NULL;
-			size = 0;
-			continue;
-		}
-
-		value = line + strlen(RSS);
-		if (sscanf(value, "%lu kB", &vma_rss) < 1) {
-			printf("Unable to parse smaps entry for Rss\n");
-			goto out;
-		}
-		break;
-	}
-
-	ret = locked && (vma_rss < vma_size);
 out:
-	free(line);
 	if (smaps)
 		fclose(smaps);
-	return ret;
+	free(line);
+	return value;
 }
 
-#define PRESENT_BIT     0x8000000000000000ULL
-#define PFN_MASK        0x007FFFFFFFFFFFFFULL
-#define UNEVICTABLE_BIT (1UL << 18)
-
-static int lock_check(char *map)
+static bool is_vma_lock_on_fault(unsigned long addr)
 {
-	unsigned long page_size = getpagesize();
-	uint64_t page1_flags, page2_flags;
+	bool locked;
+	unsigned long vma_size, vma_rss;
 
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
+	locked = is_vmflag_set(addr, LOCKED);
+	if (!locked)
+		return false;
 
-	/* Both pages should be present */
-	if (((page1_flags & PRESENT_BIT) == 0) ||
-	    ((page2_flags & PRESENT_BIT) == 0)) {
-		printf("Failed to make both pages present\n");
-		return 1;
-	}
+	vma_size = get_value_for_name(addr, SIZE);
+	vma_rss = get_value_for_name(addr, RSS);
 
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
+	/* only one page is faulted in */
+	return (vma_rss < vma_size);
+}
 
-	/* Both pages should be unevictable */
-	if (((page1_flags & UNEVICTABLE_BIT) == 0) ||
-	    ((page2_flags & UNEVICTABLE_BIT) == 0)) {
-		printf("Failed to make both pages unevictable\n");
-		return 1;
-	}
+#define PRESENT_BIT     0x8000000000000000ULL
+#define PFN_MASK        0x007FFFFFFFFFFFFFULL
+#define UNEVICTABLE_BIT (1UL << 18)
 
-	if (!is_vmflag_set((unsigned long)map, LOCKED)) {
-		printf("VMA flag %s is missing on page 1\n", LOCKED);
-		return 1;
-	}
+static int lock_check(unsigned long addr)
+{
+	bool locked;
+	unsigned long vma_size, vma_rss;
 
-	if (!is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
-		printf("VMA flag %s is missing on page 2\n", LOCKED);
-		return 1;
-	}
+	locked = is_vmflag_set(addr, LOCKED);
+	if (!locked)
+		return false;
 
-	return 0;
+	vma_size = get_value_for_name(addr, SIZE);
+	vma_rss = get_value_for_name(addr, RSS);
+
+	return (vma_rss == vma_size);
 }
 
 static int unlock_lock_check(char *map)
 {
-	unsigned long page_size = getpagesize();
-	uint64_t page1_flags, page2_flags;
-
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
-
-	if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) {
-		printf("A page is still marked unevictable after unlock\n");
-		return 1;
-	}
-
 	if (is_vmflag_set((unsigned long)map, LOCKED)) {
 		printf("VMA flag %s is present on page 1 after unlock\n", LOCKED);
 		return 1;
 	}
 
-	if (is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
-		printf("VMA flag %s is present on page 2 after unlock\n", LOCKED);
-		return 1;
-	}
-
 	return 0;
 }
 
@@ -311,7 +210,7 @@ static int test_mlock_lock()
 		goto unmap;
 	}
 
-	if (lock_check(map))
+	if (!lock_check((unsigned long)map))
 		goto unmap;
 
 	/* Now unlock and recheck attributes */
@@ -330,64 +229,18 @@ static int test_mlock_lock()
 
 static int onfault_check(char *map)
 {
-	unsigned long page_size = getpagesize();
-	uint64_t page1_flags, page2_flags;
-
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-
-	/* Neither page should be present */
-	if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) {
-		printf("Pages were made present by MLOCK_ONFAULT\n");
-		return 1;
-	}
-
 	*map = 'a';
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-
-	/* Only page 1 should be present */
-	if ((page1_flags & PRESENT_BIT) == 0) {
-		printf("Page 1 is not present after fault\n");
-		return 1;
-	} else if (page2_flags & PRESENT_BIT) {
-		printf("Page 2 was made present\n");
-		return 1;
-	}
-
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-
-	/* Page 1 should be unevictable */
-	if ((page1_flags & UNEVICTABLE_BIT) == 0) {
-		printf("Failed to make faulted page unevictable\n");
-		return 1;
-	}
-
 	if (!is_vma_lock_on_fault((unsigned long)map)) {
 		printf("VMA is not marked for lock on fault\n");
 		return 1;
 	}
 
-	if (!is_vma_lock_on_fault((unsigned long)map + page_size)) {
-		printf("VMA is not marked for lock on fault\n");
-		return 1;
-	}
-
 	return 0;
 }
 
 static int unlock_onfault_check(char *map)
 {
 	unsigned long page_size = getpagesize();
-	uint64_t page1_flags;
-
-	page1_flags = get_pageflags((unsigned long)map);
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-
-	if (page1_flags & UNEVICTABLE_BIT) {
-		printf("Page 1 is still marked unevictable after unlock\n");
-		return 1;
-	}
 
 	if (is_vma_lock_on_fault((unsigned long)map) ||
 	    is_vma_lock_on_fault((unsigned long)map + page_size)) {
@@ -445,7 +298,6 @@ static int test_lock_onfault_of_present()
 	char *map;
 	int ret = 1;
 	unsigned long page_size = getpagesize();
-	uint64_t page1_flags, page2_flags;
 
 	map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
 		   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
@@ -465,17 +317,6 @@ static int test_lock_onfault_of_present()
 		goto unmap;
 	}
 
-	page1_flags = get_pageflags((unsigned long)map);
-	page2_flags = get_pageflags((unsigned long)map + page_size);
-	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
-	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
-
-	/* Page 1 should be unevictable */
-	if ((page1_flags & UNEVICTABLE_BIT) == 0) {
-		printf("Failed to make present page unevictable\n");
-		goto unmap;
-	}
-
 	if (!is_vma_lock_on_fault((unsigned long)map) ||
 	    !is_vma_lock_on_fault((unsigned long)map + page_size)) {
 		printf("VMA with present pages is not marked lock on fault\n");
@@ -507,7 +348,7 @@ static int test_munlockall()
 		goto out;
 	}
 
-	if (lock_check(map))
+	if (!lock_check((unsigned long)map))
 		goto unmap;
 
 	if (munlockall()) {
@@ -549,7 +390,7 @@ static int test_munlockall()
 		goto out;
 	}
 
-	if (lock_check(map))
+	if (!lock_check((unsigned long)map))
 		goto unmap;
 
 	if (munlockall()) {
Rafael Aquini March 24, 2020, 3:49 p.m. UTC | #22
On Tue, Mar 24, 2020 at 04:42:18PM +0100, Michal Hocko wrote:
> [Cc Eric - the email thread starts http://lkml.kernel.org/r/20200322013525.1095493-1-aquini@redhat.com]
> 
> On Mon 23-03-20 11:54:49, Rafael Aquini wrote:
> [...]
> > I'm OK with it, if you want to go ahead and do the kill.
> 
> See the patch below
> 
> From 07c08f387d036c70239d4060ffd30534cf77a0a5 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 23 Mar 2020 17:33:46 +0100
> Subject: [PATCH] selftests: vm: drop dependencies on page flags from mlock2
>  tests
> 
> It was noticed that mlock2 tests are failing after 9c4e6b1a7027f
> ("mm, mlock, vmscan: no more skipping pagevecs") because the patch has
> changed the timing on when the page is added to the unevictable LRU list
> and thus gains the unevictable page flag.
> 
> The test was just too dependent on the implementation details which were
> true at the time when it was introduced. Page flags and the timing when
> they are set is something no userspace should ever depend on. The test
> should be testing only for the user observable contract of the tested
> syscalls. Those are defined pretty well for the mlock and there are
> other means for testing them. In fact this is already done and testing
> for page flags can be safely dropped to achieve the aimed purpose.
> Present bits can be checked by /proc/<pid>/smaps RSS field and the
> locking state by VmFlags although I would argue that Locked: field would
> be more appropriate.
> 
> Drop all the page flag machinery and considerably simplify the test.
> This should be more robust for future kernel changes while checking the
> promised contract is still valid.
> 
> Reported-by: Rafael Aquini <aquini@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  tools/testing/selftests/vm/mlock2-tests.c | 233 ++++------------------
>  1 file changed, 37 insertions(+), 196 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c
> index 637b6d0ac0d0..11b2301f3aa3 100644
> --- a/tools/testing/selftests/vm/mlock2-tests.c
> +++ b/tools/testing/selftests/vm/mlock2-tests.c
> @@ -67,59 +67,6 @@ static int get_vm_area(unsigned long addr, struct vm_boundaries *area)
>  	return ret;
>  }
>  
> -static uint64_t get_pageflags(unsigned long addr)
> -{
> -	FILE *file;
> -	uint64_t pfn;
> -	unsigned long offset;
> -
> -	file = fopen("/proc/self/pagemap", "r");
> -	if (!file) {
> -		perror("fopen pagemap");
> -		_exit(1);
> -	}
> -
> -	offset = addr / getpagesize() * sizeof(pfn);
> -
> -	if (fseek(file, offset, SEEK_SET)) {
> -		perror("fseek pagemap");
> -		_exit(1);
> -	}
> -
> -	if (fread(&pfn, sizeof(pfn), 1, file) != 1) {
> -		perror("fread pagemap");
> -		_exit(1);
> -	}
> -
> -	fclose(file);
> -	return pfn;
> -}
> -
> -static uint64_t get_kpageflags(unsigned long pfn)
> -{
> -	uint64_t flags;
> -	FILE *file;
> -
> -	file = fopen("/proc/kpageflags", "r");
> -	if (!file) {
> -		perror("fopen kpageflags");
> -		_exit(1);
> -	}
> -
> -	if (fseek(file, pfn * sizeof(flags), SEEK_SET)) {
> -		perror("fseek kpageflags");
> -		_exit(1);
> -	}
> -
> -	if (fread(&flags, sizeof(flags), 1, file) != 1) {
> -		perror("fread kpageflags");
> -		_exit(1);
> -	}
> -
> -	fclose(file);
> -	return flags;
> -}
> -
>  #define VMFLAGS "VmFlags:"
>  
>  static bool is_vmflag_set(unsigned long addr, const char *vmflag)
> @@ -159,19 +106,13 @@ static bool is_vmflag_set(unsigned long addr, const char *vmflag)
>  #define RSS  "Rss:"
>  #define LOCKED "lo"
>  
> -static bool is_vma_lock_on_fault(unsigned long addr)
> +static unsigned long get_value_for_name(unsigned long addr, const char *name)
>  {
> -	bool ret = false;
> -	bool locked;
> -	FILE *smaps = NULL;
> -	unsigned long vma_size, vma_rss;
>  	char *line = NULL;
> -	char *value;
>  	size_t size = 0;
> -
> -	locked = is_vmflag_set(addr, LOCKED);
> -	if (!locked)
> -		goto out;
> +	char *value_ptr;
> +	FILE *smaps = NULL;
> +	unsigned long value = -1UL;
>  
>  	smaps = seek_to_smaps_entry(addr);
>  	if (!smaps) {
> @@ -180,112 +121,70 @@ static bool is_vma_lock_on_fault(unsigned long addr)
>  	}
>  
>  	while (getline(&line, &size, smaps) > 0) {
> -		if (!strstr(line, SIZE)) {
> +		if (!strstr(line, name)) {
>  			free(line);
>  			line = NULL;
>  			size = 0;
>  			continue;
>  		}
>  
> -		value = line + strlen(SIZE);
> -		if (sscanf(value, "%lu kB", &vma_size) < 1) {
> +		value_ptr = line + strlen(name);
> +		if (sscanf(value_ptr, "%lu kB", &value) < 1) {
>  			printf("Unable to parse smaps entry for Size\n");
>  			goto out;
>  		}
>  		break;
>  	}
>  
> -	while (getline(&line, &size, smaps) > 0) {
> -		if (!strstr(line, RSS)) {
> -			free(line);
> -			line = NULL;
> -			size = 0;
> -			continue;
> -		}
> -
> -		value = line + strlen(RSS);
> -		if (sscanf(value, "%lu kB", &vma_rss) < 1) {
> -			printf("Unable to parse smaps entry for Rss\n");
> -			goto out;
> -		}
> -		break;
> -	}
> -
> -	ret = locked && (vma_rss < vma_size);
>  out:
> -	free(line);
>  	if (smaps)
>  		fclose(smaps);
> -	return ret;
> +	free(line);
> +	return value;
>  }
>  
> -#define PRESENT_BIT     0x8000000000000000ULL
> -#define PFN_MASK        0x007FFFFFFFFFFFFFULL
> -#define UNEVICTABLE_BIT (1UL << 18)
> -
> -static int lock_check(char *map)
> +static bool is_vma_lock_on_fault(unsigned long addr)
>  {
> -	unsigned long page_size = getpagesize();
> -	uint64_t page1_flags, page2_flags;
> +	bool locked;
> +	unsigned long vma_size, vma_rss;
>  
> -	page1_flags = get_pageflags((unsigned long)map);
> -	page2_flags = get_pageflags((unsigned long)map + page_size);
> +	locked = is_vmflag_set(addr, LOCKED);
> +	if (!locked)
> +		return false;
>  
> -	/* Both pages should be present */
> -	if (((page1_flags & PRESENT_BIT) == 0) ||
> -	    ((page2_flags & PRESENT_BIT) == 0)) {
> -		printf("Failed to make both pages present\n");
> -		return 1;
> -	}
> +	vma_size = get_value_for_name(addr, SIZE);
> +	vma_rss = get_value_for_name(addr, RSS);
>  
> -	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
> -	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
> +	/* only one page is faulted in */
> +	return (vma_rss < vma_size);
> +}
>  
> -	/* Both pages should be unevictable */
> -	if (((page1_flags & UNEVICTABLE_BIT) == 0) ||
> -	    ((page2_flags & UNEVICTABLE_BIT) == 0)) {
> -		printf("Failed to make both pages unevictable\n");
> -		return 1;
> -	}
> +#define PRESENT_BIT     0x8000000000000000ULL
> +#define PFN_MASK        0x007FFFFFFFFFFFFFULL
> +#define UNEVICTABLE_BIT (1UL << 18)
>  
> -	if (!is_vmflag_set((unsigned long)map, LOCKED)) {
> -		printf("VMA flag %s is missing on page 1\n", LOCKED);
> -		return 1;
> -	}
> +static int lock_check(unsigned long addr)
> +{
> +	bool locked;
> +	unsigned long vma_size, vma_rss;
>  
> -	if (!is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
> -		printf("VMA flag %s is missing on page 2\n", LOCKED);
> -		return 1;
> -	}
> +	locked = is_vmflag_set(addr, LOCKED);
> +	if (!locked)
> +		return false;
>  
> -	return 0;
> +	vma_size = get_value_for_name(addr, SIZE);
> +	vma_rss = get_value_for_name(addr, RSS);
> +
> +	return (vma_rss == vma_size);
>  }
>  
>  static int unlock_lock_check(char *map)
>  {
> -	unsigned long page_size = getpagesize();
> -	uint64_t page1_flags, page2_flags;
> -
> -	page1_flags = get_pageflags((unsigned long)map);
> -	page2_flags = get_pageflags((unsigned long)map + page_size);
> -	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
> -	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
> -
> -	if ((page1_flags & UNEVICTABLE_BIT) || (page2_flags & UNEVICTABLE_BIT)) {
> -		printf("A page is still marked unevictable after unlock\n");
> -		return 1;
> -	}
> -
>  	if (is_vmflag_set((unsigned long)map, LOCKED)) {
>  		printf("VMA flag %s is present on page 1 after unlock\n", LOCKED);
>  		return 1;
>  	}
>  
> -	if (is_vmflag_set((unsigned long)map + page_size, LOCKED)) {
> -		printf("VMA flag %s is present on page 2 after unlock\n", LOCKED);
> -		return 1;
> -	}
> -
>  	return 0;
>  }
>  
> @@ -311,7 +210,7 @@ static int test_mlock_lock()
>  		goto unmap;
>  	}
>  
> -	if (lock_check(map))
> +	if (!lock_check((unsigned long)map))
>  		goto unmap;
>  
>  	/* Now unlock and recheck attributes */
> @@ -330,64 +229,18 @@ static int test_mlock_lock()
>  
>  static int onfault_check(char *map)
>  {
> -	unsigned long page_size = getpagesize();
> -	uint64_t page1_flags, page2_flags;
> -
> -	page1_flags = get_pageflags((unsigned long)map);
> -	page2_flags = get_pageflags((unsigned long)map + page_size);
> -
> -	/* Neither page should be present */
> -	if ((page1_flags & PRESENT_BIT) || (page2_flags & PRESENT_BIT)) {
> -		printf("Pages were made present by MLOCK_ONFAULT\n");
> -		return 1;
> -	}
> -
>  	*map = 'a';
> -	page1_flags = get_pageflags((unsigned long)map);
> -	page2_flags = get_pageflags((unsigned long)map + page_size);
> -
> -	/* Only page 1 should be present */
> -	if ((page1_flags & PRESENT_BIT) == 0) {
> -		printf("Page 1 is not present after fault\n");
> -		return 1;
> -	} else if (page2_flags & PRESENT_BIT) {
> -		printf("Page 2 was made present\n");
> -		return 1;
> -	}
> -
> -	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
> -
> -	/* Page 1 should be unevictable */
> -	if ((page1_flags & UNEVICTABLE_BIT) == 0) {
> -		printf("Failed to make faulted page unevictable\n");
> -		return 1;
> -	}
> -
>  	if (!is_vma_lock_on_fault((unsigned long)map)) {
>  		printf("VMA is not marked for lock on fault\n");
>  		return 1;
>  	}
>  
> -	if (!is_vma_lock_on_fault((unsigned long)map + page_size)) {
> -		printf("VMA is not marked for lock on fault\n");
> -		return 1;
> -	}
> -
>  	return 0;
>  }
>  
>  static int unlock_onfault_check(char *map)
>  {
>  	unsigned long page_size = getpagesize();
> -	uint64_t page1_flags;
> -
> -	page1_flags = get_pageflags((unsigned long)map);
> -	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
> -
> -	if (page1_flags & UNEVICTABLE_BIT) {
> -		printf("Page 1 is still marked unevictable after unlock\n");
> -		return 1;
> -	}
>  
>  	if (is_vma_lock_on_fault((unsigned long)map) ||
>  	    is_vma_lock_on_fault((unsigned long)map + page_size)) {
> @@ -445,7 +298,6 @@ static int test_lock_onfault_of_present()
>  	char *map;
>  	int ret = 1;
>  	unsigned long page_size = getpagesize();
> -	uint64_t page1_flags, page2_flags;
>  
>  	map = mmap(NULL, 2 * page_size, PROT_READ | PROT_WRITE,
>  		   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> @@ -465,17 +317,6 @@ static int test_lock_onfault_of_present()
>  		goto unmap;
>  	}
>  
> -	page1_flags = get_pageflags((unsigned long)map);
> -	page2_flags = get_pageflags((unsigned long)map + page_size);
> -	page1_flags = get_kpageflags(page1_flags & PFN_MASK);
> -	page2_flags = get_kpageflags(page2_flags & PFN_MASK);
> -
> -	/* Page 1 should be unevictable */
> -	if ((page1_flags & UNEVICTABLE_BIT) == 0) {
> -		printf("Failed to make present page unevictable\n");
> -		goto unmap;
> -	}
> -
>  	if (!is_vma_lock_on_fault((unsigned long)map) ||
>  	    !is_vma_lock_on_fault((unsigned long)map + page_size)) {
>  		printf("VMA with present pages is not marked lock on fault\n");
> @@ -507,7 +348,7 @@ static int test_munlockall()
>  		goto out;
>  	}
>  
> -	if (lock_check(map))
> +	if (!lock_check((unsigned long)map))
>  		goto unmap;
>  
>  	if (munlockall()) {
> @@ -549,7 +390,7 @@ static int test_munlockall()
>  		goto out;
>  	}
>  
> -	if (lock_check(map))
> +	if (!lock_check((unsigned long)map))
>  		goto unmap;
>  
>  	if (munlockall()) {
> -- 
> 2.25.1
> 
> -- 
> Michal Hocko
> SUSE Labs
>

Thanks Michal!

 
Acked-by: Rafael Aquini <aquini@redhat.com>
Andrew Morton March 26, 2020, 12:49 a.m. UTC | #23
On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini <aquini@redhat.com> wrote:

> Thanks Michal!
> 
>  
> Acked-by: Rafael Aquini <aquini@redhat.com>

I'll add
Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
and cc:stable to this.  Any kernel which has 9c4e6b1a7027f will benefit
from this change.

We're getting quite late in the cycle now so I think I'll hold off
merging this up until post-5.7-rc1.  It will still get into the stable
trees, but a bit later.
Rafael Aquini March 26, 2020, 2:22 a.m. UTC | #24
On Wed, Mar 25, 2020 at 05:49:49PM -0700, Andrew Morton wrote:
> On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> 
> > Thanks Michal!
> > 
> >  
> > Acked-by: Rafael Aquini <aquini@redhat.com>
> 
> I'll add
> Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> and cc:stable to this.  Any kernel which has 9c4e6b1a7027f will benefit
> from this change.
> 
> We're getting quite late in the cycle now so I think I'll hold off
> merging this up until post-5.7-rc1.  It will still get into the stable
> trees, but a bit later.
>

Thank you, Andrew

-- Rafael
Michal Hocko March 26, 2020, 6:49 a.m. UTC | #25
On Wed 25-03-20 17:49:49, Andrew Morton wrote:
> On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> 
> > Thanks Michal!
> > 
> >  
> > Acked-by: Rafael Aquini <aquini@redhat.com>
> 
> I'll add
> Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")

Wouldn't be this misleading? It would suggest that this commit is
somehow incorrect. I would consider b3b0d09c7a23 ("selftests: vm: add
tests for lock on fault") to be a better fit.

> and cc:stable to this.  Any kernel which has 9c4e6b1a7027f will benefit
> from this change.

OK, makes sense with
Cc: stable # with 9c4e6b1a7027f applied
 
> We're getting quite late in the cycle now so I think I'll hold off
> merging this up until post-5.7-rc1.  It will still get into the stable
> trees, but a bit later.

No problem. Nothing really urget. Coincidentaly we've just had a report
from our partner about this very specific test failure for our
enterprise kernel as well. I will just backport the patch as it seems
nobody really objects to it.

Thanks!
Andrew Morton March 26, 2020, 7:58 p.m. UTC | #26
On Thu, 26 Mar 2020 07:49:09 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 25-03-20 17:49:49, Andrew Morton wrote:
> > On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > 
> > > Thanks Michal!
> > > 
> > >  
> > > Acked-by: Rafael Aquini <aquini@redhat.com>
> > 
> > I'll add
> > Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> 
> Wouldn't be this misleading? It would suggest that this commit is
> somehow incorrect. I would consider b3b0d09c7a23 ("selftests: vm: add
> tests for lock on fault") to be a better fit.

Yes, it's a bit misleading.

Or maybe not.  b3b0d09c7a23 was merged in 2015 and worked OK (afaik)
until 9c4e6b1a7027f came along in 2020.  So arguably, 2020's
9c4e6b1a7027f was correct but incomplete, since it failed to repair the
impact upon the test code.

I don't think that kernels prior to 2020's 9c4e6b1a7027f require this
change(?), so we only need to backport this into 5.6.x, which is what
my proposed Fixes: and cc:stable expresses.

> > and cc:stable to this.  Any kernel which has 9c4e6b1a7027f will benefit
> > from this change.
> 
> OK, makes sense with
> Cc: stable # with 9c4e6b1a7027f applied
>  
> > We're getting quite late in the cycle now so I think I'll hold off
> > merging this up until post-5.7-rc1.  It will still get into the stable
> > trees, but a bit later.
> 
> No problem. Nothing really urget. Coincidentaly we've just had a report
> from our partner about this very specific test failure for our
> enterprise kernel as well.

I assume that your kernel has 2020's 9c4e6b1a7027f?

> I will just backport the patch as it seems
> nobody really objects to it.
Rafael Aquini March 26, 2020, 8:16 p.m. UTC | #27
On Thu, Mar 26, 2020 at 12:58:09PM -0700, Andrew Morton wrote:
> On Thu, 26 Mar 2020 07:49:09 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Wed 25-03-20 17:49:49, Andrew Morton wrote:
> > > On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > > 
> > > > Thanks Michal!
> > > > 
> > > >  
> > > > Acked-by: Rafael Aquini <aquini@redhat.com>
> > > 
> > > I'll add
> > > Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > 
> > Wouldn't be this misleading? It would suggest that this commit is
> > somehow incorrect. I would consider b3b0d09c7a23 ("selftests: vm: add
> > tests for lock on fault") to be a better fit.
> 
> Yes, it's a bit misleading.
> 
> Or maybe not.  b3b0d09c7a23 was merged in 2015 and worked OK (afaik)
> until 9c4e6b1a7027f came along in 2020.  So arguably, 2020's
> 9c4e6b1a7027f was correct but incomplete, since it failed to repair the
> impact upon the test code.
>

That's correct, and it was my reading, as well.
 
> I don't think that kernels prior to 2020's 9c4e6b1a7027f require this
> change(?), so we only need to backport this into 5.6.x, which is what
> my proposed Fixes: and cc:stable expresses.
> 

Agreed

-- Rafael
Rafael Aquini March 26, 2020, 8:19 p.m. UTC | #28
On Thu, Mar 26, 2020 at 12:58:09PM -0700, Andrew Morton wrote:
> On Thu, 26 Mar 2020 07:49:09 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Wed 25-03-20 17:49:49, Andrew Morton wrote:
> > > On Tue, 24 Mar 2020 11:49:10 -0400 Rafael Aquini <aquini@redhat.com> wrote:
> > > 
> > > > Thanks Michal!
> > > > 
> > > >  
> > > > Acked-by: Rafael Aquini <aquini@redhat.com>
> > > 
> > > I'll add
> > > Fixes: 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
> > 
> > Wouldn't be this misleading? It would suggest that this commit is
> > somehow incorrect. I would consider b3b0d09c7a23 ("selftests: vm: add
> > tests for lock on fault") to be a better fit.
> 
> Yes, it's a bit misleading.
> 
> Or maybe not.  b3b0d09c7a23 was merged in 2015 and worked OK (afaik)
> until 9c4e6b1a7027f came along in 2020.  So arguably, 2020's
> 9c4e6b1a7027f was correct but incomplete, since it failed to repair the
> impact upon the test code.
> 
> I don't think that kernels prior to 2020's 9c4e6b1a7027f require this
> change(?), so we only need to backport this into 5.6.x, which is what

Only one note here: 9c4e6b1a7027f is from 2018, and went merged on 4.16,
AFAICS.

-- Rafael

Patch
diff mbox series

diff --git a/tools/testing/selftests/vm/mlock2-tests.c b/tools/testing/selftests/vm/mlock2-tests.c
index 637b6d0ac0d0..26dc320ca3c9 100644
--- a/tools/testing/selftests/vm/mlock2-tests.c
+++ b/tools/testing/selftests/vm/mlock2-tests.c
@@ -7,6 +7,7 @@ 
 #include <sys/time.h>
 #include <sys/resource.h>
 #include <stdbool.h>
+#include <sched.h>
 #include "mlock2.h"
 
 #include "../kselftest.h"
@@ -328,6 +329,22 @@  static int test_mlock_lock()
 	return ret;
 }
 
+/*
+ * After commit 9c4e6b1a7027f ("mm, mlock, vmscan: no more skipping pagevecs")
+ * changes made by calls to mlock* family might not be immediately reflected
+ * on the LRUs, thus checking the PFN flags might race against pagevec drain.
+ *
+ * In order to sort out that race, and get the after fault checks consistent,
+ * the "quick and dirty" trick below is required in order to force a call to
+ * lru_add_drain_all() to get the recently MLOCK_ONFAULT pages moved to
+ * the unevictable LRU, as expected by the checks in this selftest.
+ */
+static void force_lru_add_drain_all(void)
+{
+	sched_yield();
+	system("echo 1 > /proc/sys/vm/compact_memory");
+}
+
 static int onfault_check(char *map)
 {
 	unsigned long page_size = getpagesize();
@@ -343,6 +360,9 @@  static int onfault_check(char *map)
 	}
 
 	*map = 'a';
+
+	force_lru_add_drain_all();
+
 	page1_flags = get_pageflags((unsigned long)map);
 	page2_flags = get_pageflags((unsigned long)map + page_size);
 
@@ -465,6 +485,8 @@  static int test_lock_onfault_of_present()
 		goto unmap;
 	}
 
+	force_lru_add_drain_all();
+
 	page1_flags = get_pageflags((unsigned long)map);
 	page2_flags = get_pageflags((unsigned long)map + page_size);
 	page1_flags = get_kpageflags(page1_flags & PFN_MASK);