diff mbox series

[v3] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

Message ID 20220131230255.789059-1-mfo@canonical.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm: fix race between MADV_FREE reclaim and blkdev direct IO read | expand

Commit Message

Mauricio Faria de Oliveira Jan. 31, 2022, 11:02 p.m. UTC
Problem:
=======

Userspace might read the zero-page instead of actual data from a
direct IO read on a block device if the buffers have been called
madvise(MADV_FREE) on earlier (this is discussed below) due to a
race between page reclaim on MADV_FREE and blkdev direct IO read.

- Race condition:
  ==============

During page reclaim, the MADV_FREE page check in try_to_unmap_one()
checks if the page is not dirty, then discards its rmap PTE(s) (vs.
remap back if the page is dirty).

However, after try_to_unmap_one() returns to shrink_page_list(), it
might keep the page _anyway_ if page_ref_freeze() fails (it expects
exactly _one_ page reference, from the isolation for page reclaim).

Well, blkdev_direct_IO() gets references for all pages, and on READ
operations it only sets them dirty _later_.

So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for
direct IO read from block devices, and page reclaim happens during
__blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
returns, but BEFORE the pages are set dirty, the situation happens.

The direct IO read eventually completes. Now, when userspace reads
the buffers, the PTE is no longer there and the page fault handler
do_anonymous_page() services that with the zero-page, NOT the data!

A synthetic reproducer is provided.

- Page faults:
  ===========

If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue
doesn't happen, because that faults-in all pages as writeable, so
do_anonymous_page() sets up a new page/rmap/PTE, and that is used
by direct IO. The userspace reads don't fault as the PTE is there
(thus zero-page is not used/setup).

But if page reclaim happens AFTER it / BEFORE setting pages dirty,
the PTE is no longer there; the subsequent page faults can't help:

The data-read from the block device probably won't generate faults
due to DMA (no MMU) but even in the case it wouldn't use DMA, that
happens on different virtual addresses (not user-mapped addresses)
because `struct bio_vec` stores `struct page` to figure addresses
out (which are different from user-mapped addresses) for the read.

Thus userspace reads (to user-mapped addresses) still fault, then
do_anonymous_page() gets another `struct page` that would address/
map to other memory than the `struct page` used by `struct bio_vec`
for the read.  (The original `struct page` is not available, since
it wasn't freed, as page_ref_freeze() failed due to more page refs.
And even if it were available, its data cannot be trusted anymore.)

Solution:
========

One solution is to check for the expected page reference count
in try_to_unmap_one().

There should be one reference from the isolation (that is also
checked in shrink_page_list() with page_ref_freeze()) plus one
or more references from page mapping(s) (put in discard: label).
Further references mean that rmap/PTE cannot be unmapped/nuked.

(Note: there might be more than one reference from mapping due
to fork()/clone() without CLONE_VM, which use the same `struct
page` for references, until the copy-on-write page gets copied.)

So, additional page references (e.g., from direct IO read) now
prevent the rmap/PTE from being unmapped/dropped; similarly to
the page is not freed per shrink_page_list()/page_ref_freeze()).

- Races and Barriers:
  ==================

The new check in try_to_unmap_one() should be safe in races with
bio_iov_iter_get_pages() in get_user_pages() fast and slow paths,
as it's done under the PTE lock.

The fast path doesn't take the lock, but it checks if the PTE has
changed and if so, it drops the reference and leaves the page for
the slow path (which does take that lock).

The fast path requires synchronization w/ full memory barrier: it
writes the page reference count first then it reads the PTE later,
while try_to_unmap() writes PTE first then it reads page refcount.

And a second barrier is needed, as the page dirty flag should not
be read before the page reference count (as in __remove_mapping()).
(This can be a load memory barrier only; no writes are involved.)

Call stack/comments:

- try_to_unmap_one()
  - page_vma_mapped_walk()
    - map_pte()			# see pte_offset_map_lock():
        pte_offset_map()
        spin_lock()

  - ptep_get_and_clear()	# write PTE
  - smp_mb()			# (new barrier) GUP fast path
  - page_ref_count()		# (new check) read refcount

  - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
      pte_unmap()
      spin_unlock()

- bio_iov_iter_get_pages()
  - __bio_iov_iter_get_pages()
    - iov_iter_get_pages()
      - get_user_pages_fast()
        - internal_get_user_pages_fast()

          # fast path
          - lockless_pages_from_mm()
            - gup_{pgd,p4d,pud,pmd,pte}_range()
                ptep = pte_offset_map()		# not _lock()
                pte = ptep_get_lockless(ptep)

                page = pte_page(pte)
                try_grab_compound_head(page)	# inc refcount
                                            	# (RMW/barrier
                                             	#  on success)

                if (pte_val(pte) != pte_val(*ptep)) # read PTE
                        put_compound_head(page) # dec refcount
                        			# go slow path

          # slow path
          - __gup_longterm_unlocked()
            - get_user_pages_unlocked()
              - __get_user_pages_locked()
                - __get_user_pages()
                  - follow_{page,p4d,pud,pmd}_mask()
                    - follow_page_pte()
                        ptep = pte_offset_map_lock()
                        pte = *ptep
                        page = vm_normal_page(pte)
                        try_grab_page(page)	# inc refcount
                        pte_unmap_unlock()

- Huge Pages:
  ==========

Regarding transparent hugepages, that logic shouldn't change, as
MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
(madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
thus should reach shrink_page_list() -> split_huge_page_to_list()
before try_to_unmap[_one](), so it deals with normal pages only.

(And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address()
happens, which should not or be rare, the page refcount should be
greater than mapcount: the head page is referenced by tail pages.
That also prevents checking the head `page` then incorrectly call
page_remove_rmap(subpage) for a tail page, that isn't even in the
shrink_page_list()'s page_list (an effect of split huge pmd/pmvw),
as it might happen today in this unlikely scenario.)

MADV_FREE'd buffers:
===================

So, back to the "if MADV_FREE pages are used as buffers" note.
The case is arguable, and subject to multiple interpretations.

The madvise(2) manual page on the MADV_FREE advice value says:

1) 'After a successful MADV_FREE ... data will be lost when
   the kernel frees the pages.'
2) 'the free operation will be canceled if the caller writes
   into the page' / 'subsequent writes ... will succeed and
   then [the] kernel cannot free those dirtied pages'
3) 'If there is no subsequent write, the kernel can free the
   pages at any time.'

Thoughts, questions, considerations... respectively:

1) Since the kernel didn't actually free the page (page_ref_freeze()
   failed), should the data not have been lost? (on userspace read.)
2) Should writes performed by the direct IO read be able to cancel
   the free operation?
   - Should the direct IO read be considered as 'the caller' too,
     as it's been requested by 'the caller'?
   - Should the bio technique to dirty pages on return to userspace
     (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
     be considered in another/special way here?
3) Should an upcoming write from a previously requested direct IO
   read be considered as a subsequent write, so the kernel should
   not free the pages? (as it's known at the time of page reclaim.)

At last:

Technically, the last point would seem a reasonable consideration
and balance, as the madvise(2) manual page apparently (and fairly)
seem to assume that 'writes' are memory access from the userspace
process (not explicitly considering writes from the kernel or its
corner cases; again, fairly).. plus the kernel fix implementation
for the corner case of the largely 'non-atomic write' encompassed
by a direct IO read operation, is relatively simple; and it helps.

Reproducer:
==========

@ test.c (simplified, but works)

	#define _GNU_SOURCE
	#include <fcntl.h>
	#include <stdio.h>
	#include <unistd.h>
	#include <sys/mman.h>

	int main() {
		int fd, i;
		char *buf;

		fd = open(DEV, O_RDONLY | O_DIRECT);

		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
			buf[i] = 1; // init to non-zero

		madvise(buf, BUF_SIZE, MADV_FREE);

		read(fd, buf, BUF_SIZE);

		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
			printf("%p: 0x%x\n", &buf[i], buf[i]);

		return 0;
	}

@ block/fops.c (formerly fs/block_dev.c)

	+#include <linux/swap.h>
	...
	... __blkdev_direct_IO[_simple](...)
	{
	...
	+	if (!strcmp(current->comm, "good"))
	+		shrink_all_memory(ULONG_MAX);
	+
         	ret = bio_iov_iter_get_pages(...);
	+
	+	if (!strcmp(current->comm, "bad"))
	+		shrink_all_memory(ULONG_MAX);
	...
	}

@ shell

        # NUM_PAGES=4
        # PAGE_SIZE=$(getconf PAGE_SIZE)

        # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
        # DEV=$(losetup -f --show test.img)

        # gcc -DDEV=\"$DEV\" \
              -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
              -DPAGE_SIZE=${PAGE_SIZE} \
               test.c -o test

        # od -tx1 $DEV
        0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
        *
        0040000

        # mv test good
        # ./good
        0x7f7c10418000: 0x79
        0x7f7c10419000: 0x79
        0x7f7c1041a000: 0x79
        0x7f7c1041b000: 0x79

        # mv good bad
        # ./bad
        0x7fa1b8050000: 0x0
        0x7fa1b8051000: 0x0
        0x7fa1b8052000: 0x0
        0x7fa1b8053000: 0x0

Ceph/TCMalloc:
=============

For documentation purposes, the use case driving the analysis/fix
is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses
MADV_FREE to release unused memory to the system from the mmap'ed
page heap (might be committed back/used again; it's not munmap'ed.)
- PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise()
- PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing.

Note: TCMalloc switched back to MADV_DONTNEED a few commits after
the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so
the issue just 'disappeared' on Ceph on later Ubuntu releases but
is still present in the kernel, and can be hit by other use cases.

The observed issue seems to be the old Ceph bug #22464 [1], where
checksum mismatches are observed (and instrumentation with buffer
dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges).

The issue in Ceph was reasonably deemed a kernel bug (comment #50)
and mostly worked around with a retry mechanism, but other parts
of Ceph could still hit that (rocksdb). Anyway, it's less likely
to be hit again as TCMalloc switched out of MADV_FREE by default.

(Some kernel versions/reports from the Ceph bug, and relation with
the MADV_FREE introduction/changes; TCMalloc versions not checked.)
- 4.4 good
- 4.5 (madv_free: introduction)
- 4.9 bad
- 4.10 good? maybe a swapless system
- 4.12 (madv_free: no longer free instantly on swapless systems)
- 4.13 bad

[1] https://tracker.ceph.com/issues/22464

Thanks:
======

Several people contributed to analysis/discussions/tests/reproducers
in the first stages when drilling down on ceph/tcmalloc/linux kernel:

- Dan Hill <daniel.hill@canonical.com>
- Dan Streetman <dan.streetman@canonical.com>
- Dongdong Tao <dongdong.tao@canonical.com>
- Gavin Guo <gavin.guo@canonical.com>
- Gerald Yang <gerald.yang@canonical.com>
- Heitor Alves de Siqueira <halves@canonical.com>
- Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
- Jay Vosburgh <jay.vosburgh@canonical.com>
- Matthew Ruffell <matthew.ruffell@canonical.com>
- Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>

Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
---

changelog:

v3: - add full memory barrier to sync against GUP fast path;
      update comments. (Thanks: Yu Zhao <yuzhao@google.com>)
    - add Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
      (no significant changes from v2).
    - add Fixes: tag.
    - minor changes/corrections to the commit message.
    - tested on v5.17-rc2.

v2: - check refcount against mapcount rather than a static 2.
      (Thanks: Minchan Kim <minchan@kernel.org>)

 mm/rmap.c   | 25 ++++++++++++++++++++++++-
 mm/vmscan.c |  2 +-
 2 files changed, 25 insertions(+), 2 deletions(-)

Comments

Andrew Morton Jan. 31, 2022, 11:43 p.m. UTC | #1
On Mon, 31 Jan 2022 20:02:55 -0300 Mauricio Faria de Oliveira <mfo@canonical.com> wrote:

> Problem:
> =======
> 
> Userspace might read the zero-page instead of actual data from a
> direct IO read on a block device if the buffers have been called
> madvise(MADV_FREE) on earlier (this is discussed below) due to a
> race between page reclaim on MADV_FREE and blkdev direct IO read.
> 
> ...
>
> Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")

Five years ago.  As it doesn't seem urgent I targeted 5.18-rc1 for
this, and added a cc:stable so it will trickle back into earlier trees.

How does this plan sound?
Mauricio Faria de Oliveira Feb. 1, 2022, 2:23 a.m. UTC | #2
On Mon, Jan 31, 2022 at 8:43 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 31 Jan 2022 20:02:55 -0300 Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
>
> > Problem:
> > =======
> >
> > Userspace might read the zero-page instead of actual data from a
> > direct IO read on a block device if the buffers have been called
> > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> > race between page reclaim on MADV_FREE and blkdev direct IO read.
> >
> > ...
> >
> > Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
>
> Five years ago.  As it doesn't seem urgent I targeted 5.18-rc1 for
> this, and added a cc:stable so it will trickle back into earlier trees.
>
> How does this plan sound?

That sounds good; it's not urgent, indeed. Thanks!
Christoph Hellwig Feb. 2, 2022, 2:03 p.m. UTC | #3
On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> Well, blkdev_direct_IO() gets references for all pages, and on READ
> operations it only sets them dirty _later_.
> 
> So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for
> direct IO read from block devices, and page reclaim happens during
> __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
> returns, but BEFORE the pages are set dirty, the situation happens.
> 
> The direct IO read eventually completes. Now, when userspace reads
> the buffers, the PTE is no longer there and the page fault handler
> do_anonymous_page() services that with the zero-page, NOT the data!

So why not just set the pages dirty early like the other direct I/O
implementations?  Or if this is fine with the patch should we remove
the early dirtying elsewhere?

> Reproducer:
> ==========
> 
> @ test.c (simplified, but works)

Can you add this to blktests or some other regularly run regression
test suite?

> +				smp_rmb();
> +
> +				/*
> +				 * The only page refs must be from the isolation
> +				 * plus one or more rmap's (dropped by discard:).

Overly long line.

> +				 */
> +				if ((ref_count == 1 + map_count) &&

No need for the inner braces.
Mauricio Faria de Oliveira Feb. 2, 2022, 4:29 p.m. UTC | #4
On Wed, Feb 2, 2022 at 11:03 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> > Well, blkdev_direct_IO() gets references for all pages, and on READ
> > operations it only sets them dirty _later_.
> >
> > So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for
> > direct IO read from block devices, and page reclaim happens during
> > __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
> > returns, but BEFORE the pages are set dirty, the situation happens.
> >
> > The direct IO read eventually completes. Now, when userspace reads
> > the buffers, the PTE is no longer there and the page fault handler
> > do_anonymous_page() services that with the zero-page, NOT the data!
>
> So why not just set the pages dirty early like the other direct I/O
> implementations?  Or if this is fine with the patch should we remove
> the early dirtying elsewhere?

In general, since this particular problem is specific to MADV_FREE,
it seemed about right to go for a more contained/particular solution
(than changes with broader impact/risk to things that aren't broken).

This isn't to say either approach shouldn't be pursued, but just that
the larger changes aren't strictly needed to actually fix _this_ issue
(and might complicate landing the fix into the stable/distro kernels.)

Now, specifically on the 2 suggestions you mentioned, I'm not very
familiar with other implementations, thus I can't speak to that, sorry.

However, on the 1st suggestion (set pages dirty early), John noted
[1] there might be issues with that and advised not going there.

>
> > Reproducer:
> > ==========
> >
> > @ test.c (simplified, but works)
>
> Can you add this to blktests or some other regularly run regression
> test suite?

Sure.

The test also needs the kernel-side change (to trigger memory reclaim),
which can probably be wired for blktests with a fault-injection capability.

Does that sound good? Maybe there's a better way to do it.

>
> > +                             smp_rmb();
> > +
> > +                             /*
> > +                              * The only page refs must be from the isolation
> > +                              * plus one or more rmap's (dropped by discard:).
>
> Overly long line.

Hmm, checkpatch.pl didn't complain about it. Ah, it checks for 100 chars.
Ok; v4.

>
> > +                              */
> > +                             if ((ref_count == 1 + map_count) &&
>
> No need for the inner braces.
>

Ok; v4.

I'll wait a bit in case more changes are needed, and send v4 w/ the above.

Thanks!

[1] https://lore.kernel.org/linux-mm/7094dbd6-de0c-9909-e657-e358e14dc6c3@nvidia.com/
Yu Zhao Feb. 2, 2022, 7:56 p.m. UTC | #5
On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> Problem:
> =======

Thanks for the update. A couple of quick questions:

> Userspace might read the zero-page instead of actual data from a
> direct IO read on a block device if the buffers have been called
> madvise(MADV_FREE) on earlier (this is discussed below) due to a
> race between page reclaim on MADV_FREE and blkdev direct IO read.

1) would page migration be affected as well?

> @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  
>  			/* MADV_FREE page check */
>  			if (!PageSwapBacked(page)) {
> -				if (!PageDirty(page)) {
> +				int ref_count, map_count;
> +
> +				/*
> +				 * Synchronize with gup_pte_range():
> +				 * - clear PTE; barrier; read refcount
> +				 * - inc refcount; barrier; read PTE
> +				 */
> +				smp_mb();
> +
> +				ref_count = page_count(page);
> +				map_count = page_mapcount(page);
> +
> +				/*
> +				 * Order reads for page refcount and dirty flag;
> +				 * see __remove_mapping().
> +				 */
> +				smp_rmb();

2) why does it need to order against __remove_mapping()? It seems to
   me that here (called from the reclaim path) it can't race with
   __remove_mapping() because both lock the page.

> +				/*
> +				 * The only page refs must be from the isolation
> +				 * plus one or more rmap's (dropped by discard:).
> +				 */
> +				if ((ref_count == 1 + map_count) &&
> +				    !PageDirty(page)) {
>  					/* Invalidate as we cleared the pte */
>  					mmu_notifier_invalidate_range(mm,
>  						address, address + PAGE_SIZE);
Mauricio Faria de Oliveira Feb. 2, 2022, 9:27 p.m. UTC | #6
On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> > Problem:
> > =======
>
> Thanks for the update. A couple of quick questions:
>
> > Userspace might read the zero-page instead of actual data from a
> > direct IO read on a block device if the buffers have been called
> > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> > race between page reclaim on MADV_FREE and blkdev direct IO read.
>
> 1) would page migration be affected as well?

Could you please elaborate on the potential problem you considered?

I checked migrate_pages() -> try_to_migrate() holds the page lock,
thus shouldn't race with shrink_page_list() -> with try_to_unmap()
(where the issue with MADV_FREE is), but maybe I didn't get you
correctly.

>
> > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >
> >                       /* MADV_FREE page check */
> >                       if (!PageSwapBacked(page)) {
> > -                             if (!PageDirty(page)) {
> > +                             int ref_count, map_count;
> > +
> > +                             /*
> > +                              * Synchronize with gup_pte_range():
> > +                              * - clear PTE; barrier; read refcount
> > +                              * - inc refcount; barrier; read PTE
> > +                              */
> > +                             smp_mb();
> > +
> > +                             ref_count = page_count(page);
> > +                             map_count = page_mapcount(page);
> > +
> > +                             /*
> > +                              * Order reads for page refcount and dirty flag;
> > +                              * see __remove_mapping().
> > +                              */
> > +                             smp_rmb();
>
> 2) why does it need to order against __remove_mapping()? It seems to
>    me that here (called from the reclaim path) it can't race with
>    __remove_mapping() because both lock the page.

I'll improve that comment in v4.  The ordering isn't against __remove_mapping(),
but actually because of an issue described in __remove_mapping()'s comments
(something else that doesn't hold the page lock, just has a page reference, that
may clear the page dirty flag then drop the reference; thus check ref,
then dirty).

Hope this clarifies the question.

Thanks!

>
> > +                             /*
> > +                              * The only page refs must be from the isolation
> > +                              * plus one or more rmap's (dropped by discard:).
> > +                              */
> > +                             if ((ref_count == 1 + map_count) &&
> > +                                 !PageDirty(page)) {
> >                                       /* Invalidate as we cleared the pte */
> >                                       mmu_notifier_invalidate_range(mm,
> >                                               address, address + PAGE_SIZE);
Yu Zhao Feb. 2, 2022, 9:53 p.m. UTC | #7
On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote:
> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> > > Problem:
> > > =======
> >
> > Thanks for the update. A couple of quick questions:
> >
> > > Userspace might read the zero-page instead of actual data from a
> > > direct IO read on a block device if the buffers have been called
> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> > > race between page reclaim on MADV_FREE and blkdev direct IO read.
> >
> > 1) would page migration be affected as well?
> 
> Could you please elaborate on the potential problem you considered?
> 
> I checked migrate_pages() -> try_to_migrate() holds the page lock,
> thus shouldn't race with shrink_page_list() -> with try_to_unmap()
> (where the issue with MADV_FREE is), but maybe I didn't get you
> correctly.

Could the race exist between DIO and migration? While DIO is writing
to a page, could migration unmap it and copy the data from this page
to a new page?

> > > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >
> > >                       /* MADV_FREE page check */
> > >                       if (!PageSwapBacked(page)) {
> > > -                             if (!PageDirty(page)) {
> > > +                             int ref_count, map_count;
> > > +
> > > +                             /*
> > > +                              * Synchronize with gup_pte_range():
> > > +                              * - clear PTE; barrier; read refcount
> > > +                              * - inc refcount; barrier; read PTE
> > > +                              */
> > > +                             smp_mb();
> > > +
> > > +                             ref_count = page_count(page);
> > > +                             map_count = page_mapcount(page);
> > > +
> > > +                             /*
> > > +                              * Order reads for page refcount and dirty flag;
> > > +                              * see __remove_mapping().
> > > +                              */
> > > +                             smp_rmb();
> >
> > 2) why does it need to order against __remove_mapping()? It seems to
> >    me that here (called from the reclaim path) it can't race with
> >    __remove_mapping() because both lock the page.
> 
> I'll improve that comment in v4.  The ordering isn't against __remove_mapping(),
> but actually because of an issue described in __remove_mapping()'s comments
> (something else that doesn't hold the page lock, just has a page reference, that
> may clear the page dirty flag then drop the reference; thus check ref,
> then dirty).

Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so,
there should be a smp_wmb() on the other side:

	 * get_user_pages(&page);

	smp_wmb()

	 * SetPageDirty(page);
	 * put_page(page);

(__remove_mapping() doesn't need smp_[rw]mb() on either side because
it relies on page refcnt freeze and retesting.)

Thanks.
Mauricio Faria de Oliveira Feb. 3, 2022, 10:17 p.m. UTC | #8
On Wed, Feb 2, 2022 at 6:53 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote:
> > On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> > > > Problem:
> > > > =======
> > >
> > > Thanks for the update. A couple of quick questions:
> > >
> > > > Userspace might read the zero-page instead of actual data from a
> > > > direct IO read on a block device if the buffers have been called
> > > > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> > > > race between page reclaim on MADV_FREE and blkdev direct IO read.
> > >
> > > 1) would page migration be affected as well?
> >
> > Could you please elaborate on the potential problem you considered?
> >
> > I checked migrate_pages() -> try_to_migrate() holds the page lock,
> > thus shouldn't race with shrink_page_list() -> with try_to_unmap()
> > (where the issue with MADV_FREE is), but maybe I didn't get you
> > correctly.
>
> Could the race exist between DIO and migration? While DIO is writing
> to a page, could migration unmap it and copy the data from this page
> to a new page?
>

Thanks for clarifying. I started looking into this, but since it's unrelated
to MADV_FREE (which doesn't apply to page migration), I guess this
shouldn't block this patch, if at all possible.  Is that OK with you?


> > > > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > >
> > > >                       /* MADV_FREE page check */
> > > >                       if (!PageSwapBacked(page)) {
> > > > -                             if (!PageDirty(page)) {
> > > > +                             int ref_count, map_count;
> > > > +
> > > > +                             /*
> > > > +                              * Synchronize with gup_pte_range():
> > > > +                              * - clear PTE; barrier; read refcount
> > > > +                              * - inc refcount; barrier; read PTE
> > > > +                              */
> > > > +                             smp_mb();
> > > > +
> > > > +                             ref_count = page_count(page);
> > > > +                             map_count = page_mapcount(page);
> > > > +
> > > > +                             /*
> > > > +                              * Order reads for page refcount and dirty flag;
> > > > +                              * see __remove_mapping().
> > > > +                              */
> > > > +                             smp_rmb();
> > >
> > > 2) why does it need to order against __remove_mapping()? It seems to
> > >    me that here (called from the reclaim path) it can't race with
> > >    __remove_mapping() because both lock the page.
> >
> > I'll improve that comment in v4.  The ordering isn't against __remove_mapping(),
> > but actually because of an issue described in __remove_mapping()'s comments
> > (something else that doesn't hold the page lock, just has a page reference, that
> > may clear the page dirty flag then drop the reference; thus check ref,
> > then dirty).
>
> Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so,
> there should be a smp_wmb() on the other side:

If I understand it correctly, it actually implies a full memory
barrier, doesn't it?

Because... gup_pte_range() (fast path) calls try_grab_compound_head(),
which eventually calls* atomic_add_unless(), an atomic conditional RMW
operation with return value, thus fully ordered on success (atomic_t.rst);
(on failure gup_pte_range() falls back to the slow path, below.)

And follow_page_pte() (slow path) calls try_grab_page(), which also calls
into try_grab_compound_head(), as the above.

(* on CONFIG_TINY_RCU, it calls just atomic_add(), which isn't ordered,
but that option is targeted for UP/!SMP, thus not a problem for this race.)

Looking at the implementation of arch_atomic_fetch_add_unless() on
more relaxed/weakly ordered archs (arm, powerpc, if I got that right),
there are barriers like 'smp_mb()' and 'sync' instruction if 'old != unless',
so that seems to be OK.

And the set_page_dirty() calls occur after get_user_pages() / that point.

Does that make sense?

Thanks!


>
>          * get_user_pages(&page);
>
>         smp_wmb()
>
>          * SetPageDirty(page);
>          * put_page(page);
>
> (__remove_mapping() doesn't need smp_[rw]mb() on either side because
> it relies on page refcnt freeze and retesting.)
>
> Thanks.
Yu Zhao Feb. 4, 2022, 5:56 a.m. UTC | #9
On Thu, Feb 3, 2022 at 3:17 PM Mauricio Faria de Oliveira
<mfo@canonical.com> wrote:
>
> On Wed, Feb 2, 2022 at 6:53 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote:
> > > On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> > > > > Problem:
> > > > > =======
> > > >
> > > > Thanks for the update. A couple of quick questions:
> > > >
> > > > > Userspace might read the zero-page instead of actual data from a
> > > > > direct IO read on a block device if the buffers have been called
> > > > > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> > > > > race between page reclaim on MADV_FREE and blkdev direct IO read.
> > > >
> > > > 1) would page migration be affected as well?
> > >
> > > Could you please elaborate on the potential problem you considered?
> > >
> > > I checked migrate_pages() -> try_to_migrate() holds the page lock,
> > > thus shouldn't race with shrink_page_list() -> with try_to_unmap()
> > > (where the issue with MADV_FREE is), but maybe I didn't get you
> > > correctly.
> >
> > Could the race exist between DIO and migration? While DIO is writing
> > to a page, could migration unmap it and copy the data from this page
> > to a new page?
> >
>
> Thanks for clarifying. I started looking into this, but since it's unrelated
> to MADV_FREE (which doesn't apply to page migration), I guess this
> shouldn't block this patch, if at all possible.  Is that OK with you?
>
>
> > > > > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >
> > > > >                       /* MADV_FREE page check */
> > > > >                       if (!PageSwapBacked(page)) {
> > > > > -                             if (!PageDirty(page)) {
> > > > > +                             int ref_count, map_count;
> > > > > +
> > > > > +                             /*
> > > > > +                              * Synchronize with gup_pte_range():
> > > > > +                              * - clear PTE; barrier; read refcount
> > > > > +                              * - inc refcount; barrier; read PTE
> > > > > +                              */
> > > > > +                             smp_mb();
> > > > > +
> > > > > +                             ref_count = page_count(page);
> > > > > +                             map_count = page_mapcount(page);
> > > > > +
> > > > > +                             /*
> > > > > +                              * Order reads for page refcount and dirty flag;
> > > > > +                              * see __remove_mapping().
> > > > > +                              */
> > > > > +                             smp_rmb();
> > > >
> > > > 2) why does it need to order against __remove_mapping()? It seems to
> > > >    me that here (called from the reclaim path) it can't race with
> > > >    __remove_mapping() because both lock the page.
> > >
> > > I'll improve that comment in v4.  The ordering isn't against __remove_mapping(),
> > > but actually because of an issue described in __remove_mapping()'s comments
> > > (something else that doesn't hold the page lock, just has a page reference, that
> > > may clear the page dirty flag then drop the reference; thus check ref,
> > > then dirty).
> >
> > Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so,
> > there should be a smp_wmb() on the other side:
>
> If I understand it correctly, it actually implies a full memory
> barrier, doesn't it?
>
> Because... gup_pte_range() (fast path) calls try_grab_compound_head(),
> which eventually calls* atomic_add_unless(), an atomic conditional RMW
> operation with return value, thus fully ordered on success (atomic_t.rst);
> (on failure gup_pte_range() falls back to the slow path, below.)
>
> And follow_page_pte() (slow path) calls try_grab_page(), which also calls
> into try_grab_compound_head(), as the above.
>
> (* on CONFIG_TINY_RCU, it calls just atomic_add(), which isn't ordered,
> but that option is targeted for UP/!SMP, thus not a problem for this race.)
>
> Looking at the implementation of arch_atomic_fetch_add_unless() on
> more relaxed/weakly ordered archs (arm, powerpc, if I got that right),
> there are barriers like 'smp_mb()' and 'sync' instruction if 'old != unless',
> so that seems to be OK.
>
> And the set_page_dirty() calls occur after get_user_pages() / that point.
>
> Does that make sense?

Yes, it does, thanks. I was thinking along the lines of whether there
is an actual contract. The reason get_user_pages() currently works as
a full barrier is not intentional but a side effect of this recent
cleanup patch:
commit 54d516b1d6 ("mm/gup: small refactoring: simplify try_grab_page()")
But I agree your fix works as is.
John Hubbard Feb. 4, 2022, 7:03 a.m. UTC | #10
On 2/3/22 21:56, Yu Zhao wrote:
...
>>> Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so,
>>> there should be a smp_wmb() on the other side:
>>
>> If I understand it correctly, it actually implies a full memory
>> barrier, doesn't it?
>>
>> Because... gup_pte_range() (fast path) calls try_grab_compound_head(),
>> which eventually calls* atomic_add_unless(), an atomic conditional RMW
>> operation with return value, thus fully ordered on success (atomic_t.rst);
>> (on failure gup_pte_range() falls back to the slow path, below.)
>>
>> And follow_page_pte() (slow path) calls try_grab_page(), which also calls
>> into try_grab_compound_head(), as the above.

Well, doing so was a mistake, actually. I've recently reverted it, via:
commit c36c04c2e132 ("Revert "mm/gup: small refactoring: simplify
try_grab_page()""). Details are in the commit log.

Apologies for the confusion that this may have created.

thanks,
Mauricio Faria de Oliveira Feb. 4, 2022, 6:58 p.m. UTC | #11
On Fri, Feb 4, 2022 at 2:57 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 3:17 PM Mauricio Faria de Oliveira
> <mfo@canonical.com> wrote:
> >
> > On Wed, Feb 2, 2022 at 6:53 PM Yu Zhao <yuzhao@google.com> wrote:
[...]
> > > Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so,
> > > there should be a smp_wmb() on the other side:
> >
> > If I understand it correctly, it actually implies a full memory
> > barrier, doesn't it?
> >
> > Because... gup_pte_range() (fast path) calls try_grab_compound_head(),
> > which eventually calls* atomic_add_unless(), an atomic conditional RMW
> > operation with return value, thus fully ordered on success (atomic_t.rst);
> > (on failure gup_pte_range() falls back to the slow path, below.)
> >
> > And follow_page_pte() (slow path) calls try_grab_page(), which also calls
> > into try_grab_compound_head(), as the above.
> >
> > (* on CONFIG_TINY_RCU, it calls just atomic_add(), which isn't ordered,
> > but that option is targeted for UP/!SMP, thus not a problem for this race.)
> >
> > Looking at the implementation of arch_atomic_fetch_add_unless() on
> > more relaxed/weakly ordered archs (arm, powerpc, if I got that right),
> > there are barriers like 'smp_mb()' and 'sync' instruction if 'old != unless',
> > so that seems to be OK.
> >
> > And the set_page_dirty() calls occur after get_user_pages() / that point.
> >
> > Does that make sense?
>
> Yes, it does, thanks. I was thinking along the lines of whether there
> is an actual contract. [...]

Ok, got you.

> [...] The reason get_user_pages() currently works as
> a full barrier is not intentional but a side effect of this recent
> cleanup patch:
> commit 54d516b1d6 ("mm/gup: small refactoring: simplify try_grab_page()")
> But I agree your fix works as is.

Thanks for bringing it up!

That commit and its revert [1] (that John mentioned in his reply)
change only try_grab_page() / not try_grab_compound_head(),
thus should affect only the slow path / not the fast path.

So, with either change or revert, the slow path should still be okay,
as it takes the page table lock, and try_to_unmap_one() too, thus
they shouldn't race. And the spinlock barriers get values through.

Thanks,

[1] commit c36c04c2e132 ("Revert "mm/gup: small refactoring: simplify
try_grab_page()"")
Mauricio Faria de Oliveira Feb. 4, 2022, 6:59 p.m. UTC | #12
On Fri, Feb 4, 2022 at 4:03 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 2/3/22 21:56, Yu Zhao wrote:
> ...
> >>> Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so,
> >>> there should be a smp_wmb() on the other side:
> >>
> >> If I understand it correctly, it actually implies a full memory
> >> barrier, doesn't it?
> >>
> >> Because... gup_pte_range() (fast path) calls try_grab_compound_head(),
> >> which eventually calls* atomic_add_unless(), an atomic conditional RMW
> >> operation with return value, thus fully ordered on success (atomic_t.rst);
> >> (on failure gup_pte_range() falls back to the slow path, below.)
> >>
> >> And follow_page_pte() (slow path) calls try_grab_page(), which also calls
> >> into try_grab_compound_head(), as the above.
>
> Well, doing so was a mistake, actually. I've recently reverted it, via:
> commit c36c04c2e132 ("Revert "mm/gup: small refactoring: simplify
> try_grab_page()""). Details are in the commit log.
>
> Apologies for the confusion that this may have created.

No worries; thanks for the pointers / commit log.

>
> thanks,
> --
> John Hubbard
> NVIDIA
> [...]
Huang, Ying Feb. 16, 2022, 6:48 a.m. UTC | #13
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote:
>> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote:
>> >
>> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
>> > > Problem:
>> > > =======
>> >
>> > Thanks for the update. A couple of quick questions:
>> >
>> > > Userspace might read the zero-page instead of actual data from a
>> > > direct IO read on a block device if the buffers have been called
>> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a
>> > > race between page reclaim on MADV_FREE and blkdev direct IO read.
>> >
>> > 1) would page migration be affected as well?
>> 
>> Could you please elaborate on the potential problem you considered?
>> 
>> I checked migrate_pages() -> try_to_migrate() holds the page lock,
>> thus shouldn't race with shrink_page_list() -> with try_to_unmap()
>> (where the issue with MADV_FREE is), but maybe I didn't get you
>> correctly.
>
> Could the race exist between DIO and migration? While DIO is writing
> to a page, could migration unmap it and copy the data from this page
> to a new page?

Check the migrate_pages() code,

  migrate_pages
    unmap_and_move
      __unmap_and_move
        try_to_migrate // set PTE to swap entry with PTL
        move_to_new_page
          migrate_page
            folio_migrate_mapping
              folio_ref_count(folio) != expected_count // check page ref count
            folio_migrate_copy

The page ref count is checked after unmapping and before copying.  This
is good, but it appears that we need a memory barrier between checking
page ref count and copying page.

Best Regards,
Huang, Ying
Yu Zhao Feb. 16, 2022, 9:58 p.m. UTC | #14
On Wed, Feb 16, 2022 at 02:48:19PM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote:
> >> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote:
> >> >
> >> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> >> > > Problem:
> >> > > =======
> >> >
> >> > Thanks for the update. A couple of quick questions:
> >> >
> >> > > Userspace might read the zero-page instead of actual data from a
> >> > > direct IO read on a block device if the buffers have been called
> >> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> >> > > race between page reclaim on MADV_FREE and blkdev direct IO read.
> >> >
> >> > 1) would page migration be affected as well?
> >> 
> >> Could you please elaborate on the potential problem you considered?
> >> 
> >> I checked migrate_pages() -> try_to_migrate() holds the page lock,
> >> thus shouldn't race with shrink_page_list() -> with try_to_unmap()
> >> (where the issue with MADV_FREE is), but maybe I didn't get you
> >> correctly.
> >
> > Could the race exist between DIO and migration? While DIO is writing
> > to a page, could migration unmap it and copy the data from this page
> > to a new page?
> 
> Check the migrate_pages() code,
> 
>   migrate_pages
>     unmap_and_move
>       __unmap_and_move
>         try_to_migrate // set PTE to swap entry with PTL
>         move_to_new_page
>           migrate_page
>             folio_migrate_mapping
>               folio_ref_count(folio) != expected_count // check page ref count
>             folio_migrate_copy
> 
> The page ref count is checked after unmapping and before copying.  This
> is good, but it appears that we need a memory barrier between checking
> page ref count and copying page.

I didn't look into this but, off the top of head, this should be
similar if not identical to the DIO case. Therefore, it requires two
barriers -- before and after the refcnt check (which may or may not
exist).
Yu Zhao Feb. 16, 2022, 10 p.m. UTC | #15
On Wed, Feb 16, 2022 at 02:58:36PM -0700, Yu Zhao wrote:
> On Wed, Feb 16, 2022 at 02:48:19PM +0800, Huang, Ying wrote:
> > Yu Zhao <yuzhao@google.com> writes:
> > 
> > > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote:
> > >> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote:
> > >> >
> > >> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> > >> > > Problem:
> > >> > > =======
> > >> >
> > >> > Thanks for the update. A couple of quick questions:
> > >> >
> > >> > > Userspace might read the zero-page instead of actual data from a
> > >> > > direct IO read on a block device if the buffers have been called
> > >> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> > >> > > race between page reclaim on MADV_FREE and blkdev direct IO read.
> > >> >
> > >> > 1) would page migration be affected as well?
> > >> 
> > >> Could you please elaborate on the potential problem you considered?
> > >> 
> > >> I checked migrate_pages() -> try_to_migrate() holds the page lock,
> > >> thus shouldn't race with shrink_page_list() -> with try_to_unmap()
> > >> (where the issue with MADV_FREE is), but maybe I didn't get you
> > >> correctly.
> > >
> > > Could the race exist between DIO and migration? While DIO is writing
> > > to a page, could migration unmap it and copy the data from this page
> > > to a new page?
> > 
> > Check the migrate_pages() code,
> > 
> >   migrate_pages
> >     unmap_and_move
> >       __unmap_and_move
> >         try_to_migrate // set PTE to swap entry with PTL
> >         move_to_new_page
> >           migrate_page
> >             folio_migrate_mapping
> >               folio_ref_count(folio) != expected_count // check page ref count
> >             folio_migrate_copy
> > 
> > The page ref count is checked after unmapping and before copying.  This
> > is good, but it appears that we need a memory barrier between checking
> > page ref count and copying page.
> 
> I didn't look into this but, off the top of head, this should be
> similar if not identical to the DIO case. Therefore, it requires two
                                  ^^^ reclaim

> barriers -- before and after the refcnt check (which may or may not
> exist).
Huang, Ying Feb. 17, 2022, 6:08 a.m. UTC | #16
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Feb 16, 2022 at 02:48:19PM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote:
>> >> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote:
>> >> >
>> >> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
>> >> > > Problem:
>> >> > > =======
>> >> >
>> >> > Thanks for the update. A couple of quick questions:
>> >> >
>> >> > > Userspace might read the zero-page instead of actual data from a
>> >> > > direct IO read on a block device if the buffers have been called
>> >> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a
>> >> > > race between page reclaim on MADV_FREE and blkdev direct IO read.
>> >> >
>> >> > 1) would page migration be affected as well?
>> >> 
>> >> Could you please elaborate on the potential problem you considered?
>> >> 
>> >> I checked migrate_pages() -> try_to_migrate() holds the page lock,
>> >> thus shouldn't race with shrink_page_list() -> with try_to_unmap()
>> >> (where the issue with MADV_FREE is), but maybe I didn't get you
>> >> correctly.
>> >
>> > Could the race exist between DIO and migration? While DIO is writing
>> > to a page, could migration unmap it and copy the data from this page
>> > to a new page?
>> 
>> Check the migrate_pages() code,
>> 
>>   migrate_pages
>>     unmap_and_move
>>       __unmap_and_move
>>         try_to_migrate // set PTE to swap entry with PTL
>>         move_to_new_page
>>           migrate_page
>>             folio_migrate_mapping
>>               folio_ref_count(folio) != expected_count // check page ref count
>>             folio_migrate_copy
>> 
>> The page ref count is checked after unmapping and before copying.  This
>> is good, but it appears that we need a memory barrier between checking
>> page ref count and copying page.
>
> I didn't look into this but, off the top of head, this should be
> similar if not identical to the DIO case. Therefore, it requires two
> barriers -- before and after the refcnt check (which may or may not
> exist).

Yes.  I think so too.

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..b7ae45724378 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1599,7 +1599,30 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 			/* MADV_FREE page check */
 			if (!PageSwapBacked(page)) {
-				if (!PageDirty(page)) {
+				int ref_count, map_count;
+
+				/*
+				 * Synchronize with gup_pte_range():
+				 * - clear PTE; barrier; read refcount
+				 * - inc refcount; barrier; read PTE
+				 */
+				smp_mb();
+
+				ref_count = page_count(page);
+				map_count = page_mapcount(page);
+
+				/*
+				 * Order reads for page refcount and dirty flag;
+				 * see __remove_mapping().
+				 */
+				smp_rmb();
+
+				/*
+				 * The only page refs must be from the isolation
+				 * plus one or more rmap's (dropped by discard:).
+				 */
+				if ((ref_count == 1 + map_count) &&
+				    !PageDirty(page)) {
 					/* Invalidate as we cleared the pte */
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..0dbfa3a69567 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1729,7 +1729,7 @@  static unsigned int shrink_page_list(struct list_head *page_list,
 				mapping = page_mapping(page);
 			}
 		} else if (unlikely(PageTransHuge(page))) {
-			/* Split file THP */
+			/* Split file/lazyfree THP */
 			if (split_huge_page_to_list(page, page_list))
 				goto keep_locked;
 		}