diff mbox series

[v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

Message ID 20220105233440.63361-1-mfo@canonical.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read | expand

Commit Message

Mauricio Faria de Oliveira Jan. 5, 2022, 11:34 p.m. UTC
Problem:
=======

Userspace might read the zero-page instead of actual data from a
direct IO read on a block device if the buffers have been called
madvise(MADV_FREE) on earlier (this is discussed below) due to a
race between page reclaim on MADV_FREE and blkdev direct IO read.

Race condition:
==============

During page reclaim, the MADV_FREE page check in try_to_unmap_one()
checks if the page is not dirty, then discards its PTE (vs remap it
back if the page is dirty).

However, after try_to_unmap_one() returns to shrink_page_list(), it
might keep the page _anyway_ if page_ref_freeze() fails (it expects
a single page ref from the isolation).

Well, blkdev_direct_IO() gets references for all pages, and on READ
operations it sets them dirty later.

So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more
later) for direct IO read from block devices and page reclaim runs
during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages()
but BEFORE it sets pages dirty, that situation happens.

The direct IO read eventually completes. Now, when userspace reads
the buffers, the PTE is no longer there and the page fault handler
do_anonymous_page() services that with the zero-page, NOT the data!

A synthetic reproducer is provided.

Page faults:
===========

The data read from the block device probably won't generate faults
due to DMA (no MMU) but even in the case it wouldn't use DMA, that
happens on different virtual addresses (not user-mapped addresses)
because `struct bio_vec` stores `struct page` to figure addresses
out (which are different from/unrelated to user-mapped addresses)
for the data read.

Thus userspace reads (to user-mapped addresses) still fault, then
do_anonymous_page() gets another `struct page` that would address/
map to other memory than the `struct page` used by `struct bio_vec`
for the read (which runs correctly as the page wasn't freed due to
page_ref_freeze(), and is reclaimed later) -- but even if the page
addresses matched, that handler maps the zero-page in the PTE, not
that page's memory (on read faults.)

If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue
doesn't happen, because that faults-in all pages as writeable, so
do_anonymous_page() sets up a new page/rmap/PTE, and that is used
by direct IO. The userspace reads don't fault as the PTE is there
(thus zero-page is not used.)

Solution:
========

One solution is to check for the expected page reference count in
try_to_unmap_one() too, which should be exactly two: one from the
isolation (checked by shrink_page_list()), and the other from the
rmap (dropped by the discard: label). If that doesn't match, then
remap the PTE back, just like page dirty does.

The new check in try_to_unmap_one() should be safe in races with
bio_iov_iter_get_pages() in get_user_pages() fast and slow paths,
as it's done under the PTE lock. The fast path doesn't take that
lock but it checks the PTE has changed, then drops the reference
and leaves the page for the slow path (which does take that lock).

- try_to_unmap_one()
  - page_vma_mapped_walk()
    - map_pte() # see pte_offset_map_lock():
        pte_offset_map()
        spin_lock()
  - page_ref_count() # new check
  - page_vma_mapped_walk_done() # see pte_unmap_unlock():
      pte_unmap()
      spin_unlock()

- bio_iov_iter_get_pages()
  - __bio_iov_iter_get_pages()
    - iov_iter_get_pages()
      - get_user_pages_fast()
        - internal_get_user_pages_fast()

          # fast path
          - lockless_pages_from_mm()
            - gup_{pgd,p4d,pud,pmd,pte}_range()
                ptep = pte_offset_map() # not _lock()
                pte = ptep_get_lockless(ptep)
                page = pte_page(pte)
                try_grab_compound_head(page) # get ref
                if (pte_val(pte) != pte_val(*ptep))
                        put_compound_head(page) # put ref
                        # leave page for slow path
          # slow path
          - __gup_longterm_unlocked()
            - get_user_pages_unlocked()
              - __get_user_pages_locked()
                - __get_user_pages()
                  - follow_{page,p4d,pud,pmd}_mask()
                    - follow_page_pte()
                        ptep = pte_offset_map_lock()
                        pte = *ptep
                        page = vm_normal_page(pte)
                        try_grab_page(page) # get ref
                        pte_unmap_unlock()

Regarding transparent hugepages, that number shouldn't change, as
MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
(madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
thus should reach shrink_page_list() -> split_huge_page_to_list()
before try_to_unmap[_one](), so it deals with normal pages only.

(And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address()
happens, which it should not or be rare, the page refcount is not
two, as the head page counts tail pages, and tail pages have zero.
That also prevents checking the head `page` then incorrectly call
page_remove_rmap(subpage) for a tail page, that isn't even in the
shrink_page_list()'s page_list (an effect of split huge pmd/pmvw),
as it might happen today in this unlikely scenario.)

MADV_FREE'd buffers:
===================

So, back to the "if MADV_FREE pages are used as buffers" note.
The case is arguable, and subject to multiple interpretations.

The madvise(2) manual page on the MADV_FREE advice value says:
- 'After a successful MADV_FREE ... data will be lost when
   the kernel frees the pages.'
- 'the free operation will be canceled if the caller writes
   into the page' / 'subsequent writes ... will succeed and
   then [the] kernel cannot free those dirtied pages'
- 'If there is no subsequent write, the kernel can free the
   pages at any time.'

Thoughts, questions, considerations...
- Since the kernel didn't actually free the page (page_ref_freeze()
  failed), should the data not have been lost? (on userspace read.)
- Should writes performed by the direct IO read be able to cancel
  the free operation?
  - Should the direct IO read be considered as 'the caller' too,
    as it's been requested by 'the caller'?
  - Should the bio technique to dirty pages on return to userspace
    (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
    be considered in another/special way here?
- Should an upcoming write from a previously requested direct IO
  read be considered as a subsequent write, so the kernel should
  not free the pages? (as it's known at the time of page reclaim.)

Technically, the last point would seem a reasonable consideration
and balance, as the madvise(2) manual page apparently (and fairly)
seem to assume that 'writes' are memory access from the userspace
process (not explicitly considering writes from the kernel or its
corner cases; again, fairly).. plus the kernel fix implementation
for the corner case of the largely 'non-atomic write' encompassed
by a direct IO read operation, is relatively simple; and it helps.

Reproducer:
==========

@ test.c (simplified, but works)

	#define _GNU_SOURCE
	#include <fcntl.h>
	#include <stdio.h>
	#include <unistd.h>
	#include <sys/mman.h>

	int main() {
		int fd, i;
		char *buf;

		fd = open(DEV, O_RDONLY | O_DIRECT);

		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
			buf[i] = 1; // init to non-zero

		madvise(buf, BUF_SIZE, MADV_FREE);

		read(fd, buf, BUF_SIZE);

		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
			printf("%p: 0x%x\n", &buf[i], buf[i]);

		return 0;
	}

@ block/fops.c (formerly fs/block_dev.c)

	+#include <linux/swap.h>
	...
	... __blkdev_direct_IO[_simple](...)
	{
	...
	+	if (!strcmp(current->comm, "good"))
	+		shrink_all_memory(ULONG_MAX);
	+
         	ret = bio_iov_iter_get_pages(...);
	+
	+	if (!strcmp(current->comm, "bad"))
	+		shrink_all_memory(ULONG_MAX);
	...
	}

@ shell

	# yes | dd of=test.img bs=1k count=16
	# DEV=$(losetup -f --show test.img)
	# gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test

	# od -tx1 $DEV
	0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
	*
	0040000

	# mv test good
	# ./good
	0x7f1509206000: 0x79
	0x7f1509207000: 0x79
	0x7f1509208000: 0x79
	0x7f1509209000: 0x79

	# mv good bad
	# ./bad
	0x7fd87272f000: 0x0
	0x7fd872730000: 0x0
	0x7fd872731000: 0x0
	0x7fd872732000: 0x0

Ceph/TCMalloc:
=============

For documentation purposes, the use case driving the analysis/fix
is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses
MADV_FREE to release unused memory to the system from the mmap'ed
page heap (might be committed back/used again; it's not munmap'ed.)
- PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise()
- PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing.

Note: TCMalloc switched back to MADV_DONTNEED a few commits after
the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so
the issue just 'disappeared' on Ceph on later Ubuntu releases but
is still present in the kernel, and can be hit by other use cases.

The observed issue seems to be the old Ceph bug #22464 [1], where
checksum mismatches are observed (and instrumentation with buffer
dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges).

The issue in Ceph was reasonably deemed a kernel bug (comment #50)
and mostly worked around with a retry mechanism, but other parts
of Ceph could still hit that (rocksdb). Anyway, it's less likely
to be hit again as TCMalloc switched out of MADV_FREE by default.

(Some kernel versions/reports from the Ceph bug, and relation with
the MADV_FREE introduction/changes; TCMalloc versions not checked.)
- 4.4 good
- 4.5 (madv_free: introduction)
- 4.9 bad
- 4.10 good? maybe a swapless system
- 4.12 (madv_free: no longer free instantly on swapless systems)
- 4.13 bad

[1] https://tracker.ceph.com/issues/22464

Thanks:
======

Several people contributed to analysis/discussions/tests/reproducers
in the first stages when drilling down on ceph/tcmalloc/linux kernel:

- Dan Hill <daniel.hill@canonical.com>
- Dan Streetman <dan.streetman@canonical.com>
- Dongdong Tao <dongdong.tao@canonical.com>
- Gavin Guo <gavin.guo@canonical.com>
- Gerald Yang <gerald.yang@canonical.com>
- Heitor Alves de Siqueira <halves@canonical.com>
- Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
- Jay Vosburgh <jay.vosburgh@canonical.com>
- Matthew Ruffell <matthew.ruffell@canonical.com>
- Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>

v2: check refcount against mapcount rather than a static 2.
    Thanks: Minchan Kim <minchan@kernel.org>

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
---
 mm/rmap.c   | 15 ++++++++++++++-
 mm/vmscan.c |  2 +-
 2 files changed, 15 insertions(+), 2 deletions(-)

Comments

Minchan Kim Jan. 6, 2022, 11:15 p.m. UTC | #1
On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> Problem:
> =======
> 
> Userspace might read the zero-page instead of actual data from a
> direct IO read on a block device if the buffers have been called
> madvise(MADV_FREE) on earlier (this is discussed below) due to a
> race between page reclaim on MADV_FREE and blkdev direct IO read.
> 
> Race condition:
> ==============
> 
> During page reclaim, the MADV_FREE page check in try_to_unmap_one()
> checks if the page is not dirty, then discards its PTE (vs remap it
> back if the page is dirty).
> 
> However, after try_to_unmap_one() returns to shrink_page_list(), it
> might keep the page _anyway_ if page_ref_freeze() fails (it expects
> a single page ref from the isolation).
> 
> Well, blkdev_direct_IO() gets references for all pages, and on READ
> operations it sets them dirty later.
> 
> So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more
> later) for direct IO read from block devices and page reclaim runs
> during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages()
> but BEFORE it sets pages dirty, that situation happens.
> 
> The direct IO read eventually completes. Now, when userspace reads
> the buffers, the PTE is no longer there and the page fault handler
> do_anonymous_page() services that with the zero-page, NOT the data!
> 
> A synthetic reproducer is provided.
> 
> Page faults:
> ===========
> 
> The data read from the block device probably won't generate faults
> due to DMA (no MMU) but even in the case it wouldn't use DMA, that
> happens on different virtual addresses (not user-mapped addresses)
> because `struct bio_vec` stores `struct page` to figure addresses
> out (which are different from/unrelated to user-mapped addresses)
> for the data read.
> 
> Thus userspace reads (to user-mapped addresses) still fault, then
> do_anonymous_page() gets another `struct page` that would address/
> map to other memory than the `struct page` used by `struct bio_vec`
> for the read (which runs correctly as the page wasn't freed due to
> page_ref_freeze(), and is reclaimed later) -- but even if the page
> addresses matched, that handler maps the zero-page in the PTE, not
> that page's memory (on read faults.)
> 
> If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue
> doesn't happen, because that faults-in all pages as writeable, so
> do_anonymous_page() sets up a new page/rmap/PTE, and that is used
> by direct IO. The userspace reads don't fault as the PTE is there
> (thus zero-page is not used.)
> 
> Solution:
> ========
> 
> One solution is to check for the expected page reference count in
> try_to_unmap_one() too, which should be exactly two: one from the
> isolation (checked by shrink_page_list()), and the other from the
> rmap (dropped by the discard: label). If that doesn't match, then
> remap the PTE back, just like page dirty does.
> 
> The new check in try_to_unmap_one() should be safe in races with
> bio_iov_iter_get_pages() in get_user_pages() fast and slow paths,
> as it's done under the PTE lock. The fast path doesn't take that
> lock but it checks the PTE has changed, then drops the reference
> and leaves the page for the slow path (which does take that lock).
> 
> - try_to_unmap_one()
>   - page_vma_mapped_walk()
>     - map_pte() # see pte_offset_map_lock():
>         pte_offset_map()
>         spin_lock()
>   - page_ref_count() # new check
>   - page_vma_mapped_walk_done() # see pte_unmap_unlock():
>       pte_unmap()
>       spin_unlock()
> 
> - bio_iov_iter_get_pages()
>   - __bio_iov_iter_get_pages()
>     - iov_iter_get_pages()
>       - get_user_pages_fast()
>         - internal_get_user_pages_fast()
> 
>           # fast path
>           - lockless_pages_from_mm()
>             - gup_{pgd,p4d,pud,pmd,pte}_range()
>                 ptep = pte_offset_map() # not _lock()
>                 pte = ptep_get_lockless(ptep)
>                 page = pte_page(pte)
>                 try_grab_compound_head(page) # get ref
>                 if (pte_val(pte) != pte_val(*ptep))
>                         put_compound_head(page) # put ref
>                         # leave page for slow path
>           # slow path
>           - __gup_longterm_unlocked()
>             - get_user_pages_unlocked()
>               - __get_user_pages_locked()
>                 - __get_user_pages()
>                   - follow_{page,p4d,pud,pmd}_mask()
>                     - follow_page_pte()
>                         ptep = pte_offset_map_lock()
>                         pte = *ptep
>                         page = vm_normal_page(pte)
>                         try_grab_page(page) # get ref
>                         pte_unmap_unlock()
> 
> Regarding transparent hugepages, that number shouldn't change, as
> MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
> (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
> thus should reach shrink_page_list() -> split_huge_page_to_list()
> before try_to_unmap[_one](), so it deals with normal pages only.
> 
> (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address()
> happens, which it should not or be rare, the page refcount is not
> two, as the head page counts tail pages, and tail pages have zero.
> That also prevents checking the head `page` then incorrectly call
> page_remove_rmap(subpage) for a tail page, that isn't even in the
> shrink_page_list()'s page_list (an effect of split huge pmd/pmvw),
> as it might happen today in this unlikely scenario.)
> 
> MADV_FREE'd buffers:
> ===================
> 
> So, back to the "if MADV_FREE pages are used as buffers" note.
> The case is arguable, and subject to multiple interpretations.
> 
> The madvise(2) manual page on the MADV_FREE advice value says:
> - 'After a successful MADV_FREE ... data will be lost when
>    the kernel frees the pages.'
> - 'the free operation will be canceled if the caller writes
>    into the page' / 'subsequent writes ... will succeed and
>    then [the] kernel cannot free those dirtied pages'
> - 'If there is no subsequent write, the kernel can free the
>    pages at any time.'
> 
> Thoughts, questions, considerations...
> - Since the kernel didn't actually free the page (page_ref_freeze()
>   failed), should the data not have been lost? (on userspace read.)
> - Should writes performed by the direct IO read be able to cancel
>   the free operation?
>   - Should the direct IO read be considered as 'the caller' too,
>     as it's been requested by 'the caller'?
>   - Should the bio technique to dirty pages on return to userspace
>     (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
>     be considered in another/special way here?
> - Should an upcoming write from a previously requested direct IO
>   read be considered as a subsequent write, so the kernel should
>   not free the pages? (as it's known at the time of page reclaim.)
> 
> Technically, the last point would seem a reasonable consideration
> and balance, as the madvise(2) manual page apparently (and fairly)
> seem to assume that 'writes' are memory access from the userspace
> process (not explicitly considering writes from the kernel or its
> corner cases; again, fairly).. plus the kernel fix implementation
> for the corner case of the largely 'non-atomic write' encompassed
> by a direct IO read operation, is relatively simple; and it helps.
> 
> Reproducer:
> ==========
> 
> @ test.c (simplified, but works)
> 
> 	#define _GNU_SOURCE
> 	#include <fcntl.h>
> 	#include <stdio.h>
> 	#include <unistd.h>
> 	#include <sys/mman.h>
> 
> 	int main() {
> 		int fd, i;
> 		char *buf;
> 
> 		fd = open(DEV, O_RDONLY | O_DIRECT);
> 
> 		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
>                 	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> 		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
> 			buf[i] = 1; // init to non-zero
> 
> 		madvise(buf, BUF_SIZE, MADV_FREE);
> 
> 		read(fd, buf, BUF_SIZE);
> 
> 		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
> 			printf("%p: 0x%x\n", &buf[i], buf[i]);
> 
> 		return 0;
> 	}
> 
> @ block/fops.c (formerly fs/block_dev.c)
> 
> 	+#include <linux/swap.h>
> 	...
> 	... __blkdev_direct_IO[_simple](...)
> 	{
> 	...
> 	+	if (!strcmp(current->comm, "good"))
> 	+		shrink_all_memory(ULONG_MAX);
> 	+
>          	ret = bio_iov_iter_get_pages(...);
> 	+
> 	+	if (!strcmp(current->comm, "bad"))
> 	+		shrink_all_memory(ULONG_MAX);
> 	...
> 	}
> 
> @ shell
> 
> 	# yes | dd of=test.img bs=1k count=16
> 	# DEV=$(losetup -f --show test.img)
> 	# gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test
> 
> 	# od -tx1 $DEV
> 	0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
> 	*
> 	0040000
> 
> 	# mv test good
> 	# ./good
> 	0x7f1509206000: 0x79
> 	0x7f1509207000: 0x79
> 	0x7f1509208000: 0x79
> 	0x7f1509209000: 0x79
> 
> 	# mv good bad
> 	# ./bad
> 	0x7fd87272f000: 0x0
> 	0x7fd872730000: 0x0
> 	0x7fd872731000: 0x0
> 	0x7fd872732000: 0x0
> 
> Ceph/TCMalloc:
> =============
> 
> For documentation purposes, the use case driving the analysis/fix
> is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses
> MADV_FREE to release unused memory to the system from the mmap'ed
> page heap (might be committed back/used again; it's not munmap'ed.)
> - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise()
> - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing.
> 
> Note: TCMalloc switched back to MADV_DONTNEED a few commits after
> the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so
> the issue just 'disappeared' on Ceph on later Ubuntu releases but
> is still present in the kernel, and can be hit by other use cases.
> 
> The observed issue seems to be the old Ceph bug #22464 [1], where
> checksum mismatches are observed (and instrumentation with buffer
> dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges).
> 
> The issue in Ceph was reasonably deemed a kernel bug (comment #50)
> and mostly worked around with a retry mechanism, but other parts
> of Ceph could still hit that (rocksdb). Anyway, it's less likely
> to be hit again as TCMalloc switched out of MADV_FREE by default.
> 
> (Some kernel versions/reports from the Ceph bug, and relation with
> the MADV_FREE introduction/changes; TCMalloc versions not checked.)
> - 4.4 good
> - 4.5 (madv_free: introduction)
> - 4.9 bad
> - 4.10 good? maybe a swapless system
> - 4.12 (madv_free: no longer free instantly on swapless systems)
> - 4.13 bad
> 
> [1] https://tracker.ceph.com/issues/22464
> 
> Thanks:
> ======
> 
> Several people contributed to analysis/discussions/tests/reproducers
> in the first stages when drilling down on ceph/tcmalloc/linux kernel:
> 
> - Dan Hill <daniel.hill@canonical.com>
> - Dan Streetman <dan.streetman@canonical.com>
> - Dongdong Tao <dongdong.tao@canonical.com>
> - Gavin Guo <gavin.guo@canonical.com>
> - Gerald Yang <gerald.yang@canonical.com>
> - Heitor Alves de Siqueira <halves@canonical.com>
> - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
> - Jay Vosburgh <jay.vosburgh@canonical.com>
> - Matthew Ruffell <matthew.ruffell@canonical.com>
> - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
> 
> v2: check refcount against mapcount rather than a static 2.
>     Thanks: Minchan Kim <minchan@kernel.org>
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> ---
>  mm/rmap.c   | 15 ++++++++++++++-
>  mm/vmscan.c |  2 +-
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 163ac4e6bcee..8671de473c25 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  
>  			/* MADV_FREE page check */
>  			if (!PageSwapBacked(page)) {
> -				if (!PageDirty(page)) {
> +				int ref_count = page_ref_count(page);
> +				int map_count = page_mapcount(page);
> +
> +				/*
> +				 * The only page refs must be from the isolation
> +				 * (checked by the caller shrink_page_list() too)
> +				 * and one or more rmap's (dropped by discard:).
> +				 *
> +				 * Check the reference count before dirty flag
> +				 * with memory barrier; see __remove_mapping().
> +				 */
> +				smp_rmb();
> +				if ((ref_count - 1 == map_count) &&
> +				    !PageDirty(page)) {

Thanks. Looks good to me.

I'd like to invite more folks for better eyes since this could
be a rather subtle issue.
Yang Shi Jan. 7, 2022, 12:11 a.m. UTC | #2
On Thu, Jan 6, 2022 at 3:15 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > Problem:
> > =======
> >
> > Userspace might read the zero-page instead of actual data from a
> > direct IO read on a block device if the buffers have been called
> > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> > race between page reclaim on MADV_FREE and blkdev direct IO read.
> >
> > Race condition:
> > ==============
> >
> > During page reclaim, the MADV_FREE page check in try_to_unmap_one()
> > checks if the page is not dirty, then discards its PTE (vs remap it
> > back if the page is dirty).
> >
> > However, after try_to_unmap_one() returns to shrink_page_list(), it
> > might keep the page _anyway_ if page_ref_freeze() fails (it expects
> > a single page ref from the isolation).
> >
> > Well, blkdev_direct_IO() gets references for all pages, and on READ
> > operations it sets them dirty later.
> >
> > So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more
> > later) for direct IO read from block devices and page reclaim runs
> > during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages()
> > but BEFORE it sets pages dirty, that situation happens.
> >
> > The direct IO read eventually completes. Now, when userspace reads
> > the buffers, the PTE is no longer there and the page fault handler
> > do_anonymous_page() services that with the zero-page, NOT the data!
> >
> > A synthetic reproducer is provided.
> >
> > Page faults:
> > ===========
> >
> > The data read from the block device probably won't generate faults
> > due to DMA (no MMU) but even in the case it wouldn't use DMA, that
> > happens on different virtual addresses (not user-mapped addresses)
> > because `struct bio_vec` stores `struct page` to figure addresses
> > out (which are different from/unrelated to user-mapped addresses)
> > for the data read.
> >
> > Thus userspace reads (to user-mapped addresses) still fault, then
> > do_anonymous_page() gets another `struct page` that would address/
> > map to other memory than the `struct page` used by `struct bio_vec`
> > for the read (which runs correctly as the page wasn't freed due to
> > page_ref_freeze(), and is reclaimed later) -- but even if the page
> > addresses matched, that handler maps the zero-page in the PTE, not
> > that page's memory (on read faults.)
> >
> > If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue
> > doesn't happen, because that faults-in all pages as writeable, so
> > do_anonymous_page() sets up a new page/rmap/PTE, and that is used
> > by direct IO. The userspace reads don't fault as the PTE is there
> > (thus zero-page is not used.)
> >
> > Solution:
> > ========
> >
> > One solution is to check for the expected page reference count in
> > try_to_unmap_one() too, which should be exactly two: one from the
> > isolation (checked by shrink_page_list()), and the other from the
> > rmap (dropped by the discard: label). If that doesn't match, then
> > remap the PTE back, just like page dirty does.
> >
> > The new check in try_to_unmap_one() should be safe in races with
> > bio_iov_iter_get_pages() in get_user_pages() fast and slow paths,
> > as it's done under the PTE lock. The fast path doesn't take that
> > lock but it checks the PTE has changed, then drops the reference
> > and leaves the page for the slow path (which does take that lock).
> >
> > - try_to_unmap_one()
> >   - page_vma_mapped_walk()
> >     - map_pte() # see pte_offset_map_lock():
> >         pte_offset_map()
> >         spin_lock()
> >   - page_ref_count() # new check
> >   - page_vma_mapped_walk_done() # see pte_unmap_unlock():
> >       pte_unmap()
> >       spin_unlock()
> >
> > - bio_iov_iter_get_pages()
> >   - __bio_iov_iter_get_pages()
> >     - iov_iter_get_pages()
> >       - get_user_pages_fast()
> >         - internal_get_user_pages_fast()
> >
> >           # fast path
> >           - lockless_pages_from_mm()
> >             - gup_{pgd,p4d,pud,pmd,pte}_range()
> >                 ptep = pte_offset_map() # not _lock()
> >                 pte = ptep_get_lockless(ptep)
> >                 page = pte_page(pte)
> >                 try_grab_compound_head(page) # get ref
> >                 if (pte_val(pte) != pte_val(*ptep))
> >                         put_compound_head(page) # put ref
> >                         # leave page for slow path
> >           # slow path
> >           - __gup_longterm_unlocked()
> >             - get_user_pages_unlocked()
> >               - __get_user_pages_locked()
> >                 - __get_user_pages()
> >                   - follow_{page,p4d,pud,pmd}_mask()
> >                     - follow_page_pte()
> >                         ptep = pte_offset_map_lock()
> >                         pte = *ptep
> >                         page = vm_normal_page(pte)
> >                         try_grab_page(page) # get ref
> >                         pte_unmap_unlock()
> >
> > Regarding transparent hugepages, that number shouldn't change, as
> > MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
> > (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
> > thus should reach shrink_page_list() -> split_huge_page_to_list()
> > before try_to_unmap[_one](), so it deals with normal pages only.
> >
> > (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address()
> > happens, which it should not or be rare, the page refcount is not
> > two, as the head page counts tail pages, and tail pages have zero.
> > That also prevents checking the head `page` then incorrectly call
> > page_remove_rmap(subpage) for a tail page, that isn't even in the
> > shrink_page_list()'s page_list (an effect of split huge pmd/pmvw),
> > as it might happen today in this unlikely scenario.)
> >
> > MADV_FREE'd buffers:
> > ===================
> >
> > So, back to the "if MADV_FREE pages are used as buffers" note.
> > The case is arguable, and subject to multiple interpretations.
> >
> > The madvise(2) manual page on the MADV_FREE advice value says:
> > - 'After a successful MADV_FREE ... data will be lost when
> >    the kernel frees the pages.'
> > - 'the free operation will be canceled if the caller writes
> >    into the page' / 'subsequent writes ... will succeed and
> >    then [the] kernel cannot free those dirtied pages'
> > - 'If there is no subsequent write, the kernel can free the
> >    pages at any time.'
> >
> > Thoughts, questions, considerations...
> > - Since the kernel didn't actually free the page (page_ref_freeze()
> >   failed), should the data not have been lost? (on userspace read.)
> > - Should writes performed by the direct IO read be able to cancel
> >   the free operation?
> >   - Should the direct IO read be considered as 'the caller' too,
> >     as it's been requested by 'the caller'?
> >   - Should the bio technique to dirty pages on return to userspace
> >     (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
> >     be considered in another/special way here?
> > - Should an upcoming write from a previously requested direct IO
> >   read be considered as a subsequent write, so the kernel should
> >   not free the pages? (as it's known at the time of page reclaim.)
> >
> > Technically, the last point would seem a reasonable consideration
> > and balance, as the madvise(2) manual page apparently (and fairly)
> > seem to assume that 'writes' are memory access from the userspace
> > process (not explicitly considering writes from the kernel or its
> > corner cases; again, fairly).. plus the kernel fix implementation
> > for the corner case of the largely 'non-atomic write' encompassed
> > by a direct IO read operation, is relatively simple; and it helps.
> >
> > Reproducer:
> > ==========
> >
> > @ test.c (simplified, but works)
> >
> >       #define _GNU_SOURCE
> >       #include <fcntl.h>
> >       #include <stdio.h>
> >       #include <unistd.h>
> >       #include <sys/mman.h>
> >
> >       int main() {
> >               int fd, i;
> >               char *buf;
> >
> >               fd = open(DEV, O_RDONLY | O_DIRECT);
> >
> >               buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
> >                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >
> >               for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
> >                       buf[i] = 1; // init to non-zero
> >
> >               madvise(buf, BUF_SIZE, MADV_FREE);
> >
> >               read(fd, buf, BUF_SIZE);
> >
> >               for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
> >                       printf("%p: 0x%x\n", &buf[i], buf[i]);
> >
> >               return 0;
> >       }
> >
> > @ block/fops.c (formerly fs/block_dev.c)
> >
> >       +#include <linux/swap.h>
> >       ...
> >       ... __blkdev_direct_IO[_simple](...)
> >       {
> >       ...
> >       +       if (!strcmp(current->comm, "good"))
> >       +               shrink_all_memory(ULONG_MAX);
> >       +
> >               ret = bio_iov_iter_get_pages(...);
> >       +
> >       +       if (!strcmp(current->comm, "bad"))
> >       +               shrink_all_memory(ULONG_MAX);
> >       ...
> >       }
> >
> > @ shell
> >
> >       # yes | dd of=test.img bs=1k count=16
> >       # DEV=$(losetup -f --show test.img)
> >       # gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test
> >
> >       # od -tx1 $DEV
> >       0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
> >       *
> >       0040000
> >
> >       # mv test good
> >       # ./good
> >       0x7f1509206000: 0x79
> >       0x7f1509207000: 0x79
> >       0x7f1509208000: 0x79
> >       0x7f1509209000: 0x79
> >
> >       # mv good bad
> >       # ./bad
> >       0x7fd87272f000: 0x0
> >       0x7fd872730000: 0x0
> >       0x7fd872731000: 0x0
> >       0x7fd872732000: 0x0
> >
> > Ceph/TCMalloc:
> > =============
> >
> > For documentation purposes, the use case driving the analysis/fix
> > is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses
> > MADV_FREE to release unused memory to the system from the mmap'ed
> > page heap (might be committed back/used again; it's not munmap'ed.)
> > - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise()
> > - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing.
> >
> > Note: TCMalloc switched back to MADV_DONTNEED a few commits after
> > the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so
> > the issue just 'disappeared' on Ceph on later Ubuntu releases but
> > is still present in the kernel, and can be hit by other use cases.
> >
> > The observed issue seems to be the old Ceph bug #22464 [1], where
> > checksum mismatches are observed (and instrumentation with buffer
> > dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges).
> >
> > The issue in Ceph was reasonably deemed a kernel bug (comment #50)
> > and mostly worked around with a retry mechanism, but other parts
> > of Ceph could still hit that (rocksdb). Anyway, it's less likely
> > to be hit again as TCMalloc switched out of MADV_FREE by default.
> >
> > (Some kernel versions/reports from the Ceph bug, and relation with
> > the MADV_FREE introduction/changes; TCMalloc versions not checked.)
> > - 4.4 good
> > - 4.5 (madv_free: introduction)
> > - 4.9 bad
> > - 4.10 good? maybe a swapless system
> > - 4.12 (madv_free: no longer free instantly on swapless systems)
> > - 4.13 bad
> >
> > [1] https://tracker.ceph.com/issues/22464
> >
> > Thanks:
> > ======
> >
> > Several people contributed to analysis/discussions/tests/reproducers
> > in the first stages when drilling down on ceph/tcmalloc/linux kernel:
> >
> > - Dan Hill <daniel.hill@canonical.com>
> > - Dan Streetman <dan.streetman@canonical.com>
> > - Dongdong Tao <dongdong.tao@canonical.com>
> > - Gavin Guo <gavin.guo@canonical.com>
> > - Gerald Yang <gerald.yang@canonical.com>
> > - Heitor Alves de Siqueira <halves@canonical.com>
> > - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
> > - Jay Vosburgh <jay.vosburgh@canonical.com>
> > - Matthew Ruffell <matthew.ruffell@canonical.com>
> > - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
> >
> > v2: check refcount against mapcount rather than a static 2.
> >     Thanks: Minchan Kim <minchan@kernel.org>
> >
> > Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> > ---
> >  mm/rmap.c   | 15 ++++++++++++++-
> >  mm/vmscan.c |  2 +-
> >  2 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 163ac4e6bcee..8671de473c25 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >
> >                       /* MADV_FREE page check */
> >                       if (!PageSwapBacked(page)) {
> > -                             if (!PageDirty(page)) {
> > +                             int ref_count = page_ref_count(page);
> > +                             int map_count = page_mapcount(page);
> > +
> > +                             /*
> > +                              * The only page refs must be from the isolation
> > +                              * (checked by the caller shrink_page_list() too)
> > +                              * and one or more rmap's (dropped by discard:).
> > +                              *
> > +                              * Check the reference count before dirty flag
> > +                              * with memory barrier; see __remove_mapping().
> > +                              */
> > +                             smp_rmb();
> > +                             if ((ref_count - 1 == map_count) &&
> > +                                 !PageDirty(page)) {
>
> Thanks. Looks good to me.
>
> I'd like to invite more folks for better eyes since this could
> be a rather subtle issue.

I don't spot anything wrong either. But I'm wondering if we could
remove the PageDirty check or not.

It seems racy to have MADV_FREE solely rely on PageDirty bit to tell
if the page is going to be redirtied or re-accessed or not. The pge
refcount itself should be enough. As long as elevated refcount is
observed we know this page is going to be or has been re-accessed by
someone (particularly, fast GUP). The PageDirty bit may be set much
later, just like what the commit log said.

But I'm not quite sure if there is any path that sets PageDirty bit
then elevates refcount, TBH I don't look very hard.
Yang Shi Jan. 7, 2022, 1:08 a.m. UTC | #3
On Thu, Jan 6, 2022 at 4:11 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Jan 6, 2022 at 3:15 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > > Problem:
> > > =======
> > >
> > > Userspace might read the zero-page instead of actual data from a
> > > direct IO read on a block device if the buffers have been called
> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a
> > > race between page reclaim on MADV_FREE and blkdev direct IO read.
> > >
> > > Race condition:
> > > ==============
> > >
> > > During page reclaim, the MADV_FREE page check in try_to_unmap_one()
> > > checks if the page is not dirty, then discards its PTE (vs remap it
> > > back if the page is dirty).
> > >
> > > However, after try_to_unmap_one() returns to shrink_page_list(), it
> > > might keep the page _anyway_ if page_ref_freeze() fails (it expects
> > > a single page ref from the isolation).
> > >
> > > Well, blkdev_direct_IO() gets references for all pages, and on READ
> > > operations it sets them dirty later.
> > >
> > > So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more
> > > later) for direct IO read from block devices and page reclaim runs
> > > during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages()
> > > but BEFORE it sets pages dirty, that situation happens.
> > >
> > > The direct IO read eventually completes. Now, when userspace reads
> > > the buffers, the PTE is no longer there and the page fault handler
> > > do_anonymous_page() services that with the zero-page, NOT the data!
> > >
> > > A synthetic reproducer is provided.
> > >
> > > Page faults:
> > > ===========
> > >
> > > The data read from the block device probably won't generate faults
> > > due to DMA (no MMU) but even in the case it wouldn't use DMA, that
> > > happens on different virtual addresses (not user-mapped addresses)
> > > because `struct bio_vec` stores `struct page` to figure addresses
> > > out (which are different from/unrelated to user-mapped addresses)
> > > for the data read.
> > >
> > > Thus userspace reads (to user-mapped addresses) still fault, then
> > > do_anonymous_page() gets another `struct page` that would address/
> > > map to other memory than the `struct page` used by `struct bio_vec`
> > > for the read (which runs correctly as the page wasn't freed due to
> > > page_ref_freeze(), and is reclaimed later) -- but even if the page
> > > addresses matched, that handler maps the zero-page in the PTE, not
> > > that page's memory (on read faults.)
> > >
> > > If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue
> > > doesn't happen, because that faults-in all pages as writeable, so
> > > do_anonymous_page() sets up a new page/rmap/PTE, and that is used
> > > by direct IO. The userspace reads don't fault as the PTE is there
> > > (thus zero-page is not used.)
> > >
> > > Solution:
> > > ========
> > >
> > > One solution is to check for the expected page reference count in
> > > try_to_unmap_one() too, which should be exactly two: one from the
> > > isolation (checked by shrink_page_list()), and the other from the
> > > rmap (dropped by the discard: label). If that doesn't match, then
> > > remap the PTE back, just like page dirty does.
> > >
> > > The new check in try_to_unmap_one() should be safe in races with
> > > bio_iov_iter_get_pages() in get_user_pages() fast and slow paths,
> > > as it's done under the PTE lock. The fast path doesn't take that
> > > lock but it checks the PTE has changed, then drops the reference
> > > and leaves the page for the slow path (which does take that lock).
> > >
> > > - try_to_unmap_one()
> > >   - page_vma_mapped_walk()
> > >     - map_pte() # see pte_offset_map_lock():
> > >         pte_offset_map()
> > >         spin_lock()
> > >   - page_ref_count() # new check
> > >   - page_vma_mapped_walk_done() # see pte_unmap_unlock():
> > >       pte_unmap()
> > >       spin_unlock()
> > >
> > > - bio_iov_iter_get_pages()
> > >   - __bio_iov_iter_get_pages()
> > >     - iov_iter_get_pages()
> > >       - get_user_pages_fast()
> > >         - internal_get_user_pages_fast()
> > >
> > >           # fast path
> > >           - lockless_pages_from_mm()
> > >             - gup_{pgd,p4d,pud,pmd,pte}_range()
> > >                 ptep = pte_offset_map() # not _lock()
> > >                 pte = ptep_get_lockless(ptep)
> > >                 page = pte_page(pte)
> > >                 try_grab_compound_head(page) # get ref
> > >                 if (pte_val(pte) != pte_val(*ptep))
> > >                         put_compound_head(page) # put ref
> > >                         # leave page for slow path
> > >           # slow path
> > >           - __gup_longterm_unlocked()
> > >             - get_user_pages_unlocked()
> > >               - __get_user_pages_locked()
> > >                 - __get_user_pages()
> > >                   - follow_{page,p4d,pud,pmd}_mask()
> > >                     - follow_page_pte()
> > >                         ptep = pte_offset_map_lock()
> > >                         pte = *ptep
> > >                         page = vm_normal_page(pte)
> > >                         try_grab_page(page) # get ref
> > >                         pte_unmap_unlock()
> > >
> > > Regarding transparent hugepages, that number shouldn't change, as
> > > MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
> > > (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
> > > thus should reach shrink_page_list() -> split_huge_page_to_list()
> > > before try_to_unmap[_one](), so it deals with normal pages only.
> > >
> > > (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address()
> > > happens, which it should not or be rare, the page refcount is not
> > > two, as the head page counts tail pages, and tail pages have zero.
> > > That also prevents checking the head `page` then incorrectly call
> > > page_remove_rmap(subpage) for a tail page, that isn't even in the
> > > shrink_page_list()'s page_list (an effect of split huge pmd/pmvw),
> > > as it might happen today in this unlikely scenario.)
> > >
> > > MADV_FREE'd buffers:
> > > ===================
> > >
> > > So, back to the "if MADV_FREE pages are used as buffers" note.
> > > The case is arguable, and subject to multiple interpretations.
> > >
> > > The madvise(2) manual page on the MADV_FREE advice value says:
> > > - 'After a successful MADV_FREE ... data will be lost when
> > >    the kernel frees the pages.'
> > > - 'the free operation will be canceled if the caller writes
> > >    into the page' / 'subsequent writes ... will succeed and
> > >    then [the] kernel cannot free those dirtied pages'
> > > - 'If there is no subsequent write, the kernel can free the
> > >    pages at any time.'
> > >
> > > Thoughts, questions, considerations...
> > > - Since the kernel didn't actually free the page (page_ref_freeze()
> > >   failed), should the data not have been lost? (on userspace read.)
> > > - Should writes performed by the direct IO read be able to cancel
> > >   the free operation?
> > >   - Should the direct IO read be considered as 'the caller' too,
> > >     as it's been requested by 'the caller'?
> > >   - Should the bio technique to dirty pages on return to userspace
> > >     (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
> > >     be considered in another/special way here?
> > > - Should an upcoming write from a previously requested direct IO
> > >   read be considered as a subsequent write, so the kernel should
> > >   not free the pages? (as it's known at the time of page reclaim.)
> > >
> > > Technically, the last point would seem a reasonable consideration
> > > and balance, as the madvise(2) manual page apparently (and fairly)
> > > seem to assume that 'writes' are memory access from the userspace
> > > process (not explicitly considering writes from the kernel or its
> > > corner cases; again, fairly).. plus the kernel fix implementation
> > > for the corner case of the largely 'non-atomic write' encompassed
> > > by a direct IO read operation, is relatively simple; and it helps.
> > >
> > > Reproducer:
> > > ==========
> > >
> > > @ test.c (simplified, but works)
> > >
> > >       #define _GNU_SOURCE
> > >       #include <fcntl.h>
> > >       #include <stdio.h>
> > >       #include <unistd.h>
> > >       #include <sys/mman.h>
> > >
> > >       int main() {
> > >               int fd, i;
> > >               char *buf;
> > >
> > >               fd = open(DEV, O_RDONLY | O_DIRECT);
> > >
> > >               buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
> > >                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >
> > >               for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
> > >                       buf[i] = 1; // init to non-zero
> > >
> > >               madvise(buf, BUF_SIZE, MADV_FREE);
> > >
> > >               read(fd, buf, BUF_SIZE);
> > >
> > >               for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
> > >                       printf("%p: 0x%x\n", &buf[i], buf[i]);
> > >
> > >               return 0;
> > >       }
> > >
> > > @ block/fops.c (formerly fs/block_dev.c)
> > >
> > >       +#include <linux/swap.h>
> > >       ...
> > >       ... __blkdev_direct_IO[_simple](...)
> > >       {
> > >       ...
> > >       +       if (!strcmp(current->comm, "good"))
> > >       +               shrink_all_memory(ULONG_MAX);
> > >       +
> > >               ret = bio_iov_iter_get_pages(...);
> > >       +
> > >       +       if (!strcmp(current->comm, "bad"))
> > >       +               shrink_all_memory(ULONG_MAX);
> > >       ...
> > >       }
> > >
> > > @ shell
> > >
> > >       # yes | dd of=test.img bs=1k count=16
> > >       # DEV=$(losetup -f --show test.img)
> > >       # gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test
> > >
> > >       # od -tx1 $DEV
> > >       0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
> > >       *
> > >       0040000
> > >
> > >       # mv test good
> > >       # ./good
> > >       0x7f1509206000: 0x79
> > >       0x7f1509207000: 0x79
> > >       0x7f1509208000: 0x79
> > >       0x7f1509209000: 0x79
> > >
> > >       # mv good bad
> > >       # ./bad
> > >       0x7fd87272f000: 0x0
> > >       0x7fd872730000: 0x0
> > >       0x7fd872731000: 0x0
> > >       0x7fd872732000: 0x0
> > >
> > > Ceph/TCMalloc:
> > > =============
> > >
> > > For documentation purposes, the use case driving the analysis/fix
> > > is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses
> > > MADV_FREE to release unused memory to the system from the mmap'ed
> > > page heap (might be committed back/used again; it's not munmap'ed.)
> > > - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise()
> > > - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing.
> > >
> > > Note: TCMalloc switched back to MADV_DONTNEED a few commits after
> > > the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so
> > > the issue just 'disappeared' on Ceph on later Ubuntu releases but
> > > is still present in the kernel, and can be hit by other use cases.
> > >
> > > The observed issue seems to be the old Ceph bug #22464 [1], where
> > > checksum mismatches are observed (and instrumentation with buffer
> > > dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges).
> > >
> > > The issue in Ceph was reasonably deemed a kernel bug (comment #50)
> > > and mostly worked around with a retry mechanism, but other parts
> > > of Ceph could still hit that (rocksdb). Anyway, it's less likely
> > > to be hit again as TCMalloc switched out of MADV_FREE by default.
> > >
> > > (Some kernel versions/reports from the Ceph bug, and relation with
> > > the MADV_FREE introduction/changes; TCMalloc versions not checked.)
> > > - 4.4 good
> > > - 4.5 (madv_free: introduction)
> > > - 4.9 bad
> > > - 4.10 good? maybe a swapless system
> > > - 4.12 (madv_free: no longer free instantly on swapless systems)
> > > - 4.13 bad
> > >
> > > [1] https://tracker.ceph.com/issues/22464
> > >
> > > Thanks:
> > > ======
> > >
> > > Several people contributed to analysis/discussions/tests/reproducers
> > > in the first stages when drilling down on ceph/tcmalloc/linux kernel:
> > >
> > > - Dan Hill <daniel.hill@canonical.com>
> > > - Dan Streetman <dan.streetman@canonical.com>
> > > - Dongdong Tao <dongdong.tao@canonical.com>
> > > - Gavin Guo <gavin.guo@canonical.com>
> > > - Gerald Yang <gerald.yang@canonical.com>
> > > - Heitor Alves de Siqueira <halves@canonical.com>
> > > - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
> > > - Jay Vosburgh <jay.vosburgh@canonical.com>
> > > - Matthew Ruffell <matthew.ruffell@canonical.com>
> > > - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
> > >
> > > v2: check refcount against mapcount rather than a static 2.
> > >     Thanks: Minchan Kim <minchan@kernel.org>
> > >
> > > Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> > > ---
> > >  mm/rmap.c   | 15 ++++++++++++++-
> > >  mm/vmscan.c |  2 +-
> > >  2 files changed, 15 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 163ac4e6bcee..8671de473c25 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >
> > >                       /* MADV_FREE page check */
> > >                       if (!PageSwapBacked(page)) {
> > > -                             if (!PageDirty(page)) {
> > > +                             int ref_count = page_ref_count(page);
> > > +                             int map_count = page_mapcount(page);
> > > +
> > > +                             /*
> > > +                              * The only page refs must be from the isolation
> > > +                              * (checked by the caller shrink_page_list() too)
> > > +                              * and one or more rmap's (dropped by discard:).
> > > +                              *
> > > +                              * Check the reference count before dirty flag
> > > +                              * with memory barrier; see __remove_mapping().
> > > +                              */
> > > +                             smp_rmb();
> > > +                             if ((ref_count - 1 == map_count) &&
> > > +                                 !PageDirty(page)) {
> >
> > Thanks. Looks good to me.
> >
> > I'd like to invite more folks for better eyes since this could
> > be a rather subtle issue.
>
> I don't spot anything wrong either. But I'm wondering if we could
> remove the PageDirty check or not.

Sorry, take this back. PageDirty check is still needed since the page
could be redirtied even though the refcount is not elevated.

>
> It seems racy to have MADV_FREE solely rely on PageDirty bit to tell
> if the page is going to be redirtied or re-accessed or not. The pge
> refcount itself should be enough. As long as elevated refcount is
> observed we know this page is going to be or has been re-accessed by
> someone (particularly, fast GUP). The PageDirty bit may be set much
> later, just like what the commit log said.
>
> But I'm not quite sure if there is any path that sets PageDirty bit
> then elevates refcount, TBH I don't look very hard.
Huang, Ying Jan. 11, 2022, 1:34 a.m. UTC | #4
Minchan Kim <minchan@kernel.org> writes:

> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>> Problem:
>> =======
>> 
>> Userspace might read the zero-page instead of actual data from a
>> direct IO read on a block device if the buffers have been called
>> madvise(MADV_FREE) on earlier (this is discussed below) due to a
>> race between page reclaim on MADV_FREE and blkdev direct IO read.
>> 
>> Race condition:
>> ==============
>> 
>> During page reclaim, the MADV_FREE page check in try_to_unmap_one()
>> checks if the page is not dirty, then discards its PTE (vs remap it
>> back if the page is dirty).
>> 
>> However, after try_to_unmap_one() returns to shrink_page_list(), it
>> might keep the page _anyway_ if page_ref_freeze() fails (it expects
>> a single page ref from the isolation).
>> 
>> Well, blkdev_direct_IO() gets references for all pages, and on READ
>> operations it sets them dirty later.
>> 
>> So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more
>> later) for direct IO read from block devices and page reclaim runs
>> during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages()
>> but BEFORE it sets pages dirty, that situation happens.
>> 
>> The direct IO read eventually completes. Now, when userspace reads
>> the buffers, the PTE is no longer there and the page fault handler
>> do_anonymous_page() services that with the zero-page, NOT the data!
>> 
>> A synthetic reproducer is provided.
>> 
>> Page faults:
>> ===========
>> 
>> The data read from the block device probably won't generate faults
>> due to DMA (no MMU) but even in the case it wouldn't use DMA, that
>> happens on different virtual addresses (not user-mapped addresses)
>> because `struct bio_vec` stores `struct page` to figure addresses
>> out (which are different from/unrelated to user-mapped addresses)
>> for the data read.
>> 
>> Thus userspace reads (to user-mapped addresses) still fault, then
>> do_anonymous_page() gets another `struct page` that would address/
>> map to other memory than the `struct page` used by `struct bio_vec`
>> for the read (which runs correctly as the page wasn't freed due to
>> page_ref_freeze(), and is reclaimed later) -- but even if the page
>> addresses matched, that handler maps the zero-page in the PTE, not
>> that page's memory (on read faults.)
>> 
>> If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue
>> doesn't happen, because that faults-in all pages as writeable, so
>> do_anonymous_page() sets up a new page/rmap/PTE, and that is used
>> by direct IO. The userspace reads don't fault as the PTE is there
>> (thus zero-page is not used.)
>> 
>> Solution:
>> ========
>> 
>> One solution is to check for the expected page reference count in
>> try_to_unmap_one() too, which should be exactly two: one from the
>> isolation (checked by shrink_page_list()), and the other from the
>> rmap (dropped by the discard: label). If that doesn't match, then
>> remap the PTE back, just like page dirty does.
>> 
>> The new check in try_to_unmap_one() should be safe in races with
>> bio_iov_iter_get_pages() in get_user_pages() fast and slow paths,
>> as it's done under the PTE lock. The fast path doesn't take that
>> lock but it checks the PTE has changed, then drops the reference
>> and leaves the page for the slow path (which does take that lock).
>> 
>> - try_to_unmap_one()
>>   - page_vma_mapped_walk()
>>     - map_pte() # see pte_offset_map_lock():
>>         pte_offset_map()
>>         spin_lock()
>>   - page_ref_count() # new check
>>   - page_vma_mapped_walk_done() # see pte_unmap_unlock():
>>       pte_unmap()
>>       spin_unlock()
>> 
>> - bio_iov_iter_get_pages()
>>   - __bio_iov_iter_get_pages()
>>     - iov_iter_get_pages()
>>       - get_user_pages_fast()
>>         - internal_get_user_pages_fast()
>> 
>>           # fast path
>>           - lockless_pages_from_mm()
>>             - gup_{pgd,p4d,pud,pmd,pte}_range()
>>                 ptep = pte_offset_map() # not _lock()
>>                 pte = ptep_get_lockless(ptep)
>>                 page = pte_page(pte)
>>                 try_grab_compound_head(page) # get ref
>>                 if (pte_val(pte) != pte_val(*ptep))
>>                         put_compound_head(page) # put ref
>>                         # leave page for slow path
>>           # slow path
>>           - __gup_longterm_unlocked()
>>             - get_user_pages_unlocked()
>>               - __get_user_pages_locked()
>>                 - __get_user_pages()
>>                   - follow_{page,p4d,pud,pmd}_mask()
>>                     - follow_page_pte()
>>                         ptep = pte_offset_map_lock()
>>                         pte = *ptep
>>                         page = vm_normal_page(pte)
>>                         try_grab_page(page) # get ref
>>                         pte_unmap_unlock()
>> 
>> Regarding transparent hugepages, that number shouldn't change, as
>> MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
>> (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
>> thus should reach shrink_page_list() -> split_huge_page_to_list()
>> before try_to_unmap[_one](), so it deals with normal pages only.
>> 
>> (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address()
>> happens, which it should not or be rare, the page refcount is not
>> two, as the head page counts tail pages, and tail pages have zero.
>> That also prevents checking the head `page` then incorrectly call
>> page_remove_rmap(subpage) for a tail page, that isn't even in the
>> shrink_page_list()'s page_list (an effect of split huge pmd/pmvw),
>> as it might happen today in this unlikely scenario.)
>> 
>> MADV_FREE'd buffers:
>> ===================
>> 
>> So, back to the "if MADV_FREE pages are used as buffers" note.
>> The case is arguable, and subject to multiple interpretations.
>> 
>> The madvise(2) manual page on the MADV_FREE advice value says:
>> - 'After a successful MADV_FREE ... data will be lost when
>>    the kernel frees the pages.'
>> - 'the free operation will be canceled if the caller writes
>>    into the page' / 'subsequent writes ... will succeed and
>>    then [the] kernel cannot free those dirtied pages'
>> - 'If there is no subsequent write, the kernel can free the
>>    pages at any time.'
>> 
>> Thoughts, questions, considerations...
>> - Since the kernel didn't actually free the page (page_ref_freeze()
>>   failed), should the data not have been lost? (on userspace read.)
>> - Should writes performed by the direct IO read be able to cancel
>>   the free operation?
>>   - Should the direct IO read be considered as 'the caller' too,
>>     as it's been requested by 'the caller'?
>>   - Should the bio technique to dirty pages on return to userspace
>>     (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
>>     be considered in another/special way here?
>> - Should an upcoming write from a previously requested direct IO
>>   read be considered as a subsequent write, so the kernel should
>>   not free the pages? (as it's known at the time of page reclaim.)
>> 
>> Technically, the last point would seem a reasonable consideration
>> and balance, as the madvise(2) manual page apparently (and fairly)
>> seem to assume that 'writes' are memory access from the userspace
>> process (not explicitly considering writes from the kernel or its
>> corner cases; again, fairly).. plus the kernel fix implementation
>> for the corner case of the largely 'non-atomic write' encompassed
>> by a direct IO read operation, is relatively simple; and it helps.
>> 
>> Reproducer:
>> ==========
>> 
>> @ test.c (simplified, but works)
>> 
>> 	#define _GNU_SOURCE
>> 	#include <fcntl.h>
>> 	#include <stdio.h>
>> 	#include <unistd.h>
>> 	#include <sys/mman.h>
>> 
>> 	int main() {
>> 		int fd, i;
>> 		char *buf;
>> 
>> 		fd = open(DEV, O_RDONLY | O_DIRECT);
>> 
>> 		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
>>                 	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> 
>> 		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
>> 			buf[i] = 1; // init to non-zero
>> 
>> 		madvise(buf, BUF_SIZE, MADV_FREE);
>> 
>> 		read(fd, buf, BUF_SIZE);
>> 
>> 		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
>> 			printf("%p: 0x%x\n", &buf[i], buf[i]);
>> 
>> 		return 0;
>> 	}
>> 
>> @ block/fops.c (formerly fs/block_dev.c)
>> 
>> 	+#include <linux/swap.h>
>> 	...
>> 	... __blkdev_direct_IO[_simple](...)
>> 	{
>> 	...
>> 	+	if (!strcmp(current->comm, "good"))
>> 	+		shrink_all_memory(ULONG_MAX);
>> 	+
>>          	ret = bio_iov_iter_get_pages(...);
>> 	+
>> 	+	if (!strcmp(current->comm, "bad"))
>> 	+		shrink_all_memory(ULONG_MAX);
>> 	...
>> 	}
>> 
>> @ shell
>> 
>> 	# yes | dd of=test.img bs=1k count=16
>> 	# DEV=$(losetup -f --show test.img)
>> 	# gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test
>> 
>> 	# od -tx1 $DEV
>> 	0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
>> 	*
>> 	0040000
>> 
>> 	# mv test good
>> 	# ./good
>> 	0x7f1509206000: 0x79
>> 	0x7f1509207000: 0x79
>> 	0x7f1509208000: 0x79
>> 	0x7f1509209000: 0x79
>> 
>> 	# mv good bad
>> 	# ./bad
>> 	0x7fd87272f000: 0x0
>> 	0x7fd872730000: 0x0
>> 	0x7fd872731000: 0x0
>> 	0x7fd872732000: 0x0
>> 
>> Ceph/TCMalloc:
>> =============
>> 
>> For documentation purposes, the use case driving the analysis/fix
>> is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses
>> MADV_FREE to release unused memory to the system from the mmap'ed
>> page heap (might be committed back/used again; it's not munmap'ed.)
>> - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise()
>> - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing.
>> 
>> Note: TCMalloc switched back to MADV_DONTNEED a few commits after
>> the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so
>> the issue just 'disappeared' on Ceph on later Ubuntu releases but
>> is still present in the kernel, and can be hit by other use cases.
>> 
>> The observed issue seems to be the old Ceph bug #22464 [1], where
>> checksum mismatches are observed (and instrumentation with buffer
>> dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges).
>> 
>> The issue in Ceph was reasonably deemed a kernel bug (comment #50)
>> and mostly worked around with a retry mechanism, but other parts
>> of Ceph could still hit that (rocksdb). Anyway, it's less likely
>> to be hit again as TCMalloc switched out of MADV_FREE by default.
>> 
>> (Some kernel versions/reports from the Ceph bug, and relation with
>> the MADV_FREE introduction/changes; TCMalloc versions not checked.)
>> - 4.4 good
>> - 4.5 (madv_free: introduction)
>> - 4.9 bad
>> - 4.10 good? maybe a swapless system
>> - 4.12 (madv_free: no longer free instantly on swapless systems)
>> - 4.13 bad
>> 
>> [1] https://tracker.ceph.com/issues/22464
>> 
>> Thanks:
>> ======
>> 
>> Several people contributed to analysis/discussions/tests/reproducers
>> in the first stages when drilling down on ceph/tcmalloc/linux kernel:
>> 
>> - Dan Hill <daniel.hill@canonical.com>
>> - Dan Streetman <dan.streetman@canonical.com>
>> - Dongdong Tao <dongdong.tao@canonical.com>
>> - Gavin Guo <gavin.guo@canonical.com>
>> - Gerald Yang <gerald.yang@canonical.com>
>> - Heitor Alves de Siqueira <halves@canonical.com>
>> - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
>> - Jay Vosburgh <jay.vosburgh@canonical.com>
>> - Matthew Ruffell <matthew.ruffell@canonical.com>
>> - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
>> 
>> v2: check refcount against mapcount rather than a static 2.
>>     Thanks: Minchan Kim <minchan@kernel.org>
>> 
>> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
>> ---
>>  mm/rmap.c   | 15 ++++++++++++++-
>>  mm/vmscan.c |  2 +-
>>  2 files changed, 15 insertions(+), 2 deletions(-)
>> 
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 163ac4e6bcee..8671de473c25 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  
>>  			/* MADV_FREE page check */
>>  			if (!PageSwapBacked(page)) {
>> -				if (!PageDirty(page)) {
>> +				int ref_count = page_ref_count(page);
>> +				int map_count = page_mapcount(page);
>> +
>> +				/*
>> +				 * The only page refs must be from the isolation
>> +				 * (checked by the caller shrink_page_list() too)
>> +				 * and one or more rmap's (dropped by discard:).
>> +				 *
>> +				 * Check the reference count before dirty flag
>> +				 * with memory barrier; see __remove_mapping().
>> +				 */
>> +				smp_rmb();
>> +				if ((ref_count - 1 == map_count) &&
>> +				    !PageDirty(page)) {
>
> Thanks. Looks good to me.
>
> I'd like to invite more folks for better eyes since this could
> be a rather subtle issue.

Looks good to me too!  Please feel free to add my

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying
Yu Zhao Jan. 11, 2022, 6:48 a.m. UTC | #5
On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 163ac4e6bcee..8671de473c25 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  
>  			/* MADV_FREE page check */
>  			if (!PageSwapBacked(page)) {
> -				if (!PageDirty(page)) {
> +				int ref_count = page_ref_count(page);
> +				int map_count = page_mapcount(page);
> +
> +				/*
> +				 * The only page refs must be from the isolation
> +				 * (checked by the caller shrink_page_list() too)
> +				 * and one or more rmap's (dropped by discard:).
> +				 *
> +				 * Check the reference count before dirty flag
> +				 * with memory barrier; see __remove_mapping().
> +				 */
> +				smp_rmb();
> +				if ((ref_count - 1 == map_count) &&
> +				    !PageDirty(page)) {
>  					/* Invalidate as we cleared the pte */
>  					mmu_notifier_invalidate_range(mm,
>  						address, address + PAGE_SIZE);

Out of curiosity, how does it work with COW in terms of reordering?
Specifically, it seems to me get_page() and page_dup_rmap() in
copy_present_pte() can happen in any order, and if page_dup_rmap()
is seen first, and direct io is holding a refcnt, this check can still
pass?
Minchan Kim Jan. 11, 2022, 6:54 p.m. UTC | #6
On Mon, Jan 10, 2022 at 11:48:13PM -0700, Yu Zhao wrote:
> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 163ac4e6bcee..8671de473c25 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  
> >  			/* MADV_FREE page check */
> >  			if (!PageSwapBacked(page)) {
> > -				if (!PageDirty(page)) {
> > +				int ref_count = page_ref_count(page);
> > +				int map_count = page_mapcount(page);
> > +
> > +				/*
> > +				 * The only page refs must be from the isolation
> > +				 * (checked by the caller shrink_page_list() too)
> > +				 * and one or more rmap's (dropped by discard:).
> > +				 *
> > +				 * Check the reference count before dirty flag
> > +				 * with memory barrier; see __remove_mapping().
> > +				 */
> > +				smp_rmb();
> > +				if ((ref_count - 1 == map_count) &&
> > +				    !PageDirty(page)) {
> >  					/* Invalidate as we cleared the pte */
> >  					mmu_notifier_invalidate_range(mm,
> >  						address, address + PAGE_SIZE);
> 
> Out of curiosity, how does it work with COW in terms of reordering?
> Specifically, it seems to me get_page() and page_dup_rmap() in
> copy_present_pte() can happen in any order, and if page_dup_rmap()
> is seen first, and direct io is holding a refcnt, this check can still
> pass?
> 

Hi Yu,

I think you're correct. I think we don't like memory barrier
there in page_dup_rmap. Then, how about make gup_fast is aware
of FOLL_TOUCH?

FOLL_TOUCH means it's going to write something so the page
should be dirty. Currently, get_user_pages works like that.
Howver, problem is get_user_pages_fast since it looks like
that lockless_pages_from_mm doesn't support FOLL_TOUCH.

So the idea is if param in internal_get_user_pages_fast
includes FOLL_TOUCH, gup_{pmd,pte}_range try to make the
page dirty under trylock_page(If the lock fails, it goes
slow path with __gup_longterm_unlocked and set_dirty_pages
for them).

This approach would solve other cases where map userspace
pages into kernel space and then write. Since the write
didn't go through with the process's page table, we will
lose the dirty bit in the page table of the process and
it turns out same problem. That's why I'd like to approach
this.

If it doesn't work, the other option to fix this specific
case is can't we make pages dirty in advance in DIO read-case?

When I look at DIO code, it's already doing in async case.
Could't we do the same thing for the other cases?
I guess the worst case we will see would be more page
writeback since the page becomes dirty unnecessary.
John Hubbard Jan. 11, 2022, 7:29 p.m. UTC | #7
On 1/11/22 10:54, Minchan Kim wrote:
...
> Hi Yu,
> 
> I think you're correct. I think we don't like memory barrier
> there in page_dup_rmap. Then, how about make gup_fast is aware
> of FOLL_TOUCH?
> 
> FOLL_TOUCH means it's going to write something so the page

Actually, my understanding of FOLL_TOUCH is that it does *not* mean that
data will be written to the page. That is what FOLL_WRITE is for.
FOLL_TOUCH means: update the "accessed" metadata, without actually
writing to the memory that the page represents.


> should be dirty. Currently, get_user_pages works like that.
> Howver, problem is get_user_pages_fast since it looks like
> that lockless_pages_from_mm doesn't support FOLL_TOUCH.
> 
> So the idea is if param in internal_get_user_pages_fast
> includes FOLL_TOUCH, gup_{pmd,pte}_range try to make the
> page dirty under trylock_page(If the lock fails, it goes

Marking a page dirty solely because FOLL_TOUCH is specified would
be an API-level mistake. That's why it isn't "supported". Or at least,
that's how I'm reading things.

Hope that helps!

> slow path with __gup_longterm_unlocked and set_dirty_pages
> for them).
> 
> This approach would solve other cases where map userspace
> pages into kernel space and then write. Since the write
> didn't go through with the process's page table, we will
> lose the dirty bit in the page table of the process and
> it turns out same problem. That's why I'd like to approach
> this.
> 
> If it doesn't work, the other option to fix this specific
> case is can't we make pages dirty in advance in DIO read-case?
> 
> When I look at DIO code, it's already doing in async case.
> Could't we do the same thing for the other cases?
> I guess the worst case we will see would be more page
> writeback since the page becomes dirty unnecessary.

Marking pages dirty after pinning them is a pre-existing area of
problems. See the long-running LWN articles about get_user_pages() [1].


[1] https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

thanks,
Minchan Kim Jan. 11, 2022, 8:20 p.m. UTC | #8
On Tue, Jan 11, 2022 at 11:29:36AM -0800, John Hubbard wrote:
> On 1/11/22 10:54, Minchan Kim wrote:
> ...
> > Hi Yu,
> > 
> > I think you're correct. I think we don't like memory barrier
> > there in page_dup_rmap. Then, how about make gup_fast is aware
> > of FOLL_TOUCH?
> > 
> > FOLL_TOUCH means it's going to write something so the page
> 
> Actually, my understanding of FOLL_TOUCH is that it does *not* mean that
> data will be written to the page. That is what FOLL_WRITE is for.
> FOLL_TOUCH means: update the "accessed" metadata, without actually
> writing to the memory that the page represents.

Exactly. I should have mentioned the FOLL_TOUCH with FOLL_WRITE.
What I wanted to hit with FOLL_TOUCH was 

follow_page_pte:

        if (flags & FOLL_TOUCH) {
                if ((flags & FOLL_WRITE) &&
                    !pte_dirty(pte) && !PageDirty(page))
                        set_page_dirty(page);
                mark_page_accessed(page);
        }

> 
> 
> > should be dirty. Currently, get_user_pages works like that.
> > Howver, problem is get_user_pages_fast since it looks like
> > that lockless_pages_from_mm doesn't support FOLL_TOUCH.
> > 
> > So the idea is if param in internal_get_user_pages_fast
> > includes FOLL_TOUCH, gup_{pmd,pte}_range try to make the
> > page dirty under trylock_page(If the lock fails, it goes
> 
> Marking a page dirty solely because FOLL_TOUCH is specified would
> be an API-level mistake. That's why it isn't "supported". Or at least,
> that's how I'm reading things.
> 
> Hope that helps!
> 
> > slow path with __gup_longterm_unlocked and set_dirty_pages
> > for them).
> > 
> > This approach would solve other cases where map userspace
> > pages into kernel space and then write. Since the write
> > didn't go through with the process's page table, we will
> > lose the dirty bit in the page table of the process and
> > it turns out same problem. That's why I'd like to approach
> > this.
> > 
> > If it doesn't work, the other option to fix this specific
> > case is can't we make pages dirty in advance in DIO read-case?
> > 
> > When I look at DIO code, it's already doing in async case.
> > Could't we do the same thing for the other cases?
> > I guess the worst case we will see would be more page
> > writeback since the page becomes dirty unnecessary.
> 
> Marking pages dirty after pinning them is a pre-existing area of
> problems. See the long-running LWN articles about get_user_pages() [1].

Oh, Do you mean marking page dirty in DIO path is already problems?
Let me read the pages in the link.
Thanks!

> 
> 
> [1] https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
>
Minchan Kim Jan. 11, 2022, 8:21 p.m. UTC | #9
On Tue, Jan 11, 2022 at 12:20:13PM -0800, Minchan Kim wrote:
< snip >
> > > slow path with __gup_longterm_unlocked and set_dirty_pages
> > > for them).
> > > 
> > > This approach would solve other cases where map userspace
> > > pages into kernel space and then write. Since the write
> > > didn't go through with the process's page table, we will
> > > lose the dirty bit in the page table of the process and
> > > it turns out same problem. That's why I'd like to approach
> > > this.
> > > 
> > > If it doesn't work, the other option to fix this specific
> > > case is can't we make pages dirty in advance in DIO read-case?
> > > 
> > > When I look at DIO code, it's already doing in async case.
> > > Could't we do the same thing for the other cases?
> > > I guess the worst case we will see would be more page
> > > writeback since the page becomes dirty unnecessary.
> > 
> > Marking pages dirty after pinning them is a pre-existing area of
> > problems. See the long-running LWN articles about get_user_pages() [1].
> 
> Oh, Do you mean marking page dirty in DIO path is already problems?

                  ^ marking page dirty too late in DIO path

Typo fix.
Minchan Kim Jan. 11, 2022, 9:59 p.m. UTC | #10
On Tue, Jan 11, 2022 at 12:21:59PM -0800, Minchan Kim wrote:
> On Tue, Jan 11, 2022 at 12:20:13PM -0800, Minchan Kim wrote:
> < snip >
> > > > slow path with __gup_longterm_unlocked and set_dirty_pages
> > > > for them).
> > > > 
> > > > This approach would solve other cases where map userspace
> > > > pages into kernel space and then write. Since the write
> > > > didn't go through with the process's page table, we will
> > > > lose the dirty bit in the page table of the process and
> > > > it turns out same problem. That's why I'd like to approach
> > > > this.
> > > > 
> > > > If it doesn't work, the other option to fix this specific
> > > > case is can't we make pages dirty in advance in DIO read-case?
> > > > 
> > > > When I look at DIO code, it's already doing in async case.
> > > > Could't we do the same thing for the other cases?
> > > > I guess the worst case we will see would be more page
> > > > writeback since the page becomes dirty unnecessary.
> > > 
> > > Marking pages dirty after pinning them is a pre-existing area of
> > > problems. See the long-running LWN articles about get_user_pages() [1].
> > 
> > Oh, Do you mean marking page dirty in DIO path is already problems?
> 
>                   ^ marking page dirty too late in DIO path
> 
> Typo fix.

I looked though the articles but couldn't find dots to connetct
issues with this MADV_FREE issue. However, man page shows a clue
why it's fine.

```
       O_DIRECT  I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private map‐
       ping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on  the  heap  and
       statically  allocated  buffers).  Any such I/Os, whether submitted via an asynchronous I/O interface or from another
       thread in the process, should be completed before fork(2) is called.  Failure to do so can result in data corruption
       and  undefined  behavior  in parent and child processes.

```

I think it would make the copy_present_pte's page_dup_rmap safe.
John Hubbard Jan. 11, 2022, 11:38 p.m. UTC | #11
On 1/11/22 13:59, Minchan Kim wrote:
...
>>>> Marking pages dirty after pinning them is a pre-existing area of
>>>> problems. See the long-running LWN articles about get_user_pages() [1].
>>>
>>> Oh, Do you mean marking page dirty in DIO path is already problems?
>>
>>                    ^ marking page dirty too late in DIO path
>>
>> Typo fix.
> 
> I looked though the articles but couldn't find dots to connetct
> issues with this MADV_FREE issue. However, man page shows a clue

The area covered in those articles is about the fact that file system
and block are not safely interacting with pinned memory. Even today.
So I'm trying to make sure you're aware of that before you go too far
in that direction.

> why it's fine.
> 
> ```
>         O_DIRECT  I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private map‐
>         ping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on  the  heap  and
>         statically  allocated  buffers).  Any such I/Os, whether submitted via an asynchronous I/O interface or from another
>         thread in the process, should be completed before fork(2) is called.  Failure to do so can result in data corruption
>         and  undefined  behavior  in parent and child processes.
> 
> ```
> 
> I think it would make the copy_present_pte's page_dup_rmap safe.

I'd have to see this in patch form, because I'm not quite able to visualize it yet.

thanks,
Minchan Kim Jan. 12, 2022, 12:01 a.m. UTC | #12
On Tue, Jan 11, 2022 at 03:38:24PM -0800, John Hubbard wrote:
> On 1/11/22 13:59, Minchan Kim wrote:
> ...
> > > > > Marking pages dirty after pinning them is a pre-existing area of
> > > > > problems. See the long-running LWN articles about get_user_pages() [1].
> > > > 
> > > > Oh, Do you mean marking page dirty in DIO path is already problems?
> > > 
> > >                    ^ marking page dirty too late in DIO path
> > > 
> > > Typo fix.
> > 
> > I looked though the articles but couldn't find dots to connetct
> > issues with this MADV_FREE issue. However, man page shows a clue
> 
> The area covered in those articles is about the fact that file system
> and block are not safely interacting with pinned memory. Even today.
> So I'm trying to make sure you're aware of that before you go too far
> in that direction.
> 
> > why it's fine.
> > 
> > ```
> >         O_DIRECT  I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private map‐
> >         ping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on  the  heap  and
> >         statically  allocated  buffers).  Any such I/Os, whether submitted via an asynchronous I/O interface or from another
> >         thread in the process, should be completed before fork(2) is called.  Failure to do so can result in data corruption
> >         and  undefined  behavior  in parent and child processes.
> > 
> > ```
> > 
> > I think it would make the copy_present_pte's page_dup_rmap safe.
> 
> I'd have to see this in patch form, because I'm not quite able to visualize it yet.

It would be great if you read though the original patch
description. Since v2 had a little change to consider
mutiple maps among parent and child, it would introduce a little
mistmatch with the description but it's still quite good to explain
current problem.

https://lore.kernel.org/all/20220105233440.63361-1-mfo@canonical.com/T/#u

Problem is MADV_FREEed anonymous memory is supposed to work based on
dirtiness came from the user process's page table bit or PageDirty.
Since VM can't see the dirty, it just discards the anonymous memory
instead of swappoing out. Thus, the dirtiness is key to work correctly.

However, DIO didn't make the page Dirty yet until IO is completed
and at the same time, the store operation didn't go though via
user process's page table regardless of DMA or other way.
It makes VM could decide just drop the page since it didn't see
any dirtiness from the page. So it turns out enduser would be
surprised because the read syscall with DIO was completed but
the data was zero rather than latest uptodate data.

To prevent the problem, the patch suggested to compare page_mapcount
with page_count since it expects any additional reference of the page
means someone is doing accessing the memory so in this case, not
discarding the page. However, Yu pointed out page_count and
page_mapcount could be reordered in copy_page_range, for example.
So I am looking for the solution(one would be adding memory barrier
right before page_dup_rmap but I'd like to avoid it if we have
other idea). And then man page says forking under going DIO would
be already prohibited so the concern raised would be void, IIUC.

Hope this helps your understanding.
Thanks!

work
Huang, Ying Jan. 12, 2022, 1:46 a.m. UTC | #13
Yu Zhao <yuzhao@google.com> writes:

> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 163ac4e6bcee..8671de473c25 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  
>>  			/* MADV_FREE page check */
>>  			if (!PageSwapBacked(page)) {
>> -				if (!PageDirty(page)) {
>> +				int ref_count = page_ref_count(page);
>> +				int map_count = page_mapcount(page);
>> +
>> +				/*
>> +				 * The only page refs must be from the isolation
>> +				 * (checked by the caller shrink_page_list() too)
>> +				 * and one or more rmap's (dropped by discard:).
>> +				 *
>> +				 * Check the reference count before dirty flag
>> +				 * with memory barrier; see __remove_mapping().
>> +				 */
>> +				smp_rmb();
>> +				if ((ref_count - 1 == map_count) &&
>> +				    !PageDirty(page)) {
>>  					/* Invalidate as we cleared the pte */
>>  					mmu_notifier_invalidate_range(mm,
>>  						address, address + PAGE_SIZE);
>
> Out of curiosity, how does it work with COW in terms of reordering?
> Specifically, it seems to me get_page() and page_dup_rmap() in
> copy_present_pte() can happen in any order, and if page_dup_rmap()
> is seen first, and direct io is holding a refcnt, this check can still
> pass?

I think that you are correct.

After more thoughts, it appears very tricky to compare page count and
map count.  Even if we have added smp_rmb() between page_ref_count() and
page_mapcount(), an interrupt may happen between them.  During the
interrupt, the page count and map count may be changed, for example,
unmapped, or do_swap_page().

Best Regards,
Huang, Ying
Minchan Kim Jan. 12, 2022, 5:33 p.m. UTC | #14
On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 163ac4e6bcee..8671de473c25 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>  
> >>  			/* MADV_FREE page check */
> >>  			if (!PageSwapBacked(page)) {
> >> -				if (!PageDirty(page)) {
> >> +				int ref_count = page_ref_count(page);
> >> +				int map_count = page_mapcount(page);
> >> +
> >> +				/*
> >> +				 * The only page refs must be from the isolation
> >> +				 * (checked by the caller shrink_page_list() too)
> >> +				 * and one or more rmap's (dropped by discard:).
> >> +				 *
> >> +				 * Check the reference count before dirty flag
> >> +				 * with memory barrier; see __remove_mapping().
> >> +				 */
> >> +				smp_rmb();
> >> +				if ((ref_count - 1 == map_count) &&
> >> +				    !PageDirty(page)) {
> >>  					/* Invalidate as we cleared the pte */
> >>  					mmu_notifier_invalidate_range(mm,
> >>  						address, address + PAGE_SIZE);
> >
> > Out of curiosity, how does it work with COW in terms of reordering?
> > Specifically, it seems to me get_page() and page_dup_rmap() in
> > copy_present_pte() can happen in any order, and if page_dup_rmap()
> > is seen first, and direct io is holding a refcnt, this check can still
> > pass?
> 
> I think that you are correct.
> 
> After more thoughts, it appears very tricky to compare page count and
> map count.  Even if we have added smp_rmb() between page_ref_count() and
> page_mapcount(), an interrupt may happen between them.  During the
> interrupt, the page count and map count may be changed, for example,
> unmapped, or do_swap_page().

Yeah, it happens but what specific problem are you concerning from the
count change under race? The fork case Yu pointed out was already known
for breaking DIO so user should take care not to fork under DIO(Please
look at O_DIRECT section in man 2 open). If you could give a specific
example, it would be great to think over the issue.

I agree it's little tricky but it seems to be way other place has used
for a long time(Please look at write_protect_page in ksm.c).
So, here what we missing is tlb flush before the checking.

Something like this.

diff --git a/mm/rmap.c b/mm/rmap.c
index b0fd9dc19eba..b4ad9faa17b2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

                        /* MADV_FREE page check */
                        if (!PageSwapBacked(page)) {
-                               int refcount = page_ref_count(page);
-
-                               /*
-                                * The only page refs must be from the isolation
-                                * (checked by the caller shrink_page_list() too)
-                                * and the (single) rmap (dropped by discard:).
-                                *
-                                * Check the reference count before dirty flag
-                                * with memory barrier; see __remove_mapping().
-                                */
-                               smp_rmb();
-                               if (refcount == 2 && !PageDirty(page)) {
+                               if (!PageDirty(page) &&
+                                       page_mapcount(page) + 1 == page_count(page)) {
                                        /* Invalidate as we cleared the pte */
                                        mmu_notifier_invalidate_range(mm,
                                                address, address + PAGE_SIZE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f3162a5724de..6454ff5c576f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
                        enum ttu_flags flags = TTU_BATCH_FLUSH;
                        bool was_swapbacked = PageSwapBacked(page);

+                       if (!was_swapbacked && PageAnon(page))
+                               flags &= ~TTU_BATCH_FLUSH;
+
                        if (unlikely(PageTransHuge(page)))
                                flags |= TTU_SPLIT_HUGE_PMD;
Mauricio Faria de Oliveira Jan. 12, 2022, 9:53 p.m. UTC | #15
Hi Minchan Kim,

Thanks for handling the hard questions! :)

On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> > Yu Zhao <yuzhao@google.com> writes:
> >
> > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > >> index 163ac4e6bcee..8671de473c25 100644
> > >> --- a/mm/rmap.c
> > >> +++ b/mm/rmap.c
> > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >>
> > >>                    /* MADV_FREE page check */
> > >>                    if (!PageSwapBacked(page)) {
> > >> -                          if (!PageDirty(page)) {
> > >> +                          int ref_count = page_ref_count(page);
> > >> +                          int map_count = page_mapcount(page);
> > >> +
> > >> +                          /*
> > >> +                           * The only page refs must be from the isolation
> > >> +                           * (checked by the caller shrink_page_list() too)
> > >> +                           * and one or more rmap's (dropped by discard:).
> > >> +                           *
> > >> +                           * Check the reference count before dirty flag
> > >> +                           * with memory barrier; see __remove_mapping().
> > >> +                           */
> > >> +                          smp_rmb();
> > >> +                          if ((ref_count - 1 == map_count) &&
> > >> +                              !PageDirty(page)) {
> > >>                                    /* Invalidate as we cleared the pte */
> > >>                                    mmu_notifier_invalidate_range(mm,
> > >>                                            address, address + PAGE_SIZE);
> > >
> > > Out of curiosity, how does it work with COW in terms of reordering?
> > > Specifically, it seems to me get_page() and page_dup_rmap() in
> > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> > > is seen first, and direct io is holding a refcnt, this check can still
> > > pass?
> >
> > I think that you are correct.
> >
> > After more thoughts, it appears very tricky to compare page count and
> > map count.  Even if we have added smp_rmb() between page_ref_count() and
> > page_mapcount(), an interrupt may happen between them.  During the
> > interrupt, the page count and map count may be changed, for example,
> > unmapped, or do_swap_page().
>
> Yeah, it happens but what specific problem are you concerning from the
> count change under race? The fork case Yu pointed out was already known
> for breaking DIO so user should take care not to fork under DIO(Please
> look at O_DIRECT section in man 2 open). If you could give a specific
> example, it would be great to think over the issue.
>
> I agree it's little tricky but it seems to be way other place has used
> for a long time(Please look at write_protect_page in ksm.c).

Ah, that's great to see it's being used elsewhere, for DIO particularly!

> So, here what we missing is tlb flush before the checking.

That shouldn't be required for this particular issue/case, IIUIC.
One of the things we checked early on was disabling deferred TLB flush
(similarly to what you've done), and it didn't help with the issue; also, the
issue happens on uniprocessor mode too (thus no remote CPU involved.)


>
> Something like this.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b0fd9dc19eba..b4ad9faa17b2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>
>                         /* MADV_FREE page check */
>                         if (!PageSwapBacked(page)) {
> -                               int refcount = page_ref_count(page);
> -
> -                               /*
> -                                * The only page refs must be from the isolation
> -                                * (checked by the caller shrink_page_list() too)
> -                                * and the (single) rmap (dropped by discard:).
> -                                *
> -                                * Check the reference count before dirty flag
> -                                * with memory barrier; see __remove_mapping().
> -                                */
> -                               smp_rmb();
> -                               if (refcount == 2 && !PageDirty(page)) {
> +                               if (!PageDirty(page) &&
> +                                       page_mapcount(page) + 1 == page_count(page)) {

In the interest of avoiding a different race/bug, it seemed worth following the
suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
after the page's reference count, with a memory barrier in between.

I'm not familiar with the details of the original issue behind that code change,
but it seemed to be possible here too, particularly as writes from user-space
can happen asynchronously / after try_to_unmap_one() checked PTE clean
and didn't set PageDirty, and if the page's PTE is present, there's no fault?

Thanks again,
Mauricio

>                                         /* Invalidate as we cleared the pte */
>                                         mmu_notifier_invalidate_range(mm,
>                                                 address, address + PAGE_SIZE);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f3162a5724de..6454ff5c576f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                         enum ttu_flags flags = TTU_BATCH_FLUSH;
>                         bool was_swapbacked = PageSwapBacked(page);
>
> +                       if (!was_swapbacked && PageAnon(page))
> +                               flags &= ~TTU_BATCH_FLUSH;
> +
>                         if (unlikely(PageTransHuge(page)))
>                                 flags |= TTU_SPLIT_HUGE_PMD;
>
>
>
>
>
>
Minchan Kim Jan. 12, 2022, 10:37 p.m. UTC | #16
On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
> Hi Minchan Kim,
> 
> Thanks for handling the hard questions! :)
> 
> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> > > Yu Zhao <yuzhao@google.com> writes:
> > >
> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > > >> index 163ac4e6bcee..8671de473c25 100644
> > > >> --- a/mm/rmap.c
> > > >> +++ b/mm/rmap.c
> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > >>
> > > >>                    /* MADV_FREE page check */
> > > >>                    if (!PageSwapBacked(page)) {
> > > >> -                          if (!PageDirty(page)) {
> > > >> +                          int ref_count = page_ref_count(page);
> > > >> +                          int map_count = page_mapcount(page);
> > > >> +
> > > >> +                          /*
> > > >> +                           * The only page refs must be from the isolation
> > > >> +                           * (checked by the caller shrink_page_list() too)
> > > >> +                           * and one or more rmap's (dropped by discard:).
> > > >> +                           *
> > > >> +                           * Check the reference count before dirty flag
> > > >> +                           * with memory barrier; see __remove_mapping().
> > > >> +                           */
> > > >> +                          smp_rmb();
> > > >> +                          if ((ref_count - 1 == map_count) &&
> > > >> +                              !PageDirty(page)) {
> > > >>                                    /* Invalidate as we cleared the pte */
> > > >>                                    mmu_notifier_invalidate_range(mm,
> > > >>                                            address, address + PAGE_SIZE);
> > > >
> > > > Out of curiosity, how does it work with COW in terms of reordering?
> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> > > > is seen first, and direct io is holding a refcnt, this check can still
> > > > pass?
> > >
> > > I think that you are correct.
> > >
> > > After more thoughts, it appears very tricky to compare page count and
> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> > > page_mapcount(), an interrupt may happen between them.  During the
> > > interrupt, the page count and map count may be changed, for example,
> > > unmapped, or do_swap_page().
> >
> > Yeah, it happens but what specific problem are you concerning from the
> > count change under race? The fork case Yu pointed out was already known
> > for breaking DIO so user should take care not to fork under DIO(Please
> > look at O_DIRECT section in man 2 open). If you could give a specific
> > example, it would be great to think over the issue.
> >
> > I agree it's little tricky but it seems to be way other place has used
> > for a long time(Please look at write_protect_page in ksm.c).
> 
> Ah, that's great to see it's being used elsewhere, for DIO particularly!
> 
> > So, here what we missing is tlb flush before the checking.
> 
> That shouldn't be required for this particular issue/case, IIUIC.
> One of the things we checked early on was disabling deferred TLB flush
> (similarly to what you've done), and it didn't help with the issue; also, the
> issue happens on uniprocessor mode too (thus no remote CPU involved.)

I guess you didn't try it with page_mapcount + 1 == page_count at tha
time?  Anyway, I agree we don't need TLB flush here like KSM.
I think the reason KSM is doing TLB flush before the check it to
make sure trap trigger on the write from userprocess in other core.
However, this MADV_FREE case, HW already gaurantees the trap.
Please see below.

> 
> 
> >
> > Something like this.
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index b0fd9dc19eba..b4ad9faa17b2 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >
> >                         /* MADV_FREE page check */
> >                         if (!PageSwapBacked(page)) {
> > -                               int refcount = page_ref_count(page);
> > -
> > -                               /*
> > -                                * The only page refs must be from the isolation
> > -                                * (checked by the caller shrink_page_list() too)
> > -                                * and the (single) rmap (dropped by discard:).
> > -                                *
> > -                                * Check the reference count before dirty flag
> > -                                * with memory barrier; see __remove_mapping().
> > -                                */
> > -                               smp_rmb();
> > -                               if (refcount == 2 && !PageDirty(page)) {
> > +                               if (!PageDirty(page) &&
> > +                                       page_mapcount(page) + 1 == page_count(page)) {
> 
> In the interest of avoiding a different race/bug, it seemed worth following the
> suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
> after the page's reference count, with a memory barrier in between.

True so it means your patch as-is is good for me.

> 
> I'm not familiar with the details of the original issue behind that code change,
> but it seemed to be possible here too, particularly as writes from user-space
> can happen asynchronously / after try_to_unmap_one() checked PTE clean
> and didn't set PageDirty, and if the page's PTE is present, there's no fault?

Yeah, it was discussed. 

For clean pte, CPU has to fetch and update the actual pte entry, not TLB
so trap triggers for MADV_FREE page.

https://lkml.org/lkml/2015/4/15/565
https://lkml.org/lkml/2015/4/16/136
Huang, Ying Jan. 13, 2022, 5:47 a.m. UTC | #17
Minchan Kim <minchan@kernel.org> writes:

> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>> Yu Zhao <yuzhao@google.com> writes:
>> 
>> > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>> >> diff --git a/mm/rmap.c b/mm/rmap.c
>> >> index 163ac4e6bcee..8671de473c25 100644
>> >> --- a/mm/rmap.c
>> >> +++ b/mm/rmap.c
>> >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >>  
>> >>  			/* MADV_FREE page check */
>> >>  			if (!PageSwapBacked(page)) {
>> >> -				if (!PageDirty(page)) {
>> >> +				int ref_count = page_ref_count(page);
>> >> +				int map_count = page_mapcount(page);
>> >> +
>> >> +				/*
>> >> +				 * The only page refs must be from the isolation
>> >> +				 * (checked by the caller shrink_page_list() too)
>> >> +				 * and one or more rmap's (dropped by discard:).
>> >> +				 *
>> >> +				 * Check the reference count before dirty flag
>> >> +				 * with memory barrier; see __remove_mapping().
>> >> +				 */
>> >> +				smp_rmb();
>> >> +				if ((ref_count - 1 == map_count) &&
>> >> +				    !PageDirty(page)) {
>> >>  					/* Invalidate as we cleared the pte */
>> >>  					mmu_notifier_invalidate_range(mm,
>> >>  						address, address + PAGE_SIZE);
>> >
>> > Out of curiosity, how does it work with COW in terms of reordering?
>> > Specifically, it seems to me get_page() and page_dup_rmap() in
>> > copy_present_pte() can happen in any order, and if page_dup_rmap()
>> > is seen first, and direct io is holding a refcnt, this check can still
>> > pass?
>> 
>> I think that you are correct.
>> 
>> After more thoughts, it appears very tricky to compare page count and
>> map count.  Even if we have added smp_rmb() between page_ref_count() and
>> page_mapcount(), an interrupt may happen between them.  During the
>> interrupt, the page count and map count may be changed, for example,
>> unmapped, or do_swap_page().
>
> Yeah, it happens but what specific problem are you concerning from the
> count change under race? The fork case Yu pointed out was already known
> for breaking DIO so user should take care not to fork under DIO(Please
> look at O_DIRECT section in man 2 open). If you could give a specific
> example, it would be great to think over the issue.

Whether is the following race possible?

CPU0/Process A                  CPU1/Process B
--------------                  --------------
try_to_unmap_one
  page_mapcount()
                                zap_pte_range()
                                  page_remove_rmap()
                                    atomic_add_negative(-1, &page->_mapcount)
                                  tlb_flush_mmu()
                                    ...
                                      put_page_testzero()
  page_count()

Previously I thought that there's similar race in do_swap_page().  But
after more thoughts, I found that the page is locked in do_swap_page().
So do_swap_page() is safe.  Per my understanding, except during fork()
as Yu pointed out, the anonymous page must be locked before increasing
its mapcount.

So, if the above race is possible, we need to guarantee to read
page_count() before page_mapcount().  That is, something as follows,

        count = page_count();
        smp_rmb();
        mapcount = page_mapcount();
        if (!PageDirty(page) && mapcount + 1 == count) {
                ...
        }

Best Regards,
Huang, Ying

> I agree it's little tricky but it seems to be way other place has used
> for a long time(Please look at write_protect_page in ksm.c).
> So, here what we missing is tlb flush before the checking.
>
> Something like this.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b0fd9dc19eba..b4ad9faa17b2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>
>                         /* MADV_FREE page check */
>                         if (!PageSwapBacked(page)) {
> -                               int refcount = page_ref_count(page);
> -
> -                               /*
> -                                * The only page refs must be from the isolation
> -                                * (checked by the caller shrink_page_list() too)
> -                                * and the (single) rmap (dropped by discard:).
> -                                *
> -                                * Check the reference count before dirty flag
> -                                * with memory barrier; see __remove_mapping().
> -                                */
> -                               smp_rmb();
> -                               if (refcount == 2 && !PageDirty(page)) {
> +                               if (!PageDirty(page) &&
> +                                       page_mapcount(page) + 1 == page_count(page)) {
>                                         /* Invalidate as we cleared the pte */
>                                         mmu_notifier_invalidate_range(mm,
>                                                 address, address + PAGE_SIZE);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f3162a5724de..6454ff5c576f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                         enum ttu_flags flags = TTU_BATCH_FLUSH;
>                         bool was_swapbacked = PageSwapBacked(page);
>
> +                       if (!was_swapbacked && PageAnon(page))
> +                               flags &= ~TTU_BATCH_FLUSH;
> +
>                         if (unlikely(PageTransHuge(page)))
>                                 flags |= TTU_SPLIT_HUGE_PMD;
Miaohe Lin Jan. 13, 2022, 6:37 a.m. UTC | #18
On 2022/1/13 13:47, Huang, Ying wrote:
> Minchan Kim <minchan@kernel.org> writes:
> 
>> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>>> Yu Zhao <yuzhao@google.com> writes:
>>>
>>>> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 163ac4e6bcee..8671de473c25 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>>  
>>>>>  			/* MADV_FREE page check */
>>>>>  			if (!PageSwapBacked(page)) {
>>>>> -				if (!PageDirty(page)) {
>>>>> +				int ref_count = page_ref_count(page);
>>>>> +				int map_count = page_mapcount(page);
>>>>> +
>>>>> +				/*
>>>>> +				 * The only page refs must be from the isolation
>>>>> +				 * (checked by the caller shrink_page_list() too)
>>>>> +				 * and one or more rmap's (dropped by discard:).
>>>>> +				 *
>>>>> +				 * Check the reference count before dirty flag
>>>>> +				 * with memory barrier; see __remove_mapping().
>>>>> +				 */
>>>>> +				smp_rmb();
>>>>> +				if ((ref_count - 1 == map_count) &&
>>>>> +				    !PageDirty(page)) {
>>>>>  					/* Invalidate as we cleared the pte */
>>>>>  					mmu_notifier_invalidate_range(mm,
>>>>>  						address, address + PAGE_SIZE);
>>>>
>>>> Out of curiosity, how does it work with COW in terms of reordering?
>>>> Specifically, it seems to me get_page() and page_dup_rmap() in
>>>> copy_present_pte() can happen in any order, and if page_dup_rmap()
>>>> is seen first, and direct io is holding a refcnt, this check can still
>>>> pass?
>>>
>>> I think that you are correct.
>>>
>>> After more thoughts, it appears very tricky to compare page count and
>>> map count.  Even if we have added smp_rmb() between page_ref_count() and
>>> page_mapcount(), an interrupt may happen between them.  During the
>>> interrupt, the page count and map count may be changed, for example,
>>> unmapped, or do_swap_page().
>>
>> Yeah, it happens but what specific problem are you concerning from the
>> count change under race? The fork case Yu pointed out was already known
>> for breaking DIO so user should take care not to fork under DIO(Please
>> look at O_DIRECT section in man 2 open). If you could give a specific
>> example, it would be great to think over the issue.
> 
> Whether is the following race possible?
> 
> CPU0/Process A                  CPU1/Process B
> --------------                  --------------
> try_to_unmap_one
>   page_mapcount()
>                                 zap_pte_range()
>                                   page_remove_rmap()
>                                     atomic_add_negative(-1, &page->_mapcount)
>                                   tlb_flush_mmu()
>                                     ...
>                                       put_page_testzero()
>   page_count()
> 

It seems they're under the same page table Lock.

Thanks.

> Previously I thought that there's similar race in do_swap_page().  But
> after more thoughts, I found that the page is locked in do_swap_page().
> So do_swap_page() is safe.  Per my understanding, except during fork()
> as Yu pointed out, the anonymous page must be locked before increasing
> its mapcount.
> 
> So, if the above race is possible, we need to guarantee to read
> page_count() before page_mapcount().  That is, something as follows,
> 
>         count = page_count();
>         smp_rmb();
>         mapcount = page_mapcount();
>         if (!PageDirty(page) && mapcount + 1 == count) {
>                 ...
>         }
> 
> Best Regards,
> Huang, Ying
> 
>> I agree it's little tricky but it seems to be way other place has used
>> for a long time(Please look at write_protect_page in ksm.c).
>> So, here what we missing is tlb flush before the checking.
>>
>> Something like this.
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index b0fd9dc19eba..b4ad9faa17b2 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>
>>                         /* MADV_FREE page check */
>>                         if (!PageSwapBacked(page)) {
>> -                               int refcount = page_ref_count(page);
>> -
>> -                               /*
>> -                                * The only page refs must be from the isolation
>> -                                * (checked by the caller shrink_page_list() too)
>> -                                * and the (single) rmap (dropped by discard:).
>> -                                *
>> -                                * Check the reference count before dirty flag
>> -                                * with memory barrier; see __remove_mapping().
>> -                                */
>> -                               smp_rmb();
>> -                               if (refcount == 2 && !PageDirty(page)) {
>> +                               if (!PageDirty(page) &&
>> +                                       page_mapcount(page) + 1 == page_count(page)) {
>>                                         /* Invalidate as we cleared the pte */
>>                                         mmu_notifier_invalidate_range(mm,
>>                                                 address, address + PAGE_SIZE);
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f3162a5724de..6454ff5c576f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>>                         enum ttu_flags flags = TTU_BATCH_FLUSH;
>>                         bool was_swapbacked = PageSwapBacked(page);
>>
>> +                       if (!was_swapbacked && PageAnon(page))
>> +                               flags &= ~TTU_BATCH_FLUSH;
>> +
>>                         if (unlikely(PageTransHuge(page)))
>>                                 flags |= TTU_SPLIT_HUGE_PMD;
> .
>
Yu Zhao Jan. 13, 2022, 7:29 a.m. UTC | #19
On Wed, Jan 12, 2022 at 2:53 PM Mauricio Faria de Oliveira
<mfo@canonical.com> wrote:
>
> Hi Minchan Kim,
>
> Thanks for handling the hard questions! :)
>
> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> > > Yu Zhao <yuzhao@google.com> writes:
> > >
> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > > >> index 163ac4e6bcee..8671de473c25 100644
> > > >> --- a/mm/rmap.c
> > > >> +++ b/mm/rmap.c
> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > >>
> > > >>                    /* MADV_FREE page check */
> > > >>                    if (!PageSwapBacked(page)) {
> > > >> -                          if (!PageDirty(page)) {
> > > >> +                          int ref_count = page_ref_count(page);
> > > >> +                          int map_count = page_mapcount(page);
> > > >> +
> > > >> +                          /*
> > > >> +                           * The only page refs must be from the isolation
> > > >> +                           * (checked by the caller shrink_page_list() too)
> > > >> +                           * and one or more rmap's (dropped by discard:).
> > > >> +                           *
> > > >> +                           * Check the reference count before dirty flag
> > > >> +                           * with memory barrier; see __remove_mapping().
> > > >> +                           */
> > > >> +                          smp_rmb();
> > > >> +                          if ((ref_count - 1 == map_count) &&
> > > >> +                              !PageDirty(page)) {
> > > >>                                    /* Invalidate as we cleared the pte */
> > > >>                                    mmu_notifier_invalidate_range(mm,
> > > >>                                            address, address + PAGE_SIZE);
> > > >
> > > > Out of curiosity, how does it work with COW in terms of reordering?
> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> > > > is seen first, and direct io is holding a refcnt, this check can still
> > > > pass?
> > >
> > > I think that you are correct.
> > >
> > > After more thoughts, it appears very tricky to compare page count and
> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> > > page_mapcount(), an interrupt may happen between them.  During the
> > > interrupt, the page count and map count may be changed, for example,
> > > unmapped, or do_swap_page().
> >
> > Yeah, it happens but what specific problem are you concerning from the
> > count change under race? The fork case Yu pointed out was already known
> > for breaking DIO so user should take care not to fork under DIO(Please
> > look at O_DIRECT section in man 2 open). If you could give a specific
> > example, it would be great to think over the issue.
> >
> > I agree it's little tricky but it seems to be way other place has used
> > for a long time(Please look at write_protect_page in ksm.c).
>
> Ah, that's great to see it's being used elsewhere, for DIO particularly!
>
> > So, here what we missing is tlb flush before the checking.
>
> That shouldn't be required for this particular issue/case, IIUIC.
> One of the things we checked early on was disabling deferred TLB flush
> (similarly to what you've done), and it didn't help with the issue; also, the
> issue happens on uniprocessor mode too (thus no remote CPU involved.)

Fast gup doesn't block tlb flush; it only blocks IPI used when freeing
page tables. So it's expected that forcing a tlb flush doesn't fix the
problem.

But it still seems to me the fix is missing smp_mb(). IIUC, a proper
fix would require, on the dio side
inc page refcnt
smp_mb()
read pte

and on the rmap side
clear pte
smp_mb()
read page refcnt

try_grab_compound_head() implies smp_mb, but i don't think
ptep_get_and_clear() does.

mapcount, as Minchan said, probably is irrelevant given dio is already
known to be broken with fork.

I glanced at the thread and thought it might be worth menthing.
Huang, Ying Jan. 13, 2022, 8:04 a.m. UTC | #20
Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2022/1/13 13:47, Huang, Ying wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> 
>>> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>>>> Yu Zhao <yuzhao@google.com> writes:
>>>>
>>>>> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 163ac4e6bcee..8671de473c25 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>>>  
>>>>>>  			/* MADV_FREE page check */
>>>>>>  			if (!PageSwapBacked(page)) {
>>>>>> -				if (!PageDirty(page)) {
>>>>>> +				int ref_count = page_ref_count(page);
>>>>>> +				int map_count = page_mapcount(page);
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * The only page refs must be from the isolation
>>>>>> +				 * (checked by the caller shrink_page_list() too)
>>>>>> +				 * and one or more rmap's (dropped by discard:).
>>>>>> +				 *
>>>>>> +				 * Check the reference count before dirty flag
>>>>>> +				 * with memory barrier; see __remove_mapping().
>>>>>> +				 */
>>>>>> +				smp_rmb();
>>>>>> +				if ((ref_count - 1 == map_count) &&
>>>>>> +				    !PageDirty(page)) {
>>>>>>  					/* Invalidate as we cleared the pte */
>>>>>>  					mmu_notifier_invalidate_range(mm,
>>>>>>  						address, address + PAGE_SIZE);
>>>>>
>>>>> Out of curiosity, how does it work with COW in terms of reordering?
>>>>> Specifically, it seems to me get_page() and page_dup_rmap() in
>>>>> copy_present_pte() can happen in any order, and if page_dup_rmap()
>>>>> is seen first, and direct io is holding a refcnt, this check can still
>>>>> pass?
>>>>
>>>> I think that you are correct.
>>>>
>>>> After more thoughts, it appears very tricky to compare page count and
>>>> map count.  Even if we have added smp_rmb() between page_ref_count() and
>>>> page_mapcount(), an interrupt may happen between them.  During the
>>>> interrupt, the page count and map count may be changed, for example,
>>>> unmapped, or do_swap_page().
>>>
>>> Yeah, it happens but what specific problem are you concerning from the
>>> count change under race? The fork case Yu pointed out was already known
>>> for breaking DIO so user should take care not to fork under DIO(Please
>>> look at O_DIRECT section in man 2 open). If you could give a specific
>>> example, it would be great to think over the issue.
>> 
>> Whether is the following race possible?
>> 
>> CPU0/Process A                  CPU1/Process B
>> --------------                  --------------
>> try_to_unmap_one
>>   page_mapcount()
>>                                 zap_pte_range()
>>                                   page_remove_rmap()
>>                                     atomic_add_negative(-1, &page->_mapcount)
>>                                   tlb_flush_mmu()
>>                                     ...
>>                                       put_page_testzero()
>>   page_count()
>> 
>
> It seems they're under the same page table Lock.

This is for a page shared by 2 processes (Process A/B above).  But you
reminded me that an anonymous page cannot be shared between multiple
processes after direct IO.  Because direct IO read un-shares the page,
and fork() isn't allowed, I guess ksm isn't allowed too.

So the above race isn't possible.  Sorry for confusing.

Best Regards,
Huang, Ying

> Thanks.
>
>> Previously I thought that there's similar race in do_swap_page().  But
>> after more thoughts, I found that the page is locked in do_swap_page().
>> So do_swap_page() is safe.  Per my understanding, except during fork()
>> as Yu pointed out, the anonymous page must be locked before increasing
>> its mapcount.
>> 
>> So, if the above race is possible, we need to guarantee to read
>> page_count() before page_mapcount().  That is, something as follows,
>> 
>>         count = page_count();
>>         smp_rmb();
>>         mapcount = page_mapcount();
>>         if (!PageDirty(page) && mapcount + 1 == count) {
>>                 ...
>>         }
>> 
>> Best Regards,
>> Huang, Ying
>> 
>>> I agree it's little tricky but it seems to be way other place has used
>>> for a long time(Please look at write_protect_page in ksm.c).
>>> So, here what we missing is tlb flush before the checking.
>>>
>>> Something like this.
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index b0fd9dc19eba..b4ad9faa17b2 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>
>>>                         /* MADV_FREE page check */
>>>                         if (!PageSwapBacked(page)) {
>>> -                               int refcount = page_ref_count(page);
>>> -
>>> -                               /*
>>> -                                * The only page refs must be from the isolation
>>> -                                * (checked by the caller shrink_page_list() too)
>>> -                                * and the (single) rmap (dropped by discard:).
>>> -                                *
>>> -                                * Check the reference count before dirty flag
>>> -                                * with memory barrier; see __remove_mapping().
>>> -                                */
>>> -                               smp_rmb();
>>> -                               if (refcount == 2 && !PageDirty(page)) {
>>> +                               if (!PageDirty(page) &&
>>> +                                       page_mapcount(page) + 1 == page_count(page)) {
>>>                                         /* Invalidate as we cleared the pte */
>>>                                         mmu_notifier_invalidate_range(mm,
>>>                                                 address, address + PAGE_SIZE);
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index f3162a5724de..6454ff5c576f 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>>>                         enum ttu_flags flags = TTU_BATCH_FLUSH;
>>>                         bool was_swapbacked = PageSwapBacked(page);
>>>
>>> +                       if (!was_swapbacked && PageAnon(page))
>>> +                               flags &= ~TTU_BATCH_FLUSH;
>>> +
>>>                         if (unlikely(PageTransHuge(page)))
>>>                                 flags |= TTU_SPLIT_HUGE_PMD;
>> .
>>
Huang, Ying Jan. 13, 2022, 8:54 a.m. UTC | #21
Minchan Kim <minchan@kernel.org> writes:

> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
>> Hi Minchan Kim,
>> 
>> Thanks for handling the hard questions! :)
>> 
>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
>> >
>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>> > > Yu Zhao <yuzhao@google.com> writes:
>> > >
>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
>> > > >> index 163ac4e6bcee..8671de473c25 100644
>> > > >> --- a/mm/rmap.c
>> > > >> +++ b/mm/rmap.c
>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> > > >>
>> > > >>                    /* MADV_FREE page check */
>> > > >>                    if (!PageSwapBacked(page)) {
>> > > >> -                          if (!PageDirty(page)) {
>> > > >> +                          int ref_count = page_ref_count(page);
>> > > >> +                          int map_count = page_mapcount(page);
>> > > >> +
>> > > >> +                          /*
>> > > >> +                           * The only page refs must be from the isolation
>> > > >> +                           * (checked by the caller shrink_page_list() too)
>> > > >> +                           * and one or more rmap's (dropped by discard:).
>> > > >> +                           *
>> > > >> +                           * Check the reference count before dirty flag
>> > > >> +                           * with memory barrier; see __remove_mapping().
>> > > >> +                           */
>> > > >> +                          smp_rmb();
>> > > >> +                          if ((ref_count - 1 == map_count) &&
>> > > >> +                              !PageDirty(page)) {
>> > > >>                                    /* Invalidate as we cleared the pte */
>> > > >>                                    mmu_notifier_invalidate_range(mm,
>> > > >>                                            address, address + PAGE_SIZE);
>> > > >
>> > > > Out of curiosity, how does it work with COW in terms of reordering?
>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
>> > > > is seen first, and direct io is holding a refcnt, this check can still
>> > > > pass?
>> > >
>> > > I think that you are correct.
>> > >
>> > > After more thoughts, it appears very tricky to compare page count and
>> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
>> > > page_mapcount(), an interrupt may happen between them.  During the
>> > > interrupt, the page count and map count may be changed, for example,
>> > > unmapped, or do_swap_page().
>> >
>> > Yeah, it happens but what specific problem are you concerning from the
>> > count change under race? The fork case Yu pointed out was already known
>> > for breaking DIO so user should take care not to fork under DIO(Please
>> > look at O_DIRECT section in man 2 open). If you could give a specific
>> > example, it would be great to think over the issue.
>> >
>> > I agree it's little tricky but it seems to be way other place has used
>> > for a long time(Please look at write_protect_page in ksm.c).
>> 
>> Ah, that's great to see it's being used elsewhere, for DIO particularly!
>> 
>> > So, here what we missing is tlb flush before the checking.
>> 
>> That shouldn't be required for this particular issue/case, IIUIC.
>> One of the things we checked early on was disabling deferred TLB flush
>> (similarly to what you've done), and it didn't help with the issue; also, the
>> issue happens on uniprocessor mode too (thus no remote CPU involved.)
>
> I guess you didn't try it with page_mapcount + 1 == page_count at tha
> time?  Anyway, I agree we don't need TLB flush here like KSM.
> I think the reason KSM is doing TLB flush before the check it to
> make sure trap trigger on the write from userprocess in other core.
> However, this MADV_FREE case, HW already gaurantees the trap.
> Please see below.
>
>> 
>> 
>> >
>> > Something like this.
>> >
>> > diff --git a/mm/rmap.c b/mm/rmap.c
>> > index b0fd9dc19eba..b4ad9faa17b2 100644
>> > --- a/mm/rmap.c
>> > +++ b/mm/rmap.c
>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> >
>> >                         /* MADV_FREE page check */
>> >                         if (!PageSwapBacked(page)) {
>> > -                               int refcount = page_ref_count(page);
>> > -
>> > -                               /*
>> > -                                * The only page refs must be from the isolation
>> > -                                * (checked by the caller shrink_page_list() too)
>> > -                                * and the (single) rmap (dropped by discard:).
>> > -                                *
>> > -                                * Check the reference count before dirty flag
>> > -                                * with memory barrier; see __remove_mapping().
>> > -                                */
>> > -                               smp_rmb();
>> > -                               if (refcount == 2 && !PageDirty(page)) {
>> > +                               if (!PageDirty(page) &&
>> > +                                       page_mapcount(page) + 1 == page_count(page)) {
>> 
>> In the interest of avoiding a different race/bug, it seemed worth following the
>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
>> after the page's reference count, with a memory barrier in between.
>
> True so it means your patch as-is is good for me.

If my understanding were correct, a shared anonymous page will be mapped
read-only.  If so, will a private anonymous page be called
SetPageDirty() concurrently after direct IO case has been dealt with
via comparing page_count()/page_mapcount()?

Best Regards,
Huang, Ying
Huang, Ying Jan. 13, 2022, 12:30 p.m. UTC | #22
"Huang, Ying" <ying.huang@intel.com> writes:

> Minchan Kim <minchan@kernel.org> writes:
>
>> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
>>> Hi Minchan Kim,
>>> 
>>> Thanks for handling the hard questions! :)
>>> 
>>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
>>> >
>>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
>>> > > Yu Zhao <yuzhao@google.com> writes:
>>> > >
>>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
>>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
>>> > > >> index 163ac4e6bcee..8671de473c25 100644
>>> > > >> --- a/mm/rmap.c
>>> > > >> +++ b/mm/rmap.c
>>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> > > >>
>>> > > >>                    /* MADV_FREE page check */
>>> > > >>                    if (!PageSwapBacked(page)) {
>>> > > >> -                          if (!PageDirty(page)) {
>>> > > >> +                          int ref_count = page_ref_count(page);
>>> > > >> +                          int map_count = page_mapcount(page);
>>> > > >> +
>>> > > >> +                          /*
>>> > > >> +                           * The only page refs must be from the isolation
>>> > > >> +                           * (checked by the caller shrink_page_list() too)
>>> > > >> +                           * and one or more rmap's (dropped by discard:).
>>> > > >> +                           *
>>> > > >> +                           * Check the reference count before dirty flag
>>> > > >> +                           * with memory barrier; see __remove_mapping().
>>> > > >> +                           */
>>> > > >> +                          smp_rmb();
>>> > > >> +                          if ((ref_count - 1 == map_count) &&
>>> > > >> +                              !PageDirty(page)) {
>>> > > >>                                    /* Invalidate as we cleared the pte */
>>> > > >>                                    mmu_notifier_invalidate_range(mm,
>>> > > >>                                            address, address + PAGE_SIZE);
>>> > > >
>>> > > > Out of curiosity, how does it work with COW in terms of reordering?
>>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
>>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
>>> > > > is seen first, and direct io is holding a refcnt, this check can still
>>> > > > pass?
>>> > >
>>> > > I think that you are correct.
>>> > >
>>> > > After more thoughts, it appears very tricky to compare page count and
>>> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
>>> > > page_mapcount(), an interrupt may happen between them.  During the
>>> > > interrupt, the page count and map count may be changed, for example,
>>> > > unmapped, or do_swap_page().
>>> >
>>> > Yeah, it happens but what specific problem are you concerning from the
>>> > count change under race? The fork case Yu pointed out was already known
>>> > for breaking DIO so user should take care not to fork under DIO(Please
>>> > look at O_DIRECT section in man 2 open). If you could give a specific
>>> > example, it would be great to think over the issue.
>>> >
>>> > I agree it's little tricky but it seems to be way other place has used
>>> > for a long time(Please look at write_protect_page in ksm.c).
>>> 
>>> Ah, that's great to see it's being used elsewhere, for DIO particularly!
>>> 
>>> > So, here what we missing is tlb flush before the checking.
>>> 
>>> That shouldn't be required for this particular issue/case, IIUIC.
>>> One of the things we checked early on was disabling deferred TLB flush
>>> (similarly to what you've done), and it didn't help with the issue; also, the
>>> issue happens on uniprocessor mode too (thus no remote CPU involved.)
>>
>> I guess you didn't try it with page_mapcount + 1 == page_count at tha
>> time?  Anyway, I agree we don't need TLB flush here like KSM.
>> I think the reason KSM is doing TLB flush before the check it to
>> make sure trap trigger on the write from userprocess in other core.
>> However, this MADV_FREE case, HW already gaurantees the trap.
>> Please see below.
>>
>>> 
>>> 
>>> >
>>> > Something like this.
>>> >
>>> > diff --git a/mm/rmap.c b/mm/rmap.c
>>> > index b0fd9dc19eba..b4ad9faa17b2 100644
>>> > --- a/mm/rmap.c
>>> > +++ b/mm/rmap.c
>>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> >
>>> >                         /* MADV_FREE page check */
>>> >                         if (!PageSwapBacked(page)) {
>>> > -                               int refcount = page_ref_count(page);
>>> > -
>>> > -                               /*
>>> > -                                * The only page refs must be from the isolation
>>> > -                                * (checked by the caller shrink_page_list() too)
>>> > -                                * and the (single) rmap (dropped by discard:).
>>> > -                                *
>>> > -                                * Check the reference count before dirty flag
>>> > -                                * with memory barrier; see __remove_mapping().
>>> > -                                */
>>> > -                               smp_rmb();
>>> > -                               if (refcount == 2 && !PageDirty(page)) {
>>> > +                               if (!PageDirty(page) &&
>>> > +                                       page_mapcount(page) + 1 == page_count(page)) {
>>> 
>>> In the interest of avoiding a different race/bug, it seemed worth following the
>>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
>>> after the page's reference count, with a memory barrier in between.
>>
>> True so it means your patch as-is is good for me.
>
> If my understanding were correct, a shared anonymous page will be mapped
> read-only.  If so, will a private anonymous page be called
> SetPageDirty() concurrently after direct IO case has been dealt with
> via comparing page_count()/page_mapcount()?

Sorry, I found that I am not quite right here.  When direct IO read
completes, it will call SetPageDirty() and put_page() finally.  And
MADV_FREE in try_to_unmap_one() needs to deal with that too.

Checking direct IO code, it appears that set_page_dirty_lock() is used
to set page dirty, which will use lock_page().

  dio_bio_complete
    bio_check_pages_dirty
      bio_dirty_fn  /* through workqueue */
        bio_release_pages
          set_page_dirty_lock
    bio_release_pages
      set_page_dirty_lock

So in theory, for direct IO, the memory barrier may be unnecessary.  But
I don't think it's a good idea to depend on this specific behavior of
direct IO.  The original code with memory barrier looks more generic and
I don't think it will introduce visible overhead.

Best Regards,
Huang, Ying
Mauricio Faria de Oliveira Jan. 13, 2022, 2:30 p.m. UTC | #23
On Wed, Jan 12, 2022 at 7:37 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
> > Hi Minchan Kim,
> >
> > Thanks for handling the hard questions! :)
> >
> > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> > > > Yu Zhao <yuzhao@google.com> writes:
> > > >
> > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > > > >> index 163ac4e6bcee..8671de473c25 100644
> > > > >> --- a/mm/rmap.c
> > > > >> +++ b/mm/rmap.c
> > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >>
> > > > >>                    /* MADV_FREE page check */
> > > > >>                    if (!PageSwapBacked(page)) {
> > > > >> -                          if (!PageDirty(page)) {
> > > > >> +                          int ref_count = page_ref_count(page);
> > > > >> +                          int map_count = page_mapcount(page);
> > > > >> +
> > > > >> +                          /*
> > > > >> +                           * The only page refs must be from the isolation
> > > > >> +                           * (checked by the caller shrink_page_list() too)
> > > > >> +                           * and one or more rmap's (dropped by discard:).
> > > > >> +                           *
> > > > >> +                           * Check the reference count before dirty flag
> > > > >> +                           * with memory barrier; see __remove_mapping().
> > > > >> +                           */
> > > > >> +                          smp_rmb();
> > > > >> +                          if ((ref_count - 1 == map_count) &&
> > > > >> +                              !PageDirty(page)) {
> > > > >>                                    /* Invalidate as we cleared the pte */
> > > > >>                                    mmu_notifier_invalidate_range(mm,
> > > > >>                                            address, address + PAGE_SIZE);
> > > > >
> > > > > Out of curiosity, how does it work with COW in terms of reordering?
> > > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> > > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> > > > > is seen first, and direct io is holding a refcnt, this check can still
> > > > > pass?
> > > >
> > > > I think that you are correct.
> > > >
> > > > After more thoughts, it appears very tricky to compare page count and
> > > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> > > > page_mapcount(), an interrupt may happen between them.  During the
> > > > interrupt, the page count and map count may be changed, for example,
> > > > unmapped, or do_swap_page().
> > >
> > > Yeah, it happens but what specific problem are you concerning from the
> > > count change under race? The fork case Yu pointed out was already known
> > > for breaking DIO so user should take care not to fork under DIO(Please
> > > look at O_DIRECT section in man 2 open). If you could give a specific
> > > example, it would be great to think over the issue.
> > >
> > > I agree it's little tricky but it seems to be way other place has used
> > > for a long time(Please look at write_protect_page in ksm.c).
> >
> > Ah, that's great to see it's being used elsewhere, for DIO particularly!
> >
> > > So, here what we missing is tlb flush before the checking.
> >
> > That shouldn't be required for this particular issue/case, IIUIC.
> > One of the things we checked early on was disabling deferred TLB flush
> > (similarly to what you've done), and it didn't help with the issue; also, the
> > issue happens on uniprocessor mode too (thus no remote CPU involved.)
>
> I guess you didn't try it with page_mapcount + 1 == page_count at tha
> time?  Anyway, I agree we don't need TLB flush here like KSM.

Sorry, I fail to understand how the page (map) count and TLB flush
would be related.
(I realize you and Yu Zhao already confirmed the TLB flush is not
needed/expected
to fix the issue too; but just for my own education, if you have a chance.)

> I think the reason KSM is doing TLB flush before the check it to
> make sure trap trigger on the write from userprocess in other core.
> However, this MADV_FREE case, HW already gaurantees the trap.

Understood.

> Please see below.
>
> >
> >
> > >
> > > Something like this.
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index b0fd9dc19eba..b4ad9faa17b2 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >
> > >                         /* MADV_FREE page check */
> > >                         if (!PageSwapBacked(page)) {
> > > -                               int refcount = page_ref_count(page);
> > > -
> > > -                               /*
> > > -                                * The only page refs must be from the isolation
> > > -                                * (checked by the caller shrink_page_list() too)
> > > -                                * and the (single) rmap (dropped by discard:).
> > > -                                *
> > > -                                * Check the reference count before dirty flag
> > > -                                * with memory barrier; see __remove_mapping().
> > > -                                */
> > > -                               smp_rmb();
> > > -                               if (refcount == 2 && !PageDirty(page)) {
> > > +                               if (!PageDirty(page) &&
> > > +                                       page_mapcount(page) + 1 == page_count(page)) {
> >
> > In the interest of avoiding a different race/bug, it seemed worth following the
> > suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
> > after the page's reference count, with a memory barrier in between.
>
> True so it means your patch as-is is good for me.

That's good news! Thanks for all your help, review, and discussion so
far; it's been very educational.

I see Yu Zhao mentioned a possible concern/suggestion with additional
memory barriers elsewhere.
I'll try and dig to understand/check that in more detail and follow up.

>
> >
> > I'm not familiar with the details of the original issue behind that code change,
> > but it seemed to be possible here too, particularly as writes from user-space
> > can happen asynchronously / after try_to_unmap_one() checked PTE clean
> > and didn't set PageDirty, and if the page's PTE is present, there's no fault?
>
> Yeah, it was discussed.
>
> For clean pte, CPU has to fetch and update the actual pte entry, not TLB
> so trap triggers for MADV_FREE page.
>
> https://lkml.org/lkml/2015/4/15/565
> https://lkml.org/lkml/2015/4/16/136

Thanks for the pointers; great reading.

cheers,
Mauricio Faria de Oliveira Jan. 13, 2022, 2:54 p.m. UTC | #24
On Thu, Jan 13, 2022 at 9:30 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> "Huang, Ying" <ying.huang@intel.com> writes:
>
> > Minchan Kim <minchan@kernel.org> writes:
> >
> >> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote:
> >>> Hi Minchan Kim,
> >>>
> >>> Thanks for handling the hard questions! :)
> >>>
> >>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
> >>> >
> >>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> >>> > > Yu Zhao <yuzhao@google.com> writes:
> >>> > >
> >>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> >>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> >>> > > >> index 163ac4e6bcee..8671de473c25 100644
> >>> > > >> --- a/mm/rmap.c
> >>> > > >> +++ b/mm/rmap.c
> >>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>> > > >>
> >>> > > >>                    /* MADV_FREE page check */
> >>> > > >>                    if (!PageSwapBacked(page)) {
> >>> > > >> -                          if (!PageDirty(page)) {
> >>> > > >> +                          int ref_count = page_ref_count(page);
> >>> > > >> +                          int map_count = page_mapcount(page);
> >>> > > >> +
> >>> > > >> +                          /*
> >>> > > >> +                           * The only page refs must be from the isolation
> >>> > > >> +                           * (checked by the caller shrink_page_list() too)
> >>> > > >> +                           * and one or more rmap's (dropped by discard:).
> >>> > > >> +                           *
> >>> > > >> +                           * Check the reference count before dirty flag
> >>> > > >> +                           * with memory barrier; see __remove_mapping().
> >>> > > >> +                           */
> >>> > > >> +                          smp_rmb();
> >>> > > >> +                          if ((ref_count - 1 == map_count) &&
> >>> > > >> +                              !PageDirty(page)) {
> >>> > > >>                                    /* Invalidate as we cleared the pte */
> >>> > > >>                                    mmu_notifier_invalidate_range(mm,
> >>> > > >>                                            address, address + PAGE_SIZE);
> >>> > > >
> >>> > > > Out of curiosity, how does it work with COW in terms of reordering?
> >>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> >>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> >>> > > > is seen first, and direct io is holding a refcnt, this check can still
> >>> > > > pass?
> >>> > >
> >>> > > I think that you are correct.
> >>> > >
> >>> > > After more thoughts, it appears very tricky to compare page count and
> >>> > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> >>> > > page_mapcount(), an interrupt may happen between them.  During the
> >>> > > interrupt, the page count and map count may be changed, for example,
> >>> > > unmapped, or do_swap_page().
> >>> >
> >>> > Yeah, it happens but what specific problem are you concerning from the
> >>> > count change under race? The fork case Yu pointed out was already known
> >>> > for breaking DIO so user should take care not to fork under DIO(Please
> >>> > look at O_DIRECT section in man 2 open). If you could give a specific
> >>> > example, it would be great to think over the issue.
> >>> >
> >>> > I agree it's little tricky but it seems to be way other place has used
> >>> > for a long time(Please look at write_protect_page in ksm.c).
> >>>
> >>> Ah, that's great to see it's being used elsewhere, for DIO particularly!
> >>>
> >>> > So, here what we missing is tlb flush before the checking.
> >>>
> >>> That shouldn't be required for this particular issue/case, IIUIC.
> >>> One of the things we checked early on was disabling deferred TLB flush
> >>> (similarly to what you've done), and it didn't help with the issue; also, the
> >>> issue happens on uniprocessor mode too (thus no remote CPU involved.)
> >>
> >> I guess you didn't try it with page_mapcount + 1 == page_count at tha
> >> time?  Anyway, I agree we don't need TLB flush here like KSM.
> >> I think the reason KSM is doing TLB flush before the check it to
> >> make sure trap trigger on the write from userprocess in other core.
> >> However, this MADV_FREE case, HW already gaurantees the trap.
> >> Please see below.
> >>
> >>>
> >>>
> >>> >
> >>> > Something like this.
> >>> >
> >>> > diff --git a/mm/rmap.c b/mm/rmap.c
> >>> > index b0fd9dc19eba..b4ad9faa17b2 100644
> >>> > --- a/mm/rmap.c
> >>> > +++ b/mm/rmap.c
> >>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>> >
> >>> >                         /* MADV_FREE page check */
> >>> >                         if (!PageSwapBacked(page)) {
> >>> > -                               int refcount = page_ref_count(page);
> >>> > -
> >>> > -                               /*
> >>> > -                                * The only page refs must be from the isolation
> >>> > -                                * (checked by the caller shrink_page_list() too)
> >>> > -                                * and the (single) rmap (dropped by discard:).
> >>> > -                                *
> >>> > -                                * Check the reference count before dirty flag
> >>> > -                                * with memory barrier; see __remove_mapping().
> >>> > -                                */
> >>> > -                               smp_rmb();
> >>> > -                               if (refcount == 2 && !PageDirty(page)) {
> >>> > +                               if (!PageDirty(page) &&
> >>> > +                                       page_mapcount(page) + 1 == page_count(page)) {
> >>>
> >>> In the interest of avoiding a different race/bug, it seemed worth following the
> >>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty()
> >>> after the page's reference count, with a memory barrier in between.
> >>
> >> True so it means your patch as-is is good for me.
> >
> > If my understanding were correct, a shared anonymous page will be mapped
> > read-only.  If so, will a private anonymous page be called
> > SetPageDirty() concurrently after direct IO case has been dealt with
> > via comparing page_count()/page_mapcount()?
>
> Sorry, I found that I am not quite right here.  When direct IO read
> completes, it will call SetPageDirty() and put_page() finally.  And
> MADV_FREE in try_to_unmap_one() needs to deal with that too.
>
> Checking direct IO code, it appears that set_page_dirty_lock() is used
> to set page dirty, which will use lock_page().
>
>   dio_bio_complete
>     bio_check_pages_dirty
>       bio_dirty_fn  /* through workqueue */
>         bio_release_pages
>           set_page_dirty_lock
>     bio_release_pages
>       set_page_dirty_lock
>
> So in theory, for direct IO, the memory barrier may be unnecessary.  But
> I don't think it's a good idea to depend on this specific behavior of
> direct IO.  The original code with memory barrier looks more generic and
> I don't think it will introduce visible overhead.
>

Thanks for all the considerations/thought process with potential corner cases!

Regarding the overhead, agreed; and this is in memory reclaim which isn't a
fast path (and even if it's under direct reclaim, things have slowed
down already),
so that would seem to be fine.

cheers,

> Best Regards,
> Huang, Ying
Minchan Kim Jan. 14, 2022, 12:35 a.m. UTC | #25
On Thu, Jan 13, 2022 at 12:29:51AM -0700, Yu Zhao wrote:
> On Wed, Jan 12, 2022 at 2:53 PM Mauricio Faria de Oliveira
> <mfo@canonical.com> wrote:
> >
> > Hi Minchan Kim,
> >
> > Thanks for handling the hard questions! :)
> >
> > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> > > > Yu Zhao <yuzhao@google.com> writes:
> > > >
> > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > > > >> index 163ac4e6bcee..8671de473c25 100644
> > > > >> --- a/mm/rmap.c
> > > > >> +++ b/mm/rmap.c
> > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > >>
> > > > >>                    /* MADV_FREE page check */
> > > > >>                    if (!PageSwapBacked(page)) {
> > > > >> -                          if (!PageDirty(page)) {
> > > > >> +                          int ref_count = page_ref_count(page);
> > > > >> +                          int map_count = page_mapcount(page);
> > > > >> +
> > > > >> +                          /*
> > > > >> +                           * The only page refs must be from the isolation
> > > > >> +                           * (checked by the caller shrink_page_list() too)
> > > > >> +                           * and one or more rmap's (dropped by discard:).
> > > > >> +                           *
> > > > >> +                           * Check the reference count before dirty flag
> > > > >> +                           * with memory barrier; see __remove_mapping().
> > > > >> +                           */
> > > > >> +                          smp_rmb();
> > > > >> +                          if ((ref_count - 1 == map_count) &&
> > > > >> +                              !PageDirty(page)) {
> > > > >>                                    /* Invalidate as we cleared the pte */
> > > > >>                                    mmu_notifier_invalidate_range(mm,
> > > > >>                                            address, address + PAGE_SIZE);
> > > > >
> > > > > Out of curiosity, how does it work with COW in terms of reordering?
> > > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> > > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> > > > > is seen first, and direct io is holding a refcnt, this check can still
> > > > > pass?
> > > >
> > > > I think that you are correct.
> > > >
> > > > After more thoughts, it appears very tricky to compare page count and
> > > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> > > > page_mapcount(), an interrupt may happen between them.  During the
> > > > interrupt, the page count and map count may be changed, for example,
> > > > unmapped, or do_swap_page().
> > >
> > > Yeah, it happens but what specific problem are you concerning from the
> > > count change under race? The fork case Yu pointed out was already known
> > > for breaking DIO so user should take care not to fork under DIO(Please
> > > look at O_DIRECT section in man 2 open). If you could give a specific
> > > example, it would be great to think over the issue.
> > >
> > > I agree it's little tricky but it seems to be way other place has used
> > > for a long time(Please look at write_protect_page in ksm.c).
> >
> > Ah, that's great to see it's being used elsewhere, for DIO particularly!
> >
> > > So, here what we missing is tlb flush before the checking.
> >
> > That shouldn't be required for this particular issue/case, IIUIC.
> > One of the things we checked early on was disabling deferred TLB flush
> > (similarly to what you've done), and it didn't help with the issue; also, the
> > issue happens on uniprocessor mode too (thus no remote CPU involved.)
> 
> Fast gup doesn't block tlb flush; it only blocks IPI used when freeing
> page tables. So it's expected that forcing a tlb flush doesn't fix the
> problem.
> 
> But it still seems to me the fix is missing smp_mb(). IIUC, a proper
> fix would require, on the dio side
> inc page refcnt
> smp_mb()
> read pte
> 
> and on the rmap side
> clear pte
> smp_mb()
> read page refcnt
> 
> try_grab_compound_head() implies smp_mb, but i don't think
> ptep_get_and_clear() does.
> 
> mapcount, as Minchan said, probably is irrelevant given dio is already
> known to be broken with fork.
> 
> I glanced at the thread and thought it might be worth menthing.

If the madv_freed page is shared among processes, it means the ptes
pointing the page are CoW state. If DIO is about to work with the
page, gup_fast will fallback to slow path and then break the CoW
using faultin_page before the submit bio. Thus, the page is not
shared any longer and the pte was alrady marked as dirty on fault
handling. Thus, I think there is no race.

Only, problem is race between DIO and reclaim on exclusive private
madv_free page. In the case, page_count would be only racy.
If ptep_get_and_clear is unordered, yeah, we need barrier there.
(Looks like unorder since ARM uses xchg_relaxed).
Mauricio Faria de Oliveira Jan. 31, 2022, 11:10 p.m. UTC | #26
On Thu, Jan 13, 2022 at 9:35 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Thu, Jan 13, 2022 at 12:29:51AM -0700, Yu Zhao wrote:
> > On Wed, Jan 12, 2022 at 2:53 PM Mauricio Faria de Oliveira
> > <mfo@canonical.com> wrote:
> > >
> > > Hi Minchan Kim,
> > >
> > > Thanks for handling the hard questions! :)
> > >
> > > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote:
> > > > > Yu Zhao <yuzhao@google.com> writes:
> > > > >
> > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote:
> > > > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > > > > >> index 163ac4e6bcee..8671de473c25 100644
> > > > > >> --- a/mm/rmap.c
> > > > > >> +++ b/mm/rmap.c
> > > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > > > >>
> > > > > >>                    /* MADV_FREE page check */
> > > > > >>                    if (!PageSwapBacked(page)) {
> > > > > >> -                          if (!PageDirty(page)) {
> > > > > >> +                          int ref_count = page_ref_count(page);
> > > > > >> +                          int map_count = page_mapcount(page);
> > > > > >> +
> > > > > >> +                          /*
> > > > > >> +                           * The only page refs must be from the isolation
> > > > > >> +                           * (checked by the caller shrink_page_list() too)
> > > > > >> +                           * and one or more rmap's (dropped by discard:).
> > > > > >> +                           *
> > > > > >> +                           * Check the reference count before dirty flag
> > > > > >> +                           * with memory barrier; see __remove_mapping().
> > > > > >> +                           */
> > > > > >> +                          smp_rmb();
> > > > > >> +                          if ((ref_count - 1 == map_count) &&
> > > > > >> +                              !PageDirty(page)) {
> > > > > >>                                    /* Invalidate as we cleared the pte */
> > > > > >>                                    mmu_notifier_invalidate_range(mm,
> > > > > >>                                            address, address + PAGE_SIZE);
> > > > > >
> > > > > > Out of curiosity, how does it work with COW in terms of reordering?
> > > > > > Specifically, it seems to me get_page() and page_dup_rmap() in
> > > > > > copy_present_pte() can happen in any order, and if page_dup_rmap()
> > > > > > is seen first, and direct io is holding a refcnt, this check can still
> > > > > > pass?
> > > > >
> > > > > I think that you are correct.
> > > > >
> > > > > After more thoughts, it appears very tricky to compare page count and
> > > > > map count.  Even if we have added smp_rmb() between page_ref_count() and
> > > > > page_mapcount(), an interrupt may happen between them.  During the
> > > > > interrupt, the page count and map count may be changed, for example,
> > > > > unmapped, or do_swap_page().
> > > >
> > > > Yeah, it happens but what specific problem are you concerning from the
> > > > count change under race? The fork case Yu pointed out was already known
> > > > for breaking DIO so user should take care not to fork under DIO(Please
> > > > look at O_DIRECT section in man 2 open). If you could give a specific
> > > > example, it would be great to think over the issue.
> > > >
> > > > I agree it's little tricky but it seems to be way other place has used
> > > > for a long time(Please look at write_protect_page in ksm.c).
> > >
> > > Ah, that's great to see it's being used elsewhere, for DIO particularly!
> > >
> > > > So, here what we missing is tlb flush before the checking.
> > >
> > > That shouldn't be required for this particular issue/case, IIUIC.
> > > One of the things we checked early on was disabling deferred TLB flush
> > > (similarly to what you've done), and it didn't help with the issue; also, the
> > > issue happens on uniprocessor mode too (thus no remote CPU involved.)
> >
> > Fast gup doesn't block tlb flush; it only blocks IPI used when freeing
> > page tables. So it's expected that forcing a tlb flush doesn't fix the
> > problem.
> >
> > But it still seems to me the fix is missing smp_mb(). IIUC, a proper
> > fix would require, on the dio side
> > inc page refcnt
> > smp_mb()
> > read pte
> >
> > and on the rmap side
> > clear pte
> > smp_mb()
> > read page refcnt
> >
> > try_grab_compound_head() implies smp_mb, but i don't think
> > ptep_get_and_clear() does.
> >
> > mapcount, as Minchan said, probably is irrelevant given dio is already
> > known to be broken with fork.
> >
> > I glanced at the thread and thought it might be worth menthing.
>
> If the madv_freed page is shared among processes, it means the ptes
> pointing the page are CoW state. If DIO is about to work with the
> page, gup_fast will fallback to slow path and then break the CoW
> using faultin_page before the submit bio. Thus, the page is not
> shared any longer and the pte was alrady marked as dirty on fault
> handling. Thus, I think there is no race.
>
> Only, problem is race between DIO and reclaim on exclusive private
> madv_free page. In the case, page_count would be only racy.
> If ptep_get_and_clear is unordered, yeah, we need barrier there.
> (Looks like unorder since ARM uses xchg_relaxed).

Sorry for the long delay in getting back to this.

Thanks to both of you for bringing this up and checking further!
Agree; I missed this barrier was needed for the gup() fast path.

Just sent PATCH v3.
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..8671de473c25 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1570,7 +1570,20 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 			/* MADV_FREE page check */
 			if (!PageSwapBacked(page)) {
-				if (!PageDirty(page)) {
+				int ref_count = page_ref_count(page);
+				int map_count = page_mapcount(page);
+
+				/*
+				 * The only page refs must be from the isolation
+				 * (checked by the caller shrink_page_list() too)
+				 * and one or more rmap's (dropped by discard:).
+				 *
+				 * Check the reference count before dirty flag
+				 * with memory barrier; see __remove_mapping().
+				 */
+				smp_rmb();
+				if ((ref_count - 1 == map_count) &&
+				    !PageDirty(page)) {
 					/* Invalidate as we cleared the pte */
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fb9584641ac7..c1ea4e14f510 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1688,7 +1688,7 @@  static unsigned int shrink_page_list(struct list_head *page_list,
 				mapping = page_mapping(page);
 			}
 		} else if (unlikely(PageTransHuge(page))) {
-			/* Split file THP */
+			/* Split file/lazyfree THP */
 			if (split_huge_page_to_list(page, page_list))
 				goto keep_locked;
 		}