Message ID | 20220105233440.63361-1-mfo@canonical.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read | expand |
On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > Problem: > ======= > > Userspace might read the zero-page instead of actual data from a > direct IO read on a block device if the buffers have been called > madvise(MADV_FREE) on earlier (this is discussed below) due to a > race between page reclaim on MADV_FREE and blkdev direct IO read. > > Race condition: > ============== > > During page reclaim, the MADV_FREE page check in try_to_unmap_one() > checks if the page is not dirty, then discards its PTE (vs remap it > back if the page is dirty). > > However, after try_to_unmap_one() returns to shrink_page_list(), it > might keep the page _anyway_ if page_ref_freeze() fails (it expects > a single page ref from the isolation). > > Well, blkdev_direct_IO() gets references for all pages, and on READ > operations it sets them dirty later. > > So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more > later) for direct IO read from block devices and page reclaim runs > during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages() > but BEFORE it sets pages dirty, that situation happens. > > The direct IO read eventually completes. Now, when userspace reads > the buffers, the PTE is no longer there and the page fault handler > do_anonymous_page() services that with the zero-page, NOT the data! > > A synthetic reproducer is provided. > > Page faults: > =========== > > The data read from the block device probably won't generate faults > due to DMA (no MMU) but even in the case it wouldn't use DMA, that > happens on different virtual addresses (not user-mapped addresses) > because `struct bio_vec` stores `struct page` to figure addresses > out (which are different from/unrelated to user-mapped addresses) > for the data read. > > Thus userspace reads (to user-mapped addresses) still fault, then > do_anonymous_page() gets another `struct page` that would address/ > map to other memory than the `struct page` used by `struct bio_vec` > for the read (which runs correctly as the page wasn't freed due to > page_ref_freeze(), and is reclaimed later) -- but even if the page > addresses matched, that handler maps the zero-page in the PTE, not > that page's memory (on read faults.) > > If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue > doesn't happen, because that faults-in all pages as writeable, so > do_anonymous_page() sets up a new page/rmap/PTE, and that is used > by direct IO. The userspace reads don't fault as the PTE is there > (thus zero-page is not used.) > > Solution: > ======== > > One solution is to check for the expected page reference count in > try_to_unmap_one() too, which should be exactly two: one from the > isolation (checked by shrink_page_list()), and the other from the > rmap (dropped by the discard: label). If that doesn't match, then > remap the PTE back, just like page dirty does. > > The new check in try_to_unmap_one() should be safe in races with > bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, > as it's done under the PTE lock. The fast path doesn't take that > lock but it checks the PTE has changed, then drops the reference > and leaves the page for the slow path (which does take that lock). > > - try_to_unmap_one() > - page_vma_mapped_walk() > - map_pte() # see pte_offset_map_lock(): > pte_offset_map() > spin_lock() > - page_ref_count() # new check > - page_vma_mapped_walk_done() # see pte_unmap_unlock(): > pte_unmap() > spin_unlock() > > - bio_iov_iter_get_pages() > - __bio_iov_iter_get_pages() > - iov_iter_get_pages() > - get_user_pages_fast() > - internal_get_user_pages_fast() > > # fast path > - lockless_pages_from_mm() > - gup_{pgd,p4d,pud,pmd,pte}_range() > ptep = pte_offset_map() # not _lock() > pte = ptep_get_lockless(ptep) > page = pte_page(pte) > try_grab_compound_head(page) # get ref > if (pte_val(pte) != pte_val(*ptep)) > put_compound_head(page) # put ref > # leave page for slow path > # slow path > - __gup_longterm_unlocked() > - get_user_pages_unlocked() > - __get_user_pages_locked() > - __get_user_pages() > - follow_{page,p4d,pud,pmd}_mask() > - follow_page_pte() > ptep = pte_offset_map_lock() > pte = *ptep > page = vm_normal_page(pte) > try_grab_page(page) # get ref > pte_unmap_unlock() > > Regarding transparent hugepages, that number shouldn't change, as > MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() > (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) > thus should reach shrink_page_list() -> split_huge_page_to_list() > before try_to_unmap[_one](), so it deals with normal pages only. > > (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() > happens, which it should not or be rare, the page refcount is not > two, as the head page counts tail pages, and tail pages have zero. > That also prevents checking the head `page` then incorrectly call > page_remove_rmap(subpage) for a tail page, that isn't even in the > shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), > as it might happen today in this unlikely scenario.) > > MADV_FREE'd buffers: > =================== > > So, back to the "if MADV_FREE pages are used as buffers" note. > The case is arguable, and subject to multiple interpretations. > > The madvise(2) manual page on the MADV_FREE advice value says: > - 'After a successful MADV_FREE ... data will be lost when > the kernel frees the pages.' > - 'the free operation will be canceled if the caller writes > into the page' / 'subsequent writes ... will succeed and > then [the] kernel cannot free those dirtied pages' > - 'If there is no subsequent write, the kernel can free the > pages at any time.' > > Thoughts, questions, considerations... > - Since the kernel didn't actually free the page (page_ref_freeze() > failed), should the data not have been lost? (on userspace read.) > - Should writes performed by the direct IO read be able to cancel > the free operation? > - Should the direct IO read be considered as 'the caller' too, > as it's been requested by 'the caller'? > - Should the bio technique to dirty pages on return to userspace > (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) > be considered in another/special way here? > - Should an upcoming write from a previously requested direct IO > read be considered as a subsequent write, so the kernel should > not free the pages? (as it's known at the time of page reclaim.) > > Technically, the last point would seem a reasonable consideration > and balance, as the madvise(2) manual page apparently (and fairly) > seem to assume that 'writes' are memory access from the userspace > process (not explicitly considering writes from the kernel or its > corner cases; again, fairly).. plus the kernel fix implementation > for the corner case of the largely 'non-atomic write' encompassed > by a direct IO read operation, is relatively simple; and it helps. > > Reproducer: > ========== > > @ test.c (simplified, but works) > > #define _GNU_SOURCE > #include <fcntl.h> > #include <stdio.h> > #include <unistd.h> > #include <sys/mman.h> > > int main() { > int fd, i; > char *buf; > > fd = open(DEV, O_RDONLY | O_DIRECT); > > buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) > buf[i] = 1; // init to non-zero > > madvise(buf, BUF_SIZE, MADV_FREE); > > read(fd, buf, BUF_SIZE); > > for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) > printf("%p: 0x%x\n", &buf[i], buf[i]); > > return 0; > } > > @ block/fops.c (formerly fs/block_dev.c) > > +#include <linux/swap.h> > ... > ... __blkdev_direct_IO[_simple](...) > { > ... > + if (!strcmp(current->comm, "good")) > + shrink_all_memory(ULONG_MAX); > + > ret = bio_iov_iter_get_pages(...); > + > + if (!strcmp(current->comm, "bad")) > + shrink_all_memory(ULONG_MAX); > ... > } > > @ shell > > # yes | dd of=test.img bs=1k count=16 > # DEV=$(losetup -f --show test.img) > # gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test > > # od -tx1 $DEV > 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a > * > 0040000 > > # mv test good > # ./good > 0x7f1509206000: 0x79 > 0x7f1509207000: 0x79 > 0x7f1509208000: 0x79 > 0x7f1509209000: 0x79 > > # mv good bad > # ./bad > 0x7fd87272f000: 0x0 > 0x7fd872730000: 0x0 > 0x7fd872731000: 0x0 > 0x7fd872732000: 0x0 > > Ceph/TCMalloc: > ============= > > For documentation purposes, the use case driving the analysis/fix > is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses > MADV_FREE to release unused memory to the system from the mmap'ed > page heap (might be committed back/used again; it's not munmap'ed.) > - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() > - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. > > Note: TCMalloc switched back to MADV_DONTNEED a few commits after > the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so > the issue just 'disappeared' on Ceph on later Ubuntu releases but > is still present in the kernel, and can be hit by other use cases. > > The observed issue seems to be the old Ceph bug #22464 [1], where > checksum mismatches are observed (and instrumentation with buffer > dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). > > The issue in Ceph was reasonably deemed a kernel bug (comment #50) > and mostly worked around with a retry mechanism, but other parts > of Ceph could still hit that (rocksdb). Anyway, it's less likely > to be hit again as TCMalloc switched out of MADV_FREE by default. > > (Some kernel versions/reports from the Ceph bug, and relation with > the MADV_FREE introduction/changes; TCMalloc versions not checked.) > - 4.4 good > - 4.5 (madv_free: introduction) > - 4.9 bad > - 4.10 good? maybe a swapless system > - 4.12 (madv_free: no longer free instantly on swapless systems) > - 4.13 bad > > [1] https://tracker.ceph.com/issues/22464 > > Thanks: > ====== > > Several people contributed to analysis/discussions/tests/reproducers > in the first stages when drilling down on ceph/tcmalloc/linux kernel: > > - Dan Hill <daniel.hill@canonical.com> > - Dan Streetman <dan.streetman@canonical.com> > - Dongdong Tao <dongdong.tao@canonical.com> > - Gavin Guo <gavin.guo@canonical.com> > - Gerald Yang <gerald.yang@canonical.com> > - Heitor Alves de Siqueira <halves@canonical.com> > - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> > - Jay Vosburgh <jay.vosburgh@canonical.com> > - Matthew Ruffell <matthew.ruffell@canonical.com> > - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> > > v2: check refcount against mapcount rather than a static 2. > Thanks: Minchan Kim <minchan@kernel.org> > > Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> > --- > mm/rmap.c | 15 ++++++++++++++- > mm/vmscan.c | 2 +- > 2 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 163ac4e6bcee..8671de473c25 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > - if (!PageDirty(page)) { > + int ref_count = page_ref_count(page); > + int map_count = page_mapcount(page); > + > + /* > + * The only page refs must be from the isolation > + * (checked by the caller shrink_page_list() too) > + * and one or more rmap's (dropped by discard:). > + * > + * Check the reference count before dirty flag > + * with memory barrier; see __remove_mapping(). > + */ > + smp_rmb(); > + if ((ref_count - 1 == map_count) && > + !PageDirty(page)) { Thanks. Looks good to me. I'd like to invite more folks for better eyes since this could be a rather subtle issue.
On Thu, Jan 6, 2022 at 3:15 PM Minchan Kim <minchan@kernel.org> wrote: > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > Problem: > > ======= > > > > Userspace might read the zero-page instead of actual data from a > > direct IO read on a block device if the buffers have been called > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > > > Race condition: > > ============== > > > > During page reclaim, the MADV_FREE page check in try_to_unmap_one() > > checks if the page is not dirty, then discards its PTE (vs remap it > > back if the page is dirty). > > > > However, after try_to_unmap_one() returns to shrink_page_list(), it > > might keep the page _anyway_ if page_ref_freeze() fails (it expects > > a single page ref from the isolation). > > > > Well, blkdev_direct_IO() gets references for all pages, and on READ > > operations it sets them dirty later. > > > > So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more > > later) for direct IO read from block devices and page reclaim runs > > during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages() > > but BEFORE it sets pages dirty, that situation happens. > > > > The direct IO read eventually completes. Now, when userspace reads > > the buffers, the PTE is no longer there and the page fault handler > > do_anonymous_page() services that with the zero-page, NOT the data! > > > > A synthetic reproducer is provided. > > > > Page faults: > > =========== > > > > The data read from the block device probably won't generate faults > > due to DMA (no MMU) but even in the case it wouldn't use DMA, that > > happens on different virtual addresses (not user-mapped addresses) > > because `struct bio_vec` stores `struct page` to figure addresses > > out (which are different from/unrelated to user-mapped addresses) > > for the data read. > > > > Thus userspace reads (to user-mapped addresses) still fault, then > > do_anonymous_page() gets another `struct page` that would address/ > > map to other memory than the `struct page` used by `struct bio_vec` > > for the read (which runs correctly as the page wasn't freed due to > > page_ref_freeze(), and is reclaimed later) -- but even if the page > > addresses matched, that handler maps the zero-page in the PTE, not > > that page's memory (on read faults.) > > > > If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue > > doesn't happen, because that faults-in all pages as writeable, so > > do_anonymous_page() sets up a new page/rmap/PTE, and that is used > > by direct IO. The userspace reads don't fault as the PTE is there > > (thus zero-page is not used.) > > > > Solution: > > ======== > > > > One solution is to check for the expected page reference count in > > try_to_unmap_one() too, which should be exactly two: one from the > > isolation (checked by shrink_page_list()), and the other from the > > rmap (dropped by the discard: label). If that doesn't match, then > > remap the PTE back, just like page dirty does. > > > > The new check in try_to_unmap_one() should be safe in races with > > bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, > > as it's done under the PTE lock. The fast path doesn't take that > > lock but it checks the PTE has changed, then drops the reference > > and leaves the page for the slow path (which does take that lock). > > > > - try_to_unmap_one() > > - page_vma_mapped_walk() > > - map_pte() # see pte_offset_map_lock(): > > pte_offset_map() > > spin_lock() > > - page_ref_count() # new check > > - page_vma_mapped_walk_done() # see pte_unmap_unlock(): > > pte_unmap() > > spin_unlock() > > > > - bio_iov_iter_get_pages() > > - __bio_iov_iter_get_pages() > > - iov_iter_get_pages() > > - get_user_pages_fast() > > - internal_get_user_pages_fast() > > > > # fast path > > - lockless_pages_from_mm() > > - gup_{pgd,p4d,pud,pmd,pte}_range() > > ptep = pte_offset_map() # not _lock() > > pte = ptep_get_lockless(ptep) > > page = pte_page(pte) > > try_grab_compound_head(page) # get ref > > if (pte_val(pte) != pte_val(*ptep)) > > put_compound_head(page) # put ref > > # leave page for slow path > > # slow path > > - __gup_longterm_unlocked() > > - get_user_pages_unlocked() > > - __get_user_pages_locked() > > - __get_user_pages() > > - follow_{page,p4d,pud,pmd}_mask() > > - follow_page_pte() > > ptep = pte_offset_map_lock() > > pte = *ptep > > page = vm_normal_page(pte) > > try_grab_page(page) # get ref > > pte_unmap_unlock() > > > > Regarding transparent hugepages, that number shouldn't change, as > > MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() > > (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) > > thus should reach shrink_page_list() -> split_huge_page_to_list() > > before try_to_unmap[_one](), so it deals with normal pages only. > > > > (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() > > happens, which it should not or be rare, the page refcount is not > > two, as the head page counts tail pages, and tail pages have zero. > > That also prevents checking the head `page` then incorrectly call > > page_remove_rmap(subpage) for a tail page, that isn't even in the > > shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), > > as it might happen today in this unlikely scenario.) > > > > MADV_FREE'd buffers: > > =================== > > > > So, back to the "if MADV_FREE pages are used as buffers" note. > > The case is arguable, and subject to multiple interpretations. > > > > The madvise(2) manual page on the MADV_FREE advice value says: > > - 'After a successful MADV_FREE ... data will be lost when > > the kernel frees the pages.' > > - 'the free operation will be canceled if the caller writes > > into the page' / 'subsequent writes ... will succeed and > > then [the] kernel cannot free those dirtied pages' > > - 'If there is no subsequent write, the kernel can free the > > pages at any time.' > > > > Thoughts, questions, considerations... > > - Since the kernel didn't actually free the page (page_ref_freeze() > > failed), should the data not have been lost? (on userspace read.) > > - Should writes performed by the direct IO read be able to cancel > > the free operation? > > - Should the direct IO read be considered as 'the caller' too, > > as it's been requested by 'the caller'? > > - Should the bio technique to dirty pages on return to userspace > > (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) > > be considered in another/special way here? > > - Should an upcoming write from a previously requested direct IO > > read be considered as a subsequent write, so the kernel should > > not free the pages? (as it's known at the time of page reclaim.) > > > > Technically, the last point would seem a reasonable consideration > > and balance, as the madvise(2) manual page apparently (and fairly) > > seem to assume that 'writes' are memory access from the userspace > > process (not explicitly considering writes from the kernel or its > > corner cases; again, fairly).. plus the kernel fix implementation > > for the corner case of the largely 'non-atomic write' encompassed > > by a direct IO read operation, is relatively simple; and it helps. > > > > Reproducer: > > ========== > > > > @ test.c (simplified, but works) > > > > #define _GNU_SOURCE > > #include <fcntl.h> > > #include <stdio.h> > > #include <unistd.h> > > #include <sys/mman.h> > > > > int main() { > > int fd, i; > > char *buf; > > > > fd = open(DEV, O_RDONLY | O_DIRECT); > > > > buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, > > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) > > buf[i] = 1; // init to non-zero > > > > madvise(buf, BUF_SIZE, MADV_FREE); > > > > read(fd, buf, BUF_SIZE); > > > > for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) > > printf("%p: 0x%x\n", &buf[i], buf[i]); > > > > return 0; > > } > > > > @ block/fops.c (formerly fs/block_dev.c) > > > > +#include <linux/swap.h> > > ... > > ... __blkdev_direct_IO[_simple](...) > > { > > ... > > + if (!strcmp(current->comm, "good")) > > + shrink_all_memory(ULONG_MAX); > > + > > ret = bio_iov_iter_get_pages(...); > > + > > + if (!strcmp(current->comm, "bad")) > > + shrink_all_memory(ULONG_MAX); > > ... > > } > > > > @ shell > > > > # yes | dd of=test.img bs=1k count=16 > > # DEV=$(losetup -f --show test.img) > > # gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test > > > > # od -tx1 $DEV > > 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a > > * > > 0040000 > > > > # mv test good > > # ./good > > 0x7f1509206000: 0x79 > > 0x7f1509207000: 0x79 > > 0x7f1509208000: 0x79 > > 0x7f1509209000: 0x79 > > > > # mv good bad > > # ./bad > > 0x7fd87272f000: 0x0 > > 0x7fd872730000: 0x0 > > 0x7fd872731000: 0x0 > > 0x7fd872732000: 0x0 > > > > Ceph/TCMalloc: > > ============= > > > > For documentation purposes, the use case driving the analysis/fix > > is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses > > MADV_FREE to release unused memory to the system from the mmap'ed > > page heap (might be committed back/used again; it's not munmap'ed.) > > - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() > > - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. > > > > Note: TCMalloc switched back to MADV_DONTNEED a few commits after > > the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so > > the issue just 'disappeared' on Ceph on later Ubuntu releases but > > is still present in the kernel, and can be hit by other use cases. > > > > The observed issue seems to be the old Ceph bug #22464 [1], where > > checksum mismatches are observed (and instrumentation with buffer > > dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). > > > > The issue in Ceph was reasonably deemed a kernel bug (comment #50) > > and mostly worked around with a retry mechanism, but other parts > > of Ceph could still hit that (rocksdb). Anyway, it's less likely > > to be hit again as TCMalloc switched out of MADV_FREE by default. > > > > (Some kernel versions/reports from the Ceph bug, and relation with > > the MADV_FREE introduction/changes; TCMalloc versions not checked.) > > - 4.4 good > > - 4.5 (madv_free: introduction) > > - 4.9 bad > > - 4.10 good? maybe a swapless system > > - 4.12 (madv_free: no longer free instantly on swapless systems) > > - 4.13 bad > > > > [1] https://tracker.ceph.com/issues/22464 > > > > Thanks: > > ====== > > > > Several people contributed to analysis/discussions/tests/reproducers > > in the first stages when drilling down on ceph/tcmalloc/linux kernel: > > > > - Dan Hill <daniel.hill@canonical.com> > > - Dan Streetman <dan.streetman@canonical.com> > > - Dongdong Tao <dongdong.tao@canonical.com> > > - Gavin Guo <gavin.guo@canonical.com> > > - Gerald Yang <gerald.yang@canonical.com> > > - Heitor Alves de Siqueira <halves@canonical.com> > > - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> > > - Jay Vosburgh <jay.vosburgh@canonical.com> > > - Matthew Ruffell <matthew.ruffell@canonical.com> > > - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> > > > > v2: check refcount against mapcount rather than a static 2. > > Thanks: Minchan Kim <minchan@kernel.org> > > > > Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> > > --- > > mm/rmap.c | 15 ++++++++++++++- > > mm/vmscan.c | 2 +- > > 2 files changed, 15 insertions(+), 2 deletions(-) > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 163ac4e6bcee..8671de473c25 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > /* MADV_FREE page check */ > > if (!PageSwapBacked(page)) { > > - if (!PageDirty(page)) { > > + int ref_count = page_ref_count(page); > > + int map_count = page_mapcount(page); > > + > > + /* > > + * The only page refs must be from the isolation > > + * (checked by the caller shrink_page_list() too) > > + * and one or more rmap's (dropped by discard:). > > + * > > + * Check the reference count before dirty flag > > + * with memory barrier; see __remove_mapping(). > > + */ > > + smp_rmb(); > > + if ((ref_count - 1 == map_count) && > > + !PageDirty(page)) { > > Thanks. Looks good to me. > > I'd like to invite more folks for better eyes since this could > be a rather subtle issue. I don't spot anything wrong either. But I'm wondering if we could remove the PageDirty check or not. It seems racy to have MADV_FREE solely rely on PageDirty bit to tell if the page is going to be redirtied or re-accessed or not. The pge refcount itself should be enough. As long as elevated refcount is observed we know this page is going to be or has been re-accessed by someone (particularly, fast GUP). The PageDirty bit may be set much later, just like what the commit log said. But I'm not quite sure if there is any path that sets PageDirty bit then elevates refcount, TBH I don't look very hard.
On Thu, Jan 6, 2022 at 4:11 PM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Jan 6, 2022 at 3:15 PM Minchan Kim <minchan@kernel.org> wrote: > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > > Problem: > > > ======= > > > > > > Userspace might read the zero-page instead of actual data from a > > > direct IO read on a block device if the buffers have been called > > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > > > > > Race condition: > > > ============== > > > > > > During page reclaim, the MADV_FREE page check in try_to_unmap_one() > > > checks if the page is not dirty, then discards its PTE (vs remap it > > > back if the page is dirty). > > > > > > However, after try_to_unmap_one() returns to shrink_page_list(), it > > > might keep the page _anyway_ if page_ref_freeze() fails (it expects > > > a single page ref from the isolation). > > > > > > Well, blkdev_direct_IO() gets references for all pages, and on READ > > > operations it sets them dirty later. > > > > > > So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more > > > later) for direct IO read from block devices and page reclaim runs > > > during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages() > > > but BEFORE it sets pages dirty, that situation happens. > > > > > > The direct IO read eventually completes. Now, when userspace reads > > > the buffers, the PTE is no longer there and the page fault handler > > > do_anonymous_page() services that with the zero-page, NOT the data! > > > > > > A synthetic reproducer is provided. > > > > > > Page faults: > > > =========== > > > > > > The data read from the block device probably won't generate faults > > > due to DMA (no MMU) but even in the case it wouldn't use DMA, that > > > happens on different virtual addresses (not user-mapped addresses) > > > because `struct bio_vec` stores `struct page` to figure addresses > > > out (which are different from/unrelated to user-mapped addresses) > > > for the data read. > > > > > > Thus userspace reads (to user-mapped addresses) still fault, then > > > do_anonymous_page() gets another `struct page` that would address/ > > > map to other memory than the `struct page` used by `struct bio_vec` > > > for the read (which runs correctly as the page wasn't freed due to > > > page_ref_freeze(), and is reclaimed later) -- but even if the page > > > addresses matched, that handler maps the zero-page in the PTE, not > > > that page's memory (on read faults.) > > > > > > If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue > > > doesn't happen, because that faults-in all pages as writeable, so > > > do_anonymous_page() sets up a new page/rmap/PTE, and that is used > > > by direct IO. The userspace reads don't fault as the PTE is there > > > (thus zero-page is not used.) > > > > > > Solution: > > > ======== > > > > > > One solution is to check for the expected page reference count in > > > try_to_unmap_one() too, which should be exactly two: one from the > > > isolation (checked by shrink_page_list()), and the other from the > > > rmap (dropped by the discard: label). If that doesn't match, then > > > remap the PTE back, just like page dirty does. > > > > > > The new check in try_to_unmap_one() should be safe in races with > > > bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, > > > as it's done under the PTE lock. The fast path doesn't take that > > > lock but it checks the PTE has changed, then drops the reference > > > and leaves the page for the slow path (which does take that lock). > > > > > > - try_to_unmap_one() > > > - page_vma_mapped_walk() > > > - map_pte() # see pte_offset_map_lock(): > > > pte_offset_map() > > > spin_lock() > > > - page_ref_count() # new check > > > - page_vma_mapped_walk_done() # see pte_unmap_unlock(): > > > pte_unmap() > > > spin_unlock() > > > > > > - bio_iov_iter_get_pages() > > > - __bio_iov_iter_get_pages() > > > - iov_iter_get_pages() > > > - get_user_pages_fast() > > > - internal_get_user_pages_fast() > > > > > > # fast path > > > - lockless_pages_from_mm() > > > - gup_{pgd,p4d,pud,pmd,pte}_range() > > > ptep = pte_offset_map() # not _lock() > > > pte = ptep_get_lockless(ptep) > > > page = pte_page(pte) > > > try_grab_compound_head(page) # get ref > > > if (pte_val(pte) != pte_val(*ptep)) > > > put_compound_head(page) # put ref > > > # leave page for slow path > > > # slow path > > > - __gup_longterm_unlocked() > > > - get_user_pages_unlocked() > > > - __get_user_pages_locked() > > > - __get_user_pages() > > > - follow_{page,p4d,pud,pmd}_mask() > > > - follow_page_pte() > > > ptep = pte_offset_map_lock() > > > pte = *ptep > > > page = vm_normal_page(pte) > > > try_grab_page(page) # get ref > > > pte_unmap_unlock() > > > > > > Regarding transparent hugepages, that number shouldn't change, as > > > MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() > > > (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) > > > thus should reach shrink_page_list() -> split_huge_page_to_list() > > > before try_to_unmap[_one](), so it deals with normal pages only. > > > > > > (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() > > > happens, which it should not or be rare, the page refcount is not > > > two, as the head page counts tail pages, and tail pages have zero. > > > That also prevents checking the head `page` then incorrectly call > > > page_remove_rmap(subpage) for a tail page, that isn't even in the > > > shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), > > > as it might happen today in this unlikely scenario.) > > > > > > MADV_FREE'd buffers: > > > =================== > > > > > > So, back to the "if MADV_FREE pages are used as buffers" note. > > > The case is arguable, and subject to multiple interpretations. > > > > > > The madvise(2) manual page on the MADV_FREE advice value says: > > > - 'After a successful MADV_FREE ... data will be lost when > > > the kernel frees the pages.' > > > - 'the free operation will be canceled if the caller writes > > > into the page' / 'subsequent writes ... will succeed and > > > then [the] kernel cannot free those dirtied pages' > > > - 'If there is no subsequent write, the kernel can free the > > > pages at any time.' > > > > > > Thoughts, questions, considerations... > > > - Since the kernel didn't actually free the page (page_ref_freeze() > > > failed), should the data not have been lost? (on userspace read.) > > > - Should writes performed by the direct IO read be able to cancel > > > the free operation? > > > - Should the direct IO read be considered as 'the caller' too, > > > as it's been requested by 'the caller'? > > > - Should the bio technique to dirty pages on return to userspace > > > (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) > > > be considered in another/special way here? > > > - Should an upcoming write from a previously requested direct IO > > > read be considered as a subsequent write, so the kernel should > > > not free the pages? (as it's known at the time of page reclaim.) > > > > > > Technically, the last point would seem a reasonable consideration > > > and balance, as the madvise(2) manual page apparently (and fairly) > > > seem to assume that 'writes' are memory access from the userspace > > > process (not explicitly considering writes from the kernel or its > > > corner cases; again, fairly).. plus the kernel fix implementation > > > for the corner case of the largely 'non-atomic write' encompassed > > > by a direct IO read operation, is relatively simple; and it helps. > > > > > > Reproducer: > > > ========== > > > > > > @ test.c (simplified, but works) > > > > > > #define _GNU_SOURCE > > > #include <fcntl.h> > > > #include <stdio.h> > > > #include <unistd.h> > > > #include <sys/mman.h> > > > > > > int main() { > > > int fd, i; > > > char *buf; > > > > > > fd = open(DEV, O_RDONLY | O_DIRECT); > > > > > > buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, > > > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > > > for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) > > > buf[i] = 1; // init to non-zero > > > > > > madvise(buf, BUF_SIZE, MADV_FREE); > > > > > > read(fd, buf, BUF_SIZE); > > > > > > for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) > > > printf("%p: 0x%x\n", &buf[i], buf[i]); > > > > > > return 0; > > > } > > > > > > @ block/fops.c (formerly fs/block_dev.c) > > > > > > +#include <linux/swap.h> > > > ... > > > ... __blkdev_direct_IO[_simple](...) > > > { > > > ... > > > + if (!strcmp(current->comm, "good")) > > > + shrink_all_memory(ULONG_MAX); > > > + > > > ret = bio_iov_iter_get_pages(...); > > > + > > > + if (!strcmp(current->comm, "bad")) > > > + shrink_all_memory(ULONG_MAX); > > > ... > > > } > > > > > > @ shell > > > > > > # yes | dd of=test.img bs=1k count=16 > > > # DEV=$(losetup -f --show test.img) > > > # gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test > > > > > > # od -tx1 $DEV > > > 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a > > > * > > > 0040000 > > > > > > # mv test good > > > # ./good > > > 0x7f1509206000: 0x79 > > > 0x7f1509207000: 0x79 > > > 0x7f1509208000: 0x79 > > > 0x7f1509209000: 0x79 > > > > > > # mv good bad > > > # ./bad > > > 0x7fd87272f000: 0x0 > > > 0x7fd872730000: 0x0 > > > 0x7fd872731000: 0x0 > > > 0x7fd872732000: 0x0 > > > > > > Ceph/TCMalloc: > > > ============= > > > > > > For documentation purposes, the use case driving the analysis/fix > > > is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses > > > MADV_FREE to release unused memory to the system from the mmap'ed > > > page heap (might be committed back/used again; it's not munmap'ed.) > > > - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() > > > - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. > > > > > > Note: TCMalloc switched back to MADV_DONTNEED a few commits after > > > the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so > > > the issue just 'disappeared' on Ceph on later Ubuntu releases but > > > is still present in the kernel, and can be hit by other use cases. > > > > > > The observed issue seems to be the old Ceph bug #22464 [1], where > > > checksum mismatches are observed (and instrumentation with buffer > > > dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). > > > > > > The issue in Ceph was reasonably deemed a kernel bug (comment #50) > > > and mostly worked around with a retry mechanism, but other parts > > > of Ceph could still hit that (rocksdb). Anyway, it's less likely > > > to be hit again as TCMalloc switched out of MADV_FREE by default. > > > > > > (Some kernel versions/reports from the Ceph bug, and relation with > > > the MADV_FREE introduction/changes; TCMalloc versions not checked.) > > > - 4.4 good > > > - 4.5 (madv_free: introduction) > > > - 4.9 bad > > > - 4.10 good? maybe a swapless system > > > - 4.12 (madv_free: no longer free instantly on swapless systems) > > > - 4.13 bad > > > > > > [1] https://tracker.ceph.com/issues/22464 > > > > > > Thanks: > > > ====== > > > > > > Several people contributed to analysis/discussions/tests/reproducers > > > in the first stages when drilling down on ceph/tcmalloc/linux kernel: > > > > > > - Dan Hill <daniel.hill@canonical.com> > > > - Dan Streetman <dan.streetman@canonical.com> > > > - Dongdong Tao <dongdong.tao@canonical.com> > > > - Gavin Guo <gavin.guo@canonical.com> > > > - Gerald Yang <gerald.yang@canonical.com> > > > - Heitor Alves de Siqueira <halves@canonical.com> > > > - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> > > > - Jay Vosburgh <jay.vosburgh@canonical.com> > > > - Matthew Ruffell <matthew.ruffell@canonical.com> > > > - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> > > > > > > v2: check refcount against mapcount rather than a static 2. > > > Thanks: Minchan Kim <minchan@kernel.org> > > > > > > Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> > > > --- > > > mm/rmap.c | 15 ++++++++++++++- > > > mm/vmscan.c | 2 +- > > > 2 files changed, 15 insertions(+), 2 deletions(-) > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > index 163ac4e6bcee..8671de473c25 100644 > > > --- a/mm/rmap.c > > > +++ b/mm/rmap.c > > > @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > /* MADV_FREE page check */ > > > if (!PageSwapBacked(page)) { > > > - if (!PageDirty(page)) { > > > + int ref_count = page_ref_count(page); > > > + int map_count = page_mapcount(page); > > > + > > > + /* > > > + * The only page refs must be from the isolation > > > + * (checked by the caller shrink_page_list() too) > > > + * and one or more rmap's (dropped by discard:). > > > + * > > > + * Check the reference count before dirty flag > > > + * with memory barrier; see __remove_mapping(). > > > + */ > > > + smp_rmb(); > > > + if ((ref_count - 1 == map_count) && > > > + !PageDirty(page)) { > > > > Thanks. Looks good to me. > > > > I'd like to invite more folks for better eyes since this could > > be a rather subtle issue. > > I don't spot anything wrong either. But I'm wondering if we could > remove the PageDirty check or not. Sorry, take this back. PageDirty check is still needed since the page could be redirtied even though the refcount is not elevated. > > It seems racy to have MADV_FREE solely rely on PageDirty bit to tell > if the page is going to be redirtied or re-accessed or not. The pge > refcount itself should be enough. As long as elevated refcount is > observed we know this page is going to be or has been re-accessed by > someone (particularly, fast GUP). The PageDirty bit may be set much > later, just like what the commit log said. > > But I'm not quite sure if there is any path that sets PageDirty bit > then elevates refcount, TBH I don't look very hard.
Minchan Kim <minchan@kernel.org> writes: > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: >> Problem: >> ======= >> >> Userspace might read the zero-page instead of actual data from a >> direct IO read on a block device if the buffers have been called >> madvise(MADV_FREE) on earlier (this is discussed below) due to a >> race between page reclaim on MADV_FREE and blkdev direct IO read. >> >> Race condition: >> ============== >> >> During page reclaim, the MADV_FREE page check in try_to_unmap_one() >> checks if the page is not dirty, then discards its PTE (vs remap it >> back if the page is dirty). >> >> However, after try_to_unmap_one() returns to shrink_page_list(), it >> might keep the page _anyway_ if page_ref_freeze() fails (it expects >> a single page ref from the isolation). >> >> Well, blkdev_direct_IO() gets references for all pages, and on READ >> operations it sets them dirty later. >> >> So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more >> later) for direct IO read from block devices and page reclaim runs >> during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages() >> but BEFORE it sets pages dirty, that situation happens. >> >> The direct IO read eventually completes. Now, when userspace reads >> the buffers, the PTE is no longer there and the page fault handler >> do_anonymous_page() services that with the zero-page, NOT the data! >> >> A synthetic reproducer is provided. >> >> Page faults: >> =========== >> >> The data read from the block device probably won't generate faults >> due to DMA (no MMU) but even in the case it wouldn't use DMA, that >> happens on different virtual addresses (not user-mapped addresses) >> because `struct bio_vec` stores `struct page` to figure addresses >> out (which are different from/unrelated to user-mapped addresses) >> for the data read. >> >> Thus userspace reads (to user-mapped addresses) still fault, then >> do_anonymous_page() gets another `struct page` that would address/ >> map to other memory than the `struct page` used by `struct bio_vec` >> for the read (which runs correctly as the page wasn't freed due to >> page_ref_freeze(), and is reclaimed later) -- but even if the page >> addresses matched, that handler maps the zero-page in the PTE, not >> that page's memory (on read faults.) >> >> If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue >> doesn't happen, because that faults-in all pages as writeable, so >> do_anonymous_page() sets up a new page/rmap/PTE, and that is used >> by direct IO. The userspace reads don't fault as the PTE is there >> (thus zero-page is not used.) >> >> Solution: >> ======== >> >> One solution is to check for the expected page reference count in >> try_to_unmap_one() too, which should be exactly two: one from the >> isolation (checked by shrink_page_list()), and the other from the >> rmap (dropped by the discard: label). If that doesn't match, then >> remap the PTE back, just like page dirty does. >> >> The new check in try_to_unmap_one() should be safe in races with >> bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, >> as it's done under the PTE lock. The fast path doesn't take that >> lock but it checks the PTE has changed, then drops the reference >> and leaves the page for the slow path (which does take that lock). >> >> - try_to_unmap_one() >> - page_vma_mapped_walk() >> - map_pte() # see pte_offset_map_lock(): >> pte_offset_map() >> spin_lock() >> - page_ref_count() # new check >> - page_vma_mapped_walk_done() # see pte_unmap_unlock(): >> pte_unmap() >> spin_unlock() >> >> - bio_iov_iter_get_pages() >> - __bio_iov_iter_get_pages() >> - iov_iter_get_pages() >> - get_user_pages_fast() >> - internal_get_user_pages_fast() >> >> # fast path >> - lockless_pages_from_mm() >> - gup_{pgd,p4d,pud,pmd,pte}_range() >> ptep = pte_offset_map() # not _lock() >> pte = ptep_get_lockless(ptep) >> page = pte_page(pte) >> try_grab_compound_head(page) # get ref >> if (pte_val(pte) != pte_val(*ptep)) >> put_compound_head(page) # put ref >> # leave page for slow path >> # slow path >> - __gup_longterm_unlocked() >> - get_user_pages_unlocked() >> - __get_user_pages_locked() >> - __get_user_pages() >> - follow_{page,p4d,pud,pmd}_mask() >> - follow_page_pte() >> ptep = pte_offset_map_lock() >> pte = *ptep >> page = vm_normal_page(pte) >> try_grab_page(page) # get ref >> pte_unmap_unlock() >> >> Regarding transparent hugepages, that number shouldn't change, as >> MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() >> (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) >> thus should reach shrink_page_list() -> split_huge_page_to_list() >> before try_to_unmap[_one](), so it deals with normal pages only. >> >> (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() >> happens, which it should not or be rare, the page refcount is not >> two, as the head page counts tail pages, and tail pages have zero. >> That also prevents checking the head `page` then incorrectly call >> page_remove_rmap(subpage) for a tail page, that isn't even in the >> shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), >> as it might happen today in this unlikely scenario.) >> >> MADV_FREE'd buffers: >> =================== >> >> So, back to the "if MADV_FREE pages are used as buffers" note. >> The case is arguable, and subject to multiple interpretations. >> >> The madvise(2) manual page on the MADV_FREE advice value says: >> - 'After a successful MADV_FREE ... data will be lost when >> the kernel frees the pages.' >> - 'the free operation will be canceled if the caller writes >> into the page' / 'subsequent writes ... will succeed and >> then [the] kernel cannot free those dirtied pages' >> - 'If there is no subsequent write, the kernel can free the >> pages at any time.' >> >> Thoughts, questions, considerations... >> - Since the kernel didn't actually free the page (page_ref_freeze() >> failed), should the data not have been lost? (on userspace read.) >> - Should writes performed by the direct IO read be able to cancel >> the free operation? >> - Should the direct IO read be considered as 'the caller' too, >> as it's been requested by 'the caller'? >> - Should the bio technique to dirty pages on return to userspace >> (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) >> be considered in another/special way here? >> - Should an upcoming write from a previously requested direct IO >> read be considered as a subsequent write, so the kernel should >> not free the pages? (as it's known at the time of page reclaim.) >> >> Technically, the last point would seem a reasonable consideration >> and balance, as the madvise(2) manual page apparently (and fairly) >> seem to assume that 'writes' are memory access from the userspace >> process (not explicitly considering writes from the kernel or its >> corner cases; again, fairly).. plus the kernel fix implementation >> for the corner case of the largely 'non-atomic write' encompassed >> by a direct IO read operation, is relatively simple; and it helps. >> >> Reproducer: >> ========== >> >> @ test.c (simplified, but works) >> >> #define _GNU_SOURCE >> #include <fcntl.h> >> #include <stdio.h> >> #include <unistd.h> >> #include <sys/mman.h> >> >> int main() { >> int fd, i; >> char *buf; >> >> fd = open(DEV, O_RDONLY | O_DIRECT); >> >> buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, >> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> >> for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) >> buf[i] = 1; // init to non-zero >> >> madvise(buf, BUF_SIZE, MADV_FREE); >> >> read(fd, buf, BUF_SIZE); >> >> for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) >> printf("%p: 0x%x\n", &buf[i], buf[i]); >> >> return 0; >> } >> >> @ block/fops.c (formerly fs/block_dev.c) >> >> +#include <linux/swap.h> >> ... >> ... __blkdev_direct_IO[_simple](...) >> { >> ... >> + if (!strcmp(current->comm, "good")) >> + shrink_all_memory(ULONG_MAX); >> + >> ret = bio_iov_iter_get_pages(...); >> + >> + if (!strcmp(current->comm, "bad")) >> + shrink_all_memory(ULONG_MAX); >> ... >> } >> >> @ shell >> >> # yes | dd of=test.img bs=1k count=16 >> # DEV=$(losetup -f --show test.img) >> # gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test >> >> # od -tx1 $DEV >> 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a >> * >> 0040000 >> >> # mv test good >> # ./good >> 0x7f1509206000: 0x79 >> 0x7f1509207000: 0x79 >> 0x7f1509208000: 0x79 >> 0x7f1509209000: 0x79 >> >> # mv good bad >> # ./bad >> 0x7fd87272f000: 0x0 >> 0x7fd872730000: 0x0 >> 0x7fd872731000: 0x0 >> 0x7fd872732000: 0x0 >> >> Ceph/TCMalloc: >> ============= >> >> For documentation purposes, the use case driving the analysis/fix >> is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses >> MADV_FREE to release unused memory to the system from the mmap'ed >> page heap (might be committed back/used again; it's not munmap'ed.) >> - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() >> - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. >> >> Note: TCMalloc switched back to MADV_DONTNEED a few commits after >> the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so >> the issue just 'disappeared' on Ceph on later Ubuntu releases but >> is still present in the kernel, and can be hit by other use cases. >> >> The observed issue seems to be the old Ceph bug #22464 [1], where >> checksum mismatches are observed (and instrumentation with buffer >> dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). >> >> The issue in Ceph was reasonably deemed a kernel bug (comment #50) >> and mostly worked around with a retry mechanism, but other parts >> of Ceph could still hit that (rocksdb). Anyway, it's less likely >> to be hit again as TCMalloc switched out of MADV_FREE by default. >> >> (Some kernel versions/reports from the Ceph bug, and relation with >> the MADV_FREE introduction/changes; TCMalloc versions not checked.) >> - 4.4 good >> - 4.5 (madv_free: introduction) >> - 4.9 bad >> - 4.10 good? maybe a swapless system >> - 4.12 (madv_free: no longer free instantly on swapless systems) >> - 4.13 bad >> >> [1] https://tracker.ceph.com/issues/22464 >> >> Thanks: >> ====== >> >> Several people contributed to analysis/discussions/tests/reproducers >> in the first stages when drilling down on ceph/tcmalloc/linux kernel: >> >> - Dan Hill <daniel.hill@canonical.com> >> - Dan Streetman <dan.streetman@canonical.com> >> - Dongdong Tao <dongdong.tao@canonical.com> >> - Gavin Guo <gavin.guo@canonical.com> >> - Gerald Yang <gerald.yang@canonical.com> >> - Heitor Alves de Siqueira <halves@canonical.com> >> - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> >> - Jay Vosburgh <jay.vosburgh@canonical.com> >> - Matthew Ruffell <matthew.ruffell@canonical.com> >> - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> >> >> v2: check refcount against mapcount rather than a static 2. >> Thanks: Minchan Kim <minchan@kernel.org> >> >> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> >> --- >> mm/rmap.c | 15 ++++++++++++++- >> mm/vmscan.c | 2 +- >> 2 files changed, 15 insertions(+), 2 deletions(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 163ac4e6bcee..8671de473c25 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> >> /* MADV_FREE page check */ >> if (!PageSwapBacked(page)) { >> - if (!PageDirty(page)) { >> + int ref_count = page_ref_count(page); >> + int map_count = page_mapcount(page); >> + >> + /* >> + * The only page refs must be from the isolation >> + * (checked by the caller shrink_page_list() too) >> + * and one or more rmap's (dropped by discard:). >> + * >> + * Check the reference count before dirty flag >> + * with memory barrier; see __remove_mapping(). >> + */ >> + smp_rmb(); >> + if ((ref_count - 1 == map_count) && >> + !PageDirty(page)) { > > Thanks. Looks good to me. > > I'd like to invite more folks for better eyes since this could > be a rather subtle issue. Looks good to me too! Please feel free to add my Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Best Regards, Huang, Ying
On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > diff --git a/mm/rmap.c b/mm/rmap.c > index 163ac4e6bcee..8671de473c25 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > - if (!PageDirty(page)) { > + int ref_count = page_ref_count(page); > + int map_count = page_mapcount(page); > + > + /* > + * The only page refs must be from the isolation > + * (checked by the caller shrink_page_list() too) > + * and one or more rmap's (dropped by discard:). > + * > + * Check the reference count before dirty flag > + * with memory barrier; see __remove_mapping(). > + */ > + smp_rmb(); > + if ((ref_count - 1 == map_count) && > + !PageDirty(page)) { > /* Invalidate as we cleared the pte */ > mmu_notifier_invalidate_range(mm, > address, address + PAGE_SIZE); Out of curiosity, how does it work with COW in terms of reordering? Specifically, it seems to me get_page() and page_dup_rmap() in copy_present_pte() can happen in any order, and if page_dup_rmap() is seen first, and direct io is holding a refcnt, this check can still pass?
On Mon, Jan 10, 2022 at 11:48:13PM -0700, Yu Zhao wrote: > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 163ac4e6bcee..8671de473c25 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > /* MADV_FREE page check */ > > if (!PageSwapBacked(page)) { > > - if (!PageDirty(page)) { > > + int ref_count = page_ref_count(page); > > + int map_count = page_mapcount(page); > > + > > + /* > > + * The only page refs must be from the isolation > > + * (checked by the caller shrink_page_list() too) > > + * and one or more rmap's (dropped by discard:). > > + * > > + * Check the reference count before dirty flag > > + * with memory barrier; see __remove_mapping(). > > + */ > > + smp_rmb(); > > + if ((ref_count - 1 == map_count) && > > + !PageDirty(page)) { > > /* Invalidate as we cleared the pte */ > > mmu_notifier_invalidate_range(mm, > > address, address + PAGE_SIZE); > > Out of curiosity, how does it work with COW in terms of reordering? > Specifically, it seems to me get_page() and page_dup_rmap() in > copy_present_pte() can happen in any order, and if page_dup_rmap() > is seen first, and direct io is holding a refcnt, this check can still > pass? > Hi Yu, I think you're correct. I think we don't like memory barrier there in page_dup_rmap. Then, how about make gup_fast is aware of FOLL_TOUCH? FOLL_TOUCH means it's going to write something so the page should be dirty. Currently, get_user_pages works like that. Howver, problem is get_user_pages_fast since it looks like that lockless_pages_from_mm doesn't support FOLL_TOUCH. So the idea is if param in internal_get_user_pages_fast includes FOLL_TOUCH, gup_{pmd,pte}_range try to make the page dirty under trylock_page(If the lock fails, it goes slow path with __gup_longterm_unlocked and set_dirty_pages for them). This approach would solve other cases where map userspace pages into kernel space and then write. Since the write didn't go through with the process's page table, we will lose the dirty bit in the page table of the process and it turns out same problem. That's why I'd like to approach this. If it doesn't work, the other option to fix this specific case is can't we make pages dirty in advance in DIO read-case? When I look at DIO code, it's already doing in async case. Could't we do the same thing for the other cases? I guess the worst case we will see would be more page writeback since the page becomes dirty unnecessary.
On 1/11/22 10:54, Minchan Kim wrote: ... > Hi Yu, > > I think you're correct. I think we don't like memory barrier > there in page_dup_rmap. Then, how about make gup_fast is aware > of FOLL_TOUCH? > > FOLL_TOUCH means it's going to write something so the page Actually, my understanding of FOLL_TOUCH is that it does *not* mean that data will be written to the page. That is what FOLL_WRITE is for. FOLL_TOUCH means: update the "accessed" metadata, without actually writing to the memory that the page represents. > should be dirty. Currently, get_user_pages works like that. > Howver, problem is get_user_pages_fast since it looks like > that lockless_pages_from_mm doesn't support FOLL_TOUCH. > > So the idea is if param in internal_get_user_pages_fast > includes FOLL_TOUCH, gup_{pmd,pte}_range try to make the > page dirty under trylock_page(If the lock fails, it goes Marking a page dirty solely because FOLL_TOUCH is specified would be an API-level mistake. That's why it isn't "supported". Or at least, that's how I'm reading things. Hope that helps! > slow path with __gup_longterm_unlocked and set_dirty_pages > for them). > > This approach would solve other cases where map userspace > pages into kernel space and then write. Since the write > didn't go through with the process's page table, we will > lose the dirty bit in the page table of the process and > it turns out same problem. That's why I'd like to approach > this. > > If it doesn't work, the other option to fix this specific > case is can't we make pages dirty in advance in DIO read-case? > > When I look at DIO code, it's already doing in async case. > Could't we do the same thing for the other cases? > I guess the worst case we will see would be more page > writeback since the page becomes dirty unnecessary. Marking pages dirty after pinning them is a pre-existing area of problems. See the long-running LWN articles about get_user_pages() [1]. [1] https://lwn.net/Kernel/Index/#Memory_management-get_user_pages thanks,
On Tue, Jan 11, 2022 at 11:29:36AM -0800, John Hubbard wrote: > On 1/11/22 10:54, Minchan Kim wrote: > ... > > Hi Yu, > > > > I think you're correct. I think we don't like memory barrier > > there in page_dup_rmap. Then, how about make gup_fast is aware > > of FOLL_TOUCH? > > > > FOLL_TOUCH means it's going to write something so the page > > Actually, my understanding of FOLL_TOUCH is that it does *not* mean that > data will be written to the page. That is what FOLL_WRITE is for. > FOLL_TOUCH means: update the "accessed" metadata, without actually > writing to the memory that the page represents. Exactly. I should have mentioned the FOLL_TOUCH with FOLL_WRITE. What I wanted to hit with FOLL_TOUCH was follow_page_pte: if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) set_page_dirty(page); mark_page_accessed(page); } > > > > should be dirty. Currently, get_user_pages works like that. > > Howver, problem is get_user_pages_fast since it looks like > > that lockless_pages_from_mm doesn't support FOLL_TOUCH. > > > > So the idea is if param in internal_get_user_pages_fast > > includes FOLL_TOUCH, gup_{pmd,pte}_range try to make the > > page dirty under trylock_page(If the lock fails, it goes > > Marking a page dirty solely because FOLL_TOUCH is specified would > be an API-level mistake. That's why it isn't "supported". Or at least, > that's how I'm reading things. > > Hope that helps! > > > slow path with __gup_longterm_unlocked and set_dirty_pages > > for them). > > > > This approach would solve other cases where map userspace > > pages into kernel space and then write. Since the write > > didn't go through with the process's page table, we will > > lose the dirty bit in the page table of the process and > > it turns out same problem. That's why I'd like to approach > > this. > > > > If it doesn't work, the other option to fix this specific > > case is can't we make pages dirty in advance in DIO read-case? > > > > When I look at DIO code, it's already doing in async case. > > Could't we do the same thing for the other cases? > > I guess the worst case we will see would be more page > > writeback since the page becomes dirty unnecessary. > > Marking pages dirty after pinning them is a pre-existing area of > problems. See the long-running LWN articles about get_user_pages() [1]. Oh, Do you mean marking page dirty in DIO path is already problems? Let me read the pages in the link. Thanks! > > > [1] https://lwn.net/Kernel/Index/#Memory_management-get_user_pages > > thanks, > -- > John Hubbard > NVIDIA >
On Tue, Jan 11, 2022 at 12:20:13PM -0800, Minchan Kim wrote: < snip > > > > slow path with __gup_longterm_unlocked and set_dirty_pages > > > for them). > > > > > > This approach would solve other cases where map userspace > > > pages into kernel space and then write. Since the write > > > didn't go through with the process's page table, we will > > > lose the dirty bit in the page table of the process and > > > it turns out same problem. That's why I'd like to approach > > > this. > > > > > > If it doesn't work, the other option to fix this specific > > > case is can't we make pages dirty in advance in DIO read-case? > > > > > > When I look at DIO code, it's already doing in async case. > > > Could't we do the same thing for the other cases? > > > I guess the worst case we will see would be more page > > > writeback since the page becomes dirty unnecessary. > > > > Marking pages dirty after pinning them is a pre-existing area of > > problems. See the long-running LWN articles about get_user_pages() [1]. > > Oh, Do you mean marking page dirty in DIO path is already problems? ^ marking page dirty too late in DIO path Typo fix.
On Tue, Jan 11, 2022 at 12:21:59PM -0800, Minchan Kim wrote: > On Tue, Jan 11, 2022 at 12:20:13PM -0800, Minchan Kim wrote: > < snip > > > > > slow path with __gup_longterm_unlocked and set_dirty_pages > > > > for them). > > > > > > > > This approach would solve other cases where map userspace > > > > pages into kernel space and then write. Since the write > > > > didn't go through with the process's page table, we will > > > > lose the dirty bit in the page table of the process and > > > > it turns out same problem. That's why I'd like to approach > > > > this. > > > > > > > > If it doesn't work, the other option to fix this specific > > > > case is can't we make pages dirty in advance in DIO read-case? > > > > > > > > When I look at DIO code, it's already doing in async case. > > > > Could't we do the same thing for the other cases? > > > > I guess the worst case we will see would be more page > > > > writeback since the page becomes dirty unnecessary. > > > > > > Marking pages dirty after pinning them is a pre-existing area of > > > problems. See the long-running LWN articles about get_user_pages() [1]. > > > > Oh, Do you mean marking page dirty in DIO path is already problems? > > ^ marking page dirty too late in DIO path > > Typo fix. I looked though the articles but couldn't find dots to connetct issues with this MADV_FREE issue. However, man page shows a clue why it's fine. ``` O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private map‐ ping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. ``` I think it would make the copy_present_pte's page_dup_rmap safe.
On 1/11/22 13:59, Minchan Kim wrote: ... >>>> Marking pages dirty after pinning them is a pre-existing area of >>>> problems. See the long-running LWN articles about get_user_pages() [1]. >>> >>> Oh, Do you mean marking page dirty in DIO path is already problems? >> >> ^ marking page dirty too late in DIO path >> >> Typo fix. > > I looked though the articles but couldn't find dots to connetct > issues with this MADV_FREE issue. However, man page shows a clue The area covered in those articles is about the fact that file system and block are not safely interacting with pinned memory. Even today. So I'm trying to make sure you're aware of that before you go too far in that direction. > why it's fine. > > ``` > O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private map‐ > ping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and > statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another > thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption > and undefined behavior in parent and child processes. > > ``` > > I think it would make the copy_present_pte's page_dup_rmap safe. I'd have to see this in patch form, because I'm not quite able to visualize it yet. thanks,
On Tue, Jan 11, 2022 at 03:38:24PM -0800, John Hubbard wrote: > On 1/11/22 13:59, Minchan Kim wrote: > ... > > > > > Marking pages dirty after pinning them is a pre-existing area of > > > > > problems. See the long-running LWN articles about get_user_pages() [1]. > > > > > > > > Oh, Do you mean marking page dirty in DIO path is already problems? > > > > > > ^ marking page dirty too late in DIO path > > > > > > Typo fix. > > > > I looked though the articles but couldn't find dots to connetct > > issues with this MADV_FREE issue. However, man page shows a clue > > The area covered in those articles is about the fact that file system > and block are not safely interacting with pinned memory. Even today. > So I'm trying to make sure you're aware of that before you go too far > in that direction. > > > why it's fine. > > > > ``` > > O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private map‐ > > ping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and > > statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another > > thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption > > and undefined behavior in parent and child processes. > > > > ``` > > > > I think it would make the copy_present_pte's page_dup_rmap safe. > > I'd have to see this in patch form, because I'm not quite able to visualize it yet. It would be great if you read though the original patch description. Since v2 had a little change to consider mutiple maps among parent and child, it would introduce a little mistmatch with the description but it's still quite good to explain current problem. https://lore.kernel.org/all/20220105233440.63361-1-mfo@canonical.com/T/#u Problem is MADV_FREEed anonymous memory is supposed to work based on dirtiness came from the user process's page table bit or PageDirty. Since VM can't see the dirty, it just discards the anonymous memory instead of swappoing out. Thus, the dirtiness is key to work correctly. However, DIO didn't make the page Dirty yet until IO is completed and at the same time, the store operation didn't go though via user process's page table regardless of DMA or other way. It makes VM could decide just drop the page since it didn't see any dirtiness from the page. So it turns out enduser would be surprised because the read syscall with DIO was completed but the data was zero rather than latest uptodate data. To prevent the problem, the patch suggested to compare page_mapcount with page_count since it expects any additional reference of the page means someone is doing accessing the memory so in this case, not discarding the page. However, Yu pointed out page_count and page_mapcount could be reordered in copy_page_range, for example. So I am looking for the solution(one would be adding memory barrier right before page_dup_rmap but I'd like to avoid it if we have other idea). And then man page says forking under going DIO would be already prohibited so the concern raised would be void, IIUC. Hope this helps your understanding. Thanks! work
Yu Zhao <yuzhao@google.com> writes: > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 163ac4e6bcee..8671de473c25 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> >> /* MADV_FREE page check */ >> if (!PageSwapBacked(page)) { >> - if (!PageDirty(page)) { >> + int ref_count = page_ref_count(page); >> + int map_count = page_mapcount(page); >> + >> + /* >> + * The only page refs must be from the isolation >> + * (checked by the caller shrink_page_list() too) >> + * and one or more rmap's (dropped by discard:). >> + * >> + * Check the reference count before dirty flag >> + * with memory barrier; see __remove_mapping(). >> + */ >> + smp_rmb(); >> + if ((ref_count - 1 == map_count) && >> + !PageDirty(page)) { >> /* Invalidate as we cleared the pte */ >> mmu_notifier_invalidate_range(mm, >> address, address + PAGE_SIZE); > > Out of curiosity, how does it work with COW in terms of reordering? > Specifically, it seems to me get_page() and page_dup_rmap() in > copy_present_pte() can happen in any order, and if page_dup_rmap() > is seen first, and direct io is holding a refcnt, this check can still > pass? I think that you are correct. After more thoughts, it appears very tricky to compare page count and map count. Even if we have added smp_rmb() between page_ref_count() and page_mapcount(), an interrupt may happen between them. During the interrupt, the page count and map count may be changed, for example, unmapped, or do_swap_page(). Best Regards, Huang, Ying
On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > Yu Zhao <yuzhao@google.com> writes: > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > >> diff --git a/mm/rmap.c b/mm/rmap.c > >> index 163ac4e6bcee..8671de473c25 100644 > >> --- a/mm/rmap.c > >> +++ b/mm/rmap.c > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >> > >> /* MADV_FREE page check */ > >> if (!PageSwapBacked(page)) { > >> - if (!PageDirty(page)) { > >> + int ref_count = page_ref_count(page); > >> + int map_count = page_mapcount(page); > >> + > >> + /* > >> + * The only page refs must be from the isolation > >> + * (checked by the caller shrink_page_list() too) > >> + * and one or more rmap's (dropped by discard:). > >> + * > >> + * Check the reference count before dirty flag > >> + * with memory barrier; see __remove_mapping(). > >> + */ > >> + smp_rmb(); > >> + if ((ref_count - 1 == map_count) && > >> + !PageDirty(page)) { > >> /* Invalidate as we cleared the pte */ > >> mmu_notifier_invalidate_range(mm, > >> address, address + PAGE_SIZE); > > > > Out of curiosity, how does it work with COW in terms of reordering? > > Specifically, it seems to me get_page() and page_dup_rmap() in > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > is seen first, and direct io is holding a refcnt, this check can still > > pass? > > I think that you are correct. > > After more thoughts, it appears very tricky to compare page count and > map count. Even if we have added smp_rmb() between page_ref_count() and > page_mapcount(), an interrupt may happen between them. During the > interrupt, the page count and map count may be changed, for example, > unmapped, or do_swap_page(). Yeah, it happens but what specific problem are you concerning from the count change under race? The fork case Yu pointed out was already known for breaking DIO so user should take care not to fork under DIO(Please look at O_DIRECT section in man 2 open). If you could give a specific example, it would be great to think over the issue. I agree it's little tricky but it seems to be way other place has used for a long time(Please look at write_protect_page in ksm.c). So, here what we missing is tlb flush before the checking. Something like this. diff --git a/mm/rmap.c b/mm/rmap.c index b0fd9dc19eba..b4ad9faa17b2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* MADV_FREE page check */ if (!PageSwapBacked(page)) { - int refcount = page_ref_count(page); - - /* - * The only page refs must be from the isolation - * (checked by the caller shrink_page_list() too) - * and the (single) rmap (dropped by discard:). - * - * Check the reference count before dirty flag - * with memory barrier; see __remove_mapping(). - */ - smp_rmb(); - if (refcount == 2 && !PageDirty(page)) { + if (!PageDirty(page) && + page_mapcount(page) + 1 == page_count(page)) { /* Invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); diff --git a/mm/vmscan.c b/mm/vmscan.c index f3162a5724de..6454ff5c576f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list, enum ttu_flags flags = TTU_BATCH_FLUSH; bool was_swapbacked = PageSwapBacked(page); + if (!was_swapbacked && PageAnon(page)) + flags &= ~TTU_BATCH_FLUSH; + if (unlikely(PageTransHuge(page))) flags |= TTU_SPLIT_HUGE_PMD;
Hi Minchan Kim, Thanks for handling the hard questions! :) On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > > Yu Zhao <yuzhao@google.com> writes: > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > >> diff --git a/mm/rmap.c b/mm/rmap.c > > >> index 163ac4e6bcee..8671de473c25 100644 > > >> --- a/mm/rmap.c > > >> +++ b/mm/rmap.c > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > >> > > >> /* MADV_FREE page check */ > > >> if (!PageSwapBacked(page)) { > > >> - if (!PageDirty(page)) { > > >> + int ref_count = page_ref_count(page); > > >> + int map_count = page_mapcount(page); > > >> + > > >> + /* > > >> + * The only page refs must be from the isolation > > >> + * (checked by the caller shrink_page_list() too) > > >> + * and one or more rmap's (dropped by discard:). > > >> + * > > >> + * Check the reference count before dirty flag > > >> + * with memory barrier; see __remove_mapping(). > > >> + */ > > >> + smp_rmb(); > > >> + if ((ref_count - 1 == map_count) && > > >> + !PageDirty(page)) { > > >> /* Invalidate as we cleared the pte */ > > >> mmu_notifier_invalidate_range(mm, > > >> address, address + PAGE_SIZE); > > > > > > Out of curiosity, how does it work with COW in terms of reordering? > > > Specifically, it seems to me get_page() and page_dup_rmap() in > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > > is seen first, and direct io is holding a refcnt, this check can still > > > pass? > > > > I think that you are correct. > > > > After more thoughts, it appears very tricky to compare page count and > > map count. Even if we have added smp_rmb() between page_ref_count() and > > page_mapcount(), an interrupt may happen between them. During the > > interrupt, the page count and map count may be changed, for example, > > unmapped, or do_swap_page(). > > Yeah, it happens but what specific problem are you concerning from the > count change under race? The fork case Yu pointed out was already known > for breaking DIO so user should take care not to fork under DIO(Please > look at O_DIRECT section in man 2 open). If you could give a specific > example, it would be great to think over the issue. > > I agree it's little tricky but it seems to be way other place has used > for a long time(Please look at write_protect_page in ksm.c). Ah, that's great to see it's being used elsewhere, for DIO particularly! > So, here what we missing is tlb flush before the checking. That shouldn't be required for this particular issue/case, IIUIC. One of the things we checked early on was disabling deferred TLB flush (similarly to what you've done), and it didn't help with the issue; also, the issue happens on uniprocessor mode too (thus no remote CPU involved.) > > Something like this. > > diff --git a/mm/rmap.c b/mm/rmap.c > index b0fd9dc19eba..b4ad9faa17b2 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > - int refcount = page_ref_count(page); > - > - /* > - * The only page refs must be from the isolation > - * (checked by the caller shrink_page_list() too) > - * and the (single) rmap (dropped by discard:). > - * > - * Check the reference count before dirty flag > - * with memory barrier; see __remove_mapping(). > - */ > - smp_rmb(); > - if (refcount == 2 && !PageDirty(page)) { > + if (!PageDirty(page) && > + page_mapcount(page) + 1 == page_count(page)) { In the interest of avoiding a different race/bug, it seemed worth following the suggestion outlined in __remove_mapping(), i.e., checking PageDirty() after the page's reference count, with a memory barrier in between. I'm not familiar with the details of the original issue behind that code change, but it seemed to be possible here too, particularly as writes from user-space can happen asynchronously / after try_to_unmap_one() checked PTE clean and didn't set PageDirty, and if the page's PTE is present, there's no fault? Thanks again, Mauricio > /* Invalidate as we cleared the pte */ > mmu_notifier_invalidate_range(mm, > address, address + PAGE_SIZE); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f3162a5724de..6454ff5c576f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list, > enum ttu_flags flags = TTU_BATCH_FLUSH; > bool was_swapbacked = PageSwapBacked(page); > > + if (!was_swapbacked && PageAnon(page)) > + flags &= ~TTU_BATCH_FLUSH; > + > if (unlikely(PageTransHuge(page))) > flags |= TTU_SPLIT_HUGE_PMD; > > > > > >
On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote: > Hi Minchan Kim, > > Thanks for handling the hard questions! :) > > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: > > > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > > > Yu Zhao <yuzhao@google.com> writes: > > > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > > >> diff --git a/mm/rmap.c b/mm/rmap.c > > > >> index 163ac4e6bcee..8671de473c25 100644 > > > >> --- a/mm/rmap.c > > > >> +++ b/mm/rmap.c > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > >> > > > >> /* MADV_FREE page check */ > > > >> if (!PageSwapBacked(page)) { > > > >> - if (!PageDirty(page)) { > > > >> + int ref_count = page_ref_count(page); > > > >> + int map_count = page_mapcount(page); > > > >> + > > > >> + /* > > > >> + * The only page refs must be from the isolation > > > >> + * (checked by the caller shrink_page_list() too) > > > >> + * and one or more rmap's (dropped by discard:). > > > >> + * > > > >> + * Check the reference count before dirty flag > > > >> + * with memory barrier; see __remove_mapping(). > > > >> + */ > > > >> + smp_rmb(); > > > >> + if ((ref_count - 1 == map_count) && > > > >> + !PageDirty(page)) { > > > >> /* Invalidate as we cleared the pte */ > > > >> mmu_notifier_invalidate_range(mm, > > > >> address, address + PAGE_SIZE); > > > > > > > > Out of curiosity, how does it work with COW in terms of reordering? > > > > Specifically, it seems to me get_page() and page_dup_rmap() in > > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > > > is seen first, and direct io is holding a refcnt, this check can still > > > > pass? > > > > > > I think that you are correct. > > > > > > After more thoughts, it appears very tricky to compare page count and > > > map count. Even if we have added smp_rmb() between page_ref_count() and > > > page_mapcount(), an interrupt may happen between them. During the > > > interrupt, the page count and map count may be changed, for example, > > > unmapped, or do_swap_page(). > > > > Yeah, it happens but what specific problem are you concerning from the > > count change under race? The fork case Yu pointed out was already known > > for breaking DIO so user should take care not to fork under DIO(Please > > look at O_DIRECT section in man 2 open). If you could give a specific > > example, it would be great to think over the issue. > > > > I agree it's little tricky but it seems to be way other place has used > > for a long time(Please look at write_protect_page in ksm.c). > > Ah, that's great to see it's being used elsewhere, for DIO particularly! > > > So, here what we missing is tlb flush before the checking. > > That shouldn't be required for this particular issue/case, IIUIC. > One of the things we checked early on was disabling deferred TLB flush > (similarly to what you've done), and it didn't help with the issue; also, the > issue happens on uniprocessor mode too (thus no remote CPU involved.) I guess you didn't try it with page_mapcount + 1 == page_count at tha time? Anyway, I agree we don't need TLB flush here like KSM. I think the reason KSM is doing TLB flush before the check it to make sure trap trigger on the write from userprocess in other core. However, this MADV_FREE case, HW already gaurantees the trap. Please see below. > > > > > > Something like this. > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index b0fd9dc19eba..b4ad9faa17b2 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > /* MADV_FREE page check */ > > if (!PageSwapBacked(page)) { > > - int refcount = page_ref_count(page); > > - > > - /* > > - * The only page refs must be from the isolation > > - * (checked by the caller shrink_page_list() too) > > - * and the (single) rmap (dropped by discard:). > > - * > > - * Check the reference count before dirty flag > > - * with memory barrier; see __remove_mapping(). > > - */ > > - smp_rmb(); > > - if (refcount == 2 && !PageDirty(page)) { > > + if (!PageDirty(page) && > > + page_mapcount(page) + 1 == page_count(page)) { > > In the interest of avoiding a different race/bug, it seemed worth following the > suggestion outlined in __remove_mapping(), i.e., checking PageDirty() > after the page's reference count, with a memory barrier in between. True so it means your patch as-is is good for me. > > I'm not familiar with the details of the original issue behind that code change, > but it seemed to be possible here too, particularly as writes from user-space > can happen asynchronously / after try_to_unmap_one() checked PTE clean > and didn't set PageDirty, and if the page's PTE is present, there's no fault? Yeah, it was discussed. For clean pte, CPU has to fetch and update the actual pte entry, not TLB so trap triggers for MADV_FREE page. https://lkml.org/lkml/2015/4/15/565 https://lkml.org/lkml/2015/4/16/136
Minchan Kim <minchan@kernel.org> writes: > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: >> Yu Zhao <yuzhao@google.com> writes: >> >> > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> >> index 163ac4e6bcee..8671de473c25 100644 >> >> --- a/mm/rmap.c >> >> +++ b/mm/rmap.c >> >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> >> >> >> /* MADV_FREE page check */ >> >> if (!PageSwapBacked(page)) { >> >> - if (!PageDirty(page)) { >> >> + int ref_count = page_ref_count(page); >> >> + int map_count = page_mapcount(page); >> >> + >> >> + /* >> >> + * The only page refs must be from the isolation >> >> + * (checked by the caller shrink_page_list() too) >> >> + * and one or more rmap's (dropped by discard:). >> >> + * >> >> + * Check the reference count before dirty flag >> >> + * with memory barrier; see __remove_mapping(). >> >> + */ >> >> + smp_rmb(); >> >> + if ((ref_count - 1 == map_count) && >> >> + !PageDirty(page)) { >> >> /* Invalidate as we cleared the pte */ >> >> mmu_notifier_invalidate_range(mm, >> >> address, address + PAGE_SIZE); >> > >> > Out of curiosity, how does it work with COW in terms of reordering? >> > Specifically, it seems to me get_page() and page_dup_rmap() in >> > copy_present_pte() can happen in any order, and if page_dup_rmap() >> > is seen first, and direct io is holding a refcnt, this check can still >> > pass? >> >> I think that you are correct. >> >> After more thoughts, it appears very tricky to compare page count and >> map count. Even if we have added smp_rmb() between page_ref_count() and >> page_mapcount(), an interrupt may happen between them. During the >> interrupt, the page count and map count may be changed, for example, >> unmapped, or do_swap_page(). > > Yeah, it happens but what specific problem are you concerning from the > count change under race? The fork case Yu pointed out was already known > for breaking DIO so user should take care not to fork under DIO(Please > look at O_DIRECT section in man 2 open). If you could give a specific > example, it would be great to think over the issue. Whether is the following race possible? CPU0/Process A CPU1/Process B -------------- -------------- try_to_unmap_one page_mapcount() zap_pte_range() page_remove_rmap() atomic_add_negative(-1, &page->_mapcount) tlb_flush_mmu() ... put_page_testzero() page_count() Previously I thought that there's similar race in do_swap_page(). But after more thoughts, I found that the page is locked in do_swap_page(). So do_swap_page() is safe. Per my understanding, except during fork() as Yu pointed out, the anonymous page must be locked before increasing its mapcount. So, if the above race is possible, we need to guarantee to read page_count() before page_mapcount(). That is, something as follows, count = page_count(); smp_rmb(); mapcount = page_mapcount(); if (!PageDirty(page) && mapcount + 1 == count) { ... } Best Regards, Huang, Ying > I agree it's little tricky but it seems to be way other place has used > for a long time(Please look at write_protect_page in ksm.c). > So, here what we missing is tlb flush before the checking. > > Something like this. > > diff --git a/mm/rmap.c b/mm/rmap.c > index b0fd9dc19eba..b4ad9faa17b2 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > - int refcount = page_ref_count(page); > - > - /* > - * The only page refs must be from the isolation > - * (checked by the caller shrink_page_list() too) > - * and the (single) rmap (dropped by discard:). > - * > - * Check the reference count before dirty flag > - * with memory barrier; see __remove_mapping(). > - */ > - smp_rmb(); > - if (refcount == 2 && !PageDirty(page)) { > + if (!PageDirty(page) && > + page_mapcount(page) + 1 == page_count(page)) { > /* Invalidate as we cleared the pte */ > mmu_notifier_invalidate_range(mm, > address, address + PAGE_SIZE); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f3162a5724de..6454ff5c576f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list, > enum ttu_flags flags = TTU_BATCH_FLUSH; > bool was_swapbacked = PageSwapBacked(page); > > + if (!was_swapbacked && PageAnon(page)) > + flags &= ~TTU_BATCH_FLUSH; > + > if (unlikely(PageTransHuge(page))) > flags |= TTU_SPLIT_HUGE_PMD;
On 2022/1/13 13:47, Huang, Ying wrote: > Minchan Kim <minchan@kernel.org> writes: > >> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: >>> Yu Zhao <yuzhao@google.com> writes: >>> >>>> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 163ac4e6bcee..8671de473c25 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>>> >>>>> /* MADV_FREE page check */ >>>>> if (!PageSwapBacked(page)) { >>>>> - if (!PageDirty(page)) { >>>>> + int ref_count = page_ref_count(page); >>>>> + int map_count = page_mapcount(page); >>>>> + >>>>> + /* >>>>> + * The only page refs must be from the isolation >>>>> + * (checked by the caller shrink_page_list() too) >>>>> + * and one or more rmap's (dropped by discard:). >>>>> + * >>>>> + * Check the reference count before dirty flag >>>>> + * with memory barrier; see __remove_mapping(). >>>>> + */ >>>>> + smp_rmb(); >>>>> + if ((ref_count - 1 == map_count) && >>>>> + !PageDirty(page)) { >>>>> /* Invalidate as we cleared the pte */ >>>>> mmu_notifier_invalidate_range(mm, >>>>> address, address + PAGE_SIZE); >>>> >>>> Out of curiosity, how does it work with COW in terms of reordering? >>>> Specifically, it seems to me get_page() and page_dup_rmap() in >>>> copy_present_pte() can happen in any order, and if page_dup_rmap() >>>> is seen first, and direct io is holding a refcnt, this check can still >>>> pass? >>> >>> I think that you are correct. >>> >>> After more thoughts, it appears very tricky to compare page count and >>> map count. Even if we have added smp_rmb() between page_ref_count() and >>> page_mapcount(), an interrupt may happen between them. During the >>> interrupt, the page count and map count may be changed, for example, >>> unmapped, or do_swap_page(). >> >> Yeah, it happens but what specific problem are you concerning from the >> count change under race? The fork case Yu pointed out was already known >> for breaking DIO so user should take care not to fork under DIO(Please >> look at O_DIRECT section in man 2 open). If you could give a specific >> example, it would be great to think over the issue. > > Whether is the following race possible? > > CPU0/Process A CPU1/Process B > -------------- -------------- > try_to_unmap_one > page_mapcount() > zap_pte_range() > page_remove_rmap() > atomic_add_negative(-1, &page->_mapcount) > tlb_flush_mmu() > ... > put_page_testzero() > page_count() > It seems they're under the same page table Lock. Thanks. > Previously I thought that there's similar race in do_swap_page(). But > after more thoughts, I found that the page is locked in do_swap_page(). > So do_swap_page() is safe. Per my understanding, except during fork() > as Yu pointed out, the anonymous page must be locked before increasing > its mapcount. > > So, if the above race is possible, we need to guarantee to read > page_count() before page_mapcount(). That is, something as follows, > > count = page_count(); > smp_rmb(); > mapcount = page_mapcount(); > if (!PageDirty(page) && mapcount + 1 == count) { > ... > } > > Best Regards, > Huang, Ying > >> I agree it's little tricky but it seems to be way other place has used >> for a long time(Please look at write_protect_page in ksm.c). >> So, here what we missing is tlb flush before the checking. >> >> Something like this. >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index b0fd9dc19eba..b4ad9faa17b2 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> >> /* MADV_FREE page check */ >> if (!PageSwapBacked(page)) { >> - int refcount = page_ref_count(page); >> - >> - /* >> - * The only page refs must be from the isolation >> - * (checked by the caller shrink_page_list() too) >> - * and the (single) rmap (dropped by discard:). >> - * >> - * Check the reference count before dirty flag >> - * with memory barrier; see __remove_mapping(). >> - */ >> - smp_rmb(); >> - if (refcount == 2 && !PageDirty(page)) { >> + if (!PageDirty(page) && >> + page_mapcount(page) + 1 == page_count(page)) { >> /* Invalidate as we cleared the pte */ >> mmu_notifier_invalidate_range(mm, >> address, address + PAGE_SIZE); >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index f3162a5724de..6454ff5c576f 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list, >> enum ttu_flags flags = TTU_BATCH_FLUSH; >> bool was_swapbacked = PageSwapBacked(page); >> >> + if (!was_swapbacked && PageAnon(page)) >> + flags &= ~TTU_BATCH_FLUSH; >> + >> if (unlikely(PageTransHuge(page))) >> flags |= TTU_SPLIT_HUGE_PMD; > . >
On Wed, Jan 12, 2022 at 2:53 PM Mauricio Faria de Oliveira <mfo@canonical.com> wrote: > > Hi Minchan Kim, > > Thanks for handling the hard questions! :) > > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: > > > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > > > Yu Zhao <yuzhao@google.com> writes: > > > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > > >> diff --git a/mm/rmap.c b/mm/rmap.c > > > >> index 163ac4e6bcee..8671de473c25 100644 > > > >> --- a/mm/rmap.c > > > >> +++ b/mm/rmap.c > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > >> > > > >> /* MADV_FREE page check */ > > > >> if (!PageSwapBacked(page)) { > > > >> - if (!PageDirty(page)) { > > > >> + int ref_count = page_ref_count(page); > > > >> + int map_count = page_mapcount(page); > > > >> + > > > >> + /* > > > >> + * The only page refs must be from the isolation > > > >> + * (checked by the caller shrink_page_list() too) > > > >> + * and one or more rmap's (dropped by discard:). > > > >> + * > > > >> + * Check the reference count before dirty flag > > > >> + * with memory barrier; see __remove_mapping(). > > > >> + */ > > > >> + smp_rmb(); > > > >> + if ((ref_count - 1 == map_count) && > > > >> + !PageDirty(page)) { > > > >> /* Invalidate as we cleared the pte */ > > > >> mmu_notifier_invalidate_range(mm, > > > >> address, address + PAGE_SIZE); > > > > > > > > Out of curiosity, how does it work with COW in terms of reordering? > > > > Specifically, it seems to me get_page() and page_dup_rmap() in > > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > > > is seen first, and direct io is holding a refcnt, this check can still > > > > pass? > > > > > > I think that you are correct. > > > > > > After more thoughts, it appears very tricky to compare page count and > > > map count. Even if we have added smp_rmb() between page_ref_count() and > > > page_mapcount(), an interrupt may happen between them. During the > > > interrupt, the page count and map count may be changed, for example, > > > unmapped, or do_swap_page(). > > > > Yeah, it happens but what specific problem are you concerning from the > > count change under race? The fork case Yu pointed out was already known > > for breaking DIO so user should take care not to fork under DIO(Please > > look at O_DIRECT section in man 2 open). If you could give a specific > > example, it would be great to think over the issue. > > > > I agree it's little tricky but it seems to be way other place has used > > for a long time(Please look at write_protect_page in ksm.c). > > Ah, that's great to see it's being used elsewhere, for DIO particularly! > > > So, here what we missing is tlb flush before the checking. > > That shouldn't be required for this particular issue/case, IIUIC. > One of the things we checked early on was disabling deferred TLB flush > (similarly to what you've done), and it didn't help with the issue; also, the > issue happens on uniprocessor mode too (thus no remote CPU involved.) Fast gup doesn't block tlb flush; it only blocks IPI used when freeing page tables. So it's expected that forcing a tlb flush doesn't fix the problem. But it still seems to me the fix is missing smp_mb(). IIUC, a proper fix would require, on the dio side inc page refcnt smp_mb() read pte and on the rmap side clear pte smp_mb() read page refcnt try_grab_compound_head() implies smp_mb, but i don't think ptep_get_and_clear() does. mapcount, as Minchan said, probably is irrelevant given dio is already known to be broken with fork. I glanced at the thread and thought it might be worth menthing.
Miaohe Lin <linmiaohe@huawei.com> writes: > On 2022/1/13 13:47, Huang, Ying wrote: >> Minchan Kim <minchan@kernel.org> writes: >> >>> On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: >>>> Yu Zhao <yuzhao@google.com> writes: >>>> >>>>> On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 163ac4e6bcee..8671de473c25 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>>>>> >>>>>> /* MADV_FREE page check */ >>>>>> if (!PageSwapBacked(page)) { >>>>>> - if (!PageDirty(page)) { >>>>>> + int ref_count = page_ref_count(page); >>>>>> + int map_count = page_mapcount(page); >>>>>> + >>>>>> + /* >>>>>> + * The only page refs must be from the isolation >>>>>> + * (checked by the caller shrink_page_list() too) >>>>>> + * and one or more rmap's (dropped by discard:). >>>>>> + * >>>>>> + * Check the reference count before dirty flag >>>>>> + * with memory barrier; see __remove_mapping(). >>>>>> + */ >>>>>> + smp_rmb(); >>>>>> + if ((ref_count - 1 == map_count) && >>>>>> + !PageDirty(page)) { >>>>>> /* Invalidate as we cleared the pte */ >>>>>> mmu_notifier_invalidate_range(mm, >>>>>> address, address + PAGE_SIZE); >>>>> >>>>> Out of curiosity, how does it work with COW in terms of reordering? >>>>> Specifically, it seems to me get_page() and page_dup_rmap() in >>>>> copy_present_pte() can happen in any order, and if page_dup_rmap() >>>>> is seen first, and direct io is holding a refcnt, this check can still >>>>> pass? >>>> >>>> I think that you are correct. >>>> >>>> After more thoughts, it appears very tricky to compare page count and >>>> map count. Even if we have added smp_rmb() between page_ref_count() and >>>> page_mapcount(), an interrupt may happen between them. During the >>>> interrupt, the page count and map count may be changed, for example, >>>> unmapped, or do_swap_page(). >>> >>> Yeah, it happens but what specific problem are you concerning from the >>> count change under race? The fork case Yu pointed out was already known >>> for breaking DIO so user should take care not to fork under DIO(Please >>> look at O_DIRECT section in man 2 open). If you could give a specific >>> example, it would be great to think over the issue. >> >> Whether is the following race possible? >> >> CPU0/Process A CPU1/Process B >> -------------- -------------- >> try_to_unmap_one >> page_mapcount() >> zap_pte_range() >> page_remove_rmap() >> atomic_add_negative(-1, &page->_mapcount) >> tlb_flush_mmu() >> ... >> put_page_testzero() >> page_count() >> > > It seems they're under the same page table Lock. This is for a page shared by 2 processes (Process A/B above). But you reminded me that an anonymous page cannot be shared between multiple processes after direct IO. Because direct IO read un-shares the page, and fork() isn't allowed, I guess ksm isn't allowed too. So the above race isn't possible. Sorry for confusing. Best Regards, Huang, Ying > Thanks. > >> Previously I thought that there's similar race in do_swap_page(). But >> after more thoughts, I found that the page is locked in do_swap_page(). >> So do_swap_page() is safe. Per my understanding, except during fork() >> as Yu pointed out, the anonymous page must be locked before increasing >> its mapcount. >> >> So, if the above race is possible, we need to guarantee to read >> page_count() before page_mapcount(). That is, something as follows, >> >> count = page_count(); >> smp_rmb(); >> mapcount = page_mapcount(); >> if (!PageDirty(page) && mapcount + 1 == count) { >> ... >> } >> >> Best Regards, >> Huang, Ying >> >>> I agree it's little tricky but it seems to be way other place has used >>> for a long time(Please look at write_protect_page in ksm.c). >>> So, here what we missing is tlb flush before the checking. >>> >>> Something like this. >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index b0fd9dc19eba..b4ad9faa17b2 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>> >>> /* MADV_FREE page check */ >>> if (!PageSwapBacked(page)) { >>> - int refcount = page_ref_count(page); >>> - >>> - /* >>> - * The only page refs must be from the isolation >>> - * (checked by the caller shrink_page_list() too) >>> - * and the (single) rmap (dropped by discard:). >>> - * >>> - * Check the reference count before dirty flag >>> - * with memory barrier; see __remove_mapping(). >>> - */ >>> - smp_rmb(); >>> - if (refcount == 2 && !PageDirty(page)) { >>> + if (!PageDirty(page) && >>> + page_mapcount(page) + 1 == page_count(page)) { >>> /* Invalidate as we cleared the pte */ >>> mmu_notifier_invalidate_range(mm, >>> address, address + PAGE_SIZE); >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index f3162a5724de..6454ff5c576f 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list, >>> enum ttu_flags flags = TTU_BATCH_FLUSH; >>> bool was_swapbacked = PageSwapBacked(page); >>> >>> + if (!was_swapbacked && PageAnon(page)) >>> + flags &= ~TTU_BATCH_FLUSH; >>> + >>> if (unlikely(PageTransHuge(page))) >>> flags |= TTU_SPLIT_HUGE_PMD; >> . >>
Minchan Kim <minchan@kernel.org> writes: > On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote: >> Hi Minchan Kim, >> >> Thanks for handling the hard questions! :) >> >> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: >> > >> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: >> > > Yu Zhao <yuzhao@google.com> writes: >> > > >> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: >> > > >> diff --git a/mm/rmap.c b/mm/rmap.c >> > > >> index 163ac4e6bcee..8671de473c25 100644 >> > > >> --- a/mm/rmap.c >> > > >> +++ b/mm/rmap.c >> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> > > >> >> > > >> /* MADV_FREE page check */ >> > > >> if (!PageSwapBacked(page)) { >> > > >> - if (!PageDirty(page)) { >> > > >> + int ref_count = page_ref_count(page); >> > > >> + int map_count = page_mapcount(page); >> > > >> + >> > > >> + /* >> > > >> + * The only page refs must be from the isolation >> > > >> + * (checked by the caller shrink_page_list() too) >> > > >> + * and one or more rmap's (dropped by discard:). >> > > >> + * >> > > >> + * Check the reference count before dirty flag >> > > >> + * with memory barrier; see __remove_mapping(). >> > > >> + */ >> > > >> + smp_rmb(); >> > > >> + if ((ref_count - 1 == map_count) && >> > > >> + !PageDirty(page)) { >> > > >> /* Invalidate as we cleared the pte */ >> > > >> mmu_notifier_invalidate_range(mm, >> > > >> address, address + PAGE_SIZE); >> > > > >> > > > Out of curiosity, how does it work with COW in terms of reordering? >> > > > Specifically, it seems to me get_page() and page_dup_rmap() in >> > > > copy_present_pte() can happen in any order, and if page_dup_rmap() >> > > > is seen first, and direct io is holding a refcnt, this check can still >> > > > pass? >> > > >> > > I think that you are correct. >> > > >> > > After more thoughts, it appears very tricky to compare page count and >> > > map count. Even if we have added smp_rmb() between page_ref_count() and >> > > page_mapcount(), an interrupt may happen between them. During the >> > > interrupt, the page count and map count may be changed, for example, >> > > unmapped, or do_swap_page(). >> > >> > Yeah, it happens but what specific problem are you concerning from the >> > count change under race? The fork case Yu pointed out was already known >> > for breaking DIO so user should take care not to fork under DIO(Please >> > look at O_DIRECT section in man 2 open). If you could give a specific >> > example, it would be great to think over the issue. >> > >> > I agree it's little tricky but it seems to be way other place has used >> > for a long time(Please look at write_protect_page in ksm.c). >> >> Ah, that's great to see it's being used elsewhere, for DIO particularly! >> >> > So, here what we missing is tlb flush before the checking. >> >> That shouldn't be required for this particular issue/case, IIUIC. >> One of the things we checked early on was disabling deferred TLB flush >> (similarly to what you've done), and it didn't help with the issue; also, the >> issue happens on uniprocessor mode too (thus no remote CPU involved.) > > I guess you didn't try it with page_mapcount + 1 == page_count at tha > time? Anyway, I agree we don't need TLB flush here like KSM. > I think the reason KSM is doing TLB flush before the check it to > make sure trap trigger on the write from userprocess in other core. > However, this MADV_FREE case, HW already gaurantees the trap. > Please see below. > >> >> >> > >> > Something like this. >> > >> > diff --git a/mm/rmap.c b/mm/rmap.c >> > index b0fd9dc19eba..b4ad9faa17b2 100644 >> > --- a/mm/rmap.c >> > +++ b/mm/rmap.c >> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >> > >> > /* MADV_FREE page check */ >> > if (!PageSwapBacked(page)) { >> > - int refcount = page_ref_count(page); >> > - >> > - /* >> > - * The only page refs must be from the isolation >> > - * (checked by the caller shrink_page_list() too) >> > - * and the (single) rmap (dropped by discard:). >> > - * >> > - * Check the reference count before dirty flag >> > - * with memory barrier; see __remove_mapping(). >> > - */ >> > - smp_rmb(); >> > - if (refcount == 2 && !PageDirty(page)) { >> > + if (!PageDirty(page) && >> > + page_mapcount(page) + 1 == page_count(page)) { >> >> In the interest of avoiding a different race/bug, it seemed worth following the >> suggestion outlined in __remove_mapping(), i.e., checking PageDirty() >> after the page's reference count, with a memory barrier in between. > > True so it means your patch as-is is good for me. If my understanding were correct, a shared anonymous page will be mapped read-only. If so, will a private anonymous page be called SetPageDirty() concurrently after direct IO case has been dealt with via comparing page_count()/page_mapcount()? Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: > Minchan Kim <minchan@kernel.org> writes: > >> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote: >>> Hi Minchan Kim, >>> >>> Thanks for handling the hard questions! :) >>> >>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: >>> > >>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: >>> > > Yu Zhao <yuzhao@google.com> writes: >>> > > >>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: >>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c >>> > > >> index 163ac4e6bcee..8671de473c25 100644 >>> > > >> --- a/mm/rmap.c >>> > > >> +++ b/mm/rmap.c >>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>> > > >> >>> > > >> /* MADV_FREE page check */ >>> > > >> if (!PageSwapBacked(page)) { >>> > > >> - if (!PageDirty(page)) { >>> > > >> + int ref_count = page_ref_count(page); >>> > > >> + int map_count = page_mapcount(page); >>> > > >> + >>> > > >> + /* >>> > > >> + * The only page refs must be from the isolation >>> > > >> + * (checked by the caller shrink_page_list() too) >>> > > >> + * and one or more rmap's (dropped by discard:). >>> > > >> + * >>> > > >> + * Check the reference count before dirty flag >>> > > >> + * with memory barrier; see __remove_mapping(). >>> > > >> + */ >>> > > >> + smp_rmb(); >>> > > >> + if ((ref_count - 1 == map_count) && >>> > > >> + !PageDirty(page)) { >>> > > >> /* Invalidate as we cleared the pte */ >>> > > >> mmu_notifier_invalidate_range(mm, >>> > > >> address, address + PAGE_SIZE); >>> > > > >>> > > > Out of curiosity, how does it work with COW in terms of reordering? >>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in >>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap() >>> > > > is seen first, and direct io is holding a refcnt, this check can still >>> > > > pass? >>> > > >>> > > I think that you are correct. >>> > > >>> > > After more thoughts, it appears very tricky to compare page count and >>> > > map count. Even if we have added smp_rmb() between page_ref_count() and >>> > > page_mapcount(), an interrupt may happen between them. During the >>> > > interrupt, the page count and map count may be changed, for example, >>> > > unmapped, or do_swap_page(). >>> > >>> > Yeah, it happens but what specific problem are you concerning from the >>> > count change under race? The fork case Yu pointed out was already known >>> > for breaking DIO so user should take care not to fork under DIO(Please >>> > look at O_DIRECT section in man 2 open). If you could give a specific >>> > example, it would be great to think over the issue. >>> > >>> > I agree it's little tricky but it seems to be way other place has used >>> > for a long time(Please look at write_protect_page in ksm.c). >>> >>> Ah, that's great to see it's being used elsewhere, for DIO particularly! >>> >>> > So, here what we missing is tlb flush before the checking. >>> >>> That shouldn't be required for this particular issue/case, IIUIC. >>> One of the things we checked early on was disabling deferred TLB flush >>> (similarly to what you've done), and it didn't help with the issue; also, the >>> issue happens on uniprocessor mode too (thus no remote CPU involved.) >> >> I guess you didn't try it with page_mapcount + 1 == page_count at tha >> time? Anyway, I agree we don't need TLB flush here like KSM. >> I think the reason KSM is doing TLB flush before the check it to >> make sure trap trigger on the write from userprocess in other core. >> However, this MADV_FREE case, HW already gaurantees the trap. >> Please see below. >> >>> >>> >>> > >>> > Something like this. >>> > >>> > diff --git a/mm/rmap.c b/mm/rmap.c >>> > index b0fd9dc19eba..b4ad9faa17b2 100644 >>> > --- a/mm/rmap.c >>> > +++ b/mm/rmap.c >>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, >>> > >>> > /* MADV_FREE page check */ >>> > if (!PageSwapBacked(page)) { >>> > - int refcount = page_ref_count(page); >>> > - >>> > - /* >>> > - * The only page refs must be from the isolation >>> > - * (checked by the caller shrink_page_list() too) >>> > - * and the (single) rmap (dropped by discard:). >>> > - * >>> > - * Check the reference count before dirty flag >>> > - * with memory barrier; see __remove_mapping(). >>> > - */ >>> > - smp_rmb(); >>> > - if (refcount == 2 && !PageDirty(page)) { >>> > + if (!PageDirty(page) && >>> > + page_mapcount(page) + 1 == page_count(page)) { >>> >>> In the interest of avoiding a different race/bug, it seemed worth following the >>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty() >>> after the page's reference count, with a memory barrier in between. >> >> True so it means your patch as-is is good for me. > > If my understanding were correct, a shared anonymous page will be mapped > read-only. If so, will a private anonymous page be called > SetPageDirty() concurrently after direct IO case has been dealt with > via comparing page_count()/page_mapcount()? Sorry, I found that I am not quite right here. When direct IO read completes, it will call SetPageDirty() and put_page() finally. And MADV_FREE in try_to_unmap_one() needs to deal with that too. Checking direct IO code, it appears that set_page_dirty_lock() is used to set page dirty, which will use lock_page(). dio_bio_complete bio_check_pages_dirty bio_dirty_fn /* through workqueue */ bio_release_pages set_page_dirty_lock bio_release_pages set_page_dirty_lock So in theory, for direct IO, the memory barrier may be unnecessary. But I don't think it's a good idea to depend on this specific behavior of direct IO. The original code with memory barrier looks more generic and I don't think it will introduce visible overhead. Best Regards, Huang, Ying
On Wed, Jan 12, 2022 at 7:37 PM Minchan Kim <minchan@kernel.org> wrote: > > On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote: > > Hi Minchan Kim, > > > > Thanks for handling the hard questions! :) > > > > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > > > > Yu Zhao <yuzhao@google.com> writes: > > > > > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > > > >> diff --git a/mm/rmap.c b/mm/rmap.c > > > > >> index 163ac4e6bcee..8671de473c25 100644 > > > > >> --- a/mm/rmap.c > > > > >> +++ b/mm/rmap.c > > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > >> > > > > >> /* MADV_FREE page check */ > > > > >> if (!PageSwapBacked(page)) { > > > > >> - if (!PageDirty(page)) { > > > > >> + int ref_count = page_ref_count(page); > > > > >> + int map_count = page_mapcount(page); > > > > >> + > > > > >> + /* > > > > >> + * The only page refs must be from the isolation > > > > >> + * (checked by the caller shrink_page_list() too) > > > > >> + * and one or more rmap's (dropped by discard:). > > > > >> + * > > > > >> + * Check the reference count before dirty flag > > > > >> + * with memory barrier; see __remove_mapping(). > > > > >> + */ > > > > >> + smp_rmb(); > > > > >> + if ((ref_count - 1 == map_count) && > > > > >> + !PageDirty(page)) { > > > > >> /* Invalidate as we cleared the pte */ > > > > >> mmu_notifier_invalidate_range(mm, > > > > >> address, address + PAGE_SIZE); > > > > > > > > > > Out of curiosity, how does it work with COW in terms of reordering? > > > > > Specifically, it seems to me get_page() and page_dup_rmap() in > > > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > > > > is seen first, and direct io is holding a refcnt, this check can still > > > > > pass? > > > > > > > > I think that you are correct. > > > > > > > > After more thoughts, it appears very tricky to compare page count and > > > > map count. Even if we have added smp_rmb() between page_ref_count() and > > > > page_mapcount(), an interrupt may happen between them. During the > > > > interrupt, the page count and map count may be changed, for example, > > > > unmapped, or do_swap_page(). > > > > > > Yeah, it happens but what specific problem are you concerning from the > > > count change under race? The fork case Yu pointed out was already known > > > for breaking DIO so user should take care not to fork under DIO(Please > > > look at O_DIRECT section in man 2 open). If you could give a specific > > > example, it would be great to think over the issue. > > > > > > I agree it's little tricky but it seems to be way other place has used > > > for a long time(Please look at write_protect_page in ksm.c). > > > > Ah, that's great to see it's being used elsewhere, for DIO particularly! > > > > > So, here what we missing is tlb flush before the checking. > > > > That shouldn't be required for this particular issue/case, IIUIC. > > One of the things we checked early on was disabling deferred TLB flush > > (similarly to what you've done), and it didn't help with the issue; also, the > > issue happens on uniprocessor mode too (thus no remote CPU involved.) > > I guess you didn't try it with page_mapcount + 1 == page_count at tha > time? Anyway, I agree we don't need TLB flush here like KSM. Sorry, I fail to understand how the page (map) count and TLB flush would be related. (I realize you and Yu Zhao already confirmed the TLB flush is not needed/expected to fix the issue too; but just for my own education, if you have a chance.) > I think the reason KSM is doing TLB flush before the check it to > make sure trap trigger on the write from userprocess in other core. > However, this MADV_FREE case, HW already gaurantees the trap. Understood. > Please see below. > > > > > > > > > > > Something like this. > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > index b0fd9dc19eba..b4ad9faa17b2 100644 > > > --- a/mm/rmap.c > > > +++ b/mm/rmap.c > > > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > /* MADV_FREE page check */ > > > if (!PageSwapBacked(page)) { > > > - int refcount = page_ref_count(page); > > > - > > > - /* > > > - * The only page refs must be from the isolation > > > - * (checked by the caller shrink_page_list() too) > > > - * and the (single) rmap (dropped by discard:). > > > - * > > > - * Check the reference count before dirty flag > > > - * with memory barrier; see __remove_mapping(). > > > - */ > > > - smp_rmb(); > > > - if (refcount == 2 && !PageDirty(page)) { > > > + if (!PageDirty(page) && > > > + page_mapcount(page) + 1 == page_count(page)) { > > > > In the interest of avoiding a different race/bug, it seemed worth following the > > suggestion outlined in __remove_mapping(), i.e., checking PageDirty() > > after the page's reference count, with a memory barrier in between. > > True so it means your patch as-is is good for me. That's good news! Thanks for all your help, review, and discussion so far; it's been very educational. I see Yu Zhao mentioned a possible concern/suggestion with additional memory barriers elsewhere. I'll try and dig to understand/check that in more detail and follow up. > > > > > I'm not familiar with the details of the original issue behind that code change, > > but it seemed to be possible here too, particularly as writes from user-space > > can happen asynchronously / after try_to_unmap_one() checked PTE clean > > and didn't set PageDirty, and if the page's PTE is present, there's no fault? > > Yeah, it was discussed. > > For clean pte, CPU has to fetch and update the actual pte entry, not TLB > so trap triggers for MADV_FREE page. > > https://lkml.org/lkml/2015/4/15/565 > https://lkml.org/lkml/2015/4/16/136 Thanks for the pointers; great reading. cheers,
On Thu, Jan 13, 2022 at 9:30 AM Huang, Ying <ying.huang@intel.com> wrote: > > "Huang, Ying" <ying.huang@intel.com> writes: > > > Minchan Kim <minchan@kernel.org> writes: > > > >> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote: > >>> Hi Minchan Kim, > >>> > >>> Thanks for handling the hard questions! :) > >>> > >>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: > >>> > > >>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > >>> > > Yu Zhao <yuzhao@google.com> writes: > >>> > > > >>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > >>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c > >>> > > >> index 163ac4e6bcee..8671de473c25 100644 > >>> > > >> --- a/mm/rmap.c > >>> > > >> +++ b/mm/rmap.c > >>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >>> > > >> > >>> > > >> /* MADV_FREE page check */ > >>> > > >> if (!PageSwapBacked(page)) { > >>> > > >> - if (!PageDirty(page)) { > >>> > > >> + int ref_count = page_ref_count(page); > >>> > > >> + int map_count = page_mapcount(page); > >>> > > >> + > >>> > > >> + /* > >>> > > >> + * The only page refs must be from the isolation > >>> > > >> + * (checked by the caller shrink_page_list() too) > >>> > > >> + * and one or more rmap's (dropped by discard:). > >>> > > >> + * > >>> > > >> + * Check the reference count before dirty flag > >>> > > >> + * with memory barrier; see __remove_mapping(). > >>> > > >> + */ > >>> > > >> + smp_rmb(); > >>> > > >> + if ((ref_count - 1 == map_count) && > >>> > > >> + !PageDirty(page)) { > >>> > > >> /* Invalidate as we cleared the pte */ > >>> > > >> mmu_notifier_invalidate_range(mm, > >>> > > >> address, address + PAGE_SIZE); > >>> > > > > >>> > > > Out of curiosity, how does it work with COW in terms of reordering? > >>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in > >>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > >>> > > > is seen first, and direct io is holding a refcnt, this check can still > >>> > > > pass? > >>> > > > >>> > > I think that you are correct. > >>> > > > >>> > > After more thoughts, it appears very tricky to compare page count and > >>> > > map count. Even if we have added smp_rmb() between page_ref_count() and > >>> > > page_mapcount(), an interrupt may happen between them. During the > >>> > > interrupt, the page count and map count may be changed, for example, > >>> > > unmapped, or do_swap_page(). > >>> > > >>> > Yeah, it happens but what specific problem are you concerning from the > >>> > count change under race? The fork case Yu pointed out was already known > >>> > for breaking DIO so user should take care not to fork under DIO(Please > >>> > look at O_DIRECT section in man 2 open). If you could give a specific > >>> > example, it would be great to think over the issue. > >>> > > >>> > I agree it's little tricky but it seems to be way other place has used > >>> > for a long time(Please look at write_protect_page in ksm.c). > >>> > >>> Ah, that's great to see it's being used elsewhere, for DIO particularly! > >>> > >>> > So, here what we missing is tlb flush before the checking. > >>> > >>> That shouldn't be required for this particular issue/case, IIUIC. > >>> One of the things we checked early on was disabling deferred TLB flush > >>> (similarly to what you've done), and it didn't help with the issue; also, the > >>> issue happens on uniprocessor mode too (thus no remote CPU involved.) > >> > >> I guess you didn't try it with page_mapcount + 1 == page_count at tha > >> time? Anyway, I agree we don't need TLB flush here like KSM. > >> I think the reason KSM is doing TLB flush before the check it to > >> make sure trap trigger on the write from userprocess in other core. > >> However, this MADV_FREE case, HW already gaurantees the trap. > >> Please see below. > >> > >>> > >>> > >>> > > >>> > Something like this. > >>> > > >>> > diff --git a/mm/rmap.c b/mm/rmap.c > >>> > index b0fd9dc19eba..b4ad9faa17b2 100644 > >>> > --- a/mm/rmap.c > >>> > +++ b/mm/rmap.c > >>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >>> > > >>> > /* MADV_FREE page check */ > >>> > if (!PageSwapBacked(page)) { > >>> > - int refcount = page_ref_count(page); > >>> > - > >>> > - /* > >>> > - * The only page refs must be from the isolation > >>> > - * (checked by the caller shrink_page_list() too) > >>> > - * and the (single) rmap (dropped by discard:). > >>> > - * > >>> > - * Check the reference count before dirty flag > >>> > - * with memory barrier; see __remove_mapping(). > >>> > - */ > >>> > - smp_rmb(); > >>> > - if (refcount == 2 && !PageDirty(page)) { > >>> > + if (!PageDirty(page) && > >>> > + page_mapcount(page) + 1 == page_count(page)) { > >>> > >>> In the interest of avoiding a different race/bug, it seemed worth following the > >>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty() > >>> after the page's reference count, with a memory barrier in between. > >> > >> True so it means your patch as-is is good for me. > > > > If my understanding were correct, a shared anonymous page will be mapped > > read-only. If so, will a private anonymous page be called > > SetPageDirty() concurrently after direct IO case has been dealt with > > via comparing page_count()/page_mapcount()? > > Sorry, I found that I am not quite right here. When direct IO read > completes, it will call SetPageDirty() and put_page() finally. And > MADV_FREE in try_to_unmap_one() needs to deal with that too. > > Checking direct IO code, it appears that set_page_dirty_lock() is used > to set page dirty, which will use lock_page(). > > dio_bio_complete > bio_check_pages_dirty > bio_dirty_fn /* through workqueue */ > bio_release_pages > set_page_dirty_lock > bio_release_pages > set_page_dirty_lock > > So in theory, for direct IO, the memory barrier may be unnecessary. But > I don't think it's a good idea to depend on this specific behavior of > direct IO. The original code with memory barrier looks more generic and > I don't think it will introduce visible overhead. > Thanks for all the considerations/thought process with potential corner cases! Regarding the overhead, agreed; and this is in memory reclaim which isn't a fast path (and even if it's under direct reclaim, things have slowed down already), so that would seem to be fine. cheers, > Best Regards, > Huang, Ying
On Thu, Jan 13, 2022 at 12:29:51AM -0700, Yu Zhao wrote: > On Wed, Jan 12, 2022 at 2:53 PM Mauricio Faria de Oliveira > <mfo@canonical.com> wrote: > > > > Hi Minchan Kim, > > > > Thanks for handling the hard questions! :) > > > > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > > > > Yu Zhao <yuzhao@google.com> writes: > > > > > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > > > >> diff --git a/mm/rmap.c b/mm/rmap.c > > > > >> index 163ac4e6bcee..8671de473c25 100644 > > > > >> --- a/mm/rmap.c > > > > >> +++ b/mm/rmap.c > > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > >> > > > > >> /* MADV_FREE page check */ > > > > >> if (!PageSwapBacked(page)) { > > > > >> - if (!PageDirty(page)) { > > > > >> + int ref_count = page_ref_count(page); > > > > >> + int map_count = page_mapcount(page); > > > > >> + > > > > >> + /* > > > > >> + * The only page refs must be from the isolation > > > > >> + * (checked by the caller shrink_page_list() too) > > > > >> + * and one or more rmap's (dropped by discard:). > > > > >> + * > > > > >> + * Check the reference count before dirty flag > > > > >> + * with memory barrier; see __remove_mapping(). > > > > >> + */ > > > > >> + smp_rmb(); > > > > >> + if ((ref_count - 1 == map_count) && > > > > >> + !PageDirty(page)) { > > > > >> /* Invalidate as we cleared the pte */ > > > > >> mmu_notifier_invalidate_range(mm, > > > > >> address, address + PAGE_SIZE); > > > > > > > > > > Out of curiosity, how does it work with COW in terms of reordering? > > > > > Specifically, it seems to me get_page() and page_dup_rmap() in > > > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > > > > is seen first, and direct io is holding a refcnt, this check can still > > > > > pass? > > > > > > > > I think that you are correct. > > > > > > > > After more thoughts, it appears very tricky to compare page count and > > > > map count. Even if we have added smp_rmb() between page_ref_count() and > > > > page_mapcount(), an interrupt may happen between them. During the > > > > interrupt, the page count and map count may be changed, for example, > > > > unmapped, or do_swap_page(). > > > > > > Yeah, it happens but what specific problem are you concerning from the > > > count change under race? The fork case Yu pointed out was already known > > > for breaking DIO so user should take care not to fork under DIO(Please > > > look at O_DIRECT section in man 2 open). If you could give a specific > > > example, it would be great to think over the issue. > > > > > > I agree it's little tricky but it seems to be way other place has used > > > for a long time(Please look at write_protect_page in ksm.c). > > > > Ah, that's great to see it's being used elsewhere, for DIO particularly! > > > > > So, here what we missing is tlb flush before the checking. > > > > That shouldn't be required for this particular issue/case, IIUIC. > > One of the things we checked early on was disabling deferred TLB flush > > (similarly to what you've done), and it didn't help with the issue; also, the > > issue happens on uniprocessor mode too (thus no remote CPU involved.) > > Fast gup doesn't block tlb flush; it only blocks IPI used when freeing > page tables. So it's expected that forcing a tlb flush doesn't fix the > problem. > > But it still seems to me the fix is missing smp_mb(). IIUC, a proper > fix would require, on the dio side > inc page refcnt > smp_mb() > read pte > > and on the rmap side > clear pte > smp_mb() > read page refcnt > > try_grab_compound_head() implies smp_mb, but i don't think > ptep_get_and_clear() does. > > mapcount, as Minchan said, probably is irrelevant given dio is already > known to be broken with fork. > > I glanced at the thread and thought it might be worth menthing. If the madv_freed page is shared among processes, it means the ptes pointing the page are CoW state. If DIO is about to work with the page, gup_fast will fallback to slow path and then break the CoW using faultin_page before the submit bio. Thus, the page is not shared any longer and the pte was alrady marked as dirty on fault handling. Thus, I think there is no race. Only, problem is race between DIO and reclaim on exclusive private madv_free page. In the case, page_count would be only racy. If ptep_get_and_clear is unordered, yeah, we need barrier there. (Looks like unorder since ARM uses xchg_relaxed).
On Thu, Jan 13, 2022 at 9:35 PM Minchan Kim <minchan@kernel.org> wrote: > > On Thu, Jan 13, 2022 at 12:29:51AM -0700, Yu Zhao wrote: > > On Wed, Jan 12, 2022 at 2:53 PM Mauricio Faria de Oliveira > > <mfo@canonical.com> wrote: > > > > > > Hi Minchan Kim, > > > > > > Thanks for handling the hard questions! :) > > > > > > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > > > > > Yu Zhao <yuzhao@google.com> writes: > > > > > > > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > > > > >> diff --git a/mm/rmap.c b/mm/rmap.c > > > > > >> index 163ac4e6bcee..8671de473c25 100644 > > > > > >> --- a/mm/rmap.c > > > > > >> +++ b/mm/rmap.c > > > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > >> > > > > > >> /* MADV_FREE page check */ > > > > > >> if (!PageSwapBacked(page)) { > > > > > >> - if (!PageDirty(page)) { > > > > > >> + int ref_count = page_ref_count(page); > > > > > >> + int map_count = page_mapcount(page); > > > > > >> + > > > > > >> + /* > > > > > >> + * The only page refs must be from the isolation > > > > > >> + * (checked by the caller shrink_page_list() too) > > > > > >> + * and one or more rmap's (dropped by discard:). > > > > > >> + * > > > > > >> + * Check the reference count before dirty flag > > > > > >> + * with memory barrier; see __remove_mapping(). > > > > > >> + */ > > > > > >> + smp_rmb(); > > > > > >> + if ((ref_count - 1 == map_count) && > > > > > >> + !PageDirty(page)) { > > > > > >> /* Invalidate as we cleared the pte */ > > > > > >> mmu_notifier_invalidate_range(mm, > > > > > >> address, address + PAGE_SIZE); > > > > > > > > > > > > Out of curiosity, how does it work with COW in terms of reordering? > > > > > > Specifically, it seems to me get_page() and page_dup_rmap() in > > > > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > > > > > is seen first, and direct io is holding a refcnt, this check can still > > > > > > pass? > > > > > > > > > > I think that you are correct. > > > > > > > > > > After more thoughts, it appears very tricky to compare page count and > > > > > map count. Even if we have added smp_rmb() between page_ref_count() and > > > > > page_mapcount(), an interrupt may happen between them. During the > > > > > interrupt, the page count and map count may be changed, for example, > > > > > unmapped, or do_swap_page(). > > > > > > > > Yeah, it happens but what specific problem are you concerning from the > > > > count change under race? The fork case Yu pointed out was already known > > > > for breaking DIO so user should take care not to fork under DIO(Please > > > > look at O_DIRECT section in man 2 open). If you could give a specific > > > > example, it would be great to think over the issue. > > > > > > > > I agree it's little tricky but it seems to be way other place has used > > > > for a long time(Please look at write_protect_page in ksm.c). > > > > > > Ah, that's great to see it's being used elsewhere, for DIO particularly! > > > > > > > So, here what we missing is tlb flush before the checking. > > > > > > That shouldn't be required for this particular issue/case, IIUIC. > > > One of the things we checked early on was disabling deferred TLB flush > > > (similarly to what you've done), and it didn't help with the issue; also, the > > > issue happens on uniprocessor mode too (thus no remote CPU involved.) > > > > Fast gup doesn't block tlb flush; it only blocks IPI used when freeing > > page tables. So it's expected that forcing a tlb flush doesn't fix the > > problem. > > > > But it still seems to me the fix is missing smp_mb(). IIUC, a proper > > fix would require, on the dio side > > inc page refcnt > > smp_mb() > > read pte > > > > and on the rmap side > > clear pte > > smp_mb() > > read page refcnt > > > > try_grab_compound_head() implies smp_mb, but i don't think > > ptep_get_and_clear() does. > > > > mapcount, as Minchan said, probably is irrelevant given dio is already > > known to be broken with fork. > > > > I glanced at the thread and thought it might be worth menthing. > > If the madv_freed page is shared among processes, it means the ptes > pointing the page are CoW state. If DIO is about to work with the > page, gup_fast will fallback to slow path and then break the CoW > using faultin_page before the submit bio. Thus, the page is not > shared any longer and the pte was alrady marked as dirty on fault > handling. Thus, I think there is no race. > > Only, problem is race between DIO and reclaim on exclusive private > madv_free page. In the case, page_count would be only racy. > If ptep_get_and_clear is unordered, yeah, we need barrier there. > (Looks like unorder since ARM uses xchg_relaxed). Sorry for the long delay in getting back to this. Thanks to both of you for bringing this up and checking further! Agree; I missed this barrier was needed for the gup() fast path. Just sent PATCH v3.
diff --git a/mm/rmap.c b/mm/rmap.c index 163ac4e6bcee..8671de473c25 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* MADV_FREE page check */ if (!PageSwapBacked(page)) { - if (!PageDirty(page)) { + int ref_count = page_ref_count(page); + int map_count = page_mapcount(page); + + /* + * The only page refs must be from the isolation + * (checked by the caller shrink_page_list() too) + * and one or more rmap's (dropped by discard:). + * + * Check the reference count before dirty flag + * with memory barrier; see __remove_mapping(). + */ + smp_rmb(); + if ((ref_count - 1 == map_count) && + !PageDirty(page)) { /* Invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); diff --git a/mm/vmscan.c b/mm/vmscan.c index fb9584641ac7..c1ea4e14f510 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1688,7 +1688,7 @@ static unsigned int shrink_page_list(struct list_head *page_list, mapping = page_mapping(page); } } else if (unlikely(PageTransHuge(page))) { - /* Split file THP */ + /* Split file/lazyfree THP */ if (split_huge_page_to_list(page, page_list)) goto keep_locked; }
Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its PTE (vs remap it back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects a single page ref from the isolation). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it sets them dirty later. So, if MADV_FREE pages (i.e., not dirty) are used as buffers (more later) for direct IO read from block devices and page reclaim runs during __blkdev_direct_IO[_simple]() AFTER bio_iov_iter_get_pages() but BEFORE it sets pages dirty, that situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. Page faults: =========== The data read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from/unrelated to user-mapped addresses) for the data read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read (which runs correctly as the page wasn't freed due to page_ref_freeze(), and is reclaimed later) -- but even if the page addresses matched, that handler maps the zero-page in the PTE, not that page's memory (on read faults.) If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one() too, which should be exactly two: one from the isolation (checked by shrink_page_list()), and the other from the rmap (dropped by the discard: label). If that doesn't match, then remap the PTE back, just like page dirty does. The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take that lock but it checks the PTE has changed, then drops the reference and leaves the page for the slow path (which does take that lock). - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - page_ref_count() # new check - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # get ref if (pte_val(pte) != pte_val(*ptep)) put_compound_head(page) # put ref # leave page for slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # get ref pte_unmap_unlock() Regarding transparent hugepages, that number shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which it should not or be rare, the page refcount is not two, as the head page counts tail pages, and tail pages have zero. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: - 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' - 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' - 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... - Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) - Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? - Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # yes | dd of=test.img bs=1k count=16 # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" -DBUF_SIZE=16384 -DPAGE_SIZE=4096 test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f1509206000: 0x79 0x7f1509207000: 0x79 0x7f1509208000: 0x79 0x7f1509209000: 0x79 # mv good bad # ./bad 0x7fd87272f000: 0x0 0x7fd872730000: 0x0 0x7fd872731000: 0x0 0x7fd872732000: 0x0 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill <daniel.hill@canonical.com> - Dan Streetman <dan.streetman@canonical.com> - Dongdong Tao <dongdong.tao@canonical.com> - Gavin Guo <gavin.guo@canonical.com> - Gerald Yang <gerald.yang@canonical.com> - Heitor Alves de Siqueira <halves@canonical.com> - Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> - Jay Vosburgh <jay.vosburgh@canonical.com> - Matthew Ruffell <matthew.ruffell@canonical.com> - Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> v2: check refcount against mapcount rather than a static 2. Thanks: Minchan Kim <minchan@kernel.org> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> --- mm/rmap.c | 15 ++++++++++++++- mm/vmscan.c | 2 +- 2 files changed, 15 insertions(+), 2 deletions(-)