Message ID | 20220131230255.789059-1-mfo@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm: fix race between MADV_FREE reclaim and blkdev direct IO read | expand |
On Mon, 31 Jan 2022 20:02:55 -0300 Mauricio Faria de Oliveira <mfo@canonical.com> wrote: > Problem: > ======= > > Userspace might read the zero-page instead of actual data from a > direct IO read on a block device if the buffers have been called > madvise(MADV_FREE) on earlier (this is discussed below) due to a > race between page reclaim on MADV_FREE and blkdev direct IO read. > > ... > > Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages") Five years ago. As it doesn't seem urgent I targeted 5.18-rc1 for this, and added a cc:stable so it will trickle back into earlier trees. How does this plan sound?
On Mon, Jan 31, 2022 at 8:43 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 31 Jan 2022 20:02:55 -0300 Mauricio Faria de Oliveira <mfo@canonical.com> wrote: > > > Problem: > > ======= > > > > Userspace might read the zero-page instead of actual data from a > > direct IO read on a block device if the buffers have been called > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > > > ... > > > > Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages") > > Five years ago. As it doesn't seem urgent I targeted 5.18-rc1 for > this, and added a cc:stable so it will trickle back into earlier trees. > > How does this plan sound? That sounds good; it's not urgent, indeed. Thanks!
On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > Well, blkdev_direct_IO() gets references for all pages, and on READ > operations it only sets them dirty _later_. > > So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for > direct IO read from block devices, and page reclaim happens during > __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() > returns, but BEFORE the pages are set dirty, the situation happens. > > The direct IO read eventually completes. Now, when userspace reads > the buffers, the PTE is no longer there and the page fault handler > do_anonymous_page() services that with the zero-page, NOT the data! So why not just set the pages dirty early like the other direct I/O implementations? Or if this is fine with the patch should we remove the early dirtying elsewhere? > Reproducer: > ========== > > @ test.c (simplified, but works) Can you add this to blktests or some other regularly run regression test suite? > + smp_rmb(); > + > + /* > + * The only page refs must be from the isolation > + * plus one or more rmap's (dropped by discard:). Overly long line. > + */ > + if ((ref_count == 1 + map_count) && No need for the inner braces.
On Wed, Feb 2, 2022 at 11:03 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > Well, blkdev_direct_IO() gets references for all pages, and on READ > > operations it only sets them dirty _later_. > > > > So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for > > direct IO read from block devices, and page reclaim happens during > > __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() > > returns, but BEFORE the pages are set dirty, the situation happens. > > > > The direct IO read eventually completes. Now, when userspace reads > > the buffers, the PTE is no longer there and the page fault handler > > do_anonymous_page() services that with the zero-page, NOT the data! > > So why not just set the pages dirty early like the other direct I/O > implementations? Or if this is fine with the patch should we remove > the early dirtying elsewhere? In general, since this particular problem is specific to MADV_FREE, it seemed about right to go for a more contained/particular solution (than changes with broader impact/risk to things that aren't broken). This isn't to say either approach shouldn't be pursued, but just that the larger changes aren't strictly needed to actually fix _this_ issue (and might complicate landing the fix into the stable/distro kernels.) Now, specifically on the 2 suggestions you mentioned, I'm not very familiar with other implementations, thus I can't speak to that, sorry. However, on the 1st suggestion (set pages dirty early), John noted [1] there might be issues with that and advised not going there. > > > Reproducer: > > ========== > > > > @ test.c (simplified, but works) > > Can you add this to blktests or some other regularly run regression > test suite? Sure. The test also needs the kernel-side change (to trigger memory reclaim), which can probably be wired for blktests with a fault-injection capability. Does that sound good? Maybe there's a better way to do it. > > > + smp_rmb(); > > + > > + /* > > + * The only page refs must be from the isolation > > + * plus one or more rmap's (dropped by discard:). > > Overly long line. Hmm, checkpatch.pl didn't complain about it. Ah, it checks for 100 chars. Ok; v4. > > > + */ > > + if ((ref_count == 1 + map_count) && > > No need for the inner braces. > Ok; v4. I'll wait a bit in case more changes are needed, and send v4 w/ the above. Thanks! [1] https://lore.kernel.org/linux-mm/7094dbd6-de0c-9909-e657-e358e14dc6c3@nvidia.com/
On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > Problem: > ======= Thanks for the update. A couple of quick questions: > Userspace might read the zero-page instead of actual data from a > direct IO read on a block device if the buffers have been called > madvise(MADV_FREE) on earlier (this is discussed below) due to a > race between page reclaim on MADV_FREE and blkdev direct IO read. 1) would page migration be affected as well? > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > - if (!PageDirty(page)) { > + int ref_count, map_count; > + > + /* > + * Synchronize with gup_pte_range(): > + * - clear PTE; barrier; read refcount > + * - inc refcount; barrier; read PTE > + */ > + smp_mb(); > + > + ref_count = page_count(page); > + map_count = page_mapcount(page); > + > + /* > + * Order reads for page refcount and dirty flag; > + * see __remove_mapping(). > + */ > + smp_rmb(); 2) why does it need to order against __remove_mapping()? It seems to me that here (called from the reclaim path) it can't race with __remove_mapping() because both lock the page. > + /* > + * The only page refs must be from the isolation > + * plus one or more rmap's (dropped by discard:). > + */ > + if ((ref_count == 1 + map_count) && > + !PageDirty(page)) { > /* Invalidate as we cleared the pte */ > mmu_notifier_invalidate_range(mm, > address, address + PAGE_SIZE);
On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > Problem: > > ======= > > Thanks for the update. A couple of quick questions: > > > Userspace might read the zero-page instead of actual data from a > > direct IO read on a block device if the buffers have been called > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > 1) would page migration be affected as well? Could you please elaborate on the potential problem you considered? I checked migrate_pages() -> try_to_migrate() holds the page lock, thus shouldn't race with shrink_page_list() -> with try_to_unmap() (where the issue with MADV_FREE is), but maybe I didn't get you correctly. > > > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > /* MADV_FREE page check */ > > if (!PageSwapBacked(page)) { > > - if (!PageDirty(page)) { > > + int ref_count, map_count; > > + > > + /* > > + * Synchronize with gup_pte_range(): > > + * - clear PTE; barrier; read refcount > > + * - inc refcount; barrier; read PTE > > + */ > > + smp_mb(); > > + > > + ref_count = page_count(page); > > + map_count = page_mapcount(page); > > + > > + /* > > + * Order reads for page refcount and dirty flag; > > + * see __remove_mapping(). > > + */ > > + smp_rmb(); > > 2) why does it need to order against __remove_mapping()? It seems to > me that here (called from the reclaim path) it can't race with > __remove_mapping() because both lock the page. I'll improve that comment in v4. The ordering isn't against __remove_mapping(), but actually because of an issue described in __remove_mapping()'s comments (something else that doesn't hold the page lock, just has a page reference, that may clear the page dirty flag then drop the reference; thus check ref, then dirty). Hope this clarifies the question. Thanks! > > > + /* > > + * The only page refs must be from the isolation > > + * plus one or more rmap's (dropped by discard:). > > + */ > > + if ((ref_count == 1 + map_count) && > > + !PageDirty(page)) { > > /* Invalidate as we cleared the pte */ > > mmu_notifier_invalidate_range(mm, > > address, address + PAGE_SIZE);
On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote: > On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > > Problem: > > > ======= > > > > Thanks for the update. A couple of quick questions: > > > > > Userspace might read the zero-page instead of actual data from a > > > direct IO read on a block device if the buffers have been called > > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > > > 1) would page migration be affected as well? > > Could you please elaborate on the potential problem you considered? > > I checked migrate_pages() -> try_to_migrate() holds the page lock, > thus shouldn't race with shrink_page_list() -> with try_to_unmap() > (where the issue with MADV_FREE is), but maybe I didn't get you > correctly. Could the race exist between DIO and migration? While DIO is writing to a page, could migration unmap it and copy the data from this page to a new page? > > > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > /* MADV_FREE page check */ > > > if (!PageSwapBacked(page)) { > > > - if (!PageDirty(page)) { > > > + int ref_count, map_count; > > > + > > > + /* > > > + * Synchronize with gup_pte_range(): > > > + * - clear PTE; barrier; read refcount > > > + * - inc refcount; barrier; read PTE > > > + */ > > > + smp_mb(); > > > + > > > + ref_count = page_count(page); > > > + map_count = page_mapcount(page); > > > + > > > + /* > > > + * Order reads for page refcount and dirty flag; > > > + * see __remove_mapping(). > > > + */ > > > + smp_rmb(); > > > > 2) why does it need to order against __remove_mapping()? It seems to > > me that here (called from the reclaim path) it can't race with > > __remove_mapping() because both lock the page. > > I'll improve that comment in v4. The ordering isn't against __remove_mapping(), > but actually because of an issue described in __remove_mapping()'s comments > (something else that doesn't hold the page lock, just has a page reference, that > may clear the page dirty flag then drop the reference; thus check ref, > then dirty). Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so, there should be a smp_wmb() on the other side: * get_user_pages(&page); smp_wmb() * SetPageDirty(page); * put_page(page); (__remove_mapping() doesn't need smp_[rw]mb() on either side because it relies on page refcnt freeze and retesting.) Thanks.
On Wed, Feb 2, 2022 at 6:53 PM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote: > > On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > > > Problem: > > > > ======= > > > > > > Thanks for the update. A couple of quick questions: > > > > > > > Userspace might read the zero-page instead of actual data from a > > > > direct IO read on a block device if the buffers have been called > > > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > > > > > 1) would page migration be affected as well? > > > > Could you please elaborate on the potential problem you considered? > > > > I checked migrate_pages() -> try_to_migrate() holds the page lock, > > thus shouldn't race with shrink_page_list() -> with try_to_unmap() > > (where the issue with MADV_FREE is), but maybe I didn't get you > > correctly. > > Could the race exist between DIO and migration? While DIO is writing > to a page, could migration unmap it and copy the data from this page > to a new page? > Thanks for clarifying. I started looking into this, but since it's unrelated to MADV_FREE (which doesn't apply to page migration), I guess this shouldn't block this patch, if at all possible. Is that OK with you? > > > > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > > > /* MADV_FREE page check */ > > > > if (!PageSwapBacked(page)) { > > > > - if (!PageDirty(page)) { > > > > + int ref_count, map_count; > > > > + > > > > + /* > > > > + * Synchronize with gup_pte_range(): > > > > + * - clear PTE; barrier; read refcount > > > > + * - inc refcount; barrier; read PTE > > > > + */ > > > > + smp_mb(); > > > > + > > > > + ref_count = page_count(page); > > > > + map_count = page_mapcount(page); > > > > + > > > > + /* > > > > + * Order reads for page refcount and dirty flag; > > > > + * see __remove_mapping(). > > > > + */ > > > > + smp_rmb(); > > > > > > 2) why does it need to order against __remove_mapping()? It seems to > > > me that here (called from the reclaim path) it can't race with > > > __remove_mapping() because both lock the page. > > > > I'll improve that comment in v4. The ordering isn't against __remove_mapping(), > > but actually because of an issue described in __remove_mapping()'s comments > > (something else that doesn't hold the page lock, just has a page reference, that > > may clear the page dirty flag then drop the reference; thus check ref, > > then dirty). > > Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so, > there should be a smp_wmb() on the other side: If I understand it correctly, it actually implies a full memory barrier, doesn't it? Because... gup_pte_range() (fast path) calls try_grab_compound_head(), which eventually calls* atomic_add_unless(), an atomic conditional RMW operation with return value, thus fully ordered on success (atomic_t.rst); (on failure gup_pte_range() falls back to the slow path, below.) And follow_page_pte() (slow path) calls try_grab_page(), which also calls into try_grab_compound_head(), as the above. (* on CONFIG_TINY_RCU, it calls just atomic_add(), which isn't ordered, but that option is targeted for UP/!SMP, thus not a problem for this race.) Looking at the implementation of arch_atomic_fetch_add_unless() on more relaxed/weakly ordered archs (arm, powerpc, if I got that right), there are barriers like 'smp_mb()' and 'sync' instruction if 'old != unless', so that seems to be OK. And the set_page_dirty() calls occur after get_user_pages() / that point. Does that make sense? Thanks! > > * get_user_pages(&page); > > smp_wmb() > > * SetPageDirty(page); > * put_page(page); > > (__remove_mapping() doesn't need smp_[rw]mb() on either side because > it relies on page refcnt freeze and retesting.) > > Thanks.
On Thu, Feb 3, 2022 at 3:17 PM Mauricio Faria de Oliveira <mfo@canonical.com> wrote: > > On Wed, Feb 2, 2022 at 6:53 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote: > > > On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > > > > Problem: > > > > > ======= > > > > > > > > Thanks for the update. A couple of quick questions: > > > > > > > > > Userspace might read the zero-page instead of actual data from a > > > > > direct IO read on a block device if the buffers have been called > > > > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > > > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > > > > > > > 1) would page migration be affected as well? > > > > > > Could you please elaborate on the potential problem you considered? > > > > > > I checked migrate_pages() -> try_to_migrate() holds the page lock, > > > thus shouldn't race with shrink_page_list() -> with try_to_unmap() > > > (where the issue with MADV_FREE is), but maybe I didn't get you > > > correctly. > > > > Could the race exist between DIO and migration? While DIO is writing > > to a page, could migration unmap it and copy the data from this page > > to a new page? > > > > Thanks for clarifying. I started looking into this, but since it's unrelated > to MADV_FREE (which doesn't apply to page migration), I guess this > shouldn't block this patch, if at all possible. Is that OK with you? > > > > > > > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > > > > > /* MADV_FREE page check */ > > > > > if (!PageSwapBacked(page)) { > > > > > - if (!PageDirty(page)) { > > > > > + int ref_count, map_count; > > > > > + > > > > > + /* > > > > > + * Synchronize with gup_pte_range(): > > > > > + * - clear PTE; barrier; read refcount > > > > > + * - inc refcount; barrier; read PTE > > > > > + */ > > > > > + smp_mb(); > > > > > + > > > > > + ref_count = page_count(page); > > > > > + map_count = page_mapcount(page); > > > > > + > > > > > + /* > > > > > + * Order reads for page refcount and dirty flag; > > > > > + * see __remove_mapping(). > > > > > + */ > > > > > + smp_rmb(); > > > > > > > > 2) why does it need to order against __remove_mapping()? It seems to > > > > me that here (called from the reclaim path) it can't race with > > > > __remove_mapping() because both lock the page. > > > > > > I'll improve that comment in v4. The ordering isn't against __remove_mapping(), > > > but actually because of an issue described in __remove_mapping()'s comments > > > (something else that doesn't hold the page lock, just has a page reference, that > > > may clear the page dirty flag then drop the reference; thus check ref, > > > then dirty). > > > > Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so, > > there should be a smp_wmb() on the other side: > > If I understand it correctly, it actually implies a full memory > barrier, doesn't it? > > Because... gup_pte_range() (fast path) calls try_grab_compound_head(), > which eventually calls* atomic_add_unless(), an atomic conditional RMW > operation with return value, thus fully ordered on success (atomic_t.rst); > (on failure gup_pte_range() falls back to the slow path, below.) > > And follow_page_pte() (slow path) calls try_grab_page(), which also calls > into try_grab_compound_head(), as the above. > > (* on CONFIG_TINY_RCU, it calls just atomic_add(), which isn't ordered, > but that option is targeted for UP/!SMP, thus not a problem for this race.) > > Looking at the implementation of arch_atomic_fetch_add_unless() on > more relaxed/weakly ordered archs (arm, powerpc, if I got that right), > there are barriers like 'smp_mb()' and 'sync' instruction if 'old != unless', > so that seems to be OK. > > And the set_page_dirty() calls occur after get_user_pages() / that point. > > Does that make sense? Yes, it does, thanks. I was thinking along the lines of whether there is an actual contract. The reason get_user_pages() currently works as a full barrier is not intentional but a side effect of this recent cleanup patch: commit 54d516b1d6 ("mm/gup: small refactoring: simplify try_grab_page()") But I agree your fix works as is.
On 2/3/22 21:56, Yu Zhao wrote: ... >>> Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so, >>> there should be a smp_wmb() on the other side: >> >> If I understand it correctly, it actually implies a full memory >> barrier, doesn't it? >> >> Because... gup_pte_range() (fast path) calls try_grab_compound_head(), >> which eventually calls* atomic_add_unless(), an atomic conditional RMW >> operation with return value, thus fully ordered on success (atomic_t.rst); >> (on failure gup_pte_range() falls back to the slow path, below.) >> >> And follow_page_pte() (slow path) calls try_grab_page(), which also calls >> into try_grab_compound_head(), as the above. Well, doing so was a mistake, actually. I've recently reverted it, via: commit c36c04c2e132 ("Revert "mm/gup: small refactoring: simplify try_grab_page()""). Details are in the commit log. Apologies for the confusion that this may have created. thanks,
On Fri, Feb 4, 2022 at 2:57 AM Yu Zhao <yuzhao@google.com> wrote: > > On Thu, Feb 3, 2022 at 3:17 PM Mauricio Faria de Oliveira > <mfo@canonical.com> wrote: > > > > On Wed, Feb 2, 2022 at 6:53 PM Yu Zhao <yuzhao@google.com> wrote: [...] > > > Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so, > > > there should be a smp_wmb() on the other side: > > > > If I understand it correctly, it actually implies a full memory > > barrier, doesn't it? > > > > Because... gup_pte_range() (fast path) calls try_grab_compound_head(), > > which eventually calls* atomic_add_unless(), an atomic conditional RMW > > operation with return value, thus fully ordered on success (atomic_t.rst); > > (on failure gup_pte_range() falls back to the slow path, below.) > > > > And follow_page_pte() (slow path) calls try_grab_page(), which also calls > > into try_grab_compound_head(), as the above. > > > > (* on CONFIG_TINY_RCU, it calls just atomic_add(), which isn't ordered, > > but that option is targeted for UP/!SMP, thus not a problem for this race.) > > > > Looking at the implementation of arch_atomic_fetch_add_unless() on > > more relaxed/weakly ordered archs (arm, powerpc, if I got that right), > > there are barriers like 'smp_mb()' and 'sync' instruction if 'old != unless', > > so that seems to be OK. > > > > And the set_page_dirty() calls occur after get_user_pages() / that point. > > > > Does that make sense? > > Yes, it does, thanks. I was thinking along the lines of whether there > is an actual contract. [...] Ok, got you. > [...] The reason get_user_pages() currently works as > a full barrier is not intentional but a side effect of this recent > cleanup patch: > commit 54d516b1d6 ("mm/gup: small refactoring: simplify try_grab_page()") > But I agree your fix works as is. Thanks for bringing it up! That commit and its revert [1] (that John mentioned in his reply) change only try_grab_page() / not try_grab_compound_head(), thus should affect only the slow path / not the fast path. So, with either change or revert, the slow path should still be okay, as it takes the page table lock, and try_to_unmap_one() too, thus they shouldn't race. And the spinlock barriers get values through. Thanks, [1] commit c36c04c2e132 ("Revert "mm/gup: small refactoring: simplify try_grab_page()"")
On Fri, Feb 4, 2022 at 4:03 AM John Hubbard <jhubbard@nvidia.com> wrote: > > On 2/3/22 21:56, Yu Zhao wrote: > ... > >>> Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so, > >>> there should be a smp_wmb() on the other side: > >> > >> If I understand it correctly, it actually implies a full memory > >> barrier, doesn't it? > >> > >> Because... gup_pte_range() (fast path) calls try_grab_compound_head(), > >> which eventually calls* atomic_add_unless(), an atomic conditional RMW > >> operation with return value, thus fully ordered on success (atomic_t.rst); > >> (on failure gup_pte_range() falls back to the slow path, below.) > >> > >> And follow_page_pte() (slow path) calls try_grab_page(), which also calls > >> into try_grab_compound_head(), as the above. > > Well, doing so was a mistake, actually. I've recently reverted it, via: > commit c36c04c2e132 ("Revert "mm/gup: small refactoring: simplify > try_grab_page()""). Details are in the commit log. > > Apologies for the confusion that this may have created. No worries; thanks for the pointers / commit log. > > thanks, > -- > John Hubbard > NVIDIA > [...]
Yu Zhao <yuzhao@google.com> writes: > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote: >> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: >> > >> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: >> > > Problem: >> > > ======= >> > >> > Thanks for the update. A couple of quick questions: >> > >> > > Userspace might read the zero-page instead of actual data from a >> > > direct IO read on a block device if the buffers have been called >> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a >> > > race between page reclaim on MADV_FREE and blkdev direct IO read. >> > >> > 1) would page migration be affected as well? >> >> Could you please elaborate on the potential problem you considered? >> >> I checked migrate_pages() -> try_to_migrate() holds the page lock, >> thus shouldn't race with shrink_page_list() -> with try_to_unmap() >> (where the issue with MADV_FREE is), but maybe I didn't get you >> correctly. > > Could the race exist between DIO and migration? While DIO is writing > to a page, could migration unmap it and copy the data from this page > to a new page? Check the migrate_pages() code, migrate_pages unmap_and_move __unmap_and_move try_to_migrate // set PTE to swap entry with PTL move_to_new_page migrate_page folio_migrate_mapping folio_ref_count(folio) != expected_count // check page ref count folio_migrate_copy The page ref count is checked after unmapping and before copying. This is good, but it appears that we need a memory barrier between checking page ref count and copying page. Best Regards, Huang, Ying
On Wed, Feb 16, 2022 at 02:48:19PM +0800, Huang, Ying wrote: > Yu Zhao <yuzhao@google.com> writes: > > > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote: > >> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > >> > > >> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > >> > > Problem: > >> > > ======= > >> > > >> > Thanks for the update. A couple of quick questions: > >> > > >> > > Userspace might read the zero-page instead of actual data from a > >> > > direct IO read on a block device if the buffers have been called > >> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > >> > > race between page reclaim on MADV_FREE and blkdev direct IO read. > >> > > >> > 1) would page migration be affected as well? > >> > >> Could you please elaborate on the potential problem you considered? > >> > >> I checked migrate_pages() -> try_to_migrate() holds the page lock, > >> thus shouldn't race with shrink_page_list() -> with try_to_unmap() > >> (where the issue with MADV_FREE is), but maybe I didn't get you > >> correctly. > > > > Could the race exist between DIO and migration? While DIO is writing > > to a page, could migration unmap it and copy the data from this page > > to a new page? > > Check the migrate_pages() code, > > migrate_pages > unmap_and_move > __unmap_and_move > try_to_migrate // set PTE to swap entry with PTL > move_to_new_page > migrate_page > folio_migrate_mapping > folio_ref_count(folio) != expected_count // check page ref count > folio_migrate_copy > > The page ref count is checked after unmapping and before copying. This > is good, but it appears that we need a memory barrier between checking > page ref count and copying page. I didn't look into this but, off the top of head, this should be similar if not identical to the DIO case. Therefore, it requires two barriers -- before and after the refcnt check (which may or may not exist).
On Wed, Feb 16, 2022 at 02:58:36PM -0700, Yu Zhao wrote: > On Wed, Feb 16, 2022 at 02:48:19PM +0800, Huang, Ying wrote: > > Yu Zhao <yuzhao@google.com> writes: > > > > > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote: > > >> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > > >> > > > >> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > >> > > Problem: > > >> > > ======= > > >> > > > >> > Thanks for the update. A couple of quick questions: > > >> > > > >> > > Userspace might read the zero-page instead of actual data from a > > >> > > direct IO read on a block device if the buffers have been called > > >> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > >> > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > >> > > > >> > 1) would page migration be affected as well? > > >> > > >> Could you please elaborate on the potential problem you considered? > > >> > > >> I checked migrate_pages() -> try_to_migrate() holds the page lock, > > >> thus shouldn't race with shrink_page_list() -> with try_to_unmap() > > >> (where the issue with MADV_FREE is), but maybe I didn't get you > > >> correctly. > > > > > > Could the race exist between DIO and migration? While DIO is writing > > > to a page, could migration unmap it and copy the data from this page > > > to a new page? > > > > Check the migrate_pages() code, > > > > migrate_pages > > unmap_and_move > > __unmap_and_move > > try_to_migrate // set PTE to swap entry with PTL > > move_to_new_page > > migrate_page > > folio_migrate_mapping > > folio_ref_count(folio) != expected_count // check page ref count > > folio_migrate_copy > > > > The page ref count is checked after unmapping and before copying. This > > is good, but it appears that we need a memory barrier between checking > > page ref count and copying page. > > I didn't look into this but, off the top of head, this should be > similar if not identical to the DIO case. Therefore, it requires two ^^^ reclaim > barriers -- before and after the refcnt check (which may or may not > exist).
Yu Zhao <yuzhao@google.com> writes: > On Wed, Feb 16, 2022 at 02:48:19PM +0800, Huang, Ying wrote: >> Yu Zhao <yuzhao@google.com> writes: >> >> > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote: >> >> On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: >> >> > >> >> > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: >> >> > > Problem: >> >> > > ======= >> >> > >> >> > Thanks for the update. A couple of quick questions: >> >> > >> >> > > Userspace might read the zero-page instead of actual data from a >> >> > > direct IO read on a block device if the buffers have been called >> >> > > madvise(MADV_FREE) on earlier (this is discussed below) due to a >> >> > > race between page reclaim on MADV_FREE and blkdev direct IO read. >> >> > >> >> > 1) would page migration be affected as well? >> >> >> >> Could you please elaborate on the potential problem you considered? >> >> >> >> I checked migrate_pages() -> try_to_migrate() holds the page lock, >> >> thus shouldn't race with shrink_page_list() -> with try_to_unmap() >> >> (where the issue with MADV_FREE is), but maybe I didn't get you >> >> correctly. >> > >> > Could the race exist between DIO and migration? While DIO is writing >> > to a page, could migration unmap it and copy the data from this page >> > to a new page? >> >> Check the migrate_pages() code, >> >> migrate_pages >> unmap_and_move >> __unmap_and_move >> try_to_migrate // set PTE to swap entry with PTL >> move_to_new_page >> migrate_page >> folio_migrate_mapping >> folio_ref_count(folio) != expected_count // check page ref count >> folio_migrate_copy >> >> The page ref count is checked after unmapping and before copying. This >> is good, but it appears that we need a memory barrier between checking >> page ref count and copying page. > > I didn't look into this but, off the top of head, this should be > similar if not identical to the DIO case. Therefore, it requires two > barriers -- before and after the refcnt check (which may or may not > exist). Yes. I think so too. Best Regards, Huang, Ying
diff --git a/mm/rmap.c b/mm/rmap.c index 6a1e8c7f6213..b7ae45724378 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* MADV_FREE page check */ if (!PageSwapBacked(page)) { - if (!PageDirty(page)) { + int ref_count, map_count; + + /* + * Synchronize with gup_pte_range(): + * - clear PTE; barrier; read refcount + * - inc refcount; barrier; read PTE + */ + smp_mb(); + + ref_count = page_count(page); + map_count = page_mapcount(page); + + /* + * Order reads for page refcount and dirty flag; + * see __remove_mapping(). + */ + smp_rmb(); + + /* + * The only page refs must be from the isolation + * plus one or more rmap's (dropped by discard:). + */ + if ((ref_count == 1 + map_count) && + !PageDirty(page)) { /* Invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); diff --git a/mm/vmscan.c b/mm/vmscan.c index 090bfb605ecf..0dbfa3a69567 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1729,7 +1729,7 @@ static unsigned int shrink_page_list(struct list_head *page_list, mapping = page_mapping(page); } } else if (unlikely(PageTransHuge(page))) { - /* Split file THP */ + /* Split file/lazyfree THP */ if (split_huge_page_to_list(page, page_list)) goto keep_locked; }