diff mbox series

[1/2] mm/migrate_device.c: Copy pte dirty bit to page

Message ID a9daea363991c023d0364be22a762405b6c6f5c4.1660281458.git-series.apopple@nvidia.com (mailing list archive)
State New
Headers show
Series [1/2] mm/migrate_device.c: Copy pte dirty bit to page | expand

Commit Message

Alistair Popple Aug. 12, 2022, 5:22 a.m. UTC
migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
installs migration entries directly if it can lock the migrating page.
When removing a dirty pte the dirty bit is supposed to be carried over
to the underlying page to prevent it being lost.

Currently migrate_vma_*() can only be used for private anonymous
mappings. That means loss of the dirty bit usually doesn't result in
data loss because these pages are typically not file-backed. However
pages may be backed by swap storage which can result in data loss if an
attempt is made to migrate a dirty page that doesn't yet have the
PageDirty flag set.

In this case migration will fail due to unexpected references but the
dirty pte bit will be lost. If the page is subsequently reclaimed data
won't be written back to swap storage as it is considered uptodate,
resulting in data loss if the page is subsequently accessed.

Prevent this by copying the dirty bit to the page when removing the pte
to match what try_to_migrate_one() does.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Peter Xu <peterx@redhat.com>
---
 mm/migrate_device.c | 5 +++++
 1 file changed, 5 insertions(+)


base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136

Comments

Peter Xu Aug. 15, 2022, 8:29 p.m. UTC | #1
Hi, Alistair,

On Fri, Aug 12, 2022 at 03:22:30PM +1000, Alistair Popple wrote:
> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> installs migration entries directly if it can lock the migrating page.
> When removing a dirty pte the dirty bit is supposed to be carried over
> to the underlying page to prevent it being lost.
> 
> Currently migrate_vma_*() can only be used for private anonymous
> mappings. That means loss of the dirty bit usually doesn't result in
> data loss because these pages are typically not file-backed. However
> pages may be backed by swap storage which can result in data loss if an
> attempt is made to migrate a dirty page that doesn't yet have the
> PageDirty flag set.
> 
> In this case migration will fail due to unexpected references but the
> dirty pte bit will be lost. If the page is subsequently reclaimed data
> won't be written back to swap storage as it is considered uptodate,
> resulting in data loss if the page is subsequently accessed.
> 
> Prevent this by copying the dirty bit to the page when removing the pte
> to match what try_to_migrate_one() does.
> 
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reported-by: Peter Xu <peterx@redhat.com>

This line should be:

Reported-by: Huang Ying <ying.huang@intel.com>

Please also feel free to add:

Acked-by: Peter Xu <peterx@redhat.com>

Thanks,
Alistair Popple Aug. 16, 2022, 12:51 a.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> Hi, Alistair,
>
> On Fri, Aug 12, 2022 at 03:22:30PM +1000, Alistair Popple wrote:
>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>> installs migration entries directly if it can lock the migrating page.
>> When removing a dirty pte the dirty bit is supposed to be carried over
>> to the underlying page to prevent it being lost.
>>
>> Currently migrate_vma_*() can only be used for private anonymous
>> mappings. That means loss of the dirty bit usually doesn't result in
>> data loss because these pages are typically not file-backed. However
>> pages may be backed by swap storage which can result in data loss if an
>> attempt is made to migrate a dirty page that doesn't yet have the
>> PageDirty flag set.
>>
>> In this case migration will fail due to unexpected references but the
>> dirty pte bit will be lost. If the page is subsequently reclaimed data
>> won't be written back to swap storage as it is considered uptodate,
>> resulting in data loss if the page is subsequently accessed.
>>
>> Prevent this by copying the dirty bit to the page when removing the pte
>> to match what try_to_migrate_one() does.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Reported-by: Peter Xu <peterx@redhat.com>
>
> This line should be:
>
> Reported-by: Huang Ying <ying.huang@intel.com>
>
> Please also feel free to add:
>
> Acked-by: Peter Xu <peterx@redhat.com>

Thanks Peter, my bad. I assume Andrew can fix up the tags if I don't
need to re-spin this series.

> Thanks,
huang ying Aug. 16, 2022, 1:39 a.m. UTC | #3
Hi, Alistair,

On Fri, Aug 12, 2022 at 1:23 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> installs migration entries directly if it can lock the migrating page.
> When removing a dirty pte the dirty bit is supposed to be carried over
> to the underlying page to prevent it being lost.
>
> Currently migrate_vma_*() can only be used for private anonymous
> mappings. That means loss of the dirty bit usually doesn't result in
> data loss because these pages are typically not file-backed. However
> pages may be backed by swap storage which can result in data loss if an
> attempt is made to migrate a dirty page that doesn't yet have the
> PageDirty flag set.
>
> In this case migration will fail due to unexpected references but the
> dirty pte bit will be lost. If the page is subsequently reclaimed data
> won't be written back to swap storage as it is considered uptodate,
> resulting in data loss if the page is subsequently accessed.
>
> Prevent this by copying the dirty bit to the page when removing the pte
> to match what try_to_migrate_one() does.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reported-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/migrate_device.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d..d38f8a6 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -7,6 +7,7 @@
>  #include <linux/export.h>
>  #include <linux/memremap.h>
>  #include <linux/migrate.h>
> +#include <linux/mm.h>
>  #include <linux/mm_inline.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/oom.h>
> @@ -211,6 +212,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>
>                         migrate->cpages++;
>
> +                       /* Set the dirty flag on the folio now the pte is gone. */
> +                       if (pte_dirty(pte))
> +                               folio_mark_dirty(page_folio(page));
> +

I think that this isn't sufficient to fix all issues.  Firstly, "pte"
is assigned at the begin of the loop, before the PTE is cleared via
ptep_clear_flush() or ptep_get_and_clear().  That is, the pte isn't
changed atomically.  Between "pte" assignment and PTE clear, the PTE
may become dirty.  That is, we need to update pte when we clear the
PTE.

And I don't know why we use ptep_get_and_clear() to clear PTE if
(!anon_exclusive).  Why don't we need to flush the TLB?

Best Regards,
Huang, Ying

>                         /* Setup special migration page table entry */
>                         if (mpfn & MIGRATE_PFN_WRITE)
>                                 entry = make_writable_migration_entry(
>
> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
> --
> git-series 0.9.1
>
Alistair Popple Aug. 16, 2022, 2:28 a.m. UTC | #4
huang ying <huang.ying.caritas@gmail.com> writes:

> Hi, Alistair,
>
> On Fri, Aug 12, 2022 at 1:23 PM Alistair Popple <apopple@nvidia.com> wrote:
>>
>> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
>> installs migration entries directly if it can lock the migrating page.
>> When removing a dirty pte the dirty bit is supposed to be carried over
>> to the underlying page to prevent it being lost.
>>
>> Currently migrate_vma_*() can only be used for private anonymous
>> mappings. That means loss of the dirty bit usually doesn't result in
>> data loss because these pages are typically not file-backed. However
>> pages may be backed by swap storage which can result in data loss if an
>> attempt is made to migrate a dirty page that doesn't yet have the
>> PageDirty flag set.
>>
>> In this case migration will fail due to unexpected references but the
>> dirty pte bit will be lost. If the page is subsequently reclaimed data
>> won't be written back to swap storage as it is considered uptodate,
>> resulting in data loss if the page is subsequently accessed.
>>
>> Prevent this by copying the dirty bit to the page when removing the pte
>> to match what try_to_migrate_one() does.
>>
>> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>> Reported-by: Peter Xu <peterx@redhat.com>
>> ---
>>  mm/migrate_device.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d..d38f8a6 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -7,6 +7,7 @@
>>  #include <linux/export.h>
>>  #include <linux/memremap.h>
>>  #include <linux/migrate.h>
>> +#include <linux/mm.h>
>>  #include <linux/mm_inline.h>
>>  #include <linux/mmu_notifier.h>
>>  #include <linux/oom.h>
>> @@ -211,6 +212,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>
>>                         migrate->cpages++;
>>
>> +                       /* Set the dirty flag on the folio now the pte is gone. */
>> +                       if (pte_dirty(pte))
>> +                               folio_mark_dirty(page_folio(page));
>> +
>
> I think that this isn't sufficient to fix all issues.  Firstly, "pte"
> is assigned at the begin of the loop, before the PTE is cleared via
> ptep_clear_flush() or ptep_get_and_clear().  That is, the pte isn't
> changed atomically.  Between "pte" assignment and PTE clear, the PTE
> may become dirty.  That is, we need to update pte when we clear the
> PTE.

Oh good catch, thanks. Will fix.

> And I don't know why we use ptep_get_and_clear() to clear PTE if
> (!anon_exclusive).  Why don't we need to flush the TLB?

We do the TLB flush at the end if anything was modified:

	/* Only flush the TLB if we actually modified any entries */
	if (unmapped)
		flush_tlb_range(walk->vma, start, end);

Obviously I don't think that will work correctly now given we have to
read the dirty bits and clear the PTE atomically. I assume it was
originally written this way for some sort of performance reason.

 - Alistair

> Best Regards,
> Huang, Ying
>
>>                         /* Setup special migration page table entry */
>>                         if (mpfn & MIGRATE_PFN_WRITE)
>>                                 entry = make_writable_migration_entry(
>>
>> base-commit: ffcf9c5700e49c0aee42dcba9a12ba21338e8136
>> --
>> git-series 0.9.1
>>
huang ying Aug. 16, 2022, 6:37 a.m. UTC | #5
On Tue, Aug 16, 2022 at 10:33 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> huang ying <huang.ying.caritas@gmail.com> writes:
>
> > Hi, Alistair,
> >
> > On Fri, Aug 12, 2022 at 1:23 PM Alistair Popple <apopple@nvidia.com> wrote:

[snip]

>
> > And I don't know why we use ptep_get_and_clear() to clear PTE if
> > (!anon_exclusive).  Why don't we need to flush the TLB?
>
> We do the TLB flush at the end if anything was modified:
>
>         /* Only flush the TLB if we actually modified any entries */
>         if (unmapped)
>                 flush_tlb_range(walk->vma, start, end);
>
> Obviously I don't think that will work correctly now given we have to
> read the dirty bits and clear the PTE atomically. I assume it was
> originally written this way for some sort of performance reason.

Thanks for pointing this out.  If there were parallel page table
operations such as mprotect() or munmap(), the delayed TLB flushing
mechanism here may have some problem.  Please take a look at the
comments of flush_tlb_batched_pending() and TLB flush batching
implementation in try_to_unmap_one().  We may need to flush TLB with
page table lock held or use a mechanism similar to that in
try_to_unmap_one().

Best Regards,
Huang, Ying
Alistair Popple Aug. 17, 2022, 1:27 a.m. UTC | #6
huang ying <huang.ying.caritas@gmail.com> writes:

> On Tue, Aug 16, 2022 at 10:33 AM Alistair Popple <apopple@nvidia.com> wrote:
>>
>>
>> huang ying <huang.ying.caritas@gmail.com> writes:
>>
>> > Hi, Alistair,
>> >
>> > On Fri, Aug 12, 2022 at 1:23 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> [snip]
>
>>
>> > And I don't know why we use ptep_get_and_clear() to clear PTE if
>> > (!anon_exclusive).  Why don't we need to flush the TLB?
>>
>> We do the TLB flush at the end if anything was modified:
>>
>>         /* Only flush the TLB if we actually modified any entries */
>>         if (unmapped)
>>                 flush_tlb_range(walk->vma, start, end);
>>
>> Obviously I don't think that will work correctly now given we have to
>> read the dirty bits and clear the PTE atomically. I assume it was
>> originally written this way for some sort of performance reason.
>
> Thanks for pointing this out.  If there were parallel page table
> operations such as mprotect() or munmap(), the delayed TLB flushing
> mechanism here may have some problem.  Please take a look at the
> comments of flush_tlb_batched_pending() and TLB flush batching
> implementation in try_to_unmap_one().  We may need to flush TLB with
> page table lock held or use a mechanism similar to that in
> try_to_unmap_one().

Thanks for the pointers. I agree there is likely also a problem here
with the delayed TLB flushing. v2 of this patch deals with this by
always flushing the TLB using ptep_flush_clear(), similar to how
try_to_migrate_one() works. It looks like it could be worth
investigating using batched TLB flushing for both this and
try_to_migrate(), but I will leave that for a future optimisation.

> Best Regards,
> Huang, Ying
diff mbox series

Patch

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d..d38f8a6 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -7,6 +7,7 @@ 
 #include <linux/export.h>
 #include <linux/memremap.h>
 #include <linux/migrate.h>
+#include <linux/mm.h>
 #include <linux/mm_inline.h>
 #include <linux/mmu_notifier.h>
 #include <linux/oom.h>
@@ -211,6 +212,10 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
 			migrate->cpages++;
 
+			/* Set the dirty flag on the folio now the pte is gone. */
+			if (pte_dirty(pte))
+				folio_mark_dirty(page_folio(page));
+
 			/* Setup special migration page table entry */
 			if (mpfn & MIGRATE_PFN_WRITE)
 				entry = make_writable_migration_entry(