Message ID | 20220809220100.20033-6-peterx@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Remember a/d bits for migration entries | expand |
Peter Xu <peterx@redhat.com> writes: > When page migration happens, we always ignore the young/dirty bit settings > in the old pgtable, and marking the page as old in the new page table using > either pte_mkold() or pmd_mkold(), and keeping the pte clean. > > That's fine from functional-wise, but that's not friendly to page reclaim > because the moving page can be actively accessed within the procedure. Not > to mention hardware setting the young bit can bring quite some overhead on > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. > The same slowdown problem to dirty bits when the memory is first written > after page migration happened. > > Actually we can easily remember the A/D bit configuration and recover the > information after the page is migrated. To achieve it, define a new set of > bits in the migration swap offset field to cache the A/D bits for old pte. > Then when removing/recovering the migration entry, we can recover the A/D > bits even if the page changed. > > One thing to mention is that here we used max_swapfile_size() to detect how > many swp offset bits we have, and we'll only enable this feature if we know > the swp offset can be big enough to store both the PFN value and the young ~~~~~ Nitpick: A/D > bit. Otherwise the A/D bits are dropped like before. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++ > mm/huge_memory.c | 18 +++++++- > mm/migrate.c | 6 ++- > mm/migrate_device.c | 4 ++ > mm/rmap.c | 5 ++- > 5 files changed, 128 insertions(+), 4 deletions(-) > > diff --git a/include/linux/swapops.h b/include/linux/swapops.h > index e1accbcd1136..0e9579b90659 100644 > --- a/include/linux/swapops.h > +++ b/include/linux/swapops.h > @@ -8,6 +8,10 @@ > > #ifdef CONFIG_MMU > > +#ifdef CONFIG_SWAP > +#include <linux/swapfile.h> > +#endif /* CONFIG_SWAP */ I don't think we need the comment here. The #ifdef is too near. But this isn't a big deal. Best Regards, Huang, Ying
On Wed, Aug 10, 2022 at 02:30:33PM +0800, Huang, Ying wrote: > Peter Xu <peterx@redhat.com> writes: > > > When page migration happens, we always ignore the young/dirty bit settings > > in the old pgtable, and marking the page as old in the new page table using > > either pte_mkold() or pmd_mkold(), and keeping the pte clean. > > > > That's fine from functional-wise, but that's not friendly to page reclaim > > because the moving page can be actively accessed within the procedure. Not > > to mention hardware setting the young bit can bring quite some overhead on > > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. > > The same slowdown problem to dirty bits when the memory is first written > > after page migration happened. > > > > Actually we can easily remember the A/D bit configuration and recover the > > information after the page is migrated. To achieve it, define a new set of > > bits in the migration swap offset field to cache the A/D bits for old pte. > > Then when removing/recovering the migration entry, we can recover the A/D > > bits even if the page changed. > > > > One thing to mention is that here we used max_swapfile_size() to detect how > > many swp offset bits we have, and we'll only enable this feature if we know > > the swp offset can be big enough to store both the PFN value and the young > ~~~~~ > Nitpick: A/D Fixed. > > > bit. Otherwise the A/D bits are dropped like before. > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++ > > mm/huge_memory.c | 18 +++++++- > > mm/migrate.c | 6 ++- > > mm/migrate_device.c | 4 ++ > > mm/rmap.c | 5 ++- > > 5 files changed, 128 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/swapops.h b/include/linux/swapops.h > > index e1accbcd1136..0e9579b90659 100644 > > --- a/include/linux/swapops.h > > +++ b/include/linux/swapops.h > > @@ -8,6 +8,10 @@ > > > > #ifdef CONFIG_MMU > > > > +#ifdef CONFIG_SWAP > > +#include <linux/swapfile.h> > > +#endif /* CONFIG_SWAP */ > > I don't think we need the comment here. The #ifdef is too near. But > this isn't a big deal. I'd slightly prefer keeping it (especially Nadav used to complain on missing comments on ifdefs in previous versions..) since any ifdef can grow by adding code into it. Then it'll be hard to justify how to define "near" or not, so hard to define who should be adding that if I'm not the one. Thanks,
On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote: > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index 27fb37d65476..699f821b8443 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > else > entry = make_readable_migration_entry( > page_to_pfn(page)); > + if (pte_young(pte)) > + entry = make_migration_entry_young(entry); > + if (pte_dirty(pte)) > + entry = make_migration_entry_dirty(entry); > swp_pte = swp_entry_to_pte(entry); > if (pte_present(pte)) { > if (pte_soft_dirty(pte)) This change needs to be wrapped with pte_present() at least.. I also just noticed that this change probably won't help anyway because: (1) When ram->device, the pte will finally be replaced with a device private entry, and device private entry does not yet support A/D, it means A/D will be dropped again, (2) When device->ram, we are missing information on either A/D bits, or even if device private entries start to suport A/D, it's still not clear whether we should take device read/write into considerations too on the page A/D bits to be accurate. I think I'll probably keep the code there for completeness, but I think it won't really help much until more things are done.
Peter Xu <peterx@redhat.com> writes: > On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote: >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >> index 27fb37d65476..699f821b8443 100644 >> --- a/mm/migrate_device.c >> +++ b/mm/migrate_device.c >> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >> else >> entry = make_readable_migration_entry( >> page_to_pfn(page)); >> + if (pte_young(pte)) >> + entry = make_migration_entry_young(entry); >> + if (pte_dirty(pte)) >> + entry = make_migration_entry_dirty(entry); >> swp_pte = swp_entry_to_pte(entry); >> if (pte_present(pte)) { >> if (pte_soft_dirty(pte)) > > This change needs to be wrapped with pte_present() at least.. > > I also just noticed that this change probably won't help anyway because: > > (1) When ram->device, the pte will finally be replaced with a device > private entry, and device private entry does not yet support A/D, it > means A/D will be dropped again, > > (2) When device->ram, we are missing information on either A/D bits, or > even if device private entries start to suport A/D, it's still not > clear whether we should take device read/write into considerations > too on the page A/D bits to be accurate. > > I think I'll probably keep the code there for completeness, but I think it > won't really help much until more things are done. It appears that there are more issues. Between "pte = *ptep" and pte clear, CPU may set A/D bit in PTE, so we may need to update pte when clearing PTE. And I don't find the TLB is flushed in some cases after PTE is cleared. Best Regards, Huang, Ying
On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote: > Peter Xu <peterx@redhat.com> writes: > > > On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote: > >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c > >> index 27fb37d65476..699f821b8443 100644 > >> --- a/mm/migrate_device.c > >> +++ b/mm/migrate_device.c > >> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > >> else > >> entry = make_readable_migration_entry( > >> page_to_pfn(page)); > >> + if (pte_young(pte)) > >> + entry = make_migration_entry_young(entry); > >> + if (pte_dirty(pte)) > >> + entry = make_migration_entry_dirty(entry); > >> swp_pte = swp_entry_to_pte(entry); > >> if (pte_present(pte)) { > >> if (pte_soft_dirty(pte)) > > > > This change needs to be wrapped with pte_present() at least.. > > > > I also just noticed that this change probably won't help anyway because: > > > > (1) When ram->device, the pte will finally be replaced with a device > > private entry, and device private entry does not yet support A/D, it > > means A/D will be dropped again, > > > > (2) When device->ram, we are missing information on either A/D bits, or > > even if device private entries start to suport A/D, it's still not > > clear whether we should take device read/write into considerations > > too on the page A/D bits to be accurate. > > > > I think I'll probably keep the code there for completeness, but I think it > > won't really help much until more things are done. > > It appears that there are more issues. Between "pte = *ptep" and pte > clear, CPU may set A/D bit in PTE, so we may need to update pte when > clearing PTE. Agreed, I didn't see it a huge problem with current code, but it should be better in that way. > And I don't find the TLB is flushed in some cases after PTE is cleared. I think it's okay to not flush tlb if pte not present. But maybe you're talking about something else? Thanks,
On Aug 15, 2022, at 12:18 PM, Peter Xu <peterx@redhat.com> wrote: > On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote: >> Peter Xu <peterx@redhat.com> writes: >> >>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote: >>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>>> index 27fb37d65476..699f821b8443 100644 >>>> --- a/mm/migrate_device.c >>>> +++ b/mm/migrate_device.c >>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>> else >>>> entry = make_readable_migration_entry( >>>> page_to_pfn(page)); >>>> + if (pte_young(pte)) >>>> + entry = make_migration_entry_young(entry); >>>> + if (pte_dirty(pte)) >>>> + entry = make_migration_entry_dirty(entry); >>>> swp_pte = swp_entry_to_pte(entry); >>>> if (pte_present(pte)) { >>>> if (pte_soft_dirty(pte)) >>> >>> This change needs to be wrapped with pte_present() at least.. >>> >>> I also just noticed that this change probably won't help anyway because: >>> >>> (1) When ram->device, the pte will finally be replaced with a device >>> private entry, and device private entry does not yet support A/D, it >>> means A/D will be dropped again, >>> >>> (2) When device->ram, we are missing information on either A/D bits, or >>> even if device private entries start to suport A/D, it's still not >>> clear whether we should take device read/write into considerations >>> too on the page A/D bits to be accurate. >>> >>> I think I'll probably keep the code there for completeness, but I think it >>> won't really help much until more things are done. >> >> It appears that there are more issues. Between "pte = *ptep" and pte >> clear, CPU may set A/D bit in PTE, so we may need to update pte when >> clearing PTE. > > Agreed, I didn't see it a huge problem with current code, but it should be > better in that way. > >> And I don't find the TLB is flushed in some cases after PTE is cleared. > > I think it's okay to not flush tlb if pte not present. But maybe you're > talking about something else? I think Huang refers to situation in which the PTE is cleared, still not flushed, and then A/D is being set by the hardware. At least on x86, the hardware is not supposed to do so. The only case I remember (and sometimes misremembers) is with KNL erratum, which perhaps needs to be considered: https://lore.kernel.org/all/20160708001911.9A3FD2B6@viggo.jf.intel.com/
On Aug 15, 2022, at 1:52 PM, Nadav Amit <nadav.amit@gmail.com> wrote: > On Aug 15, 2022, at 12:18 PM, Peter Xu <peterx@redhat.com> wrote: > >> On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote: >>> Peter Xu <peterx@redhat.com> writes: >>> >>>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote: >>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>>>> index 27fb37d65476..699f821b8443 100644 >>>>> --- a/mm/migrate_device.c >>>>> +++ b/mm/migrate_device.c >>>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>>> else >>>>> entry = make_readable_migration_entry( >>>>> page_to_pfn(page)); >>>>> + if (pte_young(pte)) >>>>> + entry = make_migration_entry_young(entry); >>>>> + if (pte_dirty(pte)) >>>>> + entry = make_migration_entry_dirty(entry); >>>>> swp_pte = swp_entry_to_pte(entry); >>>>> if (pte_present(pte)) { >>>>> if (pte_soft_dirty(pte)) >>>> >>>> This change needs to be wrapped with pte_present() at least.. >>>> >>>> I also just noticed that this change probably won't help anyway because: >>>> >>>> (1) When ram->device, the pte will finally be replaced with a device >>>> private entry, and device private entry does not yet support A/D, it >>>> means A/D will be dropped again, >>>> >>>> (2) When device->ram, we are missing information on either A/D bits, or >>>> even if device private entries start to suport A/D, it's still not >>>> clear whether we should take device read/write into considerations >>>> too on the page A/D bits to be accurate. >>>> >>>> I think I'll probably keep the code there for completeness, but I think it >>>> won't really help much until more things are done. >>> >>> It appears that there are more issues. Between "pte = *ptep" and pte >>> clear, CPU may set A/D bit in PTE, so we may need to update pte when >>> clearing PTE. >> >> Agreed, I didn't see it a huge problem with current code, but it should be >> better in that way. >> >>> And I don't find the TLB is flushed in some cases after PTE is cleared. >> >> I think it's okay to not flush tlb if pte not present. But maybe you're >> talking about something else? > > I think Huang refers to situation in which the PTE is cleared, still not > flushed, and then A/D is being set by the hardware. > > At least on x86, the hardware is not supposed to do so. The only case I > remember (and sometimes misremembers) is with KNL erratum, which perhaps > needs to be considered: > > https://lore.kernel.org/all/20160708001911.9A3FD2B6@viggo.jf.intel.com/ I keep not remembering this erratum correctly. IIRC, the erratum says that the access/dirty might be set, but it does not mean that a write is possible after the PTE is cleared (i.e., the dirty/access might be set on the non-present PTE, but the access itself would fail). So it is not an issue in this case - losing A/D would not impact correctness since the access should fail. Dave Hansen hates when I get confused with this one, but I cc him if he wants to confirm. [ Having said all of that, in general the lack of regard to mm->tlb_flush_pending is always concerning in such functions. ]
Nadav Amit <nadav.amit@gmail.com> writes: > On Aug 15, 2022, at 12:18 PM, Peter Xu <peterx@redhat.com> wrote: > >> On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote: >>> Peter Xu <peterx@redhat.com> writes: >>> >>>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote: >>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c >>>>> index 27fb37d65476..699f821b8443 100644 >>>>> --- a/mm/migrate_device.c >>>>> +++ b/mm/migrate_device.c >>>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, >>>>> else >>>>> entry = make_readable_migration_entry( >>>>> page_to_pfn(page)); >>>>> + if (pte_young(pte)) >>>>> + entry = make_migration_entry_young(entry); >>>>> + if (pte_dirty(pte)) >>>>> + entry = make_migration_entry_dirty(entry); >>>>> swp_pte = swp_entry_to_pte(entry); >>>>> if (pte_present(pte)) { >>>>> if (pte_soft_dirty(pte)) >>>> >>>> This change needs to be wrapped with pte_present() at least.. >>>> >>>> I also just noticed that this change probably won't help anyway because: >>>> >>>> (1) When ram->device, the pte will finally be replaced with a device >>>> private entry, and device private entry does not yet support A/D, it >>>> means A/D will be dropped again, >>>> >>>> (2) When device->ram, we are missing information on either A/D bits, or >>>> even if device private entries start to suport A/D, it's still not >>>> clear whether we should take device read/write into considerations >>>> too on the page A/D bits to be accurate. >>>> >>>> I think I'll probably keep the code there for completeness, but I think it >>>> won't really help much until more things are done. >>> >>> It appears that there are more issues. Between "pte = *ptep" and pte >>> clear, CPU may set A/D bit in PTE, so we may need to update pte when >>> clearing PTE. >> >> Agreed, I didn't see it a huge problem with current code, but it should be >> better in that way. >> >>> And I don't find the TLB is flushed in some cases after PTE is cleared. >> >> I think it's okay to not flush tlb if pte not present. But maybe you're >> talking about something else? > > I think Huang refers to situation in which the PTE is cleared, still not > flushed, and then A/D is being set by the hardware. No. The situation in my mind is PTE with A/D set is cleared, not flushed. Then a parallel mprotect or munmap may cause race conditions. As Alistair pointed out in another thread [1], there is TLB flushing after PTL unlocked. But I think we need to flush TLB before unlock. This has been fixed in Alistair's latest version [2]. [1] https://lore.kernel.org/lkml/87r11gvrx6.fsf@nvdebian.thelocal/ [2] https://lore.kernel.org/lkml/6e77914685ede036c419fa65b6adc27f25a6c3e9.1660635033.git-series.apopple@nvidia.com/ Best Regards, Huang, Ying
On 8/15/22 14:03, Nadav Amit wrote: >> >> At least on x86, the hardware is not supposed to do so. The only case I >> remember (and sometimes misremembers) is with KNL erratum, which perhaps >> needs to be considered: >> >> https://lore.kernel.org/all/20160708001911.9A3FD2B6@viggo.jf.intel.com/ > I keep not remembering this erratum correctly. IIRC, the erratum says that > the access/dirty might be set, but it does not mean that a write is possible > after the PTE is cleared (i.e., the dirty/access might be set on the > non-present PTE, but the access itself would fail). So it is not an issue in > this case - losing A/D would not impact correctness since the access should > fail. > > Dave Hansen hates when I get confused with this one, but I cc him if he > wants to confirm. Right. The issue is strictly with the page walker setting Accessed/Dirty in a racy way. The TLB still has accurate contents at all times.
diff --git a/include/linux/swapops.h b/include/linux/swapops.h index e1accbcd1136..0e9579b90659 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -8,6 +8,10 @@ #ifdef CONFIG_MMU +#ifdef CONFIG_SWAP +#include <linux/swapfile.h> +#endif /* CONFIG_SWAP */ + /* * swapcache pages are stored in the swapper_space radix tree. We want to * get good packing density in that tree, so the index should be dense in @@ -35,6 +39,31 @@ #endif /* MAX_PHYSMEM_BITS */ #define SWP_PFN_MASK ((1UL << SWP_PFN_BITS) - 1) +/** + * Migration swap entry specific bitfield definitions. Layout: + * + * |----------+--------------------| + * | swp_type | swp_offset | + * |----------+--------+-+-+-------| + * | | resv |D|A| PFN | + * |----------+--------+-+-+-------| + * + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set (bit A) + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set (bit D) + * + * Note: A/D bits will be stored in migration entries iff there're enough + * free bits in arch specific swp offset. By default we'll ignore A/D bits + * when migrating a page. Please refer to migration_entry_supports_ad() + * for more information. If there're more bits besides PFN and A/D bits, + * they should be reserved and always be zeros. + */ +#define SWP_MIG_YOUNG_BIT (SWP_PFN_BITS) +#define SWP_MIG_DIRTY_BIT (SWP_PFN_BITS + 1) +#define SWP_MIG_TOTAL_BITS (SWP_PFN_BITS + 2) + +#define SWP_MIG_YOUNG BIT(SWP_MIG_YOUNG_BIT) +#define SWP_MIG_DIRTY BIT(SWP_MIG_DIRTY_BIT) + static inline bool is_pfn_swap_entry(swp_entry_t entry); /* Clear all flags but only keep swp_entry_t related information */ @@ -265,6 +294,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) return swp_entry(SWP_MIGRATION_WRITE, offset); } +/* + * Returns whether the host has large enough swap offset field to support + * carrying over pgtable A/D bits for page migrations. The result is + * pretty much arch specific. + */ +static inline bool migration_entry_supports_ad(void) +{ + /* + * max_swapfile_size() returns the max supported swp-offset plus 1. + * We can support the migration A/D bits iff the pfn swap entry has + * the offset large enough to cover all of them (PFN, A & D bits). + */ +#ifdef CONFIG_SWAP + return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS); +#else /* CONFIG_SWAP */ + return false; +#endif /* CONFIG_SWAP */ +} + +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry) +{ + if (migration_entry_supports_ad()) + return swp_entry(swp_type(entry), + swp_offset(entry) | SWP_MIG_YOUNG); + return entry; +} + +static inline bool is_migration_entry_young(swp_entry_t entry) +{ + if (migration_entry_supports_ad()) + return swp_offset(entry) & SWP_MIG_YOUNG; + /* Keep the old behavior of aging page after migration */ + return false; +} + +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry) +{ + if (migration_entry_supports_ad()) + return swp_entry(swp_type(entry), + swp_offset(entry) | SWP_MIG_DIRTY); + return entry; +} + +static inline bool is_migration_entry_dirty(swp_entry_t entry) +{ + if (migration_entry_supports_ad()) + return swp_offset(entry) & SWP_MIG_DIRTY; + /* Keep the old behavior of clean page after migration */ + return false; +} + extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, spinlock_t *ptl); extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, @@ -311,6 +391,25 @@ static inline int is_readable_migration_entry(swp_entry_t entry) return 0; } +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry) +{ + return entry; +} + +static inline bool is_migration_entry_young(swp_entry_t entry) +{ + return false; +} + +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry) +{ + return entry; +} + +static inline bool is_migration_entry_dirty(swp_entry_t entry) +{ + return false; +} #endif /* CONFIG_MIGRATION */ typedef unsigned long pte_marker; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e8e78d1bac5f..1644e9f59d73 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2089,7 +2089,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, write = is_writable_migration_entry(entry); if (PageAnon(page)) anon_exclusive = is_readable_exclusive_migration_entry(entry); - young = false; + young = is_migration_entry_young(entry); + dirty = is_migration_entry_dirty(entry); soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); } else { @@ -2148,6 +2149,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, else swp_entry = make_readable_migration_entry( page_to_pfn(page + i)); + if (young) + swp_entry = make_migration_entry_young(swp_entry); + if (dirty) + swp_entry = make_migration_entry_dirty(swp_entry); entry = swp_entry_to_pte(swp_entry); if (soft_dirty) entry = pte_swp_mksoft_dirty(entry); @@ -3157,6 +3162,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, entry = make_readable_exclusive_migration_entry(page_to_pfn(page)); else entry = make_readable_migration_entry(page_to_pfn(page)); + if (pmd_young(pmdval)) + entry = make_migration_entry_young(entry); + if (pmd_dirty(pmdval)) + entry = make_migration_entry_dirty(entry); pmdswp = swp_entry_to_pmd(entry); if (pmd_soft_dirty(pmdval)) pmdswp = pmd_swp_mksoft_dirty(pmdswp); @@ -3182,13 +3191,18 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) entry = pmd_to_swp_entry(*pvmw->pmd); get_page(new); - pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot))); + pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)); if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde = pmd_mksoft_dirty(pmde); if (is_writable_migration_entry(entry)) pmde = maybe_pmd_mkwrite(pmde, vma); if (pmd_swp_uffd_wp(*pvmw->pmd)) pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); + if (!is_migration_entry_young(entry)) + pmde = pmd_mkold(pmde); + /* NOTE: this may contain setting soft-dirty on some archs */ + if (PageDirty(new) && is_migration_entry_dirty(entry)) + pmde = pmd_mkdirty(pmde); if (PageAnon(new)) { rmap_t rmap_flags = RMAP_COMPOUND; diff --git a/mm/migrate.c b/mm/migrate.c index 6a1597c92261..0433a71d2bee 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -198,7 +198,7 @@ static bool remove_migration_pte(struct folio *folio, #endif folio_get(folio); - pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot))); + pte = mk_pte(new, READ_ONCE(vma->vm_page_prot)); if (pte_swp_soft_dirty(*pvmw.pte)) pte = pte_mksoft_dirty(pte); @@ -206,6 +206,10 @@ static bool remove_migration_pte(struct folio *folio, * Recheck VMA as permissions can change since migration started */ entry = pte_to_swp_entry(*pvmw.pte); + if (!is_migration_entry_young(entry)) + pte = pte_mkold(pte); + if (folio_test_dirty(folio) && is_migration_entry_dirty(entry)) + pte = pte_mkdirty(pte); if (is_writable_migration_entry(entry)) pte = maybe_mkwrite(pte, vma); else if (pte_swp_uffd_wp(*pvmw.pte)) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 27fb37d65476..699f821b8443 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, else entry = make_readable_migration_entry( page_to_pfn(page)); + if (pte_young(pte)) + entry = make_migration_entry_young(entry); + if (pte_dirty(pte)) + entry = make_migration_entry_dirty(entry); swp_pte = swp_entry_to_pte(entry); if (pte_present(pte)) { if (pte_soft_dirty(pte)) diff --git a/mm/rmap.c b/mm/rmap.c index af775855e58f..28aef434ea41 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2065,7 +2065,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, else entry = make_readable_migration_entry( page_to_pfn(subpage)); - + if (pte_young(pteval)) + entry = make_migration_entry_young(entry); + if (pte_dirty(pteval)) + entry = make_migration_entry_dirty(entry); swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte);
When page migration happens, we always ignore the young/dirty bit settings in the old pgtable, and marking the page as old in the new page table using either pte_mkold() or pmd_mkold(), and keeping the pte clean. That's fine from functional-wise, but that's not friendly to page reclaim because the moving page can be actively accessed within the procedure. Not to mention hardware setting the young bit can bring quite some overhead on some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. The same slowdown problem to dirty bits when the memory is first written after page migration happened. Actually we can easily remember the A/D bit configuration and recover the information after the page is migrated. To achieve it, define a new set of bits in the migration swap offset field to cache the A/D bits for old pte. Then when removing/recovering the migration entry, we can recover the A/D bits even if the page changed. One thing to mention is that here we used max_swapfile_size() to detect how many swp offset bits we have, and we'll only enable this feature if we know the swp offset can be big enough to store both the PFN value and the young bit. Otherwise the A/D bits are dropped like before. Signed-off-by: Peter Xu <peterx@redhat.com> --- include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 18 +++++++- mm/migrate.c | 6 ++- mm/migrate_device.c | 4 ++ mm/rmap.c | 5 ++- 5 files changed, 128 insertions(+), 4 deletions(-)