diff mbox series

[v3,5/7] mm: Remember young/dirty bit for page migrations

Message ID 20220809220100.20033-6-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm: Remember a/d bits for migration entries | expand

Commit Message

Peter Xu Aug. 9, 2022, 10 p.m. UTC
When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table using
either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.  Not
to mention hardware setting the young bit can bring quite some overhead on
some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
The same slowdown problem to dirty bits when the memory is first written
after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated.  To achieve it, define a new set of
bits in the migration swap offset field to cache the A/D bits for old pte.
Then when removing/recovering the migration entry, we can recover the A/D
bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect how
many swp offset bits we have, and we'll only enable this feature if we know
the swp offset can be big enough to store both the PFN value and the young
bit.  Otherwise the A/D bits are dropped like before.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c        | 18 +++++++-
 mm/migrate.c            |  6 ++-
 mm/migrate_device.c     |  4 ++
 mm/rmap.c               |  5 ++-
 5 files changed, 128 insertions(+), 4 deletions(-)

Comments

Huang, Ying Aug. 10, 2022, 6:30 a.m. UTC | #1
Peter Xu <peterx@redhat.com> writes:

> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
>
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
>
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
>
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
                                                                       ~~~~~
Nitpick: A/D

> bit.  Otherwise the A/D bits are dropped like before.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c        | 18 +++++++-
>  mm/migrate.c            |  6 ++-
>  mm/migrate_device.c     |  4 ++
>  mm/rmap.c               |  5 ++-
>  5 files changed, 128 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index e1accbcd1136..0e9579b90659 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -8,6 +8,10 @@
>  
>  #ifdef CONFIG_MMU
>  
> +#ifdef CONFIG_SWAP
> +#include <linux/swapfile.h>
> +#endif	/* CONFIG_SWAP */

I don't think we need the comment here.  The #ifdef is too near.  But
this isn't a big deal.

Best Regards,
Huang, Ying
Peter Xu Aug. 10, 2022, 3:19 p.m. UTC | #2
On Wed, Aug 10, 2022 at 02:30:33PM +0800, Huang, Ying wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > When page migration happens, we always ignore the young/dirty bit settings
> > in the old pgtable, and marking the page as old in the new page table using
> > either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> >
> > That's fine from functional-wise, but that's not friendly to page reclaim
> > because the moving page can be actively accessed within the procedure.  Not
> > to mention hardware setting the young bit can bring quite some overhead on
> > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> > The same slowdown problem to dirty bits when the memory is first written
> > after page migration happened.
> >
> > Actually we can easily remember the A/D bit configuration and recover the
> > information after the page is migrated.  To achieve it, define a new set of
> > bits in the migration swap offset field to cache the A/D bits for old pte.
> > Then when removing/recovering the migration entry, we can recover the A/D
> > bits even if the page changed.
> >
> > One thing to mention is that here we used max_swapfile_size() to detect how
> > many swp offset bits we have, and we'll only enable this feature if we know
> > the swp offset can be big enough to store both the PFN value and the young
>                                                                        ~~~~~
> Nitpick: A/D

Fixed.

> 
> > bit.  Otherwise the A/D bits are dropped like before.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++
> >  mm/huge_memory.c        | 18 +++++++-
> >  mm/migrate.c            |  6 ++-
> >  mm/migrate_device.c     |  4 ++
> >  mm/rmap.c               |  5 ++-
> >  5 files changed, 128 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index e1accbcd1136..0e9579b90659 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -8,6 +8,10 @@
> >  
> >  #ifdef CONFIG_MMU
> >  
> > +#ifdef CONFIG_SWAP
> > +#include <linux/swapfile.h>
> > +#endif	/* CONFIG_SWAP */
> 
> I don't think we need the comment here.  The #ifdef is too near.  But
> this isn't a big deal.

I'd slightly prefer keeping it (especially Nadav used to complain on
missing comments on ifdefs in previous versions..) since any ifdef can grow
by adding code into it. Then it'll be hard to justify how to define "near"
or not, so hard to define who should be adding that if I'm not the one.

Thanks,
Peter Xu Aug. 11, 2022, 3:19 p.m. UTC | #3
On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d65476..699f821b8443 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			else
>  				entry = make_readable_migration_entry(
>  							page_to_pfn(page));
> +			if (pte_young(pte))
> +				entry = make_migration_entry_young(entry);
> +			if (pte_dirty(pte))
> +				entry = make_migration_entry_dirty(entry);
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_present(pte)) {
>  				if (pte_soft_dirty(pte))

This change needs to be wrapped with pte_present() at least..

I also just noticed that this change probably won't help anyway because:

  (1) When ram->device, the pte will finally be replaced with a device
      private entry, and device private entry does not yet support A/D, it
      means A/D will be dropped again,

  (2) When device->ram, we are missing information on either A/D bits, or
      even if device private entries start to suport A/D, it's still not
      clear whether we should take device read/write into considerations
      too on the page A/D bits to be accurate.

I think I'll probably keep the code there for completeness, but I think it
won't really help much until more things are done.
Huang, Ying Aug. 12, 2022, 2:32 a.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d65476..699f821b8443 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			else
>>  				entry = make_readable_migration_entry(
>>  							page_to_pfn(page));
>> +			if (pte_young(pte))
>> +				entry = make_migration_entry_young(entry);
>> +			if (pte_dirty(pte))
>> +				entry = make_migration_entry_dirty(entry);
>>  			swp_pte = swp_entry_to_pte(entry);
>>  			if (pte_present(pte)) {
>>  				if (pte_soft_dirty(pte))
>
> This change needs to be wrapped with pte_present() at least..
>
> I also just noticed that this change probably won't help anyway because:
>
>   (1) When ram->device, the pte will finally be replaced with a device
>       private entry, and device private entry does not yet support A/D, it
>       means A/D will be dropped again,
>
>   (2) When device->ram, we are missing information on either A/D bits, or
>       even if device private entries start to suport A/D, it's still not
>       clear whether we should take device read/write into considerations
>       too on the page A/D bits to be accurate.
>
> I think I'll probably keep the code there for completeness, but I think it
> won't really help much until more things are done.

It appears that there are more issues.  Between "pte = *ptep" and pte
clear, CPU may set A/D bit in PTE, so we may need to update pte when
clearing PTE.  And I don't find the TLB is flushed in some cases after
PTE is cleared.

Best Regards,
Huang, Ying
Peter Xu Aug. 15, 2022, 7:18 p.m. UTC | #5
On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 27fb37d65476..699f821b8443 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>  			else
> >>  				entry = make_readable_migration_entry(
> >>  							page_to_pfn(page));
> >> +			if (pte_young(pte))
> >> +				entry = make_migration_entry_young(entry);
> >> +			if (pte_dirty(pte))
> >> +				entry = make_migration_entry_dirty(entry);
> >>  			swp_pte = swp_entry_to_pte(entry);
> >>  			if (pte_present(pte)) {
> >>  				if (pte_soft_dirty(pte))
> >
> > This change needs to be wrapped with pte_present() at least..
> >
> > I also just noticed that this change probably won't help anyway because:
> >
> >   (1) When ram->device, the pte will finally be replaced with a device
> >       private entry, and device private entry does not yet support A/D, it
> >       means A/D will be dropped again,
> >
> >   (2) When device->ram, we are missing information on either A/D bits, or
> >       even if device private entries start to suport A/D, it's still not
> >       clear whether we should take device read/write into considerations
> >       too on the page A/D bits to be accurate.
> >
> > I think I'll probably keep the code there for completeness, but I think it
> > won't really help much until more things are done.
> 
> It appears that there are more issues.  Between "pte = *ptep" and pte
> clear, CPU may set A/D bit in PTE, so we may need to update pte when
> clearing PTE.

Agreed, I didn't see it a huge problem with current code, but it should be
better in that way.

> And I don't find the TLB is flushed in some cases after PTE is cleared.

I think it's okay to not flush tlb if pte not present.  But maybe you're
talking about something else?

Thanks,
Nadav Amit Aug. 15, 2022, 8:52 p.m. UTC | #6
On Aug 15, 2022, at 12:18 PM, Peter Xu <peterx@redhat.com> wrote:

> On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 27fb37d65476..699f821b8443 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>> 			else
>>>> 				entry = make_readable_migration_entry(
>>>> 							page_to_pfn(page));
>>>> +			if (pte_young(pte))
>>>> +				entry = make_migration_entry_young(entry);
>>>> +			if (pte_dirty(pte))
>>>> +				entry = make_migration_entry_dirty(entry);
>>>> 			swp_pte = swp_entry_to_pte(entry);
>>>> 			if (pte_present(pte)) {
>>>> 				if (pte_soft_dirty(pte))
>>> 
>>> This change needs to be wrapped with pte_present() at least..
>>> 
>>> I also just noticed that this change probably won't help anyway because:
>>> 
>>>  (1) When ram->device, the pte will finally be replaced with a device
>>>      private entry, and device private entry does not yet support A/D, it
>>>      means A/D will be dropped again,
>>> 
>>>  (2) When device->ram, we are missing information on either A/D bits, or
>>>      even if device private entries start to suport A/D, it's still not
>>>      clear whether we should take device read/write into considerations
>>>      too on the page A/D bits to be accurate.
>>> 
>>> I think I'll probably keep the code there for completeness, but I think it
>>> won't really help much until more things are done.
>> 
>> It appears that there are more issues.  Between "pte = *ptep" and pte
>> clear, CPU may set A/D bit in PTE, so we may need to update pte when
>> clearing PTE.
> 
> Agreed, I didn't see it a huge problem with current code, but it should be
> better in that way.
> 
>> And I don't find the TLB is flushed in some cases after PTE is cleared.
> 
> I think it's okay to not flush tlb if pte not present.  But maybe you're
> talking about something else?

I think Huang refers to situation in which the PTE is cleared, still not
flushed, and then A/D is being set by the hardware.

At least on x86, the hardware is not supposed to do so. The only case I
remember (and sometimes misremembers) is with KNL erratum, which perhaps
needs to be considered:

https://lore.kernel.org/all/20160708001911.9A3FD2B6@viggo.jf.intel.com/
Nadav Amit Aug. 15, 2022, 9:03 p.m. UTC | #7
On Aug 15, 2022, at 1:52 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> On Aug 15, 2022, at 12:18 PM, Peter Xu <peterx@redhat.com> wrote:
> 
>> On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote:
>>> Peter Xu <peterx@redhat.com> writes:
>>> 
>>>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 27fb37d65476..699f821b8443 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> 			else
>>>>> 				entry = make_readable_migration_entry(
>>>>> 							page_to_pfn(page));
>>>>> +			if (pte_young(pte))
>>>>> +				entry = make_migration_entry_young(entry);
>>>>> +			if (pte_dirty(pte))
>>>>> +				entry = make_migration_entry_dirty(entry);
>>>>> 			swp_pte = swp_entry_to_pte(entry);
>>>>> 			if (pte_present(pte)) {
>>>>> 				if (pte_soft_dirty(pte))
>>>> 
>>>> This change needs to be wrapped with pte_present() at least..
>>>> 
>>>> I also just noticed that this change probably won't help anyway because:
>>>> 
>>>> (1) When ram->device, the pte will finally be replaced with a device
>>>>     private entry, and device private entry does not yet support A/D, it
>>>>     means A/D will be dropped again,
>>>> 
>>>> (2) When device->ram, we are missing information on either A/D bits, or
>>>>     even if device private entries start to suport A/D, it's still not
>>>>     clear whether we should take device read/write into considerations
>>>>     too on the page A/D bits to be accurate.
>>>> 
>>>> I think I'll probably keep the code there for completeness, but I think it
>>>> won't really help much until more things are done.
>>> 
>>> It appears that there are more issues.  Between "pte = *ptep" and pte
>>> clear, CPU may set A/D bit in PTE, so we may need to update pte when
>>> clearing PTE.
>> 
>> Agreed, I didn't see it a huge problem with current code, but it should be
>> better in that way.
>> 
>>> And I don't find the TLB is flushed in some cases after PTE is cleared.
>> 
>> I think it's okay to not flush tlb if pte not present.  But maybe you're
>> talking about something else?
> 
> I think Huang refers to situation in which the PTE is cleared, still not
> flushed, and then A/D is being set by the hardware.
> 
> At least on x86, the hardware is not supposed to do so. The only case I
> remember (and sometimes misremembers) is with KNL erratum, which perhaps
> needs to be considered:
> 
> https://lore.kernel.org/all/20160708001911.9A3FD2B6@viggo.jf.intel.com/

I keep not remembering this erratum correctly. IIRC, the erratum says that
the access/dirty might be set, but it does not mean that a write is possible
after the PTE is cleared (i.e., the dirty/access might be set on the
non-present PTE, but the access itself would fail). So it is not an issue in
this case - losing A/D would not impact correctness since the access should
fail.

Dave Hansen hates when I get confused with this one, but I cc him if he
wants to confirm.

[ Having said all of that, in general the lack of regard to
  mm->tlb_flush_pending is always concerning in such functions. ]
Huang, Ying Aug. 17, 2022, 1:49 a.m. UTC | #8
Nadav Amit <nadav.amit@gmail.com> writes:

> On Aug 15, 2022, at 12:18 PM, Peter Xu <peterx@redhat.com> wrote:
>
>> On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote:
>>> Peter Xu <peterx@redhat.com> writes:
>>> 
>>>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 27fb37d65476..699f821b8443 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> 			else
>>>>> 				entry = make_readable_migration_entry(
>>>>> 							page_to_pfn(page));
>>>>> +			if (pte_young(pte))
>>>>> +				entry = make_migration_entry_young(entry);
>>>>> +			if (pte_dirty(pte))
>>>>> +				entry = make_migration_entry_dirty(entry);
>>>>> 			swp_pte = swp_entry_to_pte(entry);
>>>>> 			if (pte_present(pte)) {
>>>>> 				if (pte_soft_dirty(pte))
>>>> 
>>>> This change needs to be wrapped with pte_present() at least..
>>>> 
>>>> I also just noticed that this change probably won't help anyway because:
>>>> 
>>>>  (1) When ram->device, the pte will finally be replaced with a device
>>>>      private entry, and device private entry does not yet support A/D, it
>>>>      means A/D will be dropped again,
>>>> 
>>>>  (2) When device->ram, we are missing information on either A/D bits, or
>>>>      even if device private entries start to suport A/D, it's still not
>>>>      clear whether we should take device read/write into considerations
>>>>      too on the page A/D bits to be accurate.
>>>> 
>>>> I think I'll probably keep the code there for completeness, but I think it
>>>> won't really help much until more things are done.
>>> 
>>> It appears that there are more issues.  Between "pte = *ptep" and pte
>>> clear, CPU may set A/D bit in PTE, so we may need to update pte when
>>> clearing PTE.
>> 
>> Agreed, I didn't see it a huge problem with current code, but it should be
>> better in that way.
>> 
>>> And I don't find the TLB is flushed in some cases after PTE is cleared.
>> 
>> I think it's okay to not flush tlb if pte not present.  But maybe you're
>> talking about something else?
>
> I think Huang refers to situation in which the PTE is cleared, still not
> flushed, and then A/D is being set by the hardware.

No.  The situation in my mind is PTE with A/D set is cleared, not
flushed.  Then a parallel mprotect or munmap may cause race conditions.
As Alistair pointed out in another thread [1], there is TLB flushing
after PTL unlocked.  But I think we need to flush TLB before unlock.
This has been fixed in Alistair's latest version [2].

[1] https://lore.kernel.org/lkml/87r11gvrx6.fsf@nvdebian.thelocal/
[2] https://lore.kernel.org/lkml/6e77914685ede036c419fa65b6adc27f25a6c3e9.1660635033.git-series.apopple@nvidia.com/

Best Regards,
Huang, Ying
Dave Hansen Aug. 18, 2022, 4:39 p.m. UTC | #9
On 8/15/22 14:03, Nadav Amit wrote:
>>
>> At least on x86, the hardware is not supposed to do so. The only case I
>> remember (and sometimes misremembers) is with KNL erratum, which perhaps
>> needs to be considered:
>>
>> https://lore.kernel.org/all/20160708001911.9A3FD2B6@viggo.jf.intel.com/
> I keep not remembering this erratum correctly. IIRC, the erratum says that
> the access/dirty might be set, but it does not mean that a write is possible
> after the PTE is cleared (i.e., the dirty/access might be set on the
> non-present PTE, but the access itself would fail). So it is not an issue in
> this case - losing A/D would not impact correctness since the access should
> fail.
> 
> Dave Hansen hates when I get confused with this one, but I cc him if he
> wants to confirm.

Right.

The issue is strictly with the page walker setting Accessed/Dirty in a
racy way.  The TLB still has accurate contents at all times.
diff mbox series

Patch

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index e1accbcd1136..0e9579b90659 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -8,6 +8,10 @@ 
 
 #ifdef CONFIG_MMU
 
+#ifdef CONFIG_SWAP
+#include <linux/swapfile.h>
+#endif	/* CONFIG_SWAP */
+
 /*
  * swapcache pages are stored in the swapper_space radix tree.  We want to
  * get good packing density in that tree, so the index should be dense in
@@ -35,6 +39,31 @@ 
 #endif	/* MAX_PHYSMEM_BITS */
 #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
 
+/**
+ * Migration swap entry specific bitfield definitions.  Layout:
+ *
+ *   |----------+--------------------|
+ *   | swp_type | swp_offset         |
+ *   |----------+--------+-+-+-------|
+ *   |          | resv   |D|A|  PFN  |
+ *   |----------+--------+-+-+-------|
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set (bit A)
+ * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set (bit D)
+ *
+ * Note: A/D bits will be stored in migration entries iff there're enough
+ * free bits in arch specific swp offset.  By default we'll ignore A/D bits
+ * when migrating a page.  Please refer to migration_entry_supports_ad()
+ * for more information.  If there're more bits besides PFN and A/D bits,
+ * they should be reserved and always be zeros.
+ */
+#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
+#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
+#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
+
+#define SWP_MIG_YOUNG			BIT(SWP_MIG_YOUNG_BIT)
+#define SWP_MIG_DIRTY			BIT(SWP_MIG_DIRTY_BIT)
+
 static inline bool is_pfn_swap_entry(swp_entry_t entry);
 
 /* Clear all flags but only keep swp_entry_t related information */
@@ -265,6 +294,57 @@  static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 	return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
+/*
+ * Returns whether the host has large enough swap offset field to support
+ * carrying over pgtable A/D bits for page migrations.  The result is
+ * pretty much arch specific.
+ */
+static inline bool migration_entry_supports_ad(void)
+{
+	/*
+	 * max_swapfile_size() returns the max supported swp-offset plus 1.
+	 * We can support the migration A/D bits iff the pfn swap entry has
+	 * the offset large enough to cover all of them (PFN, A & D bits).
+	 */
+#ifdef CONFIG_SWAP
+	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+#else  /* CONFIG_SWAP */
+	return false;
+#endif	/* CONFIG_SWAP */
+}
+
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_YOUNG);
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_offset(entry) & SWP_MIG_YOUNG;
+	/* Keep the old behavior of aging page after migration */
+	return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_DIRTY);
+	return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_offset(entry) & SWP_MIG_DIRTY;
+	/* Keep the old behavior of clean page after migration */
+	return false;
+}
+
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl);
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -311,6 +391,25 @@  static inline int is_readable_migration_entry(swp_entry_t entry)
 	return 0;
 }
 
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+	return false;
+}
 #endif	/* CONFIG_MIGRATION */
 
 typedef unsigned long pte_marker;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e8e78d1bac5f..1644e9f59d73 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2089,7 +2089,8 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_writable_migration_entry(entry);
 		if (PageAnon(page))
 			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = false;
+		young = is_migration_entry_young(entry);
+		dirty = is_migration_entry_dirty(entry);
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2148,6 +2149,10 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			else
 				swp_entry = make_readable_migration_entry(
 							page_to_pfn(page + i));
+			if (young)
+				swp_entry = make_migration_entry_young(swp_entry);
+			if (dirty)
+				swp_entry = make_migration_entry_dirty(swp_entry);
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
@@ -3157,6 +3162,10 @@  int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
 	else
 		entry = make_readable_migration_entry(page_to_pfn(page));
+	if (pmd_young(pmdval))
+		entry = make_migration_entry_young(entry);
+	if (pmd_dirty(pmdval))
+		entry = make_migration_entry_dirty(entry);
 	pmdswp = swp_entry_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3182,13 +3191,18 @@  void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	get_page(new);
-	pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+	if (!is_migration_entry_young(entry))
+		pmde = pmd_mkold(pmde);
+	/* NOTE: this may contain setting soft-dirty on some archs */
+	if (PageDirty(new) && is_migration_entry_dirty(entry))
+		pmde = pmd_mkdirty(pmde);
 
 	if (PageAnon(new)) {
 		rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..0433a71d2bee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -198,7 +198,7 @@  static bool remove_migration_pte(struct folio *folio,
 #endif
 
 		folio_get(folio);
-		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
 		if (pte_swp_soft_dirty(*pvmw.pte))
 			pte = pte_mksoft_dirty(pte);
 
@@ -206,6 +206,10 @@  static bool remove_migration_pte(struct folio *folio,
 		 * Recheck VMA as permissions can change since migration started
 		 */
 		entry = pte_to_swp_entry(*pvmw.pte);
+		if (!is_migration_entry_young(entry))
+			pte = pte_mkold(pte);
+		if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
+			pte = pte_mkdirty(pte);
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d65476..699f821b8443 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,10 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(page));
+			if (pte_young(pte))
+				entry = make_migration_entry_young(entry);
+			if (pte_dirty(pte))
+				entry = make_migration_entry_dirty(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_present(pte)) {
 				if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..28aef434ea41 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,10 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
-
+			if (pte_young(pteval))
+				entry = make_migration_entry_young(entry);
+			if (pte_dirty(pteval))
+				entry = make_migration_entry_dirty(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);