diff mbox series

[v2,2/2] mm: Remember young/dirty bit for page migrations

Message ID 20220804203952.53665-3-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm: Remember a/d bits for migration entries | expand

Commit Message

Peter Xu Aug. 4, 2022, 8:39 p.m. UTC
When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table using
either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.  Not
to mention hardware setting the young bit can bring quite some overhead on
some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
The same slowdown problem to dirty bits when the memory is first written
after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated.  To achieve it, define a new set of
bits in the migration swap offset field to cache the A/D bits for old pte.
Then when removing/recovering the migration entry, we can recover the A/D
bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect how
many swp offset bits we have, and we'll only enable this feature if we know
the swp offset can be big enough to store both the PFN value and the young
bit.  Otherwise the A/D bits are dropped like before.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c        | 26 +++++++++++-
 mm/migrate.c            |  6 ++-
 mm/migrate_device.c     |  4 ++
 mm/rmap.c               |  5 ++-
 5 files changed, 128 insertions(+), 4 deletions(-)

Comments

Nadav Amit Aug. 4, 2022, 10:40 p.m. UTC | #1
On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:

> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> 
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
> 
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
> 
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
> bit.  Otherwise the A/D bits are dropped like before.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
> mm/huge_memory.c        | 26 +++++++++++-
> mm/migrate.c            |  6 ++-
> mm/migrate_device.c     |  4 ++
> mm/rmap.c               |  5 ++-
> 5 files changed, 128 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 1d17e4bb3d2f..34aa448ac6ee 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -8,6 +8,8 @@
> 
> #ifdef CONFIG_MMU
> 
> +#include <linux/swapfile.h>

Shouldn’t the ifdef go into linux/swapfile.h if that’s the right thing to do
to prevent others from mistakenly including it?

> +
> /*
>  * swapcache pages are stored in the swapper_space radix tree.  We want to
>  * get good packing density in that tree, so the index should be dense in
> @@ -35,6 +37,24 @@
> #endif
> #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
> 
> +/**
> + * Migration swap entry specific bitfield definitions.
> + *
> + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> + *
> + * Note: these bits will be stored in migration entries iff there're enough
> + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> + * when migrating a page.  Please refer to migration_entry_supports_ad()
> + * for more information.
> + */
> +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> +
> +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)

Any reason not to use BIT(x) ?

> +
> static inline bool is_pfn_swap_entry(swp_entry_t entry);
> 
> /* Clear all flags but only keep swp_entry_t related information */
> @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> 	return swp_entry(SWP_MIGRATION_WRITE, offset);
> }
> 
> +/*
> + * Returns whether the host has large enough swap offset field to support
> + * carrying over pgtable A/D bits for page migrations.  The result is
> + * pretty much arch specific.
> + */
> +static inline bool migration_entry_supports_ad(void)
> +{
> +	/*
> +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> +	 * We can support the migration A/D bits iff the pfn swap entry has
> +	 * the offset large enough to cover all of them (PFN, A & D bits).
> +	 */
> +#ifdef CONFIG_SWAP
> +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);

This is an actual a function call (unless LTO has some trick). A bit of a
shame it cannot be at least memoized.

Or at least mark max_swapfile_size() as __attribute_const__ so it would not
be called twice for make_migration_entry_young() and
make_migration_entry_dirty().

> +#else
> +	return false;
> +#endif
> +}
> +
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_entry(swp_type(entry),
> +				 swp_offset(entry) | SWP_MIG_YOUNG);
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_offset(entry) & SWP_MIG_YOUNG;
> +	/* Keep the old behavior of aging page after migration */
> +	return false;
> +}
> +
> +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_entry(swp_type(entry),
> +				 swp_offset(entry) | SWP_MIG_DIRTY);
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;

Shouldn’t it be SWP_MIG_DIRTY ?

> +	/* Keep the old behavior of clean page after migration */
> +	return false;
> +}
> +
> extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> 					spinlock_t *ptl);
> extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> @@ -311,6 +382,26 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
> 	return 0;
> }
> 
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_young(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> +{
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> #endif

While at it, can you change to:

#endif /* CONFIG_MIGRATION */

[ these ifdefs burn my eyes ]

Other than that looks good.

Thanks,
Nadav
David Hildenbrand Aug. 5, 2022, 12:17 p.m. UTC | #2
On 04.08.22 22:39, Peter Xu wrote:
> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> 
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
> 
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
> 
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
> bit.  Otherwise the A/D bits are dropped like before.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c        | 26 +++++++++++-
>  mm/migrate.c            |  6 ++-
>  mm/migrate_device.c     |  4 ++
>  mm/rmap.c               |  5 ++-
>  5 files changed, 128 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 1d17e4bb3d2f..34aa448ac6ee 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -8,6 +8,8 @@
>  
>  #ifdef CONFIG_MMU
>  
> +#include <linux/swapfile.h>
> +
>  /*
>   * swapcache pages are stored in the swapper_space radix tree.  We want to
>   * get good packing density in that tree, so the index should be dense in
> @@ -35,6 +37,24 @@
>  #endif
>  #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
>  
> +/**
> + * Migration swap entry specific bitfield definitions.
> + *
> + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> + *
> + * Note: these bits will be stored in migration entries iff there're enough
> + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> + * when migrating a page.  Please refer to migration_entry_supports_ad()
> + * for more information.
> + */
> +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> +
> +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
> +
>  static inline bool is_pfn_swap_entry(swp_entry_t entry);
>  
>  /* Clear all flags but only keep swp_entry_t related information */
> @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
>  	return swp_entry(SWP_MIGRATION_WRITE, offset);
>  }
>  
> +/*
> + * Returns whether the host has large enough swap offset field to support
> + * carrying over pgtable A/D bits for page migrations.  The result is
> + * pretty much arch specific.
> + */
> +static inline bool migration_entry_supports_ad(void)
> +{
> +	/*
> +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> +	 * We can support the migration A/D bits iff the pfn swap entry has
> +	 * the offset large enough to cover all of them (PFN, A & D bits).
> +	 */
> +#ifdef CONFIG_SWAP
> +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> +#else
> +	return false;
> +#endif
> +}


This looks much cleaner to me. It might be helpful to draw an ascii
picture where exatcly these bits reside isnide the offset.

> +
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())

Do we maybe want to turn that into a static key and enable it once and
for all? As Nadav says, the repeated max_swapfile_size() calls/checks
might be worth optimizing out.

[...]
Peter Xu Aug. 5, 2022, 4:30 p.m. UTC | #3
On Thu, Aug 04, 2022 at 03:40:57PM -0700, Nadav Amit wrote:
> On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > When page migration happens, we always ignore the young/dirty bit settings
> > in the old pgtable, and marking the page as old in the new page table using
> > either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> > 
> > That's fine from functional-wise, but that's not friendly to page reclaim
> > because the moving page can be actively accessed within the procedure.  Not
> > to mention hardware setting the young bit can bring quite some overhead on
> > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> > The same slowdown problem to dirty bits when the memory is first written
> > after page migration happened.
> > 
> > Actually we can easily remember the A/D bit configuration and recover the
> > information after the page is migrated.  To achieve it, define a new set of
> > bits in the migration swap offset field to cache the A/D bits for old pte.
> > Then when removing/recovering the migration entry, we can recover the A/D
> > bits even if the page changed.
> > 
> > One thing to mention is that here we used max_swapfile_size() to detect how
> > many swp offset bits we have, and we'll only enable this feature if we know
> > the swp offset can be big enough to store both the PFN value and the young
> > bit.  Otherwise the A/D bits are dropped like before.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
> > mm/huge_memory.c        | 26 +++++++++++-
> > mm/migrate.c            |  6 ++-
> > mm/migrate_device.c     |  4 ++
> > mm/rmap.c               |  5 ++-
> > 5 files changed, 128 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index 1d17e4bb3d2f..34aa448ac6ee 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -8,6 +8,8 @@
> > 
> > #ifdef CONFIG_MMU
> > 
> > +#include <linux/swapfile.h>
> 
> Shouldn’t the ifdef go into linux/swapfile.h if that’s the right thing to do
> to prevent others from mistakenly including it?

swapfile.h is for max_swapfile_size() that's referenced in the new code. If
!CONFIG_MMU then there shouldn't be a reference to max_swapfile_size() even
if swapops.h included, then logically it should be after CONFIG_MMU?

> 
> > +
> > /*
> >  * swapcache pages are stored in the swapper_space radix tree.  We want to
> >  * get good packing density in that tree, so the index should be dense in
> > @@ -35,6 +37,24 @@
> > #endif
> > #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
> > 
> > +/**
> > + * Migration swap entry specific bitfield definitions.
> > + *
> > + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> > + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> > + *
> > + * Note: these bits will be stored in migration entries iff there're enough
> > + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> > + * when migrating a page.  Please refer to migration_entry_supports_ad()
> > + * for more information.
> > + */
> > +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> > +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> > +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> > +
> > +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> > +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
> 
> Any reason not to use BIT(x) ?

Yeh why not..

> 
> > +
> > static inline bool is_pfn_swap_entry(swp_entry_t entry);
> > 
> > /* Clear all flags but only keep swp_entry_t related information */
> > @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> > 	return swp_entry(SWP_MIGRATION_WRITE, offset);
> > }
> > 
> > +/*
> > + * Returns whether the host has large enough swap offset field to support
> > + * carrying over pgtable A/D bits for page migrations.  The result is
> > + * pretty much arch specific.
> > + */
> > +static inline bool migration_entry_supports_ad(void)
> > +{
> > +	/*
> > +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> > +	 * We can support the migration A/D bits iff the pfn swap entry has
> > +	 * the offset large enough to cover all of them (PFN, A & D bits).
> > +	 */
> > +#ifdef CONFIG_SWAP
> > +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> 
> This is an actual a function call (unless LTO has some trick). A bit of a
> shame it cannot be at least memoized.
> 
> Or at least mark max_swapfile_size() as __attribute_const__ so it would not
> be called twice for make_migration_entry_young() and
> make_migration_entry_dirty().

I didn't take too much effort on this one since we're on swap path and I
assumed that's not a super hot path.  But __attribute_const__ sounds good
and easy to get, thanks.

Perhaps I should mark it on migration_entry_supports_ad() as a whole?  Note
that unfortunately SWP_MIG_TOTAL_BITS may not be a const either (see how
that define roots back to MAX_PHYSMEM_BITS, where on x86_64 it needs to
check 5-lvl).

> 
> > +#else
> > +	return false;
> > +#endif
> > +}
> > +
> > +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> > +		return swp_entry(swp_type(entry),
> > +				 swp_offset(entry) | SWP_MIG_YOUNG);
> > +	return entry;
> > +}
> > +
> > +static inline bool is_migration_entry_young(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> > +		return swp_offset(entry) & SWP_MIG_YOUNG;
> > +	/* Keep the old behavior of aging page after migration */
> > +	return false;
> > +}
> > +
> > +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> > +		return swp_entry(swp_type(entry),
> > +				 swp_offset(entry) | SWP_MIG_DIRTY);
> > +	return entry;
> > +}
> > +
> > +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> > +		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
> 
> Shouldn’t it be SWP_MIG_DIRTY ?

Oops, yes.

> 
> > +	/* Keep the old behavior of clean page after migration */
> > +	return false;
> > +}
> > +
> > extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
> > 					spinlock_t *ptl);
> > extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> > @@ -311,6 +382,26 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
> > 	return 0;
> > }
> > 
> > +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> > +{
> > +	return entry;
> > +}
> > +
> > +static inline bool is_migration_entry_young(swp_entry_t entry)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> > +{
> > +	return entry;
> > +}
> > +
> > +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> > +{
> > +	return false;
> > +}
> > +
> > #endif
> 
> While at it, can you change to:
> 
> #endif /* CONFIG_MIGRATION */
> 
> [ these ifdefs burn my eyes ]

Ok, but there're a bunch of "ifdef"s in this header that are missing those.
Maybe I'll add a separate patch for all of them just to do the commenting.

> 
> Other than that looks good.

Thanks,
Peter Xu Aug. 5, 2022, 4:36 p.m. UTC | #4
On Fri, Aug 05, 2022 at 02:17:55PM +0200, David Hildenbrand wrote:
> On 04.08.22 22:39, Peter Xu wrote:
> > When page migration happens, we always ignore the young/dirty bit settings
> > in the old pgtable, and marking the page as old in the new page table using
> > either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> > 
> > That's fine from functional-wise, but that's not friendly to page reclaim
> > because the moving page can be actively accessed within the procedure.  Not
> > to mention hardware setting the young bit can bring quite some overhead on
> > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> > The same slowdown problem to dirty bits when the memory is first written
> > after page migration happened.
> > 
> > Actually we can easily remember the A/D bit configuration and recover the
> > information after the page is migrated.  To achieve it, define a new set of
> > bits in the migration swap offset field to cache the A/D bits for old pte.
> > Then when removing/recovering the migration entry, we can recover the A/D
> > bits even if the page changed.
> > 
> > One thing to mention is that here we used max_swapfile_size() to detect how
> > many swp offset bits we have, and we'll only enable this feature if we know
> > the swp offset can be big enough to store both the PFN value and the young
> > bit.  Otherwise the A/D bits are dropped like before.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
> >  mm/huge_memory.c        | 26 +++++++++++-
> >  mm/migrate.c            |  6 ++-
> >  mm/migrate_device.c     |  4 ++
> >  mm/rmap.c               |  5 ++-
> >  5 files changed, 128 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index 1d17e4bb3d2f..34aa448ac6ee 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -8,6 +8,8 @@
> >  
> >  #ifdef CONFIG_MMU
> >  
> > +#include <linux/swapfile.h>
> > +
> >  /*
> >   * swapcache pages are stored in the swapper_space radix tree.  We want to
> >   * get good packing density in that tree, so the index should be dense in
> > @@ -35,6 +37,24 @@
> >  #endif
> >  #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
> >  
> > +/**
> > + * Migration swap entry specific bitfield definitions.
> > + *
> > + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> > + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> > + *
> > + * Note: these bits will be stored in migration entries iff there're enough
> > + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> > + * when migrating a page.  Please refer to migration_entry_supports_ad()
> > + * for more information.
> > + */
> > +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> > +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> > +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> > +
> > +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> > +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
> > +
> >  static inline bool is_pfn_swap_entry(swp_entry_t entry);
> >  
> >  /* Clear all flags but only keep swp_entry_t related information */
> > @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> >  	return swp_entry(SWP_MIGRATION_WRITE, offset);
> >  }
> >  
> > +/*
> > + * Returns whether the host has large enough swap offset field to support
> > + * carrying over pgtable A/D bits for page migrations.  The result is
> > + * pretty much arch specific.
> > + */
> > +static inline bool migration_entry_supports_ad(void)
> > +{
> > +	/*
> > +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> > +	 * We can support the migration A/D bits iff the pfn swap entry has
> > +	 * the offset large enough to cover all of them (PFN, A & D bits).
> > +	 */
> > +#ifdef CONFIG_SWAP
> > +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> > +#else
> > +	return false;
> > +#endif
> > +}
> 
> 
> This looks much cleaner to me. It might be helpful to draw an ascii
> picture where exatcly these bits reside isnide the offset.

Yes that'll be helpful especially when more bits are defined.  Not sure how
much it'll help for now but I can definitely do that.

> 
> > +
> > +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> > +{
> > +	if (migration_entry_supports_ad())
> 
> Do we maybe want to turn that into a static key and enable it once and
> for all? As Nadav says, the repeated max_swapfile_size() calls/checks
> might be worth optimizing out.

Since there're a few arch related issues to answer (as replied to Nadav -
both max_swapfile_size and SWP_MIG_TOTAL_BITS may not be constant), my
current plan is to first attach the const attribute, then leave the other
optimizations for later.

If this is a super hot path, I probably need to do this in reversed order,
but hopefully it's fine to this case.

Thanks,
Huang, Ying Aug. 9, 2022, 8:40 a.m. UTC | #5
Peter Xu <peterx@redhat.com> writes:

> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
>
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
>
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
>
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
> bit.  Otherwise the A/D bits are dropped like before.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/swapops.h | 91 +++++++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c        | 26 +++++++++++-
>  mm/migrate.c            |  6 ++-
>  mm/migrate_device.c     |  4 ++
>  mm/rmap.c               |  5 ++-
>  5 files changed, 128 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 1d17e4bb3d2f..34aa448ac6ee 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -8,6 +8,8 @@
>  
>  #ifdef CONFIG_MMU
>  
> +#include <linux/swapfile.h>
> +
>  /*
>   * swapcache pages are stored in the swapper_space radix tree.  We want to
>   * get good packing density in that tree, so the index should be dense in
> @@ -35,6 +37,24 @@
>  #endif
>  #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
>  
> +/**
> + * Migration swap entry specific bitfield definitions.
> + *
> + * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
> + * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
> + *
> + * Note: these bits will be stored in migration entries iff there're enough
> + * free bits in arch specific swp offset.  By default we'll ignore A/D bits
> + * when migrating a page.  Please refer to migration_entry_supports_ad()
> + * for more information.
> + */
> +#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
> +#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
> +#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
> +
> +#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
> +#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
> +
>  static inline bool is_pfn_swap_entry(swp_entry_t entry);
>  
>  /* Clear all flags but only keep swp_entry_t related information */
> @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
>  	return swp_entry(SWP_MIGRATION_WRITE, offset);
>  }
>  
> +/*
> + * Returns whether the host has large enough swap offset field to support
> + * carrying over pgtable A/D bits for page migrations.  The result is
> + * pretty much arch specific.
> + */
> +static inline bool migration_entry_supports_ad(void)
> +{
> +	/*
> +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> +	 * We can support the migration A/D bits iff the pfn swap entry has
> +	 * the offset large enough to cover all of them (PFN, A & D bits).
> +	 */
> +#ifdef CONFIG_SWAP
> +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> +#else
> +	return false;
> +#endif
> +}
> +
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_entry(swp_type(entry),
> +				 swp_offset(entry) | SWP_MIG_YOUNG);
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_young(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_offset(entry) & SWP_MIG_YOUNG;
> +	/* Keep the old behavior of aging page after migration */
> +	return false;
> +}
> +
> +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_entry(swp_type(entry),
> +				 swp_offset(entry) | SWP_MIG_DIRTY);
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> +{
> +	if (migration_entry_supports_ad())
> +		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
> +	/* Keep the old behavior of clean page after migration */
> +	return false;
> +}
> +
>  extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
>  					spinlock_t *ptl);
>  extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> @@ -311,6 +382,26 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
>  	return 0;
>  }
>  
> +static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
> +{
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_young(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
> +{
> +	return entry;
> +}
> +
> +static inline bool is_migration_entry_dirty(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
>  #endif
>  
>  typedef unsigned long pte_marker;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 29e3628687a6..7d79db4bd76a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2005,6 +2005,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  	pmd_t old_pmd, _pmd;
>  	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
>  	bool anon_exclusive = false;
> +	bool dirty = false;
>  	unsigned long addr;
>  	int i;
>  
> @@ -2088,7 +2089,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		write = is_writable_migration_entry(entry);
>  		if (PageAnon(page))
>  			anon_exclusive = is_readable_exclusive_migration_entry(entry);
> -		young = false;
> +		young = is_migration_entry_young(entry);
> +		dirty = is_migration_entry_dirty(entry);
>  		soft_dirty = pmd_swp_soft_dirty(old_pmd);
>  		uffd_wp = pmd_swp_uffd_wp(old_pmd);
>  	} else {
> @@ -2097,6 +2099,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			SetPageDirty(page);
>  		write = pmd_write(old_pmd);
>  		young = pmd_young(old_pmd);
> +		dirty = pmd_dirty(old_pmd);
>  		soft_dirty = pmd_soft_dirty(old_pmd);
>  		uffd_wp = pmd_uffd_wp(old_pmd);
>  
> @@ -2146,6 +2149,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  			else
>  				swp_entry = make_readable_migration_entry(
>  							page_to_pfn(page + i));
> +			if (young)
> +				swp_entry = make_migration_entry_young(swp_entry);
> +			if (dirty)
> +				swp_entry = make_migration_entry_dirty(swp_entry);
>  			entry = swp_entry_to_pte(swp_entry);
>  			if (soft_dirty)
>  				entry = pte_swp_mksoft_dirty(entry);
> @@ -2160,6 +2167,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  				entry = pte_wrprotect(entry);
>  			if (!young)
>  				entry = pte_mkold(entry);
> +			if (dirty)
> +				/*
> +				 * NOTE: this may contains setting soft
> +				 * dirty too on some archs like x86.
> +				 */

Personally, I prefer to put comments above "if (dirty)".  But you can
choose your favorite way unless it violates coding style.

> +				entry = pte_mkdirty(entry);

We don't track dirty flag even for normal PTE before.  So I think we
should separate the dirty flag tracking for normal PTE in a separate
patch.

>  			if (soft_dirty)
>  				entry = pte_mksoft_dirty(entry);
>  			if (uffd_wp)
> @@ -3148,6 +3161,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
>  	else
>  		entry = make_readable_migration_entry(page_to_pfn(page));
> +	if (pmd_young(pmdval))
> +		entry = make_migration_entry_young(entry);
> +	if (pmd_dirty(pmdval))
> +		entry = make_migration_entry_dirty(entry);
>  	pmdswp = swp_entry_to_pmd(entry);
>  	if (pmd_soft_dirty(pmdval))
>  		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
> @@ -3173,13 +3190,18 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
>  
>  	entry = pmd_to_swp_entry(*pvmw->pmd);
>  	get_page(new);
> -	pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
> +	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
>  	if (pmd_swp_soft_dirty(*pvmw->pmd))
>  		pmde = pmd_mksoft_dirty(pmde);
>  	if (is_writable_migration_entry(entry))
>  		pmde = maybe_pmd_mkwrite(pmde, vma);
>  	if (pmd_swp_uffd_wp(*pvmw->pmd))
>  		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
> +	if (!is_migration_entry_young(entry))
> +		pmde = pmd_mkold(pmde);
> +	if (PageDirty(new) && is_migration_entry_dirty(entry))
> +		/* NOTE: this may contain setting soft-dirty on some archs */
> +		pmde = pmd_mkdirty(pmde);
>  
>  	if (PageAnon(new)) {
>  		rmap_t rmap_flags = RMAP_COMPOUND;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1649270bc1a7..957e8e6ee28e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -199,7 +199,7 @@ static bool remove_migration_pte(struct folio *folio,
>  #endif
>  
>  		folio_get(folio);
> -		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
> +		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
>  		if (pte_swp_soft_dirty(*pvmw.pte))
>  			pte = pte_mksoft_dirty(pte);
>  
> @@ -207,6 +207,10 @@ static bool remove_migration_pte(struct folio *folio,
>  		 * Recheck VMA as permissions can change since migration started
>  		 */
>  		entry = pte_to_swp_entry(*pvmw.pte);
> +		if (!is_migration_entry_young(entry))
> +			pte = pte_mkold(pte);
> +		if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
> +			pte = pte_mkdirty(pte);
>  		if (is_writable_migration_entry(entry))
>  			pte = maybe_mkwrite(pte, vma);
>  		else if (pte_swp_uffd_wp(*pvmw.pte))
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 7feeb447e3b9..47d90df04cc6 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			else
>  				entry = make_readable_migration_entry(
>  							page_to_pfn(page));
> +			if (pte_young(pte))
> +				entry = make_migration_entry_young(entry);
> +			if (pte_dirty(pte))
> +				entry = make_migration_entry_dirty(entry);

I don't find pte_dirty() is synced to PageDirty() as in
try_to_migrate_one().  Is it a issue in the original code?

Best Regards,
Huang, Ying

>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_present(pte)) {
>  				if (pte_soft_dirty(pte))
> diff --git a/mm/rmap.c b/mm/rmap.c
> index af775855e58f..28aef434ea41 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2065,7 +2065,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  			else
>  				entry = make_readable_migration_entry(
>  							page_to_pfn(subpage));
> -
> +			if (pte_young(pteval))
> +				entry = make_migration_entry_young(entry);
> +			if (pte_dirty(pteval))
> +				entry = make_migration_entry_dirty(entry);
>  			swp_pte = swp_entry_to_pte(entry);
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
Huang, Ying Aug. 9, 2022, 8:45 a.m. UTC | #6
Peter Xu <peterx@redhat.com> writes:

> On Thu, Aug 04, 2022 at 03:40:57PM -0700, Nadav Amit wrote:
>> On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:
>> > +
>> > static inline bool is_pfn_swap_entry(swp_entry_t entry);
>> > 
>> > /* Clear all flags but only keep swp_entry_t related information */
>> > @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
>> > 	return swp_entry(SWP_MIGRATION_WRITE, offset);
>> > }
>> > 
>> > +/*
>> > + * Returns whether the host has large enough swap offset field to support
>> > + * carrying over pgtable A/D bits for page migrations.  The result is
>> > + * pretty much arch specific.
>> > + */
>> > +static inline bool migration_entry_supports_ad(void)
>> > +{
>> > +	/*
>> > +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
>> > +	 * We can support the migration A/D bits iff the pfn swap entry has
>> > +	 * the offset large enough to cover all of them (PFN, A & D bits).
>> > +	 */
>> > +#ifdef CONFIG_SWAP
>> > +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
>> 
>> This is an actual a function call (unless LTO has some trick). A bit of a
>> shame it cannot be at least memoized.
>> 
>> Or at least mark max_swapfile_size() as __attribute_const__ so it would not
>> be called twice for make_migration_entry_young() and
>> make_migration_entry_dirty().
>
> I didn't take too much effort on this one since we're on swap path and I
> assumed that's not a super hot path.  But __attribute_const__ sounds good
> and easy to get, thanks.
>
> Perhaps I should mark it on migration_entry_supports_ad() as a whole?  Note
> that unfortunately SWP_MIG_TOTAL_BITS may not be a const either (see how
> that define roots back to MAX_PHYSMEM_BITS, where on x86_64 it needs to
> check 5-lvl).

I think it's possible to memorize max_swapfile_size() or
migration_entry_supports_ad().  Although they are not constant, they are
not changed after initialized.  The challenge is to find a clean way to
initialize it.

Best Regards,
Huang, Ying
David Hildenbrand Aug. 9, 2022, 8:47 a.m. UTC | #7
On 09.08.22 10:45, Huang, Ying wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Thu, Aug 04, 2022 at 03:40:57PM -0700, Nadav Amit wrote:
>>> On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:
>>>> +
>>>> static inline bool is_pfn_swap_entry(swp_entry_t entry);
>>>>
>>>> /* Clear all flags but only keep swp_entry_t related information */
>>>> @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
>>>> 	return swp_entry(SWP_MIGRATION_WRITE, offset);
>>>> }
>>>>
>>>> +/*
>>>> + * Returns whether the host has large enough swap offset field to support
>>>> + * carrying over pgtable A/D bits for page migrations.  The result is
>>>> + * pretty much arch specific.
>>>> + */
>>>> +static inline bool migration_entry_supports_ad(void)
>>>> +{
>>>> +	/*
>>>> +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
>>>> +	 * We can support the migration A/D bits iff the pfn swap entry has
>>>> +	 * the offset large enough to cover all of them (PFN, A & D bits).
>>>> +	 */
>>>> +#ifdef CONFIG_SWAP
>>>> +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
>>>
>>> This is an actual a function call (unless LTO has some trick). A bit of a
>>> shame it cannot be at least memoized.
>>>
>>> Or at least mark max_swapfile_size() as __attribute_const__ so it would not
>>> be called twice for make_migration_entry_young() and
>>> make_migration_entry_dirty().
>>
>> I didn't take too much effort on this one since we're on swap path and I
>> assumed that's not a super hot path.  But __attribute_const__ sounds good
>> and easy to get, thanks.
>>
>> Perhaps I should mark it on migration_entry_supports_ad() as a whole?  Note
>> that unfortunately SWP_MIG_TOTAL_BITS may not be a const either (see how
>> that define roots back to MAX_PHYSMEM_BITS, where on x86_64 it needs to
>> check 5-lvl).
> 
> I think it's possible to memorize max_swapfile_size() or
> migration_entry_supports_ad().  Although they are not constant, they are
> not changed after initialized.  The challenge is to find a clean way to
> initialize it.

We could max_swapfile_size()->__max_swapfile_size()

and then simply have a new max_swapfile_size() that caches the value in
a static variable. If that turns out to be worth the trouble.
Peter Xu Aug. 9, 2022, 2:59 p.m. UTC | #8
On Tue, Aug 09, 2022 at 04:45:32PM +0800, Huang, Ying wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Aug 04, 2022 at 03:40:57PM -0700, Nadav Amit wrote:
> >> On Aug 4, 2022, at 1:39 PM, Peter Xu <peterx@redhat.com> wrote:
> >> > +
> >> > static inline bool is_pfn_swap_entry(swp_entry_t entry);
> >> > 
> >> > /* Clear all flags but only keep swp_entry_t related information */
> >> > @@ -265,6 +285,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> >> > 	return swp_entry(SWP_MIGRATION_WRITE, offset);
> >> > }
> >> > 
> >> > +/*
> >> > + * Returns whether the host has large enough swap offset field to support
> >> > + * carrying over pgtable A/D bits for page migrations.  The result is
> >> > + * pretty much arch specific.
> >> > + */
> >> > +static inline bool migration_entry_supports_ad(void)
> >> > +{
> >> > +	/*
> >> > +	 * max_swapfile_size() returns the max supported swp-offset plus 1.
> >> > +	 * We can support the migration A/D bits iff the pfn swap entry has
> >> > +	 * the offset large enough to cover all of them (PFN, A & D bits).
> >> > +	 */
> >> > +#ifdef CONFIG_SWAP
> >> > +	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> >> 
> >> This is an actual a function call (unless LTO has some trick). A bit of a
> >> shame it cannot be at least memoized.
> >> 
> >> Or at least mark max_swapfile_size() as __attribute_const__ so it would not
> >> be called twice for make_migration_entry_young() and
> >> make_migration_entry_dirty().
> >
> > I didn't take too much effort on this one since we're on swap path and I
> > assumed that's not a super hot path.  But __attribute_const__ sounds good
> > and easy to get, thanks.
> >
> > Perhaps I should mark it on migration_entry_supports_ad() as a whole?  Note
> > that unfortunately SWP_MIG_TOTAL_BITS may not be a const either (see how
> > that define roots back to MAX_PHYSMEM_BITS, where on x86_64 it needs to
> > check 5-lvl).
> 
> I think it's possible to memorize max_swapfile_size() or
> migration_entry_supports_ad().  Although they are not constant, they are
> not changed after initialized.  The challenge is to find a clean way to
> initialize it.

I checked it up today, the conclusion is I think it's safe we initialize ad
bits for migration in swapfile_init() and I'll do that in the next version.

Longer proof material below.

Generic max_swapfile_size() is pretty much a constant except x86.  x86 has
two dependency, on (1) X86_BUG_L1TF, and (2) l1tf_mitigation.

The other challenge is the reference to MAX_PHYSMEM_BITS, which is also
only complicated on x86 with 5 level page tables.

Luckily the cpu bits are all setup within early_cpu_init(). The setup of
l1tf_mitigation variable is later but still earlier than most of the init
calls (which swapfile_init belongs to level 4).

A full graph for reference:

- start_kernel
  - setup_arch
    - early_cpu_init
      - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
      - early_identify_cpu --> setup X86_BUG_L1TF
      - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
  - parse_early_param
    - l1tf_cmdline --> set l1tf_mitigation
  - check_bugs
    - l1tf_select_mitigation --> set l1tf_mitigation
  - arch_call_rest_init
    - rest_init
      - kernel_init
        - kernel_init_freeable
          - do_basic_setup
            - do_initcalls --> calls swapfile_init() (initcall level 4)

I'll add one (or maybe >1?) patch on top in next post to optimize the
fetching of these states.

Thanks,
Peter Xu Aug. 9, 2022, 5:59 p.m. UTC | #9
On Tue, Aug 09, 2022 at 04:40:12PM +0800, Huang, Ying wrote:
> > @@ -2160,6 +2167,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> >  				entry = pte_wrprotect(entry);
> >  			if (!young)
> >  				entry = pte_mkold(entry);
> > +			if (dirty)
> > +				/*
> > +				 * NOTE: this may contains setting soft
> > +				 * dirty too on some archs like x86.
> > +				 */
> 
> Personally, I prefer to put comments above "if (dirty)".  But you can
> choose your favorite way unless it violates coding style.

Sure.

> 
> > +				entry = pte_mkdirty(entry);
> 
> We don't track dirty flag even for normal PTE before.  So I think we
> should separate the dirty flag tracking for normal PTE in a separate
> patch.

It's kinda convenient to touch that up, but for sure I can split that into
a tiny but separate patch too.

[...]

> I don't find pte_dirty() is synced to PageDirty() as in
> try_to_migrate_one().  Is it a issue in the original code?

I think it has?  There is:

		/* Set the dirty flag on the folio now the pte is gone. */
		if (pte_dirty(pteval))
			folio_mark_dirty(folio);

?

Thanks,
Huang, Ying Aug. 10, 2022, 12:53 a.m. UTC | #10
Peter Xu <peterx@redhat.com> writes:

> On Tue, Aug 09, 2022 at 04:40:12PM +0800, Huang, Ying wrote:
[snip]
>
>> I don't find pte_dirty() is synced to PageDirty() as in
>> try_to_migrate_one().  Is it a issue in the original code?
>
> I think it has?  There is:
>
> 		/* Set the dirty flag on the folio now the pte is gone. */
> 		if (pte_dirty(pteval))
> 			folio_mark_dirty(folio);
>

Sorry, my original words are confusing.  Yes, there's dirty bit syncing
in try_to_migrate_one().  But I don't find that in migrate_device.c

 $ grep dirty mm/migrate_device.c
				if (pte_soft_dirty(pte))
					swp_pte = pte_swp_mksoft_dirty(swp_pte);
				if (pte_swp_soft_dirty(pte))
					swp_pte = pte_swp_mksoft_dirty(swp_pte);
			entry = pte_mkwrite(pte_mkdirty(entry));

I guess that migrate_device.c is used to migrate between CPU visible
page to CPU un-visible page (device visible), so the rule is different?

Best Regards,
Huang, Ying
Peter Xu Aug. 10, 2022, 7:21 p.m. UTC | #11
On Wed, Aug 10, 2022 at 08:53:49AM +0800, Huang, Ying wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Aug 09, 2022 at 04:40:12PM +0800, Huang, Ying wrote:
> [snip]
> >
> >> I don't find pte_dirty() is synced to PageDirty() as in
> >> try_to_migrate_one().  Is it a issue in the original code?
> >
> > I think it has?  There is:
> >
> > 		/* Set the dirty flag on the folio now the pte is gone. */
> > 		if (pte_dirty(pteval))
> > 			folio_mark_dirty(folio);
> >
> 
> Sorry, my original words are confusing.  Yes, there's dirty bit syncing
> in try_to_migrate_one().  But I don't find that in migrate_device.c
> 
>  $ grep dirty mm/migrate_device.c
> 				if (pte_soft_dirty(pte))
> 					swp_pte = pte_swp_mksoft_dirty(swp_pte);
> 				if (pte_swp_soft_dirty(pte))
> 					swp_pte = pte_swp_mksoft_dirty(swp_pte);
> 			entry = pte_mkwrite(pte_mkdirty(entry));
> 
> I guess that migrate_device.c is used to migrate between CPU visible
> page to CPU un-visible page (device visible), so the rule is different?

IIUC migrate_vma_collect() handles migrations for both directions (RAM <->
device mem).

Yeah, indeed I also didn't see how migrate_vma_collect_pmd() handles the
carry-over of pte dirty to page dirty, which looks a bit odd.  I also don't
see why the dirty bit doesn't need to be maintained, e.g. when a previous
page was dirty then after migration of ram->dev->ram it seems to be clean
with current code.

Another scenario is, even if the page was clean, as long as page migrated
to device mem, device DMAed to the page, then page migrated back to RAM.  I
also didn't see how we could detect the DMAs and set pte/page dirty
properly after migrated back.

Copy Alistair and Jason..
Alistair Popple Aug. 11, 2022, 5:44 a.m. UTC | #12
Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 10, 2022 at 08:53:49AM +0800, Huang, Ying wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Tue, Aug 09, 2022 at 04:40:12PM +0800, Huang, Ying wrote:
>> [snip]
>> >
>> >> I don't find pte_dirty() is synced to PageDirty() as in
>> >> try_to_migrate_one().  Is it a issue in the original code?
>> >
>> > I think it has?  There is:
>> >
>> > 		/* Set the dirty flag on the folio now the pte is gone. */
>> > 		if (pte_dirty(pteval))
>> > 			folio_mark_dirty(folio);
>> >
>>
>> Sorry, my original words are confusing.  Yes, there's dirty bit syncing
>> in try_to_migrate_one().  But I don't find that in migrate_device.c
>>
>>  $ grep dirty mm/migrate_device.c
>> 				if (pte_soft_dirty(pte))
>> 					swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> 				if (pte_swp_soft_dirty(pte))
>> 					swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> 			entry = pte_mkwrite(pte_mkdirty(entry));
>>
>> I guess that migrate_device.c is used to migrate between CPU visible
>> page to CPU un-visible page (device visible), so the rule is different?
>
> IIUC migrate_vma_collect() handles migrations for both directions (RAM <->
> device mem).

That's correct.

> Yeah, indeed I also didn't see how migrate_vma_collect_pmd() handles the
> carry-over of pte dirty to page dirty, which looks a bit odd.  I also don't
> see why the dirty bit doesn't need to be maintained, e.g. when a previous
> page was dirty then after migration of ram->dev->ram it seems to be clean
> with current code.

That's a bug - it does need to be maintained. migrate_vma_*() currently
only works with anonymous private mappings. We could still loose data if
we attempt (but fail) to migrate a page that has been swapped in from
disk though, depending on the precise sequence.

Will post a fix for this, thanks for pointing it out.

> Another scenario is, even if the page was clean, as long as page migrated
> to device mem, device DMAed to the page, then page migrated back to RAM.  I
> also didn't see how we could detect the DMAs and set pte/page dirty
> properly after migrated back.

That would be up to the driver, unless we assume the page is always
dirty which is probably not a bad default. In practice I don't think
this will currently be a problem as any pages migrated to the device
won't have pages allocated in swap and this only works with private
anonymous mappings. But I think we should fix it anyway so will include
it in the fix.

> Copy Alistair and Jason..

Thanks. I will take a look at this series too, but probably won't get to
it until next week.

 - Alistair
diff mbox series

Patch

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 1d17e4bb3d2f..34aa448ac6ee 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -8,6 +8,8 @@ 
 
 #ifdef CONFIG_MMU
 
+#include <linux/swapfile.h>
+
 /*
  * swapcache pages are stored in the swapper_space radix tree.  We want to
  * get good packing density in that tree, so the index should be dense in
@@ -35,6 +37,24 @@ 
 #endif
 #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
 
+/**
+ * Migration swap entry specific bitfield definitions.
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
+ * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set
+ *
+ * Note: these bits will be stored in migration entries iff there're enough
+ * free bits in arch specific swp offset.  By default we'll ignore A/D bits
+ * when migrating a page.  Please refer to migration_entry_supports_ad()
+ * for more information.
+ */
+#define SWP_MIG_YOUNG_BIT		(SWP_PFN_BITS)
+#define SWP_MIG_DIRTY_BIT		(SWP_PFN_BITS + 1)
+#define SWP_MIG_TOTAL_BITS		(SWP_PFN_BITS + 2)
+
+#define SWP_MIG_YOUNG			(1UL << SWP_MIG_YOUNG_BIT)
+#define SWP_MIG_DIRTY			(1UL << SWP_MIG_DIRTY_BIT)
+
 static inline bool is_pfn_swap_entry(swp_entry_t entry);
 
 /* Clear all flags but only keep swp_entry_t related information */
@@ -265,6 +285,57 @@  static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 	return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
+/*
+ * Returns whether the host has large enough swap offset field to support
+ * carrying over pgtable A/D bits for page migrations.  The result is
+ * pretty much arch specific.
+ */
+static inline bool migration_entry_supports_ad(void)
+{
+	/*
+	 * max_swapfile_size() returns the max supported swp-offset plus 1.
+	 * We can support the migration A/D bits iff the pfn swap entry has
+	 * the offset large enough to cover all of them (PFN, A & D bits).
+	 */
+#ifdef CONFIG_SWAP
+	return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+#else
+	return false;
+#endif
+}
+
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_YOUNG);
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_offset(entry) & SWP_MIG_YOUNG;
+	/* Keep the old behavior of aging page after migration */
+	return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_DIRTY);
+	return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+	if (migration_entry_supports_ad())
+		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
+	/* Keep the old behavior of clean page after migration */
+	return false;
+}
+
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl);
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -311,6 +382,26 @@  static inline int is_readable_migration_entry(swp_entry_t entry)
 	return 0;
 }
 
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+	return false;
+}
+
 #endif
 
 typedef unsigned long pte_marker;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29e3628687a6..7d79db4bd76a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2005,6 +2005,7 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
 	bool anon_exclusive = false;
+	bool dirty = false;
 	unsigned long addr;
 	int i;
 
@@ -2088,7 +2089,8 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_writable_migration_entry(entry);
 		if (PageAnon(page))
 			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = false;
+		young = is_migration_entry_young(entry);
+		dirty = is_migration_entry_dirty(entry);
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2097,6 +2099,7 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			SetPageDirty(page);
 		write = pmd_write(old_pmd);
 		young = pmd_young(old_pmd);
+		dirty = pmd_dirty(old_pmd);
 		soft_dirty = pmd_soft_dirty(old_pmd);
 		uffd_wp = pmd_uffd_wp(old_pmd);
 
@@ -2146,6 +2149,10 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			else
 				swp_entry = make_readable_migration_entry(
 							page_to_pfn(page + i));
+			if (young)
+				swp_entry = make_migration_entry_young(swp_entry);
+			if (dirty)
+				swp_entry = make_migration_entry_dirty(swp_entry);
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
@@ -2160,6 +2167,12 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pte_wrprotect(entry);
 			if (!young)
 				entry = pte_mkold(entry);
+			if (dirty)
+				/*
+				 * NOTE: this may contains setting soft
+				 * dirty too on some archs like x86.
+				 */
+				entry = pte_mkdirty(entry);
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 			if (uffd_wp)
@@ -3148,6 +3161,10 @@  int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
 	else
 		entry = make_readable_migration_entry(page_to_pfn(page));
+	if (pmd_young(pmdval))
+		entry = make_migration_entry_young(entry);
+	if (pmd_dirty(pmdval))
+		entry = make_migration_entry_dirty(entry);
 	pmdswp = swp_entry_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3173,13 +3190,18 @@  void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	get_page(new);
-	pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+	if (!is_migration_entry_young(entry))
+		pmde = pmd_mkold(pmde);
+	if (PageDirty(new) && is_migration_entry_dirty(entry))
+		/* NOTE: this may contain setting soft-dirty on some archs */
+		pmde = pmd_mkdirty(pmde);
 
 	if (PageAnon(new)) {
 		rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 1649270bc1a7..957e8e6ee28e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -199,7 +199,7 @@  static bool remove_migration_pte(struct folio *folio,
 #endif
 
 		folio_get(folio);
-		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
 		if (pte_swp_soft_dirty(*pvmw.pte))
 			pte = pte_mksoft_dirty(pte);
 
@@ -207,6 +207,10 @@  static bool remove_migration_pte(struct folio *folio,
 		 * Recheck VMA as permissions can change since migration started
 		 */
 		entry = pte_to_swp_entry(*pvmw.pte);
+		if (!is_migration_entry_young(entry))
+			pte = pte_mkold(pte);
+		if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
+			pte = pte_mkdirty(pte);
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 7feeb447e3b9..47d90df04cc6 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,10 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(page));
+			if (pte_young(pte))
+				entry = make_migration_entry_young(entry);
+			if (pte_dirty(pte))
+				entry = make_migration_entry_dirty(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_present(pte)) {
 				if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..28aef434ea41 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,10 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
-
+			if (pte_young(pteval))
+				entry = make_migration_entry_young(entry);
+			if (pte_dirty(pteval))
+				entry = make_migration_entry_dirty(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);