mbox series

[RFC,0/4] mm: Remember young bit for migration entries

Message ID 20220729014041.21292-1-peterx@redhat.com (mailing list archive)
Headers show
Series mm: Remember young bit for migration entries | expand

Message

Peter Xu July 29, 2022, 1:40 a.m. UTC
[Marking as RFC; only x86 is supported for now, plan to add a few more
 archs when there's a formal version]

Problem
=======

When migrate a page, right now we always mark the migrated page as old.
The reason could be that we don't really know whether the page is hot or
cold, so we could have taken it a default negative assuming that's safer.

However that could lead to at least two problems:

  (1) We lost the real hot/cold information while we could have persisted.
      That information shouldn't change even if the backing page is changed
      after the migration,

  (2) There can be always extra overhead on the immediate next access to
      any migrated page, because hardware MMU needs cycles to set the young
      bit again (as long as the MMU supports).

Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable.  In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to 4ms
if young set.

This issue is originally reported by Andrea Arcangeli.

Solution
========

To solve this problem, this patchset tries to remember the young bit in the
migration entries and carry it over when recovering the ptes.

We have the chance to do so because in many systems the swap offset is not
really fully used.  Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like young, and that's how this series tried to approach this
problem.

One tricky thing here is even though we're embedding the information into
swap entry which seems to be a very generic data structure, the number of
bits that are free is still arch dependent.  Not only because the size of
swp_entry_t differs, but also due to the different layouts of swap ptes on
different archs.

Here, this series requires specific arch to define an extra macro called
__ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
information, the swap logic can know whether there's extra bits to use,
then it'll remember the young bits when possible.  By default, it'll keep
the old behavior of keeping all migrated pages cold.

Tests
=====

After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms.  The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps.

Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

Patch Layout
============

Patch 1:  Add swp_offset_pfn() and apply to all pfn swap entries, we should
          also stop treating swp_offset() as PFN anymore because it can
          contain more information starting from next patch.
Patch 2:  The core patch to remember young bit in swap offsets.
Patch 3:  A cleanup for x86 32 bits pgtable.h.
Patch 4:  Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration

Please review, thanks.

[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c

Peter Xu (4):
  mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
  mm: Remember young bit for page migrations
  mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
  mm/x86: Define __ARCH_SWP_OFFSET_BITS

 arch/arm64/mm/hugetlbpage.c           |  2 +-
 arch/x86/include/asm/pgtable-2level.h |  6 ++
 arch/x86/include/asm/pgtable-3level.h | 15 +++--
 arch/x86/include/asm/pgtable_64.h     |  5 ++
 include/linux/swapops.h               | 85 +++++++++++++++++++++++++--
 mm/hmm.c                              |  2 +-
 mm/huge_memory.c                      | 10 +++-
 mm/memory-failure.c                   |  2 +-
 mm/migrate.c                          |  4 +-
 mm/migrate_device.c                   |  2 +
 mm/page_vma_mapped.c                  |  6 +-
 mm/rmap.c                             |  3 +-
 12 files changed, 122 insertions(+), 20 deletions(-)

Comments

Nadav Amit July 29, 2022, 5:07 p.m. UTC | #1
On Jul 28, 2022, at 6:40 PM, Peter Xu <peterx@redhat.com> wrote:

> [Marking as RFC; only x86 is supported for now, plan to add a few more
> archs when there's a formal version]
> 
> Problem
> =======
> 
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.

Looks good to me.

I just wonder whether the order of the patches should be different. I always
understood that separating the “enabling” patch from the others is not a
good practice, since it complicates bisection. I guess it is more of a minor
issue for such a small patch-set…
Peter Xu July 29, 2022, 10:43 p.m. UTC | #2
On Fri, Jul 29, 2022 at 10:07:02AM -0700, Nadav Amit wrote:
> On Jul 28, 2022, at 6:40 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > [Marking as RFC; only x86 is supported for now, plan to add a few more
> > archs when there's a formal version]
> > 
> > Problem
> > =======
> > 
> > When migrate a page, right now we always mark the migrated page as old.
> > The reason could be that we don't really know whether the page is hot or
> > cold, so we could have taken it a default negative assuming that's safer.
> 
> Looks good to me.

Thanks for the quick review comment, Nadav.

> 
> I just wonder whether the order of the patches should be different. I always
> understood that separating the “enabling” patch from the others is not a
> good practice, since it complicates bisection. I guess it is more of a minor
> issue for such a small patch-set…

Yeah I'd guess you mean when there are a bunch of patches to form one
feature, then we may want to be able to know which part of the feature
break something.  But as you mentioned this feature is mostly implemented
in patch 2 only.

I can squash the enablement patch into the same patch, but when comes to
more archs it also means I'll squash all the archs into the same patch.
I'm just afraid it'll complicate that patch too much - I'd expect each
calculation of swp offset for any arch may not be that straightforward
enough, so it'll be good if they can be reviewed separately and carefully.
Huang, Ying Aug. 1, 2022, 3:20 a.m. UTC | #3
Peter Xu <peterx@redhat.com> writes:

> [Marking as RFC; only x86 is supported for now, plan to add a few more
>  archs when there's a formal version]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
>
> However that could lead to at least two problems:
>
>   (1) We lost the real hot/cold information while we could have persisted.
>       That information shouldn't change even if the backing page is changed
>       after the migration,
>
>   (2) There can be always extra overhead on the immediate next access to
>       any migrated page, because hardware MMU needs cycles to set the young
>       bit again (as long as the MMU supports).
>
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable.  In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.

LKP has observed that before too, as in the following reports and
discussion.

https://lore.kernel.org/all/87bn35zcko.fsf@yhuang-dev.intel.com/t/

Best Regards,
Huang, Ying

> This issue is originally reported by Andrea Arcangeli.
>
> Solution
> ========
>
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
>
> We have the chance to do so because in many systems the swap offset is not
> really fully used.  Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
>
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent.  Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.
>
> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible.  By default, it'll keep
> the old behavior of keeping all migrated pages cold.
>
> Tests
> =====
>
> After the patchset applied, the immediate read access test [1] of above 1G
> chunk after migration can shrink from 99ms to 4ms.  The test is done by
> moving 1G pages from node 0->1->0 then read it in page size jumps.
>
> Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
> only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
>
> Patch Layout
> ============
>
> Patch 1:  Add swp_offset_pfn() and apply to all pfn swap entries, we should
>           also stop treating swp_offset() as PFN anymore because it can
>           contain more information starting from next patch.
> Patch 2:  The core patch to remember young bit in swap offsets.
> Patch 3:  A cleanup for x86 32 bits pgtable.h.
> Patch 4:  Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration
>
> Please review, thanks.
>
> [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
>
> Peter Xu (4):
>   mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
>   mm: Remember young bit for page migrations
>   mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
>   mm/x86: Define __ARCH_SWP_OFFSET_BITS
>
>  arch/arm64/mm/hugetlbpage.c           |  2 +-
>  arch/x86/include/asm/pgtable-2level.h |  6 ++
>  arch/x86/include/asm/pgtable-3level.h | 15 +++--
>  arch/x86/include/asm/pgtable_64.h     |  5 ++
>  include/linux/swapops.h               | 85 +++++++++++++++++++++++++--
>  mm/hmm.c                              |  2 +-
>  mm/huge_memory.c                      | 10 +++-
>  mm/memory-failure.c                   |  2 +-
>  mm/migrate.c                          |  4 +-
>  mm/migrate_device.c                   |  2 +
>  mm/page_vma_mapped.c                  |  6 +-
>  mm/rmap.c                             |  3 +-
>  12 files changed, 122 insertions(+), 20 deletions(-)
Huang, Ying Aug. 1, 2022, 5:33 a.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> [Marking as RFC; only x86 is supported for now, plan to add a few more
>  archs when there's a formal version]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
>
> However that could lead to at least two problems:
>
>   (1) We lost the real hot/cold information while we could have persisted.
>       That information shouldn't change even if the backing page is changed
>       after the migration,
>
>   (2) There can be always extra overhead on the immediate next access to
>       any migrated page, because hardware MMU needs cycles to set the young
>       bit again (as long as the MMU supports).
>
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable.  In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
>
> This issue is originally reported by Andrea Arcangeli.
>
> Solution
> ========
>
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
>
> We have the chance to do so because in many systems the swap offset is not
> really fully used.  Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
>
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent.  Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.

If my understanding were correct, max_swapfile_size() provides a
mechanism to identify the available bits with swp_entry_t and swap PTE
considered.  We may take advantage of that?

And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap
file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if
L1TF mitigation is enabled.

Cced Andi for confirmation.

Best Regards,
Huang, Ying

> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible.  By default, it'll keep
> the old behavior of keeping all migrated pages cold.
>
> Tests
> =====
>
> After the patchset applied, the immediate read access test [1] of above 1G
> chunk after migration can shrink from 99ms to 4ms.  The test is done by
> moving 1G pages from node 0->1->0 then read it in page size jumps.
>
> Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
> only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
>
> Patch Layout
> ============
>
> Patch 1:  Add swp_offset_pfn() and apply to all pfn swap entries, we should
>           also stop treating swp_offset() as PFN anymore because it can
>           contain more information starting from next patch.
> Patch 2:  The core patch to remember young bit in swap offsets.
> Patch 3:  A cleanup for x86 32 bits pgtable.h.
> Patch 4:  Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration
>
> Please review, thanks.
>
> [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
>
> Peter Xu (4):
>   mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
>   mm: Remember young bit for page migrations
>   mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
>   mm/x86: Define __ARCH_SWP_OFFSET_BITS
>
>  arch/arm64/mm/hugetlbpage.c           |  2 +-
>  arch/x86/include/asm/pgtable-2level.h |  6 ++
>  arch/x86/include/asm/pgtable-3level.h | 15 +++--
>  arch/x86/include/asm/pgtable_64.h     |  5 ++
>  include/linux/swapops.h               | 85 +++++++++++++++++++++++++--
>  mm/hmm.c                              |  2 +-
>  mm/huge_memory.c                      | 10 +++-
>  mm/memory-failure.c                   |  2 +-
>  mm/migrate.c                          |  4 +-
>  mm/migrate_device.c                   |  2 +
>  mm/page_vma_mapped.c                  |  6 +-
>  mm/rmap.c                             |  3 +-
>  12 files changed, 122 insertions(+), 20 deletions(-)
David Hildenbrand Aug. 1, 2022, 8:21 a.m. UTC | #5
On 29.07.22 03:40, Peter Xu wrote:
> [Marking as RFC; only x86 is supported for now, plan to add a few more
>  archs when there's a formal version]
> 
> Problem
> =======
> 
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
> 
> However that could lead to at least two problems:
> 
>   (1) We lost the real hot/cold information while we could have persisted.
>       That information shouldn't change even if the backing page is changed
>       after the migration,
> 
>   (2) There can be always extra overhead on the immediate next access to
>       any migrated page, because hardware MMU needs cycles to set the young
>       bit again (as long as the MMU supports).
> 
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable.  In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
> 
> This issue is originally reported by Andrea Arcangeli.
> 
> Solution
> ========
> 
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
> 
> We have the chance to do so because in many systems the swap offset is not
> really fully used.  Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
> 
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent.  Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.
> 
> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible.  By default, it'll keep
> the old behavior of keeping all migrated pages cold.
> 


I played with a similar idea when working on pte_swp_exclusive() but
gave up, because it ended up looking too hacky. Looking at patch #2, I
get the same feeling again. Kind of hacky.


If we mostly only care about x86_64, and it's a performance improvement
after all, why not simply do it like
pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
Peter Xu Aug. 1, 2022, 10:25 p.m. UTC | #6
On Mon, Aug 01, 2022 at 01:33:28PM +0800, Huang, Ying wrote:
> If my understanding were correct, max_swapfile_size() provides a
> mechanism to identify the available bits with swp_entry_t and swap PTE
> considered.  We may take advantage of that?

That's an interesting trick, I'll have a closer look, thanks for the
pointer!

> 
> And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap
> file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if
> L1TF mitigation is enabled.
> 
> Cced Andi for confirmation.

Yeah it'll be great to have a confirmation, hopefully max_swapfile_size()
should have covered that case.
Peter Xu Aug. 1, 2022, 10:35 p.m. UTC | #7
On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
> On 29.07.22 03:40, Peter Xu wrote:
> > [Marking as RFC; only x86 is supported for now, plan to add a few more
> >  archs when there's a formal version]
> > 
> > Problem
> > =======
> > 
> > When migrate a page, right now we always mark the migrated page as old.
> > The reason could be that we don't really know whether the page is hot or
> > cold, so we could have taken it a default negative assuming that's safer.
> > 
> > However that could lead to at least two problems:
> > 
> >   (1) We lost the real hot/cold information while we could have persisted.
> >       That information shouldn't change even if the backing page is changed
> >       after the migration,
> > 
> >   (2) There can be always extra overhead on the immediate next access to
> >       any migrated page, because hardware MMU needs cycles to set the young
> >       bit again (as long as the MMU supports).
> > 
> > Many of the recent upstream works showed that (2) is not something trivial
> > and actually very measurable.  In my test case, reading 1G chunk of memory
> > - jumping in page size intervals - could take 99ms just because of the
> > extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> > if young set.
> > 
> > This issue is originally reported by Andrea Arcangeli.
> > 
> > Solution
> > ========
> > 
> > To solve this problem, this patchset tries to remember the young bit in the
> > migration entries and carry it over when recovering the ptes.
> > 
> > We have the chance to do so because in many systems the swap offset is not
> > really fully used.  Migration entries use swp offset to store PFN only,
> > while the PFN is normally not as large as swp offset and normally smaller.
> > It means we do have some free bits in swp offset that we can use to store
> > things like young, and that's how this series tried to approach this
> > problem.
> > 
> > One tricky thing here is even though we're embedding the information into
> > swap entry which seems to be a very generic data structure, the number of
> > bits that are free is still arch dependent.  Not only because the size of
> > swp_entry_t differs, but also due to the different layouts of swap ptes on
> > different archs.
> > 
> > Here, this series requires specific arch to define an extra macro called
> > __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> > information, the swap logic can know whether there's extra bits to use,
> > then it'll remember the young bits when possible.  By default, it'll keep
> > the old behavior of keeping all migrated pages cold.
> > 
> 
> 
> I played with a similar idea when working on pte_swp_exclusive() but
> gave up, because it ended up looking too hacky. Looking at patch #2, I
> get the same feeling again. Kind of hacky.

Could you explain what's the "hacky" part you mentioned?

I used swap entry to avoid per-arch operations. I failed to figure out a
common way to know swp offset length myself so unluckily in this RFC I
still needed one macro per-arch.  Ying's suggestion seems to be a good fit
here to me to remove the last arch-specific dependency.

> 
> 
> If we mostly only care about x86_64, and it's a performance improvement
> after all, why not simply do it like
> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?

Page migration works for most archs, I want to have it work for all archs
that can easily benefit from it.

Besides I actually have a question on the anon exclusive bit in the swap
pte: since we have that anyway, why we need a specific migration type for
anon exclusive pages?  Can it be simply read migration entries with anon
exclusive bit set?

Thanks,
David Hildenbrand Aug. 2, 2022, 12:06 p.m. UTC | #8
On 02.08.22 00:35, Peter Xu wrote:
> On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
>> On 29.07.22 03:40, Peter Xu wrote:
>>> [Marking as RFC; only x86 is supported for now, plan to add a few more
>>>  archs when there's a formal version]
>>>
>>> Problem
>>> =======
>>>
>>> When migrate a page, right now we always mark the migrated page as old.
>>> The reason could be that we don't really know whether the page is hot or
>>> cold, so we could have taken it a default negative assuming that's safer.
>>>
>>> However that could lead to at least two problems:
>>>
>>>   (1) We lost the real hot/cold information while we could have persisted.
>>>       That information shouldn't change even if the backing page is changed
>>>       after the migration,
>>>
>>>   (2) There can be always extra overhead on the immediate next access to
>>>       any migrated page, because hardware MMU needs cycles to set the young
>>>       bit again (as long as the MMU supports).
>>>
>>> Many of the recent upstream works showed that (2) is not something trivial
>>> and actually very measurable.  In my test case, reading 1G chunk of memory
>>> - jumping in page size intervals - could take 99ms just because of the
>>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
>>> if young set.
>>>
>>> This issue is originally reported by Andrea Arcangeli.
>>>
>>> Solution
>>> ========
>>>
>>> To solve this problem, this patchset tries to remember the young bit in the
>>> migration entries and carry it over when recovering the ptes.
>>>
>>> We have the chance to do so because in many systems the swap offset is not
>>> really fully used.  Migration entries use swp offset to store PFN only,
>>> while the PFN is normally not as large as swp offset and normally smaller.
>>> It means we do have some free bits in swp offset that we can use to store
>>> things like young, and that's how this series tried to approach this
>>> problem.
>>>
>>> One tricky thing here is even though we're embedding the information into
>>> swap entry which seems to be a very generic data structure, the number of
>>> bits that are free is still arch dependent.  Not only because the size of
>>> swp_entry_t differs, but also due to the different layouts of swap ptes on
>>> different archs.
>>>
>>> Here, this series requires specific arch to define an extra macro called
>>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
>>> information, the swap logic can know whether there's extra bits to use,
>>> then it'll remember the young bits when possible.  By default, it'll keep
>>> the old behavior of keeping all migrated pages cold.
>>>
>>
>>
>> I played with a similar idea when working on pte_swp_exclusive() but
>> gave up, because it ended up looking too hacky. Looking at patch #2, I
>> get the same feeling again. Kind of hacky.
> 
> Could you explain what's the "hacky" part you mentioned?

SWP_PFN_OFFSET_FREE_BITS :)

It's a PFN offset and we're mangling in random other bits. That's hacky
IMHO.

I played with the idea of converting all code to store bits in addition
to the type + offset. But that requires digging through a lot of arch
code to teach that code about additional flags, so I discarded that idea
when working on the COW fixes.

> 
> I used swap entry to avoid per-arch operations. I failed to figure out a
> common way to know swp offset length myself so unluckily in this RFC I
> still needed one macro per-arch.  Ying's suggestion seems to be a good fit
> here to me to remove the last arch-specific dependency.

Instead of mangling this into the PFN offset and let the arch tell you
which bits of the PFN offset are unused ... rather remove the bits from
the offset and define them manually to have a certain meaning. That's
exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be
handled on architectures that want to support it.

I hope I could make it clearer what the hacky part is IMHO :)

> 
>>
>>
>> If we mostly only care about x86_64, and it's a performance improvement
>> after all, why not simply do it like
>> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
> 
> Page migration works for most archs, I want to have it work for all archs
> that can easily benefit from it.

Yet we only care about x86-64 IIUC regarding performance, just the way
the dirty bit is handled?

> 
> Besides I actually have a question on the anon exclusive bit in the swap
> pte: since we have that anyway, why we need a specific migration type for
> anon exclusive pages?  Can it be simply read migration entries with anon
> exclusive bit set?

Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.

As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
PTEs, you could even reuse that bit for migration entries and get at
alteast the most relevant 64bit architectures supported easily.
Peter Xu Aug. 2, 2022, 8:14 p.m. UTC | #9
On Tue, Aug 02, 2022 at 02:06:37PM +0200, David Hildenbrand wrote:
> On 02.08.22 00:35, Peter Xu wrote:
> > On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
> >> On 29.07.22 03:40, Peter Xu wrote:
> >>> [Marking as RFC; only x86 is supported for now, plan to add a few more
> >>>  archs when there's a formal version]
> >>>
> >>> Problem
> >>> =======
> >>>
> >>> When migrate a page, right now we always mark the migrated page as old.
> >>> The reason could be that we don't really know whether the page is hot or
> >>> cold, so we could have taken it a default negative assuming that's safer.
> >>>
> >>> However that could lead to at least two problems:
> >>>
> >>>   (1) We lost the real hot/cold information while we could have persisted.
> >>>       That information shouldn't change even if the backing page is changed
> >>>       after the migration,
> >>>
> >>>   (2) There can be always extra overhead on the immediate next access to
> >>>       any migrated page, because hardware MMU needs cycles to set the young
> >>>       bit again (as long as the MMU supports).
> >>>
> >>> Many of the recent upstream works showed that (2) is not something trivial
> >>> and actually very measurable.  In my test case, reading 1G chunk of memory
> >>> - jumping in page size intervals - could take 99ms just because of the
> >>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> >>> if young set.
> >>>
> >>> This issue is originally reported by Andrea Arcangeli.
> >>>
> >>> Solution
> >>> ========
> >>>
> >>> To solve this problem, this patchset tries to remember the young bit in the
> >>> migration entries and carry it over when recovering the ptes.
> >>>
> >>> We have the chance to do so because in many systems the swap offset is not
> >>> really fully used.  Migration entries use swp offset to store PFN only,
> >>> while the PFN is normally not as large as swp offset and normally smaller.
> >>> It means we do have some free bits in swp offset that we can use to store
> >>> things like young, and that's how this series tried to approach this
> >>> problem.
> >>>
> >>> One tricky thing here is even though we're embedding the information into
> >>> swap entry which seems to be a very generic data structure, the number of
> >>> bits that are free is still arch dependent.  Not only because the size of
> >>> swp_entry_t differs, but also due to the different layouts of swap ptes on
> >>> different archs.
> >>>
> >>> Here, this series requires specific arch to define an extra macro called
> >>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset.  With this
> >>> information, the swap logic can know whether there's extra bits to use,
> >>> then it'll remember the young bits when possible.  By default, it'll keep
> >>> the old behavior of keeping all migrated pages cold.
> >>>
> >>
> >>
> >> I played with a similar idea when working on pte_swp_exclusive() but
> >> gave up, because it ended up looking too hacky. Looking at patch #2, I
> >> get the same feeling again. Kind of hacky.
> > 
> > Could you explain what's the "hacky" part you mentioned?
> 
> SWP_PFN_OFFSET_FREE_BITS :)
> 
> It's a PFN offset and we're mangling in random other bits. That's hacky
> IMHO.
> 
> I played with the idea of converting all code to store bits in addition
> to the type + offset. But that requires digging through a lot of arch
> code to teach that code about additional flags, so I discarded that idea
> when working on the COW fixes.

Having SWP_PFN_OFFSET_FREE_BITS was the cleanest approach I could think of
before I know the max_swapfile_size() trick. It only needs the arch to
define this one macro and swapops.h will take action accordingly.

OTOH, I don't want to use swap pte bits not only because there aren't a lot
left for some archs (x86_64 only has bit 4 left, afaict), but also since
this is a page migration issue it'll be nicer to me if can be solved in the
swp entry level, not pte.

> 
> > 
> > I used swap entry to avoid per-arch operations. I failed to figure out a
> > common way to know swp offset length myself so unluckily in this RFC I
> > still needed one macro per-arch.  Ying's suggestion seems to be a good fit
> > here to me to remove the last arch-specific dependency.
> 
> Instead of mangling this into the PFN offset and let the arch tell you
> which bits of the PFN offset are unused ... rather remove the bits from
> the offset and define them manually to have a certain meaning. That's
> exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be
> handled on architectures that want to support it.
> 
> I hope I could make it clearer what the hacky part is IMHO :)
> 
> > 
> >>
> >>
> >> If we mostly only care about x86_64, and it's a performance improvement
> >> after all, why not simply do it like
> >> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
> > 
> > Page migration works for most archs, I want to have it work for all archs
> > that can easily benefit from it.
> 
> Yet we only care about x86-64 IIUC regarding performance, just the way
> the dirty bit is handled?

I don't think we only care about x86_64?  Should other archs have the same
issue as long as there's the hardware young bit?

Even without it, it'll affect page reclaim logic too, and that's also not
x86 only.

> 
> > 
> > Besides I actually have a question on the anon exclusive bit in the swap
> > pte: since we have that anyway, why we need a specific migration type for
> > anon exclusive pages?  Can it be simply read migration entries with anon
> > exclusive bit set?
> 
> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
> 
> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> PTEs, you could even reuse that bit for migration entries and get at
> alteast the most relevant 64bit architectures supported easily.

Yes, but I think having two mechanisms for the single problem can confuse
people.

IIUC the swap bit is already defined in major archs anyway, and since anon
exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
only servicing that very minority.. which seems to be a pity to waste the
swp type on all archs even if the archs defined swp pte bits just for anon
exclusive.

Thanks,
David Hildenbrand Aug. 2, 2022, 8:23 p.m. UTC | #10
> I don't think we only care about x86_64?  Should other archs have the same
> issue as long as there's the hardware young bit?
> 
> Even without it, it'll affect page reclaim logic too, and that's also not
> x86 only.

Okay, reading the cover letter and looking at the code my understanding
was that x86-64 is the real focus.

>>
>>>
>>> Besides I actually have a question on the anon exclusive bit in the swap
>>> pte: since we have that anyway, why we need a specific migration type for
>>> anon exclusive pages?  Can it be simply read migration entries with anon
>>> exclusive bit set?
>>
>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
>>
>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
>> PTEs, you could even reuse that bit for migration entries and get at
>> alteast the most relevant 64bit architectures supported easily.
> 
> Yes, but I think having two mechanisms for the single problem can confuse
> people.
> 

It would be one bit with two different meanings depending on the swp type.

> IIUC the swap bit is already defined in major archs anyway, and since anon
> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on

It kind-of is best effort, but the goal is to have all archs support it.

... just like the young bit here?

> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> only servicing that very minority.. which seems to be a pity to waste the

I have a big item on my todo list to support all, but I have different
priorities right now.

If there is no free bit, simply steal one from the offset ... which is
the same thing your approach would do, just in a different way, no?

> swp type on all archs even if the archs defined swp pte bits just for anon
> exclusive.

Why do we care? We walk about one type not one bit.
Peter Xu Aug. 2, 2022, 8:35 p.m. UTC | #11
On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
> > I don't think we only care about x86_64?  Should other archs have the same
> > issue as long as there's the hardware young bit?
> > 
> > Even without it, it'll affect page reclaim logic too, and that's also not
> > x86 only.
> 
> Okay, reading the cover letter and looking at the code my understanding
> was that x86-64 is the real focus.
> 
> >>
> >>>
> >>> Besides I actually have a question on the anon exclusive bit in the swap
> >>> pte: since we have that anyway, why we need a specific migration type for
> >>> anon exclusive pages?  Can it be simply read migration entries with anon
> >>> exclusive bit set?
> >>
> >> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
> >>
> >> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> >> PTEs, you could even reuse that bit for migration entries and get at
> >> alteast the most relevant 64bit architectures supported easily.
> > 
> > Yes, but I think having two mechanisms for the single problem can confuse
> > people.
> > 
> 
> It would be one bit with two different meanings depending on the swp type.
> 
> > IIUC the swap bit is already defined in major archs anyway, and since anon
> > exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
> 
> It kind-of is best effort, but the goal is to have all archs support it.
> 
> ... just like the young bit here?

Exactly, so I'm also wondering whether we can move the swp pte anon
exclusive bit into swp entry.  It just sounds weird to have them defined in
two ways.

> 
> > archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> > migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> > only servicing that very minority.. which seems to be a pity to waste the
> 
> I have a big item on my todo list to support all, but I have different
> priorities right now.
> 
> If there is no free bit, simply steal one from the offset ... which is
> the same thing your approach would do, just in a different way, no?
> 
> > swp type on all archs even if the archs defined swp pte bits just for anon
> > exclusive.
> 
> Why do we care? We walk about one type not one bit.

The swap type address space is still limited, I'd say we should save when
possible.  I believe people caring about swapping care about the limit of
swap devices too.  If the offset can keep it, I think it's better than the
swap type.  De-dup either the type or the swap pte bit would be nicer, imho.
David Hildenbrand Aug. 2, 2022, 8:59 p.m. UTC | #12
On 02.08.22 22:35, Peter Xu wrote:
> On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
>>> I don't think we only care about x86_64?  Should other archs have the same
>>> issue as long as there's the hardware young bit?
>>>
>>> Even without it, it'll affect page reclaim logic too, and that's also not
>>> x86 only.
>>
>> Okay, reading the cover letter and looking at the code my understanding
>> was that x86-64 is the real focus.
>>
>>>>
>>>>>
>>>>> Besides I actually have a question on the anon exclusive bit in the swap
>>>>> pte: since we have that anyway, why we need a specific migration type for
>>>>> anon exclusive pages?  Can it be simply read migration entries with anon
>>>>> exclusive bit set?
>>>>
>>>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
>>>>
>>>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
>>>> PTEs, you could even reuse that bit for migration entries and get at
>>>> alteast the most relevant 64bit architectures supported easily.
>>>
>>> Yes, but I think having two mechanisms for the single problem can confuse
>>> people.
>>>
>>
>> It would be one bit with two different meanings depending on the swp type.
>>
>>> IIUC the swap bit is already defined in major archs anyway, and since anon
>>> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
>>
>> It kind-of is best effort, but the goal is to have all archs support it.
>>
>> ... just like the young bit here?
> 
> Exactly, so I'm also wondering whether we can move the swp pte anon
> exclusive bit into swp entry.  It just sounds weird to have them defined in
> two ways.

I'd argue it's just the swp vs. nonswp difference that are in fact two
different concepts (device+offset vs. type+pte). And some dirty details
how swp entries are actually used.

With swp entries you have to be very careful, for example, take a look
at radix_to_swp_entry() and swp_to_radix_entry(). That made me refrain
from touching anything inside actual swp entries and instead store it in
the pte.

> 
>>
>>> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
>>> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
>>> only servicing that very minority.. which seems to be a pity to waste the
>>
>> I have a big item on my todo list to support all, but I have different
>> priorities right now.
>>
>> If there is no free bit, simply steal one from the offset ... which is
>> the same thing your approach would do, just in a different way, no?
>>
>>> swp type on all archs even if the archs defined swp pte bits just for anon
>>> exclusive.
>>
>> Why do we care? We walk about one type not one bit.
> 
> The swap type address space is still limited, I'd say we should save when
> possible.  I believe people caring about swapping care about the limit of
> swap devices too.  If the offset can keep it, I think it's better than the

Ehm, last time I did the math I came to the conclusion that nobody
cares. Let me redo the math:

MAX_SWAPFILES = 1<<5 - 1 - 1 - 4 - 3 - 1 = 22

Which is the worst case right now with all kinds of oddity compiled in
(sorry CONFIG_DEVICE_PRIVATE).

So far nobody complaint.

> swap type.  De-dup either the type or the swap pte bit would be nicer, imho.
> 

If you manage bits in the pte manually, you might be able to get a
better packing density, if bits are scattered around. Just take a look
at the x86_64 location of _PAGE_SWP_EXCLUSIVE.

What I'm rooting for is something like

#define pte_nonswp_mkyoung pte_swp_mkexclusive

Eventually with some VM_BUG_ONs to make sure people call it on the right
swp ptes.

If we ever want to get rid of SWP_MIGRATION_READ_EXCLUSIVE (so people
can have 23 swap devices), and eventually have separate bits for both.
For now it's not necessary.
Peter Xu Aug. 2, 2022, 10:15 p.m. UTC | #13
On Tue, Aug 02, 2022 at 10:59:42PM +0200, David Hildenbrand wrote:
> On 02.08.22 22:35, Peter Xu wrote:
> > On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
> >>> I don't think we only care about x86_64?  Should other archs have the same
> >>> issue as long as there's the hardware young bit?
> >>>
> >>> Even without it, it'll affect page reclaim logic too, and that's also not
> >>> x86 only.
> >>
> >> Okay, reading the cover letter and looking at the code my understanding
> >> was that x86-64 is the real focus.
> >>
> >>>>
> >>>>>
> >>>>> Besides I actually have a question on the anon exclusive bit in the swap
> >>>>> pte: since we have that anyway, why we need a specific migration type for
> >>>>> anon exclusive pages?  Can it be simply read migration entries with anon
> >>>>> exclusive bit set?
> >>>>
> >>>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
> >>>>
> >>>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> >>>> PTEs, you could even reuse that bit for migration entries and get at
> >>>> alteast the most relevant 64bit architectures supported easily.
> >>>
> >>> Yes, but I think having two mechanisms for the single problem can confuse
> >>> people.
> >>>
> >>
> >> It would be one bit with two different meanings depending on the swp type.
> >>
> >>> IIUC the swap bit is already defined in major archs anyway, and since anon
> >>> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
> >>
> >> It kind-of is best effort, but the goal is to have all archs support it.
> >>
> >> ... just like the young bit here?
> > 
> > Exactly, so I'm also wondering whether we can move the swp pte anon
> > exclusive bit into swp entry.  It just sounds weird to have them defined in
> > two ways.
> 
> I'd argue it's just the swp vs. nonswp difference that are in fact two
> different concepts (device+offset vs. type+pte). And some dirty details
> how swp entries are actually used.
> 
> With swp entries you have to be very careful, for example, take a look
> at radix_to_swp_entry() and swp_to_radix_entry(). That made me refrain
> from touching anything inside actual swp entries and instead store it in
> the pte.

I don't really see any risk - it neither touches the swp entry nor do
complicated things around it (shift 1 and set bit 0 to 1).  Please feel
free to share your concern in case I overlooked something, though.

> 
> > 
> >>
> >>> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> >>> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> >>> only servicing that very minority.. which seems to be a pity to waste the
> >>
> >> I have a big item on my todo list to support all, but I have different
> >> priorities right now.
> >>
> >> If there is no free bit, simply steal one from the offset ... which is
> >> the same thing your approach would do, just in a different way, no?
> >>
> >>> swp type on all archs even if the archs defined swp pte bits just for anon
> >>> exclusive.
> >>
> >> Why do we care? We walk about one type not one bit.
> > 
> > The swap type address space is still limited, I'd say we should save when
> > possible.  I believe people caring about swapping care about the limit of
> > swap devices too.  If the offset can keep it, I think it's better than the
> 
> Ehm, last time I did the math I came to the conclusion that nobody
> cares. Let me redo the math:
> 
> MAX_SWAPFILES = 1<<5 - 1 - 1 - 4 - 3 - 1 = 22
> 
> Which is the worst case right now with all kinds of oddity compiled in
> (sorry CONFIG_DEVICE_PRIVATE).
> 
> So far nobody complaint.

Yeah.  To me using one bit of it is fine especially if that's the best to
do.  Here what confuses me is we have two ways to represent "whether this
page is anon exclusive" in a swap pte, either we need to go into the type
or using the bit.  The trick here is whether the swap pte bit makes sense
depends on the swp type first too, while the swap type can be "anon
exclusive read migration" itself.

> 
> > swap type.  De-dup either the type or the swap pte bit would be nicer, imho.
> > 
> 
> If you manage bits in the pte manually, you might be able to get a
> better packing density, if bits are scattered around. Just take a look
> at the x86_64 location of _PAGE_SWP_EXCLUSIVE.
> 
> What I'm rooting for is something like
> 
> #define pte_nonswp_mkyoung pte_swp_mkexclusive
> 
> Eventually with some VM_BUG_ONs to make sure people call it on the right
> swp ptes.
> 
> If we ever want to get rid of SWP_MIGRATION_READ_EXCLUSIVE (so people
> can have 23 swap devices), and eventually have separate bits for both.
> For now it's not necessary.

Okay, but that's probably the last thing I'd like to try - I don't want to
further complicate the anon exclusive bit in swap pte..  I'd think cleaning
that up somehow would be nice, but as you mentioned if no one complains I
have no strong opinion either.

Thanks,