diff mbox series

[2/2] mm: Remember young bit for page migrations

Message ID 20220803012159.36551-3-peterx@redhat.com (mailing list archive)
State New
Headers show
Series mm: Remember young bit for migration entries | expand

Commit Message

Peter Xu Aug. 3, 2022, 1:21 a.m. UTC
When page migration happens, we always ignore the young bit settings in the
old pgtable, and marking the page as old in the new page table using either
pte_mkold() or pmd_mkold().

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.  Not
to mention hardware setting the young bit can bring quite some overhead on
some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.

Actually we can easily remember the young bit configuration and recover the
information after the page is migrated.  To achieve it, define a new bit in
the migration swap offset field to show whether the old pte has young bit
set or not.  Then when removing/recovering the migration entry, we can
recover the young bit even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect how
many swp offset bits we have, and we'll only enable this feature if we know
the swp offset can be big enough to store both the PFN value and the young
bit.  Otherwise the young bit is dropped like before.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/swapops.h | 49 +++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c        | 10 +++++++--
 mm/migrate.c            |  4 +++-
 mm/migrate_device.c     |  2 ++
 mm/rmap.c               |  3 ++-
 5 files changed, 64 insertions(+), 4 deletions(-)

Comments

Nadav Amit Aug. 3, 2022, 7:42 a.m. UTC | #1
On Aug 2, 2022, at 6:21 PM, Peter Xu <peterx@redhat.com> wrote:

> When page migration happens, we always ignore the young bit settings in the
> old pgtable, and marking the page as old in the new page table using either
> pte_mkold() or pmd_mkold().
> 
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure.  Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> 
> Actually we can easily remember the young bit configuration and recover the
> information after the page is migrated.  To achieve it, define a new bit in
> the migration swap offset field to show whether the old pte has young bit
> set or not.  Then when removing/recovering the migration entry, we can
> recover the young bit even if the page changed.
> 
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
> bit.  Otherwise the young bit is dropped like before.

I gave it some more thought and I am less confident whether this is the best
solution. Not sure it is not either, so I am raising an alternative with
pros and cons.

An alternative would be to propagate the access bit into the page (i.e.,
using folio_set_young()) and then set it back into the PTE later (i.e.,
based on folio_test_young()). It might even seem that in general it is
better to always set the page access bit if folio_test_young().

This can be simpler and more performant. Setting the access-bit would not
impact reclaim decisions (as the page is already considered young), would
not induce overheads on clearing the access-bit (no TLB flush is needed at
least on x86), and would save the time the CPU takes to set the access bit
if the page is ever accessed (on x86).

It may also improve the preciseness of page-idle mechanisms and the
interaction with it. IIUC, page-idle does not consider migration entries, so
the user would not get indication that pages under migration are not idle.
When page-idle is reset, migrated pages might be later reinstated as
“accessed”, giving wrong indication that the pages are not-idle, when in
fact they are.

On the negative side, I am not sure whether other archs, that might require
a TLB flush for resetting the access-bit, and the overhead of doing atomic
operation to clear the access-bit, would not induce more overhead than they
would save.

One more unrelated point - note that remove_migration_pte() would always set
a clean PTE even when the old one was dirty…
kernel test robot Aug. 3, 2022, 12:21 p.m. UTC | #2
Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-Remember-young-bit-for-migration-entries/20220803-092311
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: openrisc-randconfig-r016-20220803 (https://download.01.org/0day-ci/archive/20220803/202208032031.PVcMB0Hr-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/2fca6cb25745d1404fc34e0ec2ea89b6195a8c27
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Peter-Xu/mm-Remember-young-bit-for-migration-entries/20220803-092311
        git checkout 2fca6cb25745d1404fc34e0ec2ea89b6195a8c27
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=openrisc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   or1k-linux-ld: mm/rmap.o: in function `migration_entry_supports_young':
>> include/linux/swapops.h:288: undefined reference to `max_swapfile_size'
   include/linux/swapops.h:288:(.text+0x31a0): relocation truncated to fit: R_OR1K_INSN_REL_26 against undefined symbol `max_swapfile_size'
   or1k-linux-ld: mm/migrate.o: in function `migration_entry_supports_young':
>> include/linux/swapops.h:288: undefined reference to `max_swapfile_size'
   include/linux/swapops.h:288:(.text+0x158): relocation truncated to fit: R_OR1K_INSN_REL_26 against undefined symbol `max_swapfile_size'


vim +288 include/linux/swapops.h

   279	
   280	static inline bool migration_entry_supports_young(void)
   281	{
   282		/*
   283		 * max_swapfile_size() returns the max supported swp-offset plus 1.
   284		 * We can support the migration young bit only if the pfn swap
   285		 * entry has the offset larger than storing the PFN value, then it
   286		 * means there's extra bit(s) where we can store the young bit.
   287		 */
 > 288		return max_swapfile_size() > SWP_MIG_YOUNG_BIT;
   289	}
   290
kernel test robot Aug. 3, 2022, 1:53 p.m. UTC | #3
Hi Peter,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-Remember-young-bit-for-migration-entries/20220803-092311
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: arc-randconfig-r043-20220803 (https://download.01.org/0day-ci/archive/20220803/202208032100.RedT7779-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/2fca6cb25745d1404fc34e0ec2ea89b6195a8c27
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Peter-Xu/mm-Remember-young-bit-for-migration-entries/20220803-092311
        git checkout 2fca6cb25745d1404fc34e0ec2ea89b6195a8c27
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arc-elf-ld: mm/rmap.o: in function `migration_entry_supports_young':
   include/linux/swapops.h:288: undefined reference to `max_swapfile_size'
>> arc-elf-ld: include/linux/swapops.h:288: undefined reference to `max_swapfile_size'
   arc-elf-ld: mm/migrate.o: in function `migration_entry_supports_young':
   include/linux/swapops.h:288: undefined reference to `max_swapfile_size'
>> arc-elf-ld: include/linux/swapops.h:288: undefined reference to `max_swapfile_size'
Peter Xu Aug. 3, 2022, 4:45 p.m. UTC | #4
On Wed, Aug 03, 2022 at 12:42:54AM -0700, Nadav Amit wrote:
> On Aug 2, 2022, at 6:21 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> > When page migration happens, we always ignore the young bit settings in the
> > old pgtable, and marking the page as old in the new page table using either
> > pte_mkold() or pmd_mkold().
> > 
> > That's fine from functional-wise, but that's not friendly to page reclaim
> > because the moving page can be actively accessed within the procedure.  Not
> > to mention hardware setting the young bit can bring quite some overhead on
> > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> > 
> > Actually we can easily remember the young bit configuration and recover the
> > information after the page is migrated.  To achieve it, define a new bit in
> > the migration swap offset field to show whether the old pte has young bit
> > set or not.  Then when removing/recovering the migration entry, we can
> > recover the young bit even if the page changed.
> > 
> > One thing to mention is that here we used max_swapfile_size() to detect how
> > many swp offset bits we have, and we'll only enable this feature if we know
> > the swp offset can be big enough to store both the PFN value and the young
> > bit.  Otherwise the young bit is dropped like before.
> 
> I gave it some more thought and I am less confident whether this is the best
> solution. Not sure it is not either, so I am raising an alternative with
> pros and cons.
> 
> An alternative would be to propagate the access bit into the page (i.e.,
> using folio_set_young()) and then set it back into the PTE later (i.e.,
> based on folio_test_young()). It might even seem that in general it is
> better to always set the page access bit if folio_test_young().

That's indeed an option.  It's just that the Young bit (along with Idle
bit) is only defined with PAGE_IDLE feature enabled, or they're all no-op.

Another thing is even though using page flags looks clean, it'll lose the
granule of virtual address spaces when the page can be mapped in multiple
mm/vmas.  I don't think there's a major difference here since page reclaim
will collect either pte young or page young (as long as page idle defined)
so it'll be the same. But there'll be other side effects as long as related
to the virtual address space. E.g. extra TLB flush needed as you said even
if the pages were not accessed by some mapping at all, so they become false
positive "young" pages after migration entries recovered.

In short, it'll be slightly less accurate than storing it in pgtables.

> 
> This can be simpler and more performant. Setting the access-bit would not
> impact reclaim decisions (as the page is already considered young), would
> not induce overheads on clearing the access-bit (no TLB flush is needed at
> least on x86), and would save the time the CPU takes to set the access bit
> if the page is ever accessed (on x86).

Agreed.  These benefits should be shared between both the pgtable approach
or the PageYoung approach you mentioned.

> 
> It may also improve the preciseness of page-idle mechanisms and the
> interaction with it.

IIUC this may need extra work for page idle.

Currently when doing rmap walks we either only look at migration entries or
never look at them, according to the PVMW_MIGRATION flag.  We'll need to
teach the page idle code and rmap walker to be able to walk with both
present ptes and migration entries at the same time to achieve that
preciseness.

> IIUC, page-idle does not consider migration entries, so
> the user would not get indication that pages under migration are not idle.
> When page-idle is reset, migrated pages might be later reinstated as
> “accessed”, giving wrong indication that the pages are not-idle, when in
> fact they are.

Before this patchset, we'll constantly loosing that young bit, hence it'll
be a false negative after migration entries recovered.

After this patchset, we'll have a possible race iff the page idle was
triggered during migrating some pages, but it's a race condition only and
the young bit will be correctly collected after migration completed.

So I agree it's a problem as you mentioned, but probably it's still better
with current patch than without it.

If ultimately we decided to go with page flags approach as you proposed,
just to mention that setting PageYoung is not enough - that bit is only
used to reserve page reclaim logic and not being interrupted by page idle.
IMO what we really will need is clearing PageIdle instead.

> 
> On the negative side, I am not sure whether other archs, that might require
> a TLB flush for resetting the access-bit, and the overhead of doing atomic
> operation to clear the access-bit, would not induce more overhead than they
> would save.

I think your proposal makes sense and looks clean, maybe even cleaner than
the new max_swapfile_size() approach (and definitely nicer than the old one
of mine). It's just that I still want this to happen even without page idle
enabled - at least Fedora doesn't have page idle enabled by default.  I'm
not sure whether it'll be worth it to define Young bit just for this (note:
iiuc we don't need Idle bit in this case, but only the Young bit).

The other thing is whether there's other side effect of losing pte level
granularity of young bit, since right after we merge them into the page
flags, then that granule is lost.  So far I don't worry a lot on the tlb
flush overhead, but hopefully nothing else we missed.

> 
> One more unrelated point - note that remove_migration_pte() would always set
> a clean PTE even when the old one was dirty…

Correct.  Say it in another way, at least initial writes perf will still
suffer after migration on x86.

Dirty bit is kind of different in this case so I didn't yet try to cover
it.  E.g., we won't lose it even without this patchset but consolidates it
into PageDirty already or it'll be a bug.

I think PageDirty could be cleared during migration procedure, if so we
could be wrongly applying the dirty bit to the recovered pte.  I thought
about this before posting this series, but I hesitated on adding dirty bit
altogether with it at least in these initial versions since dirty bit may
need some more justifications.

Please feel free to share any further thoughts on the dirty bit.

Thanks,
Peter Xu Aug. 3, 2022, 9:47 p.m. UTC | #5
On Wed, Aug 03, 2022 at 08:21:43PM +0800, kernel test robot wrote:
> Hi Peter,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on akpm-mm/mm-everything]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-Remember-young-bit-for-migration-entries/20220803-092311
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> config: openrisc-randconfig-r016-20220803 (https://download.01.org/0day-ci/archive/20220803/202208032031.PVcMB0Hr-lkp@intel.com/config)
> compiler: or1k-linux-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/2fca6cb25745d1404fc34e0ec2ea89b6195a8c27
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Peter-Xu/mm-Remember-young-bit-for-migration-entries/20220803-092311
>         git checkout 2fca6cb25745d1404fc34e0ec2ea89b6195a8c27
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=openrisc SHELL=/bin/bash
> 
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>    or1k-linux-ld: mm/rmap.o: in function `migration_entry_supports_young':
> >> include/linux/swapops.h:288: undefined reference to `max_swapfile_size'
>    include/linux/swapops.h:288:(.text+0x31a0): relocation truncated to fit: R_OR1K_INSN_REL_26 against undefined symbol `max_swapfile_size'
>    or1k-linux-ld: mm/migrate.o: in function `migration_entry_supports_young':
> >> include/linux/swapops.h:288: undefined reference to `max_swapfile_size'
>    include/linux/swapops.h:288:(.text+0x158): relocation truncated to fit: R_OR1K_INSN_REL_26 against undefined symbol `max_swapfile_size'

Hmm, a bit surprised to know swapops.h will be used without CONFIG_SWAP..
I'll squash this in the next version (if not going via a page flag based
approach):

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 9ddede3790a4..d689f59479c3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -285,7 +285,11 @@ static inline bool migration_entry_supports_young(void)
         * entry has the offset larger than storing the PFN value, then it
         * means there's extra bit(s) where we can store the young bit.
         */
+#ifdef CONFIG_SWAP
        return max_swapfile_size() > SWP_MIG_YOUNG_BIT;
+#else
+       return false;
+#endif
 }
 
 static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
Nadav Amit Aug. 4, 2022, 6:42 a.m. UTC | #6
On Aug 3, 2022, at 9:45 AM, Peter Xu <peterx@redhat.com> wrote:

> On Wed, Aug 03, 2022 at 12:42:54AM -0700, Nadav Amit wrote:
>> On Aug 2, 2022, at 6:21 PM, Peter Xu <peterx@redhat.com> wrote:
>> 
>> On the negative side, I am not sure whether other archs, that might require
>> a TLB flush for resetting the access-bit, and the overhead of doing atomic
>> operation to clear the access-bit, would not induce more overhead than they
>> would save.
> 
> I think your proposal makes sense and looks clean, maybe even cleaner than
> the new max_swapfile_size() approach (and definitely nicer than the old one
> of mine). It's just that I still want this to happen even without page idle
> enabled - at least Fedora doesn't have page idle enabled by default.  I'm
> not sure whether it'll be worth it to define Young bit just for this (note:
> iiuc we don't need Idle bit in this case, but only the Young bit).
> 
> The other thing is whether there's other side effect of losing pte level
> granularity of young bit, since right after we merge them into the page
> flags, then that granule is lost.  So far I don't worry a lot on the tlb
> flush overhead, but hopefully nothing else we missed.

I agree with your arguments. I missed the fact that page young bit is only
defined if PAGE_IDLE is defined. So unless it makes sense in general to have
all pages marked as accessed if the page is young, adding the bit would
cause additional overheads, especially for 32-bit systems.

I also agree that the solution that you provided would improve page-idle
behavior. However, while not being wrong, users who try to clear page-idle
indications would not succeed doing so for pages that are undergoing a
migration.

There are some additional implications that I do not remember that you or
anyone else (including me) mentioned, specifically for MADV_COLD and
MADV_FREE. You may want to teach madvise_cold_or_pageout_pte_range() and
madvise_free_pte_range() to deal with these new type of entries.

On the other hand, madvise is already “broken” today in that regard, since
IIUC, it does not even clear PageReferenced (on MADV_COLD) for migrated
pages.

>> One more unrelated point - note that remove_migration_pte() would always set
>> a clean PTE even when the old one was dirty…
> 
> Correct.  Say it in another way, at least initial writes perf will still
> suffer after migration on x86.
> 
> Dirty bit is kind of different in this case so I didn't yet try to cover
> it.  E.g., we won't lose it even without this patchset but consolidates it
> into PageDirty already or it'll be a bug.
> 
> I think PageDirty could be cleared during migration procedure, if so we
> could be wrongly applying the dirty bit to the recovered pte.  I thought
> about this before posting this series, but I hesitated on adding dirty bit
> altogether with it at least in these initial versions since dirty bit may
> need some more justifications.
> 
> Please feel free to share any further thoughts on the dirty bit.

I fully understand that the dirty-bit can stay for a different patch(-set).
But I do not see a problem in setting the dirty-bit if the PTE is mapped as
writable, since anyhow the page can be dirties at any given moment
afterwards without the kernel involvement.

If you are concerned that the page will be written back in between and the
PTE would be marked as dirty unnecessarily, you can keep the bit only if the
both PageDirty and a new "swap entry dirty-bit” are set.
Peter Xu Aug. 4, 2022, 5:07 p.m. UTC | #7
On Wed, Aug 03, 2022 at 11:42:39PM -0700, Nadav Amit wrote:
> On Aug 3, 2022, at 9:45 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, Aug 03, 2022 at 12:42:54AM -0700, Nadav Amit wrote:
> >> On Aug 2, 2022, at 6:21 PM, Peter Xu <peterx@redhat.com> wrote:
> >> 
> >> On the negative side, I am not sure whether other archs, that might require
> >> a TLB flush for resetting the access-bit, and the overhead of doing atomic
> >> operation to clear the access-bit, would not induce more overhead than they
> >> would save.
> > 
> > I think your proposal makes sense and looks clean, maybe even cleaner than
> > the new max_swapfile_size() approach (and definitely nicer than the old one
> > of mine). It's just that I still want this to happen even without page idle
> > enabled - at least Fedora doesn't have page idle enabled by default.  I'm
> > not sure whether it'll be worth it to define Young bit just for this (note:
> > iiuc we don't need Idle bit in this case, but only the Young bit).
> > 
> > The other thing is whether there's other side effect of losing pte level
> > granularity of young bit, since right after we merge them into the page
> > flags, then that granule is lost.  So far I don't worry a lot on the tlb
> > flush overhead, but hopefully nothing else we missed.
> 
> I agree with your arguments. I missed the fact that page young bit is only
> defined if PAGE_IDLE is defined. So unless it makes sense in general to have
> all pages marked as accessed if the page is young, adding the bit would
> cause additional overheads, especially for 32-bit systems.
> 
> I also agree that the solution that you provided would improve page-idle
> behavior. However, while not being wrong, users who try to clear page-idle
> indications would not succeed doing so for pages that are undergoing a
> migration.

Right.

Since I don't have a clear mind of reusing PageYoung here unconditionally,
I think I'll still stick with the current approach if nothing else jumps
in.  I still see the page idle tracking on migration entries a long
standing and relatively separate issue, so IMHO we can move one step at a
time on solving the "page idle tracking for migrated page", leaving the
"page idle reset during page migrating" to latter.

> 
> There are some additional implications that I do not remember that you or
> anyone else (including me) mentioned, specifically for MADV_COLD and
> MADV_FREE. You may want to teach madvise_cold_or_pageout_pte_range() and
> madvise_free_pte_range() to deal with these new type of entries.
> 
> On the other hand, madvise is already “broken” today in that regard, since
> IIUC, it does not even clear PageReferenced (on MADV_COLD) for migrated
> pages.

Yeah, afaict we don't handle migration pages for both madvises.

Maybe it's because it's racy to access the page knowing that it's operated
upon by the thread doing migration (when without the page lock)?  For real
migrations (not THP split), we'll also be changing the old page not new
one.  So maybe the migration entries are just not yet the major target for
both of the madvises.

For this series, I can think more of dropping the young bit for migration
entry during these madvises (which should be relatively safe with the
pgtable held, since I don't need to touch the page but just modify the swap
entry within), but probably that's not really the major problem here, so
not sure whether that matters a huge lot (e.g., for FREE we should really
drop the whole entry?).

Copying Minchan too.

> 
> >> One more unrelated point - note that remove_migration_pte() would always set
> >> a clean PTE even when the old one was dirty…
> > 
> > Correct.  Say it in another way, at least initial writes perf will still
> > suffer after migration on x86.
> > 
> > Dirty bit is kind of different in this case so I didn't yet try to cover
> > it.  E.g., we won't lose it even without this patchset but consolidates it
> > into PageDirty already or it'll be a bug.
> > 
> > I think PageDirty could be cleared during migration procedure, if so we
> > could be wrongly applying the dirty bit to the recovered pte.  I thought
> > about this before posting this series, but I hesitated on adding dirty bit
> > altogether with it at least in these initial versions since dirty bit may
> > need some more justifications.
> > 
> > Please feel free to share any further thoughts on the dirty bit.
> 
> I fully understand that the dirty-bit can stay for a different patch(-set).
> But I do not see a problem in setting the dirty-bit if the PTE is mapped as
> writable, since anyhow the page can be dirties at any given moment
> afterwards without the kernel involvement.
> 
> If you are concerned that the page will be written back in between and the
> PTE would be marked as dirty unnecessarily, you can keep the bit only if the
> both PageDirty and a new "swap entry dirty-bit” are set.

Sounds good, I'll think more of it and see whether I'll cover that too.

Thanks,
Nadav Amit Aug. 4, 2022, 5:16 p.m. UTC | #8
On Aug 4, 2022, at 10:07 AM, Peter Xu <peterx@redhat.com> wrote:

> On Wed, Aug 03, 2022 at 11:42:39PM -0700, Nadav Amit wrote:
>> On Aug 3, 2022, at 9:45 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> For this series, I can think more of dropping the young bit for migration
> entry during these madvises (which should be relatively safe with the
> pgtable held, since I don't need to touch the page but just modify the swap
> entry within), but probably that's not really the major problem here, so
> not sure whether that matters a huge lot (e.g., for FREE we should really
> drop the whole entry?).

Sounds good to me. I just had a look whether the swap entry should be
checked and hit these additional cases, so I thought they are worthy of
consideration. But as they are already not handling some cases, which is
valid for madvise's hints, leaving it to later sounds fine.

Handling migration entries anyhow might not be worthy of the risk of
breaking something.
diff mbox series

Patch

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 1d17e4bb3d2f..9ddede3790a4 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -8,6 +8,8 @@ 
 
 #ifdef CONFIG_MMU
 
+#include <linux/swapfile.h>
+
 /*
  * swapcache pages are stored in the swapper_space radix tree.  We want to
  * get good packing density in that tree, so the index should be dense in
@@ -35,6 +37,16 @@ 
 #endif
 #define SWP_PFN_MASK			((1UL << SWP_PFN_BITS) - 1)
 
+/**
+ * Migration swap entry specific bitfield definitions.
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
+ *
+ * Note: these bits will be used only if there're free bits in arch
+ * specific swp offset field.
+ */
+#define SWP_MIG_YOUNG_BIT		(1UL << SWP_PFN_BITS)
+
 static inline bool is_pfn_swap_entry(swp_entry_t entry);
 
 /* Clear all flags but only keep swp_entry_t related information */
@@ -265,6 +277,33 @@  static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 	return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
+static inline bool migration_entry_supports_young(void)
+{
+	/*
+	 * max_swapfile_size() returns the max supported swp-offset plus 1.
+	 * We can support the migration young bit only if the pfn swap
+	 * entry has the offset larger than storing the PFN value, then it
+	 * means there's extra bit(s) where we can store the young bit.
+	 */
+	return max_swapfile_size() > SWP_MIG_YOUNG_BIT;
+}
+
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_young())
+		return swp_entry(swp_type(entry),
+				 swp_offset(entry) | SWP_MIG_YOUNG_BIT);
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	if (migration_entry_supports_young())
+		return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
+	/* Keep the old behavior of aging page after migration */
+	return false;
+}
+
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl);
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -311,6 +350,16 @@  static inline int is_readable_migration_entry(swp_entry_t entry)
 	return 0;
 }
 
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+	return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+	return false;
+}
+
 #endif
 
 typedef unsigned long pte_marker;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29e3628687a6..131fe5754d8f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2088,7 +2088,7 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		write = is_writable_migration_entry(entry);
 		if (PageAnon(page))
 			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = false;
+		young = is_migration_entry_young(entry);
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2146,6 +2146,8 @@  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			else
 				swp_entry = make_readable_migration_entry(
 							page_to_pfn(page + i));
+			if (young)
+				swp_entry = make_migration_entry_young(swp_entry);
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
@@ -3148,6 +3150,8 @@  int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
 	else
 		entry = make_readable_migration_entry(page_to_pfn(page));
+	if (pmd_young(pmdval))
+		entry = make_migration_entry_young(entry);
 	pmdswp = swp_entry_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3173,13 +3177,15 @@  void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	get_page(new);
-	pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+	if (!is_migration_entry_young(entry))
+		pmde = pmd_mkold(pmde);
 
 	if (PageAnon(new)) {
 		rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 1649270bc1a7..62cb3a9451de 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -199,7 +199,7 @@  static bool remove_migration_pte(struct folio *folio,
 #endif
 
 		folio_get(folio);
-		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
 		if (pte_swp_soft_dirty(*pvmw.pte))
 			pte = pte_mksoft_dirty(pte);
 
@@ -207,6 +207,8 @@  static bool remove_migration_pte(struct folio *folio,
 		 * Recheck VMA as permissions can change since migration started
 		 */
 		entry = pte_to_swp_entry(*pvmw.pte);
+		if (!is_migration_entry_young(entry))
+			pte = pte_mkold(pte);
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 7feeb447e3b9..fd8daf45c1a6 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,8 @@  static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(page));
+			if (pte_young(pte))
+				entry = make_migration_entry_young(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_present(pte)) {
 				if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..605fb37ae95e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,8 @@  static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			else
 				entry = make_readable_migration_entry(
 							page_to_pfn(subpage));
-
+			if (pte_young(pteval))
+				entry = make_migration_entry_young(entry);
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);