mbox series

[RFC,0/2] remove SWAP_MAP_SHMEM

Message ID 20240923231142.4155415-1-nphamcs@gmail.com (mailing list archive)
Headers show
Series remove SWAP_MAP_SHMEM | expand

Message

Nhat Pham Sept. 23, 2024, 11:11 p.m. UTC
The SWAP_MAP_SHMEM state was originally introduced in the commit
aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
swap entry belongs to shmem during swapoff.

However, swapoff has since been rewritten drastically in the commit
b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
having swap count == SWAP_MAP_SHMEM value is basically the same as having
swap count == 1, and swap_shmem_alloc() behaves analogously to
swap_duplicate()
    
This RFC proposes the removal of this state and the associated helper to
simplify the state machine (both mentally and code-wise). We will also
have an extra state/special value that can be repurposed (for swap entries
that never gets re-duplicated).

Another motivation (albeit a bit premature at the moment) is the new swap
abstraction I am currently working on, that would allow for swap/zswap
decoupling, swapoff optimization, etc. The fewer states and swap API
functions there are, the simpler the conversion will be.

I am sending this series first as an RFC, just in case I missed something
or misunderstood this state, or if someone has a swap optimization in mind
for shmem that would require this special state.

Swap experts, let me know if I'm mistaken :) Otherwise if there is no
objection I will resend this patch series again for merging.

Nhat Pham (2):
  swapfile: add a batched variant for swap_duplicate()
  swap: shmem: remove SWAP_MAP_SHMEM

 include/linux/swap.h | 16 ++++++++--------
 mm/shmem.c           |  2 +-
 mm/swapfile.c        | 28 +++++++++-------------------
 3 files changed, 18 insertions(+), 28 deletions(-)


base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049

Comments

Yosry Ahmed Sept. 24, 2024, 12:20 a.m. UTC | #1
On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> The SWAP_MAP_SHMEM state was originally introduced in the commit
> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> swap entry belongs to shmem during swapoff.
>
> However, swapoff has since been rewritten drastically in the commit
> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> swap count == 1, and swap_shmem_alloc() behaves analogously to
> swap_duplicate()
>
> This RFC proposes the removal of this state and the associated helper to
> simplify the state machine (both mentally and code-wise). We will also
> have an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).
>
> Another motivation (albeit a bit premature at the moment) is the new swap
> abstraction I am currently working on, that would allow for swap/zswap
> decoupling, swapoff optimization, etc. The fewer states and swap API
> functions there are, the simpler the conversion will be.
>
> I am sending this series first as an RFC, just in case I missed something
> or misunderstood this state, or if someone has a swap optimization in mind
> for shmem that would require this special state.

I have the same patch sitting in a tree somewhere from when I tried
working on swap abstraction, except then swap_shmem_alloc() did not
take a 'nr' argument so I did not need swap_duplicate_nr(). I was
going to send it out with other swap code cleanups I had, but I ended
up deciding to do nothing.

So for what it's worth I think this is correct:
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

>
> Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> objection I will resend this patch series again for merging.
>
> Nhat Pham (2):
>   swapfile: add a batched variant for swap_duplicate()
>   swap: shmem: remove SWAP_MAP_SHMEM
>
>  include/linux/swap.h | 16 ++++++++--------
>  mm/shmem.c           |  2 +-
>  mm/swapfile.c        | 28 +++++++++-------------------
>  3 files changed, 18 insertions(+), 28 deletions(-)
>
>
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> --
> 2.43.5
Baolin Wang Sept. 24, 2024, 1:55 a.m. UTC | #2
On 2024/9/24 07:11, Nhat Pham wrote:
> The SWAP_MAP_SHMEM state was originally introduced in the commit
> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> swap entry belongs to shmem during swapoff.
> 
> However, swapoff has since been rewritten drastically in the commit
> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> swap count == 1, and swap_shmem_alloc() behaves analogously to
> swap_duplicate()
>      
> This RFC proposes the removal of this state and the associated helper to
> simplify the state machine (both mentally and code-wise). We will also
> have an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).
> 
> Another motivation (albeit a bit premature at the moment) is the new swap
> abstraction I am currently working on, that would allow for swap/zswap
> decoupling, swapoff optimization, etc. The fewer states and swap API
> functions there are, the simpler the conversion will be.
> 
> I am sending this series first as an RFC, just in case I missed something
> or misunderstood this state, or if someone has a swap optimization in mind
> for shmem that would require this special state.

The idea makes sense to me. I did a quick test with shmem mTHP, and 
encountered the following warning which is triggered by 
'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().

[   81.064967] ------------[ cut here ]------------
[   81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623 
__swap_duplicate+0x1d0/0x2e0
[   81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS 
BTYPE=--)
[   81.064995] pc : __swap_duplicate+0x1d0/0x2e0
[   81.064997] lr : swap_duplicate_nr+0x30/0x70
[......]
[   81.065019] Call trace:
[   81.065019]  __swap_duplicate+0x1d0/0x2e0
[   81.065021]  swap_duplicate_nr+0x30/0x70
[   81.065022]  shmem_writepage+0x24c/0x438
[   81.065024]  pageout+0x104/0x2e0
[   81.065026]  shrink_folio_list+0x7f0/0xe60
[   81.065027]  reclaim_folio_list+0x90/0x178
[   81.065029]  reclaim_pages+0x128/0x1a8
[   81.065030]  madvise_cold_or_pageout_pte_range+0x80c/0xd10
[   81.065031]  walk_pmd_range.isra.0+0x1b8/0x3a0
[   81.065033]  walk_pud_range+0x120/0x1b0
[   81.065035]  walk_pgd_range+0x150/0x1a8
[   81.065036]  __walk_page_range+0xa4/0xb8
[   81.065038]  walk_page_range+0x1c8/0x250
[   81.065039]  madvise_pageout+0xf4/0x280
[   81.065041]  madvise_vma_behavior+0x268/0x3f0
[   81.065042]  madvise_walk_vmas.constprop.0+0xb8/0x128
[   81.065043]  do_madvise.part.0+0xe8/0x2a0
[   81.065044]  __arm64_sys_madvise+0x64/0x78
[   81.065046]  invoke_syscall.constprop.0+0x54/0xe8
[   81.065048]  do_el0_svc+0xa4/0xc0
[   81.065050]  el0_svc+0x2c/0xb0
[   81.065052]  el0t_64_sync_handler+0xb8/0xc0
[   81.065054]  el0t_64_sync+0x14c/0x150

> Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> objection I will resend this patch series again for merging.
> 
> Nhat Pham (2):
>    swapfile: add a batched variant for swap_duplicate()
>    swap: shmem: remove SWAP_MAP_SHMEM
> 
>   include/linux/swap.h | 16 ++++++++--------
>   mm/shmem.c           |  2 +-
>   mm/swapfile.c        | 28 +++++++++-------------------
>   3 files changed, 18 insertions(+), 28 deletions(-)
> 
> 
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
Yosry Ahmed Sept. 24, 2024, 2:15 a.m. UTC | #3
On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2024/9/24 07:11, Nhat Pham wrote:
> > The SWAP_MAP_SHMEM state was originally introduced in the commit
> > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > swap entry belongs to shmem during swapoff.
> >
> > However, swapoff has since been rewritten drastically in the commit
> > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > swap count == 1, and swap_shmem_alloc() behaves analogously to
> > swap_duplicate()
> >
> > This RFC proposes the removal of this state and the associated helper to
> > simplify the state machine (both mentally and code-wise). We will also
> > have an extra state/special value that can be repurposed (for swap entries
> > that never gets re-duplicated).
> >
> > Another motivation (albeit a bit premature at the moment) is the new swap
> > abstraction I am currently working on, that would allow for swap/zswap
> > decoupling, swapoff optimization, etc. The fewer states and swap API
> > functions there are, the simpler the conversion will be.
> >
> > I am sending this series first as an RFC, just in case I missed something
> > or misunderstood this state, or if someone has a swap optimization in mind
> > for shmem that would require this special state.
>
> The idea makes sense to me. I did a quick test with shmem mTHP, and
> encountered the following warning which is triggered by
> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().

Apparently __swap_duplicate() does not currently handle increasing the
swap count for multiple swap entries by 1 (i.e. usage == 1) because it
does not handle rolling back count increases when
swap_count_continued() fails.

I guess this voids my Reviewed-by until we sort this out. Technically
swap_count_continued() won't ever be called for shmem because we only
ever increment the count by 1, but there is no way to know this in
__swap_duplicate() without SWAP_HAS_SHMEM.

>
> [   81.064967] ------------[ cut here ]------------
> [   81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623
> __swap_duplicate+0x1d0/0x2e0
> [   81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS
> BTYPE=--)
> [   81.064995] pc : __swap_duplicate+0x1d0/0x2e0
> [   81.064997] lr : swap_duplicate_nr+0x30/0x70
> [......]
> [   81.065019] Call trace:
> [   81.065019]  __swap_duplicate+0x1d0/0x2e0
> [   81.065021]  swap_duplicate_nr+0x30/0x70
> [   81.065022]  shmem_writepage+0x24c/0x438
> [   81.065024]  pageout+0x104/0x2e0
> [   81.065026]  shrink_folio_list+0x7f0/0xe60
> [   81.065027]  reclaim_folio_list+0x90/0x178
> [   81.065029]  reclaim_pages+0x128/0x1a8
> [   81.065030]  madvise_cold_or_pageout_pte_range+0x80c/0xd10
> [   81.065031]  walk_pmd_range.isra.0+0x1b8/0x3a0
> [   81.065033]  walk_pud_range+0x120/0x1b0
> [   81.065035]  walk_pgd_range+0x150/0x1a8
> [   81.065036]  __walk_page_range+0xa4/0xb8
> [   81.065038]  walk_page_range+0x1c8/0x250
> [   81.065039]  madvise_pageout+0xf4/0x280
> [   81.065041]  madvise_vma_behavior+0x268/0x3f0
> [   81.065042]  madvise_walk_vmas.constprop.0+0xb8/0x128
> [   81.065043]  do_madvise.part.0+0xe8/0x2a0
> [   81.065044]  __arm64_sys_madvise+0x64/0x78
> [   81.065046]  invoke_syscall.constprop.0+0x54/0xe8
> [   81.065048]  do_el0_svc+0xa4/0xc0
> [   81.065050]  el0_svc+0x2c/0xb0
> [   81.065052]  el0t_64_sync_handler+0xb8/0xc0
> [   81.065054]  el0t_64_sync+0x14c/0x150
>
> > Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> > objection I will resend this patch series again for merging.
> >
> > Nhat Pham (2):
> >    swapfile: add a batched variant for swap_duplicate()
> >    swap: shmem: remove SWAP_MAP_SHMEM
> >
> >   include/linux/swap.h | 16 ++++++++--------
> >   mm/shmem.c           |  2 +-
> >   mm/swapfile.c        | 28 +++++++++-------------------
> >   3 files changed, 18 insertions(+), 28 deletions(-)
> >
> >
> > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
Baolin Wang Sept. 24, 2024, 3:25 a.m. UTC | #4
On 2024/9/24 10:15, Yosry Ahmed wrote:
> On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2024/9/24 07:11, Nhat Pham wrote:
>>> The SWAP_MAP_SHMEM state was originally introduced in the commit
>>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
>>> swap entry belongs to shmem during swapoff.
>>>
>>> However, swapoff has since been rewritten drastically in the commit
>>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
>>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
>>> swap count == 1, and swap_shmem_alloc() behaves analogously to
>>> swap_duplicate()
>>>
>>> This RFC proposes the removal of this state and the associated helper to
>>> simplify the state machine (both mentally and code-wise). We will also
>>> have an extra state/special value that can be repurposed (for swap entries
>>> that never gets re-duplicated).
>>>
>>> Another motivation (albeit a bit premature at the moment) is the new swap
>>> abstraction I am currently working on, that would allow for swap/zswap
>>> decoupling, swapoff optimization, etc. The fewer states and swap API
>>> functions there are, the simpler the conversion will be.
>>>
>>> I am sending this series first as an RFC, just in case I missed something
>>> or misunderstood this state, or if someone has a swap optimization in mind
>>> for shmem that would require this special state.
>>
>> The idea makes sense to me. I did a quick test with shmem mTHP, and
>> encountered the following warning which is triggered by
>> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
> 
> Apparently __swap_duplicate() does not currently handle increasing the
> swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> does not handle rolling back count increases when
> swap_count_continued() fails.
> 
> I guess this voids my Reviewed-by until we sort this out. Technically
> swap_count_continued() won't ever be called for shmem because we only
> ever increment the count by 1, but there is no way to know this in
> __swap_duplicate() without SWAP_HAS_SHMEM.

Agreed. An easy solution might be to add a new boolean parameter to 
indicate whether the SHMEM swap entry count is increasing?

diff --git a/mm/swapfile.c b/mm/swapfile.c
index cebc244ee60f..21f1eec2c30a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3607,7 +3607,7 @@ void si_swapinfo(struct sysinfo *val)
   * - swap-cache reference is requested but the entry is not used. -> 
ENOENT
   * - swap-mapped reference requested but needs continued swap count. 
-> ENOMEM
   */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int 
nr, bool shmem)
  {
         struct swap_info_struct *si;
         struct swap_cluster_info *ci;
@@ -3620,7 +3620,7 @@ static int __swap_duplicate(swp_entry_t entry, 
unsigned char usage, int nr)

         offset = swp_offset(entry);
         VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-       VM_WARN_ON(usage == 1 && nr > 1);
+       VM_WARN_ON(usage == 1 && nr > 1 && !shmem);
         ci = lock_cluster_or_swap_info(si, offset);

         err = 0;
@@ -3661,7 +3661,7 @@ static int __swap_duplicate(swp_entry_t entry, 
unsigned char usage, int nr)
                         has_cache = SWAP_HAS_CACHE;
                 else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
                         count += usage;
-               else if (swap_count_continued(si, offset + i, count))
+               else if (!shmem && swap_count_continued(si, offset + i, 
count))
                         count = COUNT_CONTINUED;
                 else {
                         /*
Nhat Pham Sept. 24, 2024, 2:32 p.m. UTC | #5
On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
> On 2024/9/24 10:15, Yosry Ahmed wrote:
> > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 2024/9/24 07:11, Nhat Pham wrote:
> >>> The SWAP_MAP_SHMEM state was originally introduced in the commit
> >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> >>> swap entry belongs to shmem during swapoff.
> >>>
> >>> However, swapoff has since been rewritten drastically in the commit
> >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> >>> swap count == 1, and swap_shmem_alloc() behaves analogously to
> >>> swap_duplicate()
> >>>
> >>> This RFC proposes the removal of this state and the associated helper to
> >>> simplify the state machine (both mentally and code-wise). We will also
> >>> have an extra state/special value that can be repurposed (for swap entries
> >>> that never gets re-duplicated).
> >>>
> >>> Another motivation (albeit a bit premature at the moment) is the new swap
> >>> abstraction I am currently working on, that would allow for swap/zswap
> >>> decoupling, swapoff optimization, etc. The fewer states and swap API
> >>> functions there are, the simpler the conversion will be.
> >>>
> >>> I am sending this series first as an RFC, just in case I missed something
> >>> or misunderstood this state, or if someone has a swap optimization in mind
> >>> for shmem that would require this special state.
> >>
> >> The idea makes sense to me. I did a quick test with shmem mTHP, and
> >> encountered the following warning which is triggered by
> >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
> >
> > Apparently __swap_duplicate() does not currently handle increasing the
> > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > does not handle rolling back count increases when
> > swap_count_continued() fails.
> >
> > I guess this voids my Reviewed-by until we sort this out. Technically
> > swap_count_continued() won't ever be called for shmem because we only
> > ever increment the count by 1, but there is no way to know this in
> > __swap_duplicate() without SWAP_HAS_SHMEM.

Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
remove the swapfile check (that's another can of worms, but I need
data before submitting the patch to remove it...)

One thing we can do is instead of warning here, we can handle it in
the for loop check, where we have access to count - that's the point
of having that for-loop check anyway? :)

There's a couple of ways to go about it:

1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
(or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))

2. Alternatively, instead of warning here, we can simply return
-ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
this MUST succeed.

Either solutions should follow with careful documentation to make it
clear the expectation/guarantee of the new API.

Yosry, Baolin, how do you two feel about this? Would something like
this work? I need to test it first, but let me know if I'm missing
something.

If this does not work, we can do what Baolin is suggesting, and
perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
but at least we still lose a state...
Yosry Ahmed Sept. 24, 2024, 3:07 p.m. UTC | #6
On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> > On 2024/9/24 10:15, Yosry Ahmed wrote:
> > > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> > > <baolin.wang@linux.alibaba.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2024/9/24 07:11, Nhat Pham wrote:
> > >>> The SWAP_MAP_SHMEM state was originally introduced in the commit
> > >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > >>> swap entry belongs to shmem during swapoff.
> > >>>
> > >>> However, swapoff has since been rewritten drastically in the commit
> > >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > >>> swap count == 1, and swap_shmem_alloc() behaves analogously to
> > >>> swap_duplicate()
> > >>>
> > >>> This RFC proposes the removal of this state and the associated helper to
> > >>> simplify the state machine (both mentally and code-wise). We will also
> > >>> have an extra state/special value that can be repurposed (for swap entries
> > >>> that never gets re-duplicated).
> > >>>
> > >>> Another motivation (albeit a bit premature at the moment) is the new swap
> > >>> abstraction I am currently working on, that would allow for swap/zswap
> > >>> decoupling, swapoff optimization, etc. The fewer states and swap API
> > >>> functions there are, the simpler the conversion will be.
> > >>>
> > >>> I am sending this series first as an RFC, just in case I missed something
> > >>> or misunderstood this state, or if someone has a swap optimization in mind
> > >>> for shmem that would require this special state.
> > >>
> > >> The idea makes sense to me. I did a quick test with shmem mTHP, and
> > >> encountered the following warning which is triggered by
> > >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
> > >
> > > Apparently __swap_duplicate() does not currently handle increasing the
> > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > > does not handle rolling back count increases when
> > > swap_count_continued() fails.
> > >
> > > I guess this voids my Reviewed-by until we sort this out. Technically
> > > swap_count_continued() won't ever be called for shmem because we only
> > > ever increment the count by 1, but there is no way to know this in
> > > __swap_duplicate() without SWAP_HAS_SHMEM.
>
> Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
> remove the swapfile check (that's another can of worms, but I need
> data before submitting the patch to remove it...)
>
> One thing we can do is instead of warning here, we can handle it in
> the for loop check, where we have access to count - that's the point
> of having that for-loop check anyway? :)
>
> There's a couple of ways to go about it:
>
> 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );

Hmm that should work, although it's a bit complicated tbh.

> (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))

I think this will make the warning very hard to hit if there's a
misuse of __swap_duplicate(). It will only be hit when an entry needs
count continuation, which I am not sure is very common. If there's a
bug, the warning will potentially catch it too late, if ever.

The side effect here is failing to decrement the swap count of some
swap entries which will lead to them never being freed, essentially
leaking swap capacity slowly over time. I am not sure if there are
more detrimental effects.

>
> 2. Alternatively, instead of warning here, we can simply return
> -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
> this MUST succeed.

We still fail to rollback incremented counts though when we return
-ENOMEM, right? Maybe I didn't get what you mean.

>
> Either solutions should follow with careful documentation to make it
> clear the expectation/guarantee of the new API.
>
> Yosry, Baolin, how do you two feel about this? Would something like
> this work? I need to test it first, but let me know if I'm missing
> something.
>
> If this does not work, we can do what Baolin is suggesting, and
> perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
> but at least we still lose a state...

Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is
just a cleanup with small wins, so if it's too complicated to remove
it it may not be worth it. I am assuming with your ongoing work, it
becomes much more valuable, so maybe if it's too complicated we can
defer it until the benefits are realizable?
Nhat Pham Sept. 24, 2024, 3:48 p.m. UTC | #7
On Tue, Sep 24, 2024 at 8:08 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> > >
> > >
> > > On 2024/9/24 10:15, Yosry Ahmed wrote:
> > > > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> > > > <baolin.wang@linux.alibaba.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2024/9/24 07:11, Nhat Pham wrote:
> > > >>> The SWAP_MAP_SHMEM state was originally introduced in the commit
> > > >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > > >>> swap entry belongs to shmem during swapoff.
> > > >>>
> > > >>> However, swapoff has since been rewritten drastically in the commit
> > > >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > > >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > > >>> swap count == 1, and swap_shmem_alloc() behaves analogously to
> > > >>> swap_duplicate()
> > > >>>
> > > >>> This RFC proposes the removal of this state and the associated helper to
> > > >>> simplify the state machine (both mentally and code-wise). We will also
> > > >>> have an extra state/special value that can be repurposed (for swap entries
> > > >>> that never gets re-duplicated).
> > > >>>
> > > >>> Another motivation (albeit a bit premature at the moment) is the new swap
> > > >>> abstraction I am currently working on, that would allow for swap/zswap
> > > >>> decoupling, swapoff optimization, etc. The fewer states and swap API
> > > >>> functions there are, the simpler the conversion will be.
> > > >>>
> > > >>> I am sending this series first as an RFC, just in case I missed something
> > > >>> or misunderstood this state, or if someone has a swap optimization in mind
> > > >>> for shmem that would require this special state.
> > > >>
> > > >> The idea makes sense to me. I did a quick test with shmem mTHP, and
> > > >> encountered the following warning which is triggered by
> > > >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
> > > >
> > > > Apparently __swap_duplicate() does not currently handle increasing the
> > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > > > does not handle rolling back count increases when
> > > > swap_count_continued() fails.
> > > >
> > > > I guess this voids my Reviewed-by until we sort this out. Technically
> > > > swap_count_continued() won't ever be called for shmem because we only
> > > > ever increment the count by 1, but there is no way to know this in
> > > > __swap_duplicate() without SWAP_HAS_SHMEM.
> >
> > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
> > remove the swapfile check (that's another can of worms, but I need
> > data before submitting the patch to remove it...)
> >
> > One thing we can do is instead of warning here, we can handle it in
> > the for loop check, where we have access to count - that's the point
> > of having that for-loop check anyway? :)
> >
> > There's a couple of ways to go about it:
> >
> > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
>
> Hmm that should work, although it's a bit complicated tbh.
>
> > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
>
> I think this will make the warning very hard to hit if there's a
> misuse of __swap_duplicate(). It will only be hit when an entry needs
> count continuation, which I am not sure is very common. If there's a
> bug, the warning will potentially catch it too late, if ever.
>
> The side effect here is failing to decrement the swap count of some
> swap entries which will lead to them never being freed, essentially
> leaking swap capacity slowly over time. I am not sure if there are
> more detrimental effects.
>
> >
> > 2. Alternatively, instead of warning here, we can simply return
> > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
> > this MUST succeed.
>
> We still fail to rollback incremented counts though when we return
> -ENOMEM, right? Maybe I didn't get what you mean.

My understanding now is that there are two for loops. One for loop
that checks the entry's states, and one for loop that does the actual
incrementing work (or state modification).

We can check in the first for loop, if it is safe to proceed:

if (!count && !has_cache) {
    err = -ENOENT;
} else if (usage == SWAP_HAS_CACHE) {
if (has_cache)
    err = -EEXIST;
} else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
    err = -EINVAL;
} else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
SWAP_MAP_MAX)) {
    /* the batched variants currently do not support rollback */
    err = -ENOMEM;
}

At this point, IIUC, we have not done any incrementing, so no rollback
needed? :)

>
> >
> > Either solutions should follow with careful documentation to make it
> > clear the expectation/guarantee of the new API.
> >
> > Yosry, Baolin, how do you two feel about this? Would something like
> > this work? I need to test it first, but let me know if I'm missing
> > something.
> >
> > If this does not work, we can do what Baolin is suggesting, and
> > perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
> > but at least we still lose a state...
>
> Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is
> just a cleanup with small wins, so if it's too complicated to remove
> it it may not be worth it. I am assuming with your ongoing work, it
> becomes much more valuable, so maybe if it's too complicated we can
> defer it until the benefits are realizable?

I agree :)
Yosry Ahmed Sept. 24, 2024, 6:11 p.m. UTC | #8
[..]
> > > > > Apparently __swap_duplicate() does not currently handle increasing the
> > > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > > > > does not handle rolling back count increases when
> > > > > swap_count_continued() fails.
> > > > >
> > > > > I guess this voids my Reviewed-by until we sort this out. Technically
> > > > > swap_count_continued() won't ever be called for shmem because we only
> > > > > ever increment the count by 1, but there is no way to know this in
> > > > > __swap_duplicate() without SWAP_HAS_SHMEM.
> > >
> > > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
> > > remove the swapfile check (that's another can of worms, but I need
> > > data before submitting the patch to remove it...)
> > >
> > > One thing we can do is instead of warning here, we can handle it in
> > > the for loop check, where we have access to count - that's the point
> > > of having that for-loop check anyway? :)
> > >
> > > There's a couple of ways to go about it:
> > >
> > > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
> >
> > Hmm that should work, although it's a bit complicated tbh.
> >
> > > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
> >
> > I think this will make the warning very hard to hit if there's a
> > misuse of __swap_duplicate(). It will only be hit when an entry needs
> > count continuation, which I am not sure is very common. If there's a
> > bug, the warning will potentially catch it too late, if ever.
> >
> > The side effect here is failing to decrement the swap count of some
> > swap entries which will lead to them never being freed, essentially
> > leaking swap capacity slowly over time. I am not sure if there are
> > more detrimental effects.
> >
> > >
> > > 2. Alternatively, instead of warning here, we can simply return
> > > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
> > > this MUST succeed.
> >
> > We still fail to rollback incremented counts though when we return
> > -ENOMEM, right? Maybe I didn't get what you mean.
>
> My understanding now is that there are two for loops. One for loop
> that checks the entry's states, and one for loop that does the actual
> incrementing work (or state modification).
>
> We can check in the first for loop, if it is safe to proceed:
>
> if (!count && !has_cache) {
>     err = -ENOENT;
> } else if (usage == SWAP_HAS_CACHE) {
> if (has_cache)
>     err = -EEXIST;
> } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
>     err = -EINVAL;
> } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> SWAP_MAP_MAX)) {
>     /* the batched variants currently do not support rollback */
>     err = -ENOMEM;
> }

Hmm yeah I think something like this should work and is arguably
better than just warning, although this needs cleaning up:
- We already know usage != SWAP_HAS_CACHE, so no need to check if usage == 1.
- We already know (count & ~COUNT_CONTINUED) is larger than
SWAP_MAP_MAX, so we should check if it's equal to SWAP_MAP_MAX.
- We should probably just calculate count & ~COUNT_CONTINUED above the
if conditions at this point.

I would also like to hear what Barry thinks since he added this (and I
just realized he is not CC'd).
Chris Li Sept. 24, 2024, 8:15 p.m. UTC | #9
Hi Nhat,

On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> The SWAP_MAP_SHMEM state was originally introduced in the commit
> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> swap entry belongs to shmem during swapoff.
>
> However, swapoff has since been rewritten drastically in the commit
> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> swap count == 1, and swap_shmem_alloc() behaves analogously to
> swap_duplicate()
>
> This RFC proposes the removal of this state and the associated helper to
> simplify the state machine (both mentally and code-wise). We will also
> have an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).

Please help me understand. After having this patch, the shmem swap
entry will also use the swap continue as the only way to count shmem
swap entries, am I missing something?

That seems to have some performance hit for the shmem. Because unlike
anonymous memory, The shmem can easily have a very high swap count.
I would expect there to be some performance hit for the high share
swap count usage case.
Do you have any test number on high shared swap count usage that says
it is negligible to remove it?

Also if you remove the  in the SWAP_MAP_SHMEM, wouldn't you need
remove the counter in the shmem as well. Because the shmem counter is
no longer connected to the swap count any more.

Huge, do you have any feedback on this change?

Chris


> Another motivation (albeit a bit premature at the moment) is the new swap
> abstraction I am currently working on, that would allow for swap/zswap
> decoupling, swapoff optimization, etc. The fewer states and swap API
> functions there are, the simpler the conversion will be.
>
> I am sending this series first as an RFC, just in case I missed something
> or misunderstood this state, or if someone has a swap optimization in mind
> for shmem that would require this special state.
>
> Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> objection I will resend this patch series again for merging.
>
> Nhat Pham (2):
>   swapfile: add a batched variant for swap_duplicate()
>   swap: shmem: remove SWAP_MAP_SHMEM
>
>  include/linux/swap.h | 16 ++++++++--------
>  mm/shmem.c           |  2 +-
>  mm/swapfile.c        | 28 +++++++++-------------------
>  3 files changed, 18 insertions(+), 28 deletions(-)
>
>
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> --
> 2.43.5
Yosry Ahmed Sept. 24, 2024, 9:30 p.m. UTC | #10
On Tue, Sep 24, 2024 at 1:16 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Nhat,
>
> On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > The SWAP_MAP_SHMEM state was originally introduced in the commit
> > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > swap entry belongs to shmem during swapoff.
> >
> > However, swapoff has since been rewritten drastically in the commit
> > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > swap count == 1, and swap_shmem_alloc() behaves analogously to
> > swap_duplicate()
> >
> > This RFC proposes the removal of this state and the associated helper to
> > simplify the state machine (both mentally and code-wise). We will also
> > have an extra state/special value that can be repurposed (for swap entries
> > that never gets re-duplicated).
>
> Please help me understand. After having this patch, the shmem swap
> entry will also use the swap continue as the only way to count shmem
> swap entries, am I missing something?
>
> That seems to have some performance hit for the shmem. Because unlike
> anonymous memory, The shmem can easily have a very high swap count.
> I would expect there to be some performance hit for the high share
> swap count usage case.
> Do you have any test number on high shared swap count usage that says
> it is negligible to remove it?

Shmem only calls swap_duplicate() once in shmem_writepage() when we
add the swap entry to the page cache. We do not increment the swap
count once for each user like anon memory. IOW, the page cache is the
only user of the shmem swap entry, so when we remove SWAP_MAP_SHMEM
the swap count of shmem pages will either be 0 or 1 (disregarding
SWAP_HAS_CACHE). So the swap continuation code is not used here.

>
> Also if you remove the  in the SWAP_MAP_SHMEM, wouldn't you need
> remove the counter in the shmem as well. Because the shmem counter is
> no longer connected to the swap count any more.

Not sure what you mean here.

>
> Huge, do you have any feedback on this change?

Hugh*

>
> Chris
Baolin Wang Sept. 25, 2024, 1:53 a.m. UTC | #11
On 2024/9/24 23:48, Nhat Pham wrote:
> On Tue, Sep 24, 2024 at 8:08 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote:
>>>
>>> On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>
>>>> On 2024/9/24 10:15, Yosry Ahmed wrote:
>>>>> On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/9/24 07:11, Nhat Pham wrote:
>>>>>>> The SWAP_MAP_SHMEM state was originally introduced in the commit
>>>>>>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
>>>>>>> swap entry belongs to shmem during swapoff.
>>>>>>>
>>>>>>> However, swapoff has since been rewritten drastically in the commit
>>>>>>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
>>>>>>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
>>>>>>> swap count == 1, and swap_shmem_alloc() behaves analogously to
>>>>>>> swap_duplicate()
>>>>>>>
>>>>>>> This RFC proposes the removal of this state and the associated helper to
>>>>>>> simplify the state machine (both mentally and code-wise). We will also
>>>>>>> have an extra state/special value that can be repurposed (for swap entries
>>>>>>> that never gets re-duplicated).
>>>>>>>
>>>>>>> Another motivation (albeit a bit premature at the moment) is the new swap
>>>>>>> abstraction I am currently working on, that would allow for swap/zswap
>>>>>>> decoupling, swapoff optimization, etc. The fewer states and swap API
>>>>>>> functions there are, the simpler the conversion will be.
>>>>>>>
>>>>>>> I am sending this series first as an RFC, just in case I missed something
>>>>>>> or misunderstood this state, or if someone has a swap optimization in mind
>>>>>>> for shmem that would require this special state.
>>>>>>
>>>>>> The idea makes sense to me. I did a quick test with shmem mTHP, and
>>>>>> encountered the following warning which is triggered by
>>>>>> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
>>>>>
>>>>> Apparently __swap_duplicate() does not currently handle increasing the
>>>>> swap count for multiple swap entries by 1 (i.e. usage == 1) because it
>>>>> does not handle rolling back count increases when
>>>>> swap_count_continued() fails.
>>>>>
>>>>> I guess this voids my Reviewed-by until we sort this out. Technically
>>>>> swap_count_continued() won't ever be called for shmem because we only
>>>>> ever increment the count by 1, but there is no way to know this in
>>>>> __swap_duplicate() without SWAP_HAS_SHMEM.
>>>
>>> Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
>>> remove the swapfile check (that's another can of worms, but I need
>>> data before submitting the patch to remove it...)
>>>
>>> One thing we can do is instead of warning here, we can handle it in
>>> the for loop check, where we have access to count - that's the point
>>> of having that for-loop check anyway? :)
>>>
>>> There's a couple of ways to go about it:
>>>
>>> 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
>>
>> Hmm that should work, although it's a bit complicated tbh.
>>
>>> (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
>>
>> I think this will make the warning very hard to hit if there's a
>> misuse of __swap_duplicate(). It will only be hit when an entry needs
>> count continuation, which I am not sure is very common. If there's a
>> bug, the warning will potentially catch it too late, if ever.
>>
>> The side effect here is failing to decrement the swap count of some
>> swap entries which will lead to them never being freed, essentially
>> leaking swap capacity slowly over time. I am not sure if there are
>> more detrimental effects.
>>
>>>
>>> 2. Alternatively, instead of warning here, we can simply return
>>> -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
>>> this MUST succeed.
>>
>> We still fail to rollback incremented counts though when we return
>> -ENOMEM, right? Maybe I didn't get what you mean.
> 
> My understanding now is that there are two for loops. One for loop
> that checks the entry's states, and one for loop that does the actual
> incrementing work (or state modification).
> 
> We can check in the first for loop, if it is safe to proceed:
> 
> if (!count && !has_cache) {
>      err = -ENOENT;
> } else if (usage == SWAP_HAS_CACHE) {
> if (has_cache)
>      err = -EEXIST;
> } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
>      err = -EINVAL;
> } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> SWAP_MAP_MAX)) {
>      /* the batched variants currently do not support rollback */
>      err = -ENOMEM;
> }
> 
> At this point, IIUC, we have not done any incrementing, so no rollback
> needed? :)

Right, looks good (although need some cleanup pointed by Yosry).

>>>
>>> Either solutions should follow with careful documentation to make it
>>> clear the expectation/guarantee of the new API.
>>>
>>> Yosry, Baolin, how do you two feel about this? Would something like
>>> this work? I need to test it first, but let me know if I'm missing
>>> something.
>>>
>>> If this does not work, we can do what Baolin is suggesting, and
>>> perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
>>> but at least we still lose a state...
>>
>> Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is
>> just a cleanup with small wins, so if it's too complicated to remove
>> it it may not be worth it. I am assuming with your ongoing work, it
>> becomes much more valuable, so maybe if it's too complicated we can
>> defer it until the benefits are realizable?
> 
> I agree :)

One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to 
batch free shmem swap entries in __swap_entries_free(), similar to the 
commit bea67dcc5eea ("mm: attempt to batch free swap entries for 
zap_pte_range()") did, which can improve the performance of shmem mTHP 
munmap() function based on my testing.

Without this patch set, I need do following changes to batch free shmem 
swap entries:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0cded32414a1..94e28cd60c52 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -163,7 +163,7 @@ static bool swap_is_last_map(struct swap_info_struct 
*si,
         unsigned char *map_end = map + nr_pages;
         unsigned char count = *map;

-       if (swap_count(count) != 1)
+       if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM)
                 return false;

         while (++map < map_end) {
@@ -1503,10 +1503,10 @@ static bool __swap_entries_free(struct 
swap_info_struct *si,
         unsigned int type = swp_type(entry);
         struct swap_cluster_info *ci;
         bool has_cache = false;
-       unsigned char count;
+       unsigned char count = swap_count(data_race(si->swap_map[offset]));
         int i;

-       if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1)
+       if (nr <= 1 || (count != 1 && count != SWAP_MAP_SHMEM))
                 goto fallback;
         /* cross into another cluster */
         if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)
Barry Song Sept. 25, 2024, 6:26 a.m. UTC | #12
On Wed, Sep 25, 2024 at 2:12 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > > > > > Apparently __swap_duplicate() does not currently handle increasing the
> > > > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > > > > > does not handle rolling back count increases when
> > > > > > swap_count_continued() fails.
> > > > > >
> > > > > > I guess this voids my Reviewed-by until we sort this out. Technically
> > > > > > swap_count_continued() won't ever be called for shmem because we only
> > > > > > ever increment the count by 1, but there is no way to know this in
> > > > > > __swap_duplicate() without SWAP_HAS_SHMEM.
> > > >
> > > > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
> > > > remove the swapfile check (that's another can of worms, but I need
> > > > data before submitting the patch to remove it...)
> > > >
> > > > One thing we can do is instead of warning here, we can handle it in
> > > > the for loop check, where we have access to count - that's the point
> > > > of having that for-loop check anyway? :)
> > > >
> > > > There's a couple of ways to go about it:
> > > >
> > > > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
> > >
> > > Hmm that should work, although it's a bit complicated tbh.
> > >
> > > > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
> > >
> > > I think this will make the warning very hard to hit if there's a
> > > misuse of __swap_duplicate(). It will only be hit when an entry needs
> > > count continuation, which I am not sure is very common. If there's a
> > > bug, the warning will potentially catch it too late, if ever.
> > >
> > > The side effect here is failing to decrement the swap count of some
> > > swap entries which will lead to them never being freed, essentially
> > > leaking swap capacity slowly over time. I am not sure if there are
> > > more detrimental effects.
> > >
> > > >
> > > > 2. Alternatively, instead of warning here, we can simply return
> > > > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
> > > > this MUST succeed.
> > >
> > > We still fail to rollback incremented counts though when we return
> > > -ENOMEM, right? Maybe I didn't get what you mean.
> >
> > My understanding now is that there are two for loops. One for loop
> > that checks the entry's states, and one for loop that does the actual
> > incrementing work (or state modification).
> >
> > We can check in the first for loop, if it is safe to proceed:
> >
> > if (!count && !has_cache) {
> >     err = -ENOENT;
> > } else if (usage == SWAP_HAS_CACHE) {
> > if (has_cache)
> >     err = -EEXIST;
> > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
> >     err = -EINVAL;
> > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> > SWAP_MAP_MAX)) {
> >     /* the batched variants currently do not support rollback */
> >     err = -ENOMEM;
> > }
>
> Hmm yeah I think something like this should work and is arguably
> better than just warning, although this needs cleaning up:
> - We already know usage != SWAP_HAS_CACHE, so no need to check if usage == 1.
> - We already know (count & ~COUNT_CONTINUED) is larger than
> SWAP_MAP_MAX, so we should check if it's equal to SWAP_MAP_MAX.
> - We should probably just calculate count & ~COUNT_CONTINUED above the
> if conditions at this point.
>
> I would also like to hear what Barry thinks since he added this (and I
> just realized he is not CC'd).

I am perfectly fine with the approach, in the first loop, if we find all entries
don't need CONTINUED, we can run the 2nd loop even for usage==1
and nr > 1. this is almost always true for a real product where anon folios
are unlikely to be fork-shared by so many processes.

but we need fall back to iterating nr times if this really happens:

int swap_duplicate_nr(swp_entry_t entry, int nr)
{
   ....
   if (nr > 1 and ENOMEM) {
   for(nr entries) {
    __swap_duplicate(entry, 1, 1);
    entry = next_entry;
  }
}

seems a bit ugly?

maybe we can keep the swap_duplicate(swp_entry_t entry)
there? then avoid __swap_duplicate(entry, 1, 1);?

Thanks
Barry
Huang, Ying Sept. 25, 2024, 7:19 a.m. UTC | #13
Nhat Pham <nphamcs@gmail.com> writes:

[snip]

>
> My understanding now is that there are two for loops. One for loop
> that checks the entry's states, and one for loop that does the actual
> incrementing work (or state modification).
>
> We can check in the first for loop, if it is safe to proceed:
>
> if (!count && !has_cache) {
>     err = -ENOENT;
> } else if (usage == SWAP_HAS_CACHE) {
> if (has_cache)
>     err = -EEXIST;
> } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
>     err = -EINVAL;
> } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> SWAP_MAP_MAX)) {
>     /* the batched variants currently do not support rollback */
>     err = -ENOMEM;
> }
>
> At this point, IIUC, we have not done any incrementing, so no rollback
> needed? :)

I think that it's better to add a VM_WARN_ONCE() here.  If someone
enabled 'nr > 1' for __swap_duplicate(), the issue will be more
explicit.

[snip]

--
Best Regards,
Huang, Ying
Huang, Ying Sept. 25, 2024, 7:24 a.m. UTC | #14
Barry Song <21cnbao@gmail.com> writes:

[snip]

> I am perfectly fine with the approach, in the first loop, if we find all entries
> don't need CONTINUED, we can run the 2nd loop even for usage==1
> and nr > 1. this is almost always true for a real product where anon folios
> are unlikely to be fork-shared by so many processes.

One possible use case is ksm.  Where the map count could be large.

--
Best Regards,
Huang, Ying
Barry Song Sept. 25, 2024, 7:38 a.m. UTC | #15
On Wed, Sep 25, 2024 at 3:28 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> [snip]
>
> > I am perfectly fine with the approach, in the first loop, if we find all entries
> > don't need CONTINUED, we can run the 2nd loop even for usage==1
> > and nr > 1. this is almost always true for a real product where anon folios
> > are unlikely to be fork-shared by so many processes.
>
> One possible use case is ksm.  Where the map count could be large.

Sorry, I overlooked the KSM case, but it seems this doesn't significantly
change the overall direction. :-)

Since we can fall back to swap_duplicate() and handle each entry one
by one after the first loopback returns -ENOMEM, KSM isn't a concern
either.

>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Nhat Pham Sept. 25, 2024, 2:21 p.m. UTC | #16
On Wed, Sep 25, 2024 at 12:33 AM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, Sep 25, 2024 at 3:23 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Nhat Pham <nphamcs@gmail.com> writes:
> >
> > [snip]
> >
> > >
> > > My understanding now is that there are two for loops. One for loop
> > > that checks the entry's states, and one for loop that does the actual
> > > incrementing work (or state modification).
> > >
> > > We can check in the first for loop, if it is safe to proceed:
> > >
> > > if (!count && !has_cache) {
> > >     err = -ENOENT;
> > > } else if (usage == SWAP_HAS_CACHE) {
> > > if (has_cache)
> > >     err = -EEXIST;
> > > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
> > >     err = -EINVAL;
> > > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> > > SWAP_MAP_MAX)) {
> > >     /* the batched variants currently do not support rollback */
> > >     err = -ENOMEM;
> > > }
> > >
> > > At this point, IIUC, we have not done any incrementing, so no rollback
> > > needed? :)
> >
> > I think that it's better to add a VM_WARN_ONCE() here.  If someone
> > enabled 'nr > 1' for __swap_duplicate(), the issue will be more
> > explicit.
>
> ying, i guess you missed this is the exact case Nhat is enabling
>  'nr > 1' for __swap_duplicate(). and this warning is functioning.
> and he is trying to support the nr>1 case.
>
> https://lore.kernel.org/linux-mm/20240923231142.4155415-2-nphamcs@gmail.com/

I'm only supporting the case nr > 1, when there is no need to add swap
continuations :) That's the only current use case right now (shmem) :)

1. Keep the non-batched variant:

int swap_duplicate(swp_entry_t entry)
{
    int err = 0;

    while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
        err = add_swap_count_continuation(entry, GFP_ATOMIC);
    return err;
}

2. Implement the batched variant:

int swap_duplicate_nr(swp_entry_t entry, int nr)
{
    swp_entry_t cur_entry;
    int i, err;

    if (nr == 1)
        return swap_duplicate(entry);

    err = __swap_duplicate(entry, 1, nr);
    if (err == -ENOMEM) {
        /* fallback to non-batched version */
        for (i = 0; i < nr; i++) {
            cur_entry = (swp_entry_t){entry.val + i};
            if (swap_duplicate(cur_entry)) {
                /* rollback */
                while (--i >= 0) {
                     cur_entry = (swp_entry_t){entry.val + i};
                     swap_free(cur_entry);
                }
            }
        }
    }
   return err;
}

How does this look? My concern is that there is not really a use for
the fallback logic. Basically dead code.

I can keep it in if you guys have a use for it soon, but otherwise I
lean towards just adding a WARN etc. there, or return -ENOMEM, and
WARN at shmem's callsite (because it cannot get -ENOMEM).
Nhat Pham Sept. 25, 2024, 2:24 p.m. UTC | #17
On Wed, Sep 25, 2024 at 7:21 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
>
> I'm only supporting the case nr > 1, when there is no need to add swap
> continuations :) That's the only current use case right now (shmem) :)

Sorry, I forgot to say - but to fully support a batched variant, we
can do something like this:

>
> 1. Keep the non-batched variant:
>
> int swap_duplicate(swp_entry_t entry)
> {
>     int err = 0;
>
>     while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
>         err = add_swap_count_continuation(entry, GFP_ATOMIC);
>     return err;
> }
>
> 2. Implement the batched variant:
>
> int swap_duplicate_nr(swp_entry_t entry, int nr)
> {
>     swp_entry_t cur_entry;
>     int i, err;
>
>     if (nr == 1)
>         return swap_duplicate(entry);
>
>     err = __swap_duplicate(entry, 1, nr);
>     if (err == -ENOMEM) {
>         /* fallback to non-batched version */
>         for (i = 0; i < nr; i++) {
>             cur_entry = (swp_entry_t){entry.val + i};
>             if (swap_duplicate(cur_entry)) {
>                 /* rollback */
>                 while (--i >= 0) {
>                      cur_entry = (swp_entry_t){entry.val + i};
>                      swap_free(cur_entry);
>                 }
missing a "return err;" here. Not my best idea to write (pseudo) code
before caffeine in the morning :)
>             }
>         }
>     }
>    return err;
> }
>
> How does this look? My concern is that there is not really a use for
> the fallback logic. Basically dead code.
>
> I can keep it in if you guys have a use for it soon, but otherwise I
> lean towards just adding a WARN etc. there, or return -ENOMEM, and
> WARN at shmem's callsite (because it cannot get -ENOMEM).
Nhat Pham Sept. 25, 2024, 2:28 p.m. UTC | #18
On Wed, Sep 25, 2024 at 7:21 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Wed, Sep 25, 2024 at 12:33 AM Barry Song <baohua@kernel.org> wrote:
> How does this look? My concern is that there is not really a use for
> the fallback logic. Basically dead code.
>
> I can keep it in if you guys have a use for it soon, but otherwise I
> lean towards just adding a WARN etc. there, or return -ENOMEM, and
> WARN at shmem's callsite (because it cannot get -ENOMEM).

Oh and another point - I plan to rewrite this swap entity lifetime
logic in the new abstraction layer. The swap continuation part will go
away with it - I don't quite like the way we're currently doing
things. So this added complexity might be unnecessary.
Nhat Pham Sept. 25, 2024, 2:37 p.m. UTC | #19
On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
> One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
> batch free shmem swap entries in __swap_entries_free(), similar to the
> commit bea67dcc5eea ("mm: attempt to batch free swap entries for
> zap_pte_range()") did, which can improve the performance of shmem mTHP
> munmap() function based on my testing.

Yeah, the problem with having an extraneous state is you have to
constantly check for it in code, and/or keep it in mind when you
develop things. I've been constantly having to check for this state
when I develop code around this area, and it gets old fast.

If we can use it to optimize something, I can understand keeping it.
But it just seems like dead code to me :)

My preference is to do this as simply as possible - add another case
(usage == 1, nr > 1, and we need to add swap continuations) in the
check in __swap_duplicate()'s first loop, and just WARN right there.

That case CANNOT happen UNLESS we introduce a bug, or have a new use
case. When we actually have a use case, we can always introduce
handling/fallback logic for that case.

Barry, Yosry, Baolin, Ying, how do you feel about this?
Huang, Ying Sept. 26, 2024, 1:59 a.m. UTC | #20
Nhat Pham <nphamcs@gmail.com> writes:

> On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>> One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
>> batch free shmem swap entries in __swap_entries_free(), similar to the
>> commit bea67dcc5eea ("mm: attempt to batch free swap entries for
>> zap_pte_range()") did, which can improve the performance of shmem mTHP
>> munmap() function based on my testing.
>
> Yeah, the problem with having an extraneous state is you have to
> constantly check for it in code, and/or keep it in mind when you
> develop things. I've been constantly having to check for this state
> when I develop code around this area, and it gets old fast.
>
> If we can use it to optimize something, I can understand keeping it.
> But it just seems like dead code to me :)
>
> My preference is to do this as simply as possible - add another case
> (usage == 1, nr > 1, and we need to add swap continuations) in the
> check in __swap_duplicate()'s first loop, and just WARN right there.
>
> That case CANNOT happen UNLESS we introduce a bug, or have a new use
> case. When we actually have a use case, we can always introduce
> handling/fallback logic for that case.
>
> Barry, Yosry, Baolin, Ying, how do you feel about this?

Sounds good to me to just WARN now.  We can always improve when it's
necessary.

--
Best Regards,
Huang, Ying
Baolin Wang Sept. 26, 2024, 3:30 a.m. UTC | #21
On 2024/9/26 09:59, Huang, Ying wrote:
> Nhat Pham <nphamcs@gmail.com> writes:
> 
>> On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>
>>> One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
>>> batch free shmem swap entries in __swap_entries_free(), similar to the
>>> commit bea67dcc5eea ("mm: attempt to batch free swap entries for
>>> zap_pte_range()") did, which can improve the performance of shmem mTHP
>>> munmap() function based on my testing.
>>
>> Yeah, the problem with having an extraneous state is you have to
>> constantly check for it in code, and/or keep it in mind when you
>> develop things. I've been constantly having to check for this state
>> when I develop code around this area, and it gets old fast.
>>
>> If we can use it to optimize something, I can understand keeping it.
>> But it just seems like dead code to me :)
>>
>> My preference is to do this as simply as possible - add another case
>> (usage == 1, nr > 1, and we need to add swap continuations) in the
>> check in __swap_duplicate()'s first loop, and just WARN right there.
>>
>> That case CANNOT happen UNLESS we introduce a bug, or have a new use
>> case. When we actually have a use case, we can always introduce
>> handling/fallback logic for that case.
>>
>> Barry, Yosry, Baolin, Ying, how do you feel about this?
> 
> Sounds good to me to just WARN now.  We can always improve when it's
> necessary.

+1. Agreed.
Barry Song Sept. 26, 2024, 4 a.m. UTC | #22
On Thu, Sep 26, 2024 at 2:37 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> > One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
> > batch free shmem swap entries in __swap_entries_free(), similar to the
> > commit bea67dcc5eea ("mm: attempt to batch free swap entries for
> > zap_pte_range()") did, which can improve the performance of shmem mTHP
> > munmap() function based on my testing.
>
> Yeah, the problem with having an extraneous state is you have to
> constantly check for it in code, and/or keep it in mind when you
> develop things. I've been constantly having to check for this state
> when I develop code around this area, and it gets old fast.
>
> If we can use it to optimize something, I can understand keeping it.
> But it just seems like dead code to me :)
>
> My preference is to do this as simply as possible - add another case
> (usage == 1, nr > 1, and we need to add swap continuations) in the
> check in __swap_duplicate()'s first loop, and just WARN right there.
>
> That case CANNOT happen UNLESS we introduce a bug, or have a new use
> case. When we actually have a use case, we can always introduce
> handling/fallback logic for that case.
>
> Barry, Yosry, Baolin, Ying, how do you feel about this?

Hi Nhat,

I’m not entirely clear on your point. If your proposal is to support the
case where usage == 1 and nr > 1 only when we don’t require
CONTINUED, and to issue a warning once we determine that
CONTINUED is needed, then I’m completely on board with that
approach.

It seems that your intention is to simply relocate the existing warning
to the scenario where CONTINUED is actually required, rather than
maintaining a warning for the case where usage == 1 and nr > 1 at
all times?

I wasn't actually suggesting a rollback as you posted:
     err = __swap_duplicate(entry, 1, nr);
     if (err == -ENOMEM) {
         /* fallback to non-batched version */
         for (i = 0; i < nr; i++) {
             cur_entry = (swp_entry_t){entry.val + i};
             if (swap_duplicate(cur_entry)) {
                 /* rollback */
                 while (--i >= 0) {
                      cur_entry = (swp_entry_t){entry.val + i};
                      swap_free(cur_entry);
                 }

I suggested checking for all errors except -ENOMEM in the first loop. If all
conditions indicate that only CONTINUED is necessary (with no other
issues like ENOENT, EEXIST, etc., for any entry), we can proceed
with a for loop for swap_duplicate(), eliminating the need for a rollback.

However, I agree that we can proceed with that only when there is actually
a user involved. (There's no need to address it right now.)

Thanks
Barry
Nhat Pham Sept. 26, 2024, 10:50 p.m. UTC | #23
On Wed, Sep 25, 2024 at 8:59 PM Barry Song <baohua@kernel.org> wrote:
>
> On Thu, Sep 26, 2024 at 2:37 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> I’m not entirely clear on your point. If your proposal is to support the
> case where usage == 1 and nr > 1 only when we don’t require
> CONTINUED, and to issue a warning once we determine that
> CONTINUED is needed, then I’m completely on board with that
> approach.
>
> It seems that your intention is to simply relocate the existing warning
> to the scenario where CONTINUED is actually required, rather than
> maintaining a warning for the case where usage == 1 and nr > 1 at
> all times?

Ohhh yeah we definitely agreed on intentions, but I think I
misunderstood your request :) The code below was an attempt to satisfy
that request...

Please ignore it. I'll submit an actual patch taking into account our
discussions :) Hopefully I won't forget to actually test with thp
swaps this time...

>
> I wasn't actually suggesting a rollback as you posted:
>      err = __swap_duplicate(entry, 1, nr);
>      if (err == -ENOMEM) {
>          /* fallback to non-batched version */
>          for (i = 0; i < nr; i++) {
>              cur_entry = (swp_entry_t){entry.val + i};
>              if (swap_duplicate(cur_entry)) {
>                  /* rollback */
>                  while (--i >= 0) {
>                       cur_entry = (swp_entry_t){entry.val + i};
>                       swap_free(cur_entry);
>                  }
>