mbox series

[RFC,0/2] remove SWAP_MAP_SHMEM

Message ID 20240923231142.4155415-1-nphamcs@gmail.com (mailing list archive)
Headers show
Series remove SWAP_MAP_SHMEM | expand

Message

Nhat Pham Sept. 23, 2024, 11:11 p.m. UTC
The SWAP_MAP_SHMEM state was originally introduced in the commit
aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
swap entry belongs to shmem during swapoff.

However, swapoff has since been rewritten drastically in the commit
b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
having swap count == SWAP_MAP_SHMEM value is basically the same as having
swap count == 1, and swap_shmem_alloc() behaves analogously to
swap_duplicate()
    
This RFC proposes the removal of this state and the associated helper to
simplify the state machine (both mentally and code-wise). We will also
have an extra state/special value that can be repurposed (for swap entries
that never gets re-duplicated).

Another motivation (albeit a bit premature at the moment) is the new swap
abstraction I am currently working on, that would allow for swap/zswap
decoupling, swapoff optimization, etc. The fewer states and swap API
functions there are, the simpler the conversion will be.

I am sending this series first as an RFC, just in case I missed something
or misunderstood this state, or if someone has a swap optimization in mind
for shmem that would require this special state.

Swap experts, let me know if I'm mistaken :) Otherwise if there is no
objection I will resend this patch series again for merging.

Nhat Pham (2):
  swapfile: add a batched variant for swap_duplicate()
  swap: shmem: remove SWAP_MAP_SHMEM

 include/linux/swap.h | 16 ++++++++--------
 mm/shmem.c           |  2 +-
 mm/swapfile.c        | 28 +++++++++-------------------
 3 files changed, 18 insertions(+), 28 deletions(-)


base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049

Comments

Yosry Ahmed Sept. 24, 2024, 12:20 a.m. UTC | #1
On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> The SWAP_MAP_SHMEM state was originally introduced in the commit
> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> swap entry belongs to shmem during swapoff.
>
> However, swapoff has since been rewritten drastically in the commit
> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> swap count == 1, and swap_shmem_alloc() behaves analogously to
> swap_duplicate()
>
> This RFC proposes the removal of this state and the associated helper to
> simplify the state machine (both mentally and code-wise). We will also
> have an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).
>
> Another motivation (albeit a bit premature at the moment) is the new swap
> abstraction I am currently working on, that would allow for swap/zswap
> decoupling, swapoff optimization, etc. The fewer states and swap API
> functions there are, the simpler the conversion will be.
>
> I am sending this series first as an RFC, just in case I missed something
> or misunderstood this state, or if someone has a swap optimization in mind
> for shmem that would require this special state.

I have the same patch sitting in a tree somewhere from when I tried
working on swap abstraction, except then swap_shmem_alloc() did not
take a 'nr' argument so I did not need swap_duplicate_nr(). I was
going to send it out with other swap code cleanups I had, but I ended
up deciding to do nothing.

So for what it's worth I think this is correct:
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

>
> Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> objection I will resend this patch series again for merging.
>
> Nhat Pham (2):
>   swapfile: add a batched variant for swap_duplicate()
>   swap: shmem: remove SWAP_MAP_SHMEM
>
>  include/linux/swap.h | 16 ++++++++--------
>  mm/shmem.c           |  2 +-
>  mm/swapfile.c        | 28 +++++++++-------------------
>  3 files changed, 18 insertions(+), 28 deletions(-)
>
>
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> --
> 2.43.5
Baolin Wang Sept. 24, 2024, 1:55 a.m. UTC | #2
On 2024/9/24 07:11, Nhat Pham wrote:
> The SWAP_MAP_SHMEM state was originally introduced in the commit
> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> swap entry belongs to shmem during swapoff.
> 
> However, swapoff has since been rewritten drastically in the commit
> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> swap count == 1, and swap_shmem_alloc() behaves analogously to
> swap_duplicate()
>      
> This RFC proposes the removal of this state and the associated helper to
> simplify the state machine (both mentally and code-wise). We will also
> have an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).
> 
> Another motivation (albeit a bit premature at the moment) is the new swap
> abstraction I am currently working on, that would allow for swap/zswap
> decoupling, swapoff optimization, etc. The fewer states and swap API
> functions there are, the simpler the conversion will be.
> 
> I am sending this series first as an RFC, just in case I missed something
> or misunderstood this state, or if someone has a swap optimization in mind
> for shmem that would require this special state.

The idea makes sense to me. I did a quick test with shmem mTHP, and 
encountered the following warning which is triggered by 
'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().

[   81.064967] ------------[ cut here ]------------
[   81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623 
__swap_duplicate+0x1d0/0x2e0
[   81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS 
BTYPE=--)
[   81.064995] pc : __swap_duplicate+0x1d0/0x2e0
[   81.064997] lr : swap_duplicate_nr+0x30/0x70
[......]
[   81.065019] Call trace:
[   81.065019]  __swap_duplicate+0x1d0/0x2e0
[   81.065021]  swap_duplicate_nr+0x30/0x70
[   81.065022]  shmem_writepage+0x24c/0x438
[   81.065024]  pageout+0x104/0x2e0
[   81.065026]  shrink_folio_list+0x7f0/0xe60
[   81.065027]  reclaim_folio_list+0x90/0x178
[   81.065029]  reclaim_pages+0x128/0x1a8
[   81.065030]  madvise_cold_or_pageout_pte_range+0x80c/0xd10
[   81.065031]  walk_pmd_range.isra.0+0x1b8/0x3a0
[   81.065033]  walk_pud_range+0x120/0x1b0
[   81.065035]  walk_pgd_range+0x150/0x1a8
[   81.065036]  __walk_page_range+0xa4/0xb8
[   81.065038]  walk_page_range+0x1c8/0x250
[   81.065039]  madvise_pageout+0xf4/0x280
[   81.065041]  madvise_vma_behavior+0x268/0x3f0
[   81.065042]  madvise_walk_vmas.constprop.0+0xb8/0x128
[   81.065043]  do_madvise.part.0+0xe8/0x2a0
[   81.065044]  __arm64_sys_madvise+0x64/0x78
[   81.065046]  invoke_syscall.constprop.0+0x54/0xe8
[   81.065048]  do_el0_svc+0xa4/0xc0
[   81.065050]  el0_svc+0x2c/0xb0
[   81.065052]  el0t_64_sync_handler+0xb8/0xc0
[   81.065054]  el0t_64_sync+0x14c/0x150

> Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> objection I will resend this patch series again for merging.
> 
> Nhat Pham (2):
>    swapfile: add a batched variant for swap_duplicate()
>    swap: shmem: remove SWAP_MAP_SHMEM
> 
>   include/linux/swap.h | 16 ++++++++--------
>   mm/shmem.c           |  2 +-
>   mm/swapfile.c        | 28 +++++++++-------------------
>   3 files changed, 18 insertions(+), 28 deletions(-)
> 
> 
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
Yosry Ahmed Sept. 24, 2024, 2:15 a.m. UTC | #3
On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2024/9/24 07:11, Nhat Pham wrote:
> > The SWAP_MAP_SHMEM state was originally introduced in the commit
> > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > swap entry belongs to shmem during swapoff.
> >
> > However, swapoff has since been rewritten drastically in the commit
> > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > swap count == 1, and swap_shmem_alloc() behaves analogously to
> > swap_duplicate()
> >
> > This RFC proposes the removal of this state and the associated helper to
> > simplify the state machine (both mentally and code-wise). We will also
> > have an extra state/special value that can be repurposed (for swap entries
> > that never gets re-duplicated).
> >
> > Another motivation (albeit a bit premature at the moment) is the new swap
> > abstraction I am currently working on, that would allow for swap/zswap
> > decoupling, swapoff optimization, etc. The fewer states and swap API
> > functions there are, the simpler the conversion will be.
> >
> > I am sending this series first as an RFC, just in case I missed something
> > or misunderstood this state, or if someone has a swap optimization in mind
> > for shmem that would require this special state.
>
> The idea makes sense to me. I did a quick test with shmem mTHP, and
> encountered the following warning which is triggered by
> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().

Apparently __swap_duplicate() does not currently handle increasing the
swap count for multiple swap entries by 1 (i.e. usage == 1) because it
does not handle rolling back count increases when
swap_count_continued() fails.

I guess this voids my Reviewed-by until we sort this out. Technically
swap_count_continued() won't ever be called for shmem because we only
ever increment the count by 1, but there is no way to know this in
__swap_duplicate() without SWAP_HAS_SHMEM.

>
> [   81.064967] ------------[ cut here ]------------
> [   81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623
> __swap_duplicate+0x1d0/0x2e0
> [   81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS
> BTYPE=--)
> [   81.064995] pc : __swap_duplicate+0x1d0/0x2e0
> [   81.064997] lr : swap_duplicate_nr+0x30/0x70
> [......]
> [   81.065019] Call trace:
> [   81.065019]  __swap_duplicate+0x1d0/0x2e0
> [   81.065021]  swap_duplicate_nr+0x30/0x70
> [   81.065022]  shmem_writepage+0x24c/0x438
> [   81.065024]  pageout+0x104/0x2e0
> [   81.065026]  shrink_folio_list+0x7f0/0xe60
> [   81.065027]  reclaim_folio_list+0x90/0x178
> [   81.065029]  reclaim_pages+0x128/0x1a8
> [   81.065030]  madvise_cold_or_pageout_pte_range+0x80c/0xd10
> [   81.065031]  walk_pmd_range.isra.0+0x1b8/0x3a0
> [   81.065033]  walk_pud_range+0x120/0x1b0
> [   81.065035]  walk_pgd_range+0x150/0x1a8
> [   81.065036]  __walk_page_range+0xa4/0xb8
> [   81.065038]  walk_page_range+0x1c8/0x250
> [   81.065039]  madvise_pageout+0xf4/0x280
> [   81.065041]  madvise_vma_behavior+0x268/0x3f0
> [   81.065042]  madvise_walk_vmas.constprop.0+0xb8/0x128
> [   81.065043]  do_madvise.part.0+0xe8/0x2a0
> [   81.065044]  __arm64_sys_madvise+0x64/0x78
> [   81.065046]  invoke_syscall.constprop.0+0x54/0xe8
> [   81.065048]  do_el0_svc+0xa4/0xc0
> [   81.065050]  el0_svc+0x2c/0xb0
> [   81.065052]  el0t_64_sync_handler+0xb8/0xc0
> [   81.065054]  el0t_64_sync+0x14c/0x150
>
> > Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> > objection I will resend this patch series again for merging.
> >
> > Nhat Pham (2):
> >    swapfile: add a batched variant for swap_duplicate()
> >    swap: shmem: remove SWAP_MAP_SHMEM
> >
> >   include/linux/swap.h | 16 ++++++++--------
> >   mm/shmem.c           |  2 +-
> >   mm/swapfile.c        | 28 +++++++++-------------------
> >   3 files changed, 18 insertions(+), 28 deletions(-)
> >
> >
> > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
Baolin Wang Sept. 24, 2024, 3:25 a.m. UTC | #4
On 2024/9/24 10:15, Yosry Ahmed wrote:
> On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2024/9/24 07:11, Nhat Pham wrote:
>>> The SWAP_MAP_SHMEM state was originally introduced in the commit
>>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
>>> swap entry belongs to shmem during swapoff.
>>>
>>> However, swapoff has since been rewritten drastically in the commit
>>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
>>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
>>> swap count == 1, and swap_shmem_alloc() behaves analogously to
>>> swap_duplicate()
>>>
>>> This RFC proposes the removal of this state and the associated helper to
>>> simplify the state machine (both mentally and code-wise). We will also
>>> have an extra state/special value that can be repurposed (for swap entries
>>> that never gets re-duplicated).
>>>
>>> Another motivation (albeit a bit premature at the moment) is the new swap
>>> abstraction I am currently working on, that would allow for swap/zswap
>>> decoupling, swapoff optimization, etc. The fewer states and swap API
>>> functions there are, the simpler the conversion will be.
>>>
>>> I am sending this series first as an RFC, just in case I missed something
>>> or misunderstood this state, or if someone has a swap optimization in mind
>>> for shmem that would require this special state.
>>
>> The idea makes sense to me. I did a quick test with shmem mTHP, and
>> encountered the following warning which is triggered by
>> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
> 
> Apparently __swap_duplicate() does not currently handle increasing the
> swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> does not handle rolling back count increases when
> swap_count_continued() fails.
> 
> I guess this voids my Reviewed-by until we sort this out. Technically
> swap_count_continued() won't ever be called for shmem because we only
> ever increment the count by 1, but there is no way to know this in
> __swap_duplicate() without SWAP_HAS_SHMEM.

Agreed. An easy solution might be to add a new boolean parameter to 
indicate whether the SHMEM swap entry count is increasing?

diff --git a/mm/swapfile.c b/mm/swapfile.c
index cebc244ee60f..21f1eec2c30a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3607,7 +3607,7 @@ void si_swapinfo(struct sysinfo *val)
   * - swap-cache reference is requested but the entry is not used. -> 
ENOENT
   * - swap-mapped reference requested but needs continued swap count. 
-> ENOMEM
   */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int 
nr, bool shmem)
  {
         struct swap_info_struct *si;
         struct swap_cluster_info *ci;
@@ -3620,7 +3620,7 @@ static int __swap_duplicate(swp_entry_t entry, 
unsigned char usage, int nr)

         offset = swp_offset(entry);
         VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-       VM_WARN_ON(usage == 1 && nr > 1);
+       VM_WARN_ON(usage == 1 && nr > 1 && !shmem);
         ci = lock_cluster_or_swap_info(si, offset);

         err = 0;
@@ -3661,7 +3661,7 @@ static int __swap_duplicate(swp_entry_t entry, 
unsigned char usage, int nr)
                         has_cache = SWAP_HAS_CACHE;
                 else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
                         count += usage;
-               else if (swap_count_continued(si, offset + i, count))
+               else if (!shmem && swap_count_continued(si, offset + i, 
count))
                         count = COUNT_CONTINUED;
                 else {
                         /*