Message ID | 20240923231142.4155415-1-nphamcs@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | remove SWAP_MAP_SHMEM | expand |
On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote: > > The SWAP_MAP_SHMEM state was originally introduced in the commit > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > swap entry belongs to shmem during swapoff. > > However, swapoff has since been rewritten drastically in the commit > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > having swap count == SWAP_MAP_SHMEM value is basically the same as having > swap count == 1, and swap_shmem_alloc() behaves analogously to > swap_duplicate() > > This RFC proposes the removal of this state and the associated helper to > simplify the state machine (both mentally and code-wise). We will also > have an extra state/special value that can be repurposed (for swap entries > that never gets re-duplicated). > > Another motivation (albeit a bit premature at the moment) is the new swap > abstraction I am currently working on, that would allow for swap/zswap > decoupling, swapoff optimization, etc. The fewer states and swap API > functions there are, the simpler the conversion will be. > > I am sending this series first as an RFC, just in case I missed something > or misunderstood this state, or if someone has a swap optimization in mind > for shmem that would require this special state. I have the same patch sitting in a tree somewhere from when I tried working on swap abstraction, except then swap_shmem_alloc() did not take a 'nr' argument so I did not need swap_duplicate_nr(). I was going to send it out with other swap code cleanups I had, but I ended up deciding to do nothing. So for what it's worth I think this is correct: Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > Swap experts, let me know if I'm mistaken :) Otherwise if there is no > objection I will resend this patch series again for merging. > > Nhat Pham (2): > swapfile: add a batched variant for swap_duplicate() > swap: shmem: remove SWAP_MAP_SHMEM > > include/linux/swap.h | 16 ++++++++-------- > mm/shmem.c | 2 +- > mm/swapfile.c | 28 +++++++++------------------- > 3 files changed, 18 insertions(+), 28 deletions(-) > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 > -- > 2.43.5
On 2024/9/24 07:11, Nhat Pham wrote: > The SWAP_MAP_SHMEM state was originally introduced in the commit > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > swap entry belongs to shmem during swapoff. > > However, swapoff has since been rewritten drastically in the commit > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > having swap count == SWAP_MAP_SHMEM value is basically the same as having > swap count == 1, and swap_shmem_alloc() behaves analogously to > swap_duplicate() > > This RFC proposes the removal of this state and the associated helper to > simplify the state machine (both mentally and code-wise). We will also > have an extra state/special value that can be repurposed (for swap entries > that never gets re-duplicated). > > Another motivation (albeit a bit premature at the moment) is the new swap > abstraction I am currently working on, that would allow for swap/zswap > decoupling, swapoff optimization, etc. The fewer states and swap API > functions there are, the simpler the conversion will be. > > I am sending this series first as an RFC, just in case I missed something > or misunderstood this state, or if someone has a swap optimization in mind > for shmem that would require this special state. The idea makes sense to me. I did a quick test with shmem mTHP, and encountered the following warning which is triggered by 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). [ 81.064967] ------------[ cut here ]------------ [ 81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623 __swap_duplicate+0x1d0/0x2e0 [ 81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 81.064995] pc : __swap_duplicate+0x1d0/0x2e0 [ 81.064997] lr : swap_duplicate_nr+0x30/0x70 [......] [ 81.065019] Call trace: [ 81.065019] __swap_duplicate+0x1d0/0x2e0 [ 81.065021] swap_duplicate_nr+0x30/0x70 [ 81.065022] shmem_writepage+0x24c/0x438 [ 81.065024] pageout+0x104/0x2e0 [ 81.065026] shrink_folio_list+0x7f0/0xe60 [ 81.065027] reclaim_folio_list+0x90/0x178 [ 81.065029] reclaim_pages+0x128/0x1a8 [ 81.065030] madvise_cold_or_pageout_pte_range+0x80c/0xd10 [ 81.065031] walk_pmd_range.isra.0+0x1b8/0x3a0 [ 81.065033] walk_pud_range+0x120/0x1b0 [ 81.065035] walk_pgd_range+0x150/0x1a8 [ 81.065036] __walk_page_range+0xa4/0xb8 [ 81.065038] walk_page_range+0x1c8/0x250 [ 81.065039] madvise_pageout+0xf4/0x280 [ 81.065041] madvise_vma_behavior+0x268/0x3f0 [ 81.065042] madvise_walk_vmas.constprop.0+0xb8/0x128 [ 81.065043] do_madvise.part.0+0xe8/0x2a0 [ 81.065044] __arm64_sys_madvise+0x64/0x78 [ 81.065046] invoke_syscall.constprop.0+0x54/0xe8 [ 81.065048] do_el0_svc+0xa4/0xc0 [ 81.065050] el0_svc+0x2c/0xb0 [ 81.065052] el0t_64_sync_handler+0xb8/0xc0 [ 81.065054] el0t_64_sync+0x14c/0x150 > Swap experts, let me know if I'm mistaken :) Otherwise if there is no > objection I will resend this patch series again for merging. > > Nhat Pham (2): > swapfile: add a batched variant for swap_duplicate() > swap: shmem: remove SWAP_MAP_SHMEM > > include/linux/swap.h | 16 ++++++++-------- > mm/shmem.c | 2 +- > mm/swapfile.c | 28 +++++++++------------------- > 3 files changed, 18 insertions(+), 28 deletions(-) > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang <baolin.wang@linux.alibaba.com> wrote: > > > > On 2024/9/24 07:11, Nhat Pham wrote: > > The SWAP_MAP_SHMEM state was originally introduced in the commit > > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a > > swap entry belongs to shmem during swapoff. > > > > However, swapoff has since been rewritten drastically in the commit > > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now > > having swap count == SWAP_MAP_SHMEM value is basically the same as having > > swap count == 1, and swap_shmem_alloc() behaves analogously to > > swap_duplicate() > > > > This RFC proposes the removal of this state and the associated helper to > > simplify the state machine (both mentally and code-wise). We will also > > have an extra state/special value that can be repurposed (for swap entries > > that never gets re-duplicated). > > > > Another motivation (albeit a bit premature at the moment) is the new swap > > abstraction I am currently working on, that would allow for swap/zswap > > decoupling, swapoff optimization, etc. The fewer states and swap API > > functions there are, the simpler the conversion will be. > > > > I am sending this series first as an RFC, just in case I missed something > > or misunderstood this state, or if someone has a swap optimization in mind > > for shmem that would require this special state. > > The idea makes sense to me. I did a quick test with shmem mTHP, and > encountered the following warning which is triggered by > 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). Apparently __swap_duplicate() does not currently handle increasing the swap count for multiple swap entries by 1 (i.e. usage == 1) because it does not handle rolling back count increases when swap_count_continued() fails. I guess this voids my Reviewed-by until we sort this out. Technically swap_count_continued() won't ever be called for shmem because we only ever increment the count by 1, but there is no way to know this in __swap_duplicate() without SWAP_HAS_SHMEM. > > [ 81.064967] ------------[ cut here ]------------ > [ 81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623 > __swap_duplicate+0x1d0/0x2e0 > [ 81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS > BTYPE=--) > [ 81.064995] pc : __swap_duplicate+0x1d0/0x2e0 > [ 81.064997] lr : swap_duplicate_nr+0x30/0x70 > [......] > [ 81.065019] Call trace: > [ 81.065019] __swap_duplicate+0x1d0/0x2e0 > [ 81.065021] swap_duplicate_nr+0x30/0x70 > [ 81.065022] shmem_writepage+0x24c/0x438 > [ 81.065024] pageout+0x104/0x2e0 > [ 81.065026] shrink_folio_list+0x7f0/0xe60 > [ 81.065027] reclaim_folio_list+0x90/0x178 > [ 81.065029] reclaim_pages+0x128/0x1a8 > [ 81.065030] madvise_cold_or_pageout_pte_range+0x80c/0xd10 > [ 81.065031] walk_pmd_range.isra.0+0x1b8/0x3a0 > [ 81.065033] walk_pud_range+0x120/0x1b0 > [ 81.065035] walk_pgd_range+0x150/0x1a8 > [ 81.065036] __walk_page_range+0xa4/0xb8 > [ 81.065038] walk_page_range+0x1c8/0x250 > [ 81.065039] madvise_pageout+0xf4/0x280 > [ 81.065041] madvise_vma_behavior+0x268/0x3f0 > [ 81.065042] madvise_walk_vmas.constprop.0+0xb8/0x128 > [ 81.065043] do_madvise.part.0+0xe8/0x2a0 > [ 81.065044] __arm64_sys_madvise+0x64/0x78 > [ 81.065046] invoke_syscall.constprop.0+0x54/0xe8 > [ 81.065048] do_el0_svc+0xa4/0xc0 > [ 81.065050] el0_svc+0x2c/0xb0 > [ 81.065052] el0t_64_sync_handler+0xb8/0xc0 > [ 81.065054] el0t_64_sync+0x14c/0x150 > > > Swap experts, let me know if I'm mistaken :) Otherwise if there is no > > objection I will resend this patch series again for merging. > > > > Nhat Pham (2): > > swapfile: add a batched variant for swap_duplicate() > > swap: shmem: remove SWAP_MAP_SHMEM > > > > include/linux/swap.h | 16 ++++++++-------- > > mm/shmem.c | 2 +- > > mm/swapfile.c | 28 +++++++++------------------- > > 3 files changed, 18 insertions(+), 28 deletions(-) > > > > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
On 2024/9/24 10:15, Yosry Ahmed wrote: > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang > <baolin.wang@linux.alibaba.com> wrote: >> >> >> >> On 2024/9/24 07:11, Nhat Pham wrote: >>> The SWAP_MAP_SHMEM state was originally introduced in the commit >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a >>> swap entry belongs to shmem during swapoff. >>> >>> However, swapoff has since been rewritten drastically in the commit >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having >>> swap count == 1, and swap_shmem_alloc() behaves analogously to >>> swap_duplicate() >>> >>> This RFC proposes the removal of this state and the associated helper to >>> simplify the state machine (both mentally and code-wise). We will also >>> have an extra state/special value that can be repurposed (for swap entries >>> that never gets re-duplicated). >>> >>> Another motivation (albeit a bit premature at the moment) is the new swap >>> abstraction I am currently working on, that would allow for swap/zswap >>> decoupling, swapoff optimization, etc. The fewer states and swap API >>> functions there are, the simpler the conversion will be. >>> >>> I am sending this series first as an RFC, just in case I missed something >>> or misunderstood this state, or if someone has a swap optimization in mind >>> for shmem that would require this special state. >> >> The idea makes sense to me. I did a quick test with shmem mTHP, and >> encountered the following warning which is triggered by >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate(). > > Apparently __swap_duplicate() does not currently handle increasing the > swap count for multiple swap entries by 1 (i.e. usage == 1) because it > does not handle rolling back count increases when > swap_count_continued() fails. > > I guess this voids my Reviewed-by until we sort this out. Technically > swap_count_continued() won't ever be called for shmem because we only > ever increment the count by 1, but there is no way to know this in > __swap_duplicate() without SWAP_HAS_SHMEM. Agreed. An easy solution might be to add a new boolean parameter to indicate whether the SHMEM swap entry count is increasing? diff --git a/mm/swapfile.c b/mm/swapfile.c index cebc244ee60f..21f1eec2c30a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3607,7 +3607,7 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> ENOMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) +static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr, bool shmem) { struct swap_info_struct *si; struct swap_cluster_info *ci; @@ -3620,7 +3620,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) offset = swp_offset(entry); VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - VM_WARN_ON(usage == 1 && nr > 1); + VM_WARN_ON(usage == 1 && nr > 1 && !shmem); ci = lock_cluster_or_swap_info(si, offset); err = 0; @@ -3661,7 +3661,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr) has_cache = SWAP_HAS_CACHE; else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count += usage; - else if (swap_count_continued(si, offset + i, count)) + else if (!shmem && swap_count_continued(si, offset + i, count)) count = COUNT_CONTINUED; else { /*