Message ID | 20230404154716.23058-1-rongwei.wang@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/swap: fix swap_info_struct race between swapoff and get_swap_pages() | expand |
Hello I have fix up some stuff base on Patch v1. And in order to help all readers and reviewers to reproduce this bug, share a reproducer here: swap_bomb.sh #!/usr/bin/env bash stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v & stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap --verify -v madvise_shared.c #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <unistd.h> #define MSIZE (1024 * 1024 * 2) int main() { char *shm_addr; unsigned long i; while (1) { // Map shared memory segment shm_addr = mmap(NULL, MSIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (shm_addr == MAP_FAILED) { perror("Failed to map shared memory segment"); exit(EXIT_FAILURE); } for (i = 0; i < MSIZE; i++) { shm_addr[i] = 1; } // Advise kernel on usage pattern of shared memory if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) { perror ("Failed to advise kernel on shared memory usage"); exit(EXIT_FAILURE); } for (i = 0; i < MSIZE; i++) { shm_addr[i] = 1; } // Advise kernel on usage pattern of shared memory if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) { perror ("Failed to advise kernel on shared memory usage"); exit(EXIT_FAILURE); } // Use shared memory printf("Hello, shared memory: 0x%lx\n", shm_addr); // Unmap shared memory segment if (munmap(shm_addr, MSIZE) == -1) { perror("Failed to unmap shared memory segment"); exit(EXIT_FAILURE); } } return 0; } The bug will reproduce more quickly (about 2~5 minutes) if concurrent more swap_bomb.sh and madvise_shared. Thanks. change log: v1 -> v2 * fix up some commits and add assert_spin_locked(&p->lock) inside __delete_from_avail_list() (suggested by Matthew Wilcox and Bagas Sanjaya) On 4/4/23 11:47 PM, Rongwei Wang wrote: > The si->lock must be held when deleting the si from > the available list. Otherwise, another thread can > re-add the si to the available list, which can lead > to memory corruption. The only place we have found > where this happens is in the swapoff path. This case > can be described as below: > > core 0 core 1 > swapoff > > del_from_avail_list(si) waiting > > try lock si->lock acquire swap_avail_lock > and re-add si into > swap_avail_head > > acquire si->lock but > missing si already be > added again, and continuing > to clear SWP_WRITEOK, etc. > > It can be easily found a massive warning messages can > be triggered inside get_swap_pages() by some special > cases, for example, we call madvise(MADV_PAGEOUT) on > blocks of touched memory concurrently, meanwhile, run > much swapon-swapoff operations (e.g. stress-ng-swap). > > However, in the worst case, panic can be caused by the > above scene. In swapoff(), the memory used by si could > be kept in swap_info[] after turning off a swap. This > means memory corruption will not be caused immediately > until allocated and reset for a new swap in the swapon > path. A panic message caused: > (with CONFIG_PLIST_DEBUG enabled) > > ------------[ cut here ]------------ > top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a > prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d > next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a > WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 > Modules linked in: rfkill(E) crct10dif_ce(E)... > CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ > Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 > pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) > pc : plist_check_prev_next_node+0x50/0x70 > lr : plist_check_prev_next_node+0x50/0x70 > sp : ffff0018009d3c30 > x29: ffff0018009d3c40 x28: ffff800011b32a98 > x27: 0000000000000000 x26: ffff001803908000 > x25: ffff8000128ea088 x24: ffff800011b32a48 > x23: 0000000000000028 x22: ffff001800875c00 > x21: ffff800010f9e520 x20: ffff001800875c00 > x19: ffff001800fdc6e0 x18: 0000000000000030 > x17: 0000000000000000 x16: 0000000000000000 > x15: 0736076307640766 x14: 0730073007380731 > x13: 0736076307640766 x12: 0730073007380731 > x11: 000000000004058d x10: 0000000085a85b76 > x9 : ffff8000101436e4 x8 : ffff800011c8ce08 > x7 : 0000000000000000 x6 : 0000000000000001 > x5 : ffff0017df9ed338 x4 : 0000000000000001 > x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 > x1 : 0000000000000000 x0 : 0000000000000000 > Call trace: > plist_check_prev_next_node+0x50/0x70 > plist_check_head+0x80/0xf0 > plist_add+0x28/0x140 > add_to_avail_list+0x9c/0xf0 > _enable_swap_info+0x78/0xb4 > __do_sys_swapon+0x918/0xa10 > __arm64_sys_swapon+0x20/0x30 > el0_svc_common+0x8c/0x220 > do_el0_svc+0x2c/0x90 > el0_svc+0x1c/0x30 > el0_sync_handler+0xa8/0xb0 > el0_sync+0x148/0x180 > irq event stamp: 2082270 > > Now, si->lock locked before calling 'del_from_avail_list()' > to make sure other thread see the si had been deleted > and SWP_WRITEOK cleared together, will not reinsert again. > > This problem exists in versions after stable 5.10.y. > > Cc: stable@vger.kernel.org > Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com> > Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> > --- > mm/swapfile.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 62ba2bf577d7..2c718f45745f 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) > { > int nid; > > + assert_spin_locked(&p->lock); > for_each_node(nid) > plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); > } > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > spin_unlock(&swap_lock); > goto out_dput; > } > - del_from_avail_list(p); > spin_lock(&p->lock); > + del_from_avail_list(p); > if (p->prio < 0) { > struct swap_info_struct *si = p; > int nid;
On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > The si->lock must be held when deleting the si from > the available list. > > ... > > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) > { > int nid; > > + assert_spin_locked(&p->lock); > for_each_node(nid) > plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); > } > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > spin_unlock(&swap_lock); > goto out_dput; > } > - del_from_avail_list(p); > spin_lock(&p->lock); > + del_from_avail_list(p); > if (p->prio < 0) { > struct swap_info_struct *si = p; > int nid; So we have swap_avail_lock swap_info_struct.lock swap_cluster_info.lock Is the ranking of these three clearly documented somewhere? Did you test this with lockdep fully enabled? I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device according to numa node") is the appropriate Fixes: target - do you agree? These functions use identifier `p' for the swap_info_struct*, whereas most other code uses the much more sensible `si'. That's just rude. But we shouldn't change that within this fix.
Hi Andrew On 4/5/23 3:26 AM, Andrew Morton wrote: > On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > >> The si->lock must be held when deleting the si from >> the available list. >> >> ... >> >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) >> { >> int nid; >> >> + assert_spin_locked(&p->lock); >> for_each_node(nid) >> plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); >> } >> @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) >> spin_unlock(&swap_lock); >> goto out_dput; >> } >> - del_from_avail_list(p); >> spin_lock(&p->lock); >> + del_from_avail_list(p); >> if (p->prio < 0) { >> struct swap_info_struct *si = p; >> int nid; > So we have > > swap_avail_lock > swap_info_struct.lock > swap_cluster_info.lock > > Is the ranking of these three clearly documented somewhere? It seems have swap_lock swap_info_struct.lock swap_avail_lock I just summary the ranking of these three locks by reading code, not find any documents (maybe have). > > > Did you test this with lockdep fully enabled? > > > I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device > according to numa node") is the appropriate Fixes: target - do you > agree? Yes, I'm sure my latest test version has included Aaron's a2468cc9bfdff, and my test .config has enabled CONFIG as below: CONFIG_LOCK_DEBUGGING_SUPPORT=y CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_LOCKDEP=y CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y > > > These functions use identifier `p' for the swap_info_struct*, whereas > most other code uses the much more sensible `si'. That's just rude. > But we shouldn't change that within this fix. Indeed, It's confusing more or less to use both 'si' and 'p'. I can ready for another patch to replace 'p' with 'si'. Thanks.
Hi Andrew, Sorry for replying a little late, it's holiday here yesterday. On Tue, Apr 04, 2023 at 12:26:00PM -0700, Andrew Morton wrote: > On Tue, 4 Apr 2023 23:47:16 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > > The si->lock must be held when deleting the si from > > the available list. > > > > ... > > > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) > > { > > int nid; > > > > + assert_spin_locked(&p->lock); > > for_each_node(nid) > > plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); > > } > > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > > spin_unlock(&swap_lock); > > goto out_dput; > > } > > - del_from_avail_list(p); > > spin_lock(&p->lock); > > + del_from_avail_list(p); > > if (p->prio < 0) { > > struct swap_info_struct *si = p; > > int nid; > > So we have > > swap_avail_lock > swap_info_struct.lock > swap_cluster_info.lock > > Is the ranking of these three clearly documented somewhere? > I see some comments in swapfile.c mentioned something related, e.g. above the definition of swap_avail_heads, the comment mentioned swap_lock has to be taken before si->lock and swap_avail_lock can be taken after si->lock is held, but I'm not aware of a place documenting these things. Documenting these things is useful information I think, let me see if I can come up with something later. > > Did you test this with lockdep fully enabled? > > > I'm thinking that Aaron's a2468cc9bfdff ("swap: choose swap device > according to numa node") is the appropriate Fixes: target - do you > agree? It doesn't appear to be the case. For one thing, the problematic code that removes the swap device from the avail list without acquiring si->lock was there before my commit and my commit didn't change that behaviour. For another, I wanted to see if the problem is still there without my commit(just to make sure). I followed Rongwei's description and used stress-ng/swap test together with some test progs that does memory allocation then MADVISE(pageout) in a loop to reproduce this problem and I can also see the warning like below using Linus' master branch as of today, I believe this is the problem Rongwei described: [ 1914.518786] ------------[ cut here ]------------ [ 1914.519049] swap_info 9 in list but !SWP_WRITEOK [ 1914.519274] WARNING: CPU: 14 PID: 14307 at mm/swapfile.c:1085 get_swap_pages+0x3b3/0x440 [ 1914.519660] Modules linked in: [ 1914.519811] CPU: 14 PID: 14307 Comm: swap Tainted: G W 6.3.0-rc5-00032-g99ddf2254feb #5 [ 1914.520238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014 [ 1914.520641] RIP: 0010:get_swap_pages+0x3b3/0x440 [ 1914.520860] Code: 48 8b 4c 24 30 48 c1 e0 3a 4c 09 e0 48 89 01 e8 43 79 96 00 e9 b2 fd ff ff 41 0f be 77 48 48 c7 c78 [ 1914.521709] RSP: 0018:ffffc9000ba0f838 EFLAGS: 00010282 [ 1914.521950] RAX: 0000000000000000 RBX: ffff888154411400 RCX: 0000000000000000 [ 1914.522273] RDX: 0000000000000004 RSI: ffffffff824035cb RDI: 0000000000000001 [ 1914.522601] RBP: ffff888100d95f68 R08: 0000000000000001 R09: 0000000000000003 [ 1914.522926] R10: ffffffff82a7a420 R11: ffffffff82a7a420 R12: 0000000000000350 [ 1914.523249] R13: ffff888100d95da8 R14: ffff888100d95f50 R15: ffff888100d95c00 [ 1914.523576] FS: 00007f23abea2600(0000) GS:ffff88823b600000(0000) knlGS:0000000000000000 [ 1914.523942] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1914.524206] CR2: 00007f23abbff000 CR3: 0000000104b86004 CR4: 0000000000770ee0 [ 1914.524534] PKRU: 55555554 [ 1914.524661] Call Trace: [ 1914.524782] <TASK> [ 1914.524889] folio_alloc_swap+0xde/0x230 [ 1914.525076] add_to_swap+0x36/0xb0 [ 1914.525242] shrink_folio_list+0x9ab/0xef0 [ 1914.525445] reclaim_folio_list+0x70/0x130 [ 1914.525644] reclaim_pages+0x9c/0x1c0 [ 1914.525819] madvise_cold_or_pageout_pte_range+0x79f/0xc80 [ 1914.526073] walk_pgd_range+0x4d8/0x940 [ 1914.526255] ? mt_find+0x15b/0x490 [ 1914.526426] __walk_page_range+0x211/0x230 [ 1914.526619] walk_page_range+0x17a/0x1e0 [ 1914.526807] madvise_pageout+0xef/0x250 And when I reverted my commit on the same branch(needs some manual edits), the problem is still there. Another thing is, I noticed Rongwei mentioned "This problem exists in versions after stable 5.10.y." in the changelog while my commit entered mainline in v4.14. So either this problem is always there, i.e. earlier than my commit; or this problem is indeed only there after v5.10, then it should be something else that triggered it. My qemu refuses to boot v4.14 kernel so I can not verify the former yet.
On Wed, Apr 05, 2023 at 12:08:47AM +0800, Rongwei Wang wrote: > Hello > > I have fix up some stuff base on Patch v1. And in order to help all readers > and reviewers to > > reproduce this bug, share a reproducer here: I reproduced this problem under a VM this way: $ sudo ./stress-ng --swap 1 // on another terminal $ for i in `seq 8`; do ./swap & done Looks simpler than yours :-) (Didn't realize you have posted your reproducer here since I'm not CCed and just found it after invented mine) Then the warning message normally appear within a few seconds. Here is the code for the above swap prog: #include <stdio.h> #include <stddef.h> #include <sys/mman.h> #define SIZE 0x100000 int main(void) { int i, ret; void *p; p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; } ret = 0; while (1) { for (i = 0; i < SIZE; i += 0x1000) ((char *)p)[i] = 1; ret = madvise(p, SIZE, MADV_PAGEOUT); if (ret != 0) { perror("madvise"); break; } } return ret; } Unfortunately, this test prog did not work on kernels before v5.4 because MADV_PAGEOUT is introduced in v5.4. I tested on v5.4 and the problem is also there. Haven't found a way to trigger swap with swap device come and go on kernels before v5.4; tried putting the test prog in a memcg with memory limit but then the prog is easily killed due to nowhere to swap out. > > swap_bomb.sh > > #!/usr/bin/env bash > > stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap > --verify -v & > stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap > --verify -v & > stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap > --verify -v & > stress-ng -a 1 --class vm -t 12h --metrics --times -x bigheap,stackmmap,mlock,vm-splice,mmapaddr,mmapfixed,mmapfork,mmaphuge,mmapmany,mprotect,mremap,msync,msyncmany,physpage,tmpfs,vm-addr,vm-rw,brk,vm-segv,userfaultfd,malloc,stack,munmap,dev-shm,bad-altstack,shm-sysv,pageswap,madvise,vm,shm,env,mmap > --verify -v > > > madvise_shared.c > > #include <stdio.h> > #include <stdlib.h> > #include <sys/mman.h> > #include <unistd.h> > > #define MSIZE (1024 * 1024 * 2) > > int main() > { > char *shm_addr; > unsigned long i; > > while (1) { > // Map shared memory segment > shm_addr = > mmap(NULL, MSIZE, PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, -1, 0); > if (shm_addr == MAP_FAILED) { > perror("Failed to map shared memory segment"); > exit(EXIT_FAILURE); > } > > for (i = 0; i < MSIZE; i++) { > shm_addr[i] = 1; > } > > // Advise kernel on usage pattern of shared memory > if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) { > perror > ("Failed to advise kernel on shared memory > usage"); > exit(EXIT_FAILURE); > } > > for (i = 0; i < MSIZE; i++) { > shm_addr[i] = 1; > } > > // Advise kernel on usage pattern of shared memory > if (madvise(shm_addr, MSIZE, MADV_PAGEOUT) == -1) { > perror > ("Failed to advise kernel on shared memory > usage"); > exit(EXIT_FAILURE); > } > // Use shared memory > printf("Hello, shared memory: 0x%lx\n", shm_addr); > > // Unmap shared memory segment > if (munmap(shm_addr, MSIZE) == -1) { > perror("Failed to unmap shared memory segment"); > exit(EXIT_FAILURE); > } > } > > return 0; > } > > The bug will reproduce more quickly (about 2~5 minutes) if concurrent more > swap_bomb.sh and madvise_shared. > > Thanks.
> It doesn't appear to be the case. For one thing, the problematic code > that removes the swap device from the avail list without acquiring > si->lock was there before my commit and my commit didn't change that > behaviour. For another, I wanted to see if the problem is still there > without my commit(just to make sure). > > I followed Rongwei's description and used stress-ng/swap test together > with some test progs that does memory allocation then MADVISE(pageout) > in a loop to reproduce this problem and I can also see the warning like > below using Linus' master branch as of today, I believe this is the > problem Rongwei described: Hi, Aaron, I can sure this is that bug, and the panic will happen when CONFIG_PLIST_DEBUG enabled (I'm not sure whether you have enabled it). > > [ 1914.518786] ------------[ cut here ]------------ > [ 1914.519049] swap_info 9 in list but !SWP_WRITEOK > [ 1914.519274] WARNING: CPU: 14 PID: 14307 at mm/swapfile.c:1085 get_swap_pages+0x3b3/0x440 > [ 1914.519660] Modules linked in: > [ 1914.519811] CPU: 14 PID: 14307 Comm: swap Tainted: G W 6.3.0-rc5-00032-g99ddf2254feb #5 > [ 1914.520238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014 > [ 1914.520641] RIP: 0010:get_swap_pages+0x3b3/0x440 > [ 1914.520860] Code: 48 8b 4c 24 30 48 c1 e0 3a 4c 09 e0 48 89 01 e8 43 79 96 00 e9 b2 fd ff ff 41 0f be 77 48 48 c7 c78 > [ 1914.521709] RSP: 0018:ffffc9000ba0f838 EFLAGS: 00010282 > [ 1914.521950] RAX: 0000000000000000 RBX: ffff888154411400 RCX: 0000000000000000 > [ 1914.522273] RDX: 0000000000000004 RSI: ffffffff824035cb RDI: 0000000000000001 > [ 1914.522601] RBP: ffff888100d95f68 R08: 0000000000000001 R09: 0000000000000003 > [ 1914.522926] R10: ffffffff82a7a420 R11: ffffffff82a7a420 R12: 0000000000000350 > [ 1914.523249] R13: ffff888100d95da8 R14: ffff888100d95f50 R15: ffff888100d95c00 > [ 1914.523576] FS: 00007f23abea2600(0000) GS:ffff88823b600000(0000) knlGS:0000000000000000 > [ 1914.523942] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1914.524206] CR2: 00007f23abbff000 CR3: 0000000104b86004 CR4: 0000000000770ee0 > [ 1914.524534] PKRU: 55555554 > [ 1914.524661] Call Trace: > [ 1914.524782] <TASK> > [ 1914.524889] folio_alloc_swap+0xde/0x230 > [ 1914.525076] add_to_swap+0x36/0xb0 > [ 1914.525242] shrink_folio_list+0x9ab/0xef0 > [ 1914.525445] reclaim_folio_list+0x70/0x130 > [ 1914.525644] reclaim_pages+0x9c/0x1c0 > [ 1914.525819] madvise_cold_or_pageout_pte_range+0x79f/0xc80 > [ 1914.526073] walk_pgd_range+0x4d8/0x940 > [ 1914.526255] ? mt_find+0x15b/0x490 > [ 1914.526426] __walk_page_range+0x211/0x230 > [ 1914.526619] walk_page_range+0x17a/0x1e0 > [ 1914.526807] madvise_pageout+0xef/0x250 > > And when I reverted my commit on the same branch(needs some manual edits), > the problem is still there. > > Another thing is, I noticed Rongwei mentioned "This problem exists in > versions after stable 5.10.y." in the changelog while my commit entered > mainline in v4.14. > > So either this problem is always there, i.e. earlier than my commit; or > this problem is indeed only there after v5.10, then it should be something > else that triggered it. My qemu refuses to boot v4.14 kernel so I can > not verify the former yet. Me too. The oldest kernel that my qemu can run is 4.19. BTW, I try to replace 'p' with 'si' today, and find there are many areas need to be modified, especially inside swapoff() and swapon(). So many modifications maybe affect future tracking of code modifications and will cost some time to test. So I wanna to ensure whether need I to do this. If need, I can continue to do this. Thanks.
Oh, sorry, I miss this email just now, that because of I'm also replying your previous email. On 2023/4/6 20:12, Aaron Lu wrote: > On Wed, Apr 05, 2023 at 12:08:47AM +0800, Rongwei Wang wrote: >> Hello >> >> I have fix up some stuff base on Patch v1. And in order to help all readers >> and reviewers to >> >> reproduce this bug, share a reproducer here: > I reproduced this problem under a VM this way: > > $ sudo ./stress-ng --swap 1 > // on another terminal > $ for i in `seq 8`; do ./swap & done > Looks simpler than yours :-) Cool, indeed become simpler. > (Didn't realize you have posted your reproducer here since I'm not CCed > and just found it after invented mine) > Then the warning message normally appear within a few seconds. > > Here is the code for the above swap prog: > > #include <stdio.h> > #include <stddef.h> > #include <sys/mman.h> > > #define SIZE 0x100000 > > int main(void) > { > int i, ret; > void *p; > > p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); > if (p == MAP_FAILED) { > perror("mmap"); > return -1; > } > > ret = 0; > while (1) { > for (i = 0; i < SIZE; i += 0x1000) > ((char *)p)[i] = 1; > ret = madvise(p, SIZE, MADV_PAGEOUT); > if (ret != 0) { > perror("madvise"); > break; > } > } > > return ret; > } > > Unfortunately, this test prog did not work on kernels before v5.4 because > MADV_PAGEOUT is introduced in v5.4. I tested on v5.4 and the problem is > also there. Maybe that is this bug can not be found since now. And we found this is triggered by stress-ng-swap and stress-ng-madvise (PAGEOUT) firstly. It seems this is that reason. It seems MADV_COLD is also introduced together with MADV_PAGEOUT. I have no idea and have to depend on you.:-) > > Haven't found a way to trigger swap with swap device come and go on > kernels before v5.4; tried putting the test prog in a memcg with memory > limit but then the prog is easily killed due to nowhere to swap out. > Personally, I do not intend to continuing searching for the method to reproduce before v5.4. Of course, if you have idea, I can try. Thanks:-)
On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote: > The si->lock must be held when deleting the si from > the available list. Otherwise, another thread can > re-add the si to the available list, which can lead > to memory corruption. The only place we have found > where this happens is in the swapoff path. This case > can be described as below: > > core 0 core 1 > swapoff > > del_from_avail_list(si) waiting > > try lock si->lock acquire swap_avail_lock > and re-add si into > swap_avail_head confused here. If del_from_avail_list(si) finished in swaoff path, then this si should not exist in any of the per-node avail list and core 1 should not be able to re-add it. I stared at the code for a while and couldn't figure out how this happened, will continue to look at this tomorrow. Thanks, Aaron > > acquire si->lock but > missing si already be > added again, and continuing > to clear SWP_WRITEOK, etc. > > It can be easily found a massive warning messages can > be triggered inside get_swap_pages() by some special > cases, for example, we call madvise(MADV_PAGEOUT) on > blocks of touched memory concurrently, meanwhile, run > much swapon-swapoff operations (e.g. stress-ng-swap). > > However, in the worst case, panic can be caused by the > above scene. In swapoff(), the memory used by si could > be kept in swap_info[] after turning off a swap. This > means memory corruption will not be caused immediately > until allocated and reset for a new swap in the swapon > path. A panic message caused: > (with CONFIG_PLIST_DEBUG enabled) > > ------------[ cut here ]------------ > top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a > prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d > next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a > WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 > Modules linked in: rfkill(E) crct10dif_ce(E)... > CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ > Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 > pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) > pc : plist_check_prev_next_node+0x50/0x70 > lr : plist_check_prev_next_node+0x50/0x70 > sp : ffff0018009d3c30 > x29: ffff0018009d3c40 x28: ffff800011b32a98 > x27: 0000000000000000 x26: ffff001803908000 > x25: ffff8000128ea088 x24: ffff800011b32a48 > x23: 0000000000000028 x22: ffff001800875c00 > x21: ffff800010f9e520 x20: ffff001800875c00 > x19: ffff001800fdc6e0 x18: 0000000000000030 > x17: 0000000000000000 x16: 0000000000000000 > x15: 0736076307640766 x14: 0730073007380731 > x13: 0736076307640766 x12: 0730073007380731 > x11: 000000000004058d x10: 0000000085a85b76 > x9 : ffff8000101436e4 x8 : ffff800011c8ce08 > x7 : 0000000000000000 x6 : 0000000000000001 > x5 : ffff0017df9ed338 x4 : 0000000000000001 > x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 > x1 : 0000000000000000 x0 : 0000000000000000 > Call trace: > plist_check_prev_next_node+0x50/0x70 > plist_check_head+0x80/0xf0 > plist_add+0x28/0x140 > add_to_avail_list+0x9c/0xf0 > _enable_swap_info+0x78/0xb4 > __do_sys_swapon+0x918/0xa10 > __arm64_sys_swapon+0x20/0x30 > el0_svc_common+0x8c/0x220 > do_el0_svc+0x2c/0x90 > el0_svc+0x1c/0x30 > el0_sync_handler+0xa8/0xb0 > el0_sync+0x148/0x180 > irq event stamp: 2082270 > > Now, si->lock locked before calling 'del_from_avail_list()' > to make sure other thread see the si had been deleted > and SWP_WRITEOK cleared together, will not reinsert again. > > This problem exists in versions after stable 5.10.y. > > Cc: stable@vger.kernel.org > Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com> > Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> > --- > mm/swapfile.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 62ba2bf577d7..2c718f45745f 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) > { > int nid; > > + assert_spin_locked(&p->lock); > for_each_node(nid) > plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); > } > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > spin_unlock(&swap_lock); > goto out_dput; > } > - del_from_avail_list(p); > spin_lock(&p->lock); > + del_from_avail_list(p); > if (p->prio < 0) { > struct swap_info_struct *si = p; > int nid; > -- > 2.27.0 >
On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote: > On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote: > > The si->lock must be held when deleting the si from > > the available list. Otherwise, another thread can > > re-add the si to the available list, which can lead > > to memory corruption. The only place we have found > > where this happens is in the swapoff path. This case > > can be described as below: > > > > core 0 core 1 > > swapoff > > > > del_from_avail_list(si) waiting > > > > try lock si->lock acquire swap_avail_lock > > and re-add si into > > swap_avail_head > > confused here. > > If del_from_avail_list(si) finished in swaoff path, then this si should > not exist in any of the per-node avail list and core 1 should not be > able to re-add it. I think a possible sequence could be like this: cpuX cpuY swapoff put_swap_folio() del_from_avail_list(si) taken si->lock spin_lock(&si->lock); swap_range_free() was_full && SWP_WRITEOK -> re-add! drop si->lock taken si->lock proceed removing si End result: si left on avail_list after being swapped off. The problem is, in add_to_avail_list(), it has no idea this si is being swapped off and taking si->lock then del_from_avail_list() could avoid this problem, so I think this patch did the right thing but the changelog about how this happened needs updating and after that: Reviewed-by: Aaron Lu <aaron.lu@intel.com> Thanks, Aaron > > I stared at the code for a while and couldn't figure out how this > happened, will continue to look at this tomorrow. > > > > acquire si->lock but > > missing si already be > > added again, and continuing > > to clear SWP_WRITEOK, etc. > > > > It can be easily found a massive warning messages can > > be triggered inside get_swap_pages() by some special > > cases, for example, we call madvise(MADV_PAGEOUT) on > > blocks of touched memory concurrently, meanwhile, run > > much swapon-swapoff operations (e.g. stress-ng-swap). > > > > However, in the worst case, panic can be caused by the > > above scene. In swapoff(), the memory used by si could > > be kept in swap_info[] after turning off a swap. This > > means memory corruption will not be caused immediately > > until allocated and reset for a new swap in the swapon > > path. A panic message caused: > > (with CONFIG_PLIST_DEBUG enabled) > > > > ------------[ cut here ]------------ > > top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a > > prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d > > next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a > > WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 > > Modules linked in: rfkill(E) crct10dif_ce(E)... > > CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ > > Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 > > pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) > > pc : plist_check_prev_next_node+0x50/0x70 > > lr : plist_check_prev_next_node+0x50/0x70 > > sp : ffff0018009d3c30 > > x29: ffff0018009d3c40 x28: ffff800011b32a98 > > x27: 0000000000000000 x26: ffff001803908000 > > x25: ffff8000128ea088 x24: ffff800011b32a48 > > x23: 0000000000000028 x22: ffff001800875c00 > > x21: ffff800010f9e520 x20: ffff001800875c00 > > x19: ffff001800fdc6e0 x18: 0000000000000030 > > x17: 0000000000000000 x16: 0000000000000000 > > x15: 0736076307640766 x14: 0730073007380731 > > x13: 0736076307640766 x12: 0730073007380731 > > x11: 000000000004058d x10: 0000000085a85b76 > > x9 : ffff8000101436e4 x8 : ffff800011c8ce08 > > x7 : 0000000000000000 x6 : 0000000000000001 > > x5 : ffff0017df9ed338 x4 : 0000000000000001 > > x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 > > x1 : 0000000000000000 x0 : 0000000000000000 > > Call trace: > > plist_check_prev_next_node+0x50/0x70 > > plist_check_head+0x80/0xf0 > > plist_add+0x28/0x140 > > add_to_avail_list+0x9c/0xf0 > > _enable_swap_info+0x78/0xb4 > > __do_sys_swapon+0x918/0xa10 > > __arm64_sys_swapon+0x20/0x30 > > el0_svc_common+0x8c/0x220 > > do_el0_svc+0x2c/0x90 > > el0_svc+0x1c/0x30 > > el0_sync_handler+0xa8/0xb0 > > el0_sync+0x148/0x180 > > irq event stamp: 2082270 > > > > Now, si->lock locked before calling 'del_from_avail_list()' > > to make sure other thread see the si had been deleted > > and SWP_WRITEOK cleared together, will not reinsert again. > > > > This problem exists in versions after stable 5.10.y. > > > > Cc: stable@vger.kernel.org > > Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com> > > Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> > > --- > > mm/swapfile.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 62ba2bf577d7..2c718f45745f 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) > > { > > int nid; > > > > + assert_spin_locked(&p->lock); > > for_each_node(nid) > > plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); > > } > > @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > > spin_unlock(&swap_lock); > > goto out_dput; > > } > > - del_from_avail_list(p); > > spin_lock(&p->lock); > > + del_from_avail_list(p); > > if (p->prio < 0) { > > struct swap_info_struct *si = p; > > int nid; > > -- > > 2.27.0 > >
On 2023/4/6 22:57, Aaron Lu wrote: > On Thu, Apr 06, 2023 at 10:04:16PM +0800, Aaron Lu wrote: >> On Tue, Apr 04, 2023 at 11:47:16PM +0800, Rongwei Wang wrote: >>> The si->lock must be held when deleting the si from >>> the available list. Otherwise, another thread can >>> re-add the si to the available list, which can lead >>> to memory corruption. The only place we have found >>> where this happens is in the swapoff path. This case >>> can be described as below: >>> >>> core 0 core 1 >>> swapoff >>> >>> del_from_avail_list(si) waiting >>> >>> try lock si->lock acquire swap_avail_lock >>> and re-add si into >>> swap_avail_head >> confused here. >> >> If del_from_avail_list(si) finished in swaoff path, then this si should >> not exist in any of the per-node avail list and core 1 should not be >> able to re-add it. > I think a possible sequence could be like this: > > cpuX cpuY > swapoff put_swap_folio() > > del_from_avail_list(si) > taken si->lock > spin_lock(&si->lock); > > swap_range_free() > was_full && SWP_WRITEOK -> re-add! > drop si->lock > > taken si->lock > proceed removing si > > End result: si left on avail_list after being swapped off. > > The problem is, in add_to_avail_list(), it has no idea this si is being > swapped off and taking si->lock then del_from_avail_list() could avoid > this problem, so I think this patch did the right thing but the > changelog about how this happened needs updating and after that: Hi Aaron That's my fault. Actually, I don't refers specifically to swap_range_free() path in commit, mainly because cpuY can stand all threads which is waiting swap_avail_lock, then to call add_to_avail_list(), not only swap_range_free(), e.g. maybe swapon(). > > Reviewed-by: Aaron Lu <aaron.lu@intel.com> Thanks for your time. -wrw > > Thanks, > Aaron >
diff --git a/mm/swapfile.c b/mm/swapfile.c index 62ba2bf577d7..2c718f45745f 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -679,6 +679,7 @@ static void __del_from_avail_list(struct swap_info_struct *p) { int nid; + assert_spin_locked(&p->lock); for_each_node(nid) plist_del(&p->avail_lists[nid], &swap_avail_heads[nid]); } @@ -2434,8 +2435,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&swap_lock); goto out_dput; } - del_from_avail_list(p); spin_lock(&p->lock); + del_from_avail_list(p); if (p->prio < 0) { struct swap_info_struct *si = p; int nid;