Message ID | 20190129072154.63783-1-jiufei.xue@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: fix sleeping function warning in alloc_swap_info | expand |
On 2019/1/29 15:21, Jiufei Xue wrote: > Trinity reports BUG: > > sleeping function called from invalid context at mm/vmalloc.c:1477 > in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1 > > [ 2748.573460] Call Trace: > [ 2748.575935] dump_stack+0x91/0xeb > [ 2748.578512] ___might_sleep+0x21c/0x250 > [ 2748.581090] remove_vm_area+0x1d/0x90 > [ 2748.583637] __vunmap+0x76/0x100 > [ 2748.586120] __se_sys_swapon+0xb9a/0x1220 > [ 2748.598973] do_syscall_64+0x60/0x210 > [ 2748.601439] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > This is triggered by calling kvfree() inside spinlock() section in > function alloc_swap_info(). > Fix this by moving the kvfree() after spin_unlock(). The fix looks good to me. BTW, swap_info_struct's size has been reduced to its original size: 272 bytes by commit 66f71da9dd38("mm/swap: use nr_node_ids for avail_lists in swap_info_struct"). I didn't use back kzalloc/kfree in that commit since I don't see any any harm by keep using kvzalloc/kvfree, but now looks like they're causing some trouble. So what about using back kzalloc/kfree for swap_info_struct instead? Can save one local variable and using kvzalloc/kvfree for a struct that is 272 bytes doesn't really have any benefit. Thanks, Aaron > > Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation") > Cc: <stable@vger.kernel.org> > Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> > Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> > --- > mm/swapfile.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index dbac1d49469d..d26c9eac3d64 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -2810,7 +2810,7 @@ late_initcall(max_swapfiles_check); > > static struct swap_info_struct *alloc_swap_info(void) > { > - struct swap_info_struct *p; > + struct swap_info_struct *p, *tmp = NULL; > unsigned int type; > int i; > int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node); > @@ -2840,7 +2840,7 @@ static struct swap_info_struct *alloc_swap_info(void) > smp_wmb(); > nr_swapfiles++; > } else { > - kvfree(p); > + tmp = p; > p = swap_info[type]; > /* > * Do not memset this entry: a racing procfs swap_next() > @@ -2853,6 +2853,8 @@ static struct swap_info_struct *alloc_swap_info(void) > plist_node_init(&p->avail_lists[i], 0); > p->flags = SWP_USED; > spin_unlock(&swap_lock); > + kvfree(tmp); > + > spin_lock_init(&p->lock); > spin_lock_init(&p->cont_lock); > >
Hi, On 19/1/29 16:53, Aaron Lu wrote: > On 2019/1/29 15:21, Jiufei Xue wrote: >> Trinity reports BUG: >> >> sleeping function called from invalid context at mm/vmalloc.c:1477 >> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1 >> >> [ 2748.573460] Call Trace: >> [ 2748.575935] dump_stack+0x91/0xeb >> [ 2748.578512] ___might_sleep+0x21c/0x250 >> [ 2748.581090] remove_vm_area+0x1d/0x90 >> [ 2748.583637] __vunmap+0x76/0x100 >> [ 2748.586120] __se_sys_swapon+0xb9a/0x1220 >> [ 2748.598973] do_syscall_64+0x60/0x210 >> [ 2748.601439] entry_SYSCALL_64_after_hwframe+0x49/0xbe >> >> This is triggered by calling kvfree() inside spinlock() section in >> function alloc_swap_info(). >> Fix this by moving the kvfree() after spin_unlock(). > > The fix looks good to me. > > BTW, swap_info_struct's size has been reduced to its original size: > 272 bytes by commit 66f71da9dd38("mm/swap: use nr_node_ids for > avail_lists in swap_info_struct"). I didn't use back kzalloc/kfree > in that commit since I don't see any any harm by keep using > kvzalloc/kvfree, but now looks like they're causing some trouble. > > So what about using back kzalloc/kfree for swap_info_struct instead? > Can save one local variable and using kvzalloc/kvfree for a struct > that is 272 bytes doesn't really have any benefit. > avail_lists in swap_info_struct is dynamic allocated. So if we use back kzalloc/kfree, how to deal with the case that nr_node_ids is big? Thanks, Joseph > Thanks, > Aaron > >> >> Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation") >> Cc: <stable@vger.kernel.org> >> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> >> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> >> --- >> mm/swapfile.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index dbac1d49469d..d26c9eac3d64 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -2810,7 +2810,7 @@ late_initcall(max_swapfiles_check); >> >> static struct swap_info_struct *alloc_swap_info(void) >> { >> - struct swap_info_struct *p; >> + struct swap_info_struct *p, *tmp = NULL; >> unsigned int type; >> int i; >> int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node); >> @@ -2840,7 +2840,7 @@ static struct swap_info_struct *alloc_swap_info(void) >> smp_wmb(); >> nr_swapfiles++; >> } else { >> - kvfree(p); >> + tmp = p; >> p = swap_info[type]; >> /* >> * Do not memset this entry: a racing procfs swap_next() >> @@ -2853,6 +2853,8 @@ static struct swap_info_struct *alloc_swap_info(void) >> plist_node_init(&p->avail_lists[i], 0); >> p->flags = SWP_USED; >> spin_unlock(&swap_lock); >> + kvfree(tmp); >> + >> spin_lock_init(&p->lock); >> spin_lock_init(&p->cont_lock); >> >>
On Tue, Jan 29, 2019 at 06:43:53PM +0800, Joseph Qi wrote: > Hi, > > On 19/1/29 16:53, Aaron Lu wrote: > > On 2019/1/29 15:21, Jiufei Xue wrote: > >> Trinity reports BUG: > >> > >> sleeping function called from invalid context at mm/vmalloc.c:1477 > >> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1 > >> > >> [ 2748.573460] Call Trace: > >> [ 2748.575935] dump_stack+0x91/0xeb > >> [ 2748.578512] ___might_sleep+0x21c/0x250 > >> [ 2748.581090] remove_vm_area+0x1d/0x90 > >> [ 2748.583637] __vunmap+0x76/0x100 > >> [ 2748.586120] __se_sys_swapon+0xb9a/0x1220 > >> [ 2748.598973] do_syscall_64+0x60/0x210 > >> [ 2748.601439] entry_SYSCALL_64_after_hwframe+0x49/0xbe > >> > >> This is triggered by calling kvfree() inside spinlock() section in > >> function alloc_swap_info(). > >> Fix this by moving the kvfree() after spin_unlock(). > > > > The fix looks good to me. > > > > BTW, swap_info_struct's size has been reduced to its original size: > > 272 bytes by commit 66f71da9dd38("mm/swap: use nr_node_ids for > > avail_lists in swap_info_struct"). I didn't use back kzalloc/kfree > > in that commit since I don't see any any harm by keep using > > kvzalloc/kvfree, but now looks like they're causing some trouble. > > > > So what about using back kzalloc/kfree for swap_info_struct instead? > > Can save one local variable and using kvzalloc/kvfree for a struct > > that is 272 bytes doesn't really have any benefit. > > > avail_lists in swap_info_struct is dynamic allocated. > So if we use back kzalloc/kfree, how to deal with the case that > nr_node_ids is big? Oh right, I missed that. Acked-by: Aaron Lu <aaron.lu@linux.alibaba.com> Thanks, Aaron > >> > >> Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation") > >> Cc: <stable@vger.kernel.org> > >> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> > >> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> > >> --- > >> mm/swapfile.c | 6 ++++-- > >> 1 file changed, 4 insertions(+), 2 deletions(-) > >> > >> diff --git a/mm/swapfile.c b/mm/swapfile.c > >> index dbac1d49469d..d26c9eac3d64 100644 > >> --- a/mm/swapfile.c > >> +++ b/mm/swapfile.c > >> @@ -2810,7 +2810,7 @@ late_initcall(max_swapfiles_check); > >> > >> static struct swap_info_struct *alloc_swap_info(void) > >> { > >> - struct swap_info_struct *p; > >> + struct swap_info_struct *p, *tmp = NULL; > >> unsigned int type; > >> int i; > >> int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node); > >> @@ -2840,7 +2840,7 @@ static struct swap_info_struct *alloc_swap_info(void) > >> smp_wmb(); > >> nr_swapfiles++; > >> } else { > >> - kvfree(p); > >> + tmp = p; > >> p = swap_info[type]; > >> /* > >> * Do not memset this entry: a racing procfs swap_next() > >> @@ -2853,6 +2853,8 @@ static struct swap_info_struct *alloc_swap_info(void) > >> plist_node_init(&p->avail_lists[i], 0); > >> p->flags = SWP_USED; > >> spin_unlock(&swap_lock); > >> + kvfree(tmp); > >> + > >> spin_lock_init(&p->lock); > >> spin_lock_init(&p->cont_lock); > >> > >>
On 2019/01/29 16:21, Jiufei Xue wrote: > Trinity reports BUG: > > sleeping function called from invalid context at mm/vmalloc.c:1477 > in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1 > > [ 2748.573460] Call Trace: > [ 2748.575935] dump_stack+0x91/0xeb > [ 2748.578512] ___might_sleep+0x21c/0x250 > [ 2748.581090] remove_vm_area+0x1d/0x90 > [ 2748.583637] __vunmap+0x76/0x100 > [ 2748.586120] __se_sys_swapon+0xb9a/0x1220 > [ 2748.598973] do_syscall_64+0x60/0x210 > [ 2748.601439] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > This is triggered by calling kvfree() inside spinlock() section in > function alloc_swap_info(). > Fix this by moving the kvfree() after spin_unlock(). > Excuse me? But isn't kvfree() safe to be called with spinlock held? There was no context explanation regarding kvfree() for 4.18.20. 4.19.18 says Context: Any context except NMI. and 4.20.5 says Context: Either preemptible task context or not-NMI interrupt. . There might be users who didn't notice this change.
On Tue, 29 Jan 2019 20:43:20 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: > On 2019/01/29 16:21, Jiufei Xue wrote: > > Trinity reports BUG: > > > > sleeping function called from invalid context at mm/vmalloc.c:1477 > > in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1 > > > > [ 2748.573460] Call Trace: > > [ 2748.575935] dump_stack+0x91/0xeb > > [ 2748.578512] ___might_sleep+0x21c/0x250 > > [ 2748.581090] remove_vm_area+0x1d/0x90 > > [ 2748.583637] __vunmap+0x76/0x100 > > [ 2748.586120] __se_sys_swapon+0xb9a/0x1220 > > [ 2748.598973] do_syscall_64+0x60/0x210 > > [ 2748.601439] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > > > This is triggered by calling kvfree() inside spinlock() section in > > function alloc_swap_info(). > > Fix this by moving the kvfree() after spin_unlock(). > > > > Excuse me? But isn't kvfree() safe to be called with spinlock held? Yes, I'm having trouble spotting where kvfree() can sleep. Perhaps it *used* to sleep on mutex_lock(vmap_purge_lock), but try_purge_vmap_area_lazy() is using mutex_trylock(). Confused. kvfree() darn well *shouldn't* sleep!
On 2019/01/30 4:13, Andrew Morton wrote: > On Tue, 29 Jan 2019 20:43:20 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: > >> On 2019/01/29 16:21, Jiufei Xue wrote: >>> Trinity reports BUG: >>> >>> sleeping function called from invalid context at mm/vmalloc.c:1477 >>> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1 >>> >>> [ 2748.573460] Call Trace: >>> [ 2748.575935] dump_stack+0x91/0xeb >>> [ 2748.578512] ___might_sleep+0x21c/0x250 >>> [ 2748.581090] remove_vm_area+0x1d/0x90 >>> [ 2748.583637] __vunmap+0x76/0x100 >>> [ 2748.586120] __se_sys_swapon+0xb9a/0x1220 >>> [ 2748.598973] do_syscall_64+0x60/0x210 >>> [ 2748.601439] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> >>> This is triggered by calling kvfree() inside spinlock() section in >>> function alloc_swap_info(). >>> Fix this by moving the kvfree() after spin_unlock(). >>> >> >> Excuse me? But isn't kvfree() safe to be called with spinlock held? > > Yes, I'm having trouble spotting where kvfree() can sleep. Perhaps it > *used* to sleep on mutex_lock(vmap_purge_lock), but > try_purge_vmap_area_lazy() is using mutex_trylock(). Confused. > > kvfree() darn well *shouldn't* sleep! > If I recall correctly, there was an attempt to allow vfree() to sleep but that attempt failed, and the change to allow vfree() to sleep was reverted. Thus, vfree() had been "Context: Any context except NMI.". If we want to allow vfree() to sleep, at least we need to test with kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use vmalloc()/vfree() path). For now, reverting the "Context: Either preemptible task context or not-NMI interrupt." change will be needed for stable kernels.
On Tue, Jan 29, 2019 at 1:12 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > On 2019/01/30 4:13, Andrew Morton wrote: > > On Tue, 29 Jan 2019 20:43:20 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: > > > >> On 2019/01/29 16:21, Jiufei Xue wrote: > >>> Trinity reports BUG: > >>> > >>> sleeping function called from invalid context at mm/vmalloc.c:1477 > >>> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1 > >>> > >>> [ 2748.573460] Call Trace: > >>> [ 2748.575935] dump_stack+0x91/0xeb > >>> [ 2748.578512] ___might_sleep+0x21c/0x250 > >>> [ 2748.581090] remove_vm_area+0x1d/0x90 > >>> [ 2748.583637] __vunmap+0x76/0x100 > >>> [ 2748.586120] __se_sys_swapon+0xb9a/0x1220 > >>> [ 2748.598973] do_syscall_64+0x60/0x210 > >>> [ 2748.601439] entry_SYSCALL_64_after_hwframe+0x49/0xbe > >>> > >>> This is triggered by calling kvfree() inside spinlock() section in > >>> function alloc_swap_info(). > >>> Fix this by moving the kvfree() after spin_unlock(). > >>> > >> > >> Excuse me? But isn't kvfree() safe to be called with spinlock held? > > > > Yes, I'm having trouble spotting where kvfree() can sleep. Perhaps it > > *used* to sleep on mutex_lock(vmap_purge_lock), but > > try_purge_vmap_area_lazy() is using mutex_trylock(). Confused. > > > > kvfree() darn well *shouldn't* sleep! > > > > If I recall correctly, there was an attempt to allow vfree() to sleep > but that attempt failed, and the change to allow vfree() to sleep was > reverted. Thus, vfree() had been "Context: Any context except NMI.". > > If we want to allow vfree() to sleep, at least we need to test with > kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use > vmalloc()/vfree() path). For now, reverting the > "Context: Either preemptible task context or not-NMI interrupt." change > will be needed for stable kernels. So, the comment for vfree "May sleep if called *not* from interrupt context." is wrong? Thanks, Yang >
Yang Shi wrote: > On Tue, Jan 29, 2019 at 1:12 PM Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > On 2019/01/30 4:13, Andrew Morton wrote: > > > On Tue, 29 Jan 2019 20:43:20 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote: > > > > > >> On 2019/01/29 16:21, Jiufei Xue wrote: > > >>> Trinity reports BUG: > > >>> > > >>> sleeping function called from invalid context at mm/vmalloc.c:1477 > > >>> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1 > > >>> > > >>> [ 2748.573460] Call Trace: > > >>> [ 2748.575935] dump_stack+0x91/0xeb > > >>> [ 2748.578512] ___might_sleep+0x21c/0x250 > > >>> [ 2748.581090] remove_vm_area+0x1d/0x90 > > >>> [ 2748.583637] __vunmap+0x76/0x100 > > >>> [ 2748.586120] __se_sys_swapon+0xb9a/0x1220 > > >>> [ 2748.598973] do_syscall_64+0x60/0x210 > > >>> [ 2748.601439] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > >>> > > >>> This is triggered by calling kvfree() inside spinlock() section in > > >>> function alloc_swap_info(). > > >>> Fix this by moving the kvfree() after spin_unlock(). > > >>> > > >> > > >> Excuse me? But isn't kvfree() safe to be called with spinlock held? > > > > > > Yes, I'm having trouble spotting where kvfree() can sleep. Perhaps it > > > *used* to sleep on mutex_lock(vmap_purge_lock), but > > > try_purge_vmap_area_lazy() is using mutex_trylock(). Confused. > > > > > > kvfree() darn well *shouldn't* sleep! > > > > > > > If I recall correctly, there was an attempt to allow vfree() to sleep > > but that attempt failed, and the change to allow vfree() to sleep was > > reverted. Thus, vfree() had been "Context: Any context except NMI.". That attempt was not reverted. Instead vfree_atomic() was added. > > > > If we want to allow vfree() to sleep, at least we need to test with > > kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use > > vmalloc()/vfree() path). For now, reverting the > > "Context: Either preemptible task context or not-NMI interrupt." change > > will be needed for stable kernels. > > So, the comment for vfree "May sleep if called *not* from interrupt > context." is wrong? Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says We are going to use sleeping lock for freeing vmap. However some vfree() users want to free memory from atomic (but not from interrupt) context. For this we add vfree_atomic() - deferred variation of vfree() which can be used in any atomic context (except NMIs). and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made - * Context: Any context except NMI. + * Context: Either preemptible task context or not-NMI interrupt. change. But I think that we converted kmalloc() to kvmalloc() without checking context of kvfree() callers. Therefore, I think that kvfree() needs to use vfree_atomic() rather than just saying "vfree() might sleep if called not in interrupt context."...
On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > > > If we want to allow vfree() to sleep, at least we need to test with > > > kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use > > > vmalloc()/vfree() path). For now, reverting the > > > "Context: Either preemptible task context or not-NMI interrupt." change > > > will be needed for stable kernels. > > > > So, the comment for vfree "May sleep if called *not* from interrupt > > context." is wrong? > > Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says > > We are going to use sleeping lock for freeing vmap. However some > vfree() users want to free memory from atomic (but not from interrupt) > context. For this we add vfree_atomic() - deferred variation of vfree() > which can be used in any atomic context (except NMIs). > > and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made > > - * Context: Any context except NMI. > + * Context: Either preemptible task context or not-NMI interrupt. > > change. But I think that we converted kmalloc() to kvmalloc() without checking > context of kvfree() callers. Therefore, I think that kvfree() needs to use > vfree_atomic() rather than just saying "vfree() might sleep if called not in > interrupt context."... Whereabouts in the vfree() path can the kernel sleep?
On Tue, Jan 29, 2019 at 5:01 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > - * Context: Any context except NMI. > > + * Context: Either preemptible task context or not-NMI interrupt. > > Whereabouts in the vfree() path can the kernel sleep? Note that it's not necessarily about *sleeping*. One thing that vfree() really fundamentally should do is to flush TLB's. And you must not do a cross-TLB flush with interrupts disabled. NOTE! Right now, I think we do lazy TLB flushing, so the flush actually is delayed until the vmalloc() when the address rolls around in the vmalloc address space. But there really are very real and obvious reasons why we might want to do it at vfree time. So I'd honestly be a whole lot happier with vmalloc/vfree being process context only. Or at least with with interrupts enabled (so swirq/BH context would be fine, but an actual interrupt not so). Again, this is not about sleeping. But the end result is almost the same: we really should strive to not do vfree() in interrupt context. Linus
On Tue, Jan 29, 2019 at 5:11 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Again, this is not about sleeping. But the end result is almost the > same: we really should strive to not do vfree() in interrupt context. Note that we currently test the wrong thing for this: we actually check "in_interrupt()" for the deferred case, which certainly works in practice, and protects from deadlocks (we need vmap_area_lock for the free-area handling) But it doesn't actually end up testing the "oops, interrupts are disabled in process context" issue. The "might_sleep()" check _does_ check that, iirc. Which - as mentioned - is fine because we currently don't actually do the TLB flush synchronously, but it's worth noting again. "vfree()" really is a *lot* different from "kfree()". It's unsafe in all kinds of special ways, and the locking difference is just part of it. So whatever might_sleep() has found might be a potential real issue at some point... Linus
Andrew Morton wrote: > > change. But I think that we converted kmalloc() to kvmalloc() without checking > > context of kvfree() callers. Therefore, I think that kvfree() needs to use > > vfree_atomic() rather than just saying "vfree() might sleep if called not in > > interrupt context."... > > Whereabouts in the vfree() path can the kernel sleep? Indeed. Although __vunmap() must not be called from interrupt context because mutex_trylock()/mutex_unlock() from try_purge_vmap_area_lazy() from free_vmap_area_noflush() from free_unmap_vmap_area() from remove_vm_area() from __vunmap() cannot be called from interrupt context, it seems that there is no location that does sleeping operation. Linus Torvalds wrote: > Which - as mentioned - is fine because we currently don't actually do > the TLB flush synchronously, but it's worth noting again. "vfree()" > really is a *lot* different from "kfree()". It's unsafe in all kinds > of special ways, and the locking difference is just part of it. > > So whatever might_sleep() has found might be a potential real issue at > some point... Then, do we automatically defer vfree() to mm_percpu_wq context?
On Tue, Jan 29, 2019 at 6:54 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > Then, do we automatically defer vfree() to mm_percpu_wq context? We might do that, and say "if you call vfree with interrupts disabled, it gets deferred". That said, the deferred case should generally not be a common case either. It has real issues, one of which is simply that particularly on 32-bit architectures we can run out of vmalloc space even normally, and if there are loads that do a lot of allocation and then deferred frees, that problem could become really bad. So I'd almost be happier having a warning if we end up doing the TLB flush and defer. At least to find *what* people do. And I do wonder if we should just always warn, and have that "might_sleep()", and simply say "if you do vfree from interrupts or with interrupts disabled, you really should be aware of these kinds of issues, and you really should *show* that you are aware by using vfree_atomic()". Linus
On Wed, 30 Jan 2019 09:18:20 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Jan 29, 2019 at 6:54 PM Tetsuo Handa > <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > Then, do we automatically defer vfree() to mm_percpu_wq context? > > We might do that, and say "if you call vfree with interrupts disabled, > it gets deferred". > > That said, the deferred case should generally not be a common case > either. It has real issues, one of which is simply that particularly > on 32-bit architectures we can run out of vmalloc space even normally, > and if there are loads that do a lot of allocation and then deferred > frees, that problem could become really bad. > > So I'd almost be happier having a warning if we end up doing the TLB > flush and defer. At least to find *what* people do. > > And I do wonder if we should just always warn, and have that > "might_sleep()", and simply say "if you do vfree from interrupts or > with interrupts disabled, you really should be aware of these kinds of > issues, and you really should *show* that you are aware by using > vfree_atomic()". > Agree. if (irqs_disabled()) {warn_once; punt_to_workqueue} is the way to go here. vfree() should be callable from spinlocked code and might_sleep() is an inappropriate check.
On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote: > On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > > > > > > > > If we want to allow vfree() to sleep, at least we need to test with > > > > kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use > > > > vmalloc()/vfree() path). For now, reverting the > > > > "Context: Either preemptible task context or not-NMI interrupt." change > > > > will be needed for stable kernels. > > > > > > So, the comment for vfree "May sleep if called *not* from interrupt > > > context." is wrong? > > > > Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says > > > > We are going to use sleeping lock for freeing vmap. However some > > vfree() users want to free memory from atomic (but not from interrupt) > > context. For this we add vfree_atomic() - deferred variation of vfree() > > which can be used in any atomic context (except NMIs). > > > > and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made > > > > - * Context: Any context except NMI. > > + * Context: Either preemptible task context or not-NMI interrupt. > > > > change. But I think that we converted kmalloc() to kvmalloc() without checking > > context of kvfree() callers. Therefore, I think that kvfree() needs to use > > vfree_atomic() rather than just saying "vfree() might sleep if called not in > > interrupt context."... > > Whereabouts in the vfree() path can the kernel sleep? (Sorry for the late reply.) Adding Andrey Ryabinin, author of commit 52414d3302577bb6 ("kvfree(): fix misleading comment"), maybe Andrey remembers where vfree() can sleep. In the meantime, does "cond_resched_lock(&vmap_area_lock);" in __purge_vmap_area_lazy() count as a sleep point? __purge_vmap_area_lazy() can be called if mutex_trylock on vmap_purge_lock succeed.
On 3/7/19 5:43 PM, Aaron Lu wrote: > On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote: >> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: >> >>>>> >>>>> If we want to allow vfree() to sleep, at least we need to test with >>>>> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use >>>>> vmalloc()/vfree() path). For now, reverting the >>>>> "Context: Either preemptible task context or not-NMI interrupt." change >>>>> will be needed for stable kernels. >>>> >>>> So, the comment for vfree "May sleep if called *not* from interrupt >>>> context." is wrong? >>> >>> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says >>> >>> We are going to use sleeping lock for freeing vmap. However some >>> vfree() users want to free memory from atomic (but not from interrupt) >>> context. For this we add vfree_atomic() - deferred variation of vfree() >>> which can be used in any atomic context (except NMIs). >>> >>> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made >>> >>> - * Context: Any context except NMI. >>> + * Context: Either preemptible task context or not-NMI interrupt. >>> >>> change. But I think that we converted kmalloc() to kvmalloc() without checking >>> context of kvfree() callers. Therefore, I think that kvfree() needs to use >>> vfree_atomic() rather than just saying "vfree() might sleep if called not in >>> interrupt context."... >> >> Whereabouts in the vfree() path can the kernel sleep? > > (Sorry for the late reply.) > > Adding Andrey Ryabinin, author of commit 52414d3302577bb6 > ("kvfree(): fix misleading comment"), maybe Andrey remembers > where vfree() can sleep. > > In the meantime, does "cond_resched_lock(&vmap_area_lock);" in > __purge_vmap_area_lazy() count as a sleep point? Yes, this is the place (the only one) where vfree() can sleep.
On Thu, Mar 07, 2019 at 05:47:13PM +0300, Andrey Ryabinin wrote: > > > On 3/7/19 5:43 PM, Aaron Lu wrote: > > On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote: > >> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > >> > >>>>> > >>>>> If we want to allow vfree() to sleep, at least we need to test with > >>>>> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use > >>>>> vmalloc()/vfree() path). For now, reverting the > >>>>> "Context: Either preemptible task context or not-NMI interrupt." change > >>>>> will be needed for stable kernels. > >>>> > >>>> So, the comment for vfree "May sleep if called *not* from interrupt > >>>> context." is wrong? > >>> > >>> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says > >>> > >>> We are going to use sleeping lock for freeing vmap. However some > >>> vfree() users want to free memory from atomic (but not from interrupt) > >>> context. For this we add vfree_atomic() - deferred variation of vfree() > >>> which can be used in any atomic context (except NMIs). > >>> > >>> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made > >>> > >>> - * Context: Any context except NMI. > >>> + * Context: Either preemptible task context or not-NMI interrupt. > >>> > >>> change. But I think that we converted kmalloc() to kvmalloc() without checking > >>> context of kvfree() callers. Therefore, I think that kvfree() needs to use > >>> vfree_atomic() rather than just saying "vfree() might sleep if called not in > >>> interrupt context."... > >> > >> Whereabouts in the vfree() path can the kernel sleep? > > > > (Sorry for the late reply.) > > > > Adding Andrey Ryabinin, author of commit 52414d3302577bb6 > > ("kvfree(): fix misleading comment"), maybe Andrey remembers > > where vfree() can sleep. > > > > In the meantime, does "cond_resched_lock(&vmap_area_lock);" in > > __purge_vmap_area_lazy() count as a sleep point? > > Yes, this is the place (the only one) where vfree() can sleep. OK, thanks for the quick confirm. So what about this: use __vfree_deferred() when: - in_interrupt(), because we can't use mutex_trylock() as pointed out by Tetsuo; - in_atomic(), because cond_resched_lock(); - irqs_disabled(), as smp_call_function_many() will deadlock. An untested diff to show the idea(not sure if warn is needed): diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e86ba6e74b50..28d200f054b0 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1578,7 +1578,7 @@ void vfree_atomic(const void *addr) static void __vfree(const void *addr) { - if (unlikely(in_interrupt())) + if (unlikely(in_interrupt() || in_atomic() || irqs_disabled())) __vfree_deferred(addr); else __vunmap(addr, 1); @@ -1606,8 +1606,6 @@ void vfree(const void *addr) kmemleak_free(addr); - might_sleep_if(!in_interrupt()); - if (!addr) return;
On 3/7/19 6:24 PM, Aaron Lu wrote: > On Thu, Mar 07, 2019 at 05:47:13PM +0300, Andrey Ryabinin wrote: >> >> >> On 3/7/19 5:43 PM, Aaron Lu wrote: >>> On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote: >>>> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: >>>> >>>>>>> >>>>>>> If we want to allow vfree() to sleep, at least we need to test with >>>>>>> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use >>>>>>> vmalloc()/vfree() path). For now, reverting the >>>>>>> "Context: Either preemptible task context or not-NMI interrupt." change >>>>>>> will be needed for stable kernels. >>>>>> >>>>>> So, the comment for vfree "May sleep if called *not* from interrupt >>>>>> context." is wrong? >>>>> >>>>> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says >>>>> >>>>> We are going to use sleeping lock for freeing vmap. However some >>>>> vfree() users want to free memory from atomic (but not from interrupt) >>>>> context. For this we add vfree_atomic() - deferred variation of vfree() >>>>> which can be used in any atomic context (except NMIs). >>>>> >>>>> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made >>>>> >>>>> - * Context: Any context except NMI. >>>>> + * Context: Either preemptible task context or not-NMI interrupt. >>>>> >>>>> change. But I think that we converted kmalloc() to kvmalloc() without checking >>>>> context of kvfree() callers. Therefore, I think that kvfree() needs to use >>>>> vfree_atomic() rather than just saying "vfree() might sleep if called not in >>>>> interrupt context."... >>>> >>>> Whereabouts in the vfree() path can the kernel sleep? >>> >>> (Sorry for the late reply.) >>> >>> Adding Andrey Ryabinin, author of commit 52414d3302577bb6 >>> ("kvfree(): fix misleading comment"), maybe Andrey remembers >>> where vfree() can sleep. >>> >>> In the meantime, does "cond_resched_lock(&vmap_area_lock);" in >>> __purge_vmap_area_lazy() count as a sleep point? >> >> Yes, this is the place (the only one) where vfree() can sleep. > > OK, thanks for the quick confirm. > > So what about this: use __vfree_deferred() when: > - in_interrupt(), because we can't use mutex_trylock() as pointed out > by Tetsuo; > - in_atomic(), because cond_resched_lock(); > - irqs_disabled(), as smp_call_function_many() will deadlock. > > An untested diff to show the idea(not sure if warn is needed): > It was discussed before. You're not the first one suggesting something like this. There is the comment near in_atomic() explaining well why and when your patch won't work. The easiest way of making vfree() to be safe in atomic contexts is this patch: http://lkml.kernel.org/r/20170330102719.13119-1-aryabinin@virtuozzo.com But the final decision at that time was to fix caller so the call vfree from sleepable context instead: http://lkml.kernel.org/r/20170330152229.f2108e718114ed77acae7405@linux-foundation.org > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index e86ba6e74b50..28d200f054b0 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -1578,7 +1578,7 @@ void vfree_atomic(const void *addr) > > static void __vfree(const void *addr) > { > - if (unlikely(in_interrupt())) > + if (unlikely(in_interrupt() || in_atomic() || irqs_disabled())) > __vfree_deferred(addr); > else > __vunmap(addr, 1); > @@ -1606,8 +1606,6 @@ void vfree(const void *addr) > > kmemleak_free(addr); > > - might_sleep_if(!in_interrupt()); > - > if (!addr) > return; > >
On 2019/3/8 0:33, Andrey Ryabinin wrote: > > > On 3/7/19 6:24 PM, Aaron Lu wrote: >> On Thu, Mar 07, 2019 at 05:47:13PM +0300, Andrey Ryabinin wrote: >>> >>> >>> On 3/7/19 5:43 PM, Aaron Lu wrote: >>>> On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote: >>>>> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: >>>>> >>>>>>>> >>>>>>>> If we want to allow vfree() to sleep, at least we need to test with >>>>>>>> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use >>>>>>>> vmalloc()/vfree() path). For now, reverting the >>>>>>>> "Context: Either preemptible task context or not-NMI interrupt." change >>>>>>>> will be needed for stable kernels. >>>>>>> >>>>>>> So, the comment for vfree "May sleep if called *not* from interrupt >>>>>>> context." is wrong? >>>>>> >>>>>> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says >>>>>> >>>>>> We are going to use sleeping lock for freeing vmap. However some >>>>>> vfree() users want to free memory from atomic (but not from interrupt) >>>>>> context. For this we add vfree_atomic() - deferred variation of vfree() >>>>>> which can be used in any atomic context (except NMIs). >>>>>> >>>>>> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made >>>>>> >>>>>> - * Context: Any context except NMI. >>>>>> + * Context: Either preemptible task context or not-NMI interrupt. >>>>>> >>>>>> change. But I think that we converted kmalloc() to kvmalloc() without checking >>>>>> context of kvfree() callers. Therefore, I think that kvfree() needs to use >>>>>> vfree_atomic() rather than just saying "vfree() might sleep if called not in >>>>>> interrupt context."... >>>>> >>>>> Whereabouts in the vfree() path can the kernel sleep? >>>> >>>> (Sorry for the late reply.) >>>> >>>> Adding Andrey Ryabinin, author of commit 52414d3302577bb6 >>>> ("kvfree(): fix misleading comment"), maybe Andrey remembers >>>> where vfree() can sleep. >>>> >>>> In the meantime, does "cond_resched_lock(&vmap_area_lock);" in >>>> __purge_vmap_area_lazy() count as a sleep point? >>> >>> Yes, this is the place (the only one) where vfree() can sleep. >> >> OK, thanks for the quick confirm. >> >> So what about this: use __vfree_deferred() when: >> - in_interrupt(), because we can't use mutex_trylock() as pointed out >> by Tetsuo; >> - in_atomic(), because cond_resched_lock(); >> - irqs_disabled(), as smp_call_function_many() will deadlock. >> >> An untested diff to show the idea(not sure if warn is needed): >> > > It was discussed before. You're not the first one suggesting something like this. > There is the comment near in_atomic() explaining well why and when your patch won't work. Thanks for the info. > The easiest way of making vfree() to be safe in atomic contexts is this patch: > http://lkml.kernel.org/r/20170330102719.13119-1-aryabinin@virtuozzo.com > > But the final decision at that time was to fix caller so the call vfree from sleepable context instead: > http://lkml.kernel.org/r/20170330152229.f2108e718114ed77acae7405@linux-foundation.org OK, if that is the final decision, then I think Jiufei's patch that moves kvfree() out of the locked region is the right thing to do for this issue here.
Hi Andrew, >> It was discussed before. You're not the first one suggesting something like this. >> There is the comment near in_atomic() explaining well why and when your patch won't work. > Thanks for the info. > >> The easiest way of making vfree() to be safe in atomic contexts is this patch: >> http://lkml.kernel.org/r/20170330102719.13119-1-aryabinin@virtuozzo.com >> >> But the final decision at that time was to fix caller so the call vfree from sleepable context instead: >> http://lkml.kernel.org/r/20170330152229.f2108e718114ed77acae7405@linux-foundation.org > OK, if that is the final decision, then I think Jiufei's patch that > moves kvfree() out of the locked region is the right thing to do for > this issue here. > Is that the final decision we have made? Could you please look into my patch again and give the decision? Thanks, Jiufei
diff --git a/mm/swapfile.c b/mm/swapfile.c index dbac1d49469d..d26c9eac3d64 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2810,7 +2810,7 @@ late_initcall(max_swapfiles_check); static struct swap_info_struct *alloc_swap_info(void) { - struct swap_info_struct *p; + struct swap_info_struct *p, *tmp = NULL; unsigned int type; int i; int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node); @@ -2840,7 +2840,7 @@ static struct swap_info_struct *alloc_swap_info(void) smp_wmb(); nr_swapfiles++; } else { - kvfree(p); + tmp = p; p = swap_info[type]; /* * Do not memset this entry: a racing procfs swap_next() @@ -2853,6 +2853,8 @@ static struct swap_info_struct *alloc_swap_info(void) plist_node_init(&p->avail_lists[i], 0); p->flags = SWP_USED; spin_unlock(&swap_lock); + kvfree(tmp); + spin_lock_init(&p->lock); spin_lock_init(&p->cont_lock);