diff mbox series

mm: fix sleeping function warning in alloc_swap_info

Message ID 20190129072154.63783-1-jiufei.xue@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series mm: fix sleeping function warning in alloc_swap_info | expand

Commit Message

Jiufei Xue Jan. 29, 2019, 7:21 a.m. UTC
Trinity reports BUG:

sleeping function called from invalid context at mm/vmalloc.c:1477
in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1

[ 2748.573460] Call Trace:
[ 2748.575935]  dump_stack+0x91/0xeb
[ 2748.578512]  ___might_sleep+0x21c/0x250
[ 2748.581090]  remove_vm_area+0x1d/0x90
[ 2748.583637]  __vunmap+0x76/0x100
[ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
[ 2748.598973]  do_syscall_64+0x60/0x210
[ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is triggered by calling kvfree() inside spinlock() section in
function alloc_swap_info().
Fix this by moving the kvfree() after spin_unlock().

Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
Cc: <stable@vger.kernel.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
---
 mm/swapfile.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Aaron Lu Jan. 29, 2019, 8:53 a.m. UTC | #1
On 2019/1/29 15:21, Jiufei Xue wrote:
> Trinity reports BUG:
> 
> sleeping function called from invalid context at mm/vmalloc.c:1477
> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1
> 
> [ 2748.573460] Call Trace:
> [ 2748.575935]  dump_stack+0x91/0xeb
> [ 2748.578512]  ___might_sleep+0x21c/0x250
> [ 2748.581090]  remove_vm_area+0x1d/0x90
> [ 2748.583637]  __vunmap+0x76/0x100
> [ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
> [ 2748.598973]  do_syscall_64+0x60/0x210
> [ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This is triggered by calling kvfree() inside spinlock() section in
> function alloc_swap_info().
> Fix this by moving the kvfree() after spin_unlock().

The fix looks good to me.

BTW, swap_info_struct's size has been reduced to its original size:
272 bytes by commit 66f71da9dd38("mm/swap: use nr_node_ids for
avail_lists in swap_info_struct"). I didn't use back kzalloc/kfree
in that commit since I don't see any any harm by keep using
kvzalloc/kvfree, but now looks like they're causing some trouble.

So what about using back kzalloc/kfree for swap_info_struct instead?
Can save one local variable and using kvzalloc/kvfree for a struct
that is 272 bytes doesn't really have any benefit.

Thanks,
Aaron

> 
> Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
> Cc: <stable@vger.kernel.org>
> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
> ---
>  mm/swapfile.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index dbac1d49469d..d26c9eac3d64 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2810,7 +2810,7 @@ late_initcall(max_swapfiles_check);
>  
>  static struct swap_info_struct *alloc_swap_info(void)
>  {
> -	struct swap_info_struct *p;
> +	struct swap_info_struct *p, *tmp = NULL;
>  	unsigned int type;
>  	int i;
>  	int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node);
> @@ -2840,7 +2840,7 @@ static struct swap_info_struct *alloc_swap_info(void)
>  		smp_wmb();
>  		nr_swapfiles++;
>  	} else {
> -		kvfree(p);
> +		tmp = p;
>  		p = swap_info[type];
>  		/*
>  		 * Do not memset this entry: a racing procfs swap_next()
> @@ -2853,6 +2853,8 @@ static struct swap_info_struct *alloc_swap_info(void)
>  		plist_node_init(&p->avail_lists[i], 0);
>  	p->flags = SWP_USED;
>  	spin_unlock(&swap_lock);
> +	kvfree(tmp);
> +
>  	spin_lock_init(&p->lock);
>  	spin_lock_init(&p->cont_lock);
>  
>
Joseph Qi Jan. 29, 2019, 10:43 a.m. UTC | #2
Hi,

On 19/1/29 16:53, Aaron Lu wrote:
> On 2019/1/29 15:21, Jiufei Xue wrote:
>> Trinity reports BUG:
>>
>> sleeping function called from invalid context at mm/vmalloc.c:1477
>> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1
>>
>> [ 2748.573460] Call Trace:
>> [ 2748.575935]  dump_stack+0x91/0xeb
>> [ 2748.578512]  ___might_sleep+0x21c/0x250
>> [ 2748.581090]  remove_vm_area+0x1d/0x90
>> [ 2748.583637]  __vunmap+0x76/0x100
>> [ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
>> [ 2748.598973]  do_syscall_64+0x60/0x210
>> [ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> This is triggered by calling kvfree() inside spinlock() section in
>> function alloc_swap_info().
>> Fix this by moving the kvfree() after spin_unlock().
> 
> The fix looks good to me.
> 
> BTW, swap_info_struct's size has been reduced to its original size:
> 272 bytes by commit 66f71da9dd38("mm/swap: use nr_node_ids for
> avail_lists in swap_info_struct"). I didn't use back kzalloc/kfree
> in that commit since I don't see any any harm by keep using
> kvzalloc/kvfree, but now looks like they're causing some trouble.
> 
> So what about using back kzalloc/kfree for swap_info_struct instead?
> Can save one local variable and using kvzalloc/kvfree for a struct
> that is 272 bytes doesn't really have any benefit.
> 
avail_lists in swap_info_struct is dynamic allocated.
So if we use back kzalloc/kfree, how to deal with the case that
nr_node_ids is big?

Thanks,
Joseph

> Thanks,
> Aaron
> 
>>
>> Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
>> Cc: <stable@vger.kernel.org>
>> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
>> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
>> ---
>>  mm/swapfile.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index dbac1d49469d..d26c9eac3d64 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -2810,7 +2810,7 @@ late_initcall(max_swapfiles_check);
>>  
>>  static struct swap_info_struct *alloc_swap_info(void)
>>  {
>> -	struct swap_info_struct *p;
>> +	struct swap_info_struct *p, *tmp = NULL;
>>  	unsigned int type;
>>  	int i;
>>  	int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node);
>> @@ -2840,7 +2840,7 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  		smp_wmb();
>>  		nr_swapfiles++;
>>  	} else {
>> -		kvfree(p);
>> +		tmp = p;
>>  		p = swap_info[type];
>>  		/*
>>  		 * Do not memset this entry: a racing procfs swap_next()
>> @@ -2853,6 +2853,8 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  		plist_node_init(&p->avail_lists[i], 0);
>>  	p->flags = SWP_USED;
>>  	spin_unlock(&swap_lock);
>> +	kvfree(tmp);
>> +
>>  	spin_lock_init(&p->lock);
>>  	spin_lock_init(&p->cont_lock);
>>  
>>
Aaron Lu Jan. 29, 2019, 11:19 a.m. UTC | #3
On Tue, Jan 29, 2019 at 06:43:53PM +0800, Joseph Qi wrote:
> Hi,
> 
> On 19/1/29 16:53, Aaron Lu wrote:
> > On 2019/1/29 15:21, Jiufei Xue wrote:
> >> Trinity reports BUG:
> >>
> >> sleeping function called from invalid context at mm/vmalloc.c:1477
> >> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1
> >>
> >> [ 2748.573460] Call Trace:
> >> [ 2748.575935]  dump_stack+0x91/0xeb
> >> [ 2748.578512]  ___might_sleep+0x21c/0x250
> >> [ 2748.581090]  remove_vm_area+0x1d/0x90
> >> [ 2748.583637]  __vunmap+0x76/0x100
> >> [ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
> >> [ 2748.598973]  do_syscall_64+0x60/0x210
> >> [ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >> This is triggered by calling kvfree() inside spinlock() section in
> >> function alloc_swap_info().
> >> Fix this by moving the kvfree() after spin_unlock().
> > 
> > The fix looks good to me.
> > 
> > BTW, swap_info_struct's size has been reduced to its original size:
> > 272 bytes by commit 66f71da9dd38("mm/swap: use nr_node_ids for
> > avail_lists in swap_info_struct"). I didn't use back kzalloc/kfree
> > in that commit since I don't see any any harm by keep using
> > kvzalloc/kvfree, but now looks like they're causing some trouble.
> > 
> > So what about using back kzalloc/kfree for swap_info_struct instead?
> > Can save one local variable and using kvzalloc/kvfree for a struct
> > that is 272 bytes doesn't really have any benefit.
> > 
> avail_lists in swap_info_struct is dynamic allocated.
> So if we use back kzalloc/kfree, how to deal with the case that
> nr_node_ids is big?

Oh right, I missed that.

Acked-by: Aaron Lu <aaron.lu@linux.alibaba.com>
 
Thanks,
Aaron
 
> >>
> >> Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
> >> Cc: <stable@vger.kernel.org>
> >> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> >> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
> >> ---
> >>  mm/swapfile.c | 6 ++++--
> >>  1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index dbac1d49469d..d26c9eac3d64 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -2810,7 +2810,7 @@ late_initcall(max_swapfiles_check);
> >>  
> >>  static struct swap_info_struct *alloc_swap_info(void)
> >>  {
> >> -	struct swap_info_struct *p;
> >> +	struct swap_info_struct *p, *tmp = NULL;
> >>  	unsigned int type;
> >>  	int i;
> >>  	int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node);
> >> @@ -2840,7 +2840,7 @@ static struct swap_info_struct *alloc_swap_info(void)
> >>  		smp_wmb();
> >>  		nr_swapfiles++;
> >>  	} else {
> >> -		kvfree(p);
> >> +		tmp = p;
> >>  		p = swap_info[type];
> >>  		/*
> >>  		 * Do not memset this entry: a racing procfs swap_next()
> >> @@ -2853,6 +2853,8 @@ static struct swap_info_struct *alloc_swap_info(void)
> >>  		plist_node_init(&p->avail_lists[i], 0);
> >>  	p->flags = SWP_USED;
> >>  	spin_unlock(&swap_lock);
> >> +	kvfree(tmp);
> >> +
> >>  	spin_lock_init(&p->lock);
> >>  	spin_lock_init(&p->cont_lock);
> >>  
> >>
Tetsuo Handa Jan. 29, 2019, 11:43 a.m. UTC | #4
On 2019/01/29 16:21, Jiufei Xue wrote:
> Trinity reports BUG:
> 
> sleeping function called from invalid context at mm/vmalloc.c:1477
> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1
> 
> [ 2748.573460] Call Trace:
> [ 2748.575935]  dump_stack+0x91/0xeb
> [ 2748.578512]  ___might_sleep+0x21c/0x250
> [ 2748.581090]  remove_vm_area+0x1d/0x90
> [ 2748.583637]  __vunmap+0x76/0x100
> [ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
> [ 2748.598973]  do_syscall_64+0x60/0x210
> [ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This is triggered by calling kvfree() inside spinlock() section in
> function alloc_swap_info().
> Fix this by moving the kvfree() after spin_unlock().
> 

Excuse me? But isn't kvfree() safe to be called with spinlock held?

There was no context explanation regarding kvfree() for 4.18.20.
4.19.18 says

  Context: Any context except NMI.

and 4.20.5 says

  Context: Either preemptible task context or not-NMI interrupt.

. There might be users who didn't notice this change.
Andrew Morton Jan. 29, 2019, 7:13 p.m. UTC | #5
On Tue, 29 Jan 2019 20:43:20 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> On 2019/01/29 16:21, Jiufei Xue wrote:
> > Trinity reports BUG:
> > 
> > sleeping function called from invalid context at mm/vmalloc.c:1477
> > in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1
> > 
> > [ 2748.573460] Call Trace:
> > [ 2748.575935]  dump_stack+0x91/0xeb
> > [ 2748.578512]  ___might_sleep+0x21c/0x250
> > [ 2748.581090]  remove_vm_area+0x1d/0x90
> > [ 2748.583637]  __vunmap+0x76/0x100
> > [ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
> > [ 2748.598973]  do_syscall_64+0x60/0x210
> > [ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> > This is triggered by calling kvfree() inside spinlock() section in
> > function alloc_swap_info().
> > Fix this by moving the kvfree() after spin_unlock().
> > 
> 
> Excuse me? But isn't kvfree() safe to be called with spinlock held?

Yes, I'm having trouble spotting where kvfree() can sleep.  Perhaps it
*used* to sleep on mutex_lock(vmap_purge_lock), but
try_purge_vmap_area_lazy() is using mutex_trylock().  Confused.

kvfree() darn well *shouldn't* sleep!
Tetsuo Handa Jan. 29, 2019, 9:12 p.m. UTC | #6
On 2019/01/30 4:13, Andrew Morton wrote:
> On Tue, 29 Jan 2019 20:43:20 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
>> On 2019/01/29 16:21, Jiufei Xue wrote:
>>> Trinity reports BUG:
>>>
>>> sleeping function called from invalid context at mm/vmalloc.c:1477
>>> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1
>>>
>>> [ 2748.573460] Call Trace:
>>> [ 2748.575935]  dump_stack+0x91/0xeb
>>> [ 2748.578512]  ___might_sleep+0x21c/0x250
>>> [ 2748.581090]  remove_vm_area+0x1d/0x90
>>> [ 2748.583637]  __vunmap+0x76/0x100
>>> [ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
>>> [ 2748.598973]  do_syscall_64+0x60/0x210
>>> [ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>> This is triggered by calling kvfree() inside spinlock() section in
>>> function alloc_swap_info().
>>> Fix this by moving the kvfree() after spin_unlock().
>>>
>>
>> Excuse me? But isn't kvfree() safe to be called with spinlock held?
> 
> Yes, I'm having trouble spotting where kvfree() can sleep.  Perhaps it
> *used* to sleep on mutex_lock(vmap_purge_lock), but
> try_purge_vmap_area_lazy() is using mutex_trylock().  Confused.
> 
> kvfree() darn well *shouldn't* sleep!
> 

If I recall correctly, there was an attempt to allow vfree() to sleep
but that attempt failed, and the change to allow vfree() to sleep was
reverted. Thus, vfree() had been "Context: Any context except NMI.".

If we want to allow vfree() to sleep, at least we need to test with
kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
vmalloc()/vfree() path). For now, reverting the 
"Context: Either preemptible task context or not-NMI interrupt." change
will be needed for stable kernels.
Yang Shi Jan. 29, 2019, 9:51 p.m. UTC | #7
On Tue, Jan 29, 2019 at 1:12 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2019/01/30 4:13, Andrew Morton wrote:
> > On Tue, 29 Jan 2019 20:43:20 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> >
> >> On 2019/01/29 16:21, Jiufei Xue wrote:
> >>> Trinity reports BUG:
> >>>
> >>> sleeping function called from invalid context at mm/vmalloc.c:1477
> >>> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1
> >>>
> >>> [ 2748.573460] Call Trace:
> >>> [ 2748.575935]  dump_stack+0x91/0xeb
> >>> [ 2748.578512]  ___might_sleep+0x21c/0x250
> >>> [ 2748.581090]  remove_vm_area+0x1d/0x90
> >>> [ 2748.583637]  __vunmap+0x76/0x100
> >>> [ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
> >>> [ 2748.598973]  do_syscall_64+0x60/0x210
> >>> [ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>
> >>> This is triggered by calling kvfree() inside spinlock() section in
> >>> function alloc_swap_info().
> >>> Fix this by moving the kvfree() after spin_unlock().
> >>>
> >>
> >> Excuse me? But isn't kvfree() safe to be called with spinlock held?
> >
> > Yes, I'm having trouble spotting where kvfree() can sleep.  Perhaps it
> > *used* to sleep on mutex_lock(vmap_purge_lock), but
> > try_purge_vmap_area_lazy() is using mutex_trylock().  Confused.
> >
> > kvfree() darn well *shouldn't* sleep!
> >
>
> If I recall correctly, there was an attempt to allow vfree() to sleep
> but that attempt failed, and the change to allow vfree() to sleep was
> reverted. Thus, vfree() had been "Context: Any context except NMI.".
>
> If we want to allow vfree() to sleep, at least we need to test with
> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
> vmalloc()/vfree() path). For now, reverting the
> "Context: Either preemptible task context or not-NMI interrupt." change
> will be needed for stable kernels.

So, the comment for vfree "May sleep if called *not* from interrupt
context." is wrong?

Thanks,
Yang

>
Tetsuo Handa Jan. 30, 2019, 12:42 a.m. UTC | #8
Yang Shi wrote:
> On Tue, Jan 29, 2019 at 1:12 PM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > On 2019/01/30 4:13, Andrew Morton wrote:
> > > On Tue, 29 Jan 2019 20:43:20 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > >
> > >> On 2019/01/29 16:21, Jiufei Xue wrote:
> > >>> Trinity reports BUG:
> > >>>
> > >>> sleeping function called from invalid context at mm/vmalloc.c:1477
> > >>> in_atomic(): 1, irqs_disabled(): 0, pid: 12269, name: trinity-c1
> > >>>
> > >>> [ 2748.573460] Call Trace:
> > >>> [ 2748.575935]  dump_stack+0x91/0xeb
> > >>> [ 2748.578512]  ___might_sleep+0x21c/0x250
> > >>> [ 2748.581090]  remove_vm_area+0x1d/0x90
> > >>> [ 2748.583637]  __vunmap+0x76/0x100
> > >>> [ 2748.586120]  __se_sys_swapon+0xb9a/0x1220
> > >>> [ 2748.598973]  do_syscall_64+0x60/0x210
> > >>> [ 2748.601439]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >>>
> > >>> This is triggered by calling kvfree() inside spinlock() section in
> > >>> function alloc_swap_info().
> > >>> Fix this by moving the kvfree() after spin_unlock().
> > >>>
> > >>
> > >> Excuse me? But isn't kvfree() safe to be called with spinlock held?
> > >
> > > Yes, I'm having trouble spotting where kvfree() can sleep.  Perhaps it
> > > *used* to sleep on mutex_lock(vmap_purge_lock), but
> > > try_purge_vmap_area_lazy() is using mutex_trylock().  Confused.
> > >
> > > kvfree() darn well *shouldn't* sleep!
> > >
> >
> > If I recall correctly, there was an attempt to allow vfree() to sleep
> > but that attempt failed, and the change to allow vfree() to sleep was
> > reverted. Thus, vfree() had been "Context: Any context except NMI.".

That attempt was not reverted. Instead vfree_atomic() was added.

> >
> > If we want to allow vfree() to sleep, at least we need to test with
> > kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
> > vmalloc()/vfree() path). For now, reverting the
> > "Context: Either preemptible task context or not-NMI interrupt." change
> > will be needed for stable kernels.
> 
> So, the comment for vfree "May sleep if called *not* from interrupt
> context." is wrong?

Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says

    We are going to use sleeping lock for freeing vmap.  However some
    vfree() users want to free memory from atomic (but not from interrupt)
    context.  For this we add vfree_atomic() - deferred variation of vfree()
    which can be used in any atomic context (except NMIs).

and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made

    - * Context: Any context except NMI.
    + * Context: Either preemptible task context or not-NMI interrupt.

change. But I think that we converted kmalloc() to kvmalloc() without checking
context of kvfree() callers. Therefore, I think that kvfree() needs to use
vfree_atomic() rather than just saying "vfree() might sleep if called not in
interrupt context."...
Andrew Morton Jan. 30, 2019, 1:01 a.m. UTC | #9
On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:

> > >
> > > If we want to allow vfree() to sleep, at least we need to test with
> > > kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
> > > vmalloc()/vfree() path). For now, reverting the
> > > "Context: Either preemptible task context or not-NMI interrupt." change
> > > will be needed for stable kernels.
> > 
> > So, the comment for vfree "May sleep if called *not* from interrupt
> > context." is wrong?
> 
> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says
> 
>     We are going to use sleeping lock for freeing vmap.  However some
>     vfree() users want to free memory from atomic (but not from interrupt)
>     context.  For this we add vfree_atomic() - deferred variation of vfree()
>     which can be used in any atomic context (except NMIs).
> 
> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made
> 
>     - * Context: Any context except NMI.
>     + * Context: Either preemptible task context or not-NMI interrupt.
> 
> change. But I think that we converted kmalloc() to kvmalloc() without checking
> context of kvfree() callers. Therefore, I think that kvfree() needs to use
> vfree_atomic() rather than just saying "vfree() might sleep if called not in
> interrupt context."...

Whereabouts in the vfree() path can the kernel sleep?
Linus Torvalds Jan. 30, 2019, 1:11 a.m. UTC | #10
On Tue, Jan 29, 2019 at 5:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> >
> >     - * Context: Any context except NMI.
> >     + * Context: Either preemptible task context or not-NMI interrupt.
>
> Whereabouts in the vfree() path can the kernel sleep?

Note that it's not necessarily about *sleeping*.

One thing that vfree() really fundamentally should do is to flush
TLB's. And you must not do a cross-TLB flush with interrupts disabled.

NOTE! Right now, I think we do lazy TLB flushing, so the flush
actually is delayed until the vmalloc() when the address rolls around
in the vmalloc address space. But there really are very real and
obvious reasons why we might want to do it at vfree time.

So I'd honestly be a whole lot happier with vmalloc/vfree being
process context only. Or at least with with interrupts enabled (so
swirq/BH context would be fine, but an actual interrupt not so).

Again, this is not about sleeping. But the end result is almost the
same: we really should strive to not do vfree() in interrupt context.

                Linus
Linus Torvalds Jan. 30, 2019, 1:23 a.m. UTC | #11
On Tue, Jan 29, 2019 at 5:11 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Again, this is not about sleeping. But the end result is almost the
> same: we really should strive to not do vfree() in interrupt context.

Note that we currently test the wrong thing for this: we actually
check "in_interrupt()" for the deferred case, which certainly works in
practice, and protects from deadlocks (we need vmap_area_lock for the
free-area handling)

But it doesn't actually end up testing the "oops, interrupts are
disabled in process context" issue. The "might_sleep()" check _does_
check that, iirc.

Which - as mentioned - is fine because we currently don't actually do
the TLB flush synchronously, but it's worth noting again. "vfree()"
really is a *lot* different from "kfree()". It's unsafe in all kinds
of special ways, and the locking difference is just part of it.

So whatever might_sleep() has found might be a potential real issue at
some point...

              Linus
Tetsuo Handa Jan. 30, 2019, 2:54 a.m. UTC | #12
Andrew Morton wrote:
> > change. But I think that we converted kmalloc() to kvmalloc() without checking
> > context of kvfree() callers. Therefore, I think that kvfree() needs to use
> > vfree_atomic() rather than just saying "vfree() might sleep if called not in
> > interrupt context."...
> 
> Whereabouts in the vfree() path can the kernel sleep?

Indeed. Although __vunmap() must not be called from interrupt context because
mutex_trylock()/mutex_unlock() from try_purge_vmap_area_lazy() from
free_vmap_area_noflush() from free_unmap_vmap_area() from remove_vm_area() from
__vunmap() cannot be called from interrupt context, it seems that there is no
location that does sleeping operation.

Linus Torvalds wrote:
> Which - as mentioned - is fine because we currently don't actually do
> the TLB flush synchronously, but it's worth noting again. "vfree()"
> really is a *lot* different from "kfree()". It's unsafe in all kinds
> of special ways, and the locking difference is just part of it.
> 
> So whatever might_sleep() has found might be a potential real issue at
> some point...

Then, do we automatically defer vfree() to mm_percpu_wq context?
Linus Torvalds Jan. 30, 2019, 5:18 p.m. UTC | #13
On Tue, Jan 29, 2019 at 6:54 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> Then, do we automatically defer vfree() to mm_percpu_wq context?

We might do that, and say "if you call vfree with interrupts disabled,
it gets deferred".

That said, the deferred case should generally not be a common case
either. It has real issues, one of which is simply that particularly
on 32-bit architectures we can run out of vmalloc space even normally,
and if there are loads that do a lot of allocation and then deferred
frees, that problem could become really bad.

So I'd almost be happier having a warning if we end up doing the TLB
flush and defer. At least to find *what* people do.

And I do wonder if we should just always warn, and have that
"might_sleep()", and simply say "if you do vfree from interrupts or
with interrupts disabled, you really should be aware of these kinds of
issues, and you really should *show* that you are aware by using
vfree_atomic()".

                     Linus
Andrew Morton Jan. 30, 2019, 10:13 p.m. UTC | #14
On Wed, 30 Jan 2019 09:18:20 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Jan 29, 2019 at 6:54 PM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > Then, do we automatically defer vfree() to mm_percpu_wq context?
> 
> We might do that, and say "if you call vfree with interrupts disabled,
> it gets deferred".
> 
> That said, the deferred case should generally not be a common case
> either. It has real issues, one of which is simply that particularly
> on 32-bit architectures we can run out of vmalloc space even normally,
> and if there are loads that do a lot of allocation and then deferred
> frees, that problem could become really bad.
> 
> So I'd almost be happier having a warning if we end up doing the TLB
> flush and defer. At least to find *what* people do.
> 
> And I do wonder if we should just always warn, and have that
> "might_sleep()", and simply say "if you do vfree from interrupts or
> with interrupts disabled, you really should be aware of these kinds of
> issues, and you really should *show* that you are aware by using
> vfree_atomic()".
> 

Agree.  if (irqs_disabled()) {warn_once; punt_to_workqueue} is the way
to go here.  vfree() should be callable from spinlocked code and
might_sleep() is an inappropriate check.
Aaron Lu March 7, 2019, 2:43 p.m. UTC | #15
On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote:
> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
> 
> > > >
> > > > If we want to allow vfree() to sleep, at least we need to test with
> > > > kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
> > > > vmalloc()/vfree() path). For now, reverting the
> > > > "Context: Either preemptible task context or not-NMI interrupt." change
> > > > will be needed for stable kernels.
> > > 
> > > So, the comment for vfree "May sleep if called *not* from interrupt
> > > context." is wrong?
> > 
> > Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says
> > 
> >     We are going to use sleeping lock for freeing vmap.  However some
> >     vfree() users want to free memory from atomic (but not from interrupt)
> >     context.  For this we add vfree_atomic() - deferred variation of vfree()
> >     which can be used in any atomic context (except NMIs).
> > 
> > and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made
> > 
> >     - * Context: Any context except NMI.
> >     + * Context: Either preemptible task context or not-NMI interrupt.
> > 
> > change. But I think that we converted kmalloc() to kvmalloc() without checking
> > context of kvfree() callers. Therefore, I think that kvfree() needs to use
> > vfree_atomic() rather than just saying "vfree() might sleep if called not in
> > interrupt context."...
> 
> Whereabouts in the vfree() path can the kernel sleep?

(Sorry for the late reply.)

Adding Andrey Ryabinin, author of commit 52414d3302577bb6
("kvfree(): fix misleading comment"), maybe Andrey remembers
where vfree() can sleep.

In the meantime, does "cond_resched_lock(&vmap_area_lock);" in
__purge_vmap_area_lazy() count as a sleep point?
__purge_vmap_area_lazy() can be called if mutex_trylock on
vmap_purge_lock succeed.
Andrey Ryabinin March 7, 2019, 2:47 p.m. UTC | #16
On 3/7/19 5:43 PM, Aaron Lu wrote:
> On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote:
>> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>>>>>
>>>>> If we want to allow vfree() to sleep, at least we need to test with
>>>>> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
>>>>> vmalloc()/vfree() path). For now, reverting the
>>>>> "Context: Either preemptible task context or not-NMI interrupt." change
>>>>> will be needed for stable kernels.
>>>>
>>>> So, the comment for vfree "May sleep if called *not* from interrupt
>>>> context." is wrong?
>>>
>>> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says
>>>
>>>     We are going to use sleeping lock for freeing vmap.  However some
>>>     vfree() users want to free memory from atomic (but not from interrupt)
>>>     context.  For this we add vfree_atomic() - deferred variation of vfree()
>>>     which can be used in any atomic context (except NMIs).
>>>
>>> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made
>>>
>>>     - * Context: Any context except NMI.
>>>     + * Context: Either preemptible task context or not-NMI interrupt.
>>>
>>> change. But I think that we converted kmalloc() to kvmalloc() without checking
>>> context of kvfree() callers. Therefore, I think that kvfree() needs to use
>>> vfree_atomic() rather than just saying "vfree() might sleep if called not in
>>> interrupt context."...
>>
>> Whereabouts in the vfree() path can the kernel sleep?
> 
> (Sorry for the late reply.)
> 
> Adding Andrey Ryabinin, author of commit 52414d3302577bb6
> ("kvfree(): fix misleading comment"), maybe Andrey remembers
> where vfree() can sleep.
> 
> In the meantime, does "cond_resched_lock(&vmap_area_lock);" in
> __purge_vmap_area_lazy() count as a sleep point?

Yes, this is the place (the only one) where vfree() can sleep.
Aaron Lu March 7, 2019, 3:24 p.m. UTC | #17
On Thu, Mar 07, 2019 at 05:47:13PM +0300, Andrey Ryabinin wrote:
> 
> 
> On 3/7/19 5:43 PM, Aaron Lu wrote:
> > On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote:
> >> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >>
> >>>>>
> >>>>> If we want to allow vfree() to sleep, at least we need to test with
> >>>>> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
> >>>>> vmalloc()/vfree() path). For now, reverting the
> >>>>> "Context: Either preemptible task context or not-NMI interrupt." change
> >>>>> will be needed for stable kernels.
> >>>>
> >>>> So, the comment for vfree "May sleep if called *not* from interrupt
> >>>> context." is wrong?
> >>>
> >>> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says
> >>>
> >>>     We are going to use sleeping lock for freeing vmap.  However some
> >>>     vfree() users want to free memory from atomic (but not from interrupt)
> >>>     context.  For this we add vfree_atomic() - deferred variation of vfree()
> >>>     which can be used in any atomic context (except NMIs).
> >>>
> >>> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made
> >>>
> >>>     - * Context: Any context except NMI.
> >>>     + * Context: Either preemptible task context or not-NMI interrupt.
> >>>
> >>> change. But I think that we converted kmalloc() to kvmalloc() without checking
> >>> context of kvfree() callers. Therefore, I think that kvfree() needs to use
> >>> vfree_atomic() rather than just saying "vfree() might sleep if called not in
> >>> interrupt context."...
> >>
> >> Whereabouts in the vfree() path can the kernel sleep?
> > 
> > (Sorry for the late reply.)
> > 
> > Adding Andrey Ryabinin, author of commit 52414d3302577bb6
> > ("kvfree(): fix misleading comment"), maybe Andrey remembers
> > where vfree() can sleep.
> > 
> > In the meantime, does "cond_resched_lock(&vmap_area_lock);" in
> > __purge_vmap_area_lazy() count as a sleep point?
> 
> Yes, this is the place (the only one) where vfree() can sleep.

OK, thanks for the quick confirm.

So what about this: use __vfree_deferred() when:
 - in_interrupt(), because we can't use mutex_trylock() as pointed out
   by Tetsuo;
 - in_atomic(), because cond_resched_lock();
 - irqs_disabled(), as smp_call_function_many() will deadlock.

An untested diff to show the idea(not sure if warn is needed):

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e86ba6e74b50..28d200f054b0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1578,7 +1578,7 @@ void vfree_atomic(const void *addr)
 
 static void __vfree(const void *addr)
 {
-	if (unlikely(in_interrupt()))
+	if (unlikely(in_interrupt() || in_atomic() || irqs_disabled()))
 		__vfree_deferred(addr);
 	else
 		__vunmap(addr, 1);
@@ -1606,8 +1606,6 @@ void vfree(const void *addr)
 
 	kmemleak_free(addr);
 
-	might_sleep_if(!in_interrupt());
-
 	if (!addr)
 		return;
Andrey Ryabinin March 7, 2019, 4:33 p.m. UTC | #18
On 3/7/19 6:24 PM, Aaron Lu wrote:
> On Thu, Mar 07, 2019 at 05:47:13PM +0300, Andrey Ryabinin wrote:
>>
>>
>> On 3/7/19 5:43 PM, Aaron Lu wrote:
>>> On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote:
>>>> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>>>
>>>>>>>
>>>>>>> If we want to allow vfree() to sleep, at least we need to test with
>>>>>>> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
>>>>>>> vmalloc()/vfree() path). For now, reverting the
>>>>>>> "Context: Either preemptible task context or not-NMI interrupt." change
>>>>>>> will be needed for stable kernels.
>>>>>>
>>>>>> So, the comment for vfree "May sleep if called *not* from interrupt
>>>>>> context." is wrong?
>>>>>
>>>>> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says
>>>>>
>>>>>     We are going to use sleeping lock for freeing vmap.  However some
>>>>>     vfree() users want to free memory from atomic (but not from interrupt)
>>>>>     context.  For this we add vfree_atomic() - deferred variation of vfree()
>>>>>     which can be used in any atomic context (except NMIs).
>>>>>
>>>>> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made
>>>>>
>>>>>     - * Context: Any context except NMI.
>>>>>     + * Context: Either preemptible task context or not-NMI interrupt.
>>>>>
>>>>> change. But I think that we converted kmalloc() to kvmalloc() without checking
>>>>> context of kvfree() callers. Therefore, I think that kvfree() needs to use
>>>>> vfree_atomic() rather than just saying "vfree() might sleep if called not in
>>>>> interrupt context."...
>>>>
>>>> Whereabouts in the vfree() path can the kernel sleep?
>>>
>>> (Sorry for the late reply.)
>>>
>>> Adding Andrey Ryabinin, author of commit 52414d3302577bb6
>>> ("kvfree(): fix misleading comment"), maybe Andrey remembers
>>> where vfree() can sleep.
>>>
>>> In the meantime, does "cond_resched_lock(&vmap_area_lock);" in
>>> __purge_vmap_area_lazy() count as a sleep point?
>>
>> Yes, this is the place (the only one) where vfree() can sleep.
> 
> OK, thanks for the quick confirm.
> 
> So what about this: use __vfree_deferred() when:
>  - in_interrupt(), because we can't use mutex_trylock() as pointed out
>    by Tetsuo;
>  - in_atomic(), because cond_resched_lock();
>  - irqs_disabled(), as smp_call_function_many() will deadlock.
> 
> An untested diff to show the idea(not sure if warn is needed):
> 

It was discussed before. You're not the first one suggesting something like this.
There is the comment near in_atomic() explaining  well why and when your patch won't work.

The easiest way of making vfree() to be safe in atomic contexts is this patch:
http://lkml.kernel.org/r/20170330102719.13119-1-aryabinin@virtuozzo.com

But the final decision at that time was to fix caller so the call vfree from sleepable context instead:
 http://lkml.kernel.org/r/20170330152229.f2108e718114ed77acae7405@linux-foundation.org

 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e86ba6e74b50..28d200f054b0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1578,7 +1578,7 @@ void vfree_atomic(const void *addr)
>  
>  static void __vfree(const void *addr)
>  {
> -	if (unlikely(in_interrupt()))
> +	if (unlikely(in_interrupt() || in_atomic() || irqs_disabled()))
>  		__vfree_deferred(addr);
>  	else
>  		__vunmap(addr, 1);
> @@ -1606,8 +1606,6 @@ void vfree(const void *addr)
>  
>  	kmemleak_free(addr);
>  
> -	might_sleep_if(!in_interrupt());
> -
>  	if (!addr)
>  		return;
>  
>
Aaron Lu March 8, 2019, 2:41 a.m. UTC | #19
On 2019/3/8 0:33, Andrey Ryabinin wrote:
> 
> 
> On 3/7/19 6:24 PM, Aaron Lu wrote:
>> On Thu, Mar 07, 2019 at 05:47:13PM +0300, Andrey Ryabinin wrote:
>>>
>>>
>>> On 3/7/19 5:43 PM, Aaron Lu wrote:
>>>> On Tue, Jan 29, 2019 at 05:01:50PM -0800, Andrew Morton wrote:
>>>>> On Wed, 30 Jan 2019 09:42:06 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>>>>
>>>>>>>>
>>>>>>>> If we want to allow vfree() to sleep, at least we need to test with
>>>>>>>> kvmalloc() == vmalloc() (i.e. force kvmalloc()/kvfree() users to use
>>>>>>>> vmalloc()/vfree() path). For now, reverting the
>>>>>>>> "Context: Either preemptible task context or not-NMI interrupt." change
>>>>>>>> will be needed for stable kernels.
>>>>>>>
>>>>>>> So, the comment for vfree "May sleep if called *not* from interrupt
>>>>>>> context." is wrong?
>>>>>>
>>>>>> Commit bf22e37a641327e3 ("mm: add vfree_atomic()") says
>>>>>>
>>>>>>     We are going to use sleeping lock for freeing vmap.  However some
>>>>>>     vfree() users want to free memory from atomic (but not from interrupt)
>>>>>>     context.  For this we add vfree_atomic() - deferred variation of vfree()
>>>>>>     which can be used in any atomic context (except NMIs).
>>>>>>
>>>>>> and commit 52414d3302577bb6 ("kvfree(): fix misleading comment") made
>>>>>>
>>>>>>     - * Context: Any context except NMI.
>>>>>>     + * Context: Either preemptible task context or not-NMI interrupt.
>>>>>>
>>>>>> change. But I think that we converted kmalloc() to kvmalloc() without checking
>>>>>> context of kvfree() callers. Therefore, I think that kvfree() needs to use
>>>>>> vfree_atomic() rather than just saying "vfree() might sleep if called not in
>>>>>> interrupt context."...
>>>>>
>>>>> Whereabouts in the vfree() path can the kernel sleep?
>>>>
>>>> (Sorry for the late reply.)
>>>>
>>>> Adding Andrey Ryabinin, author of commit 52414d3302577bb6
>>>> ("kvfree(): fix misleading comment"), maybe Andrey remembers
>>>> where vfree() can sleep.
>>>>
>>>> In the meantime, does "cond_resched_lock(&vmap_area_lock);" in
>>>> __purge_vmap_area_lazy() count as a sleep point?
>>>
>>> Yes, this is the place (the only one) where vfree() can sleep.
>>
>> OK, thanks for the quick confirm.
>>
>> So what about this: use __vfree_deferred() when:
>>  - in_interrupt(), because we can't use mutex_trylock() as pointed out
>>    by Tetsuo;
>>  - in_atomic(), because cond_resched_lock();
>>  - irqs_disabled(), as smp_call_function_many() will deadlock.
>>
>> An untested diff to show the idea(not sure if warn is needed):
>>
> 
> It was discussed before. You're not the first one suggesting something like this.
> There is the comment near in_atomic() explaining  well why and when your patch won't work.

Thanks for the info.

> The easiest way of making vfree() to be safe in atomic contexts is this patch:
> http://lkml.kernel.org/r/20170330102719.13119-1-aryabinin@virtuozzo.com
> 
> But the final decision at that time was to fix caller so the call vfree from sleepable context instead:
>  http://lkml.kernel.org/r/20170330152229.f2108e718114ed77acae7405@linux-foundation.org

OK, if that is the final decision, then I think Jiufei's patch that
moves kvfree() out of the locked region is the right thing to do for
this issue here.
Jiufei Xue March 11, 2019, 1:43 a.m. UTC | #20
Hi Andrew,
>> It was discussed before. You're not the first one suggesting something like this.
>> There is the comment near in_atomic() explaining  well why and when your patch won't work.
> Thanks for the info.
>
>> The easiest way of making vfree() to be safe in atomic contexts is this patch:
>> http://lkml.kernel.org/r/20170330102719.13119-1-aryabinin@virtuozzo.com
>>
>> But the final decision at that time was to fix caller so the call vfree from sleepable context instead:
>>  http://lkml.kernel.org/r/20170330152229.f2108e718114ed77acae7405@linux-foundation.org
> OK, if that is the final decision, then I think Jiufei's patch that
> moves kvfree() out of the locked region is the right thing to do for
> this issue here.
>
Is that the final decision we have made? Could you please look into
my patch again and give the decision?


Thanks,

Jiufei
diff mbox series

Patch

diff --git a/mm/swapfile.c b/mm/swapfile.c
index dbac1d49469d..d26c9eac3d64 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2810,7 +2810,7 @@  late_initcall(max_swapfiles_check);
 
 static struct swap_info_struct *alloc_swap_info(void)
 {
-	struct swap_info_struct *p;
+	struct swap_info_struct *p, *tmp = NULL;
 	unsigned int type;
 	int i;
 	int size = sizeof(*p) + nr_node_ids * sizeof(struct plist_node);
@@ -2840,7 +2840,7 @@  static struct swap_info_struct *alloc_swap_info(void)
 		smp_wmb();
 		nr_swapfiles++;
 	} else {
-		kvfree(p);
+		tmp = p;
 		p = swap_info[type];
 		/*
 		 * Do not memset this entry: a racing procfs swap_next()
@@ -2853,6 +2853,8 @@  static struct swap_info_struct *alloc_swap_info(void)
 		plist_node_init(&p->avail_lists[i], 0);
 	p->flags = SWP_USED;
 	spin_unlock(&swap_lock);
+	kvfree(tmp);
+
 	spin_lock_init(&p->lock);
 	spin_lock_init(&p->cont_lock);