diff mbox series

[RFC] mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL

Message ID 20240508125808.28882-1-hailong.liu@oppo.com (mailing list archive)
State New
Headers show
Series [RFC] mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL | expand

Commit Message

Hailong Liu May 8, 2024, 12:58 p.m. UTC
From: "Hailong.Liu" <hailong.liu@oppo.com>

Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
includes support for __GFP_NOFAIL, but it presents a conflict with
commit dd544141b9eb ("vmalloc: back off when the current task is
OOM-killed"). A possible scenario is as belows:

process-a
kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
    __vmalloc_node_range()
	__vmalloc_area_node()
	    vm_area_alloc_pages()
            --> oom-killer send SIGKILL to process-a
            if (fatal_signal_pending(current)) break;
--> return NULL;

to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
if __GFP_NOFAIL set.

Reported-by: Oven <liyangouwen1@oppo.com>
Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
---
 mm/vmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

---
This issue occurred during OPLUS KASAN test. Below is part of the log

-> send signal
[65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198

[65731.259685] [T32454] Call trace:
[65731.259698] [T32454]  dump_backtrace+0xf4/0x118
[65731.259734] [T32454]  show_stack+0x18/0x24
[65731.259756] [T32454]  dump_stack_lvl+0x60/0x7c
[65731.259781] [T32454]  dump_stack+0x18/0x38
[65731.259800] [T32454]  mrdump_common_die+0x250/0x39c [mrdump]
[65731.259936] [T32454]  ipanic_die+0x20/0x34 [mrdump]
[65731.260019] [T32454]  atomic_notifier_call_chain+0xb4/0xfc
[65731.260047] [T32454]  notify_die+0x114/0x198
[65731.260073] [T32454]  die+0xf4/0x5b4
[65731.260098] [T32454]  die_kernel_fault+0x80/0x98
[65731.260124] [T32454]  __do_kernel_fault+0x160/0x2a8
[65731.260146] [T32454]  do_bad_area+0x68/0x148
[65731.260174] [T32454]  do_mem_abort+0x151c/0x1b34
[65731.260204] [T32454]  el1_abort+0x3c/0x5c
[65731.260227] [T32454]  el1h_64_sync_handler+0x54/0x90
[65731.260248] [T32454]  el1h_64_sync+0x68/0x6c
[65731.260269] [T32454]  z_erofs_decompress_queue+0x7f0/0x2258
--> be->decompressed_pages = kvcalloc(be->nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL);
	kernel panic by NULL pointer dereference.
	erofs assume kvmalloc with __GFP_NOFAIL never return NULL.

[65731.260293] [T32454]  z_erofs_runqueue+0xf30/0x104c
[65731.260314] [T32454]  z_erofs_readahead+0x4f0/0x968
[65731.260339] [T32454]  read_pages+0x170/0xadc
[65731.260364] [T32454]  page_cache_ra_unbounded+0x874/0xf30
[65731.260388] [T32454]  page_cache_ra_order+0x24c/0x714
[65731.260411] [T32454]  filemap_fault+0xbf0/0x1a74
[65731.260437] [T32454]  __do_fault+0xd0/0x33c
[65731.260462] [T32454]  handle_mm_fault+0xf74/0x3fe0
[65731.260486] [T32454]  do_mem_abort+0x54c/0x1b34
[65731.260509] [T32454]  el0_da+0x44/0x94
[65731.260531] [T32454]  el0t_64_sync_handler+0x98/0xb4
[65731.260553] [T32454]  el0t_64_sync+0x198/0x19c

--
2.34.1

Comments

Gao Xiang May 8, 2024, 1:41 p.m. UTC | #1
+Cc Michal,

On 2024/5/8 20:58, hailong.liu@oppo.com wrote:
> From: "Hailong.Liu" <hailong.liu@oppo.com>
> 
> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> includes support for __GFP_NOFAIL, but it presents a conflict with
> commit dd544141b9eb ("vmalloc: back off when the current task is
> OOM-killed"). A possible scenario is as belows:
> 
> process-a
> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>      __vmalloc_node_range()
> 	__vmalloc_area_node()
> 	    vm_area_alloc_pages()
>              --> oom-killer send SIGKILL to process-a
>              if (fatal_signal_pending(current)) break;
> --> return NULL;
> 
> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> if __GFP_NOFAIL set.
> 
> Reported-by: Oven <liyangouwen1@oppo.com>
> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>

Why taging this as RFC here?  It seems a corner-case fix of
commit a421ef303008

Thanks,
Gao Xiang

> ---
>   mm/vmalloc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6641be0ca80b..2f359d08bf8d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> 
>   	/* High-order pages or fallback path if "bulk" fails. */
>   	while (nr_allocated < nr_pages) {
> -		if (fatal_signal_pending(current))
> +		if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>   			break;
> 
>   		if (nid == NUMA_NO_NODE)
> ---
> This issue occurred during OPLUS KASAN test. Below is part of the log
> 
> -> send signal
> [65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198
> 
> [65731.259685] [T32454] Call trace:
> [65731.259698] [T32454]  dump_backtrace+0xf4/0x118
> [65731.259734] [T32454]  show_stack+0x18/0x24
> [65731.259756] [T32454]  dump_stack_lvl+0x60/0x7c
> [65731.259781] [T32454]  dump_stack+0x18/0x38
> [65731.259800] [T32454]  mrdump_common_die+0x250/0x39c [mrdump]
> [65731.259936] [T32454]  ipanic_die+0x20/0x34 [mrdump]
> [65731.260019] [T32454]  atomic_notifier_call_chain+0xb4/0xfc
> [65731.260047] [T32454]  notify_die+0x114/0x198
> [65731.260073] [T32454]  die+0xf4/0x5b4
> [65731.260098] [T32454]  die_kernel_fault+0x80/0x98
> [65731.260124] [T32454]  __do_kernel_fault+0x160/0x2a8
> [65731.260146] [T32454]  do_bad_area+0x68/0x148
> [65731.260174] [T32454]  do_mem_abort+0x151c/0x1b34
> [65731.260204] [T32454]  el1_abort+0x3c/0x5c
> [65731.260227] [T32454]  el1h_64_sync_handler+0x54/0x90
> [65731.260248] [T32454]  el1h_64_sync+0x68/0x6c
> [65731.260269] [T32454]  z_erofs_decompress_queue+0x7f0/0x2258
> --> be->decompressed_pages = kvcalloc(be->nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL);
> 	kernel panic by NULL pointer dereference.
> 	erofs assume kvmalloc with __GFP_NOFAIL never return NULL.
> 
> [65731.260293] [T32454]  z_erofs_runqueue+0xf30/0x104c
> [65731.260314] [T32454]  z_erofs_readahead+0x4f0/0x968
> [65731.260339] [T32454]  read_pages+0x170/0xadc
> [65731.260364] [T32454]  page_cache_ra_unbounded+0x874/0xf30
> [65731.260388] [T32454]  page_cache_ra_order+0x24c/0x714
> [65731.260411] [T32454]  filemap_fault+0xbf0/0x1a74
> [65731.260437] [T32454]  __do_fault+0xd0/0x33c
> [65731.260462] [T32454]  handle_mm_fault+0xf74/0x3fe0
> [65731.260486] [T32454]  do_mem_abort+0x54c/0x1b34
> [65731.260509] [T32454]  el0_da+0x44/0x94
> [65731.260531] [T32454]  el0t_64_sync_handler+0x98/0xb4
> [65731.260553] [T32454]  el0t_64_sync+0x198/0x19c
> 
> --
> 2.34.1
Gao Xiang May 8, 2024, 2:13 p.m. UTC | #2
On 2024/5/8 21:41, Gao Xiang wrote:
> 
> +Cc Michal,
> 
> On 2024/5/8 20:58, hailong.liu@oppo.com wrote:
>> From: "Hailong.Liu" <hailong.liu@oppo.com>
>>
>> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
>> includes support for __GFP_NOFAIL, but it presents a conflict with
>> commit dd544141b9eb ("vmalloc: back off when the current task is
>> OOM-killed"). A possible scenario is as belows:
>>
>> process-a
>> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>>      __vmalloc_node_range()
>>     __vmalloc_area_node()
>>         vm_area_alloc_pages()
>>              --> oom-killer send SIGKILL to process-a
>>              if (fatal_signal_pending(current)) break;
>> --> return NULL;
>>
>> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
>> if __GFP_NOFAIL set.
>>
>> Reported-by: Oven <liyangouwen1@oppo.com>
>> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> 
> Why taging this as RFC here?  It seems a corner-case fix of
> commit a421ef303008

BTW, I guess maybe commit e9c3cda4d86e ("mm, vmalloc: fix high order
__GFP_NOFAIL allocations") misses fatal_signal_pending() handling
anyway..

> 
> Thanks,
> Gao Xiang
>
Hailong Liu May 8, 2024, 2:43 p.m. UTC | #3
On Wed, 08. May 21:41, Gao Xiang wrote:
>
> +Cc Michal,
>
> On 2024/5/8 20:58, hailong.liu@oppo.com wrote:
> > From: "Hailong.Liu" <hailong.liu@oppo.com>
> >
> > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > includes support for __GFP_NOFAIL, but it presents a conflict with
> > commit dd544141b9eb ("vmalloc: back off when the current task is
> > OOM-killed"). A possible scenario is as belows:
> >
> > process-a
> > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> >      __vmalloc_node_range()
> > 	__vmalloc_area_node()
> > 	    vm_area_alloc_pages()
> >              --> oom-killer send SIGKILL to process-a
> >              if (fatal_signal_pending(current)) break;
> > --> return NULL;
> >
> > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > if __GFP_NOFAIL set.
> >
> > Reported-by: Oven <liyangouwen1@oppo.com>
> > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
>
> Why taging this as RFC here?  It seems a corner-case fix of
> commit a421ef303008
>
> Thanks,
> Gao Xiang
>

Hi Gao Xiang:

RFC here to wait for a better way to handle this case :).
IMO, if vmalloc support __GFP_NOFAIL it should not return
null even system is deadlock on memory.

--

Best Regards,
Hailong.
Gao Xiang May 8, 2024, 3:10 p.m. UTC | #4
Hi,

On 2024/5/8 22:43, Hailong Liu wrote:
> On Wed, 08. May 21:41, Gao Xiang wrote:
>>
>> +Cc Michal,
>>
>> On 2024/5/8 20:58, hailong.liu@oppo.com wrote:
>>> From: "Hailong.Liu" <hailong.liu@oppo.com>
>>>
>>> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
>>> includes support for __GFP_NOFAIL, but it presents a conflict with
>>> commit dd544141b9eb ("vmalloc: back off when the current task is
>>> OOM-killed"). A possible scenario is as belows:
>>>
>>> process-a
>>> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>>>       __vmalloc_node_range()
>>> 	__vmalloc_area_node()
>>> 	    vm_area_alloc_pages()
>>>               --> oom-killer send SIGKILL to process-a
>>>               if (fatal_signal_pending(current)) break;
>>> --> return NULL;
>>>
>>> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
>>> if __GFP_NOFAIL set.
>>>
>>> Reported-by: Oven <liyangouwen1@oppo.com>
>>> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
>>
>> Why taging this as RFC here?  It seems a corner-case fix of
>> commit a421ef303008
>>
>> Thanks,
>> Gao Xiang
>>
> 
> Hi Gao Xiang:
> 
> RFC here to wait for a better way to handle this case :).
> IMO, if vmalloc support __GFP_NOFAIL it should not return
> null even system is deadlock on memory.

The starting point is that kmalloc doesn't support __GFP_NOFAIL
if order > 1 (even for very short temporary uses), see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v6.8#n2896

but it is possible if we have such page pointer array (since two
(order-1) pages can only keep 1024 8-byte entries, it can happen
if compression ratios are high), and kvmalloc(__GFP_NOFAIL) has
already been supported for almost two years, it will fallback to
order-0 allocation as described in commit e9c3cda4d86e
("mm, vmalloc: fix high order __GFP_NOFAIL allocations").

With my limited understanding, I'm not sure why it can cause
deadlock here since it will fallback to order-0 allocation then,
and such allocation is just for short temporary uses again
because kmalloc doesn't support order > 1 short memory
allocation strictly.

Thanks,
Gao Xiang

> 
> --
> 
> Best Regards,
> Hailong.
Hailong Liu May 8, 2024, 3:31 p.m. UTC | #5
On Wed, 08. May 23:10, Gao Xiang wrote:
> Hi,
>
> On 2024/5/8 22:43, Hailong Liu wrote:
> > On Wed, 08. May 21:41, Gao Xiang wrote:
> > >
> > > +Cc Michal,
> > >
> > > On 2024/5/8 20:58, hailong.liu@oppo.com wrote:
> > > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > > >
> > > > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > > > includes support for __GFP_NOFAIL, but it presents a conflict with
> > > > commit dd544141b9eb ("vmalloc: back off when the current task is
> > > > OOM-killed"). A possible scenario is as belows:
> > > >
> > > > process-a
> > > > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> > > >       __vmalloc_node_range()
> > > > 	__vmalloc_area_node()
> > > > 	    vm_area_alloc_pages()
> > > >               --> oom-killer send SIGKILL to process-a
> > > >               if (fatal_signal_pending(current)) break;
> > > > --> return NULL;
> > > >
> > > > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > > > if __GFP_NOFAIL set.
> > > >
> > > > Reported-by: Oven <liyangouwen1@oppo.com>
> > > > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > >
> > > Why taging this as RFC here?  It seems a corner-case fix of
> > > commit a421ef303008
> > >
> > > Thanks,
> > > Gao Xiang
> > >
> >
> > Hi Gao Xiang:
> >
> > RFC here to wait for a better way to handle this case :).
> > IMO, if vmalloc support __GFP_NOFAIL it should not return
> > null even system is deadlock on memory.
>
> The starting point is that kmalloc doesn't support __GFP_NOFAIL
> if order > 1 (even for very short temporary uses), see:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v6.8#n2896
>
> but it is possible if we have such page pointer array (since two
> (order-1) pages can only keep 1024 8-byte entries, it can happen
> if compression ratios are high), and kvmalloc(__GFP_NOFAIL) has
> already been supported for almost two years, it will fallback to
> order-0 allocation as described in commit e9c3cda4d86e
> ("mm, vmalloc: fix high order __GFP_NOFAIL allocations").
>
> With my limited understanding, I'm not sure why it can cause
> deadlock here since it will fallback to order-0 allocation then,
> and such allocation is just for short temporary uses again
> because kmalloc doesn't support order > 1 short memory
> allocation strictly.
>

deadlock on memory meands there is a memory leak causing
system to be unable to allocate memory not actual
*deadlock*.

> Thanks,
> Gao Xiang
>

--

Best Regards,
Hailong.
Gao Xiang May 8, 2024, 3:40 p.m. UTC | #6
On 2024/5/8 23:31, Hailong Liu wrote:
> On Wed, 08. May 23:10, Gao Xiang wrote:
>> Hi,
>>
>> On 2024/5/8 22:43, Hailong Liu wrote:
>>> On Wed, 08. May 21:41, Gao Xiang wrote:
>>>>
>>>> +Cc Michal,
>>>>
>>>> On 2024/5/8 20:58, hailong.liu@oppo.com wrote:
>>>>> From: "Hailong.Liu" <hailong.liu@oppo.com>
>>>>>
>>>>> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
>>>>> includes support for __GFP_NOFAIL, but it presents a conflict with
>>>>> commit dd544141b9eb ("vmalloc: back off when the current task is
>>>>> OOM-killed"). A possible scenario is as belows:
>>>>>
>>>>> process-a
>>>>> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>>>>>        __vmalloc_node_range()
>>>>> 	__vmalloc_area_node()
>>>>> 	    vm_area_alloc_pages()
>>>>>                --> oom-killer send SIGKILL to process-a
>>>>>                if (fatal_signal_pending(current)) break;
>>>>> --> return NULL;
>>>>>
>>>>> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
>>>>> if __GFP_NOFAIL set.
>>>>>
>>>>> Reported-by: Oven <liyangouwen1@oppo.com>
>>>>> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
>>>>
>>>> Why taging this as RFC here?  It seems a corner-case fix of
>>>> commit a421ef303008
>>>>
>>>> Thanks,
>>>> Gao Xiang
>>>>
>>>
>>> Hi Gao Xiang:
>>>
>>> RFC here to wait for a better way to handle this case :).
>>> IMO, if vmalloc support __GFP_NOFAIL it should not return
>>> null even system is deadlock on memory.
>>
>> The starting point is that kmalloc doesn't support __GFP_NOFAIL
>> if order > 1 (even for very short temporary uses), see:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v6.8#n2896
>>
>> but it is possible if we have such page pointer array (since two
>> (order-1) pages can only keep 1024 8-byte entries, it can happen
>> if compression ratios are high), and kvmalloc(__GFP_NOFAIL) has
>> already been supported for almost two years, it will fallback to
>> order-0 allocation as described in commit e9c3cda4d86e
>> ("mm, vmalloc: fix high order __GFP_NOFAIL allocations").
>>
>> With my limited understanding, I'm not sure why it can cause
>> deadlock here since it will fallback to order-0 allocation then,
>> and such allocation is just for short temporary uses again
>> because kmalloc doesn't support order > 1 short memory
>> allocation strictly.
>>
> 
> deadlock on memory meands there is a memory leak causing
> system to be unable to allocate memory not actual
> *deadlock*.

Where is memory leak? If it's caused by kvmalloc(__GFP_NOFAIL)
callers, then it's bugs of callers and we should fix the callers.

Also why kmalloc(__GFP_NOFAIL) (for example, also order-0
allocation) differs?

Thanks,
Gao Xiang


> 
>> Thanks,
>> Gao Xiang
>>
> 
> --
> 
> Best Regards,
> Hailong.
Hailong Liu May 9, 2024, 1:30 a.m. UTC | #7
On Wed, 08. May 23:40, Gao Xiang wrote:
>
>
> On 2024/5/8 23:31, Hailong Liu wrote:
> > On Wed, 08. May 23:10, Gao Xiang wrote:
> > > Hi,
> > >
> > > On 2024/5/8 22:43, Hailong Liu wrote:
> > > > On Wed, 08. May 21:41, Gao Xiang wrote:
> > > > >
> > > > > +Cc Michal,
> > > > >
> > > > > On 2024/5/8 20:58, hailong.liu@oppo.com wrote:
> > > > > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > > > > >
> > > > > > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > > > > > includes support for __GFP_NOFAIL, but it presents a conflict with
> > > > > > commit dd544141b9eb ("vmalloc: back off when the current task is
> > > > > > OOM-killed"). A possible scenario is as belows:
> > > > > >
> > > > > > process-a
> > > > > > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> > > > > >        __vmalloc_node_range()
> > > > > > 	__vmalloc_area_node()
> > > > > > 	    vm_area_alloc_pages()
> > > > > >                --> oom-killer send SIGKILL to process-a
> > > > > >                if (fatal_signal_pending(current)) break;
> > > > > > --> return NULL;
> > > > > >
> > > > > > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > > > > > if __GFP_NOFAIL set.
> > > > > >
> > > > > > Reported-by: Oven <liyangouwen1@oppo.com>
> > > > > > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > > > >
> > > > > Why taging this as RFC here?  It seems a corner-case fix of
> > > > > commit a421ef303008
> > > > >
> > > > > Thanks,
> > > > > Gao Xiang
> > > > >
> > > >
> > > > Hi Gao Xiang:
> > > >
> > > > RFC here to wait for a better way to handle this case :).
> > > > IMO, if vmalloc support __GFP_NOFAIL it should not return
> > > > null even system is deadlock on memory.
> > >
> > > The starting point is that kmalloc doesn't support __GFP_NOFAIL
> > > if order > 1 (even for very short temporary uses), see:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/page_alloc.c?h=v6.8#n2896
> > >
> > > but it is possible if we have such page pointer array (since two
> > > (order-1) pages can only keep 1024 8-byte entries, it can happen
> > > if compression ratios are high), and kvmalloc(__GFP_NOFAIL) has
> > > already been supported for almost two years, it will fallback to
> > > order-0 allocation as described in commit e9c3cda4d86e
> > > ("mm, vmalloc: fix high order __GFP_NOFAIL allocations").
> > >
> > > With my limited understanding, I'm not sure why it can cause
> > > deadlock here since it will fallback to order-0 allocation then,
> > > and such allocation is just for short temporary uses again
> > > because kmalloc doesn't support order > 1 short memory
> > > allocation strictly.
> > >
> >
> > deadlock on memory meands there is a memory leak causing
> > system to be unable to allocate memory not actual
> > *deadlock*.
>
> Where is memory leak? If it's caused by kvmalloc(__GFP_NOFAIL)
> callers, then it's bugs of callers and we should fix the callers.
>
> Also why kmalloc(__GFP_NOFAIL) (for example, also order-0
> allocation) differs?
>
> Thanks,
> Gao Xiang
>

I’m not suggesting that erofs would cause a memleak. What I mean is
that if kvmalloc is invoked with __GFP_NOFAIL, it must ensure a non-NULL
return, even in scenarios where memory leaks caused by other processes
result in the inability to allocate a page. In such a situation, it
should result in “Kernel panic - not syncing: System is deadlocked
on memory”.
>
> >
> > > Thanks,
> > > Gao Xiang
> > >
> >
> > --
> >
> > Best Regards,
> > Hailong.

--
Best Regards,
Hailong.
Barry Song May 9, 2024, 2:20 a.m. UTC | #8
On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
>
> From: "Hailong.Liu" <hailong.liu@oppo.com>
>
> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> includes support for __GFP_NOFAIL, but it presents a conflict with
> commit dd544141b9eb ("vmalloc: back off when the current task is
> OOM-killed"). A possible scenario is as belows:
>
> process-a
> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>     __vmalloc_node_range()
>         __vmalloc_area_node()
>             vm_area_alloc_pages()
>             --> oom-killer send SIGKILL to process-a
>             if (fatal_signal_pending(current)) break;
> --> return NULL;
>
> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> if __GFP_NOFAIL set.
>
> Reported-by: Oven <liyangouwen1@oppo.com>
> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> ---
>  mm/vmalloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6641be0ca80b..2f359d08bf8d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>
>         /* High-order pages or fallback path if "bulk" fails. */
>         while (nr_allocated < nr_pages) {
> -               if (fatal_signal_pending(current))
> +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>                         break;

why not !nofail ?

This seems a correct fix, but it undermines the assumption made in
commit dd544141b9eb
 ("vmalloc: back off when the current task is OOM-killed")

"
    This may trigger some hidden problems, when caller does not handle
    vmalloc failures, or when rollaback after failed vmalloc calls own
    vmallocs inside.  However all of these scenarios are incorrect: vmalloc
    does not guarantee successful allocation, it has never been called with
    __GFP_NOFAIL and threfore either should not be used for any rollbacks or
    should handle such errors correctly and not lead to critical failures.
"

If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
reverting the fix intended to address the OOM-killer issue in commit
dd544141b9eb.
Should we indeed permit the NOFAIL flag for large kvmalloc allocations?

>
>                 if (nid == NUMA_NO_NODE)
> ---
> This issue occurred during OPLUS KASAN test. Below is part of the log
>
> -> send signal
> [65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198
>
> [65731.259685] [T32454] Call trace:
> [65731.259698] [T32454]  dump_backtrace+0xf4/0x118
> [65731.259734] [T32454]  show_stack+0x18/0x24
> [65731.259756] [T32454]  dump_stack_lvl+0x60/0x7c
> [65731.259781] [T32454]  dump_stack+0x18/0x38
> [65731.259800] [T32454]  mrdump_common_die+0x250/0x39c [mrdump]
> [65731.259936] [T32454]  ipanic_die+0x20/0x34 [mrdump]
> [65731.260019] [T32454]  atomic_notifier_call_chain+0xb4/0xfc
> [65731.260047] [T32454]  notify_die+0x114/0x198
> [65731.260073] [T32454]  die+0xf4/0x5b4
> [65731.260098] [T32454]  die_kernel_fault+0x80/0x98
> [65731.260124] [T32454]  __do_kernel_fault+0x160/0x2a8
> [65731.260146] [T32454]  do_bad_area+0x68/0x148
> [65731.260174] [T32454]  do_mem_abort+0x151c/0x1b34
> [65731.260204] [T32454]  el1_abort+0x3c/0x5c
> [65731.260227] [T32454]  el1h_64_sync_handler+0x54/0x90
> [65731.260248] [T32454]  el1h_64_sync+0x68/0x6c
> [65731.260269] [T32454]  z_erofs_decompress_queue+0x7f0/0x2258
> --> be->decompressed_pages = kvcalloc(be->nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL);
>         kernel panic by NULL pointer dereference.
>         erofs assume kvmalloc with __GFP_NOFAIL never return NULL.
>
> [65731.260293] [T32454]  z_erofs_runqueue+0xf30/0x104c
> [65731.260314] [T32454]  z_erofs_readahead+0x4f0/0x968
> [65731.260339] [T32454]  read_pages+0x170/0xadc
> [65731.260364] [T32454]  page_cache_ra_unbounded+0x874/0xf30
> [65731.260388] [T32454]  page_cache_ra_order+0x24c/0x714
> [65731.260411] [T32454]  filemap_fault+0xbf0/0x1a74
> [65731.260437] [T32454]  __do_fault+0xd0/0x33c
> [65731.260462] [T32454]  handle_mm_fault+0xf74/0x3fe0
> [65731.260486] [T32454]  do_mem_abort+0x54c/0x1b34
> [65731.260509] [T32454]  el0_da+0x44/0x94
> [65731.260531] [T32454]  el0t_64_sync_handler+0x98/0xb4
> [65731.260553] [T32454]  el0t_64_sync+0x198/0x19c
>
> --
> 2.34.1

Thanks
Barry
Barry Song May 9, 2024, 2:26 a.m. UTC | #9
On Thu, May 9, 2024 at 2:20 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
> >
> > From: "Hailong.Liu" <hailong.liu@oppo.com>
> >
> > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > includes support for __GFP_NOFAIL, but it presents a conflict with
> > commit dd544141b9eb ("vmalloc: back off when the current task is
> > OOM-killed"). A possible scenario is as belows:
> >
> > process-a
> > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> >     __vmalloc_node_range()
> >         __vmalloc_area_node()
> >             vm_area_alloc_pages()
> >             --> oom-killer send SIGKILL to process-a
> >             if (fatal_signal_pending(current)) break;
> > --> return NULL;
> >
> > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > if __GFP_NOFAIL set.
> >
> > Reported-by: Oven <liyangouwen1@oppo.com>
> > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > ---
> >  mm/vmalloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 6641be0ca80b..2f359d08bf8d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >
> >         /* High-order pages or fallback path if "bulk" fails. */
> >         while (nr_allocated < nr_pages) {
> > -               if (fatal_signal_pending(current))
> > +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
> >                         break;
>
> why not !nofail ?
>
> This seems a correct fix, but it undermines the assumption made in
> commit dd544141b9eb
>  ("vmalloc: back off when the current task is OOM-killed")
>
> "
>     This may trigger some hidden problems, when caller does not handle
>     vmalloc failures, or when rollaback after failed vmalloc calls own
>     vmallocs inside.  However all of these scenarios are incorrect: vmalloc
>     does not guarantee successful allocation, it has never been called with
>     __GFP_NOFAIL and threfore either should not be used for any rollbacks or
>     should handle such errors correctly and not lead to critical failures.
> "
>
> If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
> reverting the fix intended to address the OOM-killer issue in commit
> dd544141b9eb.
> Should we indeed permit the NOFAIL flag for large kvmalloc allocations?

+ Vasily, Michal.

>
> >
> >                 if (nid == NUMA_NO_NODE)
> > ---
> > This issue occurred during OPLUS KASAN test. Below is part of the log
> >
> > -> send signal
> > [65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198
> >
> > [65731.259685] [T32454] Call trace:
> > [65731.259698] [T32454]  dump_backtrace+0xf4/0x118
> > [65731.259734] [T32454]  show_stack+0x18/0x24
> > [65731.259756] [T32454]  dump_stack_lvl+0x60/0x7c
> > [65731.259781] [T32454]  dump_stack+0x18/0x38
> > [65731.259800] [T32454]  mrdump_common_die+0x250/0x39c [mrdump]
> > [65731.259936] [T32454]  ipanic_die+0x20/0x34 [mrdump]
> > [65731.260019] [T32454]  atomic_notifier_call_chain+0xb4/0xfc
> > [65731.260047] [T32454]  notify_die+0x114/0x198
> > [65731.260073] [T32454]  die+0xf4/0x5b4
> > [65731.260098] [T32454]  die_kernel_fault+0x80/0x98
> > [65731.260124] [T32454]  __do_kernel_fault+0x160/0x2a8
> > [65731.260146] [T32454]  do_bad_area+0x68/0x148
> > [65731.260174] [T32454]  do_mem_abort+0x151c/0x1b34
> > [65731.260204] [T32454]  el1_abort+0x3c/0x5c
> > [65731.260227] [T32454]  el1h_64_sync_handler+0x54/0x90
> > [65731.260248] [T32454]  el1h_64_sync+0x68/0x6c
> > [65731.260269] [T32454]  z_erofs_decompress_queue+0x7f0/0x2258
> > --> be->decompressed_pages = kvcalloc(be->nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL);
> >         kernel panic by NULL pointer dereference.
> >         erofs assume kvmalloc with __GFP_NOFAIL never return NULL.
> >
> > [65731.260293] [T32454]  z_erofs_runqueue+0xf30/0x104c
> > [65731.260314] [T32454]  z_erofs_readahead+0x4f0/0x968
> > [65731.260339] [T32454]  read_pages+0x170/0xadc
> > [65731.260364] [T32454]  page_cache_ra_unbounded+0x874/0xf30
> > [65731.260388] [T32454]  page_cache_ra_order+0x24c/0x714
> > [65731.260411] [T32454]  filemap_fault+0xbf0/0x1a74
> > [65731.260437] [T32454]  __do_fault+0xd0/0x33c
> > [65731.260462] [T32454]  handle_mm_fault+0xf74/0x3fe0
> > [65731.260486] [T32454]  do_mem_abort+0x54c/0x1b34
> > [65731.260509] [T32454]  el0_da+0x44/0x94
> > [65731.260531] [T32454]  el0t_64_sync_handler+0x98/0xb4
> > [65731.260553] [T32454]  el0t_64_sync+0x198/0x19c
> >
> > --
> > 2.34.1
>
> Thanks
> Barry
Barry Song May 9, 2024, 2:30 a.m. UTC | #10
On Thu, May 9, 2024 at 2:26 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, May 9, 2024 at 2:20 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
> > >
> > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > >
> > > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > > includes support for __GFP_NOFAIL, but it presents a conflict with
> > > commit dd544141b9eb ("vmalloc: back off when the current task is
> > > OOM-killed"). A possible scenario is as belows:
> > >
> > > process-a
> > > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> > >     __vmalloc_node_range()
> > >         __vmalloc_area_node()
> > >             vm_area_alloc_pages()
> > >             --> oom-killer send SIGKILL to process-a
> > >             if (fatal_signal_pending(current)) break;
> > > --> return NULL;
> > >
> > > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > > if __GFP_NOFAIL set.
> > >
> > > Reported-by: Oven <liyangouwen1@oppo.com>
> > > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > > ---
> > >  mm/vmalloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 6641be0ca80b..2f359d08bf8d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> > >
> > >         /* High-order pages or fallback path if "bulk" fails. */
> > >         while (nr_allocated < nr_pages) {
> > > -               if (fatal_signal_pending(current))
> > > +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
> > >                         break;
> >
> > why not !nofail ?
> >
> > This seems a correct fix, but it undermines the assumption made in
> > commit dd544141b9eb
> >  ("vmalloc: back off when the current task is OOM-killed")
> >
> > "
> >     This may trigger some hidden problems, when caller does not handle
> >     vmalloc failures, or when rollaback after failed vmalloc calls own
> >     vmallocs inside.  However all of these scenarios are incorrect: vmalloc
> >     does not guarantee successful allocation, it has never been called with
> >     __GFP_NOFAIL and threfore either should not be used for any rollbacks or
> >     should handle such errors correctly and not lead to critical failures.
> > "
> >
> > If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
> > reverting the fix intended to address the OOM-killer issue in commit
> > dd544141b9eb.
> > Should we indeed permit the NOFAIL flag for large kvmalloc allocations?
>
> + Vasily, Michal.

Sorry for my oversight. Fixed the email of Vasily.

>
> >
> > >
> > >                 if (nid == NUMA_NO_NODE)
> > > ---
> > > This issue occurred during OPLUS KASAN test. Below is part of the log
> > >
> > > -> send signal
> > > [65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198
> > >
> > > [65731.259685] [T32454] Call trace:
> > > [65731.259698] [T32454]  dump_backtrace+0xf4/0x118
> > > [65731.259734] [T32454]  show_stack+0x18/0x24
> > > [65731.259756] [T32454]  dump_stack_lvl+0x60/0x7c
> > > [65731.259781] [T32454]  dump_stack+0x18/0x38
> > > [65731.259800] [T32454]  mrdump_common_die+0x250/0x39c [mrdump]
> > > [65731.259936] [T32454]  ipanic_die+0x20/0x34 [mrdump]
> > > [65731.260019] [T32454]  atomic_notifier_call_chain+0xb4/0xfc
> > > [65731.260047] [T32454]  notify_die+0x114/0x198
> > > [65731.260073] [T32454]  die+0xf4/0x5b4
> > > [65731.260098] [T32454]  die_kernel_fault+0x80/0x98
> > > [65731.260124] [T32454]  __do_kernel_fault+0x160/0x2a8
> > > [65731.260146] [T32454]  do_bad_area+0x68/0x148
> > > [65731.260174] [T32454]  do_mem_abort+0x151c/0x1b34
> > > [65731.260204] [T32454]  el1_abort+0x3c/0x5c
> > > [65731.260227] [T32454]  el1h_64_sync_handler+0x54/0x90
> > > [65731.260248] [T32454]  el1h_64_sync+0x68/0x6c
> > > [65731.260269] [T32454]  z_erofs_decompress_queue+0x7f0/0x2258
> > > --> be->decompressed_pages = kvcalloc(be->nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL);
> > >         kernel panic by NULL pointer dereference.
> > >         erofs assume kvmalloc with __GFP_NOFAIL never return NULL.
> > >
> > > [65731.260293] [T32454]  z_erofs_runqueue+0xf30/0x104c
> > > [65731.260314] [T32454]  z_erofs_readahead+0x4f0/0x968
> > > [65731.260339] [T32454]  read_pages+0x170/0xadc
> > > [65731.260364] [T32454]  page_cache_ra_unbounded+0x874/0xf30
> > > [65731.260388] [T32454]  page_cache_ra_order+0x24c/0x714
> > > [65731.260411] [T32454]  filemap_fault+0xbf0/0x1a74
> > > [65731.260437] [T32454]  __do_fault+0xd0/0x33c
> > > [65731.260462] [T32454]  handle_mm_fault+0xf74/0x3fe0
> > > [65731.260486] [T32454]  do_mem_abort+0x54c/0x1b34
> > > [65731.260509] [T32454]  el0_da+0x44/0x94
> > > [65731.260531] [T32454]  el0t_64_sync_handler+0x98/0xb4
> > > [65731.260553] [T32454]  el0t_64_sync+0x198/0x19c
> > >
> > > --
> > > 2.34.1
> >
> > Thanks
> > Barry
Gao Xiang May 9, 2024, 2:39 a.m. UTC | #11
Hi,

On 2024/5/9 10:20, Barry Song wrote:
> On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
>>
>> From: "Hailong.Liu" <hailong.liu@oppo.com>
>>
>> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
>> includes support for __GFP_NOFAIL, but it presents a conflict with
>> commit dd544141b9eb ("vmalloc: back off when the current task is
>> OOM-killed"). A possible scenario is as belows:
>>
>> process-a
>> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>>      __vmalloc_node_range()
>>          __vmalloc_area_node()
>>              vm_area_alloc_pages()
>>              --> oom-killer send SIGKILL to process-a
>>              if (fatal_signal_pending(current)) break;
>> --> return NULL;
>>
>> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
>> if __GFP_NOFAIL set.
>>
>> Reported-by: Oven <liyangouwen1@oppo.com>
>> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
>> ---
>>   mm/vmalloc.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 6641be0ca80b..2f359d08bf8d 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>
>>          /* High-order pages or fallback path if "bulk" fails. */
>>          while (nr_allocated < nr_pages) {
>> -               if (fatal_signal_pending(current))
>> +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>>                          break;
> 
> why not !nofail ?
> 
> This seems a correct fix, but it undermines the assumption made in
> commit dd544141b9eb
>   ("vmalloc: back off when the current task is OOM-killed")
> 
> "
>      This may trigger some hidden problems, when caller does not handle
>      vmalloc failures, or when rollaback after failed vmalloc calls own
>      vmallocs inside.  However all of these scenarios are incorrect: vmalloc
>      does not guarantee successful allocation, it has never been called with
>      __GFP_NOFAIL and threfore either should not be used for any rollbacks or
>      should handle such errors correctly and not lead to critical failures.
> "
> 
> If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
> reverting the fix intended to address the OOM-killer issue in commit
> dd544141b9eb.
> Should we indeed permit the NOFAIL flag for large kvmalloc allocations?

Just from my perspective, I don't really care about kmalloc, vmalloc
or kvmalloc (__GFP_NOFAIL).  I even don't care if it returns three
order-0 pages or a high-order page.   I just would like to need a
virtual consecutive buffer (even it works slowly.) with __GFP_NOFAIL.

Because in some cases, writing fallback code may be tough and hard to
test if such fallback path is correct since it only triggers in extreme
workloads, and even such buffers are just used in a very short lifetime.
Also see other FS discussion of __GFP_NOFAIL, e.g.
https://lore.kernel.org/all/ZcUQfzfQ9R8X0s47@tiehlicka/

In the worst cases, it usually just needs < 5 order-0 pages (for many
cases it only needs one page), but with kmalloc it will trigger WARN
if it occurs to > order-1 allocation. as I mentioned before.

With my limited understanding I don't see why it could any problem with
kvmalloc(__GFP_NOFAIL) since it has no difference of kmalloc(GFP_NOFAIL)
with order-0 allocation.


Thanks,
Gao XIang
Barry Song May 9, 2024, 3:09 a.m. UTC | #12
On Thu, May 9, 2024 at 2:39 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi,
>
> On 2024/5/9 10:20, Barry Song wrote:
> > On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
> >>
> >> From: "Hailong.Liu" <hailong.liu@oppo.com>
> >>
> >> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> >> includes support for __GFP_NOFAIL, but it presents a conflict with
> >> commit dd544141b9eb ("vmalloc: back off when the current task is
> >> OOM-killed"). A possible scenario is as belows:
> >>
> >> process-a
> >> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> >>      __vmalloc_node_range()
> >>          __vmalloc_area_node()
> >>              vm_area_alloc_pages()
> >>              --> oom-killer send SIGKILL to process-a
> >>              if (fatal_signal_pending(current)) break;
> >> --> return NULL;
> >>
> >> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> >> if __GFP_NOFAIL set.
> >>
> >> Reported-by: Oven <liyangouwen1@oppo.com>
> >> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> >> ---
> >>   mm/vmalloc.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> >> index 6641be0ca80b..2f359d08bf8d 100644
> >> --- a/mm/vmalloc.c
> >> +++ b/mm/vmalloc.c
> >> @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >>
> >>          /* High-order pages or fallback path if "bulk" fails. */
> >>          while (nr_allocated < nr_pages) {
> >> -               if (fatal_signal_pending(current))
> >> +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
> >>                          break;
> >
> > why not !nofail ?
> >
> > This seems a correct fix, but it undermines the assumption made in
> > commit dd544141b9eb
> >   ("vmalloc: back off when the current task is OOM-killed")
> >
> > "
> >      This may trigger some hidden problems, when caller does not handle
> >      vmalloc failures, or when rollaback after failed vmalloc calls own
> >      vmallocs inside.  However all of these scenarios are incorrect: vmalloc
> >      does not guarantee successful allocation, it has never been called with
> >      __GFP_NOFAIL and threfore either should not be used for any rollbacks or
> >      should handle such errors correctly and not lead to critical failures.
> > "
> >
> > If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
> > reverting the fix intended to address the OOM-killer issue in commit
> > dd544141b9eb.
> > Should we indeed permit the NOFAIL flag for large kvmalloc allocations?
>
> Just from my perspective, I don't really care about kmalloc, vmalloc
> or kvmalloc (__GFP_NOFAIL).  I even don't care if it returns three
> order-0 pages or a high-order page.   I just would like to need a
> virtual consecutive buffer (even it works slowly.) with __GFP_NOFAIL.
>
> Because in some cases, writing fallback code may be tough and hard to
> test if such fallback path is correct since it only triggers in extreme
> workloads, and even such buffers are just used in a very short lifetime.
> Also see other FS discussion of __GFP_NOFAIL, e.g.
> https://lore.kernel.org/all/ZcUQfzfQ9R8X0s47@tiehlicka/
>
> In the worst cases, it usually just needs < 5 order-0 pages (for many
> cases it only needs one page), but with kmalloc it will trigger WARN
> if it occurs to > order-1 allocation. as I mentioned before.
>
> With my limited understanding I don't see why it could any problem with
> kvmalloc(__GFP_NOFAIL) since it has no difference of kmalloc(GFP_NOFAIL)
> with order-0 allocation.

I completely understand that you're not concerned about the origin of
the memory,
such as whether it's organized by all zero-order pages. However, in the event
that someone else allocates a large memory, like several megabytes with the
NOFAIL flag, commit dd544141b9eb aims to halt the allocation before success
if the process being allocated is targeted for termination of OOM-killer.

With the current patch, we miss the opportunity for early allocation
termination.
However, if the size of the kvmalloc() is small, as in your case, I
believe it should
be perfectly fine. but do we have any way to prevent large size allocation with
NOFAIL?

>
>
> Thanks,
> Gao XIang

Thanks
Barry
Gao Xiang May 9, 2024, 3:11 a.m. UTC | #13
On 2024/5/9 10:39, Gao Xiang wrote:
> Hi,
> 
> On 2024/5/9 10:20, Barry Song wrote:
>> On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
>>>
>>> From: "Hailong.Liu" <hailong.liu@oppo.com>
>>>
>>> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
>>> includes support for __GFP_NOFAIL, but it presents a conflict with
>>> commit dd544141b9eb ("vmalloc: back off when the current task is
>>> OOM-killed"). A possible scenario is as belows:
>>>
>>> process-a
>>> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>>>      __vmalloc_node_range()
>>>          __vmalloc_area_node()
>>>              vm_area_alloc_pages()
>>>              --> oom-killer send SIGKILL to process-a
>>>              if (fatal_signal_pending(current)) break;
>>> --> return NULL;
>>>
>>> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
>>> if __GFP_NOFAIL set.
>>>
>>> Reported-by: Oven <liyangouwen1@oppo.com>
>>> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
>>> ---
>>>   mm/vmalloc.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index 6641be0ca80b..2f359d08bf8d 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>
>>>          /* High-order pages or fallback path if "bulk" fails. */
>>>          while (nr_allocated < nr_pages) {
>>> -               if (fatal_signal_pending(current))
>>> +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>>>                          break;
>>
>> why not !nofail ?
>>
>> This seems a correct fix, but it undermines the assumption made in
>> commit dd544141b9eb
>>   ("vmalloc: back off when the current task is OOM-killed")
>>
>> "
>>      This may trigger some hidden problems, when caller does not handle
>>      vmalloc failures, or when rollaback after failed vmalloc calls own
>>      vmallocs inside.  However all of these scenarios are incorrect: vmalloc
>>      does not guarantee successful allocation, it has never been called with
>>      __GFP_NOFAIL and threfore either should not be used for any rollbacks or
>>      should handle such errors correctly and not lead to critical failures.
>> "
>>
>> If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
>> reverting the fix intended to address the OOM-killer issue in commit
>> dd544141b9eb.
>> Should we indeed permit the NOFAIL flag for large kvmalloc allocations?
> 
> Just from my perspective, I don't really care about kmalloc, vmalloc
> or kvmalloc (__GFP_NOFAIL).  I even don't care if it returns three
> order-0 pages or a high-order page.   I just would like to need a
> virtual consecutive buffer (even it works slowly.) with __GFP_NOFAIL.
> 
> Because in some cases, writing fallback code may be tough and hard to
> test if such fallback path is correct since it only triggers in extreme
> workloads, and even such buffers are just used in a very short lifetime.

add some words...

    ^ here extreme cases were mostly just generated by syzkaller fuzzing
tests, but if real users try to use some configuration to compress more,
I still think it needs to be handled (even such kvmalloc may be slow if
falling back to order-0 allocations due to memory pressures)

> Also see other FS discussion of __GFP_NOFAIL, e.g.
> https://lore.kernel.org/all/ZcUQfzfQ9R8X0s47@tiehlicka/
> 
> In the worst cases, it usually just needs < 5 order-0 pages (for many
> cases it only needs one page), but with kmalloc it will trigger WARN
> if it occurs to > order-1 allocation. as I mentioned before.
> 
> With my limited understanding I don't see why it could any problem with
> kvmalloc(__GFP_NOFAIL) since it has no difference of kmalloc(GFP_NOFAIL)
> with order-0 allocation.

.. kvmalloc with order-0 pages to form a virtual consecutive buffer
just like several kmalloc(__GFP_NOFAIL) allocations together in the
callers, I don't see any difference of memory pressure here.

Thanks,
Gao Xiang

> 
> 
> Thanks,
> Gao XIang
Gao Xiang May 9, 2024, 3:17 a.m. UTC | #14
On 2024/5/9 11:09, Barry Song wrote:
> On Thu, May 9, 2024 at 2:39 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi,
>>
>> On 2024/5/9 10:20, Barry Song wrote:
>>> On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
>>>>
>>>> From: "Hailong.Liu" <hailong.liu@oppo.com>
>>>>
>>>> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
>>>> includes support for __GFP_NOFAIL, but it presents a conflict with
>>>> commit dd544141b9eb ("vmalloc: back off when the current task is
>>>> OOM-killed"). A possible scenario is as belows:
>>>>
>>>> process-a
>>>> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>>>>       __vmalloc_node_range()
>>>>           __vmalloc_area_node()
>>>>               vm_area_alloc_pages()
>>>>               --> oom-killer send SIGKILL to process-a
>>>>               if (fatal_signal_pending(current)) break;
>>>> --> return NULL;
>>>>
>>>> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
>>>> if __GFP_NOFAIL set.
>>>>
>>>> Reported-by: Oven <liyangouwen1@oppo.com>
>>>> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
>>>> ---
>>>>    mm/vmalloc.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>> index 6641be0ca80b..2f359d08bf8d 100644
>>>> --- a/mm/vmalloc.c
>>>> +++ b/mm/vmalloc.c
>>>> @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>>
>>>>           /* High-order pages or fallback path if "bulk" fails. */
>>>>           while (nr_allocated < nr_pages) {
>>>> -               if (fatal_signal_pending(current))
>>>> +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>>>>                           break;
>>>
>>> why not !nofail ?
>>>
>>> This seems a correct fix, but it undermines the assumption made in
>>> commit dd544141b9eb
>>>    ("vmalloc: back off when the current task is OOM-killed")
>>>
>>> "
>>>       This may trigger some hidden problems, when caller does not handle
>>>       vmalloc failures, or when rollaback after failed vmalloc calls own
>>>       vmallocs inside.  However all of these scenarios are incorrect: vmalloc
>>>       does not guarantee successful allocation, it has never been called with
>>>       __GFP_NOFAIL and threfore either should not be used for any rollbacks or
>>>       should handle such errors correctly and not lead to critical failures.
>>> "
>>>
>>> If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
>>> reverting the fix intended to address the OOM-killer issue in commit
>>> dd544141b9eb.
>>> Should we indeed permit the NOFAIL flag for large kvmalloc allocations?
>>
>> Just from my perspective, I don't really care about kmalloc, vmalloc
>> or kvmalloc (__GFP_NOFAIL).  I even don't care if it returns three
>> order-0 pages or a high-order page.   I just would like to need a
>> virtual consecutive buffer (even it works slowly.) with __GFP_NOFAIL.
>>
>> Because in some cases, writing fallback code may be tough and hard to
>> test if such fallback path is correct since it only triggers in extreme
>> workloads, and even such buffers are just used in a very short lifetime.
>> Also see other FS discussion of __GFP_NOFAIL, e.g.
>> https://lore.kernel.org/all/ZcUQfzfQ9R8X0s47@tiehlicka/
>>
>> In the worst cases, it usually just needs < 5 order-0 pages (for many
>> cases it only needs one page), but with kmalloc it will trigger WARN
>> if it occurs to > order-1 allocation. as I mentioned before.
>>
>> With my limited understanding I don't see why it could any problem with
>> kvmalloc(__GFP_NOFAIL) since it has no difference of kmalloc(GFP_NOFAIL)
>> with order-0 allocation.
> 
> I completely understand that you're not concerned about the origin of
> the memory,
> such as whether it's organized by all zero-order pages. However, in the event
> that someone else allocates a large memory, like several megabytes with the
> NOFAIL flag, commit dd544141b9eb aims to halt the allocation before success
> if the process being allocated is targeted for termination of OOM-killer.
> 
> With the current patch, we miss the opportunity for early allocation
> termination.
> However, if the size of the kvmalloc() is small, as in your case, I
> believe it should
> be perfectly fine. but do we have any way to prevent large size allocation with
> NOFAIL?

I think large size order-0 virtual consecutive allocation prevention
should be done by the callers (e.g. EROFS) other than memory subsystem,
since it totally sounds like caller bugs.

It doesn't make sense for me to block getting a virtual consecutive buffer
which only needs more than two (e.g. three or five) order-0 pages in the
extreme cases.

Yes, I need to prevent very insane allocations but it's an on-disk hard
limitation of a filesystem itself.

Thanks,
Gao Xiang

> 
>>
>>
>> Thanks,
>> Gao XIang
> 
> Thanks
> Barry
Hailong Liu May 9, 2024, 3:22 a.m. UTC | #15
On Thu, 09. May 10:39, Gao Xiang wrote:
> Hi,
>
> On 2024/5/9 10:20, Barry Song wrote:
> > On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
> > >
> > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > >
> > > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > > includes support for __GFP_NOFAIL, but it presents a conflict with
> > > commit dd544141b9eb ("vmalloc: back off when the current task is
> > > OOM-killed"). A possible scenario is as belows:
> > >
> > > process-a
> > > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> > >      __vmalloc_node_range()
> > >          __vmalloc_area_node()
> > >              vm_area_alloc_pages()
> > >              --> oom-killer send SIGKILL to process-a
> > >              if (fatal_signal_pending(current)) break;
> > > --> return NULL;
> > >
> > > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > > if __GFP_NOFAIL set.
> > >
> > > Reported-by: Oven <liyangouwen1@oppo.com>
> > > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > > ---
> > >   mm/vmalloc.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 6641be0ca80b..2f359d08bf8d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> > >
> > >          /* High-order pages or fallback path if "bulk" fails. */
> > >          while (nr_allocated < nr_pages) {
> > > -               if (fatal_signal_pending(current))
> > > +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
> > >                          break;
> >
> > why not !nofail ?
> >
> > This seems a correct fix, but it undermines the assumption made in
> > commit dd544141b9eb
> >   ("vmalloc: back off when the current task is OOM-killed")
> >
> > "
> >      This may trigger some hidden problems, when caller does not handle
> >      vmalloc failures, or when rollaback after failed vmalloc calls own
> >      vmallocs inside.  However all of these scenarios are incorrect: vmalloc
> >      does not guarantee successful allocation, it has never been called with
> >      __GFP_NOFAIL and threfore either should not be used for any rollbacks or
> >      should handle such errors correctly and not lead to critical failures.
> > "
j> >
> > If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
> > reverting the fix intended to address the OOM-killer issue in commit
> > dd544141b9eb.
> > Should we indeed permit the NOFAIL flag for large kvmalloc allocations?
>
> Just from my perspective, I don't really care about kmalloc, vmalloc
> or kvmalloc (__GFP_NOFAIL).  I even don't care if it returns three
> order-0 pages or a high-order page.   I just would like to need a
> virtual consecutive buffer (even it works slowly.) with __GFP_NOFAIL.
>
> Because in some cases, writing fallback code may be tough and hard to
> test if such fallback path is correct since it only triggers in extreme
> workloads, and even such buffers are just used in a very short lifetime.
> Also see other FS discussion of __GFP_NOFAIL, e.g.
> https://lore.kernel.org/all/ZcUQfzfQ9R8X0s47@tiehlicka/
>
> In the worst cases, it usually just needs < 5 order-0 pages (for many
> cases it only needs one page), but with kmalloc it will trigger WARN
> if it occurs to > order-1 allocation. as I mentioned before.
>
> With my limited understanding I don't see why it could any problem with
> kvmalloc(__GFP_NOFAIL) since it has no difference of kmalloc(GFP_NOFAIL)
> with order-0 allocation.
>
I totally agree with you. so I fixed this in vmalloc or not in erofs. However,
there are still some extreme scenarios that cause vmalloc to return NULL,
such as vmap failure. Let's see what the mm experts think.

>
> Thanks,
> Gao XIang
>

--

Best Regards,
Hailong.
Hailong Liu May 9, 2024, 3:33 a.m. UTC | #16
On Thu, 09. May 14:20, Barry Song wrote:
> On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
> >
> > From: "Hailong.Liu" <hailong.liu@oppo.com>
> >
> > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > includes support for __GFP_NOFAIL, but it presents a conflict with
> > commit dd544141b9eb ("vmalloc: back off when the current task is
> > OOM-killed"). A possible scenario is as belows:
> >
> > process-a
> > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> >     __vmalloc_node_range()
> >         __vmalloc_area_node()
> >             vm_area_alloc_pages()
> >             --> oom-killer send SIGKILL to process-a
> >             if (fatal_signal_pending(current)) break;
> > --> return NULL;
> >
> > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > if __GFP_NOFAIL set.
> >
> > Reported-by: Oven <liyangouwen1@oppo.com>
> > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > ---
> >  mm/vmalloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 6641be0ca80b..2f359d08bf8d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >
> >         /* High-order pages or fallback path if "bulk" fails. */
> >         while (nr_allocated < nr_pages) {
> > -               if (fatal_signal_pending(current))
> > +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
> >                         break;
>
> why not !nofail ?

if order = 0, nofail would not be set true in bulk allocator. in such a case,
it is still possible to break early

>
> This seems a correct fix, but it undermines the assumption made in
> commit dd544141b9eb
>  ("vmalloc: back off when the current task is OOM-killed")
>
> "
>     This may trigger some hidden problems, when caller does not handle
>     vmalloc failures, or when rollaback after failed vmalloc calls own
>     vmallocs inside.  However all of these scenarios are incorrect: vmalloc
>     does not guarantee successful allocation, it has never been called with
>     __GFP_NOFAIL and threfore either should not be used for any rollbacks or
>     should handle such errors correctly and not lead to critical failures.
> "
>
> If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
> reverting the fix intended to address the OOM-killer issue in commit
> dd544141b9eb.
> Should we indeed permit the NOFAIL flag for large kvmalloc allocations?

IMO, if we encounter this issue, it should be fixed by the
caller, not here.
>

>
> Thanks
> Barry

--

Best Regards,
Hailong.
Barry Song May 9, 2024, 3:48 a.m. UTC | #17
On Thu, May 9, 2024 at 3:33 PM Hailong Liu <hailong.liu@oppo.com> wrote:
>
> On Thu, 09. May 14:20, Barry Song wrote:
> > On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
> > >
> > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > >
> > > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > > includes support for __GFP_NOFAIL, but it presents a conflict with
> > > commit dd544141b9eb ("vmalloc: back off when the current task is
> > > OOM-killed"). A possible scenario is as belows:
> > >
> > > process-a
> > > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> > >     __vmalloc_node_range()
> > >         __vmalloc_area_node()
> > >             vm_area_alloc_pages()
> > >             --> oom-killer send SIGKILL to process-a
> > >             if (fatal_signal_pending(current)) break;
> > > --> return NULL;
> > >
> > > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > > if __GFP_NOFAIL set.
> > >
> > > Reported-by: Oven <liyangouwen1@oppo.com>
> > > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > > ---
> > >  mm/vmalloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 6641be0ca80b..2f359d08bf8d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> > >
> > >         /* High-order pages or fallback path if "bulk" fails. */
> > >         while (nr_allocated < nr_pages) {
> > > -               if (fatal_signal_pending(current))
> > > +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
> > >                         break;
> >
> > why not !nofail ?
>
> if order = 0, nofail would not be set true in bulk allocator. in such a case,
> it is still possible to break early
>
> >
> > This seems a correct fix, but it undermines the assumption made in
> > commit dd544141b9eb
> >  ("vmalloc: back off when the current task is OOM-killed")
> >
> > "
> >     This may trigger some hidden problems, when caller does not handle
> >     vmalloc failures, or when rollaback after failed vmalloc calls own
> >     vmallocs inside.  However all of these scenarios are incorrect: vmalloc
> >     does not guarantee successful allocation, it has never been called with
> >     __GFP_NOFAIL and threfore either should not be used for any rollbacks or
> >     should handle such errors correctly and not lead to critical failures.
> > "
> >
> > If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
> > reverting the fix intended to address the OOM-killer issue in commit
> > dd544141b9eb.
> > Should we indeed permit the NOFAIL flag for large kvmalloc allocations?
>
> IMO, if we encounter this issue, it should be fixed by the
> caller, not here.

I agree. but could we WARN_ON a large kvmalloc(NOFAIL) allocation?

> >
>
> >
> > Thanks
> > Barry
>
> --
>
> Best Regards,
> Hailong.
Gao Xiang May 9, 2024, 4:19 a.m. UTC | #18
On 2024/5/9 11:48, Barry Song wrote:
> On Thu, May 9, 2024 at 3:33 PM Hailong Liu <hailong.liu@oppo.com> wrote:
>>
>> On Thu, 09. May 14:20, Barry Song wrote:
>>> On Thu, May 9, 2024 at 12:58 AM <hailong.liu@oppo.com> wrote:
>>>>
>>>> From: "Hailong.Liu" <hailong.liu@oppo.com>
>>>>
>>>> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
>>>> includes support for __GFP_NOFAIL, but it presents a conflict with
>>>> commit dd544141b9eb ("vmalloc: back off when the current task is
>>>> OOM-killed"). A possible scenario is as belows:
>>>>
>>>> process-a
>>>> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>>>>      __vmalloc_node_range()
>>>>          __vmalloc_area_node()
>>>>              vm_area_alloc_pages()
>>>>              --> oom-killer send SIGKILL to process-a
>>>>              if (fatal_signal_pending(current)) break;
>>>> --> return NULL;
>>>>
>>>> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
>>>> if __GFP_NOFAIL set.
>>>>
>>>> Reported-by: Oven <liyangouwen1@oppo.com>
>>>> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
>>>> ---
>>>>   mm/vmalloc.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>> index 6641be0ca80b..2f359d08bf8d 100644
>>>> --- a/mm/vmalloc.c
>>>> +++ b/mm/vmalloc.c
>>>> @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>>>>
>>>>          /* High-order pages or fallback path if "bulk" fails. */
>>>>          while (nr_allocated < nr_pages) {
>>>> -               if (fatal_signal_pending(current))
>>>> +               if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>>>>                          break;
>>>
>>> why not !nofail ?
>>
>> if order = 0, nofail would not be set true in bulk allocator. in such a case,
>> it is still possible to break early
>>
>>>
>>> This seems a correct fix, but it undermines the assumption made in
>>> commit dd544141b9eb
>>>   ("vmalloc: back off when the current task is OOM-killed")
>>>
>>> "
>>>      This may trigger some hidden problems, when caller does not handle
>>>      vmalloc failures, or when rollaback after failed vmalloc calls own
>>>      vmallocs inside.  However all of these scenarios are incorrect: vmalloc
>>>      does not guarantee successful allocation, it has never been called with
>>>      __GFP_NOFAIL and threfore either should not be used for any rollbacks or
>>>      should handle such errors correctly and not lead to critical failures.
>>> "
>>>
>>> If a significant kvmalloc operation is performed with the NOFAIL flag, it risks
>>> reverting the fix intended to address the OOM-killer issue in commit
>>> dd544141b9eb.
>>> Should we indeed permit the NOFAIL flag for large kvmalloc allocations?
>>
>> IMO, if we encounter this issue, it should be fixed by the
>> caller, not here.
> 
> I agree. but could we WARN_ON a large kvmalloc(NOFAIL) allocation?

For me, kvmalloc(~24k) sounds good to handle extreme cases (for extreme
compression), only specific users for file archiving (and fuzzers) are
expected to use this configuration.

Anyway, it's just my own ideas.

Thanks,
Gao Xiang

> 
>>>
>>
>>>
>>> Thanks
>>> Barry
>>
>> --
>>
>> Best Regards,
>> Hailong.
Christoph Hellwig May 9, 2024, 4:51 a.m. UTC | #19
On Thu, May 09, 2024 at 09:30:59AM +0800, Hailong Liu wrote:
> I’m not suggesting that erofs would cause a memleak. What I mean is
> that if kvmalloc is invoked with __GFP_NOFAIL, it must ensure a non-NULL
> return, even in scenarios where memory leaks caused by other processes
> result in the inability to allocate a page. In such a situation, it
> should result in “Kernel panic - not syncing: System is deadlocked
> on memory”.

Yes.  __GFP_NOFAIL is a contract that says never ever return NULL.
The callers will generally not handle a NULL return and blindly
dereference it, leading to all kinds of nasty security issues.

Note that deadlocking would be nice, but at least it is just a
denial of service and not a possible privilege escalation.
Christoph Hellwig May 9, 2024, 4:52 a.m. UTC | #20
On Thu, May 09, 2024 at 02:20:03PM +1200, Barry Song wrote:
> reverting the fix intended to address the OOM-killer issue in commit
> dd544141b9eb.
> Should we indeed permit the NOFAIL flag for large kvmalloc allocations?

What is large?  When you don't allow actually use cases people will
just reimplement it poorly.  E.g. we'd probably have to add back the
XFS kmem_ wrappers.
Barry Song May 9, 2024, 6:12 a.m. UTC | #21
On Thu, May 9, 2024 at 4:52 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, May 09, 2024 at 02:20:03PM +1200, Barry Song wrote:
> > reverting the fix intended to address the OOM-killer issue in commit
> > dd544141b9eb.
> > Should we indeed permit the NOFAIL flag for large kvmalloc allocations?
>
> What is large?  When you don't allow actually use cases people will
> just reimplement it poorly.  E.g. we'd probably have to add back the
> XFS kmem_ wrappers.

Xiang gave his number 24KiB in erofs, but probably we still have some
"naughty" users allocating much more than 24KiB with NOFAIL in other
subsystems. We should never return NULL for NOFAIL.  So, in any case,
we require Hailong's patch in some form.

However, commit dd544141b9eb ("vmalloc: back off when the current task is
OOM-killed") is also tackling an issue. If we can find a way to preserve its
benefits even in the NOFAIL scenario, it would be preferable, though it
seems improbable. Thus, I'm considering if we could at least include a
WARN_ON_ONCE((gfp & NOFAIL) && size > LARGE) to aid in debugging
potential issues we might encounter. Furthermore, compelling a large
allocation with NOFAIL also appears pointless, as it could lead to
unpredictable long latency.

But I don't know what the proper "LARGE" value is.

Thanks
Barry
Michal Hocko May 9, 2024, 7:48 a.m. UTC | #22
On Wed 08-05-24 20:58:08, hailong.liu@oppo.com wrote:
> From: "Hailong.Liu" <hailong.liu@oppo.com>
> 
> Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> includes support for __GFP_NOFAIL, but it presents a conflict with
> commit dd544141b9eb ("vmalloc: back off when the current task is
> OOM-killed"). A possible scenario is as belows:
> 
> process-a
> kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
>     __vmalloc_node_range()
> 	__vmalloc_area_node()
> 	    vm_area_alloc_pages()
>             --> oom-killer send SIGKILL to process-a
>             if (fatal_signal_pending(current)) break;
> --> return NULL;
> 
> to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> if __GFP_NOFAIL set.
> 
> Reported-by: Oven <liyangouwen1@oppo.com>
> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> ---
>  mm/vmalloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6641be0ca80b..2f359d08bf8d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> 
>  	/* High-order pages or fallback path if "bulk" fails. */
>  	while (nr_allocated < nr_pages) {
> -		if (fatal_signal_pending(current))
> +		if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))

Use nofail instead of gfp & __GFP_NOFAIL.

Other than that looks good to me. After that is fixed, please feel free
to add Acked-by: Michal Hocko <mhocko@suse.com>

I believe this should also have Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL")
Hailong Liu May 9, 2024, 8:06 a.m. UTC | #23
On Thu, 09. May 09:48, Michal Hocko wrote:
> On Wed 08-05-24 20:58:08, hailong.liu@oppo.com wrote:
> > From: "Hailong.Liu" <hailong.liu@oppo.com>
> >
> > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > includes support for __GFP_NOFAIL, but it presents a conflict with
> > commit dd544141b9eb ("vmalloc: back off when the current task is
> > OOM-killed"). A possible scenario is as belows:
> >
> > process-a
> > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> >     __vmalloc_node_range()
> > 	__vmalloc_area_node()
> > 	    vm_area_alloc_pages()
> >             --> oom-killer send SIGKILL to process-a
> >             if (fatal_signal_pending(current)) break;
> > --> return NULL;
> >
> > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > if __GFP_NOFAIL set.
> >
> > Reported-by: Oven <liyangouwen1@oppo.com>
> > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > ---
> >  mm/vmalloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 6641be0ca80b..2f359d08bf8d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >
> >  	/* High-order pages or fallback path if "bulk" fails. */
> >  	while (nr_allocated < nr_pages) {
> > -		if (fatal_signal_pending(current))
> > +		if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
>
> Use nofail instead of gfp & __GFP_NOFAIL.
>
> Other than that looks good to me. After that is fixed, please feel free
> to add Acked-by: Michal Hocko <mhocko@suse.com>
>
> I believe this should also have Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL")
> --
> Michal Hocko
> SUSE Labs

Thanks for the review and the Ack!

Add Fixes in V2 patch.

IIUC, nofail could not used for this case.

	/*
	 * For order-0 pages we make use of bulk allocator, if
	 * the page array is partly or not at all populated due
	 * to fails, fallback to a single page allocator that is
	 * more permissive.
	 */
	if (!order) {
		/* bulk allocator doesn't support nofail req. officially */
		xxx
-> nofail = false;
	} else if (gfp & __GFP_NOFAIL) {
		/*
		 * Higher order nofail allocations are really expensive and
		 * potentially dangerous (pre-mature OOM, disruptive reclaim
		 * and compaction etc.
		 */
		alloc_gfp &= ~__GFP_NOFAIL;
		nofail = true;
	}

	/* High-order pages or fallback path if "bulk" fails. */
	while (nr_allocated < nr_pages) {

-> nofail is false here if bulk allocator fails.
		if (fatal_signal_pending(current))
			break;

--

Best Regards,
Hailong.
Barry Song May 9, 2024, 8:32 a.m. UTC | #24
On Thu, May 9, 2024 at 8:21 PM Hailong Liu <hailong.liu@oppo.com> wrote:
>
> On Thu, 09. May 09:48, Michal Hocko wrote:
> > On Wed 08-05-24 20:58:08, hailong.liu@oppo.com wrote:
> > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > >
> > > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > > includes support for __GFP_NOFAIL, but it presents a conflict with
> > > commit dd544141b9eb ("vmalloc: back off when the current task is
> > > OOM-killed"). A possible scenario is as belows:
> > >
> > > process-a
> > > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> > >     __vmalloc_node_range()
> > >     __vmalloc_area_node()
> > >         vm_area_alloc_pages()
> > >             --> oom-killer send SIGKILL to process-a
> > >             if (fatal_signal_pending(current)) break;
> > > --> return NULL;
> > >
> > > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > > if __GFP_NOFAIL set.
> > >
> > > Reported-by: Oven <liyangouwen1@oppo.com>
> > > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > > ---
> > >  mm/vmalloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 6641be0ca80b..2f359d08bf8d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> > >
> > >     /* High-order pages or fallback path if "bulk" fails. */
> > >     while (nr_allocated < nr_pages) {
> > > -           if (fatal_signal_pending(current))
> > > +           if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
> >
> > Use nofail instead of gfp & __GFP_NOFAIL.
> >
> > Other than that looks good to me. After that is fixed, please feel free
> > to add Acked-by: Michal Hocko <mhocko@suse.com>
> >
> > I believe this should also have Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL")
> > --
> > Michal Hocko
> > SUSE Labs
>
> Thanks for the review and the Ack!
>
> Add Fixes in V2 patch.
>
> IIUC, nofail could not used for this case.
>
>         /*
>          * For order-0 pages we make use of bulk allocator, if
>          * the page array is partly or not at all populated due
>          * to fails, fallback to a single page allocator that is
>          * more permissive.
>          */
>         if (!order) {
>                 /* bulk allocator doesn't support nofail req. officially */
>                 xxx
> -> nofail = false;

isn't it another bug that needs a fix?

>         } else if (gfp & __GFP_NOFAIL) {
>                 /*
>                  * Higher order nofail allocations are really expensive and
>                  * potentially dangerous (pre-mature OOM, disruptive reclaim
>                  * and compaction etc.
>                  */
>                 alloc_gfp &= ~__GFP_NOFAIL;
>                 nofail = true;
>         }
>
>         /* High-order pages or fallback path if "bulk" fails. */
>         while (nr_allocated < nr_pages) {
>
> -> nofail is false here if bulk allocator fails.
>                 if (fatal_signal_pending(current))
>                         break;
>
> --
>
> Best Regards,
> Hailong.
Barry Song May 9, 2024, 8:57 a.m. UTC | #25
On Thu, May 9, 2024 at 8:32 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, May 9, 2024 at 8:21 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> >
> > On Thu, 09. May 09:48, Michal Hocko wrote:
> > > On Wed 08-05-24 20:58:08, hailong.liu@oppo.com wrote:
> > > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > > >
> > > > Commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
> > > > includes support for __GFP_NOFAIL, but it presents a conflict with
> > > > commit dd544141b9eb ("vmalloc: back off when the current task is
> > > > OOM-killed"). A possible scenario is as belows:
> > > >
> > > > process-a
> > > > kvcalloc(n, m, GFP_KERNEL | __GFP_NOFAIL)
> > > >     __vmalloc_node_range()
> > > >     __vmalloc_area_node()
> > > >         vm_area_alloc_pages()
> > > >             --> oom-killer send SIGKILL to process-a
> > > >             if (fatal_signal_pending(current)) break;
> > > > --> return NULL;
> > > >
> > > > to fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
> > > > if __GFP_NOFAIL set.
> > > >
> > > > Reported-by: Oven <liyangouwen1@oppo.com>
> > > > Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
> > > > ---
> > > >  mm/vmalloc.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 6641be0ca80b..2f359d08bf8d 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -3560,7 +3560,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> > > >
> > > >     /* High-order pages or fallback path if "bulk" fails. */
> > > >     while (nr_allocated < nr_pages) {
> > > > -           if (fatal_signal_pending(current))
> > > > +           if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
> > >
> > > Use nofail instead of gfp & __GFP_NOFAIL.
> > >
> > > Other than that looks good to me. After that is fixed, please feel free
> > > to add Acked-by: Michal Hocko <mhocko@suse.com>
> > >
> > > I believe this should also have Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL")
> > > --
> > > Michal Hocko
> > > SUSE Labs
> >
> > Thanks for the review and the Ack!
> >
> > Add Fixes in V2 patch.
> >
> > IIUC, nofail could not used for this case.
> >
> >         /*
> >          * For order-0 pages we make use of bulk allocator, if
> >          * the page array is partly or not at all populated due
> >          * to fails, fallback to a single page allocator that is
> >          * more permissive.
> >          */
> >         if (!order) {
> >                 /* bulk allocator doesn't support nofail req. officially */
> >                 xxx
> > -> nofail = false;
>
> isn't it another bug that needs a fix?

Upon further examination, it's not a bug, but we can still utilize 'nofail'.
The current code is very hard to read about gfp and "nofail" :-)

maybe:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6641be0ca80b..7c66fe16c2ad 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3498,7 +3498,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 {
        unsigned int nr_allocated = 0;
        gfp_t alloc_gfp = gfp;
-       bool nofail = false;
+       bool nofail = !!(gfp & __GFP_NOFAIL);
        struct page *page;
        int i;

@@ -3555,7 +3555,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
                 * and compaction etc.
                 */
                alloc_gfp &= ~__GFP_NOFAIL;
-               nofail = true;
        }

        /* High-order pages or fallback path if "bulk" fails. */

>
> >         } else if (gfp & __GFP_NOFAIL) {
> >                 /*
> >                  * Higher order nofail allocations are really expensive and
> >                  * potentially dangerous (pre-mature OOM, disruptive reclaim
> >                  * and compaction etc.
> >                  */
> >                 alloc_gfp &= ~__GFP_NOFAIL;
> >                 nofail = true;
> >         }
> >
> >         /* High-order pages or fallback path if "bulk" fails. */
> >         while (nr_allocated < nr_pages) {
> >
> > -> nofail is false here if bulk allocator fails.
> >                 if (fatal_signal_pending(current))
> >                         break;
> >
> > --
> >
> > Best Regards,
> > Hailong.
Hailong Liu May 9, 2024, 9:50 a.m. UTC | #26
On Thu, 09. May 20:57, Barry Song wrote:
>
> Upon further examination, it's not a bug, but we can still utilize 'nofail'.
> The current code is very hard to read about gfp and "nofail" :-)
>
> maybe:
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6641be0ca80b..7c66fe16c2ad 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3498,7 +3498,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  {
>         unsigned int nr_allocated = 0;
>         gfp_t alloc_gfp = gfp;
> -       bool nofail = false;
> +       bool nofail = !!(gfp & __GFP_NOFAIL);
>         struct page *page;
>         int i;
>
> @@ -3555,7 +3555,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>                  * and compaction etc.
>                  */
>                 alloc_gfp &= ~__GFP_NOFAIL;
> -               nofail = true;
>         }

Thanks for suggestion. I think that makes more clearly. Will
try it in next version.

--

Best Regards,
Hailong.
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6641be0ca80b..2f359d08bf8d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3560,7 +3560,7 @@  vm_area_alloc_pages(gfp_t gfp, int nid,

 	/* High-order pages or fallback path if "bulk" fails. */
 	while (nr_allocated < nr_pages) {
-		if (fatal_signal_pending(current))
+		if (!(gfp & __GFP_NOFAIL) && fatal_signal_pending(current))
 			break;

 		if (nid == NUMA_NO_NODE)