diff mbox series

[mm] vmalloc: back off when the current task is OOM-killed

Message ID d07a5540-3e07-44ba-1e59-067500f024d9@virtuozzo.com (mailing list archive)
State New
Headers show
Series [mm] vmalloc: back off when the current task is OOM-killed | expand

Commit Message

Vasily Averin Sept. 17, 2021, 8:06 a.m. UTC
Huge vmalloc allocation on heavy loaded node can lead to a global
memory shortage. A task called vmalloc can have the worst badness
and be chosen by OOM-killer, however received fatal signal and
oom victim mark does not interrupt allocation cycle. Vmalloc will
continue allocating pages over and over again, exacerbating the crisis
and consuming the memory freed up by another killed tasks.

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic.

Unfortunately it is not 100% safe. Previous attempt to break vmalloc
cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed"") due to some vmalloc callers did not handled
failures properly. Found issues was resolved, however, there may
be other similar places.

Such failures may be acceptable for emergencies, such as OOM. On the other
hand, we would like to detect them earlier. However they are quite rare,
and will be hidden by OOM messages, so I'm afraid they wikk have quite
small chance of being noticed and reported.

To improve the detection of such places this patch also interrupts the vmalloc
allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
config option to do not break unaware production kernels.

Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
affect other users of this subsystem.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/page_alloc.c | 5 +++++
 mm/vmalloc.c    | 6 ++++++
 2 files changed, 11 insertions(+)

Comments

Andrew Morton Sept. 19, 2021, 11:31 p.m. UTC | #1
On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:

> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. A task called vmalloc can have the worst badness
> and be chosen by OOM-killer, however received fatal signal and
> oom victim mark does not interrupt allocation cycle. Vmalloc will
> continue allocating pages over and over again, exacerbating the crisis
> and consuming the memory freed up by another killed tasks.
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic.
> 
> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed"") due to some vmalloc callers did not handled
> failures properly. Found issues was resolved, however, there may
> be other similar places.

Well that was lame of us.

I believe that at least one of the kernel testbots can utilize fault
injection.  If we were to wire up vmalloc (as we have done with slab
and pagealloc) then this will help to locate such buggy vmalloc callers.

> Such failures may be acceptable for emergencies, such as OOM. On the other
> hand, we would like to detect them earlier. However they are quite rare,
> and will be hidden by OOM messages, so I'm afraid they wikk have quite
> small chance of being noticed and reported.
> 
> To improve the detection of such places this patch also interrupts the vmalloc
> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
> config option to do not break unaware production kernels.

This sounds like a pretty sad half-measure?
Tetsuo Handa Sept. 20, 2021, 1:22 a.m. UTC | #2
On 2021/09/20 8:31, Andrew Morton wrote:
> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
> 
>> Huge vmalloc allocation on heavy loaded node can lead to a global
>> memory shortage. A task called vmalloc can have the worst badness
>> and be chosen by OOM-killer, however received fatal signal and
>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>> continue allocating pages over and over again, exacerbating the crisis
>> and consuming the memory freed up by another killed tasks.
>>
>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>> effective and avoid host panic.
>>
>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>> the current task is killed"") due to some vmalloc callers did not handled
>> failures properly. Found issues was resolved, however, there may
>> be other similar places.
> 
> Well that was lame of us.
> 
> I believe that at least one of the kernel testbots can utilize fault
> injection.  If we were to wire up vmalloc (as we have done with slab
> and pagealloc) then this will help to locate such buggy vmalloc callers.

__alloc_pages_bulk() has three callers.

  alloc_pages_bulk_list() => No in-tree users.

  alloc_pages_bulk_array() => Used by xfs_buf_alloc_pages(), __page_pool_alloc_pages_slow(), svc_alloc_arg().

    xfs_buf_alloc_pages() => Might retry forever until all pages are allocated (i.e. effectively __GFP_NOFAIL). This patch can cause infinite loop problem.

    __page_pool_alloc_pages_slow() => Will not retry if allocation failed. This patch might help.

    svc_alloc_arg() => Will not retry if signal pending. This patch might help only if allocating a lot of pages.

  alloc_pages_bulk_array_node() => Used by vm_area_alloc_pages().

vm_area_alloc_pages() => Used by __vmalloc_area_node() from __vmalloc_node_range() from vmalloc functions. Needs !__GFP_NOFAIL check?
Vasily Averin Sept. 20, 2021, 10:59 a.m. UTC | #3
On 9/20/21 4:22 AM, Tetsuo Handa wrote:
> On 2021/09/20 8:31, Andrew Morton wrote:
>> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>>> Huge vmalloc allocation on heavy loaded node can lead to a global
>>> memory shortage. A task called vmalloc can have the worst badness
>>> and be chosen by OOM-killer, however received fatal signal and
>>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>>> continue allocating pages over and over again, exacerbating the crisis
>>> and consuming the memory freed up by another killed tasks.
>>>
>>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>>> effective and avoid host panic.
>>>
>>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>>> the current task is killed"") due to some vmalloc callers did not handled
>>> failures properly. Found issues was resolved, however, there may
>>> be other similar places.
>>
>> Well that was lame of us.
>>
>> I believe that at least one of the kernel testbots can utilize fault
>> injection.  If we were to wire up vmalloc (as we have done with slab
>> and pagealloc) then this will help to locate such buggy vmalloc callers.

Andrew, could you please clarify how we can do it?
Do you mean we can use exsiting allocation fault injection infrastructure to trigger
such kind of issues? Unfortunately I found no ways to reach this goal.
It  allows to emulate single faults with small probability, however it is not enough,
we need to completely disable all vmalloc allocations. 
I've tried to extend fault injection infrastructure however found that it is not trivial.

That's why I've added direct fatal_signal_pending() check into my patch.
 
> __alloc_pages_bulk() has three callers.
> 
>   alloc_pages_bulk_list() => No in-tree users.
> 
>   alloc_pages_bulk_array() => Used by xfs_buf_alloc_pages(), __page_pool_alloc_pages_slow(), svc_alloc_arg().
> 
>     xfs_buf_alloc_pages() => Might retry forever until all pages are allocated (i.e. effectively __GFP_NOFAIL). This patch can cause infinite loop problem.

You are right, I've missed it.
However __alloc_pages_bulk() can return no new pages without my patch too:
- due to fault injection inside  prepare_alloc_pages()
- if __rmqueue_pcplist() returns NULL and if array already had some assigned pages,
- if both __rmqueue_pcplist() and following __alloc_pages(0) cannot get any page.
On the other hand I cannot say that it is 100% xfs-related issue, it looks strange
but they have some chance to get page after few attemps.

So I think I can change 'break' to 'goto failed_irq', call __alloc_pages(0) and
return 1 page. It seems is handled correctly in all callers too.

>     __page_pool_alloc_pages_slow() => Will not retry if allocation failed. This patch might help.
> 
>     svc_alloc_arg() => Will not retry if signal pending. This patch might help only if allocating a lot of pages.
> 
>   alloc_pages_bulk_array_node() => Used by vm_area_alloc_pages().
> 
> vm_area_alloc_pages() => Used by __vmalloc_area_node() from __vmalloc_node_range() from vmalloc functions. Needs !__GFP_NOFAIL check?

Comments in description of __vmalloc_node() and kvmalloc() claim that __GFP_NOFAIL is not supported,
I did not found any other callers used this flag.
Andrew Morton Sept. 21, 2021, 6:55 p.m. UTC | #4
On Mon, 20 Sep 2021 13:59:35 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:

> On 9/20/21 4:22 AM, Tetsuo Handa wrote:
> > On 2021/09/20 8:31, Andrew Morton wrote:
> >> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >>> Huge vmalloc allocation on heavy loaded node can lead to a global
> >>> memory shortage. A task called vmalloc can have the worst badness
> >>> and be chosen by OOM-killer, however received fatal signal and
> >>> oom victim mark does not interrupt allocation cycle. Vmalloc will
> >>> continue allocating pages over and over again, exacerbating the crisis
> >>> and consuming the memory freed up by another killed tasks.
> >>>
> >>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> >>> effective and avoid host panic.
> >>>
> >>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
> >>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> >>> the current task is killed"") due to some vmalloc callers did not handled
> >>> failures properly. Found issues was resolved, however, there may
> >>> be other similar places.
> >>
> >> Well that was lame of us.
> >>
> >> I believe that at least one of the kernel testbots can utilize fault
> >> injection.  If we were to wire up vmalloc (as we have done with slab
> >> and pagealloc) then this will help to locate such buggy vmalloc callers.
> 
> Andrew, could you please clarify how we can do it?
> Do you mean we can use exsiting allocation fault injection infrastructure to trigger
> such kind of issues? Unfortunately I found no ways to reach this goal.
> It  allows to emulate single faults with small probability, however it is not enough,
> we need to completely disable all vmalloc allocations. 

I don't see why there's a problem?  You're saying "there might still be
vmalloc() callers which don't correctly handle allocation failures",
yes?

I'm suggesting that we use fault injection to cause a small proportion
of vmalloc() calls to artificially fail, so such buggy callers will
eventually be found and fixed.  Why does such a scheme require that
*all* vmalloc() calls fail?
Vasily Averin Sept. 22, 2021, 6:18 a.m. UTC | #5
On 9/21/21 9:55 PM, Andrew Morton wrote:
> On Mon, 20 Sep 2021 13:59:35 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
> 
>> On 9/20/21 4:22 AM, Tetsuo Handa wrote:
>>> On 2021/09/20 8:31, Andrew Morton wrote:
>>>> On Fri, 17 Sep 2021 11:06:49 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>
>>>>> Huge vmalloc allocation on heavy loaded node can lead to a global
>>>>> memory shortage. A task called vmalloc can have the worst badness
>>>>> and be chosen by OOM-killer, however received fatal signal and
>>>>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>>>>> continue allocating pages over and over again, exacerbating the crisis
>>>>> and consuming the memory freed up by another killed tasks.
>>>>>
>>>>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>>>>> effective and avoid host panic.
>>>>>
>>>>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>>>>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>>>>> the current task is killed"") due to some vmalloc callers did not handled
>>>>> failures properly. Found issues was resolved, however, there may
>>>>> be other similar places.
>>>>
>>>> Well that was lame of us.
>>>>
>>>> I believe that at least one of the kernel testbots can utilize fault
>>>> injection.  If we were to wire up vmalloc (as we have done with slab
>>>> and pagealloc) then this will help to locate such buggy vmalloc callers.
>>
>> Andrew, could you please clarify how we can do it?
>> Do you mean we can use exsiting allocation fault injection infrastructure to trigger
>> such kind of issues? Unfortunately I found no ways to reach this goal.
>> It  allows to emulate single faults with small probability, however it is not enough,
>> we need to completely disable all vmalloc allocations. 
> 
> I don't see why there's a problem?  You're saying "there might still be
> vmalloc() callers which don't correctly handle allocation failures",
> yes?
> 
> I'm suggesting that we use fault injection to cause a small proportion
> of vmalloc() calls to artificially fail, so such buggy callers will
> eventually be found and fixed.  Why does such a scheme require that
> *all* vmalloc() calls fail?

Let me explain.
1) it is not trivial to use current allocation fault injection to cause
a small proportion of vmalloc() calls to artificially fail.

vmalloc
 __vmalloc_node
  __vmalloc_node_range
   __vmalloc_area_node
    vm_area_alloc_pages
 
vm_area_alloc_pages uses new __alloc_pages_bulk subsystem, requesting up to 100 pages in cycle.
__alloc_pages_bulk() can be interrupted by allocation fault injection, however in this case
vm_area_alloc_pages() will failback to old-style page allocation cycle.
In general case it successfully finishes allocation and vmalloc itself will not fail.

To fail vmalloc we need to fail both alloc_pages_bulk_array_node() and alloc_pages_node() together.

2) if we failed single vmalloc it is not enough.
I would remind, we want to emulate fatal signal reaction.
However I afraid dying task can execute a quite complex rollback procedure.
This rollback can call another vmalloc and last one will be failed
again on fatal_signal_pending check.

To emulate this behavior in fault injection we need to disable all following
vmalloc calls of our victim, pseudo-"dying" task.

I doubt both these goals can be reached by current allocation fault injection subsystem,
I do not understand how to configure it accordingly.

Thank you,
	Vasily Averin
Michal Hocko Sept. 22, 2021, 12:27 p.m. UTC | #6
On Fri 17-09-21 11:06:49, Vasily Averin wrote:
> Huge vmalloc allocation on heavy loaded node can lead to a global
> memory shortage. A task called vmalloc can have the worst badness
> and be chosen by OOM-killer, however received fatal signal and
> oom victim mark does not interrupt allocation cycle. Vmalloc will
> continue allocating pages over and over again, exacerbating the crisis
> and consuming the memory freed up by another killed tasks.
> 
> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
> effective and avoid host panic.
> 
> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed"") due to some vmalloc callers did not handled
> failures properly. Found issues was resolved, however, there may
> be other similar places.
> 
> Such failures may be acceptable for emergencies, such as OOM. On the other
> hand, we would like to detect them earlier. However they are quite rare,
> and will be hidden by OOM messages, so I'm afraid they wikk have quite
> small chance of being noticed and reported.
> 
> To improve the detection of such places this patch also interrupts the vmalloc
> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
> config option to do not break unaware production kernels.

I really dislike this. We shouldn't have a sementically different
behavior for a debugging kernel.

Is there any technical reason to not do fatal_signal_pending bailout
unconditionally? OOM victim based check will make it less likely and
therefore any potential bugs are just hidden more. So I think we should
really go with fatal_signal_pending check here.

> Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
> affect other users of this subsystem.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/page_alloc.c | 5 +++++
>  mm/vmalloc.c    | 6 ++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b37435c274cf..133d52e507ff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5288,6 +5288,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  			continue;
>  		}
>  
> +		if (tsk_is_oom_victim(current) ||
> +		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
> +		     fatal_signal_pending(current)))
> +			break;

This allocator interface is used in some real hot paths. It is also
meant to be fail fast interface (e.g. it only allocates from pcp
allocator) so it shouldn't bring any additional risk to memory depletion
under heavy memory pressure.

In other words I do not see any reason to bail out in this code path.
Vasily Averin Sept. 23, 2021, 6:49 a.m. UTC | #7
On 9/22/21 3:27 PM, Michal Hocko wrote:
> On Fri 17-09-21 11:06:49, Vasily Averin wrote:
>> Huge vmalloc allocation on heavy loaded node can lead to a global
>> memory shortage. A task called vmalloc can have the worst badness
>> and be chosen by OOM-killer, however received fatal signal and
>> oom victim mark does not interrupt allocation cycle. Vmalloc will
>> continue allocating pages over and over again, exacerbating the crisis
>> and consuming the memory freed up by another killed tasks.
>>
>> This patch allows OOM-killer to break vmalloc cycle, makes OOM more
>> effective and avoid host panic.
>>
>> Unfortunately it is not 100% safe. Previous attempt to break vmalloc
>> cycle was reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
>> the current task is killed"") due to some vmalloc callers did not handled
>> failures properly. Found issues was resolved, however, there may
>> be other similar places.
>>
>> Such failures may be acceptable for emergencies, such as OOM. On the other
>> hand, we would like to detect them earlier. However they are quite rare,
>> and will be hidden by OOM messages, so I'm afraid they wikk have quite
>> small chance of being noticed and reported.
>>
>> To improve the detection of such places this patch also interrupts the vmalloc
>> allocation cycle for all fatal signals. The checks are hidden under DEBUG_VM
>> config option to do not break unaware production kernels.
> 
> I really dislike this. We shouldn't have a sementically different
> behavior for a debugging kernel.

Yes, you're right, thank you.

> Is there any technical reason to not do fatal_signal_pending bailout
> unconditionally? OOM victim based check will make it less likely and
> therefore any potential bugs are just hidden more. So I think we should
> really go with fatal_signal_pending check here.

I'm agree, oom_victim == fatal_signal_pending.
I'm agree that vmalloc callers should expect and handle single vnalloc failures.
I think it is acceptable to enable fatal_signal_pending check to quickly
detect such kind of iussues.
However fatal_signal_pending check can cause serial vmalloc failures
and I doubt it is acceptable. 

Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
even properly handled such serial failures can cause troubles.

Hypothetically, cancelled vmalloc called inside some filesystem's transaction
forces its rollback, that in own turn it can call own vmalloc.
Any failures on this path can break the filesystem.
I doubt it is acceptable, especially for non-OOM fatal signals.
On the other hand I cannot say that it is a 100% bug.

Another scenario:
as you know failed vmalloc calls pr_warn. According message should be sent
to remote terminal or netconsole. I'm not sure about execution context,
however if this is done in task context it may call vmalloc either in terminal
or in network subsystems. Even handled, such failures are not fatal,
but this behaviour is at least unexpected.

Should we perhaps interrupt the first vmalloc only?

>> Vmalloc uses new alloc_pages_bulk subsystem, so newly added checks can
>> affect other users of this subsystem.
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/page_alloc.c | 5 +++++
>>  mm/vmalloc.c    | 6 ++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b37435c274cf..133d52e507ff 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5288,6 +5288,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>>  			continue;
>>  		}
>>  
>> +		if (tsk_is_oom_victim(current) ||
>> +		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
>> +		     fatal_signal_pending(current)))
>> +			break;
> 
> This allocator interface is used in some real hot paths. It is also
> meant to be fail fast interface (e.g. it only allocates from pcp
> allocator) so it shouldn't bring any additional risk to memory depletion
> under heavy memory pressure.
> 
> In other words I do not see any reason to bail out in this code path.

Thank you for the explanation, let's drop this check at all.

Thank you,
	Vasily Averin
Michal Hocko Sept. 24, 2021, 7:55 a.m. UTC | #8
On Thu 23-09-21 09:49:57, Vasily Averin wrote:
[...]
> I'm agree that vmalloc callers should expect and handle single vnalloc failures.
> I think it is acceptable to enable fatal_signal_pending check to quickly
> detect such kind of iussues.
> However fatal_signal_pending check can cause serial vmalloc failures
> and I doubt it is acceptable. 
> 
> Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
> even properly handled such serial failures can cause troubles.

Could you be more specific? Also how would this be any different from
similar failures for an oom victim? Except that the later is less likely
so (as already mentioend) any potential bugs would be just lurking there
for a longer time.

> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
> forces its rollback, that in own turn it can call own vmalloc.

Do you have any specific example?

> Any failures on this path can break the filesystem.
> I doubt it is acceptable, especially for non-OOM fatal signals.
> On the other hand I cannot say that it is a 100% bug.
> 
> Another scenario:
> as you know failed vmalloc calls pr_warn. According message should be sent
> to remote terminal or netconsole. I'm not sure about execution context,
> however if this is done in task context it may call vmalloc either in terminal
> or in network subsystems. Even handled, such failures are not fatal,
> but this behaviour is at least unexpected.

I do not think we want to shape the vmalloc bahavior based on
printk/console behavior.

> Should we perhaps interrupt the first vmalloc only?

This doesn't make much sense to me TBH. It doesn't address the very
problem you are describing in the changelog.
Vasily Averin Sept. 27, 2021, 9:36 a.m. UTC | #9
On 9/24/21 10:55 AM, Michal Hocko wrote:
> On Thu 23-09-21 09:49:57, Vasily Averin wrote:
> [...]
>> I'm agree that vmalloc callers should expect and handle single vnalloc failures.
>> I think it is acceptable to enable fatal_signal_pending check to quickly
>> detect such kind of iussues.
>> However fatal_signal_pending check can cause serial vmalloc failures
>> and I doubt it is acceptable. 
>>
>> Rollback after failed vmalloc can call new vmalloc calls that will be failed too, 
>> even properly handled such serial failures can cause troubles.
> 
> Could you be more specific? Also how would this be any different from
> similar failures for an oom victim? Except that the later is less likely
> so (as already mentioend) any potential bugs would be just lurking there
> for a longer time.
> 
>> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
>> forces its rollback, that in own turn it can call own vmalloc.
> 
> Do you have any specific example?

No, it was pure hypothetical assumption.
I was thinking about it over the weekend, and decided that:
a) such kind of issue (i.e. vmalloc call in rollback after failed vmalloc)
   is very unlikely
b) if it still exists -- it must have own failback with kmalloc(NOFAIL) 
   or just accept/ignore such failure and should not lead to critical failures.
   If this still happen -- ihis is a bug, we should detect and fix it ASAP.

>> Should we perhaps interrupt the first vmalloc only?
> 
> This doesn't make much sense to me TBH. It doesn't address the very
> problem you are describing in the changelog.

Last question:
how do you think, should we perhaps, instead, detect such vmallocs 
(called in rollback after failed vmalloc) and generate a warnings,
to prevent such kind of problems in future?

Thank you,
	Vasily Averin
Michal Hocko Sept. 27, 2021, 11:08 a.m. UTC | #10
On Mon 27-09-21 12:36:15, Vasily Averin wrote:
> On 9/24/21 10:55 AM, Michal Hocko wrote:
> > On Thu 23-09-21 09:49:57, Vasily Averin wrote:
[...]
> >> Hypothetically, cancelled vmalloc called inside some filesystem's transaction
> >> forces its rollback, that in own turn it can call own vmalloc.
> > 
> > Do you have any specific example?
> 
> No, it was pure hypothetical assumption.
> I was thinking about it over the weekend, and decided that:
> a) such kind of issue (i.e. vmalloc call in rollback after failed vmalloc)
>    is very unlikely
> b) if it still exists -- it must have own failback with kmalloc(NOFAIL) 
>    or just accept/ignore such failure and should not lead to critical failures.
>    If this still happen -- ihis is a bug, we should detect and fix it ASAP.

I would even argue that nobody should rely on vmalloc suceeding. The
purpose of the allocator is to allow larger allocations and we do not
guarantee anything even for small reqests.

> >> Should we perhaps interrupt the first vmalloc only?
> > 
> > This doesn't make much sense to me TBH. It doesn't address the very
> > problem you are describing in the changelog.
> 
> Last question:
> how do you think, should we perhaps, instead, detect such vmallocs 
> (called in rollback after failed vmalloc) and generate a warnings,
> to prevent such kind of problems in future?

We do provide an allocation failure splat unless the request is
explicitly __GFP_NOWARN IIRC.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..133d52e507ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5288,6 +5288,11 @@  unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			continue;
 		}
 
+		if (tsk_is_oom_victim(current) ||
+		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
+		     fatal_signal_pending(current)))
+			break;
+
 		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
 								pcp, pcp_list);
 		if (unlikely(!page)) {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c3b8e3e5cfc5..04b291076726 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@ 
 #include <linux/pgtable.h>
 #include <linux/uaccess.h>
 #include <linux/hugetlb.h>
+#include <linux/oom.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
@@ -2860,6 +2861,11 @@  vm_area_alloc_pages(gfp_t gfp, int nid,
 		struct page *page;
 		int i;
 
+		if (tsk_is_oom_victim(current) ||
+		    (IS_ENABLED(CONFIG_DEBUG_VM) &&
+		     fatal_signal_pending(current)))
+			break;
+
 		page = alloc_pages_node(nid, gfp, order);
 		if (unlikely(!page))
 			break;