[RFC,0/4] static key support for error injection functions

Message ID	20240531-fault-injection-statickeys-v1-0-a513fd0a9614@suse.cz (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Vlastimil Babka <vbabka@suse.cz> Subject: [PATCH RFC 0/4] static key support for error injection functions Date: Fri, 31 May 2024 11:33:31 +0200 Message-Id: <20240531-fault-injection-statickeys-v1-0-a513fd0a9614@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit To: Akinobu Mita <akinobu.mita@gmail.com>, Christoph Lameter <cl@linux.com>, David Rientjes <rientjes@google.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>, Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>, "David S. Miller" <davem@davemloft.net>, Masami Hiramatsu <mhiramat@kernel.org>, Steven Rostedt <rostedt@goodmis.org>, Mark Rutland <mark.rutland@arm.com> Cc: Jiri Olsa <jolsa@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Vlastimil Babka <vbabka@suse.cz> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	static key support for error injection functions \| expand [RFC,0/4] static key support for error injection functions [RFC,1/4] fault-inject: add support for static keys around fault injection sites [RFC,2/4] error-injection: support static keys around injectable functions [RFC,3/4] mm, slab: add static key for should_failslab() [RFC,4/4] mm, page_alloc: add static key for should_fail_alloc_page()

Vlastimil Babka May 31, 2024, 9:33 a.m. UTC

Incomplete, help needed from ftrace/kprobe and bpf folks.

As previously mentioned by myself [1] and others [2] the functions
designed for error injection can bring visible overhead in fastpaths
such as slab or page allocation, because even if nothing hooks into them
at a given moment, they are noninline function calls regardless of
CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
always available for fault injection") and af3b854492f3
("mm/page_alloc.c: allow error injection").

Live patching their callsites has been also suggested in both [1] and
[2] threads, and this is an attempt to do that with static keys that
guard the call sites. When disabled, the error injection functions still
exist and are noinline, but are not being called. Any of the existing
mechanisms that can inject errors should make sure to enable the
respective static key. I have added that support to some of them but
need help with the others.

- the legacy fault injection, i.e. CONFIG_FAILSLAB and
  CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the
  address of the static key if it exists. The key will be activated if the
  fault injection probability becomes non-zero, and deactivated in the
  opposite transition. This also removes the overhead of the evaluation
  (on top of the noninline function call) when these mechanisms are
  configured in the kernel but unused at the moment.

- the generic error injection using kretprobes with
  override_function_with_return is handled in patch 2. The
  ALLOW_ERROR_INJECTION() annotation is extended so that static key
  address can be passed, and the framework controls it when error
  injection is enabled or disabled in debugfs for the function.

There are two more users I know of but am not familiar enough to fix up
myself. I hope people that are more familiar can help me here.

- ftrace seems to be using override_function_with_return from
  #define ftrace_override_function_with_return but I found no place
  where the latter is used. I assume it might be hidden behind more
  macro magic? But the point is if ftrace can be instructed to act like
  an error injection, it would also have to use some form of metadata
  (from patch 2 presumably?) to get to the static key and control it.

  If ftrace can only observe the function being called, maybe it
  wouldn't be wrong to just observe nothing if the static key isn't
  enabled because nobody is doing the fault injection?

- bpftrace, as can be seen from the example in commit 4f6923fbb352
  description. I suppose bpf is already aware what functions the
  currently loaded bpf programs hook into, so that it could look up the
  static key and control it. Maybe using again the metadata from patch 2,
  or extending its own, as I've noticed there's e.g. BTF_ID(func,
  should_failslab)

Now I realize maybe handling this at the k(ret)probe level would be
sufficient for all cases except the legacy fault injection from Patch 1?
Also wanted to note that by AFAIU by using the static_key_slow_dec/inc
API (as done in patches 1/2) should allow all mechanisms to coexist
naturally without fighting each other on the static key state, and also
handle the reference count for e.g. active probes or bpf programs if
there's no similar internal mechanism.

Patches 3 and 4 implement the static keys for the two mm fault injection
sites in slab and page allocators. For a quick demonstration I've run a
VM and the simple test from [1] that stresses the slab allocator and got
this time before the series:

real    0m8.349s
user    0m0.694s
sys     0m7.648s

with perf showing

   0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
   0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page                                                                                                                                                                                        ▒

And after the series

real    0m7.924s
user    0m0.727s
sys     0m7.191s

and the functions gone from perf report.

There might be other such fault injection callsites in hotpaths of other
subsystems but I didn't search for them at this point.

[1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccbd7f@suse.cz/
[2] https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Vlastimil Babka (4):
      fault-inject: add support for static keys around fault injection sites
      error-injection: support static keys around injectable functions
      mm, slab: add static key for should_failslab()
      mm, page_alloc: add static key for should_fail_alloc_page()

 include/asm-generic/error-injection.h | 13 ++++++++++-
 include/asm-generic/vmlinux.lds.h     |  2 +-
 include/linux/error-injection.h       |  9 +++++---
 include/linux/fault-inject.h          |  7 +++++-
 kernel/fail_function.c                | 22 +++++++++++++++---
 lib/error-inject.c                    |  6 ++++-
 lib/fault-inject.c                    | 43 ++++++++++++++++++++++++++++++++++-
 mm/fail_page_alloc.c                  |  3 ++-
 mm/failslab.c                         |  2 +-
 mm/internal.h                         |  2 ++
 mm/page_alloc.c                       | 11 ++++++---
 mm/slab.h                             |  3 +++
 mm/slub.c                             | 10 +++++---
 13 files changed, 114 insertions(+), 19 deletions(-)
---
base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
change-id: 20240530-fault-injection-statickeys-66b7222e91b7

Best regards,

Mark Rutland May 31, 2024, 3:31 p.m. UTC | #1

Hi,

On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote:
> Incomplete, help needed from ftrace/kprobe and bpf folks.

> - the generic error injection using kretprobes with
>   override_function_with_return is handled in patch 2. The
>   ALLOW_ERROR_INJECTION() annotation is extended so that static key
>   address can be passed, and the framework controls it when error
>   injection is enabled or disabled in debugfs for the function.
> 
> There are two more users I know of but am not familiar enough to fix up
> myself. I hope people that are more familiar can help me here.
> 
> - ftrace seems to be using override_function_with_return from
>   #define ftrace_override_function_with_return but I found no place
>   where the latter is used. I assume it might be hidden behind more
>   macro magic? But the point is if ftrace can be instructed to act like
>   an error injection, it would also have to use some form of metadata
>   (from patch 2 presumably?) to get to the static key and control it.

I don't think you've missed anything; nothing currently uses
ftrace_override_function_with_return(). I added that in commit:

  94d095ffa0e16bb7 ("ftrace: abstract DYNAMIC_FTRACE_WITH_ARGS accesses")

... so that it was possible to do anything that was possible with
FTRACE_WITH_REGS and/or kprobes, under the expectation that we might
want to move fault injection and BPF probes over to fprobes in future,
as ftrace/fprobes is generally faster than kprobes (e.g. for
architectures that can't do KPROBES_ON_FTRACE or OPTPROBES).

That's just the mechanism for the handler to use; I'd expect whatever
registered the handler to be responsible for flipping the static key,
and I don't think anything needs to change within ftrace itself.

>   If ftrace can only observe the function being called, maybe it
>   wouldn't be wrong to just observe nothing if the static key isn't
>   enabled because nobody is doing the fault injection?

Yep, that sounds right to me.

Mark.

Roman Gushchin May 31, 2024, 11:39 p.m. UTC | #2

On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote:
> Incomplete, help needed from ftrace/kprobe and bpf folks.
> 
> As previously mentioned by myself [1] and others [2] the functions
> designed for error injection can bring visible overhead in fastpaths
> such as slab or page allocation, because even if nothing hooks into them
> at a given moment, they are noninline function calls regardless of
> CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
> always available for fault injection") and af3b854492f3
> ("mm/page_alloc.c: allow error injection").
> 
> Live patching their callsites has been also suggested in both [1] and
> [2] threads, and this is an attempt to do that with static keys that
> guard the call sites. When disabled, the error injection functions still
> exist and are noinline, but are not being called. Any of the existing
> mechanisms that can inject errors should make sure to enable the
> respective static key. I have added that support to some of them but
> need help with the others.

I think it's a clever idea and makes total sense!

> 
> - the legacy fault injection, i.e. CONFIG_FAILSLAB and
>   CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the
>   address of the static key if it exists. The key will be activated if the
>   fault injection probability becomes non-zero, and deactivated in the
>   opposite transition. This also removes the overhead of the evaluation
>   (on top of the noninline function call) when these mechanisms are
>   configured in the kernel but unused at the moment.
> 
> - the generic error injection using kretprobes with
>   override_function_with_return is handled in patch 2. The
>   ALLOW_ERROR_INJECTION() annotation is extended so that static key
>   address can be passed, and the framework controls it when error
>   injection is enabled or disabled in debugfs for the function.
> 
> There are two more users I know of but am not familiar enough to fix up
> myself. I hope people that are more familiar can help me here.
> 
> - ftrace seems to be using override_function_with_return from
>   #define ftrace_override_function_with_return but I found no place
>   where the latter is used. I assume it might be hidden behind more
>   macro magic? But the point is if ftrace can be instructed to act like
>   an error injection, it would also have to use some form of metadata
>   (from patch 2 presumably?) to get to the static key and control it.
> 
>   If ftrace can only observe the function being called, maybe it
>   wouldn't be wrong to just observe nothing if the static key isn't
>   enabled because nobody is doing the fault injection?
> 
> - bpftrace, as can be seen from the example in commit 4f6923fbb352
>   description. I suppose bpf is already aware what functions the
>   currently loaded bpf programs hook into, so that it could look up the
>   static key and control it. Maybe using again the metadata from patch 2,
>   or extending its own, as I've noticed there's e.g. BTF_ID(func,
>   should_failslab)
> 
> Now I realize maybe handling this at the k(ret)probe level would be
> sufficient for all cases except the legacy fault injection from Patch 1?
> Also wanted to note that by AFAIU by using the static_key_slow_dec/inc
> API (as done in patches 1/2) should allow all mechanisms to coexist
> naturally without fighting each other on the static key state, and also
> handle the reference count for e.g. active probes or bpf programs if
> there's no similar internal mechanism.
> 
> Patches 3 and 4 implement the static keys for the two mm fault injection
> sites in slab and page allocators. For a quick demonstration I've run a
> VM and the simple test from [1] that stresses the slab allocator and got
> this time before the series:
> 
> real    0m8.349s
> user    0m0.694s
> sys     0m7.648s
> 
> with perf showing
> 
>    0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>    0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page                                                                                                                                                                                        ▒
> 
> And after the series
> 
> real    0m7.924s
> user    0m0.727s
> sys     0m7.191s

Is "user" increase a measurement error or it's real?

Otherwise, nice savings!

Vlastimil Babka June 1, 2024, 8:48 p.m. UTC | #3

On 5/31/24 5:31 PM, Mark Rutland wrote:
> Hi,
> 
> On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote:
>> Incomplete, help needed from ftrace/kprobe and bpf folks.
> 
>> - the generic error injection using kretprobes with
>>   override_function_with_return is handled in patch 2. The
>>   ALLOW_ERROR_INJECTION() annotation is extended so that static key
>>   address can be passed, and the framework controls it when error
>>   injection is enabled or disabled in debugfs for the function.
>> 
>> There are two more users I know of but am not familiar enough to fix up
>> myself. I hope people that are more familiar can help me here.
>> 
>> - ftrace seems to be using override_function_with_return from
>>   #define ftrace_override_function_with_return but I found no place
>>   where the latter is used. I assume it might be hidden behind more
>>   macro magic? But the point is if ftrace can be instructed to act like
>>   an error injection, it would also have to use some form of metadata
>>   (from patch 2 presumably?) to get to the static key and control it.
> 
> I don't think you've missed anything; nothing currently uses
> ftrace_override_function_with_return(). I added that in commit:

Ah, great, thanks for confirming that.

>   94d095ffa0e16bb7 ("ftrace: abstract DYNAMIC_FTRACE_WITH_ARGS accesses")
> 
> ... so that it was possible to do anything that was possible with
> FTRACE_WITH_REGS and/or kprobes, under the expectation that we might
> want to move fault injection and BPF probes over to fprobes in future,
> as ftrace/fprobes is generally faster than kprobes (e.g. for
> architectures that can't do KPROBES_ON_FTRACE or OPTPROBES).
> 
> That's just the mechanism for the handler to use; I'd expect whatever
> registered the handler to be responsible for flipping the static key,
> and I don't think anything needs to change within ftrace itself.
> 
>>   If ftrace can only observe the function being called, maybe it
>>   wouldn't be wrong to just observe nothing if the static key isn't
>>   enabled because nobody is doing the fault injection?
> 
> Yep, that sounds right to me.

Good.

> Mark.

Vlastimil Babka June 1, 2024, 8:53 p.m. UTC | #4

On 6/1/24 1:39 AM, Roman Gushchin wrote:
> On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote:
>> Incomplete, help needed from ftrace/kprobe and bpf folks.
>> 
>> As previously mentioned by myself [1] and others [2] the functions
>> designed for error injection can bring visible overhead in fastpaths
>> such as slab or page allocation, because even if nothing hooks into them
>> at a given moment, they are noninline function calls regardless of
>> CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
>> always available for fault injection") and af3b854492f3
>> ("mm/page_alloc.c: allow error injection").
>> 
>> Live patching their callsites has been also suggested in both [1] and
>> [2] threads, and this is an attempt to do that with static keys that
>> guard the call sites. When disabled, the error injection functions still
>> exist and are noinline, but are not being called. Any of the existing
>> mechanisms that can inject errors should make sure to enable the
>> respective static key. I have added that support to some of them but
>> need help with the others.
> 
> I think it's a clever idea and makes total sense!

Thanks!

>> 
>> Patches 3 and 4 implement the static keys for the two mm fault injection
>> sites in slab and page allocators. For a quick demonstration I've run a
>> VM and the simple test from [1] that stresses the slab allocator and got
>> this time before the series:
>> 
>> real    0m8.349s
>> user    0m0.694s
>> sys     0m7.648s
>> 
>> with perf showing
>> 
>>    0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>>    0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page                                                                                                                                                                                        ▒
>> 
>> And after the series
>> 
>> real    0m7.924s
>> user    0m0.727s
>> sys     0m7.191s
> 
> Is "user" increase a measurement error or it's real?

Hm interesting, I have actually did the measurement 3 times even though I
pasted just one, and it's consistent. But could be just artifact of where
things landed in the cache, and might change a bit with every kernel
build/boot. Will see. There's no reason why this should affect user time.

> Otherwise, nice savings!

Wei Yang June 2, 2024, 11:36 a.m. UTC | #5

On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote:
>Incomplete, help needed from ftrace/kprobe and bpf folks.
>
>As previously mentioned by myself [1] and others [2] the functions
>designed for error injection can bring visible overhead in fastpaths
>such as slab or page allocation, because even if nothing hooks into them
>at a given moment, they are noninline function calls regardless of
>CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
>always available for fault injection") and af3b854492f3
>("mm/page_alloc.c: allow error injection").
>
>Live patching their callsites has been also suggested in both [1] and
>[2] threads, and this is an attempt to do that with static keys that
>guard the call sites. When disabled, the error injection functions still
>exist and are noinline, but are not being called. Any of the existing
>mechanisms that can inject errors should make sure to enable the
>respective static key. I have added that support to some of them but
>need help with the others.
>
>- the legacy fault injection, i.e. CONFIG_FAILSLAB and
>  CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the
>  address of the static key if it exists. The key will be activated if the
>  fault injection probability becomes non-zero, and deactivated in the
>  opposite transition. This also removes the overhead of the evaluation
>  (on top of the noninline function call) when these mechanisms are
>  configured in the kernel but unused at the moment.
>
>- the generic error injection using kretprobes with
>  override_function_with_return is handled in patch 2. The
>  ALLOW_ERROR_INJECTION() annotation is extended so that static key
>  address can be passed, and the framework controls it when error
>  injection is enabled or disabled in debugfs for the function.
>
>There are two more users I know of but am not familiar enough to fix up
>myself. I hope people that are more familiar can help me here.
>
>- ftrace seems to be using override_function_with_return from
>  #define ftrace_override_function_with_return but I found no place
>  where the latter is used. I assume it might be hidden behind more
>  macro magic? But the point is if ftrace can be instructed to act like
>  an error injection, it would also have to use some form of metadata
>  (from patch 2 presumably?) to get to the static key and control it.
>
>  If ftrace can only observe the function being called, maybe it
>  wouldn't be wrong to just observe nothing if the static key isn't
>  enabled because nobody is doing the fault injection?
>
>- bpftrace, as can be seen from the example in commit 4f6923fbb352
>  description. I suppose bpf is already aware what functions the
>  currently loaded bpf programs hook into, so that it could look up the
>  static key and control it. Maybe using again the metadata from patch 2,
>  or extending its own, as I've noticed there's e.g. BTF_ID(func,
>  should_failslab)
>
>Now I realize maybe handling this at the k(ret)probe level would be
>sufficient for all cases except the legacy fault injection from Patch 1?
>Also wanted to note that by AFAIU by using the static_key_slow_dec/inc
>API (as done in patches 1/2) should allow all mechanisms to coexist
>naturally without fighting each other on the static key state, and also
>handle the reference count for e.g. active probes or bpf programs if
>there's no similar internal mechanism.
>
>Patches 3 and 4 implement the static keys for the two mm fault injection
>sites in slab and page allocators. For a quick demonstration I've run a
>VM and the simple test from [1] that stresses the slab allocator and got

I took a look into [1] and I see some data like "1.43% plus the overhead in
its caller", but not clearly find which test cases are.

Sorry for my unfamiliarity, would you mind giving more words on the cases?

>this time before the series:
>
>real    0m8.349s
>user    0m0.694s
>sys     0m7.648s
>
>with perf showing
>
>   0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>   0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page                                                                                                                                                                                        ▒
>
>And after the series
>
>real    0m7.924s
>user    0m0.727s
>sys     0m7.191s
>

Maybe add the percentage here would be more helpful.

>and the functions gone from perf report.
>
>There might be other such fault injection callsites in hotpaths of other
>subsystems but I didn't search for them at this point.
>
>[1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccbd7f@suse.cz/
>[2] https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/
>
>Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>---
>Vlastimil Babka (4):
>      fault-inject: add support for static keys around fault injection sites
>      error-injection: support static keys around injectable functions
>      mm, slab: add static key for should_failslab()
>      mm, page_alloc: add static key for should_fail_alloc_page()
>
> include/asm-generic/error-injection.h | 13 ++++++++++-
> include/asm-generic/vmlinux.lds.h     |  2 +-
> include/linux/error-injection.h       |  9 +++++---
> include/linux/fault-inject.h          |  7 +++++-
> kernel/fail_function.c                | 22 +++++++++++++++---
> lib/error-inject.c                    |  6 ++++-
> lib/fault-inject.c                    | 43 ++++++++++++++++++++++++++++++++++-
> mm/fail_page_alloc.c                  |  3 ++-
> mm/failslab.c                         |  2 +-
> mm/internal.h                         |  2 ++
> mm/page_alloc.c                       | 11 ++++++---
> mm/slab.h                             |  3 +++
> mm/slub.c                             | 10 +++++---
> 13 files changed, 114 insertions(+), 19 deletions(-)
>---
>base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
>change-id: 20240530-fault-injection-statickeys-66b7222e91b7
>
>Best regards,
>-- 
>Vlastimil Babka <vbabka@suse.cz>
>

David Rientjes June 2, 2024, 8:47 p.m. UTC | #6

On Fri, 31 May 2024, Vlastimil Babka wrote:

> Patches 3 and 4 implement the static keys for the two mm fault injection
> sites in slab and page allocators. For a quick demonstration I've run a
> VM and the simple test from [1] that stresses the slab allocator and got
> this time before the series:
> 
> real    0m8.349s
> user    0m0.694s
> sys     0m7.648s
> 
> with perf showing
> 
>    0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>    0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page                                                                                                                                                                                        ▒
> 
> And after the series
> 
> real    0m7.924s
> user    0m0.727s
> sys     0m7.191s
> 
> and the functions gone from perf report.
> 

Impressive results that will no doubt be a win for kernels that enable 
these options.

Both CONFIG_FAILSLAB and CONFIG_FAIL_PAGE_ALLOC go out of their way to 
have no overhead, both in performance and kernel text overhead, when the 
.config options are disabled.

Do we have any insight into the distros or users that enable either of 
these options and are expecting optimal performance?  I would have assumed 
that while CONFIG_FAULT_INJECTION may be enabled that any users who would 
care deeply about this would have disabled both of these debug options.

> There might be other such fault injection callsites in hotpaths of other
> subsystems but I didn't search for them at this point.
> 
> [1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccbd7f@suse.cz/
> [2] https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> Vlastimil Babka (4):
>       fault-inject: add support for static keys around fault injection sites
>       error-injection: support static keys around injectable functions
>       mm, slab: add static key for should_failslab()
>       mm, page_alloc: add static key for should_fail_alloc_page()
> 
>  include/asm-generic/error-injection.h | 13 ++++++++++-
>  include/asm-generic/vmlinux.lds.h     |  2 +-
>  include/linux/error-injection.h       |  9 +++++---
>  include/linux/fault-inject.h          |  7 +++++-
>  kernel/fail_function.c                | 22 +++++++++++++++---
>  lib/error-inject.c                    |  6 ++++-
>  lib/fault-inject.c                    | 43 ++++++++++++++++++++++++++++++++++-
>  mm/fail_page_alloc.c                  |  3 ++-
>  mm/failslab.c                         |  2 +-
>  mm/internal.h                         |  2 ++
>  mm/page_alloc.c                       | 11 ++++++---
>  mm/slab.h                             |  3 +++
>  mm/slub.c                             | 10 +++++---
>  13 files changed, 114 insertions(+), 19 deletions(-)
> ---
> base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
> change-id: 20240530-fault-injection-statickeys-66b7222e91b7
> 
> Best regards,
> -- 
> Vlastimil Babka <vbabka@suse.cz>
> 
>

Vlastimil Babka June 2, 2024, 9:08 p.m. UTC | #7

On 6/2/24 10:47 PM, David Rientjes wrote:
> On Fri, 31 May 2024, Vlastimil Babka wrote:
> 
>> Patches 3 and 4 implement the static keys for the two mm fault injection
>> sites in slab and page allocators. For a quick demonstration I've run a
>> VM and the simple test from [1] that stresses the slab allocator and got
>> this time before the series:
>> 
>> real    0m8.349s
>> user    0m0.694s
>> sys     0m7.648s
>> 
>> with perf showing
>> 
>>    0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>>    0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page                                                                                                                                                                                        ▒
>> 
>> And after the series
>> 
>> real    0m7.924s
>> user    0m0.727s
>> sys     0m7.191s
>> 
>> and the functions gone from perf report.
>> 
> 
> Impressive results that will no doubt be a win for kernels that enable 
> these options.

Oh, I should have been more clear about it. This was done with both of these
options disabled. It's measuring just removing the overhead of calling the
empty noninline function, otherwise the difference would be larger. CPU
mitigations (defaults) were enabled though, which affects the cost of
function calls (this was in KVM guest on Ryzen 2700).

> Both CONFIG_FAILSLAB and CONFIG_FAIL_PAGE_ALLOC go out of their way to 
> have no overhead, both in performance and kernel text overhead, when the 
> .config options are disabled.

Except the unavoidable function call overhead since commits 4f6923fbb352 and
af3b854492f3.

> Do we have any insight into the distros or users that enable either of 
> these options and are expecting optimal performance?  I would have assumed 
> that while CONFIG_FAULT_INJECTION may be enabled that any users who would 
> care deeply about this would have disabled both of these debug options.

Eliminating the empty function call overhead, which is currently not
possible to configure out in any way, was my primary goal. For our distro we
disable the options and they are enabled only in a debug kernel option. So
the additional benefit of the static key is we could enable them with no
cost, and have them available for when needed, without the need to change
kernel. This is great for debugging functionality in general
(debug_pagealloc, page_owner), maybe this would be less likely to be useful,
but one never knows.

>> There might be other such fault injection callsites in hotpaths of other
>> subsystems but I didn't search for them at this point.
>> 
>> [1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccbd7f@suse.cz/
>> [2] https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>> Vlastimil Babka (4):
>>       fault-inject: add support for static keys around fault injection sites
>>       error-injection: support static keys around injectable functions
>>       mm, slab: add static key for should_failslab()
>>       mm, page_alloc: add static key for should_fail_alloc_page()
>> 
>>  include/asm-generic/error-injection.h | 13 ++++++++++-
>>  include/asm-generic/vmlinux.lds.h     |  2 +-
>>  include/linux/error-injection.h       |  9 +++++---
>>  include/linux/fault-inject.h          |  7 +++++-
>>  kernel/fail_function.c                | 22 +++++++++++++++---
>>  lib/error-inject.c                    |  6 ++++-
>>  lib/fault-inject.c                    | 43 ++++++++++++++++++++++++++++++++++-
>>  mm/fail_page_alloc.c                  |  3 ++-
>>  mm/failslab.c                         |  2 +-
>>  mm/internal.h                         |  2 ++
>>  mm/page_alloc.c                       | 11 ++++++---
>>  mm/slab.h                             |  3 +++
>>  mm/slub.c                             | 10 +++++---
>>  13 files changed, 114 insertions(+), 19 deletions(-)
>> ---
>> base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
>> change-id: 20240530-fault-injection-statickeys-66b7222e91b7
>> 
>> Best regards,
>> -- 
>> Vlastimil Babka <vbabka@suse.cz>
>> 
>>

[RFC,0/4] static key support for error injection functions

Message

Comments