diff mbox series

[2/2] moderr: add module error injection tool

Message ID 20250122-modules-error-injection-v1-2-910590a04fd5@samsung.com (mailing list archive)
State New
Headers show
Series [1/2] module: allow for module error injection | expand

Checks

Context Check Description
mcgrof/vmtest-modules-next-VM_Test-0 success Logs for Run CI tests
mcgrof/vmtest-modules-next-VM_Test-1 success Logs for Run CI tests
mcgrof/vmtest-modules-next-VM_Test-5 fail Logs for setup / Setup kdevops environment
mcgrof/vmtest-modules-next-VM_Test-4 fail Logs for setup / Setup kdevops environment
mcgrof/vmtest-modules-next-VM_Test-2 fail Logs for cleanup / Archive results and cleanup
mcgrof/vmtest-modules-next-VM_Test-3 fail Logs for cleanup / Archive results and cleanup
mcgrof/vmtest-modules-next-PR fail PR summary

Commit Message

Daniel Gomez Jan. 22, 2025, 1:11 p.m. UTC
Add support for a module error injection tool. The tool
can inject errors in the annotated module kernel functions
such as complete_formation(), do_init_module() and
module_enable_rodata_after_init(). Module name and module function are
required parameters to have control over the error injection.

Example: Inject error -22 to module_enable_rodata_ro_after_init for
brd module:

sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
--error=-22 --trace
Monitoring module error injection... Hit Ctrl-C to end.
MODULE     ERROR FUNCTION
brd        -22   module_enable_rodata_after_init()

Kernel messages:
[   89.463690] brd: module loaded
[   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
ro_after_init data might still be writable

Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
---
 tools/bpf/Makefile            |  13 ++-
 tools/bpf/moderr/.gitignore   |   2 +
 tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
 tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
 tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
 tools/bpf/moderr/moderr.h     |  40 +++++++
 6 files changed, 510 insertions(+), 3 deletions(-)

Comments

Alexei Starovoitov Jan. 22, 2025, 5:02 p.m. UTC | #1
On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
>
> Add support for a module error injection tool. The tool
> can inject errors in the annotated module kernel functions
> such as complete_formation(), do_init_module() and
> module_enable_rodata_after_init(). Module name and module function are
> required parameters to have control over the error injection.
>
> Example: Inject error -22 to module_enable_rodata_ro_after_init for
> brd module:
>
> sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> --error=-22 --trace
> Monitoring module error injection... Hit Ctrl-C to end.
> MODULE     ERROR FUNCTION
> brd        -22   module_enable_rodata_after_init()
>
> Kernel messages:
> [   89.463690] brd: module loaded
> [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> ro_after_init data might still be writable
>
> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> ---
>  tools/bpf/Makefile            |  13 ++-
>  tools/bpf/moderr/.gitignore   |   2 +
>  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
>  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
>  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
>  tools/bpf/moderr/moderr.h     |  40 +++++++
>  6 files changed, 510 insertions(+), 3 deletions(-)

The tool looks useful, but we don't add tools to the kernel repo.
It has to stay out of tree.

The value of error injection is not clear to me.
Other places in the kernel use it to test paths in the kernel
that are difficult to do otherwise.
These 3 functions don't seem to be in this category.
Luis Chamberlain Jan. 28, 2025, 8:57 p.m. UTC | #2
On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
> On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> >
> > Add support for a module error injection tool. The tool
> > can inject errors in the annotated module kernel functions
> > such as complete_formation(), do_init_module() and
> > module_enable_rodata_after_init(). Module name and module function are
> > required parameters to have control over the error injection.
> >
> > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > brd module:
> >
> > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > --error=-22 --trace
> > Monitoring module error injection... Hit Ctrl-C to end.
> > MODULE     ERROR FUNCTION
> > brd        -22   module_enable_rodata_after_init()
> >
> > Kernel messages:
> > [   89.463690] brd: module loaded
> > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > ro_after_init data might still be writable
> >
> > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > ---
> >  tools/bpf/Makefile            |  13 ++-
> >  tools/bpf/moderr/.gitignore   |   2 +
> >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> >  tools/bpf/moderr/moderr.h     |  40 +++++++
> >  6 files changed, 510 insertions(+), 3 deletions(-)
> 
> The tool looks useful, but we don't add tools to the kernel repo.
> It has to stay out of tree.

For selftests we do add random tools.

> The value of error injection is not clear to me.

It is of great value, since it deals with corner cases which are
otherwise hard to reproduce in places which a real error can be
catostrophic.

> Other places in the kernel use it to test paths in the kernel
> that are difficult to do otherwise.

Right.

> These 3 functions don't seem to be in this category.

That's the key here we should focus on. The problem is when a maintainer
*does* agree that adding an error injection entry is useful for testing,
and we have a developer willing to do the work to help test / validate
it. In this case, this error case is rare but we do want to strive to
test this as we ramp up and extend our modules selftests.

Then there is the aspect of how to mitigate how instrusive code changes
to allow error injection are. In 2021 we evaluated the prospect of error
injection in-kernel long ago for other areas like the block layer for
add_disk() failures [0] but the minimal interface to enable this from
userspace with debugfs was considered just too intrusive.

This effort tried to evaluate what this could look like with eBPF to
mitigate the required in-kernel code, and I believe the light weight
nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
suffices to my taste.

So, perhaps the tools aspect can just go in:

tools/testing/selftests/module/

[0] https://www.spinics.net/lists/linux-block/msg68159.html

  Luis
Daniel Gomez Feb. 4, 2025, 1:30 p.m. UTC | #3
On Wed, Jan 22, 2025 at 09:02:19AM +0100, Alexei Starovoitov wrote:
> On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> >
> > Add support for a module error injection tool. The tool
> > can inject errors in the annotated module kernel functions
> > such as complete_formation(), do_init_module() and
> > module_enable_rodata_after_init(). Module name and module function are
> > required parameters to have control over the error injection.
> >
> > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > brd module:
> >
> > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > --error=-22 --trace
> > Monitoring module error injection... Hit Ctrl-C to end.
> > MODULE     ERROR FUNCTION
> > brd        -22   module_enable_rodata_after_init()
> >
> > Kernel messages:
> > [   89.463690] brd: module loaded
> > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > ro_after_init data might still be writable
> >
> > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > ---
> >  tools/bpf/Makefile            |  13 ++-
> >  tools/bpf/moderr/.gitignore   |   2 +
> >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> >  tools/bpf/moderr/moderr.h     |  40 +++++++
> >  6 files changed, 510 insertions(+), 3 deletions(-)
> 
> The tool looks useful, but we don't add tools to the kernel repo.
> It has to stay out of tree.

Can you clarify what do you mean? There are other tools under tools/ and tools/
bpf [1].

[1] https://lore.kernel.org/bpf/20200114184230.GA204154@krava/

I will anyway move the tool to the suggested location in the other thread.

> 
> The value of error injection is not clear to me.
> Other places in the kernel use it to test paths in the kernel
> that are difficult to do otherwise.

By any chance do you know which tool are they using?

> These 3 functions don't seem to be in this category.

Luis already answered this.
Alexei Starovoitov Feb. 4, 2025, 3:28 p.m. UTC | #4
On Tue, Feb 4, 2025 at 1:30 PM Daniel Gomez <da.gomez@kernel.org> wrote:
>
> On Wed, Jan 22, 2025 at 09:02:19AM +0100, Alexei Starovoitov wrote:
> > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> > >
> > > Add support for a module error injection tool. The tool
> > > can inject errors in the annotated module kernel functions
> > > such as complete_formation(), do_init_module() and
> > > module_enable_rodata_after_init(). Module name and module function are
> > > required parameters to have control over the error injection.
> > >
> > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > > brd module:
> > >
> > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > > --error=-22 --trace
> > > Monitoring module error injection... Hit Ctrl-C to end.
> > > MODULE     ERROR FUNCTION
> > > brd        -22   module_enable_rodata_after_init()
> > >
> > > Kernel messages:
> > > [   89.463690] brd: module loaded
> > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > > ro_after_init data might still be writable
> > >
> > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > ---
> > >  tools/bpf/Makefile            |  13 ++-
> > >  tools/bpf/moderr/.gitignore   |   2 +
> > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> > >  tools/bpf/moderr/moderr.h     |  40 +++++++
> > >  6 files changed, 510 insertions(+), 3 deletions(-)
> >
> > The tool looks useful, but we don't add tools to the kernel repo.
> > It has to stay out of tree.
>
> Can you clarify what do you mean? There are other tools under tools/ and tools/
> bpf [1].
>
> [1] https://lore.kernel.org/bpf/20200114184230.GA204154@krava/

As you noticed we added only one tool out of many and recently
discussed removing it, since the value of keeping it in the tree
is minimal.

> I will anyway move the tool to the suggested location in the other thread.

I don't think it's a good idea.
Keep it outside. There is no reason for it to live in the tree.
Lucas De Marchi Feb. 19, 2025, 8:17 p.m. UTC | #5
On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
>On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
>> On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
>> >
>> > Add support for a module error injection tool. The tool
>> > can inject errors in the annotated module kernel functions
>> > such as complete_formation(), do_init_module() and
>> > module_enable_rodata_after_init(). Module name and module function are
>> > required parameters to have control over the error injection.
>> >
>> > Example: Inject error -22 to module_enable_rodata_ro_after_init for
>> > brd module:
>> >
>> > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
>> > --error=-22 --trace
>> > Monitoring module error injection... Hit Ctrl-C to end.
>> > MODULE     ERROR FUNCTION
>> > brd        -22   module_enable_rodata_after_init()
>> >
>> > Kernel messages:
>> > [   89.463690] brd: module loaded
>> > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
>> > ro_after_init data might still be writable
>> >
>> > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>> > ---
>> >  tools/bpf/Makefile            |  13 ++-
>> >  tools/bpf/moderr/.gitignore   |   2 +
>> >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
>> >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
>> >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
>> >  tools/bpf/moderr/moderr.h     |  40 +++++++
>> >  6 files changed, 510 insertions(+), 3 deletions(-)
>>
>> The tool looks useful, but we don't add tools to the kernel repo.
>> It has to stay out of tree.
>
>For selftests we do add random tools.
>
>> The value of error injection is not clear to me.
>
>It is of great value, since it deals with corner cases which are
>otherwise hard to reproduce in places which a real error can be
>catostrophic.
>
>> Other places in the kernel use it to test paths in the kernel
>> that are difficult to do otherwise.
>
>Right.
>
>> These 3 functions don't seem to be in this category.
>
>That's the key here we should focus on. The problem is when a maintainer
>*does* agree that adding an error injection entry is useful for testing,
>and we have a developer willing to do the work to help test / validate
>it. In this case, this error case is rare but we do want to strive to
>test this as we ramp up and extend our modules selftests.
>
>Then there is the aspect of how to mitigate how instrusive code changes
>to allow error injection are. In 2021 we evaluated the prospect of error
>injection in-kernel long ago for other areas like the block layer for
>add_disk() failures [0] but the minimal interface to enable this from
>userspace with debugfs was considered just too intrusive.
>
>This effort tried to evaluate what this could look like with eBPF to
>mitigate the required in-kernel code, and I believe the light weight
>nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
>suffices to my taste.
>
>So, perhaps the tools aspect can just go in:
>
>tools/testing/selftests/module/

but why would it be module-specific? Based on its current implementation
and discussion about inject.py it seems to be generic enough to be
useful to test any function annotated with ALLOW_ERROR_INJECTION().

As xe driver maintainer, it may be interesting to use such a tool:

	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  
	23

How does this approach compare to writing the function name on debugfs
(the current approach in xe's testsuite)?

	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108

If you decide to have the tool to live somewhere else, then kmod repo
could be a candidate. Although I think having it in kernel tree is
simpler maintenance-wise.

Lucas De Marchi

>
>[0] https://www.spinics.net/lists/linux-block/msg68159.html
>
>  Luis
Luis Chamberlain Feb. 21, 2025, 8:15 p.m. UTC | #6
On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
> On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
> > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
> > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> > > >
> > > > Add support for a module error injection tool. The tool
> > > > can inject errors in the annotated module kernel functions
> > > > such as complete_formation(), do_init_module() and
> > > > module_enable_rodata_after_init(). Module name and module function are
> > > > required parameters to have control over the error injection.
> > > >
> > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > > > brd module:
> > > >
> > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > > > --error=-22 --trace
> > > > Monitoring module error injection... Hit Ctrl-C to end.
> > > > MODULE     ERROR FUNCTION
> > > > brd        -22   module_enable_rodata_after_init()
> > > >
> > > > Kernel messages:
> > > > [   89.463690] brd: module loaded
> > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > > > ro_after_init data might still be writable
> > > >
> > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > > ---
> > > >  tools/bpf/Makefile            |  13 ++-
> > > >  tools/bpf/moderr/.gitignore   |   2 +
> > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
> > > >  6 files changed, 510 insertions(+), 3 deletions(-)
> > > 
> > > The tool looks useful, but we don't add tools to the kernel repo.
> > > It has to stay out of tree.
> > 
> > For selftests we do add random tools.
> > 
> > > The value of error injection is not clear to me.
> > 
> > It is of great value, since it deals with corner cases which are
> > otherwise hard to reproduce in places which a real error can be
> > catostrophic.
> > 
> > > Other places in the kernel use it to test paths in the kernel
> > > that are difficult to do otherwise.
> > 
> > Right.
> > 
> > > These 3 functions don't seem to be in this category.
> > 
> > That's the key here we should focus on. The problem is when a maintainer
> > *does* agree that adding an error injection entry is useful for testing,
> > and we have a developer willing to do the work to help test / validate
> > it. In this case, this error case is rare but we do want to strive to
> > test this as we ramp up and extend our modules selftests.
> > 
> > Then there is the aspect of how to mitigate how instrusive code changes
> > to allow error injection are. In 2021 we evaluated the prospect of error
> > injection in-kernel long ago for other areas like the block layer for
> > add_disk() failures [0] but the minimal interface to enable this from
> > userspace with debugfs was considered just too intrusive.
> > 
> > This effort tried to evaluate what this could look like with eBPF to
> > mitigate the required in-kernel code, and I believe the light weight
> > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
> > suffices to my taste.
> > 
> > So, perhaps the tools aspect can just go in:
> > 
> > tools/testing/selftests/module/
> 
> but why would it be module-specific?

Gotta start somewhere.

> Based on its current implementation
> and discussion about inject.py it seems to be generic enough to be
> useful to test any function annotated with ALLOW_ERROR_INJECTION().
> 
> As xe driver maintainer, it may be interesting to use such a tool:
> 
> 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
> 
> How does this approach compare to writing the function name on debugfs
> (the current approach in xe's testsuite)?
> 
> 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
> 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
> 
> If you decide to have the tool to live somewhere else, then kmod repo
> could be a candidate.

Would we install this upon install target?

Danny can decide on this :)

> Although I think having it in kernel tree is
> simpler maintenance-wise.

I think we have at least two users upstream who can make use of it. If
we end up going through tools/testing/selftests/module/ first, can't
you make use of it later?

  Luis
Daniel Gomez Feb. 22, 2025, 9:29 p.m. UTC | #7
On Wed, Feb 19, 2025 at 02:17:48PM +0100, Lucas De Marchi wrote:
> On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
> > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
> > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> > > >
> > > > Add support for a module error injection tool. The tool
> > > > can inject errors in the annotated module kernel functions
> > > > such as complete_formation(), do_init_module() and
> > > > module_enable_rodata_after_init(). Module name and module function are
> > > > required parameters to have control over the error injection.
> > > >
> > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > > > brd module:
> > > >
> > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > > > --error=-22 --trace
> > > > Monitoring module error injection... Hit Ctrl-C to end.
> > > > MODULE     ERROR FUNCTION
> > > > brd        -22   module_enable_rodata_after_init()
> > > >
> > > > Kernel messages:
> > > > [   89.463690] brd: module loaded
> > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > > > ro_after_init data might still be writable
> > > >
> > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > > ---
> > > >  tools/bpf/Makefile            |  13 ++-
> > > >  tools/bpf/moderr/.gitignore   |   2 +
> > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
> > > >  6 files changed, 510 insertions(+), 3 deletions(-)
> > > 
> > > The tool looks useful, but we don't add tools to the kernel repo.
> > > It has to stay out of tree.
> > 
> > For selftests we do add random tools.
> > 
> > > The value of error injection is not clear to me.
> > 
> > It is of great value, since it deals with corner cases which are
> > otherwise hard to reproduce in places which a real error can be
> > catostrophic.
> > 
> > > Other places in the kernel use it to test paths in the kernel
> > > that are difficult to do otherwise.
> > 
> > Right.
> > 
> > > These 3 functions don't seem to be in this category.
> > 
> > That's the key here we should focus on. The problem is when a maintainer
> > *does* agree that adding an error injection entry is useful for testing,
> > and we have a developer willing to do the work to help test / validate
> > it. In this case, this error case is rare but we do want to strive to
> > test this as we ramp up and extend our modules selftests.
> > 
> > Then there is the aspect of how to mitigate how instrusive code changes
> > to allow error injection are. In 2021 we evaluated the prospect of error
> > injection in-kernel long ago for other areas like the block layer for
> > add_disk() failures [0] but the minimal interface to enable this from
> > userspace with debugfs was considered just too intrusive.
> > 
> > This effort tried to evaluate what this could look like with eBPF to
> > mitigate the required in-kernel code, and I believe the light weight
> > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
> > suffices to my taste.
> > 
> > So, perhaps the tools aspect can just go in:
> > 
> > tools/testing/selftests/module/
> 
> but why would it be module-specific? Based on its current implementation
> and discussion about inject.py it seems to be generic enough to be
> useful to test any function annotated with ALLOW_ERROR_INJECTION().

Right, but inject.py is based on the old/deprecated Python eBPF/BCC
infrastructure (although I think it's still a working and maintained tool based
on commit history). However, as I noted in the cover letter, I think it would be
useful to port it to use libbpf instead.

> 
> As xe driver maintainer, it may be interesting to use such a tool:
> 
> 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23

I was wondering if users of ALLOW_ERROR_INJECTION() still depend on inject.py
tool, or other tools were used. Or perhaps all are using debugfs?

> 
> How does this approach compare to writing the function name on debugfs
> (the current approach in xe's testsuite)?
> 
> 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
> 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108

IMHO and looking at the references above, looks "simpler" to handle these cases
in eBPF. Specially because the error injection logic does not have to live along
with the code but in a separate tool. Link from Luis [0] (also [1]) is a good
example of when debugfs may be seen a bit too intrusive and how eBPF may resolve
the problems maintainers have.

[1] https://lore.kernel.org/all/20210512064629.13899-1-mcgrof@kernel.org/

In summary, a function annotated with the error injection tag can modify
its return value in eBPF code by using the bpf_override_return() helper. The
logic for deciding when to inject the error is likely similar to the current
implementation in debugfs. This can be seen in patch 2, file tools/bpf/
moderr/moderr.bpf.c where error is injected based on module name and target
function.

> 
> If you decide to have the tool to live somewhere else, then kmod repo
> could be a candidate. Although I think having it in kernel tree is

Thanks Lucas.

> simpler maintenance-wise.

I agree with you that having the tool (or maybe tools if we decide to split) in
tree it may be easier and may be a way to showcase its usage more effectively.

> 
> Lucas De Marchi
> 
> > 
> > [0] https://www.spinics.net/lists/linux-block/msg68159.html
> > 
> >  Luis
Daniel Gomez Feb. 22, 2025, 9:35 p.m. UTC | #8
On Fri, Feb 21, 2025 at 12:15:40PM +0100, Luis Chamberlain wrote:
> On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
> > On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
> > > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
> > > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> > > > >
> > > > > Add support for a module error injection tool. The tool
> > > > > can inject errors in the annotated module kernel functions
> > > > > such as complete_formation(), do_init_module() and
> > > > > module_enable_rodata_after_init(). Module name and module function are
> > > > > required parameters to have control over the error injection.
> > > > >
> > > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > > > > brd module:
> > > > >
> > > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > > > > --error=-22 --trace
> > > > > Monitoring module error injection... Hit Ctrl-C to end.
> > > > > MODULE     ERROR FUNCTION
> > > > > brd        -22   module_enable_rodata_after_init()
> > > > >
> > > > > Kernel messages:
> > > > > [   89.463690] brd: module loaded
> > > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > > > > ro_after_init data might still be writable
> > > > >
> > > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > > > ---
> > > > >  tools/bpf/Makefile            |  13 ++-
> > > > >  tools/bpf/moderr/.gitignore   |   2 +
> > > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> > > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> > > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> > > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
> > > > >  6 files changed, 510 insertions(+), 3 deletions(-)
> > > > 
> > > > The tool looks useful, but we don't add tools to the kernel repo.
> > > > It has to stay out of tree.
> > > 
> > > For selftests we do add random tools.
> > > 
> > > > The value of error injection is not clear to me.
> > > 
> > > It is of great value, since it deals with corner cases which are
> > > otherwise hard to reproduce in places which a real error can be
> > > catostrophic.
> > > 
> > > > Other places in the kernel use it to test paths in the kernel
> > > > that are difficult to do otherwise.
> > > 
> > > Right.
> > > 
> > > > These 3 functions don't seem to be in this category.
> > > 
> > > That's the key here we should focus on. The problem is when a maintainer
> > > *does* agree that adding an error injection entry is useful for testing,
> > > and we have a developer willing to do the work to help test / validate
> > > it. In this case, this error case is rare but we do want to strive to
> > > test this as we ramp up and extend our modules selftests.
> > > 
> > > Then there is the aspect of how to mitigate how instrusive code changes
> > > to allow error injection are. In 2021 we evaluated the prospect of error
> > > injection in-kernel long ago for other areas like the block layer for
> > > add_disk() failures [0] but the minimal interface to enable this from
> > > userspace with debugfs was considered just too intrusive.
> > > 
> > > This effort tried to evaluate what this could look like with eBPF to
> > > mitigate the required in-kernel code, and I believe the light weight
> > > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
> > > suffices to my taste.
> > > 
> > > So, perhaps the tools aspect can just go in:
> > > 
> > > tools/testing/selftests/module/
> > 
> > but why would it be module-specific?
> 
> Gotta start somewhere.
> 
> > Based on its current implementation
> > and discussion about inject.py it seems to be generic enough to be
> > useful to test any function annotated with ALLOW_ERROR_INJECTION().
> > 
> > As xe driver maintainer, it may be interesting to use such a tool:
> > 
> > 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
> > 
> > How does this approach compare to writing the function name on debugfs
> > (the current approach in xe's testsuite)?
> > 
> > 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
> > 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
> > 
> > If you decide to have the tool to live somewhere else, then kmod repo
> > could be a candidate.
> 
> Would we install this upon install target?
> 
> Danny can decide on this :)
> 
> > Although I think having it in kernel tree is
> > simpler maintenance-wise.
> 
> I think we have at least two users upstream who can make use of it. If
> we end up going through tools/testing/selftests/module/ first, can't
> you make use of it later?

What are the features in debugfs required to be useful for xe that we can
port to an eBPF version? I see from the link provided the use of probability,
interval, times and space but these are configured to allways trigger the error.
Is that right?

> 
>   Luis
Lucas De Marchi Feb. 24, 2025, 2:43 p.m. UTC | #9
On Sat, Feb 22, 2025 at 10:35:07PM +0100, Daniel Gomez wrote:
>On Fri, Feb 21, 2025 at 12:15:40PM +0100, Luis Chamberlain wrote:
>> On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
>> > On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
>> > > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
>> > > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
>> > > > >
>> > > > > Add support for a module error injection tool. The tool
>> > > > > can inject errors in the annotated module kernel functions
>> > > > > such as complete_formation(), do_init_module() and
>> > > > > module_enable_rodata_after_init(). Module name and module function are
>> > > > > required parameters to have control over the error injection.
>> > > > >
>> > > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
>> > > > > brd module:
>> > > > >
>> > > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
>> > > > > --error=-22 --trace
>> > > > > Monitoring module error injection... Hit Ctrl-C to end.
>> > > > > MODULE     ERROR FUNCTION
>> > > > > brd        -22   module_enable_rodata_after_init()
>> > > > >
>> > > > > Kernel messages:
>> > > > > [   89.463690] brd: module loaded
>> > > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
>> > > > > ro_after_init data might still be writable
>> > > > >
>> > > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>> > > > > ---
>> > > > >  tools/bpf/Makefile            |  13 ++-
>> > > > >  tools/bpf/moderr/.gitignore   |   2 +
>> > > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
>> > > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
>> > > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
>> > > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
>> > > > >  6 files changed, 510 insertions(+), 3 deletions(-)
>> > > >
>> > > > The tool looks useful, but we don't add tools to the kernel repo.
>> > > > It has to stay out of tree.
>> > >
>> > > For selftests we do add random tools.
>> > >
>> > > > The value of error injection is not clear to me.
>> > >
>> > > It is of great value, since it deals with corner cases which are
>> > > otherwise hard to reproduce in places which a real error can be
>> > > catostrophic.
>> > >
>> > > > Other places in the kernel use it to test paths in the kernel
>> > > > that are difficult to do otherwise.
>> > >
>> > > Right.
>> > >
>> > > > These 3 functions don't seem to be in this category.
>> > >
>> > > That's the key here we should focus on. The problem is when a maintainer
>> > > *does* agree that adding an error injection entry is useful for testing,
>> > > and we have a developer willing to do the work to help test / validate
>> > > it. In this case, this error case is rare but we do want to strive to
>> > > test this as we ramp up and extend our modules selftests.
>> > >
>> > > Then there is the aspect of how to mitigate how instrusive code changes
>> > > to allow error injection are. In 2021 we evaluated the prospect of error
>> > > injection in-kernel long ago for other areas like the block layer for
>> > > add_disk() failures [0] but the minimal interface to enable this from
>> > > userspace with debugfs was considered just too intrusive.
>> > >
>> > > This effort tried to evaluate what this could look like with eBPF to
>> > > mitigate the required in-kernel code, and I believe the light weight
>> > > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
>> > > suffices to my taste.
>> > >
>> > > So, perhaps the tools aspect can just go in:
>> > >
>> > > tools/testing/selftests/module/
>> >
>> > but why would it be module-specific?
>>
>> Gotta start somewhere.
>>
>> > Based on its current implementation
>> > and discussion about inject.py it seems to be generic enough to be
>> > useful to test any function annotated with ALLOW_ERROR_INJECTION().
>> >
>> > As xe driver maintainer, it may be interesting to use such a tool:
>> >
>> > 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
>> >
>> > How does this approach compare to writing the function name on debugfs
>> > (the current approach in xe's testsuite)?
>> >
>> > 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
>> > 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
>> >
>> > If you decide to have the tool to live somewhere else, then kmod repo
>> > could be a candidate.
>>
>> Would we install this upon install target?
>>
>> Danny can decide on this :)
>>
>> > Although I think having it in kernel tree is
>> > simpler maintenance-wise.
>>
>> I think we have at least two users upstream who can make use of it. If
>> we end up going through tools/testing/selftests/module/ first, can't
>> you make use of it later?
>
>What are the features in debugfs required to be useful for xe that we can
>port to an eBPF version? I see from the link provided the use of probability,
>interval, times and space but these are configured to allways trigger the error.
>Is that right?

I don't think we use them... we just set them to "always trigger" and
then create the conditions for that to happen.  But my question still
remains:  what is the benefit of using the bpf approach over writing
these files?

Lucas De Marchi

>
>>
>>   Luis
Daniel Gomez Feb. 28, 2025, 9:27 a.m. UTC | #10
On Mon, Feb 24, 2025 at 08:43:45AM +0100, Lucas De Marchi wrote:
> On Sat, Feb 22, 2025 at 10:35:07PM +0100, Daniel Gomez wrote:
> > On Fri, Feb 21, 2025 at 12:15:40PM +0100, Luis Chamberlain wrote:
> > > On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
> > > > On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
> > > > > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
> > > > > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> > > > > > >
> > > > > > > Add support for a module error injection tool. The tool
> > > > > > > can inject errors in the annotated module kernel functions
> > > > > > > such as complete_formation(), do_init_module() and
> > > > > > > module_enable_rodata_after_init(). Module name and module function are
> > > > > > > required parameters to have control over the error injection.
> > > > > > >
> > > > > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > > > > > > brd module:
> > > > > > >
> > > > > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > > > > > > --error=-22 --trace
> > > > > > > Monitoring module error injection... Hit Ctrl-C to end.
> > > > > > > MODULE     ERROR FUNCTION
> > > > > > > brd        -22   module_enable_rodata_after_init()
> > > > > > >
> > > > > > > Kernel messages:
> > > > > > > [   89.463690] brd: module loaded
> > > > > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > > > > > > ro_after_init data might still be writable
> > > > > > >
> > > > > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > > > > > ---
> > > > > > >  tools/bpf/Makefile            |  13 ++-
> > > > > > >  tools/bpf/moderr/.gitignore   |   2 +
> > > > > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> > > > > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> > > > > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
> > > > > > >  6 files changed, 510 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > The tool looks useful, but we don't add tools to the kernel repo.
> > > > > > It has to stay out of tree.
> > > > >
> > > > > For selftests we do add random tools.
> > > > >
> > > > > > The value of error injection is not clear to me.
> > > > >
> > > > > It is of great value, since it deals with corner cases which are
> > > > > otherwise hard to reproduce in places which a real error can be
> > > > > catostrophic.
> > > > >
> > > > > > Other places in the kernel use it to test paths in the kernel
> > > > > > that are difficult to do otherwise.
> > > > >
> > > > > Right.
> > > > >
> > > > > > These 3 functions don't seem to be in this category.
> > > > >
> > > > > That's the key here we should focus on. The problem is when a maintainer
> > > > > *does* agree that adding an error injection entry is useful for testing,
> > > > > and we have a developer willing to do the work to help test / validate
> > > > > it. In this case, this error case is rare but we do want to strive to
> > > > > test this as we ramp up and extend our modules selftests.
> > > > >
> > > > > Then there is the aspect of how to mitigate how instrusive code changes
> > > > > to allow error injection are. In 2021 we evaluated the prospect of error
> > > > > injection in-kernel long ago for other areas like the block layer for
> > > > > add_disk() failures [0] but the minimal interface to enable this from
> > > > > userspace with debugfs was considered just too intrusive.
> > > > >
> > > > > This effort tried to evaluate what this could look like with eBPF to
> > > > > mitigate the required in-kernel code, and I believe the light weight
> > > > > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
> > > > > suffices to my taste.
> > > > >
> > > > > So, perhaps the tools aspect can just go in:
> > > > >
> > > > > tools/testing/selftests/module/
> > > >
> > > > but why would it be module-specific?
> > > 
> > > Gotta start somewhere.
> > > 
> > > > Based on its current implementation
> > > > and discussion about inject.py it seems to be generic enough to be
> > > > useful to test any function annotated with ALLOW_ERROR_INJECTION().
> > > >
> > > > As xe driver maintainer, it may be interesting to use such a tool:
> > > >
> > > > 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
> > > >
> > > > How does this approach compare to writing the function name on debugfs
> > > > (the current approach in xe's testsuite)?
> > > >
> > > > 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
> > > > 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
> > > >
> > > > If you decide to have the tool to live somewhere else, then kmod repo
> > > > could be a candidate.
> > > 
> > > Would we install this upon install target?
> > > 
> > > Danny can decide on this :)
> > > 
> > > > Although I think having it in kernel tree is
> > > > simpler maintenance-wise.
> > > 
> > > I think we have at least two users upstream who can make use of it. If
> > > we end up going through tools/testing/selftests/module/ first, can't
> > > you make use of it later?
> > 
> > What are the features in debugfs required to be useful for xe that we can
> > port to an eBPF version? I see from the link provided the use of probability,
> > interval, times and space but these are configured to allways trigger the error.
> > Is that right?
> 
> I don't think we use them... we just set them to "always trigger" and
> then create the conditions for that to happen.  But my question still
> remains:  what is the benefit of using the bpf approach over writing
> these files?

The code to trigger the error injection still needs to exist with both
approaches. My understanding from debugfs and the comment from Luis earlier in
the thread is that with eBPF you can mitigate how intrusive in-kernel code
changes are to allow error injection. Luis added the example here [1] for
debugfs.

[1] https://lore.kernel.org/all/20210512064629.13899-9-mcgrof@kernel.org/

To compare patch 8 in [1] with eBPF approach: the patch describes
all the necessary changes required to allow error injection on the
add_disk() path. With eBPF one would simply annotate the function(s) with
ALLOW_ERROR_INJECTION(), e.g. device_add() and replace the return value
in eBPF code with bpf_override_return() as implemented in moderr tool for
module_enable_rdata_after_init() for example.

New error injection users such modules or block/disk would benefit of the eBPF
approach with having a more simpler in-kernel code for the same purpose. Current
users of debugfs/errorinj would have to remove all the upstream intrusive error
injection related code if they want to adopt the eBPF approach.

Daniel

> 
> Lucas De Marchi
> 
> > 
> > > 
> > >   Luis
Lucas De Marchi Feb. 28, 2025, 6:48 p.m. UTC | #11
On Fri, Feb 28, 2025 at 10:27:17AM +0100, Daniel Gomez wrote:
>On Mon, Feb 24, 2025 at 08:43:45AM +0100, Lucas De Marchi wrote:
>> On Sat, Feb 22, 2025 at 10:35:07PM +0100, Daniel Gomez wrote:
>> > On Fri, Feb 21, 2025 at 12:15:40PM +0100, Luis Chamberlain wrote:
>> > > On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
>> > > > On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
>> > > > > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
>> > > > > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
>> > > > > > >
>> > > > > > > Add support for a module error injection tool. The tool
>> > > > > > > can inject errors in the annotated module kernel functions
>> > > > > > > such as complete_formation(), do_init_module() and
>> > > > > > > module_enable_rodata_after_init(). Module name and module function are
>> > > > > > > required parameters to have control over the error injection.
>> > > > > > >
>> > > > > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
>> > > > > > > brd module:
>> > > > > > >
>> > > > > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
>> > > > > > > --error=-22 --trace
>> > > > > > > Monitoring module error injection... Hit Ctrl-C to end.
>> > > > > > > MODULE     ERROR FUNCTION
>> > > > > > > brd        -22   module_enable_rodata_after_init()
>> > > > > > >
>> > > > > > > Kernel messages:
>> > > > > > > [   89.463690] brd: module loaded
>> > > > > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
>> > > > > > > ro_after_init data might still be writable
>> > > > > > >
>> > > > > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>> > > > > > > ---
>> > > > > > >  tools/bpf/Makefile            |  13 ++-
>> > > > > > >  tools/bpf/moderr/.gitignore   |   2 +
>> > > > > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
>> > > > > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
>> > > > > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
>> > > > > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
>> > > > > > >  6 files changed, 510 insertions(+), 3 deletions(-)
>> > > > > >
>> > > > > > The tool looks useful, but we don't add tools to the kernel repo.
>> > > > > > It has to stay out of tree.
>> > > > >
>> > > > > For selftests we do add random tools.
>> > > > >
>> > > > > > The value of error injection is not clear to me.
>> > > > >
>> > > > > It is of great value, since it deals with corner cases which are
>> > > > > otherwise hard to reproduce in places which a real error can be
>> > > > > catostrophic.
>> > > > >
>> > > > > > Other places in the kernel use it to test paths in the kernel
>> > > > > > that are difficult to do otherwise.
>> > > > >
>> > > > > Right.
>> > > > >
>> > > > > > These 3 functions don't seem to be in this category.
>> > > > >
>> > > > > That's the key here we should focus on. The problem is when a maintainer
>> > > > > *does* agree that adding an error injection entry is useful for testing,
>> > > > > and we have a developer willing to do the work to help test / validate
>> > > > > it. In this case, this error case is rare but we do want to strive to
>> > > > > test this as we ramp up and extend our modules selftests.
>> > > > >
>> > > > > Then there is the aspect of how to mitigate how instrusive code changes
>> > > > > to allow error injection are. In 2021 we evaluated the prospect of error
>> > > > > injection in-kernel long ago for other areas like the block layer for
>> > > > > add_disk() failures [0] but the minimal interface to enable this from
>> > > > > userspace with debugfs was considered just too intrusive.
>> > > > >
>> > > > > This effort tried to evaluate what this could look like with eBPF to
>> > > > > mitigate the required in-kernel code, and I believe the light weight
>> > > > > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
>> > > > > suffices to my taste.
>> > > > >
>> > > > > So, perhaps the tools aspect can just go in:
>> > > > >
>> > > > > tools/testing/selftests/module/
>> > > >
>> > > > but why would it be module-specific?
>> > >
>> > > Gotta start somewhere.
>> > >
>> > > > Based on its current implementation
>> > > > and discussion about inject.py it seems to be generic enough to be
>> > > > useful to test any function annotated with ALLOW_ERROR_INJECTION().
>> > > >
>> > > > As xe driver maintainer, it may be interesting to use such a tool:
>> > > >
>> > > > 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
>> > > >
>> > > > How does this approach compare to writing the function name on debugfs
>> > > > (the current approach in xe's testsuite)?
>> > > >
>> > > > 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
>> > > > 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
>> > > >
>> > > > If you decide to have the tool to live somewhere else, then kmod repo
>> > > > could be a candidate.
>> > >
>> > > Would we install this upon install target?
>> > >
>> > > Danny can decide on this :)
>> > >
>> > > > Although I think having it in kernel tree is
>> > > > simpler maintenance-wise.
>> > >
>> > > I think we have at least two users upstream who can make use of it. If
>> > > we end up going through tools/testing/selftests/module/ first, can't
>> > > you make use of it later?
>> >
>> > What are the features in debugfs required to be useful for xe that we can
>> > port to an eBPF version? I see from the link provided the use of probability,
>> > interval, times and space but these are configured to allways trigger the error.
>> > Is that right?
>>
>> I don't think we use them... we just set them to "always trigger" and
>> then create the conditions for that to happen.  But my question still
>> remains:  what is the benefit of using the bpf approach over writing
>> these files?
>
>The code to trigger the error injection still needs to exist with both
>approaches. My understanding from debugfs and the comment from Luis earlier in
>the thread is that with eBPF you can mitigate how intrusive in-kernel code
>changes are to allow error injection. Luis added the example here [1] for
>debugfs.
>
>[1] https://lore.kernel.org/all/20210512064629.13899-9-mcgrof@kernel.org/
>
>To compare patch 8 in [1] with eBPF approach: the patch describes
>all the necessary changes required to allow error injection on the
>add_disk() path. With eBPF one would simply annotate the function(s) with
>ALLOW_ERROR_INJECTION(), e.g. device_add() and replace the return value
>in eBPF code with bpf_override_return() as implemented in moderr tool for
>module_enable_rdata_after_init() for example.

but that is all that we need with the fail_function in debugfs too:

$ git grep ALLOW_ERROR_INJECTION -- drivers/gpu/drm/xe
drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(xe_device_create, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(wait_for_lmem_ready, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_exec_queue.c:ALLOW_ERROR_INJECTION(xe_exec_queue_create_bind, ERRNO);
drivers/gpu/drm/xe/xe_ggtt.c:ALLOW_ERROR_INJECTION(xe_ggtt_init_early, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_guc_ads.c:ALLOW_ERROR_INJECTION(xe_guc_ads_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_guc_ct.c:ALLOW_ERROR_INJECTION(xe_guc_ct_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_guc_log.c:ALLOW_ERROR_INJECTION(xe_guc_log_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_guc_relay.c:ALLOW_ERROR_INJECTION(xe_guc_relay_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() is used to conditionally skip function execution
drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() macro but this is acceptable because for those
drivers/gpu/drm/xe/xe_pm.c:ALLOW_ERROR_INJECTION(xe_pm_init_early, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_create, ERRNO);
drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_prepare, ERRNO);
drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_run, ERRNO);
drivers/gpu/drm/xe/xe_sriov.c:ALLOW_ERROR_INJECTION(xe_sriov_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_sync.c:ALLOW_ERROR_INJECTION(xe_sync_entry_parse, ERRNO);
drivers/gpu/drm/xe/xe_tile.c:ALLOW_ERROR_INJECTION(xe_tile_init_early, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_uc_fw.c:ALLOW_ERROR_INJECTION(xe_uc_fw_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vma_ops_alloc, ERRNO);
drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vm_create_scratch, ERRNO);
drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(vm_bind_ioctl_ops_create, ERRNO);
drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(vm_bind_ioctl_ops_execute, ERRNO);
drivers/gpu/drm/xe/xe_wa.c:ALLOW_ERROR_INJECTION(xe_wa_init, ERRNO); /* See xe_pci_probe() */
drivers/gpu/drm/xe/xe_wopcm.c:ALLOW_ERROR_INJECTION(xe_wopcm_init, ERRNO); /* See xe_pci_probe() */

That is different from the patch you are pointing to because that patch
is trying to add arbitrary/named error injection points throughout the
code. However via debugfs it's still possible to add error injection to
the beginning of a function by annotating that function with
ALLOW_ERROR_INJECTION. If a function is annotated with that, then if you
do e.g.

	echo xe_device_create > /sys/kernel/debug/fail_function/inject

it will cause that function to fail. There are some additional files to
control _when_ that function should fail, but I'm failing to see a clear
benefit. See this example in the docs:

	Documentation/fault-injection/fault-injection.rst:- Inject open_ctree error while btrfs mount::

Lucas De Marchi

>
>New error injection users such modules or block/disk would benefit of the eBPF
>approach with having a more simpler in-kernel code for the same purpose. Current
>users of debugfs/errorinj would have to remove all the upstream intrusive error
>injection related code if they want to adopt the eBPF approach.
>
>Daniel
>
>>
>> Lucas De Marchi
>>
>> >
>> > >
>> > >   Luis
Daniel Gomez March 5, 2025, 10:06 a.m. UTC | #12
On Fri, Feb 28, 2025 at 12:48:38PM +0100, Lucas De Marchi wrote:
> On Fri, Feb 28, 2025 at 10:27:17AM +0100, Daniel Gomez wrote:
> > On Mon, Feb 24, 2025 at 08:43:45AM +0100, Lucas De Marchi wrote:
> > > On Sat, Feb 22, 2025 at 10:35:07PM +0100, Daniel Gomez wrote:
> > > > On Fri, Feb 21, 2025 at 12:15:40PM +0100, Luis Chamberlain wrote:
> > > > > On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
> > > > > > On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
> > > > > > > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
> > > > > > > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> > > > > > > > >
> > > > > > > > > Add support for a module error injection tool. The tool
> > > > > > > > > can inject errors in the annotated module kernel functions
> > > > > > > > > such as complete_formation(), do_init_module() and
> > > > > > > > > module_enable_rodata_after_init(). Module name and module function are
> > > > > > > > > required parameters to have control over the error injection.
> > > > > > > > >
> > > > > > > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > > > > > > > > brd module:
> > > > > > > > >
> > > > > > > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > > > > > > > > --error=-22 --trace
> > > > > > > > > Monitoring module error injection... Hit Ctrl-C to end.
> > > > > > > > > MODULE     ERROR FUNCTION
> > > > > > > > > brd        -22   module_enable_rodata_after_init()
> > > > > > > > >
> > > > > > > > > Kernel messages:
> > > > > > > > > [   89.463690] brd: module loaded
> > > > > > > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > > > > > > > > ro_after_init data might still be writable
> > > > > > > > >
> > > > > > > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > > > > > > > ---
> > > > > > > > >  tools/bpf/Makefile            |  13 ++-
> > > > > > > > >  tools/bpf/moderr/.gitignore   |   2 +
> > > > > > > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> > > > > > > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> > > > > > > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
> > > > > > > > >  6 files changed, 510 insertions(+), 3 deletions(-)
> > > > > > > >
> > > > > > > > The tool looks useful, but we don't add tools to the kernel repo.
> > > > > > > > It has to stay out of tree.
> > > > > > >
> > > > > > > For selftests we do add random tools.
> > > > > > >
> > > > > > > > The value of error injection is not clear to me.
> > > > > > >
> > > > > > > It is of great value, since it deals with corner cases which are
> > > > > > > otherwise hard to reproduce in places which a real error can be
> > > > > > > catostrophic.
> > > > > > >
> > > > > > > > Other places in the kernel use it to test paths in the kernel
> > > > > > > > that are difficult to do otherwise.
> > > > > > >
> > > > > > > Right.
> > > > > > >
> > > > > > > > These 3 functions don't seem to be in this category.
> > > > > > >
> > > > > > > That's the key here we should focus on. The problem is when a maintainer
> > > > > > > *does* agree that adding an error injection entry is useful for testing,
> > > > > > > and we have a developer willing to do the work to help test / validate
> > > > > > > it. In this case, this error case is rare but we do want to strive to
> > > > > > > test this as we ramp up and extend our modules selftests.
> > > > > > >
> > > > > > > Then there is the aspect of how to mitigate how instrusive code changes
> > > > > > > to allow error injection are. In 2021 we evaluated the prospect of error
> > > > > > > injection in-kernel long ago for other areas like the block layer for
> > > > > > > add_disk() failures [0] but the minimal interface to enable this from
> > > > > > > userspace with debugfs was considered just too intrusive.
> > > > > > >
> > > > > > > This effort tried to evaluate what this could look like with eBPF to
> > > > > > > mitigate the required in-kernel code, and I believe the light weight
> > > > > > > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
> > > > > > > suffices to my taste.
> > > > > > >
> > > > > > > So, perhaps the tools aspect can just go in:
> > > > > > >
> > > > > > > tools/testing/selftests/module/
> > > > > >
> > > > > > but why would it be module-specific?
> > > > >
> > > > > Gotta start somewhere.
> > > > >
> > > > > > Based on its current implementation
> > > > > > and discussion about inject.py it seems to be generic enough to be
> > > > > > useful to test any function annotated with ALLOW_ERROR_INJECTION().
> > > > > >
> > > > > > As xe driver maintainer, it may be interesting to use such a tool:
> > > > > >
> > > > > > 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
> > > > > >
> > > > > > How does this approach compare to writing the function name on debugfs
> > > > > > (the current approach in xe's testsuite)?
> > > > > >
> > > > > > 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
> > > > > > 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
> > > > > >
> > > > > > If you decide to have the tool to live somewhere else, then kmod repo
> > > > > > could be a candidate.
> > > > >
> > > > > Would we install this upon install target?
> > > > >
> > > > > Danny can decide on this :)
> > > > >
> > > > > > Although I think having it in kernel tree is
> > > > > > simpler maintenance-wise.
> > > > >
> > > > > I think we have at least two users upstream who can make use of it. If
> > > > > we end up going through tools/testing/selftests/module/ first, can't
> > > > > you make use of it later?
> > > >
> > > > What are the features in debugfs required to be useful for xe that we can
> > > > port to an eBPF version? I see from the link provided the use of probability,
> > > > interval, times and space but these are configured to allways trigger the error.
> > > > Is that right?
> > > 
> > > I don't think we use them... we just set them to "always trigger" and
> > > then create the conditions for that to happen.  But my question still
> > > remains:  what is the benefit of using the bpf approach over writing
> > > these files?
> > 
> > The code to trigger the error injection still needs to exist with both
> > approaches. My understanding from debugfs and the comment from Luis earlier in
> > the thread is that with eBPF you can mitigate how intrusive in-kernel code
> > changes are to allow error injection. Luis added the example here [1] for
> > debugfs.
> > 
> > [1] https://lore.kernel.org/all/20210512064629.13899-9-mcgrof@kernel.org/
> > 
> > To compare patch 8 in [1] with eBPF approach: the patch describes
> > all the necessary changes required to allow error injection on the
> > add_disk() path. With eBPF one would simply annotate the function(s) with
> > ALLOW_ERROR_INJECTION(), e.g. device_add() and replace the return value
> > in eBPF code with bpf_override_return() as implemented in moderr tool for
> > module_enable_rdata_after_init() for example.
> 
> but that is all that we need with the fail_function in debugfs too:
> 
> $ git grep ALLOW_ERROR_INJECTION -- drivers/gpu/drm/xe
> drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(xe_device_create, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(wait_for_lmem_ready, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_exec_queue.c:ALLOW_ERROR_INJECTION(xe_exec_queue_create_bind, ERRNO);
> drivers/gpu/drm/xe/xe_ggtt.c:ALLOW_ERROR_INJECTION(xe_ggtt_init_early, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_guc_ads.c:ALLOW_ERROR_INJECTION(xe_guc_ads_init, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_guc_ct.c:ALLOW_ERROR_INJECTION(xe_guc_ct_init, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_guc_log.c:ALLOW_ERROR_INJECTION(xe_guc_log_init, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_guc_relay.c:ALLOW_ERROR_INJECTION(xe_guc_relay_init, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() is used to conditionally skip function execution
> drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() macro but this is acceptable because for those
> drivers/gpu/drm/xe/xe_pm.c:ALLOW_ERROR_INJECTION(xe_pm_init_early, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_create, ERRNO);
> drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_prepare, ERRNO);
> drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_run, ERRNO);
> drivers/gpu/drm/xe/xe_sriov.c:ALLOW_ERROR_INJECTION(xe_sriov_init, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_sync.c:ALLOW_ERROR_INJECTION(xe_sync_entry_parse, ERRNO);
> drivers/gpu/drm/xe/xe_tile.c:ALLOW_ERROR_INJECTION(xe_tile_init_early, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_uc_fw.c:ALLOW_ERROR_INJECTION(xe_uc_fw_init, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vma_ops_alloc, ERRNO);
> drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vm_create_scratch, ERRNO);
> drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(vm_bind_ioctl_ops_create, ERRNO);
> drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(vm_bind_ioctl_ops_execute, ERRNO);
> drivers/gpu/drm/xe/xe_wa.c:ALLOW_ERROR_INJECTION(xe_wa_init, ERRNO); /* See xe_pci_probe() */
> drivers/gpu/drm/xe/xe_wopcm.c:ALLOW_ERROR_INJECTION(xe_wopcm_init, ERRNO); /* See xe_pci_probe() */
> 
> That is different from the patch you are pointing to because that patch
> is trying to add arbitrary/named error injection points throughout the
> code. However via debugfs it's still possible to add error injection to

When reading the patch I assumed the block/failure-injection.c was needed for
the knobs in sysfs/debugfs. But I see I was wrong and these are only needed for
the arbitrary error injection points?

I see mm/fail_page_alloc.c has a similar approach with should_fail_alloc_page().

> the beginning of a function by annotating that function with
> ALLOW_ERROR_INJECTION. If a function is annotated with that, then if you
> do e.g.
> 
> 	echo xe_device_create > /sys/kernel/debug/fail_function/inject
> 
> it will cause that function to fail. There are some additional files to
> control _when_ that function should fail, but I'm failing to see a clear
> benefit. See this example in the docs:

Can you clarify if _when_ (in debugfs) allows you to access function arguments
of a given annotated function with ALLOW_ERROR_INJECTION()? It seems that might
be the only part that can be moved out of the kernel and handled in eBPF. Other
than that, I don't see either a benefit of using one approach over the other.

> 
> 	Documentation/fault-injection/fault-injection.rst:- Inject open_ctree error while btrfs mount::
> 
> Lucas De Marchi
Lucas De Marchi March 5, 2025, 9:08 p.m. UTC | #13
+Francois who added most of the error injection points in xe

On Wed, Mar 05, 2025 at 11:06:57AM +0100, Daniel Gomez wrote:
>On Fri, Feb 28, 2025 at 12:48:38PM +0100, Lucas De Marchi wrote:
>> On Fri, Feb 28, 2025 at 10:27:17AM +0100, Daniel Gomez wrote:
>> > On Mon, Feb 24, 2025 at 08:43:45AM +0100, Lucas De Marchi wrote:
>> > > On Sat, Feb 22, 2025 at 10:35:07PM +0100, Daniel Gomez wrote:
>> > > > On Fri, Feb 21, 2025 at 12:15:40PM +0100, Luis Chamberlain wrote:
>> > > > > On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
>> > > > > > On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
>> > > > > > > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
>> > > > > > > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
>> > > > > > > > >
>> > > > > > > > > Add support for a module error injection tool. The tool
>> > > > > > > > > can inject errors in the annotated module kernel functions
>> > > > > > > > > such as complete_formation(), do_init_module() and
>> > > > > > > > > module_enable_rodata_after_init(). Module name and module function are
>> > > > > > > > > required parameters to have control over the error injection.
>> > > > > > > > >
>> > > > > > > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
>> > > > > > > > > brd module:
>> > > > > > > > >
>> > > > > > > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
>> > > > > > > > > --error=-22 --trace
>> > > > > > > > > Monitoring module error injection... Hit Ctrl-C to end.
>> > > > > > > > > MODULE     ERROR FUNCTION
>> > > > > > > > > brd        -22   module_enable_rodata_after_init()
>> > > > > > > > >
>> > > > > > > > > Kernel messages:
>> > > > > > > > > [   89.463690] brd: module loaded
>> > > > > > > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
>> > > > > > > > > ro_after_init data might still be writable
>> > > > > > > > >
>> > > > > > > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>> > > > > > > > > ---
>> > > > > > > > >  tools/bpf/Makefile            |  13 ++-
>> > > > > > > > >  tools/bpf/moderr/.gitignore   |   2 +
>> > > > > > > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
>> > > > > > > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
>> > > > > > > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
>> > > > > > > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
>> > > > > > > > >  6 files changed, 510 insertions(+), 3 deletions(-)
>> > > > > > > >
>> > > > > > > > The tool looks useful, but we don't add tools to the kernel repo.
>> > > > > > > > It has to stay out of tree.
>> > > > > > >
>> > > > > > > For selftests we do add random tools.
>> > > > > > >
>> > > > > > > > The value of error injection is not clear to me.
>> > > > > > >
>> > > > > > > It is of great value, since it deals with corner cases which are
>> > > > > > > otherwise hard to reproduce in places which a real error can be
>> > > > > > > catostrophic.
>> > > > > > >
>> > > > > > > > Other places in the kernel use it to test paths in the kernel
>> > > > > > > > that are difficult to do otherwise.
>> > > > > > >
>> > > > > > > Right.
>> > > > > > >
>> > > > > > > > These 3 functions don't seem to be in this category.
>> > > > > > >
>> > > > > > > That's the key here we should focus on. The problem is when a maintainer
>> > > > > > > *does* agree that adding an error injection entry is useful for testing,
>> > > > > > > and we have a developer willing to do the work to help test / validate
>> > > > > > > it. In this case, this error case is rare but we do want to strive to
>> > > > > > > test this as we ramp up and extend our modules selftests.
>> > > > > > >
>> > > > > > > Then there is the aspect of how to mitigate how instrusive code changes
>> > > > > > > to allow error injection are. In 2021 we evaluated the prospect of error
>> > > > > > > injection in-kernel long ago for other areas like the block layer for
>> > > > > > > add_disk() failures [0] but the minimal interface to enable this from
>> > > > > > > userspace with debugfs was considered just too intrusive.
>> > > > > > >
>> > > > > > > This effort tried to evaluate what this could look like with eBPF to
>> > > > > > > mitigate the required in-kernel code, and I believe the light weight
>> > > > > > > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
>> > > > > > > suffices to my taste.
>> > > > > > >
>> > > > > > > So, perhaps the tools aspect can just go in:
>> > > > > > >
>> > > > > > > tools/testing/selftests/module/
>> > > > > >
>> > > > > > but why would it be module-specific?
>> > > > >
>> > > > > Gotta start somewhere.
>> > > > >
>> > > > > > Based on its current implementation
>> > > > > > and discussion about inject.py it seems to be generic enough to be
>> > > > > > useful to test any function annotated with ALLOW_ERROR_INJECTION().
>> > > > > >
>> > > > > > As xe driver maintainer, it may be interesting to use such a tool:
>> > > > > >
>> > > > > > 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
>> > > > > >
>> > > > > > How does this approach compare to writing the function name on debugfs
>> > > > > > (the current approach in xe's testsuite)?
>> > > > > >
>> > > > > > 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
>> > > > > > 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
>> > > > > >
>> > > > > > If you decide to have the tool to live somewhere else, then kmod repo
>> > > > > > could be a candidate.
>> > > > >
>> > > > > Would we install this upon install target?
>> > > > >
>> > > > > Danny can decide on this :)
>> > > > >
>> > > > > > Although I think having it in kernel tree is
>> > > > > > simpler maintenance-wise.
>> > > > >
>> > > > > I think we have at least two users upstream who can make use of it. If
>> > > > > we end up going through tools/testing/selftests/module/ first, can't
>> > > > > you make use of it later?
>> > > >
>> > > > What are the features in debugfs required to be useful for xe that we can
>> > > > port to an eBPF version? I see from the link provided the use of probability,
>> > > > interval, times and space but these are configured to allways trigger the error.
>> > > > Is that right?
>> > >
>> > > I don't think we use them... we just set them to "always trigger" and
>> > > then create the conditions for that to happen.  But my question still
>> > > remains:  what is the benefit of using the bpf approach over writing
>> > > these files?
>> >
>> > The code to trigger the error injection still needs to exist with both
>> > approaches. My understanding from debugfs and the comment from Luis earlier in
>> > the thread is that with eBPF you can mitigate how intrusive in-kernel code
>> > changes are to allow error injection. Luis added the example here [1] for
>> > debugfs.
>> >
>> > [1] https://lore.kernel.org/all/20210512064629.13899-9-mcgrof@kernel.org/
>> >
>> > To compare patch 8 in [1] with eBPF approach: the patch describes
>> > all the necessary changes required to allow error injection on the
>> > add_disk() path. With eBPF one would simply annotate the function(s) with
>> > ALLOW_ERROR_INJECTION(), e.g. device_add() and replace the return value
>> > in eBPF code with bpf_override_return() as implemented in moderr tool for
>> > module_enable_rdata_after_init() for example.
>>
>> but that is all that we need with the fail_function in debugfs too:
>>
>> $ git grep ALLOW_ERROR_INJECTION -- drivers/gpu/drm/xe
>> drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(xe_device_create, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(wait_for_lmem_ready, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_exec_queue.c:ALLOW_ERROR_INJECTION(xe_exec_queue_create_bind, ERRNO);
>> drivers/gpu/drm/xe/xe_ggtt.c:ALLOW_ERROR_INJECTION(xe_ggtt_init_early, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_guc_ads.c:ALLOW_ERROR_INJECTION(xe_guc_ads_init, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_guc_ct.c:ALLOW_ERROR_INJECTION(xe_guc_ct_init, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_guc_log.c:ALLOW_ERROR_INJECTION(xe_guc_log_init, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_guc_relay.c:ALLOW_ERROR_INJECTION(xe_guc_relay_init, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() is used to conditionally skip function execution
>> drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() macro but this is acceptable because for those
>> drivers/gpu/drm/xe/xe_pm.c:ALLOW_ERROR_INJECTION(xe_pm_init_early, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_create, ERRNO);
>> drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_prepare, ERRNO);
>> drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_run, ERRNO);
>> drivers/gpu/drm/xe/xe_sriov.c:ALLOW_ERROR_INJECTION(xe_sriov_init, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_sync.c:ALLOW_ERROR_INJECTION(xe_sync_entry_parse, ERRNO);
>> drivers/gpu/drm/xe/xe_tile.c:ALLOW_ERROR_INJECTION(xe_tile_init_early, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_uc_fw.c:ALLOW_ERROR_INJECTION(xe_uc_fw_init, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vma_ops_alloc, ERRNO);
>> drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vm_create_scratch, ERRNO);
>> drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERRORALLOW_ERROR_INJECTION_INJECTION(vm_bind_ioctl_ops_create, ERRNO);
>> drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(vm_bind_ioctl_ops_execute, ERRNO);
>> drivers/gpu/drm/xe/xe_wa.c:ALLOW_ERROR_INJECTION(xe_wa_init, ERRNO); /* See xe_pci_probe() */
>> drivers/gpu/drm/xe/xe_wopcm.c:ALLOW_ERROR_INJECTION(xe_wopcm_init, ERRNO); /* See xe_pci_probe() */
>>
>> That is different from the patch you are pointing to because that patch
>> is trying to add arbitrary/named error injection points throughout the
>> code. However via debugfs it's still possible to add error injection to
>
>When reading the patch I assumed the block/failure-injection.c was needed for
>the knobs in sysfs/debugfs. But I see I was wrong and these are only needed for
>the arbitrary error injection points?

yeah. For working with ALLOW_ERROR_INJECTION() we have to refactor the
code so the functions follow its requirements. When that is true, then
we can simply use the fail_function/inject to trigger it.

>
>I see mm/fail_page_alloc.c has a similar approach with should_fail_alloc_page().
>
>> the beginning of a function by annotating that function with
>> ALLOW_ERROR_INJECTION. If a function is annotated with that, then if you
>> do e.g.
>>
>> 	echo xe_device_create > /sys/kernel/debug/fail_function/inject
>>
>> it will cause that function to fail. There are some additional files to
>> control _when_ that function should fail, but I'm failing to see a clear
>> benefit. See this example in the docs:
>
>Can you clarify if _when_ (in debugfs) allows you to access function arguments
>of a given annotated function with ALLOW_ERROR_INJECTION()? It seems that might
>be the only part that can be moved out of the kernel and handled in eBPF. Other
>than that, I don't see either a benefit of using one approach over the other.

afaik we can't change the behavior based on arguments when using the
debugfs approach.

Lucas De Marchi

>
>>
>> 	Documentation/fault-injection/fault-injection.rst:- Inject open_ctree error while btrfs mount::
>>
>> Lucas De Marchi
>
Francois Dugast March 6, 2025, 1:30 p.m. UTC | #14
On Wed, Mar 05, 2025 at 03:08:25PM -0600, Lucas De Marchi wrote:
> +Francois who added most of the error injection points in xe
> 
> On Wed, Mar 05, 2025 at 11:06:57AM +0100, Daniel Gomez wrote:
> > On Fri, Feb 28, 2025 at 12:48:38PM +0100, Lucas De Marchi wrote:
> > > On Fri, Feb 28, 2025 at 10:27:17AM +0100, Daniel Gomez wrote:
> > > > On Mon, Feb 24, 2025 at 08:43:45AM +0100, Lucas De Marchi wrote:
> > > > > On Sat, Feb 22, 2025 at 10:35:07PM +0100, Daniel Gomez wrote:
> > > > > > On Fri, Feb 21, 2025 at 12:15:40PM +0100, Luis Chamberlain wrote:
> > > > > > > On Wed, Feb 19, 2025 at 02:17:48PM -0600, Lucas De Marchi wrote:
> > > > > > > > On Tue, Jan 28, 2025 at 12:57:05PM -0800, Luis Chamberlain wrote:
> > > > > > > > > On Wed, Jan 22, 2025 at 09:02:19AM -0800, Alexei Starovoitov wrote:
> > > > > > > > > > On Wed, Jan 22, 2025 at 5:12 AM Daniel Gomez <da.gomez@samsung.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Add support for a module error injection tool. The tool
> > > > > > > > > > > can inject errors in the annotated module kernel functions
> > > > > > > > > > > such as complete_formation(), do_init_module() and
> > > > > > > > > > > module_enable_rodata_after_init(). Module name and module function are
> > > > > > > > > > > required parameters to have control over the error injection.
> > > > > > > > > > >
> > > > > > > > > > > Example: Inject error -22 to module_enable_rodata_ro_after_init for
> > > > > > > > > > > brd module:
> > > > > > > > > > >
> > > > > > > > > > > sudo moderr --modname=brd --modfunc=module_enable_rodata_ro_after_init \
> > > > > > > > > > > --error=-22 --trace
> > > > > > > > > > > Monitoring module error injection... Hit Ctrl-C to end.
> > > > > > > > > > > MODULE     ERROR FUNCTION
> > > > > > > > > > > brd        -22   module_enable_rodata_after_init()
> > > > > > > > > > >
> > > > > > > > > > > Kernel messages:
> > > > > > > > > > > [   89.463690] brd: module loaded
> > > > > > > > > > > [   89.463855] brd: module_enable_rodata_ro_after_init() returned -22,
> > > > > > > > > > > ro_after_init data might still be writable
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  tools/bpf/Makefile            |  13 ++-
> > > > > > > > > > >  tools/bpf/moderr/.gitignore   |   2 +
> > > > > > > > > > >  tools/bpf/moderr/Makefile     |  95 +++++++++++++++++
> > > > > > > > > > >  tools/bpf/moderr/moderr.bpf.c | 127 +++++++++++++++++++++++
> > > > > > > > > > >  tools/bpf/moderr/moderr.c     | 236 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >  tools/bpf/moderr/moderr.h     |  40 +++++++
> > > > > > > > > > >  6 files changed, 510 insertions(+), 3 deletions(-)
> > > > > > > > > >
> > > > > > > > > > The tool looks useful, but we don't add tools to the kernel repo.
> > > > > > > > > > It has to stay out of tree.
> > > > > > > > >
> > > > > > > > > For selftests we do add random tools.
> > > > > > > > >
> > > > > > > > > > The value of error injection is not clear to me.
> > > > > > > > >
> > > > > > > > > It is of great value, since it deals with corner cases which are
> > > > > > > > > otherwise hard to reproduce in places which a real error can be
> > > > > > > > > catostrophic.
> > > > > > > > >
> > > > > > > > > > Other places in the kernel use it to test paths in the kernel
> > > > > > > > > > that are difficult to do otherwise.
> > > > > > > > >
> > > > > > > > > Right.
> > > > > > > > >
> > > > > > > > > > These 3 functions don't seem to be in this category.
> > > > > > > > >
> > > > > > > > > That's the key here we should focus on. The problem is when a maintainer
> > > > > > > > > *does* agree that adding an error injection entry is useful for testing,
> > > > > > > > > and we have a developer willing to do the work to help test / validate
> > > > > > > > > it. In this case, this error case is rare but we do want to strive to
> > > > > > > > > test this as we ramp up and extend our modules selftests.
> > > > > > > > >
> > > > > > > > > Then there is the aspect of how to mitigate how instrusive code changes
> > > > > > > > > to allow error injection are. In 2021 we evaluated the prospect of error
> > > > > > > > > injection in-kernel long ago for other areas like the block layer for
> > > > > > > > > add_disk() failures [0] but the minimal interface to enable this from
> > > > > > > > > userspace with debugfs was considered just too intrusive.
> > > > > > > > >
> > > > > > > > > This effort tried to evaluate what this could look like with eBPF to
> > > > > > > > > mitigate the required in-kernel code, and I believe the light weight
> > > > > > > > > nature of it by just requiring a sprinkle with ALLOW_ERROR_INJECTION()
> > > > > > > > > suffices to my taste.
> > > > > > > > >
> > > > > > > > > So, perhaps the tools aspect can just go in:
> > > > > > > > >
> > > > > > > > > tools/testing/selftests/module/
> > > > > > > >
> > > > > > > > but why would it be module-specific?
> > > > > > >
> > > > > > > Gotta start somewhere.
> > > > > > >
> > > > > > > > Based on its current implementation
> > > > > > > > and discussion about inject.py it seems to be generic enough to be
> > > > > > > > useful to test any function annotated with ALLOW_ERROR_INJECTION().
> > > > > > > >
> > > > > > > > As xe driver maintainer, it may be interesting to use such a tool:
> > > > > > > >
> > > > > > > > 	$ git grep ALLOW_ERROR_INJECT -- drivers/gpu/drm/xe | wc -l  	23
> > > > > > > >
> > > > > > > > How does this approach compare to writing the function name on debugfs
> > > > > > > > (the current approach in xe's testsuite)?
> > > > > > > >
> > > > > > > > 	fail_function @ https://docs.kernel.org/fault-injection/fault-injection.html#fault-injection-capabilities-infrastructure
> > > > > > > > 	https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/intel/xe_fault_injection.c?ref_type=heads#L108
> > > > > > > >
> > > > > > > > If you decide to have the tool to live somewhere else, then kmod repo
> > > > > > > > could be a candidate.
> > > > > > >
> > > > > > > Would we install this upon install target?
> > > > > > >
> > > > > > > Danny can decide on this :)
> > > > > > >
> > > > > > > > Although I think having it in kernel tree is
> > > > > > > > simpler maintenance-wise.
> > > > > > >
> > > > > > > I think we have at least two users upstream who can make use of it. If
> > > > > > > we end up going through tools/testing/selftests/module/ first, can't
> > > > > > > you make use of it later?
> > > > > >
> > > > > > What are the features in debugfs required to be useful for xe that we can
> > > > > > port to an eBPF version? I see from the link provided the use of probability,
> > > > > > interval, times and space but these are configured to allways trigger the error.
> > > > > > Is that right?
> > > > >
> > > > > I don't think we use them... we just set them to "always trigger" and
> > > > > then create the conditions for that to happen.  But my question still
> > > > > remains:  what is the benefit of using the bpf approach over writing
> > > > > these files?
> > > >
> > > > The code to trigger the error injection still needs to exist with both
> > > > approaches. My understanding from debugfs and the comment from Luis earlier in
> > > > the thread is that with eBPF you can mitigate how intrusive in-kernel code
> > > > changes are to allow error injection. Luis added the example here [1] for
> > > > debugfs.
> > > >
> > > > [1] https://lore.kernel.org/all/20210512064629.13899-9-mcgrof@kernel.org/
> > > >
> > > > To compare patch 8 in [1] with eBPF approach: the patch describes
> > > > all the necessary changes required to allow error injection on the
> > > > add_disk() path. With eBPF one would simply annotate the function(s) with
> > > > ALLOW_ERROR_INJECTION(), e.g. device_add() and replace the return value
> > > > in eBPF code with bpf_override_return() as implemented in moderr tool for
> > > > module_enable_rdata_after_init() for example.
> > > 
> > > but that is all that we need with the fail_function in debugfs too:
> > > 
> > > $ git grep ALLOW_ERROR_INJECTION -- drivers/gpu/drm/xe
> > > drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(xe_device_create, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_device.c:ALLOW_ERROR_INJECTION(wait_for_lmem_ready, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_exec_queue.c:ALLOW_ERROR_INJECTION(xe_exec_queue_create_bind, ERRNO);
> > > drivers/gpu/drm/xe/xe_ggtt.c:ALLOW_ERROR_INJECTION(xe_ggtt_init_early, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_guc_ads.c:ALLOW_ERROR_INJECTION(xe_guc_ads_init, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_guc_ct.c:ALLOW_ERROR_INJECTION(xe_guc_ct_init, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_guc_log.c:ALLOW_ERROR_INJECTION(xe_guc_log_init, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_guc_relay.c:ALLOW_ERROR_INJECTION(xe_guc_relay_init, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() is used to conditionally skip function execution
> > > drivers/gpu/drm/xe/xe_pci.c: * ALLOW_ERROR_INJECTION() macro but this is acceptable because for those
> > > drivers/gpu/drm/xe/xe_pm.c:ALLOW_ERROR_INJECTION(xe_pm_init_early, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_create, ERRNO);
> > > drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_prepare, ERRNO);
> > > drivers/gpu/drm/xe/xe_pt.c:ALLOW_ERROR_INJECTION(xe_pt_update_ops_run, ERRNO);
> > > drivers/gpu/drm/xe/xe_sriov.c:ALLOW_ERROR_INJECTION(xe_sriov_init, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_sync.c:ALLOW_ERROR_INJECTION(xe_sync_entry_parse, ERRNO);
> > > drivers/gpu/drm/xe/xe_tile.c:ALLOW_ERROR_INJECTION(xe_tile_init_early, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_uc_fw.c:ALLOW_ERROR_INJECTION(xe_uc_fw_init, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vma_ops_alloc, ERRNO);
> > > drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(xe_vm_create_scratch, ERRNO);
> > > drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERRORALLOW_ERROR_INJECTION_INJECTION(vm_bind_ioctl_ops_create, ERRNO);
> > > drivers/gpu/drm/xe/xe_vm.c:ALLOW_ERROR_INJECTION(vm_bind_ioctl_ops_execute, ERRNO);
> > > drivers/gpu/drm/xe/xe_wa.c:ALLOW_ERROR_INJECTION(xe_wa_init, ERRNO); /* See xe_pci_probe() */
> > > drivers/gpu/drm/xe/xe_wopcm.c:ALLOW_ERROR_INJECTION(xe_wopcm_init, ERRNO); /* See xe_pci_probe() */
> > > 
> > > That is different from the patch you are pointing to because that patch
> > > is trying to add arbitrary/named error injection points throughout the
> > > code. However via debugfs it's still possible to add error injection to
> > 
> > When reading the patch I assumed the block/failure-injection.c was needed for
> > the knobs in sysfs/debugfs. But I see I was wrong and these are only needed for
> > the arbitrary error injection points?

Correct, ALLOW_ERROR_INJECTION() is sufficient to make a function error
injectable and controllable with the debugfs knobs. Arbitrary error injection
points might also be added using only ALLOW_ERROR_INJECTION() and extra
functions:

    static int should_X_fail(void)
    {
      return 0;
    }
    ALLOW_ERROR_INJECTION(should_X_fail, ERRNO);

    int foo(...)
    {
      ...
      if (should_X_fail()) {
        // error
      }
      ...
    }

But this would introduce a lot of dummy code, which the
block/failure-injection.c approach prevents.

Until now, fault injection in xe has mostly been used to test error handling
during probe and during IOCTLs calls. Errors are unconditionnally injected in
each function one by one during tests. This is why probability, interval,
times and space are configured to always trigger.

Besides, function-level error injection points have been a good fit in the
driver, hence the absence of arbitrary error injection points.

Francois

> 
> yeah. For working with ALLOW_ERROR_INJECTION() we have to refactor the
> code so the functions follow its requirements. When that is true, then
> we can simply use the fail_function/inject to trigger it.
> 
> > 
> > I see mm/fail_page_alloc.c has a similar approach with should_fail_alloc_page().
> > 
> > > the beginning of a function by annotating that function with
> > > ALLOW_ERROR_INJECTION. If a function is annotated with that, then if you
> > > do e.g.
> > > 
> > > 	echo xe_device_create > /sys/kernel/debug/fail_function/inject
> > > 
> > > it will cause that function to fail. There are some additional files to
> > > control _when_ that function should fail, but I'm failing to see a clear
> > > benefit. See this example in the docs:
> > 
> > Can you clarify if _when_ (in debugfs) allows you to access function arguments
> > of a given annotated function with ALLOW_ERROR_INJECTION()? It seems that might
> > be the only part that can be moved out of the kernel and handled in eBPF. Other
> > than that, I don't see either a benefit of using one approach over the other.
> 
> afaik we can't change the behavior based on arguments when using the
> debugfs approach.
> 
> Lucas De Marchi
> 
> > 
> > > 
> > > 	Documentation/fault-injection/fault-injection.rst:- Inject open_ctree error while btrfs mount::
> > > 
> > > Lucas De Marchi
> >
diff mbox series

Patch

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index 243b79f2b451e52ca196f79dc46befd1b3dab458..018cab5102e7e42b8b7b2749f4f463bf55c5119b 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -38,7 +38,7 @@  FEATURE_TESTS = libbfd disassembler-four-args disassembler-init-styled
 FEATURE_DISPLAY = libbfd
 
 check_feat := 1
-NON_CHECK_FEAT_TARGETS := clean bpftool_clean runqslower_clean resolve_btfids_clean
+NON_CHECK_FEAT_TARGETS := clean bpftool_clean moderr_clean runqslower_clean resolve_btfids_clean
 ifdef MAKECMDGOALS
 ifeq ($(filter-out $(NON_CHECK_FEAT_TARGETS),$(MAKECMDGOALS)),)
   check_feat := 0
@@ -76,7 +76,7 @@  $(OUTPUT)%.lex.o: $(OUTPUT)%.lex.c
 
 PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm
 
-all: $(PROGS) bpftool runqslower
+all: $(PROGS) bpftool moderr runqslower
 
 $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm'
 $(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o
@@ -92,7 +92,7 @@  $(OUTPUT)bpf_exp.lex.c: $(OUTPUT)bpf_exp.yacc.c
 $(OUTPUT)bpf_exp.yacc.o: $(OUTPUT)bpf_exp.yacc.c
 $(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.lex.c
 
-clean: bpftool_clean runqslower_clean resolve_btfids_clean
+clean: bpftool_clean moderr_clean runqslower_clean resolve_btfids_clean
 	$(call QUIET_CLEAN, bpf-progs)
 	$(Q)$(RM) -r -- $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
 	       $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
@@ -118,6 +118,12 @@  bpftool_install:
 bpftool_clean:
 	$(call descend,bpftool,clean)
 
+moderr:
+	$(call descend,moderr)
+
+moderr_clean:
+	$(call descend,moderr,clean)
+
 runqslower:
 	$(call descend,runqslower)
 
@@ -131,5 +137,6 @@  resolve_btfids_clean:
 	$(call descend,resolve_btfids,clean)
 
 .PHONY: all install clean bpftool bpftool_install bpftool_clean \
+	moderr moderr_clean \
 	runqslower runqslower_clean \
 	resolve_btfids resolve_btfids_clean
diff --git a/tools/bpf/moderr/.gitignore b/tools/bpf/moderr/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..ffdb70230c8bc308efcc8b7d2084856e2225da91
--- /dev/null
+++ b/tools/bpf/moderr/.gitignore
@@ -0,0 +1,2 @@ 
+# SPDX-License-Identifier: GPL-2.0-only
+/.output
diff --git a/tools/bpf/moderr/Makefile b/tools/bpf/moderr/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..e6331179f7800e6c1d1945ca713e34f74f7d805d
--- /dev/null
+++ b/tools/bpf/moderr/Makefile
@@ -0,0 +1,95 @@ 
+# SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+include ../../scripts/Makefile.include
+include ../../scripts/Makefile.arch
+
+OUTPUT ?= $(abspath .output)/
+
+BPFTOOL_OUTPUT := $(OUTPUT)bpftool/
+DEFAULT_BPFTOOL := $(BPFTOOL_OUTPUT)bootstrap/bpftool
+BPFTOOL ?= $(DEFAULT_BPFTOOL)
+LIBBPF_SRC := $(abspath ../../lib/bpf)
+BPFOBJ_OUTPUT := $(OUTPUT)libbpf/
+BPFOBJ := $(BPFOBJ_OUTPUT)libbpf.a
+BPF_DESTDIR := $(BPFOBJ_OUTPUT)
+BPF_INCLUDE := $(BPF_DESTDIR)include
+INCLUDES := -I$(OUTPUT) -I$(BPF_INCLUDE) -I$(abspath ../../include/uapi)
+CFLAGS := -g -Wall $(CLANG_CROSS_FLAGS)
+CFLAGS += $(EXTRA_CFLAGS)
+LDFLAGS += $(EXTRA_LDFLAGS)
+LDLIBS += -lelf -lz
+
+# Try to detect best kernel BTF source
+KERNEL_REL := $(shell uname -r)
+VMLINUX_BTF_PATHS := $(if $(O),$(O)/vmlinux)		\
+	$(if $(KBUILD_OUTPUT),$(KBUILD_OUTPUT)/vmlinux) \
+	../../../vmlinux /sys/kernel/btf/vmlinux	\
+	/boot/vmlinux-$(KERNEL_REL)
+VMLINUX_BTF_PATH := $(or $(VMLINUX_BTF),$(firstword			       \
+					  $(wildcard $(VMLINUX_BTF_PATHS))))
+
+ifeq ($(V),1)
+Q =
+else
+Q = @
+MAKEFLAGS += --no-print-directory
+submake_extras := feature_display=0
+endif
+
+.DELETE_ON_ERROR:
+
+.PHONY: all clean moderr libbpf_hdrs
+all: moderr
+
+moderr: $(OUTPUT)moderr
+
+clean:
+	$(call QUIET_CLEAN, moderr)
+	$(Q)$(RM) -r $(BPFOBJ_OUTPUT) $(BPFTOOL_OUTPUT)
+	$(Q)$(RM) $(OUTPUT)*.o $(OUTPUT)*.d
+	$(Q)$(RM) $(OUTPUT)*.skel.h $(OUTPUT)vmlinux.h
+	$(Q)$(RM) $(OUTPUT)moderr
+	$(Q)$(RM) -r .output
+
+libbpf_hdrs: $(BPFOBJ)
+
+$(OUTPUT)moderr: $(OUTPUT)moderr.o $(BPFOBJ)
+	$(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $^ $(LDLIBS) -o $@
+
+$(OUTPUT)moderr.o: moderr.h $(OUTPUT)moderr.skel.h	      \
+			$(OUTPUT)moderr.bpf.o | libbpf_hdrs
+
+$(OUTPUT)moderr.bpf.o: $(OUTPUT)vmlinux.h moderr.h | libbpf_hdrs
+
+$(OUTPUT)%.skel.h: $(OUTPUT)%.bpf.o | $(BPFTOOL)
+	$(QUIET_GEN)$(BPFTOOL) gen skeleton $< > $@
+
+$(OUTPUT)%.bpf.o: %.bpf.c $(BPFOBJ) | $(OUTPUT)
+	$(QUIET_GEN)$(CLANG) -g -O2 --target=bpf $(INCLUDES) \
+		 -D__TARGET_ARCH_$(SRCARCH)						 \
+		 -c $(filter %.c,$^) -o $@ &&				     \
+	$(LLVM_STRIP) -g $@
+
+$(OUTPUT)%.o: %.c | $(OUTPUT)
+	$(QUIET_CC)$(CC) $(CFLAGS) $(INCLUDES) -c $(filter %.c,$^) -o $@
+
+$(OUTPUT) $(BPFOBJ_OUTPUT) $(BPFTOOL_OUTPUT):
+	$(QUIET_MKDIR)mkdir -p $@
+
+$(OUTPUT)vmlinux.h: $(VMLINUX_BTF_PATH) | $(OUTPUT) $(BPFTOOL)
+ifeq ($(VMLINUX_H),)
+	$(Q)if [ ! -e "$(VMLINUX_BTF_PATH)" ] ; then \
+		echo "Couldn't find kernel BTF; set VMLINUX_BTF to"	       \
+			"specify its location." >&2;			       \
+		exit 1;\
+	fi
+	$(QUIET_GEN)$(BPFTOOL) btf dump file $(VMLINUX_BTF_PATH) format c > $@
+else
+	$(Q)cp "$(VMLINUX_H)" $@
+endif
+
+$(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(BPFOBJ_OUTPUT)
+	$(Q)$(MAKE) $(submake_extras) -C $(LIBBPF_SRC) OUTPUT=$(BPFOBJ_OUTPUT) \
+		    DESTDIR=$(BPFOBJ_OUTPUT) prefix= $(abspath $@) install_headers
+
+$(DEFAULT_BPFTOOL): | $(BPFTOOL_OUTPUT)
+	$(Q)$(MAKE) $(submake_extras) -C ../bpftool OUTPUT=$(BPFTOOL_OUTPUT) bootstrap
diff --git a/tools/bpf/moderr/moderr.bpf.c b/tools/bpf/moderr/moderr.bpf.c
new file mode 100644
index 0000000000000000000000000000000000000000..1c5d03336dd87a2f065ef6b608f077a8b988e5cf
--- /dev/null
+++ b/tools/bpf/moderr/moderr.bpf.c
@@ -0,0 +1,127 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Samsung */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include "moderr.h"
+
+const volatile bool filter_modname = false;
+const volatile char targ_modname[MODULE_NAME_LEN];
+const volatile bool set_errinj = false;
+const volatile int targ_errinj = 0;
+const volatile bool filter_modfunc = false;
+const volatile int targ_modfunc = 0;
+
+char LICENSE[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 2097152);
+} rb SEC(".maps");
+
+static __always_inline bool filter_module_name(struct module *mod)
+{
+	char modname[MODULE_NAME_LEN];
+
+	bpf_probe_read_str(modname, sizeof(modname), mod->name);
+
+	if (!filter_modname ||
+	    filter_modname && bpf_strncmp(modname, MODULE_NAME_LEN,
+					  (const char *)targ_modname) != 0)
+		return false;
+
+	return true;
+}
+
+static __always_inline bool filter_module_func(enum modfunc fc)
+{
+	if (!filter_modfunc || filter_modfunc && targ_modfunc != fc)
+		return false;
+
+	return true;
+}
+
+static __always_inline bool
+generate_errinj_event(struct pt_regs *ctx, struct module *mod, enum modfunc fc)
+{
+	struct event *e;
+
+	e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
+	if (!e)
+		return false;
+
+	e->err = 0;
+	e->func = fc;
+	bpf_probe_read_str(e->modname, sizeof(e->modname), mod->name);
+
+	if (set_errinj) {
+		bpf_override_return(ctx, targ_errinj);
+		e->err = targ_errinj;
+	}
+
+	bpf_ringbuf_submit(e, 0);
+	return true;
+}
+
+static __always_inline bool generate_debug_event(struct pt_regs *ctx,
+						 struct module *mod,
+						 enum modfunc fc,
+						 const char *fmt)
+{
+	struct event *e;
+
+	e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
+	if (!e)
+		return false;
+
+	e->dbg = BPF_SNPRINTF(e->msg, sizeof(e->msg), "[%s:%s]: %s", mod->name,
+			      modfunc_to_string(fc), fmt);
+
+	bpf_ringbuf_submit(e, 0);
+	return true;
+}
+
+static __always_inline int
+module_error_injection(struct pt_regs *ctx, struct module *mod, enum modfunc fc)
+{
+	if (!filter_module_name(mod)) {
+		generate_debug_event(ctx, mod, fc,
+				     "Target module does not match");
+		return 0;
+	}
+
+	if (!filter_module_func(fc)) {
+		generate_debug_event(ctx, mod, fc,
+				     "Target function does not match");
+		return 0;
+	}
+
+	if (!generate_errinj_event(ctx, mod, fc)) {
+		generate_debug_event(
+			ctx, mod, fc,
+			"Error injection event cannot be generated");
+		return 0;
+	}
+
+	return 0;
+}
+
+SEC("kprobe/complete_formation")
+int BPF_KPROBE(complete_formation, struct module *mod, struct load_info *info)
+{
+	return module_error_injection(ctx, mod, COMPLETE_FORMATION);
+}
+
+SEC("kprobe/do_init_module")
+int BPF_KPROBE(do_init_module, struct module *mod, struct load_info *info)
+{
+	return module_error_injection(ctx, mod, DO_INIT_MODULE);
+}
+
+SEC("kprobe/module_enable_rodata_ro_after_init")
+int BPF_KPROBE(module_enable_rodata_ro_after_init, struct module *mod)
+{
+	return module_error_injection(ctx, mod,
+				      MODULE_ENABLE_RODATA_AFTER_INIT);
+}
diff --git a/tools/bpf/moderr/moderr.c b/tools/bpf/moderr/moderr.c
new file mode 100644
index 0000000000000000000000000000000000000000..dce18b02b55d1ad1f7e304cb49985d570b115aa4
--- /dev/null
+++ b/tools/bpf/moderr/moderr.c
@@ -0,0 +1,236 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Samsung */
+#include <argp.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include <stdlib.h>
+#include <string.h>
+#include "moderr.h"
+#include "moderr.skel.h"
+
+static struct env {
+	bool verbose;
+	char modname[MODULE_NAME_LEN];
+	enum modfunc func;
+	bool trace;
+	int errinj;
+} env;
+
+const char *argp_program_version = "moderr 0.1";
+const char *argp_program_bug_address = "<da.gomez@samsung.com>";
+const char argp_program_doc[] =
+"BPF moderr application.\n"
+"\n"
+"It injects errors in module initialization\n"
+"\nUSAGE: "
+"moderr [-m <module_name>] [-f <function_name>] [-e <errno>]\n";
+
+static volatile bool exiting = false;
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+static const struct argp_option opts[] = {
+	{ "verbose", 'v', NULL, 0, "Verbose debug output" },
+	{ "trace", 't', NULL, 0, "Enable trace output", 0 },
+	{ "modname", 'm', "MODNAME", 0, "Trace this module name only", 0 },
+	{ "modfunc", 'f', "MODFUNC", 0, "Trace this module function only", 0 },
+	{ "list", 'l', NULL, 0, "List available module functions", 0 },
+	{ "error", 'e', "ERROR", 0, "Inject this error", 0 },
+	{ NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help", 0 },
+	{},
+};
+
+static void help_modfunc(void)
+{
+	printf("\nAvailable modfunc options are:\n"
+	       "- complete_formation\n"
+	       "- do_init_module\n"
+	       "- module_enable_rodata_ro_after_init\n\n");
+}
+
+static enum modfunc string_to_modfunc(char *arg)
+{
+	if (strncmp(arg, "complete_formation", strlen(arg)) == 0)
+		return COMPLETE_FORMATION;
+
+	if (strncmp(arg, "do_init_module", strlen(arg)) == 0)
+		return DO_INIT_MODULE;
+
+	if (strncmp(arg, "module_enable_rodata_ro_after_init", strlen(arg)) ==
+	    0)
+		return MODULE_ENABLE_RODATA_AFTER_INIT;
+
+	return UNKNOWN;
+}
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case 'h':
+		argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
+		break;
+	case 'l':
+		help_modfunc();
+		argp_usage(state);
+		break;
+	case 'v':
+		env.verbose = true;
+		break;
+	case 'm':
+		if (strlen(arg) + 1 > MODULE_NAME_LEN) {
+			fprintf(stderr, "module name error\n");
+			argp_usage(state);
+		}
+		strncpy(env.modname, arg, sizeof(env.modname) - 1);
+		break;
+	case 'f':
+		if (strlen(arg) + 1 > MODULE_FUNC_LEN) {
+			fprintf(stderr, "module function too long\n");
+			argp_usage(state);
+		}
+		env.func = string_to_modfunc(arg);
+		if (!env.func) {
+			fprintf(stderr, "invalid module function\n");
+			help_modfunc();
+			argp_usage(state);
+		}
+		break;
+	case 'e':
+		env.errinj = atoi(arg);
+		break;
+	case 't':
+		env.trace = true;
+		break;
+	case ARGP_KEY_ARG:
+		argp_usage(state);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static const struct argp argp = {
+	.options = opts,
+	.parser = parse_arg,
+	.doc = argp_program_doc,
+};
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format,
+			   va_list args)
+{
+	if (level == LIBBPF_DEBUG && !env.verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static void sig_handler(int sig)
+{
+	exiting = true;
+}
+
+static int handle_event(void *ctx, void *data, size_t data_sz)
+{
+	const struct event *e = data;
+
+	if (!env.trace)
+		return 0;
+
+	if (e->dbg) {
+		if (env.verbose)
+			printf("%s\n", e->msg);
+		return 0;
+	}
+
+	printf("%-10s %-5d %-20s\n", e->modname, e->err,
+	       modfunc_to_string(e->func));
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	struct ring_buffer *rb = NULL;
+	struct moderr_bpf *obj;
+	int err;
+
+	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
+	if (err)
+		return err;
+
+	if (!strlen(env.modname) || !env.func) {
+		fprintf(stderr, "missing arguments\n");
+		return EXIT_FAILURE;
+	}
+
+	libbpf_set_print(libbpf_print_fn);
+
+	signal(SIGINT, sig_handler);
+	signal(SIGTERM, sig_handler);
+
+	obj = moderr_bpf__open();
+	if (!obj) {
+		fprintf(stderr, "failed to open and load BPF object\n");
+		return 1;
+	}
+
+	obj->rodata->filter_modname = true;
+	strncpy(obj->rodata->targ_modname, env.modname, MODULE_NAME_LEN - 1);
+	obj->rodata->targ_modname[MODULE_NAME_LEN - 1] = '\0';
+
+	obj->rodata->filter_modfunc = true;
+	obj->rodata->targ_modfunc = env.func;
+
+	if (env.errinj) {
+		obj->rodata->set_errinj = true;
+		obj->rodata->targ_errinj = env.errinj;
+	}
+
+	err = moderr_bpf__load(obj);
+	if (err) {
+		fprintf(stderr, "failed to load and verify BPF object\n");
+		goto cleanup;
+	}
+
+	err = moderr_bpf__attach(obj);
+	if (err) {
+		fprintf(stderr, "failed to attach BPF object\n");
+		goto cleanup;
+	}
+
+	printf("Monitoring module error injection... Hit Ctrl-C to end.\n");
+
+	rb = ring_buffer__new(bpf_map__fd(obj->maps.rb), handle_event, NULL,
+			      NULL);
+	if (!rb) {
+		err = -1;
+		fprintf(stderr, "failed to create ring buffer\n");
+		goto cleanup;
+	}
+
+	if (env.trace)
+		printf("%-10s %-5s %-20s\n", "MODULE", "ERROR", "FUNCTION");
+
+	while (!exiting) {
+		err = ring_buffer__poll(rb, 100);
+		if (err == -EINTR) {
+			err = 0;
+			break;
+		}
+		if (err < 0) {
+			fprintf(stderr, "error polling ring buffer: %d\n", err);
+			break;
+		}
+	}
+
+	printf("\n");
+
+cleanup:
+	ring_buffer__free(rb);
+	moderr_bpf__destroy(obj);
+
+	return err < 0 ? -err : 0;
+}
diff --git a/tools/bpf/moderr/moderr.h b/tools/bpf/moderr/moderr.h
new file mode 100644
index 0000000000000000000000000000000000000000..e17440cf4bd5fe09b927cb83807a88f66861bba5
--- /dev/null
+++ b/tools/bpf/moderr/moderr.h
@@ -0,0 +1,40 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2025 Samsung */
+#ifndef __MODERR_H
+#define __MODERR_H
+
+#define MAX_PARAM_PREFIX_LEN (64 - sizeof(unsigned long))
+#define MODULE_NAME_LEN MAX_PARAM_PREFIX_LEN
+#define MODULE_FUNC_LEN 128
+#define MESSAGE_LEN 128
+
+enum modfunc {
+	UNKNOWN,
+	COMPLETE_FORMATION = 1,
+	DO_INIT_MODULE,
+	MODULE_ENABLE_RODATA_AFTER_INIT,
+};
+
+struct event {
+	char modname[MODULE_NAME_LEN];
+	int err;
+	int func;
+	char msg[MESSAGE_LEN];
+	int dbg;
+};
+
+static inline const char *modfunc_to_string(enum modfunc fc)
+{
+	switch (fc) {
+	case COMPLETE_FORMATION:
+		return "complete_formation()";
+	case DO_INIT_MODULE:
+		return "do_init_module()";
+	case MODULE_ENABLE_RODATA_AFTER_INIT:
+		return "module_enable_rodata_after_init()";
+	default:
+		return "unknown";
+	}
+}
+
+#endif /* __MODERR_H */