mbox series

[bpf-next,v3,0/4] bpf/crib: Add open-coded style process file iterator and file related CRIB kfuncs

Message ID AM6PR03MB58488FD29EB0D0B89D52AABB99532@AM6PR03MB5848.eurprd03.prod.outlook.com (mailing list archive)
Headers show
Series bpf/crib: Add open-coded style process file iterator and file related CRIB kfuncs | expand

Message

Juntong Deng Nov. 6, 2024, 7:31 p.m. UTC
This patch series adds open-coded style process file iterator
bpf_iter_task_file and file related kfuncs bpf_fget_task(),
bpf_get_file_ops_type(), and corresponding selftests test cases.

Known future merge conflict: In linux-next task_lookup_next_fdget_rcu()
has been removed and replaced with fget_task_next() [0], but that has
not happened yet in bpf-next, so I still
use task_lookup_next_fdget_rcu() in bpf_iter_task_file_next().

[0]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=8fd3395ec9051a52828fcca2328cb50a69dea8ef

Although iter/task_file already exists, for CRIB we still need the
open-coded iterator style process file iterator, and the same is true
for other bpf iterators such as iter/tcp, iter/udp, etc.

The traditional bpf iterator is more like a bpf version of procfs, but
similar to procfs, it is not suitable for CRIB scenarios that need to
obtain large amounts of complex, multi-level in-kernel information.

The following is from previous discussions [1].

[1]: https://lore.kernel.org/bpf/AM6PR03MB5848CA34B5B68C90F210285E99B12@AM6PR03MB5848.eurprd03.prod.outlook.com/

This is because the context of bpf iterators is fixed and bpf iterators
cannot be nested. This means that a bpf iterator program can only
complete a specific small iterative dump task, and cannot dump
multi-level data.

An example, when we need to dump all the sockets of a process, we need
to iterate over all the files (sockets) of the process, and iterate over
the all packets in the queue of each socket, and iterate over all data
in each packet.

If we use bpf iterator, since the iterator can not be nested, we need to
use socket iterator program to get all the basic information of all
sockets (pass pid as filter), and then use packet iterator program to
get the basic information of all packets of a specific socket (pass pid,
fd as filter), and then use packet data iterator program to get all the
data of a specific packet (pass pid, fd, packet index as filter).

This would be complicated and require a lot of (each iteration)
bpf program startup and exit (leading to poor performance).

By comparison, open coded iterator is much more flexible, we can iterate
in any context, at any time, and iteration can be nested, so we can
achieve more flexible and more elegant dumping through open coded
iterators.

With open coded iterators, all of the above can be done in a single
bpf program, and with nested iterators, everything becomes compact
and simple.

Also, bpf iterators transmit data to user space through seq_file,
which involves a lot of open (bpf_iter_create), read, close syscalls,
context switching, memory copying, and cannot achieve the performance
of using ringbuf.

Discussion
----------

1. Do we need bpf_iter_task_file_get_fd()?

Andrii suggested that next() should return a pointer to
a bpf_iter_task_file_item, which contains *file and fd.

This is feasible, but it might compromise iterator encapsulation?

More detailed discussion can be found at [3] [4]

[3]: https://lore.kernel.org/bpf/CAEf4Bzbt0kh53xYZL57Nc9AWcYUKga_NQ6uUrTeU4bj8qyTLng@mail.gmail.com/
[4]: https://lore.kernel.org/bpf/AM6PR03MB584814D93FE3680635DE61A199562@AM6PR03MB5848.eurprd03.prod.outlook.com/

What should we do? Maybe more discussion is needed?

2. Where should we put CRIB related kfuncs?

I totally agree that most of the CRIB related kfuncs are not
CRIB specific.

The goal of CRIB is to collect all relevant information about a process,
which means we need to add kfuncs involving several different kernel
subsystems (though these kfuncs are not complex and many just help the
bpf program reach a certain data structure).

But here is a question, where should these CRIB kfuncs be placed?
There doesn't seem to be a suitable file to put them in.

My current idea is to create a crib folder and then create new files for
the relevant subsystems, e.g. crib/files.c, crib/socket.c, crib/mount.c
etc. Putting them in the same folder makes it easier to maintain
them centrally.

If anyone else wants to use CRIB kfuncs, welcome to use them.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
---
v2 -> v3:
1. Move task_file open-coded iterator to kernel/bpf/helpers.c.

2. Fix duplicate error code 7 in test_bpf_iter_task_file().

3. Add comment for case when bpf_iter_task_file_get_fd() returns -1.

4. Add future plans in commit message of "Add struct file related
CRIB kfuncs".

5. Add Discussion section to cover letter.

v1 -> v2:
Fix a type definition error in the fd parameter of
bpf_fget_task() at crib_common.h.

Juntong Deng (4):
  bpf/crib: Introduce task_file open-coded iterator kfuncs
  selftests/bpf: Add tests for open-coded style process file iterator
  bpf/crib: Add struct file related CRIB kfuncs
  selftests/bpf: Add tests for struct file related CRIB kfuncs

 kernel/bpf/Makefile                           |   1 +
 kernel/bpf/crib/Makefile                      |   3 +
 kernel/bpf/crib/crib.c                        |  28 ++++
 kernel/bpf/crib/files.c                       |  54 ++++++++
 kernel/bpf/helpers.c                          |   4 +
 kernel/bpf/task_iter.c                        |  96 +++++++++++++
 tools/testing/selftests/bpf/prog_tests/crib.c | 126 ++++++++++++++++++
 .../testing/selftests/bpf/progs/crib_common.h |  25 ++++
 .../selftests/bpf/progs/crib_files_failure.c  | 108 +++++++++++++++
 .../selftests/bpf/progs/crib_files_success.c  | 119 +++++++++++++++++
 10 files changed, 564 insertions(+)
 create mode 100644 kernel/bpf/crib/Makefile
 create mode 100644 kernel/bpf/crib/crib.c
 create mode 100644 kernel/bpf/crib/files.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/crib.c
 create mode 100644 tools/testing/selftests/bpf/progs/crib_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/crib_files_failure.c
 create mode 100644 tools/testing/selftests/bpf/progs/crib_files_success.c

Comments

Andrii Nakryiko Nov. 6, 2024, 10:15 p.m. UTC | #1
On Wed, Nov 6, 2024 at 11:35 AM Juntong Deng <juntong.deng@outlook.com> wrote:
>
> This patch series adds open-coded style process file iterator
> bpf_iter_task_file and file related kfuncs bpf_fget_task(),
> bpf_get_file_ops_type(), and corresponding selftests test cases.
>
> Known future merge conflict: In linux-next task_lookup_next_fdget_rcu()
> has been removed and replaced with fget_task_next() [0], but that has
> not happened yet in bpf-next, so I still
> use task_lookup_next_fdget_rcu() in bpf_iter_task_file_next().
>
> [0]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=8fd3395ec9051a52828fcca2328cb50a69dea8ef
>
> Although iter/task_file already exists, for CRIB we still need the
> open-coded iterator style process file iterator, and the same is true
> for other bpf iterators such as iter/tcp, iter/udp, etc.
>
> The traditional bpf iterator is more like a bpf version of procfs, but
> similar to procfs, it is not suitable for CRIB scenarios that need to
> obtain large amounts of complex, multi-level in-kernel information.
>
> The following is from previous discussions [1].
>
> [1]: https://lore.kernel.org/bpf/AM6PR03MB5848CA34B5B68C90F210285E99B12@AM6PR03MB5848.eurprd03.prod.outlook.com/
>
> This is because the context of bpf iterators is fixed and bpf iterators
> cannot be nested. This means that a bpf iterator program can only
> complete a specific small iterative dump task, and cannot dump
> multi-level data.
>
> An example, when we need to dump all the sockets of a process, we need
> to iterate over all the files (sockets) of the process, and iterate over
> the all packets in the queue of each socket, and iterate over all data
> in each packet.
>
> If we use bpf iterator, since the iterator can not be nested, we need to
> use socket iterator program to get all the basic information of all
> sockets (pass pid as filter), and then use packet iterator program to
> get the basic information of all packets of a specific socket (pass pid,
> fd as filter), and then use packet data iterator program to get all the
> data of a specific packet (pass pid, fd, packet index as filter).
>
> This would be complicated and require a lot of (each iteration)
> bpf program startup and exit (leading to poor performance).
>
> By comparison, open coded iterator is much more flexible, we can iterate
> in any context, at any time, and iteration can be nested, so we can
> achieve more flexible and more elegant dumping through open coded
> iterators.
>
> With open coded iterators, all of the above can be done in a single
> bpf program, and with nested iterators, everything becomes compact
> and simple.
>
> Also, bpf iterators transmit data to user space through seq_file,
> which involves a lot of open (bpf_iter_create), read, close syscalls,
> context switching, memory copying, and cannot achieve the performance
> of using ringbuf.
>
> Discussion
> ----------
>
> 1. Do we need bpf_iter_task_file_get_fd()?
>
> Andrii suggested that next() should return a pointer to
> a bpf_iter_task_file_item, which contains *file and fd.
>
> This is feasible, but it might compromise iterator encapsulation?

I don't think so, replied on v2 ([0]). I know you saw that, I'm just
linking it for others.

  [0] https://lore.kernel.org/bpf/CAEf4Bzba2N7pxPQh8_BDrVgupZdeow_3S7xSjDmsdhL19eXb3A@mail.gmail.com/


>
> More detailed discussion can be found at [3] [4]
>
> [3]: https://lore.kernel.org/bpf/CAEf4Bzbt0kh53xYZL57Nc9AWcYUKga_NQ6uUrTeU4bj8qyTLng@mail.gmail.com/
> [4]: https://lore.kernel.org/bpf/AM6PR03MB584814D93FE3680635DE61A199562@AM6PR03MB5848.eurprd03.prod.outlook.com/
>
> What should we do? Maybe more discussion is needed?
>
> 2. Where should we put CRIB related kfuncs?
>
> I totally agree that most of the CRIB related kfuncs are not
> CRIB specific.
>
> The goal of CRIB is to collect all relevant information about a process,
> which means we need to add kfuncs involving several different kernel
> subsystems (though these kfuncs are not complex and many just help the
> bpf program reach a certain data structure).
>
> But here is a question, where should these CRIB kfuncs be placed?
> There doesn't seem to be a suitable file to put them in.
>
> My current idea is to create a crib folder and then create new files for
> the relevant subsystems, e.g. crib/files.c, crib/socket.c, crib/mount.c
> etc. Putting them in the same folder makes it easier to maintain
> them centrally.
>
> If anyone else wants to use CRIB kfuncs, welcome to use them.
>

CRIB is just one of possible applications of such kfuncs, so I'd steer
away from over-specifying it as CRIB.

task_file open-coded iterator is generic, and should stay close to
other task iterator code, as you do in this revision.

bpf_get_file_ops_type() is unnecessary, as we already discussed on v2,
__ksym and comparison is the way to go here.

bpf_fget_task(), if VFS folks agree to add it, probably will have to
stay close to other similar VFS helpers.

> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
> ---
> v2 -> v3:
> 1. Move task_file open-coded iterator to kernel/bpf/helpers.c.
>
> 2. Fix duplicate error code 7 in test_bpf_iter_task_file().
>
> 3. Add comment for case when bpf_iter_task_file_get_fd() returns -1.
>
> 4. Add future plans in commit message of "Add struct file related
> CRIB kfuncs".
>
> 5. Add Discussion section to cover letter.
>
> v1 -> v2:
> Fix a type definition error in the fd parameter of
> bpf_fget_task() at crib_common.h.
>
> Juntong Deng (4):
>   bpf/crib: Introduce task_file open-coded iterator kfuncs
>   selftests/bpf: Add tests for open-coded style process file iterator
>   bpf/crib: Add struct file related CRIB kfuncs
>   selftests/bpf: Add tests for struct file related CRIB kfuncs
>
>  kernel/bpf/Makefile                           |   1 +
>  kernel/bpf/crib/Makefile                      |   3 +
>  kernel/bpf/crib/crib.c                        |  28 ++++
>  kernel/bpf/crib/files.c                       |  54 ++++++++
>  kernel/bpf/helpers.c                          |   4 +
>  kernel/bpf/task_iter.c                        |  96 +++++++++++++
>  tools/testing/selftests/bpf/prog_tests/crib.c | 126 ++++++++++++++++++
>  .../testing/selftests/bpf/progs/crib_common.h |  25 ++++
>  .../selftests/bpf/progs/crib_files_failure.c  | 108 +++++++++++++++
>  .../selftests/bpf/progs/crib_files_success.c  | 119 +++++++++++++++++
>  10 files changed, 564 insertions(+)
>  create mode 100644 kernel/bpf/crib/Makefile
>  create mode 100644 kernel/bpf/crib/crib.c
>  create mode 100644 kernel/bpf/crib/files.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/crib.c
>  create mode 100644 tools/testing/selftests/bpf/progs/crib_common.h
>  create mode 100644 tools/testing/selftests/bpf/progs/crib_files_failure.c
>  create mode 100644 tools/testing/selftests/bpf/progs/crib_files_success.c
>
> --
> 2.39.5
>
Juntong Deng Nov. 6, 2024, 10:49 p.m. UTC | #2
On 2024/11/6 22:15, Andrii Nakryiko wrote:
> On Wed, Nov 6, 2024 at 11:35 AM Juntong Deng <juntong.deng@outlook.com> wrote:
>>
>> This patch series adds open-coded style process file iterator
>> bpf_iter_task_file and file related kfuncs bpf_fget_task(),
>> bpf_get_file_ops_type(), and corresponding selftests test cases.
>>
>> Known future merge conflict: In linux-next task_lookup_next_fdget_rcu()
>> has been removed and replaced with fget_task_next() [0], but that has
>> not happened yet in bpf-next, so I still
>> use task_lookup_next_fdget_rcu() in bpf_iter_task_file_next().
>>
>> [0]: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=8fd3395ec9051a52828fcca2328cb50a69dea8ef
>>
>> Although iter/task_file already exists, for CRIB we still need the
>> open-coded iterator style process file iterator, and the same is true
>> for other bpf iterators such as iter/tcp, iter/udp, etc.
>>
>> The traditional bpf iterator is more like a bpf version of procfs, but
>> similar to procfs, it is not suitable for CRIB scenarios that need to
>> obtain large amounts of complex, multi-level in-kernel information.
>>
>> The following is from previous discussions [1].
>>
>> [1]: https://lore.kernel.org/bpf/AM6PR03MB5848CA34B5B68C90F210285E99B12@AM6PR03MB5848.eurprd03.prod.outlook.com/
>>
>> This is because the context of bpf iterators is fixed and bpf iterators
>> cannot be nested. This means that a bpf iterator program can only
>> complete a specific small iterative dump task, and cannot dump
>> multi-level data.
>>
>> An example, when we need to dump all the sockets of a process, we need
>> to iterate over all the files (sockets) of the process, and iterate over
>> the all packets in the queue of each socket, and iterate over all data
>> in each packet.
>>
>> If we use bpf iterator, since the iterator can not be nested, we need to
>> use socket iterator program to get all the basic information of all
>> sockets (pass pid as filter), and then use packet iterator program to
>> get the basic information of all packets of a specific socket (pass pid,
>> fd as filter), and then use packet data iterator program to get all the
>> data of a specific packet (pass pid, fd, packet index as filter).
>>
>> This would be complicated and require a lot of (each iteration)
>> bpf program startup and exit (leading to poor performance).
>>
>> By comparison, open coded iterator is much more flexible, we can iterate
>> in any context, at any time, and iteration can be nested, so we can
>> achieve more flexible and more elegant dumping through open coded
>> iterators.
>>
>> With open coded iterators, all of the above can be done in a single
>> bpf program, and with nested iterators, everything becomes compact
>> and simple.
>>
>> Also, bpf iterators transmit data to user space through seq_file,
>> which involves a lot of open (bpf_iter_create), read, close syscalls,
>> context switching, memory copying, and cannot achieve the performance
>> of using ringbuf.
>>
>> Discussion
>> ----------
>>
>> 1. Do we need bpf_iter_task_file_get_fd()?
>>
>> Andrii suggested that next() should return a pointer to
>> a bpf_iter_task_file_item, which contains *file and fd.
>>
>> This is feasible, but it might compromise iterator encapsulation?
> 
> I don't think so, replied on v2 ([0]). I know you saw that, I'm just
> linking it for others.
> 
>    [0] https://lore.kernel.org/bpf/CAEf4Bzba2N7pxPQh8_BDrVgupZdeow_3S7xSjDmsdhL19eXb3A@mail.gmail.com/
> 
> 
>>
>> More detailed discussion can be found at [3] [4]
>>
>> [3]: https://lore.kernel.org/bpf/CAEf4Bzbt0kh53xYZL57Nc9AWcYUKga_NQ6uUrTeU4bj8qyTLng@mail.gmail.com/
>> [4]: https://lore.kernel.org/bpf/AM6PR03MB584814D93FE3680635DE61A199562@AM6PR03MB5848.eurprd03.prod.outlook.com/
>>
>> What should we do? Maybe more discussion is needed?
>>
>> 2. Where should we put CRIB related kfuncs?
>>
>> I totally agree that most of the CRIB related kfuncs are not
>> CRIB specific.
>>
>> The goal of CRIB is to collect all relevant information about a process,
>> which means we need to add kfuncs involving several different kernel
>> subsystems (though these kfuncs are not complex and many just help the
>> bpf program reach a certain data structure).
>>
>> But here is a question, where should these CRIB kfuncs be placed?
>> There doesn't seem to be a suitable file to put them in.
>>
>> My current idea is to create a crib folder and then create new files for
>> the relevant subsystems, e.g. crib/files.c, crib/socket.c, crib/mount.c
>> etc. Putting them in the same folder makes it easier to maintain
>> them centrally.
>>
>> If anyone else wants to use CRIB kfuncs, welcome to use them.
>>
> 
> CRIB is just one of possible applications of such kfuncs, so I'd steer
> away from over-specifying it as CRIB.
> 
> task_file open-coded iterator is generic, and should stay close to
> other task iterator code, as you do in this revision.
> 
> bpf_get_file_ops_type() is unnecessary, as we already discussed on v2,
> __ksym and comparison is the way to go here.
> 
> bpf_fget_task(), if VFS folks agree to add it, probably will have to
> stay close to other similar VFS helpers.
> 

Yes, I agree.

Maybe we should put it in fs/bpf_fs_kfuncs.c?

>> Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
>> ---
>> v2 -> v3:
>> 1. Move task_file open-coded iterator to kernel/bpf/helpers.c.
>>
>> 2. Fix duplicate error code 7 in test_bpf_iter_task_file().
>>
>> 3. Add comment for case when bpf_iter_task_file_get_fd() returns -1.
>>
>> 4. Add future plans in commit message of "Add struct file related
>> CRIB kfuncs".
>>
>> 5. Add Discussion section to cover letter.
>>
>> v1 -> v2:
>> Fix a type definition error in the fd parameter of
>> bpf_fget_task() at crib_common.h.
>>
>> Juntong Deng (4):
>>    bpf/crib: Introduce task_file open-coded iterator kfuncs
>>    selftests/bpf: Add tests for open-coded style process file iterator
>>    bpf/crib: Add struct file related CRIB kfuncs
>>    selftests/bpf: Add tests for struct file related CRIB kfuncs
>>
>>   kernel/bpf/Makefile                           |   1 +
>>   kernel/bpf/crib/Makefile                      |   3 +
>>   kernel/bpf/crib/crib.c                        |  28 ++++
>>   kernel/bpf/crib/files.c                       |  54 ++++++++
>>   kernel/bpf/helpers.c                          |   4 +
>>   kernel/bpf/task_iter.c                        |  96 +++++++++++++
>>   tools/testing/selftests/bpf/prog_tests/crib.c | 126 ++++++++++++++++++
>>   .../testing/selftests/bpf/progs/crib_common.h |  25 ++++
>>   .../selftests/bpf/progs/crib_files_failure.c  | 108 +++++++++++++++
>>   .../selftests/bpf/progs/crib_files_success.c  | 119 +++++++++++++++++
>>   10 files changed, 564 insertions(+)
>>   create mode 100644 kernel/bpf/crib/Makefile
>>   create mode 100644 kernel/bpf/crib/crib.c
>>   create mode 100644 kernel/bpf/crib/files.c
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/crib.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/crib_common.h
>>   create mode 100644 tools/testing/selftests/bpf/progs/crib_files_failure.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/crib_files_success.c
>>
>> --
>> 2.39.5
>>
Song Liu Nov. 6, 2024, 11:04 p.m. UTC | #3
On Wed, Nov 6, 2024 at 2:49 PM Juntong Deng <juntong.deng@outlook.com> wrote:
>
[...]
> >
> > CRIB is just one of possible applications of such kfuncs, so I'd steer
> > away from over-specifying it as CRIB.
> >
> > task_file open-coded iterator is generic, and should stay close to
> > other task iterator code, as you do in this revision.
> >
> > bpf_get_file_ops_type() is unnecessary, as we already discussed on v2,
> > __ksym and comparison is the way to go here.
> >
> > bpf_fget_task(), if VFS folks agree to add it, probably will have to
> > stay close to other similar VFS helpers.
> >
>
> Yes, I agree.
>
> Maybe we should put it in fs/bpf_fs_kfuncs.c?

fs/bpf_fs_kfuncs.c is a good place for bpf_fget_task().

Please note that currently all kfuncs in fs/bpf_fs_kfuncs.c are only
available to BPF LSM programs. We need to make some of them,
including bpf_fget_task, available to tracing functions.

Thanks,
Song