diff mbox series

[v1,bpf-next,1/2,RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper

Message ID 20230801145414.418145-1-davemarchevsky@fb.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [v1,bpf-next,1/2,RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-8 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 pending Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 pending Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 pending Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3078 this patch: 3078
netdev/cc_maintainers warning 8 maintainers not CCed: kpsingh@kernel.org martin.lau@linux.dev john.fastabend@gmail.com sdf@google.com song@kernel.org yonghong.song@linux.dev jolsa@kernel.org haoluo@google.com
netdev/build_clang success Errors and warnings before: 1539 this patch: 1539
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3100 this patch: 3100
netdev/checkpatch warning WARNING: line length of 85 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Dave Marchevsky Aug. 1, 2023, 2:54 p.m. UTC
At Meta we have a profiling daemon which periodically collects
information on many hosts. This collection usually involves grabbing
stacks (user and kernel) using perf_event BPF progs and later symbolicating
them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
those cases we must fall back to digging around in /proc/PID/maps to map
virtual address to (binary, offset). The /proc/PID/maps digging does not
occur synchronously with stack collection, so the process might already
be gone, in which case it won't have /proc/PID/maps and we will fail to
symbolicate.

This 'exited process problem' doesn't occur very often as
most of the prod services we care to profile are long-lived daemons,
there are enough usecases to warrant a workaround: a BPF program which
can be optionally loaded at data collection time and essentially walks
/proc/PID/maps. Currently this is done by walking the vma list:

  struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
  mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */

Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
longer a vma linked list to walk. Walking the vma maple tree is not as
simple as hopping struct vm_area_struct->vm_next. That commit replaces
vm_next hopping with calls to find_vma(mm, addr) helper function, which
returns the vma containing addr, or if no vma contains addr,
the closest vma with higher start addr.

The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
find_vma, with the major difference that no 'closest vma' is returned if
there is no VMA containing a particular address. This prevents BPF
programs from being able to use bpf_find_vma to iterate all vmas in a
task in a reasonable way.

This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
'closest vma' behavior when used. Because this is find_vma's default
behavior it's as straightforward as nerfing a 'vma contains addr' check
on find_vma retval.

Also, change bpf_find_vma's address parameter to 'addr' instead of
'start'. The former is used in documentation and more accurately
describes the param.

[
  RFC: This isn't an ideal solution for iteration of all vmas in a task
       in the long term for a few reasons:

     * In nmi context, second call to bpf_find_vma will fail because
       irq_work is busy, so can't iterate all vmas
     * Repeatedly taking and releasing mmap_read lock when a dedicated
       iterate_all_vmas(task) kfunc could just take it once and hold for
       all vmas

    My specific usecase doesn't do vma iteration in nmi context and I
    think the 'closest vma' behavior can be useful here despite locking
    inefficiencies.

    When Alexei and I discussed this offline, two alternatives to
    provide similar functionality while addressing above issues seemed
    reasonable:

      * open-coded iterator for task vma. Similar to existing
        task_vma bpf_iter, but no need to create a bpf_link and read
	bpf_iter fd from userspace.
      * New kfunc taking callback similar bpf_find_vma, but iterating
        over all vmas in one go

     I think this patch is useful on its own since it's a fairly minimal
     change and fixes my usecase. Sending for early feedback and to
     solicit further thought about whether this should be dropped in
     favor of one of the above options.
]

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Cc: Nathan Slingerland <slinger@meta.com>
---
 include/uapi/linux/bpf.h       | 14 ++++++++++++--
 kernel/bpf/task_iter.c         | 12 ++++++++----
 tools/include/uapi/linux/bpf.h | 14 ++++++++++++--
 3 files changed, 32 insertions(+), 8 deletions(-)

Comments

Alexei Starovoitov Aug. 1, 2023, 8:41 p.m. UTC | #1
On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> At Meta we have a profiling daemon which periodically collects
> information on many hosts. This collection usually involves grabbing
> stacks (user and kernel) using perf_event BPF progs and later symbolicating
> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
> those cases we must fall back to digging around in /proc/PID/maps to map
> virtual address to (binary, offset). The /proc/PID/maps digging does not
> occur synchronously with stack collection, so the process might already
> be gone, in which case it won't have /proc/PID/maps and we will fail to
> symbolicate.
>
> This 'exited process problem' doesn't occur very often as
> most of the prod services we care to profile are long-lived daemons,
> there are enough usecases to warrant a workaround: a BPF program which
> can be optionally loaded at data collection time and essentially walks
> /proc/PID/maps. Currently this is done by walking the vma list:
>
>   struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
>   mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>
> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
> longer a vma linked list to walk. Walking the vma maple tree is not as
> simple as hopping struct vm_area_struct->vm_next. That commit replaces
> vm_next hopping with calls to find_vma(mm, addr) helper function, which
> returns the vma containing addr, or if no vma contains addr,
> the closest vma with higher start addr.
>
> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
> find_vma, with the major difference that no 'closest vma' is returned if
> there is no VMA containing a particular address. This prevents BPF
> programs from being able to use bpf_find_vma to iterate all vmas in a
> task in a reasonable way.
>
> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
> 'closest vma' behavior when used. Because this is find_vma's default
> behavior it's as straightforward as nerfing a 'vma contains addr' check
> on find_vma retval.
>
> Also, change bpf_find_vma's address parameter to 'addr' instead of
> 'start'. The former is used in documentation and more accurately
> describes the param.
>
> [
>   RFC: This isn't an ideal solution for iteration of all vmas in a task
>        in the long term for a few reasons:
>
>      * In nmi context, second call to bpf_find_vma will fail because
>        irq_work is busy, so can't iterate all vmas
>      * Repeatedly taking and releasing mmap_read lock when a dedicated
>        iterate_all_vmas(task) kfunc could just take it once and hold for
>        all vmas
>
>     My specific usecase doesn't do vma iteration in nmi context and I
>     think the 'closest vma' behavior can be useful here despite locking
>     inefficiencies.
>
>     When Alexei and I discussed this offline, two alternatives to
>     provide similar functionality while addressing above issues seemed
>     reasonable:
>
>       * open-coded iterator for task vma. Similar to existing
>         task_vma bpf_iter, but no need to create a bpf_link and read
>         bpf_iter fd from userspace.
>       * New kfunc taking callback similar bpf_find_vma, but iterating
>         over all vmas in one go
>
>      I think this patch is useful on its own since it's a fairly minimal
>      change and fixes my usecase. Sending for early feedback and to
>      solicit further thought about whether this should be dropped in
>      favor of one of the above options.

- In theory this patch can work, but patch 2 didn't attempt to actually
use it in a loop to iterate all vma-s.
Which is a bit of red flag whether such iteration is practical
(either via bpf_loop or bpf_for).

- This behavior of bpf_find_vma() feels too much implementation detail.
find_vma will probably stay this way, since different parts of the kernel
rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.

- Looking at task_vma_seq_get_next().. that's how vma iter should be done and
I don't think bpf prog can do it on its own.
Because with bpf_find_vma() the lock will drop at every step the problems
described at that large comment will be hit sooner or later.

All concerns combined I feel we better provide a new kfunc that iterates vma
and drops the lock before invoking callback.
It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
Maybe it's ok.
Doing it open coded iterators style is likely better.
bpf_iter_vma_new() kfunc will do
bpf_mmap_unlock_get_irq_work+mmap_read_trylock
while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.

I'd try to do open-code-iter first. It's a good test for the iter infra.
bpf_iter_testmod_seq_new is an example of how to add a new iter.

Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
It's not a trusted arg. We better move away from this legacy pointer.
bpf_iter_vma_new() should accept only trusted ptr to task_struct.
fwiw bpf_get_current_task_btf_proto has
.ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
The bpf prog might look like:
task = bpf_get_current_task_btf();
err = bpf_iter_vma_new(&it, task);
while ((vma = bpf_iter_vma_next(&it))) ...;
assuming lock is not dropped by _next.
David Marchevsky Aug. 4, 2023, 6:59 a.m. UTC | #2
On 8/1/23 4:41 PM, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>
>> At Meta we have a profiling daemon which periodically collects
>> information on many hosts. This collection usually involves grabbing
>> stacks (user and kernel) using perf_event BPF progs and later symbolicating
>> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
>> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
>> those cases we must fall back to digging around in /proc/PID/maps to map
>> virtual address to (binary, offset). The /proc/PID/maps digging does not
>> occur synchronously with stack collection, so the process might already
>> be gone, in which case it won't have /proc/PID/maps and we will fail to
>> symbolicate.
>>
>> This 'exited process problem' doesn't occur very often as
>> most of the prod services we care to profile are long-lived daemons,
>> there are enough usecases to warrant a workaround: a BPF program which
>> can be optionally loaded at data collection time and essentially walks
>> /proc/PID/maps. Currently this is done by walking the vma list:
>>
>>   struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
>>   mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>>
>> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
>> longer a vma linked list to walk. Walking the vma maple tree is not as
>> simple as hopping struct vm_area_struct->vm_next. That commit replaces
>> vm_next hopping with calls to find_vma(mm, addr) helper function, which
>> returns the vma containing addr, or if no vma contains addr,
>> the closest vma with higher start addr.
>>
>> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
>> find_vma, with the major difference that no 'closest vma' is returned if
>> there is no VMA containing a particular address. This prevents BPF
>> programs from being able to use bpf_find_vma to iterate all vmas in a
>> task in a reasonable way.
>>
>> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
>> 'closest vma' behavior when used. Because this is find_vma's default
>> behavior it's as straightforward as nerfing a 'vma contains addr' check
>> on find_vma retval.
>>
>> Also, change bpf_find_vma's address parameter to 'addr' instead of
>> 'start'. The former is used in documentation and more accurately
>> describes the param.
>>
>> [
>>   RFC: This isn't an ideal solution for iteration of all vmas in a task
>>        in the long term for a few reasons:
>>
>>      * In nmi context, second call to bpf_find_vma will fail because
>>        irq_work is busy, so can't iterate all vmas
>>      * Repeatedly taking and releasing mmap_read lock when a dedicated
>>        iterate_all_vmas(task) kfunc could just take it once and hold for
>>        all vmas
>>
>>     My specific usecase doesn't do vma iteration in nmi context and I
>>     think the 'closest vma' behavior can be useful here despite locking
>>     inefficiencies.
>>
>>     When Alexei and I discussed this offline, two alternatives to
>>     provide similar functionality while addressing above issues seemed
>>     reasonable:
>>
>>       * open-coded iterator for task vma. Similar to existing
>>         task_vma bpf_iter, but no need to create a bpf_link and read
>>         bpf_iter fd from userspace.
>>       * New kfunc taking callback similar bpf_find_vma, but iterating
>>         over all vmas in one go
>>
>>      I think this patch is useful on its own since it's a fairly minimal
>>      change and fixes my usecase. Sending for early feedback and to
>>      solicit further thought about whether this should be dropped in
>>      favor of one of the above options.
> 
> - In theory this patch can work, but patch 2 didn't attempt to actually
> use it in a loop to iterate all vma-s.
> Which is a bit of red flag whether such iteration is practical
> (either via bpf_loop or bpf_for).
> 
> - This behavior of bpf_find_vma() feels too much implementation detail.
> find_vma will probably stay this way, since different parts of the kernel
> rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.
> 
> - Looking at task_vma_seq_get_next().. that's how vma iter should be done and
> I don't think bpf prog can do it on its own.
> Because with bpf_find_vma() the lock will drop at every step the problems
> described at that large comment will be hit sooner or later.
> 
> All concerns combined I feel we better provide a new kfunc that iterates vma
> and drops the lock before invoking callback.
> It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
> Maybe it's ok.
> Doing it open coded iterators style is likely better.
> bpf_iter_vma_new() kfunc will do
> bpf_mmap_unlock_get_irq_work+mmap_read_trylock
> while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.
> 
> I'd try to do open-code-iter first. It's a good test for the iter infra.
> bpf_iter_testmod_seq_new is an example of how to add a new iter.
> 
> Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
> It's not a trusted arg. We better move away from this legacy pointer.
> bpf_iter_vma_new() should accept only trusted ptr to task_struct.
> fwiw bpf_get_current_task_btf_proto has
> .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
> The bpf prog might look like:
> task = bpf_get_current_task_btf();
> err = bpf_iter_vma_new(&it, task);
> while ((vma = bpf_iter_vma_next(&it))) ...;
> assuming lock is not dropped by _next.

The only concern here that doesn't seem reasonable to me is the
"too much implementation detail". I agree with the rest, though,
so will send a different series with new implementation and point
 to this discussion.
Yonghong Song Aug. 4, 2023, 3:52 p.m. UTC | #3
On 8/3/23 11:59 PM, David Marchevsky wrote:
> On 8/1/23 4:41 PM, Alexei Starovoitov wrote:
>> On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>>
>>> At Meta we have a profiling daemon which periodically collects
>>> information on many hosts. This collection usually involves grabbing
>>> stacks (user and kernel) using perf_event BPF progs and later symbolicating
>>> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on
>>> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In
>>> those cases we must fall back to digging around in /proc/PID/maps to map
>>> virtual address to (binary, offset). The /proc/PID/maps digging does not
>>> occur synchronously with stack collection, so the process might already
>>> be gone, in which case it won't have /proc/PID/maps and we will fail to
>>> symbolicate.
>>>
>>> This 'exited process problem' doesn't occur very often as
>>> most of the prod services we care to profile are long-lived daemons,
>>> there are enough usecases to warrant a workaround: a BPF program which
>>> can be optionally loaded at data collection time and essentially walks
>>> /proc/PID/maps. Currently this is done by walking the vma list:
>>>
>>>    struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap);
>>>    mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */
>>>
>>> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no
>>> longer a vma linked list to walk. Walking the vma maple tree is not as
>>> simple as hopping struct vm_area_struct->vm_next. That commit replaces
>>> vm_next hopping with calls to find_vma(mm, addr) helper function, which
>>> returns the vma containing addr, or if no vma contains addr,
>>> the closest vma with higher start addr.
>>>
>>> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around
>>> find_vma, with the major difference that no 'closest vma' is returned if
>>> there is no VMA containing a particular address. This prevents BPF
>>> programs from being able to use bpf_find_vma to iterate all vmas in a
>>> task in a reasonable way.
>>>
>>> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores
>>> 'closest vma' behavior when used. Because this is find_vma's default
>>> behavior it's as straightforward as nerfing a 'vma contains addr' check
>>> on find_vma retval.
>>>
>>> Also, change bpf_find_vma's address parameter to 'addr' instead of
>>> 'start'. The former is used in documentation and more accurately
>>> describes the param.
>>>
>>> [
>>>    RFC: This isn't an ideal solution for iteration of all vmas in a task
>>>         in the long term for a few reasons:
>>>
>>>       * In nmi context, second call to bpf_find_vma will fail because
>>>         irq_work is busy, so can't iterate all vmas
>>>       * Repeatedly taking and releasing mmap_read lock when a dedicated
>>>         iterate_all_vmas(task) kfunc could just take it once and hold for
>>>         all vmas
>>>
>>>      My specific usecase doesn't do vma iteration in nmi context and I
>>>      think the 'closest vma' behavior can be useful here despite locking
>>>      inefficiencies.
>>>
>>>      When Alexei and I discussed this offline, two alternatives to
>>>      provide similar functionality while addressing above issues seemed
>>>      reasonable:
>>>
>>>        * open-coded iterator for task vma. Similar to existing
>>>          task_vma bpf_iter, but no need to create a bpf_link and read
>>>          bpf_iter fd from userspace.
>>>        * New kfunc taking callback similar bpf_find_vma, but iterating
>>>          over all vmas in one go
>>>
>>>       I think this patch is useful on its own since it's a fairly minimal
>>>       change and fixes my usecase. Sending for early feedback and to
>>>       solicit further thought about whether this should be dropped in
>>>       favor of one of the above options.
>>
>> - In theory this patch can work, but patch 2 didn't attempt to actually
>> use it in a loop to iterate all vma-s.
>> Which is a bit of red flag whether such iteration is practical
>> (either via bpf_loop or bpf_for).
>>
>> - This behavior of bpf_find_vma() feels too much implementation detail.
>> find_vma will probably stay this way, since different parts of the kernel
>> rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much.
>>
>> - Looking at task_vma_seq_get_next().. that's how vma iter should be done and
>> I don't think bpf prog can do it on its own.
>> Because with bpf_find_vma() the lock will drop at every step the problems
>> described at that large comment will be hit sooner or later.
>>
>> All concerns combined I feel we better provide a new kfunc that iterates vma
>> and drops the lock before invoking callback.
>> It can be much simpler than task_vma_seq_get_next() if we don't drop the lock.
>> Maybe it's ok.
>> Doing it open coded iterators style is likely better.
>> bpf_iter_vma_new() kfunc will do
>> bpf_mmap_unlock_get_irq_work+mmap_read_trylock
>> while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm.
>>
>> I'd try to do open-code-iter first. It's a good test for the iter infra.
>> bpf_iter_testmod_seq_new is an example of how to add a new iter.
>>
>> Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID.
>> It's not a trusted arg. We better move away from this legacy pointer.
>> bpf_iter_vma_new() should accept only trusted ptr to task_struct.
>> fwiw bpf_get_current_task_btf_proto has
>> .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here.
>> The bpf prog might look like:
>> task = bpf_get_current_task_btf();
>> err = bpf_iter_vma_new(&it, task);
>> while ((vma = bpf_iter_vma_next(&it))) ...;
>> assuming lock is not dropped by _next.
> 
> The only concern here that doesn't seem reasonable to me is the
> "too much implementation detail". I agree with the rest, though,
> so will send a different series with new implementation and point
>   to this discussion.

For reference, this is another use case for traversing
vma's in the bpf program reported from bcc mailing list:
   https://github.com/iovisor/bcc/pull/4679

The use case is for that the application may not have frame pointer
so bpf program will just scan stack's and find potential
user text region pointers and report them. This is similar
to what current arch (e.g., x86) code reporting crash stack.
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 70da85200695..947187d76ebc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5169,8 +5169,13 @@  union bpf_attr {
  *		function with *task*, *vma*, and *callback_ctx*.
  *		The *callback_fn* should be a static function and
  *		the *callback_ctx* should be a pointer to the stack.
- *		The *flags* is used to control certain aspects of the helper.
- *		Currently, the *flags* must be 0.
+ *		The *flags* is used to control certain aspects of the helper and
+ *		may be one of the following:
+ *
+ *		**BPF_F_VMA_NEXT**
+ *			If no vma contains *addr*, call *callback_fn* with the next vma,
+ *			i.e. the vma with lowest vm_start that is higher than *addr*.
+ *			This replicates behavior of kernel's find_vma helper.
  *
  *		The expected callback signature is
  *
@@ -6026,6 +6031,11 @@  enum {
 	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
 };
 
+/* Flags for bpf_find_vma helper */
+enum {
+	BPF_F_VMA_NEXT		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c4ab9d6cdbe9..a8c87dcf36ad 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -777,7 +777,7 @@  static struct bpf_iter_reg task_vma_reg_info = {
 	.show_fdinfo		= bpf_iter_task_show_fdinfo,
 };
 
-BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
+BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, addr,
 	   bpf_callback_t, callback_fn, void *, callback_ctx, u64, flags)
 {
 	struct mmap_unlock_irq_work *work = NULL;
@@ -785,10 +785,13 @@  BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
 	bool irq_work_busy = false;
 	struct mm_struct *mm;
 	int ret = -ENOENT;
+	bool vma_next;
 
-	if (flags)
+	if (flags & ~BPF_F_VMA_NEXT)
 		return -EINVAL;
 
+	vma_next = flags & BPF_F_VMA_NEXT;
+
 	if (!task)
 		return -ENOENT;
 
@@ -801,9 +804,10 @@  BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start,
 	if (irq_work_busy || !mmap_read_trylock(mm))
 		return -EBUSY;
 
-	vma = find_vma(mm, start);
+	vma = find_vma(mm, addr);
 
-	if (vma && vma->vm_start <= start && vma->vm_end > start) {
+	if (vma &&
+	    ((vma->vm_start <= addr && vma->vm_end > addr) || vma_next)) {
 		callback_fn((u64)(long)task, (u64)(long)vma,
 			    (u64)(long)callback_ctx, 0, 0);
 		ret = 0;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 70da85200695..947187d76ebc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5169,8 +5169,13 @@  union bpf_attr {
  *		function with *task*, *vma*, and *callback_ctx*.
  *		The *callback_fn* should be a static function and
  *		the *callback_ctx* should be a pointer to the stack.
- *		The *flags* is used to control certain aspects of the helper.
- *		Currently, the *flags* must be 0.
+ *		The *flags* is used to control certain aspects of the helper and
+ *		may be one of the following:
+ *
+ *		**BPF_F_VMA_NEXT**
+ *			If no vma contains *addr*, call *callback_fn* with the next vma,
+ *			i.e. the vma with lowest vm_start that is higher than *addr*.
+ *			This replicates behavior of kernel's find_vma helper.
  *
  *		The expected callback signature is
  *
@@ -6026,6 +6031,11 @@  enum {
 	BPF_F_EXCLUDE_INGRESS	= (1ULL << 4),
 };
 
+/* Flags for bpf_find_vma helper */
+enum {
+	BPF_F_VMA_NEXT		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\