mbox series

[RFC,bpf-next,0/9] allow HID-BPF to do device IOs

Message ID 20240209-hid-bpf-sleepable-v1-0-4cc895b5adbd@kernel.org (mailing list archive)
Headers show
Series allow HID-BPF to do device IOs | expand

Message

Benjamin Tissoires Feb. 9, 2024, 1:26 p.m. UTC
[Putting this as a RFC because I'm pretty sure I'm not doing the things
correctly at the BPF level.]
[Also using bpf-next as the base tree as there will be conflicting
changes otherwise]

Ideally I'd like to have something similar to bpf_timers, but not
in soft IRQ context. So I'm emulating this with a sleepable
bpf_tail_call() (see "HID: bpf: allow to defer work in a delayed
workqueue").

Basically, I need to be able to defer a HID-BPF program for the
following reasons (from the aforementioned patch):
1. defer an event:
   Sometimes we receive an out of proximity event, but the device can not
   be trusted enough, and we need to ensure that we won't receive another
   one in the following n milliseconds. So we need to wait those n
   milliseconds, and eventually re-inject that event in the stack.

2. inject new events in reaction to one given event:
   We might want to transform one given event into several. This is the
   case for macro keys where a single key press is supposed to send
   a sequence of key presses. But this could also be used to patch a
   faulty behavior, if a device forgets to send a release event.

3. communicate with the device in reaction to one event:
   We might want to communicate back to the device after a given event.
   For example a device might send us an event saying that it came back
   from sleeping state and needs to be re-initialized.

Currently we can achieve that by keeping a userspace program around,
raise a bpf event, and let that userspace program inject the events and
commands.
However, we are just keeping that program alive as a daemon for just
scheduling commands. There is no logic in it, so it doesn't really justify
an actual userspace wakeup. So a kernel workqueue seems simpler to handle.

Besides the "this is insane to reimplement the wheel", the other part I'm
not sure is whether we can say that BPF maps of type prog_array and of
type queue/stack can be used in sleepable context.
I don't see any warning when running the test programs, but that's probably
not a guarantee I'm doing the things properly :)

Cheers,
Benjamin

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
---
Benjamin Tissoires (9):
      bpf: allow more maps in sleepable bpf programs
      HID: bpf/dispatch: regroup kfuncs definitions
      HID: bpf: export hid_hw_output_report as a BPF kfunc
      selftests/hid: Add test for hid_bpf_hw_output_report
      HID: bpf: allow to inject HID event from BPF
      selftests/hid: add tests for hid_bpf_input_report
      HID: bpf: allow to defer work in a delayed workqueue
      selftests/hid: add test for hid_bpf_schedule_delayed_work
      selftests/hid: add another set of delayed work tests

 Documentation/hid/hid-bpf.rst                      |  26 +-
 drivers/hid/bpf/entrypoints/entrypoints.bpf.c      |   8 +
 drivers/hid/bpf/entrypoints/entrypoints.lskel.h    | 236 +++++++++------
 drivers/hid/bpf/hid_bpf_dispatch.c                 | 315 ++++++++++++++++-----
 drivers/hid/bpf/hid_bpf_jmp_table.c                |  34 ++-
 drivers/hid/hid-core.c                             |   2 +
 include/linux/hid_bpf.h                            |  10 +
 kernel/bpf/verifier.c                              |   3 +
 tools/testing/selftests/hid/hid_bpf.c              | 286 ++++++++++++++++++-
 tools/testing/selftests/hid/progs/hid.c            | 205 ++++++++++++++
 .../testing/selftests/hid/progs/hid_bpf_helpers.h  |   8 +
 11 files changed, 962 insertions(+), 171 deletions(-)
---
base-commit: 4f7a05917237b006ceae760507b3d15305769ade
change-id: 20240205-hid-bpf-sleepable-c01260fd91c4

Best regards,

Comments

Toke Høiland-Jørgensen Feb. 9, 2024, 3:42 p.m. UTC | #1
Benjamin Tissoires <bentiss@kernel.org> writes:

> [Putting this as a RFC because I'm pretty sure I'm not doing the things
> correctly at the BPF level.]
> [Also using bpf-next as the base tree as there will be conflicting
> changes otherwise]
>
> Ideally I'd like to have something similar to bpf_timers, but not
> in soft IRQ context. So I'm emulating this with a sleepable
> bpf_tail_call() (see "HID: bpf: allow to defer work in a delayed
> workqueue").

Why implement a new mechanism? Sounds like what you need is essentially
the bpf_timer functionality, just running in a different context, right?
So why not just add a flag to the timer setup that controls the callback
context? I've been toying with something similar for restarting XDP TX
for my queueing patch series (though I'm not sure if this will actually
end up being needed in the end):

https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/commit/?h=xdp-queueing-08&id=54bc201a358d1ac6ebfe900099315bbd0a76e862

-Toke
Benjamin Tissoires Feb. 9, 2024, 4:26 p.m. UTC | #2
On Fri, Feb 9, 2024 at 4:42 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Benjamin Tissoires <bentiss@kernel.org> writes:
>
> > [Putting this as a RFC because I'm pretty sure I'm not doing the things
> > correctly at the BPF level.]
> > [Also using bpf-next as the base tree as there will be conflicting
> > changes otherwise]
> >
> > Ideally I'd like to have something similar to bpf_timers, but not
> > in soft IRQ context. So I'm emulating this with a sleepable
> > bpf_tail_call() (see "HID: bpf: allow to defer work in a delayed
> > workqueue").
>
> Why implement a new mechanism? Sounds like what you need is essentially
> the bpf_timer functionality, just running in a different context, right?

Heh, that's exactly why I put in a RFC :)

So yes, the bpf_timer approach is cleaner, but I need it in a
workqueue, as a hrtimer in a softIRQ would prevent me to kzalloc and
wait for the device.

> So why not just add a flag to the timer setup that controls the callback
> context? I've been toying with something similar for restarting XDP TX
> for my queueing patch series (though I'm not sure if this will actually
> end up being needed in the end):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/commit/?h=xdp-queueing-08&id=54bc201a358d1ac6ebfe900099315bbd0a76e862
>

Oh, nice. Good idea. But would it be OK to have a "timer-like" where
it actually defers the job in a workqueue instead of using an hrtimer?

I thought I would have to rewrite the entire bpf_timer approach
without the softIRQ, but if I can just add a new flag, that will make
things way simpler for me.

This however raises another issue if I were to use the bpf_timers: now
the HID-BPF kfuncs will not be available as they are only available to
tracing prog types. And when I tried to call them from a bpf_timer (in
softIRQ) they were not available.

Cheers,
Benjamin
Toke Høiland-Jørgensen Feb. 9, 2024, 5:05 p.m. UTC | #3
Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:

> On Fri, Feb 9, 2024 at 4:42 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Benjamin Tissoires <bentiss@kernel.org> writes:
>>
>> > [Putting this as a RFC because I'm pretty sure I'm not doing the things
>> > correctly at the BPF level.]
>> > [Also using bpf-next as the base tree as there will be conflicting
>> > changes otherwise]
>> >
>> > Ideally I'd like to have something similar to bpf_timers, but not
>> > in soft IRQ context. So I'm emulating this with a sleepable
>> > bpf_tail_call() (see "HID: bpf: allow to defer work in a delayed
>> > workqueue").
>>
>> Why implement a new mechanism? Sounds like what you need is essentially
>> the bpf_timer functionality, just running in a different context, right?
>
> Heh, that's exactly why I put in a RFC :)
>
> So yes, the bpf_timer approach is cleaner, but I need it in a
> workqueue, as a hrtimer in a softIRQ would prevent me to kzalloc and
> wait for the device.

Right, makes sense.

>> So why not just add a flag to the timer setup that controls the callback
>> context? I've been toying with something similar for restarting XDP TX
>> for my queueing patch series (though I'm not sure if this will actually
>> end up being needed in the end):
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/commit/?h=xdp-queueing-08&id=54bc201a358d1ac6ebfe900099315bbd0a76e862
>>
>
> Oh, nice. Good idea. But would it be OK to have a "timer-like" where
> it actually defers the job in a workqueue instead of using an hrtimer?

That's conceptually still a timer, though, isn't it? I.e., it's a
mechanism whereby you specify a callback and a delay, and bpf_timer
ensures that your callback is called after that delay. IMO it's totally
congruent with that API to be able to specify a different execution
context as part of the timer setup.

As for how to implement it, I suspect the easiest may be something
similar to what the patch I linked above does: keep the hrtimer, and
just have a different (kernel) callback function when the timer fires
which does an immediate schedule_work() (without the _delayed) and then
runs the BPF callback in that workqueue. I.e., keep the delay handling
the way the existing bpf_timer implementation does it, and just add an
indirection to start the workqueue in the kernel dispatch code.

> I thought I would have to rewrite the entire bpf_timer approach
> without the softIRQ, but if I can just add a new flag, that will make
> things way simpler for me.

IMO that would be fine. You may want to wait for the maintainers to
chime in before going down this route, though :)

> This however raises another issue if I were to use the bpf_timers: now
> the HID-BPF kfuncs will not be available as they are only available to
> tracing prog types. And when I tried to call them from a bpf_timer (in
> softIRQ) they were not available.

IIUC, the bpf_timer callback is just a function (subprog) from the
verifier PoV, so it is verified as whatever program type is creating the
timer. So in other words, as long as you setup the timer from inside a
tracing prog type, you should have access to all the same kfuncs, I
think?

-Toke
Benjamin Tissoires Feb. 12, 2024, 4:47 p.m. UTC | #4
On Fri, Feb 9, 2024 at 6:05 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
>
> > On Fri, Feb 9, 2024 at 4:42 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Benjamin Tissoires <bentiss@kernel.org> writes:
> >>
> >> > [Putting this as a RFC because I'm pretty sure I'm not doing the things
> >> > correctly at the BPF level.]
> >> > [Also using bpf-next as the base tree as there will be conflicting
> >> > changes otherwise]
> >> >
> >> > Ideally I'd like to have something similar to bpf_timers, but not
> >> > in soft IRQ context. So I'm emulating this with a sleepable
> >> > bpf_tail_call() (see "HID: bpf: allow to defer work in a delayed
> >> > workqueue").
> >>
> >> Why implement a new mechanism? Sounds like what you need is essentially
> >> the bpf_timer functionality, just running in a different context, right?
> >
> > Heh, that's exactly why I put in a RFC :)
> >
> > So yes, the bpf_timer approach is cleaner, but I need it in a
> > workqueue, as a hrtimer in a softIRQ would prevent me to kzalloc and
> > wait for the device.
>
> Right, makes sense.
>
> >> So why not just add a flag to the timer setup that controls the callback
> >> context? I've been toying with something similar for restarting XDP TX
> >> for my queueing patch series (though I'm not sure if this will actually
> >> end up being needed in the end):
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/commit/?h=xdp-queueing-08&id=54bc201a358d1ac6ebfe900099315bbd0a76e862
> >>
> >
> > Oh, nice. Good idea. But would it be OK to have a "timer-like" where
> > it actually defers the job in a workqueue instead of using an hrtimer?
>
> That's conceptually still a timer, though, isn't it? I.e., it's a
> mechanism whereby you specify a callback and a delay, and bpf_timer
> ensures that your callback is called after that delay. IMO it's totally
> congruent with that API to be able to specify a different execution
> context as part of the timer setup.

Yep :)

There is still a problem I wasn't able to fix over the week end and
today. How can I tell the verifier that the callback is sleepable,
when the tracing function that called the timer_start() function is
not?
(more on that below).

>
> As for how to implement it, I suspect the easiest may be something
> similar to what the patch I linked above does: keep the hrtimer, and
> just have a different (kernel) callback function when the timer fires
> which does an immediate schedule_work() (without the _delayed) and then
> runs the BPF callback in that workqueue. I.e., keep the delay handling
> the way the existing bpf_timer implementation does it, and just add an
> indirection to start the workqueue in the kernel dispatch code.

Sounds good, especially given that's roughly how the delayed_timers
are implemented.

>
> > I thought I would have to rewrite the entire bpf_timer approach
> > without the softIRQ, but if I can just add a new flag, that will make
> > things way simpler for me.
>
> IMO that would be fine. You may want to wait for the maintainers to
> chime in before going down this route, though :)
>
> > This however raises another issue if I were to use the bpf_timers: now
> > the HID-BPF kfuncs will not be available as they are only available to
> > tracing prog types. And when I tried to call them from a bpf_timer (in
> > softIRQ) they were not available.
>
> IIUC, the bpf_timer callback is just a function (subprog) from the
> verifier PoV, so it is verified as whatever program type is creating the
> timer. So in other words, as long as you setup the timer from inside a
> tracing prog type, you should have access to all the same kfuncs, I
> think?

Yep, you are correct. But as mentioned above, I am now in trouble with
the sleepable state:
- I need to call timer_start() from a non sleepable tracing function
(I'm in hard IRQ when dealing with a physical device)
- but then, ideally, the callback function needs to be tagged as a
sleepable one, so I can export my kfuncs which are doing kzalloc and
device IO as such.

However, I can not really teach the BPF verifier to do so:
- it seems to check for the callback first when it is loaded, and
there is no SEC() equivalent for static functions
- libbpf doesn't have access to the callback as a prog as it has to be
a static function, and thus isn't exported as a full-blown prog.
- the verifier only checks for the callback when dealing with
BPF_FUNC_timer_set_callback, which doesn't have a "flag" argument
(though the validation of the callback has already been done while
checking it first, so we are already too late to change the sleppable
state of the callback)

Right now, the only OK-ish version I have is declaring the kfunc as
non-sleepable, but checking that we are in a different context than
the IRQ of the initial event. This way, I am not crashing if this
function is called from the initial IRQ, but will still crash if used
outside of the hid context.

This is not satisfactory, but I feel like it's going to be hard to
teach the verifier that the callback function is sleepable in that
case (maybe we could suffix the callback name, like we do for
arguments, but this is not very clean either).

Cheers,
Benjamin
Toke Høiland-Jørgensen Feb. 12, 2024, 5:46 p.m. UTC | #5
Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:

[...]
>> IIUC, the bpf_timer callback is just a function (subprog) from the
>> verifier PoV, so it is verified as whatever program type is creating the
>> timer. So in other words, as long as you setup the timer from inside a
>> tracing prog type, you should have access to all the same kfuncs, I
>> think?
>
> Yep, you are correct. But as mentioned above, I am now in trouble with
> the sleepable state:
> - I need to call timer_start() from a non sleepable tracing function
> (I'm in hard IRQ when dealing with a physical device)
> - but then, ideally, the callback function needs to be tagged as a
> sleepable one, so I can export my kfuncs which are doing kzalloc and
> device IO as such.
>
> However, I can not really teach the BPF verifier to do so:
> - it seems to check for the callback first when it is loaded, and
> there is no SEC() equivalent for static functions
> - libbpf doesn't have access to the callback as a prog as it has to be
> a static function, and thus isn't exported as a full-blown prog.
> - the verifier only checks for the callback when dealing with
> BPF_FUNC_timer_set_callback, which doesn't have a "flag" argument
> (though the validation of the callback has already been done while
> checking it first, so we are already too late to change the sleppable
> state of the callback)
>
> Right now, the only OK-ish version I have is declaring the kfunc as
> non-sleepable, but checking that we are in a different context than
> the IRQ of the initial event. This way, I am not crashing if this
> function is called from the initial IRQ, but will still crash if used
> outside of the hid context.
>
> This is not satisfactory, but I feel like it's going to be hard to
> teach the verifier that the callback function is sleepable in that
> case (maybe we could suffix the callback name, like we do for
> arguments, but this is not very clean either).

The callback is only set once when the timer is first setup; I *think*
it works to do the setup (bpf_timer_init() and bpf_timer_set_callback())
in the context you need (from a sleepable prog), but do the arming
(bpf_timer_start()) from a different program that is not itself sleepable?

-Toke
Benjamin Tissoires Feb. 12, 2024, 6:20 p.m. UTC | #6
On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
>
> [...]
> >> IIUC, the bpf_timer callback is just a function (subprog) from the
> >> verifier PoV, so it is verified as whatever program type is creating the
> >> timer. So in other words, as long as you setup the timer from inside a
> >> tracing prog type, you should have access to all the same kfuncs, I
> >> think?
> >
> > Yep, you are correct. But as mentioned above, I am now in trouble with
> > the sleepable state:
> > - I need to call timer_start() from a non sleepable tracing function
> > (I'm in hard IRQ when dealing with a physical device)
> > - but then, ideally, the callback function needs to be tagged as a
> > sleepable one, so I can export my kfuncs which are doing kzalloc and
> > device IO as such.
> >
> > However, I can not really teach the BPF verifier to do so:
> > - it seems to check for the callback first when it is loaded, and
> > there is no SEC() equivalent for static functions
> > - libbpf doesn't have access to the callback as a prog as it has to be
> > a static function, and thus isn't exported as a full-blown prog.
> > - the verifier only checks for the callback when dealing with
> > BPF_FUNC_timer_set_callback, which doesn't have a "flag" argument
> > (though the validation of the callback has already been done while
> > checking it first, so we are already too late to change the sleppable
> > state of the callback)
> >
> > Right now, the only OK-ish version I have is declaring the kfunc as
> > non-sleepable, but checking that we are in a different context than
> > the IRQ of the initial event. This way, I am not crashing if this
> > function is called from the initial IRQ, but will still crash if used
> > outside of the hid context.
> >
> > This is not satisfactory, but I feel like it's going to be hard to
> > teach the verifier that the callback function is sleepable in that
> > case (maybe we could suffix the callback name, like we do for
> > arguments, but this is not very clean either).
>
> The callback is only set once when the timer is first setup; I *think*
> it works to do the setup (bpf_timer_init() and bpf_timer_set_callback())
> in the context you need (from a sleepable prog), but do the arming
> (bpf_timer_start()) from a different program that is not itself sleepable?
>

Genius! It works, and I can just keep having them declared as a
syscall kfunc, not as a tracing kfunc.

But isn't this an issue outside of my use case? I mean, if the
callback is assuming the environment for when it is set up but can be
called from any context there seems to be a problem when 2 contexts
are not equivalent, no?
Alexei Starovoitov Feb. 12, 2024, 9:24 p.m. UTC | #7
On Mon, Feb 12, 2024 at 10:21 AM Benjamin Tissoires
<benjamin.tissoires@redhat.com> wrote:
>
> On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
> >
> > [...]
> > >> IIUC, the bpf_timer callback is just a function (subprog) from the
> > >> verifier PoV, so it is verified as whatever program type is creating the
> > >> timer. So in other words, as long as you setup the timer from inside a
> > >> tracing prog type, you should have access to all the same kfuncs, I
> > >> think?
> > >
> > > Yep, you are correct. But as mentioned above, I am now in trouble with
> > > the sleepable state:
> > > - I need to call timer_start() from a non sleepable tracing function
> > > (I'm in hard IRQ when dealing with a physical device)
> > > - but then, ideally, the callback function needs to be tagged as a
> > > sleepable one, so I can export my kfuncs which are doing kzalloc and
> > > device IO as such.
> > >
> > > However, I can not really teach the BPF verifier to do so:
> > > - it seems to check for the callback first when it is loaded, and
> > > there is no SEC() equivalent for static functions
> > > - libbpf doesn't have access to the callback as a prog as it has to be
> > > a static function, and thus isn't exported as a full-blown prog.
> > > - the verifier only checks for the callback when dealing with
> > > BPF_FUNC_timer_set_callback, which doesn't have a "flag" argument
> > > (though the validation of the callback has already been done while
> > > checking it first, so we are already too late to change the sleppable
> > > state of the callback)
> > >
> > > Right now, the only OK-ish version I have is declaring the kfunc as
> > > non-sleepable, but checking that we are in a different context than
> > > the IRQ of the initial event. This way, I am not crashing if this
> > > function is called from the initial IRQ, but will still crash if used
> > > outside of the hid context.
> > >
> > > This is not satisfactory, but I feel like it's going to be hard to
> > > teach the verifier that the callback function is sleepable in that
> > > case (maybe we could suffix the callback name, like we do for
> > > arguments, but this is not very clean either).
> >
> > The callback is only set once when the timer is first setup; I *think*
> > it works to do the setup (bpf_timer_init() and bpf_timer_set_callback())
> > in the context you need (from a sleepable prog), but do the arming
> > (bpf_timer_start()) from a different program that is not itself sleepable?
> >
>
> Genius! It works, and I can just keep having them declared as a
> syscall kfunc, not as a tracing kfunc.
>
> But isn't this an issue outside of my use case? I mean, if the
> callback is assuming the environment for when it is set up but can be
> called from any context there seems to be a problem when 2 contexts
> are not equivalent, no?

I agree that workqueue delegation fits into the bpf_timer concept and
a lot of code can and should be shared.
All the lessons(bugs) learned with bpf_timer don't need to be re-discovered :)
Too bad, bpf_timer_set_callback() doesn't have a flag argument,
so we need a new kfunc to set a sleepable callback.
Maybe
bpf_timer_set_sleepable_cb() ?
The verifier will set is_async_cb = true for it (like it does for regular cb-s).
And since prog->aux->sleepable is kinda "global" we need another
per subprog flag:
bool is_sleepable: 1;

We can factor out a check "if (prog->aux->sleepable)" into a helper
that will check that "global" flag and another env->cur_state->in_sleepable
flag that will work similar to active_rcu_lock.
Once the verifier starts processing subprog->is_sleepable
it will set cur_state->in_sleepable = true;
to make all subprogs called from that cb to be recognized as sleepable too.

A bit of a challenge is what to do with global subprogs,
since they're verified lazily. They can be called from
sleepable and non-sleepable contex. Should be solvable.

Overall I think this feature is needed urgently,
so if you don't have cycles to work on this soon,
I can prioritize it right after bpf_arena work.
Benjamin Tissoires Feb. 13, 2024, 5:46 p.m. UTC | #8
On Feb 12 2024, Alexei Starovoitov wrote:
> On Mon, Feb 12, 2024 at 10:21 AM Benjamin Tissoires
> <benjamin.tissoires@redhat.com> wrote:
> >
> > On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >
> > > Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
> > >
[...]
> I agree that workqueue delegation fits into the bpf_timer concept and
> a lot of code can and should be shared.

Thanks Alexei for the detailed answer. I've given it an attempt but still can not
figure it out entirely.

> All the lessons(bugs) learned with bpf_timer don't need to be re-discovered :)
> Too bad, bpf_timer_set_callback() doesn't have a flag argument,
> so we need a new kfunc to set a sleepable callback.
> Maybe
> bpf_timer_set_sleepable_cb() ?

OK. So I guess I should drop Toke's suggestion with the bpf_timer_ini() flag?

> The verifier will set is_async_cb = true for it (like it does for regular cb-s).
> And since prog->aux->sleepable is kinda "global" we need another
> per subprog flag:
> bool is_sleepable: 1;

done (in push_callback_call())

> 
> We can factor out a check "if (prog->aux->sleepable)" into a helper
> that will check that "global" flag and another env->cur_state->in_sleepable
> flag that will work similar to active_rcu_lock.

done (I think), cf patch 2 below

> Once the verifier starts processing subprog->is_sleepable
> it will set cur_state->in_sleepable = true;
> to make all subprogs called from that cb to be recognized as sleepable too.

That's the point I don't know where to put the new code.

It seems the best place would be in do_check(), but I am under the impression
that the code of the callback is added at the end of the instruction list, meaning
that I do not know where it starts, and which subprog index it corresponds to.

> 
> A bit of a challenge is what to do with global subprogs,
> since they're verified lazily. They can be called from
> sleepable and non-sleepable contex. Should be solvable.

I must confess this is way over me (and given that I didn't even managed to make
the "easy" case working, that might explain things a little :-P )

> 
> Overall I think this feature is needed urgently,
> so if you don't have cycles to work on this soon,
> I can prioritize it right after bpf_arena work.

I can try to spare a few cycles on it. Even if your instructions were on
spot, I still can't make the subprogs recognized as sleepable.

For reference, this is where I am (probably bogus, but seems to be
working when timer_set_sleepable_cb() is called from a sleepable context
as mentioned by Toke):

---
>From d4aa3d969fa9a89c6447d843dad338fde2ac0155 Mon Sep 17 00:00:00 2001
From: Benjamin Tissoires <bentiss@kernel.org>
Date: Tue, 13 Feb 2024 18:40:01 +0100
Subject: [PATCH RFC bpf-next v2 01/11] Sleepable timers
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Message-Id: <20240213-hid-bpf-sleepable-v2-1-6c2d6b49865c@kernel.org>

---
 include/linux/bpf_verifier.h   |  2 +
 include/uapi/linux/bpf.h       | 13 ++++++
 kernel/bpf/helpers.c           | 91 +++++++++++++++++++++++++++++++++++++++---
 kernel/bpf/verifier.c          | 20 ++++++++--
 tools/include/uapi/linux/bpf.h | 13 ++++++
 5 files changed, 130 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 84365e6dd85d..789ef5fec547 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -426,6 +426,7 @@ struct bpf_verifier_state {
 	 * while they are still in use.
 	 */
 	bool used_as_loop_entry;
+	bool in_sleepable;
 
 	/* first and last insn idx of this verifier state */
 	u32 first_insn_idx;
@@ -626,6 +627,7 @@ struct bpf_subprog_info {
 	bool is_async_cb: 1;
 	bool is_exception_cb: 1;
 	bool args_cached: 1;
+	bool is_sleepable: 1;
 
 	u8 arg_cnt;
 	struct bpf_subprog_arg_info args[MAX_BPF_FUNC_REG_ARGS];
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d96708380e52..ef1f2be4cfef 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5742,6 +5742,18 @@ union bpf_attr {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * long bpf_timer_set_sleepable_cb(struct bpf_timer *timer, void *callback_fn)
+ *	Description
+ *		Configure the timer to call *callback_fn* static function in a
+ *		sleepable context.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *		**-EPERM** if *timer* is in a map that doesn't have any user references.
+ *		The user space should either hold a file descriptor to a map with timers
+ *		or pin such map in bpffs. When map is unpinned or file descriptor is
+ *		closed all timers in the map will be cancelled and freed.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5956,6 +5968,7 @@ union bpf_attr {
 	FN(user_ringbuf_drain, 209, ##ctx)		\
 	FN(cgrp_storage_get, 210, ##ctx)		\
 	FN(cgrp_storage_delete, 211, ##ctx)		\
+	FN(timer_set_sleepable_cb, 212, ##ctx)		\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 4db1c658254c..e3b83d27b1b6 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1097,9 +1097,11 @@ const struct bpf_func_proto bpf_snprintf_proto = {
  */
 struct bpf_hrtimer {
 	struct hrtimer timer;
+	struct work_struct work;
 	struct bpf_map *map;
 	struct bpf_prog *prog;
 	void __rcu *callback_fn;
+	void __rcu *sleepable_cb_fn;
 	void *value;
 };
 
@@ -1113,18 +1115,64 @@ struct bpf_timer_kern {
 	struct bpf_spin_lock lock;
 } __attribute__((aligned(8)));
 
+static void bpf_timer_work_cb(struct work_struct *work)
+{
+	struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work);
+	struct bpf_map *map = t->map;
+	void *value = t->value;
+	bpf_callback_t callback_fn;
+	void *key;
+	u32 idx;
+
+	BTF_TYPE_EMIT(struct bpf_timer);
+
+	rcu_read_lock();
+	callback_fn = rcu_dereference(t->sleepable_cb_fn);
+	rcu_read_unlock();
+	if (!callback_fn)
+		return;
+
+	// /* bpf_timer_work_cb() runs in hrtimer_run_softirq. It doesn't migrate and
+	//  * cannot be preempted by another bpf_timer_cb() on the same cpu.
+	//  * Remember the timer this callback is servicing to prevent
+	//  * deadlock if callback_fn() calls bpf_timer_cancel() or
+	//  * bpf_map_delete_elem() on the same timer.
+	//  */
+	// this_cpu_write(hrtimer_running, t);
+	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+		struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+		/* compute the key */
+		idx = ((char *)value - array->value) / array->elem_size;
+		key = &idx;
+	} else { /* hash or lru */
+		key = value - round_up(map->key_size, 8);
+	}
+
+	callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
+	/* The verifier checked that return value is zero. */
+
+	// this_cpu_write(hrtimer_running, NULL);
+}
+
 static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
 
 static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
 {
 	struct bpf_hrtimer *t = container_of(hrtimer, struct bpf_hrtimer, timer);
+	bpf_callback_t callback_fn, sleepable_cb_fn;
 	struct bpf_map *map = t->map;
 	void *value = t->value;
-	bpf_callback_t callback_fn;
 	void *key;
 	u32 idx;
 
 	BTF_TYPE_EMIT(struct bpf_timer);
+	sleepable_cb_fn = rcu_dereference_check(t->sleepable_cb_fn, rcu_read_lock_bh_held());
+	if (sleepable_cb_fn) {
+		schedule_work(&t->work);
+		goto out;
+	}
+
 	callback_fn = rcu_dereference_check(t->callback_fn, rcu_read_lock_bh_held());
 	if (!callback_fn)
 		goto out;
@@ -1190,7 +1238,9 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
 	t->map = map;
 	t->prog = NULL;
 	rcu_assign_pointer(t->callback_fn, NULL);
+	rcu_assign_pointer(t->sleepable_cb_fn, NULL);
 	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
+	INIT_WORK(&t->work, bpf_timer_work_cb);
 	t->timer.function = bpf_timer_cb;
 	WRITE_ONCE(timer->timer, t);
 	/* Guarantee the order between timer->timer and map->usercnt. So
@@ -1221,8 +1271,8 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callback_fn,
-	   struct bpf_prog_aux *, aux)
+static int __bpf_timer_set_callback(struct bpf_timer_kern *timer, void *callback_fn,
+				    struct bpf_prog_aux *aux, bool is_sleepable)
 {
 	struct bpf_prog *prev, *prog = aux->prog;
 	struct bpf_hrtimer *t;
@@ -1260,12 +1310,24 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
 			bpf_prog_put(prev);
 		t->prog = prog;
 	}
-	rcu_assign_pointer(t->callback_fn, callback_fn);
+	if (is_sleepable) {
+		rcu_assign_pointer(t->sleepable_cb_fn, callback_fn);
+		rcu_assign_pointer(t->callback_fn, NULL);
+	} else {
+		rcu_assign_pointer(t->callback_fn, callback_fn);
+		rcu_assign_pointer(t->sleepable_cb_fn, NULL);
+	}
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
 	return ret;
 }
 
+BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callback_fn,
+	   struct bpf_prog_aux *, aux)
+{
+	return __bpf_timer_set_callback(timer, callback_fn, aux, false);
+}
+
 static const struct bpf_func_proto bpf_timer_set_callback_proto = {
 	.func		= bpf_timer_set_callback,
 	.gpl_only	= true,
@@ -1274,6 +1336,20 @@ static const struct bpf_func_proto bpf_timer_set_callback_proto = {
 	.arg2_type	= ARG_PTR_TO_FUNC,
 };
 
+BPF_CALL_3(bpf_timer_set_sleepable_cb, struct bpf_timer_kern *, timer, void *, callback_fn,
+	   struct bpf_prog_aux *, aux)
+{
+	return __bpf_timer_set_callback(timer, callback_fn, aux, true);
+}
+
+static const struct bpf_func_proto bpf_timer_set_sleepable_cb_proto = {
+	.func		= bpf_timer_set_sleepable_cb,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TIMER,
+	.arg2_type	= ARG_PTR_TO_FUNC,
+};
+
 BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, flags)
 {
 	struct bpf_hrtimer *t;
@@ -1353,6 +1429,7 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
 	 * if it was running.
 	 */
 	ret = ret ?: hrtimer_cancel(&t->timer);
+	ret = ret ?: cancel_work_sync(&t->work);
 	return ret;
 }
 
@@ -1405,8 +1482,10 @@ void bpf_timer_cancel_and_free(void *val)
 	 * effectively cancelled because bpf_timer_cb() will return
 	 * HRTIMER_NORESTART.
 	 */
-	if (this_cpu_read(hrtimer_running) != t)
+	if (this_cpu_read(hrtimer_running) != t) {
 		hrtimer_cancel(&t->timer);
+	}
+	cancel_work_sync(&t->work);
 	kfree(t);
 }
 
@@ -1749,6 +1828,8 @@ bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_timer_init_proto;
 	case BPF_FUNC_timer_set_callback:
 		return &bpf_timer_set_callback_proto;
+	case BPF_FUNC_timer_set_sleepable_cb:
+		return &bpf_timer_set_sleepable_cb_proto;
 	case BPF_FUNC_timer_start:
 		return &bpf_timer_start_proto;
 	case BPF_FUNC_timer_cancel:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 64fa188d00ad..400c625efe22 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -513,7 +513,8 @@ static bool is_sync_callback_calling_function(enum bpf_func_id func_id)
 
 static bool is_async_callback_calling_function(enum bpf_func_id func_id)
 {
-	return func_id == BPF_FUNC_timer_set_callback;
+	return func_id == BPF_FUNC_timer_set_callback ||
+	       func_id == BPF_FUNC_timer_set_sleepable_cb;
 }
 
 static bool is_callback_calling_function(enum bpf_func_id func_id)
@@ -1414,6 +1415,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 	}
 	dst_state->speculative = src->speculative;
 	dst_state->active_rcu_lock = src->active_rcu_lock;
+	dst_state->in_sleepable = src->in_sleepable;
 	dst_state->curframe = src->curframe;
 	dst_state->active_lock.ptr = src->active_lock.ptr;
 	dst_state->active_lock.id = src->active_lock.id;
@@ -9434,11 +9436,13 @@ static int push_callback_call(struct bpf_verifier_env *env, struct bpf_insn *ins
 
 	if (insn->code == (BPF_JMP | BPF_CALL) &&
 	    insn->src_reg == 0 &&
-	    insn->imm == BPF_FUNC_timer_set_callback) {
+	    (insn->imm == BPF_FUNC_timer_set_callback ||
+	     insn->imm == BPF_FUNC_timer_set_sleepable_cb)) {
 		struct bpf_verifier_state *async_cb;
 
 		/* there is no real recursion here. timer callbacks are async */
 		env->subprog_info[subprog].is_async_cb = true;
+		env->subprog_info[subprog].is_sleepable = insn->imm == BPF_FUNC_timer_set_sleepable_cb;
 		async_cb = push_async_cb(env, env->subprog_info[subprog].start,
 					 insn_idx, subprog);
 		if (!async_cb)
@@ -10280,6 +10284,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 					 set_map_elem_callback_state);
 		break;
 	case BPF_FUNC_timer_set_callback:
+		fallthrough;
+	case BPF_FUNC_timer_set_sleepable_cb:
 		err = push_callback_call(env, insn, insn_idx, meta.subprogno,
 					 set_timer_callback_state);
 		break;
@@ -15586,7 +15592,9 @@ static int visit_insn(int t, struct bpf_verifier_env *env)
 		return DONE_EXPLORING;
 
 	case BPF_CALL:
-		if (insn->src_reg == 0 && insn->imm == BPF_FUNC_timer_set_callback)
+		if (insn->src_reg == 0 &&
+		    (insn->imm == BPF_FUNC_timer_set_callback ||
+		     insn->imm == BPF_FUNC_timer_set_sleepable_cb))
 			/* Mark this call insn as a prune point to trigger
 			 * is_state_visited() check before call itself is
 			 * processed by __check_func_call(). Otherwise new
@@ -16762,6 +16770,9 @@ static bool states_equal(struct bpf_verifier_env *env,
 	if (old->active_rcu_lock != cur->active_rcu_lock)
 		return false;
 
+	if (old->in_sleepable != cur->in_sleepable)
+		return false;
+
 	/* for states to be equal callsites have to be the same
 	 * and all frame states need to be equivalent
 	 */
@@ -19639,7 +19650,8 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			continue;
 		}
 
-		if (insn->imm == BPF_FUNC_timer_set_callback) {
+		if (insn->imm == BPF_FUNC_timer_set_callback ||
+		    insn->imm == BPF_FUNC_timer_set_sleepable_cb) {
 			/* The verifier will process callback_fn as many times as necessary
 			 * with different maps and the register states prepared by
 			 * set_timer_callback_state will be accurate.
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d96708380e52..ef1f2be4cfef 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5742,6 +5742,18 @@ union bpf_attr {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * long bpf_timer_set_sleepable_cb(struct bpf_timer *timer, void *callback_fn)
+ *	Description
+ *		Configure the timer to call *callback_fn* static function in a
+ *		sleepable context.
+ *	Return
+ *		0 on success.
+ *		**-EINVAL** if *timer* was not initialized with bpf_timer_init() earlier.
+ *		**-EPERM** if *timer* is in a map that doesn't have any user references.
+ *		The user space should either hold a file descriptor to a map with timers
+ *		or pin such map in bpffs. When map is unpinned or file descriptor is
+ *		closed all timers in the map will be cancelled and freed.
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5956,6 +5968,7 @@ union bpf_attr {
 	FN(user_ringbuf_drain, 209, ##ctx)		\
 	FN(cgrp_storage_get, 210, ##ctx)		\
 	FN(cgrp_storage_delete, 211, ##ctx)		\
+	FN(timer_set_sleepable_cb, 212, ##ctx)		\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
Kumar Kartikeya Dwivedi Feb. 13, 2024, 7:23 p.m. UTC | #9
On Tue, 13 Feb 2024 at 18:46, Benjamin Tissoires <bentiss@kernel.org> wrote:
>
> On Feb 12 2024, Alexei Starovoitov wrote:
> > On Mon, Feb 12, 2024 at 10:21 AM Benjamin Tissoires
> > <benjamin.tissoires@redhat.com> wrote:
> > >
> > > On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > >
> > > > Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
> > > >
> [...]
> > I agree that workqueue delegation fits into the bpf_timer concept and
> > a lot of code can and should be shared.
>
> Thanks Alexei for the detailed answer. I've given it an attempt but still can not
> figure it out entirely.
>
> > All the lessons(bugs) learned with bpf_timer don't need to be re-discovered :)
> > Too bad, bpf_timer_set_callback() doesn't have a flag argument,
> > so we need a new kfunc to set a sleepable callback.
> > Maybe
> > bpf_timer_set_sleepable_cb() ?
>
> OK. So I guess I should drop Toke's suggestion with the bpf_timer_ini() flag?
>
> > The verifier will set is_async_cb = true for it (like it does for regular cb-s).
> > And since prog->aux->sleepable is kinda "global" we need another
> > per subprog flag:
> > bool is_sleepable: 1;
>
> done (in push_callback_call())
>
> >
> > We can factor out a check "if (prog->aux->sleepable)" into a helper
> > that will check that "global" flag and another env->cur_state->in_sleepable
> > flag that will work similar to active_rcu_lock.
>
> done (I think), cf patch 2 below
>
> > Once the verifier starts processing subprog->is_sleepable
> > it will set cur_state->in_sleepable = true;
> > to make all subprogs called from that cb to be recognized as sleepable too.
>
> That's the point I don't know where to put the new code.
>

I think that would go in the already existing special case for
push_async_cb where you get the verifier state of the async callback.
You can make setting the boolean in that verifier state conditional on
whether it's your kfunc/helper you're processing taking a sleepable
callback.

> It seems the best place would be in do_check(), but I am under the impression
> that the code of the callback is added at the end of the instruction list, meaning
> that I do not know where it starts, and which subprog index it corresponds to.
>
> >
> > A bit of a challenge is what to do with global subprogs,
> > since they're verified lazily. They can be called from
> > sleepable and non-sleepable contex. Should be solvable.
>
> I must confess this is way over me (and given that I didn't even managed to make
> the "easy" case working, that might explain things a little :-P )
>

I think it will be solvable but made somewhat difficult by the fact
that even if we mark subprog_info of some global_func A as
in_sleepable, so that we explore it as sleepable during its
verification, we might encounter later another global_func that calls
a global func, already explored as non-sleepable, in sleepable
context. In this case I think we need to redo the verification of that
global func as sleepable once again. It could be that it is called
from both non-sleepable and sleepable contexts, so both paths
(in_sleepable = true, and in_sleepable = false) need to be explored,
or we could reject such cases, but it might be a little restrictive.

Some common helper global func unrelated to caller context doing some
auxiliary work, called from sleepable timer callback and normal main
subprog might be an example where rejection will be prohibitive.

An approach might be to explore main and global subprogs once as we do
now, and then keep a list of global subprogs that need to be revisited
as in_sleepable (due to being called from a sleepable context) and
trigger do_check_common for them again, this might have to be repeated
as the list grows on each iteration, but eventually we will have
explored all of them as in_sleepable if need be, and the loop will
end. Surely, this trades off logical simplicity of verifier code with
redoing verification of global subprogs again.

To add items to such a list, for each global subprog we encounter that
needs to be analyzed as in_sleepable, we will also collect all its
callee global subprogs by walking its instructions (a bit like
check_max_stack_depth does).

> >
> > Overall I think this feature is needed urgently,
> > so if you don't have cycles to work on this soon,
> > I can prioritize it right after bpf_arena work.
>
> I can try to spare a few cycles on it. Even if your instructions were on
> spot, I still can't make the subprogs recognized as sleepable.
>
> For reference, this is where I am (probably bogus, but seems to be
> working when timer_set_sleepable_cb() is called from a sleepable context
> as mentioned by Toke):
>

I just skimmed the patch but I think it's already 90% there. The only
other change I would suggest is switching from helper to kfunc, as
originally proposed by Alexei.

> [...]
Toke Høiland-Jørgensen Feb. 13, 2024, 7:51 p.m. UTC | #10
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Tue, 13 Feb 2024 at 18:46, Benjamin Tissoires <bentiss@kernel.org> wrote:
>>
>> On Feb 12 2024, Alexei Starovoitov wrote:
>> > On Mon, Feb 12, 2024 at 10:21 AM Benjamin Tissoires
>> > <benjamin.tissoires@redhat.com> wrote:
>> > >
>> > > On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> > > >
>> > > > Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
>> > > >
>> [...]
>> > I agree that workqueue delegation fits into the bpf_timer concept and
>> > a lot of code can and should be shared.
>>
>> Thanks Alexei for the detailed answer. I've given it an attempt but still can not
>> figure it out entirely.
>>
>> > All the lessons(bugs) learned with bpf_timer don't need to be re-discovered :)
>> > Too bad, bpf_timer_set_callback() doesn't have a flag argument,
>> > so we need a new kfunc to set a sleepable callback.
>> > Maybe
>> > bpf_timer_set_sleepable_cb() ?
>>
>> OK. So I guess I should drop Toke's suggestion with the bpf_timer_ini() flag?
>>
>> > The verifier will set is_async_cb = true for it (like it does for regular cb-s).
>> > And since prog->aux->sleepable is kinda "global" we need another
>> > per subprog flag:
>> > bool is_sleepable: 1;
>>
>> done (in push_callback_call())
>>
>> >
>> > We can factor out a check "if (prog->aux->sleepable)" into a helper
>> > that will check that "global" flag and another env->cur_state->in_sleepable
>> > flag that will work similar to active_rcu_lock.
>>
>> done (I think), cf patch 2 below
>>
>> > Once the verifier starts processing subprog->is_sleepable
>> > it will set cur_state->in_sleepable = true;
>> > to make all subprogs called from that cb to be recognized as sleepable too.
>>
>> That's the point I don't know where to put the new code.
>>
>
> I think that would go in the already existing special case for
> push_async_cb where you get the verifier state of the async callback.
> You can make setting the boolean in that verifier state conditional on
> whether it's your kfunc/helper you're processing taking a sleepable
> callback.
>
>> It seems the best place would be in do_check(), but I am under the impression
>> that the code of the callback is added at the end of the instruction list, meaning
>> that I do not know where it starts, and which subprog index it corresponds to.
>>
>> >
>> > A bit of a challenge is what to do with global subprogs,
>> > since they're verified lazily. They can be called from
>> > sleepable and non-sleepable contex. Should be solvable.
>>
>> I must confess this is way over me (and given that I didn't even managed to make
>> the "easy" case working, that might explain things a little :-P )
>>
>
> I think it will be solvable but made somewhat difficult by the fact
> that even if we mark subprog_info of some global_func A as
> in_sleepable, so that we explore it as sleepable during its
> verification, we might encounter later another global_func that calls
> a global func, already explored as non-sleepable, in sleepable
> context. In this case I think we need to redo the verification of that
> global func as sleepable once again. It could be that it is called
> from both non-sleepable and sleepable contexts, so both paths
> (in_sleepable = true, and in_sleepable = false) need to be explored,
> or we could reject such cases, but it might be a little restrictive.
>
> Some common helper global func unrelated to caller context doing some
> auxiliary work, called from sleepable timer callback and normal main
> subprog might be an example where rejection will be prohibitive.
>
> An approach might be to explore main and global subprogs once as we do
> now, and then keep a list of global subprogs that need to be revisited
> as in_sleepable (due to being called from a sleepable context) and
> trigger do_check_common for them again, this might have to be repeated
> as the list grows on each iteration, but eventually we will have
> explored all of them as in_sleepable if need be, and the loop will
> end. Surely, this trades off logical simplicity of verifier code with
> redoing verification of global subprogs again.
>
> To add items to such a list, for each global subprog we encounter that
> needs to be analyzed as in_sleepable, we will also collect all its
> callee global subprogs by walking its instructions (a bit like
> check_max_stack_depth does).

Sorry if I'm being dense, but why is all this needed if it's already
possible to just define the timer callback from a program type that
allows sleeping, and then set the actual timeout from a different
program that is not sleepable? Isn't the set_sleepable_cb() kfunc just a
convenience then? Or did I misunderstand and it's not actually possible
to mix callback/timer arming from different program types?

-Toke
Alexei Starovoitov Feb. 13, 2024, 8:48 p.m. UTC | #11
On Tue, Feb 13, 2024 at 08:23:17PM +0100, Kumar Kartikeya Dwivedi wrote:
> On Tue, 13 Feb 2024 at 18:46, Benjamin Tissoires <bentiss@kernel.org> wrote:
> >
> > On Feb 12 2024, Alexei Starovoitov wrote:
> > > On Mon, Feb 12, 2024 at 10:21 AM Benjamin Tissoires
> > > <benjamin.tissoires@redhat.com> wrote:
> > > >
> > > > On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > > >
> > > > > Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
> > > > >
> > [...]
> > > I agree that workqueue delegation fits into the bpf_timer concept and
> > > a lot of code can and should be shared.
> >
> > Thanks Alexei for the detailed answer. I've given it an attempt but still can not
> > figure it out entirely.
> >
> > > All the lessons(bugs) learned with bpf_timer don't need to be re-discovered :)
> > > Too bad, bpf_timer_set_callback() doesn't have a flag argument,
> > > so we need a new kfunc to set a sleepable callback.
> > > Maybe
> > > bpf_timer_set_sleepable_cb() ?
> >
> > OK. So I guess I should drop Toke's suggestion with the bpf_timer_ini() flag?

I think the flag for bpf_timer_init() is still needed.
Since we probably shouldn't combine hrtimer with workqueue.
bpf_timer_start() should do schedule_work() directly.
The latency matters in some cases.
We will use flags to specify which workqueue strategy to use.
Fingers crossed, WQ_BH will be merged in the next merge window and we'd need to expose it
as an option to bpf progs.

> >
> > > The verifier will set is_async_cb = true for it (like it does for regular cb-s).
> > > And since prog->aux->sleepable is kinda "global" we need another
> > > per subprog flag:
> > > bool is_sleepable: 1;
> >
> > done (in push_callback_call())
> >
> > >
> > > We can factor out a check "if (prog->aux->sleepable)" into a helper
> > > that will check that "global" flag and another env->cur_state->in_sleepable
> > > flag that will work similar to active_rcu_lock.
> >
> > done (I think), cf patch 2 below
> >
> > > Once the verifier starts processing subprog->is_sleepable
> > > it will set cur_state->in_sleepable = true;
> > > to make all subprogs called from that cb to be recognized as sleepable too.
> >
> > That's the point I don't know where to put the new code.
> >
> 
> I think that would go in the already existing special case for
> push_async_cb where you get the verifier state of the async callback.
> You can make setting the boolean in that verifier state conditional on
> whether it's your kfunc/helper you're processing taking a sleepable
> callback.
> 
> > It seems the best place would be in do_check(), but I am under the impression
> > that the code of the callback is added at the end of the instruction list, meaning
> > that I do not know where it starts, and which subprog index it corresponds to.
> >
> > >
> > > A bit of a challenge is what to do with global subprogs,
> > > since they're verified lazily. They can be called from
> > > sleepable and non-sleepable contex. Should be solvable.
> >
> > I must confess this is way over me (and given that I didn't even managed to make
> > the "easy" case working, that might explain things a little :-P )
> >
> 
> I think it will be solvable but made somewhat difficult by the fact
> that even if we mark subprog_info of some global_func A as
> in_sleepable, so that we explore it as sleepable during its
> verification, we might encounter later another global_func that calls
> a global func, already explored as non-sleepable, in sleepable
> context. In this case I think we need to redo the verification of that
> global func as sleepable once again. It could be that it is called
> from both non-sleepable and sleepable contexts, so both paths
> (in_sleepable = true, and in_sleepable = false) need to be explored,
> or we could reject such cases, but it might be a little restrictive.
> 
> Some common helper global func unrelated to caller context doing some
> auxiliary work, called from sleepable timer callback and normal main
> subprog might be an example where rejection will be prohibitive.
> 
> An approach might be to explore main and global subprogs once as we do
> now, and then keep a list of global subprogs that need to be revisited
> as in_sleepable (due to being called from a sleepable context) and
> trigger do_check_common for them again, this might have to be repeated
> as the list grows on each iteration, but eventually we will have
> explored all of them as in_sleepable if need be, and the loop will
> end. Surely, this trades off logical simplicity of verifier code with
> redoing verification of global subprogs again.
> 
> To add items to such a list, for each global subprog we encounter that
> needs to be analyzed as in_sleepable, we will also collect all its
> callee global subprogs by walking its instructions (a bit like
> check_max_stack_depth does).

imo that would be a serious increase in the verifier complexity.
It's not clear whether this is needed at this point.
For now I would drop cur_state->in_sleepable to false before verifying global subprogs.

> > >
> > > Overall I think this feature is needed urgently,
> > > so if you don't have cycles to work on this soon,
> > > I can prioritize it right after bpf_arena work.
> >
> > I can try to spare a few cycles on it. Even if your instructions were on
> > spot, I still can't make the subprogs recognized as sleepable.
> >
> > For reference, this is where I am (probably bogus, but seems to be
> > working when timer_set_sleepable_cb() is called from a sleepable context
> > as mentioned by Toke):
> >
> 
> I just skimmed the patch but I think it's already 90% there. The only
> other change I would suggest is switching from helper to kfunc, as
> originally proposed by Alexei.

+1. I quickly looked through the patches and it does look like 90% there.
Alexei Starovoitov Feb. 13, 2024, 8:52 p.m. UTC | #12
On Tue, Feb 13, 2024 at 08:51:26PM +0100, Toke Høiland-Jørgensen wrote:
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> 
> > On Tue, 13 Feb 2024 at 18:46, Benjamin Tissoires <bentiss@kernel.org> wrote:
> >>
> >> On Feb 12 2024, Alexei Starovoitov wrote:
> >> > On Mon, Feb 12, 2024 at 10:21 AM Benjamin Tissoires
> >> > <benjamin.tissoires@redhat.com> wrote:
> >> > >
> >> > > On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> > > >
> >> > > > Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
> >> > > >
> >> [...]
> >> > I agree that workqueue delegation fits into the bpf_timer concept and
> >> > a lot of code can and should be shared.
> >>
> >> Thanks Alexei for the detailed answer. I've given it an attempt but still can not
> >> figure it out entirely.
> >>
> >> > All the lessons(bugs) learned with bpf_timer don't need to be re-discovered :)
> >> > Too bad, bpf_timer_set_callback() doesn't have a flag argument,
> >> > so we need a new kfunc to set a sleepable callback.
> >> > Maybe
> >> > bpf_timer_set_sleepable_cb() ?
> >>
> >> OK. So I guess I should drop Toke's suggestion with the bpf_timer_ini() flag?
> >>
> >> > The verifier will set is_async_cb = true for it (like it does for regular cb-s).
> >> > And since prog->aux->sleepable is kinda "global" we need another
> >> > per subprog flag:
> >> > bool is_sleepable: 1;
> >>
> >> done (in push_callback_call())
> >>
> >> >
> >> > We can factor out a check "if (prog->aux->sleepable)" into a helper
> >> > that will check that "global" flag and another env->cur_state->in_sleepable
> >> > flag that will work similar to active_rcu_lock.
> >>
> >> done (I think), cf patch 2 below
> >>
> >> > Once the verifier starts processing subprog->is_sleepable
> >> > it will set cur_state->in_sleepable = true;
> >> > to make all subprogs called from that cb to be recognized as sleepable too.
> >>
> >> That's the point I don't know where to put the new code.
> >>
> >
> > I think that would go in the already existing special case for
> > push_async_cb where you get the verifier state of the async callback.
> > You can make setting the boolean in that verifier state conditional on
> > whether it's your kfunc/helper you're processing taking a sleepable
> > callback.
> >
> >> It seems the best place would be in do_check(), but I am under the impression
> >> that the code of the callback is added at the end of the instruction list, meaning
> >> that I do not know where it starts, and which subprog index it corresponds to.
> >>
> >> >
> >> > A bit of a challenge is what to do with global subprogs,
> >> > since they're verified lazily. They can be called from
> >> > sleepable and non-sleepable contex. Should be solvable.
> >>
> >> I must confess this is way over me (and given that I didn't even managed to make
> >> the "easy" case working, that might explain things a little :-P )
> >>
> >
> > I think it will be solvable but made somewhat difficult by the fact
> > that even if we mark subprog_info of some global_func A as
> > in_sleepable, so that we explore it as sleepable during its
> > verification, we might encounter later another global_func that calls
> > a global func, already explored as non-sleepable, in sleepable
> > context. In this case I think we need to redo the verification of that
> > global func as sleepable once again. It could be that it is called
> > from both non-sleepable and sleepable contexts, so both paths
> > (in_sleepable = true, and in_sleepable = false) need to be explored,
> > or we could reject such cases, but it might be a little restrictive.
> >
> > Some common helper global func unrelated to caller context doing some
> > auxiliary work, called from sleepable timer callback and normal main
> > subprog might be an example where rejection will be prohibitive.
> >
> > An approach might be to explore main and global subprogs once as we do
> > now, and then keep a list of global subprogs that need to be revisited
> > as in_sleepable (due to being called from a sleepable context) and
> > trigger do_check_common for them again, this might have to be repeated
> > as the list grows on each iteration, but eventually we will have
> > explored all of them as in_sleepable if need be, and the loop will
> > end. Surely, this trades off logical simplicity of verifier code with
> > redoing verification of global subprogs again.
> >
> > To add items to such a list, for each global subprog we encounter that
> > needs to be analyzed as in_sleepable, we will also collect all its
> > callee global subprogs by walking its instructions (a bit like
> > check_max_stack_depth does).
> 
> Sorry if I'm being dense, but why is all this needed if it's already
> possible to just define the timer callback from a program type that
> allows sleeping, and then set the actual timeout from a different
> program that is not sleepable? Isn't the set_sleepable_cb() kfunc just a
> convenience then? Or did I misunderstand and it's not actually possible
> to mix callback/timer arming from different program types?

More than just convience.
bpf_set_sleepable_cb() might need to be called from non-sleepable and
there could be no way to hack it around with fake sleepable entry.
bpf_timer_cancel() clears callback_fn.
So if prog wants to bpf_timer_start() and later bpf_timer_cancel()
it would need to bpf_set_sleepable_cb() every time before bpf_timer_start().
And at that time it might be in non-sleepable ctx.
Toke Høiland-Jørgensen Feb. 14, 2024, 12:56 p.m. UTC | #13
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Feb 13, 2024 at 08:51:26PM +0100, Toke Høiland-Jørgensen wrote:
>> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>> 
>> > On Tue, 13 Feb 2024 at 18:46, Benjamin Tissoires <bentiss@kernel.org> wrote:
>> >>
>> >> On Feb 12 2024, Alexei Starovoitov wrote:
>> >> > On Mon, Feb 12, 2024 at 10:21 AM Benjamin Tissoires
>> >> > <benjamin.tissoires@redhat.com> wrote:
>> >> > >
>> >> > > On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> > > >
>> >> > > > Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
>> >> > > >
>> >> [...]
>> >> > I agree that workqueue delegation fits into the bpf_timer concept and
>> >> > a lot of code can and should be shared.
>> >>
>> >> Thanks Alexei for the detailed answer. I've given it an attempt but still can not
>> >> figure it out entirely.
>> >>
>> >> > All the lessons(bugs) learned with bpf_timer don't need to be re-discovered :)
>> >> > Too bad, bpf_timer_set_callback() doesn't have a flag argument,
>> >> > so we need a new kfunc to set a sleepable callback.
>> >> > Maybe
>> >> > bpf_timer_set_sleepable_cb() ?
>> >>
>> >> OK. So I guess I should drop Toke's suggestion with the bpf_timer_ini() flag?
>> >>
>> >> > The verifier will set is_async_cb = true for it (like it does for regular cb-s).
>> >> > And since prog->aux->sleepable is kinda "global" we need another
>> >> > per subprog flag:
>> >> > bool is_sleepable: 1;
>> >>
>> >> done (in push_callback_call())
>> >>
>> >> >
>> >> > We can factor out a check "if (prog->aux->sleepable)" into a helper
>> >> > that will check that "global" flag and another env->cur_state->in_sleepable
>> >> > flag that will work similar to active_rcu_lock.
>> >>
>> >> done (I think), cf patch 2 below
>> >>
>> >> > Once the verifier starts processing subprog->is_sleepable
>> >> > it will set cur_state->in_sleepable = true;
>> >> > to make all subprogs called from that cb to be recognized as sleepable too.
>> >>
>> >> That's the point I don't know where to put the new code.
>> >>
>> >
>> > I think that would go in the already existing special case for
>> > push_async_cb where you get the verifier state of the async callback.
>> > You can make setting the boolean in that verifier state conditional on
>> > whether it's your kfunc/helper you're processing taking a sleepable
>> > callback.
>> >
>> >> It seems the best place would be in do_check(), but I am under the impression
>> >> that the code of the callback is added at the end of the instruction list, meaning
>> >> that I do not know where it starts, and which subprog index it corresponds to.
>> >>
>> >> >
>> >> > A bit of a challenge is what to do with global subprogs,
>> >> > since they're verified lazily. They can be called from
>> >> > sleepable and non-sleepable contex. Should be solvable.
>> >>
>> >> I must confess this is way over me (and given that I didn't even managed to make
>> >> the "easy" case working, that might explain things a little :-P )
>> >>
>> >
>> > I think it will be solvable but made somewhat difficult by the fact
>> > that even if we mark subprog_info of some global_func A as
>> > in_sleepable, so that we explore it as sleepable during its
>> > verification, we might encounter later another global_func that calls
>> > a global func, already explored as non-sleepable, in sleepable
>> > context. In this case I think we need to redo the verification of that
>> > global func as sleepable once again. It could be that it is called
>> > from both non-sleepable and sleepable contexts, so both paths
>> > (in_sleepable = true, and in_sleepable = false) need to be explored,
>> > or we could reject such cases, but it might be a little restrictive.
>> >
>> > Some common helper global func unrelated to caller context doing some
>> > auxiliary work, called from sleepable timer callback and normal main
>> > subprog might be an example where rejection will be prohibitive.
>> >
>> > An approach might be to explore main and global subprogs once as we do
>> > now, and then keep a list of global subprogs that need to be revisited
>> > as in_sleepable (due to being called from a sleepable context) and
>> > trigger do_check_common for them again, this might have to be repeated
>> > as the list grows on each iteration, but eventually we will have
>> > explored all of them as in_sleepable if need be, and the loop will
>> > end. Surely, this trades off logical simplicity of verifier code with
>> > redoing verification of global subprogs again.
>> >
>> > To add items to such a list, for each global subprog we encounter that
>> > needs to be analyzed as in_sleepable, we will also collect all its
>> > callee global subprogs by walking its instructions (a bit like
>> > check_max_stack_depth does).
>> 
>> Sorry if I'm being dense, but why is all this needed if it's already
>> possible to just define the timer callback from a program type that
>> allows sleeping, and then set the actual timeout from a different
>> program that is not sleepable? Isn't the set_sleepable_cb() kfunc just a
>> convenience then? Or did I misunderstand and it's not actually possible
>> to mix callback/timer arming from different program types?
>
> More than just convience.
> bpf_set_sleepable_cb() might need to be called from non-sleepable and
> there could be no way to hack it around with fake sleepable entry.
> bpf_timer_cancel() clears callback_fn.
> So if prog wants to bpf_timer_start() and later bpf_timer_cancel()
> it would need to bpf_set_sleepable_cb() every time before bpf_timer_start().
> And at that time it might be in non-sleepable ctx.

Ah, right, makes sense; didn't think about bpf_timer_cancel(). Thanks
for the explanation :)

-Toke
Benjamin Tissoires Feb. 14, 2024, 5:10 p.m. UTC | #14
On Feb 13 2024, Kumar Kartikeya Dwivedi wrote:
> On Tue, 13 Feb 2024 at 18:46, Benjamin Tissoires <bentiss@kernel.org> wrote:
> >
> > On Feb 12 2024, Alexei Starovoitov wrote:
> > > On Mon, Feb 12, 2024 at 10:21 AM Benjamin Tissoires
> > > <benjamin.tissoires@redhat.com> wrote:
> > > >
> > > > On Mon, Feb 12, 2024 at 6:46 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > > >
> > > > > Benjamin Tissoires <benjamin.tissoires@redhat.com> writes:
> > > > >
> > [...]
> > > I agree that workqueue delegation fits into the bpf_timer concept and
> > > a lot of code can and should be shared.
> >
> > Thanks Alexei for the detailed answer. I've given it an attempt but still can not
> > figure it out entirely.
> >
> > > All the lessons(bugs) learned with bpf_timer don't need to be re-discovered :)
> > > Too bad, bpf_timer_set_callback() doesn't have a flag argument,
> > > so we need a new kfunc to set a sleepable callback.
> > > Maybe
> > > bpf_timer_set_sleepable_cb() ?
> >
> > OK. So I guess I should drop Toke's suggestion with the bpf_timer_ini() flag?
> >
> > > The verifier will set is_async_cb = true for it (like it does for regular cb-s).
> > > And since prog->aux->sleepable is kinda "global" we need another
> > > per subprog flag:
> > > bool is_sleepable: 1;
> >
> > done (in push_callback_call())
> >
> > >
> > > We can factor out a check "if (prog->aux->sleepable)" into a helper
> > > that will check that "global" flag and another env->cur_state->in_sleepable
> > > flag that will work similar to active_rcu_lock.
> >
> > done (I think), cf patch 2 below
> >
> > > Once the verifier starts processing subprog->is_sleepable
> > > it will set cur_state->in_sleepable = true;
> > > to make all subprogs called from that cb to be recognized as sleepable too.
> >
> > That's the point I don't know where to put the new code.
> >
> 
> I think that would go in the already existing special case for
> push_async_cb where you get the verifier state of the async callback.
> You can make setting the boolean in that verifier state conditional on
> whether it's your kfunc/helper you're processing taking a sleepable
> callback.

Hehe, thanks a lot. Indeed, it was a simple fix. I tried to put this
everywhere but here, and with your help got it working in 2 mins :)

> 
> > It seems the best place would be in do_check(), but I am under the impression
> > that the code of the callback is added at the end of the instruction list, meaning
> > that I do not know where it starts, and which subprog index it corresponds to.
> >
> > >
> > > A bit of a challenge is what to do with global subprogs,
> > > since they're verified lazily. They can be called from
> > > sleepable and non-sleepable contex. Should be solvable.
> >
> > I must confess this is way over me (and given that I didn't even managed to make
> > the "easy" case working, that might explain things a little :-P )
> >
> 
> I think it will be solvable but made somewhat difficult by the fact
> that even if we mark subprog_info of some global_func A as
> in_sleepable, so that we explore it as sleepable during its
> verification, we might encounter later another global_func that calls
> a global func, already explored as non-sleepable, in sleepable
> context. In this case I think we need to redo the verification of that
> global func as sleepable once again. It could be that it is called
> from both non-sleepable and sleepable contexts, so both paths
> (in_sleepable = true, and in_sleepable = false) need to be explored,
> or we could reject such cases, but it might be a little restrictive.
> 
> Some common helper global func unrelated to caller context doing some
> auxiliary work, called from sleepable timer callback and normal main
> subprog might be an example where rejection will be prohibitive.
> 
> An approach might be to explore main and global subprogs once as we do
> now, and then keep a list of global subprogs that need to be revisited
> as in_sleepable (due to being called from a sleepable context) and
> trigger do_check_common for them again, this might have to be repeated
> as the list grows on each iteration, but eventually we will have
> explored all of them as in_sleepable if need be, and the loop will
> end. Surely, this trades off logical simplicity of verifier code with
> redoing verification of global subprogs again.
> 
> To add items to such a list, for each global subprog we encounter that
> needs to be analyzed as in_sleepable, we will also collect all its
> callee global subprogs by walking its instructions (a bit like
> check_max_stack_depth does).

FWIW, this (or Alexei's suggestion) is still not implemented in v2

> 
> > >
> > > Overall I think this feature is needed urgently,
> > > so if you don't have cycles to work on this soon,
> > > I can prioritize it right after bpf_arena work.
> >
> > I can try to spare a few cycles on it. Even if your instructions were on
> > spot, I still can't make the subprogs recognized as sleepable.
> >
> > For reference, this is where I am (probably bogus, but seems to be
> > working when timer_set_sleepable_cb() is called from a sleepable context
> > as mentioned by Toke):
> >
> 
> I just skimmed the patch but I think it's already 90% there. The only
> other change I would suggest is switching from helper to kfunc, as
> originally proposed by Alexei.

the kfunc was a rabbit hole:
- I needed to teach the verifier about BPF_TIMER in kfunc
- I needed to teach the verifier about the kfunc itself
- I'm failing at calling the callback :(

Anyway, I'm about to send a second RFC so we can discuss on the code and
see where my monkey patching capabilities are reaching their limits.

Cheers,
Benjamin