diff mbox series

[RFC,bpf-next,v1,3/9] bpf: Implement bpf_throw kfunc

Message ID 20230405004239.1375399-4-memxor@gmail.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Exceptions - 1/2 | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-4 pending Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-16
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 18485 this patch: 18485
netdev/cc_maintainers warning 21 maintainers not CCed: mgorman@suse.de mykolal@fb.com sdf@google.com mingo@redhat.com yhs@fb.com john.fastabend@gmail.com dietmar.eggemann@arm.com jolsa@kernel.org martin.lau@linux.dev linux-kselftest@vger.kernel.org bsegall@google.com vschneid@redhat.com peterz@infradead.org vincent.guittot@linaro.org kpsingh@kernel.org bristot@redhat.com shuah@kernel.org haoluo@google.com song@kernel.org rostedt@goodmis.org juri.lelli@redhat.com
netdev/build_clang success Errors and warnings before: 3988 this patch: 3988
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 19432 this patch: 19432
netdev/checkpatch fail CHECK: Please don't use multiple blank lines CHECK: Unnecessary parentheses around 'prog->expected_attach_type == BPF_TRACE_FEXIT' CHECK: extern prototypes should be avoided in .h files CHECK: multiple assignments should be avoided ERROR: open brace '{' following function definitions go on the next line WARNING: Prefer 'fallthrough;' over fallthrough comment WARNING: Prefer __noreturn over __attribute__((noreturn)) WARNING: line length of 100 exceeds 80 columns WARNING: line length of 104 exceeds 80 columns WARNING: line length of 110 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Kumar Kartikeya Dwivedi April 5, 2023, 12:42 a.m. UTC
Introduce the support for exceptions in BPF runtime. The bpf_throw kfunc
is the main helper to be used by programs to throw an exception. In the
verifier domain, it will be processed as an immediate main exit for the
program, ending exploration of that path.

This proves to be a powerful property: a program can perform a condition
check and call bpf_throw for the case that will never occur at runtime,
but it's still something it needs to prove to the verifier. The
unwinding of the program stack and any resources will automatically be
performed by the BPF runtime.

For now, we fail if we see lingering references, locks, etc., but a
future patch will extend the current infrastructure to generate the
cleanup code for those too.

Programs that do not use exceptions today have no change in their .text
and performance, as all extra code generated to throw, propagate, unwind
the stack, etc. only applies to programs that use this new facility.

- The exception state is represented using four booleans in the
  task_struct of current task. Each boolean corresponds to the exception
  state for each kernel context. This allows BPF programs to be
  interrupted and still not clobber the other's exception state.
- The other vexing case is of recursion. If a program calls into another
  program (e.g. call into helper which invokes tracing program
  eventually), it may throw and clobber the current exception state. To
  avoid this, an invariant is maintained across the implementation:
	Exception state is always cleared on entry and exit of the main
	BPF program.
  This implies that if recursion occurs, the BPF program will clear the
  current exception state on entry and exit. However, callbacks do not
  do the same, because they are subprograms. The case for propagating
  exceptions of callbacks invoked by the kernel back to the BPF program
  is handled in the next commit. This is also the main reason to clear
  exception state on entry, asynchronous callbacks can clobber exception
  state even though we make sure it's always set to be 0 within the
  kernel.
  Anyhow, the only other thing to be kept in mind is to never allow a
  BPF program to execute when the program is being unwinded. This
  implies that every function involved in this path must be notrace,
  which is the case for bpf_throw, bpf_get_exception and
  bpf_reset_exception.
- Rewrites happen for bpf_throw and call instructions to subprogs.
  The instructions which are executed in the main frame of the main
  program (thus, not global functions and extension programs, which end
  up executing in frame > 0) need to be rewritten differently. This is
  tracked using BPF_THROW_OUTER vs BPF_THROW_INNER. If not done, a
  recursing tracing program may set exception state which the main
  program is instrumented to handle eventually, causing it to unwind
  when it shouldn't.
- Callsite specific marking is done. It is possible to reduce the
  instrumentation needed if we were marking callsites. Only all calls to
  global subprogs would need to be rewritten to handle thrown
  exceptions, otherwise for each callsite to static subprogs, the
  verifier's path awareness allows us to skip the handling if all
  passible paths taken using that callsite never throw. This propagates
  into all callers and prog may end up having throws_exception as false.
  Typically this reduces the amount of instrumentation when subprogs
  throwing are deeply nested and only throw under specific conditions.
- BPF_PROG_TYPE_EXT is special in that it replaces global functions in
  other BPF programs. A check is added after we know exception
  specification of a prog (throws_exception) to ensure we don't attach
  throwing extension to a program not instrumented to handle them, or
  to main subprog which has BPF_THROW_OUTER handling compared to
  extension prog's BPF_THROW_INNER handling.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h                           |   9 +-
 include/linux/bpf_verifier.h                  |  13 +
 include/linux/sched.h                         |   1 +
 kernel/bpf/arraymap.c                         |   9 +-
 kernel/bpf/helpers.c                          |  22 ++
 kernel/bpf/syscall.c                          |  10 +
 kernel/bpf/trampoline.c                       |   4 +-
 kernel/bpf/verifier.c                         | 241 ++++++++++++++++--
 .../testing/selftests/bpf/bpf_experimental.h  |   9 +
 9 files changed, 299 insertions(+), 19 deletions(-)

Comments

Alexei Starovoitov April 6, 2023, 2:16 a.m. UTC | #1
On Wed, Apr 05, 2023 at 02:42:33AM +0200, Kumar Kartikeya Dwivedi wrote:
> 
> - The exception state is represented using four booleans in the
>   task_struct of current task. Each boolean corresponds to the exception
>   state for each kernel context. This allows BPF programs to be
>   interrupted and still not clobber the other's exception state.

that doesn't work for sleepable bpf progs and in RT for regular progs too.

> - The other vexing case is of recursion. If a program calls into another
>   program (e.g. call into helper which invokes tracing program
>   eventually), it may throw and clobber the current exception state. To
>   avoid this, an invariant is maintained across the implementation:
> 	Exception state is always cleared on entry and exit of the main
> 	BPF program.
>   This implies that if recursion occurs, the BPF program will clear the
>   current exception state on entry and exit. However, callbacks do not
>   do the same, because they are subprograms. The case for propagating
>   exceptions of callbacks invoked by the kernel back to the BPF program
>   is handled in the next commit. This is also the main reason to clear
>   exception state on entry, asynchronous callbacks can clobber exception
>   state even though we make sure it's always set to be 0 within the
>   kernel.
>   Anyhow, the only other thing to be kept in mind is to never allow a
>   BPF program to execute when the program is being unwinded. This
>   implies that every function involved in this path must be notrace,
>   which is the case for bpf_throw, bpf_get_exception and
>   bpf_reset_exception.

...

> +			struct bpf_insn entry_insns[] = {
> +				BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
> +				BPF_EMIT_CALL(bpf_reset_exception),
> +				BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
> +				insn[i],
> +			};

Is not correct in global bpf progs that take more than 1 argument.

How about using a scratch space in prog->aux->exception[] instead of current task?

> +notrace u64 bpf_get_exception(void)
> +{
> +	int i = interrupt_context_level();
> +
> +	return current->bpf_exception_thrown[i];
> +}

this is too slow to be acceptable.
it needs to be single load plus branch.
with prog->aux->exception approach we can achieve that.
Instead of inserting a call to bpf_get_exception() we can do load+cmp.
We probably should pass prog->aux into exception callback, so it
can know where throw came from.

> - Rewrites happen for bpf_throw and call instructions to subprogs.
>   The instructions which are executed in the main frame of the main
>   program (thus, not global functions and extension programs, which end
>   up executing in frame > 0) need to be rewritten differently. This is
>   tracked using BPF_THROW_OUTER vs BPF_THROW_INNER. If not done, a

how about BPF_THROW_OUTER vs BPF_THROW_ANY_INNER ?
would it be more precise ?

> +__bpf_kfunc notrace void bpf_throw(void)
> +{
> +	int i = interrupt_context_level();
> +
> +	current->bpf_exception_thrown[i] = true;
> +}

I think this needs to take u64 or couple u64 args and store them
in the scratch area.
bpf_assert* macros also need a way for bpf prog to specify
the reason for the assertion.
Otherwise there won't be any way to debug what happened.
Kumar Kartikeya Dwivedi April 6, 2023, 11:54 p.m. UTC | #2
On Thu, Apr 06, 2023 at 04:16:22AM CEST, Alexei Starovoitov wrote:
> On Wed, Apr 05, 2023 at 02:42:33AM +0200, Kumar Kartikeya Dwivedi wrote:
> >
> > - The exception state is represented using four booleans in the
> >   task_struct of current task. Each boolean corresponds to the exception
> >   state for each kernel context. This allows BPF programs to be
> >   interrupted and still not clobber the other's exception state.
>
> that doesn't work for sleepable bpf progs and in RT for regular progs too.
>

Can you elaborate? If a sleepable program blocks, that means the task is
scheduled out, so the next program will use the other task's task_struct.
Same for preemption for normal progs (under RT or not).

Is there something else that I'm missing?

> > - The other vexing case is of recursion. If a program calls into another
> >   program (e.g. call into helper which invokes tracing program
> >   eventually), it may throw and clobber the current exception state. To
> >   avoid this, an invariant is maintained across the implementation:
> > 	Exception state is always cleared on entry and exit of the main
> > 	BPF program.
> >   This implies that if recursion occurs, the BPF program will clear the
> >   current exception state on entry and exit. However, callbacks do not
> >   do the same, because they are subprograms. The case for propagating
> >   exceptions of callbacks invoked by the kernel back to the BPF program
> >   is handled in the next commit. This is also the main reason to clear
> >   exception state on entry, asynchronous callbacks can clobber exception
> >   state even though we make sure it's always set to be 0 within the
> >   kernel.
> >   Anyhow, the only other thing to be kept in mind is to never allow a
> >   BPF program to execute when the program is being unwinded. This
> >   implies that every function involved in this path must be notrace,
> >   which is the case for bpf_throw, bpf_get_exception and
> >   bpf_reset_exception.
>
> ...
>
> > +			struct bpf_insn entry_insns[] = {
> > +				BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
> > +				BPF_EMIT_CALL(bpf_reset_exception),
> > +				BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
> > +				insn[i],
> > +			};
>
> Is not correct in global bpf progs that take more than 1 argument.
>

But this is not done for global subprogs, only for the main subprog, it only
needs to be done when we enter the program from the kernel.

> How about using a scratch space in prog->aux->exception[] instead of current task?
>

I actually had this thought. It's even better because we can hardcode the
address of the exception state right in the program (since prog->aux remains
stable during bpf_patch_insn_data). However, concurrent program invocations on
multiple CPUs doesn't work well with this. It's like, one program sets the state
while the other tries to check it. It can be per-CPU but then we have to disable
preemption (which cannot be done).

Unfortunately per-task state seemed like the only choice which works without
complicating things too much.

> > +notrace u64 bpf_get_exception(void)
> > +{
> > +	int i = interrupt_context_level();
> > +
> > +	return current->bpf_exception_thrown[i];
> > +}
>
> this is too slow to be acceptable.

I agree, also partly why I still marked this an RFC.

> it needs to be single load plus branch.
> with prog->aux->exception approach we can achieve that.
> Instead of inserting a call to bpf_get_exception() we can do load+cmp.
> We probably should pass prog->aux into exception callback, so it
> can know where throw came from.
>

IMO prog->aux->exception won't work either (unless I'm missing some way which
you can point out). The other option would be that we spill pointer to the
per-task exception state to the program's stack on entry, and then every check
loads the value and performs the check. It will be a load from stack, load from
memory and then a jump instruction. Still not as good as a direct load though,
which we'd have with prog->aux, but much better than the current state.

> > - Rewrites happen for bpf_throw and call instructions to subprogs.
> >   The instructions which are executed in the main frame of the main
> >   program (thus, not global functions and extension programs, which end
> >   up executing in frame > 0) need to be rewritten differently. This is
> >   tracked using BPF_THROW_OUTER vs BPF_THROW_INNER. If not done, a
>
> how about BPF_THROW_OUTER vs BPF_THROW_ANY_INNER ?
> would it be more precise ?

I'm fine with the renaming. The only thing the type signifies is if we need to
do the rewrite for frame 0 vs frame N.

>
> > +__bpf_kfunc notrace void bpf_throw(void)
> > +{
> > +	int i = interrupt_context_level();
> > +
> > +	current->bpf_exception_thrown[i] = true;
> > +}
>
> I think this needs to take u64 or couple u64 args and store them
> in the scratch area.
> bpf_assert* macros also need a way for bpf prog to specify
> the reason for the assertion.
> Otherwise there won't be any way to debug what happened.

I agree. Should we force it to be a constant value? Then we can hardcode it in
the .text without having to save and restore it, but that might end up being a
little too restrictive?
Alexei Starovoitov April 7, 2023, 2:11 a.m. UTC | #3
On Fri, Apr 07, 2023 at 01:54:03AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Thu, Apr 06, 2023 at 04:16:22AM CEST, Alexei Starovoitov wrote:
> > On Wed, Apr 05, 2023 at 02:42:33AM +0200, Kumar Kartikeya Dwivedi wrote:
> > >
> > > - The exception state is represented using four booleans in the
> > >   task_struct of current task. Each boolean corresponds to the exception
> > >   state for each kernel context. This allows BPF programs to be
> > >   interrupted and still not clobber the other's exception state.
> >
> > that doesn't work for sleepable bpf progs and in RT for regular progs too.
> >
> 
> Can you elaborate? If a sleepable program blocks, that means the task is
> scheduled out, so the next program will use the other task's task_struct.
> Same for preemption for normal progs (under RT or not).

I was worried about the case of the same task but different code paths
in the kernel with tracing prog stepping on preempted lsm.s prog.
I think you point that in this case they gotta be in different interrupt_context_level.
I need to think it through a bit more.

> > How about using a scratch space in prog->aux->exception[] instead of current task?
> >
> 
> I actually had this thought. It's even better because we can hardcode the
> address of the exception state right in the program (since prog->aux remains
> stable during bpf_patch_insn_data). 

exactly.

> However, concurrent program invocations on
> multiple CPUs doesn't work well with this. It's like, one program sets the state
> while the other tries to check it. 

Right. If it asserts on one cpu all other cpus will unwind as well,
since we're saying bpf_assert is for exceptions when user cannot convince
the verifier that the program is correct.
So it doesn't matter that it aborted everywhere. It's probably a good thing too.

> It can be per-CPU but then we have to disable
> preemption (which cannot be done).

I was thinking to propose a per-cpu prog->aux->exception[] area.
Sleepable and not are in migrate_disable(). bpf progs never migrate.
So we can do it, but we'd need new pseudo this_cpu_ptr instruction and
corresponding JIT support which felt like overkill.

Another idea I contemplated is to do preempt_disable() and local_irq_save()
into special field in prog->aux->exception first thing in bpf_throw()
and then re-enable everything before entering exception cb.
To avoid races with other progs, but NMI can still happen, so it's pointless.
Just non-per-cpu prog->aux->exception seems good enough.

> > > - Rewrites happen for bpf_throw and call instructions to subprogs.
> > >   The instructions which are executed in the main frame of the main
> > >   program (thus, not global functions and extension programs, which end
> > >   up executing in frame > 0) need to be rewritten differently. This is
> > >   tracked using BPF_THROW_OUTER vs BPF_THROW_INNER. If not done, a
> >
> > how about BPF_THROW_OUTER vs BPF_THROW_ANY_INNER ?
> > would it be more precise ?
> 
> I'm fine with the renaming. The only thing the type signifies is if we need to
> do the rewrite for frame 0 vs frame N.

Sure. BPF_THROW_FRAME_ZERO and BPF_THROW_FRAME_NON_ZERO also works.
Or any name that shows that 2nd includes multiple cases.

> >
> > > +__bpf_kfunc notrace void bpf_throw(void)
> > > +{
> > > +	int i = interrupt_context_level();
> > > +
> > > +	current->bpf_exception_thrown[i] = true;
> > > +}
> >
> > I think this needs to take u64 or couple u64 args and store them
> > in the scratch area.
> > bpf_assert* macros also need a way for bpf prog to specify
> > the reason for the assertion.
> > Otherwise there won't be any way to debug what happened.
> 
> I agree. Should we force it to be a constant value? Then we can hardcode it in
> the .text without having to save and restore it, but that might end up being a
> little too restrictive?

with prog->aux->exception approach run-time values works too.
Kumar Kartikeya Dwivedi April 7, 2023, 2:46 a.m. UTC | #4
On Fri, Apr 07, 2023 at 04:11:36AM CEST, Alexei Starovoitov wrote:
> On Fri, Apr 07, 2023 at 01:54:03AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Thu, Apr 06, 2023 at 04:16:22AM CEST, Alexei Starovoitov wrote:
> > > On Wed, Apr 05, 2023 at 02:42:33AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > >
> > > > - The exception state is represented using four booleans in the
> > > >   task_struct of current task. Each boolean corresponds to the exception
> > > >   state for each kernel context. This allows BPF programs to be
> > > >   interrupted and still not clobber the other's exception state.
> > >
> > > that doesn't work for sleepable bpf progs and in RT for regular progs too.
> > >
> >
> > Can you elaborate? If a sleepable program blocks, that means the task is
> > scheduled out, so the next program will use the other task's task_struct.
> > Same for preemption for normal progs (under RT or not).
>
> I was worried about the case of the same task but different code paths
> in the kernel with tracing prog stepping on preempted lsm.s prog.
> I think you point that in this case they gotta be in different interrupt_context_level.
> I need to think it through a bit more.
>

If there is nesting, the programs always clear their exception state on exit, so
the prog that calls into the kernel which then calls into the tracing prog etc.
won't see its exception state on return. The only path where attaching programs
would screw things up is when we see a thrown exception and start unwinding
(where clearing would be a problem since its done frame-by-frame). For that, I
already prevent _throwing_ fexit programs from attaching to subprogs in this
series (normal ones are still ok and supported, because fentry/fexit is
important for stats etc.). There might be some other corner cases I missed but
ensuring this property alone in general should make things work correctly.

> > > How about using a scratch space in prog->aux->exception[] instead of current task?
> > >
> >
> > I actually had this thought. It's even better because we can hardcode the
> > address of the exception state right in the program (since prog->aux remains
> > stable during bpf_patch_insn_data).
>
> exactly.
>
> > However, concurrent program invocations on
> > multiple CPUs doesn't work well with this. It's like, one program sets the state
> > while the other tries to check it.
>
> Right. If it asserts on one cpu all other cpus will unwind as well,
> since we're saying bpf_assert is for exceptions when user cannot convince
> the verifier that the program is correct.
> So it doesn't matter that it aborted everywhere. It's probably a good thing too.
>

We can discuss the semantics (this makes bpf_assert more stronger and basically
poisons the program globally in some sense), but implementation wise it's going
to be a lot more tricky to reason about correctness.

Right now, the verifier follows paths and knows what resources are held when we
throw from a nested call chain (to complain on leaks). Callers doing the check
for exception state at runtime expect only certain throwing points to trigger
the check and rely on that for leak freedom.

With a global prog->aux->exception, things will be ok for the CPU on which the
exception was thrown, but some other CPU will see the check returning true in a
caller even if the callee subprog for it did not throw and was possibly
transferring its acquired references to the caller after completing execution,
which now causes leaks (because subprogs are allowed to acquire and return to
their caller).

The way to handle this would be that we assume every callee which throws may
also notionally throw right when returning (due to some other CPU's throw which
we may see). Then every exit from throwing callees may be processed as throwing
if we see the global state as set.

However, this completely prevents subprogs from transferring some acquired
resource to their caller (which I think is too restrictive). If I'm acquiring
memory from static subprog and returning to my caller, I can't any longer since
I notionally throw when exiting and holding resources when doing bpf_throw is
disallowed, so transfer is out of the question.

In the current scenario it would work, because I threw early on in my subprog
when checking some condition (or proving something to the verifier) and after
that just chugged along and did my work.

> > It can be per-CPU but then we have to disable
> > preemption (which cannot be done).
>
> I was thinking to propose a per-cpu prog->aux->exception[] area.
> Sleepable and not are in migrate_disable(). bpf progs never migrate.
> So we can do it, but we'd need new pseudo this_cpu_ptr instruction and
> corresponding JIT support which felt like overkill.
>
> Another idea I contemplated is to do preempt_disable() and local_irq_save()
> into special field in prog->aux->exception first thing in bpf_throw()
> and then re-enable everything before entering exception cb.
> To avoid races with other progs, but NMI can still happen, so it's pointless.
> Just non-per-cpu prog->aux->exception seems good enough.
>
> > > > - Rewrites happen for bpf_throw and call instructions to subprogs.
> > > >   The instructions which are executed in the main frame of the main
> > > >   program (thus, not global functions and extension programs, which end
> > > >   up executing in frame > 0) need to be rewritten differently. This is
> > > >   tracked using BPF_THROW_OUTER vs BPF_THROW_INNER. If not done, a
> > >
> > > how about BPF_THROW_OUTER vs BPF_THROW_ANY_INNER ?
> > > would it be more precise ?
> >
> > I'm fine with the renaming. The only thing the type signifies is if we need to
> > do the rewrite for frame 0 vs frame N.
>
> Sure. BPF_THROW_FRAME_ZERO and BPF_THROW_FRAME_NON_ZERO also works.
> Or any name that shows that 2nd includes multiple cases.
>

Ack.

> > >
> > > > +__bpf_kfunc notrace void bpf_throw(void)
> > > > +{
> > > > +	int i = interrupt_context_level();
> > > > +
> > > > +	current->bpf_exception_thrown[i] = true;
> > > > +}
> > >
> > > I think this needs to take u64 or couple u64 args and store them
> > > in the scratch area.
> > > bpf_assert* macros also need a way for bpf prog to specify
> > > the reason for the assertion.
> > > Otherwise there won't be any way to debug what happened.
> >
> > I agree. Should we force it to be a constant value? Then we can hardcode it in
> > the .text without having to save and restore it, but that might end up being a
> > little too restrictive?
>
> with prog->aux->exception approach run-time values works too.
Alexei Starovoitov April 12, 2023, 7:36 p.m. UTC | #5
On Fri, Apr 07, 2023 at 04:46:55AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Fri, Apr 07, 2023 at 04:11:36AM CEST, Alexei Starovoitov wrote:
> > On Fri, Apr 07, 2023 at 01:54:03AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Thu, Apr 06, 2023 at 04:16:22AM CEST, Alexei Starovoitov wrote:
> > > > On Wed, Apr 05, 2023 at 02:42:33AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > >
> > > > > - The exception state is represented using four booleans in the
> > > > >   task_struct of current task. Each boolean corresponds to the exception
> > > > >   state for each kernel context. This allows BPF programs to be
> > > > >   interrupted and still not clobber the other's exception state.
> > > >
> > > > that doesn't work for sleepable bpf progs and in RT for regular progs too.
> > > >
> > >
> > > Can you elaborate? If a sleepable program blocks, that means the task is
> > > scheduled out, so the next program will use the other task's task_struct.
> > > Same for preemption for normal progs (under RT or not).
> >
> > I was worried about the case of the same task but different code paths
> > in the kernel with tracing prog stepping on preempted lsm.s prog.
> > I think you point that in this case they gotta be in different interrupt_context_level.
> > I need to think it through a bit more.
> >
> 
> If there is nesting, the programs always clear their exception state on exit, so
> the prog that calls into the kernel which then calls into the tracing prog etc.
> won't see its exception state on return. The only path where attaching programs
> would screw things up is when we see a thrown exception and start unwinding
> (where clearing would be a problem since its done frame-by-frame). For that, I
> already prevent _throwing_ fexit programs from attaching to subprogs in this
> series (normal ones are still ok and supported, because fentry/fexit is
> important for stats etc.). There might be some other corner cases I missed but
> ensuring this property alone in general should make things work correctly.
> 
> > > > How about using a scratch space in prog->aux->exception[] instead of current task?
> > > >
> > >
> > > I actually had this thought. It's even better because we can hardcode the
> > > address of the exception state right in the program (since prog->aux remains
> > > stable during bpf_patch_insn_data).
> >
> > exactly.
> >
> > > However, concurrent program invocations on
> > > multiple CPUs doesn't work well with this. It's like, one program sets the state
> > > while the other tries to check it.
> >
> > Right. If it asserts on one cpu all other cpus will unwind as well,
> > since we're saying bpf_assert is for exceptions when user cannot convince
> > the verifier that the program is correct.
> > So it doesn't matter that it aborted everywhere. It's probably a good thing too.
> >
> 
> We can discuss the semantics (this makes bpf_assert more stronger and basically
> poisons the program globally in some sense), but implementation wise it's going
> to be a lot more tricky to reason about correctness.
> 
> Right now, the verifier follows paths and knows what resources are held when we
> throw from a nested call chain (to complain on leaks). Callers doing the check
> for exception state at runtime expect only certain throwing points to trigger
> the check and rely on that for leak freedom.
> 
> With a global prog->aux->exception, things will be ok for the CPU on which the
> exception was thrown, but some other CPU will see the check returning true in a
> caller even if the callee subprog for it did not throw and was possibly
> transferring its acquired references to the caller after completing execution,
> which now causes leaks (because subprogs are allowed to acquire and return to
> their caller).
> 
> The way to handle this would be that we assume every callee which throws may
> also notionally throw right when returning (due to some other CPU's throw which
> we may see). Then every exit from throwing callees may be processed as throwing
> if we see the global state as set.
> 
> However, this completely prevents subprogs from transferring some acquired
> resource to their caller (which I think is too restrictive). If I'm acquiring
> memory from static subprog and returning to my caller, I can't any longer since
> I notionally throw when exiting and holding resources when doing bpf_throw is
> disallowed, so transfer is out of the question.

I was under impression that subprogs cannot acquire refs and transfer them
to caller.
Looks like your commit 9d9d00ac29d0 ("bpf: Fix reference state management for synchronous callbacks")
allowed too much.
I don't think it's a good idea to support coding patterns like:
void my_alloc_foo(struct foo **ptr)
{
  struct foo *p = bpf_obj_new(typeof(*p));
  *ptr = p;
}

It's a correct C, of course, but do we really want to support such code?
I don't think the verifier can fully support it anyway.
That commit of yours allowed some of it in theory, but above example probably won't work,
since 'transfer' isn't understood by the verifier.

Regardless whether we tighten the verifier now or later such subprogs shouldn't be throwing.
So I don't see an issue doing global prog->aux->exception.
Kumar Kartikeya Dwivedi April 13, 2023, 5:05 p.m. UTC | #6
On Wed, Apr 12, 2023 at 09:36:12PM CEST, Alexei Starovoitov wrote:
> On Fri, Apr 07, 2023 at 04:46:55AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Fri, Apr 07, 2023 at 04:11:36AM CEST, Alexei Starovoitov wrote:
> > > On Fri, Apr 07, 2023 at 01:54:03AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > On Thu, Apr 06, 2023 at 04:16:22AM CEST, Alexei Starovoitov wrote:
> > > > > On Wed, Apr 05, 2023 at 02:42:33AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > > >
> > > > > > - The exception state is represented using four booleans in the
> > > > > >   task_struct of current task. Each boolean corresponds to the exception
> > > > > >   state for each kernel context. This allows BPF programs to be
> > > > > >   interrupted and still not clobber the other's exception state.
> > > > >
> > > > > that doesn't work for sleepable bpf progs and in RT for regular progs too.
> > > > >
> > > >
> > > > Can you elaborate? If a sleepable program blocks, that means the task is
> > > > scheduled out, so the next program will use the other task's task_struct.
> > > > Same for preemption for normal progs (under RT or not).
> > >
> > > I was worried about the case of the same task but different code paths
> > > in the kernel with tracing prog stepping on preempted lsm.s prog.
> > > I think you point that in this case they gotta be in different interrupt_context_level.
> > > I need to think it through a bit more.
> > >
> >
> > If there is nesting, the programs always clear their exception state on exit, so
> > the prog that calls into the kernel which then calls into the tracing prog etc.
> > won't see its exception state on return. The only path where attaching programs
> > would screw things up is when we see a thrown exception and start unwinding
> > (where clearing would be a problem since its done frame-by-frame). For that, I
> > already prevent _throwing_ fexit programs from attaching to subprogs in this
> > series (normal ones are still ok and supported, because fentry/fexit is
> > important for stats etc.). There might be some other corner cases I missed but
> > ensuring this property alone in general should make things work correctly.
> >
> > > > > How about using a scratch space in prog->aux->exception[] instead of current task?
> > > > >
> > > >
> > > > I actually had this thought. It's even better because we can hardcode the
> > > > address of the exception state right in the program (since prog->aux remains
> > > > stable during bpf_patch_insn_data).
> > >
> > > exactly.
> > >
> > > > However, concurrent program invocations on
> > > > multiple CPUs doesn't work well with this. It's like, one program sets the state
> > > > while the other tries to check it.
> > >
> > > Right. If it asserts on one cpu all other cpus will unwind as well,
> > > since we're saying bpf_assert is for exceptions when user cannot convince
> > > the verifier that the program is correct.
> > > So it doesn't matter that it aborted everywhere. It's probably a good thing too.
> > >
> >
> > We can discuss the semantics (this makes bpf_assert more stronger and basically
> > poisons the program globally in some sense), but implementation wise it's going
> > to be a lot more tricky to reason about correctness.
> >
> > Right now, the verifier follows paths and knows what resources are held when we
> > throw from a nested call chain (to complain on leaks). Callers doing the check
> > for exception state at runtime expect only certain throwing points to trigger
> > the check and rely on that for leak freedom.
> >
> > With a global prog->aux->exception, things will be ok for the CPU on which the
> > exception was thrown, but some other CPU will see the check returning true in a
> > caller even if the callee subprog for it did not throw and was possibly
> > transferring its acquired references to the caller after completing execution,
> > which now causes leaks (because subprogs are allowed to acquire and return to
> > their caller).
> >
> > The way to handle this would be that we assume every callee which throws may
> > also notionally throw right when returning (due to some other CPU's throw which
> > we may see). Then every exit from throwing callees may be processed as throwing
> > if we see the global state as set.
> >
> > However, this completely prevents subprogs from transferring some acquired
> > resource to their caller (which I think is too restrictive). If I'm acquiring
> > memory from static subprog and returning to my caller, I can't any longer since
> > I notionally throw when exiting and holding resources when doing bpf_throw is
> > disallowed, so transfer is out of the question.
>
> I was under impression that subprogs cannot acquire refs and transfer them
> to caller.
> Looks like your commit 9d9d00ac29d0 ("bpf: Fix reference state management for synchronous callbacks")
> allowed too much.

I think you misunderstood the change in that commit. It was about restricting
callback functions from acquiring references and not releasing them before their
BPF_EXIT (since our handling is not completely correct for more than one
iteration). The verifier has always allowed acquiring references and
transferring them to the caller for subprog calls.

> I don't think it's a good idea to support coding patterns like:
> void my_alloc_foo(struct foo **ptr)
> {
>   struct foo *p = bpf_obj_new(typeof(*p));
>   *ptr = p;
> }
>
> It's a correct C, of course, but do we really want to support such code?
> I don't think the verifier can fully support it anyway.
> That commit of yours allowed some of it in theory, but above example probably won't work,
> since 'transfer' isn't understood by the verifier.

I have no strong opinions about restricting (I think the code for handling
transfers is sane and correct, we just transfer the modified reference state,
and it's a natural valid form of writing programs), especially since static
subprogs do not have the limitations as global subprogs, and you really don't
want to inline everything all the time.
But I think we may end up breaking existing code/programs if we do. Whether that
fallout will be small or not, I have no data yet to predict.

>
> Regardless whether we tighten the verifier now or later such subprogs shouldn't be throwing.
> So I don't see an issue doing global prog->aux->exception.

That's certainly an option, but I still think we need to be a bit careful. The
reason is that during analysis, we need to determine that whenever a subprog
call exits, are we in a state where we can safely unwind? It might end up
restricting a large set of use cases, but I can only say with certainty after I
try it out.

Right now, I heavily rely on the assumption that the checks only become true
when something throws (to also minimize rewrites, but that's a minor reason).
The core reason is being able to argue about correctness. With global exception
state, they can become true anytime, so we need to be a lot more conservative
even if we e.g. didn't see a subprog as throwing from all callsites.

call subprog(A) // will be rewritten, as using R1=A can throw
call subprog(B) // not rewritten, as using R1=B does not throw
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 002a811b6b90..04b81f5fe809 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1287,6 +1287,7 @@  static inline bool bpf_prog_has_trampoline(const struct bpf_prog *prog)
 struct bpf_func_info_aux {
 	u16 linkage;
 	bool unreliable;
+	bool throws_exception;
 };
 
 enum bpf_jit_poke_reason {
@@ -1430,7 +1431,8 @@  struct bpf_prog {
 				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
 				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
 				call_get_func_ip:1, /* Do we call get_func_ip() */
-				tstamp_type_access:1; /* Accessed __sk_buff->tstamp_type */
+				tstamp_type_access:1, /* Accessed __sk_buff->tstamp_type */
+				throws_exception:1; /* Does this program throw exceptions? */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -3035,4 +3037,9 @@  static inline gfp_t bpf_memcg_flags(gfp_t flags)
 	return flags;
 }
 
+/* BPF Exception helpers */
+void bpf_reset_exception(void);
+u64 bpf_get_exception(void);
+void bpf_throw(void);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 81d525d057c7..bc067223d3ee 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -430,6 +430,17 @@  struct bpf_loop_inline_state {
 	u32 callback_subprogno; /* valid when fit_for_inline is true */
 };
 
+enum {
+	BPF_THROW_NONE,
+	BPF_THROW_OUTER,
+	BPF_THROW_INNER,
+};
+
+struct bpf_throw_state {
+	int type;
+	bool check_helper_ret_code;
+};
+
 /* Possible states for alu_state member. */
 #define BPF_ALU_SANITIZE_SRC		(1U << 0)
 #define BPF_ALU_SANITIZE_DST		(1U << 1)
@@ -464,6 +475,7 @@  struct bpf_insn_aux_data {
 		 */
 		struct bpf_loop_inline_state loop_inline_state;
 	};
+	struct bpf_throw_state throw_state;
 	u64 obj_new_size; /* remember the size of type passed to bpf_obj_new to rewrite R1 */
 	struct btf_struct_meta *kptr_struct_meta;
 	u64 map_key_state; /* constant (32 bit) key tracking for maps */
@@ -537,6 +549,7 @@  struct bpf_subprog_info {
 	bool tail_call_reachable;
 	bool has_ld_abs;
 	bool is_async_cb;
+	bool can_throw;
 };
 
 /* single container for all structs
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b11b4517760f..a568245b59a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1480,6 +1480,7 @@  struct task_struct {
 	struct bpf_local_storage __rcu	*bpf_storage;
 	/* Used for BPF run context */
 	struct bpf_run_ctx		*bpf_ctx;
+	bool				bpf_exception_thrown[4];
 #endif
 
 #ifdef CONFIG_GCC_PLUGIN_STACKLEAK
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 2058e89b5ddd..de0eadf8706f 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -905,7 +905,14 @@  static void *prog_fd_array_get_ptr(struct bpf_map *map,
 	if (IS_ERR(prog))
 		return prog;
 
-	if (!bpf_prog_map_compatible(map, prog)) {
+	/* Programs which throw exceptions are not allowed to be tail call
+	 * targets. This is because it forces us to be conservative for each
+	 * bpf_tail_call invocation and assume it may throw, since we do not
+	 * know what the target program may do, thus causing us to propagate the
+	 * exception and mark calling prog as potentially throwing. Just be
+	 * restrictive for now and disallow this.
+	 */
+	if (prog->throws_exception || !bpf_prog_map_compatible(map, prog)) {
 		bpf_prog_put(prog);
 		return ERR_PTR(-EINVAL);
 	}
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 6be16db9f188..89e70907257c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1879,6 +1879,20 @@  void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
 	}
 }
 
+notrace void bpf_reset_exception(void)
+{
+	int i = interrupt_context_level();
+
+	current->bpf_exception_thrown[i] = false;
+}
+
+notrace u64 bpf_get_exception(void)
+{
+	int i = interrupt_context_level();
+
+	return current->bpf_exception_thrown[i];
+}
+
 __diag_push();
 __diag_ignore_all("-Wmissing-prototypes",
 		  "Global functions as their definitions will be in vmlinux BTF");
@@ -2295,6 +2309,13 @@  __bpf_kfunc void bpf_rcu_read_unlock(void)
 	rcu_read_unlock();
 }
 
+__bpf_kfunc notrace void bpf_throw(void)
+{
+	int i = interrupt_context_level();
+
+	current->bpf_exception_thrown[i] = true;
+}
+
 __diag_pop();
 
 BTF_SET8_START(generic_btf_ids)
@@ -2321,6 +2342,7 @@  BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
 #endif
 BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_throw)
 BTF_SET8_END(generic_btf_ids)
 
 static const struct btf_kfunc_id_set generic_kfunc_set = {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e18ac7fdc210..f82e7a174d6a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3144,6 +3144,16 @@  static int bpf_tracing_prog_attach(struct bpf_prog *prog,
 		tgt_prog = prog->aux->dst_prog;
 	}
 
+	/* Don't allow tracing programs to attach to fexit and clear exception
+	 * state when we are unwinding the program.
+	 */
+	if (prog->type == BPF_PROG_TYPE_TRACING &&
+	    (prog->expected_attach_type == BPF_TRACE_FEXIT) &&
+	    tgt_prog && tgt_prog->throws_exception && prog->throws_exception) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
 	err = bpf_link_prime(&link->link.link, &link_primer);
 	if (err)
 		goto out_unlock;
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index f61d5138b12b..e9f9dd52f16c 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -514,7 +514,9 @@  static int __bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_tr
 	kind = bpf_attach_type_to_tramp(link->link.prog);
 	if (tr->extension_prog)
 		/* cannot attach fentry/fexit if extension prog is attached.
-		 * cannot overwrite extension prog either.
+		 * cannot overwrite extension prog either. We rely on this to
+		 * not check extension prog's exception specification (since
+		 * throwing extension may not replace non-throwing).
 		 */
 		return -EBUSY;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8ecd5df73b07..6981d8817c71 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2787,6 +2787,8 @@  static int add_subprog_and_kfunc(struct bpf_verifier_env *env)
 	return 0;
 }
 
+static bool is_bpf_throw_call(struct bpf_insn *insn);
+
 static int check_subprogs(struct bpf_verifier_env *env)
 {
 	int i, subprog_start, subprog_end, off, cur_subprog = 0;
@@ -2820,11 +2822,12 @@  static int check_subprogs(struct bpf_verifier_env *env)
 		if (i == subprog_end - 1) {
 			/* to avoid fall-through from one subprog into another
 			 * the last insn of the subprog should be either exit
-			 * or unconditional jump back
+			 * or unconditional jump back or bpf_throw call
 			 */
 			if (code != (BPF_JMP | BPF_EXIT) &&
-			    code != (BPF_JMP | BPF_JA)) {
-				verbose(env, "last insn is not an exit or jmp\n");
+			    code != (BPF_JMP | BPF_JA) &&
+			    !is_bpf_throw_call(insn + i)) {
+				verbose(env, "last insn is not an exit or jmp or bpf_throw call\n");
 				return -EINVAL;
 			}
 			subprog_start = subprog_end;
@@ -8200,6 +8203,7 @@  static int set_callee_state(struct bpf_verifier_env *env,
 			    struct bpf_func_state *callee, int insn_idx);
 
 static bool is_callback_calling_kfunc(u32 btf_id);
+static int mark_chain_throw(struct bpf_verifier_env *env, int insn_idx);
 
 static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			     int *insn_idx, int subprog,
@@ -8247,6 +8251,12 @@  static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			caller->regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG;
 
 			/* continue with next insn after call */
+
+			/* We don't explore the global function, but if it
+			 * throws, mark the callchain as throwing.
+			 */
+			if (env->subprog_info[subprog].can_throw)
+				return mark_chain_throw(env, *insn_idx);
 			return 0;
 		}
 	}
@@ -8382,6 +8392,53 @@  static int set_callee_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_throw_state_type(struct bpf_verifier_env *env, int insn_idx,
+				int frame, int subprog)
+{
+	struct bpf_throw_state *ts = &env->insn_aux_data[insn_idx].throw_state;
+	int type;
+
+	if (!frame && !subprog && env->prog->type != BPF_PROG_TYPE_EXT)
+		type = BPF_THROW_OUTER;
+	else
+		type = BPF_THROW_INNER;
+	if (ts->type != BPF_THROW_NONE) {
+		if (ts->type != type) {
+			verbose(env,
+				"conflicting rewrite type for throwing call insn %d: %d and %d\n",
+				insn_idx, ts->type, type);
+			return -EINVAL;
+		}
+	}
+	ts->type = type;
+	return 0;
+}
+
+static int mark_chain_throw(struct bpf_verifier_env *env, int insn_idx) {
+	struct bpf_func_info_aux *func_info_aux = env->prog->aux->func_info_aux;
+	struct bpf_subprog_info *subprog = env->subprog_info;
+	struct bpf_verifier_state *state = env->cur_state;
+	struct bpf_func_state **frame = state->frame;
+	u32 cur_subprogno;
+	int ret;
+
+	/* Mark all callsites leading up to this throw and their corresponding
+	 * subprogs and update their func_info_aux table.
+	 */
+	for (int i = 1; i <= state->curframe; i++) {
+		u32 subprogno = frame[i - 1]->subprogno;
+
+		func_info_aux[subprogno].throws_exception = subprog[subprogno].can_throw = true;
+		ret = set_throw_state_type(env, frame[i]->callsite, i - 1, subprogno);
+		if (ret < 0)
+			return ret;
+	}
+	/* Now mark actual instruction which caused the throw */
+	cur_subprogno = frame[state->curframe]->subprogno;
+	func_info_aux[cur_subprogno].throws_exception = subprog[cur_subprogno].can_throw = true;
+	return set_throw_state_type(env, insn_idx, state->curframe, cur_subprogno);
+}
+
 static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			   int *insn_idx)
 {
@@ -8394,7 +8451,6 @@  static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			target_insn);
 		return -EFAULT;
 	}
-
 	return __check_func_call(env, insn, insn_idx, subprog, set_callee_state);
 }
 
@@ -8755,17 +8811,17 @@  record_func_key(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta,
 	return 0;
 }
 
-static int check_reference_leak(struct bpf_verifier_env *env)
+static int check_reference_leak(struct bpf_verifier_env *env, bool exception_exit)
 {
 	struct bpf_func_state *state = cur_func(env);
 	bool refs_lingering = false;
 	int i;
 
-	if (state->frameno && !state->in_callback_fn)
+	if (!exception_exit && state->frameno && !state->in_callback_fn)
 		return 0;
 
 	for (i = 0; i < state->acquired_refs; i++) {
-		if (state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
+		if (!exception_exit && state->in_callback_fn && state->refs[i].callback_ref != state->frameno)
 			continue;
 		verbose(env, "Unreleased reference id=%d alloc_insn=%d\n",
 			state->refs[i].id, state->refs[i].insn_idx);
@@ -8999,7 +9055,7 @@  static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 
 	switch (func_id) {
 	case BPF_FUNC_tail_call:
-		err = check_reference_leak(env);
+		err = check_reference_leak(env, false);
 		if (err) {
 			verbose(env, "tail_call would lead to reference leak\n");
 			return err;
@@ -9615,6 +9671,7 @@  enum special_kfunc_type {
 	KF_bpf_dynptr_from_xdp,
 	KF_bpf_dynptr_slice,
 	KF_bpf_dynptr_slice_rdwr,
+	KF_bpf_throw,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -9633,6 +9690,7 @@  BTF_ID(func, bpf_dynptr_from_skb)
 BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
+BTF_ID(func, bpf_throw)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -9653,6 +9711,7 @@  BTF_ID(func, bpf_dynptr_from_skb)
 BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
+BTF_ID(func, bpf_throw)
 
 static bool is_kfunc_bpf_rcu_read_lock(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -10736,6 +10795,13 @@  static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (meta.btf == btf_vmlinux && meta.func_id == special_kfunc_list[KF_bpf_throw]) {
+		err = mark_chain_throw(env, insn_idx);
+		if (err < 0)
+			return err;
+		return 1;
+	}
+
 	for (i = 0; i < CALLER_SAVED_REGS; i++)
 		mark_reg_not_init(env, regs, caller_saved[i]);
 
@@ -13670,7 +13736,7 @@  static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	 * gen_ld_abs() may terminate the program at runtime, leading to
 	 * reference leak.
 	 */
-	err = check_reference_leak(env);
+	err = check_reference_leak(env, false);
 	if (err) {
 		verbose(env, "BPF_LD_[ABS|IND] cannot be mixed with socket references\n");
 		return err;
@@ -14075,6 +14141,10 @@  static int visit_insn(int t, struct bpf_verifier_env *env)
 		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 			struct bpf_kfunc_call_arg_meta meta;
 
+			/* 'call bpf_throw' has no fallthrough edge, same as BPF_EXIT */
+			if (is_bpf_throw_call(insn))
+				return DONE_EXPLORING;
+
 			ret = fetch_kfunc_meta(env, insn, &meta, NULL);
 			if (ret == 0 && is_iter_next_kfunc(&meta)) {
 				mark_prune_point(env, t);
@@ -14738,7 +14808,7 @@  static bool regs_exact(const struct bpf_reg_state *rold,
 		       const struct bpf_reg_state *rcur,
 		       struct bpf_id_pair *idmap)
 {
-	return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 && 
+	return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 &&
 	       check_ids(rold->id, rcur->id, idmap) &&
 	       check_ids(rold->ref_obj_id, rcur->ref_obj_id, idmap);
 }
@@ -15617,6 +15687,7 @@  static int do_check(struct bpf_verifier_env *env)
 	int prev_insn_idx = -1;
 
 	for (;;) {
+		bool exception_exit = false;
 		struct bpf_insn *insn;
 		u8 class;
 		int err;
@@ -15830,12 +15901,18 @@  static int do_check(struct bpf_verifier_env *env)
 						return -EINVAL;
 					}
 				}
-				if (insn->src_reg == BPF_PSEUDO_CALL)
+				if (insn->src_reg == BPF_PSEUDO_CALL) {
 					err = check_func_call(env, insn, &env->insn_idx);
-				else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL)
+				} else if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
 					err = check_kfunc_call(env, insn, &env->insn_idx);
-				else
+					if (err == 1) {
+						err = 0;
+						exception_exit = true;
+						goto process_bpf_exit_full;
+					}
+				} else {
 					err = check_helper_call(env, insn, &env->insn_idx);
+				}
 				if (err)
 					return err;
 
@@ -15863,6 +15940,7 @@  static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
+process_bpf_exit_full:
 				if (env->cur_state->active_lock.ptr &&
 				    !in_rbtree_lock_required_cb(env)) {
 					verbose(env, "bpf_spin_unlock is missing\n");
@@ -15880,10 +15958,23 @@  static int do_check(struct bpf_verifier_env *env)
 				 * function, for which reference_state must
 				 * match caller reference state when it exits.
 				 */
-				err = check_reference_leak(env);
+				err = check_reference_leak(env, exception_exit);
 				if (err)
 					return err;
 
+				/* The side effect of the prepare_func_exit
+				 * which is being skipped is that it frees
+				 * bpf_func_state. Typically, process_bpf_exit
+				 * will only be hit with outermost exit.
+				 * copy_verifier_state in pop_stack will handle
+				 * freeing of any extra bpf_func_state left over
+				 * from not processing all nested function
+				 * exits. We also skip return code checks as
+				 * they are not needed for exceptional exits.
+				 */
+				if (exception_exit)
+					goto process_bpf_exit;
+
 				if (state->curframe) {
 					/* exit from nested function */
 					err = prepare_func_exit(env, &env->insn_idx);
@@ -17438,6 +17529,33 @@  static int do_misc_fixups(struct bpf_verifier_env *env)
 	int i, ret, cnt, delta = 0;
 
 	for (i = 0; i < insn_cnt; i++, insn++) {
+		/* Typically, exception state is always cleared on entry and we
+		 * ensure to clear it before exiting, but in some cases, our
+		 * invocation can occur after a BPF callback has been executed
+		 * asynchronously in the context of the current task, which may
+		 * clobber the state (think of BPF timer callbacks). Callbacks
+		 * never reset exception state (as they may be called from
+		 * within a program). Thus, if we rely on seeing the exception
+		 * state, always clear it on entry.
+		 */
+		if (i == 0 && prog->throws_exception) {
+			struct bpf_insn entry_insns[] = {
+				BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+				BPF_EMIT_CALL(bpf_reset_exception),
+				BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+				insn[i],
+			};
+
+			cnt = ARRAY_SIZE(entry_insns);
+			new_prog = bpf_patch_insn_data(env, i + delta, entry_insns, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+		}
+
 		/* Make divide-by-zero exceptions impossible. */
 		if (insn->code == (BPF_ALU64 | BPF_MOD | BPF_X) ||
 		    insn->code == (BPF_ALU64 | BPF_DIV | BPF_X) ||
@@ -18030,7 +18148,33 @@  static bool is_inlineable_bpf_loop_call(struct bpf_insn *insn,
 	return insn->code == (BPF_JMP | BPF_CALL) &&
 		insn->src_reg == 0 &&
 		insn->imm == BPF_FUNC_loop &&
-		aux->loop_inline_state.fit_for_inline;
+		aux->loop_inline_state.fit_for_inline &&
+		aux->throw_state.type == BPF_THROW_NONE;
+}
+
+static struct bpf_prog *rewrite_bpf_throw_call(struct bpf_verifier_env *env,
+					       int position,
+					       struct bpf_throw_state *tstate,
+					       u32 *cnt)
+{
+	struct bpf_insn insn_buf[] = {
+		env->prog->insnsi[position],
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	};
+
+	*cnt = ARRAY_SIZE(insn_buf);
+	/* We don't need the call instruction for throws in frame 0 */
+	if (tstate->type == BPF_THROW_OUTER)
+		return bpf_patch_insn_data(env, position, insn_buf + 1, *cnt - 1);
+	return bpf_patch_insn_data(env, position, insn_buf, *cnt);
+}
+
+static bool is_bpf_throw_call(struct bpf_insn *insn)
+{
+	return insn->code == (BPF_JMP | BPF_CALL) &&
+	       insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+	       insn->off == 0 && insn->imm == special_kfunc_list[KF_bpf_throw];
 }
 
 /* For all sub-programs in the program (including main) check
@@ -18069,8 +18213,24 @@  static int do_misc_rewrites(struct bpf_verifier_env *env)
 						   &cnt);
 			if (!new_prog)
 				return -ENOMEM;
+		} else if (is_bpf_throw_call(insn)) {
+			struct bpf_throw_state *throw_state = &insn_aux->throw_state;
+
+			/* The verifier was able to prove that the bpf_throw
+			 * call was unreachable, hence it must have not been
+			 * seen and will be removed by opt_remove_dead_code.
+			 */
+			if (throw_state->type == BPF_THROW_NONE) {
+				WARN_ON_ONCE(insn_aux->seen);
+				goto skip;
+			}
+
+			new_prog = rewrite_bpf_throw_call(env, i + delta, throw_state, &cnt);
+			if (!new_prog)
+				return -ENOMEM;
 		}
 
+skip:
 		if (new_prog) {
 			delta     += cnt - 1;
 			env->prog  = new_prog;
@@ -18240,6 +18400,12 @@  static int do_check_subprogs(struct bpf_verifier_env *env)
 				"Func#%d is safe for any args that match its prototype\n",
 				i);
 		}
+		/* Only reliable functions from BTF PoV can be extended, hence
+		 * remember their exception specification to check that we don't
+		 * replace non-throwing subprog with throwing subprog. The
+		 * opposite is fine though.
+		 */
+		aux->func_info_aux[i].throws_exception = env->subprog_info[i].can_throw;
 	}
 	return 0;
 }
@@ -18250,8 +18416,12 @@  static int do_check_main(struct bpf_verifier_env *env)
 
 	env->insn_idx = 0;
 	ret = do_check_common(env, 0);
-	if (!ret)
+	if (!ret) {
 		env->prog->aux->stack_depth = env->subprog_info[0].stack_depth;
+		env->prog->throws_exception = env->subprog_info[0].can_throw;
+		if (env->prog->aux->func_info)
+			env->prog->aux->func_info_aux[0].throws_exception = env->prog->throws_exception;
+	}
 	return ret;
 }
 
@@ -18753,6 +18923,42 @@  struct btf *bpf_get_btf_vmlinux(void)
 	return btf_vmlinux;
 }
 
+static int check_ext_prog(struct bpf_verifier_env *env)
+{
+	struct bpf_prog *tgt_prog = env->prog->aux->dst_prog;
+	u32 btf_id = env->prog->aux->attach_btf_id;
+	struct bpf_prog *prog = env->prog;
+	int subprog = -1;
+
+	if (prog->type != BPF_PROG_TYPE_EXT)
+		return 0;
+	for (int i = 0; i < tgt_prog->aux->func_info_cnt; i++) {
+		if (tgt_prog->aux->func_info[i].type_id == btf_id) {
+			subprog = i;
+			break;
+		}
+	}
+	if (subprog == -1) {
+		verbose(env, "verifier internal error: extension prog's subprog not found\n");
+		return -EFAULT;
+	}
+	/* BPF_THROW_OUTER rewrites won't match BPF_PROG_TYPE_EXT's
+	 * BPF_THROW_INNER rewrites.
+	 */
+	if (!subprog && prog->throws_exception) {
+		verbose(env, "Cannot attach throwing extension to main subprog\n");
+		return -EINVAL;
+	}
+	/* Overwriting extensions is not allowed, so we can simply check
+	 * the specification of the subprog we are replacing.
+	 */
+	if (!tgt_prog->aux->func_info_aux[subprog].throws_exception && prog->throws_exception) {
+		verbose(env, "Cannot attach throwing extension to non-throwing subprog\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
 int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 {
 	u64 start_time = ktime_get_ns();
@@ -18871,6 +19077,9 @@  int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	ret = do_check_subprogs(env);
 	ret = ret ?: do_check_main(env);
 
+	ret = ret ?: check_ext_prog(env);
+
+
 	if (ret == 0 && bpf_prog_is_offloaded(env->prog->aux))
 		ret = bpf_prog_offload_finalize(env);
 
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index dbd2c729781a..d5de9251e775 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -89,4 +89,13 @@  extern void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
  */
 extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
 
+/* Description
+ *  Throw an exception, terminating the execution of the program immediately.
+ *  The eBPF runtime unwinds the stack automatically and exits the program with
+ *  the default return value of 0.
+ * Returns
+ *  This function never returns.
+ */
+extern void bpf_throw(void) __attribute__((noreturn)) __ksym;
+
 #endif