diff mbox series

[RFC,bpf-next,v3,4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.

Message ID 20230815174712.660956-5-thinker.li@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Sleepable BPF programs on cgroup {get,set}sockopt | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR pending PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 pending Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12 pending Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 pending Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 2830 this patch: 2838
netdev/cc_maintainers warning 5 maintainers not CCed: daniel@iogearbox.net kpsingh@kernel.org john.fastabend@gmail.com jolsa@kernel.org haoluo@google.com
netdev/build_clang fail Errors and warnings before: 1526 this patch: 1527
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 2858 this patch: 2866
netdev/checkpatch warning WARNING: line length of 103 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/kdoc fail Errors and warnings before: 12 this patch: 14
netdev/source_inline success Was 0 now: 0

Commit Message

Kui-Feng Lee Aug. 15, 2023, 5:47 p.m. UTC
From: Kui-Feng Lee <thinker.li@gmail.com>

The new dynptr type (BPF_DYNPTR_TYPE_CGROUP_SOCKOPT) will be used by BPF
programs to create a buffer that can be installed on ctx to replace
exisiting optval or user_optval.

Installation is only allowed if ctx->flags &
BPF_SOCKOPT_FLAG_OPTVAL_REPLACE is true. It is enabled only for sleepable
programs on the cgroup/setsockopt hook.  BPF programs can install a new
buffer holding by a dynptr to increase the size of optval passed to
setsockopt().

Installation is not enabled for cgroup/getsockopt since you can not
increased a buffer created, by user program, to return data from
getsockopt().

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
---
 include/linux/bpf.h    |   7 +-
 include/linux/filter.h |   4 +
 kernel/bpf/btf.c       |   3 +
 kernel/bpf/cgroup.c    |   5 +-
 kernel/bpf/helpers.c   | 197 +++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c  |  47 +++++++++-
 6 files changed, 259 insertions(+), 4 deletions(-)

Comments

Alexei Starovoitov Aug. 17, 2023, 1:25 a.m. UTC | #1
On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote:
>  
> +BTF_SET8_START(cgroup_common_btf_ids)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
> +BTF_SET8_END(cgroup_common_btf_ids)

These shouldn't be sockopt specific.
If we want dynptr to represent a pointer to a user contiguous user memory
we should use generic kfunc that do so.

I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do.
New dynptr type can be hidden in the kernel and all existing
kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work
with user memory.

But I think we have to step back. Why do we need this whole thing in the first place?
_why_ sockopt bpf progs needs to read and write user memory?

Yes there is one page limit, but what is the use case to actually read and write
beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
anything useful with iptables binary blobs. They are hard enough for kernel to parse.
Kui-Feng Lee Aug. 17, 2023, 7 p.m. UTC | #2
On 8/16/23 18:25, Alexei Starovoitov wrote:
> On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote:
>>   
>> +BTF_SET8_START(cgroup_common_btf_ids)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
>> +BTF_SET8_END(cgroup_common_btf_ids)
> 
> These shouldn't be sockopt specific.
> If we want dynptr to represent a pointer to a user contiguous user memory
> we should use generic kfunc that do so.
> 
> I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do.
> New dynptr type can be hidden in the kernel and all existing
> kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work
> with user memory.
> 
> But I think we have to step back. Why do we need this whole thing in the first place?
> _why_ sockopt bpf progs needs to read and write user memory?
> 
> Yes there is one page limit, but what is the use case to actually read and write
> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
> anything useful with iptables binary blobs. They are hard enough for kernel to parse.

The ideal behind the design is let the developers of filters to decide
when to replace the existing buffer.  And, access the content of
buffers just like accessing raw pointers. However, seems almost everyone
love to use *_read() & *_write(). I will move to that direction.
Alexei Starovoitov Aug. 17, 2023, 7:43 p.m. UTC | #3
On Thu, Aug 17, 2023 at 12:00 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 8/16/23 18:25, Alexei Starovoitov wrote:
> > On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote:
> >>
> >> +BTF_SET8_START(cgroup_common_btf_ids)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
> >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
> >> +BTF_SET8_END(cgroup_common_btf_ids)
> >
> > These shouldn't be sockopt specific.
> > If we want dynptr to represent a pointer to a user contiguous user memory
> > we should use generic kfunc that do so.
> >
> > I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do.
> > New dynptr type can be hidden in the kernel and all existing
> > kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work
> > with user memory.
> >
> > But I think we have to step back. Why do we need this whole thing in the first place?
> > _why_ sockopt bpf progs needs to read and write user memory?
> >
> > Yes there is one page limit, but what is the use case to actually read and write
> > beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
> > anything useful with iptables binary blobs. They are hard enough for kernel to parse.
>
> The ideal behind the design is let the developers of filters to decide
> when to replace the existing buffer.  And, access the content of
> buffers just like accessing raw pointers. However, seems almost everyone
> love to use *_read() & *_write(). I will move to that direction.

That doesn't answer my question about the use case.
What kind of filters want to parse more than 4k of sockopt data?
Martin KaFai Lau Aug. 17, 2023, 8:41 p.m. UTC | #4
On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
> But I think we have to step back. Why do we need this whole thing in the first place?
> _why_  sockopt bpf progs needs to read and write user memory?
> 
> Yes there is one page limit, but what is the use case to actually read and write
> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
> anything useful with iptables binary blobs. They are hard enough for kernel to parse.

Usually the bpf prog is only interested in a very small number of optnames and 
no need to read the optval at all for most cases. The max size for our use cases 
is 16 bytes. The kernel currently is kind of doing it the opposite and always 
assumes the bpf prog needing to use the optval, allocate kernel memory and 
copy_from_user such that the non-sleepable bpf program can read/write it.

The bpf prog usually checks the optname and then just returns for most cases:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
	if (ctx->optname != MY_INTERNAL_SOCKOPT)
		return 1;

	/* change the ctx->optval and return to user space ... */
}

When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and 
copy the first PAGE_SIZE data from the user optval. We used to ask the bpf prog 
to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not looked at 
the optval. Otherwise, the kernel used to conclude that the bpf prog had set an 
invalid optlen because optlen is larger than the optval_end - optval and 
returned -EFAULT incorrectly to the end-user.

The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an 
internal kernel PAGE_SIZE limitation:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
	if (ctx->optname != MY_INTERNAL_SOCKOPT) {
		/* only do that for ctx->optlen > PAGE_SIZE.
		 * otherwise, the following cgroup bpf prog will
		 * not be able to use the optval that it may
		 * be interested.
		 */
		if (ctx->optlen > PAGE_SIZE)
			ctx->optlen = 0;
		return 1;
	}

	/* change the ctx->optval and return to user space ... */
}

The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT for 
{g,s}setsockopt with wrong optlen").

Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the 
kernel allows a user passing NULL optval and using the optlen returned by 
getsockopt to learn the buffer space required. The bpf prog then needs to remove 
the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames that it is 
not interested while risking the following cgroup prog may not be able to use 
some of the optval:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
	if (ctx->optname != MY_INTERNAL_SOCKOPT) {

		/* Do that for all optname that you are not interested.
		 * The latter cgroup bpf will not be able to use the optval.
		 */
	 	ctx->optlen = 0;
		return 1;
	}

	/* chage the ctx->optval and return to user space ... */
}

The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT for 
getsockopt with optval=NULL").

To avoid other potential optname cases that may run into similar situation and 
requires the bpf prog work around it again like the above, it needs a way to 
track whether a bpf prog has changed the optval in runtime. If it is not 
changed, use the result from the kernel set/getsockopt. If reading/writing to 
optval is done through a kfunc, this can be tracked. The kfunc can also directly 
read/write the user memory in optval, avoid the pre-alloc kernel memory and the 
PAGE_SIZE limit but this is a minor point.
Yonghong Song Aug. 17, 2023, 9:37 p.m. UTC | #5
On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
>> But I think we have to step back. Why do we need this whole thing in 
>> the first place?
>> _why_  sockopt bpf progs needs to read and write user memory?
>>
>> Yes there is one page limit, but what is the use case to actually read 
>> and write
>> beyond that? iptables sockopt was mentioned, but I don't think bpf 
>> prog can do
>> anything useful with iptables binary blobs. They are hard enough for 
>> kernel to parse.
> 
> Usually the bpf prog is only interested in a very small number of 
> optnames and no need to read the optval at all for most cases. The max 
> size for our use cases is 16 bytes. The kernel currently is kind of 
> doing it the opposite and always assumes the bpf prog needing to use the 
> optval, allocate kernel memory and copy_from_user such that the 
> non-sleepable bpf program can read/write it.
> 
> The bpf prog usually checks the optname and then just returns for most 
> cases:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT)
>          return 1;
> 
>      /* change the ctx->optval and return to user space ... */
> }
> 
> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE 
> memory and copy the first PAGE_SIZE data from the user optval. We used 
> to ask the bpf prog to explicitly set the optlen to 0 for > PAGE_SIZE 
> case even it has not looked at the optval. Otherwise, the kernel used to 
> conclude that the bpf prog had set an invalid optlen because optlen is 
> larger than the optval_end - optval and returned -EFAULT incorrectly to 
> the end-user.
> 
> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due 
> to an internal kernel PAGE_SIZE limitation:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>          /* only do that for ctx->optlen > PAGE_SIZE.
>           * otherwise, the following cgroup bpf prog will
>           * not be able to use the optval that it may
>           * be interested.
>           */
>          if (ctx->optlen > PAGE_SIZE)
>              ctx->optlen = 0;
>          return 1;
>      }
> 
>      /* change the ctx->optval and return to user space ... */
> }
> 
> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't 
> EFAULT for {g,s}setsockopt with wrong optlen").
> 
> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that 
> the kernel allows a user passing NULL optval and using the optlen 
> returned by getsockopt to learn the buffer space required. The bpf prog 
> then needs to remove the optlen > PAGE_SIZE check and set optlen to 0 
> for _all_ optnames that it is not interested while risking the following 
> cgroup prog may not be able to use some of the optval:
> 
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
> 
>          /* Do that for all optname that you are not interested.
>           * The latter cgroup bpf will not be able to use the optval.
>           */
>           ctx->optlen = 0;
>          return 1;
>      }
> 
>      /* chage the ctx->optval and return to user space ... */
> }
> 
> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't 
> EFAULT for getsockopt with optval=NULL").
> 
> To avoid other potential optname cases that may run into similar 
> situation and requires the bpf prog work around it again like the above, 
> it needs a way to track whether a bpf prog has changed the optval in 
> runtime. If it is not changed, use the result from the kernel 

Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
if optval is changed?

struct bpf_sockopt {
         __bpf_md_ptr(struct bpf_sock *, sk);
         __bpf_md_ptr(void *, optval);
         __bpf_md_ptr(void *, optval_end);

         __s32   level;
         __s32   optname;
         __s32   optlen;
         __s32   retval;
};

> set/getsockopt. If reading/writing to optval is done through a kfunc, 
> this can be tracked. The kfunc can also directly read/write the user 
> memory in optval, avoid the pre-alloc kernel memory and the PAGE_SIZE 
> limit but this is a minor point.
Alexei Starovoitov Aug. 17, 2023, 9:46 p.m. UTC | #6
On Thu, Aug 17, 2023 at 1:41 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
> > But I think we have to step back. Why do we need this whole thing in the first place?
> > _why_  sockopt bpf progs needs to read and write user memory?
> >
> > Yes there is one page limit, but what is the use case to actually read and write
> > beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
> > anything useful with iptables binary blobs. They are hard enough for kernel to parse.
>
> Usually the bpf prog is only interested in a very small number of optnames and
> no need to read the optval at all for most cases. The max size for our use cases
> is 16 bytes. The kernel currently is kind of doing it the opposite and always
> assumes the bpf prog needing to use the optval, allocate kernel memory and
> copy_from_user such that the non-sleepable bpf program can read/write it.
>
> The bpf prog usually checks the optname and then just returns for most cases:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>         if (ctx->optname != MY_INTERNAL_SOCKOPT)
>                 return 1;
>
>         /* change the ctx->optval and return to user space ... */
> }
>
> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and
> copy the first PAGE_SIZE data from the user optval. We used to ask the bpf prog
> to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not looked at
> the optval. Otherwise, the kernel used to conclude that the bpf prog had set an
> invalid optlen because optlen is larger than the optval_end - optval and
> returned -EFAULT incorrectly to the end-user.
>
> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an
> internal kernel PAGE_SIZE limitation:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>         if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>                 /* only do that for ctx->optlen > PAGE_SIZE.
>                  * otherwise, the following cgroup bpf prog will
>                  * not be able to use the optval that it may
>                  * be interested.
>                  */
>                 if (ctx->optlen > PAGE_SIZE)
>                         ctx->optlen = 0;
>                 return 1;
>         }
>
>         /* change the ctx->optval and return to user space ... */
> }
>
> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT for
> {g,s}setsockopt with wrong optlen").
>
> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the
> kernel allows a user passing NULL optval and using the optlen returned by
> getsockopt to learn the buffer space required. The bpf prog then needs to remove
> the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames that it is
> not interested while risking the following cgroup prog may not be able to use
> some of the optval:
>
> SEC("cgroup/getsockopt")
> int get_internal_sockopt(struct bpf_sockopt *ctx)
> {
>         if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>
>                 /* Do that for all optname that you are not interested.
>                  * The latter cgroup bpf will not be able to use the optval.
>                  */
>                 ctx->optlen = 0;
>                 return 1;
>         }
>
>         /* chage the ctx->optval and return to user space ... */
> }
>
> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT for
> getsockopt with optval=NULL").

Agree with all of the above.
Existing bpf sockopt interfaces was problematic,
but with these workarounds we fixed all known issues.

> To avoid other potential optname cases that may run into similar situation and
> requires the bpf prog work around it again like the above, it needs a way to
> track whether a bpf prog has changed the optval in runtime. If it is not
> changed, use the result from the kernel set/getsockopt. If reading/writing to
> optval is done through a kfunc, this can be tracked. The kfunc can also directly
> read/write the user memory in optval, avoid the pre-alloc kernel memory and the
> PAGE_SIZE limit but this is a minor point.

so I'm still not following what sleepable progs that can access everything
would help the existing situation.
I agree that sleepable bpf sockopt should be free from old mistakes,
but people might still write old-style non-sleeptable bpf sockopt and
might repeat the same mistakes.
I'm missing the value of the new interface. It's better, sure, but why?
Do we expect to rewrite existing sockopt progs in sleepable way?
It might not be easy due to sleepable limitations like maps and whatever else.
Martin KaFai Lau Aug. 17, 2023, 10:45 p.m. UTC | #7
On 8/17/23 2:46 PM, Alexei Starovoitov wrote:
>> To avoid other potential optname cases that may run into similar situation and
>> requires the bpf prog work around it again like the above, it needs a way to
>> track whether a bpf prog has changed the optval in runtime. If it is not
>> changed, use the result from the kernel set/getsockopt. If reading/writing to
>> optval is done through a kfunc, this can be tracked. The kfunc can also directly
>> read/write the user memory in optval, avoid the pre-alloc kernel memory and the
>> PAGE_SIZE limit but this is a minor point.
> so I'm still not following what sleepable progs that can access everything
> would help the existing situation.
> I agree that sleepable bpf sockopt should be free from old mistakes,
> but people might still write old-style non-sleeptable bpf sockopt and
> might repeat the same mistakes.
> I'm missing the value of the new interface. It's better, sure, but why?
> Do we expect to rewrite existing sockopt progs in sleepable way?
> It might not be easy due to sleepable limitations like maps and whatever else.

so far our sockopt progs only uses sk local storage that can support sleepable 
and we can all move to the sleepable way to avoid any future quirk.

Agree that others may have sockopt prog that has hard dependency on 
non-sleepable. If the existing non-sleepable and sleepable inter-leaved 
together, it would end up hitting similar issue.

Lets drop the idea of this set.
Martin KaFai Lau Aug. 17, 2023, 10:56 p.m. UTC | #8
On 8/17/23 2:37 PM, Yonghong Song wrote:
> 
> 
> On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
>> On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
>>> But I think we have to step back. Why do we need this whole thing in the 
>>> first place?
>>> _why_  sockopt bpf progs needs to read and write user memory?
>>>
>>> Yes there is one page limit, but what is the use case to actually read and write
>>> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
>>> anything useful with iptables binary blobs. They are hard enough for kernel 
>>> to parse.
>>
>> Usually the bpf prog is only interested in a very small number of optnames and 
>> no need to read the optval at all for most cases. The max size for our use 
>> cases is 16 bytes. The kernel currently is kind of doing it the opposite and 
>> always assumes the bpf prog needing to use the optval, allocate kernel memory 
>> and copy_from_user such that the non-sleepable bpf program can read/write it.
>>
>> The bpf prog usually checks the optname and then just returns for most cases:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT)
>>          return 1;
>>
>>      /* change the ctx->optval and return to user space ... */
>> }
>>
>> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and 
>> copy the first PAGE_SIZE data from the user optval. We used to ask the bpf 
>> prog to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not 
>> looked at the optval. Otherwise, the kernel used to conclude that the bpf prog 
>> had set an invalid optlen because optlen is larger than the optval_end - 
>> optval and returned -EFAULT incorrectly to the end-user.
>>
>> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an 
>> internal kernel PAGE_SIZE limitation:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>>          /* only do that for ctx->optlen > PAGE_SIZE.
>>           * otherwise, the following cgroup bpf prog will
>>           * not be able to use the optval that it may
>>           * be interested.
>>           */
>>          if (ctx->optlen > PAGE_SIZE)
>>              ctx->optlen = 0;
>>          return 1;
>>      }
>>
>>      /* change the ctx->optval and return to user space ... */
>> }
>>
>> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT 
>> for {g,s}setsockopt with wrong optlen").
>>
>> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the 
>> kernel allows a user passing NULL optval and using the optlen returned by 
>> getsockopt to learn the buffer space required. The bpf prog then needs to 
>> remove the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames 
>> that it is not interested while risking the following cgroup prog may not be 
>> able to use some of the optval:
>>
>> SEC("cgroup/getsockopt")
>> int get_internal_sockopt(struct bpf_sockopt *ctx)
>> {
>>      if (ctx->optname != MY_INTERNAL_SOCKOPT) {
>>
>>          /* Do that for all optname that you are not interested.
>>           * The latter cgroup bpf will not be able to use the optval.
>>           */
>>           ctx->optlen = 0;
>>          return 1;
>>      }
>>
>>      /* chage the ctx->optval and return to user space ... */
>> }
>>
>> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT 
>> for getsockopt with optval=NULL").
>>
>> To avoid other potential optname cases that may run into similar situation and 
>> requires the bpf prog work around it again like the above, it needs a way to 
>> track whether a bpf prog has changed the optval in runtime. If it is not 
>> changed, use the result from the kernel 
> 
> Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
> if optval is changed?

This new interface should work. If there is an old-existing prog staying with 
the old interface (didn't set this bool but changed the optval) in the cgroup 
prog array, it probably end up not improving anything also?

or the verifier can enforce setting this bool in runtime when writing to optval? 
do not know how demanding the verifier change is. I am not sure if this would be 
an overkill for the verifier.

> 
> struct bpf_sockopt {
>          __bpf_md_ptr(struct bpf_sock *, sk);
>          __bpf_md_ptr(void *, optval);
>          __bpf_md_ptr(void *, optval_end);
> 
>          __s32   level;
>          __s32   optname;
>          __s32   optlen;
>          __s32   retval;
> };
> 
>> set/getsockopt. If reading/writing to optval is done through a kfunc, this can 
>> be tracked. The kfunc can also directly read/write the user memory in optval, 
>> avoid the pre-alloc kernel memory and the PAGE_SIZE limit but this is a minor 
>> point.
Kui-Feng Lee Aug. 18, 2023, 12:14 a.m. UTC | #9
On 8/17/23 12:43, Alexei Starovoitov wrote:
> On Thu, Aug 17, 2023 at 12:00 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 8/16/23 18:25, Alexei Starovoitov wrote:
>>> On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote:
>>>>
>>>> +BTF_SET8_START(cgroup_common_btf_ids)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
>>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
>>>> +BTF_SET8_END(cgroup_common_btf_ids)
>>>
>>> These shouldn't be sockopt specific.
>>> If we want dynptr to represent a pointer to a user contiguous user memory
>>> we should use generic kfunc that do so.
>>>
>>> I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do.
>>> New dynptr type can be hidden in the kernel and all existing
>>> kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work
>>> with user memory.
>>>
>>> But I think we have to step back. Why do we need this whole thing in the first place?
>>> _why_ sockopt bpf progs needs to read and write user memory?
>>>
>>> Yes there is one page limit, but what is the use case to actually read and write
>>> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
>>> anything useful with iptables binary blobs. They are hard enough for kernel to parse.
>>
>> The ideal behind the design is let the developers of filters to decide
>> when to replace the existing buffer.  And, access the content of
>> buffers just like accessing raw pointers. However, seems almost everyone
>> love to use *_read() & *_write(). I will move to that direction.
> 
> That doesn't answer my question about the use case.
> What kind of filters want to parse more than 4k of sockopt data?

I don't have any existing use case for this. It is just something
can/allow to happen.
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index edb35bcfa548..b9e4d7752555 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -663,12 +663,15 @@  enum bpf_type_flag {
 	/* DYNPTR points to xdp_buff */
 	DYNPTR_TYPE_XDP		= BIT(16 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to optval buffer of bpf_sockopt */
+	DYNPTR_TYPE_CGROUP_SOCKOPT = BIT(17 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
 #define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
-				 | DYNPTR_TYPE_XDP)
+				 | DYNPTR_TYPE_XDP | DYNPTR_TYPE_CGROUP_SOCKOPT)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -1206,6 +1209,8 @@  enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_SKB,
 	/* Underlying data is a xdp_buff */
 	BPF_DYNPTR_TYPE_XDP,
+	/* Underlying data is for the optval of a cgroup sock */
+	BPF_DYNPTR_TYPE_CGROUP_SOCKOPT,
 };
 
 int bpf_dynptr_check_size(u32 size);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2aa2a96526de..df12fddd2f21 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1347,6 +1347,10 @@  struct bpf_sockopt_kern {
 enum bpf_sockopt_kern_flags {
 	/* optval is a pointer to user space memory */
 	BPF_SOCKOPT_FLAG_OPTVAL_USER    = (1U << 0),
+	/* able to install new optval */
+	BPF_SOCKOPT_FLAG_OPTVAL_REPLACE	= (1U << 1),
+	/* optval is referenced by a dynptr */
+	BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR	= (1U << 2),
 };
 
 int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 249657c466dd..6d6a040688be 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -217,6 +217,7 @@  enum btf_kfunc_hook {
 	BTF_KFUNC_HOOK_SOCKET_FILTER,
 	BTF_KFUNC_HOOK_LWT,
 	BTF_KFUNC_HOOK_NETFILTER,
+	BTF_KFUNC_HOOK_CGROUP_SOCKOPT,
 	BTF_KFUNC_HOOK_MAX,
 };
 
@@ -7846,6 +7847,8 @@  static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
 		return BTF_KFUNC_HOOK_LWT;
 	case BPF_PROG_TYPE_NETFILTER:
 		return BTF_KFUNC_HOOK_NETFILTER;
+	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+		return BTF_KFUNC_HOOK_CGROUP_SOCKOPT;
 	default:
 		return BTF_KFUNC_HOOK_MAX;
 	}
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 425094e071ba..164dee8753cf 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1865,6 +1865,8 @@  static int filter_setsockopt_progs_cb(void *arg,
 	if (max_optlen < 0)
 		return max_optlen;
 
+	ctx->flags = BPF_SOCKOPT_FLAG_OPTVAL_REPLACE;
+
 	if (copy_from_user(ctx->optval, optval,
 			   min(ctx->optlen, max_optlen)) != 0)
 		return -EFAULT;
@@ -1893,7 +1895,8 @@  int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 	ctx.optlen = *optlen;
 	ctx.optval = optval;
 	ctx.optval_end = optval + *optlen;
-	ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
+	ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER |
+		BPF_SOCKOPT_FLAG_OPTVAL_REPLACE;
 
 	lock_sock(sk);
 	ret = bpf_prog_run_array_cg_cb(&cgrp->bpf, CGROUP_SETSOCKOPT,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index eb91cae0612a..fc38aff02654 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1537,6 +1537,7 @@  BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, const struct bpf_dynptr_kern
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		/* Source and destination may possibly overlap, hence use memmove to
 		 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
 		 * pointing to overlapping PTR_TO_MAP_VALUE regions.
@@ -1582,6 +1583,7 @@  BPF_CALL_5(bpf_dynptr_write, const struct bpf_dynptr_kern *, dst, u32, offset, v
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		if (flags)
 			return -EINVAL;
 		/* Source and destination may possibly overlap, hence use memmove to
@@ -1634,6 +1636,7 @@  BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		return (unsigned long)(ptr->data + ptr->offset + offset);
 	case BPF_DYNPTR_TYPE_SKB:
 	case BPF_DYNPTR_TYPE_XDP:
@@ -2261,6 +2264,7 @@  __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset
 	switch (type) {
 	case BPF_DYNPTR_TYPE_LOCAL:
 	case BPF_DYNPTR_TYPE_RINGBUF:
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
 		return ptr->data + ptr->offset + offset;
 	case BPF_DYNPTR_TYPE_SKB:
 		if (buffer__opt)
@@ -2429,6 +2433,185 @@  __bpf_kfunc void bpf_rcu_read_unlock(void)
 	rcu_read_unlock();
 }
 
+/* Create a buffer of the given size for a {set,get}sockopt BPF filter.
+ *
+ * This kfunc is only avaliabe for sleeplabe contexts. The dynptr should be
+ * released by bpf_sockopt_dynptr_install() or bpf_sockopt_release().
+ */
+__bpf_kfunc int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size,
+					 struct bpf_dynptr_kern *ptr__uninit)
+{
+	void *optval;
+	int err;
+
+	bpf_dynptr_set_null(ptr__uninit);
+
+	err = bpf_dynptr_check_size(size);
+	if (err)
+		return err;
+
+	optval = kzalloc(size, GFP_KERNEL);
+	if (!optval)
+		return -ENOMEM;
+
+	bpf_dynptr_init(ptr__uninit, optval,
+			BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0, size);
+
+	return size;
+}
+
+/* Install the buffer of the dynptr into the sockopt context.
+ *
+ * This kfunc is only avaliabe for sleeplabe contexts. The dynptr should be
+ * allocated by bpf_sockopt_dynptr_alloc().  The dynptr is invalid after
+ * returning from this function successfully.
+ */
+__bpf_kfunc int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt,
+					   struct bpf_dynptr_kern *ptr)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+
+	if (!(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_REPLACE) ||
+	    bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT ||
+	    !ptr->data)
+		return -EINVAL;
+
+	if (sopt_kern->optval == ptr->data &&
+	    !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)) {
+		/* This dynptr is initialized by bpf_sockopt_dynptr_from()
+		 * and the optval is not overwritten by
+		 * bpf_sockopt_dynptr_install() yet.
+		 */
+		bpf_dynptr_set_null(ptr);
+		sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
+		return 0;
+	}
+
+	if (sopt_kern->optval &&
+	    !(sopt_kern->flags & (BPF_SOCKOPT_FLAG_OPTVAL_USER |
+				  BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR)))
+		kfree(sopt_kern->optval);
+
+	sopt_kern->optval = ptr->data;
+	sopt_kern->optval_end = ptr->data + __bpf_dynptr_size(ptr);
+	sopt_kern->optlen = __bpf_dynptr_size(ptr);
+	sopt_kern->flags &= ~(BPF_SOCKOPT_FLAG_OPTVAL_USER |
+			      BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR);
+
+	bpf_dynptr_set_null(ptr);
+
+	return 0;
+}
+
+__bpf_kfunc int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt,
+					   struct bpf_dynptr_kern *ptr)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+
+	if (bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT ||
+	    !ptr->data)
+		return -EINVAL;
+
+	if (sopt_kern->optval == ptr->data &&
+	    !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER))
+		/* This dynptr is initialized by bpf_sockopt_dynptr_from()
+		 * and the optval is not overwritten by
+		 * bpf_sockopt_dynptr_install() yet.
+		 */
+		sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
+	else
+		kfree(ptr->data);
+	bpf_dynptr_set_null(ptr);
+
+	return 0;
+}
+
+/* Initialize a sockopt dynptr from a user or installed optval pointer.
+ *
+ * sopt->optval can be a user pointer or a kernel pointer. A kernel pointer
+ * can be a buffer allocated by the caller of the BPF program or a buffer
+ * installed by other BPF programs through bpf_sockopt_dynptr_install().
+ *
+ * Atmost one dynptr shall be created by this function at any moment, or
+ * it will return -EINVAL. You can create another dypptr by this function
+ * after release the previous one by bpf_sockopt_dynptr_release().
+ *
+ * A dynptr that is initialized when optval is a user pointer is an
+ * exception. In this case, the dynptr will point to a kernel buffer with
+ * the same content as the user buffer. To simplify the code, users should
+ * always make sure having only one dynptr initialized by this function at
+ * any moment.
+ */
+__bpf_kfunc int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt,
+					struct bpf_dynptr_kern *ptr__uninit,
+					unsigned int size)
+{
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+	int err;
+
+	bpf_dynptr_set_null(ptr__uninit);
+
+	if (size > (sopt_kern->optval_end - sopt_kern->optval))
+		return -EINVAL;
+
+	if (size == 0)
+		size = min(sopt_kern->optlen,
+			   (int)(sopt_kern->optval_end - sopt_kern->optval));
+
+	if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR)
+		return -EINVAL;
+
+	if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
+		err = bpf_sockopt_dynptr_alloc(sopt, sopt_kern->optlen,
+					       ptr__uninit);
+		if (err >= 0)
+			err = copy_from_user(ptr__uninit->data,
+					     sopt_kern->optval,
+					     size);
+		return err;
+	}
+
+	bpf_dynptr_init(ptr__uninit, sopt_kern->optval,
+			BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0,
+			size);
+	sopt_kern->flags |= BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR;
+
+	return size;
+}
+
+/**
+ * int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
+ *				  struct bpf_dynptr_kern *ptr)
+ *     Description
+ *             Copy data from *ptr* to *sopt->optval*.
+ *     Return
+ *             >= 0 on success, or a negative error in case of failure.
+ */
+__bpf_kfunc int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt,
+					   struct bpf_dynptr_kern *ptr)
+{
+	__u32 size = bpf_dynptr_size(ptr);
+
+	struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt;
+	int ret;
+
+	if (size > (sopt_kern->optval_end - sopt_kern->optval))
+		return -EINVAL;
+
+	if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) {
+		ret = copy_to_user(sopt_kern->optval, ptr->data,
+				   size);
+		if (unlikely(ret))
+			return -EFAULT;
+	} else {
+		/* Use memmove() in case of optval & ptr overlap. */
+		memmove(sopt_kern->optval, ptr->data, size);
+		ret = size;
+	}
+
+	return ret;
+}
+
 __diag_pop();
 
 BTF_SET8_START(generic_btf_ids)
@@ -2494,6 +2677,19 @@  static const struct btf_kfunc_id_set common_kfunc_set = {
 	.set   = &common_btf_ids,
 };
 
+BTF_SET8_START(cgroup_common_btf_ids)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE)
+BTF_SET8_END(cgroup_common_btf_ids)
+
+static const struct btf_kfunc_id_set cgroup_kfunc_set = {
+	.owner	= THIS_MODULE,
+	.set	= &cgroup_common_btf_ids,
+};
+
 static int __init kfunc_init(void)
 {
 	int ret;
@@ -2513,6 +2709,7 @@  static int __init kfunc_init(void)
 	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &generic_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCKOPT, &cgroup_kfunc_set);
 	ret = ret ?: register_btf_id_dtor_kfuncs(generic_dtors,
 						  ARRAY_SIZE(generic_dtors),
 						  THIS_MODULE);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 936a171ea976..83d65a6e1309 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -745,6 +745,8 @@  static const char *dynptr_type_str(enum bpf_dynptr_type type)
 		return "skb";
 	case BPF_DYNPTR_TYPE_XDP:
 		return "xdp";
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return "cgroup_sockopt";
 	case BPF_DYNPTR_TYPE_INVALID:
 		return "<invalid>";
 	default:
@@ -826,6 +828,8 @@  static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_SKB;
 	case DYNPTR_TYPE_XDP:
 		return BPF_DYNPTR_TYPE_XDP;
+	case DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return BPF_DYNPTR_TYPE_CGROUP_SOCKOPT;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -842,6 +846,8 @@  static enum bpf_type_flag get_dynptr_type_flag(enum bpf_dynptr_type type)
 		return DYNPTR_TYPE_SKB;
 	case BPF_DYNPTR_TYPE_XDP:
 		return DYNPTR_TYPE_XDP;
+	case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT:
+		return DYNPTR_TYPE_CGROUP_SOCKOPT;
 	default:
 		return 0;
 	}
@@ -849,7 +855,8 @@  static enum bpf_type_flag get_dynptr_type_flag(enum bpf_dynptr_type type)
 
 static bool dynptr_type_refcounted(enum bpf_dynptr_type type)
 {
-	return type == BPF_DYNPTR_TYPE_RINGBUF;
+	return type == BPF_DYNPTR_TYPE_RINGBUF ||
+		type == BPF_DYNPTR_TYPE_CGROUP_SOCKOPT;
 }
 
 static void __mark_dynptr_reg(struct bpf_reg_state *reg,
@@ -10271,6 +10278,10 @@  enum special_kfunc_type {
 	KF_bpf_dynptr_slice,
 	KF_bpf_dynptr_slice_rdwr,
 	KF_bpf_dynptr_clone,
+	KF_bpf_sockopt_dynptr_alloc,
+	KF_bpf_sockopt_dynptr_install,
+	KF_bpf_sockopt_dynptr_release,
+	KF_bpf_sockopt_dynptr_from,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -10291,6 +10302,10 @@  BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
+BTF_ID(func, bpf_sockopt_dynptr_alloc)
+BTF_ID(func, bpf_sockopt_dynptr_install)
+BTF_ID(func, bpf_sockopt_dynptr_release)
+BTF_ID(func, bpf_sockopt_dynptr_from)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -10313,6 +10328,10 @@  BTF_ID(func, bpf_dynptr_from_xdp)
 BTF_ID(func, bpf_dynptr_slice)
 BTF_ID(func, bpf_dynptr_slice_rdwr)
 BTF_ID(func, bpf_dynptr_clone)
+BTF_ID(func, bpf_sockopt_dynptr_alloc)
+BTF_ID(func, bpf_sockopt_dynptr_install)
+BTF_ID(func, bpf_sockopt_dynptr_release)
+BTF_ID(func, bpf_sockopt_dynptr_from)
 
 static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
 {
@@ -10966,6 +10985,20 @@  static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			arg_type |= OBJ_RELEASE;
 			break;
 		case KF_ARG_PTR_TO_DYNPTR:
+			if (meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_install] ||
+			    meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_release]) {
+				int ref_obj_id = dynptr_ref_obj_id(env, reg);
+
+				if (ref_obj_id < 0) {
+					verbose(env, "R%d is not a valid dynptr\n", regno);
+					return -EINVAL;
+				}
+
+				/* Required by check_func_arg_reg_off() */
+				arg_type |= ARG_PTR_TO_DYNPTR | OBJ_RELEASE;
+				meta->release_regno = regno;
+			}
+			break;
 		case KF_ARG_PTR_TO_ITER:
 		case KF_ARG_PTR_TO_LIST_HEAD:
 		case KF_ARG_PTR_TO_LIST_NODE:
@@ -11053,6 +11086,10 @@  static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 					verbose(env, "verifier internal error: missing ref obj id for parent of clone\n");
 					return -EFAULT;
 				}
+			} else if ((meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_alloc] ||
+				    meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_from]) &&
+				   (dynptr_arg_type & MEM_UNINIT)) {
+				dynptr_arg_type |= DYNPTR_TYPE_CGROUP_SOCKOPT;
 			}
 
 			ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type, clone_ref_obj_id);
@@ -11361,7 +11398,13 @@  static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	 * PTR_TO_BTF_ID in bpf_kfunc_arg_meta, do the release now.
 	 */
 	if (meta.release_regno) {
-		err = release_reference(env, regs[meta.release_regno].ref_obj_id);
+		verbose(env, "release refcounted PTR_TO_BTF_ID %s\n",
+			meta.func_name);
+		if (meta.func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_install] ||
+		    meta.func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_release])
+			err = unmark_stack_slots_dynptr(env, &regs[meta.release_regno]);
+		else
+			err = release_reference(env, regs[meta.release_regno].ref_obj_id);
 		if (err) {
 			verbose(env, "kfunc %s#%d reference has not been acquired before\n",
 				func_name, meta.func_id);