diff mbox series

[RFC,bpf-next,1/5] bpf: enable sleepable BPF programs attached to cgroup/{get,set}sockopt.

Message ID 20230722052248.1062582-2-kuifeng@meta.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Sleepable BPF programs on cgroup {get,set}sockopt | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-6 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 pending Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-13 fail Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 fail Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-17 fail Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 fail Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 pending Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 fail Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 fail Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 fail Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 fail Logs for veristat
bpf/vmtest-bpf-next-VM_Test-11 fail Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 fail Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 fail Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 fail Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 fail Logs for test_progs on s390x with gcc
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 3076 this patch: 3092
netdev/cc_maintainers warning 7 maintainers not CCed: daniel@iogearbox.net yhs@fb.com kpsingh@kernel.org john.fastabend@gmail.com sdf@google.com jolsa@kernel.org haoluo@google.com
netdev/build_clang success Errors and warnings before: 1539 this patch: 1539
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 3098 this patch: 3114
netdev/checkpatch warning CHECK: Alignment should match open parenthesis CHECK: Unbalanced braces around else statement CHECK: braces {} should be used on all arms of this statement WARNING: line length of 111 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: quoted string split across lines
netdev/kdoc success Errors and warnings before: 12 this patch: 12
netdev/source_inline success Was 0 now: 0

Commit Message

Kui-Feng Lee July 22, 2023, 5:22 a.m. UTC
From: Kui-Feng Lee <kuifeng@meta.com>

Enable sleepable cgroup/{get,set}sockopt hooks.

The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
received a pointer to the optval in user space instead of a kernel
copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
begin and end of the user space buffer if receiving a user space
buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
a kernel space buffer.

A program receives a user space buffer if ctx->flags &
BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
buffer.  The BPF programs should not read/write from/to a user space buffer
dirrectly.  It should access the buffer through bpf_copy_from_user() and
bpf_copy_to_user() provided in the following patches.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/filter.h         |   3 +
 include/uapi/linux/bpf.h       |   9 ++
 kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
 kernel/bpf/verifier.c          |   7 +-
 tools/include/uapi/linux/bpf.h |   9 ++
 tools/lib/bpf/libbpf.c         |   2 +
 6 files changed, 176 insertions(+), 43 deletions(-)

Comments

Stanislav Fomichev July 24, 2023, 6:36 p.m. UTC | #1
On 07/21, kuifeng@meta.com wrote:
> From: Kui-Feng Lee <kuifeng@meta.com>
> 
> Enable sleepable cgroup/{get,set}sockopt hooks.
> 
> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
> received a pointer to the optval in user space instead of a kernel
> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
> begin and end of the user space buffer if receiving a user space
> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
> a kernel space buffer.
> 
> A program receives a user space buffer if ctx->flags &
> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
> buffer.  The BPF programs should not read/write from/to a user space buffer
> dirrectly.  It should access the buffer through bpf_copy_from_user() and
> bpf_copy_to_user() provided in the following patches.
> 
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>  include/linux/filter.h         |   3 +
>  include/uapi/linux/bpf.h       |   9 ++
>  kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
>  kernel/bpf/verifier.c          |   7 +-
>  tools/include/uapi/linux/bpf.h |   9 ++
>  tools/lib/bpf/libbpf.c         |   2 +
>  6 files changed, 176 insertions(+), 43 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index f69114083ec7..301dd1ba0de1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
>  	s32		level;
>  	s32		optname;
>  	s32		optlen;
> +	u32		flags;
> +	u8		*user_optval;
> +	u8		*user_optval_end;
>  	/* for retval in struct bpf_cg_run_ctx */
>  	struct task_struct *current_task;
>  	/* Temporary "register" for indirect stores to ppos. */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 739c15906a65..b2f81193f97b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>  	__s32	optname;
>  	__s32	optlen;
>  	__s32	retval;
> +
> +	__bpf_md_ptr(void *, user_optval);
> +	__bpf_md_ptr(void *, user_optval_end);

Can we re-purpose existing optval/optval_end pointers
for the sleepable programs? IOW, when the prog is sleepable,
pass user pointers via optval/optval_end and require the programs
to do copy_to/from on this buffer (even if the backing pointer might be
in kernel memory - we can handle that in the kfuncs?).

The fact that the program now needs to look at the flag
(BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
use makes the handling even more complicated; and we already have a
bunch of hairy stuff in these hooks. (or I misreading the change?)

Also, regarding sleepable and non-sleepable co-existence: do we really need
that? Can we say that all the programs have to be sleepable
or non-sleepable? Mixing them complicates the sharing of that buffer.
Kuifeng Lee July 31, 2023, 10:02 p.m. UTC | #2
Sorry for the late reply! I just backed from a vacation.


On 7/24/23 11:36, Stanislav Fomichev wrote:
> On 07/21, kuifeng@meta.com wrote:
>> From: Kui-Feng Lee <kuifeng@meta.com>
>>
>> Enable sleepable cgroup/{get,set}sockopt hooks.
>>
>> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
>> received a pointer to the optval in user space instead of a kernel
>> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
>> begin and end of the user space buffer if receiving a user space
>> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
>> a kernel space buffer.
>>
>> A program receives a user space buffer if ctx->flags &
>> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
>> buffer.  The BPF programs should not read/write from/to a user space buffer
>> dirrectly.  It should access the buffer through bpf_copy_from_user() and
>> bpf_copy_to_user() provided in the following patches.
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/filter.h         |   3 +
>>   include/uapi/linux/bpf.h       |   9 ++
>>   kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
>>   kernel/bpf/verifier.c          |   7 +-
>>   tools/include/uapi/linux/bpf.h |   9 ++
>>   tools/lib/bpf/libbpf.c         |   2 +
>>   6 files changed, 176 insertions(+), 43 deletions(-)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index f69114083ec7..301dd1ba0de1 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
>>   	s32		level;
>>   	s32		optname;
>>   	s32		optlen;
>> +	u32		flags;
>> +	u8		*user_optval;
>> +	u8		*user_optval_end;
>>   	/* for retval in struct bpf_cg_run_ctx */
>>   	struct task_struct *current_task;
>>   	/* Temporary "register" for indirect stores to ppos. */
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 739c15906a65..b2f81193f97b 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>>   	__s32	optname;
>>   	__s32	optlen;
>>   	__s32	retval;
>> +
>> +	__bpf_md_ptr(void *, user_optval);
>> +	__bpf_md_ptr(void *, user_optval_end);
> 
> Can we re-purpose existing optval/optval_end pointers
> for the sleepable programs? IOW, when the prog is sleepable,
> pass user pointers via optval/optval_end and require the programs
> to do copy_to/from on this buffer (even if the backing pointer might be
> in kernel memory - we can handle that in the kfuncs?).
> 
> The fact that the program now needs to look at the flag
> (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
> use makes the handling even more complicated; and we already have a
> bunch of hairy stuff in these hooks. (or I misreading the change?)
> 
> Also, regarding sleepable and non-sleepable co-existence: do we really need
> that? Can we say that all the programs have to be sleepable
> or non-sleepable? Mixing them complicates the sharing of that buffer.

I considered this approach as well. This is an open question for me.
If we go this way, it means we can not attach a BPF program of a type
if any program of the other type has been installed.
Stanislav Fomichev July 31, 2023, 11:35 p.m. UTC | #3
On Mon, Jul 31, 2023 at 3:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
> Sorry for the late reply! I just backed from a vacation.
>
>
> On 7/24/23 11:36, Stanislav Fomichev wrote:
> > On 07/21, kuifeng@meta.com wrote:
> >> From: Kui-Feng Lee <kuifeng@meta.com>
> >>
> >> Enable sleepable cgroup/{get,set}sockopt hooks.
> >>
> >> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
> >> received a pointer to the optval in user space instead of a kernel
> >> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
> >> begin and end of the user space buffer if receiving a user space
> >> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
> >> a kernel space buffer.
> >>
> >> A program receives a user space buffer if ctx->flags &
> >> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
> >> buffer.  The BPF programs should not read/write from/to a user space buffer
> >> dirrectly.  It should access the buffer through bpf_copy_from_user() and
> >> bpf_copy_to_user() provided in the following patches.
> >>
> >> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> >> ---
> >>   include/linux/filter.h         |   3 +
> >>   include/uapi/linux/bpf.h       |   9 ++
> >>   kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
> >>   kernel/bpf/verifier.c          |   7 +-
> >>   tools/include/uapi/linux/bpf.h |   9 ++
> >>   tools/lib/bpf/libbpf.c         |   2 +
> >>   6 files changed, 176 insertions(+), 43 deletions(-)
> >>
> >> diff --git a/include/linux/filter.h b/include/linux/filter.h
> >> index f69114083ec7..301dd1ba0de1 100644
> >> --- a/include/linux/filter.h
> >> +++ b/include/linux/filter.h
> >> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
> >>      s32             level;
> >>      s32             optname;
> >>      s32             optlen;
> >> +    u32             flags;
> >> +    u8              *user_optval;
> >> +    u8              *user_optval_end;
> >>      /* for retval in struct bpf_cg_run_ctx */
> >>      struct task_struct *current_task;
> >>      /* Temporary "register" for indirect stores to ppos. */
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 739c15906a65..b2f81193f97b 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
> >>      __s32   optname;
> >>      __s32   optlen;
> >>      __s32   retval;
> >> +
> >> +    __bpf_md_ptr(void *, user_optval);
> >> +    __bpf_md_ptr(void *, user_optval_end);
> >
> > Can we re-purpose existing optval/optval_end pointers
> > for the sleepable programs? IOW, when the prog is sleepable,
> > pass user pointers via optval/optval_end and require the programs
> > to do copy_to/from on this buffer (even if the backing pointer might be
> > in kernel memory - we can handle that in the kfuncs?).
> >
> > The fact that the program now needs to look at the flag
> > (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
> > use makes the handling even more complicated; and we already have a
> > bunch of hairy stuff in these hooks. (or I misreading the change?)
> >
> > Also, regarding sleepable and non-sleepable co-existence: do we really need
> > that? Can we say that all the programs have to be sleepable
> > or non-sleepable? Mixing them complicates the sharing of that buffer.
>
> I considered this approach as well. This is an open question for me.
> If we go this way, it means we can not attach a BPF program of a type
> if any program of the other type has been installed.

If we pass two pointers (kernel copy buffer + real user mem) to the
sleepable program, we'll make it even more complicated by inheriting
all existing warts of the non-sleepable version :-(
IOW, feels like we should try to see if we can have some
copy_to/from_user kfuncs in the sleepable version that transparently
support either kernel or user memory (and prohibit direct access to
user_optval in the sleepable version).
And then, if we have one non-sleepable program in the chain, we can
fallback everything to the kernel buffer (maybe).
This way seems like we can support both versions in the same chain and
have a more sane api?
Kuifeng Lee Aug. 1, 2023, 5:31 p.m. UTC | #4
On 7/31/23 16:35, Stanislav Fomichev wrote:
> On Mon, Jul 31, 2023 at 3:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>> Sorry for the late reply! I just backed from a vacation.
>>
>>
>> On 7/24/23 11:36, Stanislav Fomichev wrote:
>>> On 07/21, kuifeng@meta.com wrote:
>>>> From: Kui-Feng Lee <kuifeng@meta.com>
>>>>
>>>> Enable sleepable cgroup/{get,set}sockopt hooks.
>>>>
>>>> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
>>>> received a pointer to the optval in user space instead of a kernel
>>>> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
>>>> begin and end of the user space buffer if receiving a user space
>>>> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
>>>> a kernel space buffer.
>>>>
>>>> A program receives a user space buffer if ctx->flags &
>>>> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
>>>> buffer.  The BPF programs should not read/write from/to a user space buffer
>>>> dirrectly.  It should access the buffer through bpf_copy_from_user() and
>>>> bpf_copy_to_user() provided in the following patches.
>>>>
>>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>>>> ---
>>>>    include/linux/filter.h         |   3 +
>>>>    include/uapi/linux/bpf.h       |   9 ++
>>>>    kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
>>>>    kernel/bpf/verifier.c          |   7 +-
>>>>    tools/include/uapi/linux/bpf.h |   9 ++
>>>>    tools/lib/bpf/libbpf.c         |   2 +
>>>>    6 files changed, 176 insertions(+), 43 deletions(-)
>>>>
>>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>>> index f69114083ec7..301dd1ba0de1 100644
>>>> --- a/include/linux/filter.h
>>>> +++ b/include/linux/filter.h
>>>> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
>>>>       s32             level;
>>>>       s32             optname;
>>>>       s32             optlen;
>>>> +    u32             flags;
>>>> +    u8              *user_optval;
>>>> +    u8              *user_optval_end;
>>>>       /* for retval in struct bpf_cg_run_ctx */
>>>>       struct task_struct *current_task;
>>>>       /* Temporary "register" for indirect stores to ppos. */
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 739c15906a65..b2f81193f97b 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>>>>       __s32   optname;
>>>>       __s32   optlen;
>>>>       __s32   retval;
>>>> +
>>>> +    __bpf_md_ptr(void *, user_optval);
>>>> +    __bpf_md_ptr(void *, user_optval_end);
>>>
>>> Can we re-purpose existing optval/optval_end pointers
>>> for the sleepable programs? IOW, when the prog is sleepable,
>>> pass user pointers via optval/optval_end and require the programs
>>> to do copy_to/from on this buffer (even if the backing pointer might be
>>> in kernel memory - we can handle that in the kfuncs?).
>>>
>>> The fact that the program now needs to look at the flag
>>> (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
>>> use makes the handling even more complicated; and we already have a
>>> bunch of hairy stuff in these hooks. (or I misreading the change?)
>>>
>>> Also, regarding sleepable and non-sleepable co-existence: do we really need
>>> that? Can we say that all the programs have to be sleepable
>>> or non-sleepable? Mixing them complicates the sharing of that buffer.
>>
>> I considered this approach as well. This is an open question for me.
>> If we go this way, it means we can not attach a BPF program of a type
>> if any program of the other type has been installed.
> 
> If we pass two pointers (kernel copy buffer + real user mem) to the
> sleepable program, we'll make it even more complicated by inheriting
> all existing warts of the non-sleepable version :-(
> IOW, feels like we should try to see if we can have some
> copy_to/from_user kfuncs in the sleepable version that transparently
> support either kernel or user memory (and prohibit direct access to
> user_optval in the sleepable version).
> And then, if we have one non-sleepable program in the chain, we can
> fallback everything to the kernel buffer (maybe).
> This way seems like we can support both versions in the same chain and
> have a more sane api?

Basically, you are saying to move cp_from_optval() and cp_to_optval() in
the testcase to kfuncs. This can cause unnecessary copy. We can add
an API to make a dynptr from the ctx to avoid unnecessary copies.
Stanislav Fomichev Aug. 1, 2023, 6:08 p.m. UTC | #5
On Tue, Aug 1, 2023 at 10:31 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>
>
>
> On 7/31/23 16:35, Stanislav Fomichev wrote:
> > On Mon, Jul 31, 2023 at 3:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
> >>
> >> Sorry for the late reply! I just backed from a vacation.
> >>
> >>
> >> On 7/24/23 11:36, Stanislav Fomichev wrote:
> >>> On 07/21, kuifeng@meta.com wrote:
> >>>> From: Kui-Feng Lee <kuifeng@meta.com>
> >>>>
> >>>> Enable sleepable cgroup/{get,set}sockopt hooks.
> >>>>
> >>>> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
> >>>> received a pointer to the optval in user space instead of a kernel
> >>>> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
> >>>> begin and end of the user space buffer if receiving a user space
> >>>> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
> >>>> a kernel space buffer.
> >>>>
> >>>> A program receives a user space buffer if ctx->flags &
> >>>> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
> >>>> buffer.  The BPF programs should not read/write from/to a user space buffer
> >>>> dirrectly.  It should access the buffer through bpf_copy_from_user() and
> >>>> bpf_copy_to_user() provided in the following patches.
> >>>>
> >>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> >>>> ---
> >>>>    include/linux/filter.h         |   3 +
> >>>>    include/uapi/linux/bpf.h       |   9 ++
> >>>>    kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
> >>>>    kernel/bpf/verifier.c          |   7 +-
> >>>>    tools/include/uapi/linux/bpf.h |   9 ++
> >>>>    tools/lib/bpf/libbpf.c         |   2 +
> >>>>    6 files changed, 176 insertions(+), 43 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
> >>>> index f69114083ec7..301dd1ba0de1 100644
> >>>> --- a/include/linux/filter.h
> >>>> +++ b/include/linux/filter.h
> >>>> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
> >>>>       s32             level;
> >>>>       s32             optname;
> >>>>       s32             optlen;
> >>>> +    u32             flags;
> >>>> +    u8              *user_optval;
> >>>> +    u8              *user_optval_end;
> >>>>       /* for retval in struct bpf_cg_run_ctx */
> >>>>       struct task_struct *current_task;
> >>>>       /* Temporary "register" for indirect stores to ppos. */
> >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>>> index 739c15906a65..b2f81193f97b 100644
> >>>> --- a/include/uapi/linux/bpf.h
> >>>> +++ b/include/uapi/linux/bpf.h
> >>>> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
> >>>>       __s32   optname;
> >>>>       __s32   optlen;
> >>>>       __s32   retval;
> >>>> +
> >>>> +    __bpf_md_ptr(void *, user_optval);
> >>>> +    __bpf_md_ptr(void *, user_optval_end);
> >>>
> >>> Can we re-purpose existing optval/optval_end pointers
> >>> for the sleepable programs? IOW, when the prog is sleepable,
> >>> pass user pointers via optval/optval_end and require the programs
> >>> to do copy_to/from on this buffer (even if the backing pointer might be
> >>> in kernel memory - we can handle that in the kfuncs?).
> >>>
> >>> The fact that the program now needs to look at the flag
> >>> (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
> >>> use makes the handling even more complicated; and we already have a
> >>> bunch of hairy stuff in these hooks. (or I misreading the change?)
> >>>
> >>> Also, regarding sleepable and non-sleepable co-existence: do we really need
> >>> that? Can we say that all the programs have to be sleepable
> >>> or non-sleepable? Mixing them complicates the sharing of that buffer.
> >>
> >> I considered this approach as well. This is an open question for me.
> >> If we go this way, it means we can not attach a BPF program of a type
> >> if any program of the other type has been installed.
> >
> > If we pass two pointers (kernel copy buffer + real user mem) to the
> > sleepable program, we'll make it even more complicated by inheriting
> > all existing warts of the non-sleepable version :-(
> > IOW, feels like we should try to see if we can have some
> > copy_to/from_user kfuncs in the sleepable version that transparently
> > support either kernel or user memory (and prohibit direct access to
> > user_optval in the sleepable version).
> > And then, if we have one non-sleepable program in the chain, we can
> > fallback everything to the kernel buffer (maybe).
> > This way seems like we can support both versions in the same chain and
> > have a more sane api?
>
> Basically, you are saying to move cp_from_optval() and cp_to_optval() in
> the testcase to kfuncs. This can cause unnecessary copy. We can add
> an API to make a dynptr from the ctx to avoid unnecessary copies.

Yeah, handle this transparently in the kfunc or via dynptr, whatever works.
I'm not too worried about the extra copy tbh, this is a slow path; I'm
more concerned about improving the bpf program / user experience.
Martin KaFai Lau Aug. 2, 2023, 7:25 p.m. UTC | #6
On 7/21/23 10:22 PM, kuifeng@meta.com wrote:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 739c15906a65..b2f81193f97b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>   	__s32	optname;
>   	__s32	optlen;
>   	__s32	retval;
> +
> +	__bpf_md_ptr(void *, user_optval);
> +	__bpf_md_ptr(void *, user_optval_end);
> +	__u32	flags;

I will follow up on the UAPI discussion in the other existing thread.

> +};
> +
> +enum bpf_sockopt_flags {
> +	/* optval is a pointer to user space memory */
> +	BPF_SOCKOPT_FLAG_OPTVAL_USER	= (1U << 0),
>   };
>   
>   struct bpf_pidns_info {
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 5b2741aa0d9b..b268bbfa6c53 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -38,15 +38,26 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
>   	const struct bpf_prog_array *array;
>   	struct bpf_run_ctx *old_run_ctx;
>   	struct bpf_cg_run_ctx run_ctx;
> +	bool do_sleepable;
>   	u32 func_ret;
>   
> +	do_sleepable =
> +		atype == CGROUP_SETSOCKOPT || atype == CGROUP_GETSOCKOPT;
> +
>   	run_ctx.retval = retval;
>   	migrate_disable();
> -	rcu_read_lock();
> +	if (do_sleepable) {
> +		might_fault();
> +		rcu_read_lock_trace();
> +	} else
> +		rcu_read_lock();
>   	array = rcu_dereference(cgrp->effective[atype]);

array is now under rcu_read_lock_trace for the "do_sleepable" case.

The array free side should be changed also to wait for the tasks_trace gp. 
Please check if any bpf_prog_array_free() usages in cgroup.c should be replaced 
with bpf_prog_array_free_sleepable().

Another high level comment is, other cgroup hooks may get sleepable support in 
the future. In particular the SEC("lsm_cgroup") when considering other lsm hooks 
in SEC("lsm.s") have sleepable support already. just want to bring up here for 
awareness. This set can focus on get/setsockopt for now.

>   	item = &array->items[0];
>   	old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
>   	while ((prog = READ_ONCE(item->prog))) {
> +		if (do_sleepable && !prog->aux->sleepable)
> +			rcu_read_lock();
> +
>   		run_ctx.prog_item = item;
>   		func_ret = run_prog(prog, ctx);
>   		if (ret_flags) {
> @@ -56,13 +67,43 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
>   		if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval))
>   			run_ctx.retval = -EPERM;
>   		item++;
> +
> +		if (do_sleepable && !prog->aux->sleepable)
> +			rcu_read_unlock();
>   	}
>   	bpf_reset_run_ctx(old_run_ctx);
> -	rcu_read_unlock();
> +	if (do_sleepable)
> +		rcu_read_unlock_trace();
> +	else
> +		rcu_read_unlock();
>   	migrate_enable();
>   	return run_ctx.retval;
>   }
>   
> +static __always_inline bool
> +has_only_sleepable_prog_cg(const struct cgroup_bpf *cgrp,
> +			 enum cgroup_bpf_attach_type atype)
> +{
> +	const struct bpf_prog_array_item *item;
> +	const struct bpf_prog *prog;
> +	int cnt = 0;
> +	bool ret = true;
> +
> +	rcu_read_lock();
> +	item = &rcu_dereference(cgrp->effective[atype])->items[0];
> +	while (ret && (prog = READ_ONCE(item->prog))) {
> +		if (!prog->aux->sleepable)
> +			ret = false;
> +		item++;
> +		cnt++;
> +	}
> +	rcu_read_unlock();
> +	if (cnt == 0)
> +		ret = false;
> +
> +	return ret;
> +}
> +
>   unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
>   				       const struct bpf_insn *insn)
>   {
> @@ -1773,7 +1814,8 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
>   static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
>   			     struct bpf_sockopt_buf *buf)
>   {
> -	if (ctx->optval == buf->data)
> +	if (ctx->optval == buf->data ||
> +	    ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
>   		return;
>   	kfree(ctx->optval);
>   }
> @@ -1781,7 +1823,8 @@ static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
>   static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx,
>   				  struct bpf_sockopt_buf *buf)
>   {
> -	return ctx->optval != buf->data;
> +	return ctx->optval != buf->data &&
> +		!(ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER);
>   }
>   
>   int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
> @@ -1796,21 +1839,31 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
>   		.optname = *optname,
>   	};
>   	int ret, max_optlen;
> +	bool alloc_mem;
> +
> +	alloc_mem = !has_only_sleepable_prog_cg(&cgrp->bpf, CGROUP_SETSOCKOPT);

hmm... I am not sure if this would work. The cgroup->effective[atype] could have 
been changed after this has_only_sleepable_prog_cg() check. For example, a 
non-sleepable prog is attached after this check. The latter 
bpf_prog_run_array_cg() may end up having an incorrect ctx.

A quick thought is, this alloc decision should be postponed to the 
bpf_prog_run_array_cg()? It may be better to have a different 
bpf_prog_run_array_cg for set/getsockopt here, not sure.

> +	if (!alloc_mem) {
> +		max_optlen = *optlen;
> +		ctx.optlen = *optlen;
> +		ctx.user_optval = optval;
> +		ctx.user_optval_end = optval + *optlen;
> +		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
> +	} else {
> +		/* Allocate a bit more than the initial user buffer for
> +		 * BPF program. The canonical use case is overriding
> +		 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
> +		 */
> +		max_optlen = max_t(int, 16, *optlen);
> +		max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
> +		if (max_optlen < 0)
> +			return max_optlen;
>   
> -	/* Allocate a bit more than the initial user buffer for
> -	 * BPF program. The canonical use case is overriding
> -	 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
> -	 */
> -	max_optlen = max_t(int, 16, *optlen);
> -	max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
> -	if (max_optlen < 0)
> -		return max_optlen;
> -
> -	ctx.optlen = *optlen;
> +		ctx.optlen = *optlen;
>   
> -	if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
> -		ret = -EFAULT;
> -		goto out;
> +		if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
>   	}
Martin KaFai Lau Aug. 2, 2023, 10:28 p.m. UTC | #7
On 8/1/23 11:08 AM, Stanislav Fomichev wrote:
> On Tue, Aug 1, 2023 at 10:31 AM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>
>>
>>
>> On 7/31/23 16:35, Stanislav Fomichev wrote:
>>> On Mon, Jul 31, 2023 at 3:02 PM Kui-Feng Lee <sinquersw@gmail.com> wrote:
>>>>
>>>> Sorry for the late reply! I just backed from a vacation.
>>>>
>>>>
>>>> On 7/24/23 11:36, Stanislav Fomichev wrote:
>>>>> On 07/21, kuifeng@meta.com wrote:
>>>>>> From: Kui-Feng Lee <kuifeng@meta.com>
>>>>>>
>>>>>> Enable sleepable cgroup/{get,set}sockopt hooks.
>>>>>>
>>>>>> The sleepable BPF programs attached to cgroup/{get,set}sockopt hooks may
>>>>>> received a pointer to the optval in user space instead of a kernel
>>>>>> copy. ctx->user_optval and ctx->user_optval_end are the pointers to the
>>>>>> begin and end of the user space buffer if receiving a user space
>>>>>> buffer. ctx->optval and ctx->optval_end will be a kernel copy if receiving
>>>>>> a kernel space buffer.
>>>>>>
>>>>>> A program receives a user space buffer if ctx->flags &
>>>>>> BPF_SOCKOPT_FLAG_OPTVAL_USER is true, otherwise it receives a kernel space
>>>>>> buffer.  The BPF programs should not read/write from/to a user space buffer
>>>>>> dirrectly.  It should access the buffer through bpf_copy_from_user() and
>>>>>> bpf_copy_to_user() provided in the following patches.
>>>>>>
>>>>>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>>>>>> ---
>>>>>>     include/linux/filter.h         |   3 +
>>>>>>     include/uapi/linux/bpf.h       |   9 ++
>>>>>>     kernel/bpf/cgroup.c            | 189 ++++++++++++++++++++++++++-------
>>>>>>     kernel/bpf/verifier.c          |   7 +-
>>>>>>     tools/include/uapi/linux/bpf.h |   9 ++
>>>>>>     tools/lib/bpf/libbpf.c         |   2 +
>>>>>>     6 files changed, 176 insertions(+), 43 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>>>>> index f69114083ec7..301dd1ba0de1 100644
>>>>>> --- a/include/linux/filter.h
>>>>>> +++ b/include/linux/filter.h
>>>>>> @@ -1345,6 +1345,9 @@ struct bpf_sockopt_kern {
>>>>>>        s32             level;
>>>>>>        s32             optname;
>>>>>>        s32             optlen;
>>>>>> +    u32             flags;
>>>>>> +    u8              *user_optval;
>>>>>> +    u8              *user_optval_end;
>>>>>>        /* for retval in struct bpf_cg_run_ctx */
>>>>>>        struct task_struct *current_task;
>>>>>>        /* Temporary "register" for indirect stores to ppos. */
>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>> index 739c15906a65..b2f81193f97b 100644
>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>> @@ -7135,6 +7135,15 @@ struct bpf_sockopt {
>>>>>>        __s32   optname;
>>>>>>        __s32   optlen;
>>>>>>        __s32   retval;
>>>>>> +
>>>>>> +    __bpf_md_ptr(void *, user_optval);
>>>>>> +    __bpf_md_ptr(void *, user_optval_end);
>>>>>
>>>>> Can we re-purpose existing optval/optval_end pointers
>>>>> for the sleepable programs? IOW, when the prog is sleepable,
>>>>> pass user pointers via optval/optval_end and require the programs
>>>>> to do copy_to/from on this buffer (even if the backing pointer might be
>>>>> in kernel memory - we can handle that in the kfuncs?).
>>>>>
>>>>> The fact that the program now needs to look at the flag
>>>>> (BPF_SOCKOPT_FLAG_OPTVAL_USER) and decide which buffer to
>>>>> use makes the handling even more complicated; and we already have a
>>>>> bunch of hairy stuff in these hooks. (or I misreading the change?)
>>>>>
>>>>> Also, regarding sleepable and non-sleepable co-existence: do we really need
>>>>> that? Can we say that all the programs have to be sleepable
>>>>> or non-sleepable? Mixing them complicates the sharing of that buffer.
>>>>
>>>> I considered this approach as well. This is an open question for me.
>>>> If we go this way, it means we can not attach a BPF program of a type
>>>> if any program of the other type has been installed.
>>>
>>> If we pass two pointers (kernel copy buffer + real user mem) to the
>>> sleepable program, we'll make it even more complicated by inheriting
>>> all existing warts of the non-sleepable version :-( >>> IOW, feels like we should try to see if we can have some
>>> copy_to/from_user kfuncs in the sleepable version that transparently
>>> support either kernel or user memory (and prohibit direct access to
>>> user_optval in the sleepable version).

 From looking at patch 5 selftest, I also think exposing user_optval_* and flags 
is not ideal. For example, correct me if I am wrong, in patch 3, dynptr is only 
used for setsockopt to alloc. Intuitively, when developing a bpf prog, I would 
expect using bpf_dynptr_write() to write a new sockopt and then done. However, 
it still needs to "install" (by calling bpf_sockopt_install_optval). I think the 
"install" part is leaking too much internal details.

Beside, adding both new 'ctx->user_optval + len > ctx->user_optval_end' and 
dynptr usage pattern together is counter productive considering dynptr is to 
avoid the length comparison. Saving an unnecessary "copy_from_user(ctx.optval, 
optval,...)" is more important than being able to directly read from 
ctx->user_optval. The bpf prog is usually only interested in a few optnames and 
directly returns without even looking at the optval for the uninterested 
optnames. The current __cgroup_bpf_run_filter_{get,set}sockopt always does a 
"copy_from_user(ctx.optval, optval,...)".

>>> And then, if we have one non-sleepable program in the chain, we can
>>> fallback everything to the kernel buffer (maybe).
>>> This way seems like we can support both versions in the same chain and
>>> have a more sane api?
>>
>> Basically, you are saying to move cp_from_optval() and cp_to_optval() in
>> the testcase to kfuncs. This can cause unnecessary copy. We can add
>> an API to make a dynptr from the ctx to avoid unnecessary copies.
> 
> Yeah, handle this transparently in the kfunc or via dynptr, whatever works.
> I'm not too worried about the extra copy tbh, this is a slow path; I'm
> more concerned about improving the bpf program / user experience.

+1. It will be great if all can be done in two kfunc (/dynptr_{write,read}). I 
would disallow sleepable prog to use the optval if it can make things simpler. 
If it goes with dynptr, need to support bpf_dynptr_slice() as well which I think 
should be doable after a quick thought.

The test needs to include a cgrp->effective array that has interleaved sleepable 
and non-sleepable bpf progs.
diff mbox series

Patch

diff --git a/include/linux/filter.h b/include/linux/filter.h
index f69114083ec7..301dd1ba0de1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1345,6 +1345,9 @@  struct bpf_sockopt_kern {
 	s32		level;
 	s32		optname;
 	s32		optlen;
+	u32		flags;
+	u8		*user_optval;
+	u8		*user_optval_end;
 	/* for retval in struct bpf_cg_run_ctx */
 	struct task_struct *current_task;
 	/* Temporary "register" for indirect stores to ppos. */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 739c15906a65..b2f81193f97b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7135,6 +7135,15 @@  struct bpf_sockopt {
 	__s32	optname;
 	__s32	optlen;
 	__s32	retval;
+
+	__bpf_md_ptr(void *, user_optval);
+	__bpf_md_ptr(void *, user_optval_end);
+	__u32	flags;
+};
+
+enum bpf_sockopt_flags {
+	/* optval is a pointer to user space memory */
+	BPF_SOCKOPT_FLAG_OPTVAL_USER	= (1U << 0),
 };
 
 struct bpf_pidns_info {
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 5b2741aa0d9b..b268bbfa6c53 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -38,15 +38,26 @@  bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
 	const struct bpf_prog_array *array;
 	struct bpf_run_ctx *old_run_ctx;
 	struct bpf_cg_run_ctx run_ctx;
+	bool do_sleepable;
 	u32 func_ret;
 
+	do_sleepable =
+		atype == CGROUP_SETSOCKOPT || atype == CGROUP_GETSOCKOPT;
+
 	run_ctx.retval = retval;
 	migrate_disable();
-	rcu_read_lock();
+	if (do_sleepable) {
+		might_fault();
+		rcu_read_lock_trace();
+	} else
+		rcu_read_lock();
 	array = rcu_dereference(cgrp->effective[atype]);
 	item = &array->items[0];
 	old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
 	while ((prog = READ_ONCE(item->prog))) {
+		if (do_sleepable && !prog->aux->sleepable)
+			rcu_read_lock();
+
 		run_ctx.prog_item = item;
 		func_ret = run_prog(prog, ctx);
 		if (ret_flags) {
@@ -56,13 +67,43 @@  bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
 		if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval))
 			run_ctx.retval = -EPERM;
 		item++;
+
+		if (do_sleepable && !prog->aux->sleepable)
+			rcu_read_unlock();
 	}
 	bpf_reset_run_ctx(old_run_ctx);
-	rcu_read_unlock();
+	if (do_sleepable)
+		rcu_read_unlock_trace();
+	else
+		rcu_read_unlock();
 	migrate_enable();
 	return run_ctx.retval;
 }
 
+static __always_inline bool
+has_only_sleepable_prog_cg(const struct cgroup_bpf *cgrp,
+			 enum cgroup_bpf_attach_type atype)
+{
+	const struct bpf_prog_array_item *item;
+	const struct bpf_prog *prog;
+	int cnt = 0;
+	bool ret = true;
+
+	rcu_read_lock();
+	item = &rcu_dereference(cgrp->effective[atype])->items[0];
+	while (ret && (prog = READ_ONCE(item->prog))) {
+		if (!prog->aux->sleepable)
+			ret = false;
+		item++;
+		cnt++;
+	}
+	rcu_read_unlock();
+	if (cnt == 0)
+		ret = false;
+
+	return ret;
+}
+
 unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
 				       const struct bpf_insn *insn)
 {
@@ -1773,7 +1814,8 @@  static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen,
 static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
 			     struct bpf_sockopt_buf *buf)
 {
-	if (ctx->optval == buf->data)
+	if (ctx->optval == buf->data ||
+	    ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
 		return;
 	kfree(ctx->optval);
 }
@@ -1781,7 +1823,8 @@  static void sockopt_free_buf(struct bpf_sockopt_kern *ctx,
 static bool sockopt_buf_allocated(struct bpf_sockopt_kern *ctx,
 				  struct bpf_sockopt_buf *buf)
 {
-	return ctx->optval != buf->data;
+	return ctx->optval != buf->data &&
+		!(ctx->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER);
 }
 
 int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
@@ -1796,21 +1839,31 @@  int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 		.optname = *optname,
 	};
 	int ret, max_optlen;
+	bool alloc_mem;
+
+	alloc_mem = !has_only_sleepable_prog_cg(&cgrp->bpf, CGROUP_SETSOCKOPT);
+	if (!alloc_mem) {
+		max_optlen = *optlen;
+		ctx.optlen = *optlen;
+		ctx.user_optval = optval;
+		ctx.user_optval_end = optval + *optlen;
+		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
+	} else {
+		/* Allocate a bit more than the initial user buffer for
+		 * BPF program. The canonical use case is overriding
+		 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
+		 */
+		max_optlen = max_t(int, 16, *optlen);
+		max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
+		if (max_optlen < 0)
+			return max_optlen;
 
-	/* Allocate a bit more than the initial user buffer for
-	 * BPF program. The canonical use case is overriding
-	 * TCP_CONGESTION(nv) to TCP_CONGESTION(cubic).
-	 */
-	max_optlen = max_t(int, 16, *optlen);
-	max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
-	if (max_optlen < 0)
-		return max_optlen;
-
-	ctx.optlen = *optlen;
+		ctx.optlen = *optlen;
 
-	if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
-		ret = -EFAULT;
-		goto out;
+		if (copy_from_user(ctx.optval, optval, min(*optlen, max_optlen)) != 0) {
+			ret = -EFAULT;
+			goto out;
+		}
 	}
 
 	lock_sock(sk);
@@ -1824,7 +1877,8 @@  int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 	if (ctx.optlen == -1) {
 		/* optlen set to -1, bypass kernel */
 		ret = 1;
-	} else if (ctx.optlen > max_optlen || ctx.optlen < -1) {
+	} else if (alloc_mem &&
+		   (ctx.optlen > max_optlen || ctx.optlen < -1)) {
 		/* optlen is out of bounds */
 		if (*optlen > PAGE_SIZE && ctx.optlen >= 0) {
 			pr_info_once("bpf setsockopt: ignoring program buffer with optlen=%d (max_optlen=%d)\n",
@@ -1846,6 +1900,8 @@  int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level,
 		 */
 		if (ctx.optlen != 0) {
 			*optlen = ctx.optlen;
+			if (ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)
+				return 0;
 			/* We've used bpf_sockopt_kern->buf as an intermediary
 			 * storage, but the BPF program indicates that we need
 			 * to pass this data to the kernel setsockopt handler.
@@ -1892,33 +1948,59 @@  int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 
 	orig_optlen = max_optlen;
 	ctx.optlen = max_optlen;
-	max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
-	if (max_optlen < 0)
-		return max_optlen;
+	if (has_only_sleepable_prog_cg(&cgrp->bpf, CGROUP_GETSOCKOPT)) {
+		if (!retval) {
+			/* If kernel getsockopt finished successfully,
+			 * copy whatever was returned to the user back
+			 * into our temporary buffer. Set optlen to the
+			 * one that kernel returned as well to let
+			 * BPF programs inspect the value.
+			 */
 
-	if (!retval) {
-		/* If kernel getsockopt finished successfully,
-		 * copy whatever was returned to the user back
-		 * into our temporary buffer. Set optlen to the
-		 * one that kernel returned as well to let
-		 * BPF programs inspect the value.
-		 */
+			if (get_user(ctx.optlen, optlen)) {
+				ret = -EFAULT;
+				goto out;
+			}
 
-		if (get_user(ctx.optlen, optlen)) {
-			ret = -EFAULT;
-			goto out;
+			if (ctx.optlen < 0) {
+				ret = -EFAULT;
+				goto out;
+			}
+			orig_optlen = ctx.optlen;
 		}
 
-		if (ctx.optlen < 0) {
-			ret = -EFAULT;
-			goto out;
-		}
-		orig_optlen = ctx.optlen;
+		ctx.user_optval = optval;
+		ctx.user_optval_end = optval + *optlen;
+		ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER;
+	} else {
+		max_optlen = sockopt_alloc_buf(&ctx, max_optlen, &buf);
+		if (max_optlen < 0)
+			return max_optlen;
+
+		if (!retval) {
+			/* If kernel getsockopt finished successfully,
+			 * copy whatever was returned to the user back
+			 * into our temporary buffer. Set optlen to the
+			 * one that kernel returned as well to let
+			 * BPF programs inspect the value.
+			 */
 
-		if (copy_from_user(ctx.optval, optval,
-				   min(ctx.optlen, max_optlen)) != 0) {
-			ret = -EFAULT;
-			goto out;
+			if (get_user(ctx.optlen, optlen)) {
+				ret = -EFAULT;
+				goto out;
+			}
+
+			if (ctx.optlen < 0) {
+				ret = -EFAULT;
+				goto out;
+			}
+			orig_optlen = ctx.optlen;
+
+			if (copy_from_user(ctx.optval, optval,
+					   min(ctx.optlen, max_optlen)) != 0) {
+				ret = -EFAULT;
+				goto out;
+			}
 		}
 	}
 
@@ -1942,7 +2024,9 @@  int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level,
 	}
 
 	if (ctx.optlen != 0) {
-		if (optval && copy_to_user(optval, ctx.optval, ctx.optlen)) {
+		if (optval &&
+		    !(ctx.flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) &&
+		    copy_to_user(optval, ctx.optval, ctx.optlen)) {
 			ret = -EFAULT;
 			goto out;
 		}
@@ -2388,6 +2472,20 @@  static bool cg_sockopt_is_valid_access(int off, int size,
 		if (size != size_default)
 			return false;
 		return prog->expected_attach_type == BPF_CGROUP_GETSOCKOPT;
+	case offsetof(struct bpf_sockopt, user_optval):
+		if (size != sizeof(__u64))
+			return false;
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct bpf_sockopt, user_optval_end):
+		if (size != sizeof(__u64))
+			return false;
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	case offsetof(struct bpf_sockopt, flags):
+		if (size != sizeof(__u32))
+			return false;
+		break;
 	default:
 		if (size != size_default)
 			return false;
@@ -2481,6 +2579,15 @@  static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct bpf_sockopt, optval_end):
 		*insn++ = CG_SOCKOPT_READ_FIELD(optval_end);
 		break;
+	case offsetof(struct bpf_sockopt, user_optval):
+		*insn++ = CG_SOCKOPT_READ_FIELD(user_optval);
+		break;
+	case offsetof(struct bpf_sockopt, user_optval_end):
+		*insn++ = CG_SOCKOPT_READ_FIELD(user_optval_end);
+		break;
+	case offsetof(struct bpf_sockopt, flags):
+		*insn++ = CG_SOCKOPT_READ_FIELD(flags);
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 803b91135ca0..8cb25a775119 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -19299,9 +19299,11 @@  static bool can_be_sleepable(struct bpf_prog *prog)
 			return false;
 		}
 	}
+
 	return prog->type == BPF_PROG_TYPE_LSM ||
 	       prog->type == BPF_PROG_TYPE_KPROBE /* only for uprobes */ ||
-	       prog->type == BPF_PROG_TYPE_STRUCT_OPS;
+	       prog->type == BPF_PROG_TYPE_STRUCT_OPS ||
+	       prog->type == BPF_PROG_TYPE_CGROUP_SOCKOPT;
 }
 
 static int check_attach_btf_id(struct bpf_verifier_env *env)
@@ -19323,7 +19325,8 @@  static int check_attach_btf_id(struct bpf_verifier_env *env)
 	}
 
 	if (prog->aux->sleepable && !can_be_sleepable(prog)) {
-		verbose(env, "Only fentry/fexit/fmod_ret, lsm, iter, uprobe, and struct_ops programs can be sleepable\n");
+		verbose(env, "Only fentry/fexit/fmod_ret, lsm, iter, uprobe,"
+			"cgroup, and struct_ops programs can be sleepable\n");
 		return -EINVAL;
 	}
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 739c15906a65..b2f81193f97b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7135,6 +7135,15 @@  struct bpf_sockopt {
 	__s32	optname;
 	__s32	optlen;
 	__s32	retval;
+
+	__bpf_md_ptr(void *, user_optval);
+	__bpf_md_ptr(void *, user_optval_end);
+	__u32	flags;
+};
+
+enum bpf_sockopt_flags {
+	/* optval is a pointer to user space memory */
+	BPF_SOCKOPT_FLAG_OPTVAL_USER	= (1U << 0),
 };
 
 struct bpf_pidns_info {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 17883f5a44b9..3be9270bbc33 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8766,7 +8766,9 @@  static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("cgroup/getsockname6",	CGROUP_SOCK_ADDR, BPF_CGROUP_INET6_GETSOCKNAME, SEC_ATTACHABLE),
 	SEC_DEF("cgroup/sysctl",	CGROUP_SYSCTL, BPF_CGROUP_SYSCTL, SEC_ATTACHABLE),
 	SEC_DEF("cgroup/getsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT, SEC_ATTACHABLE),
+	SEC_DEF("cgroup/getsockopt.s",	CGROUP_SOCKOPT, BPF_CGROUP_GETSOCKOPT, SEC_ATTACHABLE | SEC_SLEEPABLE),
 	SEC_DEF("cgroup/setsockopt",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE),
+	SEC_DEF("cgroup/setsockopt.s",	CGROUP_SOCKOPT, BPF_CGROUP_SETSOCKOPT, SEC_ATTACHABLE | SEC_SLEEPABLE),
 	SEC_DEF("cgroup/dev",		CGROUP_DEVICE, BPF_CGROUP_DEVICE, SEC_ATTACHABLE_OPT),
 	SEC_DEF("struct_ops+",		STRUCT_OPS, 0, SEC_NONE),
 	SEC_DEF("struct_ops.s+",	STRUCT_OPS, 0, SEC_SLEEPABLE),