Message ID | 20230815174712.660956-5-thinker.li@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | Sleepable BPF programs on cgroup {get,set}sockopt | expand |
On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote: > > +BTF_SET8_START(cgroup_common_btf_ids) > +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE) > +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE) > +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE) > +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE) > +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE) > +BTF_SET8_END(cgroup_common_btf_ids) These shouldn't be sockopt specific. If we want dynptr to represent a pointer to a user contiguous user memory we should use generic kfunc that do so. I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do. New dynptr type can be hidden in the kernel and all existing kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work with user memory. But I think we have to step back. Why do we need this whole thing in the first place? _why_ sockopt bpf progs needs to read and write user memory? Yes there is one page limit, but what is the use case to actually read and write beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do anything useful with iptables binary blobs. They are hard enough for kernel to parse.
On 8/16/23 18:25, Alexei Starovoitov wrote: > On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote: >> >> +BTF_SET8_START(cgroup_common_btf_ids) >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE) >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE) >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE) >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE) >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE) >> +BTF_SET8_END(cgroup_common_btf_ids) > > These shouldn't be sockopt specific. > If we want dynptr to represent a pointer to a user contiguous user memory > we should use generic kfunc that do so. > > I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do. > New dynptr type can be hidden in the kernel and all existing > kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work > with user memory. > > But I think we have to step back. Why do we need this whole thing in the first place? > _why_ sockopt bpf progs needs to read and write user memory? > > Yes there is one page limit, but what is the use case to actually read and write > beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do > anything useful with iptables binary blobs. They are hard enough for kernel to parse. The ideal behind the design is let the developers of filters to decide when to replace the existing buffer. And, access the content of buffers just like accessing raw pointers. However, seems almost everyone love to use *_read() & *_write(). I will move to that direction.
On Thu, Aug 17, 2023 at 12:00 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: > > > > On 8/16/23 18:25, Alexei Starovoitov wrote: > > On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote: > >> > >> +BTF_SET8_START(cgroup_common_btf_ids) > >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE) > >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE) > >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE) > >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE) > >> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE) > >> +BTF_SET8_END(cgroup_common_btf_ids) > > > > These shouldn't be sockopt specific. > > If we want dynptr to represent a pointer to a user contiguous user memory > > we should use generic kfunc that do so. > > > > I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do. > > New dynptr type can be hidden in the kernel and all existing > > kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work > > with user memory. > > > > But I think we have to step back. Why do we need this whole thing in the first place? > > _why_ sockopt bpf progs needs to read and write user memory? > > > > Yes there is one page limit, but what is the use case to actually read and write > > beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do > > anything useful with iptables binary blobs. They are hard enough for kernel to parse. > > The ideal behind the design is let the developers of filters to decide > when to replace the existing buffer. And, access the content of > buffers just like accessing raw pointers. However, seems almost everyone > love to use *_read() & *_write(). I will move to that direction. That doesn't answer my question about the use case. What kind of filters want to parse more than 4k of sockopt data?
On 8/16/23 6:25 PM, Alexei Starovoitov wrote: > But I think we have to step back. Why do we need this whole thing in the first place? > _why_ sockopt bpf progs needs to read and write user memory? > > Yes there is one page limit, but what is the use case to actually read and write > beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do > anything useful with iptables binary blobs. They are hard enough for kernel to parse. Usually the bpf prog is only interested in a very small number of optnames and no need to read the optval at all for most cases. The max size for our use cases is 16 bytes. The kernel currently is kind of doing it the opposite and always assumes the bpf prog needing to use the optval, allocate kernel memory and copy_from_user such that the non-sleepable bpf program can read/write it. The bpf prog usually checks the optname and then just returns for most cases: SEC("cgroup/getsockopt") int get_internal_sockopt(struct bpf_sockopt *ctx) { if (ctx->optname != MY_INTERNAL_SOCKOPT) return 1; /* change the ctx->optval and return to user space ... */ } When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and copy the first PAGE_SIZE data from the user optval. We used to ask the bpf prog to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not looked at the optval. Otherwise, the kernel used to conclude that the bpf prog had set an invalid optlen because optlen is larger than the optval_end - optval and returned -EFAULT incorrectly to the end-user. The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an internal kernel PAGE_SIZE limitation: SEC("cgroup/getsockopt") int get_internal_sockopt(struct bpf_sockopt *ctx) { if (ctx->optname != MY_INTERNAL_SOCKOPT) { /* only do that for ctx->optlen > PAGE_SIZE. * otherwise, the following cgroup bpf prog will * not be able to use the optval that it may * be interested. */ if (ctx->optlen > PAGE_SIZE) ctx->optlen = 0; return 1; } /* change the ctx->optval and return to user space ... */ } The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen"). Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the kernel allows a user passing NULL optval and using the optlen returned by getsockopt to learn the buffer space required. The bpf prog then needs to remove the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames that it is not interested while risking the following cgroup prog may not be able to use some of the optval: SEC("cgroup/getsockopt") int get_internal_sockopt(struct bpf_sockopt *ctx) { if (ctx->optname != MY_INTERNAL_SOCKOPT) { /* Do that for all optname that you are not interested. * The latter cgroup bpf will not be able to use the optval. */ ctx->optlen = 0; return 1; } /* chage the ctx->optval and return to user space ... */ } The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT for getsockopt with optval=NULL"). To avoid other potential optname cases that may run into similar situation and requires the bpf prog work around it again like the above, it needs a way to track whether a bpf prog has changed the optval in runtime. If it is not changed, use the result from the kernel set/getsockopt. If reading/writing to optval is done through a kfunc, this can be tracked. The kfunc can also directly read/write the user memory in optval, avoid the pre-alloc kernel memory and the PAGE_SIZE limit but this is a minor point.
On 8/17/23 1:41 PM, Martin KaFai Lau wrote: > On 8/16/23 6:25 PM, Alexei Starovoitov wrote: >> But I think we have to step back. Why do we need this whole thing in >> the first place? >> _why_ sockopt bpf progs needs to read and write user memory? >> >> Yes there is one page limit, but what is the use case to actually read >> and write >> beyond that? iptables sockopt was mentioned, but I don't think bpf >> prog can do >> anything useful with iptables binary blobs. They are hard enough for >> kernel to parse. > > Usually the bpf prog is only interested in a very small number of > optnames and no need to read the optval at all for most cases. The max > size for our use cases is 16 bytes. The kernel currently is kind of > doing it the opposite and always assumes the bpf prog needing to use the > optval, allocate kernel memory and copy_from_user such that the > non-sleepable bpf program can read/write it. > > The bpf prog usually checks the optname and then just returns for most > cases: > > SEC("cgroup/getsockopt") > int get_internal_sockopt(struct bpf_sockopt *ctx) > { > if (ctx->optname != MY_INTERNAL_SOCKOPT) > return 1; > > /* change the ctx->optval and return to user space ... */ > } > > When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE > memory and copy the first PAGE_SIZE data from the user optval. We used > to ask the bpf prog to explicitly set the optlen to 0 for > PAGE_SIZE > case even it has not looked at the optval. Otherwise, the kernel used to > conclude that the bpf prog had set an invalid optlen because optlen is > larger than the optval_end - optval and returned -EFAULT incorrectly to > the end-user. > > The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due > to an internal kernel PAGE_SIZE limitation: > > SEC("cgroup/getsockopt") > int get_internal_sockopt(struct bpf_sockopt *ctx) > { > if (ctx->optname != MY_INTERNAL_SOCKOPT) { > /* only do that for ctx->optlen > PAGE_SIZE. > * otherwise, the following cgroup bpf prog will > * not be able to use the optval that it may > * be interested. > */ > if (ctx->optlen > PAGE_SIZE) > ctx->optlen = 0; > return 1; > } > > /* change the ctx->optval and return to user space ... */ > } > > The above has been worked around in commit 29ebbba7d461 ("bpf: Don't > EFAULT for {g,s}setsockopt with wrong optlen"). > > Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that > the kernel allows a user passing NULL optval and using the optlen > returned by getsockopt to learn the buffer space required. The bpf prog > then needs to remove the optlen > PAGE_SIZE check and set optlen to 0 > for _all_ optnames that it is not interested while risking the following > cgroup prog may not be able to use some of the optval: > > SEC("cgroup/getsockopt") > int get_internal_sockopt(struct bpf_sockopt *ctx) > { > if (ctx->optname != MY_INTERNAL_SOCKOPT) { > > /* Do that for all optname that you are not interested. > * The latter cgroup bpf will not be able to use the optval. > */ > ctx->optlen = 0; > return 1; > } > > /* chage the ctx->optval and return to user space ... */ > } > > The above case has been addressed in commit 00e74ae08638 ("bpf: Don't > EFAULT for getsockopt with optval=NULL"). > > To avoid other potential optname cases that may run into similar > situation and requires the bpf prog work around it again like the above, > it needs a way to track whether a bpf prog has changed the optval in > runtime. If it is not changed, use the result from the kernel Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it if optval is changed? struct bpf_sockopt { __bpf_md_ptr(struct bpf_sock *, sk); __bpf_md_ptr(void *, optval); __bpf_md_ptr(void *, optval_end); __s32 level; __s32 optname; __s32 optlen; __s32 retval; }; > set/getsockopt. If reading/writing to optval is done through a kfunc, > this can be tracked. The kfunc can also directly read/write the user > memory in optval, avoid the pre-alloc kernel memory and the PAGE_SIZE > limit but this is a minor point.
On Thu, Aug 17, 2023 at 1:41 PM Martin KaFai Lau <martin.lau@linux.dev> wrote: > > On 8/16/23 6:25 PM, Alexei Starovoitov wrote: > > But I think we have to step back. Why do we need this whole thing in the first place? > > _why_ sockopt bpf progs needs to read and write user memory? > > > > Yes there is one page limit, but what is the use case to actually read and write > > beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do > > anything useful with iptables binary blobs. They are hard enough for kernel to parse. > > Usually the bpf prog is only interested in a very small number of optnames and > no need to read the optval at all for most cases. The max size for our use cases > is 16 bytes. The kernel currently is kind of doing it the opposite and always > assumes the bpf prog needing to use the optval, allocate kernel memory and > copy_from_user such that the non-sleepable bpf program can read/write it. > > The bpf prog usually checks the optname and then just returns for most cases: > > SEC("cgroup/getsockopt") > int get_internal_sockopt(struct bpf_sockopt *ctx) > { > if (ctx->optname != MY_INTERNAL_SOCKOPT) > return 1; > > /* change the ctx->optval and return to user space ... */ > } > > When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and > copy the first PAGE_SIZE data from the user optval. We used to ask the bpf prog > to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not looked at > the optval. Otherwise, the kernel used to conclude that the bpf prog had set an > invalid optlen because optlen is larger than the optval_end - optval and > returned -EFAULT incorrectly to the end-user. > > The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an > internal kernel PAGE_SIZE limitation: > > SEC("cgroup/getsockopt") > int get_internal_sockopt(struct bpf_sockopt *ctx) > { > if (ctx->optname != MY_INTERNAL_SOCKOPT) { > /* only do that for ctx->optlen > PAGE_SIZE. > * otherwise, the following cgroup bpf prog will > * not be able to use the optval that it may > * be interested. > */ > if (ctx->optlen > PAGE_SIZE) > ctx->optlen = 0; > return 1; > } > > /* change the ctx->optval and return to user space ... */ > } > > The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT for > {g,s}setsockopt with wrong optlen"). > > Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the > kernel allows a user passing NULL optval and using the optlen returned by > getsockopt to learn the buffer space required. The bpf prog then needs to remove > the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames that it is > not interested while risking the following cgroup prog may not be able to use > some of the optval: > > SEC("cgroup/getsockopt") > int get_internal_sockopt(struct bpf_sockopt *ctx) > { > if (ctx->optname != MY_INTERNAL_SOCKOPT) { > > /* Do that for all optname that you are not interested. > * The latter cgroup bpf will not be able to use the optval. > */ > ctx->optlen = 0; > return 1; > } > > /* chage the ctx->optval and return to user space ... */ > } > > The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT for > getsockopt with optval=NULL"). Agree with all of the above. Existing bpf sockopt interfaces was problematic, but with these workarounds we fixed all known issues. > To avoid other potential optname cases that may run into similar situation and > requires the bpf prog work around it again like the above, it needs a way to > track whether a bpf prog has changed the optval in runtime. If it is not > changed, use the result from the kernel set/getsockopt. If reading/writing to > optval is done through a kfunc, this can be tracked. The kfunc can also directly > read/write the user memory in optval, avoid the pre-alloc kernel memory and the > PAGE_SIZE limit but this is a minor point. so I'm still not following what sleepable progs that can access everything would help the existing situation. I agree that sleepable bpf sockopt should be free from old mistakes, but people might still write old-style non-sleeptable bpf sockopt and might repeat the same mistakes. I'm missing the value of the new interface. It's better, sure, but why? Do we expect to rewrite existing sockopt progs in sleepable way? It might not be easy due to sleepable limitations like maps and whatever else.
On 8/17/23 2:46 PM, Alexei Starovoitov wrote: >> To avoid other potential optname cases that may run into similar situation and >> requires the bpf prog work around it again like the above, it needs a way to >> track whether a bpf prog has changed the optval in runtime. If it is not >> changed, use the result from the kernel set/getsockopt. If reading/writing to >> optval is done through a kfunc, this can be tracked. The kfunc can also directly >> read/write the user memory in optval, avoid the pre-alloc kernel memory and the >> PAGE_SIZE limit but this is a minor point. > so I'm still not following what sleepable progs that can access everything > would help the existing situation. > I agree that sleepable bpf sockopt should be free from old mistakes, > but people might still write old-style non-sleeptable bpf sockopt and > might repeat the same mistakes. > I'm missing the value of the new interface. It's better, sure, but why? > Do we expect to rewrite existing sockopt progs in sleepable way? > It might not be easy due to sleepable limitations like maps and whatever else. so far our sockopt progs only uses sk local storage that can support sleepable and we can all move to the sleepable way to avoid any future quirk. Agree that others may have sockopt prog that has hard dependency on non-sleepable. If the existing non-sleepable and sleepable inter-leaved together, it would end up hitting similar issue. Lets drop the idea of this set.
On 8/17/23 2:37 PM, Yonghong Song wrote: > > > On 8/17/23 1:41 PM, Martin KaFai Lau wrote: >> On 8/16/23 6:25 PM, Alexei Starovoitov wrote: >>> But I think we have to step back. Why do we need this whole thing in the >>> first place? >>> _why_ sockopt bpf progs needs to read and write user memory? >>> >>> Yes there is one page limit, but what is the use case to actually read and write >>> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do >>> anything useful with iptables binary blobs. They are hard enough for kernel >>> to parse. >> >> Usually the bpf prog is only interested in a very small number of optnames and >> no need to read the optval at all for most cases. The max size for our use >> cases is 16 bytes. The kernel currently is kind of doing it the opposite and >> always assumes the bpf prog needing to use the optval, allocate kernel memory >> and copy_from_user such that the non-sleepable bpf program can read/write it. >> >> The bpf prog usually checks the optname and then just returns for most cases: >> >> SEC("cgroup/getsockopt") >> int get_internal_sockopt(struct bpf_sockopt *ctx) >> { >> if (ctx->optname != MY_INTERNAL_SOCKOPT) >> return 1; >> >> /* change the ctx->optval and return to user space ... */ >> } >> >> When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and >> copy the first PAGE_SIZE data from the user optval. We used to ask the bpf >> prog to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not >> looked at the optval. Otherwise, the kernel used to conclude that the bpf prog >> had set an invalid optlen because optlen is larger than the optval_end - >> optval and returned -EFAULT incorrectly to the end-user. >> >> The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an >> internal kernel PAGE_SIZE limitation: >> >> SEC("cgroup/getsockopt") >> int get_internal_sockopt(struct bpf_sockopt *ctx) >> { >> if (ctx->optname != MY_INTERNAL_SOCKOPT) { >> /* only do that for ctx->optlen > PAGE_SIZE. >> * otherwise, the following cgroup bpf prog will >> * not be able to use the optval that it may >> * be interested. >> */ >> if (ctx->optlen > PAGE_SIZE) >> ctx->optlen = 0; >> return 1; >> } >> >> /* change the ctx->optval and return to user space ... */ >> } >> >> The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT >> for {g,s}setsockopt with wrong optlen"). >> >> Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the >> kernel allows a user passing NULL optval and using the optlen returned by >> getsockopt to learn the buffer space required. The bpf prog then needs to >> remove the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames >> that it is not interested while risking the following cgroup prog may not be >> able to use some of the optval: >> >> SEC("cgroup/getsockopt") >> int get_internal_sockopt(struct bpf_sockopt *ctx) >> { >> if (ctx->optname != MY_INTERNAL_SOCKOPT) { >> >> /* Do that for all optname that you are not interested. >> * The latter cgroup bpf will not be able to use the optval. >> */ >> ctx->optlen = 0; >> return 1; >> } >> >> /* chage the ctx->optval and return to user space ... */ >> } >> >> The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT >> for getsockopt with optval=NULL"). >> >> To avoid other potential optname cases that may run into similar situation and >> requires the bpf prog work around it again like the above, it needs a way to >> track whether a bpf prog has changed the optval in runtime. If it is not >> changed, use the result from the kernel > > Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it > if optval is changed? This new interface should work. If there is an old-existing prog staying with the old interface (didn't set this bool but changed the optval) in the cgroup prog array, it probably end up not improving anything also? or the verifier can enforce setting this bool in runtime when writing to optval? do not know how demanding the verifier change is. I am not sure if this would be an overkill for the verifier. > > struct bpf_sockopt { > __bpf_md_ptr(struct bpf_sock *, sk); > __bpf_md_ptr(void *, optval); > __bpf_md_ptr(void *, optval_end); > > __s32 level; > __s32 optname; > __s32 optlen; > __s32 retval; > }; > >> set/getsockopt. If reading/writing to optval is done through a kfunc, this can >> be tracked. The kfunc can also directly read/write the user memory in optval, >> avoid the pre-alloc kernel memory and the PAGE_SIZE limit but this is a minor >> point.
On 8/17/23 12:43, Alexei Starovoitov wrote: > On Thu, Aug 17, 2023 at 12:00 PM Kui-Feng Lee <sinquersw@gmail.com> wrote: >> >> >> >> On 8/16/23 18:25, Alexei Starovoitov wrote: >>> On Tue, Aug 15, 2023 at 10:47:11AM -0700, thinker.li@gmail.com wrote: >>>> >>>> +BTF_SET8_START(cgroup_common_btf_ids) >>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE) >>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE) >>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE) >>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE) >>>> +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE) >>>> +BTF_SET8_END(cgroup_common_btf_ids) >>> >>> These shouldn't be sockopt specific. >>> If we want dynptr to represent a pointer to a user contiguous user memory >>> we should use generic kfunc that do so. >>> >>> I suspect a single new kfunc: bpf_dynptr_from_user_mem() would do. >>> New dynptr type can be hidden in the kernel and all existing >>> kfuncs dynptr_slice, dynptr_data, dynptr_write could be made to work >>> with user memory. >>> >>> But I think we have to step back. Why do we need this whole thing in the first place? >>> _why_ sockopt bpf progs needs to read and write user memory? >>> >>> Yes there is one page limit, but what is the use case to actually read and write >>> beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do >>> anything useful with iptables binary blobs. They are hard enough for kernel to parse. >> >> The ideal behind the design is let the developers of filters to decide >> when to replace the existing buffer. And, access the content of >> buffers just like accessing raw pointers. However, seems almost everyone >> love to use *_read() & *_write(). I will move to that direction. > > That doesn't answer my question about the use case. > What kind of filters want to parse more than 4k of sockopt data? I don't have any existing use case for this. It is just something can/allow to happen.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index edb35bcfa548..b9e4d7752555 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -663,12 +663,15 @@ enum bpf_type_flag { /* DYNPTR points to xdp_buff */ DYNPTR_TYPE_XDP = BIT(16 + BPF_BASE_TYPE_BITS), + /* DYNPTR points to optval buffer of bpf_sockopt */ + DYNPTR_TYPE_CGROUP_SOCKOPT = BIT(17 + BPF_BASE_TYPE_BITS), + __BPF_TYPE_FLAG_MAX, __BPF_TYPE_LAST_FLAG = __BPF_TYPE_FLAG_MAX - 1, }; #define DYNPTR_TYPE_FLAG_MASK (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \ - | DYNPTR_TYPE_XDP) + | DYNPTR_TYPE_XDP | DYNPTR_TYPE_CGROUP_SOCKOPT) /* Max number of base types. */ #define BPF_BASE_TYPE_LIMIT (1UL << BPF_BASE_TYPE_BITS) @@ -1206,6 +1209,8 @@ enum bpf_dynptr_type { BPF_DYNPTR_TYPE_SKB, /* Underlying data is a xdp_buff */ BPF_DYNPTR_TYPE_XDP, + /* Underlying data is for the optval of a cgroup sock */ + BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, }; int bpf_dynptr_check_size(u32 size); diff --git a/include/linux/filter.h b/include/linux/filter.h index 2aa2a96526de..df12fddd2f21 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1347,6 +1347,10 @@ struct bpf_sockopt_kern { enum bpf_sockopt_kern_flags { /* optval is a pointer to user space memory */ BPF_SOCKOPT_FLAG_OPTVAL_USER = (1U << 0), + /* able to install new optval */ + BPF_SOCKOPT_FLAG_OPTVAL_REPLACE = (1U << 1), + /* optval is referenced by a dynptr */ + BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR = (1U << 2), }; int copy_bpf_fprog_from_user(struct sock_fprog *dst, sockptr_t src, int len); diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 249657c466dd..6d6a040688be 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -217,6 +217,7 @@ enum btf_kfunc_hook { BTF_KFUNC_HOOK_SOCKET_FILTER, BTF_KFUNC_HOOK_LWT, BTF_KFUNC_HOOK_NETFILTER, + BTF_KFUNC_HOOK_CGROUP_SOCKOPT, BTF_KFUNC_HOOK_MAX, }; @@ -7846,6 +7847,8 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type) return BTF_KFUNC_HOOK_LWT; case BPF_PROG_TYPE_NETFILTER: return BTF_KFUNC_HOOK_NETFILTER; + case BPF_PROG_TYPE_CGROUP_SOCKOPT: + return BTF_KFUNC_HOOK_CGROUP_SOCKOPT; default: return BTF_KFUNC_HOOK_MAX; } diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 425094e071ba..164dee8753cf 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -1865,6 +1865,8 @@ static int filter_setsockopt_progs_cb(void *arg, if (max_optlen < 0) return max_optlen; + ctx->flags = BPF_SOCKOPT_FLAG_OPTVAL_REPLACE; + if (copy_from_user(ctx->optval, optval, min(ctx->optlen, max_optlen)) != 0) return -EFAULT; @@ -1893,7 +1895,8 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, ctx.optlen = *optlen; ctx.optval = optval; ctx.optval_end = optval + *optlen; - ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER; + ctx.flags = BPF_SOCKOPT_FLAG_OPTVAL_USER | + BPF_SOCKOPT_FLAG_OPTVAL_REPLACE; lock_sock(sk); ret = bpf_prog_run_array_cg_cb(&cgrp->bpf, CGROUP_SETSOCKOPT, diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index eb91cae0612a..fc38aff02654 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1537,6 +1537,7 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, const struct bpf_dynptr_kern switch (type) { case BPF_DYNPTR_TYPE_LOCAL: case BPF_DYNPTR_TYPE_RINGBUF: + case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT: /* Source and destination may possibly overlap, hence use memmove to * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr * pointing to overlapping PTR_TO_MAP_VALUE regions. @@ -1582,6 +1583,7 @@ BPF_CALL_5(bpf_dynptr_write, const struct bpf_dynptr_kern *, dst, u32, offset, v switch (type) { case BPF_DYNPTR_TYPE_LOCAL: case BPF_DYNPTR_TYPE_RINGBUF: + case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT: if (flags) return -EINVAL; /* Source and destination may possibly overlap, hence use memmove to @@ -1634,6 +1636,7 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3 switch (type) { case BPF_DYNPTR_TYPE_LOCAL: case BPF_DYNPTR_TYPE_RINGBUF: + case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT: return (unsigned long)(ptr->data + ptr->offset + offset); case BPF_DYNPTR_TYPE_SKB: case BPF_DYNPTR_TYPE_XDP: @@ -2261,6 +2264,7 @@ __bpf_kfunc void *bpf_dynptr_slice(const struct bpf_dynptr_kern *ptr, u32 offset switch (type) { case BPF_DYNPTR_TYPE_LOCAL: case BPF_DYNPTR_TYPE_RINGBUF: + case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT: return ptr->data + ptr->offset + offset; case BPF_DYNPTR_TYPE_SKB: if (buffer__opt) @@ -2429,6 +2433,185 @@ __bpf_kfunc void bpf_rcu_read_unlock(void) rcu_read_unlock(); } +/* Create a buffer of the given size for a {set,get}sockopt BPF filter. + * + * This kfunc is only avaliabe for sleeplabe contexts. The dynptr should be + * released by bpf_sockopt_dynptr_install() or bpf_sockopt_release(). + */ +__bpf_kfunc int bpf_sockopt_dynptr_alloc(struct bpf_sockopt *sopt, int size, + struct bpf_dynptr_kern *ptr__uninit) +{ + void *optval; + int err; + + bpf_dynptr_set_null(ptr__uninit); + + err = bpf_dynptr_check_size(size); + if (err) + return err; + + optval = kzalloc(size, GFP_KERNEL); + if (!optval) + return -ENOMEM; + + bpf_dynptr_init(ptr__uninit, optval, + BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0, size); + + return size; +} + +/* Install the buffer of the dynptr into the sockopt context. + * + * This kfunc is only avaliabe for sleeplabe contexts. The dynptr should be + * allocated by bpf_sockopt_dynptr_alloc(). The dynptr is invalid after + * returning from this function successfully. + */ +__bpf_kfunc int bpf_sockopt_dynptr_install(struct bpf_sockopt *sopt, + struct bpf_dynptr_kern *ptr) +{ + struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt; + + if (!(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_REPLACE) || + bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT || + !ptr->data) + return -EINVAL; + + if (sopt_kern->optval == ptr->data && + !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)) { + /* This dynptr is initialized by bpf_sockopt_dynptr_from() + * and the optval is not overwritten by + * bpf_sockopt_dynptr_install() yet. + */ + bpf_dynptr_set_null(ptr); + sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR; + return 0; + } + + if (sopt_kern->optval && + !(sopt_kern->flags & (BPF_SOCKOPT_FLAG_OPTVAL_USER | + BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR))) + kfree(sopt_kern->optval); + + sopt_kern->optval = ptr->data; + sopt_kern->optval_end = ptr->data + __bpf_dynptr_size(ptr); + sopt_kern->optlen = __bpf_dynptr_size(ptr); + sopt_kern->flags &= ~(BPF_SOCKOPT_FLAG_OPTVAL_USER | + BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR); + + bpf_dynptr_set_null(ptr); + + return 0; +} + +__bpf_kfunc int bpf_sockopt_dynptr_release(struct bpf_sockopt *sopt, + struct bpf_dynptr_kern *ptr) +{ + struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt; + + if (bpf_dynptr_get_type(ptr) != BPF_DYNPTR_TYPE_CGROUP_SOCKOPT || + !ptr->data) + return -EINVAL; + + if (sopt_kern->optval == ptr->data && + !(sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER)) + /* This dynptr is initialized by bpf_sockopt_dynptr_from() + * and the optval is not overwritten by + * bpf_sockopt_dynptr_install() yet. + */ + sopt_kern->flags &= ~BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR; + else + kfree(ptr->data); + bpf_dynptr_set_null(ptr); + + return 0; +} + +/* Initialize a sockopt dynptr from a user or installed optval pointer. + * + * sopt->optval can be a user pointer or a kernel pointer. A kernel pointer + * can be a buffer allocated by the caller of the BPF program or a buffer + * installed by other BPF programs through bpf_sockopt_dynptr_install(). + * + * Atmost one dynptr shall be created by this function at any moment, or + * it will return -EINVAL. You can create another dypptr by this function + * after release the previous one by bpf_sockopt_dynptr_release(). + * + * A dynptr that is initialized when optval is a user pointer is an + * exception. In this case, the dynptr will point to a kernel buffer with + * the same content as the user buffer. To simplify the code, users should + * always make sure having only one dynptr initialized by this function at + * any moment. + */ +__bpf_kfunc int bpf_sockopt_dynptr_from(struct bpf_sockopt *sopt, + struct bpf_dynptr_kern *ptr__uninit, + unsigned int size) +{ + struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt; + int err; + + bpf_dynptr_set_null(ptr__uninit); + + if (size > (sopt_kern->optval_end - sopt_kern->optval)) + return -EINVAL; + + if (size == 0) + size = min(sopt_kern->optlen, + (int)(sopt_kern->optval_end - sopt_kern->optval)); + + if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR) + return -EINVAL; + + if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) { + err = bpf_sockopt_dynptr_alloc(sopt, sopt_kern->optlen, + ptr__uninit); + if (err >= 0) + err = copy_from_user(ptr__uninit->data, + sopt_kern->optval, + size); + return err; + } + + bpf_dynptr_init(ptr__uninit, sopt_kern->optval, + BPF_DYNPTR_TYPE_CGROUP_SOCKOPT, 0, + size); + sopt_kern->flags |= BPF_SOCKOPT_FLAG_OPTVAL_DYNPTR; + + return size; +} + +/** + * int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt, + * struct bpf_dynptr_kern *ptr) + * Description + * Copy data from *ptr* to *sopt->optval*. + * Return + * >= 0 on success, or a negative error in case of failure. + */ +__bpf_kfunc int bpf_sockopt_dynptr_copy_to(struct bpf_sockopt *sopt, + struct bpf_dynptr_kern *ptr) +{ + __u32 size = bpf_dynptr_size(ptr); + + struct bpf_sockopt_kern *sopt_kern = (struct bpf_sockopt_kern *)sopt; + int ret; + + if (size > (sopt_kern->optval_end - sopt_kern->optval)) + return -EINVAL; + + if (sopt_kern->flags & BPF_SOCKOPT_FLAG_OPTVAL_USER) { + ret = copy_to_user(sopt_kern->optval, ptr->data, + size); + if (unlikely(ret)) + return -EFAULT; + } else { + /* Use memmove() in case of optval & ptr overlap. */ + memmove(sopt_kern->optval, ptr->data, size); + ret = size; + } + + return ret; +} + __diag_pop(); BTF_SET8_START(generic_btf_ids) @@ -2494,6 +2677,19 @@ static const struct btf_kfunc_id_set common_kfunc_set = { .set = &common_btf_ids, }; +BTF_SET8_START(cgroup_common_btf_ids) +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_copy_to, KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_alloc, KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_install, KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_release, KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_sockopt_dynptr_from, KF_SLEEPABLE) +BTF_SET8_END(cgroup_common_btf_ids) + +static const struct btf_kfunc_id_set cgroup_kfunc_set = { + .owner = THIS_MODULE, + .set = &cgroup_common_btf_ids, +}; + static int __init kfunc_init(void) { int ret; @@ -2513,6 +2709,7 @@ static int __init kfunc_init(void) ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &generic_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &generic_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &generic_kfunc_set); + ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCKOPT, &cgroup_kfunc_set); ret = ret ?: register_btf_id_dtor_kfuncs(generic_dtors, ARRAY_SIZE(generic_dtors), THIS_MODULE); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 936a171ea976..83d65a6e1309 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -745,6 +745,8 @@ static const char *dynptr_type_str(enum bpf_dynptr_type type) return "skb"; case BPF_DYNPTR_TYPE_XDP: return "xdp"; + case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT: + return "cgroup_sockopt"; case BPF_DYNPTR_TYPE_INVALID: return "<invalid>"; default: @@ -826,6 +828,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type) return BPF_DYNPTR_TYPE_SKB; case DYNPTR_TYPE_XDP: return BPF_DYNPTR_TYPE_XDP; + case DYNPTR_TYPE_CGROUP_SOCKOPT: + return BPF_DYNPTR_TYPE_CGROUP_SOCKOPT; default: return BPF_DYNPTR_TYPE_INVALID; } @@ -842,6 +846,8 @@ static enum bpf_type_flag get_dynptr_type_flag(enum bpf_dynptr_type type) return DYNPTR_TYPE_SKB; case BPF_DYNPTR_TYPE_XDP: return DYNPTR_TYPE_XDP; + case BPF_DYNPTR_TYPE_CGROUP_SOCKOPT: + return DYNPTR_TYPE_CGROUP_SOCKOPT; default: return 0; } @@ -849,7 +855,8 @@ static enum bpf_type_flag get_dynptr_type_flag(enum bpf_dynptr_type type) static bool dynptr_type_refcounted(enum bpf_dynptr_type type) { - return type == BPF_DYNPTR_TYPE_RINGBUF; + return type == BPF_DYNPTR_TYPE_RINGBUF || + type == BPF_DYNPTR_TYPE_CGROUP_SOCKOPT; } static void __mark_dynptr_reg(struct bpf_reg_state *reg, @@ -10271,6 +10278,10 @@ enum special_kfunc_type { KF_bpf_dynptr_slice, KF_bpf_dynptr_slice_rdwr, KF_bpf_dynptr_clone, + KF_bpf_sockopt_dynptr_alloc, + KF_bpf_sockopt_dynptr_install, + KF_bpf_sockopt_dynptr_release, + KF_bpf_sockopt_dynptr_from, }; BTF_SET_START(special_kfunc_set) @@ -10291,6 +10302,10 @@ BTF_ID(func, bpf_dynptr_from_xdp) BTF_ID(func, bpf_dynptr_slice) BTF_ID(func, bpf_dynptr_slice_rdwr) BTF_ID(func, bpf_dynptr_clone) +BTF_ID(func, bpf_sockopt_dynptr_alloc) +BTF_ID(func, bpf_sockopt_dynptr_install) +BTF_ID(func, bpf_sockopt_dynptr_release) +BTF_ID(func, bpf_sockopt_dynptr_from) BTF_SET_END(special_kfunc_set) BTF_ID_LIST(special_kfunc_list) @@ -10313,6 +10328,10 @@ BTF_ID(func, bpf_dynptr_from_xdp) BTF_ID(func, bpf_dynptr_slice) BTF_ID(func, bpf_dynptr_slice_rdwr) BTF_ID(func, bpf_dynptr_clone) +BTF_ID(func, bpf_sockopt_dynptr_alloc) +BTF_ID(func, bpf_sockopt_dynptr_install) +BTF_ID(func, bpf_sockopt_dynptr_release) +BTF_ID(func, bpf_sockopt_dynptr_from) static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta) { @@ -10966,6 +10985,20 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ arg_type |= OBJ_RELEASE; break; case KF_ARG_PTR_TO_DYNPTR: + if (meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_install] || + meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_release]) { + int ref_obj_id = dynptr_ref_obj_id(env, reg); + + if (ref_obj_id < 0) { + verbose(env, "R%d is not a valid dynptr\n", regno); + return -EINVAL; + } + + /* Required by check_func_arg_reg_off() */ + arg_type |= ARG_PTR_TO_DYNPTR | OBJ_RELEASE; + meta->release_regno = regno; + } + break; case KF_ARG_PTR_TO_ITER: case KF_ARG_PTR_TO_LIST_HEAD: case KF_ARG_PTR_TO_LIST_NODE: @@ -11053,6 +11086,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ verbose(env, "verifier internal error: missing ref obj id for parent of clone\n"); return -EFAULT; } + } else if ((meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_alloc] || + meta->func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_from]) && + (dynptr_arg_type & MEM_UNINIT)) { + dynptr_arg_type |= DYNPTR_TYPE_CGROUP_SOCKOPT; } ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type, clone_ref_obj_id); @@ -11361,7 +11398,13 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, * PTR_TO_BTF_ID in bpf_kfunc_arg_meta, do the release now. */ if (meta.release_regno) { - err = release_reference(env, regs[meta.release_regno].ref_obj_id); + verbose(env, "release refcounted PTR_TO_BTF_ID %s\n", + meta.func_name); + if (meta.func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_install] || + meta.func_id == special_kfunc_list[KF_bpf_sockopt_dynptr_release]) + err = unmark_stack_slots_dynptr(env, ®s[meta.release_regno]); + else + err = release_reference(env, regs[meta.release_regno].ref_obj_id); if (err) { verbose(env, "kfunc %s#%d reference has not been acquired before\n", func_name, meta.func_id);