Message ID | 20230516001348.286414-2-andrii@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | Add O_PATH-based BPF_OBJ_PIN and BPF_OBJ_GET support | expand |
On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote: > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall > forces users to specify pinning location as a string-based absolute or > relative (to current working directory) path. This has various > implications related to security (e.g., symlink-based attacks), forces > BPF FS to be exposed in the file system, which can cause races with > other applications. > > One of the feedbacks we got from folks working with containers heavily > was that inability to use purely FD-based location specification was an > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET > commands. This patch closes this oversight, adding path_fd field to > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by > *at() syscalls for dirfd + pathname combinations. > > This now allows interesting possibilities like working with detached BPF > FS mount (e.g., to perform multiple pinnings without running a risk of > someone interfering with them), and generally making pinning/getting > more secure and not prone to any races and/or security attacks. > > This is demonstrated by a selftest added in subsequent patch that takes > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate > creating detached BPF FS mount, pinning, and then getting BPF map out of > it, all while never exposing this private instance of BPF FS to outside > worlds. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > include/linux/bpf.h | 4 ++-- > include/uapi/linux/bpf.h | 5 +++++ > kernel/bpf/inode.c | 16 ++++++++-------- > kernel/bpf/syscall.c | 8 +++++--- > tools/include/uapi/linux/bpf.h | 5 +++++ > 5 files changed, 25 insertions(+), 13 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 36e4b2d8cca2..f58895830ada 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname); > -int bpf_obj_get_user(const char __user *pathname, int flags); > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > #define BPF_ITER_FUNC_PREFIX "bpf_iter_" > #define DEFINE_BPF_ITER_FUNC(target, args...) \ > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 1bb11a6ee667..db2870a52ce0 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -1420,6 +1420,11 @@ union bpf_attr { > __aligned_u64 pathname; > __u32 bpf_fd; > __u32 file_flags; > + /* same as dirfd in openat() syscall; see openat(2) > + * manpage for details of dirfd/path_fd and pathname semantics; > + * zero path_fd implies AT_FDCWD behavior > + */ > + __u32 path_fd; I'd probably call it dir_fd to emphasize the similarity, but I don't mind path_fd as well I have a note that you suggested to introduce this for uprobe multi link as well, so I'll do something similar lgtm Acked-by: Jiri Olsa <jolsa@kernel.org> jirka > }; > > struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c > index 9948b542a470..13bb54f6bd17 100644 > --- a/kernel/bpf/inode.c > +++ b/kernel/bpf/inode.c > @@ -435,7 +435,7 @@ static int bpf_iter_link_pin_kernel(struct dentry *parent, > return ret; > } > > -static int bpf_obj_do_pin(const char __user *pathname, void *raw, > +static int bpf_obj_do_pin(int path_fd, const char __user *pathname, void *raw, > enum bpf_type type) > { > struct dentry *dentry; > @@ -444,7 +444,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw, > umode_t mode; > int ret; > > - dentry = user_path_create(AT_FDCWD, pathname, &path, 0); > + dentry = user_path_create(path_fd, pathname, &path, 0); > if (IS_ERR(dentry)) > return PTR_ERR(dentry); > > @@ -478,7 +478,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw, > return ret; > } > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname) > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname) > { > enum bpf_type type; > void *raw; > @@ -488,14 +488,14 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname) > if (IS_ERR(raw)) > return PTR_ERR(raw); > > - ret = bpf_obj_do_pin(pathname, raw, type); > + ret = bpf_obj_do_pin(path_fd, pathname, raw, type); > if (ret != 0) > bpf_any_put(raw, type); > > return ret; > } > > -static void *bpf_obj_do_get(const char __user *pathname, > +static void *bpf_obj_do_get(int path_fd, const char __user *pathname, > enum bpf_type *type, int flags) > { > struct inode *inode; > @@ -503,7 +503,7 @@ static void *bpf_obj_do_get(const char __user *pathname, > void *raw; > int ret; > > - ret = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path); > + ret = user_path_at(path_fd, pathname, LOOKUP_FOLLOW, &path); > if (ret) > return ERR_PTR(ret); > > @@ -527,7 +527,7 @@ static void *bpf_obj_do_get(const char __user *pathname, > return ERR_PTR(ret); > } > > -int bpf_obj_get_user(const char __user *pathname, int flags) > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags) > { > enum bpf_type type = BPF_TYPE_UNSPEC; > int f_flags; > @@ -538,7 +538,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags) > if (f_flags < 0) > return f_flags; > > - raw = bpf_obj_do_get(pathname, &type, f_flags); > + raw = bpf_obj_do_get(path_fd, pathname, &type, f_flags); > if (IS_ERR(raw)) > return PTR_ERR(raw); > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 909c112ef537..65a46f6d4be0 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -2697,14 +2697,15 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size) > return err; > } > > -#define BPF_OBJ_LAST_FIELD file_flags > +#define BPF_OBJ_LAST_FIELD path_fd > > static int bpf_obj_pin(const union bpf_attr *attr) > { > if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0) > return -EINVAL; > > - return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname)); > + return bpf_obj_pin_user(attr->bpf_fd, attr->path_fd ?: AT_FDCWD, > + u64_to_user_ptr(attr->pathname)); > } > > static int bpf_obj_get(const union bpf_attr *attr) > @@ -2713,7 +2714,8 @@ static int bpf_obj_get(const union bpf_attr *attr) > attr->file_flags & ~BPF_OBJ_FLAG_MASK) > return -EINVAL; > > - return bpf_obj_get_user(u64_to_user_ptr(attr->pathname), > + return bpf_obj_get_user(attr->path_fd ?: AT_FDCWD, > + u64_to_user_ptr(attr->pathname), > attr->file_flags); > } > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 1bb11a6ee667..db2870a52ce0 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -1420,6 +1420,11 @@ union bpf_attr { > __aligned_u64 pathname; > __u32 bpf_fd; > __u32 file_flags; > + /* same as dirfd in openat() syscall; see openat(2) > + * manpage for details of dirfd/path_fd and pathname semantics; > + * zero path_fd implies AT_FDCWD behavior > + */ > + __u32 path_fd; > }; > > struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ > -- > 2.34.1 > >
On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote: > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall > forces users to specify pinning location as a string-based absolute or > relative (to current working directory) path. This has various > implications related to security (e.g., symlink-based attacks), forces > BPF FS to be exposed in the file system, which can cause races with > other applications. > > One of the feedbacks we got from folks working with containers heavily > was that inability to use purely FD-based location specification was an > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET > commands. This patch closes this oversight, adding path_fd field to Cool! > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by > *at() syscalls for dirfd + pathname combinations. > > This now allows interesting possibilities like working with detached BPF > FS mount (e.g., to perform multiple pinnings without running a risk of > someone interfering with them), and generally making pinning/getting > more secure and not prone to any races and/or security attacks. > > This is demonstrated by a selftest added in subsequent patch that takes > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate > creating detached BPF FS mount, pinning, and then getting BPF map out of > it, all while never exposing this private instance of BPF FS to outside > worlds. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > include/linux/bpf.h | 4 ++-- > include/uapi/linux/bpf.h | 5 +++++ > kernel/bpf/inode.c | 16 ++++++++-------- > kernel/bpf/syscall.c | 8 +++++--- > tools/include/uapi/linux/bpf.h | 5 +++++ > 5 files changed, 25 insertions(+), 13 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 36e4b2d8cca2..f58895830ada 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname); > -int bpf_obj_get_user(const char __user *pathname, int flags); > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > #define BPF_ITER_FUNC_PREFIX "bpf_iter_" > #define DEFINE_BPF_ITER_FUNC(target, args...) \ > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 1bb11a6ee667..db2870a52ce0 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -1420,6 +1420,11 @@ union bpf_attr { > __aligned_u64 pathname; > __u32 bpf_fd; > __u32 file_flags; > + /* same as dirfd in openat() syscall; see openat(2) > + * manpage for details of dirfd/path_fd and pathname semantics; > + * zero path_fd implies AT_FDCWD behavior > + */ > + __u32 path_fd; > }; So 0 is a valid file descriptor and can trivially be created and made to refer to any file. Is this a conscious decision to have a zero value imply AT_FDCWD and have you done this somewhere else in bpf already? Because that's contrary to how any file descriptor based apis work. How this is usually solved for extensible structs is to have a flag field that raises a flag to indicate that the fd fiel is set and thus 0 can be used as a valid value. The way you're doing it right now is very counterintuitive to userspace and pretty much guaranteed to cause subtle bugs.
On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@kernel.org> wrote: > > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote: > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall > > forces users to specify pinning location as a string-based absolute or > > relative (to current working directory) path. This has various > > implications related to security (e.g., symlink-based attacks), forces > > BPF FS to be exposed in the file system, which can cause races with > > other applications. > > > > One of the feedbacks we got from folks working with containers heavily > > was that inability to use purely FD-based location specification was an > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET > > commands. This patch closes this oversight, adding path_fd field to > > Cool! > > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by > > *at() syscalls for dirfd + pathname combinations. > > > > This now allows interesting possibilities like working with detached BPF > > FS mount (e.g., to perform multiple pinnings without running a risk of > > someone interfering with them), and generally making pinning/getting > > more secure and not prone to any races and/or security attacks. > > > > This is demonstrated by a selftest added in subsequent patch that takes > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate > > creating detached BPF FS mount, pinning, and then getting BPF map out of > > it, all while never exposing this private instance of BPF FS to outside > > worlds. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > include/linux/bpf.h | 4 ++-- > > include/uapi/linux/bpf.h | 5 +++++ > > kernel/bpf/inode.c | 16 ++++++++-------- > > kernel/bpf/syscall.c | 8 +++++--- > > tools/include/uapi/linux/bpf.h | 5 +++++ > > 5 files changed, 25 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 36e4b2d8cca2..f58895830ada 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); > > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname); > > -int bpf_obj_get_user(const char __user *pathname, int flags); > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > > > #define BPF_ITER_FUNC_PREFIX "bpf_iter_" > > #define DEFINE_BPF_ITER_FUNC(target, args...) \ > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 1bb11a6ee667..db2870a52ce0 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -1420,6 +1420,11 @@ union bpf_attr { > > __aligned_u64 pathname; > > __u32 bpf_fd; > > __u32 file_flags; > > + /* same as dirfd in openat() syscall; see openat(2) > > + * manpage for details of dirfd/path_fd and pathname semantics; > > + * zero path_fd implies AT_FDCWD behavior > > + */ > > + __u32 path_fd; > > }; > > So 0 is a valid file descriptor and can trivially be created and made to > refer to any file. Is this a conscious decision to have a zero value > imply AT_FDCWD and have you done this somewhere else in bpf already? > Because that's contrary to how any file descriptor based apis work. > > How this is usually solved for extensible structs is to have a flag > field that raises a flag to indicate that the fd fiel is set and thus 0 > can be used as a valid value. > > The way you're doing it right now is very counterintuitive to userspace > and pretty much guaranteed to cause subtle bugs. Yes, it's a very bpf()-specific convention we've settled on a while ago. It allows a cleaner and simpler backwards compatibility story without having to introduce new flags every single time. Most of BPF UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass it to bpf() syscall. Most of the time users will be blissfully unaware because libbpf and other BPF libraries are checking for fd == 0 and dup()'ing them to avoid ever returning FD 0 to the user. tl;dr, a conscious decision consistent with the rest of BPF UAPI. It is a bpf() peculiarity, yes.
On Tue, May 16, 2023 at 1:52 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote: > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall > > forces users to specify pinning location as a string-based absolute or > > relative (to current working directory) path. This has various > > implications related to security (e.g., symlink-based attacks), forces > > BPF FS to be exposed in the file system, which can cause races with > > other applications. > > > > One of the feedbacks we got from folks working with containers heavily > > was that inability to use purely FD-based location specification was an > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET > > commands. This patch closes this oversight, adding path_fd field to > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by > > *at() syscalls for dirfd + pathname combinations. > > > > This now allows interesting possibilities like working with detached BPF > > FS mount (e.g., to perform multiple pinnings without running a risk of > > someone interfering with them), and generally making pinning/getting > > more secure and not prone to any races and/or security attacks. > > > > This is demonstrated by a selftest added in subsequent patch that takes > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate > > creating detached BPF FS mount, pinning, and then getting BPF map out of > > it, all while never exposing this private instance of BPF FS to outside > > worlds. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > include/linux/bpf.h | 4 ++-- > > include/uapi/linux/bpf.h | 5 +++++ > > kernel/bpf/inode.c | 16 ++++++++-------- > > kernel/bpf/syscall.c | 8 +++++--- > > tools/include/uapi/linux/bpf.h | 5 +++++ > > 5 files changed, 25 insertions(+), 13 deletions(-) > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 36e4b2d8cca2..f58895830ada 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); > > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname); > > -int bpf_obj_get_user(const char __user *pathname, int flags); > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > > > #define BPF_ITER_FUNC_PREFIX "bpf_iter_" > > #define DEFINE_BPF_ITER_FUNC(target, args...) \ > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 1bb11a6ee667..db2870a52ce0 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -1420,6 +1420,11 @@ union bpf_attr { > > __aligned_u64 pathname; > > __u32 bpf_fd; > > __u32 file_flags; > > + /* same as dirfd in openat() syscall; see openat(2) > > + * manpage for details of dirfd/path_fd and pathname semantics; > > + * zero path_fd implies AT_FDCWD behavior > > + */ > > + __u32 path_fd; > > I'd probably call it dir_fd to emphasize the similarity, > but I don't mind path_fd as well I considered that, but it's really not necessarily a directory, it could be a specific file location (with O_PATH), so I felt like a more generic "path_fd" would be better (plus we have *path*name to combine with). It's minor, I can be convinced if others feel strongly about this. > > I have a note that you suggested to introduce this for uprobe > multi link as well, so I'll do something similar > > lgtm > > Acked-by: Jiri Olsa <jolsa@kernel.org> > > jirka > > > }; > > > > struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ > > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c > > index 9948b542a470..13bb54f6bd17 100644 > > --- a/kernel/bpf/inode.c > > +++ b/kernel/bpf/inode.c [...]
On Tue, May 16, 2023 at 11:02:42AM -0700, Andrii Nakryiko wrote: > On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@kernel.org> wrote: > > > > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote: > > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall > > > forces users to specify pinning location as a string-based absolute or > > > relative (to current working directory) path. This has various > > > implications related to security (e.g., symlink-based attacks), forces > > > BPF FS to be exposed in the file system, which can cause races with > > > other applications. > > > > > > One of the feedbacks we got from folks working with containers heavily > > > was that inability to use purely FD-based location specification was an > > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET > > > commands. This patch closes this oversight, adding path_fd field to > > > > Cool! > > > > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by > > > *at() syscalls for dirfd + pathname combinations. > > > > > > This now allows interesting possibilities like working with detached BPF > > > FS mount (e.g., to perform multiple pinnings without running a risk of > > > someone interfering with them), and generally making pinning/getting > > > more secure and not prone to any races and/or security attacks. > > > > > > This is demonstrated by a selftest added in subsequent patch that takes > > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate > > > creating detached BPF FS mount, pinning, and then getting BPF map out of > > > it, all while never exposing this private instance of BPF FS to outside > > > worlds. > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > --- > > > include/linux/bpf.h | 4 ++-- > > > include/uapi/linux/bpf.h | 5 +++++ > > > kernel/bpf/inode.c | 16 ++++++++-------- > > > kernel/bpf/syscall.c | 8 +++++--- > > > tools/include/uapi/linux/bpf.h | 5 +++++ > > > 5 files changed, 25 insertions(+), 13 deletions(-) > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > index 36e4b2d8cca2..f58895830ada 100644 > > > --- a/include/linux/bpf.h > > > +++ b/include/linux/bpf.h > > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); > > > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > > > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > > > > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname); > > > -int bpf_obj_get_user(const char __user *pathname, int flags); > > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > > > > > #define BPF_ITER_FUNC_PREFIX "bpf_iter_" > > > #define DEFINE_BPF_ITER_FUNC(target, args...) \ > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > index 1bb11a6ee667..db2870a52ce0 100644 > > > --- a/include/uapi/linux/bpf.h > > > +++ b/include/uapi/linux/bpf.h > > > @@ -1420,6 +1420,11 @@ union bpf_attr { > > > __aligned_u64 pathname; > > > __u32 bpf_fd; > > > __u32 file_flags; > > > + /* same as dirfd in openat() syscall; see openat(2) > > > + * manpage for details of dirfd/path_fd and pathname semantics; > > > + * zero path_fd implies AT_FDCWD behavior > > > + */ > > > + __u32 path_fd; > > > }; > > > > So 0 is a valid file descriptor and can trivially be created and made to > > refer to any file. Is this a conscious decision to have a zero value > > imply AT_FDCWD and have you done this somewhere else in bpf already? > > Because that's contrary to how any file descriptor based apis work. > > > > How this is usually solved for extensible structs is to have a flag > > field that raises a flag to indicate that the fd fiel is set and thus 0 > > can be used as a valid value. > > > > The way you're doing it right now is very counterintuitive to userspace > > and pretty much guaranteed to cause subtle bugs. > > Yes, it's a very bpf()-specific convention we've settled on a while > ago. It allows a cleaner and simpler backwards compatibility story > without having to introduce new flags every single time. Most of BPF > UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass > it to bpf() syscall. Most of the time users will be blissfully unaware > because libbpf and other BPF libraries are checking for fd == 0 and > dup()'ing them to avoid ever returning FD 0 to the user. > > tl;dr, a conscious decision consistent with the rest of BPF UAPI. It > is a bpf() peculiarity, yes. Adding fsdevel so we're aware of this quirk. So I'm not sure whether this was ever discussed on fsdevel when you took the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an invalid value. If it was discussed then great but if not then I would like to make it very clear that if in the future you decide to introduce custom semantics for vfs provided infrastructure - especially when exposed to userspace - that you please Cc us. You often make it very clear on the list that you don't like it when anything that touches bpf code doesn't end up getting sent to the bpf mailing list. It is exactly the same for us. This is not a rant I'm really just trying to make sure that we agree on common ground when it comes to touching each others code or semantic assumptions. I personally find this extremely weird to treat fd 0 as anything other than a random fd number as it goes against any userspace assumptions and drastically deviates from basically every file descriptor interface we have. I mean, you're not just saying fd 0 is invalid you're even saying it means AT_FDCWD. For every other interface, including those that pass fds in structs whose extensibility is premised on unknown fields being set to zero, have ways to make fd 0 work just fine. You could've done that to without inventing custom fd semantics.
On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote: > Adding fsdevel so we're aware of this quirk. > > So I'm not sure whether this was ever discussed on fsdevel when you took > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an > invalid value. I've never heard of this before, and I think it is compltely unacceptable. 0 ist just a normal FD, although one that happens to have specific meaning in userspace as stdin. > > If it was discussed then great but if not then I would like to make it > very clear that if in the future you decide to introduce custom > semantics for vfs provided infrastructure - especially when exposed to > userspace - that you please Cc us. I don't think it's just the future. We really need to undo this ASAP.
On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote: > > Adding fsdevel so we're aware of this quirk. > > > > So I'm not sure whether this was ever discussed on fsdevel when you took > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an > > invalid value. > > I've never heard of this before, and I think it is compltely > unacceptable. 0 ist just a normal FD, although one that happens to > have specific meaning in userspace as stdin. > > > > > If it was discussed then great but if not then I would like to make it > > very clear that if in the future you decide to introduce custom > > semantics for vfs provided infrastructure - especially when exposed to > > userspace - that you please Cc us. > > I don't think it's just the future. We really need to undo this ASAP. Christian is not correct in stating that treatment of fd==0 as invalid bpf object applies to vfs fd-s. The path_fd addition in this patch is really the very first one of this kind. At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0 are invalid and this is not going to change. It's been uapi for a long time. More so fd-s 0,1,2 are not "normal FDs". Unix has made two mistakes: 1. fd==0 being valid fd 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr. The first mistake makes it hard to pass FD without an extra flag. The 2nd mistake is just awful. We've seen plenty of severe datacenter wide issues because some library or piece of software assumes stdin/out/err. Various services have been hurt badly by this "convention". In libbpf we added ensure_good_fd() to make sure none of bpf objects (progs, maps, etc) are ever seen with fd=0,1,2. Other pieces of datacenter software enforce the same. In other words fds=0,1,2 are taken. They must not be anything but stdin/out/err or gutted to /dev/null. Otherwise expect horrible bugs and multi day debugging. Because of that, several years ago, we've decided to fix unix mistake #1 when it comes to bpf objects and started reserving fd=0 as invalid. This patch is proposing to do the same for path_fd (normal vfs fd) when it is passed to bpf syscall. I think it's a good trade-off and fits the rest of bpf uapi. Everyone who's hiding behind statements: but POSIX is a standard.. or this is how we've been doing things... are ignoring the practical situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them.
On Wed, May 17, 2023 at 9:17 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote: > > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote: > > > Adding fsdevel so we're aware of this quirk. > > > > > > So I'm not sure whether this was ever discussed on fsdevel when you took > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an > > > invalid value. > > > > I've never heard of this before, and I think it is compltely > > unacceptable. 0 ist just a normal FD, although one that happens to > > have specific meaning in userspace as stdin. > > > > > > > > If it was discussed then great but if not then I would like to make it > > > very clear that if in the future you decide to introduce custom > > > semantics for vfs provided infrastructure - especially when exposed to > > > userspace - that you please Cc us. > > > > I don't think it's just the future. We really need to undo this ASAP. > > Christian is not correct in stating that treatment of fd==0 as invalid > bpf object applies to vfs fd-s. > The path_fd addition in this patch is really the very first one of this kind. > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0 > are invalid and this is not going to change. It's been uapi for a long time. > > More so fd-s 0,1,2 are not "normal FDs". > Unix has made two mistakes: > 1. fd==0 being valid fd > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr. > > The first mistake makes it hard to pass FD without an extra flag. > The 2nd mistake is just awful. > We've seen plenty of severe datacenter wide issues because some > library or piece of software assumes stdin/out/err. > Various services have been hurt badly by this "convention". > In libbpf we added ensure_good_fd() to make sure none of bpf objects > (progs, maps, etc) are ever seen with fd=0,1,2. > Other pieces of datacenter software enforce the same. > > In other words fds=0,1,2 are taken. They must not be anything but > stdin/out/err or gutted to /dev/null. > Otherwise expect horrible bugs and multi day debugging. > > Because of that, several years ago, we've decided to fix unix mistake #1 > when it comes to bpf objects and started reserving fd=0 as invalid. > This patch is proposing to do the same for path_fd (normal vfs fd) when > it is passed to bpf syscall. I think it's a good trade-off and fits > the rest of bpf uapi. > > Everyone who's hiding behind statements: but POSIX is a standard.. > or this is how we've been doing things... are ignoring the practical > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them. Summarizing an offlist discussion with Christian and Andrii. The key issue is that fd == 0 must not mean AT_FDCWD and that's clear. We'll respin with an extra flag to indicate that path_fd should be used.
On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote: > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote: > > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote: > > > Adding fsdevel so we're aware of this quirk. > > > > > > So I'm not sure whether this was ever discussed on fsdevel when you took > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an > > > invalid value. > > > > I've never heard of this before, and I think it is compltely > > unacceptable. 0 ist just a normal FD, although one that happens to > > have specific meaning in userspace as stdin. > > > > > > > > If it was discussed then great but if not then I would like to make it > > > very clear that if in the future you decide to introduce custom > > > semantics for vfs provided infrastructure - especially when exposed to > > > userspace - that you please Cc us. > > > > I don't think it's just the future. We really need to undo this ASAP. > > Christian is not correct in stating that treatment of fd==0 as invalid > bpf object applies to vfs fd-s. > The path_fd addition in this patch is really the very first one of this kind. > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0 > are invalid and this is not going to change. It's been uapi for a long time. > > More so fd-s 0,1,2 are not "normal FDs". > Unix has made two mistakes: > 1. fd==0 being valid fd > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr. > > The first mistake makes it hard to pass FD without an extra flag. > The 2nd mistake is just awful. > We've seen plenty of severe datacenter wide issues because some > library or piece of software assumes stdin/out/err. > Various services have been hurt badly by this "convention". > In libbpf we added ensure_good_fd() to make sure none of bpf objects > (progs, maps, etc) are ever seen with fd=0,1,2. > Other pieces of datacenter software enforce the same. > > In other words fds=0,1,2 are taken. They must not be anything but > stdin/out/err or gutted to /dev/null. > Otherwise expect horrible bugs and multi day debugging. > > Because of that, several years ago, we've decided to fix unix mistake #1 > when it comes to bpf objects and started reserving fd=0 as invalid. > This patch is proposing to do the same for path_fd (normal vfs fd) when It isn't as you now realized but I'm glad we cleared that up off-list. > it is passed to bpf syscall. I think it's a good trade-off and fits > the rest of bpf uapi. > > Everyone who's hiding behind statements: but POSIX is a standard.. > or this is how we've been doing things... are ignoring the practical > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them. (Prefix: Imagine me calmly writing this and in a relaxed tone.) Just to clarify. I do think that deciding that 0 is an invalid file descriptor number is weird and I wish you'd have discussed this with us before you took that decision. You've seen the reaction that other low-level userspace people you talked to had to these news... I'm not sure what to make of the POSIX excursion. I think that it is a complete sideshow to the issue here in a way. But fwiw... We don't follow arbitrary conventions such as 0, 1, and 2 because we all have sworn allegiance to The Secret Order of the POSIX but because the alternative is that one subsystem finds it neat to use fd 0 to refer to AT_FDCWD and another one to AT_MY_CUSTOM_FD0_MEANING. Which is exactly what would've happened if this patch would have made it unnoticed. This doesn't scale and our interfaces aren't designed around Shakespeare's dictum "What's in a name?". This will quickly devolve into a situation similar to letting a step on a staircase be off by a few millimeters. See those users falling. I'm glad we cleared this up. My main issue is indeed that fd 0 now isn't just forbidden it would be given an entirely different meaning which is not acceptable from the vfs perspective.
The other thing to note is that while the *convention* may be that 0, 1, and 2 are for stdin, stdout, and stderr, this is a *userspae* convention. After all, system daemons like getty, gnome-terminal, et. al, need to be able to open file descriptors for stdin, stdout, and stderr, and it would be.....highly undesirable for the kernel to have to special case those processes from being able to open those file descriptors. So in the eyes of Kernel to Userspace API's we should not specially privilege the meaning of file descriptors 0, 1, and 2. Besides, we have a perfectly good way of expressing "not a FD" and that is negative values! File descriptors, after all, are signed integers. Finally, by having some kernel subsystem have a different meaning for fd 0 means that there are potential security vulernabilities. It may be the case that userspace *SHOULD* not use fd 0 for anythingn other than stdin, and that should be something which should be handed to it by its parent process. However, consider what might happen if a malicious program where to exec a process, perhaps a setuid process, with fd 0 closed. Now the first file opened by that program will be assigned fd 0, and if that gets passed to BPF, something surprising and wonderous --- but hopefully not something that can be leveraged to be a high severity security vulnerability --- may very well happen. So if there is anyway to that we can change the BPF API's to change to use negative values for special case meanings, we should do it. Certainly for any new API's, and even for old API's, Linus has always said that there are some special case times when we can break the userspace ABI --- and security vulnerabilites are certainly one of them. Best regards, - Ted
On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote: > On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote: > > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote: > > > > Adding fsdevel so we're aware of this quirk. > > > > > > > > So I'm not sure whether this was ever discussed on fsdevel when you took > > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an > > > > invalid value. > > > > > > I've never heard of this before, and I think it is compltely > > > unacceptable. 0 ist just a normal FD, although one that happens to > > > have specific meaning in userspace as stdin. > > > > > > > > > > > If it was discussed then great but if not then I would like to make it > > > > very clear that if in the future you decide to introduce custom > > > > semantics for vfs provided infrastructure - especially when exposed to > > > > userspace - that you please Cc us. > > > > > > I don't think it's just the future. We really need to undo this ASAP. > > > > Christian is not correct in stating that treatment of fd==0 as invalid > > bpf object applies to vfs fd-s. > > The path_fd addition in this patch is really the very first one of this kind. > > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0 > > are invalid and this is not going to change. It's been uapi for a long time. > > > > More so fd-s 0,1,2 are not "normal FDs". > > Unix has made two mistakes: > > 1. fd==0 being valid fd > > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr. > > > > The first mistake makes it hard to pass FD without an extra flag. > > The 2nd mistake is just awful. > > We've seen plenty of severe datacenter wide issues because some > > library or piece of software assumes stdin/out/err. > > Various services have been hurt badly by this "convention". > > In libbpf we added ensure_good_fd() to make sure none of bpf objects > > (progs, maps, etc) are ever seen with fd=0,1,2. > > Other pieces of datacenter software enforce the same. > > > > In other words fds=0,1,2 are taken. They must not be anything but > > stdin/out/err or gutted to /dev/null. > > Otherwise expect horrible bugs and multi day debugging. > > > > Because of that, several years ago, we've decided to fix unix mistake #1 > > when it comes to bpf objects and started reserving fd=0 as invalid. > > This patch is proposing to do the same for path_fd (normal vfs fd) when > > It isn't as you now realized but I'm glad we cleared that up off-list. > > > it is passed to bpf syscall. I think it's a good trade-off and fits > > the rest of bpf uapi. > > > > Everyone who's hiding behind statements: but POSIX is a standard.. > > or this is how we've been doing things... are ignoring the practical > > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them. > > (Prefix: Imagine me calmly writing this and in a relaxed tone.) > > Just to clarify. I do think that deciding that 0 is an invalid file We're still talking past each other. 0 is an invalid bpf object. Not file. There is a difference. The kernel is breaking user space by returning non-file FDs in 0,1,2. Especially as fd = 1 and 2. ensure_good_fd() in libbpf is a library workaround to make sure bpf objects are not the reason for user app brekage. I firmly believe that making kernel return socket FDs and other special FDs with fd >=3 (under new sysctl, for example) will prevent user space breakage. And to answer Ted's question.. Yes. It's a security issue, but it's the other way around. The kernel returning non vfs file FD in [0,1,2] range is a security issue. I'm proposing to fix it with new sysctl or boot flag.
On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote: > We're still talking past each other. > 0 is an invalid bpf object. Not file. > There is a difference. > The kernel is breaking user space by returning non-file FDs in 0,1,2. > Especially as fd = 1 and 2. > ensure_good_fd() in libbpf is a library workaround to make sure bpf objects > are not the reason for user app brekage. > I firmly believe that making kernel return socket FDs and other special FDs with fd >=3 > (under new sysctl, for example) will prevent user space breakage. Wait, why are socket FDs special? I shouldn't be able to have anything but chardev fds, pipes and regular files as fd 0,1,2? I agree that having directory fds and blockdev fds as fd 0,1,2 are confusing and pointless, but I see the value in having a TCP socket as stdin/stdout/stderr. If a fd shouldn't be used for stdio, having an ioctl to enable it and read/write return errors until/unless it's enabled makes sense. But now we have to label each fd as safe/not-safe for stdio, which we can as easily do by setting up our fops appropriately. So I'm not sure what you're trying to accomplish here.
On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote: > On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote: > > On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote: > > > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote: > > > > > Adding fsdevel so we're aware of this quirk. > > > > > > > > > > So I'm not sure whether this was ever discussed on fsdevel when you took > > > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an > > > > > invalid value. > > > > > > > > I've never heard of this before, and I think it is compltely > > > > unacceptable. 0 ist just a normal FD, although one that happens to > > > > have specific meaning in userspace as stdin. > > > > > > > > > > > > > > If it was discussed then great but if not then I would like to make it > > > > > very clear that if in the future you decide to introduce custom > > > > > semantics for vfs provided infrastructure - especially when exposed to > > > > > userspace - that you please Cc us. > > > > > > > > I don't think it's just the future. We really need to undo this ASAP. > > > > > > Christian is not correct in stating that treatment of fd==0 as invalid > > > bpf object applies to vfs fd-s. > > > The path_fd addition in this patch is really the very first one of this kind. > > > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0 > > > are invalid and this is not going to change. It's been uapi for a long time. > > > > > > More so fd-s 0,1,2 are not "normal FDs". > > > Unix has made two mistakes: > > > 1. fd==0 being valid fd > > > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr. > > > > > > The first mistake makes it hard to pass FD without an extra flag. > > > The 2nd mistake is just awful. > > > We've seen plenty of severe datacenter wide issues because some > > > library or piece of software assumes stdin/out/err. > > > Various services have been hurt badly by this "convention". > > > In libbpf we added ensure_good_fd() to make sure none of bpf objects > > > (progs, maps, etc) are ever seen with fd=0,1,2. > > > Other pieces of datacenter software enforce the same. > > > > > > In other words fds=0,1,2 are taken. They must not be anything but > > > stdin/out/err or gutted to /dev/null. > > > Otherwise expect horrible bugs and multi day debugging. > > > > > > Because of that, several years ago, we've decided to fix unix mistake #1 > > > when it comes to bpf objects and started reserving fd=0 as invalid. > > > This patch is proposing to do the same for path_fd (normal vfs fd) when > > > > It isn't as you now realized but I'm glad we cleared that up off-list. > > > > > it is passed to bpf syscall. I think it's a good trade-off and fits > > > the rest of bpf uapi. > > > > > > Everyone who's hiding behind statements: but POSIX is a standard.. > > > or this is how we've been doing things... are ignoring the practical > > > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them. > > > > (Prefix: Imagine me calmly writing this and in a relaxed tone.) > > > > Just to clarify. I do think that deciding that 0 is an invalid file descriptor > > We're still talking past each other. > 0 is an invalid bpf object. Not file. > There is a difference. You cut of a word above. I can't follow your argument. File descriptor numbers are free to refer to whatever we want. They don't care about what type of object they refer to and they better not. > The kernel is breaking user space by returning non-file FDs in 0,1,2. > Especially as fd = 1 and 2. This has a strong aura of a strawman argument. ;) > ensure_good_fd() in libbpf is a library workaround to make sure bpf objects > are not the reason for user app brekage. > I firmly believe that making kernel return socket FDs and other special FDs with fd >=3 > (under new sysctl, for example) will prevent user space breakage. > > And to answer Ted's question.. > Yes. It's a security issue, but it's the other way around. > The kernel returning non vfs file FD in [0,1,2] range is a security issue. > I'm proposing to fix it with new sysctl or boot flag. That's just completely weird. We can see what Linus thinks but I think that's a somewhat outlandish proposal that I wouldn't support.
On Thu, May 18, 2023 at 05:33:32PM +0100, Matthew Wilcox wrote: > On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote: > > We're still talking past each other. > > 0 is an invalid bpf object. Not file. > > There is a difference. > > The kernel is breaking user space by returning non-file FDs in 0,1,2. > > Especially as fd = 1 and 2. > > ensure_good_fd() in libbpf is a library workaround to make sure bpf objects > > are not the reason for user app brekage. > > I firmly believe that making kernel return socket FDs and other special FDs with fd >=3 > > (under new sysctl, for example) will prevent user space breakage. > > Wait, why are socket FDs special? I shouldn't be able to have anything > but chardev fds, pipes and regular files as fd 0,1,2? I agree that having > directory fds and blockdev fds as fd 0,1,2 are confusing and pointless, > but I see the value in having a TCP socket as stdin/stdout/stderr. > > If a fd shouldn't be used for stdio, having an ioctl to enable it > and read/write return errors until/unless it's enabled makes sense. > But now we have to label each fd as safe/not-safe for stdio, which we > can as easily do by setting up our fops appropriately. So I'm not sure > what you're trying to accomplish here. Yeah, I don't think we want weird ioctl()s to restrict file descriptor ranges in any way. This all sounds pretty weird to me and I don't even want to imagine the semantical oddness of suddenly restricting the kernels ability to return some fds. Honestly, most of the time sysctls such as this are the equivalent of throwing the hands up in the air and leaving the room.
On Thu, May 18, 2023 at 10:20 AM Christian Brauner <brauner@kernel.org> wrote: > > That's just completely weird. We can see what Linus thinks but I think > that's a somewhat outlandish proposal that I wouldn't support. I have no idea of the background here. But fd 0 is in absolutely no way special. Anything that thinks that a zero fd is invalid or in any way different from (say) fd 5 is completely and utterly buggy by definition. Now, fd 0 can obviously be invalid in the sense that it may not be open, exactly the same way fd 100 may not be open. So in *that* sense we can have an invalid fd 0, and system calls might return EBADF for trying to access it if somebody has closed it. If bpf thinks that 0 is not a file descriptor, then bpf is simply wrong. No ifs, buts or maybes about it. It's like saying "1 is not a number". It's nonsensical garbage. But maybe I misunderstand the issue. Linus
On Thu, May 18, 2023 at 10:33:30AM -0700, Linus Torvalds wrote: > On Thu, May 18, 2023 at 10:20 AM Christian Brauner <brauner@kernel.org> wrote: > > > > That's just completely weird. We can see what Linus thinks but I think > > that's a somewhat outlandish proposal that I wouldn't support. > > I have no idea of the background here. > > But fd 0 is in absolutely no way special. Anything that thinks that a > zero fd is invalid or in any way different from (say) fd 5 is > completely and utterly buggy by definition. > > Now, fd 0 can obviously be invalid in the sense that it may not be > open, exactly the same way fd 100 may not be open. So in *that* sense > we can have an invalid fd 0, and system calls might return EBADF for > trying to access it if somebody has closed it. > > If bpf thinks that 0 is not a file descriptor, then bpf is simply > wrong. No ifs, buts or maybes about it. It's like saying "1 is not a > number". It's nonsensical garbage. > > But maybe I misunderstand the issue. TL;DR bpf has considerd fd 0 to be an invalid fd value for a long time. We can't exactly follow the motiviation: https://lore.kernel.org/bpf/CAADnVQLitLUc1SozzKjBgq6HGTchE1cO+e4j8eDgtE0zFn5VEw@mail.gmail.com and it's probably to late to change this. Yes, passing fds in extensible structs is inconvenient because you need to pass an indicator alongside an fd to indicate that the zero value is anactual file descriptor. Because unknown fields need to be initialzied to zero so that old kernels can ignore larger structs. So what we usually have to do: mount_setattr() attr.set |= MOUNT_ATTR_IDMAP; attr.userns_fd = 0; We just found out about fd 0 being invalid because a new bpf feature proposal now tried to reuse fd 0 to mean AT_FDCWD when passed through a new bpf feature; which I said isn't acceptable.
On Thu, May 18, 2023 at 07:20:29PM +0200, Christian Brauner wrote: > On Thu, May 18, 2023 at 09:25:08AM -0700, Alexei Starovoitov wrote: > > On Thu, May 18, 2023 at 10:38:46AM +0200, Christian Brauner wrote: > > > On Wed, May 17, 2023 at 09:17:36AM -0700, Alexei Starovoitov wrote: > > > > On Wed, May 17, 2023 at 5:05 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > > > > > On Wed, May 17, 2023 at 11:11:24AM +0200, Christian Brauner wrote: > > > > > > Adding fsdevel so we're aware of this quirk. > > > > > > > > > > > > So I'm not sure whether this was ever discussed on fsdevel when you took > > > > > > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an > > > > > > invalid value. > > > > > > > > > > I've never heard of this before, and I think it is compltely > > > > > unacceptable. 0 ist just a normal FD, although one that happens to > > > > > have specific meaning in userspace as stdin. > > > > > > > > > > > > > > > > > If it was discussed then great but if not then I would like to make it > > > > > > very clear that if in the future you decide to introduce custom > > > > > > semantics for vfs provided infrastructure - especially when exposed to > > > > > > userspace - that you please Cc us. > > > > > > > > > > I don't think it's just the future. We really need to undo this ASAP. > > > > > > > > Christian is not correct in stating that treatment of fd==0 as invalid > > > > bpf object applies to vfs fd-s. > > > > The path_fd addition in this patch is really the very first one of this kind. > > > > At the same time bpf anon fd-s (progs, maps, links, btfs) with fd == 0 > > > > are invalid and this is not going to change. It's been uapi for a long time. > > > > > > > > More so fd-s 0,1,2 are not "normal FDs". > > > > Unix has made two mistakes: > > > > 1. fd==0 being valid fd > > > > 2. establishing convention that fd-s 0,1,2 are stdin, stdout, stderr. > > > > > > > > The first mistake makes it hard to pass FD without an extra flag. > > > > The 2nd mistake is just awful. > > > > We've seen plenty of severe datacenter wide issues because some > > > > library or piece of software assumes stdin/out/err. > > > > Various services have been hurt badly by this "convention". > > > > In libbpf we added ensure_good_fd() to make sure none of bpf objects > > > > (progs, maps, etc) are ever seen with fd=0,1,2. > > > > Other pieces of datacenter software enforce the same. > > > > > > > > In other words fds=0,1,2 are taken. They must not be anything but > > > > stdin/out/err or gutted to /dev/null. > > > > Otherwise expect horrible bugs and multi day debugging. > > > > > > > > Because of that, several years ago, we've decided to fix unix mistake #1 > > > > when it comes to bpf objects and started reserving fd=0 as invalid. > > > > This patch is proposing to do the same for path_fd (normal vfs fd) when > > > > > > It isn't as you now realized but I'm glad we cleared that up off-list. > > > > > > > it is passed to bpf syscall. I think it's a good trade-off and fits > > > > the rest of bpf uapi. > > > > > > > > Everyone who's hiding behind statements: but POSIX is a standard.. > > > > or this is how we've been doing things... are ignoring the practical > > > > situation at hand. fd-s 0,1,2 are taken. Make sure your sw never produces them. > > > > > > (Prefix: Imagine me calmly writing this and in a relaxed tone.) > > > > > > Just to clarify. I do think that deciding that 0 is an invalid file > > descriptor > > > > > We're still talking past each other. > > 0 is an invalid bpf object. Not file. > > There is a difference. > > You cut of a word above. I can't follow your argument. > File descriptor numbers are free to refer to whatever we want. > They don't care about what type of object they refer to and they > better not. User space cares what they refer to. Unix made integer FD into broken abstraction. regular files need one set of syscalls to work with such 'integer FDs'. sockets need another set of syscalls. All other anon-inodes need another set of syscalls. While user space sees an integer without type and that a root cause of the bugs. And that's particularly bad for integers 0,1,2 where user space assumes that regular files are behind them. Imagine C++ project where base class is called 'FD' while children implement their own access methods that don't overlap with each other. Now one application passes this 'FD' class to another. That other app can only do trial and error to figure out what it got. Any user space developer would reject such class hierarchy design, but kernel folks are so proud of this broken abstraction. FDs are not special.. POSIX is the standard... Right. That's why user space keeps their workarounds, because kernel folks are not empathic to user space needs.
[ Out walking for health - sorry for html crud ] On Thu, May 18, 2023, 11:26 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > User space cares what they refer to. > Unix made integer FD into broken abstraction. > Stop trying to blame anybody else. The unix people came up with a good abstraction, and you've screwed up. That is nobody's fault but your own, and you should just admit it rather than trying to double down on being wrong. Integer file descriptions are fine. Negative ones are invalid, and people who wanted to use "root" and "cwd" have already done so using them for decades. Look up openat(). You just picked completely the wrong invalid number. That's on you. Nobody else. regular files need one set of syscalls to work with such 'integer FDs'. > sockets need another set of syscalls. > None of that has anything to do with file descriptors, which are a unified format. The reason sockets have extra system calls is that you're can just do extra things with them. That's no different from ioctl's, and they could have been done that way, but the extra system calls are cleaner. While user space sees an integer without type and that a root cause of the > bugs. > And that's particularly bad for integers 0,1,2 where user space assumes > that > regular files are behind them. > Stop making yourself look any more stupid than necessary. The 0/1/2 file descriptors are not at all special. They are a shell pipeline default, nothing more. They are not the argument your think they are, and you should stop trying to make them an argument. It only makes you look bad. Imagine C++ project where base class is called 'FD' while children > Ok, now you've just gone totally off the rails. Stop this silliness. You did something bad from ignorance. That's fine. But making excuses for it turns ignorance to willful stupidity. Stop digging yourself deeper in that hole. The system call interface is not some C++ object interface. And we should all be grateful that it isn't. Linus
On Wed, May 17, 2023 at 2:11 AM Christian Brauner <brauner@kernel.org> wrote: > > On Tue, May 16, 2023 at 11:02:42AM -0700, Andrii Nakryiko wrote: > > On Tue, May 16, 2023 at 2:07 AM Christian Brauner <brauner@kernel.org> wrote: > > > > > > On Mon, May 15, 2023 at 05:13:46PM -0700, Andrii Nakryiko wrote: > > > > Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall > > > > forces users to specify pinning location as a string-based absolute or > > > > relative (to current working directory) path. This has various > > > > implications related to security (e.g., symlink-based attacks), forces > > > > BPF FS to be exposed in the file system, which can cause races with > > > > other applications. > > > > > > > > One of the feedbacks we got from folks working with containers heavily > > > > was that inability to use purely FD-based location specification was an > > > > unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET > > > > commands. This patch closes this oversight, adding path_fd field to > > > > > > Cool! > > > > > > > BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by > > > > *at() syscalls for dirfd + pathname combinations. > > > > > > > > This now allows interesting possibilities like working with detached BPF > > > > FS mount (e.g., to perform multiple pinnings without running a risk of > > > > someone interfering with them), and generally making pinning/getting > > > > more secure and not prone to any races and/or security attacks. > > > > > > > > This is demonstrated by a selftest added in subsequent patch that takes > > > > advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate > > > > creating detached BPF FS mount, pinning, and then getting BPF map out of > > > > it, all while never exposing this private instance of BPF FS to outside > > > > worlds. > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > --- > > > > include/linux/bpf.h | 4 ++-- > > > > include/uapi/linux/bpf.h | 5 +++++ > > > > kernel/bpf/inode.c | 16 ++++++++-------- > > > > kernel/bpf/syscall.c | 8 +++++--- > > > > tools/include/uapi/linux/bpf.h | 5 +++++ > > > > 5 files changed, 25 insertions(+), 13 deletions(-) > > > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > > index 36e4b2d8cca2..f58895830ada 100644 > > > > --- a/include/linux/bpf.h > > > > +++ b/include/linux/bpf.h > > > > @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); > > > > struct bpf_link *bpf_link_get_from_fd(u32 ufd); > > > > struct bpf_link *bpf_link_get_curr_or_next(u32 *id); > > > > > > > > -int bpf_obj_pin_user(u32 ufd, const char __user *pathname); > > > > -int bpf_obj_get_user(const char __user *pathname, int flags); > > > > +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); > > > > +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); > > > > > > > > #define BPF_ITER_FUNC_PREFIX "bpf_iter_" > > > > #define DEFINE_BPF_ITER_FUNC(target, args...) \ > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > > index 1bb11a6ee667..db2870a52ce0 100644 > > > > --- a/include/uapi/linux/bpf.h > > > > +++ b/include/uapi/linux/bpf.h > > > > @@ -1420,6 +1420,11 @@ union bpf_attr { > > > > __aligned_u64 pathname; > > > > __u32 bpf_fd; > > > > __u32 file_flags; > > > > + /* same as dirfd in openat() syscall; see openat(2) > > > > + * manpage for details of dirfd/path_fd and pathname semantics; > > > > + * zero path_fd implies AT_FDCWD behavior > > > > + */ > > > > + __u32 path_fd; > > > > }; > > > > > > So 0 is a valid file descriptor and can trivially be created and made to > > > refer to any file. Is this a conscious decision to have a zero value > > > imply AT_FDCWD and have you done this somewhere else in bpf already? > > > Because that's contrary to how any file descriptor based apis work. > > > > > > How this is usually solved for extensible structs is to have a flag > > > field that raises a flag to indicate that the fd fiel is set and thus 0 > > > can be used as a valid value. > > > > > > The way you're doing it right now is very counterintuitive to userspace > > > and pretty much guaranteed to cause subtle bugs. > > > > Yes, it's a very bpf()-specific convention we've settled on a while > > ago. It allows a cleaner and simpler backwards compatibility story > > without having to introduce new flags every single time. Most of BPF > > UAPI by now dictates that (otherwise valid) FD 0 can't be used to pass > > it to bpf() syscall. Most of the time users will be blissfully unaware > > because libbpf and other BPF libraries are checking for fd == 0 and > > dup()'ing them to avoid ever returning FD 0 to the user. > > > > tl;dr, a conscious decision consistent with the rest of BPF UAPI. It > > is a bpf() peculiarity, yes. > > Adding fsdevel so we're aware of this quirk. > > So I'm not sure whether this was ever discussed on fsdevel when you took > the decision to treat fd 0 as AT_FDCWD or in general treat fd 0 as an > invalid value. > > If it was discussed then great but if not then I would like to make it > very clear that if in the future you decide to introduce custom > semantics for vfs provided infrastructure - especially when exposed to > userspace - that you please Cc us. Yep, I'll remember to cc linux-fsdevel@vger.kernel.org for future patches touching on vfs-related concepts, no problem. I wasn't trying to sneak it in or anything, it just never occurred to me, sorry about that. > > You often make it very clear on the list that you don't like it when > anything that touches bpf code doesn't end up getting sent to the bpf > mailing list. It is exactly the same for us. That's a fair request, ack. > > This is not a rant I'm really just trying to make sure that we agree on > common ground when it comes to touching each others code or semantic > assumptions. > > I personally find this extremely weird to treat fd 0 as anything other > than a random fd number as it goes against any userspace assumptions and > drastically deviates from basically every file descriptor interface we > have. I mean, you're not just saying fd 0 is invalid you're even saying > it means AT_FDCWD. Agreed, I can see how this could have undesirable (not just surprising) implications if, say, open(O_PATH) returned fd=0. I just sent a v2 with a new flag that needs to be specified to be able to use path FD (and falling back to backwards-compatible AT_FDCWD behavior otherwise). > > For every other interface, including those that pass fds in structs > whose extensibility is premised on unknown fields being set to zero, > have ways to make fd 0 work just fine. You could've done that to without > inventing custom fd semantics.
On Thu, May 18, 2023 at 11:57:14AM -0700, Linus Torvalds wrote: > That is nobody's fault but your own, and you should just admit it rather > than trying to double down on being wrong. You're correct. I was indeed doubling down on that. Thanks for putting it straight like that. > The 0/1/2 file descriptors are not at all special. They are a shell > pipeline default, nothing more. They are not the argument your think they > are, and you should stop trying to make them an argument. I'm well aware that any file type is allowed to be in FDs 0,1,2 and some user space is using it that way, like old inetd: https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428 That puts the same socket into 0,1,2 before exec-ing new process. My point that the kernel has to assist user space instead of stubbornly sticking to POSIX and saying all FDs are equal. Most user space developers know that care should be taken with FDs 0,1,2, but it's still easy to make a mistake. To explain the motivation a bit of background: "folly" is a core C++ library for fb apps. Like libstdc++ and a lot more. Until this commit in 2021: https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea the user could launch a new process with flag "folly::Subprocess::CLOSE". It's useful for the cases when child doesn't want to inherit stdin/out/err. There is also GLOG. google's logging library that can be configured to log to stderr. Both libraries are well written with the high code quality. In a big app multiple people use different pieces and may not be aware how all pieces are put together. You can guess the rest... Important service used a library that used another library that started a process with folly::Subprocess::CLOSE. That process opened network connections and used glog. It was "working" for some time, because sys_write() to a socket is a valid command, but when TCP buffers got full synchronous innocuous logging prevented parent from making progress. That footgun was removed from folly in 2021, but we still see this issue from time to time. My point that the kernel can help here. Since folks don't like sysctl to control FD assignment how about something like this: diff --git a/fs/file.c b/fs/file.c index 7893ea161d77..896e79433f61 100644 --- a/fs/file.c +++ b/fs/file.c @@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) return error; } +__weak noinline u32 get_start_fd(void) +{ + return 0; +} +/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */ + int __get_unused_fd_flags(unsigned flags, unsigned long nofile) { - return alloc_fd(0, nofile, flags); + return alloc_fd(get_start_fd(), nofile, flags); } Then we can enforce fd >= 3 for a certain container or for a particular app.
On Thu, May 18, 2023 at 09:44:33PM -0700, Alexei Starovoitov wrote: > On Thu, May 18, 2023 at 11:57:14AM -0700, Linus Torvalds wrote: > > That is nobody's fault but your own, and you should just admit it rather > > than trying to double down on being wrong. > > You're correct. I was indeed doubling down on that. > Thanks for putting it straight like that. > > > The 0/1/2 file descriptors are not at all special. They are a shell > > pipeline default, nothing more. They are not the argument your think they > > are, and you should stop trying to make them an argument. > > I'm well aware that any file type is allowed to be in FDs 0,1,2 and > some user space is using it that way, like old inetd: > https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428 > That puts the same socket into 0,1,2 before exec-ing new process. > > My point that the kernel has to assist user space instead of > stubbornly sticking to POSIX and saying all FDs are equal. > > Most user space developers know that care should be taken with FDs 0,1,2, > but it's still easy to make a mistake. > > To explain the motivation a bit of background: > "folly" is a core C++ library for fb apps. Like libstdc++ and a lot more. > Until this commit in 2021: > https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea > the user could launch a new process with flag "folly::Subprocess::CLOSE". > It's useful for the cases when child doesn't want to inherit stdin/out/err. > There is also GLOG. google's logging library that can be configured to log to stderr. > Both libraries are well written with the high code quality. > In a big app multiple people use different pieces and may not be aware > how all pieces are put together. You can guess the rest... > Important service used a library that used another library that started a > process with folly::Subprocess::CLOSE. That process opened network connections > and used glog. It was "working" for some time, because sys_write() to a socket > is a valid command, but when TCP buffers got full synchronous innocuous logging > prevented parent from making progress. > > That footgun was removed from folly in 2021, but we still see this issue from time to time. > My point that the kernel can help here. > Since folks don't like sysctl to control FD assignment how about something like this: > > diff --git a/fs/file.c b/fs/file.c > index 7893ea161d77..896e79433f61 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > return error; > } > > +__weak noinline u32 get_start_fd(void) > +{ > + return 0; > +} > +/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */ > + > int __get_unused_fd_flags(unsigned flags, unsigned long nofile) > { > - return alloc_fd(0, nofile, flags); > + return alloc_fd(get_start_fd(), nofile, flags); I'm sorry but I really don't think this is a good idea. We're not going to run BPF programs in core file code. That stuff is sensitive and complex enough as it is without having to take into account that a bpf program can modify behavior. It's also completely unclear whether that's safe to do as this would allow to change fd allocation across the whole kernel. This idea that fd 0, 1, and 2 or any other fd deserve special treatment by the kernel needs to die; and quickly at that.
On Fri, May 19, 2023 at 10:13:09AM +0200, Christian Brauner wrote: > > I'm well aware that any file type is allowed to be in FDs 0,1,2 and > > some user space is using it that way, like old inetd: > > https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428 > > That puts the same socket into 0,1,2 before exec-ing new process. This is a *feature*. I've seen, and actually written shell scripts which have been wired into /etc/inetd.conf. amd so the fact that shell script can send stdout out to a incoming TCP connection. It should be possible to implement the finger protocol (RFC 1288) as a shell or python script, *precisely* because having inetd connect a socket to FDs 0, 1, and 2 is a good and useful thing to do. > > My point that the kernel has to assist user space instead of > > stubbornly sticking to POSIX and saying all FDs are equal. This is not a matter of adhering to Posix. It's about the fundamental Unix philosophy. Not everything needs to be implemented in a complicated C++ program.... > > To explain the motivation a bit of background: > > "folly" is a core C++ library for fb apps. Like libstdc++ and a lot more. > > Until this commit in 2021: > > https://github.com/facebook/folly/commit/cc9032a0e41a0cba9aa93240c483cfceb0ff44ea > > the user could launch a new process with flag "folly::Subprocess::CLOSE". > > It's useful for the cases when child doesn't want to inherit stdin/out/err. Yeah, sorry, that's just simple bug in the Folly library (which I guess was well named). Closing all of the file descriptors and then opening 0, 1, and 2 using /dev/null is a pretty basic. In fact, there's a convenient daemon(3) will do this for you. No muss, no fuss, no dirty dishes. > I'm sorry but I really don't think this is a good idea. We're not going > to run BPF programs in core file code. That stuff is sensitive and > complex enough as it is without having to take into account that a bpf > program can modify behavior. It's also completely unclear whether that's > safe to do as this would allow to change fd allocation across the whole > kernel. > > This idea that fd 0, 1, and 2 or any other fd deserve special treatment > by the kernel needs to die; and quickly at that. +1. Making fundamentally violent changes to core Unix design and philosophy just to accomodate incompetent user space programmers is IMHO a really bad idea. - Ted
On Thu, May 18, 2023 at 9:44 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > My point that the kernel can help here. > Since folks don't like sysctl to control FD assignment how about something like this: No. Really. The only thing that needs to happen is that people need to *know* that fd's 0/1/2 are not at all special. The thing is, it's even *traditional* to do something like close(0); close(1); close(2); and fork() twice (exiting the first child after the second fork). Usually you'd also make sure to re-set umask, and do a setsid() to make sure you're not part of the controlling terminal any more. Now, some people would then redirect /dev/null to those file descriptors, but not always: file descriptors used to be a fairly limited resource. So people would *want* to use all the file descriptors they could for whatever server connections they implemented. Including, very much, fd 0. So really. There is not a way in hell we will *EVER* consider fd 0 to be special. It isn't, it never has been, and it never shall be. Instead, just spread the word that fd 0 isn't special. The fact that you think that some completely broken C++ code was "written with high quality", and you think that is an excuse for garbage is just sad. Those C++ libraries WERE INCREDIBLE CRAP. They were buggy garbage. And no, they are in no way going to affect the kernel and make the kernel do stupid and wrong things. Linus
On Do, 18.05.23 21:44, Alexei Starovoitov (alexei.starovoitov@gmail.com) wrote: > > The 0/1/2 file descriptors are not at all special. They are a shell > > pipeline default, nothing more. They are not the argument your think they > > are, and you should stop trying to make them an argument. > > I'm well aware that any file type is allowed to be in FDs 0,1,2 and > some user space is using it that way, like old inetd: > https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428 > That puts the same socket into 0,1,2 before exec-ing new process. > > My point that the kernel has to assist user space instead of > stubbornly sticking to POSIX and saying all FDs are equal. > > Most user space developers know that care should be taken with FDs 0,1,2, > but it's still easy to make a mistake. If I look at libbpf, which supposedly gets the fd handling right I can't find any hint it actually moves the fds it gets from open() to an fd > 2, though? i.e. the code that invokes open() calls in the libbpf codebase happily just accepts an fd < 2, including fd == 0, and this is then later passed back into the kernel in various bpf() syscall invocations, which should refuse it, no? So what's going on there? I did find this though: <snip> new_fd = open("/", O_RDONLY | O_CLOEXEC); if (new_fd < 0) { err = -errno; goto err_free_new_name; } new_fd = dup3(fd, new_fd, O_CLOEXEC); if (new_fd < 0) { err = -errno; goto err_close_new_fd; } </snip> (This is from libbpf.c, bpf_map__reuse_fd(), i.e. https://github.com/libbpf/libbpf/blob/master/src/libbpf.c) Not sure what's going on here, what is this about? you allocate an fd you then immediately replace? Is this done to move the fd away from fd=0? but that doesn't work that way, in case fd 0 is closed when entering this function. Or is this about dup'ping with O_CLOEXEC? Please be aware that F_DUPFD_CLOEXEC exists, which allows you to easily move some fd above some treshold, *and* set O_CLOEXEC at the same time. In the systemd codebase we call this frequently for code that ends up being loaded in arbitrary processes (glibc NSS modules, PAM modules), in order to ensure we get out of the fd < 3 territory quickly. (btw, if you do care about O_CLOEXEC – which is great – then you also want to replace a bunch of fopen(…, "r") with fopen(…, "re") in your codebase) Lennart
On Tue, May 23, 2023 at 12:49 AM Lennart Poettering <lennart@poettering.net> wrote: > > On Do, 18.05.23 21:44, Alexei Starovoitov (alexei.starovoitov@gmail.com) wrote: > > > > The 0/1/2 file descriptors are not at all special. They are a shell > > > pipeline default, nothing more. They are not the argument your think they > > > are, and you should stop trying to make them an argument. > > > > I'm well aware that any file type is allowed to be in FDs 0,1,2 and > > some user space is using it that way, like old inetd: > > https://github.com/guillemj/inetutils/blob/master/src/inetd.c#L428 > > That puts the same socket into 0,1,2 before exec-ing new process. > > > > My point that the kernel has to assist user space instead of > > stubbornly sticking to POSIX and saying all FDs are equal. > > > > Most user space developers know that care should be taken with FDs 0,1,2, > > but it's still easy to make a mistake. > > If I look at libbpf, which supposedly gets the fd handling right I > can't find any hint it actually moves the fds it gets from open() to > an fd > 2, though? > > i.e. the code that invokes open() calls in the libbpf codebase happily > just accepts an fd < 2, including fd == 0, and this is then later > passed back into the kernel in various bpf() syscall invocations, > which should refuse it, no? So what's going on there? libbpf's attempt to ensure that fd>2 applies mostly to BPF objects: maps, progs, btfs, links, which are always returned from bpf() syscall. That's where we use ensure_good_fd(). The snippet you found in bpf_map__reuse_fd() is problematic and slipped through the cracks, I'll fix it, thanks for bringing this to my attention. I'll also check if there are any other places where we should use ensure_good_fd() as well. > > I did find this though: > > <snip> > new_fd = open("/", O_RDONLY | O_CLOEXEC); > if (new_fd < 0) { > err = -errno; > goto err_free_new_name; > } > > new_fd = dup3(fd, new_fd, O_CLOEXEC); > if (new_fd < 0) { > err = -errno; > goto err_close_new_fd; > } > </snip> > > (This is from libbpf.c, bpf_map__reuse_fd(), i.e. https://github.com/libbpf/libbpf/blob/master/src/libbpf.c) > > Not sure what's going on here, what is this about? you allocate an fd > you then immediately replace? Is this done to move the fd away from > fd=0? but that doesn't work that way, in case fd 0 is closed when > entering this function. > > Or is this about dup'ping with O_CLOEXEC? This code predates me, so I don't know the exact reasons why it's implemented exactly like this. It seems like instead of doing open()+dup3() we should do dup3()+ensure_good_fd(). I need to look at this carefully, this is not the code I work with regularly. > > Please be aware that F_DUPFD_CLOEXEC exists, which allows you to > easily move some fd above some treshold, *and* set O_CLOEXEC at the > same time. In the systemd codebase we call this frequently for code > that ends up being loaded in arbitrary processes (glibc NSS modules, > PAM modules), in order to ensure we get out of the fd < 3 territory > quickly. nice, thanks > > (btw, if you do care about O_CLOEXEC – which is great – then you also > want to replace a bunch of fopen(…, "r") with fopen(…, "re") in your > codebase) I will check this as well, thanks for the hints :) > > Lennart
On Thu, May 18, 2023 at 09:44:33PM -0700, Alexei Starovoitov wrote: > That footgun was removed from folly in 2021, but we still see this issue from time to time. > My point that the kernel can help here. > Since folks don't like sysctl to control FD assignment how about something like this: > > diff --git a/fs/file.c b/fs/file.c > index 7893ea161d77..896e79433f61 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -554,9 +554,15 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > return error; > } > > +__weak noinline u32 get_start_fd(void) > +{ > + return 0; > +} > +/* mark it as BPF_MODIFY_RETURN to let bpf progs adjust return value */ > + > int __get_unused_fd_flags(unsigned flags, unsigned long nofile) > { > - return alloc_fd(0, nofile, flags); > + return alloc_fd(get_start_fd(), nofile, flags); > } > > Then we can enforce fd >= 3 for a certain container or for a particular app. [an extremely belated reply - had been net.dead since mid-May, just got to that thread] As far as I'm concerned, the main conclusion is that BPF handling of file descriptors needs a fairly hostile code review, regarding the interactions with dup2(), shared descriptor tables, SCM_RIGHTS, etc. Rationale: demonstrated utter lack of clue about the nature of file descriptors, along with a weird mental model of how they are used, complete with "if they are used not in the way we expect, let's shove a hook somewhere to enforce The Right Way(tm)". Will do...
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 36e4b2d8cca2..f58895830ada 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2077,8 +2077,8 @@ struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); struct bpf_link *bpf_link_get_from_fd(u32 ufd); struct bpf_link *bpf_link_get_curr_or_next(u32 *id); -int bpf_obj_pin_user(u32 ufd, const char __user *pathname); -int bpf_obj_get_user(const char __user *pathname, int flags); +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname); +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags); #define BPF_ITER_FUNC_PREFIX "bpf_iter_" #define DEFINE_BPF_ITER_FUNC(target, args...) \ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1bb11a6ee667..db2870a52ce0 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1420,6 +1420,11 @@ union bpf_attr { __aligned_u64 pathname; __u32 bpf_fd; __u32 file_flags; + /* same as dirfd in openat() syscall; see openat(2) + * manpage for details of dirfd/path_fd and pathname semantics; + * zero path_fd implies AT_FDCWD behavior + */ + __u32 path_fd; }; struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 9948b542a470..13bb54f6bd17 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -435,7 +435,7 @@ static int bpf_iter_link_pin_kernel(struct dentry *parent, return ret; } -static int bpf_obj_do_pin(const char __user *pathname, void *raw, +static int bpf_obj_do_pin(int path_fd, const char __user *pathname, void *raw, enum bpf_type type) { struct dentry *dentry; @@ -444,7 +444,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw, umode_t mode; int ret; - dentry = user_path_create(AT_FDCWD, pathname, &path, 0); + dentry = user_path_create(path_fd, pathname, &path, 0); if (IS_ERR(dentry)) return PTR_ERR(dentry); @@ -478,7 +478,7 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw, return ret; } -int bpf_obj_pin_user(u32 ufd, const char __user *pathname) +int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname) { enum bpf_type type; void *raw; @@ -488,14 +488,14 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname) if (IS_ERR(raw)) return PTR_ERR(raw); - ret = bpf_obj_do_pin(pathname, raw, type); + ret = bpf_obj_do_pin(path_fd, pathname, raw, type); if (ret != 0) bpf_any_put(raw, type); return ret; } -static void *bpf_obj_do_get(const char __user *pathname, +static void *bpf_obj_do_get(int path_fd, const char __user *pathname, enum bpf_type *type, int flags) { struct inode *inode; @@ -503,7 +503,7 @@ static void *bpf_obj_do_get(const char __user *pathname, void *raw; int ret; - ret = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW, &path); + ret = user_path_at(path_fd, pathname, LOOKUP_FOLLOW, &path); if (ret) return ERR_PTR(ret); @@ -527,7 +527,7 @@ static void *bpf_obj_do_get(const char __user *pathname, return ERR_PTR(ret); } -int bpf_obj_get_user(const char __user *pathname, int flags) +int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags) { enum bpf_type type = BPF_TYPE_UNSPEC; int f_flags; @@ -538,7 +538,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags) if (f_flags < 0) return f_flags; - raw = bpf_obj_do_get(pathname, &type, f_flags); + raw = bpf_obj_do_get(path_fd, pathname, &type, f_flags); if (IS_ERR(raw)) return PTR_ERR(raw); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 909c112ef537..65a46f6d4be0 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2697,14 +2697,15 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size) return err; } -#define BPF_OBJ_LAST_FIELD file_flags +#define BPF_OBJ_LAST_FIELD path_fd static int bpf_obj_pin(const union bpf_attr *attr) { if (CHECK_ATTR(BPF_OBJ) || attr->file_flags != 0) return -EINVAL; - return bpf_obj_pin_user(attr->bpf_fd, u64_to_user_ptr(attr->pathname)); + return bpf_obj_pin_user(attr->bpf_fd, attr->path_fd ?: AT_FDCWD, + u64_to_user_ptr(attr->pathname)); } static int bpf_obj_get(const union bpf_attr *attr) @@ -2713,7 +2714,8 @@ static int bpf_obj_get(const union bpf_attr *attr) attr->file_flags & ~BPF_OBJ_FLAG_MASK) return -EINVAL; - return bpf_obj_get_user(u64_to_user_ptr(attr->pathname), + return bpf_obj_get_user(attr->path_fd ?: AT_FDCWD, + u64_to_user_ptr(attr->pathname), attr->file_flags); } diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1bb11a6ee667..db2870a52ce0 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1420,6 +1420,11 @@ union bpf_attr { __aligned_u64 pathname; __u32 bpf_fd; __u32 file_flags; + /* same as dirfd in openat() syscall; see openat(2) + * manpage for details of dirfd/path_fd and pathname semantics; + * zero path_fd implies AT_FDCWD behavior + */ + __u32 path_fd; }; struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall forces users to specify pinning location as a string-based absolute or relative (to current working directory) path. This has various implications related to security (e.g., symlink-based attacks), forces BPF FS to be exposed in the file system, which can cause races with other applications. One of the feedbacks we got from folks working with containers heavily was that inability to use purely FD-based location specification was an unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET commands. This patch closes this oversight, adding path_fd field to BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by *at() syscalls for dirfd + pathname combinations. This now allows interesting possibilities like working with detached BPF FS mount (e.g., to perform multiple pinnings without running a risk of someone interfering with them), and generally making pinning/getting more secure and not prone to any races and/or security attacks. This is demonstrated by a selftest added in subsequent patch that takes advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate creating detached BPF FS mount, pinning, and then getting BPF map out of it, all while never exposing this private instance of BPF FS to outside worlds. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- include/linux/bpf.h | 4 ++-- include/uapi/linux/bpf.h | 5 +++++ kernel/bpf/inode.c | 16 ++++++++-------- kernel/bpf/syscall.c | 8 +++++--- tools/include/uapi/linux/bpf.h | 5 +++++ 5 files changed, 25 insertions(+), 13 deletions(-)