diff mbox series

[RFC] getvalues(2) prototype

Message ID 20220322192712.709170-1-mszeredi@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] getvalues(2) prototype | expand

Commit Message

Miklos Szeredi March 22, 2022, 7:27 p.m. UTC
Add a new userspace API that allows getting multiple short values in a
single syscall.

This would be useful for the following reasons:

- Calling open/read/close for many small files is inefficient.  E.g. on my
  desktop invoking lsof(1) results in ~60k open + read + close calls under
  /proc and 90% of those are 128 bytes or less.

- Interfaces for getting various attributes and statistics are fragmented.
  For files we have basic stat, statx, extended attributes, file attributes
  (for which there are two overlapping ioctl interfaces).  For mounts and
  superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
  The latter also has the problem on not allowing queries on a specific
  mount.

- Some attributes are cheap to generate, some are expensive.  Allowing
  userspace to select which ones it needs should allow optimizing queries.

- Adding an ascii namespace should allow easy extension and self
  description.

- The values can be text or binary, whichever is fits best.

The interface definition is:

struct name_val {
	const char *name;	/* in */
	struct iovec value_in;	/* in */
	struct iovec value_out;	/* out */
	uint32_t error;		/* out */
	uint32_t reserved;
};

int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
	      unsigned int flags);

@dfd and @path are used to lookup object $ORIGIN.  @vec contains @num
name/value descriptors.  @flags contains lookup flags for @path.

The syscall returns the number of values filled or an error.

A single name/value descriptor has the following fields:

@name describes the object whose value is to be returned.  E.g.

mnt                    - list of mount parameters
mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
mntns                  - list of mount ID's reachable from the current root
mntns:21:parentid      - parent ID of the mount with ID of 21
xattr:security.selinux - the security.selinux extended attribute
data:foo/bar           - the data contained in file $ORIGIN/foo/bar

If the name starts with the separator, then it is taken to have the same
prefix as the previous name/value descriptor.  E.g. in the following
sequence of names the second one is equivalent to mnt:parentid:

mnt:mountpoint
:parentid

@value_in supplies the buffer to store value(s) in.  If a subsequent
name/value descriptor has NULL value of value_in.iov_base, then the buffer
from the previous name/value descriptor will be used.  This way it's
possible to use a shared buffer for multiple values.

The starting address and length of the actual value will be stored in
@value_out, unless an error has occurred in which case @error will be set to
the positive errno value.

Multiple names starting with the same prefix (including the shorthand form)
may also be batched together under the same lock, so the order of the names
can determine atomicity.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/Makefile                            |   2 +-
 fs/mount.h                             |   8 +
 fs/namespace.c                         |  42 ++
 fs/proc_namespace.c                    |   2 +-
 fs/values.c                            | 524 +++++++++++++++++++++++++
 6 files changed, 577 insertions(+), 2 deletions(-)
 create mode 100644 fs/values.c

Comments

Miklos Szeredi March 22, 2022, 7:30 p.m. UTC | #1
On Tue, Mar 22, 2022 at 8:27 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> Add a new userspace API that allows getting multiple short values in a
> single syscall.

Attaching a test program that allows arbitrary queries.

Thanks,
Miklos
Casey Schaufler March 22, 2022, 8:36 p.m. UTC | #2
On 3/22/2022 12:27 PM, Miklos Szeredi wrote:
> Add a new userspace API that allows getting multiple short values in a
> single syscall.
>
> This would be useful for the following reasons:
>
> - Calling open/read/close for many small files is inefficient.  E.g. on my
>    desktop invoking lsof(1) results in ~60k open + read + close calls under
>    /proc and 90% of those are 128 bytes or less.

You don't need the generality below to address this issue.

int openandread(const char *path, char *buffer, size_t size);

would address this case swimmingly.

> - Interfaces for getting various attributes and statistics are fragmented.
>    For files we have basic stat, statx, extended attributes, file attributes
>    (for which there are two overlapping ioctl interfaces).  For mounts and
>    superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
>    The latter also has the problem on not allowing queries on a specific
>    mount.
>
> - Some attributes are cheap to generate, some are expensive.  Allowing
>    userspace to select which ones it needs should allow optimizing queries.
>
> - Adding an ascii namespace should allow easy extension and self
>    description.
>
> - The values can be text or binary, whichever is fits best.
>
> The interface definition is:
>
> struct name_val {
> 	const char *name;	/* in */
> 	struct iovec value_in;	/* in */
> 	struct iovec value_out;	/* out */
> 	uint32_t error;		/* out */
> 	uint32_t reserved;
> };
>
> int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
> 	      unsigned int flags);
>
> @dfd and @path are used to lookup object $ORIGIN.

To be conventional you should have

int getvalues(const char *path, struct name_val *vec, size_t num,
	      unsigned int flags);

and

int fgetvalues(int dfd, struct name_val *vec, size_t num,
	       unsigned int flags);

>    @vec contains @num
> name/value descriptors.  @flags contains lookup flags for @path.
>
> The syscall returns the number of values filled or an error.
>
> A single name/value descriptor has the following fields:
>
> @name describes the object whose value is to be returned.  E.g.
>
> mnt                    - list of mount parameters
> mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
> mntns                  - list of mount ID's reachable from the current root
> mntns:21:parentid      - parent ID of the mount with ID of 21
> xattr:security.selinux - the security.selinux extended attribute
> data:foo/bar           - the data contained in file $ORIGIN/foo/bar
>
> If the name starts with the separator, then it is taken to have the same
> prefix as the previous name/value descriptor.  E.g. in the following
> sequence of names the second one is equivalent to mnt:parentid:
>
> mnt:mountpoint
> :parentid

I would consider this a clever optimization that is likely to
cause confusion and result in lots of bugs. Yes, you'll save some
parsing time, but the debugging headaches it would introduce would
more than make up for it.

> @value_in supplies the buffer to store value(s) in.  If a subsequent
> name/value descriptor has NULL value of value_in.iov_base, then the buffer
> from the previous name/value descriptor will be used.  This way it's
> possible to use a shared buffer for multiple values.

I would not trust very many application developers to use the NULL
value_in.iov_base correctly. In fact, I can't think of a way it could
be used sensibly. Sure, you could put two things of known size into
one buffer and use the known offset, but again that's a clever
optimization that will result in more bugs than it is worth.

> The starting address and length of the actual value will be stored in
> @value_out, unless an error has occurred in which case @error will be set to
> the positive errno value.

You only need to return the address if you do the multi-value in a buffer.
Which I've already expressed distaste for.

If the application asks for 6 attributes and is denied access to the
3rd (by an LSM let's say) what is returned? Are the 1st two buffers
filled? How can I tell which attribute was unavailable?

> Multiple names starting with the same prefix (including the shorthand form)
> may also be batched together under the same lock, so the order of the names
> can determine atomicity.

I will believe you, but it's hardly obvious why this is true.

>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>   arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>   fs/Makefile                            |   2 +-
>   fs/mount.h                             |   8 +
>   fs/namespace.c                         |  42 ++
>   fs/proc_namespace.c                    |   2 +-
>   fs/values.c                            | 524 +++++++++++++++++++++++++
>   6 files changed, 577 insertions(+), 2 deletions(-)
>   create mode 100644 fs/values.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..c72668001b39 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>   448	common	process_mrelease	sys_process_mrelease
>   449	common	futex_waitv		sys_futex_waitv
>   450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	getvalues		sys_getvalues
>   
>   #
>   # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/fs/Makefile b/fs/Makefile
> index 208a74e0b00e..f00d6bcd1178 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -16,7 +16,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
>   		pnode.o splice.o sync.o utimes.o d_path.o \
>   		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
>   		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
> -		kernel_read_file.o remap_range.o
> +		kernel_read_file.o remap_range.o values.o
>   
>   ifeq ($(CONFIG_BLOCK),y)
>   obj-y +=	buffer.o direct-io.o mpage.o
> diff --git a/fs/mount.h b/fs/mount.h
> index 0b6e08cf8afb..a3ca5233e481 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -148,3 +148,11 @@ static inline bool is_anon_ns(struct mnt_namespace *ns)
>   }
>   
>   extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor);
> +
> +extern void namespace_lock_read(void);
> +extern void namespace_unlock_read(void);
> +extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt);
> +extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
> +			 struct path *root);
> +extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns,
> +					 struct path *root, int id);
> diff --git a/fs/namespace.c b/fs/namespace.c
> index de6fae84f1a1..52b15c17251f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1405,6 +1405,38 @@ void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor)
>   }
>   #endif  /* CONFIG_PROC_FS */
>   
> +void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
> +		  struct path *root)
> +{
> +	struct mount *m;
> +
> +	down_read(&namespace_sem);
> +	for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
> +		if (is_path_reachable(m, m->mnt.mnt_root, root)) {
> +			seq_printf(seq, "%i", m->mnt_id);
> +			seq_putc(seq, '\0');
> +		}
> +	}
> +	up_read(&namespace_sem);
> +}
> +
> +/* called with namespace_sem held for read */
> +struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root,
> +				  int id)
> +{
> +	struct mount *m;
> +
> +	for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
> +		if (m->mnt_id == id) {
> +			if (is_path_reachable(m, m->mnt.mnt_root, root))
> +				return mntget(&m->mnt);
> +			else
> +				return NULL;
> +		}
> +	}
> +	return NULL;
> +}
> +
>   /**
>    * may_umount_tree - check if a mount tree is busy
>    * @m: root of mount tree
> @@ -1494,6 +1526,16 @@ static inline void namespace_lock(void)
>   	down_write(&namespace_sem);
>   }
>   
> +void namespace_lock_read(void)
> +{
> +	down_read(&namespace_sem);
> +}
> +
> +void namespace_unlock_read(void)
> +{
> +	up_read(&namespace_sem);
> +}
> +
>   enum umount_tree_flags {
>   	UMOUNT_SYNC = 1,
>   	UMOUNT_PROPAGATE = 2,
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 49650e54d2f8..fa6dc2c20578 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -61,7 +61,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
>   	return security_sb_show_options(m, sb);
>   }
>   
> -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
> +void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
>   {
>   	static const struct proc_fs_opts mnt_opts[] = {
>   		{ MNT_NOSUID, ",nosuid" },
> diff --git a/fs/values.c b/fs/values.c
> new file mode 100644
> index 000000000000..618fa9bf48a1
> --- /dev/null
> +++ b/fs/values.c
> @@ -0,0 +1,524 @@
> +#include <linux/syscalls.h>
> +#include <linux/printk.h>
> +#include <linux/namei.h>
> +#include <linux/fs_struct.h>
> +#include <linux/posix_acl_xattr.h>
> +#include <linux/xattr.h>
> +#include "pnode.h"
> +#include "internal.h"
> +
> +#define VAL_GRSEP ':'
> +
> +struct name_val {
> +	const char __user *name;	/* in */
> +	struct iovec value_in;		/* in */
> +	struct iovec value_out;		/* out */
> +	__u32 error;			/* out */
> +	__u32 reserved;
> +};
> +
> +struct val_iter {
> +	struct name_val __user *curr;
> +	size_t num;
> +	struct iovec vec;
> +	char name[256];
> +	size_t bufsize;
> +	struct seq_file seq;
> +	const char *prefix;
> +	const char *sub;
> +};
> +
> +struct val_desc {
> +	const char *name;
> +	union {
> +		int idx;
> +		int (*get)(struct val_iter *vi, const struct path *path);
> +	};
> +};
> +
> +static int val_get(struct val_iter *vi)
> +{
> +	struct name_val nameval;
> +	long err;
> +
> +	if (copy_from_user(&nameval, vi->curr, sizeof(nameval)))
> +		return -EFAULT;
> +
> +	err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name));
> +	if (err < 0)
> +		return err;
> +	if (err == sizeof(vi->name))
> +		return -ERANGE;
> +
> +	if (nameval.value_in.iov_base)
> +		vi->vec = nameval.value_in;
> +
> +	vi->seq.size = min(vi->vec.iov_len, vi->bufsize);
> +	vi->seq.count = 0;
> +
> +	return 0;
> +}
> +
> +static int val_next(struct val_iter *vi)
> +{
> +	vi->curr++;
> +	vi->num--;
> +
> +	return vi->num ? val_get(vi) : 0;
> +}
> +
> +static int val_end(struct val_iter *vi, size_t count)
> +{
> +	struct iovec iov = {
> +		.iov_base = vi->vec.iov_base,
> +		.iov_len = count,
> +	};
> +
> +	if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov)))
> +		return -EFAULT;
> +
> +	vi->vec.iov_base += count;
> +	vi->vec.iov_len -= count;
> +
> +	return val_next(vi);
> +}
> +
> +static int val_err(struct val_iter *vi, int err)
> +{
> +	if (put_user(-err, &vi->curr->error))
> +		return -EFAULT;
> +
> +	return val_next(vi);
> +}
> +
> +static int val_end_seq(struct val_iter *vi, int err)
> +{
> +	size_t count = vi->seq.count;
> +
> +	if (err)
> +		return val_err(vi, err);
> +
> +	if (count == vi->seq.size)
> +		return -EOVERFLOW;
> +
> +	if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count))
> +		return -EFAULT;
> +
> +	return val_end(vi, count);
> +}
> +
> +static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd)
> +{
> +	const char *name = vi->name;
> +	const char *prefix = vi->prefix;
> +	size_t prefixlen = strlen(prefix);
> +
> +	if (prefixlen) {
> +		/*
> +		 * Name beggining with a group separator is a shorthand for
> +		 * previously prefix.
> +		 */
> +		if (name[0] == VAL_GRSEP) {
> +			name++;
> +		} else  {
> +			if (strncmp(name, prefix, prefixlen) != 0)
> +				return NULL;
> +			name += prefixlen;
> +		}
> +	}
> +
> +	vi->sub = NULL;
> +	for (; vd->name; vd++) {
> +		if (strcmp(name, vd->name) == 0)
> +			break;
> +		else {
> +			size_t grlen = strlen(vd->name);
> +
> +			if (strncmp(vd->name, name, grlen) == 0 &&
> +			    name[grlen] == VAL_GRSEP) {
> +				vi->sub = name + grlen + 1;
> +				break;
> +			}
> +		}
> +	}
> +	return vd;
> +}
> +
> +static int val_get_group(struct val_iter *vi, struct val_desc *vd)
> +{
> +	for (; vd->name; vd++)
> +		seq_write(&vi->seq, vd->name, strlen(vd->name) + 1);
> +
> +	return val_end_seq(vi, 0);
> +}
> +
> +static bool val_push_prefix(struct val_iter *vi, const char **oldprefix)
> +{
> +	char *newprefix;
> +
> +	newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL);
> +	if (newprefix) {
> +		*oldprefix = vi->prefix;
> +		vi->prefix = newprefix;
> +	}
> +
> +	return newprefix;
> +}
> +
> +static void val_pop_prefix(struct val_iter *vi, const char *oldprefix)
> +{
> +	kfree(vi->prefix);
> +	vi->prefix = oldprefix;
> +}
> +
> +enum {
> +	VAL_MNT_ID,
> +	VAL_MNT_PARENTID,
> +	VAL_MNT_ROOT,
> +	VAL_MNT_MOUNTPOINT,
> +	VAL_MNT_OPTIONS,
> +	VAL_MNT_SHARED,
> +	VAL_MNT_MASTER,
> +	VAL_MNT_PROPAGATE_FROM,
> +	VAL_MNT_UNBINDABLE,
> +	VAL_MNT_NOTFOUND,
> +};
> +
> +static struct val_desc val_mnt_group[] = {
> +	{ .name = "id",			.idx = VAL_MNT_ID		},
> +	{ .name = "parentid",		.idx = VAL_MNT_PARENTID,	},
> +	{ .name = "root",		.idx = VAL_MNT_ROOT,		},
> +	{ .name = "mountpoint",		.idx = VAL_MNT_MOUNTPOINT,	},
> +	{ .name = "options",		.idx = VAL_MNT_OPTIONS,		},
> +	{ .name = "shared",		.idx = VAL_MNT_SHARED,		},
> +	{ .name = "master",		.idx = VAL_MNT_MASTER,		},
> +	{ .name = "propagate_from",	.idx = VAL_MNT_PROPAGATE_FROM,	},
> +	{ .name = "unbindable",		.idx = VAL_MNT_UNBINDABLE,	},
> +	{ .name = NULL,			.idx = VAL_MNT_NOTFOUND		},
> +};
> +
> +static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt)
> +{
> +	struct super_block *sb = mnt->mnt_sb;
> +	int err = 0;
> +
> +	if (sb->s_op->show_path) {
> +		err = sb->s_op->show_path(seq, mnt->mnt_root);
> +		if (!err) {
> +			seq_putc(seq, '\0');
> +			if (seq->count < seq->size)
> +				seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL);
> +		}
> +	} else {
> +		seq_dentry(seq, mnt->mnt_root, "");
> +	}
> +
> +	return err;
> +}
> +
> +static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt)
> +{
> +	struct mount *m = real_mount(mnt);
> +	struct path root, mnt_path;
> +	struct val_desc *vd;
> +	const char *oldprefix;
> +	int err = 0;
> +
> +	if (!val_push_prefix(vi, &oldprefix))
> +		return -ENOMEM;
> +
> +	while (!err && vi->num) {
> +		vd = val_lookup(vi, val_mnt_group);
> +		if (!vd)
> +			break;
> +
> +		switch(vd->idx) {
> +		case VAL_MNT_ID:
> +			seq_printf(&vi->seq, "%i", m->mnt_id);
> +			break;
> +		case VAL_MNT_PARENTID:
> +			seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id);
> +			break;
> +		case VAL_MNT_ROOT:
> +			seq_mnt_root(&vi->seq, mnt);
> +			break;
> +		case VAL_MNT_MOUNTPOINT:
> +			get_fs_root(current->fs, &root);
> +			mnt_path.dentry = mnt->mnt_root;
> +			mnt_path.mnt = mnt;
> +			err = seq_path_root(&vi->seq, &mnt_path, &root, "");
> +			path_put(&root);
> +			break;
> +		case VAL_MNT_OPTIONS:
> +			seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw");
> +			show_mnt_opts(&vi->seq, mnt);
> +			break;
> +		case VAL_MNT_SHARED:
> +			if (IS_MNT_SHARED(m))
> +				seq_printf(&vi->seq, "%i,", m->mnt_group_id);
> +			break;
> +		case VAL_MNT_MASTER:
> +			if (IS_MNT_SLAVE(m))
> +				seq_printf(&vi->seq, "%i,",
> +					   m->mnt_master->mnt_group_id);
> +			break;
> +		case VAL_MNT_PROPAGATE_FROM:
> +			if (IS_MNT_SLAVE(m)) {
> +				int dom;
> +
> +				get_fs_root(current->fs, &root);
> +				dom = get_dominating_id(m, &root);
> +				path_put(&root);
> +				if (dom)
> +					seq_printf(&vi->seq, "%i,", dom);
> +			}
> +			break;
> +		case VAL_MNT_UNBINDABLE:
> +			if (IS_MNT_UNBINDABLE(m))
> +				seq_puts(&vi->seq, "yes");
> +			break;
> +		default:
> +			err = -ENOENT;
> +			break;
> +		}
> +		err = val_end_seq(vi, err);
> +	}
> +	val_pop_prefix(vi, oldprefix);
> +
> +	return err;
> +}
> +
> +static int val_mnt_get(struct val_iter *vi, const struct path *path)
> +{
> +	int err;
> +
> +	if (!vi->sub)
> +		return val_get_group(vi, val_mnt_group);
> +
> +	namespace_lock_read();
> +	err = val_mnt_show(vi, path->mnt);
> +	namespace_unlock_read();
> +
> +	return err;
> +}
> +
> +static int val_mntns_get(struct val_iter *vi, const struct path *path)
> +{
> +	struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns;
> +	struct vfsmount *mnt;
> +	struct path root;
> +	char *end;
> +	int mnt_id;
> +	int err;
> +
> +	if (!vi->sub) {
> +		get_fs_root(current->fs, &root);
> +		seq_mnt_list(&vi->seq, mnt_ns, &root);
> +		path_put(&root);
> +		return val_end_seq(vi, 0);
> +	}
> +
> +	end = strchr(vi->sub, VAL_GRSEP);
> +	if (end)
> +		*end = '\0';
> +	err = kstrtoint(vi->sub, 10, &mnt_id);
> +	if (err)
> +		return val_err(vi, err);
> +	vi->sub = NULL;
> +	if (end) {
> +		*end = VAL_GRSEP;
> +		vi->sub = end + 1;
> +	}
> +
> +	namespace_lock_read();
> +	get_fs_root(current->fs, &root);
> +	mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id);
> +	path_put(&root);
> +	if (!mnt) {
> +		namespace_unlock_read();
> +		return val_err(vi, -ENOENT);
> +	}
> +	if (vi->sub)
> +		err = val_mnt_show(vi, mnt);
> +	else
> +		err = val_get_group(vi, val_mnt_group);
> +
> +	namespace_unlock_read();
> +	mntput(mnt);
> +
> +	return err;
> +}
> +
> +static ssize_t val_do_read(struct val_iter *vi, struct path *path)
> +{
> +	ssize_t ret;
> +	struct file *file;
> +	struct open_flags op = {
> +		.open_flag = O_RDONLY,
> +		.acc_mode = MAY_READ,
> +		.intent = LOOKUP_OPEN,
> +	};
> +
> +	file = do_file_open_root(path, "", &op);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +
> +	ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL);
> +	fput(file);
> +
> +	return ret;
> +}
> +
> +static ssize_t val_do_readlink(struct val_iter *vi, struct path *path)
> +{
> +	int ret;
> +
> +	ret = security_inode_readlink(path->dentry);
> +	if (ret)
> +		return ret;
> +
> +	return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len);
> +}
> +
> +static inline bool dot_or_dotdot(const char *s)
> +{
> +	return s[0] == '.' &&
> +		(s[1] == '/' || s[1] == '\0' ||
> +		 (s[1] == '.' && (s[2] == '/' || s[2] == '\0')));
> +}
> +
> +/*
> + * - empty path is okay
> + * - must not begin or end with slash or have a double slash anywhere
> + * - must not have . or .. components
> + */
> +static bool val_verify_path(const char *subpath)
> +{
> +	const char *s = subpath;
> +
> +	if (s[0] == '\0')
> +		return true;
> +
> +	if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//"))
> +		return false;
> +
> +	for (s--; s; s = strstr(s + 3, "/."))
> +		if (dot_or_dotdot(s + 1))
> +			return false;
> +
> +	return true;
> +}
> +
> +static int val_data_get(struct val_iter *vi, const struct path *path)
> +{
> +	struct path this;
> +	ssize_t ret;
> +
> +	if (!vi->sub)
> +		return val_err(vi, -ENOENT);
> +
> +	if (!val_verify_path(vi->sub))
> +		return val_err(vi, -EINVAL);
> +
> +	ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub,
> +			      LOOKUP_NO_XDEV | LOOKUP_BENEATH |
> +			      LOOKUP_IN_ROOT, &this);
> +	if (ret)
> +		return val_err(vi, ret);
> +
> +	ret = -ENODATA;
> +	if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) {
> +		if (d_is_reg(this.dentry))
> +			ret = val_do_read(vi, &this);
> +		else
> +			ret = val_do_readlink(vi, &this);
> +	}
> +	path_put(&this);
> +	if (ret == -EFAULT)
> +		return ret;
> +	if (ret < 0)
> +		return val_err(vi, ret);
> +	if (ret == vi->vec.iov_len)
> +		return -EOVERFLOW;
> +
> +	return val_end(vi, ret);
> +}
> +
> +static int val_xattr_get(struct val_iter *vi, const struct path *path)
> +{
> +	ssize_t ret;
> +	struct user_namespace *mnt_userns = mnt_user_ns(path->mnt);
> +	void *value = vi->seq.buf + vi->seq.count;
> +	size_t size = min_t(size_t, vi->seq.size - vi->seq.count,
> +			    XATTR_SIZE_MAX);
> +
> +	if (!vi->sub)
> +		return val_err(vi, -ENOENT);
> +
> +	ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size);
> +	if (ret < 0)
> +		return val_err(vi, ret);
> +
> +	if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
> +	    (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
> +		posix_acl_fix_xattr_to_user(mnt_userns, value, ret);
> +
> +	vi->seq.count += ret;
> +
> +	return val_end_seq(vi, 0);
> +}
> +
> +
> +static struct val_desc val_toplevel_group[] = {
> +	{ .name = "mnt",	.get = val_mnt_get,	},
> +	{ .name = "mntns",	.get = val_mntns_get,	},
> +	{ .name = "xattr",	.get = val_xattr_get,	},
> +	{ .name = "data",	.get = val_data_get,	},
> +	{ .name = NULL },
> +};
> +
> +SYSCALL_DEFINE5(getvalues,
> +		int, dfd,
> +		const char __user *, pathname,
> +		struct name_val __user *, vec,
> +		size_t, num,
> +		unsigned int, flags)
> +{
> +	char vals[1024];
> +	struct val_iter vi = {
> +		.curr = vec,
> +		.num = num,
> +		.seq.buf = vals,
> +		.bufsize = sizeof(vals),
> +		.prefix = "",
> +	};
> +	struct val_desc *vd;
> +	struct path path = {};
> +	ssize_t err;
> +
> +	err = user_path_at(dfd, pathname, 0, &path);
> +	if (err)
> +		return err;
> +
> +	err = val_get(&vi);
> +	if (err)
> +		goto out;
> +
> +	if (!strlen(vi.name)) {
> +		err = val_get_group(&vi, val_toplevel_group);
> +		goto out;
> +	}
> +	while (!err && vi.num) {
> +		vd = val_lookup(&vi, val_toplevel_group);
> +		if (!vd->name)
> +			err = val_err(&vi, -ENOENT);
> +		else
> +			err = vd->get(&vi, &path);
> +	}
> +out:
> +	if (err == -EOVERFLOW)
> +		err = 0;
> +
> +	path_put(&path);
> +	return err < 0 ? err : num - vi.num;
> +}
Casey Schaufler March 22, 2022, 8:53 p.m. UTC | #3
On 3/22/2022 1:36 PM, Casey Schaufler wrote:
> On 3/22/2022 12:27 PM, Miklos Szeredi wrote:
>> Add a new userspace API that allows getting multiple short values in a
>> single syscall.
>>
>> This would be useful for the following reasons:
>>
>> - Calling open/read/close for many small files is inefficient. E.g. on my
>>    desktop invoking lsof(1) results in ~60k open + read + close calls under
>>    /proc and 90% of those are 128 bytes or less.
>
> You don't need the generality below to address this issue.
>
> int openandread(const char *path, char *buffer, size_t size);
>
> would address this case swimmingly.
>
>> - Interfaces for getting various attributes and statistics are fragmented.
>>    For files we have basic stat, statx, extended attributes, file attributes
>>    (for which there are two overlapping ioctl interfaces).  For mounts and
>>    superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
>>    The latter also has the problem on not allowing queries on a specific
>>    mount.
>>
>> - Some attributes are cheap to generate, some are expensive. Allowing
>>    userspace to select which ones it needs should allow optimizing queries.
>>
>> - Adding an ascii namespace should allow easy extension and self
>>    description.

... I forgot to mention that without a mechanism to ask what attributes
are available on the file this is completely pointless. If the application
asks for xattr:security.selinux on a system using Smack, which would have
xattr:security.SMACK64 and might have xattr:security.SMACK64EXEC or
xattr:security.SMACK64TRANSMUTE, it won't be happy. Or on a system with
no LSM for that matter.

>>
>> - The values can be text or binary, whichever is fits best.
>>
>> The interface definition is:
>>
>> struct name_val {
>>     const char *name;    /* in */
>>     struct iovec value_in;    /* in */
>>     struct iovec value_out;    /* out */
>>     uint32_t error;        /* out */
>>     uint32_t reserved;
>> };
>>
>> int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
>>           unsigned int flags);
>>
>> @dfd and @path are used to lookup object $ORIGIN.
>
> To be conventional you should have
>
> int getvalues(const char *path, struct name_val *vec, size_t num,
>           unsigned int flags);
>
> and
>
> int fgetvalues(int dfd, struct name_val *vec, size_t num,
>            unsigned int flags);
>
>>    @vec contains @num
>> name/value descriptors.  @flags contains lookup flags for @path.
>>
>> The syscall returns the number of values filled or an error.
>>
>> A single name/value descriptor has the following fields:
>>
>> @name describes the object whose value is to be returned.  E.g.
>>
>> mnt                    - list of mount parameters
>> mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
>> mntns                  - list of mount ID's reachable from the current root
>> mntns:21:parentid      - parent ID of the mount with ID of 21
>> xattr:security.selinux - the security.selinux extended attribute
>> data:foo/bar           - the data contained in file $ORIGIN/foo/bar
>>
>> If the name starts with the separator, then it is taken to have the same
>> prefix as the previous name/value descriptor.  E.g. in the following
>> sequence of names the second one is equivalent to mnt:parentid:
>>
>> mnt:mountpoint
>> :parentid
>
> I would consider this a clever optimization that is likely to
> cause confusion and result in lots of bugs. Yes, you'll save some
> parsing time, but the debugging headaches it would introduce would
> more than make up for it.
>
>> @value_in supplies the buffer to store value(s) in.  If a subsequent
>> name/value descriptor has NULL value of value_in.iov_base, then the buffer
>> from the previous name/value descriptor will be used.  This way it's
>> possible to use a shared buffer for multiple values.
>
> I would not trust very many application developers to use the NULL
> value_in.iov_base correctly. In fact, I can't think of a way it could
> be used sensibly. Sure, you could put two things of known size into
> one buffer and use the known offset, but again that's a clever
> optimization that will result in more bugs than it is worth.
>
>> The starting address and length of the actual value will be stored in
>> @value_out, unless an error has occurred in which case @error will be set to
>> the positive errno value.
>
> You only need to return the address if you do the multi-value in a buffer.
> Which I've already expressed distaste for.
>
> If the application asks for 6 attributes and is denied access to the
> 3rd (by an LSM let's say) what is returned? Are the 1st two buffers
> filled? How can I tell which attribute was unavailable?
>
>> Multiple names starting with the same prefix (including the shorthand form)
>> may also be batched together under the same lock, so the order of the names
>> can determine atomicity.
>
> I will believe you, but it's hardly obvious why this is true.
>
>>
>> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
>> ---
>>   arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>>   fs/Makefile                            |   2 +-
>>   fs/mount.h                             |   8 +
>>   fs/namespace.c                         |  42 ++
>>   fs/proc_namespace.c                    |   2 +-
>>   fs/values.c                            | 524 +++++++++++++++++++++++++
>>   6 files changed, 577 insertions(+), 2 deletions(-)
>>   create mode 100644 fs/values.c
>>
>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
>> index c84d12608cd2..c72668001b39 100644
>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>> @@ -372,6 +372,7 @@
>>   448    common    process_mrelease    sys_process_mrelease
>>   449    common    futex_waitv        sys_futex_waitv
>>   450    common    set_mempolicy_home_node sys_set_mempolicy_home_node
>> +451    common    getvalues        sys_getvalues
>>     #
>>   # Due to a historical design error, certain syscalls are numbered differently
>> diff --git a/fs/Makefile b/fs/Makefile
>> index 208a74e0b00e..f00d6bcd1178 100644
>> --- a/fs/Makefile
>> +++ b/fs/Makefile
>> @@ -16,7 +16,7 @@ obj-y :=    open.o read_write.o file_table.o super.o \
>>           pnode.o splice.o sync.o utimes.o d_path.o \
>>           stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
>>           fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
>> -        kernel_read_file.o remap_range.o
>> +        kernel_read_file.o remap_range.o values.o
>>     ifeq ($(CONFIG_BLOCK),y)
>>   obj-y +=    buffer.o direct-io.o mpage.o
>> diff --git a/fs/mount.h b/fs/mount.h
>> index 0b6e08cf8afb..a3ca5233e481 100644
>> --- a/fs/mount.h
>> +++ b/fs/mount.h
>> @@ -148,3 +148,11 @@ static inline bool is_anon_ns(struct mnt_namespace *ns)
>>   }
>>     extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor);
>> +
>> +extern void namespace_lock_read(void);
>> +extern void namespace_unlock_read(void);
>> +extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt);
>> +extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
>> +             struct path *root);
>> +extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns,
>> +                     struct path *root, int id);
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index de6fae84f1a1..52b15c17251f 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -1405,6 +1405,38 @@ void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor)
>>   }
>>   #endif  /* CONFIG_PROC_FS */
>>   +void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
>> +          struct path *root)
>> +{
>> +    struct mount *m;
>> +
>> +    down_read(&namespace_sem);
>> +    for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
>> +        if (is_path_reachable(m, m->mnt.mnt_root, root)) {
>> +            seq_printf(seq, "%i", m->mnt_id);
>> +            seq_putc(seq, '\0');
>> +        }
>> +    }
>> +    up_read(&namespace_sem);
>> +}
>> +
>> +/* called with namespace_sem held for read */
>> +struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root,
>> +                  int id)
>> +{
>> +    struct mount *m;
>> +
>> +    for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
>> +        if (m->mnt_id == id) {
>> +            if (is_path_reachable(m, m->mnt.mnt_root, root))
>> +                return mntget(&m->mnt);
>> +            else
>> +                return NULL;
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>>   /**
>>    * may_umount_tree - check if a mount tree is busy
>>    * @m: root of mount tree
>> @@ -1494,6 +1526,16 @@ static inline void namespace_lock(void)
>>       down_write(&namespace_sem);
>>   }
>>   +void namespace_lock_read(void)
>> +{
>> +    down_read(&namespace_sem);
>> +}
>> +
>> +void namespace_unlock_read(void)
>> +{
>> +    up_read(&namespace_sem);
>> +}
>> +
>>   enum umount_tree_flags {
>>       UMOUNT_SYNC = 1,
>>       UMOUNT_PROPAGATE = 2,
>> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
>> index 49650e54d2f8..fa6dc2c20578 100644
>> --- a/fs/proc_namespace.c
>> +++ b/fs/proc_namespace.c
>> @@ -61,7 +61,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
>>       return security_sb_show_options(m, sb);
>>   }
>>   -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
>> +void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
>>   {
>>       static const struct proc_fs_opts mnt_opts[] = {
>>           { MNT_NOSUID, ",nosuid" },
>> diff --git a/fs/values.c b/fs/values.c
>> new file mode 100644
>> index 000000000000..618fa9bf48a1
>> --- /dev/null
>> +++ b/fs/values.c
>> @@ -0,0 +1,524 @@
>> +#include <linux/syscalls.h>
>> +#include <linux/printk.h>
>> +#include <linux/namei.h>
>> +#include <linux/fs_struct.h>
>> +#include <linux/posix_acl_xattr.h>
>> +#include <linux/xattr.h>
>> +#include "pnode.h"
>> +#include "internal.h"
>> +
>> +#define VAL_GRSEP ':'
>> +
>> +struct name_val {
>> +    const char __user *name;    /* in */
>> +    struct iovec value_in;        /* in */
>> +    struct iovec value_out;        /* out */
>> +    __u32 error;            /* out */
>> +    __u32 reserved;
>> +};
>> +
>> +struct val_iter {
>> +    struct name_val __user *curr;
>> +    size_t num;
>> +    struct iovec vec;
>> +    char name[256];
>> +    size_t bufsize;
>> +    struct seq_file seq;
>> +    const char *prefix;
>> +    const char *sub;
>> +};
>> +
>> +struct val_desc {
>> +    const char *name;
>> +    union {
>> +        int idx;
>> +        int (*get)(struct val_iter *vi, const struct path *path);
>> +    };
>> +};
>> +
>> +static int val_get(struct val_iter *vi)
>> +{
>> +    struct name_val nameval;
>> +    long err;
>> +
>> +    if (copy_from_user(&nameval, vi->curr, sizeof(nameval)))
>> +        return -EFAULT;
>> +
>> +    err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name));
>> +    if (err < 0)
>> +        return err;
>> +    if (err == sizeof(vi->name))
>> +        return -ERANGE;
>> +
>> +    if (nameval.value_in.iov_base)
>> +        vi->vec = nameval.value_in;
>> +
>> +    vi->seq.size = min(vi->vec.iov_len, vi->bufsize);
>> +    vi->seq.count = 0;
>> +
>> +    return 0;
>> +}
>> +
>> +static int val_next(struct val_iter *vi)
>> +{
>> +    vi->curr++;
>> +    vi->num--;
>> +
>> +    return vi->num ? val_get(vi) : 0;
>> +}
>> +
>> +static int val_end(struct val_iter *vi, size_t count)
>> +{
>> +    struct iovec iov = {
>> +        .iov_base = vi->vec.iov_base,
>> +        .iov_len = count,
>> +    };
>> +
>> +    if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov)))
>> +        return -EFAULT;
>> +
>> +    vi->vec.iov_base += count;
>> +    vi->vec.iov_len -= count;
>> +
>> +    return val_next(vi);
>> +}
>> +
>> +static int val_err(struct val_iter *vi, int err)
>> +{
>> +    if (put_user(-err, &vi->curr->error))
>> +        return -EFAULT;
>> +
>> +    return val_next(vi);
>> +}
>> +
>> +static int val_end_seq(struct val_iter *vi, int err)
>> +{
>> +    size_t count = vi->seq.count;
>> +
>> +    if (err)
>> +        return val_err(vi, err);
>> +
>> +    if (count == vi->seq.size)
>> +        return -EOVERFLOW;
>> +
>> +    if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count))
>> +        return -EFAULT;
>> +
>> +    return val_end(vi, count);
>> +}
>> +
>> +static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd)
>> +{
>> +    const char *name = vi->name;
>> +    const char *prefix = vi->prefix;
>> +    size_t prefixlen = strlen(prefix);
>> +
>> +    if (prefixlen) {
>> +        /*
>> +         * Name beggining with a group separator is a shorthand for
>> +         * previously prefix.
>> +         */
>> +        if (name[0] == VAL_GRSEP) {
>> +            name++;
>> +        } else  {
>> +            if (strncmp(name, prefix, prefixlen) != 0)
>> +                return NULL;
>> +            name += prefixlen;
>> +        }
>> +    }
>> +
>> +    vi->sub = NULL;
>> +    for (; vd->name; vd++) {
>> +        if (strcmp(name, vd->name) == 0)
>> +            break;
>> +        else {
>> +            size_t grlen = strlen(vd->name);
>> +
>> +            if (strncmp(vd->name, name, grlen) == 0 &&
>> +                name[grlen] == VAL_GRSEP) {
>> +                vi->sub = name + grlen + 1;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    return vd;
>> +}
>> +
>> +static int val_get_group(struct val_iter *vi, struct val_desc *vd)
>> +{
>> +    for (; vd->name; vd++)
>> +        seq_write(&vi->seq, vd->name, strlen(vd->name) + 1);
>> +
>> +    return val_end_seq(vi, 0);
>> +}
>> +
>> +static bool val_push_prefix(struct val_iter *vi, const char **oldprefix)
>> +{
>> +    char *newprefix;
>> +
>> +    newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL);
>> +    if (newprefix) {
>> +        *oldprefix = vi->prefix;
>> +        vi->prefix = newprefix;
>> +    }
>> +
>> +    return newprefix;
>> +}
>> +
>> +static void val_pop_prefix(struct val_iter *vi, const char *oldprefix)
>> +{
>> +    kfree(vi->prefix);
>> +    vi->prefix = oldprefix;
>> +}
>> +
>> +enum {
>> +    VAL_MNT_ID,
>> +    VAL_MNT_PARENTID,
>> +    VAL_MNT_ROOT,
>> +    VAL_MNT_MOUNTPOINT,
>> +    VAL_MNT_OPTIONS,
>> +    VAL_MNT_SHARED,
>> +    VAL_MNT_MASTER,
>> +    VAL_MNT_PROPAGATE_FROM,
>> +    VAL_MNT_UNBINDABLE,
>> +    VAL_MNT_NOTFOUND,
>> +};
>> +
>> +static struct val_desc val_mnt_group[] = {
>> +    { .name = "id",            .idx = VAL_MNT_ID        },
>> +    { .name = "parentid",        .idx = VAL_MNT_PARENTID,    },
>> +    { .name = "root",        .idx = VAL_MNT_ROOT,        },
>> +    { .name = "mountpoint",        .idx = VAL_MNT_MOUNTPOINT,    },
>> +    { .name = "options",        .idx = VAL_MNT_OPTIONS, },
>> +    { .name = "shared",        .idx = VAL_MNT_SHARED,        },
>> +    { .name = "master",        .idx = VAL_MNT_MASTER,        },
>> +    { .name = "propagate_from",    .idx = VAL_MNT_PROPAGATE_FROM,    },
>> +    { .name = "unbindable",        .idx = VAL_MNT_UNBINDABLE,    },
>> +    { .name = NULL,            .idx = VAL_MNT_NOTFOUND },
>> +};
>> +
>> +static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt)
>> +{
>> +    struct super_block *sb = mnt->mnt_sb;
>> +    int err = 0;
>> +
>> +    if (sb->s_op->show_path) {
>> +        err = sb->s_op->show_path(seq, mnt->mnt_root);
>> +        if (!err) {
>> +            seq_putc(seq, '\0');
>> +            if (seq->count < seq->size)
>> +                seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL);
>> +        }
>> +    } else {
>> +        seq_dentry(seq, mnt->mnt_root, "");
>> +    }
>> +
>> +    return err;
>> +}
>> +
>> +static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt)
>> +{
>> +    struct mount *m = real_mount(mnt);
>> +    struct path root, mnt_path;
>> +    struct val_desc *vd;
>> +    const char *oldprefix;
>> +    int err = 0;
>> +
>> +    if (!val_push_prefix(vi, &oldprefix))
>> +        return -ENOMEM;
>> +
>> +    while (!err && vi->num) {
>> +        vd = val_lookup(vi, val_mnt_group);
>> +        if (!vd)
>> +            break;
>> +
>> +        switch(vd->idx) {
>> +        case VAL_MNT_ID:
>> +            seq_printf(&vi->seq, "%i", m->mnt_id);
>> +            break;
>> +        case VAL_MNT_PARENTID:
>> +            seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id);
>> +            break;
>> +        case VAL_MNT_ROOT:
>> +            seq_mnt_root(&vi->seq, mnt);
>> +            break;
>> +        case VAL_MNT_MOUNTPOINT:
>> +            get_fs_root(current->fs, &root);
>> +            mnt_path.dentry = mnt->mnt_root;
>> +            mnt_path.mnt = mnt;
>> +            err = seq_path_root(&vi->seq, &mnt_path, &root, "");
>> +            path_put(&root);
>> +            break;
>> +        case VAL_MNT_OPTIONS:
>> +            seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw");
>> +            show_mnt_opts(&vi->seq, mnt);
>> +            break;
>> +        case VAL_MNT_SHARED:
>> +            if (IS_MNT_SHARED(m))
>> +                seq_printf(&vi->seq, "%i,", m->mnt_group_id);
>> +            break;
>> +        case VAL_MNT_MASTER:
>> +            if (IS_MNT_SLAVE(m))
>> +                seq_printf(&vi->seq, "%i,",
>> +                       m->mnt_master->mnt_group_id);
>> +            break;
>> +        case VAL_MNT_PROPAGATE_FROM:
>> +            if (IS_MNT_SLAVE(m)) {
>> +                int dom;
>> +
>> +                get_fs_root(current->fs, &root);
>> +                dom = get_dominating_id(m, &root);
>> +                path_put(&root);
>> +                if (dom)
>> +                    seq_printf(&vi->seq, "%i,", dom);
>> +            }
>> +            break;
>> +        case VAL_MNT_UNBINDABLE:
>> +            if (IS_MNT_UNBINDABLE(m))
>> +                seq_puts(&vi->seq, "yes");
>> +            break;
>> +        default:
>> +            err = -ENOENT;
>> +            break;
>> +        }
>> +        err = val_end_seq(vi, err);
>> +    }
>> +    val_pop_prefix(vi, oldprefix);
>> +
>> +    return err;
>> +}
>> +
>> +static int val_mnt_get(struct val_iter *vi, const struct path *path)
>> +{
>> +    int err;
>> +
>> +    if (!vi->sub)
>> +        return val_get_group(vi, val_mnt_group);
>> +
>> +    namespace_lock_read();
>> +    err = val_mnt_show(vi, path->mnt);
>> +    namespace_unlock_read();
>> +
>> +    return err;
>> +}
>> +
>> +static int val_mntns_get(struct val_iter *vi, const struct path *path)
>> +{
>> +    struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns;
>> +    struct vfsmount *mnt;
>> +    struct path root;
>> +    char *end;
>> +    int mnt_id;
>> +    int err;
>> +
>> +    if (!vi->sub) {
>> +        get_fs_root(current->fs, &root);
>> +        seq_mnt_list(&vi->seq, mnt_ns, &root);
>> +        path_put(&root);
>> +        return val_end_seq(vi, 0);
>> +    }
>> +
>> +    end = strchr(vi->sub, VAL_GRSEP);
>> +    if (end)
>> +        *end = '\0';
>> +    err = kstrtoint(vi->sub, 10, &mnt_id);
>> +    if (err)
>> +        return val_err(vi, err);
>> +    vi->sub = NULL;
>> +    if (end) {
>> +        *end = VAL_GRSEP;
>> +        vi->sub = end + 1;
>> +    }
>> +
>> +    namespace_lock_read();
>> +    get_fs_root(current->fs, &root);
>> +    mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id);
>> +    path_put(&root);
>> +    if (!mnt) {
>> +        namespace_unlock_read();
>> +        return val_err(vi, -ENOENT);
>> +    }
>> +    if (vi->sub)
>> +        err = val_mnt_show(vi, mnt);
>> +    else
>> +        err = val_get_group(vi, val_mnt_group);
>> +
>> +    namespace_unlock_read();
>> +    mntput(mnt);
>> +
>> +    return err;
>> +}
>> +
>> +static ssize_t val_do_read(struct val_iter *vi, struct path *path)
>> +{
>> +    ssize_t ret;
>> +    struct file *file;
>> +    struct open_flags op = {
>> +        .open_flag = O_RDONLY,
>> +        .acc_mode = MAY_READ,
>> +        .intent = LOOKUP_OPEN,
>> +    };
>> +
>> +    file = do_file_open_root(path, "", &op);
>> +    if (IS_ERR(file))
>> +        return PTR_ERR(file);
>> +
>> +    ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL);
>> +    fput(file);
>> +
>> +    return ret;
>> +}
>> +
>> +static ssize_t val_do_readlink(struct val_iter *vi, struct path *path)
>> +{
>> +    int ret;
>> +
>> +    ret = security_inode_readlink(path->dentry);
>> +    if (ret)
>> +        return ret;
>> +
>> +    return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len);
>> +}
>> +
>> +static inline bool dot_or_dotdot(const char *s)
>> +{
>> +    return s[0] == '.' &&
>> +        (s[1] == '/' || s[1] == '\0' ||
>> +         (s[1] == '.' && (s[2] == '/' || s[2] == '\0')));
>> +}
>> +
>> +/*
>> + * - empty path is okay
>> + * - must not begin or end with slash or have a double slash anywhere
>> + * - must not have . or .. components
>> + */
>> +static bool val_verify_path(const char *subpath)
>> +{
>> +    const char *s = subpath;
>> +
>> +    if (s[0] == '\0')
>> +        return true;
>> +
>> +    if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//"))
>> +        return false;
>> +
>> +    for (s--; s; s = strstr(s + 3, "/."))
>> +        if (dot_or_dotdot(s + 1))
>> +            return false;
>> +
>> +    return true;
>> +}
>> +
>> +static int val_data_get(struct val_iter *vi, const struct path *path)
>> +{
>> +    struct path this;
>> +    ssize_t ret;
>> +
>> +    if (!vi->sub)
>> +        return val_err(vi, -ENOENT);
>> +
>> +    if (!val_verify_path(vi->sub))
>> +        return val_err(vi, -EINVAL);
>> +
>> +    ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub,
>> +                  LOOKUP_NO_XDEV | LOOKUP_BENEATH |
>> +                  LOOKUP_IN_ROOT, &this);
>> +    if (ret)
>> +        return val_err(vi, ret);
>> +
>> +    ret = -ENODATA;
>> +    if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) {
>> +        if (d_is_reg(this.dentry))
>> +            ret = val_do_read(vi, &this);
>> +        else
>> +            ret = val_do_readlink(vi, &this);
>> +    }
>> +    path_put(&this);
>> +    if (ret == -EFAULT)
>> +        return ret;
>> +    if (ret < 0)
>> +        return val_err(vi, ret);
>> +    if (ret == vi->vec.iov_len)
>> +        return -EOVERFLOW;
>> +
>> +    return val_end(vi, ret);
>> +}
>> +
>> +static int val_xattr_get(struct val_iter *vi, const struct path *path)
>> +{
>> +    ssize_t ret;
>> +    struct user_namespace *mnt_userns = mnt_user_ns(path->mnt);
>> +    void *value = vi->seq.buf + vi->seq.count;
>> +    size_t size = min_t(size_t, vi->seq.size - vi->seq.count,
>> +                XATTR_SIZE_MAX);
>> +
>> +    if (!vi->sub)
>> +        return val_err(vi, -ENOENT);
>> +
>> +    ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size);
>> +    if (ret < 0)
>> +        return val_err(vi, ret);
>> +
>> +    if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
>> +        (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
>> +        posix_acl_fix_xattr_to_user(mnt_userns, value, ret);
>> +
>> +    vi->seq.count += ret;
>> +
>> +    return val_end_seq(vi, 0);
>> +}
>> +
>> +
>> +static struct val_desc val_toplevel_group[] = {
>> +    { .name = "mnt",    .get = val_mnt_get,    },
>> +    { .name = "mntns",    .get = val_mntns_get,    },
>> +    { .name = "xattr",    .get = val_xattr_get,    },
>> +    { .name = "data",    .get = val_data_get,    },
>> +    { .name = NULL },
>> +};
>> +
>> +SYSCALL_DEFINE5(getvalues,
>> +        int, dfd,
>> +        const char __user *, pathname,
>> +        struct name_val __user *, vec,
>> +        size_t, num,
>> +        unsigned int, flags)
>> +{
>> +    char vals[1024];
>> +    struct val_iter vi = {
>> +        .curr = vec,
>> +        .num = num,
>> +        .seq.buf = vals,
>> +        .bufsize = sizeof(vals),
>> +        .prefix = "",
>> +    };
>> +    struct val_desc *vd;
>> +    struct path path = {};
>> +    ssize_t err;
>> +
>> +    err = user_path_at(dfd, pathname, 0, &path);
>> +    if (err)
>> +        return err;
>> +
>> +    err = val_get(&vi);
>> +    if (err)
>> +        goto out;
>> +
>> +    if (!strlen(vi.name)) {
>> +        err = val_get_group(&vi, val_toplevel_group);
>> +        goto out;
>> +    }
>> +    while (!err && vi.num) {
>> +        vd = val_lookup(&vi, val_toplevel_group);
>> +        if (!vd->name)
>> +            err = val_err(&vi, -ENOENT);
>> +        else
>> +            err = vd->get(&vi, &path);
>> +    }
>> +out:
>> +    if (err == -EOVERFLOW)
>> +        err = 0;
>> +
>> +    path_put(&path);
>> +    return err < 0 ? err : num - vi.num;
>> +}
Greg Kroah-Hartman March 23, 2022, 7:14 a.m. UTC | #4
On Tue, Mar 22, 2022 at 01:36:26PM -0700, Casey Schaufler wrote:
> On 3/22/2022 12:27 PM, Miklos Szeredi wrote:
> > Add a new userspace API that allows getting multiple short values in a
> > single syscall.
> > 
> > This would be useful for the following reasons:
> > 
> > - Calling open/read/close for many small files is inefficient.  E.g. on my
> >    desktop invoking lsof(1) results in ~60k open + read + close calls under
> >    /proc and 90% of those are 128 bytes or less.
> 
> You don't need the generality below to address this issue.
> 
> int openandread(const char *path, char *buffer, size_t size);
> 
> would address this case swimmingly.

Or you can use my readfile(2) proposal:
	https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org

But you had better actually benchmark the thing.  Turns out that I could
not find a real-world use that shows improvements in anything.

Miklos, what userspace tool will use this new syscall and how will it be
faster than readfile() was?

I should rebase that against 5.17 again and see if anything is different
due to the new spectre-bhb slowdowns.

thanks,

greg k-h
Greg Kroah-Hartman March 23, 2022, 7:16 a.m. UTC | #5
On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
> Add a new userspace API that allows getting multiple short values in a
> single syscall.
> 
> This would be useful for the following reasons:
> 
> - Calling open/read/close for many small files is inefficient.  E.g. on my
>   desktop invoking lsof(1) results in ~60k open + read + close calls under
>   /proc and 90% of those are 128 bytes or less.

As I found out in testing readfile():
	https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org

microbenchmarks do show a tiny improvement in doing something like this,
but that's not a real-world application.

Do you have anything real that can use this that shows a speedup?

thanks,

greg k-h
Bernd Schubert March 23, 2022, 10:26 a.m. UTC | #6
On 3/23/22 08:16, Greg KH wrote:
> On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
>> Add a new userspace API that allows getting multiple short values in a
>> single syscall.
>>
>> This would be useful for the following reasons:
>>
>> - Calling open/read/close for many small files is inefficient.  E.g. on my
>>    desktop invoking lsof(1) results in ~60k open + read + close calls under
>>    /proc and 90% of those are 128 bytes or less.
> 
> As I found out in testing readfile():
> 	https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org
> 
> microbenchmarks do show a tiny improvement in doing something like this,
> but that's not a real-world application.
> 
> Do you have anything real that can use this that shows a speedup?

Add in network file systems. Demonstrating that this is useful locally 
and with micro benchmarks - yeah, helps a bit to make it locally faster. 
But the real case is when thousands of clients are handled by a few 
network servers. Even reducing wire latency for a single client would 
make a difference here.

There is a bit of chicken-egg problem - it is a bit of work to add to 
file systems like NFS (or others that are not the kernel), but the work 
won't be made there before there is no syscall for it. To demonstrate it 
on NFS one also needs a an official protocol change first. And then 
applications also need to support that new syscall first.
I had a hard time explaining weather physicist back in 2009 that it is 
not a good idea to have millions of 512B files on  Lustre. With recent 
AI workload this gets even worse.

This is the same issue in fact with the fuse patches we are creating 
(https://lwn.net/Articles/888877/). Miklos asked for benchmark numbers - 
we can only demonstrate slight effects locally, but out goal is in fact 
to reduce network latencies and server load.

- Bernd
Christian Brauner March 23, 2022, 11:42 a.m. UTC | #7
On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
> Add a new userspace API that allows getting multiple short values in a
> single syscall.
> 
> This would be useful for the following reasons:
> 
> - Calling open/read/close for many small files is inefficient.  E.g. on my
>   desktop invoking lsof(1) results in ~60k open + read + close calls under
>   /proc and 90% of those are 128 bytes or less.
> 
> - Interfaces for getting various attributes and statistics are fragmented.
>   For files we have basic stat, statx, extended attributes, file attributes
>   (for which there are two overlapping ioctl interfaces).  For mounts and
>   superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
>   The latter also has the problem on not allowing queries on a specific
>   mount.
> 
> - Some attributes are cheap to generate, some are expensive.  Allowing
>   userspace to select which ones it needs should allow optimizing queries.
> 
> - Adding an ascii namespace should allow easy extension and self
>   description.
> 
> - The values can be text or binary, whichever is fits best.
> 
> The interface definition is:
> 
> struct name_val {
> 	const char *name;	/* in */
> 	struct iovec value_in;	/* in */
> 	struct iovec value_out;	/* out */
> 	uint32_t error;		/* out */
> 	uint32_t reserved;
> };

Yes, we really need a way to query for various fs information. I'm a bit
torn about the details of this interface though. I would really like if
we had interfaces that are really easy to use from userspace comparable
to statx for example. I know having this generic as possible was the
goal but I'm just a bit uneasy with such interfaces. They become
cumbersome to use in userspace. I'm not sure if the data: part for
example should be in this at all. That seems a bit out of place to me.

Would it be really that bad if we added multiple syscalls for different
types of info? For example, querying mount information could reasonably
be a more focussed separate system call allowing to retrieve detailed
mount propagation info, flags, idmappings and so on. Prior approaches to
solve this in a completely generic way have gotten us not very far too
so I'm a bit worried about this aspect too.

For performance concerns, once those multiple system calls are in place
they will naturally be made available in io_uring and so if people do
really have performance issues they can rely on io_uring (to e.g.,
offload system call cost).

> 
> int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
> 	      unsigned int flags);
> 
> @dfd and @path are used to lookup object $ORIGIN.  @vec contains @num
> name/value descriptors.  @flags contains lookup flags for @path.
> 
> The syscall returns the number of values filled or an error.
> 
> A single name/value descriptor has the following fields:
> 
> @name describes the object whose value is to be returned.  E.g.
> 
> mnt                    - list of mount parameters
> mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
> mntns                  - list of mount ID's reachable from the current root
> mntns:21:parentid      - parent ID of the mount with ID of 21
> xattr:security.selinux - the security.selinux extended attribute
> data:foo/bar           - the data contained in file $ORIGIN/foo/bar
> 
> If the name starts with the separator, then it is taken to have the same
> prefix as the previous name/value descriptor.  E.g. in the following
> sequence of names the second one is equivalent to mnt:parentid:
> 
> mnt:mountpoint
> :parentid
> 
> @value_in supplies the buffer to store value(s) in.  If a subsequent
> name/value descriptor has NULL value of value_in.iov_base, then the buffer
> from the previous name/value descriptor will be used.  This way it's
> possible to use a shared buffer for multiple values.
> 
> The starting address and length of the actual value will be stored in
> @value_out, unless an error has occurred in which case @error will be set to
> the positive errno value.
> 
> Multiple names starting with the same prefix (including the shorthand form)
> may also be batched together under the same lock, so the order of the names
> can determine atomicity.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/Makefile                            |   2 +-
>  fs/mount.h                             |   8 +
>  fs/namespace.c                         |  42 ++
>  fs/proc_namespace.c                    |   2 +-
>  fs/values.c                            | 524 +++++++++++++++++++++++++
>  6 files changed, 577 insertions(+), 2 deletions(-)
>  create mode 100644 fs/values.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..c72668001b39 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	getvalues		sys_getvalues
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/fs/Makefile b/fs/Makefile
> index 208a74e0b00e..f00d6bcd1178 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -16,7 +16,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
>  		pnode.o splice.o sync.o utimes.o d_path.o \
>  		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
>  		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
> -		kernel_read_file.o remap_range.o
> +		kernel_read_file.o remap_range.o values.o
>  
>  ifeq ($(CONFIG_BLOCK),y)
>  obj-y +=	buffer.o direct-io.o mpage.o
> diff --git a/fs/mount.h b/fs/mount.h
> index 0b6e08cf8afb..a3ca5233e481 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -148,3 +148,11 @@ static inline bool is_anon_ns(struct mnt_namespace *ns)
>  }
>  
>  extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor);
> +
> +extern void namespace_lock_read(void);
> +extern void namespace_unlock_read(void);
> +extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt);
> +extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
> +			 struct path *root);
> +extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns,
> +					 struct path *root, int id);
> diff --git a/fs/namespace.c b/fs/namespace.c
> index de6fae84f1a1..52b15c17251f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1405,6 +1405,38 @@ void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor)
>  }
>  #endif  /* CONFIG_PROC_FS */
>  
> +void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
> +		  struct path *root)
> +{
> +	struct mount *m;
> +
> +	down_read(&namespace_sem);
> +	for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
> +		if (is_path_reachable(m, m->mnt.mnt_root, root)) {
> +			seq_printf(seq, "%i", m->mnt_id);
> +			seq_putc(seq, '\0');
> +		}
> +	}
> +	up_read(&namespace_sem);
> +}
> +
> +/* called with namespace_sem held for read */
> +struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root,
> +				  int id)
> +{
> +	struct mount *m;
> +
> +	for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
> +		if (m->mnt_id == id) {
> +			if (is_path_reachable(m, m->mnt.mnt_root, root))
> +				return mntget(&m->mnt);
> +			else
> +				return NULL;
> +		}
> +	}
> +	return NULL;
> +}
> +
>  /**
>   * may_umount_tree - check if a mount tree is busy
>   * @m: root of mount tree
> @@ -1494,6 +1526,16 @@ static inline void namespace_lock(void)
>  	down_write(&namespace_sem);
>  }
>  
> +void namespace_lock_read(void)
> +{
> +	down_read(&namespace_sem);
> +}
> +
> +void namespace_unlock_read(void)
> +{
> +	up_read(&namespace_sem);
> +}
> +
>  enum umount_tree_flags {
>  	UMOUNT_SYNC = 1,
>  	UMOUNT_PROPAGATE = 2,
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 49650e54d2f8..fa6dc2c20578 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -61,7 +61,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
>  	return security_sb_show_options(m, sb);
>  }
>  
> -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
> +void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
>  {
>  	static const struct proc_fs_opts mnt_opts[] = {
>  		{ MNT_NOSUID, ",nosuid" },
> diff --git a/fs/values.c b/fs/values.c
> new file mode 100644
> index 000000000000..618fa9bf48a1
> --- /dev/null
> +++ b/fs/values.c
> @@ -0,0 +1,524 @@
> +#include <linux/syscalls.h>
> +#include <linux/printk.h>
> +#include <linux/namei.h>
> +#include <linux/fs_struct.h>
> +#include <linux/posix_acl_xattr.h>
> +#include <linux/xattr.h>
> +#include "pnode.h"
> +#include "internal.h"
> +
> +#define VAL_GRSEP ':'
> +
> +struct name_val {
> +	const char __user *name;	/* in */
> +	struct iovec value_in;		/* in */
> +	struct iovec value_out;		/* out */
> +	__u32 error;			/* out */
> +	__u32 reserved;
> +};
> +
> +struct val_iter {
> +	struct name_val __user *curr;
> +	size_t num;
> +	struct iovec vec;
> +	char name[256];
> +	size_t bufsize;
> +	struct seq_file seq;
> +	const char *prefix;
> +	const char *sub;
> +};
> +
> +struct val_desc {
> +	const char *name;
> +	union {
> +		int idx;
> +		int (*get)(struct val_iter *vi, const struct path *path);
> +	};
> +};
> +
> +static int val_get(struct val_iter *vi)
> +{
> +	struct name_val nameval;
> +	long err;
> +
> +	if (copy_from_user(&nameval, vi->curr, sizeof(nameval)))
> +		return -EFAULT;
> +
> +	err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name));
> +	if (err < 0)
> +		return err;
> +	if (err == sizeof(vi->name))
> +		return -ERANGE;
> +
> +	if (nameval.value_in.iov_base)
> +		vi->vec = nameval.value_in;
> +
> +	vi->seq.size = min(vi->vec.iov_len, vi->bufsize);
> +	vi->seq.count = 0;
> +
> +	return 0;
> +}
> +
> +static int val_next(struct val_iter *vi)
> +{
> +	vi->curr++;
> +	vi->num--;
> +
> +	return vi->num ? val_get(vi) : 0;
> +}
> +
> +static int val_end(struct val_iter *vi, size_t count)
> +{
> +	struct iovec iov = {
> +		.iov_base = vi->vec.iov_base,
> +		.iov_len = count,
> +	};
> +
> +	if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov)))
> +		return -EFAULT;
> +
> +	vi->vec.iov_base += count;
> +	vi->vec.iov_len -= count;
> +
> +	return val_next(vi);
> +}
> +
> +static int val_err(struct val_iter *vi, int err)
> +{
> +	if (put_user(-err, &vi->curr->error))
> +		return -EFAULT;
> +
> +	return val_next(vi);
> +}
> +
> +static int val_end_seq(struct val_iter *vi, int err)
> +{
> +	size_t count = vi->seq.count;
> +
> +	if (err)
> +		return val_err(vi, err);
> +
> +	if (count == vi->seq.size)
> +		return -EOVERFLOW;
> +
> +	if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count))
> +		return -EFAULT;
> +
> +	return val_end(vi, count);
> +}
> +
> +static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd)
> +{
> +	const char *name = vi->name;
> +	const char *prefix = vi->prefix;
> +	size_t prefixlen = strlen(prefix);
> +
> +	if (prefixlen) {
> +		/*
> +		 * Name beggining with a group separator is a shorthand for
> +		 * previously prefix.
> +		 */
> +		if (name[0] == VAL_GRSEP) {
> +			name++;
> +		} else  {
> +			if (strncmp(name, prefix, prefixlen) != 0)
> +				return NULL;
> +			name += prefixlen;
> +		}
> +	}
> +
> +	vi->sub = NULL;
> +	for (; vd->name; vd++) {
> +		if (strcmp(name, vd->name) == 0)
> +			break;
> +		else {
> +			size_t grlen = strlen(vd->name);
> +
> +			if (strncmp(vd->name, name, grlen) == 0 &&
> +			    name[grlen] == VAL_GRSEP) {
> +				vi->sub = name + grlen + 1;
> +				break;
> +			}
> +		}
> +	}
> +	return vd;
> +}
> +
> +static int val_get_group(struct val_iter *vi, struct val_desc *vd)
> +{
> +	for (; vd->name; vd++)
> +		seq_write(&vi->seq, vd->name, strlen(vd->name) + 1);
> +
> +	return val_end_seq(vi, 0);
> +}
> +
> +static bool val_push_prefix(struct val_iter *vi, const char **oldprefix)
> +{
> +	char *newprefix;
> +
> +	newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL);
> +	if (newprefix) {
> +		*oldprefix = vi->prefix;
> +		vi->prefix = newprefix;
> +	}
> +
> +	return newprefix;
> +}
> +
> +static void val_pop_prefix(struct val_iter *vi, const char *oldprefix)
> +{
> +	kfree(vi->prefix);
> +	vi->prefix = oldprefix;
> +}
> +
> +enum {
> +	VAL_MNT_ID,
> +	VAL_MNT_PARENTID,
> +	VAL_MNT_ROOT,
> +	VAL_MNT_MOUNTPOINT,
> +	VAL_MNT_OPTIONS,
> +	VAL_MNT_SHARED,
> +	VAL_MNT_MASTER,
> +	VAL_MNT_PROPAGATE_FROM,
> +	VAL_MNT_UNBINDABLE,
> +	VAL_MNT_NOTFOUND,
> +};
> +
> +static struct val_desc val_mnt_group[] = {
> +	{ .name = "id",			.idx = VAL_MNT_ID		},
> +	{ .name = "parentid",		.idx = VAL_MNT_PARENTID,	},
> +	{ .name = "root",		.idx = VAL_MNT_ROOT,		},
> +	{ .name = "mountpoint",		.idx = VAL_MNT_MOUNTPOINT,	},
> +	{ .name = "options",		.idx = VAL_MNT_OPTIONS,		},
> +	{ .name = "shared",		.idx = VAL_MNT_SHARED,		},
> +	{ .name = "master",		.idx = VAL_MNT_MASTER,		},
> +	{ .name = "propagate_from",	.idx = VAL_MNT_PROPAGATE_FROM,	},
> +	{ .name = "unbindable",		.idx = VAL_MNT_UNBINDABLE,	},
> +	{ .name = NULL,			.idx = VAL_MNT_NOTFOUND		},
> +};
> +
> +static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt)
> +{
> +	struct super_block *sb = mnt->mnt_sb;
> +	int err = 0;
> +
> +	if (sb->s_op->show_path) {
> +		err = sb->s_op->show_path(seq, mnt->mnt_root);
> +		if (!err) {
> +			seq_putc(seq, '\0');
> +			if (seq->count < seq->size)
> +				seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL);
> +		}
> +	} else {
> +		seq_dentry(seq, mnt->mnt_root, "");
> +	}
> +
> +	return err;
> +}
> +
> +static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt)
> +{
> +	struct mount *m = real_mount(mnt);
> +	struct path root, mnt_path;
> +	struct val_desc *vd;
> +	const char *oldprefix;
> +	int err = 0;
> +
> +	if (!val_push_prefix(vi, &oldprefix))
> +		return -ENOMEM;
> +
> +	while (!err && vi->num) {
> +		vd = val_lookup(vi, val_mnt_group);
> +		if (!vd)
> +			break;
> +
> +		switch(vd->idx) {
> +		case VAL_MNT_ID:
> +			seq_printf(&vi->seq, "%i", m->mnt_id);
> +			break;
> +		case VAL_MNT_PARENTID:
> +			seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id);
> +			break;
> +		case VAL_MNT_ROOT:
> +			seq_mnt_root(&vi->seq, mnt);
> +			break;
> +		case VAL_MNT_MOUNTPOINT:
> +			get_fs_root(current->fs, &root);
> +			mnt_path.dentry = mnt->mnt_root;
> +			mnt_path.mnt = mnt;
> +			err = seq_path_root(&vi->seq, &mnt_path, &root, "");
> +			path_put(&root);
> +			break;
> +		case VAL_MNT_OPTIONS:
> +			seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw");
> +			show_mnt_opts(&vi->seq, mnt);
> +			break;
> +		case VAL_MNT_SHARED:
> +			if (IS_MNT_SHARED(m))
> +				seq_printf(&vi->seq, "%i,", m->mnt_group_id);
> +			break;
> +		case VAL_MNT_MASTER:
> +			if (IS_MNT_SLAVE(m))
> +				seq_printf(&vi->seq, "%i,",
> +					   m->mnt_master->mnt_group_id);
> +			break;
> +		case VAL_MNT_PROPAGATE_FROM:
> +			if (IS_MNT_SLAVE(m)) {
> +				int dom;
> +
> +				get_fs_root(current->fs, &root);
> +				dom = get_dominating_id(m, &root);
> +				path_put(&root);
> +				if (dom)
> +					seq_printf(&vi->seq, "%i,", dom);
> +			}
> +			break;
> +		case VAL_MNT_UNBINDABLE:
> +			if (IS_MNT_UNBINDABLE(m))
> +				seq_puts(&vi->seq, "yes");
> +			break;
> +		default:
> +			err = -ENOENT;
> +			break;
> +		}
> +		err = val_end_seq(vi, err);
> +	}
> +	val_pop_prefix(vi, oldprefix);
> +
> +	return err;
> +}
> +
> +static int val_mnt_get(struct val_iter *vi, const struct path *path)
> +{
> +	int err;
> +
> +	if (!vi->sub)
> +		return val_get_group(vi, val_mnt_group);
> +
> +	namespace_lock_read();
> +	err = val_mnt_show(vi, path->mnt);
> +	namespace_unlock_read();
> +
> +	return err;
> +}
> +
> +static int val_mntns_get(struct val_iter *vi, const struct path *path)
> +{
> +	struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns;
> +	struct vfsmount *mnt;
> +	struct path root;
> +	char *end;
> +	int mnt_id;
> +	int err;
> +
> +	if (!vi->sub) {
> +		get_fs_root(current->fs, &root);
> +		seq_mnt_list(&vi->seq, mnt_ns, &root);
> +		path_put(&root);
> +		return val_end_seq(vi, 0);
> +	}
> +
> +	end = strchr(vi->sub, VAL_GRSEP);
> +	if (end)
> +		*end = '\0';
> +	err = kstrtoint(vi->sub, 10, &mnt_id);
> +	if (err)
> +		return val_err(vi, err);
> +	vi->sub = NULL;
> +	if (end) {
> +		*end = VAL_GRSEP;
> +		vi->sub = end + 1;
> +	}
> +
> +	namespace_lock_read();
> +	get_fs_root(current->fs, &root);
> +	mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id);
> +	path_put(&root);
> +	if (!mnt) {
> +		namespace_unlock_read();
> +		return val_err(vi, -ENOENT);
> +	}
> +	if (vi->sub)
> +		err = val_mnt_show(vi, mnt);
> +	else
> +		err = val_get_group(vi, val_mnt_group);
> +
> +	namespace_unlock_read();
> +	mntput(mnt);
> +
> +	return err;
> +}
> +
> +static ssize_t val_do_read(struct val_iter *vi, struct path *path)
> +{
> +	ssize_t ret;
> +	struct file *file;
> +	struct open_flags op = {
> +		.open_flag = O_RDONLY,
> +		.acc_mode = MAY_READ,
> +		.intent = LOOKUP_OPEN,
> +	};
> +
> +	file = do_file_open_root(path, "", &op);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +
> +	ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL);
> +	fput(file);
> +
> +	return ret;
> +}
> +
> +static ssize_t val_do_readlink(struct val_iter *vi, struct path *path)
> +{
> +	int ret;
> +
> +	ret = security_inode_readlink(path->dentry);
> +	if (ret)
> +		return ret;
> +
> +	return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len);
> +}
> +
> +static inline bool dot_or_dotdot(const char *s)
> +{
> +	return s[0] == '.' &&
> +		(s[1] == '/' || s[1] == '\0' ||
> +		 (s[1] == '.' && (s[2] == '/' || s[2] == '\0')));
> +}
> +
> +/*
> + * - empty path is okay
> + * - must not begin or end with slash or have a double slash anywhere
> + * - must not have . or .. components
> + */
> +static bool val_verify_path(const char *subpath)
> +{
> +	const char *s = subpath;
> +
> +	if (s[0] == '\0')
> +		return true;
> +
> +	if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//"))
> +		return false;
> +
> +	for (s--; s; s = strstr(s + 3, "/."))
> +		if (dot_or_dotdot(s + 1))
> +			return false;
> +
> +	return true;
> +}
> +
> +static int val_data_get(struct val_iter *vi, const struct path *path)
> +{
> +	struct path this;
> +	ssize_t ret;
> +
> +	if (!vi->sub)
> +		return val_err(vi, -ENOENT);
> +
> +	if (!val_verify_path(vi->sub))
> +		return val_err(vi, -EINVAL);
> +
> +	ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub,
> +			      LOOKUP_NO_XDEV | LOOKUP_BENEATH |
> +			      LOOKUP_IN_ROOT, &this);
> +	if (ret)
> +		return val_err(vi, ret);
> +
> +	ret = -ENODATA;
> +	if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) {
> +		if (d_is_reg(this.dentry))
> +			ret = val_do_read(vi, &this);
> +		else
> +			ret = val_do_readlink(vi, &this);
> +	}
> +	path_put(&this);
> +	if (ret == -EFAULT)
> +		return ret;
> +	if (ret < 0)
> +		return val_err(vi, ret);
> +	if (ret == vi->vec.iov_len)
> +		return -EOVERFLOW;
> +
> +	return val_end(vi, ret);
> +}
> +
> +static int val_xattr_get(struct val_iter *vi, const struct path *path)
> +{
> +	ssize_t ret;
> +	struct user_namespace *mnt_userns = mnt_user_ns(path->mnt);
> +	void *value = vi->seq.buf + vi->seq.count;
> +	size_t size = min_t(size_t, vi->seq.size - vi->seq.count,
> +			    XATTR_SIZE_MAX);
> +
> +	if (!vi->sub)
> +		return val_err(vi, -ENOENT);
> +
> +	ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size);
> +	if (ret < 0)
> +		return val_err(vi, ret);
> +
> +	if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
> +	    (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
> +		posix_acl_fix_xattr_to_user(mnt_userns, value, ret);
> +
> +	vi->seq.count += ret;
> +
> +	return val_end_seq(vi, 0);
> +}
> +
> +
> +static struct val_desc val_toplevel_group[] = {
> +	{ .name = "mnt",	.get = val_mnt_get,	},
> +	{ .name = "mntns",	.get = val_mntns_get,	},
> +	{ .name = "xattr",	.get = val_xattr_get,	},
> +	{ .name = "data",	.get = val_data_get,	},
> +	{ .name = NULL },
> +};
> +
> +SYSCALL_DEFINE5(getvalues,
> +		int, dfd,
> +		const char __user *, pathname,
> +		struct name_val __user *, vec,
> +		size_t, num,
> +		unsigned int, flags)
> +{
> +	char vals[1024];
> +	struct val_iter vi = {
> +		.curr = vec,
> +		.num = num,
> +		.seq.buf = vals,
> +		.bufsize = sizeof(vals),
> +		.prefix = "",
> +	};
> +	struct val_desc *vd;
> +	struct path path = {};
> +	ssize_t err;
> +
> +	err = user_path_at(dfd, pathname, 0, &path);
> +	if (err)
> +		return err;
> +
> +	err = val_get(&vi);
> +	if (err)
> +		goto out;
> +
> +	if (!strlen(vi.name)) {
> +		err = val_get_group(&vi, val_toplevel_group);
> +		goto out;
> +	}
> +	while (!err && vi.num) {
> +		vd = val_lookup(&vi, val_toplevel_group);
> +		if (!vd->name)
> +			err = val_err(&vi, -ENOENT);
> +		else
> +			err = vd->get(&vi, &path);
> +	}
> +out:
> +	if (err == -EOVERFLOW)
> +		err = 0;
> +
> +	path_put(&path);
> +	return err < 0 ? err : num - vi.num;
> +}
> -- 
> 2.35.1
>
Greg Kroah-Hartman March 23, 2022, 11:42 a.m. UTC | #8
On Wed, Mar 23, 2022 at 11:26:11AM +0100, Bernd Schubert wrote:
> On 3/23/22 08:16, Greg KH wrote:
> > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
> > > Add a new userspace API that allows getting multiple short values in a
> > > single syscall.
> > > 
> > > This would be useful for the following reasons:
> > > 
> > > - Calling open/read/close for many small files is inefficient.  E.g. on my
> > >    desktop invoking lsof(1) results in ~60k open + read + close calls under
> > >    /proc and 90% of those are 128 bytes or less.
> > 
> > As I found out in testing readfile():
> > 	https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org
> > 
> > microbenchmarks do show a tiny improvement in doing something like this,
> > but that's not a real-world application.
> > 
> > Do you have anything real that can use this that shows a speedup?
> 
> Add in network file systems. Demonstrating that this is useful locally and
> with micro benchmarks - yeah, helps a bit to make it locally faster. But the
> real case is when thousands of clients are handled by a few network servers.
> Even reducing wire latency for a single client would make a difference here.

I think I tried running readfile on NFS.  Didn't see any improvements.
But please, try it again.  Also note that this proposal isn't for NFS,
or any other "real" filesystem :)

> There is a bit of chicken-egg problem - it is a bit of work to add to file
> systems like NFS (or others that are not the kernel), but the work won't be
> made there before there is no syscall for it. To demonstrate it on NFS one
> also needs a an official protocol change first. And then applications also
> need to support that new syscall first.
> I had a hard time explaining weather physicist back in 2009 that it is not a
> good idea to have millions of 512B files on  Lustre. With recent AI workload
> this gets even worse.

Can you try using the readfile() patch to see if that helps you all out
on Lustre?  If so, that's a good reason to consider it.  But again, has
nothing to do with this getvalues(2) api.

thanks,

greg k-h
Bernd Schubert March 23, 2022, 12:06 p.m. UTC | #9
On 3/23/22 12:42, Greg KH wrote:
> On Wed, Mar 23, 2022 at 11:26:11AM +0100, Bernd Schubert wrote:
>> On 3/23/22 08:16, Greg KH wrote:
>>> On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
>>>> Add a new userspace API that allows getting multiple short values in a
>>>> single syscall.
>>>>
>>>> This would be useful for the following reasons:
>>>>
>>>> - Calling open/read/close for many small files is inefficient.  E.g. on my
>>>>     desktop invoking lsof(1) results in ~60k open + read + close calls under
>>>>     /proc and 90% of those are 128 bytes or less.
>>>
>>> As I found out in testing readfile():
>>> 	https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org
>>>
>>> microbenchmarks do show a tiny improvement in doing something like this,
>>> but that's not a real-world application.
>>>
>>> Do you have anything real that can use this that shows a speedup?
>>
>> Add in network file systems. Demonstrating that this is useful locally and
>> with micro benchmarks - yeah, helps a bit to make it locally faster. But the
>> real case is when thousands of clients are handled by a few network servers.
>> Even reducing wire latency for a single client would make a difference here.
> 
> I think I tried running readfile on NFS.  Didn't see any improvements.
> But please, try it again.  Also note that this proposal isn't for NFS,
> or any other "real" filesystem :)

How did you run it on NFS? To get real benefit you would need to add a 
READ_FILE rpc to the NFS protocol and code? Just having it locally won't 
avoid the expensive wire calls?

> 
>> There is a bit of chicken-egg problem - it is a bit of work to add to file
>> systems like NFS (or others that are not the kernel), but the work won't be
>> made there before there is no syscall for it. To demonstrate it on NFS one
>> also needs a an official protocol change first. And then applications also
>> need to support that new syscall first.
>> I had a hard time explaining weather physicist back in 2009 that it is not a
>> good idea to have millions of 512B files on  Lustre. With recent AI workload
>> this gets even worse.
> 
> Can you try using the readfile() patch to see if that helps you all out
> on Lustre?  If so, that's a good reason to consider it.  But again, has
> nothing to do with this getvalues(2) api.

I don't have a Lustre system to easily play with (I'm working on another 
network file system). But unless Lustre would implement aggressive 
prefetch of data on stat, I don't see how either approach would work 
without a protocol addition. For Lustre it probably would be helpful 
only when small data are inlined into the inode.
In end this is exactly the chicken-egg problem - Lustre (or anything 
else) won't implement it before the kernel does not support it. But then 
the new syscall won't be added before it is proven that it helps.


  - Bernd
Greg Kroah-Hartman March 23, 2022, 12:13 p.m. UTC | #10
On Wed, Mar 23, 2022 at 01:06:33PM +0100, Bernd Schubert wrote:
> 
> 
> On 3/23/22 12:42, Greg KH wrote:
> > On Wed, Mar 23, 2022 at 11:26:11AM +0100, Bernd Schubert wrote:
> > > On 3/23/22 08:16, Greg KH wrote:
> > > > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
> > > > > Add a new userspace API that allows getting multiple short values in a
> > > > > single syscall.
> > > > > 
> > > > > This would be useful for the following reasons:
> > > > > 
> > > > > - Calling open/read/close for many small files is inefficient.  E.g. on my
> > > > >     desktop invoking lsof(1) results in ~60k open + read + close calls under
> > > > >     /proc and 90% of those are 128 bytes or less.
> > > > 
> > > > As I found out in testing readfile():
> > > > 	https://lore.kernel.org/r/20200704140250.423345-1-gregkh@linuxfoundation.org
> > > > 
> > > > microbenchmarks do show a tiny improvement in doing something like this,
> > > > but that's not a real-world application.
> > > > 
> > > > Do you have anything real that can use this that shows a speedup?
> > > 
> > > Add in network file systems. Demonstrating that this is useful locally and
> > > with micro benchmarks - yeah, helps a bit to make it locally faster. But the
> > > real case is when thousands of clients are handled by a few network servers.
> > > Even reducing wire latency for a single client would make a difference here.
> > 
> > I think I tried running readfile on NFS.  Didn't see any improvements.
> > But please, try it again.  Also note that this proposal isn't for NFS,
> > or any other "real" filesystem :)
> 
> How did you run it on NFS? To get real benefit you would need to add a
> READ_FILE rpc to the NFS protocol and code? Just having it locally won't
> avoid the expensive wire calls?

I did not touch anything related to NFS code, which is perhaps why I did
not notice any difference :)

thanks,

greg k-h
Miklos Szeredi March 23, 2022, 1:24 p.m. UTC | #11
On Wed, 23 Mar 2022 at 12:43, Christian Brauner <brauner@kernel.org> wrote:

> Yes, we really need a way to query for various fs information. I'm a bit
> torn about the details of this interface though. I would really like if
> we had interfaces that are really easy to use from userspace comparable
> to statx for example.

The reason I stated thinking about this is that Amir wanted a per-sb
iostat interface and dumped it into /proc/PID/mountstats.  And that is
definitely not the right way to go about this.

So we could add a statfsx() and start filling in new stuff, and that's
what Linus suggested.  But then we might need to add stuff that is not
representable in a flat structure (like for example the stuff that
nfs_show_stats does) and that again needs new infrastructure.

Another example is task info in /proc.  Utilities are doing a crazy
number of syscalls to get trivial information.  Why don't we have a
procx(2) syscall?  I guess because lots of that is difficult to
represent in a flat structure.  Just take the lsof example: tt's doing
hundreds of thousands of syscalls on a desktop computer with just a
few hundred processes.

So I'm trying to look beyond fsinfo and about how we could better
retrieve attributes, statistics, small bits and pieces within a
unified framework.

The ease of use argument does not really come into the picture here,
because (unlike stat and friends) most of this info is specialized and
will be either consumed by libraries, specialized utilities
(util-linux, procos) or with a generic utility application that can
query any information about anything that is exported through such an
interface.    That applies to plain stat(2) as well: most users will
not switch to statx() simply because that's too generic.  And that's
fine, for info as common as struct stat a syscall is warranted.  If
the info is more specialized, then I think a truly generic interface
is a much better choice.

>  I know having this generic as possible was the
> goal but I'm just a bit uneasy with such interfaces. They become
> cumbersome to use in userspace. I'm not sure if the data: part for
> example should be in this at all. That seems a bit out of place to me.

Good point, reduction of scope may help.

> Would it be really that bad if we added multiple syscalls for different
> types of info? For example, querying mount information could reasonably
> be a more focussed separate system call allowing to retrieve detailed
> mount propagation info, flags, idmappings and so on. Prior approaches to
> solve this in a completely generic way have gotten us not very far too
> so I'm a bit worried about this aspect too.

And I fear that this will just result in more and more ad-hoc
interfaces being added, because a new feature didn't quite fit the old
API.  You can see the history of this happening all over the place
with multiple new syscall versions being added as the old one turns
out to be not generic enough.

I think a new interface needs to

  - be uniform (a single utility can be used to retrieve various
attributes and statistics, contrast this with e.g. stat(1),
getfattr(1), lsattr(1) not to mention various fs specific tools).

 - have a hierarchical namespace (the unix path lookup is an example
of this that stood the test of time)

 - allow retrieving arbitrary text or binary data

And whatever form it takes, I'm sure it will be easier to use than the
mess we currently have in various interfaces like the mount or process
stats.

Thanks,
Miklos
Greg Kroah-Hartman March 23, 2022, 1:38 p.m. UTC | #12
On Wed, Mar 23, 2022 at 02:24:40PM +0100, Miklos Szeredi wrote:
> On Wed, 23 Mar 2022 at 12:43, Christian Brauner <brauner@kernel.org> wrote:
> 
> > Yes, we really need a way to query for various fs information. I'm a bit
> > torn about the details of this interface though. I would really like if
> > we had interfaces that are really easy to use from userspace comparable
> > to statx for example.
> 
> The reason I stated thinking about this is that Amir wanted a per-sb
> iostat interface and dumped it into /proc/PID/mountstats.  And that is
> definitely not the right way to go about this.
> 
> So we could add a statfsx() and start filling in new stuff, and that's
> what Linus suggested.  But then we might need to add stuff that is not
> representable in a flat structure (like for example the stuff that
> nfs_show_stats does) and that again needs new infrastructure.
> 
> Another example is task info in /proc.  Utilities are doing a crazy
> number of syscalls to get trivial information.  Why don't we have a
> procx(2) syscall?  I guess because lots of that is difficult to
> represent in a flat structure.  Just take the lsof example: tt's doing
> hundreds of thousands of syscalls on a desktop computer with just a
> few hundred processes.
> 
> So I'm trying to look beyond fsinfo and about how we could better
> retrieve attributes, statistics, small bits and pieces within a
> unified framework.
> 
> The ease of use argument does not really come into the picture here,
> because (unlike stat and friends) most of this info is specialized and
> will be either consumed by libraries, specialized utilities
> (util-linux, procos) or with a generic utility application that can
> query any information about anything that is exported through such an
> interface.    That applies to plain stat(2) as well: most users will
> not switch to statx() simply because that's too generic.  And that's
> fine, for info as common as struct stat a syscall is warranted.  If
> the info is more specialized, then I think a truly generic interface
> is a much better choice.
> 
> >  I know having this generic as possible was the
> > goal but I'm just a bit uneasy with such interfaces. They become
> > cumbersome to use in userspace. I'm not sure if the data: part for
> > example should be in this at all. That seems a bit out of place to me.
> 
> Good point, reduction of scope may help.
> 
> > Would it be really that bad if we added multiple syscalls for different
> > types of info? For example, querying mount information could reasonably
> > be a more focussed separate system call allowing to retrieve detailed
> > mount propagation info, flags, idmappings and so on. Prior approaches to
> > solve this in a completely generic way have gotten us not very far too
> > so I'm a bit worried about this aspect too.
> 
> And I fear that this will just result in more and more ad-hoc
> interfaces being added, because a new feature didn't quite fit the old
> API.  You can see the history of this happening all over the place
> with multiple new syscall versions being added as the old one turns
> out to be not generic enough.
> 
> I think a new interface needs to
> 
>   - be uniform (a single utility can be used to retrieve various
> attributes and statistics, contrast this with e.g. stat(1),
> getfattr(1), lsattr(1) not to mention various fs specific tools).
> 
>  - have a hierarchical namespace (the unix path lookup is an example
> of this that stood the test of time)
> 
>  - allow retrieving arbitrary text or binary data
> 
> And whatever form it takes, I'm sure it will be easier to use than the
> mess we currently have in various interfaces like the mount or process
> stats.

This has been proposed in the past a few times.  Most recently by the
KVM developers, which tried to create a "generic" api, but ended up just
making something to work for KVM as they got tired of people ignoring
their more intrusive patch sets.  See virt/kvm/binary_stats.c for what
they ended up with, and perhaps you can just use that same type of
interface here as well?

thanks,

greg k-h
Casey Schaufler March 23, 2022, 1:51 p.m. UTC | #13
On 3/23/2022 6:24 AM, Miklos Szeredi wrote:
> On Wed, 23 Mar 2022 at 12:43, Christian Brauner <brauner@kernel.org> wrote:
>
>> Yes, we really need a way to query for various fs information. I'm a bit
>> torn about the details of this interface though. I would really like if
>> we had interfaces that are really easy to use from userspace comparable
>> to statx for example.
> The reason I stated thinking about this is that Amir wanted a per-sb
> iostat interface and dumped it into /proc/PID/mountstats.  And that is
> definitely not the right way to go about this.
>
> So we could add a statfsx() and start filling in new stuff, and that's
> what Linus suggested.  But then we might need to add stuff that is not
> representable in a flat structure (like for example the stuff that
> nfs_show_stats does) and that again needs new infrastructure.
>
> Another example is task info in /proc.  Utilities are doing a crazy
> number of syscalls to get trivial information.  Why don't we have a
> procx(2) syscall?  I guess because lots of that is difficult to
> represent in a flat structure.  Just take the lsof example: tt's doing
> hundreds of thousands of syscalls on a desktop computer with just a
> few hundred processes.
>
> So I'm trying to look beyond fsinfo and about how we could better
> retrieve attributes, statistics, small bits and pieces within a
> unified framework.
>
> The ease of use argument does not really come into the picture here,
> because (unlike stat and friends) most of this info is specialized and
> will be either consumed by libraries, specialized utilities
> (util-linux, procos) or with a generic utility application that can
> query any information about anything that is exported through such an
> interface.    That applies to plain stat(2) as well: most users will
> not switch to statx() simply because that's too generic.  And that's
> fine, for info as common as struct stat a syscall is warranted.  If
> the info is more specialized, then I think a truly generic interface
> is a much better choice.
>
>>   I know having this generic as possible was the
>> goal but I'm just a bit uneasy with such interfaces. They become
>> cumbersome to use in userspace. I'm not sure if the data: part for
>> example should be in this at all. That seems a bit out of place to me.
> Good point, reduction of scope may help.
>
>> Would it be really that bad if we added multiple syscalls for different
>> types of info? For example, querying mount information could reasonably
>> be a more focussed separate system call allowing to retrieve detailed
>> mount propagation info, flags, idmappings and so on. Prior approaches to
>> solve this in a completely generic way have gotten us not very far too
>> so I'm a bit worried about this aspect too.
> And I fear that this will just result in more and more ad-hoc
> interfaces being added, because a new feature didn't quite fit the old
> API.  You can see the history of this happening all over the place
> with multiple new syscall versions being added as the old one turns
> out to be not generic enough.
>
> I think a new interface needs to
>
>    - be uniform (a single utility can be used to retrieve various
> attributes and statistics, contrast this with e.g. stat(1),
> getfattr(1), lsattr(1) not to mention various fs specific tools).
>
>   - have a hierarchical namespace (the unix path lookup is an example
> of this that stood the test of time)
>
>   - allow retrieving arbitrary text or binary data

You also need a way to get a list off what attributes are available
and/or a way to get all available attributes. Applications and especially
libraries shouldn't have to guess what information is relevant. If the
attributes change depending on the filesystem and/or LSM involved, and
they do, how can a general purpose library function know what data to
ask for?

>
> And whatever form it takes, I'm sure it will be easier to use than the
> mess we currently have in various interfaces like the mount or process
> stats.
>
> Thanks,
> Miklos
Miklos Szeredi March 23, 2022, 2 p.m. UTC | #14
On Wed, 23 Mar 2022 at 14:51, Casey Schaufler <casey@schaufler-ca.com> wrote:

> You also need a way to get a list off what attributes are available
> and/or a way to get all available attributes. Applications and especially
> libraries shouldn't have to guess what information is relevant. If the
> attributes change depending on the filesystem and/or LSM involved, and
> they do, how can a general purpose library function know what data to
> ask for?

Oh, yes.  Even the current prototype does that:

# ~/getvalues / ""
[] = "mnt" "mntns" "xattr" "data" (len=21)
# ~/getvalues / "mnt"
[mnt] = "id" "parentid" "root" "mountpoint" "options" "shared"
"master" "propagate_from" "unbindable" (len=76)
# ~/getvalues / "mntns"
[mntns] = "21" "22" "24" "25" "23" "26" "27" "28" "29" "30" "31" "32" (len=36)
 ~/getvalues / "mntns:21"
[mntns:21] = "id" "parentid" "root" "mountpoint" "options" "shared"
"master" "propagate_from" "unbindable" (len=76)

I didn't implement enumeration for "data" and "xattr" but that is
certainly possible and not even difficult to do.

Thanks,
Miklos
Miklos Szeredi March 23, 2022, 3:23 p.m. UTC | #15
On Wed, 23 Mar 2022 at 14:38, Greg KH <gregkh@linuxfoundation.org> wrote:

> This has been proposed in the past a few times.  Most recently by the
> KVM developers, which tried to create a "generic" api, but ended up just
> making something to work for KVM as they got tired of people ignoring
> their more intrusive patch sets.  See virt/kvm/binary_stats.c for what
> they ended up with, and perhaps you can just use that same type of
> interface here as well?

So this looks like a fixed set of statistics where each one has a
descriptor (a name, size, offset, flags, ...) that tells about the
piece of data to be exported.  The stats are kept up to date in kernel
memory and copied to userspace on read.  The copy can be selective,
since the read can specify the offset and size of data it would like
to retrieve.

The interface is self descriptive and selective, but its structure is
fixed for a specific object type, there's no way this could be
extended to look up things like extended attributes.  Maybe that's not
a problem, but the lack of a hierarchical namespace could turn out to
be a major drawback.

I think people underestimate the usefulness of hierarchical
namespaces, even though we use them extensively in lots of well
established interfaces.

Thanks,
Miklos
J. Bruce Fields March 23, 2022, 7:29 p.m. UTC | #16
On Wed, Mar 23, 2022 at 11:26:11AM +0100, Bernd Schubert wrote:
> Add in network file systems. Demonstrating that this is useful
> locally and with micro benchmarks - yeah, helps a bit to make it
> locally faster. But the real case is when thousands of clients are
> handled by a few network servers. Even reducing wire latency for a
> single client would make a difference here.
> 
> There is a bit of chicken-egg problem - it is a bit of work to add
> to file systems like NFS (or others that are not the kernel), but
> the work won't be made there before there is no syscall for it. To
> demonstrate it on NFS one also needs a an official protocol change
> first.

I wouldn't assume that.  NFSv4 already supports compound rpc operations,
so you can do OPEN+READ+CLOSE in a single round trip.  The client's
never done that, but there are pynfs tests that can testify to the fact
that our server supports it.

It's not something anyone's used much outside of artificial tests, so
there may well turn out be issues, but the protocol's definitely
sufficient to prototype this at least.

I'm not volunteering, but it doesn't seem too difficult in theory if
someone's interested.

--b.

> And then applications also need to support that new syscall
> first.
> I had a hard time explaining weather physicist back in 2009 that it
> is not a good idea to have millions of 512B files on  Lustre. With
> recent AI workload this gets even worse.
> 
> This is the same issue in fact with the fuse patches we are creating
> (https://lwn.net/Articles/888877/). Miklos asked for benchmark
> numbers - we can only demonstrate slight effects locally, but out
> goal is in fact to reduce network latencies and server load.
> 
> - Bernd
Theodore Ts'o March 23, 2022, 10:19 p.m. UTC | #17
On Wed, Mar 23, 2022 at 02:24:40PM +0100, Miklos Szeredi wrote:
> The reason I stated thinking about this is that Amir wanted a per-sb
> iostat interface and dumped it into /proc/PID/mountstats.  And that is
> definitely not the right way to go about this.
> 
> So we could add a statfsx() and start filling in new stuff, and that's
> what Linus suggested.  But then we might need to add stuff that is not
> representable in a flat structure (like for example the stuff that
> nfs_show_stats does) and that again needs new infrastructure.
> 
> Another example is task info in /proc.  Utilities are doing a crazy
> number of syscalls to get trivial information.  Why don't we have a
> procx(2) syscall?  I guess because lots of that is difficult to
> represent in a flat structure.  Just take the lsof example: tt's doing
> hundreds of thousands of syscalls on a desktop computer with just a
> few hundred processes.

I'm still a bit puzzled about the reason for getvalues(2) beyond,
"reduce the number of system calls".  Is this a performance argument?
If so, have you benchmarked lsof using this new interface?

I did a quickie run on my laptop, which currently had 444 process.
"lsof /home/tytso > /tmp/foo" didn't take long:

% time lsof /home/tytso >& /tmp/foo
real    0m0.144s
user    0m0.039s
sys     0m0.087s

And an strace of that same lsof command indicated had 67,889 lines.
So yeah, lots of system calls.  But is this new system call really
going to speed up things by all that much?

If the argument is "make it easier to use", what's wrong the solution
of creating userspace libraries which abstract away calls to
open/read/close a whole bunch of procfs files to make life easier for
application programmers?

In short, what problem is this new system call going to solve?  Each
new system call, especially with all of the parsing that this one is
going to use, is going to be an additional attack surface, and an
additional new system call that we have to maintain --- and for the
first 7-10 years, userspace programs are going to have to use the
existing open/read/close interface since enterprise kernels stick a
wrong for a L-O-N-G time, so any kind of ease-of-use argument isn't
really going to help application programs until RHEL 10 becomes
obsolete.  (Unless you plan to backport this into RHEL 9 beta, but
still, waiting for RHEL 9 to become completely EOL is going to be... a
while.)  So whatever the benefits of this new interface is going to
be, I suggest we should be sure that it's really worth it.

Cheers,

					- Ted
Casey Schaufler March 23, 2022, 10:39 p.m. UTC | #18
On 3/23/2022 7:00 AM, Miklos Szeredi wrote:
> On Wed, 23 Mar 2022 at 14:51, Casey Schaufler <casey@schaufler-ca.com> wrote:
>
>> You also need a way to get a list off what attributes are available
>> and/or a way to get all available attributes. Applications and especially
>> libraries shouldn't have to guess what information is relevant. If the
>> attributes change depending on the filesystem and/or LSM involved, and
>> they do, how can a general purpose library function know what data to
>> ask for?
> Oh, yes.  Even the current prototype does that:
>
> # ~/getvalues / ""
> [] = "mnt" "mntns" "xattr" "data" (len=21)
> # ~/getvalues / "mnt"
> [mnt] = "id" "parentid" "root" "mountpoint" "options" "shared"
> "master" "propagate_from" "unbindable" (len=76)
> # ~/getvalues / "mntns"
> [mntns] = "21" "22" "24" "25" "23" "26" "27" "28" "29" "30" "31" "32" (len=36)
>   ~/getvalues / "mntns:21"
> [mntns:21] = "id" "parentid" "root" "mountpoint" "options" "shared"
> "master" "propagate_from" "unbindable" (len=76)

That requires multiple calls and hierarchy tracking by the caller.
Not to mention that in this case the caller needs to understand
how mount namespaces are being used. I don't see that you've made
anything cleaner. You have discarded the type checking provided
by the "classic" APIs. Elsewhere in this thread the claims of
improved performance have been questioned, but I can't say boo
about that. Is this interface targeted for languages other than C
for which the paradigm might provide (more?) value?

>
> I didn't implement enumeration for "data" and "xattr" but that is
> certainly possible and not even difficult to do.
>
> Thanks,
> Miklos
Dave Chinner March 23, 2022, 10:58 p.m. UTC | #19
On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
> Add a new userspace API that allows getting multiple short values in a
> single syscall.
> 
> This would be useful for the following reasons:
> 
> - Calling open/read/close for many small files is inefficient.  E.g. on my
>   desktop invoking lsof(1) results in ~60k open + read + close calls under
>   /proc and 90% of those are 128 bytes or less.

How does doing the open/read/close in a single syscall make this any
more efficient? All it saves is the overhead of a couple of
syscalls, it doesn't reduce any of the setup or teardown overhead
needed to read the data itself....

> - Interfaces for getting various attributes and statistics are fragmented.
>   For files we have basic stat, statx, extended attributes, file attributes
>   (for which there are two overlapping ioctl interfaces).  For mounts and
>   superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
>   The latter also has the problem on not allowing queries on a specific
>   mount.

https://xkcd.com/927/

> - Some attributes are cheap to generate, some are expensive.  Allowing
>   userspace to select which ones it needs should allow optimizing queries.
> 
> - Adding an ascii namespace should allow easy extension and self
>   description.
> 
> - The values can be text or binary, whichever is fits best.
> 
> The interface definition is:
> 
> struct name_val {
> 	const char *name;	/* in */
> 	struct iovec value_in;	/* in */
> 	struct iovec value_out;	/* out */
> 	uint32_t error;		/* out */
> 	uint32_t reserved;
> };

Ahhh, XFS_IOC_ATTRMULTI_BY_HANDLE reborn. This is how xfsdump gets
and sets attributes efficiently when dumping and restoring files -
it's an interface that allows batches of xattr operations to be run
on a file in a single syscall.

I've said in the past when discussing things like statx() that maybe
everything should be addressable via the xattr namespace and
set/queried via xattr names regardless of how the filesystem stores
the data. The VFS/filesystem simply translates the name to the
storage location of the information. It might be held in xattrs, but
it could just be a flag bit in an inode field.

Then we just get named xattrs in batches from an open fd.

> int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
> 	      unsigned int flags);
> 
> @dfd and @path are used to lookup object $ORIGIN.  @vec contains @num
> name/value descriptors.  @flags contains lookup flags for @path.
> 
> The syscall returns the number of values filled or an error.
> 
> A single name/value descriptor has the following fields:
> 
> @name describes the object whose value is to be returned.  E.g.
> 
> mnt                    - list of mount parameters
> mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
> mntns                  - list of mount ID's reachable from the current root
> mntns:21:parentid      - parent ID of the mount with ID of 21
> xattr:security.selinux - the security.selinux extended attribute
> data:foo/bar           - the data contained in file $ORIGIN/foo/bar

How are these different from just declaring new xattr namespaces for
these things. e.g. open any file and list the xattrs in the
xattr:mount.mnt namespace to get the list of mount parameters for
that mount.

Why do we need a new "xattr in everything but name" interface when
we could just extend the one we've already got and formalise a new,
cleaner version of xattr batch APIs that have been around for 20-odd
years already?

Cheers,

Dave.
Casey Schaufler March 23, 2022, 11:17 p.m. UTC | #20
On 3/23/2022 3:58 PM, Dave Chinner wrote:
> On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
>> Add a new userspace API that allows getting multiple short values in a
>> single syscall.
>>
>> This would be useful for the following reasons:
>>
>> - Calling open/read/close for many small files is inefficient.  E.g. on my
>>    desktop invoking lsof(1) results in ~60k open + read + close calls under
>>    /proc and 90% of those are 128 bytes or less.
> How does doing the open/read/close in a single syscall make this any
> more efficient? All it saves is the overhead of a couple of
> syscalls, it doesn't reduce any of the setup or teardown overhead
> needed to read the data itself....
>
>> - Interfaces for getting various attributes and statistics are fragmented.
>>    For files we have basic stat, statx, extended attributes, file attributes
>>    (for which there are two overlapping ioctl interfaces).  For mounts and
>>    superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
>>    The latter also has the problem on not allowing queries on a specific
>>    mount.
> https://xkcd.com/927/
>
>> - Some attributes are cheap to generate, some are expensive.  Allowing
>>    userspace to select which ones it needs should allow optimizing queries.
>>
>> - Adding an ascii namespace should allow easy extension and self
>>    description.
>>
>> - The values can be text or binary, whichever is fits best.
>>
>> The interface definition is:
>>
>> struct name_val {
>> 	const char *name;	/* in */
>> 	struct iovec value_in;	/* in */
>> 	struct iovec value_out;	/* out */
>> 	uint32_t error;		/* out */
>> 	uint32_t reserved;
>> };
> Ahhh, XFS_IOC_ATTRMULTI_BY_HANDLE reborn. This is how xfsdump gets
> and sets attributes efficiently when dumping and restoring files -
> it's an interface that allows batches of xattr operations to be run
> on a file in a single syscall.
>
> I've said in the past when discussing things like statx() that maybe
> everything should be addressable via the xattr namespace and
> set/queried via xattr names regardless of how the filesystem stores
> the data. The VFS/filesystem simply translates the name to the
> storage location of the information. It might be held in xattrs, but
> it could just be a flag bit in an inode field.
>
> Then we just get named xattrs in batches from an open fd.
>
>> int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
>> 	      unsigned int flags);
>>
>> @dfd and @path are used to lookup object $ORIGIN.  @vec contains @num
>> name/value descriptors.  @flags contains lookup flags for @path.
>>
>> The syscall returns the number of values filled or an error.
>>
>> A single name/value descriptor has the following fields:
>>
>> @name describes the object whose value is to be returned.  E.g.
>>
>> mnt                    - list of mount parameters
>> mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
>> mntns                  - list of mount ID's reachable from the current root
>> mntns:21:parentid      - parent ID of the mount with ID of 21
>> xattr:security.selinux - the security.selinux extended attribute
>> data:foo/bar           - the data contained in file $ORIGIN/foo/bar
> How are these different from just declaring new xattr namespaces for
> these things. e.g. open any file and list the xattrs in the
> xattr:mount.mnt namespace to get the list of mount parameters for
> that mount.

There is a significant and vocal set of people who dislike xattrs
passionately. I often hear them whinging whenever someone proposes
using them. I think that your suggestion has all the advantages of
the getvalues(2) interface while also addressing its shortcomings.
If we could get it past the anti-xattr crowd we might have something.
You could even provide getvalues() on top of it.

>
> Why do we need a new "xattr in everything but name" interface when
> we could just extend the one we've already got and formalise a new,
> cleaner version of xattr batch APIs that have been around for 20-odd
> years already?
>
> Cheers,
>
> Dave.
>
Christoph Hellwig March 24, 2022, 6:34 a.m. UTC | #21
On Wed, Mar 23, 2022 at 06:19:51PM -0400, Theodore Ts'o wrote:
> I'm still a bit puzzled about the reason for getvalues(2) beyond,
> "reduce the number of system calls".  Is this a performance argument?
> If so, have you benchmarked lsof using this new interface?

Yeah.  Even if open + read + close is a bnottle neck for fuse or
network file systems I think a io_uring op for just that is a much
better choice instead of this crazy multi-value operation.

And even on that I need to be sold first.
Greg Kroah-Hartman March 24, 2022, 6:56 a.m. UTC | #22
On Wed, Mar 23, 2022 at 04:23:34PM +0100, Miklos Szeredi wrote:
> On Wed, 23 Mar 2022 at 14:38, Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> > This has been proposed in the past a few times.  Most recently by the
> > KVM developers, which tried to create a "generic" api, but ended up just
> > making something to work for KVM as they got tired of people ignoring
> > their more intrusive patch sets.  See virt/kvm/binary_stats.c for what
> > they ended up with, and perhaps you can just use that same type of
> > interface here as well?
> 
> So this looks like a fixed set of statistics where each one has a
> descriptor (a name, size, offset, flags, ...) that tells about the
> piece of data to be exported.  The stats are kept up to date in kernel
> memory and copied to userspace on read.  The copy can be selective,
> since the read can specify the offset and size of data it would like
> to retrieve.
> 
> The interface is self descriptive and selective, but its structure is
> fixed for a specific object type, there's no way this could be
> extended to look up things like extended attributes.  Maybe that's not
> a problem, but the lack of a hierarchical namespace could turn out to
> be a major drawback.
> 
> I think people underestimate the usefulness of hierarchical
> namespaces, even though we use them extensively in lots of well
> established interfaces.

I like the namespaces, they work well.  If you want self-describing
interfaces (which I think your patch does), then why not just use the
varlink protocol?  It's been implemented for the kernel already many
years ago:
	https://github.com/varlink
and specifically:
	https://github.com/varlink/linux-varlink

It doesn't need a new syscall.

thanks,

greg k-h
Miklos Szeredi March 24, 2022, 8:44 a.m. UTC | #23
On Wed, 23 Mar 2022 at 23:20, Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Mar 23, 2022 at 02:24:40PM +0100, Miklos Szeredi wrote:
> > The reason I stated thinking about this is that Amir wanted a per-sb
> > iostat interface and dumped it into /proc/PID/mountstats.  And that is
> > definitely not the right way to go about this.
> >
> > So we could add a statfsx() and start filling in new stuff, and that's
> > what Linus suggested.  But then we might need to add stuff that is not
> > representable in a flat structure (like for example the stuff that
> > nfs_show_stats does) and that again needs new infrastructure.
> >
> > Another example is task info in /proc.  Utilities are doing a crazy
> > number of syscalls to get trivial information.  Why don't we have a
> > procx(2) syscall?  I guess because lots of that is difficult to
> > represent in a flat structure.  Just take the lsof example: tt's doing
> > hundreds of thousands of syscalls on a desktop computer with just a
> > few hundred processes.
>
> I'm still a bit puzzled about the reason for getvalues(2) beyond,
> "reduce the number of system calls".  Is this a performance argument?

One argument that can't be worked around without batchingis atomicity.
Not sure how important that is, but IIRC it was one of the
requirements relating to the proposed fsinfo syscall, which this API
is meant to supersede.   Performance was also oft repeated regarding
the fsinfo API, but I'm less bought into that.

> If so, have you benchmarked lsof using this new interface?

Not yet.  Looked yesterday at both lsof and procps source code, and
both are pretty complex and not easy to plug in a new interface.   But
I've not yet given up...

> I did a quickie run on my laptop, which currently had 444 process.
> "lsof /home/tytso > /tmp/foo" didn't take long:
>
> % time lsof /home/tytso >& /tmp/foo
> real    0m0.144s
> user    0m0.039s
> sys     0m0.087s
>
> And an strace of that same lsof command indicated had 67,889 lines.
> So yeah, lots of system calls.  But is this new system call really
> going to speed up things by all that much?

$ ps uax | wc -l
335
$ time lsof > /dev/null

real 0m3.011s
user 0m1.257s
sys 0m1.249s
$ strace -o /tmp/strace lsof > /dev/null
$ wc -l /tmp/strace
638523 /tmp/strace

That's an order of magnitude higher than in your case; don't know what
could cause this.

Thanks,
Millos
Miklos Szeredi March 24, 2022, 8:57 a.m. UTC | #24
On Wed, 23 Mar 2022 at 23:58, Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:

> > - Interfaces for getting various attributes and statistics are fragmented.
> >   For files we have basic stat, statx, extended attributes, file attributes
> >   (for which there are two overlapping ioctl interfaces).  For mounts and
> >   superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
> >   The latter also has the problem on not allowing queries on a specific
> >   mount.
>
> https://xkcd.com/927/

Haha!

> I've said in the past when discussing things like statx() that maybe
> everything should be addressable via the xattr namespace and
> set/queried via xattr names regardless of how the filesystem stores
> the data. The VFS/filesystem simply translates the name to the
> storage location of the information. It might be held in xattrs, but
> it could just be a flag bit in an inode field.

Right, that would definitely make sense for inode attributes.

What about other objects' attributes, statistics?   Remember this
started out as a way to replace /proc/self/mountinfo with something
that can query individual mount.

> > mnt                    - list of mount parameters
> > mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
> > mntns                  - list of mount ID's reachable from the current root
> > mntns:21:parentid      - parent ID of the mount with ID of 21
> > xattr:security.selinux - the security.selinux extended attribute
> > data:foo/bar           - the data contained in file $ORIGIN/foo/bar
>
> How are these different from just declaring new xattr namespaces for
> these things. e.g. open any file and list the xattrs in the
> xattr:mount.mnt namespace to get the list of mount parameters for
> that mount.

Okay.

> Why do we need a new "xattr in everything but name" interface when
> we could just extend the one we've already got and formalise a new,
> cleaner version of xattr batch APIs that have been around for 20-odd
> years already?

Seems to make sense. But...will listxattr list everyting recursively?
I guess that won't work, better just list traditional xattrs,
otherwise we'll likely get regressions, and anyway the point of a
hierarchical namespace is to be able to list nodes on each level.  We
can use getxattr() for this purpose, just like getvalues() does in the
above example.

Thanks,
Miklos
Amir Goldstein March 24, 2022, 10:34 a.m. UTC | #25
> > I've said in the past when discussing things like statx() that maybe
> > everything should be addressable via the xattr namespace and
> > set/queried via xattr names regardless of how the filesystem stores
> > the data. The VFS/filesystem simply translates the name to the
> > storage location of the information. It might be held in xattrs, but
> > it could just be a flag bit in an inode field.
>
> Right, that would definitely make sense for inode attributes.

Why limit to inode attributes?
The argument of getxattr()/fgetxattr() is exactly the same as
the argument for statfs()fstatfs() and the latter returns the attributes
of the sb and the mnt (i.e. calculate_f_flags()).

I don't see a problem with querying attributes of a mount/sb the same
way as long as the namespace is clear about what is the object that
is being queried (e.g. getxattr(path, "fsinfo.sbiostats.rchar",...).

>
> What about other objects' attributes, statistics?   Remember this
> started out as a way to replace /proc/self/mountinfo with something
> that can query individual mount.
>
> > > mnt                    - list of mount parameters
> > > mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
> > > mntns                  - list of mount ID's reachable from the current root
> > > mntns:21:parentid      - parent ID of the mount with ID of 21
> > > xattr:security.selinux - the security.selinux extended attribute
> > > data:foo/bar           - the data contained in file $ORIGIN/foo/bar
> >
> > How are these different from just declaring new xattr namespaces for
> > these things. e.g. open any file and list the xattrs in the
> > xattr:mount.mnt namespace to get the list of mount parameters for
> > that mount.
>
> Okay.
>
> > Why do we need a new "xattr in everything but name" interface when
> > we could just extend the one we've already got and formalise a new,
> > cleaner version of xattr batch APIs that have been around for 20-odd
> > years already?
>
> Seems to make sense. But...will listxattr list everyting recursively?
> I guess that won't work, better just list traditional xattrs,
> otherwise we'll likely get regressions, and anyway the point of a
> hierarchical namespace is to be able to list nodes on each level.  We
> can use getxattr() for this purpose, just like getvalues() does in the
> above example.
>

FYI, there are already precedents for "virtual" xattrs, see the user.smb3.*
family in fs/cifs/xattr.c for example.

Those cifs "virtual" (or "remote") xattrs are not listed by listxattr, even
though they ARE properties of the file which are very relevant for backup.

Currently, they use the user.* namespace, but the values could be
also exposed via a more generic fsinfo.* namespace that is dedicated
to these sort of things and then, as you suggest, getxattr(path, "fsinfo",...)
can list "smb3" for cifs.

I like where this is going :)

Thanks,
Amir.
Eric W. Biederman March 24, 2022, 4:15 p.m. UTC | #26
Miklos Szeredi <miklos@szeredi.hu> writes:

> On Wed, 23 Mar 2022 at 23:20, Theodore Ts'o <tytso@mit.edu> wrote:
>>
>> On Wed, Mar 23, 2022 at 02:24:40PM +0100, Miklos Szeredi wrote:
>> > The reason I stated thinking about this is that Amir wanted a per-sb
>> > iostat interface and dumped it into /proc/PID/mountstats.  And that is
>> > definitely not the right way to go about this.
>> >
>> > So we could add a statfsx() and start filling in new stuff, and that's
>> > what Linus suggested.  But then we might need to add stuff that is not
>> > representable in a flat structure (like for example the stuff that
>> > nfs_show_stats does) and that again needs new infrastructure.
>> >
>> > Another example is task info in /proc.  Utilities are doing a crazy
>> > number of syscalls to get trivial information.  Why don't we have a
>> > procx(2) syscall?  I guess because lots of that is difficult to
>> > represent in a flat structure.  Just take the lsof example: tt's doing
>> > hundreds of thousands of syscalls on a desktop computer with just a
>> > few hundred processes.
>>
>> I'm still a bit puzzled about the reason for getvalues(2) beyond,
>> "reduce the number of system calls".  Is this a performance argument?
>
> One argument that can't be worked around without batchingis atomicity.
> Not sure how important that is, but IIRC it was one of the
> requirements relating to the proposed fsinfo syscall, which this API
> is meant to supersede.   Performance was also oft repeated regarding
> the fsinfo API, but I'm less bought into that.

A silly question.  Have you looked to see if you can perform this work
with io_uring?

I know io_uring does all of the batching already, so I think io_uring is
as ready as anything is to solve the performance issues, and the general
small file problem.  There is also the bpf information extractor (Sorry
I forget what it's proper name is) that also can solve many of the small
read problems.

I am very confused you mention atomicity but I don't see any new
filesystem hooks or anyway you could implement atomicity for reads
much less writes in the patch you posted.

If the real target is something like fsinfo that is returning
information that is not currently available except by possibly
processing /proc/self/mountinfo perhaps a more targeted name would
help.

I certainly did not get the impression when skimming your introduction
to this that you were trying to solve anything except reading a number
of small files.

Eric
Dave Chinner March 24, 2022, 8:31 p.m. UTC | #27
On Thu, Mar 24, 2022 at 09:57:26AM +0100, Miklos Szeredi wrote:
> On Wed, 23 Mar 2022 at 23:58, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
> 
> > > - Interfaces for getting various attributes and statistics are fragmented.
> > >   For files we have basic stat, statx, extended attributes, file attributes
> > >   (for which there are two overlapping ioctl interfaces).  For mounts and
> > >   superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
> > >   The latter also has the problem on not allowing queries on a specific
> > >   mount.
> >
> > https://xkcd.com/927/
> 
> Haha!
> 
> > I've said in the past when discussing things like statx() that maybe
> > everything should be addressable via the xattr namespace and
> > set/queried via xattr names regardless of how the filesystem stores
> > the data. The VFS/filesystem simply translates the name to the
> > storage location of the information. It might be held in xattrs, but
> > it could just be a flag bit in an inode field.
> 
> Right, that would definitely make sense for inode attributes.
> 
> What about other objects' attributes, statistics?   Remember this
> started out as a way to replace /proc/self/mountinfo with something
> that can query individual mount.

For individual mount info, why do we even need to query something in
/proc? I mean, every open file in the mount has access to the mount
and the underlying superblock, so why not just make the query
namespace accessable from any open fd on that mount?

e.g. /proc/self/mountinfo tells you where the mounts are, then you
can just open(O_PATH) the mount point you want the info from and
retrieve the relevant xattrs from that fd. The information itself
does not need to be in /proc, nor only accessible from /proc, nor be
limited to proc infrastructure, nor be constrained by proc's
arbitrary "one value per file" presentation....

Indeed, we don't have to centralise all the information in one place
- all we need is to have a well defined, consistent method for
indexing that information and all the shenanigans for accessing
common stuff can be wrapped up in a common userspace library
(similar to how iterating the mount table is generic C library
functionality).

> > > mnt                    - list of mount parameters
> > > mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
> > > mntns                  - list of mount ID's reachable from the current root
> > > mntns:21:parentid      - parent ID of the mount with ID of 21
> > > xattr:security.selinux - the security.selinux extended attribute
> > > data:foo/bar           - the data contained in file $ORIGIN/foo/bar
> >
> > How are these different from just declaring new xattr namespaces for
> > these things. e.g. open any file and list the xattrs in the
> > xattr:mount.mnt namespace to get the list of mount parameters for
> > that mount.
> 
> Okay.
> 
> > Why do we need a new "xattr in everything but name" interface when
> > we could just extend the one we've already got and formalise a new,
> > cleaner version of xattr batch APIs that have been around for 20-odd
> > years already?
> 
> Seems to make sense. But...will listxattr list everyting recursively?
> I guess that won't work, better just list traditional xattrs,
> otherwise we'll likely get regressions,

*nod*

> and anyway the point of a
> hierarchical namespace is to be able to list nodes on each level.  We
> can use getxattr() for this purpose, just like getvalues() does in the
> above example.

Yup, and like Casey suggests, you could implement a generic
getvalues()-like user library on top of it so users don't even need
to know where and how the values are located or retrieved.

The other advantage of an xattr interface is that is also provides a
symmetrical API for -changing- values. No need for some special
configfs or configfd thingy for setting parameters - just change the
value of the parameter or mount option with a simple setxattr call.
That retains the simplicity of proc and sysfs attributes in that you
can change them just by writing a new value to the file....

Cheers,

Dave.
Karel Zak March 25, 2022, 8:46 a.m. UTC | #28
On Thu, Mar 24, 2022 at 09:44:38AM +0100, Miklos Szeredi wrote:
> > If so, have you benchmarked lsof using this new interface?
> 
> Not yet.  Looked yesterday at both lsof and procps source code, and
> both are pretty complex and not easy to plug in a new interface.   But
> I've not yet given up...

I can imagine something like getvalues(2) in lsblk (based on /sys) or
in lsfd (based on /proc; lsof replacement). The tools have defined set
of information to read from kernel, so gather all the requests to the
one syscall for each process or block device makes sense and it will
dramatically reduce number of open+read+close syscalls.

    Karel
Greg Kroah-Hartman March 25, 2022, 8:54 a.m. UTC | #29
On Fri, Mar 25, 2022 at 09:46:46AM +0100, Karel Zak wrote:
> On Thu, Mar 24, 2022 at 09:44:38AM +0100, Miklos Szeredi wrote:
> > > If so, have you benchmarked lsof using this new interface?
> > 
> > Not yet.  Looked yesterday at both lsof and procps source code, and
> > both are pretty complex and not easy to plug in a new interface.   But
> > I've not yet given up...
> 
> I can imagine something like getvalues(2) in lsblk (based on /sys) or
> in lsfd (based on /proc; lsof replacement). The tools have defined set
> of information to read from kernel, so gather all the requests to the
> one syscall for each process or block device makes sense and it will
> dramatically reduce number of open+read+close syscalls.

And do those open+read+close syscalls actually show up in measurements?

Again, I tried to find a real-world application that turning those 3
into 1 would matter, and I couldn't.  procps had no decreased system
load that I could notice.  I'll mess with lsof but that's really just a
stress-test, not anything that is run all the time, right?

And as others have said, using io_uring() would also solve the 3 syscall
issue, but no one seems to want to convert these tools to use that,
which implies that it's not really an issue for anyone :)

thanks,

greg k-h
Karel Zak March 25, 2022, 9:10 a.m. UTC | #30
On Fri, Mar 25, 2022 at 07:31:16AM +1100, Dave Chinner wrote:
> > What about other objects' attributes, statistics?   Remember this
> > started out as a way to replace /proc/self/mountinfo with something
> > that can query individual mount.
> 
> For individual mount info, why do we even need to query something in
> /proc? I mean, every open file in the mount has access to the mount
> and the underlying superblock, so why not just make the query
> namespace accessable from any open fd on that mount?


The current most problematic situation is in systemd. We get
generic notification (poll() on mountinfo) that something has been
modified in the mount table, and then we need to parse all the file
to get details.

So, the ideal solution would be notification that points to the FS
and interface to read information (e.g. getvalues()) about the FS.

    Karel
Karel Zak March 25, 2022, 9:25 a.m. UTC | #31
On Fri, Mar 25, 2022 at 09:54:21AM +0100, Greg KH wrote:
> On Fri, Mar 25, 2022 at 09:46:46AM +0100, Karel Zak wrote:
> > On Thu, Mar 24, 2022 at 09:44:38AM +0100, Miklos Szeredi wrote:
> > > > If so, have you benchmarked lsof using this new interface?
> > > 
> > > Not yet.  Looked yesterday at both lsof and procps source code, and
> > > both are pretty complex and not easy to plug in a new interface.   But
> > > I've not yet given up...
> > 
> > I can imagine something like getvalues(2) in lsblk (based on /sys) or
> > in lsfd (based on /proc; lsof replacement). The tools have defined set
> > of information to read from kernel, so gather all the requests to the
> > one syscall for each process or block device makes sense and it will
> > dramatically reduce number of open+read+close syscalls.
> 
> And do those open+read+close syscalls actually show up in measurements?
> 
> Again, I tried to find a real-world application that turning those 3
> into 1 would matter, and I couldn't.  procps had no decreased system
> load that I could notice.  I'll mess with lsof but that's really just a
> stress-test, not anything that is run all the time, right?


Right, the speed of ps(1) or lsof(1) is not important. IMHO the current
discussion about getvalues() goes in wrong direction :-)  

I guess the primary motivation is not to replace open+read+close, but
provide to userspace something usable to get information from mount
table, because the current /proc/#/mountinfo and notification by
poll() is horrible.

Don't forget that the previous attempt was fsinfo() from David Howells
(unfortunately, it was too complex and rejected by Linus).

> And as others have said, using io_uring() would also solve the 3 syscall
> issue, but no one seems to want to convert these tools to use that,
> which implies that it's not really an issue for anyone :)

OK, I'll think about it :-)

    Karel
Cyril Hrubis March 25, 2022, 11:02 a.m. UTC | #32
Hi!
> > If so, have you benchmarked lsof using this new interface?
> 
> Not yet.  Looked yesterday at both lsof and procps source code, and
> both are pretty complex and not easy to plug in a new interface.   But
> I've not yet given up...

Looking at lsof it seems to use fopen() and fgets() to parse various
proc files. I doubt that we can make the parsing singificantly faster
without completely rewriting the internals.

As for procps the readproc.c has file2str() function that does copy
whole proc files into a buffer with open() - read() - close(). It may be
reasonably easy to hook the new systall there and it will probably make
ps and top slightly faster.
Trond Myklebust March 25, 2022, 4:42 p.m. UTC | #33
On Fri, 2022-03-25 at 07:31 +1100, Dave Chinner wrote:
> On Thu, Mar 24, 2022 at 09:57:26AM +0100, Miklos Szeredi wrote:
> > On Wed, 23 Mar 2022 at 23:58, Dave Chinner <david@fromorbit.com>
> > wrote:
> > > 
> > > On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
> > 
> > > > - Interfaces for getting various attributes and statistics are
> > > > fragmented.
> > > >   For files we have basic stat, statx, extended attributes,
> > > > file attributes
> > > >   (for which there are two overlapping ioctl interfaces).  For
> > > > mounts and
> > > >   superblocks we have stat*fs as well as
> > > > /proc/$PID/{mountinfo,mountstats}.
> > > >   The latter also has the problem on not allowing queries on a
> > > > specific
> > > >   mount.
> > > 
> > > https://xkcd.com/927/
> > 
> > Haha!
> > 
> > > I've said in the past when discussing things like statx() that
> > > maybe
> > > everything should be addressable via the xattr namespace and
> > > set/queried via xattr names regardless of how the filesystem
> > > stores
> > > the data. The VFS/filesystem simply translates the name to the
> > > storage location of the information. It might be held in xattrs,
> > > but
> > > it could just be a flag bit in an inode field.
> > 
> > Right, that would definitely make sense for inode attributes.
> > 
> > What about other objects' attributes, statistics?   Remember this
> > started out as a way to replace /proc/self/mountinfo with something
> > that can query individual mount.
> 
> For individual mount info, why do we even need to query something in
> /proc? I mean, every open file in the mount has access to the mount
> and the underlying superblock, so why not just make the query
> namespace accessable from any open fd on that mount?
> 
> e.g. /proc/self/mountinfo tells you where the mounts are, then you
> can just open(O_PATH) the mount point you want the info from and
> retrieve the relevant xattrs from that fd. The information itself
> does not need to be in /proc, nor only accessible from /proc, nor be
> limited to proc infrastructure, nor be constrained by proc's
> arbitrary "one value per file" presentation....
> 
> Indeed, we don't have to centralise all the information in one place
> - all we need is to have a well defined, consistent method for
> indexing that information and all the shenanigans for accessing
> common stuff can be wrapped up in a common userspace library
> (similar to how iterating the mount table is generic C library
> functionality).
> 
> > > > mnt                    - list of mount parameters
> > > > mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
> > > > mntns                  - list of mount ID's reachable from the
> > > > current root
> > > > mntns:21:parentid      - parent ID of the mount with ID of 21
> > > > xattr:security.selinux - the security.selinux extended
> > > > attribute
> > > > data:foo/bar           - the data contained in file
> > > > $ORIGIN/foo/bar
> > > 
> > > How are these different from just declaring new xattr namespaces
> > > for
> > > these things. e.g. open any file and list the xattrs in the
> > > xattr:mount.mnt namespace to get the list of mount parameters for
> > > that mount.
> > 
> > Okay.
> > 
> > > Why do we need a new "xattr in everything but name" interface
> > > when
> > > we could just extend the one we've already got and formalise a
> > > new,
> > > cleaner version of xattr batch APIs that have been around for 20-
> > > odd
> > > years already?
> > 
> > Seems to make sense. But...will listxattr list everyting
> > recursively?
> > I guess that won't work, better just list traditional xattrs,
> > otherwise we'll likely get regressions,
> 
> *nod*
> 
> > and anyway the point of a
> > hierarchical namespace is to be able to list nodes on each level. 
> > We
> > can use getxattr() for this purpose, just like getvalues() does in
> > the
> > above example.
> 
> Yup, and like Casey suggests, you could implement a generic
> getvalues()-like user library on top of it so users don't even need
> to know where and how the values are located or retrieved.
> 
> The other advantage of an xattr interface is that is also provides a
> symmetrical API for -changing- values. No need for some special
> configfs or configfd thingy for setting parameters - just change the
> value of the parameter or mount option with a simple setxattr call.
> That retains the simplicity of proc and sysfs attributes in that you
> can change them just by writing a new value to the file....

The downsides are, however, that the current interface provides little
in the way of atomicity if you want to read or write to multiple
attributes at the same time. Something like a backup program might want
to be able to atomically retrieve the ctime when it is backing up the
attributes.
Also, when setting attributes, I'd like to avoid multiple syscalls when
I'm changing multiple related attributes.

IOW: Adding a batching interface that is akin to what Miklos was
proposing would be a helpful change if we want to go down this path.
Linus Torvalds March 25, 2022, 6:40 p.m. UTC | #34
On Fri, Mar 25, 2022 at 1:46 AM Karel Zak <kzak@redhat.com> wrote:
>
> I can imagine something like getvalues(2) in lsblk (based on /sys) or
> in lsfd (based on /proc; lsof replacement).

I really would be very hesitant to add new interfaces for completely
specialty purposes.

As others have mentioned, this has been tried for much more
fundamental reasons (that whole "open-and-read" thing), and it hasn't
been an obvious improvement.

It *could* possibly be an improvement if it would allow us to take
advantage of special server-side operations in a networked filesystem
(ie like the whole "copy_file_range" kinds of interfaces that allow
server-side things) where you need to transfer less data, or need
fewer back-and-forth operations.

And even that is clearly questionable, with some of those network file
interfaces basically not having been used in real world situations in
the past, so..

(Also, with "copy_file_range" we not only had others actively doing
it, but the wins were "several orders of manitude", so even if it was
fairly rare, it was _so_ big that it was worth doing anyway).

With the "open-and-read" thing, the wins aren't that enormous.

And getvalues() isn't even that. It's literally a speciality interface
for a very special thing. Yes, I'm sure it avoids several system
calls. Yes, I'm sure it avoids parsing strings etc. But I really don't
think this is something we want to do unless people can show enormous
and real-world examples of where it makes such a huge difference that
we absolutely have to do it.

             Linus
Theodore Ts'o March 26, 2022, 4:19 a.m. UTC | #35
On Fri, Mar 25, 2022 at 10:25:53AM +0100, Karel Zak wrote:
> 
> Right, the speed of ps(1) or lsof(1) is not important. IMHO the current
> discussion about getvalues() goes in wrong direction :-)
> 
> I guess the primary motivation is not to replace open+read+close, but
> provide to userspace something usable to get information from mount
> table, because the current /proc/#/mountinfo and notification by
> poll() is horrible.

I think that's because the getvalues(2) prototype *only* optimizes
away open+read+close, and doesn't do a *thing* with respect to
/proc/<pid>/mountinfo.

> Don't forget that the previous attempt was fsinfo() from David Howells
> (unfortunately, it was too complex and rejected by Linus).

fsinfo() tried to do a lot more than solving the /proc/<pid>/mountinfo
problem; perhaps that was the cause of the complexity.

Ignoring the notification problem (which I suspect we could solve with
an extension of fsnotify), if the goal is to find a cleaner way to
fetch information about a process's mount namespace and the mounts in
that namespace, why not trying to export that information via sysfs?
Information about devices are just as complex, after all.

We could make mount namespaces to be their own first class object, so
there would be an entry in /proc/<pid> which returns the mount
namespace id used by a particular process.  Similarly, let each
mounted file system be its own first class object.  Information about
each mount namespace would be in /sys/mnt_ns, and information about
each mounted file system would be in /sys/superblock.  Then in
/sys/mnt_ns there would be a directory for each (superblock,
mountpoint) pair.

Given how quickly programs like lsof can open tens of thousands of
small files, and typically there are't that many mounted file systems
in a particular mount namespace, performance really shouldn't be a
problem.

If it works well enough for other kernel objects that are accessed via
sysfs, and fsinfo() is way to complex, why don't we try a pattern
which has worked and is "native" to Linux?

					- Ted
Dave Chinner March 27, 2022, 9:03 p.m. UTC | #36
On Fri, Mar 25, 2022 at 04:42:27PM +0000, Trond Myklebust wrote:
> On Fri, 2022-03-25 at 07:31 +1100, Dave Chinner wrote:
> > > and anyway the point of a
> > > hierarchical namespace is to be able to list nodes on each level. 
> > > We
> > > can use getxattr() for this purpose, just like getvalues() does in
> > > the
> > > above example.
> > 
> > Yup, and like Casey suggests, you could implement a generic
> > getvalues()-like user library on top of it so users don't even need
> > to know where and how the values are located or retrieved.
> > 
> > The other advantage of an xattr interface is that is also provides a
> > symmetrical API for -changing- values. No need for some special
> > configfs or configfd thingy for setting parameters - just change the
> > value of the parameter or mount option with a simple setxattr call.
> > That retains the simplicity of proc and sysfs attributes in that you
> > can change them just by writing a new value to the file....
> 
> The downsides are, however, that the current interface provides little
> in the way of atomicity if you want to read or write to multiple
> attributes at the same time. Something like a backup program might want
> to be able to atomically retrieve the ctime when it is backing up the
> attributes.

I assumed that batched updates were implied and understood after
my earlier comments about XFS_IOC_ATTRMULTI_BY_HANDLE as used
by xfsdump/restore for the past 20+ years.

> Also, when setting attributes, I'd like to avoid multiple syscalls when
> I'm changing multiple related attributes.
>
> IOW: Adding a batching interface that is akin to what Miklos was
> proposing would be a helpful change if we want to go down this path.

Yup, that's exactly what XFS_IOC_ATTRMULTI_BY_HANDLE provides and
I'm assuming that would also be provided by whatever formalised
generic syscall API we come up with here...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..c72668001b39 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@ 
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	getvalues		sys_getvalues
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/Makefile b/fs/Makefile
index 208a74e0b00e..f00d6bcd1178 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -16,7 +16,7 @@  obj-y :=	open.o read_write.o file_table.o super.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
 		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
-		kernel_read_file.o remap_range.o
+		kernel_read_file.o remap_range.o values.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=	buffer.o direct-io.o mpage.o
diff --git a/fs/mount.h b/fs/mount.h
index 0b6e08cf8afb..a3ca5233e481 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -148,3 +148,11 @@  static inline bool is_anon_ns(struct mnt_namespace *ns)
 }
 
 extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor);
+
+extern void namespace_lock_read(void);
+extern void namespace_unlock_read(void);
+extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt);
+extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
+			 struct path *root);
+extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns,
+					 struct path *root, int id);
diff --git a/fs/namespace.c b/fs/namespace.c
index de6fae84f1a1..52b15c17251f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1405,6 +1405,38 @@  void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor)
 }
 #endif  /* CONFIG_PROC_FS */
 
+void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
+		  struct path *root)
+{
+	struct mount *m;
+
+	down_read(&namespace_sem);
+	for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
+		if (is_path_reachable(m, m->mnt.mnt_root, root)) {
+			seq_printf(seq, "%i", m->mnt_id);
+			seq_putc(seq, '\0');
+		}
+	}
+	up_read(&namespace_sem);
+}
+
+/* called with namespace_sem held for read */
+struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root,
+				  int id)
+{
+	struct mount *m;
+
+	for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
+		if (m->mnt_id == id) {
+			if (is_path_reachable(m, m->mnt.mnt_root, root))
+				return mntget(&m->mnt);
+			else
+				return NULL;
+		}
+	}
+	return NULL;
+}
+
 /**
  * may_umount_tree - check if a mount tree is busy
  * @m: root of mount tree
@@ -1494,6 +1526,16 @@  static inline void namespace_lock(void)
 	down_write(&namespace_sem);
 }
 
+void namespace_lock_read(void)
+{
+	down_read(&namespace_sem);
+}
+
+void namespace_unlock_read(void)
+{
+	up_read(&namespace_sem);
+}
+
 enum umount_tree_flags {
 	UMOUNT_SYNC = 1,
 	UMOUNT_PROPAGATE = 2,
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 49650e54d2f8..fa6dc2c20578 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -61,7 +61,7 @@  static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 	return security_sb_show_options(m, sb);
 }
 
-static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
+void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 {
 	static const struct proc_fs_opts mnt_opts[] = {
 		{ MNT_NOSUID, ",nosuid" },
diff --git a/fs/values.c b/fs/values.c
new file mode 100644
index 000000000000..618fa9bf48a1
--- /dev/null
+++ b/fs/values.c
@@ -0,0 +1,524 @@ 
+#include <linux/syscalls.h>
+#include <linux/printk.h>
+#include <linux/namei.h>
+#include <linux/fs_struct.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/xattr.h>
+#include "pnode.h"
+#include "internal.h"
+
+#define VAL_GRSEP ':'
+
+struct name_val {
+	const char __user *name;	/* in */
+	struct iovec value_in;		/* in */
+	struct iovec value_out;		/* out */
+	__u32 error;			/* out */
+	__u32 reserved;
+};
+
+struct val_iter {
+	struct name_val __user *curr;
+	size_t num;
+	struct iovec vec;
+	char name[256];
+	size_t bufsize;
+	struct seq_file seq;
+	const char *prefix;
+	const char *sub;
+};
+
+struct val_desc {
+	const char *name;
+	union {
+		int idx;
+		int (*get)(struct val_iter *vi, const struct path *path);
+	};
+};
+
+static int val_get(struct val_iter *vi)
+{
+	struct name_val nameval;
+	long err;
+
+	if (copy_from_user(&nameval, vi->curr, sizeof(nameval)))
+		return -EFAULT;
+
+	err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name));
+	if (err < 0)
+		return err;
+	if (err == sizeof(vi->name))
+		return -ERANGE;
+
+	if (nameval.value_in.iov_base)
+		vi->vec = nameval.value_in;
+
+	vi->seq.size = min(vi->vec.iov_len, vi->bufsize);
+	vi->seq.count = 0;
+
+	return 0;
+}
+
+static int val_next(struct val_iter *vi)
+{
+	vi->curr++;
+	vi->num--;
+
+	return vi->num ? val_get(vi) : 0;
+}
+
+static int val_end(struct val_iter *vi, size_t count)
+{
+	struct iovec iov = {
+		.iov_base = vi->vec.iov_base,
+		.iov_len = count,
+	};
+
+	if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov)))
+		return -EFAULT;
+
+	vi->vec.iov_base += count;
+	vi->vec.iov_len -= count;
+
+	return val_next(vi);
+}
+
+static int val_err(struct val_iter *vi, int err)
+{
+	if (put_user(-err, &vi->curr->error))
+		return -EFAULT;
+
+	return val_next(vi);
+}
+
+static int val_end_seq(struct val_iter *vi, int err)
+{
+	size_t count = vi->seq.count;
+
+	if (err)
+		return val_err(vi, err);
+
+	if (count == vi->seq.size)
+		return -EOVERFLOW;
+
+	if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count))
+		return -EFAULT;
+
+	return val_end(vi, count);
+}
+
+static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd)
+{
+	const char *name = vi->name;
+	const char *prefix = vi->prefix;
+	size_t prefixlen = strlen(prefix);
+
+	if (prefixlen) {
+		/*
+		 * Name beggining with a group separator is a shorthand for
+		 * previously prefix.
+		 */
+		if (name[0] == VAL_GRSEP) {
+			name++;
+		} else  {
+			if (strncmp(name, prefix, prefixlen) != 0)
+				return NULL;
+			name += prefixlen;
+		}
+	}
+
+	vi->sub = NULL;
+	for (; vd->name; vd++) {
+		if (strcmp(name, vd->name) == 0)
+			break;
+		else {
+			size_t grlen = strlen(vd->name);
+
+			if (strncmp(vd->name, name, grlen) == 0 &&
+			    name[grlen] == VAL_GRSEP) {
+				vi->sub = name + grlen + 1;
+				break;
+			}
+		}
+	}
+	return vd;
+}
+
+static int val_get_group(struct val_iter *vi, struct val_desc *vd)
+{
+	for (; vd->name; vd++)
+		seq_write(&vi->seq, vd->name, strlen(vd->name) + 1);
+
+	return val_end_seq(vi, 0);
+}
+
+static bool val_push_prefix(struct val_iter *vi, const char **oldprefix)
+{
+	char *newprefix;
+
+	newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL);
+	if (newprefix) {
+		*oldprefix = vi->prefix;
+		vi->prefix = newprefix;
+	}
+
+	return newprefix;
+}
+
+static void val_pop_prefix(struct val_iter *vi, const char *oldprefix)
+{
+	kfree(vi->prefix);
+	vi->prefix = oldprefix;
+}
+
+enum {
+	VAL_MNT_ID,
+	VAL_MNT_PARENTID,
+	VAL_MNT_ROOT,
+	VAL_MNT_MOUNTPOINT,
+	VAL_MNT_OPTIONS,
+	VAL_MNT_SHARED,
+	VAL_MNT_MASTER,
+	VAL_MNT_PROPAGATE_FROM,
+	VAL_MNT_UNBINDABLE,
+	VAL_MNT_NOTFOUND,
+};
+
+static struct val_desc val_mnt_group[] = {
+	{ .name = "id",			.idx = VAL_MNT_ID		},
+	{ .name = "parentid",		.idx = VAL_MNT_PARENTID,	},
+	{ .name = "root",		.idx = VAL_MNT_ROOT,		},
+	{ .name = "mountpoint",		.idx = VAL_MNT_MOUNTPOINT,	},
+	{ .name = "options",		.idx = VAL_MNT_OPTIONS,		},
+	{ .name = "shared",		.idx = VAL_MNT_SHARED,		},
+	{ .name = "master",		.idx = VAL_MNT_MASTER,		},
+	{ .name = "propagate_from",	.idx = VAL_MNT_PROPAGATE_FROM,	},
+	{ .name = "unbindable",		.idx = VAL_MNT_UNBINDABLE,	},
+	{ .name = NULL,			.idx = VAL_MNT_NOTFOUND		},
+};
+
+static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt)
+{
+	struct super_block *sb = mnt->mnt_sb;
+	int err = 0;
+
+	if (sb->s_op->show_path) {
+		err = sb->s_op->show_path(seq, mnt->mnt_root);
+		if (!err) {
+			seq_putc(seq, '\0');
+			if (seq->count < seq->size)
+				seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL);
+		}
+	} else {
+		seq_dentry(seq, mnt->mnt_root, "");
+	}
+
+	return err;
+}
+
+static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt)
+{
+	struct mount *m = real_mount(mnt);
+	struct path root, mnt_path;
+	struct val_desc *vd;
+	const char *oldprefix;
+	int err = 0;
+
+	if (!val_push_prefix(vi, &oldprefix))
+		return -ENOMEM;
+
+	while (!err && vi->num) {
+		vd = val_lookup(vi, val_mnt_group);
+		if (!vd)
+			break;
+
+		switch(vd->idx) {
+		case VAL_MNT_ID:
+			seq_printf(&vi->seq, "%i", m->mnt_id);
+			break;
+		case VAL_MNT_PARENTID:
+			seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id);
+			break;
+		case VAL_MNT_ROOT:
+			seq_mnt_root(&vi->seq, mnt);
+			break;
+		case VAL_MNT_MOUNTPOINT:
+			get_fs_root(current->fs, &root);
+			mnt_path.dentry = mnt->mnt_root;
+			mnt_path.mnt = mnt;
+			err = seq_path_root(&vi->seq, &mnt_path, &root, "");
+			path_put(&root);
+			break;
+		case VAL_MNT_OPTIONS:
+			seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw");
+			show_mnt_opts(&vi->seq, mnt);
+			break;
+		case VAL_MNT_SHARED:
+			if (IS_MNT_SHARED(m))
+				seq_printf(&vi->seq, "%i,", m->mnt_group_id);
+			break;
+		case VAL_MNT_MASTER:
+			if (IS_MNT_SLAVE(m))
+				seq_printf(&vi->seq, "%i,",
+					   m->mnt_master->mnt_group_id);
+			break;
+		case VAL_MNT_PROPAGATE_FROM:
+			if (IS_MNT_SLAVE(m)) {
+				int dom;
+
+				get_fs_root(current->fs, &root);
+				dom = get_dominating_id(m, &root);
+				path_put(&root);
+				if (dom)
+					seq_printf(&vi->seq, "%i,", dom);
+			}
+			break;
+		case VAL_MNT_UNBINDABLE:
+			if (IS_MNT_UNBINDABLE(m))
+				seq_puts(&vi->seq, "yes");
+			break;
+		default:
+			err = -ENOENT;
+			break;
+		}
+		err = val_end_seq(vi, err);
+	}
+	val_pop_prefix(vi, oldprefix);
+
+	return err;
+}
+
+static int val_mnt_get(struct val_iter *vi, const struct path *path)
+{
+	int err;
+
+	if (!vi->sub)
+		return val_get_group(vi, val_mnt_group);
+
+	namespace_lock_read();
+	err = val_mnt_show(vi, path->mnt);
+	namespace_unlock_read();
+
+	return err;
+}
+
+static int val_mntns_get(struct val_iter *vi, const struct path *path)
+{
+	struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns;
+	struct vfsmount *mnt;
+	struct path root;
+	char *end;
+	int mnt_id;
+	int err;
+
+	if (!vi->sub) {
+		get_fs_root(current->fs, &root);
+		seq_mnt_list(&vi->seq, mnt_ns, &root);
+		path_put(&root);
+		return val_end_seq(vi, 0);
+	}
+
+	end = strchr(vi->sub, VAL_GRSEP);
+	if (end)
+		*end = '\0';
+	err = kstrtoint(vi->sub, 10, &mnt_id);
+	if (err)
+		return val_err(vi, err);
+	vi->sub = NULL;
+	if (end) {
+		*end = VAL_GRSEP;
+		vi->sub = end + 1;
+	}
+
+	namespace_lock_read();
+	get_fs_root(current->fs, &root);
+	mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id);
+	path_put(&root);
+	if (!mnt) {
+		namespace_unlock_read();
+		return val_err(vi, -ENOENT);
+	}
+	if (vi->sub)
+		err = val_mnt_show(vi, mnt);
+	else
+		err = val_get_group(vi, val_mnt_group);
+
+	namespace_unlock_read();
+	mntput(mnt);
+
+	return err;
+}
+
+static ssize_t val_do_read(struct val_iter *vi, struct path *path)
+{
+	ssize_t ret;
+	struct file *file;
+	struct open_flags op = {
+		.open_flag = O_RDONLY,
+		.acc_mode = MAY_READ,
+		.intent = LOOKUP_OPEN,
+	};
+
+	file = do_file_open_root(path, "", &op);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL);
+	fput(file);
+
+	return ret;
+}
+
+static ssize_t val_do_readlink(struct val_iter *vi, struct path *path)
+{
+	int ret;
+
+	ret = security_inode_readlink(path->dentry);
+	if (ret)
+		return ret;
+
+	return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len);
+}
+
+static inline bool dot_or_dotdot(const char *s)
+{
+	return s[0] == '.' &&
+		(s[1] == '/' || s[1] == '\0' ||
+		 (s[1] == '.' && (s[2] == '/' || s[2] == '\0')));
+}
+
+/*
+ * - empty path is okay
+ * - must not begin or end with slash or have a double slash anywhere
+ * - must not have . or .. components
+ */
+static bool val_verify_path(const char *subpath)
+{
+	const char *s = subpath;
+
+	if (s[0] == '\0')
+		return true;
+
+	if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//"))
+		return false;
+
+	for (s--; s; s = strstr(s + 3, "/."))
+		if (dot_or_dotdot(s + 1))
+			return false;
+
+	return true;
+}
+
+static int val_data_get(struct val_iter *vi, const struct path *path)
+{
+	struct path this;
+	ssize_t ret;
+
+	if (!vi->sub)
+		return val_err(vi, -ENOENT);
+
+	if (!val_verify_path(vi->sub))
+		return val_err(vi, -EINVAL);
+
+	ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub,
+			      LOOKUP_NO_XDEV | LOOKUP_BENEATH |
+			      LOOKUP_IN_ROOT, &this);
+	if (ret)
+		return val_err(vi, ret);
+
+	ret = -ENODATA;
+	if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) {
+		if (d_is_reg(this.dentry))
+			ret = val_do_read(vi, &this);
+		else
+			ret = val_do_readlink(vi, &this);
+	}
+	path_put(&this);
+	if (ret == -EFAULT)
+		return ret;
+	if (ret < 0)
+		return val_err(vi, ret);
+	if (ret == vi->vec.iov_len)
+		return -EOVERFLOW;
+
+	return val_end(vi, ret);
+}
+
+static int val_xattr_get(struct val_iter *vi, const struct path *path)
+{
+	ssize_t ret;
+	struct user_namespace *mnt_userns = mnt_user_ns(path->mnt);
+	void *value = vi->seq.buf + vi->seq.count;
+	size_t size = min_t(size_t, vi->seq.size - vi->seq.count,
+			    XATTR_SIZE_MAX);
+
+	if (!vi->sub)
+		return val_err(vi, -ENOENT);
+
+	ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size);
+	if (ret < 0)
+		return val_err(vi, ret);
+
+	if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
+	    (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
+		posix_acl_fix_xattr_to_user(mnt_userns, value, ret);
+
+	vi->seq.count += ret;
+
+	return val_end_seq(vi, 0);
+}
+
+
+static struct val_desc val_toplevel_group[] = {
+	{ .name = "mnt",	.get = val_mnt_get,	},
+	{ .name = "mntns",	.get = val_mntns_get,	},
+	{ .name = "xattr",	.get = val_xattr_get,	},
+	{ .name = "data",	.get = val_data_get,	},
+	{ .name = NULL },
+};
+
+SYSCALL_DEFINE5(getvalues,
+		int, dfd,
+		const char __user *, pathname,
+		struct name_val __user *, vec,
+		size_t, num,
+		unsigned int, flags)
+{
+	char vals[1024];
+	struct val_iter vi = {
+		.curr = vec,
+		.num = num,
+		.seq.buf = vals,
+		.bufsize = sizeof(vals),
+		.prefix = "",
+	};
+	struct val_desc *vd;
+	struct path path = {};
+	ssize_t err;
+
+	err = user_path_at(dfd, pathname, 0, &path);
+	if (err)
+		return err;
+
+	err = val_get(&vi);
+	if (err)
+		goto out;
+
+	if (!strlen(vi.name)) {
+		err = val_get_group(&vi, val_toplevel_group);
+		goto out;
+	}
+	while (!err && vi.num) {
+		vd = val_lookup(&vi, val_toplevel_group);
+		if (!vd->name)
+			err = val_err(&vi, -ENOENT);
+		else
+			err = vd->get(&vi, &path);
+	}
+out:
+	if (err == -EOVERFLOW)
+		err = 0;
+
+	path_put(&path);
+	return err < 0 ? err : num - vi.num;
+}