[RFC] Change calling conventions for filldir_t

Message ID	YvvBs+7YUcrzwV1a@ZenIV (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Tue, 16 Aug 2022 17:11:31 +0100 From: Al Viro <viro@zeniv.linux.org.uk> To: linux-fsdevel@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Subject: [RFC][PATCH] Change calling conventions for filldir_t Message-ID: <YvvBs+7YUcrzwV1a@ZenIV> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: Al Viro <viro@ftp.linux.org.uk> Precedence: bulk
Series	[RFC] Change calling conventions for filldir_t \| expand [RFC] Change calling conventions for filldir_t

Al Viro Aug. 16, 2022, 4:11 p.m. UTC

filldir_t instances (directory iterators callbacks) used to return 0 for
"OK, keep going" or -E... for "stop".  Note that it's *NOT* how the
error values are reported - the rules for those are callback-dependent
and ->iterate{,_shared}() instances only care about zero vs. non-zero
(look at emit_dir() and friends).

So let's just return bool ("should we keep going?") - it's less confusing
that way.  The choice between "true means keep going" and "true means
stop" is bikesheddable; we have two groups of callbacks -
    do something for everything in directory, until we run into problem
and
    find an entry in directory and do something to it.

The former tended to use 0/-E... conventions - -E<something> on failure.
The latter tended to use 0/1, 1 being "stop, we are done".
The callers treated anything non-zero as "stop", ignoring which
non-zero value did they get.

"true means stop" would be more natural for the second group; "true
means keep going" - for the first one.  I tried both variants and
the things like
	if allocation failed
		something = -ENOMEM;
		return true;
just looked unnatural and asking for trouble.
    
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Matthew Wilcox Aug. 16, 2022, 4:24 p.m. UTC | #1

On Tue, Aug 16, 2022 at 05:11:31PM +0100, Al Viro wrote:
> filldir_t instances (directory iterators callbacks) used to return 0 for
> "OK, keep going" or -E... for "stop".  Note that it's *NOT* how the
> error values are reported - the rules for those are callback-dependent
> and ->iterate{,_shared}() instances only care about zero vs. non-zero
> (look at emit_dir() and friends).
> 
> So let's just return bool ("should we keep going?") - it's less confusing
> that way.  The choice between "true means keep going" and "true means
> stop" is bikesheddable; we have two groups of callbacks -
>     do something for everything in directory, until we run into problem
> and
>     find an entry in directory and do something to it.
> 
> The former tended to use 0/-E... conventions - -E<something> on failure.
> The latter tended to use 0/1, 1 being "stop, we are done".
> The callers treated anything non-zero as "stop", ignoring which
> non-zero value did they get.
> 
> "true means stop" would be more natural for the second group; "true
> means keep going" - for the first one.  I tried both variants and
> the things like
> 	if allocation failed
> 		something = -ENOMEM;
> 		return true;
> just looked unnatural and asking for trouble.

I like it the way you have it.  My only suggestion is:

+++ b/include/linux/fs.h
@@ -2039,6 +2039,7 @@ extern bool may_open_dev(const struct path *path);
  * the kernel specify what kind of dirent layout it wants to have.
  * This allows the kernel to read directories into kernel space or
  * to have different dirent layouts depending on the binary type.
+ * Return 'true' to keep going and 'false' if there are no more entries.
  */
 struct dir_context;
 typedef int (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9eced4cc286e..8b8c0c11afec 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2040,7 +2040,7 @@ umode_t mode_strip_sgid(struct user_namespace *mnt_userns,
>   * to have different dirent layouts depending on the binary type.
>   */
>  struct dir_context;
> -typedef int (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
> +typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
>  			 unsigned);
>  
>  struct dir_context {

Al Viro Aug. 16, 2022, 4:32 p.m. UTC | #2

On Tue, Aug 16, 2022 at 05:24:16PM +0100, Matthew Wilcox wrote:
> > The former tended to use 0/-E... conventions - -E<something> on failure.
> > The latter tended to use 0/1, 1 being "stop, we are done".
> > The callers treated anything non-zero as "stop", ignoring which
> > non-zero value did they get.
> > 
> > "true means stop" would be more natural for the second group; "true
> > means keep going" - for the first one.  I tried both variants and
> > the things like
> > 	if allocation failed
> > 		something = -ENOMEM;
> > 		return true;
> > just looked unnatural and asking for trouble.
> 
> I like it the way you have it.  My only suggestion is:
> 
> +++ b/include/linux/fs.h
> @@ -2039,6 +2039,7 @@ extern bool may_open_dev(const struct path *path);
>   * the kernel specify what kind of dirent layout it wants to have.
>   * This allows the kernel to read directories into kernel space or
>   * to have different dirent layouts depending on the binary type.
> + * Return 'true' to keep going and 'false' if there are no more entries.
>   */
>  struct dir_context;
>  typedef int (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,

OK...  FWIW, an additional argument is that ->iterate() instances are
not calling those directly - they are calling dir_emit(), which returns
true for "keep going" and false for "stop".  This makes the calling
conventions for callbacks match that...

Christian Brauner Aug. 16, 2022, 4:57 p.m. UTC | #3

On Tue, Aug 16, 2022 at 05:11:31PM +0100, Al Viro wrote:
> filldir_t instances (directory iterators callbacks) used to return 0 for
> "OK, keep going" or -E... for "stop".  Note that it's *NOT* how the
> error values are reported - the rules for those are callback-dependent
> and ->iterate{,_shared}() instances only care about zero vs. non-zero
> (look at emit_dir() and friends).
> 
> So let's just return bool ("should we keep going?") - it's less confusing
> that way.  The choice between "true means keep going" and "true means
> stop" is bikesheddable; we have two groups of callbacks -
>     do something for everything in directory, until we run into problem
> and
>     find an entry in directory and do something to it.
> 
> The former tended to use 0/-E... conventions - -E<something> on failure.
> The latter tended to use 0/1, 1 being "stop, we are done".
> The callers treated anything non-zero as "stop", ignoring which
> non-zero value did they get.
> 
> "true means stop" would be more natural for the second group; "true
> means keep going" - for the first one.  I tried both variants and
> the things like
> 	if allocation failed
> 		something = -ENOMEM;
> 		return true;
> just looked unnatural and asking for trouble.
>     
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
> index aee9aaf9f3df..60df72f6dd37 100644
> --- a/Documentation/filesystems/porting.rst
> +++ b/Documentation/filesystems/porting.rst
> @@ -922,3 +922,14 @@ is provided - file_open_root_mnt().  In-tree users adjusted.
>  no_llseek is gone; don't set .llseek to that - just leave it NULL instead.
>  Checks for "does that file have llseek(2), or should it fail with ESPIPE"
>  should be done by looking at FMODE_LSEEK in file->f_mode.
> +
> +---
> +
> +*mandatory*
> +
> +filldir_t (readdir callbacks) calling conventions have changed.  Instead of
> +returning 0 or -E... it returns bool now.  false means "no more" (as -E... used
> +to to) and true - "keep going" (as 0 in old calling conventions).  Rationale:

s/to to/to/ ?
otherwise looks good to me,
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>

> +callers never looked at specific -E... values anyway.  ->iterate() and
> +->iterate_shared() instance require no changes at all, all filldir_t ones in
> +the tree converted.
> diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
> index d257293401e2..097d42cbd540 100644
> --- a/arch/alpha/kernel/osf_sys.c
> +++ b/arch/alpha/kernel/osf_sys.c
> @@ -108,7 +108,7 @@ struct osf_dirent_callback {
>  	int error;
>  };
>  
> -static int
> +static bool
>  osf_filldir(struct dir_context *ctx, const char *name, int namlen,
>  	    loff_t offset, u64 ino, unsigned int d_type)
>  {
> @@ -120,11 +120,11 @@ osf_filldir(struct dir_context *ctx, const char *name, int namlen,
>  
>  	buf->error = -EINVAL;	/* only used if we fail */
>  	if (reclen > buf->count)
> -		return -EINVAL;
> +		return false;
>  	d_ino = ino;
>  	if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
>  		buf->error = -EOVERFLOW;
> -		return -EOVERFLOW;
> +		return false;
>  	}
>  	if (buf->basep) {
>  		if (put_user(offset, buf->basep))
> @@ -141,10 +141,10 @@ osf_filldir(struct dir_context *ctx, const char *name, int namlen,
>  	dirent = (void __user *)dirent + reclen;
>  	buf->dirent = dirent;
>  	buf->count -= reclen;
> -	return 0;
> +	return true;
>  Efault:
>  	buf->error = -EFAULT;
> -	return -EFAULT;
> +	return false;
>  }
>  
>  SYSCALL_DEFINE4(osf_getdirentries, unsigned int, fd,
> diff --git a/fs/afs/dir.c b/fs/afs/dir.c
> index 56ae5cd5184f..230c2d19116d 100644
> --- a/fs/afs/dir.c
> +++ b/fs/afs/dir.c
> @@ -24,9 +24,9 @@ static int afs_readdir(struct file *file, struct dir_context *ctx);
>  static int afs_d_revalidate(struct dentry *dentry, unsigned int flags);
>  static int afs_d_delete(const struct dentry *dentry);
>  static void afs_d_iput(struct dentry *dentry, struct inode *inode);
> -static int afs_lookup_one_filldir(struct dir_context *ctx, const char *name, int nlen,
> +static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, int nlen,
>  				  loff_t fpos, u64 ino, unsigned dtype);
> -static int afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen,
> +static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen,
>  			      loff_t fpos, u64 ino, unsigned dtype);
>  static int afs_create(struct user_namespace *mnt_userns, struct inode *dir,
>  		      struct dentry *dentry, umode_t mode, bool excl);
> @@ -568,7 +568,7 @@ static int afs_readdir(struct file *file, struct dir_context *ctx)
>   * - if afs_dir_iterate_block() spots this function, it'll pass the FID
>   *   uniquifier through dtype
>   */
> -static int afs_lookup_one_filldir(struct dir_context *ctx, const char *name,
> +static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name,
>  				  int nlen, loff_t fpos, u64 ino, unsigned dtype)
>  {
>  	struct afs_lookup_one_cookie *cookie =
> @@ -584,16 +584,16 @@ static int afs_lookup_one_filldir(struct dir_context *ctx, const char *name,
>  
>  	if (cookie->name.len != nlen ||
>  	    memcmp(cookie->name.name, name, nlen) != 0) {
> -		_leave(" = 0 [no]");
> -		return 0;
> +		_leave(" = true [keep looking]");
> +		return true;
>  	}
>  
>  	cookie->fid.vnode = ino;
>  	cookie->fid.unique = dtype;
>  	cookie->found = 1;
>  
> -	_leave(" = -1 [found]");
> -	return -1;
> +	_leave(" = false [found]");
> +	return false;
>  }
>  
>  /*
> @@ -636,12 +636,11 @@ static int afs_do_lookup_one(struct inode *dir, struct dentry *dentry,
>   * - if afs_dir_iterate_block() spots this function, it'll pass the FID
>   *   uniquifier through dtype
>   */
> -static int afs_lookup_filldir(struct dir_context *ctx, const char *name,
> +static bool afs_lookup_filldir(struct dir_context *ctx, const char *name,
>  			      int nlen, loff_t fpos, u64 ino, unsigned dtype)
>  {
>  	struct afs_lookup_cookie *cookie =
>  		container_of(ctx, struct afs_lookup_cookie, ctx);
> -	int ret;
>  
>  	_enter("{%s,%u},%s,%u,,%llu,%u",
>  	       cookie->name.name, cookie->name.len, name, nlen,
> @@ -663,12 +662,10 @@ static int afs_lookup_filldir(struct dir_context *ctx, const char *name,
>  		cookie->fids[1].unique	= dtype;
>  		cookie->found = 1;
>  		if (cookie->one_only)
> -			return -1;
> +			return false;
>  	}
>  
> -	ret = cookie->nr_fids >= 50 ? -1 : 0;
> -	_leave(" = %d", ret);
> -	return ret;
> +	return cookie->nr_fids < 50;
>  }
>  
>  /*
> diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
> index 18d5b91cb573..c29814a66c5b 100644
> --- a/fs/ecryptfs/file.c
> +++ b/fs/ecryptfs/file.c
> @@ -53,7 +53,7 @@ struct ecryptfs_getdents_callback {
>  };
>  
>  /* Inspired by generic filldir in fs/readdir.c */
> -static int
> +static bool
>  ecryptfs_filldir(struct dir_context *ctx, const char *lower_name,
>  		 int lower_namelen, loff_t offset, u64 ino, unsigned int d_type)
>  {
> @@ -61,18 +61,19 @@ ecryptfs_filldir(struct dir_context *ctx, const char *lower_name,
>  		container_of(ctx, struct ecryptfs_getdents_callback, ctx);
>  	size_t name_size;
>  	char *name;
> -	int rc;
> +	int err;
> +	bool res;
>  
>  	buf->filldir_called++;
> -	rc = ecryptfs_decode_and_decrypt_filename(&name, &name_size,
> -						  buf->sb, lower_name,
> -						  lower_namelen);
> -	if (rc) {
> -		if (rc != -EINVAL) {
> +	err = ecryptfs_decode_and_decrypt_filename(&name, &name_size,
> +						   buf->sb, lower_name,
> +						   lower_namelen);
> +	if (err) {
> +		if (err != -EINVAL) {
>  			ecryptfs_printk(KERN_DEBUG,
>  					"%s: Error attempting to decode and decrypt filename [%s]; rc = [%d]\n",
> -					__func__, lower_name, rc);
> -			return rc;
> +					__func__, lower_name, err);
> +			return false;
>  		}
>  
>  		/* Mask -EINVAL errors as these are most likely due a plaintext
> @@ -81,16 +82,15 @@ ecryptfs_filldir(struct dir_context *ctx, const char *lower_name,
>  		 * the "lost+found" dentry in the root directory of an Ext4
>  		 * filesystem.
>  		 */
> -		return 0;
> +		return true;
>  	}
>  
>  	buf->caller->pos = buf->ctx.pos;
> -	rc = !dir_emit(buf->caller, name, name_size, ino, d_type);
> +	res = dir_emit(buf->caller, name, name_size, ino, d_type);
>  	kfree(name);
> -	if (!rc)
> +	if (res)
>  		buf->entries_written++;
> -
> -	return rc;
> +	return res;
>  }
>  
>  /**
> @@ -111,14 +111,8 @@ static int ecryptfs_readdir(struct file *file, struct dir_context *ctx)
>  	lower_file = ecryptfs_file_to_lower(file);
>  	rc = iterate_dir(lower_file, &buf.ctx);
>  	ctx->pos = buf.ctx.pos;
> -	if (rc < 0)
> -		goto out;
> -	if (buf.filldir_called && !buf.entries_written)
> -		goto out;
> -	if (rc >= 0)
> -		fsstack_copy_attr_atime(inode,
> -					file_inode(lower_file));
> -out:
> +	if (rc >= 0 && (buf.entries_written || !buf.filldir_called))
> +		fsstack_copy_attr_atime(inode, file_inode(lower_file));
>  	return rc;
>  }
>  
> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> index 3ef80d000e13..c648a493faf2 100644
> --- a/fs/exportfs/expfs.c
> +++ b/fs/exportfs/expfs.c
> @@ -248,21 +248,20 @@ struct getdents_callback {
>   * A rather strange filldir function to capture
>   * the name matching the specified inode number.
>   */
> -static int filldir_one(struct dir_context *ctx, const char *name, int len,
> +static bool filldir_one(struct dir_context *ctx, const char *name, int len,
>  			loff_t pos, u64 ino, unsigned int d_type)
>  {
>  	struct getdents_callback *buf =
>  		container_of(ctx, struct getdents_callback, ctx);
> -	int result = 0;
>  
>  	buf->sequence++;
>  	if (buf->ino == ino && len <= NAME_MAX) {
>  		memcpy(buf->name, name, len);
>  		buf->name[len] = '\0';
>  		buf->found = 1;
> -		result = -1;
> +		return false;	// no more
>  	}
> -	return result;
> +	return true;
>  }
>  
>  /**
> diff --git a/fs/fat/dir.c b/fs/fat/dir.c
> index 249825017da7..00235b8a1823 100644
> --- a/fs/fat/dir.c
> +++ b/fs/fat/dir.c
> @@ -705,7 +705,7 @@ static int fat_readdir(struct file *file, struct dir_context *ctx)
>  }
>  
>  #define FAT_IOCTL_FILLDIR_FUNC(func, dirent_type)			   \
> -static int func(struct dir_context *ctx, const char *name, int name_len,   \
> +static bool func(struct dir_context *ctx, const char *name, int name_len,  \
>  			     loff_t offset, u64 ino, unsigned int d_type)  \
>  {									   \
>  	struct fat_ioctl_filldir_callback *buf =			   \
> @@ -714,7 +714,7 @@ static int func(struct dir_context *ctx, const char *name, int name_len,   \
>  	struct dirent_type __user *d2 = d1 + 1;				   \
>  									   \
>  	if (buf->result)						   \
> -		return -EINVAL;						   \
> +		return false;						   \
>  	buf->result++;							   \
>  									   \
>  	if (name != NULL) {						   \
> @@ -750,10 +750,10 @@ static int func(struct dir_context *ctx, const char *name, int name_len,   \
>  		    put_user(short_len, &d1->d_reclen))			   \
>  			goto efault;					   \
>  	}								   \
> -	return 0;							   \
> +	return true;							   \
>  efault:									   \
>  	buf->result = -EFAULT;						   \
> -	return -EFAULT;							   \
> +	return false;							   \
>  }
>  
>  FAT_IOCTL_FILLDIR_FUNC(fat_ioctl_filldir, __fat_dirent)
> diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
> index 756d05779200..cf40895233f5 100644
> --- a/fs/gfs2/export.c
> +++ b/fs/gfs2/export.c
> @@ -66,7 +66,7 @@ struct get_name_filldir {
>  	char *name;
>  };
>  
> -static int get_name_filldir(struct dir_context *ctx, const char *name,
> +static bool get_name_filldir(struct dir_context *ctx, const char *name,
>  			    int length, loff_t offset, u64 inum,
>  			    unsigned int type)
>  {
> @@ -74,12 +74,12 @@ static int get_name_filldir(struct dir_context *ctx, const char *name,
>  		container_of(ctx, struct get_name_filldir, ctx);
>  
>  	if (inum != gnfd->inum.no_addr)
> -		return 0;
> +		return true;
>  
>  	memcpy(gnfd->name, name, length);
>  	gnfd->name[length] = 0;
>  
> -	return 1;
> +	return false;
>  }
>  
>  static int gfs2_get_name(struct dentry *parent, char *name,
> diff --git a/fs/ksmbd/smb2pdu.c b/fs/ksmbd/smb2pdu.c
> index 9751cc92c111..6785a9cc9ee1 100644
> --- a/fs/ksmbd/smb2pdu.c
> +++ b/fs/ksmbd/smb2pdu.c
> @@ -3779,7 +3779,7 @@ static int reserve_populate_dentry(struct ksmbd_dir_info *d_info,
>  	return 0;
>  }
>  
> -static int __query_dir(struct dir_context *ctx, const char *name, int namlen,
> +static bool __query_dir(struct dir_context *ctx, const char *name, int namlen,
>  		       loff_t offset, u64 ino, unsigned int d_type)
>  {
>  	struct ksmbd_readdir_data	*buf;
> @@ -3793,22 +3793,20 @@ static int __query_dir(struct dir_context *ctx, const char *name, int namlen,
>  
>  	/* dot and dotdot entries are already reserved */
>  	if (!strcmp(".", name) || !strcmp("..", name))
> -		return 0;
> +		return true;
>  	if (ksmbd_share_veto_filename(priv->work->tcon->share_conf, name))
> -		return 0;
> +		return true;
>  	if (!match_pattern(name, namlen, priv->search_pattern))
> -		return 0;
> +		return true;
>  
>  	d_info->name		= name;
>  	d_info->name_len	= namlen;
>  	rc = reserve_populate_dentry(d_info, priv->info_level);
>  	if (rc)
> -		return rc;
> -	if (d_info->flags & SMB2_RETURN_SINGLE_ENTRY) {
> +		return false;
> +	if (d_info->flags & SMB2_RETURN_SINGLE_ENTRY)
>  		d_info->out_buf_len = 0;
> -		return 0;
> -	}
> -	return 0;
> +	return true;
>  }
>  
>  static void restart_ctx(struct dir_context *ctx)
> diff --git a/fs/ksmbd/vfs.c b/fs/ksmbd/vfs.c
> index 78d01033604c..48b2b901f6e5 100644
> --- a/fs/ksmbd/vfs.c
> +++ b/fs/ksmbd/vfs.c
> @@ -1105,7 +1105,7 @@ int ksmbd_vfs_unlink(struct user_namespace *user_ns,
>  	return err;
>  }
>  
> -static int __dir_empty(struct dir_context *ctx, const char *name, int namlen,
> +static bool __dir_empty(struct dir_context *ctx, const char *name, int namlen,
>  		       loff_t offset, u64 ino, unsigned int d_type)
>  {
>  	struct ksmbd_readdir_data *buf;
> @@ -1113,9 +1113,7 @@ static int __dir_empty(struct dir_context *ctx, const char *name, int namlen,
>  	buf = container_of(ctx, struct ksmbd_readdir_data, ctx);
>  	buf->dirent_count++;
>  
> -	if (buf->dirent_count > 2)
> -		return -ENOTEMPTY;
> -	return 0;
> +	return buf->dirent_count <= 2;
>  }
>  
>  /**
> @@ -1142,7 +1140,7 @@ int ksmbd_vfs_empty_dir(struct ksmbd_file *fp)
>  	return err;
>  }
>  
> -static int __caseless_lookup(struct dir_context *ctx, const char *name,
> +static bool __caseless_lookup(struct dir_context *ctx, const char *name,
>  			     int namlen, loff_t offset, u64 ino,
>  			     unsigned int d_type)
>  {
> @@ -1151,13 +1149,13 @@ static int __caseless_lookup(struct dir_context *ctx, const char *name,
>  	buf = container_of(ctx, struct ksmbd_readdir_data, ctx);
>  
>  	if (buf->used != namlen)
> -		return 0;
> +		return true;
>  	if (!strncasecmp((char *)buf->private, name, namlen)) {
>  		memcpy((char *)buf->private, name, namlen);
>  		buf->dirent_count = 1;
> -		return -EEXIST;
> +		return false;
>  	}
> -	return 0;
> +	return true;
>  }
>  
>  /**
> diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> index c634483d85d2..b29d27eaa8a6 100644
> --- a/fs/nfsd/nfs4recover.c
> +++ b/fs/nfsd/nfs4recover.c
> @@ -266,7 +266,7 @@ struct nfs4_dir_ctx {
>  	struct list_head names;
>  };
>  
> -static int
> +static bool
>  nfsd4_build_namelist(struct dir_context *__ctx, const char *name, int namlen,
>  		loff_t offset, u64 ino, unsigned int d_type)
>  {
> @@ -275,14 +275,14 @@ nfsd4_build_namelist(struct dir_context *__ctx, const char *name, int namlen,
>  	struct name_list *entry;
>  
>  	if (namlen != HEXDIR_LEN - 1)
> -		return 0;
> +		return true;
>  	entry = kmalloc(sizeof(struct name_list), GFP_KERNEL);
>  	if (entry == NULL)
> -		return -ENOMEM;
> +		return false;
>  	memcpy(entry->name, name, HEXDIR_LEN - 1);
>  	entry->name[HEXDIR_LEN - 1] = '\0';
>  	list_add(&entry->list, &ctx->names);
> -	return 0;
> +	return true;
>  }
>  
>  static int
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 9f486b788ed0..4b0015706e98 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1811,7 +1811,7 @@ struct readdir_data {
>  	int		full;
>  };
>  
> -static int nfsd_buffered_filldir(struct dir_context *ctx, const char *name,
> +static bool nfsd_buffered_filldir(struct dir_context *ctx, const char *name,
>  				 int namlen, loff_t offset, u64 ino,
>  				 unsigned int d_type)
>  {
> @@ -1823,7 +1823,7 @@ static int nfsd_buffered_filldir(struct dir_context *ctx, const char *name,
>  	reclen = ALIGN(sizeof(struct buffered_dirent) + namlen, sizeof(u64));
>  	if (buf->used + reclen > PAGE_SIZE) {
>  		buf->full = 1;
> -		return -EINVAL;
> +		return false;
>  	}
>  
>  	de->namlen = namlen;
> @@ -1833,7 +1833,7 @@ static int nfsd_buffered_filldir(struct dir_context *ctx, const char *name,
>  	memcpy(de->name, name, namlen);
>  	buf->used += reclen;
>  
> -	return 0;
> +	return true;
>  }
>  
>  static __be32 nfsd_buffered_readdir(struct file *file, struct svc_fh *fhp,
> diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
> index 81c3d65d68fe..694471fc46b8 100644
> --- a/fs/ocfs2/dir.c
> +++ b/fs/ocfs2/dir.c
> @@ -2032,7 +2032,7 @@ struct ocfs2_empty_dir_priv {
>  	unsigned seen_other;
>  	unsigned dx_dir;
>  };
> -static int ocfs2_empty_dir_filldir(struct dir_context *ctx, const char *name,
> +static bool ocfs2_empty_dir_filldir(struct dir_context *ctx, const char *name,
>  				   int name_len, loff_t pos, u64 ino,
>  				   unsigned type)
>  {
> @@ -2052,7 +2052,7 @@ static int ocfs2_empty_dir_filldir(struct dir_context *ctx, const char *name,
>  	 */
>  	if (name_len == 1 && !strncmp(".", name, 1) && pos == 0) {
>  		p->seen_dot = 1;
> -		return 0;
> +		return true;
>  	}
>  
>  	if (name_len == 2 && !strncmp("..", name, 2) &&
> @@ -2060,13 +2060,13 @@ static int ocfs2_empty_dir_filldir(struct dir_context *ctx, const char *name,
>  		p->seen_dot_dot = 1;
>  
>  		if (p->dx_dir && p->seen_dot)
> -			return 1;
> +			return false;
>  
> -		return 0;
> +		return true;
>  	}
>  
>  	p->seen_other = 1;
> -	return 1;
> +	return false;
>  }
>  
>  static int ocfs2_empty_dir_dx(struct inode *inode,
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index fa87d89cf754..126671e6caed 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -2057,7 +2057,7 @@ struct ocfs2_orphan_filldir_priv {
>  	enum ocfs2_orphan_reco_type orphan_reco_type;
>  };
>  
> -static int ocfs2_orphan_filldir(struct dir_context *ctx, const char *name,
> +static bool ocfs2_orphan_filldir(struct dir_context *ctx, const char *name,
>  				int name_len, loff_t pos, u64 ino,
>  				unsigned type)
>  {
> @@ -2066,21 +2066,21 @@ static int ocfs2_orphan_filldir(struct dir_context *ctx, const char *name,
>  	struct inode *iter;
>  
>  	if (name_len == 1 && !strncmp(".", name, 1))
> -		return 0;
> +		return true;
>  	if (name_len == 2 && !strncmp("..", name, 2))
> -		return 0;
> +		return true;
>  
>  	/* do not include dio entry in case of orphan scan */
>  	if ((p->orphan_reco_type == ORPHAN_NO_NEED_TRUNCATE) &&
>  			(!strncmp(name, OCFS2_DIO_ORPHAN_PREFIX,
>  			OCFS2_DIO_ORPHAN_PREFIX_LEN)))
> -		return 0;
> +		return true;
>  
>  	/* Skip bad inodes so that recovery can continue */
>  	iter = ocfs2_iget(p->osb, ino,
>  			  OCFS2_FI_FLAG_ORPHAN_RECOVERY, 0);
>  	if (IS_ERR(iter))
> -		return 0;
> +		return true;
>  
>  	if (!strncmp(name, OCFS2_DIO_ORPHAN_PREFIX,
>  			OCFS2_DIO_ORPHAN_PREFIX_LEN))
> @@ -2090,7 +2090,7 @@ static int ocfs2_orphan_filldir(struct dir_context *ctx, const char *name,
>  	 * happen concurrently with unlink/rename */
>  	if (OCFS2_I(iter)->ip_next_orphan) {
>  		iput(iter);
> -		return 0;
> +		return true;
>  	}
>  
>  	trace_ocfs2_orphan_filldir((unsigned long long)OCFS2_I(iter)->ip_blkno);
> @@ -2099,7 +2099,7 @@ static int ocfs2_orphan_filldir(struct dir_context *ctx, const char *name,
>  	OCFS2_I(iter)->ip_next_orphan = p->head;
>  	p->head = iter;
>  
> -	return 0;
> +	return true;
>  }
>  
>  static int ocfs2_queue_orphans(struct ocfs2_super *osb,
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 78f62cc1797b..8c25d185cdc0 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -170,7 +170,7 @@ static struct ovl_cache_entry *ovl_cache_entry_new(struct ovl_readdir_data *rdd,
>  	return p;
>  }
>  
> -static int ovl_cache_entry_add_rb(struct ovl_readdir_data *rdd,
> +static bool ovl_cache_entry_add_rb(struct ovl_readdir_data *rdd,
>  				  const char *name, int len, u64 ino,
>  				  unsigned int d_type)
>  {
> @@ -179,22 +179,22 @@ static int ovl_cache_entry_add_rb(struct ovl_readdir_data *rdd,
>  	struct ovl_cache_entry *p;
>  
>  	if (ovl_cache_entry_find_link(name, len, &newp, &parent))
> -		return 0;
> +		return true;
>  
>  	p = ovl_cache_entry_new(rdd, name, len, ino, d_type);
>  	if (p == NULL) {
>  		rdd->err = -ENOMEM;
> -		return -ENOMEM;
> +		return false;
>  	}
>  
>  	list_add_tail(&p->l_node, rdd->list);
>  	rb_link_node(&p->node, parent, newp);
>  	rb_insert_color(&p->node, rdd->root);
>  
> -	return 0;
> +	return true;
>  }
>  
> -static int ovl_fill_lowest(struct ovl_readdir_data *rdd,
> +static bool ovl_fill_lowest(struct ovl_readdir_data *rdd,
>  			   const char *name, int namelen,
>  			   loff_t offset, u64 ino, unsigned int d_type)
>  {
> @@ -211,7 +211,7 @@ static int ovl_fill_lowest(struct ovl_readdir_data *rdd,
>  			list_add_tail(&p->l_node, &rdd->middle);
>  	}
>  
> -	return rdd->err;
> +	return rdd->err == 0;
>  }
>  
>  void ovl_cache_free(struct list_head *list)
> @@ -250,7 +250,7 @@ static void ovl_cache_put(struct ovl_dir_file *od, struct dentry *dentry)
>  	}
>  }
>  
> -static int ovl_fill_merge(struct dir_context *ctx, const char *name,
> +static bool ovl_fill_merge(struct dir_context *ctx, const char *name,
>  			  int namelen, loff_t offset, u64 ino,
>  			  unsigned int d_type)
>  {
> @@ -528,7 +528,7 @@ static int ovl_cache_update_ino(struct path *path, struct ovl_cache_entry *p)
>  	goto out;
>  }
>  
> -static int ovl_fill_plain(struct dir_context *ctx, const char *name,
> +static bool ovl_fill_plain(struct dir_context *ctx, const char *name,
>  			  int namelen, loff_t offset, u64 ino,
>  			  unsigned int d_type)
>  {
> @@ -540,11 +540,11 @@ static int ovl_fill_plain(struct dir_context *ctx, const char *name,
>  	p = ovl_cache_entry_new(rdd, name, namelen, ino, d_type);
>  	if (p == NULL) {
>  		rdd->err = -ENOMEM;
> -		return -ENOMEM;
> +		return false;
>  	}
>  	list_add_tail(&p->l_node, rdd->list);
>  
> -	return 0;
> +	return true;
>  }
>  
>  static int ovl_dir_read_impure(struct path *path,  struct list_head *list,
> @@ -648,7 +648,7 @@ struct ovl_readdir_translate {
>  	bool xinowarn;
>  };
>  
> -static int ovl_fill_real(struct dir_context *ctx, const char *name,
> +static bool ovl_fill_real(struct dir_context *ctx, const char *name,
>  			   int namelen, loff_t offset, u64 ino,
>  			   unsigned int d_type)
>  {
> @@ -1027,7 +1027,7 @@ void ovl_cleanup_whiteouts(struct ovl_fs *ofs, struct dentry *upper,
>  	inode_unlock(upper->d_inode);
>  }
>  
> -static int ovl_check_d_type(struct dir_context *ctx, const char *name,
> +static bool ovl_check_d_type(struct dir_context *ctx, const char *name,
>  			  int namelen, loff_t offset, u64 ino,
>  			  unsigned int d_type)
>  {
> @@ -1036,12 +1036,12 @@ static int ovl_check_d_type(struct dir_context *ctx, const char *name,
>  
>  	/* Even if d_type is not supported, DT_DIR is returned for . and .. */
>  	if (!strncmp(name, ".", namelen) || !strncmp(name, "..", namelen))
> -		return 0;
> +		return true;
>  
>  	if (d_type != DT_UNKNOWN)
>  		rdd->d_type_supported = true;
>  
> -	return 0;
> +	return true;
>  }
>  
>  /*
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 09e8ed7d4161..9c53edb60c03 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -140,7 +140,7 @@ struct readdir_callback {
>  	int result;
>  };
>  
> -static int fillonedir(struct dir_context *ctx, const char *name, int namlen,
> +static bool fillonedir(struct dir_context *ctx, const char *name, int namlen,
>  		      loff_t offset, u64 ino, unsigned int d_type)
>  {
>  	struct readdir_callback *buf =
> @@ -149,14 +149,14 @@ static int fillonedir(struct dir_context *ctx, const char *name, int namlen,
>  	unsigned long d_ino;
>  
>  	if (buf->result)
> -		return -EINVAL;
> +		return false;
>  	buf->result = verify_dirent_name(name, namlen);
> -	if (buf->result < 0)
> -		return buf->result;
> +	if (buf->result)
> +		return false;

Probably pretty obvious but just to make sure buf->result being 0 is
supposed to return false in the two locations in this patch?

>  	d_ino = ino;
>  	if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
>  		buf->result = -EOVERFLOW;
> -		return -EOVERFLOW;
> +		return false;
>  	}
>  	buf->result++;
>  	dirent = buf->dirent;
> @@ -169,12 +169,12 @@ static int fillonedir(struct dir_context *ctx, const char *name, int namlen,
>  	unsafe_put_user(namlen, &dirent->d_namlen, efault_end);
>  	unsafe_copy_dirent_name(dirent->d_name, name, namlen, efault_end);
>  	user_write_access_end();
> -	return 0;
> +	return true;
>  efault_end:
>  	user_write_access_end();
>  efault:
>  	buf->result = -EFAULT;
> -	return -EFAULT;
> +	return false;
>  }
>  
>  SYSCALL_DEFINE3(old_readdir, unsigned int, fd,
> @@ -219,7 +219,7 @@ struct getdents_callback {
>  	int error;
>  };
>  
> -static int filldir(struct dir_context *ctx, const char *name, int namlen,
> +static bool filldir(struct dir_context *ctx, const char *name, int namlen,
>  		   loff_t offset, u64 ino, unsigned int d_type)
>  {
>  	struct linux_dirent __user *dirent, *prev;
> @@ -232,18 +232,18 @@ static int filldir(struct dir_context *ctx, const char *name, int namlen,
>  
>  	buf->error = verify_dirent_name(name, namlen);
>  	if (unlikely(buf->error))
> -		return buf->error;
> +		return false;
>  	buf->error = -EINVAL;	/* only used if we fail.. */
>  	if (reclen > buf->count)
> -		return -EINVAL;
> +		return false;
>  	d_ino = ino;
>  	if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
>  		buf->error = -EOVERFLOW;
> -		return -EOVERFLOW;
> +		return false;
>  	}
>  	prev_reclen = buf->prev_reclen;
>  	if (prev_reclen && signal_pending(current))
> -		return -EINTR;
> +		return false;
>  	dirent = buf->current_dir;
>  	prev = (void __user *) dirent - prev_reclen;
>  	if (!user_write_access_begin(prev, reclen + prev_reclen))
> @@ -260,12 +260,12 @@ static int filldir(struct dir_context *ctx, const char *name, int namlen,
>  	buf->current_dir = (void __user *)dirent + reclen;
>  	buf->prev_reclen = reclen;
>  	buf->count -= reclen;
> -	return 0;
> +	return true;
>  efault_end:
>  	user_write_access_end();
>  efault:
>  	buf->error = -EFAULT;
> -	return -EFAULT;
> +	return false;
>  }
>  
>  SYSCALL_DEFINE3(getdents, unsigned int, fd,
> @@ -307,7 +307,7 @@ struct getdents_callback64 {
>  	int error;
>  };
>  
> -static int filldir64(struct dir_context *ctx, const char *name, int namlen,
> +static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
>  		     loff_t offset, u64 ino, unsigned int d_type)
>  {
>  	struct linux_dirent64 __user *dirent, *prev;
> @@ -319,13 +319,13 @@ static int filldir64(struct dir_context *ctx, const char *name, int namlen,
>  
>  	buf->error = verify_dirent_name(name, namlen);
>  	if (unlikely(buf->error))
> -		return buf->error;
> +		return false;
>  	buf->error = -EINVAL;	/* only used if we fail.. */
>  	if (reclen > buf->count)
> -		return -EINVAL;
> +		return false;
>  	prev_reclen = buf->prev_reclen;
>  	if (prev_reclen && signal_pending(current))
> -		return -EINTR;
> +		return false;
>  	dirent = buf->current_dir;
>  	prev = (void __user *)dirent - prev_reclen;
>  	if (!user_write_access_begin(prev, reclen + prev_reclen))
> @@ -342,13 +342,13 @@ static int filldir64(struct dir_context *ctx, const char *name, int namlen,
>  	buf->prev_reclen = reclen;
>  	buf->current_dir = (void __user *)dirent + reclen;
>  	buf->count -= reclen;
> -	return 0;
> +	return true;
>  
>  efault_end:
>  	user_write_access_end();
>  efault:
>  	buf->error = -EFAULT;
> -	return -EFAULT;
> +	return false;
>  }
>  
>  SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> @@ -397,7 +397,7 @@ struct compat_readdir_callback {
>  	int result;
>  };
>  
> -static int compat_fillonedir(struct dir_context *ctx, const char *name,
> +static bool compat_fillonedir(struct dir_context *ctx, const char *name,
>  			     int namlen, loff_t offset, u64 ino,
>  			     unsigned int d_type)
>  {
> @@ -407,14 +407,14 @@ static int compat_fillonedir(struct dir_context *ctx, const char *name,
>  	compat_ulong_t d_ino;
>  
>  	if (buf->result)
> -		return -EINVAL;
> +		return false;
>  	buf->result = verify_dirent_name(name, namlen);
> -	if (buf->result < 0)
> -		return buf->result;
> +	if (buf->result)
> +		return false;

Linus Torvalds Aug. 16, 2022, 6:58 p.m. UTC | #4

On Tue, Aug 16, 2022 at 9:11 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> filldir_t instances (directory iterators callbacks) used to return 0 for
> "OK, keep going" or -E... for "stop".

Yeah, and it was really confusing wrt what the reported user space error was.

> So let's just return bool ("should we keep going?") - it's less confusing
> that way.

Ack. And the "true means keep going" that you picked is, I think, the
more natural model with the conceptual model being

  while (dir_emit()) ..

and this makes the filldir return value match that dir_emit() one.

That said, our filldir code is still confusing as hell. And I would
really like to see that "shared vs non-shared" iterator thing go away,
with everybody using the shared one - and filesystems that can't deal
with it using their own lock.

But that's a completely independent wart in our complicated filldir saga.

But if somebody were to look at that iterate-vs-iterate_shared, that
would be lovely. A quick grep shows that we don't have *that* many of
the non-shared cases left:

      git grep '\.iterate\>.*='

seems to imply that converting them to a "use my own load" wouldn't be
_too_ bad.

And some of them might actually be perfectly ok with the shared
semantics (ie inode->i_rwsem held just for reading) and they just were
never converted originally.

Anybody?

              Linus

Matthew Wilcox Aug. 16, 2022, 7:11 p.m. UTC | #5

On Tue, Aug 16, 2022 at 11:58:36AM -0700, Linus Torvalds wrote:
> That said, our filldir code is still confusing as hell. And I would
> really like to see that "shared vs non-shared" iterator thing go away,
> with everybody using the shared one - and filesystems that can't deal
> with it using their own lock.
> 
> But that's a completely independent wart in our complicated filldir saga.
> 
> But if somebody were to look at that iterate-vs-iterate_shared, that
> would be lovely. A quick grep shows that we don't have *that* many of
> the non-shared cases left:
> 
>       git grep '\.iterate\>.*='
> 
> seems to imply that converting them to a "use my own load" wouldn't be
> _too_ bad.
> 
> And some of them might actually be perfectly ok with the shared
> semantics (ie inode->i_rwsem held just for reading) and they just were
> never converted originally.

What's depressing is that some of these are newly added.  It'd be
great if we could attach something _like_ __deprecated to things
that checkpatch could pick up on.

fs/adfs/dir_f.c:        .iterate        = adfs_f_iterate,
fs/adfs/dir_fplus.c:    .iterate        = adfs_fplus_iterate,

ADFS is read-only, so must be safe?

fs/ceph/dir.c:  .iterate = ceph_readdir,
fs/ceph/dir.c:  .iterate = ceph_readdir,

At least CEPH has active maintainers, cc'd

fs/coda/dir.c:  .iterate        = coda_readdir,

Would anyone notice if we broke CODA?  Maintainers cc'd anyway.

fs/exfat/dir.c: .iterate        = exfat_iterate,

Exfat is a new addition, but has active maintainers.

fs/jfs/namei.c: .iterate        = jfs_readdir,

Maintainer cc'd

fs/ntfs/dir.c:  .iterate        = ntfs_readdir,         /* Read directory contents. */

Maybe we can get rid of ntfs soon.

fs/ocfs2/file.c:        .iterate        = ocfs2_readdir,
fs/ocfs2/file.c:        .iterate        = ocfs2_readdir,

maintainers cc'd

fs/orangefs/dir.c:      .iterate = orangefs_dir_iterate,

New; maintainer cc'd

fs/overlayfs/readdir.c: .iterate        = ovl_iterate,

Active maintainer, cc'd

fs/proc/base.c: .iterate        = proc_##LSM##_attr_dir_iterate, \

Hmm.  We need both SMACK and Apparmor to agree to this ... cc's added.

fs/vboxsf/dir.c:        .iterate = vboxsf_dir_iterate,

Also newly added.  Maintainer cc'd.

Jan Harkes Aug. 16, 2022, 8:14 p.m. UTC | #6

On Tue, Aug 16, 2022 at 03:35:46PM -0400, Matthew Wilcox wrote:
> fs/coda/dir.c:  .iterate        = coda_readdir,
> 
> Would anyone notice if we broke CODA?  Maintainers cc'd anyway.

Ha, yes I think I would notice, but probably not until after the changes
got released and trickled down to the distributions ;)

So good to know in advance a change like this is coming. I'll have to
educate myself on this shared vs non-shared filldir.

Jan

Matthew Wilcox Aug. 16, 2022, 9:20 p.m. UTC | #7

On Tue, Aug 16, 2022 at 04:14:38PM -0400, Jan Harkes wrote:
> On Tue, Aug 16, 2022 at 03:35:46PM -0400, Matthew Wilcox wrote:
> > fs/coda/dir.c:  .iterate        = coda_readdir,
> > 
> > Would anyone notice if we broke CODA?  Maintainers cc'd anyway.
> 
> Ha, yes I think I would notice, but probably not until after the changes
> got released and trickled down to the distributions ;)

I'm sorry about that.  I got confused with Intermezzo, which we already
deleted in 2004.

> So good to know in advance a change like this is coming. I'll have to
> educate myself on this shared vs non-shared filldir.

From Documentation/filesystems/porting.rst:

->iterate_shared() is added; it's a parallel variant of ->iterate().
Exclusion on struct file level is still provided (as well as that
between it and lseek on the same struct file), but if your directory
has been opened several times, you can get these called in parallel.
Exclusion between that method and all directory-modifying ones is
still provided, of course.

Often enough ->iterate() can serve as ->iterate_shared() without any
changes - it is a read-only operation, after all.  If you have any
per-inode or per-dentry in-core data structures modified by ->iterate(),
you might need something to serialize the access to them.  If you
do dcache pre-seeding, you'll need to switch to d_alloc_parallel() for
that; look for in-tree examples.

Old method is only used if the new one is absent; eventually it will
be removed.  Switch while you still can; the old one won't stay.

---

That's probably the best documentation we have about it.

Casey Schaufler Aug. 16, 2022, 10:30 p.m. UTC | #8

On 8/16/2022 12:11 PM, Matthew Wilcox wrote:
> On Tue, Aug 16, 2022 at 11:58:36AM -0700, Linus Torvalds wrote:
>> That said, our filldir code is still confusing as hell. And I would
>> really like to see that "shared vs non-shared" iterator thing go away,
>> with everybody using the shared one - and filesystems that can't deal
>> with it using their own lock.
>>
>> But that's a completely independent wart in our complicated filldir saga.
>>
>> But if somebody were to look at that iterate-vs-iterate_shared, that
>> would be lovely. A quick grep shows that we don't have *that* many of
>> the non-shared cases left:
>>
>>       git grep '\.iterate\>.*='
>>
>> seems to imply that converting them to a "use my own load" wouldn't be
>> _too_ bad.
>>
>> And some of them might actually be perfectly ok with the shared
>> semantics (ie inode->i_rwsem held just for reading) and they just were
>> never converted originally.
> What's depressing is that some of these are newly added.  It'd be
> great if we could attach something _like_ __deprecated to things
> that checkpatch could pick up on.
>
> fs/adfs/dir_f.c:        .iterate        = adfs_f_iterate,
> fs/adfs/dir_fplus.c:    .iterate        = adfs_fplus_iterate,
>
> ADFS is read-only, so must be safe?
>
> fs/ceph/dir.c:  .iterate = ceph_readdir,
> fs/ceph/dir.c:  .iterate = ceph_readdir,
>
> At least CEPH has active maintainers, cc'd
>
> fs/coda/dir.c:  .iterate        = coda_readdir,
>
> Would anyone notice if we broke CODA?  Maintainers cc'd anyway.
>
> fs/exfat/dir.c: .iterate        = exfat_iterate,
>
> Exfat is a new addition, but has active maintainers.
>
> fs/jfs/namei.c: .iterate        = jfs_readdir,
>
> Maintainer cc'd
>
> fs/ntfs/dir.c:  .iterate        = ntfs_readdir,         /* Read directory contents. */
>
> Maybe we can get rid of ntfs soon.
>
> fs/ocfs2/file.c:        .iterate        = ocfs2_readdir,
> fs/ocfs2/file.c:        .iterate        = ocfs2_readdir,
>
> maintainers cc'd
>
> fs/orangefs/dir.c:      .iterate = orangefs_dir_iterate,
>
> New; maintainer cc'd
>
> fs/overlayfs/readdir.c: .iterate        = ovl_iterate,
>
> Active maintainer, cc'd
>
> fs/proc/base.c: .iterate        = proc_##LSM##_attr_dir_iterate, \
>
> Hmm.  We need both SMACK and Apparmor to agree to this ... cc's added.

Smack passes all tests and seems perfectly content with the change.
I can't say that the tests stress this interface.

>
> fs/vboxsf/dir.c:        .iterate = vboxsf_dir_iterate,
>
> Also newly added.  Maintainer cc'd.

Linus Torvalds Aug. 16, 2022, 10:58 p.m. UTC | #9

On Tue, Aug 16, 2022 at 1:14 PM Jan Harkes <jaharkes@cs.cmu.edu> wrote:
>
> So good to know in advance a change like this is coming. I'll have to
> educate myself on this shared vs non-shared filldir.

Well, that change isn't necessarily "coming" - we've had this horrid
duality for years. "iterate_shared" goes back to 2016, and has been
marked "recommended" since then.

But the plain old "iterate" does continue to work, and having both is
only an ugly wart - I can't honestly claim that it's a big and
pressing problem.

But it would be *really* nice if filesystems did switch to it. For
most filesystems, it is probably trivial. The only real difference is
that the "iterate_shared" directory walker is called with the
directory inode->i_rwsem held for reading, rather than for writing.

End result: there can be multiple concurrent "readdir" iterators
running on the same directory at the same time. But there's still
exclusion wrt things like file creation, so a filesystem doesn't have
to worry about that.

Also, the concurrency is only between different 'struct file'
entities. The file position lock still serializes all getdents() calls
on any individual open directory 'struct file'.

End result: most filesystems probably can move to the 'iterate_shared'
version with no real issues.

Looking at coda in particular, I don't see any reason to not just
switch over to 'iterate_shared' with no code modifications at all.
Nothing in there seems to have any "I rely on the directory inode
being exclusively locked".

In fact, for coda in particular, it would be really nice if we could
just get rid of the old "iterate" function for the host filesystem
entirely. Which honestly I'd expect you could _also_ do, because
almost all serious local filesystems have long since been converted.

So coda looks like it could trivially move over. But I might be
missing something.

I suspect the same is true of most other filesystems, but it requires
people go check and go think about it.

               Linus

Linus Torvalds Aug. 16, 2022, 11:09 p.m. UTC | #10

On Tue, Aug 16, 2022 at 3:30 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> Smack passes all tests and seems perfectly content with the change.
> I can't say that the tests stress this interface.

All the security filesystems really seem to boil down to just calling
that 'proc_pident_readdir()' function with different sets of 'const
struct pid_entry' arrays.

And all that does is to make sure the pidents are filled in by that
proc_fill_cache(), which basically does a filename lookup.

And a filename lookup *already* has to be able to handle being called
in parallel, because that's how filename lookup works:

  [.. miss in dcache ..]
  lookup_slow ->
      inode_lock_shared(dir);
      __lookup_slow -> does the
      inode_unlock_shared(dir);

so as long as the proc_fill_cache() handles the d_in_lookup()
situation correctly (where we serialize on one single _name_ in the
directory), that should all be good.

And proc_fill_cache() does indeed seem to handle it right - and if it
didn't, it would be fundamentally racy with regular lookups - so I
think all those security layer proc_##LSM##_attr_dir_iterate cases can
be moved over to iterate_shared with no code change.

But again, maybe there's something really subtle I'm overlooking. Or
maybe not something subtle at all, and I'm just missing a big honking
issue.

            Linus

Amir Goldstein Aug. 17, 2022, 7:25 a.m. UTC | #11

On Wed, Aug 17, 2022 at 2:02 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Aug 16, 2022 at 1:14 PM Jan Harkes <jaharkes@cs.cmu.edu> wrote:
> >
> > So good to know in advance a change like this is coming. I'll have to
> > educate myself on this shared vs non-shared filldir.
>
> Well, that change isn't necessarily "coming" - we've had this horrid
> duality for years. "iterate_shared" goes back to 2016, and has been
> marked "recommended" since then.
>
> But the plain old "iterate" does continue to work, and having both is
> only an ugly wart - I can't honestly claim that it's a big and
> pressing problem.
>

All the "complexity" is really very tidy and neat inside iterate_dir()
I honestly don't see what the big fuss is about.

If filesystems do need to synchronize creates/deletes with readdir
they would need to add a new internal lock and take it for every
dir operation - it seems like a big hustle for little good reason.

Perhaps it would have been more elegant to replace the
iterate_shared/iterate duality with an fs_type flag.
And it would be very simple to make that change.

> But it would be *really* nice if filesystems did switch to it. For
> most filesystems, it is probably trivial. The only real difference is
> that the "iterate_shared" directory walker is called with the
> directory inode->i_rwsem held for reading, rather than for writing.
>
> End result: there can be multiple concurrent "readdir" iterators
> running on the same directory at the same time. But there's still
> exclusion wrt things like file creation, so a filesystem doesn't have
> to worry about that.
>
> Also, the concurrency is only between different 'struct file'
> entities. The file position lock still serializes all getdents() calls
> on any individual open directory 'struct file'.
>

Hmm, I wonder. Can gentents() proceed without f_pos_lock on
single f_count and then a cloned fd created and another gentents()
races with the first one?

A very niche corner case, but perhaps we can close this hole
for filesystems that opt-in with fs_type flag, because... (see below)

> End result: most filesystems probably can move to the 'iterate_shared'
> version with no real issues.
>
> Looking at coda in particular, I don't see any reason to not just
> switch over to 'iterate_shared' with no code modifications at all.
> Nothing in there seems to have any "I rely on the directory inode
> being exclusively locked".
>
> In fact, for coda in particular, it would be really nice if we could
> just get rid of the old "iterate" function for the host filesystem
> entirely. Which honestly I'd expect you could _also_ do, because
> almost all serious local filesystems have long since been converted.
>
> So coda looks like it could trivially move over. But I might be
> missing something.
>
> I suspect the same is true of most other filesystems, but it requires
> people go check and go think about it.
>

I tried to look at ovl_iterate(). The subtlety is regarding the internal
readdir cache.

Compared to fuse_readdir(), which is already "iterate_shared"
and also implements internal readdir cache, fuse adds an internal
mutex (quote) "against (crazy) parallel readdir on same open file".

Considering this protection is already provided by f_pos_lock
99% of the time, perhaps an opt-in flag to waiver the single
f_count optimization would be enough to be able to make
also ovl_iterate() shared?

Thanks,
Amir.

Linus Torvalds Aug. 17, 2022, 5:05 p.m. UTC | #12

On Wed, Aug 17, 2022 at 12:25 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> All the "complexity" is really very tidy and neat inside iterate_dir()
> I honestly don't see what the big fuss is about.

The worst part of it is conceptual, and the conditional locking.

No, it's not a huge maintenance burden, but it's quite the ugly wart
in our vfs layer that generally has pretty clean abstractions.

The non-shared version can also be a real performance problem on some
loads, but that's mainly a "real local filesystem" issue. Samba used
to (and presumably still does) do a *lot* of readdir calls for all the
name mangling.

We used to have similar horrors wrt ioctl's with the big kernel lock -
you can still see signs of that in the naming (ie the 'ioctl' function
pointer in the file operations structure is called 'unlocked_ioctl' -
even though it's been over a decade since we got rid of the 'locked'
one).

So that thing is just nasty.

> If filesystems do need to synchronize creates/deletes with readdir
> they would need to add a new internal lock and take it for every
> dir operation - it seems like a big hustle for little good reason.

Most of the time, they have that lock *anyway* - or, as I looked at
two cases, don't need any locking at all.

The pathname lookup parts of a filesystem need to be able to handle
concurrency within a directory anyway, because of how '->lookup()'
works.  Yes, our dentry layer helps a lot and serializes individual
name lookups, but basically no filesystem can rely on _just_ i_rwsem
anyway.

Of course, those locks are often about things like "read page X from
directory Y", so the locking can be about things like block allocation
etc. The bigger conceptual races - filename creation, lookup of the
same entry concurrently - are already handled by the VFS layer.

So those locks generally already exist. They just need to be checked
cover readdir too.

Normally readdir iteration is very similar to lookup().

The *one* exception tends to be directory entry caches, although
sometimes those too are shared with lookup.

> Perhaps it would have been more elegant to replace the
> iterate_shared/iterate duality with an fs_type flag.
> And it would be very simple to make that change.

No.

You're missing the point. It's not the *flag* and "we have two
different iterate functions" that is the ugly part.

Making it a filesystem flag would not have made any difference - it
would still be the same horrendous wart at the vfs level, and it would
be in some sense worse because now it's split into two different
independent things that are tied together.

> Hmm, I wonder. Can gentents() proceed without f_pos_lock on
> single f_count and then a cloned fd created and another gentents()
> races with the first one?

That would break much more important system calls than getdents (ie
basic read/write).

So if fdget_pos() were to be broken, we have much bigger issues.

In case  you care, the thing that protects us against another thread
using the same file descriptor is

  fdget_pos ->
    __fdget_pos ->
      __fdget ->
        __fdget_light ->
          check current->files->count
          do full __fdget() if it's > 1

and then __fdget_pos() does that

                if (file_count(file) > 1) {
                        v |= FDPUT_POS_UNLOCK;
                        mutex_lock(&file->f_pos_lock);
                }

check afterwards.

IOW, this code is carefully written to avoid both the file pointer
reference increment *and* the f_pos_lock in the common case of a
single-threaded process that only has the file open once.

But if 'current->files->count' is elevated - iow some other thread has
access to the file array and might be open/close/dup'ing files, we
take the file reference.

And then if the file is open through multiple file descriptors (which
might be in another file table entirely), _or_ if we just took the
file refernce due to that threading issue - then in either of those
cases we will also now take the file pos lock.

End result: either this system call is the only one that can access
that file, or we have taken the file pos lock.

Either way, getdents() ends up being serialized wrt that particular open file.

Now, you can actually screw this up if you really try: the file pos
lock obviously also is conditional on FMODE_ATOMIC_POS.

So if you have a filesystem that does "stream_open()" to clear
FMODE_ATOMIC_POS, that can avoid the file pos lock entirely.

So don't do that. I'm looking at fuse and FOPEN_STREAM, and that looks
a bit iffy.

(Of course, for a virtual filesystem, if you want to have a getdents()
that doesn't allow lseek, and that doesn't use f_pos for getdents but
just some internal other logic, clearing FMODE_ATOMIC_POS might
actually be entirely intentional. It's not like the file pos lock is
really _required_).

> I tried to look at ovl_iterate(). The subtlety is regarding the internal
> readdir cache.

Ok, from a quick look, you do have

        cache = ovl_dir_cache(d_inode(dentry));

so yeah, the cache is per-inode and thus different 'struct file'
entries for the same directory will need locking.

> Compared to fuse_readdir(), which is already "iterate_shared"
> and also implements internal readdir cache, fuse adds an internal
> mutex (quote) "against (crazy) parallel readdir on same open file".

Maybe the FUSE lock is *because* fuse allows that FOPEN_STREAM?

> Considering this protection is already provided by f_pos_lock
> 99% of the time,

It's not 99% of the time.

f_pos_lock is reliable. You have to explicitly turn it off. FUSE is
either confused and does pointless locking, or has done that
stream_open() thing to explicitly turn off the vfs locking.

Or, probably more realistically, the fuse code goes back to before we
did f_pos_lock.

                Linus

Amir Goldstein Aug. 17, 2022, 8:13 p.m. UTC | #13

> Normally readdir iteration is very similar to lookup().
>
> The *one* exception tends to be directory entry caches, although
> sometimes those too are shared with lookup.
>
[...]
>
> > Hmm, I wonder. Can gentents() proceed without f_pos_lock on
> > single f_count and then a cloned fd created and another gentents()
> > races with the first one?
>
> That would break much more important system calls than getdents (ie
> basic read/write).
>
> So if fdget_pos() were to be broken, we have much bigger issues.
>
> In case  you care, the thing that protects us against another thread
> using the same file descriptor is
>
>   fdget_pos ->
>     __fdget_pos ->
>       __fdget ->
>         __fdget_light ->
>           check current->files->count
>           do full __fdget() if it's > 1
>
> and then __fdget_pos() does that
>
>                 if (file_count(file) > 1) {
>                         v |= FDPUT_POS_UNLOCK;
>                         mutex_lock(&file->f_pos_lock);
>                 }
>
> check afterwards.
>
> IOW, this code is carefully written to avoid both the file pointer
> reference increment *and* the f_pos_lock in the common case of a
> single-threaded process that only has the file open once.
>
> But if 'current->files->count' is elevated - iow some other thread has
> access to the file array and might be open/close/dup'ing files, we
> take the file reference.
>
> And then if the file is open through multiple file descriptors (which
> might be in another file table entirely), _or_ if we just took the
> file refernce due to that threading issue - then in either of those
> cases we will also now take the file pos lock.
>
> End result: either this system call is the only one that can access
> that file, or we have taken the file pos lock.
>
> Either way, getdents() ends up being serialized wrt that particular open file.
>

Yes, I see it now.

> Now, you can actually screw this up if you really try: the file pos
> lock obviously also is conditional on FMODE_ATOMIC_POS.
>
> So if you have a filesystem that does "stream_open()" to clear
> FMODE_ATOMIC_POS, that can avoid the file pos lock entirely.
>
> So don't do that. I'm looking at fuse and FOPEN_STREAM, and that looks
> a bit iffy.
>
> (Of course, for a virtual filesystem, if you want to have a getdents()
> that doesn't allow lseek, and that doesn't use f_pos for getdents but
> just some internal other logic, clearing FMODE_ATOMIC_POS might
> actually be entirely intentional. It's not like the file pos lock is
> really _required_).
>
> > I tried to look at ovl_iterate(). The subtlety is regarding the internal
> > readdir cache.
>
> Ok, from a quick look, you do have
>
>         cache = ovl_dir_cache(d_inode(dentry));
>
> so yeah, the cache is per-inode and thus different 'struct file'
> entries for the same directory will need locking.
>

The cache in ovl_cache_get() is a refcounted object, so I think
we can fix grabbing the instance and initiating a new instance safely.
The cache in ovl_cache_get_impure() is more problematic.
I'll see what we can do about it.

> > Compared to fuse_readdir(), which is already "iterate_shared"
> > and also implements internal readdir cache, fuse adds an internal
> > mutex (quote) "against (crazy) parallel readdir on same open file".
>
> Maybe the FUSE lock is *because* fuse allows that FOPEN_STREAM?
>
> > Considering this protection is already provided by f_pos_lock
> > 99% of the time,
>
> It's not 99% of the time.
>
> f_pos_lock is reliable. You have to explicitly turn it off. FUSE is
> either confused and does pointless locking, or has done that
> stream_open() thing to explicitly turn off the vfs locking.
>
> Or, probably more realistically, the fuse code goes back to before we
> did f_pos_lock.

I'll leave that riddle for Miklos to solve.

Thanks,
Amir.

Matthew Wilcox Aug. 18, 2022, 2:35 p.m. UTC | #14

On Tue, Aug 16, 2022 at 08:11:31PM +0100, Matthew Wilcox wrote:
> fs/adfs/dir_f.c:        .iterate        = adfs_f_iterate,
> fs/adfs/dir_fplus.c:    .iterate        = adfs_fplus_iterate,
> 
> ADFS is read-only, so must be safe?

I just checked ADFS.  This isn't a f_ops ->iterate, this is a special
adfs_dir_ops.  ADFS already uses f_ops->iterate_shared.

John Johansen Aug. 18, 2022, 4:14 p.m. UTC | #15

On 8/16/22 12:11, Matthew Wilcox wrote:
> On Tue, Aug 16, 2022 at 11:58:36AM -0700, Linus Torvalds wrote:
>> That said, our filldir code is still confusing as hell. And I would
>> really like to see that "shared vs non-shared" iterator thing go away,
>> with everybody using the shared one - and filesystems that can't deal
>> with it using their own lock.
>>
>> But that's a completely independent wart in our complicated filldir saga.
>>
>> But if somebody were to look at that iterate-vs-iterate_shared, that
>> would be lovely. A quick grep shows that we don't have *that* many of
>> the non-shared cases left:
>>
>>        git grep '\.iterate\>.*='
>>
>> seems to imply that converting them to a "use my own load" wouldn't be
>> _too_ bad.
>>
>> And some of them might actually be perfectly ok with the shared
>> semantics (ie inode->i_rwsem held just for reading) and they just were
>> never converted originally.
> 
> What's depressing is that some of these are newly added.  It'd be
> great if we could attach something _like_ __deprecated to things
> that checkpatch could pick up on.
> 
> fs/adfs/dir_f.c:        .iterate        = adfs_f_iterate,
> fs/adfs/dir_fplus.c:    .iterate        = adfs_fplus_iterate,
> 
> ADFS is read-only, so must be safe?
> 
> fs/ceph/dir.c:  .iterate = ceph_readdir,
> fs/ceph/dir.c:  .iterate = ceph_readdir,
> 
> At least CEPH has active maintainers, cc'd
> 
> fs/coda/dir.c:  .iterate        = coda_readdir,
> 
> Would anyone notice if we broke CODA?  Maintainers cc'd anyway.
> 
> fs/exfat/dir.c: .iterate        = exfat_iterate,
> 
> Exfat is a new addition, but has active maintainers.
> 
> fs/jfs/namei.c: .iterate        = jfs_readdir,
> 
> Maintainer cc'd
> 
> fs/ntfs/dir.c:  .iterate        = ntfs_readdir,         /* Read directory contents. */
> 
> Maybe we can get rid of ntfs soon.
> 
> fs/ocfs2/file.c:        .iterate        = ocfs2_readdir,
> fs/ocfs2/file.c:        .iterate        = ocfs2_readdir,
> 
> maintainers cc'd
> 
> fs/orangefs/dir.c:      .iterate = orangefs_dir_iterate,
> 
> New; maintainer cc'd
> 
> fs/overlayfs/readdir.c: .iterate        = ovl_iterate,
> 
> Active maintainer, cc'd
> 
> fs/proc/base.c: .iterate        = proc_##LSM##_attr_dir_iterate, \
> 
> Hmm.  We need both SMACK and Apparmor to agree to this ... cc's added.

This is fine for AppArmor


> 
> fs/vboxsf/dir.c:        .iterate = vboxsf_dir_iterate,
> 
> Also newly added.  Maintainer cc'd.
>

Mike Marshall Sept. 21, 2022, 7:25 p.m. UTC | #16

>> At least CEPH has active maintainers...

Well that's just mean :-) ...

Anywho...

>> in many cases, the existing iterate() implementation works just fine as iterate_shared().
>> https://lwn.net/Articles/686943/

I changed the orangefs .iterate to .iterate_shared... ls still works,
find still works,
xfstests shows no regressions...

I found and ran my getdents program from back when we were working to
go upstream...
it works the same with .iterate as it does with .iterate_shared...

-Mike

On Thu, Aug 18, 2022 at 12:15 PM John Johansen
<john.johansen@canonical.com> wrote:
>
> On 8/16/22 12:11, Matthew Wilcox wrote:
> > On Tue, Aug 16, 2022 at 11:58:36AM -0700, Linus Torvalds wrote:
> >> That said, our filldir code is still confusing as hell. And I would
> >> really like to see that "shared vs non-shared" iterator thing go away,
> >> with everybody using the shared one - and filesystems that can't deal
> >> with it using their own lock.
> >>
> >> But that's a completely independent wart in our complicated filldir saga.
> >>
> >> But if somebody were to look at that iterate-vs-iterate_shared, that
> >> would be lovely. A quick grep shows that we don't have *that* many of
> >> the non-shared cases left:
> >>
> >>        git grep '\.iterate\>.*='
> >>
> >> seems to imply that converting them to a "use my own load" wouldn't be
> >> _too_ bad.
> >>
> >> And some of them might actually be perfectly ok with the shared
> >> semantics (ie inode->i_rwsem held just for reading) and they just were
> >> never converted originally.
> >
> > What's depressing is that some of these are newly added.  It'd be
> > great if we could attach something _like_ __deprecated to things
> > that checkpatch could pick up on.
> >
> > fs/adfs/dir_f.c:        .iterate        = adfs_f_iterate,
> > fs/adfs/dir_fplus.c:    .iterate        = adfs_fplus_iterate,
> >
> > ADFS is read-only, so must be safe?
> >
> > fs/ceph/dir.c:  .iterate = ceph_readdir,
> > fs/ceph/dir.c:  .iterate = ceph_readdir,
> >
> > At least CEPH has active maintainers, cc'd
> >
> > fs/coda/dir.c:  .iterate        = coda_readdir,
> >
> > Would anyone notice if we broke CODA?  Maintainers cc'd anyway.
> >
> > fs/exfat/dir.c: .iterate        = exfat_iterate,
> >
> > Exfat is a new addition, but has active maintainers.
> >
> > fs/jfs/namei.c: .iterate        = jfs_readdir,
> >
> > Maintainer cc'd
> >
> > fs/ntfs/dir.c:  .iterate        = ntfs_readdir,         /* Read directory contents. */
> >
> > Maybe we can get rid of ntfs soon.
> >
> > fs/ocfs2/file.c:        .iterate        = ocfs2_readdir,
> > fs/ocfs2/file.c:        .iterate        = ocfs2_readdir,
> >
> > maintainers cc'd
> >
> > fs/orangefs/dir.c:      .iterate = orangefs_dir_iterate,
> >
> > New; maintainer cc'd
> >
> > fs/overlayfs/readdir.c: .iterate        = ovl_iterate,
> >
> > Active maintainer, cc'd
> >
> > fs/proc/base.c: .iterate        = proc_##LSM##_attr_dir_iterate, \
> >
> > Hmm.  We need both SMACK and Apparmor to agree to this ... cc's added.
>
> This is fine for AppArmor
>
>
> >
> > fs/vboxsf/dir.c:        .iterate = vboxsf_dir_iterate,
> >
> > Also newly added.  Maintainer cc'd.
> >
>

[RFC] Change calling conventions for filldir_t

Commit Message

Comments

Patch