diff mbox series

[v5,3/4] fs: strip file's S_ISGID mode on vfs instead of on underlying filesystem

Message ID 1650527658-2218-3-git-send-email-xuyang2018.jy@fujitsu.com (mailing list archive)
State New, archived
Headers show
Series [v5,1/4] fs: move sgid strip operation from inode_init_owner into inode_sgid_strip | expand

Commit Message

Yang Xu (Fujitsu) April 21, 2022, 7:54 a.m. UTC
Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
firstly, then posxi acl setup, but xfs uses the contrary order. It will
affect S_ISGID clear especially we filter S_IXGRP by umask or acl.

Regardless of which filesystem is in use, failure to strip the SGID correctly
is considered a security failure that needs to be fixed. The current VFS
infrastructure requires the filesystem to do everything right and not step on
any landmines to strip the SGID bit, when in fact it can easily be done at the
VFS and the filesystems then don't even need to be aware that the SGID needs
to be (or has been stripped) by the operation the user asked to be done.

Vfs has all the info it needs - it doesn't need the filesystems to do everything
correctly with the mode and ensuring that they order things like posix acl setup
functions correctly with inode_init_owner() to strip the SGID bit.

Just strip the SGID bit at the VFS, and then the filesystem can't get it wrong.

Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
this api may change mode.

Only the following places use inode_init_owner
"
arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
"

They are used in filesystem to init new inode function and these init inode
functions are used by following operations:
mkdir
symlink
mknod
create
tmpfile
rename

We don't care about mkdir because we don't strip SGID bit for directory except
fs.xfs.irix_sgid_inherit. But we even call prepare_mode() in do_mkdirat() since
inode_sgid_strip() will skip directories anyway. This will enforce the same
ordering for all relevant operations and it will make the code more uniform and
easier to understand by using new helper prepare_mode().

symlink and rename only use valid mode that doesn't have SGID bit.

We have added inode_sgid_strip api for the remaining operations.

In addition to the above six operations, four filesystems has a little difference
1) btrfs has btrfs_create_subvol_root to create new inode but used non SGID bit
   mode and can ignore
2) ocfs2 reflink function should add inode_sgid_strip api manually because we
   don't add it in vfs
3) spufs which doesn't really go hrough the regular VFS callpath because it has
   separate system call spu_create, but it t only allows the creation of
   directories and only allows bits in 0777 and can ignore
4) bpf use vfs_mkobj in bpf_obj_do_pin with
   "S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask()) mode and
   use bpf_mkobj_ops in bpf_iter_link_pin_kernel with S_IFREG | S_IRUSR mode,
   so bpf is also not affected

This patch also changed grpid behaviour for ext4/xfs because the mode passed to
them may been changed by inode_sgid_strip.

Also as Christian Brauner said"
The patch itself is useful as it would move a security sensitive operation that is
currently burried in individual filesystems into the vfs layer. But it has a decent
regression  potential since it might strip filesystems that have so far relied on
getting the S_ISGID bit with a mode argument. So this needs a lot of testing and
long exposure in -next for at least one full kernel cycle."

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
---
v4->v5:
put inode_sgid_strip before the inode_init_owner in ocfs2 filesystem
because the inode->i_mode's assignment is in inode_init_owner
 fs/inode.c         |  2 --
 fs/namei.c         | 22 +++++++++-------------
 fs/ocfs2/namei.c   |  1 +
 include/linux/fs.h | 11 +++++++++++
 4 files changed, 21 insertions(+), 15 deletions(-)

Comments

Christian Brauner April 21, 2022, 8:35 a.m. UTC | #1
On Thu, Apr 21, 2022 at 03:54:17PM +0800, Yang Xu wrote:
> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
> firstly, then posxi acl setup, but xfs uses the contrary order. It will
> affect S_ISGID clear especially we filter S_IXGRP by umask or acl.
> 
> Regardless of which filesystem is in use, failure to strip the SGID correctly
> is considered a security failure that needs to be fixed. The current VFS
> infrastructure requires the filesystem to do everything right and not step on
> any landmines to strip the SGID bit, when in fact it can easily be done at the
> VFS and the filesystems then don't even need to be aware that the SGID needs
> to be (or has been stripped) by the operation the user asked to be done.
> 
> Vfs has all the info it needs - it doesn't need the filesystems to do everything
> correctly with the mode and ensuring that they order things like posix acl setup
> functions correctly with inode_init_owner() to strip the SGID bit.
> 
> Just strip the SGID bit at the VFS, and then the filesystem can't get it wrong.
> 
> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
> this api may change mode.
> 
> Only the following places use inode_init_owner
> "
> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
> "
> 
> They are used in filesystem to init new inode function and these init inode
> functions are used by following operations:
> mkdir
> symlink
> mknod
> create
> tmpfile
> rename
> 
> We don't care about mkdir because we don't strip SGID bit for directory except
> fs.xfs.irix_sgid_inherit. But we even call prepare_mode() in do_mkdirat() since
> inode_sgid_strip() will skip directories anyway. This will enforce the same
> ordering for all relevant operations and it will make the code more uniform and
> easier to understand by using new helper prepare_mode().
> 
> symlink and rename only use valid mode that doesn't have SGID bit.
> 
> We have added inode_sgid_strip api for the remaining operations.
> 
> In addition to the above six operations, four filesystems has a little difference
> 1) btrfs has btrfs_create_subvol_root to create new inode but used non SGID bit
>    mode and can ignore
> 2) ocfs2 reflink function should add inode_sgid_strip api manually because we
>    don't add it in vfs
> 3) spufs which doesn't really go hrough the regular VFS callpath because it has
>    separate system call spu_create, but it t only allows the creation of
>    directories and only allows bits in 0777 and can ignore
> 4) bpf use vfs_mkobj in bpf_obj_do_pin with
>    "S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask()) mode and
>    use bpf_mkobj_ops in bpf_iter_link_pin_kernel with S_IFREG | S_IRUSR mode,
>    so bpf is also not affected
> 
> This patch also changed grpid behaviour for ext4/xfs because the mode passed to
> them may been changed by inode_sgid_strip.
> 
> Also as Christian Brauner said"
> The patch itself is useful as it would move a security sensitive operation that is
> currently burried in individual filesystems into the vfs layer. But it has a decent
> regression  potential since it might strip filesystems that have so far relied on
> getting the S_ISGID bit with a mode argument. So this needs a lot of testing and
> long exposure in -next for at least one full kernel cycle."
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
> ---
> v4->v5:
> put inode_sgid_strip before the inode_init_owner in ocfs2 filesystem
> because the inode->i_mode's assignment is in inode_init_owner
>  fs/inode.c         |  2 --
>  fs/namei.c         | 22 +++++++++-------------
>  fs/ocfs2/namei.c   |  1 +
>  include/linux/fs.h | 11 +++++++++++
>  4 files changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 57130e4ef8b4..95667e634bd4 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>  		/* Directories are special, and always inherit S_ISGID */
>  		if (S_ISDIR(mode))
>  			mode |= S_ISGID;
> -		else
> -			mode = inode_sgid_strip(mnt_userns, dir, mode);
>  	} else
>  		inode_fsgid_set(inode, mnt_userns);
>  	inode->i_mode = mode;
> diff --git a/fs/namei.c b/fs/namei.c
> index 73646e28fae0..5b8e6288d503 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  	if (open_flag & O_CREAT) {
>  		if (open_flag & O_EXCL)
>  			open_flag &= ~O_TRUNC;
> -		if (!IS_POSIXACL(dir->d_inode))
> -			mode &= ~current_umask();
> +		mode = prepare_mode(mnt_userns, dir->d_inode, mode);
>  		if (likely(got_write))
>  			create_error = may_o_create(mnt_userns, &nd->path,
>  						    dentry, mode);
> @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>  	child = d_alloc(dentry, &slash_name);
>  	if (unlikely(!child))
>  		goto out_err;
> -	if (!IS_POSIXACL(dir))
> -		mode &= ~current_umask();
> +	mode = prepare_mode(mnt_userns, dir, mode);
>  	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>  	if (error)
>  		goto out_err;
> @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>  	if (IS_ERR(dentry))
>  		goto out1;
>  
> -	if (!IS_POSIXACL(path.dentry->d_inode))
> -		mode &= ~current_umask();
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>  	error = security_path_mknod(&path, dentry, mode, dev);
>  	if (error)
>  		goto out2;
>  
> -	mnt_userns = mnt_user_ns(path.mnt);
>  	switch (mode & S_IFMT) {
>  		case 0: case S_IFREG:
>  			error = vfs_create(mnt_userns, path.dentry->d_inode,
> @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  	struct path path;
>  	int error;
>  	unsigned int lookup_flags = LOOKUP_DIRECTORY;
> +	struct user_namespace *mnt_userns;
>  
>  retry:
>  	dentry = filename_create(dfd, name, &path, lookup_flags);
> @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  	if (IS_ERR(dentry))
>  		goto out_putname;
>  
> -	if (!IS_POSIXACL(path.dentry->d_inode))
> -		mode &= ~current_umask();
> +	mnt_userns = mnt_user_ns(path.mnt);
> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>  	error = security_path_mkdir(&path, dentry, mode);
> -	if (!error) {
> -		struct user_namespace *mnt_userns;
> -		mnt_userns = mnt_user_ns(path.mnt);
> +	if (!error)
>  		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
>  				  mode);
> -	}
> +
>  	done_path_create(&path, dentry);
>  	if (retry_estale(error, lookup_flags)) {
>  		lookup_flags |= LOOKUP_REVAL;
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index c75fd54b9185..21f3da2e66c9 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
>  	 * callers. */
>  	if (S_ISDIR(mode))
>  		set_nlink(inode, 2);
> +	mode = inode_sgid_strip(&init_user_ns, dir, mode);
>  	inode_init_owner(&init_user_ns, inode, dir, mode);

For the record, I'm not too fond of this separate invocation of
inode_sgid_strip() but since it's only one location this might be fine.
If there's more than one location a separate helper should exist for
this that abstracts this away for the filesystem.

Two questions:
- Sould this call prepare_mode(), i.e. should we honor umasks here too?
- How is the sgid bit handled when creating reflinks on other reflink
  supporting filesystems such as xfs and btrfs?

>  	status = dquot_initialize(inode);
>  	if (status)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 532de76c9b91..0bf81ab71619 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3459,6 +3459,17 @@ static inline bool dir_relax_shared(struct inode *inode)
>  	return !IS_DEADDIR(inode);
>  }
>  
> +static inline umode_t prepare_mode(struct user_namespace *mnt_userns,
> +				   const struct inode *dir, umode_t mode)
> +{
> +	mode = inode_sgid_strip(mnt_userns, dir, mode);
> +
> +	if (!IS_POSIXACL(dir))
> +		mode &= current_umask();

You're missing a "~". I assume you meant:

mode &= ~current_umask();
Yang Xu (Fujitsu) April 22, 2022, 6:03 a.m. UTC | #2
on 2022/4/21 16:35, Christian Brauner wrote:
> On Thu, Apr 21, 2022 at 03:54:17PM +0800, Yang Xu wrote:
>> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
>> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
>> firstly, then posxi acl setup, but xfs uses the contrary order. It will
>> affect S_ISGID clear especially we filter S_IXGRP by umask or acl.
>>
>> Regardless of which filesystem is in use, failure to strip the SGID correctly
>> is considered a security failure that needs to be fixed. The current VFS
>> infrastructure requires the filesystem to do everything right and not step on
>> any landmines to strip the SGID bit, when in fact it can easily be done at the
>> VFS and the filesystems then don't even need to be aware that the SGID needs
>> to be (or has been stripped) by the operation the user asked to be done.
>>
>> Vfs has all the info it needs - it doesn't need the filesystems to do everything
>> correctly with the mode and ensuring that they order things like posix acl setup
>> functions correctly with inode_init_owner() to strip the SGID bit.
>>
>> Just strip the SGID bit at the VFS, and then the filesystem can't get it wrong.
>>
>> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
>> this api may change mode.
>>
>> Only the following places use inode_init_owner
>> "
>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
>> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
>> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
>> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
>> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
>> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
>> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
>> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
>> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
>> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
>> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
>> "
>>
>> They are used in filesystem to init new inode function and these init inode
>> functions are used by following operations:
>> mkdir
>> symlink
>> mknod
>> create
>> tmpfile
>> rename
>>
>> We don't care about mkdir because we don't strip SGID bit for directory except
>> fs.xfs.irix_sgid_inherit. But we even call prepare_mode() in do_mkdirat() since
>> inode_sgid_strip() will skip directories anyway. This will enforce the same
>> ordering for all relevant operations and it will make the code more uniform and
>> easier to understand by using new helper prepare_mode().
>>
>> symlink and rename only use valid mode that doesn't have SGID bit.
>>
>> We have added inode_sgid_strip api for the remaining operations.
>>
>> In addition to the above six operations, four filesystems has a little difference
>> 1) btrfs has btrfs_create_subvol_root to create new inode but used non SGID bit
>>     mode and can ignore
>> 2) ocfs2 reflink function should add inode_sgid_strip api manually because we
>>     don't add it in vfs
>> 3) spufs which doesn't really go hrough the regular VFS callpath because it has
>>     separate system call spu_create, but it t only allows the creation of
>>     directories and only allows bits in 0777 and can ignore
>> 4) bpf use vfs_mkobj in bpf_obj_do_pin with
>>     "S_IFREG | ((S_IRUSR | S_IWUSR)&  ~current_umask()) mode and
>>     use bpf_mkobj_ops in bpf_iter_link_pin_kernel with S_IFREG | S_IRUSR mode,
>>     so bpf is also not affected
>>
>> This patch also changed grpid behaviour for ext4/xfs because the mode passed to
>> them may been changed by inode_sgid_strip.
>>
>> Also as Christian Brauner said"
>> The patch itself is useful as it would move a security sensitive operation that is
>> currently burried in individual filesystems into the vfs layer. But it has a decent
>> regression  potential since it might strip filesystems that have so far relied on
>> getting the S_ISGID bit with a mode argument. So this needs a lot of testing and
>> long exposure in -next for at least one full kernel cycle."
>>
>> Suggested-by: Dave Chinner<david@fromorbit.com>
>> Signed-off-by: Yang Xu<xuyang2018.jy@fujitsu.com>
>> ---
>> v4->v5:
>> put inode_sgid_strip before the inode_init_owner in ocfs2 filesystem
>> because the inode->i_mode's assignment is in inode_init_owner
>>   fs/inode.c         |  2 --
>>   fs/namei.c         | 22 +++++++++-------------
>>   fs/ocfs2/namei.c   |  1 +
>>   include/linux/fs.h | 11 +++++++++++
>>   4 files changed, 21 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index 57130e4ef8b4..95667e634bd4 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>   		/* Directories are special, and always inherit S_ISGID */
>>   		if (S_ISDIR(mode))
>>   			mode |= S_ISGID;
>> -		else
>> -			mode = inode_sgid_strip(mnt_userns, dir, mode);
>>   	} else
>>   		inode_fsgid_set(inode, mnt_userns);
>>   	inode->i_mode = mode;
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 73646e28fae0..5b8e6288d503 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>>   	if (open_flag&  O_CREAT) {
>>   		if (open_flag&  O_EXCL)
>>   			open_flag&= ~O_TRUNC;
>> -		if (!IS_POSIXACL(dir->d_inode))
>> -			mode&= ~current_umask();
>> +		mode = prepare_mode(mnt_userns, dir->d_inode, mode);
>>   		if (likely(got_write))
>>   			create_error = may_o_create(mnt_userns,&nd->path,
>>   						    dentry, mode);
>> @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>>   	child = d_alloc(dentry,&slash_name);
>>   	if (unlikely(!child))
>>   		goto out_err;
>> -	if (!IS_POSIXACL(dir))
>> -		mode&= ~current_umask();
>> +	mode = prepare_mode(mnt_userns, dir, mode);
>>   	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>>   	if (error)
>>   		goto out_err;
>> @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>>   	if (IS_ERR(dentry))
>>   		goto out1;
>>
>> -	if (!IS_POSIXACL(path.dentry->d_inode))
>> -		mode&= ~current_umask();
>> +	mnt_userns = mnt_user_ns(path.mnt);
>> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>>   	error = security_path_mknod(&path, dentry, mode, dev);
>>   	if (error)
>>   		goto out2;
>>
>> -	mnt_userns = mnt_user_ns(path.mnt);
>>   	switch (mode&  S_IFMT) {
>>   		case 0: case S_IFREG:
>>   			error = vfs_create(mnt_userns, path.dentry->d_inode,
>> @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>>   	struct path path;
>>   	int error;
>>   	unsigned int lookup_flags = LOOKUP_DIRECTORY;
>> +	struct user_namespace *mnt_userns;
>>
>>   retry:
>>   	dentry = filename_create(dfd, name,&path, lookup_flags);
>> @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>>   	if (IS_ERR(dentry))
>>   		goto out_putname;
>>
>> -	if (!IS_POSIXACL(path.dentry->d_inode))
>> -		mode&= ~current_umask();
>> +	mnt_userns = mnt_user_ns(path.mnt);
>> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>>   	error = security_path_mkdir(&path, dentry, mode);
>> -	if (!error) {
>> -		struct user_namespace *mnt_userns;
>> -		mnt_userns = mnt_user_ns(path.mnt);
>> +	if (!error)
>>   		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
>>   				  mode);
>> -	}
>> +
>>   	done_path_create(&path, dentry);
>>   	if (retry_estale(error, lookup_flags)) {
>>   		lookup_flags |= LOOKUP_REVAL;
>> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
>> index c75fd54b9185..21f3da2e66c9 100644
>> --- a/fs/ocfs2/namei.c
>> +++ b/fs/ocfs2/namei.c
>> @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
>>   	 * callers. */
>>   	if (S_ISDIR(mode))
>>   		set_nlink(inode, 2);
>> +	mode = inode_sgid_strip(&init_user_ns, dir, mode);
>>   	inode_init_owner(&init_user_ns, inode, dir, mode);
>
> For the record, I'm not too fond of this separate invocation of
> inode_sgid_strip() but since it's only one location this might be fine.
> If there's more than one location a separate helper should exist for
> this that abstracts this away for the filesystem.
Agree. This case only be found when using OCFS2_IOC_REFLINK ioctl. And 
other support reflink filesystem(xfs, btrfs) they use  FICLONE or 
FICLONERANGE ioctl.

Since ocfs2 has supported reflink by using it remap_file_range, should 
we still need this ioctl?

commit bd50873dc725a9fa72592ecc986c58805e823051
Author: Tao Ma <tao.ma@oracle.com>
Date:   Mon Sep 21 11:25:14 2009 +0800

     ocfs2: Add ioctl for reflink.

     The ioctl will take 3 parameters: old_path, new_path and
     preserve and call vfs_reflink. It is useful when we backport
     reflink features to old kernels.

     Signed-off-by: Tao Ma <tao.ma@oracle.com>

Of course, this is a problem doesn't belong to this series.

>
> Two questions:
> - Sould this call prepare_mode(), i.e. should we honor umasks here too?
IMO, it desn't need to honor umask. Because reflink only will update 
inode_imode by setattr to strip S_ISGID and S_ISUID instead of creating 
a file.
> - How is the sgid bit handled when creating reflinks on other reflink
>    supporting filesystems such as xfs and btrfs?
xfstests has a test case generic/673 for this, so btrfs and xfs should 
have the same behavior.
I look into xfs code.

Firstly

If we don't have CAP_FSETID and it is a regulre file,also have sgid bit, 
then should_remove_suid will give attr a ATTR_KILL_SGID mask.

ATTR_KILL_SGID -> clear S_ISGID if executable

int should_remove_suid(struct dentry *dentry)
{
         umode_t mode = d_inode(dentry)->i_mode;
         int kill = 0;

         /* suid always must be killed */
         if (unlikely(mode & S_ISUID))
                 kill = ATTR_KILL_SUID;

         /*
          * sgid without any exec bits is just a mandatory locking mark; 
leave
          * it alone.  If some exec bits are set, it's a real sgid; kill it.
          */
         if (unlikely((mode & S_ISGID) && (mode & S_IXGRP)))
                 kill |= ATTR_KILL_SGID;

         if (unlikely(kill && !capable(CAP_FSETID) && S_ISREG(mode)))
                 return kill;

         return 0;
}

Then it will call notify_change to modify inode mode by using setattr 
hook as below:
fs/attr.c notfiy_change
	{
	...
	  if (ia_valid & ATTR_KILL_SUID) {
                 if (mode & S_ISUID) {
                         ia_valid = attr->ia_valid |= ATTR_MODE;
                         attr->ia_mode = (inode->i_mode & ~S_ISUID);
                 }
         }
         if (ia_valid & ATTR_KILL_SGID) {
                 if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
                         if (!(ia_valid & ATTR_MODE)) {
                                 ia_valid = attr->ia_valid |= ATTR_MODE;
                                 attr->ia_mode = inode->i_mode;
                         }
                         attr->ia_mode &= ~S_ISGID;
                 }
         }
	...
Here it will strip attr->ia_mode by check inode->i_mode whether have 
sgid bit and group-execute bit.

then setattr_prepare and setattr_copy has the remaining sgid stripping 
rule just neither are in the group of the current file nor have 
CAP_FSETID in their user namespace.

         if (ia_valid & ATTR_MODE) {
                 umode_t mode = attr->ia_mode;
                 kgid_t kgid = i_gid_into_mnt(mnt_userns, inode); // 
this code seems unnecessary can be used directly in in_group_p
                 if (!in_group_p(kgid) &&
                     !capable_wrt_inode_uidgid(mnt_userns, inode, 
CAP_FSETID))
                         mode &= ~S_ISGID;
                 inode->i_mode = mode;

It seems this way has similar logic but not include umask.

>
>>   	status = dquot_initialize(inode);
>>   	if (status)
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 532de76c9b91..0bf81ab71619 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -3459,6 +3459,17 @@ static inline bool dir_relax_shared(struct inode *inode)
>>   	return !IS_DEADDIR(inode);
>>   }
>>
>> +static inline umode_t prepare_mode(struct user_namespace *mnt_userns,
>> +				   const struct inode *dir, umode_t mode)
>> +{
>> +	mode = inode_sgid_strip(mnt_userns, dir, mode);
>> +
>> +	if (!IS_POSIXACL(dir))
>> +		mode&= current_umask();
>
> You're missing a "~". I assume you meant:
>
> mode&= ~current_umask();
Yes, sorry for this.
Christian Brauner April 22, 2022, 9:47 a.m. UTC | #3
On Fri, Apr 22, 2022 at 06:03:09AM +0000, xuyang2018.jy@fujitsu.com wrote:
> on 2022/4/21 16:35, Christian Brauner wrote:
> > On Thu, Apr 21, 2022 at 03:54:17PM +0800, Yang Xu wrote:
> >> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
> >> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
> >> firstly, then posxi acl setup, but xfs uses the contrary order. It will
> >> affect S_ISGID clear especially we filter S_IXGRP by umask or acl.
> >>
> >> Regardless of which filesystem is in use, failure to strip the SGID correctly
> >> is considered a security failure that needs to be fixed. The current VFS
> >> infrastructure requires the filesystem to do everything right and not step on
> >> any landmines to strip the SGID bit, when in fact it can easily be done at the
> >> VFS and the filesystems then don't even need to be aware that the SGID needs
> >> to be (or has been stripped) by the operation the user asked to be done.
> >>
> >> Vfs has all the info it needs - it doesn't need the filesystems to do everything
> >> correctly with the mode and ensuring that they order things like posix acl setup
> >> functions correctly with inode_init_owner() to strip the SGID bit.
> >>
> >> Just strip the SGID bit at the VFS, and then the filesystem can't get it wrong.
> >>
> >> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
> >> this api may change mode.
> >>
> >> Only the following places use inode_init_owner
> >> "
> >> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> >> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> >> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
> >> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> >> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
> >> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
> >> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
> >> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
> >> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> >> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
> >> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
> >> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
> >> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
> >> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> >> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
> >> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
> >> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> >> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
> >> "
> >>
> >> They are used in filesystem to init new inode function and these init inode
> >> functions are used by following operations:
> >> mkdir
> >> symlink
> >> mknod
> >> create
> >> tmpfile
> >> rename
> >>
> >> We don't care about mkdir because we don't strip SGID bit for directory except
> >> fs.xfs.irix_sgid_inherit. But we even call prepare_mode() in do_mkdirat() since
> >> inode_sgid_strip() will skip directories anyway. This will enforce the same
> >> ordering for all relevant operations and it will make the code more uniform and
> >> easier to understand by using new helper prepare_mode().
> >>
> >> symlink and rename only use valid mode that doesn't have SGID bit.
> >>
> >> We have added inode_sgid_strip api for the remaining operations.
> >>
> >> In addition to the above six operations, four filesystems has a little difference
> >> 1) btrfs has btrfs_create_subvol_root to create new inode but used non SGID bit
> >>     mode and can ignore
> >> 2) ocfs2 reflink function should add inode_sgid_strip api manually because we
> >>     don't add it in vfs
> >> 3) spufs which doesn't really go hrough the regular VFS callpath because it has
> >>     separate system call spu_create, but it t only allows the creation of
> >>     directories and only allows bits in 0777 and can ignore
> >> 4) bpf use vfs_mkobj in bpf_obj_do_pin with
> >>     "S_IFREG | ((S_IRUSR | S_IWUSR)&  ~current_umask()) mode and
> >>     use bpf_mkobj_ops in bpf_iter_link_pin_kernel with S_IFREG | S_IRUSR mode,
> >>     so bpf is also not affected
> >>
> >> This patch also changed grpid behaviour for ext4/xfs because the mode passed to
> >> them may been changed by inode_sgid_strip.
> >>
> >> Also as Christian Brauner said"
> >> The patch itself is useful as it would move a security sensitive operation that is
> >> currently burried in individual filesystems into the vfs layer. But it has a decent
> >> regression  potential since it might strip filesystems that have so far relied on
> >> getting the S_ISGID bit with a mode argument. So this needs a lot of testing and
> >> long exposure in -next for at least one full kernel cycle."
> >>
> >> Suggested-by: Dave Chinner<david@fromorbit.com>
> >> Signed-off-by: Yang Xu<xuyang2018.jy@fujitsu.com>
> >> ---
> >> v4->v5:
> >> put inode_sgid_strip before the inode_init_owner in ocfs2 filesystem
> >> because the inode->i_mode's assignment is in inode_init_owner
> >>   fs/inode.c         |  2 --
> >>   fs/namei.c         | 22 +++++++++-------------
> >>   fs/ocfs2/namei.c   |  1 +
> >>   include/linux/fs.h | 11 +++++++++++
> >>   4 files changed, 21 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index 57130e4ef8b4..95667e634bd4 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
> >>   		/* Directories are special, and always inherit S_ISGID */
> >>   		if (S_ISDIR(mode))
> >>   			mode |= S_ISGID;
> >> -		else
> >> -			mode = inode_sgid_strip(mnt_userns, dir, mode);
> >>   	} else
> >>   		inode_fsgid_set(inode, mnt_userns);
> >>   	inode->i_mode = mode;
> >> diff --git a/fs/namei.c b/fs/namei.c
> >> index 73646e28fae0..5b8e6288d503 100644
> >> --- a/fs/namei.c
> >> +++ b/fs/namei.c
> >> @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> >>   	if (open_flag&  O_CREAT) {
> >>   		if (open_flag&  O_EXCL)
> >>   			open_flag&= ~O_TRUNC;
> >> -		if (!IS_POSIXACL(dir->d_inode))
> >> -			mode&= ~current_umask();
> >> +		mode = prepare_mode(mnt_userns, dir->d_inode, mode);
> >>   		if (likely(got_write))
> >>   			create_error = may_o_create(mnt_userns,&nd->path,
> >>   						    dentry, mode);
> >> @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
> >>   	child = d_alloc(dentry,&slash_name);
> >>   	if (unlikely(!child))
> >>   		goto out_err;
> >> -	if (!IS_POSIXACL(dir))
> >> -		mode&= ~current_umask();
> >> +	mode = prepare_mode(mnt_userns, dir, mode);
> >>   	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
> >>   	if (error)
> >>   		goto out_err;
> >> @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
> >>   	if (IS_ERR(dentry))
> >>   		goto out1;
> >>
> >> -	if (!IS_POSIXACL(path.dentry->d_inode))
> >> -		mode&= ~current_umask();
> >> +	mnt_userns = mnt_user_ns(path.mnt);
> >> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
> >>   	error = security_path_mknod(&path, dentry, mode, dev);
> >>   	if (error)
> >>   		goto out2;
> >>
> >> -	mnt_userns = mnt_user_ns(path.mnt);
> >>   	switch (mode&  S_IFMT) {
> >>   		case 0: case S_IFREG:
> >>   			error = vfs_create(mnt_userns, path.dentry->d_inode,
> >> @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
> >>   	struct path path;
> >>   	int error;
> >>   	unsigned int lookup_flags = LOOKUP_DIRECTORY;
> >> +	struct user_namespace *mnt_userns;
> >>
> >>   retry:
> >>   	dentry = filename_create(dfd, name,&path, lookup_flags);
> >> @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
> >>   	if (IS_ERR(dentry))
> >>   		goto out_putname;
> >>
> >> -	if (!IS_POSIXACL(path.dentry->d_inode))
> >> -		mode&= ~current_umask();
> >> +	mnt_userns = mnt_user_ns(path.mnt);
> >> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
> >>   	error = security_path_mkdir(&path, dentry, mode);
> >> -	if (!error) {
> >> -		struct user_namespace *mnt_userns;
> >> -		mnt_userns = mnt_user_ns(path.mnt);
> >> +	if (!error)
> >>   		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
> >>   				  mode);
> >> -	}
> >> +
> >>   	done_path_create(&path, dentry);
> >>   	if (retry_estale(error, lookup_flags)) {
> >>   		lookup_flags |= LOOKUP_REVAL;
> >> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> >> index c75fd54b9185..21f3da2e66c9 100644
> >> --- a/fs/ocfs2/namei.c
> >> +++ b/fs/ocfs2/namei.c
> >> @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
> >>   	 * callers. */
> >>   	if (S_ISDIR(mode))
> >>   		set_nlink(inode, 2);
> >> +	mode = inode_sgid_strip(&init_user_ns, dir, mode);
> >>   	inode_init_owner(&init_user_ns, inode, dir, mode);
> >
> > For the record, I'm not too fond of this separate invocation of
> > inode_sgid_strip() but since it's only one location this might be fine.
> > If there's more than one location a separate helper should exist for
> > this that abstracts this away for the filesystem.
> Agree. This case only be found when using OCFS2_IOC_REFLINK ioctl. And 

(I have one very minor non-technical ask: can you please make sure to
leave an empty line between the text you're citing and your reply? It
would make reading your replies a lot easier. :))

> other support reflink filesystem(xfs, btrfs) they use  FICLONE or 
> FICLONERANGE ioctl.
> 
> Since ocfs2 has supported reflink by using it remap_file_range, should 
> we still need this ioctl?
> 
> commit bd50873dc725a9fa72592ecc986c58805e823051
> Author: Tao Ma <tao.ma@oracle.com>
> Date:   Mon Sep 21 11:25:14 2009 +0800
> 
>      ocfs2: Add ioctl for reflink.
> 
>      The ioctl will take 3 parameters: old_path, new_path and
>      preserve and call vfs_reflink. It is useful when we backport
>      reflink features to old kernels.
> 
>      Signed-off-by: Tao Ma <tao.ma@oracle.com>
> 
> Of course, this is a problem doesn't belong to this series.
> 
> >
> > Two questions:
> > - Sould this call prepare_mode(), i.e. should we honor umasks here too?
> IMO, it desn't need to honor umask. Because reflink only will update 
> inode_imode by setattr to strip S_ISGID and S_ISUID instead of creating 
> a file.

I had a misconception here because I got confused by ocfs2.
While the OCFS2_IOC_REFLINK ioctl creates the target file itself and the
reflink the FICLONE and FICLONERANGE _only_ create the reflink.

So the target file itself must've been created prior to
FICLONE/FICLONERANGE which means basic setgid stripping should've been
done when the file was created.

Since ocfs2 reflink callpaths work very differently we can't switch it
to FICLONE/FICLONERANGE as this would regress ocfs2 users.

> > - How is the sgid bit handled when creating reflinks on other reflink
> >    supporting filesystems such as xfs and btrfs?
> xfstests has a test case generic/673 for this, so btrfs and xfs should 
> have the same behavior.
> I look into xfs code.
> 
> Firstly
> 
> If we don't have CAP_FSETID and it is a regulre file,also have sgid bit, 
> then should_remove_suid will give attr a ATTR_KILL_SGID mask.

What you're referring to below is privilege stripping when _modifying
the content_ of set{g,u}id binaries.

That happens e.g. during write() or indeed a reflink creation via
FICLONE/FICLONERANGE. For the latter privilege stripping happens when
only some extents are reflinked and not the whole file. So that's:

* ioctl_file_clone()
  * vfs_clone_file_range()
    * do_clone_file_range()
      * generic_file_rw_checks()
      * remap_verify_area()
        * security_file_permission()
      * ->remap_file_range() [1]

[1]:
btrfs:
* btrfs_remap_file_range()
  * generic_remap_file_range_prep()
    * file_modified() // privilege stripping

ocfs2:
* ocfs2_remap_file_range()
  * generic_remap_file_range_prep()
    * file_modified() // privilege stripping

xfs:
* xfs_remap_file_range()
  * generic_remap_file_range_prep()
    * file_modified() // privilege stripping

// This is stacked, i.e. ovl filesystems will rely on the filesystem
// used for the upper mount and it's .remap_file_range() implementation.
overlayfs:
* ovl_remap_file_range()
  * vfs_clone_file_range()

The other two implementers are:

cifs:
* cifs_remap_file_range()

nfs:
* nfs42_remap_file_range()

both of which don't call into generic_remap_file_range_prep() and so
file_modified() isn't called. But they are netfses and there's a server
involved. In general, this isn't really a concern for this patchset.
(But something to potentially to look into in the future.)

And it wasn't what I was worried about.

The crucial information that we needed was whether reflink callpaths
other than ocfs2 create files themselves and therefore might rely
implicitly on setgid stripping in inode_init_owner().

Now that we looked at all callers we can be confident that this isn't
the case. _Apart from ocfs2_ but which you handle in the patchset.
Yang Xu (Fujitsu) April 22, 2022, 10:13 a.m. UTC | #4
on 2022/4/22 17:47, Christian Brauner wrote:
> On Fri, Apr 22, 2022 at 06:03:09AM +0000, xuyang2018.jy@fujitsu.com wrote:
>> on 2022/4/21 16:35, Christian Brauner wrote:
>>> On Thu, Apr 21, 2022 at 03:54:17PM +0800, Yang Xu wrote:
>>>> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
>>>> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
>>>> firstly, then posxi acl setup, but xfs uses the contrary order. It will
>>>> affect S_ISGID clear especially we filter S_IXGRP by umask or acl.
>>>>
>>>> Regardless of which filesystem is in use, failure to strip the SGID correctly
>>>> is considered a security failure that needs to be fixed. The current VFS
>>>> infrastructure requires the filesystem to do everything right and not step on
>>>> any landmines to strip the SGID bit, when in fact it can easily be done at the
>>>> VFS and the filesystems then don't even need to be aware that the SGID needs
>>>> to be (or has been stripped) by the operation the user asked to be done.
>>>>
>>>> Vfs has all the info it needs - it doesn't need the filesystems to do everything
>>>> correctly with the mode and ensuring that they order things like posix acl setup
>>>> functions correctly with inode_init_owner() to strip the SGID bit.
>>>>
>>>> Just strip the SGID bit at the VFS, and then the filesystem can't get it wrong.
>>>>
>>>> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
>>>> this api may change mode.
>>>>
>>>> Only the following places use inode_init_owner
>>>> "
>>>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
>>>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
>>>> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
>>>> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
>>>> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
>>>> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
>>>> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
>>>> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
>>>> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
>>>> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
>>>> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
>>>> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
>>>> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
>>>> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
>>>> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
>>>> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
>>>> "
>>>>
>>>> They are used in filesystem to init new inode function and these init inode
>>>> functions are used by following operations:
>>>> mkdir
>>>> symlink
>>>> mknod
>>>> create
>>>> tmpfile
>>>> rename
>>>>
>>>> We don't care about mkdir because we don't strip SGID bit for directory except
>>>> fs.xfs.irix_sgid_inherit. But we even call prepare_mode() in do_mkdirat() since
>>>> inode_sgid_strip() will skip directories anyway. This will enforce the same
>>>> ordering for all relevant operations and it will make the code more uniform and
>>>> easier to understand by using new helper prepare_mode().
>>>>
>>>> symlink and rename only use valid mode that doesn't have SGID bit.
>>>>
>>>> We have added inode_sgid_strip api for the remaining operations.
>>>>
>>>> In addition to the above six operations, four filesystems has a little difference
>>>> 1) btrfs has btrfs_create_subvol_root to create new inode but used non SGID bit
>>>>      mode and can ignore
>>>> 2) ocfs2 reflink function should add inode_sgid_strip api manually because we
>>>>      don't add it in vfs
>>>> 3) spufs which doesn't really go hrough the regular VFS callpath because it has
>>>>      separate system call spu_create, but it t only allows the creation of
>>>>      directories and only allows bits in 0777 and can ignore
>>>> 4) bpf use vfs_mkobj in bpf_obj_do_pin with
>>>>      "S_IFREG | ((S_IRUSR | S_IWUSR)&   ~current_umask()) mode and
>>>>      use bpf_mkobj_ops in bpf_iter_link_pin_kernel with S_IFREG | S_IRUSR mode,
>>>>      so bpf is also not affected
>>>>
>>>> This patch also changed grpid behaviour for ext4/xfs because the mode passed to
>>>> them may been changed by inode_sgid_strip.
>>>>
>>>> Also as Christian Brauner said"
>>>> The patch itself is useful as it would move a security sensitive operation that is
>>>> currently burried in individual filesystems into the vfs layer. But it has a decent
>>>> regression  potential since it might strip filesystems that have so far relied on
>>>> getting the S_ISGID bit with a mode argument. So this needs a lot of testing and
>>>> long exposure in -next for at least one full kernel cycle."
>>>>
>>>> Suggested-by: Dave Chinner<david@fromorbit.com>
>>>> Signed-off-by: Yang Xu<xuyang2018.jy@fujitsu.com>
>>>> ---
>>>> v4->v5:
>>>> put inode_sgid_strip before the inode_init_owner in ocfs2 filesystem
>>>> because the inode->i_mode's assignment is in inode_init_owner
>>>>    fs/inode.c         |  2 --
>>>>    fs/namei.c         | 22 +++++++++-------------
>>>>    fs/ocfs2/namei.c   |  1 +
>>>>    include/linux/fs.h | 11 +++++++++++
>>>>    4 files changed, 21 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/fs/inode.c b/fs/inode.c
>>>> index 57130e4ef8b4..95667e634bd4 100644
>>>> --- a/fs/inode.c
>>>> +++ b/fs/inode.c
>>>> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
>>>>    		/* Directories are special, and always inherit S_ISGID */
>>>>    		if (S_ISDIR(mode))
>>>>    			mode |= S_ISGID;
>>>> -		else
>>>> -			mode = inode_sgid_strip(mnt_userns, dir, mode);
>>>>    	} else
>>>>    		inode_fsgid_set(inode, mnt_userns);
>>>>    	inode->i_mode = mode;
>>>> diff --git a/fs/namei.c b/fs/namei.c
>>>> index 73646e28fae0..5b8e6288d503 100644
>>>> --- a/fs/namei.c
>>>> +++ b/fs/namei.c
>>>> @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>>>>    	if (open_flag&   O_CREAT) {
>>>>    		if (open_flag&   O_EXCL)
>>>>    			open_flag&= ~O_TRUNC;
>>>> -		if (!IS_POSIXACL(dir->d_inode))
>>>> -			mode&= ~current_umask();
>>>> +		mode = prepare_mode(mnt_userns, dir->d_inode, mode);
>>>>    		if (likely(got_write))
>>>>    			create_error = may_o_create(mnt_userns,&nd->path,
>>>>    						    dentry, mode);
>>>> @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
>>>>    	child = d_alloc(dentry,&slash_name);
>>>>    	if (unlikely(!child))
>>>>    		goto out_err;
>>>> -	if (!IS_POSIXACL(dir))
>>>> -		mode&= ~current_umask();
>>>> +	mode = prepare_mode(mnt_userns, dir, mode);
>>>>    	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
>>>>    	if (error)
>>>>    		goto out_err;
>>>> @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
>>>>    	if (IS_ERR(dentry))
>>>>    		goto out1;
>>>>
>>>> -	if (!IS_POSIXACL(path.dentry->d_inode))
>>>> -		mode&= ~current_umask();
>>>> +	mnt_userns = mnt_user_ns(path.mnt);
>>>> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>>>>    	error = security_path_mknod(&path, dentry, mode, dev);
>>>>    	if (error)
>>>>    		goto out2;
>>>>
>>>> -	mnt_userns = mnt_user_ns(path.mnt);
>>>>    	switch (mode&   S_IFMT) {
>>>>    		case 0: case S_IFREG:
>>>>    			error = vfs_create(mnt_userns, path.dentry->d_inode,
>>>> @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>>>>    	struct path path;
>>>>    	int error;
>>>>    	unsigned int lookup_flags = LOOKUP_DIRECTORY;
>>>> +	struct user_namespace *mnt_userns;
>>>>
>>>>    retry:
>>>>    	dentry = filename_create(dfd, name,&path, lookup_flags);
>>>> @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>>>>    	if (IS_ERR(dentry))
>>>>    		goto out_putname;
>>>>
>>>> -	if (!IS_POSIXACL(path.dentry->d_inode))
>>>> -		mode&= ~current_umask();
>>>> +	mnt_userns = mnt_user_ns(path.mnt);
>>>> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
>>>>    	error = security_path_mkdir(&path, dentry, mode);
>>>> -	if (!error) {
>>>> -		struct user_namespace *mnt_userns;
>>>> -		mnt_userns = mnt_user_ns(path.mnt);
>>>> +	if (!error)
>>>>    		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
>>>>    				  mode);
>>>> -	}
>>>> +
>>>>    	done_path_create(&path, dentry);
>>>>    	if (retry_estale(error, lookup_flags)) {
>>>>    		lookup_flags |= LOOKUP_REVAL;
>>>> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
>>>> index c75fd54b9185..21f3da2e66c9 100644
>>>> --- a/fs/ocfs2/namei.c
>>>> +++ b/fs/ocfs2/namei.c
>>>> @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
>>>>    	 * callers. */
>>>>    	if (S_ISDIR(mode))
>>>>    		set_nlink(inode, 2);
>>>> +	mode = inode_sgid_strip(&init_user_ns, dir, mode);
>>>>    	inode_init_owner(&init_user_ns, inode, dir, mode);
>>>
>>> For the record, I'm not too fond of this separate invocation of
>>> inode_sgid_strip() but since it's only one location this might be fine.
>>> If there's more than one location a separate helper should exist for
>>> this that abstracts this away for the filesystem.
>> Agree. This case only be found when using OCFS2_IOC_REFLINK ioctl. And
>
> (I have one very minor non-technical ask: can you please make sure to
> leave an empty line between the text you're citing and your reply? It
> would make reading your replies a lot easier. :))

Of course.

>
>> other support reflink filesystem(xfs, btrfs) they use  FICLONE or
>> FICLONERANGE ioctl.
>>
>> Since ocfs2 has supported reflink by using it remap_file_range, should
>> we still need this ioctl?
>>
>> commit bd50873dc725a9fa72592ecc986c58805e823051
>> Author: Tao Ma<tao.ma@oracle.com>
>> Date:   Mon Sep 21 11:25:14 2009 +0800
>>
>>       ocfs2: Add ioctl for reflink.
>>
>>       The ioctl will take 3 parameters: old_path, new_path and
>>       preserve and call vfs_reflink. It is useful when we backport
>>       reflink features to old kernels.
>>
>>       Signed-off-by: Tao Ma<tao.ma@oracle.com>
>>
>> Of course, this is a problem doesn't belong to this series.
>>
>>>
>>> Two questions:
>>> - Sould this call prepare_mode(), i.e. should we honor umasks here too?
>> IMO, it desn't need to honor umask. Because reflink only will update
>> inode_imode by setattr to strip S_ISGID and S_ISUID instead of creating
>> a file.
>
> I had a misconception here because I got confused by ocfs2.
> While the OCFS2_IOC_REFLINK ioctl creates the target file itself and the
> reflink the FICLONE and FICLONERANGE _only_ create the reflink.
>
> So the target file itself must've been created prior to
> FICLONE/FICLONERANGE which means basic setgid stripping should've been
> done when the file was created.

Yes.

>
> Since ocfs2 reflink callpaths work very differently we can't switch it
> to FICLONE/FICLONERANGE as this would regress ocfs2 users.
>
>>> - How is the sgid bit handled when creating reflinks on other reflink
>>>     supporting filesystems such as xfs and btrfs?
>> xfstests has a test case generic/673 for this, so btrfs and xfs should
>> have the same behavior.
>> I look into xfs code.
>>
>> Firstly
>>
>> If we don't have CAP_FSETID and it is a regulre file,also have sgid bit,
>> then should_remove_suid will give attr a ATTR_KILL_SGID mask.
>
> What you're referring to below is privilege stripping when _modifying
> the content_ of set{g,u}id binaries.
>
> That happens e.g. during write() or indeed a reflink creation via
> FICLONE/FICLONERANGE. For the latter privilege stripping happens when
> only some extents are reflinked and not the whole file. So that's:
>
> * ioctl_file_clone()
>    * vfs_clone_file_range()
>      * do_clone_file_range()
>        * generic_file_rw_checks()
>        * remap_verify_area()
>          * security_file_permission()
>        * ->remap_file_range() [1]
>
> [1]:
> btrfs:
> * btrfs_remap_file_range()
>    * generic_remap_file_range_prep()
>      * file_modified() // privilege stripping
>
> ocfs2:
> * ocfs2_remap_file_range()
>    * generic_remap_file_range_prep()
>      * file_modified() // privilege stripping
>
> xfs:
> * xfs_remap_file_range()
>    * generic_remap_file_range_prep()
>      * file_modified() // privilege stripping
>
> // This is stacked, i.e. ovl filesystems will rely on the filesystem
> // used for the upper mount and it's .remap_file_range() implementation.
> overlayfs:
> * ovl_remap_file_range()
>    * vfs_clone_file_range()
>
> The other two implementers are:
>
> cifs:
> * cifs_remap_file_range()
>
> nfs:
> * nfs42_remap_file_range()
>
> both of which don't call into generic_remap_file_range_prep() and so
> file_modified() isn't called. But they are netfses and there's a server
> involved. In general, this isn't really a concern for this patchset.
> (But something to potentially to look into in the future.)
>
> And it wasn't what I was worried about.
>
> The crucial information that we needed was whether reflink callpaths
> other than ocfs2 create files themselves and therefore might rely
> implicitly on setgid stripping in inode_init_owner().
>
> Now that we looked at all callers we can be confident that this isn't
> the case. _Apart from ocfs2_ but which you handle in the patchset.

If I understand correctly, this stripping code is ok. So can I send a v6 
patch?
Christian Brauner April 22, 2022, 10:49 a.m. UTC | #5
On Fri, Apr 22, 2022 at 10:13:52AM +0000, xuyang2018.jy@fujitsu.com wrote:
> on 2022/4/22 17:47, Christian Brauner wrote:
> > On Fri, Apr 22, 2022 at 06:03:09AM +0000, xuyang2018.jy@fujitsu.com wrote:
> >> on 2022/4/21 16:35, Christian Brauner wrote:
> >>> On Thu, Apr 21, 2022 at 03:54:17PM +0800, Yang Xu wrote:
> >>>> Currently, vfs only passes mode argument to filesystem, then use inode_init_owner()
> >>>> to strip S_ISGID. Some filesystem(ie ext4/btrfs) will call inode_init_owner
> >>>> firstly, then posxi acl setup, but xfs uses the contrary order. It will
> >>>> affect S_ISGID clear especially we filter S_IXGRP by umask or acl.
> >>>>
> >>>> Regardless of which filesystem is in use, failure to strip the SGID correctly
> >>>> is considered a security failure that needs to be fixed. The current VFS
> >>>> infrastructure requires the filesystem to do everything right and not step on
> >>>> any landmines to strip the SGID bit, when in fact it can easily be done at the
> >>>> VFS and the filesystems then don't even need to be aware that the SGID needs
> >>>> to be (or has been stripped) by the operation the user asked to be done.
> >>>>
> >>>> Vfs has all the info it needs - it doesn't need the filesystems to do everything
> >>>> correctly with the mode and ensuring that they order things like posix acl setup
> >>>> functions correctly with inode_init_owner() to strip the SGID bit.
> >>>>
> >>>> Just strip the SGID bit at the VFS, and then the filesystem can't get it wrong.
> >>>>
> >>>> Also, the inode_sgid_strip() api should be used before IS_POSIXACL() because
> >>>> this api may change mode.
> >>>>
> >>>> Only the following places use inode_init_owner
> >>>> "
> >>>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> >>>> arch/powerpc/platforms/cell/spufs/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode | S_IFDIR);
> >>>> fs/9p/vfs_inode.c:      inode_init_owner(&init_user_ns, inode, NULL, mode);
> >>>> fs/bfs/dir.c:   inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/btrfs/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> >>>> fs/btrfs/tests/btrfs-tests.c:   inode_init_owner(&init_user_ns, inode, NULL, S_IFREG);
> >>>> fs/ext2/ialloc.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/ext4/ialloc.c:               inode_init_owner(mnt_userns, inode, dir, mode);
> >>>> fs/f2fs/namei.c:        inode_init_owner(mnt_userns, inode, dir, mode);
> >>>> fs/hfsplus/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/hugetlbfs/inode.c:           inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/jfs/jfs_inode.c:     inode_init_owner(&init_user_ns, inode, parent, mode);
> >>>> fs/minix/bitmap.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/nilfs2/inode.c:      inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/ntfs3/inode.c:       inode_init_owner(mnt_userns, inode, dir, mode);
> >>>> fs/ocfs2/dlmfs/dlmfs.c:         inode_init_owner(&init_user_ns, inode, NULL, mode);
> >>>> fs/ocfs2/dlmfs/dlmfs.c: inode_init_owner(&init_user_ns, inode, parent, mode);
> >>>> fs/ocfs2/namei.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/omfs/inode.c:        inode_init_owner(&init_user_ns, inode, NULL, mode);
> >>>> fs/overlayfs/dir.c:     inode_init_owner(&init_user_ns, inode, dentry->d_parent->d_inode, mode);
> >>>> fs/ramfs/inode.c:               inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/reiserfs/namei.c:    inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/sysv/ialloc.c:       inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/ubifs/dir.c: inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/udf/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/ufs/ialloc.c:        inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> fs/xfs/xfs_inode.c:             inode_init_owner(mnt_userns, inode, dir, mode);
> >>>> fs/zonefs/super.c:      inode_init_owner(&init_user_ns, inode, parent, S_IFDIR | 0555);
> >>>> kernel/bpf/inode.c:     inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> mm/shmem.c:             inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>> "
> >>>>
> >>>> They are used in filesystem to init new inode function and these init inode
> >>>> functions are used by following operations:
> >>>> mkdir
> >>>> symlink
> >>>> mknod
> >>>> create
> >>>> tmpfile
> >>>> rename
> >>>>
> >>>> We don't care about mkdir because we don't strip SGID bit for directory except
> >>>> fs.xfs.irix_sgid_inherit. But we even call prepare_mode() in do_mkdirat() since
> >>>> inode_sgid_strip() will skip directories anyway. This will enforce the same
> >>>> ordering for all relevant operations and it will make the code more uniform and
> >>>> easier to understand by using new helper prepare_mode().
> >>>>
> >>>> symlink and rename only use valid mode that doesn't have SGID bit.
> >>>>
> >>>> We have added inode_sgid_strip api for the remaining operations.
> >>>>
> >>>> In addition to the above six operations, four filesystems has a little difference
> >>>> 1) btrfs has btrfs_create_subvol_root to create new inode but used non SGID bit
> >>>>      mode and can ignore
> >>>> 2) ocfs2 reflink function should add inode_sgid_strip api manually because we
> >>>>      don't add it in vfs
> >>>> 3) spufs which doesn't really go hrough the regular VFS callpath because it has
> >>>>      separate system call spu_create, but it t only allows the creation of
> >>>>      directories and only allows bits in 0777 and can ignore
> >>>> 4) bpf use vfs_mkobj in bpf_obj_do_pin with
> >>>>      "S_IFREG | ((S_IRUSR | S_IWUSR)&   ~current_umask()) mode and
> >>>>      use bpf_mkobj_ops in bpf_iter_link_pin_kernel with S_IFREG | S_IRUSR mode,
> >>>>      so bpf is also not affected
> >>>>
> >>>> This patch also changed grpid behaviour for ext4/xfs because the mode passed to
> >>>> them may been changed by inode_sgid_strip.
> >>>>
> >>>> Also as Christian Brauner said"
> >>>> The patch itself is useful as it would move a security sensitive operation that is
> >>>> currently burried in individual filesystems into the vfs layer. But it has a decent
> >>>> regression  potential since it might strip filesystems that have so far relied on
> >>>> getting the S_ISGID bit with a mode argument. So this needs a lot of testing and
> >>>> long exposure in -next for at least one full kernel cycle."
> >>>>
> >>>> Suggested-by: Dave Chinner<david@fromorbit.com>
> >>>> Signed-off-by: Yang Xu<xuyang2018.jy@fujitsu.com>
> >>>> ---
> >>>> v4->v5:
> >>>> put inode_sgid_strip before the inode_init_owner in ocfs2 filesystem
> >>>> because the inode->i_mode's assignment is in inode_init_owner
> >>>>    fs/inode.c         |  2 --
> >>>>    fs/namei.c         | 22 +++++++++-------------
> >>>>    fs/ocfs2/namei.c   |  1 +
> >>>>    include/linux/fs.h | 11 +++++++++++
> >>>>    4 files changed, 21 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/fs/inode.c b/fs/inode.c
> >>>> index 57130e4ef8b4..95667e634bd4 100644
> >>>> --- a/fs/inode.c
> >>>> +++ b/fs/inode.c
> >>>> @@ -2246,8 +2246,6 @@ void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
> >>>>    		/* Directories are special, and always inherit S_ISGID */
> >>>>    		if (S_ISDIR(mode))
> >>>>    			mode |= S_ISGID;
> >>>> -		else
> >>>> -			mode = inode_sgid_strip(mnt_userns, dir, mode);
> >>>>    	} else
> >>>>    		inode_fsgid_set(inode, mnt_userns);
> >>>>    	inode->i_mode = mode;
> >>>> diff --git a/fs/namei.c b/fs/namei.c
> >>>> index 73646e28fae0..5b8e6288d503 100644
> >>>> --- a/fs/namei.c
> >>>> +++ b/fs/namei.c
> >>>> @@ -3287,8 +3287,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> >>>>    	if (open_flag&   O_CREAT) {
> >>>>    		if (open_flag&   O_EXCL)
> >>>>    			open_flag&= ~O_TRUNC;
> >>>> -		if (!IS_POSIXACL(dir->d_inode))
> >>>> -			mode&= ~current_umask();
> >>>> +		mode = prepare_mode(mnt_userns, dir->d_inode, mode);
> >>>>    		if (likely(got_write))
> >>>>    			create_error = may_o_create(mnt_userns,&nd->path,
> >>>>    						    dentry, mode);
> >>>> @@ -3521,8 +3520,7 @@ struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
> >>>>    	child = d_alloc(dentry,&slash_name);
> >>>>    	if (unlikely(!child))
> >>>>    		goto out_err;
> >>>> -	if (!IS_POSIXACL(dir))
> >>>> -		mode&= ~current_umask();
> >>>> +	mode = prepare_mode(mnt_userns, dir, mode);
> >>>>    	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
> >>>>    	if (error)
> >>>>    		goto out_err;
> >>>> @@ -3850,13 +3848,12 @@ static int do_mknodat(int dfd, struct filename *name, umode_t mode,
> >>>>    	if (IS_ERR(dentry))
> >>>>    		goto out1;
> >>>>
> >>>> -	if (!IS_POSIXACL(path.dentry->d_inode))
> >>>> -		mode&= ~current_umask();
> >>>> +	mnt_userns = mnt_user_ns(path.mnt);
> >>>> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
> >>>>    	error = security_path_mknod(&path, dentry, mode, dev);
> >>>>    	if (error)
> >>>>    		goto out2;
> >>>>
> >>>> -	mnt_userns = mnt_user_ns(path.mnt);
> >>>>    	switch (mode&   S_IFMT) {
> >>>>    		case 0: case S_IFREG:
> >>>>    			error = vfs_create(mnt_userns, path.dentry->d_inode,
> >>>> @@ -3943,6 +3940,7 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
> >>>>    	struct path path;
> >>>>    	int error;
> >>>>    	unsigned int lookup_flags = LOOKUP_DIRECTORY;
> >>>> +	struct user_namespace *mnt_userns;
> >>>>
> >>>>    retry:
> >>>>    	dentry = filename_create(dfd, name,&path, lookup_flags);
> >>>> @@ -3950,15 +3948,13 @@ int do_mkdirat(int dfd, struct filename *name, umode_t mode)
> >>>>    	if (IS_ERR(dentry))
> >>>>    		goto out_putname;
> >>>>
> >>>> -	if (!IS_POSIXACL(path.dentry->d_inode))
> >>>> -		mode&= ~current_umask();
> >>>> +	mnt_userns = mnt_user_ns(path.mnt);
> >>>> +	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
> >>>>    	error = security_path_mkdir(&path, dentry, mode);
> >>>> -	if (!error) {
> >>>> -		struct user_namespace *mnt_userns;
> >>>> -		mnt_userns = mnt_user_ns(path.mnt);
> >>>> +	if (!error)
> >>>>    		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
> >>>>    				  mode);
> >>>> -	}
> >>>> +
> >>>>    	done_path_create(&path, dentry);
> >>>>    	if (retry_estale(error, lookup_flags)) {
> >>>>    		lookup_flags |= LOOKUP_REVAL;
> >>>> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> >>>> index c75fd54b9185..21f3da2e66c9 100644
> >>>> --- a/fs/ocfs2/namei.c
> >>>> +++ b/fs/ocfs2/namei.c
> >>>> @@ -197,6 +197,7 @@ static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
> >>>>    	 * callers. */
> >>>>    	if (S_ISDIR(mode))
> >>>>    		set_nlink(inode, 2);
> >>>> +	mode = inode_sgid_strip(&init_user_ns, dir, mode);
> >>>>    	inode_init_owner(&init_user_ns, inode, dir, mode);
> >>>
> >>> For the record, I'm not too fond of this separate invocation of
> >>> inode_sgid_strip() but since it's only one location this might be fine.
> >>> If there's more than one location a separate helper should exist for
> >>> this that abstracts this away for the filesystem.
> >> Agree. This case only be found when using OCFS2_IOC_REFLINK ioctl. And
> >
> > (I have one very minor non-technical ask: can you please make sure to
> > leave an empty line between the text you're citing and your reply? It
> > would make reading your replies a lot easier. :))
> 
> Of course.
> 
> >
> >> other support reflink filesystem(xfs, btrfs) they use  FICLONE or
> >> FICLONERANGE ioctl.
> >>
> >> Since ocfs2 has supported reflink by using it remap_file_range, should
> >> we still need this ioctl?
> >>
> >> commit bd50873dc725a9fa72592ecc986c58805e823051
> >> Author: Tao Ma<tao.ma@oracle.com>
> >> Date:   Mon Sep 21 11:25:14 2009 +0800
> >>
> >>       ocfs2: Add ioctl for reflink.
> >>
> >>       The ioctl will take 3 parameters: old_path, new_path and
> >>       preserve and call vfs_reflink. It is useful when we backport
> >>       reflink features to old kernels.
> >>
> >>       Signed-off-by: Tao Ma<tao.ma@oracle.com>
> >>
> >> Of course, this is a problem doesn't belong to this series.
> >>
> >>>
> >>> Two questions:
> >>> - Sould this call prepare_mode(), i.e. should we honor umasks here too?
> >> IMO, it desn't need to honor umask. Because reflink only will update
> >> inode_imode by setattr to strip S_ISGID and S_ISUID instead of creating
> >> a file.
> >
> > I had a misconception here because I got confused by ocfs2.
> > While the OCFS2_IOC_REFLINK ioctl creates the target file itself and the
> > reflink the FICLONE and FICLONERANGE _only_ create the reflink.
> >
> > So the target file itself must've been created prior to
> > FICLONE/FICLONERANGE which means basic setgid stripping should've been
> > done when the file was created.
> 
> Yes.
> 
> >
> > Since ocfs2 reflink callpaths work very differently we can't switch it
> > to FICLONE/FICLONERANGE as this would regress ocfs2 users.
> >
> >>> - How is the sgid bit handled when creating reflinks on other reflink
> >>>     supporting filesystems such as xfs and btrfs?
> >> xfstests has a test case generic/673 for this, so btrfs and xfs should
> >> have the same behavior.
> >> I look into xfs code.
> >>
> >> Firstly
> >>
> >> If we don't have CAP_FSETID and it is a regulre file,also have sgid bit,
> >> then should_remove_suid will give attr a ATTR_KILL_SGID mask.
> >
> > What you're referring to below is privilege stripping when _modifying
> > the content_ of set{g,u}id binaries.
> >
> > That happens e.g. during write() or indeed a reflink creation via
> > FICLONE/FICLONERANGE. For the latter privilege stripping happens when
> > only some extents are reflinked and not the whole file. So that's:
> >
> > * ioctl_file_clone()
> >    * vfs_clone_file_range()
> >      * do_clone_file_range()
> >        * generic_file_rw_checks()
> >        * remap_verify_area()
> >          * security_file_permission()
> >        * ->remap_file_range() [1]
> >
> > [1]:
> > btrfs:
> > * btrfs_remap_file_range()
> >    * generic_remap_file_range_prep()
> >      * file_modified() // privilege stripping
> >
> > ocfs2:
> > * ocfs2_remap_file_range()
> >    * generic_remap_file_range_prep()
> >      * file_modified() // privilege stripping
> >
> > xfs:
> > * xfs_remap_file_range()
> >    * generic_remap_file_range_prep()
> >      * file_modified() // privilege stripping
> >
> > // This is stacked, i.e. ovl filesystems will rely on the filesystem
> > // used for the upper mount and it's .remap_file_range() implementation.
> > overlayfs:
> > * ovl_remap_file_range()
> >    * vfs_clone_file_range()
> >
> > The other two implementers are:
> >
> > cifs:
> > * cifs_remap_file_range()
> >
> > nfs:
> > * nfs42_remap_file_range()
> >
> > both of which don't call into generic_remap_file_range_prep() and so
> > file_modified() isn't called. But they are netfses and there's a server
> > involved. In general, this isn't really a concern for this patchset.
> > (But something to potentially to look into in the future.)
> >
> > And it wasn't what I was worried about.
> >
> > The crucial information that we needed was whether reflink callpaths
> > other than ocfs2 create files themselves and therefore might rely
> > implicitly on setgid stripping in inode_init_owner().
> >
> > Now that we looked at all callers we can be confident that this isn't
> > the case. _Apart from ocfs2_ but which you handle in the patchset.
> 
> If I understand correctly, this stripping code is ok. So can I send a v6 
> patch?

Yes.
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 57130e4ef8b4..95667e634bd4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2246,8 +2246,6 @@  void inode_init_owner(struct user_namespace *mnt_userns, struct inode *inode,
 		/* Directories are special, and always inherit S_ISGID */
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
-		else
-			mode = inode_sgid_strip(mnt_userns, dir, mode);
 	} else
 		inode_fsgid_set(inode, mnt_userns);
 	inode->i_mode = mode;
diff --git a/fs/namei.c b/fs/namei.c
index 73646e28fae0..5b8e6288d503 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3287,8 +3287,7 @@  static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (open_flag & O_CREAT) {
 		if (open_flag & O_EXCL)
 			open_flag &= ~O_TRUNC;
-		if (!IS_POSIXACL(dir->d_inode))
-			mode &= ~current_umask();
+		mode = prepare_mode(mnt_userns, dir->d_inode, mode);
 		if (likely(got_write))
 			create_error = may_o_create(mnt_userns, &nd->path,
 						    dentry, mode);
@@ -3521,8 +3520,7 @@  struct dentry *vfs_tmpfile(struct user_namespace *mnt_userns,
 	child = d_alloc(dentry, &slash_name);
 	if (unlikely(!child))
 		goto out_err;
-	if (!IS_POSIXACL(dir))
-		mode &= ~current_umask();
+	mode = prepare_mode(mnt_userns, dir, mode);
 	error = dir->i_op->tmpfile(mnt_userns, dir, child, mode);
 	if (error)
 		goto out_err;
@@ -3850,13 +3848,12 @@  static int do_mknodat(int dfd, struct filename *name, umode_t mode,
 	if (IS_ERR(dentry))
 		goto out1;
 
-	if (!IS_POSIXACL(path.dentry->d_inode))
-		mode &= ~current_umask();
+	mnt_userns = mnt_user_ns(path.mnt);
+	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
 	error = security_path_mknod(&path, dentry, mode, dev);
 	if (error)
 		goto out2;
 
-	mnt_userns = mnt_user_ns(path.mnt);
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(mnt_userns, path.dentry->d_inode,
@@ -3943,6 +3940,7 @@  int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	struct user_namespace *mnt_userns;
 
 retry:
 	dentry = filename_create(dfd, name, &path, lookup_flags);
@@ -3950,15 +3948,13 @@  int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 	if (IS_ERR(dentry))
 		goto out_putname;
 
-	if (!IS_POSIXACL(path.dentry->d_inode))
-		mode &= ~current_umask();
+	mnt_userns = mnt_user_ns(path.mnt);
+	mode = prepare_mode(mnt_userns, path.dentry->d_inode, mode);
 	error = security_path_mkdir(&path, dentry, mode);
-	if (!error) {
-		struct user_namespace *mnt_userns;
-		mnt_userns = mnt_user_ns(path.mnt);
+	if (!error)
 		error = vfs_mkdir(mnt_userns, path.dentry->d_inode, dentry,
 				  mode);
-	}
+
 	done_path_create(&path, dentry);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index c75fd54b9185..21f3da2e66c9 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -197,6 +197,7 @@  static struct inode *ocfs2_get_init_inode(struct inode *dir, umode_t mode)
 	 * callers. */
 	if (S_ISDIR(mode))
 		set_nlink(inode, 2);
+	mode = inode_sgid_strip(&init_user_ns, dir, mode);
 	inode_init_owner(&init_user_ns, inode, dir, mode);
 	status = dquot_initialize(inode);
 	if (status)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 532de76c9b91..0bf81ab71619 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3459,6 +3459,17 @@  static inline bool dir_relax_shared(struct inode *inode)
 	return !IS_DEADDIR(inode);
 }
 
+static inline umode_t prepare_mode(struct user_namespace *mnt_userns,
+				   const struct inode *dir, umode_t mode)
+{
+	mode = inode_sgid_strip(mnt_userns, dir, mode);
+
+	if (!IS_POSIXACL(dir))
+		mode &= current_umask();
+
+	return mode;
+}
+
 extern bool path_noexec(const struct path *path);
 extern void inode_nohighmem(struct inode *inode);