[1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

Message ID	150135741519.35318.16765137368329971936.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Subject: [PATCH 1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE From: Dan Williams <dan.j.williams@intel.com> To: darrick.wong@oracle.com Date: Sat, 29 Jul 2017 12:43:35 -0700 Message-ID: <150135741519.35318.16765137368329971936.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <150135740948.35318.10730072114996910905.stgit@dwillia2-desk3.amr.corp.intel.com> References: <150135740948.35318.10730072114996910905.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.17.1-9-g687f MIME-Version: 1.0 Precedence: list Cc: Jan Kara <jack@suse.cz>, linux-nvdimm@lists.01.org, Dave Chinner <david@fromorbit.com>, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>, luto@kernel.org, linux-fsdevel@vger.kernel.org, Christoph Hellwig <hch@lst.de> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Dan Williams July 29, 2017, 7:43 p.m. UTC

An inode with this flag set indicates that the file's block map cannot
be changed, no size change, deletion, hole-punch, range collapse, or
reflink.

The implementation of toggling the flag and sealing the state of the
extent map is saved for a later patch. The functionality provided by
S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
that provided by S_SWAPFILE, and it is targeted to replace it.

For now, only xfs and the core vfs are updated to consider the new flag.

The additional check that is added for this flag, beyond what we are
already doing for swapfiles, is to truncate or abort writes that try to
extend the file size.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/attr.c          |   10 ++++++++++
 fs/namei.c         |    3 ++-
 fs/open.c          |    6 ++++++
 fs/read_write.c    |    3 +++
 fs/xfs/xfs_ioctl.c |    6 ++++++
 include/linux/fs.h |    2 ++
 mm/filemap.c       |    9 +++++++++
 7 files changed, 38 insertions(+), 1 deletion(-)

Colin Walters July 31, 2017, 4:02 p.m. UTC | #1

On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed, no size change, deletion, hole-punch, range collapse, or
> reflink.
> 
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
> 
> For now, only xfs and the core vfs are updated to consider the new flag.

Quite a while ago I started a request for O_OBJECT:
http://www.spinics.net/lists/linux-fsdevel/msg75085.html
A few months ago I was thinking about that more and realized
it'd likely be more palatable to land as an inode flag, like
you're doing here.

Now, S_IOMAP_IMMUTABLE would be quite close to the semantics
I want for ostree, except we also want to disallow
changes to the inode uid, gid or mode.  (Extended attributes are 
a whole other story; but I'd like to at least disallow changes to the
security. namespace).

The goal here is mostly about resilience to *accidental* changes;
think an admin doing `cp /path/to/binary /usr/bin/bash` which
does open(O_TRUNC), which would hence corrupt all hardlinks.

S_IOMAP_IMMUTABLE would give a lot of great protection against
those types of accidental changes - most of them are either going
to be open(O_TRUNC) or O_APPEND.   Since you're touching various
write paths here, perhaps we can also add
S_CONTENTS_IMMUTABLE or something at the same time?

If this lands as is - I'm quite likely to change ostree to use it;
any objections to that?  As mentioned in the thread, there are several
other cases of "content immutable" files in userspace, such as
QEMU "qcow2", git objects.  And really the most classic example is
/etc/sudoers and the need for a special "visudo" program to really
ensure that editors don't do in-place overwrites.

But it'd be great if we can use this push to also land "content immutabilty"
or however we decide to call it.

Dan Williams July 31, 2017, 4:29 p.m. UTC | #2

On Mon, Jul 31, 2017 at 9:02 AM, Colin Walters <walters@verbum.org> wrote:
> On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote:
>> An inode with this flag set indicates that the file's block map cannot
>> be changed, no size change, deletion, hole-punch, range collapse, or
>> reflink.
>>
>> The implementation of toggling the flag and sealing the state of the
>> extent map is saved for a later patch. The functionality provided by
>> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
>> that provided by S_SWAPFILE, and it is targeted to replace it.
>>
>> For now, only xfs and the core vfs are updated to consider the new flag.
>
> Quite a while ago I started a request for O_OBJECT:
> http://www.spinics.net/lists/linux-fsdevel/msg75085.html
> A few months ago I was thinking about that more and realized
> it'd likely be more palatable to land as an inode flag, like
> you're doing here.
>
> Now, S_IOMAP_IMMUTABLE would be quite close to the semantics
> I want for ostree, except we also want to disallow
> changes to the inode uid, gid or mode.  (Extended attributes are
> a whole other story; but I'd like to at least disallow changes to the
> security. namespace).
>
> The goal here is mostly about resilience to *accidental* changes;
> think an admin doing `cp /path/to/binary /usr/bin/bash` which
> does open(O_TRUNC), which would hence corrupt all hardlinks.
>
> S_IOMAP_IMMUTABLE would give a lot of great protection against
> those types of accidental changes - most of them are either going
> to be open(O_TRUNC) or O_APPEND.   Since you're touching various
> write paths here, perhaps we can also add
> S_CONTENTS_IMMUTABLE or something at the same time?
>
> If this lands as is - I'm quite likely to change ostree to use it;
> any objections to that?  As mentioned in the thread, there are several
> other cases of "content immutable" files in userspace, such as
> QEMU "qcow2", git objects.  And really the most classic example is
> /etc/sudoers and the need for a special "visudo" program to really
> ensure that editors don't do in-place overwrites.
>
> But it'd be great if we can use this push to also land "content immutabilty"
> or however we decide to call it.

How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?

Colin Walters July 31, 2017, 4:32 p.m. UTC | #3

On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> 
> How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?

We still want the ability to make hardlinks.

Darrick J. Wong July 31, 2017, 4:46 p.m. UTC | #4

On Sat, Jul 29, 2017 at 12:43:35PM -0700, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed, no size change, deletion, hole-punch, range collapse, or
> reflink.
> 
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
> 
> For now, only xfs and the core vfs are updated to consider the new flag.
> 
> The additional check that is added for this flag, beyond what we are
> already doing for swapfiles, is to truncate or abort writes that try to
> extend the file size.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/attr.c          |   10 ++++++++++
>  fs/namei.c         |    3 ++-
>  fs/open.c          |    6 ++++++
>  fs/read_write.c    |    3 +++
>  fs/xfs/xfs_ioctl.c |    6 ++++++
>  include/linux/fs.h |    2 ++
>  mm/filemap.c       |    9 +++++++++
>  7 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 135304146120..8573e364bd06 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>   */
>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>  {
> +	if (IS_IOMAP_IMMUTABLE(inode)) {
> +		/*
> +		 * Any size change is disallowed. Size increases may
> +		 * dirty metadata that an application is not prepared to
> +		 * sync, and a size decrease may expose free blocks to
> +		 * in-flight DMA.
> +		 */
> +		return -ETXTBSY;
> +	}
> +
>  	if (inode->i_size < offset) {
>  		unsigned long limit;
>  
> diff --git a/fs/namei.c b/fs/namei.c
> index ddb6a7c2b3d4..588f1135c42c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
>  		return -EPERM;
>  
>  	if (check_sticky(dir, inode) || IS_APPEND(inode) ||
> -	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
> +	    IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)
> +	    || IS_IOMAP_IMMUTABLE(inode))

This caught my eye because why wouldn't we be able to unlink a file
whose block map is immutable?  Link count has "nothing" to do with that.

:)

Hmm... I guess the reasoning here is that if we're IOMAP_IMMUTABLE'd,
we're not allowed to touch the link count since (we assume) nobody's
who's interested in *inode is going to call fsync on it to flush the
metadata to disk?

If that's the case, then shouldn't there be a corresponding may_create
check to prevent us from linking a file into a directory?

(I don't really see why link/unlink shouldn't be allowed...)

>  		return -EPERM;
>  	if (isdir) {
>  		if (!d_is_dir(victim))
> diff --git a/fs/open.c b/fs/open.c
> index 35bb784763a4..7395860d7164 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -ETXTBSY;
>  
>  	/*
> +	 * We cannot allow any allocation changes on an iomap immutable file

If it's a (regular allocation) or (unshare blocks) fallocate call, we
could return zero immediately since no writes will be able to hit
ENOSPC because we know that the block mappings are there and no CoW is
required.

> +	 */
> +	if (IS_IOMAP_IMMUTABLE(inode))
> +		return -ETXTBSY;

I've been wondering in the back of my head if we could allow a
size-extending FALLOC_FL_ZERO_RANGE here that would fsync at the end?
The use case I had in mind was extending files without having to turn
off the IOMAP_IMMUTABLE flag, since (in theory, anyway) we'd bzero() the
memory and then increase i_size, so userspace should never be able to
access storage that isn't totally finalized.

OTOH that also seems like a huge weird loophole in IOMAP_IMMUTABLE, so
maybe everyone would prefer to shoot down this use case now?

> +
> +	/*
>  	 * Revalidate the write permissions, in case security policy has
>  	 * changed since the files were opened.
>  	 */
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 0cc7033aa413..dc673be7c7cb 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>  	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>  		return -ETXTBSY;
>  
> +	if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
> +		return -ETXTBSY;
> +
>  	/* Don't reflink dirs, pipes, sockets... */
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>  		return -EISDIR;
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index e75c40a47b7d..2e64488bc4de 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
>  		goto out_put_tmp_file;
>  	}
>  
> +	if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
> +	    IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
> +		error = -EINVAL;
> +		goto out_put_tmp_file;
> +	}

Someone's going to have to audit more of the XFS ioctls to make sure
that none of them can touch the block mapping, but as this is an RFC it
doesn't need to be done right this instant.

> +
>  	/*
>  	 * We need to ensure that the fds passed in point to XFS inodes
>  	 * before we cast and access them as XFS structures as we have no
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d21248..0a254b768855 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1829,6 +1829,7 @@ struct super_operations {
>  #else
>  #define S_DAX		0	/* Make all the DAX code disappear */
>  #endif
> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
>  
>  /*
>   * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -1867,6 +1868,7 @@ struct super_operations {
>  #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
>  #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
>  #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
>  
>  #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
>  				 (inode)->i_rdev == WHITEOUT_DEV)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..e4a6529da2bd 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(pos >= inode->i_sb->s_maxbytes))
>  		return -EFBIG;
>  
> +	/* Are we about to mutate the block map on an immutable file? */
> +	if (IS_IOMAP_IMMUTABLE(inode)
> +			&& (pos + iov_iter_count(from) > i_size_read(inode))) {
> +		if (pos < i_size_read(inode))
> +			iov_iter_truncate(from, i_size_read(inode) - pos);

Writes past the end get truncated to stop at EOF?  I'd have thought that
would be an error since userspace is asking for metadata updates it
previously said it would never need...

--D

> +		else
> +			return -ETXTBSY;
> +	}
> +
>  	iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
>  	return iov_iter_count(from);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dan Williams July 31, 2017, 5:32 p.m. UTC | #5

On Mon, Jul 31, 2017 at 9:46 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Sat, Jul 29, 2017 at 12:43:35PM -0700, Dan Williams wrote:
>> An inode with this flag set indicates that the file's block map cannot
>> be changed, no size change, deletion, hole-punch, range collapse, or
>> reflink.
>>
>> The implementation of toggling the flag and sealing the state of the
>> extent map is saved for a later patch. The functionality provided by
>> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
>> that provided by S_SWAPFILE, and it is targeted to replace it.
>>
>> For now, only xfs and the core vfs are updated to consider the new flag.
>>
>> The additional check that is added for this flag, beyond what we are
>> already doing for swapfiles, is to truncate or abort writes that try to
>> extend the file size.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/attr.c          |   10 ++++++++++
>>  fs/namei.c         |    3 ++-
>>  fs/open.c          |    6 ++++++
>>  fs/read_write.c    |    3 +++
>>  fs/xfs/xfs_ioctl.c |    6 ++++++
>>  include/linux/fs.h |    2 ++
>>  mm/filemap.c       |    9 +++++++++
>>  7 files changed, 38 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 135304146120..8573e364bd06 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>>   */
>>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>>  {
>> +     if (IS_IOMAP_IMMUTABLE(inode)) {
>> +             /*
>> +              * Any size change is disallowed. Size increases may
>> +              * dirty metadata that an application is not prepared to
>> +              * sync, and a size decrease may expose free blocks to
>> +              * in-flight DMA.
>> +              */
>> +             return -ETXTBSY;
>> +     }
>> +
>>       if (inode->i_size < offset) {
>>               unsigned long limit;
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index ddb6a7c2b3d4..588f1135c42c 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
>>               return -EPERM;
>>
>>       if (check_sticky(dir, inode) || IS_APPEND(inode) ||
>> -         IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode))
>> +         IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)
>> +         || IS_IOMAP_IMMUTABLE(inode))
>
> This caught my eye because why wouldn't we be able to unlink a file
> whose block map is immutable?  Link count has "nothing" to do with that.
>
> :)
>
> Hmm... I guess the reasoning here is that if we're IOMAP_IMMUTABLE'd,
> we're not allowed to touch the link count since (we assume) nobody's
> who's interested in *inode is going to call fsync on it to flush the
> metadata to disk?
>
> If that's the case, then shouldn't there be a corresponding may_create
> check to prevent us from linking a file into a directory?
>
> (I don't really see why link/unlink shouldn't be allowed...)

True, unlink should be allowed, the blocks will still be immutable as
long as the file is open so I think we're good.

>>               return -EPERM;
>>       if (isdir) {
>>               if (!d_is_dir(victim))
>> diff --git a/fs/open.c b/fs/open.c
>> index 35bb784763a4..7395860d7164 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               return -ETXTBSY;
>>
>>       /*
>> +      * We cannot allow any allocation changes on an iomap immutable file
>
> If it's a (regular allocation) or (unshare blocks) fallocate call, we
> could return zero immediately since no writes will be able to hit
> ENOSPC because we know that the block mappings are there and no CoW is
> required.

Sounds good assuming the regular allocation is staying within the
range of the current file size, and fail otherwise.

>
>> +      */
>> +     if (IS_IOMAP_IMMUTABLE(inode))
>> +             return -ETXTBSY;
>
> I've been wondering in the back of my head if we could allow a
> size-extending FALLOC_FL_ZERO_RANGE here that would fsync at the end?
> The use case I had in mind was extending files without having to turn
> off the IOMAP_IMMUTABLE flag, since (in theory, anyway) we'd bzero() the
> memory and then increase i_size, so userspace should never be able to
> access storage that isn't totally finalized.
>
> OTOH that also seems like a huge weird loophole in IOMAP_IMMUTABLE, so
> maybe everyone would prefer to shoot down this use case now?

It seems useful, because the alternative feels more error prone if
applications are forced to toggle IOMAP_IMMUTABLE.

>
>> +
>> +     /*
>>        * Revalidate the write permissions, in case security policy has
>>        * changed since the files were opened.
>>        */
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 0cc7033aa413..dc673be7c7cb 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>>       if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>>               return -ETXTBSY;
>>
>> +     if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
>> +             return -ETXTBSY;
>> +
>>       /* Don't reflink dirs, pipes, sockets... */
>>       if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>>               return -EISDIR;
>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>> index e75c40a47b7d..2e64488bc4de 100644
>> --- a/fs/xfs/xfs_ioctl.c
>> +++ b/fs/xfs/xfs_ioctl.c
>> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
>>               goto out_put_tmp_file;
>>       }
>>
>> +     if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
>> +         IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
>> +             error = -EINVAL;
>> +             goto out_put_tmp_file;
>> +     }
>
> Someone's going to have to audit more of the XFS ioctls to make sure
> that none of them can touch the block mapping, but as this is an RFC it
> doesn't need to be done right this instant.

Speaking of things that need to be done before this mechanism can move
out of RFC, you had mentioned wanting to get your current devel
backlog cleared before moving ahead with this. I can offer some review
cycles, you'll just need to tolerate a higher
question-to-review-comment ratio given my relative XFS experience.

>
>> +
>>       /*
>>        * We need to ensure that the fds passed in point to XFS inodes
>>        * before we cast and access them as XFS structures as we have no
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 6e1fd5d21248..0a254b768855 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1829,6 +1829,7 @@ struct super_operations {
>>  #else
>>  #define S_DAX                0       /* Make all the DAX code disappear */
>>  #endif
>> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
>>
>>  /*
>>   * Note that nosuid etc flags are inode-specific: setting some file-system
>> @@ -1867,6 +1868,7 @@ struct super_operations {
>>  #define IS_AUTOMOUNT(inode)  ((inode)->i_flags & S_AUTOMOUNT)
>>  #define IS_NOSEC(inode)              ((inode)->i_flags & S_NOSEC)
>>  #define IS_DAX(inode)                ((inode)->i_flags & S_DAX)
>> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
>>
>>  #define IS_WHITEOUT(inode)   (S_ISCHR(inode->i_mode) && \
>>                                (inode)->i_rdev == WHITEOUT_DEV)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a49702445ce0..e4a6529da2bd 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>>       if (unlikely(pos >= inode->i_sb->s_maxbytes))
>>               return -EFBIG;
>>
>> +     /* Are we about to mutate the block map on an immutable file? */
>> +     if (IS_IOMAP_IMMUTABLE(inode)
>> +                     && (pos + iov_iter_count(from) > i_size_read(inode))) {
>> +             if (pos < i_size_read(inode))
>> +                     iov_iter_truncate(from, i_size_read(inode) - pos);
>
> Writes past the end get truncated to stop at EOF?  I'd have thought that
> would be an error since userspace is asking for metadata updates it
> previously said it would never need...

It's an error if the current position is >= EOF, but otherwise allow a
truncated write. I'm fine either way.

...but then again if we allow FALLOC_FL_ZERO_RANGE to resize the file,
why not synchronous writes too? It turns IOMAP_IMMUTABLE into a flag
that only allows blocks to be added to a file, but not removed. That
would seem to dovetail well with synchronous mmap faults when / if XFS
grows that feature.

Colin Walters July 31, 2017, 5:42 p.m. UTC | #6

On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote:
> On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> > 
> > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
> 
> We still want the ability to make hardlinks.

Also of course, symmetrically, to unlink.   If we used S_IMMUTABLE for /etc/sudoers,
it'd still be racy since one would have to transiently remove the flag in order
to replace it with a new version.

Related to this topic is the fact that S_IMMUTABLE is itself mutable; I
think once S_IMMUTABLE_CONTENTS is set, it would not be able to made
mutable again.  

Also I just remembered that since then memfd_create() and more notably
fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want
here AFAICS.  Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW)
and mine just adds in F_SEAL_WRITE.  I think there was some discussion
of the seals for persistent files when memfd_create() landed, but I can't
find it offhand.

Darrick J. Wong July 31, 2017, 6:23 p.m. UTC | #7

On Mon, Jul 31, 2017 at 01:42:13PM -0400, Colin Walters wrote:
> 
> 
> On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote:
> > On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote:
> > > 
> > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
> > 
> > We still want the ability to make hardlinks.
> 
> Also of course, symmetrically, to unlink.   If we used S_IMMUTABLE for /etc/sudoers,
> it'd still be racy since one would have to transiently remove the flag in order
> to replace it with a new version.
> 
> Related to this topic is the fact that S_IMMUTABLE is itself mutable; I
> think once S_IMMUTABLE_CONTENTS is set, it would not be able to made
> mutable again.  
> 
> Also I just remembered that since then memfd_create() and more notably
> fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want
> here AFAICS.  Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW)

I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data,
which are two things that cannot happen under S_IOMAP_IMMUTABLE that
aren't size changes.  From the implementation it looks like shrink and
grow are only supposed to disallow changes to i_size, not i_blocks (or
the file block map).

Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply
isn't defined for any other filesystem...?  e.g. it doesn't prohibit
reflink, but the only fs implementing seals doesn't support reflink.

<shrug>

Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE
user cases being presented.

> and mine just adds in F_SEAL_WRITE.  I think there was some discussion
> of the seals for persistent files when memfd_create() landed, but I can't
> find it offhand.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Colin Walters Aug. 1, 2017, 2:15 a.m. UTC | #8

On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote:

> I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data,
> which are two things that cannot happen under S_IOMAP_IMMUTABLE that
> aren't size changes.  From the implementation it looks like shrink and
> grow are only supposed to disallow changes to i_size, not i_blocks (or
> the file block map).

True. 

> Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply
> isn't defined for any other filesystem...?  e.g. it doesn't prohibit
> reflink, but the only fs implementing seals doesn't support reflink.
> 
> <shrug>
> 
> Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE
> user cases being presented.

To be clear, the set of use cases is swap files and DAX, right?  Or is there anything else?
I can't imagine why anyone would want to turn a swap file back into a regular file.
I haven't fully followed DAX, but I'd take your word for it if people want to
be able to remove the flag after.

Anyways, I think your broader point is right; the use cases are different enough
that it doesn't make sense to try to add S_CONTENT_IMMUTABLE (or however 
one decides to call it) at the same time.

Dave Chinner Aug. 1, 2017, 2:42 a.m. UTC | #9

On Mon, Jul 31, 2017 at 10:15:12PM -0400, Colin Walters wrote:
> On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote:
> 
> > I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data,
> > which are two things that cannot happen under S_IOMAP_IMMUTABLE that
> > aren't size changes.  From the implementation it looks like shrink and
> > grow are only supposed to disallow changes to i_size, not i_blocks (or
> > the file block map).
> 
> True. 
> 
> > Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply
> > isn't defined for any other filesystem...?  e.g. it doesn't prohibit
> > reflink, but the only fs implementing seals doesn't support reflink.
> > 
> > <shrug>
> > 
> > Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE
> > user cases being presented.
> 
> To be clear, the set of use cases is swap files and DAX, right?  Or is there anything else?

I've outlined other use cases in previous discussions. To repeat
myself, every so often we get someone with, say, a new high
speed camera that want to dma the camera frames direct to the
storage because they can't push 500,000 frames/s through the CPU
to storage. Hence they want to bypass the OS and DMA the data direct
to the storage. To do this they need a mechanism to freeze and unfreeze
the block map of the file so that nothing modifies the block map
while the camera hardware is dumping data direct to the storage.
Immutable extent maps provide the functionality they need to
implement this safely.

There's also other similar use cases for RDMA targets on PMEM
(regardless of whether DAX is enabled or not), and I've come across
a couple of requests for mechanisms to allow fabric based nvme
storage to do direct data transfers between storage devices, too.
All of these use cases can be safely implemented if there is a
mechanism to mark extent maps as immutable for the duration of
the operation they need to perform.

> I can't imagine why anyone would want to turn a swap file back into a regular file.
> I haven't fully followed DAX, but I'd take your word for it if people want to
> be able to remove the flag after.

DAX isn't the driver of that functionality, it's the other use cases
that need it, and why the proposed "only remove flag if len == 0"
API is a non-starter....

Cheers,

Dave.

Christoph Hellwig Aug. 5, 2017, 9:45 a.m. UTC | #10

On Tue, Aug 01, 2017 at 12:42:18PM +1000, Dave Chinner wrote:
> I've outlined other use cases in previous discussions. To repeat
> myself, every so often we get someone with, say, a new high
> speed camera that want to dma the camera frames direct to the
> storage because they can't push 500,000 frames/s through the CPU
> to storage. Hence they want to bypass the OS and DMA the data direct
> to the storage. To do this they need a mechanism to freeze and unfreeze
> the block map of the file so that nothing modifies the block map
> while the camera hardware is dumping data direct to the storage.
> Immutable extent maps provide the functionality they need to
> implement this safely.

And we have such a mechanism already: it's called the iolock during
I/O, and dirct I/O.  I've worked on plenty such schemes and the proper way
works perfectly fine.  Just because people ask for stupid ways to
archives that doesn't mean they understand what they are doing.

> There's also other similar use cases for RDMA targets on PMEM
> (regardless of whether DAX is enabled or not), and I've come across
> a couple of requests for mechanisms to allow fabric based nvme
> storage to do direct data transfers between storage devices, too.
> All of these use cases can be safely implemented if there is a
> mechanism to mark extent maps as immutable for the duration of
> the operation they need to perform.

As someone who spent most of them time on the last 2 years in this
area: we have a massive problem discoverability and addressing
(lack of struct page) for p2p devices.  We have absolutely no problem
with the direct I/O model with them.

> DAX isn't the driver of that functionality, it's the other use cases
> that need it, and why the proposed "only remove flag if len == 0"
> API is a non-starter....

The other "use" cases are even more bullshit than the DAX one.

[1/3] fs, xfs: introduce S_IOMAP_IMMUTABLE

Commit Message

Comments

Patch