[v2,1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE

Message ID	150181369070.32119.1019092252036976783.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Subject: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE From: Dan Williams <dan.j.williams@intel.com> To: darrick.wong@oracle.com Date: Thu, 03 Aug 2017 19:28:10 -0700 Message-ID: <150181369070.32119.1019092252036976783.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <150181368442.32119.13336247800141074356.stgit@dwillia2-desk3.amr.corp.intel.com> References: <150181368442.32119.13336247800141074356.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.17.1-9-g687f MIME-Version: 1.0 Precedence: list Cc: Jan Kara <jack@suse.cz>, linux-nvdimm@lists.01.org, Dave Chinner <david@fromorbit.com>, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>, luto@kernel.org, linux-fsdevel@vger.kernel.org, Christoph Hellwig <hch@lst.de> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Dan Williams Aug. 4, 2017, 2:28 a.m. UTC

An inode with this flag set indicates that the file's block map cannot
be changed from the currently allocated set.

The implementation of toggling the flag and sealing the state of the
extent map is saved for a later patch. The functionality provided by
S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
that provided by S_SWAPFILE, and it is targeted to replace it.

For now, only xfs and the core vfs are updated to consider the new flag.

The additional checks that are added for this flag, beyond what we are
already doing for swapfiles, are:
* fail writes that try to extend the file size
* fail attempts to directly change the allocation map via fallocate or
  xfs ioctls. This can be done centrally by blocking
  xfs_alloc_file_space and xfs_free_file_space when the flag is set.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/attr.c              |   10 ++++++++++
 fs/open.c              |    6 ++++++
 fs/read_write.c        |    3 +++
 fs/xfs/xfs_bmap_util.c |    6 ++++++
 fs/xfs/xfs_ioctl.c     |    6 ++++++
 include/linux/fs.h     |    2 ++
 mm/filemap.c           |    5 +++++
 7 files changed, 38 insertions(+)

Darrick J. Wong Aug. 4, 2017, 8 p.m. UTC | #1

On Thu, Aug 03, 2017 at 07:28:10PM -0700, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed from the currently allocated set.
> 
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
> 
> For now, only xfs and the core vfs are updated to consider the new flag.
> 
> The additional checks that are added for this flag, beyond what we are
> already doing for swapfiles, are:
> * fail writes that try to extend the file size
> * fail attempts to directly change the allocation map via fallocate or
>   xfs ioctls. This can be done centrally by blocking
>   xfs_alloc_file_space and xfs_free_file_space when the flag is set.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/attr.c              |   10 ++++++++++
>  fs/open.c              |    6 ++++++
>  fs/read_write.c        |    3 +++
>  fs/xfs/xfs_bmap_util.c |    6 ++++++
>  fs/xfs/xfs_ioctl.c     |    6 ++++++
>  include/linux/fs.h     |    2 ++
>  mm/filemap.c           |    5 +++++
>  7 files changed, 38 insertions(+)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 135304146120..8573e364bd06 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>   */
>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>  {
> +	if (IS_IOMAP_IMMUTABLE(inode)) {
> +		/*
> +		 * Any size change is disallowed. Size increases may
> +		 * dirty metadata that an application is not prepared to
> +		 * sync, and a size decrease may expose free blocks to
> +		 * in-flight DMA.
> +		 */
> +		return -ETXTBSY;
> +	}
> +
>  	if (inode->i_size < offset) {
>  		unsigned long limit;
>  
> diff --git a/fs/open.c b/fs/open.c
> index 35bb784763a4..7395860d7164 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -ETXTBSY;
>  
>  	/*
> +	 * We cannot allow any allocation changes on an iomap immutable file
> +	 */
> +	if (IS_IOMAP_IMMUTABLE(inode))
> +		return -ETXTBSY;
> +
> +	/*
>  	 * Revalidate the write permissions, in case security policy has
>  	 * changed since the files were opened.
>  	 */
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 0cc7033aa413..dc673be7c7cb 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>  	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>  		return -ETXTBSY;
>  
> +	if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
> +		return -ETXTBSY;
> +
>  	/* Don't reflink dirs, pipes, sockets... */
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>  		return -EISDIR;
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 93e955262d07..fe0f8f7f4bb7 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1044,6 +1044,9 @@ xfs_alloc_file_space(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> +	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
> +		return -ETXTBSY;
> +

Hm.  The 'seal this up' caller in the next patch doesn't check for
ETXTBSY (or if it does I missed that), so if you try to seal an already
sealed file you'll get an error code even though you actually got the
state you wanted.

Second question: How might we handle the situation where a filesystem
/has/ to alter a block mapping?  Hypothetically, if the block layer
tells the fs that some range of storage has gone bad and the fs decides
to punch out that part of the file (or mark it unwritten or whatever) to
avoid a machine check, can we lock out file IO, forcibly remove the
mapping from memory, make whatever block map updates we want, and then
unlock?

(Conceptually, the bmbt rebuilder in the online fsck patchset operates
in a similar manner...)

--D

>  	error = xfs_qm_dqattach(ip, 0);
>  	if (error)
>  		return error;
> @@ -1294,6 +1297,9 @@ xfs_free_file_space(
>  
>  	trace_xfs_free_file_space(ip);
>  
> +	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
> +		return -ETXTBSY;
> +
>  	error = xfs_qm_dqattach(ip, 0);
>  	if (error)
>  		return error;
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index e75c40a47b7d..2e64488bc4de 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
>  		goto out_put_tmp_file;
>  	}
>  
> +	if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
> +	    IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
> +		error = -EINVAL;
> +		goto out_put_tmp_file;
> +	}
> +
>  	/*
>  	 * We need to ensure that the fds passed in point to XFS inodes
>  	 * before we cast and access them as XFS structures as we have no
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d21248..0a254b768855 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1829,6 +1829,7 @@ struct super_operations {
>  #else
>  #define S_DAX		0	/* Make all the DAX code disappear */
>  #endif
> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
>  
>  /*
>   * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -1867,6 +1868,7 @@ struct super_operations {
>  #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
>  #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
>  #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
>  
>  #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
>  				 (inode)->i_rdev == WHITEOUT_DEV)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..a4105a4c1d69 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2806,6 +2806,11 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(pos >= inode->i_sb->s_maxbytes))
>  		return -EFBIG;
>  
> +	/* Are we about to mutate the block map on an immutable file? */
> +	if (IS_IOMAP_IMMUTABLE(inode)
> +			&& (pos + iov_iter_count(from) > i_size_read(inode)))
> +		return -ETXTBSY;
> +
>  	iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
>  	return iov_iter_count(from);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dan Williams Aug. 4, 2017, 8:31 p.m. UTC | #2

On Fri, Aug 4, 2017 at 1:00 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:10PM -0700, Dan Williams wrote:
>> An inode with this flag set indicates that the file's block map cannot
>> be changed from the currently allocated set.
>>
>> The implementation of toggling the flag and sealing the state of the
>> extent map is saved for a later patch. The functionality provided by
>> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
>> that provided by S_SWAPFILE, and it is targeted to replace it.
>>
>> For now, only xfs and the core vfs are updated to consider the new flag.
>>
>> The additional checks that are added for this flag, beyond what we are
>> already doing for swapfiles, are:
>> * fail writes that try to extend the file size
>> * fail attempts to directly change the allocation map via fallocate or
>>   xfs ioctls. This can be done centrally by blocking
>>   xfs_alloc_file_space and xfs_free_file_space when the flag is set.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/attr.c              |   10 ++++++++++
>>  fs/open.c              |    6 ++++++
>>  fs/read_write.c        |    3 +++
>>  fs/xfs/xfs_bmap_util.c |    6 ++++++
>>  fs/xfs/xfs_ioctl.c     |    6 ++++++
>>  include/linux/fs.h     |    2 ++
>>  mm/filemap.c           |    5 +++++
>>  7 files changed, 38 insertions(+)
>>
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 135304146120..8573e364bd06 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>>   */
>>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>>  {
>> +     if (IS_IOMAP_IMMUTABLE(inode)) {
>> +             /*
>> +              * Any size change is disallowed. Size increases may
>> +              * dirty metadata that an application is not prepared to
>> +              * sync, and a size decrease may expose free blocks to
>> +              * in-flight DMA.
>> +              */
>> +             return -ETXTBSY;
>> +     }
>> +
>>       if (inode->i_size < offset) {
>>               unsigned long limit;
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 35bb784763a4..7395860d7164 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               return -ETXTBSY;
>>
>>       /*
>> +      * We cannot allow any allocation changes on an iomap immutable file
>> +      */
>> +     if (IS_IOMAP_IMMUTABLE(inode))
>> +             return -ETXTBSY;
>> +
>> +     /*
>>        * Revalidate the write permissions, in case security policy has
>>        * changed since the files were opened.
>>        */
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 0cc7033aa413..dc673be7c7cb 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>>       if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>>               return -ETXTBSY;
>>
>> +     if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
>> +             return -ETXTBSY;
>> +
>>       /* Don't reflink dirs, pipes, sockets... */
>>       if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>>               return -EISDIR;
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 93e955262d07..fe0f8f7f4bb7 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1044,6 +1044,9 @@ xfs_alloc_file_space(
>>       if (XFS_FORCED_SHUTDOWN(mp))
>>               return -EIO;
>>
>> +     if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
>> +             return -ETXTBSY;
>> +
>
> Hm.  The 'seal this up' caller in the next patch doesn't check for
> ETXTBSY (or if it does I missed that), so if you try to seal an already
> sealed file you'll get an error code even though you actually got the
> state you wanted.

That's a good point, I'll fix that up.

>
> Second question: How might we handle the situation where a filesystem
> /has/ to alter a block mapping?  Hypothetically, if the block layer
> tells the fs that some range of storage has gone bad and the fs decides
> to punch out that part of the file (or mark it unwritten or whatever) to
> avoid a machine check, can we lock out file IO, forcibly remove the
> mapping from memory, make whatever block map updates we want, and then
> unlock?

It's not clear that the filesystem /has/ to change the block mappings
when the backing media supports error clearing. Unlike bad DRAM ranges
where the address is permanently mapped out, we can clear pmem and
disk errors by writing new data. The bad block can be repaired or
remapped internal to the hardware device.

As far as I can see no amount of fs locking will keep in-flight DMA
from assuming it can continue to write to the storage address it
thought was immutable. So, I think that means that the only error
management that can be expected while the file is immutable is
blkdev_issue_zeroout() to clear the error, or otherwise hope that the
DMA operation can generate the properly sized/aligned write request
that can clear the error.

Christoph Hellwig Aug. 5, 2017, 9:47 a.m. UTC | #3

NAK^4.

We should not allow users to create immutable files.  We have
proper ways to synchronize I/O, and this is just an invitation
for horrible abuses that should not be allowed, and which we've
always people told not to do.

Dave Chinner Aug. 7, 2017, 12:25 a.m. UTC | #4

On Sat, Aug 05, 2017 at 11:47:08AM +0200, Christoph Hellwig wrote:
> NAK^4.
> 
> We should not allow users to create immutable files.  We have
> proper ways to synchronize I/O, and this is just an invitation
> for horrible abuses that should not be allowed, and which we've
> always people told not to do.

We've always told people not to do those "horrible abuses" because
of the TOCTOU race conditions inherent in getting accurate
BMAP/FIEMAP information to userspace. However, immutable extent maps
solve the TOCTOU problem and so removes the only *technical* barrier
in the way of using extent maps to implement functionality such as
userspace pNFS servers.

The core requirement for a userspace pNFS block server to be able to
safely export the block map of a file to remote clients is that the
extent map is allocated and will not change while the client has
been granted access to it. Immutable extent maps provide that
functionality to userspace.  However, for this to work, us
filesystem developers have to give up the idea that only the
filesystem can access the storage underlying the filesystem.

I'm not writing this for your benefit, Christoph, but for everyone
else who doesn't know about existing direct remote storage access
protocols and implementations. That is, I'm letting everyone know
we've already had to give up the exclusive storage device access
model...

.... when you implemented the kernel pNFS server code that provides
unknown third parties with the *remote direct access* to the storage
underlying the XFS filesystem.

Yup, we already allow third parties to arbitrate and directly access
to the XFS block device map. That "horrible abuse" was allowed
because it could be done safely via NFSv4 delegations and a new API
that provided a "blocks will always be allocated before a write and
won't change while the remote client has access" guarantee from XFS
to the kernel pNFS server (i.e. ->map_blocks()/->commit_blocks()
export ops and the break_layouts() API).

Immutable extent maps provide userspace with this same guarantee, so
what used to be considered a "horrible abuse" can now be done safely
and without risking data and/or filesystem corruption.  So, really,
calling this an "invitation to horrible abuses that should not be
allowed" ignores the reality that you were the architect that
introduced this "safe remote direct access" model to convert a
"horrible abuse" into a set of safe, supportable operations.

In the end, all I care about is that everyone understands the
technical merits of the proposals being considered rather than
discussion and review being shut down because "Christoph shouted
nasty words at me but I still don't understand why?".....

Cheers,

Dave.

Christoph Hellwig Aug. 11, 2017, 10:34 a.m. UTC | #5

On Mon, Aug 07, 2017 at 10:25:02AM +1000, Dave Chinner wrote:
> We've always told people not to do those "horrible abuses" because
> of the TOCTOU race conditions inherent in getting accurate
> BMAP/FIEMAP information to userspace. However, immutable extent maps
> solve the TOCTOU problem and so removes the only *technical* barrier
> in the way of using extent maps to implement functionality such as
> userspace pNFS servers.

For pNFS block/scsi and my upcoming RDMA persistent memory layout?
Hell no - we'll need concepts we can't expose to userspace for them,
and to expose the advanced functionality people are asking for
(reflinks, atomic updates, no stale data exposure) immutable extents
maps won't work at all.

> The core requirement for a userspace pNFS block server to be able to
> safely export the block map of a file to remote clients is that the
> extent map is allocated and will not change while the client has
> been granted access to it.

No.  The core feature for the block layout is to create an unwrittent
extent that we can expose to the client for writing to it and only
marking it as written after commit by converting the extent list.

Now I know you're going to argue that this could work with pre-zeroing
the extents, but for and actual SCSI or NVMe device that will suck
badly.  And for RDMA-like layouts we don't even need the zeroing as
we can control client behavior a lot better because memory registrations
allow much more fine grained control.

Either way we a good notification from the file system to the server
when the extent map changes.

But for either blocks or rdma layout and implementation with the filesystem
in kernel space and the server in user is stupid as they need to interact
closely.  There is a good reason why all successful NFS products have
the server very tightly coupled to the file system, and a userspace <->
kernel barrier does not help with that.

[v2,1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE

Commit Message

Comments

Patch