Message ID | 150135741519.35318.16765137368329971936.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote: > An inode with this flag set indicates that the file's block map cannot > be changed, no size change, deletion, hole-punch, range collapse, or > reflink. > > The implementation of toggling the flag and sealing the state of the > extent map is saved for a later patch. The functionality provided by > S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of > that provided by S_SWAPFILE, and it is targeted to replace it. > > For now, only xfs and the core vfs are updated to consider the new flag. Quite a while ago I started a request for O_OBJECT: http://www.spinics.net/lists/linux-fsdevel/msg75085.html A few months ago I was thinking about that more and realized it'd likely be more palatable to land as an inode flag, like you're doing here. Now, S_IOMAP_IMMUTABLE would be quite close to the semantics I want for ostree, except we also want to disallow changes to the inode uid, gid or mode. (Extended attributes are a whole other story; but I'd like to at least disallow changes to the security. namespace). The goal here is mostly about resilience to *accidental* changes; think an admin doing `cp /path/to/binary /usr/bin/bash` which does open(O_TRUNC), which would hence corrupt all hardlinks. S_IOMAP_IMMUTABLE would give a lot of great protection against those types of accidental changes - most of them are either going to be open(O_TRUNC) or O_APPEND. Since you're touching various write paths here, perhaps we can also add S_CONTENTS_IMMUTABLE or something at the same time? If this lands as is - I'm quite likely to change ostree to use it; any objections to that? As mentioned in the thread, there are several other cases of "content immutable" files in userspace, such as QEMU "qcow2", git objects. And really the most classic example is /etc/sudoers and the need for a special "visudo" program to really ensure that editors don't do in-place overwrites. But it'd be great if we can use this push to also land "content immutabilty" or however we decide to call it.
On Mon, Jul 31, 2017 at 9:02 AM, Colin Walters <walters@verbum.org> wrote: > On Sat, Jul 29, 2017, at 03:43 PM, Dan Williams wrote: >> An inode with this flag set indicates that the file's block map cannot >> be changed, no size change, deletion, hole-punch, range collapse, or >> reflink. >> >> The implementation of toggling the flag and sealing the state of the >> extent map is saved for a later patch. The functionality provided by >> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of >> that provided by S_SWAPFILE, and it is targeted to replace it. >> >> For now, only xfs and the core vfs are updated to consider the new flag. > > Quite a while ago I started a request for O_OBJECT: > http://www.spinics.net/lists/linux-fsdevel/msg75085.html > A few months ago I was thinking about that more and realized > it'd likely be more palatable to land as an inode flag, like > you're doing here. > > Now, S_IOMAP_IMMUTABLE would be quite close to the semantics > I want for ostree, except we also want to disallow > changes to the inode uid, gid or mode. (Extended attributes are > a whole other story; but I'd like to at least disallow changes to the > security. namespace). > > The goal here is mostly about resilience to *accidental* changes; > think an admin doing `cp /path/to/binary /usr/bin/bash` which > does open(O_TRUNC), which would hence corrupt all hardlinks. > > S_IOMAP_IMMUTABLE would give a lot of great protection against > those types of accidental changes - most of them are either going > to be open(O_TRUNC) or O_APPEND. Since you're touching various > write paths here, perhaps we can also add > S_CONTENTS_IMMUTABLE or something at the same time? > > If this lands as is - I'm quite likely to change ostree to use it; > any objections to that? As mentioned in the thread, there are several > other cases of "content immutable" files in userspace, such as > QEMU "qcow2", git objects. And really the most classic example is > /etc/sudoers and the need for a special "visudo" program to really > ensure that editors don't do in-place overwrites. > > But it'd be great if we can use this push to also land "content immutabilty" > or however we decide to call it. How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE?
On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote: > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE? We still want the ability to make hardlinks.
On Sat, Jul 29, 2017 at 12:43:35PM -0700, Dan Williams wrote: > An inode with this flag set indicates that the file's block map cannot > be changed, no size change, deletion, hole-punch, range collapse, or > reflink. > > The implementation of toggling the flag and sealing the state of the > extent map is saved for a later patch. The functionality provided by > S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of > that provided by S_SWAPFILE, and it is targeted to replace it. > > For now, only xfs and the core vfs are updated to consider the new flag. > > The additional check that is added for this flag, beyond what we are > already doing for swapfiles, is to truncate or abort writes that try to > extend the file size. > > Cc: Jan Kara <jack@suse.cz> > Cc: Jeff Moyer <jmoyer@redhat.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Cc: Alexander Viro <viro@zeniv.linux.org.uk> > Suggested-by: Dave Chinner <david@fromorbit.com> > Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > fs/attr.c | 10 ++++++++++ > fs/namei.c | 3 ++- > fs/open.c | 6 ++++++ > fs/read_write.c | 3 +++ > fs/xfs/xfs_ioctl.c | 6 ++++++ > include/linux/fs.h | 2 ++ > mm/filemap.c | 9 +++++++++ > 7 files changed, 38 insertions(+), 1 deletion(-) > > diff --git a/fs/attr.c b/fs/attr.c > index 135304146120..8573e364bd06 100644 > --- a/fs/attr.c > +++ b/fs/attr.c > @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare); > */ > int inode_newsize_ok(const struct inode *inode, loff_t offset) > { > + if (IS_IOMAP_IMMUTABLE(inode)) { > + /* > + * Any size change is disallowed. Size increases may > + * dirty metadata that an application is not prepared to > + * sync, and a size decrease may expose free blocks to > + * in-flight DMA. > + */ > + return -ETXTBSY; > + } > + > if (inode->i_size < offset) { > unsigned long limit; > > diff --git a/fs/namei.c b/fs/namei.c > index ddb6a7c2b3d4..588f1135c42c 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir) > return -EPERM; > > if (check_sticky(dir, inode) || IS_APPEND(inode) || > - IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)) > + IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode) > + || IS_IOMAP_IMMUTABLE(inode)) This caught my eye because why wouldn't we be able to unlink a file whose block map is immutable? Link count has "nothing" to do with that. :) Hmm... I guess the reasoning here is that if we're IOMAP_IMMUTABLE'd, we're not allowed to touch the link count since (we assume) nobody's who's interested in *inode is going to call fsync on it to flush the metadata to disk? If that's the case, then shouldn't there be a corresponding may_create check to prevent us from linking a file into a directory? (I don't really see why link/unlink shouldn't be allowed...) > return -EPERM; > if (isdir) { > if (!d_is_dir(victim)) > diff --git a/fs/open.c b/fs/open.c > index 35bb784763a4..7395860d7164 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > return -ETXTBSY; > > /* > + * We cannot allow any allocation changes on an iomap immutable file If it's a (regular allocation) or (unshare blocks) fallocate call, we could return zero immediately since no writes will be able to hit ENOSPC because we know that the block mappings are there and no CoW is required. > + */ > + if (IS_IOMAP_IMMUTABLE(inode)) > + return -ETXTBSY; I've been wondering in the back of my head if we could allow a size-extending FALLOC_FL_ZERO_RANGE here that would fsync at the end? The use case I had in mind was extending files without having to turn off the IOMAP_IMMUTABLE flag, since (in theory, anyway) we'd bzero() the memory and then increase i_size, so userspace should never be able to access storage that isn't totally finalized. OTOH that also seems like a huge weird loophole in IOMAP_IMMUTABLE, so maybe everyone would prefer to shoot down this use case now? > + > + /* > * Revalidate the write permissions, in case security policy has > * changed since the files were opened. > */ > diff --git a/fs/read_write.c b/fs/read_write.c > index 0cc7033aa413..dc673be7c7cb 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, > if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out)) > return -ETXTBSY; > > + if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out)) > + return -ETXTBSY; > + > /* Don't reflink dirs, pipes, sockets... */ > if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) > return -EISDIR; > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > index e75c40a47b7d..2e64488bc4de 100644 > --- a/fs/xfs/xfs_ioctl.c > +++ b/fs/xfs/xfs_ioctl.c > @@ -1755,6 +1755,12 @@ xfs_ioc_swapext( > goto out_put_tmp_file; > } > > + if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) || > + IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) { > + error = -EINVAL; > + goto out_put_tmp_file; > + } Someone's going to have to audit more of the XFS ioctls to make sure that none of them can touch the block mapping, but as this is an RFC it doesn't need to be done right this instant. > + > /* > * We need to ensure that the fds passed in point to XFS inodes > * before we cast and access them as XFS structures as we have no > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 6e1fd5d21248..0a254b768855 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1829,6 +1829,7 @@ struct super_operations { > #else > #define S_DAX 0 /* Make all the DAX code disappear */ > #endif > +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */ > > /* > * Note that nosuid etc flags are inode-specific: setting some file-system > @@ -1867,6 +1868,7 @@ struct super_operations { > #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) > #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) > #define IS_DAX(inode) ((inode)->i_flags & S_DAX) > +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE) > > #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ > (inode)->i_rdev == WHITEOUT_DEV) > diff --git a/mm/filemap.c b/mm/filemap.c > index a49702445ce0..e4a6529da2bd 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) > if (unlikely(pos >= inode->i_sb->s_maxbytes)) > return -EFBIG; > > + /* Are we about to mutate the block map on an immutable file? */ > + if (IS_IOMAP_IMMUTABLE(inode) > + && (pos + iov_iter_count(from) > i_size_read(inode))) { > + if (pos < i_size_read(inode)) > + iov_iter_truncate(from, i_size_read(inode) - pos); Writes past the end get truncated to stop at EOF? I'd have thought that would be an error since userspace is asking for metadata updates it previously said it would never need... --D > + else > + return -ETXTBSY; > + } > + > iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos); > return iov_iter_count(from); > } > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 31, 2017 at 9:46 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Sat, Jul 29, 2017 at 12:43:35PM -0700, Dan Williams wrote: >> An inode with this flag set indicates that the file's block map cannot >> be changed, no size change, deletion, hole-punch, range collapse, or >> reflink. >> >> The implementation of toggling the flag and sealing the state of the >> extent map is saved for a later patch. The functionality provided by >> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of >> that provided by S_SWAPFILE, and it is targeted to replace it. >> >> For now, only xfs and the core vfs are updated to consider the new flag. >> >> The additional check that is added for this flag, beyond what we are >> already doing for swapfiles, is to truncate or abort writes that try to >> extend the file size. >> >> Cc: Jan Kara <jack@suse.cz> >> Cc: Jeff Moyer <jmoyer@redhat.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Cc: Alexander Viro <viro@zeniv.linux.org.uk> >> Suggested-by: Dave Chinner <david@fromorbit.com> >> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> fs/attr.c | 10 ++++++++++ >> fs/namei.c | 3 ++- >> fs/open.c | 6 ++++++ >> fs/read_write.c | 3 +++ >> fs/xfs/xfs_ioctl.c | 6 ++++++ >> include/linux/fs.h | 2 ++ >> mm/filemap.c | 9 +++++++++ >> 7 files changed, 38 insertions(+), 1 deletion(-) >> >> diff --git a/fs/attr.c b/fs/attr.c >> index 135304146120..8573e364bd06 100644 >> --- a/fs/attr.c >> +++ b/fs/attr.c >> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare); >> */ >> int inode_newsize_ok(const struct inode *inode, loff_t offset) >> { >> + if (IS_IOMAP_IMMUTABLE(inode)) { >> + /* >> + * Any size change is disallowed. Size increases may >> + * dirty metadata that an application is not prepared to >> + * sync, and a size decrease may expose free blocks to >> + * in-flight DMA. >> + */ >> + return -ETXTBSY; >> + } >> + >> if (inode->i_size < offset) { >> unsigned long limit; >> >> diff --git a/fs/namei.c b/fs/namei.c >> index ddb6a7c2b3d4..588f1135c42c 100644 >> --- a/fs/namei.c >> +++ b/fs/namei.c >> @@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir) >> return -EPERM; >> >> if (check_sticky(dir, inode) || IS_APPEND(inode) || >> - IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)) >> + IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode) >> + || IS_IOMAP_IMMUTABLE(inode)) > > This caught my eye because why wouldn't we be able to unlink a file > whose block map is immutable? Link count has "nothing" to do with that. > > :) > > Hmm... I guess the reasoning here is that if we're IOMAP_IMMUTABLE'd, > we're not allowed to touch the link count since (we assume) nobody's > who's interested in *inode is going to call fsync on it to flush the > metadata to disk? > > If that's the case, then shouldn't there be a corresponding may_create > check to prevent us from linking a file into a directory? > > (I don't really see why link/unlink shouldn't be allowed...) True, unlink should be allowed, the blocks will still be immutable as long as the file is open so I think we're good. >> return -EPERM; >> if (isdir) { >> if (!d_is_dir(victim)) >> diff --git a/fs/open.c b/fs/open.c >> index 35bb784763a4..7395860d7164 100644 >> --- a/fs/open.c >> +++ b/fs/open.c >> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) >> return -ETXTBSY; >> >> /* >> + * We cannot allow any allocation changes on an iomap immutable file > > If it's a (regular allocation) or (unshare blocks) fallocate call, we > could return zero immediately since no writes will be able to hit > ENOSPC because we know that the block mappings are there and no CoW is > required. Sounds good assuming the regular allocation is staying within the range of the current file size, and fail otherwise. > >> + */ >> + if (IS_IOMAP_IMMUTABLE(inode)) >> + return -ETXTBSY; > > I've been wondering in the back of my head if we could allow a > size-extending FALLOC_FL_ZERO_RANGE here that would fsync at the end? > The use case I had in mind was extending files without having to turn > off the IOMAP_IMMUTABLE flag, since (in theory, anyway) we'd bzero() the > memory and then increase i_size, so userspace should never be able to > access storage that isn't totally finalized. > > OTOH that also seems like a huge weird loophole in IOMAP_IMMUTABLE, so > maybe everyone would prefer to shoot down this use case now? It seems useful, because the alternative feels more error prone if applications are forced to toggle IOMAP_IMMUTABLE. > >> + >> + /* >> * Revalidate the write permissions, in case security policy has >> * changed since the files were opened. >> */ >> diff --git a/fs/read_write.c b/fs/read_write.c >> index 0cc7033aa413..dc673be7c7cb 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, >> if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out)) >> return -ETXTBSY; >> >> + if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out)) >> + return -ETXTBSY; >> + >> /* Don't reflink dirs, pipes, sockets... */ >> if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) >> return -EISDIR; >> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c >> index e75c40a47b7d..2e64488bc4de 100644 >> --- a/fs/xfs/xfs_ioctl.c >> +++ b/fs/xfs/xfs_ioctl.c >> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext( >> goto out_put_tmp_file; >> } >> >> + if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) || >> + IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) { >> + error = -EINVAL; >> + goto out_put_tmp_file; >> + } > > Someone's going to have to audit more of the XFS ioctls to make sure > that none of them can touch the block mapping, but as this is an RFC it > doesn't need to be done right this instant. Speaking of things that need to be done before this mechanism can move out of RFC, you had mentioned wanting to get your current devel backlog cleared before moving ahead with this. I can offer some review cycles, you'll just need to tolerate a higher question-to-review-comment ratio given my relative XFS experience. > >> + >> /* >> * We need to ensure that the fds passed in point to XFS inodes >> * before we cast and access them as XFS structures as we have no >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index 6e1fd5d21248..0a254b768855 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -1829,6 +1829,7 @@ struct super_operations { >> #else >> #define S_DAX 0 /* Make all the DAX code disappear */ >> #endif >> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */ >> >> /* >> * Note that nosuid etc flags are inode-specific: setting some file-system >> @@ -1867,6 +1868,7 @@ struct super_operations { >> #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) >> #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) >> #define IS_DAX(inode) ((inode)->i_flags & S_DAX) >> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE) >> >> #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ >> (inode)->i_rdev == WHITEOUT_DEV) >> diff --git a/mm/filemap.c b/mm/filemap.c >> index a49702445ce0..e4a6529da2bd 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) >> if (unlikely(pos >= inode->i_sb->s_maxbytes)) >> return -EFBIG; >> >> + /* Are we about to mutate the block map on an immutable file? */ >> + if (IS_IOMAP_IMMUTABLE(inode) >> + && (pos + iov_iter_count(from) > i_size_read(inode))) { >> + if (pos < i_size_read(inode)) >> + iov_iter_truncate(from, i_size_read(inode) - pos); > > Writes past the end get truncated to stop at EOF? I'd have thought that > would be an error since userspace is asking for metadata updates it > previously said it would never need... It's an error if the current position is >= EOF, but otherwise allow a truncated write. I'm fine either way. ...but then again if we allow FALLOC_FL_ZERO_RANGE to resize the file, why not synchronous writes too? It turns IOMAP_IMMUTABLE into a flag that only allows blocks to be added to a file, but not removed. That would seem to dovetail well with synchronous mmap faults when / if XFS grows that feature.
On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote: > On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote: > > > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE? > > We still want the ability to make hardlinks. Also of course, symmetrically, to unlink. If we used S_IMMUTABLE for /etc/sudoers, it'd still be racy since one would have to transiently remove the flag in order to replace it with a new version. Related to this topic is the fact that S_IMMUTABLE is itself mutable; I think once S_IMMUTABLE_CONTENTS is set, it would not be able to made mutable again. Also I just remembered that since then memfd_create() and more notably fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want here AFAICS. Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW) and mine just adds in F_SEAL_WRITE. I think there was some discussion of the seals for persistent files when memfd_create() landed, but I can't find it offhand.
On Mon, Jul 31, 2017 at 01:42:13PM -0400, Colin Walters wrote: > > > On Mon, Jul 31, 2017, at 12:32 PM, Colin Walters wrote: > > On Mon, Jul 31, 2017, at 12:29 PM, Dan Williams wrote: > > > > > > How is S_CONTENTS_IMMUTABLE different than S_IMMUTABLE? > > > > We still want the ability to make hardlinks. > > Also of course, symmetrically, to unlink. If we used S_IMMUTABLE for /etc/sudoers, > it'd still be racy since one would have to transiently remove the flag in order > to replace it with a new version. > > Related to this topic is the fact that S_IMMUTABLE is itself mutable; I > think once S_IMMUTABLE_CONTENTS is set, it would not be able to made > mutable again. > > Also I just remembered that since then memfd_create() and more notably > fcntl(F_ADD_SEALS) landed - in fact it already has flags for what we want > here AFAICS. Your S_IOMAP_IMMUTABLE is fcntl(F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW) I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data, which are two things that cannot happen under S_IOMAP_IMMUTABLE that aren't size changes. From the implementation it looks like shrink and grow are only supposed to disallow changes to i_size, not i_blocks (or the file block map). Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply isn't defined for any other filesystem...? e.g. it doesn't prohibit reflink, but the only fs implementing seals doesn't support reflink. <shrug> Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE user cases being presented. > and mine just adds in F_SEAL_WRITE. I think there was some discussion > of the seals for persistent files when memfd_create() landed, but I can't > find it offhand. --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote: > I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data, > which are two things that cannot happen under S_IOMAP_IMMUTABLE that > aren't size changes. From the implementation it looks like shrink and > grow are only supposed to disallow changes to i_size, not i_blocks (or > the file block map). True. > Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply > isn't defined for any other filesystem...? e.g. it doesn't prohibit > reflink, but the only fs implementing seals doesn't support reflink. > > <shrug> > > Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE > user cases being presented. To be clear, the set of use cases is swap files and DAX, right? Or is there anything else? I can't imagine why anyone would want to turn a swap file back into a regular file. I haven't fully followed DAX, but I'd take your word for it if people want to be able to remove the flag after. Anyways, I think your broader point is right; the use cases are different enough that it doesn't make sense to try to add S_CONTENT_IMMUTABLE (or however one decides to call it) at the same time.
On Mon, Jul 31, 2017 at 10:15:12PM -0400, Colin Walters wrote: > On Mon, Jul 31, 2017, at 02:23 PM, Darrick J. Wong wrote: > > > I don't think F_SEAL_{SHRINK,GROW} prevents reflinking or CoW of file data, > > which are two things that cannot happen under S_IOMAP_IMMUTABLE that > > aren't size changes. From the implementation it looks like shrink and > > grow are only supposed to disallow changes to i_size, not i_blocks (or > > the file block map). > > True. > > > Then again, I suppose F_SEAL_* only work on shmem, so maybe it simply > > isn't defined for any other filesystem...? e.g. it doesn't prohibit > > reflink, but the only fs implementing seals doesn't support reflink. > > > > <shrug> > > > > Seals cannot be removed, which is too strict for the S_IOMAP_IMMUTABLE > > user cases being presented. > > To be clear, the set of use cases is swap files and DAX, right? Or is there anything else? I've outlined other use cases in previous discussions. To repeat myself, every so often we get someone with, say, a new high speed camera that want to dma the camera frames direct to the storage because they can't push 500,000 frames/s through the CPU to storage. Hence they want to bypass the OS and DMA the data direct to the storage. To do this they need a mechanism to freeze and unfreeze the block map of the file so that nothing modifies the block map while the camera hardware is dumping data direct to the storage. Immutable extent maps provide the functionality they need to implement this safely. There's also other similar use cases for RDMA targets on PMEM (regardless of whether DAX is enabled or not), and I've come across a couple of requests for mechanisms to allow fabric based nvme storage to do direct data transfers between storage devices, too. All of these use cases can be safely implemented if there is a mechanism to mark extent maps as immutable for the duration of the operation they need to perform. > I can't imagine why anyone would want to turn a swap file back into a regular file. > I haven't fully followed DAX, but I'd take your word for it if people want to > be able to remove the flag after. DAX isn't the driver of that functionality, it's the other use cases that need it, and why the proposed "only remove flag if len == 0" API is a non-starter.... Cheers, Dave.
On Tue, Aug 01, 2017 at 12:42:18PM +1000, Dave Chinner wrote: > I've outlined other use cases in previous discussions. To repeat > myself, every so often we get someone with, say, a new high > speed camera that want to dma the camera frames direct to the > storage because they can't push 500,000 frames/s through the CPU > to storage. Hence they want to bypass the OS and DMA the data direct > to the storage. To do this they need a mechanism to freeze and unfreeze > the block map of the file so that nothing modifies the block map > while the camera hardware is dumping data direct to the storage. > Immutable extent maps provide the functionality they need to > implement this safely. And we have such a mechanism already: it's called the iolock during I/O, and dirct I/O. I've worked on plenty such schemes and the proper way works perfectly fine. Just because people ask for stupid ways to archives that doesn't mean they understand what they are doing. > There's also other similar use cases for RDMA targets on PMEM > (regardless of whether DAX is enabled or not), and I've come across > a couple of requests for mechanisms to allow fabric based nvme > storage to do direct data transfers between storage devices, too. > All of these use cases can be safely implemented if there is a > mechanism to mark extent maps as immutable for the duration of > the operation they need to perform. As someone who spent most of them time on the last 2 years in this area: we have a massive problem discoverability and addressing (lack of struct page) for p2p devices. We have absolutely no problem with the direct I/O model with them. > DAX isn't the driver of that functionality, it's the other use cases > that need it, and why the proposed "only remove flag if len == 0" > API is a non-starter.... The other "use" cases are even more bullshit than the DAX one.
diff --git a/fs/attr.c b/fs/attr.c index 135304146120..8573e364bd06 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare); */ int inode_newsize_ok(const struct inode *inode, loff_t offset) { + if (IS_IOMAP_IMMUTABLE(inode)) { + /* + * Any size change is disallowed. Size increases may + * dirty metadata that an application is not prepared to + * sync, and a size decrease may expose free blocks to + * in-flight DMA. + */ + return -ETXTBSY; + } + if (inode->i_size < offset) { unsigned long limit; diff --git a/fs/namei.c b/fs/namei.c index ddb6a7c2b3d4..588f1135c42c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2793,7 +2793,8 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir) return -EPERM; if (check_sticky(dir, inode) || IS_APPEND(inode) || - IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode)) + IS_IMMUTABLE(inode) || IS_SWAPFILE(inode) || HAS_UNMAPPED_ID(inode) + || IS_IOMAP_IMMUTABLE(inode)) return -EPERM; if (isdir) { if (!d_is_dir(victim)) diff --git a/fs/open.c b/fs/open.c index 35bb784763a4..7395860d7164 100644 --- a/fs/open.c +++ b/fs/open.c @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) return -ETXTBSY; /* + * We cannot allow any allocation changes on an iomap immutable file + */ + if (IS_IOMAP_IMMUTABLE(inode)) + return -ETXTBSY; + + /* * Revalidate the write permissions, in case security policy has * changed since the files were opened. */ diff --git a/fs/read_write.c b/fs/read_write.c index 0cc7033aa413..dc673be7c7cb 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in, if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out)) return -ETXTBSY; + if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out)) + return -ETXTBSY; + /* Don't reflink dirs, pipes, sockets... */ if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode)) return -EISDIR; diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index e75c40a47b7d..2e64488bc4de 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -1755,6 +1755,12 @@ xfs_ioc_swapext( goto out_put_tmp_file; } + if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) || + IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) { + error = -EINVAL; + goto out_put_tmp_file; + } + /* * We need to ensure that the fds passed in point to XFS inodes * before we cast and access them as XFS structures as we have no diff --git a/include/linux/fs.h b/include/linux/fs.h index 6e1fd5d21248..0a254b768855 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1829,6 +1829,7 @@ struct super_operations { #else #define S_DAX 0 /* Make all the DAX code disappear */ #endif +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -1867,6 +1868,7 @@ struct super_operations { #define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) #define IS_DAX(inode) ((inode)->i_flags & S_DAX) +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE) #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV) diff --git a/mm/filemap.c b/mm/filemap.c index a49702445ce0..e4a6529da2bd 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2806,6 +2806,15 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) if (unlikely(pos >= inode->i_sb->s_maxbytes)) return -EFBIG; + /* Are we about to mutate the block map on an immutable file? */ + if (IS_IOMAP_IMMUTABLE(inode) + && (pos + iov_iter_count(from) > i_size_read(inode))) { + if (pos < i_size_read(inode)) + iov_iter_truncate(from, i_size_read(inode) - pos); + else + return -ETXTBSY; + } + iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos); return iov_iter_count(from); }
An inode with this flag set indicates that the file's block map cannot be changed, no size change, deletion, hole-punch, range collapse, or reflink. The implementation of toggling the flag and sealing the state of the extent map is saved for a later patch. The functionality provided by S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of that provided by S_SWAPFILE, and it is targeted to replace it. For now, only xfs and the core vfs are updated to consider the new flag. The additional check that is added for this flag, beyond what we are already doing for swapfiles, is to truncate or abort writes that try to extend the file size. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Suggested-by: Dave Chinner <david@fromorbit.com> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/attr.c | 10 ++++++++++ fs/namei.c | 3 ++- fs/open.c | 6 ++++++ fs/read_write.c | 3 +++ fs/xfs/xfs_ioctl.c | 6 ++++++ include/linux/fs.h | 2 ++ mm/filemap.c | 9 +++++++++ 7 files changed, 38 insertions(+), 1 deletion(-)