diff mbox series

[1/3] vfs: add new f_op->syncfs vector

Message ID 20201216233149.39025-2-vgoyal@redhat.com (mailing list archive)
State New, archived
Headers show
Series vfs, overlayfs: Fix syncfs() to return error | expand

Commit Message

Vivek Goyal Dec. 16, 2020, 11:31 p.m. UTC
Current implementation of __sync_filesystem() ignores the return code
from ->sync_fs(). I am not sure why that's the case. There must have
been some historical reason for this.

Ignoring ->sync_fs() return code is problematic for overlayfs where
it can return error if sync_filesystem() on upper super block failed.
That error will simply be lost and sycnfs(overlay_fd), will get
success (despite the fact it failed).

If we modify existing implementation, there is a concern that it will
lead to user space visible behavior changes and break things. So
instead implement a new file_operations->syncfs() call which will
be called in syncfs() syscall path. Return code from this new
call will be captured. And all the writeback error detection
logic can go in there as well. Only filesystems which implement
this call get affected by this change. Others continue to fallback
to existing mechanism.

To be clear, I mean something like this (draft, untested) patch. You'd
also need to add a new ->syncfs op for overlayfs, and that could just do
a check_and_advance against the upper layer sb's errseq_t after calling
sync_filesystem.

Vivek, fixed couple of minor compile errors in original patch.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/sync.c          | 29 ++++++++++++++++++++---------
 include/linux/fs.h |  1 +
 2 files changed, 21 insertions(+), 9 deletions(-)

Comments

Al Viro Dec. 17, 2020, 12:49 a.m. UTC | #1
[Christoph added to Cc...]
On Wed, Dec 16, 2020 at 06:31:47PM -0500, Vivek Goyal wrote:
> Current implementation of __sync_filesystem() ignores the return code
> from ->sync_fs(). I am not sure why that's the case. There must have
> been some historical reason for this.
> 
> Ignoring ->sync_fs() return code is problematic for overlayfs where
> it can return error if sync_filesystem() on upper super block failed.
> That error will simply be lost and sycnfs(overlay_fd), will get
> success (despite the fact it failed).
> 
> If we modify existing implementation, there is a concern that it will
> lead to user space visible behavior changes and break things. So
> instead implement a new file_operations->syncfs() call which will
> be called in syncfs() syscall path. Return code from this new
> call will be captured. And all the writeback error detection
> logic can go in there as well. Only filesystems which implement
> this call get affected by this change. Others continue to fallback
> to existing mechanism.

That smells like a massive source of confusion down the road.  I'd just
looked through the existing instances; many always return 0, but quite
a few sometimes try to return an error:
fs/btrfs/super.c:2412:  .sync_fs        = btrfs_sync_fs,
fs/exfat/super.c:204:   .sync_fs        = exfat_sync_fs,
fs/ext4/super.c:1674:   .sync_fs        = ext4_sync_fs,
fs/f2fs/super.c:2480:   .sync_fs        = f2fs_sync_fs,
fs/gfs2/super.c:1600:   .sync_fs                = gfs2_sync_fs,
fs/hfsplus/super.c:368: .sync_fs        = hfsplus_sync_fs,
fs/nilfs2/super.c:689:  .sync_fs        = nilfs_sync_fs,
fs/ocfs2/super.c:139:   .sync_fs        = ocfs2_sync_fs,
fs/overlayfs/super.c:399:       .sync_fs        = ovl_sync_fs,
fs/ubifs/super.c:2052:  .sync_fs       = ubifs_sync_fs,
is the list of such.  There are 4 method callers:
dquot_quota_sync(), dquot_disable(), __sync_filesystem() and
sync_fs_one_sb().  For sync_fs_one_sb() we want to ignore the
return value; for __sync_filesystem() we almost certainly
do *not* - it ends with return __sync_blockdev(sb->s_bdev, wait),
after all.  The question for that one is whether we want
__sync_blockdev() called even in case of ->sync_fs() reporting
a failure, and I suspect that it's safer to call it anyway and
return the first error value we'd got.  No idea about quota
situation.
Jan Kara Dec. 17, 2020, 9:57 a.m. UTC | #2
On Thu 17-12-20 00:49:35, Al Viro wrote:
> [Christoph added to Cc...]
> On Wed, Dec 16, 2020 at 06:31:47PM -0500, Vivek Goyal wrote:
> > Current implementation of __sync_filesystem() ignores the return code
> > from ->sync_fs(). I am not sure why that's the case. There must have
> > been some historical reason for this.
> > 
> > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > it can return error if sync_filesystem() on upper super block failed.
> > That error will simply be lost and sycnfs(overlay_fd), will get
> > success (despite the fact it failed).
> > 
> > If we modify existing implementation, there is a concern that it will
> > lead to user space visible behavior changes and break things. So
> > instead implement a new file_operations->syncfs() call which will
> > be called in syncfs() syscall path. Return code from this new
> > call will be captured. And all the writeback error detection
> > logic can go in there as well. Only filesystems which implement
> > this call get affected by this change. Others continue to fallback
> > to existing mechanism.
> 
> That smells like a massive source of confusion down the road.  I'd just
> looked through the existing instances; many always return 0, but quite
> a few sometimes try to return an error:
> fs/btrfs/super.c:2412:  .sync_fs        = btrfs_sync_fs,
> fs/exfat/super.c:204:   .sync_fs        = exfat_sync_fs,
> fs/ext4/super.c:1674:   .sync_fs        = ext4_sync_fs,
> fs/f2fs/super.c:2480:   .sync_fs        = f2fs_sync_fs,
> fs/gfs2/super.c:1600:   .sync_fs                = gfs2_sync_fs,
> fs/hfsplus/super.c:368: .sync_fs        = hfsplus_sync_fs,
> fs/nilfs2/super.c:689:  .sync_fs        = nilfs_sync_fs,
> fs/ocfs2/super.c:139:   .sync_fs        = ocfs2_sync_fs,
> fs/overlayfs/super.c:399:       .sync_fs        = ovl_sync_fs,
> fs/ubifs/super.c:2052:  .sync_fs       = ubifs_sync_fs,
> is the list of such.  There are 4 method callers:
> dquot_quota_sync(), dquot_disable(), __sync_filesystem() and
> sync_fs_one_sb().  For sync_fs_one_sb() we want to ignore the
> return value; for __sync_filesystem() we almost certainly
> do *not* - it ends with return __sync_blockdev(sb->s_bdev, wait),
> after all.  The question for that one is whether we want
> __sync_blockdev() called even in case of ->sync_fs() reporting
> a failure, and I suspect that it's safer to call it anyway and
> return the first error value we'd got.  No idea about quota
> situation.

WRT quota situation: All the ->sync_fs() calls there are due to cache
coherency reasons (we need to get quota changes to disk, then prune quota
files's page cache, and then userspace can read current quota structures
from the disk). We don't want to fail dquot_disable() just because caches
might be incoherent so ignoring ->sync_fs() return value there is fine.
With dquot_quota_sync() it might make some sense to return the error -
that's just a backend for Q_SYNC quotactl(2). OTOH I'm not sure anybody
really cares - Q_SYNC is rarely used.

								Honza
Vivek Goyal Dec. 17, 2020, 3 p.m. UTC | #3
On Thu, Dec 17, 2020 at 12:49:35AM +0000, Al Viro wrote:
> [Christoph added to Cc...]
> On Wed, Dec 16, 2020 at 06:31:47PM -0500, Vivek Goyal wrote:
> > Current implementation of __sync_filesystem() ignores the return code
> > from ->sync_fs(). I am not sure why that's the case. There must have
> > been some historical reason for this.
> > 
> > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > it can return error if sync_filesystem() on upper super block failed.
> > That error will simply be lost and sycnfs(overlay_fd), will get
> > success (despite the fact it failed).
> > 
> > If we modify existing implementation, there is a concern that it will
> > lead to user space visible behavior changes and break things. So
> > instead implement a new file_operations->syncfs() call which will
> > be called in syncfs() syscall path. Return code from this new
> > call will be captured. And all the writeback error detection
> > logic can go in there as well. Only filesystems which implement
> > this call get affected by this change. Others continue to fallback
> > to existing mechanism.
> 
> That smells like a massive source of confusion down the road.  I'd just
> looked through the existing instances; many always return 0, but quite
> a few sometimes try to return an error:
> fs/btrfs/super.c:2412:  .sync_fs        = btrfs_sync_fs,
> fs/exfat/super.c:204:   .sync_fs        = exfat_sync_fs,
> fs/ext4/super.c:1674:   .sync_fs        = ext4_sync_fs,
> fs/f2fs/super.c:2480:   .sync_fs        = f2fs_sync_fs,
> fs/gfs2/super.c:1600:   .sync_fs                = gfs2_sync_fs,
> fs/hfsplus/super.c:368: .sync_fs        = hfsplus_sync_fs,
> fs/nilfs2/super.c:689:  .sync_fs        = nilfs_sync_fs,
> fs/ocfs2/super.c:139:   .sync_fs        = ocfs2_sync_fs,
> fs/overlayfs/super.c:399:       .sync_fs        = ovl_sync_fs,
> fs/ubifs/super.c:2052:  .sync_fs       = ubifs_sync_fs,
> is the list of such.  There are 4 method callers:
> dquot_quota_sync(), dquot_disable(), __sync_filesystem() and
> sync_fs_one_sb().  For sync_fs_one_sb() we want to ignore the
> return value; for __sync_filesystem() we almost certainly
> do *not* - it ends with return __sync_blockdev(sb->s_bdev, wait),
> after all.  The question for that one is whether we want
> __sync_blockdev() called even in case of ->sync_fs() reporting
> a failure, and I suspect that it's safer to call it anyway and
> return the first error value we'd got.

I posted V1 patch to do exactly above. In __sync_filesystem(), capture
return code from ->sync_fs() but continue to call __sync_blockdev() and
and return error code from ->sync_fs() if there is one otherwise
return error code from __sync_blockdev().

https://lore.kernel.org/linux-fsdevel/20201216143802.GA10550@redhat.com/

Thanks
Vivek

> No idea about quota situation.
>
Vivek Goyal Dec. 17, 2020, 4:15 p.m. UTC | #4
On Thu, Dec 17, 2020 at 10:57:28AM +0100, Jan Kara wrote:
> On Thu 17-12-20 00:49:35, Al Viro wrote:
> > [Christoph added to Cc...]
> > On Wed, Dec 16, 2020 at 06:31:47PM -0500, Vivek Goyal wrote:
> > > Current implementation of __sync_filesystem() ignores the return code
> > > from ->sync_fs(). I am not sure why that's the case. There must have
> > > been some historical reason for this.
> > > 
> > > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > > it can return error if sync_filesystem() on upper super block failed.
> > > That error will simply be lost and sycnfs(overlay_fd), will get
> > > success (despite the fact it failed).
> > > 
> > > If we modify existing implementation, there is a concern that it will
> > > lead to user space visible behavior changes and break things. So
> > > instead implement a new file_operations->syncfs() call which will
> > > be called in syncfs() syscall path. Return code from this new
> > > call will be captured. And all the writeback error detection
> > > logic can go in there as well. Only filesystems which implement
> > > this call get affected by this change. Others continue to fallback
> > > to existing mechanism.
> > 
> > That smells like a massive source of confusion down the road.  I'd just
> > looked through the existing instances; many always return 0, but quite
> > a few sometimes try to return an error:
> > fs/btrfs/super.c:2412:  .sync_fs        = btrfs_sync_fs,
> > fs/exfat/super.c:204:   .sync_fs        = exfat_sync_fs,
> > fs/ext4/super.c:1674:   .sync_fs        = ext4_sync_fs,
> > fs/f2fs/super.c:2480:   .sync_fs        = f2fs_sync_fs,
> > fs/gfs2/super.c:1600:   .sync_fs                = gfs2_sync_fs,
> > fs/hfsplus/super.c:368: .sync_fs        = hfsplus_sync_fs,
> > fs/nilfs2/super.c:689:  .sync_fs        = nilfs_sync_fs,
> > fs/ocfs2/super.c:139:   .sync_fs        = ocfs2_sync_fs,
> > fs/overlayfs/super.c:399:       .sync_fs        = ovl_sync_fs,
> > fs/ubifs/super.c:2052:  .sync_fs       = ubifs_sync_fs,
> > is the list of such.  There are 4 method callers:
> > dquot_quota_sync(), dquot_disable(), __sync_filesystem() and
> > sync_fs_one_sb().  For sync_fs_one_sb() we want to ignore the
> > return value; for __sync_filesystem() we almost certainly
> > do *not* - it ends with return __sync_blockdev(sb->s_bdev, wait),
> > after all.  The question for that one is whether we want
> > __sync_blockdev() called even in case of ->sync_fs() reporting
> > a failure, and I suspect that it's safer to call it anyway and
> > return the first error value we'd got.  No idea about quota
> > situation.
> 
> WRT quota situation: All the ->sync_fs() calls there are due to cache
> coherency reasons (we need to get quota changes to disk, then prune quota
> files's page cache, and then userspace can read current quota structures
> from the disk). We don't want to fail dquot_disable() just because caches
> might be incoherent so ignoring ->sync_fs() return value there is fine.
> With dquot_quota_sync() it might make some sense to return the error -
> that's just a backend for Q_SYNC quotactl(2). OTOH I'm not sure anybody
> really cares - Q_SYNC is rarely used.

Thanks Jan. May be I will leave dquot_quota_sync() untouched for now. When
somebody has a need to capture return code from ->sync_fs() there, it
can be easily added.

Vivek
Jeffrey Layton Dec. 17, 2020, 7:49 p.m. UTC | #5
On Thu, 2020-12-17 at 00:49 +0000, Al Viro wrote:
> [Christoph added to Cc...]
> On Wed, Dec 16, 2020 at 06:31:47PM -0500, Vivek Goyal wrote:
> > Current implementation of __sync_filesystem() ignores the return code
> > from ->sync_fs(). I am not sure why that's the case. There must have
> > been some historical reason for this.
> > 
> > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > it can return error if sync_filesystem() on upper super block failed.
> > That error will simply be lost and sycnfs(overlay_fd), will get
> > success (despite the fact it failed).
> > 
> > If we modify existing implementation, there is a concern that it will
> > lead to user space visible behavior changes and break things. So
> > instead implement a new file_operations->syncfs() call which will
> > be called in syncfs() syscall path. Return code from this new
> > call will be captured. And all the writeback error detection
> > logic can go in there as well. Only filesystems which implement
> > this call get affected by this change. Others continue to fallback
> > to existing mechanism.
> 
> That smells like a massive source of confusion down the road.  I'd just
> looked through the existing instances; many always return 0, but quite
> a few sometimes try to return an error:
> fs/btrfs/super.c:2412:  .sync_fs        = btrfs_sync_fs,
> fs/exfat/super.c:204:   .sync_fs        = exfat_sync_fs,
> fs/ext4/super.c:1674:   .sync_fs        = ext4_sync_fs,
> fs/f2fs/super.c:2480:   .sync_fs        = f2fs_sync_fs,
> fs/gfs2/super.c:1600:   .sync_fs                = gfs2_sync_fs,
> fs/hfsplus/super.c:368: .sync_fs        = hfsplus_sync_fs,
> fs/nilfs2/super.c:689:  .sync_fs        = nilfs_sync_fs,
> fs/ocfs2/super.c:139:   .sync_fs        = ocfs2_sync_fs,
> fs/overlayfs/super.c:399:       .sync_fs        = ovl_sync_fs,
> fs/ubifs/super.c:2052:  .sync_fs       = ubifs_sync_fs,
> is the list of such.  There are 4 method callers:
> dquot_quota_sync(), dquot_disable(), __sync_filesystem() and
> sync_fs_one_sb().  For sync_fs_one_sb() we want to ignore the
> return value; for __sync_filesystem() we almost certainly
> do *not* - it ends with return __sync_blockdev(sb->s_bdev, wait),
> after all.  The question for that one is whether we want
> __sync_blockdev() called even in case of ->sync_fs() reporting
> a failure, and I suspect that it's safer to call it anyway and
> return the first error value we'd got.  No idea about quota
> situation.
> 

The only problem there is that makes it a bit difficult to override the
error return to syncfs, which is really what overlayfs wants to be able
to do. Their syncfs syncs out the upper layer, so it makes sense to just
have their file->f_sb_err track the upper layer's sb->s_wb_err.

You can plumb the errors from sync_fs all the way through to the syncfs
syscall, but we can't currently tell whether we're doing the sync_fs op
on behalf of sync(), syncfs() or something else entirely. We need to
ensure that if it does return an error, that it doesn't get dropped on
the floor.

I think it'd be simpler to just add f_op->syncfs and change
s_op->sync_fs to a different name, to lessen the confusion.
s_op->sync_fs sort of makes it look like you're implementing syncfs(2),
but there's a bit more to it than that.

Maybe s_op->sync_filesystem? There are only about 113 instances
"sync_fs" in the tree. Changing the name might also help highlight the
fact that the return code won't be ignored like it used to be.
diff mbox series

Patch

diff --git a/fs/sync.c b/fs/sync.c
index 1373a610dc78..06caa9758d93 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -155,27 +155,38 @@  void emergency_sync(void)
 	}
 }
 
+static int generic_syncfs(struct file *file)
+{
+	int ret, ret2;
+	struct super_block *sb = file->f_path.dentry->d_sb;
+
+	down_read(&sb->s_umount);
+	ret = sync_filesystem(sb);
+	up_read(&sb->s_umount);
+
+	ret2 = errseq_check_and_advance(&sb->s_wb_err, &file->f_sb_err);
+
+	return ret ? ret : ret2;
+}
+
 /*
  * sync a single super
  */
 SYSCALL_DEFINE1(syncfs, int, fd)
 {
 	struct fd f = fdget(fd);
-	struct super_block *sb;
-	int ret, ret2;
+	int ret;
 
 	if (!f.file)
 		return -EBADF;
-	sb = f.file->f_path.dentry->d_sb;
-
-	down_read(&sb->s_umount);
-	ret = sync_filesystem(sb);
-	up_read(&sb->s_umount);
 
-	ret2 = errseq_check_and_advance(&sb->s_wb_err, &f.file->f_sb_err);
+	if (f.file->f_op->syncfs)
+		ret = f.file->f_op->syncfs(f.file);
+	else
+		ret = generic_syncfs(f.file);
 
 	fdput(f);
-	return ret ? ret : ret2;
+	return ret;
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8667d0cdc71e..6710469b7e33 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1859,6 +1859,7 @@  struct file_operations {
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
+	int (*syncfs)(struct file *);
 } __randomize_layout;
 
 struct inode_operations {