diff mbox series

ovl: skip overlayfs superblocks at global sync

Message ID 158642098777.5635.10501704178160375549.stgit@buzz (mailing list archive)
State New, archived
Headers show
Series ovl: skip overlayfs superblocks at global sync | expand

Commit Message

Konstantin Khlebnikov April 9, 2020, 8:29 a.m. UTC
Stacked filesystems like overlayfs has no own writeback, but they have to
forward syncfs() requests to backend for keeping data integrity.

During global sync() each overlayfs instance calls method ->sync_fs()
for backend although it itself is in global list of superblocks too.
As a result one syscall sync() could write one superblock several times
and send multiple disk barriers.

This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/overlayfs/super.c |    5 +++--
 fs/sync.c            |    3 ++-
 include/linux/fs.h   |    2 ++
 3 files changed, 7 insertions(+), 3 deletions(-)

Comments

Amir Goldstein April 9, 2020, 10:23 a.m. UTC | #1
On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
>
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
>
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> ---

Seems reasonable.
You may add:
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

+CC: containers list

This bring up old memories.
I posted this way back to fix handling of emergency_remount() in the
presence of loop mounted fs:
https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/

But seems to me that emergency_sync() and sync(2) are equally broken
for this use case.

I wonder if anyone cares enough about resilience of loop mounted fs to try
and change the iterate_* functions to iterate supers/bdevs in reverse order...

Thanks,
Amir.
Konstantin Khlebnikov April 9, 2020, 11:28 a.m. UTC | #2
On 09/04/2020 13.23, Amir Goldstein wrote:
> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>>
>> Stacked filesystems like overlayfs has no own writeback, but they have to
>> forward syncfs() requests to backend for keeping data integrity.
>>
>> During global sync() each overlayfs instance calls method ->sync_fs()
>> for backend although it itself is in global list of superblocks too.
>> As a result one syscall sync() could write one superblock several times
>> and send multiple disk barriers.
>>
>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>>
>> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>> ---
> 
> Seems reasonable.
> You may add:
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> 
> +CC: containers list

Thanks

> 
> This bring up old memories.
> I posted this way back to fix handling of emergency_remount() in the
> presence of loop mounted fs:
> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> 
> But seems to me that emergency_sync() and sync(2) are equally broken
> for this use case.
> 
> I wonder if anyone cares enough about resilience of loop mounted fs to try
> and change the iterate_* functions to iterate supers/bdevs in reverse order...

Now I see reason behind "sync; sync; sync; reboot" =)

Order old -> new allows to not miss new items if list modifies.
Might be important for some users.

bdev iteration seems already reversed: inode_sb_list_add adds to the head

> 
> Thanks,
> Amir.
>
Amir Goldstein April 9, 2020, 11:48 a.m. UTC | #3
On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> On 09/04/2020 13.23, Amir Goldstein wrote:
> > On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> > <khlebnikov@yandex-team.ru> wrote:
> >>
> >> Stacked filesystems like overlayfs has no own writeback, but they have to
> >> forward syncfs() requests to backend for keeping data integrity.
> >>
> >> During global sync() each overlayfs instance calls method ->sync_fs()
> >> for backend although it itself is in global list of superblocks too.
> >> As a result one syscall sync() could write one superblock several times
> >> and send multiple disk barriers.
> >>
> >> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
> >>
> >> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> >> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> >> ---
> >
> > Seems reasonable.
> > You may add:
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > +CC: containers list
>
> Thanks
>
> >
> > This bring up old memories.
> > I posted this way back to fix handling of emergency_remount() in the
> > presence of loop mounted fs:
> > https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> >
> > But seems to me that emergency_sync() and sync(2) are equally broken
> > for this use case.
> >
> > I wonder if anyone cares enough about resilience of loop mounted fs to try
> > and change the iterate_* functions to iterate supers/bdevs in reverse order...
>
> Now I see reason behind "sync; sync; sync; reboot" =)
>
> Order old -> new allows to not miss new items if list modifies.
> Might be important for some users.
>

That's not the reason I suggested reverse order.
The reason is that with loop mounted fs, the correct order of flushing is:
1. sync loop mounted fs inodes => writes to loop image file
2. sync loop mounted fs sb => fsyncs the loop image file
3. sync the loop image host fs sb

With forward sb iteration order, #3 happens before #1, so the
loop mounted fs changes are not really being made durable by
a single sync(2) call.

> bdev iteration seems already reversed: inode_sb_list_add adds to the head
>

I think bdev iteration order will not make a difference in this case.
flushing /dev/loopX will not be needed and it happens too late
anyway.

Thanks,
Amir.
Konstantin Khlebnikov April 9, 2020, 12:04 p.m. UTC | #4
On 09/04/2020 14.48, Amir Goldstein wrote:
> On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>>
>> On 09/04/2020 13.23, Amir Goldstein wrote:
>>> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
>>> <khlebnikov@yandex-team.ru> wrote:
>>>>
>>>> Stacked filesystems like overlayfs has no own writeback, but they have to
>>>> forward syncfs() requests to backend for keeping data integrity.
>>>>
>>>> During global sync() each overlayfs instance calls method ->sync_fs()
>>>> for backend although it itself is in global list of superblocks too.
>>>> As a result one syscall sync() could write one superblock several times
>>>> and send multiple disk barriers.
>>>>
>>>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>>>>
>>>> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
>>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
>>>> ---
>>>
>>> Seems reasonable.
>>> You may add:
>>> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
>>>
>>> +CC: containers list
>>
>> Thanks
>>
>>>
>>> This bring up old memories.
>>> I posted this way back to fix handling of emergency_remount() in the
>>> presence of loop mounted fs:
>>> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
>>>
>>> But seems to me that emergency_sync() and sync(2) are equally broken
>>> for this use case.
>>>
>>> I wonder if anyone cares enough about resilience of loop mounted fs to try
>>> and change the iterate_* functions to iterate supers/bdevs in reverse order...
>>
>> Now I see reason behind "sync; sync; sync; reboot" =)
>>
>> Order old -> new allows to not miss new items if list modifies.
>> Might be important for some users.
>>
> 
> That's not the reason I suggested reverse order.
> The reason is that with loop mounted fs, the correct order of flushing is:
> 1. sync loop mounted fs inodes => writes to loop image file
> 2. sync loop mounted fs sb => fsyncs the loop image file
> 3. sync the loop image host fs sb
> 
> With forward sb iteration order, #3 happens before #1, so the
> loop mounted fs changes are not really being made durable by
> a single sync(2) call.

If fs in loop mounted with barriers then sync_fs will issue
REQ_OP_FLUSH to loop device and trigger fsync() for image file.
Sync() might write something twice but data should be safe.
Without barriers this scenario is broken for sure.

Emergency remount R/O is other thing. It really needs reverse order.

> 
>> bdev iteration seems already reversed: inode_sb_list_add adds to the head
>>
> 
> I think bdev iteration order will not make a difference in this case.
> flushing /dev/loopX will not be needed and it happens too late
> anyway.
> 
> Thanks,
> Amir.
>
Amir Goldstein April 9, 2020, 1:22 p.m. UTC | #5
On Thu, Apr 9, 2020 at 3:04 PM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
>
>
> On 09/04/2020 14.48, Amir Goldstein wrote:
> > On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
> > <khlebnikov@yandex-team.ru> wrote:
> >>
> >> On 09/04/2020 13.23, Amir Goldstein wrote:
> >>> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> >>> <khlebnikov@yandex-team.ru> wrote:
> >>>>
> >>>> Stacked filesystems like overlayfs has no own writeback, but they have to
> >>>> forward syncfs() requests to backend for keeping data integrity.
> >>>>
> >>>> During global sync() each overlayfs instance calls method ->sync_fs()
> >>>> for backend although it itself is in global list of superblocks too.
> >>>> As a result one syscall sync() could write one superblock several times
> >>>> and send multiple disk barriers.
> >>>>
> >>>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
> >>>>
> >>>> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> >>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> >>>> ---
> >>>
> >>> Seems reasonable.
> >>> You may add:
> >>> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >>>
> >>> +CC: containers list
> >>
> >> Thanks
> >>
> >>>
> >>> This bring up old memories.
> >>> I posted this way back to fix handling of emergency_remount() in the
> >>> presence of loop mounted fs:
> >>> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> >>>
> >>> But seems to me that emergency_sync() and sync(2) are equally broken
> >>> for this use case.
> >>>
> >>> I wonder if anyone cares enough about resilience of loop mounted fs to try
> >>> and change the iterate_* functions to iterate supers/bdevs in reverse order...
> >>
> >> Now I see reason behind "sync; sync; sync; reboot" =)
> >>
> >> Order old -> new allows to not miss new items if list modifies.
> >> Might be important for some users.
> >>
> >
> > That's not the reason I suggested reverse order.
> > The reason is that with loop mounted fs, the correct order of flushing is:
> > 1. sync loop mounted fs inodes => writes to loop image file
> > 2. sync loop mounted fs sb => fsyncs the loop image file
> > 3. sync the loop image host fs sb
> >
> > With forward sb iteration order, #3 happens before #1, so the
> > loop mounted fs changes are not really being made durable by
> > a single sync(2) call.
>
> If fs in loop mounted with barriers then sync_fs will issue
> REQ_OP_FLUSH to loop device and trigger fsync() for image file.
> Sync() might write something twice but data should be safe.
> Without barriers this scenario is broken for sure.
>
> Emergency remount R/O is other thing. It really needs reverse order.
>

Correct. There is no problem with durability.
Although for some filesystems it would be more efficient to first
write and fsync the loop images and then sync_fs().
I can potentially result in less overall disk barriers.

Thanks,
Amir.
Dave Chinner April 11, 2020, 10:28 p.m. UTC | #6
On Thu, Apr 09, 2020 at 11:29:47AM +0300, Konstantin Khlebnikov wrote:
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
> 
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
> 
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Why wouldn't you just remove the ->sync_fs method from overlay?

I mean, if you don't need the filesystem to do anything special for
one specific data integrity sync_fs call, you don't need it for any
of them, yes?

-Dave.
Amir Goldstein April 12, 2020, 6:46 a.m. UTC | #7
On Sun, Apr 12, 2020 at 1:29 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Apr 09, 2020 at 11:29:47AM +0300, Konstantin Khlebnikov wrote:
> > Stacked filesystems like overlayfs has no own writeback, but they have to
> > forward syncfs() requests to backend for keeping data integrity.
> >
> > During global sync() each overlayfs instance calls method ->sync_fs()
> > for backend although it itself is in global list of superblocks too.
> > As a result one syscall sync() could write one superblock several times
> > and send multiple disk barriers.
> >
> > This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Why wouldn't you just remove the ->sync_fs method from overlay?
>
> I mean, if you don't need the filesystem to do anything special for
> one specific data integrity sync_fs call, you don't need it for any
> of them, yes?
>

No, but I understand the confusion.

Say you have 1000 overlay sb's all of them using upper directories
from a single xfs sb (quite common for containers).

syncfs(2) of each overlay, must call sync_fs of xfs (see ovl_sync_fs)
sync(2) will call xfs sync_fs anyway, so there is no point in calling
ovl_sync_fs => xfs sync_fs 1000 more times.

Thanks,
Amir.
Miklos Szeredi April 21, 2020, 9:36 a.m. UTC | #8
On Thu, Apr 9, 2020 at 10:29 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
>
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
>
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Reported-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

Thanks, applied.

Miklos
diff mbox series

Patch

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 732ad5495c92..59df13d16280 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -261,8 +261,8 @@  static int ovl_sync_fs(struct super_block *sb, int wait)
 		return 0;
 
 	/*
-	 * If this is a sync(2) call or an emergency sync, all the super blocks
-	 * will be iterated, including upper_sb, so no need to do anything.
+	 * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
+	 * All the super blocks will be iterated, including upper_sb.
 	 *
 	 * If this is a syncfs(2) call, then we do need to call
 	 * sync_filesystem() on upper_sb, but enough if we do it when being
@@ -1818,6 +1818,7 @@  static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 	sb->s_xattr = ovl_xattr_handlers;
 	sb->s_fs_info = ofs;
 	sb->s_flags |= SB_POSIXACL;
+	sb->s_iflags |= SB_I_SKIP_SYNC;
 
 	err = -ENOMEM;
 	root_dentry = ovl_get_root(sb, upperpath.dentry, oe);
diff --git a/fs/sync.c b/fs/sync.c
index 4d1ff010bc5a..16c2630ee4bf 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -76,7 +76,8 @@  static void sync_inodes_one_sb(struct super_block *sb, void *arg)
 
 static void sync_fs_one_sb(struct super_block *sb, void *arg)
 {
-	if (!sb_rdonly(sb) && sb->s_op->sync_fs)
+	if (!sb_rdonly(sb) && !(sb->s_iflags & SB_I_SKIP_SYNC) &&
+	    sb->s_op->sync_fs)
 		sb->s_op->sync_fs(sb, *(int *)arg);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4f6f59b4f22a..f186a966a36c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1409,6 +1409,8 @@  extern int send_sigurg(struct fown_struct *fown);
 #define SB_I_IMA_UNVERIFIABLE_SIGNATURE	0x00000020
 #define SB_I_UNTRUSTED_MOUNTER		0x00000040
 
+#define SB_I_SKIP_SYNC	0x00000100	/* Skip superblock at global sync */
+
 /* Possible states of 'frozen' field */
 enum {
 	SB_UNFROZEN = 0,		/* FS is unfrozen */