diff mbox series

fix writing to the filesystem after unmount

Message ID 59b54cc3-b98b-aff9-14fc-dc25c61111c6@redhat.com (mailing list archive)
State New, archived
Headers show
Series fix writing to the filesystem after unmount | expand

Commit Message

Mikulas Patocka Sept. 6, 2023, 1:26 p.m. UTC
lvm may suspend any logical volume anytime. If lvm suspend races with
unmount, it may be possible that the kernel writes to the filesystem after
unmount successfully returned. The problem can be demonstrated with this
script:

#!/bin/sh -ex
modprobe brd rd_size=4194304
vgcreate vg /dev/ram0
lvcreate -L 16M -n lv vg
mkfs.ext4 /dev/vg/lv
mount -t ext4 /dev/vg/lv /mnt/test
dmsetup suspend /dev/vg/lv
(sleep 1; dmsetup resume /dev/vg/lv) &
umount /mnt/test
md5sum /dev/vg/lv
md5sum /dev/vg/lv
dmsetup remove_all
rmmod brd

The script unmounts the filesystem and runs md5sum twice, the result is
that these two invocations return different hash.

What happens:
* dmsetup suspend calls freeze_bdev, that goes to freeze_super and it
  increments sb->s_active
* then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls
  deactivate_super, deactivate_super sees that sb->s_active is 2, so it
  decreases it to 1 and does nothing - the umount syscall returns
  successfully
* now we have a mounted filesystem despite the fact that umount returned
* we call md5sum, this waits for the block device being unblocked
* dmsetup resume unblocks the block device and calls thaw_bdev, that calls
  thaw_super and thaw_super_locked
* thaw_super_locked calls deactivate_locked_super, this actually drops the
  refcount and performs the unmount. The unmount races with md5sum. md5sum
  wins the race and it returns the hash of the filesystem before it was
  unmounted
* the second md5sum returns the hash after the filesystem was unmounted

In order to fix this bug, this patch introduces a new function
wait_and_deactivate_super that will wait if the filesystem is frozen and
perform deactivate_locked_super only after that.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 fs/namespace.c     |    2 +-
 fs/super.c         |   20 ++++++++++++++++++++
 include/linux/fs.h |    1 +
 3 files changed, 22 insertions(+), 1 deletion(-)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Comments

Christian Brauner Sept. 6, 2023, 2:27 p.m. UTC | #1
On Wed, Sep 06, 2023 at 03:26:21PM +0200, Mikulas Patocka wrote:
> lvm may suspend any logical volume anytime. If lvm suspend races with
> unmount, it may be possible that the kernel writes to the filesystem after
> unmount successfully returned. The problem can be demonstrated with this
> script:
> 
> #!/bin/sh -ex
> modprobe brd rd_size=4194304
> vgcreate vg /dev/ram0
> lvcreate -L 16M -n lv vg
> mkfs.ext4 /dev/vg/lv
> mount -t ext4 /dev/vg/lv /mnt/test
> dmsetup suspend /dev/vg/lv
> (sleep 1; dmsetup resume /dev/vg/lv) &
> umount /mnt/test
> md5sum /dev/vg/lv
> md5sum /dev/vg/lv
> dmsetup remove_all
> rmmod brd
> 
> The script unmounts the filesystem and runs md5sum twice, the result is
> that these two invocations return different hash.
> 
> What happens:
> * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it
>   increments sb->s_active
> * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls
>   deactivate_super, deactivate_super sees that sb->s_active is 2, so it
>   decreases it to 1 and does nothing - the umount syscall returns
>   successfully
> * now we have a mounted filesystem despite the fact that umount returned

That can happen for any number of reasons. Other codepaths might very
well still hold active references to the superblock. The same thing can
happen when you have your filesystem pinned in another mount namespace.

If you really want to be absolutely sure that the superblock is
destroyed you must use a mechanism like fanotify which allows you to get
notified on superblock destruction.

> @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn
>  	}
>  	fsnotify_vfsmount_delete(&mnt->mnt);
>  	dput(mnt->mnt.mnt_root);
> -	deactivate_super(mnt->mnt.mnt_sb);
> +	wait_and_deactivate_super(mnt->mnt.mnt_sb);

Your patch means that we hang on any umount when the filesystem is
frozen. IOW, you'd also hang on any umount of a bind-mount. IOW, every
single container making use of this filesystems via bind-mounts would
hang on umount and shutdown.

You'd effectively build a deadlock trap for userspace when the
filesystem is frozen. And nothing can make progress until that thing is
thawed. Umount can't block if the block device is frozen.

>  	mnt_free_id(mnt);
>  	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
>  }
> Index: linux-2.6/fs/super.c
> ===================================================================
> --- linux-2.6.orig/fs/super.c	2023-09-05 21:09:16.000000000 +0200
> +++ linux-2.6/fs/super.c	2023-09-06 09:52:20.000000000 +0200
> @@ -36,6 +36,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/user_namespace.h>
>  #include <linux/fs_context.h>
> +#include <linux/delay.h>
>  #include <uapi/linux/mount.h>
>  #include "internal.h"
>  
> @@ -365,6 +366,25 @@ void deactivate_super(struct super_block
>  EXPORT_SYMBOL(deactivate_super);
>  
>  /**
> + *	wait_and_deactivate_super	-	wait for unfreeze and drop a reference
> + *	@s: superblock to deactivate
> + *
> + *	Variant of deactivate_super(), except that we wait if the filesystem is
> + *	frozen. This is required on unmount, to make sure that the filesystem is
> + *	really unmounted when this function returns.
> + */
> +void wait_and_deactivate_super(struct super_block *s)
> +{
> +	down_write(&s->s_umount);
> +	while (s->s_writers.frozen != SB_UNFROZEN) {
> +		up_write(&s->s_umount);
> +		msleep(1);
> +		down_write(&s->s_umount);
> +	}
> +	deactivate_locked_super(s);

That msleep(1) alone is a pretty nasty hack. We should definitely not
spin in code like this. That superblock could stay frozen for a long
time without s_umount held. So this is spinning.

Even if we wanted to do this it would need to use a similar wait
mechanism for the filesystem to be thawed like we do in
thaw_super_locked().

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Sept. 6, 2023, 3:03 p.m. UTC | #2
On Wed, 6 Sep 2023, Christian Brauner wrote:

> > What happens:
> > * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it
> >   increments sb->s_active
> > * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls
> >   deactivate_super, deactivate_super sees that sb->s_active is 2, so it
> >   decreases it to 1 and does nothing - the umount syscall returns
> >   successfully
> > * now we have a mounted filesystem despite the fact that umount returned
> 
> That can happen for any number of reasons. Other codepaths might very
> well still hold active references to the superblock. The same thing can
> happen when you have your filesystem pinned in another mount namespace.
> 
> If you really want to be absolutely sure that the superblock is
> destroyed you must use a mechanism like fanotify which allows you to get
> notified on superblock destruction.

If the administrator runs a script that performs unmount and then back-up 
of the underlying block device, it may read corrupted data. I think that 
this is a problem.

> > @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn
> >  	}
> >  	fsnotify_vfsmount_delete(&mnt->mnt);
> >  	dput(mnt->mnt.mnt_root);
> > -	deactivate_super(mnt->mnt.mnt_sb);
> > +	wait_and_deactivate_super(mnt->mnt.mnt_sb);
> 
> Your patch means that we hang on any umount when the filesystem is
> frozen.

Currently, if we freeze a filesystem with "fsfreeze" and unmount it, the 
mount point is removed, but the filesystem stays active and it is leaked. 
You can't unfreeze it with "fsfreeze --unfreeze" because the mount point 
is gone. (the only way how to recover it is "echo j>/proc/sysrq-trigger").

> IOW, you'd also hang on any umount of a bind-mount. IOW, every
> single container making use of this filesystems via bind-mounts would
> hang on umount and shutdown.

bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing 
in this case. I tried unmounting bind-mounts and there was no deadlock.

> You'd effectively build a deadlock trap for userspace when the
> filesystem is frozen. And nothing can make progress until that thing is
> thawed. Umount can't block if the block device is frozen.

unmounting a filesystem frozen with "fsfreeze" doesn't work in the current 
kernel. We can say that the administrator shouldn't do it. But reading the 
block device after umount finishes is something that the administrator may 
do.

BTW. what do you think that unmount of a frozen filesystem should properly 
do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
something else?

> That msleep(1) alone is a pretty nasty hack. We should definitely not
> spin in code like this. That superblock could stay frozen for a long
> time without s_umount held. So this is spinning.
> 
> Even if we wanted to do this it would need to use a similar wait
> mechanism for the filesystem to be thawed like we do in
> thaw_super_locked().

Yes, it may be possible to rework it using a wait queue.

Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Darrick J. Wong Sept. 6, 2023, 3:22 p.m. UTC | #3
On Wed, Sep 06, 2023 at 03:26:21PM +0200, Mikulas Patocka wrote:
> lvm may suspend any logical volume anytime. If lvm suspend races with
> unmount, it may be possible that the kernel writes to the filesystem after
> unmount successfully returned. The problem can be demonstrated with this
> script:
> 
> #!/bin/sh -ex
> modprobe brd rd_size=4194304
> vgcreate vg /dev/ram0
> lvcreate -L 16M -n lv vg
> mkfs.ext4 /dev/vg/lv
> mount -t ext4 /dev/vg/lv /mnt/test
> dmsetup suspend /dev/vg/lv
> (sleep 1; dmsetup resume /dev/vg/lv) &
> umount /mnt/test
> md5sum /dev/vg/lv
> md5sum /dev/vg/lv
> dmsetup remove_all
> rmmod brd
> 
> The script unmounts the filesystem and runs md5sum twice, the result is
> that these two invocations return different hash.
> 
> What happens:
> * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it
>   increments sb->s_active
> * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls
>   deactivate_super, deactivate_super sees that sb->s_active is 2, so it
>   decreases it to 1 and does nothing - the umount syscall returns
>   successfully
> * now we have a mounted filesystem despite the fact that umount returned
> * we call md5sum, this waits for the block device being unblocked
> * dmsetup resume unblocks the block device and calls thaw_bdev, that calls
>   thaw_super and thaw_super_locked
> * thaw_super_locked calls deactivate_locked_super, this actually drops the
>   refcount and performs the unmount. The unmount races with md5sum. md5sum
>   wins the race and it returns the hash of the filesystem before it was
>   unmounted
> * the second md5sum returns the hash after the filesystem was unmounted
> 
> In order to fix this bug, this patch introduces a new function
> wait_and_deactivate_super that will wait if the filesystem is frozen and
> perform deactivate_locked_super only after that.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
> 
> ---
>  fs/namespace.c     |    2 +-
>  fs/super.c         |   20 ++++++++++++++++++++
>  include/linux/fs.h |    1 +
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/fs/namespace.c
> ===================================================================
> --- linux-2.6.orig/fs/namespace.c	2023-09-06 09:45:54.000000000 +0200
> +++ linux-2.6/fs/namespace.c	2023-09-06 09:47:15.000000000 +0200
> @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn
>  	}
>  	fsnotify_vfsmount_delete(&mnt->mnt);
>  	dput(mnt->mnt.mnt_root);
> -	deactivate_super(mnt->mnt.mnt_sb);
> +	wait_and_deactivate_super(mnt->mnt.mnt_sb);
>  	mnt_free_id(mnt);
>  	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
>  }
> Index: linux-2.6/fs/super.c
> ===================================================================
> --- linux-2.6.orig/fs/super.c	2023-09-05 21:09:16.000000000 +0200
> +++ linux-2.6/fs/super.c	2023-09-06 09:52:20.000000000 +0200
> @@ -36,6 +36,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/user_namespace.h>
>  #include <linux/fs_context.h>
> +#include <linux/delay.h>
>  #include <uapi/linux/mount.h>
>  #include "internal.h"
>  
> @@ -365,6 +366,25 @@ void deactivate_super(struct super_block
>  EXPORT_SYMBOL(deactivate_super);
>  
>  /**
> + *	wait_and_deactivate_super	-	wait for unfreeze and drop a reference
> + *	@s: superblock to deactivate
> + *
> + *	Variant of deactivate_super(), except that we wait if the filesystem is
> + *	frozen. This is required on unmount, to make sure that the filesystem is
> + *	really unmounted when this function returns.
> + */
> +void wait_and_deactivate_super(struct super_block *s)
> +{
> +	down_write(&s->s_umount);
> +	while (s->s_writers.frozen != SB_UNFROZEN) {
> +		up_write(&s->s_umount);
> +		msleep(1);
> +		down_write(&s->s_umount);
> +	}

Instead of msleep, could you use wait_var_event_killable like
wait_for_partially_frozen() does?

--D

> +	deactivate_locked_super(s);
> +}
> +
> +/**
>   *	grab_super - acquire an active reference
>   *	@s: reference we are trying to make active
>   *
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2023-09-05 21:09:16.000000000 +0200
> +++ linux-2.6/include/linux/fs.h	2023-09-06 09:46:56.000000000 +0200
> @@ -2262,6 +2262,7 @@ void kill_anon_super(struct super_block
>  void kill_litter_super(struct super_block *sb);
>  void deactivate_super(struct super_block *sb);
>  void deactivate_locked_super(struct super_block *sb);
> +void wait_and_deactivate_super(struct super_block *sb);
>  int set_anon_super(struct super_block *s, void *data);
>  int set_anon_super_fc(struct super_block *s, struct fs_context *fc);
>  int get_anon_bdev(dev_t *);
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Christian Brauner Sept. 6, 2023, 3:33 p.m. UTC | #4
> Currently, if we freeze a filesystem with "fsfreeze" and unmount it, the 
> mount point is removed, but the filesystem stays active and it is leaked. 
> You can't unfreeze it with "fsfreeze --unfreeze" because the mount point 
> is gone. (the only way how to recover it is "echo j>/proc/sysrq-trigger").

You can of course always remount and unfreeze it.

> > IOW, you'd also hang on any umount of a bind-mount. IOW, every
> > single container making use of this filesystems via bind-mounts would
> > hang on umount and shutdown.
> 
> bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing 
> in this case. I tried unmounting bind-mounts and there was no deadlock.

With your patch what happens if you do the following?

#!/bin/sh -ex
modprobe brd rd_size=4194304
vgcreate vg /dev/ram0
lvcreate -L 16M -n lv vg
mkfs.ext4 /dev/vg/lv

mount -t ext4 /dev/vg/lv /mnt/test
mount --bind /mnt/test /opt
mount --make-private /opt

dmsetup suspend /dev/vg/lv
(sleep 1; dmsetup resume /dev/vg/lv) &

umount /opt # I'd expect this to hang

md5sum /dev/vg/lv
md5sum /dev/vg/lv
dmsetup remove_all
rmmod brd

> BTW. what do you think that unmount of a frozen filesystem should properly 
> do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
> something else?

In my opinion we should refuse to unmount frozen filesystems and log an
error that the filesystem is frozen. Waiting forever isn't a good idea
in my opinion.

But this is a significant uapi change afaict so this would need to be
hidden behind a config option, a sysctl, or it would have to be a new
flag to umount2() MNT_UNFROZEN which would allow an administrator to use
this flag to not unmount a frozen filesystems.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Christian Brauner Sept. 6, 2023, 3:38 p.m. UTC | #5
On Wed, Sep 06, 2023 at 08:22:45AM -0700, Darrick J. Wong wrote:
> On Wed, Sep 06, 2023 at 03:26:21PM +0200, Mikulas Patocka wrote:
> > lvm may suspend any logical volume anytime. If lvm suspend races with
> > unmount, it may be possible that the kernel writes to the filesystem after
> > unmount successfully returned. The problem can be demonstrated with this
> > script:
> > 
> > #!/bin/sh -ex
> > modprobe brd rd_size=4194304
> > vgcreate vg /dev/ram0
> > lvcreate -L 16M -n lv vg
> > mkfs.ext4 /dev/vg/lv
> > mount -t ext4 /dev/vg/lv /mnt/test
> > dmsetup suspend /dev/vg/lv
> > (sleep 1; dmsetup resume /dev/vg/lv) &
> > umount /mnt/test
> > md5sum /dev/vg/lv
> > md5sum /dev/vg/lv
> > dmsetup remove_all
> > rmmod brd
> > 
> > The script unmounts the filesystem and runs md5sum twice, the result is
> > that these two invocations return different hash.
> > 
> > What happens:
> > * dmsetup suspend calls freeze_bdev, that goes to freeze_super and it
> >   increments sb->s_active
> > * then we unmount the filesystem, we go to cleanup_mnt, cleanup_mnt calls
> >   deactivate_super, deactivate_super sees that sb->s_active is 2, so it
> >   decreases it to 1 and does nothing - the umount syscall returns
> >   successfully
> > * now we have a mounted filesystem despite the fact that umount returned
> > * we call md5sum, this waits for the block device being unblocked
> > * dmsetup resume unblocks the block device and calls thaw_bdev, that calls
> >   thaw_super and thaw_super_locked
> > * thaw_super_locked calls deactivate_locked_super, this actually drops the
> >   refcount and performs the unmount. The unmount races with md5sum. md5sum
> >   wins the race and it returns the hash of the filesystem before it was
> >   unmounted
> > * the second md5sum returns the hash after the filesystem was unmounted
> > 
> > In order to fix this bug, this patch introduces a new function
> > wait_and_deactivate_super that will wait if the filesystem is frozen and
> > perform deactivate_locked_super only after that.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Cc: stable@vger.kernel.org
> > 
> > ---
> >  fs/namespace.c     |    2 +-
> >  fs/super.c         |   20 ++++++++++++++++++++
> >  include/linux/fs.h |    1 +
> >  3 files changed, 22 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6/fs/namespace.c
> > ===================================================================
> > --- linux-2.6.orig/fs/namespace.c	2023-09-06 09:45:54.000000000 +0200
> > +++ linux-2.6/fs/namespace.c	2023-09-06 09:47:15.000000000 +0200
> > @@ -1251,7 +1251,7 @@ static void cleanup_mnt(struct mount *mn
> >  	}
> >  	fsnotify_vfsmount_delete(&mnt->mnt);
> >  	dput(mnt->mnt.mnt_root);
> > -	deactivate_super(mnt->mnt.mnt_sb);
> > +	wait_and_deactivate_super(mnt->mnt.mnt_sb);
> >  	mnt_free_id(mnt);
> >  	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
> >  }
> > Index: linux-2.6/fs/super.c
> > ===================================================================
> > --- linux-2.6.orig/fs/super.c	2023-09-05 21:09:16.000000000 +0200
> > +++ linux-2.6/fs/super.c	2023-09-06 09:52:20.000000000 +0200
> > @@ -36,6 +36,7 @@
> >  #include <linux/lockdep.h>
> >  #include <linux/user_namespace.h>
> >  #include <linux/fs_context.h>
> > +#include <linux/delay.h>
> >  #include <uapi/linux/mount.h>
> >  #include "internal.h"
> >  
> > @@ -365,6 +366,25 @@ void deactivate_super(struct super_block
> >  EXPORT_SYMBOL(deactivate_super);
> >  
> >  /**
> > + *	wait_and_deactivate_super	-	wait for unfreeze and drop a reference
> > + *	@s: superblock to deactivate
> > + *
> > + *	Variant of deactivate_super(), except that we wait if the filesystem is
> > + *	frozen. This is required on unmount, to make sure that the filesystem is
> > + *	really unmounted when this function returns.
> > + */
> > +void wait_and_deactivate_super(struct super_block *s)
> > +{
> > +	down_write(&s->s_umount);
> > +	while (s->s_writers.frozen != SB_UNFROZEN) {
> > +		up_write(&s->s_umount);
> > +		msleep(1);
> > +		down_write(&s->s_umount);
> > +	}
> 
> Instead of msleep, could you use wait_var_event_killable like
> wait_for_partially_frozen() does?

I said the same thing but I think that the patch in this way isn't a
good idea and technically also uapi breakage. Anyway, can you take a
look at my third response here?

https://lore.kernel.org/lkml/20230906-aufheben-hagel-9925501b7822@brauner

(I forgot you worked on freezing as well.
I'm currently moving the freezing bits to fs_holder_ops
https://gitlab.com/brauner/linux/-/commits/vfs.super.freeze)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Christian Brauner Sept. 6, 2023, 3:58 p.m. UTC | #6
On Wed, Sep 06, 2023 at 05:33:32PM +0200, Christian Brauner wrote:
> > Currently, if we freeze a filesystem with "fsfreeze" and unmount it, the 
> > mount point is removed, but the filesystem stays active and it is leaked. 
> > You can't unfreeze it with "fsfreeze --unfreeze" because the mount point 
> > is gone. (the only way how to recover it is "echo j>/proc/sysrq-trigger").
> 
> You can of course always remount and unfreeze it.
> 
> > > IOW, you'd also hang on any umount of a bind-mount. IOW, every
> > > single container making use of this filesystems via bind-mounts would
> > > hang on umount and shutdown.
> > 
> > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing 
> > in this case. I tried unmounting bind-mounts and there was no deadlock.
> 
> With your patch what happens if you do the following?
> 
> #!/bin/sh -ex
> modprobe brd rd_size=4194304
> vgcreate vg /dev/ram0
> lvcreate -L 16M -n lv vg
> mkfs.ext4 /dev/vg/lv
> 
> mount -t ext4 /dev/vg/lv /mnt/test
> mount --bind /mnt/test /opt
> mount --make-private /opt
> 
> dmsetup suspend /dev/vg/lv
> (sleep 1; dmsetup resume /dev/vg/lv) &
> 
> umount /opt # I'd expect this to hang
> 
> md5sum /dev/vg/lv
> md5sum /dev/vg/lv
> dmsetup remove_all
> rmmod brd
> 
> > BTW. what do you think that unmount of a frozen filesystem should properly 
> > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
> > something else?
> 
> In my opinion we should refuse to unmount frozen filesystems and log an
> error that the filesystem is frozen. Waiting forever isn't a good idea
> in my opinion.
> 
> But this is a significant uapi change afaict so this would need to be
> hidden behind a config option, a sysctl, or it would have to be a new
> flag to umount2() MNT_UNFROZEN which would allow an administrator to use
> this flag to not unmount a frozen filesystems.

That's probably too careful. I think we could risk starting to return an
error when trying to unmount a frozen filesystem. And if that causes
regressions we could go and look at another option like MNT_UNFROZEN or
whatever.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Sept. 6, 2023, 4:01 p.m. UTC | #7
On Wed, 6 Sep 2023, Christian Brauner wrote:

> > > IOW, you'd also hang on any umount of a bind-mount. IOW, every
> > > single container making use of this filesystems via bind-mounts would
> > > hang on umount and shutdown.
> > 
> > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing 
> > in this case. I tried unmounting bind-mounts and there was no deadlock.
> 
> With your patch what happens if you do the following?
> 
> #!/bin/sh -ex
> modprobe brd rd_size=4194304
> vgcreate vg /dev/ram0
> lvcreate -L 16M -n lv vg
> mkfs.ext4 /dev/vg/lv
> 
> mount -t ext4 /dev/vg/lv /mnt/test
> mount --bind /mnt/test /opt
> mount --make-private /opt
> 
> dmsetup suspend /dev/vg/lv
> (sleep 1; dmsetup resume /dev/vg/lv) &
> 
> umount /opt # I'd expect this to hang
> 
> md5sum /dev/vg/lv
> md5sum /dev/vg/lv
> dmsetup remove_all
> rmmod brd

"umount /opt" doesn't hang. It waits one second (until dmsetup resume is 
called) and then proceeds.

Then, it fails with "rmmod: ERROR: Module brd is in use" because the 
script didn't unmount /mnt/test.

> > BTW. what do you think that unmount of a frozen filesystem should properly 
> > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
> > something else?
> 
> In my opinion we should refuse to unmount frozen filesystems and log an
> error that the filesystem is frozen. Waiting forever isn't a good idea
> in my opinion.

But lvm may freeze filesystems anytime - so we'd get randomly returned 
errors then.

> But this is a significant uapi change afaict so this would need to be
> hidden behind a config option, a sysctl, or it would have to be a new
> flag to umount2() MNT_UNFROZEN which would allow an administrator to use
> this flag to not unmount a frozen filesystems.

The kernel currently distinguishes between kernel-initiated freeze (that 
is used by the XFS scrub) and userspace-initiated freeze (that is used by 
the FIFREEZE ioctl and by device-mapper initiated freeze through 
freeze_bdev).

Perhaps we could distinguish between FIFREEZE-initiated freezes and 
device-mapper initiated freezes as well. And we could change the logic to 
return -EBUSY if the freeze was initiated by FIFREEZE and to wait for 
unfreeze if it was initiated by the device-mapper.

Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Christian Brauner Sept. 6, 2023, 4:19 p.m. UTC | #8
On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote:
> 
> 
> On Wed, 6 Sep 2023, Christian Brauner wrote:
> 
> > > > IOW, you'd also hang on any umount of a bind-mount. IOW, every
> > > > single container making use of this filesystems via bind-mounts would
> > > > hang on umount and shutdown.
> > > 
> > > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing 
> > > in this case. I tried unmounting bind-mounts and there was no deadlock.
> > 
> > With your patch what happens if you do the following?
> > 
> > #!/bin/sh -ex
> > modprobe brd rd_size=4194304
> > vgcreate vg /dev/ram0
> > lvcreate -L 16M -n lv vg
> > mkfs.ext4 /dev/vg/lv
> > 
> > mount -t ext4 /dev/vg/lv /mnt/test
> > mount --bind /mnt/test /opt
> > mount --make-private /opt
> > 
> > dmsetup suspend /dev/vg/lv
> > (sleep 1; dmsetup resume /dev/vg/lv) &
> > 
> > umount /opt # I'd expect this to hang
> > 
> > md5sum /dev/vg/lv
> > md5sum /dev/vg/lv
> > dmsetup remove_all
> > rmmod brd
> 
> "umount /opt" doesn't hang. It waits one second (until dmsetup resume is 
> called) and then proceeds.

So unless I'm really misreading the code - entirely possible - the
umount of the bind-mount now waits until the filesystem is resumed with
your patch. And if that's the case that's a bug.

If at all, then only the last umount, the one that destroys the
superblock, should wait for the filesystem to become unfrozen.

A bind-mount shouldn't as there are still active mounts of the
filesystem (e.g., /mnt/test).

So you should see this with (unless I really misread things):

#!/bin/sh -ex
modprobe brd rd_size=4194304
vgcreate vg /dev/ram0
lvcreate -L 16M -n lv vg
mkfs.ext4 /dev/vg/lv

mount -t ext4 /dev/vg/lv /mnt/test
mount --bind /mnt/test /opt
mount --make-private /opt

dmsetup suspend /dev/vg/lv

umount /opt # This will hang with your patch?

> 
> Then, it fails with "rmmod: ERROR: Module brd is in use" because the 
> script didn't unmount /mnt/test.
> 
> > > BTW. what do you think that unmount of a frozen filesystem should properly 
> > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
> > > something else?
> > 
> > In my opinion we should refuse to unmount frozen filesystems and log an
> > error that the filesystem is frozen. Waiting forever isn't a good idea
> > in my opinion.
> 
> But lvm may freeze filesystems anytime - so we'd get randomly returned 
> errors then.

So? Or you might hang at anytime.

> 
> > But this is a significant uapi change afaict so this would need to be
> > hidden behind a config option, a sysctl, or it would have to be a new
> > flag to umount2() MNT_UNFROZEN which would allow an administrator to use
> > this flag to not unmount a frozen filesystems.
> 
> The kernel currently distinguishes between kernel-initiated freeze (that 
> is used by the XFS scrub) and userspace-initiated freeze (that is used by 
> the FIFREEZE ioctl and by device-mapper initiated freeze through 
> freeze_bdev).

Yes, I'm aware.

> 
> Perhaps we could distinguish between FIFREEZE-initiated freezes and 
> device-mapper initiated freezes as well. And we could change the logic to 
> return -EBUSY if the freeze was initiated by FIFREEZE and to wait for 
> unfreeze if it was initiated by the device-mapper.

For device mapper initiated freezes you can unfreeze independent of any
filesystem mountpoint via dm ioctls.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Sept. 6, 2023, 4:52 p.m. UTC | #9
On Wed, 6 Sep 2023, Christian Brauner wrote:

> On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 6 Sep 2023, Christian Brauner wrote:
> > 
> > > > > IOW, you'd also hang on any umount of a bind-mount. IOW, every
> > > > > single container making use of this filesystems via bind-mounts would
> > > > > hang on umount and shutdown.
> > > > 
> > > > bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing 
> > > > in this case. I tried unmounting bind-mounts and there was no deadlock.
> > > 
> > > With your patch what happens if you do the following?
> > > 
> > > #!/bin/sh -ex
> > > modprobe brd rd_size=4194304
> > > vgcreate vg /dev/ram0
> > > lvcreate -L 16M -n lv vg
> > > mkfs.ext4 /dev/vg/lv
> > > 
> > > mount -t ext4 /dev/vg/lv /mnt/test
> > > mount --bind /mnt/test /opt
> > > mount --make-private /opt
> > > 
> > > dmsetup suspend /dev/vg/lv
> > > (sleep 1; dmsetup resume /dev/vg/lv) &
> > > 
> > > umount /opt # I'd expect this to hang
> > > 
> > > md5sum /dev/vg/lv
> > > md5sum /dev/vg/lv
> > > dmsetup remove_all
> > > rmmod brd
> > 
> > "umount /opt" doesn't hang. It waits one second (until dmsetup resume is 
> > called) and then proceeds.
> 
> So unless I'm really misreading the code - entirely possible - the
> umount of the bind-mount now waits until the filesystem is resumed with
> your patch. And if that's the case that's a bug.

Yes.

It can be fixed by changing wait_and_deactivate_super to this:

void wait_and_deactivate_super(struct super_block *s)
{
	down_write(&s->s_umount);
	while (s->s_writers.frozen != SB_UNFROZEN && atomic_read(&s->s_active) == 2) {
		up_write(&s->s_umount);
		msleep(1);
		down_write(&s->s_umount);
	}
	deactivate_locked_super(s);
}

> > > > BTW. what do you think that unmount of a frozen filesystem should properly 
> > > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
> > > > something else?
> > > 
> > > In my opinion we should refuse to unmount frozen filesystems and log an
> > > error that the filesystem is frozen. Waiting forever isn't a good idea
> > > in my opinion.
> > 
> > But lvm may freeze filesystems anytime - so we'd get randomly returned 
> > errors then.
> 
> So? Or you might hang at anytime.

lvm doesn't keep logical volumes suspended for a prolonged amount of time. 
It will unfreeze them after it made updates to the dm table and to the 
metadata. So, it won't hang forever.

I think it's better to sleep for a short time in umount than to return an 
error.

Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Al Viro Sept. 6, 2023, 5:08 p.m. UTC | #10
On Wed, Sep 06, 2023 at 05:03:34PM +0200, Mikulas Patocka wrote:

> > IOW, you'd also hang on any umount of a bind-mount. IOW, every
> > single container making use of this filesystems via bind-mounts would
> > hang on umount and shutdown.
> 
> bind-mount doesn't modify "s->s_writers.frozen", so the patch does nothing 
> in this case. I tried unmounting bind-mounts and there was no deadlock.

You are making *any* mount destruction hang if the sucker is frozen.
Which includes the things like exit(2) of the last process within
a namespace, etc.

And it does include the things like mount --bind /usr/bin/gcc /tmp/cc; umount /tmp/cc
if /usr happened to be frozen at the moment.

This is really not an option.

> BTW. what do you think that unmount of a frozen filesystem should properly 
> do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
> something else?

It's not just umount(2).  It's exit(2).  And close(2).  And AF_UNIX garbage
collector taking out an undeliverable SCM_RIGHTS datagram that happens to
contain a reference to the last opened file on lazy-umounted fs, etc.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Al Viro Sept. 6, 2023, 5:10 p.m. UTC | #11
On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote:

> Perhaps we could distinguish between FIFREEZE-initiated freezes and 
> device-mapper initiated freezes as well. And we could change the logic to 
> return -EBUSY if the freeze was initiated by FIFREEZE and to wait for 
> unfreeze if it was initiated by the device-mapper.

By the time you get to cleanup_mnt() it's far too late to return -EBUSY.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jan Kara Sept. 7, 2023, 9:44 a.m. UTC | #12
On Wed 06-09-23 18:52:39, Mikulas Patocka wrote:
> On Wed, 6 Sep 2023, Christian Brauner wrote:
> > On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote:
> > > > > BTW. what do you think that unmount of a frozen filesystem should properly 
> > > > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
> > > > > something else?
> > > > 
> > > > In my opinion we should refuse to unmount frozen filesystems and log an
> > > > error that the filesystem is frozen. Waiting forever isn't a good idea
> > > > in my opinion.
> > > 
> > > But lvm may freeze filesystems anytime - so we'd get randomly returned 
> > > errors then.
> > 
> > So? Or you might hang at anytime.
> 
> lvm doesn't keep logical volumes suspended for a prolonged amount of time. 
> It will unfreeze them after it made updates to the dm table and to the 
> metadata. So, it won't hang forever.
> 
> I think it's better to sleep for a short time in umount than to return an 
> error.

I think we've got too deep down into "how to fix things" but I'm not 100%
sure what the "bug" actually is. In the initial posting Mikulas writes "the
kernel writes to the filesystem after unmount successfully returned" - is
that really such a big issue? Anybody else can open the device and write to
it as well. Or even mount the device again. So userspace that relies on
this is kind of flaky anyway (and always has been).

I understand the need for userspace to make sure the device is not being
modified to do its thing - but then it should perhaps freeze the bdev if
it wants to be certain? Or at least open it O_EXCL to make sure it's not
mounted?

WRT the umount behavior for frozen filesystem - as Al writes it's a tricky
issue and we've been discussing that several times over the years and it
never went anywhere because of nasty corner-cases (which current behavior
also has but trading one nasty case for another isn't really a win). Now
that we distinguish between kernel-initiated freeze (with well defined
freeze owner and lifetime) and userspace-initiated freeze, I can image we'd
make last unmount of the superblock wait for the kernel-initiated freeze to
thaw. But as Al writes with lazy unmounts, bind mounts in multiple
namespaces etc. I'm not sure such behavior brings much value...

								Honza
Christian Brauner Sept. 7, 2023, 10:43 a.m. UTC | #13
> I think we've got too deep down into "how to fix things" but I'm not 100%

We did.

> sure what the "bug" actually is. In the initial posting Mikulas writes "the
> kernel writes to the filesystem after unmount successfully returned" - is
> that really such a big issue? Anybody else can open the device and write to
> it as well. Or even mount the device again. So userspace that relies on
> this is kind of flaky anyway (and always has been).

Yeah, agreed.

> namespaces etc. I'm not sure such behavior brings much value...

It would in any case mean complicating our code for little gain imho.
And as I showed in my initial reply the current patch would hang on any
bind-mount unmount. IOW, any container. And Al correctly points out
issues with exit(), close() and friends on top of that.

But I also hate the idea of waiting on the last umount because that can
also lead to new unexpected behavior when e.g., the system is shutdown
and systemd goes on to unmount all things and then suddenly just hangs
when before it was able to make progress.

And returning EBUSY is tricky as well as we somehow would need to have a
way to refcount in a manner that let's us differentiate between last-
"user-visible"-superblock-reference" and
last-active-superblock-reference which would complicate things even more.

I propose we clearly document that unmounting a frozen filesystem will
mean that the superblock stays active at least until the filesystem is
unfrozen.

And if userspace wants to make sure to not recycle such a frozen
superblock they can now use FSCONFIG_CMD_CREATE_EXCL to detect that.

What might be useful is to extend fanotify. Right now we have
fsnotify_sb_delete() which lets you detect that a superblock has been
destroyed (generic_shutdown_super()). It could be useful to also get
notified when a superblock is frozen and unfrozen?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Sept. 7, 2023, 12:04 p.m. UTC | #14
On Thu, 7 Sep 2023, Christian Brauner wrote:

> > I think we've got too deep down into "how to fix things" but I'm not 100%
> 
> We did.
> 
> > sure what the "bug" actually is. In the initial posting Mikulas writes "the
> > kernel writes to the filesystem after unmount successfully returned" - is
> > that really such a big issue?

I think it's an issue if the administrator writes a script that unmounts a 
filesystem and then copies the underyling block device somewhere. Or a 
script that unmounts a filesystem and runs fsck afterwards. Or a script 
that unmounts a filesystem and runs mkfs on the same block device.

> > Anybody else can open the device and write to it as well. Or even 
> > mount the device again. So userspace that relies on this is kind of 
> > flaky anyway (and always has been).

It's admin's responsibility to make sure that the filesystem is not 
mounted multiple times when he touches the underlying block device after 
unmount.

> Yeah, agreed.
> 
> > namespaces etc. I'm not sure such behavior brings much value...
> 
> It would in any case mean complicating our code for little gain imho.
> And as I showed in my initial reply the current patch would hang on any
> bind-mount unmount. IOW, any container. And Al correctly points out
> issues with exit(), close() and friends on top of that.
> 
> But I also hate the idea of waiting on the last umount because that can
> also lead to new unexpected behavior when e.g., the system is shutdown
> and systemd goes on to unmount all things and then suddenly just hangs
> when before it was able to make progress.

Would you agree to waiting on the last umount only if the freeze was 
initiated by lvm? (there would be no hang in this case because lvm thaws 
the block device in a timely manner)

Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jan Kara Sept. 8, 2023, 7:32 a.m. UTC | #15
On Thu 07-09-23 14:04:51, Mikulas Patocka wrote:
> 
> 
> On Thu, 7 Sep 2023, Christian Brauner wrote:
> 
> > > I think we've got too deep down into "how to fix things" but I'm not 100%
> > 
> > We did.
> > 
> > > sure what the "bug" actually is. In the initial posting Mikulas writes "the
> > > kernel writes to the filesystem after unmount successfully returned" - is
> > > that really such a big issue?
> 
> I think it's an issue if the administrator writes a script that unmounts a 
> filesystem and then copies the underyling block device somewhere. Or a 
> script that unmounts a filesystem and runs fsck afterwards. Or a script 
> that unmounts a filesystem and runs mkfs on the same block device.

Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem
hasn't been unmounted properly and complain. Which is exactly what should
IMHO happen.

> > > Anybody else can open the device and write to it as well. Or even 
> > > mount the device again. So userspace that relies on this is kind of 
> > > flaky anyway (and always has been).
> 
> It's admin's responsibility to make sure that the filesystem is not 
> mounted multiple times when he touches the underlying block device after 
> unmount.

What I wanted to suggest is that we should provide means how to make sure
block device is not being modified and educate admins and tool authors
about them. Because just doing "umount /dev/sda1" and thinking this means
that /dev/sda1 is unused now simply is not enough in today's world for
multiple reasons and we cannot solve it just in the kernel.

								Honza
Zdenek Kabelac Sept. 8, 2023, 9:29 a.m. UTC | #16
Dne 08. 09. 23 v 9:32 Jan Kara napsal(a):
> On Thu 07-09-23 14:04:51, Mikulas Patocka wrote:
>>
>> On Thu, 7 Sep 2023, Christian Brauner wrote:
>>
>>>> I think we've got too deep down into "how to fix things" but I'm not 100%
>>> We did.
>>>
>>>> sure what the "bug" actually is. In the initial posting Mikulas writes "the
>>>> kernel writes to the filesystem after unmount successfully returned" - is
>>>> that really such a big issue?
>> I think it's an issue if the administrator writes a script that unmounts a
>> filesystem and then copies the underyling block device somewhere. Or a
>> script that unmounts a filesystem and runs fsck afterwards. Or a script
>> that unmounts a filesystem and runs mkfs on the same block device.
> Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem
> hasn't been unmounted properly and complain. Which is exactly what should
> IMHO happen.
>
>>>> Anybody else can open the device and write to it as well. Or even
>>>> mount the device again. So userspace that relies on this is kind of
>>>> flaky anyway (and always has been).
>> It's admin's responsibility to make sure that the filesystem is not
>> mounted multiple times when he touches the underlying block device after
>> unmount.
> What I wanted to suggest is that we should provide means how to make sure
> block device is not being modified and educate admins and tool authors
> about them. Because just doing "umount /dev/sda1" and thinking this means
> that /dev/sda1 is unused now simply is not enough in today's world for
> multiple reasons and we cannot solve it just in the kernel.
>

Hi


/me just wondering how do you then imagine i.e. safe removal of USB drive when 
user shall not expect unmount really unmounts filesystem?

IMHO  - unmount should detect some very suspicious state of block device if it 
cannot correctly proceed - i.e. reporting 'warning/error' on such commands...

Main problem is - if the 'unmount' is successful in this case - the last 
connection userspace had to this fileystem is lost - and user cannot get rid 
of such filesystem anymore for a system.

I'd likely propose in this particular state of unmounting of a frozen 
filesystem to just proceed - and drop the frozen state together with release 
filesystem and never issue any ioctl from such filelsystem to the device below 
- so it would not be a 100% valid unmount - but since the freeze should be 
nearly equivalent of having a proper 'unmount' being done -  it shoudn't be 
causing any harm either - and  all resources associated could  be 
'released.    IMHO it's correct to 'drop' frozen state for filesystem that is 
not going to exist anymore  (assuming it's the last  such user)

Regards


Zdenek


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jan Kara Sept. 8, 2023, 10:20 a.m. UTC | #17
On Fri 08-09-23 11:29:40, Zdenek Kabelac wrote:
> Dne 08. 09. 23 v 9:32 Jan Kara napsal(a):
> > On Thu 07-09-23 14:04:51, Mikulas Patocka wrote:
> > > 
> > > On Thu, 7 Sep 2023, Christian Brauner wrote:
> > > 
> > > > > I think we've got too deep down into "how to fix things" but I'm not 100%
> > > > We did.
> > > > 
> > > > > sure what the "bug" actually is. In the initial posting Mikulas writes "the
> > > > > kernel writes to the filesystem after unmount successfully returned" - is
> > > > > that really such a big issue?
> > > I think it's an issue if the administrator writes a script that unmounts a
> > > filesystem and then copies the underyling block device somewhere. Or a
> > > script that unmounts a filesystem and runs fsck afterwards. Or a script
> > > that unmounts a filesystem and runs mkfs on the same block device.
> > Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem
> > hasn't been unmounted properly and complain. Which is exactly what should
> > IMHO happen.
> > 
> > > > > Anybody else can open the device and write to it as well. Or even
> > > > > mount the device again. So userspace that relies on this is kind of
> > > > > flaky anyway (and always has been).
> > > It's admin's responsibility to make sure that the filesystem is not
> > > mounted multiple times when he touches the underlying block device after
> > > unmount.
> > What I wanted to suggest is that we should provide means how to make sure
> > block device is not being modified and educate admins and tool authors
> > about them. Because just doing "umount /dev/sda1" and thinking this means
> > that /dev/sda1 is unused now simply is not enough in today's world for
> > multiple reasons and we cannot solve it just in the kernel.
> > 
> 
> /me just wondering how do you then imagine i.e. safe removal of USB drive
> when user shall not expect unmount really unmounts filesystem?

Well, currently you click some "Eject / safely remove / whatever" button
and then you get a "wait" dialog until everything is done after which
you're told the stick is safe to remove. What I imagine is that the "wait"
dialog needs to be there while there are any (or exclusive at minimum) openers
of the device. Not until umount(2) syscall has returned. And yes, the
kernel doesn't quite make that easy - the best you can currently probably
do is to try opening the device with O_EXCL and if that fails, you know
there's some other exclusive open.

> IMHO  - unmount should detect some very suspicious state of block device if
> it cannot correctly proceed - i.e. reporting 'warning/error' on such
> commands...

You seem to be concentrated too much on the simple case of a desktop with
an USB stick you just copy data to & from. :) The trouble is, as Al wrote
elsewhere in this thread that filesystem unmount can be for example a
result of exit(2) or close(2) system call if you setup things in a nasty
way. Do you want exit(2) to fail because the block device is frozen?
Umount(2) has to work for all its users and changing the behavior has nasty
corner-cases. So does the current behavior, I agree, but improving
situation for one usecase while breaking another usecase isn't really a way
forward...

> Main problem is - if the 'unmount' is successful in this case - the last
> connection userspace had to this fileystem is lost - and user cannot get rid
> of such filesystem anymore for a system.

Well, the filesystem (struct superblock to be exact) is invisible in
/proc/mounts (or whatever), that is true. But it is still very much
associated with that block device and if you do 'mount <device>
<mntpoint>', you'll get it back. But yes, the filesystem will not go away
until all references to it are dropped and you cannot easily find who holds
those references and how to get rid of them.

> I'd likely propose in this particular state of unmounting of a frozen
> filesystem to just proceed - and drop the frozen state together with release
> filesystem and never issue any ioctl from such filelsystem to the device
> below - so it would not be a 100% valid unmount - but since the freeze
> should be nearly equivalent of having a proper 'unmount' being done -  it
> shoudn't be causing any harm either - and  all resources associated could 
> be 'released.  IMHO it's correct to 'drop' frozen state for filesystem
> that is not going to exist anymore  (assuming it's the last  such user)

This option was also discussed in the past and it has nasty consequences as
well. Cleanly shutting down a filesystem usually needs to write to the
underlying device so either you allow the filesystem to write to the device
on umount breaking assumptions of the user who froze the fs or you'd have
to implement a special handling for this case for every filesystem to avoid
the writes (and put up with the fact that the filesystem will appear as
uncleanly shutdown on the next mount). Not particularly nice either...

								Honza
Zdenek Kabelac Sept. 8, 2023, 10:51 a.m. UTC | #18
Dne 08. 09. 23 v 12:20 Jan Kara napsal(a):
> On Fri 08-09-23 11:29:40, Zdenek Kabelac wrote:
>> Dne 08. 09. 23 v 9:32 Jan Kara napsal(a):
>>> On Thu 07-09-23 14:04:51, Mikulas Patocka wrote:
>>>> On Thu, 7 Sep 2023, Christian Brauner wrote:
>>>>
>>>>>> I think we've got too deep down into "how to fix things" but I'm not 100%
>>>>> We did.
>>>>>
>>>>>> sure what the "bug" actually is. In the initial posting Mikulas writes "the
>>>>>> kernel writes to the filesystem after unmount successfully returned" - is
>>>>>> that really such a big issue?
>>>> I think it's an issue if the administrator writes a script that unmounts a
>>>> filesystem and then copies the underyling block device somewhere. Or a
>>>> script that unmounts a filesystem and runs fsck afterwards. Or a script
>>>> that unmounts a filesystem and runs mkfs on the same block device.
>>> Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem
>>> hasn't been unmounted properly and complain. Which is exactly what should
>>> IMHO happen.
>> I'd likely propose in this particular state of unmounting of a frozen
>> filesystem to just proceed - and drop the frozen state together with release
>> filesystem and never issue any ioctl from such filelsystem to the device
>> below - so it would not be a 100% valid unmount - but since the freeze
>> should be nearly equivalent of having a proper 'unmount' being done -  it
>> shoudn't be causing any harm either - and  all resources associated could
>> be 'released.  IMHO it's correct to 'drop' frozen state for filesystem
>> that is not going to exist anymore  (assuming it's the last  such user)
> This option was also discussed in the past and it has nasty consequences as
> well. Cleanly shutting down a filesystem usually needs to write to the
> underlying device so either you allow the filesystem to write to the device
> on umount breaking assumptions of the user who froze the fs or you'd have
> to implement a special handling for this case for every filesystem to avoid
> the writes (and put up with the fact that the filesystem will appear as
> uncleanly shutdown on the next mount). Not particularly nice either...


I'd say there are several options and we should aim towards the variant which 
is most usable by normal users.

Making hyper complex  unmount rule logic that basically no user-space tools 
around Gnome/KDE... are able to handle well and getting it to the position 
where only the core kernel developer have all the 'wisdom' to detect and 
decode system state and then 'know what's going on'  isn't the favourite goal 
here.

Freeze should be getting the filesystem into 'consistent' state - filesystem 
should  be able to 'easily' recover and finish all the ongoing  'unfinished' 
process with the next mount without requiring full 'fsck' - otherwise it would 
be useless for i.e. snapshot.

So to me this looks like the win-win strategy where we basically do not loose 
any information  and we also do not leak kernel resources - since i..e in case 
of DM devices - the underlying DM device might have already changed  disk 
characteristics anyway.

If the developers then believe - that 'more variants' of complex behavior are 
necessary - then kernel could have some  sysfs parameter to configure some 
'more advanced' logic  i.e.  keep  'fs mounted'   for those skilled admins who 
are able to go through the deepest corners  here  -  but other then that  
plain 'umount' should really go with the meaning of   a)   manages to umount 
and release a device    b)  in other case reports to a user there is still 
something holding device....

Regards


Zdenek
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Christian Brauner Sept. 8, 2023, 11:32 a.m. UTC | #19
> I'd say there are several options and we should aim towards the variant
> which is most usable by normal users.

None of the options is sufficiently satisfying to risk intricate
behavioral changes with unknown consequences for existing workloads as
far as I'm concerned.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Pavel Machek Sept. 8, 2023, 11:59 a.m. UTC | #20
On Thu 2023-09-07 11:44:57, Jan Kara wrote:
> On Wed 06-09-23 18:52:39, Mikulas Patocka wrote:
> > On Wed, 6 Sep 2023, Christian Brauner wrote:
> > > On Wed, Sep 06, 2023 at 06:01:06PM +0200, Mikulas Patocka wrote:
> > > > > > BTW. what do you think that unmount of a frozen filesystem should properly 
> > > > > > do? Fail with -EBUSY? Or, unfreeze the filesystem and unmount it? Or 
> > > > > > something else?
> > > > > 
> > > > > In my opinion we should refuse to unmount frozen filesystems and log an
> > > > > error that the filesystem is frozen. Waiting forever isn't a good idea
> > > > > in my opinion.
> > > > 
> > > > But lvm may freeze filesystems anytime - so we'd get randomly returned 
> > > > errors then.
> > > 
> > > So? Or you might hang at anytime.
> > 
> > lvm doesn't keep logical volumes suspended for a prolonged amount of time. 
> > It will unfreeze them after it made updates to the dm table and to the 
> > metadata. So, it won't hang forever.
> > 
> > I think it's better to sleep for a short time in umount than to return an 
> > error.
> 
> I think we've got too deep down into "how to fix things" but I'm not 100%
> sure what the "bug" actually is. In the initial posting Mikulas writes "the
> kernel writes to the filesystem after unmount successfully returned" - is
> that really such a big issue? Anybody else can open the device and write to
> it as well. Or even mount the device again. So userspace that relies on
> this is kind of flaky anyway (and always has been).

Umm. No? I admin my own systems; I'm responsible for my
userspace. Maybe I'm in single user mode.

Noone writes to my block devices without my permissions.

By mount, I give such permission to the kernel. By umount, I take
such permission away.

There's nothing flaky about that. Kernel is simply buggy. Fix it.

[Remember that "you should umount before disconnecting USB devices to
prevent data corruption"? How is that working with kernel writing to
devices after umount?]

Best regards,
									Pavel
Pavel Machek Sept. 8, 2023, 12:01 p.m. UTC | #21
Hi!

> What I wanted to suggest is that we should provide means how to make sure
> block device is not being modified and educate admins and tool authors
> about them. Because just doing "umount /dev/sda1" and thinking this means
> that /dev/sda1 is unused now simply is not enough in today's world for
> multiple reasons and we cannot solve it just in the kernel.

It better be enough. And I'm pretty sure it is true in single-user
mode, or for usb sticks, or...

Simply fix the kernel. No need to re-educate anyone.


									Pavel
Christian Brauner Sept. 8, 2023, 12:02 p.m. UTC | #22
> Well, currently you click some "Eject / safely remove / whatever" button
> and then you get a "wait" dialog until everything is done after which
> you're told the stick is safe to remove. What I imagine is that the "wait"
> dialog needs to be there while there are any (or exclusive at minimum) openers
> of the device. Not until umount(2) syscall has returned. And yes, the

Agreed. umount(2) doesn't give guarantees about a filesystem being
really gone once it has returned. And it really shouldn't. There's too
many cases where that doesn't work and it's not a commitment we should
make.

And there are ways to wait until superblock shutdown that I've mentioned
before in other places where it somehow really matters. inotify's
IN_UMOUNT will notify about superblock shutdown. IOW, when it really
hits generic_shutdown_super() which can only be hit after unfreezing as
that requires active references.

So this really can be used to wait for a filesystem to go away across
all namespaces, and across filesytem freezing and it's available to
unprivileged users. Simple example:

# shell 1
sudo mount -t xfs /dev/sda /mnt
sudo mount --bind /mnt /opt
inotifywait -e unmount /mnt

#shell 2
sudo umount /opt # nothing happens in shell 1
sudo umount /mnt # shell 1 gets woken

> corner-cases. So does the current behavior, I agree, but improving
> situation for one usecase while breaking another usecase isn't really a way
> forward...

Agreed.

> Well, the filesystem (struct superblock to be exact) is invisible in
> /proc/mounts (or whatever), that is true. But it is still very much
> associated with that block device and if you do 'mount <device>
> <mntpoint>', you'll get it back. But yes, the filesystem will not go away

And now we at least have an api to detect that case and refuse to reuse
the superblock.

> until all references to it are dropped and you cannot easily find who holds
> those references and how to get rid of them.

Namespaces make this even messier. You have no easy way of knowing
whether the filesystem isn't pinned somewhere else through an explicit
bind-mount or when it was copied during mount namespace creation.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Zdenek Kabelac Sept. 8, 2023, 12:07 p.m. UTC | #23
Dne 08. 09. 23 v 13:32 Christian Brauner napsal(a):
>> I'd say there are several options and we should aim towards the variant
>> which is most usable by normal users.
> 
> None of the options is sufficiently satisfying to risk intricate
> behavioral changes with unknown consequences for existing workloads as
> far as I'm concerned.
> 

I'm not convinced anyone has the 'fsfreeze' + 'unmount' as a regular workload. 
  Thus solving this unusual race case shouldn't be breaking anyones else 
existing workload.

ATM there is no good solution for this particular case.

So can you please elaborate which new risks are we going to introduce by 
fixing this resource hole ?


Zdenek

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Christian Brauner Sept. 8, 2023, 12:34 p.m. UTC | #24
> So can you please elaborate which new risks are we going to introduce by
> fixing this resource hole ?

I'm not quite sure why you need a personal summary of the various
reasons different people brought together in the thread.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
John Stoffel Sept. 8, 2023, 4:49 p.m. UTC | #25
>>>>> "Christian" == Christian Brauner <brauner@kernel.org> writes:

>> Well, currently you click some "Eject / safely remove / whatever" button
>> and then you get a "wait" dialog until everything is done after which
>> you're told the stick is safe to remove. What I imagine is that the "wait"
>> dialog needs to be there while there are any (or exclusive at minimum) openers
>> of the device. Not until umount(2) syscall has returned. And yes, the

> Agreed. umount(2) doesn't give guarantees about a filesystem being
> really gone once it has returned. And it really shouldn't. There's
> too many cases where that doesn't work and it's not a commitment we
> should make.

So how the heck is someone supposed to know, from userspace, that a
filesystem is unmounted?  Just wearing my SysAdmin hat, this strikes
me as really potentially painful and annoying.  But then again, so are
bind mounts from alot of views too.  

Don't people remember how bad it can be when you are trying to
shutdown and system and it hangs because a remote NFS server is down
and not responding?  And your client system hangs completely?  

> And there are ways to wait until superblock shutdown that I've
> mentioned before in other places where it somehow really
> matters. inotify's IN_UMOUNT will notify about superblock
> shutdown. IOW, when it really hits generic_shutdown_super() which
> can only be hit after unfreezing as that requires active references.

Can we maybe invert this discussion and think about it from the
userspace side of things?  How does the user(space) mount/unmount
devices cleanly and reliably?  

> So this really can be used to wait for a filesystem to go away across
> all namespaces, and across filesytem freezing and it's available to
> unprivileged users. Simple example:

> # shell 1
> sudo mount -t xfs /dev/sda /mnt
> sudo mount --bind /mnt /opt
> inotifywait -e unmount /mnt

> #shell 2
> sudo umount /opt # nothing happens in shell 1
> sudo umount /mnt # shell 1 gets woken

So what makes this *useful* to anyone?  Why doesn't the bind mount
A) lock /mnt into place, but B) give you some way of seeing that
there's a bindmount in place?  

>> corner-cases. So does the current behavior, I agree, but improving
>> situation for one usecase while breaking another usecase isn't really a way
>> forward...

> Agreed.

>> Well, the filesystem (struct superblock to be exact) is invisible
>> in /proc/mounts (or whatever), that is true. But it is still very
>> much associated with that block device and if you do 'mount
>> <device> <mntpoint>', you'll get it back. But yes, the filesystem
>> will not go away

Then should it be unmountable in the first place?  I mean yes, we
always need a way to force an unmount no matter what, even if that
breaks some other process on the system, but for regular use,
shouldn't it just give back an error like:

	  /mnt in use by bind mount /opt

or something like that?  Give the poor sysadmin some information on
what's going on here. 

> And now we at least have an api to detect that case and refuse to reuse
> the superblock.

>> until all references to it are dropped and you cannot easily find
>> who holds those references and how to get rid of them.

ding ding ding!!!!  I don't want to have to run 'lsof' or something
like that.

> Namespaces make this even messier. You have no easy way of knowing
> whether the filesystem isn't pinned somewhere else through an
> explicit bind-mount or when it was copied during mount namespace
> creation.

This is the biggest downside of namespaces and bind mounts in my
mind.  The lack of traceability back to the source.  

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Christoph Hellwig Sept. 9, 2023, 11:21 a.m. UTC | #26
On Fri, Sep 08, 2023 at 12:20:14PM +0200, Jan Kara wrote:
> Well, currently you click some "Eject / safely remove / whatever" button
> and then you get a "wait" dialog until everything is done after which
> you're told the stick is safe to remove. What I imagine is that the "wait"
> dialog needs to be there while there are any (or exclusive at minimum) openers
> of the device. Not until umount(2) syscall has returned. And yes, the
> kernel doesn't quite make that easy - the best you can currently probably
> do is to try opening the device with O_EXCL and if that fails, you know
> there's some other exclusive open.

... and the simple answer to the problem is to have an event notification
for when the super_block has finally been destroyed.  That way the
application gets this notification directly instead of having to make
a process synchronous that fundamentally isn't as explained in this thread.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
Jan Kara Sept. 12, 2023, 9:10 a.m. UTC | #27
On Fri 08-09-23 12:51:03, Zdenek Kabelac wrote:
> Dne 08. 09. 23 v 12:20 Jan Kara napsal(a):
> > On Fri 08-09-23 11:29:40, Zdenek Kabelac wrote:
> > > Dne 08. 09. 23 v 9:32 Jan Kara napsal(a):
> > > > On Thu 07-09-23 14:04:51, Mikulas Patocka wrote:
> > > > > On Thu, 7 Sep 2023, Christian Brauner wrote:
> > > > > 
> > > > > > > I think we've got too deep down into "how to fix things" but I'm not 100%
> > > > > > We did.
> > > > > > 
> > > > > > > sure what the "bug" actually is. In the initial posting Mikulas writes "the
> > > > > > > kernel writes to the filesystem after unmount successfully returned" - is
> > > > > > > that really such a big issue?
> > > > > I think it's an issue if the administrator writes a script that unmounts a
> > > > > filesystem and then copies the underyling block device somewhere. Or a
> > > > > script that unmounts a filesystem and runs fsck afterwards. Or a script
> > > > > that unmounts a filesystem and runs mkfs on the same block device.
> > > > Well, e.g. e2fsprogs use O_EXCL open so they will detect that the filesystem
> > > > hasn't been unmounted properly and complain. Which is exactly what should
> > > > IMHO happen.
> > > I'd likely propose in this particular state of unmounting of a frozen
> > > filesystem to just proceed - and drop the frozen state together with release
> > > filesystem and never issue any ioctl from such filelsystem to the device
> > > below - so it would not be a 100% valid unmount - but since the freeze
> > > should be nearly equivalent of having a proper 'unmount' being done -  it
> > > shoudn't be causing any harm either - and  all resources associated could
> > > be 'released.  IMHO it's correct to 'drop' frozen state for filesystem
> > > that is not going to exist anymore  (assuming it's the last  such user)
> > This option was also discussed in the past and it has nasty consequences as
> > well. Cleanly shutting down a filesystem usually needs to write to the
> > underlying device so either you allow the filesystem to write to the device
> > on umount breaking assumptions of the user who froze the fs or you'd have
> > to implement a special handling for this case for every filesystem to avoid
> > the writes (and put up with the fact that the filesystem will appear as
> > uncleanly shutdown on the next mount). Not particularly nice either...
> 
> 
> I'd say there are several options and we should aim towards the variant
> which is most usable by normal users.
> 
> Making hyper complex  unmount rule logic that basically no user-space tools
> around Gnome/KDE... are able to handle well and getting it to the position
> where only the core kernel developer have all the 'wisdom' to detect and
> decode system state and then 'know what's going on'  isn't the favourite
> goal here.

I don't think we are really making forward progress in the argument which
behavior is more or less correct or useful. But maybe when we cannot agree
on the general solution we could still improve the situation that
practically matters? E.g. disputing Gnome apps telling you you can safely
remove the USB stick when you actually cannot because the filesystem on it
is frozen is actually kind of weak argument because users that freeze
filesystem on their USB stick are practically non-existent. So is there a
usecase where users are hitting these problems in practice? Maybe some user
report that triggered original Mikulas' patch? Or was that mostly a
theoretical concern?

> Freeze should be getting the filesystem into 'consistent' state - filesystem
> should  be able to 'easily' recover and finish all the ongoing  'unfinished'
> process with the next mount without requiring full 'fsck' - otherwise it
> would be useless for i.e. snapshot.

More or less yes but e.g. grub2 isn't able to reliably read images of just
frozen filesystem because it ignores journal contents. So if this was root
filesystem this could result in unbootable system.

								Honza
diff mbox series

Patch

Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c	2023-09-06 09:45:54.000000000 +0200
+++ linux-2.6/fs/namespace.c	2023-09-06 09:47:15.000000000 +0200
@@ -1251,7 +1251,7 @@  static void cleanup_mnt(struct mount *mn
 	}
 	fsnotify_vfsmount_delete(&mnt->mnt);
 	dput(mnt->mnt.mnt_root);
-	deactivate_super(mnt->mnt.mnt_sb);
+	wait_and_deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
 }
Index: linux-2.6/fs/super.c
===================================================================
--- linux-2.6.orig/fs/super.c	2023-09-05 21:09:16.000000000 +0200
+++ linux-2.6/fs/super.c	2023-09-06 09:52:20.000000000 +0200
@@ -36,6 +36,7 @@ 
 #include <linux/lockdep.h>
 #include <linux/user_namespace.h>
 #include <linux/fs_context.h>
+#include <linux/delay.h>
 #include <uapi/linux/mount.h>
 #include "internal.h"
 
@@ -365,6 +366,25 @@  void deactivate_super(struct super_block
 EXPORT_SYMBOL(deactivate_super);
 
 /**
+ *	wait_and_deactivate_super	-	wait for unfreeze and drop a reference
+ *	@s: superblock to deactivate
+ *
+ *	Variant of deactivate_super(), except that we wait if the filesystem is
+ *	frozen. This is required on unmount, to make sure that the filesystem is
+ *	really unmounted when this function returns.
+ */
+void wait_and_deactivate_super(struct super_block *s)
+{
+	down_write(&s->s_umount);
+	while (s->s_writers.frozen != SB_UNFROZEN) {
+		up_write(&s->s_umount);
+		msleep(1);
+		down_write(&s->s_umount);
+	}
+	deactivate_locked_super(s);
+}
+
+/**
  *	grab_super - acquire an active reference
  *	@s: reference we are trying to make active
  *
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2023-09-05 21:09:16.000000000 +0200
+++ linux-2.6/include/linux/fs.h	2023-09-06 09:46:56.000000000 +0200
@@ -2262,6 +2262,7 @@  void kill_anon_super(struct super_block
 void kill_litter_super(struct super_block *sb);
 void deactivate_super(struct super_block *sb);
 void deactivate_locked_super(struct super_block *sb);
+void wait_and_deactivate_super(struct super_block *sb);
 int set_anon_super(struct super_block *s, void *data);
 int set_anon_super_fc(struct super_block *s, struct fs_context *fc);
 int get_anon_bdev(dev_t *);