btrfs: Change s_flags instead of returning -EBUSY
diff mbox

Message ID 20170307134702.GS4662@twin.jikos.cz
State New
Headers show

Commit Message

David Sterba March 7, 2017, 1:47 p.m. UTC
On Sat, Mar 04, 2017 at 12:33:22PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> The problem is with parallel mounting multiple subvolumes rw when the
> root filesystem is marked as read-only such as a boot sequence
> using systemd. Not all subvolumes will be mounted because some of
> them will return -EBUSY.
> 
> Here is a sample execution time.
> Root filesystem is mounted read-only so s->s_flags are set to MS_RDONLY
> flags is the parameter passed and s_flags is the one recorded in sb.
> btrfs_mount is called via vfs_kern_mount().
> 
> Mount Thread 1				Mount Thread 2
> 
> btrfs_mount(flags & ~MS_RDONLY)
>  check (flags ^ s_flags) & MS_RDONLY
>  returns -EBUSY				btrfs_mount(flags & ~MS_RDONLY)
>  					check (flags ^ s_flags) & MS_RDONLY
> 					 returns -EBUSY
> btrfs_mount(flags | MS_RDONLY)
> btrfs_remount(flags & ~MS_RDONLY)
> s->s_flags &= ~MS_RDONLY
> 					btrfs_mount(flags | MS_RDONLY)
> 					check (flags ^ s_flags) & MS_RDONLY)
> 					 returns -EBUSY
> 					 mount FAILS
> 
> The -EBUSY was originally introduced in:
> 4b82d6e ("Btrfs: Add mount into directory support")
> as a copy of (then) get_sb_dev().
> 
> Later commit 0723a04 ("btrfs: allow mounting btrfs subvolumes
> with different ro/rw options") added the option of allowing
> subvolumes in rw/ro modes.
> 
> This fix is instead of toggling the flags in remount, we set
> s_flags &= MS_RDONLY when we see a conflict in s_flags and passed parameter
> flags and let mount continue as it is. This will allow the first mount attempt
> to succeed, and we can get rid of the re-kern_mount() and remount sequence
> altogether.

(as we've discussed this off-list, I'll repeat some of the points here)

We can't get rid of the vfs_kern_mount sequence in mount_subvol as this
addresses a usecase [1].

I've tried to fix the parallel mount bug once [2] but the simple
solution to remove the -EBUSY check was not working correctly. The fix
you propose does not look correct to me either, as it is chaning the
ro/rw flags that are being passed to mount.

[1] https://git.kernel.org/linus/0723a0473fb48a1c93b113a28665b64ce5faf35a
[2] https://lkml.kernel.org/r/1461770076-13000-1-git-send-email-dsterba@suse.com

>  	mnt = vfs_kern_mount(&btrfs_fs_type, flags, device_name, newargs);

The race could happen between the two calls to kern mount, the state of
the ro/rw flags is not viewed consistently as they change. The
workaround that worked for me is to add a local mutex (patch below) that
protects just this section. But the solution is quite ugly and I haven't
sent it upstream for that reason.

> -	if (PTR_ERR_OR_ZERO(mnt) == -EBUSY) {
> -		if (flags & MS_RDONLY) {
> -			mnt = vfs_kern_mount(&btrfs_fs_type, flags & ~MS_RDONLY,
> -					     device_name, newargs);
> -		} else {
> -			mnt = vfs_kern_mount(&btrfs_fs_type, flags | MS_RDONLY,
> -					     device_name, newargs);
> -			if (IS_ERR(mnt)) {
> -				root = ERR_CAST(mnt);
> -				mnt = NULL;
> -				goto out;
> -			}
> -
> -			down_write(&mnt->mnt_sb->s_umount);
> -			ret = btrfs_remount(mnt->mnt_sb, &flags, NULL);
> -			up_write(&mnt->mnt_sb->s_umount);
> -			if (ret < 0) {
> -				root = ERR_PTR(ret);
> -				goto out;
> -			}
> -		}
> -	}


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Goldwyn Rodrigues March 8, 2017, 2:28 p.m. UTC | #1
On 03/07/2017 07:47 AM, David Sterba wrote:
> On Sat, Mar 04, 2017 at 12:33:22PM -0600, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> The problem is with parallel mounting multiple subvolumes rw when the
>> root filesystem is marked as read-only such as a boot sequence
>> using systemd. Not all subvolumes will be mounted because some of
>> them will return -EBUSY.
>>
>> Here is a sample execution time.
>> Root filesystem is mounted read-only so s->s_flags are set to MS_RDONLY
>> flags is the parameter passed and s_flags is the one recorded in sb.
>> btrfs_mount is called via vfs_kern_mount().
>>
>> Mount Thread 1				Mount Thread 2
>>
>> btrfs_mount(flags & ~MS_RDONLY)
>>  check (flags ^ s_flags) & MS_RDONLY
>>  returns -EBUSY				btrfs_mount(flags & ~MS_RDONLY)
>>  					check (flags ^ s_flags) & MS_RDONLY
>> 					 returns -EBUSY
>> btrfs_mount(flags | MS_RDONLY)
>> btrfs_remount(flags & ~MS_RDONLY)
>> s->s_flags &= ~MS_RDONLY
>> 					btrfs_mount(flags | MS_RDONLY)
>> 					check (flags ^ s_flags) & MS_RDONLY)
>> 					 returns -EBUSY
>> 					 mount FAILS
>>
>> The -EBUSY was originally introduced in:
>> 4b82d6e ("Btrfs: Add mount into directory support")
>> as a copy of (then) get_sb_dev().
>>
>> Later commit 0723a04 ("btrfs: allow mounting btrfs subvolumes
>> with different ro/rw options") added the option of allowing
>> subvolumes in rw/ro modes.
>>
>> This fix is instead of toggling the flags in remount, we set
>> s_flags &= MS_RDONLY when we see a conflict in s_flags and passed parameter
>> flags and let mount continue as it is. This will allow the first mount attempt
>> to succeed, and we can get rid of the re-kern_mount() and remount sequence
>> altogether.
> 
> (as we've discussed this off-list, I'll repeat some of the points here)
> 
> We can't get rid of the vfs_kern_mount sequence in mount_subvol as this
> addresses a usecase [1].

This use case works for the patch I proposed. I mentioned the link in
the patch header. However, there are other issues which Omar pointed
out, namely the checks before turning it RW.

> 
> I've tried to fix the parallel mount bug once [2] but the simple
> solution to remove the -EBUSY check was not working correctly. The fix
> you propose does not look correct to me either, as it is chaning the
> ro/rw flags that are being passed to mount.
> 
> [1] https://git.kernel.org/linus/0723a0473fb48a1c93b113a28665b64ce5faf35a
> [2] https://lkml.kernel.org/r/1461770076-13000-1-git-send-email-dsterba@suse.com
> 
>>  	mnt = vfs_kern_mount(&btrfs_fs_type, flags, device_name, newargs);
> 
> The race could happen between the two calls to kern mount, the state of
> the ro/rw flags is not viewed consistently as they change. The
> workaround that worked for me is to add a local mutex (patch below) that
> protects just this section. But the solution is quite ugly and I haven't
> sent it upstream for that reason.

The patch performs the change one way (RO->RW) exactly once during a
subvol rw mount in case the root volume is rw. Any subsequent calls will
not go through this because it is already changed. I am not sure what
would interfere/race with it.

The idea is to remove the race presented by removing return -EBUSY and
set the flags there itself. -EBUSY would re-call vfs_kern mount which is
the main cause of the flags being switched again and again.

However, the remount operation of turning any subvolume read-only is
tricky because it turns the whole volume read-only. I have started a new
thread for that.

> 
>> -	if (PTR_ERR_OR_ZERO(mnt) == -EBUSY) {
>> -		if (flags & MS_RDONLY) {
>> -			mnt = vfs_kern_mount(&btrfs_fs_type, flags & ~MS_RDONLY,
>> -					     device_name, newargs);
>> -		} else {
>> -			mnt = vfs_kern_mount(&btrfs_fs_type, flags | MS_RDONLY,
>> -					     device_name, newargs);
>> -			if (IS_ERR(mnt)) {
>> -				root = ERR_CAST(mnt);
>> -				mnt = NULL;
>> -				goto out;
>> -			}
>> -
>> -			down_write(&mnt->mnt_sb->s_umount);
>> -			ret = btrfs_remount(mnt->mnt_sb, &flags, NULL);
>> -			up_write(&mnt->mnt_sb->s_umount);
>> -			if (ret < 0) {
>> -				root = ERR_PTR(ret);
>> -				goto out;
>> -			}
>> -		}
>> -	}
> 
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1284,6 +1284,7 @@ static struct dentry *mount_subvol(const
>         struct vfsmount *mnt = NULL;
>         char *newargs;
>         int ret;
> +       static DEFINE_MUTEX(subvol_lock);
> 
>         newargs = setup_root_args(data);
>         if (!newargs) {
> @@ -1291,6 +1292,24 @@ static struct dentry *mount_subvol(const
>                 goto out;
>         }
> 
> +       /*
> +        * Protect against racing mounts of subvolumes with different RO/RW
> +        * flags.  The first vfs_kern_mount could fail with -EBUSY if the rw
> +        * flags do not match with the first and the currently mounted
> +        * subvolume.
> +        *
> +        * To resolve that, we adjust the rw flags and do remount. If another
> +        * mounts goes through the same path and hits the window between the
> +        * adjusted vfs_kern_mount and btrfs_remount, it will fail because of
> +        * the ro/rw mismatch in btrfs_mount.
> +        *
> +        * If the mounts do not race and are serialized externally, everything
> +        * works fine.  The function-local mutex enforces the serialization but
> +        * is otherwise only an ugly workaround due to lack of better
> +        * solutions.
> +        */
> +       mutex_lock(&subvol_lock);
> +
>         mnt = vfs_kern_mount(&btrfs_fs_type, flags, device_name, newargs);
>         if (PTR_ERR_OR_ZERO(mnt) == -EBUSY) {
>                 if (flags & MS_RDONLY) {
> @@ -1302,6 +1321,7 @@ static struct dentry *mount_subvol(const
>                         if (IS_ERR(mnt)) {
>                                 root = ERR_CAST(mnt);
>                                 mnt = NULL;
> +                               mutex_unlock(&subvol_lock);
>                                 goto out;
>                         }
> 
> @@ -1310,10 +1330,13 @@ static struct dentry *mount_subvol(const
>                         up_write(&mnt->mnt_sb->s_umount);
>                         if (ret < 0) {
>                                 root = ERR_PTR(ret);
> +                               mutex_unlock(&subvol_lock);
>                                 goto out;
>                         }
>                 }
>         }
> +       mutex_unlock(&subvol_lock);
> +
>         if (IS_ERR(mnt)) {
>                 root = ERR_CAST(mnt);
>                 mnt = NULL;
>

Patch
diff mbox

--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1284,6 +1284,7 @@  static struct dentry *mount_subvol(const
        struct vfsmount *mnt = NULL;
        char *newargs;
        int ret;
+       static DEFINE_MUTEX(subvol_lock);

        newargs = setup_root_args(data);
        if (!newargs) {
@@ -1291,6 +1292,24 @@  static struct dentry *mount_subvol(const
                goto out;
        }

+       /*
+        * Protect against racing mounts of subvolumes with different RO/RW
+        * flags.  The first vfs_kern_mount could fail with -EBUSY if the rw
+        * flags do not match with the first and the currently mounted
+        * subvolume.
+        *
+        * To resolve that, we adjust the rw flags and do remount. If another
+        * mounts goes through the same path and hits the window between the
+        * adjusted vfs_kern_mount and btrfs_remount, it will fail because of
+        * the ro/rw mismatch in btrfs_mount.
+        *
+        * If the mounts do not race and are serialized externally, everything
+        * works fine.  The function-local mutex enforces the serialization but
+        * is otherwise only an ugly workaround due to lack of better
+        * solutions.
+        */
+       mutex_lock(&subvol_lock);
+
        mnt = vfs_kern_mount(&btrfs_fs_type, flags, device_name, newargs);
        if (PTR_ERR_OR_ZERO(mnt) == -EBUSY) {
                if (flags & MS_RDONLY) {
@@ -1302,6 +1321,7 @@  static struct dentry *mount_subvol(const
                        if (IS_ERR(mnt)) {
                                root = ERR_CAST(mnt);
                                mnt = NULL;
+                               mutex_unlock(&subvol_lock);
                                goto out;
                        }

@@ -1310,10 +1330,13 @@  static struct dentry *mount_subvol(const
                        up_write(&mnt->mnt_sb->s_umount);
                        if (ret < 0) {
                                root = ERR_PTR(ret);
+                               mutex_unlock(&subvol_lock);
                                goto out;
                        }
                }
        }
+       mutex_unlock(&subvol_lock);
+
        if (IS_ERR(mnt)) {
                root = ERR_CAST(mnt);
                mnt = NULL;