diff mbox series

fs: Protect reconfiguration of sb read-write from racing writes

Message ID 20230615113848.8439-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series fs: Protect reconfiguration of sb read-write from racing writes | expand

Commit Message

Jan Kara June 15, 2023, 11:38 a.m. UTC
The reconfigure / remount code takes a lot of effort to protect
filesystem's reconfiguration code from racing writes on remounting
read-only. However during remounting read-only filesystem to read-write
mode userspace writes can start immediately once we clear SB_RDONLY
flag. This is inconvenient for example for ext4 because we need to do
some writes to the filesystem (such as preparation of quota files)
before we can take userspace writes so we are clearing SB_RDONLY flag
before we are fully ready to accept userpace writes and syzbot has found
a way to exploit this [1]. Also as far as I'm reading the code
the filesystem remount code was protected from racing writes in the
legacy mount path by the mount's MNT_READONLY flag so this is relatively
new problem. It is actually fairly easy to protect remount read-write
from racing writes using sb->s_readonly_remount flag so let's just do
that instead of having to workaround these races in the filesystem code.

[1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Christian Brauner June 15, 2023, 12:53 p.m. UTC | #1
On Thu, Jun 15, 2023 at 01:38:48PM +0200, Jan Kara wrote:
> The reconfigure / remount code takes a lot of effort to protect
> filesystem's reconfiguration code from racing writes on remounting
> read-only. However during remounting read-only filesystem to read-write
> mode userspace writes can start immediately once we clear SB_RDONLY
> flag. This is inconvenient for example for ext4 because we need to do
> some writes to the filesystem (such as preparation of quota files)
> before we can take userspace writes so we are clearing SB_RDONLY flag
> before we are fully ready to accept userpace writes and syzbot has found
> a way to exploit this [1]. Also as far as I'm reading the code
> the filesystem remount code was protected from racing writes in the
> legacy mount path by the mount's MNT_READONLY flag so this is relatively
> new problem. It is actually fairly easy to protect remount read-write
> from racing writes using sb->s_readonly_remount flag so let's just do
> that instead of having to workaround these races in the filesystem code.
> 
> [1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---

So looking at the ext4 code this can only happen when you clear
SB_RDONLY in ->reconfigure() too early (and the mount isn't
MNT_READONLY). Afaict, this was fixed in:

a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled")

by clearing SB_RDONLY late, right before returning from ->reconfigure()
when everything's ready. So your change is not about fixing that bug in
[1] it's about making the vfs give the guarantee that an fs is free to
clear SB_RDONLY because any ro<->rw transitions are protected via
s_readonly_remount. Correct? It seems ok to me just making sure.
Theodore Ts'o June 15, 2023, 2:10 p.m. UTC | #2
On Thu, Jun 15, 2023 at 02:53:53PM +0200, Christian Brauner wrote:
> 
> So looking at the ext4 code this can only happen when you clear
> SB_RDONLY in ->reconfigure() too early (and the mount isn't
> MNT_READONLY). Afaict, this was fixed in:
> 
> a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled")
> 
> by clearing SB_RDONLY late, right before returning from ->reconfigure()
> when everything's ready. So your change is not about fixing that bug in
> [1] it's about making the vfs give the guarantee that an fs is free to
> clear SB_RDONLY because any ro<->rw transitions are protected via
> s_readonly_remount. Correct? It seems ok to me just making sure.

Unfortunately we had to revert that commit because that broke
r/o->r/w writes when quota was enabled.  The problem is we need a way
of enabling file system writes for internal purposes (e.g., because
quota needs to set up quota inodes) but *not* allow userspace file
system writes to occur until we are fully done with the remount process.

See the discussion here:

	https://lore.kernel.org/all/20230608044056.GA1418535@mit.edu/

The problem with the current state of the tree is commit dea9d8f7643f
("ext4: only check dquot_initialize_needed() when debugging") has
caught real bugs in the past where the caller of
ext4_xattr_block_set() failed to call dquot_initialize(inode).  In
addition, shutting up the warning doesn't fix the problem that while
we hit this race where we have started remounting r/w, quota hasn't
been initialized, quota tracking will get silently dropped, leading to
the quota usage tracking no longer reflecting reality.

Jan's patch will fix this problem.

Cheers,

						- Ted
Jan Kara June 15, 2023, 2:48 p.m. UTC | #3
On Thu 15-06-23 10:10:40, Theodore Ts'o wrote:
> On Thu, Jun 15, 2023 at 02:53:53PM +0200, Christian Brauner wrote:
> > 
> > So looking at the ext4 code this can only happen when you clear
> > SB_RDONLY in ->reconfigure() too early (and the mount isn't
> > MNT_READONLY). Afaict, this was fixed in:
> > 
> > a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled")
> > 
> > by clearing SB_RDONLY late, right before returning from ->reconfigure()
> > when everything's ready. So your change is not about fixing that bug in
> > [1] it's about making the vfs give the guarantee that an fs is free to
> > clear SB_RDONLY because any ro<->rw transitions are protected via
> > s_readonly_remount. Correct? It seems ok to me just making sure.
> 
> Unfortunately we had to revert that commit because that broke
> r/o->r/w writes when quota was enabled.  The problem is we need a way
> of enabling file system writes for internal purposes (e.g., because
> quota needs to set up quota inodes) but *not* allow userspace file
> system writes to occur until we are fully done with the remount process.
> 
> See the discussion here:
> 
> 	https://lore.kernel.org/all/20230608044056.GA1418535@mit.edu/
> 
> The problem with the current state of the tree is commit dea9d8f7643f
> ("ext4: only check dquot_initialize_needed() when debugging") has
> caught real bugs in the past where the caller of
> ext4_xattr_block_set() failed to call dquot_initialize(inode).  In
> addition, shutting up the warning doesn't fix the problem that while
> we hit this race where we have started remounting r/w, quota hasn't
> been initialized, quota tracking will get silently dropped, leading to
> the quota usage tracking no longer reflecting reality.
> 
> Jan's patch will fix this problem.

Yes, and to fully reveal the situation, we can indeed introduce
ext4-private superblock state "only internal writes allowed" which can be
used for quota setup. It just requires a bit of tidying and helper
adjustment (in fact I have such patches on my private branch). But it has
occured to me that generally it is easier if the filesystem's remount code
doesn't have to care about racing writers on rw<->ro transitions since we
have many filesystems and making sure all are getting this correct is ...
tedious.

								Honza
Christian Brauner June 15, 2023, 3:01 p.m. UTC | #4
On Thu, 15 Jun 2023 13:38:48 +0200, Jan Kara wrote:
> The reconfigure / remount code takes a lot of effort to protect
> filesystem's reconfiguration code from racing writes on remounting
> read-only. However during remounting read-only filesystem to read-write
> mode userspace writes can start immediately once we clear SB_RDONLY
> flag. This is inconvenient for example for ext4 because we need to do
> some writes to the filesystem (such as preparation of quota files)
> before we can take userspace writes so we are clearing SB_RDONLY flag
> before we are fully ready to accept userpace writes and syzbot has found
> a way to exploit this [1]. Also as far as I'm reading the code
> the filesystem remount code was protected from racing writes in the
> legacy mount path by the mount's MNT_READONLY flag so this is relatively
> new problem. It is actually fairly easy to protect remount read-write
> from racing writes using sb->s_readonly_remount flag so let's just do
> that instead of having to workaround these races in the filesystem code.
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] fs: Protect reconfiguration of sb read-write from racing writes
      https://git.kernel.org/vfs/vfs/c/496de0b41695
Dave Chinner June 15, 2023, 10:36 p.m. UTC | #5
On Thu, Jun 15, 2023 at 01:38:48PM +0200, Jan Kara wrote:
> The reconfigure / remount code takes a lot of effort to protect
> filesystem's reconfiguration code from racing writes on remounting
> read-only. However during remounting read-only filesystem to read-write
> mode userspace writes can start immediately once we clear SB_RDONLY
> flag. This is inconvenient for example for ext4 because we need to do
> some writes to the filesystem (such as preparation of quota files)
> before we can take userspace writes so we are clearing SB_RDONLY flag
> before we are fully ready to accept userpace writes and syzbot has found
> a way to exploit this [1]. Also as far as I'm reading the code
> the filesystem remount code was protected from racing writes in the
> legacy mount path by the mount's MNT_READONLY flag so this is relatively
> new problem. It is actually fairly easy to protect remount read-write
> from racing writes using sb->s_readonly_remount flag so let's just do
> that instead of having to workaround these races in the filesystem code.
> 
> [1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/super.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 34afe411cf2b..6cd64961aa07 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -903,6 +903,7 @@ int reconfigure_super(struct fs_context *fc)
>  	struct super_block *sb = fc->root->d_sb;
>  	int retval;
>  	bool remount_ro = false;
> +	bool remount_rw = false;
>  	bool force = fc->sb_flags & SB_FORCE;
>  
>  	if (fc->sb_flags_mask & ~MS_RMT_MASK)
> @@ -920,7 +921,7 @@ int reconfigure_super(struct fs_context *fc)
>  		    bdev_read_only(sb->s_bdev))
>  			return -EACCES;
>  #endif
> -
> +		remount_rw = !(fc->sb_flags & SB_RDONLY) && sb_rdonly(sb);
>  		remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb);
>  	}
>  
> @@ -950,6 +951,14 @@ int reconfigure_super(struct fs_context *fc)
>  			if (retval)
>  				return retval;
>  		}
> +	} else if (remount_rw) {
> +		/*
> +		 * We set s_readonly_remount here to protect filesystem's
> +		 * reconfigure code from writes from userspace until
> +		 * reconfigure finishes.
> +		 */
> +		sb->s_readonly_remount = 1;
> +		smp_wmb();

What does the magic random memory barrier do? What is it ordering,
and what is it paired with?

This sort of thing is much better done with small helpers that
encapsulate the necessary memory barriers:

sb_set_readonly_remount()
sb_clear_readonly_remount()

alongside the helper that provides the read-side check and memory
barrier the write barrier is associated with.

I don't often ask for code to be cleaned up before a bug fix can be
added, but I think this is one of the important cases where it does
actually matter - we should never add undocumented memory barriers
in the code like this...

Cheers,

Dave.
Jan Kara June 16, 2023, 4:37 p.m. UTC | #6
On Fri 16-06-23 08:36:53, Dave Chinner wrote:
> On Thu, Jun 15, 2023 at 01:38:48PM +0200, Jan Kara wrote:
> > The reconfigure / remount code takes a lot of effort to protect
> > filesystem's reconfiguration code from racing writes on remounting
> > read-only. However during remounting read-only filesystem to read-write
> > mode userspace writes can start immediately once we clear SB_RDONLY
> > flag. This is inconvenient for example for ext4 because we need to do
> > some writes to the filesystem (such as preparation of quota files)
> > before we can take userspace writes so we are clearing SB_RDONLY flag
> > before we are fully ready to accept userpace writes and syzbot has found
> > a way to exploit this [1]. Also as far as I'm reading the code
> > the filesystem remount code was protected from racing writes in the
> > legacy mount path by the mount's MNT_READONLY flag so this is relatively
> > new problem. It is actually fairly easy to protect remount read-write
> > from racing writes using sb->s_readonly_remount flag so let's just do
> > that instead of having to workaround these races in the filesystem code.
...
> > +	} else if (remount_rw) {
> > +		/*
> > +		 * We set s_readonly_remount here to protect filesystem's
> > +		 * reconfigure code from writes from userspace until
> > +		 * reconfigure finishes.
> > +		 */
> > +		sb->s_readonly_remount = 1;
> > +		smp_wmb();
> 
> What does the magic random memory barrier do? What is it ordering,
> and what is it paired with?
> 
> This sort of thing is much better done with small helpers that
> encapsulate the necessary memory barriers:
> 
> sb_set_readonly_remount()
> sb_clear_readonly_remount()
> 
> alongside the helper that provides the read-side check and memory
> barrier the write barrier is associated with.

Fair remark. The new code including barrier just copies what happens a few
lines above for remount read-only case (and what happens ib several other
places throughout VFS code). I agree having helpers for this and actually
documenting how the barriers are matching there is a good cleanup.

> I don't often ask for code to be cleaned up before a bug fix can be
> added, but I think this is one of the important cases where it does
> actually matter - we should never add undocumented memory barriers
> in the code like this...

I've talked to Christian and we'll queue this as a separate cleanup. I'll
post it shortly.

								Honza
Dave Chinner June 16, 2023, 10:48 p.m. UTC | #7
On Fri, Jun 16, 2023 at 06:37:00PM +0200, Jan Kara wrote:
> On Fri 16-06-23 08:36:53, Dave Chinner wrote:
> > On Thu, Jun 15, 2023 at 01:38:48PM +0200, Jan Kara wrote:
> > > The reconfigure / remount code takes a lot of effort to protect
> > > filesystem's reconfiguration code from racing writes on remounting
> > > read-only. However during remounting read-only filesystem to read-write
> > > mode userspace writes can start immediately once we clear SB_RDONLY
> > > flag. This is inconvenient for example for ext4 because we need to do
> > > some writes to the filesystem (such as preparation of quota files)
> > > before we can take userspace writes so we are clearing SB_RDONLY flag
> > > before we are fully ready to accept userpace writes and syzbot has found
> > > a way to exploit this [1]. Also as far as I'm reading the code
> > > the filesystem remount code was protected from racing writes in the
> > > legacy mount path by the mount's MNT_READONLY flag so this is relatively
> > > new problem. It is actually fairly easy to protect remount read-write
> > > from racing writes using sb->s_readonly_remount flag so let's just do
> > > that instead of having to workaround these races in the filesystem code.
> ...
> > > +	} else if (remount_rw) {
> > > +		/*
> > > +		 * We set s_readonly_remount here to protect filesystem's
> > > +		 * reconfigure code from writes from userspace until
> > > +		 * reconfigure finishes.
> > > +		 */
> > > +		sb->s_readonly_remount = 1;
> > > +		smp_wmb();
> > 
> > What does the magic random memory barrier do? What is it ordering,
> > and what is it paired with?
> > 
> > This sort of thing is much better done with small helpers that
> > encapsulate the necessary memory barriers:
> > 
> > sb_set_readonly_remount()
> > sb_clear_readonly_remount()
> > 
> > alongside the helper that provides the read-side check and memory
> > barrier the write barrier is associated with.
> 
> Fair remark. The new code including barrier just copies what happens a few
> lines above for remount read-only case (and what happens ib several other
> places throughout VFS code).

Yes, I saw that you'd copied that magic memory barrier. Good for
consistency, but it doesn't answer any of the above questions
either, so....

> I agree having helpers for this and actually
> documenting how the barriers are matching there is a good cleanup.
> 
> > I don't often ask for code to be cleaned up before a bug fix can be
> > added, but I think this is one of the important cases where it does
> > actually matter - we should never add undocumented memory barriers
> > in the code like this...
> 
> I've talked to Christian and we'll queue this as a separate cleanup. I'll
> post it shortly.

Thanks!

-Dave.
diff mbox series

Patch

diff --git a/fs/super.c b/fs/super.c
index 34afe411cf2b..6cd64961aa07 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -903,6 +903,7 @@  int reconfigure_super(struct fs_context *fc)
 	struct super_block *sb = fc->root->d_sb;
 	int retval;
 	bool remount_ro = false;
+	bool remount_rw = false;
 	bool force = fc->sb_flags & SB_FORCE;
 
 	if (fc->sb_flags_mask & ~MS_RMT_MASK)
@@ -920,7 +921,7 @@  int reconfigure_super(struct fs_context *fc)
 		    bdev_read_only(sb->s_bdev))
 			return -EACCES;
 #endif
-
+		remount_rw = !(fc->sb_flags & SB_RDONLY) && sb_rdonly(sb);
 		remount_ro = (fc->sb_flags & SB_RDONLY) && !sb_rdonly(sb);
 	}
 
@@ -950,6 +951,14 @@  int reconfigure_super(struct fs_context *fc)
 			if (retval)
 				return retval;
 		}
+	} else if (remount_rw) {
+		/*
+		 * We set s_readonly_remount here to protect filesystem's
+		 * reconfigure code from writes from userspace until
+		 * reconfigure finishes.
+		 */
+		sb->s_readonly_remount = 1;
+		smp_wmb();
 	}
 
 	if (fc->ops->reconfigure) {