diff mbox series

[v2,4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount

Message ID 20201127092058.15117-5-sargun@sargun.me (mailing list archive)
State New, archived
Headers show
Series Make overlayfs volatile mounts reusable | expand

Commit Message

Sargun Dhillon Nov. 27, 2020, 9:20 a.m. UTC
Volatile remounts validate the following at the moment:
 * Has the module been reloaded / the system rebooted
 * Has the workdir been remounted

This adds a new check for errors detected via the superblock's
errseq_t. At mount time, the errseq_t is snapshotted to disk,
and upon remount it's re-verified. This allows for kernel-level
detection of errors without forcing userspace to perform a
sync and allows for the hidden detection of writeback errors.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 fs/overlayfs/overlayfs.h | 1 +
 fs/overlayfs/readdir.c   | 6 ++++++
 fs/overlayfs/super.c     | 1 +
 3 files changed, 8 insertions(+)

Comments

Vivek Goyal Nov. 30, 2020, 6:43 p.m. UTC | #1
On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> Volatile remounts validate the following at the moment:
>  * Has the module been reloaded / the system rebooted
>  * Has the workdir been remounted
> 
> This adds a new check for errors detected via the superblock's
> errseq_t. At mount time, the errseq_t is snapshotted to disk,
> and upon remount it's re-verified. This allows for kernel-level
> detection of errors without forcing userspace to perform a
> sync and allows for the hidden detection of writeback errors.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/overlayfs/overlayfs.h | 1 +
>  fs/overlayfs/readdir.c   | 6 ++++++
>  fs/overlayfs/super.c     | 1 +
>  3 files changed, 8 insertions(+)
> 
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index de694ee99d7c..e8a711953b64 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -85,6 +85,7 @@ struct ovl_volatile_info {
>  	 */
>  	uuid_t		ovl_boot_id;	/* Must stay first member */
>  	u64		s_instance_id;
> +	errseq_t	errseq;	/* Implemented as a u32 */
>  } __packed;
>  
>  /*
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 7b66fbb20261..5795b28bb4cf 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
>  		return -EINVAL;
>  	}
>  
> +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> +	if (err) {
> +		pr_debug("Workdir filesystem reports errors: %d\n", err);

It probably will make sense to make it pr_err() instead? Will be good
to know if kernel logic detected errors.


Thanks
Vivek
Vivek Goyal Nov. 30, 2020, 7:15 p.m. UTC | #2
On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> Volatile remounts validate the following at the moment:
>  * Has the module been reloaded / the system rebooted
>  * Has the workdir been remounted
> 
> This adds a new check for errors detected via the superblock's
> errseq_t. At mount time, the errseq_t is snapshotted to disk,
> and upon remount it's re-verified. This allows for kernel-level
> detection of errors without forcing userspace to perform a
> sync and allows for the hidden detection of writeback errors.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/overlayfs/overlayfs.h | 1 +
>  fs/overlayfs/readdir.c   | 6 ++++++
>  fs/overlayfs/super.c     | 1 +
>  3 files changed, 8 insertions(+)
> 
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index de694ee99d7c..e8a711953b64 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -85,6 +85,7 @@ struct ovl_volatile_info {
>  	 */
>  	uuid_t		ovl_boot_id;	/* Must stay first member */
>  	u64		s_instance_id;
> +	errseq_t	errseq;	/* Implemented as a u32 */
>  } __packed;
>  
>  /*
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 7b66fbb20261..5795b28bb4cf 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
>  		return -EINVAL;
>  	}
>  
> +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> +	if (err) {
> +		pr_debug("Workdir filesystem reports errors: %d\n", err);
> +		return -EINVAL;
> +	}
> +
>  	return 1;
>  }
>  
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index a8ee3ba4ebbd..2e473f8c75dd 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
>  	int err;
>  	struct ovl_volatile_info info = {
>  		.s_instance_id = volatiledir->d_sb->s_instance_id,
> +		.errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),

errse_sample() seems to return 0 if nobody has seen the error yet. That
means on remount we will fail. It is a false failure from our perspective
and we are not interested in knowing if somebody else has seen the
failure or not. 

Maybe we need a flag in errseq_sample() to get us current value
irrespective of the fact whether anybody has seen the error or not?

If we end up making this change, then we probably will have to somehow
mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we
sampled ->s_wb_err when nobody saw it and later by the remount time
say ERRSEQ_SEEN is set, we don't want remount to fail.

Thanks
Vivek
>  	};
>  
>  	uuid_copy(&info.ovl_boot_id, &ovl_boot_id);
> -- 
> 2.25.1
>
Vivek Goyal Nov. 30, 2020, 7:33 p.m. UTC | #3
On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> Volatile remounts validate the following at the moment:
>  * Has the module been reloaded / the system rebooted
>  * Has the workdir been remounted
> 
> This adds a new check for errors detected via the superblock's
> errseq_t. At mount time, the errseq_t is snapshotted to disk,
> and upon remount it's re-verified. This allows for kernel-level
> detection of errors without forcing userspace to perform a
> sync and allows for the hidden detection of writeback errors.
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-unionfs@vger.kernel.org
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  fs/overlayfs/overlayfs.h | 1 +
>  fs/overlayfs/readdir.c   | 6 ++++++
>  fs/overlayfs/super.c     | 1 +
>  3 files changed, 8 insertions(+)
> 
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index de694ee99d7c..e8a711953b64 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -85,6 +85,7 @@ struct ovl_volatile_info {
>  	 */
>  	uuid_t		ovl_boot_id;	/* Must stay first member */
>  	u64		s_instance_id;
> +	errseq_t	errseq;	/* Implemented as a u32 */
>  } __packed;
>  
>  /*
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 7b66fbb20261..5795b28bb4cf 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
>  		return -EINVAL;
>  	}
>  
> +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);

Might be a stupid question. Will ask anyway.

But what protects against wrapping of counter. IOW, Say we stored info.errseq
value as A. It is possible that bunch of errors occurred and at remount
time ->s_wb_err is back to A and we pass the check. (Despite the fact lots
of errors have occurred since we sampled).

Thanks
Vivek
Sargun Dhillon Dec. 1, 2020, 11:56 a.m. UTC | #4
On Mon, Nov 30, 2020 at 02:33:42PM -0500, Vivek Goyal wrote:
> On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > Volatile remounts validate the following at the moment:
> >  * Has the module been reloaded / the system rebooted
> >  * Has the workdir been remounted
> > 
> > This adds a new check for errors detected via the superblock's
> > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > and upon remount it's re-verified. This allows for kernel-level
> > detection of errors without forcing userspace to perform a
> > sync and allows for the hidden detection of writeback errors.
> > 
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/overlayfs/overlayfs.h | 1 +
> >  fs/overlayfs/readdir.c   | 6 ++++++
> >  fs/overlayfs/super.c     | 1 +
> >  3 files changed, 8 insertions(+)
> > 
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index de694ee99d7c..e8a711953b64 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> >  	 */
> >  	uuid_t		ovl_boot_id;	/* Must stay first member */
> >  	u64		s_instance_id;
> > +	errseq_t	errseq;	/* Implemented as a u32 */
> >  } __packed;
> >  
> >  /*
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 7b66fbb20261..5795b28bb4cf 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> >  		return -EINVAL;
> >  	}
> >  
> > +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> 
> Might be a stupid question. Will ask anyway.
> 
> But what protects against wrapping of counter. IOW, Say we stored info.errseq
> value as A. It is possible that bunch of errors occurred and at remount
> time ->s_wb_err is back to A and we pass the check. (Despite the fact lots
> of errors have occurred since we sampled).
> 
> Thanks
> Vivek
> 

+Jeff Layton <jlayton@redhat.com>

Nothing. The current errseq API works like this today where if you have 2^20
(1048576) errors, and syncfs (or other calls that mark the errseq as seen), and
the error that occured 1048575 times ago was the same error as you just last
had, and the error on the upperdir has already been marked as seen, the error
will be swallowed up silently.

This exists throughout all of VFS. I think we're potentially making this more
likely by checkpointing to disk. The one aspect which is a little different about
the usecase in the patch is that it relies on this mechanism to determine if
an error has occured after the entire FS was constructed, so it's somewhat
more consequential than the current issue in VFS which will just bubble up
errors in a few files.

On my system syncfs takes about 2 milliseconds, so you have a chance to
experience this every ~30 minutes if the syscalls align in the right way. If
we expanded the errseq_t to u64, we would potentially get a collision
every 4503599627370496 calls, or assuming the 2 millisecond invariant
holds, every 285 years. Now, we probably don't want to make errseq_t into
a u64 because of performance reasons (not all systems have native u64
cmpxchg), and the extra memory it'd take up.

If we really want to avoid this case, I can think of one "simple" solution,
which is something like laying out errseq_t as something like a errseq_t_src
that's 64-bits, and all readers just look at the lower 32-bits. The longer
errseq_t would exist on super_blocks, but files would still get the shorter one.
To potentially avoid the performance penalty of atomic longs, we could also
do something like this:

typedef struct {
    atomic_t overflow;
    u32 errseq;
} errseq_t_big;

And in errseq_set, do:
/* Wraps */
if (new < old)
        atomic_inc(&eseq->overflow);

*shrug*
I don't think that the above scenario is likely though.
Jeff Layton Dec. 1, 2020, 12:45 p.m. UTC | #5
On Tue, 2020-12-01 at 11:56 +0000, Sargun Dhillon wrote:
> On Mon, Nov 30, 2020 at 02:33:42PM -0500, Vivek Goyal wrote:
> > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > > Volatile remounts validate the following at the moment:
> > >  * Has the module been reloaded / the system rebooted
> > >  * Has the workdir been remounted
> > > 
> > > This adds a new check for errors detected via the superblock's
> > > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > > and upon remount it's re-verified. This allows for kernel-level
> > > detection of errors without forcing userspace to perform a
> > > sync and allows for the hidden detection of writeback errors.
> > > 
> > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > Cc: linux-fsdevel@vger.kernel.org
> > > Cc: linux-unionfs@vger.kernel.org
> > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > ---
> > >  fs/overlayfs/overlayfs.h | 1 +
> > >  fs/overlayfs/readdir.c   | 6 ++++++
> > >  fs/overlayfs/super.c     | 1 +
> > >  3 files changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > > index de694ee99d7c..e8a711953b64 100644
> > > --- a/fs/overlayfs/overlayfs.h
> > > +++ b/fs/overlayfs/overlayfs.h
> > > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> > >  	 */
> > >  	uuid_t		ovl_boot_id;	/* Must stay first member */
> > >  	u64		s_instance_id;
> > > +	errseq_t	errseq;	/* Implemented as a u32 */
> > >  } __packed;
> > >  
> > > 
> > > 
> > > 
> > >  /*
> > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > > index 7b66fbb20261..5795b28bb4cf 100644
> > > --- a/fs/overlayfs/readdir.c
> > > +++ b/fs/overlayfs/readdir.c
> > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> > >  		return -EINVAL;
> > >  	}
> > >  
> > > 
> > > 
> > > 
> > > +	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> > 
> > Might be a stupid question. Will ask anyway.
> > 
> > But what protects against wrapping of counter. IOW, Say we stored info.errseq
> > value as A. It is possible that bunch of errors occurred and at remount
> > time ->s_wb_err is back to A and we pass the check. (Despite the fact lots
> > of errors have occurred since we sampled).
> > 
> > Thanks
> > Vivek
> > 
> 
> +Jeff Layton <jlayton@redhat.com>
> 
> Nothing. The current errseq API works like this today where if you have 2^20
> (1048576) errors, and syncfs (or other calls that mark the errseq as seen), and
> the error that occured 1048575 times ago was the same error as you just last
> had, and the error on the upperdir has already been marked as seen, the error
> will be swallowed up silently.
 
> This exists throughout all of VFS. I think we're potentially making this more
> likely by checkpointing to disk. The one aspect which is a little different about
> the usecase in the patch is that it relies on this mechanism to determine if
> an error has occured after the entire FS was constructed, so it's somewhat
> more consequential than the current issue in VFS which will just bubble up
> errors in a few files.
> 
> On my system syncfs takes about 2 milliseconds, so you have a chance to
> experience this every ~30 minutes if the syscalls align in the right way. If
> we expanded the errseq_t to u64, we would potentially get a collision
> every 4503599627370496 calls, or assuming the 2 millisecond invariant
> holds, every 285 years. Now, we probably don't want to make errseq_t into
> a u64 because of performance reasons (not all systems have native u64
> cmpxchg), and the extra memory it'd take up.
> 
> If we really want to avoid this case, I can think of one "simple" solution,
> which is something like laying out errseq_t as something like a errseq_t_src
> that's 64-bits, and all readers just look at the lower 32-bits. The longer
> errseq_t would exist on super_blocks, but files would still get the shorter one.
> To potentially avoid the performance penalty of atomic longs, we could also
> do something like this:
> 
> typedef struct {
>     atomic_t overflow;
>     u32 errseq;
> } errseq_t_big;
> 
> And in errseq_set, do:
> /* Wraps */
> if (new < old)
>         atomic_inc(&eseq->overflow);
> 
> *shrug*
> I don't think that the above scenario is likely though.
> 


The counter is only bumped if the seen flag is set, so you'd need to
record 2^20 errors _and_ fetch them that many times, and just get
unlucky once and hit the exact counter value that you had previously
sampled.

If that's your situation, then you probably have much bigger problems
than the counter wrapping.
Amir Goldstein Dec. 5, 2020, 9:13 a.m. UTC | #6
On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > Volatile remounts validate the following at the moment:
> >  * Has the module been reloaded / the system rebooted
> >  * Has the workdir been remounted
> >
> > This adds a new check for errors detected via the superblock's
> > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > and upon remount it's re-verified. This allows for kernel-level
> > detection of errors without forcing userspace to perform a
> > sync and allows for the hidden detection of writeback errors.
> >
> > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > Cc: linux-fsdevel@vger.kernel.org
> > Cc: linux-unionfs@vger.kernel.org
> > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  fs/overlayfs/overlayfs.h | 1 +
> >  fs/overlayfs/readdir.c   | 6 ++++++
> >  fs/overlayfs/super.c     | 1 +
> >  3 files changed, 8 insertions(+)
> >
> > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > index de694ee99d7c..e8a711953b64 100644
> > --- a/fs/overlayfs/overlayfs.h
> > +++ b/fs/overlayfs/overlayfs.h
> > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> >        */
> >       uuid_t          ovl_boot_id;    /* Must stay first member */
> >       u64             s_instance_id;
> > +     errseq_t        errseq; /* Implemented as a u32 */
> >  } __packed;
> >
> >  /*
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 7b66fbb20261..5795b28bb4cf 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> >               return -EINVAL;
> >       }
> >
> > +     err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> > +     if (err) {
> > +             pr_debug("Workdir filesystem reports errors: %d\n", err);
> > +             return -EINVAL;
> > +     }
> > +
> >       return 1;
> >  }
> >
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index a8ee3ba4ebbd..2e473f8c75dd 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
> >       int err;
> >       struct ovl_volatile_info info = {
> >               .s_instance_id = volatiledir->d_sb->s_instance_id,
> > +             .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),
>
> errse_sample() seems to return 0 if nobody has seen the error yet. That
> means on remount we will fail. It is a false failure from our perspective
> and we are not interested in knowing if somebody else has seen the
> failure or not.
>
> Maybe we need a flag in errseq_sample() to get us current value
> irrespective of the fact whether anybody has seen the error or not?
>
> If we end up making this change, then we probably will have to somehow
> mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we
> sampled ->s_wb_err when nobody saw it and later by the remount time
> say ERRSEQ_SEEN is set, we don't want remount to fail.
>

Hopping back to this review, looks like for volatile mount we need
something like (in this order):
1. check if re-use and get sampled errseq from volatiledir xattr
2. otherwise errseq_sample() upper_sb and store in volatiledir xattr
3. errseq_check() since stored or sampled errseq (0 for fresh mount
with unseen error)
4. fail volatile mount if errseq_check() failed
5. errseq_check() since stored errseq on fsync()/syncfs()

For fresh volatile mount, syncfs can fix the temporary mount error.
For re-used volatile mount, the mount error is permanent.

Did I miss anything?
Is the mount safe for both seen and unseen error cases? no error case?
Are we safe if a syncfs on upper_sb sneaks in between 2 and 3?

Thanks,
Amir.
Jeff Layton Dec. 5, 2020, 1:51 p.m. UTC | #7
On Sat, 2020-12-05 at 11:13 +0200, Amir Goldstein wrote:
> On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > > Volatile remounts validate the following at the moment:
> > >  * Has the module been reloaded / the system rebooted
> > >  * Has the workdir been remounted
> > > 
> > > This adds a new check for errors detected via the superblock's
> > > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > > and upon remount it's re-verified. This allows for kernel-level
> > > detection of errors without forcing userspace to perform a
> > > sync and allows for the hidden detection of writeback errors.
> > > 
> > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > Cc: linux-fsdevel@vger.kernel.org
> > > Cc: linux-unionfs@vger.kernel.org
> > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > ---
> > >  fs/overlayfs/overlayfs.h | 1 +
> > >  fs/overlayfs/readdir.c   | 6 ++++++
> > >  fs/overlayfs/super.c     | 1 +
> > >  3 files changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > > index de694ee99d7c..e8a711953b64 100644
> > > --- a/fs/overlayfs/overlayfs.h
> > > +++ b/fs/overlayfs/overlayfs.h
> > > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> > >        */
> > >       uuid_t          ovl_boot_id;    /* Must stay first member */
> > >       u64             s_instance_id;
> > > +     errseq_t        errseq; /* Implemented as a u32 */
> > >  } __packed;
> > > 
> > >  /*
> > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > > index 7b66fbb20261..5795b28bb4cf 100644
> > > --- a/fs/overlayfs/readdir.c
> > > +++ b/fs/overlayfs/readdir.c
> > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> > >               return -EINVAL;
> > >       }
> > > 
> > > +     err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> > > +     if (err) {
> > > +             pr_debug("Workdir filesystem reports errors: %d\n", err);
> > > +             return -EINVAL;
> > > +     }
> > > +
> > >       return 1;
> > >  }
> > > 
> > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > index a8ee3ba4ebbd..2e473f8c75dd 100644
> > > --- a/fs/overlayfs/super.c
> > > +++ b/fs/overlayfs/super.c
> > > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
> > >       int err;
> > >       struct ovl_volatile_info info = {
> > >               .s_instance_id = volatiledir->d_sb->s_instance_id,
> > > +             .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),
> > 
> > errse_sample() seems to return 0 if nobody has seen the error yet. That
> > means on remount we will fail. It is a false failure from our perspective
> > and we are not interested in knowing if somebody else has seen the
> > failure or not.
> > 
> > Maybe we need a flag in errseq_sample() to get us current value
> > irrespective of the fact whether anybody has seen the error or not?
> > 
> > If we end up making this change, then we probably will have to somehow
> > mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we
> > sampled ->s_wb_err when nobody saw it and later by the remount time
> > say ERRSEQ_SEEN is set, we don't want remount to fail.
> > 
> 
> Hopping back to this review, looks like for volatile mount we need
> something like (in this order):
> 1. check if re-use and get sampled errseq from volatiledir xattr
> 2. otherwise errseq_sample() upper_sb and store in volatiledir xattr

I'm not sure I follow. Why does this need to go into an xattr?

errseq_t is never persisted on stable storage. It's an entirely
in-memory thing.


> 3. errseq_check() since stored or sampled errseq (0 for fresh mount
> with unseen error)
> 4. fail volatile mount if errseq_check() failed
> 5. errseq_check() since stored errseq on fsync()/syncfs()
> 

I think this is simpler than that. You just need a new errseq_t helper
that only conditionally samples if the thing is 0 or if the error has
already been seen. Something like this (hopefully with a better name):

bool errseq_sample_no_unseen(errseq_t *eseq, errseq_t *sample)         
{                                                                      
        errseq_t old = READ_ONCE(*eseq);                               
                                                                       
        if (old && !(old & ERRSEQ_SEEN))                               
                return false;                                          
        *sample = old;                                                 
        return true;                                                   
}                                                                      
       
If that returns false, fail the mount. If it's true, then save off the
sample and proceed.

> For fresh volatile mount, syncfs can fix the temporary mount error.
> For re-used volatile mount, the mount error is permanent.
> 
> Did I miss anything?
> Is the mount safe for both seen and unseen error cases? no error case?
> Are we safe if a syncfs on upper_sb sneaks in between 2 and 3?
> 
> Thanks,
> Amir.
>
Amir Goldstein Dec. 5, 2020, 2:51 p.m. UTC | #8
On Sat, Dec 5, 2020 at 3:51 PM Jeff Layton <jlayton@redhat.com> wrote:
>
> On Sat, 2020-12-05 at 11:13 +0200, Amir Goldstein wrote:
> > On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote:
> > > > Volatile remounts validate the following at the moment:
> > > >  * Has the module been reloaded / the system rebooted
> > > >  * Has the workdir been remounted
> > > >
> > > > This adds a new check for errors detected via the superblock's
> > > > errseq_t. At mount time, the errseq_t is snapshotted to disk,
> > > > and upon remount it's re-verified. This allows for kernel-level
> > > > detection of errors without forcing userspace to perform a
> > > > sync and allows for the hidden detection of writeback errors.
> > > >
> > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> > > > Cc: linux-fsdevel@vger.kernel.org
> > > > Cc: linux-unionfs@vger.kernel.org
> > > > Cc: Miklos Szeredi <miklos@szeredi.hu>
> > > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > > Cc: Vivek Goyal <vgoyal@redhat.com>
> > > > ---
> > > >  fs/overlayfs/overlayfs.h | 1 +
> > > >  fs/overlayfs/readdir.c   | 6 ++++++
> > > >  fs/overlayfs/super.c     | 1 +
> > > >  3 files changed, 8 insertions(+)
> > > >
> > > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> > > > index de694ee99d7c..e8a711953b64 100644
> > > > --- a/fs/overlayfs/overlayfs.h
> > > > +++ b/fs/overlayfs/overlayfs.h
> > > > @@ -85,6 +85,7 @@ struct ovl_volatile_info {
> > > >        */
> > > >       uuid_t          ovl_boot_id;    /* Must stay first member */
> > > >       u64             s_instance_id;
> > > > +     errseq_t        errseq; /* Implemented as a u32 */
> > > >  } __packed;
> > > >
> > > >  /*
> > > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > > > index 7b66fbb20261..5795b28bb4cf 100644
> > > > --- a/fs/overlayfs/readdir.c
> > > > +++ b/fs/overlayfs/readdir.c
> > > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs,
> > > >               return -EINVAL;
> > > >       }
> > > >
> > > > +     err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
> > > > +     if (err) {
> > > > +             pr_debug("Workdir filesystem reports errors: %d\n", err);
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > >       return 1;
> > > >  }
> > > >
> > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > > > index a8ee3ba4ebbd..2e473f8c75dd 100644
> > > > --- a/fs/overlayfs/super.c
> > > > +++ b/fs/overlayfs/super.c
> > > > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
> > > >       int err;
> > > >       struct ovl_volatile_info info = {
> > > >               .s_instance_id = volatiledir->d_sb->s_instance_id,
> > > > +             .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),
> > >
> > > errse_sample() seems to return 0 if nobody has seen the error yet. That
> > > means on remount we will fail. It is a false failure from our perspective
> > > and we are not interested in knowing if somebody else has seen the
> > > failure or not.
> > >
> > > Maybe we need a flag in errseq_sample() to get us current value
> > > irrespective of the fact whether anybody has seen the error or not?
> > >
> > > If we end up making this change, then we probably will have to somehow
> > > mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we
> > > sampled ->s_wb_err when nobody saw it and later by the remount time
> > > say ERRSEQ_SEEN is set, we don't want remount to fail.
> > >
> >
> > Hopping back to this review, looks like for volatile mount we need
> > something like (in this order):
> > 1. check if re-use and get sampled errseq from volatiledir xattr
> > 2. otherwise errseq_sample() upper_sb and store in volatiledir xattr
>
> I'm not sure I follow. Why does this need to go into an xattr?
>
> errseq_t is never persisted on stable storage. It's an entirely
> in-memory thing.
>

We know, but that was the purpose of this patch series [1].
Reusing volatile overlay layers is not allowed in v5.9.
Sargun is trying to allow that by verifying that since the first volatile
mount there was:
* no reboot
* no overlay module reload
* no underlying fs re-mount
* [and with this patch] no writeback error on upper fs

[1] https://lore.kernel.org/linux-unionfs/20201127092058.15117-1-sargun@sargun.me/T/#mb2f1c770a47898d8781e62a46fcc7526535e5dde

>
> > 3. errseq_check() since stored or sampled errseq (0 for fresh mount
> > with unseen error)
> > 4. fail volatile mount if errseq_check() failed
> > 5. errseq_check() since stored errseq on fsync()/syncfs()
> >
>
> I think this is simpler than that. You just need a new errseq_t helper
> that only conditionally samples if the thing is 0 or if the error has
> already been seen. Something like this (hopefully with a better name):
>
> bool errseq_sample_no_unseen(errseq_t *eseq, errseq_t *sample)
> {
>         errseq_t old = READ_ONCE(*eseq);
>
>         if (old && !(old & ERRSEQ_SEEN))
>                 return false;
>         *sample = old;
>         return true;
> }
>
> If that returns false, fail the mount. If it's true, then save off the
> sample and proceed.
>

Yes, but that wasn't the purpose of the original patch set,
it was just something else that needed fixing that we found during
review of this patch for which Sargun posted a separate patch.

Thank,
Amir.
diff mbox series

Patch

diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index de694ee99d7c..e8a711953b64 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -85,6 +85,7 @@  struct ovl_volatile_info {
 	 */
 	uuid_t		ovl_boot_id;	/* Must stay first member */
 	u64		s_instance_id;
+	errseq_t	errseq;	/* Implemented as a u32 */
 } __packed;
 
 /*
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 7b66fbb20261..5795b28bb4cf 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -1117,6 +1117,12 @@  static int ovl_verify_volatile_info(struct ovl_fs *ofs,
 		return -EINVAL;
 	}
 
+	err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq);
+	if (err) {
+		pr_debug("Workdir filesystem reports errors: %d\n", err);
+		return -EINVAL;
+	}
+
 	return 1;
 }
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index a8ee3ba4ebbd..2e473f8c75dd 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1248,6 +1248,7 @@  static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir)
 	int err;
 	struct ovl_volatile_info info = {
 		.s_instance_id = volatiledir->d_sb->s_instance_id,
+		.errseq = errseq_sample(&volatiledir->d_sb->s_wb_err),
 	};
 
 	uuid_copy(&info.ovl_boot_id, &ovl_boot_id);