diff mbox

xfs: allow zero-length symlinks (and directories), sort of

Message ID 20170304000529.GB5073@birch.djwong.org (mailing list archive)
State New, archived
Headers show

Commit Message

Darrick J. Wong March 4, 2017, 12:05 a.m. UTC
In commit ef388e2054 ("don't allow di_size with high bit set") and
commit 3c6f46eacd87 ("sanity check directory inode di_size") we
erroneously fail the inode verification checks for symlinks or
directories with di_size == 0.

This is proven incorrect by generic/388 because the unlink process uses
multiple unconnected transactions to truncate the remote symlink /
directory and to unlink and free the inode.  If the fs goes down after
the truncate commit but before the unlink happens, we are left with a
zero-length inode.  Worse yet, if the unlink /does/ commit, log recovery
will now break after the first transaction because we've zeroed di_size.

In order to prevent the crashes and ASSERTs that the old patches
covered, re-allow the zero-length inodes and change the functions that
interface with the vfs to return the appropriate error codes to
userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |    4 ----
 fs/xfs/xfs_dir2_readdir.c     |    6 ++----
 fs/xfs/xfs_iops.c             |    9 ++++++++-
 3 files changed, 10 insertions(+), 9 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Brian Foster March 6, 2017, 1:51 p.m. UTC | #1
On Fri, Mar 03, 2017 at 04:05:29PM -0800, Darrick J. Wong wrote:
> In commit ef388e2054 ("don't allow di_size with high bit set") and
> commit 3c6f46eacd87 ("sanity check directory inode di_size") we
> erroneously fail the inode verification checks for symlinks or
> directories with di_size == 0.
> 
> This is proven incorrect by generic/388 because the unlink process uses
> multiple unconnected transactions to truncate the remote symlink /
> directory and to unlink and free the inode.  If the fs goes down after
> the truncate commit but before the unlink happens, we are left with a
> zero-length inode.  Worse yet, if the unlink /does/ commit, log recovery
> will now break after the first transaction because we've zeroed di_size.
> 
> In order to prevent the crashes and ASSERTs that the old patches
> covered, re-allow the zero-length inodes and change the functions that
> interface with the vfs to return the appropriate error codes to
> userspace.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Interesting, thanks. I suppose this means we should drop the repair side
fix..? I'm a bit leery on repair modifying a filesystem that is
technically in a "valid" state, unless there's no other option to
recover from that state.

>  fs/xfs/libxfs/xfs_inode_buf.c |    4 ----
>  fs/xfs/xfs_dir2_readdir.c     |    6 ++----
>  fs/xfs/xfs_iops.c             |    9 ++++++++-
>  3 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 3752bac..6e62999 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -402,10 +402,6 @@ xfs_dinode_verify(
>  	if (mode && xfs_mode_to_ftype(mode) == XFS_DIR3_FT_UNKNOWN)
>  		return false;
>  
> -	/* No zero-length symlinks/dirs. */
> -	if ((S_ISLNK(mode) || S_ISDIR(mode)) && dip->di_size == 0)
> -		return false;
> -

What's the reasoning for dropping the S_ISDIR() logic here? Is the same
scenario possible on a directory?

>  	/* only version 3 or greater inodes are extensively verified here */
>  	if (dip->di_version < 3)
>  		return true;
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 0b3b636..be1367a 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -74,10 +74,8 @@ xfs_dir2_sf_getdents(
>  	/*
>  	 * Give up if the directory is way too short.
>  	 */
> -	if (dp->i_d.di_size < offsetof(xfs_dir2_sf_hdr_t, parent)) {
> -		ASSERT(XFS_FORCED_SHUTDOWN(dp->i_mount));
> -		return -EIO;
> -	}
> +	if (dp->i_d.di_size < offsetof(xfs_dir2_sf_hdr_t, parent))
> +		return -EFSCORRUPTED;

This seems reasonable if it is truly an unexpected state...

>  
>  	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
>  	ASSERT(dp->i_df.if_u1.if_data != NULL);
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 22c1615..b7b6807 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -457,6 +457,9 @@ xfs_vn_get_link(
>  	char			*link;
>  	int			error = -ENOMEM;
>  
> +	if (i_size_read(inode) == 0)
> +		return ERR_PTR(-EFSCORRUPTED);
> +

... but this one to me seems to qualify as an expected state, based on
what you've described in the commit log description. Can we actually now
"recover" from this situation at runtime (i.e., can we unlink the
inode)? If so, it seems a little ominous to return a corruption errror
as opposed to EINVAL or EIO or something (though I suppose either one
may confuse a user enough into running xfs_repair anyways). Thoughts?

Brian

>  	if (!dentry)
>  		return ERR_PTR(-ECHILD);
>  
> @@ -483,8 +486,12 @@ xfs_vn_get_link_inline(
>  	struct inode		*inode,
>  	struct delayed_call	*done)
>  {
> +	char			*p;
>  	ASSERT(XFS_I(inode)->i_df.if_flags & XFS_IFINLINE);
> -	return XFS_I(inode)->i_df.if_u1.if_data;
> +	p = XFS_I(inode)->i_df.if_u1.if_data;
> +	if (p)
> +		return p;
> +	return ERR_PTR(-EFSCORRUPTED);
>  }
>  
>  STATIC int
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong March 6, 2017, 5:22 p.m. UTC | #2
On Mon, Mar 06, 2017 at 08:51:44AM -0500, Brian Foster wrote:
> On Fri, Mar 03, 2017 at 04:05:29PM -0800, Darrick J. Wong wrote:
> > In commit ef388e2054 ("don't allow di_size with high bit set") and
> > commit 3c6f46eacd87 ("sanity check directory inode di_size") we
> > erroneously fail the inode verification checks for symlinks or
> > directories with di_size == 0.
> > 
> > This is proven incorrect by generic/388 because the unlink process uses
> > multiple unconnected transactions to truncate the remote symlink /
> > directory and to unlink and free the inode.  If the fs goes down after
> > the truncate commit but before the unlink happens, we are left with a
> > zero-length inode.  Worse yet, if the unlink /does/ commit, log recovery
> > will now break after the first transaction because we've zeroed di_size.

Ugh, I misspoke slightly.  Truncate and *ifree* are separate
transactions that can lose each other in the mist after a crash.

So that separates this discussion into two classes of problem -- the
first is the case of "truncate, crash, forget to ifree", which at least
isn't a user-visible problem because the inode has been unlinked from
the directory tree at that point.  Log recovery breaks if iget ->
dinode_verify stumbles over the zero-length inode.

The second case (which is what the original dinode_verify patches were
focused on) is the case where someone intentionally corrupts a
filesystem to have zero-byte metafiles.  For that case we need to return
appropriate error codes without crashing.

> > In order to prevent the crashes and ASSERTs that the old patches
> > covered, re-allow the zero-length inodes and change the functions that
> > interface with the vfs to return the appropriate error codes to
> > userspace.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Interesting, thanks. I suppose this means we should drop the repair side
> fix..? I'm a bit leery on repair modifying a filesystem that is
> technically in a "valid" state, unless there's no other option to
> recover from that state.

I'm not sure if zero byte directories & symlinks are valid states.  They
cannot be created via the regular means (symlink("", "/sym/link")
returns -EINVAL), but such things can be encountered internally because
of the truncate-crash-forgettoifree sequence.  At a bare minimum the
rest of this patch needs to fix the parts of XFS that blow up on
zero-byte special files if the verifier cannot reject them outright.

In the longer run, this situation ought to be fixed by some sort of
ifree log redo item to chain the two transactions together, thereby
avoiding this mess.

(OTOH I think the defer_ops piece could use more battle testing before
we start hooking it up to existing functionality that mostly doesn't
blow up.)

> >  fs/xfs/libxfs/xfs_inode_buf.c |    4 ----
> >  fs/xfs/xfs_dir2_readdir.c     |    6 ++----
> >  fs/xfs/xfs_iops.c             |    9 ++++++++-
> >  3 files changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 3752bac..6e62999 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -402,10 +402,6 @@ xfs_dinode_verify(
> >  	if (mode && xfs_mode_to_ftype(mode) == XFS_DIR3_FT_UNKNOWN)
> >  		return false;
> >  
> > -	/* No zero-length symlinks/dirs. */
> > -	if ((S_ISLNK(mode) || S_ISDIR(mode)) && dip->di_size == 0)
> > -		return false;
> > -
> 
> What's the reasoning for dropping the S_ISDIR() logic here? Is the same
> scenario possible on a directory?

Yes.

> >  	/* only version 3 or greater inodes are extensively verified here */
> >  	if (dip->di_version < 3)
> >  		return true;
> > diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> > index 0b3b636..be1367a 100644
> > --- a/fs/xfs/xfs_dir2_readdir.c
> > +++ b/fs/xfs/xfs_dir2_readdir.c
> > @@ -74,10 +74,8 @@ xfs_dir2_sf_getdents(
> >  	/*
> >  	 * Give up if the directory is way too short.
> >  	 */
> > -	if (dp->i_d.di_size < offsetof(xfs_dir2_sf_hdr_t, parent)) {
> > -		ASSERT(XFS_FORCED_SHUTDOWN(dp->i_mount));
> > -		return -EIO;
> > -	}
> > +	if (dp->i_d.di_size < offsetof(xfs_dir2_sf_hdr_t, parent))
> > +		return -EFSCORRUPTED;
> 
> This seems reasonable if it is truly an unexpected state...

I think the error code change is reasonable.  In any case the ASSERT
needs to go because we cant't crash the kernel on account of bad
disk metadata.

> >  
> >  	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
> >  	ASSERT(dp->i_df.if_u1.if_data != NULL);
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 22c1615..b7b6807 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -457,6 +457,9 @@ xfs_vn_get_link(
> >  	char			*link;
> >  	int			error = -ENOMEM;
> >  
> > +	if (i_size_read(inode) == 0)
> > +		return ERR_PTR(-EFSCORRUPTED);
> > +
> 
> ... but this one to me seems to qualify as an expected state, based on
> what you've described in the commit log description. Can we actually now
> "recover" from this situation at runtime (i.e., can we unlink the
> inode)? If so, it seems a little ominous to return a corruption errror
> as opposed to EINVAL or EIO or something (though I suppose either one
> may confuse a user enough into running xfs_repair anyways). Thoughts?

Right now _vn_get_link() on a zero-length rmt symlink just returns "".
Zero-length metadata is (clearly) not totally invalid, since it can be
reproduced (temporarily) via regular log interactions... but from this
user's perspective, the fs shouldn't ever present a metadata object that
can't be created by userspace.

An empty buffer (apparently) gets interpreted as a symlink to the
containing dir, so at least the mechanism doesn't go totally haywire.
That might not be harmful, but I still opine that we should nudge the
user in the direction of xfs_repair with -EFSCORRUPTED.

I don't consider it a good idea to have ifree be a potential
side-effect of readlink, though.  Scraping broken things off the
filesystem (I think) should remain the domain of the scrub/repair tools.

In the longer run I think it would be useful (certainly for xfs_scrub)
to be able to unlink bad files, though I've not implemented that.  In
order to totally tear down a file one would have to make sure that there
are zero dirents pointing to the inode, which requires either parent
pointers or a full directory scan.  Or xfs_repair. :)

--D

> 
> Brian
> 
> >  	if (!dentry)
> >  		return ERR_PTR(-ECHILD);
> >  
> > @@ -483,8 +486,12 @@ xfs_vn_get_link_inline(
> >  	struct inode		*inode,
> >  	struct delayed_call	*done)
> >  {
> > +	char			*p;
> >  	ASSERT(XFS_I(inode)->i_df.if_flags & XFS_IFINLINE);
> > -	return XFS_I(inode)->i_df.if_u1.if_data;
> > +	p = XFS_I(inode)->i_df.if_u1.if_data;
> > +	if (p)
> > +		return p;
> > +	return ERR_PTR(-EFSCORRUPTED);
> >  }
> >  
> >  STATIC int
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster March 6, 2017, 6:18 p.m. UTC | #3
On Mon, Mar 06, 2017 at 09:22:05AM -0800, Darrick J. Wong wrote:
> On Mon, Mar 06, 2017 at 08:51:44AM -0500, Brian Foster wrote:
> > On Fri, Mar 03, 2017 at 04:05:29PM -0800, Darrick J. Wong wrote:
> > > In commit ef388e2054 ("don't allow di_size with high bit set") and
> > > commit 3c6f46eacd87 ("sanity check directory inode di_size") we
> > > erroneously fail the inode verification checks for symlinks or
> > > directories with di_size == 0.
> > > 
> > > This is proven incorrect by generic/388 because the unlink process uses
> > > multiple unconnected transactions to truncate the remote symlink /
> > > directory and to unlink and free the inode.  If the fs goes down after
> > > the truncate commit but before the unlink happens, we are left with a
> > > zero-length inode.  Worse yet, if the unlink /does/ commit, log recovery
> > > will now break after the first transaction because we've zeroed di_size.
> 
> Ugh, I misspoke slightly.  Truncate and *ifree* are separate
> transactions that can lose each other in the mist after a crash.
> 
> So that separates this discussion into two classes of problem -- the
> first is the case of "truncate, crash, forget to ifree", which at least
> isn't a user-visible problem because the inode has been unlinked from
> the directory tree at that point.  Log recovery breaks if iget ->
> dinode_verify stumbles over the zero-length inode.
> 

Ok, that change seems reasonable.

> The second case (which is what the original dinode_verify patches were
> focused on) is the case where someone intentionally corrupts a
> filesystem to have zero-byte metafiles.  For that case we need to return
> appropriate error codes without crashing.
> 

Yep, makes sense too...

Though now I'm confused because the commit log implies that we can end
up with zero-sized symlinks in the directory tree if we crash at the
right point. Is that not the case?

> > > In order to prevent the crashes and ASSERTs that the old patches
> > > covered, re-allow the zero-length inodes and change the functions that
> > > interface with the vfs to return the appropriate error codes to
> > > userspace.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Interesting, thanks. I suppose this means we should drop the repair side
> > fix..? I'm a bit leery on repair modifying a filesystem that is
> > technically in a "valid" state, unless there's no other option to
> > recover from that state.
> 
> I'm not sure if zero byte directories & symlinks are valid states.  They
> cannot be created via the regular means (symlink("", "/sym/link")
> returns -EINVAL), but such things can be encountered internally because
> of the truncate-crash-forgettoifree sequence.  At a bare minimum the
> rest of this patch needs to fix the parts of XFS that blow up on
> zero-byte special files if the verifier cannot reject them outright.
> 

By valid I just mean that it's a user-visible state the fs can end up in
through normal and generally correct operation, which I mean to include
system crash and log recovery, but exclude things like external metadata
modification. The latter is clearly corruption, but the former may not
be.

Essentially what I'm saying is that if it is not possible to have a
zero-sized symlink in a directory except for a bug or external
modification, or there is otherwise no path to recover outside of
xfs_repair (i.e., the current verify failure behavior) then EFSCORRUPTED
makes sense to me. If it is possible (even if it only occurs as an
artifact of log recovery) and everything works except for reading the
actual link returns an error (i.e., it can be renamed, unlinked), then
I'm not sure EFSCORRUPTED is the right error.

As noted above, however, I'm not quite following which of those
scenarios applies here...

Brian

> In the longer run, this situation ought to be fixed by some sort of
> ifree log redo item to chain the two transactions together, thereby
> avoiding this mess.
> 
> (OTOH I think the defer_ops piece could use more battle testing before
> we start hooking it up to existing functionality that mostly doesn't
> blow up.)
> 
> > >  fs/xfs/libxfs/xfs_inode_buf.c |    4 ----
> > >  fs/xfs/xfs_dir2_readdir.c     |    6 ++----
> > >  fs/xfs/xfs_iops.c             |    9 ++++++++-
> > >  3 files changed, 10 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 3752bac..6e62999 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -402,10 +402,6 @@ xfs_dinode_verify(
> > >  	if (mode && xfs_mode_to_ftype(mode) == XFS_DIR3_FT_UNKNOWN)
> > >  		return false;
> > >  
> > > -	/* No zero-length symlinks/dirs. */
> > > -	if ((S_ISLNK(mode) || S_ISDIR(mode)) && dip->di_size == 0)
> > > -		return false;
> > > -
> > 
> > What's the reasoning for dropping the S_ISDIR() logic here? Is the same
> > scenario possible on a directory?
> 
> Yes.
> 
> > >  	/* only version 3 or greater inodes are extensively verified here */
> > >  	if (dip->di_version < 3)
> > >  		return true;
> > > diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> > > index 0b3b636..be1367a 100644
> > > --- a/fs/xfs/xfs_dir2_readdir.c
> > > +++ b/fs/xfs/xfs_dir2_readdir.c
> > > @@ -74,10 +74,8 @@ xfs_dir2_sf_getdents(
> > >  	/*
> > >  	 * Give up if the directory is way too short.
> > >  	 */
> > > -	if (dp->i_d.di_size < offsetof(xfs_dir2_sf_hdr_t, parent)) {
> > > -		ASSERT(XFS_FORCED_SHUTDOWN(dp->i_mount));
> > > -		return -EIO;
> > > -	}
> > > +	if (dp->i_d.di_size < offsetof(xfs_dir2_sf_hdr_t, parent))
> > > +		return -EFSCORRUPTED;
> > 
> > This seems reasonable if it is truly an unexpected state...
> 
> I think the error code change is reasonable.  In any case the ASSERT
> needs to go because we cant't crash the kernel on account of bad
> disk metadata.
> 
> > >  
> > >  	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
> > >  	ASSERT(dp->i_df.if_u1.if_data != NULL);
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index 22c1615..b7b6807 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -457,6 +457,9 @@ xfs_vn_get_link(
> > >  	char			*link;
> > >  	int			error = -ENOMEM;
> > >  
> > > +	if (i_size_read(inode) == 0)
> > > +		return ERR_PTR(-EFSCORRUPTED);
> > > +
> > 
> > ... but this one to me seems to qualify as an expected state, based on
> > what you've described in the commit log description. Can we actually now
> > "recover" from this situation at runtime (i.e., can we unlink the
> > inode)? If so, it seems a little ominous to return a corruption errror
> > as opposed to EINVAL or EIO or something (though I suppose either one
> > may confuse a user enough into running xfs_repair anyways). Thoughts?
> 
> Right now _vn_get_link() on a zero-length rmt symlink just returns "".
> Zero-length metadata is (clearly) not totally invalid, since it can be
> reproduced (temporarily) via regular log interactions... but from this
> user's perspective, the fs shouldn't ever present a metadata object that
> can't be created by userspace.
> 
> An empty buffer (apparently) gets interpreted as a symlink to the
> containing dir, so at least the mechanism doesn't go totally haywire.
> That might not be harmful, but I still opine that we should nudge the
> user in the direction of xfs_repair with -EFSCORRUPTED.
> 
> I don't consider it a good idea to have ifree be a potential
> side-effect of readlink, though.  Scraping broken things off the
> filesystem (I think) should remain the domain of the scrub/repair tools.
> 
> In the longer run I think it would be useful (certainly for xfs_scrub)
> to be able to unlink bad files, though I've not implemented that.  In
> order to totally tear down a file one would have to make sure that there
> are zero dirents pointing to the inode, which requires either parent
> pointers or a full directory scan.  Or xfs_repair. :)
> 
> --D
> 
> > 
> > Brian
> > 
> > >  	if (!dentry)
> > >  		return ERR_PTR(-ECHILD);
> > >  
> > > @@ -483,8 +486,12 @@ xfs_vn_get_link_inline(
> > >  	struct inode		*inode,
> > >  	struct delayed_call	*done)
> > >  {
> > > +	char			*p;
> > >  	ASSERT(XFS_I(inode)->i_df.if_flags & XFS_IFINLINE);
> > > -	return XFS_I(inode)->i_df.if_u1.if_data;
> > > +	p = XFS_I(inode)->i_df.if_u1.if_data;
> > > +	if (p)
> > > +		return p;
> > > +	return ERR_PTR(-EFSCORRUPTED);
> > >  }
> > >  
> > >  STATIC int
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner March 6, 2017, 9:47 p.m. UTC | #4
On Mon, Mar 06, 2017 at 01:18:20PM -0500, Brian Foster wrote:
> On Mon, Mar 06, 2017 at 09:22:05AM -0800, Darrick J. Wong wrote:
> > On Mon, Mar 06, 2017 at 08:51:44AM -0500, Brian Foster wrote:
> > > On Fri, Mar 03, 2017 at 04:05:29PM -0800, Darrick J. Wong wrote:
> > > > In commit ef388e2054 ("don't allow di_size with high bit set") and
> > > > commit 3c6f46eacd87 ("sanity check directory inode di_size") we
> > > > erroneously fail the inode verification checks for symlinks or
> > > > directories with di_size == 0.
> > > > 
> > > > This is proven incorrect by generic/388 because the unlink process uses
> > > > multiple unconnected transactions to truncate the remote symlink /
> > > > directory and to unlink and free the inode.  If the fs goes down after
> > > > the truncate commit but before the unlink happens, we are left with a
> > > > zero-length inode.  Worse yet, if the unlink /does/ commit, log recovery
> > > > will now break after the first transaction because we've zeroed di_size.
> > 
> > Ugh, I misspoke slightly.  Truncate and *ifree* are separate
> > transactions that can lose each other in the mist after a crash.
> > 
> > So that separates this discussion into two classes of problem -- the
> > first is the case of "truncate, crash, forget to ifree", which at least
> > isn't a user-visible problem because the inode has been unlinked from
> > the directory tree at that point.  Log recovery breaks if iget ->
> > dinode_verify stumbles over the zero-length inode.
> > 
> 
> Ok, that change seems reasonable.

Not to me, though - it's indicative of an atomicity problem, and not
something we should work around by considering the intermediate
"corrupt" state on disk valid. Remember that one of the main reasons
for checks like these in the verifiers is that they tell us where
our transactional change atomicity is broken. Gutting the verifier
because it's indicated we have a change atomicity problem is not the
solution.

i.e. I think we should be looking at changing xfs_inactive() to use
the deferred ops state machine for the multiple {truncate-data,
truncata-attr, free inode} transactions. Then we don't need to care
about temporary incomplete state on disk - if we get the first
transaction completed, log recovery will have an intent one disk for
the next transaction, and we can complete the inode freeing
process rather than leaving a dangling chad.

Making an exception in a verifier for a special log recovery state
is fine - we have to handle all sorts of interesting intermediate
states in log recovery - but as a general rule if it's an invalid
state the verifier should always be catching it....

[snip]

> Essentially what I'm saying is that if it is not possible to have a
> zero-sized symlink in a directory except for a bug or external
> modification, or there is otherwise no path to recover outside of
> xfs_repair (i.e., the current verify failure behavior) then EFSCORRUPTED
> makes sense to me.

*nod*

> If it is possible (even if it only occurs as an
> artifact of log recovery) and everything works except for reading the
> actual link returns an error (i.e., it can be renamed, unlinked), then
> I'm not sure EFSCORRUPTED is the right error.

We shouldn't ever get that far on a zero-length symlink/directory
because the inode is clearly in an invalid state....

-Dave.
Darrick J. Wong March 6, 2017, 11:16 p.m. UTC | #5
On Tue, Mar 07, 2017 at 08:47:55AM +1100, Dave Chinner wrote:
> On Mon, Mar 06, 2017 at 01:18:20PM -0500, Brian Foster wrote:
> > On Mon, Mar 06, 2017 at 09:22:05AM -0800, Darrick J. Wong wrote:
> > > On Mon, Mar 06, 2017 at 08:51:44AM -0500, Brian Foster wrote:
> > > > On Fri, Mar 03, 2017 at 04:05:29PM -0800, Darrick J. Wong wrote:
> > > > > In commit ef388e2054 ("don't allow di_size with high bit set") and
> > > > > commit 3c6f46eacd87 ("sanity check directory inode di_size") we
> > > > > erroneously fail the inode verification checks for symlinks or
> > > > > directories with di_size == 0.
> > > > > 
> > > > > This is proven incorrect by generic/388 because the unlink process uses
> > > > > multiple unconnected transactions to truncate the remote symlink /
> > > > > directory and to unlink and free the inode.  If the fs goes down after
> > > > > the truncate commit but before the unlink happens, we are left with a
> > > > > zero-length inode.  Worse yet, if the unlink /does/ commit, log recovery
> > > > > will now break after the first transaction because we've zeroed di_size.

Brian had a question in an earlier reply about whether or not g/388
actually produced a filesystem with a zero-length symlink.  Looking back
at the metadumps I kept around from Friday I don't see any indication of
that happening.  However, the atomicity problem still exists.

> > > 
> > > Ugh, I misspoke slightly.  Truncate and *ifree* are separate
> > > transactions that can lose each other in the mist after a crash.
> > > 
> > > So that separates this discussion into two classes of problem -- the
> > > first is the case of "truncate, crash, forget to ifree", which at least
> > > isn't a user-visible problem because the inode has been unlinked from
> > > the directory tree at that point.  Log recovery breaks if iget ->
> > > dinode_verify stumbles over the zero-length inode.
> > > 
> > 
> > Ok, that change seems reasonable.
> 
> Not to me, though - it's indicative of an atomicity problem, and not
> something we should work around by considering the intermediate
> "corrupt" state on disk valid. Remember that one of the main reasons
> for checks like these in the verifiers is that they tell us where
> our transactional change atomicity is broken. Gutting the verifier
> because it's indicated we have a change atomicity problem is not the
> solution.
> 
> i.e. I think we should be looking at changing xfs_inactive() to use
> the deferred ops state machine for the multiple {truncate-data,
> truncata-attr, free inode} transactions. Then we don't need to care
> about temporary incomplete state on disk - if we get the first
> transaction completed, log recovery will have an intent one disk for
> the next transaction, and we can complete the inode freeing
> process rather than leaving a dangling chad.

I argued for this earlier as a longer-term fix, but vger has not been
transmitting emails reliably for a week now, so I don't know if you saw
that part.

There are various parts of XFS that have atomicity problems just like
this one.  Auditing the codebase to find all these places and add the
necessary log functionality could take a while, though it ought to be
done now that we have the ability to chain transactions.

(Also I think there's a minor bug with the defer_ops where if we call
defer_trans_roll, the transaction commits successfully, but then
something else fails, we end up double-freeing the intent items.)

> Making an exception in a verifier for a special log recovery state
> is fine - we have to handle all sorts of interesting intermediate
> states in log recovery - but as a general rule if it's an invalid
> state the verifier should always be catching it....

Ah, see, there's something I didn't know -- that verifiers can be
contextually sensitive.  In this case we could probably get away with a
new iget flag so that log replay could say "Hey, I don't care if the
size is zero".

> [snip]
> 
> > Essentially what I'm saying is that if it is not possible to have a
> > zero-sized symlink in a directory except for a bug or external
> > modification, or there is otherwise no path to recover outside of
> > xfs_repair (i.e., the current verify failure behavior) then EFSCORRUPTED
> > makes sense to me.
> 
> *nod*
> 
> > If it is possible (even if it only occurs as an
> > artifact of log recovery) and everything works except for reading the
> > actual link returns an error (i.e., it can be renamed, unlinked), then
> > I'm not sure EFSCORRUPTED is the right error.
> 
> We shouldn't ever get that far on a zero-length symlink/directory
> because the inode is clearly in an invalid state....

<nod> Ok, I'll leave the verifier as it is for now.

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 3752bac..6e62999 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -402,10 +402,6 @@  xfs_dinode_verify(
 	if (mode && xfs_mode_to_ftype(mode) == XFS_DIR3_FT_UNKNOWN)
 		return false;
 
-	/* No zero-length symlinks/dirs. */
-	if ((S_ISLNK(mode) || S_ISDIR(mode)) && dip->di_size == 0)
-		return false;
-
 	/* only version 3 or greater inodes are extensively verified here */
 	if (dip->di_version < 3)
 		return true;
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 0b3b636..be1367a 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -74,10 +74,8 @@  xfs_dir2_sf_getdents(
 	/*
 	 * Give up if the directory is way too short.
 	 */
-	if (dp->i_d.di_size < offsetof(xfs_dir2_sf_hdr_t, parent)) {
-		ASSERT(XFS_FORCED_SHUTDOWN(dp->i_mount));
-		return -EIO;
-	}
+	if (dp->i_d.di_size < offsetof(xfs_dir2_sf_hdr_t, parent))
+		return -EFSCORRUPTED;
 
 	ASSERT(dp->i_df.if_bytes == dp->i_d.di_size);
 	ASSERT(dp->i_df.if_u1.if_data != NULL);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 22c1615..b7b6807 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -457,6 +457,9 @@  xfs_vn_get_link(
 	char			*link;
 	int			error = -ENOMEM;
 
+	if (i_size_read(inode) == 0)
+		return ERR_PTR(-EFSCORRUPTED);
+
 	if (!dentry)
 		return ERR_PTR(-ECHILD);
 
@@ -483,8 +486,12 @@  xfs_vn_get_link_inline(
 	struct inode		*inode,
 	struct delayed_call	*done)
 {
+	char			*p;
 	ASSERT(XFS_I(inode)->i_df.if_flags & XFS_IFINLINE);
-	return XFS_I(inode)->i_df.if_u1.if_data;
+	p = XFS_I(inode)->i_df.if_u1.if_data;
+	if (p)
+		return p;
+	return ERR_PTR(-EFSCORRUPTED);
 }
 
 STATIC int