xfs: preserve i_mode if __xfs_set_acl() fails

Message ID	20170816002511.GM21024@dastard (mailing list archive)
State	Accepted
Headers	show Return-Path: <linux-xfs-owner@kernel.org> Date: Wed, 16 Aug 2017 10:25:11 +1000 From: Dave Chinner <david@fromorbit.com> To: Ernesto =?iso-8859-1?Q?A=2E_Fern=E1ndez?= <ernesto.mnd.fernandez@gmail.com> Cc: linux-xfs@vger.kernel.org, "Darrick J. Wong" <darrick.wong@oracle.com>, Zorro Lang <zlang@redhat.com> Subject: Re: [PATCH] xfs: preserve i_mode if __xfs_set_acl() fails Message-ID: <20170816002511.GM21024@dastard> References: <20170814081806.GA3924@debian.home> <20170815020018.GF21024@dastard> <20170815061857.GA2803@debian.home> <20170815084430.GK21024@dastard> <20170815192938.GA1247@debian.home> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170815192938.GA1247@debian.home> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk

Dave Chinner Aug. 16, 2017, 12:25 a.m. UTC

On Tue, Aug 15, 2017 at 04:29:39PM -0300, Ernesto A. Fernández wrote:
> On Tue, Aug 15, 2017 at 06:44:30PM +1000, Dave Chinner wrote:
> > On Tue, Aug 15, 2017 at 03:18:58AM -0300, Ernesto A. Fernández wrote:
> > But I have to ask - why do we even need to modify the mode first?
> > Why not change the ACL first, then modify the mode+timestamp? If
> > setting the ACL fails, then we don't have anything to undo and all
> > is good....
> 
> I intended for the mode to be committed as part of the same transaction
> that sets or removes the ACL. In my mind making the changes later, as part
> of a separate transaction, would have meant that a crash between the two
> left the filesystem in an inconsistent state,

No, it will not leave the fileystem in an inconsistent state. It
will leave the inode permissions in an /unwanted/ state, but there
is no filesystem metadata inconsistency. 

> with a new ACL but without
> the corresponding mode bits.

Yup, but that's no different from right now, where a crash after
setting the mode bits could be applied but the ACL update is
missing.

Either way is even rarely than "crash at the wrong time" implies,
because we've also got to have a complete journal checkpoint occur
between the two operations and then crash between the checkpoint and
the second operation. Yes, it's possible, but in the entire time
I've been working on XFS (almost 15 years now) I can count on one
hand the number of times such a problem has occurred and been
reported...

So, it's a rare problem, and one that will get solved in time
because there's much more to solving the problem than just this
case. e.g. I worte this in 2008:

http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Atomic_Multi-Transaction_Operations

And we've really only got the infrastructure we could use to
implement this in a widespread manner with the rmap/reflink
functionality. But implementing it will require a large amount of
re-organisation of filesystem operations, so it's something that
will take time to roll out.

With that in mind, here's waht I suggested above: set the mode after
the xattr. I haven't tested it - can you check it solves the problem
case you are testing?

Cheers,

Dave.

Ernesto A. Fernández Aug. 16, 2017, 7:16 a.m. UTC | #1

On Wed, Aug 16, 2017 at 10:25:11AM +1000, Dave Chinner wrote:
> On Tue, Aug 15, 2017 at 04:29:39PM -0300, Ernesto A. Fernández wrote:
> > On Tue, Aug 15, 2017 at 06:44:30PM +1000, Dave Chinner wrote:
> > > On Tue, Aug 15, 2017 at 03:18:58AM -0300, Ernesto A. Fernández wrote:
> > > But I have to ask - why do we even need to modify the mode first?
> > > Why not change the ACL first, then modify the mode+timestamp? If
> > > setting the ACL fails, then we don't have anything to undo and all
> > > is good....
> > 
> > I intended for the mode to be committed as part of the same transaction
> > that sets or removes the ACL. In my mind making the changes later, as part
> > of a separate transaction, would have meant that a crash between the two
> > left the filesystem in an inconsistent state,
> 
> No, it will not leave the fileystem in an inconsistent state. It
> will leave the inode permissions in an /unwanted/ state, but there
> is no filesystem metadata inconsistency. 
> 
> > with a new ACL but without
> > the corresponding mode bits.
> 
> Yup, but that's no different from right now, where a crash after
> setting the mode bits could be applied but the ACL update is
> missing.
> 
> Either way is even rarely than "crash at the wrong time" implies,
> because we've also got to have a complete journal checkpoint occur
> between the two operations and then crash between the checkpoint and
> the second operation. Yes, it's possible, but in the entire time
> I've been working on XFS (almost 15 years now) I can count on one
> hand the number of times such a problem has occurred and been
> reported...
> 
> So, it's a rare problem, and one that will get solved in time
> because there's much more to solving the problem than just this
> case. e.g. I worte this in 2008:
> 
> http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Atomic_Multi-Transaction_Operations
> 
> And we've really only got the infrastructure we could use to
> implement this in a widespread manner with the rmap/reflink
> functionality. But implementing it will require a large amount of
> re-organisation of filesystem operations, so it's something that
> will take time to roll out.

Alright, thanks for the explanation.
 
> With that in mind, here's waht I suggested above: set the mode after
> the xattr. I haven't tested it - can you check it solves the problem
> case you are testing?

It does. Of course the test still fails, as I said before, now claiming that
the filesystem is inconsistent. But that's a separate issue.

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> xfs: don't change inode mode if ACL update fails
> 
> XXX: untested
> 
> ---
>  fs/xfs/xfs_acl.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_acl.c b/fs/xfs/xfs_acl.c
> index 7034e17535de..3354140de07e 100644
> --- a/fs/xfs/xfs_acl.c
> +++ b/fs/xfs/xfs_acl.c
> @@ -247,6 +247,8 @@ xfs_set_mode(struct inode *inode, umode_t mode)
>  int
>  xfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  {
> +	umode_t mode;
> +	bool set_mode = false;
>  	int error = 0;
>  
>  	if (!acl)
> @@ -257,16 +259,24 @@ xfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  		return error;
>  
>  	if (type == ACL_TYPE_ACCESS) {
> -		umode_t mode;
> -
>  		error = posix_acl_update_mode(inode, &mode, &acl);
>  		if (error)
>  			return error;
> -		error = xfs_set_mode(inode, mode);
> -		if (error)
> -			return error;
> +		set_mode = true;
>  	}
>  
>   set_acl:
> -	return __xfs_set_acl(inode, acl, type);
> +	error =  __xfs_set_acl(inode, acl, type);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * We set the mode after successfully updating the ACL xattr because the
> +	 * xattr update can fail at ENOSPC and we don't want to change the mode
> +	 * if the ACL update hasn't been applied.
> +	 */
> +	if (set_mode)
> +		error = xfs_set_mode(inode, mode);
> +
> +	return error;
>  }
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Aug. 16, 2017, 1:35 p.m. UTC | #2

On Wed, Aug 16, 2017 at 04:16:53AM -0300, Ernesto A. Fernández wrote:
> On Wed, Aug 16, 2017 at 10:25:11AM +1000, Dave Chinner wrote:
> > On Tue, Aug 15, 2017 at 04:29:39PM -0300, Ernesto A. Fernández wrote:
> > > On Tue, Aug 15, 2017 at 06:44:30PM +1000, Dave Chinner wrote:
> > > > On Tue, Aug 15, 2017 at 03:18:58AM -0300, Ernesto A. Fernández wrote:
> > > > But I have to ask - why do we even need to modify the mode first?
> > > > Why not change the ACL first, then modify the mode+timestamp? If
> > > > setting the ACL fails, then we don't have anything to undo and all
> > > > is good....
> > > 
> > > I intended for the mode to be committed as part of the same transaction
> > > that sets or removes the ACL. In my mind making the changes later, as part
> > > of a separate transaction, would have meant that a crash between the two
> > > left the filesystem in an inconsistent state,
> > 
> > No, it will not leave the fileystem in an inconsistent state. It
> > will leave the inode permissions in an /unwanted/ state, but there
> > is no filesystem metadata inconsistency. 
> > 
> > > with a new ACL but without
> > > the corresponding mode bits.
> > 
> > Yup, but that's no different from right now, where a crash after
> > setting the mode bits could be applied but the ACL update is
> > missing.
> > 
> > Either way is even rarely than "crash at the wrong time" implies,
> > because we've also got to have a complete journal checkpoint occur
> > between the two operations and then crash between the checkpoint and
> > the second operation. Yes, it's possible, but in the entire time
> > I've been working on XFS (almost 15 years now) I can count on one
> > hand the number of times such a problem has occurred and been
> > reported...
> > 
> > So, it's a rare problem, and one that will get solved in time
> > because there's much more to solving the problem than just this
> > case. e.g. I worte this in 2008:
> > 
> > http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Atomic_Multi-Transaction_Operations
> > 
> > And we've really only got the infrastructure we could use to
> > implement this in a widespread manner with the rmap/reflink
> > functionality. But implementing it will require a large amount of
> > re-organisation of filesystem operations, so it's something that
> > will take time to roll out.
> 
> Alright, thanks for the explanation.
>  
> > With that in mind, here's waht I suggested above: set the mode after
> > the xattr. I haven't tested it - can you check it solves the problem
> > case you are testing?
> 
> It does. Of course the test still fails, as I said before, now claiming that
> the filesystem is inconsistent. But that's a separate issue.

It shouldn't be - what's the error that is being reported?

Cheers,

Dave.

Ernesto A. Fernández Aug. 16, 2017, 7:31 p.m. UTC | #3

On Wed, Aug 16, 2017 at 11:35:02PM +1000, Dave Chinner wrote:
> On Wed, Aug 16, 2017 at 04:16:53AM -0300, Ernesto A. Fernández wrote:
> > On Wed, Aug 16, 2017 at 10:25:11AM +1000, Dave Chinner wrote:
> > > With that in mind, here's waht I suggested above: set the mode after
> > > the xattr. I haven't tested it - can you check it solves the problem
> > > case you are testing?
> > 
> > It does. Of course the test still fails, as I said before, now claiming that
> > the filesystem is inconsistent. But that's a separate issue.
> 
> It shouldn't be - what's the error that is being reported?

xfs_repair, phase 3:
bad magic number febe in block 512 (14452) for directory inode 131
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 11900 for inode 131, would reset to 11274
bad anextents 1 for inode 131, would reset to 0


This is not a result of your patch, it was already happening before.
It seems to be caused by setting too many extended attributes to a
file.

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Aug. 17, 2017, midnight UTC | #4

On Wed, Aug 16, 2017 at 04:31:01PM -0300, Ernesto A. Fernández wrote:
> On Wed, Aug 16, 2017 at 11:35:02PM +1000, Dave Chinner wrote:
> > On Wed, Aug 16, 2017 at 04:16:53AM -0300, Ernesto A. Fernández wrote:
> > > On Wed, Aug 16, 2017 at 10:25:11AM +1000, Dave Chinner wrote:
> > > > With that in mind, here's waht I suggested above: set the mode after
> > > > the xattr. I haven't tested it - can you check it solves the problem
> > > > case you are testing?
> > > 
> > > It does. Of course the test still fails, as I said before, now claiming that
> > > the filesystem is inconsistent. But that's a separate issue.
> > 
> > It shouldn't be - what's the error that is being reported?
> 
> xfs_repair, phase 3:
> bad magic number febe in block 512 (14452) for directory inode 131

Hmmmm - that's a magic number for a non-crc DA node. Kinda implies
that it ENOSPC'd in the middle of a tree split.

Can you test on a CRC enabled filesystem and see if there are any
other errors that are detected either at runtime or by repair? If it
does fail, can youpost the full output of xfs_repair?

> It seems to be caused by setting too many extended attributes to a
> file.

Yeah, > 500 blocks in the attribute tree implies at least several
megabytes of xattrs on a file, which is something that pretty much
never happens on production workloads. It's not likely to be a
frequently exercised path...

Cheers,

Dave.

Ernesto A. Fernández Aug. 17, 2017, 5:34 a.m. UTC | #5

On Thu, Aug 17, 2017 at 10:00:42AM +1000, Dave Chinner wrote:
> On Wed, Aug 16, 2017 at 04:31:01PM -0300, Ernesto A. Fernández wrote:
> > xfs_repair, phase 3:
> > bad magic number febe in block 512 (14452) for directory inode 131
> 
> Hmmmm - that's a magic number for a non-crc DA node. Kinda implies
> that it ENOSPC'd in the middle of a tree split.
> 
> Can you test on a CRC enabled filesystem and see if there are any
> other errors that are detected either at runtime or by repair?

No other errors, no.

> If it does fail, can youpost the full output of xfs_repair?

Of course. I don't think it will be of much help though:


Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
bad magic number 3ebe in block 506 (14446) for directory inode 67
problem with attribute contents in inode 67
would clear attr fork
bad nblocks 11894 for inode 67, would reset to 11268
bad anextents 1 for inode 67, would reset to 0
        - agno = 1
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.


> > It seems to be caused by setting too many extended attributes to a
> > file.
> 
> Yeah, > 500 blocks in the attribute tree implies at least several
> megabytes of xattrs on a file, which is something that pretty much
> never happens on production workloads. It's not likely to be a
> frequently exercised path...
>
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Perhaps this can be of some use:

In my test I'm setting as many xattrs with 1k values as possible. If I
change that to 64k values this bug is no longer seen.

The cutoff seems to be when the name+value of the xattr are above 3072+1
bytes. This does not depend on the amount of free space of the
filesystem, so this is probably not just about the number of xattrs as I
first thought.

If I change the block size to 2048 the cutoff is now 1537 bytes, that
is, still 3/4 the block size plus one. For bsize=1024, the cutoff is
when the value alone is longer than 756 bytes.

Also you don't actually need to max out the number of xattrs to see
this bug. For example: if the block size is 1024, the value size is 757
and the name size is 18 bytes, you can trigger this bug by setting 2010
xattrs.

Let me know if you need more info.

Ernest
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Aug. 17, 2017, 6:34 a.m. UTC | #6

On Thu, Aug 17, 2017 at 02:34:40AM -0300, Ernesto A. Fernández wrote:
> On Thu, Aug 17, 2017 at 10:00:42AM +1000, Dave Chinner wrote:
> > On Wed, Aug 16, 2017 at 04:31:01PM -0300, Ernesto A. Fernández wrote:
> > > xfs_repair, phase 3:
> > > bad magic number febe in block 512 (14452) for directory inode 131
> > 
> > Hmmmm - that's a magic number for a non-crc DA node. Kinda implies
> > that it ENOSPC'd in the middle of a tree split.
> > 
> > Can you test on a CRC enabled filesystem and see if there are any
> > other errors that are detected either at runtime or by repair?
> 
> No other errors, no.
> 
> > If it does fail, can youpost the full output of xfs_repair?
> 
> Of course. I don't think it will be of much help though:
> 
> 
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - scan filesystem freespace and inode maps...
>         - found root inode chunk
> Phase 3 - for each AG...
>         - scan (but don't clear) agi unlinked lists...
>         - process known inodes and perform inode discovery...
>         - agno = 0
> bad magic number 3ebe in block 506 (14446) for directory inode 67

Ok, so it's a specific change in tree size that is causing the
problem. It's probably a tree height change opertaion....

> Perhaps this can be of some use:
> 
> In my test I'm setting as many xattrs with 1k values as possible. If I
> change that to 64k values this bug is no longer seen.

Of course. Large xattrs are stored remote to the attribute tree, so
each attribute is only consuming ~30-40 bytes in the attribute tree
rather than 1k.

> The cutoff seems to be when the name+value of the xattr are above 3072+1
> bytes.

Which is exactly 75% of the attribute tree block size. i.e. the size
that attribute data is stored remotely instead of in the tree
itself.

> This does not depend on the amount of free space of the
> filesystem, so this is probably not just about the number of xattrs as I
> first thought.

Given it follows the block size, it smells of a btree level change
going wrong.  Haven't go ttime to look right now, but if you could
look at the inode in xfs_db and print the state of the root attr
fork and the root attr block header before running xfs_repair in the
filesystem that would probably tell us..

Cheers,

Dave.

xfs: preserve i_mode if __xfs_set_acl() fails

Commit Message

Comments

Patch