Message ID | 20170816002511.GM21024@dastard (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Wed, Aug 16, 2017 at 10:25:11AM +1000, Dave Chinner wrote: > On Tue, Aug 15, 2017 at 04:29:39PM -0300, Ernesto A. Fernández wrote: > > On Tue, Aug 15, 2017 at 06:44:30PM +1000, Dave Chinner wrote: > > > On Tue, Aug 15, 2017 at 03:18:58AM -0300, Ernesto A. Fernández wrote: > > > But I have to ask - why do we even need to modify the mode first? > > > Why not change the ACL first, then modify the mode+timestamp? If > > > setting the ACL fails, then we don't have anything to undo and all > > > is good.... > > > > I intended for the mode to be committed as part of the same transaction > > that sets or removes the ACL. In my mind making the changes later, as part > > of a separate transaction, would have meant that a crash between the two > > left the filesystem in an inconsistent state, > > No, it will not leave the fileystem in an inconsistent state. It > will leave the inode permissions in an /unwanted/ state, but there > is no filesystem metadata inconsistency. > > > with a new ACL but without > > the corresponding mode bits. > > Yup, but that's no different from right now, where a crash after > setting the mode bits could be applied but the ACL update is > missing. > > Either way is even rarely than "crash at the wrong time" implies, > because we've also got to have a complete journal checkpoint occur > between the two operations and then crash between the checkpoint and > the second operation. Yes, it's possible, but in the entire time > I've been working on XFS (almost 15 years now) I can count on one > hand the number of times such a problem has occurred and been > reported... > > So, it's a rare problem, and one that will get solved in time > because there's much more to solving the problem than just this > case. e.g. I worte this in 2008: > > http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Atomic_Multi-Transaction_Operations > > And we've really only got the infrastructure we could use to > implement this in a widespread manner with the rmap/reflink > functionality. But implementing it will require a large amount of > re-organisation of filesystem operations, so it's something that > will take time to roll out. Alright, thanks for the explanation. > With that in mind, here's waht I suggested above: set the mode after > the xattr. I haven't tested it - can you check it solves the problem > case you are testing? It does. Of course the test still fails, as I said before, now claiming that the filesystem is inconsistent. But that's a separate issue. > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > > xfs: don't change inode mode if ACL update fails > > XXX: untested > > --- > fs/xfs/xfs_acl.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_acl.c b/fs/xfs/xfs_acl.c > index 7034e17535de..3354140de07e 100644 > --- a/fs/xfs/xfs_acl.c > +++ b/fs/xfs/xfs_acl.c > @@ -247,6 +247,8 @@ xfs_set_mode(struct inode *inode, umode_t mode) > int > xfs_set_acl(struct inode *inode, struct posix_acl *acl, int type) > { > + umode_t mode; > + bool set_mode = false; > int error = 0; > > if (!acl) > @@ -257,16 +259,24 @@ xfs_set_acl(struct inode *inode, struct posix_acl *acl, int type) > return error; > > if (type == ACL_TYPE_ACCESS) { > - umode_t mode; > - > error = posix_acl_update_mode(inode, &mode, &acl); > if (error) > return error; > - error = xfs_set_mode(inode, mode); > - if (error) > - return error; > + set_mode = true; > } > > set_acl: > - return __xfs_set_acl(inode, acl, type); > + error = __xfs_set_acl(inode, acl, type); > + if (error) > + return error; > + > + /* > + * We set the mode after successfully updating the ACL xattr because the > + * xattr update can fail at ENOSPC and we don't want to change the mode > + * if the ACL update hasn't been applied. > + */ > + if (set_mode) > + error = xfs_set_mode(inode, mode); > + > + return error; > } -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 16, 2017 at 04:16:53AM -0300, Ernesto A. Fernández wrote: > On Wed, Aug 16, 2017 at 10:25:11AM +1000, Dave Chinner wrote: > > On Tue, Aug 15, 2017 at 04:29:39PM -0300, Ernesto A. Fernández wrote: > > > On Tue, Aug 15, 2017 at 06:44:30PM +1000, Dave Chinner wrote: > > > > On Tue, Aug 15, 2017 at 03:18:58AM -0300, Ernesto A. Fernández wrote: > > > > But I have to ask - why do we even need to modify the mode first? > > > > Why not change the ACL first, then modify the mode+timestamp? If > > > > setting the ACL fails, then we don't have anything to undo and all > > > > is good.... > > > > > > I intended for the mode to be committed as part of the same transaction > > > that sets or removes the ACL. In my mind making the changes later, as part > > > of a separate transaction, would have meant that a crash between the two > > > left the filesystem in an inconsistent state, > > > > No, it will not leave the fileystem in an inconsistent state. It > > will leave the inode permissions in an /unwanted/ state, but there > > is no filesystem metadata inconsistency. > > > > > with a new ACL but without > > > the corresponding mode bits. > > > > Yup, but that's no different from right now, where a crash after > > setting the mode bits could be applied but the ACL update is > > missing. > > > > Either way is even rarely than "crash at the wrong time" implies, > > because we've also got to have a complete journal checkpoint occur > > between the two operations and then crash between the checkpoint and > > the second operation. Yes, it's possible, but in the entire time > > I've been working on XFS (almost 15 years now) I can count on one > > hand the number of times such a problem has occurred and been > > reported... > > > > So, it's a rare problem, and one that will get solved in time > > because there's much more to solving the problem than just this > > case. e.g. I worte this in 2008: > > > > http://xfs.org/index.php/Improving_Metadata_Performance_By_Reducing_Journal_Overhead#Atomic_Multi-Transaction_Operations > > > > And we've really only got the infrastructure we could use to > > implement this in a widespread manner with the rmap/reflink > > functionality. But implementing it will require a large amount of > > re-organisation of filesystem operations, so it's something that > > will take time to roll out. > > Alright, thanks for the explanation. > > > With that in mind, here's waht I suggested above: set the mode after > > the xattr. I haven't tested it - can you check it solves the problem > > case you are testing? > > It does. Of course the test still fails, as I said before, now claiming that > the filesystem is inconsistent. But that's a separate issue. It shouldn't be - what's the error that is being reported? Cheers, Dave.
On Wed, Aug 16, 2017 at 11:35:02PM +1000, Dave Chinner wrote: > On Wed, Aug 16, 2017 at 04:16:53AM -0300, Ernesto A. Fernández wrote: > > On Wed, Aug 16, 2017 at 10:25:11AM +1000, Dave Chinner wrote: > > > With that in mind, here's waht I suggested above: set the mode after > > > the xattr. I haven't tested it - can you check it solves the problem > > > case you are testing? > > > > It does. Of course the test still fails, as I said before, now claiming that > > the filesystem is inconsistent. But that's a separate issue. > > It shouldn't be - what's the error that is being reported? xfs_repair, phase 3: bad magic number febe in block 512 (14452) for directory inode 131 problem with attribute contents in inode 131 would clear attr fork bad nblocks 11900 for inode 131, would reset to 11274 bad anextents 1 for inode 131, would reset to 0 This is not a result of your patch, it was already happening before. It seems to be caused by setting too many extended attributes to a file. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 16, 2017 at 04:31:01PM -0300, Ernesto A. Fernández wrote: > On Wed, Aug 16, 2017 at 11:35:02PM +1000, Dave Chinner wrote: > > On Wed, Aug 16, 2017 at 04:16:53AM -0300, Ernesto A. Fernández wrote: > > > On Wed, Aug 16, 2017 at 10:25:11AM +1000, Dave Chinner wrote: > > > > With that in mind, here's waht I suggested above: set the mode after > > > > the xattr. I haven't tested it - can you check it solves the problem > > > > case you are testing? > > > > > > It does. Of course the test still fails, as I said before, now claiming that > > > the filesystem is inconsistent. But that's a separate issue. > > > > It shouldn't be - what's the error that is being reported? > > xfs_repair, phase 3: > bad magic number febe in block 512 (14452) for directory inode 131 Hmmmm - that's a magic number for a non-crc DA node. Kinda implies that it ENOSPC'd in the middle of a tree split. Can you test on a CRC enabled filesystem and see if there are any other errors that are detected either at runtime or by repair? If it does fail, can youpost the full output of xfs_repair? > It seems to be caused by setting too many extended attributes to a > file. Yeah, > 500 blocks in the attribute tree implies at least several megabytes of xattrs on a file, which is something that pretty much never happens on production workloads. It's not likely to be a frequently exercised path... Cheers, Dave.
On Thu, Aug 17, 2017 at 10:00:42AM +1000, Dave Chinner wrote: > On Wed, Aug 16, 2017 at 04:31:01PM -0300, Ernesto A. Fernández wrote: > > xfs_repair, phase 3: > > bad magic number febe in block 512 (14452) for directory inode 131 > > Hmmmm - that's a magic number for a non-crc DA node. Kinda implies > that it ENOSPC'd in the middle of a tree split. > > Can you test on a CRC enabled filesystem and see if there are any > other errors that are detected either at runtime or by repair? No other errors, no. > If it does fail, can youpost the full output of xfs_repair? Of course. I don't think it will be of much help though: Phase 1 - find and verify superblock... Phase 2 - using internal log - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 bad magic number 3ebe in block 506 (14446) for directory inode 67 problem with attribute contents in inode 67 would clear attr fork bad nblocks 11894 for inode 67, would reset to 11268 bad anextents 1 for inode 67, would reset to 0 - agno = 1 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. > > It seems to be caused by setting too many extended attributes to a > > file. > > Yeah, > 500 blocks in the attribute tree implies at least several > megabytes of xattrs on a file, which is something that pretty much > never happens on production workloads. It's not likely to be a > frequently exercised path... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com Perhaps this can be of some use: In my test I'm setting as many xattrs with 1k values as possible. If I change that to 64k values this bug is no longer seen. The cutoff seems to be when the name+value of the xattr are above 3072+1 bytes. This does not depend on the amount of free space of the filesystem, so this is probably not just about the number of xattrs as I first thought. If I change the block size to 2048 the cutoff is now 1537 bytes, that is, still 3/4 the block size plus one. For bsize=1024, the cutoff is when the value alone is longer than 756 bytes. Also you don't actually need to max out the number of xattrs to see this bug. For example: if the block size is 1024, the value size is 757 and the name size is 18 bytes, you can trigger this bug by setting 2010 xattrs. Let me know if you need more info. Ernest -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 17, 2017 at 02:34:40AM -0300, Ernesto A. Fernández wrote: > On Thu, Aug 17, 2017 at 10:00:42AM +1000, Dave Chinner wrote: > > On Wed, Aug 16, 2017 at 04:31:01PM -0300, Ernesto A. Fernández wrote: > > > xfs_repair, phase 3: > > > bad magic number febe in block 512 (14452) for directory inode 131 > > > > Hmmmm - that's a magic number for a non-crc DA node. Kinda implies > > that it ENOSPC'd in the middle of a tree split. > > > > Can you test on a CRC enabled filesystem and see if there are any > > other errors that are detected either at runtime or by repair? > > No other errors, no. > > > If it does fail, can youpost the full output of xfs_repair? > > Of course. I don't think it will be of much help though: > > > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - scan filesystem freespace and inode maps... > - found root inode chunk > Phase 3 - for each AG... > - scan (but don't clear) agi unlinked lists... > - process known inodes and perform inode discovery... > - agno = 0 > bad magic number 3ebe in block 506 (14446) for directory inode 67 Ok, so it's a specific change in tree size that is causing the problem. It's probably a tree height change opertaion.... > Perhaps this can be of some use: > > In my test I'm setting as many xattrs with 1k values as possible. If I > change that to 64k values this bug is no longer seen. Of course. Large xattrs are stored remote to the attribute tree, so each attribute is only consuming ~30-40 bytes in the attribute tree rather than 1k. > The cutoff seems to be when the name+value of the xattr are above 3072+1 > bytes. Which is exactly 75% of the attribute tree block size. i.e. the size that attribute data is stored remotely instead of in the tree itself. > This does not depend on the amount of free space of the > filesystem, so this is probably not just about the number of xattrs as I > first thought. Given it follows the block size, it smells of a btree level change going wrong. Haven't go ttime to look right now, but if you could look at the inode in xfs_db and print the state of the root attr fork and the root attr block header before running xfs_repair in the filesystem that would probably tell us.. Cheers, Dave.
diff --git a/fs/xfs/xfs_acl.c b/fs/xfs/xfs_acl.c index 7034e17535de..3354140de07e 100644 --- a/fs/xfs/xfs_acl.c +++ b/fs/xfs/xfs_acl.c @@ -247,6 +247,8 @@ xfs_set_mode(struct inode *inode, umode_t mode) int xfs_set_acl(struct inode *inode, struct posix_acl *acl, int type) { + umode_t mode; + bool set_mode = false; int error = 0; if (!acl) @@ -257,16 +259,24 @@ xfs_set_acl(struct inode *inode, struct posix_acl *acl, int type) return error; if (type == ACL_TYPE_ACCESS) { - umode_t mode; - error = posix_acl_update_mode(inode, &mode, &acl); if (error) return error; - error = xfs_set_mode(inode, mode); - if (error) - return error; + set_mode = true; } set_acl: - return __xfs_set_acl(inode, acl, type); + error = __xfs_set_acl(inode, acl, type); + if (error) + return error; + + /* + * We set the mode after successfully updating the ACL xattr because the + * xattr update can fail at ENOSPC and we don't want to change the mode + * if the ACL update hasn't been applied. + */ + if (set_mode) + error = xfs_set_mode(inode, mode); + + return error; }