Message ID | 20170616105445.3314-3-cmaiolino@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
I forgot to add "V4" to the subject here, but the patch is still the V4 one
On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: > When a buffer has been failed during writeback, the inode items into it > are kept flush locked, and are never resubmitted due the flush lock, so, > if any buffer fails to be written, the items in AIL are never written to > disk and never unlocked. > > This causes unmount operation to hang due these items flush locked in AIL, What type of hang? If it has occurred in production is there a trace somewhere? what does it look like? You said you would work on an xfstest for this, how's that going? Otherewise a commit log description of how to reproduce would be useful. > but this also causes the items in AIL to never be written back, even when > the IO device comes back to normal. > > I've been testing this patch with a DM-thin device, creating a > filesystem larger than the real device. > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > errors from the device, and keep spinning on xfsaild (when 'retry > forever' configuration is set). > > At this point, the filesystem can not be unmounted because of the flush locked > items in AIL, but worse, the items in AIL are never retried at all > (once xfs_inode_item_push() will skip the items that are flush locked), > even if the underlying DM-thin device is expanded to the proper size. Jeesh. If the above issue is a real hang, shoudl we not consider a sensible stable fix to start off with ? Luis -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote: > On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: > > When a buffer has been failed during writeback, the inode items into it > > are kept flush locked, and are never resubmitted due the flush lock, so, > > if any buffer fails to be written, the items in AIL are never written to > > disk and never unlocked. > > > > This causes unmount operation to hang due these items flush locked in AIL, > > What type of hang? If it has occurred in production is there a trace somewhere? > what does it look like? > > You said you would work on an xfstest for this, how's that going? Otherewise > a commit log description of how to reproduce would be useful. I'm curious for an xfstest too, but I think Carlos /did/ tell us how to reproduce -- create a thinp device, format XFS, fill it up, and try to unmount. I don't think there /is/ much of a trace, just xfsaild doing nothing and a bunch of ail items that are flush locked and stuck that way. > > but this also causes the items in AIL to never be written back, even when > > the IO device comes back to normal. > > > > I've been testing this patch with a DM-thin device, creating a > > filesystem larger than the real device. > > > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > > errors from the device, and keep spinning on xfsaild (when 'retry > > forever' configuration is set). > > > > At this point, the filesystem can not be unmounted because of the flush locked > > items in AIL, but worse, the items in AIL are never retried at all > > (once xfs_inode_item_push() will skip the items that are flush locked), > > even if the underlying DM-thin device is expanded to the proper size. > > Jeesh. > > If the above issue is a real hang, shoudl we not consider a sensible stable fix > to start off with ? Huh? I thought this series is supposed to be the fix. --D > > Luis > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote: > On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote: > > On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: > > > When a buffer has been failed during writeback, the inode items into it > > > are kept flush locked, and are never resubmitted due the flush lock, so, > > > if any buffer fails to be written, the items in AIL are never written to > > > disk and never unlocked. > > > > > > This causes unmount operation to hang due these items flush locked in AIL, > > > > What type of hang? If it has occurred in production is there a trace somewhere? > > what does it look like? > > > > You said you would work on an xfstest for this, how's that going? Otherewise > > a commit log description of how to reproduce would be useful. > > I'm curious for an xfstest too, but I think Carlos /did/ tell us how to > reproduce -- create a thinp device, format XFS, fill it up, and try to > unmount. Well he did mention to create a Dm-thin device with a fileystem larger than the real device. You seem to have say just filling up a filsystem? Do both cases trigger the issue? > I don't think there /is/ much of a trace, just xfsaild doing > nothing and a bunch of ail items that are flush locked and stuck that way. OK so no hung task seek, no crash, just a system call that never ends? Is the issue recoverable? So unmount just never completes? Can we CTRL-C (SIGINT) out of it? > > > but this also causes the items in AIL to never be written back, even when > > > the IO device comes back to normal. > > > > > > I've been testing this patch with a DM-thin device, creating a > > > filesystem larger than the real device. > > > > > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > > > errors from the device, and keep spinning on xfsaild (when 'retry > > > forever' configuration is set). > > > > > > At this point, the filesystem can not be unmounted because of the flush locked > > > items in AIL, but worse, the items in AIL are never retried at all > > > (once xfs_inode_item_push() will skip the items that are flush locked), > > > even if the underlying DM-thin device is expanded to the proper size. > > > > Jeesh. > > > > If the above issue is a real hang, shoudl we not consider a sensible stable fix > > to start off with ? > > Huh? I thought this series is supposed to be the fix. It seems like a rather large set of changes, if the issue was sevee I was hoping for a stable candidate fix first. If its not fixing a severe issue then sure. Luis -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 6/16/17 2:37 PM, Luis R. Rodriguez wrote: > On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote: >> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote: >>> On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: >>>> When a buffer has been failed during writeback, the inode items into it >>>> are kept flush locked, and are never resubmitted due the flush lock, so, >>>> if any buffer fails to be written, the items in AIL are never written to >>>> disk and never unlocked. >>>> >>>> This causes unmount operation to hang due these items flush locked in AIL, >>> >>> What type of hang? If it has occurred in production is there a trace somewhere? >>> what does it look like? >>> >>> You said you would work on an xfstest for this, how's that going? Otherewise >>> a commit log description of how to reproduce would be useful. >> >> I'm curious for an xfstest too, but I think Carlos /did/ tell us how to >> reproduce -- create a thinp device, format XFS, fill it up, and try to >> unmount. > > Well he did mention to create a Dm-thin device with a fileystem larger than > the real device. You seem to have say just filling up a filsystem? No - the case is filling up the /thinp device/, not the filesystem. > > Do both cases trigger the issue? This whole issue has to do with errors from the block device during writeback; that's why the thin device is involved. When the backing store fills, we see the IO errors that cause this problem. >> I don't think there /is/ much of a trace, just xfsaild doing >> nothing and a bunch of ail items that are flush locked and stuck that way. > > OK so no hung task seek, no crash, just a system call that never ends? > > Is the issue recoverable? So unmount just never completes? Can we CTRL-C > (SIGINT) out of it? No, you can't ctrl-c a stuck kernel. >>>> but this also causes the items in AIL to never be written back, even when >>>> the IO device comes back to normal. >>>> >>>> I've been testing this patch with a DM-thin device, creating a >>>> filesystem larger than the real device. >>>> >>>> When writing enough data to fill the DM-thin device, XFS receives ENOSPC >>>> errors from the device, and keep spinning on xfsaild (when 'retry >>>> forever' configuration is set). >>>> >>>> At this point, the filesystem can not be unmounted because of the flush locked >>>> items in AIL, but worse, the items in AIL are never retried at all >>>> (once xfs_inode_item_push() will skip the items that are flush locked), >>>> even if the underlying DM-thin device is expanded to the proper size. >>> >>> Jeesh. >>> >>> If the above issue is a real hang, shoudl we not consider a sensible stable fix >>> to start off with ? >> >> Huh? I thought this series is supposed to be the fix. > > It seems like a rather large set of changes, if the issue was sevee I was hoping > for a stable candidate fix first. If its not fixing a severe issue then sure. Fixes go uptream first, then to stable kernels if appropriate, right? But not every fix is a trivial one-liner. I don't think there is any simpler fix to be had, here. -Eric > Luis > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 16, 2017 at 02:45:13PM -0500, Eric Sandeen wrote: > On 6/16/17 2:37 PM, Luis R. Rodriguez wrote: > > On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote: > >> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote: > >>> On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: > >>>> When a buffer has been failed during writeback, the inode items into it > >>>> are kept flush locked, and are never resubmitted due the flush lock, so, > >>>> if any buffer fails to be written, the items in AIL are never written to > >>>> disk and never unlocked. > >>>> > >>>> This causes unmount operation to hang due these items flush locked in AIL, > >>> > >>> What type of hang? If it has occurred in production is there a trace somewhere? > >>> what does it look like? > >>> > >>> You said you would work on an xfstest for this, how's that going? Otherewise > >>> a commit log description of how to reproduce would be useful. > >> > >> I'm curious for an xfstest too, but I think Carlos /did/ tell us how to > >> reproduce -- create a thinp device, format XFS, fill it up, and try to > >> unmount. > > > > Well he did mention to create a Dm-thin device with a fileystem larger than > > the real device. You seem to have say just filling up a filsystem? > > No - the case is filling up the /thinp device/, not the filesystem. > > > > > Do both cases trigger the issue? > > This whole issue has to do with errors from the block device during writeback; > that's why the thin device is involved. When the backing store fills, we > see the IO errors that cause this problem. > > >> I don't think there /is/ much of a trace, just xfsaild doing > >> nothing and a bunch of ail items that are flush locked and stuck that way. > > > > OK so no hung task seek, no crash, just a system call that never ends? > > > > Is the issue recoverable? So unmount just never completes? Can we CTRL-C > > (SIGINT) out of it? > > No, you can't ctrl-c a stuck kernel. > > >>>> but this also causes the items in AIL to never be written back, even when > >>>> the IO device comes back to normal. > >>>> > >>>> I've been testing this patch with a DM-thin device, creating a > >>>> filesystem larger than the real device. > >>>> > >>>> When writing enough data to fill the DM-thin device, XFS receives ENOSPC > >>>> errors from the device, and keep spinning on xfsaild (when 'retry > >>>> forever' configuration is set). > >>>> > >>>> At this point, the filesystem can not be unmounted because of the flush locked > >>>> items in AIL, but worse, the items in AIL are never retried at all > >>>> (once xfs_inode_item_push() will skip the items that are flush locked), > >>>> even if the underlying DM-thin device is expanded to the proper size. > >>> > >>> Jeesh. > >>> > >>> If the above issue is a real hang, shoudl we not consider a sensible stable fix > >>> to start off with ? > >> > >> Huh? I thought this series is supposed to be the fix. > > > > It seems like a rather large set of changes, if the issue was sevee I was hoping > > for a stable candidate fix first. If its not fixing a severe issue then sure. > > Fixes go uptream first, then to stable kernels if appropriate, right? > > But not every fix is a trivial one-liner. I don't think there is any simpler > fix to be had, here. > Yeah, this is kind of intended to be the simplest fix available as far as I'm aware. TBH, I don't think the fix here is fundamentally that complex (define a state for already flushed/failed log items), but rather the code that needs to be modified to implement it has various states and corner cases to manage that make it tricky to get right. If we truly needed a stable worthy fix in short order, that would probably be to revert ac8809f9a ("xfs: abort metadata writeback on permanent errors"), which caused this regression by making the AIL responsible for failed retries. A couple caveats I can think of with that are that 1.) this would likely short circuit the recently added error configuration api (which is kind of irrelevant if the underlying error management code doesn't work correctly, but suggests that should be reverted as well) and 2.) in doing so, this reverts back to the hardcoded infinite retry behavior in XFS. That means that transient errors may eventually recover, but the thinp enospc use case and whatnot are still going to hang on umount. It hasn't seemed necessary to me to take that approach given the lower prevalence of the issue and the fact that a solution had been worked out for a while now. Though I suppose the longer we go without a fix in place, the stronger the argument for something like that becomes. Brian > -Eric > > > Luis > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: > When a buffer has been failed during writeback, the inode items into it > are kept flush locked, and are never resubmitted due the flush lock, so, > if any buffer fails to be written, the items in AIL are never written to > disk and never unlocked. > > This causes unmount operation to hang due these items flush locked in AIL, > but this also causes the items in AIL to never be written back, even when > the IO device comes back to normal. > > I've been testing this patch with a DM-thin device, creating a > filesystem larger than the real device. > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > errors from the device, and keep spinning on xfsaild (when 'retry > forever' configuration is set). > > At this point, the filesystem can not be unmounted because of the flush locked > items in AIL, but worse, the items in AIL are never retried at all > (once xfs_inode_item_push() will skip the items that are flush locked), > even if the underlying DM-thin device is expanded to the proper size. > > This patch fixes both cases, retrying any item that has been failed > previously, using the infra-structure provided by the previous patch. > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > --- > V2: > - Fix XFS_LI_FAILED flag removal > - Use atomic operations to set and clear XFS_LI_FAILED flag > - Remove check for XBF_WRITE_FAIL in xfs_inode_item_push > - Add more comments to the code > - Add a helper function to resubmit the failed buffers, so this > can be also used in dquot system without duplicating code > > V3: > - kill xfs_imap_to_bp call using a pointer in the log item to > hold the buffer address > - use xa_lock instead of atomic operations to handle log item > flags > - Add a hold to the buffer for each log item failed > - move buffer resubmission up in xfs_inode_item_push() > > V4: > - Remove bflags argument from iop_error callback > - Remove ip argument from xfs_buf_resubmit_failed_buffers > - Use helpers to set/clear XFS_LI_FAILED flag > - remove ->xa_lock from the iop->error callback and move it up > on the stack, so all log items are processed into a single > pair of lock/unlock > > fs/xfs/xfs_buf_item.c | 37 +++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_buf_item.h | 1 + > fs/xfs/xfs_inode_item.c | 36 +++++++++++++++++++++++++++++++++--- > fs/xfs/xfs_trans.h | 26 ++++++++++++++++++++++++++ > fs/xfs/xfs_trans_ail.c | 4 ++-- > 5 files changed, 99 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > index 4fa68c9..c5d21ea 100644 > --- a/fs/xfs/xfs_buf_item.c > +++ b/fs/xfs/xfs_buf_item.c > @@ -1222,3 +1222,40 @@ xfs_buf_iodone( > xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE); > xfs_buf_item_free(BUF_ITEM(lip)); > } > + > +/* > + * Requeue a failed buffer for writeback > + * > + * Return true if the buffer has been re-queued properly, false otherwise > + */ > +bool > +xfs_buf_resubmit_failed_buffers( > + struct xfs_log_item *lip, > + struct list_head *buffer_list) > +{ > + struct xfs_log_item *next; > + struct xfs_buf *bp; > + bool ret; > + > + bp = lip->li_buf; > + > + /* Clear XFS_LI_FAILED flag from all items before resubmit Nit: start of the comment (/*) should be on its own line. > + * > + * XFS_LI_FAILED set/clear is protected by xa_lock, caller this > + * function already have it acquired > + */ > + for (; lip; lip = next) { > + next = lip->li_bio_list; > + xfs_clear_li_failed(lip); > + } > + > + /* Add this buffer back to the delayed write list */ > + xfs_buf_lock(bp); This should be a trylock and we should return XFS_ITEM_LOCKED from the caller if it fails so xfsaild can make progress rather than block sitting on a buffer lock (and this is all much cleaner with the trylock/unlock in the caller). > + if (!xfs_buf_delwri_queue(bp, buffer_list)) > + ret = false; > + else > + ret = true; > + > + xfs_buf_unlock(bp); > + return ret; > +} > diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h > index f7eba99..82c3d64 100644 > --- a/fs/xfs/xfs_buf_item.h > +++ b/fs/xfs/xfs_buf_item.h > @@ -70,6 +70,7 @@ void xfs_buf_attach_iodone(struct xfs_buf *, > xfs_log_item_t *); > void xfs_buf_iodone_callbacks(struct xfs_buf *); > void xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *); > +bool xfs_buf_resubmit_failed_buffers(struct xfs_log_item *, struct list_head *); > > extern kmem_zone_t *xfs_buf_item_zone; > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 08cb7d1..2719ac6 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -27,6 +27,7 @@ > #include "xfs_error.h" > #include "xfs_trace.h" > #include "xfs_trans_priv.h" > +#include "xfs_buf_item.h" > #include "xfs_log.h" > > > @@ -475,6 +476,18 @@ xfs_inode_item_unpin( > wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); > } > > +/* > + * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer > + * have been failed during writeback > + */ > +STATIC void > +xfs_inode_item_error( > + struct xfs_log_item *lip, > + struct xfs_buf *bp) > +{ > + xfs_set_li_failed(lip, bp); > +} > + > STATIC uint > xfs_inode_item_push( > struct xfs_log_item *lip, > @@ -491,6 +504,17 @@ xfs_inode_item_push( > if (xfs_ipincount(ip) > 0) > return XFS_ITEM_PINNED; > > + /* > + * The buffer containing this item failed to be written back > + * previously. Resubmit the buffer for IO. > + */ > + if (lip->li_flags & XFS_LI_FAILED) { > + if (!xfs_buf_resubmit_failed_buffers(lip, buffer_list)) > + rval = XFS_ITEM_FLUSHING; > + > + return rval; > + } > + > if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) > return XFS_ITEM_LOCKED; > > @@ -622,7 +646,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { > .iop_unlock = xfs_inode_item_unlock, > .iop_committed = xfs_inode_item_committed, > .iop_push = xfs_inode_item_push, > - .iop_committing = xfs_inode_item_committing > + .iop_committing = xfs_inode_item_committing, > + .iop_error = xfs_inode_item_error > }; > > > @@ -710,7 +735,8 @@ xfs_iflush_done( > * the AIL lock. > */ > iip = INODE_ITEM(blip); > - if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) > + if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) || > + lip->li_flags & XFS_LI_FAILED) > need_ail++; > > blip = next; > @@ -718,7 +744,8 @@ xfs_iflush_done( > > /* make sure we capture the state of the initial inode. */ > iip = INODE_ITEM(lip); > - if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) > + if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) || > + lip->li_flags & XFS_LI_FAILED) > need_ail++; > > /* > @@ -739,6 +766,9 @@ xfs_iflush_done( > if (INODE_ITEM(blip)->ili_logged && > blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) > mlip_changed |= xfs_ail_delete_one(ailp, blip); > + else { > + xfs_clear_li_failed(blip); > + } > } > > if (mlip_changed) { > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index 50df5367..2d7cf91 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -49,6 +49,7 @@ typedef struct xfs_log_item { > struct xfs_ail *li_ailp; /* ptr to AIL */ > uint li_type; /* item type */ > uint li_flags; /* misc flags */ > + struct xfs_buf *li_buf; /* real buffer pointer */ > struct xfs_log_item *li_bio_list; /* buffer item list */ > void (*li_cb)(struct xfs_buf *, > struct xfs_log_item *); > @@ -72,6 +73,31 @@ typedef struct xfs_log_item { > { XFS_LI_ABORTED, "ABORTED" }, \ > { XFS_LI_FAILED, "FAILED" } > > +static inline void > +xfs_clear_li_failed( > + struct xfs_log_item *lip) > +{ > + struct xfs_buf *bp = lip->li_buf; > + I think we should assert that ->xa_lock is held in both of these helpers. Note that we have to use lockdep_assert_held() for spinlocks. It also couldn't hurt to assert that XFS_LI_IN_AIL is set as well (we'd just have to reorder the _clear_li_failed() call in xfs_ail_delete_one() below). > + if (lip->li_flags & XFS_LI_FAILED) { > + lip->li_flags &= ~XFS_LI_FAILED; > + lip->li_buf = NULL; > + xfs_buf_rele(bp); > + } > +} > + > +static inline void > +xfs_set_li_failed( > + struct xfs_log_item *lip, > + struct xfs_buf *bp) > +{ > + if (lip->li_flags & ~XFS_LI_FAILED) { I think you want !(lip->li_flags & XFS_LI_FAILED). ;) Brian > + xfs_buf_hold(bp); > + lip->li_flags |= XFS_LI_FAILED; > + lip->li_buf = bp; > + } > +} > + > struct xfs_item_ops { > void (*iop_size)(xfs_log_item_t *, int *, int *); > void (*iop_format)(xfs_log_item_t *, struct xfs_log_vec *); > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > index 9056c0f..c41d640 100644 > --- a/fs/xfs/xfs_trans_ail.c > +++ b/fs/xfs/xfs_trans_ail.c > @@ -687,15 +687,15 @@ xfs_trans_ail_update_bulk( > bool > xfs_ail_delete_one( > struct xfs_ail *ailp, > - struct xfs_log_item *lip) > + struct xfs_log_item *lip) > { > struct xfs_log_item *mlip = xfs_ail_min(ailp); > > trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn); > xfs_ail_delete(ailp, lip); > lip->li_flags &= ~XFS_LI_IN_AIL; > + xfs_clear_li_failed(lip); > lip->li_lsn = 0; > - > return mlip == lip; > } > > -- > 2.9.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 19, 2017 at 09:49:42AM -0400, Brian Foster wrote: > On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: > > When a buffer has been failed during writeback, the inode items into it > > are kept flush locked, and are never resubmitted due the flush lock, so, > > if any buffer fails to be written, the items in AIL are never written to > > disk and never unlocked. > > > > This causes unmount operation to hang due these items flush locked in AIL, > > but this also causes the items in AIL to never be written back, even when > > the IO device comes back to normal. > > > > I've been testing this patch with a DM-thin device, creating a > > filesystem larger than the real device. > > > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > > errors from the device, and keep spinning on xfsaild (when 'retry > > forever' configuration is set). > > > > At this point, the filesystem can not be unmounted because of the flush locked > > items in AIL, but worse, the items in AIL are never retried at all > > (once xfs_inode_item_push() will skip the items that are flush locked), > > even if the underlying DM-thin device is expanded to the proper size. > > > > This patch fixes both cases, retrying any item that has been failed > > previously, using the infra-structure provided by the previous patch. > > > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > > --- ... > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > > index 08cb7d1..2719ac6 100644 > > --- a/fs/xfs/xfs_inode_item.c > > +++ b/fs/xfs/xfs_inode_item.c ... > > @@ -475,6 +476,18 @@ xfs_inode_item_unpin( > > wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); > > } > > > > +/* > > + * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer > > + * have been failed during writeback > > + */ > > +STATIC void > > +xfs_inode_item_error( > > + struct xfs_log_item *lip, > > + struct xfs_buf *bp) > > +{ Also if we're going to keep the ->iop_error() thing around, could we add an 'ASSERT(xfs_isiflocked(INODE_ITEM(lip)->ili_inode))' here? Brian > > + xfs_set_li_failed(lip, bp); > > +} > > + > > STATIC uint > > xfs_inode_item_push( > > struct xfs_log_item *lip, > > @@ -491,6 +504,17 @@ xfs_inode_item_push( > > if (xfs_ipincount(ip) > 0) > > return XFS_ITEM_PINNED; > > > > + /* > > + * The buffer containing this item failed to be written back > > + * previously. Resubmit the buffer for IO. > > + */ > > + if (lip->li_flags & XFS_LI_FAILED) { > > + if (!xfs_buf_resubmit_failed_buffers(lip, buffer_list)) > > + rval = XFS_ITEM_FLUSHING; > > + > > + return rval; > > + } > > + > > if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) > > return XFS_ITEM_LOCKED; > > > > @@ -622,7 +646,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { > > .iop_unlock = xfs_inode_item_unlock, > > .iop_committed = xfs_inode_item_committed, > > .iop_push = xfs_inode_item_push, > > - .iop_committing = xfs_inode_item_committing > > + .iop_committing = xfs_inode_item_committing, > > + .iop_error = xfs_inode_item_error > > }; > > > > > > @@ -710,7 +735,8 @@ xfs_iflush_done( > > * the AIL lock. > > */ > > iip = INODE_ITEM(blip); > > - if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) > > + if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) || > > + lip->li_flags & XFS_LI_FAILED) > > need_ail++; > > > > blip = next; > > @@ -718,7 +744,8 @@ xfs_iflush_done( > > > > /* make sure we capture the state of the initial inode. */ > > iip = INODE_ITEM(lip); > > - if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) > > + if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) || > > + lip->li_flags & XFS_LI_FAILED) > > need_ail++; > > > > /* > > @@ -739,6 +766,9 @@ xfs_iflush_done( > > if (INODE_ITEM(blip)->ili_logged && > > blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) > > mlip_changed |= xfs_ail_delete_one(ailp, blip); > > + else { > > + xfs_clear_li_failed(blip); > > + } > > } > > > > if (mlip_changed) { > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > > index 50df5367..2d7cf91 100644 > > --- a/fs/xfs/xfs_trans.h > > +++ b/fs/xfs/xfs_trans.h > > @@ -49,6 +49,7 @@ typedef struct xfs_log_item { > > struct xfs_ail *li_ailp; /* ptr to AIL */ > > uint li_type; /* item type */ > > uint li_flags; /* misc flags */ > > + struct xfs_buf *li_buf; /* real buffer pointer */ > > struct xfs_log_item *li_bio_list; /* buffer item list */ > > void (*li_cb)(struct xfs_buf *, > > struct xfs_log_item *); > > @@ -72,6 +73,31 @@ typedef struct xfs_log_item { > > { XFS_LI_ABORTED, "ABORTED" }, \ > > { XFS_LI_FAILED, "FAILED" } > > > > +static inline void > > +xfs_clear_li_failed( > > + struct xfs_log_item *lip) > > +{ > > + struct xfs_buf *bp = lip->li_buf; > > + > > I think we should assert that ->xa_lock is held in both of these > helpers. Note that we have to use lockdep_assert_held() for spinlocks. > > It also couldn't hurt to assert that XFS_LI_IN_AIL is set as well (we'd > just have to reorder the _clear_li_failed() call in xfs_ail_delete_one() > below). > > > + if (lip->li_flags & XFS_LI_FAILED) { > > + lip->li_flags &= ~XFS_LI_FAILED; > > + lip->li_buf = NULL; > > + xfs_buf_rele(bp); > > + } > > +} > > + > > +static inline void > > +xfs_set_li_failed( > > + struct xfs_log_item *lip, > > + struct xfs_buf *bp) > > +{ > > + if (lip->li_flags & ~XFS_LI_FAILED) { > > I think you want !(lip->li_flags & XFS_LI_FAILED). ;) > > Brian > > > + xfs_buf_hold(bp); > > + lip->li_flags |= XFS_LI_FAILED; > > + lip->li_buf = bp; > > + } > > +} > > + > > struct xfs_item_ops { > > void (*iop_size)(xfs_log_item_t *, int *, int *); > > void (*iop_format)(xfs_log_item_t *, struct xfs_log_vec *); > > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > > index 9056c0f..c41d640 100644 > > --- a/fs/xfs/xfs_trans_ail.c > > +++ b/fs/xfs/xfs_trans_ail.c > > @@ -687,15 +687,15 @@ xfs_trans_ail_update_bulk( > > bool > > xfs_ail_delete_one( > > struct xfs_ail *ailp, > > - struct xfs_log_item *lip) > > + struct xfs_log_item *lip) > > { > > struct xfs_log_item *mlip = xfs_ail_min(ailp); > > > > trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn); > > xfs_ail_delete(ailp, lip); > > lip->li_flags &= ~XFS_LI_IN_AIL; > > + xfs_clear_li_failed(lip); > > lip->li_lsn = 0; > > - > > return mlip == lip; > > } > > > > -- > > 2.9.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Luis. On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote: > On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: > > When a buffer has been failed during writeback, the inode items into it > > are kept flush locked, and are never resubmitted due the flush lock, so, > > if any buffer fails to be written, the items in AIL are never written to > > disk and never unlocked. > > > > This causes unmount operation to hang due these items flush locked in AIL, > > What type of hang? If it has occurred in production is there a trace somewhere? > what does it look like? > No, there isn't any specific trace, the hang can be seen in several different places, when unmounting the filesystem, it will hang in xfs_ail_push_all_sync(), but this will be hit even if no unmount is attempted, with items stuck forever in ail. I think the easier way to track this is to look at the device stats in sysfs, and you will see a forever increase in push_ail statistics even with no work going on in the filesystem. > You said you would work on an xfstest for this, how's that going? Otherewise > a commit log description of how to reproduce would be useful. > The xfstests is not done yet, and I'm actually not focusing on it right now, I already have a reproducer, pointed on the beginning of the discussion from this problem and having this fixed by now is my priority, once the patches are in shape and accepted, I'll work on the xfstests. Not to mention that this problem is still possible to occur not only with inode items, but also with dquot items, which will also be fixed as soon as we reach a consensus of how to best fix this problem by now. Once the dquot items fix will use the same infra-structure as the inode items use in this patchset, and quite the same code, one of the reasons I segmented the buffer resubmission into a different function that can be used for both item types. > > but this also causes the items in AIL to never be written back, even when > > the IO device comes back to normal. > > > > I've been testing this patch with a DM-thin device, creating a > > filesystem larger than the real device. > > > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > > errors from the device, and keep spinning on xfsaild (when 'retry > > forever' configuration is set). > > > > At this point, the filesystem can not be unmounted because of the flush locked > > items in AIL, but worse, the items in AIL are never retried at all > > (once xfs_inode_item_push() will skip the items that are flush locked), > > even if the underlying DM-thin device is expanded to the proper size. > > Jeesh. > > If the above issue is a real hang, shoudl we not consider a sensible stable fix > to start off with ? > Please take a look at the whole history of this issue, this patchset is supposed to be the stable fix, that's why one of the reqs was to use xa_lock here, to change the log_item flags, instead of using atomic ops, making it easier to backport it to stable kernels, without messing around with atomic ops and field type changes and yes, this is a real hang problem, which we already received several reports on this along the time I'm working on it. Cheers > Luis > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 09:01:03AM +0200, Carlos Maiolino wrote: > Hello Luis. > > On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote: > > On Fri, Jun 16, 2017 at 12:54:45PM +0200, Carlos Maiolino wrote: > > > When a buffer has been failed during writeback, the inode items into it > > > are kept flush locked, and are never resubmitted due the flush lock, so, > > > if any buffer fails to be written, the items in AIL are never written to > > > disk and never unlocked. > > > > > > This causes unmount operation to hang due these items flush locked in AIL, > > > > What type of hang? If it has occurred in production is there a trace somewhere? > > what does it look like? > > > > No, there isn't any specific trace, the hang can be seen in several different > places, when unmounting the filesystem, it will hang in xfs_ail_push_all_sync(), > but this will be hit even if no unmount is attempted, with items stuck forever > in ail. Curious, since you can reproduce what happens if you do a hard reset on the system when this trigger, once it boots back up? I'd guess it covers but just want to be sure. > I think the easier way to track this is to look at the device stats in sysfs, > and you will see a forever increase in push_ail statistics even with no work > going on in the filesystem. But the above seems to note it can happen for *any* failed buffer on writeback? Has that been really so odd to happen, or was this also just because of a requirement of 'retry forever' option? Or did the 'retry forever' help increase the chances of reproducing? I suppose I was hoping for a real world case which might land us in the worst case. For instance the dm-thin case with a pool that will run out seems to be like a common case for some docker users and from the looks of it folks are as a side consequence seeing this as docker just hanging after trying to unmount [0] and reaching xfs_ail_push_all_sync(). Could this be an example of a real world issue? There was no mention of the requirement of the 'retry forever' option though in these cases. If this is an example alternative real world situation triggering then this might be more common than what the commit log seems to describe. [0] https://github.com/moby/moby/issues/20707 > > You said you would work on an xfstest for this, how's that going? Otherewise > > a commit log description of how to reproduce would be useful. > > > > The xfstests is not done yet, and I'm actually not focusing on it right now, I > already have a reproducer, pointed on the beginning of the discussion from this > problem and having this fixed by now is my priority, once the patches are in > shape and accepted, I'll work on the xfstests. OK thanks. > Not to mention that this problem is still possible to occur not only with > inode items, but also with dquot items, which will also be fixed as soon as we > reach a consensus of how to best fix this problem by now. Once the dquot items > fix will use the same infra-structure as the inode items use in this patchset, > and quite the same code, one of the reasons I segmented the buffer resubmission > into a different function that can be used for both item types. I see... thanks! > > > but this also causes the items in AIL to never be written back, even when > > > the IO device comes back to normal. > > > > > > I've been testing this patch with a DM-thin device, creating a > > > filesystem larger than the real device. > > > > > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > > > errors from the device, and keep spinning on xfsaild (when 'retry > > > forever' configuration is set). > > > > > > At this point, the filesystem can not be unmounted because of the flush locked > > > items in AIL, but worse, the items in AIL are never retried at all > > > (once xfs_inode_item_push() will skip the items that are flush locked), > > > even if the underlying DM-thin device is expanded to the proper size. > > > > Jeesh. > > > > If the above issue is a real hang, shoudl we not consider a sensible stable fix > > to start off with ? > > > > Please take a look at the whole history of this issue, this patchset is supposed > to be the stable fix, that's why one of the reqs was to use xa_lock here, to > change the log_item flags, instead of using atomic ops, making it easier to > backport it to stable kernels, without messing around with atomic ops and field > type changes and yes, this is a real hang problem, which we already received > several reports on this along the time I'm working on it. I'm thrilled to hear stable is being considered here. Thanks! Luis -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote: > On Fri, Jun 16, 2017 at 02:45:13PM -0500, Eric Sandeen wrote: > > On 6/16/17 2:37 PM, Luis R. Rodriguez wrote: > > > On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote: > > >> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote: > > > It seems like a rather large set of changes, if the issue was sevee I was hoping > > > for a stable candidate fix first. If its not fixing a severe issue then sure. > > > > Fixes go uptream first, then to stable kernels if appropriate, right? > > > > But not every fix is a trivial one-liner. I don't think there is any simpler > > fix to be had, here. > > > > Yeah, this is kind of intended to be the simplest fix available as far > as I'm aware. TBH, I don't think the fix here is fundamentally that > complex (define a state for already flushed/failed log items), but > rather the code that needs to be modified to implement it has various > states and corner cases to manage that make it tricky to get right. OK this helps, thanks. > If we truly needed a stable worthy fix in short order, that would > probably be to revert ac8809f9a ("xfs: abort metadata writeback on > permanent errors"), which caused this regression by making the AIL > responsible for failed retries. Should the following tag be added then to this commit proposed by Carlos: Fixes: ac8809f9a ("xfs: abort metadata writeback on permanent errors") FWIW this was merged as of v3.13. Even though that seems to have taken the failed buffer and kept it the commit log of that change seems to indicate we already ran into similar situations before, after this commit we seem to just retry IO once, but keep the buffer around on the AIL, to allow further modifications of the buffer. > A couple caveats I can think of with > that are that 1.) this would likely short circuit the recently added > error configuration api (which is kind of irrelevant if the underlying > error management code doesn't work correctly, but suggests that should > be reverted as well) and 2.) in doing so, this reverts back to the > hardcoded infinite retry behavior in XFS. That means that transient > errors may eventually recover, but the thinp enospc use case and whatnot > are still going to hang on umount. Right, and also one of the gains of the patch seems to have been to allow thinp devices to keep on chugging with modifications on the buffer, so that would be lost as well. That seems to be like an optimization though so its worth loosing IMHO if would have resolved the hang. Since that's not the case though... > It hasn't seemed necessary to me to take that approach given the lower > prevalence of the issue Of this issue? I suppose its why I asked about examples of issues, I seem to have found it likely much more possible out in the wild than expected. It would seem folks might be working around it somehow. > and the fact that a solution had been worked out > for a while now. Though I suppose the longer we go without a fix in > place, the stronger the argument for something like that becomes. Luis -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote: > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote: > > On Fri, Jun 16, 2017 at 02:45:13PM -0500, Eric Sandeen wrote: > > > On 6/16/17 2:37 PM, Luis R. Rodriguez wrote: > > > > On Fri, Jun 16, 2017 at 12:24:45PM -0700, Darrick J. Wong wrote: > > > >> On Fri, Jun 16, 2017 at 08:35:10PM +0200, Luis R. Rodriguez wrote: > > > > It seems like a rather large set of changes, if the issue was sevee I was hoping > > > > for a stable candidate fix first. If its not fixing a severe issue then sure. > > > > > > Fixes go uptream first, then to stable kernels if appropriate, right? > > > > > > But not every fix is a trivial one-liner. I don't think there is any simpler > > > fix to be had, here. > > > > > > > Yeah, this is kind of intended to be the simplest fix available as far > > as I'm aware. TBH, I don't think the fix here is fundamentally that > > complex (define a state for already flushed/failed log items), but > > rather the code that needs to be modified to implement it has various > > states and corner cases to manage that make it tricky to get right. > > OK this helps, thanks. > > > If we truly needed a stable worthy fix in short order, that would > > probably be to revert ac8809f9a ("xfs: abort metadata writeback on > > permanent errors"), which caused this regression by making the AIL > > responsible for failed retries. > > Should the following tag be added then to this commit proposed by Carlos: > > Fixes: ac8809f9a ("xfs: abort metadata writeback on permanent errors") > That seems reasonable to me. > FWIW this was merged as of v3.13. Even though that seems to have taken the > failed buffer and kept it the commit log of that change seems to indicate we > already ran into similar situations before, after this commit we seem to just > retry IO once, but keep the buffer around on the AIL, to allow further > modifications of the buffer. > Before that commit, I believe we would retry the metadata writeback indefinitely so long as it failed. If the underlying I/O failure ceased to occur, then this loop ends and the fs proceeds as normal. If not, then the filesystem is probably going to hang. After that commit, we retry once from I/O completion handling and otherwise rely on the AIL to issue the next (indefinite) retry. If the item that failed has a flush lock, we won't actually ever submit the I/O (which is the bug), however, and thus you're toast regardless of whether the I/O would ultimately succeed or not. > > > A couple caveats I can think of with > > that are that 1.) this would likely short circuit the recently added > > error configuration api (which is kind of irrelevant if the underlying > > error management code doesn't work correctly, but suggests that should > > be reverted as well) and 2.) in doing so, this reverts back to the > > hardcoded infinite retry behavior in XFS. That means that transient > > errors may eventually recover, but the thinp enospc use case and whatnot > > are still going to hang on umount. > > Right, and also one of the gains of the patch seems to have been to allow > thinp devices to keep on chugging with modifications on the buffer, so that > would be lost as well. That seems to be like an optimization though so its > worth loosing IMHO if would have resolved the hang. Since that's not the case > though... > I think that's just a secondary effect of unlocking the buffer. Probably not that important if I/Os are failing. > > It hasn't seemed necessary to me to take that approach given the lower > > prevalence of the issue > > Of this issue? I suppose its why I asked about examples of issues, I seem > to have found it likely much more possible out in the wild than expected. > It would seem folks might be working around it somehow. > If we're talking about the thin provisioning case, I suspect most people work around it by properly configuring their storage. ;) Brian > > and the fact that a solution had been worked out > > for a while now. Though I suppose the longer we go without a fix in > > place, the stronger the argument for something like that becomes. > > Luis > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 01:20:41PM -0400, Brian Foster wrote: > On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote: > > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote: > > > It hasn't seemed necessary to me to take that approach given the lower > > > prevalence of the issue > > > > Of this issue? I suppose its why I asked about examples of issues, I seem > > to have found it likely much more possible out in the wild than expected. > > It would seem folks might be working around it somehow. > > > > If we're talking about the thin provisioning case, I suspect most people > work around it by properly configuring their storage. ;) The fact that we *hang* makes it more serious, so even if folks misconfigured storage with less space it should be no reason to consider hangs any less severe. Specially if it seems to be a common issue, and I'm alluding to the fact that this might be more common than the patch describes. Luis -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 01:20:41PM -0400, Brian Foster wrote: > On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote: > > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote: > > > If we truly needed a stable worthy fix in short order, that would > > > probably be to revert ac8809f9a ("xfs: abort metadata writeback on > > > permanent errors"), which caused this regression by making the AIL > > > responsible for failed retries. > > > > Should the following tag be added then to this commit proposed by Carlos: > > > > Fixes: ac8809f9a ("xfs: abort metadata writeback on permanent errors") > > > > That seems reasonable to me. Then in this case I'd like the commit log to also explain *why* the described fix did not work. It actually describes the issue as being considered, "thin provisioned devices that have run out of backing space", and this recovering. So did recovery never really work? Does recovery actually work for some cases but not some? If so why not for some? Luis -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 20, 2017 at 08:05:05PM +0200, Luis R. Rodriguez wrote: > On Tue, Jun 20, 2017 at 01:20:41PM -0400, Brian Foster wrote: > > On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote: > > > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote: > > > > It hasn't seemed necessary to me to take that approach given the lower > > > > prevalence of the issue > > > > > > Of this issue? I suppose its why I asked about examples of issues, I seem > > > to have found it likely much more possible out in the wild than expected. > > > It would seem folks might be working around it somehow. > > > > > > > If we're talking about the thin provisioning case, I suspect most people > > work around it by properly configuring their storage. ;) > > The fact that we *hang* makes it more serious, so even if folks misconfigured > storage with less space it should be no reason to consider hangs any less > severe. Specially if it seems to be a common issue, and I'm alluding to the > fact that this might be more common than the patch describes. > My point is simply that a hang was a likely outcome before the patch that introduced the regression as well, so the benefit of doing a proper revert doesn't clearly outweigh the cost. Despite what the side effect is, the fact that this tends to primarily affect users who have thin volumes going inactive also suggests that this can be worked around reasonably well enough via storage configuration. This all suggests to me that Carlos' current approach is the most reasonable one. I'm not following what the line of argument is here. Are you suggesting a different approach? If so, based on what use case/reasoning? Brian > Luis > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Luis > > Curious, since you can reproduce what happens if you do a hard reset on the > system when this trigger, once it boots back up? I'd guess it covers but just > want to be sure. > Just for context, the problem with the stuck unmounts happens because the items in AIL can't be written back to their specific locations in the disk due lack of real space. But, instead of shutting down the FS when somebody tries to unmount, or permanently fail the buffer when trying to write it back (if XFS is configured to fail at some point), xfsaild keep spinning around on such buffers because the items are flush locked, and they are not even retried at all. giving this small resumed context, and now answering your question regarding a hard reset. When you hard reset the system in such state, after the system comes back alive, the filesystem in question will be unmountable, because the journal will be dirty, and XFS won't be able to replay it during the mount due lack of space in the physical device: # mount <volume> /mnt [ 91.843429] XFS (dm-5): Mounting V5 Filesystem [ 91.864321] device-mapper: thin: 253:2: reached low water mark for data device: sending event. [ 91.889451] device-mapper: thin: 253:2: switching pool to out-of-data-space (error IO) mode [ 91.890821] XFS (dm-5): xfs_do_force_shutdown(0x1) called from line 1201 of file fs/xfs/xfs_buf.c. Return address = 0xffffffff813bb416 [ 91.893590] XFS (dm-5): I/O Error Detected. Shutting down filesystem [ 91.894813] XFS (dm-5): Please umount the filesystem and rectify the problem(s) [ 91.896158] XFS (dm-5): metadata I/O error: block 0x31f80 ("xlog_bwrite") error 28 numblks 4096 [ 91.902234] XFS (dm-5): failed to locate log tail [ 91.902920] XFS (dm-5): log mount/recovery failed: error -28 [ 91.903712] XFS (dm-5): log mount failed mount: mount <volume> on /mnt failed: No space left on device Although, simply my expanding the thin pool, everything comes back to normal again: #lvextend -L +500M <POOL> # mount <volume> /mnt [ 248.935258] XFS (dm-5): Mounting V5 Filesystem [ 248.954288] XFS (dm-5): Starting recovery (logdev: internal) [ 248.985238] XFS (dm-5): Ending recovery (logdev: internal)
On Wed, Jun 21, 2017 at 06:10:52AM -0400, Brian Foster wrote: > On Tue, Jun 20, 2017 at 08:05:05PM +0200, Luis R. Rodriguez wrote: > > On Tue, Jun 20, 2017 at 01:20:41PM -0400, Brian Foster wrote: > > > On Tue, Jun 20, 2017 at 06:52:04PM +0200, Luis R. Rodriguez wrote: > > > > On Mon, Jun 19, 2017 at 06:59:05AM -0400, Brian Foster wrote: > > > > > It hasn't seemed necessary to me to take that approach given the lower > > > > > prevalence of the issue > > > > > > > > Of this issue? I suppose its why I asked about examples of issues, I seem > > > > to have found it likely much more possible out in the wild than expected. > > > > It would seem folks might be working around it somehow. > > > > > > > > > > If we're talking about the thin provisioning case, I suspect most people > > > work around it by properly configuring their storage. ;) > > > > The fact that we *hang* makes it more serious, so even if folks misconfigured > > storage with less space it should be no reason to consider hangs any less > > severe. Specially if it seems to be a common issue, and I'm alluding to the > > fact that this might be more common than the patch describes. > > > > My point is simply that a hang was a likely outcome before the patch > that introduced the regression as well, so the benefit of doing a proper > revert doesn't clearly outweigh the cost. Sure agreed. > Despite what the side effect > is, the fact that this tends to primarily affect users who have thin > volumes going inactive also suggests that this can be worked around > reasonably well enough via storage configuration. This all suggests to > me that Carlos' current approach is the most reasonable one. OK thanks. > I'm not following what the line of argument is here. Are you suggesting > a different approach? If so, based on what use case/reasoning? No, it just seemed to me you were indicating that the hang was not that serious of an issue given you could work around it with proper storage configuration. I see now you were using that analogy just to indicate it was also an issue before so the revert is with merit. Luis -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 4fa68c9..c5d21ea 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -1222,3 +1222,40 @@ xfs_buf_iodone( xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE); xfs_buf_item_free(BUF_ITEM(lip)); } + +/* + * Requeue a failed buffer for writeback + * + * Return true if the buffer has been re-queued properly, false otherwise + */ +bool +xfs_buf_resubmit_failed_buffers( + struct xfs_log_item *lip, + struct list_head *buffer_list) +{ + struct xfs_log_item *next; + struct xfs_buf *bp; + bool ret; + + bp = lip->li_buf; + + /* Clear XFS_LI_FAILED flag from all items before resubmit + * + * XFS_LI_FAILED set/clear is protected by xa_lock, caller this + * function already have it acquired + */ + for (; lip; lip = next) { + next = lip->li_bio_list; + xfs_clear_li_failed(lip); + } + + /* Add this buffer back to the delayed write list */ + xfs_buf_lock(bp); + if (!xfs_buf_delwri_queue(bp, buffer_list)) + ret = false; + else + ret = true; + + xfs_buf_unlock(bp); + return ret; +} diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h index f7eba99..82c3d64 100644 --- a/fs/xfs/xfs_buf_item.h +++ b/fs/xfs/xfs_buf_item.h @@ -70,6 +70,7 @@ void xfs_buf_attach_iodone(struct xfs_buf *, xfs_log_item_t *); void xfs_buf_iodone_callbacks(struct xfs_buf *); void xfs_buf_iodone(struct xfs_buf *, struct xfs_log_item *); +bool xfs_buf_resubmit_failed_buffers(struct xfs_log_item *, struct list_head *); extern kmem_zone_t *xfs_buf_item_zone; diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 08cb7d1..2719ac6 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -27,6 +27,7 @@ #include "xfs_error.h" #include "xfs_trace.h" #include "xfs_trans_priv.h" +#include "xfs_buf_item.h" #include "xfs_log.h" @@ -475,6 +476,18 @@ xfs_inode_item_unpin( wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); } +/* + * Callback used to mark a buffer with XFS_LI_FAILED when items in the buffer + * have been failed during writeback + */ +STATIC void +xfs_inode_item_error( + struct xfs_log_item *lip, + struct xfs_buf *bp) +{ + xfs_set_li_failed(lip, bp); +} + STATIC uint xfs_inode_item_push( struct xfs_log_item *lip, @@ -491,6 +504,17 @@ xfs_inode_item_push( if (xfs_ipincount(ip) > 0) return XFS_ITEM_PINNED; + /* + * The buffer containing this item failed to be written back + * previously. Resubmit the buffer for IO. + */ + if (lip->li_flags & XFS_LI_FAILED) { + if (!xfs_buf_resubmit_failed_buffers(lip, buffer_list)) + rval = XFS_ITEM_FLUSHING; + + return rval; + } + if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED)) return XFS_ITEM_LOCKED; @@ -622,7 +646,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { .iop_unlock = xfs_inode_item_unlock, .iop_committed = xfs_inode_item_committed, .iop_push = xfs_inode_item_push, - .iop_committing = xfs_inode_item_committing + .iop_committing = xfs_inode_item_committing, + .iop_error = xfs_inode_item_error }; @@ -710,7 +735,8 @@ xfs_iflush_done( * the AIL lock. */ iip = INODE_ITEM(blip); - if (iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) + if ((iip->ili_logged && blip->li_lsn == iip->ili_flush_lsn) || + lip->li_flags & XFS_LI_FAILED) need_ail++; blip = next; @@ -718,7 +744,8 @@ xfs_iflush_done( /* make sure we capture the state of the initial inode. */ iip = INODE_ITEM(lip); - if (iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) + if ((iip->ili_logged && lip->li_lsn == iip->ili_flush_lsn) || + lip->li_flags & XFS_LI_FAILED) need_ail++; /* @@ -739,6 +766,9 @@ xfs_iflush_done( if (INODE_ITEM(blip)->ili_logged && blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn) mlip_changed |= xfs_ail_delete_one(ailp, blip); + else { + xfs_clear_li_failed(blip); + } } if (mlip_changed) { diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 50df5367..2d7cf91 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -49,6 +49,7 @@ typedef struct xfs_log_item { struct xfs_ail *li_ailp; /* ptr to AIL */ uint li_type; /* item type */ uint li_flags; /* misc flags */ + struct xfs_buf *li_buf; /* real buffer pointer */ struct xfs_log_item *li_bio_list; /* buffer item list */ void (*li_cb)(struct xfs_buf *, struct xfs_log_item *); @@ -72,6 +73,31 @@ typedef struct xfs_log_item { { XFS_LI_ABORTED, "ABORTED" }, \ { XFS_LI_FAILED, "FAILED" } +static inline void +xfs_clear_li_failed( + struct xfs_log_item *lip) +{ + struct xfs_buf *bp = lip->li_buf; + + if (lip->li_flags & XFS_LI_FAILED) { + lip->li_flags &= ~XFS_LI_FAILED; + lip->li_buf = NULL; + xfs_buf_rele(bp); + } +} + +static inline void +xfs_set_li_failed( + struct xfs_log_item *lip, + struct xfs_buf *bp) +{ + if (lip->li_flags & ~XFS_LI_FAILED) { + xfs_buf_hold(bp); + lip->li_flags |= XFS_LI_FAILED; + lip->li_buf = bp; + } +} + struct xfs_item_ops { void (*iop_size)(xfs_log_item_t *, int *, int *); void (*iop_format)(xfs_log_item_t *, struct xfs_log_vec *); diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 9056c0f..c41d640 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -687,15 +687,15 @@ xfs_trans_ail_update_bulk( bool xfs_ail_delete_one( struct xfs_ail *ailp, - struct xfs_log_item *lip) + struct xfs_log_item *lip) { struct xfs_log_item *mlip = xfs_ail_min(ailp); trace_xfs_ail_delete(lip, mlip->li_lsn, lip->li_lsn); xfs_ail_delete(ailp, lip); lip->li_flags &= ~XFS_LI_IN_AIL; + xfs_clear_li_failed(lip); lip->li_lsn = 0; - return mlip == lip; }
When a buffer has been failed during writeback, the inode items into it are kept flush locked, and are never resubmitted due the flush lock, so, if any buffer fails to be written, the items in AIL are never written to disk and never unlocked. This causes unmount operation to hang due these items flush locked in AIL, but this also causes the items in AIL to never be written back, even when the IO device comes back to normal. I've been testing this patch with a DM-thin device, creating a filesystem larger than the real device. When writing enough data to fill the DM-thin device, XFS receives ENOSPC errors from the device, and keep spinning on xfsaild (when 'retry forever' configuration is set). At this point, the filesystem can not be unmounted because of the flush locked items in AIL, but worse, the items in AIL are never retried at all (once xfs_inode_item_push() will skip the items that are flush locked), even if the underlying DM-thin device is expanded to the proper size. This patch fixes both cases, retrying any item that has been failed previously, using the infra-structure provided by the previous patch. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> --- V2: - Fix XFS_LI_FAILED flag removal - Use atomic operations to set and clear XFS_LI_FAILED flag - Remove check for XBF_WRITE_FAIL in xfs_inode_item_push - Add more comments to the code - Add a helper function to resubmit the failed buffers, so this can be also used in dquot system without duplicating code V3: - kill xfs_imap_to_bp call using a pointer in the log item to hold the buffer address - use xa_lock instead of atomic operations to handle log item flags - Add a hold to the buffer for each log item failed - move buffer resubmission up in xfs_inode_item_push() V4: - Remove bflags argument from iop_error callback - Remove ip argument from xfs_buf_resubmit_failed_buffers - Use helpers to set/clear XFS_LI_FAILED flag - remove ->xa_lock from the iop->error callback and move it up on the stack, so all log items are processed into a single pair of lock/unlock fs/xfs/xfs_buf_item.c | 37 +++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_buf_item.h | 1 + fs/xfs/xfs_inode_item.c | 36 +++++++++++++++++++++++++++++++++--- fs/xfs/xfs_trans.h | 26 ++++++++++++++++++++++++++ fs/xfs/xfs_trans_ail.c | 4 ++-- 5 files changed, 99 insertions(+), 5 deletions(-)