Message ID | 20170511135733.21765-3-cmaiolino@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On 5/11/17 8:57 AM, Carlos Maiolino wrote: > When a buffer has been failed during writeback, the inode items into it > are kept flush locked, and are never resubmitted due the flush lock, so, > if any buffer fails to be written, the items in AIL are never written to > disk and never unlocked. > > This causes a filesystem to be unmountable due these items flush locked I think you mean "not unmountable?" > in AIL, but this also causes the items in AIL to never be written back, > even when the IO device comes back to normal. > > I've been testing this patch with a DM-thin device, creating a > filesystem larger than the real device. > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > errors from the device, and keep spinning on xfsaild (when 'retry > forever' configuration is set). > > At this point, the filesystem is unmountable because of the flush locked (or cannot be unmounted ...) > items in AIL, but worse, the items in AIL are never retried at all > (once xfs_inode_item_push() will skip the items that are flush locked), > even if the underlying DM-thin device is expanded to the proper size. Can you turn that into an xfstest? > This patch fixes both cases, retrying any item that has been failed > previously, using the infra-structure provided by the previous patch. > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > --- > > This same problem is also possible in dquot code, but the fix is almost > identical. > > I am not submitting a fix for dquot yet to avoid the need to create VX for both > patches, once we agree with the solution, I'll submit a fix to dquot. > > fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 53 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 08cb7d1..583fa9e 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -475,6 +475,21 @@ xfs_inode_item_unpin( > wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); > } > > +STATIC void > +xfs_inode_item_error( > + struct xfs_log_item *lip, > + unsigned int bflags) > +{ > + > + /* > + * The buffer writeback containing this inode has been failed > + * mark it as failed and unlock the flush lock, so it can be retried > + * again > + */ > + if (bflags & XBF_WRITE_FAIL) > + lip->li_flags |= XFS_LI_FAILED; > +} > + > STATIC uint > xfs_inode_item_push( > struct xfs_log_item *lip, > @@ -517,8 +532,44 @@ xfs_inode_item_push( > * the AIL. > */ > if (!xfs_iflock_nowait(ip)) { Some comments about what this new block is for would be helpful, I think. > + if (lip->li_flags & XFS_LI_FAILED) { > + > + struct xfs_dinode *dip; > + struct xfs_log_item *next; > + int error; > + > + error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, > + &dip, &bp, XBF_TRYLOCK, 0); > + > + if (error) { > + rval = XFS_ITEM_FLUSHING; > + goto out_unlock; > + } > + > + if (!(bp->b_flags & XBF_WRITE_FAIL)) { > + rval = XFS_ITEM_FLUSHING; > + xfs_buf_relse(bp); > + goto out_unlock; > + } > + > + while (lip != NULL) { > + next = lip->li_bio_list; > + > + if (lip->li_flags & XFS_LI_FAILED) > + lip->li_flags &= XFS_LI_FAILED; This confuses me. If XFS_LI_FAILED is set, set XFS_LI_FAILED? I assume you meant to clear it? > + lip = next; > + } > + /* Add this buffer back to the delayed write list */ > + if (!xfs_buf_delwri_queue(bp, buffer_list)) > + rval = XFS_ITEM_FLUSHING; > + xfs_buf_relse(bp); So by here we have an implicit rval = XFS_ITEM_SUCCESS, I guess? (I wonder about setting FLUSHING at the top, and setting SUCCESS only if everything in here works out - but maybe that would be more confusing) Anyway that's my first drive-by review, I'm not sure I have all the state & locking clear in my head for this stuff. Thanks, -Eric > + goto out_unlock; > + } > + > rval = XFS_ITEM_FLUSHING; > goto out_unlock; > + > } > > ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { > .iop_unlock = xfs_inode_item_unlock, > .iop_committed = xfs_inode_item_committed, > .iop_push = xfs_inode_item_push, > - .iop_committing = xfs_inode_item_committing > + .iop_committing = xfs_inode_item_committing, > + .iop_error = xfs_inode_item_error > }; > > > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > When a buffer has been failed during writeback, the inode items into it > are kept flush locked, and are never resubmitted due the flush lock, so, > if any buffer fails to be written, the items in AIL are never written to > disk and never unlocked. > > This causes a filesystem to be unmountable due these items flush locked > in AIL, but this also causes the items in AIL to never be written back, > even when the IO device comes back to normal. > > I've been testing this patch with a DM-thin device, creating a > filesystem larger than the real device. > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > errors from the device, and keep spinning on xfsaild (when 'retry > forever' configuration is set). > > At this point, the filesystem is unmountable because of the flush locked > items in AIL, but worse, the items in AIL are never retried at all > (once xfs_inode_item_push() will skip the items that are flush locked), > even if the underlying DM-thin device is expanded to the proper size. > > This patch fixes both cases, retrying any item that has been failed > previously, using the infra-structure provided by the previous patch. > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > --- > > This same problem is also possible in dquot code, but the fix is almost > identical. > > I am not submitting a fix for dquot yet to avoid the need to create VX for both > patches, once we agree with the solution, I'll submit a fix to dquot. > > fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 53 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 08cb7d1..583fa9e 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -475,6 +475,21 @@ xfs_inode_item_unpin( > wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); > } > > +STATIC void > +xfs_inode_item_error( > + struct xfs_log_item *lip, > + unsigned int bflags) > +{ > + > + /* > + * The buffer writeback containing this inode has been failed > + * mark it as failed and unlock the flush lock, so it can be retried > + * again > + */ > + if (bflags & XBF_WRITE_FAIL) > + lip->li_flags |= XFS_LI_FAILED; > +} > + > STATIC uint > xfs_inode_item_push( > struct xfs_log_item *lip, > @@ -517,8 +532,44 @@ xfs_inode_item_push( > * the AIL. > */ > if (!xfs_iflock_nowait(ip)) { > + if (lip->li_flags & XFS_LI_FAILED) { > + > + struct xfs_dinode *dip; > + struct xfs_log_item *next; > + int error; > + > + error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, > + &dip, &bp, XBF_TRYLOCK, 0); > + > + if (error) { > + rval = XFS_ITEM_FLUSHING; > + goto out_unlock; > + } > + > + if (!(bp->b_flags & XBF_WRITE_FAIL)) { > + rval = XFS_ITEM_FLUSHING; > + xfs_buf_relse(bp); > + goto out_unlock; > + } > + > + while (lip != NULL) { > + next = lip->li_bio_list; > + > + if (lip->li_flags & XFS_LI_FAILED) > + lip->li_flags &= XFS_LI_FAILED; Eric already pointed out that you probably intend to clear the flag here..? > + lip = next; > + } This whole hunk might be better off in a helper function (with the comment Eric suggested as well). Those points and the ->iop_error() thing aside, this otherwise seems Ok to me. Brian > + > + if (!xfs_buf_delwri_queue(bp, buffer_list)) > + rval = XFS_ITEM_FLUSHING; > + > + xfs_buf_relse(bp); > + goto out_unlock; > + } > + > rval = XFS_ITEM_FLUSHING; > goto out_unlock; > + > } > > ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { > .iop_unlock = xfs_inode_item_unlock, > .iop_committed = xfs_inode_item_committed, > .iop_push = xfs_inode_item_push, > - .iop_committing = xfs_inode_item_committing > + .iop_committing = xfs_inode_item_committing, > + .iop_error = xfs_inode_item_error > }; > > > -- > 2.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ahoj! On Thu, May 11, 2017 at 10:32:16AM -0500, Eric Sandeen wrote: > On 5/11/17 8:57 AM, Carlos Maiolino wrote: > > When a buffer has been failed during writeback, the inode items into it > > are kept flush locked, and are never resubmitted due the flush lock, so, > > if any buffer fails to be written, the items in AIL are never written to > > disk and never unlocked. > > > > This causes a filesystem to be unmountable due these items flush locked > > I think you mean "not unmountable?" > Yeah, my bad, fast typing slow thinking :) > > in AIL, but this also causes the items in AIL to never be written back, > > even when the IO device comes back to normal. > > > > I've been testing this patch with a DM-thin device, creating a > > filesystem larger than the real device. > > > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > > errors from the device, and keep spinning on xfsaild (when 'retry > > forever' configuration is set). > > > > At this point, the filesystem is unmountable because of the flush locked > > (or cannot be unmounted ...) > *nod* > > items in AIL, but worse, the items in AIL are never retried at all > > (once xfs_inode_item_push() will skip the items that are flush locked), > > even if the underlying DM-thin device is expanded to the proper size. > > Can you turn that into an xfstest? > Yeah, I am planing to do that, this is really not that hard to move into an xfstests case, although, it will be going into dangerous sub-set, once it will lockup the filesystem. > > This patch fixes both cases, retrying any item that has been failed > > previously, using the infra-structure provided by the previous patch. > > > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > > --- > > > > This same problem is also possible in dquot code, but the fix is almost > > identical. > > > > I am not submitting a fix for dquot yet to avoid the need to create VX for both > > patches, once we agree with the solution, I'll submit a fix to dquot. > > > > fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 53 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > > index 08cb7d1..583fa9e 100644 > > --- a/fs/xfs/xfs_inode_item.c > > +++ b/fs/xfs/xfs_inode_item.c > > @@ -475,6 +475,21 @@ xfs_inode_item_unpin( > > wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); > > } > > > > +STATIC void > > +xfs_inode_item_error( > > + struct xfs_log_item *lip, > > + unsigned int bflags) > > +{ > > + > > + /* > > + * The buffer writeback containing this inode has been failed > > + * mark it as failed and unlock the flush lock, so it can be retried > > + * again > > + */ > > + if (bflags & XBF_WRITE_FAIL) > > + lip->li_flags |= XFS_LI_FAILED; > > +} > > + > > STATIC uint > > xfs_inode_item_push( > > struct xfs_log_item *lip, > > @@ -517,8 +532,44 @@ xfs_inode_item_push( > > * the AIL. > > */ > > if (!xfs_iflock_nowait(ip)) { > > Some comments about what this new block is for would be helpful, I think. > /me replies on Brian's comment > > + if (lip->li_flags & XFS_LI_FAILED) { > > + > > + struct xfs_dinode *dip; > > + struct xfs_log_item *next; > > + int error; > > + > > + error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, > > + &dip, &bp, XBF_TRYLOCK, 0); > > + > > + if (error) { > > + rval = XFS_ITEM_FLUSHING; > > + goto out_unlock; > > + } > > + > > + if (!(bp->b_flags & XBF_WRITE_FAIL)) { > > + rval = XFS_ITEM_FLUSHING; > > + xfs_buf_relse(bp); > > + goto out_unlock; > > + } > > + > > + while (lip != NULL) { > > + next = lip->li_bio_list; > > + > > + if (lip->li_flags & XFS_LI_FAILED) > > + lip->li_flags &= XFS_LI_FAILED; > > This confuses me. If XFS_LI_FAILED is set, set XFS_LI_FAILED? > I assume you meant to clear it? > *nod* fix going to V2 > > + lip = next; > > + } > > + > > /* Add this buffer back to the delayed write list */ > > > + if (!xfs_buf_delwri_queue(bp, buffer_list)) > > + rval = XFS_ITEM_FLUSHING; > > > + xfs_buf_relse(bp); > > So by here we have an implicit rval = XFS_ITEM_SUCCESS, I guess? > AFAIK this is the current behavior of xfs_inode_item_push() without my patch, at a first glance it looked weird to me too, but then I just decided to leave it as-is. > (I wonder about setting FLUSHING at the top, and setting SUCCESS > only if everything in here works out - but maybe that would be > more confusing) > > Anyway that's my first drive-by review, I'm not sure I have all the state & > locking clear in my head for this stuff. > I really appreciate the review, thanks for your time :) > Thanks, > -Eric > > > + goto out_unlock; > > + } > > + > > rval = XFS_ITEM_FLUSHING; > > goto out_unlock; > > + > > } > > > > ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); > > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { > > .iop_unlock = xfs_inode_item_unlock, > > .iop_committed = xfs_inode_item_committed, > > .iop_push = xfs_inode_item_push, > > - .iop_committing = xfs_inode_item_committing > > + .iop_committing = xfs_inode_item_committing, > > + .iop_error = xfs_inode_item_error > > }; > > > > > >
On Thu, May 11, 2017 at 01:08:05PM -0400, Brian Foster wrote: > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > > When a buffer has been failed during writeback, the inode items into it > > are kept flush locked, and are never resubmitted due the flush lock, so, > > if any buffer fails to be written, the items in AIL are never written to > > disk and never unlocked. > > > > This causes a filesystem to be unmountable due these items flush locked > > in AIL, but this also causes the items in AIL to never be written back, > > even when the IO device comes back to normal. > > > > I've been testing this patch with a DM-thin device, creating a > > filesystem larger than the real device. > > > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > > errors from the device, and keep spinning on xfsaild (when 'retry > > forever' configuration is set). > > > > At this point, the filesystem is unmountable because of the flush locked > > items in AIL, but worse, the items in AIL are never retried at all > > (once xfs_inode_item_push() will skip the items that are flush locked), > > even if the underlying DM-thin device is expanded to the proper size. > > > > This patch fixes both cases, retrying any item that has been failed > > previously, using the infra-structure provided by the previous patch. > > > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > > --- > > > > This same problem is also possible in dquot code, but the fix is almost > > identical. > > > > I am not submitting a fix for dquot yet to avoid the need to create VX for both > > patches, once we agree with the solution, I'll submit a fix to dquot. > > > > fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 53 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > > index 08cb7d1..583fa9e 100644 > > --- a/fs/xfs/xfs_inode_item.c > > +++ b/fs/xfs/xfs_inode_item.c > > @@ -475,6 +475,21 @@ xfs_inode_item_unpin( > > wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); > > } > > > > +STATIC void > > +xfs_inode_item_error( > > + struct xfs_log_item *lip, > > + unsigned int bflags) > > +{ > > + > > + /* > > + * The buffer writeback containing this inode has been failed > > + * mark it as failed and unlock the flush lock, so it can be retried > > + * again > > + */ > > + if (bflags & XBF_WRITE_FAIL) > > + lip->li_flags |= XFS_LI_FAILED; > > +} > > + > > STATIC uint > > xfs_inode_item_push( > > struct xfs_log_item *lip, > > @@ -517,8 +532,44 @@ xfs_inode_item_push( > > * the AIL. > > */ > > if (!xfs_iflock_nowait(ip)) { > > + if (lip->li_flags & XFS_LI_FAILED) { > > + > > + struct xfs_dinode *dip; > > + struct xfs_log_item *next; > > + int error; > > + > > + error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, > > + &dip, &bp, XBF_TRYLOCK, 0); > > + > > + if (error) { > > + rval = XFS_ITEM_FLUSHING; > > + goto out_unlock; > > + } > > + > > + if (!(bp->b_flags & XBF_WRITE_FAIL)) { > > + rval = XFS_ITEM_FLUSHING; > > + xfs_buf_relse(bp); > > + goto out_unlock; > > + } > > + > > + while (lip != NULL) { > > + next = lip->li_bio_list; > > + > > + if (lip->li_flags & XFS_LI_FAILED) > > + lip->li_flags &= XFS_LI_FAILED; > > Eric already pointed out that you probably intend to clear the flag > here..? > Yup, my bad. > > + lip = next; > > + } > > This whole hunk might be better off in a helper function (with the > comment Eric suggested as well). > Agreed, a helper function can be used here and in dquot code as well, so I agree that a helper function can be useful, I'll try to make it a common code for both dquot and inode items. > Those points and the ->iop_error() thing aside, this otherwise seems Ok > to me. > > Brian > > > + > > + if (!xfs_buf_delwri_queue(bp, buffer_list)) > > + rval = XFS_ITEM_FLUSHING; > > + > > + xfs_buf_relse(bp); > > + goto out_unlock; > > + } > > + > > rval = XFS_ITEM_FLUSHING; > > goto out_unlock; > > + > > } > > > > ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); > > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { > > .iop_unlock = xfs_inode_item_unlock, > > .iop_committed = xfs_inode_item_committed, > > .iop_push = xfs_inode_item_push, > > - .iop_committing = xfs_inode_item_committing > > + .iop_committing = xfs_inode_item_committing, > > + .iop_error = xfs_inode_item_error > > }; > > > > > > -- > > 2.9.3 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 12, 2017 at 10:21:35AM +0200, Carlos Maiolino wrote: > On Thu, May 11, 2017 at 01:08:05PM -0400, Brian Foster wrote: > > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > > > When a buffer has been failed during writeback, the inode items into it > > > are kept flush locked, and are never resubmitted due the flush lock, so, > > > if any buffer fails to be written, the items in AIL are never written to > > > disk and never unlocked. > > > > > > This causes a filesystem to be unmountable due these items flush locked > > > in AIL, but this also causes the items in AIL to never be written back, > > > even when the IO device comes back to normal. > > > > > > I've been testing this patch with a DM-thin device, creating a > > > filesystem larger than the real device. > > > > > > When writing enough data to fill the DM-thin device, XFS receives ENOSPC > > > errors from the device, and keep spinning on xfsaild (when 'retry > > > forever' configuration is set). > > > > > > At this point, the filesystem is unmountable because of the flush locked > > > items in AIL, but worse, the items in AIL are never retried at all > > > (once xfs_inode_item_push() will skip the items that are flush locked), > > > even if the underlying DM-thin device is expanded to the proper size. > > > > > > This patch fixes both cases, retrying any item that has been failed > > > previously, using the infra-structure provided by the previous patch. > > > > > > Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> > > > --- > > > > > > This same problem is also possible in dquot code, but the fix is almost > > > identical. > > > > > > I am not submitting a fix for dquot yet to avoid the need to create VX for both > > > patches, once we agree with the solution, I'll submit a fix to dquot. > > > > > > fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 53 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > > > index 08cb7d1..583fa9e 100644 > > > --- a/fs/xfs/xfs_inode_item.c > > > +++ b/fs/xfs/xfs_inode_item.c > > > @@ -475,6 +475,21 @@ xfs_inode_item_unpin( > > > wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); > > > } > > > > > > +STATIC void > > > +xfs_inode_item_error( > > > + struct xfs_log_item *lip, > > > + unsigned int bflags) > > > +{ > > > + > > > + /* > > > + * The buffer writeback containing this inode has been failed > > > + * mark it as failed and unlock the flush lock, so it can be retried > > > + * again > > > + */ > > > + if (bflags & XBF_WRITE_FAIL) > > > + lip->li_flags |= XFS_LI_FAILED; > > > +} > > > + > > > STATIC uint > > > xfs_inode_item_push( > > > struct xfs_log_item *lip, > > > @@ -517,8 +532,44 @@ xfs_inode_item_push( > > > * the AIL. > > > */ > > > if (!xfs_iflock_nowait(ip)) { > > > + if (lip->li_flags & XFS_LI_FAILED) { > > > + > > > + struct xfs_dinode *dip; > > > + struct xfs_log_item *next; > > > + int error; > > > + > > > + error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, > > > + &dip, &bp, XBF_TRYLOCK, 0); > > > + > > > + if (error) { > > > + rval = XFS_ITEM_FLUSHING; > > > + goto out_unlock; > > > + } > > > + > > > + if (!(bp->b_flags & XBF_WRITE_FAIL)) { > > > + rval = XFS_ITEM_FLUSHING; > > > + xfs_buf_relse(bp); > > > + goto out_unlock; > > > + } I think I glossed over this on my first pass, but I don't think we need to (or should) check XBF_WRITE_FAIL here or in the error handler. It's a flag used to control the internal retry and that is kind of irrelevant to this mechanism. Unless I'm missing something.. I don't think this state can occur..? Brian > > > + > > > + while (lip != NULL) { > > > + next = lip->li_bio_list; > > > + > > > + if (lip->li_flags & XFS_LI_FAILED) > > > + lip->li_flags &= XFS_LI_FAILED; > > > > Eric already pointed out that you probably intend to clear the flag > > here..? > > > > Yup, my bad. > > > > + lip = next; > > > + } > > > > This whole hunk might be better off in a helper function (with the > > comment Eric suggested as well). > > > > Agreed, a helper function can be used here and in dquot code as well, so I agree > that a helper function can be useful, I'll try to make it a common code for both > dquot and inode items. > > > Those points and the ->iop_error() thing aside, this otherwise seems Ok > > to me. > > > > > > Brian > > > > > + > > > + if (!xfs_buf_delwri_queue(bp, buffer_list)) > > > + rval = XFS_ITEM_FLUSHING; > > > + > > > + xfs_buf_relse(bp); > > > + goto out_unlock; > > > + } > > > + > > > rval = XFS_ITEM_FLUSHING; > > > goto out_unlock; > > > + > > > } > > > > > > ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); > > > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { > > > .iop_unlock = xfs_inode_item_unlock, > > > .iop_committed = xfs_inode_item_committed, > > > .iop_push = xfs_inode_item_push, > > > - .iop_committing = xfs_inode_item_committing > > > + .iop_committing = xfs_inode_item_committing, > > > + .iop_error = xfs_inode_item_error > > > }; > > > > > > > > > -- > > > 2.9.3 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Carlos > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > +xfs_inode_item_error( > + struct xfs_log_item *lip, > + unsigned int bflags) > +{ > + > + /* > + * The buffer writeback containing this inode has been failed > + * mark it as failed and unlock the flush lock, so it can be retried > + * again > + */ > + if (bflags & XBF_WRITE_FAIL) > + lip->li_flags |= XFS_LI_FAILED; > +} I'm pretty sure we can't modify the log item state like this in IO context because the parent object is not locked by this context. We can modify the state during IO submission because we hold the parent object lock, but we don't hold it here. i.e. we can race with other log item state changes done during a concurrent transaction (e.g. marking the log item dirty in xfs_trans_log_inode()). > + > STATIC uint > xfs_inode_item_push( > struct xfs_log_item *lip, > @@ -517,8 +532,44 @@ xfs_inode_item_push( > * the AIL. > */ > if (!xfs_iflock_nowait(ip)) { > + if (lip->li_flags & XFS_LI_FAILED) { > + > + struct xfs_dinode *dip; > + struct xfs_log_item *next; > + int error; > + > + error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, > + &dip, &bp, XBF_TRYLOCK, 0); > + > + if (error) { > + rval = XFS_ITEM_FLUSHING; > + goto out_unlock; > + } > + > + if (!(bp->b_flags & XBF_WRITE_FAIL)) { > + rval = XFS_ITEM_FLUSHING; > + xfs_buf_relse(bp); > + goto out_unlock; > + } > + > + while (lip != NULL) { > + next = lip->li_bio_list; > + > + if (lip->li_flags & XFS_LI_FAILED) > + lip->li_flags &= XFS_LI_FAILED; > + lip = next; > + } Same race here - we hold the lock for this inode, but not all the other inodes linked to the buffer. This implies that lip->li_flags needs to use atomic bit updates so there are no RMW update races like this. Cheers, Dave.
Hi Dave, On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote: > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > > +xfs_inode_item_error( > > + struct xfs_log_item *lip, > > + unsigned int bflags) > > +{ > > + > > + /* > > + * The buffer writeback containing this inode has been failed > > + * mark it as failed and unlock the flush lock, so it can be retried > > + * again > > + */ > > + if (bflags & XBF_WRITE_FAIL) > > + lip->li_flags |= XFS_LI_FAILED; > > +} > > I'm pretty sure we can't modify the log item state like this in > IO context because the parent object is not locked by this context. > We can modify the state during IO submission because we hold the > parent object lock, but we don't hold it here. > > i.e. we can race with other log item state changes done during a > concurrent transaction (e.g. marking the log item dirty in > xfs_trans_log_inode()). > > > + > > + if (lip->li_flags & XFS_LI_FAILED) > > + lip->li_flags &= XFS_LI_FAILED; > > + lip = next; > > + } > > Same race here - we hold the lock for this inode, but not all the > other inodes linked to the buffer. > > This implies that lip->li_flags needs to use atomic bit updates so > there are no RMW update races like this. > I believe refers to both cases above, while setting and removing the flag? I can simply use set_bit() and clear_bit() here, if this is enough, although, to use them I'll need to change lip->li_flags size to `unsigned long` if people are ok with that. > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > + } > > > > + > > > > + if (!(bp->b_flags & XBF_WRITE_FAIL)) { > > > > + rval = XFS_ITEM_FLUSHING; > > > > + xfs_buf_relse(bp); > > > > + goto out_unlock; > > > > + } > > I think I glossed over this on my first pass, but I don't think we need > to (or should) check XBF_WRITE_FAIL here or in the error handler. It's a > flag used to control the internal retry and that is kind of irrelevant > to this mechanism. Unless I'm missing something.. I don't think this > state can occur..? > Yup, right, it is not required, I'll remove it for the next version, thanks for the review > Brian > > > > > + > > > > + while (lip != NULL) { > > > > + next = lip->li_bio_list; > > > > + > > > > + if (lip->li_flags & XFS_LI_FAILED) > > > > + lip->li_flags &= XFS_LI_FAILED; > > > > > > Eric already pointed out that you probably intend to clear the flag > > > here..? > > > > > > > Yup, my bad. > > > > > > + lip = next; > > > > + } > > > > > > This whole hunk might be better off in a helper function (with the > > > comment Eric suggested as well). > > > > > > > Agreed, a helper function can be used here and in dquot code as well, so I agree > > that a helper function can be useful, I'll try to make it a common code for both > > dquot and inode items. > > > > > Those points and the ->iop_error() thing aside, this otherwise seems Ok > > > to me. > > > > > > > > > > Brian > > > > > > > + > > > > + if (!xfs_buf_delwri_queue(bp, buffer_list)) > > > > + rval = XFS_ITEM_FLUSHING; > > > > + > > > > + xfs_buf_relse(bp); > > > > + goto out_unlock; > > > > + } > > > > + > > > > rval = XFS_ITEM_FLUSHING; > > > > goto out_unlock; > > > > + > > > > } > > > > > > > > ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); > > > > @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { > > > > .iop_unlock = xfs_inode_item_unlock, > > > > .iop_committed = xfs_inode_item_committed, > > > > .iop_push = xfs_inode_item_push, > > > > - .iop_committing = xfs_inode_item_committing > > > > + .iop_committing = xfs_inode_item_committing, > > > > + .iop_error = xfs_inode_item_error > > > > }; > > > > > > > > > > > > -- > > > > 2.9.3 > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > Carlos > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 17, 2017 at 12:41:11PM +0200, Carlos Maiolino wrote: > Hi Dave, > > On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote: > > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > > > +xfs_inode_item_error( > > > + struct xfs_log_item *lip, > > > + unsigned int bflags) > > > +{ > > > + > > > + /* > > > + * The buffer writeback containing this inode has been failed > > > + * mark it as failed and unlock the flush lock, so it can be retried > > > + * again > > > + */ > > > + if (bflags & XBF_WRITE_FAIL) > > > + lip->li_flags |= XFS_LI_FAILED; > > > +} > > > > I'm pretty sure we can't modify the log item state like this in > > IO context because the parent object is not locked by this context. > > We can modify the state during IO submission because we hold the > > parent object lock, but we don't hold it here. > > > > i.e. we can race with other log item state changes done during a > > concurrent transaction (e.g. marking the log item dirty in > > xfs_trans_log_inode()). > > > > > + > > > + if (lip->li_flags & XFS_LI_FAILED) > > > + lip->li_flags &= XFS_LI_FAILED; > > > + lip = next; > > > + } > > > > Same race here - we hold the lock for this inode, but not all the > > other inodes linked to the buffer. > > > > This implies that lip->li_flags needs to use atomic bit updates so > > there are no RMW update races like this. > > > > I believe refers to both cases above, while setting and removing the flag? > > I can simply use set_bit() and clear_bit() here, if this is enough, although, to > use them I'll need to change lip->li_flags size to `unsigned long` if people are > ok with that. Yup, and you need to change every other flag in li_flags to use atomic bitops, too. That also means using test_bit() for value checks, test_and_clear_bit() instead of "if (bit) {clear bit}" and so on. i.e. it's no good just making one flag atomic and all the other updates using RMW accesses to change the state because that doesn't close the update race conditions. Cheers, Dave.
On Fri, May 19, 2017 at 10:22:27AM +1000, Dave Chinner wrote: > On Wed, May 17, 2017 at 12:41:11PM +0200, Carlos Maiolino wrote: > > Hi Dave, > > > > On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote: > > > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > > > > +xfs_inode_item_error( > > > > + struct xfs_log_item *lip, > > > > + unsigned int bflags) > > > > +{ > > > > + > > > > + /* > > > > + * The buffer writeback containing this inode has been failed > > > > + * mark it as failed and unlock the flush lock, so it can be retried > > > > + * again > > > > + */ > > > > + if (bflags & XBF_WRITE_FAIL) > > > > + lip->li_flags |= XFS_LI_FAILED; > > > > +} > > > > > > I'm pretty sure we can't modify the log item state like this in > > > IO context because the parent object is not locked by this context. > > > We can modify the state during IO submission because we hold the > > > parent object lock, but we don't hold it here. > > > > > > i.e. we can race with other log item state changes done during a > > > concurrent transaction (e.g. marking the log item dirty in > > > xfs_trans_log_inode()). > > > > > > > + > > > > + if (lip->li_flags & XFS_LI_FAILED) > > > > + lip->li_flags &= XFS_LI_FAILED; > > > > + lip = next; > > > > + } > > > > > > Same race here - we hold the lock for this inode, but not all the > > > other inodes linked to the buffer. > > > > > > This implies that lip->li_flags needs to use atomic bit updates so > > > there are no RMW update races like this. > > > > > > > I believe refers to both cases above, while setting and removing the flag? > > > > I can simply use set_bit() and clear_bit() here, if this is enough, although, to > > use them I'll need to change lip->li_flags size to `unsigned long` if people are > > ok with that. > > Yup, and you need to change every other flag in li_flags to use > atomic bitops, too. That also means using test_bit() for value > checks, test_and_clear_bit() instead of "if (bit) {clear bit}" and > so on. > > i.e. it's no good just making one flag atomic and all the other > updates using RMW accesses to change the state because that doesn't > close the update race conditions. > Am I following this correctly that the only potential race here looks like it is with XFS_LI_ABORTED (the dirty transaction cancel case, specifically)? Note that functions like xfs_trans_log_inode() set dirty state on the log item descriptor, which is associated with the transaction, not the log item itself. The only other log item flag is LI_IN_AIL and that is going to remain set during the lifetime of LI_FAILED by design. Maybe I'm missing something else going on here.. But otherwise given the above, that this is primarily I/O error path code, and that we presumably already have serialization mechanisms in place in the form of parent object locks, why not just grab the inode lock in the appropriate places? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017 at 07:27:01AM -0400, Brian Foster wrote: > On Fri, May 19, 2017 at 10:22:27AM +1000, Dave Chinner wrote: > > On Wed, May 17, 2017 at 12:41:11PM +0200, Carlos Maiolino wrote: > > > Hi Dave, > > > > > > On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote: > > > > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > > > > > +xfs_inode_item_error( > > > > > + struct xfs_log_item *lip, > > > > > + unsigned int bflags) > > > > > +{ > > > > > + > > > > > + /* > > > > > + * The buffer writeback containing this inode has been failed > > > > > + * mark it as failed and unlock the flush lock, so it can be retried > > > > > + * again > > > > > + */ > > > > > + if (bflags & XBF_WRITE_FAIL) > > > > > + lip->li_flags |= XFS_LI_FAILED; > > > > > +} > > > > > > > > I'm pretty sure we can't modify the log item state like this in > > > > IO context because the parent object is not locked by this context. > > > > We can modify the state during IO submission because we hold the > > > > parent object lock, but we don't hold it here. > > > > > > > > i.e. we can race with other log item state changes done during a > > > > concurrent transaction (e.g. marking the log item dirty in > > > > xfs_trans_log_inode()). > > > > > > > > > + > > > > > + if (lip->li_flags & XFS_LI_FAILED) > > > > > + lip->li_flags &= XFS_LI_FAILED; > > > > > + lip = next; > > > > > + } > > > > > > > > Same race here - we hold the lock for this inode, but not all the > > > > other inodes linked to the buffer. > > > > > > > > This implies that lip->li_flags needs to use atomic bit updates so > > > > there are no RMW update races like this. > > > > > > > > > > I believe refers to both cases above, while setting and removing the flag? > > > > > > I can simply use set_bit() and clear_bit() here, if this is enough, although, to > > > use them I'll need to change lip->li_flags size to `unsigned long` if people are > > > ok with that. > > > > Yup, and you need to change every other flag in li_flags to use > > atomic bitops, too. That also means using test_bit() for value > > checks, test_and_clear_bit() instead of "if (bit) {clear bit}" and > > so on. > > > > i.e. it's no good just making one flag atomic and all the other > > updates using RMW accesses to change the state because that doesn't > > close the update race conditions. > > > > Am I following this correctly that the only potential race here looks > like it is with XFS_LI_ABORTED (the dirty transaction cancel case, > specifically)? XFS_LI_ABORTED is used during a shutdown (dirty trans cancel causes a shutdown) so races don't really matter here. However, the XFS_LI_IN_AIL flag is the problem because of it's serialisation requirements. That is, log IO completion can be running asynchronously to metadata object IO completion and the manipulation of the XFS_LI_IN_AIL flag and state is protected by the ailp->xa_lock. Adding new flags to the same field that can be asynchronously updated by RMW operations outside the ailp->xa_lock will cause problems in future. There *may not* be a problem right now, but it is poor programming practice to have different coherency processes for the same variable that is updated via RMW operations. In these situations, the only safe way to update the variable is to use an atomic operation. i.e. I'm not looking for a specific bug in the code - I'm pointing out a concurrent programming error that is being made. > But otherwise given the above, that this is primarily I/O error path > code, and that we presumably already have serialization mechanisms in > place in the form of parent object locks, why not just grab the inode > lock in the appropriate places? Because we can't guarantee that we can lock the parent object in IO completion. Deadlock is trivial: process A grabs parent object, blocks on flush lock. IO completion holds flush lock, blocks trying to get parent object lock. Cheers, Dave.
On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > On Fri, May 19, 2017 at 07:27:01AM -0400, Brian Foster wrote: > > On Fri, May 19, 2017 at 10:22:27AM +1000, Dave Chinner wrote: > > > On Wed, May 17, 2017 at 12:41:11PM +0200, Carlos Maiolino wrote: > > > > Hi Dave, > > > > > > > > On Wed, May 17, 2017 at 10:57:49AM +1000, Dave Chinner wrote: > > > > > On Thu, May 11, 2017 at 03:57:33PM +0200, Carlos Maiolino wrote: > > > > > > +xfs_inode_item_error( > > > > > > + struct xfs_log_item *lip, > > > > > > + unsigned int bflags) > > > > > > +{ > > > > > > + > > > > > > + /* > > > > > > + * The buffer writeback containing this inode has been failed > > > > > > + * mark it as failed and unlock the flush lock, so it can be retried > > > > > > + * again > > > > > > + */ > > > > > > + if (bflags & XBF_WRITE_FAIL) > > > > > > + lip->li_flags |= XFS_LI_FAILED; > > > > > > +} > > > > > > > > > > I'm pretty sure we can't modify the log item state like this in > > > > > IO context because the parent object is not locked by this context. > > > > > We can modify the state during IO submission because we hold the > > > > > parent object lock, but we don't hold it here. > > > > > > > > > > i.e. we can race with other log item state changes done during a > > > > > concurrent transaction (e.g. marking the log item dirty in > > > > > xfs_trans_log_inode()). > > > > > > > > > > > + > > > > > > + if (lip->li_flags & XFS_LI_FAILED) > > > > > > + lip->li_flags &= XFS_LI_FAILED; > > > > > > + lip = next; > > > > > > + } > > > > > > > > > > Same race here - we hold the lock for this inode, but not all the > > > > > other inodes linked to the buffer. > > > > > > > > > > This implies that lip->li_flags needs to use atomic bit updates so > > > > > there are no RMW update races like this. > > > > > > > > > > > > > I believe refers to both cases above, while setting and removing the flag? > > > > > > > > I can simply use set_bit() and clear_bit() here, if this is enough, although, to > > > > use them I'll need to change lip->li_flags size to `unsigned long` if people are > > > > ok with that. > > > > > > Yup, and you need to change every other flag in li_flags to use > > > atomic bitops, too. That also means using test_bit() for value > > > checks, test_and_clear_bit() instead of "if (bit) {clear bit}" and > > > so on. > > > > > > i.e. it's no good just making one flag atomic and all the other > > > updates using RMW accesses to change the state because that doesn't > > > close the update race conditions. > > > > > > > Am I following this correctly that the only potential race here looks > > like it is with XFS_LI_ABORTED (the dirty transaction cancel case, > > specifically)? > > XFS_LI_ABORTED is used during a shutdown (dirty trans cancel causes > a shutdown) so races don't really matter here. > Right.. though I don't think we should dismiss this case because in theory, if a race causes loss of a flag or some such that happens to affect the ability to tear down in-core structures, we could still end up with things like memory leaks or unmount hangs on a shutdown fs. > However, the XFS_LI_IN_AIL flag is the problem because of it's > serialisation requirements. That is, log IO completion can be > running asynchronously to metadata object IO completion and the > manipulation of the XFS_LI_IN_AIL flag and state is protected by the > ailp->xa_lock. > Indeed.. > Adding new flags to the same field that can be asynchronously > updated by RMW operations outside the ailp->xa_lock will cause > problems in future. There *may not* be a problem right now, but it > is poor programming practice to have different coherency processes > for the same variable that is updated via RMW operations. In these > situations, the only safe way to update the variable is to use an > atomic operation. > So is there a reason why we couldn't acquire ->xa_lock to fail the log items as we would have done anyways if the metadata writeback had succeeded and we were removing the log items from the AIL.. (e.g., why introduce new serialization rules if we don't need to)? > i.e. I'm not looking for a specific bug in the code - I'm pointing > out a concurrent programming error that is being made. > Ok. That's not what was described above, but sounds reasonable enough to me regardless so long as the patch description is clear on what it is doing. I don't think we're closing any race outside of the aborted thing above because IN_AIL has to be set for metadata writeback to occur (and potentially fail). > > But otherwise given the above, that this is primarily I/O error path > > code, and that we presumably already have serialization mechanisms in > > place in the form of parent object locks, why not just grab the inode > > lock in the appropriate places? > > Because we can't guarantee that we can lock the parent object in IO > completion. Deadlock is trivial: process A grabs parent object, > blocks on flush lock. IO completion holds flush lock, blocks trying > to get parent object lock. > Ah, right. So once we acquire the flush lock and drop the ilock, we can no longer block on the ilock from a context responsible for releasing the flush lock. That makes sense, but raises a larger question... if "process A" can cause this kind of locking problem in I/O completion context, what prevents it from interfering with (re)submission context just the same? For example, the AIL pushes an inode, locks, flushes and submits the underlying buffer. The buffer I/O fails and is ultimately released back to the AIL. Process A comes in, locks the inode and blocks on the flush lock. Subsequent AIL pushes and retry attempts fail to lock the inode, never even get to the point of checking for LI_FAILED and thus we're back to the same problem we have today. IOW, doesn't this mean we need to check and handle LI_FAILED first off in ->iop_push() and not just in response to flush lock failure? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote: > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > > Adding new flags to the same field that can be asynchronously > > updated by RMW operations outside the ailp->xa_lock will cause > > problems in future. There *may not* be a problem right now, but it > > is poor programming practice to have different coherency processes > > for the same variable that is updated via RMW operations. In these > > situations, the only safe way to update the variable is to use an > > atomic operation. > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log > items as we would have done anyways if the metadata writeback had > succeeded and we were removing the log items from the AIL.. Yes. the alip->xa_lock protects AIL state is a highly contended lock. It should not be used for things that aren't AIL related because that will have performance and scalability implications. > (e.g., why > introduce new serialization rules if we don't need to)? Becuase we don't tie independent object operational state to a single subsystem's internal locking requirements. If the log item does not have it's own context lock (which it doesn't), then we need to use atomic operations to serialise flag updates across different subsystems so they can safely maintain their own independent internal locking schemes. An example is the inode flags - they are serialised by the i_flags_lock spinlock because there are multiple state changes that need to be done atomically through different subsystems (including RCU freeing state). We serialise these updates by an object lock rather than a subsystem lock because it's the only way we can share the flag field across different subsystems safely... > > > But otherwise given the above, that this is primarily I/O error path > > > code, and that we presumably already have serialization mechanisms in > > > place in the form of parent object locks, why not just grab the inode > > > lock in the appropriate places? > > > > Because we can't guarantee that we can lock the parent object in IO > > completion. Deadlock is trivial: process A grabs parent object, > > blocks on flush lock. IO completion holds flush lock, blocks trying > > to get parent object lock. > > > > Ah, right. So once we acquire the flush lock and drop the ilock, we > can no longer block on the ilock from a context responsible for > releasing the flush lock. > > That makes sense, but raises a larger question... if "process A" can > cause this kind of locking problem in I/O completion context, what > prevents it from interfering with (re)submission context just the same? > For example, the AIL pushes an inode, locks, flushes and submits the > underlying buffer. The buffer I/O fails and is ultimately released back > to the AIL. Process A comes in, locks the inode and blocks on the flush > lock. Subsequent AIL pushes and retry attempts fail to lock the inode, > never even get to the point of checking for LI_FAILED and thus we're > back to the same problem we have today. > > IOW, doesn't this mean we need to check and handle LI_FAILED first off > in ->iop_push() and not just in response to flush lock failure? It's entirely possible that we need to do that. This whole "don't endlessly retry failed buffer writes" thing is a substantial change in behaviour, so there's bound to be interactions that we don't get right the first time... Cheers, Dave. > > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com >
On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote: > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote: > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > > > Adding new flags to the same field that can be asynchronously > > > updated by RMW operations outside the ailp->xa_lock will cause > > > problems in future. There *may not* be a problem right now, but it > > > is poor programming practice to have different coherency processes > > > for the same variable that is updated via RMW operations. In these > > > situations, the only safe way to update the variable is to use an > > > atomic operation. > > > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log > > items as we would have done anyways if the metadata writeback had > > succeeded and we were removing the log items from the AIL.. > > Yes. the alip->xa_lock protects AIL state is a highly contended > lock. It should not be used for things that aren't AIL related > because that will have performance and scalability implications. > The purpose of this flag is to control AIL retry processing, how is this not AIL related? If the I/O succeeds, we grab ->xa_lock to remove the items from the AIL. If the I/O fails, we can't grab ->xa_lock to tag the items as failed (so the AIL knows if/how to retry the item(s)) because of performance and scalability concerns..? In fact, having taken another look at the code, you could make the argument that failure to take the ->xa_lock is a landmine should the AIL task ever expect to observe consistent state across the affected set of items (independent from atomic li_flags updates). > > (e.g., why > > introduce new serialization rules if we don't need to)? > > Becuase we don't tie independent object operational state to a > single subsystem's internal locking requirements. If the log item > does not have it's own context lock (which it doesn't), then we need > to use atomic operations to serialise flag updates across different > subsystems so they can safely maintain their own independent > internal locking schemes. > > An example is the inode flags - they are serialised by the > i_flags_lock spinlock because there are multiple state changes that > need to be done atomically through different subsystems (including > RCU freeing state). We serialise these updates by an object lock > rather than a subsystem lock because it's the only way we can share > the flag field across different subsystems safely... > This makes sense in principle and from the perspective that it might matter for some future changes/flags, I just don't think this patch really departs from our existing serialization model wrt to the AIL. We aren't adding the serialization requirement to any new context where it doesn't already exist. Therefore, (IMO) it's mostly an overcomplication to rework serialization as a dependency for this fix. All that said, the bitops change is harmless and there are only a few flags to deal with, so I don't think it really matters much. I just think it would be nice to avoid an artificial backport dependency. IOW, I think this patch should use ->xa_lock as is and can be immediately followed by a patch to convert the li_flags to bit ops and remove the ->xa_lock from contexts where it is no longer necessary (with documented justification). > > > > But otherwise given the above, that this is primarily I/O error path > > > > code, and that we presumably already have serialization mechanisms in > > > > place in the form of parent object locks, why not just grab the inode > > > > lock in the appropriate places? > > > > > > Because we can't guarantee that we can lock the parent object in IO > > > completion. Deadlock is trivial: process A grabs parent object, > > > blocks on flush lock. IO completion holds flush lock, blocks trying > > > to get parent object lock. > > > > > > > Ah, right. So once we acquire the flush lock and drop the ilock, we > > can no longer block on the ilock from a context responsible for > > releasing the flush lock. > > > > That makes sense, but raises a larger question... if "process A" can > > cause this kind of locking problem in I/O completion context, what > > prevents it from interfering with (re)submission context just the same? > > For example, the AIL pushes an inode, locks, flushes and submits the > > underlying buffer. The buffer I/O fails and is ultimately released back > > to the AIL. Process A comes in, locks the inode and blocks on the flush > > lock. Subsequent AIL pushes and retry attempts fail to lock the inode, > > never even get to the point of checking for LI_FAILED and thus we're > > back to the same problem we have today. > > > > IOW, doesn't this mean we need to check and handle LI_FAILED first off > > in ->iop_push() and not just in response to flush lock failure? > > It's entirely possible that we need to do that. This whole "don't > endlessly retry failed buffer writes" thing is a substantial change > in behaviour, so there's bound to be interactions that we don't get > right the first time... > I'm not sure if you're referring to this patch or the broader error configuration stuff here... Note that this patch doesn't change the fundamental behavior that the AIL retries failed buffers (subject to the error config). I tend to get this mixed up, but IIUC this has been traditional behavior for things like buffers, for example, for quite some time. The problem here is that the AIL doesn't correctly issue retries for certain items (i.e., those with flush locks), so we're just unraveling a livelock in the log subsystem (as opposed to changing behavior). Brian > Cheers, > > Dave. > > > > > Brian > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote: > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote: > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote: > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > > > > Adding new flags to the same field that can be asynchronously > > > > updated by RMW operations outside the ailp->xa_lock will cause > > > > problems in future. There *may not* be a problem right now, but it > > > > is poor programming practice to have different coherency processes > > > > for the same variable that is updated via RMW operations. In these > > > > situations, the only safe way to update the variable is to use an > > > > atomic operation. > > > > > > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log > > > items as we would have done anyways if the metadata writeback had > > > succeeded and we were removing the log items from the AIL.. > > > > Yes. the alip->xa_lock protects AIL state is a highly contended > > lock. It should not be used for things that aren't AIL related > > because that will have performance and scalability implications. > > > > The purpose of this flag is to control AIL retry processing, how is this > not AIL related? It's IO state, not AIL state. IO submission occurs from more places than and AIL push (e.g. inode reclaim, inode clustering, etc) and there's no way we should be exposing the internal AIL state lock in places like that. > All that said, the bitops change is harmless and there are only a few > flags to deal with, so I don't think it really matters much. I just > think it would be nice to avoid an artificial backport dependency. IOW, > I think this patch should use ->xa_lock as is and can be immediately > followed by a patch to convert the li_flags to bit ops and remove the > ->xa_lock from contexts where it is no longer necessary (with documented > justification). Then it needs to be done as a single patch set with the fix you want to backport as the first patch, otherwise the bitop change not get done until someone does scalability tests and trips over it and then we've got more shit to backport to fix performance regressions. > > > IOW, doesn't this mean we need to check and handle LI_FAILED first off > > > in ->iop_push() and not just in response to flush lock failure? > > > > It's entirely possible that we need to do that. This whole "don't > > endlessly retry failed buffer writes" thing is a substantial change > > in behaviour, so there's bound to be interactions that we don't get > > right the first time... > > > > I'm not sure if you're referring to this patch or the broader error > configuration stuff here... Note that this patch doesn't change the > fundamental behavior that the AIL retries failed buffers (subject to the > error config). I tend to get this mixed up, but IIUC this has been > traditional behavior for things like buffers, for example, for quite > some time. Yes, but the change is that now we rely on the AIL push to trigger cleanup of failed buffers on unmount, whereas previously the unmount just hung endlessly retrying the failed buffers. i.e. we used to accept this hang as "expected behaviour" but now it's considered a bug we have to fix. Hence we now have to handle retries and failures and untangle the locking issues we've not had to care about for 20 years. As it is, the "only check failure if flush lock fails" idea was designed to prevent having to lookup the backing buffer to check for failure for every inode we wanted to flush as those lookups are too expensive to do on every inode we need to flush. However, if we are propagating the failure state to the log item on IO completion, checking this state is not expensive any more, so there's no need to hide it until we detect a state that may indicate an IO failure.... Cheers, Dave.
On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote: > On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote: > > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote: > > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote: > > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > > > > > Adding new flags to the same field that can be asynchronously > > > > > updated by RMW operations outside the ailp->xa_lock will cause > > > > > problems in future. There *may not* be a problem right now, but it > > > > > is poor programming practice to have different coherency processes > > > > > for the same variable that is updated via RMW operations. In these > > > > > situations, the only safe way to update the variable is to use an > > > > > atomic operation. > > > > > > > > > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log > > > > items as we would have done anyways if the metadata writeback had > > > > succeeded and we were removing the log items from the AIL.. > > > > > > Yes. the alip->xa_lock protects AIL state is a highly contended > > > lock. It should not be used for things that aren't AIL related > > > because that will have performance and scalability implications. > > > > > > > The purpose of this flag is to control AIL retry processing, how is this > > not AIL related? > > It's IO state, not AIL state. IO submission occurs from more places > than and AIL push (e.g. inode reclaim, inode clustering, etc) and > there's no way we should be exposing the internal AIL state lock in > places like that. > That's reasonable, though it suggests different code from what this patchset currently does. This would be far more useful if we could focus on correctness of the code first and foremost. If the code we commit actually warrants this kind of change, then this becomes a much easier (i.e., pointless :P) discussion from my end. As it is, this patch only ever clears the flag on AIL resubmission. ISTM you are indirectly suggesting that should occur in other contexts as well (e.g., consider a reclaim and flush of another inode backed by a buffer that is already failed, and has other inodes sitting in the AIL flush locked). If that is the case, please elaborate, because otherwise it is kind of impossible to evaluate and discuss without specifics. > > All that said, the bitops change is harmless and there are only a few > > flags to deal with, so I don't think it really matters much. I just > > think it would be nice to avoid an artificial backport dependency. IOW, > > I think this patch should use ->xa_lock as is and can be immediately > > followed by a patch to convert the li_flags to bit ops and remove the > > ->xa_lock from contexts where it is no longer necessary (with documented > > justification). > > Then it needs to be done as a single patch set with the fix you want > to backport as the first patch, otherwise the bitop change not get > done until someone does scalability tests and trips over it and then > we've got more shit to backport to fix performance regressions. > The point is that the atomic bitmask change doesn't need to be backported at all. Notwithstanding potential changes to the patches as they currently stand (as noted above), it doesn't actually fix anything, performance or otherwise. As it is, this patch would only add an acquisition of ->xa_lock in completion context on I/O failure. Exactly what workload would you expect to be affected by changes to I/O error handling? How would it be any different than what we already do on normal I/O completion? > > > > IOW, doesn't this mean we need to check and handle LI_FAILED first off > > > > in ->iop_push() and not just in response to flush lock failure? > > > > > > It's entirely possible that we need to do that. This whole "don't > > > endlessly retry failed buffer writes" thing is a substantial change > > > in behaviour, so there's bound to be interactions that we don't get > > > right the first time... > > > > > > > I'm not sure if you're referring to this patch or the broader error > > configuration stuff here... Note that this patch doesn't change the > > fundamental behavior that the AIL retries failed buffers (subject to the > > error config). I tend to get this mixed up, but IIUC this has been > > traditional behavior for things like buffers, for example, for quite > > some time. > > Yes, but the change is that now we rely on the AIL push to trigger > cleanup of failed buffers on unmount, whereas previously the unmount > just hung endlessly retrying the failed buffers. i.e. we used to > accept this hang as "expected behaviour" but now it's considered a > bug we have to fix. Hence we now have to handle retries and failures > and untangle the locking issues we've not had to care about for 20 > years. > This completely mischaracterizes the problem. If this problem occurs, log deadlock is the most likely outcome. An unmount is simply part of the most common test method to immediately observe a hang that is otherwise already imminent based on previous events. This is not a pure unmount problem and afaict not associated with any recent behavior changes with regard to buffer I/O error handling. > As it is, the "only check failure if flush lock fails" idea was > designed to prevent having to lookup the backing buffer to check for > failure for every inode we wanted to flush as those lookups are too > expensive to do on every inode we need to flush. However, if we are > propagating the failure state to the log item on IO completion, > checking this state is not expensive any more, so there's no need to > hide it until we detect a state that may indicate an IO failure.... > Indeed. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 23, 2017 at 12:22:04PM -0400, Brian Foster wrote: > On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote: > > On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote: > > > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote: > > > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote: > > > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > > > > > > Adding new flags to the same field that can be asynchronously > > > > > > updated by RMW operations outside the ailp->xa_lock will cause > > > > > > problems in future. There *may not* be a problem right now, but it > > > > > > is poor programming practice to have different coherency processes > > > > > > for the same variable that is updated via RMW operations. In these > > > > > > situations, the only safe way to update the variable is to use an > > > > > > atomic operation. > > > > > > > > > > > > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log > > > > > items as we would have done anyways if the metadata writeback had > > > > > succeeded and we were removing the log items from the AIL.. > > > > > > > > Yes. the alip->xa_lock protects AIL state is a highly contended > > > > lock. It should not be used for things that aren't AIL related > > > > because that will have performance and scalability implications. > > > > > > > > > > The purpose of this flag is to control AIL retry processing, how is this > > > not AIL related? > > > > It's IO state, not AIL state. IO submission occurs from more places > > than and AIL push (e.g. inode reclaim, inode clustering, etc) and > > there's no way we should be exposing the internal AIL state lock in > > places like that. > > > > That's reasonable, though it suggests different code from what this > patchset currently does. This would be far more useful if we could focus > on correctness of the code first and foremost. If the code we commit > actually warrants this kind of change, then this becomes a much easier > (i.e., pointless :P) discussion from my end. Hmmm - this is not the first time recently you've said something similar in a review discussion where you didn't like this issues have been pointed out. Last time is was "handwavey", now it's "pointless". We haven't worked out a solution that solves all the known problems, so these sorts of comments really don't help close on one.... > As it is, this patch only ever clears the flag on AIL resubmission. ISTM > you are indirectly suggesting that should occur in other contexts as > well (e.g., consider a reclaim and flush of another inode backed by a > buffer that is already failed, and has other inodes sitting in the AIL > flush locked). If that is the case, please elaborate, because otherwise > it is kind of impossible to evaluate and discuss without specifics. All I've done is take the deadlock scenario you pointed out and applied to it other places we block on the flush lock. Go look at inode reclaim - it can do blocking reclaim and so will block on the flush lock. How is this code supposed to handle hitting a failed, flush locked inode? If it just blocks, then it's a likely deadlock vector as you've already pointed out. Hence we'll need the same failed write checks to trigger buffer resubmission before we retry the inode reclaim process. [....] > As it is, this patch would only add an acquisition of ->xa_lock in > completion context on I/O failure. Exactly what workload would you > expect to be affected by changes to I/O error handling? How would it be > any different than what we already do on normal I/O completion? If you want to check and clear the failed flag in the log item in a non-racy way in IO submission (because we have to do this outside the flush lock) then we'd have to hold the AIL lock. And that means the AIL lock is then in the hot path of inode flushing in all situations, not just in IO completion. We've known for a long time that the AIL lock is the second most contended lock in the filesystem (behind the CIL context lock), and that it is most heavily contended on inode IO completion because completion runs on the CPU that the interrupt is delivered to and so the AIL lock gets contended from multiple CPUs at once on multi-disk filesystems (see xfs_iflush_done() for more info). As such, we do not want to increase it's use anywhere in the inode IO submission/completion path at all if we can possibly avoid it. So with all the problems that leaving failed inodes flush locked presents, I'm starting to think that the inode failed callback should remove the inode from the buffer and drop the flush lock rather than leave it held and attached to the buffer. We can check the LI_FAILED flag in xfs_iflush_int() and simply skip the inode flush to the buffer to effectively "resubmit" the inode unchanged. If we then clear the LI_FAILED flag in xfs_iflush_done() (i.e. only when a successful write occurs), we don't need to jump through hoops to clear the flag at IO submission time. We'd probably also need a failed flag on the buffer (maybe with a hold ref, too) so that it isn't reclaimed by memory pressure while we have inodes in the failed state pending writeback. This leaves the rest of the submission code completely unchanged - including the write clustering. It also gets rid of all the flush lock deadlock scenarios because we always unlock the flush lock on IO completion, regardless of whether the buffer IO succeeded or not. ANd we don't have to change any infrastructure, either - it all just works as it is because the LI_FAILED flag is handled at the lowest levels possible. This, to me, seems like a far cleaner solution than anything we've discussed so far.... Cheers, Dave.
On Wed, May 24, 2017 at 11:06:01AM +1000, Dave Chinner wrote: > On Tue, May 23, 2017 at 12:22:04PM -0400, Brian Foster wrote: > > On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote: > > > On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote: > > > > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote: > > > > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote: > > > > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > > > > > > > Adding new flags to the same field that can be asynchronously > > > > > > > updated by RMW operations outside the ailp->xa_lock will cause > > > > > > > problems in future. There *may not* be a problem right now, but it > > > > > > > is poor programming practice to have different coherency processes > > > > > > > for the same variable that is updated via RMW operations. In these > > > > > > > situations, the only safe way to update the variable is to use an > > > > > > > atomic operation. > > > > > > > > > > > > > > > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log > > > > > > items as we would have done anyways if the metadata writeback had > > > > > > succeeded and we were removing the log items from the AIL.. > > > > > > > > > > Yes. the alip->xa_lock protects AIL state is a highly contended > > > > > lock. It should not be used for things that aren't AIL related > > > > > because that will have performance and scalability implications. > > > > > > > > > > > > > The purpose of this flag is to control AIL retry processing, how is this > > > > not AIL related? > > > > > > It's IO state, not AIL state. IO submission occurs from more places > > > than and AIL push (e.g. inode reclaim, inode clustering, etc) and > > > there's no way we should be exposing the internal AIL state lock in > > > places like that. > > > > > > > That's reasonable, though it suggests different code from what this > > patchset currently does. This would be far more useful if we could focus > > on correctness of the code first and foremost. If the code we commit > > actually warrants this kind of change, then this becomes a much easier > > (i.e., pointless :P) discussion from my end. > > Hmmm - this is not the first time recently you've said something > similar in a review discussion where you didn't like this issues > have been pointed out. Last time is was "handwavey", now it's > "pointless". We haven't worked out a solution that solves all the > known problems, so these sorts of comments really don't help close > on one.... > You misinterpret what I wrote. What I am saying is that _my_ argument would become pointless if you could provide context around what locking specifically leads to your performance/scalability concern. I.e., just above you refer to taking the AIL lock in various places that are not modified by this patch. Since this patch does not touch those areas, I can only assume you see something wrong with the patch that would actually require locking or some additional changes in those areas. If that is the case, and you could explicitly point out where/why that extra locking is necessary, then that would clearly undermine my argument that there is no risk of performance degradation here. Instead, you are just asserting that there is additional lock contention and I'm kind of left guessing where these mysterious ->xa_lock acquisitions are, because I haven't suggested using it anywhere outside of I/O completion. Indeed, from this most recent mail, ISTM that your concern comes from the implication that AIL locking would be required in various other I/O submission contexts to clear the failed flags. More on that below... (And I don't see what any of this has to do with the quotacheck deadlock thing.) > > As it is, this patch only ever clears the flag on AIL resubmission. ISTM > > you are indirectly suggesting that should occur in other contexts as > > well (e.g., consider a reclaim and flush of another inode backed by a > > buffer that is already failed, and has other inodes sitting in the AIL > > flush locked). If that is the case, please elaborate, because otherwise > > it is kind of impossible to evaluate and discuss without specifics. > > All I've done is take the deadlock scenario you pointed > out and applied to it other places we block on the flush lock. > Ok... > Go look at inode reclaim - it can do blocking reclaim and so will > block on the flush lock. How is this code supposed to handle hitting > a failed, flush locked inode? If it just blocks, then it's a likely > deadlock vector as you've already pointed out. Hence we'll need the > same failed write checks to trigger buffer resubmission before we > retry the inode reclaim process. > I'm aware we take the flush lock there, but I'm not convinced that is a deadlock vector. AFAICT, nothing in that callpath blocks the subsequent AIL retries from occurring. If that AIL retry eventually succeeds, this all unwinds and we carry on as normal. If _blocking_ as such is a problem for the codepath in question, then that seems like more of a question of whether a trylock is more appropriate. > [....] > > > As it is, this patch would only add an acquisition of ->xa_lock in > > completion context on I/O failure. Exactly what workload would you > > expect to be affected by changes to I/O error handling? How would it be > > any different than what we already do on normal I/O completion? > > If you want to check and clear the failed flag in the log item in a > non-racy way in IO submission (because we have to do this outside > the flush lock) then we'd have to hold the AIL lock. And that means > the AIL lock is then in the hot path of inode flushing in all > situations, not just in IO completion. > Personally, I have no specific desire to clear the failed flags in I/O submission context as a general solution. The AIL retry originally did that as a sort of an optimization to avoid looking up the buffer for every flush locked inode (rightly and based on your suggestion). This appears to have clouded the fact that we still need to clear the failed flag on successful I/O completion (I thought I called that out on a previous iteration, but I've lost track so maybe not. I also haven't got to Carlos' latest version.). Conditionally relying on clearing the failed flag from submission context (as this patch does) is not sufficient because it is still possible to modify and flush an inode to an already failed buffer (and thus the buffer has pre-existing failed inodes). If an AIL push ended up resubmitting the buffer due to the newly flushed/notfailed inode, it never clears the failed flags on the other inodes attached to the buffer. I haven't combed through all of the codepaths there, but I wouldn't be surprised to see a data loss bug lurking in there. (Note again that this is based on the code in this patch.) This _could_ imply that all submission contexts need to unconditionally check for and clear failed items on the buffer or that the buffer needs a flag too, but it would be much more simple and straightforward to clear the flag after successful I/O completion (when ->xa_lock is already held). This avoids the need for extra flags or serialization requirements. To be fair, this _may_ require an additional ->xa_lock cycle on I/O completion to clear the flag from relogged items so they are flushed correctly again on a subsequent push (without having left the AIL), but again, that should only occur after an I/O error has occurred. [NOTE: After reading the rest of the email and coming back here, I see you suggested the same notion of clearing the flag after I/O success in your flush lock rework idea, so I suspect I'm not totally off the rails here. ;)] If that all works, there is no need to acquire the AIL lock in any context we don't already have it in general. Again, if I'm missing something there, please point it out. > We've known for a long time that the AIL lock is the second most > contended lock in the filesystem (behind the CIL context lock), and > that it is most heavily contended on inode IO completion because > completion runs on the CPU that the interrupt is delivered to and > so the AIL lock gets contended from multiple CPUs at once on > multi-disk filesystems (see xfs_iflush_done() for more info). > As such, we do not want to increase it's use anywhere in the inode > IO submission/completion path at all if we can possibly avoid it. > Understood and agreed. But as mentioned previously, I've not suggested taking ->xa_lock anywhere else outside of when I/O errors occur and so performance in general is not affected. > So with all the problems that leaving failed inodes flush locked > presents, I'm starting to think that the inode failed callback > should remove the inode from the buffer and drop the flush lock > rather than leave it held and attached to the buffer. We can check > the LI_FAILED flag in xfs_iflush_int() and simply skip the inode > flush to the buffer to effectively "resubmit" the inode unchanged. > If we then clear the LI_FAILED flag in xfs_iflush_done() (i.e. only > when a successful write occurs), we don't need to jump through hoops > to clear the flag at IO submission time. We'd probably also need a > failed flag on the buffer (maybe with a hold ref, too) so that it > isn't reclaimed by memory pressure while we have inodes in the > failed state pending writeback. > A couple thoughts... first on the idea itself and second on the trajectory of this particular patch series. 1.) Interesting idea. Some of my initial questions/thoughts on the idea itself... It's not clear to me how we'd manage whether a failed (and flush-unlocked) inode has been subsequently modified and actually needs to be flushed again. I'm also not following how we'd propagate successful completion state back to the parent inodes if they had been previously removed from the buffer due to an I/O failure. It sounds like this _may_ conflict with providing a generic solution to this problem (i.e., we have to fix this problem for dquots as well, and hence this patch is focused on providing a generic solution). Is that not the case, or would we have to repeat this pattern for every flush locked item? Finally, I'm a bit skeptical that it's wise to provide such a fundamental shift in flush locking behavior based on the rare case of an I/O error. That means every case that currently acquires a flush lock has a new state to consider (IOW, we've effectively changed the meaning of the lock). My experience is that our error handling testing/regression capabilities are fairly poor as it is (which I don't think is something unique to us), and thus error path regressions are generally easier to introduce and tend to go undetected for quite some time (how long has AIL retry been broken? ;). Of course, this is just a first impression and could all be alleviated with further discussion and/or code. 2.) Independent of whether we want to do something like the above, I disagree with the premise that this series causes any new problems with regard to flush locked metadata items (as explained above). It is focused on fixing a particular AIL retry bug in the uncommon I/O error case within the AIL retry framework that has been in place for quite some time. Once the kinks are worked out, I think this is now very close to fixing that bug succinctly, correctly and without any performance side effects. As such, I would very much prefer that we actually make progress on fixing this specific bug before moving on to the new shiny design that redefines flush locking. If there are still fundamental flush lock issues beyond what this patch is intended to fix, we can address those incrementally and rework flush locking along with LI_FAILED handling on top of a known working base provided by this patch. At the absolute very least, I'd prefer that we work out the remaining kinks in this patchset (even if it is not ultimately merged) to provide a common reference for a correct solution. If there is a hard technical impasse with this fix, then that would at least help Carlos and I understand where the actual problems lie. That's just my .02 and is ultimately contingent on Carlos' preferred approach, since he's the one doing the heavy lifting here (though I suspect he would also prefer the incremental approach). Brian > This leaves the rest of the submission code completely unchanged - > including the write clustering. It also gets rid of all the flush > lock deadlock scenarios because we always unlock the flush lock on > IO completion, regardless of whether the buffer IO succeeded or not. > ANd we don't have to change any infrastructure, either - it all just > works as it is because the LI_FAILED flag is handled at the lowest > levels possible. > > This, to me, seems like a far cleaner solution than anything we've > discussed so far.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 24, 2017 at 08:42:47AM -0400, Brian Foster wrote: > On Wed, May 24, 2017 at 11:06:01AM +1000, Dave Chinner wrote: > > On Tue, May 23, 2017 at 12:22:04PM -0400, Brian Foster wrote: > > > On Tue, May 23, 2017 at 09:23:18PM +1000, Dave Chinner wrote: > > > > On Mon, May 22, 2017 at 08:51:13AM -0400, Brian Foster wrote: > > > > > On Mon, May 22, 2017 at 09:19:06AM +1000, Dave Chinner wrote: > > > > > > On Sat, May 20, 2017 at 07:46:56AM -0400, Brian Foster wrote: > > > > > > > On Sat, May 20, 2017 at 09:39:00AM +1000, Dave Chinner wrote: > > > > > > > > Adding new flags to the same field that can be asynchronously > > > > > > > > updated by RMW operations outside the ailp->xa_lock will cause > > > > > > > > problems in future. There *may not* be a problem right now, but it > > > > > > > > is poor programming practice to have different coherency processes > > > > > > > > for the same variable that is updated via RMW operations. In these > > > > > > > > situations, the only safe way to update the variable is to use an > > > > > > > > atomic operation. > > > > > > > > > > > > > > > > > > > > > > So is there a reason why we couldn't acquire ->xa_lock to fail the log > > > > > > > items as we would have done anyways if the metadata writeback had > > > > > > > succeeded and we were removing the log items from the AIL.. > > > > > > > > > > > > Yes. the alip->xa_lock protects AIL state is a highly contended > > > > > > lock. It should not be used for things that aren't AIL related > > > > > > because that will have performance and scalability implications. > > > > > > > > > > > > > > > > The purpose of this flag is to control AIL retry processing, how is this > > > > > not AIL related? > > > > > > > > It's IO state, not AIL state. IO submission occurs from more places > > > > than and AIL push (e.g. inode reclaim, inode clustering, etc) and > > > > there's no way we should be exposing the internal AIL state lock in > > > > places like that. > > > > > > > > > > That's reasonable, though it suggests different code from what this > > > patchset currently does. This would be far more useful if we could focus > > > on correctness of the code first and foremost. If the code we commit > > > actually warrants this kind of change, then this becomes a much easier > > > (i.e., pointless :P) discussion from my end. > > > > Hmmm - this is not the first time recently you've said something > > similar in a review discussion where you didn't like this issues > > have been pointed out. Last time is was "handwavey", now it's > > "pointless". We haven't worked out a solution that solves all the > > known problems, so these sorts of comments really don't help close > > on one.... > > > > You misinterpret what I wrote. What I am saying is that _my_ argument > would become pointless if you could provide context around what locking > specifically leads to your performance/scalability concern. > > I.e., just above you refer to taking the AIL lock in various places that > are not modified by this patch. Since this patch does not touch those > areas, I can only assume you see something wrong with the patch that > would actually require locking or some additional changes in those > areas. If that is the case, and you could explicitly point out where/why > that extra locking is necessary, then that would clearly undermine my > argument that there is no risk of performance degradation here. Instead, > you are just asserting that there is additional lock contention and I'm > kind of left guessing where these mysterious ->xa_lock acquisitions are, > because I haven't suggested using it anywhere outside of I/O completion. > > Indeed, from this most recent mail, ISTM that your concern comes from > the implication that AIL locking would be required in various other I/O > submission contexts to clear the failed flags. More on that below... > > (And I don't see what any of this has to do with the quotacheck deadlock > thing.) > > > > As it is, this patch only ever clears the flag on AIL resubmission. ISTM > > > you are indirectly suggesting that should occur in other contexts as > > > well (e.g., consider a reclaim and flush of another inode backed by a > > > buffer that is already failed, and has other inodes sitting in the AIL > > > flush locked). If that is the case, please elaborate, because otherwise > > > it is kind of impossible to evaluate and discuss without specifics. > > > > All I've done is take the deadlock scenario you pointed > > out and applied to it other places we block on the flush lock. > > > > Ok... > > > Go look at inode reclaim - it can do blocking reclaim and so will > > block on the flush lock. How is this code supposed to handle hitting > > a failed, flush locked inode? If it just blocks, then it's a likely > > deadlock vector as you've already pointed out. Hence we'll need the > > same failed write checks to trigger buffer resubmission before we > > retry the inode reclaim process. > > > > I'm aware we take the flush lock there, but I'm not convinced that is a > deadlock vector. AFAICT, nothing in that callpath blocks the subsequent > AIL retries from occurring. If that AIL retry eventually succeeds, this > all unwinds and we carry on as normal. > > If _blocking_ as such is a problem for the codepath in question, then > that seems like more of a question of whether a trylock is more > appropriate. > > > [....] > > > > > As it is, this patch would only add an acquisition of ->xa_lock in > > > completion context on I/O failure. Exactly what workload would you > > > expect to be affected by changes to I/O error handling? How would it be > > > any different than what we already do on normal I/O completion? > > > > If you want to check and clear the failed flag in the log item in a > > non-racy way in IO submission (because we have to do this outside > > the flush lock) then we'd have to hold the AIL lock. And that means > > the AIL lock is then in the hot path of inode flushing in all > > situations, not just in IO completion. > > > > Personally, I have no specific desire to clear the failed flags in I/O > submission context as a general solution. The AIL retry originally did > that as a sort of an optimization to avoid looking up the buffer for > every flush locked inode (rightly and based on your suggestion). > > This appears to have clouded the fact that we still need to clear the > failed flag on successful I/O completion (I thought I called that out on > a previous iteration, but I've lost track so maybe not. I also haven't > got to Carlos' latest version.). > > Conditionally relying on clearing the failed flag from submission > context (as this patch does) is not sufficient because it is still > possible to modify and flush an inode to an already failed buffer (and > thus the buffer has pre-existing failed inodes). If an AIL push ended up > resubmitting the buffer due to the newly flushed/notfailed inode, it > never clears the failed flags on the other inodes attached to the > buffer. I haven't combed through all of the codepaths there, but I > wouldn't be surprised to see a data loss bug lurking in there. (Note > again that this is based on the code in this patch.) > > This _could_ imply that all submission contexts need to unconditionally > check for and clear failed items on the buffer or that the buffer needs > a flag too, but it would be much more simple and straightforward to > clear the flag after successful I/O completion (when ->xa_lock is > already held). This avoids the need for extra flags or serialization > requirements. To be fair, this _may_ require an additional ->xa_lock > cycle on I/O completion to clear the flag from relogged items so they > are flushed correctly again on a subsequent push (without having left > the AIL), but again, that should only occur after an I/O error has > occurred. > > [NOTE: After reading the rest of the email and coming back here, I see > you suggested the same notion of clearing the flag after I/O success in > your flush lock rework idea, so I suspect I'm not totally off the rails > here. ;)] > > If that all works, there is no need to acquire the AIL lock in any > context we don't already have it in general. Again, if I'm missing > something there, please point it out. > > > We've known for a long time that the AIL lock is the second most > > contended lock in the filesystem (behind the CIL context lock), and > > that it is most heavily contended on inode IO completion because > > completion runs on the CPU that the interrupt is delivered to and > > so the AIL lock gets contended from multiple CPUs at once on > > multi-disk filesystems (see xfs_iflush_done() for more info). > > As such, we do not want to increase it's use anywhere in the inode > > IO submission/completion path at all if we can possibly avoid it. > > > > Understood and agreed. But as mentioned previously, I've not suggested > taking ->xa_lock anywhere else outside of when I/O errors occur and so > performance in general is not affected. > > > So with all the problems that leaving failed inodes flush locked > > presents, I'm starting to think that the inode failed callback > > should remove the inode from the buffer and drop the flush lock > > rather than leave it held and attached to the buffer. We can check > > the LI_FAILED flag in xfs_iflush_int() and simply skip the inode > > flush to the buffer to effectively "resubmit" the inode unchanged. > > If we then clear the LI_FAILED flag in xfs_iflush_done() (i.e. only > > when a successful write occurs), we don't need to jump through hoops > > to clear the flag at IO submission time. We'd probably also need a > > failed flag on the buffer (maybe with a hold ref, too) so that it > > isn't reclaimed by memory pressure while we have inodes in the > > failed state pending writeback. > > > > A couple thoughts... first on the idea itself and second on the > trajectory of this particular patch series. > > 1.) Interesting idea. Some of my initial questions/thoughts on the idea > itself... > > It's not clear to me how we'd manage whether a failed (and > flush-unlocked) inode has been subsequently modified and actually needs > to be flushed again. I'm also not following how we'd propagate > successful completion state back to the parent inodes if they had been > previously removed from the buffer due to an I/O failure. > > It sounds like this _may_ conflict with providing a generic solution to > this problem (i.e., we have to fix this problem for dquots as well, and > hence this patch is focused on providing a generic solution). Is that > not the case, or would we have to repeat this pattern for every flush > locked item? > > Finally, I'm a bit skeptical that it's wise to provide such a > fundamental shift in flush locking behavior based on the rare case of an > I/O error. That means every case that currently acquires a flush lock > has a new state to consider (IOW, we've effectively changed the meaning > of the lock). My experience is that our error handling > testing/regression capabilities are fairly poor as it is (which I don't > think is something unique to us), and thus error path regressions are > generally easier to introduce and tend to go undetected for quite some > time (how long has AIL retry been broken? ;). > > Of course, this is just a first impression and could all be alleviated > with further discussion and/or code. > > 2.) Independent of whether we want to do something like the above, I > disagree with the premise that this series causes any new problems with > regard to flush locked metadata items (as explained above). It is > focused on fixing a particular AIL retry bug in the uncommon I/O error > case within the AIL retry framework that has been in place for quite > some time. Once the kinks are worked out, I think this is now very close > to fixing that bug succinctly, correctly and without any performance > side effects. > > As such, I would very much prefer that we actually make progress on > fixing this specific bug before moving on to the new shiny design that > redefines flush locking. If there are still fundamental flush lock > issues beyond what this patch is intended to fix, we can address those > incrementally and rework flush locking along with LI_FAILED handling on > top of a known working base provided by this patch. At the absolute very > least, I'd prefer that we work out the remaining kinks in this patchset > (even if it is not ultimately merged) to provide a common reference for > a correct solution. If there is a hard technical impasse with this fix, > then that would at least help Carlos and I understand where the actual > problems lie. > > That's just my .02 and is ultimately contingent on Carlos' preferred > approach, since he's the one doing the heavy lifting here (though I > suspect he would also prefer the incremental approach). > > Brian I need to give some more thinking about Dave's suggestion on this new flush locking design Dave suggested. This really looks a nice idea that needs some polish before the implementation, and I can certainly work on that, but, from my POV, I'll agree with Brian here. Reworking in the flush lock design will take time, more discussions, more patches, etc. In the mean time, we have an increasing number in users hitting this bug, with the increasing usage of DM-Thin, this tends to increase quick, once, the bug is easily triggered when using dm-thin. The unmount hang is just one of the symptoms of the problem, but there are several other things that go wrong and might be unnoticed by the user. The unmount is the most noticeable symptom an user will see, and I believe we can just polish what needed to be polished on my patchset, fix the bug, then move on and I'll work on this design Dave suggested. My patchset shouldn't introduce any new bug or regression (saving it still needs that a successful IO completion clears the LI_FAILED flag), and fixes a bug that is appearing more and more often in a reasonable way. Then I move on and look over the design Dave mentioned, instead of stick with this bug in the code for... who knows how longer. Anyway, I'm not trying to 'sell my fish' here saying this patchset is the best solution, I'm far away from doing things like that, but this patchset uses the approach we took weeks to decide, then, why not use this now, fix the bug and improve it when at least users are not hitting the bug anymore? Cheers. > > > This leaves the rest of the submission code completely unchanged - > > including the write clustering. It also gets rid of all the flush > > lock deadlock scenarios because we always unlock the flush lock on > > IO completion, regardless of whether the buffer IO succeeded or not. > > ANd we don't have to change any infrastructure, either - it all just > > works as it is because the LI_FAILED flag is handled at the lowest > > levels possible. > > > > This, to me, seems like a far cleaner solution than anything we've > > discussed so far.... > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 24, 2017 at 03:26:04PM +0200, Carlos Maiolino wrote: ... > > I need to give some more thinking about Dave's suggestion on this new flush > locking design Dave suggested. This really looks a nice idea that needs some > polish before the implementation, and I can certainly work on that, but, from my > POV, I'll agree with Brian here. > > Reworking in the flush lock design will take time, more discussions, more > patches, etc. In the mean time, we have an increasing number in users hitting > this bug, with the increasing usage of DM-Thin, this tends to increase quick, > once, the bug is easily triggered when using dm-thin. > > The unmount hang is just one of the symptoms of the problem, but there are > several other things that go wrong and might be unnoticed by the user. The > unmount is the most noticeable symptom an user will see, and I believe we can > just polish what needed to be polished on my patchset, fix the bug, then move on > and I'll work on this design Dave suggested. > > My patchset shouldn't introduce any new bug or regression (saving it still needs > that a successful IO completion clears the LI_FAILED flag), and fixes a bug that > is appearing more and more often in a reasonable way. > > Then I move on and look over the design Dave mentioned, instead of stick with > this bug in the code for... who knows how longer. > > Anyway, I'm not trying to 'sell my fish' here saying this patchset is the best > solution, I'm far away from doing things like that, but this patchset uses the > approach we took weeks to decide, then, why not use this now, fix the bug and > improve it when at least users are not hitting the bug anymore? > FWIW, I just posted some review comments to Carlos' v2 that attempt to explicitly describe how I was thinking we could handle this without the need for atomic flags and without introducing performance regressions. Perhaps it would be easier to carry this discussion over there... If what I was thinking is broken or has deeper problems, then we can drop it and stick with the atomic flags approach that v2 uses or otherwise look into something completely different like what Dave suggests above. Brian > Cheers. > > > > > > > This leaves the rest of the submission code completely unchanged - > > > including the write clustering. It also gets rid of all the flush > > > lock deadlock scenarios because we always unlock the flush lock on > > > IO completion, regardless of whether the buffer IO succeeded or not. > > > ANd we don't have to change any infrastructure, either - it all just > > > works as it is because the LI_FAILED flag is handled at the lowest > > > levels possible. > > > > > > This, to me, seems like a far cleaner solution than anything we've > > > discussed so far.... > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Carlos > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 08cb7d1..583fa9e 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -475,6 +475,21 @@ xfs_inode_item_unpin( wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); } +STATIC void +xfs_inode_item_error( + struct xfs_log_item *lip, + unsigned int bflags) +{ + + /* + * The buffer writeback containing this inode has been failed + * mark it as failed and unlock the flush lock, so it can be retried + * again + */ + if (bflags & XBF_WRITE_FAIL) + lip->li_flags |= XFS_LI_FAILED; +} + STATIC uint xfs_inode_item_push( struct xfs_log_item *lip, @@ -517,8 +532,44 @@ xfs_inode_item_push( * the AIL. */ if (!xfs_iflock_nowait(ip)) { + if (lip->li_flags & XFS_LI_FAILED) { + + struct xfs_dinode *dip; + struct xfs_log_item *next; + int error; + + error = xfs_imap_to_bp(ip->i_mount, NULL, &ip->i_imap, + &dip, &bp, XBF_TRYLOCK, 0); + + if (error) { + rval = XFS_ITEM_FLUSHING; + goto out_unlock; + } + + if (!(bp->b_flags & XBF_WRITE_FAIL)) { + rval = XFS_ITEM_FLUSHING; + xfs_buf_relse(bp); + goto out_unlock; + } + + while (lip != NULL) { + next = lip->li_bio_list; + + if (lip->li_flags & XFS_LI_FAILED) + lip->li_flags &= XFS_LI_FAILED; + lip = next; + } + + if (!xfs_buf_delwri_queue(bp, buffer_list)) + rval = XFS_ITEM_FLUSHING; + + xfs_buf_relse(bp); + goto out_unlock; + } + rval = XFS_ITEM_FLUSHING; goto out_unlock; + } ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount)); @@ -622,7 +673,8 @@ static const struct xfs_item_ops xfs_inode_item_ops = { .iop_unlock = xfs_inode_item_unlock, .iop_committed = xfs_inode_item_committed, .iop_push = xfs_inode_item_push, - .iop_committing = xfs_inode_item_committing + .iop_committing = xfs_inode_item_committing, + .iop_error = xfs_inode_item_error };
When a buffer has been failed during writeback, the inode items into it are kept flush locked, and are never resubmitted due the flush lock, so, if any buffer fails to be written, the items in AIL are never written to disk and never unlocked. This causes a filesystem to be unmountable due these items flush locked in AIL, but this also causes the items in AIL to never be written back, even when the IO device comes back to normal. I've been testing this patch with a DM-thin device, creating a filesystem larger than the real device. When writing enough data to fill the DM-thin device, XFS receives ENOSPC errors from the device, and keep spinning on xfsaild (when 'retry forever' configuration is set). At this point, the filesystem is unmountable because of the flush locked items in AIL, but worse, the items in AIL are never retried at all (once xfs_inode_item_push() will skip the items that are flush locked), even if the underlying DM-thin device is expanded to the proper size. This patch fixes both cases, retrying any item that has been failed previously, using the infra-structure provided by the previous patch. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> --- This same problem is also possible in dquot code, but the fix is almost identical. I am not submitting a fix for dquot yet to avoid the need to create VX for both patches, once we agree with the solution, I'll submit a fix to dquot. fs/xfs/xfs_inode_item.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 53 insertions(+), 1 deletion(-)