Message ID | 20200417150859.14734-6-bfoster@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs: flush related error handling cleanups | expand |
On Fri, Apr 17, 2020 at 11:08:52AM -0400, Brian Foster wrote: > At unmount time, XFS emits a warning for every in-core buffer that > might have undergone a write error. In practice this behavior is > probably reasonable given that the filesystem is likely short lived > once I/O errors begin to occur consistently. Under certain test or > otherwise expected error conditions, this can spam the logs and slow > down the unmount. Ratelimit the warning to prevent this problem > while still informing the user that errors have occurred. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_buf.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 93942d8e35dd..5120fed06075 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -1685,11 +1685,10 @@ xfs_wait_buftarg( > bp = list_first_entry(&dispose, struct xfs_buf, b_lru); > list_del_init(&bp->b_lru); > if (bp->b_flags & XBF_WRITE_FAIL) { > - xfs_alert(btp->bt_mount, > -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!", > + xfs_alert_ratelimited(btp->bt_mount, > +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n" > +"Please run xfs_repair to determine the extent of the problem.", > (long long)bp->b_bn); Hmmmm. I was under the impression that multiple line log messages were frowned upon because they prevent every output line in the log being tagged correctly. That's where KERN_CONT came from (i.e. it's a continuation of a previous log message), but we don't use that with the XFS logging and hence multi-line log messages are split into multiple logging calls. IOWs, this might be better handled just using a static ratelimit variable here.... Actually, we already have one for xfs_buf_item_push() to limit warnings about retrying XBF_WRITE_FAIL buffers: static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10); Perhaps we should be using the same ratelimit variable here.... Cheers, Dave.
On Mon, Apr 20, 2020 at 01:19:59PM +1000, Dave Chinner wrote: > On Fri, Apr 17, 2020 at 11:08:52AM -0400, Brian Foster wrote: > > At unmount time, XFS emits a warning for every in-core buffer that > > might have undergone a write error. In practice this behavior is > > probably reasonable given that the filesystem is likely short lived > > once I/O errors begin to occur consistently. Under certain test or > > otherwise expected error conditions, this can spam the logs and slow > > down the unmount. Ratelimit the warning to prevent this problem > > while still informing the user that errors have occurred. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/xfs_buf.c | 7 +++---- > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index 93942d8e35dd..5120fed06075 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -1685,11 +1685,10 @@ xfs_wait_buftarg( > > bp = list_first_entry(&dispose, struct xfs_buf, b_lru); > > list_del_init(&bp->b_lru); > > if (bp->b_flags & XBF_WRITE_FAIL) { > > - xfs_alert(btp->bt_mount, > > -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!", > > + xfs_alert_ratelimited(btp->bt_mount, > > +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n" > > +"Please run xfs_repair to determine the extent of the problem.", > > (long long)bp->b_bn); > > Hmmmm. I was under the impression that multiple line log messages > were frowned upon because they prevent every output line in the log > being tagged correctly. That's where KERN_CONT came from (i.e. it's > a continuation of a previous log message), but we don't use that > with the XFS logging and hence multi-line log messages are split > into multiple logging calls. > I debated combining these into a single line for that exact reason for about a second and then just went with this because I didn't think it mattered that much. > IOWs, this might be better handled just using a static ratelimit > variable here.... > > Actually, we already have one for xfs_buf_item_push() to limit > warnings about retrying XBF_WRITE_FAIL buffers: > > static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10); > > Perhaps we should be using the same ratelimit variable here.... > IIRC that was static in another file, but we can centralize (and perhaps generalize..) it somewhere if that is preferred.. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On 4/17/20 8:08 AM, Brian Foster wrote: > At unmount time, XFS emits a warning for every in-core buffer that > might have undergone a write error. In practice this behavior is > probably reasonable given that the filesystem is likely short lived > once I/O errors begin to occur consistently. Under certain test or > otherwise expected error conditions, this can spam the logs and slow > down the unmount. Ratelimit the warning to prevent this problem > while still informing the user that errors have occurred. > > Signed-off-by: Brian Foster <bfoster@redhat.com> Ok, other than the line return Dave pointed out, looks ok to me: Reviewed-by: Allison Collins <allison.henderson@oracle.com> > --- > fs/xfs/xfs_buf.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 93942d8e35dd..5120fed06075 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -1685,11 +1685,10 @@ xfs_wait_buftarg( > bp = list_first_entry(&dispose, struct xfs_buf, b_lru); > list_del_init(&bp->b_lru); > if (bp->b_flags & XBF_WRITE_FAIL) { > - xfs_alert(btp->bt_mount, > -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!", > + xfs_alert_ratelimited(btp->bt_mount, > +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n" > +"Please run xfs_repair to determine the extent of the problem.", > (long long)bp->b_bn); > - xfs_alert(btp->bt_mount, > -"Please run xfs_repair to determine the extent of the problem."); > } > xfs_buf_rele(bp); > } >
On Mon, Apr 20, 2020 at 10:02:05AM -0400, Brian Foster wrote: > On Mon, Apr 20, 2020 at 01:19:59PM +1000, Dave Chinner wrote: > > On Fri, Apr 17, 2020 at 11:08:52AM -0400, Brian Foster wrote: > > > At unmount time, XFS emits a warning for every in-core buffer that > > > might have undergone a write error. In practice this behavior is > > > probably reasonable given that the filesystem is likely short lived > > > once I/O errors begin to occur consistently. Under certain test or > > > otherwise expected error conditions, this can spam the logs and slow > > > down the unmount. Ratelimit the warning to prevent this problem > > > while still informing the user that errors have occurred. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > fs/xfs/xfs_buf.c | 7 +++---- > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > index 93942d8e35dd..5120fed06075 100644 > > > --- a/fs/xfs/xfs_buf.c > > > +++ b/fs/xfs/xfs_buf.c > > > @@ -1685,11 +1685,10 @@ xfs_wait_buftarg( > > > bp = list_first_entry(&dispose, struct xfs_buf, b_lru); > > > list_del_init(&bp->b_lru); > > > if (bp->b_flags & XBF_WRITE_FAIL) { > > > - xfs_alert(btp->bt_mount, > > > -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!", > > > + xfs_alert_ratelimited(btp->bt_mount, > > > +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n" > > > +"Please run xfs_repair to determine the extent of the problem.", > > > (long long)bp->b_bn); > > > > Hmmmm. I was under the impression that multiple line log messages > > were frowned upon because they prevent every output line in the log > > being tagged correctly. That's where KERN_CONT came from (i.e. it's > > a continuation of a previous log message), but we don't use that > > with the XFS logging and hence multi-line log messages are split > > into multiple logging calls. > > > > I debated combining these into a single line for that exact reason for > about a second and then just went with this because I didn't think it > mattered that much. It doesn't matter to us, but it does matter to those people who want their log entries correctly tagged for their classification engines... > > IOWs, this might be better handled just using a static ratelimit > > variable here.... > > > > Actually, we already have one for xfs_buf_item_push() to limit > > warnings about retrying XBF_WRITE_FAIL buffers: > > > > static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10); > > > > Perhaps we should be using the same ratelimit variable here.... > > > > IIRC that was static in another file, but we can centralize (and perhaps > generalize..) it somewhere if that is preferred.. I think it makes sense to have all the buffer write fail messages ratelimited under the same variable - once it starts spewing messages, we should limit them all the same way... Cheers, Dave.
On Tue, Apr 21, 2020 at 08:23:32AM +1000, Dave Chinner wrote: > On Mon, Apr 20, 2020 at 10:02:05AM -0400, Brian Foster wrote: > > On Mon, Apr 20, 2020 at 01:19:59PM +1000, Dave Chinner wrote: > > > On Fri, Apr 17, 2020 at 11:08:52AM -0400, Brian Foster wrote: > > > > At unmount time, XFS emits a warning for every in-core buffer that > > > > might have undergone a write error. In practice this behavior is > > > > probably reasonable given that the filesystem is likely short lived > > > > once I/O errors begin to occur consistently. Under certain test or > > > > otherwise expected error conditions, this can spam the logs and slow > > > > down the unmount. Ratelimit the warning to prevent this problem > > > > while still informing the user that errors have occurred. > > > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > > --- > > > > fs/xfs/xfs_buf.c | 7 +++---- > > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > index 93942d8e35dd..5120fed06075 100644 > > > > --- a/fs/xfs/xfs_buf.c > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -1685,11 +1685,10 @@ xfs_wait_buftarg( > > > > bp = list_first_entry(&dispose, struct xfs_buf, b_lru); > > > > list_del_init(&bp->b_lru); > > > > if (bp->b_flags & XBF_WRITE_FAIL) { > > > > - xfs_alert(btp->bt_mount, > > > > -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!", > > > > + xfs_alert_ratelimited(btp->bt_mount, > > > > +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n" > > > > +"Please run xfs_repair to determine the extent of the problem.", > > > > (long long)bp->b_bn); > > > > > > Hmmmm. I was under the impression that multiple line log messages > > > were frowned upon because they prevent every output line in the log > > > being tagged correctly. That's where KERN_CONT came from (i.e. it's > > > a continuation of a previous log message), but we don't use that > > > with the XFS logging and hence multi-line log messages are split > > > into multiple logging calls. > > > > > > > I debated combining these into a single line for that exact reason for > > about a second and then just went with this because I didn't think it > > mattered that much. > > It doesn't matter to us, but it does matter to those people who want > their log entries correctly tagged for their classification > engines... > Makes sense, though I am a bit curious whether it would be categorized correctly even when fixed up, or whether something like a single long line would be preferred over two. *shrug* > > > IOWs, this might be better handled just using a static ratelimit > > > variable here.... > > > > > > Actually, we already have one for xfs_buf_item_push() to limit > > > warnings about retrying XBF_WRITE_FAIL buffers: > > > > > > static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10); > > > > > > Perhaps we should be using the same ratelimit variable here.... > > > > > > > IIRC that was static in another file, but we can centralize (and perhaps > > generalize..) it somewhere if that is preferred.. > > I think it makes sense to have all the buffer write fail > messages ratelimited under the same variable - once it starts > spewing messages, we should limit them all the same way... > Yeah. I actually ended up sticking the ratelimit in the buftarg as it comes off a bit cleaner than a global and I don't think there's much of a practical difference in having a per-target limit. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 93942d8e35dd..5120fed06075 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1685,11 +1685,10 @@ xfs_wait_buftarg( bp = list_first_entry(&dispose, struct xfs_buf, b_lru); list_del_init(&bp->b_lru); if (bp->b_flags & XBF_WRITE_FAIL) { - xfs_alert(btp->bt_mount, -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!", + xfs_alert_ratelimited(btp->bt_mount, +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n" +"Please run xfs_repair to determine the extent of the problem.", (long long)bp->b_bn); - xfs_alert(btp->bt_mount, -"Please run xfs_repair to determine the extent of the problem."); } xfs_buf_rele(bp); }
At unmount time, XFS emits a warning for every in-core buffer that might have undergone a write error. In practice this behavior is probably reasonable given that the filesystem is likely short lived once I/O errors begin to occur consistently. Under certain test or otherwise expected error conditions, this can spam the logs and slow down the unmount. Ratelimit the warning to prevent this problem while still informing the user that errors have occurred. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_buf.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-)