diff mbox series

[05/12] xfs: ratelimit unmount time per-buffer I/O error warning

Message ID 20200417150859.14734-6-bfoster@redhat.com (mailing list archive)
State New, archived
Headers show
Series xfs: flush related error handling cleanups | expand

Commit Message

Brian Foster April 17, 2020, 3:08 p.m. UTC
At unmount time, XFS emits a warning for every in-core buffer that
might have undergone a write error. In practice this behavior is
probably reasonable given that the filesystem is likely short lived
once I/O errors begin to occur consistently. Under certain test or
otherwise expected error conditions, this can spam the logs and slow
down the unmount. Ratelimit the warning to prevent this problem
while still informing the user that errors have occurred.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Dave Chinner April 20, 2020, 3:19 a.m. UTC | #1
On Fri, Apr 17, 2020 at 11:08:52AM -0400, Brian Foster wrote:
> At unmount time, XFS emits a warning for every in-core buffer that
> might have undergone a write error. In practice this behavior is
> probably reasonable given that the filesystem is likely short lived
> once I/O errors begin to occur consistently. Under certain test or
> otherwise expected error conditions, this can spam the logs and slow
> down the unmount. Ratelimit the warning to prevent this problem
> while still informing the user that errors have occurred.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 93942d8e35dd..5120fed06075 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1685,11 +1685,10 @@ xfs_wait_buftarg(
>  			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
>  			list_del_init(&bp->b_lru);
>  			if (bp->b_flags & XBF_WRITE_FAIL) {
> -				xfs_alert(btp->bt_mount,
> -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
> +				xfs_alert_ratelimited(btp->bt_mount,
> +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n"
> +"Please run xfs_repair to determine the extent of the problem.",
>  					(long long)bp->b_bn);

Hmmmm. I was under the impression that multiple line log messages
were frowned upon because they prevent every output line in the log
being tagged correctly. That's where KERN_CONT came from (i.e. it's
a continuation of a previous log message), but we don't use that
with the XFS logging and hence multi-line log messages are split
into multiple logging calls.

IOWs, this might be better handled just using a static ratelimit
variable here....

Actually, we already have one for xfs_buf_item_push() to limit
warnings about retrying XBF_WRITE_FAIL buffers:

static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10);

Perhaps we should be using the same ratelimit variable here....

Cheers,

Dave.
Brian Foster April 20, 2020, 2:02 p.m. UTC | #2
On Mon, Apr 20, 2020 at 01:19:59PM +1000, Dave Chinner wrote:
> On Fri, Apr 17, 2020 at 11:08:52AM -0400, Brian Foster wrote:
> > At unmount time, XFS emits a warning for every in-core buffer that
> > might have undergone a write error. In practice this behavior is
> > probably reasonable given that the filesystem is likely short lived
> > once I/O errors begin to occur consistently. Under certain test or
> > otherwise expected error conditions, this can spam the logs and slow
> > down the unmount. Ratelimit the warning to prevent this problem
> > while still informing the user that errors have occurred.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 93942d8e35dd..5120fed06075 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1685,11 +1685,10 @@ xfs_wait_buftarg(
> >  			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
> >  			list_del_init(&bp->b_lru);
> >  			if (bp->b_flags & XBF_WRITE_FAIL) {
> > -				xfs_alert(btp->bt_mount,
> > -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
> > +				xfs_alert_ratelimited(btp->bt_mount,
> > +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n"
> > +"Please run xfs_repair to determine the extent of the problem.",
> >  					(long long)bp->b_bn);
> 
> Hmmmm. I was under the impression that multiple line log messages
> were frowned upon because they prevent every output line in the log
> being tagged correctly. That's where KERN_CONT came from (i.e. it's
> a continuation of a previous log message), but we don't use that
> with the XFS logging and hence multi-line log messages are split
> into multiple logging calls.
> 

I debated combining these into a single line for that exact reason for
about a second and then just went with this because I didn't think it
mattered that much.

> IOWs, this might be better handled just using a static ratelimit
> variable here....
> 
> Actually, we already have one for xfs_buf_item_push() to limit
> warnings about retrying XBF_WRITE_FAIL buffers:
> 
> static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10);
> 
> Perhaps we should be using the same ratelimit variable here....
> 

IIRC that was static in another file, but we can centralize (and perhaps
generalize..) it somewhere if that is preferred..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Allison Henderson April 20, 2020, 6:50 p.m. UTC | #3
On 4/17/20 8:08 AM, Brian Foster wrote:
> At unmount time, XFS emits a warning for every in-core buffer that
> might have undergone a write error. In practice this behavior is
> probably reasonable given that the filesystem is likely short lived
> once I/O errors begin to occur consistently. Under certain test or
> otherwise expected error conditions, this can spam the logs and slow
> down the unmount. Ratelimit the warning to prevent this problem
> while still informing the user that errors have occurred.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
Ok, other than the line return Dave pointed out, looks ok to me:
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_buf.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 93942d8e35dd..5120fed06075 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1685,11 +1685,10 @@ xfs_wait_buftarg(
>   			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
>   			list_del_init(&bp->b_lru);
>   			if (bp->b_flags & XBF_WRITE_FAIL) {
> -				xfs_alert(btp->bt_mount,
> -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
> +				xfs_alert_ratelimited(btp->bt_mount,
> +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n"
> +"Please run xfs_repair to determine the extent of the problem.",
>   					(long long)bp->b_bn);
> -				xfs_alert(btp->bt_mount,
> -"Please run xfs_repair to determine the extent of the problem.");
>   			}
>   			xfs_buf_rele(bp);
>   		}
>
Dave Chinner April 20, 2020, 10:23 p.m. UTC | #4
On Mon, Apr 20, 2020 at 10:02:05AM -0400, Brian Foster wrote:
> On Mon, Apr 20, 2020 at 01:19:59PM +1000, Dave Chinner wrote:
> > On Fri, Apr 17, 2020 at 11:08:52AM -0400, Brian Foster wrote:
> > > At unmount time, XFS emits a warning for every in-core buffer that
> > > might have undergone a write error. In practice this behavior is
> > > probably reasonable given that the filesystem is likely short lived
> > > once I/O errors begin to occur consistently. Under certain test or
> > > otherwise expected error conditions, this can spam the logs and slow
> > > down the unmount. Ratelimit the warning to prevent this problem
> > > while still informing the user that errors have occurred.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_buf.c | 7 +++----
> > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 93942d8e35dd..5120fed06075 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -1685,11 +1685,10 @@ xfs_wait_buftarg(
> > >  			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
> > >  			list_del_init(&bp->b_lru);
> > >  			if (bp->b_flags & XBF_WRITE_FAIL) {
> > > -				xfs_alert(btp->bt_mount,
> > > -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
> > > +				xfs_alert_ratelimited(btp->bt_mount,
> > > +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n"
> > > +"Please run xfs_repair to determine the extent of the problem.",
> > >  					(long long)bp->b_bn);
> > 
> > Hmmmm. I was under the impression that multiple line log messages
> > were frowned upon because they prevent every output line in the log
> > being tagged correctly. That's where KERN_CONT came from (i.e. it's
> > a continuation of a previous log message), but we don't use that
> > with the XFS logging and hence multi-line log messages are split
> > into multiple logging calls.
> > 
> 
> I debated combining these into a single line for that exact reason for
> about a second and then just went with this because I didn't think it
> mattered that much.

It doesn't matter to us, but it does matter to those people who want
their log entries correctly tagged for their classification
engines...

> > IOWs, this might be better handled just using a static ratelimit
> > variable here....
> > 
> > Actually, we already have one for xfs_buf_item_push() to limit
> > warnings about retrying XBF_WRITE_FAIL buffers:
> > 
> > static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10);
> > 
> > Perhaps we should be using the same ratelimit variable here....
> > 
> 
> IIRC that was static in another file, but we can centralize (and perhaps
> generalize..) it somewhere if that is preferred..

I think it makes sense to have all the buffer write fail
messages ratelimited under the same variable - once it starts
spewing messages, we should limit them all the same way...

Cheers,

Dave.
Brian Foster April 21, 2020, 12:13 p.m. UTC | #5
On Tue, Apr 21, 2020 at 08:23:32AM +1000, Dave Chinner wrote:
> On Mon, Apr 20, 2020 at 10:02:05AM -0400, Brian Foster wrote:
> > On Mon, Apr 20, 2020 at 01:19:59PM +1000, Dave Chinner wrote:
> > > On Fri, Apr 17, 2020 at 11:08:52AM -0400, Brian Foster wrote:
> > > > At unmount time, XFS emits a warning for every in-core buffer that
> > > > might have undergone a write error. In practice this behavior is
> > > > probably reasonable given that the filesystem is likely short lived
> > > > once I/O errors begin to occur consistently. Under certain test or
> > > > otherwise expected error conditions, this can spam the logs and slow
> > > > down the unmount. Ratelimit the warning to prevent this problem
> > > > while still informing the user that errors have occurred.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_buf.c | 7 +++----
> > > >  1 file changed, 3 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index 93942d8e35dd..5120fed06075 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1685,11 +1685,10 @@ xfs_wait_buftarg(
> > > >  			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
> > > >  			list_del_init(&bp->b_lru);
> > > >  			if (bp->b_flags & XBF_WRITE_FAIL) {
> > > > -				xfs_alert(btp->bt_mount,
> > > > -"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
> > > > +				xfs_alert_ratelimited(btp->bt_mount,
> > > > +"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n"
> > > > +"Please run xfs_repair to determine the extent of the problem.",
> > > >  					(long long)bp->b_bn);
> > > 
> > > Hmmmm. I was under the impression that multiple line log messages
> > > were frowned upon because they prevent every output line in the log
> > > being tagged correctly. That's where KERN_CONT came from (i.e. it's
> > > a continuation of a previous log message), but we don't use that
> > > with the XFS logging and hence multi-line log messages are split
> > > into multiple logging calls.
> > > 
> > 
> > I debated combining these into a single line for that exact reason for
> > about a second and then just went with this because I didn't think it
> > mattered that much.
> 
> It doesn't matter to us, but it does matter to those people who want
> their log entries correctly tagged for their classification
> engines...
> 

Makes sense, though I am a bit curious whether it would be categorized
correctly even when fixed up, or whether something like a single long
line would be preferred over two. *shrug*

> > > IOWs, this might be better handled just using a static ratelimit
> > > variable here....
> > > 
> > > Actually, we already have one for xfs_buf_item_push() to limit
> > > warnings about retrying XBF_WRITE_FAIL buffers:
> > > 
> > > static DEFINE_RATELIMIT_STATE(xfs_buf_write_fail_rl_state, 30 * HZ, 10);
> > > 
> > > Perhaps we should be using the same ratelimit variable here....
> > > 
> > 
> > IIRC that was static in another file, but we can centralize (and perhaps
> > generalize..) it somewhere if that is preferred..
> 
> I think it makes sense to have all the buffer write fail
> messages ratelimited under the same variable - once it starts
> spewing messages, we should limit them all the same way...
> 

Yeah. I actually ended up sticking the ratelimit in the buftarg as it
comes off a bit cleaner than a global and I don't think there's much of
a practical difference in having a per-target limit.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 93942d8e35dd..5120fed06075 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1685,11 +1685,10 @@  xfs_wait_buftarg(
 			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
 			list_del_init(&bp->b_lru);
 			if (bp->b_flags & XBF_WRITE_FAIL) {
-				xfs_alert(btp->bt_mount,
-"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!",
+				xfs_alert_ratelimited(btp->bt_mount,
+"Corruption Alert: Buffer at daddr 0x%llx had permanent write failures!\n"
+"Please run xfs_repair to determine the extent of the problem.",
 					(long long)bp->b_bn);
-				xfs_alert(btp->bt_mount,
-"Please run xfs_repair to determine the extent of the problem.");
 			}
 			xfs_buf_rele(bp);
 		}