xfs: ratelimit inode flush on buffered write ENOSPC
diff mbox series

Message ID 20200329172209.GA80283@magnolia
State Accepted
Headers show
Series
  • xfs: ratelimit inode flush on buffered write ENOSPC
Related show

Commit Message

Darrick J. Wong March 29, 2020, 5:22 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

A customer reported rcu stalls and softlockup warnings on a computer
with many CPU cores and many many more IO threads trying to write to a
filesystem that is totally out of space.  Subsequent analysis pointed to
the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
which causes a lot of wb_writeback_work to be queued.  The writeback
worker spends so much time trying to wake the many many threads waiting
for writeback completion that it trips the softlockup detector, and (in
this case) the system automatically reboots.

In addition, they complain that the lengthy xfs_flush_inodes scan traps
all of those threads in uninterruptible sleep, which hampers their
ability to kill the program or do anything else to escape the situation.

If there's thousands of threads trying to write to files on a full
filesystem, each of those threads will start separate copies of the
inode flush scan.  This is kind of pointless since we only need one
scan, so rate limit the inode flush.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_mount.h |    1 +
 fs/xfs/xfs_super.c |   14 ++++++++++++++
 2 files changed, 15 insertions(+)

Comments

Allison Collins March 29, 2020, 7:57 p.m. UTC | #1
On 3/29/20 10:22 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> A customer reported rcu stalls and softlockup warnings on a computer
> with many CPU cores and many many more IO threads trying to write to a
> filesystem that is totally out of space.  Subsequent analysis pointed to
> the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
> which causes a lot of wb_writeback_work to be queued.  The writeback
> worker spends so much time trying to wake the many many threads waiting
> for writeback completion that it trips the softlockup detector, and (in
> this case) the system automatically reboots.
> 
> In addition, they complain that the lengthy xfs_flush_inodes scan traps
> all of those threads in uninterruptible sleep, which hampers their
> ability to kill the program or do anything else to escape the situation.
> 
> If there's thousands of threads trying to write to files on a full
> filesystem, each of those threads will start separate copies of the
> inode flush scan.  This is kind of pointless since we only need one
> scan, so rate limit the inode flush.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Ok, makes sense
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_mount.h |    1 +
>   fs/xfs/xfs_super.c |   14 ++++++++++++++
>   2 files changed, 15 insertions(+)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 88ab09ed29e7..50c43422fa17 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -167,6 +167,7 @@ typedef struct xfs_mount {
>   	struct xfs_kobj		m_error_meta_kobj;
>   	struct xfs_error_cfg	m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX];
>   	struct xstats		m_stats;	/* per-fs stats */
> +	struct ratelimit_state	m_flush_inodes_ratelimit;
>   
>   	struct workqueue_struct *m_buf_workqueue;
>   	struct workqueue_struct	*m_unwritten_workqueue;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 68fea439d974..abf06bf9c3f3 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -528,6 +528,9 @@ xfs_flush_inodes(
>   {
>   	struct super_block	*sb = mp->m_super;
>   
> +	if (!__ratelimit(&mp->m_flush_inodes_ratelimit))
> +		return;
> +
>   	if (down_read_trylock(&sb->s_umount)) {
>   		sync_inodes_sb(sb);
>   		up_read(&sb->s_umount);
> @@ -1366,6 +1369,17 @@ xfs_fc_fill_super(
>   	if (error)
>   		goto out_free_names;
>   
> +	/*
> +	 * Cap the number of invocations of xfs_flush_inodes to 16 for every
> +	 * quarter of a second.  The magic numbers here were determined by
> +	 * observation neither to cause stalls in writeback when there are a
> +	 * lot of IO threads and the fs is near ENOSPC, nor cause any fstest
> +	 * regressions.  YMMV.
> +	 */
> +	ratelimit_state_init(&mp->m_flush_inodes_ratelimit, HZ / 4, 16);
> +	ratelimit_set_flags(&mp->m_flush_inodes_ratelimit,
> +			RATELIMIT_MSG_ON_RELEASE);
> +
>   	error = xfs_init_mount_workqueues(mp);
>   	if (error)
>   		goto out_close_devices;
>
Dave Chinner March 29, 2020, 10:08 p.m. UTC | #2
On Sun, Mar 29, 2020 at 10:22:09AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> A customer reported rcu stalls and softlockup warnings on a computer
> with many CPU cores and many many more IO threads trying to write to a
> filesystem that is totally out of space.  Subsequent analysis pointed to
> the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
> which causes a lot of wb_writeback_work to be queued.  The writeback
> worker spends so much time trying to wake the many many threads waiting
> for writeback completion that it trips the softlockup detector, and (in
> this case) the system automatically reboots.
> 
> In addition, they complain that the lengthy xfs_flush_inodes scan traps
> all of those threads in uninterruptible sleep, which hampers their
> ability to kill the program or do anything else to escape the situation.
> 
> If there's thousands of threads trying to write to files on a full
> filesystem, each of those threads will start separate copies of the
> inode flush scan.  This is kind of pointless since we only need one
> scan, so rate limit the inode flush.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_mount.h |    1 +
>  fs/xfs/xfs_super.c |   14 ++++++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 88ab09ed29e7..50c43422fa17 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -167,6 +167,7 @@ typedef struct xfs_mount {
>  	struct xfs_kobj		m_error_meta_kobj;
>  	struct xfs_error_cfg	m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX];
>  	struct xstats		m_stats;	/* per-fs stats */
> +	struct ratelimit_state	m_flush_inodes_ratelimit;
>  
>  	struct workqueue_struct *m_buf_workqueue;
>  	struct workqueue_struct	*m_unwritten_workqueue;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 68fea439d974..abf06bf9c3f3 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -528,6 +528,9 @@ xfs_flush_inodes(
>  {
>  	struct super_block	*sb = mp->m_super;
>  
> +	if (!__ratelimit(&mp->m_flush_inodes_ratelimit))
> +		return;
> +
>  	if (down_read_trylock(&sb->s_umount)) {
>  		sync_inodes_sb(sb);
>  		up_read(&sb->s_umount);
> @@ -1366,6 +1369,17 @@ xfs_fc_fill_super(
>  	if (error)
>  		goto out_free_names;
>  
> +	/*
> +	 * Cap the number of invocations of xfs_flush_inodes to 16 for every
> +	 * quarter of a second.  The magic numbers here were determined by
> +	 * observation neither to cause stalls in writeback when there are a
> +	 * lot of IO threads and the fs is near ENOSPC, nor cause any fstest
> +	 * regressions.  YMMV.
> +	 */
> +	ratelimit_state_init(&mp->m_flush_inodes_ratelimit, HZ / 4, 16);
> +	ratelimit_set_flags(&mp->m_flush_inodes_ratelimit,
> +			RATELIMIT_MSG_ON_RELEASE);

Urk.

RATELIMIT_MSG_ON_RELEASE prevents "callbacks suppressed"
messages when rate limiting was active and resets via __rate_limit().
However, in ratelimit_state_exit(), that flag -enables- printing
"callbacks suppressed" messages when rate limiting was active and is
reset.

Same flag, exact opposite behaviour...

The comment says it's behaviour is supposed to match that of
ratelimit_state_exit() (i.e. print message on ratelimit exit), so I
really can't tell if this is correct/intended usage or just API
abuse....

Cheers,

Dave.
Darrick J. Wong March 30, 2020, 12:16 a.m. UTC | #3
On Mon, Mar 30, 2020 at 09:08:02AM +1100, Dave Chinner wrote:
> On Sun, Mar 29, 2020 at 10:22:09AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > A customer reported rcu stalls and softlockup warnings on a computer
> > with many CPU cores and many many more IO threads trying to write to a
> > filesystem that is totally out of space.  Subsequent analysis pointed to
> > the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
> > which causes a lot of wb_writeback_work to be queued.  The writeback
> > worker spends so much time trying to wake the many many threads waiting
> > for writeback completion that it trips the softlockup detector, and (in
> > this case) the system automatically reboots.
> > 
> > In addition, they complain that the lengthy xfs_flush_inodes scan traps
> > all of those threads in uninterruptible sleep, which hampers their
> > ability to kill the program or do anything else to escape the situation.
> > 
> > If there's thousands of threads trying to write to files on a full
> > filesystem, each of those threads will start separate copies of the
> > inode flush scan.  This is kind of pointless since we only need one
> > scan, so rate limit the inode flush.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_mount.h |    1 +
> >  fs/xfs/xfs_super.c |   14 ++++++++++++++
> >  2 files changed, 15 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 88ab09ed29e7..50c43422fa17 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -167,6 +167,7 @@ typedef struct xfs_mount {
> >  	struct xfs_kobj		m_error_meta_kobj;
> >  	struct xfs_error_cfg	m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX];
> >  	struct xstats		m_stats;	/* per-fs stats */
> > +	struct ratelimit_state	m_flush_inodes_ratelimit;
> >  
> >  	struct workqueue_struct *m_buf_workqueue;
> >  	struct workqueue_struct	*m_unwritten_workqueue;
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 68fea439d974..abf06bf9c3f3 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -528,6 +528,9 @@ xfs_flush_inodes(
> >  {
> >  	struct super_block	*sb = mp->m_super;
> >  
> > +	if (!__ratelimit(&mp->m_flush_inodes_ratelimit))
> > +		return;
> > +
> >  	if (down_read_trylock(&sb->s_umount)) {
> >  		sync_inodes_sb(sb);
> >  		up_read(&sb->s_umount);
> > @@ -1366,6 +1369,17 @@ xfs_fc_fill_super(
> >  	if (error)
> >  		goto out_free_names;
> >  
> > +	/*
> > +	 * Cap the number of invocations of xfs_flush_inodes to 16 for every
> > +	 * quarter of a second.  The magic numbers here were determined by
> > +	 * observation neither to cause stalls in writeback when there are a
> > +	 * lot of IO threads and the fs is near ENOSPC, nor cause any fstest
> > +	 * regressions.  YMMV.
> > +	 */
> > +	ratelimit_state_init(&mp->m_flush_inodes_ratelimit, HZ / 4, 16);
> > +	ratelimit_set_flags(&mp->m_flush_inodes_ratelimit,
> > +			RATELIMIT_MSG_ON_RELEASE);
> 
> Urk.
> 
> RATELIMIT_MSG_ON_RELEASE prevents "callbacks suppressed"
> messages when rate limiting was active and resets via __rate_limit().
> However, in ratelimit_state_exit(), that flag -enables- printing
> "callbacks suppressed" messages when rate limiting was active and is
> reset.
> 
> Same flag, exact opposite behaviour...
> 
> The comment says it's behaviour is supposed to match that of
> ratelimit_state_exit() (i.e. print message on ratelimit exit), so I
> really can't tell if this is correct/intended usage or just API
> abuse....

This flag (AFAICT) basically means "summarize skipped calls later",
where later is when _exit is called.  It's very annoying that this
printk thing is mixed in with what otherwise is a simple ratelimiting
mechanism, since there isn't much to be gained by spamming dmesg every
time a buffered write hits ENOSPC, and absolutely nothing to be gained
by logging that at umount time (with comm being the umount process!)

Since there's no design documentation for how the ratelimiting system
works, the best I can do is RTFS and do whatever magic gets the outcome
I want (which is to set the flag and skip calling _exit.  Only one of
the ratelimit state users calls ratelimit_state_exit, so it's apparently
not required.

This all is poor engineering practice, but you /did/ suggest
ratelimiting (on IRC) and I don't want to go reimplementing ratelimit.c
either, and it /does/ fix the xfs_flush_inodes flooding problems.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner March 30, 2020, 12:33 a.m. UTC | #4
On Sun, Mar 29, 2020 at 05:16:02PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 30, 2020 at 09:08:02AM +1100, Dave Chinner wrote:
> > On Sun, Mar 29, 2020 at 10:22:09AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > A customer reported rcu stalls and softlockup warnings on a computer
> > > with many CPU cores and many many more IO threads trying to write to a
> > > filesystem that is totally out of space.  Subsequent analysis pointed to
> > > the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb,
> > > which causes a lot of wb_writeback_work to be queued.  The writeback
> > > worker spends so much time trying to wake the many many threads waiting
> > > for writeback completion that it trips the softlockup detector, and (in
> > > this case) the system automatically reboots.
> > > 
> > > In addition, they complain that the lengthy xfs_flush_inodes scan traps
> > > all of those threads in uninterruptible sleep, which hampers their
> > > ability to kill the program or do anything else to escape the situation.
> > > 
> > > If there's thousands of threads trying to write to files on a full
> > > filesystem, each of those threads will start separate copies of the
> > > inode flush scan.  This is kind of pointless since we only need one
> > > scan, so rate limit the inode flush.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_mount.h |    1 +
> > >  fs/xfs/xfs_super.c |   14 ++++++++++++++
> > >  2 files changed, 15 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index 88ab09ed29e7..50c43422fa17 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -167,6 +167,7 @@ typedef struct xfs_mount {
> > >  	struct xfs_kobj		m_error_meta_kobj;
> > >  	struct xfs_error_cfg	m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX];
> > >  	struct xstats		m_stats;	/* per-fs stats */
> > > +	struct ratelimit_state	m_flush_inodes_ratelimit;
> > >  
> > >  	struct workqueue_struct *m_buf_workqueue;
> > >  	struct workqueue_struct	*m_unwritten_workqueue;
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 68fea439d974..abf06bf9c3f3 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -528,6 +528,9 @@ xfs_flush_inodes(
> > >  {
> > >  	struct super_block	*sb = mp->m_super;
> > >  
> > > +	if (!__ratelimit(&mp->m_flush_inodes_ratelimit))
> > > +		return;
> > > +
> > >  	if (down_read_trylock(&sb->s_umount)) {
> > >  		sync_inodes_sb(sb);
> > >  		up_read(&sb->s_umount);
> > > @@ -1366,6 +1369,17 @@ xfs_fc_fill_super(
> > >  	if (error)
> > >  		goto out_free_names;
> > >  
> > > +	/*
> > > +	 * Cap the number of invocations of xfs_flush_inodes to 16 for every
> > > +	 * quarter of a second.  The magic numbers here were determined by
> > > +	 * observation neither to cause stalls in writeback when there are a
> > > +	 * lot of IO threads and the fs is near ENOSPC, nor cause any fstest
> > > +	 * regressions.  YMMV.
> > > +	 */
> > > +	ratelimit_state_init(&mp->m_flush_inodes_ratelimit, HZ / 4, 16);
> > > +	ratelimit_set_flags(&mp->m_flush_inodes_ratelimit,
> > > +			RATELIMIT_MSG_ON_RELEASE);
> > 
> > Urk.
> > 
> > RATELIMIT_MSG_ON_RELEASE prevents "callbacks suppressed"
> > messages when rate limiting was active and resets via __rate_limit().
> > However, in ratelimit_state_exit(), that flag -enables- printing
> > "callbacks suppressed" messages when rate limiting was active and is
> > reset.
> > 
> > Same flag, exact opposite behaviour...
> > 
> > The comment says it's behaviour is supposed to match that of
> > ratelimit_state_exit() (i.e. print message on ratelimit exit), so I
> > really can't tell if this is correct/intended usage or just API
> > abuse....
> 
> This flag (AFAICT) basically means "summarize skipped calls later",
> where later is when _exit is called.  It's very annoying that this
> printk thing is mixed in with what otherwise is a simple ratelimiting
> mechanism, since there isn't much to be gained by spamming dmesg every
> time a buffered write hits ENOSPC, and absolutely nothing to be gained
> by logging that at umount time (with comm being the umount process!)
> 
> Since there's no design documentation for how the ratelimiting system
> works, the best I can do is RTFS and do whatever magic gets the outcome
> I want (which is to set the flag and skip calling _exit.  Only one of
> the ratelimit state users calls ratelimit_state_exit, so it's apparently
> not required.

Yeah, that's pretty much where I got to with it - I wasn't at all
sure if there was something obvious I was missing or it was just
another piece of poorly written code...

> This all is poor engineering practice, but you /did/ suggest
> ratelimiting (on IRC) and I don't want to go reimplementing ratelimit.c
> either, and it /does/ fix the xfs_flush_inodes flooding problems.

*nod*

Reviewed-by: Dave Chinner <dchinner@redhat.com>

Patch
diff mbox series

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 88ab09ed29e7..50c43422fa17 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -167,6 +167,7 @@  typedef struct xfs_mount {
 	struct xfs_kobj		m_error_meta_kobj;
 	struct xfs_error_cfg	m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX];
 	struct xstats		m_stats;	/* per-fs stats */
+	struct ratelimit_state	m_flush_inodes_ratelimit;
 
 	struct workqueue_struct *m_buf_workqueue;
 	struct workqueue_struct	*m_unwritten_workqueue;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 68fea439d974..abf06bf9c3f3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -528,6 +528,9 @@  xfs_flush_inodes(
 {
 	struct super_block	*sb = mp->m_super;
 
+	if (!__ratelimit(&mp->m_flush_inodes_ratelimit))
+		return;
+
 	if (down_read_trylock(&sb->s_umount)) {
 		sync_inodes_sb(sb);
 		up_read(&sb->s_umount);
@@ -1366,6 +1369,17 @@  xfs_fc_fill_super(
 	if (error)
 		goto out_free_names;
 
+	/*
+	 * Cap the number of invocations of xfs_flush_inodes to 16 for every
+	 * quarter of a second.  The magic numbers here were determined by
+	 * observation neither to cause stalls in writeback when there are a
+	 * lot of IO threads and the fs is near ENOSPC, nor cause any fstest
+	 * regressions.  YMMV.
+	 */
+	ratelimit_state_init(&mp->m_flush_inodes_ratelimit, HZ / 4, 16);
+	ratelimit_set_flags(&mp->m_flush_inodes_ratelimit,
+			RATELIMIT_MSG_ON_RELEASE);
+
 	error = xfs_init_mount_workqueues(mp);
 	if (error)
 		goto out_close_devices;