Message ID | 20200519222310.2576434-3-david@fromorbit.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | xfs: improve transaction rate scalability | expand |
On Wed, May 20, 2020 at 08:23:10AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > It's a global atomic counter, and we are hitting it at a rate of > half a million transactions a second, so it's bouncing the counter > cacheline all over the place on large machines. We don't actually > need it anymore - it used to be required because the VFS freeze code > could not track/prevent filesystem transactions that were running, > but that problem no longer exists. > > Hence to remove the counter, we simply have to ensure that nothing > calls xfs_sync_sb() while we are trying to quiesce the filesytem. > That only happens if the log worker is still running when we call > xfs_quiesce_attr(). The log worker is cancelled at the end of > xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it > early here and then we can remove the counter altogether. > > Concurrent create, 50 million inodes, identical 16p/16GB virtual > machines on different physical hosts. Machine A has twice the CPU > cores per socket of machine B: > > unpatched patched > machine A: 3m16s 2m00s > machine B: 4m04s 4m05s > > Create rates: > unpatched patched > machine A: 282k+/-31k 468k+/-21k > machine B: 231k+/-8k 233k+/-11k > > Concurrent rm of same 50 million inodes: > > unpatched patched > machine A: 6m42s 2m33s > machine B: 4m47s 4m47s > > The transaction rate on the fast machine went from just under > 300k/sec to 700k/sec, which indicates just how much of a bottleneck > this atomic counter was. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_mount.h | 1 - > fs/xfs/xfs_super.c | 17 +++++------------ > fs/xfs/xfs_trans.c | 27 +++++++++++---------------- > 3 files changed, 16 insertions(+), 29 deletions(-) > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index c1f92c1847bb2..3725d25ad97e8 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -176,7 +176,6 @@ typedef struct xfs_mount { > uint64_t m_resblks; /* total reserved blocks */ > uint64_t m_resblks_avail;/* available reserved blocks */ > uint64_t m_resblks_save; /* reserved blks @ remount,ro */ > - atomic_t m_active_trans; /* number trans frozen */ > struct delayed_work m_reclaim_work; /* background inode reclaim */ > struct delayed_work m_eofblocks_work; /* background eof blocks > trimming */ > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index aae469f73efeb..fa58cb07c8fdf 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -874,8 +874,10 @@ xfs_restore_resvblks(struct xfs_mount *mp) > * there is no log replay required to write the inodes to disk - this is the > * primary difference between a sync and a quiesce. > * > + * We cancel log work early here to ensure all transactions the log worker may > + * run have finished before we clean up and log the superblock and write an > + * unmount record. The unfreeze process is responsible for restarting the log > + * worker correctly. > */ > void > xfs_quiesce_attr( > @@ -883,9 +885,7 @@ xfs_quiesce_attr( > { > int error = 0; > > - /* wait for all modifications to complete */ > - while (atomic_read(&mp->m_active_trans) > 0) > - delay(100); > + cancel_delayed_work_sync(&mp->m_log->l_work); Shouldn't the cancel_delayed_work_sync for l_work in xfs_log_quiesce be removed now given that we've already cancelled it here?
On Wed, May 20, 2020 at 12:01:52AM -0700, Christoph Hellwig wrote: > On Wed, May 20, 2020 at 08:23:10AM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > It's a global atomic counter, and we are hitting it at a rate of > > half a million transactions a second, so it's bouncing the counter > > cacheline all over the place on large machines. We don't actually > > need it anymore - it used to be required because the VFS freeze code > > could not track/prevent filesystem transactions that were running, > > but that problem no longer exists. > > > > Hence to remove the counter, we simply have to ensure that nothing > > calls xfs_sync_sb() while we are trying to quiesce the filesytem. > > That only happens if the log worker is still running when we call > > xfs_quiesce_attr(). The log worker is cancelled at the end of > > xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it > > early here and then we can remove the counter altogether. > > > > Concurrent create, 50 million inodes, identical 16p/16GB virtual > > machines on different physical hosts. Machine A has twice the CPU > > cores per socket of machine B: > > > > unpatched patched > > machine A: 3m16s 2m00s > > machine B: 4m04s 4m05s > > > > Create rates: > > unpatched patched > > machine A: 282k+/-31k 468k+/-21k > > machine B: 231k+/-8k 233k+/-11k > > > > Concurrent rm of same 50 million inodes: > > > > unpatched patched > > machine A: 6m42s 2m33s > > machine B: 4m47s 4m47s > > > > The transaction rate on the fast machine went from just under > > 300k/sec to 700k/sec, which indicates just how much of a bottleneck > > this atomic counter was. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_mount.h | 1 - > > fs/xfs/xfs_super.c | 17 +++++------------ > > fs/xfs/xfs_trans.c | 27 +++++++++++---------------- > > 3 files changed, 16 insertions(+), 29 deletions(-) > > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > index c1f92c1847bb2..3725d25ad97e8 100644 > > --- a/fs/xfs/xfs_mount.h > > +++ b/fs/xfs/xfs_mount.h > > @@ -176,7 +176,6 @@ typedef struct xfs_mount { > > uint64_t m_resblks; /* total reserved blocks */ > > uint64_t m_resblks_avail;/* available reserved blocks */ > > uint64_t m_resblks_save; /* reserved blks @ remount,ro */ > > - atomic_t m_active_trans; /* number trans frozen */ > > struct delayed_work m_reclaim_work; /* background inode reclaim */ > > struct delayed_work m_eofblocks_work; /* background eof blocks > > trimming */ > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > index aae469f73efeb..fa58cb07c8fdf 100644 > > --- a/fs/xfs/xfs_super.c > > +++ b/fs/xfs/xfs_super.c > > @@ -874,8 +874,10 @@ xfs_restore_resvblks(struct xfs_mount *mp) > > * there is no log replay required to write the inodes to disk - this is the > > * primary difference between a sync and a quiesce. > > * > > + * We cancel log work early here to ensure all transactions the log worker may > > + * run have finished before we clean up and log the superblock and write an > > + * unmount record. The unfreeze process is responsible for restarting the log > > + * worker correctly. > > */ > > void > > xfs_quiesce_attr( > > @@ -883,9 +885,7 @@ xfs_quiesce_attr( > > { > > int error = 0; > > > > - /* wait for all modifications to complete */ > > - while (atomic_read(&mp->m_active_trans) > 0) > > - delay(100); > > + cancel_delayed_work_sync(&mp->m_log->l_work); > > Shouldn't the cancel_delayed_work_sync for l_work in xfs_log_quiesce > be removed now given that we've already cancelled it here? No, because every other caller of xfs_log_quiesce() requires the work to be cancelled before the log is quiesced. Cheers, Dave.
On Wed, May 20, 2020 at 05:13:39PM +1000, Dave Chinner wrote: > > > @@ -883,9 +885,7 @@ xfs_quiesce_attr( > > > { > > > int error = 0; > > > > > > - /* wait for all modifications to complete */ > > > - while (atomic_read(&mp->m_active_trans) > 0) > > > - delay(100); > > > + cancel_delayed_work_sync(&mp->m_log->l_work); > > > > Shouldn't the cancel_delayed_work_sync for l_work in xfs_log_quiesce > > be removed now given that we've already cancelled it here? > > No, because every other caller of xfs_log_quiesce() requires the > work to be cancelled before the log is quiesced. The only other caller is xfs_log_unmount, which could grow it. But I guess an extra cancel isn't harmful in the end.
On Wed, May 20, 2020 at 08:23:10AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > It's a global atomic counter, and we are hitting it at a rate of > half a million transactions a second, so it's bouncing the counter > cacheline all over the place on large machines. We don't actually > need it anymore - it used to be required because the VFS freeze code > could not track/prevent filesystem transactions that were running, > but that problem no longer exists. > > Hence to remove the counter, we simply have to ensure that nothing > calls xfs_sync_sb() while we are trying to quiesce the filesytem. > That only happens if the log worker is still running when we call > xfs_quiesce_attr(). The log worker is cancelled at the end of > xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it > early here and then we can remove the counter altogether. > > Concurrent create, 50 million inodes, identical 16p/16GB virtual > machines on different physical hosts. Machine A has twice the CPU > cores per socket of machine B: > > unpatched patched > machine A: 3m16s 2m00s > machine B: 4m04s 4m05s > > Create rates: > unpatched patched > machine A: 282k+/-31k 468k+/-21k > machine B: 231k+/-8k 233k+/-11k > > Concurrent rm of same 50 million inodes: > > unpatched patched > machine A: 6m42s 2m33s > machine B: 4m47s 4m47s > > The transaction rate on the fast machine went from just under > 300k/sec to 700k/sec, which indicates just how much of a bottleneck > this atomic counter was. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> /me kinda wonders why removing the counter entirely has so little effect on machine B, but seeing as I've been pondering killing this counter myself for years, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > --- > fs/xfs/xfs_mount.h | 1 - > fs/xfs/xfs_super.c | 17 +++++------------ > fs/xfs/xfs_trans.c | 27 +++++++++++---------------- > 3 files changed, 16 insertions(+), 29 deletions(-) > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index c1f92c1847bb2..3725d25ad97e8 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -176,7 +176,6 @@ typedef struct xfs_mount { > uint64_t m_resblks; /* total reserved blocks */ > uint64_t m_resblks_avail;/* available reserved blocks */ > uint64_t m_resblks_save; /* reserved blks @ remount,ro */ > - atomic_t m_active_trans; /* number trans frozen */ > struct delayed_work m_reclaim_work; /* background inode reclaim */ > struct delayed_work m_eofblocks_work; /* background eof blocks > trimming */ > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index aae469f73efeb..fa58cb07c8fdf 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -874,8 +874,10 @@ xfs_restore_resvblks(struct xfs_mount *mp) > * there is no log replay required to write the inodes to disk - this is the > * primary difference between a sync and a quiesce. > * > - * Note: xfs_log_quiesce() stops background log work - the callers must ensure > - * it is started again when appropriate. > + * We cancel log work early here to ensure all transactions the log worker may > + * run have finished before we clean up and log the superblock and write an > + * unmount record. The unfreeze process is responsible for restarting the log > + * worker correctly. > */ > void > xfs_quiesce_attr( > @@ -883,9 +885,7 @@ xfs_quiesce_attr( > { > int error = 0; > > - /* wait for all modifications to complete */ > - while (atomic_read(&mp->m_active_trans) > 0) > - delay(100); > + cancel_delayed_work_sync(&mp->m_log->l_work); > > /* force the log to unpin objects from the now complete transactions */ > xfs_log_force(mp, XFS_LOG_SYNC); > @@ -899,12 +899,6 @@ xfs_quiesce_attr( > if (error) > xfs_warn(mp, "xfs_attr_quiesce: failed to log sb changes. " > "Frozen image may not be consistent."); > - /* > - * Just warn here till VFS can correctly support > - * read-only remount without racing. > - */ > - WARN_ON(atomic_read(&mp->m_active_trans) != 0); > - > xfs_log_quiesce(mp); > } > > @@ -1793,7 +1787,6 @@ static int xfs_init_fs_context( > INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC); > spin_lock_init(&mp->m_perag_lock); > mutex_init(&mp->m_growlock); > - atomic_set(&mp->m_active_trans, 0); > INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker); > INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker); > INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker); > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index b055a5ab53465..217937d743dbb 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -68,7 +68,6 @@ xfs_trans_free( > xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false); > > trace_xfs_trans_free(tp, _RET_IP_); > - atomic_dec(&tp->t_mountp->m_active_trans); > if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT)) > sb_end_intwrite(tp->t_mountp->m_super); > xfs_trans_free_dqinfo(tp); > @@ -125,8 +124,6 @@ xfs_trans_dup( > xfs_defer_move(ntp, tp); > > xfs_trans_dup_dqinfo(tp, ntp); > - > - atomic_inc(&tp->t_mountp->m_active_trans); > return ntp; > } > > @@ -275,7 +272,6 @@ xfs_trans_alloc( > */ > WARN_ON(resp->tr_logres > 0 && > mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE); > - atomic_inc(&mp->m_active_trans); > > tp->t_magic = XFS_TRANS_HEADER_MAGIC; > tp->t_flags = flags; > @@ -299,20 +295,19 @@ xfs_trans_alloc( > > /* > * Create an empty transaction with no reservation. This is a defensive > - * mechanism for routines that query metadata without actually modifying > - * them -- if the metadata being queried is somehow cross-linked (think a > - * btree block pointer that points higher in the tree), we risk deadlock. > - * However, blocks grabbed as part of a transaction can be re-grabbed. > - * The verifiers will notice the corrupt block and the operation will fail > - * back to userspace without deadlocking. > + * mechanism for routines that query metadata without actually modifying them -- > + * if the metadata being queried is somehow cross-linked (think a btree block > + * pointer that points higher in the tree), we risk deadlock. However, blocks > + * grabbed as part of a transaction can be re-grabbed. The verifiers will > + * notice the corrupt block and the operation will fail back to userspace > + * without deadlocking. > * > - * Note the zero-length reservation; this transaction MUST be cancelled > - * without any dirty data. > + * Note the zero-length reservation; this transaction MUST be cancelled without > + * any dirty data. > * > - * Callers should obtain freeze protection to avoid two conflicts with fs > - * freezing: (1) having active transactions trip the m_active_trans ASSERTs; > - * and (2) grabbing buffers at the same time that freeze is trying to drain > - * the buffer LRU list. > + * Callers should obtain freeze protection to avoid a conflict with fs freezing > + * where we can be grabbing buffers at the same time that freeze is trying to > + * drain the buffer LRU list. > */ > int > xfs_trans_alloc_empty( > -- > 2.26.2.761.g0e0b3e54be >
On Wed, May 20, 2020 at 01:47:54PM -0700, Darrick J. Wong wrote: > On Wed, May 20, 2020 at 08:23:10AM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > It's a global atomic counter, and we are hitting it at a rate of > > half a million transactions a second, so it's bouncing the counter > > cacheline all over the place on large machines. We don't actually > > need it anymore - it used to be required because the VFS freeze code > > could not track/prevent filesystem transactions that were running, > > but that problem no longer exists. > > > > Hence to remove the counter, we simply have to ensure that nothing > > calls xfs_sync_sb() while we are trying to quiesce the filesytem. > > That only happens if the log worker is still running when we call > > xfs_quiesce_attr(). The log worker is cancelled at the end of > > xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it > > early here and then we can remove the counter altogether. > > > > Concurrent create, 50 million inodes, identical 16p/16GB virtual > > machines on different physical hosts. Machine A has twice the CPU > > cores per socket of machine B: > > > > unpatched patched > > machine A: 3m16s 2m00s > > machine B: 4m04s 4m05s > > > > Create rates: > > unpatched patched > > machine A: 282k+/-31k 468k+/-21k > > machine B: 231k+/-8k 233k+/-11k > > > > Concurrent rm of same 50 million inodes: > > > > unpatched patched > > machine A: 6m42s 2m33s > > machine B: 4m47s 4m47s > > > > The transaction rate on the fast machine went from just under > > 300k/sec to 700k/sec, which indicates just how much of a bottleneck > > this atomic counter was. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > /me kinda wonders why removing the counter entirely has so little effect > on machine B, but seeing as I've been pondering killing this counter > myself for years, Because the transaction rate on machine B isn't high enough to that the cacheline bouncing becomes a limiting factor. Don't forget that the impact of cacheline bouncing is exponential - there is a very small window where it goes from "none at all" to "all the machine", and these two machines sit on either side of that threshold. i.e. the older machine is too slow to hit that threshold, the newer machine hits it easily. Cheers, Dave.
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index c1f92c1847bb2..3725d25ad97e8 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -176,7 +176,6 @@ typedef struct xfs_mount { uint64_t m_resblks; /* total reserved blocks */ uint64_t m_resblks_avail;/* available reserved blocks */ uint64_t m_resblks_save; /* reserved blks @ remount,ro */ - atomic_t m_active_trans; /* number trans frozen */ struct delayed_work m_reclaim_work; /* background inode reclaim */ struct delayed_work m_eofblocks_work; /* background eof blocks trimming */ diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index aae469f73efeb..fa58cb07c8fdf 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -874,8 +874,10 @@ xfs_restore_resvblks(struct xfs_mount *mp) * there is no log replay required to write the inodes to disk - this is the * primary difference between a sync and a quiesce. * - * Note: xfs_log_quiesce() stops background log work - the callers must ensure - * it is started again when appropriate. + * We cancel log work early here to ensure all transactions the log worker may + * run have finished before we clean up and log the superblock and write an + * unmount record. The unfreeze process is responsible for restarting the log + * worker correctly. */ void xfs_quiesce_attr( @@ -883,9 +885,7 @@ xfs_quiesce_attr( { int error = 0; - /* wait for all modifications to complete */ - while (atomic_read(&mp->m_active_trans) > 0) - delay(100); + cancel_delayed_work_sync(&mp->m_log->l_work); /* force the log to unpin objects from the now complete transactions */ xfs_log_force(mp, XFS_LOG_SYNC); @@ -899,12 +899,6 @@ xfs_quiesce_attr( if (error) xfs_warn(mp, "xfs_attr_quiesce: failed to log sb changes. " "Frozen image may not be consistent."); - /* - * Just warn here till VFS can correctly support - * read-only remount without racing. - */ - WARN_ON(atomic_read(&mp->m_active_trans) != 0); - xfs_log_quiesce(mp); } @@ -1793,7 +1787,6 @@ static int xfs_init_fs_context( INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC); spin_lock_init(&mp->m_perag_lock); mutex_init(&mp->m_growlock); - atomic_set(&mp->m_active_trans, 0); INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker); INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker); INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker); diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index b055a5ab53465..217937d743dbb 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -68,7 +68,6 @@ xfs_trans_free( xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false); trace_xfs_trans_free(tp, _RET_IP_); - atomic_dec(&tp->t_mountp->m_active_trans); if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT)) sb_end_intwrite(tp->t_mountp->m_super); xfs_trans_free_dqinfo(tp); @@ -125,8 +124,6 @@ xfs_trans_dup( xfs_defer_move(ntp, tp); xfs_trans_dup_dqinfo(tp, ntp); - - atomic_inc(&tp->t_mountp->m_active_trans); return ntp; } @@ -275,7 +272,6 @@ xfs_trans_alloc( */ WARN_ON(resp->tr_logres > 0 && mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE); - atomic_inc(&mp->m_active_trans); tp->t_magic = XFS_TRANS_HEADER_MAGIC; tp->t_flags = flags; @@ -299,20 +295,19 @@ xfs_trans_alloc( /* * Create an empty transaction with no reservation. This is a defensive - * mechanism for routines that query metadata without actually modifying - * them -- if the metadata being queried is somehow cross-linked (think a - * btree block pointer that points higher in the tree), we risk deadlock. - * However, blocks grabbed as part of a transaction can be re-grabbed. - * The verifiers will notice the corrupt block and the operation will fail - * back to userspace without deadlocking. + * mechanism for routines that query metadata without actually modifying them -- + * if the metadata being queried is somehow cross-linked (think a btree block + * pointer that points higher in the tree), we risk deadlock. However, blocks + * grabbed as part of a transaction can be re-grabbed. The verifiers will + * notice the corrupt block and the operation will fail back to userspace + * without deadlocking. * - * Note the zero-length reservation; this transaction MUST be cancelled - * without any dirty data. + * Note the zero-length reservation; this transaction MUST be cancelled without + * any dirty data. * - * Callers should obtain freeze protection to avoid two conflicts with fs - * freezing: (1) having active transactions trip the m_active_trans ASSERTs; - * and (2) grabbing buffers at the same time that freeze is trying to drain - * the buffer LRU list. + * Callers should obtain freeze protection to avoid a conflict with fs freezing + * where we can be grabbing buffers at the same time that freeze is trying to + * drain the buffer LRU list. */ int xfs_trans_alloc_empty(