Message ID | 20200227134321.7238-4-bfoster@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs: automatic relogging experiment | expand |
On 2/27/20 6:43 AM, Brian Foster wrote: > Automatic item relogging will occur from xfsaild context. xfsaild > cannot acquire log reservation itself because it is also responsible > for writeback and thus making used log reservation available again. > Since there is no guarantee log reservation is available by the time > a relogged item reaches the AIL, this is prone to deadlock. > > To guarantee log reservation for automatic relogging, implement a > reservation management scheme where a transaction that is capable of > enabling relogging of an item must contribute the necessary > reservation to the relog mechanism up front. Use reference counting > to associate the lifetime of pending relog reservation to the > lifetime of in-core log items with relogging enabled. > > The basic log reservation sequence for a relog enabled transaction > is as follows: > > - A transaction that uses relogging specifies XFS_TRANS_RELOG at > allocation time. > - Once initialized, RELOG transactions check for the existence of > the global relog log ticket. If it exists, grab a reference and > return. If not, allocate an empty ticket and install into the relog > subsystem. Seed the relog ticket from reservation of the current > transaction. Roll the current transaction to replenish its > reservation and return to the caller. > - The transaction is used as normal. If an item is relogged in the > transaction, that item acquires a reference on the global relog > ticket currently held open by the transaction. The item's reference > persists until relogging is disabled on the item. > - The RELOG transaction commits and releases its reference to the > global relog ticket. The global relog ticket is released once its > reference count drops to zero. > > This provides a central relog log ticket that guarantees reservation > availability for relogged items, avoids log reservation deadlocks > and is allocated and released on demand. > Ok, I followed it through and didn't see any obvious errors Reviewed-by: Allison Collins <allison.henderson@oracle.com> > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/libxfs/xfs_shared.h | 1 + > fs/xfs/xfs_trans.c | 37 +++++++++++++--- > fs/xfs/xfs_trans.h | 3 ++ > fs/xfs/xfs_trans_ail.c | 89 ++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_trans_priv.h | 1 + > 5 files changed, 126 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h > index c45acbd3add9..0a10ca0853ab 100644 > --- a/fs/xfs/libxfs/xfs_shared.h > +++ b/fs/xfs/libxfs/xfs_shared.h > @@ -77,6 +77,7 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, > * made then this algorithm will eventually find all the space it needs. > */ > #define XFS_TRANS_LOWMODE 0x100 /* allocate in low space mode */ > +#define XFS_TRANS_RELOG 0x200 /* enable automatic relogging */ > > /* > * Field values for xfs_trans_mod_sb. > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 3b208f9a865c..8ac05ed8deda 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -107,9 +107,14 @@ xfs_trans_dup( > > ntp->t_flags = XFS_TRANS_PERM_LOG_RES | > (tp->t_flags & XFS_TRANS_RESERVE) | > - (tp->t_flags & XFS_TRANS_NO_WRITECOUNT); > - /* We gave our writer reference to the new transaction */ > + (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) | > + (tp->t_flags & XFS_TRANS_RELOG); > + /* > + * The writer reference and relog reference transfer to the new > + * transaction. > + */ > tp->t_flags |= XFS_TRANS_NO_WRITECOUNT; > + tp->t_flags &= ~XFS_TRANS_RELOG; > ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket); > > ASSERT(tp->t_blk_res >= tp->t_blk_res_used); > @@ -284,15 +289,25 @@ xfs_trans_alloc( > tp->t_firstblock = NULLFSBLOCK; > > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > - if (error) { > - xfs_trans_cancel(tp); > - return error; > + if (error) > + goto error; > + > + if (flags & XFS_TRANS_RELOG) { > + error = xfs_trans_ail_relog_reserve(&tp); > + if (error) > + goto error; > } > > trace_xfs_trans_alloc(tp, _RET_IP_); > > *tpp = tp; > return 0; > + > +error: > + /* clear relog flag if we haven't acquired a ref */ > + tp->t_flags &= ~XFS_TRANS_RELOG; > + xfs_trans_cancel(tp); > + return error; > } > > /* > @@ -973,6 +988,10 @@ __xfs_trans_commit( > > xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); > > + /* release the relog ticket reference if this transaction holds one */ > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); > + > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > xfs_trans_free(tp); > > @@ -1004,6 +1023,10 @@ __xfs_trans_commit( > error = -EIO; > tp->t_ticket = NULL; > } > + /* release the relog ticket reference if this transaction holds one */ > + /* XXX: handle RELOG items on transaction abort */ > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > xfs_trans_free_items(tp, !!error); > xfs_trans_free(tp); > @@ -1064,6 +1087,10 @@ xfs_trans_cancel( > tp->t_ticket = NULL; > } > > + /* release the relog ticket reference if this transaction holds one */ > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); > + > /* mark this thread as no longer being in a transaction */ > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index 752c7fef9de7..a032989943bd 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -236,6 +236,9 @@ int xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *); > void xfs_trans_cancel(xfs_trans_t *); > int xfs_trans_ail_init(struct xfs_mount *); > void xfs_trans_ail_destroy(struct xfs_mount *); > +int xfs_trans_ail_relog_reserve(struct xfs_trans **); > +bool xfs_trans_ail_relog_get(struct xfs_mount *); > +int xfs_trans_ail_relog_put(struct xfs_mount *); > > void xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *, > enum xfs_blft); > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > index 00cc5b8734be..a3fb64275baa 100644 > --- a/fs/xfs/xfs_trans_ail.c > +++ b/fs/xfs/xfs_trans_ail.c > @@ -17,6 +17,7 @@ > #include "xfs_errortag.h" > #include "xfs_error.h" > #include "xfs_log.h" > +#include "xfs_log_priv.h" > > #ifdef DEBUG > /* > @@ -818,6 +819,93 @@ xfs_trans_ail_delete( > xfs_log_space_wake(ailp->ail_mount); > } > > +bool > +xfs_trans_ail_relog_get( > + struct xfs_mount *mp) > +{ > + struct xfs_ail *ailp = mp->m_ail; > + bool ret = false; > + > + spin_lock(&ailp->ail_lock); > + if (ailp->ail_relog_tic) { > + xfs_log_ticket_get(ailp->ail_relog_tic); > + ret = true; > + } > + spin_unlock(&ailp->ail_lock); > + return ret; > +} > + > +/* > + * Reserve log space for the automatic relogging ->tr_relog ticket. This > + * requires a clean, permanent transaction from the caller. Pull reservation > + * for the relog ticket and roll the caller's transaction back to its fully > + * reserved state. If the AIL relog ticket is already initialized, grab a > + * reference and return. > + */ > +int > +xfs_trans_ail_relog_reserve( > + struct xfs_trans **tpp) > +{ > + struct xfs_trans *tp = *tpp; > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_ail *ailp = mp->m_ail; > + struct xlog_ticket *tic; > + uint32_t logres = M_RES(mp)->tr_relog.tr_logres; > + > + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > + ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); > + > + if (xfs_trans_ail_relog_get(mp)) > + return 0; > + > + /* no active ticket, fall into slow path to allocate one.. */ > + tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0); > + if (!tic) > + return -ENOMEM; > + ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res); > + > + /* check again since we dropped the lock for the allocation */ > + spin_lock(&ailp->ail_lock); > + if (ailp->ail_relog_tic) { > + xfs_log_ticket_get(ailp->ail_relog_tic); > + spin_unlock(&ailp->ail_lock); > + xfs_log_ticket_put(tic); > + return 0; > + } > + > + /* attach and reserve space for the ->tr_relog ticket */ > + ailp->ail_relog_tic = tic; > + tp->t_ticket->t_curr_res -= tic->t_curr_res; > + spin_unlock(&ailp->ail_lock); > + > + return xfs_trans_roll(tpp); > +} > + > +/* > + * Release a reference to the relog ticket. > + */ > +int > +xfs_trans_ail_relog_put( > + struct xfs_mount *mp) > +{ > + struct xfs_ail *ailp = mp->m_ail; > + struct xlog_ticket *tic; > + > + spin_lock(&ailp->ail_lock); > + if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) { > + spin_unlock(&ailp->ail_lock); > + return 0; > + } > + > + ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1); > + tic = ailp->ail_relog_tic; > + ailp->ail_relog_tic = NULL; > + spin_unlock(&ailp->ail_lock); > + > + xfs_log_done(mp, tic, NULL, false); > + return 0; > +} > + > int > xfs_trans_ail_init( > xfs_mount_t *mp) > @@ -854,6 +942,7 @@ xfs_trans_ail_destroy( > { > struct xfs_ail *ailp = mp->m_ail; > > + ASSERT(ailp->ail_relog_tic == NULL); > kthread_stop(ailp->ail_task); > kmem_free(ailp); > } > diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h > index 2e073c1c4614..839df6559b9f 100644 > --- a/fs/xfs/xfs_trans_priv.h > +++ b/fs/xfs/xfs_trans_priv.h > @@ -61,6 +61,7 @@ struct xfs_ail { > int ail_log_flush; > struct list_head ail_buf_list; > wait_queue_head_t ail_empty; > + struct xlog_ticket *ail_relog_tic; > }; > > /* >
On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote: > Automatic item relogging will occur from xfsaild context. xfsaild > cannot acquire log reservation itself because it is also responsible > for writeback and thus making used log reservation available again. > Since there is no guarantee log reservation is available by the time > a relogged item reaches the AIL, this is prone to deadlock. > > To guarantee log reservation for automatic relogging, implement a > reservation management scheme where a transaction that is capable of > enabling relogging of an item must contribute the necessary > reservation to the relog mechanism up front. Ooooh, I had wondered where I was going to find that hook. :) What does it mean to be capable of enabling relogging of an item? For the quotaoff example, does this mean that all the transactions that happen on behalf of a quotaoff operation must specify TRANS_RELOG? What if the relog thread grinds to a halt while other non-RELOG threads continue to push things into the log? Can we starve and/or livelock waiting around? Or should the log be able to kick some higher level thread to inject a TRANS_RELOG transaction to move things along? > Use reference counting > to associate the lifetime of pending relog reservation to the > lifetime of in-core log items with relogging enabled. Ok, so we only have to pay the relog reservation while there are reloggable items floating around in the system. > The basic log reservation sequence for a relog enabled transaction > is as follows: > > - A transaction that uses relogging specifies XFS_TRANS_RELOG at > allocation time. > - Once initialized, RELOG transactions check for the existence of > the global relog log ticket. If it exists, grab a reference and > return. If not, allocate an empty ticket and install into the relog > subsystem. Seed the relog ticket from reservation of the current > transaction. Roll the current transaction to replenish its > reservation and return to the caller. I guess we'd have to be careful that the transaction we're stealing from actually has enough reservation to re-log a pending item, but that shouldn't be difficult. I worry that there might be some operation somewhere that Just Works because tr_logcount * tr_logres is enough space for it to run without having to get more reseration, but (tr_logcount - 1) * tr_logres isn't enough. Though that might not be a big issue seeing how bloated the log reservations become when reflink and rmap are turned on. <cough> > - The transaction is used as normal. If an item is relogged in the > transaction, that item acquires a reference on the global relog > ticket currently held open by the transaction. The item's reference > persists until relogging is disabled on the item. > - The RELOG transaction commits and releases its reference to the > global relog ticket. The global relog ticket is released once its > reference count drops to zero. > > This provides a central relog log ticket that guarantees reservation > availability for relogged items, avoids log reservation deadlocks > and is allocated and released on demand. Sounds cool. /me jumps in. > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/libxfs/xfs_shared.h | 1 + > fs/xfs/xfs_trans.c | 37 +++++++++++++--- > fs/xfs/xfs_trans.h | 3 ++ > fs/xfs/xfs_trans_ail.c | 89 ++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_trans_priv.h | 1 + > 5 files changed, 126 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h > index c45acbd3add9..0a10ca0853ab 100644 > --- a/fs/xfs/libxfs/xfs_shared.h > +++ b/fs/xfs/libxfs/xfs_shared.h > @@ -77,6 +77,7 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, > * made then this algorithm will eventually find all the space it needs. > */ > #define XFS_TRANS_LOWMODE 0x100 /* allocate in low space mode */ > +#define XFS_TRANS_RELOG 0x200 /* enable automatic relogging */ > > /* > * Field values for xfs_trans_mod_sb. > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 3b208f9a865c..8ac05ed8deda 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -107,9 +107,14 @@ xfs_trans_dup( > > ntp->t_flags = XFS_TRANS_PERM_LOG_RES | > (tp->t_flags & XFS_TRANS_RESERVE) | > - (tp->t_flags & XFS_TRANS_NO_WRITECOUNT); > - /* We gave our writer reference to the new transaction */ > + (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) | > + (tp->t_flags & XFS_TRANS_RELOG); > + /* > + * The writer reference and relog reference transfer to the new > + * transaction. > + */ > tp->t_flags |= XFS_TRANS_NO_WRITECOUNT; > + tp->t_flags &= ~XFS_TRANS_RELOG; > ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket); > > ASSERT(tp->t_blk_res >= tp->t_blk_res_used); > @@ -284,15 +289,25 @@ xfs_trans_alloc( > tp->t_firstblock = NULLFSBLOCK; > > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > - if (error) { > - xfs_trans_cancel(tp); > - return error; > + if (error) > + goto error; > + > + if (flags & XFS_TRANS_RELOG) { > + error = xfs_trans_ail_relog_reserve(&tp); > + if (error) > + goto error; > } > > trace_xfs_trans_alloc(tp, _RET_IP_); > > *tpp = tp; > return 0; > + > +error: > + /* clear relog flag if we haven't acquired a ref */ > + tp->t_flags &= ~XFS_TRANS_RELOG; > + xfs_trans_cancel(tp); > + return error; > } > > /* > @@ -973,6 +988,10 @@ __xfs_trans_commit( > > xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); > > + /* release the relog ticket reference if this transaction holds one */ > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); > + > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > xfs_trans_free(tp); > > @@ -1004,6 +1023,10 @@ __xfs_trans_commit( > error = -EIO; > tp->t_ticket = NULL; > } > + /* release the relog ticket reference if this transaction holds one */ > + /* XXX: handle RELOG items on transaction abort */ "Handle"? Hm. Do the reloggable items end up attached in some way to this new transaction, or are we purely stealing the reservation so that the ail can use it to relog the items on its own? If it's the second, then I wonder what handling do we need to do? Or maybe you meant handling the relog items that the caller attached to this relog transaction? Won't those get cancelled the same way they do now? Mechanically this looks reasonable. --D > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > xfs_trans_free_items(tp, !!error); > xfs_trans_free(tp); > @@ -1064,6 +1087,10 @@ xfs_trans_cancel( > tp->t_ticket = NULL; > } > > + /* release the relog ticket reference if this transaction holds one */ > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); > + > /* mark this thread as no longer being in a transaction */ > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index 752c7fef9de7..a032989943bd 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -236,6 +236,9 @@ int xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *); > void xfs_trans_cancel(xfs_trans_t *); > int xfs_trans_ail_init(struct xfs_mount *); > void xfs_trans_ail_destroy(struct xfs_mount *); > +int xfs_trans_ail_relog_reserve(struct xfs_trans **); > +bool xfs_trans_ail_relog_get(struct xfs_mount *); > +int xfs_trans_ail_relog_put(struct xfs_mount *); > > void xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *, > enum xfs_blft); > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > index 00cc5b8734be..a3fb64275baa 100644 > --- a/fs/xfs/xfs_trans_ail.c > +++ b/fs/xfs/xfs_trans_ail.c > @@ -17,6 +17,7 @@ > #include "xfs_errortag.h" > #include "xfs_error.h" > #include "xfs_log.h" > +#include "xfs_log_priv.h" > > #ifdef DEBUG > /* > @@ -818,6 +819,93 @@ xfs_trans_ail_delete( > xfs_log_space_wake(ailp->ail_mount); > } > > +bool > +xfs_trans_ail_relog_get( > + struct xfs_mount *mp) > +{ > + struct xfs_ail *ailp = mp->m_ail; > + bool ret = false; > + > + spin_lock(&ailp->ail_lock); > + if (ailp->ail_relog_tic) { > + xfs_log_ticket_get(ailp->ail_relog_tic); > + ret = true; > + } > + spin_unlock(&ailp->ail_lock); > + return ret; > +} > + > +/* > + * Reserve log space for the automatic relogging ->tr_relog ticket. This > + * requires a clean, permanent transaction from the caller. Pull reservation > + * for the relog ticket and roll the caller's transaction back to its fully > + * reserved state. If the AIL relog ticket is already initialized, grab a > + * reference and return. > + */ > +int > +xfs_trans_ail_relog_reserve( > + struct xfs_trans **tpp) > +{ > + struct xfs_trans *tp = *tpp; > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_ail *ailp = mp->m_ail; > + struct xlog_ticket *tic; > + uint32_t logres = M_RES(mp)->tr_relog.tr_logres; > + > + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > + ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); > + > + if (xfs_trans_ail_relog_get(mp)) > + return 0; > + > + /* no active ticket, fall into slow path to allocate one.. */ > + tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0); > + if (!tic) > + return -ENOMEM; > + ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res); > + > + /* check again since we dropped the lock for the allocation */ > + spin_lock(&ailp->ail_lock); > + if (ailp->ail_relog_tic) { > + xfs_log_ticket_get(ailp->ail_relog_tic); > + spin_unlock(&ailp->ail_lock); > + xfs_log_ticket_put(tic); > + return 0; > + } > + > + /* attach and reserve space for the ->tr_relog ticket */ > + ailp->ail_relog_tic = tic; > + tp->t_ticket->t_curr_res -= tic->t_curr_res; > + spin_unlock(&ailp->ail_lock); > + > + return xfs_trans_roll(tpp); > +} > + > +/* > + * Release a reference to the relog ticket. > + */ > +int > +xfs_trans_ail_relog_put( > + struct xfs_mount *mp) > +{ > + struct xfs_ail *ailp = mp->m_ail; > + struct xlog_ticket *tic; > + > + spin_lock(&ailp->ail_lock); > + if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) { > + spin_unlock(&ailp->ail_lock); > + return 0; > + } > + > + ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1); > + tic = ailp->ail_relog_tic; > + ailp->ail_relog_tic = NULL; > + spin_unlock(&ailp->ail_lock); > + > + xfs_log_done(mp, tic, NULL, false); > + return 0; > +} > + > int > xfs_trans_ail_init( > xfs_mount_t *mp) > @@ -854,6 +942,7 @@ xfs_trans_ail_destroy( > { > struct xfs_ail *ailp = mp->m_ail; > > + ASSERT(ailp->ail_relog_tic == NULL); > kthread_stop(ailp->ail_task); > kmem_free(ailp); > } > diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h > index 2e073c1c4614..839df6559b9f 100644 > --- a/fs/xfs/xfs_trans_priv.h > +++ b/fs/xfs/xfs_trans_priv.h > @@ -61,6 +61,7 @@ struct xfs_ail { > int ail_log_flush; > struct list_head ail_buf_list; > wait_queue_head_t ail_empty; > + struct xlog_ticket *ail_relog_tic; > }; > > /* > -- > 2.21.1 >
On Thu, Feb 27, 2020 at 04:02:18PM -0800, Darrick J. Wong wrote: > On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote: > > Automatic item relogging will occur from xfsaild context. xfsaild > > cannot acquire log reservation itself because it is also responsible > > for writeback and thus making used log reservation available again. > > Since there is no guarantee log reservation is available by the time > > a relogged item reaches the AIL, this is prone to deadlock. > > > > To guarantee log reservation for automatic relogging, implement a > > reservation management scheme where a transaction that is capable of > > enabling relogging of an item must contribute the necessary > > reservation to the relog mechanism up front. > > Ooooh, I had wondered where I was going to find that hook. :) > > What does it mean to be capable of enabling relogging of an item? > It's basically a requirement introduced by this patch that any transaction that wants to enable relog on an item needs to check whether the "central relog ticket" already exists, and if not, it needs to contribute reservation to it as part of its normal transaction allocation process. > For the quotaoff example, does this mean that all the transactions that > happen on behalf of a quotaoff operation must specify TRANS_RELOG? > No, only the transaction that commits the quotaoff start intent. > What if the relog thread grinds to a halt while other non-RELOG threads > continue to push things into the log? Can we starve and/or livelock > waiting around? Or should the log be able to kick some higher level > thread to inject a TRANS_RELOG transaction to move things along? > I hope not. :P At least I haven't seen such issues in my (fairly limited and focused) testing so far. > > Use reference counting > > to associate the lifetime of pending relog reservation to the > > lifetime of in-core log items with relogging enabled. > > Ok, so we only have to pay the relog reservation while there are > reloggable items floating around in the system. > Yeah, I was back and forth on this because the relog reservation tracking thing is a bit of complexity itself. It's much simpler to have the AIL or something allocate a relog transaction up front and just keep it around for the lifetime of the mount. My thinking was that relogs could be quite rare (quotaoff, scrub), so it's probably an unfair use of resources. I figured it best to try and provide some flexibility up front and we can always back off to something more simplified later. If we ended up with something like a reloggable "zero on recovery" intent for writeback, that ties more into core fs functionality and the simpler relog ticket implementation might be justified. > > The basic log reservation sequence for a relog enabled transaction > > is as follows: > > > > - A transaction that uses relogging specifies XFS_TRANS_RELOG at > > allocation time. > > - Once initialized, RELOG transactions check for the existence of > > the global relog log ticket. If it exists, grab a reference and > > return. If not, allocate an empty ticket and install into the relog > > subsystem. Seed the relog ticket from reservation of the current > > transaction. Roll the current transaction to replenish its > > reservation and return to the caller. > > I guess we'd have to be careful that the transaction we're stealing from > actually has enough reservation to re-log a pending item, but that > shouldn't be difficult. > > I worry that there might be some operation somewhere that Just Works > because tr_logcount * tr_logres is enough space for it to run without > having to get more reseration, but (tr_logcount - 1) * tr_logres isn't > enough. Though that might not be a big issue seeing how bloated the > log reservations become when reflink and rmap are turned on. <cough> > This ties in to the size calculation of the relog transaction and the fact that the contributing transaction rolls before it carries on to normal use. IOW, this approach adds zero extra reservation consumption to existing transaction commits, so there should be no additional risk of reservation overrun in that regard. > > - The transaction is used as normal. If an item is relogged in the > > transaction, that item acquires a reference on the global relog > > ticket currently held open by the transaction. The item's reference > > persists until relogging is disabled on the item. > > - The RELOG transaction commits and releases its reference to the > > global relog ticket. The global relog ticket is released once its > > reference count drops to zero. > > > > This provides a central relog log ticket that guarantees reservation > > availability for relogged items, avoids log reservation deadlocks > > and is allocated and released on demand. > > Sounds cool. /me jumps in. > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/libxfs/xfs_shared.h | 1 + > > fs/xfs/xfs_trans.c | 37 +++++++++++++--- > > fs/xfs/xfs_trans.h | 3 ++ > > fs/xfs/xfs_trans_ail.c | 89 ++++++++++++++++++++++++++++++++++++++ > > fs/xfs/xfs_trans_priv.h | 1 + > > 5 files changed, 126 insertions(+), 5 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h > > index c45acbd3add9..0a10ca0853ab 100644 > > --- a/fs/xfs/libxfs/xfs_shared.h > > +++ b/fs/xfs/libxfs/xfs_shared.h > > @@ -77,6 +77,7 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, > > * made then this algorithm will eventually find all the space it needs. > > */ > > #define XFS_TRANS_LOWMODE 0x100 /* allocate in low space mode */ > > +#define XFS_TRANS_RELOG 0x200 /* enable automatic relogging */ > > > > /* > > * Field values for xfs_trans_mod_sb. > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > > index 3b208f9a865c..8ac05ed8deda 100644 > > --- a/fs/xfs/xfs_trans.c > > +++ b/fs/xfs/xfs_trans.c > > @@ -107,9 +107,14 @@ xfs_trans_dup( > > > > ntp->t_flags = XFS_TRANS_PERM_LOG_RES | > > (tp->t_flags & XFS_TRANS_RESERVE) | > > - (tp->t_flags & XFS_TRANS_NO_WRITECOUNT); > > - /* We gave our writer reference to the new transaction */ > > + (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) | > > + (tp->t_flags & XFS_TRANS_RELOG); > > + /* > > + * The writer reference and relog reference transfer to the new > > + * transaction. > > + */ > > tp->t_flags |= XFS_TRANS_NO_WRITECOUNT; > > + tp->t_flags &= ~XFS_TRANS_RELOG; > > ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket); > > > > ASSERT(tp->t_blk_res >= tp->t_blk_res_used); > > @@ -284,15 +289,25 @@ xfs_trans_alloc( > > tp->t_firstblock = NULLFSBLOCK; > > > > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > > - if (error) { > > - xfs_trans_cancel(tp); > > - return error; > > + if (error) > > + goto error; > > + > > + if (flags & XFS_TRANS_RELOG) { > > + error = xfs_trans_ail_relog_reserve(&tp); > > + if (error) > > + goto error; > > } > > > > trace_xfs_trans_alloc(tp, _RET_IP_); > > > > *tpp = tp; > > return 0; > > + > > +error: > > + /* clear relog flag if we haven't acquired a ref */ > > + tp->t_flags &= ~XFS_TRANS_RELOG; > > + xfs_trans_cancel(tp); > > + return error; > > } > > > > /* > > @@ -973,6 +988,10 @@ __xfs_trans_commit( > > > > xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); > > > > + /* release the relog ticket reference if this transaction holds one */ > > + if (tp->t_flags & XFS_TRANS_RELOG) > > + xfs_trans_ail_relog_put(mp); > > + > > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > > xfs_trans_free(tp); > > > > @@ -1004,6 +1023,10 @@ __xfs_trans_commit( > > error = -EIO; > > tp->t_ticket = NULL; > > } > > + /* release the relog ticket reference if this transaction holds one */ > > + /* XXX: handle RELOG items on transaction abort */ > > "Handle"? Hm. Do the reloggable items end up attached in some way to > this new transaction, or are we purely stealing the reservation so that > the ail can use it to relog the items on its own? If it's the second, > then I wonder what handling do we need to do? > More the latter... > Or maybe you meant handling the relog items that the caller attached to > this relog transaction? Won't those get cancelled the same way they do > now? > This comment was more of a note to self that when putting this together I hadn't thought through the abort/shutdown path and whether the code is correct (i.e., should the transaction cancel relog state? I still need to test an abort of a relog commit, etc.). That's still something I need to work through, but I wouldn't read more into the comment than that. Brian > Mechanically this looks reasonable. > > --D > > > + if (tp->t_flags & XFS_TRANS_RELOG) > > + xfs_trans_ail_relog_put(mp); > > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > > xfs_trans_free_items(tp, !!error); > > xfs_trans_free(tp); > > @@ -1064,6 +1087,10 @@ xfs_trans_cancel( > > tp->t_ticket = NULL; > > } > > > > + /* release the relog ticket reference if this transaction holds one */ > > + if (tp->t_flags & XFS_TRANS_RELOG) > > + xfs_trans_ail_relog_put(mp); > > + > > /* mark this thread as no longer being in a transaction */ > > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > > > > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > > index 752c7fef9de7..a032989943bd 100644 > > --- a/fs/xfs/xfs_trans.h > > +++ b/fs/xfs/xfs_trans.h > > @@ -236,6 +236,9 @@ int xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *); > > void xfs_trans_cancel(xfs_trans_t *); > > int xfs_trans_ail_init(struct xfs_mount *); > > void xfs_trans_ail_destroy(struct xfs_mount *); > > +int xfs_trans_ail_relog_reserve(struct xfs_trans **); > > +bool xfs_trans_ail_relog_get(struct xfs_mount *); > > +int xfs_trans_ail_relog_put(struct xfs_mount *); > > > > void xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *, > > enum xfs_blft); > > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > > index 00cc5b8734be..a3fb64275baa 100644 > > --- a/fs/xfs/xfs_trans_ail.c > > +++ b/fs/xfs/xfs_trans_ail.c > > @@ -17,6 +17,7 @@ > > #include "xfs_errortag.h" > > #include "xfs_error.h" > > #include "xfs_log.h" > > +#include "xfs_log_priv.h" > > > > #ifdef DEBUG > > /* > > @@ -818,6 +819,93 @@ xfs_trans_ail_delete( > > xfs_log_space_wake(ailp->ail_mount); > > } > > > > +bool > > +xfs_trans_ail_relog_get( > > + struct xfs_mount *mp) > > +{ > > + struct xfs_ail *ailp = mp->m_ail; > > + bool ret = false; > > + > > + spin_lock(&ailp->ail_lock); > > + if (ailp->ail_relog_tic) { > > + xfs_log_ticket_get(ailp->ail_relog_tic); > > + ret = true; > > + } > > + spin_unlock(&ailp->ail_lock); > > + return ret; > > +} > > + > > +/* > > + * Reserve log space for the automatic relogging ->tr_relog ticket. This > > + * requires a clean, permanent transaction from the caller. Pull reservation > > + * for the relog ticket and roll the caller's transaction back to its fully > > + * reserved state. If the AIL relog ticket is already initialized, grab a > > + * reference and return. > > + */ > > +int > > +xfs_trans_ail_relog_reserve( > > + struct xfs_trans **tpp) > > +{ > > + struct xfs_trans *tp = *tpp; > > + struct xfs_mount *mp = tp->t_mountp; > > + struct xfs_ail *ailp = mp->m_ail; > > + struct xlog_ticket *tic; > > + uint32_t logres = M_RES(mp)->tr_relog.tr_logres; > > + > > + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > + ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); > > + > > + if (xfs_trans_ail_relog_get(mp)) > > + return 0; > > + > > + /* no active ticket, fall into slow path to allocate one.. */ > > + tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0); > > + if (!tic) > > + return -ENOMEM; > > + ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res); > > + > > + /* check again since we dropped the lock for the allocation */ > > + spin_lock(&ailp->ail_lock); > > + if (ailp->ail_relog_tic) { > > + xfs_log_ticket_get(ailp->ail_relog_tic); > > + spin_unlock(&ailp->ail_lock); > > + xfs_log_ticket_put(tic); > > + return 0; > > + } > > + > > + /* attach and reserve space for the ->tr_relog ticket */ > > + ailp->ail_relog_tic = tic; > > + tp->t_ticket->t_curr_res -= tic->t_curr_res; > > + spin_unlock(&ailp->ail_lock); > > + > > + return xfs_trans_roll(tpp); > > +} > > + > > +/* > > + * Release a reference to the relog ticket. > > + */ > > +int > > +xfs_trans_ail_relog_put( > > + struct xfs_mount *mp) > > +{ > > + struct xfs_ail *ailp = mp->m_ail; > > + struct xlog_ticket *tic; > > + > > + spin_lock(&ailp->ail_lock); > > + if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) { > > + spin_unlock(&ailp->ail_lock); > > + return 0; > > + } > > + > > + ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1); > > + tic = ailp->ail_relog_tic; > > + ailp->ail_relog_tic = NULL; > > + spin_unlock(&ailp->ail_lock); > > + > > + xfs_log_done(mp, tic, NULL, false); > > + return 0; > > +} > > + > > int > > xfs_trans_ail_init( > > xfs_mount_t *mp) > > @@ -854,6 +942,7 @@ xfs_trans_ail_destroy( > > { > > struct xfs_ail *ailp = mp->m_ail; > > > > + ASSERT(ailp->ail_relog_tic == NULL); > > kthread_stop(ailp->ail_task); > > kmem_free(ailp); > > } > > diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h > > index 2e073c1c4614..839df6559b9f 100644 > > --- a/fs/xfs/xfs_trans_priv.h > > +++ b/fs/xfs/xfs_trans_priv.h > > @@ -61,6 +61,7 @@ struct xfs_ail { > > int ail_log_flush; > > struct list_head ail_buf_list; > > wait_queue_head_t ail_empty; > > + struct xlog_ticket *ail_relog_tic; > > }; > > > > /* > > -- > > 2.21.1 > > >
On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote: > Automatic item relogging will occur from xfsaild context. xfsaild > cannot acquire log reservation itself because it is also responsible > for writeback and thus making used log reservation available again. > Since there is no guarantee log reservation is available by the time > a relogged item reaches the AIL, this is prone to deadlock. > > To guarantee log reservation for automatic relogging, implement a > reservation management scheme where a transaction that is capable of > enabling relogging of an item must contribute the necessary > reservation to the relog mechanism up front. Use reference counting > to associate the lifetime of pending relog reservation to the > lifetime of in-core log items with relogging enabled. > > The basic log reservation sequence for a relog enabled transaction > is as follows: > > - A transaction that uses relogging specifies XFS_TRANS_RELOG at > allocation time. > - Once initialized, RELOG transactions check for the existence of > the global relog log ticket. If it exists, grab a reference and > return. If not, allocate an empty ticket and install into the relog > subsystem. Seed the relog ticket from reservation of the current > transaction. Roll the current transaction to replenish its > reservation and return to the caller. > - The transaction is used as normal. If an item is relogged in the > transaction, that item acquires a reference on the global relog > ticket currently held open by the transaction. The item's reference > persists until relogging is disabled on the item. > - The RELOG transaction commits and releases its reference to the > global relog ticket. The global relog ticket is released once its > reference count drops to zero. > > This provides a central relog log ticket that guarantees reservation > availability for relogged items, avoids log reservation deadlocks > and is allocated and released on demand. Hi Brain, I've held off commenting immediately on this while I tried to get the concept of dynamic relogging straight in my head. I couldn't put my finger on what I thought was wrong - just a nagging feeling that I'd gone down this path before and it ended in somethign that didn't work. It wasn't until a couple of hours ago that a big cogs clunked into place and I realised this roughly mirrored a path I went down 12 or 13 years ago trying to implement what turned into the CIL. I failed at least 4 times over 5 years trying to implement delayed logging... THere's a couple of simple code comments below before what was in my head seemed to gel together into something slightly more coherent than "it seems inside out".... > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/libxfs/xfs_shared.h | 1 + > fs/xfs/xfs_trans.c | 37 +++++++++++++--- > fs/xfs/xfs_trans.h | 3 ++ > fs/xfs/xfs_trans_ail.c | 89 ++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_trans_priv.h | 1 + > 5 files changed, 126 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h > index c45acbd3add9..0a10ca0853ab 100644 > --- a/fs/xfs/libxfs/xfs_shared.h > +++ b/fs/xfs/libxfs/xfs_shared.h > @@ -77,6 +77,7 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, > * made then this algorithm will eventually find all the space it needs. > */ > #define XFS_TRANS_LOWMODE 0x100 /* allocate in low space mode */ > +#define XFS_TRANS_RELOG 0x200 /* enable automatic relogging */ > > /* > * Field values for xfs_trans_mod_sb. > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 3b208f9a865c..8ac05ed8deda 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -107,9 +107,14 @@ xfs_trans_dup( > > ntp->t_flags = XFS_TRANS_PERM_LOG_RES | > (tp->t_flags & XFS_TRANS_RESERVE) | > - (tp->t_flags & XFS_TRANS_NO_WRITECOUNT); > - /* We gave our writer reference to the new transaction */ > + (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) | > + (tp->t_flags & XFS_TRANS_RELOG); > + /* > + * The writer reference and relog reference transfer to the new > + * transaction. > + */ > tp->t_flags |= XFS_TRANS_NO_WRITECOUNT; > + tp->t_flags &= ~XFS_TRANS_RELOG; > ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket); > > ASSERT(tp->t_blk_res >= tp->t_blk_res_used); > @@ -284,15 +289,25 @@ xfs_trans_alloc( > tp->t_firstblock = NULLFSBLOCK; > > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > - if (error) { > - xfs_trans_cancel(tp); > - return error; > + if (error) > + goto error; > + > + if (flags & XFS_TRANS_RELOG) { > + error = xfs_trans_ail_relog_reserve(&tp); > + if (error) > + goto error; > } Hmmmm. So we are putting the AIL lock directly into the transaction reserve path? xfs_trans_reserve() goes out of it's way to be lockless in the fast paths, so if you want this to be a generic mechanism that any transaction can use, the reservation needs to be completely lockless. > > trace_xfs_trans_alloc(tp, _RET_IP_); > > *tpp = tp; > return 0; > + > +error: > + /* clear relog flag if we haven't acquired a ref */ > + tp->t_flags &= ~XFS_TRANS_RELOG; > + xfs_trans_cancel(tp); > + return error; seems like a case of "only set the flags once you have a reference"? Then xfs_trans_cancel() can clean up without special cases being needed anywhere... > } > > /* > @@ -973,6 +988,10 @@ __xfs_trans_commit( > > xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); > > + /* release the relog ticket reference if this transaction holds one */ > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); > + That looks ... interesting. xfs_trans_ail_relog_put() can call xfs_log_done(), which means it can do a log write, which means the commit lsn for this transaction could change. Hence to make a relog permanent as a result of a sync transaction, we'd need the commit_lsn of the AIL relog ticket write here, not that of the original transaction that was written. > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > xfs_trans_free(tp); > > @@ -1004,6 +1023,10 @@ __xfs_trans_commit( > error = -EIO; > tp->t_ticket = NULL; > } > + /* release the relog ticket reference if this transaction holds one */ > + /* XXX: handle RELOG items on transaction abort */ > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); This has the potential for log writes to be issued from a transaction abort context, right? > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > xfs_trans_free_items(tp, !!error); > xfs_trans_free(tp); > @@ -1064,6 +1087,10 @@ xfs_trans_cancel( > tp->t_ticket = NULL; > } > > + /* release the relog ticket reference if this transaction holds one */ > + if (tp->t_flags & XFS_TRANS_RELOG) > + xfs_trans_ail_relog_put(mp); And log writes from a cancel seems possible here, too. I don't think we do this at all right now, so this could have some interesting unexpected side-effects during error handling. (As if that wasn't complex enough to begin with!) > @@ -818,6 +819,93 @@ xfs_trans_ail_delete( > xfs_log_space_wake(ailp->ail_mount); > } > > +bool > +xfs_trans_ail_relog_get( > + struct xfs_mount *mp) > +{ > + struct xfs_ail *ailp = mp->m_ail; > + bool ret = false; > + > + spin_lock(&ailp->ail_lock); > + if (ailp->ail_relog_tic) { > + xfs_log_ticket_get(ailp->ail_relog_tic); > + ret = true; > + } > + spin_unlock(&ailp->ail_lock); > + return ret; > +} > + > +/* > + * Reserve log space for the automatic relogging ->tr_relog ticket. This > + * requires a clean, permanent transaction from the caller. Pull reservation > + * for the relog ticket and roll the caller's transaction back to its fully > + * reserved state. If the AIL relog ticket is already initialized, grab a > + * reference and return. > + */ > +int > +xfs_trans_ail_relog_reserve( > + struct xfs_trans **tpp) > +{ > + struct xfs_trans *tp = *tpp; > + struct xfs_mount *mp = tp->t_mountp; > + struct xfs_ail *ailp = mp->m_ail; > + struct xlog_ticket *tic; > + uint32_t logres = M_RES(mp)->tr_relog.tr_logres; > + > + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > + ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); > + > + if (xfs_trans_ail_relog_get(mp)) > + return 0; > + > + /* no active ticket, fall into slow path to allocate one.. */ > + tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0); > + if (!tic) > + return -ENOMEM; > + ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res); > + > + /* check again since we dropped the lock for the allocation */ > + spin_lock(&ailp->ail_lock); > + if (ailp->ail_relog_tic) { > + xfs_log_ticket_get(ailp->ail_relog_tic); > + spin_unlock(&ailp->ail_lock); > + xfs_log_ticket_put(tic); > + return 0; > + } > + > + /* attach and reserve space for the ->tr_relog ticket */ > + ailp->ail_relog_tic = tic; > + tp->t_ticket->t_curr_res -= tic->t_curr_res; > + spin_unlock(&ailp->ail_lock); > + > + return xfs_trans_roll(tpp); > +} Hmmm. So before we've even returned from xfs_trans_alloc(), we may have committed the transaction and burnt and entire transaction entire log space reservations? IOWs, to support relogging of a single item in a permanent transaction, we have to increase the log count of the transaction so that we don't end up running out of reservation space one transaction before the end of the normal fast path behaviour of the change being made? And we do that up front because we're not exactly sure of how much space the item that needs relogging is going to require, but the AIL is going to need to hold on to it from as long as it needs to relog the item. <GRRRRIIINNNNDDDD> <CLUNK> Ahhhhh. This is the reason why the CIL ended up using a reservation stealing mechanism for it's ticket. Every log reservation takes into account the worst case log overhead for that single transaction - that means there are no special up front hooks or changes to the log reservation to support the CIL, and it also means that the CIL can steal what it needs from the current ticket at commit time. Then the CIL can commit at any time, knowing it has enough space to write all the items it tracks to the log. This avoided the transaction reservation mechanism from needing to know anything about the CIL or that it was stealing space from committing transactions for it's own private ticket. It greatly simplified everything, and it's the reason that the CIL succeeded where several other attempts to relog items in memory failed.... So.... Why can't the AIL steal the reservation it needs from the current transaction when the item that needs relogging is being committed? i.e. we add the "relog overhead" to the permanent transaction that requires items to be relogged by the AIL, and when that item is formatted into the CIL we also check to see if it is currently marked as "reloggable". If it's not, we steal the relogging reservation from the transaction (essentially the item's formatted size) and pass it to the AIL's private relog ticket, setting the log item to be "reloggable" so the reservation isn't stolen over and over again as the object is relogged in the CIL. Hence as we commit reloggable items to the CIL, the AIL ticket reservation grows with each of those items marked as reloggable. As we relog items to the CIL and the CIL grows it's reservation via the size delta, the AIL reservation can also be updated with the same delta. Hence the AIL will always know exactly how much space it needs to relog all the items it holds for relogging, and because it's been stolen from the original transaction it is, like the CIL tciket reservation, considered used space in the log. Hence the log space required for relogging items via the AIL is correctly accounted for without needing up front static, per-item reservations. When the item is logged the final time and the reloggable flag is removed (e.g. when the quotaoff completes) then we can remove the reservation from AIL ticket and add it back to the current transaction. Hence when the final transaction is committed and the ticket for that transaction is released via xfs_log_done(), the space the AIL held for relogging the item is also released. This doesn't require any modifications to the transaction reservation subsystem, nor the itransaction commit/cancel code, no changes to the log space accounting, etc. All we need to do is tag items when we log them as reloggable, and in the CIL formatting of the item pass the formatted size differences to the AIL so it can steal the reservation it needs. Then the AIL doesn't need a dynamic relogging ticket. It can just hold a single ticket from init to teardown by taking an extra reference to the ticket. When it is running a relog, it can build it as though it's just a normal permanent transaction without needing to get a new reservation, and when it rolls the write space consumed is regranted automatically. AFAICT, that gets rid of the need for all this ticket reference counting, null pointer checking, lock juggling and rechecking, etc. It does not require modification to the transaction reservation path, and we can account for the relogging on an item by item basis in each individual transaction reservation.... Thoughts? -Dave.
On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote: > On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote: > > Automatic item relogging will occur from xfsaild context. xfsaild > > cannot acquire log reservation itself because it is also responsible > > for writeback and thus making used log reservation available again. > > Since there is no guarantee log reservation is available by the time > > a relogged item reaches the AIL, this is prone to deadlock. > > > > To guarantee log reservation for automatic relogging, implement a > > reservation management scheme where a transaction that is capable of > > enabling relogging of an item must contribute the necessary > > reservation to the relog mechanism up front. Use reference counting > > to associate the lifetime of pending relog reservation to the > > lifetime of in-core log items with relogging enabled. > > > > The basic log reservation sequence for a relog enabled transaction > > is as follows: > > > > - A transaction that uses relogging specifies XFS_TRANS_RELOG at > > allocation time. > > - Once initialized, RELOG transactions check for the existence of > > the global relog log ticket. If it exists, grab a reference and > > return. If not, allocate an empty ticket and install into the relog > > subsystem. Seed the relog ticket from reservation of the current > > transaction. Roll the current transaction to replenish its > > reservation and return to the caller. > > - The transaction is used as normal. If an item is relogged in the > > transaction, that item acquires a reference on the global relog > > ticket currently held open by the transaction. The item's reference > > persists until relogging is disabled on the item. > > - The RELOG transaction commits and releases its reference to the > > global relog ticket. The global relog ticket is released once its > > reference count drops to zero. > > > > This provides a central relog log ticket that guarantees reservation > > availability for relogged items, avoids log reservation deadlocks > > and is allocated and released on demand. > > Hi Brain, > > I've held off commenting immediately on this while I tried to get > the concept of dynamic relogging straight in my head. I couldn't put > my finger on what I thought was wrong - just a nagging feeling that > I'd gone down this path before and it ended in somethign that didn't > work. > No problem.. > It wasn't until a couple of hours ago that a big cogs clunked into > place and I realised this roughly mirrored a path I went down 12 or > 13 years ago trying to implement what turned into the CIL. I failed > at least 4 times over 5 years trying to implement delayed logging... > > THere's a couple of simple code comments below before what was in my > head seemed to gel together into something slightly more coherent > than "it seems inside out".... > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/libxfs/xfs_shared.h | 1 + > > fs/xfs/xfs_trans.c | 37 +++++++++++++--- > > fs/xfs/xfs_trans.h | 3 ++ > > fs/xfs/xfs_trans_ail.c | 89 ++++++++++++++++++++++++++++++++++++++ > > fs/xfs/xfs_trans_priv.h | 1 + > > 5 files changed, 126 insertions(+), 5 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h > > index c45acbd3add9..0a10ca0853ab 100644 > > --- a/fs/xfs/libxfs/xfs_shared.h > > +++ b/fs/xfs/libxfs/xfs_shared.h > > @@ -77,6 +77,7 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, > > * made then this algorithm will eventually find all the space it needs. > > */ > > #define XFS_TRANS_LOWMODE 0x100 /* allocate in low space mode */ > > +#define XFS_TRANS_RELOG 0x200 /* enable automatic relogging */ > > > > /* > > * Field values for xfs_trans_mod_sb. > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > > index 3b208f9a865c..8ac05ed8deda 100644 > > --- a/fs/xfs/xfs_trans.c > > +++ b/fs/xfs/xfs_trans.c > > @@ -107,9 +107,14 @@ xfs_trans_dup( > > > > ntp->t_flags = XFS_TRANS_PERM_LOG_RES | > > (tp->t_flags & XFS_TRANS_RESERVE) | > > - (tp->t_flags & XFS_TRANS_NO_WRITECOUNT); > > - /* We gave our writer reference to the new transaction */ > > + (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) | > > + (tp->t_flags & XFS_TRANS_RELOG); > > + /* > > + * The writer reference and relog reference transfer to the new > > + * transaction. > > + */ > > tp->t_flags |= XFS_TRANS_NO_WRITECOUNT; > > + tp->t_flags &= ~XFS_TRANS_RELOG; > > ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket); > > > > ASSERT(tp->t_blk_res >= tp->t_blk_res_used); > > @@ -284,15 +289,25 @@ xfs_trans_alloc( > > tp->t_firstblock = NULLFSBLOCK; > > > > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > > - if (error) { > > - xfs_trans_cancel(tp); > > - return error; > > + if (error) > > + goto error; > > + > > + if (flags & XFS_TRANS_RELOG) { > > + error = xfs_trans_ail_relog_reserve(&tp); > > + if (error) > > + goto error; > > } > > Hmmmm. So we are putting the AIL lock directly into the transaction > reserve path? xfs_trans_reserve() goes out of it's way to be > lockless in the fast paths, so if you want this to be a generic > mechanism that any transaction can use, the reservation needs to be > completely lockless. > Yeah, this is one of those warts mentioned in the cover letter. I wasn't planning to make it lockless as much as using an independent lock, but I can take a closer look at that if it still looks like a contention point after getting through the bigger picture feedback.. > > > > trace_xfs_trans_alloc(tp, _RET_IP_); > > > > *tpp = tp; > > return 0; > > + > > +error: > > + /* clear relog flag if we haven't acquired a ref */ > > + tp->t_flags &= ~XFS_TRANS_RELOG; > > + xfs_trans_cancel(tp); > > + return error; > > seems like a case of "only set the flags once you have a reference"? > Then xfs_trans_cancel() can clean up without special cases being > needed anywhere... > Yeah, I suppose that might be a bit more readable. > > } > > > > /* > > @@ -973,6 +988,10 @@ __xfs_trans_commit( > > > > xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); > > > > + /* release the relog ticket reference if this transaction holds one */ > > + if (tp->t_flags & XFS_TRANS_RELOG) > > + xfs_trans_ail_relog_put(mp); > > + > > That looks ... interesting. xfs_trans_ail_relog_put() can call > xfs_log_done(), which means it can do a log write, which means the > commit lsn for this transaction could change. Hence to make a relog > permanent as a result of a sync transaction, we'd need the > commit_lsn of the AIL relog ticket write here, not that of the > original transaction that was written. > Hmm.. xfs_log_done() is eventually called for every transaction ticket, so I don't think it should be doing log writes based on that. The _ail_relog_put() path is basically just intended to properly terminate the relog ticket based on the existing transaction completion path. I'd have to dig a bit further into the code here, but e.g. we don't pass an iclog via this path so if we were attempting to write I'd probably have seen assert failures (or worse) by now. Indeed, this is similar to a cancel of a clean transaction, the difference being this is open-coded because we only have the relog ticket in this context. > > > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > > xfs_trans_free(tp); > > > > @@ -1004,6 +1023,10 @@ __xfs_trans_commit( > > error = -EIO; > > tp->t_ticket = NULL; > > } > > + /* release the relog ticket reference if this transaction holds one */ > > + /* XXX: handle RELOG items on transaction abort */ > > + if (tp->t_flags & XFS_TRANS_RELOG) > > + xfs_trans_ail_relog_put(mp); > > This has the potential for log writes to be issued from a > transaction abort context, right? > > > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > > xfs_trans_free_items(tp, !!error); > > xfs_trans_free(tp); > > @@ -1064,6 +1087,10 @@ xfs_trans_cancel( > > tp->t_ticket = NULL; > > } > > > > + /* release the relog ticket reference if this transaction holds one */ > > + if (tp->t_flags & XFS_TRANS_RELOG) > > + xfs_trans_ail_relog_put(mp); > > And log writes from a cancel seems possible here, too. I don't think > we do this at all right now, so this could have some interesting > unexpected side-effects during error handling. (As if that wasn't > complex enough to begin with!) > I'm not quite sure I follow your comments here, though I do still need to work through all of the error paths and make sure everything is correct. I've been putting details like that off in favor of getting a high level design approach worked out first. > > @@ -818,6 +819,93 @@ xfs_trans_ail_delete( > > xfs_log_space_wake(ailp->ail_mount); > > } > > > > +bool > > +xfs_trans_ail_relog_get( > > + struct xfs_mount *mp) > > +{ > > + struct xfs_ail *ailp = mp->m_ail; > > + bool ret = false; > > + > > + spin_lock(&ailp->ail_lock); > > + if (ailp->ail_relog_tic) { > > + xfs_log_ticket_get(ailp->ail_relog_tic); > > + ret = true; > > + } > > + spin_unlock(&ailp->ail_lock); > > + return ret; > > +} > > + > > +/* > > + * Reserve log space for the automatic relogging ->tr_relog ticket. This > > + * requires a clean, permanent transaction from the caller. Pull reservation > > + * for the relog ticket and roll the caller's transaction back to its fully > > + * reserved state. If the AIL relog ticket is already initialized, grab a > > + * reference and return. > > + */ > > +int > > +xfs_trans_ail_relog_reserve( > > + struct xfs_trans **tpp) > > +{ > > + struct xfs_trans *tp = *tpp; > > + struct xfs_mount *mp = tp->t_mountp; > > + struct xfs_ail *ailp = mp->m_ail; > > + struct xlog_ticket *tic; > > + uint32_t logres = M_RES(mp)->tr_relog.tr_logres; > > + > > + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > + ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); > > + > > + if (xfs_trans_ail_relog_get(mp)) > > + return 0; > > + > > + /* no active ticket, fall into slow path to allocate one.. */ > > + tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0); > > + if (!tic) > > + return -ENOMEM; > > + ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res); > > + > > + /* check again since we dropped the lock for the allocation */ > > + spin_lock(&ailp->ail_lock); > > + if (ailp->ail_relog_tic) { > > + xfs_log_ticket_get(ailp->ail_relog_tic); > > + spin_unlock(&ailp->ail_lock); > > + xfs_log_ticket_put(tic); > > + return 0; > > + } > > + > > + /* attach and reserve space for the ->tr_relog ticket */ > > + ailp->ail_relog_tic = tic; > > + tp->t_ticket->t_curr_res -= tic->t_curr_res; > > + spin_unlock(&ailp->ail_lock); > > + > > + return xfs_trans_roll(tpp); > > +} > > Hmmm. > > So before we've even returned from xfs_trans_alloc(), we may have > committed the transaction and burnt and entire transaction entire > log space reservations? > Yep. > IOWs, to support relogging of a single item in a permanent > transaction, we have to increase the log count of the transaction so > that we don't end up running out of reservation space one > transaction before the end of the normal fast path behaviour of the > change being made? > Right. > And we do that up front because we're not exactly sure of how much > space the item that needs relogging is going to require, but the AIL > is going to need to hold on to it from as long as it needs to relog > the item. > *nod* > <GRRRRIIINNNNDDDD> > > <CLUNK> > > Ahhhhh. > Heh. :P > This is the reason why the CIL ended up using a reservation stealing > mechanism for it's ticket. Every log reservation takes into account > the worst case log overhead for that single transaction - that means > there are no special up front hooks or changes to the log > reservation to support the CIL, and it also means that the CIL can > steal what it needs from the current ticket at commit time. Then the > CIL can commit at any time, knowing it has enough space to write all > the items it tracks to the log. > > This avoided the transaction reservation mechanism from needing to > know anything about the CIL or that it was stealing space from > committing transactions for it's own private ticket. It greatly > simplified everything, and it's the reason that the CIL succeeded > where several other attempts to relog items in memory failed.... > > So.... > > Why can't the AIL steal the reservation it needs from the current > transaction when the item that needs relogging is being committed? I think we can. My first proposal[1] implemented something like this. The broader design was much less fleshed out, so it was crudely implemented as more of a commit time hook to relog pending items based on the currently committing transaction as opposed to reservation stealing for an independent relog ticket. This was actually based on the observation that the CIL already did something similar (though I didn't have all of the background as to why that approach was used for the CIL). The feedback at the time was to consider moving in a different direction that didn't involve stealing reservation from transactions, which gave me the impression that approach wasn't going to be acceptable. Granted, the first few variations of this were widely different in approach and I don't think there was any explicit objection to reservation stealing itself, it just seemed a better development approach to work out an ideal design based on fundamentals as opposed to limiting the scope to contexts that might facilitate reservation stealing. Despite the fact that this code is hairy, the fundamental approach is rather straightforward and simplistic. The hairiness mostly comes from attempting to make things generic and dynamic and could certainly use some cleanup. Now that this is at a point where I'm reasonably confident it is fundamentally correct, I'm quite happy to revisit a reservation stealing approach if that improves functional operation. This can serve a decent reference/baseline implementation for that, I think. [1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/ > i.e. we add the "relog overhead" to the permanent transaction that > requires items to be relogged by the AIL, and when that item is > formatted into the CIL we also check to see if it is currently > marked as "reloggable". If it's not, we steal the relogging > reservation from the transaction (essentially the item's formatted > size) and pass it to the AIL's private relog ticket, setting the log > item to be "reloggable" so the reservation isn't stolen over and > over again as the object is relogged in the CIL. > Ok, so this implies to me we'd update the per-transaction reservation calculations to incorporate worst case relog overhead for each transaction. E.g., the quotaoff transaction would x2 the intent size, a transaction that might relog a buffer adds a max buffer size overhead, etc. Right? > Hence as we commit reloggable items to the CIL, the AIL ticket > reservation grows with each of those items marked as reloggable. > As we relog items to the CIL and the CIL grows it's reservation via > the size delta, the AIL reservation can also be updated with the > same delta. > > Hence the AIL will always know exactly how much space it needs to relog all > the items it holds for relogging, and because it's been stolen from > the original transaction it is, like the CIL tciket reservation, > considered used space in the log. Hence the log space required for > relogging items via the AIL is correctly accounted for without > needing up front static, per-item reservations. > By this I assume you're referring to avoiding the need to reserve -> donate -> roll in the current scheme. Instead, we'd acquire the larger reservation up front and only steal if it necessary, which is less overhead because we don't need to always replenish the full transaction. Do note that I don't consider the current approach high overhead overall because the roll only needs to happen when (and if) the first transaction enables the relog ticket. There is no additional overhead from that point forward. The "relog overhead" approach sounds fine to me as well, I'm just noting that there are still some tradeoffs to either approach. > When the item is logged the final time and the reloggable flag is > removed (e.g. when the quotaoff completes) then we can remove the > reservation from AIL ticket and add it back to the current > transaction. Hence when the final transaction is committed and the > ticket for that transaction is released via xfs_log_done(), the > space the AIL held for relogging the item is also released. > Makes sense. > This doesn't require any modifications to the transaction > reservation subsystem, nor the itransaction commit/cancel code, > no changes to the log space accounting, etc. All we need to do is > tag items when we log them as reloggable, and in the CIL formatting > of the item pass the formatted size differences to the AIL so it can > steal the reservation it needs. > > Then the AIL doesn't need a dynamic relogging ticket. It can just > hold a single ticket from init to teardown by taking an extra > reference to the ticket. When it is running a relog, it can build it > as though it's just a normal permanent transaction without needing > to get a new reservation, and when it rolls the write space consumed > is regranted automatically. > > AFAICT, that gets rid of the need for all this ticket reference counting, > null pointer checking, lock juggling and rechecking, etc. It does > not require modification to the transaction reservation path, and we > can account for the relogging on an item by item basis in each > individual transaction reservation.... > Yep, I think there's potential for some nice cleanup there. I'd be very happy to rip out some of guts of the current patch, particularly the reference counting bits. > Thoughts? > <thinking out loud from here on..> One thing that comes to mind thinking about this is dealing with batching (relog multiple items per roll). This code doesn't handle that yet, but I anticipate it being a requirement and it's fairly easy to update the current scheme to support a fixed item count per relog-roll. A stealing approach potentially complicates things when it comes to batching because we have per-item reservation granularity to consider. For example, consider if we had a variety of different relog item types active at once, a subset are being relogged while another subset are being disabled (before ever needing to be relogged). For one, we'd have to be careful to not release reservation while it might be accounted to an active relog transaction that is currently rolling with some other items, etc. There's also a potential quirk in handling reservation of a relogged item that is cancelled while it's being relogged, but that might be more of an implementation detail. I don't think that's a show stopper, but rather just something I'd like to have factored into the design from the start. One option could be to maintain a separate counter of active relog reservation aside from the actual relog ticket. That way the relog ticket could just pull from this relog reservation pool based on the current item(s) being relogged asynchronously from different tasks that might add or remove reservation from the pool for separate items. That might get a little wonky when we consider the relog ticket needs to pull from the pool and then put something back if the item is still reloggable after the relog transaction rolls. Another problem is that reloggable items are still otherwise usable once they are unlocked. So for example we'd have to account for a situation where one transaction dirties a buffer, enables relogging, commits and then some other transaction dirties more of the buffer and commits without caring whether the buffer was relog enabled or not. Unless runtime relog reservation is always worst case, that's a subtle path to reservation overrun in the relog transaction. I'm actually wondering if a more simple approach to tracking stolen reservation is to add a new field that tracks active relog reservation to each supported log item. Then the initial transaction enables relogging with a simple transfer from the transaction to the item. The relog transaction knows how much reservation it can assign based on the current population of items and requires no further reservation accounting because it rolls and thus automatically reacquires relog reservation for each associated item. The path that clears relog state transfers res from the item back to a transaction simply so the reservation can be released back to the pool of unused log space. Note that clearing relog state doesn't require a transaction in the current implementation, but we could easily define a helper to allocate an empty transaction, clear relog and reclaim relog reservation and then cancel for those contexts. Thoughts on any of the above appreciated. I need to think about this some more but otherwise I'll attempt some form of a res stealing approach for the next iteration... Brian > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote: > On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote: > > On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote: > > > Automatic item relogging will occur from xfsaild context. xfsaild > > > cannot acquire log reservation itself because it is also responsible > > > for writeback and thus making used log reservation available again. > > > Since there is no guarantee log reservation is available by the time > > > a relogged item reaches the AIL, this is prone to deadlock. ..... > > > @@ -284,15 +289,25 @@ xfs_trans_alloc( > > > tp->t_firstblock = NULLFSBLOCK; > > > > > > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > > > - if (error) { > > > - xfs_trans_cancel(tp); > > > - return error; > > > + if (error) > > > + goto error; > > > + > > > + if (flags & XFS_TRANS_RELOG) { > > > + error = xfs_trans_ail_relog_reserve(&tp); > > > + if (error) > > > + goto error; > > > } > > > > Hmmmm. So we are putting the AIL lock directly into the transaction > > reserve path? xfs_trans_reserve() goes out of it's way to be > > lockless in the fast paths, so if you want this to be a generic > > mechanism that any transaction can use, the reservation needs to be > > completely lockless. > > > > Yeah, this is one of those warts mentioned in the cover letter. I wasn't > planning to make it lockless as much as using an independent lock, but I > can take a closer look at that if it still looks like a contention point > after getting through the bigger picture feedback.. If relogging ends up getting widely used, then a filesystem global lock in the transaction reserve path is guaranteed to be a contention point. :) > > > + /* release the relog ticket reference if this transaction holds one */ > > > + if (tp->t_flags & XFS_TRANS_RELOG) > > > + xfs_trans_ail_relog_put(mp); > > > + > > > > That looks ... interesting. xfs_trans_ail_relog_put() can call > > xfs_log_done(), which means it can do a log write, which means the > > commit lsn for this transaction could change. Hence to make a relog > > permanent as a result of a sync transaction, we'd need the > > commit_lsn of the AIL relog ticket write here, not that of the > > original transaction that was written. > > > > Hmm.. xfs_log_done() is eventually called for every transaction ticket, > so I don't think it should be doing log writes based on that. xfs_log_done() xlog_commit_record() xlog_write(XLOG_COMMIT_TRANS) xlog_state_release_iclog() xlog_sync() xlog_write_iclog() submit_bio() > > The > _ail_relog_put() path is basically just intended to properly terminate > the relog ticket based on the existing transaction completion path. I'd > have to dig a bit further into the code here, but e.g. we don't pass an > iclog via this path so if we were attempting to write I'd probably have > seen assert failures (or worse) by now. xlog_write() code handles a missing commit iclog pointer just fine. Indeed, that's the reason xlog_write() may issue log IO itself: .... spin_lock(&log->l_icloglock); xlog_state_finish_copy(log, iclog, record_cnt, data_cnt); if (commit_iclog) { ASSERT(flags & XLOG_COMMIT_TRANS); *commit_iclog = iclog; } else { error = xlog_state_release_iclog(log, iclog); } spin_unlock(&log->l_icloglock); i.e. if you don't provide xlog_write with a commit_iclog pointer then you are saying "caller does not call xfs_log_release_iclog() itself, so please release the iclog once the log iovec has been written to the iclog. Hence the way you've written this code explicitly tells xlog_write() to issue log IO on you behalf if it is necessary.... > Indeed, this is similar to a > cancel of a clean transaction, the difference being this is open-coded > because we only have the relog ticket in this context. A clean transaction has XLOG_TIC_INITED set, so xfs_log_done() will never call xlog_commit_record() on it. XLOG_TIC_INITED is used to indicate a start record has been written to the log for a permanent transaction so it doesn't get written again when it is relogged..... .... Hmmmm ..... You might be right. It looks like delayed logging made XLOG_TIC_INITED completely redundant. xfs_trans_commit() no longer writes directly to iclogs, so the normal transaction tickets will never have XLOG_TIC_INITED cleared. Hence on any kernel since delayed logging was the only option, we will never get log writes from the xfs_trans* interfaces. And we no longer do an xlog_write() call for every item in the transaction after we format them into a log iovec. Instead, the CIL just passes one big list of iovecs to a singel xlog_write() call, so all callers only make a single xlog_write call per physical log transaction. OK, XLOG_TIC_INITED is redundant, and should be removed. And xfs_log_done() needs to be split into two, one for releasing the ticket, one for completing the xlog_write() call. Compile tested only patch below for you :P > > <GRRRRIIINNNNDDDD> > > > > <CLUNK> > > > > Ahhhhh. > > > > Heh. :P > > > This is the reason why the CIL ended up using a reservation stealing > > mechanism for it's ticket. Every log reservation takes into account > > the worst case log overhead for that single transaction - that means > > there are no special up front hooks or changes to the log > > reservation to support the CIL, and it also means that the CIL can > > steal what it needs from the current ticket at commit time. Then the > > CIL can commit at any time, knowing it has enough space to write all > > the items it tracks to the log. > > > > This avoided the transaction reservation mechanism from needing to > > know anything about the CIL or that it was stealing space from > > committing transactions for it's own private ticket. It greatly > > simplified everything, and it's the reason that the CIL succeeded > > where several other attempts to relog items in memory failed.... > > > > So.... > > > > Why can't the AIL steal the reservation it needs from the current > > transaction when the item that needs relogging is being committed? > > I think we can. My first proposal[1] implemented something like this. > The broader design was much less fleshed out, so it was crudely > implemented as more of a commit time hook to relog pending items based > on the currently committing transaction as opposed to reservation > stealing for an independent relog ticket. This was actually based on the > observation that the CIL already did something similar (though I didn't > have all of the background as to why that approach was used for the > CIL). > > The feedback at the time was to consider moving in a different direction > that didn't involve stealing reservation from transactions, which gave > me the impression that approach wasn't going to be acceptable. Granted, I don't recall that discussion steering away from stealing reservations; it was more about mechanics like using transaction callbacks and other possible mechanisms for enabling relogging. > the first few variations of this were widely different in approach and I > don't think there was any explicit objection to reservation stealing > itself, it just seemed a better development approach to work out an > ideal design based on fundamentals as opposed to limiting the scope to > contexts that might facilitate reservation stealing. Right, it wasn't clear how to best implement this at the time, because we were all just struggling to get our heads around the problem scope. :) > Despite the fact > that this code is hairy, the fundamental approach is rather > straightforward and simplistic. The hairiness mostly comes from > attempting to make things generic and dynamic and could certainly use > some cleanup. > > Now that this is at a point where I'm reasonably confident it is > fundamentally correct, I'm quite happy to revisit a reservation stealing > approach if that improves functional operation. This can serve a decent > reference/baseline implementation for that, I think. *nod* > > [1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/ > > > i.e. we add the "relog overhead" to the permanent transaction that > > requires items to be relogged by the AIL, and when that item is > > formatted into the CIL we also check to see if it is currently > > marked as "reloggable". If it's not, we steal the relogging > > reservation from the transaction (essentially the item's formatted > > size) and pass it to the AIL's private relog ticket, setting the log > > item to be "reloggable" so the reservation isn't stolen over and > > over again as the object is relogged in the CIL. > > > > Ok, so this implies to me we'd update the per-transaction reservation > calculations to incorporate worst case relog overhead for each > transaction. E.g., the quotaoff transaction would x2 the intent size, > a transaction that might relog a buffer adds a max buffer size overhead, > etc. Right? Essentially, yes. And the reloggable items and reservation can be clearly documented in the per-transaction reservation explanation comments like we already do for all the structures that are modified in a transaction. > > Hence as we commit reloggable items to the CIL, the AIL ticket > > reservation grows with each of those items marked as reloggable. > > As we relog items to the CIL and the CIL grows it's reservation via > > the size delta, the AIL reservation can also be updated with the > > same delta. > > > > Hence the AIL will always know exactly how much space it needs to relog all > > the items it holds for relogging, and because it's been stolen from > > the original transaction it is, like the CIL tciket reservation, > > considered used space in the log. Hence the log space required for > > relogging items via the AIL is correctly accounted for without > > needing up front static, per-item reservations. > > > > By this I assume you're referring to avoiding the need to reserve -> > donate -> roll in the current scheme. Instead, we'd acquire the larger > reservation up front and only steal if it necessary, which is less > overhead because we don't need to always replenish the full transaction. Yes - we acquire the initial relog reservation by stealing it from the current transaction. This gives the AIL ticket the reserved space (both reserve and write grant space) it needs to relog the item once. When the item is relogged by the AIL, the transaction commit immediately regrants the reservation space that was just consumed, and the trans_roll regrants the write grant space the commit just consumed from the ticket via it's call to xfs_trans_reserve(). Hence we only need to steal the space necessary to do the first relogging of the item as the AIL will hold that reservation until the high level code turns off relogging for that log item. > > Thoughts? > > <thinking out loud from here on..> > > One thing that comes to mind thinking about this is dealing with > batching (relog multiple items per roll). This code doesn't handle that > yet, but I anticipate it being a requirement and it's fairly easy to > update the current scheme to support a fixed item count per relog-roll. > > A stealing approach potentially complicates things when it comes to > batching because we have per-item reservation granularity to consider. > For example, consider if we had a variety of different relog item types > active at once, a subset are being relogged while another subset are > being disabled (before ever needing to be relogged). So concurrent "active relogging" + "start relogging" + "stop relogging"? > For one, we'd have > to be careful to not release reservation while it might be accounted to > an active relog transaction that is currently rolling with some other > items, etc. There's also a potential quirk in handling reservation of a > relogged item that is cancelled while it's being relogged, but that > might be more of an implementation detail. Right, did you notice the ail->ail_relog_lock rwsem that I wrapped my example relog transaction item add loop + commit function in? i.e. while we are building and committing a relog transaction, we hold off the transactions that are trying to add/remove their items to/from the relog list. Hence the reservation stealing accounting in the ticket can be be serialised against the transactional use of the ticket. It's basically the same method we use for serialising addition to the CIL in transaction commit against CIL pushes draining the current list for log writes (rwsem for add/push serialisation, spin lock for concurrent add serialisation under the rwsem). > I don't think that's a show stopper, but rather just something I'd like > to have factored into the design from the start. One option could be to *nod* > maintain a separate counter of active relog reservation aside from the > actual relog ticket. That way the relog ticket could just pull from this > relog reservation pool based on the current item(s) being relogged > asynchronously from different tasks that might add or remove reservation > from the pool for separate items. That might get a little wonky when we > consider the relog ticket needs to pull from the pool and then put > something back if the item is still reloggable after the relog > transaction rolls. RIght, that's the whole problem that we solve via a) stealing the initial reserve/write grant space at commit time and b) serialising stealing vs transactional use of the ticket. That is, after the roll, the ticket has a full reserve and grant space reservation for all the items accounted to the relog ticket. Every new relog item added to the ticket (or is relogged in the CIL and uses more space) adds the full required reserve/write grant space to the the relog ticket. Hence the relog ticket always has current log space reserved to commit the entire set of items tagged as reloggable. And by avoiding modifying the ticket while we are actively processing the relog transaction, we don't screw up the ticket accounting in the middle of the transaction.... > Another problem is that reloggable items are still otherwise usable once > they are unlocked. So for example we'd have to account for a situation > where one transaction dirties a buffer, enables relogging, commits and > then some other transaction dirties more of the buffer and commits > without caring whether the buffer was relog enabled or not. yup, that's the delta size updates from the CIL commit. i.e. if we relog an item to the CIL that has the XFS_LI_RELOG flag already set on it, the change in size that we steal for the CIL ticket also needs to be stolen for the AIL ticket. i.e. we already do almost all the work we need to handle this. > Unless > runtime relog reservation is always worst case, that's a subtle path to > reservation overrun in the relog transaction. Yes, but it's a problem the CIL already solves for us :P > I'm actually wondering if a more simple approach to tracking stolen > reservation is to add a new field that tracks active relog reservation > to each supported log item. Then the initial transaction enables > relogging with a simple transfer from the transaction to the item. The > relog transaction knows how much reservation it can assign based on the > current population of items and requires no further reservation > accounting because it rolls and thus automatically reacquires relog > reservation for each associated item. The path that clears relog state > transfers res from the item back to a transaction simply so the > reservation can be released back to the pool of unused log space. Note > that clearing relog state doesn't require a transaction in the current > implementation, but we could easily define a helper to allocate an empty > transaction, clear relog and reclaim relog reservation and then cancel > for those contexts. I don't think we need any of that - the AIL ticket only needs to be kept up to date with the changes to the formatted size of the item marked for relogging. It's no different to the CIL ticket reservation accounting from that perspective. Cheers, Dave.
On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote: > OK, XLOG_TIC_INITED is redundant, and should be removed. And > xfs_log_done() needs to be split into two, one for releasing the > ticket, one for completing the xlog_write() call. Compile tested > only patch below for you :P And now with sample patch. -Dave.
On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote: > > On Mon, Mar 02, 2020 at 02:07:50PM +1100, Dave Chinner wrote: > > > On Thu, Feb 27, 2020 at 08:43:15AM -0500, Brian Foster wrote: > > > > Automatic item relogging will occur from xfsaild context. xfsaild > > > > cannot acquire log reservation itself because it is also responsible > > > > for writeback and thus making used log reservation available again. > > > > Since there is no guarantee log reservation is available by the time > > > > a relogged item reaches the AIL, this is prone to deadlock. > > ..... > > > > > @@ -284,15 +289,25 @@ xfs_trans_alloc( > > > > tp->t_firstblock = NULLFSBLOCK; > > > > > > > > error = xfs_trans_reserve(tp, resp, blocks, rtextents); > > > > - if (error) { > > > > - xfs_trans_cancel(tp); > > > > - return error; > > > > + if (error) > > > > + goto error; > > > > + > > > > + if (flags & XFS_TRANS_RELOG) { > > > > + error = xfs_trans_ail_relog_reserve(&tp); > > > > + if (error) > > > > + goto error; > > > > } > > > > > > Hmmmm. So we are putting the AIL lock directly into the transaction > > > reserve path? xfs_trans_reserve() goes out of it's way to be > > > lockless in the fast paths, so if you want this to be a generic > > > mechanism that any transaction can use, the reservation needs to be > > > completely lockless. > > > > > > > Yeah, this is one of those warts mentioned in the cover letter. I wasn't > > planning to make it lockless as much as using an independent lock, but I > > can take a closer look at that if it still looks like a contention point > > after getting through the bigger picture feedback.. > > If relogging ends up getting widely used, then a filesystem global > lock in the transaction reserve path is guaranteed to be a > contention point. :) > I was thinking yesterday that I had implemented a fast path there. I intended to, but looking at the code I realize I hadn't got to that yet. So... I see your point and this will need to be addressed if it doesn't fall away via the relog res rework. > > > > + /* release the relog ticket reference if this transaction holds one */ > > > > + if (tp->t_flags & XFS_TRANS_RELOG) > > > > + xfs_trans_ail_relog_put(mp); > > > > + > > > > > > That looks ... interesting. xfs_trans_ail_relog_put() can call > > > xfs_log_done(), which means it can do a log write, which means the > > > commit lsn for this transaction could change. Hence to make a relog > > > permanent as a result of a sync transaction, we'd need the > > > commit_lsn of the AIL relog ticket write here, not that of the > > > original transaction that was written. > > > > > > > Hmm.. xfs_log_done() is eventually called for every transaction ticket, > > so I don't think it should be doing log writes based on that. > > xfs_log_done() > xlog_commit_record() > xlog_write(XLOG_COMMIT_TRANS) > xlog_state_release_iclog() > xlog_sync() > xlog_write_iclog() > submit_bio() Yeah, I know that we _can_ issue writes from this path because it's shared between the CIL log ticket path and the per-transaction log ticket path. I'm just not convinced that happens from the transaction path simply because if it did, it would seem like a potentially noticeable bug. > > > > The > > _ail_relog_put() path is basically just intended to properly terminate > > the relog ticket based on the existing transaction completion path. I'd > > have to dig a bit further into the code here, but e.g. we don't pass an > > iclog via this path so if we were attempting to write I'd probably have > > seen assert failures (or worse) by now. > > xlog_write() code handles a missing commit iclog pointer just fine. > Indeed, that's the reason xlog_write() may issue log IO itself: > > .... > spin_lock(&log->l_icloglock); > xlog_state_finish_copy(log, iclog, record_cnt, data_cnt); > if (commit_iclog) { > ASSERT(flags & XLOG_COMMIT_TRANS); > *commit_iclog = iclog; > } else { > error = xlog_state_release_iclog(log, iclog); > } > spin_unlock(&log->l_icloglock); > > i.e. if you don't provide xlog_write with a commit_iclog pointer > then you are saying "caller does not call xfs_log_release_iclog() > itself, so please release the iclog once the log iovec has been > written to the iclog. Hence the way you've written this code > explicitly tells xlog_write() to issue log IO on you behalf if it is > necessary.... > xlog_commit_record(), which is the xlog_write() caller in this case, expects a valid iclog and (unconditionally) asserts otherwise. I would expect to have seen that assert had this happened. > > Indeed, this is similar to a > > cancel of a clean transaction, the difference being this is open-coded > > because we only have the relog ticket in this context. > > A clean transaction has XLOG_TIC_INITED set, so xfs_log_done() will > never call xlog_commit_record() on it. XLOG_TIC_INITED is used to > indicate a start record has been written to the log for a permanent > transaction so it doesn't get written again when it is relogged..... > > .... Hmmmm ..... > > You might be right. > I figured it was related to the _INITED flag since that obviously gates the call to xlog_commit_record(), but when I've taken a quick look at that code in the past I wasn't really able to make much sense of its purpose (i.e. why the behavior differs between CIL and transaction log tickets) and didn't dig any further than that. > It looks like delayed logging made XLOG_TIC_INITED completely > redundant. xfs_trans_commit() no longer writes directly to iclogs, > so the normal transaction tickets will never have XLOG_TIC_INITED > cleared. Hence on any kernel since delayed logging was the only > option, we will never get log writes from the xfs_trans* interfaces. > > And we no longer do an xlog_write() call for every item in the > transaction after we format them into a log iovec. Instead, the CIL > just passes one big list of iovecs to a singel xlog_write() call, > so all callers only make a single xlog_write call per physical log > transaction. > > OK, XLOG_TIC_INITED is redundant, and should be removed. And > xfs_log_done() needs to be split into two, one for releasing the > ticket, one for completing the xlog_write() call. Compile tested > only patch below for you :P > Heh, Ok. Will take a look. > > > <GRRRRIIINNNNDDDD> > > > > > > <CLUNK> > > > > > > Ahhhhh. > > > > > > > Heh. :P > > > > > This is the reason why the CIL ended up using a reservation stealing > > > mechanism for it's ticket. Every log reservation takes into account > > > the worst case log overhead for that single transaction - that means > > > there are no special up front hooks or changes to the log > > > reservation to support the CIL, and it also means that the CIL can > > > steal what it needs from the current ticket at commit time. Then the > > > CIL can commit at any time, knowing it has enough space to write all > > > the items it tracks to the log. > > > > > > This avoided the transaction reservation mechanism from needing to > > > know anything about the CIL or that it was stealing space from > > > committing transactions for it's own private ticket. It greatly > > > simplified everything, and it's the reason that the CIL succeeded > > > where several other attempts to relog items in memory failed.... > > > > > > So.... > > > > > > Why can't the AIL steal the reservation it needs from the current > > > transaction when the item that needs relogging is being committed? > > > > I think we can. My first proposal[1] implemented something like this. > > The broader design was much less fleshed out, so it was crudely > > implemented as more of a commit time hook to relog pending items based > > on the currently committing transaction as opposed to reservation > > stealing for an independent relog ticket. This was actually based on the > > observation that the CIL already did something similar (though I didn't > > have all of the background as to why that approach was used for the > > CIL). > > > > The feedback at the time was to consider moving in a different direction > > that didn't involve stealing reservation from transactions, which gave > > me the impression that approach wasn't going to be acceptable. Granted, > > I don't recall that discussion steering away from stealing > reservations; it was more about mechanics like using transaction > callbacks and other possible mechanisms for enabling relogging. > > > the first few variations of this were widely different in approach and I > > don't think there was any explicit objection to reservation stealing > > itself, it just seemed a better development approach to work out an > > ideal design based on fundamentals as opposed to limiting the scope to > > contexts that might facilitate reservation stealing. > > Right, it wasn't clear how to best implement this at the time, > because we were all just struggling to get our heads around the > problem scope. :) > > > Despite the fact > > that this code is hairy, the fundamental approach is rather > > straightforward and simplistic. The hairiness mostly comes from > > attempting to make things generic and dynamic and could certainly use > > some cleanup. > > > > Now that this is at a point where I'm reasonably confident it is > > fundamentally correct, I'm quite happy to revisit a reservation stealing > > approach if that improves functional operation. This can serve a decent > > reference/baseline implementation for that, I think. > > *nod* > > > > > [1] https://lore.kernel.org/linux-xfs/20191024172850.7698-1-bfoster@redhat.com/ > > > > > i.e. we add the "relog overhead" to the permanent transaction that > > > requires items to be relogged by the AIL, and when that item is > > > formatted into the CIL we also check to see if it is currently > > > marked as "reloggable". If it's not, we steal the relogging > > > reservation from the transaction (essentially the item's formatted > > > size) and pass it to the AIL's private relog ticket, setting the log > > > item to be "reloggable" so the reservation isn't stolen over and > > > over again as the object is relogged in the CIL. > > > > > > > Ok, so this implies to me we'd update the per-transaction reservation > > calculations to incorporate worst case relog overhead for each > > transaction. E.g., the quotaoff transaction would x2 the intent size, > > a transaction that might relog a buffer adds a max buffer size overhead, > > etc. Right? > > Essentially, yes. > > And the reloggable items and reservation can be clearly documented > in the per-transaction reservation explanation comments like we > already do for all the structures that are modified in a > transaction. > Ok. > > > Hence as we commit reloggable items to the CIL, the AIL ticket > > > reservation grows with each of those items marked as reloggable. > > > As we relog items to the CIL and the CIL grows it's reservation via > > > the size delta, the AIL reservation can also be updated with the > > > same delta. > > > > > > Hence the AIL will always know exactly how much space it needs to relog all > > > the items it holds for relogging, and because it's been stolen from > > > the original transaction it is, like the CIL tciket reservation, > > > considered used space in the log. Hence the log space required for > > > relogging items via the AIL is correctly accounted for without > > > needing up front static, per-item reservations. > > > > > > > By this I assume you're referring to avoiding the need to reserve -> > > donate -> roll in the current scheme. Instead, we'd acquire the larger > > reservation up front and only steal if it necessary, which is less > > overhead because we don't need to always replenish the full transaction. > > Yes - we acquire the initial relog reservation by stealing it > from the current transaction. This gives the AIL ticket the reserved > space (both reserve and write grant space) it needs to relog the > item once. > > When the item is relogged by the AIL, the transaction commit > immediately regrants the reservation space that was just consumed, > and the trans_roll regrants the write grant space the commit just > consumed from the ticket via it's call to xfs_trans_reserve(). Hence > we only need to steal the space necessary to do the first relogging > of the item as the AIL will hold that reservation until the high > level code turns off relogging for that log item. > Yep. > > > Thoughts? > > > > <thinking out loud from here on..> > > > > One thing that comes to mind thinking about this is dealing with > > batching (relog multiple items per roll). This code doesn't handle that > > yet, but I anticipate it being a requirement and it's fairly easy to > > update the current scheme to support a fixed item count per relog-roll. > > > > A stealing approach potentially complicates things when it comes to > > batching because we have per-item reservation granularity to consider. > > For example, consider if we had a variety of different relog item types > > active at once, a subset are being relogged while another subset are > > being disabled (before ever needing to be relogged). > > So concurrent "active relogging" + "start relogging" + "stop > relogging"? > Yeah.. > > For one, we'd have > > to be careful to not release reservation while it might be accounted to > > an active relog transaction that is currently rolling with some other > > items, etc. There's also a potential quirk in handling reservation of a > > relogged item that is cancelled while it's being relogged, but that > > might be more of an implementation detail. > > Right, did you notice the ail->ail_relog_lock rwsem that I wrapped > my example relog transaction item add loop + commit function in? > Yes, but the side effects didn't quite register.. > i.e. while we are building and committing a relog transaction, we > hold off the transactions that are trying to add/remove their items > to/from the relog list. Hence the reservation stealing accounting in > the ticket can be be serialised against the transactional use of the > ticket. > Hmm, so this changes relog item/state management as well as the reservation management. That's probably why it wasn't clear to me from the code example. > It's basically the same method we use for serialising addition to > the CIL in transaction commit against CIL pushes draining the > current list for log writes (rwsem for add/push serialisation, spin > lock for concurrent add serialisation under the rwsem). > Ok. The difference with the CIL is that in that context we're processing a single, constantly maintained list of items. As of right now, there is no such list for relog enabled items. If I'm following correctly, that's still not necessary here, we're just talking about wrapping a rw lock around the state management (+ res accounting) and the actual res consumption such that a relog cancel doesn't steal reservation from an active relog transaction, for example. > > I don't think that's a show stopper, but rather just something I'd like > > to have factored into the design from the start. One option could be to > > *nod* > > > maintain a separate counter of active relog reservation aside from the > > actual relog ticket. That way the relog ticket could just pull from this > > relog reservation pool based on the current item(s) being relogged > > asynchronously from different tasks that might add or remove reservation > > from the pool for separate items. That might get a little wonky when we > > consider the relog ticket needs to pull from the pool and then put > > something back if the item is still reloggable after the relog > > transaction rolls. > > RIght, that's the whole problem that we solve via a) stealing the > initial reserve/write grant space at commit time and b) serialising > stealing vs transactional use of the ticket. > > That is, after the roll, the ticket has a full reserve and grant > space reservation for all the items accounted to the relog ticket. > Every new relog item added to the ticket (or is relogged in the CIL > and uses more space) adds the full required reserve/write grant > space to the the relog ticket. Hence the relog ticket always has > current log space reserved to commit the entire set of items tagged > as reloggable. And by avoiding modifying the ticket while we are > actively processing the relog transaction, we don't screw up the > ticket accounting in the middle of the transaction.... > Yep, Ok. > > Another problem is that reloggable items are still otherwise usable once > > they are unlocked. So for example we'd have to account for a situation > > where one transaction dirties a buffer, enables relogging, commits and > > then some other transaction dirties more of the buffer and commits > > without caring whether the buffer was relog enabled or not. > > yup, that's the delta size updates from the CIL commit. i.e. if we > relog an item to the CIL that has the XFS_LI_RELOG flag already set > on it, the change in size that we steal for the CIL ticket also > needs to be stolen for the AIL ticket. i.e. we already do almost all > the work we need to handle this. > > > Unless > > runtime relog reservation is always worst case, that's a subtle path to > > reservation overrun in the relog transaction. > > Yes, but it's a problem the CIL already solves for us :P > Ok, I think we're pretty close to the same page here. I was thinking about the worst case relog reservation being pulled off the committing transaction unconditionally, where I think you're thinking about it as the transaction (i.e. reservation calculation) would have the worst case reservation, but we'd only pull off the delta as needed at commit time (i.e. exactly how the CIL works wrt to transaction reservation consumption). Let me work through a simple example to try and (in)validate my concern: - Transaction A is relog enabled, dirties 50% of a buffer and enables auto relogging. On commit, the CIL takes buffer/2 reservation for the log vector and the relog mechanism takes the same amount for subsequent relogs. - Transaction B is not relog enabled (so no extra relog reservation), dirties another 40% of the (already relog enabled) buffer and commits. The CIL takes 90% of the transaction buffer reservation. The relog ticket now needs an additional 40% (since the original 50% is maintained by the relog system), but afaict there is no guarantee that res is available if trans B is not relog enabled. So IIUC, the CIL scheme works because every transaction automatically includes worst case reservation for every possible item supported by the transaction. Relog transactions are presumably more selective, however, so we need to either have some rule where only relog enabled transactions can touch relog enabled items (it's not clear to me how realistic that is for things like buffers etc., but it sounds potentially delicate) or we simplify the relog reservation consumption calculation to consume worst case reservation for the relog ticket in anticipation of unrelated transactions dirtying more of the associated object since it was originally committed for relog. Thoughts? Note that carrying worst case reservation also potentially simplifies accounting between multiple dirty+relog increments and a single relog cancel, particularly if a relog cancelling transaction also happens to dirty more of a buffer before it commits. I'd prefer to avoid subtle usage landmines like that as much as possible. Hmm.. even if we do ultimately want a more granular and dynamic relog res accounting, I might start with the simple worst-case approach since 1.) it probably doesn't matter for intents, 2.) it's easier to reason about the isolated reservation calculation changes in an independent patch from the rest of the mechanism and 3.) I'd rather get the basics nailed down and solid before potentially fighting with granular reservation accounting accuracy bugs. ;) Brian > > I'm actually wondering if a more simple approach to tracking stolen > > reservation is to add a new field that tracks active relog reservation > > to each supported log item. Then the initial transaction enables > > relogging with a simple transfer from the transaction to the item. The > > relog transaction knows how much reservation it can assign based on the > > current population of items and requires no further reservation > > accounting because it rolls and thus automatically reacquires relog > > reservation for each associated item. The path that clears relog state > > transfers res from the item back to a transaction simply so the > > reservation can be released back to the pool of unused log space. Note > > that clearing relog state doesn't require a transaction in the current > > implementation, but we could easily define a helper to allocate an empty > > transaction, clear relog and reclaim relog reservation and then cancel > > for those contexts. > > I don't think we need any of that - the AIL ticket only needs to be > kept up to date with the changes to the formatted size of the item > marked for relogging. It's no different to the CIL ticket > reservation accounting from that perspective. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Tue, Mar 03, 2020 at 03:07:35PM +1100, Dave Chinner wrote: > On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote: > > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote: > > OK, XLOG_TIC_INITED is redundant, and should be removed. And > > xfs_log_done() needs to be split into two, one for releasing the > > ticket, one for completing the xlog_write() call. Compile tested > > only patch below for you :P > > And now with sample patch. > > -Dave. > -- > Dave Chinner > david@fromorbit.com > > xfs: kill XLOG_TIC_INITED > > From: Dave Chinner <dchinner@redhat.com> > > Delayed logging made this redundant as we never directly write > transactions to the log anymore. Hence we no longer make multiple > xlog_write() calls for a transaction as we format individual items > in a transaction, and hence don't need to keep track of whether we > should be writing a start record for every xlog_write call. > FWIW the commit log could use a bit more context, perhaps from your previous description, about the original semantics of _INITED flag. E.g., it's always been rather vague to me, probably because it seems to be a remnant of some no longer fully in place functionality. > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log.c | 79 ++++++++++++++++++--------------------------------- > fs/xfs/xfs_log.h | 4 --- > fs/xfs/xfs_log_cil.c | 13 +++++---- > fs/xfs/xfs_log_priv.h | 18 ++++++------ > fs/xfs/xfs_trans.c | 24 ++++++++-------- > 5 files changed, 55 insertions(+), 83 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index f6006d94a581..a45f3eefee39 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -496,8 +496,8 @@ xfs_log_reserve( > * This routine is called when a user of a log manager ticket is done with > * the reservation. If the ticket was ever used, then a commit record for > * the associated transaction is written out as a log operation header with > - * no data. The flag XLOG_TIC_INITED is set when the first write occurs with > - * a given ticket. If the ticket was one with a permanent reservation, then > + * no data. ^ trailing whitespace > + * If the ticket was one with a permanent reservation, then > * a few operations are done differently. Permanent reservation tickets by > * default don't release the reservation. They just commit the current > * transaction with the belief that the reservation is still needed. A flag > @@ -506,49 +506,38 @@ xfs_log_reserve( > * the inited state again. By doing this, a start record will be written > * out when the next write occurs. > */ > -xfs_lsn_t > -xfs_log_done( > - struct xfs_mount *mp, > +int > +xlog_write_done( > + struct xlog *log, > struct xlog_ticket *ticket, > struct xlog_in_core **iclog, > - bool regrant) > + xfs_lsn_t *lsn) > { > - struct xlog *log = mp->m_log; > - xfs_lsn_t lsn = 0; > - > - if (XLOG_FORCED_SHUTDOWN(log) || > - /* > - * If nothing was ever written, don't write out commit record. > - * If we get an error, just continue and give back the log ticket. > - */ > - (((ticket->t_flags & XLOG_TIC_INITED) == 0) && > - (xlog_commit_record(log, ticket, iclog, &lsn)))) { > - lsn = (xfs_lsn_t) -1; > - regrant = false; > - } > + if (XLOG_FORCED_SHUTDOWN(log)) > + return -EIO; > > + return xlog_commit_record(log, ticket, iclog, lsn); > +} > > +/* > + * Release or regrant the ticket reservation now the transaction is done with > + * it depending on caller context. Rolling transactions need the ticket > + * regranted, otherwise we release it completely. > + */ > +void > +xlog_ticket_done( > + struct xlog *log, > + struct xlog_ticket *ticket, > + bool regrant) > +{ > if (!regrant) { > trace_xfs_log_done_nonperm(log, ticket); > - > - /* > - * Release ticket if not permanent reservation or a specific > - * request has been made to release a permanent reservation. > - */ > xlog_ungrant_log_space(log, ticket); > } else { > trace_xfs_log_done_perm(log, ticket); > - > xlog_regrant_reserve_log_space(log, ticket); > - /* If this ticket was a permanent reservation and we aren't > - * trying to release it, reset the inited flags; so next time > - * we write, a start record will be written out. > - */ > - ticket->t_flags |= XLOG_TIC_INITED; > } > - > xfs_log_ticket_put(ticket); > - return lsn; > } In general it would be nicer to split off as much refactoring as possible into separate patches, even though it's not yet clear to me what granularity is possible with this patch... > > static bool > @@ -2148,8 +2137,9 @@ xlog_print_trans( > } > > /* > - * Calculate the potential space needed by the log vector. Each region gets > - * its own xlog_op_header_t and may need to be double word aligned. > + * Calculate the potential space needed by the log vector. We always write a > + * start record, and each region gets its own xlog_op_header_t and may need to > + * be double word aligned. > */ > static int > xlog_write_calc_vec_length( > @@ -2157,14 +2147,10 @@ xlog_write_calc_vec_length( > struct xfs_log_vec *log_vector) > { > struct xfs_log_vec *lv; > - int headers = 0; > + int headers = 1; > int len = 0; > int i; > > - /* acct for start rec of xact */ > - if (ticket->t_flags & XLOG_TIC_INITED) > - headers++; > - > for (lv = log_vector; lv; lv = lv->lv_next) { > /* we don't write ordered log vectors */ > if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) > @@ -2195,17 +2181,11 @@ xlog_write_start_rec( > struct xlog_op_header *ophdr, > struct xlog_ticket *ticket) > { > - if (!(ticket->t_flags & XLOG_TIC_INITED)) > - return 0; > - > ophdr->oh_tid = cpu_to_be32(ticket->t_tid); > ophdr->oh_clientid = ticket->t_clientid; > ophdr->oh_len = 0; > ophdr->oh_flags = XLOG_START_TRANS; > ophdr->oh_res2 = 0; > - > - ticket->t_flags &= ~XLOG_TIC_INITED; > - > return sizeof(struct xlog_op_header); > } The header comment to this function refers to the "inited" state. Also note that there's a similar reference in xfs_log_write_unmount_record(), but that instance sets ->t_flags to zero so might be fine outside of the stale comment. > > @@ -2410,12 +2390,10 @@ xlog_write( > len = xlog_write_calc_vec_length(ticket, log_vector); > > /* > - * Region headers and bytes are already accounted for. > - * We only need to take into account start records and > - * split regions in this function. > + * Region headers and bytes are already accounted for. We only need to > + * take into account start records and split regions in this function. > */ > - if (ticket->t_flags & XLOG_TIC_INITED) > - ticket->t_curr_res -= sizeof(xlog_op_header_t); > + ticket->t_curr_res -= sizeof(xlog_op_header_t); > So AFAICT the CIL allocates a ticket and up to this point only mucks around with the reservation value. That means _INITED is still in place once we get to xlog_write(). xlog_write() immediately calls xlog_write_calc_vec_length() and makes the ->t_curr_res adjustment before touching ->t_flags, so those bits all seems fine. We then get into the log vector loops, where it looks like we call xlog_write_start_rec() for each log vector region and rely on the _INITED flag to only write a start record once per associated ticket. Unless I'm missing something, this looks like it would change behavior to perhaps write a start record per-region..? Note that this might not preclude the broader change to kill off _INITED since we're using the same ticket throughout the call, but some initial refactoring might be required to remove this dependency first. > /* > * Commit record headers need to be accounted for. These > @@ -3609,7 +3587,6 @@ xlog_ticket_alloc( > tic->t_ocnt = cnt; > tic->t_tid = prandom_u32(); > tic->t_clientid = client; > - tic->t_flags = XLOG_TIC_INITED; > if (permanent) > tic->t_flags |= XLOG_TIC_PERM_RESERV; > > diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h > index 84e06805160f..85f8d0966811 100644 > --- a/fs/xfs/xfs_log.h > +++ b/fs/xfs/xfs_log.h > @@ -105,10 +105,6 @@ struct xfs_log_item; > struct xfs_item_ops; > struct xfs_trans; > > -xfs_lsn_t xfs_log_done(struct xfs_mount *mp, > - struct xlog_ticket *ticket, > - struct xlog_in_core **iclog, > - bool regrant); > int xfs_log_force(struct xfs_mount *mp, uint flags); > int xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags, > int *log_forced); > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 48435cf2aa16..255065d276fc 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -841,10 +841,11 @@ xlog_cil_push( > } > spin_unlock(&cil->xc_push_lock); > > - /* xfs_log_done always frees the ticket on error. */ > - commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, false); > - if (commit_lsn == -1) > - goto out_abort; > + error = xlog_write_done(log, tic, &commit_iclog, &commit_lsn); > + if (error) > + goto out_abort_free_ticket; > + > + xlog_ticket_done(log, tic, false); > > spin_lock(&commit_iclog->ic_callback_lock); > if (commit_iclog->ic_state == XLOG_STATE_IOERROR) { > @@ -876,7 +877,7 @@ xlog_cil_push( > return 0; > > out_abort_free_ticket: > - xfs_log_ticket_put(tic); > + xlog_ticket_done(log, tic, false); > out_abort: > xlog_cil_committed(ctx, true); > return -EIO; > @@ -1017,7 +1018,7 @@ xfs_log_commit_cil( > if (commit_lsn) > *commit_lsn = xc_commit_lsn; > > - xfs_log_done(mp, tp->t_ticket, NULL, regrant); > + xlog_ticket_done(log, tp->t_ticket, regrant); > tp->t_ticket = NULL; > xfs_trans_unreserve_and_mod_sb(tp); > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index b192c5a9f9fd..6965d164ff45 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -53,11 +53,9 @@ enum xlog_iclog_state { > /* > * Flags to log ticket > */ > -#define XLOG_TIC_INITED 0x1 /* has been initialized */ > #define XLOG_TIC_PERM_RESERV 0x2 /* permanent reservation */ These values don't end up on disk, right? If not, it might be worth resetting the _PERM_RESERV value to 1. Otherwise the rest looks like mostly straightforward refactoring. Brian > > #define XLOG_TIC_FLAGS \ > - { XLOG_TIC_INITED, "XLOG_TIC_INITED" }, \ > { XLOG_TIC_PERM_RESERV, "XLOG_TIC_PERM_RESERV" } > > /* > @@ -438,14 +436,14 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes) > > void xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket); > void xlog_print_trans(struct xfs_trans *); > -int > -xlog_write( > - struct xlog *log, > - struct xfs_log_vec *log_vector, > - struct xlog_ticket *tic, > - xfs_lsn_t *start_lsn, > - struct xlog_in_core **commit_iclog, > - uint flags); > + > +int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, > + struct xlog_ticket *tic, xfs_lsn_t *start_lsn, > + struct xlog_in_core **commit_iclog, uint flags); > +int xlog_write_done(struct xlog *log, struct xlog_ticket *ticket, > + struct xlog_in_core **iclog, xfs_lsn_t *lsn); > +void xlog_ticket_done(struct xlog *log, struct xlog_ticket *ticket, > + bool regrant); > > /* > * When we crack an atomic LSN, we sample it first so that the value will not > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 3b208f9a865c..85ea3727878b 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -9,6 +9,7 @@ > #include "xfs_shared.h" > #include "xfs_format.h" > #include "xfs_log_format.h" > +#include "xfs_log_priv.h" > #include "xfs_trans_resv.h" > #include "xfs_mount.h" > #include "xfs_extent_busy.h" > @@ -150,8 +151,9 @@ xfs_trans_reserve( > uint blocks, > uint rtextents) > { > - int error = 0; > - bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0; > + struct xfs_mount *mp = tp->t_mountp; > + int error = 0; > + bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0; > > /* Mark this thread as being in a transaction */ > current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > @@ -162,7 +164,7 @@ xfs_trans_reserve( > * fail if the count would go below zero. > */ > if (blocks > 0) { > - error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd); > + error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd); > if (error != 0) { > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > return -ENOSPC; > @@ -191,9 +193,9 @@ xfs_trans_reserve( > > if (tp->t_ticket != NULL) { > ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES); > - error = xfs_log_regrant(tp->t_mountp, tp->t_ticket); > + error = xfs_log_regrant(mp, tp->t_ticket); > } else { > - error = xfs_log_reserve(tp->t_mountp, > + error = xfs_log_reserve(mp, > resp->tr_logres, > resp->tr_logcount, > &tp->t_ticket, XFS_TRANSACTION, > @@ -213,7 +215,7 @@ xfs_trans_reserve( > * fail if the count would go below zero. > */ > if (rtextents > 0) { > - error = xfs_mod_frextents(tp->t_mountp, -((int64_t)rtextents)); > + error = xfs_mod_frextents(mp, -((int64_t)rtextents)); > if (error) { > error = -ENOSPC; > goto undo_log; > @@ -229,7 +231,7 @@ xfs_trans_reserve( > */ > undo_log: > if (resp->tr_logres > 0) { > - xfs_log_done(tp->t_mountp, tp->t_ticket, NULL, false); > + xlog_ticket_done(mp->m_log, tp->t_ticket, false); > tp->t_ticket = NULL; > tp->t_log_res = 0; > tp->t_flags &= ~XFS_TRANS_PERM_LOG_RES; > @@ -237,7 +239,7 @@ xfs_trans_reserve( > > undo_blocks: > if (blocks > 0) { > - xfs_mod_fdblocks(tp->t_mountp, (int64_t)blocks, rsvd); > + xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd); > tp->t_blk_res = 0; > } > > @@ -999,9 +1001,7 @@ __xfs_trans_commit( > */ > xfs_trans_unreserve_and_mod_dquots(tp); > if (tp->t_ticket) { > - commit_lsn = xfs_log_done(mp, tp->t_ticket, NULL, regrant); > - if (commit_lsn == -1 && !error) > - error = -EIO; > + xlog_ticket_done(mp->m_log, tp->t_ticket, regrant); > tp->t_ticket = NULL; > } > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > @@ -1060,7 +1060,7 @@ xfs_trans_cancel( > xfs_trans_unreserve_and_mod_dquots(tp); > > if (tp->t_ticket) { > - xfs_log_done(mp, tp->t_ticket, NULL, false); > + xlog_ticket_done(mp->m_log, tp->t_ticket, false); > tp->t_ticket = NULL; > } > >
On Tue, Mar 03, 2020 at 09:13:16AM -0500, Brian Foster wrote: > On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote: > > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote: > > > <thinking out loud from here on..> > > > > > > One thing that comes to mind thinking about this is dealing with > > > batching (relog multiple items per roll). This code doesn't handle that > > > yet, but I anticipate it being a requirement and it's fairly easy to > > > update the current scheme to support a fixed item count per relog-roll. > > > > > > A stealing approach potentially complicates things when it comes to > > > batching because we have per-item reservation granularity to consider. > > > For example, consider if we had a variety of different relog item types > > > active at once, a subset are being relogged while another subset are > > > being disabled (before ever needing to be relogged). > > > > So concurrent "active relogging" + "start relogging" + "stop > > relogging"? > > > > Yeah.. > > > > For one, we'd have > > > to be careful to not release reservation while it might be accounted to > > > an active relog transaction that is currently rolling with some other > > > items, etc. There's also a potential quirk in handling reservation of a > > > relogged item that is cancelled while it's being relogged, but that > > > might be more of an implementation detail. > > > > Right, did you notice the ail->ail_relog_lock rwsem that I wrapped > > my example relog transaction item add loop + commit function in? > > > > Yes, but the side effects didn't quite register.. > > > i.e. while we are building and committing a relog transaction, we > > hold off the transactions that are trying to add/remove their items > > to/from the relog list. Hence the reservation stealing accounting in > > the ticket can be be serialised against the transactional use of the > > ticket. > > > > Hmm, so this changes relog item/state management as well as the > reservation management. That's probably why it wasn't clear to me from > the code example. Yeah, I left a lot out :P > > It's basically the same method we use for serialising addition to > > the CIL in transaction commit against CIL pushes draining the > > current list for log writes (rwsem for add/push serialisation, spin > > lock for concurrent add serialisation under the rwsem). > > > > Ok. The difference with the CIL is that in that context we're processing > a single, constantly maintained list of items. As of right now, there is > no such list for relog enabled items. If I'm following correctly, that's > still not necessary here, we're just talking about wrapping a rw lock > around the state management (+ res accounting) and the actual res > consumption such that a relog cancel doesn't steal reservation from an > active relog transaction, for example. That is correct. I don't think it actually matters because we can't remove an item that is locked and being relogged (so it's reservation is in use). Except for the fact we can't serialise internal itcket accounting updates the transaction might be doing with external tciket accounting modifications any other way. > > > I don't think that's a show stopper, but rather just something I'd like > > > to have factored into the design from the start. One option could be to > > > > *nod* > > > > > maintain a separate counter of active relog reservation aside from the > > > actual relog ticket. That way the relog ticket could just pull from this > > > relog reservation pool based on the current item(s) being relogged > > > asynchronously from different tasks that might add or remove reservation > > > from the pool for separate items. That might get a little wonky when we > > > consider the relog ticket needs to pull from the pool and then put > > > something back if the item is still reloggable after the relog > > > transaction rolls. > > > > RIght, that's the whole problem that we solve via a) stealing the > > initial reserve/write grant space at commit time and b) serialising > > stealing vs transactional use of the ticket. > > > > That is, after the roll, the ticket has a full reserve and grant > > space reservation for all the items accounted to the relog ticket. > > Every new relog item added to the ticket (or is relogged in the CIL > > and uses more space) adds the full required reserve/write grant > > space to the the relog ticket. Hence the relog ticket always has > > current log space reserved to commit the entire set of items tagged > > as reloggable. And by avoiding modifying the ticket while we are > > actively processing the relog transaction, we don't screw up the > > ticket accounting in the middle of the transaction.... > > > > Yep, Ok. > > > > Another problem is that reloggable items are still otherwise usable once > > > they are unlocked. So for example we'd have to account for a situation > > > where one transaction dirties a buffer, enables relogging, commits and > > > then some other transaction dirties more of the buffer and commits > > > without caring whether the buffer was relog enabled or not. > > > > yup, that's the delta size updates from the CIL commit. i.e. if we > > relog an item to the CIL that has the XFS_LI_RELOG flag already set > > on it, the change in size that we steal for the CIL ticket also > > needs to be stolen for the AIL ticket. i.e. we already do almost all > > the work we need to handle this. > > > > > Unless > > > runtime relog reservation is always worst case, that's a subtle path to > > > reservation overrun in the relog transaction. > > > > Yes, but it's a problem the CIL already solves for us :P > > Ok, I think we're pretty close to the same page here. I was thinking > about the worst case relog reservation being pulled off the committing > transaction unconditionally, where I think you're thinking about it as > the transaction (i.e. reservation calculation) would have the worst case > reservation, but we'd only pull off the delta as needed at commit time > (i.e. exactly how the CIL works wrt to transaction reservation > consumption). Let me work through a simple example to try and > (in)validate my concern: > > - Transaction A is relog enabled, dirties 50% of a buffer and enables > auto relogging. On commit, the CIL takes buffer/2 reservation for the > log vector and the relog mechanism takes the same amount for > subsequent relogs. > - Transaction B is not relog enabled (so no extra relog reservation), > dirties another 40% of the (already relog enabled) buffer and commits. > The CIL takes 90% of the transaction buffer reservation. The relog > ticket now needs an additional 40% (since the original 50% is > maintained by the relog system), but afaict there is no guarantee that > res is available if trans B is not relog enabled. Yes, I can see that would be an issue - very well spotted, Brian. Without reading your further comments: off the top of my head that means we would probably have to declare particular types of objects as reloggable, and explicitly include that object in each reservation rather than use a generic "buffer" or "inode" reservation for it. We are likely to only be relogging "container" objects such as intents or high level structures such as, inodes, AG headers, etc, so this probably isn't a huge increase in transaction size or complexity, and it will be largely self documenting. But cwit still adds more complexity, and we're trying to avoid that .... ..... Oh, I'm a bit stupid only half way through my first coffee: just get rid of delta updates altogether and steal the entire relog reservation for the object up front. We really don't need delta updates, it was just something the CIL does so I didn't think past "we can use that because it is already there".... /me reads on... > So IIUC, the CIL scheme works because every transaction automatically > includes worst case reservation for every possible item supported by the > transaction. Relog transactions are presumably more selective, however, > so we need to either have some rule where only relog enabled > transactions can touch relog enabled items (it's not clear to me how > realistic that is for things like buffers etc., but it sounds > potentially delicate) or we simplify the relog reservation consumption > calculation to consume worst case reservation for the relog ticket in > anticipation of unrelated transactions dirtying more of the associated > object since it was originally committed for relog. Thoughts? Yep, I think we've both come to the same conclusion :) > Note that carrying worst case reservation also potentially simplifies > accounting between multiple dirty+relog increments and a single relog > cancel, particularly if a relog cancelling transaction also happens to > dirty more of a buffer before it commits. I'd prefer to avoid subtle > usage landmines like that as much as possible. Hmm.. even if we do > ultimately want a more granular and dynamic relog res accounting, I > might start with the simple worst-case approach since 1.) it probably > doesn't matter for intents, 2.) it's easier to reason about the isolated > reservation calculation changes in an independent patch from the rest of > the mechanism and 3.) I'd rather get the basics nailed down and solid > before potentially fighting with granular reservation accounting > accuracy bugs. ;) Absolutely. I don't think we'll ever need anything more dynamic or complex. And I think just taking the whole reservation will scale a lot better if we are constantly modifying the object - we only need to take the relog lock when we add or remove a relog reservation now, not on every delta change to the object.... Again, well spotted and good thinking! Cheers, Dave.
On Tue, Mar 03, 2020 at 10:12:17AM -0500, Brian Foster wrote: > On Tue, Mar 03, 2020 at 03:07:35PM +1100, Dave Chinner wrote: > > On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote: > > > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote: > > > OK, XLOG_TIC_INITED is redundant, and should be removed. And > > > xfs_log_done() needs to be split into two, one for releasing the > > > ticket, one for completing the xlog_write() call. Compile tested > > > only patch below for you :P > > > > And now with sample patch. > > > > -Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > > xfs: kill XLOG_TIC_INITED > > > > From: Dave Chinner <dchinner@redhat.com> > > > > Delayed logging made this redundant as we never directly write > > transactions to the log anymore. Hence we no longer make multiple > > xlog_write() calls for a transaction as we format individual items > > in a transaction, and hence don't need to keep track of whether we > > should be writing a start record for every xlog_write call. > > > > FWIW the commit log could use a bit more context, perhaps from your > previous description, about the original semantics of _INITED flag. > E.g., it's always been rather vague to me, probably because it seems to > be a remnant of some no longer fully in place functionality. Yup, it was a quick "here's what it looks like" patch. > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_log.c | 79 ++++++++++++++++++--------------------------------- > > fs/xfs/xfs_log.h | 4 --- > > fs/xfs/xfs_log_cil.c | 13 +++++---- > > fs/xfs/xfs_log_priv.h | 18 ++++++------ > > fs/xfs/xfs_trans.c | 24 ++++++++-------- > > 5 files changed, 55 insertions(+), 83 deletions(-) > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > index f6006d94a581..a45f3eefee39 100644 > > --- a/fs/xfs/xfs_log.c > > +++ b/fs/xfs/xfs_log.c > > @@ -496,8 +496,8 @@ xfs_log_reserve( > > * This routine is called when a user of a log manager ticket is done with > > * the reservation. If the ticket was ever used, then a commit record for > > * the associated transaction is written out as a log operation header with > > - * no data. The flag XLOG_TIC_INITED is set when the first write occurs with > > - * a given ticket. If the ticket was one with a permanent reservation, then > > + * no data. > > ^ trailing whitespace Forgot to <ctrl-j> to join the lines back together :P > > > + * If the ticket was one with a permanent reservation, then > > * a few operations are done differently. Permanent reservation tickets by > > * default don't release the reservation. They just commit the current > > * transaction with the belief that the reservation is still needed. A flag > > @@ -506,49 +506,38 @@ xfs_log_reserve( > > * the inited state again. By doing this, a start record will be written > > * out when the next write occurs. > > */ > > -xfs_lsn_t > > -xfs_log_done( > > - struct xfs_mount *mp, > > +int > > +xlog_write_done( > > + struct xlog *log, > > struct xlog_ticket *ticket, > > struct xlog_in_core **iclog, > > - bool regrant) > > + xfs_lsn_t *lsn) > > { > > - struct xlog *log = mp->m_log; > > - xfs_lsn_t lsn = 0; > > - > > - if (XLOG_FORCED_SHUTDOWN(log) || > > - /* > > - * If nothing was ever written, don't write out commit record. > > - * If we get an error, just continue and give back the log ticket. > > - */ > > - (((ticket->t_flags & XLOG_TIC_INITED) == 0) && > > - (xlog_commit_record(log, ticket, iclog, &lsn)))) { > > - lsn = (xfs_lsn_t) -1; > > - regrant = false; > > - } > > + if (XLOG_FORCED_SHUTDOWN(log)) > > + return -EIO; > > > > + return xlog_commit_record(log, ticket, iclog, lsn); > > +} > > > > +/* > > + * Release or regrant the ticket reservation now the transaction is done with > > + * it depending on caller context. Rolling transactions need the ticket > > + * regranted, otherwise we release it completely. > > + */ > > +void > > +xlog_ticket_done( > > + struct xlog *log, > > + struct xlog_ticket *ticket, > > + bool regrant) > > +{ > > if (!regrant) { > > trace_xfs_log_done_nonperm(log, ticket); > > - > > - /* > > - * Release ticket if not permanent reservation or a specific > > - * request has been made to release a permanent reservation. > > - */ > > xlog_ungrant_log_space(log, ticket); > > } else { > > trace_xfs_log_done_perm(log, ticket); > > - > > xlog_regrant_reserve_log_space(log, ticket); > > - /* If this ticket was a permanent reservation and we aren't > > - * trying to release it, reset the inited flags; so next time > > - * we write, a start record will be written out. > > - */ > > - ticket->t_flags |= XLOG_TIC_INITED; > > } > > - > > xfs_log_ticket_put(ticket); > > - return lsn; > > } > > In general it would be nicer to split off as much refactoring as > possible into separate patches, even though it's not yet clear to me > what granularity is possible with this patch... Yeah, there's heaps mor cleanups that can be done as a result of this - e.g. xlog_write_done() and xlog_commit_record() should be merged. The one caller of xlog_write() that does not provide a commit_iclog variable shoudl do that call xlog_commit_record() itself, then xlog_write() can just assume that always returns the last iclog, etc.... > > static bool > > @@ -2148,8 +2137,9 @@ xlog_print_trans( > > } > > > > /* > > - * Calculate the potential space needed by the log vector. Each region gets > > - * its own xlog_op_header_t and may need to be double word aligned. > > + * Calculate the potential space needed by the log vector. We always write a > > + * start record, and each region gets its own xlog_op_header_t and may need to > > + * be double word aligned. > > */ > > static int > > xlog_write_calc_vec_length( > > @@ -2157,14 +2147,10 @@ xlog_write_calc_vec_length( > > struct xfs_log_vec *log_vector) > > { > > struct xfs_log_vec *lv; > > - int headers = 0; > > + int headers = 1; > > int len = 0; > > int i; > > > > - /* acct for start rec of xact */ > > - if (ticket->t_flags & XLOG_TIC_INITED) > > - headers++; > > - > > for (lv = log_vector; lv; lv = lv->lv_next) { > > /* we don't write ordered log vectors */ > > if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) > > @@ -2195,17 +2181,11 @@ xlog_write_start_rec( > > struct xlog_op_header *ophdr, > > struct xlog_ticket *ticket) > > { > > - if (!(ticket->t_flags & XLOG_TIC_INITED)) > > - return 0; > > - > > ophdr->oh_tid = cpu_to_be32(ticket->t_tid); > > ophdr->oh_clientid = ticket->t_clientid; > > ophdr->oh_len = 0; > > ophdr->oh_flags = XLOG_START_TRANS; > > ophdr->oh_res2 = 0; > > - > > - ticket->t_flags &= ~XLOG_TIC_INITED; > > - > > return sizeof(struct xlog_op_header); > > } > > The header comment to this function refers to the "inited" state. Missed that... > Also note that there's a similar reference in > xfs_log_write_unmount_record(), but that instance sets ->t_flags to zero > so might be fine outside of the stale comment. More cleanups! > > @@ -2410,12 +2390,10 @@ xlog_write( > > len = xlog_write_calc_vec_length(ticket, log_vector); > > > > /* > > - * Region headers and bytes are already accounted for. > > - * We only need to take into account start records and > > - * split regions in this function. > > + * Region headers and bytes are already accounted for. We only need to > > + * take into account start records and split regions in this function. > > */ > > - if (ticket->t_flags & XLOG_TIC_INITED) > > - ticket->t_curr_res -= sizeof(xlog_op_header_t); > > + ticket->t_curr_res -= sizeof(xlog_op_header_t); > > > > So AFAICT the CIL allocates a ticket and up to this point only mucks > around with the reservation value. That means _INITED is still in place > once we get to xlog_write(). xlog_write() immediately calls > xlog_write_calc_vec_length() and makes the ->t_curr_res adjustment > before touching ->t_flags, so those bits all seems fine. > > We then get into the log vector loops, where it looks like we call > xlog_write_start_rec() for each log vector region and rely on the > _INITED flag to only write a start record once per associated ticket. > Unless I'm missing something, this looks like it would change behavior > to perhaps write a start record per-region..? Note that this might not > preclude the broader change to kill off _INITED since we're using the > same ticket throughout the call, but some initial refactoring might be > required to remove this dependency first. Ah, yes, well spotted. I need to move the call to xlog_write_start_rec() outside the loop - it only needs to be written once per ticket, and we only ever supply one ticket to xlog_write() now, and it is never reused to call back into xlog_write again for the same transaction context. I did say "compile tested only " :) > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > > index b192c5a9f9fd..6965d164ff45 100644 > > --- a/fs/xfs/xfs_log_priv.h > > +++ b/fs/xfs/xfs_log_priv.h > > @@ -53,11 +53,9 @@ enum xlog_iclog_state { > > /* > > * Flags to log ticket > > */ > > -#define XLOG_TIC_INITED 0x1 /* has been initialized */ > > #define XLOG_TIC_PERM_RESERV 0x2 /* permanent reservation */ > > These values don't end up on disk, right? If not, it might be worth > resetting the _PERM_RESERV value to 1. Otherwise the rest looks like > mostly straightforward refactoring. *nod* -Dave.
On Wed, Mar 04, 2020 at 08:26:26AM +1100, Dave Chinner wrote: > On Tue, Mar 03, 2020 at 09:13:16AM -0500, Brian Foster wrote: > > On Tue, Mar 03, 2020 at 10:25:29AM +1100, Dave Chinner wrote: > > > On Mon, Mar 02, 2020 at 01:06:50PM -0500, Brian Foster wrote: > > > > <thinking out loud from here on..> > > > > > > > > One thing that comes to mind thinking about this is dealing with > > > > batching (relog multiple items per roll). This code doesn't handle that > > > > yet, but I anticipate it being a requirement and it's fairly easy to > > > > update the current scheme to support a fixed item count per relog-roll. > > > > > > > > A stealing approach potentially complicates things when it comes to > > > > batching because we have per-item reservation granularity to consider. > > > > For example, consider if we had a variety of different relog item types > > > > active at once, a subset are being relogged while another subset are > > > > being disabled (before ever needing to be relogged). > > > > > > So concurrent "active relogging" + "start relogging" + "stop > > > relogging"? > > > > > > > Yeah.. > > > > > > For one, we'd have > > > > to be careful to not release reservation while it might be accounted to > > > > an active relog transaction that is currently rolling with some other > > > > items, etc. There's also a potential quirk in handling reservation of a > > > > relogged item that is cancelled while it's being relogged, but that > > > > might be more of an implementation detail. > > > > > > Right, did you notice the ail->ail_relog_lock rwsem that I wrapped > > > my example relog transaction item add loop + commit function in? > > > > > > > Yes, but the side effects didn't quite register.. > > > > > i.e. while we are building and committing a relog transaction, we > > > hold off the transactions that are trying to add/remove their items > > > to/from the relog list. Hence the reservation stealing accounting in > > > the ticket can be be serialised against the transactional use of the > > > ticket. > > > > > > > Hmm, so this changes relog item/state management as well as the > > reservation management. That's probably why it wasn't clear to me from > > the code example. > > Yeah, I left a lot out :P > > > > It's basically the same method we use for serialising addition to > > > the CIL in transaction commit against CIL pushes draining the > > > current list for log writes (rwsem for add/push serialisation, spin > > > lock for concurrent add serialisation under the rwsem). > > > > > > > Ok. The difference with the CIL is that in that context we're processing > > a single, constantly maintained list of items. As of right now, there is > > no such list for relog enabled items. If I'm following correctly, that's > > still not necessary here, we're just talking about wrapping a rw lock > > around the state management (+ res accounting) and the actual res > > consumption such that a relog cancel doesn't steal reservation from an > > active relog transaction, for example. > > That is correct. I don't think it actually matters because we can't > remove an item that is locked and being relogged (so it's > reservation is in use). Except for the fact we can't serialise > internal itcket accounting updates the transaction might be doing > with external tciket accounting modifications any other way. > We do still need some form of serialization for objects that aren't currently lockable, like certain intents. I suppose we could add locks where necessary, but right now I'm thinking of a slight alteration of the res accounting strategy to distinguish relog enabled items currently resident in the relog transaction (i.e. awaiting the AIL processing to complete and commit/roll the transaction) from all other relog enabled items might be good enough. Without something like that (or IOW if the relog transaction is the central holder of all outstanding relog reservation), then we always have to serialize relog cancel on an item against xfsaild populating a relog transaction, regardless of whether the affected item is being relogged at the time. That's again probably not something that's a huge deal for the current use case, but it also _seems_ like something that can be addressed via implementation tweaks without changing the fundamental design. I guess we'll see once I have a chance to actually play around with it... :P > > > > I don't think that's a show stopper, but rather just something I'd like > > > > to have factored into the design from the start. One option could be to > > > > > > *nod* > > > > > > > maintain a separate counter of active relog reservation aside from the > > > > actual relog ticket. That way the relog ticket could just pull from this > > > > relog reservation pool based on the current item(s) being relogged > > > > asynchronously from different tasks that might add or remove reservation > > > > from the pool for separate items. That might get a little wonky when we > > > > consider the relog ticket needs to pull from the pool and then put > > > > something back if the item is still reloggable after the relog > > > > transaction rolls. > > > > > > RIght, that's the whole problem that we solve via a) stealing the > > > initial reserve/write grant space at commit time and b) serialising > > > stealing vs transactional use of the ticket. > > > > > > That is, after the roll, the ticket has a full reserve and grant > > > space reservation for all the items accounted to the relog ticket. > > > Every new relog item added to the ticket (or is relogged in the CIL > > > and uses more space) adds the full required reserve/write grant > > > space to the the relog ticket. Hence the relog ticket always has > > > current log space reserved to commit the entire set of items tagged > > > as reloggable. And by avoiding modifying the ticket while we are > > > actively processing the relog transaction, we don't screw up the > > > ticket accounting in the middle of the transaction.... > > > > > > > Yep, Ok. > > > > > > Another problem is that reloggable items are still otherwise usable once > > > > they are unlocked. So for example we'd have to account for a situation > > > > where one transaction dirties a buffer, enables relogging, commits and > > > > then some other transaction dirties more of the buffer and commits > > > > without caring whether the buffer was relog enabled or not. > > > > > > yup, that's the delta size updates from the CIL commit. i.e. if we > > > relog an item to the CIL that has the XFS_LI_RELOG flag already set > > > on it, the change in size that we steal for the CIL ticket also > > > needs to be stolen for the AIL ticket. i.e. we already do almost all > > > the work we need to handle this. > > > > > > > Unless > > > > runtime relog reservation is always worst case, that's a subtle path to > > > > reservation overrun in the relog transaction. > > > > > > Yes, but it's a problem the CIL already solves for us :P > > > > Ok, I think we're pretty close to the same page here. I was thinking > > about the worst case relog reservation being pulled off the committing > > transaction unconditionally, where I think you're thinking about it as > > the transaction (i.e. reservation calculation) would have the worst case > > reservation, but we'd only pull off the delta as needed at commit time > > (i.e. exactly how the CIL works wrt to transaction reservation > > consumption). Let me work through a simple example to try and > > (in)validate my concern: > > > > - Transaction A is relog enabled, dirties 50% of a buffer and enables > > auto relogging. On commit, the CIL takes buffer/2 reservation for the > > log vector and the relog mechanism takes the same amount for > > subsequent relogs. > > - Transaction B is not relog enabled (so no extra relog reservation), > > dirties another 40% of the (already relog enabled) buffer and commits. > > The CIL takes 90% of the transaction buffer reservation. The relog > > ticket now needs an additional 40% (since the original 50% is > > maintained by the relog system), but afaict there is no guarantee that > > res is available if trans B is not relog enabled. > > Yes, I can see that would be an issue - very well spotted, Brian. > > Without reading your further comments: off the top of my head that > means we would probably have to declare particular types of objects > as reloggable, and explicitly include that object in each > reservation rather than use a generic "buffer" or "inode" > reservation for it. We are likely to only be relogging "container" > objects such as intents or high level structures such as, > inodes, AG headers, etc, so this probably isn't a huge increase > in transaction size or complexity, and it will be largely self > documenting. > > But cwit still adds more complexity, and we're trying to avoid that > .... > > ..... > > Oh, I'm a bit stupid only half way through my first coffee: just get > rid of delta updates altogether and steal the entire relog > reservation for the object up front. We really don't need delta > updates, it was just something the CIL does so I didn't think past > "we can use that because it is already there".... > > /me reads on... > > > So IIUC, the CIL scheme works because every transaction automatically > > includes worst case reservation for every possible item supported by the > > transaction. Relog transactions are presumably more selective, however, > > so we need to either have some rule where only relog enabled > > transactions can touch relog enabled items (it's not clear to me how > > realistic that is for things like buffers etc., but it sounds > > potentially delicate) or we simplify the relog reservation consumption > > calculation to consume worst case reservation for the relog ticket in > > anticipation of unrelated transactions dirtying more of the associated > > object since it was originally committed for relog. Thoughts? > > Yep, I think we've both come to the same conclusion :) > Indeed. Account for the max res. size of the associated object that the relog enabled transaction already acquired for us anyways and we shouldn't have to worry about further modifications of the object. > > Note that carrying worst case reservation also potentially simplifies > > accounting between multiple dirty+relog increments and a single relog > > cancel, particularly if a relog cancelling transaction also happens to > > dirty more of a buffer before it commits. I'd prefer to avoid subtle > > usage landmines like that as much as possible. Hmm.. even if we do > > ultimately want a more granular and dynamic relog res accounting, I > > might start with the simple worst-case approach since 1.) it probably > > doesn't matter for intents, 2.) it's easier to reason about the isolated > > reservation calculation changes in an independent patch from the rest of > > the mechanism and 3.) I'd rather get the basics nailed down and solid > > before potentially fighting with granular reservation accounting > > accuracy bugs. ;) > > Absolutely. I don't think we'll ever need anything more dynamic > or complex. And I think just taking the whole reservation will scale > a lot better if we are constantly modifying the object - we only > need to take the relog lock when we add or remove a relog > reservation now, not on every delta change to the object.... > Yep, Ok. I think there's still some potential quirks to work through with this approach (re: the per-item granularity accounting/serialization bits mentioned above), but at this point it's easier for me to reason about it by hacking on it a bit. If there are still any issues to work out, best to have something concrete on the list to guide discussion. Thanks Dave, appreciate the design input! Brian > Again, well spotted and good thinking! > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h index c45acbd3add9..0a10ca0853ab 100644 --- a/fs/xfs/libxfs/xfs_shared.h +++ b/fs/xfs/libxfs/xfs_shared.h @@ -77,6 +77,7 @@ void xfs_log_get_max_trans_res(struct xfs_mount *mp, * made then this algorithm will eventually find all the space it needs. */ #define XFS_TRANS_LOWMODE 0x100 /* allocate in low space mode */ +#define XFS_TRANS_RELOG 0x200 /* enable automatic relogging */ /* * Field values for xfs_trans_mod_sb. diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 3b208f9a865c..8ac05ed8deda 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -107,9 +107,14 @@ xfs_trans_dup( ntp->t_flags = XFS_TRANS_PERM_LOG_RES | (tp->t_flags & XFS_TRANS_RESERVE) | - (tp->t_flags & XFS_TRANS_NO_WRITECOUNT); - /* We gave our writer reference to the new transaction */ + (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) | + (tp->t_flags & XFS_TRANS_RELOG); + /* + * The writer reference and relog reference transfer to the new + * transaction. + */ tp->t_flags |= XFS_TRANS_NO_WRITECOUNT; + tp->t_flags &= ~XFS_TRANS_RELOG; ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket); ASSERT(tp->t_blk_res >= tp->t_blk_res_used); @@ -284,15 +289,25 @@ xfs_trans_alloc( tp->t_firstblock = NULLFSBLOCK; error = xfs_trans_reserve(tp, resp, blocks, rtextents); - if (error) { - xfs_trans_cancel(tp); - return error; + if (error) + goto error; + + if (flags & XFS_TRANS_RELOG) { + error = xfs_trans_ail_relog_reserve(&tp); + if (error) + goto error; } trace_xfs_trans_alloc(tp, _RET_IP_); *tpp = tp; return 0; + +error: + /* clear relog flag if we haven't acquired a ref */ + tp->t_flags &= ~XFS_TRANS_RELOG; + xfs_trans_cancel(tp); + return error; } /* @@ -973,6 +988,10 @@ __xfs_trans_commit( xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); + /* release the relog ticket reference if this transaction holds one */ + if (tp->t_flags & XFS_TRANS_RELOG) + xfs_trans_ail_relog_put(mp); + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); xfs_trans_free(tp); @@ -1004,6 +1023,10 @@ __xfs_trans_commit( error = -EIO; tp->t_ticket = NULL; } + /* release the relog ticket reference if this transaction holds one */ + /* XXX: handle RELOG items on transaction abort */ + if (tp->t_flags & XFS_TRANS_RELOG) + xfs_trans_ail_relog_put(mp); current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); xfs_trans_free_items(tp, !!error); xfs_trans_free(tp); @@ -1064,6 +1087,10 @@ xfs_trans_cancel( tp->t_ticket = NULL; } + /* release the relog ticket reference if this transaction holds one */ + if (tp->t_flags & XFS_TRANS_RELOG) + xfs_trans_ail_relog_put(mp); + /* mark this thread as no longer being in a transaction */ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 752c7fef9de7..a032989943bd 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -236,6 +236,9 @@ int xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *); void xfs_trans_cancel(xfs_trans_t *); int xfs_trans_ail_init(struct xfs_mount *); void xfs_trans_ail_destroy(struct xfs_mount *); +int xfs_trans_ail_relog_reserve(struct xfs_trans **); +bool xfs_trans_ail_relog_get(struct xfs_mount *); +int xfs_trans_ail_relog_put(struct xfs_mount *); void xfs_trans_buf_set_type(struct xfs_trans *, struct xfs_buf *, enum xfs_blft); diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 00cc5b8734be..a3fb64275baa 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -17,6 +17,7 @@ #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #ifdef DEBUG /* @@ -818,6 +819,93 @@ xfs_trans_ail_delete( xfs_log_space_wake(ailp->ail_mount); } +bool +xfs_trans_ail_relog_get( + struct xfs_mount *mp) +{ + struct xfs_ail *ailp = mp->m_ail; + bool ret = false; + + spin_lock(&ailp->ail_lock); + if (ailp->ail_relog_tic) { + xfs_log_ticket_get(ailp->ail_relog_tic); + ret = true; + } + spin_unlock(&ailp->ail_lock); + return ret; +} + +/* + * Reserve log space for the automatic relogging ->tr_relog ticket. This + * requires a clean, permanent transaction from the caller. Pull reservation + * for the relog ticket and roll the caller's transaction back to its fully + * reserved state. If the AIL relog ticket is already initialized, grab a + * reference and return. + */ +int +xfs_trans_ail_relog_reserve( + struct xfs_trans **tpp) +{ + struct xfs_trans *tp = *tpp; + struct xfs_mount *mp = tp->t_mountp; + struct xfs_ail *ailp = mp->m_ail; + struct xlog_ticket *tic; + uint32_t logres = M_RES(mp)->tr_relog.tr_logres; + + ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); + ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); + + if (xfs_trans_ail_relog_get(mp)) + return 0; + + /* no active ticket, fall into slow path to allocate one.. */ + tic = xlog_ticket_alloc(mp->m_log, logres, 1, XFS_TRANSACTION, true, 0); + if (!tic) + return -ENOMEM; + ASSERT(tp->t_ticket->t_curr_res >= tic->t_curr_res); + + /* check again since we dropped the lock for the allocation */ + spin_lock(&ailp->ail_lock); + if (ailp->ail_relog_tic) { + xfs_log_ticket_get(ailp->ail_relog_tic); + spin_unlock(&ailp->ail_lock); + xfs_log_ticket_put(tic); + return 0; + } + + /* attach and reserve space for the ->tr_relog ticket */ + ailp->ail_relog_tic = tic; + tp->t_ticket->t_curr_res -= tic->t_curr_res; + spin_unlock(&ailp->ail_lock); + + return xfs_trans_roll(tpp); +} + +/* + * Release a reference to the relog ticket. + */ +int +xfs_trans_ail_relog_put( + struct xfs_mount *mp) +{ + struct xfs_ail *ailp = mp->m_ail; + struct xlog_ticket *tic; + + spin_lock(&ailp->ail_lock); + if (atomic_add_unless(&ailp->ail_relog_tic->t_ref, -1, 1)) { + spin_unlock(&ailp->ail_lock); + return 0; + } + + ASSERT(atomic_read(&ailp->ail_relog_tic->t_ref) == 1); + tic = ailp->ail_relog_tic; + ailp->ail_relog_tic = NULL; + spin_unlock(&ailp->ail_lock); + + xfs_log_done(mp, tic, NULL, false); + return 0; +} + int xfs_trans_ail_init( xfs_mount_t *mp) @@ -854,6 +942,7 @@ xfs_trans_ail_destroy( { struct xfs_ail *ailp = mp->m_ail; + ASSERT(ailp->ail_relog_tic == NULL); kthread_stop(ailp->ail_task); kmem_free(ailp); } diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h index 2e073c1c4614..839df6559b9f 100644 --- a/fs/xfs/xfs_trans_priv.h +++ b/fs/xfs/xfs_trans_priv.h @@ -61,6 +61,7 @@ struct xfs_ail { int ail_log_flush; struct list_head ail_buf_list; wait_queue_head_t ail_empty; + struct xlog_ticket *ail_relog_tic; }; /*
Automatic item relogging will occur from xfsaild context. xfsaild cannot acquire log reservation itself because it is also responsible for writeback and thus making used log reservation available again. Since there is no guarantee log reservation is available by the time a relogged item reaches the AIL, this is prone to deadlock. To guarantee log reservation for automatic relogging, implement a reservation management scheme where a transaction that is capable of enabling relogging of an item must contribute the necessary reservation to the relog mechanism up front. Use reference counting to associate the lifetime of pending relog reservation to the lifetime of in-core log items with relogging enabled. The basic log reservation sequence for a relog enabled transaction is as follows: - A transaction that uses relogging specifies XFS_TRANS_RELOG at allocation time. - Once initialized, RELOG transactions check for the existence of the global relog log ticket. If it exists, grab a reference and return. If not, allocate an empty ticket and install into the relog subsystem. Seed the relog ticket from reservation of the current transaction. Roll the current transaction to replenish its reservation and return to the caller. - The transaction is used as normal. If an item is relogged in the transaction, that item acquires a reference on the global relog ticket currently held open by the transaction. The item's reference persists until relogging is disabled on the item. - The RELOG transaction commits and releases its reference to the global relog ticket. The global relog ticket is released once its reference count drops to zero. This provides a central relog log ticket that guarantees reservation availability for relogged items, avoids log reservation deadlocks and is allocated and released on demand. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/libxfs/xfs_shared.h | 1 + fs/xfs/xfs_trans.c | 37 +++++++++++++--- fs/xfs/xfs_trans.h | 3 ++ fs/xfs/xfs_trans_ail.c | 89 ++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_trans_priv.h | 1 + 5 files changed, 126 insertions(+), 5 deletions(-)