diff mbox series

[25/45] xfs: reserve space and initialise xlog_op_header in item formatting

Message ID 20210305051143.182133-26-david@fromorbit.com (mailing list archive)
State Superseded
Headers show
Series xfs: consolidated log and optimisation changes | expand

Commit Message

Dave Chinner March 5, 2021, 5:11 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

Current xlog_write() adds op headers to the log manually for every
log item region that is in the vector passed to it. While
xlog_write() needs to stamp the transaction ID into the ophdr, we
already know it's length, flags, clientid, etc at CIL commit time.

This means the only time that xlog write really needs to format and
reserve space for a new ophdr is when a region is split across two
iclogs. Adding the opheader and accounting for it as part of the
normal formatted item region means we simplify the accounting
of space used by a transaction and we don't have to special case
reserving of space in for the ophdrs in xlog_write(). It also means
we can largely initialise the ophdr in transaction commit instead
of xlog_write, making the xlog_write formatting inner loop much
tighter.

xlog_prepare_iovec() is now too large to stay as an inline function,
so we move it out of line and into xfs_log.c.

Object sizes:
text	   data	    bss	    dec	    hex	filename
1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after

So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
out of line, even though it grew in size itself.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_log.h     |  42 +++-------------
 fs/xfs/xfs_log_cil.c |  25 +++++-----
 3 files changed, 99 insertions(+), 83 deletions(-)

Comments

Darrick J. Wong March 9, 2021, 2:21 a.m. UTC | #1
On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Current xlog_write() adds op headers to the log manually for every
> log item region that is in the vector passed to it. While
> xlog_write() needs to stamp the transaction ID into the ophdr, we
> already know it's length, flags, clientid, etc at CIL commit time.
> 
> This means the only time that xlog write really needs to format and
> reserve space for a new ophdr is when a region is split across two
> iclogs. Adding the opheader and accounting for it as part of the
> normal formatted item region means we simplify the accounting
> of space used by a transaction and we don't have to special case
> reserving of space in for the ophdrs in xlog_write(). It also means
> we can largely initialise the ophdr in transaction commit instead
> of xlog_write, making the xlog_write formatting inner loop much
> tighter.
> 
> xlog_prepare_iovec() is now too large to stay as an inline function,
> so we move it out of line and into xfs_log.c.
> 
> Object sizes:
> text	   data	    bss	    dec	    hex	filename
> 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> 
> So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> out of line, even though it grew in size itself.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Sooo... if I understand this part of the patchset correctly, the goal
here is to simplify and shorten the inner loop of xlog_write. Callers
are now required to create their own log op headers at the start of the
xfs_log_iovec chain in the xfs_log_vec, which means that the only time
xlog_write has to create an ophdr is when we fill up the current iclog
and must continue in a new one, because that's not something the callers
should ever have to know about.  Correct?

If so,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

It /really/ would have been nice to have kept these patches separated by
major functional change area (i.e. separate series) instead of one
gigantic 45-patch behemoth to intimidate the reviewers...

--D

> ---
>  fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_log.h     |  42 +++-------------
>  fs/xfs/xfs_log_cil.c |  25 +++++-----
>  3 files changed, 99 insertions(+), 83 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 429cb1e7cc67..98de45be80c0 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -89,6 +89,62 @@ xlog_iclogs_empty(
>  static int
>  xfs_log_cover(struct xfs_mount *);
>  
> +/*
> + * We need to make sure the buffer pointer returned is naturally aligned for the
> + * biggest basic data type we put into it. We have already accounted for this
> + * padding when sizing the buffer.
> + *
> + * However, this padding does not get written into the log, and hence we have to
> + * track the space used by the log vectors separately to prevent log space hangs
> + * due to inaccurate accounting (i.e. a leak) of the used log space through the
> + * CIL context ticket.
> + *
> + * We also add space for the xlog_op_header that describes this region in the
> + * log. This prepends the data region we return to the caller to copy their data
> + * into, so do all the static initialisation of the ophdr now. Because the ophdr
> + * is not 8 byte aligned, we have to be careful to ensure that we align the
> + * start of the buffer such that the region we return to the call is 8 byte
> + * aligned and packed against the tail of the ophdr.
> + */
> +void *
> +xlog_prepare_iovec(
> +	struct xfs_log_vec	*lv,
> +	struct xfs_log_iovec	**vecp,
> +	uint			type)
> +{
> +	struct xfs_log_iovec	*vec = *vecp;
> +	struct xlog_op_header	*oph;
> +	uint32_t		len;
> +	void			*buf;
> +
> +	if (vec) {
> +		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
> +		vec++;
> +	} else {
> +		vec = &lv->lv_iovecp[0];
> +	}
> +
> +	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
> +	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
> +		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
> +					sizeof(struct xlog_op_header);
> +	}
> +
> +	vec->i_type = type;
> +	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
> +
> +	oph = vec->i_addr;
> +	oph->oh_clientid = XFS_TRANSACTION;
> +	oph->oh_res2 = 0;
> +	oph->oh_flags = 0;
> +
> +	buf = vec->i_addr + sizeof(struct xlog_op_header);
> +	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));
> +
> +	*vecp = vec;
> +	return buf;
> +}
> +
>  static void
>  xlog_grant_sub_space(
>  	struct xlog		*log,
> @@ -2120,9 +2176,9 @@ xlog_print_trans(
>  }
>  
>  /*
> - * Calculate the potential space needed by the log vector. If this is a start
> - * transaction, the caller has already accounted for both opheaders in the start
> - * transaction, so we don't need to account for them here.
> + * Calculate the potential space needed by the log vector. All regions contain
> + * their own opheaders and they are accounted for in region space so we don't
> + * need to add them to the vector length here.
>   */
>  static int
>  xlog_write_calc_vec_length(
> @@ -2149,18 +2205,7 @@ xlog_write_calc_vec_length(
>  			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
>  		}
>  	}
> -
> -	/* Don't account for regions with embedded ophdrs */
> -	if (optype && headers > 0) {
> -		headers--;
> -		if (optype & XLOG_START_TRANS) {
> -			ASSERT(headers >= 1);
> -			headers--;
> -		}
> -	}
> -
>  	ticket->t_res_num_ophdrs += headers;
> -	len += headers * sizeof(struct xlog_op_header);
>  
>  	return len;
>  }
> @@ -2170,7 +2215,6 @@ xlog_write_setup_ophdr(
>  	struct xlog_op_header	*ophdr,
>  	struct xlog_ticket	*ticket)
>  {
> -	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
>  	ophdr->oh_clientid = XFS_TRANSACTION;
>  	ophdr->oh_res2 = 0;
>  	ophdr->oh_flags = 0;
> @@ -2404,21 +2448,25 @@ xlog_write(
>  			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
>  			/*
> -			 * The XLOG_START_TRANS has embedded ophdrs for the
> -			 * start record and transaction header. They will always
> -			 * be the first two regions in the lv chain. Commit and
> -			 * unmount records also have embedded ophdrs.
> +			 * Regions always have their ophdr at the start of the
> +			 * region, except for:
> +			 * - a transaction start which has a start record ophdr
> +			 *   before the first region ophdr; and
> +			 * - the previous region didn't fully fit into an iclog
> +			 *   so needs a continuation ophdr to prepend the region
> +			 *   in this new iclog.
>  			 */
> -			if (optype) {
> -				ophdr = reg->i_addr;
> -				if (index)
> -					optype &= ~XLOG_START_TRANS;
> -			} else {
> +			ophdr = reg->i_addr;
> +			if (optype && index) {
> +				optype &= ~XLOG_START_TRANS;
> +			} else if (partial_copy) {
>                                  ophdr = xlog_write_setup_ophdr(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
>  				added_ophdr = true;
>  			}
> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +
>  			len += xlog_write_setup_copy(ticket, ophdr,
>  						     iclog->ic_size-log_offset,
>  						     reg->i_len,
> @@ -2436,20 +2484,11 @@ xlog_write(
>  				ophdr->oh_len = cpu_to_be32(copy_len -
>  						sizeof(struct xlog_op_header));
>  			}
> -			/*
> -			 * Copy region.
> -			 *
> -			 * Commit records just log an opheader, so
> -			 * we can have empty payloads with no data region to
> -			 * copy.  Hence we only copy the payload if the vector
> -			 * says it has data to copy.
> -			 */
> -			ASSERT(copy_len >= 0);
> -			if (copy_len > 0) {
> -				memcpy(ptr, reg->i_addr + copy_off, copy_len);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -						   copy_len);
> -			}
> +
> +			ASSERT(copy_len > 0);
> +			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> +			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> +
>  			if (added_ophdr)
>  				copy_len += sizeof(struct xlog_op_header);
>  			record_cnt++;
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 1ca4f2edbdaf..af54ea3f8c90 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -21,44 +21,18 @@ struct xfs_log_vec {
>  
>  #define XFS_LOG_VEC_ORDERED	(-1)
>  
> -/*
> - * We need to make sure the buffer pointer returned is naturally aligned for the
> - * biggest basic data type we put into it. We have already accounted for this
> - * padding when sizing the buffer.
> - *
> - * However, this padding does not get written into the log, and hence we have to
> - * track the space used by the log vectors separately to prevent log space hangs
> - * due to inaccurate accounting (i.e. a leak) of the used log space through the
> - * CIL context ticket.
> - */
> -static inline void *
> -xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
> -		uint type)
> -{
> -	struct xfs_log_iovec *vec = *vecp;
> -
> -	if (vec) {
> -		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
> -		vec++;
> -	} else {
> -		vec = &lv->lv_iovecp[0];
> -	}
> -
> -	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
> -		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
> -
> -	vec->i_type = type;
> -	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
> -
> -	ASSERT(IS_ALIGNED((unsigned long)vec->i_addr, sizeof(uint64_t)));
> -
> -	*vecp = vec;
> -	return vec->i_addr;
> -}
> +void *xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
> +		uint type);
>  
>  static inline void
>  xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
>  {
> +	struct xlog_op_header	*oph = vec->i_addr;
> +
> +	/* opheader tracks payload length, logvec tracks region length */
> +	oph->oh_len = len;
> +
> +	len += sizeof(struct xlog_op_header);
>  	lv->lv_buf_len += len;
>  	lv->lv_bytes += len;
>  	vec->i_len = len;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0c81c13e2cf6..7a5e6bdb7876 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs(
>  		}
>  
>  		/*
> -		 * We 64-bit align the length of each iovec so that the start
> -		 * of the next one is naturally aligned.  We'll need to
> -		 * account for that slack space here. Then round nbytes up
> -		 * to 64-bit alignment so that the initial buffer alignment is
> -		 * easy to calculate and verify.
> +		 * We 64-bit align the length of each iovec so that the start of
> +		 * the next one is naturally aligned.  We'll need to account for
> +		 * that slack space here.
> +		 *
> +		 * We also add the xlog_op_header to each region when
> +		 * formatting, but that's not accounted to the size of the item
> +		 * at this point. Hence we'll need an addition number of bytes
> +		 * for each vector to hold an opheader.
> +		 *
> +		 * Then round nbytes up to 64-bit alignment so that the initial
> +		 * buffer alignment is easy to calculate and verify.
>  		 */
> -		nbytes += niovecs * sizeof(uint64_t);
> +		nbytes += niovecs *
> +			(sizeof(uint64_t) + sizeof(struct xlog_op_header));
>  		nbytes = round_up(nbytes, sizeof(uint64_t));
>  
>  		/*
> @@ -433,11 +440,6 @@ xlog_cil_insert_items(
>  
>  	spin_lock(&cil->xc_cil_lock);
>  
> -	/* account for space used by new iovec headers  */
> -	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> -	len += iovhdr_res;
> -	ctx->nvecs += diff_iovecs;
> -
>  	/* attach the transaction to the CIL if it has any busy extents */
>  	if (!list_empty(&tp->t_busy))
>  		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> @@ -469,6 +471,7 @@ xlog_cil_insert_items(
>  	}
>  	tp->t_ticket->t_curr_res -= len;
>  	ctx->space_used += len;
> +	ctx->nvecs += diff_iovecs;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> -- 
> 2.28.0
>
Dave Chinner March 11, 2021, 3:29 a.m. UTC | #2
On Mon, Mar 08, 2021 at 06:21:34PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Current xlog_write() adds op headers to the log manually for every
> > log item region that is in the vector passed to it. While
> > xlog_write() needs to stamp the transaction ID into the ophdr, we
> > already know it's length, flags, clientid, etc at CIL commit time.
> > 
> > This means the only time that xlog write really needs to format and
> > reserve space for a new ophdr is when a region is split across two
> > iclogs. Adding the opheader and accounting for it as part of the
> > normal formatted item region means we simplify the accounting
> > of space used by a transaction and we don't have to special case
> > reserving of space in for the ophdrs in xlog_write(). It also means
> > we can largely initialise the ophdr in transaction commit instead
> > of xlog_write, making the xlog_write formatting inner loop much
> > tighter.
> > 
> > xlog_prepare_iovec() is now too large to stay as an inline function,
> > so we move it out of line and into xfs_log.c.
> > 
> > Object sizes:
> > text	   data	    bss	    dec	    hex	filename
> > 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> > 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> > 
> > So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> > out of line, even though it grew in size itself.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Sooo... if I understand this part of the patchset correctly, the goal
> here is to simplify and shorten the inner loop of xlog_write.

That's one of the goals. The other goal is to avoid needing to
account for log op headers separately in the high level CIL commit
code.

> Callers
> are now required to create their own log op headers at the start of the
> xfs_log_iovec chain in the xfs_log_vec, which means that the only time
> xlog_write has to create an ophdr is when we fill up the current iclog
> and must continue in a new one, because that's not something the callers
> should ever have to know about.  Correct?

Yes.

> If so,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

> It /really/ would have been nice to have kept these patches separated by
> major functional change area (i.e. separate series) instead of one
> gigantic 45-patch behemoth to intimidate the reviewers...

How is that any different from sending out 6-7 separate dependent
patchsets one immediately after another?  A change to one patch in
one series results in needing to rebase at least one patch in each
of the smaller patchsets, so I've still got to treat them all as one
big patchset in my development trees. Then I have to start
reposting patchsets just because another patchset was changed, and
that gets even more confusing trying to work out what patchset goes
with which version and so on. It's much easier for me to manage them
as a single patchset....

Cheers,

Dave.
Darrick J. Wong March 11, 2021, 3:41 a.m. UTC | #3
On Thu, Mar 12, 2021 at 02:29:32PM +1100, Dave Chinner wrote:
> On Mon, Mar 08, 2021 at 06:21:34PM -0800, Darrick J. Wong wrote:
> > On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Current xlog_write() adds op headers to the log manually for every
> > > log item region that is in the vector passed to it. While
> > > xlog_write() needs to stamp the transaction ID into the ophdr, we
> > > already know it's length, flags, clientid, etc at CIL commit time.
> > > 
> > > This means the only time that xlog write really needs to format and
> > > reserve space for a new ophdr is when a region is split across two
> > > iclogs. Adding the opheader and accounting for it as part of the
> > > normal formatted item region means we simplify the accounting
> > > of space used by a transaction and we don't have to special case
> > > reserving of space in for the ophdrs in xlog_write(). It also means
> > > we can largely initialise the ophdr in transaction commit instead
> > > of xlog_write, making the xlog_write formatting inner loop much
> > > tighter.
> > > 
> > > xlog_prepare_iovec() is now too large to stay as an inline function,
> > > so we move it out of line and into xfs_log.c.
> > > 
> > > Object sizes:
> > > text	   data	    bss	    dec	    hex	filename
> > > 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> > > 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> > > 
> > > So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> > > out of line, even though it grew in size itself.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Sooo... if I understand this part of the patchset correctly, the goal
> > here is to simplify and shorten the inner loop of xlog_write.
> 
> That's one of the goals. The other goal is to avoid needing to
> account for log op headers separately in the high level CIL commit
> code.
> 
> > Callers
> > are now required to create their own log op headers at the start of the
> > xfs_log_iovec chain in the xfs_log_vec, which means that the only time
> > xlog_write has to create an ophdr is when we fill up the current iclog
> > and must continue in a new one, because that's not something the callers
> > should ever have to know about.  Correct?
> 
> Yes.
> 
> > If so,
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Thanks!
> 
> > It /really/ would have been nice to have kept these patches separated by
> > major functional change area (i.e. separate series) instead of one
> > gigantic 45-patch behemoth to intimidate the reviewers...
> 
> How is that any different from sending out 6-7 separate dependent
> patchsets one immediately after another?  A change to one patch in
> one series results in needing to rebase at least one patch in each
> of the smaller patchsets, so I've still got to treat them all as one
> big patchset in my development trees. Then I have to start
> reposting patchsets just because another patchset was changed, and
> that gets even more confusing trying to work out what patchset goes
> with which version and so on. It's much easier for me to manage them
> as a single patchset....

Well, ok, but it would have been nice for the cover letter to give
/some/ hint as to what's changing in various subranges, e.g.

"Patches 32-36 reduce the xc_cil_lock critical sections,
 Patches 37-41 create per-cpu cil structures and move log items and
       vectors to use them,
 Patches 42-44 are more cleanups,
 Patch 45 documents the whole mess."

So I could see the outlines of where the 45 patches were going.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster March 16, 2021, 2:53 p.m. UTC | #4
On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Current xlog_write() adds op headers to the log manually for every
> log item region that is in the vector passed to it. While
> xlog_write() needs to stamp the transaction ID into the ophdr, we
> already know it's length, flags, clientid, etc at CIL commit time.
> 
> This means the only time that xlog write really needs to format and
> reserve space for a new ophdr is when a region is split across two
> iclogs. Adding the opheader and accounting for it as part of the
> normal formatted item region means we simplify the accounting
> of space used by a transaction and we don't have to special case
> reserving of space in for the ophdrs in xlog_write(). It also means
> we can largely initialise the ophdr in transaction commit instead
> of xlog_write, making the xlog_write formatting inner loop much
> tighter.
> 
> xlog_prepare_iovec() is now too large to stay as an inline function,
> so we move it out of line and into xfs_log.c.
> 
> Object sizes:
> text	   data	    bss	    dec	    hex	filename
> 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> 
> So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> out of line, even though it grew in size itself.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks mostly reasonable, a couple or so questions...

>  fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_log.h     |  42 +++-------------
>  fs/xfs/xfs_log_cil.c |  25 +++++-----
>  3 files changed, 99 insertions(+), 83 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 429cb1e7cc67..98de45be80c0 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -89,6 +89,62 @@ xlog_iclogs_empty(
>  static int
>  xfs_log_cover(struct xfs_mount *);
>  
> +/*
> + * We need to make sure the buffer pointer returned is naturally aligned for the
> + * biggest basic data type we put into it. We have already accounted for this
> + * padding when sizing the buffer.
> + *
> + * However, this padding does not get written into the log, and hence we have to
> + * track the space used by the log vectors separately to prevent log space hangs
> + * due to inaccurate accounting (i.e. a leak) of the used log space through the
> + * CIL context ticket.
> + *
> + * We also add space for the xlog_op_header that describes this region in the
> + * log. This prepends the data region we return to the caller to copy their data
> + * into, so do all the static initialisation of the ophdr now. Because the ophdr
> + * is not 8 byte aligned, we have to be careful to ensure that we align the
> + * start of the buffer such that the region we return to the call is 8 byte
> + * aligned and packed against the tail of the ophdr.
> + */
> +void *
> +xlog_prepare_iovec(
> +	struct xfs_log_vec	*lv,
> +	struct xfs_log_iovec	**vecp,
> +	uint			type)
> +{
> +	struct xfs_log_iovec	*vec = *vecp;
> +	struct xlog_op_header	*oph;
> +	uint32_t		len;
> +	void			*buf;
> +
> +	if (vec) {
> +		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
> +		vec++;
> +	} else {
> +		vec = &lv->lv_iovecp[0];
> +	}
> +
> +	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
> +	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
> +		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
> +					sizeof(struct xlog_op_header);
> +	}
> +
> +	vec->i_type = type;
> +	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
> +
> +	oph = vec->i_addr;
> +	oph->oh_clientid = XFS_TRANSACTION;
> +	oph->oh_res2 = 0;
> +	oph->oh_flags = 0;
> +
> +	buf = vec->i_addr + sizeof(struct xlog_op_header);
> +	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));

Why is it the buffer portion needs to be 8 byte aligned but not ->i_addr
itself?

> +
> +	*vecp = vec;
> +	return buf;
> +}
> +

Not worth changing now, but it's helpful to reduce the size of the patch
by separating out mechanical changes like moving functions around.

>  static void
>  xlog_grant_sub_space(
>  	struct xlog		*log,
...
> @@ -2149,18 +2205,7 @@ xlog_write_calc_vec_length(
>  			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
>  		}
>  	}
> -
> -	/* Don't account for regions with embedded ophdrs */
> -	if (optype && headers > 0) {
> -		headers--;
> -		if (optype & XLOG_START_TRANS) {
> -			ASSERT(headers >= 1);
> -			headers--;
> -		}
> -	}
> -
>  	ticket->t_res_num_ophdrs += headers;
> -	len += headers * sizeof(struct xlog_op_header);

Hm, this seems to suggest something was off wrt to ->t_res_num_ophdrs
prior to this change.  Granted this looks like it's just a debug field,
but the previous logic filtered out embedded op headers unconditionally
whereas now it looks like we go back to accounting them. Am I missing
something?

>  
>  	return len;
>  }
...
> @@ -2404,21 +2448,25 @@ xlog_write(
>  			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
>  			/*
> -			 * The XLOG_START_TRANS has embedded ophdrs for the
> -			 * start record and transaction header. They will always
> -			 * be the first two regions in the lv chain. Commit and
> -			 * unmount records also have embedded ophdrs.
> +			 * Regions always have their ophdr at the start of the
> +			 * region, except for:
> +			 * - a transaction start which has a start record ophdr
> +			 *   before the first region ophdr; and
> +			 * - the previous region didn't fully fit into an iclog
> +			 *   so needs a continuation ophdr to prepend the region
> +			 *   in this new iclog.
>  			 */
> -			if (optype) {
> -				ophdr = reg->i_addr;
> -				if (index)
> -					optype &= ~XLOG_START_TRANS;
> -			} else {
> +			ophdr = reg->i_addr;
> +			if (optype && index) {
> +				optype &= ~XLOG_START_TRANS;
> +			} else if (partial_copy) {
>                                  ophdr = xlog_write_setup_ophdr(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
>  				added_ophdr = true;
>  			}

So in the partial_copy continuation case we're still stamping an ophdr
directly into the iclog. Otherwise we're processing/modifying flags and
whatnot on the ophdr already stamped at commit time in the log vector.
However, this is Ok because a relog would reformat the op header.

> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +
>  			len += xlog_write_setup_copy(ticket, ophdr,
>  						     iclog->ic_size-log_offset,
>  						     reg->i_len,
> @@ -2436,20 +2484,11 @@ xlog_write(
>  				ophdr->oh_len = cpu_to_be32(copy_len -
>  						sizeof(struct xlog_op_header));
>  			}
> -			/*
> -			 * Copy region.
> -			 *
> -			 * Commit records just log an opheader, so
> -			 * we can have empty payloads with no data region to
> -			 * copy.  Hence we only copy the payload if the vector
> -			 * says it has data to copy.
> -			 */
> -			ASSERT(copy_len >= 0);
> -			if (copy_len > 0) {
> -				memcpy(ptr, reg->i_addr + copy_off, copy_len);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -						   copy_len);
> -			}
> +
> +			ASSERT(copy_len > 0);
> +			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> +			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> +

I assume the checks in xlog_write_copy_finish() to require a minimum of
one op header worth of space in the iclog prevent doing a partial write
across an embedded op header boundary, but it would be nice to have an
assert or something that ensures that. For example, assert for something
like if a partial_copy occurs, partial_copy_len was at least the length
of an op header into the region.

>  			if (added_ophdr)
>  				copy_len += sizeof(struct xlog_op_header);
>  			record_cnt++;
...
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 0c81c13e2cf6..7a5e6bdb7876 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs(
>  		}
>  
>  		/*
> -		 * We 64-bit align the length of each iovec so that the start
> -		 * of the next one is naturally aligned.  We'll need to
> -		 * account for that slack space here. Then round nbytes up
> -		 * to 64-bit alignment so that the initial buffer alignment is
> -		 * easy to calculate and verify.
> +		 * We 64-bit align the length of each iovec so that the start of
> +		 * the next one is naturally aligned.  We'll need to account for
> +		 * that slack space here.
> +		 *

Related to my question above, I'm a little confused by the (preexisting)
comment. If the start of the next iovec is now the ophdr, doesn't that
mean the "start of the next one (iovec)" is technically no longer
naturally aligned?

Brian

> +		 * We also add the xlog_op_header to each region when
> +		 * formatting, but that's not accounted to the size of the item
> +		 * at this point. Hence we'll need an addition number of bytes
> +		 * for each vector to hold an opheader.
> +		 *
> +		 * Then round nbytes up to 64-bit alignment so that the initial
> +		 * buffer alignment is easy to calculate and verify.
>  		 */
> -		nbytes += niovecs * sizeof(uint64_t);
> +		nbytes += niovecs *
> +			(sizeof(uint64_t) + sizeof(struct xlog_op_header));
>  		nbytes = round_up(nbytes, sizeof(uint64_t));
>  
>  		/*
> @@ -433,11 +440,6 @@ xlog_cil_insert_items(
>  
>  	spin_lock(&cil->xc_cil_lock);
>  
> -	/* account for space used by new iovec headers  */
> -	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> -	len += iovhdr_res;
> -	ctx->nvecs += diff_iovecs;
> -
>  	/* attach the transaction to the CIL if it has any busy extents */
>  	if (!list_empty(&tp->t_busy))
>  		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> @@ -469,6 +471,7 @@ xlog_cil_insert_items(
>  	}
>  	tp->t_ticket->t_curr_res -= len;
>  	ctx->space_used += len;
> +	ctx->nvecs += diff_iovecs;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> -- 
> 2.28.0
>
Brian Foster March 16, 2021, 2:54 p.m. UTC | #5
On Wed, Mar 10, 2021 at 07:41:14PM -0800, Darrick J. Wong wrote:
> On Thu, Mar 12, 2021 at 02:29:32PM +1100, Dave Chinner wrote:
> > On Mon, Mar 08, 2021 at 06:21:34PM -0800, Darrick J. Wong wrote:
> > > On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Current xlog_write() adds op headers to the log manually for every
> > > > log item region that is in the vector passed to it. While
> > > > xlog_write() needs to stamp the transaction ID into the ophdr, we
> > > > already know it's length, flags, clientid, etc at CIL commit time.
> > > > 
> > > > This means the only time that xlog write really needs to format and
> > > > reserve space for a new ophdr is when a region is split across two
> > > > iclogs. Adding the opheader and accounting for it as part of the
> > > > normal formatted item region means we simplify the accounting
> > > > of space used by a transaction and we don't have to special case
> > > > reserving of space in for the ophdrs in xlog_write(). It also means
> > > > we can largely initialise the ophdr in transaction commit instead
> > > > of xlog_write, making the xlog_write formatting inner loop much
> > > > tighter.
> > > > 
> > > > xlog_prepare_iovec() is now too large to stay as an inline function,
> > > > so we move it out of line and into xfs_log.c.
> > > > 
> > > > Object sizes:
> > > > text	   data	    bss	    dec	    hex	filename
> > > > 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> > > > 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> > > > 
> > > > So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> > > > out of line, even though it grew in size itself.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Sooo... if I understand this part of the patchset correctly, the goal
> > > here is to simplify and shorten the inner loop of xlog_write.
> > 
> > That's one of the goals. The other goal is to avoid needing to
> > account for log op headers separately in the high level CIL commit
> > code.
> > 
> > > Callers
> > > are now required to create their own log op headers at the start of the
> > > xfs_log_iovec chain in the xfs_log_vec, which means that the only time
> > > xlog_write has to create an ophdr is when we fill up the current iclog
> > > and must continue in a new one, because that's not something the callers
> > > should ever have to know about.  Correct?
> > 
> > Yes.
> > 
> > > If so,
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > Thanks!
> > 
> > > It /really/ would have been nice to have kept these patches separated by
> > > major functional change area (i.e. separate series) instead of one
> > > gigantic 45-patch behemoth to intimidate the reviewers...
> > 
> > How is that any different from sending out 6-7 separate dependent
> > patchsets one immediately after another?  A change to one patch in
> > one series results in needing to rebase at least one patch in each
> > of the smaller patchsets, so I've still got to treat them all as one
> > big patchset in my development trees. Then I have to start
> > reposting patchsets just because another patchset was changed, and
> > that gets even more confusing trying to work out what patchset goes
> > with which version and so on. It's much easier for me to manage them
> > as a single patchset....
> 
> Well, ok, but it would have been nice for the cover letter to give
> /some/ hint as to what's changing in various subranges, e.g.
> 
> "Patches 32-36 reduce the xc_cil_lock critical sections,
>  Patches 37-41 create per-cpu cil structures and move log items and
>        vectors to use them,
>  Patches 42-44 are more cleanups,
>  Patch 45 documents the whole mess."
> 
> So I could see the outlines of where the 45 patches were going.
> 

Agreed. The purpose of separate patch series' is to facilitate upstream
review and patch processing. This series strikes me as not only separate
logical changes, but changes probably with different trajectories toward
merge as well. E.g., do we expect to land this whole series together at
the same time? That would seem... unwise.

If not (or if we don't otherwise want to unnecessarily delay the earlier
parts of the series until the whole percpu cil thing at the end is
worked out), then I think it probably makes sense to split off into
three or so subseries. The first can cover the log flush optimizations
and whatever one off fixes that are all probably close to merge-worthy,
the second can cover this op header formatting rework and associated
cleanups, and the last covers all of the percpu stuff at the end. If
there's a real concern over rebase churn, there's probably no huge need
to respin the entire collection on every review cycle of one of the
earlier subseries.

Brian

> --D
> 
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
>
Dave Chinner May 19, 2021, 3:18 a.m. UTC | #6
On Tue, Mar 16, 2021 at 10:53:01AM -0400, Brian Foster wrote:
> On Fri, Mar 05, 2021 at 04:11:23PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Current xlog_write() adds op headers to the log manually for every
> > log item region that is in the vector passed to it. While
> > xlog_write() needs to stamp the transaction ID into the ophdr, we
> > already know it's length, flags, clientid, etc at CIL commit time.
> > 
> > This means the only time that xlog write really needs to format and
> > reserve space for a new ophdr is when a region is split across two
> > iclogs. Adding the opheader and accounting for it as part of the
> > normal formatted item region means we simplify the accounting
> > of space used by a transaction and we don't have to special case
> > reserving of space in for the ophdrs in xlog_write(). It also means
> > we can largely initialise the ophdr in transaction commit instead
> > of xlog_write, making the xlog_write formatting inner loop much
> > tighter.
> > 
> > xlog_prepare_iovec() is now too large to stay as an inline function,
> > so we move it out of line and into xfs_log.c.
> > 
> > Object sizes:
> > text	   data	    bss	    dec	    hex	filename
> > 1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
> > 1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after
> > 
> > So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
> > out of line, even though it grew in size itself.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Looks mostly reasonable, a couple or so questions...
> 
> >  fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
> >  fs/xfs/xfs_log.h     |  42 +++-------------
> >  fs/xfs/xfs_log_cil.c |  25 +++++-----
> >  3 files changed, 99 insertions(+), 83 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 429cb1e7cc67..98de45be80c0 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -89,6 +89,62 @@ xlog_iclogs_empty(
> >  static int
> >  xfs_log_cover(struct xfs_mount *);
> >  
> > +/*
> > + * We need to make sure the buffer pointer returned is naturally aligned for the
> > + * biggest basic data type we put into it. We have already accounted for this
> > + * padding when sizing the buffer.
> > + *
> > + * However, this padding does not get written into the log, and hence we have to
> > + * track the space used by the log vectors separately to prevent log space hangs
> > + * due to inaccurate accounting (i.e. a leak) of the used log space through the
> > + * CIL context ticket.
> > + *
> > + * We also add space for the xlog_op_header that describes this region in the
> > + * log. This prepends the data region we return to the caller to copy their data
> > + * into, so do all the static initialisation of the ophdr now. Because the ophdr
> > + * is not 8 byte aligned, we have to be careful to ensure that we align the
> > + * start of the buffer such that the region we return to the call is 8 byte
> > + * aligned and packed against the tail of the ophdr.
> > + */
> > +void *
> > +xlog_prepare_iovec(
> > +	struct xfs_log_vec	*lv,
> > +	struct xfs_log_iovec	**vecp,
> > +	uint			type)
> > +{
> > +	struct xfs_log_iovec	*vec = *vecp;
> > +	struct xlog_op_header	*oph;
> > +	uint32_t		len;
> > +	void			*buf;
> > +
> > +	if (vec) {
> > +		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
> > +		vec++;
> > +	} else {
> > +		vec = &lv->lv_iovecp[0];
> > +	}
> > +
> > +	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
> > +	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
> > +		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
> > +					sizeof(struct xlog_op_header);
> > +	}
> > +
> > +	vec->i_type = type;
> > +	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
> > +
> > +	oph = vec->i_addr;
> > +	oph->oh_clientid = XFS_TRANSACTION;
> > +	oph->oh_res2 = 0;
> > +	oph->oh_flags = 0;
> > +
> > +	buf = vec->i_addr + sizeof(struct xlog_op_header);
> > +	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));
> 
> Why is it the buffer portion needs to be 8 byte aligned but not ->i_addr
> itself?

Same reason the returned buffer has always been 8 byte aligned -
because we cast ixit  to on-disk structures that assume 8 byte
memory alignment of the structure. e.g. when formatting inodes,
intents, etc.

We don't care about the ophdr itself, because it only contains
4 byte aligned variables (32 bit) and the log only requires them to
be 4 byte alignment of regions (asserts in xlog_write() that it
casts to ophdrs to update them...

> >  static void
> >  xlog_grant_sub_space(
> >  	struct xlog		*log,
> ...
> > @@ -2149,18 +2205,7 @@ xlog_write_calc_vec_length(
> >  			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
> >  		}
> >  	}
> > -
> > -	/* Don't account for regions with embedded ophdrs */
> > -	if (optype && headers > 0) {
> > -		headers--;
> > -		if (optype & XLOG_START_TRANS) {
> > -			ASSERT(headers >= 1);
> > -			headers--;
> > -		}
> > -	}
> > -
> >  	ticket->t_res_num_ophdrs += headers;
> > -	len += headers * sizeof(struct xlog_op_header);
> 
> Hm, this seems to suggest something was off wrt to ->t_res_num_ophdrs
> prior to this change.  Granted this looks like it's just a debug field,
> but the previous logic filtered out embedded op headers unconditionally
> whereas now it looks like we go back to accounting them. Am I missing
> something?

Nothign wrong here with the old or the new code. The old code *adds*
the unaccounted ophdr space for each region here, but now everything
has an embedded opheader it's accounted directly to the region size.
Hence the size of the regions we summed in the above loop already
accounts for the ophdr space and we no longer need to account for
opheaders individually here.

> > @@ -2404,21 +2448,25 @@ xlog_write(
> >  			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> >  
> >  			/*
> > -			 * The XLOG_START_TRANS has embedded ophdrs for the
> > -			 * start record and transaction header. They will always
> > -			 * be the first two regions in the lv chain. Commit and
> > -			 * unmount records also have embedded ophdrs.
> > +			 * Regions always have their ophdr at the start of the
> > +			 * region, except for:
> > +			 * - a transaction start which has a start record ophdr
> > +			 *   before the first region ophdr; and
> > +			 * - the previous region didn't fully fit into an iclog
> > +			 *   so needs a continuation ophdr to prepend the region
> > +			 *   in this new iclog.
> >  			 */
> > -			if (optype) {
> > -				ophdr = reg->i_addr;
> > -				if (index)
> > -					optype &= ~XLOG_START_TRANS;
> > -			} else {
> > +			ophdr = reg->i_addr;
> > +			if (optype && index) {
> > +				optype &= ~XLOG_START_TRANS;
> > +			} else if (partial_copy) {
> >                                  ophdr = xlog_write_setup_ophdr(ptr, ticket);
> >  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> >  					   sizeof(struct xlog_op_header));
> >  				added_ophdr = true;
> >  			}
> 
> So in the partial_copy continuation case we're still stamping an ophdr
> directly into the iclog. Otherwise we're processing/modifying flags and
> whatnot on the ophdr already stamped at commit time in the log vector.
> However, this is Ok because a relog would reformat the op header.

Right. We don't know ahead of time when a region is going to be
split across two iclogs, so we still have to handle the case of
adding a ophdr for a split region. This gets much simpler later on
in the series as the xlog_write() code is factored.

> > +			ASSERT(copy_len > 0);
> > +			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> > +			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> > +
> 
> I assume the checks in xlog_write_copy_finish() to require a minimum of
> one op header worth of space in the iclog prevent doing a partial write
> across an embedded op header boundary, but it would be nice to have an
> assert or something that ensures that. For example, assert for something
> like if a partial_copy occurs, partial_copy_len was at least the length
> of an op header into the region.

This code is completely reworked later in the series and these
whacky corner cases largely go away. There's not much point in
making this "robust" because of this...

> 
> >  			if (added_ophdr)
> >  				copy_len += sizeof(struct xlog_op_header);
> >  			record_cnt++;
> ...
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 0c81c13e2cf6..7a5e6bdb7876 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs(
> >  		}
> >  
> >  		/*
> > -		 * We 64-bit align the length of each iovec so that the start
> > -		 * of the next one is naturally aligned.  We'll need to
> > -		 * account for that slack space here. Then round nbytes up
> > -		 * to 64-bit alignment so that the initial buffer alignment is
> > -		 * easy to calculate and verify.
> > +		 * We 64-bit align the length of each iovec so that the start of
> > +		 * the next one is naturally aligned.  We'll need to account for
> > +		 * that slack space here.
> > +		 *
> 
> Related to my question above, I'm a little confused by the (preexisting)
> comment. If the start of the next iovec is now the ophdr, doesn't that
> mean the "start of the next one (iovec)" is technically no longer
> naturally aligned?

The ophdr doesn't need to be 64 bit aligned, just the
buffer that is returned from xlog_prepare_iovec(). That means we
still have to guarantee space in the buffer for that to be rounded
to 64 bits if it's not already aligned. All of the iovecs might
require padding to align them, so the presence of the ophdr doesn't
really change anything at all here...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 429cb1e7cc67..98de45be80c0 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -89,6 +89,62 @@  xlog_iclogs_empty(
 static int
 xfs_log_cover(struct xfs_mount *);
 
+/*
+ * We need to make sure the buffer pointer returned is naturally aligned for the
+ * biggest basic data type we put into it. We have already accounted for this
+ * padding when sizing the buffer.
+ *
+ * However, this padding does not get written into the log, and hence we have to
+ * track the space used by the log vectors separately to prevent log space hangs
+ * due to inaccurate accounting (i.e. a leak) of the used log space through the
+ * CIL context ticket.
+ *
+ * We also add space for the xlog_op_header that describes this region in the
+ * log. This prepends the data region we return to the caller to copy their data
+ * into, so do all the static initialisation of the ophdr now. Because the ophdr
+ * is not 8 byte aligned, we have to be careful to ensure that we align the
+ * start of the buffer such that the region we return to the call is 8 byte
+ * aligned and packed against the tail of the ophdr.
+ */
+void *
+xlog_prepare_iovec(
+	struct xfs_log_vec	*lv,
+	struct xfs_log_iovec	**vecp,
+	uint			type)
+{
+	struct xfs_log_iovec	*vec = *vecp;
+	struct xlog_op_header	*oph;
+	uint32_t		len;
+	void			*buf;
+
+	if (vec) {
+		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
+		vec++;
+	} else {
+		vec = &lv->lv_iovecp[0];
+	}
+
+	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
+	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
+		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
+					sizeof(struct xlog_op_header);
+	}
+
+	vec->i_type = type;
+	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
+
+	oph = vec->i_addr;
+	oph->oh_clientid = XFS_TRANSACTION;
+	oph->oh_res2 = 0;
+	oph->oh_flags = 0;
+
+	buf = vec->i_addr + sizeof(struct xlog_op_header);
+	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));
+
+	*vecp = vec;
+	return buf;
+}
+
 static void
 xlog_grant_sub_space(
 	struct xlog		*log,
@@ -2120,9 +2176,9 @@  xlog_print_trans(
 }
 
 /*
- * Calculate the potential space needed by the log vector. If this is a start
- * transaction, the caller has already accounted for both opheaders in the start
- * transaction, so we don't need to account for them here.
+ * Calculate the potential space needed by the log vector. All regions contain
+ * their own opheaders and they are accounted for in region space so we don't
+ * need to add them to the vector length here.
  */
 static int
 xlog_write_calc_vec_length(
@@ -2149,18 +2205,7 @@  xlog_write_calc_vec_length(
 			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
 		}
 	}
-
-	/* Don't account for regions with embedded ophdrs */
-	if (optype && headers > 0) {
-		headers--;
-		if (optype & XLOG_START_TRANS) {
-			ASSERT(headers >= 1);
-			headers--;
-		}
-	}
-
 	ticket->t_res_num_ophdrs += headers;
-	len += headers * sizeof(struct xlog_op_header);
 
 	return len;
 }
@@ -2170,7 +2215,6 @@  xlog_write_setup_ophdr(
 	struct xlog_op_header	*ophdr,
 	struct xlog_ticket	*ticket)
 {
-	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
 	ophdr->oh_clientid = XFS_TRANSACTION;
 	ophdr->oh_res2 = 0;
 	ophdr->oh_flags = 0;
@@ -2404,21 +2448,25 @@  xlog_write(
 			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
 
 			/*
-			 * The XLOG_START_TRANS has embedded ophdrs for the
-			 * start record and transaction header. They will always
-			 * be the first two regions in the lv chain. Commit and
-			 * unmount records also have embedded ophdrs.
+			 * Regions always have their ophdr at the start of the
+			 * region, except for:
+			 * - a transaction start which has a start record ophdr
+			 *   before the first region ophdr; and
+			 * - the previous region didn't fully fit into an iclog
+			 *   so needs a continuation ophdr to prepend the region
+			 *   in this new iclog.
 			 */
-			if (optype) {
-				ophdr = reg->i_addr;
-				if (index)
-					optype &= ~XLOG_START_TRANS;
-			} else {
+			ophdr = reg->i_addr;
+			if (optype && index) {
+				optype &= ~XLOG_START_TRANS;
+			} else if (partial_copy) {
                                 ophdr = xlog_write_setup_ophdr(ptr, ticket);
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 					   sizeof(struct xlog_op_header));
 				added_ophdr = true;
 			}
+			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+
 			len += xlog_write_setup_copy(ticket, ophdr,
 						     iclog->ic_size-log_offset,
 						     reg->i_len,
@@ -2436,20 +2484,11 @@  xlog_write(
 				ophdr->oh_len = cpu_to_be32(copy_len -
 						sizeof(struct xlog_op_header));
 			}
-			/*
-			 * Copy region.
-			 *
-			 * Commit records just log an opheader, so
-			 * we can have empty payloads with no data region to
-			 * copy.  Hence we only copy the payload if the vector
-			 * says it has data to copy.
-			 */
-			ASSERT(copy_len >= 0);
-			if (copy_len > 0) {
-				memcpy(ptr, reg->i_addr + copy_off, copy_len);
-				xlog_write_adv_cnt(&ptr, &len, &log_offset,
-						   copy_len);
-			}
+
+			ASSERT(copy_len > 0);
+			memcpy(ptr, reg->i_addr + copy_off, copy_len);
+			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
+
 			if (added_ophdr)
 				copy_len += sizeof(struct xlog_op_header);
 			record_cnt++;
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 1ca4f2edbdaf..af54ea3f8c90 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -21,44 +21,18 @@  struct xfs_log_vec {
 
 #define XFS_LOG_VEC_ORDERED	(-1)
 
-/*
- * We need to make sure the buffer pointer returned is naturally aligned for the
- * biggest basic data type we put into it. We have already accounted for this
- * padding when sizing the buffer.
- *
- * However, this padding does not get written into the log, and hence we have to
- * track the space used by the log vectors separately to prevent log space hangs
- * due to inaccurate accounting (i.e. a leak) of the used log space through the
- * CIL context ticket.
- */
-static inline void *
-xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
-		uint type)
-{
-	struct xfs_log_iovec *vec = *vecp;
-
-	if (vec) {
-		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
-		vec++;
-	} else {
-		vec = &lv->lv_iovecp[0];
-	}
-
-	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
-		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
-
-	vec->i_type = type;
-	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
-
-	ASSERT(IS_ALIGNED((unsigned long)vec->i_addr, sizeof(uint64_t)));
-
-	*vecp = vec;
-	return vec->i_addr;
-}
+void *xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
+		uint type);
 
 static inline void
 xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
 {
+	struct xlog_op_header	*oph = vec->i_addr;
+
+	/* opheader tracks payload length, logvec tracks region length */
+	oph->oh_len = len;
+
+	len += sizeof(struct xlog_op_header);
 	lv->lv_buf_len += len;
 	lv->lv_bytes += len;
 	vec->i_len = len;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 0c81c13e2cf6..7a5e6bdb7876 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -181,13 +181,20 @@  xlog_cil_alloc_shadow_bufs(
 		}
 
 		/*
-		 * We 64-bit align the length of each iovec so that the start
-		 * of the next one is naturally aligned.  We'll need to
-		 * account for that slack space here. Then round nbytes up
-		 * to 64-bit alignment so that the initial buffer alignment is
-		 * easy to calculate and verify.
+		 * We 64-bit align the length of each iovec so that the start of
+		 * the next one is naturally aligned.  We'll need to account for
+		 * that slack space here.
+		 *
+		 * We also add the xlog_op_header to each region when
+		 * formatting, but that's not accounted to the size of the item
+		 * at this point. Hence we'll need an addition number of bytes
+		 * for each vector to hold an opheader.
+		 *
+		 * Then round nbytes up to 64-bit alignment so that the initial
+		 * buffer alignment is easy to calculate and verify.
 		 */
-		nbytes += niovecs * sizeof(uint64_t);
+		nbytes += niovecs *
+			(sizeof(uint64_t) + sizeof(struct xlog_op_header));
 		nbytes = round_up(nbytes, sizeof(uint64_t));
 
 		/*
@@ -433,11 +440,6 @@  xlog_cil_insert_items(
 
 	spin_lock(&cil->xc_cil_lock);
 
-	/* account for space used by new iovec headers  */
-	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
-	len += iovhdr_res;
-	ctx->nvecs += diff_iovecs;
-
 	/* attach the transaction to the CIL if it has any busy extents */
 	if (!list_empty(&tp->t_busy))
 		list_splice_init(&tp->t_busy, &ctx->busy_extents);
@@ -469,6 +471,7 @@  xlog_cil_insert_items(
 	}
 	tp->t_ticket->t_curr_res -= len;
 	ctx->space_used += len;
+	ctx->nvecs += diff_iovecs;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move