diff mbox series

[RFC] block: trace: add block alignment information

Message ID 20240912-add-blkalgn-block-trace-v1-1-335dd6eea557@samsung.com (mailing list archive)
State New, archived
Headers show
Series [RFC] block: trace: add block alignment information | expand

Commit Message

Daniel Gomez via B4 Relay Sept. 12, 2024, 8:48 p.m. UTC
From: Daniel Gomez <da.gomez@samsung.com>

Report block alignment in terms of LBA and size during block tracing for
block_rq. Calculate alignment only for read/writes where the length is
greater than 0. Otherwise, report 0 to indicate no alignment calculated.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
---
This patch introduces LBA and size alignment information for
the block_rq tracepoints (block_rq{insert, issue, merge} and
block_{io_start, io_done}).

The idea of reporting alignment in a tracepoint was first suggested in
this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
tracing tool [2] was developed and used during LBS development, as
mentioned in the patch series [3] and in [1].

With this addition, users can check block alignment directly through the
block layer tracepoints without needing any additional tools.

In case we have a use case, this can be extended to other tracepoints,
such as complete and error.

Another potential enhancement could be the integration of this
information into blktrace. Would that be a feasible option to consider?

[1] https://lore.kernel.org/all/ZdvXAn1Q%2F+QX5sPQ@dread.disaster.area/
[2] blkalgn tool written in eBPF/bcc:
https://github.com/dkruces/bcc/tree/lbs
[3] https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/
---
 block/blk-mq.c               | 29 +++++++++++++++++++++++++++++
 include/linux/blk-mq.h       | 11 +++++++++++
 include/linux/blkdev.h       |  6 ++++++
 include/trace/events/block.h |  7 +++++--
 4 files changed, 51 insertions(+), 2 deletions(-)


---
base-commit: 57f962b956f1d116cd64d5c406776c4975de549d
change-id: 20240912-add-blkalgn-block-trace-71e8ab6708f1

Best regards,

Comments

John Garry Sept. 13, 2024, 8:59 a.m. UTC | #1
On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
> From: Daniel Gomez <da.gomez@samsung.com>
> 
> Report block alignment in terms of LBA and size during block tracing for
> block_rq. Calculate alignment only for read/writes where the length is
> greater than 0. Otherwise, report 0 to indicate no alignment calculated.
> 
> Suggested-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> ---
> This patch introduces LBA and size alignment information for
> the block_rq tracepoints (block_rq{insert, issue, merge} and
> block_{io_start, io_done}).

eh, shouldn't this belong in the description of the patch?

And I still don't know what we mean by alignment in this context.

 From looking at the code, it seems to be the max detected block size 
granularity. For example, for a 64KB write at a 32KB offset, that would 
give a 32KB "alignment". But a 64KB write at a 64KB offset would be 
"64KB" alignment. While a 8KB write at 64KB offset would be 8KB 
"alignment". And a 24KB write at offset 0 is a 8KB "alignment", as 8KB 
is the lowest power-of-2 which is divisible into 24KB. Is this a correct 
understanding?

> 
> The idea of reporting alignment in a tracepoint was first suggested in
> this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
> tracing tool [2] was developed and used during LBS development, as
> mentioned in the patch series [3] and in [1].
> 
> With this addition, users can check block alignment directly through the
> block layer tracepoints without needing any additional tools.
> 
> In case we have a use case, this can be extended to other tracepoints,
> such as complete and error.
> 
> Another potential enhancement could be the integration of this
> information into blktrace. Would that be a feasible option to consider?
> 
> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
> [2] blkalgn tool written in eBPF/bcc:
> https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
> [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
> ---
>   block/blk-mq.c               | 29 +++++++++++++++++++++++++++++
>   include/linux/blk-mq.h       | 11 +++++++++++
>   include/linux/blkdev.h       |  6 ++++++
>   include/trace/events/block.h |  7 +++++--
>   4 files changed, 51 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 831c5cf5d874..714452bc236b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>   }
>   EXPORT_SYMBOL_GPL(blk_rq_poll);
>   
> +u32 __blk_rq_lba_algn(struct request *req)

why use "algn", and not "align"?

"algn" is not a natural abbreviation of "alignment".

And why can't userspace figure this out? All the info is available 
already, right?

> +{
> +	u32 lbs = queue_logical_block_size(req->q);
> +	u32 lba_shift = ilog2(lbs);
> +	u32 lba = req->__sector >> (lba_shift - SECTOR_SHIFT);

req->__sector is a u64 - is it safe to store the shifted value in a u32?

> +	u32 len = req->__data_len;
> +	u32 algn_len = len;
> +	u32 algn_lba = len / lbs;
> +	u32 alignment = lbs;
> +
> +	if (is_power_of_2(len) &&
> +	    blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
> +		return len;
> +
> +	algn_len = lbs << 1U;
> +	algn_lba = algn_len / lbs;
> +
> +	while (algn_len < len) {
> +		if (!blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
> +			break;
> +
> +		alignment = algn_len;
> +		algn_len = algn_len << 1U;
> +		algn_lba = algn_len / lbs;
> +	}
> +
> +	return alignment;
> +}
> +
>   unsigned int blk_mq_rq_cpu(struct request *rq)
>   {
>   	return rq->mq_ctx->cpu;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 8d304b1d16b1..02959fbd5e28 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -740,6 +740,17 @@ void blk_mq_free_request(struct request *rq);
>   int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>   		unsigned int poll_flags);
>   
> +/* The alignment of the block in terms of LBA and size */
> +u32 __blk_rq_lba_algn(struct request *req);
> +static inline u32 blk_rq_lba_algn(struct request *req)
> +{
> +	if ((req_op(req) != REQ_OP_WRITE) && (req_op(req) != REQ_OP_READ) &&
> +	    !(req->__data_len))
> +		return 0;
> +
> +	return __blk_rq_lba_algn(req);
> +}
> +
>   bool blk_mq_queue_inflight(struct request_queue *q);
>   
>   enum {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bf1aa951fda2..28557987daa8 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1433,6 +1433,12 @@ static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
>   	return !(addr & alignment) && !(len & alignment);
>   }
>   
> +static inline bool blk_rq_lba_aligned(u32 len, u32 algn_len, u32 lba,
> +				      u32 algn_lba)
> +{
> +	return !(len % algn_len) && !(lba % algn_lba);
> +}

why in blkdev.h? It is only used in block/blk-mq.c

> +
>   /* assumes size > 256 */
>   static inline unsigned int blksize_bits(unsigned int size)
>   {
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 1527d5d45e01..ba3764214dc7 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -202,6 +202,7 @@ DECLARE_EVENT_CLASS(block_rq,
>   		__array(  char,		rwbs,	RWBS_LEN	)
>   		__array(  char,         comm,   TASK_COMM_LEN   )
>   		__dynamic_array( char,	cmd,	1		)
> +		__field(  unsigned int,	algn			)
>   	),
>   
>   	TP_fast_assign(
> @@ -210,20 +211,22 @@ DECLARE_EVENT_CLASS(block_rq,
>   		__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
>   		__entry->bytes     = blk_rq_bytes(rq);
>   		__entry->ioprio	   = rq->ioprio;
> +		__entry->algn      = blk_rq_lba_algn(rq);
>   
>   		blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
>   		__get_str(cmd)[0] = '\0';
>   		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>   	),
>   
> -	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]",
> +	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u |%u| [%s]",
>   		  MAJOR(__entry->dev), MINOR(__entry->dev),
>   		  __entry->rwbs, __entry->bytes, __get_str(cmd),
>   		  (unsigned long long)__entry->sector, __entry->nr_sector,
>   		  __print_symbolic(IOPRIO_PRIO_CLASS(__entry->ioprio),
>   				   IOPRIO_CLASS_STRINGS),
>   		  IOPRIO_PRIO_HINT(__entry->ioprio),
> -		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
> +		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->algn,
> +		  __entry->comm)
>   );
>   
>   /**
> 
> ---
> base-commit: 57f962b956f1d116cd64d5c406776c4975de549d
> change-id: 20240912-add-blkalgn-block-trace-71e8ab6708f1
> 
> Best regards,
Daniel Gomez Sept. 13, 2024, 11:26 a.m. UTC | #2
On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
> On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
> > From: Daniel Gomez <da.gomez@samsung.com>
> > 
> > Report block alignment in terms of LBA and size during block tracing for
> > block_rq. Calculate alignment only for read/writes where the length is
> > greater than 0. Otherwise, report 0 to indicate no alignment calculated.
> > 
> > Suggested-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > ---
> > This patch introduces LBA and size alignment information for
> > the block_rq tracepoints (block_rq{insert, issue, merge} and
> > block_{io_start, io_done}).
> 
> eh, shouldn't this belong in the description of the patch?

Yes. I'll move this to the commit message.

> 
> And I still don't know what we mean by alignment in this context.
> 
> From looking at the code, it seems to be the max detected block size
> granularity. For example, for a 64KB write at a 32KB offset, that would give
> a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
> alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
> 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2
> which is divisible into 24KB. Is this a correct understanding?

That is correct. Do you think adding examples like yours can help to explain
this better? Below the same examples using fio with the trace output:


	sudo fio -bs=64k -size=64k -offset=32k -rw=write \
	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
	
	sudo fio -bs=64k -size=64k -offset=64k -rw=write \
	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
	
	sudo fio -bs=8k -size=8k -offset=64k -rw=write \
	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
	
	sudo fio -bs=24k -size=24k -offset=0k -rw=write \
	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync

	fio-789     [000] .....  4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
	fio-801     [000] .....  4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
	fio-813     [000] .....  4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
	fio-825     [000] .....  4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]


Also, the motivation behind this is explained in the LBS RFC [1] and I should
have included it here for context. I hope [1] and my description below helps to
explain what alignment means and why is needed:

[1] Subject: [RFC 00/23] Enable block size > page size in XFS
https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/

Tracing alignment information is important for high-capacity and QLC SSDs with
Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
boundaries can imply in a read-modify-write (RMW).

The graph below is a representation of the device IU vs an I/O block aligned/
unaligned.

    |--- IU Boundaries ----|      |-PS-|
a)  [====][====][====][====][····][····][····]--
    |                      |
b)  [····][====][====][====][====][····][····]--
    |                      |
c)  [====][====][====][====][····][====][====]--
    |                      |
d)  [····][····][====][====][····][····][····]--
    |                      |
LBA 0                      4
  
    Key:
    [====] = I/O Block
    [····] = Memory in Page Size (PS) chunks

a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
b) The size of the I/O matches the IU size but the I/O is not aligned to the
IU boundaries. I/O is unaligned.
c) I/O does not match in either size or LBA. I/O is unaligned.

> 
> > 
> > The idea of reporting alignment in a tracepoint was first suggested in
> > this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
> > tracing tool [2] was developed and used during LBS development, as
> > mentioned in the patch series [3] and in [1].
> > 
> > With this addition, users can check block alignment directly through the
> > block layer tracepoints without needing any additional tools.
> > 
> > In case we have a use case, this can be extended to other tracepoints,
> > such as complete and error.
> > 
> > Another potential enhancement could be the integration of this
> > information into blktrace. Would that be a feasible option to consider?
> > 
> > [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
> > [2] blkalgn tool written in eBPF/bcc:
> > https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
> > [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
> > ---
> >   block/blk-mq.c               | 29 +++++++++++++++++++++++++++++
> >   include/linux/blk-mq.h       | 11 +++++++++++
> >   include/linux/blkdev.h       |  6 ++++++
> >   include/trace/events/block.h |  7 +++++--
> >   4 files changed, 51 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 831c5cf5d874..714452bc236b 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> >   }
> >   EXPORT_SYMBOL_GPL(blk_rq_poll);
> > +u32 __blk_rq_lba_algn(struct request *req)
> 
> why use "algn", and not "align"?
> 
> "algn" is not a natural abbreviation of "alignment".

That's okay with me, changing the var name to a more natural abbreviation.

> 
> And why can't userspace figure this out? All the info is available already,
> right?

We are interested in the block alignment (LBA and size) at block device driver
level, not userspace level. That is, everything that is going out from the block
layer. Using the block tracing points currently available makes it block-driver
generic.

> 
> > +{
> > +	u32 lbs = queue_logical_block_size(req->q);
> > +	u32 lba_shift = ilog2(lbs);
> > +	u32 lba = req->__sector >> (lba_shift - SECTOR_SHIFT);
> 
> req->__sector is a u64 - is it safe to store the shifted value in a u32?
> 
> > +	u32 len = req->__data_len;
> > +	u32 algn_len = len;
> > +	u32 algn_lba = len / lbs;
> > +	u32 alignment = lbs;
> > +
> > +	if (is_power_of_2(len) &&
> > +	    blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
> > +		return len;
> > +
> > +	algn_len = lbs << 1U;
> > +	algn_lba = algn_len / lbs;
> > +
> > +	while (algn_len < len) {
> > +		if (!blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
> > +			break;
> > +
> > +		alignment = algn_len;
> > +		algn_len = algn_len << 1U;
> > +		algn_lba = algn_len / lbs;
> > +	}
> > +
> > +	return alignment;
> > +}
> > +
> >   unsigned int blk_mq_rq_cpu(struct request *rq)
> >   {
> >   	return rq->mq_ctx->cpu;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 8d304b1d16b1..02959fbd5e28 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -740,6 +740,17 @@ void blk_mq_free_request(struct request *rq);
> >   int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> >   		unsigned int poll_flags);
> > +/* The alignment of the block in terms of LBA and size */
> > +u32 __blk_rq_lba_algn(struct request *req);
> > +static inline u32 blk_rq_lba_algn(struct request *req)
> > +{
> > +	if ((req_op(req) != REQ_OP_WRITE) && (req_op(req) != REQ_OP_READ) &&
> > +	    !(req->__data_len))
> > +		return 0;
> > +
> > +	return __blk_rq_lba_algn(req);
> > +}
> > +
> >   bool blk_mq_queue_inflight(struct request_queue *q);
> >   enum {
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index bf1aa951fda2..28557987daa8 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1433,6 +1433,12 @@ static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
> >   	return !(addr & alignment) && !(len & alignment);
> >   }
> > +static inline bool blk_rq_lba_aligned(u32 len, u32 algn_len, u32 lba,
> > +				      u32 algn_lba)
> > +{
> > +	return !(len % algn_len) && !(lba % algn_lba);
> > +}
> 
> why in blkdev.h? It is only used in block/blk-mq.c

There's a blk_rq_aligned in blkdev.h already and used in blk-map.c that checks
for dma and pad alignment. I can move it to blk-mq.h if that fits better.


> 
> > +
> >   /* assumes size > 256 */
> >   static inline unsigned int blksize_bits(unsigned int size)
> >   {
> > diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> > index 1527d5d45e01..ba3764214dc7 100644
> > --- a/include/trace/events/block.h
> > +++ b/include/trace/events/block.h
> > @@ -202,6 +202,7 @@ DECLARE_EVENT_CLASS(block_rq,
> >   		__array(  char,		rwbs,	RWBS_LEN	)
> >   		__array(  char,         comm,   TASK_COMM_LEN   )
> >   		__dynamic_array( char,	cmd,	1		)
> > +		__field(  unsigned int,	algn			)
> >   	),
> >   	TP_fast_assign(
> > @@ -210,20 +211,22 @@ DECLARE_EVENT_CLASS(block_rq,
> >   		__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
> >   		__entry->bytes     = blk_rq_bytes(rq);
> >   		__entry->ioprio	   = rq->ioprio;
> > +		__entry->algn      = blk_rq_lba_algn(rq);
> >   		blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
> >   		__get_str(cmd)[0] = '\0';
> >   		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
> >   	),
> > -	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]",
> > +	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u |%u| [%s]",
> >   		  MAJOR(__entry->dev), MINOR(__entry->dev),
> >   		  __entry->rwbs, __entry->bytes, __get_str(cmd),
> >   		  (unsigned long long)__entry->sector, __entry->nr_sector,
> >   		  __print_symbolic(IOPRIO_PRIO_CLASS(__entry->ioprio),
> >   				   IOPRIO_CLASS_STRINGS),
> >   		  IOPRIO_PRIO_HINT(__entry->ioprio),
> > -		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
> > +		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->algn,
> > +		  __entry->comm)
> >   );
> >   /**
> > 
> > ---
> > base-commit: 57f962b956f1d116cd64d5c406776c4975de549d
> > change-id: 20240912-add-blkalgn-block-trace-71e8ab6708f1
> > 
> > Best regards,
>
John Garry Sept. 13, 2024, 12:08 p.m. UTC | #3
On 13/09/2024 12:26, Daniel Gomez wrote:
> On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
>> On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
>>> From: Daniel Gomez <da.gomez@samsung.com>
>>>
>>> Report block alignment in terms of LBA and size during block tracing for
>>> block_rq. Calculate alignment only for read/writes where the length is
>>> greater than 0. Otherwise, report 0 to indicate no alignment calculated.
>>>
>>> Suggested-by: Dave Chinner <dchinner@redhat.com>
>>> Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
>>> ---
>>> This patch introduces LBA and size alignment information for
>>> the block_rq tracepoints (block_rq{insert, issue, merge} and
>>> block_{io_start, io_done}).
>>
>> eh, shouldn't this belong in the description of the patch?
> 
> Yes. I'll move this to the commit message.
> 
>>
>> And I still don't know what we mean by alignment in this context.
>>
>>  From looking at the code, it seems to be the max detected block size
>> granularity. For example, for a 64KB write at a 32KB offset, that would give
>> a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
>> alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
>> 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2

note: I meant "8KB is the largest power-of-2"

>> which is divisible into 24KB. Is this a correct understanding?
> 
> That is correct. 

So maybe it's me, but I just find it odd to call this information 
"alignment". To me, what you are looking for is largest block size 
granularity.

 > Do you think adding examples like yours can help to explain
 > this better?
> Below the same examples using fio with the trace output:
> 
> 
> 	sudo fio -bs=64k -size=64k -offset=32k -rw=write \
> 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> 	
> 	sudo fio -bs=64k -size=64k -offset=64k -rw=write \
> 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> 	
> 	sudo fio -bs=8k -size=8k -offset=64k -rw=write \
> 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> 	
> 	sudo fio -bs=24k -size=24k -offset=0k -rw=write \
> 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> 
> 	fio-789     [000] .....  4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
> 	fio-801     [000] .....  4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
> 	fio-813     [000] .....  4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
> 	fio-825     [000] .....  4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]
> 
> 
> Also, the motivation behind this is explained in the LBS RFC [1] and I should
> have included it here for context. I hope [1] and my description below helps to
> explain what alignment means and why is needed:
> 
> [1] Subject: [RFC 00/23] Enable block size > page size in XFS
> https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!NoMpDxzuA5uKlv0RAWE5UtOQKOrNB2zv8PHmOLWxfGCEzw5WpyyvonfhcMi0REPjCgF8pgBvEO9kyhTPO8z1$
> 
> Tracing alignment information is important for high-capacity and QLC SSDs with
> Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
> Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
> boundaries can imply in a read-modify-write (RMW).

Yes, I get that this might be important to know.

> 
> The graph below is a representation of the device IU vs an I/O block aligned/
> unaligned.
> 
>      |--- IU Boundaries ----|      |-PS-|
> a)  [====][====][====][====][····][····][····]--
>      |                      |
> b)  [····][====][====][====][====][····][····]--
>      |                      |
> c)  [====][====][====][====][····][====][====]--

is there meant to be a gap at page index #4?

>      |                      |
> d)  [····][····][====][====][····][····][····]--
>      |                      |
> LBA 0                      4
>    
>      Key:
>      [====] = I/O Block
>      [····] = Memory in Page Size (PS) chunks
> 
> a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
> b) The size of the I/O matches the IU size but the I/O is not aligned to the
> IU boundaries. I/O is unaligned.
> c) I/O does not match in either size or LBA. I/O is unaligned.

what about d)? Not aligned to IU, I assume.

> 
>>
>>>
>>> The idea of reporting alignment in a tracepoint was first suggested in
>>> this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
>>> tracing tool [2] was developed and used during LBS development, as
>>> mentioned in the patch series [3] and in [1].
>>>
>>> With this addition, users can check block alignment directly through the
>>> block layer tracepoints without needing any additional tools.
>>>
>>> In case we have a use case, this can be extended to other tracepoints,
>>> such as complete and error.
>>>
>>> Another potential enhancement could be the integration of this
>>> information into blktrace. Would that be a feasible option to consider?
>>>
>>> [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
>>> [2] blkalgn tool written in eBPF/bcc:
>>> https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
>>> [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
>>> ---
>>>    block/blk-mq.c               | 29 +++++++++++++++++++++++++++++
>>>    include/linux/blk-mq.h       | 11 +++++++++++
>>>    include/linux/blkdev.h       |  6 ++++++
>>>    include/trace/events/block.h |  7 +++++--
>>>    4 files changed, 51 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 831c5cf5d874..714452bc236b 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
>>>    }
>>>    EXPORT_SYMBOL_GPL(blk_rq_poll);
>>> +u32 __blk_rq_lba_algn(struct request *req)
>>
>> why use "algn", and not "align"?
>>
>> "algn" is not a natural abbreviation of "alignment".
> 
> That's okay with me, changing the var name to a more natural abbreviation.
> 
>>
>> And why can't userspace figure this out? All the info is available already,
>> right?
> 
> We are interested in the block alignment (LBA and size) at block device driver
> level, not userspace level. That is, everything that is going out from the block
> layer. Using the block tracing points currently available makes it block-driver
> generic.

I am just saying that the information already present in the block trace 
point can be used to get this "alignment" info, right? And userspace can 
do the work of reading those trace events to find this "alignment" info.

Thanks,
John
Daniel Gomez Sept. 13, 2024, 1:57 p.m. UTC | #4
On Fri, Sep 13, 2024 at 01:08:34PM +0100, John Garry wrote:
> On 13/09/2024 12:26, Daniel Gomez wrote:
> > On Fri, Sep 13, 2024 at 09:59:08AM +0100, John Garry wrote:
> > > On 12/09/2024 21:48, Daniel Gomez via B4 Relay wrote:
> > > > From: Daniel Gomez <da.gomez@samsung.com>
> > > > 
> > > > Report block alignment in terms of LBA and size during block tracing for
> > > > block_rq. Calculate alignment only for read/writes where the length is
> > > > greater than 0. Otherwise, report 0 to indicate no alignment calculated.
> > > > 
> > > > Suggested-by: Dave Chinner <dchinner@redhat.com>
> > > > Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
> > > > ---
> > > > This patch introduces LBA and size alignment information for
> > > > the block_rq tracepoints (block_rq{insert, issue, merge} and
> > > > block_{io_start, io_done}).
> > > 
> > > eh, shouldn't this belong in the description of the patch?
> > 
> > Yes. I'll move this to the commit message.
> > 
> > > 
> > > And I still don't know what we mean by alignment in this context.
> > > 
> > >  From looking at the code, it seems to be the max detected block size
> > > granularity. For example, for a 64KB write at a 32KB offset, that would give
> > > a 32KB "alignment". But a 64KB write at a 64KB offset would be "64KB"
> > > alignment. While a 8KB write at 64KB offset would be 8KB "alignment". And a
> > > 24KB write at offset 0 is a 8KB "alignment", as 8KB is the lowest power-of-2
> 
> note: I meant "8KB is the largest power-of-2"

8KB will be the largest unit at what a device can operate at, for that
particular I/O.

> 
> > > which is divisible into 24KB. Is this a correct understanding?
> > 
> > That is correct.
> 
> So maybe it's me, but I just find it odd to call this information
> "alignment". To me, what you are looking for is largest block size
> granularity.

More suggestions are welcome. What about just I/O granularity? Does the term
imply LBA and size?

> 
> > Do you think adding examples like yours can help to explain
> > this better?
> > Below the same examples using fio with the trace output:
> > 
> > 
> > 	sudo fio -bs=64k -size=64k -offset=32k -rw=write \
> > 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> > 	
> > 	sudo fio -bs=64k -size=64k -offset=64k -rw=write \
> > 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> > 	
> > 	sudo fio -bs=8k -size=8k -offset=64k -rw=write \
> > 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> > 	
> > 	sudo fio -bs=24k -size=24k -offset=0k -rw=write \
> > 	-direct=1 -filename=/dev/nvme0n1 -iodepth=1 -ioengine=sync -name=sync
> > 
> > 	fio-789     [000] .....  4455.092003: block_rq_issue: 259,0 WS 65536 () 64 + 128 none,0,0 |32768| [fio]
> > 	fio-801     [000] .....  4455.474826: block_rq_issue: 259,0 WS 65536 () 128 + 128 none,0,0 |65536| [fio]
> > 	fio-813     [000] .....  4455.855143: block_rq_issue: 259,0 WS 8192 () 128 + 16 none,0,0 |8192| [fio]
> > 	fio-825     [000] .....  4456.235595: block_rq_issue: 259,0 WS 24576 () 0 + 48 none,0,0 |8192| [fio]
> > 
> > 
> > Also, the motivation behind this is explained in the LBS RFC [1] and I should
> > have included it here for context. I hope [1] and my description below helps to
> > explain what alignment means and why is needed:
> > 
> > [1] Subject: [RFC 00/23] Enable block size > page size in XFS
> > https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!NoMpDxzuA5uKlv0RAWE5UtOQKOrNB2zv8PHmOLWxfGCEzw5WpyyvonfhcMi0REPjCgF8pgBvEO9kyhTPO8z1$
> > 
> > Tracing alignment information is important for high-capacity and QLC SSDs with
> > Indirection Units greater than 4 KiB. These devices are still 4 KiB in Logical
> > Block Size (LBS) but because they work at higher IUs, unaligned writes to the IU
> > boundaries can imply in a read-modify-write (RMW).
> 
> Yes, I get that this might be important to know.
> 
> > 
> > The graph below is a representation of the device IU vs an I/O block aligned/
> > unaligned.
> > 
> >      |--- IU Boundaries ----|      |-PS-|
> > a)  [====][====][====][====][····][····][····]--
> >      |                      |
> > b)  [····][====][====][====][====][····][····]--
> >      |                      |
> > c)  [====][====][====][====][····][====][====]--
> 
> is there meant to be a gap at page index #4?

Sorry, that's a copy+paste error. c) can be ignored. 

> 
> >      |                      |
> > d)  [····][····][====][====][····][····][····]--

d) is c)


> >      |                      |
> > LBA 0                      4
> >      Key:
> >      [====] = I/O Block
> >      [····] = Memory in Page Size (PS) chunks
> > 
> > a) I/O matches IU boundaries (LBA and block size). I/O is aligned.
> > b) The size of the I/O matches the IU size but the I/O is not aligned to the
> > IU boundaries. I/O is unaligned.
> > c) I/O does not match in either size or LBA. I/O is unaligned.
> 
> what about d)? Not aligned to IU, I assume.

Yes, c) description is meant for d).

So for clarity, the correct graph is:

    |--- IU Boundaries ----|      |-PS-|
a)  [====][====][====][====][····][····][····]--
    |                      |
b)  [····][====][====][====][====][····][····]--
    |                      |
c)  [····][····][====][====][····][····][····]--
    |                      |
LBA 0                      4

a) I/O matches IU boundaries (LBA and block size). I/O is aligned to IU boundaries.
b) The size of the I/O matches the IU size but the I/O is not aligned to the
IU boundaries. I/O is unaligned.
c) I/O does not match in either size or LBA. I/O is unaligned.

Using I/O granularity term:
a) 16k I/O granularity
b) 4k I/O granularity
c) 8k I/O granularity

> 
> > 
> > > 
> > > > 
> > > > The idea of reporting alignment in a tracepoint was first suggested in
> > > > this thread [1] by Dave Chinner. Additionally, an eBPF-based equivalent
> > > > tracing tool [2] was developed and used during LBS development, as
> > > > mentioned in the patch series [3] and in [1].
> > > > 
> > > > With this addition, users can check block alignment directly through the
> > > > block layer tracepoints without needing any additional tools.
> > > > 
> > > > In case we have a use case, this can be extended to other tracepoints,
> > > > such as complete and error.
> > > > 
> > > > Another potential enhancement could be the integration of this
> > > > information into blktrace. Would that be a feasible option to consider?
> > > > 
> > > > [1] https://urldefense.com/v3/__https://lore.kernel.org/all/ZdvXAn1Q*2F*QX5sPQ@dread.disaster.area/__;JSs!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtYsRb3aY$
> > > > [2] blkalgn tool written in eBPF/bcc:
> > > > https://urldefense.com/v3/__https://github.com/dkruces/bcc/tree/lbs__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXthE7cfng$
> > > > [3] https://urldefense.com/v3/__https://lore.kernel.org/all/20240822135018.1931258-1-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!P1ZM_E9uBSDLzz6M0dLc_vgEGWEY2HPBXJvEJLWp7w0l_G_r9Gvkm2kQiN586NSIH-JMx_YiCFy_6qdklHFY3pXtqQ5uwAE$
> > > > ---
> > > >    block/blk-mq.c               | 29 +++++++++++++++++++++++++++++
> > > >    include/linux/blk-mq.h       | 11 +++++++++++
> > > >    include/linux/blkdev.h       |  6 ++++++
> > > >    include/trace/events/block.h |  7 +++++--
> > > >    4 files changed, 51 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 831c5cf5d874..714452bc236b 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -4920,6 +4920,35 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
> > > >    }
> > > >    EXPORT_SYMBOL_GPL(blk_rq_poll);
> > > > +u32 __blk_rq_lba_algn(struct request *req)
> > > 
> > > why use "algn", and not "align"?
> > > 
> > > "algn" is not a natural abbreviation of "alignment".
> > 
> > That's okay with me, changing the var name to a more natural abbreviation.
> > 
> > > 
> > > And why can't userspace figure this out? All the info is available already,
> > > right?
> > 
> > We are interested in the block alignment (LBA and size) at block device driver
> > level, not userspace level. That is, everything that is going out from the block
> > layer. Using the block tracing points currently available makes it block-driver
> > generic.
> 
> I am just saying that the information already present in the block trace
> point can be used to get this "alignment" info, right? And userspace can do
> the work of reading those trace events to find this "alignment" info.

So, maybe this is better to have integrated in blktrace tool?

> 
> Thanks,
> John
>
John Garry Sept. 13, 2024, 2:08 p.m. UTC | #5
On 13/09/2024 14:57, Daniel Gomez wrote:
>> I am just saying that the information already present in the block trace
>> point can be used to get this "alignment" info, right? And userspace can do
>> the work of reading those trace events to find this "alignment" info.
> So, maybe this is better to have integrated in blktrace tool?

Sure, or whatever tool which can process this information. I just don't 
see why the kernel should be burdened with this.

Thanks,
John
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 831c5cf5d874..714452bc236b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4920,6 +4920,35 @@  int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
 }
 EXPORT_SYMBOL_GPL(blk_rq_poll);
 
+u32 __blk_rq_lba_algn(struct request *req)
+{
+	u32 lbs = queue_logical_block_size(req->q);
+	u32 lba_shift = ilog2(lbs);
+	u32 lba = req->__sector >> (lba_shift - SECTOR_SHIFT);
+	u32 len = req->__data_len;
+	u32 algn_len = len;
+	u32 algn_lba = len / lbs;
+	u32 alignment = lbs;
+
+	if (is_power_of_2(len) &&
+	    blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
+		return len;
+
+	algn_len = lbs << 1U;
+	algn_lba = algn_len / lbs;
+
+	while (algn_len < len) {
+		if (!blk_rq_lba_aligned(len, algn_len, lba, algn_lba))
+			break;
+
+		alignment = algn_len;
+		algn_len = algn_len << 1U;
+		algn_lba = algn_len / lbs;
+	}
+
+	return alignment;
+}
+
 unsigned int blk_mq_rq_cpu(struct request *rq)
 {
 	return rq->mq_ctx->cpu;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8d304b1d16b1..02959fbd5e28 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -740,6 +740,17 @@  void blk_mq_free_request(struct request *rq);
 int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
 		unsigned int poll_flags);
 
+/* The alignment of the block in terms of LBA and size */
+u32 __blk_rq_lba_algn(struct request *req);
+static inline u32 blk_rq_lba_algn(struct request *req)
+{
+	if ((req_op(req) != REQ_OP_WRITE) && (req_op(req) != REQ_OP_READ) &&
+	    !(req->__data_len))
+		return 0;
+
+	return __blk_rq_lba_algn(req);
+}
+
 bool blk_mq_queue_inflight(struct request_queue *q);
 
 enum {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bf1aa951fda2..28557987daa8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1433,6 +1433,12 @@  static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
 	return !(addr & alignment) && !(len & alignment);
 }
 
+static inline bool blk_rq_lba_aligned(u32 len, u32 algn_len, u32 lba,
+				      u32 algn_lba)
+{
+	return !(len % algn_len) && !(lba % algn_lba);
+}
+
 /* assumes size > 256 */
 static inline unsigned int blksize_bits(unsigned int size)
 {
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 1527d5d45e01..ba3764214dc7 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -202,6 +202,7 @@  DECLARE_EVENT_CLASS(block_rq,
 		__array(  char,		rwbs,	RWBS_LEN	)
 		__array(  char,         comm,   TASK_COMM_LEN   )
 		__dynamic_array( char,	cmd,	1		)
+		__field(  unsigned int,	algn			)
 	),
 
 	TP_fast_assign(
@@ -210,20 +211,22 @@  DECLARE_EVENT_CLASS(block_rq,
 		__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
 		__entry->bytes     = blk_rq_bytes(rq);
 		__entry->ioprio	   = rq->ioprio;
+		__entry->algn      = blk_rq_lba_algn(rq);
 
 		blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
 		__get_str(cmd)[0] = '\0';
 		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
-	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]",
+	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u |%u| [%s]",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->rwbs, __entry->bytes, __get_str(cmd),
 		  (unsigned long long)__entry->sector, __entry->nr_sector,
 		  __print_symbolic(IOPRIO_PRIO_CLASS(__entry->ioprio),
 				   IOPRIO_CLASS_STRINGS),
 		  IOPRIO_PRIO_HINT(__entry->ioprio),
-		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
+		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->algn,
+		  __entry->comm)
 );
 
 /**