diff mbox series

[4/5] xfs: report the correct read/write dio alignment for reflinked inodes

Message ID 20250108085549.1296733-5-hch@lst.de (mailing list archive)
State New
Headers show
Series [1/5] fs: reformat the statx definition | expand

Commit Message

Christoph Hellwig Jan. 8, 2025, 8:55 a.m. UTC
For I/O to reflinked blocks we always need to write an entire new
file system block, and the code enforces the file system block alignment
for the entire file if it has any reflinked blocks.

Use the new STATX_DIO_READ_ALIGN flag to report the asymmetric read
vs write alignments for reflinked files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_iops.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

Comments

John Garry Jan. 8, 2025, 10:10 a.m. UTC | #1
On 08/01/2025 08:55, Christoph Hellwig wrote:
> For I/O to reflinked blocks we always need to write an entire new
> file system block, and the code enforces the file system block alignment
> for the entire file if it has any reflinked blocks.
> 
> Use the new STATX_DIO_READ_ALIGN flag to report the asymmetric read
> vs write alignments for reflinked files.
> 
> Signed-off-by: Christoph Hellwig<hch@lst.de>
> Reviewed-by: "Darrick J. Wong"<djwong@kernel.org>


FWIW:
Reviewed-by: John Garry <john.g.garry@oracle.com>
John Garry Jan. 8, 2025, 10:13 a.m. UTC | #2
On 08/01/2025 08:55, Christoph Hellwig wrote:
> @@ -580,9 +580,24 @@ xfs_report_dioalign(
>   	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>   	struct block_device	*bdev = target->bt_bdev;
>   
> -	stat->result_mask |= STATX_DIOALIGN;
> +	stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN;

BTW, it would be a crappy userspace which can't handle fields which it 
did not ask for, e.g. asked for STATX_DIOALIGN, but got  STATX_DIOALIGN 
and STATX_DIO_READ_ALIGN
Christoph Hellwig Jan. 8, 2025, 3:18 p.m. UTC | #3
On Wed, Jan 08, 2025 at 10:13:02AM +0000, John Garry wrote:
> On 08/01/2025 08:55, Christoph Hellwig wrote:
>> @@ -580,9 +580,24 @@ xfs_report_dioalign(
>>   	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>>   	struct block_device	*bdev = target->bt_bdev;
>>   -	stat->result_mask |= STATX_DIOALIGN;
>> +	stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN;
>
> BTW, it would be a crappy userspace which can't handle fields which it did 
> not ask for, e.g. asked for STATX_DIOALIGN, but got  STATX_DIOALIGN and 
> STATX_DIO_READ_ALIGN

Well, the space is marked for extension.  I don't think there ever
was a requirement only fields asked for are reported, but if that
feels safer I could switch to that.
Darrick J. Wong Jan. 8, 2025, 5:20 p.m. UTC | #4
On Wed, Jan 08, 2025 at 04:18:27PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 08, 2025 at 10:13:02AM +0000, John Garry wrote:
> > On 08/01/2025 08:55, Christoph Hellwig wrote:
> >> @@ -580,9 +580,24 @@ xfs_report_dioalign(
> >>   	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> >>   	struct block_device	*bdev = target->bt_bdev;
> >>   -	stat->result_mask |= STATX_DIOALIGN;
> >> +	stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN;
> >
> > BTW, it would be a crappy userspace which can't handle fields which it did 
> > not ask for, e.g. asked for STATX_DIOALIGN, but got  STATX_DIOALIGN and 
> > STATX_DIO_READ_ALIGN
> 
> Well, the space is marked for extension.  I don't think there ever
> was a requirement only fields asked for are reported, but if that
> feels safer I could switch to that.

From the "Returned information" section of the statx manpage:

"It should be noted that the kernel may return fields that weren't
requested and may fail to return fields that were requested, depending
on what the backing filesystem supports.  (Fields that are given values
despite being unrequested can just be ignored.)  In either case,
stx_mask will not be equal mask.

<snip>

"A filesystem may also fill in fields that the caller didn't ask for if
it has values for them available and the information is available at no
extra cost.  If this happens,  the corresponding bits will be set in
stx_mask."

In other words, the kernel is allowed to return information/flags that
weren't requested, and userspace can ignore it.

--D
Eric Biggers Jan. 8, 2025, 5:53 p.m. UTC | #5
On Wed, Jan 08, 2025 at 09:55:32AM +0100, Christoph Hellwig wrote:
> For I/O to reflinked blocks we always need to write an entire new
> file system block, and the code enforces the file system block alignment
> for the entire file if it has any reflinked blocks.
> 
> Use the new STATX_DIO_READ_ALIGN flag to report the asymmetric read
> vs write alignments for reflinked files.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/xfs/xfs_iops.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 6b0228a21617..40289fe6f5b2 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -580,9 +580,24 @@ xfs_report_dioalign(
>  	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>  	struct block_device	*bdev = target->bt_bdev;
>  
> -	stat->result_mask |= STATX_DIOALIGN;
> +	stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN;
>  	stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
> -	stat->dio_offset_align = bdev_logical_block_size(bdev);
> +
> +	/*
> +	 * For COW inodes, we can only perform out of place writes of entire
> +	 * allocation units (blocks or RT extents).
> +	 * For writes smaller than the allocation unit, we must fall back to
> +	 * buffered I/O to perform read-modify-write cycles.  At best this is
> +	 * highly inefficient; at worst it leads to page cache invalidation
> +	 * races.  Tell applications to avoid this by reporting the larger write
> +	 * alignment in dio_offset_align, and the smaller read alignment in
> +	 * dio_read_offset_align.
> +	 */
> +	stat->dio_read_offset_align = bdev_logical_block_size(bdev);
> +	if (xfs_is_cow_inode(ip))
> +		stat->dio_offset_align = xfs_inode_alloc_unitsize(ip);
> +	else
> +		stat->dio_offset_align = stat->dio_read_offset_align;

This contradicts the proposed man page, which says the following about
stx_dio_read_offset_align offset:

          If non-zero this value must be smaller than  stx_dio_offset_align
          which  must be provided by the file system.

The proposed code sets stx_dio_read_offset_align and stx_dio_offset_align to the
same value in some cases.

Perhaps the documentation should say "less than or equal to"?

Also, the claim that stx_dio_offset_align "must be provided by the file system"
if stx_dio_read_offset_align is nonzero should probably be conditional on
STATX_DIOALIGN being provided too.

- Eric
Christoph Hellwig Jan. 9, 2025, 6:25 a.m. UTC | #6
On Wed, Jan 08, 2025 at 09:53:58AM -0800, Eric Biggers wrote:
> This contradicts the proposed man page, which says the following about
> stx_dio_read_offset_align offset:
> 
>           If non-zero this value must be smaller than  stx_dio_offset_align
>           which  must be provided by the file system.
> 
> The proposed code sets stx_dio_read_offset_align and stx_dio_offset_align to the
> same value in some cases.
> 
> Perhaps the documentation should say "less than or equal to"?

Yes, I'll fix it up.

> Also, the claim that stx_dio_offset_align "must be provided by the file system"
> if stx_dio_read_offset_align is nonzero should probably be conditional on
> STATX_DIOALIGN being provided too.

I'll add a "if requested" to the man page.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 6b0228a21617..40289fe6f5b2 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -580,9 +580,24 @@  xfs_report_dioalign(
 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
 	struct block_device	*bdev = target->bt_bdev;
 
-	stat->result_mask |= STATX_DIOALIGN;
+	stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN;
 	stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
-	stat->dio_offset_align = bdev_logical_block_size(bdev);
+
+	/*
+	 * For COW inodes, we can only perform out of place writes of entire
+	 * allocation units (blocks or RT extents).
+	 * For writes smaller than the allocation unit, we must fall back to
+	 * buffered I/O to perform read-modify-write cycles.  At best this is
+	 * highly inefficient; at worst it leads to page cache invalidation
+	 * races.  Tell applications to avoid this by reporting the larger write
+	 * alignment in dio_offset_align, and the smaller read alignment in
+	 * dio_read_offset_align.
+	 */
+	stat->dio_read_offset_align = bdev_logical_block_size(bdev);
+	if (xfs_is_cow_inode(ip))
+		stat->dio_offset_align = xfs_inode_alloc_unitsize(ip);
+	else
+		stat->dio_offset_align = stat->dio_read_offset_align;
 }
 
 static void
@@ -658,7 +673,7 @@  xfs_vn_getattr(
 		stat->rdev = inode->i_rdev;
 		break;
 	case S_IFREG:
-		if (request_mask & STATX_DIOALIGN)
+		if (request_mask & (STATX_DIOALIGN | STATX_DIO_READ_ALIGN))
 			xfs_report_dioalign(ip, stat);
 		if (request_mask & STATX_WRITE_ATOMIC)
 			xfs_report_atomic_write(ip, stat);