Message ID | 20250108085549.1296733-5-hch@lst.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/5] fs: reformat the statx definition | expand |
On 08/01/2025 08:55, Christoph Hellwig wrote: > For I/O to reflinked blocks we always need to write an entire new > file system block, and the code enforces the file system block alignment > for the entire file if it has any reflinked blocks. > > Use the new STATX_DIO_READ_ALIGN flag to report the asymmetric read > vs write alignments for reflinked files. > > Signed-off-by: Christoph Hellwig<hch@lst.de> > Reviewed-by: "Darrick J. Wong"<djwong@kernel.org> FWIW: Reviewed-by: John Garry <john.g.garry@oracle.com>
On 08/01/2025 08:55, Christoph Hellwig wrote: > @@ -580,9 +580,24 @@ xfs_report_dioalign( > struct xfs_buftarg *target = xfs_inode_buftarg(ip); > struct block_device *bdev = target->bt_bdev; > > - stat->result_mask |= STATX_DIOALIGN; > + stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN; BTW, it would be a crappy userspace which can't handle fields which it did not ask for, e.g. asked for STATX_DIOALIGN, but got STATX_DIOALIGN and STATX_DIO_READ_ALIGN
On Wed, Jan 08, 2025 at 10:13:02AM +0000, John Garry wrote: > On 08/01/2025 08:55, Christoph Hellwig wrote: >> @@ -580,9 +580,24 @@ xfs_report_dioalign( >> struct xfs_buftarg *target = xfs_inode_buftarg(ip); >> struct block_device *bdev = target->bt_bdev; >> - stat->result_mask |= STATX_DIOALIGN; >> + stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN; > > BTW, it would be a crappy userspace which can't handle fields which it did > not ask for, e.g. asked for STATX_DIOALIGN, but got STATX_DIOALIGN and > STATX_DIO_READ_ALIGN Well, the space is marked for extension. I don't think there ever was a requirement only fields asked for are reported, but if that feels safer I could switch to that.
On Wed, Jan 08, 2025 at 04:18:27PM +0100, Christoph Hellwig wrote: > On Wed, Jan 08, 2025 at 10:13:02AM +0000, John Garry wrote: > > On 08/01/2025 08:55, Christoph Hellwig wrote: > >> @@ -580,9 +580,24 @@ xfs_report_dioalign( > >> struct xfs_buftarg *target = xfs_inode_buftarg(ip); > >> struct block_device *bdev = target->bt_bdev; > >> - stat->result_mask |= STATX_DIOALIGN; > >> + stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN; > > > > BTW, it would be a crappy userspace which can't handle fields which it did > > not ask for, e.g. asked for STATX_DIOALIGN, but got STATX_DIOALIGN and > > STATX_DIO_READ_ALIGN > > Well, the space is marked for extension. I don't think there ever > was a requirement only fields asked for are reported, but if that > feels safer I could switch to that. From the "Returned information" section of the statx manpage: "It should be noted that the kernel may return fields that weren't requested and may fail to return fields that were requested, depending on what the backing filesystem supports. (Fields that are given values despite being unrequested can just be ignored.) In either case, stx_mask will not be equal mask. <snip> "A filesystem may also fill in fields that the caller didn't ask for if it has values for them available and the information is available at no extra cost. If this happens, the corresponding bits will be set in stx_mask." In other words, the kernel is allowed to return information/flags that weren't requested, and userspace can ignore it. --D
On Wed, Jan 08, 2025 at 09:55:32AM +0100, Christoph Hellwig wrote: > For I/O to reflinked blocks we always need to write an entire new > file system block, and the code enforces the file system block alignment > for the entire file if it has any reflinked blocks. > > Use the new STATX_DIO_READ_ALIGN flag to report the asymmetric read > vs write alignments for reflinked files. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > --- > fs/xfs/xfs_iops.c | 21 ++++++++++++++++++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 6b0228a21617..40289fe6f5b2 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -580,9 +580,24 @@ xfs_report_dioalign( > struct xfs_buftarg *target = xfs_inode_buftarg(ip); > struct block_device *bdev = target->bt_bdev; > > - stat->result_mask |= STATX_DIOALIGN; > + stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN; > stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; > - stat->dio_offset_align = bdev_logical_block_size(bdev); > + > + /* > + * For COW inodes, we can only perform out of place writes of entire > + * allocation units (blocks or RT extents). > + * For writes smaller than the allocation unit, we must fall back to > + * buffered I/O to perform read-modify-write cycles. At best this is > + * highly inefficient; at worst it leads to page cache invalidation > + * races. Tell applications to avoid this by reporting the larger write > + * alignment in dio_offset_align, and the smaller read alignment in > + * dio_read_offset_align. > + */ > + stat->dio_read_offset_align = bdev_logical_block_size(bdev); > + if (xfs_is_cow_inode(ip)) > + stat->dio_offset_align = xfs_inode_alloc_unitsize(ip); > + else > + stat->dio_offset_align = stat->dio_read_offset_align; This contradicts the proposed man page, which says the following about stx_dio_read_offset_align offset: If non-zero this value must be smaller than stx_dio_offset_align which must be provided by the file system. The proposed code sets stx_dio_read_offset_align and stx_dio_offset_align to the same value in some cases. Perhaps the documentation should say "less than or equal to"? Also, the claim that stx_dio_offset_align "must be provided by the file system" if stx_dio_read_offset_align is nonzero should probably be conditional on STATX_DIOALIGN being provided too. - Eric
On Wed, Jan 08, 2025 at 09:53:58AM -0800, Eric Biggers wrote: > This contradicts the proposed man page, which says the following about > stx_dio_read_offset_align offset: > > If non-zero this value must be smaller than stx_dio_offset_align > which must be provided by the file system. > > The proposed code sets stx_dio_read_offset_align and stx_dio_offset_align to the > same value in some cases. > > Perhaps the documentation should say "less than or equal to"? Yes, I'll fix it up. > Also, the claim that stx_dio_offset_align "must be provided by the file system" > if stx_dio_read_offset_align is nonzero should probably be conditional on > STATX_DIOALIGN being provided too. I'll add a "if requested" to the man page.
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 6b0228a21617..40289fe6f5b2 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -580,9 +580,24 @@ xfs_report_dioalign( struct xfs_buftarg *target = xfs_inode_buftarg(ip); struct block_device *bdev = target->bt_bdev; - stat->result_mask |= STATX_DIOALIGN; + stat->result_mask |= STATX_DIOALIGN | STATX_DIO_READ_ALIGN; stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; - stat->dio_offset_align = bdev_logical_block_size(bdev); + + /* + * For COW inodes, we can only perform out of place writes of entire + * allocation units (blocks or RT extents). + * For writes smaller than the allocation unit, we must fall back to + * buffered I/O to perform read-modify-write cycles. At best this is + * highly inefficient; at worst it leads to page cache invalidation + * races. Tell applications to avoid this by reporting the larger write + * alignment in dio_offset_align, and the smaller read alignment in + * dio_read_offset_align. + */ + stat->dio_read_offset_align = bdev_logical_block_size(bdev); + if (xfs_is_cow_inode(ip)) + stat->dio_offset_align = xfs_inode_alloc_unitsize(ip); + else + stat->dio_offset_align = stat->dio_read_offset_align; } static void @@ -658,7 +673,7 @@ xfs_vn_getattr( stat->rdev = inode->i_rdev; break; case S_IFREG: - if (request_mask & STATX_DIOALIGN) + if (request_mask & (STATX_DIOALIGN | STATX_DIO_READ_ALIGN)) xfs_report_dioalign(ip, stat); if (request_mask & STATX_WRITE_ATOMIC) xfs_report_atomic_write(ip, stat);