diff mbox series

[RFC,1/7] statx: add I/O alignment information

Message ID 20220211061158.227688-2-ebiggers@kernel.org (mailing list archive)
State Superseded
Headers show
Series make statx() return I/O alignment information | expand

Commit Message

Eric Biggers Feb. 11, 2022, 6:11 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

Traditionally, the conditions for when DIO (direct I/O) is supported
were fairly simple: filesystems either supported DIO aligned to the
block device's logical block size, or didn't support DIO at all.

However, due to filesystem features that have been added over time (e.g,
data journalling, inline data, encryption, verity, compression,
checkpoint disabling, log-structured mode), the conditions for when DIO
is allowed on a file have gotten increasingly complex.  Whether a
particular file supports DIO, and with what alignment, can depend on
various file attributes and filesystem mount options, as well as which
block device(s) the file's data is located on.

XFS has an ioctl XFS_IOC_DIOINFO which exposes this information to
applications.  However, as discussed
(https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebiggers@kernel.org/T/#u),
this ioctl is rarely used and not known to be used outside of
XFS-specific code.  It also was never intended to indicate when a file
doesn't support DIO at all, and it only exposes the minimum I/O
alignment, not the optimal I/O alignment which has been requested too.

Therefore, let's expose this information via statx().  Add the
STATX_IOALIGN flag and three fields associated with it:

* stx_mem_align_dio: the alignment (in bytes) required for user memory
  buffers for DIO, or 0 if DIO is not supported on the file.

* stx_offset_align_dio: the alignment (in bytes) required for file
  offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
  on the file.  This will only be nonzero if stx_mem_align_dio is
  nonzero, and vice versa.

* stx_offset_align_optimal: the alignment (in bytes) suggested for file
  offsets and I/O segment lengths to get optimal performance.  This
  applies to both DIO and buffered I/O.  It differs from stx_blocksize
  in that stx_offset_align_optimal will contain the real optimum I/O
  size, which may be a large value.  In contrast, for compatibility
  reasons stx_blocksize is the minimum size needed to avoid page cache
  read/write/modify cycles, which may be much smaller than the optimum
  I/O size.  For more details about the motivation for this field, see
  https://lore.kernel.org/r/20220210040304.GM59729@dread.disaster.area

Note that as with other statx() extensions, if STATX_IOALIGN isn't set
in the returned statx struct, then these new fields won't be filled in.
This will happen if the filesystem doesn't support STATX_IOALIGN, or if
the file isn't a regular file.  (It might be supported on block device
files in the future.)  It might also happen if the caller didn't include
STATX_IOALIGN in the request mask, since statx() isn't required to
return information that wasn't requested.

This commit adds the VFS-level plumbing for STATX_IOALIGN.  Individual
filesystems will still need to add code to support it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/stat.c                 | 3 +++
 include/linux/stat.h      | 3 +++
 include/uapi/linux/stat.h | 9 +++++++--
 3 files changed, 13 insertions(+), 2 deletions(-)


base-commit: cdaa1b1941f667814300799ddb74f3079517cd5a

Comments

Chaitanya Kulkarni Feb. 11, 2022, 11:40 a.m. UTC | #1
On 2/10/22 10:11 PM, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Traditionally, the conditions for when DIO (direct I/O) is supported
> were fairly simple: filesystems either supported DIO aligned to the
> block device's logical block size, or didn't support DIO at all.
> 
> However, due to filesystem features that have been added over time (e.g,
> data journalling, inline data, encryption, verity, compression,
> checkpoint disabling, log-structured mode), the conditions for when DIO
> is allowed on a file have gotten increasingly complex.  Whether a
> particular file supports DIO, and with what alignment, can depend on
> various file attributes and filesystem mount options, as well as which
> block device(s) the file's data is located on.
> 
> XFS has an ioctl XFS_IOC_DIOINFO which exposes this information to
> applications.  However, as discussed
> (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebiggers@kernel.org/T/#u),
> this ioctl is rarely used and not known to be used outside of
> XFS-specific code.  It also was never intended to indicate when a file
> doesn't support DIO at all, and it only exposes the minimum I/O
> alignment, not the optimal I/O alignment which has been requested too.
> 
> Therefore, let's expose this information via statx().  Add the
> STATX_IOALIGN flag and three fields associated with it:
> 
> * stx_mem_align_dio: the alignment (in bytes) required for user memory
>    buffers for DIO, or 0 if DIO is not supported on the file.
> 
> * stx_offset_align_dio: the alignment (in bytes) required for file
>    offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>    on the file.  This will only be nonzero if stx_mem_align_dio is
>    nonzero, and vice versa.
> 
> * stx_offset_align_optimal: the alignment (in bytes) suggested for file
>    offsets and I/O segment lengths to get optimal performance.  This
>    applies to both DIO and buffered I/O.  It differs from stx_blocksize
>    in that stx_offset_align_optimal will contain the real optimum I/O
>    size, which may be a large value.  In contrast, for compatibility
>    reasons stx_blocksize is the minimum size needed to avoid page cache
>    read/write/modify cycles, which may be much smaller than the optimum
>    I/O size.  For more details about the motivation for this field, see
>    https://lore.kernel.org/r/20220210040304.GM59729@dread.disaster.area
> 
> Note that as with other statx() extensions, if STATX_IOALIGN isn't set
> in the returned statx struct, then these new fields won't be filled in.
> This will happen if the filesystem doesn't support STATX_IOALIGN, or if
> the file isn't a regular file.  (It might be supported on block device
> files in the future.)  It might also happen if the caller didn't include
> STATX_IOALIGN in the request mask, since statx() isn't required to
> return information that wasn't requested.
> 
> This commit adds the VFS-level plumbing for STATX_IOALIGN.  Individual
> filesystems will still need to add code to support it.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---


I've actually worked on similar series to export alignment and 
granularity for non-trivial operations, this implementation
only exporting I/O alignments (mostly REQ_OP_WRITE/REQ_OP_READ) via
stax.

Since it is coming from :-
bdev_logical_block_size()->q->limits.logical_block_size that is set when
low level driver like nvme calls blk_queue_logical_block_size().

 From my experience especially with SSDs, applications want to
know similar information about different non-trivial requests such as
REQ_OP_DISCARD/REQ_OP_WRITE_ZEROES/REQ_OP_VERIFY (work in progress see
[1]) etc.

It will be great to make this generic userspace interface where user can
ask for specific REQ_OP_XXX such as generic I/O REQ_OP_READ/REQ_OP_WRITE
and non generic REQ_OP_XX such as REQ_OP_DISCARD/REQ_OP_VERIFY etc ....

Since I've worked on implementing REQ_OP_VERIFY support I don't want to
implement separate interface for querying the REQ_OP_VERIFY or any other
non-trivial REQ_OP_XXX granularity or alignment.

-ck

[1] https://www.spinics.net/lists/linux-xfs/msg56826.html
Chaitanya Kulkarni Feb. 11, 2022, 11:45 a.m. UTC | #2
On 2/11/22 3:40 AM, Chaitanya Kulkarni wrote:
> On 2/10/22 10:11 PM, Eric Biggers wrote:
>> From: Eric Biggers <ebiggers@google.com>
>>
>> Traditionally, the conditions for when DIO (direct I/O) is supported
>> were fairly simple: filesystems either supported DIO aligned to the
>> block device's logical block size, or didn't support DIO at all.
>>
>> However, due to filesystem features that have been added over time (e.g,
>> data journalling, inline data, encryption, verity, compression,
>> checkpoint disabling, log-structured mode), the conditions for when DIO
>> is allowed on a file have gotten increasingly complex.  Whether a
>> particular file supports DIO, and with what alignment, can depend on
>> various file attributes and filesystem mount options, as well as which
>> block device(s) the file's data is located on.
>>
>> XFS has an ioctl XFS_IOC_DIOINFO which exposes this information to
>> applications.  However, as discussed
>> (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebiggers@kernel.org/T/#u),
>> this ioctl is rarely used and not known to be used outside of
>> XFS-specific code.  It also was never intended to indicate when a file
>> doesn't support DIO at all, and it only exposes the minimum I/O
>> alignment, not the optimal I/O alignment which has been requested too.
>>
>> Therefore, let's expose this information via statx().  Add the
>> STATX_IOALIGN flag and three fields associated with it:
>>
>> * stx_mem_align_dio: the alignment (in bytes) required for user memory
>>     buffers for DIO, or 0 if DIO is not supported on the file.
>>
>> * stx_offset_align_dio: the alignment (in bytes) required for file
>>     offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>>     on the file.  This will only be nonzero if stx_mem_align_dio is
>>     nonzero, and vice versa.
>>
>> * stx_offset_align_optimal: the alignment (in bytes) suggested for file
>>     offsets and I/O segment lengths to get optimal performance.  This
>>     applies to both DIO and buffered I/O.  It differs from stx_blocksize
>>     in that stx_offset_align_optimal will contain the real optimum I/O
>>     size, which may be a large value.  In contrast, for compatibility
>>     reasons stx_blocksize is the minimum size needed to avoid page cache
>>     read/write/modify cycles, which may be much smaller than the optimum
>>     I/O size.  For more details about the motivation for this field, see
>>     https://lore.kernel.org/r/20220210040304.GM59729@dread.disaster.area
>>
>> Note that as with other statx() extensions, if STATX_IOALIGN isn't set
>> in the returned statx struct, then these new fields won't be filled in.
>> This will happen if the filesystem doesn't support STATX_IOALIGN, or if
>> the file isn't a regular file.  (It might be supported on block device
>> files in the future.)  It might also happen if the caller didn't include
>> STATX_IOALIGN in the request mask, since statx() isn't required to
>> return information that wasn't requested.
>>
>> This commit adds the VFS-level plumbing for STATX_IOALIGN.  Individual
>> filesystems will still need to add code to support it.
>>
>> Signed-off-by: Eric Biggers <ebiggers@google.com>
>> ---
> 
> 
> I've actually worked on similar series to export alignment and
> granularity for non-trivial operations, this implementation
> only exporting I/O alignments (mostly REQ_OP_WRITE/REQ_OP_READ) via
> stax.
> 
> Since it is coming from :-
> bdev_logical_block_size()->q->limits.logical_block_size that is set when
> low level driver like nvme calls blk_queue_logical_block_size().
> 
>   From my experience especially with SSDs, applications want to
> know similar information about different non-trivial requests such as
> REQ_OP_DISCARD/REQ_OP_WRITE_ZEROES/REQ_OP_VERIFY (work in progress see
> [1]) etc.
> 
> It will be great to make this generic userspace interface where user can
> ask for specific REQ_OP_XXX such as generic I/O REQ_OP_READ/REQ_OP_WRITE
> and non generic REQ_OP_XX such as REQ_OP_DISCARD/REQ_OP_VERIFY etc ....
> 
> Since I've worked on implementing REQ_OP_VERIFY support I don't want to
> implement separate interface for querying the REQ_OP_VERIFY or any other
> non-trivial REQ_OP_XXX granularity or alignment.
> 
> -ck
> 
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.spinics.net%2Flists%2Flinux-xfs%2Fmsg56826.html&amp;data=04%7C01%7Cchaitanyak%40nvidia.com%7C252d78e009ad49bd522208d9ed534dcf%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637801764313014840%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=1owqIDlcst4h%2FGr9Azteaiy22vfHFZojRipKmk6A%2FCg%3D&amp;reserved=0
> 

Adding right link for REQ_OP_VERIFY ...

[1] https://www.spinics.net/lists/linux-xfs/msg56826.html
diff mbox series

Patch

diff --git a/fs/stat.c b/fs/stat.c
index 28d2020ba1f42..093c506e69c7b 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -598,6 +598,9 @@  cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_dev_major = MAJOR(stat->dev);
 	tmp.stx_dev_minor = MINOR(stat->dev);
 	tmp.stx_mnt_id = stat->mnt_id;
+	tmp.stx_mem_align_dio = stat->mem_align_dio;
+	tmp.stx_offset_align_dio = stat->offset_align_dio;
+	tmp.stx_offset_align_optimal = stat->offset_align_optimal;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 7df06931f25d8..48b8b1ad1567c 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -50,6 +50,9 @@  struct kstat {
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
 	u64		mnt_id;
+	u32		mem_align_dio;
+	u32		offset_align_dio;
+	u32		offset_align_optimal;
 };
 
 #endif
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 1500a0f58041a..f822b23e81091 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -124,9 +124,13 @@  struct statx {
 	__u32	stx_dev_minor;
 	/* 0x90 */
 	__u64	stx_mnt_id;
-	__u64	__spare2;
+	__u32	stx_mem_align_dio;	/* Memory buffer alignment for direct I/O */
+	__u32	stx_offset_align_dio;	/* File offset alignment for direct I/O */
 	/* 0xa0 */
-	__u64	__spare3[12];	/* Spare space for future expansion */
+	__u32	stx_offset_align_optimal; /* Optimal file offset alignment for I/O */
+	__u32	__spare2;
+	/* 0xa8 */
+	__u64	__spare3[11];	/* Spare space for future expansion */
 	/* 0x100 */
 };
 
@@ -152,6 +156,7 @@  struct statx {
 #define STATX_BASIC_STATS	0x000007ffU	/* The stuff in the normal stat struct */
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
+#define STATX_IOALIGN		0x00002000U	/* Want/got IO alignment info */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */