Message ID | 20230929102726.2985188-4-john.g.garry@oracle.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | block atomic writes | expand |
On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote: > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h > index 7cab2c65d3d7..c99d7cac2aa6 100644 > --- a/include/uapi/linux/stat.h > +++ b/include/uapi/linux/stat.h > @@ -127,7 +127,10 @@ struct statx { > __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */ > __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */ > /* 0xa0 */ > - __u64 __spare3[12]; /* Spare space for future expansion */ > + __u32 stx_atomic_write_unit_max; > + __u32 stx_atomic_write_unit_min; Maybe min first and then max? That seems a bit more natural, and a lot of the code you've written handle them in that order. > +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */ How would this differ from stx_atomic_write_unit_min != 0? - Eric
On 9/29/23 15:49, Eric Biggers wrote: > On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote: >> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h >> index 7cab2c65d3d7..c99d7cac2aa6 100644 >> --- a/include/uapi/linux/stat.h >> +++ b/include/uapi/linux/stat.h >> @@ -127,7 +127,10 @@ struct statx { >> __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */ >> __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */ >> /* 0xa0 */ >> - __u64 __spare3[12]; /* Spare space for future expansion */ >> + __u32 stx_atomic_write_unit_max; >> + __u32 stx_atomic_write_unit_min; > > Maybe min first and then max? That seems a bit more natural, and a lot of the > code you've written handle them in that order. > >> +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */ > > How would this differ from stx_atomic_write_unit_min != 0? Is it even possible that stx_atomic_write_unit_min == 0? My understanding is that all Linux filesystems rely on the assumption that writing a single logical block either succeeds or does not happen, even if a power failure occurs between writing and reading a logical block. Thanks, Bart.
On 01/10/2023 14:23, Bart Van Assche wrote: > On 9/29/23 15:49, Eric Biggers wrote: >> On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote: >>> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h >>> index 7cab2c65d3d7..c99d7cac2aa6 100644 >>> --- a/include/uapi/linux/stat.h >>> +++ b/include/uapi/linux/stat.h >>> @@ -127,7 +127,10 @@ struct statx { >>> __u32 stx_dio_mem_align; /* Memory buffer alignment for >>> direct I/O */ >>> __u32 stx_dio_offset_align; /* File offset alignment for >>> direct I/O */ >>> /* 0xa0 */ >>> - __u64 __spare3[12]; /* Spare space for future expansion */ >>> + __u32 stx_atomic_write_unit_max; >>> + __u32 stx_atomic_write_unit_min; >> >> Maybe min first and then max? That seems a bit more natural, and a >> lot of the >> code you've written handle them in that order. ok, I think it's fine to reorder >> >>> +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports >>> atomic write operations */ >> >> How would this differ from stx_atomic_write_unit_min != 0? Yeah, I suppose that we can just not set this for the case of stx_atomic_write_unit_min == 0. > > Is it even possible that stx_atomic_write_unit_min == 0? My understanding > is that all Linux filesystems rely on the assumption that writing a single > logical block either succeeds or does not happen, even if a power failure > occurs between writing and reading a logical block. > Maybe they do rely on this, but is it particularly interesting? BTW, I would not like to provide assurances that every storage media produced writes logical blocks atomically. Thanks, John
On 10/2/23 02:51, John Garry wrote: > On 01/10/2023 14:23, Bart Van Assche wrote: >> Is it even possible that stx_atomic_write_unit_min == 0? My >> understanding is that all Linux filesystems rely on the assumption >> that writing a single logical block either succeeds or does not >> happen, even if a power failure occurs between writing and reading >> a logical block. > > Maybe they do rely on this, but is it particularly interesting? > > BTW, I would not like to provide assurances that every storage media > produced writes logical blocks atomically. Neither the SCSI SBC standard nor the NVMe standard defines a "minimum atomic write unit". So why to introduce something in the Linux kernel that is not defined in common storage standards? I propose to leave out stx_atomic_write_unit_min from struct statx and also to leave out atomic_write_unit_min_sectors from struct queue_limits. My opinion is that we should not support block devices in the Linux kernel that do not write logical blocks atomically. Block devices that do not write logical blocks atomically are not compatible with Linux kernel journaling filesystems. Additionally, I'm not sure it's even possible to write a journaling filesystem for such block devices. Thanks, Bart.
Bart, > Neither the SCSI SBC standard nor the NVMe standard defines a "minimum > atomic write unit". So why to introduce something in the Linux kernel > that is not defined in common storage standards? From SBC-5: "The ATOMIC TRANSFER LENGTH GRANULARITY field indicates the minimum transfer length for an atomic write command." > I propose to leave out stx_atomic_write_unit_min from > struct statx and also to leave out atomic_write_unit_min_sectors from > struct queue_limits. My opinion is that we should not support block > devices in the Linux kernel that do not write logical blocks atomically. The statx values exist to describe the limits for I/Os sent using RWF_ATOMIC and IOCB_ATOMIC. These limits may be different from other reported values such as the filesystem block size and the logical block size of the underlying device.
On Mon, Oct 02, 2023 at 10:51:36AM +0100, John Garry wrote: > On 01/10/2023 14:23, Bart Van Assche wrote: > > On 9/29/23 15:49, Eric Biggers wrote: > > > On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote: > > > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h > > > > index 7cab2c65d3d7..c99d7cac2aa6 100644 > > > > --- a/include/uapi/linux/stat.h > > > > +++ b/include/uapi/linux/stat.h > > > > @@ -127,7 +127,10 @@ struct statx { > > > > __u32 stx_dio_mem_align; /* Memory buffer alignment > > > > for direct I/O */ > > > > __u32 stx_dio_offset_align; /* File offset alignment > > > > for direct I/O */ > > > > /* 0xa0 */ > > > > - __u64 __spare3[12]; /* Spare space for future expansion */ > > > > + __u32 stx_atomic_write_unit_max; > > > > + __u32 stx_atomic_write_unit_min; > > > > > > Maybe min first and then max? That seems a bit more natural, and a > > > lot of the > > > code you've written handle them in that order. > > ok, I think it's fine to reorder > > > > > > > > +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File > > > > supports atomic write operations */ > > > > > > How would this differ from stx_atomic_write_unit_min != 0? > > Yeah, I suppose that we can just not set this for the case of > stx_atomic_write_unit_min == 0. Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the filesystem, file and underlying device support atomic writes when the values are non-zero. The whole point of the attribute mask is that the caller can check the mask for supported functionality without having to read every field in the statx structure to determine if the functionality it wants is present. -Dave.
On Tue, Oct 03, 2023 at 12:51:49PM +1100, Dave Chinner wrote: > On Mon, Oct 02, 2023 at 10:51:36AM +0100, John Garry wrote: > > On 01/10/2023 14:23, Bart Van Assche wrote: > > > On 9/29/23 15:49, Eric Biggers wrote: > > > > On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote: > > > > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h > > > > > index 7cab2c65d3d7..c99d7cac2aa6 100644 > > > > > --- a/include/uapi/linux/stat.h > > > > > +++ b/include/uapi/linux/stat.h > > > > > @@ -127,7 +127,10 @@ struct statx { > > > > > __u32 stx_dio_mem_align; /* Memory buffer alignment > > > > > for direct I/O */ > > > > > __u32 stx_dio_offset_align; /* File offset alignment > > > > > for direct I/O */ > > > > > /* 0xa0 */ > > > > > - __u64 __spare3[12]; /* Spare space for future expansion */ > > > > > + __u32 stx_atomic_write_unit_max; > > > > > + __u32 stx_atomic_write_unit_min; > > > > > > > > Maybe min first and then max? That seems a bit more natural, and a > > > > lot of the > > > > code you've written handle them in that order. > > > > ok, I think it's fine to reorder > > > > > > > > > > > +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File > > > > > supports atomic write operations */ > > > > > > > > How would this differ from stx_atomic_write_unit_min != 0? > > > > Yeah, I suppose that we can just not set this for the case of > > stx_atomic_write_unit_min == 0. > > Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the > filesystem, file and underlying device support atomic writes when > the values are non-zero. The whole point of the attribute mask is > that the caller can check the mask for supported functionality > without having to read every field in the statx structure to > determine if the functionality it wants is present. ^^ Seconding what Dave said. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com
On 03/10/2023 03:57, Darrick J. Wong wrote: >>>>> +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File >>>>> supports atomic write operations */ >>>> How would this differ from stx_atomic_write_unit_min != 0? >> Yeah, I suppose that we can just not set this for the case of >> stx_atomic_write_unit_min == 0. > Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the > filesystem, file and underlying device support atomic writes when > the values are non-zero. The whole point of the attribute mask is > that the caller can check the mask for supported functionality > without having to read every field in the statx structure to > determine if the functionality it wants is present. Sure, but again that would be just checking atomic_write_unit_min_bytes or another atomic write block setting as that is the only way to tell from the block layer (if atomic writes are supported), so it will be something like: if (request_mask & STATX_WRITE_ATOMIC && queue_atomic_write_unit_min_bytes(bdev->bd_queue)) { stat->atomic_write_unit_min = queue_atomic_write_unit_min_bytes(bdev->bd_queue); stat->atomic_write_unit_max = queue_atomic_write_unit_max_bytes(bdev->bd_queue); stat->attributes |= STATX_ATTR_WRITE_ATOMIC; stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; stat->result_mask |= STATX_WRITE_ATOMIC; } Thanks, John
On Tue, Oct 03, 2023 at 08:23:26AM +0100, John Garry wrote: > On 03/10/2023 03:57, Darrick J. Wong wrote: > > > > > > +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File > > > > > > supports atomic write operations */ > > > > > How would this differ from stx_atomic_write_unit_min != 0? > > > Yeah, I suppose that we can just not set this for the case of > > > stx_atomic_write_unit_min == 0. > > Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the > > filesystem, file and underlying device support atomic writes when > > the values are non-zero. The whole point of the attribute mask is > > that the caller can check the mask for supported functionality > > without having to read every field in the statx structure to > > determine if the functionality it wants is present. > > Sure, but again that would be just checking atomic_write_unit_min_bytes or > another atomic write block setting as that is the only way to tell from the > block layer (if atomic writes are supported), so it will be something like: > > if (request_mask & STATX_WRITE_ATOMIC && > queue_atomic_write_unit_min_bytes(bdev->bd_queue)) { > stat->atomic_write_unit_min = > queue_atomic_write_unit_min_bytes(bdev->bd_queue); > stat->atomic_write_unit_max = > queue_atomic_write_unit_max_bytes(bdev->bd_queue); > stat->attributes |= STATX_ATTR_WRITE_ATOMIC; > stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; > stat->result_mask |= STATX_WRITE_ATOMIC; The result_mask (which becomes the statx stx_mask) needs to have STATX_WRITE_ATOMIC set any time a filesystem responds to STATX_WRITE_ATOMIC being set in the request_mask, even if the response is "not supported". The attributes_mask also needs to have STATX_ATTR_WRITE_ATOMIC set if the filesystem+file can support the flag, even if it's not currently set for that file. This should get turned into a generic vfs helper for the next fs that wants to support atomic write units: static void generic_fill_statx_atomic_writes(struct kstat *stat, struct block_device *bdev) { u64 min_bytes; /* Confirm that the fs driver knows about this statx request */ stat->result_mask |= STATX_WRITE_ATOMIC; /* Confirm that the file attribute is known to the fs. */ stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; /* Fill out the rest of the atomic write fields if supported */ min_bytes = queue_atomic_write_unit_min_bytes(bdev->bd_queue); if (min_bytes == 0) return; stat->atomic_write_unit_min = min_bytes; stat->atomic_write_unit_max = queue_atomic_write_unit_max_bytes(bdev->bd_queue); /* Atomic writes actually supported on this file. */ stat->attributes |= STATX_ATTR_WRITE_ATOMIC; } and then: if (request_mask & STATX_WRITE_ATOMIC) generic_fill_statx_atomic_writes(stat, bdev); > } > > Thanks, > John
On 03/10/2023 16:46, Darrick J. Wong wrote: >> stat->result_mask |= STATX_WRITE_ATOMIC; > The result_mask (which becomes the statx stx_mask) needs to have > STATX_WRITE_ATOMIC set any time a filesystem responds to > STATX_WRITE_ATOMIC being set in the request_mask, even if the response > is "not supported". > > The attributes_mask also needs to have STATX_ATTR_WRITE_ATOMIC set if > the filesystem+file can support the flag, even if it's not currently set > for that file. This should get turned into a generic vfs helper for the > next fs that wants to support atomic write units: > > static void generic_fill_statx_atomic_writes(struct kstat *stat, > struct block_device *bdev) > { > u64 min_bytes; > > /* Confirm that the fs driver knows about this statx request */ > stat->result_mask |= STATX_WRITE_ATOMIC; > > /* Confirm that the file attribute is known to the fs. */ > stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; > > /* Fill out the rest of the atomic write fields if supported */ > min_bytes = queue_atomic_write_unit_min_bytes(bdev->bd_queue); > if (min_bytes == 0) > return; > > stat->atomic_write_unit_min = min_bytes; > stat->atomic_write_unit_max = > queue_atomic_write_unit_max_bytes(bdev->bd_queue); > > /* Atomic writes actually supported on this file. */ > stat->attributes |= STATX_ATTR_WRITE_ATOMIC; > } > > and then: > > if (request_mask & STATX_WRITE_ATOMIC) > generic_fill_statx_atomic_writes(stat, bdev); > > That looks sensible, but, if used by an FS, we would still need a method to include extra FS restrictions, like extent alignment as in 15/21. Thanks, John
On Mon, Oct 02, 2023 at 08:28:04PM -0400, Martin K. Petersen wrote: > > Bart, > > > Neither the SCSI SBC standard nor the NVMe standard defines a "minimum > > atomic write unit". So why to introduce something in the Linux kernel > > that is not defined in common storage standards? > > >From SBC-5: > > "The ATOMIC TRANSFER LENGTH GRANULARITY field indicates the minimum > transfer length for an atomic write command." I would suggest that we don't try to claim any atomic write capability if this is not a logical block as such devices are completely useless. In fact I'd add a big warning to the kernel log if a device claims this, as this breaks all the implicit assumptions that a single logical block write is atomic.
diff --git a/block/bdev.c b/block/bdev.c index f3b13aa1b7d4..037a3d9ecbcb 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1028,24 +1028,37 @@ void sync_bdevs(bool wait) iput(old_inode); } +#define BDEV_STATX_SUPPORTED_MSK (STATX_DIOALIGN | STATX_WRITE_ATOMIC) + /* - * Handle STATX_DIOALIGN for block devices. - * - * Note that the inode passed to this is the inode of a block device node file, - * not the block device's internal inode. Therefore it is *not* valid to use - * I_BDEV() here; the block device has to be looked up by i_rdev instead. + * Handle STATX_{DIOALIGN, WRITE_ATOMIC} for block devices. */ -void bdev_statx_dioalign(struct inode *inode, struct kstat *stat) +void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask) { struct block_device *bdev; - bdev = blkdev_get_no_open(inode->i_rdev); + if (!(request_mask & BDEV_STATX_SUPPORTED_MSK)) + return; + + bdev = blkdev_get_no_open(d_backing_inode(dentry)->i_rdev); if (!bdev) return; - stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; - stat->dio_offset_align = bdev_logical_block_size(bdev); - stat->result_mask |= STATX_DIOALIGN; + if (request_mask & STATX_DIOALIGN) { + stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; + stat->dio_offset_align = bdev_logical_block_size(bdev); + stat->result_mask |= STATX_DIOALIGN; + } + + if (request_mask & STATX_WRITE_ATOMIC) { + stat->atomic_write_unit_min = + queue_atomic_write_unit_min_bytes(bdev->bd_queue); + stat->atomic_write_unit_max = + queue_atomic_write_unit_max_bytes(bdev->bd_queue); + stat->attributes |= STATX_ATTR_WRITE_ATOMIC; + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; + stat->result_mask |= STATX_WRITE_ATOMIC; + } blkdev_put_no_open(bdev); } diff --git a/fs/stat.c b/fs/stat.c index d43a5cc1bfa4..b840e58f41fa 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -250,13 +250,12 @@ static int vfs_statx(int dfd, struct filename *filename, int flags, stat->attributes |= STATX_ATTR_MOUNT_ROOT; stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT; - /* Handle STATX_DIOALIGN for block devices. */ - if (request_mask & STATX_DIOALIGN) { - struct inode *inode = d_backing_inode(path.dentry); - - if (S_ISBLK(inode->i_mode)) - bdev_statx_dioalign(inode, stat); - } + /* If this is a block device inode, override the filesystem + * attributes with the block device specific parameters + * that need to be obtained from the bdev backing inode + */ + if (S_ISBLK(d_backing_inode(path.dentry)->i_mode)) + bdev_statx(path.dentry, stat, request_mask); path_put(&path); if (retry_estale(error, lookup_flags)) { @@ -649,6 +648,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer) tmp.stx_mnt_id = stat->mnt_id; tmp.stx_dio_mem_align = stat->dio_mem_align; tmp.stx_dio_offset_align = stat->dio_offset_align; + tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min; + tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max; return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index c10e47bdb34f..f70988083734 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1533,7 +1533,7 @@ int sync_blockdev(struct block_device *bdev); int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend); int sync_blockdev_nowait(struct block_device *bdev); void sync_bdevs(bool wait); -void bdev_statx_dioalign(struct inode *inode, struct kstat *stat); +void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask); void printk_all_partitions(void); int __init early_lookup_bdev(const char *pathname, dev_t *dev); #else @@ -1551,7 +1551,7 @@ static inline int sync_blockdev_nowait(struct block_device *bdev) static inline void sync_bdevs(bool wait) { } -static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat) +static inline void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask) { } static inline void printk_all_partitions(void) diff --git a/include/linux/stat.h b/include/linux/stat.h index 52150570d37a..dfa69ecfaacf 100644 --- a/include/linux/stat.h +++ b/include/linux/stat.h @@ -53,6 +53,8 @@ struct kstat { u32 dio_mem_align; u32 dio_offset_align; u64 change_cookie; + u32 atomic_write_unit_max; + u32 atomic_write_unit_min; }; /* These definitions are internal to the kernel for now. Mainly used by nfsd. */ diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h index 7cab2c65d3d7..c99d7cac2aa6 100644 --- a/include/uapi/linux/stat.h +++ b/include/uapi/linux/stat.h @@ -127,7 +127,10 @@ struct statx { __u32 stx_dio_mem_align; /* Memory buffer alignment for direct I/O */ __u32 stx_dio_offset_align; /* File offset alignment for direct I/O */ /* 0xa0 */ - __u64 __spare3[12]; /* Spare space for future expansion */ + __u32 stx_atomic_write_unit_max; + __u32 stx_atomic_write_unit_min; + /* 0xb0 */ + __u64 __spare3[11]; /* Spare space for future expansion */ /* 0x100 */ }; @@ -154,6 +157,7 @@ struct statx { #define STATX_BTIME 0x00000800U /* Want/got stx_btime */ #define STATX_MNT_ID 0x00001000U /* Got stx_mnt_id */ #define STATX_DIOALIGN 0x00002000U /* Want/got direct I/O alignment info */ +#define STATX_WRITE_ATOMIC 0x00004000U /* Want/got atomic_write_* fields */ #define STATX__RESERVED 0x80000000U /* Reserved for future struct statx expansion */ @@ -189,6 +193,7 @@ struct statx { #define STATX_ATTR_MOUNT_ROOT 0x00002000 /* Root of a mount */ #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */ #define STATX_ATTR_DAX 0x00200000 /* File is currently in DAX state */ +#define STATX_ATTR_WRITE_ATOMIC 0x00400000 /* File supports atomic write operations */ #endif /* _UAPI_LINUX_STAT_H */