mbox series

[v3,00/15] block atomic writes

Message ID 20240124113841.31824-1-john.g.garry@oracle.com (mailing list archive)
Headers show
Series block atomic writes | expand

Message

John Garry Jan. 24, 2024, 11:38 a.m. UTC
This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices here. For this, atomic
write HW is required, like SCSI ATOMIC WRITE (16).

I plan to send a series for supporting atomic writes for XFS later this
week, but initially only for XFS rtvol.

Updated man pages have been posted at:
https://lore.kernel.org/lkml/20240124112731.28579-1-john.g.garry@oracle.com/T/#m520dca97a9748de352b5a723d3155a4bb1e46456

The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Three new fields are added to struct statx - atomic_write_unit_min,
atomic_write_unit_max, and atomic_write_segments_max. For each atomic
individual write, the total length of a write must be a between
atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
power-of-2. The write must also be at a natural offset in the file
wrt the write length. For pwritev2, iovcnt is limited by
atomic_write_segments_max.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

This series is based on v6.8-rc1.

Changes since v2:
- Support atomic_write_segments_max
- Limit atomic write paramaters to max_hw_sectors_kb
- Don't increase fmode_t
- Change value for RWF_ATOMIC
- Various tidying (including advised by Jan)

Changes since v1:
- Drop XFS support for now
- Tidy NVMe changes and also add checks for atomic write violating max
  AW PF length and boundary (if any)
- Reject - instead of ignoring - RWF_ATOMIC for files which do not
  support atomic writes
- Update block sysfs documentation
- Various tidy-ups

Alan Adamson (2):
  nvme: Support atomic writes
  nvme: Ensure atomic writes will be executed atomically

Himanshu Madhani (2):
  block: Add atomic write operations to request_queue limits
  block: Add REQ_ATOMIC flag

John Garry (9):
  block: Limit atomic writes according to bio and queue limits
  block: Pass blk_queue_get_max_sectors() a request pointer
  block: Limit atomic write IO size according to
    atomic_write_max_sectors
  block: Error an attempt to split an atomic write bio
  block: Add checks to merging of atomic writes
  block: Add fops atomic write support
  scsi: sd: Support reading atomic write properties from block limits
    VPD
  scsi: sd: Add WRITE_ATOMIC_16 support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (2):
  fs/bdev: Add atomic write support info to statx
  fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support

 Documentation/ABI/stable/sysfs-block |  52 +++
 block/bdev.c                         |  37 +-
 block/blk-merge.c                    |  94 ++++-
 block/blk-mq.c                       |   2 +-
 block/blk-settings.c                 | 103 +++++
 block/blk-sysfs.c                    |  33 ++
 block/blk.h                          |   9 +-
 block/fops.c                         |  44 +-
 drivers/nvme/host/core.c             |  71 ++++
 drivers/nvme/host/nvme.h             |   2 +
 drivers/scsi/scsi_debug.c            | 589 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  93 ++++-
 drivers/scsi/sd.h                    |   8 +
 fs/stat.c                            |  47 ++-
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  45 +-
 include/linux/fs.h                   |  12 +
 include/linux/stat.h                 |   3 +
 include/scsi/scsi_proto.h            |   1 +
 include/trace/events/scsi.h          |   1 +
 include/uapi/linux/fs.h              |   5 +-
 include/uapi/linux/stat.h            |   9 +-
 23 files changed, 1123 insertions(+), 161 deletions(-)

Comments

Christoph Hellwig Jan. 29, 2024, 6:18 a.m. UTC | #1
Do you have a git tree with all patches somewhere?
John Garry Jan. 29, 2024, 9:17 a.m. UTC | #2
On 29/01/2024 06:18, Christoph Hellwig wrote:
> Do you have a git tree with all patches somewhere?
>

They should apply cleanly on v6.8-rc1, but you can also check 
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.8-v3/ for 
this series. The XFS series is at top and can be found at 
https://github.com/johnpgarry/linux/tree/atomic-writes-v6.8-v3-fs

Cheers,
John
John Garry Feb. 6, 2024, 6:44 p.m. UTC | #3
On 29/01/2024 06:18, Christoph Hellwig wrote:
> Do you have a git tree with all patches somewhere?
> 

Hi Christoph,

Please let me know if you had a chance to look at this series or what 
your plans are.

BTW, about testing, it would be good to know your thoughts on power-fail 
testing.

I have done much testing for ensuring that writes are properly issued to 
HW with no undesired splitting/merging, etc for normal operation. I have 
also tested crashing the kernel only to see if atomic writes get 
corrupted. This all looks ok.

About PF testing, I have an NVMe M.2 drive, but it supports just 4K 
nawupf. In addition, unfortunately the port on my machine does not allow 
me to power it off, so I need to plug out the power cable to test PF :(

We do also support atomic writes on our SCSI storage servers, but it is 
not practically possible to PF them.

For actual PF testing, I have been using fio in crc64 verify mode with a 
couple of tweaks to support atomic writes.

What I find from limited testing for XFS and bdev atomic writes on that 
NVMe card is that indeed 4K writes are PF-safe, but 16K (this is an 
arbitrary large block size which I chose) is not. But I think all cards 
will be 4K PF safe, even if not declared.

Thanks,
John
David Laight Feb. 10, 2024, 12:12 p.m. UTC | #4
Can someone add a : after block?

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)