mbox series

[v5,00/10] block atomic writes

Message ID 20240226173612.1478858-1-john.g.garry@oracle.com (mailing list archive)
Headers show
Series block atomic writes | expand

Message

John Garry Feb. 26, 2024, 5:36 p.m. UTC
This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices here. For this, atomic
write HW is required, like SCSI ATOMIC WRITE (16).

XFS FS support will require rework according to discussion at:
https://lore.kernel.org/linux-fsdevel/20240124142645.9334-1-john.g.garry@oracle.com/T/#m916df899e9d0fb688cdbd415826ae2423306c2e0

The current plan there is to use forcealign feature from the start. This
will take a bit more time to get done.

Updated man pages have been posted at:
https://lore.kernel.org/lkml/20240124112731.28579-1-john.g.garry@oracle.com/T/#m520dca97a9748de352b5a723d3155a4bb1e46456

The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Three new fields are added to struct statx - atomic_write_unit_min,
atomic_write_unit_max, and atomic_write_segments_max. For each atomic
individual write, the total length of a write must be a between
atomic_write_unit_min and atomic_write_unit_max, inclusive, and a
power-of-2. The write must also be at a natural offset in the file
wrt the write length. For pwritev2, iovcnt is limited by
atomic_write_segments_max.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

This series is based on v6.8-rc6

Patches can be found at:
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.8-v5

Changes since v4:
- Finally combine both NVMe patches
- Pass inode to bdev_statx() (Ritesh)
- Add IOCB_ATOMIC to TRACE_IOCB_STRINGS (Ritesh)
- Make rq_straddles_atomic_write_boundary() size+sign safe (Dave) and
  simplify (Ritesh)
- Improve generic_fill_statx_atomic_writes() doc (Dave, Ritesh) and use
  GPL export (Christoph)
- Drop BDEV_STATX_SUPPORTED_MASK and improve bdev_statx() comments
  (Christoph)
- Tweak atomic_write_valid() flow and use IS_ALIGNED (Dave) and
  also rename to generic_atomic_write_valid() (Ritesh)
- Fix module param in scsi_debug (Ojaswin)
- Tweak blkdev_direct_IO() patch to pass bdev (Keith mentioned idea)
- Some smaller code style changes, like variable renames (Ritesh)
- Restructure first block layer patch commit message (Ritesh)
- Add more RB tags (thanks)


Alan Adamson (1):
  nvme: Atomic write support

John Garry (6):
  block: Pass blk_queue_get_max_sectors() a request pointer
  block: Call blkdev_dio_unaligned() from blkdev_direct_IO()
  block: Add core atomic write support
  block: Add fops atomic write support
  scsi: sd: Atomic write support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (3):
  fs: Initial atomic write support
  fs: Add initial atomic write support info to statx
  block: Add atomic write support for statx

 Documentation/ABI/stable/sysfs-block |  52 +++
 block/bdev.c                         |  36 +-
 block/blk-merge.c                    |  98 ++++-
 block/blk-mq.c                       |   2 +-
 block/blk-settings.c                 | 101 +++++
 block/blk-sysfs.c                    |  33 ++
 block/blk.h                          |   9 +-
 block/fops.c                         |  57 ++-
 drivers/nvme/host/core.c             |  72 ++++
 drivers/scsi/scsi_debug.c            | 588 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  93 ++++-
 drivers/scsi/sd.h                    |   8 +
 fs/aio.c                             |   8 +-
 fs/btrfs/ioctl.c                     |   2 +-
 fs/read_write.c                      |   2 +-
 fs/stat.c                            |  50 ++-
 include/linux/blk_types.h            |   3 +-
 include/linux/blkdev.h               |  67 ++-
 include/linux/fs.h                   |  40 +-
 include/linux/stat.h                 |   3 +
 include/scsi/scsi_proto.h            |   1 +
 include/trace/events/scsi.h          |   1 +
 include/uapi/linux/fs.h              |   5 +-
 include/uapi/linux/stat.h            |   9 +-
 io_uring/rw.c                        |   4 +-
 26 files changed, 1177 insertions(+), 189 deletions(-)

Comments

Matthew Wilcox March 5, 2024, 11:10 p.m. UTC | #1
On Mon, Feb 26, 2024 at 05:36:02PM +0000, John Garry wrote:
> This series introduces a proposal to implementing atomic writes in the
> kernel for torn-write protection.

The API as documented will be unnecessarily complicated to implement
for buffered writes, I believe.  What I would prefer is a chattr (or, I
guess, setxattr these days) that sets the tearing boundary for the file.
The page cache can absorb writes of arbitrary size and alignment, but
will be able to guarantee that (if the storage supports it), the only
write tearing will happen on the specified boundary.

We _can_ support arbitrary power-of-two write sizes to the page cache,
but if the requirement is no tearing inside a single write, then we
will have to do a lot of work to make that true.  It isn't clear to me
that anybody is asking for this; the databases I'm aware of are willing
to submit 128kB writes and accept that there may be tearing at 16kB
boundaries (or whatever).
John Garry March 6, 2024, 9:05 a.m. UTC | #2
On 05/03/2024 23:10, Matthew Wilcox wrote:
> On Mon, Feb 26, 2024 at 05:36:02PM +0000, John Garry wrote:
>> This series introduces a proposal to implementing atomic writes in the
>> kernel for torn-write protection.
> 
> The API as documented will be unnecessarily complicated to implement
> for buffered writes, I believe.  What I would prefer is a chattr (or, I
> guess, setxattr these days) that sets the tearing boundary for the file.
> The page cache can absorb writes of arbitrary size and alignment, but
> will be able to guarantee that (if the storage supports it), the only
> write tearing will happen on the specified boundary.

In the "block atomic writes for XFS" series which I sent on Monday, we 
do use setxattr to set the extent alignment for an inode. It is not a 
tearing boundary, but just rather effectively sets the max atomic write 
size for the inode. This extent size must be a power-of-2. From this we 
can support atomic write sizes of [FS block size, extent size] for 
direct IO.

For bdev file operations atomic write support in this series for direct 
IO, atomic write size is limited by the HW support only.

> 
> We _can_ support arbitrary power-of-two write sizes to the page cache,
> but if the requirement is no tearing inside a single write, then we
> will have to do a lot of work to make that true.  It isn't clear to me
> that anybody is asking for this; the databases I'm aware of are willing
> to submit 128kB writes and accept that there may be tearing at 16kB
> boundaries (or whatever).

In this case, I would expect the DB to submit 8x separate 16KB writes. 
However if we advertise a range of supported sizes, userspace is 
entitled to use that, i.e. they could submit a single 128kB write, if 
supported.

As for supporting buffered atomic writes, the very simplest solution for 
regular FS files is to fix the atomic write min and max size at the 
extent size, above. Indeed, that might solve most or even all usecases. 
This is effectively same as your idea to set a boundary size, except 
that userspace must submit individual 16KB writes for the above example. 
As for bdev file operations, extent sizes is not a thing, so that is 
still a problem.

Having said all this, from discussion "[LSF/MM/BPF TOPIC] untorn 
buffered writes", I was hearing that can use a high-order for RWF_ATOMIC 
data and it would be just a matter of implementing support in the page 
cache, like dealing with already-present overlapping smaller folios - is 
implementing this now the concern?

Thanks,
John