[v2,00/16] block atomic writes

Message ID	20231212110844.19698-1-john.g.garry@oracle.com (mailing list archive)
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="duejtCdp"; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b="ZKUwsH3H" From: John Garry <john.g.garry@oracle.com> To: axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, jejb@linux.ibm.com, martin.petersen@oracle.com, djwong@kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org, dchinner@redhat.com, jack@suse.cz Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, tytso@mit.edu, jbongio@google.com, linux-scsi@vger.kernel.org, ming.lei@redhat.com, jaswin@linux.ibm.com, bvanassche@acm.org, John Garry <john.g.garry@oracle.com> Subject: [PATCH v2 00/16] block atomic writes Date: Tue, 12 Dec 2023 11:08:28 +0000 Message-Id: <20231212110844.19698-1-john.g.garry@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	block atomic writes \| expand [v2,00/16] block atomic writes [v2,01/16] block: Add atomic write operations to request_queue limits [v2,02/16] block: Limit atomic writes according to bio and queue limits [v2,03/16] fs/bdev: Add atomic write support info to statx [v2,04/16] fs: Increase fmode_t size [v2,05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support [v2,06/16] block: Add REQ_ATOMIC flag [v2,07/16] block: Pass blk_queue_get_max_sectors() a request pointer [v2,08/16] block: Limit atomic write IO size according to atomic_write_max_sectors [v2,09/16] block: Error an attempt to split an atomic write bio [v2,10/16] block: Add checks to merging of atomic writes [v2,11/16] block: Add fops atomic write support [v2,12/16] scsi: sd: Support reading atomic write properties from block limits VPD [v2,13/16] scsi: sd: Add WRITE_ATOMIC_16 support [v2,14/16] scsi: scsi_debug: Atomic write support [v2,15/16] nvme: Support atomic writes [v2,16/16] nvme: Ensure atomic writes will be executed atomically

John Garry Dec. 12, 2023, 11:08 a.m. UTC

This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices here. For this, atomic
write HW is required, like SCSI ATOMIC WRITE (16). For now, XFS support
from earlier versions is sidelined. That is until the interface is agreed
which does not fully rely on HW offload support.

man pages update has been posted at:
https://lore.kernel.org/linux-api/20230929093717.2972367-1-john.g.garry@oracle.com/T/#t
(not updated since I posted v1 kernel series)

The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

Some open questions:
- How to make API extensible for when we have no HW support? In that case,
  we would prob not have to follow rule of power-of-2 length et al.
  As a possible solution, maybe we can say that atomic writes are
  supported for the file via statx, but not set unit_min and max values,
  and this means that writes need to be just FS block aligned there.
- For block layer, should atomic_write_unit_max be limited by
  max_sectors_kb? Currently it is not.
- How to improve requirement that iovecs are PAGE-aligned.
  There are 2x issues:
  a. We impose this rule to not split BIOs due to virt boundary for
     NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
     16K/64K pages). Easy solution is to impose requirement that iovecs
     are 4K-aligned.
  b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
- Since debugging torn-writes due to unwanted kernel BIO splitting/merging
  would be horrible, should we add some kernel storage stack software
  integrity checks?

This series is based on v6.7-rc5.

Changes since v1:
- Drop XFS support for now
- Tidy NVMe changes and also add checks for atomic write violating max
  AW PF length and boundary (if any)
- Reject - instead of ignoring - RWF_ATOMIC for files which do not
  support atomic writes
- Update block sysfs documentation
- Various tidy-ups

Alan Adamson (2):
  nvme: Support atomic writes
  nvme: Ensure atomic writes will be executed atomically

Himanshu Madhani (2):
  block: Add atomic write operations to request_queue limits
  block: Add REQ_ATOMIC flag

John Garry (10):
  block: Limit atomic writes according to bio and queue limits
  fs: Increase fmode_t size
  block: Pass blk_queue_get_max_sectors() a request pointer
  block: Limit atomic write IO size according to
    atomic_write_max_sectors
  block: Error an attempt to split an atomic write bio
  block: Add checks to merging of atomic writes
  block: Add fops atomic write support
  scsi: sd: Support reading atomic write properties from block limits
    VPD
  scsi: sd: Add WRITE_ATOMIC_16 support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (2):
  fs/bdev: Add atomic write support info to statx
  fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support

 Documentation/ABI/stable/sysfs-block |  47 +++
 block/bdev.c                         |  31 +-
 block/blk-merge.c                    |  95 ++++-
 block/blk-mq.c                       |   2 +-
 block/blk-settings.c                 |  84 ++++
 block/blk-sysfs.c                    |  33 ++
 block/blk.h                          |   9 +-
 block/fops.c                         |  40 +-
 drivers/dma-buf/dma-buf.c            |   2 +-
 drivers/nvme/host/core.c             | 108 ++++-
 drivers/nvme/host/nvme.h             |   2 +
 drivers/scsi/scsi_debug.c            | 590 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  93 ++++-
 drivers/scsi/sd.h                    |   8 +
 fs/stat.c                            |  44 +-
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  41 +-
 include/linux/fs.h                   |  11 +
 include/linux/stat.h                 |   2 +
 include/linux/types.h                |   2 +-
 include/scsi/scsi_proto.h            |   1 +
 include/trace/events/scsi.h          |   1 +
 include/uapi/linux/fs.h              |   5 +-
 include/uapi/linux/stat.h            |   7 +-
 25 files changed, 1098 insertions(+), 184 deletions(-)

Christoph Hellwig Dec. 12, 2023, 4:32 p.m. UTC | #1

On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
> Two new fields are added to struct statx - atomic_write_unit_min and
> atomic_write_unit_max. For each atomic individual write, the total length
> of a write must be a between atomic_write_unit_min and
> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
> at a natural offset in the file wrt the write length.
> 
> SCSI sd.c and scsi_debug and NVMe kernel support is added.
> 
> Some open questions:
> - How to make API extensible for when we have no HW support? In that case,
>   we would prob not have to follow rule of power-of-2 length et al.
>   As a possible solution, maybe we can say that atomic writes are
>   supported for the file via statx, but not set unit_min and max values,
>   and this means that writes need to be just FS block aligned there.

I don't think the power of two length is much of a problem to be
honest, and if we every want to lift it we can still do that easily
by adding a new flag or limit.

What I'm a lot more worried about is how to tell the file system that
allocations are done right for these requirement.  There is no way
a user can know that allocations in an existing file are properly
aligned, so atomic writes will just fail on existing files.

I suspect we need an on-disk flag that forces allocations to be
aligned to the atomic write limit, in some ways similar how the
XFS rt flag works.  You'd need to set it on an empty file, and all
allocations after that are guaranteed to be properly aligned.

> - For block layer, should atomic_write_unit_max be limited by
>   max_sectors_kb? Currently it is not.

Well.  It must be limited to max_hw_sectors to actually work.
max_sectors is a software limit below that, which with modern hardware
is actually pretty silly and a real performance issue with todays
workloads when people don't tweak it..

> - How to improve requirement that iovecs are PAGE-aligned.
>   There are 2x issues:
>   a. We impose this rule to not split BIOs due to virt boundary for
>      NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>      16K/64K pages). Easy solution is to impose requirement that iovecs
>      are 4K-aligned.
>   b. We don't enforce this rule for virt boundary == 0, i.e. SCSI

.. we require any device that wants to support atomic writes to not
have that silly limit.  For NVMe that would require SGL support
(and some driver changes I've been wanting to make for long where
we always use SGLs for transfers larger than a single PRP if supported)

> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>   would be horrible, should we add some kernel storage stack software
>   integrity checks?

Yes, I think we'll need asserts in the drivers.  At least for NVMe I
will insist on them.  For SCSI I think the device actually checks
because the atomic writes are a different command anyway, or am I
misunderstanding how SCSI works here?

John Garry Dec. 13, 2023, 9:32 a.m. UTC | #2

On 12/12/2023 16:32, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
>> Two new fields are added to struct statx - atomic_write_unit_min and
>> atomic_write_unit_max. For each atomic individual write, the total length
>> of a write must be a between atomic_write_unit_min and
>> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
>> at a natural offset in the file wrt the write length.
>>
>> SCSI sd.c and scsi_debug and NVMe kernel support is added.
>>
>> Some open questions:
>> - How to make API extensible for when we have no HW support? In that case,
>>    we would prob not have to follow rule of power-of-2 length et al.
>>    As a possible solution, maybe we can say that atomic writes are
>>    supported for the file via statx, but not set unit_min and max values,
>>    and this means that writes need to be just FS block aligned there.
> I don't think the power of two length is much of a problem to be
> honest, and if we every want to lift it we can still do that easily
> by adding a new flag or limit.

ok, but it would be nice to have some idea on what that flag or limit 
change would be.

> 
> What I'm a lot more worried about is how to tell the file system that
> allocations are done right for these requirement.  There is no way
> a user can know that allocations in an existing file are properly
> aligned, so atomic writes will just fail on existing files.
> 
> I suspect we need an on-disk flag that forces allocations to be
> aligned to the atomic write limit, in some ways similar how the
> XFS rt flag works.  You'd need to set it on an empty file, and all
> allocations after that are guaranteed to be properly aligned.

Hmmm... so how is this different to the XFS forcealign feature?

For XFS, I thought that your idea was to always CoW new extents for 
misaligned extents or writes which spanned multiple extents.

> 
>> - For block layer, should atomic_write_unit_max be limited by
>>    max_sectors_kb? Currently it is not.
> Well.  It must be limited to max_hw_sectors to actually work.

Sure, as max_hw_sectors may also be limited by DMA controller max 
mapping size.

> max_sectors is a software limit below that, which with modern hardware
> is actually pretty silly and a real performance issue with todays
> workloads when people don't tweak it..

Right, so we should limit atomic write queue limits to max_hw_sectors. 
But people can still tweak max_sectors, and I am inclined to say that 
atomic_write_unit_max et al should be (dynamically) limited to 
max_sectors also.

> 
>> - How to improve requirement that iovecs are PAGE-aligned.
>>    There are 2x issues:
>>    a. We impose this rule to not split BIOs due to virt boundary for
>>       NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>>       16K/64K pages). Easy solution is to impose requirement that iovecs
>>       are 4K-aligned.
>>    b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
> .. we require any device that wants to support atomic writes to not
> have that silly limit.  For NVMe that would require SGL support
> (and some driver changes I've been wanting to make for long where
> we always use SGLs for transfers larger than a single PRP if supported)

If we could avoid dealing with a virt boundary, then that would be nice.

Are there any patches yet for the change to always use SGLs for 
transfers larger than a single PRP?

On a related topic, I am not sure about how - or if we even should - 
enforce iovec PAGE-alignment or length; rather, the user could just be 
advised that iovecs must be PAGE-aligned and min PAGE length to achieve 
atomic_write_unit_max.

> 
> 
>> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>>    would be horrible, should we add some kernel storage stack software
>>    integrity checks?
> Yes, I think we'll need asserts in the drivers.  At least for NVMe I
> will insist on them. 

Please see 16/16 in this series.

> For SCSI I think the device actually checks
> because the atomic writes are a different command anyway, or am I
> misunderstanding how SCSI works here?

Right, for WRITE ATOMIC (16) the target will reject a command when it 
does not respect the device atomic write limits.

Thanks,
John

Christoph Hellwig Dec. 13, 2023, 3:44 p.m. UTC | #3

On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
>>> - How to make API extensible for when we have no HW support? In that case,
>>>    we would prob not have to follow rule of power-of-2 length et al.
>>>    As a possible solution, maybe we can say that atomic writes are
>>>    supported for the file via statx, but not set unit_min and max values,
>>>    and this means that writes need to be just FS block aligned there.
>> I don't think the power of two length is much of a problem to be
>> honest, and if we every want to lift it we can still do that easily
>> by adding a new flag or limit.
>
> ok, but it would be nice to have some idea on what that flag or limit 
> change would be.

That would require a concrete use case.  The simples thing for a file
system that can or does log I/O it would simply be a flag waving all
the alignment and size requirements.

>> I suspect we need an on-disk flag that forces allocations to be
>> aligned to the atomic write limit, in some ways similar how the
>> XFS rt flag works.  You'd need to set it on an empty file, and all
>> allocations after that are guaranteed to be properly aligned.
>
> Hmmm... so how is this different to the XFS forcealign feature?

Maybe not much.  But that's not what it is about - we need a common
API for this and not some XFS internal flag.  So if this is something
we could support in ext4 as well that would be a good step.  And for
btrfs you'd probably want to support something like it in nocow mode
if people care enough, or always support atomics and write out of
place.

> For XFS, I thought that your idea was to always CoW new extents for 
> misaligned extents or writes which spanned multiple extents.

Well, that is useful for two things:

 - atomic writes on hardware that does not support it
 - atomic writes for bufferd I/O
 - supporting other sizes / alignments than the strict power of
   two above.

> Right, so we should limit atomic write queue limits to max_hw_sectors. But 
> people can still tweak max_sectors, and I am inclined to say that 
> atomic_write_unit_max et al should be (dynamically) limited to max_sectors 
> also.

Allowing people to tweak it seems to be asking for trouble.

>> have that silly limit.  For NVMe that would require SGL support
>> (and some driver changes I've been wanting to make for long where
>> we always use SGLs for transfers larger than a single PRP if supported)
>
> If we could avoid dealing with a virt boundary, then that would be nice.
>
> Are there any patches yet for the change to always use SGLs for transfers 
> larger than a single PRP?

No.

> On a related topic, I am not sure about how - or if we even should - 
> enforce iovec PAGE-alignment or length; rather, the user could just be 
> advised that iovecs must be PAGE-aligned and min PAGE length to achieve 
> atomic_write_unit_max.

Anything that just advices the user an it not clear cut and results in
an error is data loss waiting to happen.  Even more so if it differs
from device to device.

John Garry Dec. 13, 2023, 4:27 p.m. UTC | #4

On 13/12/2023 15:44, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
>>>> - How to make API extensible for when we have no HW support? In that case,
>>>>     we would prob not have to follow rule of power-of-2 length et al.
>>>>     As a possible solution, maybe we can say that atomic writes are
>>>>     supported for the file via statx, but not set unit_min and max values,
>>>>     and this means that writes need to be just FS block aligned there.
>>> I don't think the power of two length is much of a problem to be
>>> honest, and if we every want to lift it we can still do that easily
>>> by adding a new flag or limit.
>> ok, but it would be nice to have some idea on what that flag or limit
>> change would be.
> That would require a concrete use case.  The simples thing for a file
> system that can or does log I/O it would simply be a flag waving all
> the alignment and size requirements.

ok...

> 
>>> I suspect we need an on-disk flag that forces allocations to be
>>> aligned to the atomic write limit, in some ways similar how the
>>> XFS rt flag works.  You'd need to set it on an empty file, and all
>>> allocations after that are guaranteed to be properly aligned.
>> Hmmm... so how is this different to the XFS forcealign feature?
> Maybe not much.  But that's not what it is about - we need a common
> API for this and not some XFS internal flag.  So if this is something
> we could support in ext4 as well that would be a good step.  And for
> btrfs you'd probably want to support something like it in nocow mode
> if people care enough, or always support atomics and write out of
> place.

The forcealign feature was a bit of case of knowing specifically what to 
do for XFS to enable atomic writes.

But some common method to configure any FS file for atomic writes (if 
supported) would be preferable, right?

> 
>> For XFS, I thought that your idea was to always CoW new extents for
>> misaligned extents or writes which spanned multiple extents.
> Well, that is useful for two things:
> 
>   - atomic writes on hardware that does not support it
>   - atomic writes for bufferd I/O
>   - supporting other sizes / alignments than the strict power of
>     two above.

Sure

> 
>> Right, so we should limit atomic write queue limits to max_hw_sectors. But
>> people can still tweak max_sectors, and I am inclined to say that
>> atomic_write_unit_max et al should be (dynamically) limited to max_sectors
>> also.
> Allowing people to tweak it seems to be asking for trouble.

I tend to agree....

Let's just limit at max_hw_sectors.

> 
>>> have that silly limit.  For NVMe that would require SGL support
>>> (and some driver changes I've been wanting to make for long where
>>> we always use SGLs for transfers larger than a single PRP if supported)
>> If we could avoid dealing with a virt boundary, then that would be nice.
>>
>> Are there any patches yet for the change to always use SGLs for transfers
>> larger than a single PRP?
> No.

As below, if we are going to enforce alignment/size of iovecs elsewhere, 
then I don't need to worry about virt boundary.

> 
>> On a related topic, I am not sure about how - or if we even should -
>> enforce iovec PAGE-alignment or length; rather, the user could just be
>> advised that iovecs must be PAGE-aligned and min PAGE length to achieve
>> atomic_write_unit_max.
> Anything that just advices the user an it not clear cut and results in
> an error is data loss waiting to happen.  Even more so if it differs
> from device to device.

ok, fine. I suppose that we better enforce it then.

Thanks,
John

Christoph Hellwig Dec. 14, 2023, 2:37 p.m. UTC | #5

On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
>>> Are there any patches yet for the change to always use SGLs for transfers
>>> larger than a single PRP?
>> No.

Here is the WIP version.  With that you'd need to make atomic writes
conditional on !ctrl->need_virt_boundary.

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8ebdfd623e0f78..e04faffd6551fe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
 		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
 	}
-	blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
+	if (q == ctrl->admin_q || ctrl->need_virt_boundary)
+		blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
 	blk_queue_dma_alignment(q, 3);
 	blk_queue_write_cache(q, vwc, vwc);
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e7411dac00f725..aa98794a3ec53d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -262,6 +262,7 @@ enum nvme_ctrl_flags {
 struct nvme_ctrl {
 	bool comp_seen;
 	bool identified;
+	bool need_virt_boundary;
 	enum nvme_ctrl_state state;
 	spinlock_t lock;
 	struct mutex scan_lock;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 61af7ff1a9d6ba..a8d273b475cb40 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb,
 static unsigned int sgl_threshold = SZ_32K;
 module_param(sgl_threshold, uint, 0644);
 MODULE_PARM_DESC(sgl_threshold,
-		"Use SGLs when average request segment size is larger or equal to "
-		"this size. Use 0 to disable SGLs.");
+		"Use SGLs when > 0. Use 0 to disable SGLs.");
 
 #define NVME_PCI_MIN_QUEUE_SIZE 2
 #define NVME_PCI_MAX_QUEUE_SIZE 4095
@@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 	spin_unlock(&nvmeq->sq_lock);
 }
 
-static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
-				     int nseg)
-{
-	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-	unsigned int avg_seg_size;
-
-	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
-
-	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
-		return false;
-	if (!nvmeq->qid)
-		return false;
-	if (!sgl_threshold || avg_seg_size < sgl_threshold)
-		return false;
-	return true;
-}
-
 static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
@@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) &&
+			nvmeq->qid && sgl_threshold;
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
 
 	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
@@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 				return nvme_setup_prp_simple(dev, req,
 							     &cmnd->rw, &bv);
 
-			if (nvmeq->qid && sgl_threshold &&
-			    nvme_ctrl_sgl_supported(&dev->ctrl))
+			if (sgl_supported)
 				return nvme_setup_sgl_simple(dev, req,
 							     &cmnd->rw, &bv);
 		}
@@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out_free_sg;
 	}
 
-	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
+	if (sgl_supported)
 		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
 	else
 		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
@@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	result = nvme_init_ctrl_finish(&dev->ctrl, false);
 	if (result)
 		goto out_disable;
+	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
+		dev->ctrl.need_virt_boundary = true;
 
 	nvme_dbbuf_dma_alloc(dev);
 
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 81e2621169e5d3..416a9fbcccfc74 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 	error = nvme_init_ctrl_finish(&ctrl->ctrl, false);
 	if (error)
 		goto out_quiesce_queue;
+	ctrl->ctrl.need_virt_boundary = true;
 
 	return 0;

John Garry Dec. 14, 2023, 3:46 p.m. UTC | #6

On 14/12/2023 14:37, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
>>>> Are there any patches yet for the change to always use SGLs for transfers
>>>> larger than a single PRP?
>>> No.
> Here is the WIP version.  With that you'd need to make atomic writes
> conditional on !ctrl->need_virt_boundary.

Cheers, I gave it a quick spin and on the surface looks ok.

# more /sys/block/nvme0n1/queue/atomic_write_unit_max_bytes
262144
# more /sys/block/nvme0n1/queue/virt_boundary_mask
0

Keith Busch Dec. 18, 2023, 10:50 p.m. UTC | #7

On Thu, Dec 14, 2023 at 03:37:09PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
> >>> Are there any patches yet for the change to always use SGLs for transfers
> >>> larger than a single PRP?
> >> No.
> 
> Here is the WIP version.  With that you'd need to make atomic writes
> conditional on !ctrl->need_virt_boundary.

This looks pretty good as-is!
 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8ebdfd623e0f78..e04faffd6551fe 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
>  		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
>  		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
>  	}
> -	blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
> +	if (q == ctrl->admin_q || ctrl->need_virt_boundary)
> +		blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
>  	blk_queue_dma_alignment(q, 3);
>  	blk_queue_write_cache(q, vwc, vwc);
>  }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index e7411dac00f725..aa98794a3ec53d 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -262,6 +262,7 @@ enum nvme_ctrl_flags {
>  struct nvme_ctrl {
>  	bool comp_seen;
>  	bool identified;
> +	bool need_virt_boundary;
>  	enum nvme_ctrl_state state;
>  	spinlock_t lock;
>  	struct mutex scan_lock;
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 61af7ff1a9d6ba..a8d273b475cb40 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb,
>  static unsigned int sgl_threshold = SZ_32K;
>  module_param(sgl_threshold, uint, 0644);
>  MODULE_PARM_DESC(sgl_threshold,
> -		"Use SGLs when average request segment size is larger or equal to "
> -		"this size. Use 0 to disable SGLs.");
> +		"Use SGLs when > 0. Use 0 to disable SGLs.");
>  
>  #define NVME_PCI_MIN_QUEUE_SIZE 2
>  #define NVME_PCI_MAX_QUEUE_SIZE 4095
> @@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
>  	spin_unlock(&nvmeq->sq_lock);
>  }
>  
> -static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
> -				     int nseg)
> -{
> -	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> -	unsigned int avg_seg_size;
> -
> -	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
> -
> -	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
> -		return false;
> -	if (!nvmeq->qid)
> -		return false;
> -	if (!sgl_threshold || avg_seg_size < sgl_threshold)
> -		return false;
> -	return true;
> -}
> -
>  static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
>  {
>  	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
> @@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
>  static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  		struct nvme_command *cmnd)
>  {
> +	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +	bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) &&
> +			nvmeq->qid && sgl_threshold;
>  	blk_status_t ret = BLK_STS_RESOURCE;
>  	int rc;
>  
>  	if (blk_rq_nr_phys_segments(req) == 1) {
> -		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>  		struct bio_vec bv = req_bvec(req);
>  
>  		if (!is_pci_p2pdma_page(bv.bv_page)) {
> @@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  				return nvme_setup_prp_simple(dev, req,
>  							     &cmnd->rw, &bv);
>  
> -			if (nvmeq->qid && sgl_threshold &&
> -			    nvme_ctrl_sgl_supported(&dev->ctrl))
> +			if (sgl_supported)
>  				return nvme_setup_sgl_simple(dev, req,
>  							     &cmnd->rw, &bv);
>  		}
> @@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  		goto out_free_sg;
>  	}
>  
> -	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> +	if (sgl_supported)
>  		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
>  	else
>  		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
> @@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	result = nvme_init_ctrl_finish(&dev->ctrl, false);
>  	if (result)
>  		goto out_disable;
> +	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
> +		dev->ctrl.need_virt_boundary = true;
>  
>  	nvme_dbbuf_dma_alloc(dev);
>  
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 81e2621169e5d3..416a9fbcccfc74 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
>  	error = nvme_init_ctrl_finish(&ctrl->ctrl, false);
>  	if (error)
>  		goto out_quiesce_queue;
> +	ctrl->ctrl.need_virt_boundary = true;
>  
>  	return 0;
>

Darrick J. Wong Dec. 19, 2023, 5:14 a.m. UTC | #8

On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
> On 13/12/2023 15:44, Christoph Hellwig wrote:
> > On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
> > > > > - How to make API extensible for when we have no HW support? In that case,
> > > > >     we would prob not have to follow rule of power-of-2 length et al.
> > > > >     As a possible solution, maybe we can say that atomic writes are
> > > > >     supported for the file via statx, but not set unit_min and max values,
> > > > >     and this means that writes need to be just FS block aligned there.
> > > > I don't think the power of two length is much of a problem to be
> > > > honest, and if we every want to lift it we can still do that easily
> > > > by adding a new flag or limit.
> > > ok, but it would be nice to have some idea on what that flag or limit
> > > change would be.
> > That would require a concrete use case.  The simples thing for a file
> > system that can or does log I/O it would simply be a flag waving all
> > the alignment and size requirements.
> 
> ok...
> 
> > 
> > > > I suspect we need an on-disk flag that forces allocations to be
> > > > aligned to the atomic write limit, in some ways similar how the
> > > > XFS rt flag works.  You'd need to set it on an empty file, and all
> > > > allocations after that are guaranteed to be properly aligned.
> > > Hmmm... so how is this different to the XFS forcealign feature?
> > Maybe not much.  But that's not what it is about - we need a common
> > API for this and not some XFS internal flag.  So if this is something
> > we could support in ext4 as well that would be a good step.  And for
> > btrfs you'd probably want to support something like it in nocow mode
> > if people care enough, or always support atomics and write out of
> > place.
> 
> The forcealign feature was a bit of case of knowing specifically what to do
> for XFS to enable atomic writes.
> 
> But some common method to configure any FS file for atomic writes (if
> supported) would be preferable, right?

/me stumbles back in with plenty of covidbrain to go around.

So ... Christoph, you're asking for a common API for
sysadmins/applications to signal to the filesystem that they want all
data allocations aligned to a given value, right?

e.g. if a nvme device advertises a capability for untorn writes between
$lbasize and 64k, then we need a way to set up each untorn-file with
some alignment between $lbasize and 64k?

or if cxl storage becomes not ung-dly expensive, then we'd want a way to
set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
advantage of PMD mmap mappings?

I guess we could define a fsxattr xflag for FORCEALIGN and declare that
the fsx_extsize field becomes mandates mapping startoff/startblock
alignment if FORCEALIGN is set.

It just happens that since fsxattr comes from XFS then it'll be easy
enough for us to wire that up.  Maybe.

Concretely, xfs would have to have a way to force the
ag/rtgroup/rtextent size to align with the maximum anticipated required
alignment (e.g. 64k) of the device so that groups (and hence per-group
allocations) don't split the alignment.  xfs would also have need to
constrain the per-inode alignment to some factor of the ag size / rt
extent size.

Writing files could use hw acceleration directly if all the factors are
set up correctly; or fall back to COW.  Assuming there's a way to
require that writeback pick up all the folios in a $forcealign chunk for
simultaneous writeout.  (I think this is a lingering bug in the xfs
rtreflink patchset.)  Also, IIRC John mentioned that there might be some
programs that would rather get an EIO than fall back to COW.

ext4 could maybe sort of do this by allowing forcealign alignments up to
the bigalloc size, if bigalloc is enabled.  Assuming bigalloc isn't DOA.
IIRC bigalloc only supports power of 2 factors.  It wouldn't have the
COW problem because it only supports overwrites.

As usual, I dunno about btrfs.

So does that sound like more or less what you're getting at, Christoph?
Or have I misread the entire thing?  (PS: there's been like 1200 fsdevel
messages since I got sick and I've only had sufficient brainpower to
limp the xfs 6.8 patches across the finish line...)

--D

> > 
> > > For XFS, I thought that your idea was to always CoW new extents for
> > > misaligned extents or writes which spanned multiple extents.
> > Well, that is useful for two things:
> > 
> >   - atomic writes on hardware that does not support it
> >   - atomic writes for bufferd I/O
> >   - supporting other sizes / alignments than the strict power of
> >     two above.
> 
> Sure
> 
> > 
> > > Right, so we should limit atomic write queue limits to max_hw_sectors. But
> > > people can still tweak max_sectors, and I am inclined to say that
> > > atomic_write_unit_max et al should be (dynamically) limited to max_sectors
> > > also.
> > Allowing people to tweak it seems to be asking for trouble.
> 
> I tend to agree....
> 
> Let's just limit at max_hw_sectors.
> 
> > 
> > > > have that silly limit.  For NVMe that would require SGL support
> > > > (and some driver changes I've been wanting to make for long where
> > > > we always use SGLs for transfers larger than a single PRP if supported)
> > > If we could avoid dealing with a virt boundary, then that would be nice.
> > > 
> > > Are there any patches yet for the change to always use SGLs for transfers
> > > larger than a single PRP?
> > No.
> 
> As below, if we are going to enforce alignment/size of iovecs elsewhere,
> then I don't need to worry about virt boundary.
> 
> > 
> > > On a related topic, I am not sure about how - or if we even should -
> > > enforce iovec PAGE-alignment or length; rather, the user could just be
> > > advised that iovecs must be PAGE-aligned and min PAGE length to achieve
> > > atomic_write_unit_max.
> > Anything that just advices the user an it not clear cut and results in
> > an error is data loss waiting to happen.  Even more so if it differs
> > from device to device.
> 
> ok, fine. I suppose that we better enforce it then.
> 
> Thanks,
> John
>

Christoph Hellwig Dec. 19, 2023, 5:21 a.m. UTC | #9

On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote:
> /me stumbles back in with plenty of covidbrain to go around.
> 
> So ... Christoph, you're asking for a common API for
> sysadmins/applications to signal to the filesystem that they want all
> data allocations aligned to a given value, right?
> 
> e.g. if a nvme device advertises a capability for untorn writes between
> $lbasize and 64k, then we need a way to set up each untorn-file with
> some alignment between $lbasize and 64k?
> 
> or if cxl storage becomes not ung-dly expensive, then we'd want a way to
> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
> advantage of PMD mmap mappings?

The most important point is to not mix these up.

If we want to use a file for atomic writes I should tell the fs about
it, and preferably in a way that does not require me to know about weird
internal implementation details of every file system.  I really just
want to use atomic writes.  Preferably I'd just start using them after
asking for the limits.  But that's obviously not going to work for
file systems that use the hardware offload and don't otherwise align
to the limit (which would suck for very small files anyway :))

So as a compromise I tell the file system before writing or otherwise
adding any data [1] to file that I want to use atomic writes so that
the fs can take the right decisions.

[1] reflinking data into a such marked file will be ... interesting.

John Garry Dec. 19, 2023, 12:41 p.m. UTC | #10

On 19/12/2023 05:21, Christoph Hellwig wrote:
> On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote:
>> /me stumbles back in with plenty of covidbrain to go around.
>>
>> So ... Christoph, you're asking for a common API for
>> sysadmins/applications to signal to the filesystem that they want all
>> data allocations aligned to a given value, right?
>>
>> e.g. if a nvme device advertises a capability for untorn writes between
>> $lbasize and 64k, then we need a way to set up each untorn-file with
>> some alignment between $lbasize and 64k?
>>
>> or if cxl storage becomes not ung-dly expensive, then we'd want a way to
>> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
>> advantage of PMD mmap mappings?
> 
> The most important point is to not mix these up.
> 
> If we want to use a file for atomic writes I should tell the fs about
> it, and preferably in a way that does not require me to know about weird
> internal implementation details of every file system.  I really just
> want to use atomic writes.  Preferably I'd just start using them after
> asking for the limits.  But that's obviously not going to work for
> file systems that use the hardware offload and don't otherwise align
> to the limit (which would suck for very small files anyway :))
> 
> So as a compromise I tell the file system before writing or otherwise
> adding any data [1] to file that I want to use atomic writes so that
> the fs can take the right decisions.
> 
> [1] reflinking data into a such marked file will be ... interesting.
> 

How about something based on fcntl, like below? We will prob also 
require some per-FS flag for enabling atomic writes without HW support. 
That flag might be also useful for XFS for differentiating forcealign 
for atomic writes with just forcealign.

---8<----

diff --git a/fs/fcntl.c b/fs/fcntl.c
index c80a6acad742..d4a50655565a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -313,6 +313,15 @@ static long fcntl_rw_hint(struct file *file, 
unsigned int cmd,
  	}
  }

+static long fcntl_atomic_write(struct file *file, unsigned int cmd,
+			  unsigned long arg)
+{
+	if (file->f_op->enable_atomic_writes)
+		return file->f_op->enable_atomic_writes(file, arg);
+
+	return -EOPNOTSUPP;
+}
+
  static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
  		struct file *filp)
  {
@@ -419,6 +428,9 @@ static long do_fcntl(int fd, unsigned int cmd, 
unsigned long arg,
  	case F_SET_RW_HINT:
  		err = fcntl_rw_hint(filp, cmd, arg);
  		break;
+	case F_SET_ATOMIC_WRITE_SIZE:
+		err = fcntl_atomic_write(filp, cmd, arg);
+		break;
  	default:
  		break;
  	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e33e5e13b95f..dba206cbe1ab 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1478,6 +1478,46 @@ xfs_file_mmap(
  	return 0;
  }

+STATIC int
+xfs_enable_atomic_writes(
+	struct file	*file,
+	unsigned int unit_max)
+{
+	struct xfs_inode *ip = XFS_I(file_inode(file));
+	struct mnt_idmap *idmap = file_mnt_idmap(file);
+	struct dentry *dentry = file->f_path.dentry;
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct block_device	*bdev = target->bt_bdev;
+	struct request_queue	*q = bdev->bd_queue;
+	struct inode		*inode = &ip->i_vnode;
+
+	if (queue_atomic_write_unit_max_bytes(q) == 0) {
+		if (unit_max != 0) {
+			/* We expect unbounded CoW size if no HW support */
+			return -ENOTBLK;
+		}
+		/* Do something for CoW support ... */
+
+		return 0;
+	}
+
+	if (!unit_max)
+		return -EINVAL;
+
+	/* bodge alert */
+	if (inode->i_op->fileattr_set) {
+		struct fileattr fa = {
+			.fsx_extsize = unit_max,
+			.fsx_xflags = FS_XFLAG_EXTSIZE | FS_XFLAG_FORCEALIGN,
+			.fsx_valid = 1,
+		};
+
+		return inode->i_op->fileattr_set(idmap, dentry, &fa);
+	}
+
+	return -EOPNOTSUPP;
+}
+
  const struct file_operations xfs_file_operations = {
  	.llseek		= xfs_file_llseek,
  	.read_iter	= xfs_file_read_iter,
@@ -1498,6 +1538,7 @@ const struct file_operations xfs_file_operations = {
  	.fallocate	= xfs_file_fallocate,
  	.fadvise	= xfs_file_fadvise,
  	.remap_file_range = xfs_file_remap_range,
+	.enable_atomic_writes = xfs_enable_atomic_writes,
  };

  const struct file_operations xfs_dir_file_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b7a7a8c42e..a715f98057b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1956,6 +1956,7 @@ struct file_operations {
  	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
  	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
  				unsigned int poll_flags);
+	int (*enable_atomic_writes)(struct file *, unsigned int unit_max);
  } __randomize_layout;

  /* Wrap a directory iterator that needs exclusive inode access */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6c80f96049bd..69780c49622e 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -56,6 +56,8 @@
  #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
  #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)

+#define F_SET_ATOMIC_WRITE_SIZE  (F_LINUX_SPECIFIC_BASE + 15)
+
  /*
   * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
   * used to clear any hints previously set.

Christoph Hellwig Dec. 19, 2023, 3:17 p.m. UTC | #11

On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
> How about something based on fcntl, like below? We will prob also require 
> some per-FS flag for enabling atomic writes without HW support. That flag 
> might be also useful for XFS for differentiating forcealign for atomic 
> writes with just forcealign.

I would have just exposed it through a user visible flag instead of
adding yet another ioctl/fcntl opcode and yet another method.

And yes, for anything that doesn't always support atomic writes it would
need to be persisted.

John Garry Dec. 19, 2023, 4:53 p.m. UTC | #12

On 19/12/2023 15:17, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>> How about something based on fcntl, like below? We will prob also require
>> some per-FS flag for enabling atomic writes without HW support. That flag
>> might be also useful for XFS for differentiating forcealign for atomic
>> writes with just forcealign.
> I would have just exposed it through a user visible flag instead of
> adding yet another ioctl/fcntl opcode and yet another method.
> 

Any specific type of flag?

I would suggest a file attribute which we can set via chattr, but that 
is still using an ioctl and would require a new inode flag; but at least 
there is standard userspace support.

> And yes, for anything that doesn't always support atomic writes it would
> need to be persisted.

Christoph Hellwig Dec. 21, 2023, 6:50 a.m. UTC | #13

On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
> On 19/12/2023 15:17, Christoph Hellwig wrote:
>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>> How about something based on fcntl, like below? We will prob also require
>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>> might be also useful for XFS for differentiating forcealign for atomic
>>> writes with just forcealign.
>> I would have just exposed it through a user visible flag instead of
>> adding yet another ioctl/fcntl opcode and yet another method.
>>
>
> Any specific type of flag?
>
> I would suggest a file attribute which we can set via chattr, but that is 
> still using an ioctl and would require a new inode flag; but at least there 
> is standard userspace support.

I'd be fine with that, but we're kinda running out of flag there.
That's why I suggested the FS_XFLAG_ instead, which basically works
the same.

John Garry Dec. 21, 2023, 9:49 a.m. UTC | #14

On 21/12/2023 06:50, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
>> On 19/12/2023 15:17, Christoph Hellwig wrote:
>>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>>> How about something based on fcntl, like below? We will prob also require
>>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>>> might be also useful for XFS for differentiating forcealign for atomic
>>>> writes with just forcealign.
>>> I would have just exposed it through a user visible flag instead of
>>> adding yet another ioctl/fcntl opcode and yet another method.
>>>
>>
>> Any specific type of flag?
>>
>> I would suggest a file attribute which we can set via chattr, but that is
>> still using an ioctl and would require a new inode flag; but at least there
>> is standard userspace support.
> 
> I'd be fine with that, but we're kinda running out of flag there.

Yeah, in looking at e2fsprogs they are all used.

> That's why I suggested the FS_XFLAG_ instead, which basically works
> the same.
> 

ok, fine, we can try that out.

On another topic, maybe you can advise..

I noticed the NVMe patch to stop always setting virt boundary (thanks), 
but I am struggling for the wording for iovecs rules. I'd like to reuse 
iov_iter_is_aligned() to enforce any such rule.

I am thinking:
- ubuf / iovecs need to be PAGE-aligned
- each iovec needs to be length of multiple of PAGE_SIZE

But that does not work for total length < PAGE_SIZE.

So then we could have:
- ubuf / iovecs need to be PAGE-aligned
- each iovec needs to be length of multiple of atomic_write_unit_min. If 
total length > PAGE_SIZE, each iovec also needs to be a multiple of 
PAGE_SIZE.

I'd rather something simpler. Maybe it's ok.

Thanks,
John

Christoph Hellwig Dec. 21, 2023, 12:19 p.m. UTC | #15

On Thu, Dec 21, 2023 at 09:49:35AM +0000, John Garry wrote:
> I noticed the NVMe patch to stop always setting virt boundary (thanks), but 
> I am struggling for the wording for iovecs rules. I'd like to reuse 
> iov_iter_is_aligned() to enforce any such rule.
>
> I am thinking:
> - ubuf / iovecs need to be PAGE-aligned
> - each iovec needs to be length of multiple of PAGE_SIZE
>
> But that does not work for total length < PAGE_SIZE.
>
> So then we could have:
> - ubuf / iovecs need to be PAGE-aligned
> - each iovec needs to be length of multiple of atomic_write_unit_min. If 
> total length > PAGE_SIZE, each iovec also needs to be a multiple of 
> PAGE_SIZE.
>
> I'd rather something simpler. Maybe it's ok.

If we decided to not support atomic writes on anything setting a virt
boundary we don't have to care about the alignment of each vector,
and IMHO we should do that as everything else would be a life in
constant pain.  If we really have a use case for atomic writes on
consumer NVMe devices we'll just have to limit it to a single iovec.

John Garry Dec. 21, 2023, 12:48 p.m. UTC | #16

On 21/12/2023 12:19, Christoph Hellwig wrote:
>> So then we could have:
>> - ubuf / iovecs need to be PAGE-aligned
>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>> PAGE_SIZE.
>>
>> I'd rather something simpler. Maybe it's ok.
> If we decided to not support atomic writes on anything setting a virt
> boundary we don't have to care about the alignment of each vector,

ok, I think that alignment is not so important, but we still need to 
consider a minimum length per iovec, such that we will always be able to 
fit a write of length atomic_write_unit_max in a bio.

> and IMHO we should do that as everything else would be a life in
> constant pain.  If we really have a use case for atomic writes on
> consumer NVMe devices we'll just have to limit it to a single iovec.

I'd be more inclined to say that SGL support is compulsory, but I don't 
know if that is limiting atomic writes support to an unsatisfactory 
market segment.

Christoph Hellwig Dec. 21, 2023, 12:57 p.m. UTC | #17

On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote:
>>> - ubuf / iovecs need to be PAGE-aligned
>>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>>> PAGE_SIZE.
>>>
>>> I'd rather something simpler. Maybe it's ok.
>> If we decided to not support atomic writes on anything setting a virt
>> boundary we don't have to care about the alignment of each vector,
>
> ok, I think that alignment is not so important, but we still need to 
> consider a minimum length per iovec, such that we will always be able to 
> fit a write of length atomic_write_unit_max in a bio.

I don't think you man a minim length per iovec for that, but a maximum
number of iovecs instead.  For SGL-capable devices that would be
BIO_MAX_VECS, otherwise 1.

John Garry Dec. 21, 2023, 1:18 p.m. UTC | #18

On 21/12/2023 12:57, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote:
>>>> - ubuf / iovecs need to be PAGE-aligned
>>>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>>>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>>>> PAGE_SIZE.
>>>>
>>>> I'd rather something simpler. Maybe it's ok.
>>> If we decided to not support atomic writes on anything setting a virt
>>> boundary we don't have to care about the alignment of each vector,
>>
>> ok, I think that alignment is not so important, but we still need to
>> consider a minimum length per iovec, such that we will always be able to
>> fit a write of length atomic_write_unit_max in a bio.
> 
> I don't think you man a minim length per iovec for that, but a maximum
> number of iovecs instead. 

That would make sense, but I was thinking that if the request queue has 
a limit on segments then we would need to specify a iovec min length.

> For SGL-capable devices that would be
> BIO_MAX_VECS, otherwise 1.

ok, but we would need to advertise that or whatever segment limit. A 
statx field just for that seems a bit inefficient in terms of space.

Christoph Hellwig Dec. 21, 2023, 1:22 p.m. UTC | #19

On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>> For SGL-capable devices that would be
>> BIO_MAX_VECS, otherwise 1.
>
> ok, but we would need to advertise that or whatever segment limit. A statx 
> field just for that seems a bit inefficient in terms of space.

I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we
want to export is as a field.  Network file systems also might have
their own limits for one reason or another.

John Garry Dec. 21, 2023, 1:56 p.m. UTC | #20

On 21/12/2023 13:22, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>>> For SGL-capable devices that would be
>>> BIO_MAX_VECS, otherwise 1.
>>
>> ok, but we would need to advertise that or whatever segment limit. A statx
>> field just for that seems a bit inefficient in terms of space.
> 
> I'd rather not hard code BIO_MAX_VECS in the ABI,

For sure

> which suggest we
> want to export is as a field. 

ok

> Network file systems also might have
> their own limits for one reason or another.

Understood

John Garry Jan. 9, 2024, 9:55 a.m. UTC | #21

On 21/12/2023 06:50, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
>> On 19/12/2023 15:17, Christoph Hellwig wrote:
>>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>>> How about something based on fcntl, like below? We will prob also require
>>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>>> might be also useful for XFS for differentiating forcealign for atomic
>>>> writes with just forcealign.
>>> I would have just exposed it through a user visible flag instead of
>>> adding yet another ioctl/fcntl opcode and yet another method.
>>>
>> Any specific type of flag?
>>
>> I would suggest a file attribute which we can set via chattr, but that is
>> still using an ioctl and would require a new inode flag; but at least there
>> is standard userspace support.
> I'd be fine with that, but we're kinda running out of flag there.
> That's why I suggested the FS_XFLAG_ instead, which basically works
> the same.

Hi Christoph,

Coming back to this topic... how about this FS_XFLAG_ and fsxattr update:

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..9ef15fced20c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -118,7 +118,8 @@ struct fsxattr {
        __u32           fsx_nextents;   /* nextents field value (get)   */
        __u32           fsx_projid;     /* project identifier (get/set) */
        __u32           fsx_cowextsize; /* CoW extsize field value 
(get/set)*/
-       unsigned char   fsx_pad[8];
+       __u32           fsx_atomicwrites_size; /* unit max */
+       unsigned char   fsx_pad[4];
};

/*
@@ -140,6 +141,7 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream 
allocator */
#define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
allocator hint */
+#define FS_XFLAG_ATOMICWRITES  0x00020000
#define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */

/* the read-only stuff doesn't really belong here, but any other place is
lines 1-22/22 (END)

Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE 
being set.

So a user can issue:

 >xfs_io -c "atomic-writes 64K" mnt/file
 >xfs_io -c "atomic-writes" mnt/file
[65536] mnt/file

and then:

/xfs_io -c "lsattr" mnt/file
------------------W mnt/file

(W is new flag for atomic writes obvs)

The user will still have to issue statx to get the actual atomic write 
limit for a file, as 'xfs_io -c "atomic-writes"' does not take into 
account any HW/linux block layer atomic write limits.

FS_XFLAG_ATOMICWRITES will force XFS extent size and alignment to 
fsx_atomicwrites_size when we have HW support, so effectively same as 
forcealign.  For no HW support, we still specify a size. In case of 
possible XFS CoW solution for no atomic write HW support, I suppose that 
there would be no size limit in reality, so the specifying the size 
would only be just for userspace experience consistency.

Is this the sort of userspace API which you would like to see?

John

Christoph Hellwig Jan. 9, 2024, 4:02 p.m. UTC | #22

On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
> So a user can issue:
>
> >xfs_io -c "atomic-writes 64K" mnt/file
> >xfs_io -c "atomic-writes" mnt/file
> [65536] mnt/file

Let me try to decipher that:

 - the first call sets a 64k fsx_atomicwrites_size size
 - the secon call queries fsx_atomicwrites_size?

> The user will still have to issue statx to get the actual atomic write 
> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account 
> any HW/linux block layer atomic write limits.

So will the set side never fail?

> Is this the sort of userspace API which you would like to see?

What I had in mind (and that's doesn't mean it's right..) was that
the user just sets a binary flag, and the fs reports the best it
could.  But there might be reasons to do it differently.

John Garry Jan. 9, 2024, 4:52 p.m. UTC | #23

On 09/01/2024 16:02, Christoph Hellwig wrote:
> On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
>> So a user can issue:
>>
>>> xfs_io -c "atomic-writes 64K" mnt/file
>>> xfs_io -c "atomic-writes" mnt/file
>> [65536] mnt/file
> Let me try to decipher that:
> 
>   - the first call sets a 64k fsx_atomicwrites_size size

It should also set FS_XFLAG_ATOMICWRITES for the file. So this step does 
everything to enable atomic writes for that file.

>   - the secon call queries fsx_atomicwrites_size?

Right, I'm just demo'ing how it would look

> 
>> The user will still have to issue statx to get the actual atomic write
>> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account
>> any HW/linux block layer atomic write limits.
> So will the set side never fail?

It could fail.

Examples of when it could fail could include:

a. If user gave a bad size value. So the size should be a power-of-2 and 
also divisible into the AG size and compatible with any stripe alignment.

b. If the file already had flags set which are incompatible with or not 
supported for atomic writes.

c. the file already has data written. I guess that's obvious.

> 
>> Is this the sort of userspace API which you would like to see?
> What I had in mind (and that's doesn't mean it's right..) was that
> the user just sets a binary flag, and the fs reports the best it
> could.  But there might be reasons to do it differently.

That is what I am trying to do, but I also want to specify a size for 
the atomic write unit max which the user could want. I'd rather not use 
the atomic write unit max from the device for that, as that could be 
huge. However, re-reading a., above, makes me think that the kernel 
should have more input on this, but we still need some input on the max 
from the user...

Thanks,
John

Dave Chinner Jan. 9, 2024, 11:04 p.m. UTC | #24

On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
> On 21/12/2023 06:50, Christoph Hellwig wrote:
> > On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
> > > On 19/12/2023 15:17, Christoph Hellwig wrote:
> > > > On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
> > > > > How about something based on fcntl, like below? We will prob also require
> > > > > some per-FS flag for enabling atomic writes without HW support. That flag
> > > > > might be also useful for XFS for differentiating forcealign for atomic
> > > > > writes with just forcealign.
> > > > I would have just exposed it through a user visible flag instead of
> > > > adding yet another ioctl/fcntl opcode and yet another method.
> > > > 
> > > Any specific type of flag?
> > > 
> > > I would suggest a file attribute which we can set via chattr, but that is
> > > still using an ioctl and would require a new inode flag; but at least there
> > > is standard userspace support.
> > I'd be fine with that, but we're kinda running out of flag there.
> > That's why I suggested the FS_XFLAG_ instead, which basically works
> > the same.
> 
> Hi Christoph,
> 
> Coming back to this topic... how about this FS_XFLAG_ and fsxattr update:
> 
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index da43810b7485..9ef15fced20c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -118,7 +118,8 @@ struct fsxattr {
>        __u32           fsx_nextents;   /* nextents field value (get)   */
>        __u32           fsx_projid;     /* project identifier (get/set) */
>        __u32           fsx_cowextsize; /* CoW extsize field value
> (get/set)*/
> -       unsigned char   fsx_pad[8];
> +       __u32           fsx_atomicwrites_size; /* unit max */
> +       unsigned char   fsx_pad[4];
> };
> 
> /*
> @@ -140,6 +141,7 @@ struct fsxattr {
> #define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream allocator
> */
> #define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
> #define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
> allocator hint */
> +#define FS_XFLAG_ATOMICWRITES  0x00020000
> #define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */
> 
> /* the read-only stuff doesn't really belong here, but any other place is
> lines 1-22/22 (END)
> 
> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being
> set.
> 
> So a user can issue:
> 
> >xfs_io -c "atomic-writes 64K" mnt/file
> >xfs_io -c "atomic-writes" mnt/file
> [65536] mnt/file

Where are you going to store this value in the inode?  It requires a
new field in the inode and so is a change of on-disk format, right?

As it is, I really don't see this as a better solution than the
original generic "force align" flag that simply makes the extent
size hint alignment a hard physical alignment requirement rather
than just a hint. This has multiple uses (DAX PMD alignment is
another), so I just don't see why something that has a single,
application specific API that implements a hard physical alignment
is desirable.

Indeed, the whole reason that extent size hints are so versatile is
that they implement a generic allocation alignment/size function
that can be used for anything your imagination extends to. If they
were implemented as a "only allow RAID stripe aligned/sized
allocation" for the original use case then that functionality would
have been far less useful than it has proven to be over the past
couple of decades.

Hence history teaches us that we should be designing the API around
the generic filesystem function required (hard alignment of physical
extent allocation), not the specific use case that requires that
functionality.

-Dave.

John Garry Jan. 10, 2024, 8:55 a.m. UTC | #25

On 09/01/2024 23:04, Dave Chinner wrote:
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -118,7 +118,8 @@ struct fsxattr {
>>         __u32           fsx_nextents;   /* nextents field value (get)   */
>>         __u32           fsx_projid;     /* project identifier (get/set) */
>>         __u32           fsx_cowextsize; /* CoW extsize field value
>> (get/set)*/
>> -       unsigned char   fsx_pad[8];
>> +       __u32           fsx_atomicwrites_size; /* unit max */
>> +       unsigned char   fsx_pad[4];
>> };
>>
>> /*
>> @@ -140,6 +141,7 @@ struct fsxattr {
>> #define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream allocator
>> */
>> #define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
>> #define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
>> allocator hint */
>> +#define FS_XFLAG_ATOMICWRITES  0x00020000
>> #define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */
>>
>> /* the read-only stuff doesn't really belong here, but any other place is
>> lines 1-22/22 (END)
>>
>> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being
>> set.
>>
>> So a user can issue:
>>
>>> xfs_io -c "atomic-writes 64K" mnt/file
>>> xfs_io -c "atomic-writes" mnt/file
>> [65536] mnt/file
> Where are you going to store this value in the inode?  It requires a
> new field in the inode and so is a change of on-disk format, right?

It would require an on-disk format change, unless we can find an 
alternative way to store the value, like:
a. re-use pre-existing extsize or even cowextsize fields and 'xfs_io -c 
"atomic-writes $SIZE"' would update those fields and 
FS_XFLAG_ATOMICWRITES would be incompatible with FS_XFLAG_COWEXTSIZE or 
FS_XFLAG_EXTSIZE
b. require FS_XFLAG_EXTSIZE and extsize be also set to enable atomic 
writes, and extsize is used for atomic write unit max

I'm trying to think of ways to avoid requiring a value, but I don't see 
good options, like:
- make atomic write unit max some compile-time option
- require mkfs stripe alignment/width be set and use that as basis for 
atomic write unit max

We could just use the atomic write unit max which HW provides, but that 
could be 1MB or more and that will hardly give efficient data usage for 
small files. But maybe we don't care about that if we expect this 
feature to only be used on DB files, which can be huge anyway. However I 
still have concerns – we require that value to be fixed, but a disk 
firmware update could increase that value and this could mean we have 
what would be pre-existing mis-aligned extents.

> 
> As it is, I really don't see this as a better solution than the
> original generic "force align" flag that simply makes the extent
> size hint alignment a hard physical alignment requirement rather
> than just a hint. This has multiple uses (DAX PMD alignment is
> another), so I just don't see why something that has a single,
> application specific API that implements a hard physical alignment
> is desirable.

I would still hope that we will support forcealign separately for those 
purposes.

> 
> Indeed, the whole reason that extent size hints are so versatile is
> that they implement a generic allocation alignment/size function
> that can be used for anything your imagination extends to. If they
> were implemented as a "only allow RAID stripe aligned/sized
> allocation" for the original use case then that functionality would
> have been far less useful than it has proven to be over the past
> couple of decades.
> 
> Hence history teaches us that we should be designing the API around
> the generic filesystem function required (hard alignment of physical
> extent allocation), not the specific use case that requires that
> functionality.

I understand your concern. However I am not even sure that forcealign 
even gives us everything we want to enable atomic writes. There is an 
issue where we were required to pre-zero a file prior to issuing atomic 
writes to ensure extents are suitably sized, so FS_XFLAG_ATOMICWRITES 
would make the FS do what is required to avoid that pre-zeroing (but 
that pre-zeroing requirement that does sound like a forcealign issue...)

Furthermore, there was some desire to support atomic writes on block 
devices with no HW support by using a CoW-based solution, and forcealign 
would not be relevant there.

Thanks,
John

Christoph Hellwig Jan. 10, 2024, 9:19 a.m. UTC | #26

On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote:
> Hence history teaches us that we should be designing the API around
> the generic filesystem function required (hard alignment of physical
> extent allocation), not the specific use case that requires that
> functionality.

I disagree.  The alignment requirement is an artefact of how you
implement atomic writes.  As the fs user I care that I can do atomic
writes on a file and need to query how big the writes can be and
what alignment is required.

The forcealign feature is a sensible fs side implementation of that
if using hardware based atomic writes with alignment requirements,
but it is a really lousy userspace API.

So with John's API proposal for XFS with hardware alignment based atomic
writes we could still use force align.

Requesting atomic writes for an inode will set the forcealign flag
and the extent size hint, and after that it'll report atomic write
capabilities.  Roughly the same implementation, but not an API
tied to an implementation detail.

Darrick J. Wong Jan. 11, 2024, 1:40 a.m. UTC | #27

On Wed, Jan 10, 2024 at 10:19:29AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote:
> > Hence history teaches us that we should be designing the API around
> > the generic filesystem function required (hard alignment of physical
> > extent allocation), not the specific use case that requires that
> > functionality.
> 
> I disagree.  The alignment requirement is an artefact of how you
> implement atomic writes.  As the fs user I care that I can do atomic
> writes on a file and need to query how big the writes can be and
> what alignment is required.
> 
> The forcealign feature is a sensible fs side implementation of that
> if using hardware based atomic writes with alignment requirements,
> but it is a really lousy userspace API.
> 
> So with John's API proposal for XFS with hardware alignment based atomic
> writes we could still use force align.
> 
> Requesting atomic writes for an inode will set the forcealign flag
> and the extent size hint, and after that it'll report atomic write
> capabilities.  Roughly the same implementation, but not an API
> tied to an implementation detail.

Sounds good to me!  So to summarize, this is approximately what
userspace programs would have to do something like this:

struct statx statx;
struct fsxattr fsxattr;
int fd = open('/foofile', O_RDWR | O_DIRECT);

ioctl(fd, FS_IOC_GETXATTR, &fsxattr);

fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */

ioctl(fd, FS_IOC_SETXATTR, &fsxattr);

statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);

if (statx.stx_atomic_write_unit_max >= 16384) {
	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
	printf("HAPPY DANCE\n");
}

(Assume we bail out on errors.)

--D

Christoph Hellwig Jan. 11, 2024, 5:02 a.m. UTC | #28

On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
> struct statx statx;
> struct fsxattr fsxattr;
> int fd = open('/foofile', O_RDWR | O_DIRECT);
> 
> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
> 
> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
> 
> ioctl(fd, FS_IOC_SETXATTR, &fsxattr);
> 
> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
> 
> if (statx.stx_atomic_write_unit_max >= 16384) {
> 	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
> 	printf("HAPPY DANCE\n");
> }

I think this still needs a check if the fs needs alignment for
atomic writes at all. i.e.

struct statx statx;
struct fsxattr fsxattr;
int fd = open('/foofile', O_RDWR | O_DIRECT);

ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
if (statx.stx_atomic_write_unit_max < 16384) {
	bailout();
}

fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC;
if (statx.stx_atomic_write_alignment) {
	fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN;
	fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
}
if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) {
	bailout();
}

pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
printf("HAPPY DANCE\n");

John Garry Jan. 11, 2024, 9:55 a.m. UTC | #29

On 11/01/2024 05:02, Christoph Hellwig wrote:
> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
>> struct statx statx;
>> struct fsxattr fsxattr;
>> int fd = open('/foofile', O_RDWR | O_DIRECT);

I'm assuming O_CREAT also.

>>
>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>>
>> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
>> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
>>
>> ioctl(fd, FS_IOC_SETXATTR, &fsxattr);
>>
>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>>
>> if (statx.stx_atomic_write_unit_max >= 16384) {
>> 	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
>> 	printf("HAPPY DANCE\n");
>> }
> 
> I think this still needs a check if the fs needs alignment for
> atomic writes at all. i.e.
> 
> struct statx statx;
> struct fsxattr fsxattr;
> int fd = open('/foofile', O_RDWR | O_DIRECT);
> 
> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
> if (statx.stx_atomic_write_unit_max < 16384) {
> 	bailout();
> }

How could this value be >= 16384 initially? Would it be from 
pre-configured FS alignment, like XFS RT extsize? Or is this from some 
special CoW-based atomic write support? Or FS block size of 16384?

Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will 
lead to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC 
is set would it make sense to have statx return 0 for 
STATX_WRITE_ATOMIC. Otherwise the user may be misled to think that it is 
ok to issue an atomic write (when it isn’t).

Thanks,
John

> 
> fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC;
> if (statx.stx_atomic_write_alignment) {
> 	fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN;
> 	fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
> }
> if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) {
> 	bailout();
> }
> 
> pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
> printf("HAPPY DANCE\n");
>

Christoph Hellwig Jan. 11, 2024, 2:45 p.m. UTC | #30

On Thu, Jan 11, 2024 at 09:55:36AM +0000, John Garry wrote:
> On 11/01/2024 05:02, Christoph Hellwig wrote:
>> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
>>> struct statx statx;
>>> struct fsxattr fsxattr;
>>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>
> I'm assuming O_CREAT also.

Yes.

>> I think this still needs a check if the fs needs alignment for
>> atomic writes at all. i.e.
>>
>> struct statx statx;
>> struct fsxattr fsxattr;
>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>>
>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>> if (statx.stx_atomic_write_unit_max < 16384) {
>> 	bailout();
>> }
>
> How could this value be >= 16384 initially? Would it be from pre-configured 
> FS alignment, like XFS RT extsize? Or is this from some special CoW-based 
> atomic write support? Or FS block size of 16384?

Sorry, this check should not be here at all, we should only check it
later.

> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead 
> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set 
> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC. 

True.  We might need to report the limits even without that, though.

John Garry Jan. 11, 2024, 4:11 p.m. UTC | #31

> 
>>> I think this still needs a check if the fs needs alignment for
>>> atomic writes at all. i.e.
>>>
>>> struct statx statx;
>>> struct fsxattr fsxattr;
>>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>>>
>>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>>> if (statx.stx_atomic_write_unit_max < 16384) {
>>> 	bailout();
>>> }
>>
>> How could this value be >= 16384 initially? Would it be from pre-configured
>> FS alignment, like XFS RT extsize? Or is this from some special CoW-based
>> atomic write support? Or FS block size of 16384?
> 
> Sorry, this check should not be here at all, we should only check it
> later.
> 
>> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead
>> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set
>> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC.
> 
> True.  We might need to report the limits even without that, though.

Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not 
set (and it is required)? The problem is that ioctl reports -EINVAL for 
any such errors, so hard for the user to know the issue...

Thanks,
John

Christoph Hellwig Jan. 11, 2024, 4:15 p.m. UTC | #32

On Thu, Jan 11, 2024 at 04:11:38PM +0000, John Garry wrote:
> Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not set 
> (and it is required)? The problem is that ioctl reports -EINVAL for any 
> such errors, so hard for the user to know the issue...

Sure.  Pick a good unique error code, though.

John Garry Jan. 16, 2024, 11:35 a.m. UTC | #33

On 21/12/2023 13:22, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>>> For SGL-capable devices that would be
>>> BIO_MAX_VECS, otherwise 1.
>> ok, but we would need to advertise that or whatever segment limit. A statx
>> field just for that seems a bit inefficient in terms of space.
> I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we
> want to export is as a field.  Network file systems also might have
> their own limits for one reason or another.

Hi Christoph,

I have been looking at this issue again and I am not sure if telling the 
user the max number of segments allowed is the best option. I’m worried 
that resultant atomic write unit max will be too small.

The background again is that we want to tell the user what the maximum 
atomic write unit size is, such that we can always guarantee to fit the 
write in a single bio. And there would be no iovec length or alignment 
rules.

The max segments value advertised would be min(queue max segments, 
BIO_MAX_VECS), so it would be 256 when the request queue is not limiting.

The worst case scenario for iovec layout (most inefficient) which the 
user could provide would be like .iov_base = 0x...0E00 and .iov_length = 
0x400, which would mean that we would have 2x pages and 2x DMA sg elems 
required for each 1024B-length iovec. I am assuming that we will still 
use the direct IO rule of LBS length and alignment.

As such, we then need to set atomic write unit max = min(queue max 
segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 
* 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, 
for example, then we would have 32K. These seem too low.

Alternative I'm thinking that we should just limit to 1x iovec always, 
and then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) 
- 1) * PAGE_SIZE [ignoring first/last iovec contents]. It also makes 
support for non-enterprise NVMe drives more straightforward. If someone 
wants, they can introduce support for multi-iovec later, but it would 
prob require some more iovec length/alignment rules.

Please let me know your thoughts.

Thanks,
John

Christoph Hellwig Jan. 17, 2024, 3:02 p.m. UTC | #34

On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote:
> As such, we then need to set atomic write unit max = min(queue max 
> segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 * 
> 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for 
> example, then we would have 32K. These seem too low.

I don't see how this would work if support multiple sectors.

>
> Alternative I'm thinking that we should just limit to 1x iovec always, and 
> then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) * 
> PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for 
> non-enterprise NVMe drives more straightforward. If someone wants, they can 
> introduce support for multi-iovec later, but it would prob require some 
> more iovec length/alignment rules.

Supporting just a single iovec initially is fine with me, as extending
that is pretty easy.  Just talk to your potential users that they can
live with it.

I'd probably still advertise the limits even if it currently always is 1.

John Garry Jan. 17, 2024, 4:16 p.m. UTC | #35

On 17/01/2024 15:02, Christoph Hellwig wrote:
> On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote:
>> As such, we then need to set atomic write unit max = min(queue max
>> segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 *
>> 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for
>> example, then we would have 32K. These seem too low.
> 
> I don't see how this would work if support multiple sectors.
> 
>>
>> Alternative I'm thinking that we should just limit to 1x iovec always, and
>> then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) *
>> PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for
>> non-enterprise NVMe drives more straightforward. If someone wants, they can
>> introduce support for multi-iovec later, but it would prob require some
>> more iovec length/alignment rules.
> 
> Supporting just a single iovec initially is fine with me, as extending
> that is pretty easy.  Just talk to your potential users that they can
> live with it.

Yeah, any porting I know about has just been using aio with 
IO_CMD_PWRITE, so would be ok

> 
> I'd probably still advertise the limits even if it currently always is 1.
> 

I suppose that we don't need any special rule until we support > 1, as 
we cannot break anyone already using > 1 :)

Thanks,
John

[v2,00/16] block atomic writes

Message

Comments