Message ID | 20231212110844.19698-1-john.g.garry@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | block atomic writes | expand |
On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote: > Two new fields are added to struct statx - atomic_write_unit_min and > atomic_write_unit_max. For each atomic individual write, the total length > of a write must be a between atomic_write_unit_min and > atomic_write_unit_max, inclusive, and a power-of-2. The write must also be > at a natural offset in the file wrt the write length. > > SCSI sd.c and scsi_debug and NVMe kernel support is added. > > Some open questions: > - How to make API extensible for when we have no HW support? In that case, > we would prob not have to follow rule of power-of-2 length et al. > As a possible solution, maybe we can say that atomic writes are > supported for the file via statx, but not set unit_min and max values, > and this means that writes need to be just FS block aligned there. I don't think the power of two length is much of a problem to be honest, and if we every want to lift it we can still do that easily by adding a new flag or limit. What I'm a lot more worried about is how to tell the file system that allocations are done right for these requirement. There is no way a user can know that allocations in an existing file are properly aligned, so atomic writes will just fail on existing files. I suspect we need an on-disk flag that forces allocations to be aligned to the atomic write limit, in some ways similar how the XFS rt flag works. You'd need to set it on an empty file, and all allocations after that are guaranteed to be properly aligned. > - For block layer, should atomic_write_unit_max be limited by > max_sectors_kb? Currently it is not. Well. It must be limited to max_hw_sectors to actually work. max_sectors is a software limit below that, which with modern hardware is actually pretty silly and a real performance issue with todays workloads when people don't tweak it.. > - How to improve requirement that iovecs are PAGE-aligned. > There are 2x issues: > a. We impose this rule to not split BIOs due to virt boundary for > NVMe, but there virt boundary is 4K (and not PAGE size, so broken for > 16K/64K pages). Easy solution is to impose requirement that iovecs > are 4K-aligned. > b. We don't enforce this rule for virt boundary == 0, i.e. SCSI .. we require any device that wants to support atomic writes to not have that silly limit. For NVMe that would require SGL support (and some driver changes I've been wanting to make for long where we always use SGLs for transfers larger than a single PRP if supported) > - Since debugging torn-writes due to unwanted kernel BIO splitting/merging > would be horrible, should we add some kernel storage stack software > integrity checks? Yes, I think we'll need asserts in the drivers. At least for NVMe I will insist on them. For SCSI I think the device actually checks because the atomic writes are a different command anyway, or am I misunderstanding how SCSI works here?
On 12/12/2023 16:32, Christoph Hellwig wrote: > On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote: >> Two new fields are added to struct statx - atomic_write_unit_min and >> atomic_write_unit_max. For each atomic individual write, the total length >> of a write must be a between atomic_write_unit_min and >> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be >> at a natural offset in the file wrt the write length. >> >> SCSI sd.c and scsi_debug and NVMe kernel support is added. >> >> Some open questions: >> - How to make API extensible for when we have no HW support? In that case, >> we would prob not have to follow rule of power-of-2 length et al. >> As a possible solution, maybe we can say that atomic writes are >> supported for the file via statx, but not set unit_min and max values, >> and this means that writes need to be just FS block aligned there. > I don't think the power of two length is much of a problem to be > honest, and if we every want to lift it we can still do that easily > by adding a new flag or limit. ok, but it would be nice to have some idea on what that flag or limit change would be. > > What I'm a lot more worried about is how to tell the file system that > allocations are done right for these requirement. There is no way > a user can know that allocations in an existing file are properly > aligned, so atomic writes will just fail on existing files. > > I suspect we need an on-disk flag that forces allocations to be > aligned to the atomic write limit, in some ways similar how the > XFS rt flag works. You'd need to set it on an empty file, and all > allocations after that are guaranteed to be properly aligned. Hmmm... so how is this different to the XFS forcealign feature? For XFS, I thought that your idea was to always CoW new extents for misaligned extents or writes which spanned multiple extents. > >> - For block layer, should atomic_write_unit_max be limited by >> max_sectors_kb? Currently it is not. > Well. It must be limited to max_hw_sectors to actually work. Sure, as max_hw_sectors may also be limited by DMA controller max mapping size. > max_sectors is a software limit below that, which with modern hardware > is actually pretty silly and a real performance issue with todays > workloads when people don't tweak it.. Right, so we should limit atomic write queue limits to max_hw_sectors. But people can still tweak max_sectors, and I am inclined to say that atomic_write_unit_max et al should be (dynamically) limited to max_sectors also. > >> - How to improve requirement that iovecs are PAGE-aligned. >> There are 2x issues: >> a. We impose this rule to not split BIOs due to virt boundary for >> NVMe, but there virt boundary is 4K (and not PAGE size, so broken for >> 16K/64K pages). Easy solution is to impose requirement that iovecs >> are 4K-aligned. >> b. We don't enforce this rule for virt boundary == 0, i.e. SCSI > .. we require any device that wants to support atomic writes to not > have that silly limit. For NVMe that would require SGL support > (and some driver changes I've been wanting to make for long where > we always use SGLs for transfers larger than a single PRP if supported) If we could avoid dealing with a virt boundary, then that would be nice. Are there any patches yet for the change to always use SGLs for transfers larger than a single PRP? On a related topic, I am not sure about how - or if we even should - enforce iovec PAGE-alignment or length; rather, the user could just be advised that iovecs must be PAGE-aligned and min PAGE length to achieve atomic_write_unit_max. > > >> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging >> would be horrible, should we add some kernel storage stack software >> integrity checks? > Yes, I think we'll need asserts in the drivers. At least for NVMe I > will insist on them. Please see 16/16 in this series. > For SCSI I think the device actually checks > because the atomic writes are a different command anyway, or am I > misunderstanding how SCSI works here? Right, for WRITE ATOMIC (16) the target will reject a command when it does not respect the device atomic write limits. Thanks, John
On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote: >>> - How to make API extensible for when we have no HW support? In that case, >>> we would prob not have to follow rule of power-of-2 length et al. >>> As a possible solution, maybe we can say that atomic writes are >>> supported for the file via statx, but not set unit_min and max values, >>> and this means that writes need to be just FS block aligned there. >> I don't think the power of two length is much of a problem to be >> honest, and if we every want to lift it we can still do that easily >> by adding a new flag or limit. > > ok, but it would be nice to have some idea on what that flag or limit > change would be. That would require a concrete use case. The simples thing for a file system that can or does log I/O it would simply be a flag waving all the alignment and size requirements. >> I suspect we need an on-disk flag that forces allocations to be >> aligned to the atomic write limit, in some ways similar how the >> XFS rt flag works. You'd need to set it on an empty file, and all >> allocations after that are guaranteed to be properly aligned. > > Hmmm... so how is this different to the XFS forcealign feature? Maybe not much. But that's not what it is about - we need a common API for this and not some XFS internal flag. So if this is something we could support in ext4 as well that would be a good step. And for btrfs you'd probably want to support something like it in nocow mode if people care enough, or always support atomics and write out of place. > For XFS, I thought that your idea was to always CoW new extents for > misaligned extents or writes which spanned multiple extents. Well, that is useful for two things: - atomic writes on hardware that does not support it - atomic writes for bufferd I/O - supporting other sizes / alignments than the strict power of two above. > Right, so we should limit atomic write queue limits to max_hw_sectors. But > people can still tweak max_sectors, and I am inclined to say that > atomic_write_unit_max et al should be (dynamically) limited to max_sectors > also. Allowing people to tweak it seems to be asking for trouble. >> have that silly limit. For NVMe that would require SGL support >> (and some driver changes I've been wanting to make for long where >> we always use SGLs for transfers larger than a single PRP if supported) > > If we could avoid dealing with a virt boundary, then that would be nice. > > Are there any patches yet for the change to always use SGLs for transfers > larger than a single PRP? No. > On a related topic, I am not sure about how - or if we even should - > enforce iovec PAGE-alignment or length; rather, the user could just be > advised that iovecs must be PAGE-aligned and min PAGE length to achieve > atomic_write_unit_max. Anything that just advices the user an it not clear cut and results in an error is data loss waiting to happen. Even more so if it differs from device to device.
On 13/12/2023 15:44, Christoph Hellwig wrote: > On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote: >>>> - How to make API extensible for when we have no HW support? In that case, >>>> we would prob not have to follow rule of power-of-2 length et al. >>>> As a possible solution, maybe we can say that atomic writes are >>>> supported for the file via statx, but not set unit_min and max values, >>>> and this means that writes need to be just FS block aligned there. >>> I don't think the power of two length is much of a problem to be >>> honest, and if we every want to lift it we can still do that easily >>> by adding a new flag or limit. >> ok, but it would be nice to have some idea on what that flag or limit >> change would be. > That would require a concrete use case. The simples thing for a file > system that can or does log I/O it would simply be a flag waving all > the alignment and size requirements. ok... > >>> I suspect we need an on-disk flag that forces allocations to be >>> aligned to the atomic write limit, in some ways similar how the >>> XFS rt flag works. You'd need to set it on an empty file, and all >>> allocations after that are guaranteed to be properly aligned. >> Hmmm... so how is this different to the XFS forcealign feature? > Maybe not much. But that's not what it is about - we need a common > API for this and not some XFS internal flag. So if this is something > we could support in ext4 as well that would be a good step. And for > btrfs you'd probably want to support something like it in nocow mode > if people care enough, or always support atomics and write out of > place. The forcealign feature was a bit of case of knowing specifically what to do for XFS to enable atomic writes. But some common method to configure any FS file for atomic writes (if supported) would be preferable, right? > >> For XFS, I thought that your idea was to always CoW new extents for >> misaligned extents or writes which spanned multiple extents. > Well, that is useful for two things: > > - atomic writes on hardware that does not support it > - atomic writes for bufferd I/O > - supporting other sizes / alignments than the strict power of > two above. Sure > >> Right, so we should limit atomic write queue limits to max_hw_sectors. But >> people can still tweak max_sectors, and I am inclined to say that >> atomic_write_unit_max et al should be (dynamically) limited to max_sectors >> also. > Allowing people to tweak it seems to be asking for trouble. I tend to agree.... Let's just limit at max_hw_sectors. > >>> have that silly limit. For NVMe that would require SGL support >>> (and some driver changes I've been wanting to make for long where >>> we always use SGLs for transfers larger than a single PRP if supported) >> If we could avoid dealing with a virt boundary, then that would be nice. >> >> Are there any patches yet for the change to always use SGLs for transfers >> larger than a single PRP? > No. As below, if we are going to enforce alignment/size of iovecs elsewhere, then I don't need to worry about virt boundary. > >> On a related topic, I am not sure about how - or if we even should - >> enforce iovec PAGE-alignment or length; rather, the user could just be >> advised that iovecs must be PAGE-aligned and min PAGE length to achieve >> atomic_write_unit_max. > Anything that just advices the user an it not clear cut and results in > an error is data loss waiting to happen. Even more so if it differs > from device to device. ok, fine. I suppose that we better enforce it then. Thanks, John
On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote: >>> Are there any patches yet for the change to always use SGLs for transfers >>> larger than a single PRP? >> No. Here is the WIP version. With that you'd need to make atomic writes conditional on !ctrl->need_virt_boundary. diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 8ebdfd623e0f78..e04faffd6551fe 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors); blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX)); } - blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1); + if (q == ctrl->admin_q || ctrl->need_virt_boundary) + blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1); blk_queue_dma_alignment(q, 3); blk_queue_write_cache(q, vwc, vwc); } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index e7411dac00f725..aa98794a3ec53d 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -262,6 +262,7 @@ enum nvme_ctrl_flags { struct nvme_ctrl { bool comp_seen; bool identified; + bool need_virt_boundary; enum nvme_ctrl_state state; spinlock_t lock; struct mutex scan_lock; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 61af7ff1a9d6ba..a8d273b475cb40 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb, static unsigned int sgl_threshold = SZ_32K; module_param(sgl_threshold, uint, 0644); MODULE_PARM_DESC(sgl_threshold, - "Use SGLs when average request segment size is larger or equal to " - "this size. Use 0 to disable SGLs."); + "Use SGLs when > 0. Use 0 to disable SGLs."); #define NVME_PCI_MIN_QUEUE_SIZE 2 #define NVME_PCI_MAX_QUEUE_SIZE 4095 @@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx) spin_unlock(&nvmeq->sq_lock); } -static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req, - int nseg) -{ - struct nvme_queue *nvmeq = req->mq_hctx->driver_data; - unsigned int avg_seg_size; - - avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg); - - if (!nvme_ctrl_sgl_supported(&dev->ctrl)) - return false; - if (!nvmeq->qid) - return false; - if (!sgl_threshold || avg_seg_size < sgl_threshold) - return false; - return true; -} - static void nvme_free_prps(struct nvme_dev *dev, struct request *req) { const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1; @@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev, static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, struct nvme_command *cmnd) { + struct nvme_queue *nvmeq = req->mq_hctx->driver_data; struct nvme_iod *iod = blk_mq_rq_to_pdu(req); + bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) && + nvmeq->qid && sgl_threshold; blk_status_t ret = BLK_STS_RESOURCE; int rc; if (blk_rq_nr_phys_segments(req) == 1) { - struct nvme_queue *nvmeq = req->mq_hctx->driver_data; struct bio_vec bv = req_bvec(req); if (!is_pci_p2pdma_page(bv.bv_page)) { @@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, return nvme_setup_prp_simple(dev, req, &cmnd->rw, &bv); - if (nvmeq->qid && sgl_threshold && - nvme_ctrl_sgl_supported(&dev->ctrl)) + if (sgl_supported) return nvme_setup_sgl_simple(dev, req, &cmnd->rw, &bv); } @@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, goto out_free_sg; } - if (nvme_pci_use_sgls(dev, req, iod->sgt.nents)) + if (sgl_supported) ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw); else ret = nvme_pci_setup_prps(dev, req, &cmnd->rw); @@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) result = nvme_init_ctrl_finish(&dev->ctrl, false); if (result) goto out_disable; + if (!nvme_ctrl_sgl_supported(&dev->ctrl)) + dev->ctrl.need_virt_boundary = true; nvme_dbbuf_dma_alloc(dev); diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 81e2621169e5d3..416a9fbcccfc74 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, error = nvme_init_ctrl_finish(&ctrl->ctrl, false); if (error) goto out_quiesce_queue; + ctrl->ctrl.need_virt_boundary = true; return 0;
On 14/12/2023 14:37, Christoph Hellwig wrote: > On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote: >>>> Are there any patches yet for the change to always use SGLs for transfers >>>> larger than a single PRP? >>> No. > Here is the WIP version. With that you'd need to make atomic writes > conditional on !ctrl->need_virt_boundary. Cheers, I gave it a quick spin and on the surface looks ok. # more /sys/block/nvme0n1/queue/atomic_write_unit_max_bytes 262144 # more /sys/block/nvme0n1/queue/virt_boundary_mask 0
On Thu, Dec 14, 2023 at 03:37:09PM +0100, Christoph Hellwig wrote: > On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote: > >>> Are there any patches yet for the change to always use SGLs for transfers > >>> larger than a single PRP? > >> No. > > Here is the WIP version. With that you'd need to make atomic writes > conditional on !ctrl->need_virt_boundary. This looks pretty good as-is! > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 8ebdfd623e0f78..e04faffd6551fe 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, > blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors); > blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX)); > } > - blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1); > + if (q == ctrl->admin_q || ctrl->need_virt_boundary) > + blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1); > blk_queue_dma_alignment(q, 3); > blk_queue_write_cache(q, vwc, vwc); > } > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index e7411dac00f725..aa98794a3ec53d 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -262,6 +262,7 @@ enum nvme_ctrl_flags { > struct nvme_ctrl { > bool comp_seen; > bool identified; > + bool need_virt_boundary; > enum nvme_ctrl_state state; > spinlock_t lock; > struct mutex scan_lock; > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 61af7ff1a9d6ba..a8d273b475cb40 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb, > static unsigned int sgl_threshold = SZ_32K; > module_param(sgl_threshold, uint, 0644); > MODULE_PARM_DESC(sgl_threshold, > - "Use SGLs when average request segment size is larger or equal to " > - "this size. Use 0 to disable SGLs."); > + "Use SGLs when > 0. Use 0 to disable SGLs."); > > #define NVME_PCI_MIN_QUEUE_SIZE 2 > #define NVME_PCI_MAX_QUEUE_SIZE 4095 > @@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx) > spin_unlock(&nvmeq->sq_lock); > } > > -static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req, > - int nseg) > -{ > - struct nvme_queue *nvmeq = req->mq_hctx->driver_data; > - unsigned int avg_seg_size; > - > - avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg); > - > - if (!nvme_ctrl_sgl_supported(&dev->ctrl)) > - return false; > - if (!nvmeq->qid) > - return false; > - if (!sgl_threshold || avg_seg_size < sgl_threshold) > - return false; > - return true; > -} > - > static void nvme_free_prps(struct nvme_dev *dev, struct request *req) > { > const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1; > @@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev, > static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, > struct nvme_command *cmnd) > { > + struct nvme_queue *nvmeq = req->mq_hctx->driver_data; > struct nvme_iod *iod = blk_mq_rq_to_pdu(req); > + bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) && > + nvmeq->qid && sgl_threshold; > blk_status_t ret = BLK_STS_RESOURCE; > int rc; > > if (blk_rq_nr_phys_segments(req) == 1) { > - struct nvme_queue *nvmeq = req->mq_hctx->driver_data; > struct bio_vec bv = req_bvec(req); > > if (!is_pci_p2pdma_page(bv.bv_page)) { > @@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, > return nvme_setup_prp_simple(dev, req, > &cmnd->rw, &bv); > > - if (nvmeq->qid && sgl_threshold && > - nvme_ctrl_sgl_supported(&dev->ctrl)) > + if (sgl_supported) > return nvme_setup_sgl_simple(dev, req, > &cmnd->rw, &bv); > } > @@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, > goto out_free_sg; > } > > - if (nvme_pci_use_sgls(dev, req, iod->sgt.nents)) > + if (sgl_supported) > ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw); > else > ret = nvme_pci_setup_prps(dev, req, &cmnd->rw); > @@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) > result = nvme_init_ctrl_finish(&dev->ctrl, false); > if (result) > goto out_disable; > + if (!nvme_ctrl_sgl_supported(&dev->ctrl)) > + dev->ctrl.need_virt_boundary = true; > > nvme_dbbuf_dma_alloc(dev); > > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > index 81e2621169e5d3..416a9fbcccfc74 100644 > --- a/drivers/nvme/host/rdma.c > +++ b/drivers/nvme/host/rdma.c > @@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, > error = nvme_init_ctrl_finish(&ctrl->ctrl, false); > if (error) > goto out_quiesce_queue; > + ctrl->ctrl.need_virt_boundary = true; > > return 0; >
On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote: > On 13/12/2023 15:44, Christoph Hellwig wrote: > > On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote: > > > > > - How to make API extensible for when we have no HW support? In that case, > > > > > we would prob not have to follow rule of power-of-2 length et al. > > > > > As a possible solution, maybe we can say that atomic writes are > > > > > supported for the file via statx, but not set unit_min and max values, > > > > > and this means that writes need to be just FS block aligned there. > > > > I don't think the power of two length is much of a problem to be > > > > honest, and if we every want to lift it we can still do that easily > > > > by adding a new flag or limit. > > > ok, but it would be nice to have some idea on what that flag or limit > > > change would be. > > That would require a concrete use case. The simples thing for a file > > system that can or does log I/O it would simply be a flag waving all > > the alignment and size requirements. > > ok... > > > > > > > I suspect we need an on-disk flag that forces allocations to be > > > > aligned to the atomic write limit, in some ways similar how the > > > > XFS rt flag works. You'd need to set it on an empty file, and all > > > > allocations after that are guaranteed to be properly aligned. > > > Hmmm... so how is this different to the XFS forcealign feature? > > Maybe not much. But that's not what it is about - we need a common > > API for this and not some XFS internal flag. So if this is something > > we could support in ext4 as well that would be a good step. And for > > btrfs you'd probably want to support something like it in nocow mode > > if people care enough, or always support atomics and write out of > > place. > > The forcealign feature was a bit of case of knowing specifically what to do > for XFS to enable atomic writes. > > But some common method to configure any FS file for atomic writes (if > supported) would be preferable, right? /me stumbles back in with plenty of covidbrain to go around. So ... Christoph, you're asking for a common API for sysadmins/applications to signal to the filesystem that they want all data allocations aligned to a given value, right? e.g. if a nvme device advertises a capability for untorn writes between $lbasize and 64k, then we need a way to set up each untorn-file with some alignment between $lbasize and 64k? or if cxl storage becomes not ung-dly expensive, then we'd want a way to set up files with an alignment of 2M (x86) or 512M (arm64lol) to take advantage of PMD mmap mappings? I guess we could define a fsxattr xflag for FORCEALIGN and declare that the fsx_extsize field becomes mandates mapping startoff/startblock alignment if FORCEALIGN is set. It just happens that since fsxattr comes from XFS then it'll be easy enough for us to wire that up. Maybe. Concretely, xfs would have to have a way to force the ag/rtgroup/rtextent size to align with the maximum anticipated required alignment (e.g. 64k) of the device so that groups (and hence per-group allocations) don't split the alignment. xfs would also have need to constrain the per-inode alignment to some factor of the ag size / rt extent size. Writing files could use hw acceleration directly if all the factors are set up correctly; or fall back to COW. Assuming there's a way to require that writeback pick up all the folios in a $forcealign chunk for simultaneous writeout. (I think this is a lingering bug in the xfs rtreflink patchset.) Also, IIRC John mentioned that there might be some programs that would rather get an EIO than fall back to COW. ext4 could maybe sort of do this by allowing forcealign alignments up to the bigalloc size, if bigalloc is enabled. Assuming bigalloc isn't DOA. IIRC bigalloc only supports power of 2 factors. It wouldn't have the COW problem because it only supports overwrites. As usual, I dunno about btrfs. So does that sound like more or less what you're getting at, Christoph? Or have I misread the entire thing? (PS: there's been like 1200 fsdevel messages since I got sick and I've only had sufficient brainpower to limp the xfs 6.8 patches across the finish line...) --D > > > > > For XFS, I thought that your idea was to always CoW new extents for > > > misaligned extents or writes which spanned multiple extents. > > Well, that is useful for two things: > > > > - atomic writes on hardware that does not support it > > - atomic writes for bufferd I/O > > - supporting other sizes / alignments than the strict power of > > two above. > > Sure > > > > > > Right, so we should limit atomic write queue limits to max_hw_sectors. But > > > people can still tweak max_sectors, and I am inclined to say that > > > atomic_write_unit_max et al should be (dynamically) limited to max_sectors > > > also. > > Allowing people to tweak it seems to be asking for trouble. > > I tend to agree.... > > Let's just limit at max_hw_sectors. > > > > > > > have that silly limit. For NVMe that would require SGL support > > > > (and some driver changes I've been wanting to make for long where > > > > we always use SGLs for transfers larger than a single PRP if supported) > > > If we could avoid dealing with a virt boundary, then that would be nice. > > > > > > Are there any patches yet for the change to always use SGLs for transfers > > > larger than a single PRP? > > No. > > As below, if we are going to enforce alignment/size of iovecs elsewhere, > then I don't need to worry about virt boundary. > > > > > > On a related topic, I am not sure about how - or if we even should - > > > enforce iovec PAGE-alignment or length; rather, the user could just be > > > advised that iovecs must be PAGE-aligned and min PAGE length to achieve > > > atomic_write_unit_max. > > Anything that just advices the user an it not clear cut and results in > > an error is data loss waiting to happen. Even more so if it differs > > from device to device. > > ok, fine. I suppose that we better enforce it then. > > Thanks, > John >
On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote: > /me stumbles back in with plenty of covidbrain to go around. > > So ... Christoph, you're asking for a common API for > sysadmins/applications to signal to the filesystem that they want all > data allocations aligned to a given value, right? > > e.g. if a nvme device advertises a capability for untorn writes between > $lbasize and 64k, then we need a way to set up each untorn-file with > some alignment between $lbasize and 64k? > > or if cxl storage becomes not ung-dly expensive, then we'd want a way to > set up files with an alignment of 2M (x86) or 512M (arm64lol) to take > advantage of PMD mmap mappings? The most important point is to not mix these up. If we want to use a file for atomic writes I should tell the fs about it, and preferably in a way that does not require me to know about weird internal implementation details of every file system. I really just want to use atomic writes. Preferably I'd just start using them after asking for the limits. But that's obviously not going to work for file systems that use the hardware offload and don't otherwise align to the limit (which would suck for very small files anyway :)) So as a compromise I tell the file system before writing or otherwise adding any data [1] to file that I want to use atomic writes so that the fs can take the right decisions. [1] reflinking data into a such marked file will be ... interesting.
On 19/12/2023 05:21, Christoph Hellwig wrote: > On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote: >> /me stumbles back in with plenty of covidbrain to go around. >> >> So ... Christoph, you're asking for a common API for >> sysadmins/applications to signal to the filesystem that they want all >> data allocations aligned to a given value, right? >> >> e.g. if a nvme device advertises a capability for untorn writes between >> $lbasize and 64k, then we need a way to set up each untorn-file with >> some alignment between $lbasize and 64k? >> >> or if cxl storage becomes not ung-dly expensive, then we'd want a way to >> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take >> advantage of PMD mmap mappings? > > The most important point is to not mix these up. > > If we want to use a file for atomic writes I should tell the fs about > it, and preferably in a way that does not require me to know about weird > internal implementation details of every file system. I really just > want to use atomic writes. Preferably I'd just start using them after > asking for the limits. But that's obviously not going to work for > file systems that use the hardware offload and don't otherwise align > to the limit (which would suck for very small files anyway :)) > > So as a compromise I tell the file system before writing or otherwise > adding any data [1] to file that I want to use atomic writes so that > the fs can take the right decisions. > > [1] reflinking data into a such marked file will be ... interesting. > How about something based on fcntl, like below? We will prob also require some per-FS flag for enabling atomic writes without HW support. That flag might be also useful for XFS for differentiating forcealign for atomic writes with just forcealign. ---8<---- diff --git a/fs/fcntl.c b/fs/fcntl.c index c80a6acad742..d4a50655565a 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -313,6 +313,15 @@ static long fcntl_rw_hint(struct file *file, unsigned int cmd, } } +static long fcntl_atomic_write(struct file *file, unsigned int cmd, + unsigned long arg) +{ + if (file->f_op->enable_atomic_writes) + return file->f_op->enable_atomic_writes(file, arg); + + return -EOPNOTSUPP; +} + static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) { @@ -419,6 +428,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, case F_SET_RW_HINT: err = fcntl_rw_hint(filp, cmd, arg); break; + case F_SET_ATOMIC_WRITE_SIZE: + err = fcntl_atomic_write(filp, cmd, arg); + break; default: break; } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e33e5e13b95f..dba206cbe1ab 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1478,6 +1478,46 @@ xfs_file_mmap( return 0; } +STATIC int +xfs_enable_atomic_writes( + struct file *file, + unsigned int unit_max) +{ + struct xfs_inode *ip = XFS_I(file_inode(file)); + struct mnt_idmap *idmap = file_mnt_idmap(file); + struct dentry *dentry = file->f_path.dentry; + struct xfs_buftarg *target = xfs_inode_buftarg(ip); + struct block_device *bdev = target->bt_bdev; + struct request_queue *q = bdev->bd_queue; + struct inode *inode = &ip->i_vnode; + + if (queue_atomic_write_unit_max_bytes(q) == 0) { + if (unit_max != 0) { + /* We expect unbounded CoW size if no HW support */ + return -ENOTBLK; + } + /* Do something for CoW support ... */ + + return 0; + } + + if (!unit_max) + return -EINVAL; + + /* bodge alert */ + if (inode->i_op->fileattr_set) { + struct fileattr fa = { + .fsx_extsize = unit_max, + .fsx_xflags = FS_XFLAG_EXTSIZE | FS_XFLAG_FORCEALIGN, + .fsx_valid = 1, + }; + + return inode->i_op->fileattr_set(idmap, dentry, &fa); + } + + return -EOPNOTSUPP; +} + const struct file_operations xfs_file_operations = { .llseek = xfs_file_llseek, .read_iter = xfs_file_read_iter, @@ -1498,6 +1538,7 @@ const struct file_operations xfs_file_operations = { .fallocate = xfs_file_fallocate, .fadvise = xfs_file_fadvise, .remap_file_range = xfs_file_remap_range, + .enable_atomic_writes = xfs_enable_atomic_writes, }; const struct file_operations xfs_dir_file_operations = { diff --git a/include/linux/fs.h b/include/linux/fs.h index 98b7a7a8c42e..a715f98057b8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1956,6 +1956,7 @@ struct file_operations { int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *, unsigned int poll_flags); + int (*enable_atomic_writes)(struct file *, unsigned int unit_max); } __randomize_layout; /* Wrap a directory iterator that needs exclusive inode access */ diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 6c80f96049bd..69780c49622e 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -56,6 +56,8 @@ #define F_GET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 13) #define F_SET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 14) +#define F_SET_ATOMIC_WRITE_SIZE (F_LINUX_SPECIFIC_BASE + 15) + /* * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be * used to clear any hints previously set.
On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote: > How about something based on fcntl, like below? We will prob also require > some per-FS flag for enabling atomic writes without HW support. That flag > might be also useful for XFS for differentiating forcealign for atomic > writes with just forcealign. I would have just exposed it through a user visible flag instead of adding yet another ioctl/fcntl opcode and yet another method. And yes, for anything that doesn't always support atomic writes it would need to be persisted.
On 19/12/2023 15:17, Christoph Hellwig wrote: > On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote: >> How about something based on fcntl, like below? We will prob also require >> some per-FS flag for enabling atomic writes without HW support. That flag >> might be also useful for XFS for differentiating forcealign for atomic >> writes with just forcealign. > I would have just exposed it through a user visible flag instead of > adding yet another ioctl/fcntl opcode and yet another method. > Any specific type of flag? I would suggest a file attribute which we can set via chattr, but that is still using an ioctl and would require a new inode flag; but at least there is standard userspace support. > And yes, for anything that doesn't always support atomic writes it would > need to be persisted.
On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote: > On 19/12/2023 15:17, Christoph Hellwig wrote: >> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote: >>> How about something based on fcntl, like below? We will prob also require >>> some per-FS flag for enabling atomic writes without HW support. That flag >>> might be also useful for XFS for differentiating forcealign for atomic >>> writes with just forcealign. >> I would have just exposed it through a user visible flag instead of >> adding yet another ioctl/fcntl opcode and yet another method. >> > > Any specific type of flag? > > I would suggest a file attribute which we can set via chattr, but that is > still using an ioctl and would require a new inode flag; but at least there > is standard userspace support. I'd be fine with that, but we're kinda running out of flag there. That's why I suggested the FS_XFLAG_ instead, which basically works the same.
On 21/12/2023 06:50, Christoph Hellwig wrote: > On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote: >> On 19/12/2023 15:17, Christoph Hellwig wrote: >>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote: >>>> How about something based on fcntl, like below? We will prob also require >>>> some per-FS flag for enabling atomic writes without HW support. That flag >>>> might be also useful for XFS for differentiating forcealign for atomic >>>> writes with just forcealign. >>> I would have just exposed it through a user visible flag instead of >>> adding yet another ioctl/fcntl opcode and yet another method. >>> >> >> Any specific type of flag? >> >> I would suggest a file attribute which we can set via chattr, but that is >> still using an ioctl and would require a new inode flag; but at least there >> is standard userspace support. > > I'd be fine with that, but we're kinda running out of flag there. Yeah, in looking at e2fsprogs they are all used. > That's why I suggested the FS_XFLAG_ instead, which basically works > the same. > ok, fine, we can try that out. On another topic, maybe you can advise.. I noticed the NVMe patch to stop always setting virt boundary (thanks), but I am struggling for the wording for iovecs rules. I'd like to reuse iov_iter_is_aligned() to enforce any such rule. I am thinking: - ubuf / iovecs need to be PAGE-aligned - each iovec needs to be length of multiple of PAGE_SIZE But that does not work for total length < PAGE_SIZE. So then we could have: - ubuf / iovecs need to be PAGE-aligned - each iovec needs to be length of multiple of atomic_write_unit_min. If total length > PAGE_SIZE, each iovec also needs to be a multiple of PAGE_SIZE. I'd rather something simpler. Maybe it's ok. Thanks, John
On Thu, Dec 21, 2023 at 09:49:35AM +0000, John Garry wrote: > I noticed the NVMe patch to stop always setting virt boundary (thanks), but > I am struggling for the wording for iovecs rules. I'd like to reuse > iov_iter_is_aligned() to enforce any such rule. > > I am thinking: > - ubuf / iovecs need to be PAGE-aligned > - each iovec needs to be length of multiple of PAGE_SIZE > > But that does not work for total length < PAGE_SIZE. > > So then we could have: > - ubuf / iovecs need to be PAGE-aligned > - each iovec needs to be length of multiple of atomic_write_unit_min. If > total length > PAGE_SIZE, each iovec also needs to be a multiple of > PAGE_SIZE. > > I'd rather something simpler. Maybe it's ok. If we decided to not support atomic writes on anything setting a virt boundary we don't have to care about the alignment of each vector, and IMHO we should do that as everything else would be a life in constant pain. If we really have a use case for atomic writes on consumer NVMe devices we'll just have to limit it to a single iovec.
On 21/12/2023 12:19, Christoph Hellwig wrote: >> So then we could have: >> - ubuf / iovecs need to be PAGE-aligned >> - each iovec needs to be length of multiple of atomic_write_unit_min. If >> total length > PAGE_SIZE, each iovec also needs to be a multiple of >> PAGE_SIZE. >> >> I'd rather something simpler. Maybe it's ok. > If we decided to not support atomic writes on anything setting a virt > boundary we don't have to care about the alignment of each vector, ok, I think that alignment is not so important, but we still need to consider a minimum length per iovec, such that we will always be able to fit a write of length atomic_write_unit_max in a bio. > and IMHO we should do that as everything else would be a life in > constant pain. If we really have a use case for atomic writes on > consumer NVMe devices we'll just have to limit it to a single iovec. I'd be more inclined to say that SGL support is compulsory, but I don't know if that is limiting atomic writes support to an unsatisfactory market segment.
On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote: >>> - ubuf / iovecs need to be PAGE-aligned >>> - each iovec needs to be length of multiple of atomic_write_unit_min. If >>> total length > PAGE_SIZE, each iovec also needs to be a multiple of >>> PAGE_SIZE. >>> >>> I'd rather something simpler. Maybe it's ok. >> If we decided to not support atomic writes on anything setting a virt >> boundary we don't have to care about the alignment of each vector, > > ok, I think that alignment is not so important, but we still need to > consider a minimum length per iovec, such that we will always be able to > fit a write of length atomic_write_unit_max in a bio. I don't think you man a minim length per iovec for that, but a maximum number of iovecs instead. For SGL-capable devices that would be BIO_MAX_VECS, otherwise 1.
On 21/12/2023 12:57, Christoph Hellwig wrote: > On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote: >>>> - ubuf / iovecs need to be PAGE-aligned >>>> - each iovec needs to be length of multiple of atomic_write_unit_min. If >>>> total length > PAGE_SIZE, each iovec also needs to be a multiple of >>>> PAGE_SIZE. >>>> >>>> I'd rather something simpler. Maybe it's ok. >>> If we decided to not support atomic writes on anything setting a virt >>> boundary we don't have to care about the alignment of each vector, >> >> ok, I think that alignment is not so important, but we still need to >> consider a minimum length per iovec, such that we will always be able to >> fit a write of length atomic_write_unit_max in a bio. > > I don't think you man a minim length per iovec for that, but a maximum > number of iovecs instead. That would make sense, but I was thinking that if the request queue has a limit on segments then we would need to specify a iovec min length. > For SGL-capable devices that would be > BIO_MAX_VECS, otherwise 1. ok, but we would need to advertise that or whatever segment limit. A statx field just for that seems a bit inefficient in terms of space.
On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote: >> For SGL-capable devices that would be >> BIO_MAX_VECS, otherwise 1. > > ok, but we would need to advertise that or whatever segment limit. A statx > field just for that seems a bit inefficient in terms of space. I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we want to export is as a field. Network file systems also might have their own limits for one reason or another.
On 21/12/2023 13:22, Christoph Hellwig wrote: > On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote: >>> For SGL-capable devices that would be >>> BIO_MAX_VECS, otherwise 1. >> >> ok, but we would need to advertise that or whatever segment limit. A statx >> field just for that seems a bit inefficient in terms of space. > > I'd rather not hard code BIO_MAX_VECS in the ABI, For sure > which suggest we > want to export is as a field. ok > Network file systems also might have > their own limits for one reason or another. Understood
On 21/12/2023 06:50, Christoph Hellwig wrote: > On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote: >> On 19/12/2023 15:17, Christoph Hellwig wrote: >>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote: >>>> How about something based on fcntl, like below? We will prob also require >>>> some per-FS flag for enabling atomic writes without HW support. That flag >>>> might be also useful for XFS for differentiating forcealign for atomic >>>> writes with just forcealign. >>> I would have just exposed it through a user visible flag instead of >>> adding yet another ioctl/fcntl opcode and yet another method. >>> >> Any specific type of flag? >> >> I would suggest a file attribute which we can set via chattr, but that is >> still using an ioctl and would require a new inode flag; but at least there >> is standard userspace support. > I'd be fine with that, but we're kinda running out of flag there. > That's why I suggested the FS_XFLAG_ instead, which basically works > the same. Hi Christoph, Coming back to this topic... how about this FS_XFLAG_ and fsxattr update: diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index da43810b7485..9ef15fced20c 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -118,7 +118,8 @@ struct fsxattr { __u32 fsx_nextents; /* nextents field value (get) */ __u32 fsx_projid; /* project identifier (get/set) */ __u32 fsx_cowextsize; /* CoW extsize field value (get/set)*/ - unsigned char fsx_pad[8]; + __u32 fsx_atomicwrites_size; /* unit max */ + unsigned char fsx_pad[4]; }; /* @@ -140,6 +141,7 @@ struct fsxattr { #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */ #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */ #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */ +#define FS_XFLAG_ATOMICWRITES 0x00020000 #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */ /* the read-only stuff doesn't really belong here, but any other place is lines 1-22/22 (END) Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being set. So a user can issue: >xfs_io -c "atomic-writes 64K" mnt/file >xfs_io -c "atomic-writes" mnt/file [65536] mnt/file and then: /xfs_io -c "lsattr" mnt/file ------------------W mnt/file (W is new flag for atomic writes obvs) The user will still have to issue statx to get the actual atomic write limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account any HW/linux block layer atomic write limits. FS_XFLAG_ATOMICWRITES will force XFS extent size and alignment to fsx_atomicwrites_size when we have HW support, so effectively same as forcealign. For no HW support, we still specify a size. In case of possible XFS CoW solution for no atomic write HW support, I suppose that there would be no size limit in reality, so the specifying the size would only be just for userspace experience consistency. Is this the sort of userspace API which you would like to see? John
On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote: > So a user can issue: > > >xfs_io -c "atomic-writes 64K" mnt/file > >xfs_io -c "atomic-writes" mnt/file > [65536] mnt/file Let me try to decipher that: - the first call sets a 64k fsx_atomicwrites_size size - the secon call queries fsx_atomicwrites_size? > The user will still have to issue statx to get the actual atomic write > limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account > any HW/linux block layer atomic write limits. So will the set side never fail? > Is this the sort of userspace API which you would like to see? What I had in mind (and that's doesn't mean it's right..) was that the user just sets a binary flag, and the fs reports the best it could. But there might be reasons to do it differently.
On 09/01/2024 16:02, Christoph Hellwig wrote: > On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote: >> So a user can issue: >> >>> xfs_io -c "atomic-writes 64K" mnt/file >>> xfs_io -c "atomic-writes" mnt/file >> [65536] mnt/file > Let me try to decipher that: > > - the first call sets a 64k fsx_atomicwrites_size size It should also set FS_XFLAG_ATOMICWRITES for the file. So this step does everything to enable atomic writes for that file. > - the secon call queries fsx_atomicwrites_size? Right, I'm just demo'ing how it would look > >> The user will still have to issue statx to get the actual atomic write >> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account >> any HW/linux block layer atomic write limits. > So will the set side never fail? It could fail. Examples of when it could fail could include: a. If user gave a bad size value. So the size should be a power-of-2 and also divisible into the AG size and compatible with any stripe alignment. b. If the file already had flags set which are incompatible with or not supported for atomic writes. c. the file already has data written. I guess that's obvious. > >> Is this the sort of userspace API which you would like to see? > What I had in mind (and that's doesn't mean it's right..) was that > the user just sets a binary flag, and the fs reports the best it > could. But there might be reasons to do it differently. That is what I am trying to do, but I also want to specify a size for the atomic write unit max which the user could want. I'd rather not use the atomic write unit max from the device for that, as that could be huge. However, re-reading a., above, makes me think that the kernel should have more input on this, but we still need some input on the max from the user... Thanks, John
On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote: > On 21/12/2023 06:50, Christoph Hellwig wrote: > > On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote: > > > On 19/12/2023 15:17, Christoph Hellwig wrote: > > > > On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote: > > > > > How about something based on fcntl, like below? We will prob also require > > > > > some per-FS flag for enabling atomic writes without HW support. That flag > > > > > might be also useful for XFS for differentiating forcealign for atomic > > > > > writes with just forcealign. > > > > I would have just exposed it through a user visible flag instead of > > > > adding yet another ioctl/fcntl opcode and yet another method. > > > > > > > Any specific type of flag? > > > > > > I would suggest a file attribute which we can set via chattr, but that is > > > still using an ioctl and would require a new inode flag; but at least there > > > is standard userspace support. > > I'd be fine with that, but we're kinda running out of flag there. > > That's why I suggested the FS_XFLAG_ instead, which basically works > > the same. > > Hi Christoph, > > Coming back to this topic... how about this FS_XFLAG_ and fsxattr update: > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index da43810b7485..9ef15fced20c 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -118,7 +118,8 @@ struct fsxattr { > __u32 fsx_nextents; /* nextents field value (get) */ > __u32 fsx_projid; /* project identifier (get/set) */ > __u32 fsx_cowextsize; /* CoW extsize field value > (get/set)*/ > - unsigned char fsx_pad[8]; > + __u32 fsx_atomicwrites_size; /* unit max */ > + unsigned char fsx_pad[4]; > }; > > /* > @@ -140,6 +141,7 @@ struct fsxattr { > #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator > */ > #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */ > #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size > allocator hint */ > +#define FS_XFLAG_ATOMICWRITES 0x00020000 > #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */ > > /* the read-only stuff doesn't really belong here, but any other place is > lines 1-22/22 (END) > > Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being > set. > > So a user can issue: > > >xfs_io -c "atomic-writes 64K" mnt/file > >xfs_io -c "atomic-writes" mnt/file > [65536] mnt/file Where are you going to store this value in the inode? It requires a new field in the inode and so is a change of on-disk format, right? As it is, I really don't see this as a better solution than the original generic "force align" flag that simply makes the extent size hint alignment a hard physical alignment requirement rather than just a hint. This has multiple uses (DAX PMD alignment is another), so I just don't see why something that has a single, application specific API that implements a hard physical alignment is desirable. Indeed, the whole reason that extent size hints are so versatile is that they implement a generic allocation alignment/size function that can be used for anything your imagination extends to. If they were implemented as a "only allow RAID stripe aligned/sized allocation" for the original use case then that functionality would have been far less useful than it has proven to be over the past couple of decades. Hence history teaches us that we should be designing the API around the generic filesystem function required (hard alignment of physical extent allocation), not the specific use case that requires that functionality. -Dave.
On 09/01/2024 23:04, Dave Chinner wrote: >> --- a/include/uapi/linux/fs.h >> +++ b/include/uapi/linux/fs.h >> @@ -118,7 +118,8 @@ struct fsxattr { >> __u32 fsx_nextents; /* nextents field value (get) */ >> __u32 fsx_projid; /* project identifier (get/set) */ >> __u32 fsx_cowextsize; /* CoW extsize field value >> (get/set)*/ >> - unsigned char fsx_pad[8]; >> + __u32 fsx_atomicwrites_size; /* unit max */ >> + unsigned char fsx_pad[4]; >> }; >> >> /* >> @@ -140,6 +141,7 @@ struct fsxattr { >> #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator >> */ >> #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */ >> #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size >> allocator hint */ >> +#define FS_XFLAG_ATOMICWRITES 0x00020000 >> #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */ >> >> /* the read-only stuff doesn't really belong here, but any other place is >> lines 1-22/22 (END) >> >> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being >> set. >> >> So a user can issue: >> >>> xfs_io -c "atomic-writes 64K" mnt/file >>> xfs_io -c "atomic-writes" mnt/file >> [65536] mnt/file > Where are you going to store this value in the inode? It requires a > new field in the inode and so is a change of on-disk format, right? It would require an on-disk format change, unless we can find an alternative way to store the value, like: a. re-use pre-existing extsize or even cowextsize fields and 'xfs_io -c "atomic-writes $SIZE"' would update those fields and FS_XFLAG_ATOMICWRITES would be incompatible with FS_XFLAG_COWEXTSIZE or FS_XFLAG_EXTSIZE b. require FS_XFLAG_EXTSIZE and extsize be also set to enable atomic writes, and extsize is used for atomic write unit max I'm trying to think of ways to avoid requiring a value, but I don't see good options, like: - make atomic write unit max some compile-time option - require mkfs stripe alignment/width be set and use that as basis for atomic write unit max We could just use the atomic write unit max which HW provides, but that could be 1MB or more and that will hardly give efficient data usage for small files. But maybe we don't care about that if we expect this feature to only be used on DB files, which can be huge anyway. However I still have concerns – we require that value to be fixed, but a disk firmware update could increase that value and this could mean we have what would be pre-existing mis-aligned extents. > > As it is, I really don't see this as a better solution than the > original generic "force align" flag that simply makes the extent > size hint alignment a hard physical alignment requirement rather > than just a hint. This has multiple uses (DAX PMD alignment is > another), so I just don't see why something that has a single, > application specific API that implements a hard physical alignment > is desirable. I would still hope that we will support forcealign separately for those purposes. > > Indeed, the whole reason that extent size hints are so versatile is > that they implement a generic allocation alignment/size function > that can be used for anything your imagination extends to. If they > were implemented as a "only allow RAID stripe aligned/sized > allocation" for the original use case then that functionality would > have been far less useful than it has proven to be over the past > couple of decades. > > Hence history teaches us that we should be designing the API around > the generic filesystem function required (hard alignment of physical > extent allocation), not the specific use case that requires that > functionality. I understand your concern. However I am not even sure that forcealign even gives us everything we want to enable atomic writes. There is an issue where we were required to pre-zero a file prior to issuing atomic writes to ensure extents are suitably sized, so FS_XFLAG_ATOMICWRITES would make the FS do what is required to avoid that pre-zeroing (but that pre-zeroing requirement that does sound like a forcealign issue...) Furthermore, there was some desire to support atomic writes on block devices with no HW support by using a CoW-based solution, and forcealign would not be relevant there. Thanks, John
On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote: > Hence history teaches us that we should be designing the API around > the generic filesystem function required (hard alignment of physical > extent allocation), not the specific use case that requires that > functionality. I disagree. The alignment requirement is an artefact of how you implement atomic writes. As the fs user I care that I can do atomic writes on a file and need to query how big the writes can be and what alignment is required. The forcealign feature is a sensible fs side implementation of that if using hardware based atomic writes with alignment requirements, but it is a really lousy userspace API. So with John's API proposal for XFS with hardware alignment based atomic writes we could still use force align. Requesting atomic writes for an inode will set the forcealign flag and the extent size hint, and after that it'll report atomic write capabilities. Roughly the same implementation, but not an API tied to an implementation detail.
On Wed, Jan 10, 2024 at 10:19:29AM +0100, Christoph Hellwig wrote: > On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote: > > Hence history teaches us that we should be designing the API around > > the generic filesystem function required (hard alignment of physical > > extent allocation), not the specific use case that requires that > > functionality. > > I disagree. The alignment requirement is an artefact of how you > implement atomic writes. As the fs user I care that I can do atomic > writes on a file and need to query how big the writes can be and > what alignment is required. > > The forcealign feature is a sensible fs side implementation of that > if using hardware based atomic writes with alignment requirements, > but it is a really lousy userspace API. > > So with John's API proposal for XFS with hardware alignment based atomic > writes we could still use force align. > > Requesting atomic writes for an inode will set the forcealign flag > and the extent size hint, and after that it'll report atomic write > capabilities. Roughly the same implementation, but not an API > tied to an implementation detail. Sounds good to me! So to summarize, this is approximately what userspace programs would have to do something like this: struct statx statx; struct fsxattr fsxattr; int fd = open('/foofile', O_RDWR | O_DIRECT); ioctl(fd, FS_IOC_GETXATTR, &fsxattr); fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC; fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */ ioctl(fd, FS_IOC_SETXATTR, &fsxattr); statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx); if (statx.stx_atomic_write_unit_max >= 16384) { pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC); printf("HAPPY DANCE\n"); } (Assume we bail out on errors.) --D
On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote: > struct statx statx; > struct fsxattr fsxattr; > int fd = open('/foofile', O_RDWR | O_DIRECT); > > ioctl(fd, FS_IOC_GETXATTR, &fsxattr); > > fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC; > fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */ > > ioctl(fd, FS_IOC_SETXATTR, &fsxattr); > > statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx); > > if (statx.stx_atomic_write_unit_max >= 16384) { > pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC); > printf("HAPPY DANCE\n"); > } I think this still needs a check if the fs needs alignment for atomic writes at all. i.e. struct statx statx; struct fsxattr fsxattr; int fd = open('/foofile', O_RDWR | O_DIRECT); ioctl(fd, FS_IOC_GETXATTR, &fsxattr); statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx); if (statx.stx_atomic_write_unit_max < 16384) { bailout(); } fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC; if (statx.stx_atomic_write_alignment) { fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN; fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */ } if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) { bailout(); } pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC); printf("HAPPY DANCE\n");
On 11/01/2024 05:02, Christoph Hellwig wrote: > On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote: >> struct statx statx; >> struct fsxattr fsxattr; >> int fd = open('/foofile', O_RDWR | O_DIRECT); I'm assuming O_CREAT also. >> >> ioctl(fd, FS_IOC_GETXATTR, &fsxattr); >> >> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC; >> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */ >> >> ioctl(fd, FS_IOC_SETXATTR, &fsxattr); >> >> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx); >> >> if (statx.stx_atomic_write_unit_max >= 16384) { >> pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC); >> printf("HAPPY DANCE\n"); >> } > > I think this still needs a check if the fs needs alignment for > atomic writes at all. i.e. > > struct statx statx; > struct fsxattr fsxattr; > int fd = open('/foofile', O_RDWR | O_DIRECT); > > ioctl(fd, FS_IOC_GETXATTR, &fsxattr); > statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx); > if (statx.stx_atomic_write_unit_max < 16384) { > bailout(); > } How could this value be >= 16384 initially? Would it be from pre-configured FS alignment, like XFS RT extsize? Or is this from some special CoW-based atomic write support? Or FS block size of 16384? Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set would it make sense to have statx return 0 for STATX_WRITE_ATOMIC. Otherwise the user may be misled to think that it is ok to issue an atomic write (when it isn’t). Thanks, John > > fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC; > if (statx.stx_atomic_write_alignment) { > fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN; > fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */ > } > if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) { > bailout(); > } > > pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC); > printf("HAPPY DANCE\n"); >
On Thu, Jan 11, 2024 at 09:55:36AM +0000, John Garry wrote: > On 11/01/2024 05:02, Christoph Hellwig wrote: >> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote: >>> struct statx statx; >>> struct fsxattr fsxattr; >>> int fd = open('/foofile', O_RDWR | O_DIRECT); > > I'm assuming O_CREAT also. Yes. >> I think this still needs a check if the fs needs alignment for >> atomic writes at all. i.e. >> >> struct statx statx; >> struct fsxattr fsxattr; >> int fd = open('/foofile', O_RDWR | O_DIRECT); >> >> ioctl(fd, FS_IOC_GETXATTR, &fsxattr); >> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx); >> if (statx.stx_atomic_write_unit_max < 16384) { >> bailout(); >> } > > How could this value be >= 16384 initially? Would it be from pre-configured > FS alignment, like XFS RT extsize? Or is this from some special CoW-based > atomic write support? Or FS block size of 16384? Sorry, this check should not be here at all, we should only check it later. > Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead > to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set > would it make sense to have statx return 0 for STATX_WRITE_ATOMIC. True. We might need to report the limits even without that, though.
> >>> I think this still needs a check if the fs needs alignment for >>> atomic writes at all. i.e. >>> >>> struct statx statx; >>> struct fsxattr fsxattr; >>> int fd = open('/foofile', O_RDWR | O_DIRECT); >>> >>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr); >>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx); >>> if (statx.stx_atomic_write_unit_max < 16384) { >>> bailout(); >>> } >> >> How could this value be >= 16384 initially? Would it be from pre-configured >> FS alignment, like XFS RT extsize? Or is this from some special CoW-based >> atomic write support? Or FS block size of 16384? > > Sorry, this check should not be here at all, we should only check it > later. > >> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead >> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set >> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC. > > True. We might need to report the limits even without that, though. Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not set (and it is required)? The problem is that ioctl reports -EINVAL for any such errors, so hard for the user to know the issue... Thanks, John
On Thu, Jan 11, 2024 at 04:11:38PM +0000, John Garry wrote: > Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not set > (and it is required)? The problem is that ioctl reports -EINVAL for any > such errors, so hard for the user to know the issue... Sure. Pick a good unique error code, though.
On 21/12/2023 13:22, Christoph Hellwig wrote: > On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote: >>> For SGL-capable devices that would be >>> BIO_MAX_VECS, otherwise 1. >> ok, but we would need to advertise that or whatever segment limit. A statx >> field just for that seems a bit inefficient in terms of space. > I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we > want to export is as a field. Network file systems also might have > their own limits for one reason or another. Hi Christoph, I have been looking at this issue again and I am not sure if telling the user the max number of segments allowed is the best option. I’m worried that resultant atomic write unit max will be too small. The background again is that we want to tell the user what the maximum atomic write unit size is, such that we can always guarantee to fit the write in a single bio. And there would be no iovec length or alignment rules. The max segments value advertised would be min(queue max segments, BIO_MAX_VECS), so it would be 256 when the request queue is not limiting. The worst case scenario for iovec layout (most inefficient) which the user could provide would be like .iov_base = 0x...0E00 and .iov_length = 0x400, which would mean that we would have 2x pages and 2x DMA sg elems required for each 1024B-length iovec. I am assuming that we will still use the direct IO rule of LBS length and alignment. As such, we then need to set atomic write unit max = min(queue max segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 * 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for example, then we would have 32K. These seem too low. Alternative I'm thinking that we should just limit to 1x iovec always, and then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) * PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for non-enterprise NVMe drives more straightforward. If someone wants, they can introduce support for multi-iovec later, but it would prob require some more iovec length/alignment rules. Please let me know your thoughts. Thanks, John
On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote: > As such, we then need to set atomic write unit max = min(queue max > segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 * > 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for > example, then we would have 32K. These seem too low. I don't see how this would work if support multiple sectors. > > Alternative I'm thinking that we should just limit to 1x iovec always, and > then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) * > PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for > non-enterprise NVMe drives more straightforward. If someone wants, they can > introduce support for multi-iovec later, but it would prob require some > more iovec length/alignment rules. Supporting just a single iovec initially is fine with me, as extending that is pretty easy. Just talk to your potential users that they can live with it. I'd probably still advertise the limits even if it currently always is 1.
On 17/01/2024 15:02, Christoph Hellwig wrote: > On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote: >> As such, we then need to set atomic write unit max = min(queue max >> segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 * >> 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for >> example, then we would have 32K. These seem too low. > > I don't see how this would work if support multiple sectors. > >> >> Alternative I'm thinking that we should just limit to 1x iovec always, and >> then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) * >> PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for >> non-enterprise NVMe drives more straightforward. If someone wants, they can >> introduce support for multi-iovec later, but it would prob require some >> more iovec length/alignment rules. > > Supporting just a single iovec initially is fine with me, as extending > that is pretty easy. Just talk to your potential users that they can > live with it. Yeah, any porting I know about has just been using aio with IO_CMD_PWRITE, so would be ok > > I'd probably still advertise the limits even if it currently always is 1. > I suppose that we don't need any special rule until we support > 1, as we cannot break anyone already using > 1 :) Thanks, John