diff mbox series

[v8,10/10] nvme: Atomic write support

Message ID 20240610104329.3555488-11-john.g.garry@oracle.com (mailing list archive)
State New, archived
Headers show
Series block atomic writes | expand

Commit Message

John Garry June 10, 2024, 10:43 a.m. UTC
From: Alan Adamson <alan.adamson@oracle.com>

Add support to set block layer request_queue atomic write limits. The
limits will be derived from either the namespace or controller atomic
parameters.

NVMe atomic-related parameters are grouped into "normal" and "power-fail"
(or PF) class of parameter. For atomic write support, only PF parameters
are of interest. The "normal" parameters are concerned with racing reads
and writes (which also applies to PF). See NVM Command Set Specification
Revision 1.0d section 2.1.4 for reference.

Whether to use per namespace or controller atomic parameters is decided by
NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
Structure, NVM Command Set.

NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
are provided for a write which straddles this per-lba space boundary. The
block layer merging policy is such that no merges may occur in which the
resultant request would straddle such a boundary.

Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
atomic boundary rule. In addition, again unlike SCSI, there is no
dedicated atomic write command - a write which adheres to the atomic size
limit and boundary is implicitly atomic.

If NSFEAT bit 1 is set, the following parameters are of interest:
- NAWUPF (Namespace Atomic Write Unit Power Fail)
- NABSPF (Namespace Atomic Boundary Size Power Fail)
- NABO (Namespace Atomic Boundary Offset)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(NAWUPF)
- atomic_write_max_bytes = NAWUPF
- atomic_write_boundary = NABSPF

If in the unlikely scenario that NABO is non-zero, then atomic writes will
not be supported at all as dealing with this adds extra complexity. This
policy may change in future.

In all cases, atomic_write_unit_min is set to the logical block size.

If NSFEAT bit 1 is unset, the following parameter is of interest:
- AWUPF (Atomic Write Unit Power Fail)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(AWUPF)
- atomic_write_max_bytes = AWUPF
- atomic_write_boundary = 0

A new function, nvme_valid_atomic_write(), is also called from submission
path to verify that a request has been submitted to the driver will
actually be executed atomically. As mentioned, there is no dedicated NVMe
atomic write command (which may error for a command which exceeds the
controller atomic write limits).

Note on NABSPF:
There seems to be some vagueness in the spec as to whether NABSPF applies
for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
and how it is affected by bit 1. However Figure 4 does tell to check Figure
97 for info about per-namespace parameters, which NABSPF is, so it is
implied. However currently nvme_update_disk_info() does check namespace
parameter NABO regardless of this bit.

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
jpg: total rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/nvme/host/core.c | 49 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

Comments

Kanchan Joshi June 17, 2024, 5:24 p.m. UTC | #1
On 6/10/2024 4:13 PM, John Garry wrote:
> +static bool nvme_valid_atomic_write(struct request *req)
> +{
> +	struct request_queue *q = req->q;
> +	u32 boundary_bytes = queue_atomic_write_boundary_bytes(q);
> +
> +	if (blk_rq_bytes(req) > queue_atomic_write_unit_max_bytes(q))
> +		return false;
> +
> +	if (boundary_bytes) {
> +		u64 mask = boundary_bytes - 1, imask = ~mask;
> +		u64 start = blk_rq_pos(req) << SECTOR_SHIFT;
> +		u64 end = start + blk_rq_bytes(req) - 1;
> +
> +		/* If greater then must be crossing a boundary */
> +		if (blk_rq_bytes(req) > boundary_bytes)
> +			return false;

Nit: I'd cache blk_rq_bytes(req), since that is repeating and this 
function is called for each atomic IO.

> +
> +		if ((start & imask) != (end & imask))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
>   static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>   		struct request *req, struct nvme_command *cmnd,
>   		enum nvme_opcode op)
> @@ -941,6 +965,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>   
>   	if (req->cmd_flags & REQ_RAHEAD)
>   		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
> +	/*
> +	 * Ensure that nothing has been sent which cannot be executed
> +	 * atomically.
> +	 */
> +	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
> +		return BLK_STS_INVAL;
>   

Is this validity check specific to NVMe or should this be moved up to 
block layer as it also knows the limits?
John Garry June 17, 2024, 6:04 p.m. UTC | #2
On 17/06/2024 18:24, Kanchan Joshi wrote:
> On 6/10/2024 4:13 PM, John Garry wrote:
>> +static bool nvme_valid_atomic_write(struct request *req)
>> +{
>> +	struct request_queue *q = req->q;
>> +	u32 boundary_bytes = queue_atomic_write_boundary_bytes(q);
>> +
>> +	if (blk_rq_bytes(req) > queue_atomic_write_unit_max_bytes(q))
>> +		return false;
>> +
>> +	if (boundary_bytes) {
>> +		u64 mask = boundary_bytes - 1, imask = ~mask;
>> +		u64 start = blk_rq_pos(req) << SECTOR_SHIFT;
>> +		u64 end = start + blk_rq_bytes(req) - 1;
>> +
>> +		/* If greater then must be crossing a boundary */
>> +		if (blk_rq_bytes(req) > boundary_bytes)
>> +			return false;
> 
> Nit: I'd cache blk_rq_bytes(req), since that is repeating and this
> function is called for each atomic IO.

blk_rq_bytes() is just a wrapper for rq->__data_len. I suppose that we 
could cache that value to stop re-reading that memory, but I would 
hope/expect that memory to be in the CPU cache anyway.

> 
>> +
>> +		if ((start & imask) != (end & imask))
>> +			return false;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>>    static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>>    		struct request *req, struct nvme_command *cmnd,
>>    		enum nvme_opcode op)
>> @@ -941,6 +965,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>>    
>>    	if (req->cmd_flags & REQ_RAHEAD)
>>    		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>> +	/*
>> +	 * Ensure that nothing has been sent which cannot be executed
>> +	 * atomically.
>> +	 */
>> +	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
>> +		return BLK_STS_INVAL;
>>    
> 
> Is this validity check specific to NVMe or should this be moved up to
> block layer as it also knows the limits?

Only NVMe supports an LBA space boundary, so that part is specific to NVMe.

Regardless, the block layer already should ensure that the atomic write 
length and boundary is respected. nvme_valid_atomic_write() is just an 
insurance policy against the block layer or some other component not 
doing its job.

For SCSI, the device would error - for example - if the atomic write 
length was larger than the device supported. NVMe silently just does not 
execute the write atomically in that scenario, which we must avoid.

Thanks,
John
Christoph Hellwig June 18, 2024, 6:49 a.m. UTC | #3
On Mon, Jun 17, 2024 at 07:04:23PM +0100, John Garry wrote:
>> Nit: I'd cache blk_rq_bytes(req), since that is repeating and this
>> function is called for each atomic IO.
>
> blk_rq_bytes() is just a wrapper for rq->__data_len. I suppose that we 
> could cache that value to stop re-reading that memory, but I would 
> hope/expect that memory to be in the CPU cache anyway.

Yes, that feels a bit pointless.

> Only NVMe supports an LBA space boundary, so that part is specific to NVMe.
>
> Regardless, the block layer already should ensure that the atomic write 
> length and boundary is respected. nvme_valid_atomic_write() is just an 
> insurance policy against the block layer or some other component not doing 
> its job.
>
> For SCSI, the device would error - for example - if the atomic write length 
> was larger than the device supported. NVMe silently just does not execute 
> the write atomically in that scenario, which we must avoid.

It might be worth to expand the comment to include this information to
help future readers.
John Garry June 18, 2024, 7:22 a.m. UTC | #4
On 18/06/2024 07:49, Christoph Hellwig wrote:
>> Only NVMe supports an LBA space boundary, so that part is specific to NVMe.
>>
>> Regardless, the block layer already should ensure that the atomic write
>> length and boundary is respected. nvme_valid_atomic_write() is just an
>> insurance policy against the block layer or some other component not doing
>> its job.
>>
>> For SCSI, the device would error - for example - if the atomic write length
>> was larger than the device supported. NVMe silently just does not execute
>> the write atomically in that scenario, which we must avoid.
> It might be worth to expand the comment to include this information to
> help future readers.


OK, will do. I have been asked this more than once now.

John
diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f5d150c62955..91001892f60b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -927,6 +927,30 @@  static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
 	return BLK_STS_OK;
 }
 
+static bool nvme_valid_atomic_write(struct request *req)
+{
+	struct request_queue *q = req->q;
+	u32 boundary_bytes = queue_atomic_write_boundary_bytes(q);
+
+	if (blk_rq_bytes(req) > queue_atomic_write_unit_max_bytes(q))
+		return false;
+
+	if (boundary_bytes) {
+		u64 mask = boundary_bytes - 1, imask = ~mask;
+		u64 start = blk_rq_pos(req) << SECTOR_SHIFT;
+		u64 end = start + blk_rq_bytes(req) - 1;
+
+		/* If greater then must be crossing a boundary */
+		if (blk_rq_bytes(req) > boundary_bytes)
+			return false;
+
+		if ((start & imask) != (end & imask))
+			return false;
+	}
+
+	return true;
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -941,6 +965,12 @@  static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+	/*
+	 * Ensure that nothing has been sent which cannot be executed
+	 * atomically.
+	 */
+	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
+		return BLK_STS_INVAL;
 
 	cmnd->rw.opcode = op;
 	cmnd->rw.flags = 0;
@@ -1921,6 +1951,23 @@  static void nvme_configure_metadata(struct nvme_ctrl *ctrl,
 	}
 }
 
+
+static void nvme_update_atomic_write_disk_info(struct nvme_ns *ns,
+			struct nvme_id_ns *id, struct queue_limits *lim,
+			u32 bs, u32 atomic_bs)
+{
+	unsigned int boundary = 0;
+
+	if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) {
+		if (le16_to_cpu(id->nabspf))
+			boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
+	}
+	lim->atomic_write_hw_max = atomic_bs;
+	lim->atomic_write_hw_boundary = boundary;
+	lim->atomic_write_hw_unit_min = bs;
+	lim->atomic_write_hw_unit_max = rounddown_pow_of_two(atomic_bs);
+}
+
 static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl)
 {
 	return ctrl->max_hw_sectors / (NVME_CTRL_PAGE_SIZE >> SECTOR_SHIFT) + 1;
@@ -1967,6 +2014,8 @@  static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
 			atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
 		else
 			atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
+
+		nvme_update_atomic_write_disk_info(ns, id, lim, bs, atomic_bs);
 	}
 
 	if (id->nsfeat & NVME_NS_FEAT_IO_OPT) {