diff mbox series

[v2,08/11] block: mq-deadline: Fix a race condition related to zoned writes

Message ID 20230418224002.1195163-9-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series mq-deadline: Improve support for zoned block devices | expand

Commit Message

Bart Van Assche April 18, 2023, 10:39 p.m. UTC
Let deadline_next_request() only consider the first zoned write per
zone. This patch fixes a race condition between deadline_next_request()
and completion of zoned writes.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Christoph Hellwig April 19, 2023, 5:07 a.m. UTC | #1
On Tue, Apr 18, 2023 at 03:39:59PM -0700, Bart Van Assche wrote:
> Let deadline_next_request() only consider the first zoned write per
> zone. This patch fixes a race condition between deadline_next_request()
> and completion of zoned writes.

Can you explain the condition in a bit more detail?

>   * For the specified data direction, return the next request to
> @@ -386,9 +388,25 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	 */
>  	spin_lock_irqsave(&dd->zone_lock, flags);
>  	while (rq) {
> +		unsigned int zno __maybe_unused;
> +
>  		if (blk_req_can_dispatch_to_zone(rq))
>  			break;
> +
> +#ifdef CONFIG_BLK_DEV_ZONED
> +		zno = blk_rq_zone_no(rq);
> +
>  		rq = deadline_skip_seq_writes(dd, rq);
> +
> +		/*
> +		 * Skip all other write requests for the zone with zone number
> +		 * 'zno'. This prevents that this function selects a zoned write
> +		 * that is not the first write for a given zone.
> +		 */
> +		while (rq && blk_rq_zone_no(rq) == zno &&
> +		       blk_rq_is_seq_zoned_write(rq))
> +			rq = deadline_latter_request(rq);
> +#endif

The ifdefere and extra checks are a bit ugly here.

I'd suggest to:

 - move all the zoned related logic in deadline_next_request into
   a separate helper that is stubbed out for !CONFIG_BLK_DEV_ZONED
 - merge the loop in deadline_skip_seq_writes and
   added here into one single loop.  Something like:

static struct request *
deadline_zoned_next_request(struct deadline_data *dd, struct request *rq)
{
        unsigned long flags;

	spin_lock_irqsave(&dd->zone_lock, flags);
	while (!blk_req_can_dispatch_to_zone(rq)) {
		unsigned int zone_no = blk_rq_zone_no(rq);
		sector_t pos = blk_rq_pos(rq);

		while (blk_rq_is_seq_zoned_write(rq)) {
			// XXX: good comment explaining the check here
			if (blk_rq_pos(rq) != pos &&
			    blk_rq_zone_no(rq) != zone_no)
				break;
			pos += blk_rq_sectors(rq);
			rq = deadline_latter_request(rq);
			if (!rq)
				goto out_unlock;
		}
	}
out_unlock:
 	spin_unlock_irqrestore(&dd->zone_lock, flags);
	return rq;
}
Bart Van Assche April 19, 2023, 6:46 p.m. UTC | #2
On 4/18/23 22:07, Christoph Hellwig wrote:
> On Tue, Apr 18, 2023 at 03:39:59PM -0700, Bart Van Assche wrote:
>> Let deadline_next_request() only consider the first zoned write per
>> zone. This patch fixes a race condition between deadline_next_request()
>> and completion of zoned writes.
> 
> Can you explain the condition in a bit more detail?

Hi Christoph,

I'm concerned about the following scenario:
* deadline_next_request() starts iterating over the writes for a
   particular zone.
* blk_req_can_dispatch_to_zone() returns false for the first zoned write
   examined by deadline_next_request().
* A write for the same zone completes.
* deadline_next_request() continues examining zoned writes and
   blk_req_can_dispatch_to_zone() returns true for another write than the
   first write for a zone. This would result in an UNALIGNED WRITE
   COMMAND error for zoned SCSI devices.

Bart.
Damien Le Moal April 20, 2023, 1 a.m. UTC | #3
On 4/20/23 03:46, Bart Van Assche wrote:
> On 4/18/23 22:07, Christoph Hellwig wrote:
>> On Tue, Apr 18, 2023 at 03:39:59PM -0700, Bart Van Assche wrote:
>>> Let deadline_next_request() only consider the first zoned write per
>>> zone. This patch fixes a race condition between deadline_next_request()
>>> and completion of zoned writes.
>>
>> Can you explain the condition in a bit more detail?
> 
> Hi Christoph,
> 
> I'm concerned about the following scenario:
> * deadline_next_request() starts iterating over the writes for a
>    particular zone.
> * blk_req_can_dispatch_to_zone() returns false for the first zoned write
>    examined by deadline_next_request().
> * A write for the same zone completes.
> * deadline_next_request() continues examining zoned writes and
>    blk_req_can_dispatch_to_zone() returns true for another write than the
>    first write for a zone. This would result in an UNALIGNED WRITE
>    COMMAND error for zoned SCSI devices.

Examining zoned writes has to be done with deadline zone lock held, exactly to
avoid this issue.

> 
> Bart.
>
diff mbox series

Patch

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 3122c471f473..32a2cc013ed3 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -302,6 +302,7 @@  static bool deadline_is_seq_write(struct deadline_data *dd, struct request *rq)
 	return blk_rq_pos(prev) + blk_rq_sectors(prev) == blk_rq_pos(rq);
 }
 
+#ifdef CONFIG_BLK_DEV_ZONED
 /*
  * Skip all write requests that are sequential from @rq, even if we cross
  * a zone boundary.
@@ -318,6 +319,7 @@  static struct request *deadline_skip_seq_writes(struct deadline_data *dd,
 
 	return rq;
 }
+#endif
 
 /*
  * For the specified data direction, return the next request to
@@ -386,9 +388,25 @@  deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
 	while (rq) {
+		unsigned int zno __maybe_unused;
+
 		if (blk_req_can_dispatch_to_zone(rq))
 			break;
+
+#ifdef CONFIG_BLK_DEV_ZONED
+		zno = blk_rq_zone_no(rq);
+
 		rq = deadline_skip_seq_writes(dd, rq);
+
+		/*
+		 * Skip all other write requests for the zone with zone number
+		 * 'zno'. This prevents that this function selects a zoned write
+		 * that is not the first write for a given zone.
+		 */
+		while (rq && blk_rq_zone_no(rq) == zno &&
+		       blk_rq_is_seq_zoned_write(rq))
+			rq = deadline_latter_request(rq);
+#endif
 	}
 	spin_unlock_irqrestore(&dd->zone_lock, flags);