diff mbox series

[v7,1/7] block: Introduce the use_zone_write_lock member variable

Message ID 20230809202355.1171455-2-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series Improve performance for zoned UFS devices | expand

Commit Message

Bart Van Assche Aug. 9, 2023, 8:23 p.m. UTC
Writes in sequential write required zones must happen at the write
pointer. Even if the submitter of the write commands (e.g. a filesystem)
submits writes for sequential write required zones in order, the block
layer or the storage controller may reorder these write commands.

The zone locking mechanism in the mq-deadline I/O scheduler serializes
write commands for sequential zones. Some but not all storage controllers
require this serialization. Introduce a new request queue limit member
variable to allow block drivers to indicate that they preserve the order
of write commands and thus do not require serialization of writes per
zone.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-settings.c   | 6 ++++++
 include/linux/blkdev.h | 1 +
 2 files changed, 7 insertions(+)

Comments

Damien Le Moal Aug. 10, 2023, 1:33 a.m. UTC | #1
On 8/10/23 05:23, Bart Van Assche wrote:
> Writes in sequential write required zones must happen at the write
> pointer. Even if the submitter of the write commands (e.g. a filesystem)
> submits writes for sequential write required zones in order, the block
> layer or the storage controller may reorder these write commands.
> 
> The zone locking mechanism in the mq-deadline I/O scheduler serializes
> write commands for sequential zones. Some but not all storage controllers
> require this serialization. Introduce a new request queue limit member
> variable to allow block drivers to indicate that they preserve the order
> of write commands and thus do not require serialization of writes per
> zone.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-settings.c   | 6 ++++++
>  include/linux/blkdev.h | 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 0046b447268f..b75c97971860 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -56,6 +56,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>  	lim->alignment_offset = 0;
>  	lim->io_opt = 0;
>  	lim->misaligned = 0;
> +	lim->use_zone_write_lock = true;
>  	lim->zoned = BLK_ZONED_NONE;

Given that the default for zoned is BLK_ZONED_NONE, having use_zone_write_lock
default to true is strange. It would be better to set the default to false and
have disk_set_zoned() set it to true if needed, with an additional argument to
specify if it should be the case or not. E.g., for SMR drives, sd.c would call
something like:

disk_set_zoned(sdkp->disk, BLK_ZONED_HM, sdp->use_zone_write_lock);

sd.c would default to sdp->use_zone_write_lock == true and UFS driver can set
it to false. That would be cleaner I think.

>  	lim->zone_write_granularity = 0;
>  	lim->dma_alignment = 511;
> @@ -685,6 +686,11 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  						   b->max_secure_erase_sectors);
>  	t->zone_write_granularity = max(t->zone_write_granularity,
>  					b->zone_write_granularity);
> +	/*
> +	 * Whether or not the zone write lock should be used depends on the
> +	 * bottom driver only.
> +	 */
> +	t->use_zone_write_lock = b->use_zone_write_lock;

Given that DM bio targets do not have a scheduler and do not have a zone lock
bitmap allocated, I do not think this is necessary at all. This can remain to
false, thus in sync with the fact that there is no IO scheduler.

>  	t->zoned = max(t->zoned, b->zoned);
>  	return ret;
>  }
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2f5371b8482c..deffa1f13444 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -316,6 +316,7 @@ struct queue_limits {
>  	unsigned char		misaligned;
>  	unsigned char		discard_misaligned;
>  	unsigned char		raid_partial_stripes_expensive;
> +	bool			use_zone_write_lock;
>  	enum blk_zoned_model	zoned;
>  
>  	/*
Bart Van Assche Aug. 10, 2023, 2:02 p.m. UTC | #2
On 8/9/23 18:33, Damien Le Moal wrote:
> On 8/10/23 05:23, Bart Van Assche wrote:
>> Writes in sequential write required zones must happen at the write
>> pointer. Even if the submitter of the write commands (e.g. a filesystem)
>> submits writes for sequential write required zones in order, the block
>> layer or the storage controller may reorder these write commands.
>>
>> The zone locking mechanism in the mq-deadline I/O scheduler serializes
>> write commands for sequential zones. Some but not all storage controllers
>> require this serialization. Introduce a new request queue limit member
>> variable to allow block drivers to indicate that they preserve the order
>> of write commands and thus do not require serialization of writes per
>> zone.
>>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ming Lei <ming.lei@redhat.com>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>   block/blk-settings.c   | 6 ++++++
>>   include/linux/blkdev.h | 1 +
>>   2 files changed, 7 insertions(+)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 0046b447268f..b75c97971860 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -56,6 +56,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>>   	lim->alignment_offset = 0;
>>   	lim->io_opt = 0;
>>   	lim->misaligned = 0;
>> +	lim->use_zone_write_lock = true;
>>   	lim->zoned = BLK_ZONED_NONE;
> 
> Given that the default for zoned is BLK_ZONED_NONE, having use_zone_write_lock
> default to true is strange. It would be better to set the default to false and
> have disk_set_zoned() set it to true if needed, with an additional argument to
> specify if it should be the case or not. E.g., for SMR drives, sd.c would call
> something like:
> 
> disk_set_zoned(sdkp->disk, BLK_ZONED_HM, sdp->use_zone_write_lock);
> 
> sd.c would default to sdp->use_zone_write_lock == true and UFS driver can set
> it to false. That would be cleaner I think.

Hi Damien,

Thanks for the detailed feedback.

My concerns about the above proposal are as follows:
* The information about whether or not the zone write lock should be used comes
   from the block driver, e.g. a SCSI LLD.
* sdp, the SCSI disk pointer, is owned by the ULD.
* An ULD may be attached and detached multiple times during the lifetime of a
   logical unit without the LLD being informed about this. So how to set
   sdp->use_zone_write_lock without introducing a new callback or member variable
   in a data structure owned by the LLD?

Hence my preference to store use_zone_write_lock in a data structure that has the
same lifetime as the logical unit and not in any data structure controlled by the
ULD.

>>   	lim->zone_write_granularity = 0;
>>   	lim->dma_alignment = 511;
>> @@ -685,6 +686,11 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>   						   b->max_secure_erase_sectors);
>>   	t->zone_write_granularity = max(t->zone_write_granularity,
>>   					b->zone_write_granularity);
>> +	/*
>> +	 * Whether or not the zone write lock should be used depends on the
>> +	 * bottom driver only.
>> +	 */
>> +	t->use_zone_write_lock = b->use_zone_write_lock;
> 
> Given that DM bio targets do not have a scheduler and do not have a zone lock
> bitmap allocated, I do not think this is necessary at all. This can remain to
> false, thus in sync with the fact that there is no IO scheduler.

How about the request-based dm drivers (dm-mpath and dm-target)? Isn't the dm-mpath
driver request based because that allows the I/O scheduler to be configured on top
of the dm-mpath driver? From https://lwn.net/Articles/274292/: "The basic idea to
resolve the issue is to move multipathing layer down below the I/O scheduler, and it
was proposed from Mike Christie as the block layer (request-based) multipath".

Thanks,

Bart.
Damien Le Moal Aug. 11, 2023, 12:39 a.m. UTC | #3
On 8/10/23 23:02, Bart Van Assche wrote:
> On 8/9/23 18:33, Damien Le Moal wrote:
>> On 8/10/23 05:23, Bart Van Assche wrote:
>>> Writes in sequential write required zones must happen at the write
>>> pointer. Even if the submitter of the write commands (e.g. a filesystem)
>>> submits writes for sequential write required zones in order, the block
>>> layer or the storage controller may reorder these write commands.
>>>
>>> The zone locking mechanism in the mq-deadline I/O scheduler serializes
>>> write commands for sequential zones. Some but not all storage controllers
>>> require this serialization. Introduce a new request queue limit member
>>> variable to allow block drivers to indicate that they preserve the order
>>> of write commands and thus do not require serialization of writes per
>>> zone.
>>>
>>> Cc: Damien Le Moal <dlemoal@kernel.org>
>>> Cc: Christoph Hellwig <hch@lst.de>
>>> Cc: Ming Lei <ming.lei@redhat.com>
>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>> ---
>>>   block/blk-settings.c   | 6 ++++++
>>>   include/linux/blkdev.h | 1 +
>>>   2 files changed, 7 insertions(+)
>>>
>>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>>> index 0046b447268f..b75c97971860 100644
>>> --- a/block/blk-settings.c
>>> +++ b/block/blk-settings.c
>>> @@ -56,6 +56,7 @@ void blk_set_default_limits(struct queue_limits *lim)
>>>   	lim->alignment_offset = 0;
>>>   	lim->io_opt = 0;
>>>   	lim->misaligned = 0;
>>> +	lim->use_zone_write_lock = true;
>>>   	lim->zoned = BLK_ZONED_NONE;
>>
>> Given that the default for zoned is BLK_ZONED_NONE, having use_zone_write_lock
>> default to true is strange. It would be better to set the default to false and
>> have disk_set_zoned() set it to true if needed, with an additional argument to
>> specify if it should be the case or not. E.g., for SMR drives, sd.c would call
>> something like:
>>
>> disk_set_zoned(sdkp->disk, BLK_ZONED_HM, sdp->use_zone_write_lock);
>>
>> sd.c would default to sdp->use_zone_write_lock == true and UFS driver can set
>> it to false. That would be cleaner I think.
> 
> Hi Damien,
> 
> Thanks for the detailed feedback.
> 
> My concerns about the above proposal are as follows:
> * The information about whether or not the zone write lock should be used comes
>    from the block driver, e.g. a SCSI LLD.

Yes.

> * sdp, the SCSI disk pointer, is owned by the ULD.
> * An ULD may be attached and detached multiple times during the lifetime of a
>    logical unit without the LLD being informed about this. So how to set
>    sdp->use_zone_write_lock without introducing a new callback or member variable
>    in a data structure owned by the LLD?

That would be set during device scan and device revalidate. And if the value
changes, then disk_set_zoned() should be called again to update the queue limit.
That is already what is done for the zoned limit indicating the type of the
drive. My point is that the zoned limit should dictate if use_zone_write_lock
can be true. The default should be be:

	q->limits.use_zone_write_lock = q->limits.zoned != BLK_ZONED_NONE.

And as proposed, if the UFS driver wants to disable zone write locking, all it
needs to do is call "disk_set_zoned(disk, BLK_ZONED_HM, false)". I did not try
to actually code that, and the scsi disk driver may be in the way and that may
need to be done there, hence the suggestion of having a use_zone_write_lock flag
in the scsi device structure so that the UFS driver can set it as needed as well
(and detect changes when revalidating). That should work, but I may be missing
something.

> Hence my preference to store use_zone_write_lock in a data structure that has the
> same lifetime as the logical unit and not in any data structure controlled by the
> ULD.

See above. The actual storage would still be in q->limits so that the block
layer can see it.

> 
>>>   	lim->zone_write_granularity = 0;
>>>   	lim->dma_alignment = 511;
>>> @@ -685,6 +686,11 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>>   						   b->max_secure_erase_sectors);
>>>   	t->zone_write_granularity = max(t->zone_write_granularity,
>>>   					b->zone_write_granularity);
>>> +	/*
>>> +	 * Whether or not the zone write lock should be used depends on the
>>> +	 * bottom driver only.
>>> +	 */
>>> +	t->use_zone_write_lock = b->use_zone_write_lock;
>>
>> Given that DM bio targets do not have a scheduler and do not have a zone lock
>> bitmap allocated, I do not think this is necessary at all. This can remain to
>> false, thus in sync with the fact that there is no IO scheduler.
> 
> How about the request-based dm drivers (dm-mpath and dm-target)? Isn't the dm-mpath
> driver request based because that allows the I/O scheduler to be configured on top
> of the dm-mpath driver? From https://lwn.net/Articles/274292/: "The basic idea to
> resolve the issue is to move multipathing layer down below the I/O scheduler, and it
> was proposed from Mike Christie as the block layer (request-based) multipath".

These DM targets do not support zoned devices, so I do not think it is an issue.

That said, you can still keep the parameter stacking, but at the very least, I
think it should be:

t->use_zone_write_lock = t->use_zone_write_lock || b->use_zone_write_lock;

so that for a target composed of multiple devices, if one needs zone write
locking, the DM device request that as well. Otherwise, problems may occur I think.

> 
> Thanks,
> 
> Bart.
>
Bart Van Assche Aug. 11, 2023, 3:41 p.m. UTC | #4
On 8/10/23 17:39, Damien Le Moal wrote:
> On 8/10/23 23:02, Bart Van Assche wrote:
>> My concerns about the above proposal are as follows:
>> * The information about whether or not the zone write lock should be used comes
>>     from the block driver, e.g. a SCSI LLD.
> 
> Yes.
> 
>> * sdp, the SCSI disk pointer, is owned by the ULD.
>> * An ULD may be attached and detached multiple times during the lifetime of a
>>     logical unit without the LLD being informed about this. So how to set
>>     sdp->use_zone_write_lock without introducing a new callback or member variable
>>     in a data structure owned by the LLD?
> 
> That would be set during device scan and device revalidate. And if the value
> changes, then disk_set_zoned() should be called again to update the queue limit.
> That is already what is done for the zoned limit indicating the type of the
> drive. My point is that the zoned limit should dictate if use_zone_write_lock
> can be true. The default should be be:
> 
> 	q->limits.use_zone_write_lock = q->limits.zoned != BLK_ZONED_NONE.
> 
> And as proposed, if the UFS driver wants to disable zone write locking, all it
> needs to do is call "disk_set_zoned(disk, BLK_ZONED_HM, false)". I did not try
> to actually code that, and the scsi disk driver may be in the way and that may
> need to be done there, hence the suggestion of having a use_zone_write_lock flag
> in the scsi device structure so that the UFS driver can set it as needed as well
> (and detect changes when revalidating). That should work, but I may be missing
> something.

Hi Damien,

Adding a use_zone_write_lock argument to disk_set_zoned() seems wrong to 
me. The information about whether or not to use the zone write lock 
comes from the LLD. The zone model is retrieved by the ULD. Since both 
pieces of information come from different drivers, both properties 
should be modified independently.

Moving the use_zone_write_lock member variable into a data structure 
owned by the ULD seems wrong to me because that member variable is set 
by the LLD.

Reviewers are allowed to request changes for well-designed and working 
code but should be able to explain why they request these changes. Can 
you please explain why you care about the value of 
q->limits.use_zone_write_lock for the BLK_ZONED_NONE case? If 
dd_use_zone_write_locking() would be renamed and would be moved into 
include/linux/blkdev.h and if reads of q->limits.use_zone_write_lock 
would be changed into blk_use_zone_write_locking() calls then the value 
of q->limits.use_zone_write_lock for the BLK_ZONED_NONE case wouldn't 
matter at all.

Thanks,

Bart.
Damien Le Moal Aug. 12, 2023, 2:44 a.m. UTC | #5
On 8/12/23 00:41, Bart Van Assche wrote:
> On 8/10/23 17:39, Damien Le Moal wrote:
>> On 8/10/23 23:02, Bart Van Assche wrote:
>>> My concerns about the above proposal are as follows:
>>> * The information about whether or not the zone write lock should be used comes
>>>     from the block driver, e.g. a SCSI LLD.
>>
>> Yes.
>>
>>> * sdp, the SCSI disk pointer, is owned by the ULD.
>>> * An ULD may be attached and detached multiple times during the lifetime of a
>>>     logical unit without the LLD being informed about this. So how to set
>>>     sdp->use_zone_write_lock without introducing a new callback or member variable
>>>     in a data structure owned by the LLD?
>>
>> That would be set during device scan and device revalidate. And if the value
>> changes, then disk_set_zoned() should be called again to update the queue limit.
>> That is already what is done for the zoned limit indicating the type of the
>> drive. My point is that the zoned limit should dictate if use_zone_write_lock
>> can be true. The default should be be:
>>
>> 	q->limits.use_zone_write_lock = q->limits.zoned != BLK_ZONED_NONE.
>>
>> And as proposed, if the UFS driver wants to disable zone write locking, all it
>> needs to do is call "disk_set_zoned(disk, BLK_ZONED_HM, false)". I did not try
>> to actually code that, and the scsi disk driver may be in the way and that may
>> need to be done there, hence the suggestion of having a use_zone_write_lock flag
>> in the scsi device structure so that the UFS driver can set it as needed as well
>> (and detect changes when revalidating). That should work, but I may be missing
>> something.
> 
> Hi Damien,
> 
> Adding a use_zone_write_lock argument to disk_set_zoned() seems wrong to 
> me. The information about whether or not to use the zone write lock 
> comes from the LLD. The zone model is retrieved by the ULD. Since both 
> pieces of information come from different drivers, both properties 
> should be modified independently.

OK. But I still think that disk_set_zoned() is the place where the default for
use_zone_write_lock should be set.

And we need a clean way for the LLD to change use_zone_write_lock.

> 
> Moving the use_zone_write_lock member variable into a data structure 
> owned by the ULD seems wrong to me because that member variable is set 
> by the LLD.
> 
> Reviewers are allowed to request changes for well-designed and working 
> code but should be able to explain why they request these changes. Can 
> you please explain why you care about the value of 
> q->limits.use_zone_write_lock for the BLK_ZONED_NONE case? If 

Because use_zone_write_lock should ever be true for !BLK_ZONED_NONE case.
Allowing use_zone_write_lock to be true even with zoned == BLK_ZONED_NONE forces
to always check that the device is zoned to ensure that the value of
use_zone_write_lock is valid. This is awckward and uselessly complicate things.

> dd_use_zone_write_locking() would be renamed and would be moved into 
> include/linux/blkdev.h and if reads of q->limits.use_zone_write_lock 
> would be changed into blk_use_zone_write_locking() calls then the value 
> of q->limits.use_zone_write_lock for the BLK_ZONED_NONE case wouldn't 
> matter at all.

Sure, that would work. But again, why the need to check both model and actual
value of use_zone_write_lock ? I do not think it is that hard to keep
use_zone_write_lock to false for the BLK_ZONED_NONE case. Then
blk_use_zone_write_locking() is reduced to returning the value of
use_zone_write_lock.

> 
> Thanks,
> 
> Bart.
diff mbox series

Patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..b75c97971860 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -56,6 +56,7 @@  void blk_set_default_limits(struct queue_limits *lim)
 	lim->alignment_offset = 0;
 	lim->io_opt = 0;
 	lim->misaligned = 0;
+	lim->use_zone_write_lock = true;
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
@@ -685,6 +686,11 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 						   b->max_secure_erase_sectors);
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
+	/*
+	 * Whether or not the zone write lock should be used depends on the
+	 * bottom driver only.
+	 */
+	t->use_zone_write_lock = b->use_zone_write_lock;
 	t->zoned = max(t->zoned, b->zoned);
 	return ret;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2f5371b8482c..deffa1f13444 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -316,6 +316,7 @@  struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		raid_partial_stripes_expensive;
+	bool			use_zone_write_lock;
 	enum blk_zoned_model	zoned;
 
 	/*