[v5,4/5] sd: limit to use write life hints

Message ID	20240910150200.6589-5-joshi.k@samsung.com (mailing list archive)
State	Not Applicable
Headers	show Received: from mailout2.samsung.com (mailout2.samsung.com [203.254.224.25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A818199952 for <linux-scsi@vger.kernel.org>; Tue, 10 Sep 2024 15:11:02 +0000 (UTC) From: Kanchan Joshi <joshi.k@samsung.com> To: axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, martin.petersen@oracle.com, James.Bottomley@HansenPartnership.com, brauner@kernel.org, viro@zeniv.linux.org.uk, jack@suse.cz, jaegeuk@kernel.org, jlayton@kernel.org, chuck.lever@oracle.com, bvanassche@acm.org Cc: linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-f2fs-devel@lists.sourceforge.net, linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, gost.dev@samsung.com, vishak.g@samsung.com, javier.gonz@samsung.com, Nitesh Shetty <nj.shetty@samsung.com>, Kanchan Joshi <joshi.k@samsung.com> Subject: [PATCH v5 4/5] sd: limit to use write life hints Date: Tue, 10 Sep 2024 20:31:59 +0530 Message-Id: <20240910150200.6589-5-joshi.k@samsung.com> In-Reply-To: <20240910150200.6589-1-joshi.k@samsung.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P DLP-Filter: Pass References: <20240910150200.6589-1-joshi.k@samsung.com> <CGME20240910151057epcas5p3369c6257a6f169b4caa6dd59548b538c@epcas5p3.samsung.com>
Series	[v5,1/5] fs, block: refactor enum rw_hint \| expand [v5,1/5] fs, block: refactor enum rw_hint [v5,2/5] fcntl: rename rw_hint_* to rw_lifetime_hint_* [v5,3/5] fcntl: add F_{SET/GET}_RW_HINT_EX [v5,4/5] sd: limit to use write life hints [v5,5/5] nvme: enable FDP support

Kanchan Joshi Sept. 10, 2024, 3:01 p.m. UTC

From: Nitesh Shetty <nj.shetty@samsung.com>

The incoming hint value maybe either lifetime hint or placement hint.
Make SCSI interpret only temperature-based write lifetime hints.

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 drivers/scsi/sd.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Christoph Hellwig Sept. 12, 2024, 1:02 p.m. UTC | #1

On Tue, Sep 10, 2024 at 08:31:59PM +0530, Kanchan Joshi wrote:
> From: Nitesh Shetty <nj.shetty@samsung.com>
> 
> The incoming hint value maybe either lifetime hint or placement hint.

.. may either be .. ?

> Make SCSI interpret only temperature-based write lifetime hints.
> 
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> ---
>  drivers/scsi/sd.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index dad3991397cf..82bd4b07314e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1191,8 +1191,8 @@ static u8 sd_group_number(struct scsi_cmnd *cmd)
>  	if (!sdkp->rscs)
>  		return 0;
>  
> -	return min3((u32)rq->write_hint, (u32)sdkp->permanent_stream_count,
> -		    0x3fu);
> +	return min3((u32)WRITE_LIFETIME_HINT(rq->write_hint),

No fan of the screaming WRITE_LIFETIME_HINT.    Or the fact that multiple
things are multiplexed into the single rq->write_hint field to
start with.

This code could also use a bit of documentation already in the existing
version, but even more so now.

Kanchan Joshi Sept. 12, 2024, 4:31 p.m. UTC | #2

On 9/12/2024 6:32 PM, Christoph Hellwig wrote:
> On Tue, Sep 10, 2024 at 08:31:59PM +0530, Kanchan Joshi wrote:
>> From: Nitesh Shetty <nj.shetty@samsung.com>
>>
>> The incoming hint value maybe either lifetime hint or placement hint.
> 
> .. may either be .. ?

Sure.

>> Make SCSI interpret only temperature-based write lifetime hints.
>>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
>> ---
>>   drivers/scsi/sd.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index dad3991397cf..82bd4b07314e 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -1191,8 +1191,8 @@ static u8 sd_group_number(struct scsi_cmnd *cmd)
>>   	if (!sdkp->rscs)
>>   		return 0;
>>   
>> -	return min3((u32)rq->write_hint, (u32)sdkp->permanent_stream_count,
>> -		    0x3fu);
>> +	return min3((u32)WRITE_LIFETIME_HINT(rq->write_hint),
> 
> No fan of the screaming WRITE_LIFETIME_HINT. 

Macros tend to. Once it becomes lowercase (inline function), it will 
stop screaming.

    Or the fact that multiple
> things are multiplexed into the single rq->write_hint field to
> start with.

Please see the response in patch #1. My worries were:
(a) adding a new field and propagating it across the stack will cause 
code duplication.
(b) to add a new field we need to carve space within inode, bio and 
request.
We had a hole in request, but it is set to vanish after ongoing 
integrity refactoring patch of Keith [1]. For inode also, there is no 
liberty at this point [2].

I think current multiplexing approach is similar to ioprio where 
multiple io priority classes/values are expressed within an int type. 
And few kernel components choose to interpret certain ioprio values at will.

And all this is still in-kernel details. Which can be changed if/when 
other factors start helping.

[1] 
https://lore.kernel.org/linux-nvme/20240911201240.3982856-2-kbusch@meta.com/
[2] 
https://lore.kernel.org/linux-nvme/20240903-erfassen-bandmitglieder-32dfaeee66b2@brauner/

Christoph Hellwig Sept. 13, 2024, 8:06 a.m. UTC | #3

On Thu, Sep 12, 2024 at 10:01:00PM +0530, Kanchan Joshi wrote:
> Please see the response in patch #1. My worries were:
> (a) adding a new field and propagating it across the stack will cause 
> code duplication.
> (b) to add a new field we need to carve space within inode, bio and 
> request.
> We had a hole in request, but it is set to vanish after ongoing 
> integrity refactoring patch of Keith [1]. For inode also, there is no 
> liberty at this point [2].
> 
> I think current multiplexing approach is similar to ioprio where 
> multiple io priority classes/values are expressed within an int type. 
> And few kernel components choose to interpret certain ioprio values at will.
> 
> And all this is still in-kernel details. Which can be changed if/when 
> other factors start helping.

Maybe part of the problem is that the API is very confusing.  A smal
part of that is of course that the existing temperature hints already
have some issues, but this seems to be taking them make it significantly
worse.

Note: this tries to include highlevel comments from the discussion of
the previous patches instead of splitting them over multiple threads.

F_{S,G}ET_RW_HINT works on arbitrary file descriptors with absolutely no
check for support by the device or file system and not check for the
file type.  That's not exactly good API design, but not really a major
because they are clearly designed as hints with a fixed number of
values, allowing the implementation to map them if not enough are
supported.

But if we increase this to a variable number of hints that don't have
any meaning (and even if that is just the rough order of the temperature
hints assigned to them), that doesn't really work.  We'll need an API
to check if these stream hints are supported and how many of them,
otherwise the applications can't make any sensible use of them.

If these aren't just stream hints of the file system but you actually
want them as an abstract API for FDP you'll also need to actually
expose even more information like the reclaim unit size, but let's
ignore that for this part of the discssion.

Back the the API: the existing lifetime hints have basically three
layers:

 1) syscall ABI
 2) the hint stored in the inode
 3) the hint passed in the bio

1) is very much fixed for the temperature API, we just need to think if
   we want to support it at the same time as a more general hints API.
   Or if we can map one into another.  Or if we can't support them at
   the same time how that is communicated.

For 2) and 3) we can use an actual union if we decide to not support
both at the same time, keyed off a flag outside the field, but if not
we simply need space for both.

Kanchan Joshi Sept. 16, 2024, 1:49 p.m. UTC | #4

On 9/13/2024 1:36 PM, Christoph Hellwig wrote:
> On Thu, Sep 12, 2024 at 10:01:00PM +0530, Kanchan Joshi wrote:
>> Please see the response in patch #1. My worries were:
>> (a) adding a new field and propagating it across the stack will cause
>> code duplication.
>> (b) to add a new field we need to carve space within inode, bio and
>> request.
>> We had a hole in request, but it is set to vanish after ongoing
>> integrity refactoring patch of Keith [1]. For inode also, there is no
>> liberty at this point [2].
>>
>> I think current multiplexing approach is similar to ioprio where
>> multiple io priority classes/values are expressed within an int type.
>> And few kernel components choose to interpret certain ioprio values at will.
>>
>> And all this is still in-kernel details. Which can be changed if/when
>> other factors start helping.
> 
> Maybe part of the problem is that the API is very confusing.  A smal
> part of that is of course that the existing temperature hints already
> have some issues, but this seems to be taking them make it significantly
> worse.

Can you explain what part is confusing. This is a simple API that takes 
type/value pair. Two types (and respective values) are clearly defined 
currently, and more can be added in future.

> Note: this tries to include highlevel comments from the discussion of
> the previous patches instead of splitting them over multiple threads.
> 
> F_{S,G}ET_RW_HINT works on arbitrary file descriptors with absolutely no
> check for support by the device or file system and not check for the
> file type.  That's not exactly good API design, but not really a major
> because they are clearly designed as hints with a fixed number of
> values, allowing the implementation to map them if not enough are
> supported.
> 
> But if we increase this to a variable number of hints that don't have
> any meaning (and even if that is just the rough order of the temperature
> hints assigned to them), that doesn't really work.  We'll need an API
> to check if these stream hints are supported and how many of them,
> otherwise the applications can't make any sensible use of them.

- Since writes are backward compatible, nothing bad happens if the 
passed placement-hint value is not supported. Maybe desired outcome (in 
terms of WAF reduction) may not come but that's not a kernel problem 
anyway. It's rather about how well application is segregating and how 
well device is doing its job.

- Device is perfectly happy to work with numbers (0 to 256 in current 
spec) to produce some value (i.e., WAF reduction). Any extra 
semantics/abstraction on these numbers only adds to the work without 
increasing that value. If any application needs that, it's free to 
attach any meaning/semantics to these numbers.

Extra abstraction has already been done with temperature-hint (over 
multi-stream numbers). If that's useful somehow, we should consider 
going back to using those (v3)? But if we are doing a new placement 
hint, it's better to use plain numbers without any semantics. That will 
be (a) more scalable, (b) be closer to what device can readily accept, 
(c) justify why placement should be a different hint-type, and (d) help 
Kernel because it has to do less (no intermediate mapping/transformation 
etc).

IMHO sticking to the existing hint model and doing less (in terms of 
abstraction, reporting and stuff) in kernel maybe a better path.

> If these aren't just stream hints of the file system but you actually
> want them as an abstract API for FDP you'll also need to actually
> expose even more information like the reclaim unit size, but let's
> ignore that for this part of the discssion.
> 
> Back the the API: the existing lifetime hints have basically three
> layers:
> 
>   1) syscall ABI
>   2) the hint stored in the inode
>   3) the hint passed in the bio
> 
> 1) is very much fixed for the temperature API, we just need to think if
>     we want to support it at the same time as a more general hints API.
>     Or if we can map one into another.  Or if we can't support them at
>     the same time how that is communicated.
> 
> For 2) and 3) we can use an actual union if we decide to not support
> both at the same time, keyed off a flag outside the field, but if not
> we simply need space for both.
> 

Right, if there were space, we probably would have kept both.
But particularly for these two types (temperature and placement) it's 
probably fine if one overwrites the another. This is not automatic and 
will happen only at the behest of user. And that's something we can 
clearly document in the man page of the new fcntl. Hope that sounds fine?

Christoph Hellwig Sept. 17, 2024, 6:20 a.m. UTC | #5

On Mon, Sep 16, 2024 at 07:19:21PM +0530, Kanchan Joshi wrote:
> > Maybe part of the problem is that the API is very confusing.  A smal
> > part of that is of course that the existing temperature hints already
> > have some issues, but this seems to be taking them make it significantly
> > worse.
> 
> Can you explain what part is confusing. This is a simple API that takes 
> type/value pair. Two types (and respective values) are clearly defined 
> currently, and more can be added in future.

I though I outlined that below.

> > But if we increase this to a variable number of hints that don't have
> > any meaning (and even if that is just the rough order of the temperature
> > hints assigned to them), that doesn't really work.  We'll need an API
> > to check if these stream hints are supported and how many of them,
> > otherwise the applications can't make any sensible use of them.
> 
> - Since writes are backward compatible, nothing bad happens if the 
> passed placement-hint value is not supported. Maybe desired outcome (in 
> terms of WAF reduction) may not come but that's not a kernel problem 
> anyway. It's rather about how well application is segregating and how 
> well device is doing its job.

What do you mean with "writes are backward compatible" ?

> - Device is perfectly happy to work with numbers (0 to 256 in current 
> spec) to produce some value (i.e., WAF reduction). Any extra 
> semantics/abstraction on these numbers only adds to the work without 
> increasing that value. If any application needs that, it's free to 
> attach any meaning/semantics to these numbers.

If the device (or file system, which really needs to be in control
for actual files vs just block devices) does not support all 256
we need to reduce them to less than that.  The kernel can help with
that a bit if the streams have meanings (collapsing temperature levels
that are close), but not at all if they don't have meanings.  The
application can and thus needs to know the number of separate
streams available.

Kanchan Joshi Sept. 17, 2024, 4:03 p.m. UTC | #6

On 9/17/2024 11:50 AM, Christoph Hellwig wrote:
>>> But if we increase this to a variable number of hints that don't have
>>> any meaning (and even if that is just the rough order of the temperature
>>> hints assigned to them), that doesn't really work.  We'll need an API
>>> to check if these stream hints are supported and how many of them,
>>> otherwise the applications can't make any sensible use of them.
>> - Since writes are backward compatible, nothing bad happens if the
>> passed placement-hint value is not supported. Maybe desired outcome (in
>> terms of WAF reduction) may not come but that's not a kernel problem
>> anyway. It's rather about how well application is segregating and how
>> well device is doing its job.
> What do you mean with "writes are backward compatible" ?
> 

Writes are not going to fail even if you don't pass the placement-id or 
pass a placement-id that is not valid. FDP-enabled SSD will not shout 
and complete writes fine even with FDP-unaware software.

I think that part is same as how Linux write hints behave ATM. Writes 
don't have to carry the lifetime hint always. And when they do, the hint 
value never becomes the reason of failure (e.g. life hints on NVMe 
vanish in the thin air rather than causing any failure).

>> - Device is perfectly happy to work with numbers (0 to 256 in current
>> spec) to produce some value (i.e., WAF reduction). Any extra
>> semantics/abstraction on these numbers only adds to the work without
>> increasing that value. If any application needs that, it's free to
>> attach any meaning/semantics to these numbers.
> If the device (or file system, which really needs to be in control
> for actual files vs just block devices) does not support all 256
> we need to reduce them to less than that.  The kernel can help with
> that a bit if the streams have meanings (collapsing temperature levels
> that are close), but not at all if they don't have meanings. 

Current patch (nvme) does what you mentioned above.
Pasting the fragment that maps potentially large placement-hints to the 
last valid placement-id.

+static inline void nvme_assign_placement_id(struct nvme_ns *ns,
+					struct request *req,
+					struct nvme_command *cmd)
+{
+	u8 h = umin(ns->head->nr_plids - 1,
+				WRITE_PLACEMENT_HINT(req->write_hint));
+
+	cmd->rw.control |= cpu_to_le16(NVME_RW_DTYPE_DPLCMT);
+	cmd->rw.dsmgmt |= cpu_to_le32(ns->head->plids[h] << 16);
+}

But this was just an implementation choice (and not a failure avoidance 
fallback).

Kanchan Joshi Sept. 17, 2024, 5 p.m. UTC | #7

On 9/17/2024 9:33 PM, Kanchan Joshi wrote:
> On 9/17/2024 11:50 AM, Christoph Hellwig wrote:
>>>> But if we increase this to a variable number of hints that don't have
>>>> any meaning (and even if that is just the rough order of the temperature
>>>> hints assigned to them), that doesn't really work.  We'll need an API
>>>> to check if these stream hints are supported and how many of them,
>>>> otherwise the applications can't make any sensible use of them.
>>> - Since writes are backward compatible, nothing bad happens if the
>>> passed placement-hint value is not supported. Maybe desired outcome (in
>>> terms of WAF reduction) may not come but that's not a kernel problem
>>> anyway. It's rather about how well application is segregating and how
>>> well device is doing its job.
>> What do you mean with "writes are backward compatible" ?
>>
> Writes are not going to fail even if you don't pass the placement-id or
> pass a placement-id that is not valid. FDP-enabled SSD will not shout
> and complete writes fine even with FDP-unaware software.
> 
> I think that part is same as how Linux write hints behave ATM. Writes
> don't have to carry the lifetime hint always. And when they do, the hint
> value never becomes the reason of failure (e.g. life hints on NVMe
> vanish in the thin air rather than causing any failure).
> 

FWIW, I am not sure about current SCSI streams but NVMe multi-stream did 
not tolerate invalid values. Write command with invalid stream was 
aborted. So in that scheme of things, it was important to be pedantic 
about what values are being passed.
But in FDP, things are closer to Linux hints that don't cause failures. 
With the plain-numbers interface, the similarities will increase.

Christoph Hellwig Sept. 18, 2024, 6:42 a.m. UTC | #8

> > If the device (or file system, which really needs to be in control
> > for actual files vs just block devices) does not support all 256
> > we need to reduce them to less than that.  The kernel can help with
> > that a bit if the streams have meanings (collapsing temperature levels
> > that are close), but not at all if they don't have meanings. 
> 
> Current patch (nvme) does what you mentioned above.
> Pasting the fragment that maps potentially large placement-hints to the 
> last valid placement-id.
> 
> +static inline void nvme_assign_placement_id(struct nvme_ns *ns,
> +					struct request *req,
> +					struct nvme_command *cmd)
> +{
> +	u8 h = umin(ns->head->nr_plids - 1,
> +				WRITE_PLACEMENT_HINT(req->write_hint));
> +
> +	cmd->rw.control |= cpu_to_le16(NVME_RW_DTYPE_DPLCMT);
> +	cmd->rw.dsmgmt |= cpu_to_le32(ns->head->plids[h] << 16);
> +}
> 
> But this was just an implementation choice (and not a failure avoidance 
> fallback).

And it completely fucks thing up as I said.  If I have an application
that wants to separate streams I need to know how many stream I
have available, and not fold all higher numbers into the last one
available.

Kanchan Joshi Sept. 18, 2024, 8:12 a.m. UTC | #9

On 9/18/2024 12:12 PM, Christoph Hellwig wrote:
>>> If the device (or file system, which really needs to be in control
>>> for actual files vs just block devices) does not support all 256
>>> we need to reduce them to less than that.  The kernel can help with
>>> that a bit if the streams have meanings (collapsing temperature levels
>>> that are close), but not at all if they don't have meanings.
>> Current patch (nvme) does what you mentioned above.
>> Pasting the fragment that maps potentially large placement-hints to the
>> last valid placement-id.
>>
>> +static inline void nvme_assign_placement_id(struct nvme_ns *ns,
>> +					struct request *req,
>> +					struct nvme_command *cmd)
>> +{
>> +	u8 h = umin(ns->head->nr_plids - 1,
>> +				WRITE_PLACEMENT_HINT(req->write_hint));
>> +
>> +	cmd->rw.control |= cpu_to_le16(NVME_RW_DTYPE_DPLCMT);
>> +	cmd->rw.dsmgmt |= cpu_to_le32(ns->head->plids[h] << 16);
>> +}
>>
>> But this was just an implementation choice (and not a failure avoidance
>> fallback).
> And it completely fucks thing up as I said.  If I have an application
> that wants to separate streams I need to know how many stream I
> have available, and not fold all higher numbers into the last one
> available.

Would you prefer a new queue attribute (say nr_streams) that tells that?

Christoph Hellwig Sept. 18, 2024, 12:01 p.m. UTC | #10

On Wed, Sep 18, 2024 at 01:42:51PM +0530, Kanchan Joshi wrote:
> Would you prefer a new queue attribute (say nr_streams) that tells that?

No.  For one because using the same file descriptors as the one used
to set the hind actually makes it usable - finding the block device
does not.  And second as told about half a dozend time for this scheme
to actually work on a regular file the file system actually needs the
arbiter, as it can work on top of multiple block devices, consumes
streams, might export streams even if the underlying devices don't and
so on.

Kanchan Joshi Sept. 24, 2024, 9:24 a.m. UTC | #11

On 9/18/2024 5:31 PM, Christoph Hellwig wrote:
> On Wed, Sep 18, 2024 at 01:42:51PM +0530, Kanchan Joshi wrote:
>> Would you prefer a new queue attribute (say nr_streams) that tells that?
> 
> No.  For one because using the same file descriptors as the one used
> to set the hind actually makes it usable - finding the block device
> does not.  And second as told about half a dozend time for this scheme
> to actually work on a regular file the file system actually needs the
> arbiter, as it can work on top of multiple block devices, consumes
> streams, might export streams even if the underlying devices don't and
> so on.
> 

FS managed/created hints is a different topic altogether, and honestly 
that is not the scope of this series. That needs to be thought at per-FS 
level due to different data/meta layouts.
This scope of this series is to enable application-managed hints passing 
through the file system. FS only needs to pass what it receives. No 
active decision making (since application is doing that). Whether it 
works fine or not - is application's problem. But due to the simplicity 
it scales across filesystems. This is for the class of applications that 
know about their data and have decided to be in control.

Regardless, since placement-hints are not getting the reception I 
imagined, I will backtrack.

Christoph Hellwig Sept. 24, 2024, 9:28 a.m. UTC | #12

On Tue, Sep 24, 2024 at 02:54:51PM +0530, Kanchan Joshi wrote:
> FS managed/created hints is a different topic altogether,
> and honestly 
> that is not the scope of this series. That needs to be thought at per-FS 
> level due to different data/meta layouts.

No, it is not.  If you design an API where hints bypass the file
system you fundamentally do the wrong thing when there is a file
system.  No one is asking to actually implement file system
support in this series, but we need to consider the fundamental
problem in the API design.

And yes, the actual implementation will be highly dependent on the
file system.

> This scope of this series is to enable application-managed hints passing 
> through the file system. FS only needs to pass what it receives.

Which fundamentally can't work for even a semi-intelligent file system.

[v5,4/5] sd: limit to use write life hints

Commit Message

Comments

Patch