mbox series

[00/13] Pass data temperature information to zoned UFS devices

Message ID 20230920191442.3701673-1-bvanassche@acm.org (mailing list archive)
Headers show
Series Pass data temperature information to zoned UFS devices | expand

Message

Bart Van Assche Sept. 20, 2023, 7:14 p.m. UTC
Hi Jens,

Zoned UFS vendors need the data temperature information. Hence this patch
series that restores write hint information in F2FS and in the block layer.
The SCSI disk (sd) driver is modified such that it passes write hint
information to SCSI devices via the GROUP NUMBER field.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Bart Van Assche (13):
  fs/f2fs: Restore the whint_mode mount option
  fs: Restore support for F_GET_FILE_RW_HINT and F_SET_FILE_RW_HINT
  fs: Restore kiocb.ki_hint
  block: Restore write hint support
  scsi: core: Query the Block Limits Extension VPD page
  scsi_proto: Add struct io_group_descriptor
  sd: Translate data lifetime information
  scsi_debug: Reduce code duplication
  scsi_debug: Support the block limits extension VPD page
  scsi_debug: Rework page code error handling
  scsi_debug: Rework subpage code error handling
  scsi_debug: Implement the IO Advice Hints Grouping mode page
  scsi_debug: Maintain write statistics per group number

 Documentation/filesystems/f2fs.rst |  70 ++++++++++
 block/bio.c                        |   2 +
 block/blk-crypto-fallback.c        |   1 +
 block/blk-merge.c                  |  14 ++
 block/blk-mq.c                     |   2 +
 block/bounce.c                     |   1 +
 block/fops.c                       |   3 +
 drivers/scsi/scsi.c                |   2 +
 drivers/scsi/scsi_debug.c          | 202 +++++++++++++++++++----------
 drivers/scsi/scsi_sysfs.c          |  10 ++
 drivers/scsi/sd.c                  |  78 ++++++++++-
 drivers/scsi/sd.h                  |   2 +
 fs/aio.c                           |   1 +
 fs/buffer.c                        |  13 +-
 fs/cachefiles/io.c                 |   2 +
 fs/direct-io.c                     |   1 +
 fs/f2fs/data.c                     |   2 +
 fs/f2fs/f2fs.h                     |   9 ++
 fs/f2fs/file.c                     |   6 +
 fs/f2fs/segment.c                  |  95 ++++++++++++++
 fs/f2fs/super.c                    |  32 ++++-
 fs/fcntl.c                         |  18 +++
 fs/iomap/buffered-io.c             |   2 +
 fs/iomap/direct-io.c               |   1 +
 fs/mpage.c                         |   1 +
 fs/open.c                          |   1 +
 include/linux/blk-mq.h             |   1 +
 include/linux/blk_types.h          |   1 +
 include/linux/fs.h                 |  21 +++
 include/scsi/scsi_device.h         |   1 +
 include/scsi/scsi_proto.h          |  40 ++++++
 include/trace/events/f2fs.h        |   5 +-
 io_uring/rw.c                      |   1 +
 33 files changed, 566 insertions(+), 75 deletions(-)

Comments

Matthew Wilcox Sept. 20, 2023, 7:28 p.m. UTC | #1
On Wed, Sep 20, 2023 at 12:14:25PM -0700, Bart Van Assche wrote:
> Hi Jens,
> 
> Zoned UFS vendors need the data temperature information. Hence this patch
> series that restores write hint information in F2FS and in the block layer.
> The SCSI disk (sd) driver is modified such that it passes write hint
> information to SCSI devices via the GROUP NUMBER field.

"Need" in what sense?  Can you quantify what improvements we might
see from this patchset?
Bart Van Assche Sept. 20, 2023, 8:46 p.m. UTC | #2
On 9/20/23 12:28, Matthew Wilcox wrote:
> On Wed, Sep 20, 2023 at 12:14:25PM -0700, Bart Van Assche wrote:
>> Zoned UFS vendors need the data temperature information. Hence
>> this patch series that restores write hint information in F2FS and
>> in the block layer. The SCSI disk (sd) driver is modified such that
>> it passes write hint information to SCSI devices via the GROUP
>> NUMBER field.
> 
> "Need" in what sense?  Can you quantify what improvements we might 
> see from this patchset?

Hi Matthew,

This is what Jens wrote about 1.5 years ago in reply to complaints about
the removal of write hint support making it impossible to pass write 
hint information to SSD devices: "If at some point there's a
desire to actually try and upstream this support, then we'll be happy to
review that patchset."
(https://lore.kernel.org/linux-block/ef77ef36-df95-8658-ff54-7d8046f5d0e7@kernel.dk/). 
Hence this patch series.

Recently T10 standardized how data temperature information should be 
passed to SCSI devices. One of the patches in this series translates 
write hint information into a data temperature for SCSI devices. This 
can be used by SCSI SSD devices (including UFS devices) to reduce write 
amplification inside the device because host software should assign the 
same data temperature to all data that will be garbage collected at once.

Thanks,

Bart.
Niklas Cassel Sept. 21, 2023, 7:46 a.m. UTC | #3
On Wed, Sep 20, 2023 at 01:46:41PM -0700, Bart Van Assche wrote:
> On 9/20/23 12:28, Matthew Wilcox wrote:
> > On Wed, Sep 20, 2023 at 12:14:25PM -0700, Bart Van Assche wrote:
> > > Zoned UFS vendors need the data temperature information. Hence
> > > this patch series that restores write hint information in F2FS and
> > > in the block layer. The SCSI disk (sd) driver is modified such that
> > > it passes write hint information to SCSI devices via the GROUP
> > > NUMBER field.
> > 
> > "Need" in what sense?  Can you quantify what improvements we might see
> > from this patchset?
> 
> Hi Matthew,
> 
> This is what Jens wrote about 1.5 years ago in reply to complaints about
> the removal of write hint support making it impossible to pass write hint
> information to SSD devices: "If at some point there's a
> desire to actually try and upstream this support, then we'll be happy to
> review that patchset."
> (https://lore.kernel.org/linux-block/ef77ef36-df95-8658-ff54-7d8046f5d0e7@kernel.dk/).
> Hence this patch series.
> 
> Recently T10 standardized how data temperature information should be passed
> to SCSI devices. One of the patches in this series translates write hint
> information into a data temperature for SCSI devices. This can be used by
> SCSI SSD devices (including UFS devices) to reduce write amplification
> inside the device because host software should assign the same data
> temperature to all data that will be garbage collected at once.

Hello Bart,

Considering that this API (F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT)
was previously only used by NVMe (NVMe streams).

Yet, this API and the support in NVMe (NVMe streams) was removed.

Now you want to re-add the same API, but this time, it will only
be used by SCSI.

Since you basically revert (some of) the patches, I would have expected
the cover letter to at least mention NVMe somewhere.

Should NVMe streams be brought back? Yes? No?
While I have a strong guess of what the NVMe maintainers will say, I think
that your cover letter should mention "why"/"why not" the NVMe support
"is"/"is not" reverted.


Kind regards,
Niklas
Bart Van Assche Sept. 21, 2023, 2:27 p.m. UTC | #4
On 9/21/23 00:46, Niklas Cassel wrote:
> Considering that this API (F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT) 
> was previously only used by NVMe (NVMe streams).

That doesn't sound correct to me. I think support for this API was added
in F2FS in November 2017 (commit 4f0a03d34dd4 ("f2fs: apply write hints
to select the type of segments for buffered write")). That was a few
months after NVMe stream support was added (June 2017) by commit
f5d118406247 ("nvme: add support for streams and directives").

> Should NVMe streams be brought back? Yes? No?

 From commit 561593a048d7 ("Merge tag 'for-5.18/write-streams-2022-03-18'
of git://git.kernel.dk/linux-block"): "This removes the write streams
support in NVMe. No vendor ever really shipped working support for this,
and they are not interested in supporting it."

I do not want to reopen the discussion about NVMe streams.

Thanks,

Bart.
Niklas Cassel Sept. 21, 2023, 3:34 p.m. UTC | #5
Hello Bart,

On Thu, Sep 21, 2023 at 07:27:08AM -0700, Bart Van Assche wrote:
> On 9/21/23 00:46, Niklas Cassel wrote:
> > Considering that this API (F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT) was
> > previously only used by NVMe (NVMe streams).
> 
> That doesn't sound correct to me. I think support for this API was added
> in F2FS in November 2017 (commit 4f0a03d34dd4 ("f2fs: apply write hints
> to select the type of segments for buffered write")). That was a few
> months after NVMe stream support was added (June 2017) by commit
> f5d118406247 ("nvme: add support for streams and directives").

I wrote the "this API (F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT),
i.e. the support for hints in the block layer.

This addition to the block layer API was added in:
c75b1d9421f8 ("fs: add fcntl() interface for setting/getting write life time hints")

As part of this series:
https://lore.kernel.org/linux-block/1498491480-16306-1-git-send-email-axboe@kernel.dk/

So this support included:
-the block layer API changes
-the support for NVMe streams


The modifications to f2fs to actually make use of these block layer write
hints was not included in this initial series. They were added several
months later.


> From commit 561593a048d7 ("Merge tag 'for-5.18/write-streams-2022-03-18'
> of git://git.kernel.dk/linux-block"): "This removes the write streams
> support in NVMe. No vendor ever really shipped working support for this,
> and they are not interested in supporting it."
> 
> I do not want to reopen the discussion about NVMe streams.

I don't think we need to.

I simply think that your cover letter should mention it somehow...

As the whole reason why the block layer API was merged was to be
able to support NVMe streams.

So you bringing back this API, I think that you should at least
mention that you don't bring back NVMe streams...
and mention that you bring back the support for f2fs,
and add support for SCSI.. with some short motivation of why support
is needed in both SCSI and f2fs.

Right now your cover letter is 4 lines :)
I don't recall when I last saw such a small cover letter for a feature
impacting so many different parts of the kernel.


Kind regards,
Niklas
Bart Van Assche Sept. 21, 2023, 5 p.m. UTC | #6
On 9/21/23 08:34, Niklas Cassel wrote:
> Right now your cover letter is 4 lines :)
> I don't recall when I last saw such a small cover letter for a feature
> impacting so many different parts of the kernel.

I will expand the cover letter if I have to repost this patch series.

Thanks,

Bart.
Matthew Wilcox Sept. 21, 2023, 7:27 p.m. UTC | #7
On Thu, Sep 21, 2023 at 07:27:08AM -0700, Bart Van Assche wrote:
> On 9/21/23 00:46, Niklas Cassel wrote:
> > Considering that this API (F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT) was
> > previously only used by NVMe (NVMe streams).
> 
> That doesn't sound correct to me. I think support for this API was added
> in F2FS in November 2017 (commit 4f0a03d34dd4 ("f2fs: apply write hints
> to select the type of segments for buffered write")). That was a few
> months after NVMe stream support was added (June 2017) by commit
> f5d118406247 ("nvme: add support for streams and directives").
> 
> > Should NVMe streams be brought back? Yes? No?
> 
> From commit 561593a048d7 ("Merge tag 'for-5.18/write-streams-2022-03-18'
> of git://git.kernel.dk/linux-block"): "This removes the write streams
> support in NVMe. No vendor ever really shipped working support for this,
> and they are not interested in supporting it."

It sounds like UFS is at the same stage that NVMe got to -- standard
exists, no vendor has committed to actually shipping it.  Isn't bringing
it back a little premature?
Bart Van Assche Sept. 21, 2023, 7:39 p.m. UTC | #8
On 9/21/23 12:27, Matthew Wilcox wrote:
> On Thu, Sep 21, 2023 at 07:27:08AM -0700, Bart Van Assche wrote:
>> On 9/21/23 00:46, Niklas Cassel wrote:
>>> Should NVMe streams be brought back? Yes? No?
>>
>> From commit 561593a048d7 ("Merge tag 'for-5.18/write-streams-2022-03-18'
>> of git://git.kernel.dk/linux-block"): "This removes the write streams
>> support in NVMe. No vendor ever really shipped working support for this,
>> and they are not interested in supporting it."
> 
> It sounds like UFS is at the same stage that NVMe got to -- standard
> exists, no vendor has committed to actually shipping it.  Isn't bringing
> it back a little premature?

Hi Matthew,

That's a misunderstanding. UFS vendors support interpreting the SCSI 
GROUP NUMBER as a data temperature since many years, probably since more 
than ten years. Additionally, for multiple UFS vendors having the data 
temperature available is important for achieving good performance. This 
message shows how UFS vendors were using that information before write 
hint support was removed: 
https://lore.kernel.org/linux-block/PH0PR08MB7889642784B2E1FC1799A828DB0B9@PH0PR08MB7889.namprd08.prod.outlook.com/

This patch series implements support for passing data temperature 
information from F2FS to UFS devices in a standards-compliant way.

Thanks,

Bart.
Matthew Wilcox Sept. 21, 2023, 7:46 p.m. UTC | #9
On Thu, Sep 21, 2023 at 12:39:00PM -0700, Bart Van Assche wrote:
> On 9/21/23 12:27, Matthew Wilcox wrote:
> > On Thu, Sep 21, 2023 at 07:27:08AM -0700, Bart Van Assche wrote:
> > > On 9/21/23 00:46, Niklas Cassel wrote:
> > > > Should NVMe streams be brought back? Yes? No?
> > > 
> > > From commit 561593a048d7 ("Merge tag 'for-5.18/write-streams-2022-03-18'
> > > of git://git.kernel.dk/linux-block"): "This removes the write streams
> > > support in NVMe. No vendor ever really shipped working support for this,
> > > and they are not interested in supporting it."
> > 
> > It sounds like UFS is at the same stage that NVMe got to -- standard
> > exists, no vendor has committed to actually shipping it.  Isn't bringing
> > it back a little premature?
> 
> Hi Matthew,
> 
> That's a misunderstanding. UFS vendors support interpreting the SCSI GROUP
> NUMBER as a data temperature since many years, probably since more than ten
> years. Additionally, for multiple UFS vendors having the data temperature
> available is important for achieving good performance. This message shows
> how UFS vendors were using that information before write hint support was
> removed: https://lore.kernel.org/linux-block/PH0PR08MB7889642784B2E1FC1799A828DB0B9@PH0PR08MB7889.namprd08.prod.outlook.com/

If vendor support already exists, then why did you dodge the question
asking for quantified data that I asked earlier?  And can we have that
data now?
Bart Van Assche Sept. 21, 2023, 8:11 p.m. UTC | #10
On 9/21/23 12:46, Matthew Wilcox wrote:
> If vendor support already exists, then why did you dodge the question
> asking for quantified data that I asked earlier?  And can we have that
> data now?

 From Rho, Eunhee, Kanchan Joshi, Seung-Uk Shin, Nitesh Jagadeesh 
Shetty, Jooyoung Hwang, Sangyeun Cho, Daniel DG Lee, and Jaeheon Jeong. 
"{FStream}: Managing Flash Streams in the File System." In 16th USENIX 
Conference on File and Storage Technologies (FAST 18), pp. 257-264. 
2018: "Experimental results show that FStream enhances the filebench 
performance by 5%∼35% and reduces WAF (Write Amplification Factor) by 
7%∼46%. For a NoSQL database benchmark, performance is improved by up to 
38% and WAF is reduced by up to 81%." Please note that these results are 
for ext4 instead of F2FS. The benefit for F2FS is probably smaller since 
F2FS is optimized for NAND flash media.

I have Cc-ed open source contributors from multiple UFS vendors on this 
email and I hope that they can share performance numbers for F2FS.

Thanks,

Bart.
Jaegeuk Kim Sept. 21, 2023, 8:47 p.m. UTC | #11
On 09/21, Matthew Wilcox wrote:
> On Thu, Sep 21, 2023 at 12:39:00PM -0700, Bart Van Assche wrote:
> > On 9/21/23 12:27, Matthew Wilcox wrote:
> > > On Thu, Sep 21, 2023 at 07:27:08AM -0700, Bart Van Assche wrote:
> > > > On 9/21/23 00:46, Niklas Cassel wrote:
> > > > > Should NVMe streams be brought back? Yes? No?
> > > > 
> > > > From commit 561593a048d7 ("Merge tag 'for-5.18/write-streams-2022-03-18'
> > > > of git://git.kernel.dk/linux-block"): "This removes the write streams
> > > > support in NVMe. No vendor ever really shipped working support for this,
> > > > and they are not interested in supporting it."
> > > 
> > > It sounds like UFS is at the same stage that NVMe got to -- standard
> > > exists, no vendor has committed to actually shipping it.  Isn't bringing
> > > it back a little premature?
> > 
> > Hi Matthew,
> > 
> > That's a misunderstanding. UFS vendors support interpreting the SCSI GROUP
> > NUMBER as a data temperature since many years, probably since more than ten
> > years. Additionally, for multiple UFS vendors having the data temperature
> > available is important for achieving good performance. This message shows
> > how UFS vendors were using that information before write hint support was
> > removed: https://lore.kernel.org/linux-block/PH0PR08MB7889642784B2E1FC1799A828DB0B9@PH0PR08MB7889.namprd08.prod.outlook.com/
> 
> If vendor support already exists, then why did you dodge the question
> asking for quantified data that I asked earlier?  And can we have that
> data now?

I'm in doubt this patch-set really requires the quantified data which may be
mostly confidential to all the companies, also given the revert reason was no
user, IIUC. OTOH, I'm not sure whether you're famailiar with FTL, but, when
we consider the entire stack ranging from f2fs to FTL which manages NAND blocks,
I do see a clear benefit to give the temperature hints for FTL to align therein
garbage collection unit with one in f2fs, which is the key idea on Zoned UFS
in mobile world, I believe. Otherwise, it can show non-deterministic longer
write latencies due to internal GCs, increase WAI feeding to shorter lifetime.
Martin K. Petersen Sept. 27, 2023, 7:14 p.m. UTC | #12
Hi Bart!

> Zoned UFS vendors need the data temperature information. Hence this
> patch series that restores write hint information in F2FS and in the
> block layer. The SCSI disk (sd) driver is modified such that it passes
> write hint information to SCSI devices via the GROUP NUMBER field.

I don't have any particular problems with your implementation, although
I'm still trying to wrap my head around how to make this coexist with my
I/O hinting series. But I guess there's probably not going to be a big
overlap between devices that support both features.

However, it still pains me greatly to see the SBC proposal being
intertwined with the travesty that is streams. Why not define everything
in the IO advice hints group descriptor? I/O hints already use GROUP
NUMBER as an index. Why not just define a few permanent hint
descriptors? What's the point of the additional level of indirection to
tie this new feature into streams? RSCS basically says "ignore the
streams-specific bits and bobs and do this other stuff instead". What
does the streams infrastructure provide that can't be solved trivially
in the IO advise mode page alone?

For existing UFS devices which predate RSCS and streams but which
support getting data temperature from GROUP NUMBER, what is the
mechanism for detecting and enabling the feature?
Bart Van Assche Sept. 27, 2023, 8:49 p.m. UTC | #13
On 9/27/23 12:14, Martin K. Petersen wrote:
> I don't have any particular problems with your implementation, 
> although I'm still trying to wrap my head around how to make this 
> coexist with my I/O hinting series. But I guess there's probably not
> going to be a big overlap between devices that support both 
> features.

Hi Martin,

This patch series should make it easier to implement I/O hint support
since some of the code added by this patch series is also needed to
implement I/O hint support.

> However, it still pains me greatly to see the SBC proposal being 
> intertwined with the travesty that is streams. Why not define 
> everything in the IO advice hints group descriptor? I/O hints already
> use GROUP NUMBER as an index. Why not just define a few permanent
> hint descriptors? What's the point of the additional level of
> indirection to tie this new feature into streams? RSCS basically says
> "ignore the streams-specific bits and bobs and do this other stuff
> instead". What does the streams infrastructure provide that can't be
> solved trivially in the IO advise mode page alone?

Hmm ... isn't that exactly what T10 did, define everything in the IO
advice hints group descriptor by introducing the new ST_ENBLE bit in
that descriptor?

This patch series relies on the constrained streams mechanism. A
constrained stream is permanently open. The new ST_ENBLE bit in the IO
advice hints group descriptor indicates whether or not an IO advice
hints group represents a permanent stream.

The new ST_ENBLE bit in the IO advice hints group descriptor allows SCSI
devices to interpret the index of the descriptor as a data lifetime.
 From the approved T10 proposal:

Table x1 – RELATIVE LIFETIME field
..............................................
Code        Relative lifetime
..............................................
00h         no relative lifetime is applicable
01h         shortest relative lifetime
02h         second shortest relative lifetime
03h to 3Dh  intermediate relative lifetimes
3Eh         second longest relative lifetime
3Fh         longest relative lifetime
..............................................

> For existing UFS devices which predate RSCS and streams but which 
> support getting data temperature from GROUP NUMBER, what is the 
> mechanism for detecting and enabling the feature?

We plan to ask UFS device vendors to modify the UFS device firmware and
to add support for the VPD and mode pages this patch series relies on.
My understanding is that this can be done easily in UFS device firmware.

Although it is technically possible to update the firmware of UFS
devices in smartphones, most smartphones do not support this because
this is considered risky. Hence, only new smartphones will benefit from
this patch series.

I do not want to add support in the Linux kernel for how conventional
UFS devices use the GROUP NUMBER field today. Conventional UFS devices
interpret the GROUP NUMBER field as a "ContextID". The ContextID
mechanism has a state, just like the SCSI stream mechanism. UFS contexts
need to be opened explicitly and are closed upon reset. From the UFS 4.0
specification: "No ContextID shall be open after power cycle."

Please let me know if you need more information.

Bart.
Niklas Cassel Oct. 2, 2023, 11:38 a.m. UTC | #14
On Wed, Sep 27, 2023 at 03:14:10PM -0400, Martin K. Petersen wrote:
> 
> Hi Bart!
> 
> > Zoned UFS vendors need the data temperature information. Hence this
> > patch series that restores write hint information in F2FS and in the
> > block layer. The SCSI disk (sd) driver is modified such that it passes
> > write hint information to SCSI devices via the GROUP NUMBER field.
> 
> I don't have any particular problems with your implementation, although
> I'm still trying to wrap my head around how to make this coexist with my
> I/O hinting series. But I guess there's probably not going to be a big
> overlap between devices that support both features.

Hello Bart, Martin,

I don't know which user facing API Martin's I/O hinting series is intending
to use.

However, while discussing this series at ALPSS, we did ask ourselves why this
series is not reusing the already existing block layer API for providing I/O
hints:
https://github.com/torvalds/linux/blob/v6.6-rc4/include/uapi/linux/ioprio.h#L83-L103

We can have 1023 possible I/O hints, and so far we are only using 7, which
means that there are 1016 possible hints left.
This also enables you to define more than the 4 previous temperature hints
(extreme, long, medium, short), if so desired.

There is also support in fio for these I/O hints:
https://github.com/axboe/fio/blob/master/HOWTO.rst?plain=1#L2294-L2302

When this new I/O hint API has added, there was no other I/O hint API
in the kernel (since the old fcntl() F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT
API had already been removed when this new API was added).

So there should probably be a good argument why we would want to introduce
yet another API for providing I/O hints, instead of extending the I/O hint
API that we already have in the kernel right now.
(Especially since it seems fairly easy to modify your patches to reuse the
existing API.)


Kind regards,
Niklas
Niklas Cassel Oct. 2, 2023, 11:53 a.m. UTC | #15
On Mon, Oct 02, 2023 at 01:37:59PM +0200, Niklas Cassel wrote:
> On Wed, Sep 27, 2023 at 03:14:10PM -0400, Martin K. Petersen wrote:
> > 
> > Hi Bart!
> > 
> > > Zoned UFS vendors need the data temperature information. Hence this
> > > patch series that restores write hint information in F2FS and in the
> > > block layer. The SCSI disk (sd) driver is modified such that it passes
> > > write hint information to SCSI devices via the GROUP NUMBER field.
> > 
> > I don't have any particular problems with your implementation, although
> > I'm still trying to wrap my head around how to make this coexist with my
> > I/O hinting series. But I guess there's probably not going to be a big
> > overlap between devices that support both features.
> 
> Hello Bart, Martin,
> 
> I don't know which user facing API Martin's I/O hinting series is intending
> to use.
> 
> However, while discussing this series at ALPSS, we did ask ourselves why this
> series is not reusing the already existing block layer API for providing I/O
> hints:
> https://github.com/torvalds/linux/blob/v6.6-rc4/include/uapi/linux/ioprio.h#L83-L103
> 
> We can have 1023 possible I/O hints, and so far we are only using 7, which
> means that there are 1016 possible hints left.
> This also enables you to define more than the 4 previous temperature hints
> (extreme, long, medium, short), if so desired.
> 
> There is also support in fio for these I/O hints:
> https://github.com/axboe/fio/blob/master/HOWTO.rst?plain=1#L2294-L2302
> 
> When this new I/O hint API has added, there was no other I/O hint API
> in the kernel (since the old fcntl() F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT
> API had already been removed when this new API was added).
> 
> So there should probably be a good argument why we would want to introduce
> yet another API for providing I/O hints, instead of extending the I/O hint
> API that we already have in the kernel right now.
> (Especially since it seems fairly easy to modify your patches to reuse the
> existing API.)

One argument might be that the current I/O hints API does not allow hints to
be stacked. So one would not e.g. be able to combine a command duration limit
with a temperature hint...


Kind regards,
Niklas
Bart Van Assche Oct. 2, 2023, 4:33 p.m. UTC | #16
On 10/2/23 04:53, Niklas Cassel wrote:
> On Mon, Oct 02, 2023 at 01:37:59PM +0200, Niklas Cassel wrote:
>> I don't know which user facing API Martin's I/O hinting series is intending
>> to use.
>>
>> However, while discussing this series at ALPSS, we did ask ourselves why this
>> series is not reusing the already existing block layer API for providing I/O
>> hints:
>> https://github.com/torvalds/linux/blob/v6.6-rc4/include/uapi/linux/ioprio.h#L83-L103
>>
>> We can have 1023 possible I/O hints, and so far we are only using 7, which
>> means that there are 1016 possible hints left.
>> This also enables you to define more than the 4 previous temperature hints
>> (extreme, long, medium, short), if so desired.
>>
>> There is also support in fio for these I/O hints:
>> https://github.com/axboe/fio/blob/master/HOWTO.rst?plain=1#L2294-L2302
>>
>> When this new I/O hint API has added, there was no other I/O hint API
>> in the kernel (since the old fcntl() F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT
>> API had already been removed when this new API was added).
>>
>> So there should probably be a good argument why we would want to introduce
>> yet another API for providing I/O hints, instead of extending the I/O hint
>> API that we already have in the kernel right now.
>> (Especially since it seems fairly easy to modify your patches to reuse the
>> existing API.)
> 
> One argument might be that the current I/O hints API does not allow hints to
> be stacked. So one would not e.g. be able to combine a command duration limit
> with a temperature hint...

Hi Niklas,

Is your feedback about the user space API only or also about the
mechanism that is used internally in the kernel?

Restoring the ability to pass data temperature information from a
filesystem to a block device is much more important to me than
restoring the ability to pass data temperature information from user
space to a filesystem. Would it be sufficient to address your concern
if patch 2/13 would be dropped from this series?

Thanks,

Bart.
Bart Van Assche Oct. 2, 2023, 5:20 p.m. UTC | #17
On 10/2/23 04:38, Niklas Cassel wrote:
> So there should probably be a good argument why we would want to 
> introduce yet another API for providing I/O hints, instead of 
> extending the I/O hint API that we already have in the kernel right 
> now. (Especially since it seems fairly easy to modify your patches
> to reuse the existing API.)

Here is a strong argument: there is user space software that is using
the F_SET_FILE_RW_HINT API, e.g. Samba. I don't think that the above
arguments are strong enough to tell all developers of user space
software to switch from F_SET_FILE_RW_HINT to another API. This would
force user space developers to check the kernel version before they
can decide which user space API to use. If the new user space API would
get backported to distro kernels then that would cause a real nightmare
for user space developers who want to use F_SET_FILE_RW_HINT or its
equivalent.

Thanks,

Bart.
Niklas Cassel Oct. 2, 2023, 7:19 p.m. UTC | #18
On Mon, Oct 02, 2023 at 09:33:22AM -0700, Bart Van Assche wrote:
> On 10/2/23 04:53, Niklas Cassel wrote:
> > On Mon, Oct 02, 2023 at 01:37:59PM +0200, Niklas Cassel wrote:
> > > I don't know which user facing API Martin's I/O hinting series is intending
> > > to use.
> > > 
> > > However, while discussing this series at ALPSS, we did ask ourselves why this
> > > series is not reusing the already existing block layer API for providing I/O
> > > hints:
> > > https://github.com/torvalds/linux/blob/v6.6-rc4/include/uapi/linux/ioprio.h#L83-L103
> > > 
> > > We can have 1023 possible I/O hints, and so far we are only using 7, which
> > > means that there are 1016 possible hints left.
> > > This also enables you to define more than the 4 previous temperature hints
> > > (extreme, long, medium, short), if so desired.
> > > 
> > > There is also support in fio for these I/O hints:
> > > https://github.com/axboe/fio/blob/master/HOWTO.rst?plain=1#L2294-L2302
> > > 
> > > When this new I/O hint API has added, there was no other I/O hint API
> > > in the kernel (since the old fcntl() F_GET_FILE_RW_HINT / F_SET_FILE_RW_HINT
> > > API had already been removed when this new API was added).
> > > 
> > > So there should probably be a good argument why we would want to introduce
> > > yet another API for providing I/O hints, instead of extending the I/O hint
> > > API that we already have in the kernel right now.
> > > (Especially since it seems fairly easy to modify your patches to reuse the
> > > existing API.)
> > 
> > One argument might be that the current I/O hints API does not allow hints to
> > be stacked. So one would not e.g. be able to combine a command duration limit
> > with a temperature hint...
> 
> Hi Niklas,
> 
> Is your feedback about the user space API only or also about the
> mechanism that is used internally in the kernel?

The concern is only related to the user space API.

(However, if you do reuse the existing I/O prio hints, you will avoid
adding a new struct member to a lot of structs.)


> 
> Restoring the ability to pass data temperature information from a
> filesystem to a block device is much more important to me than
> restoring the ability to pass data temperature information from user
> space to a filesystem. Would it be sufficient to address your concern
> if patch 2/13 would be dropped from this series?

Right now 0 means no I/O hint.
Value 1-7 is used for CDL.
This means that bits 0-2 are currently used by CDL.

I guess we could define e.g. bits 3-5 to be used by temperature hints,
i.e. temperature hints could have values 0-7, where 0 would be no
temperature hint. (I guess we could still limit the temperature hints
to 1-4 if we want to keep the previous extreme/long/medium/short constants.)

This way, we can combine a CDL value with a temperature hint.
I.e. if user space has set bits in both bits 0-2 and 3-5, then both CDL
and temperature hints are used.

(And we would still have 4 bits left in 10 bit long I/O hints field that
can be used by some other I/O hint feature in the future.)

We could theoretically do this without changing the existing I/O prio hints
API, as all the existing hints (CDL descriptors 1-7) would keep their existing
values.

While I think this sounds quite nice, since it would avoid what your patches
currently do: adding a new "write_hint" struct member to the following structs:
struct kiocb, struct file, struct request, struct request, struct bio.

Instead it would rely on the existing ioprio struct members in these structs.
Additionally you would not need to add code that avoid merging of requests with
different write hints, as the current code already avoids merging of requests
with different ioprio (which thus extends to ioprio I/O hints).

Anyway, even if I do think that modifying your patch series to use the I/O prio
hints API would be a simpler and cleaner solution, including a smaller diffstat,
I do not care too strongly about this, and will leave the pondering to the very
wise maintainers.


Kind regards,
Niklas
Martin K. Petersen Oct. 3, 2023, 1:40 a.m. UTC | #19
Niklas,

> I don't know which user facing API Martin's I/O hinting series is
> intending to use.

I'm just using ioprio.
Bart Van Assche Oct. 3, 2023, 5:26 p.m. UTC | #20
On 10/2/23 18:40, Martin K. Petersen wrote:
> 
> Niklas,
> 
>> I don't know which user facing API Martin's I/O hinting series is
>> intending to use.
> 
> I'm just using ioprio.

Hi Martin,

Do you plan to use existing bits from the ioprio bitmask or new bits? 
Bits 0-2 are used for the priority level. Bits 3-5 are used for CDL. 
Bits 13-15 are used for the I/O priority. The SCSI and NVMe standard 
define 64 different data lifetimes (six bits). So there are 16 - 3 - 3 - 
6 = 4 remaining bits.

Thanks,

Bart.
Niklas Cassel Oct. 3, 2023, 6:45 p.m. UTC | #21
On Tue, Oct 03, 2023 at 10:26:27AM -0700, Bart Van Assche wrote:
> On 10/2/23 18:40, Martin K. Petersen wrote:
> > 
> > Niklas,
> > 
> > > I don't know which user facing API Martin's I/O hinting series is
> > > intending to use.
> > 
> > I'm just using ioprio.
> 
> Hi Martin,
> 
> Do you plan to use existing bits from the ioprio bitmask or new bits? Bits
> 0-2 are used for the priority level. Bits 3-5 are used for CDL. Bits 13-15
> are used for the I/O priority. The SCSI and NVMe standard define 64
> different data lifetimes (six bits). So there are 16 - 3 - 3 - 6 = 4
> remaining bits.

Hello Bart,

I think the math is:

16 - 3 (prio level) - 3 (CDL) - 3 (prio class) = 7

so if we want 64 different values for data lifetimes
(we previously only had 4 different values), that is 6 bits:

16 - 3 (prio level) - 3 (CDL) - 3 (prio class) - 6 (lifetime) = 1

so only one bit left for Martin :)

Not very much room to play with...


Kind regards,
Niklas
Martin K. Petersen Oct. 4, 2023, 3:17 a.m. UTC | #22
Bart,

> Do you plan to use existing bits from the ioprio bitmask or new bits?
> Bits 0-2 are used for the priority level. Bits 3-5 are used for CDL.
> Bits 13-15 are used for the I/O priority. The SCSI and NVMe standard
> define 64 different data lifetimes (six bits). So there are 16 - 3 - 3
> - 6 = 4 remaining bits.

I just use the existing I/O priority classes and levels to set a
high/normal/low relative priority.

I would still like pursue I/O classification since that performed better
in our testing. But that does involve working with vendors on a Linux
profile as discussed at LSF/MM. Don't really more than a handful in
either case.