[v7,0/3] FDP and per-io hints

Message ID	20240930181305.17286-1-joshi.k@samsung.com (mailing list archive)
Headers	show Received: from mailout1.samsung.com (mailout1.samsung.com [203.254.224.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 490AF1991D8 for <linux-fsdevel@vger.kernel.org>; Mon, 30 Sep 2024 18:20:58 +0000 (UTC) From: Kanchan Joshi <joshi.k@samsung.com> To: axboe@kernel.dk, kbusch@kernel.org, hch@lst.de, hare@suse.de, sagi@grimberg.me, martin.petersen@oracle.com, brauner@kernel.org, viro@zeniv.linux.org.uk, jack@suse.cz, jaegeuk@kernel.org, bcrl@kvack.org, dhowells@redhat.com, bvanassche@acm.org, asml.silence@gmail.com Cc: linux-nvme@lists.infradead.org, linux-fsdevel@vger.kernel.org, io-uring@vger.kernel.org, linux-block@vger.kernel.org, linux-aio@kvack.org, gost.dev@samsung.com, vishak.g@samsung.com, javier.gonz@samsung.com, Kanchan Joshi <joshi.k@samsung.com> Subject: [PATCH v7 0/3] FDP and per-io hints Date: Mon, 30 Sep 2024 23:43:02 +0530 Message-Id: <20240930181305.17286-1-joshi.k@samsung.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P DLP-Filter: Pass References: <CGME20240930182052epcas5p37edefa7556b87c3fbb543275756ac736@epcas5p3.samsung.com>
Series	FDP and per-io hints \| expand [v7,0/3] FDP and per-io hints [v7,1/3] nvme: enable FDP support [v7,2/3] block, fs: restore kiocb based write hint processing [v7,3/3] io_uring: enable per-io hinting capability

Kanchan Joshi Sept. 30, 2024, 6:13 p.m. UTC

Another spin to incorporate the feedback from LPC and previous
iterations. The series adds two capabilities:
- FDP support at NVMe level (patch #1)
- Per-io hinting via io_uring (patch #3)
Patch #2 is needed to do per-io hints.

The motivation and interface details are present in the commit
descriptions.

Testing:
Done with fcntl and liburing based custom applications.
On raw block device, ext4, xfs, btrfs and F2FS.
Checked that no regression occurs for application that use per-inode
hints.
Checked that per-io hints, when passed, take the precedence over per-inode
hints.

Changes since v6:
- Change io_uring interface to pass hints as SQE metadata (Pavel, Hannes)

Changes since v5:
- Drop placement hints
- Add per-io hint interface

Changes since v4:
- Retain the size/type checking on the enum (Bart)
- Use the name "*_lifetime_hint" rather than "*_life_hint" (Bart)

Changes since v3:
- 4 new patches to introduce placement hints
- Make nvme patch use the placement hints rather than lifetime hints

Changes since v2:
- Base it on nvme-6.11 and resolve a merge conflict

Changes since v1:
- Reduce the fetched plids from 128 to 6 (Keith)
- Use struct_size for a calculation (Keith)
- Handle robot/sparse warning

Kanchan Joshi (3):
  nvme: enable FDP support
  block, fs: restore kiocb based write hint processing
  io_uring: enable per-io hinting capability

Kanchan Joshi (3):
  nvme: enable FDP support
  block, fs: restore kiocb based write hint processing
  io_uring: enable per-io hinting capability

 block/fops.c                  |  6 +--
 drivers/nvme/host/core.c      | 70 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |  4 ++
 fs/aio.c                      |  1 +
 fs/cachefiles/io.c            |  1 +
 fs/direct-io.c                |  2 +-
 fs/fcntl.c                    | 22 -----------
 fs/iomap/direct-io.c          |  2 +-
 include/linux/fs.h            |  8 ++++
 include/linux/nvme.h          | 19 ++++++++++
 include/linux/rw_hint.h       | 24 ++++++++++++
 include/uapi/linux/io_uring.h | 19 ++++++++++
 io_uring/rw.c                 | 24 ++++++++++++
 13 files changed, 175 insertions(+), 27 deletions(-)

Christoph Hellwig Oct. 1, 2024, 9:20 a.m. UTC | #1

Any reason you completely ignored my feedback on the last version
and did not even answer?

That's not a very productive way to work.

James R. Bergsten Oct. 1, 2024, 3:58 p.m. UTC | #2

Now THIS Is the NVMe I came to know and love.

Jens Axboe Oct. 1, 2024, 4:18 p.m. UTC | #3

On 10/1/24 3:20 AM, Christoph Hellwig wrote:
> Any reason you completely ignored my feedback on the last version
> and did not even answer?

Probably because he got a little tired of dealing with the bullshit
related to this topic.

> That's not a very productive way to work.

Have to say, that neither have your responses been. Can't really fault
people for just giving up at some point, when no productive end seems to
be in sight.

As far as I'm concerned, this looks fine to me. There are customers
wanting to use us, and folks making drives that support it. It's not
right to continually gatekeep this feature just because you don't like
it.

I need to review the io_uring bits, but that should be the least of it.

Keith Busch Oct. 1, 2024, 4:23 p.m. UTC | #4

On Tue, Oct 01, 2024 at 11:20:48AM +0200, Christoph Hellwig wrote:
> Any reason you completely ignored my feedback on the last version
> and did not even answer?
> 
> That's not a very productive way to work.

I think because he's getting conflicting feedback. The arguments against
it being that it's not a good match, however it's the same match created
for streams, and no one complained then, and it's an existing user ABI.
FDP is a the new "streams", so I really don't see a big deal here. I
understand you have use cases that have filesystems intercept the hints
for different types of devices, but I don't see these use cases as
mutally exclusive. Supporting this one doesn't prevent other uses.

Christoph Hellwig Oct. 2, 2024, 7:49 a.m. UTC | #5

On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
> I think because he's getting conflicting feedback. The arguments against
> it being that it's not a good match, however it's the same match created
> for streams, and no one complained then, and it's an existing user ABI.

People complained, and the streams code got removed pretty quickly
because it did not work as expected.  I don't think that counts as
a big success.

Christoph Hellwig Oct. 2, 2024, 7:51 a.m. UTC | #6

On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote:
> Have to say, that neither have your responses been. Can't really fault
> people for just giving up at some point, when no productive end seems to
> be in sight.

I heavily disagree and take offence on that.

The previous stream separation approach made total sense, but just
needed a fair amount of work.  But it closely matches how things work
at the hardware and file system level, so it was the right approach.

Suddenly dropping that and ignoring all comments really feels like
someone urgently needs to pull a marketing stunt here.

Keith Busch Oct. 2, 2024, 2:56 p.m. UTC | #7

On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote:
> On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
> > I think because he's getting conflicting feedback. The arguments against
> > it being that it's not a good match, however it's the same match created
> > for streams, and no one complained then, and it's an existing user ABI.
> 
> People complained, and the streams code got removed pretty quickly
> because it did not work as expected.  I don't think that counts as
> a big success.

I don't think the kernel API was the problem. Capable devices never
materialized, so the code wasn't doing anything useful.

Jens Axboe Oct. 2, 2024, 3 p.m. UTC | #8

On 10/2/24 8:56 AM, Keith Busch wrote:
> On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote:
>> On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote:
>>> I think because he's getting conflicting feedback. The arguments against
>>> it being that it's not a good match, however it's the same match created
>>> for streams, and no one complained then, and it's an existing user ABI.
>>
>> People complained, and the streams code got removed pretty quickly
>> because it did not work as expected.  I don't think that counts as
>> a big success.
> 
> I don't think the kernel API was the problem. Capable devices never
> materialized, so the code wasn't doing anything useful.

Exactly. I never saw ones that kept the stream persistent across GC,
and hence they ended up being pretty useless in practice.

Jens Axboe Oct. 2, 2024, 3:03 p.m. UTC | #9

On 10/2/24 1:51 AM, Christoph Hellwig wrote:
> On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote:
>> Have to say, that neither have your responses been. Can't really fault
>> people for just giving up at some point, when no productive end seems to
>> be in sight.
> 
> I heavily disagree and take offence on that.
> 
> The previous stream separation approach made total sense, but just
> needed a fair amount of work.  But it closely matches how things work
> at the hardware and file system level, so it was the right approach.

What am I missing that makes this effort that different from streams?
Both are a lifetime hint.

> Suddenly dropping that and ignoring all comments really feels like
> someone urgently needs to pull a marketing stunt here.

I think someone is just trying to finally get this feature in, so it
can get used by customers, after many months of fairly unproductive
discussion.

Christoph Hellwig Oct. 2, 2024, 3:13 p.m. UTC | #10

On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote:
> > The previous stream separation approach made total sense, but just
> > needed a fair amount of work.  But it closely matches how things work
> > at the hardware and file system level, so it was the right approach.
> 
> What am I missing that makes this effort that different from streams?
> Both are a lifetime hint.

A stream has a lot less strings attached.  But hey, don't make me
argue for streams - you pushed the API in despite reservations back
then, and we've learned a lot since.

> > Suddenly dropping that and ignoring all comments really feels like
> > someone urgently needs to pull a marketing stunt here.
> 
> I think someone is just trying to finally get this feature in, so it
> can get used by customers, after many months of fairly unproductive
> discussion.

Well, he finally started on the right approach and gave it up after the
first round of feedback.  That's just a bit weird.

Keith Busch Oct. 2, 2024, 3:17 p.m. UTC | #11

On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote:
> 
> Well, he finally started on the right approach and gave it up after the
> first round of feedback.  That's just a bit weird.

Nothing prevents future improvements in that direction. It just seems
out of scope for what Kanchan is trying to enable for his customer use
cases. This patch looks harmless.

Christoph Hellwig Oct. 2, 2024, 3:19 p.m. UTC | #12

On Wed, Oct 02, 2024 at 09:17:35AM -0600, Keith Busch wrote:
> On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote:
> > 
> > Well, he finally started on the right approach and gave it up after the
> > first round of feedback.  That's just a bit weird.
> 
> Nothing prevents future improvements in that direction. It just seems
> out of scope for what Kanchan is trying to enable for his customer use
> cases. This patch looks harmless.

It's not really.  Once we wire it up like this we mess up the ability
to use the feature in other ways.  Additionally the per-I/O hints are
simply broken if you want a file system in the way as last time,
and if I don't misremember you acknowledged in person last week.

Jens Axboe Oct. 2, 2024, 3:22 p.m. UTC | #13

On 10/2/24 9:13 AM, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote:
>>> The previous stream separation approach made total sense, but just
>>> needed a fair amount of work.  But it closely matches how things work
>>> at the hardware and file system level, so it was the right approach.
>>
>> What am I missing that makes this effort that different from streams?
>> Both are a lifetime hint.
> 
> A stream has a lot less strings attached.  But hey, don't make me
> argue for streams - you pushed the API in despite reservations back
> then, and we've learned a lot since.

Hah, I remember arguing for it back then! It was just a pain to use, but
thankfully we could kill it when devices didn't materialize with a
quality of implementation that made them useful. And then we just yanked
it. Nothing is forever, and particularly hints is something that can
always be ignored. Same for this.

>>> Suddenly dropping that and ignoring all comments really feels like
>>> someone urgently needs to pull a marketing stunt here.
>>
>> I think someone is just trying to finally get this feature in, so it
>> can get used by customers, after many months of fairly unproductive
>> discussion.
> 
> Well, he finally started on the right approach and gave it up after the
> first round of feedback.  That's just a bit weird.

I'm a big fan of keep-it-simple-stupid, and then you can improve it down
the line. I think the simple approach stands just fine alone.

Keith Busch Oct. 2, 2024, 3:33 p.m. UTC | #14

On Wed, Oct 02, 2024 at 05:19:49PM +0200, Christoph Hellwig wrote:
> > Nothing prevents future improvements in that direction. It just seems
> > out of scope for what Kanchan is trying to enable for his customer use
> > cases. This patch looks harmless.
> 
> It's not really.  Once we wire it up like this we mess up the ability
> to use the feature in other ways.  Additionally the per-I/O hints are
> simply broken if you want a file system in the way as last time,
> and if I don't misremember you acknowledged in person last week.

You remember right. I also explained the use cases for per-io hints are
to replace current passthrough users with generic read/write paths that
have all the memory guard rails, read/write stats, and other features
that you don't get with passthrough. This just lets you accomplish the
same placement hints with the nvme raw block device that you get with
the nvme char device.

Martin K. Petersen Oct. 2, 2024, 3:47 p.m. UTC | #15

Christoph,

>> Nothing prevents future improvements in that direction. It just seems
>> out of scope for what Kanchan is trying to enable for his customer use
>> cases. This patch looks harmless.
>
> It's not really.  Once we wire it up like this we mess up the ability
> to use the feature in other ways.  Additionally the per-I/O hints are
> simply broken if you want a file system

Here is my take:

It is the kernel's job to manage the system's hardware resources and
arbitrate and share these resources optimally and fairly between all
running applications.

What irks me is defining application interfaces which fundamentally tell
the kernel that "these blocks are part of the same file".

The kernel already knows this. It is the very entity which provides that
abstraction. Why do we need an explicit interface to inform the kernel
that concurrent writes to the same file should have the same
"temperature" or need to go to the same "bin" on the storage device?
Shouldn't that just happen automatically?

Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like
we are consistently failing to deliver something that actually works for
anything but a few specialized corner cases. I think that is a shame.

Bart Van Assche Oct. 2, 2024, 6:34 p.m. UTC | #16

On 10/2/24 8:47 AM, Martin K. Petersen wrote:
> What irks me is defining application interfaces which fundamentally tell
> the kernel that "these blocks are part of the same file".

Isn't FDP about communicating much more than only this information to
the block device, e.g. information about reclaim units? Although I'm
personally not interested in FDP, my colleagues were involved in the
standardization of FDP.

Thanks,

Bart.

Bart Van Assche Oct. 3, 2024, 12:20 a.m. UTC | #17

On 9/30/24 11:13 AM, Kanchan Joshi wrote:
> Another spin to incorporate the feedback from LPC and previous
> iterations. The series adds two capabilities:
> - FDP support at NVMe level (patch #1)
> - Per-io hinting via io_uring (patch #3)
> Patch #2 is needed to do per-io hints.
> 
> The motivation and interface details are present in the commit
> descriptions.

Something is missing from the patch descriptions. What is the goal
of this patch series? Is the goal to support FDP in filesystems in
the kernel, is the goal to support filesystems implemented in user
space or perhaps something else? I think this should be mentioned in
the cover letter.

Thanks,

Bart.

Christoph Hellwig Oct. 3, 2024, 12:51 p.m. UTC | #18

On Wed, Oct 02, 2024 at 09:33:16AM -0600, Keith Busch wrote:
> You remember right. I also explained the use cases for per-io hints are
> to replace current passthrough users with generic read/write paths that
> have all the memory guard rails, read/write stats, and other features
> that you don't get with passthrough. This just lets you accomplish the
> same placement hints with the nvme raw block device that you get with
> the nvme char device.

Yes, and as I said before we need an opt-in from the file_ops, because
file systems can't support this.  And I'm speaking that from a position
of working on a file system that made the existing non-idea hints
work (to all fairness Hans did the hard work, but I was part of it).

Christoph Hellwig Oct. 3, 2024, 12:54 p.m. UTC | #19

On Wed, Oct 02, 2024 at 11:47:32AM -0400, Martin K. Petersen wrote:
> It is the kernel's job to manage the system's hardware resources and
> arbitrate and share these resources optimally and fairly between all
> running applications.

Exactly.

> What irks me is defining application interfaces which fundamentally tell
> the kernel that "these blocks are part of the same file".
> 
> The kernel already knows this. It is the very entity which provides that
> abstraction. Why do we need an explicit interface to inform the kernel
> that concurrent writes to the same file should have the same
> "temperature" or need to go to the same "bin" on the storage device?
> Shouldn't that just happen automatically?

For file: yes.  The problem is when you have more files than buckets on
the device or file systems.  Typical enterprise SSDs support somewhere
between 8 and 16 write streams, and there typically is more data than
that.  So trying to group it somehow is good idea as not all files can
have their own bucket.

Allowing this inside a file like done in this patch set on the other
hand is pretty crazy.

> Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like
> we are consistently failing to deliver something that actually works for
> anything but a few specialized corner cases. I think that is a shame.

Yes, it is.  And as someone who has been sitting in this group that is
because it's always someone in a place of power forcing down their version
because they got a promotіon, best of show award or whatever.

Christoph Hellwig Oct. 3, 2024, 12:55 p.m. UTC | #20

On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
> Isn't FDP about communicating much more than only this information to
> the block device, e.g. information about reclaim units? Although I'm
> personally not interested in FDP, my colleagues were involved in the
> standardization of FDP.

Yes, it is.  And when I explained how to properly export this kind of
information can be implemented on top of the version Kanchan sent everyone
suddenly stopped diskussion technical points and went either silent or
all political.

So I think some peoples bonuses depend on not understanding the problem
I fear :(

Keith Busch Oct. 3, 2024, 9:48 p.m. UTC | #21

On Thu, Oct 03, 2024 at 02:55:16PM +0200, Christoph Hellwig wrote:
> On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
> > Isn't FDP about communicating much more than only this information to
> > the block device, e.g. information about reclaim units? Although I'm
> > personally not interested in FDP, my colleagues were involved in the
> > standardization of FDP.
> 
> Yes, it is.  And when I explained how to properly export this kind of
> information can be implemented on top of the version Kanchan sent everyone
> suddenly stopped diskussion technical points and went either silent or
> all political.

The nominals can mean whatever you want. If you want it to mean
"temperature", then that's what it means. If you want it to mean
something else, then don't use this.

These are hints at the end of the day. Nothing locks the kernel into
this if a better solution develops. As you know, it was ripped out
before.

> So I think some peoples bonuses depend on not understanding the problem
> I fear :(

The only "bonus" I have is not repeatedly explaining why people can't
use h/w features the way they want.

Bart Van Assche Oct. 3, 2024, 10 p.m. UTC | #22

On 10/3/24 2:48 PM, Keith Busch wrote:
> The only "bonus" I have is not repeatedly explaining why people can't
> use h/w features the way they want.

Hi Keith,

Although that's a fair argument, what are the use cases for this patch
series? Filesystems in the kernel? Filesystems implemented in user
space? Perhaps something else?

This patch series adds new a new user space interface for passing hints
to storage devices (in io_uring). As we all know such interfaces are
hard to remove once these have been added.

We don't need new user space interfaces to support FDP for filesystems
in the kernel.

For filesystems implemented in user space, would using NVMe pass-through
be a viable approach? With this approach, no new user space interfaces
have to be added.

I'm wondering how to unblock FDP users without adding a new
controversial mechanism in the kernel.

Thanks,

Bart.

Jens Axboe Oct. 3, 2024, 10:12 p.m. UTC | #23

On 10/3/24 4:00 PM, Bart Van Assche wrote:
> On 10/3/24 2:48 PM, Keith Busch wrote:
>> The only "bonus" I have is not repeatedly explaining why people can't
>> use h/w features the way they want.
> 
> Hi Keith,
> 
> Although that's a fair argument, what are the use cases for this patch
> series? Filesystems in the kernel? Filesystems implemented in user
> space? Perhaps something else?
> 
> This patch series adds new a new user space interface for passing hints
> to storage devices (in io_uring). As we all know such interfaces are
> hard to remove once these have been added.

It's a _hint_, I'm not sure how many times that has to be stated. The
kernel is free to ignore it, and in the future, it may very well do
that. We already had fcntl interfaces for streams, in the same vein, and
we yanked those in the sense that they ended up doing _nothing_. Did
things break because of that? Of course not. This is no different.

> We don't need new user space interfaces to support FDP for filesystems
> in the kernel.
> 
> For filesystems implemented in user space, would using NVMe pass-through
> be a viable approach? With this approach, no new user space interfaces
> have to be added.
> 
> I'm wondering how to unblock FDP users without adding a new
> controversial mechanism in the kernel.

pass-through already works, obviously - this is more about adding a
viable real interface for it. If it's a feature that is in devices AND
customers want to use, then pointing them at pass-through is a cop-out.

Jens Axboe Oct. 3, 2024, 10:14 p.m. UTC | #24

On 10/3/24 6:54 AM, Christoph Hellwig wrote:
> For file: yes.  The problem is when you have more files than buckets on
> the device or file systems.  Typical enterprise SSDs support somewhere
> between 8 and 16 write streams, and there typically is more data than
> that.  So trying to group it somehow is good idea as not all files can
> have their own bucket.
> 
> Allowing this inside a file like done in this patch set on the other
> hand is pretty crazy.

I do agree that per-file hints are not ideal. In the spirit of making
some progress, how about we just retain per-io hints initially? We can
certainly make that work over dio. Yes buffered IO won't work initially,
but at least we're getting somewhere.

Keith Busch Oct. 3, 2024, 10:17 p.m. UTC | #25

On Thu, Oct 03, 2024 at 03:00:21PM -0700, Bart Van Assche wrote:
> On 10/3/24 2:48 PM, Keith Busch wrote:
> > The only "bonus" I have is not repeatedly explaining why people can't
> > use h/w features the way they want.
> 
> Hi Keith,
> 
> Although that's a fair argument, what are the use cases for this patch
> series? Filesystems in the kernel? Filesystems implemented in user
> space? Perhaps something else?
> 
> This patch series adds new a new user space interface for passing hints
> to storage devices (in io_uring). As we all know such interfaces are
> hard to remove once these have been added.
> 
> We don't need new user space interfaces to support FDP for filesystems
> in the kernel.
> 
> For filesystems implemented in user space, would using NVMe pass-through
> be a viable approach? With this approach, no new user space interfaces
> have to be added.
> 
> I'm wondering how to unblock FDP users without adding a new
> controversial mechanism in the kernel.

I really like the passthrough interface. This is flexible and enables
experimenting with all sorts of features.

But it exposes exploits and pitfalls. Just check the CVE's against it
(ex: CVE-2023-6238, note: same problem exists without metadata). The
consequences of misuse include memory corruption and private memory
leakage. The barriers for exploits are too low, and the potential for
mistakes are too high. A possible h/w solution requires an IOMMU, but
that's a non-starter for many users and only solves the private memory
vulnerabilities.

Passthrough also skips all the block layer features. I just today added
statistics to it staged for 6.13, but even that requires yet another
config step to enable it.

tl;dr: no one wants to use nvme passthrough in production long term.

Christoph Hellwig Oct. 4, 2024, 5:31 a.m. UTC | #26

On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote:
> On 10/3/24 6:54 AM, Christoph Hellwig wrote:
> > For file: yes.  The problem is when you have more files than buckets on
> > the device or file systems.  Typical enterprise SSDs support somewhere
> > between 8 and 16 write streams, and there typically is more data than
> > that.  So trying to group it somehow is good idea as not all files can
> > have their own bucket.
> > 
> > Allowing this inside a file like done in this patch set on the other
> > hand is pretty crazy.
> 
> I do agree that per-file hints are not ideal. In the spirit of making
> some progress, how about we just retain per-io hints initially? We can
> certainly make that work over dio. Yes buffered IO won't work initially,
> but at least we're getting somewhere.

Huh?  Per I/O hints at the syscall level are the problem (see also the
reply from Martin).  Per file make total sense, but we need the file
system in control.

The real problem is further down the stack.  For the SCSI temperature
hints just passing them on make sense.  But when you map to some kind
of stream separation in the device, no matter if that is streams, FDP,
or various kinds of streams we don't even support in thing like CF
and SDcard, the driver is not the right place to map temperature hint
to streams.  The requires some kind of intelligence.  It could be
dirt simple and just do a best effort mapping of the temperature
hints 1:1 to separate write streams, or do a little mapping if there
is not enough of them which should work fine for a raw block device.

But one we have a file system things get more complicated:

 - the file system will want it's own streams for metadata and GC
 - even with that on beefy enough hardware you can have more streams
   then temperature levels, and the file system can and should
   do intelligen placement (based usually on files)

Or to summarize:  the per-file temperature hints make sense as a user
interface.  Per-I/O hints tend to be really messy at least if a file
system is involved.  Placing the temperatures to separate write streams
in the driver does not scale even to the most trivial write stream
aware file system implementations.

And for anyone who followed the previous discussions of the patches
none of this should been new, each point has been made at least three
times before.

Javier Gonzalez Oct. 4, 2024, 6:18 a.m. UTC | #27

On 04.10.2024 07:31, Christoph Hellwig wrote:
>On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote:
>> On 10/3/24 6:54 AM, Christoph Hellwig wrote:
>> > For file: yes.  The problem is when you have more files than buckets on
>> > the device or file systems.  Typical enterprise SSDs support somewhere
>> > between 8 and 16 write streams, and there typically is more data than
>> > that.  So trying to group it somehow is good idea as not all files can
>> > have their own bucket.
>> >
>> > Allowing this inside a file like done in this patch set on the other
>> > hand is pretty crazy.
>>
>> I do agree that per-file hints are not ideal. In the spirit of making
>> some progress, how about we just retain per-io hints initially? We can
>> certainly make that work over dio. Yes buffered IO won't work initially,
>> but at least we're getting somewhere.
>
>Huh?  Per I/O hints at the syscall level are the problem (see also the
>reply from Martin).  Per file make total sense, but we need the file
>system in control.
>
>The real problem is further down the stack.  For the SCSI temperature
>hints just passing them on make sense.  But when you map to some kind
>of stream separation in the device, no matter if that is streams, FDP,
>or various kinds of streams we don't even support in thing like CF
>and SDcard, the driver is not the right place to map temperature hint
>to streams.  The requires some kind of intelligence.  It could be
>dirt simple and just do a best effort mapping of the temperature
>hints 1:1 to separate write streams, or do a little mapping if there
>is not enough of them which should work fine for a raw block device.
>
>But one we have a file system things get more complicated:
>
> - the file system will want it's own streams for metadata and GC
> - even with that on beefy enough hardware you can have more streams
>   then temperature levels, and the file system can and should
>   do intelligen placement (based usually on files)
>
>Or to summarize:  the per-file temperature hints make sense as a user
>interface.  Per-I/O hints tend to be really messy at least if a file
>system is involved.  Placing the temperatures to separate write streams
>in the driver does not scale even to the most trivial write stream
>>
>And for anyone who followed the previous discussions of the patches
>none of this should been new, each point has been made at least three
>times before.

Looking at the work you and Hans have been doing on XFS, it seems you
have been successful at mapping the semantics of the temperature to
zones (which has no semantics, just as FDP).

    What is the difference between the mapping in zones and for FDP?

The whole point is using an existing interface to cover the use-case of
people wanting hints in block. If you have a use-case for enabling hints
on file, let's work on this aftterwards.

Javier Gonzalez Oct. 4, 2024, 6:21 a.m. UTC | #28

On 03.10.2024 14:55, Christoph Hellwig wrote:
>On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote:
>> Isn't FDP about communicating much more than only this information to
>> the block device, e.g. information about reclaim units? Although I'm
>> personally not interested in FDP, my colleagues were involved in the
>> standardization of FDP.
>
>Yes, it is.  And when I explained how to properly export this kind of
>information can be implemented on top of the version Kanchan sent everyone
>suddenly stopped diskussion technical points and went either silent or
>all political.
>
>So I think some peoples bonuses depend on not understanding the problem
>I fear :(
>

Please, don't.

Childish comments like this delegitimize the work that a lot of people
are doing in Linux.

We all operate under the assumption that folks here know how to wear two
hants, and that there is no such thing as so specific incentives
neither to push _nor_ block upstream contributions without a use-case
and technical background.

Christoph Hellwig Oct. 4, 2024, 6:24 a.m. UTC | #29

On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote:
>> So I think some peoples bonuses depend on not understanding the problem
>> I fear :(
>>
>
> Please, don't.
>
> Childish comments like this delegitimize the work that a lot of people
> are doing in Linux.

It's the only very sarcastic explanation I can come up with to explain
this discussion, where people from the exactly two companies where this
might be bonus material love political discussion and drop dead when it
turns technical.

I'm probably wrong, but I've run out of other explanations.

Christoph Hellwig Oct. 4, 2024, 6:27 a.m. UTC | #30

On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote:
>> And for anyone who followed the previous discussions of the patches
>> none of this should been new, each point has been made at least three
>> times before.
>
> Looking at the work you and Hans have been doing on XFS, it seems you
> have been successful at mapping the semantics of the temperature to
> zones (which has no semantics, just as FDP).
>
>    What is the difference between the mapping in zones and for FDP?

Probably not much, except for all the pitfalls in the FDP not quite hint
not madatory design.

> The whole point is using an existing interface to cover the use-case of
> people wanting hints in block.

And that's fine.  And that point all the way back for month is that
doing a complete dumb mapping in the driver for that is fundamentally
wrong.  Kanchan's previous series did about 50% of the work to get
it rid, but then everyone dropped dead and played politics.

Javier Gonzalez Oct. 4, 2024, 6:52 a.m. UTC | #31

On 04.10.2024 08:27, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote:
>>> And for anyone who followed the previous discussions of the patches
>>> none of this should been new, each point has been made at least three
>>> times before.
>>
>> Looking at the work you and Hans have been doing on XFS, it seems you
>> have been successful at mapping the semantics of the temperature to
>> zones (which has no semantics, just as FDP).
>>
>>    What is the difference between the mapping in zones and for FDP?
>
>Probably not much, except for all the pitfalls in the FDP not quite hint
>not madatory design.

It is too late to change the first version of FDP productas, and the
solution of "go back to NVMe and make it match SCSI" is not realistic.

We have drives from several companies out there supporting this version
of FDP. And we have customers that are actively using them. It is our
collective responsibility to support them.

So, considerign that file system _are_ able to use temperature hints and
actually make them work, why don't we support FDP the same way we are
supporting zones so that people can use it in production?

I agree that down the road, an interface that allows hints (many more
than 5!) is needed. And in my opinion, this interface should not have
semintics attached to it, just a hint ID, #hints, and enough space to
put 100s of them to support storage node deployments. But this needs to
come from the users of the hints / zones / streams / etc,  not from
us vendors. We do not have neither details on how they deploy these
features at scale, nor the workloads to validate the results. Anything
else will probably just continue polluting the storage stack with more
interfaces that are not used and add to the problem of data placement
fragmentation.

>
>> The whole point is using an existing interface to cover the use-case of
>> people wanting hints in block.
>
>And that's fine.  And that point all the way back for month is that
>doing a complete dumb mapping in the driver for that is fundamentally
>wrong.  Kanchan's previous series did about 50% of the work to get
>it rid, but then everyone dropped dead and played politics.

The issue is that the first series of this patch, which is as simple as
it gets, hit the list in May. Since then we are down paths that lead
nowhere. So the line between real technical feedback that leads to
a feature being merged, and technical misleading to make people be a
busy bee becomes very thin. In the whole data placement effort, we have
been down this path many times, unfortunately...

At LPC, we discussed about the work you did in XFS and it became clear
that coming back to the first version of the patches was the easiest way
to support the use-case that current customers have. It is a pity you
did not attend the conference and could make a point against this line
of thought there.

So summarizing, all folks in this thread are asking is for a path to
move forward for solving the block use-case, which is the only one that
I am aware of requiring kernel support. This has been validated with
real workloads, so it is very specific and tangible.

Javier Gonzalez Oct. 4, 2024, 6:59 a.m. UTC | #32

On 04.10.2024 08:24, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote:
>>> So I think some peoples bonuses depend on not understanding the problem
>>> I fear :(
>>>
>>
>> Please, don't.
>>
>> Childish comments like this delegitimize the work that a lot of people
>> are doing in Linux.
>
>It's the only very sarcastic explanation I can come up with to explain
>this discussion, where people from the exactly two companies where this
>might be bonus material love political discussion and drop dead when it
>turns technical.

FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
Microship, Marvell, FADU, WDC, and Samsung.

The fact that 2 of these companies are the ones starting to build the
Linux ecosystem should not surprise you, as it is the way things work
normally.

Christoph Hellwig Oct. 4, 2024, 12:30 p.m. UTC | #33

On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> So, considerign that file system _are_ able to use temperature hints and
> actually make them work, why don't we support FDP the same way we are
> supporting zones so that people can use it in production?

Because apparently no one has tried it.  It should be possible in theory,
but for example unless you have power of 2 reclaim unit size size it
won't work very well with XFS where the AGs/RTGs must be power of two
aligned in the LBA space, except by overprovisioning the LBA space vs
the capacity actually used.

> I agree that down the road, an interface that allows hints (many more
> than 5!) is needed. And in my opinion, this interface should not have
> semintics attached to it, just a hint ID, #hints, and enough space to
> put 100s of them to support storage node deployments. But this needs to
> come from the users of the hints / zones / streams / etc,  not from
> us vendors. We do not have neither details on how they deploy these
> features at scale, nor the workloads to validate the results. Anything
> else will probably just continue polluting the storage stack with more
> interfaces that are not used and add to the problem of data placement
> fragmentation.

Please always mentioned what layer you are talking about.  At the syscall
level the temperatur hints are doing quite ok.  A full stream separation
would obviously be a lot better, as would be communicating the zone /
reclaim unit / etc size.

As an interface to a driver that doesn't natively speak temperature
hint on the other hand it doesn't work at all.

> The issue is that the first series of this patch, which is as simple as
> it gets, hit the list in May. Since then we are down paths that lead
> nowhere. So the line between real technical feedback that leads to
> a feature being merged, and technical misleading to make people be a
> busy bee becomes very thin. In the whole data placement effort, we have
> been down this path many times, unfortunately...

Well, the previous round was the first one actually trying to address the
fundamental issue after 4 month.  And then after a first round of feedback
it gets shutdown somehow out of nowhere.  As a maintainer and review that
is the kinda of contributors I have a hard time taking serious.

Christoph Hellwig Oct. 4, 2024, 12:32 p.m. UTC | #34

On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote:
> FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
> Microship, Marvell, FADU, WDC, and Samsung.
>
> The fact that 2 of these companies are the ones starting to build the
> Linux ecosystem should not surprise you, as it is the way things work
> normally.

That's not the point.  There is one company that drivers entirely pointless
marketing BS, and that one is pretty central here.  The same company
that said FDP has absolutely no іntent to work on Linux and fought my
initial attempt to make the protocol not totally unusable ony layer system.
And no, that's not Samsung.

Javier Gonzalez Oct. 7, 2024, 10:10 a.m. UTC | #35

On 04.10.2024 14:30, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> So, considerign that file system _are_ able to use temperature hints and
>> actually make them work, why don't we support FDP the same way we are
>> supporting zones so that people can use it in production?
>
>Because apparently no one has tried it.  It should be possible in theory,
>but for example unless you have power of 2 reclaim unit size size it
>won't work very well with XFS where the AGs/RTGs must be power of two
>aligned in the LBA space, except by overprovisioning the LBA space vs
>the capacity actually used.

This is good. I think we should have at least a FS POC with data
placement support to be able to drive conclusions on how the interface
and requirements should be. Until we have that, we can support the
use-cases that we know customers are asking for, i.e., block-level hints
through the existing temperature API.

>
>> I agree that down the road, an interface that allows hints (many more
>> than 5!) is needed. And in my opinion, this interface should not have
>> semintics attached to it, just a hint ID, #hints, and enough space to
>> put 100s of them to support storage node deployments. But this needs to
>> come from the users of the hints / zones / streams / etc,  not from
>> us vendors. We do not have neither details on how they deploy these
>> features at scale, nor the workloads to validate the results. Anything
>> else will probably just continue polluting the storage stack with more
>> interfaces that are not used and add to the problem of data placement
>> fragmentation.
>
>Please always mentioned what layer you are talking about.  At the syscall
>level the temperatur hints are doing quite ok.  A full stream separation
>would obviously be a lot better, as would be communicating the zone /
>reclaim unit / etc size.

I mean at the syscall level. But as mentioned above, we need to be very
sure that we have a clear use-case for that. If we continue seeing hints
being use in NVMe or other protocols, and the number increase
significantly, we can deal with it later on.

>
>As an interface to a driver that doesn't natively speak temperature
>hint on the other hand it doesn't work at all.
>
>> The issue is that the first series of this patch, which is as simple as
>> it gets, hit the list in May. Since then we are down paths that lead
>> nowhere. So the line between real technical feedback that leads to
>> a feature being merged, and technical misleading to make people be a
>> busy bee becomes very thin. In the whole data placement effort, we have
>> been down this path many times, unfortunately...
>
>Well, the previous round was the first one actually trying to address the
>fundamental issue after 4 month.  And then after a first round of feedback
>it gets shutdown somehow out of nowhere.  As a maintainer and review that
>is the kinda of contributors I have a hard time taking serious.

I am not sure I understand what you mean in the last sentece, so I will
not respond filling blanks with a bad interpretation.

In summary, what we are asking for is to take the patches that cover the
current use-case, and work together on what might be needed for better
FS support. For this, it seems you and Hans have a good idea of what you
want to have based on XFS. We can help review or do part of the work,
but trying to guess our way will only delay existing customers using
existing HW.

Javier Gonzalez Oct. 7, 2024, 11:29 a.m. UTC | #36

On 04.10.2024 14:32, Christoph Hellwig wrote:
>On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote:
>> FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm,
>> Microship, Marvell, FADU, WDC, and Samsung.
>>
>> The fact that 2 of these companies are the ones starting to build the
>> Linux ecosystem should not surprise you, as it is the way things work
>> normally.
>
>That's not the point.  There is one company that drivers entirely pointless
>marketing BS, and that one is pretty central here.  The same company
>that said FDP has absolutely no іntent to work on Linux and fought my
>initial attempt to make the protocol not totally unusable ony layer system.
>And no, that's not Samsung.

So you had an interaction in the working group, your feedback was not
taking into consideration by the authors, and the result is that FDP
cannot be supported in Linux as a consequence of that? Come on...

Hans Holmberg Oct. 8, 2024, 10:06 a.m. UTC | #37

On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com> wrote:
>
> On 04.10.2024 14:30, Christoph Hellwig wrote:
> >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> >> So, considerign that file system _are_ able to use temperature hints and
> >> actually make them work, why don't we support FDP the same way we are
> >> supporting zones so that people can use it in production?
> >
> >Because apparently no one has tried it.  It should be possible in theory,
> >but for example unless you have power of 2 reclaim unit size size it
> >won't work very well with XFS where the AGs/RTGs must be power of two
> >aligned in the LBA space, except by overprovisioning the LBA space vs
> >the capacity actually used.
>
> This is good. I think we should have at least a FS POC with data
> placement support to be able to drive conclusions on how the interface
> and requirements should be. Until we have that, we can support the
> use-cases that we know customers are asking for, i.e., block-level hints
> through the existing temperature API.
>
> >
> >> I agree that down the road, an interface that allows hints (many more
> >> than 5!) is needed. And in my opinion, this interface should not have
> >> semintics attached to it, just a hint ID, #hints, and enough space to
> >> put 100s of them to support storage node deployments. But this needs to
> >> come from the users of the hints / zones / streams / etc,  not from
> >> us vendors. We do not have neither details on how they deploy these
> >> features at scale, nor the workloads to validate the results. Anything
> >> else will probably just continue polluting the storage stack with more
> >> interfaces that are not used and add to the problem of data placement
> >> fragmentation.
> >
> >Please always mentioned what layer you are talking about.  At the syscall
> >level the temperatur hints are doing quite ok.  A full stream separation
> >would obviously be a lot better, as would be communicating the zone /
> >reclaim unit / etc size.
>
> I mean at the syscall level. But as mentioned above, we need to be very
> sure that we have a clear use-case for that. If we continue seeing hints
> being use in NVMe or other protocols, and the number increase
> significantly, we can deal with it later on.
>
> >
> >As an interface to a driver that doesn't natively speak temperature
> >hint on the other hand it doesn't work at all.
> >
> >> The issue is that the first series of this patch, which is as simple as
> >> it gets, hit the list in May. Since then we are down paths that lead
> >> nowhere. So the line between real technical feedback that leads to
> >> a feature being merged, and technical misleading to make people be a
> >> busy bee becomes very thin. In the whole data placement effort, we have
> >> been down this path many times, unfortunately...
> >
> >Well, the previous round was the first one actually trying to address the
> >fundamental issue after 4 month.  And then after a first round of feedback
> >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> >is the kinda of contributors I have a hard time taking serious.
>
> I am not sure I understand what you mean in the last sentece, so I will
> not respond filling blanks with a bad interpretation.
>
> In summary, what we are asking for is to take the patches that cover the
> current use-case, and work together on what might be needed for better
> FS support. For this, it seems you and Hans have a good idea of what you
> want to have based on XFS. We can help review or do part of the work,
> but trying to guess our way will only delay existing customers using
> existing HW.

After reading the whole thread, I end up wondering why we need to rush the
support for a single use case through instead of putting the architecture
in place for properly supporting this new type of hardware from the start
throughout the stack.

Even for user space consumers of raw block devices, is the last version
of the patch set good enough?

* It severely cripples the data separation capabilities as only a handful of
  data placement buckets are supported

* It just won't work if there is more than one user application per storage
  device as different applications data streams will be mixed at the nvme
  driver level..

While Christoph has already outlined what would be desirable from a
file system point of view, I don't have the answer to what would be the overall
best design for FDP. I would like to say that it looks to me like we need to
consider more than more than the early adoption use cases and make sure we
make the most of the hardware capabilities via logical abstractions that
would be compatible with a wider range of storage devices.

Figuring the right way forward is tricky, but why not just let it take the time
that is needed to sort this out while early users explore how to use FDP
drives and share the results?

Christoph Hellwig Oct. 8, 2024, 12:25 p.m. UTC | #38

On Mon, Oct 07, 2024 at 12:10:11PM +0200, Javier González wrote:
> In summary, what we are asking for is to take the patches that cover the
> current use-case, and work together on what might be needed for better
> FS support.

And I really do not think it is a good idea.  For one it actually
works against the stated intent of the FDP spec.  Second extending
the hints to per per-I/O in the io_uring patch is actively breaking
the nice per-file I/O hint abstraction we have right now, and is
really unsuitable when actually used on a file and not just a block
device.  And if you are only on a block device I think passthrough
of some form is still the far better option, despite the problems
with it mentioned by Keith.

Christoph Hellwig Oct. 8, 2024, 12:27 p.m. UTC | #39

On Mon, Oct 07, 2024 at 01:29:31PM +0200, Javier González wrote:
>> That's not the point.  There is one company that drivers entirely pointless
>> marketing BS, and that one is pretty central here.  The same company
>> that said FDP has absolutely no іntent to work on Linux and fought my
>> initial attempt to make the protocol not totally unusable ony layer system.
>> And no, that's not Samsung.
>
> So you had an interaction in the working group, your feedback was not
> taking into consideration by the authors, and the result is that FDP
> cannot be supported in Linux as a consequence of that? Come on...

No, what I am saying is that the "small" FDP group that was doing the
development while keeping it doing away from the group insisted that
FDP is only for use in userspace drivers, and even getting the basis
in to properly make it suitable for a semi-multi tenant setup like
Linux io_uring passthrough was not welcome and actually fought tooth
and nail by the particular one particular company.  It a long talk to
the head of the NVMe board to even get this sorted out.

Keith Busch Oct. 8, 2024, 2:44 p.m. UTC | #40

On Tue, Oct 08, 2024 at 02:25:35PM +0200, Christoph Hellwig wrote:
> On Mon, Oct 07, 2024 at 12:10:11PM +0200, Javier González wrote:
> > In summary, what we are asking for is to take the patches that cover the
> > current use-case, and work together on what might be needed for better
> > FS support.
> 
> And I really do not think it is a good idea.  For one it actually
> works against the stated intent of the FDP spec.  Second extending
> the hints to per per-I/O in the io_uring patch is actively breaking
> the nice per-file I/O hint abstraction we have right now, and is
> really unsuitable when actually used on a file and not just a block
> device.  And if you are only on a block device I think passthrough
> of some form is still the far better option, despite the problems
> with it mentioned by Keith.

Then let's just continue with patches 1 and 2. They introduce no new
user or kernel APIs, and people have already reported improvements using
it. Further, it is just a hint, it doesn't lock the kernel into anything
that may hinder future inovations and enhancements. So let's unblock
users and refocus *our* time to something more productive, please?

And to be honest, the per-io hints for generic read/write use is only
valuable for my users if metadata is also exposed to userspace. I know
Javier's team is working on that in parallel, so per-io hints are a
lower priority for me until that part settles.

Christoph Hellwig Oct. 9, 2024, 9:28 a.m. UTC | #41

On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote:
> Then let's just continue with patches 1 and 2. They introduce no new
> user or kernel APIs, and people have already reported improvements using
> it.

They are still not any way actually exposing the FDP functionality
in the standard though.  How is your application going to align
anything to the reclaim unit?  Or is this another of the cases where
as a hyperscaler you just "know" from the data sheet?

But also given that the submitter completely disappeared and refuses
to even discuss his patches I thing they are simply abandonware at
this point anyway.

Javier Gonzalez Oct. 9, 2024, 2:36 p.m. UTC | #42

> -----Original Message-----
> From: Hans Holmberg <hans@owltronix.com>
> Sent: Tuesday, October 8, 2024 12:07 PM
> To: Javier Gonzalez <javier.gonz@samsung.com>
> Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K.
> Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>;
> Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me;
> brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org;
> bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org;
> asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux-
> fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org;
> linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com
> Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> 
> On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com>
> wrote:
> >
> > On 04.10.2024 14:30, Christoph Hellwig wrote:
> > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> > >> So, considerign that file system _are_ able to use temperature hints and
> > >> actually make them work, why don't we support FDP the same way we are
> > >> supporting zones so that people can use it in production?
> > >
> > >Because apparently no one has tried it.  It should be possible in theory,
> > >but for example unless you have power of 2 reclaim unit size size it
> > >won't work very well with XFS where the AGs/RTGs must be power of two
> > >aligned in the LBA space, except by overprovisioning the LBA space vs
> > >the capacity actually used.
> >
> > This is good. I think we should have at least a FS POC with data
> > placement support to be able to drive conclusions on how the interface
> > and requirements should be. Until we have that, we can support the
> > use-cases that we know customers are asking for, i.e., block-level hints
> > through the existing temperature API.
> >
> > >
> > >> I agree that down the road, an interface that allows hints (many more
> > >> than 5!) is needed. And in my opinion, this interface should not have
> > >> semintics attached to it, just a hint ID, #hints, and enough space to
> > >> put 100s of them to support storage node deployments. But this needs to
> > >> come from the users of the hints / zones / streams / etc,  not from
> > >> us vendors. We do not have neither details on how they deploy these
> > >> features at scale, nor the workloads to validate the results. Anything
> > >> else will probably just continue polluting the storage stack with more
> > >> interfaces that are not used and add to the problem of data placement
> > >> fragmentation.
> > >
> > >Please always mentioned what layer you are talking about.  At the syscall
> > >level the temperatur hints are doing quite ok.  A full stream separation
> > >would obviously be a lot better, as would be communicating the zone /
> > >reclaim unit / etc size.
> >
> > I mean at the syscall level. But as mentioned above, we need to be very
> > sure that we have a clear use-case for that. If we continue seeing hints
> > being use in NVMe or other protocols, and the number increase
> > significantly, we can deal with it later on.
> >
> > >
> > >As an interface to a driver that doesn't natively speak temperature
> > >hint on the other hand it doesn't work at all.
> > >
> > >> The issue is that the first series of this patch, which is as simple as
> > >> it gets, hit the list in May. Since then we are down paths that lead
> > >> nowhere. So the line between real technical feedback that leads to
> > >> a feature being merged, and technical misleading to make people be a
> > >> busy bee becomes very thin. In the whole data placement effort, we have
> > >> been down this path many times, unfortunately...
> > >
> > >Well, the previous round was the first one actually trying to address the
> > >fundamental issue after 4 month.  And then after a first round of feedback
> > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> > >is the kinda of contributors I have a hard time taking serious.
> >
> > I am not sure I understand what you mean in the last sentece, so I will
> > not respond filling blanks with a bad interpretation.
> >
> > In summary, what we are asking for is to take the patches that cover the
> > current use-case, and work together on what might be needed for better
> > FS support. For this, it seems you and Hans have a good idea of what you
> > want to have based on XFS. We can help review or do part of the work,
> > but trying to guess our way will only delay existing customers using
> > existing HW.
> 
> After reading the whole thread, I end up wondering why we need to rush the
> support for a single use case through instead of putting the architecture
> in place for properly supporting this new type of hardware from the start
> throughout the stack.

This is not a rush. We have been supporting this use case through passthru for 
over 1/2 year with code already upstream in Cachelib. This is mature enough as 
to move into the block layer, which is what the end user wants to do either way.

This is though a very good point. This is why we upstreamed passthru at the 
time; so people can experiment, validate, and upstream only when there is a 
clear path.

> 
> Even for user space consumers of raw block devices, is the last version
> of the patch set good enough?
> 
> * It severely cripples the data separation capabilities as only a handful of
>   data placement buckets are supported

I could understand from your presentation at LPC, and late looking at the code that 
is available that you have been successful at getting good results with the existing 
interface in XFS. The mapping form the temperature semantics to zones (no semantics)
is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space 
structures is great.

> 
> * It just won't work if there is more than one user application per storage
>   device as different applications data streams will be mixed at the nvme
>   driver level..

For now this use-case is not clear. Folks working on it are using passthru. When we
have a more clear understanding of what is needed, we might need changes in the kernel.

If you see a need for this on the work that you are doing, by all means, please send patches.
As I said at LPC, we can work together on this.

> 
> While Christoph has already outlined what would be desirable from a
> file system point of view, I don't have the answer to what would be the overall
> best design for FDP. I would like to say that it looks to me like we need to
> consider more than more than the early adoption use cases and make sure we
> make the most of the hardware capabilities via logical abstractions that
> would be compatible with a wider range of storage devices.
> 
> Figuring the right way forward is tricky, but why not just let it take the time
> that is needed to sort this out while early users explore how to use FDP
> drives and share the results?

I agree that we might need a new interface to support more hints, beyond the temperatures. 
Or maybe not. We would not know until someone comes with a use case. We have made the 
mistake in the past of treating internal research as upstreamable work. I know can see that 
this simply complicates the in-kernel and user-space APIs.

The existing API is usable and requires no changes. There is hardware. There are customers. 
There are applications with upstream support which have been tested with passthru (the 
early results you mention). And the wiring to NVMe is _very_ simple. There is no reason 
not to take this in, and then we will see what new interfaces we might need in the future.

I would much rather spend time in discussing ideas with you and others on a potential
future API than arguing about the validity of an _existing_ one.

Keith Busch Oct. 9, 2024, 3:06 p.m. UTC | #43

On Wed, Oct 09, 2024 at 11:28:28AM +0200, Christoph Hellwig wrote:
> On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote:
> > Then let's just continue with patches 1 and 2. They introduce no new
> > user or kernel APIs, and people have already reported improvements using
> > it.
> 
> They are still not any way actually exposing the FDP functionality
> in the standard though.  How is your application going to align
> anything to the reclaim unit?  Or is this another of the cases where
> as a hyperscaler you just "know" from the data sheet?

As far as I know, this is an inconsequential spec detail that is not
being considered by any applications testing this. And yet, the expected
imrpovements are still there, so I don't see a point holding this up for
that reason.

Nitesh Shetty Oct. 9, 2024, 4:28 p.m. UTC | #44

On 09/10/24 11:28AM, Christoph Hellwig wrote:
>On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote:
>> Then let's just continue with patches 1 and 2. They introduce no new
>> user or kernel APIs, and people have already reported improvements using
>> it.
>
>They are still not any way actually exposing the FDP functionality
>in the standard though.  How is your application going to align
>anything to the reclaim unit?  Or is this another of the cases where
>as a hyperscaler you just "know" from the data sheet?

I think Keith already[1] mentioned a couple of times in previous replies,
this is an optional feature, if experiments doesn't give better
results, FDP hints can be disabled.

>
>But also given that the submitter completely disappeared and refuses
>to even discuss his patches I thing they are simply abandonware at
>this point anyway.

Kanchan is away due to personal reasons, he is expected to be back in a
week or so.

Regards,
Nitesh Shetty

[1] https://lore.kernel.org/all/ZmjSwfD1IqX-ADtL@kbusch-mbp.dhcp.thefacebook.com/

Hans Holmberg Oct. 10, 2024, 6:40 a.m. UTC | #45

On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@samsung.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Hans Holmberg <hans@owltronix.com>
> > Sent: Tuesday, October 8, 2024 12:07 PM
> > To: Javier Gonzalez <javier.gonz@samsung.com>
> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K.
> > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>;
> > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me;
> > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org;
> > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org;
> > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux-
> > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org;
> > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com
> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> >
> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com>
> > wrote:
> > >
> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> > > >> So, considerign that file system _are_ able to use temperature hints and
> > > >> actually make them work, why don't we support FDP the same way we are
> > > >> supporting zones so that people can use it in production?
> > > >
> > > >Because apparently no one has tried it.  It should be possible in theory,
> > > >but for example unless you have power of 2 reclaim unit size size it
> > > >won't work very well with XFS where the AGs/RTGs must be power of two
> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
> > > >the capacity actually used.
> > >
> > > This is good. I think we should have at least a FS POC with data
> > > placement support to be able to drive conclusions on how the interface
> > > and requirements should be. Until we have that, we can support the
> > > use-cases that we know customers are asking for, i.e., block-level hints
> > > through the existing temperature API.
> > >
> > > >
> > > >> I agree that down the road, an interface that allows hints (many more
> > > >> than 5!) is needed. And in my opinion, this interface should not have
> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
> > > >> put 100s of them to support storage node deployments. But this needs to
> > > >> come from the users of the hints / zones / streams / etc,  not from
> > > >> us vendors. We do not have neither details on how they deploy these
> > > >> features at scale, nor the workloads to validate the results. Anything
> > > >> else will probably just continue polluting the storage stack with more
> > > >> interfaces that are not used and add to the problem of data placement
> > > >> fragmentation.
> > > >
> > > >Please always mentioned what layer you are talking about.  At the syscall
> > > >level the temperatur hints are doing quite ok.  A full stream separation
> > > >would obviously be a lot better, as would be communicating the zone /
> > > >reclaim unit / etc size.
> > >
> > > I mean at the syscall level. But as mentioned above, we need to be very
> > > sure that we have a clear use-case for that. If we continue seeing hints
> > > being use in NVMe or other protocols, and the number increase
> > > significantly, we can deal with it later on.
> > >
> > > >
> > > >As an interface to a driver that doesn't natively speak temperature
> > > >hint on the other hand it doesn't work at all.
> > > >
> > > >> The issue is that the first series of this patch, which is as simple as
> > > >> it gets, hit the list in May. Since then we are down paths that lead
> > > >> nowhere. So the line between real technical feedback that leads to
> > > >> a feature being merged, and technical misleading to make people be a
> > > >> busy bee becomes very thin. In the whole data placement effort, we have
> > > >> been down this path many times, unfortunately...
> > > >
> > > >Well, the previous round was the first one actually trying to address the
> > > >fundamental issue after 4 month.  And then after a first round of feedback
> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> > > >is the kinda of contributors I have a hard time taking serious.
> > >
> > > I am not sure I understand what you mean in the last sentece, so I will
> > > not respond filling blanks with a bad interpretation.
> > >
> > > In summary, what we are asking for is to take the patches that cover the
> > > current use-case, and work together on what might be needed for better
> > > FS support. For this, it seems you and Hans have a good idea of what you
> > > want to have based on XFS. We can help review or do part of the work,
> > > but trying to guess our way will only delay existing customers using
> > > existing HW.
> >
> > After reading the whole thread, I end up wondering why we need to rush the
> > support for a single use case through instead of putting the architecture
> > in place for properly supporting this new type of hardware from the start
> > throughout the stack.
>
> This is not a rush. We have been supporting this use case through passthru for
> over 1/2 year with code already upstream in Cachelib. This is mature enough as
> to move into the block layer, which is what the end user wants to do either way.
>
> This is though a very good point. This is why we upstreamed passthru at the
> time; so people can experiment, validate, and upstream only when there is a
> clear path.
>
> >
> > Even for user space consumers of raw block devices, is the last version
> > of the patch set good enough?
> >
> > * It severely cripples the data separation capabilities as only a handful of
> >   data placement buckets are supported
>
> I could understand from your presentation at LPC, and late looking at the code that
> is available that you have been successful at getting good results with the existing
> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
> structures is great.

No, we don't map data directly to zones using lifetime hints. In fact,
lifetime hints contribute only a
relatively small part  (~10% extra write amp reduction, see the
rocksdb benchmark results).
Segregating data by file is the most important part of the data
placement heuristic, at least
for this type of workload.

>
> >
> > * It just won't work if there is more than one user application per storage
> >   device as different applications data streams will be mixed at the nvme
> >   driver level..
>
> For now this use-case is not clear. Folks working on it are using passthru. When we
> have a more clear understanding of what is needed, we might need changes in the kernel.
>
> If you see a need for this on the work that you are doing, by all means, please send patches.
> As I said at LPC, we can work together on this.
>
> >
> > While Christoph has already outlined what would be desirable from a
> > file system point of view, I don't have the answer to what would be the overall
> > best design for FDP. I would like to say that it looks to me like we need to
> > consider more than more than the early adoption use cases and make sure we
> > make the most of the hardware capabilities via logical abstractions that
> > would be compatible with a wider range of storage devices.
> >
> > Figuring the right way forward is tricky, but why not just let it take the time
> > that is needed to sort this out while early users explore how to use FDP
> > drives and share the results?
>
> I agree that we might need a new interface to support more hints, beyond the temperatures.
> Or maybe not. We would not know until someone comes with a use case. We have made the
> mistake in the past of treating internal research as upstreamable work. I know can see that
> this simply complicates the in-kernel and user-space APIs.
>
> The existing API is usable and requires no changes. There is hardware. There are customers.
> There are applications with upstream support which have been tested with passthru (the
> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason
> not to take this in, and then we will see what new interfaces we might need in the future.
>
> I would much rather spend time in discussing ideas with you and others on a potential
> future API than arguing about the validity of an _existing_ one.
>

Yes, but while FDP support could be improved later on(happy to help if
that'll be the case),
I'm just afraid that less work now defining the way data placement is
exposed is going to
result in a bigger mess later when more use cases will be considered.

Javier Gonzalez Oct. 10, 2024, 7:07 a.m. UTC | #46

On 09.10.2024 09:06, Keith Busch wrote:
>On Wed, Oct 09, 2024 at 11:28:28AM +0200, Christoph Hellwig wrote:
>> On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote:
>> > Then let's just continue with patches 1 and 2. They introduce no new
>> > user or kernel APIs, and people have already reported improvements using
>> > it.
>>
>> They are still not any way actually exposing the FDP functionality
>> in the standard though.  How is your application going to align
>> anything to the reclaim unit?  Or is this another of the cases where
>> as a hyperscaler you just "know" from the data sheet?
>
>As far as I know, this is an inconsequential spec detail that is not
>being considered by any applications testing this. And yet, the expected
>imrpovements are still there, so I don't see a point holding this up for
>that reason.

I am re-reading the thread to find a common ground. I realize that my
last email came sharper than I intended. I apologize for that Christoph.

Let me summarize and propose something constructive so that we can move
forward.

It is clear that you have a lot of insight on how a FS API should look
like to unify the different data placement alternatives across
protocols. I have not seen code for it, but based on the technical
feedback you are providing, it seems to be fairly clear in your mind.

I think we should attempt to pursue that with an example in mind. Seems
XFS is the clear candidate. You have done work already in enable SMR
HDDs; it seems we can get FDP under that umbrella. This will however
take time to get right. We can help with development, testing, and
experimental evaluation on the WAF benefits for such an interface.

However, this work should not block existing hardware enabling an
existing use-case. The current patches are not intrusive. They do not
make changse to the API and merely wire up what is there to the driver.
Anyone using temperaturs will be able to use FDP - this is a win without
a maintainance burden attached to it. The change to the FS / application
API will not require major changes either; I believe we all agree that
we cannot remove the temperatures, so all existing temperature users
will be able to continue using them; we will just add an alternative for
power users on the side.

So the proposal is to merge the patches as they are and commit to work
together on a new, better API for in-kernel users (FS), and for
applications (syscall, uring).

Christoph, would this be a viable way to move forward for you?

Thanks,
Javier

Javier Gonzalez Oct. 10, 2024, 7:13 a.m. UTC | #47

On 10.10.2024 08:40, Hans Holmberg wrote:
>On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@samsung.com> wrote:
>>
>>
>>
>> > -----Original Message-----
>> > From: Hans Holmberg <hans@owltronix.com>
>> > Sent: Tuesday, October 8, 2024 12:07 PM
>> > To: Javier Gonzalez <javier.gonz@samsung.com>
>> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K.
>> > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>;
>> > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me;
>> > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org;
>> > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org;
>> > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux-
>> > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org;
>> > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com
>> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
>> >
>> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com>
>> > wrote:
>> > >
>> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
>> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> > > >> So, considerign that file system _are_ able to use temperature hints and
>> > > >> actually make them work, why don't we support FDP the same way we are
>> > > >> supporting zones so that people can use it in production?
>> > > >
>> > > >Because apparently no one has tried it.  It should be possible in theory,
>> > > >but for example unless you have power of 2 reclaim unit size size it
>> > > >won't work very well with XFS where the AGs/RTGs must be power of two
>> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
>> > > >the capacity actually used.
>> > >
>> > > This is good. I think we should have at least a FS POC with data
>> > > placement support to be able to drive conclusions on how the interface
>> > > and requirements should be. Until we have that, we can support the
>> > > use-cases that we know customers are asking for, i.e., block-level hints
>> > > through the existing temperature API.
>> > >
>> > > >
>> > > >> I agree that down the road, an interface that allows hints (many more
>> > > >> than 5!) is needed. And in my opinion, this interface should not have
>> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
>> > > >> put 100s of them to support storage node deployments. But this needs to
>> > > >> come from the users of the hints / zones / streams / etc,  not from
>> > > >> us vendors. We do not have neither details on how they deploy these
>> > > >> features at scale, nor the workloads to validate the results. Anything
>> > > >> else will probably just continue polluting the storage stack with more
>> > > >> interfaces that are not used and add to the problem of data placement
>> > > >> fragmentation.
>> > > >
>> > > >Please always mentioned what layer you are talking about.  At the syscall
>> > > >level the temperatur hints are doing quite ok.  A full stream separation
>> > > >would obviously be a lot better, as would be communicating the zone /
>> > > >reclaim unit / etc size.
>> > >
>> > > I mean at the syscall level. But as mentioned above, we need to be very
>> > > sure that we have a clear use-case for that. If we continue seeing hints
>> > > being use in NVMe or other protocols, and the number increase
>> > > significantly, we can deal with it later on.
>> > >
>> > > >
>> > > >As an interface to a driver that doesn't natively speak temperature
>> > > >hint on the other hand it doesn't work at all.
>> > > >
>> > > >> The issue is that the first series of this patch, which is as simple as
>> > > >> it gets, hit the list in May. Since then we are down paths that lead
>> > > >> nowhere. So the line between real technical feedback that leads to
>> > > >> a feature being merged, and technical misleading to make people be a
>> > > >> busy bee becomes very thin. In the whole data placement effort, we have
>> > > >> been down this path many times, unfortunately...
>> > > >
>> > > >Well, the previous round was the first one actually trying to address the
>> > > >fundamental issue after 4 month.  And then after a first round of feedback
>> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
>> > > >is the kinda of contributors I have a hard time taking serious.
>> > >
>> > > I am not sure I understand what you mean in the last sentece, so I will
>> > > not respond filling blanks with a bad interpretation.
>> > >
>> > > In summary, what we are asking for is to take the patches that cover the
>> > > current use-case, and work together on what might be needed for better
>> > > FS support. For this, it seems you and Hans have a good idea of what you
>> > > want to have based on XFS. We can help review or do part of the work,
>> > > but trying to guess our way will only delay existing customers using
>> > > existing HW.
>> >
>> > After reading the whole thread, I end up wondering why we need to rush the
>> > support for a single use case through instead of putting the architecture
>> > in place for properly supporting this new type of hardware from the start
>> > throughout the stack.
>>
>> This is not a rush. We have been supporting this use case through passthru for
>> over 1/2 year with code already upstream in Cachelib. This is mature enough as
>> to move into the block layer, which is what the end user wants to do either way.
>>
>> This is though a very good point. This is why we upstreamed passthru at the
>> time; so people can experiment, validate, and upstream only when there is a
>> clear path.
>>
>> >
>> > Even for user space consumers of raw block devices, is the last version
>> > of the patch set good enough?
>> >
>> > * It severely cripples the data separation capabilities as only a handful of
>> >   data placement buckets are supported
>>
>> I could understand from your presentation at LPC, and late looking at the code that
>> is available that you have been successful at getting good results with the existing
>> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
>> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
>> structures is great.
>
>No, we don't map data directly to zones using lifetime hints. In fact,
>lifetime hints contribute only a
>relatively small part  (~10% extra write amp reduction, see the
>rocksdb benchmark results).
>Segregating data by file is the most important part of the data
>placement heuristic, at least
>for this type of workload.

Is this because RocksDB already does seggregation per file itself? Are
you doing something specific on XFS or using your knoledge on RocksDB to
map files with an "unwritten" protocol in the midde?

    In this context, we have collected data both using FDP natively in
    RocksDB and using the temperatures. Both look very good, because both
    are initiated by RocksDB, and the FS just passes the hints directly
    to the driver.

I ask this to understand if this is the FS responsibility or the
application's one. Our work points more to letting applications use the
hints (as the use-cases are power users, like RocksDB). I agree with you
that a FS could potentially make an improvement for legacy applications
- we have not focused much on these though, so I trust you insights on
it.

>>
>> >
>> > * It just won't work if there is more than one user application per storage
>> >   device as different applications data streams will be mixed at the nvme
>> >   driver level..
>>
>> For now this use-case is not clear. Folks working on it are using passthru. When we
>> have a more clear understanding of what is needed, we might need changes in the kernel.
>>
>> If you see a need for this on the work that you are doing, by all means, please send patches.
>> As I said at LPC, we can work together on this.
>>
>> >
>> > While Christoph has already outlined what would be desirable from a
>> > file system point of view, I don't have the answer to what would be the overall
>> > best design for FDP. I would like to say that it looks to me like we need to
>> > consider more than more than the early adoption use cases and make sure we
>> > make the most of the hardware capabilities via logical abstractions that
>> > would be compatible with a wider range of storage devices.
>> >
>> > Figuring the right way forward is tricky, but why not just let it take the time
>> > that is needed to sort this out while early users explore how to use FDP
>> > drives and share the results?
>>
>> I agree that we might need a new interface to support more hints, beyond the temperatures.
>> Or maybe not. We would not know until someone comes with a use case. We have made the
>> mistake in the past of treating internal research as upstreamable work. I know can see that
>> this simply complicates the in-kernel and user-space APIs.
>>
>> The existing API is usable and requires no changes. There is hardware. There are customers.
>> There are applications with upstream support which have been tested with passthru (the
>> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason
>> not to take this in, and then we will see what new interfaces we might need in the future.
>>
>> I would much rather spend time in discussing ideas with you and others on a potential
>> future API than arguing about the validity of an _existing_ one.
>>
>
>Yes, but while FDP support could be improved later on(happy to help if
>that'll be the case),
>I'm just afraid that less work now defining the way data placement is
>exposed is going to
>result in a bigger mess later when more use cases will be considered.

Please, see the message I responded on the other thread. I hope it is a
way to move forward and actually work together on this.

Christoph Hellwig Oct. 10, 2024, 9:10 a.m. UTC | #48

On Wed, Oct 09, 2024 at 09:06:25AM -0600, Keith Busch wrote:
> > anything to the reclaim unit?  Or is this another of the cases where
> > as a hyperscaler you just "know" from the data sheet?
> 
> As far as I know, this is an inconsequential spec detail that is not
> being considered by any applications testing this. And yet, the expected
> imrpovements are still there, so I don't see a point holding this up for
> that reason.

It was the whole point of the thing, and a major source of complexity.
Although not quite for this use case Hans has numbers that aligning
application data tables / objects / etc to the underlying "erase unit"
absolutely matters.

And just to make it clear the objection here is not that we have an
an interface that doesn't take that into respect at the syscall
level.  We already have that interface and might as well make use
of that.  The problem is that the series tries to directly expose
that to the driver, freezing us into this incomplet interface forever
(assuming we have users actually picking it up).

That's why I keep insisting like a broken record that we need to get
this lower interface right first.

Christoph Hellwig Oct. 10, 2024, 9:13 a.m. UTC | #49

On Thu, Oct 10, 2024 at 09:07:36AM +0200, Javier González wrote:
> I think we should attempt to pursue that with an example in mind. Seems
> XFS is the clear candidate. You have done work already in enable SMR
> HDDs; it seems we can get FDP under that umbrella. This will however
> take time to get right. We can help with development, testing, and
> experimental evaluation on the WAF benefits for such an interface.

Or ZNS SSDs for that matter.

> However, this work should not block existing hardware enabling an
> existing use-case. The current patches are not intrusive. They do not
> make changse to the API and merely wire up what is there to the driver.
> Anyone using temperaturs will be able to use FDP - this is a win without
> a maintainance burden attached to it. The change to the FS / application
> API will not require major changes either; I believe we all agree that
> we cannot remove the temperatures, so all existing temperature users
> will be able to continue using them; we will just add an alternative for
> power users on the side.

As mentioned probably close to a dozen times over this thread and it's
predecessors:  Keeping the per-file I/O hint API and mapping that to
FDP is fine.  Exposing the overly-simplistic hints to the NVMe driver
through the entire I/O stack and locking us into that is not.

> So the proposal is to merge the patches as they are and commit to work
> together on a new, better API for in-kernel users (FS), and for
> applications (syscall, uring).

And because of that the whole merge it now and fix it later unfortunately
doesn't work, as a proper implementation will regress behavior for at
least some users that only have the I/O hints API but try to emulate
real stream separation with it.  With the per-I/O hints in io_uring that
is in fact almost guaranteed.

Christoph Hellwig Oct. 10, 2024, 9:20 a.m. UTC | #50

On Thu, Oct 10, 2024 at 09:13:27AM +0200, Javier Gonzalez wrote:
> Is this because RocksDB already does seggregation per file itself? Are
> you doing something specific on XFS or using your knoledge on RocksDB to
> map files with an "unwritten" protocol in the midde?

XFS doesn't really do anything smart at all except for grouping files
with similar temperatures, but Hans can probably explain it in more
detail.  So yes, this relies on the application doing the data separation
and using the most logical vehicle for it: files.

>
>    In this context, we have collected data both using FDP natively in
>    RocksDB and using the temperatures. Both look very good, because both
>    are initiated by RocksDB, and the FS just passes the hints directly
>    to the driver.
>
> I ask this to understand if this is the FS responsibility or the
> application's one. Our work points more to letting applications use the
> hints (as the use-cases are power users, like RocksDB). I agree with you
> that a FS could potentially make an improvement for legacy applications
> - we have not focused much on these though, so I trust you insights on
> it.

As mentioned multiple times before in this thread this absolutely
depends on the abstraction level of the application.  If the application
works on a raw device without a file system it obviously needs very
low-level control.  And in my opinion passthrough is by far the best
interface for that level of control.  If the application is using a
file system there is no better basic level abstraction than a file,
which can then be enhanced with relatively small amount of additional
information going both ways: the file system telling the application
what good file sizes and write patterns are, and the application telling
the file system what files are good candidates to merge into the same
write stream if the file system has to merge multiple actively written
to files into a write stream.  Trying to do low-level per I/O hints
on top of a file system is a recipe for trouble because you now have
to entities fighting over placement control.

Hans Holmberg Oct. 10, 2024, 10:46 a.m. UTC | #51

On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <javier.gonz@samsung.com> wrote:
>
> On 10.10.2024 08:40, Hans Holmberg wrote:
> >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@samsung.com> wrote:
> >>
> >>
> >>
> >> > -----Original Message-----
> >> > From: Hans Holmberg <hans@owltronix.com>
> >> > Sent: Tuesday, October 8, 2024 12:07 PM
> >> > To: Javier Gonzalez <javier.gonz@samsung.com>
> >> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K.
> >> > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>;
> >> > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me;
> >> > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org;
> >> > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org;
> >> > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux-
> >> > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org;
> >> > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com
> >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> >> >
> >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com>
> >> > wrote:
> >> > >
> >> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
> >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
> >> > > >> So, considerign that file system _are_ able to use temperature hints and
> >> > > >> actually make them work, why don't we support FDP the same way we are
> >> > > >> supporting zones so that people can use it in production?
> >> > > >
> >> > > >Because apparently no one has tried it.  It should be possible in theory,
> >> > > >but for example unless you have power of 2 reclaim unit size size it
> >> > > >won't work very well with XFS where the AGs/RTGs must be power of two
> >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
> >> > > >the capacity actually used.
> >> > >
> >> > > This is good. I think we should have at least a FS POC with data
> >> > > placement support to be able to drive conclusions on how the interface
> >> > > and requirements should be. Until we have that, we can support the
> >> > > use-cases that we know customers are asking for, i.e., block-level hints
> >> > > through the existing temperature API.
> >> > >
> >> > > >
> >> > > >> I agree that down the road, an interface that allows hints (many more
> >> > > >> than 5!) is needed. And in my opinion, this interface should not have
> >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
> >> > > >> put 100s of them to support storage node deployments. But this needs to
> >> > > >> come from the users of the hints / zones / streams / etc,  not from
> >> > > >> us vendors. We do not have neither details on how they deploy these
> >> > > >> features at scale, nor the workloads to validate the results. Anything
> >> > > >> else will probably just continue polluting the storage stack with more
> >> > > >> interfaces that are not used and add to the problem of data placement
> >> > > >> fragmentation.
> >> > > >
> >> > > >Please always mentioned what layer you are talking about.  At the syscall
> >> > > >level the temperatur hints are doing quite ok.  A full stream separation
> >> > > >would obviously be a lot better, as would be communicating the zone /
> >> > > >reclaim unit / etc size.
> >> > >
> >> > > I mean at the syscall level. But as mentioned above, we need to be very
> >> > > sure that we have a clear use-case for that. If we continue seeing hints
> >> > > being use in NVMe or other protocols, and the number increase
> >> > > significantly, we can deal with it later on.
> >> > >
> >> > > >
> >> > > >As an interface to a driver that doesn't natively speak temperature
> >> > > >hint on the other hand it doesn't work at all.
> >> > > >
> >> > > >> The issue is that the first series of this patch, which is as simple as
> >> > > >> it gets, hit the list in May. Since then we are down paths that lead
> >> > > >> nowhere. So the line between real technical feedback that leads to
> >> > > >> a feature being merged, and technical misleading to make people be a
> >> > > >> busy bee becomes very thin. In the whole data placement effort, we have
> >> > > >> been down this path many times, unfortunately...
> >> > > >
> >> > > >Well, the previous round was the first one actually trying to address the
> >> > > >fundamental issue after 4 month.  And then after a first round of feedback
> >> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
> >> > > >is the kinda of contributors I have a hard time taking serious.
> >> > >
> >> > > I am not sure I understand what you mean in the last sentece, so I will
> >> > > not respond filling blanks with a bad interpretation.
> >> > >
> >> > > In summary, what we are asking for is to take the patches that cover the
> >> > > current use-case, and work together on what might be needed for better
> >> > > FS support. For this, it seems you and Hans have a good idea of what you
> >> > > want to have based on XFS. We can help review or do part of the work,
> >> > > but trying to guess our way will only delay existing customers using
> >> > > existing HW.
> >> >
> >> > After reading the whole thread, I end up wondering why we need to rush the
> >> > support for a single use case through instead of putting the architecture
> >> > in place for properly supporting this new type of hardware from the start
> >> > throughout the stack.
> >>
> >> This is not a rush. We have been supporting this use case through passthru for
> >> over 1/2 year with code already upstream in Cachelib. This is mature enough as
> >> to move into the block layer, which is what the end user wants to do either way.
> >>
> >> This is though a very good point. This is why we upstreamed passthru at the
> >> time; so people can experiment, validate, and upstream only when there is a
> >> clear path.
> >>
> >> >
> >> > Even for user space consumers of raw block devices, is the last version
> >> > of the patch set good enough?
> >> >
> >> > * It severely cripples the data separation capabilities as only a handful of
> >> >   data placement buckets are supported
> >>
> >> I could understand from your presentation at LPC, and late looking at the code that
> >> is available that you have been successful at getting good results with the existing
> >> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
> >> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
> >> structures is great.
> >
> >No, we don't map data directly to zones using lifetime hints. In fact,
> >lifetime hints contribute only a
> >relatively small part  (~10% extra write amp reduction, see the
> >rocksdb benchmark results).
> >Segregating data by file is the most important part of the data
> >placement heuristic, at least
> >for this type of workload.
>
> Is this because RocksDB already does seggregation per file itself? Are
> you doing something specific on XFS or using your knoledge on RocksDB to
> map files with an "unwritten" protocol in the midde?

Data placement by-file is based on that the lifetime of a file's data
blocks are strongly correlated. When a file is deleted, all its blocks
will be reclaimable at that point. This requires knowledge about the
data placement buckets and works really well without any hints
provided.
The life-time hint heuristic I added on top is based on rocksdb
statistics, but designed to be generic enough to work for a wider
range of workloads (still need to validate this though - more work to
be done).

>
>     In this context, we have collected data both using FDP natively in
>     RocksDB and using the temperatures. Both look very good, because both
>     are initiated by RocksDB, and the FS just passes the hints directly
>     to the driver.
>
> I ask this to understand if this is the FS responsibility or the
> application's one. Our work points more to letting applications use the
> hints (as the use-cases are power users, like RocksDB). I agree with you
> that a FS could potentially make an improvement for legacy applications
> - we have not focused much on these though, so I trust you insights on
> it.

The big problem as I see it is that if applications are going to work
well together on the same media we need a common placement
implementation somewhere, and it seems pretty natural to make it part
of filesystems to me.


>
> >>
> >> >
> >> > * It just won't work if there is more than one user application per storage
> >> >   device as different applications data streams will be mixed at the nvme
> >> >   driver level..
> >>
> >> For now this use-case is not clear. Folks working on it are using passthru. When we
> >> have a more clear understanding of what is needed, we might need changes in the kernel.
> >>
> >> If you see a need for this on the work that you are doing, by all means, please send patches.
> >> As I said at LPC, we can work together on this.
> >>
> >> >
> >> > While Christoph has already outlined what would be desirable from a
> >> > file system point of view, I don't have the answer to what would be the overall
> >> > best design for FDP. I would like to say that it looks to me like we need to
> >> > consider more than more than the early adoption use cases and make sure we
> >> > make the most of the hardware capabilities via logical abstractions that
> >> > would be compatible with a wider range of storage devices.
> >> >
> >> > Figuring the right way forward is tricky, but why not just let it take the time
> >> > that is needed to sort this out while early users explore how to use FDP
> >> > drives and share the results?
> >>
> >> I agree that we might need a new interface to support more hints, beyond the temperatures.
> >> Or maybe not. We would not know until someone comes with a use case. We have made the
> >> mistake in the past of treating internal research as upstreamable work. I know can see that
> >> this simply complicates the in-kernel and user-space APIs.
> >>
> >> The existing API is usable and requires no changes. There is hardware. There are customers.
> >> There are applications with upstream support which have been tested with passthru (the
> >> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason
> >> not to take this in, and then we will see what new interfaces we might need in the future.
> >>
> >> I would much rather spend time in discussing ideas with you and others on a potential
> >> future API than arguing about the validity of an _existing_ one.
> >>
> >
> >Yes, but while FDP support could be improved later on(happy to help if
> >that'll be the case),
> >I'm just afraid that less work now defining the way data placement is
> >exposed is going to
> >result in a bigger mess later when more use cases will be considered.
>
> Please, see the message I responded on the other thread. I hope it is a
> way to move forward and actually work together on this.

Javier Gonzalez Oct. 10, 2024, 11:59 a.m. UTC | #52

On 10.10.2024 11:13, Christoph Hellwig wrote:
>On Thu, Oct 10, 2024 at 09:07:36AM +0200, Javier González wrote:
>> I think we should attempt to pursue that with an example in mind. Seems
>> XFS is the clear candidate. You have done work already in enable SMR
>> HDDs; it seems we can get FDP under that umbrella. This will however
>> take time to get right. We can help with development, testing, and
>> experimental evaluation on the WAF benefits for such an interface.
>
>Or ZNS SSDs for that matter.

Maybe. I do not see much movement on this.

>
>> However, this work should not block existing hardware enabling an
>> existing use-case. The current patches are not intrusive. They do not
>> make changse to the API and merely wire up what is there to the driver.
>> Anyone using temperaturs will be able to use FDP - this is a win without
>> a maintainance burden attached to it. The change to the FS / application
>> API will not require major changes either; I believe we all agree that
>> we cannot remove the temperatures, so all existing temperature users
>> will be able to continue using them; we will just add an alternative for
>> power users on the side.
>
>As mentioned probably close to a dozen times over this thread and it's
>predecessors:  Keeping the per-file I/O hint API and mapping that to
>FDP is fine.  Exposing the overly-simplistic hints to the NVMe driver
>through the entire I/O stack and locking us into that is not.

I don't understand the "locking us into that" part.

>> So the proposal is to merge the patches as they are and commit to work
>> together on a new, better API for in-kernel users (FS), and for
>> applications (syscall, uring).
>
>And because of that the whole merge it now and fix it later unfortunately
>doesn't work, as a proper implementation will regress behavior for at
>least some users that only have the I/O hints API but try to emulate
>real stream separation with it.  With the per-I/O hints in io_uring that
>is in fact almost guaranteed.

I do not thing this argument is valid. These patches are not a merge now,
fix later. It is ma: use the current interface, work together on a new
one, if needed.

The new interface you are talking about is not clear to me, at all. If
you could share some patches on how that would look like, then it would
give a much clearer idea of what you have in mind. The whole idea of
let's not cover an existing use-case because we might be able to do
something in the future is not very tangible.

Javier Gonzalez Oct. 10, 2024, 12:22 p.m. UTC | #53

On 10.10.2024 11:20, Christoph Hellwig wrote:
>On Thu, Oct 10, 2024 at 09:13:27AM +0200, Javier Gonzalez wrote:
>> Is this because RocksDB already does seggregation per file itself? Are
>> you doing something specific on XFS or using your knoledge on RocksDB to
>> map files with an "unwritten" protocol in the midde?
>
>XFS doesn't really do anything smart at all except for grouping files
>with similar temperatures, but Hans can probably explain it in more
>detail.  So yes, this relies on the application doing the data separation
>and using the most logical vehicle for it: files.

This makes sense. Agree.

>
>>
>>    In this context, we have collected data both using FDP natively in
>>    RocksDB and using the temperatures. Both look very good, because both
>>    are initiated by RocksDB, and the FS just passes the hints directly
>>    to the driver.
>>
>> I ask this to understand if this is the FS responsibility or the
>> application's one. Our work points more to letting applications use the
>> hints (as the use-cases are power users, like RocksDB). I agree with you
>> that a FS could potentially make an improvement for legacy applications
>> - we have not focused much on these though, so I trust you insights on
>> it.
>
>As mentioned multiple times before in this thread this absolutely
>depends on the abstraction level of the application.  If the application
>works on a raw device without a file system it obviously needs very
>low-level control.  And in my opinion passthrough is by far the best
>interface for that level of control. 

Passthru is great for prototyping and getting insights on end-to-end
applicability. We see though that it is difficult to get a full solution
based on it, unless people implement a use-space layer tailored to their
use-case (e.g., a version SPDK's bdev). After the POC phase, most folks
that can use passthru prefer to move to block - with a validated
use-case it should be easier to get things upstream.

This is exactly where we are now.

>If the application is using a
>file system there is no better basic level abstraction than a file,
>which can then be enhanced with relatively small amount of additional
>information going both ways: the file system telling the application
>what good file sizes and write patterns are, and the application telling
>the file system what files are good candidates to merge into the same
>write stream if the file system has to merge multiple actively written
>to files into a write stream.  Trying to do low-level per I/O hints
>on top of a file system is a recipe for trouble because you now have
>to entities fighting over placement control.

For file, I agree with you.

If you saw the comments from Christian on the inode space, there are a
few plumbing challenges. Do you have any patches we could look at?

Javier Gonzalez Oct. 10, 2024, 12:27 p.m. UTC | #54

On 10.10.2024 12:46, Hans Holmberg wrote:
>On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <javier.gonz@samsung.com> wrote:
>>
>> On 10.10.2024 08:40, Hans Holmberg wrote:
>> >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@samsung.com> wrote:
>> >>
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Hans Holmberg <hans@owltronix.com>
>> >> > Sent: Tuesday, October 8, 2024 12:07 PM
>> >> > To: Javier Gonzalez <javier.gonz@samsung.com>
>> >> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K.
>> >> > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>;
>> >> > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me;
>> >> > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org;
>> >> > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org;
>> >> > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux-
>> >> > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org;
>> >> > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com
>> >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
>> >> >
>> >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com>
>> >> > wrote:
>> >> > >
>> >> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
>> >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> >> > > >> So, considerign that file system _are_ able to use temperature hints and
>> >> > > >> actually make them work, why don't we support FDP the same way we are
>> >> > > >> supporting zones so that people can use it in production?
>> >> > > >
>> >> > > >Because apparently no one has tried it.  It should be possible in theory,
>> >> > > >but for example unless you have power of 2 reclaim unit size size it
>> >> > > >won't work very well with XFS where the AGs/RTGs must be power of two
>> >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
>> >> > > >the capacity actually used.
>> >> > >
>> >> > > This is good. I think we should have at least a FS POC with data
>> >> > > placement support to be able to drive conclusions on how the interface
>> >> > > and requirements should be. Until we have that, we can support the
>> >> > > use-cases that we know customers are asking for, i.e., block-level hints
>> >> > > through the existing temperature API.
>> >> > >
>> >> > > >
>> >> > > >> I agree that down the road, an interface that allows hints (many more
>> >> > > >> than 5!) is needed. And in my opinion, this interface should not have
>> >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
>> >> > > >> put 100s of them to support storage node deployments. But this needs to
>> >> > > >> come from the users of the hints / zones / streams / etc,  not from
>> >> > > >> us vendors. We do not have neither details on how they deploy these
>> >> > > >> features at scale, nor the workloads to validate the results. Anything
>> >> > > >> else will probably just continue polluting the storage stack with more
>> >> > > >> interfaces that are not used and add to the problem of data placement
>> >> > > >> fragmentation.
>> >> > > >
>> >> > > >Please always mentioned what layer you are talking about.  At the syscall
>> >> > > >level the temperatur hints are doing quite ok.  A full stream separation
>> >> > > >would obviously be a lot better, as would be communicating the zone /
>> >> > > >reclaim unit / etc size.
>> >> > >
>> >> > > I mean at the syscall level. But as mentioned above, we need to be very
>> >> > > sure that we have a clear use-case for that. If we continue seeing hints
>> >> > > being use in NVMe or other protocols, and the number increase
>> >> > > significantly, we can deal with it later on.
>> >> > >
>> >> > > >
>> >> > > >As an interface to a driver that doesn't natively speak temperature
>> >> > > >hint on the other hand it doesn't work at all.
>> >> > > >
>> >> > > >> The issue is that the first series of this patch, which is as simple as
>> >> > > >> it gets, hit the list in May. Since then we are down paths that lead
>> >> > > >> nowhere. So the line between real technical feedback that leads to
>> >> > > >> a feature being merged, and technical misleading to make people be a
>> >> > > >> busy bee becomes very thin. In the whole data placement effort, we have
>> >> > > >> been down this path many times, unfortunately...
>> >> > > >
>> >> > > >Well, the previous round was the first one actually trying to address the
>> >> > > >fundamental issue after 4 month.  And then after a first round of feedback
>> >> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
>> >> > > >is the kinda of contributors I have a hard time taking serious.
>> >> > >
>> >> > > I am not sure I understand what you mean in the last sentece, so I will
>> >> > > not respond filling blanks with a bad interpretation.
>> >> > >
>> >> > > In summary, what we are asking for is to take the patches that cover the
>> >> > > current use-case, and work together on what might be needed for better
>> >> > > FS support. For this, it seems you and Hans have a good idea of what you
>> >> > > want to have based on XFS. We can help review or do part of the work,
>> >> > > but trying to guess our way will only delay existing customers using
>> >> > > existing HW.
>> >> >
>> >> > After reading the whole thread, I end up wondering why we need to rush the
>> >> > support for a single use case through instead of putting the architecture
>> >> > in place for properly supporting this new type of hardware from the start
>> >> > throughout the stack.
>> >>
>> >> This is not a rush. We have been supporting this use case through passthru for
>> >> over 1/2 year with code already upstream in Cachelib. This is mature enough as
>> >> to move into the block layer, which is what the end user wants to do either way.
>> >>
>> >> This is though a very good point. This is why we upstreamed passthru at the
>> >> time; so people can experiment, validate, and upstream only when there is a
>> >> clear path.
>> >>
>> >> >
>> >> > Even for user space consumers of raw block devices, is the last version
>> >> > of the patch set good enough?
>> >> >
>> >> > * It severely cripples the data separation capabilities as only a handful of
>> >> >   data placement buckets are supported
>> >>
>> >> I could understand from your presentation at LPC, and late looking at the code that
>> >> is available that you have been successful at getting good results with the existing
>> >> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
>> >> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
>> >> structures is great.
>> >
>> >No, we don't map data directly to zones using lifetime hints. In fact,
>> >lifetime hints contribute only a
>> >relatively small part  (~10% extra write amp reduction, see the
>> >rocksdb benchmark results).
>> >Segregating data by file is the most important part of the data
>> >placement heuristic, at least
>> >for this type of workload.
>>
>> Is this because RocksDB already does seggregation per file itself? Are
>> you doing something specific on XFS or using your knoledge on RocksDB to
>> map files with an "unwritten" protocol in the midde?
>
>Data placement by-file is based on that the lifetime of a file's data
>blocks are strongly correlated. When a file is deleted, all its blocks
>will be reclaimable at that point. This requires knowledge about the
>data placement buckets and works really well without any hints
>provided.

But we need hints to put files together. I believe you do this already,
as no placement protocol gives you unlimited separation.

>The life-time hint heuristic I added on top is based on rocksdb
>statistics, but designed to be generic enough to work for a wider
>range of workloads (still need to validate this though - more work to
>be done).

Maybe you can post some patches on the parts dedicated to the VFS level
and user-space API (syscall or uring)?

Following on the comment to Christoph, it would be good to have
something tangible to work together on for the next stage of this
support.

>
>>
>>     In this context, we have collected data both using FDP natively in
>>     RocksDB and using the temperatures. Both look very good, because both
>>     are initiated by RocksDB, and the FS just passes the hints directly
>>     to the driver.
>>
>> I ask this to understand if this is the FS responsibility or the
>> application's one. Our work points more to letting applications use the
>> hints (as the use-cases are power users, like RocksDB). I agree with you
>> that a FS could potentially make an improvement for legacy applications
>> - we have not focused much on these though, so I trust you insights on
>> it.
>
>The big problem as I see it is that if applications are going to work
>well together on the same media we need a common placement
>implementation somewhere, and it seems pretty natural to make it part
>of filesystems to me.

For FS users, makes a lot of sense. But we still need to cover
applications using raw block.

Christoph Hellwig Oct. 11, 2024, 8:56 a.m. UTC | #55

On Thu, Oct 10, 2024 at 02:22:32PM +0200, Javier Gonzalez wrote:
> Passthru is great for prototyping and getting insights on end-to-end
> applicability. We see though that it is difficult to get a full solution
> based on it, unless people implement a use-space layer tailored to their
> use-case (e.g., a version SPDK's bdev). After the POC phase, most folks
> that can use passthru prefer to move to block - with a validated
> use-case it should be easier to get things upstream.
>
> This is exactly where we are now.

That's a lot of marketing babble :)    What exact thing is missing
from the passthrough interface when using say spdx over io_uring?

> If you saw the comments from Christian on the inode space, there are a
> few plumbing challenges. Do you have any patches we could look at?

I'm not sure what you refer to here.

Christoph Hellwig Oct. 11, 2024, 8:59 a.m. UTC | #56

On Thu, Oct 10, 2024 at 02:27:33PM +0200, Javier Gonzalez wrote:

[full quote snipped, it would be really helpful to only quote what you
actuall reference per the usual email rules]

>> Data placement by-file is based on that the lifetime of a file's data
>> blocks are strongly correlated. When a file is deleted, all its blocks
>> will be reclaimable at that point. This requires knowledge about the
>> data placement buckets and works really well without any hints
>> provided.
>
> But we need hints to put files together. I believe you do this already,
> as no placement protocol gives you unlimited separation.

The per-file temperature hints do a reasonable job for that.  A fully
developed version of the separate write streams submitted by Kanchan
would probably do even better, but for now the per-file hints seem
to be enough.

> Maybe you can post some patches on the parts dedicated to the VFS level
> and user-space API (syscall or uring)?

It's just using the existing temperature hints.

Christoph Hellwig Oct. 11, 2024, 9:02 a.m. UTC | #57

On Thu, Oct 10, 2024 at 01:59:14PM +0200, Javier González wrote:
>> As mentioned probably close to a dozen times over this thread and it's
>> predecessors:  Keeping the per-file I/O hint API and mapping that to
>> FDP is fine.  Exposing the overly-simplistic hints to the NVMe driver
>> through the entire I/O stack and locking us into that is not.
>
> I don't understand the "locking us into that" part.

The patches as submitted do the two following two things:

 1) interpret the simple temperature hints to map to FDP reclaim handles
 2) add a new interface to set the temperature hints per I/O

and also rely on an annoying existing implementation detail where the I/O
hints set on random files get automatically propagated to the block
device without file system involvement.

This means we can't easily make the nvme driver actually use smarter
hints that expose the actual FDP capabilities without breaking users
that relied on the existing behavior, especially the per-I/O hints that
counteract any kind of file system based data placement.

Javier Gonzalez Oct. 11, 2024, 12:21 p.m. UTC | #58

On 11.10.2024 10:56, Christoph Hellwig wrote:
>On Thu, Oct 10, 2024 at 02:22:32PM +0200, Javier Gonzalez wrote:
>> Passthru is great for prototyping and getting insights on end-to-end
>> applicability. We see though that it is difficult to get a full solution
>> based on it, unless people implement a use-space layer tailored to their
>> use-case (e.g., a version SPDK's bdev). After the POC phase, most folks
>> that can use passthru prefer to move to block - with a validated
>> use-case it should be easier to get things upstream.
>>
>> This is exactly where we are now.
>
>That's a lot of marketing babble :)    What exact thing is missing
>from the passthrough interface when using say spdx over io_uring?

The block layer provides a lot of functionality that passthru cannot
provide. A simple example would be splits. You know this :)

I am sure Jens and Keith can give you more specifics on their particular
reasons.

>
>> If you saw the comments from Christian on the inode space, there are a
>> few plumbing challenges. Do you have any patches we could look at?
>
>I'm not sure what you refer to here.

This from Christian:
   https://lore.kernel.org/all/20240903-erfassen-bandmitglieder-32dfaeee66b2@brauner/

Keith Busch Oct. 11, 2024, 4:59 p.m. UTC | #59

On Fri, Oct 11, 2024 at 02:21:02PM +0200, Javier Gonzalez wrote:
> > That's a lot of marketing babble :)    What exact thing is missing
> > from the passthrough interface when using say spdx over io_uring?
> 
> The block layer provides a lot of functionality that passthru cannot
> provide. A simple example would be splits. You know this :)
> 
> I am sure Jens and Keith can give you more specifics on their particular
> reasons.

Splitting, merging, cgroups, iostats, error retries, failover.

We have a recent change staged in 6.13 to try to count iostats on
passthrough, but it doesn't do discards.

Passthrough to partitions requires root access and the slow CAP_SYSADMIN
check on every IO.

Page cache. Though that may not readily work with this io_uring proposal
as-is either.

The generic IO path is safe to changing formats; passthrough isn't.

And it's just generally more work to port existing applications to use
the passthrough interface. It causes duplicating solutions to scenarios
the kernel already handles.

Jens Axboe Oct. 11, 2024, 5:08 p.m. UTC | #60

On 10/11/24 3:02 AM, Christoph Hellwig wrote:
>>> As mentioned probably close to a dozen times over this thread and it's
>>> predecessors:  Keeping the per-file I/O hint API and mapping that to
>>> FDP is fine.  Exposing the overly-simplistic hints to the NVMe driver
>>> through the entire I/O stack and locking us into that is not.
>>
>> I don't understand the "locking us into that" part.
> 
> The patches as submitted do the two following two things:
> 
>  1) interpret the simple temperature hints to map to FDP reclaim handles
>  2) add a new interface to set the temperature hints per I/O
> 
> and also rely on an annoying existing implementation detail where the I/O
> hints set on random files get automatically propagated to the block
> device without file system involvement.
> 
> This means we can't easily make the nvme driver actually use smarter
> hints that expose the actual FDP capabilities without breaking users
> that relied on the existing behavior, especially the per-I/O hints that
> counteract any kind of file system based data placement.
> 

I think that last argument is a straw man - for any kind of interface
like this, we've ALWAYS just had the rule that any per-whatever
overrides the generic setting. Per IO hints would do the same thing. Is
this a mess if a user has assigned a file hint and uses other per IO
hints on that very same file and they differ? Certainly, but that's
really just the user/app being dumb. Or maybe being smart, as this
particular one block is always hot in the file (yeah contrived case).

The point is, if you mix and match per-io hints and per-file hints, then
you're probably being pretty silly and nobody should do that. I don't
think this is a practical concern at all.

Christoph Hellwig Oct. 14, 2024, 6:21 a.m. UTC | #61

On Fri, Oct 11, 2024 at 11:08:26AM -0600, Jens Axboe wrote:
> 
> I think that last argument is a straw man - for any kind of interface
> like this, we've ALWAYS just had the rule that any per-whatever
> overrides the generic setting.

And exactly that is the problem.  For file systems we can't support
that sanely.  So IFF you absolutely want the per-I/O hints we need
an opt in by the file operations.  I've said that at least twice
in this discussion before, but as everyone likes to have political
discussions instead of technical ones no one replied to that.

Javier Gonzalez Oct. 14, 2024, 7:02 a.m. UTC | #62

> -----Original Message-----
> From: Christoph Hellwig <hch@lst.de>
> Sent: Monday, October 14, 2024 8:21 AM
> To: Jens Axboe <axboe@kernel.dk>
> Cc: Christoph Hellwig <hch@lst.de>; Javier Gonzalez <javier.gonz@samsung.com>;
> Keith Busch <kbusch@kernel.org>; Martin K. Petersen
> <martin.petersen@oracle.com>; Kanchan Joshi <joshi.k@samsung.com>;
> hare@suse.de; sagi@grimberg.me; brauner@kernel.org; viro@zeniv.linux.org.uk;
> jack@suse.cz; jaegeuk@kernel.org; bcrl@kvack.org; dhowells@redhat.com;
> bvanassche@acm.org; asml.silence@gmail.com; linux-nvme@lists.infradead.org;
> linux-fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-
> block@vger.kernel.org; linux-aio@kvack.org; gost.dev@samsung.com;
> vishak.g@samsung.com
> Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> 
> On Fri, Oct 11, 2024 at 11:08:26AM -0600, Jens Axboe wrote:
> >
> > I think that last argument is a straw man - for any kind of interface
> > like this, we've ALWAYS just had the rule that any per-whatever
> > overrides the generic setting.
> 
> And exactly that is the problem.  For file systems we can't support
> that sanely.  So IFF you absolutely want the per-I/O hints we need
> an opt in by the file operations.  I've said that at least twice
> in this discussion before, but as everyone likes to have political
> discussions instead of technical ones no one replied to that.

Is it a way forward to add this in a new spin of the series - keeping the 
temperature mapping on the NVMe side?

If not, what would be acceptable for a first version, before getting into adding
a new interface to expose agnostic hints?

Christoph Hellwig Oct. 14, 2024, 7:47 a.m. UTC | #63

On Mon, Oct 14, 2024 at 07:02:11AM +0000, Javier Gonzalez wrote:
> > And exactly that is the problem.  For file systems we can't support
> > that sanely.  So IFF you absolutely want the per-I/O hints we need
> > an opt in by the file operations.  I've said that at least twice
> > in this discussion before, but as everyone likes to have political
> > discussions instead of technical ones no one replied to that.
> 
> Is it a way forward to add this in a new spin of the series - keeping the 
> temperature mapping on the NVMe side?

What do you gain from that?  NVMe does not understand data temperatures,
so why make up that claim?  Especially as it directly undermindes any
file system work to actually make use of it.

> If not, what would be acceptable for a first version, before getting into adding
> a new interface to expose agnostic hints?

Just iterate on Kanchan's series for a block layer (and possible user level,
but that's less urgent) interface for stream separation?

Javier Gonzalez Oct. 14, 2024, 9:08 a.m. UTC | #64

> -----Original Message-----
> From: Christoph Hellwig <hch@lst.de>
> Sent: Monday, October 14, 2024 9:47 AM
> To: Javier Gonzalez <javier.gonz@samsung.com>
> Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Keith Busch
> <kbusch@kernel.org>; Martin K. Petersen <martin.petersen@oracle.com>; Kanchan
> Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me;
> brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org;
> bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org;
> asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux-
> fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org;
> linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com
> Subject: Re: [PATCH v7 0/3] FDP and per-io hints
> 
> On Mon, Oct 14, 2024 at 07:02:11AM +0000, Javier Gonzalez wrote:
> > > And exactly that is the problem.  For file systems we can't support
> > > that sanely.  So IFF you absolutely want the per-I/O hints we need
> > > an opt in by the file operations.  I've said that at least twice
> > > in this discussion before, but as everyone likes to have political
> > > discussions instead of technical ones no one replied to that.
> >
> > Is it a way forward to add this in a new spin of the series - keeping the
> > temperature mapping on the NVMe side?
> 
> What do you gain from that?  NVMe does not understand data temperatures,
> so why make up that claim?  

I expressed this a couple of times in this thread. It is no problem to map temperatures
to a protocol that does not understand the semantics. It is not perfect, and in time
I agree that we would benefit from exposing the raw hints without semantics from
Block layer upwards.

But the temperatures are already there. We are not adding anything new. And thanks
to this, we can enable existing users with _minimal_ changes to user-space. As pointed
on the XFS zoned discussions, this is very similar to what you guys are doing (exactly to
re-use what it is already there), and it works. On top of this,  applications that already
understand temperatures (e.g., RocksDB) will be able to leverage FDP SSDs without changes.

> Especially as it directly undermindes any file system work to actually make use of it.

I do not think it does. If a FS wants to use the temperatures, then they would be able to leverage
FDP besides SCSI. And if we come up with a better interface later on, we can make the changes then.
I really do not see the issue. If we were adding a temperature abstraction now, I would agree with
You that we would need to cover the use-case you mention for FSs from the beginning, but this
Is already here. Seems like a fair compromise to support current users.

> 
> > If not, what would be acceptable for a first version, before getting into adding
> > a new interface to expose agnostic hints?
> 
> Just iterate on Kanchan's series for a block layer (and possible user level,
> but that's less urgent) interface for stream separation?

We can work on this later, but it is not that easy. Look at the mail I forwarded 
Form Christian. We now can store hints in the same structure as the temperatures. 
I believe that we would be able to support 128 hints. If we are serious about supporting 
hints as a general interface, 128 hints is not nearly enough. Moreover:

  - How do we convince VFS folks to give us more space for hints at this point?
  - How do we make agnostic hints and temperature co-exist? Do we use a bit to select 
     and create a protocol? This seems over-complicated
  - When we are exposing general hints to the block layer, how do we support N:N 
     FS:Block_Devices? and DMs? This is not obvious, and it goes well beyond what we
     need today, but we probably need to think about it.

And these are just questions that we know. You see that this is a big lift (which we 
can work together on!), but will delay current users by months. And honestly, I do
not feel comfortable doing this alone; we need application and  FS folks like you to 
help guide this so that it is really usable.

I say it again: Your idea about properly supporting hints in FS is good, and I think we should
work on it. But I do not believe that the current patches (with added things like opt in file ops) 
get in the way of this.  So if the disagreement is if these patches are harmful, let's talk about 
this explicitly and try to agree; we do not need to go over the need for a new hint interface. 
We already agree on this.

Christoph Hellwig Oct. 14, 2024, 11:50 a.m. UTC | #65

On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote:

[can you fix your mailer please, no full quotes, and especially not
quotes of the mail headers]

> > What do you gain from that?  NVMe does not understand data temperatures,
> > so why make up that claim?  
> 
> I expressed this a couple of times in this thread. It is no problem to
> map temperatures to a protocol that does not understand the semantics.

And I've agreed every time with you.  But the important point is that we
must not do it in the driver where all context is lost.

> > Especially as it directly undermindes any file system work to actually make use of it.
> 
> I do not think it does. If a FS wants to use the temperatures, then they
> would be able to leverage FDP besides SCSI.

What do you mean with that?  This is a bit too much whitepaper vocabularly.

We have code in XFS that can make use of the temperature hint.  But to
make them work it actually needs to do real stream separation on the
device.  I.e. the file system consumes the temperature hints.

> And if we come up with a better interface later on, we can make the changes then.
> I really do not see the issue. If we were adding a temperature abstraction now, I would agree with
> You that we would need to cover the use-case you mention for FSs from the beginning, but this
> Is already here. Seems like a fair compromise to support current users.

Again, I think the temperature hints at the syscall level aren't all
bad.  There's definitively a few things I'd like to do better in hindsight,
but that's not the point.  The problem is trying to turn them into
stream separation all the way down in the driver, which is fundamentally
broken.

>   - How do we convince VFS folks to give us more space for hints at this point?

What space from VFS folks do you need for hints?  And why does it
matter?

Javier Gonzalez Oct. 15, 2024, 3:07 a.m. UTC | #66

> On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote:
> > > Especially as it directly undermindes any file system work to actually make use
> of it.
> >
> > I do not think it does. If a FS wants to use the temperatures, then they
> > would be able to leverage FDP besides SCSI.
> 
> What do you mean with that?  This is a bit too much whitepaper vocabularly.
> 
> We have code in XFS that can make use of the temperature hint.  But to
> make them work it actually needs to do real stream separation on the
> device.  I.e. the file system consumes the temperature hints.

The device can guarantee the stream separation without knowing the temperature.

> > And if we come up with a better interface later on, we can make the changes
> then.
> > I really do not see the issue. If we were adding a temperature abstraction now, I
> would agree with
> > You that we would need to cover the use-case you mention for FSs from the
> beginning, but this
> > Is already here. Seems like a fair compromise to support current users.
> 
> Again, I think the temperature hints at the syscall level aren't all
> bad.  There's definitively a few things I'd like to do better in hindsight,
> but that's not the point.  The problem is trying to turn them into
> stream separation all the way down in the driver, which is fundamentally
> broken.
> 
> >   - How do we convince VFS folks to give us more space for hints at this point?
> 
> What space from VFS folks do you need for hints?  And why does it
> matter?

We need space in the inode to store the hint ID.

Look, this feels like going in circles. All this gaslighting is what makes it difficult to 
push patches when you just do not like the feature. It is the 3rd time I propose you 
a way forward and you simply cannot provide any specific technical feedback - in the 
past email I posted several questions about the interface you seem to be talking 
about and you explicitly omit that.

Christoph Hellwig Oct. 15, 2024, 5:30 a.m. UTC | #67

On Tue, Oct 15, 2024 at 03:07:57AM +0000, Javier Gonzalez wrote:
> > On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote:
> > > > Especially as it directly undermindes any file system work to actually make use
> > of it.
> > >
> > > I do not think it does. If a FS wants to use the temperatures, then they
> > > would be able to leverage FDP besides SCSI.
> > 
> > What do you mean with that?  This is a bit too much whitepaper vocabularly.
> > 
> > We have code in XFS that can make use of the temperature hint.  But to
> > make them work it actually needs to do real stream separation on the
> > device.  I.e. the file system consumes the temperature hints.
> 
> The device can guarantee the stream separation without knowing the temperature.

Of course.  But that's entirely not the point.

> > What space from VFS folks do you need for hints?  And why does it
> > matter?
> 
> We need space in the inode to store the hint ID.
> 
> Look, this feels like going in circles. All this gaslighting is what makes it difficult to 

I'm so fucking tired of this.  You are not even listening to my arguments
at all, even when explaing every detail to you again and again.  And then
you acuse me of "gaslighting". 

> push patches when you just do not like the feature. It is the 3rd time I propose you 
> a way forward and you simply cannot provide any specific technical feedback - in the 
> past email I posted several questions about the interface you seem to be talking 
> about and you explicitly omit that.

???

You are throwing random buzzwords and not even replying to all the
low level technical points.  If you think I missed something please
just ask again instead of this crap.

Christoph Hellwig Oct. 15, 2024, 5:50 a.m. UTC | #68

As the current discussion is probably too tiring for anyone who is not a
professional mud fighter I'll summarize my objections and comments here
once again in a way that even aspiring middle managers should be able
understand.

1) While the current per-file temperature hints interface is not perfect
it is okay and make sense to reuse until we need something more fancy.
We make good use of it in f2fs and the upcoming zoned xfs code to help
with data placement and have numbers to show that it helps.

2) A per-I/O interface to set these temperature hint conflicts badly
with how placement works in file systems.  If we have an urgent need
for it on the block device it needs to be opt-in by the file operations
so it can be enabled on block device, but not on file systems by
default.  This way you can implement it for block device, but not
provide it on file systems by default.  If a given file system finds
a way to implement it it can still opt into implementing it of course.

3) Mapping from temperature hints to separate write streams needs to
happen above the block layer, because file systems need to be in
control of it to do intelligent placement.  That means if you want to
map from temperature hints to stream separation it needs to be
implemented at the file operation layer, not in the device driver.
The mapping implemented in this series is probably only useful for
block devices.  Maybe if dumb file systems want to adopt it, it could
be split into library code for reuse, but as usual that's probably
best done only when actually needed.

4) To support this the block layer, that is bios and requests need
to support a notion of stream separation.   Kanchan's previous series
had most of the bits for that, it just needs to be iterated on.

All of this could have probably be easily done in the time spent on
this discussion.

Keith Busch Oct. 15, 2024, 3:09 p.m. UTC | #69

On Tue, Oct 15, 2024 at 07:50:06AM +0200, Christoph Hellwig wrote:
> 1) While the current per-file temperature hints interface is not perfect
> it is okay and make sense to reuse until we need something more fancy.
> We make good use of it in f2fs and the upcoming zoned xfs code to help
> with data placement and have numbers to show that it helps.

So we're okay to proceed with patch 1?
 
> 2) A per-I/O interface to set these temperature hint conflicts badly
> with how placement works in file systems.  If we have an urgent need
> for it on the block device it needs to be opt-in by the file operations
> so it can be enabled on block device, but not on file systems by
> default.  This way you can implement it for block device, but not
> provide it on file systems by default.  If a given file system finds
> a way to implement it it can still opt into implementing it of course.

If we add a new fop_flag that only block fops enables, then it's okay?

> 3) Mapping from temperature hints to separate write streams needs to
> happen above the block layer, because file systems need to be in
> control of it to do intelligent placement.  That means if you want to
> map from temperature hints to stream separation it needs to be
> implemented at the file operation layer, not in the device driver.
> The mapping implemented in this series is probably only useful for
> block devices.  Maybe if dumb file systems want to adopt it, it could
> be split into library code for reuse, but as usual that's probably
> best done only when actually needed.

IMO, I don't even think the io_uring per-io hint needs to be limited to
the fcntl lifetime values. It could just be a u16 value opaque to the
block layer that just gets forwarded to the device.

> 4) To support this the block layer, that is bios and requests need
> to support a notion of stream separation.   Kanchan's previous series
> had most of the bits for that, it just needs to be iterated on.
> 
> All of this could have probably be easily done in the time spent on
> this discussion.

Christoph Hellwig Oct. 15, 2024, 3:22 p.m. UTC | #70

On Tue, Oct 15, 2024 at 09:09:20AM -0600, Keith Busch wrote:
> On Tue, Oct 15, 2024 at 07:50:06AM +0200, Christoph Hellwig wrote:
> > 1) While the current per-file temperature hints interface is not perfect
> > it is okay and make sense to reuse until we need something more fancy.
> > We make good use of it in f2fs and the upcoming zoned xfs code to help
> > with data placement and have numbers to show that it helps.
> 
> So we're okay to proceed with patch 1?

No, see point 3 and 4 below for why.  We'll need something like the
interface you suggested by me in point 4 and by you in reply to point 3
in the block layer, and then block/fops.c can implement the mapping on
top of that for drivers supporting it.

>  
> > 2) A per-I/O interface to set these temperature hint conflicts badly
> > with how placement works in file systems.  If we have an urgent need
> > for it on the block device it needs to be opt-in by the file operations
> > so it can be enabled on block device, but not on file systems by
> > default.  This way you can implement it for block device, but not
> > provide it on file systems by default.  If a given file system finds
> > a way to implement it it can still opt into implementing it of course.
> 
> If we add a new fop_flag that only block fops enables, then it's okay?

The flag is just one part of it.  Of course it need to be discoverable
from userspace in one way or another, and the marshalling of the flag
needs to be controller by the file system / fops instance.

> > 3) Mapping from temperature hints to separate write streams needs to
> > happen above the block layer, because file systems need to be in
> > control of it to do intelligent placement.  That means if you want to
> > map from temperature hints to stream separation it needs to be
> > implemented at the file operation layer, not in the device driver.
> > The mapping implemented in this series is probably only useful for
> > block devices.  Maybe if dumb file systems want to adopt it, it could
> > be split into library code for reuse, but as usual that's probably
> > best done only when actually needed.
> 
> IMO, I don't even think the io_uring per-io hint needs to be limited to
> the fcntl lifetime values. It could just be a u16 value opaque to the
> block layer that just gets forwarded to the device.

Well, that's what I've been arguing for all the time, and what Kanchan's
previous series was working towards.  It's not quite as trivial as
we need a bit more than just the stream, e.g. a way to discover how many
of them exist.

> > 4) To support this the block layer, that is bios and requests need
> > to support a notion of stream separation.   Kanchan's previous series
> > had most of the bits for that, it just needs to be iterated on.
> > 
> > All of this could have probably be easily done in the time spent on
> > this discussion.
---end quoted text---

Kanchan Joshi Oct. 17, 2024, 2:35 p.m. UTC | #71

Seems per-I/O hints are not getting the love they deserve.
Apart from the block device, the usecase is when all I/Os of VM (or 
container) are to be grouped together or placed differently.

Per-IO hints are fine-granular (enough for userspace to build 
per-process/vm/file/whatever etc.) and add the flexibility we have 
lacked so far.
As for conflict, I doubt if that exists. Please see below:

> 2) A per-I/O interface to set these temperature hint conflicts badly
> with how placement works in file systems.  If we have an urgent need
> for it on the block device it needs to be opt-in by the file operations
> so it can be enabled on block device, but not on file systems by
> default.  This way you can implement it for block device, but not
> provide it on file systems by default.  If a given file system finds
> a way to implement it it can still opt into implementing it of course.

Why do you see this as something that is so different across filesystems 
that they would need to "find a way to implement"?
Both per-file and per-io hints are supplied by userspace. Inode and 
kiocb only happen to be the mean to receive the hint information.
FS is free to use this information (iff it wants) or simply forward this 
down.

Per-file hint just gets stored (within inode) without individual FS 
involvement. Per-io hint follows the same model (i.e., it is set by 
upper layer like io_uring/aio) and uses kiocb to store the hint. It does 
not alter the stored inode hint value!

After patch #3, filesystems have both per-file and per-io information 
available. And as before, they can use that hint info (from kiocb or 
inode) and/or simply forward.

The generic code (like fs/direct-io.c, fs/iomap/direct-io.c etc.,) 
already forwards the incoming hints, without any intelligence.
And we need just that because with user-passed hints, the onus of 
intelligence is on the userspace. This is how hints work on 
xfs/ext4/btrfs files at this point.

The owner of the file (filesystem, block, whatever) can still use the 
incoming hints to do any custom/extra feature. Either from inode or from 
kiocb depending on what information (per-file or per-io hint) it finds 
more useful for that custom feature. For example, F2FS is using 
per-inode information to do something custom and that part has not been 
changed by these patches.

Overall, I do not see the conflict. It's all user-driven. No?

For the currently uncommon case when FS decides to generate its own 
hints (as opposed to userspace hint consumer/forwarder), it can directly 
put the hint value in bio.

Christoph Hellwig Oct. 17, 2024, 3:23 p.m. UTC | #72

On Thu, Oct 17, 2024 at 08:05:38PM +0530, Kanchan Joshi wrote:
> Seems per-I/O hints are not getting the love they deserve.
> Apart from the block device, the usecase is when all I/Os of VM (or 
> container) are to be grouped together or placed differently.

But that assumes the file system could actually support it.  Which
is hard when you don't assume the file system isn't simply a passthrough
entity, which will not give you great results.

> > 2) A per-I/O interface to set these temperature hint conflicts badly
> > with how placement works in file systems.  If we have an urgent need
> > for it on the block device it needs to be opt-in by the file operations
> > so it can be enabled on block device, but not on file systems by
> > default.  This way you can implement it for block device, but not
> > provide it on file systems by default.  If a given file system finds
> > a way to implement it it can still opt into implementing it of course.
> 
> Why do you see this as something that is so different across filesystems 
> that they would need to "find a way to implement"?

If you want to do useful stream separation you need to write data
sequentially into the stream.  Now with streams or FDP that does not
actually imply sequentially in LBA space, but if you want the file
system to not actually deal with fragmentation from hell, and be
easily track what is grouped together you really want it sequentially
in the LBA space as well.  In other words, any kind of write placement
needs to be intimately tied to the file system block allocator.

> Both per-file and per-io hints are supplied by userspace. Inode and 
> kiocb only happen to be the mean to receive the hint information.
> FS is free to use this information (iff it wants) or simply forward this 
> down.

As mentioned above just passing it down is not actually very useful.
It might give you nice benchmark numbers when you basically reimplement
space management in userspace on a fully preallocated file, but for that
you're better of just using the block device.  If you actually want
to treat the files as files you need full file system involvement.

> Per-file hint just gets stored (within inode) without individual FS 
> involvement. Per-io hint follows the same model (i.e., it is set by 
> upper layer like io_uring/aio) and uses kiocb to store the hint. It does 
> not alter the stored inode hint value!

Yes, and now you'll get complaints that the file system ignores it
when it can't properly support it.  This is why we need a per-fop
opt in.

> The generic code (like fs/direct-io.c, fs/iomap/direct-io.c etc.,) 
> already forwards the incoming hints, without any intelligence.

Yes, and that is a problem.  We stopped doing that, but Samsung sneaked
some of this back in recently as I noticed.

> Overall, I do not see the conflict. It's all user-driven. No?

I have the gut feeling that you've just run benchmarks on image files
emulating block devices and not actually tried real file system workloads
based on this unfortunately.

Keith Busch Oct. 17, 2024, 3:44 p.m. UTC | #73

On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote:
> If you want to do useful stream separation you need to write data
> sequentially into the stream.  Now with streams or FDP that does not
> actually imply sequentially in LBA space, but if you want the file
> system to not actually deal with fragmentation from hell, and be
> easily track what is grouped together you really want it sequentially
> in the LBA space as well.  In other words, any kind of write placement
> needs to be intimately tied to the file system block allocator.

I'm replying just to make sure I understand what you're saying:

If we send per IO hints on a file, we could have interleaved hot and
cold pages at various offsets of that file, so the filesystem needs an
efficient way to allocate extents and track these so that it doesn't
interleave these in LBA space. I think that makes sense.

We can add a fop_flags and block/fops.c can be the first one to turn it
on since that LBA access is entirely user driven.

Christoph Hellwig Oct. 17, 2024, 3:46 p.m. UTC | #74

On Thu, Oct 17, 2024 at 09:44:39AM -0600, Keith Busch wrote:
> I'm replying just to make sure I understand what you're saying:
> 
> If we send per IO hints on a file, we could have interleaved hot and
> cold pages at various offsets of that file, so the filesystem needs an
> efficient way to allocate extents and track these so that it doesn't
> interleave these in LBA space. I think that makes sense.

That's a little simplified, but yes.

> We can add a fop_flags and block/fops.c can be the first one to turn it
> on since that LBA access is entirely user driven.

Yes, that's my main request on the per-I/O hint interface.

Now we just need to not dumb down the bio level interface to five
temeperature level and just expose the different write streams and we
can all have a happy Kumbaya.

Keith Busch Oct. 17, 2024, 4:06 p.m. UTC | #75

On Thu, Oct 17, 2024 at 05:46:49PM +0200, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 09:44:39AM -0600, Keith Busch wrote:
> 
> > We can add a fop_flags and block/fops.c can be the first one to turn it
> > on since that LBA access is entirely user driven.
> 
> Yes, that's my main request on the per-I/O hint interface.
> 
> Now we just need to not dumb down the bio level interface to five
> temeperature level and just expose the different write streams and we
> can all have a happy Kumbaya.

I'm sending a revised version of this patch set that attempts to address
these issues.

The io_uring hint is weird to me, so that part is simplified just to get
the point across.

Bart Van Assche Oct. 17, 2024, 4:15 p.m. UTC | #76

On 10/17/24 8:44 AM, Keith Busch wrote:
> On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote:
>> If you want to do useful stream separation you need to write data
>> sequentially into the stream.  Now with streams or FDP that does not
>> actually imply sequentially in LBA space, but if you want the file
>> system to not actually deal with fragmentation from hell, and be
>> easily track what is grouped together you really want it sequentially
>> in the LBA space as well.  In other words, any kind of write placement
>> needs to be intimately tied to the file system block allocator.
> 
> I'm replying just to make sure I understand what you're saying:
> 
> If we send per IO hints on a file, we could have interleaved hot and
> cold pages at various offsets of that file, so the filesystem needs an
> efficient way to allocate extents and track these so that it doesn't
> interleave these in LBA space. I think that makes sense.
> 
> We can add a fop_flags and block/fops.c can be the first one to turn it
> on since that LBA access is entirely user driven.

Does anyone care about buffered I/O to block devices? When using
buffered I/O, the write_hint information from the inode is used and the 
per I/O write_hint information is ignored.

Thanks,

Bart.

Keith Busch Oct. 17, 2024, 4:23 p.m. UTC | #77

On Thu, Oct 17, 2024 at 09:15:21AM -0700, Bart Van Assche wrote:
> On 10/17/24 8:44 AM, Keith Busch wrote:
> > On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote:
> > > If you want to do useful stream separation you need to write data
> > > sequentially into the stream.  Now with streams or FDP that does not
> > > actually imply sequentially in LBA space, but if you want the file
> > > system to not actually deal with fragmentation from hell, and be
> > > easily track what is grouped together you really want it sequentially
> > > in the LBA space as well.  In other words, any kind of write placement
> > > needs to be intimately tied to the file system block allocator.
> > 
> > I'm replying just to make sure I understand what you're saying:
> > 
> > If we send per IO hints on a file, we could have interleaved hot and
> > cold pages at various offsets of that file, so the filesystem needs an
> > efficient way to allocate extents and track these so that it doesn't
> > interleave these in LBA space. I think that makes sense.
> > 
> > We can add a fop_flags and block/fops.c can be the first one to turn it
> > on since that LBA access is entirely user driven.
> 
> Does anyone care about buffered I/O to block devices? When using
> buffered I/O, the write_hint information from the inode is used and the per
> I/O write_hint information is ignored.

I'm pretty sure there are applications that use buffered IO on raw block
(ex: postgresql), but it's a moot point: the block file_operations that
provide the fops_flags also provide the callbacks for O_DIRECT, which is
where this matters.

We can't really use per-io write_hints on buffered-io. At least not yet,
and maybe never. I'm not sure if it makes sense for raw block because
the page writes won't necessarily match writes to storage.

[v7,0/3] FDP and per-io hints

Message

Comments