Message ID | 20240930181305.17286-1-joshi.k@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | FDP and per-io hints | expand |
Any reason you completely ignored my feedback on the last version and did not even answer? That's not a very productive way to work.
Now THIS Is the NVMe I came to know and love.
On 10/1/24 3:20 AM, Christoph Hellwig wrote: > Any reason you completely ignored my feedback on the last version > and did not even answer? Probably because he got a little tired of dealing with the bullshit related to this topic. > That's not a very productive way to work. Have to say, that neither have your responses been. Can't really fault people for just giving up at some point, when no productive end seems to be in sight. As far as I'm concerned, this looks fine to me. There are customers wanting to use us, and folks making drives that support it. It's not right to continually gatekeep this feature just because you don't like it. I need to review the io_uring bits, but that should be the least of it.
On Tue, Oct 01, 2024 at 11:20:48AM +0200, Christoph Hellwig wrote: > Any reason you completely ignored my feedback on the last version > and did not even answer? > > That's not a very productive way to work. I think because he's getting conflicting feedback. The arguments against it being that it's not a good match, however it's the same match created for streams, and no one complained then, and it's an existing user ABI. FDP is a the new "streams", so I really don't see a big deal here. I understand you have use cases that have filesystems intercept the hints for different types of devices, but I don't see these use cases as mutally exclusive. Supporting this one doesn't prevent other uses.
On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote: > I think because he's getting conflicting feedback. The arguments against > it being that it's not a good match, however it's the same match created > for streams, and no one complained then, and it's an existing user ABI. People complained, and the streams code got removed pretty quickly because it did not work as expected. I don't think that counts as a big success.
On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote: > Have to say, that neither have your responses been. Can't really fault > people for just giving up at some point, when no productive end seems to > be in sight. I heavily disagree and take offence on that. The previous stream separation approach made total sense, but just needed a fair amount of work. But it closely matches how things work at the hardware and file system level, so it was the right approach. Suddenly dropping that and ignoring all comments really feels like someone urgently needs to pull a marketing stunt here.
On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote: > On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote: > > I think because he's getting conflicting feedback. The arguments against > > it being that it's not a good match, however it's the same match created > > for streams, and no one complained then, and it's an existing user ABI. > > People complained, and the streams code got removed pretty quickly > because it did not work as expected. I don't think that counts as > a big success. I don't think the kernel API was the problem. Capable devices never materialized, so the code wasn't doing anything useful.
On 10/2/24 8:56 AM, Keith Busch wrote: > On Wed, Oct 02, 2024 at 09:49:26AM +0200, Christoph Hellwig wrote: >> On Tue, Oct 01, 2024 at 10:23:43AM -0600, Keith Busch wrote: >>> I think because he's getting conflicting feedback. The arguments against >>> it being that it's not a good match, however it's the same match created >>> for streams, and no one complained then, and it's an existing user ABI. >> >> People complained, and the streams code got removed pretty quickly >> because it did not work as expected. I don't think that counts as >> a big success. > > I don't think the kernel API was the problem. Capable devices never > materialized, so the code wasn't doing anything useful. Exactly. I never saw ones that kept the stream persistent across GC, and hence they ended up being pretty useless in practice.
On 10/2/24 1:51 AM, Christoph Hellwig wrote: > On Tue, Oct 01, 2024 at 10:18:41AM -0600, Jens Axboe wrote: >> Have to say, that neither have your responses been. Can't really fault >> people for just giving up at some point, when no productive end seems to >> be in sight. > > I heavily disagree and take offence on that. > > The previous stream separation approach made total sense, but just > needed a fair amount of work. But it closely matches how things work > at the hardware and file system level, so it was the right approach. What am I missing that makes this effort that different from streams? Both are a lifetime hint. > Suddenly dropping that and ignoring all comments really feels like > someone urgently needs to pull a marketing stunt here. I think someone is just trying to finally get this feature in, so it can get used by customers, after many months of fairly unproductive discussion.
On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote: > > The previous stream separation approach made total sense, but just > > needed a fair amount of work. But it closely matches how things work > > at the hardware and file system level, so it was the right approach. > > What am I missing that makes this effort that different from streams? > Both are a lifetime hint. A stream has a lot less strings attached. But hey, don't make me argue for streams - you pushed the API in despite reservations back then, and we've learned a lot since. > > Suddenly dropping that and ignoring all comments really feels like > > someone urgently needs to pull a marketing stunt here. > > I think someone is just trying to finally get this feature in, so it > can get used by customers, after many months of fairly unproductive > discussion. Well, he finally started on the right approach and gave it up after the first round of feedback. That's just a bit weird.
On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote: > > Well, he finally started on the right approach and gave it up after the > first round of feedback. That's just a bit weird. Nothing prevents future improvements in that direction. It just seems out of scope for what Kanchan is trying to enable for his customer use cases. This patch looks harmless.
On Wed, Oct 02, 2024 at 09:17:35AM -0600, Keith Busch wrote: > On Wed, Oct 02, 2024 at 05:13:44PM +0200, Christoph Hellwig wrote: > > > > Well, he finally started on the right approach and gave it up after the > > first round of feedback. That's just a bit weird. > > Nothing prevents future improvements in that direction. It just seems > out of scope for what Kanchan is trying to enable for his customer use > cases. This patch looks harmless. It's not really. Once we wire it up like this we mess up the ability to use the feature in other ways. Additionally the per-I/O hints are simply broken if you want a file system in the way as last time, and if I don't misremember you acknowledged in person last week.
On 10/2/24 9:13 AM, Christoph Hellwig wrote: > On Wed, Oct 02, 2024 at 09:03:22AM -0600, Jens Axboe wrote: >>> The previous stream separation approach made total sense, but just >>> needed a fair amount of work. But it closely matches how things work >>> at the hardware and file system level, so it was the right approach. >> >> What am I missing that makes this effort that different from streams? >> Both are a lifetime hint. > > A stream has a lot less strings attached. But hey, don't make me > argue for streams - you pushed the API in despite reservations back > then, and we've learned a lot since. Hah, I remember arguing for it back then! It was just a pain to use, but thankfully we could kill it when devices didn't materialize with a quality of implementation that made them useful. And then we just yanked it. Nothing is forever, and particularly hints is something that can always be ignored. Same for this. >>> Suddenly dropping that and ignoring all comments really feels like >>> someone urgently needs to pull a marketing stunt here. >> >> I think someone is just trying to finally get this feature in, so it >> can get used by customers, after many months of fairly unproductive >> discussion. > > Well, he finally started on the right approach and gave it up after the > first round of feedback. That's just a bit weird. I'm a big fan of keep-it-simple-stupid, and then you can improve it down the line. I think the simple approach stands just fine alone.
On Wed, Oct 02, 2024 at 05:19:49PM +0200, Christoph Hellwig wrote: > > Nothing prevents future improvements in that direction. It just seems > > out of scope for what Kanchan is trying to enable for his customer use > > cases. This patch looks harmless. > > It's not really. Once we wire it up like this we mess up the ability > to use the feature in other ways. Additionally the per-I/O hints are > simply broken if you want a file system in the way as last time, > and if I don't misremember you acknowledged in person last week. You remember right. I also explained the use cases for per-io hints are to replace current passthrough users with generic read/write paths that have all the memory guard rails, read/write stats, and other features that you don't get with passthrough. This just lets you accomplish the same placement hints with the nvme raw block device that you get with the nvme char device.
Christoph, >> Nothing prevents future improvements in that direction. It just seems >> out of scope for what Kanchan is trying to enable for his customer use >> cases. This patch looks harmless. > > It's not really. Once we wire it up like this we mess up the ability > to use the feature in other ways. Additionally the per-I/O hints are > simply broken if you want a file system Here is my take: It is the kernel's job to manage the system's hardware resources and arbitrate and share these resources optimally and fairly between all running applications. What irks me is defining application interfaces which fundamentally tell the kernel that "these blocks are part of the same file". The kernel already knows this. It is the very entity which provides that abstraction. Why do we need an explicit interface to inform the kernel that concurrent writes to the same file should have the same "temperature" or need to go to the same "bin" on the storage device? Shouldn't that just happen automatically? Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like we are consistently failing to deliver something that actually works for anything but a few specialized corner cases. I think that is a shame.
On 10/2/24 8:47 AM, Martin K. Petersen wrote: > What irks me is defining application interfaces which fundamentally tell > the kernel that "these blocks are part of the same file". Isn't FDP about communicating much more than only this information to the block device, e.g. information about reclaim units? Although I'm personally not interested in FDP, my colleagues were involved in the standardization of FDP. Thanks, Bart.
On 9/30/24 11:13 AM, Kanchan Joshi wrote: > Another spin to incorporate the feedback from LPC and previous > iterations. The series adds two capabilities: > - FDP support at NVMe level (patch #1) > - Per-io hinting via io_uring (patch #3) > Patch #2 is needed to do per-io hints. > > The motivation and interface details are present in the commit > descriptions. Something is missing from the patch descriptions. What is the goal of this patch series? Is the goal to support FDP in filesystems in the kernel, is the goal to support filesystems implemented in user space or perhaps something else? I think this should be mentioned in the cover letter. Thanks, Bart.
On Wed, Oct 02, 2024 at 09:33:16AM -0600, Keith Busch wrote: > You remember right. I also explained the use cases for per-io hints are > to replace current passthrough users with generic read/write paths that > have all the memory guard rails, read/write stats, and other features > that you don't get with passthrough. This just lets you accomplish the > same placement hints with the nvme raw block device that you get with > the nvme char device. Yes, and as I said before we need an opt-in from the file_ops, because file systems can't support this. And I'm speaking that from a position of working on a file system that made the existing non-idea hints work (to all fairness Hans did the hard work, but I was part of it).
On Wed, Oct 02, 2024 at 11:47:32AM -0400, Martin K. Petersen wrote: > It is the kernel's job to manage the system's hardware resources and > arbitrate and share these resources optimally and fairly between all > running applications. Exactly. > What irks me is defining application interfaces which fundamentally tell > the kernel that "these blocks are part of the same file". > > The kernel already knows this. It is the very entity which provides that > abstraction. Why do we need an explicit interface to inform the kernel > that concurrent writes to the same file should have the same > "temperature" or need to go to the same "bin" on the storage device? > Shouldn't that just happen automatically? For file: yes. The problem is when you have more files than buckets on the device or file systems. Typical enterprise SSDs support somewhere between 8 and 16 write streams, and there typically is more data than that. So trying to group it somehow is good idea as not all files can have their own bucket. Allowing this inside a file like done in this patch set on the other hand is pretty crazy. > Whether it's SCSI groups, streams, UFS hints, or NVMe FDP, it seems like > we are consistently failing to deliver something that actually works for > anything but a few specialized corner cases. I think that is a shame. Yes, it is. And as someone who has been sitting in this group that is because it's always someone in a place of power forcing down their version because they got a promotіon, best of show award or whatever.
On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote: > Isn't FDP about communicating much more than only this information to > the block device, e.g. information about reclaim units? Although I'm > personally not interested in FDP, my colleagues were involved in the > standardization of FDP. Yes, it is. And when I explained how to properly export this kind of information can be implemented on top of the version Kanchan sent everyone suddenly stopped diskussion technical points and went either silent or all political. So I think some peoples bonuses depend on not understanding the problem I fear :(
On Thu, Oct 03, 2024 at 02:55:16PM +0200, Christoph Hellwig wrote: > On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote: > > Isn't FDP about communicating much more than only this information to > > the block device, e.g. information about reclaim units? Although I'm > > personally not interested in FDP, my colleagues were involved in the > > standardization of FDP. > > Yes, it is. And when I explained how to properly export this kind of > information can be implemented on top of the version Kanchan sent everyone > suddenly stopped diskussion technical points and went either silent or > all political. The nominals can mean whatever you want. If you want it to mean "temperature", then that's what it means. If you want it to mean something else, then don't use this. These are hints at the end of the day. Nothing locks the kernel into this if a better solution develops. As you know, it was ripped out before. > So I think some peoples bonuses depend on not understanding the problem > I fear :( The only "bonus" I have is not repeatedly explaining why people can't use h/w features the way they want.
On 10/3/24 2:48 PM, Keith Busch wrote: > The only "bonus" I have is not repeatedly explaining why people can't > use h/w features the way they want. Hi Keith, Although that's a fair argument, what are the use cases for this patch series? Filesystems in the kernel? Filesystems implemented in user space? Perhaps something else? This patch series adds new a new user space interface for passing hints to storage devices (in io_uring). As we all know such interfaces are hard to remove once these have been added. We don't need new user space interfaces to support FDP for filesystems in the kernel. For filesystems implemented in user space, would using NVMe pass-through be a viable approach? With this approach, no new user space interfaces have to be added. I'm wondering how to unblock FDP users without adding a new controversial mechanism in the kernel. Thanks, Bart.
On 10/3/24 4:00 PM, Bart Van Assche wrote: > On 10/3/24 2:48 PM, Keith Busch wrote: >> The only "bonus" I have is not repeatedly explaining why people can't >> use h/w features the way they want. > > Hi Keith, > > Although that's a fair argument, what are the use cases for this patch > series? Filesystems in the kernel? Filesystems implemented in user > space? Perhaps something else? > > This patch series adds new a new user space interface for passing hints > to storage devices (in io_uring). As we all know such interfaces are > hard to remove once these have been added. It's a _hint_, I'm not sure how many times that has to be stated. The kernel is free to ignore it, and in the future, it may very well do that. We already had fcntl interfaces for streams, in the same vein, and we yanked those in the sense that they ended up doing _nothing_. Did things break because of that? Of course not. This is no different. > We don't need new user space interfaces to support FDP for filesystems > in the kernel. > > For filesystems implemented in user space, would using NVMe pass-through > be a viable approach? With this approach, no new user space interfaces > have to be added. > > I'm wondering how to unblock FDP users without adding a new > controversial mechanism in the kernel. pass-through already works, obviously - this is more about adding a viable real interface for it. If it's a feature that is in devices AND customers want to use, then pointing them at pass-through is a cop-out.
On 10/3/24 6:54 AM, Christoph Hellwig wrote: > For file: yes. The problem is when you have more files than buckets on > the device or file systems. Typical enterprise SSDs support somewhere > between 8 and 16 write streams, and there typically is more data than > that. So trying to group it somehow is good idea as not all files can > have their own bucket. > > Allowing this inside a file like done in this patch set on the other > hand is pretty crazy. I do agree that per-file hints are not ideal. In the spirit of making some progress, how about we just retain per-io hints initially? We can certainly make that work over dio. Yes buffered IO won't work initially, but at least we're getting somewhere.
On Thu, Oct 03, 2024 at 03:00:21PM -0700, Bart Van Assche wrote: > On 10/3/24 2:48 PM, Keith Busch wrote: > > The only "bonus" I have is not repeatedly explaining why people can't > > use h/w features the way they want. > > Hi Keith, > > Although that's a fair argument, what are the use cases for this patch > series? Filesystems in the kernel? Filesystems implemented in user > space? Perhaps something else? > > This patch series adds new a new user space interface for passing hints > to storage devices (in io_uring). As we all know such interfaces are > hard to remove once these have been added. > > We don't need new user space interfaces to support FDP for filesystems > in the kernel. > > For filesystems implemented in user space, would using NVMe pass-through > be a viable approach? With this approach, no new user space interfaces > have to be added. > > I'm wondering how to unblock FDP users without adding a new > controversial mechanism in the kernel. I really like the passthrough interface. This is flexible and enables experimenting with all sorts of features. But it exposes exploits and pitfalls. Just check the CVE's against it (ex: CVE-2023-6238, note: same problem exists without metadata). The consequences of misuse include memory corruption and private memory leakage. The barriers for exploits are too low, and the potential for mistakes are too high. A possible h/w solution requires an IOMMU, but that's a non-starter for many users and only solves the private memory vulnerabilities. Passthrough also skips all the block layer features. I just today added statistics to it staged for 6.13, but even that requires yet another config step to enable it. tl;dr: no one wants to use nvme passthrough in production long term.
On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote: > On 10/3/24 6:54 AM, Christoph Hellwig wrote: > > For file: yes. The problem is when you have more files than buckets on > > the device or file systems. Typical enterprise SSDs support somewhere > > between 8 and 16 write streams, and there typically is more data than > > that. So trying to group it somehow is good idea as not all files can > > have their own bucket. > > > > Allowing this inside a file like done in this patch set on the other > > hand is pretty crazy. > > I do agree that per-file hints are not ideal. In the spirit of making > some progress, how about we just retain per-io hints initially? We can > certainly make that work over dio. Yes buffered IO won't work initially, > but at least we're getting somewhere. Huh? Per I/O hints at the syscall level are the problem (see also the reply from Martin). Per file make total sense, but we need the file system in control. The real problem is further down the stack. For the SCSI temperature hints just passing them on make sense. But when you map to some kind of stream separation in the device, no matter if that is streams, FDP, or various kinds of streams we don't even support in thing like CF and SDcard, the driver is not the right place to map temperature hint to streams. The requires some kind of intelligence. It could be dirt simple and just do a best effort mapping of the temperature hints 1:1 to separate write streams, or do a little mapping if there is not enough of them which should work fine for a raw block device. But one we have a file system things get more complicated: - the file system will want it's own streams for metadata and GC - even with that on beefy enough hardware you can have more streams then temperature levels, and the file system can and should do intelligen placement (based usually on files) Or to summarize: the per-file temperature hints make sense as a user interface. Per-I/O hints tend to be really messy at least if a file system is involved. Placing the temperatures to separate write streams in the driver does not scale even to the most trivial write stream aware file system implementations. And for anyone who followed the previous discussions of the patches none of this should been new, each point has been made at least three times before.
On 04.10.2024 07:31, Christoph Hellwig wrote: >On Thu, Oct 03, 2024 at 04:14:57PM -0600, Jens Axboe wrote: >> On 10/3/24 6:54 AM, Christoph Hellwig wrote: >> > For file: yes. The problem is when you have more files than buckets on >> > the device or file systems. Typical enterprise SSDs support somewhere >> > between 8 and 16 write streams, and there typically is more data than >> > that. So trying to group it somehow is good idea as not all files can >> > have their own bucket. >> > >> > Allowing this inside a file like done in this patch set on the other >> > hand is pretty crazy. >> >> I do agree that per-file hints are not ideal. In the spirit of making >> some progress, how about we just retain per-io hints initially? We can >> certainly make that work over dio. Yes buffered IO won't work initially, >> but at least we're getting somewhere. > >Huh? Per I/O hints at the syscall level are the problem (see also the >reply from Martin). Per file make total sense, but we need the file >system in control. > >The real problem is further down the stack. For the SCSI temperature >hints just passing them on make sense. But when you map to some kind >of stream separation in the device, no matter if that is streams, FDP, >or various kinds of streams we don't even support in thing like CF >and SDcard, the driver is not the right place to map temperature hint >to streams. The requires some kind of intelligence. It could be >dirt simple and just do a best effort mapping of the temperature >hints 1:1 to separate write streams, or do a little mapping if there >is not enough of them which should work fine for a raw block device. > >But one we have a file system things get more complicated: > > - the file system will want it's own streams for metadata and GC > - even with that on beefy enough hardware you can have more streams > then temperature levels, and the file system can and should > do intelligen placement (based usually on files) > >Or to summarize: the per-file temperature hints make sense as a user >interface. Per-I/O hints tend to be really messy at least if a file >system is involved. Placing the temperatures to separate write streams >in the driver does not scale even to the most trivial write stream >> >And for anyone who followed the previous discussions of the patches >none of this should been new, each point has been made at least three >times before. Looking at the work you and Hans have been doing on XFS, it seems you have been successful at mapping the semantics of the temperature to zones (which has no semantics, just as FDP). What is the difference between the mapping in zones and for FDP? The whole point is using an existing interface to cover the use-case of people wanting hints in block. If you have a use-case for enabling hints on file, let's work on this aftterwards.
On 03.10.2024 14:55, Christoph Hellwig wrote: >On Wed, Oct 02, 2024 at 11:34:47AM -0700, Bart Van Assche wrote: >> Isn't FDP about communicating much more than only this information to >> the block device, e.g. information about reclaim units? Although I'm >> personally not interested in FDP, my colleagues were involved in the >> standardization of FDP. > >Yes, it is. And when I explained how to properly export this kind of >information can be implemented on top of the version Kanchan sent everyone >suddenly stopped diskussion technical points and went either silent or >all political. > >So I think some peoples bonuses depend on not understanding the problem >I fear :( > Please, don't. Childish comments like this delegitimize the work that a lot of people are doing in Linux. We all operate under the assumption that folks here know how to wear two hants, and that there is no such thing as so specific incentives neither to push _nor_ block upstream contributions without a use-case and technical background.
On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote: >> So I think some peoples bonuses depend on not understanding the problem >> I fear :( >> > > Please, don't. > > Childish comments like this delegitimize the work that a lot of people > are doing in Linux. It's the only very sarcastic explanation I can come up with to explain this discussion, where people from the exactly two companies where this might be bonus material love political discussion and drop dead when it turns technical. I'm probably wrong, but I've run out of other explanations.
On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote: >> And for anyone who followed the previous discussions of the patches >> none of this should been new, each point has been made at least three >> times before. > > Looking at the work you and Hans have been doing on XFS, it seems you > have been successful at mapping the semantics of the temperature to > zones (which has no semantics, just as FDP). > > What is the difference between the mapping in zones and for FDP? Probably not much, except for all the pitfalls in the FDP not quite hint not madatory design. > The whole point is using an existing interface to cover the use-case of > people wanting hints in block. And that's fine. And that point all the way back for month is that doing a complete dumb mapping in the driver for that is fundamentally wrong. Kanchan's previous series did about 50% of the work to get it rid, but then everyone dropped dead and played politics.
On 04.10.2024 08:27, Christoph Hellwig wrote: >On Fri, Oct 04, 2024 at 08:18:11AM +0200, Javier González wrote: >>> And for anyone who followed the previous discussions of the patches >>> none of this should been new, each point has been made at least three >>> times before. >> >> Looking at the work you and Hans have been doing on XFS, it seems you >> have been successful at mapping the semantics of the temperature to >> zones (which has no semantics, just as FDP). >> >> What is the difference between the mapping in zones and for FDP? > >Probably not much, except for all the pitfalls in the FDP not quite hint >not madatory design. It is too late to change the first version of FDP productas, and the solution of "go back to NVMe and make it match SCSI" is not realistic. We have drives from several companies out there supporting this version of FDP. And we have customers that are actively using them. It is our collective responsibility to support them. So, considerign that file system _are_ able to use temperature hints and actually make them work, why don't we support FDP the same way we are supporting zones so that people can use it in production? I agree that down the road, an interface that allows hints (many more than 5!) is needed. And in my opinion, this interface should not have semintics attached to it, just a hint ID, #hints, and enough space to put 100s of them to support storage node deployments. But this needs to come from the users of the hints / zones / streams / etc, not from us vendors. We do not have neither details on how they deploy these features at scale, nor the workloads to validate the results. Anything else will probably just continue polluting the storage stack with more interfaces that are not used and add to the problem of data placement fragmentation. > >> The whole point is using an existing interface to cover the use-case of >> people wanting hints in block. > >And that's fine. And that point all the way back for month is that >doing a complete dumb mapping in the driver for that is fundamentally >wrong. Kanchan's previous series did about 50% of the work to get >it rid, but then everyone dropped dead and played politics. The issue is that the first series of this patch, which is as simple as it gets, hit the list in May. Since then we are down paths that lead nowhere. So the line between real technical feedback that leads to a feature being merged, and technical misleading to make people be a busy bee becomes very thin. In the whole data placement effort, we have been down this path many times, unfortunately... At LPC, we discussed about the work you did in XFS and it became clear that coming back to the first version of the patches was the easiest way to support the use-case that current customers have. It is a pity you did not attend the conference and could make a point against this line of thought there. So summarizing, all folks in this thread are asking is for a path to move forward for solving the block use-case, which is the only one that I am aware of requiring kernel support. This has been validated with real workloads, so it is very specific and tangible.
On 04.10.2024 08:24, Christoph Hellwig wrote: >On Fri, Oct 04, 2024 at 08:21:29AM +0200, Javier González wrote: >>> So I think some peoples bonuses depend on not understanding the problem >>> I fear :( >>> >> >> Please, don't. >> >> Childish comments like this delegitimize the work that a lot of people >> are doing in Linux. > >It's the only very sarcastic explanation I can come up with to explain >this discussion, where people from the exactly two companies where this >might be bonus material love political discussion and drop dead when it >turns technical. FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm, Microship, Marvell, FADU, WDC, and Samsung. The fact that 2 of these companies are the ones starting to build the Linux ecosystem should not surprise you, as it is the way things work normally.
On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > So, considerign that file system _are_ able to use temperature hints and > actually make them work, why don't we support FDP the same way we are > supporting zones so that people can use it in production? Because apparently no one has tried it. It should be possible in theory, but for example unless you have power of 2 reclaim unit size size it won't work very well with XFS where the AGs/RTGs must be power of two aligned in the LBA space, except by overprovisioning the LBA space vs the capacity actually used. > I agree that down the road, an interface that allows hints (many more > than 5!) is needed. And in my opinion, this interface should not have > semintics attached to it, just a hint ID, #hints, and enough space to > put 100s of them to support storage node deployments. But this needs to > come from the users of the hints / zones / streams / etc, not from > us vendors. We do not have neither details on how they deploy these > features at scale, nor the workloads to validate the results. Anything > else will probably just continue polluting the storage stack with more > interfaces that are not used and add to the problem of data placement > fragmentation. Please always mentioned what layer you are talking about. At the syscall level the temperatur hints are doing quite ok. A full stream separation would obviously be a lot better, as would be communicating the zone / reclaim unit / etc size. As an interface to a driver that doesn't natively speak temperature hint on the other hand it doesn't work at all. > The issue is that the first series of this patch, which is as simple as > it gets, hit the list in May. Since then we are down paths that lead > nowhere. So the line between real technical feedback that leads to > a feature being merged, and technical misleading to make people be a > busy bee becomes very thin. In the whole data placement effort, we have > been down this path many times, unfortunately... Well, the previous round was the first one actually trying to address the fundamental issue after 4 month. And then after a first round of feedback it gets shutdown somehow out of nowhere. As a maintainer and review that is the kinda of contributors I have a hard time taking serious.
On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote: > FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm, > Microship, Marvell, FADU, WDC, and Samsung. > > The fact that 2 of these companies are the ones starting to build the > Linux ecosystem should not surprise you, as it is the way things work > normally. That's not the point. There is one company that drivers entirely pointless marketing BS, and that one is pretty central here. The same company that said FDP has absolutely no іntent to work on Linux and fought my initial attempt to make the protocol not totally unusable ony layer system. And no, that's not Samsung.
On 04.10.2024 14:30, Christoph Hellwig wrote: >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: >> So, considerign that file system _are_ able to use temperature hints and >> actually make them work, why don't we support FDP the same way we are >> supporting zones so that people can use it in production? > >Because apparently no one has tried it. It should be possible in theory, >but for example unless you have power of 2 reclaim unit size size it >won't work very well with XFS where the AGs/RTGs must be power of two >aligned in the LBA space, except by overprovisioning the LBA space vs >the capacity actually used. This is good. I think we should have at least a FS POC with data placement support to be able to drive conclusions on how the interface and requirements should be. Until we have that, we can support the use-cases that we know customers are asking for, i.e., block-level hints through the existing temperature API. > >> I agree that down the road, an interface that allows hints (many more >> than 5!) is needed. And in my opinion, this interface should not have >> semintics attached to it, just a hint ID, #hints, and enough space to >> put 100s of them to support storage node deployments. But this needs to >> come from the users of the hints / zones / streams / etc, not from >> us vendors. We do not have neither details on how they deploy these >> features at scale, nor the workloads to validate the results. Anything >> else will probably just continue polluting the storage stack with more >> interfaces that are not used and add to the problem of data placement >> fragmentation. > >Please always mentioned what layer you are talking about. At the syscall >level the temperatur hints are doing quite ok. A full stream separation >would obviously be a lot better, as would be communicating the zone / >reclaim unit / etc size. I mean at the syscall level. But as mentioned above, we need to be very sure that we have a clear use-case for that. If we continue seeing hints being use in NVMe or other protocols, and the number increase significantly, we can deal with it later on. > >As an interface to a driver that doesn't natively speak temperature >hint on the other hand it doesn't work at all. > >> The issue is that the first series of this patch, which is as simple as >> it gets, hit the list in May. Since then we are down paths that lead >> nowhere. So the line between real technical feedback that leads to >> a feature being merged, and technical misleading to make people be a >> busy bee becomes very thin. In the whole data placement effort, we have >> been down this path many times, unfortunately... > >Well, the previous round was the first one actually trying to address the >fundamental issue after 4 month. And then after a first round of feedback >it gets shutdown somehow out of nowhere. As a maintainer and review that >is the kinda of contributors I have a hard time taking serious. I am not sure I understand what you mean in the last sentece, so I will not respond filling blanks with a bad interpretation. In summary, what we are asking for is to take the patches that cover the current use-case, and work together on what might be needed for better FS support. For this, it seems you and Hans have a good idea of what you want to have based on XFS. We can help review or do part of the work, but trying to guess our way will only delay existing customers using existing HW.
On 04.10.2024 14:32, Christoph Hellwig wrote: >On Fri, Oct 04, 2024 at 08:59:23AM +0200, Javier González wrote: >> FDP has authors from Meta, Google, Kioxia, Micron, Hynix, Solidigm, >> Microship, Marvell, FADU, WDC, and Samsung. >> >> The fact that 2 of these companies are the ones starting to build the >> Linux ecosystem should not surprise you, as it is the way things work >> normally. > >That's not the point. There is one company that drivers entirely pointless >marketing BS, and that one is pretty central here. The same company >that said FDP has absolutely no іntent to work on Linux and fought my >initial attempt to make the protocol not totally unusable ony layer system. >And no, that's not Samsung. So you had an interaction in the working group, your feedback was not taking into consideration by the authors, and the result is that FDP cannot be supported in Linux as a consequence of that? Come on...
On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com> wrote: > > On 04.10.2024 14:30, Christoph Hellwig wrote: > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > >> So, considerign that file system _are_ able to use temperature hints and > >> actually make them work, why don't we support FDP the same way we are > >> supporting zones so that people can use it in production? > > > >Because apparently no one has tried it. It should be possible in theory, > >but for example unless you have power of 2 reclaim unit size size it > >won't work very well with XFS where the AGs/RTGs must be power of two > >aligned in the LBA space, except by overprovisioning the LBA space vs > >the capacity actually used. > > This is good. I think we should have at least a FS POC with data > placement support to be able to drive conclusions on how the interface > and requirements should be. Until we have that, we can support the > use-cases that we know customers are asking for, i.e., block-level hints > through the existing temperature API. > > > > >> I agree that down the road, an interface that allows hints (many more > >> than 5!) is needed. And in my opinion, this interface should not have > >> semintics attached to it, just a hint ID, #hints, and enough space to > >> put 100s of them to support storage node deployments. But this needs to > >> come from the users of the hints / zones / streams / etc, not from > >> us vendors. We do not have neither details on how they deploy these > >> features at scale, nor the workloads to validate the results. Anything > >> else will probably just continue polluting the storage stack with more > >> interfaces that are not used and add to the problem of data placement > >> fragmentation. > > > >Please always mentioned what layer you are talking about. At the syscall > >level the temperatur hints are doing quite ok. A full stream separation > >would obviously be a lot better, as would be communicating the zone / > >reclaim unit / etc size. > > I mean at the syscall level. But as mentioned above, we need to be very > sure that we have a clear use-case for that. If we continue seeing hints > being use in NVMe or other protocols, and the number increase > significantly, we can deal with it later on. > > > > >As an interface to a driver that doesn't natively speak temperature > >hint on the other hand it doesn't work at all. > > > >> The issue is that the first series of this patch, which is as simple as > >> it gets, hit the list in May. Since then we are down paths that lead > >> nowhere. So the line between real technical feedback that leads to > >> a feature being merged, and technical misleading to make people be a > >> busy bee becomes very thin. In the whole data placement effort, we have > >> been down this path many times, unfortunately... > > > >Well, the previous round was the first one actually trying to address the > >fundamental issue after 4 month. And then after a first round of feedback > >it gets shutdown somehow out of nowhere. As a maintainer and review that > >is the kinda of contributors I have a hard time taking serious. > > I am not sure I understand what you mean in the last sentece, so I will > not respond filling blanks with a bad interpretation. > > In summary, what we are asking for is to take the patches that cover the > current use-case, and work together on what might be needed for better > FS support. For this, it seems you and Hans have a good idea of what you > want to have based on XFS. We can help review or do part of the work, > but trying to guess our way will only delay existing customers using > existing HW. After reading the whole thread, I end up wondering why we need to rush the support for a single use case through instead of putting the architecture in place for properly supporting this new type of hardware from the start throughout the stack. Even for user space consumers of raw block devices, is the last version of the patch set good enough? * It severely cripples the data separation capabilities as only a handful of data placement buckets are supported * It just won't work if there is more than one user application per storage device as different applications data streams will be mixed at the nvme driver level.. While Christoph has already outlined what would be desirable from a file system point of view, I don't have the answer to what would be the overall best design for FDP. I would like to say that it looks to me like we need to consider more than more than the early adoption use cases and make sure we make the most of the hardware capabilities via logical abstractions that would be compatible with a wider range of storage devices. Figuring the right way forward is tricky, but why not just let it take the time that is needed to sort this out while early users explore how to use FDP drives and share the results?
On Mon, Oct 07, 2024 at 12:10:11PM +0200, Javier González wrote: > In summary, what we are asking for is to take the patches that cover the > current use-case, and work together on what might be needed for better > FS support. And I really do not think it is a good idea. For one it actually works against the stated intent of the FDP spec. Second extending the hints to per per-I/O in the io_uring patch is actively breaking the nice per-file I/O hint abstraction we have right now, and is really unsuitable when actually used on a file and not just a block device. And if you are only on a block device I think passthrough of some form is still the far better option, despite the problems with it mentioned by Keith.
On Mon, Oct 07, 2024 at 01:29:31PM +0200, Javier González wrote: >> That's not the point. There is one company that drivers entirely pointless >> marketing BS, and that one is pretty central here. The same company >> that said FDP has absolutely no іntent to work on Linux and fought my >> initial attempt to make the protocol not totally unusable ony layer system. >> And no, that's not Samsung. > > So you had an interaction in the working group, your feedback was not > taking into consideration by the authors, and the result is that FDP > cannot be supported in Linux as a consequence of that? Come on... No, what I am saying is that the "small" FDP group that was doing the development while keeping it doing away from the group insisted that FDP is only for use in userspace drivers, and even getting the basis in to properly make it suitable for a semi-multi tenant setup like Linux io_uring passthrough was not welcome and actually fought tooth and nail by the particular one particular company. It a long talk to the head of the NVMe board to even get this sorted out.
On Tue, Oct 08, 2024 at 02:25:35PM +0200, Christoph Hellwig wrote: > On Mon, Oct 07, 2024 at 12:10:11PM +0200, Javier González wrote: > > In summary, what we are asking for is to take the patches that cover the > > current use-case, and work together on what might be needed for better > > FS support. > > And I really do not think it is a good idea. For one it actually > works against the stated intent of the FDP spec. Second extending > the hints to per per-I/O in the io_uring patch is actively breaking > the nice per-file I/O hint abstraction we have right now, and is > really unsuitable when actually used on a file and not just a block > device. And if you are only on a block device I think passthrough > of some form is still the far better option, despite the problems > with it mentioned by Keith. Then let's just continue with patches 1 and 2. They introduce no new user or kernel APIs, and people have already reported improvements using it. Further, it is just a hint, it doesn't lock the kernel into anything that may hinder future inovations and enhancements. So let's unblock users and refocus *our* time to something more productive, please? And to be honest, the per-io hints for generic read/write use is only valuable for my users if metadata is also exposed to userspace. I know Javier's team is working on that in parallel, so per-io hints are a lower priority for me until that part settles.
On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote: > Then let's just continue with patches 1 and 2. They introduce no new > user or kernel APIs, and people have already reported improvements using > it. They are still not any way actually exposing the FDP functionality in the standard though. How is your application going to align anything to the reclaim unit? Or is this another of the cases where as a hyperscaler you just "know" from the data sheet? But also given that the submitter completely disappeared and refuses to even discuss his patches I thing they are simply abandonware at this point anyway.
> -----Original Message----- > From: Hans Holmberg <hans@owltronix.com> > Sent: Tuesday, October 8, 2024 12:07 PM > To: Javier Gonzalez <javier.gonz@samsung.com> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K. > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>; > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me; > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org; > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org; > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux- > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org; > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com> > wrote: > > > > On 04.10.2024 14:30, Christoph Hellwig wrote: > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > > >> So, considerign that file system _are_ able to use temperature hints and > > >> actually make them work, why don't we support FDP the same way we are > > >> supporting zones so that people can use it in production? > > > > > >Because apparently no one has tried it. It should be possible in theory, > > >but for example unless you have power of 2 reclaim unit size size it > > >won't work very well with XFS where the AGs/RTGs must be power of two > > >aligned in the LBA space, except by overprovisioning the LBA space vs > > >the capacity actually used. > > > > This is good. I think we should have at least a FS POC with data > > placement support to be able to drive conclusions on how the interface > > and requirements should be. Until we have that, we can support the > > use-cases that we know customers are asking for, i.e., block-level hints > > through the existing temperature API. > > > > > > > >> I agree that down the road, an interface that allows hints (many more > > >> than 5!) is needed. And in my opinion, this interface should not have > > >> semintics attached to it, just a hint ID, #hints, and enough space to > > >> put 100s of them to support storage node deployments. But this needs to > > >> come from the users of the hints / zones / streams / etc, not from > > >> us vendors. We do not have neither details on how they deploy these > > >> features at scale, nor the workloads to validate the results. Anything > > >> else will probably just continue polluting the storage stack with more > > >> interfaces that are not used and add to the problem of data placement > > >> fragmentation. > > > > > >Please always mentioned what layer you are talking about. At the syscall > > >level the temperatur hints are doing quite ok. A full stream separation > > >would obviously be a lot better, as would be communicating the zone / > > >reclaim unit / etc size. > > > > I mean at the syscall level. But as mentioned above, we need to be very > > sure that we have a clear use-case for that. If we continue seeing hints > > being use in NVMe or other protocols, and the number increase > > significantly, we can deal with it later on. > > > > > > > >As an interface to a driver that doesn't natively speak temperature > > >hint on the other hand it doesn't work at all. > > > > > >> The issue is that the first series of this patch, which is as simple as > > >> it gets, hit the list in May. Since then we are down paths that lead > > >> nowhere. So the line between real technical feedback that leads to > > >> a feature being merged, and technical misleading to make people be a > > >> busy bee becomes very thin. In the whole data placement effort, we have > > >> been down this path many times, unfortunately... > > > > > >Well, the previous round was the first one actually trying to address the > > >fundamental issue after 4 month. And then after a first round of feedback > > >it gets shutdown somehow out of nowhere. As a maintainer and review that > > >is the kinda of contributors I have a hard time taking serious. > > > > I am not sure I understand what you mean in the last sentece, so I will > > not respond filling blanks with a bad interpretation. > > > > In summary, what we are asking for is to take the patches that cover the > > current use-case, and work together on what might be needed for better > > FS support. For this, it seems you and Hans have a good idea of what you > > want to have based on XFS. We can help review or do part of the work, > > but trying to guess our way will only delay existing customers using > > existing HW. > > After reading the whole thread, I end up wondering why we need to rush the > support for a single use case through instead of putting the architecture > in place for properly supporting this new type of hardware from the start > throughout the stack. This is not a rush. We have been supporting this use case through passthru for over 1/2 year with code already upstream in Cachelib. This is mature enough as to move into the block layer, which is what the end user wants to do either way. This is though a very good point. This is why we upstreamed passthru at the time; so people can experiment, validate, and upstream only when there is a clear path. > > Even for user space consumers of raw block devices, is the last version > of the patch set good enough? > > * It severely cripples the data separation capabilities as only a handful of > data placement buckets are supported I could understand from your presentation at LPC, and late looking at the code that is available that you have been successful at getting good results with the existing interface in XFS. The mapping form the temperature semantics to zones (no semantics) is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space structures is great. > > * It just won't work if there is more than one user application per storage > device as different applications data streams will be mixed at the nvme > driver level.. For now this use-case is not clear. Folks working on it are using passthru. When we have a more clear understanding of what is needed, we might need changes in the kernel. If you see a need for this on the work that you are doing, by all means, please send patches. As I said at LPC, we can work together on this. > > While Christoph has already outlined what would be desirable from a > file system point of view, I don't have the answer to what would be the overall > best design for FDP. I would like to say that it looks to me like we need to > consider more than more than the early adoption use cases and make sure we > make the most of the hardware capabilities via logical abstractions that > would be compatible with a wider range of storage devices. > > Figuring the right way forward is tricky, but why not just let it take the time > that is needed to sort this out while early users explore how to use FDP > drives and share the results? I agree that we might need a new interface to support more hints, beyond the temperatures. Or maybe not. We would not know until someone comes with a use case. We have made the mistake in the past of treating internal research as upstreamable work. I know can see that this simply complicates the in-kernel and user-space APIs. The existing API is usable and requires no changes. There is hardware. There are customers. There are applications with upstream support which have been tested with passthru (the early results you mention). And the wiring to NVMe is _very_ simple. There is no reason not to take this in, and then we will see what new interfaces we might need in the future. I would much rather spend time in discussing ideas with you and others on a potential future API than arguing about the validity of an _existing_ one.
On Wed, Oct 09, 2024 at 11:28:28AM +0200, Christoph Hellwig wrote: > On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote: > > Then let's just continue with patches 1 and 2. They introduce no new > > user or kernel APIs, and people have already reported improvements using > > it. > > They are still not any way actually exposing the FDP functionality > in the standard though. How is your application going to align > anything to the reclaim unit? Or is this another of the cases where > as a hyperscaler you just "know" from the data sheet? As far as I know, this is an inconsequential spec detail that is not being considered by any applications testing this. And yet, the expected imrpovements are still there, so I don't see a point holding this up for that reason.
On 09/10/24 11:28AM, Christoph Hellwig wrote: >On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote: >> Then let's just continue with patches 1 and 2. They introduce no new >> user or kernel APIs, and people have already reported improvements using >> it. > >They are still not any way actually exposing the FDP functionality >in the standard though. How is your application going to align >anything to the reclaim unit? Or is this another of the cases where >as a hyperscaler you just "know" from the data sheet? I think Keith already[1] mentioned a couple of times in previous replies, this is an optional feature, if experiments doesn't give better results, FDP hints can be disabled. > >But also given that the submitter completely disappeared and refuses >to even discuss his patches I thing they are simply abandonware at >this point anyway. Kanchan is away due to personal reasons, he is expected to be back in a week or so. Regards, Nitesh Shetty [1] https://lore.kernel.org/all/ZmjSwfD1IqX-ADtL@kbusch-mbp.dhcp.thefacebook.com/
On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@samsung.com> wrote: > > > > > -----Original Message----- > > From: Hans Holmberg <hans@owltronix.com> > > Sent: Tuesday, October 8, 2024 12:07 PM > > To: Javier Gonzalez <javier.gonz@samsung.com> > > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K. > > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>; > > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me; > > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org; > > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org; > > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux- > > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org; > > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com > > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > > > > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com> > > wrote: > > > > > > On 04.10.2024 14:30, Christoph Hellwig wrote: > > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > > > >> So, considerign that file system _are_ able to use temperature hints and > > > >> actually make them work, why don't we support FDP the same way we are > > > >> supporting zones so that people can use it in production? > > > > > > > >Because apparently no one has tried it. It should be possible in theory, > > > >but for example unless you have power of 2 reclaim unit size size it > > > >won't work very well with XFS where the AGs/RTGs must be power of two > > > >aligned in the LBA space, except by overprovisioning the LBA space vs > > > >the capacity actually used. > > > > > > This is good. I think we should have at least a FS POC with data > > > placement support to be able to drive conclusions on how the interface > > > and requirements should be. Until we have that, we can support the > > > use-cases that we know customers are asking for, i.e., block-level hints > > > through the existing temperature API. > > > > > > > > > > >> I agree that down the road, an interface that allows hints (many more > > > >> than 5!) is needed. And in my opinion, this interface should not have > > > >> semintics attached to it, just a hint ID, #hints, and enough space to > > > >> put 100s of them to support storage node deployments. But this needs to > > > >> come from the users of the hints / zones / streams / etc, not from > > > >> us vendors. We do not have neither details on how they deploy these > > > >> features at scale, nor the workloads to validate the results. Anything > > > >> else will probably just continue polluting the storage stack with more > > > >> interfaces that are not used and add to the problem of data placement > > > >> fragmentation. > > > > > > > >Please always mentioned what layer you are talking about. At the syscall > > > >level the temperatur hints are doing quite ok. A full stream separation > > > >would obviously be a lot better, as would be communicating the zone / > > > >reclaim unit / etc size. > > > > > > I mean at the syscall level. But as mentioned above, we need to be very > > > sure that we have a clear use-case for that. If we continue seeing hints > > > being use in NVMe or other protocols, and the number increase > > > significantly, we can deal with it later on. > > > > > > > > > > >As an interface to a driver that doesn't natively speak temperature > > > >hint on the other hand it doesn't work at all. > > > > > > > >> The issue is that the first series of this patch, which is as simple as > > > >> it gets, hit the list in May. Since then we are down paths that lead > > > >> nowhere. So the line between real technical feedback that leads to > > > >> a feature being merged, and technical misleading to make people be a > > > >> busy bee becomes very thin. In the whole data placement effort, we have > > > >> been down this path many times, unfortunately... > > > > > > > >Well, the previous round was the first one actually trying to address the > > > >fundamental issue after 4 month. And then after a first round of feedback > > > >it gets shutdown somehow out of nowhere. As a maintainer and review that > > > >is the kinda of contributors I have a hard time taking serious. > > > > > > I am not sure I understand what you mean in the last sentece, so I will > > > not respond filling blanks with a bad interpretation. > > > > > > In summary, what we are asking for is to take the patches that cover the > > > current use-case, and work together on what might be needed for better > > > FS support. For this, it seems you and Hans have a good idea of what you > > > want to have based on XFS. We can help review or do part of the work, > > > but trying to guess our way will only delay existing customers using > > > existing HW. > > > > After reading the whole thread, I end up wondering why we need to rush the > > support for a single use case through instead of putting the architecture > > in place for properly supporting this new type of hardware from the start > > throughout the stack. > > This is not a rush. We have been supporting this use case through passthru for > over 1/2 year with code already upstream in Cachelib. This is mature enough as > to move into the block layer, which is what the end user wants to do either way. > > This is though a very good point. This is why we upstreamed passthru at the > time; so people can experiment, validate, and upstream only when there is a > clear path. > > > > > Even for user space consumers of raw block devices, is the last version > > of the patch set good enough? > > > > * It severely cripples the data separation capabilities as only a handful of > > data placement buckets are supported > > I could understand from your presentation at LPC, and late looking at the code that > is available that you have been successful at getting good results with the existing > interface in XFS. The mapping form the temperature semantics to zones (no semantics) > is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space > structures is great. No, we don't map data directly to zones using lifetime hints. In fact, lifetime hints contribute only a relatively small part (~10% extra write amp reduction, see the rocksdb benchmark results). Segregating data by file is the most important part of the data placement heuristic, at least for this type of workload. > > > > > * It just won't work if there is more than one user application per storage > > device as different applications data streams will be mixed at the nvme > > driver level.. > > For now this use-case is not clear. Folks working on it are using passthru. When we > have a more clear understanding of what is needed, we might need changes in the kernel. > > If you see a need for this on the work that you are doing, by all means, please send patches. > As I said at LPC, we can work together on this. > > > > > While Christoph has already outlined what would be desirable from a > > file system point of view, I don't have the answer to what would be the overall > > best design for FDP. I would like to say that it looks to me like we need to > > consider more than more than the early adoption use cases and make sure we > > make the most of the hardware capabilities via logical abstractions that > > would be compatible with a wider range of storage devices. > > > > Figuring the right way forward is tricky, but why not just let it take the time > > that is needed to sort this out while early users explore how to use FDP > > drives and share the results? > > I agree that we might need a new interface to support more hints, beyond the temperatures. > Or maybe not. We would not know until someone comes with a use case. We have made the > mistake in the past of treating internal research as upstreamable work. I know can see that > this simply complicates the in-kernel and user-space APIs. > > The existing API is usable and requires no changes. There is hardware. There are customers. > There are applications with upstream support which have been tested with passthru (the > early results you mention). And the wiring to NVMe is _very_ simple. There is no reason > not to take this in, and then we will see what new interfaces we might need in the future. > > I would much rather spend time in discussing ideas with you and others on a potential > future API than arguing about the validity of an _existing_ one. > Yes, but while FDP support could be improved later on(happy to help if that'll be the case), I'm just afraid that less work now defining the way data placement is exposed is going to result in a bigger mess later when more use cases will be considered.
On 09.10.2024 09:06, Keith Busch wrote: >On Wed, Oct 09, 2024 at 11:28:28AM +0200, Christoph Hellwig wrote: >> On Tue, Oct 08, 2024 at 08:44:28AM -0600, Keith Busch wrote: >> > Then let's just continue with patches 1 and 2. They introduce no new >> > user or kernel APIs, and people have already reported improvements using >> > it. >> >> They are still not any way actually exposing the FDP functionality >> in the standard though. How is your application going to align >> anything to the reclaim unit? Or is this another of the cases where >> as a hyperscaler you just "know" from the data sheet? > >As far as I know, this is an inconsequential spec detail that is not >being considered by any applications testing this. And yet, the expected >imrpovements are still there, so I don't see a point holding this up for >that reason. I am re-reading the thread to find a common ground. I realize that my last email came sharper than I intended. I apologize for that Christoph. Let me summarize and propose something constructive so that we can move forward. It is clear that you have a lot of insight on how a FS API should look like to unify the different data placement alternatives across protocols. I have not seen code for it, but based on the technical feedback you are providing, it seems to be fairly clear in your mind. I think we should attempt to pursue that with an example in mind. Seems XFS is the clear candidate. You have done work already in enable SMR HDDs; it seems we can get FDP under that umbrella. This will however take time to get right. We can help with development, testing, and experimental evaluation on the WAF benefits for such an interface. However, this work should not block existing hardware enabling an existing use-case. The current patches are not intrusive. They do not make changse to the API and merely wire up what is there to the driver. Anyone using temperaturs will be able to use FDP - this is a win without a maintainance burden attached to it. The change to the FS / application API will not require major changes either; I believe we all agree that we cannot remove the temperatures, so all existing temperature users will be able to continue using them; we will just add an alternative for power users on the side. So the proposal is to merge the patches as they are and commit to work together on a new, better API for in-kernel users (FS), and for applications (syscall, uring). Christoph, would this be a viable way to move forward for you? Thanks, Javier
On 10.10.2024 08:40, Hans Holmberg wrote: >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@samsung.com> wrote: >> >> >> >> > -----Original Message----- >> > From: Hans Holmberg <hans@owltronix.com> >> > Sent: Tuesday, October 8, 2024 12:07 PM >> > To: Javier Gonzalez <javier.gonz@samsung.com> >> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K. >> > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>; >> > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me; >> > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org; >> > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org; >> > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux- >> > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org; >> > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints >> > >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com> >> > wrote: >> > > >> > > On 04.10.2024 14:30, Christoph Hellwig wrote: >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: >> > > >> So, considerign that file system _are_ able to use temperature hints and >> > > >> actually make them work, why don't we support FDP the same way we are >> > > >> supporting zones so that people can use it in production? >> > > > >> > > >Because apparently no one has tried it. It should be possible in theory, >> > > >but for example unless you have power of 2 reclaim unit size size it >> > > >won't work very well with XFS where the AGs/RTGs must be power of two >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs >> > > >the capacity actually used. >> > > >> > > This is good. I think we should have at least a FS POC with data >> > > placement support to be able to drive conclusions on how the interface >> > > and requirements should be. Until we have that, we can support the >> > > use-cases that we know customers are asking for, i.e., block-level hints >> > > through the existing temperature API. >> > > >> > > > >> > > >> I agree that down the road, an interface that allows hints (many more >> > > >> than 5!) is needed. And in my opinion, this interface should not have >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to >> > > >> put 100s of them to support storage node deployments. But this needs to >> > > >> come from the users of the hints / zones / streams / etc, not from >> > > >> us vendors. We do not have neither details on how they deploy these >> > > >> features at scale, nor the workloads to validate the results. Anything >> > > >> else will probably just continue polluting the storage stack with more >> > > >> interfaces that are not used and add to the problem of data placement >> > > >> fragmentation. >> > > > >> > > >Please always mentioned what layer you are talking about. At the syscall >> > > >level the temperatur hints are doing quite ok. A full stream separation >> > > >would obviously be a lot better, as would be communicating the zone / >> > > >reclaim unit / etc size. >> > > >> > > I mean at the syscall level. But as mentioned above, we need to be very >> > > sure that we have a clear use-case for that. If we continue seeing hints >> > > being use in NVMe or other protocols, and the number increase >> > > significantly, we can deal with it later on. >> > > >> > > > >> > > >As an interface to a driver that doesn't natively speak temperature >> > > >hint on the other hand it doesn't work at all. >> > > > >> > > >> The issue is that the first series of this patch, which is as simple as >> > > >> it gets, hit the list in May. Since then we are down paths that lead >> > > >> nowhere. So the line between real technical feedback that leads to >> > > >> a feature being merged, and technical misleading to make people be a >> > > >> busy bee becomes very thin. In the whole data placement effort, we have >> > > >> been down this path many times, unfortunately... >> > > > >> > > >Well, the previous round was the first one actually trying to address the >> > > >fundamental issue after 4 month. And then after a first round of feedback >> > > >it gets shutdown somehow out of nowhere. As a maintainer and review that >> > > >is the kinda of contributors I have a hard time taking serious. >> > > >> > > I am not sure I understand what you mean in the last sentece, so I will >> > > not respond filling blanks with a bad interpretation. >> > > >> > > In summary, what we are asking for is to take the patches that cover the >> > > current use-case, and work together on what might be needed for better >> > > FS support. For this, it seems you and Hans have a good idea of what you >> > > want to have based on XFS. We can help review or do part of the work, >> > > but trying to guess our way will only delay existing customers using >> > > existing HW. >> > >> > After reading the whole thread, I end up wondering why we need to rush the >> > support for a single use case through instead of putting the architecture >> > in place for properly supporting this new type of hardware from the start >> > throughout the stack. >> >> This is not a rush. We have been supporting this use case through passthru for >> over 1/2 year with code already upstream in Cachelib. This is mature enough as >> to move into the block layer, which is what the end user wants to do either way. >> >> This is though a very good point. This is why we upstreamed passthru at the >> time; so people can experiment, validate, and upstream only when there is a >> clear path. >> >> > >> > Even for user space consumers of raw block devices, is the last version >> > of the patch set good enough? >> > >> > * It severely cripples the data separation capabilities as only a handful of >> > data placement buckets are supported >> >> I could understand from your presentation at LPC, and late looking at the code that >> is available that you have been successful at getting good results with the existing >> interface in XFS. The mapping form the temperature semantics to zones (no semantics) >> is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space >> structures is great. > >No, we don't map data directly to zones using lifetime hints. In fact, >lifetime hints contribute only a >relatively small part (~10% extra write amp reduction, see the >rocksdb benchmark results). >Segregating data by file is the most important part of the data >placement heuristic, at least >for this type of workload. Is this because RocksDB already does seggregation per file itself? Are you doing something specific on XFS or using your knoledge on RocksDB to map files with an "unwritten" protocol in the midde? In this context, we have collected data both using FDP natively in RocksDB and using the temperatures. Both look very good, because both are initiated by RocksDB, and the FS just passes the hints directly to the driver. I ask this to understand if this is the FS responsibility or the application's one. Our work points more to letting applications use the hints (as the use-cases are power users, like RocksDB). I agree with you that a FS could potentially make an improvement for legacy applications - we have not focused much on these though, so I trust you insights on it. >> >> > >> > * It just won't work if there is more than one user application per storage >> > device as different applications data streams will be mixed at the nvme >> > driver level.. >> >> For now this use-case is not clear. Folks working on it are using passthru. When we >> have a more clear understanding of what is needed, we might need changes in the kernel. >> >> If you see a need for this on the work that you are doing, by all means, please send patches. >> As I said at LPC, we can work together on this. >> >> > >> > While Christoph has already outlined what would be desirable from a >> > file system point of view, I don't have the answer to what would be the overall >> > best design for FDP. I would like to say that it looks to me like we need to >> > consider more than more than the early adoption use cases and make sure we >> > make the most of the hardware capabilities via logical abstractions that >> > would be compatible with a wider range of storage devices. >> > >> > Figuring the right way forward is tricky, but why not just let it take the time >> > that is needed to sort this out while early users explore how to use FDP >> > drives and share the results? >> >> I agree that we might need a new interface to support more hints, beyond the temperatures. >> Or maybe not. We would not know until someone comes with a use case. We have made the >> mistake in the past of treating internal research as upstreamable work. I know can see that >> this simply complicates the in-kernel and user-space APIs. >> >> The existing API is usable and requires no changes. There is hardware. There are customers. >> There are applications with upstream support which have been tested with passthru (the >> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason >> not to take this in, and then we will see what new interfaces we might need in the future. >> >> I would much rather spend time in discussing ideas with you and others on a potential >> future API than arguing about the validity of an _existing_ one. >> > >Yes, but while FDP support could be improved later on(happy to help if >that'll be the case), >I'm just afraid that less work now defining the way data placement is >exposed is going to >result in a bigger mess later when more use cases will be considered. Please, see the message I responded on the other thread. I hope it is a way to move forward and actually work together on this.
On Wed, Oct 09, 2024 at 09:06:25AM -0600, Keith Busch wrote: > > anything to the reclaim unit? Or is this another of the cases where > > as a hyperscaler you just "know" from the data sheet? > > As far as I know, this is an inconsequential spec detail that is not > being considered by any applications testing this. And yet, the expected > imrpovements are still there, so I don't see a point holding this up for > that reason. It was the whole point of the thing, and a major source of complexity. Although not quite for this use case Hans has numbers that aligning application data tables / objects / etc to the underlying "erase unit" absolutely matters. And just to make it clear the objection here is not that we have an an interface that doesn't take that into respect at the syscall level. We already have that interface and might as well make use of that. The problem is that the series tries to directly expose that to the driver, freezing us into this incomplet interface forever (assuming we have users actually picking it up). That's why I keep insisting like a broken record that we need to get this lower interface right first.
On Thu, Oct 10, 2024 at 09:07:36AM +0200, Javier González wrote: > I think we should attempt to pursue that with an example in mind. Seems > XFS is the clear candidate. You have done work already in enable SMR > HDDs; it seems we can get FDP under that umbrella. This will however > take time to get right. We can help with development, testing, and > experimental evaluation on the WAF benefits for such an interface. Or ZNS SSDs for that matter. > However, this work should not block existing hardware enabling an > existing use-case. The current patches are not intrusive. They do not > make changse to the API and merely wire up what is there to the driver. > Anyone using temperaturs will be able to use FDP - this is a win without > a maintainance burden attached to it. The change to the FS / application > API will not require major changes either; I believe we all agree that > we cannot remove the temperatures, so all existing temperature users > will be able to continue using them; we will just add an alternative for > power users on the side. As mentioned probably close to a dozen times over this thread and it's predecessors: Keeping the per-file I/O hint API and mapping that to FDP is fine. Exposing the overly-simplistic hints to the NVMe driver through the entire I/O stack and locking us into that is not. > So the proposal is to merge the patches as they are and commit to work > together on a new, better API for in-kernel users (FS), and for > applications (syscall, uring). And because of that the whole merge it now and fix it later unfortunately doesn't work, as a proper implementation will regress behavior for at least some users that only have the I/O hints API but try to emulate real stream separation with it. With the per-I/O hints in io_uring that is in fact almost guaranteed.
On Thu, Oct 10, 2024 at 09:13:27AM +0200, Javier Gonzalez wrote: > Is this because RocksDB already does seggregation per file itself? Are > you doing something specific on XFS or using your knoledge on RocksDB to > map files with an "unwritten" protocol in the midde? XFS doesn't really do anything smart at all except for grouping files with similar temperatures, but Hans can probably explain it in more detail. So yes, this relies on the application doing the data separation and using the most logical vehicle for it: files. > > In this context, we have collected data both using FDP natively in > RocksDB and using the temperatures. Both look very good, because both > are initiated by RocksDB, and the FS just passes the hints directly > to the driver. > > I ask this to understand if this is the FS responsibility or the > application's one. Our work points more to letting applications use the > hints (as the use-cases are power users, like RocksDB). I agree with you > that a FS could potentially make an improvement for legacy applications > - we have not focused much on these though, so I trust you insights on > it. As mentioned multiple times before in this thread this absolutely depends on the abstraction level of the application. If the application works on a raw device without a file system it obviously needs very low-level control. And in my opinion passthrough is by far the best interface for that level of control. If the application is using a file system there is no better basic level abstraction than a file, which can then be enhanced with relatively small amount of additional information going both ways: the file system telling the application what good file sizes and write patterns are, and the application telling the file system what files are good candidates to merge into the same write stream if the file system has to merge multiple actively written to files into a write stream. Trying to do low-level per I/O hints on top of a file system is a recipe for trouble because you now have to entities fighting over placement control.
On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <javier.gonz@samsung.com> wrote: > > On 10.10.2024 08:40, Hans Holmberg wrote: > >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@samsung.com> wrote: > >> > >> > >> > >> > -----Original Message----- > >> > From: Hans Holmberg <hans@owltronix.com> > >> > Sent: Tuesday, October 8, 2024 12:07 PM > >> > To: Javier Gonzalez <javier.gonz@samsung.com> > >> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K. > >> > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>; > >> > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me; > >> > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org; > >> > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org; > >> > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux- > >> > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org; > >> > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com > >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > >> > > >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com> > >> > wrote: > >> > > > >> > > On 04.10.2024 14:30, Christoph Hellwig wrote: > >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > >> > > >> So, considerign that file system _are_ able to use temperature hints and > >> > > >> actually make them work, why don't we support FDP the same way we are > >> > > >> supporting zones so that people can use it in production? > >> > > > > >> > > >Because apparently no one has tried it. It should be possible in theory, > >> > > >but for example unless you have power of 2 reclaim unit size size it > >> > > >won't work very well with XFS where the AGs/RTGs must be power of two > >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs > >> > > >the capacity actually used. > >> > > > >> > > This is good. I think we should have at least a FS POC with data > >> > > placement support to be able to drive conclusions on how the interface > >> > > and requirements should be. Until we have that, we can support the > >> > > use-cases that we know customers are asking for, i.e., block-level hints > >> > > through the existing temperature API. > >> > > > >> > > > > >> > > >> I agree that down the road, an interface that allows hints (many more > >> > > >> than 5!) is needed. And in my opinion, this interface should not have > >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to > >> > > >> put 100s of them to support storage node deployments. But this needs to > >> > > >> come from the users of the hints / zones / streams / etc, not from > >> > > >> us vendors. We do not have neither details on how they deploy these > >> > > >> features at scale, nor the workloads to validate the results. Anything > >> > > >> else will probably just continue polluting the storage stack with more > >> > > >> interfaces that are not used and add to the problem of data placement > >> > > >> fragmentation. > >> > > > > >> > > >Please always mentioned what layer you are talking about. At the syscall > >> > > >level the temperatur hints are doing quite ok. A full stream separation > >> > > >would obviously be a lot better, as would be communicating the zone / > >> > > >reclaim unit / etc size. > >> > > > >> > > I mean at the syscall level. But as mentioned above, we need to be very > >> > > sure that we have a clear use-case for that. If we continue seeing hints > >> > > being use in NVMe or other protocols, and the number increase > >> > > significantly, we can deal with it later on. > >> > > > >> > > > > >> > > >As an interface to a driver that doesn't natively speak temperature > >> > > >hint on the other hand it doesn't work at all. > >> > > > > >> > > >> The issue is that the first series of this patch, which is as simple as > >> > > >> it gets, hit the list in May. Since then we are down paths that lead > >> > > >> nowhere. So the line between real technical feedback that leads to > >> > > >> a feature being merged, and technical misleading to make people be a > >> > > >> busy bee becomes very thin. In the whole data placement effort, we have > >> > > >> been down this path many times, unfortunately... > >> > > > > >> > > >Well, the previous round was the first one actually trying to address the > >> > > >fundamental issue after 4 month. And then after a first round of feedback > >> > > >it gets shutdown somehow out of nowhere. As a maintainer and review that > >> > > >is the kinda of contributors I have a hard time taking serious. > >> > > > >> > > I am not sure I understand what you mean in the last sentece, so I will > >> > > not respond filling blanks with a bad interpretation. > >> > > > >> > > In summary, what we are asking for is to take the patches that cover the > >> > > current use-case, and work together on what might be needed for better > >> > > FS support. For this, it seems you and Hans have a good idea of what you > >> > > want to have based on XFS. We can help review or do part of the work, > >> > > but trying to guess our way will only delay existing customers using > >> > > existing HW. > >> > > >> > After reading the whole thread, I end up wondering why we need to rush the > >> > support for a single use case through instead of putting the architecture > >> > in place for properly supporting this new type of hardware from the start > >> > throughout the stack. > >> > >> This is not a rush. We have been supporting this use case through passthru for > >> over 1/2 year with code already upstream in Cachelib. This is mature enough as > >> to move into the block layer, which is what the end user wants to do either way. > >> > >> This is though a very good point. This is why we upstreamed passthru at the > >> time; so people can experiment, validate, and upstream only when there is a > >> clear path. > >> > >> > > >> > Even for user space consumers of raw block devices, is the last version > >> > of the patch set good enough? > >> > > >> > * It severely cripples the data separation capabilities as only a handful of > >> > data placement buckets are supported > >> > >> I could understand from your presentation at LPC, and late looking at the code that > >> is available that you have been successful at getting good results with the existing > >> interface in XFS. The mapping form the temperature semantics to zones (no semantics) > >> is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space > >> structures is great. > > > >No, we don't map data directly to zones using lifetime hints. In fact, > >lifetime hints contribute only a > >relatively small part (~10% extra write amp reduction, see the > >rocksdb benchmark results). > >Segregating data by file is the most important part of the data > >placement heuristic, at least > >for this type of workload. > > Is this because RocksDB already does seggregation per file itself? Are > you doing something specific on XFS or using your knoledge on RocksDB to > map files with an "unwritten" protocol in the midde? Data placement by-file is based on that the lifetime of a file's data blocks are strongly correlated. When a file is deleted, all its blocks will be reclaimable at that point. This requires knowledge about the data placement buckets and works really well without any hints provided. The life-time hint heuristic I added on top is based on rocksdb statistics, but designed to be generic enough to work for a wider range of workloads (still need to validate this though - more work to be done). > > In this context, we have collected data both using FDP natively in > RocksDB and using the temperatures. Both look very good, because both > are initiated by RocksDB, and the FS just passes the hints directly > to the driver. > > I ask this to understand if this is the FS responsibility or the > application's one. Our work points more to letting applications use the > hints (as the use-cases are power users, like RocksDB). I agree with you > that a FS could potentially make an improvement for legacy applications > - we have not focused much on these though, so I trust you insights on > it. The big problem as I see it is that if applications are going to work well together on the same media we need a common placement implementation somewhere, and it seems pretty natural to make it part of filesystems to me. > > >> > >> > > >> > * It just won't work if there is more than one user application per storage > >> > device as different applications data streams will be mixed at the nvme > >> > driver level.. > >> > >> For now this use-case is not clear. Folks working on it are using passthru. When we > >> have a more clear understanding of what is needed, we might need changes in the kernel. > >> > >> If you see a need for this on the work that you are doing, by all means, please send patches. > >> As I said at LPC, we can work together on this. > >> > >> > > >> > While Christoph has already outlined what would be desirable from a > >> > file system point of view, I don't have the answer to what would be the overall > >> > best design for FDP. I would like to say that it looks to me like we need to > >> > consider more than more than the early adoption use cases and make sure we > >> > make the most of the hardware capabilities via logical abstractions that > >> > would be compatible with a wider range of storage devices. > >> > > >> > Figuring the right way forward is tricky, but why not just let it take the time > >> > that is needed to sort this out while early users explore how to use FDP > >> > drives and share the results? > >> > >> I agree that we might need a new interface to support more hints, beyond the temperatures. > >> Or maybe not. We would not know until someone comes with a use case. We have made the > >> mistake in the past of treating internal research as upstreamable work. I know can see that > >> this simply complicates the in-kernel and user-space APIs. > >> > >> The existing API is usable and requires no changes. There is hardware. There are customers. > >> There are applications with upstream support which have been tested with passthru (the > >> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason > >> not to take this in, and then we will see what new interfaces we might need in the future. > >> > >> I would much rather spend time in discussing ideas with you and others on a potential > >> future API than arguing about the validity of an _existing_ one. > >> > > > >Yes, but while FDP support could be improved later on(happy to help if > >that'll be the case), > >I'm just afraid that less work now defining the way data placement is > >exposed is going to > >result in a bigger mess later when more use cases will be considered. > > Please, see the message I responded on the other thread. I hope it is a > way to move forward and actually work together on this.
On 10.10.2024 11:13, Christoph Hellwig wrote: >On Thu, Oct 10, 2024 at 09:07:36AM +0200, Javier González wrote: >> I think we should attempt to pursue that with an example in mind. Seems >> XFS is the clear candidate. You have done work already in enable SMR >> HDDs; it seems we can get FDP under that umbrella. This will however >> take time to get right. We can help with development, testing, and >> experimental evaluation on the WAF benefits for such an interface. > >Or ZNS SSDs for that matter. Maybe. I do not see much movement on this. > >> However, this work should not block existing hardware enabling an >> existing use-case. The current patches are not intrusive. They do not >> make changse to the API and merely wire up what is there to the driver. >> Anyone using temperaturs will be able to use FDP - this is a win without >> a maintainance burden attached to it. The change to the FS / application >> API will not require major changes either; I believe we all agree that >> we cannot remove the temperatures, so all existing temperature users >> will be able to continue using them; we will just add an alternative for >> power users on the side. > >As mentioned probably close to a dozen times over this thread and it's >predecessors: Keeping the per-file I/O hint API and mapping that to >FDP is fine. Exposing the overly-simplistic hints to the NVMe driver >through the entire I/O stack and locking us into that is not. I don't understand the "locking us into that" part. >> So the proposal is to merge the patches as they are and commit to work >> together on a new, better API for in-kernel users (FS), and for >> applications (syscall, uring). > >And because of that the whole merge it now and fix it later unfortunately >doesn't work, as a proper implementation will regress behavior for at >least some users that only have the I/O hints API but try to emulate >real stream separation with it. With the per-I/O hints in io_uring that >is in fact almost guaranteed. I do not thing this argument is valid. These patches are not a merge now, fix later. It is ma: use the current interface, work together on a new one, if needed. The new interface you are talking about is not clear to me, at all. If you could share some patches on how that would look like, then it would give a much clearer idea of what you have in mind. The whole idea of let's not cover an existing use-case because we might be able to do something in the future is not very tangible.
On 10.10.2024 11:20, Christoph Hellwig wrote: >On Thu, Oct 10, 2024 at 09:13:27AM +0200, Javier Gonzalez wrote: >> Is this because RocksDB already does seggregation per file itself? Are >> you doing something specific on XFS or using your knoledge on RocksDB to >> map files with an "unwritten" protocol in the midde? > >XFS doesn't really do anything smart at all except for grouping files >with similar temperatures, but Hans can probably explain it in more >detail. So yes, this relies on the application doing the data separation >and using the most logical vehicle for it: files. This makes sense. Agree. > >> >> In this context, we have collected data both using FDP natively in >> RocksDB and using the temperatures. Both look very good, because both >> are initiated by RocksDB, and the FS just passes the hints directly >> to the driver. >> >> I ask this to understand if this is the FS responsibility or the >> application's one. Our work points more to letting applications use the >> hints (as the use-cases are power users, like RocksDB). I agree with you >> that a FS could potentially make an improvement for legacy applications >> - we have not focused much on these though, so I trust you insights on >> it. > >As mentioned multiple times before in this thread this absolutely >depends on the abstraction level of the application. If the application >works on a raw device without a file system it obviously needs very >low-level control. And in my opinion passthrough is by far the best >interface for that level of control. Passthru is great for prototyping and getting insights on end-to-end applicability. We see though that it is difficult to get a full solution based on it, unless people implement a use-space layer tailored to their use-case (e.g., a version SPDK's bdev). After the POC phase, most folks that can use passthru prefer to move to block - with a validated use-case it should be easier to get things upstream. This is exactly where we are now. >If the application is using a >file system there is no better basic level abstraction than a file, >which can then be enhanced with relatively small amount of additional >information going both ways: the file system telling the application >what good file sizes and write patterns are, and the application telling >the file system what files are good candidates to merge into the same >write stream if the file system has to merge multiple actively written >to files into a write stream. Trying to do low-level per I/O hints >on top of a file system is a recipe for trouble because you now have >to entities fighting over placement control. For file, I agree with you. If you saw the comments from Christian on the inode space, there are a few plumbing challenges. Do you have any patches we could look at?
On 10.10.2024 12:46, Hans Holmberg wrote: >On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <javier.gonz@samsung.com> wrote: >> >> On 10.10.2024 08:40, Hans Holmberg wrote: >> >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@samsung.com> wrote: >> >> >> >> >> >> >> >> > -----Original Message----- >> >> > From: Hans Holmberg <hans@owltronix.com> >> >> > Sent: Tuesday, October 8, 2024 12:07 PM >> >> > To: Javier Gonzalez <javier.gonz@samsung.com> >> >> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Martin K. >> >> > Petersen <martin.petersen@oracle.com>; Keith Busch <kbusch@kernel.org>; >> >> > Kanchan Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me; >> >> > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org; >> >> > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org; >> >> > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux- >> >> > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org; >> >> > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com >> >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints >> >> > >> >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@samsung.com> >> >> > wrote: >> >> > > >> >> > > On 04.10.2024 14:30, Christoph Hellwig wrote: >> >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: >> >> > > >> So, considerign that file system _are_ able to use temperature hints and >> >> > > >> actually make them work, why don't we support FDP the same way we are >> >> > > >> supporting zones so that people can use it in production? >> >> > > > >> >> > > >Because apparently no one has tried it. It should be possible in theory, >> >> > > >but for example unless you have power of 2 reclaim unit size size it >> >> > > >won't work very well with XFS where the AGs/RTGs must be power of two >> >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs >> >> > > >the capacity actually used. >> >> > > >> >> > > This is good. I think we should have at least a FS POC with data >> >> > > placement support to be able to drive conclusions on how the interface >> >> > > and requirements should be. Until we have that, we can support the >> >> > > use-cases that we know customers are asking for, i.e., block-level hints >> >> > > through the existing temperature API. >> >> > > >> >> > > > >> >> > > >> I agree that down the road, an interface that allows hints (many more >> >> > > >> than 5!) is needed. And in my opinion, this interface should not have >> >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to >> >> > > >> put 100s of them to support storage node deployments. But this needs to >> >> > > >> come from the users of the hints / zones / streams / etc, not from >> >> > > >> us vendors. We do not have neither details on how they deploy these >> >> > > >> features at scale, nor the workloads to validate the results. Anything >> >> > > >> else will probably just continue polluting the storage stack with more >> >> > > >> interfaces that are not used and add to the problem of data placement >> >> > > >> fragmentation. >> >> > > > >> >> > > >Please always mentioned what layer you are talking about. At the syscall >> >> > > >level the temperatur hints are doing quite ok. A full stream separation >> >> > > >would obviously be a lot better, as would be communicating the zone / >> >> > > >reclaim unit / etc size. >> >> > > >> >> > > I mean at the syscall level. But as mentioned above, we need to be very >> >> > > sure that we have a clear use-case for that. If we continue seeing hints >> >> > > being use in NVMe or other protocols, and the number increase >> >> > > significantly, we can deal with it later on. >> >> > > >> >> > > > >> >> > > >As an interface to a driver that doesn't natively speak temperature >> >> > > >hint on the other hand it doesn't work at all. >> >> > > > >> >> > > >> The issue is that the first series of this patch, which is as simple as >> >> > > >> it gets, hit the list in May. Since then we are down paths that lead >> >> > > >> nowhere. So the line between real technical feedback that leads to >> >> > > >> a feature being merged, and technical misleading to make people be a >> >> > > >> busy bee becomes very thin. In the whole data placement effort, we have >> >> > > >> been down this path many times, unfortunately... >> >> > > > >> >> > > >Well, the previous round was the first one actually trying to address the >> >> > > >fundamental issue after 4 month. And then after a first round of feedback >> >> > > >it gets shutdown somehow out of nowhere. As a maintainer and review that >> >> > > >is the kinda of contributors I have a hard time taking serious. >> >> > > >> >> > > I am not sure I understand what you mean in the last sentece, so I will >> >> > > not respond filling blanks with a bad interpretation. >> >> > > >> >> > > In summary, what we are asking for is to take the patches that cover the >> >> > > current use-case, and work together on what might be needed for better >> >> > > FS support. For this, it seems you and Hans have a good idea of what you >> >> > > want to have based on XFS. We can help review or do part of the work, >> >> > > but trying to guess our way will only delay existing customers using >> >> > > existing HW. >> >> > >> >> > After reading the whole thread, I end up wondering why we need to rush the >> >> > support for a single use case through instead of putting the architecture >> >> > in place for properly supporting this new type of hardware from the start >> >> > throughout the stack. >> >> >> >> This is not a rush. We have been supporting this use case through passthru for >> >> over 1/2 year with code already upstream in Cachelib. This is mature enough as >> >> to move into the block layer, which is what the end user wants to do either way. >> >> >> >> This is though a very good point. This is why we upstreamed passthru at the >> >> time; so people can experiment, validate, and upstream only when there is a >> >> clear path. >> >> >> >> > >> >> > Even for user space consumers of raw block devices, is the last version >> >> > of the patch set good enough? >> >> > >> >> > * It severely cripples the data separation capabilities as only a handful of >> >> > data placement buckets are supported >> >> >> >> I could understand from your presentation at LPC, and late looking at the code that >> >> is available that you have been successful at getting good results with the existing >> >> interface in XFS. The mapping form the temperature semantics to zones (no semantics) >> >> is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space >> >> structures is great. >> > >> >No, we don't map data directly to zones using lifetime hints. In fact, >> >lifetime hints contribute only a >> >relatively small part (~10% extra write amp reduction, see the >> >rocksdb benchmark results). >> >Segregating data by file is the most important part of the data >> >placement heuristic, at least >> >for this type of workload. >> >> Is this because RocksDB already does seggregation per file itself? Are >> you doing something specific on XFS or using your knoledge on RocksDB to >> map files with an "unwritten" protocol in the midde? > >Data placement by-file is based on that the lifetime of a file's data >blocks are strongly correlated. When a file is deleted, all its blocks >will be reclaimable at that point. This requires knowledge about the >data placement buckets and works really well without any hints >provided. But we need hints to put files together. I believe you do this already, as no placement protocol gives you unlimited separation. >The life-time hint heuristic I added on top is based on rocksdb >statistics, but designed to be generic enough to work for a wider >range of workloads (still need to validate this though - more work to >be done). Maybe you can post some patches on the parts dedicated to the VFS level and user-space API (syscall or uring)? Following on the comment to Christoph, it would be good to have something tangible to work together on for the next stage of this support. > >> >> In this context, we have collected data both using FDP natively in >> RocksDB and using the temperatures. Both look very good, because both >> are initiated by RocksDB, and the FS just passes the hints directly >> to the driver. >> >> I ask this to understand if this is the FS responsibility or the >> application's one. Our work points more to letting applications use the >> hints (as the use-cases are power users, like RocksDB). I agree with you >> that a FS could potentially make an improvement for legacy applications >> - we have not focused much on these though, so I trust you insights on >> it. > >The big problem as I see it is that if applications are going to work >well together on the same media we need a common placement >implementation somewhere, and it seems pretty natural to make it part >of filesystems to me. For FS users, makes a lot of sense. But we still need to cover applications using raw block.
On Thu, Oct 10, 2024 at 02:22:32PM +0200, Javier Gonzalez wrote: > Passthru is great for prototyping and getting insights on end-to-end > applicability. We see though that it is difficult to get a full solution > based on it, unless people implement a use-space layer tailored to their > use-case (e.g., a version SPDK's bdev). After the POC phase, most folks > that can use passthru prefer to move to block - with a validated > use-case it should be easier to get things upstream. > > This is exactly where we are now. That's a lot of marketing babble :) What exact thing is missing from the passthrough interface when using say spdx over io_uring? > If you saw the comments from Christian on the inode space, there are a > few plumbing challenges. Do you have any patches we could look at? I'm not sure what you refer to here.
On Thu, Oct 10, 2024 at 02:27:33PM +0200, Javier Gonzalez wrote: [full quote snipped, it would be really helpful to only quote what you actuall reference per the usual email rules] >> Data placement by-file is based on that the lifetime of a file's data >> blocks are strongly correlated. When a file is deleted, all its blocks >> will be reclaimable at that point. This requires knowledge about the >> data placement buckets and works really well without any hints >> provided. > > But we need hints to put files together. I believe you do this already, > as no placement protocol gives you unlimited separation. The per-file temperature hints do a reasonable job for that. A fully developed version of the separate write streams submitted by Kanchan would probably do even better, but for now the per-file hints seem to be enough. > Maybe you can post some patches on the parts dedicated to the VFS level > and user-space API (syscall or uring)? It's just using the existing temperature hints.
On Thu, Oct 10, 2024 at 01:59:14PM +0200, Javier González wrote: >> As mentioned probably close to a dozen times over this thread and it's >> predecessors: Keeping the per-file I/O hint API and mapping that to >> FDP is fine. Exposing the overly-simplistic hints to the NVMe driver >> through the entire I/O stack and locking us into that is not. > > I don't understand the "locking us into that" part. The patches as submitted do the two following two things: 1) interpret the simple temperature hints to map to FDP reclaim handles 2) add a new interface to set the temperature hints per I/O and also rely on an annoying existing implementation detail where the I/O hints set on random files get automatically propagated to the block device without file system involvement. This means we can't easily make the nvme driver actually use smarter hints that expose the actual FDP capabilities without breaking users that relied on the existing behavior, especially the per-I/O hints that counteract any kind of file system based data placement.
On 11.10.2024 10:56, Christoph Hellwig wrote: >On Thu, Oct 10, 2024 at 02:22:32PM +0200, Javier Gonzalez wrote: >> Passthru is great for prototyping and getting insights on end-to-end >> applicability. We see though that it is difficult to get a full solution >> based on it, unless people implement a use-space layer tailored to their >> use-case (e.g., a version SPDK's bdev). After the POC phase, most folks >> that can use passthru prefer to move to block - with a validated >> use-case it should be easier to get things upstream. >> >> This is exactly where we are now. > >That's a lot of marketing babble :) What exact thing is missing >from the passthrough interface when using say spdx over io_uring? The block layer provides a lot of functionality that passthru cannot provide. A simple example would be splits. You know this :) I am sure Jens and Keith can give you more specifics on their particular reasons. > >> If you saw the comments from Christian on the inode space, there are a >> few plumbing challenges. Do you have any patches we could look at? > >I'm not sure what you refer to here. This from Christian: https://lore.kernel.org/all/20240903-erfassen-bandmitglieder-32dfaeee66b2@brauner/
On Fri, Oct 11, 2024 at 02:21:02PM +0200, Javier Gonzalez wrote: > > That's a lot of marketing babble :) What exact thing is missing > > from the passthrough interface when using say spdx over io_uring? > > The block layer provides a lot of functionality that passthru cannot > provide. A simple example would be splits. You know this :) > > I am sure Jens and Keith can give you more specifics on their particular > reasons. Splitting, merging, cgroups, iostats, error retries, failover. We have a recent change staged in 6.13 to try to count iostats on passthrough, but it doesn't do discards. Passthrough to partitions requires root access and the slow CAP_SYSADMIN check on every IO. Page cache. Though that may not readily work with this io_uring proposal as-is either. The generic IO path is safe to changing formats; passthrough isn't. And it's just generally more work to port existing applications to use the passthrough interface. It causes duplicating solutions to scenarios the kernel already handles.
On 10/11/24 3:02 AM, Christoph Hellwig wrote: >>> As mentioned probably close to a dozen times over this thread and it's >>> predecessors: Keeping the per-file I/O hint API and mapping that to >>> FDP is fine. Exposing the overly-simplistic hints to the NVMe driver >>> through the entire I/O stack and locking us into that is not. >> >> I don't understand the "locking us into that" part. > > The patches as submitted do the two following two things: > > 1) interpret the simple temperature hints to map to FDP reclaim handles > 2) add a new interface to set the temperature hints per I/O > > and also rely on an annoying existing implementation detail where the I/O > hints set on random files get automatically propagated to the block > device without file system involvement. > > This means we can't easily make the nvme driver actually use smarter > hints that expose the actual FDP capabilities without breaking users > that relied on the existing behavior, especially the per-I/O hints that > counteract any kind of file system based data placement. > I think that last argument is a straw man - for any kind of interface like this, we've ALWAYS just had the rule that any per-whatever overrides the generic setting. Per IO hints would do the same thing. Is this a mess if a user has assigned a file hint and uses other per IO hints on that very same file and they differ? Certainly, but that's really just the user/app being dumb. Or maybe being smart, as this particular one block is always hot in the file (yeah contrived case). The point is, if you mix and match per-io hints and per-file hints, then you're probably being pretty silly and nobody should do that. I don't think this is a practical concern at all.
On Fri, Oct 11, 2024 at 11:08:26AM -0600, Jens Axboe wrote: > > I think that last argument is a straw man - for any kind of interface > like this, we've ALWAYS just had the rule that any per-whatever > overrides the generic setting. And exactly that is the problem. For file systems we can't support that sanely. So IFF you absolutely want the per-I/O hints we need an opt in by the file operations. I've said that at least twice in this discussion before, but as everyone likes to have political discussions instead of technical ones no one replied to that.
> -----Original Message----- > From: Christoph Hellwig <hch@lst.de> > Sent: Monday, October 14, 2024 8:21 AM > To: Jens Axboe <axboe@kernel.dk> > Cc: Christoph Hellwig <hch@lst.de>; Javier Gonzalez <javier.gonz@samsung.com>; > Keith Busch <kbusch@kernel.org>; Martin K. Petersen > <martin.petersen@oracle.com>; Kanchan Joshi <joshi.k@samsung.com>; > hare@suse.de; sagi@grimberg.me; brauner@kernel.org; viro@zeniv.linux.org.uk; > jack@suse.cz; jaegeuk@kernel.org; bcrl@kvack.org; dhowells@redhat.com; > bvanassche@acm.org; asml.silence@gmail.com; linux-nvme@lists.infradead.org; > linux-fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux- > block@vger.kernel.org; linux-aio@kvack.org; gost.dev@samsung.com; > vishak.g@samsung.com > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > > On Fri, Oct 11, 2024 at 11:08:26AM -0600, Jens Axboe wrote: > > > > I think that last argument is a straw man - for any kind of interface > > like this, we've ALWAYS just had the rule that any per-whatever > > overrides the generic setting. > > And exactly that is the problem. For file systems we can't support > that sanely. So IFF you absolutely want the per-I/O hints we need > an opt in by the file operations. I've said that at least twice > in this discussion before, but as everyone likes to have political > discussions instead of technical ones no one replied to that. Is it a way forward to add this in a new spin of the series - keeping the temperature mapping on the NVMe side? If not, what would be acceptable for a first version, before getting into adding a new interface to expose agnostic hints?
On Mon, Oct 14, 2024 at 07:02:11AM +0000, Javier Gonzalez wrote: > > And exactly that is the problem. For file systems we can't support > > that sanely. So IFF you absolutely want the per-I/O hints we need > > an opt in by the file operations. I've said that at least twice > > in this discussion before, but as everyone likes to have political > > discussions instead of technical ones no one replied to that. > > Is it a way forward to add this in a new spin of the series - keeping the > temperature mapping on the NVMe side? What do you gain from that? NVMe does not understand data temperatures, so why make up that claim? Especially as it directly undermindes any file system work to actually make use of it. > If not, what would be acceptable for a first version, before getting into adding > a new interface to expose agnostic hints? Just iterate on Kanchan's series for a block layer (and possible user level, but that's less urgent) interface for stream separation?
> -----Original Message----- > From: Christoph Hellwig <hch@lst.de> > Sent: Monday, October 14, 2024 9:47 AM > To: Javier Gonzalez <javier.gonz@samsung.com> > Cc: Christoph Hellwig <hch@lst.de>; Jens Axboe <axboe@kernel.dk>; Keith Busch > <kbusch@kernel.org>; Martin K. Petersen <martin.petersen@oracle.com>; Kanchan > Joshi <joshi.k@samsung.com>; hare@suse.de; sagi@grimberg.me; > brauner@kernel.org; viro@zeniv.linux.org.uk; jack@suse.cz; jaegeuk@kernel.org; > bcrl@kvack.org; dhowells@redhat.com; bvanassche@acm.org; > asml.silence@gmail.com; linux-nvme@lists.infradead.org; linux- > fsdevel@vger.kernel.org; io-uring@vger.kernel.org; linux-block@vger.kernel.org; > linux-aio@kvack.org; gost.dev@samsung.com; vishak.g@samsung.com > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > > On Mon, Oct 14, 2024 at 07:02:11AM +0000, Javier Gonzalez wrote: > > > And exactly that is the problem. For file systems we can't support > > > that sanely. So IFF you absolutely want the per-I/O hints we need > > > an opt in by the file operations. I've said that at least twice > > > in this discussion before, but as everyone likes to have political > > > discussions instead of technical ones no one replied to that. > > > > Is it a way forward to add this in a new spin of the series - keeping the > > temperature mapping on the NVMe side? > > What do you gain from that? NVMe does not understand data temperatures, > so why make up that claim? I expressed this a couple of times in this thread. It is no problem to map temperatures to a protocol that does not understand the semantics. It is not perfect, and in time I agree that we would benefit from exposing the raw hints without semantics from Block layer upwards. But the temperatures are already there. We are not adding anything new. And thanks to this, we can enable existing users with _minimal_ changes to user-space. As pointed on the XFS zoned discussions, this is very similar to what you guys are doing (exactly to re-use what it is already there), and it works. On top of this, applications that already understand temperatures (e.g., RocksDB) will be able to leverage FDP SSDs without changes. > Especially as it directly undermindes any file system work to actually make use of it. I do not think it does. If a FS wants to use the temperatures, then they would be able to leverage FDP besides SCSI. And if we come up with a better interface later on, we can make the changes then. I really do not see the issue. If we were adding a temperature abstraction now, I would agree with You that we would need to cover the use-case you mention for FSs from the beginning, but this Is already here. Seems like a fair compromise to support current users. > > > If not, what would be acceptable for a first version, before getting into adding > > a new interface to expose agnostic hints? > > Just iterate on Kanchan's series for a block layer (and possible user level, > but that's less urgent) interface for stream separation? We can work on this later, but it is not that easy. Look at the mail I forwarded Form Christian. We now can store hints in the same structure as the temperatures. I believe that we would be able to support 128 hints. If we are serious about supporting hints as a general interface, 128 hints is not nearly enough. Moreover: - How do we convince VFS folks to give us more space for hints at this point? - How do we make agnostic hints and temperature co-exist? Do we use a bit to select and create a protocol? This seems over-complicated - When we are exposing general hints to the block layer, how do we support N:N FS:Block_Devices? and DMs? This is not obvious, and it goes well beyond what we need today, but we probably need to think about it. And these are just questions that we know. You see that this is a big lift (which we can work together on!), but will delay current users by months. And honestly, I do not feel comfortable doing this alone; we need application and FS folks like you to help guide this so that it is really usable. I say it again: Your idea about properly supporting hints in FS is good, and I think we should work on it. But I do not believe that the current patches (with added things like opt in file ops) get in the way of this. So if the disagreement is if these patches are harmful, let's talk about this explicitly and try to agree; we do not need to go over the need for a new hint interface. We already agree on this.
On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote: [can you fix your mailer please, no full quotes, and especially not quotes of the mail headers] > > What do you gain from that? NVMe does not understand data temperatures, > > so why make up that claim? > > I expressed this a couple of times in this thread. It is no problem to > map temperatures to a protocol that does not understand the semantics. And I've agreed every time with you. But the important point is that we must not do it in the driver where all context is lost. > > Especially as it directly undermindes any file system work to actually make use of it. > > I do not think it does. If a FS wants to use the temperatures, then they > would be able to leverage FDP besides SCSI. What do you mean with that? This is a bit too much whitepaper vocabularly. We have code in XFS that can make use of the temperature hint. But to make them work it actually needs to do real stream separation on the device. I.e. the file system consumes the temperature hints. > And if we come up with a better interface later on, we can make the changes then. > I really do not see the issue. If we were adding a temperature abstraction now, I would agree with > You that we would need to cover the use-case you mention for FSs from the beginning, but this > Is already here. Seems like a fair compromise to support current users. Again, I think the temperature hints at the syscall level aren't all bad. There's definitively a few things I'd like to do better in hindsight, but that's not the point. The problem is trying to turn them into stream separation all the way down in the driver, which is fundamentally broken. > - How do we convince VFS folks to give us more space for hints at this point? What space from VFS folks do you need for hints? And why does it matter?
> On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote: > > > Especially as it directly undermindes any file system work to actually make use > of it. > > > > I do not think it does. If a FS wants to use the temperatures, then they > > would be able to leverage FDP besides SCSI. > > What do you mean with that? This is a bit too much whitepaper vocabularly. > > We have code in XFS that can make use of the temperature hint. But to > make them work it actually needs to do real stream separation on the > device. I.e. the file system consumes the temperature hints. The device can guarantee the stream separation without knowing the temperature. > > And if we come up with a better interface later on, we can make the changes > then. > > I really do not see the issue. If we were adding a temperature abstraction now, I > would agree with > > You that we would need to cover the use-case you mention for FSs from the > beginning, but this > > Is already here. Seems like a fair compromise to support current users. > > Again, I think the temperature hints at the syscall level aren't all > bad. There's definitively a few things I'd like to do better in hindsight, > but that's not the point. The problem is trying to turn them into > stream separation all the way down in the driver, which is fundamentally > broken. > > > - How do we convince VFS folks to give us more space for hints at this point? > > What space from VFS folks do you need for hints? And why does it > matter? We need space in the inode to store the hint ID. Look, this feels like going in circles. All this gaslighting is what makes it difficult to push patches when you just do not like the feature. It is the 3rd time I propose you a way forward and you simply cannot provide any specific technical feedback - in the past email I posted several questions about the interface you seem to be talking about and you explicitly omit that.
On Tue, Oct 15, 2024 at 03:07:57AM +0000, Javier Gonzalez wrote: > > On Mon, Oct 14, 2024 at 09:08:24AM +0000, Javier Gonzalez wrote: > > > > Especially as it directly undermindes any file system work to actually make use > > of it. > > > > > > I do not think it does. If a FS wants to use the temperatures, then they > > > would be able to leverage FDP besides SCSI. > > > > What do you mean with that? This is a bit too much whitepaper vocabularly. > > > > We have code in XFS that can make use of the temperature hint. But to > > make them work it actually needs to do real stream separation on the > > device. I.e. the file system consumes the temperature hints. > > The device can guarantee the stream separation without knowing the temperature. Of course. But that's entirely not the point. > > What space from VFS folks do you need for hints? And why does it > > matter? > > We need space in the inode to store the hint ID. > > Look, this feels like going in circles. All this gaslighting is what makes it difficult to I'm so fucking tired of this. You are not even listening to my arguments at all, even when explaing every detail to you again and again. And then you acuse me of "gaslighting". > push patches when you just do not like the feature. It is the 3rd time I propose you > a way forward and you simply cannot provide any specific technical feedback - in the > past email I posted several questions about the interface you seem to be talking > about and you explicitly omit that. ??? You are throwing random buzzwords and not even replying to all the low level technical points. If you think I missed something please just ask again instead of this crap.
As the current discussion is probably too tiring for anyone who is not a professional mud fighter I'll summarize my objections and comments here once again in a way that even aspiring middle managers should be able understand. 1) While the current per-file temperature hints interface is not perfect it is okay and make sense to reuse until we need something more fancy. We make good use of it in f2fs and the upcoming zoned xfs code to help with data placement and have numbers to show that it helps. 2) A per-I/O interface to set these temperature hint conflicts badly with how placement works in file systems. If we have an urgent need for it on the block device it needs to be opt-in by the file operations so it can be enabled on block device, but not on file systems by default. This way you can implement it for block device, but not provide it on file systems by default. If a given file system finds a way to implement it it can still opt into implementing it of course. 3) Mapping from temperature hints to separate write streams needs to happen above the block layer, because file systems need to be in control of it to do intelligent placement. That means if you want to map from temperature hints to stream separation it needs to be implemented at the file operation layer, not in the device driver. The mapping implemented in this series is probably only useful for block devices. Maybe if dumb file systems want to adopt it, it could be split into library code for reuse, but as usual that's probably best done only when actually needed. 4) To support this the block layer, that is bios and requests need to support a notion of stream separation. Kanchan's previous series had most of the bits for that, it just needs to be iterated on. All of this could have probably be easily done in the time spent on this discussion.
On Tue, Oct 15, 2024 at 07:50:06AM +0200, Christoph Hellwig wrote: > 1) While the current per-file temperature hints interface is not perfect > it is okay and make sense to reuse until we need something more fancy. > We make good use of it in f2fs and the upcoming zoned xfs code to help > with data placement and have numbers to show that it helps. So we're okay to proceed with patch 1? > 2) A per-I/O interface to set these temperature hint conflicts badly > with how placement works in file systems. If we have an urgent need > for it on the block device it needs to be opt-in by the file operations > so it can be enabled on block device, but not on file systems by > default. This way you can implement it for block device, but not > provide it on file systems by default. If a given file system finds > a way to implement it it can still opt into implementing it of course. If we add a new fop_flag that only block fops enables, then it's okay? > 3) Mapping from temperature hints to separate write streams needs to > happen above the block layer, because file systems need to be in > control of it to do intelligent placement. That means if you want to > map from temperature hints to stream separation it needs to be > implemented at the file operation layer, not in the device driver. > The mapping implemented in this series is probably only useful for > block devices. Maybe if dumb file systems want to adopt it, it could > be split into library code for reuse, but as usual that's probably > best done only when actually needed. IMO, I don't even think the io_uring per-io hint needs to be limited to the fcntl lifetime values. It could just be a u16 value opaque to the block layer that just gets forwarded to the device. > 4) To support this the block layer, that is bios and requests need > to support a notion of stream separation. Kanchan's previous series > had most of the bits for that, it just needs to be iterated on. > > All of this could have probably be easily done in the time spent on > this discussion.
On Tue, Oct 15, 2024 at 09:09:20AM -0600, Keith Busch wrote: > On Tue, Oct 15, 2024 at 07:50:06AM +0200, Christoph Hellwig wrote: > > 1) While the current per-file temperature hints interface is not perfect > > it is okay and make sense to reuse until we need something more fancy. > > We make good use of it in f2fs and the upcoming zoned xfs code to help > > with data placement and have numbers to show that it helps. > > So we're okay to proceed with patch 1? No, see point 3 and 4 below for why. We'll need something like the interface you suggested by me in point 4 and by you in reply to point 3 in the block layer, and then block/fops.c can implement the mapping on top of that for drivers supporting it. > > > 2) A per-I/O interface to set these temperature hint conflicts badly > > with how placement works in file systems. If we have an urgent need > > for it on the block device it needs to be opt-in by the file operations > > so it can be enabled on block device, but not on file systems by > > default. This way you can implement it for block device, but not > > provide it on file systems by default. If a given file system finds > > a way to implement it it can still opt into implementing it of course. > > If we add a new fop_flag that only block fops enables, then it's okay? The flag is just one part of it. Of course it need to be discoverable from userspace in one way or another, and the marshalling of the flag needs to be controller by the file system / fops instance. > > 3) Mapping from temperature hints to separate write streams needs to > > happen above the block layer, because file systems need to be in > > control of it to do intelligent placement. That means if you want to > > map from temperature hints to stream separation it needs to be > > implemented at the file operation layer, not in the device driver. > > The mapping implemented in this series is probably only useful for > > block devices. Maybe if dumb file systems want to adopt it, it could > > be split into library code for reuse, but as usual that's probably > > best done only when actually needed. > > IMO, I don't even think the io_uring per-io hint needs to be limited to > the fcntl lifetime values. It could just be a u16 value opaque to the > block layer that just gets forwarded to the device. Well, that's what I've been arguing for all the time, and what Kanchan's previous series was working towards. It's not quite as trivial as we need a bit more than just the stream, e.g. a way to discover how many of them exist. > > 4) To support this the block layer, that is bios and requests need > > to support a notion of stream separation. Kanchan's previous series > > had most of the bits for that, it just needs to be iterated on. > > > > All of this could have probably be easily done in the time spent on > > this discussion. ---end quoted text---
Seems per-I/O hints are not getting the love they deserve. Apart from the block device, the usecase is when all I/Os of VM (or container) are to be grouped together or placed differently. Per-IO hints are fine-granular (enough for userspace to build per-process/vm/file/whatever etc.) and add the flexibility we have lacked so far. As for conflict, I doubt if that exists. Please see below: > 2) A per-I/O interface to set these temperature hint conflicts badly > with how placement works in file systems. If we have an urgent need > for it on the block device it needs to be opt-in by the file operations > so it can be enabled on block device, but not on file systems by > default. This way you can implement it for block device, but not > provide it on file systems by default. If a given file system finds > a way to implement it it can still opt into implementing it of course. Why do you see this as something that is so different across filesystems that they would need to "find a way to implement"? Both per-file and per-io hints are supplied by userspace. Inode and kiocb only happen to be the mean to receive the hint information. FS is free to use this information (iff it wants) or simply forward this down. Per-file hint just gets stored (within inode) without individual FS involvement. Per-io hint follows the same model (i.e., it is set by upper layer like io_uring/aio) and uses kiocb to store the hint. It does not alter the stored inode hint value! After patch #3, filesystems have both per-file and per-io information available. And as before, they can use that hint info (from kiocb or inode) and/or simply forward. The generic code (like fs/direct-io.c, fs/iomap/direct-io.c etc.,) already forwards the incoming hints, without any intelligence. And we need just that because with user-passed hints, the onus of intelligence is on the userspace. This is how hints work on xfs/ext4/btrfs files at this point. The owner of the file (filesystem, block, whatever) can still use the incoming hints to do any custom/extra feature. Either from inode or from kiocb depending on what information (per-file or per-io hint) it finds more useful for that custom feature. For example, F2FS is using per-inode information to do something custom and that part has not been changed by these patches. Overall, I do not see the conflict. It's all user-driven. No? For the currently uncommon case when FS decides to generate its own hints (as opposed to userspace hint consumer/forwarder), it can directly put the hint value in bio.
On Thu, Oct 17, 2024 at 08:05:38PM +0530, Kanchan Joshi wrote: > Seems per-I/O hints are not getting the love they deserve. > Apart from the block device, the usecase is when all I/Os of VM (or > container) are to be grouped together or placed differently. But that assumes the file system could actually support it. Which is hard when you don't assume the file system isn't simply a passthrough entity, which will not give you great results. > > 2) A per-I/O interface to set these temperature hint conflicts badly > > with how placement works in file systems. If we have an urgent need > > for it on the block device it needs to be opt-in by the file operations > > so it can be enabled on block device, but not on file systems by > > default. This way you can implement it for block device, but not > > provide it on file systems by default. If a given file system finds > > a way to implement it it can still opt into implementing it of course. > > Why do you see this as something that is so different across filesystems > that they would need to "find a way to implement"? If you want to do useful stream separation you need to write data sequentially into the stream. Now with streams or FDP that does not actually imply sequentially in LBA space, but if you want the file system to not actually deal with fragmentation from hell, and be easily track what is grouped together you really want it sequentially in the LBA space as well. In other words, any kind of write placement needs to be intimately tied to the file system block allocator. > Both per-file and per-io hints are supplied by userspace. Inode and > kiocb only happen to be the mean to receive the hint information. > FS is free to use this information (iff it wants) or simply forward this > down. As mentioned above just passing it down is not actually very useful. It might give you nice benchmark numbers when you basically reimplement space management in userspace on a fully preallocated file, but for that you're better of just using the block device. If you actually want to treat the files as files you need full file system involvement. > Per-file hint just gets stored (within inode) without individual FS > involvement. Per-io hint follows the same model (i.e., it is set by > upper layer like io_uring/aio) and uses kiocb to store the hint. It does > not alter the stored inode hint value! Yes, and now you'll get complaints that the file system ignores it when it can't properly support it. This is why we need a per-fop opt in. > The generic code (like fs/direct-io.c, fs/iomap/direct-io.c etc.,) > already forwards the incoming hints, without any intelligence. Yes, and that is a problem. We stopped doing that, but Samsung sneaked some of this back in recently as I noticed. > Overall, I do not see the conflict. It's all user-driven. No? I have the gut feeling that you've just run benchmarks on image files emulating block devices and not actually tried real file system workloads based on this unfortunately.
On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote: > If you want to do useful stream separation you need to write data > sequentially into the stream. Now with streams or FDP that does not > actually imply sequentially in LBA space, but if you want the file > system to not actually deal with fragmentation from hell, and be > easily track what is grouped together you really want it sequentially > in the LBA space as well. In other words, any kind of write placement > needs to be intimately tied to the file system block allocator. I'm replying just to make sure I understand what you're saying: If we send per IO hints on a file, we could have interleaved hot and cold pages at various offsets of that file, so the filesystem needs an efficient way to allocate extents and track these so that it doesn't interleave these in LBA space. I think that makes sense. We can add a fop_flags and block/fops.c can be the first one to turn it on since that LBA access is entirely user driven.
On Thu, Oct 17, 2024 at 09:44:39AM -0600, Keith Busch wrote: > I'm replying just to make sure I understand what you're saying: > > If we send per IO hints on a file, we could have interleaved hot and > cold pages at various offsets of that file, so the filesystem needs an > efficient way to allocate extents and track these so that it doesn't > interleave these in LBA space. I think that makes sense. That's a little simplified, but yes. > We can add a fop_flags and block/fops.c can be the first one to turn it > on since that LBA access is entirely user driven. Yes, that's my main request on the per-I/O hint interface. Now we just need to not dumb down the bio level interface to five temeperature level and just expose the different write streams and we can all have a happy Kumbaya.
On Thu, Oct 17, 2024 at 05:46:49PM +0200, Christoph Hellwig wrote: > On Thu, Oct 17, 2024 at 09:44:39AM -0600, Keith Busch wrote: > > > We can add a fop_flags and block/fops.c can be the first one to turn it > > on since that LBA access is entirely user driven. > > Yes, that's my main request on the per-I/O hint interface. > > Now we just need to not dumb down the bio level interface to five > temeperature level and just expose the different write streams and we > can all have a happy Kumbaya. I'm sending a revised version of this patch set that attempts to address these issues. The io_uring hint is weird to me, so that part is simplified just to get the point across.
On 10/17/24 8:44 AM, Keith Busch wrote: > On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote: >> If you want to do useful stream separation you need to write data >> sequentially into the stream. Now with streams or FDP that does not >> actually imply sequentially in LBA space, but if you want the file >> system to not actually deal with fragmentation from hell, and be >> easily track what is grouped together you really want it sequentially >> in the LBA space as well. In other words, any kind of write placement >> needs to be intimately tied to the file system block allocator. > > I'm replying just to make sure I understand what you're saying: > > If we send per IO hints on a file, we could have interleaved hot and > cold pages at various offsets of that file, so the filesystem needs an > efficient way to allocate extents and track these so that it doesn't > interleave these in LBA space. I think that makes sense. > > We can add a fop_flags and block/fops.c can be the first one to turn it > on since that LBA access is entirely user driven. Does anyone care about buffered I/O to block devices? When using buffered I/O, the write_hint information from the inode is used and the per I/O write_hint information is ignored. Thanks, Bart.
On Thu, Oct 17, 2024 at 09:15:21AM -0700, Bart Van Assche wrote: > On 10/17/24 8:44 AM, Keith Busch wrote: > > On Thu, Oct 17, 2024 at 05:23:37PM +0200, Christoph Hellwig wrote: > > > If you want to do useful stream separation you need to write data > > > sequentially into the stream. Now with streams or FDP that does not > > > actually imply sequentially in LBA space, but if you want the file > > > system to not actually deal with fragmentation from hell, and be > > > easily track what is grouped together you really want it sequentially > > > in the LBA space as well. In other words, any kind of write placement > > > needs to be intimately tied to the file system block allocator. > > > > I'm replying just to make sure I understand what you're saying: > > > > If we send per IO hints on a file, we could have interleaved hot and > > cold pages at various offsets of that file, so the filesystem needs an > > efficient way to allocate extents and track these so that it doesn't > > interleave these in LBA space. I think that makes sense. > > > > We can add a fop_flags and block/fops.c can be the first one to turn it > > on since that LBA access is entirely user driven. > > Does anyone care about buffered I/O to block devices? When using > buffered I/O, the write_hint information from the inode is used and the per > I/O write_hint information is ignored. I'm pretty sure there are applications that use buffered IO on raw block (ex: postgresql), but it's a moot point: the block file_operations that provide the fops_flags also provide the callbacks for O_DIRECT, which is where this matters. We can't really use per-io write_hints on buffered-io. At least not yet, and maybe never. I'm not sure if it makes sense for raw block because the page writes won't necessarily match writes to storage.