diff mbox series

[PATCHv10,9/9] scsi: set permanent stream count in block limits

Message ID 20241029151922.459139-10-kbusch@meta.com (mailing list archive)
State New
Headers show
Series write hints with nvme fdp, scsi streams | expand

Commit Message

Keith Busch Oct. 29, 2024, 3:19 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

The block limits exports the number of write hints, so set this limit if
the device reports support for the lifetime hints. Not only does this
inform the user of which hints are possible, it also allows scsi devices
supporting the feature to utilize the full range through raw block
device direct-io.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/scsi/sd.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Christoph Hellwig Oct. 29, 2024, 3:26 p.m. UTC | #1
On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The block limits exports the number of write hints, so set this limit if
> the device reports support for the lifetime hints. Not only does this
> inform the user of which hints are possible, it also allows scsi devices
> supporting the feature to utilize the full range through raw block
> device direct-io.
> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Keith Busch <kbusch@kernel.org>

Despite the reviews this is still incorrect.  The permanent streams have
a relative data temperature associated with them as pointed out last
round and are not arbitrary write stream contexts despite (ab)using
the SBC streams facilities.

Bart, btw: I think the current sd implementation is buggy as well, as
it assumes the permanent streams are ordered by their data temperature
in the IO Advise hints mode page, but I can't find anything in the
spec that requires a particular ordering.
Keith Busch Oct. 29, 2024, 3:34 p.m. UTC | #2
On Tue, Oct 29, 2024 at 04:26:54PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > The block limits exports the number of write hints, so set this limit if
> > the device reports support for the lifetime hints. Not only does this
> > inform the user of which hints are possible, it also allows scsi devices
> > supporting the feature to utilize the full range through raw block
> > device direct-io.
> > 
> > Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Signed-off-by: Keith Busch <kbusch@kernel.org>
> 
> Despite the reviews this is still incorrect.  The permanent streams have
> a relative data temperature associated with them as pointed out last
> round and are not arbitrary write stream contexts despite (ab)using
> the SBC streams facilities.

So then don't use it that way? I still don't know what change you're
expecting to happen with this feedback. What do you want the kernel to
do differently here?
Christoph Hellwig Oct. 29, 2024, 3:37 p.m. UTC | #3
On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote:
> So then don't use it that way? I still don't know what change you're
> expecting to happen with this feedback. What do you want the kernel to
> do differently here?

Same as before:  don't expose them as write streams, because they
aren't.  A big mess in this series going back to the versions before
your involvement is that they somehow want to tie up the temperature
hints with the stream separation, which just ends up very messy.
Keith Busch Oct. 29, 2024, 3:38 p.m. UTC | #4
On Tue, Oct 29, 2024 at 04:37:02PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote:
> > So then don't use it that way? I still don't know what change you're
> > expecting to happen with this feedback. What do you want the kernel to
> > do differently here?
> 
> Same as before:  don't expose them as write streams, because they
> aren't.  A big mess in this series going back to the versions before
> your involvement is that they somehow want to tie up the temperature
> hints with the stream separation, which just ends up very messy.

They're not exposed as write streams. Patch 7/9 sets the feature if it
is a placement id or not, and only nvme sets it, so scsi's attributes
are not claiming to be a write stream.
Christoph Hellwig Oct. 29, 2024, 3:53 p.m. UTC | #5
On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> They're not exposed as write streams. Patch 7/9 sets the feature if it
> is a placement id or not, and only nvme sets it, so scsi's attributes
> are not claiming to be a write stream.

So it shows up in sysfs, but:

 - queue_max_write_hints (which really should be queue_max_write_streams)
   still picks it up, and from there the statx interface

 - per-inode fcntl hint that encode a temperature still magically
   get dumpted into the write streams if they are set.

In other words it's a really leaky half-backed abstraction.

Let's brainstorm how it could be done better:

 - the max_write_streams values only set by block devices that actually
   do support write streams, and not the fire and forget temperature
   hints.  They way this is queried is by having a non-zero value
   there, not need for an extra flag.
 - but the struct file (or maybe inode) gets a supported flag, as stream
   separation needs to be supported by the file system 
 - a separate fcntl is used to set per-inode streams (if you care about
   that, seem like the bdev use case focusses on per-I/O).  In that case
   we'd probably also need a separate inode field for them, or a somewhat
   complicated scheme to decide what is stored in the inode field if there
   is only one.
 - for block devices bdev/fops.c maps the temperature hints into write
   streams if write streams are supported, any user that mixes and
   matches write streams and temperature hints gets what they deserve
 - this could also be a helper for file systems that want to do the
   same.

Just a quick writeup while I'm on the run, there's probably a hole or
two that could be poked into it.
Keith Busch Oct. 29, 2024, 4:22 p.m. UTC | #6
On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > is a placement id or not, and only nvme sets it, so scsi's attributes
> > are not claiming to be a write stream.
> 
> So it shows up in sysfs, but:
> 
>  - queue_max_write_hints (which really should be queue_max_write_streams)
>    still picks it up, and from there the statx interface
> 
>  - per-inode fcntl hint that encode a temperature still magically
>    get dumpted into the write streams if they are set.
> 
> In other words it's a really leaky half-backed abstraction.

Exactly why I asked last time: "who uses it and how do you want them to
use it" :)
 
> Let's brainstorm how it could be done better:
> 
>  - the max_write_streams values only set by block devices that actually
>    do support write streams, and not the fire and forget temperature
>    hints.  They way this is queried is by having a non-zero value
>    there, not need for an extra flag.

So we need a completely different attribute for SCSI's permanent write
streams? You'd mentioned earlier you were okay with having SCSI be able
to utilized per-io raw block write hints. Having multiple things to
check for what are all just write classifiers seems unnecessarily
complicated.

>  - but the struct file (or maybe inode) gets a supported flag, as stream
>    separation needs to be supported by the file system 
>  - a separate fcntl is used to set per-inode streams (if you care about
>    that, seem like the bdev use case focusses on per-I/O).  In that case
>    we'd probably also need a separate inode field for them, or a somewhat
>    complicated scheme to decide what is stored in the inode field if there
>    is only one.

No need to create a new fcntl. The people already testing this are
successfully using FDP with the existing fcntl hints. Their applications
leverage FDP as way to separate files based on expected lifetime. It is
how they want to use it and it is working above expectations. 

>  - for block devices bdev/fops.c maps the temperature hints into write
>    streams if write streams are supported, any user that mixes and
>    matches write streams and temperature hints gets what they deserve

That's fine. This patch series pretty much accomplishes that part.

>  - this could also be a helper for file systems that want to do the
>    same.
> 
> Just a quick writeup while I'm on the run, there's probably a hole or
> two that could be poked into it.
Bart Van Assche Oct. 29, 2024, 5:18 p.m. UTC | #7
On 10/29/24 8:26 AM, Christoph Hellwig wrote:
> Bart, btw: I think the current sd implementation is buggy as well, as
> it assumes the permanent streams are ordered by their data temperature
> in the IO Advise hints mode page, but I can't find anything in the
> spec that requires a particular ordering.

How about modifying sd_read_io_hints() such that permanent stream
information is ignored if the order of the RELATIVE LIFETIME information
reported by the GET STREAM STATUS command does not match the permanent
stream order?

Thanks,

Bart.

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 41e2dfa2d67d..277035febd82 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3192,7 +3192,12 @@ sd_read_cache_type(struct scsi_disk *sdkp, 
unsigned char *buffer)
  	sdkp->DPOFUA = 0;
  }

-static bool sd_is_perm_stream(struct scsi_disk *sdkp, unsigned int 
stream_id)
+/*
+ * Returns the relative lifetime of a permanent stream. Returns -1 if the
+ * GET STREAM STATUS command fails or if the stream is not a permanent 
stream.
+ */
+static int sd_perm_stream_rel_lifetime(struct scsi_disk *sdkp,
+				       unsigned int stream_id)
  {
  	u8 cdb[16] = { SERVICE_ACTION_IN_16, SAI_GET_STREAM_STATUS };
  	struct {
@@ -3212,14 +3217,16 @@ static bool sd_is_perm_stream(struct scsi_disk 
*sdkp, unsigned int stream_id)
  	res = scsi_execute_cmd(sdev, cdb, REQ_OP_DRV_IN, &buf, sizeof(buf),
  			       SD_TIMEOUT, sdkp->max_retries, &exec_args);
  	if (res < 0)
-		return false;
+		return -1;
  	if (scsi_status_is_check_condition(res) && scsi_sense_valid(&sshdr))
  		sd_print_sense_hdr(sdkp, &sshdr);
  	if (res)
-		return false;
+		return -1;
  	if (get_unaligned_be32(&buf.h.len) < sizeof(struct scsi_stream_status))
-		return false;
-	return buf.h.stream_status[0].perm;
+		return -1;
+	if (!buf.h.stream_status[0].perm)
+		return -1;
+	return buf.h.stream_status[0].rel_lifetime;
  }

  static void sd_read_io_hints(struct scsi_disk *sdkp, unsigned char 
*buffer)
@@ -3247,9 +3254,17 @@ static void sd_read_io_hints(struct scsi_disk 
*sdkp, unsigned char *buffer)
  	 * should assign the lowest numbered stream identifiers to permanent
  	 * streams.
  	 */
-	for (desc = start; desc < end; desc++)
-		if (!desc->st_enble || !sd_is_perm_stream(sdkp, desc - start))
+	int prev_rel_lifetime = -1;
+	for (desc = start; desc < end; desc++) {
+		int rel_lifetime;
+
+		if (!desc->st_enble)
  			break;
+		rel_lifetime = sd_perm_stream_rel_lifetime(sdkp, desc - start);
+		if (rel_lifetime < 0 || rel_lifetime < prev_rel_lifetime)
+			break;
+		prev_rel_lifetime = rel_lifetime;
+	}
  	permanent_stream_count_old = sdkp->permanent_stream_count;
  	sdkp->permanent_stream_count = desc - start;
  	if (sdkp->rscs && sdkp->permanent_stream_count < 2)
Christoph Hellwig Oct. 30, 2024, 4:55 a.m. UTC | #8
On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > > is a placement id or not, and only nvme sets it, so scsi's attributes
> > > are not claiming to be a write stream.
> > 
> > So it shows up in sysfs, but:
> > 
> >  - queue_max_write_hints (which really should be queue_max_write_streams)
> >    still picks it up, and from there the statx interface
> > 
> >  - per-inode fcntl hint that encode a temperature still magically
> >    get dumpted into the write streams if they are set.
> > 
> > In other words it's a really leaky half-backed abstraction.
> 
> Exactly why I asked last time: "who uses it and how do you want them to
> use it" :)

For the temperature hints the only public user I known is rocksdb, and
that only started working when Hans fixed a brown paperbag bug in the
rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
something in the Android world does as well, maybe Bart knows more.

For the separate write streams the usage I want for them is poor mans
zones - e.g. write N LBAs sequentially into a separate write streams
and then eventually discard them together.  This will fit nicely into
f2fs and the pending xfs work as well as quite a few userspace storage
systems.  For that the file system or application needs to query
the number of available write streams (and in the bitmap world their
numbers of they are distontigous) and the size your can fit into the
"reclaim unit" in FDP terms.  I've not been bothering you much with
the latter as it is an easy retrofit once the I/O path bits lands.

> > Let's brainstorm how it could be done better:
> > 
> >  - the max_write_streams values only set by block devices that actually
> >    do support write streams, and not the fire and forget temperature
> >    hints.  They way this is queried is by having a non-zero value
> >    there, not need for an extra flag.
> 
> So we need a completely different attribute for SCSI's permanent write
> streams? You'd mentioned earlier you were okay with having SCSI be able
> to utilized per-io raw block write hints. Having multiple things to
> check for what are all just write classifiers seems unnecessarily
> complicated.

I don't think the multiple write streams interface applies to SCSIs
write streams, as they enforce a relative temperature, and they don't
have the concept of how much you can write into an "reclaim unit".

OTOH there isn't much you need to query for them anyway, as the
temperature hints have always been defined as pure hints with all
up and downsides of that.

> No need to create a new fcntl. The people already testing this are
> successfully using FDP with the existing fcntl hints. Their applications
> leverage FDP as way to separate files based on expected lifetime. It is
> how they want to use it and it is working above expectations. 

FYI, I think it's always fine and easy to map the temperature hits to
write streams if that's all the driver offers.  It loses a lot of the
capapilities, but as long as it doesn't enforce a lower level interface
that never exposes more that's fine.
Christoph Hellwig Oct. 30, 2024, 5:42 a.m. UTC | #9
On Tue, Oct 29, 2024 at 10:18:31AM -0700, Bart Van Assche wrote:
> On 10/29/24 8:26 AM, Christoph Hellwig wrote:
>> Bart, btw: I think the current sd implementation is buggy as well, as
>> it assumes the permanent streams are ordered by their data temperature
>> in the IO Advise hints mode page, but I can't find anything in the
>> spec that requires a particular ordering.
>
> How about modifying sd_read_io_hints() such that permanent stream
> information is ignored if the order of the RELATIVE LIFETIME information
> reported by the GET STREAM STATUS command does not match the permanent
> stream order?

That seems odd as there is nothing implying that they should be ordered.
The logic thing to do would be to a little array mapping the linux
temperature levels to the streams ids.
Keith Busch Oct. 30, 2024, 3:41 p.m. UTC | #10
On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> 
> > No need to create a new fcntl. The people already testing this are
> > successfully using FDP with the existing fcntl hints. Their applications
> > leverage FDP as way to separate files based on expected lifetime. It is
> > how they want to use it and it is working above expectations. 
> 
> FYI, I think it's always fine and easy to map the temperature hits to
> write streams if that's all the driver offers.  It loses a lot of the
> capapilities, but as long as it doesn't enforce a lower level interface
> that never exposes more that's fine.

But that's just the v2 from this sequence:

https://lore.kernel.org/linux-nvme/20240528150233.55562-1-joshi.k@samsung.com/

If you're okay with it now, then let's just go with that and I'm happy
continue iterating on the rest separately.
Christoph Hellwig Oct. 30, 2024, 3:45 p.m. UTC | #11
On Wed, Oct 30, 2024 at 09:41:39AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> > 
> > > No need to create a new fcntl. The people already testing this are
> > > successfully using FDP with the existing fcntl hints. Their applications
> > > leverage FDP as way to separate files based on expected lifetime. It is
> > > how they want to use it and it is working above expectations. 
> > 
> > FYI, I think it's always fine and easy to map the temperature hits to
> > write streams if that's all the driver offers.  It loses a lot of the
> > capapilities, but as long as it doesn't enforce a lower level interface
> > that never exposes more that's fine.
> 
> But that's just the v2 from this sequence:
> 
> https://lore.kernel.org/linux-nvme/20240528150233.55562-1-joshi.k@samsung.com/
> 
> If you're okay with it now, then let's just go with that and I'm happy
> continue iterating on the rest separately. 

That's exactly what I do not want - it takes the temperature hints
and force them into the write streams down in the driver with no
way to make actually useful use of the stream separation.
Keith Busch Oct. 30, 2024, 3:48 p.m. UTC | #12
On Wed, Oct 30, 2024 at 04:45:56PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:41:39AM -0600, Keith Busch wrote:
> > On Wed, Oct 30, 2024 at 05:55:26AM +0100, Christoph Hellwig wrote:
> > > On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> > > 
> > > > No need to create a new fcntl. The people already testing this are
> > > > successfully using FDP with the existing fcntl hints. Their applications
> > > > leverage FDP as way to separate files based on expected lifetime. It is
> > > > how they want to use it and it is working above expectations. 
> > > 
> > > FYI, I think it's always fine and easy to map the temperature hits to
> > > write streams if that's all the driver offers.  It loses a lot of the
> > > capapilities, but as long as it doesn't enforce a lower level interface
> > > that never exposes more that's fine.
> > 
> > But that's just the v2 from this sequence:
> > 
> > https://lore.kernel.org/linux-nvme/20240528150233.55562-1-joshi.k@samsung.com/
> > 
> > If you're okay with it now, then let's just go with that and I'm happy
> > continue iterating on the rest separately. 
> 
> That's exactly what I do not want - it takes the temperature hints
> and force them into the write streams down in the driver 

What??? You said to map the temperature hints to a write stream. The
driver offers that here. But you specifically don't want that? I'm so
confused.

> with no way to make actually useful use of the stream separation.

Have you tried it? The people who actually do easily demonstrate it is
in fact very useful.
Christoph Hellwig Oct. 30, 2024, 3:50 p.m. UTC | #13
On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> What??? You said to map the temperature hints to a write stream. The
> driver offers that here. But you specifically don't want that? I'm so
> confused.

In bdev/fops.c (or file systems if they want to do that) not down in the
driver forced down everyones throat.  Which was the whole point of the
discussion that we're running in circles here.

> > with no way to make actually useful use of the stream separation.
> 
> Have you tried it? The people who actually do easily demonstrate it is
> in fact very useful.

While I've read the claim multiple times, I've not actually seen any
numbers.
Keith Busch Oct. 30, 2024, 4:42 p.m. UTC | #14
On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> > What??? You said to map the temperature hints to a write stream. The
> > driver offers that here. But you specifically don't want that? I'm so
> > confused.
> 
> In bdev/fops.c (or file systems if they want to do that) not down in the
> driver forced down everyones throat.  Which was the whole point of the
> discussion that we're running in circles here.

That makes no sense. A change completely isolated to a driver isn't
forcing anything on anyone. It's the upper layers that's forcing this
down, whether the driver uses it or not: the hints are already getting
to the driver, but the driver currently doesn't use it. Finding a way to
use them is not some force to be demonized...

> > > with no way to make actually useful use of the stream separation.
> > 
> > Have you tried it? The people who actually do easily demonstrate it is
> > in fact very useful.
> 
> While I've read the claim multiple times, I've not actually seen any
> numbers.

Here's something recent from rocksdb developers running ycsb worklada
benchmark. The filesystem used is XFS.

It sets temperature hints for different SST levels, which already
happens today. The last data point made some minor changes with
level-to-hint mapping.

Without FDP:

WAF:        2.72
IOPS:       1465
READ LAT:   2681us
UPDATE LAT: 3115us

With FDP (rocksdb unmodified):

WAF:        2.26
IOPS:       1473
READ LAT:   2415us
UPDATE LAT: 2807us

With FDP (with some minor rocksdb changes):

WAF:        1.67
IOPS:       1547
READ LAT:   1978us
UPDATE LAT: 2267us
Christoph Hellwig Oct. 30, 2024, 4:57 p.m. UTC | #15
On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 09:48:39AM -0600, Keith Busch wrote:
> > > What??? You said to map the temperature hints to a write stream. The
> > > driver offers that here. But you specifically don't want that? I'm so
> > > confused.
> > 
> > In bdev/fops.c (or file systems if they want to do that) not down in the
> > driver forced down everyones throat.  Which was the whole point of the
> > discussion that we're running in circles here.
> 
> That makes no sense. A change completely isolated to a driver isn't
> forcing anything on anyone. It's the upper layers that's forcing this
> down, whether the driver uses it or not: the hints are already getting
> to the driver, but the driver currently doesn't use it.

And once it uses by default, taking it away will have someone scream
regresion, because we're not taking it away form that super special
use case.

> Here's something recent from rocksdb developers running ycsb worklada
> benchmark. The filesystem used is XFS.

Thanks for finally putting something up.

> It sets temperature hints for different SST levels, which already
> happens today. The last data point made some minor changes with
> level-to-hint mapping.

Do you have a pointer to the changes?

> Without FDP:
> 
> WAF:        2.72
> IOPS:       1465
> READ LAT:   2681us
> UPDATE LAT: 3115us
> 
> With FDP (rocksdb unmodified):
> 
> WAF:        2.26
> IOPS:       1473
> READ LAT:   2415us
> UPDATE LAT: 2807us
> 
> With FDP (with some minor rocksdb changes):
> 
> WAF:        1.67
> IOPS:       1547
> READ LAT:   1978us
> UPDATE LAT: 2267us

Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
which should work just fine with FDP IFF we exposed real write streams,
which roughly double read nad wirte IOPS and reduce the WAF to almost
1 this doesn't look too spectacular to be honest, but it sure it something.

I just wish we could get the real infraѕtructure instead of some band
aid, which makes it really hard to expose the real thing because now
it's been taken up and directly wired to a UAPI.
one
Bart Van Assche Oct. 30, 2024, 4:59 p.m. UTC | #16
On 10/29/24 9:55 PM, Christoph Hellwig wrote:
> For the temperature hints the only public user I known is rocksdb, and
> that only started working when Hans fixed a brown paperbag bug in the
> rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
> something in the Android world does as well, maybe Bart knows more.

UFS devices typically have less internal memory (SRAM) than the size of 
a single zone. Hence, it helps UFS devices if it can be decided at the
time a write command is received where to send the data (SRAM, SLC NAND
or TLC NAND). This is why UFS vendors asked to provide data lifetime
information to zoned logical units. More information about UFS device
internals is available in this paper: Hwang, Joo-Young, Seokhwan Kim,
Daejun Park, Yong-Gil Song, Junyoung Han, Seunghyun Choi, Sangyeun Cho,
and Youjip Won. "{ZMS}: Zone Abstraction for Mobile Flash Storage." In
2024 USENIX Annual Technical Conference (USENIX ATC 24), pp. 173-189.
2024 (https://www.usenix.org/system/files/atc24-hwang.pdf).

Bart.
Keith Busch Oct. 30, 2024, 5:05 p.m. UTC | #17
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> And once it uses by default, taking it away will have someone scream
> regresion, because we're not taking it away form that super special
> use case.

Refusing to allow something because someone might find it useful has got
to be the worst reasoning I've heard. :)
Christoph Hellwig Oct. 30, 2024, 5:14 p.m. UTC | #18
On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote:
>
> On 10/29/24 9:55 PM, Christoph Hellwig wrote:
>> For the temperature hints the only public user I known is rocksdb, and
>> that only started working when Hans fixed a brown paperbag bug in the
>> rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
>> something in the Android world does as well, maybe Bart knows more.
>
> UFS devices typically have less internal memory (SRAM) than the size of a 
> single zone.

That wasn't quite the question.  Do you know what application in android
set the fcntl temperature hints?
Christoph Hellwig Oct. 30, 2024, 5:15 p.m. UTC | #19
On Wed, Oct 30, 2024 at 11:05:03AM -0600, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> > And once it uses by default, taking it away will have someone scream
> > regresion, because we're not taking it away form that super special
> > use case.
> 
> Refusing to allow something because someone might find it useful has got
> to be the worst reasoning I've heard. :)

That's not that point.  The point is by locking us in we can't actually
do the proper thing.  And that's what I'm really worried about.
Especially with the not all that great numbers in the success story.
Keith Busch Oct. 30, 2024, 5:23 p.m. UTC | #20
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 04:50:52PM +0100, Christoph Hellwig wrote:
> 
> > It sets temperature hints for different SST levels, which already
> > happens today. The last data point made some minor changes with
> > level-to-hint mapping.
> 
> Do you have a pointer to the changes?

The change moves levels 2 and 3 to "MEDIUM" (along with 0 and 1 already
there), 4 to "LONG", and >= 5 remain "EXTREME".

WAL continues to be "SHORT", as before.
 
> > Without FDP:
> > 
> > WAF:        2.72
> > IOPS:       1465
> > READ LAT:   2681us
> > UPDATE LAT: 3115us
> > 
> > With FDP (rocksdb unmodified):
> > 
> > WAF:        2.26
> > IOPS:       1473
> > READ LAT:   2415us
> > UPDATE LAT: 2807us
> > 
> > With FDP (with some minor rocksdb changes):
> > 
> > WAF:        1.67
> > IOPS:       1547
> > READ LAT:   1978us
> > UPDATE LAT: 2267us
> 
> Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
> which should work just fine with FDP IFF we exposed real write streams,
> which roughly double read nad wirte IOPS and reduce the WAF to almost
> 1 this doesn't look too spectacular to be honest, but it sure it something.
> 
> I just wish we could get the real infraѕtructure instead of some band
> aid, which makes it really hard to expose the real thing because now
> it's been taken up and directly wired to a UAPI.
> one

This doesn't have to be the end placement streams development. I
fundamentally disagree that this locks anyone in to anything.
Bart Van Assche Oct. 30, 2024, 5:44 p.m. UTC | #21
On 10/30/24 10:14 AM, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote:
>>
>> On 10/29/24 9:55 PM, Christoph Hellwig wrote:
>>> For the temperature hints the only public user I known is rocksdb, and
>>> that only started working when Hans fixed a brown paperbag bug in the
>>> rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
>>> something in the Android world does as well, maybe Bart knows more.
>>
>> UFS devices typically have less internal memory (SRAM) than the size of a
>> single zone.
> 
> That wasn't quite the question.  Do you know what application in android
> set the fcntl temperature hints?

I do not know whether there are any Android apps that use the
F_SET_(FILE_|)RW_HINT fcntls.

The only use case in Android platform code I know of is this one: Daejun
Park, "f2fs-tools: add write hint support", f2fs-dev mailing list,
September 2024 
(https://lore.kernel.org/all/20240904011217epcms2p5a1b15db8e0ae50884429da7be4af4de4@epcms2p5/T/).
As you probably know f2fs-tools is a software package that includes
fsck.f2fs.

Jaegeuk, please correct me if necessary.

Bart.
Keith Busch Oct. 30, 2024, 10:32 p.m. UTC | #22
On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote:
> > With FDP (with some minor rocksdb changes):
> > 
> > WAF:        1.67
> > IOPS:       1547
> > READ LAT:   1978us
> > UPDATE LAT: 2267us
> 
> Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
> which should work just fine with FDP IFF we exposed real write streams,
> which roughly double read nad wirte IOPS and reduce the WAF to almost
> 1 this doesn't look too spectacular to be honest, but it sure it something.

Hold up... I absolutely appreciate the work Hans is and has done. But
are you talking about this talk?

https://lpc.events/event/18/contributions/1822/attachments/1464/3105/Zoned%20XFS%20LPC%20Zoned%20MC%202024%20V1.pdf

That is very much apples-to-oranges. The B+ isn't on the same device
being evaluated for WAF, where this has all that mixed in. I think the
results are pretty good, all things considered.
 
> I just wish we could get the real infraѕtructure instead of some band
> aid, which makes it really hard to expose the real thing because now
> it's been taken up and directly wired to a UAPI.
> one

I don't know what make of this. I think we're talking past each other.
Hans Holmberg Oct. 31, 2024, 8:19 a.m. UTC | #23
On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Oct 30, 2024 at 05:57:08PM +0100, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 10:42:59AM -0600, Keith Busch wrote:
> > > With FDP (with some minor rocksdb changes):
> > >
> > > WAF:        1.67
> > > IOPS:       1547
> > > READ LAT:   1978us
> > > UPDATE LAT: 2267us
> >
> > Compared to the Numbers Hans presented at Plumbers for the Zoned XFS code,
> > which should work just fine with FDP IFF we exposed real write streams,
> > which roughly double read nad wirte IOPS and reduce the WAF to almost
> > 1 this doesn't look too spectacular to be honest, but it sure it something.
>
> Hold up... I absolutely appreciate the work Hans is and has done. But
> are you talking about this talk?
>
> https://lpc.events/event/18/contributions/1822/attachments/1464/3105/Zoned%20XFS%20LPC%20Zoned%20MC%202024%20V1.pdf
>
> That is very much apples-to-oranges. The B+ isn't on the same device
> being evaluated for WAF, where this has all that mixed in. I think the
> results are pretty good, all things considered.

No. The meta data IO is just 0.1% of all writes, so that we use a
separate device for that in the benchmark really does not matter.

Since we can achieve a WAF of ~1 for RocksDB on flash, why should we
be content with another 67% of unwanted device side writes on top of
that?

It's of course impossible to compare your benchmark figures and mine
directly since we are using different devices, but hey, we definitely
have an opportunity here to make significant gains for FDP if we just
provide the right kernel interfaces.

Why shouldn't we expose the hardware in a way that enables the users
to make the most out of it?
Christoph Hellwig Oct. 31, 2024, 1:02 p.m. UTC | #24
On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> No. The meta data IO is just 0.1% of all writes, so that we use a
> separate device for that in the benchmark really does not matter.
> 
> Since we can achieve a WAF of ~1 for RocksDB on flash, why should we
> be content with another 67% of unwanted device side writes on top of
> that?
> 
> It's of course impossible to compare your benchmark figures and mine
> directly since we are using different devices, but hey, we definitely
> have an opportunity here to make significant gains for FDP if we just
> provide the right kernel interfaces.

I'll write code to do a 1:1 single device comparism over the weekend
and Hans will test it once he is back.
Keith Busch Oct. 31, 2024, 2:06 p.m. UTC | #25
On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <kbusch@kernel.org> wrote:
> > That is very much apples-to-oranges. The B+ isn't on the same device
> > being evaluated for WAF, where this has all that mixed in. I think the
> > results are pretty good, all things considered.
> 
> No. The meta data IO is just 0.1% of all writes, so that we use a
> separate device for that in the benchmark really does not matter.

It's very little spatially, but they overwrite differently than other
data, creating many small holes in large erase blocks.
 
> Since we can achieve a WAF of ~1 for RocksDB on flash, why should we
> be content with another 67% of unwanted device side writes on top of
> that?
> 
> It's of course impossible to compare your benchmark figures and mine
> directly since we are using different devices, but hey, we definitely
> have an opportunity here to make significant gains for FDP if we just
> provide the right kernel interfaces.
> 
> Why shouldn't we expose the hardware in a way that enables the users
> to make the most out of it?

Because the people using this want this interface. Stalling for the last
6 months hasn't produced anything better, appealing to non-existent
vaporware to block something ready-to-go that satisfies a need right
now is just wasting everyone's time.

Again, I absolutely disagree that this locks anyone in to anything.
That's an overly dramatic excuse.
Jaegeuk Kim Nov. 1, 2024, 1:03 a.m. UTC | #26
On 10/30, Bart Van Assche wrote:
> On 10/30/24 10:14 AM, Christoph Hellwig wrote:
> > On Wed, Oct 30, 2024 at 09:59:24AM -0700, Bart Van Assche wrote:
> > > 
> > > On 10/29/24 9:55 PM, Christoph Hellwig wrote:
> > > > For the temperature hints the only public user I known is rocksdb, and
> > > > that only started working when Hans fixed a brown paperbag bug in the
> > > > rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
> > > > something in the Android world does as well, maybe Bart knows more.
> > > 
> > > UFS devices typically have less internal memory (SRAM) than the size of a
> > > single zone.
> > 
> > That wasn't quite the question.  Do you know what application in android
> > set the fcntl temperature hints?
> 
> I do not know whether there are any Android apps that use the
> F_SET_(FILE_|)RW_HINT fcntls.
> 
> The only use case in Android platform code I know of is this one: Daejun
> Park, "f2fs-tools: add write hint support", f2fs-dev mailing list,
> September 2024 (https://lore.kernel.org/all/20240904011217epcms2p5a1b15db8e0ae50884429da7be4af4de4@epcms2p5/T/).
> As you probably know f2fs-tools is a software package that includes
> fsck.f2fs.
> 
> Jaegeuk, please correct me if necessary.

Yes, f2fs-tools in Android calls fcntl(fd, F_SET_RW_HINT, &hint);

> 
> Bart.
> 
>
Hans Holmberg Nov. 1, 2024, 7:16 a.m. UTC | #27
On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> > On Wed, Oct 30, 2024 at 11:33 PM Keith Busch <kbusch@kernel.org> wrote:
> > > That is very much apples-to-oranges. The B+ isn't on the same device
> > > being evaluated for WAF, where this has all that mixed in. I think the
> > > results are pretty good, all things considered.
> >
> > No. The meta data IO is just 0.1% of all writes, so that we use a
> > separate device for that in the benchmark really does not matter.
>
> It's very little spatially, but they overwrite differently than other
> data, creating many small holes in large erase blocks.

I don't really get how this could influence anything significantly.(If at all).

>
> > Since we can achieve a WAF of ~1 for RocksDB on flash, why should we
> > be content with another 67% of unwanted device side writes on top of
> > that?
> >
> > It's of course impossible to compare your benchmark figures and mine
> > directly since we are using different devices, but hey, we definitely
> > have an opportunity here to make significant gains for FDP if we just
> > provide the right kernel interfaces.
> >
> > Why shouldn't we expose the hardware in a way that enables the users
> > to make the most out of it?
>
> Because the people using this want this interface. Stalling for the last
> 6 months hasn't produced anything better, appealing to non-existent
> vaporware to block something ready-to-go that satisfies a need right
> now is just wasting everyone's time.
>
> Again, I absolutely disagree that this locks anyone in to anything.
> That's an overly dramatic excuse.

Locking in or not, to constructively move things forward (if we are
now stuck on how to wire up fs support) I believe it would be
worthwhile to prototype active fdp data placement in xfs and evaluate
it. Happy to help out with that.

Fdp and zns are different beasts, so I don't expect the results in the
presentation to be directly translatable but we can see what we can
do.

Is RocksDB the only file system user at the moment?
Is the benchmark setup/config something that could be shared?
Javier Gonzalez Nov. 1, 2024, 8:19 a.m. UTC | #28
On 01.11.2024 08:16, Hans Holmberg wrote:
>Locking in or not, to constructively move things forward (if we are
>now stuck on how to wire up fs support) I believe it would be
>worthwhile to prototype active fdp data placement in xfs and evaluate
>it. Happy to help out with that.

I appreciate you willingness to move things forward. I really mean it.

I have talked several times in this thread about collaborating in the
API that you have in mind. I would _very_ much like to have a common
abstraction for ZNS, ZUFS, FDP, and whatever people build on other
protocols. But without tangible patches showing this, we simply cannot
block this anymore.

>
>Fdp and zns are different beasts, so I don't expect the results in the
>presentation to be directly translatable but we can see what we can
>do.
>
>Is RocksDB the only file system user at the moment?
>Is the benchmark setup/config something that could be shared?

It is a YCSB workload. You have the scripts here:

    https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloada

If you have other standard workload you want us to run, let me know and
we will post the results in the list too.

We will post the changes to the L3 placement in RocksDB. I think we can
make them available somewhere for you to test before that. Let me come
back to you on this.
Keith Busch Nov. 1, 2024, 2:49 p.m. UTC | #29
On Fri, Nov 01, 2024 at 08:16:30AM +0100, Hans Holmberg wrote:
> On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <kbusch@kernel.org> wrote:
> > On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> > > No. The meta data IO is just 0.1% of all writes, so that we use a
> > > separate device for that in the benchmark really does not matter.
> >
> > It's very little spatially, but they overwrite differently than other
> > data, creating many small holes in large erase blocks.
> 
> I don't really get how this could influence anything significantly.(If at all).

Fill your filesystem to near capacity, then continue using it for a few
months. While the filesystem will report some available space, there
may not be many good blocks available to erase. Maybe.
 
> > Again, I absolutely disagree that this locks anyone in to anything.
> > That's an overly dramatic excuse.
> 
> Locking in or not, to constructively move things forward (if we are
> now stuck on how to wire up fs support) 

But we're not stuck on how to wire up to fs. That part was settled and
in kernel 10 years ago. We're stuck on wiring it down to the driver,
which should have been the easiest part.

> I believe it would be worthwhile to prototype active fdp data
> placement in xfs and evaluate it. Happy to help out with that.

When are we allowed to conclude evaluation? We have benefits my
customers want on well tested kernels, and wish to proceed now.

I'm not discouraing anyone from continuing further prototypes,
innovations, and improvements. I'd like to spend more time doing that
too, and merging something incrementally better doesn't prevent anyone
from doing that.

> Fdp and zns are different beasts, so I don't expect the results in the
> presentation to be directly translatable but we can see what we can
> do.
> 
> Is RocksDB the only file system user at the moment?

Rocks is the only open source one I know about. There are propietary
users, too.
Hans Holmberg Nov. 6, 2024, 2:26 p.m. UTC | #30
On Fri, Nov 1, 2024 at 3:49 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Nov 01, 2024 at 08:16:30AM +0100, Hans Holmberg wrote:
> > On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <kbusch@kernel.org> wrote:
> > > On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote:
> > > > No. The meta data IO is just 0.1% of all writes, so that we use a
> > > > separate device for that in the benchmark really does not matter.
> > >
> > > It's very little spatially, but they overwrite differently than other
> > > data, creating many small holes in large erase blocks.
> >
> > I don't really get how this could influence anything significantly.(If at all).
>
> Fill your filesystem to near capacity, then continue using it for a few
> months. While the filesystem will report some available space, there
> may not be many good blocks available to erase. Maybe.

For *this* benchmark workload, the metadata io is such a tiny fraction
so I doubt the impact on wa could be measured.
I completely agree it's a good idea to separate metadata from data
blocks in general. It is actually a good reason for letting the file
system control write stream allocation for all blocks :)

> > I believe it would be worthwhile to prototype active fdp data
> > placement in xfs and evaluate it. Happy to help out with that.
>
> When are we allowed to conclude evaluation? We have benefits my
> customers want on well tested kernels, and wish to proceed now.

Christoph has now wired up prototype support for FDP on top of the
xfs-rt-zoned work + this patch set, and I have had time to look over
it and started doing some testing on HW.
In addition to the FDP support, metadata can also be stored on the
same block device as the data.

Now that all placement handles are available, we can use the full data
separation capabilities of the underlying storage, so that's good.
We can map out the placement handles to different write streams much
like we assign open zones for zoned storage and this opens up for
supporting data placement heuristics for a wider range use cases (not
just the RocksDB use case discussed here).

The big pieces that are missing from the FDP plumbing as I see it is
the ability to read reclaim unit size and syncing up the remaining
capacity of the placement units with the file system allocation
groups, but I guess that can be added later.

I've started benchmarking on the hardware I have at hand, iterating on
a good workload configuration. It will take some time to get to some
robust write amp measurements since the drives are very big and
require a painfully long warmup time.
diff mbox series

Patch

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ca4bc0ac76adc..235dd6e5b6688 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3768,6 +3768,8 @@  static int sd_revalidate_disk(struct gendisk *disk)
 		sd_config_protection(sdkp, &lim);
 	}
 
+	lim.max_write_hints = sdkp->permanent_stream_count;
+
 	/*
 	 * We now have all cache related info, determine how we deal
 	 * with flush requests.