mbox series

[PATCHv10,0/9] write hints with nvme fdp, scsi streams

Message ID 20241029151922.459139-1-kbusch@meta.com (mailing list archive)
Headers show
Series write hints with nvme fdp, scsi streams | expand

Message

Keith Busch Oct. 29, 2024, 3:19 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

Changes from v9:

  Document the partition hint mask

  Use bitmap_alloc API

  Fixup bitmap memory leak

  Return invalid value if user requests an invalid write hint

  Added and exported a block device feature flag for indicating generic
  placement hint support

  Added statx write hint max field

  Added BUILD_BUG_ON check for new io_uring SQE fields.

  Added reviews

Kanchan Joshi (2):
  io_uring: enable per-io hinting capability
  nvme: enable FDP support

Keith Busch (7):
  block: use generic u16 for write hints
  block: introduce max_write_hints queue limit
  statx: add write hint information
  block: allow ability to limit partition write hints
  block, fs: add write hint to kiocb
  block: export placement hint feature
  scsi: set permanent stream count in block limits

 Documentation/ABI/stable/sysfs-block | 13 +++++
 block/bdev.c                         | 18 ++++++
 block/blk-settings.c                 |  5 ++
 block/blk-sysfs.c                    |  6 ++
 block/fops.c                         | 31 +++++++++-
 block/partitions/core.c              | 44 ++++++++++++++-
 drivers/nvme/host/core.c             | 84 ++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h             |  5 ++
 drivers/scsi/sd.c                    |  2 +
 fs/stat.c                            |  1 +
 include/linux/blk-mq.h               |  3 +-
 include/linux/blk_types.h            |  4 +-
 include/linux/blkdev.h               | 15 +++++
 include/linux/fs.h                   |  1 +
 include/linux/nvme.h                 | 19 +++++++
 include/linux/stat.h                 |  1 +
 include/uapi/linux/io_uring.h        |  4 ++
 include/uapi/linux/stat.h            |  3 +-
 io_uring/io_uring.c                  |  2 +
 io_uring/rw.c                        |  3 +-
 20 files changed, 253 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig Oct. 29, 2024, 3:24 p.m. UTC | #1
On Tue, Oct 29, 2024 at 08:19:13AM -0700, Keith Busch wrote:
>   Return invalid value if user requests an invalid write hint
> 
>   Added and exported a block device feature flag for indicating generic
>   placement hint support

But it still talks of write hints everywhere and conflates the write
streams with the temperature hints which are completely different
beasts.
Christoph Hellwig Oct. 29, 2024, 3:26 p.m. UTC | #2
On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The block limits exports the number of write hints, so set this limit if
> the device reports support for the lifetime hints. Not only does this
> inform the user of which hints are possible, it also allows scsi devices
> supporting the feature to utilize the full range through raw block
> device direct-io.
> 
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Keith Busch <kbusch@kernel.org>

Despite the reviews this is still incorrect.  The permanent streams have
a relative data temperature associated with them as pointed out last
round and are not arbitrary write stream contexts despite (ab)using
the SBC streams facilities.

Bart, btw: I think the current sd implementation is buggy as well, as
it assumes the permanent streams are ordered by their data temperature
in the IO Advise hints mode page, but I can't find anything in the
spec that requires a particular ordering.
Keith Busch Oct. 29, 2024, 3:34 p.m. UTC | #3
On Tue, Oct 29, 2024 at 04:26:54PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 08:19:22AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > 
> > The block limits exports the number of write hints, so set this limit if
> > the device reports support for the lifetime hints. Not only does this
> > inform the user of which hints are possible, it also allows scsi devices
> > supporting the feature to utilize the full range through raw block
> > device direct-io.
> > 
> > Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Signed-off-by: Keith Busch <kbusch@kernel.org>
> 
> Despite the reviews this is still incorrect.  The permanent streams have
> a relative data temperature associated with them as pointed out last
> round and are not arbitrary write stream contexts despite (ab)using
> the SBC streams facilities.

So then don't use it that way? I still don't know what change you're
expecting to happen with this feedback. What do you want the kernel to
do differently here?
Christoph Hellwig Oct. 29, 2024, 3:37 p.m. UTC | #4
On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote:
> So then don't use it that way? I still don't know what change you're
> expecting to happen with this feedback. What do you want the kernel to
> do differently here?

Same as before:  don't expose them as write streams, because they
aren't.  A big mess in this series going back to the versions before
your involvement is that they somehow want to tie up the temperature
hints with the stream separation, which just ends up very messy.
Keith Busch Oct. 29, 2024, 3:38 p.m. UTC | #5
On Tue, Oct 29, 2024 at 04:37:02PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:34:07AM -0600, Keith Busch wrote:
> > So then don't use it that way? I still don't know what change you're
> > expecting to happen with this feedback. What do you want the kernel to
> > do differently here?
> 
> Same as before:  don't expose them as write streams, because they
> aren't.  A big mess in this series going back to the versions before
> your involvement is that they somehow want to tie up the temperature
> hints with the stream separation, which just ends up very messy.

They're not exposed as write streams. Patch 7/9 sets the feature if it
is a placement id or not, and only nvme sets it, so scsi's attributes
are not claiming to be a write stream.
Christoph Hellwig Oct. 29, 2024, 3:53 p.m. UTC | #6
On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> They're not exposed as write streams. Patch 7/9 sets the feature if it
> is a placement id or not, and only nvme sets it, so scsi's attributes
> are not claiming to be a write stream.

So it shows up in sysfs, but:

 - queue_max_write_hints (which really should be queue_max_write_streams)
   still picks it up, and from there the statx interface

 - per-inode fcntl hint that encode a temperature still magically
   get dumpted into the write streams if they are set.

In other words it's a really leaky half-backed abstraction.

Let's brainstorm how it could be done better:

 - the max_write_streams values only set by block devices that actually
   do support write streams, and not the fire and forget temperature
   hints.  They way this is queried is by having a non-zero value
   there, not need for an extra flag.
 - but the struct file (or maybe inode) gets a supported flag, as stream
   separation needs to be supported by the file system 
 - a separate fcntl is used to set per-inode streams (if you care about
   that, seem like the bdev use case focusses on per-I/O).  In that case
   we'd probably also need a separate inode field for them, or a somewhat
   complicated scheme to decide what is stored in the inode field if there
   is only one.
 - for block devices bdev/fops.c maps the temperature hints into write
   streams if write streams are supported, any user that mixes and
   matches write streams and temperature hints gets what they deserve
 - this could also be a helper for file systems that want to do the
   same.

Just a quick writeup while I'm on the run, there's probably a hole or
two that could be poked into it.
Keith Busch Oct. 29, 2024, 4:22 p.m. UTC | #7
On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > is a placement id or not, and only nvme sets it, so scsi's attributes
> > are not claiming to be a write stream.
> 
> So it shows up in sysfs, but:
> 
>  - queue_max_write_hints (which really should be queue_max_write_streams)
>    still picks it up, and from there the statx interface
> 
>  - per-inode fcntl hint that encode a temperature still magically
>    get dumpted into the write streams if they are set.
> 
> In other words it's a really leaky half-backed abstraction.

Exactly why I asked last time: "who uses it and how do you want them to
use it" :)
 
> Let's brainstorm how it could be done better:
> 
>  - the max_write_streams values only set by block devices that actually
>    do support write streams, and not the fire and forget temperature
>    hints.  They way this is queried is by having a non-zero value
>    there, not need for an extra flag.

So we need a completely different attribute for SCSI's permanent write
streams? You'd mentioned earlier you were okay with having SCSI be able
to utilized per-io raw block write hints. Having multiple things to
check for what are all just write classifiers seems unnecessarily
complicated.

>  - but the struct file (or maybe inode) gets a supported flag, as stream
>    separation needs to be supported by the file system 
>  - a separate fcntl is used to set per-inode streams (if you care about
>    that, seem like the bdev use case focusses on per-I/O).  In that case
>    we'd probably also need a separate inode field for them, or a somewhat
>    complicated scheme to decide what is stored in the inode field if there
>    is only one.

No need to create a new fcntl. The people already testing this are
successfully using FDP with the existing fcntl hints. Their applications
leverage FDP as way to separate files based on expected lifetime. It is
how they want to use it and it is working above expectations. 

>  - for block devices bdev/fops.c maps the temperature hints into write
>    streams if write streams are supported, any user that mixes and
>    matches write streams and temperature hints gets what they deserve

That's fine. This patch series pretty much accomplishes that part.

>  - this could also be a helper for file systems that want to do the
>    same.
> 
> Just a quick writeup while I'm on the run, there's probably a hole or
> two that could be poked into it.
Bart Van Assche Oct. 29, 2024, 5:18 p.m. UTC | #8
On 10/29/24 8:26 AM, Christoph Hellwig wrote:
> Bart, btw: I think the current sd implementation is buggy as well, as
> it assumes the permanent streams are ordered by their data temperature
> in the IO Advise hints mode page, but I can't find anything in the
> spec that requires a particular ordering.

How about modifying sd_read_io_hints() such that permanent stream
information is ignored if the order of the RELATIVE LIFETIME information
reported by the GET STREAM STATUS command does not match the permanent
stream order?

Thanks,

Bart.

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 41e2dfa2d67d..277035febd82 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3192,7 +3192,12 @@ sd_read_cache_type(struct scsi_disk *sdkp, 
unsigned char *buffer)
  	sdkp->DPOFUA = 0;
  }

-static bool sd_is_perm_stream(struct scsi_disk *sdkp, unsigned int 
stream_id)
+/*
+ * Returns the relative lifetime of a permanent stream. Returns -1 if the
+ * GET STREAM STATUS command fails or if the stream is not a permanent 
stream.
+ */
+static int sd_perm_stream_rel_lifetime(struct scsi_disk *sdkp,
+				       unsigned int stream_id)
  {
  	u8 cdb[16] = { SERVICE_ACTION_IN_16, SAI_GET_STREAM_STATUS };
  	struct {
@@ -3212,14 +3217,16 @@ static bool sd_is_perm_stream(struct scsi_disk 
*sdkp, unsigned int stream_id)
  	res = scsi_execute_cmd(sdev, cdb, REQ_OP_DRV_IN, &buf, sizeof(buf),
  			       SD_TIMEOUT, sdkp->max_retries, &exec_args);
  	if (res < 0)
-		return false;
+		return -1;
  	if (scsi_status_is_check_condition(res) && scsi_sense_valid(&sshdr))
  		sd_print_sense_hdr(sdkp, &sshdr);
  	if (res)
-		return false;
+		return -1;
  	if (get_unaligned_be32(&buf.h.len) < sizeof(struct scsi_stream_status))
-		return false;
-	return buf.h.stream_status[0].perm;
+		return -1;
+	if (!buf.h.stream_status[0].perm)
+		return -1;
+	return buf.h.stream_status[0].rel_lifetime;
  }

  static void sd_read_io_hints(struct scsi_disk *sdkp, unsigned char 
*buffer)
@@ -3247,9 +3254,17 @@ static void sd_read_io_hints(struct scsi_disk 
*sdkp, unsigned char *buffer)
  	 * should assign the lowest numbered stream identifiers to permanent
  	 * streams.
  	 */
-	for (desc = start; desc < end; desc++)
-		if (!desc->st_enble || !sd_is_perm_stream(sdkp, desc - start))
+	int prev_rel_lifetime = -1;
+	for (desc = start; desc < end; desc++) {
+		int rel_lifetime;
+
+		if (!desc->st_enble)
  			break;
+		rel_lifetime = sd_perm_stream_rel_lifetime(sdkp, desc - start);
+		if (rel_lifetime < 0 || rel_lifetime < prev_rel_lifetime)
+			break;
+		prev_rel_lifetime = rel_lifetime;
+	}
  	permanent_stream_count_old = sdkp->permanent_stream_count;
  	sdkp->permanent_stream_count = desc - start;
  	if (sdkp->rscs && sdkp->permanent_stream_count < 2)
Chaitanya Kulkarni Oct. 30, 2024, 12:24 a.m. UTC | #9
On 10/29/24 08:19, Keith Busch wrote:
> From: Kanchan Joshi <joshi.k@samsung.com>
>
> Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
> to control the placement of logical blocks so as to reduce the SSD WAF.
> Userspace can send the write hint information using io_uring or fcntl.
>
> Fetch the placement-identifiers if the device supports FDP. The incoming
> write-hint is mapped to a placement-identifier, which in turn is set in
> the DSPEC field of the write command.
>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Hui Qi <hui81.qi@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>   drivers/nvme/host/core.c | 84 ++++++++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h |  5 +++
>   include/linux/nvme.h     | 19 +++++++++
>   3 files changed, 108 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 3de7555a7de74..bd7b89912ddb9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -44,6 +44,20 @@ struct nvme_ns_info {
>   	bool is_removed;
>   };
>   
> +struct nvme_fdp_ruh_status_desc {
> +	u16 pid;
> +	u16 ruhid;
> +	u32 earutr;
> +	u64 ruamw;
> +	u8  rsvd16[16];
> +};
> +
> +struct nvme_fdp_ruh_status {
> +	u8  rsvd0[14];
> +	__le16 nruhsd;
> +	struct nvme_fdp_ruh_status_desc ruhsd[];
> +};
> +
>   unsigned int admin_timeout = 60;
>   module_param(admin_timeout, uint, 0644);
>   MODULE_PARM_DESC(admin_timeout, "timeout in seconds for admin commands");
> @@ -657,6 +671,7 @@ static void nvme_free_ns_head(struct kref *ref)
>   	ida_free(&head->subsys->ns_ida, head->instance);
>   	cleanup_srcu_struct(&head->srcu);
>   	nvme_put_subsystem(head->subsys);
> +	kfree(head->plids);
>   	kfree(head);
>   }
>   
> @@ -974,6 +989,13 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>   	if (req->cmd_flags & REQ_RAHEAD)
>   		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>   
> +	if (req->write_hint && ns->head->nr_plids) {
> +		u16 hint = max(req->write_hint, ns->head->nr_plids);
> +
> +		dsmgmt |= ns->head->plids[hint - 1] << 16;
> +		control |= NVME_RW_DTYPE_DPLCMT;
> +	}
> +
>   	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
>   		return BLK_STS_INVAL;
>   
> @@ -2105,6 +2127,52 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
>   	return ret;
>   }
>   
> +static int nvme_fetch_fdp_plids(struct nvme_ns *ns, u32 nsid)
> +{
> +	struct nvme_fdp_ruh_status_desc *ruhsd;
> +	struct nvme_ns_head *head = ns->head;
> +	struct nvme_fdp_ruh_status *ruhs;
> +	struct nvme_command c = {};
> +	int size, ret, i;
> +
> +	if (head->plids)
> +		return 0;
> +
> +	size = struct_size(ruhs, ruhsd, NVME_MAX_PLIDS);
> +	ruhs = kzalloc(size, GFP_KERNEL);
> +	if (!ruhs)
> +		return -ENOMEM;
> +
> +	c.imr.opcode = nvme_cmd_io_mgmt_recv;
> +	c.imr.nsid = cpu_to_le32(nsid);
> +	c.imr.mo = 0x1;

can we please add some comment where values are hardcoded ?

> +	c.imr.numd =  cpu_to_le32((size >> 2) - 1);
> +
> +	ret = nvme_submit_sync_cmd(ns->queue, &c, ruhs, size);
> +	if (ret)
> +		goto out;
> +
> +	i = le16_to_cpu(ruhs->nruhsd);

instead of i why can't we use local variable nr_plids ?



> +	if (!i)
> +		goto out;
> +
> +	ns->head->nr_plids = min_t(u16, i, NVME_MAX_PLIDS);
> +	head->plids = kcalloc(ns->head->nr_plids, sizeof(head->plids),
> +			      GFP_KERNEL);
> +	if (!head->plids) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < ns->head->nr_plids; i++) {
> +		ruhsd = &ruhs->ruhsd[i];
> +		head->plids[i] = le16_to_cpu(ruhsd->pid);
> +	}
> +out:
> +	kfree(ruhs);
> +	return ret;
> +}
> +
>   static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   		struct nvme_ns_info *info)
>   {
> @@ -2141,6 +2209,19 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   			goto out;
>   	}
>   
> +	if (ns->ctrl->ctratt & NVME_CTRL_ATTR_FDPS) {
> +		ret = nvme_fetch_fdp_plids(ns, info->nsid);
> +		if (ret)
> +			dev_warn(ns->ctrl->device,
> +				"FDP failure status:0x%x\n", ret);
> +		if (ret < 0)
> +			goto out;
> +	} else {
> +		ns->head->nr_plids = 0;
> +		kfree(ns->head->plids);
> +		ns->head->plids = NULL;
> +	}
> +
>   	blk_mq_freeze_queue(ns->disk->queue);
>   	ns->head->lba_shift = id->lbaf[lbaf].ds;
>   	ns->head->nuse = le64_to_cpu(id->nuse);
> @@ -2171,6 +2252,9 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
>   	if (!nvme_init_integrity(ns->head, &lim, info))
>   		capacity = 0;
>   
> +	lim.max_write_hints = ns->head->nr_plids;
> +	if (lim.max_write_hints)
> +		lim.features |= BLK_FEAT_PLACEMENT_HINTS;
>   	ret = queue_limits_commit_update(ns->disk->queue, &lim);
>   	if (ret) {
>   		blk_mq_unfreeze_queue(ns->disk->queue);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 093cb423f536b..cec8e5d96377b 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -454,6 +454,8 @@ struct nvme_ns_ids {
>   	u8	csi;
>   };
>   
> +#define NVME_MAX_PLIDS   (NVME_CTRL_PAGE_SIZE / sizeof(16))

this calculates how many plids can fit into the ctrl page size ?

sorry but I didn't understand sizeof(16) here, since plids are u16

nvme_ns_head -> u16 *plidsshould this be sizeof(u16) ? -ck
Christoph Hellwig Oct. 30, 2024, 4:55 a.m. UTC | #10
On Tue, Oct 29, 2024 at 10:22:56AM -0600, Keith Busch wrote:
> On Tue, Oct 29, 2024 at 04:53:30PM +0100, Christoph Hellwig wrote:
> > On Tue, Oct 29, 2024 at 09:38:44AM -0600, Keith Busch wrote:
> > > They're not exposed as write streams. Patch 7/9 sets the feature if it
> > > is a placement id or not, and only nvme sets it, so scsi's attributes
> > > are not claiming to be a write stream.
> > 
> > So it shows up in sysfs, but:
> > 
> >  - queue_max_write_hints (which really should be queue_max_write_streams)
> >    still picks it up, and from there the statx interface
> > 
> >  - per-inode fcntl hint that encode a temperature still magically
> >    get dumpted into the write streams if they are set.
> > 
> > In other words it's a really leaky half-backed abstraction.
> 
> Exactly why I asked last time: "who uses it and how do you want them to
> use it" :)

For the temperature hints the only public user I known is rocksdb, and
that only started working when Hans fixed a brown paperbag bug in the
rocksdb code a while ago.  Given that f2fs interprets the hints I suspect
something in the Android world does as well, maybe Bart knows more.

For the separate write streams the usage I want for them is poor mans
zones - e.g. write N LBAs sequentially into a separate write streams
and then eventually discard them together.  This will fit nicely into
f2fs and the pending xfs work as well as quite a few userspace storage
systems.  For that the file system or application needs to query
the number of available write streams (and in the bitmap world their
numbers of they are distontigous) and the size your can fit into the
"reclaim unit" in FDP terms.  I've not been bothering you much with
the latter as it is an easy retrofit once the I/O path bits lands.

> > Let's brainstorm how it could be done better:
> > 
> >  - the max_write_streams values only set by block devices that actually
> >    do support write streams, and not the fire and forget temperature
> >    hints.  They way this is queried is by having a non-zero value
> >    there, not need for an extra flag.
> 
> So we need a completely different attribute for SCSI's permanent write
> streams? You'd mentioned earlier you were okay with having SCSI be able
> to utilized per-io raw block write hints. Having multiple things to
> check for what are all just write classifiers seems unnecessarily
> complicated.

I don't think the multiple write streams interface applies to SCSIs
write streams, as they enforce a relative temperature, and they don't
have the concept of how much you can write into an "reclaim unit".

OTOH there isn't much you need to query for them anyway, as the
temperature hints have always been defined as pure hints with all
up and downsides of that.

> No need to create a new fcntl. The people already testing this are
> successfully using FDP with the existing fcntl hints. Their applications
> leverage FDP as way to separate files based on expected lifetime. It is
> how they want to use it and it is working above expectations. 

FYI, I think it's always fine and easy to map the temperature hits to
write streams if that's all the driver offers.  It loses a lot of the
capapilities, but as long as it doesn't enforce a lower level interface
that never exposes more that's fine.
Christoph Hellwig Oct. 30, 2024, 5:42 a.m. UTC | #11
On Tue, Oct 29, 2024 at 10:18:31AM -0700, Bart Van Assche wrote:
> On 10/29/24 8:26 AM, Christoph Hellwig wrote:
>> Bart, btw: I think the current sd implementation is buggy as well, as
>> it assumes the permanent streams are ordered by their data temperature
>> in the IO Advise hints mode page, but I can't find anything in the
>> spec that requires a particular ordering.
>
> How about modifying sd_read_io_hints() such that permanent stream
> information is ignored if the order of the RELATIVE LIFETIME information
> reported by the GET STREAM STATUS command does not match the permanent
> stream order?

That seems odd as there is nothing implying that they should be ordered.
The logic thing to do would be to a little array mapping the linux
temperature levels to the streams ids.