mbox series

[PATCHv12,00/12] block write streams with nvme fdp

Message ID 20241206221801.790690-1-kbusch@meta.com (mailing list archive)
Headers show
Series block write streams with nvme fdp | expand

Message

Keith Busch Dec. 6, 2024, 10:17 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

changes from v11:

 - Place the write hint in an unused io_uring SQE field
   - Obviates the need to modify the external "attributes" stuff that
     support PI
   - Make it a u8 to match the type the block layer supports
   - And it's just easier to use for the user

 - Fix the sparse warnings from FDP definitions
   - Just use the patches that Christoph posted a few weeks ago since
     it already defined it in a way that makes sparse happy; I just made
     some minor changes to field names to match what the spec calls them

 - Actually include the first patch in this series

Christoph Hellwig (7):
  fs: add a write stream field to the kiocb
  block: add a bi_write_stream field
  block: introduce a write_stream_granularity queue limit
  block: expose write streams for block device nodes
  nvme: add a nvme_get_log_lsi helper
  nvme: pass a void pointer to nvme_get/set_features for the result
  nvme.h: add FDP definitions

Keith Busch (5):
  fs: add write stream information to statx
  block: introduce max_write_streams queue limit
  io_uring: enable per-io write streams
  nvme: register fdp parameters with the block layer
  nvme: use fdp streams if write stream is provided

 Documentation/ABI/stable/sysfs-block |  15 +++
 block/bdev.c                         |   6 +
 block/bio.c                          |   2 +
 block/blk-crypto-fallback.c          |   1 +
 block/blk-merge.c                    |   4 +
 block/blk-sysfs.c                    |   6 +
 block/bounce.c                       |   1 +
 block/fops.c                         |  23 ++++
 drivers/nvme/host/core.c             | 160 ++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h             |   9 +-
 fs/stat.c                            |   2 +
 include/linux/blk_types.h            |   1 +
 include/linux/blkdev.h               |  16 +++
 include/linux/fs.h                   |   1 +
 include/linux/nvme.h                 |  77 +++++++++++++
 include/linux/stat.h                 |   2 +
 include/uapi/linux/io_uring.h        |   4 +
 include/uapi/linux/stat.h            |   7 +-
 io_uring/io_uring.c                  |   2 +
 io_uring/rw.c                        |   1 +
 20 files changed, 332 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig Dec. 9, 2024, 12:55 p.m. UTC | #1
I just compared this to a crude rebase of what I last sent out, and
AFAICS the differences are:

 1) basically all new io_uring handling due to the integrity stuff that
   went in
 2) fixes for the NVMe FDP log page parsing
 3) drop the support for the remapping of per-partition streams
 
conceptually this all looks fine to me.  I'll throw in a few nitpicks
on the nvme bits, and I'd need to get up to speed a bit more on the
io_uring bits before commenting useful.

One thing that came I was pondering for a new version is if statx
really is the right vehicle for this as it is a very common fast-path
information.  If we had a separate streaminfo ioctl or fcntl it might
be easier to leave a bit spare space for extensibility.  I can try to
prototype that or we can leave it as-is because everyone is tired of
the series.
Keith Busch Dec. 9, 2024, 4:07 p.m. UTC | #2
On Mon, Dec 09, 2024 at 01:55:11PM +0100, Christoph Hellwig wrote:
> I just compared this to a crude rebase of what I last sent out, and
> AFAICS the differences are:
> 
>  1) basically all new io_uring handling due to the integrity stuff that
>    went in
>  2) fixes for the NVMe FDP log page parsing
>  3) drop the support for the remapping of per-partition streams

Yep, pretty much. I will revisit the partition mapping. I just haven't
heard any use cases for divvying the streams up this way, so it's not
clear to me what the interface needs to provide.

> One thing that came I was pondering for a new version is if statx
> really is the right vehicle for this as it is a very common fast-path
> information.  If we had a separate streaminfo ioctl or fcntl it might
> be easier to leave a bit spare space for extensibility.  I can try to
> prototype that or we can leave it as-is because everyone is tired of
> the series.

Oh sure. I can live without the statx parts from this series if you
prefer we take additional time to consider other approaches. We have the
sysfs block attributes reporting the same information, and that is okay
for now.
Martin K. Petersen Dec. 10, 2024, 1:49 a.m. UTC | #3
Hi Keith!

>>  3) drop the support for the remapping of per-partition streams
>
> Yep, pretty much. I will revisit the partition mapping. I just haven't
> heard any use cases for divvying the streams up this way, so it's not
> clear to me what the interface needs to provide.

Since the streams are a (very) scarce hardware resource, it does seem to
me like we should have an explicit interface for an entity (whether
app-on-bdev or a filesystem) to allocate them.

While there certainly are cases where there is a 1:1 app-to-device
mapping, as soon as you add virtualization or enterprise apps to the
mix, that assumption quickly falls apart...
Christoph Hellwig Dec. 10, 2024, 7:19 a.m. UTC | #4
On Mon, Dec 09, 2024 at 09:07:35AM -0700, Keith Busch wrote:
> Yep, pretty much. I will revisit the partition mapping. I just haven't
> heard any use cases for divvying the streams up this way, so it's not
> clear to me what the interface needs to provide.

Yes, it would be good to understand use cases first.  I just threw the
patch in as a POC to show we can do it.

> > One thing that came I was pondering for a new version is if statx
> > really is the right vehicle for this as it is a very common fast-path
> > information.  If we had a separate streaminfo ioctl or fcntl it might
> > be easier to leave a bit spare space for extensibility.  I can try to
> > prototype that or we can leave it as-is because everyone is tired of
> > the series.
> 
> Oh sure. I can live without the statx parts from this series if you
> prefer we take additional time to consider other approaches. We have the
> sysfs block attributes reporting the same information, and that is okay
> for now.

I'll try to find some time this afternoon for an interface, but if it
doesn't arrive in time we can probably drop if for the next submission.