mbox series

[PATCHv11,0/9] write hints with nvme fdp and scsi streams

Message ID 20241108193629.3817619-1-kbusch@meta.com (mailing list archive)
Headers show
Series write hints with nvme fdp and scsi streams | expand

Message

Keith Busch Nov. 8, 2024, 7:36 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

Changes from v10:

  Fixed FDP max handle size calculations (wrong type)

  Defined and used FDP constants instead of literal numbers

  Moved io_uring write_hint to the end of the SQE so as not to overlap
  with other defined fields except uring_cmd

  Default partition split so partition one gets all the write hints
  exclusively

  Folded in the fix for stacking block stream feature for nvme-multipath
  (from hch xfs-zoned-streams branch)

Kanchan Joshi (2):
  io_uring: enable per-io hinting capability
  nvme: enable FDP support

Keith Busch (7):
  block: use generic u16 for write hints
  block: introduce max_write_hints queue limit
  statx: add write hint information
  block: allow ability to limit partition write hints
  block, fs: add write hint to kiocb
  block: export placement hint feature
  scsi: set permanent stream count in block limits

 Documentation/ABI/stable/sysfs-block | 14 ++++++
 block/bdev.c                         | 22 +++++++++
 block/blk-settings.c                 |  5 ++
 block/blk-sysfs.c                    |  6 +++
 block/fops.c                         | 31 +++++++++++--
 block/partitions/core.c              | 45 +++++++++++++++++-
 drivers/nvme/host/core.c             | 69 ++++++++++++++++++++++++++++
 drivers/nvme/host/multipath.c        |  3 +-
 drivers/nvme/host/nvme.h             |  5 ++
 drivers/scsi/sd.c                    |  2 +
 fs/stat.c                            |  1 +
 include/linux/blk-mq.h               |  3 +-
 include/linux/blk_types.h            |  4 +-
 include/linux/blkdev.h               | 15 ++++++
 include/linux/fs.h                   |  1 +
 include/linux/nvme.h                 | 37 +++++++++++++++
 include/linux/stat.h                 |  1 +
 include/uapi/linux/io_uring.h        |  4 ++
 include/uapi/linux/stat.h            |  3 +-
 io_uring/io_uring.c                  |  2 +
 io_uring/rw.c                        |  2 +-
 21 files changed, 263 insertions(+), 12 deletions(-)

Comments

Christoph Hellwig Nov. 11, 2024, 10:29 a.m. UTC | #1
On Fri, Nov 08, 2024 at 11:36:20AM -0800, Keith Busch wrote:
>   Default partition split so partition one gets all the write hints
>   exclusively

I still don't think this actually works as expected, as the user
interface says the write streams are contigous, and with the bitmap
they aren't.

As I seem to have a really hard time to get my point across, I instead
spent this morning doing a POC of what I mean, and pushed it here:

http://git.infradead.org/?p=users/hch/misc.git;a=shortlog;h=refs/heads/block-write-streams

The big differences are:

 - there is a separate write_stream value now instead of overloading
   the write hint.  For now it is an 8-bit field for the internal
   data structures so that we don't have to grow the bio, but all the
   user interfaces are kept at 16 bits (or in case of statx reduced to
   it).  If this becomes now enough because we need to support devices
   with multiple reclaim groups we'll have to find some space by using
   unions or growing structures
 - block/fops.c is the place to map the existing write hints into
   the write streams instead of the driver
 - the stream granularity is added, because adding it to statx at a
   later time would be nasty.  Getting it in nvme is actually amazingly
   cumbersome so I gave up on that and just fed a dummy value for
   testing, though
 - the partitions remapping is now done using an offset into the global
   write stream space so that the there is a contiguous number space.
   The interface for this is rather hacky, so only treat it as a start
   for interface and use case discussions.
 - the generic stack limits code stopped stacking the max write
   streams.  While it does the right thing for simple things like
   multipath and mirroring/striping is is wrong for anything non-trivial
   like parity raid.  I've left this as a separate fold patch for the
   discussion.
Keith Busch Nov. 11, 2024, 4:27 p.m. UTC | #2
On Mon, Nov 11, 2024 at 11:29:14AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 11:36:20AM -0800, Keith Busch wrote:
> >   Default partition split so partition one gets all the write hints
> >   exclusively
> 
> I still don't think this actually works as expected, as the user
> interface says the write streams are contigous, and with the bitmap
> they aren't.
> 
> As I seem to have a really hard time to get my point across, I instead
> spent this morning doing a POC of what I mean, and pushed it here:
> 
> http://git.infradead.org/?p=users/hch/misc.git;a=shortlog;h=refs/heads/block-write-streams

Just purely for backward compatibility, I don't think you can have the
nvme driver error out if a stream is too large. The fcntl lifetime hint
never errored out before, which gets set unconditionally from the
file_inode without considering the block device's max write stream.

> The big differences are:
> 
>  - there is a separate write_stream value now instead of overloading
>    the write hint.  For now it is an 8-bit field for the internal
>    data structures so that we don't have to grow the bio, but all the
>    user interfaces are kept at 16 bits (or in case of statx reduced to
>    it).  If this becomes now enough because we need to support devices
>    with multiple reclaim groups we'll have to find some space by using
>    unions or growing structures

As far as I know, 255 possible streams exceeds any use case I know
about.

>  - block/fops.c is the place to map the existing write hints into
>    the write streams instead of the driver

I might be something here, but that part sure looks the same as what's
in this series.

>  - the stream granularity is added, because adding it to statx at a
>    later time would be nasty.  Getting it in nvme is actually amazingly
>    cumbersome so I gave up on that and just fed a dummy value for
>    testing, though

Just regarding the documentation on the write_stream_granularity, you
don't need to discard the entire RU in a single command. You can
invalidate the RU simply by overwriting the LBAs without ever issuing
any discard commands.

If you really want to treat it this way, you need to ensure the first
LBA written to an RU is always aligned to NPDA/NPDAL.

If this is really what you require to move this forward, though, that's
fine with me.

>  - the partitions remapping is now done using an offset into the global
>    write stream space so that the there is a contiguous number space.
>    The interface for this is rather hacky, so only treat it as a start
>    for interface and use case discussions.
>  - the generic stack limits code stopped stacking the max write
>    streams.  While it does the right thing for simple things like
>    multipath and mirroring/striping is is wrong for anything non-trivial
>    like parity raid.  I've left this as a separate fold patch for the
>    discussion.
Christoph Hellwig Nov. 11, 2024, 4:34 p.m. UTC | #3
On Mon, Nov 11, 2024 at 09:27:33AM -0700, Keith Busch wrote:
> Just purely for backward compatibility, I don't think you can have the
> nvme driver error out if a stream is too large. The fcntl lifetime hint
> never errored out before, which gets set unconditionally from the
> file_inode without considering the block device's max write stream.

True.  But block/fops.c should simply not the write hint in that
case (or even do a bit of folding if we care enough).

> >  - block/fops.c is the place to map the existing write hints into
> >    the write streams instead of the driver
> 
> I might be something here, but that part sure looks the same as what's
> in this series.

Your series simply mixes up the existing write (temperature) hint and
the write stream, including for file system use.  This version does
something very similar, but only for block devices.

> 
> >  - the stream granularity is added, because adding it to statx at a
> >    later time would be nasty.  Getting it in nvme is actually amazingly
> >    cumbersome so I gave up on that and just fed a dummy value for
> >    testing, though
> 
> Just regarding the documentation on the write_stream_granularity, you
> don't need to discard the entire RU in a single command. You can
> invalidate the RU simply by overwriting the LBAs without ever issuing
> any discard commands.

True.  Did I managed this was a quick hack job?

> If you really want to treat it this way, you need to ensure the first
> LBA written to an RU is always aligned to NPDA/NPDAL.

Those are just hints as well, but I agree you probably get much
better results if they do.

> If this is really what you require to move this forward, though, that's
> fine with me.

I could move it forward, but right now I'm more than over subsribed.
If someone actually pushing for this work could put more effort into it
it will surely be faster.
Kanchan Joshi Nov. 12, 2024, 1:26 p.m. UTC | #4
On 11/11/2024 3:59 PM, Christoph Hellwig wrote:
>   - there is a separate write_stream value now instead of overloading
>     the write hint.  For now it is an 8-bit field for the internal
>     data structures so that we don't have to grow the bio, but all the
>     user interfaces are kept at 16 bits (or in case of statx reduced to
>     it).  If this becomes now enough because we need to support devices
>     with multiple reclaim groups we'll have to find some space by using
>     unions or growing structures

>   - block/fops.c is the place to map the existing write hints into
>     the write streams instead of the driver


Last time when I attempted this separation between temperature and 
placement hints, it required adding a new fcntl[*] too because 
per-inode/file hints continues to be useful even when they are treated 
as passthrough by FS.

Applications are able to use temperature hints to group multiple files 
on device regardless of the logical placement made by FS.
The same ability is useful for write-streams/placement-hints too. But 
these patches reduce the scope to only block device.

IMO, passthrough propagation of hints/streams should continue to remain 
the default behavior as it applies on multiple filesystems. And more 
active placement by FS should rather be enabled by some opt in (e.g., 
mount option). Such opt in will anyway be needed for other reasons (like 
regression avoidance on a broken device).

[*] 
https://lore.kernel.org/linux-nvme/20240910150200.6589-4-joshi.k@samsung.com/
Christoph Hellwig Nov. 12, 2024, 1:34 p.m. UTC | #5
On Tue, Nov 12, 2024 at 06:56:25PM +0530, Kanchan Joshi wrote:
> IMO, passthrough propagation of hints/streams should continue to remain 
> the default behavior as it applies on multiple filesystems. And more 
> active placement by FS should rather be enabled by some opt in (e.g., 
> mount option). Such opt in will anyway be needed for other reasons (like 
> regression avoidance on a broken device).

I feel like banging my head against the wall.  No, passing through write
streams is simply not acceptable without the file system being in
control.  I've said and explained this in detail about a dozend times
and the file system actually needing to do data separation for it's own
purpose doesn't go away by ignoring it.
Keith Busch Nov. 12, 2024, 2:25 p.m. UTC | #6
On Tue, Nov 12, 2024 at 02:34:39PM +0100, Christoph Hellwig wrote:
> On Tue, Nov 12, 2024 at 06:56:25PM +0530, Kanchan Joshi wrote:
> > IMO, passthrough propagation of hints/streams should continue to remain 
> > the default behavior as it applies on multiple filesystems. And more 
> > active placement by FS should rather be enabled by some opt in (e.g., 
> > mount option). Such opt in will anyway be needed for other reasons (like 
> > regression avoidance on a broken device).
> 
> I feel like banging my head against the wall.  No, passing through write
> streams is simply not acceptable without the file system being in
> control.  I've said and explained this in detail about a dozend times
> and the file system actually needing to do data separation for it's own
> purpose doesn't go away by ignoring it.

But that's just an ideological decision that doesn't jive with how
people use these. The applications know how they use their data better
than the filesystem, so putting the filesystem in the way to force
streams look like zones is just a unnecessary layer of indirection
getting in the way.
Christoph Hellwig Nov. 12, 2024, 4:50 p.m. UTC | #7
On Tue, Nov 12, 2024 at 07:25:45AM -0700, Keith Busch wrote:
> > I feel like banging my head against the wall.  No, passing through write
> > streams is simply not acceptable without the file system being in
> > control.  I've said and explained this in detail about a dozend times
> > and the file system actually needing to do data separation for it's own
> > purpose doesn't go away by ignoring it.
> 
> But that's just an ideological decision that doesn't jive with how
> people use these.

Sorry, but no it is not.  The file system is the entity that owns the
block device, and it is the layer that manages the block device.
Bypassing it is an layering violation that creates a lot of problems
and solves none at all.

> The applications know how they use their data better
> than the filesystem,

That is a very bold assumption, and a clear indication that you are
actually approaching this with a rather idiological hat.  If your
specific application actually thinks it knows the storage better than
the file system that you are using you probably should not be using
that file system.  Use a raw block device or even better passthrough
or spdk if you really know what you are doing (or at least thing so).

Otherwise you need to agree that the file system is the final arbiter
of the underlying device resource.  Hint: if you have an application
that knows that it is doing (there actually are a few of those) it's
usually not hard to actually work with file system people to create
abstractions that don't poke holes into layering but still give the
applications what you want.  There's also the third option of doing
something like what Damien did with zonefs and actually create an
abstraction for what what your are doing.

> so putting the filesystem in the way to force
> streams look like zones is just a unnecessary layer of indirection
> getting in the way.

Can you please stop this BS?  Even if a file system doesn't treat
write streams like zones keeps LBA space and physical allocation units
entirely separate (for which I see no good reason, but others might
disagree) you still need the file system in control of the hardware
resources.
Christoph Hellwig Nov. 12, 2024, 5:19 p.m. UTC | #8
On Tue, Nov 12, 2024 at 05:50:54PM +0100, Christoph Hellwig wrote:
> > so putting the filesystem in the way to force
> > streams look like zones is just a unnecessary layer of indirection
> > getting in the way.
> 
> Can you please stop this BS?  Even if a file system doesn't treat
> write streams like zones keeps LBA space and physical allocation units
> entirely separate (for which I see no good reason, but others might
> disagree) you still need the file system in control of the hardware
> resources.

And in case this wasn't clear enough.  Let's assume you want to write
a low write amp flash optimized file system similar to say the storage
layers of the all flash arrays of the last 10-15 years.

You really want to avoid device GC.  You'd better group your data to
the reclaim unit / erase block / insert name here.  So you need file
system control of the write streams, you need to know their size,
you need to be able to query how much your wrote after a power faŃ–l.
Totally independent of how you organize your LBA space.  Mapping
it linearly might be the easier options without major downside, but
you could also allocate them randomly for that matter.

> 
---end quoted text---
Pierre Labat Nov. 12, 2024, 6:18 p.m. UTC | #9
My 2 cents.

Overall, it seems to me that the difficulty here comes from 2 things:
1)  The write hints may have different semantics (temperature, FDP placement, and whatever will come next).
2) Different software layers may want to use the hints, and if several do that at the same time on the same storage that may result in a mess.

About 1)
Seems to me that having a different interface for each semantic is an overkill, extra code to maintain.  And extra work when a new semantic comes along.
To keep things simple, keep one set of interfaces (per IO interface, per file interface) for all write hints semantics, and carry the difference in semantic in the hint itself.
For example, with 32 bits hints, store the semantic in 8 bits and the use the rest in the context of that semantic.
The storage transport driver (nvme driver for ex), based on the 8 bits semantic in the write hint, translates adequately the write hint for the storage device.
The storage driver can support several translations, one for each semantics supported. Linux doesn't need to yank out a translation to replace it with a another/new one.

About 2)
Provide a simple way to the user to decide which layer generate write hints.
As an example, as some of you pointed out, what if the filesystem wants to generate write hints to optimize its [own] data handling by the storage, and at the same time the application using the FS understand the storage and also wants to optimize using write hints.
Both use cases are legit, I think.
To handle that in a simple way, why not have a filesystem mount parameter enabling/disabling the use of write hints by the FS?
In the case of an application not needing/wanting to use write hints on its own, the user would mount the filesystem enabling generation of write hints. That could be the default.
On the contrary if the user decides it is best for one application to directly generate write hints to get the best performance, then mount the filesystem disabling the generation of write hints by the FS. The FS act as a passthrough regarding write hints.

Regards,

Pierre
> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Tuesday, November 12, 2024 6:26 AM
> To: Christoph Hellwig <hch@lst.de>
> Cc: Kanchan Joshi <joshi.k@samsung.com>; Keith Busch
> <kbusch@meta.com>; linux-block@vger.kernel.org; linux-
> nvme@lists.infradead.org; linux-scsi@vger.kernel.org; linux-
> fsdevel@vger.kernel.org; io-uring@vger.kernel.org; axboe@kernel.dk;
> martin.petersen@oracle.com; asml.silence@gmail.com;
> javier.gonz@samsung.com
> Subject: [EXT] Re: [PATCHv11 0/9] write hints with nvme fdp and scsi streams
> 
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you
> recognize the sender and were expecting this message.
> 
> 
> On Tue, Nov 12, 2024 at 02:34:39PM +0100, Christoph Hellwig wrote:
> > On Tue, Nov 12, 2024 at 06:56:25PM +0530, Kanchan Joshi wrote:
> > > IMO, passthrough propagation of hints/streams should continue to
> > > remain the default behavior as it applies on multiple filesystems.
> > > And more active placement by FS should rather be enabled by some opt
> > > in (e.g., mount option). Such opt in will anyway be needed for other
> > > reasons (like regression avoidance on a broken device).
> >
> > I feel like banging my head against the wall.  No, passing through
> > write streams is simply not acceptable without the file system being
> > in control.  I've said and explained this in detail about a dozend
> > times and the file system actually needing to do data separation for
> > it's own purpose doesn't go away by ignoring it.
> 
> But that's just an ideological decision that doesn't jive with how people use
> these. The applications know how they use their data better than the
> filesystem, so putting the filesystem in the way to force streams look like zones
> is just a unnecessary layer of indirection getting in the way.