diff mbox series

[PATCHv8,1/6] block, fs: restore kiocb based write hint processing

Message ID 20241017160937.2283225-2-kbusch@meta.com (mailing list archive)
State New
Headers show
Series write hints for nvme fdp | expand

Commit Message

Keith Busch Oct. 17, 2024, 4:09 p.m. UTC
From: Kanchan Joshi <joshi.k@samsung.com>

struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
("fs: remove kiocb.ki_hint"). But write hint returned with commit
449813515d3e ("block, fs: Restore the per-bio/request data lifetime
fields").

This patch uses the leftover space in kiocb to carve 2 byte field
ki_write_hint. Restore the code that operates on kiocb to use
ki_write_hint instead of inode hint value.

This does not change any behavior, but needed to enable per-io hints.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c         | 6 +++---
 fs/aio.c             | 1 +
 fs/cachefiles/io.c   | 1 +
 fs/direct-io.c       | 2 +-
 fs/iomap/direct-io.c | 2 +-
 include/linux/fs.h   | 8 ++++++++
 io_uring/rw.c        | 1 +
 7 files changed, 16 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig Oct. 18, 2024, 5:50 a.m. UTC | #1
On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> From: Kanchan Joshi <joshi.k@samsung.com>
> 
> struct kiocb has a 2 bytes hole that developed post commit 41d36a9f3e53
> ("fs: remove kiocb.ki_hint"). But write hint returned with commit
> 449813515d3e ("block, fs: Restore the per-bio/request data lifetime
> fields").
> 
> This patch uses the leftover space in kiocb to carve 2 byte field
> ki_write_hint. Restore the code that operates on kiocb to use
> ki_write_hint instead of inode hint value.
> 
> This does not change any behavior, but needed to enable per-io hints.

In this version it doesn't really restore anything, but adds a new
write hinting capability.   Similarly to the bio patch we'll probably
need to make clear what is in there instead of having it completely
untyped (the exact same appraoch as for the bio should work).

> index bbd05f1a21453..73629e26becbe 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -409,7 +409,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
>  		bio->bi_end_io = dio_bio_end_io;
>  	if (dio->is_pinned)
>  		bio_set_flag(bio, BIO_PAGE_PINNED);
> -	bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint;
> +	bio->bi_write_hint = dio->iocb->ki_write_hint;
>  
>  	sdio->bio = bio;
>  	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f637aa0706a31..fff43f121ee65 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -397,7 +397,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  					  GFP_KERNEL);
>  		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> -		bio->bi_write_hint = inode->i_write_hint;
> +		bio->bi_write_hint = dio->iocb->ki_write_hint;

File system (helper) code should not directly apply this limit,
but the file system needs to set it.

> +static inline enum rw_hint file_write_hint(struct file *filp)
> +{
> +	return file_inode(filp)->i_write_hint;
> +}
> +
>  static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
>  {
>  	*kiocb = (struct kiocb) {
>  		.ki_filp = filp,
>  		.ki_flags = filp->f_iocb_flags,
>  		.ki_ioprio = get_current_ioprio(),
> +		.ki_write_hint = file_write_hint(filp),

And we'll need to distinguish between the per-inode and per file
hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
one here, but make that conditional in the file operation.
Bart Van Assche Oct. 18, 2024, 4:11 p.m. UTC | #2
On 10/17/24 9:09 AM, Keith Busch wrote:
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3559446279c15..04e875a37f604 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -370,6 +370,7 @@ struct kiocb {
>   	void			*private;
>   	int			ki_flags;
>   	u16			ki_ioprio; /* See linux/ioprio.h */
> +	u16			ki_write_hint;
>   	union {
>   		/*
>   		 * Only used for async buffered reads, where it denotes the

Why 'u16'? Shouldn't this patch use the type 'enum rw_hint' for 
ki_write_hint and shouldn't this type be changed into u16 in a later patch?

Thanks,

Bart.
Keith Busch Oct. 21, 2024, 3:47 p.m. UTC | #3
On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> >  {
> >  	*kiocb = (struct kiocb) {
> >  		.ki_filp = filp,
> >  		.ki_flags = filp->f_iocb_flags,
> >  		.ki_ioprio = get_current_ioprio(),
> > +		.ki_write_hint = file_write_hint(filp),
> 
> And we'll need to distinguish between the per-inode and per file
> hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
> one here, but make that conditional in the file operation.

Maybe someone wants to do direct-io with partions where each partition
has a different default "hint" when not provided a per-io hint? I don't
know of such a case, but it doesn't sound terrible. In any case, I feel
if you're directing writes through these interfaces, you get to keep all
the pieces: user space controls policy, kernel just provides the
mechanisms to do it.
Bart Van Assche Oct. 21, 2024, 5:09 p.m. UTC | #4
On 10/21/24 8:47 AM, Keith Busch wrote:
> On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
>> On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
>>>   {
>>>   	*kiocb = (struct kiocb) {
>>>   		.ki_filp = filp,
>>>   		.ki_flags = filp->f_iocb_flags,
>>>   		.ki_ioprio = get_current_ioprio(),
>>> +		.ki_write_hint = file_write_hint(filp),
>>
>> And we'll need to distinguish between the per-inode and per file
>> hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
>> one here, but make that conditional in the file operation.
> 
> Maybe someone wants to do direct-io with partions where each partition
> has a different default "hint" when not provided a per-io hint? I don't
> know of such a case, but it doesn't sound terrible. In any case, I feel
> if you're directing writes through these interfaces, you get to keep all
> the pieces: user space controls policy, kernel just provides the
> mechanisms to do it.

Is it important to support partitions on top of FDP namespaces? We could
follow the example of zoned block devices and not support partitions on
top of FDP devices. From block/core.c, function add_partition():

	/*
	 * Partitions are not supported on zoned block devices that are used as
	 * such.
	 */
	if (bdev_is_zoned(disk->part0)) {
		pr_warn("%s: partitions not supported on host managed zoned block 
device\n",
			disk->disk_name);
		return ERR_PTR(-ENXIO);
	}

Thanks,

Bart.
Keith Busch Oct. 21, 2024, 7:35 p.m. UTC | #5
On Mon, Oct 21, 2024 at 10:09:57AM -0700, Bart Van Assche wrote:
> On 10/21/24 8:47 AM, Keith Busch wrote:
> > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > > >   {
> > > >   	*kiocb = (struct kiocb) {
> > > >   		.ki_filp = filp,
> > > >   		.ki_flags = filp->f_iocb_flags,
> > > >   		.ki_ioprio = get_current_ioprio(),
> > > > +		.ki_write_hint = file_write_hint(filp),
> > > 
> > > And we'll need to distinguish between the per-inode and per file
> > > hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
> > > one here, but make that conditional in the file operation.
> > 
> > Maybe someone wants to do direct-io with partions where each partition
> > has a different default "hint" when not provided a per-io hint? I don't
> > know of such a case, but it doesn't sound terrible. In any case, I feel
> > if you're directing writes through these interfaces, you get to keep all
> > the pieces: user space controls policy, kernel just provides the
> > mechanisms to do it.
> 
> Is it important to support partitions on top of FDP namespaces? 

It's already used with partitions, so yes, it's important. Breaking that
as a condition to make it work with the block stack is a non-starter.
Christoph Hellwig Oct. 22, 2024, 6:43 a.m. UTC | #6
On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote:
> On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > >  {
> > >  	*kiocb = (struct kiocb) {
> > >  		.ki_filp = filp,
> > >  		.ki_flags = filp->f_iocb_flags,
> > >  		.ki_ioprio = get_current_ioprio(),
> > > +		.ki_write_hint = file_write_hint(filp),
> > 
> > And we'll need to distinguish between the per-inode and per file
> > hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
> > one here, but make that conditional in the file operation.
> 
> Maybe someone wants to do direct-io with partions where each partition
> has a different default "hint" when not provided a per-io hint? I don't
> know of such a case, but it doesn't sound terrible. In any case, I feel
> if you're directing writes through these interfaces, you get to keep all
> the pieces: user space controls policy, kernel just provides the
> mechanisms to do it.

Eww.  You actually pointed out a real problem here: if a device
has multiple partitions the write streams as of this series are
shared by them, which breaks their use case as the applications or
file systems in different partitions will get other users of the
write stream randomly overlayed onto theirs.

So either the available streams need to be split into smaller pools
by partitions, or we just assigned them to the first partition to
make these scheme work for partitioned devices.

Either way mixing up the per-inode hint and the dynamic one remains
a bad idea.
Keith Busch Oct. 22, 2024, 2:37 p.m. UTC | #7
On Tue, Oct 22, 2024 at 08:43:09AM +0200, Christoph Hellwig wrote:
> On Mon, Oct 21, 2024 at 09:47:47AM -0600, Keith Busch wrote:
> > On Fri, Oct 18, 2024 at 07:50:32AM +0200, Christoph Hellwig wrote:
> > > On Thu, Oct 17, 2024 at 09:09:32AM -0700, Keith Busch wrote:
> > > >  {
> > > >  	*kiocb = (struct kiocb) {
> > > >  		.ki_filp = filp,
> > > >  		.ki_flags = filp->f_iocb_flags,
> > > >  		.ki_ioprio = get_current_ioprio(),
> > > > +		.ki_write_hint = file_write_hint(filp),
> > > 
> > > And we'll need to distinguish between the per-inode and per file
> > > hint.  I.e. don't blindly initialize ki_write_hint to the per-inode
> > > one here, but make that conditional in the file operation.
> > 
> > Maybe someone wants to do direct-io with partions where each partition
> > has a different default "hint" when not provided a per-io hint? I don't
> > know of such a case, but it doesn't sound terrible. In any case, I feel
> > if you're directing writes through these interfaces, you get to keep all
> > the pieces: user space controls policy, kernel just provides the
> > mechanisms to do it.
> 
> Eww.  You actually pointed out a real problem here: if a device
> has multiple partitions the write streams as of this series are
> shared by them, which breaks their use case as the applications or
> file systems in different partitions will get other users of the
> write stream randomly overlayed onto theirs.
> 
> So either the available streams need to be split into smaller pools
> by partitions, or we just assigned them to the first partition to
> make these scheme work for partitioned devices.
> 
> Either way mixing up the per-inode hint and the dynamic one remains
> a bad idea.

No doubt it's almost certainly not a good idea to mix different stream
usages, but that's not the kernels problem. It's user space policy. I
don't think the kernel needs to perform any heroic efforts to split
anything here. Just keep it simple.
Christoph Hellwig Oct. 22, 2024, 2:40 p.m. UTC | #8
On Tue, Oct 22, 2024 at 08:37:56AM -0600, Keith Busch wrote:
> No doubt it's almost certainly not a good idea to mix different stream
> usages, but that's not the kernels problem. It's user space policy. I
> don't think the kernel needs to perform any heroic efforts to split
> anything here. Just keep it simple.

Unfortunately it's not.  This complicated breaks any intelligent
scheme to manage them.  You can't write portable code assuming you
can know write streams when you need to cope with someone else using
some or all of the same resources.
diff mbox series

Patch

diff --git a/block/fops.c b/block/fops.c
index e696ae53bf1e0..85b9b97d372c8 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -74,7 +74,7 @@  static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio.bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio.bi_write_hint = iocb->ki_write_hint;
 	bio.bi_ioprio = iocb->ki_ioprio;
 	if (iocb->ki_flags & IOCB_ATOMIC)
 		bio.bi_opf |= REQ_ATOMIC;
@@ -203,7 +203,7 @@  static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 
 	for (;;) {
 		bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-		bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+		bio->bi_write_hint = iocb->ki_write_hint;
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
@@ -319,7 +319,7 @@  static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	dio->flags = 0;
 	dio->iocb = iocb;
 	bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
-	bio->bi_write_hint = file_inode(iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = iocb->ki_write_hint;
 	bio->bi_end_io = blkdev_bio_end_io_async;
 	bio->bi_ioprio = iocb->ki_ioprio;
 
diff --git a/fs/aio.c b/fs/aio.c
index e8920178b50f7..db618817e670d 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1517,6 +1517,7 @@  static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb, int rw_type)
 	req->ki_flags = req->ki_filp->f_iocb_flags | IOCB_AIO_RW;
 	if (iocb->aio_flags & IOCB_FLAG_RESFD)
 		req->ki_flags |= IOCB_EVENTFD;
+	req->ki_write_hint = file_write_hint(req->ki_filp);
 	if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
 		/*
 		 * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
diff --git a/fs/cachefiles/io.c b/fs/cachefiles/io.c
index 6a821a959b59e..c3db102ae64e2 100644
--- a/fs/cachefiles/io.c
+++ b/fs/cachefiles/io.c
@@ -309,6 +309,7 @@  int __cachefiles_write(struct cachefiles_object *object,
 	ki->iocb.ki_pos		= start_pos;
 	ki->iocb.ki_flags	= IOCB_DIRECT | IOCB_WRITE;
 	ki->iocb.ki_ioprio	= get_current_ioprio();
+	ki->iocb.ki_write_hint  = file_write_hint(file);
 	ki->object		= object;
 	ki->start		= start_pos;
 	ki->len			= len;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index bbd05f1a21453..73629e26becbe 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -409,7 +409,7 @@  dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_io;
 	if (dio->is_pinned)
 		bio_set_flag(bio, BIO_PAGE_PINNED);
-	bio->bi_write_hint = file_inode(dio->iocb->ki_filp)->i_write_hint;
+	bio->bi_write_hint = dio->iocb->ki_write_hint;
 
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f637aa0706a31..fff43f121ee65 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -397,7 +397,7 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 					  GFP_KERNEL);
 		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
-		bio->bi_write_hint = inode->i_write_hint;
+		bio->bi_write_hint = dio->iocb->ki_write_hint;
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3559446279c15..04e875a37f604 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -370,6 +370,7 @@  struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	u16			ki_write_hint;
 	union {
 		/*
 		 * Only used for async buffered reads, where it denotes the
@@ -2337,12 +2338,18 @@  static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap,
 	       !vfsgid_valid(i_gid_into_vfsgid(idmap, inode));
 }
 
+static inline enum rw_hint file_write_hint(struct file *filp)
+{
+	return file_inode(filp)->i_write_hint;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
 		.ki_filp = filp,
 		.ki_flags = filp->f_iocb_flags,
 		.ki_ioprio = get_current_ioprio(),
+		.ki_write_hint = file_write_hint(filp),
 	};
 }
 
@@ -2353,6 +2360,7 @@  static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 		.ki_filp = filp,
 		.ki_flags = kiocb_src->ki_flags,
 		.ki_ioprio = kiocb_src->ki_ioprio,
+		.ki_write_hint = kiocb_src->ki_write_hint,
 		.ki_pos = kiocb_src->ki_pos,
 	};
 }
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 80ae3c2ebb70c..ffd637ca0bd17 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1027,6 +1027,7 @@  int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(ret))
 		return ret;
 	req->cqe.res = iov_iter_count(&io->iter);
+	rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp);
 
 	if (force_nonblock) {
 		/* If the file doesn't support async, just async punt */