diff mbox series

[v6,18/34] dio: Pin pages rather than ref'ing if appropriate

Message ID 167391061117.2311931.16807283804788007499.stgit@warthog.procyon.org.uk (mailing list archive)
State New, archived
Headers show
Series iov_iter: Improve page extraction (ref, pin or just list) | expand

Commit Message

David Howells Jan. 16, 2023, 11:10 p.m. UTC
Convert the generic direct-I/O code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
---

 fs/direct-io.c |   57 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 37 insertions(+), 20 deletions(-)

Comments

Al Viro Jan. 19, 2023, 5:04 a.m. UTC | #1
On Mon, Jan 16, 2023 at 11:10:11PM +0000, David Howells wrote:
> Convert the generic direct-I/O code to use iov_iter_extract_pages() instead
> of iov_iter_get_pages().  This will pin pages or leave them unaltered
> rather than getting a ref on them as appropriate to the iterator.
> 
> The pages need to be pinned for DIO-read rather than having refs taken on
> them to prevent VM copy-on-write from malfunctioning during a concurrent
> fork() (the result of the I/O would otherwise end up only visible to the
> child process and not the parent).

Several observations:

1) fs/direct-io.c is ancient, grotty and has few remaining users.
The case of block devices got split off first; these days it's in
block/fops.c.  Then iomap-using filesystems went to fs/iomap/direct-io.c,
leaving this sucker used only by affs, ext2, fat, exfat, hfs, hfsplus, jfs,
nilfs2, ntfs3, reiserfs, udf and ocfs2.  And frankly, the sooner it dies
the better off we are.  IOW, you've picked an uninteresting part and left
the important ones untouched.

2) if you look at the "should_dirty" logics in either of those (including
fs/direct-io.c itself) you'll see a funny thing.  First of all,
dio->should_dirty (or its counterparts) gets set iff we have a user-backed
iter and operation is a read.  I.e. precisely the case when you get bio
marked with BIO_PAGE_PINNED.  And that's not an accident - look at the
places where we check that predicate: dio_bio_submit() calls
bio_set_pages_dirty() if that predicate is true before submitting the
sucker and dio_bio_complete() uses it to choose between bio_check_pages_dirty()
and bio_release_pages() + bio_put().

Look at bio_check_pages_dirty() - it checks if any of the pages we were
reading into had managed to lose the dirty bit; if none had it does
bio_release_pages(bio, false) + bio_put(bio) and returns.  If some had,
it shoves bio into bio_dirty_list and arranges for bio_release_pages(bio, true)
+ bio_put(bio) called from upper half (via schedule_work()).  The effect
of the second argument of bio_release_pages() is to (re)dirty the pages;
it can't be done from interrupt, so we have to defer it to process context.

Now, do we need to redirty anything there?  Recall that page pinning had
been introduced precisely to prevent writeback while the page is getting
DMA into it.  Who is going to mark it clean before we unpin it?

Unless I misunderstand something fundamental about the whole thing,
this crap should become useless with that conversion.  And it's not just
the ->should_dirty and its equivalents - bio_check_pages_dirty() and
the stuff around it should also be gone once block/fops.c and
fs/iomap/direct-io.c are switched to your iov_iter_extract_pages.
Moreover, that way the only places legitimately passing true to
bio_release_pages() are blk_rq_unmap_user() (on completion of
bypass read request mapping a user page) and __blkdev_direct_IO_simple()
(on completion of short sync O_DIRECT read from block device).
Both could just as well call bio_set_pages_dirty(bio) +
bio_release_pages(bio, false), killing the "dirty on release" logics
and losing the 'mark_dirty' argument.

BTW, where do we dirty the pages on IO_URING_OP_READ_FIXED with
O_DIRECT file?  AFAICS, bio_set_pages_dirty() won't be called
(ITER_BVEC iter) and neither will bio_release_pages() do anything
(BIO_NO_PAGE_REF set on the bio by bio_iov_bvec_set() called
due to the same ITER_BVEC iter).  Am I missing something trivial
here?  Jens?
Christoph Hellwig Jan. 19, 2023, 5:51 a.m. UTC | #2
On Thu, Jan 19, 2023 at 05:04:20AM +0000, Al Viro wrote:
> 1) fs/direct-io.c is ancient, grotty and has few remaining users.
> The case of block devices got split off first; these days it's in
> block/fops.c.  Then iomap-using filesystems went to fs/iomap/direct-io.c,
> leaving this sucker used only by affs, ext2, fat, exfat, hfs, hfsplus, jfs,
> nilfs2, ntfs3, reiserfs, udf and ocfs2.  And frankly, the sooner it dies
> the better off we are.  IOW, you've picked an uninteresting part and left
> the important ones untouched.

Agreed.  That being said if we want file systems (including those not
using this legacy version) to be able to rely on correct page dirtying
eventually everything needs to pin pages it writes to.  So we need to
either actually fix or remove this code in the forseeable future.  It's
by far not the most interesting and highest priority, though.   And as
I said this series is already too large too review anyway, I'd really
prefer to get a core set done ASAP and then iterate on the callers and
additional bits.

> Unless I misunderstand something fundamental about the whole thing,
> this crap should become useless with that conversion.

It should - mostly.  But we need to be very careful about that, so
I'd prefer a separate small series for it to be honest.

> BTW, where do we dirty the pages on IO_URING_OP_READ_FIXED with
> O_DIRECT file?  AFAICS, bio_set_pages_dirty() won't be called
> (ITER_BVEC iter) and neither will bio_release_pages() do anything
> (BIO_NO_PAGE_REF set on the bio by bio_iov_bvec_set() called
> due to the same ITER_BVEC iter).  Am I missing something trivial
> here?  Jens?

I don't think we do that all right now.
diff mbox series

Patch

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b1e26a706e31..b4d2c9f85a5b 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -142,9 +142,11 @@  struct dio {
 
 	/*
 	 * pages[] (and any fields placed after it) are not zeroed out at
-	 * allocation time.  Don't add new fields after pages[] unless you
-	 * wish that they not be zeroed.
+	 * allocation time.  Don't add new fields after pages[] unless you wish
+	 * that they not be zeroed.  Pages may have a ref taken, a pin emplaced
+	 * or no retention measures.
 	 */
+	unsigned int cleanup_mode;	/* How pages should be cleaned up (0/FOLL_GET/PIN) */
 	union {
 		struct page *pages[DIO_PAGES];	/* page buffer */
 		struct work_struct complete_work;/* deferred AIO completion */
@@ -167,12 +169,13 @@  static inline unsigned dio_pages_present(struct dio_submit *sdio)
 static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
 	const enum req_op dio_op = dio->opf & REQ_OP_MASK;
+	unsigned int gup_flags =
+		op_is_write(dio_op) ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
+	struct page **pages = dio->pages;
 	ssize_t ret;
 
-	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-				 &sdio->from,
-				 op_is_write(dio_op) ?
-				 FOLL_SOURCE_BUF : FOLL_DEST_BUF);
+	ret = iov_iter_extract_pages(sdio->iter, &pages, LONG_MAX, DIO_PAGES,
+				     gup_flags, &sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
 		struct page *page = ZERO_PAGE(0);
@@ -183,7 +186,7 @@  static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
+		dio->cleanup_mode = 0;
 		dio->pages[0] = page;
 		sdio->head = 0;
 		sdio->tail = 1;
@@ -197,6 +200,8 @@  static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		sdio->head = 0;
 		sdio->tail = (ret + PAGE_SIZE - 1) / PAGE_SIZE;
 		sdio->to = ((ret - 1) & (PAGE_SIZE - 1)) + 1;
+		dio->cleanup_mode =
+			iov_iter_extract_mode(sdio->iter, gup_flags);
 		return 0;
 	}
 	return ret;	
@@ -400,6 +405,10 @@  dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	 * we request a valid number of vectors.
 	 */
 	bio = bio_alloc(bdev, nr_vecs, dio->opf, GFP_KERNEL);
+	if (!(dio->cleanup_mode & FOLL_GET))
+		bio_clear_flag(bio, BIO_PAGE_REFFED);
+	if (dio->cleanup_mode & FOLL_PIN)
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 	bio->bi_iter.bi_sector = first_sector;
 	if (dio->is_async)
 		bio->bi_end_io = dio_bio_end_aio;
@@ -443,13 +452,18 @@  static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	sdio->logical_offset_in_bio = 0;
 }
 
+static void dio_cleanup_page(struct dio *dio, struct page *page)
+{
+	page_put_unpin(page, dio->cleanup_mode);
+}
+
 /*
  * Release any resources in case of a failure
  */
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+		dio_cleanup_page(dio, dio->pages[sdio->head++]);
 }
 
 /*
@@ -704,7 +718,7 @@  static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
  *
  * Return zero on success.  Non-zero means the caller needs to start a new BIO.
  */
-static inline int dio_bio_add_page(struct dio_submit *sdio)
+static inline int dio_bio_add_page(struct dio *dio, struct dio_submit *sdio)
 {
 	int ret;
 
@@ -771,11 +785,11 @@  static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
 			goto out;
 	}
 
-	if (dio_bio_add_page(sdio) != 0) {
+	if (dio_bio_add_page(dio, sdio) != 0) {
 		dio_bio_submit(dio, sdio);
 		ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh);
 		if (ret == 0) {
-			ret = dio_bio_add_page(sdio);
+			ret = dio_bio_add_page(dio, sdio);
 			BUG_ON(ret != 0);
 		}
 	}
@@ -832,13 +846,16 @@  submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		dio_cleanup_page(dio, sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	ret = try_grab_page(page, dio->cleanup_mode);		/* It is in dio */
+	if (ret < 0)
+		return ret;
+
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -853,7 +870,7 @@  submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		dio_cleanup_page(dio, sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -954,7 +971,7 @@  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					dio_cleanup_page(dio, page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -999,7 +1016,7 @@  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio_op == REQ_OP_WRITE) {
-					put_page(page);
+					dio_cleanup_page(dio, page);
 					return -ENOTBLK;
 				}
 
@@ -1012,7 +1029,7 @@  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					dio_cleanup_page(dio, page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1052,7 +1069,7 @@  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				dio_cleanup_page(dio, page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1068,7 +1085,7 @@  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 		}
 
 		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		dio_cleanup_page(dio, page);
 	}
 out:
 	return ret;
@@ -1288,7 +1305,7 @@  ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		dio_cleanup_page(dio, sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)