diff mbox series

[v4,7/7] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate

Message ID 167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk (mailing list archive)
State New, archived
Headers show
Series iov_iter: Add extraction helpers | expand

Commit Message

David Howells Jan. 7, 2023, 12:34 a.m. UTC
Convert the block layer's bio code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the source iterator.

A field, bi_cleanup_mode, is added to the bio struct that gets set by
iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
necessary.  FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page().  Other
flags could also be used in future.

Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
indicate that attached pages are ref'd by default.  Cloning sets it to 0.
__bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
indicates.

[!] Note that this is tested a bit with ext4, but nothing else.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
---

 block/bio.c               |   47 +++++++++++++++++++++++++++++++++------------
 include/linux/blk_types.h |    1 +
 2 files changed, 35 insertions(+), 13 deletions(-)

Comments

Jens Axboe Jan. 9, 2023, 3:54 a.m. UTC | #1
On 1/6/23 5:34 PM, David Howells wrote:
> Convert the block layer's bio code to use iov_iter_extract_pages() instead
> of iov_iter_get_pages().  This will pin pages or leave them unaltered
> rather than getting a ref on them as appropriate to the source iterator.
> 
> A field, bi_cleanup_mode, is added to the bio struct that gets set by
> iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
> necessary.  FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page().  Other
> flags could also be used in future.
> 
> Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
> indicate that attached pages are ref'd by default.  Cloning sets it to 0.
> __bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
> indicates.

What's the motivation for this change? It's growing struct bio, which we
can have a lot of in the system. I read the cover letter too and I can
tell what the change does, but there's no justification really for the
change.

So unless there's a good reason to do this, then that's a NAK in terms
of just the addition to struct bio alone.
David Howells Jan. 9, 2023, 9:43 a.m. UTC | #2
Jens Axboe <axboe@kernel.dk> wrote:

> > A field, bi_cleanup_mode, is added to the bio struct that gets set by
> > iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
> > necessary.  FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page().  Other
> > flags could also be used in future.
> > 
> > Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
> > indicate that attached pages are ref'd by default.  Cloning sets it to 0.
> > __bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
> > indicates.
> 
> What's the motivation for this change?

DIO reads in most filesystems and, I think, the block layer are currently
broken with respect to concurrent fork in the same process because they take
refs (FOLL_GET) on the pages involved which causes the CoW mechanism to
malfunction, leading (I think) the parent process to not see the result of the
DIO.  IIRC, the pages undergoing DIO get forcibly copied by fork - and the
copies given to the parent.  Instead, DIO reads should be pinning the pages
(FOLL_PIN).  Maybe Willy can weigh in on this?

Further, getting refs on pages in, say, a KVEC iterator is the wrong thing to
do as the kvec may point to things that shouldn't be ref'd (vmap'd or
vmalloc'd regions, for example).  Instead, the in-kernel caller should do what
it needs to do to keep hold of the memory and the DIO should not take a ref at
all.

> It's growing struct bio, which we can have a lot of in the system. I read
> the cover letter too and I can tell what the change does, but there's no
> justification really for the change.

The FOLL_* flags I'm getting back from iov_iter_extract_pages() can be mapped
to BIO_* flags in the bio.  For the moment, AFAIK, I think only FOLL_GET and
FOLL_PIN are necessary.  There are three cleanup types: put, unpin and do
nothing.

David
Jan Kara Jan. 9, 2023, 5:25 p.m. UTC | #3
On Mon 09-01-23 09:43:24, David Howells wrote:
> Jens Axboe <axboe@kernel.dk> wrote:
> 
> > > A field, bi_cleanup_mode, is added to the bio struct that gets set by
> > > iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
> > > necessary.  FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page().  Other
> > > flags could also be used in future.
> > > 
> > > Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
> > > indicate that attached pages are ref'd by default.  Cloning sets it to 0.
> > > __bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
> > > indicates.
> > 
> > What's the motivation for this change?
> 
> DIO reads in most filesystems and, I think, the block layer are currently
> broken with respect to concurrent fork in the same process because they take
> refs (FOLL_GET) on the pages involved which causes the CoW mechanism to
> malfunction, leading (I think) the parent process to not see the result of the
> DIO.  IIRC, the pages undergoing DIO get forcibly copied by fork - and the
> copies given to the parent.  Instead, DIO reads should be pinning the pages
> (FOLL_PIN).  Maybe Willy can weigh in on this?
> 
> Further, getting refs on pages in, say, a KVEC iterator is the wrong
> thing to do as the kvec may point to things that shouldn't be ref'd
> (vmap'd or vmalloc'd regions, for example).  Instead, the in-kernel
> caller should do what it needs to do to keep hold of the memory and the
> DIO should not take a ref at all.

Yes, plus there is also a problem if user sets up a DIO read into a buffer
backed by memory mapped file, then these mapped pages can be cleaned by
writeback while the DIO read is running causing checksum failures or
DIF/DIX failures. Also once the writeback is done, the filesystem currently
thinks it controls all paths modifying page data and thus can happily go on
deduplicating file blocks or do similar stuff although pages are
concurrently modified by DIO read possibly causing data corruption. See [1]
for more details why filesystems have a problem with this. So filesystems
really need DIO reads to use FOLL_PIN instead of FOLL_GET and consequently
we need to pass information to bio completion function how page references
should be dropped.

								Honza

[1] https://lore.kernel.org/all/20180103100430.GE4911@quack2.suse.cz/
Jens Axboe Jan. 9, 2023, 5:27 p.m. UTC | #4
On 1/9/23 2:43?AM, David Howells wrote:
> Jens Axboe <axboe@kernel.dk> wrote:
> 
>>> A field, bi_cleanup_mode, is added to the bio struct that gets set by
>>> iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
>>> necessary.  FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page().  Other
>>> flags could also be used in future.
>>>
>>> Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
>>> indicate that attached pages are ref'd by default.  Cloning sets it to 0.
>>> __bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
>>> indicates.
>>
>> What's the motivation for this change?
> 
> DIO reads in most filesystems and, I think, the block layer are
> currently broken with respect to concurrent fork in the same process
> because they take refs (FOLL_GET) on the pages involved which causes
> the CoW mechanism to malfunction, leading (I think) the parent process
> to not see the result of the DIO.  IIRC, the pages undergoing DIO get
> forcibly copied by fork - and the copies given to the parent.
> Instead, DIO reads should be pinning the pages (FOLL_PIN).  Maybe
> Willy can weigh in on this?
> 
> Further, getting refs on pages in, say, a KVEC iterator is the wrong
> thing to do as the kvec may point to things that shouldn't be ref'd
> (vmap'd or vmalloc'd regions, for example).  Instead, the in-kernel
> caller should do what it needs to do to keep hold of the memory and
> the DIO should not take a ref at all.

Makes sense!

>> It's growing struct bio, which we can have a lot of in the system. I read
>> the cover letter too and I can tell what the change does, but there's no
>> justification really for the change.
> 
> The FOLL_* flags I'm getting back from iov_iter_extract_pages() can be mapped
> to BIO_* flags in the bio.  For the moment, AFAIK, I think only FOLL_GET and
> FOLL_PIN are necessary.  There are three cleanup types: put, unpin and do
> nothing.

That would work, or if they fit into a ushort, there is room that could
be utilized for that. My main knee jerk reaction is just plopping a full
into in there for 2 bits of state, really. I try pretty hard to not
succumb to struct bloat, particularly on the hot path and when we can
have lots of them. So that part needs to be done more carefully in v5.
Jan Kara Jan. 9, 2023, 5:35 p.m. UTC | #5
On Sat 07-01-23 00:34:21, David Howells wrote:
> Convert the block layer's bio code to use iov_iter_extract_pages() instead
> of iov_iter_get_pages().  This will pin pages or leave them unaltered
> rather than getting a ref on them as appropriate to the source iterator.
> 
> A field, bi_cleanup_mode, is added to the bio struct that gets set by
> iov_iter_extract_pages() with FOLL_* flags indicating what cleanup is
> necessary.  FOLL_GET -> put_page(), FOLL_PIN -> unpin_user_page().  Other
> flags could also be used in future.
> 
> Newly allocated bio structs have bi_cleanup_mode set to FOLL_GET to
> indicate that attached pages are ref'd by default.  Cloning sets it to 0.
> __bio_iov_iter_get_pages() overrides it to what iov_iter_extract_pages()
> indicates.
> 
> [!] Note that this is tested a bit with ext4, but nothing else.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Logan Gunthorpe <logang@deltatee.com>

So currently we already have BIO_NO_PAGE_REF flag and what you do in this
patch partially duplicates that. So either I'd drop that flag or instead of
bi_cleanup_mode variable (which honestly looks a bit wasteful given how we
microoptimize struct bio) just add another BIO_ flag...

								Honza

> ---
> 
>  block/bio.c               |   47 +++++++++++++++++++++++++++++++++------------
>  include/linux/blk_types.h |    1 +
>  2 files changed, 35 insertions(+), 13 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 5f96fcae3f75..eafcbeba0bab 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -243,6 +243,11 @@ static void bio_free(struct bio *bio)
>   * Users of this function have their own bio allocation. Subsequently,
>   * they must remember to pair any call to bio_init() with bio_uninit()
>   * when IO has completed, or when the bio is released.
> + *
> + * We set the initial assumption that pages attached to the bio will be
> + * released with put_page() by setting bi_cleanup_mode to FOLL_GET, but this
> + * should be set to FOLL_PIN if the page should be unpinned instead; if the
> + * pages should not be put or unpinned, this should be set to 0
>   */
>  void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
>  	      unsigned short max_vecs, blk_opf_t opf)
> @@ -274,6 +279,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
>  #ifdef CONFIG_BLK_DEV_INTEGRITY
>  	bio->bi_integrity = NULL;
>  #endif
> +	bio->bi_cleanup_mode = FOLL_GET;
>  	bio->bi_vcnt = 0;
>  
>  	atomic_set(&bio->__bi_remaining, 1);
> @@ -302,6 +308,7 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
>  {
>  	bio_uninit(bio);
>  	memset(bio, 0, BIO_RESET_BYTES);
> +	bio->bi_cleanup_mode = FOLL_GET;
>  	atomic_set(&bio->__bi_remaining, 1);
>  	bio->bi_bdev = bdev;
>  	if (bio->bi_bdev)
> @@ -814,6 +821,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
>  	bio_set_flag(bio, BIO_CLONED);
>  	bio->bi_ioprio = bio_src->bi_ioprio;
>  	bio->bi_iter = bio_src->bi_iter;
> +	bio->bi_cleanup_mode = 0;
>  
>  	if (bio->bi_bdev) {
>  		if (bio->bi_bdev == bio_src->bi_bdev &&
> @@ -1168,6 +1176,18 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
>  	return bio_add_page(bio, &folio->page, len, off) > 0;
>  }
>  
> +/*
> + * Clean up a page according to the mode indicated by iov_iter_extract_pages(),
> + * where the page is may be pinned or may have a ref taken on it.
> + */
> +static void bio_release_page(struct bio *bio, struct page *page)
> +{
> +	if (bio->bi_cleanup_mode & FOLL_PIN)
> +		unpin_user_page(page);
> +	if (bio->bi_cleanup_mode & FOLL_GET)
> +		put_page(page);
> +}
> +
>  void __bio_release_pages(struct bio *bio, bool mark_dirty)
>  {
>  	struct bvec_iter_all iter_all;
> @@ -1176,7 +1196,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
>  	bio_for_each_segment_all(bvec, bio, iter_all) {
>  		if (mark_dirty && !PageCompound(bvec->bv_page))
>  			set_page_dirty_lock(bvec->bv_page);
> -		put_page(bvec->bv_page);
> +		bio_release_page(bio, bvec->bv_page);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(__bio_release_pages);
> @@ -1213,7 +1233,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
>  	}
>  
>  	if (same_page)
> -		put_page(page);
> +		bio_release_page(bio, page);
>  	return 0;
>  }
>  
> @@ -1227,7 +1247,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
>  			queue_max_zone_append_sectors(q), &same_page) != len)
>  		return -EINVAL;
>  	if (same_page)
> -		put_page(page);
> +		bio_release_page(bio, page);
>  	return 0;
>  }
>  
> @@ -1238,10 +1258,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
>   * @bio: bio to add pages to
>   * @iter: iov iterator describing the region to be mapped
>   *
> - * Pins pages from *iter and appends them to @bio's bvec array. The
> - * pages will have to be released using put_page() when done.
> - * For multi-segment *iter, this function only adds pages from the
> - * next non-empty segment of the iov iterator.
> + * Pins pages from *iter and appends them to @bio's bvec array.  The pages will
> + * have to be released using put_page() or unpin_user_page() when done as
> + * according to bi_cleanup_mode.  For multi-segment *iter, this function only
> + * adds pages from the next non-empty segment of the iov iterator.
>   */
>  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> @@ -1273,9 +1293,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	 * result to ensure the bio's total size is correct. The remainder of
>  	 * the iov data will be picked up in the next bio iteration.
>  	 */
> -	size = iov_iter_get_pages(iter, pages,
> -				  UINT_MAX - bio->bi_iter.bi_size,
> -				  nr_pages, &offset, gup_flags);
> +	size = iov_iter_extract_pages(iter, &pages,
> +				      UINT_MAX - bio->bi_iter.bi_size,
> +				      nr_pages, gup_flags,
> +				      &offset, &bio->bi_cleanup_mode);
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
>  
> @@ -1308,7 +1329,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	iov_iter_revert(iter, left);
>  out:
>  	while (i < nr_pages)
> -		put_page(pages[i++]);
> +		bio_release_page(bio, pages[i++]);
>  
>  	return ret;
>  }
> @@ -1489,8 +1510,8 @@ void bio_set_pages_dirty(struct bio *bio)
>   * the BIO and re-dirty the pages in process context.
>   *
>   * It is expected that bio_check_pages_dirty() will wholly own the BIO from
> - * here on.  It will run one put_page() against each page and will run one
> - * bio_put() against the BIO.
> + * here on.  It will run one put_page() or unpin_user_page() against each page
> + * and will run one bio_put() against the BIO.
>   */
>  
>  static void bio_dirty_fn(struct work_struct *work);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 99be590f952f..883f873a01ef 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -289,6 +289,7 @@ struct bio {
>  #endif
>  	};
>  
> +	unsigned int		bi_cleanup_mode; /* How to clean up pages */
>  	unsigned short		bi_vcnt;	/* how many bio_vec's */
>  
>  	/*
> 
>
David Howells Jan. 9, 2023, 9:37 p.m. UTC | #6
Jan Kara <jack@suse.cz> wrote:

> So currently we already have BIO_NO_PAGE_REF flag and what you do in this
> patch partially duplicates that. So either I'd drop that flag or instead of
> bi_cleanup_mode variable (which honestly looks a bit wasteful given how we
> microoptimize struct bio) just add another BIO_ flag...

I'm fine with translating the FOLL_* flags to the BIO_* flags.  I could add a
BIO_PAGE_PINNED and translate:

	FOLL_GET => 0
	FOLL_PIN => BIO_PAGE_PINNED
	0	 => BIO_NO_PAGE_REF

It would seem that BIO_NO_PAGE_REF can't be set for BIO_PAGE_PINNED because
BIO_NO_PAGE_REF governs whether bio_release_pages() calls
__bio_release_pages() - which would be necessary.  However, bio_release_page()
can do one or the other on the basis of BIO_PAGE_PINNED being specified.  So
in my patch I would end up with:

	static void bio_release_page(struct bio *bio, struct page *page)
	{
		if (bio->bi_flags & BIO_NO_PAGE_REF)
			;
		else if (bio->bi_flags & BIO_PAGE_PINNED)
			unpin_user_page(page);
		else
			put_page(page);
	}

(This is called from four places, so it has to handle BIO_NO_PAGE_REF).

It might make sense flip the logic of BIO_NO_PAGE_REF so that we have, say:

	FOLL_GET => BIO_PAGE_REFFED
	FOLL_PIN => BIO_PAGE_PINNED
	0	 => 0

Set BIO_PAGE_REFFED by default and clear it in bio_iov_bvec_set().

Note that one reason I was thinking of saving the returned FOLL_* flags is
that I don't know if, at some point, the VM will acquire yet more different
cleanup modes - or even if a page could at some point be both ref'd *and*
pinned.

Also, I could change the interface to return something other than FOLL_* - it
just seems that they're appropriate given the underlying VM interface.

David
Jens Axboe Jan. 9, 2023, 9:57 p.m. UTC | #7
On 1/9/23 2:37?PM, David Howells wrote:
> Jan Kara <jack@suse.cz> wrote:
> 
>> So currently we already have BIO_NO_PAGE_REF flag and what you do in this
>> patch partially duplicates that. So either I'd drop that flag or instead of
>> bi_cleanup_mode variable (which honestly looks a bit wasteful given how we
>> microoptimize struct bio) just add another BIO_ flag...
> 
> I'm fine with translating the FOLL_* flags to the BIO_* flags.  I could add a
> BIO_PAGE_PINNED and translate:
> 
> 	FOLL_GET => 0
> 	FOLL_PIN => BIO_PAGE_PINNED
> 	0	 => BIO_NO_PAGE_REF
> 
> It would seem that BIO_NO_PAGE_REF can't be set for BIO_PAGE_PINNED because
> BIO_NO_PAGE_REF governs whether bio_release_pages() calls
> __bio_release_pages() - which would be necessary.  However, bio_release_page()
> can do one or the other on the basis of BIO_PAGE_PINNED being specified.  So
> in my patch I would end up with:
> 
> 	static void bio_release_page(struct bio *bio, struct page *page)
> 	{
> 		if (bio->bi_flags & BIO_NO_PAGE_REF)
> 			;
> 		else if (bio->bi_flags & BIO_PAGE_PINNED)
> 			unpin_user_page(page);
> 		else
> 			put_page(page);
> 	}

Let's please make this a bit more readable with:

static void bio_release_page(struct bio *bio, struct page *page)
{
	if (bio->bi_flags & BIO_NO_PAGE_REF)
		return;
	if (bio->bi_flags & BIO_PAGE_PINNED)
		unpin_user_page(page);
	else
		put_page(page);
}
David Howells Jan. 9, 2023, 10:24 p.m. UTC | #8
Would you be okay with me flipping the logic of BIO_NO_PAGE_REF, so I end up
with:

	static void bio_release_page(struct bio *bio, struct page *page)
	{
		if (bio_flagged(bio, BIO_PAGE_PINNED))
			unpin_user_page(page);
		if (bio_flagged(bio, BIO_PAGE_REFFED))
			put_page(page);
	}

See attached patch.

David
---
iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate

Convert the block layer's bio code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the source iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

To implement this:

 (1) The BIO_NO_PAGE_REF flag is renamed to BIO_PAGE_REFFED and has it's
     logic inverted.  If set, this causes attached pages to be passed to
     put_page() during cleanup.

 (2) A BIO_PAGE_PINNED flag is provided as well.  If set, this causes
     attached pages to be passed to unpin_user_page() during cleanup.

 (3) BIO_PAGE_REFFED is set by default and BIO_PAGE_PINNED is cleared by
     default when the bio is (re-)initialised.

 (4) If iov_iter_extract_pages() indicates FOLL_GET, this causes
     BIO_PAGE_REFFED to be set and if FOLL_PIN is indicated, this causes
     BIO_PAGE_PINNED to be set.  If it returns neither FOLL_* flag, then
     both BIO_PAGE_* flags will be cleared.  Mixed sets are not supported.

 (5) Cloned bio structs have both flags cleared.

 (6) bio_release_pages() will do the release if either BIO_PAGE_* flag is
     set.

[!] Note that this is tested a bit with ext4, but nothing else.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
---
 block/bio.c               |   60 ++++++++++++++++++++++++++++++++++------------
 include/linux/bio.h       |    3 +-
 include/linux/blk_types.h |    3 +-
 3 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f96fcae3f75..5b9f9fc62345 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,11 @@ static void bio_free(struct bio *bio)
  * Users of this function have their own bio allocation. Subsequently,
  * they must remember to pair any call to bio_init() with bio_uninit()
  * when IO has completed, or when the bio is released.
+ *
+ * We set the initial assumption that pages attached to the bio will be
+ * released with put_page() by setting BIO_PAGE_REFFED, but this should be set
+ * to BIO_PAGE_PINNED if the page should be unpinned instead; if the pages
+ * should not be put or unpinned, these flags should be cleared.
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -274,6 +279,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 	bio->bi_integrity = NULL;
 #endif
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	bio->bi_vcnt = 0;
 
 	atomic_set(&bio->__bi_remaining, 1);
@@ -302,6 +308,8 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
 {
 	bio_uninit(bio);
 	memset(bio, 0, BIO_RESET_BYTES);
+	bio_set_flag(bio, BIO_PAGE_REFFED);
+	bio_clear_flag(bio, BIO_PAGE_PINNED);
 	atomic_set(&bio->__bi_remaining, 1);
 	bio->bi_bdev = bdev;
 	if (bio->bi_bdev)
@@ -814,6 +822,8 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 	bio_set_flag(bio, BIO_CLONED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
+	bio_clear_flag(bio, BIO_PAGE_PINNED);
 
 	if (bio->bi_bdev) {
 		if (bio->bi_bdev == bio_src->bi_bdev &&
@@ -1168,6 +1178,18 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
 	return bio_add_page(bio, &folio->page, len, off) > 0;
 }
 
+/*
+ * Clean up a page according to the mode indicated by iov_iter_extract_pages(),
+ * where the page is may be pinned or may have a ref taken on it.
+ */
+static void bio_release_page(struct bio *bio, struct page *page)
+{
+	if (bio_flagged(bio, BIO_PAGE_PINNED))
+		unpin_user_page(page);
+	if (bio_flagged(bio, BIO_PAGE_REFFED))
+		put_page(page);
+}
+
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
 	struct bvec_iter_all iter_all;
@@ -1176,7 +1198,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1198,7 +1220,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
 	bio->bi_iter.bi_bvec_done = iter->iov_offset;
 	bio->bi_iter.bi_size = size;
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
 	bio_set_flag(bio, BIO_CLONED);
 }
 
@@ -1213,7 +1235,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1227,7 +1249,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1238,10 +1260,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Pins pages from *iter and appends them to @bio's bvec array.  The pages will
+ * have to be released using put_page() or unpin_user_page() when done as
+ * according to bi_cleanup_mode.  For multi-segment *iter, this function only
+ * adds pages from the next non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1249,7 +1271,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	unsigned int gup_flags = 0;
+	unsigned int gup_flags = 0, cleanup_mode;
 	ssize_t size, left;
 	unsigned len, i = 0;
 	size_t offset, trim;
@@ -1273,12 +1295,20 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, gup_flags,
+				      &offset, &cleanup_mode);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
+	bio_clear_flag(bio, BIO_PAGE_PINNED);
+	if (cleanup_mode & FOLL_GET)
+		bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (cleanup_mode & FOLL_PIN)
+		bio_set_flag(bio, BIO_PAGE_PINNED);
+
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
 	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
@@ -1308,7 +1338,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1489,8 +1519,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one put_page() or unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 22078a28d7cb..1c6f051f6ff2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -482,7 +482,8 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+	    bio_flagged(bio, BIO_PAGE_PINNED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..7a2d2b2cf3a0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,8 @@ struct bio {
  * bio flags
  */
 enum {
-	BIO_NO_PAGE_REF,	/* don't put release vec pages */
+	BIO_PAGE_REFFED,	/* Pages need refs putting (see FOLL_GET) */
+	BIO_PAGE_PINNED,	/* Pages need pins unpinning (see FOLL_PIN) */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */
Jens Axboe Jan. 9, 2023, 10:57 p.m. UTC | #9
On 1/9/23 3:24?PM, David Howells wrote:
> Would you be okay with me flipping the logic of BIO_NO_PAGE_REF, so I end up
> with:
> 
> 	static void bio_release_page(struct bio *bio, struct page *page)
> 	{
> 		if (bio_flagged(bio, BIO_PAGE_PINNED))
> 			unpin_user_page(page);
> 		if (bio_flagged(bio, BIO_PAGE_REFFED))
> 			put_page(page);
> 	}
> 
> See attached patch.

I think it makes more sense to have NO_REF check, to be honest, as that
means the general path doesn't have to set that flag. But I don't feel
too strongly about that part.

> diff --git a/block/bio.c b/block/bio.c
> index 5f96fcae3f75..5b9f9fc62345 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -243,6 +243,11 @@ static void bio_free(struct bio *bio)
>   * Users of this function have their own bio allocation. Subsequently,
>   * they must remember to pair any call to bio_init() with bio_uninit()
>   * when IO has completed, or when the bio is released.
> + *
> + * We set the initial assumption that pages attached to the bio will be
> + * released with put_page() by setting BIO_PAGE_REFFED, but this should be set
> + * to BIO_PAGE_PINNED if the page should be unpinned instead; if the pages
> + * should not be put or unpinned, these flags should be cleared.
>   */
>  void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
>  	      unsigned short max_vecs, blk_opf_t opf)
> @@ -274,6 +279,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
>  #ifdef CONFIG_BLK_DEV_INTEGRITY
>  	bio->bi_integrity = NULL;
>  #endif
> +	bio_set_flag(bio, BIO_PAGE_REFFED);

This is first set to zero, then you set the flag. Why not just
initialize it like that to begin with?

> @@ -302,6 +308,8 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
>  {
>  	bio_uninit(bio);
>  	memset(bio, 0, BIO_RESET_BYTES);
> +	bio_set_flag(bio, BIO_PAGE_REFFED);
> +	bio_clear_flag(bio, BIO_PAGE_PINNED);
>  	atomic_set(&bio->__bi_remaining, 1);
>  	bio->bi_bdev = bdev;
>  	if (bio->bi_bdev)

You just memset bi_flags here, surely we don't need to clear
BIO_PAGE_PINNED after that?

> @@ -814,6 +822,8 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
>  	bio_set_flag(bio, BIO_CLONED);
>  	bio->bi_ioprio = bio_src->bi_ioprio;
>  	bio->bi_iter = bio_src->bi_iter;
> +	bio_clear_flag(bio, BIO_PAGE_REFFED);
> +	bio_clear_flag(bio, BIO_PAGE_PINNED);

Maybe it would make sense to have a set/clear mask operation? Not
strictly required for this patch, but probably worth checking after the
fact.

> @@ -1273,12 +1295,20 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	 * result to ensure the bio's total size is correct. The remainder of
>  	 * the iov data will be picked up in the next bio iteration.
>  	 */
> -	size = iov_iter_get_pages(iter, pages,
> -				  UINT_MAX - bio->bi_iter.bi_size,
> -				  nr_pages, &offset, gup_flags);
> +	size = iov_iter_extract_pages(iter, &pages,
> +				      UINT_MAX - bio->bi_iter.bi_size,
> +				      nr_pages, gup_flags,
> +				      &offset, &cleanup_mode);
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
>  
> +	bio_clear_flag(bio, BIO_PAGE_REFFED);
> +	bio_clear_flag(bio, BIO_PAGE_PINNED);
> +	if (cleanup_mode & FOLL_GET)
> +		bio_set_flag(bio, BIO_PAGE_REFFED);
> +	if (cleanup_mode & FOLL_PIN)
> +		bio_set_flag(bio, BIO_PAGE_PINNED);
> +
>  	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);

The cleanup_mode pass-back isn't the prettiest thing in the world, and
that's a lot of arguments. Maybe it'd be slightly better if we just have
gup_flags be an output parameter too?

Also not great to first clear both flags, then set them with the two
added branches...

> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 22078a28d7cb..1c6f051f6ff2 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -482,7 +482,8 @@ void zero_fill_bio(struct bio *bio);
>  
>  static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
>  {
> -	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
> +	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
> +	    bio_flagged(bio, BIO_PAGE_PINNED))
>  		__bio_release_pages(bio, mark_dirty);
>  }

Same here on a mask check, but perhaps it ends up generating the same
code?
David Howells Jan. 10, 2023, 2:37 p.m. UTC | #10
Jens Axboe <axboe@kernel.dk> wrote:

> I think it makes more sense to have NO_REF check, to be honest, as that
> means the general path doesn't have to set that flag. But I don't feel
> too strongly about that part.

It's just that the logic seems weird with BIO_NO_PAGE_REF and BIO_PAGE_PINNED
being kind of opposite polarity.

Anyway, see attached.

David
---
iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate

Convert the block layer's bio code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the source iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

To implement this:

 (1) The BIO_NO_PAGE_REF flag, if unset, indicates the page needs putting
     or unpinning.

 (2) A BIO_PAGE_PINNED flag is added.  If set, this causes attached pages
     to be passed to unpin_user_page() during cleanup instead of
     put_page().

 (3) BIO_NO_PAGE_REF and BIO_PAGE_PINNED are both cleared by default when
     the bio is (re-)initialised.

 (4) If iov_iter_extract_pages() indicates FOLL_PIN, then BIO_PAGE_PINNED
     is set; if it indicates 0, BIO_NO_PAGE_REF is set; and if it indicates
     FOLL_GET, then neither flag is set.  If it indicates anything else, a
     WARN_ON_ONCE will be triggered and BIO_NO_PAGE_REF will be set.

     Mixed sets are not supported - all the pages must be handled in the
     same way.

 (5) Cloned bio structs have BIO_NO_PAGE_REF as they don't own their own
     pages.

 (6) bio_release_pages() will do the release if BIO_NO_PAGE_REF flag is
     not set.

[!] Note that this is tested a bit with ext4, but nothing else.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
---
 block/bio.c               |   66 ++++++++++++++++++++++++++++++++++++----------
 include/linux/blk_types.h |    1 
 2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f96fcae3f75..88dfa0e34e81 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,12 @@ static void bio_free(struct bio *bio)
  * Users of this function have their own bio allocation. Subsequently,
  * they must remember to pair any call to bio_init() with bio_uninit()
  * when IO has completed, or when the bio is released.
+ *
+ * We set the initial assumption that pages attached to the bio will be
+ * released with put_page() by setting neither BIO_NO_PAGE_REF nor
+ * BIO_PAGE_PINNED; BIO_PAGE_PINNED should be set if the page should be
+ * unpinned instead and BIO_NO_PAGE_REF should be set if the pages should not
+ * be put or unpinned.
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -814,6 +820,8 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 	bio_set_flag(bio, BIO_CLONED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
+	bio_set_flag(bio, BIO_NO_PAGE_REF);
+	bio_clear_flag(bio, BIO_PAGE_PINNED);
 
 	if (bio->bi_bdev) {
 		if (bio->bi_bdev == bio_src->bi_bdev &&
@@ -1168,6 +1176,20 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
 	return bio_add_page(bio, &folio->page, len, off) > 0;
 }
 
+/*
+ * Clean up a page according to the mode indicated by iov_iter_extract_pages(),
+ * where the page is may be pinned or may have a ref taken on it.
+ */
+static void bio_release_page(struct bio *bio, struct page *page)
+{
+	if (bio_flagged(bio, BIO_NO_PAGE_REF))
+		return;
+	if (bio_flagged(bio, BIO_PAGE_PINNED))
+		unpin_user_page(page);
+	else
+		put_page(page);
+}
+
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
 	struct bvec_iter_all iter_all;
@@ -1176,7 +1198,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1213,7 +1235,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1227,7 +1249,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1238,10 +1260,11 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Pins pages from *iter and appends them to @bio's bvec array.  The pages will
+ * have to be released using put_page() or unpin_user_page() when done as
+ * according to BIO_NO_PAGE_REF and BIO_PAGE_PINNED.  For multi-segment *iter,
+ * this function only adds pages from the next non-empty segment of the iov
+ * iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1249,7 +1272,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	unsigned int gup_flags = 0;
+	unsigned int gup_flags = 0, cleanup_mode;
 	ssize_t size, left;
 	unsigned len, i = 0;
 	size_t offset, trim;
@@ -1273,12 +1296,27 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, gup_flags,
+				      &offset, &cleanup_mode);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
+	switch (cleanup_mode) {
+	case FOLL_GET:
+		break;
+	case FOLL_PIN:
+		bio_set_flag(bio, BIO_PAGE_PINNED);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		fallthrough;
+	case 0:
+		bio_set_flag(bio, BIO_NO_PAGE_REF);
+		break;
+	}
+
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
 	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
@@ -1308,7 +1346,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1489,8 +1527,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one put_page() or unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..38e22a27d029 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -319,6 +319,7 @@ struct bio {
  */
 enum {
 	BIO_NO_PAGE_REF,	/* don't put release vec pages */
+	BIO_PAGE_PINNED,	/* Pages need unpinning rather than putting */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */
David Howells Jan. 10, 2023, 2:42 p.m. UTC | #11
Jan Kara <jack@suse.cz> wrote:

> ... So filesystems really need DIO reads to use FOLL_PIN instead of FOLL_GET
> and consequently we need to pass information to bio completion function how
> page references should be dropped.

That information would be available in the bio struct with this patch if
necessary, though transcribed into a combination of BIO_* flags instead off
FOLL_* flags.

I wonder if there's the possibility of the filesystem that generated the bio
nicking the pages out of the bio and putting them itself.

David
Jens Axboe Jan. 10, 2023, 9:41 p.m. UTC | #12
On 1/10/23 7:37 AM, David Howells wrote:
> Jens Axboe <axboe@kernel.dk> wrote:
> 
>> I think it makes more sense to have NO_REF check, to be honest, as that
>> means the general path doesn't have to set that flag. But I don't feel
>> too strongly about that part.
> 
> It's just that the logic seems weird with BIO_NO_PAGE_REF and BIO_PAGE_PINNED
> being kind of opposite polarity.
> 
> Anyway, see attached.

Yeah, I guess I'll have to agree with you. So let's go with that approach
instead, but then please:

1) Change that flag as a prep patch so you don't mix up the two
2) Incorporate the feedback from the previous patch

Any chance we can get that cleanup_mode thing cleaned up as well?
Jan Kara Jan. 11, 2023, 9:58 a.m. UTC | #13
On Tue 10-01-23 14:42:04, David Howells wrote:
> Jan Kara <jack@suse.cz> wrote:
> 
> > ... So filesystems really need DIO reads to use FOLL_PIN instead of FOLL_GET
> > and consequently we need to pass information to bio completion function how
> > page references should be dropped.
> 
> That information would be available in the bio struct with this patch if
> necessary, though transcribed into a combination of BIO_* flags instead off
> FOLL_* flags.
> 
> I wonder if there's the possibility of the filesystem that generated the bio
> nicking the pages out of the bio and putting them itself.

I just meant to say that some addition struct bio is needed because your
bio_release_page() needs to find out how to release page ref. Filesystem
itself does not care about type of page reference in this path so what you
do in the latest version of this patch looks good to me.

								Honza
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 5f96fcae3f75..eafcbeba0bab 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,11 @@  static void bio_free(struct bio *bio)
  * Users of this function have their own bio allocation. Subsequently,
  * they must remember to pair any call to bio_init() with bio_uninit()
  * when IO has completed, or when the bio is released.
+ *
+ * We set the initial assumption that pages attached to the bio will be
+ * released with put_page() by setting bi_cleanup_mode to FOLL_GET, but this
+ * should be set to FOLL_PIN if the page should be unpinned instead; if the
+ * pages should not be put or unpinned, this should be set to 0
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -274,6 +279,7 @@  void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 	bio->bi_integrity = NULL;
 #endif
+	bio->bi_cleanup_mode = FOLL_GET;
 	bio->bi_vcnt = 0;
 
 	atomic_set(&bio->__bi_remaining, 1);
@@ -302,6 +308,7 @@  void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
 {
 	bio_uninit(bio);
 	memset(bio, 0, BIO_RESET_BYTES);
+	bio->bi_cleanup_mode = FOLL_GET;
 	atomic_set(&bio->__bi_remaining, 1);
 	bio->bi_bdev = bdev;
 	if (bio->bi_bdev)
@@ -814,6 +821,7 @@  static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 	bio_set_flag(bio, BIO_CLONED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
+	bio->bi_cleanup_mode = 0;
 
 	if (bio->bi_bdev) {
 		if (bio->bi_bdev == bio_src->bi_bdev &&
@@ -1168,6 +1176,18 @@  bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
 	return bio_add_page(bio, &folio->page, len, off) > 0;
 }
 
+/*
+ * Clean up a page according to the mode indicated by iov_iter_extract_pages(),
+ * where the page is may be pinned or may have a ref taken on it.
+ */
+static void bio_release_page(struct bio *bio, struct page *page)
+{
+	if (bio->bi_cleanup_mode & FOLL_PIN)
+		unpin_user_page(page);
+	if (bio->bi_cleanup_mode & FOLL_GET)
+		put_page(page);
+}
+
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
 	struct bvec_iter_all iter_all;
@@ -1176,7 +1196,7 @@  void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1213,7 +1233,7 @@  static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1227,7 +1247,7 @@  static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1238,10 +1258,10 @@  static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Pins pages from *iter and appends them to @bio's bvec array.  The pages will
+ * have to be released using put_page() or unpin_user_page() when done as
+ * according to bi_cleanup_mode.  For multi-segment *iter, this function only
+ * adds pages from the next non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1273,9 +1293,10 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, gup_flags,
+				      &offset, &bio->bi_cleanup_mode);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1308,7 +1329,7 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1489,8 +1510,8 @@  void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one put_page() or unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..883f873a01ef 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -289,6 +289,7 @@  struct bio {
 #endif
 	};
 
+	unsigned int		bi_cleanup_mode; /* How to clean up pages */
 	unsigned short		bi_vcnt;	/* how many bio_vec's */
 
 	/*