diff mbox series

[v2] block : add larger order folio size instead of pages

Message ID 20240430175014.8276-1-kundan.kumar@samsung.com (mailing list archive)
State New, archived
Headers show
Series [v2] block : add larger order folio size instead of pages | expand

Commit Message

Kundan Kumar April 30, 2024, 5:50 p.m. UTC
When mTHP is enabled, IO can contain larger folios instead of pages.
In such cases add a larger size to the bio instead of looping through
pages. This reduces the overhead of iterating through pages for larger
block sizes. perf diff before and after this change:

Perf diff for write I/O with 128K block size:
	1.26%     -1.04%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
	1.74%             [kernel.kallsyms]  [k] bvec_try_merge_page
Perf diff for read I/O with 128K block size:
	4.40%     -3.63%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
	5.60%             [kernel.kallsyms]  [k] bvec_try_merge_page

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
---
Changes since v1:
https://lore.kernel.org/all/20240419091721.1790-1-kundan.kumar@samsung.com/
- Changed functions bio_iov_add_page() and bio_iov_add_zone_append_page() to
  accept a folio
- Removed branching and now calculate folio_offset and len in same fashion for
  both 0 order and larger folios
- Added change in NVMe driver to use nvme_setup_prp_simple() by ignoring
  multiples of NVME_CTRL_PAGE_SIZE in offset
- Added change to unpin_user_pages which were added as folios. Also stopped
  the unpin of pages one by one from __bio_release_pages()(Suggested by
  Keith)

 block/bio.c             | 48 +++++++++++++++++++++++++----------------
 drivers/nvme/host/pci.c |  3 ++-
 2 files changed, 31 insertions(+), 20 deletions(-)

Comments

Christoph Hellwig May 2, 2024, 5:35 a.m. UTC | #1
> - Changed functions bio_iov_add_page() and bio_iov_add_zone_append_page() to
>   accept a folio

Those should be separate prep patches.

> - Added change in NVMe driver to use nvme_setup_prp_simple() by ignoring
>   multiples of NVME_CTRL_PAGE_SIZE in offset

This should also be a prep patch.

> - Added change to unpin_user_pages which were added as folios. Also stopped
>   the unpin of pages one by one from __bio_release_pages()(Suggested by
>   Keith)

and this as well.

> @@ -1289,16 +1285,30 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  
>  	for (left = size, i = 0; left > 0; left -= len, i++) {
>  		struct page *page = pages[i];
> +		folio = page_folio(page);

Please keep an empty line after declarations.  But I think you can also
just move the folio declaration here and combine the lines, i.e.

		struct page *page = pages[i];
		struct folio *folio = page_folio(page);

		...

> +		/* See the offset in folio and the size */
> +		folio_offset = (folio_page_idx(folio, page)
> +				<< PAGE_SHIFT) + offset;

Kernel coding style keeps the operators on the previous line, i.e.

		folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) +
				offset;

> +		size_folio = folio_size(folio);
> +
> +		/* Calculate the length of folio to be added */
> +		len = min_t(size_t, (size_folio - folio_offset), left);

size_folio is only used in this expression, so we can simplify the code
by just removing the variable:

		/* Calculate how much of the folio we're going to add: */
		len = min_t(size_t, folio_size(folio) - folio_offset, left);

> +		/* Skip the pages which got added */
> +		if (bio_flagged(bio, BIO_PAGE_PINNED) && num_pages > 1)
> +			unpin_user_pages(pages + i, num_pages - 1);

The comment doesn't sound quite correct to me: we're not really skipping
the pages, but we are dropping the extra references early here.

>  		if (!is_pci_p2pdma_page(bv.bv_page)) {
> -			if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
> +			if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1))
> +				+ bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)

Sme comment about overator placement as above.
Hannes Reinecke May 2, 2024, 6:45 a.m. UTC | #2
On 4/30/24 19:50, Kundan Kumar wrote:
> When mTHP is enabled, IO can contain larger folios instead of pages.
> In such cases add a larger size to the bio instead of looping through
> pages. This reduces the overhead of iterating through pages for larger
> block sizes. perf diff before and after this change:
> 
> Perf diff for write I/O with 128K block size:
> 	1.26%     -1.04%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
> 	1.74%             [kernel.kallsyms]  [k] bvec_try_merge_page
> Perf diff for read I/O with 128K block size:
> 	4.40%     -3.63%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
> 	5.60%             [kernel.kallsyms]  [k] bvec_try_merge_page
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> ---
> Changes since v1:
> https://lore.kernel.org/all/20240419091721.1790-1-kundan.kumar@samsung.com/
> - Changed functions bio_iov_add_page() and bio_iov_add_zone_append_page() to
>    accept a folio
> - Removed branching and now calculate folio_offset and len in same fashion for
>    both 0 order and larger folios
> - Added change in NVMe driver to use nvme_setup_prp_simple() by ignoring
>    multiples of NVME_CTRL_PAGE_SIZE in offset
> - Added change to unpin_user_pages which were added as folios. Also stopped
>    the unpin of pages one by one from __bio_release_pages()(Suggested by
>    Keith)
> 
>   block/bio.c             | 48 +++++++++++++++++++++++++----------------
>   drivers/nvme/host/pci.c |  3 ++-
>   2 files changed, 31 insertions(+), 20 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 38baedb39c6f..0ec453ad15b3 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1155,7 +1155,6 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
>   
>   	bio_for_each_folio_all(fi, bio) {
>   		struct page *page;
> -		size_t nr_pages;
>   
>   		if (mark_dirty) {
>   			folio_lock(fi.folio);
> @@ -1163,11 +1162,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
>   			folio_unlock(fi.folio);
>   		}
>   		page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
> -		nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
> -			   fi.offset / PAGE_SIZE + 1;
> -		do {
> -			bio_release_page(bio, page++);
> -		} while (--nr_pages != 0);
> +		bio_release_page(bio, page);

Errm. I guess you need to call 'folio_put()' here, otherwise the page 
reference counting will be messed up.

>   	}
>   }
>   EXPORT_SYMBOL_GPL(__bio_release_pages);
> @@ -1192,7 +1187,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>   	bio_set_flag(bio, BIO_CLONED);
>   }
>   
> -static int bio_iov_add_page(struct bio *bio, struct page *page,
> +static int bio_iov_add_folio(struct bio *bio, struct folio *folio,
>   		unsigned int len, unsigned int offset)
>   {
>   	bool same_page = false;
> @@ -1202,27 +1197,26 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
>   
>   	if (bio->bi_vcnt > 0 &&
>   	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
> -				page, len, offset, &same_page)) {
> +				&folio->page, len, offset, &same_page)) {
>   		bio->bi_iter.bi_size += len;
>   		if (same_page)
> -			bio_release_page(bio, page);
> +			bio_release_page(bio, &folio->page);
>   		return 0;
>   	}
> -	__bio_add_page(bio, page, len, offset);
> +	bio_add_folio_nofail(bio, folio, len, offset);
>   	return 0;
>   }

That is not a valid conversion.
bvec_try_to_merge_pages() will try to merge a page with an existing
bvec. If the logic is switch to folios you would need to iterate over
all pages in a folio, and call bvec_try_to_merge_pages() for each page
in the folio.
Or convert / add a function 'bvec_try_to_merge_folio()'.
But with the above patch it will only ever try to merge the first page
in the folio.

>   
> -static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
> +static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio,
>   		unsigned int len, unsigned int offset)
>   {
>   	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>   	bool same_page = false;
> -
> -	if (bio_add_hw_page(q, bio, page, len, offset,
> +	if (bio_add_hw_page(q, bio, &folio->page, len, offset,
>   			queue_max_zone_append_sectors(q), &same_page) != len)

Wouldn't this just try to add the first page in a folio?

>   		return -EINVAL;
>   	if (same_page)
> -		bio_release_page(bio, page);
> +		bio_release_page(bio, &folio->page);
>   	return 0;
>   }
>   
> @@ -1247,8 +1241,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>   	struct page **pages = (struct page **)bv;
>   	ssize_t size, left;
>   	unsigned len, i = 0;
> -	size_t offset;
> +	size_t offset, folio_offset, size_folio;
>   	int ret = 0;
> +	int num_pages;
> +	struct folio *folio;
>   
>   	/*
>   	 * Move page array up in the allocated memory for the bio vecs as far as
> @@ -1289,16 +1285,30 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>   
>   	for (left = size, i = 0; left > 0; left -= len, i++) {
>   		struct page *page = pages[i];
> +		folio = page_folio(page);
> +
> +		/* See the offset in folio and the size */
> +		folio_offset = (folio_page_idx(folio, page)
> +				<< PAGE_SHIFT) + offset;
> +		size_folio = folio_size(folio);
> +
> +		/* Calculate the length of folio to be added */
> +		len = min_t(size_t, (size_folio - folio_offset), left);
> +
> +		num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
>   
> -		len = min_t(size_t, PAGE_SIZE - offset, left);
>   		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> -			ret = bio_iov_add_zone_append_page(bio, page, len,
> -					offset);
> +			ret = bio_iov_add_zone_append_folio(bio, folio, len,
> +					folio_offset);
>   			if (ret)
>   				break;
>   		} else
> -			bio_iov_add_page(bio, page, len, offset);
> +			bio_iov_add_folio(bio, folio, len, folio_offset);
>   
> +		/* Skip the pages which got added */
> +		if (bio_flagged(bio, BIO_PAGE_PINNED) && num_pages > 1)
> +			unpin_user_pages(pages + i, num_pages - 1);
> +		i = i + (num_pages - 1);
>   		offset = 0;
>   	}
>   
I would rather use folio as an argument here ...

> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8e0bb9692685..7c07b0582cae 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -778,7 +778,8 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>   		struct bio_vec bv = req_bvec(req);
>   
>   		if (!is_pci_p2pdma_page(bv.bv_page)) {
> -			if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
> +			if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1))
> +				+ bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
>   				return nvme_setup_prp_simple(dev, req,
>   							     &cmnd->rw, &bv);
>   

In general I'm in favour of moving the block layer over to folios, but 
then we should do it by adding a folio-based version of the functions,
and the gradually convert the callers over to those functions.

Trying to slide it in via folio->page or page_folio() things is not
the way to do it as they are not equivalent transformations.

Cheers,

Hannes
Kundan Kumar May 2, 2024, 11:52 a.m. UTC | #3
On 02/05/24 08:45AM, Hannes Reinecke wrote:
>On 4/30/24 19:50, Kundan Kumar wrote:
>>When mTHP is enabled, IO can contain larger folios instead of pages.
>>In such cases add a larger size to the bio instead of looping through
>>pages. This reduces the overhead of iterating through pages for larger
>>block sizes. perf diff before and after this change:
>>
>>Perf diff for write I/O with 128K block size:
>>	1.26%     -1.04%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
>>	1.74%             [kernel.kallsyms]  [k] bvec_try_merge_page
>>Perf diff for read I/O with 128K block size:
>>	4.40%     -3.63%  [kernel.kallsyms]  [k] bio_iov_iter_get_pages
>>	5.60%             [kernel.kallsyms]  [k] bvec_try_merge_page
>>
>>Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>>---
>>Changes since v1:
>>https://lore.kernel.org/all/20240419091721.1790-1-kundan.kumar@samsung.com/
>>- Changed functions bio_iov_add_page() and bio_iov_add_zone_append_page() to
>>   accept a folio
>>- Removed branching and now calculate folio_offset and len in same fashion for
>>   both 0 order and larger folios
>>- Added change in NVMe driver to use nvme_setup_prp_simple() by ignoring
>>   multiples of NVME_CTRL_PAGE_SIZE in offset
>>- Added change to unpin_user_pages which were added as folios. Also stopped
>>   the unpin of pages one by one from __bio_release_pages()(Suggested by
>>   Keith)
>>
>>  block/bio.c             | 48 +++++++++++++++++++++++++----------------
>>  drivers/nvme/host/pci.c |  3 ++-
>>  2 files changed, 31 insertions(+), 20 deletions(-)
>>
>>diff --git a/block/bio.c b/block/bio.c
>>index 38baedb39c6f..0ec453ad15b3 100644
>>--- a/block/bio.c
>>+++ b/block/bio.c
>>@@ -1155,7 +1155,6 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
>>  	bio_for_each_folio_all(fi, bio) {
>>  		struct page *page;
>>-		size_t nr_pages;
>>  		if (mark_dirty) {
>>  			folio_lock(fi.folio);
>>@@ -1163,11 +1162,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
>>  			folio_unlock(fi.folio);
>>  		}
>>  		page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
>>-		nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
>>-			   fi.offset / PAGE_SIZE + 1;
>>-		do {
>>-			bio_release_page(bio, page++);
>>-		} while (--nr_pages != 0);
>>+		bio_release_page(bio, page);
>
>Errm. I guess you need to call 'folio_put()' here, otherwise the page 
>reference counting will be messed up.

Though we call bio_release_page() the actual ref counting is working on folios.
We can keep it like:
+               bio_release_page(bio, &fi.folio->page);
which will avoid a call to folio_page. Also will take care of flag 
BIO_PAGE_PINNED.

>
>>  	}
>>  }
>>  EXPORT_SYMBOL_GPL(__bio_release_pages);
>>@@ -1192,7 +1187,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>>  	bio_set_flag(bio, BIO_CLONED);
>>  }
>>-static int bio_iov_add_page(struct bio *bio, struct page *page,
>>+static int bio_iov_add_folio(struct bio *bio, struct folio *folio,
>>  		unsigned int len, unsigned int offset)
>>  {
>>  	bool same_page = false;
>>@@ -1202,27 +1197,26 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
>>  	if (bio->bi_vcnt > 0 &&
>>  	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
>>-				page, len, offset, &same_page)) {
>>+				&folio->page, len, offset, &same_page)) {
>>  		bio->bi_iter.bi_size += len;
>>  		if (same_page)
>>-			bio_release_page(bio, page);
>>+			bio_release_page(bio, &folio->page);
>>  		return 0;
>>  	}
>>-	__bio_add_page(bio, page, len, offset);
>>+	bio_add_folio_nofail(bio, folio, len, offset);
>>  	return 0;
>>  }
>
>That is not a valid conversion.
>bvec_try_to_merge_pages() will try to merge a page with an existing
>bvec. If the logic is switch to folios you would need to iterate over
>all pages in a folio, and call bvec_try_to_merge_pages() for each page
>in the folio.
>Or convert / add a function 'bvec_try_to_merge_folio()'.
>But with the above patch it will only ever try to merge the first page
>in the folio.
>

This logic is to merge contiguous pages. In case the new folio to be added
is contiguous to the existing bvec, this will merge the new folio also.
We need not iterate over the pages one by one as pages in folio will be
contiguous. The folio_offset does the job, if folio_offset is 0 then
we check if the first page of folio is contiguous to existing bvec. 

>>-static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
>>+static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio,
>>  		unsigned int len, unsigned int offset)
>>  {
>>  	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>  	bool same_page = false;
>>-
>>-	if (bio_add_hw_page(q, bio, page, len, offset,
>>+	if (bio_add_hw_page(q, bio, &folio->page, len, offset,
>>  			queue_max_zone_append_sectors(q), &same_page) != len)
>
>Wouldn't this just try to add the first page in a folio?
>

Yes or no, depends on offset passed here. This will try to add the first page,
with an offset which can be multiple of pages also. Say if we add 4th page in
folio the offset becomes 4096*3 = 12288.

>>  		return -EINVAL;
>>  	if (same_page)
>>-		bio_release_page(bio, page);
>>+		bio_release_page(bio, &folio->page);
>>  	return 0;
>>  }
>>@@ -1247,8 +1241,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>>  	struct page **pages = (struct page **)bv;
>>  	ssize_t size, left;
>>  	unsigned len, i = 0;
>>-	size_t offset;
>>+	size_t offset, folio_offset, size_folio;
>>  	int ret = 0;
>>+	int num_pages;
>>+	struct folio *folio;
>>  	/*
>>  	 * Move page array up in the allocated memory for the bio vecs as far as
>>@@ -1289,16 +1285,30 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>>  	for (left = size, i = 0; left > 0; left -= len, i++) {
>>  		struct page *page = pages[i];
>>+		folio = page_folio(page);
>>+
>>+		/* See the offset in folio and the size */
>>+		folio_offset = (folio_page_idx(folio, page)
>>+				<< PAGE_SHIFT) + offset;
>>+		size_folio = folio_size(folio);
>>+
>>+		/* Calculate the length of folio to be added */
>>+		len = min_t(size_t, (size_folio - folio_offset), left);
>>+
>>+		num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
>>-		len = min_t(size_t, PAGE_SIZE - offset, left);
>>  		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>>-			ret = bio_iov_add_zone_append_page(bio, page, len,
>>-					offset);
>>+			ret = bio_iov_add_zone_append_folio(bio, folio, len,
>>+					folio_offset);
>>  			if (ret)
>>  				break;
>>  		} else
>>-			bio_iov_add_page(bio, page, len, offset);
>>+			bio_iov_add_folio(bio, folio, len, folio_offset);
>>+		/* Skip the pages which got added */
>>+		if (bio_flagged(bio, BIO_PAGE_PINNED) && num_pages > 1)
>>+			unpin_user_pages(pages + i, num_pages - 1);
>>+		i = i + (num_pages - 1);
>>  		offset = 0;
>>  	}
>I would rather use folio as an argument here ...
>

I didnt understand which exact line number here.

For extracting the pages, as of now we are dependent on pin_user_pages and
iov_iter_extract_pages kind of functions which work on pages. There is an
overhead because of pinning and unpinning every page. As suggested
by Christoph "in the long run we really need a folio version of pin_user_pages
and iov_iter_extract_pages."

>>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>>index 8e0bb9692685..7c07b0582cae 100644
>>--- a/drivers/nvme/host/pci.c
>>+++ b/drivers/nvme/host/pci.c
>>@@ -778,7 +778,8 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>>  		struct bio_vec bv = req_bvec(req);
>>  		if (!is_pci_p2pdma_page(bv.bv_page)) {
>>-			if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
>>+			if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1))
>>+				+ bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
>>  				return nvme_setup_prp_simple(dev, req,
>>  							     &cmnd->rw, &bv);
>
>In general I'm in favour of moving the block layer over to folios, but 
>then we should do it by adding a folio-based version of the functions,
>and the gradually convert the callers over to those functions.
>
>Trying to slide it in via folio->page or page_folio() things is not
>the way to do it as they are not equivalent transformations.
>
>Cheers,
>
>Hannes
>-- 
>Dr. Hannes Reinecke                  Kernel Storage Architect
>hare@suse.de                                +49 911 74053 688
>SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
>HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
>
Christoph Hellwig May 2, 2024, 12:53 p.m. UTC | #4
On Thu, May 02, 2024 at 08:45:33AM +0200, Hannes Reinecke wrote:
>> -		nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
>> -			   fi.offset / PAGE_SIZE + 1;
>> -		do {
>> -			bio_release_page(bio, page++);
>> -		} while (--nr_pages != 0);
>> +		bio_release_page(bio, page);
>
> Errm. I guess you need to call 'folio_put()' here, otherwise the page 
> reference counting will be messed up.

It shouldn't.  See the rfc patch and explanation that Keith sent in reply
to the previous version.  But as I wrote earlier it should be a separate
prep patch including a commit log clearly explaining the reason for it
and how it works.

> That is not a valid conversion.
> bvec_try_to_merge_pages() will try to merge a page with an existing
> bvec. If the logic is switch to folios you would need to iterate over
> all pages in a folio, and call bvec_try_to_merge_pages() for each page
> in the folio.
> Or convert / add a function 'bvec_try_to_merge_folio()'.
> But with the above patch it will only ever try to merge the first page
> in the folio.

You mean bvec_try_merge_page?  The only place where it uses more
informtation than just the physical address is for the same_page
logic.  That beeing said converting it to a folio would still be
a good idea and a good prep patch.
Matthew Wilcox May 3, 2024, 3:26 p.m. UTC | #5
On Thu, May 02, 2024 at 02:53:40PM +0200, Christoph Hellwig wrote:
> On Thu, May 02, 2024 at 08:45:33AM +0200, Hannes Reinecke wrote:
> >> -		nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
> >> -			   fi.offset / PAGE_SIZE + 1;
> >> -		do {
> >> -			bio_release_page(bio, page++);
> >> -		} while (--nr_pages != 0);
> >> +		bio_release_page(bio, page);
> >
> > Errm. I guess you need to call 'folio_put()' here, otherwise the page 
> > reference counting will be messed up.
> 
> It shouldn't.  See the rfc patch and explanation that Keith sent in reply
> to the previous version.  But as I wrote earlier it should be a separate
> prep patch including a commit log clearly explaining the reason for it
> and how it works.

I think this is wandering into a minefield.  I'm pretty sure
it's considered valid to split the bio, and complete the two halves
independently.  Each one will put the refcounts for the pages it touches,
and if we do this early putting of references, that's going to fail.
Christoph Hellwig May 3, 2024, 4:22 p.m. UTC | #6
On Fri, May 03, 2024 at 04:26:55PM +0100, Matthew Wilcox wrote:
> I think this is wandering into a minefield.  I'm pretty sure
> it's considered valid to split the bio, and complete the two halves
> independently.  Each one will put the refcounts for the pages it touches,
> and if we do this early putting of references, that's going to fail.

That's now how bios work.  The submitter always operates on the entire
bio using the _all iterators.  bios do get split and advances, but that
only affects the bi_iter state.

In a perfect world we'd split the memory containers aspect of the bio
from the iterator, but that would be a lot of churn and we've got
bigger fish to fry.
Hannes Reinecke May 4, 2024, 12:35 p.m. UTC | #7
On 5/3/24 17:26, Matthew Wilcox wrote:
> On Thu, May 02, 2024 at 02:53:40PM +0200, Christoph Hellwig wrote:
>> On Thu, May 02, 2024 at 08:45:33AM +0200, Hannes Reinecke wrote:
>>>> -		nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
>>>> -			   fi.offset / PAGE_SIZE + 1;
>>>> -		do {
>>>> -			bio_release_page(bio, page++);
>>>> -		} while (--nr_pages != 0);
>>>> +		bio_release_page(bio, page);
>>>
>>> Errm. I guess you need to call 'folio_put()' here, otherwise the page
>>> reference counting will be messed up.
>>
>> It shouldn't.  See the rfc patch and explanation that Keith sent in reply
>> to the previous version.  But as I wrote earlier it should be a separate
>> prep patch including a commit log clearly explaining the reason for it
>> and how it works.
> 
> I think this is wandering into a minefield.  I'm pretty sure
> it's considered valid to split the bio, and complete the two halves
> independently.  Each one will put the refcounts for the pages it touches,
> and if we do this early putting of references, that's going to fail.

Precisesly my worries. Something I want to talk to you about at LSF; 
refcounting of folios vs refcounting of pages.
When one takes a refcount on a folio we are actually taking a refcount
on the first page, which is okay if we stick with using the folio 
throughout the call chain. But if we start mixing between pages and 
folios (as we do here) we will be getting the refcount wrong.

Do you have plans how we could improve the situation?
Like a warning 'Hey, you've used the folio for taking the reference, but 
now you are releasing the references for the page'?

Cheers,

Hannes
Matthew Wilcox May 4, 2024, 4:32 p.m. UTC | #8
On Sat, May 04, 2024 at 02:35:15PM +0200, Hannes Reinecke wrote:
> > I think this is wandering into a minefield.  I'm pretty sure
> > it's considered valid to split the bio, and complete the two halves
> > independently.  Each one will put the refcounts for the pages it touches,
> > and if we do this early putting of references, that's going to fail.
> 
> Precisesly my worries. Something I want to talk to you about at LSF;
> refcounting of folios vs refcounting of pages.
> When one takes a refcount on a folio we are actually taking a refcount
> on the first page, which is okay if we stick with using the folio throughout
> the call chain. But if we start mixing between pages and folios (as we do
> here) we will be getting the refcount wrong.
> 
> Do you have plans how we could improve the situation?
> Like a warning 'Hey, you've used the folio for taking the reference, but now
> you are releasing the references for the page'?

This is a fairly common misunderstanding, but TLDR: problem solved long
before I started this project.

Individual pages don't actually have a refcount.  I know it looks
like they do, and they kind of do, but for tail pages, the refcount is
always 0.  Functions like get_page() and put_page() always operate on
the head page (ie folio) refcount.

Specifically, I think you're concerned about pages coming from GUP.
Take a look at try_get_folio().  We pass in a struct page, explicitly
get the refcount on a folio, check the page is still part of the
folio, then return the folio.  And we return the page to the caller
because the caller needs to know the precise page at that address,
not the folio that contains it.

There are functions which don't surreptitiously call compound_head()
behind your back.  set_page_count(), for example.  And page_ref_count()
(rather than the more normal page_count()).

And none of this is true if you don't use __GFP_COMP.  But let's call
that an aberration that must die.
Hannes Reinecke May 5, 2024, 12:10 p.m. UTC | #9
On 5/4/24 18:32, Matthew Wilcox wrote:
> On Sat, May 04, 2024 at 02:35:15PM +0200, Hannes Reinecke wrote:
>>> I think this is wandering into a minefield.  I'm pretty sure
>>> it's considered valid to split the bio, and complete the two halves
>>> independently.  Each one will put the refcounts for the pages it touches,
>>> and if we do this early putting of references, that's going to fail.
>>
>> Precisesly my worries. Something I want to talk to you about at LSF;
>> refcounting of folios vs refcounting of pages.
>> When one takes a refcount on a folio we are actually taking a refcount
>> on the first page, which is okay if we stick with using the folio throughout
>> the call chain. But if we start mixing between pages and folios (as we do
>> here) we will be getting the refcount wrong.
>>
>> Do you have plans how we could improve the situation?
>> Like a warning 'Hey, you've used the folio for taking the reference, but now
>> you are releasing the references for the page'?
> 
> This is a fairly common misunderstanding, but TLDR: problem solved long
> before I started this project.
> 
> Individual pages don't actually have a refcount.  I know it looks
> like they do, and they kind of do, but for tail pages, the refcount is
> always 0.  Functions like get_page() and put_page() always operate on
> the head page (ie folio) refcount.
> 
Precisely.

> Specifically, I think you're concerned about pages coming from GUP.
> Take a look at try_get_folio().  We pass in a struct page, explicitly
> get the refcount on a folio, check the page is still part of the
> folio, then return the folio.  And we return the page to the caller
> because the caller needs to know the precise page at that address,
> not the folio that contains it.
> 
> There are functions which don't surreptitiously call compound_head()
> behind your back.  set_page_count(), for example.  And page_ref_count()
> (rather than the more normal page_count()).
> 
> And none of this is true if you don't use __GFP_COMP.  But let's call
> that an aberration that must die.

Ah, right. So the refcount for a page is always unwound to use the 
refcount of the enclosing folio.

I was actually concerned with the iov_iter functions, where we take a 
reference for each page. Currently iov_iter is iterating in units of
PAGE_SIZE, so there is no easy way of converting that to folios.

But one step at a time, I guess. First get the blocksize > pagesize 
patches in.

Cheers,

Hannes
Kundan Kumar May 7, 2024, 11:19 a.m. UTC | #10
On Thu, May 2, 2024 at 11:06 AM Christoph Hellwig <hch@lst.de> wrote:
>
> > - Changed functions bio_iov_add_page() and bio_iov_add_zone_append_page() to
> >   accept a folio
>
> Those should be separate prep patches.
>
I tried to split this patch into prep patches for these two functions.
The main patch logic (folio, folio_offset, length, skip) gets split
into prep patches, which looks less neater than doing it in a single
patch.
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 38baedb39c6f..0ec453ad15b3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1155,7 +1155,6 @@  void __bio_release_pages(struct bio *bio, bool mark_dirty)
 
 	bio_for_each_folio_all(fi, bio) {
 		struct page *page;
-		size_t nr_pages;
 
 		if (mark_dirty) {
 			folio_lock(fi.folio);
@@ -1163,11 +1162,7 @@  void __bio_release_pages(struct bio *bio, bool mark_dirty)
 			folio_unlock(fi.folio);
 		}
 		page = folio_page(fi.folio, fi.offset / PAGE_SIZE);
-		nr_pages = (fi.offset + fi.length - 1) / PAGE_SIZE -
-			   fi.offset / PAGE_SIZE + 1;
-		do {
-			bio_release_page(bio, page++);
-		} while (--nr_pages != 0);
+		bio_release_page(bio, page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1192,7 +1187,7 @@  void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
-static int bio_iov_add_page(struct bio *bio, struct page *page,
+static int bio_iov_add_folio(struct bio *bio, struct folio *folio,
 		unsigned int len, unsigned int offset)
 {
 	bool same_page = false;
@@ -1202,27 +1197,26 @@  static int bio_iov_add_page(struct bio *bio, struct page *page,
 
 	if (bio->bi_vcnt > 0 &&
 	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				page, len, offset, &same_page)) {
+				&folio->page, len, offset, &same_page)) {
 		bio->bi_iter.bi_size += len;
 		if (same_page)
-			bio_release_page(bio, page);
+			bio_release_page(bio, &folio->page);
 		return 0;
 	}
-	__bio_add_page(bio, page, len, offset);
+	bio_add_folio_nofail(bio, folio, len, offset);
 	return 0;
 }
 
-static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
+static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio,
 		unsigned int len, unsigned int offset)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	bool same_page = false;
-
-	if (bio_add_hw_page(q, bio, page, len, offset,
+	if (bio_add_hw_page(q, bio, &folio->page, len, offset,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		bio_release_page(bio, page);
+		bio_release_page(bio, &folio->page);
 	return 0;
 }
 
@@ -1247,8 +1241,10 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size, left;
 	unsigned len, i = 0;
-	size_t offset;
+	size_t offset, folio_offset, size_folio;
 	int ret = 0;
+	int num_pages;
+	struct folio *folio;
 
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
@@ -1289,16 +1285,30 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	for (left = size, i = 0; left > 0; left -= len, i++) {
 		struct page *page = pages[i];
+		folio = page_folio(page);
+
+		/* See the offset in folio and the size */
+		folio_offset = (folio_page_idx(folio, page)
+				<< PAGE_SHIFT) + offset;
+		size_folio = folio_size(folio);
+
+		/* Calculate the length of folio to be added */
+		len = min_t(size_t, (size_folio - folio_offset), left);
+
+		num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
 
-		len = min_t(size_t, PAGE_SIZE - offset, left);
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			ret = bio_iov_add_zone_append_page(bio, page, len,
-					offset);
+			ret = bio_iov_add_zone_append_folio(bio, folio, len,
+					folio_offset);
 			if (ret)
 				break;
 		} else
-			bio_iov_add_page(bio, page, len, offset);
+			bio_iov_add_folio(bio, folio, len, folio_offset);
 
+		/* Skip the pages which got added */
+		if (bio_flagged(bio, BIO_PAGE_PINNED) && num_pages > 1)
+			unpin_user_pages(pages + i, num_pages - 1);
+		i = i + (num_pages - 1);
 		offset = 0;
 	}
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685..7c07b0582cae 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -778,7 +778,8 @@  static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
-			if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
+			if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1))
+				+ bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
 				return nvme_setup_prp_simple(dev, req,
 							     &cmnd->rw, &bv);