diff mbox series

[v3,2/3] block: add folio awareness instead of looping through pages

Message ID 20240507144509.37477-3-kundan.kumar@samsung.com (mailing list archive)
State New
Headers show
Series block: add larger order folio instead of pages | expand

Commit Message

Kundan Kumar May 7, 2024, 2:45 p.m. UTC
Add a bigger size from folio to bio and skip processing for pages.

Fetch the offset of page within a folio. Depending on the size of folio
and folio_offset, fetch a larger length. This length may consist of
multiple contiguous pages if folio is multiorder.

Using the length calculate number of pages which will be added to bio and
increment the loop counter to skip those pages.

This technique helps to avoid overhead of merging pages which belong to
same large order folio.

Also folio-lize the functions bio_iov_add_page() and
bio_iov_add_zone_append_page()

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
---
 block/bio.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

Comments

Matthew Wilcox May 15, 2024, 8:55 p.m. UTC | #1
On Tue, May 07, 2024 at 08:15:08PM +0530, Kundan Kumar wrote:
> Add a bigger size from folio to bio and skip processing for pages.
> 
> Fetch the offset of page within a folio. Depending on the size of folio
> and folio_offset, fetch a larger length. This length may consist of
> multiple contiguous pages if folio is multiorder.

The problem is that it may not.  Here's the scenario:

int fd, fd2;
fd = open(src, O_RDONLY);
char *addr = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE,
	MAP_PRIVATE | MAP_HUGETLB, fd, 0);
int i, j;

for (i = 0; i < 1024 * 1024; i++)
	j |= addr[i];

addr[30000] = 17;
fd2 = open(dest, O_RDWR | O_DIRECT);
write(fd2, &addr[16384], 32768);

Assuming that the source file supports being cached in large folios,
the page array we get from GUP might contain:

f0p4 f0p5 f0p6 f1p0 f0p8 f0p9 ...

because we allocated 'f1' when we did COW due to the store to addr[30000].

We can certainly reduce the cost of merge if we know two pages are part
of the same folio, but we still need to check that we actually got
pages which are part of the same folio.
Kundan Kumar May 24, 2024, 9:22 a.m. UTC | #2
On 15/05/24 09:55PM, Matthew Wilcox wrote:
>On Tue, May 07, 2024 at 08:15:08PM +0530, Kundan Kumar wrote:
>> Add a bigger size from folio to bio and skip processing for pages.
>>
>> Fetch the offset of page within a folio. Depending on the size of folio
>> and folio_offset, fetch a larger length. This length may consist of
>> multiple contiguous pages if folio is multiorder.
>
>The problem is that it may not.  Here's the scenario:
>
>int fd, fd2;
>fd = open(src, O_RDONLY);
>char *addr = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE,
>	MAP_PRIVATE | MAP_HUGETLB, fd, 0);

I also added MAP_ANONYMOUS flag here, otherwise mmap fails.

>int i, j;
>
>for (i = 0; i < 1024 * 1024; i++)
>	j |= addr[i];
>
>addr[30000] = 17;
>fd2 = open(dest, O_RDWR | O_DIRECT);
>write(fd2, &addr[16384], 32768);
>
>Assuming that the source file supports being cached in large folios,
>the page array we get from GUP might contain:h
>
>f0p4 f0p5 f0p6 f1p0 f0p8 f0p9 ...
>
>because we allocated 'f1' when we did COW due to the store to addr[30000].
>
>We can certainly reduce the cost of merge if we know two pages are part
>of the same folio, but we still need to check that we actually got
>pages which are part of the same folio.
>

Thanks, now I understand that COW may happen and will modify the ptes in
between contiguous folio mapping in page table. I will include the check
if each page belongs to same folio.

I tried executing the sample code. When this does a GUP due to a O_DIRECT
write, I see that entire HUGE folio gets reused.
"addr[30000] = 17;" generates a COW but hugetlb_wp() reuses the
same huge page, using the old_folio again.

Am I missing something here? Do you have any further suggestions to
generate a scenario similar to "f0p4 f0p5 f0p6 f1p0 f0p8 f0p9 ..."
using a sample program.
--
Kundan
Matthew Wilcox May 24, 2024, 3:40 p.m. UTC | #3
On Fri, May 24, 2024 at 02:52:31PM +0530, Kundan Kumar wrote:
> On 15/05/24 09:55PM, Matthew Wilcox wrote:
> > On Tue, May 07, 2024 at 08:15:08PM +0530, Kundan Kumar wrote:
> > > Add a bigger size from folio to bio and skip processing for pages.
> > > 
> > > Fetch the offset of page within a folio. Depending on the size of folio
> > > and folio_offset, fetch a larger length. This length may consist of
> > > multiple contiguous pages if folio is multiorder.
> > 
> > The problem is that it may not.  Here's the scenario:
> > 
> > int fd, fd2;
> > fd = open(src, O_RDONLY);
> > char *addr = mmap(NULL, 1024 * 1024, PROT_READ | PROT_WRITE,
> > 	MAP_PRIVATE | MAP_HUGETLB, fd, 0);
> 
> I also added MAP_ANONYMOUS flag here, otherwise mmap fails.

I didn't test this code, but MAP_ANONYMOUS is wrong.  I'm trying to get
a file mapping here, not an anoymous mapping.

The intent is to hit the
        if (vm_flags & VM_HUGEPAGE) {
case in do_sync_mmap_readahead().

Ah, I see.

ksys_mmap_pgoff:
        if (!(flags & MAP_ANONYMOUS)) {
...
                if (is_file_hugepages(file)) {
                        len = ALIGN(len, huge_page_size(hstate_file(file)));
                } else if (unlikely(flags & MAP_HUGETLB)) {
                        retval = -EINVAL;
                        goto out_fput;
                }

Maybe we need something like this:

diff --git a/mm/mmap.c b/mm/mmap.c
index 83b4682ec85c..7c5066a8a3ac 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1307,6 +1307,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		flags_mask = LEGACY_MAP_MASK;
 		if (file->f_op->fop_flags & FOP_MMAP_SYNC)
 			flags_mask |= MAP_SYNC;
+		if (flags & MAP_HUGETLB)
+			vm_flags |= VM_HUGEPAGE;
 
 		switch (flags & MAP_TYPE) {
 		case MAP_SHARED:
@@ -1414,12 +1416,8 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 		file = fget(fd);
 		if (!file)
 			return -EBADF;
-		if (is_file_hugepages(file)) {
+		if (is_file_hugepages(file))
 			len = ALIGN(len, huge_page_size(hstate_file(file)));
-		} else if (unlikely(flags & MAP_HUGETLB)) {
-			retval = -EINVAL;
-			goto out_fput;
-		}
 	} else if (flags & MAP_HUGETLB) {
 		struct hstate *hs;
 
@@ -1441,7 +1439,6 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 	}
 
 	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-out_fput:
 	if (file)
 		fput(file);
 	return retval;

(compile tested only)
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 38baedb39c6f..e2c31f1b0aa8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1192,7 +1192,7 @@  void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
-static int bio_iov_add_page(struct bio *bio, struct page *page,
+static int bio_iov_add_folio(struct bio *bio, struct folio *folio,
 		unsigned int len, unsigned int offset)
 {
 	bool same_page = false;
@@ -1202,27 +1202,27 @@  static int bio_iov_add_page(struct bio *bio, struct page *page,
 
 	if (bio->bi_vcnt > 0 &&
 	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				page, len, offset, &same_page)) {
+				&folio->page, len, offset, &same_page)) {
 		bio->bi_iter.bi_size += len;
 		if (same_page)
-			bio_release_page(bio, page);
+			bio_release_page(bio, &folio->page);
 		return 0;
 	}
-	__bio_add_page(bio, page, len, offset);
+	bio_add_folio_nofail(bio, folio, len, offset);
 	return 0;
 }
 
-static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
+static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio,
 		unsigned int len, unsigned int offset)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	bool same_page = false;
 
-	if (bio_add_hw_page(q, bio, page, len, offset,
+	if (bio_add_hw_page(q, bio, &folio->page, len, offset,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		bio_release_page(bio, page);
+		bio_release_page(bio, &folio->page);
 	return 0;
 }
 
@@ -1247,8 +1247,8 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size, left;
 	unsigned len, i = 0;
-	size_t offset;
-	int ret = 0;
+	size_t offset, folio_offset;
+	int ret = 0, num_pages;
 
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
@@ -1289,15 +1289,26 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	for (left = size, i = 0; left > 0; left -= len, i++) {
 		struct page *page = pages[i];
+		struct folio *folio = page_folio(page);
+
+		/* Calculate the offset of page in folio */
+		folio_offset = (folio_page_idx(folio, page) << PAGE_SHIFT) +
+				offset;
+
+		len = min_t(size_t, (folio_size(folio) - folio_offset), left);
+
+		num_pages = DIV_ROUND_UP(offset + len, PAGE_SIZE);
 
-		len = min_t(size_t, PAGE_SIZE - offset, left);
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			ret = bio_iov_add_zone_append_page(bio, page, len,
-					offset);
+			ret = bio_iov_add_zone_append_folio(bio, folio, len,
+					folio_offset);
 			if (ret)
 				break;
 		} else
-			bio_iov_add_page(bio, page, len, offset);
+			bio_iov_add_folio(bio, folio, len, folio_offset);
+
+		/* Skip the pages which got added */
+		i = i + (num_pages - 1);
 
 		offset = 0;
 	}