[v2,1/4] net: introduce helper sendpages_ok()

Message ID	20240530142417.146696-2-ofir.gal@volumez.com (mailing list archive)
State	New
Headers	show Received: from EUR04-DB3-obe.outbound.protection.outlook.com (mail-db3eur04on2139.outbound.protection.outlook.com [40.107.6.139]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 163D717C21C; Thu, 30 May 2024 14:24:36 +0000 (UTC) From: Ofir Gal <ofir.gal@volumez.com> To: davem@davemloft.net, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, ceph-devel@vger.kernel.org Cc: dhowells@redhat.com, edumazet@google.com, pabeni@redhat.com, kbusch@kernel.org, axboe@kernel.dk, hch@lst.de, sagi@grimberg.me, philipp.reisner@linbit.com, lars.ellenberg@linbit.com, christoph.boehmwalder@linbit.com, idryomov@gmail.com, xiubli@redhat.com Subject: [PATCH v2 1/4] net: introduce helper sendpages_ok() Date: Thu, 30 May 2024 17:24:11 +0300 Message-ID: <20240530142417.146696-2-ofir.gal@volumez.com> In-Reply-To: <20240530142417.146696-1-ofir.gal@volumez.com> References: <20240530142417.146696-1-ofir.gal@volumez.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0
Series	[v2,1/4] net: introduce helper sendpages_ok() \| expand [v2,1/4] net: introduce helper sendpages_ok() [v2,2/4] nvme-tcp: use sendpages_ok() instead of sendpage_ok() [v2,3/4] drbd: use sendpages_ok() instead of sendpage_ok() [v2,4/4] libceph: use sendpages_ok() instead of sendpage_ok()

Ofir Gal May 30, 2024, 2:24 p.m. UTC

Network drivers are using sendpage_ok() to check the first page of an
iterator in order to disable MSG_SPLICE_PAGES. The iterator can
represent list of contiguous pages.

When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
it requires all pages in the iterator to be sendable. Therefore it needs
to check that each page is sendable.

The patch introduces a helper sendpages_ok(), it returns true if all the
contiguous pages are sendable.

Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
this helper to check whether the page list is OK. If the helper does not
return true, the driver should remove MSG_SPLICE_PAGES flag.

Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
---
 include/linux/net.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Christoph Hellwig May 31, 2024, 7:32 a.m. UTC | #1

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Sagi Grimberg May 31, 2024, 8:51 a.m. UTC | #2

On 30/05/2024 17:24, Ofir Gal wrote:
> Network drivers are using sendpage_ok() to check the first page of an
> iterator in order to disable MSG_SPLICE_PAGES. The iterator can
> represent list of contiguous pages.
>
> When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
> it requires all pages in the iterator to be sendable. Therefore it needs
> to check that each page is sendable.
>
> The patch introduces a helper sendpages_ok(), it returns true if all the
> contiguous pages are sendable.
>
> Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
> this helper to check whether the page list is OK. If the helper does not
> return true, the driver should remove MSG_SPLICE_PAGES flag.
>
> Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
> ---
>   include/linux/net.h | 20 ++++++++++++++++++++
>   1 file changed, 20 insertions(+)
>
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 688320b79fcc..b33bdc3e2031 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -322,6 +322,26 @@ static inline bool sendpage_ok(struct page *page)
>   	return !PageSlab(page) && page_count(page) >= 1;
>   }
>   
> +/*
> + * Check sendpage_ok on contiguous pages.
> + */
> +static inline bool sendpages_ok(struct page *page, size_t len, size_t offset)
> +{
> +	unsigned int pagecount;
> +	size_t page_offset;
> +	int k;
> +
> +	page = page + offset / PAGE_SIZE;
> +	page_offset = offset % PAGE_SIZE;

lets not modify the input page variable.

p = page + offset >> PAGE_SHIFT;
poffset = offset & PAGE_MASK;

> +	pagecount = DIV_ROUND_UP(len + page_offset, PAGE_SIZE);
> +
> +	for (k = 0; k < pagecount; k++)
> +		if (!sendpage_ok(page + k))
> +			return false;

perhaps instead of doing a costly DIV_ROUND_UP for every network send we 
can do:

         count = 0;
         while (count < len) {
                 if (!sendpage_ok(p))
                         return false;
                 page++;
                 count += PAGE_SIZE;
         }

And we can lose page_offset.

It can be done in a number of ways, but we should be able to do it
without the DIV_ROUND_UP...

I still don't understand how a page in the middle of a contiguous range ends
up coming from the slab while others don't.

Ofir, can you please check which condition in sendpage_ok actually fails?

Ofir Gal June 3, 2024, 12:35 p.m. UTC | #3

On 31/05/2024 11:51, Sagi Grimberg wrote:
>
>
> On 30/05/2024 17:24, Ofir Gal wrote:
>> Network drivers are using sendpage_ok() to check the first page of an
>> iterator in order to disable MSG_SPLICE_PAGES. The iterator can
>> represent list of contiguous pages.
>>
>> When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
>> it requires all pages in the iterator to be sendable. Therefore it needs
>> to check that each page is sendable.
>>
>> The patch introduces a helper sendpages_ok(), it returns true if all the
>> contiguous pages are sendable.
>>
>> Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
>> this helper to check whether the page list is OK. If the helper does not
>> return true, the driver should remove MSG_SPLICE_PAGES flag.
>>
>> Signed-off-by: Ofir Gal <ofir.gal@volumez.com>
>> ---
>>   include/linux/net.h | 20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/include/linux/net.h b/include/linux/net.h
>> index 688320b79fcc..b33bdc3e2031 100644
>> --- a/include/linux/net.h
>> +++ b/include/linux/net.h
>> @@ -322,6 +322,26 @@ static inline bool sendpage_ok(struct page *page)
>>       return !PageSlab(page) && page_count(page) >= 1;
>>   }
>>   +/*
>> + * Check sendpage_ok on contiguous pages.
>> + */
>> +static inline bool sendpages_ok(struct page *page, size_t len, size_t offset)
>> +{
>> +    unsigned int pagecount;
>> +    size_t page_offset;
>> +    int k;
>> +
>> +    page = page + offset / PAGE_SIZE;
>> +    page_offset = offset % PAGE_SIZE;
>
> lets not modify the input page variable.
>
> p = page + offset >> PAGE_SHIFT;
> poffset = offset & PAGE_MASK;
Ok, will be applied in the next patch set.

>> +    pagecount = DIV_ROUND_UP(len + page_offset, PAGE_SIZE);
>> +
>> +    for (k = 0; k < pagecount; k++)
>> +        if (!sendpage_ok(page + k))
>> +            return false;
>
> perhaps instead of doing a costly DIV_ROUND_UP for every network send we can do:
>
>         count = 0;
>         while (count < len) {
>                 if (!sendpage_ok(p))
>                         return false;
>                 page++;
>                 count += PAGE_SIZE;
>         }
>
> And we can lose page_offset.
>
> It can be done in a number of ways, but we should be able to do it
> without the DIV_ROUND_UP...
Ok, will be applied in the next patch set.

> I still don't understand how a page in the middle of a contiguous range ends
> up coming from the slab while others don't.
I haven't investigate the origin of the IO
yet. I suspect the first 2 pages are the superblocks of the raid
(mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.

> Ofir, can you please check which condition in sendpage_ok actually fails?
It failed because the page has slab, page count is 1. Sorry for not
clarifying this.

"skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1"
                                                                 ^
The print I used:
pr_info(
    "!sendpage_ok - page: 0x%p (pfn: %lx). is_slab: %u, page_count: %u\n",
    (void *)page,
    page_to_pfn(page),
    page_address(page),
    !!PageSlab(page),
    page_count(page)
);

Sagi Grimberg June 3, 2024, 9:27 p.m. UTC | #4

>> I still don't understand how a page in the middle of a contiguous range ends
>> up coming from the slab while others don't.
> I haven't investigate the origin of the IO
> yet. I suspect the first 2 pages are the superblocks of the raid
> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.

Well, if these indeed are different origins and just *happen* to be a 
mixture
of slab originated pages and non-slab pages combined together in a 
single bio of a bvec entry,
I'd suspect that it would be more beneficial to split the bvec 
(essentially not allow bio_add_page
to append the page to tail bvec depending on a queue limit (similar to 
how we handle sg gaps).

>
>> Ofir, can you please check which condition in sendpage_ok actually fails?
> It failed because the page has slab, page count is 1. Sorry for not
> clarifying this.
>
> "skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1"
>                                                                   ^
> The print I used:
> pr_info(
>      "!sendpage_ok - page: 0x%p (pfn: %lx). is_slab: %u, page_count: %u\n",
>      (void *)page,
>      page_to_pfn(page),
>      page_address(page),
>      !!PageSlab(page),
>      page_count(page)
> );
>
>

Christoph Hellwig June 4, 2024, 4:27 a.m. UTC | #5

On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>> I still don't understand how a page in the middle of a contiguous range ends
>>> up coming from the slab while others don't.
>> I haven't investigate the origin of the IO
>> yet. I suspect the first 2 pages are the superblocks of the raid
>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.
>
> Well, if these indeed are different origins and just *happen* to be a 
> mixture
> of slab originated pages and non-slab pages combined together in a single 
> bio of a bvec entry,
> I'd suspect that it would be more beneficial to split the bvec (essentially 
> not allow bio_add_page
> to append the page to tail bvec depending on a queue limit (similar to how 
> we handle sg gaps).

So you want to add a PageSlab check to bvec_try_merge_page?  That sounds
fairly expensive..

Sagi Grimberg June 4, 2024, 8:24 a.m. UTC | #6

On 04/06/2024 7:27, Christoph Hellwig wrote:
> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>>> I still don't understand how a page in the middle of a contiguous range ends
>>>> up coming from the slab while others don't.
>>> I haven't investigate the origin of the IO
>>> yet. I suspect the first 2 pages are the superblocks of the raid
>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.
>> Well, if these indeed are different origins and just *happen* to be a
>> mixture
>> of slab originated pages and non-slab pages combined together in a single
>> bio of a bvec entry,
>> I'd suspect that it would be more beneficial to split the bvec (essentially
>> not allow bio_add_page
>> to append the page to tail bvec depending on a queue limit (similar to how
>> we handle sg gaps).
> So you want to add a PageSlab check to bvec_try_merge_page?  That sounds
> fairly expensive..
>

The check needs to happen somewhere apparently, and given that it will 
be gated by a queue flag
only request queues that actually needed will suffer, but they will 
suffer anyways...

Sagi Grimberg June 4, 2024, 1:01 p.m. UTC | #7

On 04/06/2024 11:24, Sagi Grimberg wrote:
>
>
> On 04/06/2024 7:27, Christoph Hellwig wrote:
>> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>>>> I still don't understand how a page in the middle of a contiguous 
>>>>> range ends
>>>>> up coming from the slab while others don't.
>>>> I haven't investigate the origin of the IO
>>>> yet. I suspect the first 2 pages are the superblocks of the raid
>>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the 
>>>> bitmap.
>>> Well, if these indeed are different origins and just *happen* to be a
>>> mixture
>>> of slab originated pages and non-slab pages combined together in a 
>>> single
>>> bio of a bvec entry,
>>> I'd suspect that it would be more beneficial to split the bvec 
>>> (essentially
>>> not allow bio_add_page
>>> to append the page to tail bvec depending on a queue limit (similar 
>>> to how
>>> we handle sg gaps).
>> So you want to add a PageSlab check to bvec_try_merge_page? That sounds
>> fairly expensive..
>>
>
> The check needs to happen somewhere apparently, and given that it will 
> be gated by a queue flag
> only request queues that actually needed will suffer, but they will 
> suffer anyways...

Something like the untested patch below:
--
diff --git a/block/bio.c b/block/bio.c
index 53f608028c78..e55a4184c0e6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -18,6 +18,7 @@
  #include <linux/highmem.h>
  #include <linux/blk-crypto.h>
  #include <linux/xarray.h>
+#include <linux/net.h>

  #include <trace/events/block.h>
  #include "blk.h"
@@ -960,6 +961,9 @@ bool bvec_try_merge_hw_page(struct request_queue *q, 
struct bio_vec *bv,
                 return false;
         if (len > queue_max_segment_size(q) - bv->bv_len)
                 return false;
+       if (q->limits.splice_pages &&
+           sendpage_ok(bv->bv_page) ^ sendpage_ok(page))
+                       return false;
         return bvec_try_merge_page(bv, page, len, offset, same_page);
  }

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a7e820840cf7..82e2719acb9c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1937,6 +1937,7 @@ static void nvme_set_ctrl_limits(struct nvme_ctrl 
*ctrl,
         lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
         lim->max_segment_size = UINT_MAX;
         lim->dma_alignment = 3;
+       lim->splice_pages = ctrl->splice_pages;
  }

  static bool nvme_update_disk_info(struct nvme_ns *ns, struct 
nvme_id_ns *id,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 3f3e26849b61..d9818330e236 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -398,6 +398,7 @@ struct nvme_ctrl {

         enum nvme_ctrl_type cntrltype;
         enum nvme_dctype dctype;
+       bool splice_pages
  };

  static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 02076b8cb4d8..618b8f20206a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2146,6 +2146,12 @@ static int nvme_tcp_configure_admin_queue(struct 
nvme_ctrl *ctrl, bool new)
         if (error)
                 goto out_stop_queue;

+       /*
+        * we want to prevent contig pages with conflicting 
splice-ability with
+        * respect to the network transmission
+        */
+       ctrl->splice_pages = true;
+
         nvme_unquiesce_admin_queue(ctrl);

         error = nvme_init_ctrl_finish(ctrl, false);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69c4f113db42..ec657ddad2a4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -331,6 +331,12 @@ struct queue_limits {
          * due to possible offsets.
          */
         unsigned int            dma_alignment;
+
+       /*
+        * Drivers that use MSG_SPLICE_PAGES to send the bvec over the 
network,
+        * will need either bvec entry contig pages spliceable or not.
+        */
+       bool                    splice_pages;
  };

  typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
--

What I now see is that we will check PageSlab twice (bvec last index and 
append page)
and skb_splice_from_iter checks it again... How many times check we 
check this :)

Would be great if the network stack can just check it once and fallback 
to page copy...

Ofir Gal June 6, 2024, 12:57 p.m. UTC | #8

On 04/06/2024 16:01, Sagi Grimberg wrote:
>
>
> On 04/06/2024 11:24, Sagi Grimberg wrote:
>>
>>
>> On 04/06/2024 7:27, Christoph Hellwig wrote:
>>> On Tue, Jun 04, 2024 at 12:27:06AM +0300, Sagi Grimberg wrote:
>>>>>> I still don't understand how a page in the middle of a contiguous range ends
>>>>>> up coming from the slab while others don't.
>>>>> I haven't investigate the origin of the IO
>>>>> yet. I suspect the first 2 pages are the superblocks of the raid
>>>>> (mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the bitmap.
>>>> Well, if these indeed are different origins and just *happen* to be a
>>>> mixture
>>>> of slab originated pages and non-slab pages combined together in a single
>>>> bio of a bvec entry,
>>>> I'd suspect that it would be more beneficial to split the bvec (essentially
>>>> not allow bio_add_page
>>>> to append the page to tail bvec depending on a queue limit (similar to how
>>>> we handle sg gaps).
I have investigated the origin of the IO. It's a bug in the md-bitmap
code. It's a single IO that __write_sb_page() sends, it rounds up the io
size to the optimal io size but doesn't check that the final size exceeds
the amount of pages it allocated.

The slab pages aren't allocated by the md-bitmap, they are pages that
happens to be after the allocated pages. I'm applying a patch to the md
subsystem asap.

I have added some logs to test the theory:
...
md: created bitmap (1 pages) for device md127
__write_sb_page before md_super_write. offset: 16, size: 262144. pfn: 0x53ee
### __write_sb_page before md_super_write. logging pages ###
pfn: 0x53ee, slab: 0
pfn: 0x53ef, slab: 1
pfn: 0x53f0, slab: 0
pfn: 0x53f1, slab: 0
pfn: 0x53f2, slab: 0
pfn: 0x53f3, slab: 1
...
nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0
skbuff: before sendpage_ok() - pfn: 0x53ee
skbuff: before sendpage_ok() - pfn: 0x53ef
WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450
skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1
...

There is only 1 page allocated for bitmap but __write_sb_page() tries to
write 64 pages.

>>> So you want to add a PageSlab check to bvec_try_merge_page? That sounds
>>> fairly expensive..
>>>
>>
>> The check needs to happen somewhere apparently, and given that it will be gated by a queue flag
>> only request queues that actually needed will suffer, but they will suffer anyways...
> What I now see is that we will check PageSlab twice (bvec last index and append page)
> and skb_splice_from_iter checks it again... How many times check we check this :)
>
> Would be great if the network stack can just check it once and fallback to page copy...
Nevertheless I think a check in the network stack or when merging bio's
is still necessary.

Christoph Hellwig June 6, 2024, 1:08 p.m. UTC | #9

On Thu, Jun 06, 2024 at 03:57:25PM +0300, Ofir Gal wrote:
> The slab pages aren't allocated by the md-bitmap, they are pages that
> happens to be after the allocated pages. I'm applying a patch to the md
> subsystem asap.

Similar cases could happen by other means as well.  E.g. if you write
to the last blocks in an XFS allocation group while also writing something
to superblock copy at the beginnig of the next one and the two writes get
merged.  Probably not easy to reproduce but entirely possible.  Just as
as lot of other scenarious could happen due to merges.

Ofir Gal June 6, 2024, 1:18 p.m. UTC | #10

On 06/06/2024 16:08, Christoph Hellwig wrote:
> On Thu, Jun 06, 2024 at 03:57:25PM +0300, Ofir Gal wrote:
>> The slab pages aren't allocated by the md-bitmap, they are pages that
>> happens to be after the allocated pages. I'm applying a patch to the md
>> subsystem asap.
> Similar cases could happen by other means as well.  E.g. if you write
> to the last blocks in an XFS allocation group while also writing something
> to superblock copy at the beginnig of the next one and the two writes get
> merged.  Probably not easy to reproduce but entirely possible.  Just as
> as lot of other scenarious could happen due to merges.
I agree. Do we want to proceed with my patch or would we prefer a more
efficient solution?

Christoph Hellwig June 6, 2024, 1:52 p.m. UTC | #11

On Thu, Jun 06, 2024 at 04:18:29PM +0300, Ofir Gal wrote:
> I agree. Do we want to proceed with my patch or would we prefer a more
> efficient solution?

We'll need something for sure.  I don't like the idea of hacking
special cases that call into the networking code into the core block
layer, so I think it'll have to be something more like your original
patch.

Ofir Gal June 6, 2024, 3:42 p.m. UTC | #12

On 06/06/2024 16:52, Christoph Hellwig wrote:
> On Thu, Jun 06, 2024 at 04:18:29PM +0300, Ofir Gal wrote:
>> I agree. Do we want to proceed with my patch or would we prefer a more
>> efficient solution?
> We'll need something for sure.  I don't like the idea of hacking
> special cases that call into the networking code into the core block
> layer, so I think it'll have to be something more like your original
> patch.
Ok, I'm sending v3 in case we need it.

[v2,1/4] net: introduce helper sendpages_ok()

Commit Message

Comments

Patch