diff mbox series

[RFC] net: skb: Increasing allocation in __napi_alloc_skb() to 2k when needed.

Message ID 20240419222328.3231075-1-dwilder@us.ibm.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [RFC] net: skb: Increasing allocation in __napi_alloc_skb() to 2k when needed. | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

David J Wilder April 19, 2024, 10:23 p.m. UTC
When testing CONFIG_MAX_SKB_FRAGS=45 on ppc64le and x86_64 I ran into a
couple of issues.

__napi_alloc_skb() assumes its smallest fragment allocations will fit in
1K. When CONFIG_MAX_SKB_FRAGS is increased this may no longer be true
resulting in __napi_alloc_skb() reverting to using page_frag_alloc().
This results in the return of the bug fixed in:
Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
tiny skbs")

That commit insured that "small skb head fragments are kmalloc backed,
so that other objects in the slab page can be reused instead of being held
as long as skbs are sitting in socket queues."

On ppc64le the warning from napi_get_frags_check() is displayed when
CONFIG_MAX_SKB_FRAGS is set to 45. The purpose of the warning is to detect
when an increase of MAX_SKB_FRAGS has reintroduced the aforementioned bug.
Unfortunately on x86_64 this warning is not seen, even though it should be.
I found the warning was disabled by:
commit dbae2b062824 ("net: skb: introduce and use a single page frag
cache")

This RFC patch to __napi_alloc_skb() determines if an skbuff allocation
with a head fragment of size GRO_MAX_HEAD will fit in a 1k allocation,
increasing the allocation to 2k if needed.

I have functionally tested this patch, performance testing is still needed.

TBD: Remove the limitation on 4k page size from the single page frag cache
allowing ppc64le (64K page size) to benefit from this change.

TBD: I have not address the warning in napi_get_frags_check() on x86_64.
Will the warning still be needed once the other changes are completed?

Signed-off-by: David J Wilder <dwilder@us.ibm.com>
---
 net/core/skbuff.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

Comments

Paolo Abeni April 23, 2024, 7:56 a.m. UTC | #1
On Fri, 2024-04-19 at 15:23 -0700, David J Wilder wrote:
> When testing CONFIG_MAX_SKB_FRAGS=45 on ppc64le and x86_64 I ran into a
> couple of issues.
> 
> __napi_alloc_skb() assumes its smallest fragment allocations will fit in
> 1K. When CONFIG_MAX_SKB_FRAGS is increased this may no longer be true
> resulting in __napi_alloc_skb() reverting to using page_frag_alloc().
> This results in the return of the bug fixed in:
> Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
> tiny skbs")
> 
> That commit insured that "small skb head fragments are kmalloc backed,
> so that other objects in the slab page can be reused instead of being held
> as long as skbs are sitting in socket queues."
> 
> On ppc64le the warning from napi_get_frags_check() is displayed when
> CONFIG_MAX_SKB_FRAGS is set to 45. The purpose of the warning is to detect
> when an increase of MAX_SKB_FRAGS has reintroduced the aforementioned bug.
> Unfortunately on x86_64 this warning is not seen, even though it should be.
> I found the warning was disabled by:
> commit dbae2b062824 ("net: skb: introduce and use a single page frag
> cache")
> 
> This RFC patch to __napi_alloc_skb() determines if an skbuff allocation
> with a head fragment of size GRO_MAX_HEAD will fit in a 1k allocation,
> increasing the allocation to 2k if needed.
> 
> I have functionally tested this patch, performance testing is still needed.
> 
> TBD: Remove the limitation on 4k page size from the single page frag cache
> allowing ppc64le (64K page size) to benefit from this change.
> 
> TBD: I have not address the warning in napi_get_frags_check() on x86_64.
> Will the warning still be needed once the other changes are completed?


Thanks for the detailed analysis.

As mentioned by Eric in commit
bf9f1baa279f0758dc2297080360c5a616843927, it should be now possible to
revert dbae2b062824 without incurring in performance regressions for
the relevant use-case. I had that on my todo list since a lot of time,
but I was unable to allocate time for that.

I think such revert would be preferable. Would you be able to evaluate
such option?

Thanks!

Paolo
Sabrina Dubroca April 24, 2024, 8:49 p.m. UTC | #2
2024-04-23, 09:56:33 +0200, Paolo Abeni wrote:
> On Fri, 2024-04-19 at 15:23 -0700, David J Wilder wrote:
> > When testing CONFIG_MAX_SKB_FRAGS=45 on ppc64le and x86_64 I ran into a
> > couple of issues.
> > 
> > __napi_alloc_skb() assumes its smallest fragment allocations will fit in
> > 1K. When CONFIG_MAX_SKB_FRAGS is increased this may no longer be true
> > resulting in __napi_alloc_skb() reverting to using page_frag_alloc().
> > This results in the return of the bug fixed in:
> > Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
> > tiny skbs")
> > 
> > That commit insured that "small skb head fragments are kmalloc backed,
> > so that other objects in the slab page can be reused instead of being held
> > as long as skbs are sitting in socket queues."
> > 
> > On ppc64le the warning from napi_get_frags_check() is displayed when
> > CONFIG_MAX_SKB_FRAGS is set to 45. The purpose of the warning is to detect
> > when an increase of MAX_SKB_FRAGS has reintroduced the aforementioned bug.
> > Unfortunately on x86_64 this warning is not seen, even though it should be.
> > I found the warning was disabled by:
> > commit dbae2b062824 ("net: skb: introduce and use a single page frag
> > cache")
> > 
> > This RFC patch to __napi_alloc_skb() determines if an skbuff allocation
> > with a head fragment of size GRO_MAX_HEAD will fit in a 1k allocation,
> > increasing the allocation to 2k if needed.
> > 
> > I have functionally tested this patch, performance testing is still needed.
> > 
> > TBD: Remove the limitation on 4k page size from the single page frag cache
> > allowing ppc64le (64K page size) to benefit from this change.
> > 
> > TBD: I have not address the warning in napi_get_frags_check() on x86_64.
> > Will the warning still be needed once the other changes are completed?
> 
> 
> Thanks for the detailed analysis.
> 
> As mentioned by Eric in commit
> bf9f1baa279f0758dc2297080360c5a616843927, it should be now possible to
> revert dbae2b062824 without incurring in performance regressions for
> the relevant use-case. I had that on my todo list since a lot of time,
> but I was unable to allocate time for that.
> 
> I think such revert would be preferable. Would you be able to evaluate
> such option?

I don't think reverting dbae2b062824 would fix David's issue.

The problem is that with MAX_SKB_FRAGS=45, skb_shared_info becomes
huge, so 1024 is not enough for those small packets, and we use a
pagefrag instead of kmalloc, which makes napi_get_frags_check unhappy.

Even after reverting dbae2b062824, we would still go through the
pagefrag path and not __alloc_skb.

What about something like this?  (boot-tested on x86 only, but I
disabled NAPI_HAS_SMALL_PAGE_FRAG. no perf testing at all.)

-------- 8< --------
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f85e6989c36c..88923b7b64fe 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -108,6 +108,8 @@ static struct kmem_cache *skbuff_ext_cache __ro_after_init;
 #define SKB_SMALL_HEAD_HEADROOM						\
 	SKB_WITH_OVERHEAD(SKB_SMALL_HEAD_CACHE_SIZE)
 
+#define SKB_SMALL_HEAD_THRESHOLD (SKB_SMALL_HEAD_HEADROOM + NET_SKB_PAD + NET_IP_ALIGN)
+
 int sysctl_max_skb_frags __read_mostly = MAX_SKB_FRAGS;
 EXPORT_SYMBOL(sysctl_max_skb_frags);
 
@@ -726,7 +728,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	/* If requested length is either too small or too big,
 	 * we use kmalloc() for skb->head allocation.
 	 */
-	if (len <= SKB_WITH_OVERHEAD(1024) ||
+	if (len <= SKB_SMALL_HEAD_THRESHOLD ||
 	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
 	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
@@ -802,7 +804,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len)
 	 * When the small frag allocator is available, prefer it over kmalloc
 	 * for small fragments
 	 */
-	if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
+	if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_SMALL_HEAD_THRESHOLD) ||
 	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
 	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
-------- 8< --------

(__)napi_alloc_skb extends the GRO_MAX_HEAD size by NET_SKB_PAD +
NET_IP_ALIGN, so I added them here as well. Mainly this is reusing a
size that we know if big enough to fit a small header and whatever
size skb_shared_info is on the current build. Maybe this could be
max(SKB_WITH_OVERHEAD(1024), <...>) to preserve the current behavior
on MAX_SKB_FRAGS=17, since in that case
SKB_WITH_OVERHEAD(1024) > SKB_SMALL_HEAD_HEADROOM
Eric Dumazet April 24, 2024, 8:56 p.m. UTC | #3
On Wed, Apr 24, 2024 at 10:49 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
>
> 2024-04-23, 09:56:33 +0200, Paolo Abeni wrote:
> > On Fri, 2024-04-19 at 15:23 -0700, David J Wilder wrote:
> > > When testing CONFIG_MAX_SKB_FRAGS=45 on ppc64le and x86_64 I ran into a
> > > couple of issues.
> > >
> > > __napi_alloc_skb() assumes its smallest fragment allocations will fit in
> > > 1K. When CONFIG_MAX_SKB_FRAGS is increased this may no longer be true
> > > resulting in __napi_alloc_skb() reverting to using page_frag_alloc().
> > > This results in the return of the bug fixed in:
> > > Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
> > > tiny skbs")
> > >
> > > That commit insured that "small skb head fragments are kmalloc backed,
> > > so that other objects in the slab page can be reused instead of being held
> > > as long as skbs are sitting in socket queues."
> > >
> > > On ppc64le the warning from napi_get_frags_check() is displayed when
> > > CONFIG_MAX_SKB_FRAGS is set to 45. The purpose of the warning is to detect
> > > when an increase of MAX_SKB_FRAGS has reintroduced the aforementioned bug.
> > > Unfortunately on x86_64 this warning is not seen, even though it should be.
> > > I found the warning was disabled by:
> > > commit dbae2b062824 ("net: skb: introduce and use a single page frag
> > > cache")
> > >
> > > This RFC patch to __napi_alloc_skb() determines if an skbuff allocation
> > > with a head fragment of size GRO_MAX_HEAD will fit in a 1k allocation,
> > > increasing the allocation to 2k if needed.
> > >
> > > I have functionally tested this patch, performance testing is still needed.
> > >
> > > TBD: Remove the limitation on 4k page size from the single page frag cache
> > > allowing ppc64le (64K page size) to benefit from this change.
> > >
> > > TBD: I have not address the warning in napi_get_frags_check() on x86_64.
> > > Will the warning still be needed once the other changes are completed?
> >
> >
> > Thanks for the detailed analysis.
> >
> > As mentioned by Eric in commit
> > bf9f1baa279f0758dc2297080360c5a616843927, it should be now possible to
> > revert dbae2b062824 without incurring in performance regressions for
> > the relevant use-case. I had that on my todo list since a lot of time,
> > but I was unable to allocate time for that.
> >
> > I think such revert would be preferable. Would you be able to evaluate
> > such option?
>
> I don't think reverting dbae2b062824 would fix David's issue.
>
> The problem is that with MAX_SKB_FRAGS=45, skb_shared_info becomes
> huge, so 1024 is not enough for those small packets, and we use a
> pagefrag instead of kmalloc, which makes napi_get_frags_check unhappy.
>

768 bytes is not huge ...

> Even after reverting dbae2b062824, we would still go through the
> pagefrag path and not __alloc_skb.
>
> What about something like this?  (boot-tested on x86 only, but I
> disabled NAPI_HAS_SMALL_PAGE_FRAG. no perf testing at all.)
>
> -------- 8< --------
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f85e6989c36c..88923b7b64fe 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -108,6 +108,8 @@ static struct kmem_cache *skbuff_ext_cache __ro_after_init;
>  #define SKB_SMALL_HEAD_HEADROOM                                                \
>         SKB_WITH_OVERHEAD(SKB_SMALL_HEAD_CACHE_SIZE)
>
> +#define SKB_SMALL_HEAD_THRESHOLD (SKB_SMALL_HEAD_HEADROOM + NET_SKB_PAD + NET_IP_ALIGN)
> +
>  int sysctl_max_skb_frags __read_mostly = MAX_SKB_FRAGS;
>  EXPORT_SYMBOL(sysctl_max_skb_frags);
>
> @@ -726,7 +728,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
>         /* If requested length is either too small or too big,
>          * we use kmalloc() for skb->head allocation.
>          */
> -       if (len <= SKB_WITH_OVERHEAD(1024) ||
> +       if (len <= SKB_SMALL_HEAD_THRESHOLD ||
>             len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> @@ -802,7 +804,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len)
>          * When the small frag allocator is available, prefer it over kmalloc
>          * for small fragments
>          */
> -       if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
> +       if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_SMALL_HEAD_THRESHOLD) ||
>             len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
> -------- 8< --------
>
> (__)napi_alloc_skb extends the GRO_MAX_HEAD size by NET_SKB_PAD +
> NET_IP_ALIGN, so I added them here as well. Mainly this is reusing a
> size that we know if big enough to fit a small header and whatever
> size skb_shared_info is on the current build. Maybe this could be
> max(SKB_WITH_OVERHEAD(1024), <...>) to preserve the current behavior
> on MAX_SKB_FRAGS=17, since in that case
> SKB_WITH_OVERHEAD(1024) > SKB_SMALL_HEAD_HEADROOM
>
>

Here we simply use

#define GRO_MAX_HEAD 192
David J Wilder April 26, 2024, 6:25 p.m. UTC | #4
On Wed, Apr 24, 2024 at 10:49 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
>>
>> 2024-04-23, 09:56:33 +0200, Paolo Abeni wrote:
>> > On Fri, 2024-04-19 at 15:23 -0700, David J Wilder wrote:
>> > > When testing CONFIG_MAX_SKB_FRAGS=45 on ppc64le and x86_64 I ran into a
>> > > couple of issues.
>> > >
>> > > __napi_alloc_skb() assumes its smallest fragment allocations will fit in
>> > > 1K. When CONFIG_MAX_SKB_FRAGS is increased this may no longer be true
>> > > resulting in __napi_alloc_skb() reverting to using page_frag_alloc().
>> > > This results in the return of the bug fixed in:
>> > > Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
>> > > tiny skbs")
>> > >
>> > > That commit insured that "small skb head fragments are kmalloc backed,
>> > > so that other objects in the slab page can be reused instead of being held
>> > > as long as skbs are sitting in socket queues."
>> > >
>> > > On ppc64le the warning from napi_get_frags_check() is displayed when
>> > > CONFIG_MAX_SKB_FRAGS is set to 45. The purpose of the warning is to detect
>> > > when an increase of MAX_SKB_FRAGS has reintroduced the aforementioned bug.
>> > > Unfortunately on x86_64 this warning is not seen, even though it should be.
>> > > I found the warning was disabled by:
>> > > commit dbae2b062824 ("net: skb: introduce and use a single page frag
>> > > cache")
>> > >
>> > > This RFC patch to __napi_alloc_skb() determines if an skbuff allocation
>> > > with a head fragment of size GRO_MAX_HEAD will fit in a 1k allocation,
>> > > increasing the allocation to 2k if needed.
>> > >
>> > > I have functionally tested this patch, performance testing is still needed.
>> > >
>> > > TBD: Remove the limitation on 4k page size from the single page frag cache
>> > > allowing ppc64le (64K page size) to benefit from this change.
>> > >
>> > > TBD: I have not address the warning in napi_get_frags_check() on x86_64.
>> > > Will the warning still be needed once the other changes are completed?
>> >
>> >
>> > Thanks for the detailed analysis.
>> >
>> > As mentioned by Eric in commit
>> > bf9f1baa279f0758dc2297080360c5a616843927, it should be now possible to
>> > revert dbae2b062824 without incurring in performance regressions for
>> > the relevant use-case. I had that on my todo list since a lot of time,
>> > but I was unable to allocate time for that.
>> >
>> > I think such revert would be preferable. Would you be able to evaluate
>> > such option?

Thanks Paolo,  yes, I can evaluate removing dbae2b062824. 

>>
>> I don't think reverting dbae2b062824 would fix David's issue.
>>
>> The problem is that with MAX_SKB_FRAGS=45, skb_shared_info becomes
>> huge, so 1024 is not enough for those small packets, and we use a
>> pagefrag instead of kmalloc, which makes napi_get_frags_check unhappy.
>>
>
> 768 bytes is not huge ...
>
>> Even after reverting dbae2b062824, we would still go through the
>> pagefrag path and not __alloc_skb.
>>
>> What about something like this?  (boot-tested on x86 only, but I
>> disabled NAPI_HAS_SMALL_PAGE_FRAG. no perf testing at all.)
>>
>> -------- 8< --------
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index f85e6989c36c..88923b7b64fe 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -108,6 +108,8 @@ static struct kmem_cache *skbuff_ext_cache __ro_after_init;
>>  #define SKB_SMALL_HEAD_HEADROOM                                                \
>>         SKB_WITH_OVERHEAD(SKB_SMALL_HEAD_CACHE_SIZE)
>>
>> +#define SKB_SMALL_HEAD_THRESHOLD (SKB_SMALL_HEAD_HEADROOM + NET_SKB_PAD + NET_IP_ALIGN)
>> +
>>  int sysctl_max_skb_frags __read_mostly = MAX_SKB_FRAGS;
>>  EXPORT_SYMBOL(sysctl_max_skb_frags);
>>
>> @@ -726,7 +728,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
>>         /* If requested length is either too small or too big,
>>          * we use kmalloc() for skb->head allocation.
>>          */
>> -       if (len <= SKB_WITH_OVERHEAD(1024) ||
>> +       if (len <= SKB_SMALL_HEAD_THRESHOLD ||
>>             len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>>             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>> @@ -802,7 +804,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len)
>>          * When the small frag allocator is available, prefer it over kmalloc
>>          * for small fragments
>>          */
>> -       if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
>> +       if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_SMALL_HEAD_THRESHOLD) ||
>>             len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>>             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
>> -------- 8< --------
>>
>> (__)napi_alloc_skb extends the GRO_MAX_HEAD size by NET_SKB_PAD +
>> NET_IP_ALIGN, so I added them here as well. Mainly this is reusing a
>> size that we know if big enough to fit a small header and whatever
>> size skb_shared_info is on the current build. Maybe this could be
>> max(SKB_WITH_OVERHEAD(1024), <...>) to preserve the current behavior
>> on MAX_SKB_FRAGS=17, since in that case
>> SKB_WITH_OVERHEAD(1024) > SKB_SMALL_HEAD_HEADROOM
>>
>>

Thanks for the suggestion Sabrina, I will incorporate it. 

> 
> Here we simply use
> 
> #define GRO_MAX_HEAD 192

Sorry Eric, where did 192 come from?
Eric Dumazet April 26, 2024, 6:50 p.m. UTC | #5
On Fri, Apr 26, 2024 at 8:28 PM David J Wilder <dwilder@us.ibm.com> wrote:
>
> On Wed, Apr 24, 2024 at 10:49 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> >>
> >> 2024-04-23, 09:56:33 +0200, Paolo Abeni wrote:
> >> > On Fri, 2024-04-19 at 15:23 -0700, David J Wilder wrote:
> >> > > When testing CONFIG_MAX_SKB_FRAGS=45 on ppc64le and x86_64 I ran into a
> >> > > couple of issues.
> >> > >
> >> > > __napi_alloc_skb() assumes its smallest fragment allocations will fit in
> >> > > 1K. When CONFIG_MAX_SKB_FRAGS is increased this may no longer be true
> >> > > resulting in __napi_alloc_skb() reverting to using page_frag_alloc().
> >> > > This results in the return of the bug fixed in:
> >> > > Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
> >> > > tiny skbs")
> >> > >
> >> > > That commit insured that "small skb head fragments are kmalloc backed,
> >> > > so that other objects in the slab page can be reused instead of being held
> >> > > as long as skbs are sitting in socket queues."
> >> > >
> >> > > On ppc64le the warning from napi_get_frags_check() is displayed when
> >> > > CONFIG_MAX_SKB_FRAGS is set to 45. The purpose of the warning is to detect
> >> > > when an increase of MAX_SKB_FRAGS has reintroduced the aforementioned bug.
> >> > > Unfortunately on x86_64 this warning is not seen, even though it should be.
> >> > > I found the warning was disabled by:
> >> > > commit dbae2b062824 ("net: skb: introduce and use a single page frag
> >> > > cache")
> >> > >
> >> > > This RFC patch to __napi_alloc_skb() determines if an skbuff allocation
> >> > > with a head fragment of size GRO_MAX_HEAD will fit in a 1k allocation,
> >> > > increasing the allocation to 2k if needed.
> >> > >
> >> > > I have functionally tested this patch, performance testing is still needed.
> >> > >
> >> > > TBD: Remove the limitation on 4k page size from the single page frag cache
> >> > > allowing ppc64le (64K page size) to benefit from this change.
> >> > >
> >> > > TBD: I have not address the warning in napi_get_frags_check() on x86_64.
> >> > > Will the warning still be needed once the other changes are completed?
> >> >
> >> >
> >> > Thanks for the detailed analysis.
> >> >
> >> > As mentioned by Eric in commit
> >> > bf9f1baa279f0758dc2297080360c5a616843927, it should be now possible to
> >> > revert dbae2b062824 without incurring in performance regressions for
> >> > the relevant use-case. I had that on my todo list since a lot of time,
> >> > but I was unable to allocate time for that.
> >> >
> >> > I think such revert would be preferable. Would you be able to evaluate
> >> > such option?
>
> Thanks Paolo,  yes, I can evaluate removing dbae2b062824.
>
> >>
> >> I don't think reverting dbae2b062824 would fix David's issue.
> >>
> >> The problem is that with MAX_SKB_FRAGS=45, skb_shared_info becomes
> >> huge, so 1024 is not enough for those small packets, and we use a
> >> pagefrag instead of kmalloc, which makes napi_get_frags_check unhappy.
> >>
> >
> > 768 bytes is not huge ...
> >
> >> Even after reverting dbae2b062824, we would still go through the
> >> pagefrag path and not __alloc_skb.
> >>
> >> What about something like this?  (boot-tested on x86 only, but I
> >> disabled NAPI_HAS_SMALL_PAGE_FRAG. no perf testing at all.)
> >>
> >> -------- 8< --------
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index f85e6989c36c..88923b7b64fe 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -108,6 +108,8 @@ static struct kmem_cache *skbuff_ext_cache __ro_after_init;
> >>  #define SKB_SMALL_HEAD_HEADROOM                                                \
> >>         SKB_WITH_OVERHEAD(SKB_SMALL_HEAD_CACHE_SIZE)
> >>
> >> +#define SKB_SMALL_HEAD_THRESHOLD (SKB_SMALL_HEAD_HEADROOM + NET_SKB_PAD + NET_IP_ALIGN)
> >> +
> >>  int sysctl_max_skb_frags __read_mostly = MAX_SKB_FRAGS;
> >>  EXPORT_SYMBOL(sysctl_max_skb_frags);
> >>
> >> @@ -726,7 +728,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
> >>         /* If requested length is either too small or too big,
> >>          * we use kmalloc() for skb->head allocation.
> >>          */
> >> -       if (len <= SKB_WITH_OVERHEAD(1024) ||
> >> +       if (len <= SKB_SMALL_HEAD_THRESHOLD ||
> >>             len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> >>             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> >>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> >> @@ -802,7 +804,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int len)
> >>          * When the small frag allocator is available, prefer it over kmalloc
> >>          * for small fragments
> >>          */
> >> -       if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
> >> +       if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_SMALL_HEAD_THRESHOLD) ||
> >>             len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> >>             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> >>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
> >> -------- 8< --------
> >>
> >> (__)napi_alloc_skb extends the GRO_MAX_HEAD size by NET_SKB_PAD +
> >> NET_IP_ALIGN, so I added them here as well. Mainly this is reusing a
> >> size that we know if big enough to fit a small header and whatever
> >> size skb_shared_info is on the current build. Maybe this could be
> >> max(SKB_WITH_OVERHEAD(1024), <...>) to preserve the current behavior
> >> on MAX_SKB_FRAGS=17, since in that case
> >> SKB_WITH_OVERHEAD(1024) > SKB_SMALL_HEAD_HEADROOM
> >>
> >>
>
> Thanks for the suggestion Sabrina, I will incorporate it.
>
> >
> > Here we simply use
> >
> > #define GRO_MAX_HEAD 192
>
> Sorry Eric, where did 192 come from?

99.9999 % of packets have a total header size smaller than 192 bytes.

Current linux definition is too big IMO

#define GRO_MAX_HEAD (MAX_HEADER + 128)

Depending on various CONFIG options, MAX_HEADER can be 176, and final
GRO_MAX_HEAD can be 304
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b99127712e67..e3b6115a2edc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -232,18 +232,18 @@  static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
  * page - to avoid excessive truesize underestimation
  */
 
-struct page_frag_1k {
+struct page_frag_small {
 	void *va;
 	u16 offset;
 	bool pfmemalloc;
 };
 
-static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
+static void *page_frag_alloc_small(struct page_frag_small *nc, gfp_t gfp, unsigned int len)
 {
 	struct page *page;
 	int offset;
 
-	offset = nc->offset - SZ_1K;
+	offset = nc->offset - len;
 	if (likely(offset >= 0))
 		goto use_frag;
 
@@ -253,8 +253,8 @@  static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
 
 	nc->va = page_address(page);
 	nc->pfmemalloc = page_is_pfmemalloc(page);
-	offset = PAGE_SIZE - SZ_1K;
-	page_ref_add(page, offset / SZ_1K);
+	offset = PAGE_SIZE - len;
+	page_ref_add(page, offset / len);
 
 use_frag:
 	nc->offset = offset;
@@ -268,10 +268,10 @@  static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
 #define NAPI_HAS_SMALL_PAGE_FRAG	0
 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	false
 
-struct page_frag_1k {
+struct page_frag_small {
 };
 
-static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
+static void *page_frag_alloc_small(struct page_frag_small *nc, gfp_t gfp_mask, unsigned int len)
 {
 	return NULL;
 }
@@ -280,7 +280,7 @@  static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
 
 struct napi_alloc_cache {
 	struct page_frag_cache page;
-	struct page_frag_1k page_small;
+	struct page_frag_small page_small;
 	unsigned int skb_count;
 	void *skb_cache[NAPI_SKB_CACHE_SIZE];
 };
@@ -787,6 +787,7 @@  EXPORT_SYMBOL(__netdev_alloc_skb);
  *
  *	%NULL is returned if there is no free memory.
  */
+
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
@@ -794,6 +795,7 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	struct sk_buff *skb;
 	bool pfmemalloc;
 	void *data;
+	unsigned int size = SZ_1K;
 
 	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
 	len += NET_SKB_PAD + NET_IP_ALIGN;
@@ -801,9 +803,14 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	/* If requested length is either too small or too big,
 	 * we use kmalloc() for skb->head allocation.
 	 * When the small frag allocator is available, prefer it over kmalloc
-	 * for small fragments
+	 * for small fragments. Larger MAX_SKB_FRAGS values may require more
+	 * than a 1K allocation found by testing SKB_WITH_OVERHEAD(GRO_MAX_HEAD).
 	 */
-	if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
+
+	if (SKB_WITH_OVERHEAD(1024) < (MAX_HEADER + 128 + NET_SKB_PAD + NET_IP_ALIGN))
+		size = SZ_2K;
+
+	if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(size)) ||
 	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
 	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
@@ -818,7 +825,7 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) {
+	if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(size)) {
 		/* we are artificially inflating the allocation size, but
 		 * that is not as bad as it may look like, as:
 		 * - 'len' less than GRO_MAX_HEAD makes little sense
@@ -829,13 +836,10 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 		 *   little networking, as that implies no WiFi and no
 		 *   tunnels support, and 32 bits arches.
 		 */
-		len = SZ_1K;
-
-		data = page_frag_alloc_1k(&nc->page_small, gfp_mask);
+		data = page_frag_alloc_small(&nc->page_small, gfp_mask, size);
 		pfmemalloc = NAPI_SMALL_PAGE_PFMEMALLOC(nc->page_small);
 	} else {
 		len = SKB_HEAD_ALIGN(len);
-
 		data = page_frag_alloc(&nc->page, len, gfp_mask);
 		pfmemalloc = nc->page.pfmemalloc;
 	}