diff mbox series

[net-next] net: skbuff: Skip early return in skb_unref when debugging

Message ID 20240729104741.370327-1-leitao@debian.org (mailing list archive)
State Accepted
Commit c9c0ee5f20c593faf289fa8850c3ed84031dd12a
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net: skbuff: Skip early return in skb_unref when debugging | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 91 this patch: 91
netdev/build_tools success Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers fail 1 maintainers not CCed: almasrymina@google.com
netdev/build_clang success Errors and warnings before: 153 this patch: 153
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5110 this patch: 5110
netdev/checkpatch warning WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 59 this patch: 59
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-07-29--18-00 (tests: 703)

Commit Message

Breno Leitao July 29, 2024, 10:47 a.m. UTC
This patch modifies the skb_unref function to skip the early return
optimization when CONFIG_DEBUG_NET is enabled. The change ensures that
the reference count decrement always occurs in debug builds, allowing
for more thorough checking of SKB reference counting.

Previously, when the SKB's reference count was 1 and CONFIG_DEBUG_NET
was not set, the function would return early after a memory barrier
(smp_rmb()) without decrementing the reference count. This optimization
assumes it's safe to proceed with freeing the SKB without the overhead
of an atomic decrement from 1 to 0.

With this change:
- In non-debug builds (CONFIG_DEBUG_NET not set), behavior remains
  unchanged, preserving the performance optimization.
- In debug builds (CONFIG_DEBUG_NET set), the reference count is always
  decremented, even when it's 1, allowing for consistent behavior and
  potentially catching subtle SKB management bugs.

This modification enhances debugging capabilities for networking code
without impacting performance in production kernels. It helps kernel
developers identify and diagnose issues related to SKB management and
reference counting in the network stack.

Cc: Chris Mason <clm@fb.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/skbuff.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Paolo Abeni July 30, 2024, 9:38 a.m. UTC | #1
On 7/29/24 12:47, Breno Leitao wrote:
> This patch modifies the skb_unref function to skip the early return
> optimization when CONFIG_DEBUG_NET is enabled. The change ensures that
> the reference count decrement always occurs in debug builds, allowing
> for more thorough checking of SKB reference counting.
> 
> Previously, when the SKB's reference count was 1 and CONFIG_DEBUG_NET
> was not set, the function would return early after a memory barrier
> (smp_rmb()) without decrementing the reference count. This optimization
> assumes it's safe to proceed with freeing the SKB without the overhead
> of an atomic decrement from 1 to 0.
> 
> With this change:
> - In non-debug builds (CONFIG_DEBUG_NET not set), behavior remains
>    unchanged, preserving the performance optimization.
> - In debug builds (CONFIG_DEBUG_NET set), the reference count is always
>    decremented, even when it's 1, allowing for consistent behavior and
>    potentially catching subtle SKB management bugs.
> 
> This modification enhances debugging capabilities for networking code
> without impacting performance in production kernels. It helps kernel
> developers identify and diagnose issues related to SKB management and
> reference counting in the network stack.
> 
> Cc: Chris Mason <clm@fb.com>
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>   include/linux/skbuff.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 29c3ea5b6e93..cf8f6ce06742 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1225,7 +1225,7 @@ static inline bool skb_unref(struct sk_buff *skb)
>   {
>   	if (unlikely(!skb))
>   		return false;
> -	if (likely(refcount_read(&skb->users) == 1))
> +	if (!IS_ENABLED(CONFIG_DEBUG_NET) && likely(refcount_read(&skb->users) == 1))
>   		smp_rmb();
>   	else if (likely(!refcount_dec_and_test(&skb->users)))
>   		return false;

I think one assumption behind CONFIG_DEBUG_NET is that enabling such 
config should not have any measurable impact on performances.

I suspect the above could indeed cause some measurable impact, e.g. 
under UDP flood, when the user-space receiver and the BH runs on 
different cores, as this will increase pressure on the CPU cache. Could 
you please benchmark such scenario before and after this patch?

Thanks!

Paolo
Breno Leitao July 30, 2024, 10:24 a.m. UTC | #2
Hello Paolo,

On Tue, Jul 30, 2024 at 11:38:38AM +0200, Paolo Abeni wrote:
> On 7/29/24 12:47, Breno Leitao wrote:

> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1225,7 +1225,7 @@ static inline bool skb_unref(struct sk_buff *skb)
> >   {
> >   	if (unlikely(!skb))
> >   		return false;
> > -	if (likely(refcount_read(&skb->users) == 1))
> > +	if (!IS_ENABLED(CONFIG_DEBUG_NET) && likely(refcount_read(&skb->users) == 1))
> >   		smp_rmb();
> >   	else if (likely(!refcount_dec_and_test(&skb->users)))
> >   		return false;
 
> I think one assumption behind CONFIG_DEBUG_NET is that enabling such config
> should not have any measurable impact on performances.
> 
> I suspect the above could indeed cause some measurable impact, e.g. under
> UDP flood, when the user-space receiver and the BH runs on different cores,
> as this will increase pressure on the CPU cache. Could you please benchmark
> such scenario before and after this patch?

Sure, I am more than happy to do so. I will be back with it soon.

Assuming there is some performance overhead, isn't it a worthwhile
trade-off for those who are debugging the network? In other words,
wouldn't it be better to prioritize correctness over optimization in
this the CONFIG_DEBUG_NET case, even if it means sacrificing some
performance?

Thanks for reviewing,
--breno
Florian Westphal July 30, 2024, 10:50 a.m. UTC | #3
Paolo Abeni <pabeni@redhat.com> wrote:
> >   	else if (likely(!refcount_dec_and_test(&skb->users)))
> >   		return false;
> 
> I think one assumption behind CONFIG_DEBUG_NET is that enabling such config
> should not have any measurable impact on performances.

If thats the case why does it exist at all?

I was under impression that entire reason for CONFIG_DEBUG_NET was
to enable more checks for fuzzers and the like, i.e. NOT for production
kernels.
Paolo Abeni July 30, 2024, 11:15 a.m. UTC | #4
On 7/30/24 12:50, Florian Westphal wrote:
> Paolo Abeni <pabeni@redhat.com> wrote:
>>>    	else if (likely(!refcount_dec_and_test(&skb->users)))
>>>    		return false;
>>
>> I think one assumption behind CONFIG_DEBUG_NET is that enabling such config
>> should not have any measurable impact on performances.
> 
> If thats the case why does it exist at all?
> 
> I was under impression that entire reason for CONFIG_DEBUG_NET was
> to enable more checks for fuzzers and the like, i.e. NOT for production
> kernels.

I feel like I already had this discussion and I forgot the outcome, if 
so I'm sorry. To me the "but is safe to select." part in the knob 
description means this could be enabled in production, and AFAICS the 
CONFIG_DEBUG_NET-enabled code so far respects that assumption.

Thanks,

Paolo
Jakub Kicinski July 30, 2024, 2:10 p.m. UTC | #5
On Tue, 30 Jul 2024 13:15:57 +0200 Paolo Abeni wrote:
> > If thats the case why does it exist at all?

+1

> > I was under impression that entire reason for CONFIG_DEBUG_NET was
> > to enable more checks for fuzzers and the like, i.e. NOT for production
> > kernels.  
> 
> I feel like I already had this discussion and I forgot the outcome, if 
> so I'm sorry. To me the "but is safe to select." part in the knob 
> description means this could be enabled in production, and AFAICS the 
> CONFIG_DEBUG_NET-enabled code so far respects that assumption.

I believe the previous discussion was page pool specific and there
wasn't as much of a conclusion as an acquiescence (read: we had more
important things on our minds than that argument ;)).

Should we set a bar for how much perf impact is okay?

FTR I suspect there will be no measurable perf impact here.
Eric Dumazet July 30, 2024, 2:21 p.m. UTC | #6
On Tue, Jul 30, 2024 at 4:10 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 30 Jul 2024 13:15:57 +0200 Paolo Abeni wrote:
> > > If thats the case why does it exist at all?
>
> +1
>
> > > I was under impression that entire reason for CONFIG_DEBUG_NET was
> > > to enable more checks for fuzzers and the like, i.e. NOT for production
> > > kernels.
> >
> > I feel like I already had this discussion and I forgot the outcome, if
> > so I'm sorry. To me the "but is safe to select." part in the knob
> > description means this could be enabled in production, and AFAICS the
> > CONFIG_DEBUG_NET-enabled code so far respects that assumption.
>
> I believe the previous discussion was page pool specific and there
> wasn't as much of a conclusion as an acquiescence (read: we had more
> important things on our minds than that argument ;)).
>
> Should we set a bar for how much perf impact is okay?
>
> FTR I suspect there will be no measurable perf impact here.

Same feeling here, I think this patch is fine really.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Paolo Abeni July 30, 2024, 2:37 p.m. UTC | #7
On 7/30/24 16:10, Jakub Kicinski wrote:
> On Tue, 30 Jul 2024 13:15:57 +0200 Paolo Abeni wrote:
>>> I was under impression that entire reason for CONFIG_DEBUG_NET was
>>> to enable more checks for fuzzers and the like, i.e. NOT for production
>>> kernels.
>>
>> I feel like I already had this discussion and I forgot the outcome, if
>> so I'm sorry. To me the "but is safe to select." part in the knob
>> description means this could be enabled in production, and AFAICS the
>> CONFIG_DEBUG_NET-enabled code so far respects that assumption.
> 
> I believe the previous discussion was page pool specific and there
> wasn't as much of a conclusion as an acquiescence (read: we had more
> important things on our minds than that argument ;)).
> 
> Should we set a bar for how much perf impact is okay?

I think that better specifying the general guidance/expectation should 
be enough. What about extending the knob description with something alike:
---
diff --git a/net/Kconfig.debug b/net/Kconfig.debug
index 5e3fffe707dd..058cf031913b 100644
--- a/net/Kconfig.debug
+++ b/net/Kconfig.debug
@@ -24,3 +24,5 @@ config DEBUG_NET
         help
           Enable extra sanity checks in networking.
           This is mostly used by fuzzers, but is safe to select.
+         This could introduce some very minimal overhead and
+         is not suggested for production systems.
Jakub Kicinski July 30, 2024, 2:48 p.m. UTC | #8
On Tue, 30 Jul 2024 16:37:10 +0200 Paolo Abeni wrote:
> I think that better specifying the general guidance/expectation should 
> be enough. What about extending the knob description with something alike:
> ---
> diff --git a/net/Kconfig.debug b/net/Kconfig.debug
> index 5e3fffe707dd..058cf031913b 100644
> --- a/net/Kconfig.debug
> +++ b/net/Kconfig.debug
> @@ -24,3 +24,5 @@ config DEBUG_NET
>          help
>            Enable extra sanity checks in networking.
>            This is mostly used by fuzzers, but is safe to select.
> +         This could introduce some very minimal overhead and
> +         is not suggested for production systems.

I'd go with:

             Enable extra sanity checks in networking.
-            This is mostly used by fuzzers, but is safe to select.
+            This is mostly used by fuzzers, and may introduce some
+            minimal overhead, but is safe to select.

What's acceptable on prod systems really depends on the workload..
Breno Leitao July 31, 2024, 11:24 a.m. UTC | #9
Hello Paolo,

On Tue, Jul 30, 2024 at 11:38:38AM +0200, Paolo Abeni wrote:
> Could you please benchmark such scenario before and after this patch?

I've tested it on a 18-core Xeon D-2191A host, and I haven't found any
different in either TX/RX in TCP or UDP. At the same time, I must admit
that I have very low confidence in my tests.

I run the following tests for 10x on the same machine, just changing my
patch, and I getting the simple average of these 10 iterations. This is
what I am doing for TCP and UDP:

TCP:
	# iperf -s &
	# iperf -u -c localhost

	Output: 16.5 Gbits/sec

UDP:
	# iperf -s -u &
	# iperf -u -c localhost

	Output: 1.05 Mbits/sec

I don't know how to explain why UDP numbers are so low. I am happy to
run different tests, if you have any other recommendation.

--breno
Jason Xing July 31, 2024, 11:53 a.m. UTC | #10
On Wed, Jul 31, 2024 at 7:25 PM Breno Leitao <leitao@debian.org> wrote:
>
> Hello Paolo,
>
> On Tue, Jul 30, 2024 at 11:38:38AM +0200, Paolo Abeni wrote:
> > Could you please benchmark such scenario before and after this patch?
>
> I've tested it on a 18-core Xeon D-2191A host, and I haven't found any
> different in either TX/RX in TCP or UDP. At the same time, I must admit
> that I have very low confidence in my tests.
>
> I run the following tests for 10x on the same machine, just changing my
> patch, and I getting the simple average of these 10 iterations. This is
> what I am doing for TCP and UDP:
>
> TCP:
>         # iperf -s &
>         # iperf -u -c localhost
>
>         Output: 16.5 Gbits/sec
>
> UDP:
>         # iperf -s -u &
>         # iperf -u -c localhost
>
>         Output: 1.05 Mbits/sec
>
> I don't know how to explain why UDP numbers are so low. I am happy to
> run different tests, if you have any other recommendation.

I think the iperf tool uses '-b 1' as default, which is explained in
the man page:
CLIENT SPECIFIC OPTIONS
       -b, --bandwidth n[kmgKMG][,n[kmgKMG]] | n[kmgKMG]pps
              set  target  bandwidth  to  n  bits/sec  (default  1
Mbit/sec)  or  n  packets per sec. This may be used with TCP or UDP.
Optionally, for variable loads, use format of
              mean,standard deviation

So, if you try the parameter like '-b 40MB', it will reach around
40MB/sec speed.

If I were you, I could try the following way:
1) iperf3 -s
2) iperf3 -u -c 127.0.0.1 -b 0 -l 64

Hope it can help you:)

Thanks,
Jason

>
> --breno
>
Paolo Abeni Aug. 1, 2024, 9:07 a.m. UTC | #11
On 7/31/24 13:24, Breno Leitao wrote:
> Hello Paolo,
> 
> On Tue, Jul 30, 2024 at 11:38:38AM +0200, Paolo Abeni wrote:
>> Could you please benchmark such scenario before and after this patch?
> 
> I've tested it on a 18-core Xeon D-2191A host, and I haven't found any
> different in either TX/RX in TCP or UDP. At the same time, I must admit
> that I have very low confidence in my tests.
> 
> I run the following tests for 10x on the same machine, just changing my
> patch, and I getting the simple average of these 10 iterations. This is
> what I am doing for TCP and UDP:
> 
> TCP:
> 	# iperf -s &
> 	# iperf -u -c localhost
> 
> 	Output: 16.5 Gbits/sec
> 
> UDP:
> 	# iperf -s -u &
> 	# iperf -u -c localhost
> 
> 	Output: 1.05 Mbits/sec
> 
> I don't know how to explain why UDP numbers are so low. I am happy to
> run different tests, if you have any other recommendation.

Beyond the '-b 0' argument, as noted by Jason, you need to do manual CPU 
pinning of both the sender and the receiver. Additionally, to really 
flood the receiver you likely have to place the sender on a different host.

In any case, given all the prior discussion, I don't intend to block 
this patch.

Cheers,

Paolo
patchwork-bot+netdevbpf@kernel.org Aug. 1, 2024, 9:30 a.m. UTC | #12
Hello:

This patch was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Mon, 29 Jul 2024 03:47:40 -0700 you wrote:
> This patch modifies the skb_unref function to skip the early return
> optimization when CONFIG_DEBUG_NET is enabled. The change ensures that
> the reference count decrement always occurs in debug builds, allowing
> for more thorough checking of SKB reference counting.
> 
> Previously, when the SKB's reference count was 1 and CONFIG_DEBUG_NET
> was not set, the function would return early after a memory barrier
> (smp_rmb()) without decrementing the reference count. This optimization
> assumes it's safe to proceed with freeing the SKB without the overhead
> of an atomic decrement from 1 to 0.
> 
> [...]

Here is the summary with links:
  - [net-next] net: skbuff: Skip early return in skb_unref when debugging
    https://git.kernel.org/netdev/net-next/c/c9c0ee5f20c5

You are awesome, thank you!
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 29c3ea5b6e93..cf8f6ce06742 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1225,7 +1225,7 @@  static inline bool skb_unref(struct sk_buff *skb)
 {
 	if (unlikely(!skb))
 		return false;
-	if (likely(refcount_read(&skb->users) == 1))
+	if (!IS_ENABLED(CONFIG_DEBUG_NET) && likely(refcount_read(&skb->users) == 1))
 		smp_rmb();
 	else if (likely(!refcount_dec_and_test(&skb->users)))
 		return false;