diff mbox series

[v9,net-next,08/15] net: softnet_data: Make xmit per task.

Message ID 20240620132727.660738-9-bigeasy@linutronix.de (mailing list archive)
State Accepted
Commit ecefbc09e8ee768ae85b7bb7a1de8c8287397d68
Delegated to: Netdev Maintainers
Headers show
Series locking: Introduce nested-BH locking. | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1523 this patch: 1523
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 15 of 15 maintainers
netdev/build_clang success Errors and warnings before: 2029 this patch: 2029
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 16300 this patch: 16300
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 90 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 101 this patch: 101
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-06-23--18-00 (tests: 660)

Commit Message

Sebastian Andrzej Siewior June 20, 2024, 1:21 p.m. UTC
Softirq is preemptible on PREEMPT_RT. Without a per-CPU lock in
local_bh_disable() there is no guarantee that only one device is
transmitting at a time.
With preemption and multiple senders it is possible that the per-CPU
`recursion' counter gets incremented by different threads and exceeds
XMIT_RECURSION_LIMIT leading to a false positive recursion alert.
The `more' member is subject to similar problems if set by one thread
for one driver and wrongly used by another driver within another thread.

Instead of adding a lock to protect the per-CPU variable it is simpler
to make xmit per-task. Sending and receiving skbs happens always
in thread context anyway.

Having a lock to protected the per-CPU counter would block/ serialize two
sending threads needlessly. It would also require a recursive lock to
ensure that the owner can increment the counter further.

Make the softnet_data.xmit a task_struct member on PREEMPT_RT. Add
needed wrapper.

Cc: Ben Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netdevice.h      | 42 +++++++++++++++++++++++++---------
 include/linux/netdevice_xmit.h | 13 +++++++++++
 include/linux/sched.h          |  5 +++-
 net/core/dev.c                 | 14 ++++++++++++
 net/core/dev.h                 | 18 +++++++++++++++
 5 files changed, 80 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/netdevice_xmit.h

Comments

Jakub Kicinski June 22, 2024, 2:12 a.m. UTC | #1
On Thu, 20 Jun 2024 15:21:58 +0200 Sebastian Andrzej Siewior wrote:
> +static inline void netdev_xmit_set_more(bool more)
> +{
> +	current->net_xmit.more = more;
> +}
> +
> +static inline bool netdev_xmit_more(void)
> +{
> +	return current->net_xmit.more;
> +}
> +#endif
> +
> +static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
> +					      struct sk_buff *skb, struct net_device *dev,
> +					      bool more)
> +{
> +	netdev_xmit_set_more(more);
> +	return ops->ndo_start_xmit(skb, dev);
> +}

The series looks clean, I'm happy for it to be applied as is.

But I'm curious whether similar helper organization as with the BPF
code would work. By which I mean - instead of read / write helpers
for each member can we not have one helper which returns the struct?
It would be a per-CPU struct on !RT and pointer from current on RT.
Does it change the generated code? Or stripping the __percpu annotation
is a PITA?
Sebastian Andrzej Siewior June 24, 2024, 10:20 a.m. UTC | #2
On 2024-06-21 19:12:45 [-0700], Jakub Kicinski wrote:
> On Thu, 20 Jun 2024 15:21:58 +0200 Sebastian Andrzej Siewior wrote:
> > +static inline void netdev_xmit_set_more(bool more)
> > +{
> > +	current->net_xmit.more = more;
> > +}
> > +
> > +static inline bool netdev_xmit_more(void)
> > +{
> > +	return current->net_xmit.more;
> > +}
> > +#endif
> > +
> > +static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
> > +					      struct sk_buff *skb, struct net_device *dev,
> > +					      bool more)
> > +{
> > +	netdev_xmit_set_more(more);
> > +	return ops->ndo_start_xmit(skb, dev);
> > +}
> 
> The series looks clean, I'm happy for it to be applied as is.
> 
> But I'm curious whether similar helper organization as with the BPF
> code would work. By which I mean - instead of read / write helpers
> for each member can we not have one helper which returns the struct?
> It would be a per-CPU struct on !RT and pointer from current on RT.
> Does it change the generated code? Or stripping the __percpu annotation
> is a PITA?

You are asking for

| #ifndef CONFIG_PREEMPT_RT
| static inline struct netdev_xmit *netdev_get_xmit(void)
| {
|        return this_cpu_ptr(&softnet_data.xmit);
| }
| #else
| static inline int netdev_get_xmit(void)
| {
|        return &current->net_xmit;
| }
| #endif

on one side so that we can have then

| static inline void dev_xmit_recursion_inc(void)
| {
|	netdev_get_xmit()->recursion++;
| }
|
| static inline void dev_xmit_recursion_dec(void)
| {
|	netdev_get_xmit()->recursion--;
| }

This changes the generated code slightly. The inc increases from one to
two opcodes, __dev_direct_xmit() snippet:

|         addl $512, %gs:pcpu_hot+8(%rip) #, *_45
local_bh_disable();
|         incw %gs:softnet_data+120(%rip)         # *_44
dev_xmit_recursion_inc();

|         testb   $16, 185(%rbx)  #, dev_24->features
|         je      .L3310  #,
|         movl    $16, %r13d      #, <retval>
|         testb   $5, 208(%r12)   #, MEM[(const struct netdev_queue *)_54].state
|         je      .L3290  #,
|         movl    $512, %esi      #,
^ part of local_bh_enable();
|         decw %gs:softnet_data+120(%rip)         # *_44
dev_xmit_recursion_dec();

|         lea 0(%rip), %rdi       # __here
|         call    __local_bh_enable_ip    #


With the change mentioned above we get:
|         addl $512, %gs:pcpu_hot+8(%rip) #, *_51
local_bh_disable();

|         movq    %gs:this_cpu_off(%rip), %rax    # *_44, tcp_ptr__
|         addw    $1, softnet_data+120(%rax)      #, _48->recursion
two opcodes for dev_xmit_recursion_inc()

|         testb   $16, 185(%rbx)  #, dev_24->features
|         je      .L3310  #,
|         movl    $16, %r13d      #, <retval>
|         testb   $5, 208(%r12)   #, MEM[(const struct netdev_queue *)_60].state
|         je      .L3290  #,
|         movq    %gs:this_cpu_off(%rip), %rax    # *_44, tcp_ptr__
one opcode from dev_xmit_recursion_dec()

|         movl    $512, %esi      #,
part of local_bh_enable()

|         lea 0(%rip), %rdi       # __here
|         subw    $1, softnet_data+120(%rax)      #, _68->recursion
second opcode from dev_xmit_recursion_dec()

|         call    __local_bh_enable_ip    #

So we end up with one additional opcode per usage and I can't tell how
bad it is. The second invocation (dec) was interleaved so it might use
idle cycles. Instead of one optimized operation we get two and the
pointer can't be cached.

And in case you ask, the task version looks like this:

|         addl $512, %gs:pcpu_hot+8(%rip) #, *_47
local_bh_disable()

|         movq    %gs:const_pcpu_hot(%rip), %r14  # const_pcpu_hot.D.2663.D.2661.current_task, _44
|         movzwl  2426(%r14), %eax        # MEM[(struct netdev_xmit *)_44 + 2426B].recursion, _45
|         leal    1(%rax), %edx   #, tmp140
|         movw    %dx, 2426(%r14) # tmp140, MEM[(struct netdev_xmit *)_44 + 2426B].recursion

four opcodes for the inc.

|         testb   $16, 185(%rbx)  #, dev_24->features
|         je      .L3311  #,
|         movl    $16, %r13d      #, <retval>
|         testb   $5, 208(%r12)   #, MEM[(const struct netdev_queue *)_56].state
|         je      .L3291  #,
|         movw    %ax, 2426(%r14) # _45, MEM[(struct netdev_xmit *)_44 + 2426B].recursion

but then gcc recycles the initial value. It reloads the value and
decrements it in case it calls the function.

|         movl    $512, %esi      #,
|         lea 0(%rip), %rdi       # __here
|         call    __local_bh_enable_ip    #
| 

Any update request?

Sebastian
Jakub Kicinski June 24, 2024, 11:36 p.m. UTC | #3
On Mon, 24 Jun 2024 12:20:18 +0200 Sebastian Andrzej Siewior wrote:
> Any update request?

Current version is good, then, thanks for investigating!
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c83b390191d47..f6fc9066147d2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -43,6 +43,7 @@ 
 
 #include <linux/netdev_features.h>
 #include <linux/neighbour.h>
+#include <linux/netdevice_xmit.h>
 #include <uapi/linux/netdevice.h>
 #include <uapi/linux/if_bonding.h>
 #include <uapi/linux/pkt_cls.h>
@@ -3223,13 +3224,7 @@  struct softnet_data {
 	struct sk_buff_head	xfrm_backlog;
 #endif
 	/* written and read only by owning cpu: */
-	struct {
-		u16 recursion;
-		u8  more;
-#ifdef CONFIG_NET_EGRESS
-		u8  skip_txqueue;
-#endif
-	} xmit;
+	struct netdev_xmit xmit;
 #ifdef CONFIG_RPS
 	/* input_queue_head should be written by cpu owning this struct,
 	 * and only read by other cpus. Worth using a cache line.
@@ -3257,10 +3252,18 @@  struct softnet_data {
 
 DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 
+#ifndef CONFIG_PREEMPT_RT
 static inline int dev_recursion_level(void)
 {
 	return this_cpu_read(softnet_data.xmit.recursion);
 }
+#else
+static inline int dev_recursion_level(void)
+{
+	return current->net_xmit.recursion;
+}
+
+#endif
 
 void __netif_schedule(struct Qdisc *q);
 void netif_schedule_queue(struct netdev_queue *txq);
@@ -4872,18 +4875,35 @@  static inline ktime_t netdev_get_tstamp(struct net_device *dev,
 	return hwtstamps->hwtstamp;
 }
 
-static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
-					      struct sk_buff *skb, struct net_device *dev,
-					      bool more)
+#ifndef CONFIG_PREEMPT_RT
+static inline void netdev_xmit_set_more(bool more)
 {
 	__this_cpu_write(softnet_data.xmit.more, more);
-	return ops->ndo_start_xmit(skb, dev);
 }
 
 static inline bool netdev_xmit_more(void)
 {
 	return __this_cpu_read(softnet_data.xmit.more);
 }
+#else
+static inline void netdev_xmit_set_more(bool more)
+{
+	current->net_xmit.more = more;
+}
+
+static inline bool netdev_xmit_more(void)
+{
+	return current->net_xmit.more;
+}
+#endif
+
+static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
+					      struct sk_buff *skb, struct net_device *dev,
+					      bool more)
+{
+	netdev_xmit_set_more(more);
+	return ops->ndo_start_xmit(skb, dev);
+}
 
 static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev,
 					    struct netdev_queue *txq, bool more)
diff --git a/include/linux/netdevice_xmit.h b/include/linux/netdevice_xmit.h
new file mode 100644
index 0000000000000..38325e0702968
--- /dev/null
+++ b/include/linux/netdevice_xmit.h
@@ -0,0 +1,13 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_NETDEVICE_XMIT_H
+#define _LINUX_NETDEVICE_XMIT_H
+
+struct netdev_xmit {
+	u16 recursion;
+	u8  more;
+#ifdef CONFIG_NET_EGRESS
+	u8  skip_txqueue;
+#endif
+};
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6d..5187486c25222 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@ 
 #include <linux/signal_types.h>
 #include <linux/syscall_user_dispatch_types.h>
 #include <linux/mm_types_task.h>
+#include <linux/netdevice_xmit.h>
 #include <linux/task_io_accounting.h>
 #include <linux/posix-timers_types.h>
 #include <linux/restart_block.h>
@@ -975,7 +976,9 @@  struct task_struct {
 	/* delay due to memory thrashing */
 	unsigned                        in_thrashing:1;
 #endif
-
+#ifdef CONFIG_PREEMPT_RT
+	struct netdev_xmit		net_xmit;
+#endif
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
 	struct restart_block		restart_block;
diff --git a/net/core/dev.c b/net/core/dev.c
index 093d82bf0e288..95b9e4cc17676 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3940,6 +3940,7 @@  netdev_tx_queue_mapping(struct net_device *dev, struct sk_buff *skb)
 	return netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, qm));
 }
 
+#ifndef CONFIG_PREEMPT_RT
 static bool netdev_xmit_txqueue_skipped(void)
 {
 	return __this_cpu_read(softnet_data.xmit.skip_txqueue);
@@ -3950,6 +3951,19 @@  void netdev_xmit_skip_txqueue(bool skip)
 	__this_cpu_write(softnet_data.xmit.skip_txqueue, skip);
 }
 EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
+
+#else
+static bool netdev_xmit_txqueue_skipped(void)
+{
+	return current->net_xmit.skip_txqueue;
+}
+
+void netdev_xmit_skip_txqueue(bool skip)
+{
+	current->net_xmit.skip_txqueue = skip;
+}
+EXPORT_SYMBOL_GPL(netdev_xmit_skip_txqueue);
+#endif
 #endif /* CONFIG_NET_EGRESS */
 
 #ifdef CONFIG_NET_XGRESS
diff --git a/net/core/dev.h b/net/core/dev.h
index 58f88d28bc994..5654325c5b710 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -150,6 +150,8 @@  struct napi_struct *napi_by_id(unsigned int napi_id);
 void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu);
 
 #define XMIT_RECURSION_LIMIT	8
+
+#ifndef CONFIG_PREEMPT_RT
 static inline bool dev_xmit_recursion(void)
 {
 	return unlikely(__this_cpu_read(softnet_data.xmit.recursion) >
@@ -165,6 +167,22 @@  static inline void dev_xmit_recursion_dec(void)
 {
 	__this_cpu_dec(softnet_data.xmit.recursion);
 }
+#else
+static inline bool dev_xmit_recursion(void)
+{
+	return unlikely(current->net_xmit.recursion > XMIT_RECURSION_LIMIT);
+}
+
+static inline void dev_xmit_recursion_inc(void)
+{
+	current->net_xmit.recursion++;
+}
+
+static inline void dev_xmit_recursion_dec(void)
+{
+	current->net_xmit.recursion--;
+}
+#endif
 
 int dev_set_hwtstamp_phylib(struct net_device *dev,
 			    struct kernel_hwtstamp_config *cfg,