diff mbox series

[net] net: lwtunnel: disable preemption when required

Message ID 20250403083956.13946-1-justin.iurman@uliege.be (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] net: lwtunnel: disable preemption when required | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 112 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Justin Iurman April 3, 2025, 8:39 a.m. UTC
In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
preemptible scope for PREEMPT kernels. This patch disables preemption
before calling dev_xmit_recursion(). Preemption is re-enabled only at
the end, since we must ensure the same CPU is used for both
dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
recursion levels in some cases) in order to maintain valid per-cpu
counters.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Closes: https://lore.kernel.org/netdev/CAADnVQJFWn3dBFJtY+ci6oN1pDFL=TzCmNbRgey7MdYxt_AP2g@mail.gmail.com/
Fixes: 986ffb3a57c5 ("net: lwtunnel: fix recursion loops")
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
---
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Stanislav Fomichev <stfomichev@gmail.com>
---
 net/core/lwtunnel.c | 47 ++++++++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 18 deletions(-)

Comments

Stanislav Fomichev April 3, 2025, 4:24 p.m. UTC | #1
On 04/03, Justin Iurman wrote:
> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> preemptible scope for PREEMPT kernels. This patch disables preemption
> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> the end, since we must ensure the same CPU is used for both
> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> recursion levels in some cases) in order to maintain valid per-cpu
> counters.

Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
track the recursion. Any reason not to do it in the generic PREEMPT case?
Justin Iurman April 3, 2025, 7:08 p.m. UTC | #2
On 4/3/25 18:24, Stanislav Fomichev wrote:
> On 04/03, Justin Iurman wrote:
>> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
>> preemptible scope for PREEMPT kernels. This patch disables preemption
>> before calling dev_xmit_recursion(). Preemption is re-enabled only at
>> the end, since we must ensure the same CPU is used for both
>> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
>> recursion levels in some cases) in order to maintain valid per-cpu
>> counters.
> 
> Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> track the recursion. Any reason not to do it in the generic PREEMPT case?

I'd say PREEMPT_RT is a different beast. IMO, softirqs can be 
preempted/migrated in RT kernels, which is not true for non-RT kernels. 
Maybe RT kernels could use __this_cpu_* instead of "current" though, but 
it would be less trivial. For example, see commit ecefbc09e8ee ("net: 
softnet_data: Make xmit per task.") on why it makes sense to use 
"current" in RT kernels. I guess the opposite as you suggest (i.e., 
non-RT kernels using "current") would be technically possible, but there 
must be a reason it is defined the way it is... so probably incorrect or 
inefficient?
Alexei Starovoitov April 3, 2025, 8:35 p.m. UTC | #3
On Thu, Apr 3, 2025 at 12:08 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/3/25 18:24, Stanislav Fomichev wrote:
> > On 04/03, Justin Iurman wrote:
> >> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> >> preemptible scope for PREEMPT kernels. This patch disables preemption
> >> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> >> the end, since we must ensure the same CPU is used for both
> >> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> >> recursion levels in some cases) in order to maintain valid per-cpu
> >> counters.
> >
> > Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> > track the recursion. Any reason not to do it in the generic PREEMPT case?
>
> I'd say PREEMPT_RT is a different beast. IMO, softirqs can be
> preempted/migrated in RT kernels, which is not true for non-RT kernels.
> Maybe RT kernels could use __this_cpu_* instead of "current" though, but
> it would be less trivial. For example, see commit ecefbc09e8ee ("net:
> softnet_data: Make xmit per task.") on why it makes sense to use
> "current" in RT kernels. I guess the opposite as you suggest (i.e.,
> non-RT kernels using "current") would be technically possible, but there
> must be a reason it is defined the way it is... so probably incorrect or
> inefficient?

Stating the obvious...
Sebastian did a lot of work removing preempt_disable from the networking
stack.
We're certainly not adding them back.
This patch is no go.
Jakub Kicinski April 3, 2025, 11:38 p.m. UTC | #4
On Thu, 3 Apr 2025 21:08:12 +0200 Justin Iurman wrote:
> On 4/3/25 18:24, Stanislav Fomichev wrote:
> > On 04/03, Justin Iurman wrote:  
> >> In lwtunnel_{input|output|xmit}(), dev_xmit_recursion() may be called in
> >> preemptible scope for PREEMPT kernels. This patch disables preemption
> >> before calling dev_xmit_recursion(). Preemption is re-enabled only at
> >> the end, since we must ensure the same CPU is used for both
> >> dev_xmit_recursion_inc() and dev_xmit_recursion_dec() (and any other
> >> recursion levels in some cases) in order to maintain valid per-cpu
> >> counters.  
> > 
> > Dummy question: CONFIG_PREEMPT_RT uses current->net_xmit.recursion to
> > track the recursion. Any reason not to do it in the generic PREEMPT case?  
> 
> I'd say PREEMPT_RT is a different beast. IMO, softirqs can be 
> preempted/migrated in RT kernels, which is not true for non-RT kernels. 
> Maybe RT kernels could use __this_cpu_* instead of "current" though, but 
> it would be less trivial. For example, see commit ecefbc09e8ee ("net: 
> softnet_data: Make xmit per task.") on why it makes sense to use 
> "current" in RT kernels. I guess the opposite as you suggest (i.e., 
> non-RT kernels using "current") would be technically possible, but there 
> must be a reason it is defined the way it is... so probably incorrect or 
> inefficient?

I suspect it's to avoid the performance overhead.
IIUC you would be better off using local_bh_disable() here.
It doesn't disable preemption on RT.
I don't believe "disable preemption if !RT" primitive exists.
Sebastian Andrzej Siewior April 4, 2025, 2:19 p.m. UTC | #5
Alexei, thank you for the Cc.

On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> Stating the obvious...
> Sebastian did a lot of work removing preempt_disable from the networking
> stack.
> We're certainly not adding them back.
> This patch is no go.

While looking through the code, it looks as if lwtunnel_xmit() lacks a
local_bh_disable().

Sebastian
Justin Iurman April 6, 2025, 8:59 a.m. UTC | #6
On 4/4/25 16:19, Sebastian Sewior wrote:
> Alexei, thank you for the Cc.
> 
> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>> Stating the obvious...
>> Sebastian did a lot of work removing preempt_disable from the networking
>> stack.
>> We're certainly not adding them back.
>> This patch is no go.
> 
> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> local_bh_disable().

Thanks Sebastian for the confirmation, as the initial idea was to use 
local_bh_disable() as well. Then I thought preempt_disable() would be 
enough in this context, but I didn't realize you made efforts to remove 
it from the networking stack.

@Alexei, just to clarify: would you ACK this patch if we do 
s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?

> Sebastian
Alexei Starovoitov April 7, 2025, 5:54 p.m. UTC | #7
On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/4/25 16:19, Sebastian Sewior wrote:
> > Alexei, thank you for the Cc.
> >
> > On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >> Stating the obvious...
> >> Sebastian did a lot of work removing preempt_disable from the networking
> >> stack.
> >> We're certainly not adding them back.
> >> This patch is no go.
> >
> > While looking through the code, it looks as if lwtunnel_xmit() lacks a
> > local_bh_disable().
>
> Thanks Sebastian for the confirmation, as the initial idea was to use
> local_bh_disable() as well. Then I thought preempt_disable() would be
> enough in this context, but I didn't realize you made efforts to remove
> it from the networking stack.
>
> @Alexei, just to clarify: would you ACK this patch if we do
> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?

You need to think it through and not sprinkle local_bh_disable in
every lwt related function.
Like lwtunnel_input should be running with bh disabled already.

I don't remember the exact conditions where bh is disabled in xmit path.
Justin Iurman April 11, 2025, 6:34 p.m. UTC | #8
On 4/7/25 19:54, Alexei Starovoitov wrote:
> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>> Alexei, thank you for the Cc.
>>>
>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>> Stating the obvious...
>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>> stack.
>>>> We're certainly not adding them back.
>>>> This patch is no go.
>>>
>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>> local_bh_disable().
>>
>> Thanks Sebastian for the confirmation, as the initial idea was to use
>> local_bh_disable() as well. Then I thought preempt_disable() would be
>> enough in this context, but I didn't realize you made efforts to remove
>> it from the networking stack.
>>
>> @Alexei, just to clarify: would you ACK this patch if we do
>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> 
> You need to think it through and not sprinkle local_bh_disable in
> every lwt related function.
> Like lwtunnel_input should be running with bh disabled already.

Having nested calls to local_bh_{disable|enable}() is fine (i.e., 
disabling BHs when they're already disabled), but I guess it's cleaner 
to avoid it here as you suggest. And since lwtunnel_input() is indeed 
(always) running with BHs disabled, no changes needed. Thanks for the 
reminder.

> I don't remember the exact conditions where bh is disabled in xmit path.

Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can 
definitely run with or without BHs disabled. So, what I propose is the 
following logic (applied to lwtunnel_xmit() too): if BHs disabled then 
NOP else local_bh_disable(). Thoughts on this new version? (sorry, my 
mailer messes it up, but you got the idea):

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..d44d341683c5 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  	const struct lwtunnel_encap_ops *ops;
  	struct lwtunnel_state *lwtstate;
  	struct dst_entry *dst;
+	bool in_softirq;
  	int ret;

+	in_softirq = in_softirq();
+	if (!in_softirq)
+		local_bh_disable();
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -345,11 +350,13 @@ int lwtunnel_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  		ret = -EINVAL;
  		goto drop;
  	}
-	lwtstate = dst->lwtstate;

+	lwtstate = dst->lwtstate;
  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}

  	ret = -EOPNOTSUPP;
  	rcu_read_lock();
@@ -364,10 +371,12 @@ int lwtunnel_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  	if (ret == -EOPNOTSUPP)
  		goto drop;

-	return ret;
-
+	goto out;
  drop:
  	kfree_skb(skb);
+out:
+	if (!in_softirq)
+		local_bh_enable();

  	return ret;
  }
@@ -378,8 +387,13 @@ int lwtunnel_xmit(struct sk_buff *skb)
  	const struct lwtunnel_encap_ops *ops;
  	struct lwtunnel_state *lwtstate;
  	struct dst_entry *dst;
+	bool in_softirq;
  	int ret;

+	in_softirq = in_softirq();
+	if (!in_softirq)
+		local_bh_disable();
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -394,10 +408,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
  	}

  	lwtstate = dst->lwtstate;
-
  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}

  	ret = -EOPNOTSUPP;
  	rcu_read_lock();
@@ -412,10 +427,12 @@ int lwtunnel_xmit(struct sk_buff *skb)
  	if (ret == -EOPNOTSUPP)
  		goto drop;

-	return ret;
-
+	goto out;
  drop:
  	kfree_skb(skb);
+out:
+	if (!in_softirq)
+		local_bh_enable();

  	return ret;
  }
@@ -428,6 +445,8 @@ int lwtunnel_input(struct sk_buff *skb)
  	struct dst_entry *dst;
  	int ret;

+	WARN_ON_ONCE(!in_softirq());
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -440,8 +459,8 @@ int lwtunnel_input(struct sk_buff *skb)
  		ret = -EINVAL;
  		goto drop;
  	}
-	lwtstate = dst->lwtstate;

+	lwtstate = dst->lwtstate;
  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
  	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
  		return 0;
@@ -460,10 +479,8 @@ int lwtunnel_input(struct sk_buff *skb)
  		goto drop;

  	return ret;
-
  drop:
  	kfree_skb(skb);
-
  	return ret;
  }
  EXPORT_SYMBOL_GPL(lwtunnel_input);
Alexei Starovoitov April 14, 2025, 11:13 p.m. UTC | #9
On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 4/7/25 19:54, Alexei Starovoitov wrote:
> > On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>
> >> On 4/4/25 16:19, Sebastian Sewior wrote:
> >>> Alexei, thank you for the Cc.
> >>>
> >>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >>>> Stating the obvious...
> >>>> Sebastian did a lot of work removing preempt_disable from the networking
> >>>> stack.
> >>>> We're certainly not adding them back.
> >>>> This patch is no go.
> >>>
> >>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> >>> local_bh_disable().
> >>
> >> Thanks Sebastian for the confirmation, as the initial idea was to use
> >> local_bh_disable() as well. Then I thought preempt_disable() would be
> >> enough in this context, but I didn't realize you made efforts to remove
> >> it from the networking stack.
> >>
> >> @Alexei, just to clarify: would you ACK this patch if we do
> >> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> >
> > You need to think it through and not sprinkle local_bh_disable in
> > every lwt related function.
> > Like lwtunnel_input should be running with bh disabled already.
>
> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
> disabling BHs when they're already disabled), but I guess it's cleaner
> to avoid it here as you suggest. And since lwtunnel_input() is indeed
> (always) running with BHs disabled, no changes needed. Thanks for the
> reminder.
>
> > I don't remember the exact conditions where bh is disabled in xmit path.
>
> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
> definitely run with or without BHs disabled. So, what I propose is the
> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
> mailer messes it up, but you got the idea):
>
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index e39a459540ec..d44d341683c5 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
> *sk, struct sk_buff *skb)
>         const struct lwtunnel_encap_ops *ops;
>         struct lwtunnel_state *lwtstate;
>         struct dst_entry *dst;
> +       bool in_softirq;
>         int ret;
>
> +       in_softirq = in_softirq();
> +       if (!in_softirq)
> +               local_bh_disable();
> +

This looks like a hack to me.

Instead analyze the typical xmit path. If bh is not disabled
then add local_bh_disable(). It's fine if it happens to be nested
in some cases.
Andrea Mayer April 15, 2025, 12:54 a.m. UTC | #10
On Fri, 11 Apr 2025 20:34:54 +0200
Justin Iurman <justin.iurman@uliege.be> wrote:

> On 4/7/25 19:54, Alexei Starovoitov wrote:
> > On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>
> >> On 4/4/25 16:19, Sebastian Sewior wrote:
> >>> Alexei, thank you for the Cc.
> >>>
> >>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
> >>>> Stating the obvious...
> >>>> Sebastian did a lot of work removing preempt_disable from the networking
> >>>> stack.
> >>>> We're certainly not adding them back.
> >>>> This patch is no go.
> >>>
> >>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
> >>> local_bh_disable().
> >>
> >> Thanks Sebastian for the confirmation, as the initial idea was to use
> >> local_bh_disable() as well. Then I thought preempt_disable() would be
> >> enough in this context, but I didn't realize you made efforts to remove
> >> it from the networking stack.
> >>
> >> @Alexei, just to clarify: would you ACK this patch if we do
> >> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
> > 
> > You need to think it through and not sprinkle local_bh_disable in
> > every lwt related function.
> > Like lwtunnel_input should be running with bh disabled already.
> 
> Having nested calls to local_bh_{disable|enable}() is fine (i.e., 
> disabling BHs when they're already disabled), but I guess it's cleaner 
> to avoid it here as you suggest. And since lwtunnel_input() is indeed 
> (always) running with BHs disabled, no changes needed. Thanks for the 
> reminder.
> 
> > I don't remember the exact conditions where bh is disabled in xmit path.
> 
> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can

Justin, thanks for the Cc.

I have been looking into the behavior of the lwtunnel_xmit() function in both
task and softirq contexts. To facilitate this investigation, I have written a
simple eBPF program that only prints messages to the trace pipe. This program
is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
with a destination address (DA) pointing to an external node, referred to as
x.x.x.x, within my testbed network.

To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
is then processed through the forwarding path, eventually leading to the
ip_output() function. Processing ends with a call to ip_finish_output2(), which
then calls lwtunnel_xmit().

Below is the stack trace from my testing machine, highlighting the key
functions involved in this processing path:

 ============================================
  <IRQ>                                         
  ...
  lwtunnel_xmit+0x18/0x3f0                      
  ip_finish_output2+0x45a/0xcc0                 
  ip_output+0xe2/0x380                          
  NF_HOOK.constprop.0+0x7e/0x2f0                
  ip_rcv+0x4bf/0x4d0                            
  __netif_receive_skb_one_core+0x11c/0x130      
  process_backlog+0x277/0x980
  __napi_poll.constprop.0+0x58/0x260
  net_rx_action+0x396/0x6e0
  handle_softirqs+0x116/0x640
  do_softirq+0xa9/0xe0
  </IRQ>
 ============================================

Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
x.x.x.x on the same testing node. Below is the corresponding stack trace:

 ============================================
  <TASK>                                                               
  ...
  lwtunnel_xmit+0x18/0x3f0                                             
  ip_finish_output2+0x45a/0xcc0                                        
  ip_output+0xe2/0x380                                                 
  ip_push_pending_frames+0x17a/0x200                                   
  raw_sendmsg+0x9fa/0x1060                                             
  __sys_sendto+0x294/0x2e0                                             
  __x64_sys_sendto+0x6d/0x80                                           
  do_syscall_64+0x64/0x140                                             
  entry_SYSCALL_64_after_hwframe+0x76/0x7e                             
  </TASK>    
 ============================================

So also for the lwtunnel_xmit(), we need to make sure that the functions
dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
are protected, i.e. inside a local_bh_{disable/enable} block.

> definitely run with or without BHs disabled. So, what I propose is the 
> following logic (applied to lwtunnel_xmit() too): if BHs disabled then 
> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my 
> mailer messes it up, but you got the idea):
> 
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index e39a459540ec..d44d341683c5 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock 
> *sk, struct sk_buff *skb)
>   	const struct lwtunnel_encap_ops *ops;
>   	struct lwtunnel_state *lwtstate;
>   	struct dst_entry *dst;
> +	bool in_softirq;
>   	int ret;
> 
> +	in_softirq = in_softirq();
> +	if (!in_softirq)
> +		local_bh_disable();
> +

In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
function. On my x86 architecture, preempt_count() accesses the per-CPU
__preempt_count variable.

If in_softirq() returns 0, it indicates that no softirqs are currently being
processed on the local CPU and BH are not disabled. Therefore, following the
logic above, we disable bottom halves (BH) on that particular CPU.

However, there is my opinion an issue that can occur: between the check on
in_softirq() and the call to local_bh_disable(), the task may be scheduled on
another CPU. As a result, the check on in_softirq() becomes ineffective because
we may end up disabling BH on a CPU that is not the one we just checked (with
if (in_softirq()) { ... }).


>   	if (dev_xmit_recursion()) {
>   		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
>   				     __func__);
> @@ -345,11 +350,13 @@ int lwtunnel_output(struct net *net, struct sock 
> *sk, struct sk_buff *skb)
>   		ret = -EINVAL;
>   		goto drop;
>   	}
> -	lwtstate = dst->lwtstate;
> 
> +	lwtstate = dst->lwtstate;
>   	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> -	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
> -		return 0;
> +	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
> +		ret = 0;
> +		goto out;
> +	}
> 
>   	ret = -EOPNOTSUPP;
>   	rcu_read_lock();
> @@ -364,10 +371,12 @@ int lwtunnel_output(struct net *net, struct sock 
> *sk, struct sk_buff *skb)
>   	if (ret == -EOPNOTSUPP)
>   		goto drop;
> 
> -	return ret;
> -
> +	goto out;
>   drop:
>   	kfree_skb(skb);
> +out:
> +	if (!in_softirq)
> +		local_bh_enable();
> 
>   	return ret;
>   }
> @@ -378,8 +387,13 @@ int lwtunnel_xmit(struct sk_buff *skb)
>   	const struct lwtunnel_encap_ops *ops;
>   	struct lwtunnel_state *lwtstate;
>   	struct dst_entry *dst;
> +	bool in_softirq;
>   	int ret;
> 
> +	in_softirq = in_softirq();
> +	if (!in_softirq)
> +		local_bh_disable();
> +
>   	if (dev_xmit_recursion()) {
>   		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
>   				     __func__);
> @@ -394,10 +408,11 @@ int lwtunnel_xmit(struct sk_buff *skb)
>   	}
> 
>   	lwtstate = dst->lwtstate;
> -
>   	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> -	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
> -		return 0;
> +	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
> +		ret = 0;
> +		goto out;
> +	}
> 
>   	ret = -EOPNOTSUPP;
>   	rcu_read_lock();
> @@ -412,10 +427,12 @@ int lwtunnel_xmit(struct sk_buff *skb)
>   	if (ret == -EOPNOTSUPP)
>   		goto drop;
> 
> -	return ret;
> -
> +	goto out;
>   drop:
>   	kfree_skb(skb);
> +out:
> +	if (!in_softirq)
> +		local_bh_enable();
> 
>   	return ret;
>   }
> @@ -428,6 +445,8 @@ int lwtunnel_input(struct sk_buff *skb)
>   	struct dst_entry *dst;
>   	int ret;
> 
> +	WARN_ON_ONCE(!in_softirq());
> +

What about DEBUG_NET_WARN_ON_ONCE instead?

Ciao,
Andrea

>   	if (dev_xmit_recursion()) {
>   		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
>   				     __func__);
> @@ -440,8 +459,8 @@ int lwtunnel_input(struct sk_buff *skb)
>   		ret = -EINVAL;
>   		goto drop;
>   	}
> -	lwtstate = dst->lwtstate;
> 
> +	lwtstate = dst->lwtstate;
>   	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
>   	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
>   		return 0;
> @@ -460,10 +479,8 @@ int lwtunnel_input(struct sk_buff *skb)
>   		goto drop;
> 
>   	return ret;
> -
>   drop:
>   	kfree_skb(skb);
> -
>   	return ret;
>   }
>   EXPORT_SYMBOL_GPL(lwtunnel_input);
Justin Iurman April 15, 2025, 9:10 a.m. UTC | #11
On 4/15/25 02:54, Andrea Mayer wrote:> I have been looking into the 
behavior of the lwtunnel_xmit() function in both
> task and softirq contexts. To facilitate this investigation, I have written a
> simple eBPF program that only prints messages to the trace pipe. This program
> is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
> with a destination address (DA) pointing to an external node, referred to as
> x.x.x.x, within my testbed network.
> 
> To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
> to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
> is then processed through the forwarding path, eventually leading to the
> ip_output() function. Processing ends with a call to ip_finish_output2(), which
> then calls lwtunnel_xmit().
> 
> Below is the stack trace from my testing machine, highlighting the key
> functions involved in this processing path:
> 
>   ============================================
>    <IRQ>
>    ...
>    lwtunnel_xmit+0x18/0x3f0
>    ip_finish_output2+0x45a/0xcc0
>    ip_output+0xe2/0x380
>    NF_HOOK.constprop.0+0x7e/0x2f0
>    ip_rcv+0x4bf/0x4d0
>    __netif_receive_skb_one_core+0x11c/0x130
>    process_backlog+0x277/0x980
>    __napi_poll.constprop.0+0x58/0x260
>    net_rx_action+0x396/0x6e0
>    handle_softirqs+0x116/0x640
>    do_softirq+0xa9/0xe0
>    </IRQ>
>   ============================================
> 
> Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
> x.x.x.x on the same testing node. Below is the corresponding stack trace:
> 
>   ============================================
>    <TASK>
>    ...
>    lwtunnel_xmit+0x18/0x3f0
>    ip_finish_output2+0x45a/0xcc0
>    ip_output+0xe2/0x380
>    ip_push_pending_frames+0x17a/0x200
>    raw_sendmsg+0x9fa/0x1060
>    __sys_sendto+0x294/0x2e0
>    __x64_sys_sendto+0x6d/0x80
>    do_syscall_64+0x64/0x140
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    </TASK>
>   ============================================
> 
> So also for the lwtunnel_xmit(), we need to make sure that the functions
> dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
> are protected, i.e. inside a local_bh_{disable/enable} block.

That's correct, and I ended up with the same conclusion as yours on the 
possible paths for lwtunnel_xmit() depending on the context (task vs 
irq). Based on your description, we're using a similar approach with 
eBPF :-) Note that paths are similar for lwtunnel_output() (see below).

> In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
> in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
> function. On my x86 architecture, preempt_count() accesses the per-CPU
> __preempt_count variable.
> 
> If in_softirq() returns 0, it indicates that no softirqs are currently being
> processed on the local CPU and BH are not disabled. Therefore, following the
> logic above, we disable bottom halves (BH) on that particular CPU.
> 
> However, there is my opinion an issue that can occur: between the check on
> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> another CPU. As a result, the check on in_softirq() becomes ineffective because
> we may end up disabling BH on a CPU that is not the one we just checked (with
> if (in_softirq()) { ... }).

Hmm, I think it's correct... good catch. I went for this solution to (i) 
avoid useless nested BHs disable calls; and (ii) avoid ending up with a 
spaghetti graph of possible paths with or without BHs disabled (i.e., 
with single entry points, namely lwtunnel_xmit() and lwtunnel_output()), 
which otherwise makes it hard to maintain the code IMO.

So, if we want to follow what Alexei suggests (see his last response), 
we'd need to disable BHs in both ip_local_out() and ip6_local_out(). 
These are the common functions which are closest in depth, and so for 
both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of 
disabling BHs even when it may not be required. Indeed, ip_local_out() 
and ip6_local_out() both call dst_output(), which one is usually not 
lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call 
either).

The other solution is to always call local_bh_disable() in both 
lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when 
they were already. Which was basically -v1 and received a NACK from Alexei.

At the moment, I'm not sure what's best.
Justin Iurman April 15, 2025, 9:25 a.m. UTC | #12
On 4/15/25 01:13, Alexei Starovoitov wrote:
> On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/7/25 19:54, Alexei Starovoitov wrote:
>>> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>>>
>>>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>>>> Alexei, thank you for the Cc.
>>>>>
>>>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>>>> Stating the obvious...
>>>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>>>> stack.
>>>>>> We're certainly not adding them back.
>>>>>> This patch is no go.
>>>>>
>>>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>>>> local_bh_disable().
>>>>
>>>> Thanks Sebastian for the confirmation, as the initial idea was to use
>>>> local_bh_disable() as well. Then I thought preempt_disable() would be
>>>> enough in this context, but I didn't realize you made efforts to remove
>>>> it from the networking stack.
>>>>
>>>> @Alexei, just to clarify: would you ACK this patch if we do
>>>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
>>>
>>> You need to think it through and not sprinkle local_bh_disable in
>>> every lwt related function.
>>> Like lwtunnel_input should be running with bh disabled already.
>>
>> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
>> disabling BHs when they're already disabled), but I guess it's cleaner
>> to avoid it here as you suggest. And since lwtunnel_input() is indeed
>> (always) running with BHs disabled, no changes needed. Thanks for the
>> reminder.
>>
>>> I don't remember the exact conditions where bh is disabled in xmit path.
>>
>> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
>> definitely run with or without BHs disabled. So, what I propose is the
>> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
>> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
>> mailer messes it up, but you got the idea):
>>
>> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
>> index e39a459540ec..d44d341683c5 100644
>> --- a/net/core/lwtunnel.c
>> +++ b/net/core/lwtunnel.c
>> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
>> *sk, struct sk_buff *skb)
>>          const struct lwtunnel_encap_ops *ops;
>>          struct lwtunnel_state *lwtstate;
>>          struct dst_entry *dst;
>> +       bool in_softirq;
>>          int ret;
>>
>> +       in_softirq = in_softirq();
>> +       if (!in_softirq)
>> +               local_bh_disable();
>> +
> 
> This looks like a hack to me.
> 
> Instead analyze the typical xmit path. If bh is not disabled

This is already what I did, and it's exactly the reason why I ended up 
with the above proposal. It's not only about the xmit path but also the 
output path. Of course, having BHs disabled only where they need to 
without useless nested calls would be nice, but in reality the solution 
is not perfect and makes it even more difficult to visualize the path(s) 
with or without BHs disabled IMO.

For both lwtunnel_xmit() and lwtunnel_output(), the common functions 
which are closest in depth and where BHs should be disabled are 
ip_local_out() and ip6_local_out(). And even when it's not required 
(which is the tradeoff). The other solution was -v1, which you NACK'ed. 
Please see my reply to Andrea for the whole story. To summarize, I'd say 
that it's either (a) what you suggest, i.e., non-required BH disable 
calls vs. (b) nested BH disable calls. With tradeoffs for each.

> then add local_bh_disable(). It's fine if it happens to be nested
> in some cases.
Justin Iurman April 15, 2025, 11:56 a.m. UTC | #13
On 4/15/25 01:13, Alexei Starovoitov wrote:
> On Fri, Apr 11, 2025 at 11:34 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 4/7/25 19:54, Alexei Starovoitov wrote:
>>> On Sun, Apr 6, 2025 at 1:59 AM Justin Iurman <justin.iurman@uliege.be> wrote:
>>>>
>>>> On 4/4/25 16:19, Sebastian Sewior wrote:
>>>>> Alexei, thank you for the Cc.
>>>>>
>>>>> On 2025-04-03 13:35:10 [-0700], Alexei Starovoitov wrote:
>>>>>> Stating the obvious...
>>>>>> Sebastian did a lot of work removing preempt_disable from the networking
>>>>>> stack.
>>>>>> We're certainly not adding them back.
>>>>>> This patch is no go.
>>>>>
>>>>> While looking through the code, it looks as if lwtunnel_xmit() lacks a
>>>>> local_bh_disable().
>>>>
>>>> Thanks Sebastian for the confirmation, as the initial idea was to use
>>>> local_bh_disable() as well. Then I thought preempt_disable() would be
>>>> enough in this context, but I didn't realize you made efforts to remove
>>>> it from the networking stack.
>>>>
>>>> @Alexei, just to clarify: would you ACK this patch if we do
>>>> s/preempt_{disable|enable}()/local_bh_{disable|enable}()/g ?
>>>
>>> You need to think it through and not sprinkle local_bh_disable in
>>> every lwt related function.
>>> Like lwtunnel_input should be running with bh disabled already.
>>
>> Having nested calls to local_bh_{disable|enable}() is fine (i.e.,
>> disabling BHs when they're already disabled), but I guess it's cleaner
>> to avoid it here as you suggest. And since lwtunnel_input() is indeed
>> (always) running with BHs disabled, no changes needed. Thanks for the
>> reminder.
>>
>>> I don't remember the exact conditions where bh is disabled in xmit path.
>>
>> Right. Not sure for lwtunnel_xmit(), but lwtunnel_output() can
>> definitely run with or without BHs disabled. So, what I propose is the
>> following logic (applied to lwtunnel_xmit() too): if BHs disabled then
>> NOP else local_bh_disable(). Thoughts on this new version? (sorry, my
>> mailer messes it up, but you got the idea):
>>
>> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
>> index e39a459540ec..d44d341683c5 100644
>> --- a/net/core/lwtunnel.c
>> +++ b/net/core/lwtunnel.c
>> @@ -331,8 +331,13 @@ int lwtunnel_output(struct net *net, struct sock
>> *sk, struct sk_buff *skb)
>>          const struct lwtunnel_encap_ops *ops;
>>          struct lwtunnel_state *lwtstate;
>>          struct dst_entry *dst;
>> +       bool in_softirq;
>>          int ret;
>>
>> +       in_softirq = in_softirq();
>> +       if (!in_softirq)
>> +               local_bh_disable();
>> +
> 
> This looks like a hack to me.
> 
> Instead analyze the typical xmit path. If bh is not disabled
> then add local_bh_disable(). It's fine if it happens to be nested
> in some cases.

FYI, and based on my previous response, the patch would look like this 
in that case (again, my mailer messes long lines up, sorry). I'll let 
others comment on which solution/tradeoff seems better.

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..d0cb0f2f9efe 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -333,6 +333,8 @@ int lwtunnel_output(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  	struct dst_entry *dst;
  	int ret;

+	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -380,6 +382,8 @@ int lwtunnel_xmit(struct sk_buff *skb)
  	struct dst_entry *dst;
  	int ret;

+	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
@@ -428,6 +432,8 @@ int lwtunnel_input(struct sk_buff *skb)
  	struct dst_entry *dst;
  	int ret;

+	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
+
  	if (dev_xmit_recursion()) {
  		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
  				     __func__);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e18d7ec5062..89bda2f424bb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -124,10 +124,13 @@ int ip_local_out(struct net *net, struct sock *sk, 
struct sk_buff *skb)
  {
  	int err;

+	local_bh_disable();
+
  	err = __ip_local_out(net, sk, skb);
  	if (likely(err == 1))
  		err = dst_output(net, sk, skb);

+	local_bh_enable();
  	return err;
  }
  EXPORT_SYMBOL_GPL(ip_local_out);
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 806d4b5dd1e6..bb40196edeb6 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -150,10 +150,13 @@ int ip6_local_out(struct net *net, struct sock 
*sk, struct sk_buff *skb)
  {
  	int err;

+	local_bh_disable();
+
  	err = __ip6_local_out(net, sk, skb);
  	if (likely(err == 1))
  		err = dst_output(net, sk, skb);

+	local_bh_enable();
  	return err;
  }
  EXPORT_SYMBOL_GPL(ip6_local_out);
Jakub Kicinski April 15, 2025, 2:38 p.m. UTC | #14
On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
> > However, there is my opinion an issue that can occur: between the check on
> > in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> > another CPU. As a result, the check on in_softirq() becomes ineffective because
> > we may end up disabling BH on a CPU that is not the one we just checked (with
> > if (in_softirq()) { ... }).  

The context is not affected by migration. The context is fully defined
by the execution stack.

> Hmm, I think it's correct... good catch. I went for this solution to (i) 
> avoid useless nested BHs disable calls; and (ii) avoid ending up with a 
> spaghetti graph of possible paths with or without BHs disabled (i.e., 
> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()), 
> which otherwise makes it hard to maintain the code IMO.
> 
> So, if we want to follow what Alexei suggests (see his last response), 
> we'd need to disable BHs in both ip_local_out() and ip6_local_out(). 
> These are the common functions which are closest in depth, and so for 
> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of 
> disabling BHs even when it may not be required. Indeed, ip_local_out() 
> and ip6_local_out() both call dst_output(), which one is usually not 
> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call 
> either).
> 
> The other solution is to always call local_bh_disable() in both 
> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when 
> they were already. Which was basically -v1 and received a NACK from Alexei.

I thought he nacked preempt_disable()
Justin Iurman April 15, 2025, 3:12 p.m. UTC | #15
On 4/15/25 16:38, Jakub Kicinski wrote:
> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
>>> However, there is my opinion an issue that can occur: between the check on
>>> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
>>> another CPU. As a result, the check on in_softirq() becomes ineffective because
>>> we may end up disabling BH on a CPU that is not the one we just checked (with
>>> if (in_softirq()) { ... }).
> 
> The context is not affected by migration. The context is fully defined
> by the execution stack.
> 
>> Hmm, I think it's correct... good catch. I went for this solution to (i)
>> avoid useless nested BHs disable calls; and (ii) avoid ending up with a
>> spaghetti graph of possible paths with or without BHs disabled (i.e.,
>> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
>> which otherwise makes it hard to maintain the code IMO.
>>
>> So, if we want to follow what Alexei suggests (see his last response),
>> we'd need to disable BHs in both ip_local_out() and ip6_local_out().
>> These are the common functions which are closest in depth, and so for
>> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
>> disabling BHs even when it may not be required. Indeed, ip_local_out()
>> and ip6_local_out() both call dst_output(), which one is usually not
>> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
>> either).
>>
>> The other solution is to always call local_bh_disable() in both
>> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
>> they were already. Which was basically -v1 and received a NACK from Alexei.
> 
> I thought he nacked preempt_disable()

I think I wasn't clear enough, sorry. Alexei explicitly NACK'ed the 
initial patch (the one with preempt_disable()) -- you're right. I think 
he also (implicitly) NACK'ed the other solution I proposed (see [1]) by 
reading his reply. Alexei can clarify if I'm mistaken. He seems to 
prefer the solution that disables BHs on specific paths (see [2] for the 
other proposal) instead of inside lwtunnel_xmit() and lwtunnel_output().

So, basically, we have a choice to make between [1] and [2], where [1] 
IMO is better for the following reasons: (i) no nested calls to disable 
BHs, (ii) disable BHs only when they're not already, and (iii) code 
clarity, i.e., not overly complexifying the graph of paths w/ or w/o BHs 
disabled. While with [2], BHs will be disabled even when not required on 
the xmit/output path, and also resulting in an even more complex graph 
of paths regarding BHs.

   [1] 
https://lore.kernel.org/netdev/20250415073818.06ea327c@kernel.org/T/#m5a4e6a56206d9d110a5e4d664ab4ea09e7e9b33e
   [2] 
https://lore.kernel.org/netdev/20250415073818.06ea327c@kernel.org/T/#m3a88de38eceb0a53e2d173dc3675ecaa37e9d0b4
Alexei Starovoitov April 15, 2025, 3:12 p.m. UTC | #16
On Tue, Apr 15, 2025 at 7:38 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
> > > However, there is my opinion an issue that can occur: between the check on
> > > in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> > > another CPU. As a result, the check on in_softirq() becomes ineffective because
> > > we may end up disabling BH on a CPU that is not the one we just checked (with
> > > if (in_softirq()) { ... }).
>
> The context is not affected by migration. The context is fully defined
> by the execution stack.
>
> > Hmm, I think it's correct... good catch. I went for this solution to (i)
> > avoid useless nested BHs disable calls; and (ii) avoid ending up with a
> > spaghetti graph of possible paths with or without BHs disabled (i.e.,
> > with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
> > which otherwise makes it hard to maintain the code IMO.
> >
> > So, if we want to follow what Alexei suggests (see his last response),
> > we'd need to disable BHs in both ip_local_out() and ip6_local_out().
> > These are the common functions which are closest in depth, and so for
> > both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
> > disabling BHs even when it may not be required. Indeed, ip_local_out()
> > and ip6_local_out() both call dst_output(), which one is usually not
> > lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
> > either).
> >
> > The other solution is to always call local_bh_disable() in both
> > lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
> > they were already. Which was basically -v1 and received a NACK from Alexei.
>
> I thought he nacked preempt_disable()

+1.

imo unconditional local_bh_disable() in tx path is fine.
I didn't like the addition of local_bh_disable() in every lwt related
function without doing home work whether it's needed there or not.
Like input path shouldn't need local_bh_disable
Justin Iurman April 15, 2025, 5:12 p.m. UTC | #17
On 4/15/25 17:12, Alexei Starovoitov wrote:
> On Tue, Apr 15, 2025 at 7:38 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Tue, 15 Apr 2025 11:10:01 +0200 Justin Iurman wrote:
>>>> However, there is my opinion an issue that can occur: between the check on
>>>> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
>>>> another CPU. As a result, the check on in_softirq() becomes ineffective because
>>>> we may end up disabling BH on a CPU that is not the one we just checked (with
>>>> if (in_softirq()) { ... }).
>>
>> The context is not affected by migration. The context is fully defined
>> by the execution stack.
>>
>>> Hmm, I think it's correct... good catch. I went for this solution to (i)
>>> avoid useless nested BHs disable calls; and (ii) avoid ending up with a
>>> spaghetti graph of possible paths with or without BHs disabled (i.e.,
>>> with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
>>> which otherwise makes it hard to maintain the code IMO.
>>>
>>> So, if we want to follow what Alexei suggests (see his last response),
>>> we'd need to disable BHs in both ip_local_out() and ip6_local_out().
>>> These are the common functions which are closest in depth, and so for
>>> both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
>>> disabling BHs even when it may not be required. Indeed, ip_local_out()
>>> and ip6_local_out() both call dst_output(), which one is usually not
>>> lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
>>> either).
>>>
>>> The other solution is to always call local_bh_disable() in both
>>> lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
>>> they were already. Which was basically -v1 and received a NACK from Alexei.
>>
>> I thought he nacked preempt_disable()
> 
> +1.
> 
> imo unconditional local_bh_disable() in tx path is fine.
> I didn't like the addition of local_bh_disable() in every lwt related
> function without doing home work whether it's needed there or not.
> Like input path shouldn't need local_bh_disable

Ack, sorry for the confusion. I'll post -v2 with that solution.
diff mbox series

Patch

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index e39a459540ec..a9ad068e5707 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -333,6 +333,8 @@  int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	struct dst_entry *dst;
 	int ret;
 
+	preempt_disable();
+
 	if (dev_xmit_recursion()) {
 		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
 				     __func__);
@@ -345,11 +347,13 @@  int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 		ret = -EINVAL;
 		goto drop;
 	}
-	lwtstate = dst->lwtstate;
 
+	lwtstate = dst->lwtstate;
 	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}
 
 	ret = -EOPNOTSUPP;
 	rcu_read_lock();
@@ -364,11 +368,11 @@  int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	if (ret == -EOPNOTSUPP)
 		goto drop;
 
-	return ret;
-
+	goto out;
 drop:
 	kfree_skb(skb);
-
+out:
+	preempt_enable();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(lwtunnel_output);
@@ -380,6 +384,8 @@  int lwtunnel_xmit(struct sk_buff *skb)
 	struct dst_entry *dst;
 	int ret;
 
+	preempt_disable();
+
 	if (dev_xmit_recursion()) {
 		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
 				     __func__);
@@ -394,10 +400,11 @@  int lwtunnel_xmit(struct sk_buff *skb)
 	}
 
 	lwtstate = dst->lwtstate;
-
 	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}
 
 	ret = -EOPNOTSUPP;
 	rcu_read_lock();
@@ -412,11 +419,11 @@  int lwtunnel_xmit(struct sk_buff *skb)
 	if (ret == -EOPNOTSUPP)
 		goto drop;
 
-	return ret;
-
+	goto out;
 drop:
 	kfree_skb(skb);
-
+out:
+	preempt_enable();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(lwtunnel_xmit);
@@ -428,6 +435,8 @@  int lwtunnel_input(struct sk_buff *skb)
 	struct dst_entry *dst;
 	int ret;
 
+	preempt_disable();
+
 	if (dev_xmit_recursion()) {
 		net_crit_ratelimited("%s(): recursion limit reached on datapath\n",
 				     __func__);
@@ -440,11 +449,13 @@  int lwtunnel_input(struct sk_buff *skb)
 		ret = -EINVAL;
 		goto drop;
 	}
-	lwtstate = dst->lwtstate;
 
+	lwtstate = dst->lwtstate;
 	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
-	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+	    lwtstate->type > LWTUNNEL_ENCAP_MAX) {
+		ret = 0;
+		goto out;
+	}
 
 	ret = -EOPNOTSUPP;
 	rcu_read_lock();
@@ -459,11 +470,11 @@  int lwtunnel_input(struct sk_buff *skb)
 	if (ret == -EOPNOTSUPP)
 		goto drop;
 
-	return ret;
-
+	goto out;
 drop:
 	kfree_skb(skb);
-
+out:
+	preempt_enable();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(lwtunnel_input);