diff mbox series

[net] net: dsa: fix bonding with ARP monitoring by updating trans_start manually

Message ID 20220715232641.952532-1-vladimir.oltean@nxp.com (mailing list archive)
State Rejected
Delegated to: Netdev Maintainers
Headers show
Series [net] net: dsa: fix bonding with ARP monitoring by updating trans_start manually | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers fail 1 blamed authors not CCed: olteanv@gmail.com; 1 maintainers not CCed: olteanv@gmail.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 17 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Vladimir Oltean July 15, 2022, 11:26 p.m. UTC
Documentation/networking/bonding.rst points out that for ARP monitoring
to work, dev_trans_start() must be able to verify the latest trans_start
update of any slave_dev TX queue. However, with NETIF_F_LLTX,
netdev_start_xmit() -> txq_trans_update() fails to do anything, because
the TX queue hasn't been locked.

Fix this by manually updating the current TX queue's trans_start for
each packet sent.

Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports")
Reported-by: Brian Hutchinson <b.hutchman@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/slave.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Florian Fainelli July 15, 2022, 11:52 p.m. UTC | #1
On 7/15/2022 4:26 PM, Vladimir Oltean wrote:
> Documentation/networking/bonding.rst points out that for ARP monitoring
> to work, dev_trans_start() must be able to verify the latest trans_start
> update of any slave_dev TX queue. However, with NETIF_F_LLTX,
> netdev_start_xmit() -> txq_trans_update() fails to do anything, because
> the TX queue hasn't been locked.
> 
> Fix this by manually updating the current TX queue's trans_start for
> each packet sent.
> 
> Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports")
> Reported-by: Brian Hutchinson <b.hutchman@gmail.com>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

Had seen the recent veth change but did not consider that it would be 
applicable to DSA somehow although it certainly is. Thanks!
Jakub Kicinski July 16, 2022, midnight UTC | #2
On Sat, 16 Jul 2022 02:26:41 +0300 Vladimir Oltean wrote:
> Documentation/networking/bonding.rst points out that for ARP monitoring
> to work, dev_trans_start() must be able to verify the latest trans_start
> update of any slave_dev TX queue. However, with NETIF_F_LLTX,
> netdev_start_xmit() -> txq_trans_update() fails to do anything, because
> the TX queue hasn't been locked.
> 
> Fix this by manually updating the current TX queue's trans_start for
> each packet sent.
> 
> Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports")
> Reported-by: Brian Hutchinson <b.hutchman@gmail.com>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Did you see my discussion with Jay? Let's stop the spread of this
workaround, I'm tossing..
Vladimir Oltean July 16, 2022, 12:14 a.m. UTC | #3
On Fri, Jul 15, 2022 at 05:00:42PM -0700, Jakub Kicinski wrote:
> On Sat, 16 Jul 2022 02:26:41 +0300 Vladimir Oltean wrote:
> > Documentation/networking/bonding.rst points out that for ARP monitoring
> > to work, dev_trans_start() must be able to verify the latest trans_start
> > update of any slave_dev TX queue. However, with NETIF_F_LLTX,
> > netdev_start_xmit() -> txq_trans_update() fails to do anything, because
> > the TX queue hasn't been locked.
> > 
> > Fix this by manually updating the current TX queue's trans_start for
> > each packet sent.
> > 
> > Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports")
> > Reported-by: Brian Hutchinson <b.hutchman@gmail.com>
> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> 
> Did you see my discussion with Jay? Let's stop the spread of this
> workaround, I'm tossing..

No, I didn't, could you summarize the alternative proposal?
Jakub Kicinski July 16, 2022, 12:19 a.m. UTC | #4
On Sat, 16 Jul 2022 00:14:44 +0000 Vladimir Oltean wrote:
> On Fri, Jul 15, 2022 at 05:00:42PM -0700, Jakub Kicinski wrote:
> > On Sat, 16 Jul 2022 02:26:41 +0300 Vladimir Oltean wrote:  
> > > Documentation/networking/bonding.rst points out that for ARP monitoring
> > > to work, dev_trans_start() must be able to verify the latest trans_start
> > > update of any slave_dev TX queue. However, with NETIF_F_LLTX,
> > > netdev_start_xmit() -> txq_trans_update() fails to do anything, because
> > > the TX queue hasn't been locked.
> > > 
> > > Fix this by manually updating the current TX queue's trans_start for
> > > each packet sent.
> > > 
> > > Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports")
> > > Reported-by: Brian Hutchinson <b.hutchman@gmail.com>
> > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>  
> > 
> > Did you see my discussion with Jay? Let's stop the spread of this
> > workaround, I'm tossing..  
> 
> No, I didn't, could you summarize the alternative proposal?

Make bonding not depend on a field which is only valid for HW devices
which use the Tx watchdog. Let me find the thread...
https://lore.kernel.org/all/20220621213823.51c51326@kernel.org/
Vladimir Oltean July 16, 2022, 12:26 a.m. UTC | #5
On Fri, Jul 15, 2022 at 05:19:59PM -0700, Jakub Kicinski wrote:
> On Sat, 16 Jul 2022 00:14:44 +0000 Vladimir Oltean wrote:
> > On Fri, Jul 15, 2022 at 05:00:42PM -0700, Jakub Kicinski wrote:
> > > On Sat, 16 Jul 2022 02:26:41 +0300 Vladimir Oltean wrote:  
> > > > Documentation/networking/bonding.rst points out that for ARP monitoring
> > > > to work, dev_trans_start() must be able to verify the latest trans_start
> > > > update of any slave_dev TX queue. However, with NETIF_F_LLTX,
> > > > netdev_start_xmit() -> txq_trans_update() fails to do anything, because
> > > > the TX queue hasn't been locked.
> > > > 
> > > > Fix this by manually updating the current TX queue's trans_start for
> > > > each packet sent.
> > > > 
> > > > Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports")
> > > > Reported-by: Brian Hutchinson <b.hutchman@gmail.com>
> > > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>  
> > > 
> > > Did you see my discussion with Jay? Let's stop the spread of this
> > > workaround, I'm tossing..  
> > 
> > No, I didn't, could you summarize the alternative proposal?
> 
> Make bonding not depend on a field which is only valid for HW devices
> which use the Tx watchdog. Let me find the thread...
> https://lore.kernel.org/all/20220621213823.51c51326@kernel.org/

That won't work in the general case with dsa_slave_get_stats64(), which
may take the stats from hardware (delayed) or from dev_get_tstats64().

Also, not to mention that ARP monitoring used to work before the commit
I blamed, this is a punctual fix for a regression.
Jakub Kicinski July 16, 2022, 12:55 a.m. UTC | #6
On Sat, 16 Jul 2022 00:26:13 +0000 Vladimir Oltean wrote:
> > Make bonding not depend on a field which is only valid for HW devices
> > which use the Tx watchdog. Let me find the thread...
> > https://lore.kernel.org/all/20220621213823.51c51326@kernel.org/  
> 
> That won't work in the general case with dsa_slave_get_stats64(), which
> may take the stats from hardware (delayed) or from dev_get_tstats64().

Ah, that's annoying.

> Also, not to mention that ARP monitoring used to work before the commit
> I blamed, this is a punctual fix for a regression.

trans_start is for the watchdog. This is the third patch pointlessly 
messing with trans_start while the bug is in bonding. It's trying to
piggy back on semantics which are not universally true.

Fix bonding please.
Vladimir Oltean July 16, 2022, 1:30 p.m. UTC | #7
On Fri, Jul 15, 2022 at 05:55:16PM -0700, Jakub Kicinski wrote:
> On Sat, 16 Jul 2022 00:26:13 +0000 Vladimir Oltean wrote:
> > > Make bonding not depend on a field which is only valid for HW devices
> > > which use the Tx watchdog. Let me find the thread...
> > > https://lore.kernel.org/all/20220621213823.51c51326@kernel.org/  
> > 
> > That won't work in the general case with dsa_slave_get_stats64(), which
> > may take the stats from hardware (delayed) or from dev_get_tstats64().
> 
> Ah, that's annoying.
> 
> > Also, not to mention that ARP monitoring used to work before the commit
> > I blamed, this is a punctual fix for a regression.
> 
> trans_start is for the watchdog. This is the third patch pointlessly 
> messing with trans_start while the bug is in bonding. It's trying to
> piggy back on semantics which are not universally true.
> 
> Fix bonding please.

I would need some assistance from Jay or other people more familiar with
bonding to do that. I'm not exactly clear which packets the bonding
driver wants to check they have been transmitted in the last interval:
ARP packets? any packets? With DSA and switchdev drivers in general,
they have an offloaded forwarding path as well, so expect that what you
get through ndo_get_stats64 may also report packets which egressed a
physical port but weren't originated by the network stack.
I simply don't know what is a viable substitute for dev_trans_start()
because I don't understand very well what it intends to capture.
Jakub Kicinski July 16, 2022, 11:33 p.m. UTC | #8
On Sat, 16 Jul 2022 13:30:10 +0000 Vladimir Oltean wrote:
> I would need some assistance from Jay or other people more familiar with
> bonding to do that. I'm not exactly clear which packets the bonding
> driver wants to check they have been transmitted in the last interval:
> ARP packets? any packets?

And why - stack has queued a packet to the driver, how is that useful
to assess the fact that the link is up? Can bonding check instead
whether any queue is running? Or maybe the whole thing is just a remnant
of times before the centralized watchdog tx and should be dropped?

> With DSA and switchdev drivers in general,
> they have an offloaded forwarding path as well, so expect that what you
> get through ndo_get_stats64 may also report packets which egressed a
> physical port but weren't originated by the network stack.
> I simply don't know what is a viable substitute for dev_trans_start()
> because I don't understand very well what it intends to capture.

Looking thru the code I stumbled on the implementation of
dev_trans_start() - it already takes care of some upper devices.
Adding DSA handling there would offend my sensibilities slightly
less, if you want to do that. At least it's not on the fast path.
Vladimir Oltean July 25, 2022, 8:31 p.m. UTC | #9
Hi Jakub,

On Sat, Jul 16, 2022 at 04:33:38PM -0700, Jakub Kicinski wrote:
> On Sat, 16 Jul 2022 13:30:10 +0000 Vladimir Oltean wrote:
> > I would need some assistance from Jay or other people more familiar with
> > bonding to do that. I'm not exactly clear which packets the bonding
> > driver wants to check they have been transmitted in the last interval:
> > ARP packets? any packets?
> 
> And why - stack has queued a packet to the driver, how is that useful
> to assess the fact that the link is up? Can bonding check instead
> whether any queue is running? Or maybe the whole thing is just a remnant
> of times before the centralized watchdog tx and should be dropped?

Yes, indeed, why? I don't know.

> > With DSA and switchdev drivers in general,
> > they have an offloaded forwarding path as well, so expect that what you
> > get through ndo_get_stats64 may also report packets which egressed a
> > physical port but weren't originated by the network stack.
> > I simply don't know what is a viable substitute for dev_trans_start()
> > because I don't understand very well what it intends to capture.
> 
> Looking thru the code I stumbled on the implementation of
> dev_trans_start() - it already takes care of some upper devices.
> Adding DSA handling there would offend my sensibilities slightly
> less, if you want to do that. At least it's not on the fast path.

How are your sensibilities feeling about this change?

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index cc6eabee2830..b844eb0bde7e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -427,20 +427,43 @@ void __qdisc_run(struct Qdisc *q)
 
 unsigned long dev_trans_start(struct net_device *dev)
 {
+	struct net_device *lower;
+	struct list_head *iter;
 	unsigned long val, res;
+	bool have_lowers;
 	unsigned int i;
 
-	if (is_vlan_dev(dev))
-		dev = vlan_dev_real_dev(dev);
-	else if (netif_is_macvlan(dev))
-		dev = macvlan_dev_real_dev(dev);
+	rcu_read_lock();
+
+	/* Stacked network interfaces usually have NETIF_F_LLTX so
+	 * netdev_start_xmit() -> txq_trans_update() fails to do anything,
+	 * because they don't lock the TX queue. However, layers such as the
+	 * bonding arp monitor may still use dev_trans_start() on slave
+	 * interfaces such as vlan, macvlan, DSA (or potentially stacked
+	 * combinations of the above). In this case, walk the adjacency lists
+	 * all the way down, hoping that the lower-most device won't have
+	 * NETIF_F_LLTX.
+	 */
+	do {
+		have_lowers = false;
+
+		netdev_for_each_lower_dev(dev, lower, iter) {
+			have_lowers = true;
+			dev = lower;
+			break;
+		}
+	} while (have_lowers);
+
 	res = READ_ONCE(netdev_get_tx_queue(dev, 0)->trans_start);
+
 	for (i = 1; i < dev->num_tx_queues; i++) {
 		val = READ_ONCE(netdev_get_tx_queue(dev, i)->trans_start);
 		if (val && time_after(val, res))
 			res = val;
 	}
 
+	rcu_read_unlock();
+
 	return res;
 }
 EXPORT_SYMBOL(dev_trans_start);
Jakub Kicinski July 25, 2022, 8:40 p.m. UTC | #10
On Mon, 25 Jul 2022 20:31:25 +0000 Vladimir Oltean wrote:
> How are your sensibilities feeling about this change?

The code seems fine but I'd suggest we state clearly in the comment
that the entire procedure is a workaround for bonding depending on 
an antiquated/incorrect semantics of the field. And we doing this
because we don't know why and therefore how to fix it.
Jonathan Toppins July 25, 2022, 9:39 p.m. UTC | #11
On 7/25/22 16:31, Vladimir Oltean wrote:
> Hi Jakub,
> 
> On Sat, Jul 16, 2022 at 04:33:38PM -0700, Jakub Kicinski wrote:
>> On Sat, 16 Jul 2022 13:30:10 +0000 Vladimir Oltean wrote:
>>> I would need some assistance from Jay or other people more familiar with
>>> bonding to do that. I'm not exactly clear which packets the bonding
>>> driver wants to check they have been transmitted in the last interval:
>>> ARP packets? any packets?
>>
>> And why - stack has queued a packet to the driver, how is that useful
>> to assess the fact that the link is up? Can bonding check instead
>> whether any queue is running? Or maybe the whole thing is just a remnant
>> of times before the centralized watchdog tx and should be dropped?
> 
> Yes, indeed, why? I don't know.
> 
>>> With DSA and switchdev drivers in general,
>>> they have an offloaded forwarding path as well, so expect that what you
>>> get through ndo_get_stats64 may also report packets which egressed a
>>> physical port but weren't originated by the network stack.
>>> I simply don't know what is a viable substitute for dev_trans_start()
>>> because I don't understand very well what it intends to capture.
>>
>> Looking thru the code I stumbled on the implementation of
>> dev_trans_start() - it already takes care of some upper devices.
>> Adding DSA handling there would offend my sensibilities slightly
>> less, if you want to do that. At least it's not on the fast path.
> 
> How are your sensibilities feeling about this change?
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index cc6eabee2830..b844eb0bde7e 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -427,20 +427,43 @@ void __qdisc_run(struct Qdisc *q)
>   
>   unsigned long dev_trans_start(struct net_device *dev)
>   {
> +	struct net_device *lower;
> +	struct list_head *iter;
>   	unsigned long val, res;
> +	bool have_lowers;
>   	unsigned int i;
>   
> -	if (is_vlan_dev(dev))
> -		dev = vlan_dev_real_dev(dev);
> -	else if (netif_is_macvlan(dev))
> -		dev = macvlan_dev_real_dev(dev);
> +	rcu_read_lock();
> +
> +	/* Stacked network interfaces usually have NETIF_F_LLTX so
> +	 * netdev_start_xmit() -> txq_trans_update() fails to do anything,
> +	 * because they don't lock the TX queue. However, layers such as the
> +	 * bonding arp monitor may still use dev_trans_start() on slave
> +	 * interfaces such as vlan, macvlan, DSA (or potentially stacked
> +	 * combinations of the above). In this case, walk the adjacency lists
> +	 * all the way down, hoping that the lower-most device won't have
> +	 * NETIF_F_LLTX.
> +	 */

I am not sure this assumption holds for virtual devices like veth unless 
the virtual interfaces are updated to modify trans_start, see
e66e257a5d83 ("veth: Add updating of trans_start")

And not all the virtual interfaces update trans_start iirc.

> +	do {
> +		have_lowers = false;
> +
> +		netdev_for_each_lower_dev(dev, lower, iter) {
> +			have_lowers = true;
> +			dev = lower;
> +			break;
> +		}
> +	} while (have_lowers);
> +
>   	res = READ_ONCE(netdev_get_tx_queue(dev, 0)->trans_start);
> +
>   	for (i = 1; i < dev->num_tx_queues; i++) {
>   		val = READ_ONCE(netdev_get_tx_queue(dev, i)->trans_start);
>   		if (val && time_after(val, res))
>   			res = val;
>   	}
>   
> +	rcu_read_unlock();
> +
>   	return res;
>   }
>   EXPORT_SYMBOL(dev_trans_start);
>
Vladimir Oltean July 25, 2022, 9:53 p.m. UTC | #12
On Mon, Jul 25, 2022 at 05:39:33PM -0400, Jonathan Toppins wrote:
> > +	/* Stacked network interfaces usually have NETIF_F_LLTX so
> > +	 * netdev_start_xmit() -> txq_trans_update() fails to do anything,
> > +	 * because they don't lock the TX queue. However, layers such as the
> > +	 * bonding arp monitor may still use dev_trans_start() on slave
> > +	 * interfaces such as vlan, macvlan, DSA (or potentially stacked
> > +	 * combinations of the above). In this case, walk the adjacency lists
> > +	 * all the way down, hoping that the lower-most device won't have
> > +	 * NETIF_F_LLTX.
> > +	 */
> 
> I am not sure this assumption holds for virtual devices like veth unless the
> virtual interfaces are updated to modify trans_start, see
> e66e257a5d83 ("veth: Add updating of trans_start")
> 
> And not all the virtual interfaces update trans_start iirc.

Agree this hack won't give callers of dev_trans_start() something to chew
in all cases when "dev" is a virtual interface, that's why I said 'stacked'
and not 'virtual'.
diff mbox series

Patch

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index ad6a6663feeb..c1b5c549698d 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -746,12 +746,17 @@  static void dsa_skb_tx_timestamp(struct dsa_slave_priv *p,
 
 netdev_tx_t dsa_enqueue_skb(struct sk_buff *skb, struct net_device *dev)
 {
+	struct netdev_queue *q = netdev_get_tx_queue(dev, skb->queue_mapping);
+
 	/* SKB for netpoll still need to be mangled with the protocol-specific
 	 * tag to be successfully transmitted
 	 */
 	if (unlikely(netpoll_tx_running(dev)))
 		return dsa_slave_netpoll_send_skb(dev, skb);
 
+	/* NETIF_F_LLTX requires us to update q->trans_start manually */
+	WRITE_ONCE(q->trans_start, jiffies);
+
 	/* Queue the SKB for transmission on the parent interface, but
 	 * do not modify its EtherType
 	 */