Message ID | 20220715232641.952532-1-vladimir.oltean@nxp.com (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: dsa: fix bonding with ARP monitoring by updating trans_start manually | expand |
On 7/15/2022 4:26 PM, Vladimir Oltean wrote: > Documentation/networking/bonding.rst points out that for ARP monitoring > to work, dev_trans_start() must be able to verify the latest trans_start > update of any slave_dev TX queue. However, with NETIF_F_LLTX, > netdev_start_xmit() -> txq_trans_update() fails to do anything, because > the TX queue hasn't been locked. > > Fix this by manually updating the current TX queue's trans_start for > each packet sent. > > Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports") > Reported-by: Brian Hutchinson <b.hutchman@gmail.com> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Had seen the recent veth change but did not consider that it would be applicable to DSA somehow although it certainly is. Thanks!
On Sat, 16 Jul 2022 02:26:41 +0300 Vladimir Oltean wrote: > Documentation/networking/bonding.rst points out that for ARP monitoring > to work, dev_trans_start() must be able to verify the latest trans_start > update of any slave_dev TX queue. However, with NETIF_F_LLTX, > netdev_start_xmit() -> txq_trans_update() fails to do anything, because > the TX queue hasn't been locked. > > Fix this by manually updating the current TX queue's trans_start for > each packet sent. > > Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports") > Reported-by: Brian Hutchinson <b.hutchman@gmail.com> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Did you see my discussion with Jay? Let's stop the spread of this workaround, I'm tossing..
On Fri, Jul 15, 2022 at 05:00:42PM -0700, Jakub Kicinski wrote: > On Sat, 16 Jul 2022 02:26:41 +0300 Vladimir Oltean wrote: > > Documentation/networking/bonding.rst points out that for ARP monitoring > > to work, dev_trans_start() must be able to verify the latest trans_start > > update of any slave_dev TX queue. However, with NETIF_F_LLTX, > > netdev_start_xmit() -> txq_trans_update() fails to do anything, because > > the TX queue hasn't been locked. > > > > Fix this by manually updating the current TX queue's trans_start for > > each packet sent. > > > > Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports") > > Reported-by: Brian Hutchinson <b.hutchman@gmail.com> > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > > Did you see my discussion with Jay? Let's stop the spread of this > workaround, I'm tossing.. No, I didn't, could you summarize the alternative proposal?
On Sat, 16 Jul 2022 00:14:44 +0000 Vladimir Oltean wrote: > On Fri, Jul 15, 2022 at 05:00:42PM -0700, Jakub Kicinski wrote: > > On Sat, 16 Jul 2022 02:26:41 +0300 Vladimir Oltean wrote: > > > Documentation/networking/bonding.rst points out that for ARP monitoring > > > to work, dev_trans_start() must be able to verify the latest trans_start > > > update of any slave_dev TX queue. However, with NETIF_F_LLTX, > > > netdev_start_xmit() -> txq_trans_update() fails to do anything, because > > > the TX queue hasn't been locked. > > > > > > Fix this by manually updating the current TX queue's trans_start for > > > each packet sent. > > > > > > Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports") > > > Reported-by: Brian Hutchinson <b.hutchman@gmail.com> > > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > > > > Did you see my discussion with Jay? Let's stop the spread of this > > workaround, I'm tossing.. > > No, I didn't, could you summarize the alternative proposal? Make bonding not depend on a field which is only valid for HW devices which use the Tx watchdog. Let me find the thread... https://lore.kernel.org/all/20220621213823.51c51326@kernel.org/
On Fri, Jul 15, 2022 at 05:19:59PM -0700, Jakub Kicinski wrote: > On Sat, 16 Jul 2022 00:14:44 +0000 Vladimir Oltean wrote: > > On Fri, Jul 15, 2022 at 05:00:42PM -0700, Jakub Kicinski wrote: > > > On Sat, 16 Jul 2022 02:26:41 +0300 Vladimir Oltean wrote: > > > > Documentation/networking/bonding.rst points out that for ARP monitoring > > > > to work, dev_trans_start() must be able to verify the latest trans_start > > > > update of any slave_dev TX queue. However, with NETIF_F_LLTX, > > > > netdev_start_xmit() -> txq_trans_update() fails to do anything, because > > > > the TX queue hasn't been locked. > > > > > > > > Fix this by manually updating the current TX queue's trans_start for > > > > each packet sent. > > > > > > > > Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports") > > > > Reported-by: Brian Hutchinson <b.hutchman@gmail.com> > > > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > > > > > > Did you see my discussion with Jay? Let's stop the spread of this > > > workaround, I'm tossing.. > > > > No, I didn't, could you summarize the alternative proposal? > > Make bonding not depend on a field which is only valid for HW devices > which use the Tx watchdog. Let me find the thread... > https://lore.kernel.org/all/20220621213823.51c51326@kernel.org/ That won't work in the general case with dsa_slave_get_stats64(), which may take the stats from hardware (delayed) or from dev_get_tstats64(). Also, not to mention that ARP monitoring used to work before the commit I blamed, this is a punctual fix for a regression.
On Sat, 16 Jul 2022 00:26:13 +0000 Vladimir Oltean wrote: > > Make bonding not depend on a field which is only valid for HW devices > > which use the Tx watchdog. Let me find the thread... > > https://lore.kernel.org/all/20220621213823.51c51326@kernel.org/ > > That won't work in the general case with dsa_slave_get_stats64(), which > may take the stats from hardware (delayed) or from dev_get_tstats64(). Ah, that's annoying. > Also, not to mention that ARP monitoring used to work before the commit > I blamed, this is a punctual fix for a regression. trans_start is for the watchdog. This is the third patch pointlessly messing with trans_start while the bug is in bonding. It's trying to piggy back on semantics which are not universally true. Fix bonding please.
On Fri, Jul 15, 2022 at 05:55:16PM -0700, Jakub Kicinski wrote: > On Sat, 16 Jul 2022 00:26:13 +0000 Vladimir Oltean wrote: > > > Make bonding not depend on a field which is only valid for HW devices > > > which use the Tx watchdog. Let me find the thread... > > > https://lore.kernel.org/all/20220621213823.51c51326@kernel.org/ > > > > That won't work in the general case with dsa_slave_get_stats64(), which > > may take the stats from hardware (delayed) or from dev_get_tstats64(). > > Ah, that's annoying. > > > Also, not to mention that ARP monitoring used to work before the commit > > I blamed, this is a punctual fix for a regression. > > trans_start is for the watchdog. This is the third patch pointlessly > messing with trans_start while the bug is in bonding. It's trying to > piggy back on semantics which are not universally true. > > Fix bonding please. I would need some assistance from Jay or other people more familiar with bonding to do that. I'm not exactly clear which packets the bonding driver wants to check they have been transmitted in the last interval: ARP packets? any packets? With DSA and switchdev drivers in general, they have an offloaded forwarding path as well, so expect that what you get through ndo_get_stats64 may also report packets which egressed a physical port but weren't originated by the network stack. I simply don't know what is a viable substitute for dev_trans_start() because I don't understand very well what it intends to capture.
On Sat, 16 Jul 2022 13:30:10 +0000 Vladimir Oltean wrote: > I would need some assistance from Jay or other people more familiar with > bonding to do that. I'm not exactly clear which packets the bonding > driver wants to check they have been transmitted in the last interval: > ARP packets? any packets? And why - stack has queued a packet to the driver, how is that useful to assess the fact that the link is up? Can bonding check instead whether any queue is running? Or maybe the whole thing is just a remnant of times before the centralized watchdog tx and should be dropped? > With DSA and switchdev drivers in general, > they have an offloaded forwarding path as well, so expect that what you > get through ndo_get_stats64 may also report packets which egressed a > physical port but weren't originated by the network stack. > I simply don't know what is a viable substitute for dev_trans_start() > because I don't understand very well what it intends to capture. Looking thru the code I stumbled on the implementation of dev_trans_start() - it already takes care of some upper devices. Adding DSA handling there would offend my sensibilities slightly less, if you want to do that. At least it's not on the fast path.
Hi Jakub, On Sat, Jul 16, 2022 at 04:33:38PM -0700, Jakub Kicinski wrote: > On Sat, 16 Jul 2022 13:30:10 +0000 Vladimir Oltean wrote: > > I would need some assistance from Jay or other people more familiar with > > bonding to do that. I'm not exactly clear which packets the bonding > > driver wants to check they have been transmitted in the last interval: > > ARP packets? any packets? > > And why - stack has queued a packet to the driver, how is that useful > to assess the fact that the link is up? Can bonding check instead > whether any queue is running? Or maybe the whole thing is just a remnant > of times before the centralized watchdog tx and should be dropped? Yes, indeed, why? I don't know. > > With DSA and switchdev drivers in general, > > they have an offloaded forwarding path as well, so expect that what you > > get through ndo_get_stats64 may also report packets which egressed a > > physical port but weren't originated by the network stack. > > I simply don't know what is a viable substitute for dev_trans_start() > > because I don't understand very well what it intends to capture. > > Looking thru the code I stumbled on the implementation of > dev_trans_start() - it already takes care of some upper devices. > Adding DSA handling there would offend my sensibilities slightly > less, if you want to do that. At least it's not on the fast path. How are your sensibilities feeling about this change? diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index cc6eabee2830..b844eb0bde7e 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -427,20 +427,43 @@ void __qdisc_run(struct Qdisc *q) unsigned long dev_trans_start(struct net_device *dev) { + struct net_device *lower; + struct list_head *iter; unsigned long val, res; + bool have_lowers; unsigned int i; - if (is_vlan_dev(dev)) - dev = vlan_dev_real_dev(dev); - else if (netif_is_macvlan(dev)) - dev = macvlan_dev_real_dev(dev); + rcu_read_lock(); + + /* Stacked network interfaces usually have NETIF_F_LLTX so + * netdev_start_xmit() -> txq_trans_update() fails to do anything, + * because they don't lock the TX queue. However, layers such as the + * bonding arp monitor may still use dev_trans_start() on slave + * interfaces such as vlan, macvlan, DSA (or potentially stacked + * combinations of the above). In this case, walk the adjacency lists + * all the way down, hoping that the lower-most device won't have + * NETIF_F_LLTX. + */ + do { + have_lowers = false; + + netdev_for_each_lower_dev(dev, lower, iter) { + have_lowers = true; + dev = lower; + break; + } + } while (have_lowers); + res = READ_ONCE(netdev_get_tx_queue(dev, 0)->trans_start); + for (i = 1; i < dev->num_tx_queues; i++) { val = READ_ONCE(netdev_get_tx_queue(dev, i)->trans_start); if (val && time_after(val, res)) res = val; } + rcu_read_unlock(); + return res; } EXPORT_SYMBOL(dev_trans_start);
On Mon, 25 Jul 2022 20:31:25 +0000 Vladimir Oltean wrote:
> How are your sensibilities feeling about this change?
The code seems fine but I'd suggest we state clearly in the comment
that the entire procedure is a workaround for bonding depending on
an antiquated/incorrect semantics of the field. And we doing this
because we don't know why and therefore how to fix it.
On 7/25/22 16:31, Vladimir Oltean wrote: > Hi Jakub, > > On Sat, Jul 16, 2022 at 04:33:38PM -0700, Jakub Kicinski wrote: >> On Sat, 16 Jul 2022 13:30:10 +0000 Vladimir Oltean wrote: >>> I would need some assistance from Jay or other people more familiar with >>> bonding to do that. I'm not exactly clear which packets the bonding >>> driver wants to check they have been transmitted in the last interval: >>> ARP packets? any packets? >> >> And why - stack has queued a packet to the driver, how is that useful >> to assess the fact that the link is up? Can bonding check instead >> whether any queue is running? Or maybe the whole thing is just a remnant >> of times before the centralized watchdog tx and should be dropped? > > Yes, indeed, why? I don't know. > >>> With DSA and switchdev drivers in general, >>> they have an offloaded forwarding path as well, so expect that what you >>> get through ndo_get_stats64 may also report packets which egressed a >>> physical port but weren't originated by the network stack. >>> I simply don't know what is a viable substitute for dev_trans_start() >>> because I don't understand very well what it intends to capture. >> >> Looking thru the code I stumbled on the implementation of >> dev_trans_start() - it already takes care of some upper devices. >> Adding DSA handling there would offend my sensibilities slightly >> less, if you want to do that. At least it's not on the fast path. > > How are your sensibilities feeling about this change? > > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index cc6eabee2830..b844eb0bde7e 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -427,20 +427,43 @@ void __qdisc_run(struct Qdisc *q) > > unsigned long dev_trans_start(struct net_device *dev) > { > + struct net_device *lower; > + struct list_head *iter; > unsigned long val, res; > + bool have_lowers; > unsigned int i; > > - if (is_vlan_dev(dev)) > - dev = vlan_dev_real_dev(dev); > - else if (netif_is_macvlan(dev)) > - dev = macvlan_dev_real_dev(dev); > + rcu_read_lock(); > + > + /* Stacked network interfaces usually have NETIF_F_LLTX so > + * netdev_start_xmit() -> txq_trans_update() fails to do anything, > + * because they don't lock the TX queue. However, layers such as the > + * bonding arp monitor may still use dev_trans_start() on slave > + * interfaces such as vlan, macvlan, DSA (or potentially stacked > + * combinations of the above). In this case, walk the adjacency lists > + * all the way down, hoping that the lower-most device won't have > + * NETIF_F_LLTX. > + */ I am not sure this assumption holds for virtual devices like veth unless the virtual interfaces are updated to modify trans_start, see e66e257a5d83 ("veth: Add updating of trans_start") And not all the virtual interfaces update trans_start iirc. > + do { > + have_lowers = false; > + > + netdev_for_each_lower_dev(dev, lower, iter) { > + have_lowers = true; > + dev = lower; > + break; > + } > + } while (have_lowers); > + > res = READ_ONCE(netdev_get_tx_queue(dev, 0)->trans_start); > + > for (i = 1; i < dev->num_tx_queues; i++) { > val = READ_ONCE(netdev_get_tx_queue(dev, i)->trans_start); > if (val && time_after(val, res)) > res = val; > } > > + rcu_read_unlock(); > + > return res; > } > EXPORT_SYMBOL(dev_trans_start); >
On Mon, Jul 25, 2022 at 05:39:33PM -0400, Jonathan Toppins wrote: > > + /* Stacked network interfaces usually have NETIF_F_LLTX so > > + * netdev_start_xmit() -> txq_trans_update() fails to do anything, > > + * because they don't lock the TX queue. However, layers such as the > > + * bonding arp monitor may still use dev_trans_start() on slave > > + * interfaces such as vlan, macvlan, DSA (or potentially stacked > > + * combinations of the above). In this case, walk the adjacency lists > > + * all the way down, hoping that the lower-most device won't have > > + * NETIF_F_LLTX. > > + */ > > I am not sure this assumption holds for virtual devices like veth unless the > virtual interfaces are updated to modify trans_start, see > e66e257a5d83 ("veth: Add updating of trans_start") > > And not all the virtual interfaces update trans_start iirc. Agree this hack won't give callers of dev_trans_start() something to chew in all cases when "dev" is a virtual interface, that's why I said 'stacked' and not 'virtual'.
diff --git a/net/dsa/slave.c b/net/dsa/slave.c index ad6a6663feeb..c1b5c549698d 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -746,12 +746,17 @@ static void dsa_skb_tx_timestamp(struct dsa_slave_priv *p, netdev_tx_t dsa_enqueue_skb(struct sk_buff *skb, struct net_device *dev) { + struct netdev_queue *q = netdev_get_tx_queue(dev, skb->queue_mapping); + /* SKB for netpoll still need to be mangled with the protocol-specific * tag to be successfully transmitted */ if (unlikely(netpoll_tx_running(dev))) return dsa_slave_netpoll_send_skb(dev, skb); + /* NETIF_F_LLTX requires us to update q->trans_start manually */ + WRITE_ONCE(q->trans_start, jiffies); + /* Queue the SKB for transmission on the parent interface, but * do not modify its EtherType */
Documentation/networking/bonding.rst points out that for ARP monitoring to work, dev_trans_start() must be able to verify the latest trans_start update of any slave_dev TX queue. However, with NETIF_F_LLTX, netdev_start_xmit() -> txq_trans_update() fails to do anything, because the TX queue hasn't been locked. Fix this by manually updating the current TX queue's trans_start for each packet sent. Fixes: 2b86cb829976 ("net: dsa: declare lockless TX feature for slave ports") Reported-by: Brian Hutchinson <b.hutchman@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> --- net/dsa/slave.c | 5 +++++ 1 file changed, 5 insertions(+)