Message ID | 20210311072311.2969-1-xie.he.0141@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: lapbether: Prevent racing when checking whether the netif is running | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 5 of 5 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 0 this patch: 0 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 94 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 0 this patch: 0 |
netdev/header_inline | success | Link |
On Wed, 10 Mar 2021 23:23:09 -0800 Xie He wrote: > There are two "netif_running" checks in this driver. One is in > "lapbeth_xmit" and the other is in "lapbeth_rcv". They serve to make > sure that the LAPB APIs called in these functions are called before > "lapb_unregister" is called by the "ndo_stop" function. > > However, these "netif_running" checks are unreliable, because it's > possible that immediately after "netif_running" returns true, "ndo_stop" > is called (which causes "lapb_unregister" to be called). > > This patch adds locking to make sure "lapbeth_xmit" and "lapbeth_rcv" can > reliably check and ensure the netif is running while doing their work. > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > Signed-off-by: Xie He <xie.he.0141@gmail.com> Is this a theoretical issues or do you see a path where it triggers? Who are the callers sending frames to a device which went down?
On Thu, Mar 11, 2021 at 12:43 PM Jakub Kicinski <kuba@kernel.org> wrote: > > Is this a theoretical issues or do you see a path where it triggers? > > Who are the callers sending frames to a device which went down? This is a theoretical issue. I didn't see this issue in practice. When "__dev_queue_xmit" and "sch_direct_xmit" call "dev_hard_start_xmit", there appears to be no locking mechanism preventing the netif from going down while "dev_hard_start_xmit" is doing its work. David once confirmed in an email that a driver's "ndo_stop" function would indeed race with its "ndo_start_xmit" function: https://lore.kernel.org/netdev/20190520.200922.2277656639346033061.davem@davemloft.net/
On Thu, 11 Mar 2021 13:12:25 -0800 Xie He wrote: > On Thu, Mar 11, 2021 at 12:43 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > Is this a theoretical issues or do you see a path where it triggers? > > > > Who are the callers sending frames to a device which went down? > > This is a theoretical issue. I didn't see this issue in practice. > > When "__dev_queue_xmit" and "sch_direct_xmit" call > "dev_hard_start_xmit", there appears to be no locking mechanism > preventing the netif from going down while "dev_hard_start_xmit" is > doing its work. Normally driver's ndo_stop() calls netif_tx_disable() which takes TX locks, so unless your driver is lockless (LLTX) there should be no xmit calls after that point.
On Thu, Mar 11, 2021 at 2:52 PM Jakub Kicinski <kuba@kernel.org> wrote: > > Normally driver's ndo_stop() calls netif_tx_disable() which takes TX > locks, so unless your driver is lockless (LLTX) there should be no xmit > calls after that point. Do you mean I should call "netif_tx_disable" inside my "ndo_stop" function as a fix for the racing between "ndo_stop" and "ndo_start_xmit"? I can't call "netif_tx_disable" inside my "ndo_stop" function because "netif_tx_disable" will call "netif_tx_stop_queue", which causes another racing problem. Please see my recent commit f7d9d4854519 ("net: lapbether: Remove netif_start_queue / netif_stop_queue")
On Thu, 11 Mar 2021 15:13:01 -0800 Xie He wrote: > On Thu, Mar 11, 2021 at 2:52 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > Normally driver's ndo_stop() calls netif_tx_disable() which takes TX > > locks, so unless your driver is lockless (LLTX) there should be no xmit > > calls after that point. > > Do you mean I should call "netif_tx_disable" inside my "ndo_stop" > function as a fix for the racing between "ndo_stop" and > "ndo_start_xmit"? > > I can't call "netif_tx_disable" inside my "ndo_stop" function because > "netif_tx_disable" will call "netif_tx_stop_queue", which causes > another racing problem. Please see my recent commit f7d9d4854519 > ("net: lapbether: Remove netif_start_queue / netif_stop_queue") And the "noqueue" queue is there because it's on top of hdlc_fr.c somehow or some out of tree driver? Or do you install it manually?
On Thu, Mar 11, 2021 at 4:10 PM Jakub Kicinski <kuba@kernel.org> wrote: > > And the "noqueue" queue is there because it's on top of hdlc_fr.c > somehow or some out of tree driver? Or do you install it manually? No, this driver is not related to "hdlc_fr.c" or any out-of-tree driver. The default qdisc is "noqueue" for this driver because this driver doesn't set "tx_queue_len". This means the value of "tx_queue_len" would be 0. In this case, "alloc_netdev_mqs" will automatically add the "IFF_NO_QUEUE" flag to the device, then "attach_one_default_qdisc" in "net/sched/sch_generic.c" will attach the "noqueue" qdisc for devices with the "IFF_NO_QUEUE" flag.
On Thu, 11 Mar 2021 16:28:47 -0800 Xie He wrote: > On Thu, Mar 11, 2021 at 4:10 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > And the "noqueue" queue is there because it's on top of hdlc_fr.c > > somehow or some out of tree driver? Or do you install it manually? > > No, this driver is not related to "hdlc_fr.c" or any out-of-tree > driver. The default qdisc is "noqueue" for this driver because this > driver doesn't set "tx_queue_len". This means the value of > "tx_queue_len" would be 0. In this case, "alloc_netdev_mqs" will > automatically add the "IFF_NO_QUEUE" flag to the device, then > "attach_one_default_qdisc" in "net/sched/sch_generic.c" will attach > the "noqueue" qdisc for devices with the "IFF_NO_QUEUE" flag. I see.
Hi Martin, Could you ack? Thanks!
On 2021-03-14 02:59, Xie He wrote: > Hi Martin, > > Could you ack? Thanks! Acked-by: Martin Schiller <ms@dev.tdt.de>
On Tue, Mar 16, 2021 at 8:17 AM Martin Schiller <ms@dev.tdt.de> wrote: > > On 2021-03-14 02:59, Xie He wrote: > > Hi Martin, > > > > Could you ack? Thanks! > > Acked-by: Martin Schiller <ms@dev.tdt.de> Thank you!!
diff --git a/drivers/net/wan/lapbether.c b/drivers/net/wan/lapbether.c index c3372498f4f1..8fda0446ff71 100644 --- a/drivers/net/wan/lapbether.c +++ b/drivers/net/wan/lapbether.c @@ -51,6 +51,8 @@ struct lapbethdev { struct list_head node; struct net_device *ethdev; /* link to ethernet device */ struct net_device *axdev; /* lapbeth device (lapb#) */ + bool up; + spinlock_t up_lock; /* Protects "up" */ }; static LIST_HEAD(lapbeth_devices); @@ -101,8 +103,9 @@ static int lapbeth_rcv(struct sk_buff *skb, struct net_device *dev, struct packe rcu_read_lock(); lapbeth = lapbeth_get_x25_dev(dev); if (!lapbeth) - goto drop_unlock; - if (!netif_running(lapbeth->axdev)) + goto drop_unlock_rcu; + spin_lock_bh(&lapbeth->up_lock); + if (!lapbeth->up) goto drop_unlock; len = skb->data[0] + skb->data[1] * 256; @@ -117,11 +120,14 @@ static int lapbeth_rcv(struct sk_buff *skb, struct net_device *dev, struct packe goto drop_unlock; } out: + spin_unlock_bh(&lapbeth->up_lock); rcu_read_unlock(); return 0; drop_unlock: kfree_skb(skb); goto out; +drop_unlock_rcu: + rcu_read_unlock(); drop: kfree_skb(skb); return 0; @@ -151,13 +157,11 @@ static int lapbeth_data_indication(struct net_device *dev, struct sk_buff *skb) static netdev_tx_t lapbeth_xmit(struct sk_buff *skb, struct net_device *dev) { + struct lapbethdev *lapbeth = netdev_priv(dev); int err; - /* - * Just to be *really* sure not to send anything if the interface - * is down, the ethernet device may have gone. - */ - if (!netif_running(dev)) + spin_lock_bh(&lapbeth->up_lock); + if (!lapbeth->up) goto drop; /* There should be a pseudo header of 1 byte added by upper layers. @@ -194,6 +198,7 @@ static netdev_tx_t lapbeth_xmit(struct sk_buff *skb, goto drop; } out: + spin_unlock_bh(&lapbeth->up_lock); return NETDEV_TX_OK; drop: kfree_skb(skb); @@ -285,6 +290,7 @@ static const struct lapb_register_struct lapbeth_callbacks = { */ static int lapbeth_open(struct net_device *dev) { + struct lapbethdev *lapbeth = netdev_priv(dev); int err; if ((err = lapb_register(dev, &lapbeth_callbacks)) != LAPB_OK) { @@ -292,13 +298,22 @@ static int lapbeth_open(struct net_device *dev) return -ENODEV; } + spin_lock_bh(&lapbeth->up_lock); + lapbeth->up = true; + spin_unlock_bh(&lapbeth->up_lock); + return 0; } static int lapbeth_close(struct net_device *dev) { + struct lapbethdev *lapbeth = netdev_priv(dev); int err; + spin_lock_bh(&lapbeth->up_lock); + lapbeth->up = false; + spin_unlock_bh(&lapbeth->up_lock); + if ((err = lapb_unregister(dev)) != LAPB_OK) pr_err("lapb_unregister error: %d\n", err); @@ -356,6 +371,9 @@ static int lapbeth_new_device(struct net_device *dev) dev_hold(dev); lapbeth->ethdev = dev; + lapbeth->up = false; + spin_lock_init(&lapbeth->up_lock); + rc = -EIO; if (register_netdevice(ndev)) goto fail;
There are two "netif_running" checks in this driver. One is in "lapbeth_xmit" and the other is in "lapbeth_rcv". They serve to make sure that the LAPB APIs called in these functions are called before "lapb_unregister" is called by the "ndo_stop" function. However, these "netif_running" checks are unreliable, because it's possible that immediately after "netif_running" returns true, "ndo_stop" is called (which causes "lapb_unregister" to be called). This patch adds locking to make sure "lapbeth_xmit" and "lapbeth_rcv" can reliably check and ensure the netif is running while doing their work. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xie He <xie.he.0141@gmail.com> --- drivers/net/wan/lapbether.c | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-)