Message ID | 20231017085341.813335-1-yoshihiro.shimoda.uh@renesas.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] ravb: Fix races between ravb_tx_timeout_work() and net related ops | expand |
Hello! On 10/17/23 11:53 AM, Yoshihiro Shimoda wrote: > Fix races between ravb_tx_timeout_work() and functions of net_device_ops > and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that > since ravb_close() is under the rtnl lock and calls cancel_work_sync(), > ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock. I don't quite follow... how calling cancel_work_sync() is a problem? I thought the problem was that unregister_netdev() can be called with the TX timeout work still pending? And, more generally, shouldn't we protect against the TX timeout work being executed on a different CPU than the {net_device|ethtool}_ops methods are being executed (is that possible?)? I also had a suspicion that we still miss the spinlock calls in ravb_tx_timeout_work() but nothing in particular jumped at me... mind looking into that? :-) > Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") > Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> > --- > drivers/net/ethernet/renesas/ravb_main.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > index 0ef0b88b7145..b53533ab4599 100644 > --- a/drivers/net/ethernet/renesas/ravb_main.c > +++ b/drivers/net/ethernet/renesas/ravb_main.c [...] > @@ -1907,6 +1910,7 @@ static void ravb_tx_timeout_work(struct work_struct *work) > */ > netdev_err(ndev, "%s: ravb_dmac_init() failed, error %d\n", > __func__, error); > + rtnl_unlock(); > return; Perhaps *goto* instead here? > } > ravb_emac_init(ndev); > @@ -1917,6 +1921,7 @@ static void ravb_tx_timeout_work(struct work_struct *work) > ravb_ptp_init(ndev, priv->pdev); > > netif_tx_start_all_queues(ndev); ... and add label here? > + rtnl_unlock(); > } MBR, Sergey
Hello Sergey, > From: Sergey Shtylyov, Sent: Wednesday, October 18, 2023 3:59 AM > > Hello! > > On 10/17/23 11:53 AM, Yoshihiro Shimoda wrote: > > > Fix races between ravb_tx_timeout_work() and functions of net_device_ops > > and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that > > since ravb_close() is under the rtnl lock and calls cancel_work_sync(), > > ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock. > > I don't quite follow... how calling cancel_work_sync() is a problem? > I thought the problem was that unregister_netdev() can be called with > the TX timeout work still pending? And, more generally, shouldn't we > protect against the TX timeout work being executed on a different CPU > than the {net_device|ethtool}_ops methods are being executed (is that > possible?)? __dev_close_many() in net/core/dev.c calls ASSERT_RTNL() and ops->ndo_stop(). So, the ravb_close() is under rtnl lock. While locking the rtnl, it's possible to call ravb_tx_timeout_work() on other CPU. In such a case, it's possible to cause a deadlock between ravb_close() and ravb_tx_timeout_work() CPU0 CPU1 ravb_tx_timeout() schedule_work() ... __dev_close_many() // this is under rtnl lock ravb_close() cancel_work_sync() ravb_tx_timeout_work() rtnl_lock() // this is possible to cause a deadlock > I also had a suspicion that we still miss the spinlock calls in > ravb_tx_timeout_work() but nothing in particular jumped at me... > mind looking into that? :-) Yes, perhaps we have to check it somehow... > > Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") > > Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> > > --- > > drivers/net/ethernet/renesas/ravb_main.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c > > index 0ef0b88b7145..b53533ab4599 100644 > > --- a/drivers/net/ethernet/renesas/ravb_main.c > > +++ b/drivers/net/ethernet/renesas/ravb_main.c > [...] > > @@ -1907,6 +1910,7 @@ static void ravb_tx_timeout_work(struct work_struct *work) > > */ > > netdev_err(ndev, "%s: ravb_dmac_init() failed, error %d\n", > > __func__, error); > > + rtnl_unlock(); > > return; > > Perhaps *goto* instead here? ... > > } > > ravb_emac_init(ndev); > > @@ -1917,6 +1921,7 @@ static void ravb_tx_timeout_work(struct work_struct *work) > > ravb_ptp_init(ndev, priv->pdev); > > > > netif_tx_start_all_queues(ndev); > > ... and add label here? I got it. Using goto is better, I think. Best regards, Yoshihiro Shimoda > > + rtnl_unlock(); > > } > > MBR, Sergey
On 10/18/23 12:39 PM, Yoshihiro Shimoda wrote: [...] >>> Fix races between ravb_tx_timeout_work() and functions of net_device_ops >>> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that >>> since ravb_close() is under the rtnl lock and calls cancel_work_sync(), >>> ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock. >> >> I don't quite follow... how calling cancel_work_sync() is a problem? >> I thought the problem was that unregister_netdev() can be called with >> the TX timeout work still pending? And, more generally, shouldn't we >> protect against the TX timeout work being executed on a different CPU >> than the {net_device|ethtool}_ops methods are being executed (is that >> possible?)? > > __dev_close_many() in net/core/dev.c calls ASSERT_RTNL() and ops->ndo_stop(). > So, the ravb_close() is under rtnl lock. While locking the rtnl, it's > possible to call ravb_tx_timeout_work() on other CPU. In such a case, > it's possible to cause a deadlock between ravb_close() and ravb_tx_timeout_work() > > CPU0 CPU1 > ravb_tx_timeout() > schedule_work() > ... > __dev_close_many() > // this is under rtnl lock > ravb_close() > cancel_work_sync() > ravb_tx_timeout_work() > rtnl_lock() > // this is possible to cause a deadlock Ah, cancel_work_sync() means we have to wait for the work to finish -- indeed a deadlock is possiblet then. >> I also had a suspicion that we still miss the spinlock calls in >> ravb_tx_timeout_work() but nothing in particular jumped at me... We mainly need to protect against the interrupts in this case... >> mind looking into that? :-) > > Yes, perhaps we have to check it somehow... Unfortunately, I don't seem to have no bandwidth to do that myself... [...] MBR, Sergey
Hello Sergey, > From: Sergey Shtylyov, Sent: Thursday, October 19, 2023 2:27 AM > > On 10/18/23 12:39 PM, Yoshihiro Shimoda wrote: > [...] > >>> Fix races between ravb_tx_timeout_work() and functions of net_device_ops > >>> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that > >>> since ravb_close() is under the rtnl lock and calls cancel_work_sync(), > >>> ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock. > >> > >> I don't quite follow... how calling cancel_work_sync() is a problem? > >> I thought the problem was that unregister_netdev() can be called with > >> the TX timeout work still pending? And, more generally, shouldn't we > >> protect against the TX timeout work being executed on a different CPU > >> than the {net_device|ethtool}_ops methods are being executed (is that > >> possible?)? > > > > __dev_close_many() in net/core/dev.c calls ASSERT_RTNL() and ops->ndo_stop(). > > So, the ravb_close() is under rtnl lock. While locking the rtnl, it's > > possible to call ravb_tx_timeout_work() on other CPU. In such a case, > > it's possible to cause a deadlock between ravb_close() and ravb_tx_timeout_work() > > > > CPU0 CPU1 > > ravb_tx_timeout() > > schedule_work() > > ... > > __dev_close_many() > > // this is under rtnl lock > > ravb_close() > > cancel_work_sync() > > ravb_tx_timeout_work() > > rtnl_lock() > > // this is possible to cause a deadlock > > Ah, cancel_work_sync() means we have to wait for the work to > finish -- indeed a deadlock is possiblet then. Thank you for your reply. I'll update commit description in v2. > >> I also had a suspicion that we still miss the spinlock calls in > >> ravb_tx_timeout_work() but nothing in particular jumped at me... > > We mainly need to protect against the interrupts in this case... I think so. However, we can not use spin_lock_irqsave() for whole this ravb_tx_timeout_work() because ravb_ring_init() calls kcalloc() with GFP_KERNEL. > >> mind looking into that? :-) > > > > Yes, perhaps we have to check it somehow... > > Unfortunately, I don't seem to have no bandwidth to do that myself... I got it. I'll investigate this later. Best regards, Yoshihiro Shimoda > [...] > > MBR, Sergey
diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index 0ef0b88b7145..b53533ab4599 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c @@ -1874,6 +1874,9 @@ static void ravb_tx_timeout_work(struct work_struct *work) struct net_device *ndev = priv->ndev; int error; + if (!rtnl_trylock()) + return; + netif_tx_stop_all_queues(ndev); /* Stop PTP Clock driver */ @@ -1907,6 +1910,7 @@ static void ravb_tx_timeout_work(struct work_struct *work) */ netdev_err(ndev, "%s: ravb_dmac_init() failed, error %d\n", __func__, error); + rtnl_unlock(); return; } ravb_emac_init(ndev); @@ -1917,6 +1921,7 @@ static void ravb_tx_timeout_work(struct work_struct *work) ravb_ptp_init(ndev, priv->pdev); netif_tx_start_all_queues(ndev); + rtnl_unlock(); } /* Packet transmit function for Ethernet AVB */
Fix races between ravb_tx_timeout_work() and functions of net_device_ops and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that since ravb_close() is under the rtnl lock and calls cancel_work_sync(), ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock. Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> --- drivers/net/ethernet/renesas/ravb_main.c | 5 +++++ 1 file changed, 5 insertions(+)