Message ID | 20230711152647.28673-1-mike.looijmans@topic.nl (mailing list archive) |
---|---|
State | Awaiting Upstream |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | can: mcp251xfd: Always stop on BUS_OFF and call netif_stop_queue | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Series ignored based on subject |
On 11.07.2023 17:26:47, Mike Looijmans wrote: > When there's an error attempting to store the BER counter, don't abort > but continue shutting down the chip as required. If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a store error, but a failure of regmap_read(priv->map_reg, MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled and no/wrong data from the chip will be detected and return an error here (after 3 retires). In out of memory conditions or other kernel errors might be possible here, too. Have you seen a problem here? But as we shut down the chip here anyways, we can ignore the error here. > After disabling communications, also stop the TX queue with a call to > netif_stop_queue. can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have you enabled automatic restart in case of bus off? I think the netdev watchdog will kick you, if the interface has a stooped queue for too long (IIRC 5 seconds). > When the interface restarts, mcp251xfd_set_mode will > call netif_wake_queue and resume. > > This fixes a hangup in either send() or poll() from userspace. To > reproduce: I ran "cansequence can0 -p" to flood the system with packets. > While running that, I shorted the CAN signals, causing a bus error. > Usually communications would resume after the CAN bus restarted, but > sometimes the system got stuck and refused to send any more packets. > The sending process would be in either poll() or send(), waiting for > the queue to resume. To "unstuck" the process from send() it was > sufficient to send any packet on the can interface from another > process. To get it out of the poll() hang, only bringing the can > interface down (and up) would work. > > After adding the netif_stop_queue call, I was unable to reproduce the > problem. The newly added netif_stop_queue() will cause the netif_wake_queue() in the mcp251xfd_set_mode() to actually wake the queue. If you observe a problem, I think it's a general problem, so all drivers would be effected. > Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl> Marc
Met vriendelijke groet / kind regards, Mike Looijmans System Expert TOPIC Embedded Products B.V. Materiaalweg 4, 5681 RJ Best The Netherlands T: +31 (0) 499 33 69 69 E: mike.looijmans@topicproducts.com W: www.topic.nl Please consider the environment before printing this e-mail On 11-07-2023 19:05, Marc Kleine-Budde wrote: > On 11.07.2023 17:26:47, Mike Looijmans wrote: >> When there's an error attempting to store the BER counter, don't abort >> but continue shutting down the chip as required. > > If you refer to an error by __mcp251xfd_get_berr_counter(), it's not a > store error, but a failure of regmap_read(priv->map_reg, > MCP251XFD_REG_TREC, &trec). By default the SPI transfers are CRC enabled > and no/wrong data from the chip will be detected and return an error > here (after 3 retires). In out of memory conditions or other kernel > errors might be possible here, too. > > Have you seen a problem here? But as we shut down the chip here anyways, > we can ignore the error here. Excactly my point, the important task here is to report the bus off state. I haven't seen the error trigger, so it wasn't the cause of our problems. In retrospect, maybe I should have split this in two patches? >> After disabling communications, also stop the TX queue with a call to >> netif_stop_queue. > > can_bus_off() will call netif_carrier_off(), isn't this sufficient? Have > you enabled automatic restart in case of bus off? I think the netdev > watchdog will kick you, if the interface has a stooped queue for too > long (IIRC 5 seconds). Apparently not. The problem also occurred in the field, where the CAN communication got stuck for something like a full day. Automatic restart was enabled, with a 100ms timeout. Setting a shorter timeout like 50 ms seemed to make the problem more likely to occur What I also noticed, but didn't accurately measure, is that the time it took to recover used to be somewhat random, it could take up to about a second for packets to start going out again, as observed by watching LEDs. After adding netif_stop_queue the recovery was almost instantaneous. > >> When the interface restarts, mcp251xfd_set_mode will >> call netif_wake_queue and resume. >> >> This fixes a hangup in either send() or poll() from userspace. To >> reproduce: I ran "cansequence can0 -p" to flood the system with packets. >> While running that, I shorted the CAN signals, causing a bus error. >> Usually communications would resume after the CAN bus restarted, but >> sometimes the system got stuck and refused to send any more packets. >> The sending process would be in either poll() or send(), waiting for >> the queue to resume. To "unstuck" the process from send() it was >> sufficient to send any packet on the can interface from another >> process. To get it out of the poll() hang, only bringing the can >> interface down (and up) would work. >> >> After adding the netif_stop_queue call, I was unable to reproduce the >> problem. > > The newly added netif_stop_queue() will cause the netif_wake_queue() in > the mcp251xfd_set_mode() to actually wake the queue. If you observe a > problem, I think it's a general problem, so all drivers would be > effected. After two days digging, my conclusion was that "something" wasn't getting a proper "kick" when the CAN interface came back online. One could argue the netif_stop_queue should be put into can_bus_off. But the netif_wake_queue call is in the driver, so that would break symmetry. > >> Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl> > > Marc >
diff --git a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c index 68df6d4641b5..854e0644764c 100644 --- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c +++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c @@ -1084,10 +1084,11 @@ static int mcp251xfd_handle_cerrif(struct mcp251xfd_priv *priv) */ err = __mcp251xfd_get_berr_counter(priv->ndev, &priv->bec); if (err) - return err; + netdev_err(priv->ndev, "Failed to store BER counter\n"); mcp251xfd_chip_stop(priv, CAN_STATE_BUS_OFF); can_bus_off(priv->ndev); + netif_stop_queue(priv->ndev); } if (!skb)
When there's an error attempting to store the BER counter, don't abort but continue shutting down the chip as required. After disabling communications, also stop the TX queue with a call to netif_stop_queue. When the interface restarts, mcp251xfd_set_mode will call netif_wake_queue and resume. This fixes a hangup in either send() or poll() from userspace. To reproduce: I ran "cansequence can0 -p" to flood the system with packets. While running that, I shorted the CAN signals, causing a bus error. Usually communications would resume after the CAN bus restarted, but sometimes the system got stuck and refused to send any more packets. The sending process would be in either poll() or send(), waiting for the queue to resume. To "unstuck" the process from send() it was sufficient to send any packet on the can interface from another process. To get it out of the poll() hang, only bringing the can interface down (and up) would work. After adding the netif_stop_queue call, I was unable to reproduce the problem. Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl> --- drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)