[net] can: sja1000: Always restart the Tx queue after an overrun

Upstream commit 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with
a soft reset on Renesas SoCs") fixes an issue with Renesas own SJA1000
CAN controller reception: the Rx buffer is only 5 messages long, so when
the bus loaded (eg. a message every 50us), overrun may easily
happen. Upon an overrun situation, due to a possible internal crosstalk
situation, the controller enters a frozen state which only can be
unlocked with a soft reset (experimentally). The solution was to offload
a call to sja1000_start() in a threaded handler. This needs to happen in
process context as this operation requires to sleep. sja1000_start()
basically enters "reset mode", performs a proper software reset and
returns back into "normal mode".

Since this fix was introduced, we no longer observe any stalls in
reception. However it was sporadically observed that the transmit path
would now freeze. Further investigation blamed the fix mentioned above,
and especially the reset operation. Reproducing the reset in a loop
helped identifying what could possibly go wrong. The sja1000 is a single
Tx queue device, which leverages the netdev helpers to process one Tx
message at a time. The logic is: the queue is stopped, the message sent
to the transceiver, once properly transmitted the controller sets a
status bit which triggers an interrupt, in the interrupt handler the
transmission status is checked and the queue woken up. Unfortunately, if
an overrun happens, we might perform the soft reset precisely between
the transmission of the buffer to the transceiver and the advent of the
transmission status bit. We would then stop the transmission operation
without re-enabling the queue, leading to all further transmissions to
be ignored.

The reset interrupt can only happen while the device is "open", and
after a reset we anyway want to resume normal operations, no matter if a
packet to transmit got dropped in the process, so we shall wake up the
queue. Restarting the device and waking-up the queue is exactly what
sja1000_set_mode(CAN_MODE_START) does. In order to be consistent about
the queue state, we must acquire a lock both in the reset handler and in
the transmit path to ensure serialization of both operations. As the
reset handler might still be called after the transmission of a frame to
the transceiver but before it actually gets transmitted, we must ensure
we don't leak the skb, so we free it (the behavior is consistent, no
matter if there was an skb on the stack or not).

Fixes: 717c6ec241b5 ("can: sja1000: Prevent overrun stalls with a soft reset on Renesas SoCs")
Cc: stable@vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
---

This patch was written and tested on a slightly older stable kernel as
this is the kernel that runs on the boards which shown the problem, but
there should be no difference with upstream kernels.

 drivers/net/can/sja1000/sja1000.c          | 13 ++++++++++++-
 drivers/net/can/sja1000/sja1000.h          |  1 +
 drivers/net/can/sja1000/sja1000_platform.c |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

Message ID	20230922154727.591672-1-miquel.raynal@bootlin.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B53CA1EA9A for <netdev@vger.kernel.org>; Fri, 22 Sep 2023 15:47:33 +0000 (UTC) From: Miquel Raynal <miquel.raynal@bootlin.com> To: Wolfgang Grandegger <wg@grandegger.com>, Marc Kleine-Budde <mkl@pengutronix.de> Cc: "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Eric Dumazet <edumazet@google.com>, netdev@vger.kernel.org, linux-can@vger.kernel.org, =?utf-8?b?SsOpcsOpbWll?= =?utf-8?b?IERhdXRoZXJpYmVz?= <jeremie.dautheribes@bootlin.com>, Thomas Petazzoni <thomas.petazzoni@bootlin.com>, sylvain.girard@se.com, pascal.eberhard@se.com, Miquel Raynal <miquel.raynal@bootlin.com>, stable@vger.kernel.org Subject: [PATCH net] can: sja1000: Always restart the Tx queue after an overrun Date: Fri, 22 Sep 2023 17:47:27 +0200 Message-Id: <20230922154727.591672-1-miquel.raynal@bootlin.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net] can: sja1000: Always restart the Tx queue after an overrun \| expand [net] can: sja1000: Always restart the Tx queue after an overrun

Context	Check	Description
netdev/tree_selection	success	Clearly marked for net
netdev/apply	fail	Patch does not apply to net

[net] can: sja1000: Always restart the Tx queue after an overrun

Checks

Commit Message

Comments

Patch