Message ID | 20250116225521.2688224-1-sean.anderson@linux.dev (mailing list archive) |
---|---|
Headers | show |
Series | spi: zynqmp-gqspi: Improve error recovery by resetting | expand |
On Thu, Jan 16, 2025 at 05:55:16PM -0500, Sean Anderson wrote: > This series adds support for resetting the QSPI controller if we have a > timeout. I find this greatly improves the stability of the device, which > would tend to break after any timeout. If you're hitting a timeout that tends to indicate there's already a serious stability problem...
On 1/17/25 08:21, Mark Brown wrote: > On Thu, Jan 16, 2025 at 05:55:16PM -0500, Sean Anderson wrote: >> This series adds support for resetting the QSPI controller if we have a >> timeout. I find this greatly improves the stability of the device, which >> would tend to break after any timeout. > > If you're hitting a timeout that tends to indicate there's already a > serious stability problem... This was mostly hit when I was hacking on the driver. But of course there are probably still bugs lurking in this driver, so I think it is good to handle the exceptional conditions in a more-robust way. --Sean
On Fri, Jan 17, 2025 at 11:17:35AM -0500, Sean Anderson wrote: > On 1/17/25 08:21, Mark Brown wrote: > > If you're hitting a timeout that tends to indicate there's already a > > serious stability problem... > This was mostly hit when I was hacking on the driver. But of course > there are probably still bugs lurking in this driver, so I think it is > good to handle the exceptional conditions in a more-robust way. I'm not saying it's a bad idea, but your changelog is written in a way that makes it sound like timeouts are normal.
On 1/17/25 11:48, Mark Brown wrote: > On Fri, Jan 17, 2025 at 11:17:35AM -0500, Sean Anderson wrote: >> On 1/17/25 08:21, Mark Brown wrote: > >> > If you're hitting a timeout that tends to indicate there's already a >> > serious stability problem... > >> This was mostly hit when I was hacking on the driver. But of course >> there are probably still bugs lurking in this driver, so I think it is >> good to handle the exceptional conditions in a more-robust way. > > I'm not saying it's a bad idea, but your changelog is written in a way > that makes it sound like timeouts are normal. I've updated my cover letter for v2 to include the above blurb. Hopefully that clears things up. --Sean
On 17/01/2025 at 13:21:58 GMT, Mark Brown <broonie@kernel.org> wrote: > On Thu, Jan 16, 2025 at 05:55:16PM -0500, Sean Anderson wrote: >> This series adds support for resetting the QSPI controller if we have a >> timeout. I find this greatly improves the stability of the device, which >> would tend to break after any timeout. > > If you're hitting a timeout that tends to indicate there's already a > serious stability problem... Yes, unless the timeout is reached for "good reasons", ie. you request substantial amounts of data (typically from a memory device) and the timeout is too short compared to the theoretical time spent in the transfer. A loaded machine can also increase the number of false positives I guess. Cheers, Miquèl
On Fri, Jan 17, 2025 at 07:31:08PM +0100, Miquel Raynal wrote: > On 17/01/2025 at 13:21:58 GMT, Mark Brown <broonie@kernel.org> wrote: > > If you're hitting a timeout that tends to indicate there's already a > > serious stability problem... > Yes, unless the timeout is reached for "good reasons", ie. you request > substantial amounts of data (typically from a memory device) and the > timeout is too short compared to the theoretical time spent in the > transfer. A loaded machine can also increase the number of false > positives I guess. I'd argue that all of those are bad reasons, I'd only expect us to time out when there's a bug - choosing too low a timeout or doing things in a way that generates timeouts under load is a problem.
On 1/17/25 13:41, Mark Brown wrote: > On Fri, Jan 17, 2025 at 07:31:08PM +0100, Miquel Raynal wrote: >> On 17/01/2025 at 13:21:58 GMT, Mark Brown <broonie@kernel.org> wrote: > >> > If you're hitting a timeout that tends to indicate there's already a >> > serious stability problem... > >> Yes, unless the timeout is reached for "good reasons", ie. you request >> substantial amounts of data (typically from a memory device) and the >> timeout is too short compared to the theoretical time spent in the >> transfer. A loaded machine can also increase the number of false >> positives I guess. > > I'd argue that all of those are bad reasons, I'd only expect us to time > out when there's a bug - choosing too low a timeout or doing things in a > way that generates timeouts under load is a problem. There's no transmit DMA for this device. So if you are under high load and make a long transfer, it's possible to time out. I don't know if it's possible to fix that very easily. The timeout calculation assumes that data is being transferred at the SPI bus rate. That said, in the common case (NOR flash) writes don't work like that. To write a flash, we make a short transfer (such as an eraseblock) and then poll the status register before making another transfer. With short transfers there is less probability of timing out because the extra time makes up more of the duration. --Sean
On Fri, Jan 17, 2025 at 04:46:23PM -0500, Sean Anderson wrote: > On 1/17/25 13:41, Mark Brown wrote: > > On Fri, Jan 17, 2025 at 07:31:08PM +0100, Miquel Raynal wrote: > >> Yes, unless the timeout is reached for "good reasons", ie. you request > >> substantial amounts of data (typically from a memory device) and the > >> timeout is too short compared to the theoretical time spent in the > >> transfer. A loaded machine can also increase the number of false > >> positives I guess. > > I'd argue that all of those are bad reasons, I'd only expect us to time > > out when there's a bug - choosing too low a timeout or doing things in a > > way that generates timeouts under load is a problem. > There's no transmit DMA for this device. So if you are under high load > and make a long transfer, it's possible to time out. I don't know if > it's possible to fix that very easily. The timeout calculation assumes > that data is being transferred at the SPI bus rate. In that case I wouldn't expect the timeout to apply to the whole operation, or I'd expect a timeout applied waiting for something interrupt driven to not to be fired unless we stop making forward progress.