diff mbox series

[v4,8/8] serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups

Message ID 20240610152420.v4.8.I1af05e555c42a9c98435bb7aee0ee60e3dcd015e@changeid (mailing list archive)
State Not Applicable
Headers show
Series serial: qcom-geni: Overhaul TX handling to fix crashes/hangs | expand

Commit Message

Doug Anderson June 10, 2024, 10:24 p.m. UTC
The fact that the Qualcomm GENI hardware interface is based around
"packets" is really awkward to fit into Linux's UART design.
Specifically, in order to send bytes you need to start up a new
"command" saying how many bytes you want to send and then you need to
send all those bytes. Once you've committed to sending that number of
bytes it's very awkward to change your mind and send fewer, especially
if you want to do so without dropping bytes on the ground.

There may be a few cases where you might want to send fewer bytes than
you originally expected:
1. You might want to interrupt the transfer with something higher
   priority, like the kernel console or kdb.
2. You might want to enter system suspend.
3. The user might have killed the program that had queued bytes for
   sending over the UART.

Despite this awkwardness the Linux driver has still tried to send
bytes using large transfers. Whenever the driver started a new
transfer it would look at the number of bytes in the OS's queue and
start a transfer for that many. The idea of using larger transfers is
that it should be more efficient. When you're in the middle of a large
transfer you can get interrupted when the hardware FIFO is close to
empty and add more bytes in. Whenever you get to the end of a transfer
you have to wait until the transfer is totally done before you can add
more bytes and, depending on interrupt latency, that can cause the
UART to idle a bit.

Unfortunately there were lots of corner cases that the Linux driver
didn't handle.

One problem with the current driver is that if the user killed the
program that queued bytes for sending over the UART then bad things
would happen. Before commit 1788cf6a91d9 ("tty: serial: switch from
circ_buf to kfifo") we'd just send stale data out the UART. After that
commit we'll hard lockup.

Another problem with the current driver can be seen if you queue a
bunch of data to the UART and enter kdb. Specifically on a device
_without_ kernel console on the UART, with an agetty on the UART, and
with kgdb on the UART, doing `cat /var/log/messages` and then dropping
into kdb and resuming caused console output to stop.

Give up on trying to use large transfers in FIFO mode on GENI UART
since there doesn't appear to be any way to solve these problems
cleanly. Visually inspecting the console output even after these
patches doesn't show any big pauses.

In order to make this all work:
- Switch the watermark interrupt to just being used to prime the TX
  pump. Once transfers are running, use "done" to queue the next
  batch. As part of this, change the watermark to fire whenever the
  queue is empty.
- Never queue more than what can fit in the FIFO. This means we don't
  need to keep track of a command we're partway through.
- For the console code and kgdb code where we can safely block while
  the queue empties, just do that rather than trying to queue a
  command when one was already in progress (which didn't work so well
  and is why there were some weird/awkward hacks in
  qcom_geni_serial_console_write()).
- Leave the CMD_DONE interrupt enabled all the time since there's
  never any reason we don't want to see it.
- Start using the "SE_GENI_M_IRQ_EN_SET" and "SE_GENI_M_IRQ_EN_CLEAR"
  registers to avoid read-modify-write of the "SE_GENI_M_IRQ_EN"
  register. This could be done in more of the driver if needed but for
  now just update code that's touched.

Fixes: 1788cf6a91d9 ("tty: serial: switch from circ_buf to kfifo")
Fixes: a1fee899e5be ("tty: serial: qcom_geni_serial: Fix softlock")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
I'm listing two "fixes" commits here. The first is the kfifo change
since it is very easy to see a hardlockup after that change. Almost
certainly anyone with the kfifo patch wants this patch. I've also
listed a much earlier patch as one being fixed since that was the one
that made us send larger transfers.

I've tested this commit on an sc7180-trogdor board both with and
without kernel console going to the UART. I've tested across some
suspend/resume cycles and with kgdb. I've also confirmed that
bluetooth, which uses the DMA paths in this driver, continues to work.
That all being said, a lot of things change here so I'd love any
testing folks want to do.

I'm not explicitly CCing stable here. The only truly terrible problem
is the hardlockup introduced by the kfifo change. The rest of the
issue have been around for years. If someone wants the fixes ported
back to stable that's fine but IMO unless you're seeing problems it's
not 100% required.

(no changes since v3)

Changes in v3:
- Reword commit message.

Changes in v2:
- New

 drivers/tty/serial/qcom_geni_serial.c | 192 +++++++++++++-------------
 1 file changed, 94 insertions(+), 98 deletions(-)

Comments

Konrad Dybcio June 17, 2024, 7:10 p.m. UTC | #1
On 6/11/24 00:24, Douglas Anderson wrote:
> The fact that the Qualcomm GENI hardware interface is based around
> "packets" is really awkward to fit into Linux's UART design.
> Specifically, in order to send bytes you need to start up a new
> "command" saying how many bytes you want to send and then you need to
> send all those bytes. Once you've committed to sending that number of
> bytes it's very awkward to change your mind and send fewer, especially
> if you want to do so without dropping bytes on the ground.

[...]

  
> +static void qcom_geni_serial_enable_cmd_done(struct uart_port *uport)
> +{
> +	struct qcom_geni_serial_port *port = to_dev_port(uport);
> +
> +	/* If we're not in FIFO mode we don't use CMD_DONE. */
> +	if (port->dev_data->mode != GENI_SE_FIFO)
> +		return;
> +
> +	writel(M_CMD_DONE_EN, uport->membase + SE_GENI_M_IRQ_EN_SET);
> +}

IDK if this is worth of a separate function, instead of checking for the
FIFO in port_setup and writing it there, but generally this patch looks
good to me

Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
Doug Anderson June 17, 2024, 7:37 p.m. UTC | #2
Hi,

On Mon, Jun 17, 2024 at 12:10 PM Konrad Dybcio <konrad.dybcio@linaro.org> wrote:
>
>
>
> On 6/11/24 00:24, Douglas Anderson wrote:
> > The fact that the Qualcomm GENI hardware interface is based around
> > "packets" is really awkward to fit into Linux's UART design.
> > Specifically, in order to send bytes you need to start up a new
> > "command" saying how many bytes you want to send and then you need to
> > send all those bytes. Once you've committed to sending that number of
> > bytes it's very awkward to change your mind and send fewer, especially
> > if you want to do so without dropping bytes on the ground.
>
> [...]
>
>
> > +static void qcom_geni_serial_enable_cmd_done(struct uart_port *uport)
> > +{
> > +     struct qcom_geni_serial_port *port = to_dev_port(uport);
> > +
> > +     /* If we're not in FIFO mode we don't use CMD_DONE. */
> > +     if (port->dev_data->mode != GENI_SE_FIFO)
> > +             return;
> > +
> > +     writel(M_CMD_DONE_EN, uport->membase + SE_GENI_M_IRQ_EN_SET);
> > +}
>
> IDK if this is worth of a separate function, instead of checking for the
> FIFO in port_setup and writing it there, but generally this patch looks
> good to me

Sure. Somehow it felt weird to me to put it straight in there, but I
could go either way. Do you think I should spin the series just for
this, or just make this change if I happen to need to spin the series
for something else?


> Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Thanks for your reviews, I appreciate it!

-Doug
Konrad Dybcio June 17, 2024, 7:54 p.m. UTC | #3
On 6/17/24 21:37, Doug Anderson wrote:
> Hi,
> 
> On Mon, Jun 17, 2024 at 12:10 PM Konrad Dybcio <konrad.dybcio@linaro.org> wrote:
>>
>>
>>
>> On 6/11/24 00:24, Douglas Anderson wrote:
>>> The fact that the Qualcomm GENI hardware interface is based around
>>> "packets" is really awkward to fit into Linux's UART design.
>>> Specifically, in order to send bytes you need to start up a new
>>> "command" saying how many bytes you want to send and then you need to
>>> send all those bytes. Once you've committed to sending that number of
>>> bytes it's very awkward to change your mind and send fewer, especially
>>> if you want to do so without dropping bytes on the ground.
>>
>> [...]
>>
>>
>>> +static void qcom_geni_serial_enable_cmd_done(struct uart_port *uport)
>>> +{
>>> +     struct qcom_geni_serial_port *port = to_dev_port(uport);
>>> +
>>> +     /* If we're not in FIFO mode we don't use CMD_DONE. */
>>> +     if (port->dev_data->mode != GENI_SE_FIFO)
>>> +             return;
>>> +
>>> +     writel(M_CMD_DONE_EN, uport->membase + SE_GENI_M_IRQ_EN_SET);
>>> +}
>>
>> IDK if this is worth of a separate function, instead of checking for the
>> FIFO in port_setup and writing it there, but generally this patch looks
>> good to me
> 
> Sure. Somehow it felt weird to me to put it straight in there, but I
> could go either way. Do you think I should spin the series just for
> this, or just make this change if I happen to need to spin the series
> for something else?

The latter.

Konrad
Johan Hovold June 24, 2024, 12:43 p.m. UTC | #4
On Mon, Jun 10, 2024 at 03:24:26PM -0700, Douglas Anderson wrote:
> The fact that the Qualcomm GENI hardware interface is based around
> "packets" is really awkward to fit into Linux's UART design.
> Specifically, in order to send bytes you need to start up a new
> "command" saying how many bytes you want to send and then you need to
> send all those bytes. Once you've committed to sending that number of
> bytes it's very awkward to change your mind and send fewer, especially
> if you want to do so without dropping bytes on the ground.
> 
> There may be a few cases where you might want to send fewer bytes than
> you originally expected:
> 1. You might want to interrupt the transfer with something higher
>    priority, like the kernel console or kdb.
> 2. You might want to enter system suspend.
> 3. The user might have killed the program that had queued bytes for
>    sending over the UART.
> 
> Despite this awkwardness the Linux driver has still tried to send
> bytes using large transfers. Whenever the driver started a new
> transfer it would look at the number of bytes in the OS's queue and
> start a transfer for that many. The idea of using larger transfers is
> that it should be more efficient. When you're in the middle of a large
> transfer you can get interrupted when the hardware FIFO is close to
> empty and add more bytes in. Whenever you get to the end of a transfer
> you have to wait until the transfer is totally done before you can add
> more bytes and, depending on interrupt latency, that can cause the
> UART to idle a bit.

As I mentioned last week, the slowdown from this is quite noticeable
(e.g. 25% slowdown at @115200), but this may be the price we need to pay
for correctness, at least temporarily.

An alternative might be to switch to using a 16 byte fifo. This should
reduce console latency even further, and may be able avoid the idling
UART penalty by continuing to use the watermark interrupt for refilling
the FIFO.

Johan
Doug Anderson June 24, 2024, 9:15 p.m. UTC | #5
Hi,

On Mon, Jun 24, 2024 at 5:43 AM Johan Hovold <johan@kernel.org> wrote:
>
> On Mon, Jun 10, 2024 at 03:24:26PM -0700, Douglas Anderson wrote:
> > The fact that the Qualcomm GENI hardware interface is based around
> > "packets" is really awkward to fit into Linux's UART design.
> > Specifically, in order to send bytes you need to start up a new
> > "command" saying how many bytes you want to send and then you need to
> > send all those bytes. Once you've committed to sending that number of
> > bytes it's very awkward to change your mind and send fewer, especially
> > if you want to do so without dropping bytes on the ground.
> >
> > There may be a few cases where you might want to send fewer bytes than
> > you originally expected:
> > 1. You might want to interrupt the transfer with something higher
> >    priority, like the kernel console or kdb.
> > 2. You might want to enter system suspend.
> > 3. The user might have killed the program that had queued bytes for
> >    sending over the UART.
> >
> > Despite this awkwardness the Linux driver has still tried to send
> > bytes using large transfers. Whenever the driver started a new
> > transfer it would look at the number of bytes in the OS's queue and
> > start a transfer for that many. The idea of using larger transfers is
> > that it should be more efficient. When you're in the middle of a large
> > transfer you can get interrupted when the hardware FIFO is close to
> > empty and add more bytes in. Whenever you get to the end of a transfer
> > you have to wait until the transfer is totally done before you can add
> > more bytes and, depending on interrupt latency, that can cause the
> > UART to idle a bit.
>
> As I mentioned last week, the slowdown from this is quite noticeable
> (e.g. 25% slowdown at @115200), but this may be the price we need to pay
> for correctness, at least temporarily.
>
> An alternative might be to switch to using a 16 byte fifo. This should
> reduce console latency even further, and may be able avoid the idling
> UART penalty by continuing to use the watermark interrupt for refilling
> the FIFO.

I'm a bit confused. Right now we're using (effectively) a 64-byte
FIFO. The FIFO is 16-words deep and we have 4 bytes per word. ...so
I'm not sure what you mean by switching to a 16-byte FIFO. Do you mean
to make less use of the FIFO, or something else?

Overall the big problem I found in all my testing was that I needed to
wait for a "command done" before kicking off a new command. When the
"command done" arrives then the UART has stopped transmitting and
you've got to suffer an interrupt latency before you can start
transferring again. Essentially:

1. Pick a transfer size.
2. You can keep sending bytes / using the FIFO efficiently as long as
there are still bytes left in the transfer.
3. When you get to the end of the transfer, you have to wait for the
UART to stop, report that it's done, and then suffer an interrupt
latency to start a new transfer.

So to be efficient you want to pick a big transfer size but if there's
any chance that you might not need to transfer that many bytes then
you need to figure out what to do. If you can handle that properly
then that's great. If not then we have to make sure we never kick off
a transfer that we might not finish.

I'd also mention that, as talked about in my response to your other
patch [1], I'm not seeing a 25% slowdown. I tested both with my simple
proposal and with this whole series applied and my slowdown is less
than 2%. I guess there must be something different with your setup?
Trying to think about what kind of slowdown would be reasonable for my
patch series at 115200:

a) We send 64 bytes efficiently, which takes 5.6ms (64 * 1000 / 11520)

b) We stop transferring and wait for an interrupt.

c) We start transferring 64 bytes again.

Let's say that your interrupt latency is 1 ms, which would be really
terrible. In that case you'll essentially transfer 64 bytes in 6.6ms
instead of 5.6 ms, right? That would be an 18% hit. Let's imagine
something more sensible and say that most of the time you can handle
an interrupt in 100 ms. That would be about a 1.7% slowdown, which
actually matches what I was seeing. For reference, even an old arm32
rk3288-veyron device I worked with years ago could usually handle
interrupts in ~100-200 ms since dwc2 needs you to handle at least one
(sometimes more) interrupt per USB uFrame (250ms).

...so I'm confused about where your 25% number is coming from...


[1] https://lore.kernel.org/r/CAD=FV=UwyzA614tDoq7BntW1DWmic=DOszr+iRJVafVEYrXhpw@mail.gmail.com
Johan Hovold June 25, 2024, 11:21 a.m. UTC | #6
On Mon, Jun 24, 2024 at 02:15:07PM -0700, Doug Anderson wrote:
> On Mon, Jun 24, 2024 at 5:43 AM Johan Hovold <johan@kernel.org> wrote:

> > As I mentioned last week, the slowdown from this is quite noticeable
> > (e.g. 25% slowdown at @115200), but this may be the price we need to pay
> > for correctness, at least temporarily.
> >
> > An alternative might be to switch to using a 16 byte fifo. This should
> > reduce console latency even further, and may be able avoid the idling
> > UART penalty by continuing to use the watermark interrupt for refilling
> > the FIFO.
> 
> I'm a bit confused. Right now we're using (effectively) a 64-byte
> FIFO. The FIFO is 16-words deep and we have 4 bytes per word. ...so
> I'm not sure what you mean by switching to a 16-byte FIFO. Do you mean
> to make less use of the FIFO, or something else?

I meant switching to using one-byte words so that we end up with a
16-byte FIFO where we don't have the issue of adding more data when the
last word is not a full four-byte one.

> Overall the big problem I found in all my testing was that I needed to
> wait for a "command done" before kicking off a new command. When the
> "command done" arrives then the UART has stopped transmitting and
> you've got to suffer an interrupt latency before you can start
> transferring again. Essentially:
> 
> 1. Pick a transfer size.
> 2. You can keep sending bytes / using the FIFO efficiently as long as
> there are still bytes left in the transfer.
> 3. When you get to the end of the transfer, you have to wait for the
> UART to stop, report that it's done, and then suffer an interrupt
> latency to start a new transfer.
> 
> So to be efficient you want to pick a big transfer size but if there's
> any chance that you might not need to transfer that many bytes then
> you need to figure out what to do. If you can handle that properly
> then that's great. If not then we have to make sure we never kick off
> a transfer that we might not finish.

Right. But with a 16 1-byte word FIFO, we may be able to kick of a
really long transfer and just keep it running until it needs to be
kicked again (cf. enabling TX). The console code can easily insert
characters in the FIFO while the transfer is running (and would only
have to wait for 16 characters to drain in the worst case).

Effectively, most of the identified issues would just go away, as
there's basically never any need to cancel anything except at port
shutdown.

> I'd also mention that, as talked about in my response to your other
> patch [1], I'm not seeing a 25% slowdown. I tested both with my simple
> proposal and with this whole series applied and my slowdown is less
> than 2%. I guess there must be something different with your setup?
> Trying to think about what kind of slowdown would be reasonable for my
> patch series at 115200:
> 
> a) We send 64 bytes efficiently, which takes 5.6ms (64 * 1000 / 11520)
> 
> b) We stop transferring and wait for an interrupt.
> 
> c) We start transferring 64 bytes again.
> 
> Let's say that your interrupt latency is 1 ms, which would be really
> terrible. In that case you'll essentially transfer 64 bytes in 6.6ms
> instead of 5.6 ms, right? That would be an 18% hit. Let's imagine
> something more sensible and say that most of the time you can handle
> an interrupt in 100 ms. That would be about a 1.7% slowdown, which
> actually matches what I was seeing. For reference, even an old arm32
> rk3288-veyron device I worked with years ago could usually handle
> interrupts in ~100-200 ms since dwc2 needs you to handle at least one
> (sometimes more) interrupt per USB uFrame (250ms).
> 
> ...so I'm confused about where your 25% number is coming from...

I didn't do an in-depth analysis of the slowdown, but I did rerun the
tests now and I'm still seeing a 22-24% slowdown on x1e80100 with rc5.
This is a new platform so I compared with sc8280xp, which shows similar
numbers even if it's slightly faster to begin with:

					sc8280xp	x1e80100

	rc5 full series			61 s		67 s
	rc5 last patch reverted		50 s		54 s

I have a getty running and cat a 10x dmesg file of 543950 bytes to
/dev/ttyMSM0 from an ssh session (just catting in a serial console gives
similar numbers). 

Johan
Doug Anderson June 25, 2024, 2:29 p.m. UTC | #7
Hi,

On Tue, Jun 25, 2024 at 4:21 AM Johan Hovold <johan@kernel.org> wrote:
>
> On Mon, Jun 24, 2024 at 02:15:07PM -0700, Doug Anderson wrote:
> > On Mon, Jun 24, 2024 at 5:43 AM Johan Hovold <johan@kernel.org> wrote:
>
> > > As I mentioned last week, the slowdown from this is quite noticeable
> > > (e.g. 25% slowdown at @115200), but this may be the price we need to pay
> > > for correctness, at least temporarily.
> > >
> > > An alternative might be to switch to using a 16 byte fifo. This should
> > > reduce console latency even further, and may be able avoid the idling
> > > UART penalty by continuing to use the watermark interrupt for refilling
> > > the FIFO.
> >
> > I'm a bit confused. Right now we're using (effectively) a 64-byte
> > FIFO. The FIFO is 16-words deep and we have 4 bytes per word. ...so
> > I'm not sure what you mean by switching to a 16-byte FIFO. Do you mean
> > to make less use of the FIFO, or something else?
>
> I meant switching to using one-byte words so that we end up with a
> 16-byte FIFO where we don't have the issue of adding more data when the
> last word is not a full four-byte one.

Ah, I get it! I guess I would have described it as 1-byte per FIFO word.

Certainly that seems like something that's worth trying but, at least
in the past, I remember getting noticeably worse performance with it.
We used to be in that mode when kdb was enabled which I run with most
of the time. Depending on what you set the watermark level to you may
either end up spending a lot more resources servicing interrupts or
you might end up back in the case where you're stalling the transfer
because you couldn't service the interrupt fast enough. At 115.2, each
byte is about 87 microseconds, and draining a 16-byte FIFO is about
1.4ms. If you set the watermark at halfway then you'll get an
interrupt every 8 bytes or ~8x as many interrupts as with my patch
series. You'll also stall any time your interrupt latency is worse
than 694 microseconds. Hopefully that's not too often, though the
slowdowns you measured below make me worried.


> > Overall the big problem I found in all my testing was that I needed to
> > wait for a "command done" before kicking off a new command. When the
> > "command done" arrives then the UART has stopped transmitting and
> > you've got to suffer an interrupt latency before you can start
> > transferring again. Essentially:
> >
> > 1. Pick a transfer size.
> > 2. You can keep sending bytes / using the FIFO efficiently as long as
> > there are still bytes left in the transfer.
> > 3. When you get to the end of the transfer, you have to wait for the
> > UART to stop, report that it's done, and then suffer an interrupt
> > latency to start a new transfer.
> >
> > So to be efficient you want to pick a big transfer size but if there's
> > any chance that you might not need to transfer that many bytes then
> > you need to figure out what to do. If you can handle that properly
> > then that's great. If not then we have to make sure we never kick off
> > a transfer that we might not finish.
>
> Right. But with a 16 1-byte word FIFO, we may be able to kick of a
> really long transfer and just keep it running until it needs to be
> kicked again (cf. enabling TX). The console code can easily insert
> characters in the FIFO while the transfer is running (and would only
> have to wait for 16 characters to drain in the worst case).
>
> Effectively, most of the identified issues would just go away, as
> there's basically never any need to cancel anything except at port
> shutdown.

Yeah, though you'd still have to make sure that the corner cases
worked OK. You'll have to pick _some_ sort of fixed transfer size and
make sure that all the special cases / console / kdb work if they show
up right at the end of the transfer.

I was also a bit curious if there could be power implications with
leaving an active TX command always in place. Perhaps geni wouldn't be
able to drop some resources? Do you happen to know?


> > I'd also mention that, as talked about in my response to your other
> > patch [1], I'm not seeing a 25% slowdown. I tested both with my simple
> > proposal and with this whole series applied and my slowdown is less
> > than 2%. I guess there must be something different with your setup?
> > Trying to think about what kind of slowdown would be reasonable for my
> > patch series at 115200:
> >
> > a) We send 64 bytes efficiently, which takes 5.6ms (64 * 1000 / 11520)
> >
> > b) We stop transferring and wait for an interrupt.
> >
> > c) We start transferring 64 bytes again.
> >
> > Let's say that your interrupt latency is 1 ms, which would be really
> > terrible. In that case you'll essentially transfer 64 bytes in 6.6ms
> > instead of 5.6 ms, right? That would be an 18% hit. Let's imagine
> > something more sensible and say that most of the time you can handle
> > an interrupt in 100 ms. That would be about a 1.7% slowdown, which
> > actually matches what I was seeing. For reference, even an old arm32
> > rk3288-veyron device I worked with years ago could usually handle
> > interrupts in ~100-200 ms since dwc2 needs you to handle at least one
> > (sometimes more) interrupt per USB uFrame (250ms).
> >
> > ...so I'm confused about where your 25% number is coming from...
>
> I didn't do an in-depth analysis of the slowdown, but I did rerun the
> tests now and I'm still seeing a 22-24% slowdown on x1e80100 with rc5.
> This is a new platform so I compared with sc8280xp, which shows similar
> numbers even if it's slightly faster to begin with:
>
>                                         sc8280xp        x1e80100
>
>         rc5 full series                 61 s            67 s
>         rc5 last patch reverted         50 s            54 s
>
> I have a getty running and cat a 10x dmesg file of 543950 bytes to
> /dev/ttyMSM0 from an ssh session (just catting in a serial console gives
> similar numbers).

That's really weird / unexpected. Your hardware should be fancier than
mine so, if anything, I'd expect it to be faster. Is there something
causing you really bad interrupt latency or something? ...or is some
clock misconfigured and "geni" is behaving sub-optimally?

...although it wouldn't explain the slowness, I'd at least be a little
curious if you've confirmed that you're running with a 16-word FIFO
depth. See the function geni_se_get_tx_fifo_depth() where newer
hardware can actually have larger FIFO depths.

Just in case it matters, I'd be curious if you have
`CONFIG_IRQ_TIME_ACCOUNTING=y`

Oh: one last thing to confirm: do you have kernel console output
disabled for your tests? I've been doing tests with the kernel console
_not_ enabled over the serial port and just an agetty there. I could
believe things might be different if the kernel console was sending
messages over the same port.

-Doug
Johan Hovold June 26, 2024, 8:20 a.m. UTC | #8
On Tue, Jun 25, 2024 at 07:29:38AM -0700, Doug Anderson wrote:
> On Tue, Jun 25, 2024 at 4:21 AM Johan Hovold <johan@kernel.org> wrote:

> > Right. But with a 16 1-byte word FIFO, we may be able to kick of a
> > really long transfer and just keep it running until it needs to be
> > kicked again (cf. enabling TX). The console code can easily insert
> > characters in the FIFO while the transfer is running (and would only
> > have to wait for 16 characters to drain in the worst case).
> >
> > Effectively, most of the identified issues would just go away, as
> > there's basically never any need to cancel anything except at port
> > shutdown.
> 
> Yeah, though you'd still have to make sure that the corner cases
> worked OK. You'll have to pick _some_ sort of fixed transfer size and
> make sure that all the special cases / console / kdb work if they show
> up right at the end of the transfer.

Yes, there are some details like that would need to be worked out.

> I was also a bit curious if there could be power implications with
> leaving an active TX command always in place. Perhaps geni wouldn't be
> able to drop some resources? Do you happen to know?

Hmm, good point. I'll see if I can ask someone with access to docs.

But I guess we can still continue to stop the command on stop_tx() (as
we are considering anyway) to avoid that.

> > I didn't do an in-depth analysis of the slowdown, but I did rerun the
> > tests now and I'm still seeing a 22-24% slowdown on x1e80100 with rc5.
> > This is a new platform so I compared with sc8280xp, which shows similar
> > numbers even if it's slightly faster to begin with:
> >
> >                                         sc8280xp        x1e80100
> >
> >         rc5 full series                 61 s            67 s
> >         rc5 last patch reverted         50 s            54 s
> >
> > I have a getty running and cat a 10x dmesg file of 543950 bytes to
> > /dev/ttyMSM0 from an ssh session (just catting in a serial console gives
> > similar numbers).
> 
> That's really weird / unexpected. Your hardware should be fancier than
> mine so, if anything, I'd expect it to be faster. Is there something
> causing you really bad interrupt latency or something? ...or is some
> clock misconfigured and "geni" is behaving sub-optimally?

That may be the case. I'm not seeing more interrupts with the last patch
applied, and not more time spent servicing interrupts (based on a quick
look at top), so it may just be geni taking a lot of time to start or
stop commands.

> ...although it wouldn't explain the slowness, I'd at least be a little
> curious if you've confirmed that you're running with a 16-word FIFO
> depth. See the function geni_se_get_tx_fifo_depth() where newer
> hardware can actually have larger FIFO depths.

No, I had confirmed that it is using 16 words (64 bytes).
 
> Just in case it matters, I'd be curious if you have
> `CONFIG_IRQ_TIME_ACCOUNTING=y`

I do, yes.

> Oh: one last thing to confirm: do you have kernel console output
> disabled for your tests? I've been doing tests with the kernel console
> _not_ enabled over the serial port and just an agetty there. I could
> believe things might be different if the kernel console was sending
> messages over the same port.

Yes, there has been no console output during my tests, and I get similar
results with the console disabled.

Johan
diff mbox series

Patch

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 1a66424f0f5f..9d71296eae11 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -78,7 +78,7 @@ 
 #define GENI_UART_CONS_PORTS		1
 #define GENI_UART_PORTS			3
 #define DEF_FIFO_DEPTH_WORDS		16
-#define DEF_TX_WM			2
+#define DEF_TX_WM			1
 #define DEF_FIFO_WIDTH_BITS		32
 #define UART_RX_WM			2
 
@@ -128,8 +128,8 @@  struct qcom_geni_serial_port {
 	void *rx_buf;
 	u32 loopback;
 	bool brk;
+	bool tx_fifo_stopped;
 
-	unsigned int tx_remaining;
 	unsigned int tx_total;
 	int wakeup_irq;
 	bool rx_tx_swap;
@@ -336,6 +336,14 @@  static void qcom_geni_serial_poll_tx_done(struct uart_port *uport)
 							M_CMD_ABORT_EN, true);
 	}
 	writel(irq_clear, uport->membase + SE_GENI_M_IRQ_CLEAR);
+
+	/*
+	 * Re-enable the TX watermark interrupt when we clear the "done"
+	 * in case we were waiting on the "done" bit before starting a new
+	 * command. The interrupt routine will re-disable this if it's not
+	 * appropriate.
+	 */
+	writel(M_TX_FIFO_WATERMARK_EN, uport->membase +	SE_GENI_M_IRQ_EN_SET);
 }
 
 static void qcom_geni_serial_drain_tx_fifo(struct uart_port *uport)
@@ -357,7 +365,7 @@  static void qcom_geni_serial_drain_tx_fifo(struct uart_port *uport)
 	 * get lost.
 	 */
 	qcom_geni_serial_poll_bitfield(uport, SE_GENI_M_GP_LENGTH, GP_LENGTH,
-				       port->tx_total - port->tx_remaining);
+				       port->tx_total);
 
 	/*
 	 * If clearing the FIFO made us inactive then we're done--no need for
@@ -386,14 +394,6 @@  static void qcom_geni_serial_drain_tx_fifo(struct uart_port *uport)
 		writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
 	}
 	writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
-
-	/*
-	 * We've cancelled the current command. "tx_remaining" stores how
-	 * many bytes are left to finish in the current command so we know
-	 * when to start a new command. Since the command was cancelled we
-	 * need to zero "tx_remaining".
-	 */
-	port->tx_remaining = 0;
 }
 
 static void qcom_geni_serial_abort_rx(struct uart_port *uport)
@@ -453,11 +453,12 @@  static int qcom_geni_serial_get_char(struct uart_port *uport)
 static void qcom_geni_serial_poll_put_char(struct uart_port *uport,
 							unsigned char c)
 {
+	qcom_geni_serial_drain_tx_fifo(uport);
+
 	qcom_geni_serial_setup_tx(uport, 1);
 	WARN_ON(!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
 						M_TX_FIFO_WATERMARK_EN, true));
 	writel(c, uport->membase + SE_GENI_TX_FIFOn);
-	writel(M_TX_FIFO_WATERMARK_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
 	qcom_geni_serial_poll_tx_done(uport);
 }
 #endif
@@ -487,6 +488,8 @@  __qcom_geni_serial_console_write(struct uart_port *uport, const char *s,
 	int i;
 	u32 bytes_to_send = count;
 
+	qcom_geni_serial_drain_tx_fifo(uport);
+
 	for (i = 0; i < count; i++) {
 		/*
 		 * uart_console_write() adds a carriage return for each newline.
@@ -537,7 +540,6 @@  static void qcom_geni_serial_console_write(struct console *co, const char *s,
 	bool locked = true;
 	unsigned long flags;
 	u32 geni_status;
-	u32 irq_en;
 
 	WARN_ON(co->index < 0 || co->index >= GENI_UART_CONS_PORTS);
 
@@ -553,38 +555,10 @@  static void qcom_geni_serial_console_write(struct console *co, const char *s,
 
 	geni_status = readl(uport->membase + SE_GENI_STATUS);
 
-	if (!locked) {
-		/*
-		 * We can only get here if an oops is in progress then we were
-		 * unable to get the lock. This means we can't safely access
-		 * our state variables like tx_remaining. About the best we
-		 * can do is wait for the FIFO to be empty before we start our
-		 * transfer, so we'll do that.
-		 */
-		qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
-					  M_TX_FIFO_NOT_EMPTY_EN, false);
-	} else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
-		/*
-		 * It seems we can't interrupt existing transfers if all data
-		 * has been sent, in which case we need to look for done first.
-		 */
-		qcom_geni_serial_poll_tx_done(uport);
-
-		if (!kfifo_is_empty(&uport->state->port.xmit_fifo)) {
-			irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
-			writel(irq_en | M_TX_FIFO_WATERMARK_EN,
-					uport->membase + SE_GENI_M_IRQ_EN);
-		}
-	}
-
 	__qcom_geni_serial_console_write(uport, s, count);
 
-
-	if (locked) {
-		if (port->tx_remaining)
-			qcom_geni_serial_setup_tx(uport, port->tx_remaining);
+	if (locked)
 		uart_port_unlock_irqrestore(uport, flags);
-	}
 }
 
 static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
@@ -661,9 +635,9 @@  static void qcom_geni_serial_stop_tx_dma(struct uart_port *uport)
 
 	if (port->tx_dma_addr) {
 		geni_se_tx_dma_unprep(&port->se, port->tx_dma_addr,
-				      port->tx_remaining);
+				      port->tx_total);
 		port->tx_dma_addr = 0;
-		port->tx_remaining = 0;
+		port->tx_total = 0;
 	}
 
 	geni_se_cancel_m_cmd(&port->se);
@@ -708,26 +682,27 @@  static void qcom_geni_serial_start_tx_dma(struct uart_port *uport)
 		qcom_geni_serial_stop_tx_dma(uport);
 		return;
 	}
-
-	port->tx_remaining = xmit_size;
 }
 
 static void qcom_geni_serial_start_tx_fifo(struct uart_port *uport)
 {
-	u32 irq_en;
+	struct qcom_geni_serial_port *port = to_dev_port(uport);
 
-	irq_en = readl(uport->membase +	SE_GENI_M_IRQ_EN);
-	irq_en |= M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN;
-	writel(irq_en, uport->membase +	SE_GENI_M_IRQ_EN);
+	port->tx_fifo_stopped = false;
+
+	/* Prime the pump to get data flowing. */
+	writel(M_TX_FIFO_WATERMARK_EN, uport->membase +	SE_GENI_M_IRQ_EN_SET);
 }
 
 static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
 {
-	u32 irq_en;
+	struct qcom_geni_serial_port *port = to_dev_port(uport);
 
-	irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
-	irq_en &= ~(M_CMD_DONE_EN | M_TX_FIFO_WATERMARK_EN);
-	writel(irq_en, uport->membase + SE_GENI_M_IRQ_EN);
+	/*
+	 * We can't do anything to safely pause the bytes that have already
+	 * been queued up so just set a flag saying we shouldn't queue any more.
+	 */
+	port->tx_fifo_stopped = true;
 }
 
 static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop)
@@ -895,10 +870,20 @@  static void qcom_geni_serial_stop_tx(struct uart_port *uport)
 	uport->ops->stop_tx(uport);
 }
 
+static void qcom_geni_serial_enable_cmd_done(struct uart_port *uport)
+{
+	struct qcom_geni_serial_port *port = to_dev_port(uport);
+
+	/* If we're not in FIFO mode we don't use CMD_DONE. */
+	if (port->dev_data->mode != GENI_SE_FIFO)
+		return;
+
+	writel(M_CMD_DONE_EN, uport->membase + SE_GENI_M_IRQ_EN_SET);
+}
+
 static void qcom_geni_serial_send_chunk_fifo(struct uart_port *uport,
 					     unsigned int chunk)
 {
-	struct qcom_geni_serial_port *port = to_dev_port(uport);
 	unsigned int tx_bytes, remaining = chunk;
 	u8 buf[BYTES_PER_FIFO_WORD];
 
@@ -911,52 +896,74 @@  static void qcom_geni_serial_send_chunk_fifo(struct uart_port *uport,
 		iowrite32_rep(uport->membase + SE_GENI_TX_FIFOn, buf, 1);
 
 		remaining -= tx_bytes;
-		port->tx_remaining -= tx_bytes;
 	}
 }
 
-static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport,
-					    bool done, bool active)
+static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport)
 {
 	struct qcom_geni_serial_port *port = to_dev_port(uport);
 	struct tty_port *tport = &uport->state->port;
 	size_t avail;
 	size_t pending;
 	u32 status;
-	u32 irq_en;
 	unsigned int chunk;
+	bool active;
 
-	status = readl(uport->membase + SE_GENI_TX_FIFO_STATUS);
-
-	/* Complete the current tx command before taking newly added data */
-	if (active)
-		pending = port->tx_remaining;
-	else
-		pending = kfifo_len(&tport->xmit_fifo);
+	/*
+	 * The TX watermark interrupt is only used to "prime the pump" for
+	 * transfers. Once transfers have been kicked off we always use the
+	 * "done" interrupt to queue the next batch. Once were here we can
+	 * always disable the TX watermark interrupt.
+	 *
+	 * NOTE: we use the TX watermark in this way because we don't ever
+	 * kick off TX transfers larger than we can stuff into the FIFO. This
+	 * is because bytes from the OS's circular queue can disappear and
+	 * there's no known safe/non-blocking way to cancel the larger
+	 * transfer when bytes disappear. See qcom_geni_serial_drain_tx_fifo()
+	 * for an example of a safe (but blocking) way to drain, but that's
+	 * not appropriate in an IRQ handler. We also can't just kick off one
+	 * large transfer and queue bytes whenever because we're using 4 bytes
+	 * per FIFO word and thus we can only queue non-multiple-of-4 bytes as
+	 * in the last word of a transfer.
+	 */
+	writel(M_TX_FIFO_WATERMARK_EN, uport->membase +	SE_GENI_M_IRQ_EN_CLEAR);
 
-	/* All data has been transmitted and acknowledged as received */
-	if (!pending && !status && done) {
-		qcom_geni_serial_stop_tx_fifo(uport);
+	/*
+	 * If we've got an active TX command running then we expect to still
+	 * see the "done" bit in the future and we can't kick off another
+	 * transfer till then. Bail. NOTE: it's important that we read "active"
+	 * after we've cleared the "done" interrupt (which the caller already
+	 * did for us) so that we know that if we show as non-active we're
+	 * guaranteed to later get "done".
+	 *
+	 * If nothing is pending we _also_ want to bail. Later start_tx()
+	 * will start transfers again by temporarily turning on the TX
+	 * watermark.
+	 */
+	active = readl(uport->membase + SE_GENI_STATUS) & M_GENI_CMD_ACTIVE;
+	pending = port->tx_fifo_stopped ? 0 : kfifo_len(&tport->xmit_fifo);
+	if (active || !pending)
 		goto out_write_wakeup;
-	}
 
+	/* Calculate how much space is available in the FIFO right now. */
+	status = readl(uport->membase + SE_GENI_TX_FIFO_STATUS);
 	avail = port->tx_fifo_depth - (status & TX_FIFO_WC);
 	avail *= BYTES_PER_FIFO_WORD;
 
-	chunk = min(avail, pending);
-	if (!chunk)
+	/*
+	 * It's a bit odd if we get here and have bytes pending and we're
+	 * handling a "done" or "TX watermark" interrupt but we don't
+	 * have space in the FIFO. Stick in a warning and bail.
+	 */
+	if (!avail) {
+		dev_warn(uport->dev, "FIFO unexpectedly out of space\n");
 		goto out_write_wakeup;
-
-	if (!port->tx_remaining) {
-		qcom_geni_serial_setup_tx(uport, pending);
-		port->tx_remaining = pending;
-
-		irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
-		if (!(irq_en & M_TX_FIFO_WATERMARK_EN))
-			writel(irq_en | M_TX_FIFO_WATERMARK_EN,
-					uport->membase + SE_GENI_M_IRQ_EN);
 	}
 
+
+	/* We're ready to throw some bytes into the FIFO. */
+	chunk = min(avail, pending);
+	qcom_geni_serial_setup_tx(uport, chunk);
 	qcom_geni_serial_send_chunk_fifo(uport, chunk);
 
 	/*
@@ -964,17 +971,9 @@  static void qcom_geni_serial_handle_tx_fifo(struct uart_port *uport,
 	 * cleared it in qcom_geni_serial_isr it will have already reasserted
 	 * so we must clear it again here after our writes.
 	 */
-	writel(M_TX_FIFO_WATERMARK_EN,
-			uport->membase + SE_GENI_M_IRQ_CLEAR);
+	writel(M_TX_FIFO_WATERMARK_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
 
 out_write_wakeup:
-	if (!port->tx_remaining) {
-		irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
-		if (irq_en & M_TX_FIFO_WATERMARK_EN)
-			writel(irq_en & ~M_TX_FIFO_WATERMARK_EN,
-					uport->membase + SE_GENI_M_IRQ_EN);
-	}
-
 	if (kfifo_len(&tport->xmit_fifo) < WAKEUP_CHARS)
 		uart_write_wakeup(uport);
 }
@@ -984,10 +983,10 @@  static void qcom_geni_serial_handle_tx_dma(struct uart_port *uport)
 	struct qcom_geni_serial_port *port = to_dev_port(uport);
 	struct tty_port *tport = &uport->state->port;
 
-	uart_xmit_advance(uport, port->tx_remaining);
-	geni_se_tx_dma_unprep(&port->se, port->tx_dma_addr, port->tx_remaining);
+	uart_xmit_advance(uport, port->tx_total);
+	geni_se_tx_dma_unprep(&port->se, port->tx_dma_addr, port->tx_total);
 	port->tx_dma_addr = 0;
-	port->tx_remaining = 0;
+	port->tx_total = 0;
 
 	if (!kfifo_is_empty(&tport->xmit_fifo))
 		qcom_geni_serial_start_tx_dma(uport);
@@ -1001,7 +1000,6 @@  static irqreturn_t qcom_geni_serial_isr(int isr, void *dev)
 	u32 m_irq_en;
 	u32 m_irq_status;
 	u32 s_irq_status;
-	u32 geni_status;
 	u32 dma;
 	u32 dma_tx_status;
 	u32 dma_rx_status;
@@ -1019,7 +1017,6 @@  static irqreturn_t qcom_geni_serial_isr(int isr, void *dev)
 	s_irq_status = readl(uport->membase + SE_GENI_S_IRQ_STATUS);
 	dma_tx_status = readl(uport->membase + SE_DMA_TX_IRQ_STAT);
 	dma_rx_status = readl(uport->membase + SE_DMA_RX_IRQ_STAT);
-	geni_status = readl(uport->membase + SE_GENI_STATUS);
 	dma = readl(uport->membase + SE_GENI_DMA_MODE_EN);
 	m_irq_en = readl(uport->membase + SE_GENI_M_IRQ_EN);
 	writel(m_irq_status, uport->membase + SE_GENI_M_IRQ_CLEAR);
@@ -1066,9 +1063,7 @@  static irqreturn_t qcom_geni_serial_isr(int isr, void *dev)
 	} else {
 		if (m_irq_status & m_irq_en &
 		    (M_TX_FIFO_WATERMARK_EN | M_CMD_DONE_EN))
-			qcom_geni_serial_handle_tx_fifo(uport,
-					m_irq_status & M_CMD_DONE_EN,
-					geni_status & M_GENI_CMD_ACTIVE);
+			qcom_geni_serial_handle_tx_fifo(uport);
 
 		if (s_irq_status & (S_RX_FIFO_WATERMARK_EN | S_RX_FIFO_LAST_EN))
 			qcom_geni_serial_handle_rx_fifo(uport, drop_rx);
@@ -1176,6 +1171,7 @@  static int qcom_geni_serial_port_setup(struct uart_port *uport)
 	geni_se_init(&port->se, UART_RX_WM, port->rx_fifo_depth - 2);
 	geni_se_select_mode(&port->se, port->dev_data->mode);
 	writel(DEF_TX_WM, uport->membase + SE_GENI_TX_WATERMARK_REG);
+	qcom_geni_serial_enable_cmd_done(uport);
 	qcom_geni_serial_start_rx(uport);
 	port->setup = true;