diff mbox

[26/27,v12] arm/xen: vpl011: Fix the slow early console SBSA UART output

Message ID 1507891231-4386-1-git-send-email-bhupinder.thakur@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Bhupinder Thakur Oct. 13, 2017, 10:40 a.m. UTC
The early console output uses pl011_early_write() to write data. This
function waits for BUSY bit to get cleared before writing the next byte.

In the SBSA UART emulation logic, the BUSY bit was set as soon one
byte was written in the FIFO and it remained set until the FIFO was
emptied. This meant that the output was delayed as each character needed
the BUSY to get cleared.

Since the SBSA UART is getting emulated in Xen using ring buffers, it
ensures that once the data is enqueued in the FIFO, it will be received
by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
full. This will ensure that pl011_early_write() is not delayed unduly
to write the data.

Signed-off-by: Bhupinder Thakur <bhupinder.thakur@linaro.org>
---
CC: Julien Grall <julien.grall@arm.com>
CC: Andre Przywara <andre.przywara@arm.com>
CC: Stefano Stabellini <sstabellini@kernel.org>

 xen/arch/arm/vpl011.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

Comments

Andre Przywara Oct. 17, 2017, 9:51 a.m. UTC | #1
Hi Bhupinder,

first thing: As the bulk of the series has been merged now, please
restart your patch and version numbering, so a (potential) next post
should be prefixed [PATCH v3 1/2]. And please have a cover letter giving
a brief overview what this series fixes.

On 13/10/17 11:40, Bhupinder Thakur wrote:
> The early console output uses pl011_early_write() to write data. This
> function waits for BUSY bit to get cleared before writing the next byte.

... which is questionable given the actual definition of the BUSY bit in
the PL011 TRM:

============
.... The BUSY signal goes HIGH as soon as data is written to the
transmit FIFO (that is, the FIFO is non-empty) and remains asserted
HIGH while data is being transmitted. BUSY is negated only when the
transmit FIFO is empty, and the last character has been transmitted from
the shift register, ....
============

(I take it you are talking about the Linux driver in a guest here).
I think the early_write routine tries to (deliberately?) ignore the
FIFO, possibly to make sure characters really get pushed out before a
system crashes, maybe.

> 
> In the SBSA UART emulation logic, the BUSY bit was set as soon one
> byte was written in the FIFO and it remained set until the FIFO was
> emptied.

Which is correct behaviour, as this matches the PL011 TRM as quoted above.

> This meant that the output was delayed as each character needed
> the BUSY to get cleared.

But this is true as well!

> Since the SBSA UART is getting emulated in Xen using ring buffers, it
> ensures that once the data is enqueued in the FIFO, it will be received
> by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
> full. This will ensure that pl011_early_write() is not delayed unduly
> to write the data.

So I can confirm that this patch fixes the very slow earlycon output
observed with the current staging HEAD.

So while this is somewhat deviating from the spec, I can see the benefit
for an emulation scenario. I believe that emulations in general might
choose implementing things a bit differently, to cope with the
fundamental differences in their environment, like the virtually endless
"FIFO" and the lack of any timing restrictions on the emulated "wire".

So unless someone comes up with a better solution, I would support
taking this patch, as this fixes a real problem.

Cheers,
Andre

> Signed-off-by: Bhupinder Thakur <bhupinder.thakur@linaro.org>
> ---
> CC: Julien Grall <julien.grall@arm.com>
> CC: Andre Przywara <andre.przywara@arm.com>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> 
>  xen/arch/arm/vpl011.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/arch/arm/vpl011.c b/xen/arch/arm/vpl011.c
> index f7ddccb..0b07436 100644
> --- a/xen/arch/arm/vpl011.c
> +++ b/xen/arch/arm/vpl011.c
> @@ -159,9 +159,15 @@ static void vpl011_write_data(struct domain *d, uint8_t data)
>      {
>          vpl011->uartfr |= TXFF;
>          vpl011->uartris &= ~TXI;
> -    }
>  
> -    vpl011->uartfr |= BUSY;
> +        /*
> +         * This bit is set only when FIFO becomes full. This ensures that
> +         * the SBSA UART driver can write the early console data as fast as
> +         * possible, without waiting for the BUSY bit to get cleared before
> +         * writing each byte.
> +         */
> +        vpl011->uartfr |= BUSY;
> +    }
>  
>      vpl011->uartfr &= ~TXFE;
>  
> @@ -371,11 +377,16 @@ static void vpl011_data_avail(struct domain *d)
>      {
>          vpl011->uartfr &= ~TXFF;
>          vpl011->uartris |= TXI;
> +
> +        /*
> +         * Clear the BUSY bit as soon as space becomes available
> +         * so that the SBSA UART driver can start writing more data
> +         * without any further delay.
> +         */
> +        vpl011->uartfr &= ~BUSY;
> +
>          if ( out_ring_qsize == 0 )
> -        {
> -            vpl011->uartfr &= ~BUSY;
>              vpl011->uartfr |= TXFE;
> -        }
>      }
>  
>      vpl011_update_interrupt_status(d);
>
Dave Martin Oct. 17, 2017, 11:19 a.m. UTC | #2
On Tue, Oct 17, 2017 at 10:51:07AM +0100, Andre Przywara wrote:
> Hi Bhupinder,
> 
> first thing: As the bulk of the series has been merged now, please
> restart your patch and version numbering, so a (potential) next post
> should be prefixed [PATCH v3 1/2]. And please have a cover letter giving
> a brief overview what this series fixes.
> 
> On 13/10/17 11:40, Bhupinder Thakur wrote:
> > The early console output uses pl011_early_write() to write data. This
> > function waits for BUSY bit to get cleared before writing the next byte.
> 
> ... which is questionable given the actual definition of the BUSY bit in
> the PL011 TRM:
> 
> ============
> .... The BUSY signal goes HIGH as soon as data is written to the
> transmit FIFO (that is, the FIFO is non-empty) and remains asserted
> HIGH while data is being transmitted. BUSY is negated only when the
> transmit FIFO is empty, and the last character has been transmitted from
> the shift register, ....
> ============
> 
> (I take it you are talking about the Linux driver in a guest here).
> I think the early_write routine tries to (deliberately?) ignore the
> FIFO, possibly to make sure characters really get pushed out before a
> system crashes, maybe.
> 
> > 
> > In the SBSA UART emulation logic, the BUSY bit was set as soon one
> > byte was written in the FIFO and it remained set until the FIFO was
> > emptied.
> 
> Which is correct behaviour, as this matches the PL011 TRM as quoted above.
> 
> > This meant that the output was delayed as each character needed
> > the BUSY to get cleared.
> 
> But this is true as well!
> 
> > Since the SBSA UART is getting emulated in Xen using ring buffers, it
> > ensures that once the data is enqueued in the FIFO, it will be received
> > by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
> > full. This will ensure that pl011_early_write() is not delayed unduly
> > to write the data.
> 
> So I can confirm that this patch fixes the very slow earlycon output
> observed with the current staging HEAD.
> 
> So while this is somewhat deviating from the spec, I can see the benefit
> for an emulation scenario. I believe that emulations in general might
> choose implementing things a bit differently, to cope with the
> fundamental differences in their environment, like the virtually endless
> "FIFO" and the lack of any timing restrictions on the emulated "wire".
> 
> So unless someone comes up with a better solution, I would support
> taking this patch, as this fixes a real problem.

I think you get away with this, but it does violate the spec in order
to work around a feature of a correctly implemented driver.

Software can now see this, for example:

	uart_write(ch, UARTDR);
	busy = uart_read(UARTFR) & UARTFR_BUSY;
	BUG_ON(!(uart_read(UARTFR) & UARTFR_TXFE) && !busy);

which violates the spec, though I can't currently think of a good reason
for software to rely on that.


[+Rob, who wrote the original earlycon code in the amba-pl011 driver:
0d3c673e7881 ("tty/serial: pl011: add generic earlycon support")

Is there any actualy reason why we poll for !BUSY after each char in
pl011_putc()?  pl011_putc() is not exposed at all: it's only called by
pl011_console_write().

This will result in stuttering output even on hardware, though this
doesn't typically matter.

I think if the poll for !BUSY were moved to the end of
pl011_console_write(), the effect would be much less bad.]



For Xen vpl011:

This is a software emulation, so characters really are transmitted
instantaneously.

Can you drain the TX FIFO into the ring buffer immediately in
vpl011_write_data()?  Then you could always set TXFE and clear BUSY on
return to the guest, unless there is backpressure from the ring buffer.

Note that this could break with Linux versions prior to v4.1, which used
some trickery to assert the TX FIFO interrupt for the first time (see
734745caeb9f ("serial/amba-pl011: Activate TX IRQ passively")).
However, there is a subtlety in the way a real PL011 asserts the
TX/RX threshold interrupts which I think you don't implement in vpl011,
so you probably won't hit this problem in practice.  You could test with
an older kernel to find out -- but if <v4.1 kernels don't work with
current Xen for other reasons then this doesn't matter.

Cheers
---Dave

> > Signed-off-by: Bhupinder Thakur <bhupinder.thakur@linaro.org>
> > ---
> > CC: Julien Grall <julien.grall@arm.com>
> > CC: Andre Przywara <andre.przywara@arm.com>
> > CC: Stefano Stabellini <sstabellini@kernel.org>
> > 
> >  xen/arch/arm/vpl011.c | 21 ++++++++++++++++-----
> >  1 file changed, 16 insertions(+), 5 deletions(-)
> > 
> > diff --git a/xen/arch/arm/vpl011.c b/xen/arch/arm/vpl011.c
> > index f7ddccb..0b07436 100644
> > --- a/xen/arch/arm/vpl011.c
> > +++ b/xen/arch/arm/vpl011.c
> > @@ -159,9 +159,15 @@ static void vpl011_write_data(struct domain *d, uint8_t data)
> >      {
> >          vpl011->uartfr |= TXFF;
> >          vpl011->uartris &= ~TXI;
> > -    }
> >  
> > -    vpl011->uartfr |= BUSY;
> > +        /*
> > +         * This bit is set only when FIFO becomes full. This ensures that
> > +         * the SBSA UART driver can write the early console data as fast as
> > +         * possible, without waiting for the BUSY bit to get cleared before
> > +         * writing each byte.
> > +         */
> > +        vpl011->uartfr |= BUSY;
> > +    }
> >  
> >      vpl011->uartfr &= ~TXFE;
> >  
> > @@ -371,11 +377,16 @@ static void vpl011_data_avail(struct domain *d)
> >      {
> >          vpl011->uartfr &= ~TXFF;
> >          vpl011->uartris |= TXI;
> > +
> > +        /*
> > +         * Clear the BUSY bit as soon as space becomes available
> > +         * so that the SBSA UART driver can start writing more data
> > +         * without any further delay.
> > +         */
> > +        vpl011->uartfr &= ~BUSY;
> > +
> >          if ( out_ring_qsize == 0 )
> > -        {
> > -            vpl011->uartfr &= ~BUSY;
> >              vpl011->uartfr |= TXFE;
> > -        }
> >      }
> >  
> >      vpl011_update_interrupt_status(d);
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
Rob Herring Oct. 17, 2017, 12:53 p.m. UTC | #3
On Tue, Oct 17, 2017 at 6:19 AM, Dave Martin <Dave.Martin@arm.com> wrote:
> On Tue, Oct 17, 2017 at 10:51:07AM +0100, Andre Przywara wrote:
>> Hi Bhupinder,
>>
>> first thing: As the bulk of the series has been merged now, please
>> restart your patch and version numbering, so a (potential) next post
>> should be prefixed [PATCH v3 1/2]. And please have a cover letter giving
>> a brief overview what this series fixes.
>>
>> On 13/10/17 11:40, Bhupinder Thakur wrote:
>> > The early console output uses pl011_early_write() to write data. This
>> > function waits for BUSY bit to get cleared before writing the next byte.
>>
>> ... which is questionable given the actual definition of the BUSY bit in
>> the PL011 TRM:
>>
>> ============
>> .... The BUSY signal goes HIGH as soon as data is written to the
>> transmit FIFO (that is, the FIFO is non-empty) and remains asserted
>> HIGH while data is being transmitted. BUSY is negated only when the
>> transmit FIFO is empty, and the last character has been transmitted from
>> the shift register, ....
>> ============
>>
>> (I take it you are talking about the Linux driver in a guest here).
>> I think the early_write routine tries to (deliberately?) ignore the
>> FIFO, possibly to make sure characters really get pushed out before a
>> system crashes, maybe.
>>
>> >
>> > In the SBSA UART emulation logic, the BUSY bit was set as soon one
>> > byte was written in the FIFO and it remained set until the FIFO was
>> > emptied.
>>
>> Which is correct behaviour, as this matches the PL011 TRM as quoted above.
>>
>> > This meant that the output was delayed as each character needed
>> > the BUSY to get cleared.
>>
>> But this is true as well!
>>
>> > Since the SBSA UART is getting emulated in Xen using ring buffers, it
>> > ensures that once the data is enqueued in the FIFO, it will be received
>> > by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
>> > full. This will ensure that pl011_early_write() is not delayed unduly
>> > to write the data.
>>
>> So I can confirm that this patch fixes the very slow earlycon output
>> observed with the current staging HEAD.
>>
>> So while this is somewhat deviating from the spec, I can see the benefit
>> for an emulation scenario. I believe that emulations in general might
>> choose implementing things a bit differently, to cope with the
>> fundamental differences in their environment, like the virtually endless
>> "FIFO" and the lack of any timing restrictions on the emulated "wire".
>>
>> So unless someone comes up with a better solution, I would support
>> taking this patch, as this fixes a real problem.
>
> I think you get away with this, but it does violate the spec in order
> to work around a feature of a correctly implemented driver.
>
> Software can now see this, for example:
>
>         uart_write(ch, UARTDR);
>         busy = uart_read(UARTFR) & UARTFR_BUSY;
>         BUG_ON(!(uart_read(UARTFR) & UARTFR_TXFE) && !busy);
>
> which violates the spec, though I can't currently think of a good reason
> for software to rely on that.
>
>
> [+Rob, who wrote the original earlycon code in the amba-pl011 driver:
> 0d3c673e7881 ("tty/serial: pl011: add generic earlycon support")
>
> Is there any actualy reason why we poll for !BUSY after each char in
> pl011_putc()?  pl011_putc() is not exposed at all: it's only called by
> pl011_console_write().
>
> This will result in stuttering output even on hardware, though this
> doesn't typically matter.
>
> I think if the poll for !BUSY were moved to the end of
> pl011_console_write(), the effect would be much less bad.]

I just copied the code from the arm64 earlyprintk code... Maybe it was
because on simulation (which was the main platform at the time) folks
wanted the character "on the wire". It seems to be that you could just
drop it.

Rob
Dave Martin Oct. 17, 2017, 1:44 p.m. UTC | #4
On Tue, Oct 17, 2017 at 07:53:36AM -0500, Rob Herring wrote:
> On Tue, Oct 17, 2017 at 6:19 AM, Dave Martin <Dave.Martin@arm.com> wrote:
> > On Tue, Oct 17, 2017 at 10:51:07AM +0100, Andre Przywara wrote:
> >> Hi Bhupinder,
> >>
> >> first thing: As the bulk of the series has been merged now, please
> >> restart your patch and version numbering, so a (potential) next post
> >> should be prefixed [PATCH v3 1/2]. And please have a cover letter giving
> >> a brief overview what this series fixes.
> >>
> >> On 13/10/17 11:40, Bhupinder Thakur wrote:
> >> > The early console output uses pl011_early_write() to write data. This
> >> > function waits for BUSY bit to get cleared before writing the next byte.
> >>
> >> ... which is questionable given the actual definition of the BUSY bit in
> >> the PL011 TRM:
> >>
> >> ============
> >> .... The BUSY signal goes HIGH as soon as data is written to the
> >> transmit FIFO (that is, the FIFO is non-empty) and remains asserted
> >> HIGH while data is being transmitted. BUSY is negated only when the
> >> transmit FIFO is empty, and the last character has been transmitted from
> >> the shift register, ....
> >> ============
> >>
> >> (I take it you are talking about the Linux driver in a guest here).
> >> I think the early_write routine tries to (deliberately?) ignore the
> >> FIFO, possibly to make sure characters really get pushed out before a
> >> system crashes, maybe.
> >>
> >> >
> >> > In the SBSA UART emulation logic, the BUSY bit was set as soon one
> >> > byte was written in the FIFO and it remained set until the FIFO was
> >> > emptied.
> >>
> >> Which is correct behaviour, as this matches the PL011 TRM as quoted above.
> >>
> >> > This meant that the output was delayed as each character needed
> >> > the BUSY to get cleared.
> >>
> >> But this is true as well!
> >>
> >> > Since the SBSA UART is getting emulated in Xen using ring buffers, it
> >> > ensures that once the data is enqueued in the FIFO, it will be received
> >> > by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
> >> > full. This will ensure that pl011_early_write() is not delayed unduly
> >> > to write the data.
> >>
> >> So I can confirm that this patch fixes the very slow earlycon output
> >> observed with the current staging HEAD.
> >>
> >> So while this is somewhat deviating from the spec, I can see the benefit
> >> for an emulation scenario. I believe that emulations in general might
> >> choose implementing things a bit differently, to cope with the
> >> fundamental differences in their environment, like the virtually endless
> >> "FIFO" and the lack of any timing restrictions on the emulated "wire".
> >>
> >> So unless someone comes up with a better solution, I would support
> >> taking this patch, as this fixes a real problem.
> >
> > I think you get away with this, but it does violate the spec in order
> > to work around a feature of a correctly implemented driver.
> >
> > Software can now see this, for example:
> >
> >         uart_write(ch, UARTDR);
> >         busy = uart_read(UARTFR) & UARTFR_BUSY;
> >         BUG_ON(!(uart_read(UARTFR) & UARTFR_TXFE) && !busy);
> >
> > which violates the spec, though I can't currently think of a good reason
> > for software to rely on that.
> >
> >
> > [+Rob, who wrote the original earlycon code in the amba-pl011 driver:
> > 0d3c673e7881 ("tty/serial: pl011: add generic earlycon support")
> >
> > Is there any actualy reason why we poll for !BUSY after each char in
> > pl011_putc()?  pl011_putc() is not exposed at all: it's only called by
> > pl011_console_write().
> >
> > This will result in stuttering output even on hardware, though this
> > doesn't typically matter.
> >
> > I think if the poll for !BUSY were moved to the end of
> > pl011_console_write(), the effect would be much less bad.]
> 
> I just copied the code from the arm64 earlyprintk code... Maybe it was
> because on simulation (which was the main platform at the time) folks
> wanted the character "on the wire". It seems to be that you could just
> drop it.

Hmmm, the arm64 earlyprintk code

	2475ff9d2c6e ("arm64: Add simple earlyprintk support")

looks to have been derived by Catalin from arm's assembly printch/
printascii implementation, which predates git AFACT:

(Catalin, pleaes put me right if I misunderstood the history.)


arch/arm/kernel/debug.S:

ENTRY(printascii)
                addruart_current r3, r1, r2
                b       2f
1:              waituart r2, r3
                senduart r1, r3
                busyuart r2, r3
                teq     r1, #'\n'
                moveq   r1, #'\r'
                beq     1b
2:              teq     r0, #0
                ldrneb  r1, [r0], #1
                teqne   r1, #0
                bne     1b
                ret     lr
ENDPROC(printascii)

ENTRY(printch)
                addruart_current r3, r1, r2
                mov     r1, r0
                mov     r0, #0
                b       1b
ENDPROC(printch)



Russell, do you know why we wait for the UART transmitter to go
completely idle before queueing a new char?

For an individual printch this can makes sense, but it also introduces
delay for every char in printascii.

This seems to interact interestingly with virtualised UARTs, because we
may thrash between the guest and hypervisor per-char, though there
may be a way to reduce the impact of this on the emulation side.

(See above for some context)


In the pl011 earlycon code that was derived from arm64 earlycon (the
latter now deceased), pl011_putc() is not exposed at all and polling for
!BUSY other than at the end of pl011_early_write() seems unnecessary...

Crashing the platform so hard that the PL011 stops transmitting is
likely to be challenging -- e.g., turning off some clock or regulator,
making the bus lock up etc.  None of these is likely to be triggered
by pl011_early_write() itself.

Cheers
---Dave
Russell King (Oracle) Oct. 17, 2017, 2:03 p.m. UTC | #5
On Tue, Oct 17, 2017 at 02:44:29PM +0100, Dave Martin wrote:
> arch/arm/kernel/debug.S:
> 
> ENTRY(printascii)
>                 addruart_current r3, r1, r2
>                 b       2f
> 1:              waituart r2, r3
>                 senduart r1, r3
>                 busyuart r2, r3
>                 teq     r1, #'\n'
>                 moveq   r1, #'\r'
>                 beq     1b
> 2:              teq     r0, #0
>                 ldrneb  r1, [r0], #1
>                 teqne   r1, #0
>                 bne     1b
>                 ret     lr
> ENDPROC(printascii)
> 
> ENTRY(printch)
>                 addruart_current r3, r1, r2
>                 mov     r1, r0
>                 mov     r0, #0
>                 b       1b
> ENDPROC(printch)
> 
> 
> 
> Russell, do you know why we wait for the UART transmitter to go
> completely idle before queueing a new char?

It's the only way the /debug/ (and I stress /debug/) functions know
that the character has actually been sent before allowing control to
continue.  This is important for early-crashy-debug situations, but
probably less so for early_printk.

> For an individual printch this can makes sense, but it also introduces
> delay for every char in printascii.
> 
> This seems to interact interestingly with virtualised UARTs, because we
> may thrash between the guest and hypervisor per-char, though there
> may be a way to reduce the impact of this on the emulation side.

Well, I guess re-using the early /debugging/ code for early printk is
showing more and more issues - and reusing this code in this way is
something that I've never really been a fan of.

I'd personally like to see the coupling between the two things gone -
I never really wanted that coupling in the first place.
Dave Martin Oct. 17, 2017, 2:49 p.m. UTC | #6
On Tue, Oct 17, 2017 at 03:03:02PM +0100, Russell King - ARM Linux wrote:
> On Tue, Oct 17, 2017 at 02:44:29PM +0100, Dave Martin wrote:
> > arch/arm/kernel/debug.S:
> >
> > ENTRY(printascii)
> >                 addruart_current r3, r1, r2
> >                 b       2f
> > 1:              waituart r2, r3
> >                 senduart r1, r3
> >                 busyuart r2, r3
> >                 teq     r1, #'\n'
> >                 moveq   r1, #'\r'
> >                 beq     1b
> > 2:              teq     r0, #0
> >                 ldrneb  r1, [r0], #1
> >                 teqne   r1, #0
> >                 bne     1b
> >                 ret     lr
> > ENDPROC(printascii)
> >
> > ENTRY(printch)
> >                 addruart_current r3, r1, r2
> >                 mov     r1, r0
> >                 mov     r0, #0
> >                 b       1b
> > ENDPROC(printch)
> >
> >
> >
> > Russell, do you know why we wait for the UART transmitter to go
> > completely idle before queueing a new char?
>
> It's the only way the /debug/ (and I stress /debug/) functions know
> that the character has actually been sent before allowing control to
> continue.  This is important for early-crashy-debug situations, but
> probably less so for early_printk.
>
> > For an individual printch this can makes sense, but it also introduces
> > delay for every char in printascii.
> >
> > This seems to interact interestingly with virtualised UARTs, because we
> > may thrash between the guest and hypervisor per-char, though there
> > may be a way to reduce the impact of this on the emulation side.
>
> Well, I guess re-using the early /debugging/ code for early printk is
> showing more and more issues - and reusing this code in this way is
> something that I've never really been a fan of.
>
> I'd personally like to see the coupling between the two things gone -
> I never really wanted that coupling in the first place.

Agreed.  I'll propose a patch for the amba-pl011 earlycon code so that
the flush is only per each write() call, which should at least mitigate
the impact.

For low-level debug, it makes more sense to be as conservative as
possible though: I don't see a need to change arm printch/printascii:
as you point out, these two bits of code have different purposes,
even if they have common ancestry.

Cheers
---Dave
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Bhupinder Thakur Oct. 18, 2017, 10:17 a.m. UTC | #7
Hi Andre,

On 17 October 2017 at 15:21, Andre Przywara <andre.przywara@arm.com> wrote:
> Hi Bhupinder,
>
> first thing: As the bulk of the series has been merged now, please
> restart your patch and version numbering, so a (potential) next post
> should be prefixed [PATCH v3 1/2]. And please have a cover letter giving
> a brief overview what this series fixes.
>
Should I resend the patch series with a cover letter? I will also add
a reported-by tag.

> On 13/10/17 11:40, Bhupinder Thakur wrote:
>> The early console output uses pl011_early_write() to write data. This
>> function waits for BUSY bit to get cleared before writing the next byte.
>
> ... which is questionable given the actual definition of the BUSY bit in
> the PL011 TRM:
>
> ============
> .... The BUSY signal goes HIGH as soon as data is written to the
> transmit FIFO (that is, the FIFO is non-empty) and remains asserted
> HIGH while data is being transmitted. BUSY is negated only when the
> transmit FIFO is empty, and the last character has been transmitted from
> the shift register, ....
> ============
>
> (I take it you are talking about the Linux driver in a guest here).
> I think the early_write routine tries to (deliberately?) ignore the
> FIFO, possibly to make sure characters really get pushed out before a
> system crashes, maybe.
>
>>
>> In the SBSA UART emulation logic, the BUSY bit was set as soon one
>> byte was written in the FIFO and it remained set until the FIFO was
>> emptied.
>
> Which is correct behaviour, as this matches the PL011 TRM as quoted above.
>
>> This meant that the output was delayed as each character needed
>> the BUSY to get cleared.
>
> But this is true as well!
>
>> Since the SBSA UART is getting emulated in Xen using ring buffers, it
>> ensures that once the data is enqueued in the FIFO, it will be received
>> by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
>> full. This will ensure that pl011_early_write() is not delayed unduly
>> to write the data.
>
> So I can confirm that this patch fixes the very slow earlycon output
> observed with the current staging HEAD.
>
> So while this is somewhat deviating from the spec, I can see the benefit
> for an emulation scenario. I believe that emulations in general might
> choose implementing things a bit differently, to cope with the
> fundamental differences in their environment, like the virtually endless
> "FIFO" and the lack of any timing restrictions on the emulated "wire".
>
> So unless someone comes up with a better solution, I would support
> taking this patch, as this fixes a real problem.
>
> Cheers,
> Andre
>
>> Signed-off-by: Bhupinder Thakur <bhupinder.thakur@linaro.org>
>> ---
>> CC: Julien Grall <julien.grall@arm.com>
>> CC: Andre Przywara <andre.przywara@arm.com>
>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>
>>  xen/arch/arm/vpl011.c | 21 ++++++++++++++++-----
>>  1 file changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/xen/arch/arm/vpl011.c b/xen/arch/arm/vpl011.c
>> index f7ddccb..0b07436 100644
>> --- a/xen/arch/arm/vpl011.c
>> +++ b/xen/arch/arm/vpl011.c
>> @@ -159,9 +159,15 @@ static void vpl011_write_data(struct domain *d, uint8_t data)
>>      {
>>          vpl011->uartfr |= TXFF;
>>          vpl011->uartris &= ~TXI;
>> -    }
>>
>> -    vpl011->uartfr |= BUSY;
>> +        /*
>> +         * This bit is set only when FIFO becomes full. This ensures that
>> +         * the SBSA UART driver can write the early console data as fast as
>> +         * possible, without waiting for the BUSY bit to get cleared before
>> +         * writing each byte.
>> +         */
>> +        vpl011->uartfr |= BUSY;
>> +    }
>>
>>      vpl011->uartfr &= ~TXFE;
>>
>> @@ -371,11 +377,16 @@ static void vpl011_data_avail(struct domain *d)
>>      {
>>          vpl011->uartfr &= ~TXFF;
>>          vpl011->uartris |= TXI;
>> +
>> +        /*
>> +         * Clear the BUSY bit as soon as space becomes available
>> +         * so that the SBSA UART driver can start writing more data
>> +         * without any further delay.
>> +         */
>> +        vpl011->uartfr &= ~BUSY;
>> +
>>          if ( out_ring_qsize == 0 )
>> -        {
>> -            vpl011->uartfr &= ~BUSY;
>>              vpl011->uartfr |= TXFE;
>> -        }
>>      }
>>
>>      vpl011_update_interrupt_status(d);
>>

Regards,
Bhupinder
Andre Przywara Oct. 18, 2017, 10:31 a.m. UTC | #8
Hi,

On 18/10/17 11:17, Bhupinder Thakur wrote:
> Hi Andre,
> 
> On 17 October 2017 at 15:21, Andre Przywara <andre.przywara@arm.com> wrote:
>> Hi Bhupinder,
>>
>> first thing: As the bulk of the series has been merged now, please
>> restart your patch and version numbering, so a (potential) next post
>> should be prefixed [PATCH v3 1/2]. And please have a cover letter giving
>> a brief overview what this series fixes.
>>
> Should I resend the patch series with a cover letter? I will also add
> a reported-by tag.

Please wait until we have settled upon a solution, especially for that
other patch. We can talk about this in our meeting later today.

Cheers,
Andre.

>> On 13/10/17 11:40, Bhupinder Thakur wrote:
>>> The early console output uses pl011_early_write() to write data. This
>>> function waits for BUSY bit to get cleared before writing the next byte.
>>
>> ... which is questionable given the actual definition of the BUSY bit in
>> the PL011 TRM:
>>
>> ============
>> .... The BUSY signal goes HIGH as soon as data is written to the
>> transmit FIFO (that is, the FIFO is non-empty) and remains asserted
>> HIGH while data is being transmitted. BUSY is negated only when the
>> transmit FIFO is empty, and the last character has been transmitted from
>> the shift register, ....
>> ============
>>
>> (I take it you are talking about the Linux driver in a guest here).
>> I think the early_write routine tries to (deliberately?) ignore the
>> FIFO, possibly to make sure characters really get pushed out before a
>> system crashes, maybe.
>>
>>>
>>> In the SBSA UART emulation logic, the BUSY bit was set as soon one
>>> byte was written in the FIFO and it remained set until the FIFO was
>>> emptied.
>>
>> Which is correct behaviour, as this matches the PL011 TRM as quoted above.
>>
>>> This meant that the output was delayed as each character needed
>>> the BUSY to get cleared.
>>
>> But this is true as well!
>>
>>> Since the SBSA UART is getting emulated in Xen using ring buffers, it
>>> ensures that once the data is enqueued in the FIFO, it will be received
>>> by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
>>> full. This will ensure that pl011_early_write() is not delayed unduly
>>> to write the data.
>>
>> So I can confirm that this patch fixes the very slow earlycon output
>> observed with the current staging HEAD.
>>
>> So while this is somewhat deviating from the spec, I can see the benefit
>> for an emulation scenario. I believe that emulations in general might
>> choose implementing things a bit differently, to cope with the
>> fundamental differences in their environment, like the virtually endless
>> "FIFO" and the lack of any timing restrictions on the emulated "wire".
>>
>> So unless someone comes up with a better solution, I would support
>> taking this patch, as this fixes a real problem.
>>
>> Cheers,
>> Andre
>>
>>> Signed-off-by: Bhupinder Thakur <bhupinder.thakur@linaro.org>
>>> ---
>>> CC: Julien Grall <julien.grall@arm.com>
>>> CC: Andre Przywara <andre.przywara@arm.com>
>>> CC: Stefano Stabellini <sstabellini@kernel.org>
>>>
>>>  xen/arch/arm/vpl011.c | 21 ++++++++++++++++-----
>>>  1 file changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/xen/arch/arm/vpl011.c b/xen/arch/arm/vpl011.c
>>> index f7ddccb..0b07436 100644
>>> --- a/xen/arch/arm/vpl011.c
>>> +++ b/xen/arch/arm/vpl011.c
>>> @@ -159,9 +159,15 @@ static void vpl011_write_data(struct domain *d, uint8_t data)
>>>      {
>>>          vpl011->uartfr |= TXFF;
>>>          vpl011->uartris &= ~TXI;
>>> -    }
>>>
>>> -    vpl011->uartfr |= BUSY;
>>> +        /*
>>> +         * This bit is set only when FIFO becomes full. This ensures that
>>> +         * the SBSA UART driver can write the early console data as fast as
>>> +         * possible, without waiting for the BUSY bit to get cleared before
>>> +         * writing each byte.
>>> +         */
>>> +        vpl011->uartfr |= BUSY;
>>> +    }
>>>
>>>      vpl011->uartfr &= ~TXFE;
>>>
>>> @@ -371,11 +377,16 @@ static void vpl011_data_avail(struct domain *d)
>>>      {
>>>          vpl011->uartfr &= ~TXFF;
>>>          vpl011->uartris |= TXI;
>>> +
>>> +        /*
>>> +         * Clear the BUSY bit as soon as space becomes available
>>> +         * so that the SBSA UART driver can start writing more data
>>> +         * without any further delay.
>>> +         */
>>> +        vpl011->uartfr &= ~BUSY;
>>> +
>>>          if ( out_ring_qsize == 0 )
>>> -        {
>>> -            vpl011->uartfr &= ~BUSY;
>>>              vpl011->uartfr |= TXFE;
>>> -        }
>>>      }
>>>
>>>      vpl011_update_interrupt_status(d);
>>>
> 
> Regards,
> Bhupinder
>
diff mbox

Patch

diff --git a/xen/arch/arm/vpl011.c b/xen/arch/arm/vpl011.c
index f7ddccb..0b07436 100644
--- a/xen/arch/arm/vpl011.c
+++ b/xen/arch/arm/vpl011.c
@@ -159,9 +159,15 @@  static void vpl011_write_data(struct domain *d, uint8_t data)
     {
         vpl011->uartfr |= TXFF;
         vpl011->uartris &= ~TXI;
-    }
 
-    vpl011->uartfr |= BUSY;
+        /*
+         * This bit is set only when FIFO becomes full. This ensures that
+         * the SBSA UART driver can write the early console data as fast as
+         * possible, without waiting for the BUSY bit to get cleared before
+         * writing each byte.
+         */
+        vpl011->uartfr |= BUSY;
+    }
 
     vpl011->uartfr &= ~TXFE;
 
@@ -371,11 +377,16 @@  static void vpl011_data_avail(struct domain *d)
     {
         vpl011->uartfr &= ~TXFF;
         vpl011->uartris |= TXI;
+
+        /*
+         * Clear the BUSY bit as soon as space becomes available
+         * so that the SBSA UART driver can start writing more data
+         * without any further delay.
+         */
+        vpl011->uartfr &= ~BUSY;
+
         if ( out_ring_qsize == 0 )
-        {
-            vpl011->uartfr &= ~BUSY;
             vpl011->uartfr |= TXFE;
-        }
     }
 
     vpl011_update_interrupt_status(d);