diff mbox series

[RFC] dt-bindings: mailbox: add doorbell support to ARM MHU

Message ID 0a50f0cf5593baeb628dc8606c523665e5e2ae6c.1589519600.git.viresh.kumar@linaro.org (mailing list archive)
State New, archived
Headers show
Series [RFC] dt-bindings: mailbox: add doorbell support to ARM MHU | expand

Commit Message

Viresh Kumar May 15, 2020, 5:17 a.m. UTC
From: Sudeep Holla <sudeep.holla@arm.com>

Hi Rob, Arnd and Jassi,

This stuff has been doing rounds on the mailing list since several years
now with no agreed conclusion by all the parties. And here is another
attempt to get some feedback from everyone involved to close this once
and for ever. Your comments will very much be appreciated.

The ARM MHU is defined here in the TRM [1] for your reference, which
states following:

	"The MHU drives the signal using a 32-bit register, with all 32
	bits logically ORed together. The MHU provides a set of
	registers to enable software to set, clear, and check the status
	of each of the bits of this register independently.  The use of
	32 bits for each interrupt line enables software to provide more
	information about the source of the interrupt. For example, each
	bit of the register can be associated with a type of event that
	can contribute to raising the interrupt."

On few other platforms, like qcom, similar doorbell mechanism is present
with separate interrupt for each of the bits (that's how I understood
it), while in case of ARM MHU, there is a single interrupt line for all
the 32 bits. Also in case of ARM MHU, these registers and interrupts
have 3 copies for different priority levels, i.e. low priority
non-secure, high priority non-secure and secure channels.

For ARM MHU, both the dt bindings and the Linux driver support 3
channels for the different priorities right now and support sending a 32
bit data on every transfer in a locked fashion, i.e. only one transfer
can be done at once and the other have to wait for it to finish first.

Here are the point of view of the parties involved on this subject:

Jassi's viewpoint:

- Virtualization of channels should be discouraged in software based on
  specific usecases of the application. This may invite other mailbox
  driver authors to ask for doing virtualization in their drivers.

- In mailbox's terminology, every channel is equivalent to a signal,
  since there is only one signal generated here by the MHU, there should
  be only one channel per priority level.

- The clients should send data (of just setting 1 bit or many in the 32
  bit word) using the existing mechanism as the delays due to
  serialization shouldn't be significant anyway.

- The driver supports 90% of the users with the current implementation
  and it shouldn't be extended to support doorbell and implement two
  different modes by changing value of #mbox-cells field in bindings.

Sudeep (ARM) and myself as well to some extent:

- The hardware gives us the capability to write the register in
  parallel, i.e. we can write 0x800 and 0x400 together without any
  software locks, and so these 32 bits should be considered as separate
  channel even if only one interrupt is issued by the hardware finally.
  This shouldn't be called as virtualization of the channels, as the
  hardware supports this (as clearly mentioned in the TRM) and it takes
  care of handling the signal properly.

- With serialization, if we use only one channel as today at every
  priority, if there are 5 requests to send signal to the receiver and
  the dvfs request is the last one in queue (which may be called from
  scheduler's hot path with fast switching), it unnecessarily needs to
  wait for the first four transfers to finish due to the software
  locking imposed by the mailbox framework. This adds additional delay,
  maybe of few ms only, which isn't required by the hardware but just by
  the software and few ms can be important in scheduler's hotpath.

- With the current approach it isn't possible to assign different bits
  (or doorbell numbers) to clients from DT and the only way of doing
  that without adding new bindings is by extending #mbox-cells to accept
  a value of 2 as done in this patch.

Jassi and Sudeep, I hope I was able to represent both the view points
properly here. Please correct me if I have made a mistake here.

This is it. It would be nice to get the views of everyone now on this
and how should this be handled.

Thanks.

[1] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0515f/DDI0515F_juno_arm_development_platform_soc_trm.pdf , section 3.4.4, page 3-38.

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
---
 .../devicetree/bindings/mailbox/arm-mhu.txt   | 39 ++++++++++++++++++-
 1 file changed, 37 insertions(+), 2 deletions(-)

Comments

Jassi Brar May 15, 2020, 4:46 p.m. UTC | #1
On Fri, May 15, 2020 at 12:17 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> - The hardware gives us the capability to write the register in
>   parallel, i.e. we can write 0x800 and 0x400 together without any
>   software locks, and so these 32 bits should be considered as separate
>   channel even if only one interrupt is issued by the hardware finally.
>   This shouldn't be called as virtualization of the channels, as the
>   hardware supports this (as clearly mentioned in the TRM) and it takes
>   care of handling the signal properly.
>
I'll leave this one open to bikeshed arguments.

> - With serialization, if we use only one channel as today at every
>   priority, if there are 5 requests to send signal to the receiver and
>   the dvfs request is the last one in queue (which may be called from
>   scheduler's hot path with fast switching), it unnecessarily needs to
>   wait for the first four transfers to finish due to the software
>   locking imposed by the mailbox framework. This adds additional delay,
>   maybe of few ms only, which isn't required by the hardware but just by
>   the software and few ms can be important in scheduler's hotpath.
>
As I asked you yesterday over the call, it may help if you could share
some numbers to back up the doomsday scenario.
I don't believe mailbox will be a bottleneck, unless you send commands
in a while(1) ... but even then you have to compare against the
virtual-channel implementation. (Not to forget one usually doesn't
need/want the dvfs, power, clock, hotplug all happening at the _same_
time)

Please note, SCMI... lets not pretend it is not about making scmi work
with mhu :) ...  itself uses shared-memory transfers and
wait_for_completion_timeout  in scmi_do_xfer().   If some platform
_really-really_ faced speed bottlenecks, it would come to want to
exchange 32-bit encoded command/response over the mhu register,
asynchronously and totally bypassing shmem... which is possible only
now.


> - With the current approach it isn't possible to assign different bits
>   (or doorbell numbers) to clients from DT and the only way of doing
>   that without adding new bindings is by extending #mbox-cells to accept
>   a value of 2 as done in this patch.
>
I am afraid you are confused. You can use bit/doorbell-6 by passing
0x40 to mhu as the data to send.

Cheers!
Viresh Kumar May 18, 2020, 7:35 a.m. UTC | #2
On 15-05-20, 11:46, Jassi Brar wrote:
> As I asked you yesterday over the call, it may help if you could share
> some numbers to back up the doomsday scenario.

Yes, I have already asked Sudeep to get some numbers for this. He will
get back to us.

> > - With the current approach it isn't possible to assign different bits
> >   (or doorbell numbers) to clients from DT and the only way of doing
> >   that without adding new bindings is by extending #mbox-cells to accept
> >   a value of 2 as done in this patch.
> >
> I am afraid you are confused. You can use bit/doorbell-6 by passing
> 0x40 to mhu as the data to send.

That's how the code will do it, right I agree. What I was asking was
the way this information is passed from DT.
Jassi Brar May 19, 2020, 12:53 a.m. UTC | #3
On Mon, May 18, 2020 at 2:42 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 15-05-20, 11:46, Jassi Brar wrote:
> > As I asked you yesterday over the call, it may help if you could share
> > some numbers to back up the doomsday scenario.
>
> Yes, I have already asked Sudeep to get some numbers for this. He will
> get back to us.
>
Thanks, current bottleneck numbers and the patch/changes to improve
that, would help.

> > > - With the current approach it isn't possible to assign different bits
> > >   (or doorbell numbers) to clients from DT and the only way of doing
> > >   that without adding new bindings is by extending #mbox-cells to accept
> > >   a value of 2 as done in this patch.
> > >
> > I am afraid you are confused. You can use bit/doorbell-6 by passing
> > 0x40 to mhu as the data to send.
>
> That's how the code will do it, right I agree. What I was asking was
> the way this information is passed from DT.
>
That is a client/protocol property and has nothing to do with the
controller dt node.

cheers!
Bjorn Andersson May 19, 2020, 1:29 a.m. UTC | #4
On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote:

> From: Sudeep Holla <sudeep.holla@arm.com>
> 
> Hi Rob, Arnd and Jassi,
> 
> This stuff has been doing rounds on the mailing list since several years
> now with no agreed conclusion by all the parties. And here is another
> attempt to get some feedback from everyone involved to close this once
> and for ever. Your comments will very much be appreciated.
> 
> The ARM MHU is defined here in the TRM [1] for your reference, which
> states following:
> 
> 	"The MHU drives the signal using a 32-bit register, with all 32
> 	bits logically ORed together. The MHU provides a set of
> 	registers to enable software to set, clear, and check the status
> 	of each of the bits of this register independently.  The use of
> 	32 bits for each interrupt line enables software to provide more
> 	information about the source of the interrupt. For example, each
> 	bit of the register can be associated with a type of event that
> 	can contribute to raising the interrupt."
> 

Does this mean that there are 32 different signals and they are all ORed
into the same interrupt line to trigger software action when something
happens?

Or does it mean that this register is used to pass multi-bit information
and when any such information is passed an interrupt will be triggered?
If so, what does that information mean? How is it tied into other Linux
drivers/subsystems?

> On few other platforms, like qcom, similar doorbell mechanism is present
> with separate interrupt for each of the bits (that's how I understood
> it), while in case of ARM MHU, there is a single interrupt line for all
> the 32 bits. Also in case of ARM MHU, these registers and interrupts
> have 3 copies for different priority levels, i.e. low priority
> non-secure, high priority non-secure and secure channels.
> 

In the Qualcomm case we have 32 bits in a register where each bit
written will trigger an interrupt elsewhere in the SoC, as such the
mailbox is purely write only and the "read" side is handled by some
interrupt-controller.

The three copies just sounds like 3 (or 3 * 32) different mailbox
channels.

> For ARM MHU, both the dt bindings and the Linux driver support 3
> channels for the different priorities right now and support sending a 32
> bit data on every transfer in a locked fashion, i.e. only one transfer
> can be done at once and the other have to wait for it to finish first.
> 
> Here are the point of view of the parties involved on this subject:
> 
> Jassi's viewpoint:
> 
> - Virtualization of channels should be discouraged in software based on
>   specific usecases of the application. This may invite other mailbox
>   driver authors to ask for doing virtualization in their drivers.
> 
> - In mailbox's terminology, every channel is equivalent to a signal,
>   since there is only one signal generated here by the MHU, there should
>   be only one channel per priority level.
> 
> - The clients should send data (of just setting 1 bit or many in the 32
>   bit word) using the existing mechanism as the delays due to
>   serialization shouldn't be significant anyway.
> 
> - The driver supports 90% of the users with the current implementation
>   and it shouldn't be extended to support doorbell and implement two
>   different modes by changing value of #mbox-cells field in bindings.

I interpret Jassi's view as this is a channel that passes 32 bit
"messages".

> 
> Sudeep (ARM) and myself as well to some extent:
> 
> - The hardware gives us the capability to write the register in
>   parallel, i.e. we can write 0x800 and 0x400 together without any
>   software locks, and so these 32 bits should be considered as separate
>   channel even if only one interrupt is issued by the hardware finally.
>   This shouldn't be called as virtualization of the channels, as the
>   hardware supports this (as clearly mentioned in the TRM) and it takes
>   care of handling the signal properly.
> 

But if writes to the register is ORed together than it seems like the
hardware isn't supposed to pass multi-bit messages, but instead is
supposed to carry 32 individual signals - somewhat similar to the
Qualcomm block.

I don't see a problem with having a cascaded IRQ handler for incoming
notifications.

> - With serialization, if we use only one channel as today at every
>   priority, if there are 5 requests to send signal to the receiver and
>   the dvfs request is the last one in queue (which may be called from
>   scheduler's hot path with fast switching), it unnecessarily needs to
>   wait for the first four transfers to finish due to the software
>   locking imposed by the mailbox framework. This adds additional delay,
>   maybe of few ms only, which isn't required by the hardware but just by
>   the software and few ms can be important in scheduler's hotpath.
> 

So these 5 requests, are they conveyed by the signals [1,2,3,4,5] or
[BIT(0), BIT(1), BIT(2), BIT(3), BIT(4)]?

In the first case you have to serialize things given that e.g. signal 1
immediately followed by 2 is indistinguishable from 3.

If you signals are single-bit notifications then you don't need any
serialization.

Regards,
Bjorn

> - With the current approach it isn't possible to assign different bits
>   (or doorbell numbers) to clients from DT and the only way of doing
>   that without adding new bindings is by extending #mbox-cells to accept
>   a value of 2 as done in this patch.
> 
> Jassi and Sudeep, I hope I was able to represent both the view points
> properly here. Please correct me if I have made a mistake here.
> 
> This is it. It would be nice to get the views of everyone now on this
> and how should this be handled.
> 
> Thanks.
> 
> [1] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0515f/DDI0515F_juno_arm_development_platform_soc_trm.pdf , section 3.4.4, page 3-38.
> 
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  .../devicetree/bindings/mailbox/arm-mhu.txt   | 39 ++++++++++++++++++-
>  1 file changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/mailbox/arm-mhu.txt b/Documentation/devicetree/bindings/mailbox/arm-mhu.txt
> index 4971f03f0b33..ba659bcc7109 100644
> --- a/Documentation/devicetree/bindings/mailbox/arm-mhu.txt
> +++ b/Documentation/devicetree/bindings/mailbox/arm-mhu.txt
> @@ -10,6 +10,15 @@ STAT register and the remote clears it after having read the data.
>  The last channel is specified to be a 'Secure' resource, hence can't be
>  used by Linux running NS.
>  
> +The MHU drives the interrupt signal using a 32-bit register, with all
> +32-bits logically ORed together. It provides a set of registers to
> +enable software to set, clear and check the status of each of the bits
> +of this register independently. The use of 32 bits per interrupt line
> +enables software to provide more information about the source of the
> +interrupt. For example, each bit of the register can be associated with
> +a type of event that can contribute to raising the interrupt. Each of
> +the 32-bits can be used as "doorbell" to alert the remote processor.
> +
>  Mailbox Device Node:
>  ====================
>  
> @@ -18,13 +27,21 @@ used by Linux running NS.
>  - compatible:		Shall be "arm,mhu" & "arm,primecell"
>  - reg:			Contains the mailbox register address range (base
>  			address and length)
> -- #mbox-cells		Shall be 1 - the index of the channel needed.
> +- #mbox-cells		Shall be 1 - the index of the channel needed when
> +			not used as set of doorbell bits.
> +			Shall be 2 - the index of the channel needed, and
> +			the index of the doorbell bit within the channel
> +			when used in doorbell mode.
>  - interrupts:		Contains the interrupt information corresponding to
> -			each of the 3 links of MHU.
> +			each of the 3 physical channels of MHU namely low
> +			priority non-secure, high priority non-secure and
> +			secure channels.
>  
>  Example:
>  --------
>  
> +1. Controller which doesn't support doorbells
> +
>  	mhu: mailbox@2b1f0000 {
>  		#mbox-cells = <1>;
>  		compatible = "arm,mhu", "arm,primecell";
> @@ -41,3 +58,21 @@ used by Linux running NS.
>  		reg = <0 0x2e000000 0x4000>;
>  		mboxes = <&mhu 1>; /* HP-NonSecure */
>  	};
> +
> +2. Controller which supports doorbells
> +
> +	mhu: mailbox@2b1f0000 {
> +		#mbox-cells = <2>;
> +		compatible = "arm,mhu", "arm,primecell";
> +		reg = <0 0x2b1f0000 0x1000>;
> +		interrupts = <0 36 4>, /* LP-NonSecure */
> +			     <0 35 4>; /* HP-NonSecure */
> +		clocks = <&clock 0 2 1>;
> +		clock-names = "apb_pclk";
> +	};
> +
> +	mhu_client: scb@2e000000 {
> +		compatible = "arm,scpi";
> +		reg = <0 0x2e000000 0x200>;
> +		mboxes = <&mhu 1 4>; /* HP-NonSecure 5th doorbell bit */
> +	};
> -- 
> 2.25.0.rc1.19.g042ed3e048af
>
Viresh Kumar May 19, 2020, 3:40 a.m. UTC | #5
On 18-05-20, 18:29, Bjorn Andersson wrote:
> On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote:
> > This stuff has been doing rounds on the mailing list since several years
> > now with no agreed conclusion by all the parties. And here is another
> > attempt to get some feedback from everyone involved to close this once
> > and for ever. Your comments will very much be appreciated.
> > 
> > The ARM MHU is defined here in the TRM [1] for your reference, which
> > states following:
> > 
> > 	"The MHU drives the signal using a 32-bit register, with all 32
> > 	bits logically ORed together. The MHU provides a set of
> > 	registers to enable software to set, clear, and check the status
> > 	of each of the bits of this register independently.  The use of
> > 	32 bits for each interrupt line enables software to provide more
> > 	information about the source of the interrupt. For example, each
> > 	bit of the register can be associated with a type of event that
> > 	can contribute to raising the interrupt."
> > 
> 
> Does this mean that there are 32 different signals and they are all ORed
> into the same interrupt line to trigger software action when something
> happens?
> 
> Or does it mean that this register is used to pass multi-bit information
> and when any such information is passed an interrupt will be triggered?
> If so, what does that information mean? How is it tied into other Linux
> drivers/subsystems?

I have started to believe the hardware is written badly at this point
:)

The thing is that the register can be used to send a 32 bit data
(which the driver already provides), or it can be used by writing
different bits to the SET register concurrently, without corrupting
the other bits as writing 0 to a bit has no effect, we have a separate
CLEAR register for that. And so it says that all the bits are ORed
together to generate the interrupt, i.e. any bit set will generate an
interrupt. Which also means that you can't send data 0 with the
register.
Jassi Brar May 19, 2020, 4:05 a.m. UTC | #6
On Mon, May 18, 2020 at 10:40 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 18-05-20, 18:29, Bjorn Andersson wrote:
> > On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote:
> > > This stuff has been doing rounds on the mailing list since several years
> > > now with no agreed conclusion by all the parties. And here is another
> > > attempt to get some feedback from everyone involved to close this once
> > > and for ever. Your comments will very much be appreciated.
> > >
> > > The ARM MHU is defined here in the TRM [1] for your reference, which
> > > states following:
> > >
> > >     "The MHU drives the signal using a 32-bit register, with all 32
> > >     bits logically ORed together. The MHU provides a set of
> > >     registers to enable software to set, clear, and check the status
> > >     of each of the bits of this register independently.  The use of
> > >     32 bits for each interrupt line enables software to provide more
> > >     information about the source of the interrupt. For example, each
> > >     bit of the register can be associated with a type of event that
> > >     can contribute to raising the interrupt."
> > >
> >
> > Does this mean that there are 32 different signals and they are all ORed
> > into the same interrupt line to trigger software action when something
> > happens?
> >
> > Or does it mean that this register is used to pass multi-bit information
> > and when any such information is passed an interrupt will be triggered?
> > If so, what does that information mean? How is it tied into other Linux
> > drivers/subsystems?
>
> I have started to believe the hardware is written badly at this point
> :)
>
H/W is actually fine :)   Its just that the driver is written to
_also_ support a platform (my original) that doesn't have shmem and
need to pass data via 32bit registers.
Frankly, I am not against the doorbell mode, I am against implementing
two modes in a driver. If it really helped (note the past tense) the
SCMI, we could implement the driver only in doorbell mode but
unfortunately SCMI would still be _broken_ for non-doorbell
controllers.
Viresh Kumar May 19, 2020, 4:39 a.m. UTC | #7
On 18-05-20, 19:53, Jassi Brar wrote:
> That is a client/protocol property and has nothing to do with the
> controller dt node.

That's what I am concerned about, i.e. different ways of passing the
doorbell number via DT.
Rob Herring May 28, 2020, 7:20 p.m. UTC | #8
On Fri, May 15, 2020 at 10:47:38AM +0530, Viresh Kumar wrote:
> From: Sudeep Holla <sudeep.holla@arm.com>
> 
> Hi Rob, Arnd and Jassi,
> 
> This stuff has been doing rounds on the mailing list since several years
> now with no agreed conclusion by all the parties. And here is another
> attempt to get some feedback from everyone involved to close this once
> and for ever. Your comments will very much be appreciated.
> 
> The ARM MHU is defined here in the TRM [1] for your reference, which
> states following:
> 
> 	"The MHU drives the signal using a 32-bit register, with all 32
> 	bits logically ORed together. The MHU provides a set of
> 	registers to enable software to set, clear, and check the status
> 	of each of the bits of this register independently.  The use of
> 	32 bits for each interrupt line enables software to provide more
> 	information about the source of the interrupt. For example, each
> 	bit of the register can be associated with a type of event that
> 	can contribute to raising the interrupt."
> 
> On few other platforms, like qcom, similar doorbell mechanism is present
> with separate interrupt for each of the bits (that's how I understood
> it), while in case of ARM MHU, there is a single interrupt line for all
> the 32 bits. Also in case of ARM MHU, these registers and interrupts
> have 3 copies for different priority levels, i.e. low priority
> non-secure, high priority non-secure and secure channels.
> 
> For ARM MHU, both the dt bindings and the Linux driver support 3
> channels for the different priorities right now and support sending a 32
> bit data on every transfer in a locked fashion, i.e. only one transfer
> can be done at once and the other have to wait for it to finish first.
> 
> Here are the point of view of the parties involved on this subject:
> 
> Jassi's viewpoint:
> 
> - Virtualization of channels should be discouraged in software based on
>   specific usecases of the application. This may invite other mailbox
>   driver authors to ask for doing virtualization in their drivers.
> 
> - In mailbox's terminology, every channel is equivalent to a signal,
>   since there is only one signal generated here by the MHU, there should
>   be only one channel per priority level.
> 
> - The clients should send data (of just setting 1 bit or many in the 32
>   bit word) using the existing mechanism as the delays due to
>   serialization shouldn't be significant anyway.
> 
> - The driver supports 90% of the users with the current implementation
>   and it shouldn't be extended to support doorbell and implement two
>   different modes by changing value of #mbox-cells field in bindings.
> 
> Sudeep (ARM) and myself as well to some extent:
> 
> - The hardware gives us the capability to write the register in
>   parallel, i.e. we can write 0x800 and 0x400 together without any
>   software locks, and so these 32 bits should be considered as separate
>   channel even if only one interrupt is issued by the hardware finally.
>   This shouldn't be called as virtualization of the channels, as the
>   hardware supports this (as clearly mentioned in the TRM) and it takes
>   care of handling the signal properly.
> 
> - With serialization, if we use only one channel as today at every
>   priority, if there are 5 requests to send signal to the receiver and
>   the dvfs request is the last one in queue (which may be called from
>   scheduler's hot path with fast switching), it unnecessarily needs to
>   wait for the first four transfers to finish due to the software
>   locking imposed by the mailbox framework. This adds additional delay,
>   maybe of few ms only, which isn't required by the hardware but just by
>   the software and few ms can be important in scheduler's hotpath.
> 
> - With the current approach it isn't possible to assign different bits
>   (or doorbell numbers) to clients from DT and the only way of doing
>   that without adding new bindings is by extending #mbox-cells to accept
>   a value of 2 as done in this patch.
> 
> Jassi and Sudeep, I hope I was able to represent both the view points
> properly here. Please correct me if I have made a mistake here.
> 
> This is it. It would be nice to get the views of everyone now on this
> and how should this be handled.

I am perfectly fine with adding another cell which seems appropriate 
here. You can have 5 cells for all I care if that makes sense for 
the h/w. That has nothing to do with the Linux design. Whether Linux 
requires serializing mailbox accesses is a separate issue. On that side, 
it seems silly to not allow driving the h/w in the most efficient way 
possible.

Rob
Viresh Kumar May 29, 2020, 4:07 a.m. UTC | #9
On 28-05-20, 13:20, Rob Herring wrote:
> Whether Linux 
> requires serializing mailbox accesses is a separate issue. On that side, 
> it seems silly to not allow driving the h/w in the most efficient way 
> possible.

That's exactly what we are trying to say. The hardware allows us to
write all 32 bits in parallel, without any hardware issues, why
shouldn't we do that ? The delay (which Sudeep will find out, he is
facing issues with hardware access because of lockdown right now)
which may be small in transmitting across a mailbox becomes
significant because of the fact that it happens synchronously and the
receiver will send some sort of acknowledgement (and that depends on
the firmware there) and the kernel needs to wait for it, while the
kernel doesn't really need to. There is no reason IMHO for being
inefficient here while we can do better.
Jassi Brar May 29, 2020, 5:20 a.m. UTC | #10
On Thu, May 28, 2020 at 2:20 PM Rob Herring <robh@kernel.org> wrote:
>
> On Fri, May 15, 2020 at 10:47:38AM +0530, Viresh Kumar wrote:
> > From: Sudeep Holla <sudeep.holla@arm.com>
> >
> > Hi Rob, Arnd and Jassi,
> >
> > This stuff has been doing rounds on the mailing list since several years
> > now with no agreed conclusion by all the parties. And here is another
> > attempt to get some feedback from everyone involved to close this once
> > and for ever. Your comments will very much be appreciated.
> >
> > The ARM MHU is defined here in the TRM [1] for your reference, which
> > states following:
> >
> >       "The MHU drives the signal using a 32-bit register, with all 32
> >       bits logically ORed together. The MHU provides a set of
> >       registers to enable software to set, clear, and check the status
> >       of each of the bits of this register independently.  The use of
> >       32 bits for each interrupt line enables software to provide more
> >       information about the source of the interrupt. For example, each
> >       bit of the register can be associated with a type of event that
> >       can contribute to raising the interrupt."
> >
> > On few other platforms, like qcom, similar doorbell mechanism is present
> > with separate interrupt for each of the bits (that's how I understood
> > it), while in case of ARM MHU, there is a single interrupt line for all
> > the 32 bits. Also in case of ARM MHU, these registers and interrupts
> > have 3 copies for different priority levels, i.e. low priority
> > non-secure, high priority non-secure and secure channels.
> >
> > For ARM MHU, both the dt bindings and the Linux driver support 3
> > channels for the different priorities right now and support sending a 32
> > bit data on every transfer in a locked fashion, i.e. only one transfer
> > can be done at once and the other have to wait for it to finish first.
> >
> > Here are the point of view of the parties involved on this subject:
> >
> > Jassi's viewpoint:
> >
> > - Virtualization of channels should be discouraged in software based on
> >   specific usecases of the application. This may invite other mailbox
> >   driver authors to ask for doing virtualization in their drivers.
> >
> > - In mailbox's terminology, every channel is equivalent to a signal,
> >   since there is only one signal generated here by the MHU, there should
> >   be only one channel per priority level.
> >
> > - The clients should send data (of just setting 1 bit or many in the 32
> >   bit word) using the existing mechanism as the delays due to
> >   serialization shouldn't be significant anyway.
> >
> > - The driver supports 90% of the users with the current implementation
> >   and it shouldn't be extended to support doorbell and implement two
> >   different modes by changing value of #mbox-cells field in bindings.
> >
> > Sudeep (ARM) and myself as well to some extent:
> >
> > - The hardware gives us the capability to write the register in
> >   parallel, i.e. we can write 0x800 and 0x400 together without any
> >   software locks, and so these 32 bits should be considered as separate
> >   channel even if only one interrupt is issued by the hardware finally.
> >   This shouldn't be called as virtualization of the channels, as the
> >   hardware supports this (as clearly mentioned in the TRM) and it takes
> >   care of handling the signal properly.
> >
> > - With serialization, if we use only one channel as today at every
> >   priority, if there are 5 requests to send signal to the receiver and
> >   the dvfs request is the last one in queue (which may be called from
> >   scheduler's hot path with fast switching), it unnecessarily needs to
> >   wait for the first four transfers to finish due to the software
> >   locking imposed by the mailbox framework. This adds additional delay,
> >   maybe of few ms only, which isn't required by the hardware but just by
> >   the software and few ms can be important in scheduler's hotpath.
> >
> > - With the current approach it isn't possible to assign different bits
> >   (or doorbell numbers) to clients from DT and the only way of doing
> >   that without adding new bindings is by extending #mbox-cells to accept
> >   a value of 2 as done in this patch.
> >
> > Jassi and Sudeep, I hope I was able to represent both the view points
> > properly here. Please correct me if I have made a mistake here.
> >
> > This is it. It would be nice to get the views of everyone now on this
> > and how should this be handled.
>
> I am perfectly fine with adding another cell which seems appropriate
> here. You can have 5 cells for all I care if that makes sense for
> the h/w. That has nothing to do with the Linux design. Whether Linux
> requires serializing mailbox accesses is a separate issue. On that side,
> it seems silly to not allow driving the h/w in the most efficient way
> possible.
>
The fact that all these bits are backed by one physical signal makes
the firmware implement its own mux-demux'ing scheme. So the choice was
between demux and single signal at a time. Had there been one signal
per bit, the implementation would have definitely been 'efficient'.

And the first platform the driver was developed on, required message
passing over the registers. So now my approach is to make do with what
we have...unless it shows in numbers.

Anyways, if it comes to that, I'd rather a separate "doorbell' driver
than a driver working in two s/w modes.

Thanks.
Viresh Kumar May 29, 2020, 6:27 a.m. UTC | #11
On 29-05-20, 00:20, Jassi Brar wrote:
> The fact that all these bits are backed by one physical signal makes

I believe that within the IP itself, there must be 32 signals, just
that only one comes out of it all OR'd together. Maybe that can be
verified by writing 0x01 to it multiple times and it should generate
the interrupt only once on the first instance. i.e. writing 1 to any
big again won't raise the interrupt.

> the firmware implement its own mux-demux'ing scheme.

Similar to how the interrupt controllers do it in the kernel, they
receive a single interrupt and then go check its registers/bits to see
which peripheral raised it.

> So the choice was
> between demux and single signal at a time.

Well, the demux is on the firmware side and the kernel may not need to
worry about that, let the platform do it ?

> Had there been one signal
> per bit, the implementation would have definitely been 'efficient'.

I will say 'More efficient', it will still be 'efficient' if we
implement doorbell.

> And the first platform the driver was developed on, required message
> passing over the registers.

I think it was correctly done at that point of time. No doubt about
that.

> So now my approach is to make do with what
> we have...unless it shows in numbers.

ARM's office is closed (lock-down) and he doesn't have remote access
to the board and so was unable to do it until now.

Numbers or no numbers, I don't think there is a question here about
what is the most efficient way of doing it. Doing it in parallel (when
the hardware allows) is surely going to be more efficient. Sending a
signal, getting it processed by the firmware, doing something with it
and then sending an Ack and that being received by kernel will take
time. Why make a queue on the kernel side when we don't have to. :)

> Anyways, if it comes to that, I'd rather a separate "doorbell' driver
> than a driver working in two s/w modes.

I think that would be fine with everyone, if you are fine with that
now (based on plain logic and no numbers), maybe we can get that done
now?
Sudeep Holla June 3, 2020, 6:04 p.m. UTC | #12
On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote:
> On 28-05-20, 13:20, Rob Herring wrote:
> > Whether Linux
> > requires serializing mailbox accesses is a separate issue. On that side,
> > it seems silly to not allow driving the h/w in the most efficient way
> > possible.
>
> That's exactly what we are trying to say. The hardware allows us to
> write all 32 bits in parallel, without any hardware issues, why
> shouldn't we do that ? The delay (which Sudeep will find out, he is
> facing issues with hardware access because of lockdown right now)

OK, I was able to access the setup today. I couldn't reach a point
where I can do measurements as the system just became unusable with
one physical channel instead of 2 virtual channels as in my patches.

My test was simple. Switch to schedutil and read sensors periodically
via sysfs.

 arm-scmi firmware:scmi: message for 1 is not expected!
 arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120)
 arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120)
 arm-scmi firmware:scmi: message for 1 is not expected!
 arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120)
 arm-scmi firmware:scmi: message for 1 is not expected!

With trace enabled I can see even cpufreq_set timing out. Sample trace
output:

       bash-1019  [005]  1149.452340: scmi_xfer_begin:      transfer_id=1537 msg_id=7 protocol_id=19 seq=0 poll=1
       bash-1019  [005]  1149.452407: scmi_xfer_end:        transfer_id=1537 msg_id=7 protocol_id=19 seq=0 status=0
       bash-1526  [000]  1149.472553: scmi_xfer_begin:      transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0
     <idle>-0     [001]  1149.472733: scmi_xfer_begin:      transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1
     <idle>-0     [001]  1149.472842: scmi_xfer_end:        transfer_id=1539 msg_id=7 protocol_id=19 seq=1 status=-110
     <idle>-0     [001]  1149.483040: scmi_xfer_begin:      transfer_id=1540 msg_id=7 protocol_id=19 seq=1 poll=1
     <idle>-0     [001]  1149.483043: scmi_xfer_end:        transfer_id=1540 msg_id=7 protocol_id=19 seq=1 status=0
    rs:main-543   [003]  1149.493031: scmi_xfer_begin:      transfer_id=1541 msg_id=7 protocol_id=19 seq=1 poll=1
    rs:main-543   [003]  1149.493047: scmi_xfer_end:        transfer_id=1541 msg_id=7 protocol_id=19 seq=1 status=0
     <idle>-0     [000]  1149.507033: scmi_xfer_begin:      transfer_id=1542 msg_id=7 protocol_id=19 seq=1 poll=1
     <idle>-0     [000]  1149.507044: scmi_xfer_end:        transfer_id=1542 msg_id=7 protocol_id=19 seq=1 status=0
       bash-1526  [000]  1149.516068: scmi_xfer_end:        transfer_id=1538 msg_id=6 protocol_id=21 seq=0 status=-110
       bash-1526  [000]  1149.516559: scmi_xfer_begin:      transfer_id=1543 msg_id=6 protocol_id=21 seq=0 poll=0
     <idle>-0     [001]  1149.516729: scmi_xfer_begin:      transfer_id=1544 msg_id=7 protocol_id=19 seq=1 poll=1
     <idle>-0     [001]  1149.516837: scmi_xfer_end:        transfer_id=1544 msg_id=7 protocol_id=19 seq=1 status=-110
ksoftirqd/0-9     [000]  1149.519065: scmi_xfer_begin:      transfer_id=1545 msg_id=7 protocol_id=19 seq=1 poll=1
ksoftirqd/0-9     [000]  1149.519072: scmi_xfer_end:        transfer_id=1545 msg_id=7 protocol_id=19 seq=1 status=0
     <idle>-0     [001]  1149.526878: scmi_xfer_begin:      transfer_id=1546 msg_id=7 protocol_id=19 seq=1 poll=1
     <idle>-0     [001]  1149.526882: scmi_xfer_end:        transfer_id=1546 msg_id=7 protocol_id=19 seq=1 status=0
     <idle>-0     [000]  1149.551119: scmi_xfer_begin:      transfer_id=1547 msg_id=7 protocol_id=19 seq=1 poll=1
     <idle>-0     [000]  1149.551138: scmi_xfer_end:        transfer_id=1547 msg_id=7 protocol_id=19 seq=1 status=0
       bash-1526  [000]  1149.560191: scmi_xfer_end:        transfer_id=1543 msg_id=6 protocol_id=21 seq=0 status=-110
       bash-1526  [000]  1149.560690: scmi_xfer_begin:      transfer_id=1548 msg_id=6 protocol_id=21 seq=0 poll=0
     <idle>-0     [001]  1149.560859: scmi_xfer_begin:      transfer_id=1549 msg_id=7 protocol_id=19 seq=1 poll=1
     <idle>-0     [001]  1149.560968: scmi_xfer_end:        transfer_id=1549 msg_id=7 protocol_id=19 seq=1 status=-110

protocol_id=19 is cpufreq and 21 is sensor. This is simplest test and
I can easily generate more timeouts starting some stress test with DVFS.

--
Regards,
Sudeep
Sudeep Holla June 3, 2020, 6:17 p.m. UTC | #13
On Wed, Jun 03, 2020 at 07:04:35PM +0100, Sudeep Holla wrote:
> On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote:
> > On 28-05-20, 13:20, Rob Herring wrote:
> > > Whether Linux
> > > requires serializing mailbox accesses is a separate issue. On that side,
> > > it seems silly to not allow driving the h/w in the most efficient way
> > > possible.
> >
> > That's exactly what we are trying to say. The hardware allows us to
> > write all 32 bits in parallel, without any hardware issues, why
> > shouldn't we do that ? The delay (which Sudeep will find out, he is
> > facing issues with hardware access because of lockdown right now)
> 
> OK, I was able to access the setup today. I couldn't reach a point
> where I can do measurements as the system just became unusable with
> one physical channel instead of 2 virtual channels as in my patches.
> 
> My test was simple. Switch to schedutil and read sensors periodically
> via sysfs.
> 
>  arm-scmi firmware:scmi: message for 1 is not expected!
>  arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120)
>  arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120)
>  arm-scmi firmware:scmi: message for 1 is not expected!
>  arm-scmi firmware:scmi: timed out in resp(caller: scmi_sensor_reading_get+0xf4/0x120)
>  arm-scmi firmware:scmi: message for 1 is not expected!
> 
> With trace enabled I can see even cpufreq_set timing out. Sample trace
> output:
> 
>        bash-1019  [005]  1149.452340: scmi_xfer_begin:      transfer_id=1537 msg_id=7 protocol_id=19 seq=0 poll=1
>        bash-1019  [005]  1149.452407: scmi_xfer_end:        transfer_id=1537 msg_id=7 protocol_id=19 seq=0 status=0
>        bash-1526  [000]  1149.472553: scmi_xfer_begin:      transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0
>      <idle>-0     [001]  1149.472733: scmi_xfer_begin:      transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1
>      <idle>-0     [001]  1149.472842: scmi_xfer_end:        transfer_id=1539 msg_id=7 protocol_id=19 seq=1 status=-110
>      <idle>-0     [001]  1149.483040: scmi_xfer_begin:      transfer_id=1540 msg_id=7 protocol_id=19 seq=1 poll=1
>      <idle>-0     [001]  1149.483043: scmi_xfer_end:        transfer_id=1540 msg_id=7 protocol_id=19 seq=1 status=0
>     rs:main-543   [003]  1149.493031: scmi_xfer_begin:      transfer_id=1541 msg_id=7 protocol_id=19 seq=1 poll=1
>     rs:main-543   [003]  1149.493047: scmi_xfer_end:        transfer_id=1541 msg_id=7 protocol_id=19 seq=1 status=0
>      <idle>-0     [000]  1149.507033: scmi_xfer_begin:      transfer_id=1542 msg_id=7 protocol_id=19 seq=1 poll=1
>      <idle>-0     [000]  1149.507044: scmi_xfer_end:        transfer_id=1542 msg_id=7 protocol_id=19 seq=1 status=0
>        bash-1526  [000]  1149.516068: scmi_xfer_end:        transfer_id=1538 msg_id=6 protocol_id=21 seq=0 status=-110
>        bash-1526  [000]  1149.516559: scmi_xfer_begin:      transfer_id=1543 msg_id=6 protocol_id=21 seq=0 poll=0
>      <idle>-0     [001]  1149.516729: scmi_xfer_begin:      transfer_id=1544 msg_id=7 protocol_id=19 seq=1 poll=1
>      <idle>-0     [001]  1149.516837: scmi_xfer_end:        transfer_id=1544 msg_id=7 protocol_id=19 seq=1 status=-110
> ksoftirqd/0-9     [000]  1149.519065: scmi_xfer_begin:      transfer_id=1545 msg_id=7 protocol_id=19 seq=1 poll=1
> ksoftirqd/0-9     [000]  1149.519072: scmi_xfer_end:        transfer_id=1545 msg_id=7 protocol_id=19 seq=1 status=0
>      <idle>-0     [001]  1149.526878: scmi_xfer_begin:      transfer_id=1546 msg_id=7 protocol_id=19 seq=1 poll=1
>      <idle>-0     [001]  1149.526882: scmi_xfer_end:        transfer_id=1546 msg_id=7 protocol_id=19 seq=1 status=0
>      <idle>-0     [000]  1149.551119: scmi_xfer_begin:      transfer_id=1547 msg_id=7 protocol_id=19 seq=1 poll=1
>      <idle>-0     [000]  1149.551138: scmi_xfer_end:        transfer_id=1547 msg_id=7 protocol_id=19 seq=1 status=0
>        bash-1526  [000]  1149.560191: scmi_xfer_end:        transfer_id=1543 msg_id=6 protocol_id=21 seq=0 status=-110
>        bash-1526  [000]  1149.560690: scmi_xfer_begin:      transfer_id=1548 msg_id=6 protocol_id=21 seq=0 poll=0
>      <idle>-0     [001]  1149.560859: scmi_xfer_begin:      transfer_id=1549 msg_id=7 protocol_id=19 seq=1 poll=1
>      <idle>-0     [001]  1149.560968: scmi_xfer_end:        transfer_id=1549 msg_id=7 protocol_id=19 seq=1 status=-110
> 
> protocol_id=19 is cpufreq and 21 is sensor. This is simplest test and
> I can easily generate more timeouts starting some stress test with DVFS.
> 

I just realised that we have the timing info in the traces and you will
observe the sensor readings take something in order of 100us to 500-600us
or even more based on which sensor is being read. While we have 100us
timeout for cpufreq opp set. I am attaching full trace now.
Sudeep Holla June 3, 2020, 6:28 p.m. UTC | #14
Hi Bjorn,

Thanks for the details response.

On Mon, May 18, 2020 at 06:29:27PM -0700, Bjorn Andersson wrote:
> On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote:
>

[...]

I find this part nicely summarise your response.

> > - With serialization, if we use only one channel as today at every
> >   priority, if there are 5 requests to send signal to the receiver and
> >   the dvfs request is the last one in queue (which may be called from
> >   scheduler's hot path with fast switching), it unnecessarily needs to
> >   wait for the first four transfers to finish due to the software
> >   locking imposed by the mailbox framework. This adds additional delay,
> >   maybe of few ms only, which isn't required by the hardware but just by
> >   the software and few ms can be important in scheduler's hotpath.
> >
>
> So these 5 requests, are they conveyed by the signals [1,2,3,4,5] or
> [BIT(0), BIT(1), BIT(2), BIT(3), BIT(4)]?
>

Latter in this case. IMO it is platform choice on how to use it. It is
equally possible to send 2^31 different signals. But the receiver must
also interpret it in the *exact* same way. In this case, the receiver
which is platform firmware interprets as individual bit signals.

> In the first case you have to serialize things given that e.g. signal 1
> immediately followed by 2 is indistinguishable from 3.
>

Agree and we are not proposing to break that use case. It exists in the
driver/binding today and will continue as is.

> If you signals are single-bit notifications then you don't need any
> serialization.
>

Indeed, we are making use of that.

--
Regards,
Sudeep
Sudeep Holla June 3, 2020, 6:31 p.m. UTC | #15
On Mon, May 18, 2020 at 11:05:03PM -0500, Jassi Brar wrote:
> On Mon, May 18, 2020 at 10:40 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 18-05-20, 18:29, Bjorn Andersson wrote:
> > > On Thu 14 May 22:17 PDT 2020, Viresh Kumar wrote:
> > > > This stuff has been doing rounds on the mailing list since several years
> > > > now with no agreed conclusion by all the parties. And here is another
> > > > attempt to get some feedback from everyone involved to close this once
> > > > and for ever. Your comments will very much be appreciated.
> > > >
> > > > The ARM MHU is defined here in the TRM [1] for your reference, which
> > > > states following:
> > > >
> > > >     "The MHU drives the signal using a 32-bit register, with all 32
> > > >     bits logically ORed together. The MHU provides a set of
> > > >     registers to enable software to set, clear, and check the status
> > > >     of each of the bits of this register independently.  The use of
> > > >     32 bits for each interrupt line enables software to provide more
> > > >     information about the source of the interrupt. For example, each
> > > >     bit of the register can be associated with a type of event that
> > > >     can contribute to raising the interrupt."
> > > >
> > >
> > > Does this mean that there are 32 different signals and they are all ORed
> > > into the same interrupt line to trigger software action when something
> > > happens?
> > >
> > > Or does it mean that this register is used to pass multi-bit information
> > > and when any such information is passed an interrupt will be triggered?
> > > If so, what does that information mean? How is it tied into other Linux
> > > drivers/subsystems?
> >
> > I have started to believe the hardware is written badly at this point
> > :)
> >
> H/W is actually fine :)   Its just that the driver is written to
> _also_ support a platform (my original) that doesn't have shmem and
> need to pass data via 32bit registers.
> Frankly, I am not against the doorbell mode, I am against implementing
> two modes in a driver. If it really helped (note the past tense) the
> SCMI, we could implement the driver only in doorbell mode but
> unfortunately SCMI would still be _broken_ for non-doorbell
> controllers.

Should be fine as the specification is designed to work with only shmem
for any data transfer and mailbox is just a signal mechanism. I won't
be too worried about that.

--
Regards,
Sudeep
Jassi Brar June 3, 2020, 6:32 p.m. UTC | #16
On Wed, Jun 3, 2020 at 1:04 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
>
> On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote:
> > On 28-05-20, 13:20, Rob Herring wrote:
> > > Whether Linux
> > > requires serializing mailbox accesses is a separate issue. On that side,
> > > it seems silly to not allow driving the h/w in the most efficient way
> > > possible.
> >
> > That's exactly what we are trying to say. The hardware allows us to
> > write all 32 bits in parallel, without any hardware issues, why
> > shouldn't we do that ? The delay (which Sudeep will find out, he is
> > facing issues with hardware access because of lockdown right now)
>
> OK, I was able to access the setup today. I couldn't reach a point
> where I can do measurements as the system just became unusable with
> one physical channel instead of 2 virtual channels as in my patches.
>
> My test was simple. Switch to schedutil and read sensors periodically
> via sysfs.
>
>  arm-scmi firmware:scmi: message for 1 is not expected!
>
This sounds like you are not serialising requests on a shared channel.
Can you please also share the patch?

Thanks.
Jassi Brar June 3, 2020, 6:42 p.m. UTC | #17
On Wed, Jun 3, 2020 at 1:31 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
>
> > >
> > H/W is actually fine :)   Its just that the driver is written to
> > _also_ support a platform (my original) that doesn't have shmem and
> > need to pass data via 32bit registers.
> > Frankly, I am not against the doorbell mode, I am against implementing
> > two modes in a driver. If it really helped (note the past tense) the
> > SCMI, we could implement the driver only in doorbell mode but
> > unfortunately SCMI would still be _broken_ for non-doorbell
> > controllers.
>
> Should be fine as the specification is designed to work with only shmem
> for any data transfer and mailbox is just a signal mechanism. I won't
> be too worried about that.
>
Please clarify how it will work on, say again, rockchip platform with shmem.

thanks.
Viresh Kumar June 4, 2020, 5:59 a.m. UTC | #18
On 03-06-20, 19:17, Sudeep Holla wrote:
> I just realised that we have the timing info in the traces and you will
> observe the sensor readings take something in order of 100us to 500-600us
> or even more based on which sensor is being read. While we have 100us
> timeout for cpufreq opp set.

Which timeout from opp core are you talking about ?
Sudeep Holla June 4, 2020, 8:28 a.m. UTC | #19
On Thu, Jun 04, 2020 at 11:29:03AM +0530, Viresh Kumar wrote:
> On 03-06-20, 19:17, Sudeep Holla wrote:
> > I just realised that we have the timing info in the traces and you will
> > observe the sensor readings take something in order of 100us to 500-600us
> > or even more based on which sensor is being read. While we have 100us
> > timeout for cpufreq opp set.
>
> Which timeout from opp core are you talking about ?

The one we have in SCMI itself to meet the fast_switch requirement.

--
Regards,
Sudeep
Sudeep Holla June 4, 2020, 9:20 a.m. UTC | #20
On Wed, Jun 03, 2020 at 01:32:42PM -0500, Jassi Brar wrote:
> On Wed, Jun 3, 2020 at 1:04 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> >
> > On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote:
> > > On 28-05-20, 13:20, Rob Herring wrote:
> > > > Whether Linux
> > > > requires serializing mailbox accesses is a separate issue. On that side,
> > > > it seems silly to not allow driving the h/w in the most efficient way
> > > > possible.
> > >
> > > That's exactly what we are trying to say. The hardware allows us to
> > > write all 32 bits in parallel, without any hardware issues, why
> > > shouldn't we do that ? The delay (which Sudeep will find out, he is
> > > facing issues with hardware access because of lockdown right now)
> >
> > OK, I was able to access the setup today. I couldn't reach a point
> > where I can do measurements as the system just became unusable with
> > one physical channel instead of 2 virtual channels as in my patches.
> >
> > My test was simple. Switch to schedutil and read sensors periodically
> > via sysfs.
> >
> >  arm-scmi firmware:scmi: message for 1 is not expected!
> >
> This sounds like you are not serialising requests on a shared channel.
> Can you please also share the patch?

OK, I did try with a small patch initially and then realised we must hit
issue with mainline as is. Tried and the behaviour is exact same. All
I did is removed my patches and use bit[0] as the signal. It doesn't
matter as writes to the register are now serialised. Oh, the above
message comes when OS times out in advance while firmware continues to
process the old request and respond.

The trace I sent gives much better view of what's going on.

--
Regards,
Sudeep
Jassi Brar June 4, 2020, 3:15 p.m. UTC | #21
On Thu, Jun 4, 2020 at 4:20 AM Sudeep Holla <sudeep.holla@arm.com> wrote:
>
> On Wed, Jun 03, 2020 at 01:32:42PM -0500, Jassi Brar wrote:
> > On Wed, Jun 3, 2020 at 1:04 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > >
> > > On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote:
> > > > On 28-05-20, 13:20, Rob Herring wrote:
> > > > > Whether Linux
> > > > > requires serializing mailbox accesses is a separate issue. On that side,
> > > > > it seems silly to not allow driving the h/w in the most efficient way
> > > > > possible.
> > > >
> > > > That's exactly what we are trying to say. The hardware allows us to
> > > > write all 32 bits in parallel, without any hardware issues, why
> > > > shouldn't we do that ? The delay (which Sudeep will find out, he is
> > > > facing issues with hardware access because of lockdown right now)
> > >
> > > OK, I was able to access the setup today. I couldn't reach a point
> > > where I can do measurements as the system just became unusable with
> > > one physical channel instead of 2 virtual channels as in my patches.
> > >
> > > My test was simple. Switch to schedutil and read sensors periodically
> > > via sysfs.
> > >
> > >  arm-scmi firmware:scmi: message for 1 is not expected!
> > >
> > This sounds like you are not serialising requests on a shared channel.
> > Can you please also share the patch?
>
> OK, I did try with a small patch initially and then realised we must hit
> issue with mainline as is. Tried and the behaviour is exact same. All
> I did is removed my patches and use bit[0] as the signal. It doesn't
> matter as writes to the register are now serialised. Oh, the above
> message comes when OS times out in advance while firmware continues to
> process the old request and respond.
>
> The trace I sent gives much better view of what's going on.
>
BTW, you didn't even share 'good' vs 'bad' log for me to actually see
if the api lacks.

Let us look closely ...

 >>    bash-1019  [005]  1149.452340: scmi_xfer_begin:
transfer_id=1537 msg_id=7 protocol_id=19 seq=0 poll=1
 >>    bash-1019  [005]  1149.452407: scmi_xfer_end:
transfer_id=1537 msg_id=7 protocol_id=19 seq=0 status=0
>
This round trip took  67usecs.  (log shows some at even 3usecs)
That includes mailbox api overhead, memcpy and the time taken by
remote to respond.
So the api is definitely capable of much faster transfers than you require.

>>     bash-1526  [000]  1149.472553: scmi_xfer_begin:      transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0
>>      <idle>-0     [001]  1149.472733: scmi_xfer_begin:      transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1
>
Here another request is started before the first is finished.
If you want this to work you have to serialise access to the single
physical channel and/or run the remote firmware in asynchronous mode -
that is, the firmware ack the bit as soon as it sees and starts
working in the background, so that we return in ~3usec always, while
the data returns whenever it is ready.  Again, I don't have the code
or the 'good' run log to compare.

PS: I feel it is probably less effort for me to simply let the patch
through, than use my reiki powers to imagine how your test code and
firmware looks like.
Sudeep Holla June 5, 2020, 4:56 a.m. UTC | #22
On Thu, Jun 04, 2020 at 10:15:55AM -0500, Jassi Brar wrote:
> On Thu, Jun 4, 2020 at 4:20 AM Sudeep Holla <sudeep.holla@arm.com> wrote:
> >
> > On Wed, Jun 03, 2020 at 01:32:42PM -0500, Jassi Brar wrote:
> > > On Wed, Jun 3, 2020 at 1:04 PM Sudeep Holla <sudeep.holla@arm.com> wrote:
> > > >
> > > > On Fri, May 29, 2020 at 09:37:58AM +0530, Viresh Kumar wrote:
> > > > > On 28-05-20, 13:20, Rob Herring wrote:
> > > > > > Whether Linux
> > > > > > requires serializing mailbox accesses is a separate issue. On that side,
> > > > > > it seems silly to not allow driving the h/w in the most efficient way
> > > > > > possible.
> > > > >
> > > > > That's exactly what we are trying to say. The hardware allows us to
> > > > > write all 32 bits in parallel, without any hardware issues, why
> > > > > shouldn't we do that ? The delay (which Sudeep will find out, he is
> > > > > facing issues with hardware access because of lockdown right now)
> > > >
> > > > OK, I was able to access the setup today. I couldn't reach a point
> > > > where I can do measurements as the system just became unusable with
> > > > one physical channel instead of 2 virtual channels as in my patches.
> > > >
> > > > My test was simple. Switch to schedutil and read sensors periodically
> > > > via sysfs.
> > > >
> > > >  arm-scmi firmware:scmi: message for 1 is not expected!
> > > >
> > > This sounds like you are not serialising requests on a shared channel.
> > > Can you please also share the patch?
> >
> > OK, I did try with a small patch initially and then realised we must hit
> > issue with mainline as is. Tried and the behaviour is exact same. All
> > I did is removed my patches and use bit[0] as the signal. It doesn't
> > matter as writes to the register are now serialised. Oh, the above
> > message comes when OS times out in advance while firmware continues to
> > process the old request and respond.
> >
> > The trace I sent gives much better view of what's going on.
> >
> BTW, you didn't even share 'good' vs 'bad' log for me to actually see
> if the api lacks.
>
> Let us look closely ...
>
>  >>    bash-1019  [005]  1149.452340: scmi_xfer_begin:
> transfer_id=1537 msg_id=7 protocol_id=19 seq=0 poll=1
>  >>    bash-1019  [005]  1149.452407: scmi_xfer_end:
> transfer_id=1537 msg_id=7 protocol_id=19 seq=0 status=0
> >
> This round trip took  67usecs.  (log shows some at even 3usecs)
> That includes mailbox api overhead, memcpy and the time taken by
> remote to respond.

This is DVFS request which firmware acknowledges quickly and expected
to at most 100us.

> So the api is definitely capable of much faster transfers than you require.
>

I am not complaining about that. The delay is mostly due to the load on
the mailbox and parallelising helps is the focus here.

> >>     bash-1526  [000]  1149.472553: scmi_xfer_begin:      transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0
> >>      <idle>-0     [001]  1149.472733: scmi_xfer_begin:      transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1
> >
> Here another request is started before the first is finished.

Ah, the prints are when the client requested. It is not when the mailbox
started it. So this just indicates the beginning of the transfer from the
client. I must have mentioned that earlier. Some request timeout before
being picked up by mailbox if the previous request is not acknowledge
quickly. E.g. Say a sensor command started which may take upto 1ms,
almost 5-6 DVFS request after that will timeout.

> If you want this to work you have to serialise access to the single
> physical channel and/or run the remote firmware in asynchronous mode -
> that is, the firmware ack the bit as soon as it sees and starts
> working in the background, so that we return in ~3usec always, while
> the data returns whenever it is ready.

Yes it does that for few requests like DVFS while it uses synchronous
mode for few others. While ideally I agree everything in asynchronous
most is better, I don't know there may be reasons for such design. Also
the solution given is to use different bits as independent channels
which hardware allows. Anyways it's open source SCP project[1].

--
Regards,
Sudeep

[1] https://github.com/ARM-software/SCP-firmware
Jassi Brar June 5, 2020, 6:30 a.m. UTC | #23
On Thu, Jun 4, 2020 at 11:56 PM Sudeep Holla <sudeep.holla@arm.com> wrote:

>
> > >>     bash-1526  [000]  1149.472553: scmi_xfer_begin:      transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0
> > >>      <idle>-0     [001]  1149.472733: scmi_xfer_begin:      transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1
> > >
> > Here another request is started before the first is finished.
>
> Ah, the prints are when the client requested. It is not when the mailbox
> started it. So this just indicates the beginning of the transfer from the
> client.
>
There maybe condition on a sensor read to finish within 1ms, but there
is no condition for the read to _start_ at this very moment (usually
there are sleeps in the path to sensor requests).

> > If you want this to work you have to serialise access to the single
> > physical channel and/or run the remote firmware in asynchronous mode -
> > that is, the firmware ack the bit as soon as it sees and starts
> > working in the background, so that we return in ~3usec always, while
> > the data returns whenever it is ready.
>
> Yes it does that for few requests like DVFS while it uses synchronous
> mode for few others. While ideally I agree everything in asynchronous
> most is better, I don't know there may be reasons for such design.
>
The reason is that, if the firmware works asynchronously, every call
would return in ~3usec and you won't think you need virtual channels.

You have shared only 'bad' log without serialising access. Please
share log after serialising access to the channel and the 'good' log
with virtual channels.  That should put the topic to rest.

thanks.
Jassi Brar June 5, 2020, 3:42 p.m. UTC | #24
On Fri, Jun 5, 2020 at 3:58 AM Sudeep Holla <sudeep.holla@arm.com> wrote:
>
> > > > >>     bash-1526  [000]  1149.472553: scmi_xfer_begin:      transfer_id=1538 msg_id=6 protocol_id=21 seq=0 poll=0
> > > > >>      <idle>-0     [001]  1149.472733: scmi_xfer_begin:      transfer_id=1539 msg_id=7 protocol_id=19 seq=1 poll=1
> > > > >
> > > > Here another request is started before the first is finished.
> > >
> > > Ah, the prints are when the client requested. It is not when the mailbox
> > > started it. So this just indicates the beginning of the transfer from the
> > > client.
> > >
> > There maybe condition on a sensor read to finish within 1ms, but there
> > is no condition for the read to _start_ at this very moment (usually
> > there are sleeps in the path to sensor requests).
> >
>
> Again I wasn't clear. The trace logs are at the point just before calling
> mbox_send_messages. So any delay in sensor drivers won't get include. It
> is after the point sensor driver request to read the value and before we
> send the request via mailbox.
>
No, you were clear, I wasn't. Let me try again.

Since origin upto scmi_xfer, there can be many forms of sleep like
schedule/mutexlock etc.... think of some userspace triggering sensor
or dvfs operation. Linux does not provide real-time guarantees. Even
if remote (scmi) firmware guarantee RT response, it makes sense to
timeout a response only after the _request is on the bus_  and not
when you submit a request to the api (unless you serialise it).
IOW, start the timeout from  mbox_client.tx_prepare()  when the
message actually gets on the bus.


> > You have shared only 'bad' log without serialising access. Please
> > share log after serialising access to the channel and the 'good' log
> > with virtual channels.  That should put the topic to rest.
> >
>
> I didn't realise that, sorry for missing that earlier. Attached both
> now, thanks for asking.
>
Interesting logs !  The time taken to complete _successful_ requests
are arguably better in bad_trace ... there are many <10usec responses
in bad_trace, while the fastest response in good_trace is  53usec.
And the requests that 'fail/timeout' are purely the result of not
serialising them or checkout for timeout at wrong place as explained
above.

thanks.
Viresh Kumar June 10, 2020, 9:33 a.m. UTC | #25
On 05-06-20, 10:42, Jassi Brar wrote:
> Since origin upto scmi_xfer, there can be many forms of sleep like
> schedule/mutexlock etc.... think of some userspace triggering sensor
> or dvfs operation. Linux does not provide real-time guarantees. Even
> if remote (scmi) firmware guarantee RT response, it makes sense to
> timeout a response only after the _request is on the bus_  and not
> when you submit a request to the api (unless you serialise it).
> IOW, start the timeout from  mbox_client.tx_prepare()  when the
> message actually gets on the bus.

There are multiple purposes of the timeout IMO:

- Returning early if the other side is dead/hung, in such a case the
  timeout can be put when the request is put on the bus as we don't
  care of the time it takes to complete the request until the time the
  request can be fulfilled. This can be a example of i2c/spi memory
  read.

- Ensuring maximum time in which the request needs to be serviced.
  There may be hard requirements, like in case for DVFS from
  scheduler's hot path (which is essential for better working of the
  overall system). And for such a case the timeout is placed at the
  right place IMO, i.e. right after a request is submitted to mailbox.

And some more points I wanted to share..

- I am not sure I understood the *serializing* part you guys were
  talking about. I believe mailbox framework is already serializing
  the requests it is receiving on a single channel with a spin lock,
  right ? Why does the client need to serialize them as well? Is that
  for avoiding timeouts ?

- For me, and Sudeep as well IIUC, the bigger problem isn't that
  timeouts are happening and requests are failing (and so changing the
  timeout to a bigger value isn't going to fix anything), but the
  problem is that it is taking too long (because of the queue of
  requests on a channel) for a request to finish after being
  submitted. Scheduler doesn't care of the underneath logistics for
  example, all it cares for is the time it takes to change the
  frequency of a CPU. If you can do it fast enough in a guaranteed
  manner, then you can use fast switching, otherwise not.

- The hardware can very well support the case today where this can be
  done in parallel and (almost) in a guaranteed time-frame. While the
  software wants to add a limit to that and so wants to serialize
  requests.

- As many people have already suggested it (like me, Sudeep, Rob,
  maybe Bjorn as well), it seems silly to not allow driving the h/w in
  the most efficient way possible (and allow fast cpu switching in
  this case).
 
> Interesting logs !  The time taken to complete _successful_ requests
> are arguably better in bad_trace ... there are many <10usec responses
> in bad_trace, while the fastest response in good_trace is  53usec.

Indeed this is interesting. It may be worth looking (separately) into
why don't we see those 3 us long requests anymore, or maybe they were
just not there in the logs.

> And the requests that 'fail/timeout' are purely the result of not
> serialising them or checkout for timeout at wrong place as explained
> above.

We can't allow for the requests to go on for ever in some cases, while
in other cases it may be absolutely fine.
Sudeep Holla June 11, 2020, 10 a.m. UTC | #26
Hi Viresh,

Thanks for summarising the thoughts quite nicely.

On Wed, Jun 10, 2020 at 03:03:34PM +0530, Viresh Kumar wrote:
> On 05-06-20, 10:42, Jassi Brar wrote:
> > Since origin upto scmi_xfer, there can be many forms of sleep like
> > schedule/mutexlock etc.... think of some userspace triggering sensor
> > or dvfs operation. Linux does not provide real-time guarantees. Even
> > if remote (scmi) firmware guarantee RT response, it makes sense to
> > timeout a response only after the _request is on the bus_  and not
> > when you submit a request to the api (unless you serialise it).
> > IOW, start the timeout from  mbox_client.tx_prepare()  when the
> > message actually gets on the bus.
>
> There are multiple purposes of the timeout IMO:
>
> - Returning early if the other side is dead/hung, in such a case the
>   timeout can be put when the request is put on the bus as we don't
>   care of the time it takes to complete the request until the time the
>   request can be fulfilled. This can be a example of i2c/spi memory
>   read.
>
> - Ensuring maximum time in which the request needs to be serviced.
>   There may be hard requirements, like in case for DVFS from
>   scheduler's hot path (which is essential for better working of the
>   overall system). And for such a case the timeout is placed at the
>   right place IMO, i.e. right after a request is submitted to mailbox.
>

Agreed on both points.

> And some more points I wanted to share..
>
> - I am not sure I understood the *serializing* part you guys were
>   talking about. I believe mailbox framework is already serializing
>   the requests it is receiving on a single channel with a spin lock,
>   right ? Why does the client need to serialize them as well? Is that
>   for avoiding timeouts ?
>
> - For me, and Sudeep as well IIUC, the bigger problem isn't that
>   timeouts are happening and requests are failing (and so changing the
>   timeout to a bigger value isn't going to fix anything), but the
>   problem is that it is taking too long (because of the queue of
>   requests on a channel) for a request to finish after being
>   submitted. Scheduler doesn't care of the underneath logistics for
>   example, all it cares for is the time it takes to change the
>   frequency of a CPU. If you can do it fast enough in a guaranteed
>   manner, then you can use fast switching, otherwise not.
>
> - The hardware can very well support the case today where this can be
>   done in parallel and (almost) in a guaranteed time-frame. While the
>   software wants to add a limit to that and so wants to serialize
>   requests.
>

+1

> - As many people have already suggested it (like me, Sudeep, Rob,
>   maybe Bjorn as well), it seems silly to not allow driving the h/w in
>   the most efficient way possible (and allow fast cpu switching in
>   this case).
>
> > Interesting logs !  The time taken to complete _successful_ requests
> > are arguably better in bad_trace ... there are many <10usec responses
> > in bad_trace, while the fastest response in good_trace is  53usec.
>
> Indeed this is interesting. It may be worth looking (separately) into
> why don't we see those 3 us long requests anymore, or maybe they were
> just not there in the logs.
>

As I mentioned in another thread that non-dvfs requests may be prioritised
lower when there are parallel request to the remote. The so called bad
trace doesn't have such scenario with single channel and all requests
from OS being serialised. The good trace has 2 channels and requests to
remote happen in parallel and hence it is fair to see slightly higher
latencies for lower priority requests.

--
Regards,
Sudeep
Jassi Brar June 12, 2020, 12:34 a.m. UTC | #27
On Thu, Jun 11, 2020 at 5:00 AM Sudeep Holla <sudeep.holla@arm.com> wrote:

> >
> > > Interesting logs !  The time taken to complete _successful_ requests
> > > are arguably better in bad_trace ... there are many <10usec responses
> > > in bad_trace, while the fastest response in good_trace is  53usec.
> >
> > Indeed this is interesting. It may be worth looking (separately) into
> > why don't we see those 3 us long requests anymore, or maybe they were
> > just not there in the logs.
> >
>
> As I mentioned in another thread that non-dvfs requests may be prioritised
> lower when there are parallel request to the remote. The so called bad
> trace doesn't have such scenario with single channel and all requests
> from OS being serialised. The good trace has 2 channels and requests to
> remote happen in parallel and hence it is fair to see slightly higher
> latencies for lower priority requests.
>
In the first post in this thread, Viresh lamented that mailbox
introduces "a few ms" delay in the scheduler path.
Your own tests show that is certainly not the case -- average is the
same as proposed virtual channels 50-100us, the best case is 3us vs
53us for virtual channels.

-#define SCMI_MAX_POLL_TO_NS (100 * NSEC_PER_USEC)
+#define SCMI_MAX_POLL_TO_NS (30 * 1000 * NSEC_PER_USEC)

The above simple fix (not a hack or workaround) avoids the need of
virtual channels' implementation, as per tests you conducted.

It might have been silly to not implement virtual channels originally,
but it would be just as silly now to implement if we can reuse the
code.
So I welcome new tests.

thanks.
Viresh Kumar June 12, 2020, 5:28 a.m. UTC | #28
On 11-06-20, 19:34, Jassi Brar wrote:
> In the first post in this thread, Viresh lamented that mailbox
> introduces "a few ms" delay in the scheduler path.
> Your own tests show that is certainly not the case -- average is the
> same as proposed virtual channels 50-100us, the best case is 3us vs
> 53us for virtual channels.

Hmmm, I am not sure where is the confusion here Jassi. There are two
things which are very very different from each other.

- Time taken by the mailbox framework (and remote for acknowledging
  it) for completion of a single request, this can be 3us to 100s of
  us. This is clear for everyone. THIS IS NOT THE PROBLEM.

- Delay introduced by few of such requests on the last one, i.e. 5
  normal requests followed by an important one (like DVFS), the last
  one needs to wait for the first 5 to finish first. THIS IS THE
  PROBLEM.

Just increasing the timeout isn't going to solve anything as I said in
the last email, we can make it 5 minutes for what's its worth. The
idea is to make the turn-around-time less for all the requests..

From Google (I know you must already know it, I am just trying to
highlight the importance of this thing here):

Turnaround time (TAT) is the time interval from the time of submission
of a process (read request) to the time of the completion of the
process.

This is what people care about, that is the whole reason kernel has
multi-processing support in the first place. If making things
sequential was good enough, we would have never reached here. The
whole idea is to parallelize things as much as possible without
hurting efficiency in a bad way (like too much parallelism). The
hardware allows parallelism and there is absolutely no point in not
allowing that. The kernel doesn't need to worry about how the remote
is going to handle it. Remote may be simple and handle it sequentially
or it may be running Linux itself and can schedule multiple threads
for requests.
Arnd Bergmann Sept. 8, 2020, 9:14 a.m. UTC | #29
Picking up the old thread again after and getting pinged by multiple
colleagues about it (thanks!) reading through the history.

On Fri, Jun 12, 2020 at 7:29 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 11-06-20, 19:34, Jassi Brar wrote:
> > In the first post in this thread, Viresh lamented that mailbox
> > introduces "a few ms" delay in the scheduler path.
> > Your own tests show that is certainly not the case -- average is the
> > same as proposed virtual channels 50-100us, the best case is 3us vs
> > 53us for virtual channels.
>
> Hmmm, I am not sure where is the confusion here Jassi. There are two
> things which are very very different from each other.
>
> - Time taken by the mailbox framework (and remote for acknowledging
>   it) for completion of a single request, this can be 3us to 100s of
>   us. This is clear for everyone. THIS IS NOT THE PROBLEM.
>
> - Delay introduced by few of such requests on the last one, i.e. 5
>   normal requests followed by an important one (like DVFS), the last
>   one needs to wait for the first 5 to finish first. THIS IS THE
>   PROBLEM.

Earlier, Jassi also commented "Linux does not provide real-time
guarantees", which to me is what actually causes the issue here:

Linux having timeouts when communicating to the firmware means
that it relies on the hardware and firmware having real-time behavior
even when not providing real-time guarantees to its processes.

When comparing the two usage models, it's clear that the minimum
latency for a message delivery is always at least the time time
to process an interrupt, plus at least one expensive MMIO read
and one less expensive posted MMIO write for an Ack. If we
have a doorbell plus out-of-band message, we need an extra
DMA barrier and a read from coherent memory, both of which can
be noticeable. As soon as messages are queued in the current
model, the maximum latency increases by a potentially unbounded
number of round-trips, while in the doorbell model that problem
does not exist, so I agree that we need to handle both modes
in the kernel deal with all existing hardware as well as firmware
that requires low-latency communication.

It also sounds like that debate is already settled because there
are platforms using both modes, and in the kernel we usually
end up supporting the platforms that our users have, whether
we think it's a good idea or not.

The only questions that I see in need of being answered are:

1. Should the binding use just different "#mbox-cells" values or
   also different "compatible" strings to tell that difference?
2. Should one driver try to handle both modes or should there
   be two drivers?

It sounds like Jassi strongly prefers separate drivers, which
would make separate compatible strings the more practical
approach. While the argument can be made that a single
piece of hardware should only have one DT description,
the counter-argument would be that the behavior described
by the DT here is made up by both the hardware and the
firmware behind it, and they are in fact different.

       Arnd
Viresh Kumar Sept. 8, 2020, 9:27 a.m. UTC | #30
On 08-09-20, 11:14, Arnd Bergmann wrote:
> Picking up the old thread again after and getting pinged by multiple
> colleagues about it (thanks!) reading through the history.

Thanks for your inputs Arnd.

> Earlier, Jassi also commented "Linux does not provide real-time
> guarantees", which to me is what actually causes the issue here:
> 
> Linux having timeouts when communicating to the firmware means
> that it relies on the hardware and firmware having real-time behavior
> even when not providing real-time guarantees to its processes.
> 
> When comparing the two usage models, it's clear that the minimum
> latency for a message delivery is always at least the time time
> to process an interrupt, plus at least one expensive MMIO read
> and one less expensive posted MMIO write for an Ack. If we
> have a doorbell plus out-of-band message, we need an extra
> DMA barrier and a read from coherent memory, both of which can
> be noticeable. As soon as messages are queued in the current
> model, the maximum latency increases by a potentially unbounded
> number of round-trips, while in the doorbell model that problem
> does not exist, so I agree that we need to handle both modes
> in the kernel deal with all existing hardware as well as firmware
> that requires low-latency communication.

Right.

> It also sounds like that debate is already settled because there
> are platforms using both modes, and in the kernel we usually
> end up supporting the platforms that our users have, whether
> we think it's a good idea or not.
> 
> The only questions that I see in need of being answered are:
> 
> 1. Should the binding use just different "#mbox-cells" values or
>    also different "compatible" strings to tell that difference?
> 2. Should one driver try to handle both modes or should there
>    be two drivers?
> 
> It sounds like Jassi strongly prefers separate drivers, which
> would make separate compatible strings the more practical
> approach. While the argument can be made that a single
> piece of hardware should only have one DT description,
> the counter-argument would be that the behavior described
> by the DT here is made up by both the hardware and the
> firmware behind it, and they are in fact different.

I would be fine with both the ideas and that isn't a blocker for me.
Though I still feel this is exactly why we have #mbox-cells here and
that should be enough in this case, even if we opt for multiple
drivers.

But whatever everyone agrees to will be fine.
Sudeep Holla Sept. 8, 2020, 1:26 p.m. UTC | #31
On Tue, Sep 08, 2020 at 11:14:50AM +0200, Arnd Bergmann wrote:
> Picking up the old thread again after and getting pinged by multiple
> colleagues about it (thanks!) reading through the history.
> 
> On Fri, Jun 12, 2020 at 7:29 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 11-06-20, 19:34, Jassi Brar wrote:
> > > In the first post in this thread, Viresh lamented that mailbox
> > > introduces "a few ms" delay in the scheduler path.
> > > Your own tests show that is certainly not the case -- average is the
> > > same as proposed virtual channels 50-100us, the best case is 3us vs
> > > 53us for virtual channels.
> >
> > Hmmm, I am not sure where is the confusion here Jassi. There are two
> > things which are very very different from each other.
> >
> > - Time taken by the mailbox framework (and remote for acknowledging
> >   it) for completion of a single request, this can be 3us to 100s of
> >   us. This is clear for everyone. THIS IS NOT THE PROBLEM.
> >
> > - Delay introduced by few of such requests on the last one, i.e. 5
> >   normal requests followed by an important one (like DVFS), the last
> >   one needs to wait for the first 5 to finish first. THIS IS THE
> >   PROBLEM.
> 
> Earlier, Jassi also commented "Linux does not provide real-time
> guarantees", which to me is what actually causes the issue here:
> 
> Linux having timeouts when communicating to the firmware means
> that it relies on the hardware and firmware having real-time behavior
> even when not providing real-time guarantees to its processes.
> 
> When comparing the two usage models, it's clear that the minimum
> latency for a message delivery is always at least the time time
> to process an interrupt, plus at least one expensive MMIO read
> and one less expensive posted MMIO write for an Ack. If we
> have a doorbell plus out-of-band message, we need an extra
> DMA barrier and a read from coherent memory, both of which can
> be noticeable. As soon as messages are queued in the current
> model, the maximum latency increases by a potentially unbounded
> number of round-trips, while in the doorbell model that problem
> does not exist, so I agree that we need to handle both modes
> in the kernel deal with all existing hardware as well as firmware
> that requires low-latency communication.
> 
> It also sounds like that debate is already settled because there
> are platforms using both modes, and in the kernel we usually
> end up supporting the platforms that our users have, whether
> we think it's a good idea or not.
>

Thanks for the nice summary of the discussion so far.

> The only questions that I see in need of being answered are:
> 
> 1. Should the binding use just different "#mbox-cells" values or
>    also different "compatible" strings to tell that difference?

I initially proposed latter, but Rob preferred the former which
makes sense for the reasons you have mentioned below.

> 2. Should one driver try to handle both modes or should there
>    be two drivers?
>
> It sounds like Jassi strongly prefers separate drivers, which
> would make separate compatible strings the more practical
> approach.

Indeed.

> While the argument can be made that a single
> piece of hardware should only have one DT description,
> the counter-argument would be that the behavior described
> by the DT here is made up by both the hardware and the
> firmware behind it, and they are in fact different.
>

I am too fine either way.

--
Regards,
Sudeep
Jassi Brar Sept. 9, 2020, 3:23 a.m. UTC | #32
On Tue, Sep 8, 2020 at 4:15 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> Picking up the old thread again after and getting pinged by multiple
> colleagues about it (thanks!) reading through the history.
>
> On Fri, Jun 12, 2020 at 7:29 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 11-06-20, 19:34, Jassi Brar wrote:
> > > In the first post in this thread, Viresh lamented that mailbox
> > > introduces "a few ms" delay in the scheduler path.
> > > Your own tests show that is certainly not the case -- average is the
> > > same as proposed virtual channels 50-100us, the best case is 3us vs
> > > 53us for virtual channels.
> >
> > Hmmm, I am not sure where is the confusion here Jassi. There are two
> > things which are very very different from each other.
> >
> > - Time taken by the mailbox framework (and remote for acknowledging
> >   it) for completion of a single request, this can be 3us to 100s of
> >   us. This is clear for everyone. THIS IS NOT THE PROBLEM.
> >
> > - Delay introduced by few of such requests on the last one, i.e. 5
> >   normal requests followed by an important one (like DVFS), the last
> >   one needs to wait for the first 5 to finish first. THIS IS THE
> >   PROBLEM.
>
> Earlier, Jassi also commented "Linux does not provide real-time
> guarantees", which to me is what actually causes the issue here:
>
> Linux having timeouts when communicating to the firmware means
> that it relies on the hardware and firmware having real-time behavior
> even when not providing real-time guarantees to its processes.
>
The timeout used in SCMI is simply based on how long the Juno (?)
platform takes to reply in most cases.
Talking proper code-design, the timeout (if at all) shouldn't even be
a hardcoded value, but instead taken from the platform.

> When comparing the two usage models, it's clear that the minimum
> latency for a message delivery is always at least the time time
> to process an interrupt, plus at least one expensive MMIO read
> and one less expensive posted MMIO write for an Ack. If we
> have a doorbell plus out-of-band message, we need an extra
> DMA barrier and a read from coherent memory, both of which can
> be noticeable. As soon as messages are queued in the current
> model, the maximum latency increases by a potentially unbounded
> number of round-trips, while in the doorbell model that problem
> does not exist, so I agree that we need to handle both modes
> in the kernel deal with all existing hardware as well as firmware
> that requires low-latency communication.
>
From the test case Sudeep last shared, the scmi usage on mhu doesn't
not even hit any bottleneck ... the test "failed" because of the too
small hardcoded timeout value. Otherwise the current code actually
shows better numbers.
We need some synthetic tests to bring the limitation to the surface. I
agree that there may be such a test case, however fictitious. For that
reason, I am ok with the doorbell mode.

> The only questions that I see in need of being answered are:
>
> 1. Should the binding use just different "#mbox-cells" values or
>    also different "compatible" strings to tell that difference?
> 2. Should one driver try to handle both modes or should there
>    be two drivers?
>
> It sounds like Jassi strongly prefers separate drivers, which
> would make separate compatible strings the more practical
> approach. While the argument can be made that a single
> piece of hardware should only have one DT description,
> the counter-argument would be that the behavior described
> by the DT here is made up by both the hardware and the
> firmware behind it, and they are in fact different.
>
I totally agree with one compat-string for one hardware. However, as
you said, unlike other device classes, the mailbox driver runs the
sumtotal of hardware and the remote firmware behaviour. Also the
implementations wouldn't share much, so I think a separate file+dt
will be better.  But I wanna get rid of this toothache that flares up
every season, so whatever.

Cheers!
Viresh Kumar Sept. 9, 2020, 4:46 a.m. UTC | #33
On 08-09-20, 22:23, Jassi Brar wrote:
> From the test case Sudeep last shared, the scmi usage on mhu doesn't
> not even hit any bottleneck ... the test "failed" because of the too
> small hardcoded timeout value. Otherwise the current code actually
> shows better numbers.

Its not important on why the test failed there, but the fact that
there were requests in queue which have to be completed one by one and
the last ones in the queue will always pay the penalty.

> We need some synthetic tests to bring the limitation to the surface. I
> agree that there may be such a test case, however fictitious. For that
> reason, I am ok with the doorbell mode.
> 
> I totally agree with one compat-string for one hardware. However, as
> you said, unlike other device classes, the mailbox driver runs the
> sumtotal of hardware and the remote firmware behaviour. Also the
> implementations wouldn't share much, so I think a separate file+dt
> will be better.

> But I wanna get rid of this toothache that flares up
> every season, so whatever.

I can't agree more :)

So to conclude the thread, if I have understood correctly, we are
going to implement another doorbell driver for this hardware which
will use a different compatible string and #mbox-cells value.

I will try to refresh the bindings soon, which will be followed by the
driver implementation.

Thanks everyone.
Sudeep Holla Sept. 9, 2020, 9:31 a.m. UTC | #34
On Tue, Sep 08, 2020 at 10:23:33PM -0500, Jassi Brar wrote:
> On Tue, Sep 8, 2020 at 4:15 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > Picking up the old thread again after and getting pinged by multiple
> > colleagues about it (thanks!) reading through the history.
> >
> > On Fri, Jun 12, 2020 at 7:29 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > >
> > > On 11-06-20, 19:34, Jassi Brar wrote:
> > > > In the first post in this thread, Viresh lamented that mailbox
> > > > introduces "a few ms" delay in the scheduler path.
> > > > Your own tests show that is certainly not the case -- average is the
> > > > same as proposed virtual channels 50-100us, the best case is 3us vs
> > > > 53us for virtual channels.
> > >
> > > Hmmm, I am not sure where is the confusion here Jassi. There are two
> > > things which are very very different from each other.
> > >
> > > - Time taken by the mailbox framework (and remote for acknowledging
> > >   it) for completion of a single request, this can be 3us to 100s of
> > >   us. This is clear for everyone. THIS IS NOT THE PROBLEM.
> > >
> > > - Delay introduced by few of such requests on the last one, i.e. 5
> > >   normal requests followed by an important one (like DVFS), the last
> > >   one needs to wait for the first 5 to finish first. THIS IS THE
> > >   PROBLEM.
> >
> > Earlier, Jassi also commented "Linux does not provide real-time
> > guarantees", which to me is what actually causes the issue here:
> >
> > Linux having timeouts when communicating to the firmware means
> > that it relies on the hardware and firmware having real-time behavior
> > even when not providing real-time guarantees to its processes.
> >
> The timeout used in SCMI is simply based on how long the Juno (?)
> platform takes to reply in most cases.

Just FYI, the timeouts in SCMI can be platform specific. So each platform
have flexibility to choose it's own choice of timeout and need not be
stuck with so called *Juno values".

The architects of SCMI believe the transfers(especially DVFS) must not exceed
few 100s of uS and worst case for any transfers must be few ms. My initial
choice of 30ms was based on the jiffes based timer and 100Hz. Architect claim
that is too much, but I thought 3 jiffies at minimum in case we start timer
when we are close to the boundaries.

> Talking proper code-design, the timeout (if at all) shouldn't even be
> a hardcoded value, but instead taken from the platform.
> 
> > When comparing the two usage models, it's clear that the minimum
> > latency for a message delivery is always at least the time time
> > to process an interrupt, plus at least one expensive MMIO read
> > and one less expensive posted MMIO write for an Ack. If we
> > have a doorbell plus out-of-band message, we need an extra
> > DMA barrier and a read from coherent memory, both of which can
> > be noticeable. As soon as messages are queued in the current
> > model, the maximum latency increases by a potentially unbounded
> > number of round-trips, while in the doorbell model that problem
> > does not exist, so I agree that we need to handle both modes
> > in the kernel deal with all existing hardware as well as firmware
> > that requires low-latency communication.
> >
> From the test case Sudeep last shared, the scmi usage on mhu doesn't
> not even hit any bottleneck ... the test "failed" because of the too
> small hardcoded timeout value. Otherwise the current code actually
> shows better numbers.

No disagreement on the latter part. But we can't ignore the bottlenecks
even if they are rare.

> We need some synthetic tests to bring the limitation to the surface. I
> agree that there may be such a test case, however fictitious. For that
> reason, I am ok with the doorbell mode.
>

Thanks !
diff mbox series

Patch

diff --git a/Documentation/devicetree/bindings/mailbox/arm-mhu.txt b/Documentation/devicetree/bindings/mailbox/arm-mhu.txt
index 4971f03f0b33..ba659bcc7109 100644
--- a/Documentation/devicetree/bindings/mailbox/arm-mhu.txt
+++ b/Documentation/devicetree/bindings/mailbox/arm-mhu.txt
@@ -10,6 +10,15 @@  STAT register and the remote clears it after having read the data.
 The last channel is specified to be a 'Secure' resource, hence can't be
 used by Linux running NS.
 
+The MHU drives the interrupt signal using a 32-bit register, with all
+32-bits logically ORed together. It provides a set of registers to
+enable software to set, clear and check the status of each of the bits
+of this register independently. The use of 32 bits per interrupt line
+enables software to provide more information about the source of the
+interrupt. For example, each bit of the register can be associated with
+a type of event that can contribute to raising the interrupt. Each of
+the 32-bits can be used as "doorbell" to alert the remote processor.
+
 Mailbox Device Node:
 ====================
 
@@ -18,13 +27,21 @@  used by Linux running NS.
 - compatible:		Shall be "arm,mhu" & "arm,primecell"
 - reg:			Contains the mailbox register address range (base
 			address and length)
-- #mbox-cells		Shall be 1 - the index of the channel needed.
+- #mbox-cells		Shall be 1 - the index of the channel needed when
+			not used as set of doorbell bits.
+			Shall be 2 - the index of the channel needed, and
+			the index of the doorbell bit within the channel
+			when used in doorbell mode.
 - interrupts:		Contains the interrupt information corresponding to
-			each of the 3 links of MHU.
+			each of the 3 physical channels of MHU namely low
+			priority non-secure, high priority non-secure and
+			secure channels.
 
 Example:
 --------
 
+1. Controller which doesn't support doorbells
+
 	mhu: mailbox@2b1f0000 {
 		#mbox-cells = <1>;
 		compatible = "arm,mhu", "arm,primecell";
@@ -41,3 +58,21 @@  used by Linux running NS.
 		reg = <0 0x2e000000 0x4000>;
 		mboxes = <&mhu 1>; /* HP-NonSecure */
 	};
+
+2. Controller which supports doorbells
+
+	mhu: mailbox@2b1f0000 {
+		#mbox-cells = <2>;
+		compatible = "arm,mhu", "arm,primecell";
+		reg = <0 0x2b1f0000 0x1000>;
+		interrupts = <0 36 4>, /* LP-NonSecure */
+			     <0 35 4>; /* HP-NonSecure */
+		clocks = <&clock 0 2 1>;
+		clock-names = "apb_pclk";
+	};
+
+	mhu_client: scb@2e000000 {
+		compatible = "arm,scpi";
+		reg = <0 0x2e000000 0x200>;
+		mboxes = <&mhu 1 4>; /* HP-NonSecure 5th doorbell bit */
+	};