mbox series

[RFC,00/10] Add Fujitsu A64FX soc entry/hardware barrier driver

Message ID 20210108105241.1757799-1-misono.tomohiro@jp.fujitsu.com (mailing list archive)
Headers show
Series Add Fujitsu A64FX soc entry/hardware barrier driver | expand

Message

Misono Tomohiro Jan. 8, 2021, 10:52 a.m. UTC
(Resend as cover letter title was missing in the first time. Sorry for noise)

Hello,

This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware
barrier driver for it.

[Driver Description]
 A64FX CPU has several functions for HPC workload and hardware barrier
 is one of them. It is a mechanism to realize fast synchronization by
 PEs belonging to the same L3 cache domain by using implementation
 defined hardware registers.
 For more details, see A64FX HPC extension specification in
 https://github.com/fujitsu/A64FX
 
 The driver mainly offers a set of ioctls to manipulate related registers.
 Patch 1-9 implements driver code and patch 10 finally adds kconfig,
 Makefile and MAINTAINER entry for the driver.  

 Also, C library and test program for this driver is available on: 
 https://github.com/fujitsu/hardware_barrier

 The driver is based on v5.11-rc2 and tested on FX700 environment.

[RFC]
 This is the first time we upstream drivers for our chip and I want to
 confirm driver location and patch submission process.

 Based on my observation it seems drivers/soc folder is right place to put
 this driver, so I added Kconfig entry for arm64 platform config, created
 soc/fujitsu folder and updated MAINTAINER entry accordingly (last patch).
 Is it right?

 Also for final submission I think I need to 1) create some public git
 tree to push driver code (github or something), 2) make pull request to
 SOC team (soc@kernel.org). Is it a correct procedure?

 I will appreciate any help/comments.

sidenote: We plan to post other drivers for A64FX HPC extension
(prefetch control and cache control) too anytime soon.

Misono Tomohiro (10):
  soc: fujitsu: hwb: Add hardware barrier driver init/exit code
  soc: fujtisu: hwb: Add open operation
  soc: fujitsu: hwb: Add IOC_BB_ALLOC ioctl
  soc: fujitsu: hwb: Add IOC_BW_ASSIGN ioctl
  soc: fujitsu: hwb: Add IOC_BW_UNASSIGN ioctl
  soc: fujitsu: hwb: Add IOC_BB_FREE ioctl
  soc: fujitsu: hwb: Add IOC_GET_PE_INFO ioctl
  soc: fujitsu: hwb: Add release operation
  soc: fujitsu: hwb: Add sysfs entry
  soc: fujitsu: hwb: Add Kconfig/Makefile to build fujitsu_hwb driver

 MAINTAINERS                            |    7 +
 arch/arm64/Kconfig.platforms           |    5 +
 drivers/soc/Kconfig                    |    1 +
 drivers/soc/Makefile                   |    1 +
 drivers/soc/fujitsu/Kconfig            |   24 +
 drivers/soc/fujitsu/Makefile           |    2 +
 drivers/soc/fujitsu/fujitsu_hwb.c      | 1253 ++++++++++++++++++++++++
 include/uapi/linux/fujitsu_hpc_ioctl.h |   41 +
 8 files changed, 1334 insertions(+)
 create mode 100644 drivers/soc/fujitsu/Kconfig
 create mode 100644 drivers/soc/fujitsu/Makefile
 create mode 100644 drivers/soc/fujitsu/fujitsu_hwb.c
 create mode 100644 include/uapi/linux/fujitsu_hpc_ioctl.h

Comments

Mark Rutland Jan. 8, 2021, 12:54 p.m. UTC | #1
On Fri, Jan 08, 2021 at 07:52:31PM +0900, Misono Tomohiro wrote:
> (Resend as cover letter title was missing in the first time. Sorry for noise)
> 
> Hello,

Hi,

> This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware
> barrier driver for it.
> 
> [Driver Description]
>  A64FX CPU has several functions for HPC workload and hardware barrier
>  is one of them. It is a mechanism to realize fast synchronization by
>  PEs belonging to the same L3 cache domain by using implementation
>  defined hardware registers.
>  For more details, see A64FX HPC extension specification in
>  https://github.com/fujitsu/A64FX
>  
>  The driver mainly offers a set of ioctls to manipulate related registers.
>  Patch 1-9 implements driver code and patch 10 finally adds kconfig,
>  Makefile and MAINTAINER entry for the driver.  

I have a number of concerns here, and at a high level, I do not think
that this is something Linux can reasonably support in its current form.
Sorry if this comes across as harsh; I appreciate the work that has gone
into this, and the effort to try to upstream support is great -- my
concerns are with the overal picture.

As a general rule, we avoid the use of IMPLEMENTATION DEFINED features
in Linux, as they pose a number of correctness/safety challenges and
come with a potentially significan long term maintenance burden that is
generally not justified by the features themselves. For example, such
features are not usable under virtualization (where a hypervisor may set
HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of).

Secondly, the intended usage model appears to expose this to EL0 for
direct access, and the code seems to depend on threads being pinned, but
AFAICT this is not enforced and there is no provision for
context-switch, thread migration, or interaction with ptrace. I fear
this is going to be very fragile in practice, and that extending that
support in future will require much more complexity than is currently
apparent, with potentially invasive changes to arch code.

Thirdly, this requires userspace software to be intimately familiar with
the HW platform that it is running on (both in terms of using IMP-DEF
instructions and needing to know the physical layout), rather than being
generic and portable, which I don't believe is something that we wish to
encourage.  I also think this is unlikely to be supported by generic
software because of the lack of portability, and consequently I struggle
to beleive that this will see significant usage.

Further, as an IMP-DEF feature, it's not clear how much of this will
carry forward into future designs, and where things may change. It's
extremely difficult to determine whether any of the ABI decisions (e.g.
the sysfs layout) are sufficient, or what level of changes would be
necessary in userspace code if there are physical topology changes or
changes to the strucutre of the system register interfaces.

Overall, I think this needs much more justification in terms of
practical usage, safety/correctness, and long term maintenance, and with
that I think the longer term goal would be to use this to justify an
architectural feature along similar lines rather than to support any
IMPLEMENTATION DEFINED variants upstream in Linux.

>  Also, C library and test program for this driver is available on: 
>  https://github.com/fujitsu/hardware_barrier

Hmm... I see some code in that repo which looks suspiciously like code
from the Linux kernel tree, but licensed differently, which is
concerning.

Specifically, the asm block in internal.h (which the SPDX headers says
is licensed as LGPL-3.0-only) looks like a copy of code from
arch/arm64/include/asm/sysreg.h (which is licensed as GPL-2.0-only per
its SPDX header).

If that code was copied, I don't believe that relicensing is permitted.
I would advise that someone with legal training considers the provenance
of that code and what is permitted.

Thanks,
Mark.

>  The driver is based on v5.11-rc2 and tested on FX700 environment.
> 
> [RFC]
>  This is the first time we upstream drivers for our chip and I want to
>  confirm driver location and patch submission process.
> 
>  Based on my observation it seems drivers/soc folder is right place to put
>  this driver, so I added Kconfig entry for arm64 platform config, created
>  soc/fujitsu folder and updated MAINTAINER entry accordingly (last patch).
>  Is it right?
> 
>  Also for final submission I think I need to 1) create some public git
>  tree to push driver code (github or something), 2) make pull request to
>  SOC team (soc@kernel.org). Is it a correct procedure?
> 
>  I will appreciate any help/comments.
> 
> sidenote: We plan to post other drivers for A64FX HPC extension
> (prefetch control and cache control) too anytime soon.
> 
> Misono Tomohiro (10):
>   soc: fujitsu: hwb: Add hardware barrier driver init/exit code
>   soc: fujtisu: hwb: Add open operation
>   soc: fujitsu: hwb: Add IOC_BB_ALLOC ioctl
>   soc: fujitsu: hwb: Add IOC_BW_ASSIGN ioctl
>   soc: fujitsu: hwb: Add IOC_BW_UNASSIGN ioctl
>   soc: fujitsu: hwb: Add IOC_BB_FREE ioctl
>   soc: fujitsu: hwb: Add IOC_GET_PE_INFO ioctl
>   soc: fujitsu: hwb: Add release operation
>   soc: fujitsu: hwb: Add sysfs entry
>   soc: fujitsu: hwb: Add Kconfig/Makefile to build fujitsu_hwb driver
> 
>  MAINTAINERS                            |    7 +
>  arch/arm64/Kconfig.platforms           |    5 +
>  drivers/soc/Kconfig                    |    1 +
>  drivers/soc/Makefile                   |    1 +
>  drivers/soc/fujitsu/Kconfig            |   24 +
>  drivers/soc/fujitsu/Makefile           |    2 +
>  drivers/soc/fujitsu/fujitsu_hwb.c      | 1253 ++++++++++++++++++++++++
>  include/uapi/linux/fujitsu_hpc_ioctl.h |   41 +
>  8 files changed, 1334 insertions(+)
>  create mode 100644 drivers/soc/fujitsu/Kconfig
>  create mode 100644 drivers/soc/fujitsu/Makefile
>  create mode 100644 drivers/soc/fujitsu/fujitsu_hwb.c
>  create mode 100644 include/uapi/linux/fujitsu_hpc_ioctl.h
> 
> -- 
> 2.26.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Arnd Bergmann Jan. 8, 2021, 2:23 p.m. UTC | #2
On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote:
> On Fri, Jan 08, 2021 at 07:52:31PM +0900, Misono Tomohiro wrote:
> > (Resend as cover letter title was missing in the first time. Sorry for noise)
> >
> > This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware
> > barrier driver for it.
> >
> > [Driver Description]
> >  A64FX CPU has several functions for HPC workload and hardware barrier
> >  is one of them. It is a mechanism to realize fast synchronization by
> >  PEs belonging to the same L3 cache domain by using implementation
> >  defined hardware registers.
> >  For more details, see A64FX HPC extension specification in
> >  https://github.com/fujitsu/A64FX
> >
> >  The driver mainly offers a set of ioctls to manipulate related registers.
> >  Patch 1-9 implements driver code and patch 10 finally adds kconfig,
> >  Makefile and MAINTAINER entry for the driver.
>
> I have a number of concerns here, and at a high level, I do not think
> that this is something Linux can reasonably support in its current form.
> Sorry if this comes across as harsh; I appreciate the work that has gone
> into this, and the effort to try to upstream support is great -- my
> concerns are with the overal picture.
>
> As a general rule, we avoid the use of IMPLEMENTATION DEFINED features
> in Linux, as they pose a number of correctness/safety challenges and
> come with a potentially significan long term maintenance burden that is
> generally not justified by the features themselves. For example, such
> features are not usable under virtualization (where a hypervisor may set
> HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of).

I am somewhat less concerned about the feature being implementation
defined than I am about adding a custom user interface for one
platform.

In the end, anything outside of the CPU core that ends up in a SoC
is implementation defined, and this is usually not a problem as long
as we have an abstraction in the kernel that hides the details from
the user, and the system is still functional if the implementation is
turned off for whatever reason.

> Secondly, the intended usage model appears to expose this to EL0 for
> direct access, and the code seems to depend on threads being pinned, but
> AFAICT this is not enforced and there is no provision for
> context-switch, thread migration, or interaction with ptrace. I fear
> this is going to be very fragile in practice, and that extending that
> support in future will require much more complexity than is currently
> apparent, with potentially invasive changes to arch code.

Right, this is the main problem I see, too. I had not even realized
that this will have to tie in with user space threads in some form, but
you are right that once this has to interact with the CPU scheduler,
it all breaks down.

One way I can imagine this working out is to tie into the cpuset
mechanism that is used for isolating threads to CPU cores, and
then provide a cpuset interface that has the desired behavior
but that can fall back to a generic implementation with the same
or stronger (but normally slower) semantics.

> Thirdly, this requires userspace software to be intimately familiar with
> the HW platform that it is running on (both in terms of using IMP-DEF
> instructions and needing to know the physical layout), rather than being
> generic and portable, which I don't believe is something that we wish to
> encourage.  I also think this is unlikely to be supported by generic
> software because of the lack of portability, and consequently I struggle
> to beleive that this will see significant usage.

Agreed as well.

        Arnd
Mark Rutland Jan. 8, 2021, 3:51 p.m. UTC | #3
On Fri, Jan 08, 2021 at 03:23:23PM +0100, Arnd Bergmann wrote:
> On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > As a general rule, we avoid the use of IMPLEMENTATION DEFINED features
> > in Linux, as they pose a number of correctness/safety challenges and
> > come with a potentially significan long term maintenance burden that is
> > generally not justified by the features themselves. For example, such
> > features are not usable under virtualization (where a hypervisor may set
> > HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of).
> 
> I am somewhat less concerned about the feature being implementation
> defined than I am about adding a custom user interface for one
> platform.

I completely agree that adding a custom interface that's platform
specific is undesireable.

> In the end, anything outside of the CPU core that ends up in a SoC
> is implementation defined, and this is usually not a problem as long
> as we have an abstraction in the kernel that hides the details from
> the user, and the system is still functional if the implementation is
> turned off for whatever reason.

I think that peripherals and other bits out in the SoC are quite
different to things built into the CPU, where there's inevitably most
significant and subtle interactions with the architecture, and can be so
closely coupled as to not have a good point to apply abstraction. We
have common ways of abstracting storage devices, but the same is not as
true for userspace instructions.

Thanks,
Mark.
Tomohiro Misono (Fujitsu) Jan. 12, 2021, 10:24 a.m. UTC | #4
Hi, 

First of all, thanks a lot for all the comments to both of you (cont. below).

> On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > On Fri, Jan 08, 2021 at 07:52:31PM +0900, Misono Tomohiro wrote:
> > > (Resend as cover letter title was missing in the first time. Sorry for noise)
> > >
> > > This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware
> > > barrier driver for it.
> > >
> > > [Driver Description]
> > >  A64FX CPU has several functions for HPC workload and hardware barrier
> > >  is one of them. It is a mechanism to realize fast synchronization by
> > >  PEs belonging to the same L3 cache domain by using implementation
> > >  defined hardware registers.
> > >  For more details, see A64FX HPC extension specification in
> > >  https://github.com/fujitsu/A64FX
> > >
> > >  The driver mainly offers a set of ioctls to manipulate related registers.
> > >  Patch 1-9 implements driver code and patch 10 finally adds kconfig,
> > >  Makefile and MAINTAINER entry for the driver.
> >
> > I have a number of concerns here, and at a high level, I do not think
> > that this is something Linux can reasonably support in its current form.
> > Sorry if this comes across as harsh; I appreciate the work that has gone
> > into this, and the effort to try to upstream support is great -- my
> > concerns are with the overal picture.
> >
> > As a general rule, we avoid the use of IMPLEMENTATION DEFINED features
> > in Linux, as they pose a number of correctness/safety challenges and
> > come with a potentially significan long term maintenance burden that is
> > generally not justified by the features themselves. For example, such
> > features are not usable under virtualization (where a hypervisor may set
> > HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of).
> 
> I am somewhat less concerned about the feature being implementation
> defined than I am about adding a custom user interface for one
> platform.
> 
> In the end, anything outside of the CPU core that ends up in a SoC
> is implementation defined, and this is usually not a problem as long
> as we have an abstraction in the kernel that hides the details from
> the user, and the system is still functional if the implementation is
> turned off for whatever reason.

Understood. However, I don't know any other processors having similar
features at this point and it is hard to provide common abstraction interface.
I would appreciate should anyone have any information.

> > Secondly, the intended usage model appears to expose this to EL0 for
> > direct access, and the code seems to depend on threads being pinned, but
> > AFAICT this is not enforced and there is no provision for
> > context-switch, thread migration, or interaction with ptrace. I fear
> > this is going to be very fragile in practice, and that extending that
> > support in future will require much more complexity than is currently
> > apparent, with potentially invasive changes to arch code.
> 
> Right, this is the main problem I see, too. I had not even realized
> that this will have to tie in with user space threads in some form, but
> you are right that once this has to interact with the CPU scheduler,
> it all breaks down.

This observation is right. I thought adding context switch etc. support for 
implementation defined registers requires core arch code changes and 
it is far less acceptable. So, I tried to confine code change in a module with 
these restrictions. 

Regarding direct access from EL0, it is necessary for realizing fast synchronization 
as this enables synchronization logic in user application check if all threads have
reached at synchronization point without switching to kernel.
Also, It is common usage that each running thread is bound to one PE in multi-threaded 
HPC applications.

> One way I can imagine this working out is to tie into the cpuset
> mechanism that is used for isolating threads to CPU cores, and
> then provide a cpuset interface that has the desired behavior
> but that can fall back to a generic implementation with the same
> or stronger (but normally slower) semantics.

I'm not sure if this approach is feasible, but I will try to look into it.

> > Thirdly, this requires userspace software to be intimately familiar with
> > the HW platform that it is running on (both in terms of using IMP-DEF
> > instructions and needing to know the physical layout), rather than being
> > generic and portable, which I don't believe is something that we wish to
> > encourage.  I also think this is unlikely to be supported by generic
> > software because of the lack of portability, and consequently I struggle
> > to beleive that this will see significant usage.
> 
> Agreed as well.

It may be possible to trap access to these implementation defined registers 
and fallback some logic in the driver. The problem is that other processors 
might use the same IMP-DEF registers for different purpose.

Regards,
Tomohiro
Tomohiro Misono (Fujitsu) Jan. 12, 2021, 10:32 a.m. UTC | #5
> On Fri, Jan 08, 2021 at 07:52:31PM +0900, Misono Tomohiro wrote:
> > (Resend as cover letter title was missing in the first time. Sorry for noise)
> >
> > Hello,
> 
> Hi,
> 
> > This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware
> > barrier driver for it.
> >
> > [Driver Description]
> >  A64FX CPU has several functions for HPC workload and hardware barrier
> >  is one of them. It is a mechanism to realize fast synchronization by
> >  PEs belonging to the same L3 cache domain by using implementation
> >  defined hardware registers.
> >  For more details, see A64FX HPC extension specification in
> >  https://github.com/fujitsu/A64FX
> >
> >  The driver mainly offers a set of ioctls to manipulate related registers.
> >  Patch 1-9 implements driver code and patch 10 finally adds kconfig,
> >  Makefile and MAINTAINER entry for the driver.
> 
> I have a number of concerns here, and at a high level, I do not think
> that this is something Linux can reasonably support in its current form.
> Sorry if this comes across as harsh; I appreciate the work that has gone
> into this, and the effort to try to upstream support is great -- my
> concerns are with the overal picture.
> 
> As a general rule, we avoid the use of IMPLEMENTATION DEFINED features
> in Linux, as they pose a number of correctness/safety challenges and
> come with a potentially significan long term maintenance burden that is
> generally not justified by the features themselves. For example, such
> features are not usable under virtualization (where a hypervisor may set
> HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of).
> 
> Secondly, the intended usage model appears to expose this to EL0 for
> direct access, and the code seems to depend on threads being pinned, but
> AFAICT this is not enforced and there is no provision for
> context-switch, thread migration, or interaction with ptrace. I fear
> this is going to be very fragile in practice, and that extending that
> support in future will require much more complexity than is currently
> apparent, with potentially invasive changes to arch code.
> 
> Thirdly, this requires userspace software to be intimately familiar with
> the HW platform that it is running on (both in terms of using IMP-DEF
> instructions and needing to know the physical layout), rather than being
> generic and portable, which I don't believe is something that we wish to
> encourage.  I also think this is unlikely to be supported by generic
> software because of the lack of portability, and consequently I struggle
> to beleive that this will see significant usage.
> 
> Further, as an IMP-DEF feature, it's not clear how much of this will
> carry forward into future designs, and where things may change. It's
> extremely difficult to determine whether any of the ABI decisions (e.g.
> the sysfs layout) are sufficient, or what level of changes would be
> necessary in userspace code if there are physical topology changes or
> changes to the strucutre of the system register interfaces.
> 
> Overall, I think this needs much more justification in terms of
> practical usage, safety/correctness, and long term maintenance, and with
> that I think the longer term goal would be to use this to justify an
> architectural feature along similar lines rather than to support any
> IMPLEMENTATION DEFINED variants upstream in Linux.
> 
> >  Also, C library and test program for this driver is available on:
> >  https://github.com/fujitsu/hardware_barrier
> 
> Hmm... I see some code in that repo which looks suspiciously like code
> from the Linux kernel tree, but licensed differently, which is
> concerning.
> 
> Specifically, the asm block in internal.h (which the SPDX headers says
> is licensed as LGPL-3.0-only) looks like a copy of code from
> arch/arm64/include/asm/sysreg.h (which is licensed as GPL-2.0-only per
> its SPDX header).
> 
> If that code was copied, I don't believe that relicensing is permitted.
> I would advise that someone with legal training considers the provenance
> of that code and what is permitted.

Sorry, I must have lacked the attention where the code comes from when I wrote the code.
I have removed that part to write assemby directly:
 https://github.com/fujitsu/hardware_barrier/blob/develop/src/internal.h
 https://github.com/fujitsu/hardware_barrier/blob/develop/src/hwblib.c#L215

Regards,
Tomohiro
Arnd Bergmann Jan. 12, 2021, 2:22 p.m. UTC | #6
On Tue, Jan 12, 2021 at 11:24 AM misono.tomohiro@fujitsu.com
<misono.tomohiro@fujitsu.com> wrote:
> > On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote:
> However, I don't know any other processors having similar
> features at this point and it is hard to provide common abstraction interface.
> I would appreciate should anyone have any information.

The specification you pointed to mentions the SPARC64 XIfx, so
at a minimum, a user interface should be designed to also work on
whatever register-level interface that provides.

> > > Secondly, the intended usage model appears to expose this to EL0 for
> > > direct access, and the code seems to depend on threads being pinned, but
> > > AFAICT this is not enforced and there is no provision for
> > > context-switch, thread migration, or interaction with ptrace. I fear
> > > this is going to be very fragile in practice, and that extending that
> > > support in future will require much more complexity than is currently
> > > apparent, with potentially invasive changes to arch code.
> >
> > Right, this is the main problem I see, too. I had not even realized
> > that this will have to tie in with user space threads in some form, but
> > you are right that once this has to interact with the CPU scheduler,
> > it all breaks down.
>
> This observation is right. I thought adding context switch etc. support for
> implementation defined registers requires core arch code changes and
> it is far less acceptable. So, I tried to confine code change in a module with
> these restrictions.

My feeling is that having the code separate from where it would belong
in an operating system that was designed specifically for this feature
ends up being no better than rewriting the core scheduling code.

As Mark said, it may well be that neither approach would be sufficient
for an upstream merge. On the other hand, keeping the code in a
separate loadable module does make most sense if we end up
not merging it at all, in which case this is the easiest to port
between kernel versions.

> Regarding direct access from EL0, it is necessary for realizing fast synchronization
> as this enables synchronization logic in user application check if all threads have
> reached at synchronization point without switching to kernel.

Ok, I see.

> Also, It is common usage that each running thread is bound to one PE in
> multi-threaded HPC applications.

I think the expectation that all threads are bound to a physical CPU
makes sense for using this feature, but I think it would be necessary
to enforce that, e.g. by allowing only threads to enable it after they
are isolated to a non-shared CPU, and automatically disabling it
if the CPU isolation is changed.

For the user space interface, something based on process IDs
seems to make more sense to me than something based on CPU
numbers. All of the above does require some level of integration
with the core kernel of course.

I think the next step would be to try to come up with a high-level
user interface design that has a chance to get merged, rather than
addressing the review comments for the current implementation.

Aside from the user interface question, it would be good to
understand the performance impact of the feature.
As I understand it, the entire purpose is to make things faster, so
to put it in perspective compared to the burden of adding an
interface, there should be some numbers: What are the kinds of
applications that would use it in practice, and how much faster are
they compared to not having it?

       Arnd
Tomohiro Misono (Fujitsu) Jan. 15, 2021, 11:10 a.m. UTC | #7
> On Tue, Jan 12, 2021 at 11:24 AM misono.tomohiro@fujitsu.com
> <misono.tomohiro@fujitsu.com> wrote:
> > > On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > However, I don't know any other processors having similar
> > features at this point and it is hard to provide common abstraction interface.
> > I would appreciate should anyone have any information.
> 
> The specification you pointed to mentions the SPARC64 XIfx, so
> at a minimum, a user interface should be designed to also work on
> whatever register-level interface that provides.

Those our previous CPUs have hardware barrier function too, but they are
not currently used (I believe the hardware design shares common idea and 
this driver logic/ioctl interface could be applicable to both).

> > > > Secondly, the intended usage model appears to expose this to EL0 for
> > > > direct access, and the code seems to depend on threads being pinned, but
> > > > AFAICT this is not enforced and there is no provision for
> > > > context-switch, thread migration, or interaction with ptrace. I fear
> > > > this is going to be very fragile in practice, and that extending that
> > > > support in future will require much more complexity than is currently
> > > > apparent, with potentially invasive changes to arch code.
> > >
> > > Right, this is the main problem I see, too. I had not even realized
> > > that this will have to tie in with user space threads in some form, but
> > > you are right that once this has to interact with the CPU scheduler,
> > > it all breaks down.
> >
> > This observation is right. I thought adding context switch etc. support for
> > implementation defined registers requires core arch code changes and
> > it is far less acceptable. So, I tried to confine code change in a module with
> > these restrictions.
> 
> My feeling is that having the code separate from where it would belong
> in an operating system that was designed specifically for this feature
> ends up being no better than rewriting the core scheduling code.
> 
> As Mark said, it may well be that neither approach would be sufficient
> for an upstream merge. On the other hand, keeping the code in a
> separate loadable module does make most sense if we end up
> not merging it at all, in which case this is the easiest to port
> between kernel versions.
> 
> > Regarding direct access from EL0, it is necessary for realizing fast synchronization
> > as this enables synchronization logic in user application check if all threads have
> > reached at synchronization point without switching to kernel.
> 
> Ok, I see.
> 
> > Also, It is common usage that each running thread is bound to one PE in
> > multi-threaded HPC applications.
> 
> I think the expectation that all threads are bound to a physical CPU
> makes sense for using this feature, but I think it would be necessary
> to enforce that, e.g. by allowing only threads to enable it after they
> are isolated to a non-shared CPU, and automatically disabling it
> if the CPU isolation is changed.
> 
> For the user space interface, something based on process IDs
> seems to make more sense to me than something based on CPU
> numbers. All of the above does require some level of integration
> with the core kernel of course.
> 
> I think the next step would be to try to come up with a high-level
> user interface design that has a chance to get merged, rather than
> addressing the review comments for the current implementation.

Understood. One question is that high-level interface such as process
based control could solve several problems (i.e. access control/force binding),
I cannot eliminate access to IMP-DEF registers from EL0 as I exaplained
above. Is it acceptable in your sense?

> Aside from the user interface question, it would be good to
> understand the performance impact of the feature.
> As I understand it, the entire purpose is to make things faster, so
> to put it in perspective compared to the burden of adding an
> interface, there should be some numbers: What are the kinds of
> applications that would use it in practice, and how much faster are
> they compared to not having it?

Microbenchmark shows it takes around 250ns for 1 synchronization for
12 PEs with hardware barrier and it is multiple times faster than software
barrier (only measuring core synchronization logic and excluding setup time).
I don't have application results at this point and will share when I could get some.

Regards,
Tomohiro
Arnd Bergmann Jan. 15, 2021, 12:24 p.m. UTC | #8
On Fri, Jan 15, 2021 at 12:10 PM misono.tomohiro@fujitsu.com
<misono.tomohiro@fujitsu.com> wrote:
> > On Tue, Jan 12, 2021 at 11:24 AM misono.tomohiro@fujitsu.com <misono.tomohiro@fujitsu.com> wrote:

> > > Also, It is common usage that each running thread is bound to one PE in
> > > multi-threaded HPC applications.
> >
> > I think the expectation that all threads are bound to a physical CPU
> > makes sense for using this feature, but I think it would be necessary
> > to enforce that, e.g. by allowing only threads to enable it after they
> > are isolated to a non-shared CPU, and automatically disabling it
> > if the CPU isolation is changed.
> >
> > For the user space interface, something based on process IDs
> > seems to make more sense to me than something based on CPU
> > numbers. All of the above does require some level of integration
> > with the core kernel of course.
> >
> > I think the next step would be to try to come up with a high-level
> > user interface design that has a chance to get merged, rather than
> > addressing the review comments for the current implementation.
>
> Understood. One question is that high-level interface such as process
> based control could solve several problems (i.e. access control/force binding),
> I cannot eliminate access to IMP-DEF registers from EL0 as I explained
> above. Is it acceptable in your sense?

I think you will get different answers for that depending on who you ask ;-)

I'm generally ok with it, given that it will only affect a very small
number of specialized applications that are already built for
a specific microarchitecture for performance reasons. E.g. when
using an arm64 BLAS library, you would use different versions
of the same functions depending on CPU support for NEON,
SVE, SVE2, Apple AMX (which also uses imp-def instructions),
ARMv8.6 GEMM extensions, and likely a hand-optimized
version for the A64FX pipeline. Having a version for A64FX with
hardware barriers adds (at most) one more code path but hopefully
does not add complexity to the common code.

> > Aside from the user interface question, it would be good to
> > understand the performance impact of the feature.
> > As I understand it, the entire purpose is to make things faster, so
> > to put it in perspective compared to the burden of adding an
> > interface, there should be some numbers: What are the kinds of
> > applications that would use it in practice, and how much faster are
> > they compared to not having it?
>
> Microbenchmark shows it takes around 250ns for 1 synchronization for
> 12 PEs with hardware barrier and it is multiple times faster than software
> barrier (only measuring core synchronization logic and excluding setup time).
> I don't have application results at this point and will share when I could get some.

Thanks. That will be helpful indeed. Please also include information
about what you are comparing against for the software barrier. E.g.
Is that based on a futex() system call, or completely implemented
in user space?

      Arnd
Tomohiro Misono (Fujitsu) Jan. 19, 2021, 5:30 a.m. UTC | #9
> > > > Also, It is common usage that each running thread is bound to one PE in
> > > > multi-threaded HPC applications.
> > >
> > > I think the expectation that all threads are bound to a physical CPU
> > > makes sense for using this feature, but I think it would be necessary
> > > to enforce that, e.g. by allowing only threads to enable it after they
> > > are isolated to a non-shared CPU, and automatically disabling it
> > > if the CPU isolation is changed.
> > >
> > > For the user space interface, something based on process IDs
> > > seems to make more sense to me than something based on CPU
> > > numbers. All of the above does require some level of integration
> > > with the core kernel of course.
> > >
> > > I think the next step would be to try to come up with a high-level
> > > user interface design that has a chance to get merged, rather than
> > > addressing the review comments for the current implementation.
> >
> > Understood. One question is that high-level interface such as process
> > based control could solve several problems (i.e. access control/force binding),
> > I cannot eliminate access to IMP-DEF registers from EL0 as I explained
> > above. Is it acceptable in your sense?
> 
> I think you will get different answers for that depending on who you ask ;-)
> 
> I'm generally ok with it, given that it will only affect a very small
> number of specialized applications that are already built for
> a specific microarchitecture for performance reasons. E.g. when
> using an arm64 BLAS library, you would use different versions
> of the same functions depending on CPU support for NEON,
> SVE, SVE2, Apple AMX (which also uses imp-def instructions),
> ARMv8.6 GEMM extensions, and likely a hand-optimized
> version for the A64FX pipeline. Having a version for A64FX with
> hardware barriers adds (at most) one more code path but hopefully
> does not add complexity to the common code.

Thanks. Btw, to be precise, A64FX doesn't use imp-def instructions.
It provides imp-def registers which can be accessed by system
register access instructions (msr/mrs).

> > > Aside from the user interface question, it would be good to
> > > understand the performance impact of the feature.
> > > As I understand it, the entire purpose is to make things faster, so
> > > to put it in perspective compared to the burden of adding an
> > > interface, there should be some numbers: What are the kinds of
> > > applications that would use it in practice, and how much faster are
> > > they compared to not having it?
> >
> > Microbenchmark shows it takes around 250ns for 1 synchronization for
> > 12 PEs with hardware barrier and it is multiple times faster than software
> > barrier (only measuring core synchronization logic and excluding setup time).
> > I don't have application results at this point and will share when I could get some.
> 
> Thanks. That will be helpful indeed. Please also include information
> about what you are comparing against for the software barrier. E.g.
> Is that based on a futex() system call, or completely implemented
> in user space?

It completely implemented in user space by using shared variables
without system call.
(As all PEs to be synced shares L3, it should cause to access to L3.)

Regards,
Tomohiro
Tomohiro Misono (Fujitsu) Feb. 18, 2021, 9:49 a.m. UTC | #10
> > > > Also, It is common usage that each running thread is bound to one PE in
> > > > multi-threaded HPC applications.
> > >
> > > I think the expectation that all threads are bound to a physical CPU
> > > makes sense for using this feature, but I think it would be necessary
> > > to enforce that, e.g. by allowing only threads to enable it after they
> > > are isolated to a non-shared CPU, and automatically disabling it
> > > if the CPU isolation is changed.
> > >
> > > For the user space interface, something based on process IDs
> > > seems to make more sense to me than something based on CPU
> > > numbers. All of the above does require some level of integration
> > > with the core kernel of course.
> > >
> > > I think the next step would be to try to come up with a high-level
> > > user interface design that has a chance to get merged, rather than
> > > addressing the review comments for the current implementation.

Hello,

Sorry for late response but while thinking new approaches, I come up with
some different idea and want to hear your opinions. How about offload
all control to user space while the driver just offers read/write access
to the needed registers? Let me explain in detail. 

Although I searched similar functions in other products, I could not find
it. Also, this hardware barrier performs intra-numa synchronization and
it is hard to be used for general inter-process barrier. So I think
generalizing this feature in kernel does not go well.

As I said this is mainly for HPC application. In the usual situations, the
user has full control of the PC nodes when running HPC application and
thus the user has full responsibility of running processes on the machine.
Offloading all controls to these registers to the user is acceptable in that
case (i.e. the driver just offers access to the registers and does not control it). 
This is the safe for the kernel operation as manipulating barrier related
registers just affects user application.

In this approach we could remove ioctls or control logic in the driver but
we need some way to access the needed registers. I firstly think if I can
use x86's MSR driver like approach but I know the idea is rejected
recently for security concerns:
 https://lore.kernel.org/linux-arm-kernel/20201130174833.41315-1-rongwei.wang@linux.alibaba.com/ 

Based on these observations, I have two ideas currently: 
 1) make the driver to only expose sysfs interface for reading/writing
   A64FX's barrier registers  
or 
 2) generalizing (1) in some way; To make some mechanism to expose 
   CPU defined registers which can be safely accessed from user space 

Are these idea acceptable ways to explore to get merged in upstream? 
I'd appreciate any criticism/comments. 

Regards, 
Tomohiro
Tomohiro Misono (Fujitsu) March 1, 2021, 7:53 a.m. UTC | #11
Hi,

Gentle Ping?
Tomohiro

> Subject: RE: [RFC PATCH 00/10] Add Fujitsu A64FX soc entry/hardware barrier driver
> 
> > > > > Also, It is common usage that each running thread is bound to one PE in
> > > > > multi-threaded HPC applications.
> > > >
> > > > I think the expectation that all threads are bound to a physical CPU
> > > > makes sense for using this feature, but I think it would be necessary
> > > > to enforce that, e.g. by allowing only threads to enable it after they
> > > > are isolated to a non-shared CPU, and automatically disabling it
> > > > if the CPU isolation is changed.
> > > >
> > > > For the user space interface, something based on process IDs
> > > > seems to make more sense to me than something based on CPU
> > > > numbers. All of the above does require some level of integration
> > > > with the core kernel of course.
> > > >
> > > > I think the next step would be to try to come up with a high-level
> > > > user interface design that has a chance to get merged, rather than
> > > > addressing the review comments for the current implementation.
> 
> Hello,
> 
> Sorry for late response but while thinking new approaches, I come up with
> some different idea and want to hear your opinions. How about offload
> all control to user space while the driver just offers read/write access
> to the needed registers? Let me explain in detail.
> 
> Although I searched similar functions in other products, I could not find
> it. Also, this hardware barrier performs intra-numa synchronization and
> it is hard to be used for general inter-process barrier. So I think
> generalizing this feature in kernel does not go well.
> 
> As I said this is mainly for HPC application. In the usual situations, the
> user has full control of the PC nodes when running HPC application and
> thus the user has full responsibility of running processes on the machine.
> Offloading all controls to these registers to the user is acceptable in that
> case (i.e. the driver just offers access to the registers and does not control it).
> This is the safe for the kernel operation as manipulating barrier related
> registers just affects user application.
> 
> In this approach we could remove ioctls or control logic in the driver but
> we need some way to access the needed registers. I firstly think if I can
> use x86's MSR driver like approach but I know the idea is rejected
> recently for security concerns:
>  https://lore.kernel.org/linux-arm-kernel/20201130174833.41315-1-rongwei.wang@linux.alibaba.com/
> 
> Based on these observations, I have two ideas currently:
>  1) make the driver to only expose sysfs interface for reading/writing
>    A64FX's barrier registers
> or
>  2) generalizing (1) in some way; To make some mechanism to expose
>    CPU defined registers which can be safely accessed from user space
> 
> Are these idea acceptable ways to explore to get merged in upstream?
> I'd appreciate any criticism/comments.
> 
> Regards,
> Tomohiro
Arnd Bergmann March 2, 2021, 11:06 a.m. UTC | #12
On Thu, Feb 18, 2021 at 10:49 AM misono.tomohiro@fujitsu.com
<misono.tomohiro@fujitsu.com> wrote:
>
> > > > > Also, It is common usage that each running thread is bound to one PE in
> > > > > multi-threaded HPC applications.
> > > >
> > > > I think the expectation that all threads are bound to a physical CPU
> > > > makes sense for using this feature, but I think it would be necessary
> > > > to enforce that, e.g. by allowing only threads to enable it after they
> > > > are isolated to a non-shared CPU, and automatically disabling it
> > > > if the CPU isolation is changed.
> > > >
> > > > For the user space interface, something based on process IDs
> > > > seems to make more sense to me than something based on CPU
> > > > numbers. All of the above does require some level of integration
> > > > with the core kernel of course.
> > > >
> > > > I think the next step would be to try to come up with a high-level
> > > > user interface design that has a chance to get merged, rather than
> > > > addressing the review comments for the current implementation.
>
> Hello,
>
> Sorry for late response but while thinking new approaches, I come up with
> some different idea and want to hear your opinions. How about offload
> all control to user space while the driver just offers read/write access
> to the needed registers? Let me explain in detail.
>
> Although I searched similar functions in other products, I could not find
> it. Also, this hardware barrier performs intra-numa synchronization and
> it is hard to be used for general inter-process barrier. So I think
> generalizing this feature in kernel does not go well.

Ok, thank you for taking a look.

> As I said this is mainly for HPC application. In the usual situations, the
> user has full control of the PC nodes when running HPC application and
> thus the user has full responsibility of running processes on the machine.
> Offloading all controls to these registers to the user is acceptable in that
> case (i.e. the driver just offers access to the registers and does not control it).
> This is the safe for the kernel operation as manipulating barrier related
> registers just affects user application.
>
> In this approach we could remove ioctls or control logic in the driver but
> we need some way to access the needed registers. I firstly think if I can
> use x86's MSR driver like approach but I know the idea is rejected
> recently for security concerns:
>  https://lore.kernel.org/linux-arm-kernel/20201130174833.41315-1-rongwei.wang@linux.alibaba.com/
>
> Based on these observations, I have two ideas currently:
>  1) make the driver to only expose sysfs interface for reading/writing
>    A64FX's barrier registers
> or
>  2) generalizing (1) in some way; To make some mechanism to expose
>    CPU defined registers which can be safely accessed from user space
>
> Are these idea acceptable ways to explore to get merged in upstream?
> I'd appreciate any criticism/comments.

I'm also running out of ideas here. I don't think a sysfs interface would
be any different to your earlier ioctl interface or the the /dev/msr approach,
they all share the same problem that they expose low-level access to
platform specific registers in a way that is neither portable nor safe to
use for general-purpose applications outside the very narrow scope
of running highly optimized HPC applications.

You can of course continue using the module you have as an external
module that gets built outside of the kernel itself, and shipped along
with the application or library using it, rather than with the kernel.
Obviously this is not something I would generally recommend either,
but this may be the last resort to fall back to when everything else
fails.

      Arnd
Tomohiro Misono (Fujitsu) March 3, 2021, 11:20 a.m. UTC | #13
On Thu, Feb 18, 2021 at 10:49 AM misono.tomohiro@fujitsu.com
> <misono.tomohiro@fujitsu.com> wrote:
> >
> > > > > > Also, It is common usage that each running thread is bound to one PE in
> > > > > > multi-threaded HPC applications.
> > > > >
> > > > > I think the expectation that all threads are bound to a physical CPU
> > > > > makes sense for using this feature, but I think it would be necessary
> > > > > to enforce that, e.g. by allowing only threads to enable it after they
> > > > > are isolated to a non-shared CPU, and automatically disabling it
> > > > > if the CPU isolation is changed.
> > > > >
> > > > > For the user space interface, something based on process IDs
> > > > > seems to make more sense to me than something based on CPU
> > > > > numbers. All of the above does require some level of integration
> > > > > with the core kernel of course.
> > > > >
> > > > > I think the next step would be to try to come up with a high-level
> > > > > user interface design that has a chance to get merged, rather than
> > > > > addressing the review comments for the current implementation.
> >
> > Hello,
> >
> > Sorry for late response but while thinking new approaches, I come up with
> > some different idea and want to hear your opinions. How about offload
> > all control to user space while the driver just offers read/write access
> > to the needed registers? Let me explain in detail.
> >
> > Although I searched similar functions in other products, I could not find
> > it. Also, this hardware barrier performs intra-numa synchronization and
> > it is hard to be used for general inter-process barrier. So I think
> > generalizing this feature in kernel does not go well.
> 
> Ok, thank you for taking a look.
> 
> > As I said this is mainly for HPC application. In the usual situations, the
> > user has full control of the PC nodes when running HPC application and
> > thus the user has full responsibility of running processes on the machine.
> > Offloading all controls to these registers to the user is acceptable in that
> > case (i.e. the driver just offers access to the registers and does not control it).
> > This is the safe for the kernel operation as manipulating barrier related
> > registers just affects user application.
> >
> > In this approach we could remove ioctls or control logic in the driver but
> > we need some way to access the needed registers. I firstly think if I can
> > use x86's MSR driver like approach but I know the idea is rejected
> > recently for security concerns:
> >  https://lore.kernel.org/linux-arm-kernel/20201130174833.41315-1-rongwei.wang@linux.alibaba.com/
> >
> > Based on these observations, I have two ideas currently:
> >  1) make the driver to only expose sysfs interface for reading/writing
> >    A64FX's barrier registers
> > or
> >  2) generalizing (1) in some way; To make some mechanism to expose
> >    CPU defined registers which can be safely accessed from user space
> >
> > Are these idea acceptable ways to explore to get merged in upstream?
> > I'd appreciate any criticism/comments.
> 
> I'm also running out of ideas here. I don't think a sysfs interface would
> be any different to your earlier ioctl interface or the the /dev/msr approach,
> they all share the same problem that they expose low-level access to
> platform specific registers in a way that is neither portable nor safe to
> use for general-purpose applications outside the very narrow scope
> of running highly optimized HPC applications.

Ok, but ARM architecture permits implementation defined registers at the
first place. So can we provide some method/interface to access them as
CPU feature if these registers do not at least affect kernel operations (like
this barrier) and only root can access them? Library could offer portable way
for user applications (under root permission) to access them.

> You can of course continue using the module you have as an external
> module that gets built outside of the kernel itself, and shipped along
> with the application or library using it, rather than with the kernel.
> Obviously this is not something I would generally recommend either,
> but this may be the last resort to fall back to when everything else
> fails.

Understood. This is the last resort.
Thanks very much for you help.

Regards,
Tomohiro
Arnd Bergmann March 3, 2021, 1:33 p.m. UTC | #14
On Wed, Mar 3, 2021 at 12:20 PM misono.tomohiro@fujitsu.com
<misono.tomohiro@fujitsu.com> wrote:
> On Thu, Feb 18, 2021 at 10:49 AM misono.tomohiro@fujitsu.com <misono.tomohiro@fujitsu.com> wrote:
>
> > I'm also running out of ideas here. I don't think a sysfs interface would
> > be any different to your earlier ioctl interface or the the /dev/msr approach,
> > they all share the same problem that they expose low-level access to
> > platform specific registers in a way that is neither portable nor safe to
> > use for general-purpose applications outside the very narrow scope
> > of running highly optimized HPC applications.
>
> Ok, but ARM architecture permits implementation defined registers at the
> first place. So can we provide some method/interface to access them as
> CPU feature if these registers do not at least affect kernel operations (like
> this barrier) and only root can access them? Library could offer portable way
> for user applications (under root permission) to access them.

The kernel is meant to provide an abstraction for any differences between the
CPUs, including implementation defined registers. While any such abstraction
will be leaky, just passing through the raw registers is generally not a helpful
abstraction at all, as seen from the x86 MSR discussion you pointed to.

One problem with having a root-only register level interface is that this
can break the boundary between kernel mode and root user space, and
this is something that a lot of people would like to strengthen for security
reasons (e.g. a root user should not be able to break secure boot).

Another problem is that exposing the raw registers from kernel space
creates an ABI, and if it turns out to be a bad idea later on, this is hard to
take back without breaking existing applications. Not breaking things that
used to work is the primary rule for the Linux kernel.

In order to merge anything into the mainline kernel, I think the requirement
would be that it does provide a sensible abstraction inside of the kernel
that can directly be used from applications without having to go through
another library that abstracts it, and that has a good chance of being
supportable forever.

      Arnd
Tomohiro Misono (Fujitsu) March 4, 2021, 7:03 a.m. UTC | #15
> > > I'm also running out of ideas here. I don't think a sysfs interface would
> > > be any different to your earlier ioctl interface or the the /dev/msr approach,
> > > they all share the same problem that they expose low-level access to
> > > platform specific registers in a way that is neither portable nor safe to
> > > use for general-purpose applications outside the very narrow scope
> > > of running highly optimized HPC applications.
> >
> > Ok, but ARM architecture permits implementation defined registers at the
> > first place. So can we provide some method/interface to access them as
> > CPU feature if these registers do not at least affect kernel operations (like
> > this barrier) and only root can access them? Library could offer portable way
> > for user applications (under root permission) to access them.
> 
> The kernel is meant to provide an abstraction for any differences between the
> CPUs, including implementation defined registers. While any such abstraction
> will be leaky, just passing through the raw registers is generally not a helpful
> abstraction at all, as seen from the x86 MSR discussion you pointed to.
> 
> One problem with having a root-only register level interface is that this
> can break the boundary between kernel mode and root user space, and
> this is something that a lot of people would like to strengthen for security
> reasons (e.g. a root user should not be able to break secure boot).
> 
> Another problem is that exposing the raw registers from kernel space
> creates an ABI, and if it turns out to be a bad idea later on, this is hard to
> take back without breaking existing applications. Not breaking things that
> used to work is the primary rule for the Linux kernel.

Ok, thanks for the thorough explanations. It helps my understandings.

> In order to merge anything into the mainline kernel, I think the requirement
> would be that it does provide a sensible abstraction inside of the kernel
> that can directly be used from applications without having to go through
> another library that abstracts it, and that has a good chance of being
> supportable forever.

As you mentioned an idea of process-based approach earlier, I will
reconsider the possibility of general abstraction interface in that way.

Regards,
Tomohiro