diff mbox series

mips: Remove q-accessors from non-64bit platforms

Message ID 20190614063341.1672-1-fancer.lancer@gmail.com (mailing list archive)
State Mainlined
Commit 1e2791448b922069b1d76cb863290c7341ff5eb5
Headers show
Series mips: Remove q-accessors from non-64bit platforms | expand

Commit Message

Serge Semin June 14, 2019, 6:33 a.m. UTC
There are some generic drivers in the kernel, which make use of the
q-accessors or their derivatives. While at current asm/io.h the accessors
are defined, their implementation is only applicable either for 64bit
systems, or for systems with cpu_has_64bits flag set. Obviously there
are MIPS systems which are neither of these, but still need to have
those drivers supported. In this case the solution is to define some
generic versions of the q-accessors, but with a limitation to be
non-atomic. Such accessors are defined in the
io-64-nonatomic-{hi-lo,lo-hi}.h file. The drivers which utilize the
q-suffixed IO-methods are supposed to include the header file, so
in case if these accessors aren't defined for the platform, the generic
non-atomic versions are utilized. Currently the MIPS-specific asm/io.h
file provides the q-accessors for any MIPS system even for ones, which
in fact don't support them and raise BUG() in case if any of them is
called. Due to this the generic versions of the accessors are never
used while an attempt to call the IO-methods causes the kernel BUG().
In order to fix this we need to define the q-accessors only for
the MIPS systems, which actually support them, and don't define them
otherwise, so to let the corresponding drivers to use the non-atomic
q-suffixed accessors.

Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Vadim V. Vlasov <vadim.vlasov@t-platforms.ru>
---
 arch/mips/include/asm/io.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Paul Burton June 20, 2019, 5:40 p.m. UTC | #1
Hi Serge,

On Fri, Jun 14, 2019 at 09:33:42AM +0300, Serge Semin wrote:
> There are some generic drivers in the kernel, which make use of the
> q-accessors or their derivatives. While at current asm/io.h the accessors
> are defined, their implementation is only applicable either for 64bit
> systems, or for systems with cpu_has_64bits flag set. Obviously there
> are MIPS systems which are neither of these, but still need to have
> those drivers supported. In this case the solution is to define some
> generic versions of the q-accessors, but with a limitation to be
> non-atomic. Such accessors are defined in the
> io-64-nonatomic-{hi-lo,lo-hi}.h file. The drivers which utilize the
> q-suffixed IO-methods are supposed to include the header file, so
> in case if these accessors aren't defined for the platform, the generic
> non-atomic versions are utilized. Currently the MIPS-specific asm/io.h
> file provides the q-accessors for any MIPS system even for ones, which
> in fact don't support them and raise BUG() in case if any of them is
> called. Due to this the generic versions of the accessors are never
> used while an attempt to call the IO-methods causes the kernel BUG().
> In order to fix this we need to define the q-accessors only for
> the MIPS systems, which actually support them, and don't define them
> otherwise, so to let the corresponding drivers to use the non-atomic
> q-suffixed accessors.
> 
> Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Cc: Vadim V. Vlasov <vadim.vlasov@t-platforms.ru>
> ---
>  arch/mips/include/asm/io.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)

So this seems pretty reasonable. Build testing all our defconfigs only
showed up one issue for decstation_defconfig & decstation_r4k_defconfig:

  drivers/net/fddi/defza.c: In function 'fza_reads':
  drivers/net/fddi/defza.c:88:17: error: implicit declaration of
    function 'readq_relaxed'; did you mean 'readw_relaxed'?
    [-Werror=implicit-function-declaration]
   #define readq_u readq_relaxed
                   ^~~~~~~~~~~~~
  drivers/net/fddi/defza.c:126:13: note: in expansion of macro 'readq_u'
      *dst++ = readq_u(src++);
               ^~~~~~~
  drivers/net/fddi/defza.c: In function 'fza_writes':
  drivers/net/fddi/defza.c:92:18: error: implicit declaration of
    function 'writeq_relaxed'; did you mean 'writel_relaxed'?
    [-Werror=implicit-function-declaration]
   #define writeq_u writeq_relaxed
                    ^~~~~~~~~~~~~~
  drivers/net/fddi/defza.c:151:4: note: in expansion of macro 'writeq_u'
      writeq_u(*src++, dst++);
      ^~~~~~~~
    CC      net/core/scm.o
  cc1: some warnings being treated as errors
  make[4]: *** [scripts/Makefile.build:279: drivers/net/fddi/defza.o] Error 1

These uses of readq_relaxed & writeq_relaxed are both conditional upon
sizeof(unsigned long) == 8, ie. upon CONFIG_64BIT=y so they're not going
to present a runtime issue but we need to provide some implementation of
the *q accessors to keep the compiler happy.

I see a few options:

1) We could just have defza.c include <io-64-nonatomic-lo-hi.h> to get
   the appropriate declarations, which should then get optimized away by
   the compiler anyway & never actually be used.

2) We could have defza.h #define its readq_u & writeq_u macros
   differently for CONFIG_32BIT=y kernels, perhaps using
   __compiletime_error to catch any bogus use of them.

3) We could do the same in a generic header, though if nobody else has
   needed it so far & this is the only place we need it then maybe it's
   not worth it.

So I'm thinking option 2 might be best, as below. Having said that I
don't mind option 1 either - it's simple. Maciej do you have any
preference?

Thanks,
    Paul

---
diff --git a/drivers/net/fddi/defza.c b/drivers/net/fddi/defza.c
index c5cae8e74dc4..85d6a7f22fe7 100644
--- a/drivers/net/fddi/defza.c
+++ b/drivers/net/fddi/defza.c
@@ -85,11 +85,21 @@ static u8 hw_addr_beacon[8] = { 0x01, 0x80, 0xc2, 0x00, 0x01, 0x00 };
  */
 #define readw_u readw_relaxed
 #define readl_u readl_relaxed
-#define readq_u readq_relaxed
 
 #define writew_u writew_relaxed
 #define writel_u writel_relaxed
-#define writeq_u writeq_relaxed
+
+#ifdef CONFIG_32BIT
+extern u64 defza_readq_u(const void *ptr)
+	__compiletime_error("readq_u should not be used by 32b kernels");
+extern void defza_writeq_u(u64 val, void *ptr)
+	__compiletime_error("writeq_u should not be used by 32b kernels");
+# define readq_u defza_readq_u
+# define writeq_u defza_writeq_u
+#else
+# define readq_u readq_relaxed
+# define writeq_u writeq_relaxed
+#endif
 
 static inline struct sk_buff *fza_alloc_skb_irq(struct net_device *dev,
 						unsigned int length)
Maciej W. Rozycki June 20, 2019, 6:19 p.m. UTC | #2
On Thu, 20 Jun 2019, Paul Burton wrote:

> So this seems pretty reasonable. Build testing all our defconfigs only
> showed up one issue for decstation_defconfig & decstation_r4k_defconfig:
> 
>   drivers/net/fddi/defza.c: In function 'fza_reads':
>   drivers/net/fddi/defza.c:88:17: error: implicit declaration of
>     function 'readq_relaxed'; did you mean 'readw_relaxed'?
>     [-Werror=implicit-function-declaration]
>    #define readq_u readq_relaxed
>                    ^~~~~~~~~~~~~
>   drivers/net/fddi/defza.c:126:13: note: in expansion of macro 'readq_u'
>       *dst++ = readq_u(src++);
>                ^~~~~~~
>   drivers/net/fddi/defza.c: In function 'fza_writes':
>   drivers/net/fddi/defza.c:92:18: error: implicit declaration of
>     function 'writeq_relaxed'; did you mean 'writel_relaxed'?
>     [-Werror=implicit-function-declaration]
>    #define writeq_u writeq_relaxed
>                     ^~~~~~~~~~~~~~
>   drivers/net/fddi/defza.c:151:4: note: in expansion of macro 'writeq_u'
>       writeq_u(*src++, dst++);
>       ^~~~~~~~
>     CC      net/core/scm.o
>   cc1: some warnings being treated as errors
>   make[4]: *** [scripts/Makefile.build:279: drivers/net/fddi/defza.o] Error 1
> 
> These uses of readq_relaxed & writeq_relaxed are both conditional upon
> sizeof(unsigned long) == 8, ie. upon CONFIG_64BIT=y so they're not going
> to present a runtime issue but we need to provide some implementation of
> the *q accessors to keep the compiler happy.
> 
> I see a few options:
> 
> 1) We could just have defza.c include <io-64-nonatomic-lo-hi.h> to get
>    the appropriate declarations, which should then get optimized away by
>    the compiler anyway & never actually be used.

 This, definitely.

> 2) We could have defza.h #define its readq_u & writeq_u macros
>    differently for CONFIG_32BIT=y kernels, perhaps using
>    __compiletime_error to catch any bogus use of them.
> 
> 3) We could do the same in a generic header, though if nobody else has
>    needed it so far & this is the only place we need it then maybe it's
>    not worth it.
> 
> So I'm thinking option 2 might be best, as below. Having said that I
> don't mind option 1 either - it's simple. Maciej do you have any
> preference?

 The use of 64-bit operations to access option's packet memory, which is 
true SRAM, i.e. no side effects, is to improve throughput only and there's 
no need for atomicity here nor also any kind of barriers, except at the 
conclusion.  Splitting 64-bit accesses into 32-bit halves in software 
would not be a functional error here.

 I benchmarked it back in the day and the difference was noticeable with 
actual network transmissions between loops using 32-bit (LW/SW) and 64-bit 
(LD/SD) accesses respectively, though I may not be able to track down the 
figures anymore as it must have been some 18 years ago.  The performance 
of the MB ASIC used to interface the R4400 CPU to DRAM and TURBOchannel 
with the 5000/260 systems was stellar as it was specifically designed with 
high throughput in mind, as an upgrade to the exiting R3400-based 5000/240 
systems (the CPU and the ASIC are both on a daughtercard), though new such 
machines used to be sold as well.

 For the record the CPU and TURBOchannel run at 60MHz (40MHz with the 
R3400) and 25MHz respectively with these systems.

 Thanks for the heads-up!

  Maciej
Serge Semin June 21, 2019, 6:06 a.m. UTC | #3
Hello Paul,

On Thu, Jun 20, 2019 at 05:40:04PM +0000, Paul Burton wrote:
> Hi Serge,
> 
> On Fri, Jun 14, 2019 at 09:33:42AM +0300, Serge Semin wrote:
> > There are some generic drivers in the kernel, which make use of the
> > q-accessors or their derivatives. While at current asm/io.h the accessors
> > are defined, their implementation is only applicable either for 64bit
> > systems, or for systems with cpu_has_64bits flag set. Obviously there
> > are MIPS systems which are neither of these, but still need to have
> > those drivers supported. In this case the solution is to define some
> > generic versions of the q-accessors, but with a limitation to be
> > non-atomic. Such accessors are defined in the
> > io-64-nonatomic-{hi-lo,lo-hi}.h file. The drivers which utilize the
> > q-suffixed IO-methods are supposed to include the header file, so
> > in case if these accessors aren't defined for the platform, the generic
> > non-atomic versions are utilized. Currently the MIPS-specific asm/io.h
> > file provides the q-accessors for any MIPS system even for ones, which
> > in fact don't support them and raise BUG() in case if any of them is
> > called. Due to this the generic versions of the accessors are never
> > used while an attempt to call the IO-methods causes the kernel BUG().
> > In order to fix this we need to define the q-accessors only for
> > the MIPS systems, which actually support them, and don't define them
> > otherwise, so to let the corresponding drivers to use the non-atomic
> > q-suffixed accessors.
> > 
> > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > Suggested-by: Arnd Bergmann <arnd@arndb.de>
> > Cc: Vadim V. Vlasov <vadim.vlasov@t-platforms.ru>
> > ---
> >  arch/mips/include/asm/io.h | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> 
> So this seems pretty reasonable. Build testing all our defconfigs only
> showed up one issue for decstation_defconfig & decstation_r4k_defconfig:
> 
>   drivers/net/fddi/defza.c: In function 'fza_reads':
>   drivers/net/fddi/defza.c:88:17: error: implicit declaration of
>     function 'readq_relaxed'; did you mean 'readw_relaxed'?
>     [-Werror=implicit-function-declaration]
>    #define readq_u readq_relaxed
>                    ^~~~~~~~~~~~~
>   drivers/net/fddi/defza.c:126:13: note: in expansion of macro 'readq_u'
>       *dst++ = readq_u(src++);
>                ^~~~~~~
>   drivers/net/fddi/defza.c: In function 'fza_writes':
>   drivers/net/fddi/defza.c:92:18: error: implicit declaration of
>     function 'writeq_relaxed'; did you mean 'writel_relaxed'?
>     [-Werror=implicit-function-declaration]
>    #define writeq_u writeq_relaxed
>                     ^~~~~~~~~~~~~~
>   drivers/net/fddi/defza.c:151:4: note: in expansion of macro 'writeq_u'
>       writeq_u(*src++, dst++);
>       ^~~~~~~~
>     CC      net/core/scm.o
>   cc1: some warnings being treated as errors
>   make[4]: *** [scripts/Makefile.build:279: drivers/net/fddi/defza.o] Error 1
> 

Thanks for review and testing this for each target. I see you and Maciej already
agreed regarding the solution, and you even sent the fixup. So I don't have
to send the v2 patch.)

Regards,
-Sergey

> These uses of readq_relaxed & writeq_relaxed are both conditional upon
> sizeof(unsigned long) == 8, ie. upon CONFIG_64BIT=y so they're not going
> to present a runtime issue but we need to provide some implementation of
> the *q accessors to keep the compiler happy.
> 
> I see a few options:
> 
> 1) We could just have defza.c include <io-64-nonatomic-lo-hi.h> to get
>    the appropriate declarations, which should then get optimized away by
>    the compiler anyway & never actually be used.
> 
> 2) We could have defza.h #define its readq_u & writeq_u macros
>    differently for CONFIG_32BIT=y kernels, perhaps using
>    __compiletime_error to catch any bogus use of them.
> 
> 3) We could do the same in a generic header, though if nobody else has
>    needed it so far & this is the only place we need it then maybe it's
>    not worth it.
> 
> So I'm thinking option 2 might be best, as below. Having said that I
> don't mind option 1 either - it's simple. Maciej do you have any
> preference?
> 

> Thanks,
>     Paul
> 
> ---
> diff --git a/drivers/net/fddi/defza.c b/drivers/net/fddi/defza.c
> index c5cae8e74dc4..85d6a7f22fe7 100644
> --- a/drivers/net/fddi/defza.c
> +++ b/drivers/net/fddi/defza.c
> @@ -85,11 +85,21 @@ static u8 hw_addr_beacon[8] = { 0x01, 0x80, 0xc2, 0x00, 0x01, 0x00 };
>   */
>  #define readw_u readw_relaxed
>  #define readl_u readl_relaxed
> -#define readq_u readq_relaxed
>  
>  #define writew_u writew_relaxed
>  #define writel_u writel_relaxed
> -#define writeq_u writeq_relaxed
> +
> +#ifdef CONFIG_32BIT
> +extern u64 defza_readq_u(const void *ptr)
> +	__compiletime_error("readq_u should not be used by 32b kernels");
> +extern void defza_writeq_u(u64 val, void *ptr)
> +	__compiletime_error("writeq_u should not be used by 32b kernels");
> +# define readq_u defza_readq_u
> +# define writeq_u defza_writeq_u
> +#else
> +# define readq_u readq_relaxed
> +# define writeq_u writeq_relaxed
> +#endif
>  
>  static inline struct sk_buff *fza_alloc_skb_irq(struct net_device *dev,
>  						unsigned int length)
>
Arnd Bergmann June 21, 2019, 9:25 a.m. UTC | #4
On Thu, Jun 20, 2019 at 8:19 PM Maciej W. Rozycki <macro@linux-mips.org> wrote:
>
> On Thu, 20 Jun 2019, Paul Burton wrote:
>
> > So this seems pretty reasonable. Build testing all our defconfigs only
> > showed up one issue for decstation_defconfig & decstation_r4k_defconfig:
> >
> >   drivers/net/fddi/defza.c: In function 'fza_reads':
> >   drivers/net/fddi/defza.c:88:17: error: implicit declaration of
> >     function 'readq_relaxed'; did you mean 'readw_relaxed'?
> >     [-Werror=implicit-function-declaration]
> >    #define readq_u readq_relaxed
> >                    ^~~~~~~~~~~~~
> >   drivers/net/fddi/defza.c:126:13: note: in expansion of macro 'readq_u'
> >       *dst++ = readq_u(src++);
> >                ^~~~~~~
> >   drivers/net/fddi/defza.c: In function 'fza_writes':
> >   drivers/net/fddi/defza.c:92:18: error: implicit declaration of
> >     function 'writeq_relaxed'; did you mean 'writel_relaxed'?
> >     [-Werror=implicit-function-declaration]
> >    #define writeq_u writeq_relaxed
> >                     ^~~~~~~~~~~~~~
> >   drivers/net/fddi/defza.c:151:4: note: in expansion of macro 'writeq_u'
> >       writeq_u(*src++, dst++);
> >       ^~~~~~~~
> >     CC      net/core/scm.o
> >   cc1: some warnings being treated as errors
> >   make[4]: *** [scripts/Makefile.build:279: drivers/net/fddi/defza.o] Error 1
> >
> > These uses of readq_relaxed & writeq_relaxed are both conditional upon
> > sizeof(unsigned long) == 8, ie. upon CONFIG_64BIT=y so they're not going
> > to present a runtime issue but we need to provide some implementation of
> > the *q accessors to keep the compiler happy.
> >
> > I see a few options:
> >
> > 1) We could just have defza.c include <io-64-nonatomic-lo-hi.h> to get
> >    the appropriate declarations, which should then get optimized away by
> >    the compiler anyway & never actually be used.
>
>  This, definitely.

The compiler should generally not be allowed to combine two adjacent
readl_relaxed() back into a 64-bit load. Only __raw_readl() can be
combined or split up. If the mips version of the *_relaxed() accessors
allows the compiler to do this, I would consider that a bug.

> > 2) We could have defza.h #define its readq_u & writeq_u macros
> >    differently for CONFIG_32BIT=y kernels, perhaps using
> >    __compiletime_error to catch any bogus use of them.
> >
> > 3) We could do the same in a generic header, though if nobody else has
> >    needed it so far & this is the only place we need it then maybe it's
> >    not worth it.
> >
> > So I'm thinking option 2 might be best, as below. Having said that I
> > don't mind option 1 either - it's simple. Maciej do you have any
> > preference?
>
>  The use of 64-bit operations to access option's packet memory, which is
> true SRAM, i.e. no side effects, is to improve throughput only and there's
> no need for atomicity here nor also any kind of barriers, except at the
> conclusion.  Splitting 64-bit accesses into 32-bit halves in software
> would not be a functional error here.

The other property of packet memory and similar things is that you
basically want memcpy()-behavior with no byteswaps. This is one
of the few cases in which __raw_readq() is actually the right accessor
in (mostly) portable code.

       Arnd
Maciej W. Rozycki June 21, 2019, 10:09 a.m. UTC | #5
On Fri, 21 Jun 2019, Arnd Bergmann wrote:

> >  The use of 64-bit operations to access option's packet memory, which is
> > true SRAM, i.e. no side effects, is to improve throughput only and there's
> > no need for atomicity here nor also any kind of barriers, except at the
> > conclusion.  Splitting 64-bit accesses into 32-bit halves in software
> > would not be a functional error here.
> 
> The other property of packet memory and similar things is that you
> basically want memcpy()-behavior with no byteswaps. This is one
> of the few cases in which __raw_readq() is actually the right accessor
> in (mostly) portable code.

 Correct, but we're missing an `__raw_readq_relaxed', etc. interface and 
having additional barriers applied on every access would hit performance 
very badly; in fact even the barriers `*_relaxed' accessors imply would 
best be removed in this use (which is why defza.c uses `readw_o' vs 
`readw_u', etc. internally), but after all the struggles over the years 
for weakly ordered internal APIs x86 people are so averse to I'm not sure 
if I want to start another one.  We can get away with `readq_relaxed' in 
this use though as all the systems this device can be used with are 
little-endian as is TURBOchannel, so no byte-swapping will ever actually 
occur.

  Maciej
Arnd Bergmann June 21, 2019, 11:09 a.m. UTC | #6
On Fri, Jun 21, 2019 at 12:09 PM Maciej W. Rozycki <macro@linux-mips.org> wrote:
>
> On Fri, 21 Jun 2019, Arnd Bergmann wrote:
>
> > >  The use of 64-bit operations to access option's packet memory, which is
> > > true SRAM, i.e. no side effects, is to improve throughput only and there's
> > > no need for atomicity here nor also any kind of barriers, except at the
> > > conclusion.  Splitting 64-bit accesses into 32-bit halves in software
> > > would not be a functional error here.
> >
> > The other property of packet memory and similar things is that you
> > basically want memcpy()-behavior with no byteswaps. This is one
> > of the few cases in which __raw_readq() is actually the right accessor
> > in (mostly) portable code.
>
>  Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
> having additional barriers applied on every access would hit performance
> very badly;

How so? __raw_readq() by definition has the least barriers of
all, you can't make it more relaxed than it already is.

> in fact even the barriers `*_relaxed' accessors imply would
> best be removed in this use (which is why defza.c uses `readw_o' vs
> `readw_u', etc. internally), but after all the struggles over the years
> for weakly ordered internal APIs x86 people are so averse to I'm not sure
> if I want to start another one.  We can get away with `readq_relaxed' in
> this use though as all the systems this device can be used with are
> little-endian as is TURBOchannel, so no byte-swapping will ever actually
> occur.

I still don't see any downside of using __raw_readq() here, while the
upsides are:

- makes the driver portable to big-endian kernels (even though we don't
  care)
- avoids all barriers
- fixes the build regression.

      Arnd
Maciej W. Rozycki June 21, 2019, 12:24 p.m. UTC | #7
On Fri, 21 Jun 2019, Arnd Bergmann wrote:

> > > The other property of packet memory and similar things is that you
> > > basically want memcpy()-behavior with no byteswaps. This is one
> > > of the few cases in which __raw_readq() is actually the right accessor
> > > in (mostly) portable code.
> >
> >  Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
> > having additional barriers applied on every access would hit performance
> > very badly;
> 
> How so? __raw_readq() by definition has the least barriers of
> all, you can't make it more relaxed than it already is.

 Well, `__raw_readq' has all the barriers plain `readq' has except it does 
not ever do byte-swapping (which may be bad where address swizzling is 
also present).  Whereas `readq_relaxed' at least avoids the trailing DMA 
barrier.

 This is what the MIPS version has:

#define __BUILD_MEMORY_SINGLE(pfx, bwlq, type, barrier, relax, irq)	\
[...]

#define __BUILD_MEMORY_PFX(bus, bwlq, type, relax)			\
									\
__BUILD_MEMORY_SINGLE(bus, bwlq, type, 1, relax, 1)

#define BUILDIO_MEM(bwlq, type)						\
									\
__BUILD_MEMORY_PFX(__raw_, bwlq, type, 0)				\
__BUILD_MEMORY_PFX(__relaxed_, bwlq, type, 1)				\
__BUILD_MEMORY_PFX(__mem_, bwlq, type, 0)				\
__BUILD_MEMORY_PFX(, bwlq, type, 0)

So `barrier' is always passed 1 and consequently all the accessors have a 
leading MMIO ordering barrier inserted and only `__relaxed_*' ones have 
`relax' set to 0 making them skip the trailing MMIO read vs DMA ordering 
barrier.  This is in accordance to Documentation/memory-barriers.txt I 
believe.

 NB I got one part wrong in the previous e-mail, sorry, as for packet 
memory accesses etc. the correct accessors are actually `__mem_*' rather 
than `__raw_*' ones, but the former ones are not portable.  I always 
forget about this peculiarity and it took us years to get it right with 
the MIPS port and the old IDE subsystem when doing PIO.

 The `__mem_*' handlers still do whetever system-specific transformation 
is required to present data in the memory rather than CPU byte ordering.  
See arch/mips/include/asm/mach-ip27/mangle-port.h for a non-trivial 
example and arch/mips/include/asm/mach-generic/mangle-port.h for the 
general case.  Whereas `__raw_*' pass raw data unchanged and are generally 
only suitable for accesses to onchip SOC MMIO or similar resources that do 
not traverse any external bus where a system's endianness may be observed.

 So contrary to what I have written before for the theoretical case of a 
big-endian system possibly doing address swizzling we'd have to define and 
use `__mem_readq_unordered', etc. here rather than `__raw_readq_relaxed', 
etc.

> > in fact even the barriers `*_relaxed' accessors imply would
> > best be removed in this use (which is why defza.c uses `readw_o' vs
> > `readw_u', etc. internally), but after all the struggles over the years
> > for weakly ordered internal APIs x86 people are so averse to I'm not sure
> > if I want to start another one.  We can get away with `readq_relaxed' in
> > this use though as all the systems this device can be used with are
> > little-endian as is TURBOchannel, so no byte-swapping will ever actually
> > occur.
> 
> I still don't see any downside of using __raw_readq() here, while the
> upsides are:
> 
> - makes the driver portable to big-endian kernels (even though we don't
>   care)
> - avoids all barriers
> - fixes the build regression.

 Giving my observations above it would only address item #3 on your list, 
while addressing #1 and #2 would require defining `__mem_readq_unordered', 
etc. I am afraid.

 Have I missed anything?

  Maciej
Arnd Bergmann June 21, 2019, 2:01 p.m. UTC | #8
On Fri, Jun 21, 2019 at 2:24 PM Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Fri, 21 Jun 2019, Arnd Bergmann wrote:
> > > > The other property of packet memory and similar things is that you
> > > > basically want memcpy()-behavior with no byteswaps. This is one
> > > > of the few cases in which __raw_readq() is actually the right accessor
> > > > in (mostly) portable code.
> > >
> > >  Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
> > > having additional barriers applied on every access would hit performance
> > > very badly;
> >
> > How so? __raw_readq() by definition has the least barriers of
> > all, you can't make it more relaxed than it already is.
>
>  Well, `__raw_readq' has all the barriers plain `readq' has except it does
> not ever do byte-swapping (which may be bad where address swizzling is
> also present).  Whereas `readq_relaxed' at least avoids the trailing DMA
> barrier.
>
>  This is what the MIPS version has:
>
> #define __BUILD_MEMORY_SINGLE(pfx, bwlq, type, barrier, relax, irq)     \
> [...]
>
> #define __BUILD_MEMORY_PFX(bus, bwlq, type, relax)                      \
>                                                                         \
> __BUILD_MEMORY_SINGLE(bus, bwlq, type, 1, relax, 1)
>
> #define BUILDIO_MEM(bwlq, type)                                         \
>                                                                         \
> __BUILD_MEMORY_PFX(__raw_, bwlq, type, 0)                               \
> __BUILD_MEMORY_PFX(__relaxed_, bwlq, type, 1)                           \
> __BUILD_MEMORY_PFX(__mem_, bwlq, type, 0)                               \
> __BUILD_MEMORY_PFX(, bwlq, type, 0)
>
> So `barrier' is always passed 1 and consequently all the accessors have a
> leading MMIO ordering barrier inserted and only `__relaxed_*' ones have
> `relax' set to 0 making them skip the trailing MMIO read vs DMA ordering
> barrier.  This is in accordance to Documentation/memory-barriers.txt I
> believe.

It is definitely not what other architectures do here. In particular, the
asm-generic implementation that is now used on most of them
defines raw_readl() as

static inline u32 __raw_readl(const volatile void __iomem *addr)
{
        return *(const volatile u32 __force *)addr;
}

and there are a number of drivers that depend on this behavior.
readl_relaxed() typically adds the byteswap on this, and readl() adds
the barriers on top of readl_relaxed().

>  NB I got one part wrong in the previous e-mail, sorry, as for packet
> memory accesses etc. the correct accessors are actually `__mem_*' rather
> than `__raw_*' ones, but the former ones are not portable.  I always
> forget about this peculiarity and it took us years to get it right with
> the MIPS port and the old IDE subsystem when doing PIO.
>
>  The `__mem_*' handlers still do whetever system-specific transformation
> is required to present data in the memory rather than CPU byte ordering.
> See arch/mips/include/asm/mach-ip27/mangle-port.h for a non-trivial
> example and arch/mips/include/asm/mach-generic/mangle-port.h for the
> general case.  Whereas `__raw_*' pass raw data unchanged and are generally
> only suitable for accesses to onchip SOC MMIO or similar resources that do
> not traverse any external bus where a system's endianness may be observed.

Ok, so what you have for __mem_* is actually what I had expected from
__raw_* for an architecture, except for the barriers that should have been
left out.

>  So contrary to what I have written before for the theoretical case of a
> big-endian system possibly doing address swizzling we'd have to define and
> use `__mem_readq_unordered', etc. here rather than `__raw_readq_relaxed',
> etc.

Right.

> > > in fact even the barriers `*_relaxed' accessors imply would
> > > best be removed in this use (which is why defza.c uses `readw_o' vs
> > > `readw_u', etc. internally), but after all the struggles over the years
> > > for weakly ordered internal APIs x86 people are so averse to I'm not sure
> > > if I want to start another one.  We can get away with `readq_relaxed' in
> > > this use though as all the systems this device can be used with are
> > > little-endian as is TURBOchannel, so no byte-swapping will ever actually
> > > occur.
> >
> > I still don't see any downside of using __raw_readq() here, while the
> > upsides are:
> >
> > - makes the driver portable to big-endian kernels (even though we don't
> >   care)
> > - avoids all barriers
> > - fixes the build regression.
>
>  Giving my observations above it would only address item #3 on your list,
> while addressing #1 and #2 would require defining `__mem_readq_unordered',
> etc. I am afraid.
>
>  Have I missed anything?

No, I think you are right based on how mips defines __raw_readl().

Unfortunately, this also means that all portable drivers using the
__raw_ accessors to do what you want here are broken on mips
(at least on big-endian), while mips drivers using __raw_* are not
portable to anything else.

      Arnd
Maciej W. Rozycki June 24, 2019, 7:13 p.m. UTC | #9
Arnd,

 We're getting into MMIO and barriers again, sigh.  Cc-ing people recently 
involved then.

On Fri, 21 Jun 2019, Arnd Bergmann wrote:

> > > > > The other property of packet memory and similar things is that you
> > > > > basically want memcpy()-behavior with no byteswaps. This is one
> > > > > of the few cases in which __raw_readq() is actually the right accessor
> > > > > in (mostly) portable code.
> > > >
> > > >  Correct, but we're missing an `__raw_readq_relaxed', etc. interface and
> > > > having additional barriers applied on every access would hit performance
> > > > very badly;
> > >
> > > How so? __raw_readq() by definition has the least barriers of
> > > all, you can't make it more relaxed than it already is.
> >
> >  Well, `__raw_readq' has all the barriers plain `readq' has except it does
> > not ever do byte-swapping (which may be bad where address swizzling is
> > also present).  Whereas `readq_relaxed' at least avoids the trailing DMA
> > barrier.
> >
> >  This is what the MIPS version has:
> >
> > #define __BUILD_MEMORY_SINGLE(pfx, bwlq, type, barrier, relax, irq)     \
> > [...]
> >
> > #define __BUILD_MEMORY_PFX(bus, bwlq, type, relax)                      \
> >                                                                         \
> > __BUILD_MEMORY_SINGLE(bus, bwlq, type, 1, relax, 1)
> >
> > #define BUILDIO_MEM(bwlq, type)                                         \
> >                                                                         \
> > __BUILD_MEMORY_PFX(__raw_, bwlq, type, 0)                               \
> > __BUILD_MEMORY_PFX(__relaxed_, bwlq, type, 1)                           \
> > __BUILD_MEMORY_PFX(__mem_, bwlq, type, 0)                               \
> > __BUILD_MEMORY_PFX(, bwlq, type, 0)
> >
> > So `barrier' is always passed 1 and consequently all the accessors have a
> > leading MMIO ordering barrier inserted and only `__relaxed_*' ones have
> > `relax' set to 0 making them skip the trailing MMIO read vs DMA ordering
> > barrier.  This is in accordance to Documentation/memory-barriers.txt I
> > believe.
> 
> It is definitely not what other architectures do here. In particular, the
> asm-generic implementation that is now used on most of them
> defines raw_readl() as
> 
> static inline u32 __raw_readl(const volatile void __iomem *addr)
> {
>         return *(const volatile u32 __force *)addr;
> }
> 
> and there are a number of drivers that depend on this behavior.
> readl_relaxed() typically adds the byteswap on this, and readl() adds
> the barriers on top of readl_relaxed().

 If the lack of any ordering barriers with `__raw_*' accessors is indeed 
intended, then we need to document it in `memory-barriers.txt' as the 
expected semantics.

 This however does not make the interface suitable for accesses to 
memory-like MMIO resources, e.g. string I/O, but also the ATA data port, 
unless you want to make/keep the nomenclature more confusing than it has 
to be.

 First some backgound information.

1. For systems that have a uniform endianness throughout there's obviously 
   no issue.  We only need to choose between strongly-ordered and, if 
   applicable weakly-ordered accesses.  As the minimum and at the cost of 
   performance we can enforce strong MMIO ordering everywhere.

2. For systems parts of which have the opposite endianness we can have two 
   variants (or endianness policies) implemented in hardware for accesses 
   that cross the boundary of the two endiannesses:

   a. Byte lane matching.

      In this case individual byte addresses are preserved while accessing 
      a device and byte enables are generated directly from the least 
      significant bits of the CPU address, meaning that the device is 
      addressed within its decoding window directly as documented.

      The data quantity accessed, however, has to be byte-swapped to be 
      interpreted within the domain of the opposite endianness as a 
      numerical value or a bit pattern if wider than 8 bits.

   b. Bit lane matching.

      In this case individual byte addresses are mangled while accessing a 
      device and byte enables are generated from the inverse of the 
      leastsignificant bits of the CPU address, meaning that to calculate 
      a device address the CPU address has to be transformed, unless 
      accessing an aligned data quantity of the bus width.

      The data quantity accessed, however, can be used directly to be 
      interpreted within the domain of the opposite endianness as a 
      numerical value or a bit pattern regardless of its width.

   Like with the systems of a uniform endianness ordering is to be 
   considered separately.

Existing computer systems based on the MIPS processor implement either 
endianness policy or both.  The existence of systems that do bit lane 
matching is why we have all the arch/mips/include/asm/mach-*/mangle-port.h 
headers.

 I actually own a SOC-based system that implements both, when strapped at 
reset for the big endianness, by having two MMIO windows defined in its 
address space, which assume each of the policies respectively, and the 
boundary between the two endianness domains is at the PCI host bridge.  
This means that device accesses that cross the PCI bridge need address 
mangling or byte-swapping as necessary while device accesses that access 
onchip SOC devices or external devices wired to SOC's interfaces other 
than PCI require straight through accesses.

 The implementers of our Linux port for this system chose to use the byte 
lane matching policy exclusively, although for performance reason it might 
make sense to switch between the two policies depending on whether data is 
to be interpreted as a value or memory contents as the offset between the 
two windows is fixed.

 Anyway, for a system like this and regardless of any ordering requirement 
we clearly need to have three kinds of accessors:

1. Native (pass-through data, never with address mangling).

2. Value (data byte-swapped if required, possibly with address mangling).

3. Memory (data byte-swapped if required, possibly with address mangling).

And I think the natural nomenclature for the three kinds of accessors 
respectively is:

1. `__raw_*'.

2. `*' (no prefix by long-established practice, although `__val_*' would 
   do too, IMO).

3. `__mem_*'.

So I think the MIPS port got it right (and the rest is confused, IMHO).

 Furthermore, going back to the ordering requirement, I think we also need 
to have strongly-ordered native accessors, so that driver writers do not 
have to be bothered about sprinkling barriers in the common case.

 The MIPS port currently meets all the requirements, although it does not 
provide weakly ordered accessors.

> >  NB I got one part wrong in the previous e-mail, sorry, as for packet
> > memory accesses etc. the correct accessors are actually `__mem_*' rather
> > than `__raw_*' ones, but the former ones are not portable.  I always
> > forget about this peculiarity and it took us years to get it right with
> > the MIPS port and the old IDE subsystem when doing PIO.
> >
> >  The `__mem_*' handlers still do whetever system-specific transformation
> > is required to present data in the memory rather than CPU byte ordering.
> > See arch/mips/include/asm/mach-ip27/mangle-port.h for a non-trivial
> > example and arch/mips/include/asm/mach-generic/mangle-port.h for the
> > general case.  Whereas `__raw_*' pass raw data unchanged and are generally
> > only suitable for accesses to onchip SOC MMIO or similar resources that do
> > not traverse any external bus where a system's endianness may be observed.
> 
> Ok, so what you have for __mem_* is actually what I had expected from
> __raw_* for an architecture, except for the barriers that should have been
> left out.

 Please see above for an explanation as to why `__mem_*' is equivalent to 
neither `__raw_*' nor plain `*'.

> >  So contrary to what I have written before for the theoretical case of a
> > big-endian system possibly doing address swizzling we'd have to define and
> > use `__mem_readq_unordered', etc. here rather than `__raw_readq_relaxed',
> > etc.
> 
> Right.
> 
> > > > in fact even the barriers `*_relaxed' accessors imply would
> > > > best be removed in this use (which is why defza.c uses `readw_o' vs
> > > > `readw_u', etc. internally), but after all the struggles over the years
> > > > for weakly ordered internal APIs x86 people are so averse to I'm not sure
> > > > if I want to start another one.  We can get away with `readq_relaxed' in
> > > > this use though as all the systems this device can be used with are
> > > > little-endian as is TURBOchannel, so no byte-swapping will ever actually
> > > > occur.
> > >
> > > I still don't see any downside of using __raw_readq() here, while the
> > > upsides are:
> > >
> > > - makes the driver portable to big-endian kernels (even though we don't
> > >   care)
> > > - avoids all barriers
> > > - fixes the build regression.
> >
> >  Giving my observations above it would only address item #3 on your list,
> > while addressing #1 and #2 would require defining `__mem_readq_unordered',
> > etc. I am afraid.
> >
> >  Have I missed anything?
> 
> No, I think you are right based on how mips defines __raw_readl().
> 
> Unfortunately, this also means that all portable drivers using the
> __raw_ accessors to do what you want here are broken on mips
> (at least on big-endian), while mips drivers using __raw_* are not
> portable to anything else.

 I guess your conclusion is correct then, but I maintain that the 
nomenclature chosen for the MIPS port years ago is the correct one.  This 
goes back to commit 4912ba72d6e2 ("Define mem_*() I/O accessory functions 
that preserve byte addresses.") which defined `mem_*' accessors later 
renamed to `__mem_*' with commit 290f10ae4230 ("mips: namespace pollution
- mem_... -> __mem_... in io.h").  I don't know at what stage of awareness 
of the three scenarios other ports were back then.

 Questions, comments, thoughts?

  Maciej
Paul Burton June 24, 2019, 9:16 p.m. UTC | #10
Hello,

Serge Semin wrote:
> There are some generic drivers in the kernel, which make use of the
> q-accessors or their derivatives. While at current asm/io.h the accessors
> are defined, their implementation is only applicable either for 64bit
> systems, or for systems with cpu_has_64bits flag set. Obviously there
> are MIPS systems which are neither of these, but still need to have
> those drivers supported. In this case the solution is to define some
> generic versions of the q-accessors, but with a limitation to be
> non-atomic. Such accessors are defined in the
> io-64-nonatomic-{hi-lo,lo-hi}.h file. The drivers which utilize the
> q-suffixed IO-methods are supposed to include the header file, so
> in case if these accessors aren't defined for the platform, the generic
> non-atomic versions are utilized. Currently the MIPS-specific asm/io.h
> file provides the q-accessors for any MIPS system even for ones, which
> in fact don't support them and raise BUG() in case if any of them is
> called. Due to this the generic versions of the accessors are never
> used while an attempt to call the IO-methods causes the kernel BUG().
> In order to fix this we need to define the q-accessors only for
> the MIPS systems, which actually support them, and don't define them
> otherwise, so to let the corresponding drivers to use the non-atomic
> q-suffixed accessors.
> 
> Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Cc: Vadim V. Vlasov <vadim.vlasov@t-platforms.ru>

Applied to mips-next.

Thanks,
    Paul

[ This message was auto-generated; if you believe anything is incorrect
  then please email paul.burton@mips.com to report it. ]
diff mbox series

Patch

diff --git a/arch/mips/include/asm/io.h b/arch/mips/include/asm/io.h
index 29997e42480e..4597017f147b 100644
--- a/arch/mips/include/asm/io.h
+++ b/arch/mips/include/asm/io.h
@@ -462,7 +462,12 @@  __BUILD_MEMORY_PFX(, bwlq, type, 0)
 BUILDIO_MEM(b, u8)
 BUILDIO_MEM(w, u16)
 BUILDIO_MEM(l, u32)
+#ifdef CONFIG_64BIT
 BUILDIO_MEM(q, u64)
+#else
+__BUILD_MEMORY_PFX(__raw_, q, u64, 0)
+__BUILD_MEMORY_PFX(__mem_, q, u64, 0)
+#endif
 
 #define __BUILD_IOPORT_PFX(bus, bwlq, type)				\
 	__BUILD_IOPORT_SINGLE(bus, bwlq, type, 1, 0,)			\
@@ -488,12 +493,16 @@  __BUILDIO(q, u64)
 #define readb_relaxed			__relaxed_readb
 #define readw_relaxed			__relaxed_readw
 #define readl_relaxed			__relaxed_readl
+#ifdef CONFIG_64BIT
 #define readq_relaxed			__relaxed_readq
+#endif
 
 #define writeb_relaxed			__relaxed_writeb
 #define writew_relaxed			__relaxed_writew
 #define writel_relaxed			__relaxed_writel
+#ifdef CONFIG_64BIT
 #define writeq_relaxed			__relaxed_writeq
+#endif
 
 #define readb_be(addr)							\
 	__raw_readb((__force unsigned *)(addr))
@@ -516,8 +525,10 @@  __BUILDIO(q, u64)
 /*
  * Some code tests for these symbols
  */
+#ifdef CONFIG_64BIT
 #define readq				readq
 #define writeq				writeq
+#endif
 
 #define __BUILD_MEMORY_STRING(bwlq, type)				\
 									\