diff mbox series

[RFC,v6,1/6] riscv: mm: dma-noncoherent: Switch using function pointers for cache management

Message ID 20230106185526.260163-2-prabhakar.mahadev-lad.rj@bp.renesas.com (mailing list archive)
State Changes Requested
Delegated to: Palmer Dabbelt
Headers show
Series RISC-V non-coherent function pointer based cache management operations + non-coherent DMA support for AX45MP | expand

Checks

Context Check Description
conchuod/patch_count success Link
conchuod/cover_letter success Series has a cover letter
conchuod/tree_selection success Guessed tree name to be for-next
conchuod/fixes_present success Fixes tag not required for -next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 13 and now 13
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/module_param success Was 0 now: 0
conchuod/alphanumeric_selects success Out of order selects before the patch: 57 and now 57
conchuod/build_rv32_defconfig success Build OK
conchuod/build_warn_rv64 success Errors and warnings before: 2054 this patch: 2054
conchuod/dtb_warn_rv64 success Errors and warnings before: 0 this patch: 0
conchuod/header_inline fail Detected static functions without inline keyword in header files: 1
conchuod/checkpatch fail ERROR: Macros with complex values should be enclosed in parentheses WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 102 exceeds 100 columns
conchuod/source_inline success Was 0 now: 0
conchuod/build_rv64_nommu_k210_defconfig fail Build failed
conchuod/verify_fixes success No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig fail Build failed

Commit Message

Lad, Prabhakar Jan. 6, 2023, 6:55 p.m. UTC
From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>

The current implementation of CMO was handled using the ALTERNATIVE_X()
macro; this was manageable when there were a limited number of platforms
using this. Now that we are having more and more platforms coming through
with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable.

To avoid such issues this patch switches to use of function pointers
instead of ALTERNATIVE_X() macro for cache management (the only draw being
performance over the previous approach).

void (*clean_range)(unsigned long addr, unsigned long size);
void (*inv_range)(unsigned long addr, unsigned long size);
void (*flush_range)(unsigned long addr, unsigned long size);

The above function pointers are provided to be overridden where platforms
using standard approach and for platforms who want handle the operation
based on the operation the below function pointer is provided:

void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
                                      enum dma_data_direction dir,
                                      enum dma_noncoherent_ops ops);

In the current patch we have moved the ZICBOM and T-Head CMO to use
function pointers.

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
---
v5->v6
* New patch
---
 arch/riscv/errata/thead/errata.c         | 71 ++++++++++++++++++
 arch/riscv/include/asm/dma-noncoherent.h | 83 +++++++++++++++++++++
 arch/riscv/include/asm/errata_list.h     | 53 -------------
 arch/riscv/kernel/cpufeature.c           |  2 +
 arch/riscv/mm/dma-noncoherent.c          | 94 ++++++++++++++++++++++--
 arch/riscv/mm/pmem.c                     | 18 ++++-
 6 files changed, 260 insertions(+), 61 deletions(-)
 create mode 100644 arch/riscv/include/asm/dma-noncoherent.h

Comments

Conor Dooley Jan. 6, 2023, 11:29 p.m. UTC | #1
On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote:
> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +	.clean_range = &zicbom_cmo_clean_range,
> > +	.inv_range = &zicbom_cmo_inval_range,
> > +	.flush_range = &zicbom_cmo_flush_range,
> > +};
> > +#else
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +	.clean_range = NULL,
> > +	.inv_range = NULL,
> > +	.flush_range = NULL,
> > +	.riscv_dma_noncoherent_cmo_ops = NULL,
> > +};
> > +#endif
> > +EXPORT_SYMBOL(zicbom_cmo_ops);
> 
> Same here: If the ZICBOM ISA is disabled, nothing should
> reference zicbom_cmo_ops.

> Also, since ZICBOM is a standard
> extension, I think it makes sense to always have it enabled,
> at least whenever noncoherent DMA is supported, that way
> it can be the default that gets used in the absence of any
> nonstandard cache controller.

While I think of it, this is not possible as Zicbom requires toolchain
support whereas the alternative methods for non-coherent DMA do not.

Thanks,
Conor.
Conor Dooley Jan. 6, 2023, 11:47 p.m. UTC | #2
+CC Icenowy

Hey Prabhakar,

On Fri, Jan 06, 2023 at 06:55:21PM +0000, Prabhakar wrote:
> From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> 
> The current implementation of CMO was handled using the ALTERNATIVE_X()
> macro; this was manageable when there were a limited number of platforms
> using this. Now that we are having more and more platforms coming through
> with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable.

Nitpick time! This opening paragraph is all a little oddly worded IMO.
How about:

Currently, selecting which CMOs to use on a given platform is done using
and ALTERNATIVE_X() macro. This was manageable when there were just two
CMO implementations, but now that there are more and more platforms coming
needing custom CMOs, the use of the ALTERNATIVE_X() macro is unmanageable.

> To avoid such issues this patch switches to use of function pointers
> instead of ALTERNATIVE_X() macro for cache management (the only draw being

s/draw/drawback

> performance over the previous approach).

Did you notice a performance decrease or are you speculating?
Perhaps Icenowy - who I know benched their implmentation of a new CMO
macro for THEAD stuff can do so again for this.

> void (*clean_range)(unsigned long addr, unsigned long size);
> void (*inv_range)(unsigned long addr, unsigned long size);
> void (*flush_range)(unsigned long addr, unsigned long size);
> 
> The above function pointers are provided to be overridden where platforms
> using standard approach and for platforms who want handle the operation
> based on the operation the below function pointer is provided:

Think you've left out some words here chief!
Perhaps s/and for platforms who/. For platforms that/ & on the line
after, s/operation/direction/ ?

> void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
>                                       enum dma_data_direction dir,
>                                       enum dma_noncoherent_ops ops);
> 
> In the current patch we have moved the ZICBOM and T-Head CMO to use
> function pointers.

"Convert ZICBOM..."

> Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> ---
> v5->v6
> * New patch
> ---
>  arch/riscv/errata/thead/errata.c         | 71 ++++++++++++++++++
>  arch/riscv/include/asm/dma-noncoherent.h | 83 +++++++++++++++++++++
>  arch/riscv/include/asm/errata_list.h     | 53 -------------
>  arch/riscv/kernel/cpufeature.c           |  2 +
>  arch/riscv/mm/dma-noncoherent.c          | 94 ++++++++++++++++++++++--
>  arch/riscv/mm/pmem.c                     | 18 ++++-
>  6 files changed, 260 insertions(+), 61 deletions(-)
>  create mode 100644 arch/riscv/include/asm/dma-noncoherent.h
> 
> diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
> index fac5742d1c1e..826b2ba3e61e 100644
> --- a/arch/riscv/errata/thead/errata.c
> +++ b/arch/riscv/errata/thead/errata.c
> @@ -8,12 +8,72 @@
>  #include <linux/module.h>
>  #include <linux/string.h>
>  #include <linux/uaccess.h>
> +#include <asm/dma-noncoherent.h>
>  #include <asm/alternative.h>
>  #include <asm/cacheflush.h>
>  #include <asm/errata_list.h>
>  #include <asm/patch.h>
>  #include <asm/vendorid_list.h>
>  
> +#ifdef CONFIG_ERRATA_THEAD_CMO
> +/*
> + * dcache.ipa rs1 (invalidate, physical address)
> + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> + *   0000001    01010      rs1       000      00000  0001011
> + * dache.iva rs1 (invalida, virtual address)
> + *   0000001    00110      rs1       000      00000  0001011
> + *
> + * dcache.cpa rs1 (clean, physical address)
> + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> + *   0000001    01001      rs1       000      00000  0001011
> + * dcache.cva rs1 (clean, virtual address)
> + *   0000001    00100      rs1       000      00000  0001011
> + *
> + * dcache.cipa rs1 (clean then invalidate, physical address)
> + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> + *   0000001    01011      rs1       000      00000  0001011
> + * dcache.civa rs1 (... virtual address)
> + *   0000001    00111      rs1       000      00000  0001011
> + *
> + * sync.s (make sure all cache operations finished)
> + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> + *   0000000    11001     00000      000      00000  0001011
> + */
> +#define THEAD_inval_A0	".long 0x0265000b"
> +#define THEAD_clean_A0	".long 0x0245000b"
> +#define THEAD_flush_A0	".long 0x0275000b"
> +#define THEAD_SYNC_S	".long 0x0190000b"
> +
> +#define THEAD_CMO_OP(_op, _start, _size, _cachesize)				\
> +	asm volatile("mv a0, %1\n\t"						\
> +		     "j 2f\n\t"							\
> +		     "3:\n\t"							\
> +		     THEAD_##_op##_A0 "\n\t"					\
> +		     "add a0, a0, %0\n\t"					\
> +		     "2:\n\t"							\
> +		     "bltu a0, %2, 3b\n\t"					\
> +		     THEAD_SYNC_S						\
> +		     : : "r"(_cachesize),					\
> +			 "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),	\
> +			 "r"((unsigned long)(_start) + (_size))			\
> +		     : "a0")

I'm not going to provide a tested-by for the THEAD stuff, as I have no
metrics - but I booted my usual NFS and did some tyre kicking. Seemed
grand.

> +static void thead_cmo_clean_range(unsigned long addr, unsigned long size)
> +{
> +	THEAD_CMO_OP(clean, addr, size, riscv_cbom_block_size);
> +}
> +
> +static void thead_cmo_flush_range(unsigned long addr, unsigned long size)
> +{
> +	THEAD_CMO_OP(flush, addr, size, riscv_cbom_block_size);
> +}
> +
> +static void thead_cmo_inval_range(unsigned long addr, unsigned long size)
> +{
> +	THEAD_CMO_OP(inval, addr, size, riscv_cbom_block_size);
> +}
> +#endif
> +
>  static bool errata_probe_pbmt(unsigned int stage,
>  			      unsigned long arch_id, unsigned long impid)
>  {
> @@ -33,6 +93,8 @@ static bool errata_probe_pbmt(unsigned int stage,
>  static bool errata_probe_cmo(unsigned int stage,
>  			     unsigned long arch_id, unsigned long impid)
>  {
> +	struct riscv_cache_ops thead_cmo_ops;
> +
>  	if (!IS_ENABLED(CONFIG_ERRATA_THEAD_CMO))
>  		return false;
>  
> @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage,
>  
>  	riscv_cbom_block_size = L1_CACHE_BYTES;
>  	riscv_noncoherent_supported();
> +
> +	memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
> +	if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {

With CONFIG_ERRATA_THEAD_CMO=n:

/stuff/linux/arch/riscv/errata/thead/errata.c: In function 'errata_probe_cmo':
/stuff/linux/arch/riscv/errata/thead/errata.c:112:46: error: 'thead_cmo_clean_range' undeclared (first use in this function)
  112 |                 thead_cmo_ops.clean_range = &thead_cmo_clean_range;
      |                                              ^~~~~~~~~~~~~~~~~~~~~
/stuff/linux/arch/riscv/errata/thead/errata.c:112:46: note: each undeclared identifier is reported only once for each function it appears in
/stuff/linux/arch/riscv/errata/thead/errata.c:113:44: error: 'thead_cmo_inval_range' undeclared (first use in this function)
  113 |                 thead_cmo_ops.inv_range = &thead_cmo_inval_range;
      |                                            ^~~~~~~~~~~~~~~~~~~~~
/stuff/linux/arch/riscv/errata/thead/errata.c:114:46: error: 'thead_cmo_flush_range' undeclared (first use in this function)
  114 |                 thead_cmo_ops.flush_range = &thead_cmo_flush_range;
      |                                              ^~~~~~~~~~~~~~~~~~~~~

These functions are only present if CONFIG_ERRATA_THEAD_CMO is set,
so this cannot be an IS_ENABLED() as things stand.

> +		thead_cmo_ops.clean_range = &thead_cmo_clean_range;
> +		thead_cmo_ops.inv_range = &thead_cmo_inval_range;
> +		thead_cmo_ops.flush_range = &thead_cmo_flush_range;

As Arnd suggested, I'd rather see a `static const struct
riscv_cache_ops` type deal too, so you can just call register()
directly.

> +		riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
> +	}
> +
>  	return true;
>  }
>  
> diff --git a/arch/riscv/include/asm/dma-noncoherent.h b/arch/riscv/include/asm/dma-noncoherent.h
> new file mode 100644
> index 000000000000..a2af863d2608
> --- /dev/null
> +++ b/arch/riscv/include/asm/dma-noncoherent.h
> @@ -0,0 +1,83 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2022 Renesas Electronics Corp.
> + */
> +
> +#ifndef __ASM_DMA_NONCOHERENT_H
> +#define __ASM_DMA_NONCOHERENT_H
> +
> +#include <linux/dma-direct.h>
> +
> +enum dma_noncoherent_ops {
> +	NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0,
> +	NON_COHERENT_SYNC_DMA_FOR_CPU,
> +	NON_COHERENT_DMA_PREP,
> +	NON_COHERENT_DMA_PMEM,
> +};
> +
> +/*
> + * struct riscv_cache_ops - Structure for CMO function pointers

Drop the "function pointers" from everything here IMO.

> + * @clean_range: Function pointer for clean cache
> + * @inv_range: Function pointer for invalidate cache
> + * @flush_range: Function pointer for flushing the cache
> + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who want
> + *  to handle CMO themselves. If this function pointer is set rest of the
> + *  function pointers will be NULL.

Will be NULL? Or must be NULL?
Assuming you keep it, as I see Arnd is questioning it, I think a better
description is needed for this one. And a rename of the function name,
the one you have right now is oddly juxtaposed and a bit confusing.
I don't really have something much better to suggest, so I am gonna use
cmo_omnibus here...

@cmo_omnibus: For platforms whose CMO capabilities do not align with the
              standard cache management operations. If set, the specific
	      ops must be left unset.

Circling back ages later, cmo_universal()?

> +struct riscv_cache_ops {
> +	void (*clean_range)(unsigned long addr, unsigned long size);
> +	void (*inv_range)(unsigned long addr, unsigned long size);
> +	void (*flush_range)(unsigned long addr, unsigned long size);
> +	void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> +					      enum dma_data_direction dir,
> +					      enum dma_noncoherent_ops ops);
> +};
> +
> +extern struct riscv_cache_ops zicbom_cmo_ops;
> +
> +#ifdef CONFIG_RISCV_DMA_NONCOHERENT
> +
> +extern struct riscv_cache_ops noncoherent_cache_ops;
> +
> +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops);
> +
> +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size)
> +{
> +	if (noncoherent_cache_ops.clean_range) {
> +		unsigned long addr = (unsigned long)vaddr;
> +
> +		noncoherent_cache_ops.clean_range(addr, size);
> +	}
> +}
> +
> +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size)
> +{
> +	if (noncoherent_cache_ops.flush_range) {
> +		unsigned long addr = (unsigned long)vaddr;
> +
> +		noncoherent_cache_ops.flush_range(addr, size);
> +	}
> +}
> +
> +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size)
> +{
> +	if (noncoherent_cache_ops.inv_range) {
> +		unsigned long addr = (unsigned long)vaddr;
> +
> +		noncoherent_cache_ops.inv_range(addr, size);
> +	}
> +}
> +
> +#else
> +
> +static void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) {}
> +
> +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) {}
> +
> +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) {}
> +
> +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) {}

Can these ever be used if DMA_NONCOHERENT is not set? I think
riscv/mm/Makefile gates their usage on the symbol, no?

> +
> +#endif
> +
> +#endif	/* __ASM_DMA_NONCOHERENT_H */
> diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
> index 4180312d2a70..ae3fc8b80edd 100644

[...]

> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index d919efab6eba..d9445c266bfd 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -9,23 +9,82 @@
>  #include <linux/dma-map-ops.h>
>  #include <linux/mm.h>
>  #include <asm/cacheflush.h>
> +#include <asm/dma-noncoherent.h>
>  
>  static bool noncoherent_supported;
>  
> +struct riscv_cache_ops noncoherent_cache_ops = {
> +	.clean_range = NULL,
> +	.inv_range = NULL,
> +	.flush_range = NULL,
> +	.riscv_dma_noncoherent_cmo_ops = NULL,
> +};
> +EXPORT_SYMBOL(noncoherent_cache_ops);

These exports should be _GPL, no? The file is GPLed. Ditto the others.

> +#ifdef CONFIG_RISCV_ISA_ZICBOM
> +#define ZICBOM_CMO_OP(_op, _start, _size, _cachesize)				\
> +	asm volatile("mv a0, %1\n\t"						\
> +		     "j 2f\n\t"							\
> +		     "3:\n\t"							\
> +		     "cbo." __stringify(_op) " (a0)\n\t"			\
> +		     "add a0, a0, %0\n\t"					\
> +		     "2:\n\t"							\
> +		     "bltu a0, %2, 3b\n\t"					\
> +		     : : "r"(_cachesize),					\
> +			 "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),	\
> +			 "r"((unsigned long)(_start) + (_size))			\
> +		     : "a0")
> +
> +static void zicbom_cmo_clean_range(unsigned long addr, unsigned long size)
> +{
> +	ZICBOM_CMO_OP(clean, addr, size, riscv_cbom_block_size);
> +}
> +
> +static void zicbom_cmo_flush_range(unsigned long addr, unsigned long size)
> +{
> +	ZICBOM_CMO_OP(flush, addr, size, riscv_cbom_block_size);
> +}
> +
> +static void zicbom_cmo_inval_range(unsigned long addr, unsigned long size)
> +{
> +	ZICBOM_CMO_OP(inval, addr, size, riscv_cbom_block_size);
> +}
> +
> +struct riscv_cache_ops zicbom_cmo_ops = {
> +	.clean_range = &zicbom_cmo_clean_range,
> +	.inv_range = &zicbom_cmo_inval_range,
> +	.flush_range = &zicbom_cmo_flush_range,
> +};
> +#else
> +struct riscv_cache_ops zicbom_cmo_ops = {
> +	.clean_range = NULL,
> +	.inv_range = NULL,
> +	.flush_range = NULL,
> +	.riscv_dma_noncoherent_cmo_ops = NULL,
> +};
> +#endif
> +EXPORT_SYMBOL(zicbom_cmo_ops);
> +
>  void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
>  			      enum dma_data_direction dir)
>  {
>  	void *vaddr = phys_to_virt(paddr);
>  
> +	if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> +		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir,
> +								    NON_COHERENT_SYNC_DMA_FOR_DEVICE);
> +		return;
> +	}
> +
>  	switch (dir) {
>  	case DMA_TO_DEVICE:
> -		ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> +		riscv_dma_noncoherent_clean(vaddr, size);
>  		break;
>  	case DMA_FROM_DEVICE:
> -		ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> +		riscv_dma_noncoherent_clean(vaddr, size);
>  		break;
>  	case DMA_BIDIRECTIONAL:
> -		ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> +		riscv_dma_noncoherent_flush(vaddr, size);
>  		break;
>  	default:
>  		break;
> @@ -37,12 +96,18 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
>  {
>  	void *vaddr = phys_to_virt(paddr);
>  
> +	if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> +		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir,
> +								    NON_COHERENT_SYNC_DMA_FOR_CPU);
> +		return;
> +	}
> +
>  	switch (dir) {
>  	case DMA_TO_DEVICE:
>  		break;
>  	case DMA_FROM_DEVICE:
>  	case DMA_BIDIRECTIONAL:
> -		ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> +		riscv_dma_noncoherent_flush(vaddr, size);
>  		break;
>  	default:
>  		break;
> @@ -53,7 +118,13 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
>  {
>  	void *flush_addr = page_address(page);
>  
> -	ALT_CMO_OP(flush, flush_addr, size, riscv_cbom_block_size);
> +	if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> +		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(flush_addr, size, -1,
> +								    NON_COHERENT_DMA_PREP);
> +		return;
> +	}
> +
> +	riscv_dma_noncoherent_flush(flush_addr, size);
>  }
>  
>  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> @@ -78,3 +149,16 @@ void riscv_noncoherent_supported(void)
>  	     "Non-coherent DMA support enabled without a block size\n");
>  	noncoherent_supported = true;
>  }

Barring the function name, which I really don't like - it really is
confusingly named. Something like cmo_universal() would get it's point
across a bit better I think.

> +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops)
> +{
> +	if (!ops)
> +		return;
> +
> +	if (ops->riscv_dma_noncoherent_cmo_ops)
> +		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops =
> +				ops->riscv_dma_noncoherent_cmo_ops;
> +	else
> +		noncoherent_cache_ops = *ops;

Can we just chop off the extra step here? Behaviourally this is going to
be no different whether someone sets the universal op and the individual
ones anyway.

> +}
> +EXPORT_SYMBOL(riscv_noncoherent_register_cache_ops);

_GPL again I think!

I'm happy with this approach, modulo the minor bits I've pointed out.
I do really like that it takes us away from messing with an asm
alternative every time someone wants to do this stuff differently or for
a new platform.

I would like Heiko to have a look at what you've done here, but ye looks
promising IMO.

Thanks,
Conor.
Arnd Bergmann Jan. 7, 2023, 9:52 p.m. UTC | #3
On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote:
> On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote:
>> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
>> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
>> > +struct riscv_cache_ops zicbom_cmo_ops = {
>> > +	.clean_range = &zicbom_cmo_clean_range,
>> > +	.inv_range = &zicbom_cmo_inval_range,
>> > +	.flush_range = &zicbom_cmo_flush_range,
>> > +};
>> > +#else
>> > +struct riscv_cache_ops zicbom_cmo_ops = {
>> > +	.clean_range = NULL,
>> > +	.inv_range = NULL,
>> > +	.flush_range = NULL,
>> > +	.riscv_dma_noncoherent_cmo_ops = NULL,
>> > +};
>> > +#endif
>> > +EXPORT_SYMBOL(zicbom_cmo_ops);
>> 
>> Same here: If the ZICBOM ISA is disabled, nothing should
>> reference zicbom_cmo_ops.
>
>> Also, since ZICBOM is a standard
>> extension, I think it makes sense to always have it enabled,
>> at least whenever noncoherent DMA is supported, that way
>> it can be the default that gets used in the absence of any
>> nonstandard cache controller.
>
> While I think of it, this is not possible as Zicbom requires toolchain
> support whereas the alternative methods for non-coherent DMA do not.

Ah, I see. Would it be possible to use the same .long trick
as in the other ones though? Something like

#if CONFIG_AS_VERSION >= 23600 /* or whichever version */
     /* proper inline asm */
#else
     /* .long hack */
#endif

That way everyone can use it, and the hack would automatically
go away in a few years after linux requires a newer toolchain.

Alternatively, the entire noncoherent-dma support could be
made to depend on whichever toolchain introduced Zicbom.

     Arnd
Lad, Prabhakar Jan. 7, 2023, 10:10 p.m. UTC | #4
Hi Arnd,

Thank you for the review.

On Fri, Jan 6, 2023 at 10:31 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >
> > The current implementation of CMO was handled using the ALTERNATIVE_X()
> > macro; this was manageable when there were a limited number of platforms
> > using this. Now that we are having more and more platforms coming through
> > with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable.
> >
> > To avoid such issues this patch switches to use of function pointers
> > instead of ALTERNATIVE_X() macro for cache management (the only draw being
> > performance over the previous approach).
> >
> > void (*clean_range)(unsigned long addr, unsigned long size);
> > void (*inv_range)(unsigned long addr, unsigned long size);
> > void (*flush_range)(unsigned long addr, unsigned long size);
> >
> > The above function pointers are provided to be overridden where platforms
> > using standard approach and for platforms who want handle the operation
> > based on the operation the below function pointer is provided:
> >
> > void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> >                                       enum dma_data_direction dir,
> >                                       enum dma_noncoherent_ops ops);
> >
> > In the current patch we have moved the ZICBOM and T-Head CMO to use
> > function pointers.
> >
> > Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
>
> This looks like a nice improvement! I have a few suggestions
> for improvements, but no objections here.
>
> > diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
> > index fac5742d1c1e..826b2ba3e61e 100644
> > --- a/arch/riscv/errata/thead/errata.c
> > +++ b/arch/riscv/errata/thead/errata.c
> ...
> > @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage,
> >
> >       riscv_cbom_block_size = L1_CACHE_BYTES;
> >       riscv_noncoherent_supported();
> > +
> > +     memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
> > +     if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
> > +             thead_cmo_ops.clean_range = &thead_cmo_clean_range;
> > +             thead_cmo_ops.inv_range = &thead_cmo_inval_range;
> > +             thead_cmo_ops.flush_range = &thead_cmo_flush_range;
> > +             riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
> > +     }
>
> The implementation here looks reasonable, just wonder whether
> the classification as an 'errata' makes sense. I would probably
> consider this a 'driver' at this point, but that's just
> a question of personal preference.
>
zicbom is a CPU feature that doesn't have any DT node and hence no
driver and similarly for T-HEAD SoC. Also the arch_setup_dma_ops()
happens quite early before driver probing due to which we get WARN()
messages during bootup hence I have implemented it as errata; as
errata patching happens quite early.

> For the operations structure, I think a 'static const struct
> riscv_cache_ops' is more intuitive than assigning the
> members individually.
OK.

> > +
> > +enum dma_noncoherent_ops {
> > +     NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0,
> > +     NON_COHERENT_SYNC_DMA_FOR_CPU,
> > +     NON_COHERENT_DMA_PREP,
> > +     NON_COHERENT_DMA_PMEM,
> > +};
> > +
> > +/*
> > + * struct riscv_cache_ops - Structure for CMO function pointers
> > + * @clean_range: Function pointer for clean cache
> > + * @inv_range: Function pointer for invalidate cache
> > + * @flush_range: Function pointer for flushing the cache
> > + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who
> > want
> > + *  to handle CMO themselves. If this function pointer is set rest of
> > the
> > + *  function pointers will be NULL.
> > + */
> > +struct riscv_cache_ops {
> > +     void (*clean_range)(unsigned long addr, unsigned long size);
> > +     void (*inv_range)(unsigned long addr, unsigned long size);
> > +     void (*flush_range)(unsigned long addr, unsigned long size);
> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> > +                                           enum dma_data_direction dir,
> > +                                           enum dma_noncoherent_ops ops);
> > +};
>
> I don't quite see how the fourth operation is used here.
> Are there cache controllers that need something beyond
> clean/inv/flush?
>
This is for platforms that dont follow standard cache operations (like
done in patch 5/6) and there drivers decide on the operations
depending on the ops and dir.

> > +
> > +#else
> > +
> > +static void riscv_noncoherent_register_cache_ops(struct
> > riscv_cache_ops *ops) {}
> > +
> > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t
> > size) {}
> > +
> > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t
> > size) {}
> > +
> > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t
> > size) {}
>
> I think you can drop the #else path here: if there is no
> noncoherent DMA, then nothing should ever call these
> functions, right?
>
Yes it shouldn't be called.

> > +
> > +#ifdef CONFIG_RISCV_ISA_ZICBOM
> ...
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +     .clean_range = &zicbom_cmo_clean_range,
> > +     .inv_range = &zicbom_cmo_inval_range,
> > +     .flush_range = &zicbom_cmo_flush_range,
> > +};
> > +#else
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +     .clean_range = NULL,
> > +     .inv_range = NULL,
> > +     .flush_range = NULL,
> > +     .riscv_dma_noncoherent_cmo_ops = NULL,
> > +};
> > +#endif
> > +EXPORT_SYMBOL(zicbom_cmo_ops);
>
> Same here: If the ZICBOM ISA is disabled, nothing should
> reference zicbom_cmo_ops.
OK.

Cheers,
Prabhakar
Conor Dooley Jan. 7, 2023, 10:21 p.m. UTC | #5
On Sat, Jan 07, 2023 at 10:52:55PM +0100, Arnd Bergmann wrote:
> On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote:
> > On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote:
> >> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> >> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >> > +struct riscv_cache_ops zicbom_cmo_ops = {
> >> > +	.clean_range = &zicbom_cmo_clean_range,
> >> > +	.inv_range = &zicbom_cmo_inval_range,
> >> > +	.flush_range = &zicbom_cmo_flush_range,
> >> > +};
> >> > +#else
> >> > +struct riscv_cache_ops zicbom_cmo_ops = {
> >> > +	.clean_range = NULL,
> >> > +	.inv_range = NULL,
> >> > +	.flush_range = NULL,
> >> > +	.riscv_dma_noncoherent_cmo_ops = NULL,
> >> > +};
> >> > +#endif
> >> > +EXPORT_SYMBOL(zicbom_cmo_ops);
> >> 
> >> Same here: If the ZICBOM ISA is disabled, nothing should
> >> reference zicbom_cmo_ops.
> >
> >> Also, since ZICBOM is a standard
> >> extension, I think it makes sense to always have it enabled,
> >> at least whenever noncoherent DMA is supported, that way
> >> it can be the default that gets used in the absence of any
> >> nonstandard cache controller.
> >
> > While I think of it, this is not possible as Zicbom requires toolchain
> > support whereas the alternative methods for non-coherent DMA do not.
> 
> Ah, I see. Would it be possible to use the same .long trick
> as in the other ones though? Something like
> 
> #if CONFIG_AS_VERSION >= 23600 /* or whichever version */


>      /* proper inline asm */
> #else
>      /* .long hack */
> #endif
> 
> That way everyone can use it, and the hack would automatically
> go away in a few years after linux requires a newer toolchain.

> Alternatively, the entire noncoherent-dma support could be
> made to depend on whichever toolchain introduced Zicbom.

Ehh, I don't think that's a great idea. It'd require far too recent a
toolchain IMO.

Ideally, in my opinion, we'd just do something like what Drew has
proposed for Zicboz, negating the need for a check at all:
https://lore.kernel.org/linux-riscv/20221027130247.31634-4-ajones@ventanamicro.com/

Been waiting for that to be re-spun and Palmer to accept it before doing
the same thing for Zicbom. At present we have this in the arch Kconfig:

config TOOLCHAIN_HAS_ZICBOM
	bool
	default y
	depends on !64BIT || $(cc-option,-mabi=lp64 -march=rv64ima_zicbom)
	depends on !32BIT || $(cc-option,-mabi=ilp32 -march=rv32ima_zicbom)
	depends on LLD_VERSION >= 150000 || LD_VERSION >= 23800

config RISCV_ISA_ZICBOM
	bool "Zicbom extension support for non-coherent DMA operation"
	depends on TOOLCHAIN_HAS_ZICBOM

The linker version check is entirely due to the linker having issues if
it sees zicbom in the ISA string in object files.

I'd been intending to do that for Zicbom anyway, so I guess I'll just go
do it & Prabhakar can attach it to his v7..

Thanks,
Conor.
Lad, Prabhakar Jan. 7, 2023, 10:36 p.m. UTC | #6
Hi Conor,

Thank you for the review.

On Fri, Jan 6, 2023 at 11:48 PM Conor Dooley <conor@kernel.org> wrote:
>
> +CC Icenowy
>
> Hey Prabhakar,
>
> On Fri, Jan 06, 2023 at 06:55:21PM +0000, Prabhakar wrote:
> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >
> > The current implementation of CMO was handled using the ALTERNATIVE_X()
> > macro; this was manageable when there were a limited number of platforms
> > using this. Now that we are having more and more platforms coming through
> > with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable.
>
> Nitpick time! This opening paragraph is all a little oddly worded IMO.
> How about:
>
> Currently, selecting which CMOs to use on a given platform is done using
> and ALTERNATIVE_X() macro. This was manageable when there were just two
> CMO implementations, but now that there are more and more platforms coming
> needing custom CMOs, the use of the ALTERNATIVE_X() macro is unmanageable.
>
sounds good.

> > To avoid such issues this patch switches to use of function pointers
> > instead of ALTERNATIVE_X() macro for cache management (the only draw being
>
> s/draw/drawback
>
oops.

> > performance over the previous approach).
>
> Did you notice a performance decrease or are you speculating?
> Perhaps Icenowy - who I know benched their implmentation of a new CMO
> macro for THEAD stuff can do so again for this.
>
I actually haven't benchmarked the speeds to tbh I am speculating it
because of the additional checks. Maybe If I export the functions
instead of having them in the header and have static keys for each
callback maybe that will reduce some lag if any. Does that sound good?

> > void (*clean_range)(unsigned long addr, unsigned long size);
> > void (*inv_range)(unsigned long addr, unsigned long size);
> > void (*flush_range)(unsigned long addr, unsigned long size);
> >
> > The above function pointers are provided to be overridden where platforms
> > using standard approach and for platforms who want handle the operation
> > based on the operation the below function pointer is provided:
>
> Think you've left out some words here chief!
> Perhaps s/and for platforms who/. For platforms that/ & on the line
> after, s/operation/direction/ ?
>
I'll re-word it.

> > void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> >                                       enum dma_data_direction dir,
> >                                       enum dma_noncoherent_ops ops);
> >
> > In the current patch we have moved the ZICBOM and T-Head CMO to use
> > function pointers.
>
> "Convert ZICBOM..."
>
OK.

> > Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > ---
> > v5->v6
> > * New patch
> > ---
> >  arch/riscv/errata/thead/errata.c         | 71 ++++++++++++++++++
> >  arch/riscv/include/asm/dma-noncoherent.h | 83 +++++++++++++++++++++
> >  arch/riscv/include/asm/errata_list.h     | 53 -------------
> >  arch/riscv/kernel/cpufeature.c           |  2 +
> >  arch/riscv/mm/dma-noncoherent.c          | 94 ++++++++++++++++++++++--
> >  arch/riscv/mm/pmem.c                     | 18 ++++-
> >  6 files changed, 260 insertions(+), 61 deletions(-)
> >  create mode 100644 arch/riscv/include/asm/dma-noncoherent.h
> >
> > diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
> > index fac5742d1c1e..826b2ba3e61e 100644
> > --- a/arch/riscv/errata/thead/errata.c
> > +++ b/arch/riscv/errata/thead/errata.c
> > @@ -8,12 +8,72 @@
> >  #include <linux/module.h>
> >  #include <linux/string.h>
> >  #include <linux/uaccess.h>
> > +#include <asm/dma-noncoherent.h>
> >  #include <asm/alternative.h>
> >  #include <asm/cacheflush.h>
> >  #include <asm/errata_list.h>
> >  #include <asm/patch.h>
> >  #include <asm/vendorid_list.h>
> >
> > +#ifdef CONFIG_ERRATA_THEAD_CMO
> > +/*
> > + * dcache.ipa rs1 (invalidate, physical address)
> > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> > + *   0000001    01010      rs1       000      00000  0001011
> > + * dache.iva rs1 (invalida, virtual address)
> > + *   0000001    00110      rs1       000      00000  0001011
> > + *
> > + * dcache.cpa rs1 (clean, physical address)
> > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> > + *   0000001    01001      rs1       000      00000  0001011
> > + * dcache.cva rs1 (clean, virtual address)
> > + *   0000001    00100      rs1       000      00000  0001011
> > + *
> > + * dcache.cipa rs1 (clean then invalidate, physical address)
> > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> > + *   0000001    01011      rs1       000      00000  0001011
> > + * dcache.civa rs1 (... virtual address)
> > + *   0000001    00111      rs1       000      00000  0001011
> > + *
> > + * sync.s (make sure all cache operations finished)
> > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> > + *   0000000    11001     00000      000      00000  0001011
> > + */
> > +#define THEAD_inval_A0       ".long 0x0265000b"
> > +#define THEAD_clean_A0       ".long 0x0245000b"
> > +#define THEAD_flush_A0       ".long 0x0275000b"
> > +#define THEAD_SYNC_S ".long 0x0190000b"
> > +
> > +#define THEAD_CMO_OP(_op, _start, _size, _cachesize)                         \
> > +     asm volatile("mv a0, %1\n\t"                                            \
> > +                  "j 2f\n\t"                                                 \
> > +                  "3:\n\t"                                                   \
> > +                  THEAD_##_op##_A0 "\n\t"                                    \
> > +                  "add a0, a0, %0\n\t"                                       \
> > +                  "2:\n\t"                                                   \
> > +                  "bltu a0, %2, 3b\n\t"                                      \
> > +                  THEAD_SYNC_S                                               \
> > +                  : : "r"(_cachesize),                                       \
> > +                      "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),  \
> > +                      "r"((unsigned long)(_start) + (_size))                 \
> > +                  : "a0")
>
> I'm not going to provide a tested-by for the THEAD stuff, as I have no
> metrics - but I booted my usual NFS and did some tyre kicking. Seemed
> grand.
>
\o/

> > +static void thead_cmo_clean_range(unsigned long addr, unsigned long size)
> > +{
> > +     THEAD_CMO_OP(clean, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +static void thead_cmo_flush_range(unsigned long addr, unsigned long size)
> > +{
> > +     THEAD_CMO_OP(flush, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +static void thead_cmo_inval_range(unsigned long addr, unsigned long size)
> > +{
> > +     THEAD_CMO_OP(inval, addr, size, riscv_cbom_block_size);
> > +}
> > +#endif
> > +
> >  static bool errata_probe_pbmt(unsigned int stage,
> >                             unsigned long arch_id, unsigned long impid)
> >  {
> > @@ -33,6 +93,8 @@ static bool errata_probe_pbmt(unsigned int stage,
> >  static bool errata_probe_cmo(unsigned int stage,
> >                            unsigned long arch_id, unsigned long impid)
> >  {
> > +     struct riscv_cache_ops thead_cmo_ops;
> > +
> >       if (!IS_ENABLED(CONFIG_ERRATA_THEAD_CMO))
> >               return false;
> >
> > @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage,
> >
> >       riscv_cbom_block_size = L1_CACHE_BYTES;
> >       riscv_noncoherent_supported();
> > +
> > +     memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
> > +     if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
>
> With CONFIG_ERRATA_THEAD_CMO=n:
>
> /stuff/linux/arch/riscv/errata/thead/errata.c: In function 'errata_probe_cmo':
> /stuff/linux/arch/riscv/errata/thead/errata.c:112:46: error: 'thead_cmo_clean_range' undeclared (first use in this function)
>   112 |                 thead_cmo_ops.clean_range = &thead_cmo_clean_range;
>       |                                              ^~~~~~~~~~~~~~~~~~~~~
> /stuff/linux/arch/riscv/errata/thead/errata.c:112:46: note: each undeclared identifier is reported only once for each function it appears in
> /stuff/linux/arch/riscv/errata/thead/errata.c:113:44: error: 'thead_cmo_inval_range' undeclared (first use in this function)
>   113 |                 thead_cmo_ops.inv_range = &thead_cmo_inval_range;
>       |                                            ^~~~~~~~~~~~~~~~~~~~~
> /stuff/linux/arch/riscv/errata/thead/errata.c:114:46: error: 'thead_cmo_flush_range' undeclared (first use in this function)
>   114 |                 thead_cmo_ops.flush_range = &thead_cmo_flush_range;
>       |                                              ^~~~~~~~~~~~~~~~~~~~~
>
> These functions are only present if CONFIG_ERRATA_THEAD_CMO is set,
> so this cannot be an IS_ENABLED() as things stand.
>
OK, I'll fix this.

> > +             thead_cmo_ops.clean_range = &thead_cmo_clean_range;
> > +             thead_cmo_ops.inv_range = &thead_cmo_inval_range;
> > +             thead_cmo_ops.flush_range = &thead_cmo_flush_range;
>
> As Arnd suggested, I'd rather see a `static const struct
> riscv_cache_ops` type deal too, so you can just call register()
> directly.
>
Sure will do that.

> > +             riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
> > +     }
> > +
> >       return true;
> >  }
> >
> > diff --git a/arch/riscv/include/asm/dma-noncoherent.h b/arch/riscv/include/asm/dma-noncoherent.h
> > new file mode 100644
> > index 000000000000..a2af863d2608
> > --- /dev/null
> > +++ b/arch/riscv/include/asm/dma-noncoherent.h
> > @@ -0,0 +1,83 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2022 Renesas Electronics Corp.
> > + */
> > +
> > +#ifndef __ASM_DMA_NONCOHERENT_H
> > +#define __ASM_DMA_NONCOHERENT_H
> > +
> > +#include <linux/dma-direct.h>
> > +
> > +enum dma_noncoherent_ops {
> > +     NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0,
> > +     NON_COHERENT_SYNC_DMA_FOR_CPU,
> > +     NON_COHERENT_DMA_PREP,
> > +     NON_COHERENT_DMA_PMEM,
> > +};
> > +
> > +/*
> > + * struct riscv_cache_ops - Structure for CMO function pointers
>
> Drop the "function pointers" from everything here IMO.
>
OK.

> > + * @clean_range: Function pointer for clean cache
> > + * @inv_range: Function pointer for invalidate cache
> > + * @flush_range: Function pointer for flushing the cache
> > + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who want
> > + *  to handle CMO themselves. If this function pointer is set rest of the
> > + *  function pointers will be NULL.
>
> Will be NULL? Or must be NULL?
> Assuming you keep it, as I see Arnd is questioning it, I think a better
> description is needed for this one. And a rename of the function name,
> the one you have right now is oddly juxtaposed and a bit confusing.
> I don't really have something much better to suggest, so I am gonna use
> cmo_omnibus here...
>
> @cmo_omnibus: For platforms whose CMO capabilities do not align with the
>               standard cache management operations. If set, the specific
>               ops must be left unset.
>
OK, I'll update the description as above.

> Circling back ages later, cmo_universal()?
>
Sounds good to me.

> > +struct riscv_cache_ops {
> > +     void (*clean_range)(unsigned long addr, unsigned long size);
> > +     void (*inv_range)(unsigned long addr, unsigned long size);
> > +     void (*flush_range)(unsigned long addr, unsigned long size);
> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> > +                                           enum dma_data_direction dir,
> > +                                           enum dma_noncoherent_ops ops);
> > +};
> > +
> > +extern struct riscv_cache_ops zicbom_cmo_ops;
> > +
> > +#ifdef CONFIG_RISCV_DMA_NONCOHERENT
> > +
> > +extern struct riscv_cache_ops noncoherent_cache_ops;
> > +
> > +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops);
> > +
> > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size)
> > +{
> > +     if (noncoherent_cache_ops.clean_range) {
> > +             unsigned long addr = (unsigned long)vaddr;
> > +
> > +             noncoherent_cache_ops.clean_range(addr, size);
> > +     }
> > +}
> > +
> > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size)
> > +{
> > +     if (noncoherent_cache_ops.flush_range) {
> > +             unsigned long addr = (unsigned long)vaddr;
> > +
> > +             noncoherent_cache_ops.flush_range(addr, size);
> > +     }
> > +}
> > +
> > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size)
> > +{
> > +     if (noncoherent_cache_ops.inv_range) {
> > +             unsigned long addr = (unsigned long)vaddr;
> > +
> > +             noncoherent_cache_ops.inv_range(addr, size);
> > +     }
> > +}
> > +
> > +#else
> > +
> > +static void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) {}
> > +
> > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) {}
> > +
> > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) {}
> > +
> > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) {}
>
> Can these ever be used if DMA_NONCOHERENT is not set? I think
> riscv/mm/Makefile gates their usage on the symbol, no?
>
I think I can get rid of these.

> > +
> > +#endif
> > +
> > +#endif       /* __ASM_DMA_NONCOHERENT_H */
> > diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
> > index 4180312d2a70..ae3fc8b80edd 100644
>
> [...]
>
> > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> > index d919efab6eba..d9445c266bfd 100644
> > --- a/arch/riscv/mm/dma-noncoherent.c
> > +++ b/arch/riscv/mm/dma-noncoherent.c
> > @@ -9,23 +9,82 @@
> >  #include <linux/dma-map-ops.h>
> >  #include <linux/mm.h>
> >  #include <asm/cacheflush.h>
> > +#include <asm/dma-noncoherent.h>
> >
> >  static bool noncoherent_supported;
> >
> > +struct riscv_cache_ops noncoherent_cache_ops = {
> > +     .clean_range = NULL,
> > +     .inv_range = NULL,
> > +     .flush_range = NULL,
> > +     .riscv_dma_noncoherent_cmo_ops = NULL,
> > +};
> > +EXPORT_SYMBOL(noncoherent_cache_ops);
>
> These exports should be _GPL, no? The file is GPLed. Ditto the others.
>
Agreed.

> > +#ifdef CONFIG_RISCV_ISA_ZICBOM
> > +#define ZICBOM_CMO_OP(_op, _start, _size, _cachesize)                                \
> > +     asm volatile("mv a0, %1\n\t"                                            \
> > +                  "j 2f\n\t"                                                 \
> > +                  "3:\n\t"                                                   \
> > +                  "cbo." __stringify(_op) " (a0)\n\t"                        \
> > +                  "add a0, a0, %0\n\t"                                       \
> > +                  "2:\n\t"                                                   \
> > +                  "bltu a0, %2, 3b\n\t"                                      \
> > +                  : : "r"(_cachesize),                                       \
> > +                      "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),  \
> > +                      "r"((unsigned long)(_start) + (_size))                 \
> > +                  : "a0")
> > +
> > +static void zicbom_cmo_clean_range(unsigned long addr, unsigned long size)
> > +{
> > +     ZICBOM_CMO_OP(clean, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +static void zicbom_cmo_flush_range(unsigned long addr, unsigned long size)
> > +{
> > +     ZICBOM_CMO_OP(flush, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +static void zicbom_cmo_inval_range(unsigned long addr, unsigned long size)
> > +{
> > +     ZICBOM_CMO_OP(inval, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +     .clean_range = &zicbom_cmo_clean_range,
> > +     .inv_range = &zicbom_cmo_inval_range,
> > +     .flush_range = &zicbom_cmo_flush_range,
> > +};
> > +#else
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +     .clean_range = NULL,
> > +     .inv_range = NULL,
> > +     .flush_range = NULL,
> > +     .riscv_dma_noncoherent_cmo_ops = NULL,
> > +};
> > +#endif
> > +EXPORT_SYMBOL(zicbom_cmo_ops);
> > +
> >  void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> >                             enum dma_data_direction dir)
> >  {
> >       void *vaddr = phys_to_virt(paddr);
> >
> > +     if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> > +             noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir,
> > +                                                                 NON_COHERENT_SYNC_DMA_FOR_DEVICE);
> > +             return;
> > +     }
> > +
> >       switch (dir) {
> >       case DMA_TO_DEVICE:
> > -             ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> > +             riscv_dma_noncoherent_clean(vaddr, size);
> >               break;
> >       case DMA_FROM_DEVICE:
> > -             ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> > +             riscv_dma_noncoherent_clean(vaddr, size);
> >               break;
> >       case DMA_BIDIRECTIONAL:
> > -             ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> > +             riscv_dma_noncoherent_flush(vaddr, size);
> >               break;
> >       default:
> >               break;
> > @@ -37,12 +96,18 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> >  {
> >       void *vaddr = phys_to_virt(paddr);
> >
> > +     if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> > +             noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir,
> > +                                                                 NON_COHERENT_SYNC_DMA_FOR_CPU);
> > +             return;
> > +     }
> > +
> >       switch (dir) {
> >       case DMA_TO_DEVICE:
> >               break;
> >       case DMA_FROM_DEVICE:
> >       case DMA_BIDIRECTIONAL:
> > -             ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> > +             riscv_dma_noncoherent_flush(vaddr, size);
> >               break;
> >       default:
> >               break;
> > @@ -53,7 +118,13 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
> >  {
> >       void *flush_addr = page_address(page);
> >
> > -     ALT_CMO_OP(flush, flush_addr, size, riscv_cbom_block_size);
> > +     if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> > +             noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(flush_addr, size, -1,
> > +                                                                 NON_COHERENT_DMA_PREP);
> > +             return;
> > +     }
> > +
> > +     riscv_dma_noncoherent_flush(flush_addr, size);
> >  }
> >
> >  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> > @@ -78,3 +149,16 @@ void riscv_noncoherent_supported(void)
> >            "Non-coherent DMA support enabled without a block size\n");
> >       noncoherent_supported = true;
> >  }
>
> Barring the function name, which I really don't like - it really is
> confusingly named. Something like cmo_universal() would get it's point
> across a bit better I think.
>
OK, I'll rename riscv_dma_noncoherent_cmo_ops to cmo_universal.

> > +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops)
> > +{
> > +     if (!ops)
> > +             return;
> > +
> > +     if (ops->riscv_dma_noncoherent_cmo_ops)
> > +             noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops =
> > +                             ops->riscv_dma_noncoherent_cmo_ops;
> > +     else
> > +             noncoherent_cache_ops = *ops;
>
> Can we just chop off the extra step here? Behaviourally this is going to
> be no different whether someone sets the universal op and the individual
> ones anyway.
>
Agreed.

> > +}
> > +EXPORT_SYMBOL(riscv_noncoherent_register_cache_ops);
>
> _GPL again I think!
>
OK.

> I'm happy with this approach, modulo the minor bits I've pointed out.
> I do really like that it takes us away from messing with an asm
> alternative every time someone wants to do this stuff differently or for
> a new platform.
>
\o/

> I would like Heiko to have a look at what you've done here, but ye looks
> promising IMO.
>
> Thanks,
> Conor.
>

Cheers,
Prabhakar
Arnd Bergmann Jan. 8, 2023, 12:07 a.m. UTC | #7
On Sat, Jan 7, 2023, at 23:10, Lad, Prabhakar wrote:

>> > +
>> > +     memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
>> > +     if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
>> > +             thead_cmo_ops.clean_range = &thead_cmo_clean_range;
>> > +             thead_cmo_ops.inv_range = &thead_cmo_inval_range;
>> > +             thead_cmo_ops.flush_range = &thead_cmo_flush_range;
>> > +             riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
>> > +     }
>>
>> The implementation here looks reasonable, just wonder whether
>> the classification as an 'errata' makes sense. I would probably
>> consider this a 'driver' at this point, but that's just
>> a question of personal preference.
>>
> zicbom is a CPU feature that doesn't have any DT node and hence no
> driver and similarly for T-HEAD SoC.

A driver does not have to be a 'struct platform_driver' that
matches to a device node, my point was more about what to
name it, regardless of how the code is entered.

> Also the arch_setup_dma_ops()
> happens quite early before driver probing due to which we get WARN()
> messages during bootup hence I have implemented it as errata; as
> errata patching happens quite early.

But there is no more patching here, just setting the
function pointers, right?

>> > +struct riscv_cache_ops {
>> > +     void (*clean_range)(unsigned long addr, unsigned long size);
>> > +     void (*inv_range)(unsigned long addr, unsigned long size);
>> > +     void (*flush_range)(unsigned long addr, unsigned long size);
>> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
>> > +                                           enum dma_data_direction dir,
>> > +                                           enum dma_noncoherent_ops ops);
>> > +};
>>
>> I don't quite see how the fourth operation is used here.
>> Are there cache controllers that need something beyond
>> clean/inv/flush?
>>
> This is for platforms that dont follow standard cache operations (like
> done in patch 5/6) and there drivers decide on the operations
> depending on the ops and dir.

My feeling is that the set of operations that get called should
not depend on the cache controller but at best the CPU. I tried to
enumerate how zicbom and ax45 differ here, and how that compares
to other architectures:

                  zicbom      ax45,mips,arc      arm           arm64
fromdevice      clean/flush   inval/inval   inval/inval   clean/inval
todevice        clean/-       clean/-       clean/-       clean/-
bidi            flush/flush   flush/inval   clean/inval   clean/inval

So everyone does the same operation for DMA_TO_DEVICE, but
they differ in the DMA_FROM_DEVICE handling, for reasons I
don't quite see:

Your ax45 code does the same as arc and mips. arm and
arm64 skip invalidating the cache before bidi mappings,
but arm has a FIXME comment about that. arm64 does a
'clean' instead of 'inval' when mapping a fromdevice
page, which seems valid but slower than necessary.

Could the zicbom operations be changed to do the same
things as the ax45/mips/arc ones, or are there specific
details in the zicbom spec that require this?

     Arnd
Conor Dooley Jan. 8, 2023, 4:37 p.m. UTC | #8
On Sat, Jan 07, 2023 at 10:21:52PM +0000, Conor Dooley wrote:
> On Sat, Jan 07, 2023 at 10:52:55PM +0100, Arnd Bergmann wrote:
> > On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote:
> > > On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote:
> > >> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> > >> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > >> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > >> > +	.clean_range = &zicbom_cmo_clean_range,
> > >> > +	.inv_range = &zicbom_cmo_inval_range,
> > >> > +	.flush_range = &zicbom_cmo_flush_range,
> > >> > +};
> > >> > +#else
> > >> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > >> > +	.clean_range = NULL,
> > >> > +	.inv_range = NULL,
> > >> > +	.flush_range = NULL,
> > >> > +	.riscv_dma_noncoherent_cmo_ops = NULL,
> > >> > +};
> > >> > +#endif
> > >> > +EXPORT_SYMBOL(zicbom_cmo_ops);
> > >> 
> > >> Same here: If the ZICBOM ISA is disabled, nothing should
> > >> reference zicbom_cmo_ops.
> > >
> > >> Also, since ZICBOM is a standard
> > >> extension, I think it makes sense to always have it enabled,
> > >> at least whenever noncoherent DMA is supported, that way
> > >> it can be the default that gets used in the absence of any
> > >> nonstandard cache controller.
> > >
> > > While I think of it, this is not possible as Zicbom requires toolchain
> > > support whereas the alternative methods for non-coherent DMA do not.
> > 
> > Ah, I see. Would it be possible to use the same .long trick
> > as in the other ones though? Something like
> > 
> > #if CONFIG_AS_VERSION >= 23600 /* or whichever version */
> 
> 
> >      /* proper inline asm */
> > #else
> >      /* .long hack */
> > #endif
> > 
> > That way everyone can use it, and the hack would automatically
> > go away in a few years after linux requires a newer toolchain.
> 
> > Alternatively, the entire noncoherent-dma support could be
> > made to depend on whichever toolchain introduced Zicbom.
> 
> Ehh, I don't think that's a great idea. It'd require far too recent a
> toolchain IMO.
> 
> Ideally, in my opinion, we'd just do something like what Drew has
> proposed for Zicboz, negating the need for a check at all:
> https://lore.kernel.org/linux-riscv/20221027130247.31634-4-ajones@ventanamicro.com/
> 
> Been waiting for that to be re-spun and Palmer to accept it before doing
> the same thing for Zicbom. At present we have this in the arch Kconfig:
> 
> config TOOLCHAIN_HAS_ZICBOM
> 	bool
> 	default y
> 	depends on !64BIT || $(cc-option,-mabi=lp64 -march=rv64ima_zicbom)
> 	depends on !32BIT || $(cc-option,-mabi=ilp32 -march=rv32ima_zicbom)
> 	depends on LLD_VERSION >= 150000 || LD_VERSION >= 23800
> 
> config RISCV_ISA_ZICBOM
> 	bool "Zicbom extension support for non-coherent DMA operation"
> 	depends on TOOLCHAIN_HAS_ZICBOM
> 
> The linker version check is entirely due to the linker having issues if
> it sees zicbom in the ISA string in object files.
> 
> I'd been intending to do that for Zicbom anyway, so I guess I'll just go
> do it & Prabhakar can attach it to his v7..

Should pop up here in a few minutes..
https://lore.kernel.org/linux-riscv/20230108163356.3063839-1-conor@kernel.org/

Hopefully that both works & makes life easier. Certainly from a CI
coverage point of view, relaxing toolchain requirements makes *my* life
easier!
Lad, Prabhakar Jan. 9, 2023, 12:03 p.m. UTC | #9
On Sun, Jan 8, 2023 at 12:08 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Sat, Jan 7, 2023, at 23:10, Lad, Prabhakar wrote:
>
> >> > +
> >> > +     memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
> >> > +     if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
> >> > +             thead_cmo_ops.clean_range = &thead_cmo_clean_range;
> >> > +             thead_cmo_ops.inv_range = &thead_cmo_inval_range;
> >> > +             thead_cmo_ops.flush_range = &thead_cmo_flush_range;
> >> > +             riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
> >> > +     }
> >>
> >> The implementation here looks reasonable, just wonder whether
> >> the classification as an 'errata' makes sense. I would probably
> >> consider this a 'driver' at this point, but that's just
> >> a question of personal preference.
> >>
> > zicbom is a CPU feature that doesn't have any DT node and hence no
> > driver and similarly for T-HEAD SoC.
>
> A driver does not have to be a 'struct platform_driver' that
> matches to a device node, my point was more about what to
> name it, regardless of how the code is entered.
>
> > Also the arch_setup_dma_ops()
> > happens quite early before driver probing due to which we get WARN()
> > messages during bootup hence I have implemented it as errata; as
> > errata patching happens quite early.
>
> But there is no more patching here, just setting the
> function pointers, right?
>
Yes that's right.

> >> > +struct riscv_cache_ops {
> >> > +     void (*clean_range)(unsigned long addr, unsigned long size);
> >> > +     void (*inv_range)(unsigned long addr, unsigned long size);
> >> > +     void (*flush_range)(unsigned long addr, unsigned long size);
> >> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> >> > +                                           enum dma_data_direction dir,
> >> > +                                           enum dma_noncoherent_ops ops);
> >> > +};
> >>
> >> I don't quite see how the fourth operation is used here.
> >> Are there cache controllers that need something beyond
> >> clean/inv/flush?
> >>
> > This is for platforms that dont follow standard cache operations (like
> > done in patch 5/6) and there drivers decide on the operations
> > depending on the ops and dir.
>
> My feeling is that the set of operations that get called should
> not depend on the cache controller but at best the CPU. I tried to
> enumerate how zicbom and ax45 differ here, and how that compares
> to other architectures:
>
>                   zicbom      ax45,mips,arc      arm           arm64
> fromdevice      clean/flush   inval/inval   inval/inval   clean/inval
> todevice        clean/-       clean/-       clean/-       clean/-
> bidi            flush/flush   flush/inval   clean/inval   clean/inval
>
> So everyone does the same operation for DMA_TO_DEVICE, but
> they differ in the DMA_FROM_DEVICE handling, for reasons I
> don't quite see:
>
> Your ax45 code does the same as arc and mips. arm and
> arm64 skip invalidating the cache before bidi mappings,
> but arm has a FIXME comment about that. arm64 does a
> 'clean' instead of 'inval' when mapping a fromdevice
> page, which seems valid but slower than necessary.
>
> Could the zicbom operations be changed to do the same
> things as the ax45/mips/arc ones, or are there specific
> details in the zicbom spec that require this?
>
I'll let the RISC-V experts respond here.

Cheers,
Prabhakar
Arnd Bergmann Jan. 9, 2023, 12:59 p.m. UTC | #10
On Mon, Jan 9, 2023, at 13:03, Lad, Prabhakar wrote:
> On Sun, Jan 8, 2023 at 12:08 AM Arnd Bergmann <arnd@arndb.de> wrote:
>> >> > +struct riscv_cache_ops {
>> >> > +     void (*clean_range)(unsigned long addr, unsigned long size);
>> >> > +     void (*inv_range)(unsigned long addr, unsigned long size);
>> >> > +     void (*flush_range)(unsigned long addr, unsigned long size);
>> >> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
>> >> > +                                           enum dma_data_direction dir,
>> >> > +                                           enum dma_noncoherent_ops ops);
>> >> > +};
>> >>
>> >> I don't quite see how the fourth operation is used here.
>> >> Are there cache controllers that need something beyond
>> >> clean/inv/flush?
>> >>
>> > This is for platforms that dont follow standard cache operations (like
>> > done in patch 5/6) and there drivers decide on the operations
>> > depending on the ops and dir.
>>
>> My feeling is that the set of operations that get called should
>> not depend on the cache controller but at best the CPU. I tried to
>> enumerate how zicbom and ax45 differ here, and how that compares
>> to other architectures:
>>
>>                   zicbom      ax45,mips,arc      arm           arm64
>> fromdevice      clean/flush   inval/inval   inval/inval   clean/inval
>> todevice        clean/-       clean/-       clean/-       clean/-
>> bidi            flush/flush   flush/inval   clean/inval   clean/inval
>>
>> So everyone does the same operation for DMA_TO_DEVICE, but
>> they differ in the DMA_FROM_DEVICE handling, for reasons I
>> don't quite see:
>>
>> Your ax45 code does the same as arc and mips. arm and
>> arm64 skip invalidating the cache before bidi mappings,
>> but arm has a FIXME comment about that. arm64 does a
>> 'clean' instead of 'inval' when mapping a fromdevice
>> page, which seems valid but slower than necessary.
>>
>> Could the zicbom operations be changed to do the same
>> things as the ax45/mips/arc ones, or are there specific
>> details in the zicbom spec that require this?
>>
> I'll let the RISC-V experts respond here.

Adding Christoph Hellwig and Will Deacon to Cc as well.

I had another look at the arm64 side, which (like the zicbom
variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE),
as that has changed not that long ago, see

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572

I'm still not sure what the correct set of operations has
to be, but nothing in that patch description sounds ISA
or even microarchitecture specific.

    Arnd
Conor Dooley Jan. 9, 2023, 1:27 p.m. UTC | #11
On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote:
> On Mon, Jan 9, 2023, at 13:03, Lad, Prabhakar wrote:
> > On Sun, Jan 8, 2023 at 12:08 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >> >> > +struct riscv_cache_ops {
> >> >> > +     void (*clean_range)(unsigned long addr, unsigned long size);
> >> >> > +     void (*inv_range)(unsigned long addr, unsigned long size);
> >> >> > +     void (*flush_range)(unsigned long addr, unsigned long size);
> >> >> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> >> >> > +                                           enum dma_data_direction dir,
> >> >> > +                                           enum dma_noncoherent_ops ops);
> >> >> > +};
> >> >>
> >> >> I don't quite see how the fourth operation is used here.
> >> >> Are there cache controllers that need something beyond
> >> >> clean/inv/flush?
> >> >>
> >> > This is for platforms that dont follow standard cache operations (like
> >> > done in patch 5/6) and there drivers decide on the operations
> >> > depending on the ops and dir.
> >>
> >> My feeling is that the set of operations that get called should
> >> not depend on the cache controller but at best the CPU. I tried to
> >> enumerate how zicbom and ax45 differ here, and how that compares
> >> to other architectures:
> >>
> >>                   zicbom      ax45,mips,arc      arm           arm64
> >> fromdevice      clean/flush   inval/inval   inval/inval   clean/inval
> >> todevice        clean/-       clean/-       clean/-       clean/-
> >> bidi            flush/flush   flush/inval   clean/inval   clean/inval

I did a bit of digging on lore for context on why the ops are what they
are..
In v3 of the Zicbom enablement patchset, things looked like:
fromdevice	inval/inval
todevice	clean/-
bidi		flush/inval

v3:
https://lore.kernel.org/linux-riscv/20220610004308.1903626-3-heiko@sntech.de/

Samuel had some comments about the invals:
https://lore.kernel.org/linux-riscv/342e3c12-ebb0-badf-7d4c-c444a2b842b2@sholland.org/

In v4 it was changed to:
fromdevice	inval/flush
todevice	clean/-
bidi		flush/flush

v4:
https://lore.kernel.org/linux-riscv/20220619203212.3604485-4-heiko@sntech.de/

Christoph replied to that one, linking the thread belonging to the
commit you pointed out earlier:
https://lore.kernel.org/linux-riscv/20220620061607.GB10485@lst.de/

v5 produced what you have in your table above:
https://lore.kernel.org/linux-riscv/20220629215944.397952-4-heiko@sntech.de/

> >>
> >> So everyone does the same operation for DMA_TO_DEVICE, but
> >> they differ in the DMA_FROM_DEVICE handling, for reasons I
> >> don't quite see:
> >>
> >> Your ax45 code does the same as arc and mips. arm and
> >> arm64 skip invalidating the cache before bidi mappings,
> >> but arm has a FIXME comment about that. arm64 does a
> >> 'clean' instead of 'inval' when mapping a fromdevice
> >> page, which seems valid but slower than necessary.
> >>
> >> Could the zicbom operations be changed to do the same
> >> things as the ax45/mips/arc ones, or are there specific
> >> details in the zicbom spec that require this?
> >>
> > I'll let the RISC-V experts respond here.
> 
> Adding Christoph Hellwig and Will Deacon to Cc as well.
> 
> I had another look at the arm64 side, which (like the zicbom
> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE),
> as that has changed not that long ago, see
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572
> 
> I'm still not sure what the correct set of operations has
> to be, but nothing in that patch description sounds ISA
> or even microarchitecture specific.

Hope the lore archaeology helps jog people's memories...

Conor
Christoph Hellwig Jan. 10, 2023, 7:01 a.m. UTC | #12
On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote:
> I had another look at the arm64 side, which (like the zicbom
> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE),
> as that has changed not that long ago, see
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572

which IIRC has been reverted recently.

> I'm still not sure what the correct set of operations has
> to be, but nothing in that patch description sounds ISA
> or even microarchitecture specific.

Nothing is ISA specific, and the only micro architecture related thing
is if the specific core can speculate memory accesses.  See the table
in arch/arc/mm/dma.c for details.

And as commented on the arm64 patch I really hate architectures getting
creative here, as I'd much prefer to move the choice of primitives to
the core DMA code and just provide helpers to invalidate/writeback/
writeback+invalidate from the architectures eventually.
Arnd Bergmann Jan. 10, 2023, 3:03 p.m. UTC | #13
On Tue, Jan 10, 2023, at 08:01, Christoph Hellwig wrote:
> On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote:
>> I had another look at the arm64 side, which (like the zicbom
>> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE),
>> as that has changed not that long ago, see
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572
>
> which IIRC has been reverted recently.

To clarify: I was looking at arch_sync_dma_for_device(), which
changed from 'invalidate' to 'clean' last June in commit
c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers
at start of DMA transfer"). I don't see  a revert for that.

The one that was reverted recently is arch_dma_prep_coherent, which
was changed and reverted in

c44094eee32 Aug 23 2022 flush->clean
b7d9aae4048 Dec 6  2022 clean->flush

I'm primarily interested in the streaming mappings (arch_sync_*)
at the moment.

>> I'm still not sure what the correct set of operations has
>> to be, but nothing in that patch description sounds ISA
>> or even microarchitecture specific.
>
> Nothing is ISA specific, and the only micro architecture related thing
> is if the specific core can speculate memory accesses.  See the table
> in arch/arc/mm/dma.c for details.
>
> And as commented on the arm64 patch I really hate architectures getting
> creative here, as I'd much prefer to move the choice of primitives to
> the core DMA code and just provide helpers to invalidate/writeback/
> writeback+invalidate from the architectures eventually.

Agreed, that's why I particularly don't like the idea of giving SoC
specific L2$ drivers the option of making custom choices here.

The arch_dma_prep_coherent() change is arguably arm64 specific
because it is only needed for machines sharing memory with EL3
firmware that enforces buffer ownership, but even for that it would
be nice to have a central definition and some architecture specific
flag that picks one behavior or the other. Same thing for the
speculative access difference.

I looked at all the implementations now and put them in a table[1]
to see what the differences are. The only bit that I think needs
discussion is the dma_sync_single_for_device(DMA_FROM_DEVICE) op
that I mentioned above. I see that arm64, csky, powerpc, riscv
and parisc all write out at least partical cache lines first to
avoid losing dirty data in the part that is not written by the
device[2][3], while the other ones don't[4]. 

The other differences I found are:

- arm and arm64 use clean instead of flush for
  dma_sync_single_for_device(DMA_BIDIRECTIONAL).
  This seems harmless, since there is another
  invalidation at the _for_cpu() step.

- hexagon, m68k, openrisc and sh don't deal with
  speculative loads

- m68k, nios2, openrisc, parisc and xtensa use
  flush instead of clean for DMA_TO_DEVICE, presumably
  because they don't have a flush without invalidate.

- microblaze, parisc and powerpc use the same function
  for _for_device and _for_cpu, which is slightly wasteful
  but harmless.

- openrisc lacks support for bidirectional mappings,
  which is a bug

- sparc32 and xtensa only supports writethrough caches
  and consequently skips the clean before DMA but
  instead invalidate the buffer in the _for_cpu
  operation.

I also see that at least arc, arm, mips and riscv all want
CPU specific cache management operations to be registered
at boot time. While Russell had some concerns about your
suggestion to generalize the arm version, we could start
by moving the abstracted riscv version into
kernel/dma/direct.c and make sure it can be shared with
at least mips and arc, provided that we can agree on the
DMA_FROM_DEVICE behavior.

     Arnd

[1] https://docs.google.com/spreadsheets/d/1qDuMqB6TnRTj_CgUwgIIm_RJ6EZO76qohpTJUMQjUEo/edit#gid=0
[2] https://git.kernel.org/torvalds/c/c50f11c6196f
[3] https://git.kernel.org/torvalds/c/03d70617b8a7
[4] https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
Will Deacon Jan. 10, 2023, 3:11 p.m. UTC | #14
On Tue, Jan 10, 2023 at 04:03:06PM +0100, Arnd Bergmann wrote:
> On Tue, Jan 10, 2023, at 08:01, Christoph Hellwig wrote:
> > On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote:
> >> I had another look at the arm64 side, which (like the zicbom
> >> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE),
> >> as that has changed not that long ago, see
> >> 
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572
> >
> > which IIRC has been reverted recently.
> 
> To clarify: I was looking at arch_sync_dma_for_device(), which
> changed from 'invalidate' to 'clean' last June in commit
> c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers
> at start of DMA transfer"). I don't see  a revert for that.
> 
> The one that was reverted recently is arch_dma_prep_coherent, which
> was changed and reverted in
> 
> c44094eee32 Aug 23 2022 flush->clean
> b7d9aae4048 Dec 6  2022 clean->flush
> 
> I'm primarily interested in the streaming mappings (arch_sync_*)
> at the moment.

Just as an FYI, but we plan to revert the revert (i.e. go back to 'clean')
here once Qualcomm's modem firmware loader has been updated:

https://lore.kernel.org/r/20230109034843.23759-1-quic_sibis@quicinc.com

Will
Christoph Hellwig Jan. 13, 2023, 5:48 a.m. UTC | #15
On Tue, Jan 10, 2023 at 04:03:06PM +0100, Arnd Bergmann wrote:
> To clarify: I was looking at arch_sync_dma_for_device(), which
> changed from 'invalidate' to 'clean' last June in commit
> c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers
> at start of DMA transfer"). I don't see  a revert for that.
> 
> The one that was reverted recently is arch_dma_prep_coherent, which
> was changed and reverted in
> 
> c44094eee32 Aug 23 2022 flush->clean
> b7d9aae4048 Dec 6  2022 clean->flush
> 
> I'm primarily interested in the streaming mappings (arch_sync_*)
> at the moment.

Oh, true.  I've been confused about the two changes.

> Agreed, that's why I particularly don't like the idea of giving SoC
> specific L2$ drivers the option of making custom choices here.

Yes, moving this into individual drivers is an absolute no-go. 

> The arch_dma_prep_coherent() change is arguably arm64 specific
> because it is only needed for machines sharing memory with EL3
> firmware that enforces buffer ownership, but even for that it would
> be nice to have a central definition and some architecture specific
> flag that picks one behavior or the other. Same thing for the
> speculative access difference.

Yes.

> I looked at all the implementations now and put them in a table[1]
> to see what the differences are. The only bit that I think needs
> discussion is the dma_sync_single_for_device(DMA_FROM_DEVICE) op
> that I mentioned above. I see that arm64, csky, powerpc, riscv
> and parisc all write out at least partical cache lines first to
> avoid losing dirty data in the part that is not written by the
> device[2][3], while the other ones don't[4]. 

I'm tempted to declare [4] buggy until proof of the inverse.

> I also see that at least arc, arm, mips and riscv all want
> CPU specific cache management operations to be registered
> at boot time. While Russell had some concerns about your
> suggestion to generalize the arm version, we could start
> by moving the abstracted riscv version into
> kernel/dma/direct.c and make sure it can be shared with
> at least mips and arc, provided that we can agree on the
> DMA_FROM_DEVICE behavior.

Yes, I'd really like to start out with a common version and then
move bits over.  There's also some interesting bits about handling
highmem for architectures that needs virtual addresss for flushing
that might be worth sharing, too.

Thіs should be a new file in kernel/dma/ as it's not only used by
dma-direct but also by dma-iommu, and to keep the code nicely
separated.

Can you give it a go?
Arnd Bergmann Jan. 20, 2023, 5:04 p.m. UTC | #16
On Fri, Jan 13, 2023, at 06:48, Christoph Hellwig wrote:
> On Tue, Jan 10, 2023 at 04:03:06PM +0100, Arnd Bergmann wrote:
>> I looked at all the implementations now and put them in a table[1]
>> to see what the differences are. The only bit that I think needs
>> discussion is the dma_sync_single_for_device(DMA_FROM_DEVICE) op
>> that I mentioned above. I see that arm64, csky, powerpc, riscv
>> and parisc all write out at least partical cache lines first to
>> avoid losing dirty data in the part that is not written by the
>> device[2][3], while the other ones don't[4]. 
>
> I'm tempted to declare [4] buggy until proof of the inverse.

Having looked at this some more, I see that the powerpc
version is a bit problematic here as well: this one
flushes the partial cache lines before and after the
DMA transfer, while only invalidating the full
cache lines. If a partical cache line gets written
to by the CPU while the buffer is owned by the device,
this means that the received data from the device is
immediately overwritten by the second flush.

The arm64 and riscv behavior of doing a flush before
and an invalidate after the DMA seems a bit more
appropriate, as that ends up keeping the DMA
data but discarding anything written by the CPU.

Obviously there is no winning either way if the same
cache line gets written by both CPU and device, I'm
just trying to figure out what behavior we actually
want here.

The best I can think of so far is:

- flush the partial cache lines before the DMA,
  as powerpc does, and just invalidate the full
  cache lines

- only invalidate but not clean/flush after the
  DMA. This is the arm64 behavior

- warn when flushing partial cache lines
  if dma_debug is enabled.

>> I also see that at least arc, arm, mips and riscv all want
>> CPU specific cache management operations to be registered
>> at boot time. While Russell had some concerns about your
>> suggestion to generalize the arm version, we could start
>> by moving the abstracted riscv version into
>> kernel/dma/direct.c and make sure it can be shared with
>> at least mips and arc, provided that we can agree on the
>> DMA_FROM_DEVICE behavior.
>
> Yes, I'd really like to start out with a common version and then
> move bits over.  There's also some interesting bits about handling
> highmem for architectures that needs virtual addresss for flushing
> that might be worth sharing, too.
>
> Thіs should be a new file in kernel/dma/ as it's not only used by
> dma-direct but also by dma-iommu, and to keep the code nicely
> separated.
>
> Can you give it a go?

I started this at the beginning of the week but couldn't
finish it at all, but still plan to get back to it
next week.

Aside from the question for how to handle flush vs invalidate
on DMA_FROM_DEVICE, I'm still trying to figure out how to
best handle highmem with architecture specific cache management
operations. The easy approach would be to leave that up
to the architecture, passing only a physical address to
the flush function. A nicer interface might be to move the
loop over highmem pages out into common code, flush
lowmem pages by virtual addresss, and have a separate
callback for highmem pages that takes a page pointer,
like

struct dma_cache_ops {
        void (*dma_cache_wback_inv)(void *start, unsigned long sz);
        void (*dma_cache_inv)(void *start, unsigned long sz);
        void (*dma_cache_wback)(void *start, unsigned long sz);
#ifdef CONFIG_HIGHMEM
        void (*dma_cache_wback_inv_high_page)(struct page *, size_t start, unsigned long sz);
        void (*dma_cache_inv_high_page)(struct page *, size_t start, unsigned long sz);
        void (*dma_cache_wback_high_page)(struct page *, size_t start, unsigned long sz);
#endif
};

Let me know if you have a preference here, before I spend
too much time on something we don't want in the end.

     Arnd
Christoph Hellwig Jan. 21, 2023, 2:37 p.m. UTC | #17
On Fri, Jan 20, 2023 at 06:04:37PM +0100, Arnd Bergmann wrote:
> Having looked at this some more, I see that the powerpc
> version is a bit problematic here as well: this one
> flushes the partial cache lines before and after the
> DMA transfer, while only invalidating the full
> cache lines.

That feels really odd, and might be worth a bug report to the
PPC maintainers.

> Obviously there is no winning either way if the same
> cache line gets written by both CPU and device, I'm
> just trying to figure out what behavior we actually
> want here.

There isn't, and that's why we require DMAed regions to be cache line
aligned.

> Aside from the question for how to handle flush vs invalidate
> on DMA_FROM_DEVICE, I'm still trying to figure out how to
> best handle highmem with architecture specific cache management
> operations. The easy approach would be to leave that up
> to the architecture, passing only a physical address to
> the flush function.

I suspect that is a good enough first step.  Especially as I remember
that some architectures have physical address based cache management
anyway (unless we removed them in the meantime).

> A nicer interface might be to move the
> loop over highmem pages out into common code, flush
> lowmem pages by virtual addresss, and have a separate
> callback for highmem pages that takes a page pointer,
> like

I'd rather avoid multiple callbacks if we can.  But maybe solve
the simple problem first and just pass the paddr and then
iterate from there.

> 
> struct dma_cache_ops {
>         void (*dma_cache_wback_inv)(void *start, unsigned long sz);
>         void (*dma_cache_inv)(void *start, unsigned long sz);
>         void (*dma_cache_wback)(void *start, unsigned long sz);
> #ifdef CONFIG_HIGHMEM
>         void (*dma_cache_wback_inv_high_page)(struct page *, size_t start, unsigned long sz);
>         void (*dma_cache_inv_high_page)(struct page *, size_t start, unsigned long sz);
>         void (*dma_cache_wback_high_page)(struct page *, size_t start, unsigned long sz);

Btw, I really don't think these should be indirect calls.  
For sane architectures there should be exactly one way to call them,
and the onces that have different implementations really should be
using alternatives instead of expensive indirect calls.
Arnd Bergmann Jan. 21, 2023, 7:30 p.m. UTC | #18
On Sat, Jan 21, 2023, at 15:37, Christoph Hellwig wrote:
> On Fri, Jan 20, 2023 at 06:04:37PM +0100, Arnd Bergmann wrote:
>> Having looked at this some more, I see that the powerpc
>> version is a bit problematic here as well: this one
>> flushes the partial cache lines before and after the
>> DMA transfer, while only invalidating the full
>> cache lines.
>
> That feels really odd, and might be worth a bug report to the
> PPC maintainers.

Right, my first step would be to change all of the current
outliers to use the same set of operations where possible.

>> Aside from the question for how to handle flush vs invalidate
>> on DMA_FROM_DEVICE, I'm still trying to figure out how to
>> best handle highmem with architecture specific cache management
>> operations. The easy approach would be to leave that up
>> to the architecture, passing only a physical address to
>> the flush function.
>
> I suspect that is a good enough first step.  Especially as I remember
> that some architectures have physical address based cache management
> anyway (unless we removed them in the meantime).

ok

>> A nicer interface might be to move the
>> loop over highmem pages out into common code, flush
>> lowmem pages by virtual addresss, and have a separate
>> callback for highmem pages that takes a page pointer,
>> like
>
> I'd rather avoid multiple callbacks if we can.  But maybe solve
> the simple problem first and just pass the paddr and then
> iterate from there.

Ok, fair enough. This means we can't easily put the kmap_atomic()
into common code for highmem, though the per-page loop would
still work.

>> struct dma_cache_ops {
>>         void (*dma_cache_wback_inv)(void *start, unsigned long sz);
>>         void (*dma_cache_inv)(void *start, unsigned long sz);
>>         void (*dma_cache_wback)(void *start, unsigned long sz);
>> #ifdef CONFIG_HIGHMEM
>>         void (*dma_cache_wback_inv_high_page)(struct page *, size_t start, unsigned long sz);
>>         void (*dma_cache_inv_high_page)(struct page *, size_t start, unsigned long sz);
>>         void (*dma_cache_wback_high_page)(struct page *, size_t start, unsigned long sz);
>
> Btw, I really don't think these should be indirect calls.  
> For sane architectures there should be exactly one way to call them,
> and the onces that have different implementations really should be
> using alternatives instead of expensive indirect calls.

I was thinking of using STATIC_CALL() as an optimization here, which
I find easier to read and understand than alternatives. One advantage
here is that this allows the actual cache operations to be declared
locally in the architecture without letting drivers call them,
but still update the common code to work without indirect branches.

The main downside is that this is currently only optimized on
powerpc and x86, both of which don't actually need CPU specific
callbacks. ARC, ARM, and MIPS on the other hand already
have indirect function pointers, RISC-V would likely benefit the
most from either alternatives or static_call, as it already
uses alternatives and has one implementation that is clearly
preferred over the others.

      Arnd
Christoph Hellwig Jan. 22, 2023, 7:27 a.m. UTC | #19
On Sat, Jan 21, 2023 at 08:30:23PM +0100, Arnd Bergmann wrote:
> > That feels really odd, and might be worth a bug report to the
> > PPC maintainers.
> 
> Right, my first step would be to change all of the current
> outliers to use the same set of operations where possible.

Sounds good.

> > I'd rather avoid multiple callbacks if we can.  But maybe solve
> > the simple problem first and just pass the paddr and then
> > iterate from there.
> 
> Ok, fair enough. This means we can't easily put the kmap_atomic()
> into common code for highmem, though the per-page loop would
> still work.

Yes.  Given how messy many of the ops are I think one step at a time
is always good.

> I was thinking of using STATIC_CALL() as an optimization here, which
> I find easier to read and understand than alternatives. One advantage
> here is that this allows the actual cache operations to be declared
> locally in the architecture without letting drivers call them,
> but still update the common code to work without indirect branches.
> 
> The main downside is that this is currently only optimized on
> powerpc and x86, both of which don't actually need CPU specific
> callbacks. ARC, ARM, and MIPS on the other hand already
> have indirect function pointers, RISC-V would likely benefit the
> most from either alternatives or static_call, as it already
> uses alternatives and has one implementation that is clearly
> preferred over the others.

For now I'd just keep doing direct calls into the arch code, just
for the lower level invalidate, writeback, invalidate+writeback
calls as that helps cementinc the logic of which of those to use
in well documented core code.

And I'm not really sure I'd like to go beyond that - making it too
easy pluggable will make people feel more comfortable doing stupid
things here.  And yes, maybe that's personal because I've warned
the RISC-V people years ago that they'll need architectural
cache management instructions yesterday and the answer was that
no one is going to use them on modern CPUs.  *sigh*
Arnd Bergmann Jan. 22, 2023, 11:04 a.m. UTC | #20
On Sun, Jan 22, 2023, at 08:27, Christoph Hellwig wrote:
> On Sat, Jan 21, 2023 at 08:30:23PM +0100, Arnd Bergmann wrote:
>> I was thinking of using STATIC_CALL() as an optimization here, which
>> I find easier to read and understand than alternatives. One advantage
>> here is that this allows the actual cache operations to be declared
>> locally in the architecture without letting drivers call them,
>> but still update the common code to work without indirect branches.
>> 
>> The main downside is that this is currently only optimized on
>> powerpc and x86, both of which don't actually need CPU specific
>> callbacks. ARC, ARM, and MIPS on the other hand already
>> have indirect function pointers, RISC-V would likely benefit the
>> most from either alternatives or static_call, as it already
>> uses alternatives and has one implementation that is clearly
>> preferred over the others.
>
> For now I'd just keep doing direct calls into the arch code, just
> for the lower level invalidate, writeback, invalidate+writeback
> calls as that helps cementinc the logic of which of those to use
> in well documented core code.

Ok.

> And I'm not really sure I'd like to go beyond that - making it too
> easy pluggable will make people feel more comfortable doing stupid
> things here. 

I fear the bigger risk is still making the functions callable
from device driver code than it is to make the functions
globally settable.

You introduced the mips version in f8c55dc6e828 ("MIPS: use generic
dma noncoherent ops for simple noncoherent platforms"), which
was clearly meant as an implementation detail, yet we already
have a driver that slipped in with 3bdffa8ffb45 ("Input: Add
N64 controller driver") that just calls this directly rather
than using the dma-mapping interface.

On the other hand, the indirect function pointers for
per-cpu cache operations are not easily translated anyway:
with the three architectures that multiplex between
cpu specific operations, arc uses physical addresses,
mips uses virtual addresses (because of highmem), and
arm even uses both because of incompatible requirements
between l1 and l2 cache operations. arm32 also seems to
have the superset of all possible corner cases that
one might see elsewhere (prefetching vs in-order,
writethrough vs writeback, broken broadcast invalidation,
...).

> And yes, maybe that's personal because I've warned
> the RISC-V people years ago that they'll need architectural
> cache management instructions yesterday and the answer was that
> no one is going to use them on modern CPUs.  *sigh*

To be fair, from the ISA point of view, it really shouldn't
be necessary as long as you have a sane SoC design.
In practice there are always chips that are cutting corners,
or use the new CPU core as a drop-in for an existing
design. Arm SBSA tried to enforce the same thing and also
failed for pretty much the same reason.

      Arnd
Christoph Hellwig Jan. 23, 2023, 2:46 p.m. UTC | #21
On Sun, Jan 22, 2023 at 12:04:35PM +0100, Arnd Bergmann wrote:
> > And I'm not really sure I'd like to go beyond that - making it too
> > easy pluggable will make people feel more comfortable doing stupid
> > things here. 
> 
> I fear the bigger risk is still making the functions callable
> from device driver code than it is to make the functions
> globally settable.
> 
> You introduced the mips version in f8c55dc6e828 ("MIPS: use generic
> dma noncoherent ops for simple noncoherent platforms"), which
> was clearly meant as an implementation detail, yet we already
> have a driver that slipped in with 3bdffa8ffb45 ("Input: Add
> N64 controller driver") that just calls this directly rather
> than using the dma-mapping interface.

MIPS actually has a bit of a history of these odd bypasses that
it seems like this driver copied.  rmk has been very worried by
this bypassing, and in general I agree.  But there's only so
much we can do except for auditing drivers.  Especially as none of
these helpers is exported and built-in only drivers are quite rare.

> > And yes, maybe that's personal because I've warned
> > the RISC-V people years ago that they'll need architectural
> > cache management instructions yesterday and the answer was that
> > no one is going to use them on modern CPUs.  *sigh*
> 
> To be fair, from the ISA point of view, it really shouldn't
> be necessary as long as you have a sane SoC design.
> In practice there are always chips that are cutting corners,
> or use the new CPU core as a drop-in for an existing
> design. Arm SBSA tried to enforce the same thing and also
> failed for pretty much the same reason.

Not wiring up IP blocks for cache coherency is cheap.  So it totally
makes sense for dirt cheap SOCs, or even for low performance periphals
in general.  The GPU folks also believe they can make some things
faster by deliberately turning coherency off on SOCs that support
coherency.

In addition to that cache maintainance is absolutely needed for NVDIMM
support.
diff mbox series

Patch

diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
index fac5742d1c1e..826b2ba3e61e 100644
--- a/arch/riscv/errata/thead/errata.c
+++ b/arch/riscv/errata/thead/errata.c
@@ -8,12 +8,72 @@ 
 #include <linux/module.h>
 #include <linux/string.h>
 #include <linux/uaccess.h>
+#include <asm/dma-noncoherent.h>
 #include <asm/alternative.h>
 #include <asm/cacheflush.h>
 #include <asm/errata_list.h>
 #include <asm/patch.h>
 #include <asm/vendorid_list.h>
 
+#ifdef CONFIG_ERRATA_THEAD_CMO
+/*
+ * dcache.ipa rs1 (invalidate, physical address)
+ * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
+ *   0000001    01010      rs1       000      00000  0001011
+ * dache.iva rs1 (invalida, virtual address)
+ *   0000001    00110      rs1       000      00000  0001011
+ *
+ * dcache.cpa rs1 (clean, physical address)
+ * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
+ *   0000001    01001      rs1       000      00000  0001011
+ * dcache.cva rs1 (clean, virtual address)
+ *   0000001    00100      rs1       000      00000  0001011
+ *
+ * dcache.cipa rs1 (clean then invalidate, physical address)
+ * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
+ *   0000001    01011      rs1       000      00000  0001011
+ * dcache.civa rs1 (... virtual address)
+ *   0000001    00111      rs1       000      00000  0001011
+ *
+ * sync.s (make sure all cache operations finished)
+ * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
+ *   0000000    11001     00000      000      00000  0001011
+ */
+#define THEAD_inval_A0	".long 0x0265000b"
+#define THEAD_clean_A0	".long 0x0245000b"
+#define THEAD_flush_A0	".long 0x0275000b"
+#define THEAD_SYNC_S	".long 0x0190000b"
+
+#define THEAD_CMO_OP(_op, _start, _size, _cachesize)				\
+	asm volatile("mv a0, %1\n\t"						\
+		     "j 2f\n\t"							\
+		     "3:\n\t"							\
+		     THEAD_##_op##_A0 "\n\t"					\
+		     "add a0, a0, %0\n\t"					\
+		     "2:\n\t"							\
+		     "bltu a0, %2, 3b\n\t"					\
+		     THEAD_SYNC_S						\
+		     : : "r"(_cachesize),					\
+			 "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),	\
+			 "r"((unsigned long)(_start) + (_size))			\
+		     : "a0")
+
+static void thead_cmo_clean_range(unsigned long addr, unsigned long size)
+{
+	THEAD_CMO_OP(clean, addr, size, riscv_cbom_block_size);
+}
+
+static void thead_cmo_flush_range(unsigned long addr, unsigned long size)
+{
+	THEAD_CMO_OP(flush, addr, size, riscv_cbom_block_size);
+}
+
+static void thead_cmo_inval_range(unsigned long addr, unsigned long size)
+{
+	THEAD_CMO_OP(inval, addr, size, riscv_cbom_block_size);
+}
+#endif
+
 static bool errata_probe_pbmt(unsigned int stage,
 			      unsigned long arch_id, unsigned long impid)
 {
@@ -33,6 +93,8 @@  static bool errata_probe_pbmt(unsigned int stage,
 static bool errata_probe_cmo(unsigned int stage,
 			     unsigned long arch_id, unsigned long impid)
 {
+	struct riscv_cache_ops thead_cmo_ops;
+
 	if (!IS_ENABLED(CONFIG_ERRATA_THEAD_CMO))
 		return false;
 
@@ -44,6 +106,15 @@  static bool errata_probe_cmo(unsigned int stage,
 
 	riscv_cbom_block_size = L1_CACHE_BYTES;
 	riscv_noncoherent_supported();
+
+	memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
+	if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
+		thead_cmo_ops.clean_range = &thead_cmo_clean_range;
+		thead_cmo_ops.inv_range = &thead_cmo_inval_range;
+		thead_cmo_ops.flush_range = &thead_cmo_flush_range;
+		riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
+	}
+
 	return true;
 }
 
diff --git a/arch/riscv/include/asm/dma-noncoherent.h b/arch/riscv/include/asm/dma-noncoherent.h
new file mode 100644
index 000000000000..a2af863d2608
--- /dev/null
+++ b/arch/riscv/include/asm/dma-noncoherent.h
@@ -0,0 +1,83 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2022 Renesas Electronics Corp.
+ */
+
+#ifndef __ASM_DMA_NONCOHERENT_H
+#define __ASM_DMA_NONCOHERENT_H
+
+#include <linux/dma-direct.h>
+
+enum dma_noncoherent_ops {
+	NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0,
+	NON_COHERENT_SYNC_DMA_FOR_CPU,
+	NON_COHERENT_DMA_PREP,
+	NON_COHERENT_DMA_PMEM,
+};
+
+/*
+ * struct riscv_cache_ops - Structure for CMO function pointers
+ * @clean_range: Function pointer for clean cache
+ * @inv_range: Function pointer for invalidate cache
+ * @flush_range: Function pointer for flushing the cache
+ * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who want
+ *  to handle CMO themselves. If this function pointer is set rest of the
+ *  function pointers will be NULL.
+ */
+struct riscv_cache_ops {
+	void (*clean_range)(unsigned long addr, unsigned long size);
+	void (*inv_range)(unsigned long addr, unsigned long size);
+	void (*flush_range)(unsigned long addr, unsigned long size);
+	void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
+					      enum dma_data_direction dir,
+					      enum dma_noncoherent_ops ops);
+};
+
+extern struct riscv_cache_ops zicbom_cmo_ops;
+
+#ifdef CONFIG_RISCV_DMA_NONCOHERENT
+
+extern struct riscv_cache_ops noncoherent_cache_ops;
+
+void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops);
+
+static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size)
+{
+	if (noncoherent_cache_ops.clean_range) {
+		unsigned long addr = (unsigned long)vaddr;
+
+		noncoherent_cache_ops.clean_range(addr, size);
+	}
+}
+
+static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size)
+{
+	if (noncoherent_cache_ops.flush_range) {
+		unsigned long addr = (unsigned long)vaddr;
+
+		noncoherent_cache_ops.flush_range(addr, size);
+	}
+}
+
+static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size)
+{
+	if (noncoherent_cache_ops.inv_range) {
+		unsigned long addr = (unsigned long)vaddr;
+
+		noncoherent_cache_ops.inv_range(addr, size);
+	}
+}
+
+#else
+
+static void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) {}
+
+static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) {}
+
+static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) {}
+
+static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) {}
+
+#endif
+
+#endif	/* __ASM_DMA_NONCOHERENT_H */
diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
index 4180312d2a70..ae3fc8b80edd 100644
--- a/arch/riscv/include/asm/errata_list.h
+++ b/arch/riscv/include/asm/errata_list.h
@@ -91,59 +91,6 @@  asm volatile(ALTERNATIVE(						\
 #define ALT_THEAD_PMA(_val)
 #endif
 
-/*
- * dcache.ipa rs1 (invalidate, physical address)
- * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
- *   0000001    01010      rs1       000      00000  0001011
- * dache.iva rs1 (invalida, virtual address)
- *   0000001    00110      rs1       000      00000  0001011
- *
- * dcache.cpa rs1 (clean, physical address)
- * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
- *   0000001    01001      rs1       000      00000  0001011
- * dcache.cva rs1 (clean, virtual address)
- *   0000001    00100      rs1       000      00000  0001011
- *
- * dcache.cipa rs1 (clean then invalidate, physical address)
- * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
- *   0000001    01011      rs1       000      00000  0001011
- * dcache.civa rs1 (... virtual address)
- *   0000001    00111      rs1       000      00000  0001011
- *
- * sync.s (make sure all cache operations finished)
- * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
- *   0000000    11001     00000      000      00000  0001011
- */
-#define THEAD_inval_A0	".long 0x0265000b"
-#define THEAD_clean_A0	".long 0x0245000b"
-#define THEAD_flush_A0	".long 0x0275000b"
-#define THEAD_SYNC_S	".long 0x0190000b"
-
-#define ALT_CMO_OP(_op, _start, _size, _cachesize)			\
-asm volatile(ALTERNATIVE_2(						\
-	__nops(6),							\
-	"mv a0, %1\n\t"							\
-	"j 2f\n\t"							\
-	"3:\n\t"							\
-	"cbo." __stringify(_op) " (a0)\n\t"				\
-	"add a0, a0, %0\n\t"						\
-	"2:\n\t"							\
-	"bltu a0, %2, 3b\n\t"						\
-	"nop", 0, CPUFEATURE_ZICBOM, CONFIG_RISCV_ISA_ZICBOM,		\
-	"mv a0, %1\n\t"							\
-	"j 2f\n\t"							\
-	"3:\n\t"							\
-	THEAD_##_op##_A0 "\n\t"						\
-	"add a0, a0, %0\n\t"						\
-	"2:\n\t"							\
-	"bltu a0, %2, 3b\n\t"						\
-	THEAD_SYNC_S, THEAD_VENDOR_ID,					\
-			ERRATA_THEAD_CMO, CONFIG_ERRATA_THEAD_CMO)	\
-	: : "r"(_cachesize),						\
-	    "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),	\
-	    "r"((unsigned long)(_start) + (_size))			\
-	: "a0")
-
 #define THEAD_C9XX_RV_IRQ_PMU			17
 #define THEAD_C9XX_CSR_SCOUNTEROF		0x5c5
 
diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
index 205bbd6b1fce..d94d32eb7faf 100644
--- a/arch/riscv/kernel/cpufeature.c
+++ b/arch/riscv/kernel/cpufeature.c
@@ -14,6 +14,7 @@ 
 #include <linux/of.h>
 #include <asm/alternative.h>
 #include <asm/cacheflush.h>
+#include <asm/dma-noncoherent.h>
 #include <asm/errata_list.h>
 #include <asm/hwcap.h>
 #include <asm/patch.h>
@@ -298,6 +299,7 @@  static bool __init_or_module cpufeature_probe_zicbom(unsigned int stage)
 		return false;
 
 	riscv_noncoherent_supported();
+	riscv_noncoherent_register_cache_ops(&zicbom_cmo_ops);
 	return true;
 }
 
diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index d919efab6eba..d9445c266bfd 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -9,23 +9,82 @@ 
 #include <linux/dma-map-ops.h>
 #include <linux/mm.h>
 #include <asm/cacheflush.h>
+#include <asm/dma-noncoherent.h>
 
 static bool noncoherent_supported;
 
+struct riscv_cache_ops noncoherent_cache_ops = {
+	.clean_range = NULL,
+	.inv_range = NULL,
+	.flush_range = NULL,
+	.riscv_dma_noncoherent_cmo_ops = NULL,
+};
+EXPORT_SYMBOL(noncoherent_cache_ops);
+
+#ifdef CONFIG_RISCV_ISA_ZICBOM
+#define ZICBOM_CMO_OP(_op, _start, _size, _cachesize)				\
+	asm volatile("mv a0, %1\n\t"						\
+		     "j 2f\n\t"							\
+		     "3:\n\t"							\
+		     "cbo." __stringify(_op) " (a0)\n\t"			\
+		     "add a0, a0, %0\n\t"					\
+		     "2:\n\t"							\
+		     "bltu a0, %2, 3b\n\t"					\
+		     : : "r"(_cachesize),					\
+			 "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),	\
+			 "r"((unsigned long)(_start) + (_size))			\
+		     : "a0")
+
+static void zicbom_cmo_clean_range(unsigned long addr, unsigned long size)
+{
+	ZICBOM_CMO_OP(clean, addr, size, riscv_cbom_block_size);
+}
+
+static void zicbom_cmo_flush_range(unsigned long addr, unsigned long size)
+{
+	ZICBOM_CMO_OP(flush, addr, size, riscv_cbom_block_size);
+}
+
+static void zicbom_cmo_inval_range(unsigned long addr, unsigned long size)
+{
+	ZICBOM_CMO_OP(inval, addr, size, riscv_cbom_block_size);
+}
+
+struct riscv_cache_ops zicbom_cmo_ops = {
+	.clean_range = &zicbom_cmo_clean_range,
+	.inv_range = &zicbom_cmo_inval_range,
+	.flush_range = &zicbom_cmo_flush_range,
+};
+#else
+struct riscv_cache_ops zicbom_cmo_ops = {
+	.clean_range = NULL,
+	.inv_range = NULL,
+	.flush_range = NULL,
+	.riscv_dma_noncoherent_cmo_ops = NULL,
+};
+#endif
+EXPORT_SYMBOL(zicbom_cmo_ops);
+
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 			      enum dma_data_direction dir)
 {
 	void *vaddr = phys_to_virt(paddr);
 
+	if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
+		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir,
+								    NON_COHERENT_SYNC_DMA_FOR_DEVICE);
+		return;
+	}
+
 	switch (dir) {
 	case DMA_TO_DEVICE:
-		ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
+		riscv_dma_noncoherent_clean(vaddr, size);
 		break;
 	case DMA_FROM_DEVICE:
-		ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
+		riscv_dma_noncoherent_clean(vaddr, size);
 		break;
 	case DMA_BIDIRECTIONAL:
-		ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
+		riscv_dma_noncoherent_flush(vaddr, size);
 		break;
 	default:
 		break;
@@ -37,12 +96,18 @@  void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 {
 	void *vaddr = phys_to_virt(paddr);
 
+	if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
+		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir,
+								    NON_COHERENT_SYNC_DMA_FOR_CPU);
+		return;
+	}
+
 	switch (dir) {
 	case DMA_TO_DEVICE:
 		break;
 	case DMA_FROM_DEVICE:
 	case DMA_BIDIRECTIONAL:
-		ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
+		riscv_dma_noncoherent_flush(vaddr, size);
 		break;
 	default:
 		break;
@@ -53,7 +118,13 @@  void arch_dma_prep_coherent(struct page *page, size_t size)
 {
 	void *flush_addr = page_address(page);
 
-	ALT_CMO_OP(flush, flush_addr, size, riscv_cbom_block_size);
+	if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
+		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(flush_addr, size, -1,
+								    NON_COHERENT_DMA_PREP);
+		return;
+	}
+
+	riscv_dma_noncoherent_flush(flush_addr, size);
 }
 
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
@@ -78,3 +149,16 @@  void riscv_noncoherent_supported(void)
 	     "Non-coherent DMA support enabled without a block size\n");
 	noncoherent_supported = true;
 }
+
+void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops)
+{
+	if (!ops)
+		return;
+
+	if (ops->riscv_dma_noncoherent_cmo_ops)
+		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops =
+				ops->riscv_dma_noncoherent_cmo_ops;
+	else
+		noncoherent_cache_ops = *ops;
+}
+EXPORT_SYMBOL(riscv_noncoherent_register_cache_ops);
diff --git a/arch/riscv/mm/pmem.c b/arch/riscv/mm/pmem.c
index 089df92ae876..cd5aa6f42851 100644
--- a/arch/riscv/mm/pmem.c
+++ b/arch/riscv/mm/pmem.c
@@ -6,16 +6,28 @@ 
 #include <linux/export.h>
 #include <linux/libnvdimm.h>
 
-#include <asm/cacheflush.h>
+#include <asm/dma-noncoherent.h>
 
 void arch_wb_cache_pmem(void *addr, size_t size)
 {
-	ALT_CMO_OP(clean, addr, size, riscv_cbom_block_size);
+	if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
+		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(addr, size,
+								    -1, NON_COHERENT_DMA_PMEM);
+		return;
+	}
+
+	riscv_dma_noncoherent_clean(addr, size);
 }
 EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
 
 void arch_invalidate_pmem(void *addr, size_t size)
 {
-	ALT_CMO_OP(inval, addr, size, riscv_cbom_block_size);
+	if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
+		noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(addr, size,
+								    -1, NON_COHERENT_DMA_PMEM);
+		return;
+	}
+
+	riscv_dma_noncoherent_inval(addr, size);
 }
 EXPORT_SYMBOL_GPL(arch_invalidate_pmem);