Message ID | 20230106185526.260163-2-prabhakar.mahadev-lad.rj@bp.renesas.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Palmer Dabbelt |
Headers | show |
Series | RISC-V non-coherent function pointer based cache management operations + non-coherent DMA support for AX45MP | expand |
On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote: > On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote: > > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > +struct riscv_cache_ops zicbom_cmo_ops = { > > + .clean_range = &zicbom_cmo_clean_range, > > + .inv_range = &zicbom_cmo_inval_range, > > + .flush_range = &zicbom_cmo_flush_range, > > +}; > > +#else > > +struct riscv_cache_ops zicbom_cmo_ops = { > > + .clean_range = NULL, > > + .inv_range = NULL, > > + .flush_range = NULL, > > + .riscv_dma_noncoherent_cmo_ops = NULL, > > +}; > > +#endif > > +EXPORT_SYMBOL(zicbom_cmo_ops); > > Same here: If the ZICBOM ISA is disabled, nothing should > reference zicbom_cmo_ops. > Also, since ZICBOM is a standard > extension, I think it makes sense to always have it enabled, > at least whenever noncoherent DMA is supported, that way > it can be the default that gets used in the absence of any > nonstandard cache controller. While I think of it, this is not possible as Zicbom requires toolchain support whereas the alternative methods for non-coherent DMA do not. Thanks, Conor.
+CC Icenowy Hey Prabhakar, On Fri, Jan 06, 2023 at 06:55:21PM +0000, Prabhakar wrote: > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > The current implementation of CMO was handled using the ALTERNATIVE_X() > macro; this was manageable when there were a limited number of platforms > using this. Now that we are having more and more platforms coming through > with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable. Nitpick time! This opening paragraph is all a little oddly worded IMO. How about: Currently, selecting which CMOs to use on a given platform is done using and ALTERNATIVE_X() macro. This was manageable when there were just two CMO implementations, but now that there are more and more platforms coming needing custom CMOs, the use of the ALTERNATIVE_X() macro is unmanageable. > To avoid such issues this patch switches to use of function pointers > instead of ALTERNATIVE_X() macro for cache management (the only draw being s/draw/drawback > performance over the previous approach). Did you notice a performance decrease or are you speculating? Perhaps Icenowy - who I know benched their implmentation of a new CMO macro for THEAD stuff can do so again for this. > void (*clean_range)(unsigned long addr, unsigned long size); > void (*inv_range)(unsigned long addr, unsigned long size); > void (*flush_range)(unsigned long addr, unsigned long size); > > The above function pointers are provided to be overridden where platforms > using standard approach and for platforms who want handle the operation > based on the operation the below function pointer is provided: Think you've left out some words here chief! Perhaps s/and for platforms who/. For platforms that/ & on the line after, s/operation/direction/ ? > void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, > enum dma_data_direction dir, > enum dma_noncoherent_ops ops); > > In the current patch we have moved the ZICBOM and T-Head CMO to use > function pointers. "Convert ZICBOM..." > Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > --- > v5->v6 > * New patch > --- > arch/riscv/errata/thead/errata.c | 71 ++++++++++++++++++ > arch/riscv/include/asm/dma-noncoherent.h | 83 +++++++++++++++++++++ > arch/riscv/include/asm/errata_list.h | 53 ------------- > arch/riscv/kernel/cpufeature.c | 2 + > arch/riscv/mm/dma-noncoherent.c | 94 ++++++++++++++++++++++-- > arch/riscv/mm/pmem.c | 18 ++++- > 6 files changed, 260 insertions(+), 61 deletions(-) > create mode 100644 arch/riscv/include/asm/dma-noncoherent.h > > diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c > index fac5742d1c1e..826b2ba3e61e 100644 > --- a/arch/riscv/errata/thead/errata.c > +++ b/arch/riscv/errata/thead/errata.c > @@ -8,12 +8,72 @@ > #include <linux/module.h> > #include <linux/string.h> > #include <linux/uaccess.h> > +#include <asm/dma-noncoherent.h> > #include <asm/alternative.h> > #include <asm/cacheflush.h> > #include <asm/errata_list.h> > #include <asm/patch.h> > #include <asm/vendorid_list.h> > > +#ifdef CONFIG_ERRATA_THEAD_CMO > +/* > + * dcache.ipa rs1 (invalidate, physical address) > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | > + * 0000001 01010 rs1 000 00000 0001011 > + * dache.iva rs1 (invalida, virtual address) > + * 0000001 00110 rs1 000 00000 0001011 > + * > + * dcache.cpa rs1 (clean, physical address) > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | > + * 0000001 01001 rs1 000 00000 0001011 > + * dcache.cva rs1 (clean, virtual address) > + * 0000001 00100 rs1 000 00000 0001011 > + * > + * dcache.cipa rs1 (clean then invalidate, physical address) > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | > + * 0000001 01011 rs1 000 00000 0001011 > + * dcache.civa rs1 (... virtual address) > + * 0000001 00111 rs1 000 00000 0001011 > + * > + * sync.s (make sure all cache operations finished) > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | > + * 0000000 11001 00000 000 00000 0001011 > + */ > +#define THEAD_inval_A0 ".long 0x0265000b" > +#define THEAD_clean_A0 ".long 0x0245000b" > +#define THEAD_flush_A0 ".long 0x0275000b" > +#define THEAD_SYNC_S ".long 0x0190000b" > + > +#define THEAD_CMO_OP(_op, _start, _size, _cachesize) \ > + asm volatile("mv a0, %1\n\t" \ > + "j 2f\n\t" \ > + "3:\n\t" \ > + THEAD_##_op##_A0 "\n\t" \ > + "add a0, a0, %0\n\t" \ > + "2:\n\t" \ > + "bltu a0, %2, 3b\n\t" \ > + THEAD_SYNC_S \ > + : : "r"(_cachesize), \ > + "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)), \ > + "r"((unsigned long)(_start) + (_size)) \ > + : "a0") I'm not going to provide a tested-by for the THEAD stuff, as I have no metrics - but I booted my usual NFS and did some tyre kicking. Seemed grand. > +static void thead_cmo_clean_range(unsigned long addr, unsigned long size) > +{ > + THEAD_CMO_OP(clean, addr, size, riscv_cbom_block_size); > +} > + > +static void thead_cmo_flush_range(unsigned long addr, unsigned long size) > +{ > + THEAD_CMO_OP(flush, addr, size, riscv_cbom_block_size); > +} > + > +static void thead_cmo_inval_range(unsigned long addr, unsigned long size) > +{ > + THEAD_CMO_OP(inval, addr, size, riscv_cbom_block_size); > +} > +#endif > + > static bool errata_probe_pbmt(unsigned int stage, > unsigned long arch_id, unsigned long impid) > { > @@ -33,6 +93,8 @@ static bool errata_probe_pbmt(unsigned int stage, > static bool errata_probe_cmo(unsigned int stage, > unsigned long arch_id, unsigned long impid) > { > + struct riscv_cache_ops thead_cmo_ops; > + > if (!IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) > return false; > > @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage, > > riscv_cbom_block_size = L1_CACHE_BYTES; > riscv_noncoherent_supported(); > + > + memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops)); > + if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) { With CONFIG_ERRATA_THEAD_CMO=n: /stuff/linux/arch/riscv/errata/thead/errata.c: In function 'errata_probe_cmo': /stuff/linux/arch/riscv/errata/thead/errata.c:112:46: error: 'thead_cmo_clean_range' undeclared (first use in this function) 112 | thead_cmo_ops.clean_range = &thead_cmo_clean_range; | ^~~~~~~~~~~~~~~~~~~~~ /stuff/linux/arch/riscv/errata/thead/errata.c:112:46: note: each undeclared identifier is reported only once for each function it appears in /stuff/linux/arch/riscv/errata/thead/errata.c:113:44: error: 'thead_cmo_inval_range' undeclared (first use in this function) 113 | thead_cmo_ops.inv_range = &thead_cmo_inval_range; | ^~~~~~~~~~~~~~~~~~~~~ /stuff/linux/arch/riscv/errata/thead/errata.c:114:46: error: 'thead_cmo_flush_range' undeclared (first use in this function) 114 | thead_cmo_ops.flush_range = &thead_cmo_flush_range; | ^~~~~~~~~~~~~~~~~~~~~ These functions are only present if CONFIG_ERRATA_THEAD_CMO is set, so this cannot be an IS_ENABLED() as things stand. > + thead_cmo_ops.clean_range = &thead_cmo_clean_range; > + thead_cmo_ops.inv_range = &thead_cmo_inval_range; > + thead_cmo_ops.flush_range = &thead_cmo_flush_range; As Arnd suggested, I'd rather see a `static const struct riscv_cache_ops` type deal too, so you can just call register() directly. > + riscv_noncoherent_register_cache_ops(&thead_cmo_ops); > + } > + > return true; > } > > diff --git a/arch/riscv/include/asm/dma-noncoherent.h b/arch/riscv/include/asm/dma-noncoherent.h > new file mode 100644 > index 000000000000..a2af863d2608 > --- /dev/null > +++ b/arch/riscv/include/asm/dma-noncoherent.h > @@ -0,0 +1,83 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Copyright (C) 2022 Renesas Electronics Corp. > + */ > + > +#ifndef __ASM_DMA_NONCOHERENT_H > +#define __ASM_DMA_NONCOHERENT_H > + > +#include <linux/dma-direct.h> > + > +enum dma_noncoherent_ops { > + NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0, > + NON_COHERENT_SYNC_DMA_FOR_CPU, > + NON_COHERENT_DMA_PREP, > + NON_COHERENT_DMA_PMEM, > +}; > + > +/* > + * struct riscv_cache_ops - Structure for CMO function pointers Drop the "function pointers" from everything here IMO. > + * @clean_range: Function pointer for clean cache > + * @inv_range: Function pointer for invalidate cache > + * @flush_range: Function pointer for flushing the cache > + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who want > + * to handle CMO themselves. If this function pointer is set rest of the > + * function pointers will be NULL. Will be NULL? Or must be NULL? Assuming you keep it, as I see Arnd is questioning it, I think a better description is needed for this one. And a rename of the function name, the one you have right now is oddly juxtaposed and a bit confusing. I don't really have something much better to suggest, so I am gonna use cmo_omnibus here... @cmo_omnibus: For platforms whose CMO capabilities do not align with the standard cache management operations. If set, the specific ops must be left unset. Circling back ages later, cmo_universal()? > +struct riscv_cache_ops { > + void (*clean_range)(unsigned long addr, unsigned long size); > + void (*inv_range)(unsigned long addr, unsigned long size); > + void (*flush_range)(unsigned long addr, unsigned long size); > + void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, > + enum dma_data_direction dir, > + enum dma_noncoherent_ops ops); > +}; > + > +extern struct riscv_cache_ops zicbom_cmo_ops; > + > +#ifdef CONFIG_RISCV_DMA_NONCOHERENT > + > +extern struct riscv_cache_ops noncoherent_cache_ops; > + > +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops); > + > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) > +{ > + if (noncoherent_cache_ops.clean_range) { > + unsigned long addr = (unsigned long)vaddr; > + > + noncoherent_cache_ops.clean_range(addr, size); > + } > +} > + > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) > +{ > + if (noncoherent_cache_ops.flush_range) { > + unsigned long addr = (unsigned long)vaddr; > + > + noncoherent_cache_ops.flush_range(addr, size); > + } > +} > + > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) > +{ > + if (noncoherent_cache_ops.inv_range) { > + unsigned long addr = (unsigned long)vaddr; > + > + noncoherent_cache_ops.inv_range(addr, size); > + } > +} > + > +#else > + > +static void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) {} > + > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) {} > + > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) {} > + > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) {} Can these ever be used if DMA_NONCOHERENT is not set? I think riscv/mm/Makefile gates their usage on the symbol, no? > + > +#endif > + > +#endif /* __ASM_DMA_NONCOHERENT_H */ > diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h > index 4180312d2a70..ae3fc8b80edd 100644 [...] > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c > index d919efab6eba..d9445c266bfd 100644 > --- a/arch/riscv/mm/dma-noncoherent.c > +++ b/arch/riscv/mm/dma-noncoherent.c > @@ -9,23 +9,82 @@ > #include <linux/dma-map-ops.h> > #include <linux/mm.h> > #include <asm/cacheflush.h> > +#include <asm/dma-noncoherent.h> > > static bool noncoherent_supported; > > +struct riscv_cache_ops noncoherent_cache_ops = { > + .clean_range = NULL, > + .inv_range = NULL, > + .flush_range = NULL, > + .riscv_dma_noncoherent_cmo_ops = NULL, > +}; > +EXPORT_SYMBOL(noncoherent_cache_ops); These exports should be _GPL, no? The file is GPLed. Ditto the others. > +#ifdef CONFIG_RISCV_ISA_ZICBOM > +#define ZICBOM_CMO_OP(_op, _start, _size, _cachesize) \ > + asm volatile("mv a0, %1\n\t" \ > + "j 2f\n\t" \ > + "3:\n\t" \ > + "cbo." __stringify(_op) " (a0)\n\t" \ > + "add a0, a0, %0\n\t" \ > + "2:\n\t" \ > + "bltu a0, %2, 3b\n\t" \ > + : : "r"(_cachesize), \ > + "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)), \ > + "r"((unsigned long)(_start) + (_size)) \ > + : "a0") > + > +static void zicbom_cmo_clean_range(unsigned long addr, unsigned long size) > +{ > + ZICBOM_CMO_OP(clean, addr, size, riscv_cbom_block_size); > +} > + > +static void zicbom_cmo_flush_range(unsigned long addr, unsigned long size) > +{ > + ZICBOM_CMO_OP(flush, addr, size, riscv_cbom_block_size); > +} > + > +static void zicbom_cmo_inval_range(unsigned long addr, unsigned long size) > +{ > + ZICBOM_CMO_OP(inval, addr, size, riscv_cbom_block_size); > +} > + > +struct riscv_cache_ops zicbom_cmo_ops = { > + .clean_range = &zicbom_cmo_clean_range, > + .inv_range = &zicbom_cmo_inval_range, > + .flush_range = &zicbom_cmo_flush_range, > +}; > +#else > +struct riscv_cache_ops zicbom_cmo_ops = { > + .clean_range = NULL, > + .inv_range = NULL, > + .flush_range = NULL, > + .riscv_dma_noncoherent_cmo_ops = NULL, > +}; > +#endif > +EXPORT_SYMBOL(zicbom_cmo_ops); > + > void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, > enum dma_data_direction dir) > { > void *vaddr = phys_to_virt(paddr); > > + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { > + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir, > + NON_COHERENT_SYNC_DMA_FOR_DEVICE); > + return; > + } > + > switch (dir) { > case DMA_TO_DEVICE: > - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size); > + riscv_dma_noncoherent_clean(vaddr, size); > break; > case DMA_FROM_DEVICE: > - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size); > + riscv_dma_noncoherent_clean(vaddr, size); > break; > case DMA_BIDIRECTIONAL: > - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size); > + riscv_dma_noncoherent_flush(vaddr, size); > break; > default: > break; > @@ -37,12 +96,18 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size, > { > void *vaddr = phys_to_virt(paddr); > > + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { > + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir, > + NON_COHERENT_SYNC_DMA_FOR_CPU); > + return; > + } > + > switch (dir) { > case DMA_TO_DEVICE: > break; > case DMA_FROM_DEVICE: > case DMA_BIDIRECTIONAL: > - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size); > + riscv_dma_noncoherent_flush(vaddr, size); > break; > default: > break; > @@ -53,7 +118,13 @@ void arch_dma_prep_coherent(struct page *page, size_t size) > { > void *flush_addr = page_address(page); > > - ALT_CMO_OP(flush, flush_addr, size, riscv_cbom_block_size); > + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { > + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(flush_addr, size, -1, > + NON_COHERENT_DMA_PREP); > + return; > + } > + > + riscv_dma_noncoherent_flush(flush_addr, size); > } > > void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, > @@ -78,3 +149,16 @@ void riscv_noncoherent_supported(void) > "Non-coherent DMA support enabled without a block size\n"); > noncoherent_supported = true; > } Barring the function name, which I really don't like - it really is confusingly named. Something like cmo_universal() would get it's point across a bit better I think. > +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) > +{ > + if (!ops) > + return; > + > + if (ops->riscv_dma_noncoherent_cmo_ops) > + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops = > + ops->riscv_dma_noncoherent_cmo_ops; > + else > + noncoherent_cache_ops = *ops; Can we just chop off the extra step here? Behaviourally this is going to be no different whether someone sets the universal op and the individual ones anyway. > +} > +EXPORT_SYMBOL(riscv_noncoherent_register_cache_ops); _GPL again I think! I'm happy with this approach, modulo the minor bits I've pointed out. I do really like that it takes us away from messing with an asm alternative every time someone wants to do this stuff differently or for a new platform. I would like Heiko to have a look at what you've done here, but ye looks promising IMO. Thanks, Conor.
On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote: > On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote: >> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote: >> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> >> > +struct riscv_cache_ops zicbom_cmo_ops = { >> > + .clean_range = &zicbom_cmo_clean_range, >> > + .inv_range = &zicbom_cmo_inval_range, >> > + .flush_range = &zicbom_cmo_flush_range, >> > +}; >> > +#else >> > +struct riscv_cache_ops zicbom_cmo_ops = { >> > + .clean_range = NULL, >> > + .inv_range = NULL, >> > + .flush_range = NULL, >> > + .riscv_dma_noncoherent_cmo_ops = NULL, >> > +}; >> > +#endif >> > +EXPORT_SYMBOL(zicbom_cmo_ops); >> >> Same here: If the ZICBOM ISA is disabled, nothing should >> reference zicbom_cmo_ops. > >> Also, since ZICBOM is a standard >> extension, I think it makes sense to always have it enabled, >> at least whenever noncoherent DMA is supported, that way >> it can be the default that gets used in the absence of any >> nonstandard cache controller. > > While I think of it, this is not possible as Zicbom requires toolchain > support whereas the alternative methods for non-coherent DMA do not. Ah, I see. Would it be possible to use the same .long trick as in the other ones though? Something like #if CONFIG_AS_VERSION >= 23600 /* or whichever version */ /* proper inline asm */ #else /* .long hack */ #endif That way everyone can use it, and the hack would automatically go away in a few years after linux requires a newer toolchain. Alternatively, the entire noncoherent-dma support could be made to depend on whichever toolchain introduced Zicbom. Arnd
Hi Arnd, Thank you for the review. On Fri, Jan 6, 2023 at 10:31 PM Arnd Bergmann <arnd@arndb.de> wrote: > > On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote: > > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > > > The current implementation of CMO was handled using the ALTERNATIVE_X() > > macro; this was manageable when there were a limited number of platforms > > using this. Now that we are having more and more platforms coming through > > with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable. > > > > To avoid such issues this patch switches to use of function pointers > > instead of ALTERNATIVE_X() macro for cache management (the only draw being > > performance over the previous approach). > > > > void (*clean_range)(unsigned long addr, unsigned long size); > > void (*inv_range)(unsigned long addr, unsigned long size); > > void (*flush_range)(unsigned long addr, unsigned long size); > > > > The above function pointers are provided to be overridden where platforms > > using standard approach and for platforms who want handle the operation > > based on the operation the below function pointer is provided: > > > > void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, > > enum dma_data_direction dir, > > enum dma_noncoherent_ops ops); > > > > In the current patch we have moved the ZICBOM and T-Head CMO to use > > function pointers. > > > > Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > This looks like a nice improvement! I have a few suggestions > for improvements, but no objections here. > > > diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c > > index fac5742d1c1e..826b2ba3e61e 100644 > > --- a/arch/riscv/errata/thead/errata.c > > +++ b/arch/riscv/errata/thead/errata.c > ... > > @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage, > > > > riscv_cbom_block_size = L1_CACHE_BYTES; > > riscv_noncoherent_supported(); > > + > > + memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops)); > > + if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) { > > + thead_cmo_ops.clean_range = &thead_cmo_clean_range; > > + thead_cmo_ops.inv_range = &thead_cmo_inval_range; > > + thead_cmo_ops.flush_range = &thead_cmo_flush_range; > > + riscv_noncoherent_register_cache_ops(&thead_cmo_ops); > > + } > > The implementation here looks reasonable, just wonder whether > the classification as an 'errata' makes sense. I would probably > consider this a 'driver' at this point, but that's just > a question of personal preference. > zicbom is a CPU feature that doesn't have any DT node and hence no driver and similarly for T-HEAD SoC. Also the arch_setup_dma_ops() happens quite early before driver probing due to which we get WARN() messages during bootup hence I have implemented it as errata; as errata patching happens quite early. > For the operations structure, I think a 'static const struct > riscv_cache_ops' is more intuitive than assigning the > members individually. OK. > > + > > +enum dma_noncoherent_ops { > > + NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0, > > + NON_COHERENT_SYNC_DMA_FOR_CPU, > > + NON_COHERENT_DMA_PREP, > > + NON_COHERENT_DMA_PMEM, > > +}; > > + > > +/* > > + * struct riscv_cache_ops - Structure for CMO function pointers > > + * @clean_range: Function pointer for clean cache > > + * @inv_range: Function pointer for invalidate cache > > + * @flush_range: Function pointer for flushing the cache > > + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who > > want > > + * to handle CMO themselves. If this function pointer is set rest of > > the > > + * function pointers will be NULL. > > + */ > > +struct riscv_cache_ops { > > + void (*clean_range)(unsigned long addr, unsigned long size); > > + void (*inv_range)(unsigned long addr, unsigned long size); > > + void (*flush_range)(unsigned long addr, unsigned long size); > > + void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, > > + enum dma_data_direction dir, > > + enum dma_noncoherent_ops ops); > > +}; > > I don't quite see how the fourth operation is used here. > Are there cache controllers that need something beyond > clean/inv/flush? > This is for platforms that dont follow standard cache operations (like done in patch 5/6) and there drivers decide on the operations depending on the ops and dir. > > + > > +#else > > + > > +static void riscv_noncoherent_register_cache_ops(struct > > riscv_cache_ops *ops) {} > > + > > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t > > size) {} > > + > > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t > > size) {} > > + > > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t > > size) {} > > I think you can drop the #else path here: if there is no > noncoherent DMA, then nothing should ever call these > functions, right? > Yes it shouldn't be called. > > + > > +#ifdef CONFIG_RISCV_ISA_ZICBOM > ... > > +struct riscv_cache_ops zicbom_cmo_ops = { > > + .clean_range = &zicbom_cmo_clean_range, > > + .inv_range = &zicbom_cmo_inval_range, > > + .flush_range = &zicbom_cmo_flush_range, > > +}; > > +#else > > +struct riscv_cache_ops zicbom_cmo_ops = { > > + .clean_range = NULL, > > + .inv_range = NULL, > > + .flush_range = NULL, > > + .riscv_dma_noncoherent_cmo_ops = NULL, > > +}; > > +#endif > > +EXPORT_SYMBOL(zicbom_cmo_ops); > > Same here: If the ZICBOM ISA is disabled, nothing should > reference zicbom_cmo_ops. OK. Cheers, Prabhakar
On Sat, Jan 07, 2023 at 10:52:55PM +0100, Arnd Bergmann wrote: > On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote: > > On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote: > >> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote: > >> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > >> > +struct riscv_cache_ops zicbom_cmo_ops = { > >> > + .clean_range = &zicbom_cmo_clean_range, > >> > + .inv_range = &zicbom_cmo_inval_range, > >> > + .flush_range = &zicbom_cmo_flush_range, > >> > +}; > >> > +#else > >> > +struct riscv_cache_ops zicbom_cmo_ops = { > >> > + .clean_range = NULL, > >> > + .inv_range = NULL, > >> > + .flush_range = NULL, > >> > + .riscv_dma_noncoherent_cmo_ops = NULL, > >> > +}; > >> > +#endif > >> > +EXPORT_SYMBOL(zicbom_cmo_ops); > >> > >> Same here: If the ZICBOM ISA is disabled, nothing should > >> reference zicbom_cmo_ops. > > > >> Also, since ZICBOM is a standard > >> extension, I think it makes sense to always have it enabled, > >> at least whenever noncoherent DMA is supported, that way > >> it can be the default that gets used in the absence of any > >> nonstandard cache controller. > > > > While I think of it, this is not possible as Zicbom requires toolchain > > support whereas the alternative methods for non-coherent DMA do not. > > Ah, I see. Would it be possible to use the same .long trick > as in the other ones though? Something like > > #if CONFIG_AS_VERSION >= 23600 /* or whichever version */ > /* proper inline asm */ > #else > /* .long hack */ > #endif > > That way everyone can use it, and the hack would automatically > go away in a few years after linux requires a newer toolchain. > Alternatively, the entire noncoherent-dma support could be > made to depend on whichever toolchain introduced Zicbom. Ehh, I don't think that's a great idea. It'd require far too recent a toolchain IMO. Ideally, in my opinion, we'd just do something like what Drew has proposed for Zicboz, negating the need for a check at all: https://lore.kernel.org/linux-riscv/20221027130247.31634-4-ajones@ventanamicro.com/ Been waiting for that to be re-spun and Palmer to accept it before doing the same thing for Zicbom. At present we have this in the arch Kconfig: config TOOLCHAIN_HAS_ZICBOM bool default y depends on !64BIT || $(cc-option,-mabi=lp64 -march=rv64ima_zicbom) depends on !32BIT || $(cc-option,-mabi=ilp32 -march=rv32ima_zicbom) depends on LLD_VERSION >= 150000 || LD_VERSION >= 23800 config RISCV_ISA_ZICBOM bool "Zicbom extension support for non-coherent DMA operation" depends on TOOLCHAIN_HAS_ZICBOM The linker version check is entirely due to the linker having issues if it sees zicbom in the ISA string in object files. I'd been intending to do that for Zicbom anyway, so I guess I'll just go do it & Prabhakar can attach it to his v7.. Thanks, Conor.
Hi Conor, Thank you for the review. On Fri, Jan 6, 2023 at 11:48 PM Conor Dooley <conor@kernel.org> wrote: > > +CC Icenowy > > Hey Prabhakar, > > On Fri, Jan 06, 2023 at 06:55:21PM +0000, Prabhakar wrote: > > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > > > The current implementation of CMO was handled using the ALTERNATIVE_X() > > macro; this was manageable when there were a limited number of platforms > > using this. Now that we are having more and more platforms coming through > > with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable. > > Nitpick time! This opening paragraph is all a little oddly worded IMO. > How about: > > Currently, selecting which CMOs to use on a given platform is done using > and ALTERNATIVE_X() macro. This was manageable when there were just two > CMO implementations, but now that there are more and more platforms coming > needing custom CMOs, the use of the ALTERNATIVE_X() macro is unmanageable. > sounds good. > > To avoid such issues this patch switches to use of function pointers > > instead of ALTERNATIVE_X() macro for cache management (the only draw being > > s/draw/drawback > oops. > > performance over the previous approach). > > Did you notice a performance decrease or are you speculating? > Perhaps Icenowy - who I know benched their implmentation of a new CMO > macro for THEAD stuff can do so again for this. > I actually haven't benchmarked the speeds to tbh I am speculating it because of the additional checks. Maybe If I export the functions instead of having them in the header and have static keys for each callback maybe that will reduce some lag if any. Does that sound good? > > void (*clean_range)(unsigned long addr, unsigned long size); > > void (*inv_range)(unsigned long addr, unsigned long size); > > void (*flush_range)(unsigned long addr, unsigned long size); > > > > The above function pointers are provided to be overridden where platforms > > using standard approach and for platforms who want handle the operation > > based on the operation the below function pointer is provided: > > Think you've left out some words here chief! > Perhaps s/and for platforms who/. For platforms that/ & on the line > after, s/operation/direction/ ? > I'll re-word it. > > void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, > > enum dma_data_direction dir, > > enum dma_noncoherent_ops ops); > > > > In the current patch we have moved the ZICBOM and T-Head CMO to use > > function pointers. > > "Convert ZICBOM..." > OK. > > Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > --- > > v5->v6 > > * New patch > > --- > > arch/riscv/errata/thead/errata.c | 71 ++++++++++++++++++ > > arch/riscv/include/asm/dma-noncoherent.h | 83 +++++++++++++++++++++ > > arch/riscv/include/asm/errata_list.h | 53 ------------- > > arch/riscv/kernel/cpufeature.c | 2 + > > arch/riscv/mm/dma-noncoherent.c | 94 ++++++++++++++++++++++-- > > arch/riscv/mm/pmem.c | 18 ++++- > > 6 files changed, 260 insertions(+), 61 deletions(-) > > create mode 100644 arch/riscv/include/asm/dma-noncoherent.h > > > > diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c > > index fac5742d1c1e..826b2ba3e61e 100644 > > --- a/arch/riscv/errata/thead/errata.c > > +++ b/arch/riscv/errata/thead/errata.c > > @@ -8,12 +8,72 @@ > > #include <linux/module.h> > > #include <linux/string.h> > > #include <linux/uaccess.h> > > +#include <asm/dma-noncoherent.h> > > #include <asm/alternative.h> > > #include <asm/cacheflush.h> > > #include <asm/errata_list.h> > > #include <asm/patch.h> > > #include <asm/vendorid_list.h> > > > > +#ifdef CONFIG_ERRATA_THEAD_CMO > > +/* > > + * dcache.ipa rs1 (invalidate, physical address) > > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | > > + * 0000001 01010 rs1 000 00000 0001011 > > + * dache.iva rs1 (invalida, virtual address) > > + * 0000001 00110 rs1 000 00000 0001011 > > + * > > + * dcache.cpa rs1 (clean, physical address) > > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | > > + * 0000001 01001 rs1 000 00000 0001011 > > + * dcache.cva rs1 (clean, virtual address) > > + * 0000001 00100 rs1 000 00000 0001011 > > + * > > + * dcache.cipa rs1 (clean then invalidate, physical address) > > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | > > + * 0000001 01011 rs1 000 00000 0001011 > > + * dcache.civa rs1 (... virtual address) > > + * 0000001 00111 rs1 000 00000 0001011 > > + * > > + * sync.s (make sure all cache operations finished) > > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | > > + * 0000000 11001 00000 000 00000 0001011 > > + */ > > +#define THEAD_inval_A0 ".long 0x0265000b" > > +#define THEAD_clean_A0 ".long 0x0245000b" > > +#define THEAD_flush_A0 ".long 0x0275000b" > > +#define THEAD_SYNC_S ".long 0x0190000b" > > + > > +#define THEAD_CMO_OP(_op, _start, _size, _cachesize) \ > > + asm volatile("mv a0, %1\n\t" \ > > + "j 2f\n\t" \ > > + "3:\n\t" \ > > + THEAD_##_op##_A0 "\n\t" \ > > + "add a0, a0, %0\n\t" \ > > + "2:\n\t" \ > > + "bltu a0, %2, 3b\n\t" \ > > + THEAD_SYNC_S \ > > + : : "r"(_cachesize), \ > > + "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)), \ > > + "r"((unsigned long)(_start) + (_size)) \ > > + : "a0") > > I'm not going to provide a tested-by for the THEAD stuff, as I have no > metrics - but I booted my usual NFS and did some tyre kicking. Seemed > grand. > \o/ > > +static void thead_cmo_clean_range(unsigned long addr, unsigned long size) > > +{ > > + THEAD_CMO_OP(clean, addr, size, riscv_cbom_block_size); > > +} > > + > > +static void thead_cmo_flush_range(unsigned long addr, unsigned long size) > > +{ > > + THEAD_CMO_OP(flush, addr, size, riscv_cbom_block_size); > > +} > > + > > +static void thead_cmo_inval_range(unsigned long addr, unsigned long size) > > +{ > > + THEAD_CMO_OP(inval, addr, size, riscv_cbom_block_size); > > +} > > +#endif > > + > > static bool errata_probe_pbmt(unsigned int stage, > > unsigned long arch_id, unsigned long impid) > > { > > @@ -33,6 +93,8 @@ static bool errata_probe_pbmt(unsigned int stage, > > static bool errata_probe_cmo(unsigned int stage, > > unsigned long arch_id, unsigned long impid) > > { > > + struct riscv_cache_ops thead_cmo_ops; > > + > > if (!IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) > > return false; > > > > @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage, > > > > riscv_cbom_block_size = L1_CACHE_BYTES; > > riscv_noncoherent_supported(); > > + > > + memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops)); > > + if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) { > > With CONFIG_ERRATA_THEAD_CMO=n: > > /stuff/linux/arch/riscv/errata/thead/errata.c: In function 'errata_probe_cmo': > /stuff/linux/arch/riscv/errata/thead/errata.c:112:46: error: 'thead_cmo_clean_range' undeclared (first use in this function) > 112 | thead_cmo_ops.clean_range = &thead_cmo_clean_range; > | ^~~~~~~~~~~~~~~~~~~~~ > /stuff/linux/arch/riscv/errata/thead/errata.c:112:46: note: each undeclared identifier is reported only once for each function it appears in > /stuff/linux/arch/riscv/errata/thead/errata.c:113:44: error: 'thead_cmo_inval_range' undeclared (first use in this function) > 113 | thead_cmo_ops.inv_range = &thead_cmo_inval_range; > | ^~~~~~~~~~~~~~~~~~~~~ > /stuff/linux/arch/riscv/errata/thead/errata.c:114:46: error: 'thead_cmo_flush_range' undeclared (first use in this function) > 114 | thead_cmo_ops.flush_range = &thead_cmo_flush_range; > | ^~~~~~~~~~~~~~~~~~~~~ > > These functions are only present if CONFIG_ERRATA_THEAD_CMO is set, > so this cannot be an IS_ENABLED() as things stand. > OK, I'll fix this. > > + thead_cmo_ops.clean_range = &thead_cmo_clean_range; > > + thead_cmo_ops.inv_range = &thead_cmo_inval_range; > > + thead_cmo_ops.flush_range = &thead_cmo_flush_range; > > As Arnd suggested, I'd rather see a `static const struct > riscv_cache_ops` type deal too, so you can just call register() > directly. > Sure will do that. > > + riscv_noncoherent_register_cache_ops(&thead_cmo_ops); > > + } > > + > > return true; > > } > > > > diff --git a/arch/riscv/include/asm/dma-noncoherent.h b/arch/riscv/include/asm/dma-noncoherent.h > > new file mode 100644 > > index 000000000000..a2af863d2608 > > --- /dev/null > > +++ b/arch/riscv/include/asm/dma-noncoherent.h > > @@ -0,0 +1,83 @@ > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > +/* > > + * Copyright (C) 2022 Renesas Electronics Corp. > > + */ > > + > > +#ifndef __ASM_DMA_NONCOHERENT_H > > +#define __ASM_DMA_NONCOHERENT_H > > + > > +#include <linux/dma-direct.h> > > + > > +enum dma_noncoherent_ops { > > + NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0, > > + NON_COHERENT_SYNC_DMA_FOR_CPU, > > + NON_COHERENT_DMA_PREP, > > + NON_COHERENT_DMA_PMEM, > > +}; > > + > > +/* > > + * struct riscv_cache_ops - Structure for CMO function pointers > > Drop the "function pointers" from everything here IMO. > OK. > > + * @clean_range: Function pointer for clean cache > > + * @inv_range: Function pointer for invalidate cache > > + * @flush_range: Function pointer for flushing the cache > > + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who want > > + * to handle CMO themselves. If this function pointer is set rest of the > > + * function pointers will be NULL. > > Will be NULL? Or must be NULL? > Assuming you keep it, as I see Arnd is questioning it, I think a better > description is needed for this one. And a rename of the function name, > the one you have right now is oddly juxtaposed and a bit confusing. > I don't really have something much better to suggest, so I am gonna use > cmo_omnibus here... > > @cmo_omnibus: For platforms whose CMO capabilities do not align with the > standard cache management operations. If set, the specific > ops must be left unset. > OK, I'll update the description as above. > Circling back ages later, cmo_universal()? > Sounds good to me. > > +struct riscv_cache_ops { > > + void (*clean_range)(unsigned long addr, unsigned long size); > > + void (*inv_range)(unsigned long addr, unsigned long size); > > + void (*flush_range)(unsigned long addr, unsigned long size); > > + void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, > > + enum dma_data_direction dir, > > + enum dma_noncoherent_ops ops); > > +}; > > + > > +extern struct riscv_cache_ops zicbom_cmo_ops; > > + > > +#ifdef CONFIG_RISCV_DMA_NONCOHERENT > > + > > +extern struct riscv_cache_ops noncoherent_cache_ops; > > + > > +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops); > > + > > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) > > +{ > > + if (noncoherent_cache_ops.clean_range) { > > + unsigned long addr = (unsigned long)vaddr; > > + > > + noncoherent_cache_ops.clean_range(addr, size); > > + } > > +} > > + > > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) > > +{ > > + if (noncoherent_cache_ops.flush_range) { > > + unsigned long addr = (unsigned long)vaddr; > > + > > + noncoherent_cache_ops.flush_range(addr, size); > > + } > > +} > > + > > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) > > +{ > > + if (noncoherent_cache_ops.inv_range) { > > + unsigned long addr = (unsigned long)vaddr; > > + > > + noncoherent_cache_ops.inv_range(addr, size); > > + } > > +} > > + > > +#else > > + > > +static void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) {} > > + > > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) {} > > + > > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) {} > > + > > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) {} > > Can these ever be used if DMA_NONCOHERENT is not set? I think > riscv/mm/Makefile gates their usage on the symbol, no? > I think I can get rid of these. > > + > > +#endif > > + > > +#endif /* __ASM_DMA_NONCOHERENT_H */ > > diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h > > index 4180312d2a70..ae3fc8b80edd 100644 > > [...] > > > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c > > index d919efab6eba..d9445c266bfd 100644 > > --- a/arch/riscv/mm/dma-noncoherent.c > > +++ b/arch/riscv/mm/dma-noncoherent.c > > @@ -9,23 +9,82 @@ > > #include <linux/dma-map-ops.h> > > #include <linux/mm.h> > > #include <asm/cacheflush.h> > > +#include <asm/dma-noncoherent.h> > > > > static bool noncoherent_supported; > > > > +struct riscv_cache_ops noncoherent_cache_ops = { > > + .clean_range = NULL, > > + .inv_range = NULL, > > + .flush_range = NULL, > > + .riscv_dma_noncoherent_cmo_ops = NULL, > > +}; > > +EXPORT_SYMBOL(noncoherent_cache_ops); > > These exports should be _GPL, no? The file is GPLed. Ditto the others. > Agreed. > > +#ifdef CONFIG_RISCV_ISA_ZICBOM > > +#define ZICBOM_CMO_OP(_op, _start, _size, _cachesize) \ > > + asm volatile("mv a0, %1\n\t" \ > > + "j 2f\n\t" \ > > + "3:\n\t" \ > > + "cbo." __stringify(_op) " (a0)\n\t" \ > > + "add a0, a0, %0\n\t" \ > > + "2:\n\t" \ > > + "bltu a0, %2, 3b\n\t" \ > > + : : "r"(_cachesize), \ > > + "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)), \ > > + "r"((unsigned long)(_start) + (_size)) \ > > + : "a0") > > + > > +static void zicbom_cmo_clean_range(unsigned long addr, unsigned long size) > > +{ > > + ZICBOM_CMO_OP(clean, addr, size, riscv_cbom_block_size); > > +} > > + > > +static void zicbom_cmo_flush_range(unsigned long addr, unsigned long size) > > +{ > > + ZICBOM_CMO_OP(flush, addr, size, riscv_cbom_block_size); > > +} > > + > > +static void zicbom_cmo_inval_range(unsigned long addr, unsigned long size) > > +{ > > + ZICBOM_CMO_OP(inval, addr, size, riscv_cbom_block_size); > > +} > > + > > +struct riscv_cache_ops zicbom_cmo_ops = { > > + .clean_range = &zicbom_cmo_clean_range, > > + .inv_range = &zicbom_cmo_inval_range, > > + .flush_range = &zicbom_cmo_flush_range, > > +}; > > +#else > > +struct riscv_cache_ops zicbom_cmo_ops = { > > + .clean_range = NULL, > > + .inv_range = NULL, > > + .flush_range = NULL, > > + .riscv_dma_noncoherent_cmo_ops = NULL, > > +}; > > +#endif > > +EXPORT_SYMBOL(zicbom_cmo_ops); > > + > > void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, > > enum dma_data_direction dir) > > { > > void *vaddr = phys_to_virt(paddr); > > > > + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { > > + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir, > > + NON_COHERENT_SYNC_DMA_FOR_DEVICE); > > + return; > > + } > > + > > switch (dir) { > > case DMA_TO_DEVICE: > > - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size); > > + riscv_dma_noncoherent_clean(vaddr, size); > > break; > > case DMA_FROM_DEVICE: > > - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size); > > + riscv_dma_noncoherent_clean(vaddr, size); > > break; > > case DMA_BIDIRECTIONAL: > > - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size); > > + riscv_dma_noncoherent_flush(vaddr, size); > > break; > > default: > > break; > > @@ -37,12 +96,18 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size, > > { > > void *vaddr = phys_to_virt(paddr); > > > > + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { > > + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir, > > + NON_COHERENT_SYNC_DMA_FOR_CPU); > > + return; > > + } > > + > > switch (dir) { > > case DMA_TO_DEVICE: > > break; > > case DMA_FROM_DEVICE: > > case DMA_BIDIRECTIONAL: > > - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size); > > + riscv_dma_noncoherent_flush(vaddr, size); > > break; > > default: > > break; > > @@ -53,7 +118,13 @@ void arch_dma_prep_coherent(struct page *page, size_t size) > > { > > void *flush_addr = page_address(page); > > > > - ALT_CMO_OP(flush, flush_addr, size, riscv_cbom_block_size); > > + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { > > + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(flush_addr, size, -1, > > + NON_COHERENT_DMA_PREP); > > + return; > > + } > > + > > + riscv_dma_noncoherent_flush(flush_addr, size); > > } > > > > void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, > > @@ -78,3 +149,16 @@ void riscv_noncoherent_supported(void) > > "Non-coherent DMA support enabled without a block size\n"); > > noncoherent_supported = true; > > } > > Barring the function name, which I really don't like - it really is > confusingly named. Something like cmo_universal() would get it's point > across a bit better I think. > OK, I'll rename riscv_dma_noncoherent_cmo_ops to cmo_universal. > > +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) > > +{ > > + if (!ops) > > + return; > > + > > + if (ops->riscv_dma_noncoherent_cmo_ops) > > + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops = > > + ops->riscv_dma_noncoherent_cmo_ops; > > + else > > + noncoherent_cache_ops = *ops; > > Can we just chop off the extra step here? Behaviourally this is going to > be no different whether someone sets the universal op and the individual > ones anyway. > Agreed. > > +} > > +EXPORT_SYMBOL(riscv_noncoherent_register_cache_ops); > > _GPL again I think! > OK. > I'm happy with this approach, modulo the minor bits I've pointed out. > I do really like that it takes us away from messing with an asm > alternative every time someone wants to do this stuff differently or for > a new platform. > \o/ > I would like Heiko to have a look at what you've done here, but ye looks > promising IMO. > > Thanks, > Conor. > Cheers, Prabhakar
On Sat, Jan 7, 2023, at 23:10, Lad, Prabhakar wrote: >> > + >> > + memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops)); >> > + if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) { >> > + thead_cmo_ops.clean_range = &thead_cmo_clean_range; >> > + thead_cmo_ops.inv_range = &thead_cmo_inval_range; >> > + thead_cmo_ops.flush_range = &thead_cmo_flush_range; >> > + riscv_noncoherent_register_cache_ops(&thead_cmo_ops); >> > + } >> >> The implementation here looks reasonable, just wonder whether >> the classification as an 'errata' makes sense. I would probably >> consider this a 'driver' at this point, but that's just >> a question of personal preference. >> > zicbom is a CPU feature that doesn't have any DT node and hence no > driver and similarly for T-HEAD SoC. A driver does not have to be a 'struct platform_driver' that matches to a device node, my point was more about what to name it, regardless of how the code is entered. > Also the arch_setup_dma_ops() > happens quite early before driver probing due to which we get WARN() > messages during bootup hence I have implemented it as errata; as > errata patching happens quite early. But there is no more patching here, just setting the function pointers, right? >> > +struct riscv_cache_ops { >> > + void (*clean_range)(unsigned long addr, unsigned long size); >> > + void (*inv_range)(unsigned long addr, unsigned long size); >> > + void (*flush_range)(unsigned long addr, unsigned long size); >> > + void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, >> > + enum dma_data_direction dir, >> > + enum dma_noncoherent_ops ops); >> > +}; >> >> I don't quite see how the fourth operation is used here. >> Are there cache controllers that need something beyond >> clean/inv/flush? >> > This is for platforms that dont follow standard cache operations (like > done in patch 5/6) and there drivers decide on the operations > depending on the ops and dir. My feeling is that the set of operations that get called should not depend on the cache controller but at best the CPU. I tried to enumerate how zicbom and ax45 differ here, and how that compares to other architectures: zicbom ax45,mips,arc arm arm64 fromdevice clean/flush inval/inval inval/inval clean/inval todevice clean/- clean/- clean/- clean/- bidi flush/flush flush/inval clean/inval clean/inval So everyone does the same operation for DMA_TO_DEVICE, but they differ in the DMA_FROM_DEVICE handling, for reasons I don't quite see: Your ax45 code does the same as arc and mips. arm and arm64 skip invalidating the cache before bidi mappings, but arm has a FIXME comment about that. arm64 does a 'clean' instead of 'inval' when mapping a fromdevice page, which seems valid but slower than necessary. Could the zicbom operations be changed to do the same things as the ax45/mips/arc ones, or are there specific details in the zicbom spec that require this? Arnd
On Sat, Jan 07, 2023 at 10:21:52PM +0000, Conor Dooley wrote: > On Sat, Jan 07, 2023 at 10:52:55PM +0100, Arnd Bergmann wrote: > > On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote: > > > On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote: > > >> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote: > > >> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> > > >> > +struct riscv_cache_ops zicbom_cmo_ops = { > > >> > + .clean_range = &zicbom_cmo_clean_range, > > >> > + .inv_range = &zicbom_cmo_inval_range, > > >> > + .flush_range = &zicbom_cmo_flush_range, > > >> > +}; > > >> > +#else > > >> > +struct riscv_cache_ops zicbom_cmo_ops = { > > >> > + .clean_range = NULL, > > >> > + .inv_range = NULL, > > >> > + .flush_range = NULL, > > >> > + .riscv_dma_noncoherent_cmo_ops = NULL, > > >> > +}; > > >> > +#endif > > >> > +EXPORT_SYMBOL(zicbom_cmo_ops); > > >> > > >> Same here: If the ZICBOM ISA is disabled, nothing should > > >> reference zicbom_cmo_ops. > > > > > >> Also, since ZICBOM is a standard > > >> extension, I think it makes sense to always have it enabled, > > >> at least whenever noncoherent DMA is supported, that way > > >> it can be the default that gets used in the absence of any > > >> nonstandard cache controller. > > > > > > While I think of it, this is not possible as Zicbom requires toolchain > > > support whereas the alternative methods for non-coherent DMA do not. > > > > Ah, I see. Would it be possible to use the same .long trick > > as in the other ones though? Something like > > > > #if CONFIG_AS_VERSION >= 23600 /* or whichever version */ > > > > /* proper inline asm */ > > #else > > /* .long hack */ > > #endif > > > > That way everyone can use it, and the hack would automatically > > go away in a few years after linux requires a newer toolchain. > > > Alternatively, the entire noncoherent-dma support could be > > made to depend on whichever toolchain introduced Zicbom. > > Ehh, I don't think that's a great idea. It'd require far too recent a > toolchain IMO. > > Ideally, in my opinion, we'd just do something like what Drew has > proposed for Zicboz, negating the need for a check at all: > https://lore.kernel.org/linux-riscv/20221027130247.31634-4-ajones@ventanamicro.com/ > > Been waiting for that to be re-spun and Palmer to accept it before doing > the same thing for Zicbom. At present we have this in the arch Kconfig: > > config TOOLCHAIN_HAS_ZICBOM > bool > default y > depends on !64BIT || $(cc-option,-mabi=lp64 -march=rv64ima_zicbom) > depends on !32BIT || $(cc-option,-mabi=ilp32 -march=rv32ima_zicbom) > depends on LLD_VERSION >= 150000 || LD_VERSION >= 23800 > > config RISCV_ISA_ZICBOM > bool "Zicbom extension support for non-coherent DMA operation" > depends on TOOLCHAIN_HAS_ZICBOM > > The linker version check is entirely due to the linker having issues if > it sees zicbom in the ISA string in object files. > > I'd been intending to do that for Zicbom anyway, so I guess I'll just go > do it & Prabhakar can attach it to his v7.. Should pop up here in a few minutes.. https://lore.kernel.org/linux-riscv/20230108163356.3063839-1-conor@kernel.org/ Hopefully that both works & makes life easier. Certainly from a CI coverage point of view, relaxing toolchain requirements makes *my* life easier!
On Sun, Jan 8, 2023 at 12:08 AM Arnd Bergmann <arnd@arndb.de> wrote: > > On Sat, Jan 7, 2023, at 23:10, Lad, Prabhakar wrote: > > >> > + > >> > + memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops)); > >> > + if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) { > >> > + thead_cmo_ops.clean_range = &thead_cmo_clean_range; > >> > + thead_cmo_ops.inv_range = &thead_cmo_inval_range; > >> > + thead_cmo_ops.flush_range = &thead_cmo_flush_range; > >> > + riscv_noncoherent_register_cache_ops(&thead_cmo_ops); > >> > + } > >> > >> The implementation here looks reasonable, just wonder whether > >> the classification as an 'errata' makes sense. I would probably > >> consider this a 'driver' at this point, but that's just > >> a question of personal preference. > >> > > zicbom is a CPU feature that doesn't have any DT node and hence no > > driver and similarly for T-HEAD SoC. > > A driver does not have to be a 'struct platform_driver' that > matches to a device node, my point was more about what to > name it, regardless of how the code is entered. > > > Also the arch_setup_dma_ops() > > happens quite early before driver probing due to which we get WARN() > > messages during bootup hence I have implemented it as errata; as > > errata patching happens quite early. > > But there is no more patching here, just setting the > function pointers, right? > Yes that's right. > >> > +struct riscv_cache_ops { > >> > + void (*clean_range)(unsigned long addr, unsigned long size); > >> > + void (*inv_range)(unsigned long addr, unsigned long size); > >> > + void (*flush_range)(unsigned long addr, unsigned long size); > >> > + void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, > >> > + enum dma_data_direction dir, > >> > + enum dma_noncoherent_ops ops); > >> > +}; > >> > >> I don't quite see how the fourth operation is used here. > >> Are there cache controllers that need something beyond > >> clean/inv/flush? > >> > > This is for platforms that dont follow standard cache operations (like > > done in patch 5/6) and there drivers decide on the operations > > depending on the ops and dir. > > My feeling is that the set of operations that get called should > not depend on the cache controller but at best the CPU. I tried to > enumerate how zicbom and ax45 differ here, and how that compares > to other architectures: > > zicbom ax45,mips,arc arm arm64 > fromdevice clean/flush inval/inval inval/inval clean/inval > todevice clean/- clean/- clean/- clean/- > bidi flush/flush flush/inval clean/inval clean/inval > > So everyone does the same operation for DMA_TO_DEVICE, but > they differ in the DMA_FROM_DEVICE handling, for reasons I > don't quite see: > > Your ax45 code does the same as arc and mips. arm and > arm64 skip invalidating the cache before bidi mappings, > but arm has a FIXME comment about that. arm64 does a > 'clean' instead of 'inval' when mapping a fromdevice > page, which seems valid but slower than necessary. > > Could the zicbom operations be changed to do the same > things as the ax45/mips/arc ones, or are there specific > details in the zicbom spec that require this? > I'll let the RISC-V experts respond here. Cheers, Prabhakar
On Mon, Jan 9, 2023, at 13:03, Lad, Prabhakar wrote: > On Sun, Jan 8, 2023 at 12:08 AM Arnd Bergmann <arnd@arndb.de> wrote: >> >> > +struct riscv_cache_ops { >> >> > + void (*clean_range)(unsigned long addr, unsigned long size); >> >> > + void (*inv_range)(unsigned long addr, unsigned long size); >> >> > + void (*flush_range)(unsigned long addr, unsigned long size); >> >> > + void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, >> >> > + enum dma_data_direction dir, >> >> > + enum dma_noncoherent_ops ops); >> >> > +}; >> >> >> >> I don't quite see how the fourth operation is used here. >> >> Are there cache controllers that need something beyond >> >> clean/inv/flush? >> >> >> > This is for platforms that dont follow standard cache operations (like >> > done in patch 5/6) and there drivers decide on the operations >> > depending on the ops and dir. >> >> My feeling is that the set of operations that get called should >> not depend on the cache controller but at best the CPU. I tried to >> enumerate how zicbom and ax45 differ here, and how that compares >> to other architectures: >> >> zicbom ax45,mips,arc arm arm64 >> fromdevice clean/flush inval/inval inval/inval clean/inval >> todevice clean/- clean/- clean/- clean/- >> bidi flush/flush flush/inval clean/inval clean/inval >> >> So everyone does the same operation for DMA_TO_DEVICE, but >> they differ in the DMA_FROM_DEVICE handling, for reasons I >> don't quite see: >> >> Your ax45 code does the same as arc and mips. arm and >> arm64 skip invalidating the cache before bidi mappings, >> but arm has a FIXME comment about that. arm64 does a >> 'clean' instead of 'inval' when mapping a fromdevice >> page, which seems valid but slower than necessary. >> >> Could the zicbom operations be changed to do the same >> things as the ax45/mips/arc ones, or are there specific >> details in the zicbom spec that require this? >> > I'll let the RISC-V experts respond here. Adding Christoph Hellwig and Will Deacon to Cc as well. I had another look at the arm64 side, which (like the zicbom variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE), as that has changed not that long ago, see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572 I'm still not sure what the correct set of operations has to be, but nothing in that patch description sounds ISA or even microarchitecture specific. Arnd
On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote: > On Mon, Jan 9, 2023, at 13:03, Lad, Prabhakar wrote: > > On Sun, Jan 8, 2023 at 12:08 AM Arnd Bergmann <arnd@arndb.de> wrote: > >> >> > +struct riscv_cache_ops { > >> >> > + void (*clean_range)(unsigned long addr, unsigned long size); > >> >> > + void (*inv_range)(unsigned long addr, unsigned long size); > >> >> > + void (*flush_range)(unsigned long addr, unsigned long size); > >> >> > + void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, > >> >> > + enum dma_data_direction dir, > >> >> > + enum dma_noncoherent_ops ops); > >> >> > +}; > >> >> > >> >> I don't quite see how the fourth operation is used here. > >> >> Are there cache controllers that need something beyond > >> >> clean/inv/flush? > >> >> > >> > This is for platforms that dont follow standard cache operations (like > >> > done in patch 5/6) and there drivers decide on the operations > >> > depending on the ops and dir. > >> > >> My feeling is that the set of operations that get called should > >> not depend on the cache controller but at best the CPU. I tried to > >> enumerate how zicbom and ax45 differ here, and how that compares > >> to other architectures: > >> > >> zicbom ax45,mips,arc arm arm64 > >> fromdevice clean/flush inval/inval inval/inval clean/inval > >> todevice clean/- clean/- clean/- clean/- > >> bidi flush/flush flush/inval clean/inval clean/inval I did a bit of digging on lore for context on why the ops are what they are.. In v3 of the Zicbom enablement patchset, things looked like: fromdevice inval/inval todevice clean/- bidi flush/inval v3: https://lore.kernel.org/linux-riscv/20220610004308.1903626-3-heiko@sntech.de/ Samuel had some comments about the invals: https://lore.kernel.org/linux-riscv/342e3c12-ebb0-badf-7d4c-c444a2b842b2@sholland.org/ In v4 it was changed to: fromdevice inval/flush todevice clean/- bidi flush/flush v4: https://lore.kernel.org/linux-riscv/20220619203212.3604485-4-heiko@sntech.de/ Christoph replied to that one, linking the thread belonging to the commit you pointed out earlier: https://lore.kernel.org/linux-riscv/20220620061607.GB10485@lst.de/ v5 produced what you have in your table above: https://lore.kernel.org/linux-riscv/20220629215944.397952-4-heiko@sntech.de/ > >> > >> So everyone does the same operation for DMA_TO_DEVICE, but > >> they differ in the DMA_FROM_DEVICE handling, for reasons I > >> don't quite see: > >> > >> Your ax45 code does the same as arc and mips. arm and > >> arm64 skip invalidating the cache before bidi mappings, > >> but arm has a FIXME comment about that. arm64 does a > >> 'clean' instead of 'inval' when mapping a fromdevice > >> page, which seems valid but slower than necessary. > >> > >> Could the zicbom operations be changed to do the same > >> things as the ax45/mips/arc ones, or are there specific > >> details in the zicbom spec that require this? > >> > > I'll let the RISC-V experts respond here. > > Adding Christoph Hellwig and Will Deacon to Cc as well. > > I had another look at the arm64 side, which (like the zicbom > variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE), > as that has changed not that long ago, see > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572 > > I'm still not sure what the correct set of operations has > to be, but nothing in that patch description sounds ISA > or even microarchitecture specific. Hope the lore archaeology helps jog people's memories... Conor
On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote: > I had another look at the arm64 side, which (like the zicbom > variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE), > as that has changed not that long ago, see > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572 which IIRC has been reverted recently. > I'm still not sure what the correct set of operations has > to be, but nothing in that patch description sounds ISA > or even microarchitecture specific. Nothing is ISA specific, and the only micro architecture related thing is if the specific core can speculate memory accesses. See the table in arch/arc/mm/dma.c for details. And as commented on the arm64 patch I really hate architectures getting creative here, as I'd much prefer to move the choice of primitives to the core DMA code and just provide helpers to invalidate/writeback/ writeback+invalidate from the architectures eventually.
On Tue, Jan 10, 2023, at 08:01, Christoph Hellwig wrote: > On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote: >> I had another look at the arm64 side, which (like the zicbom >> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE), >> as that has changed not that long ago, see >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572 > > which IIRC has been reverted recently. To clarify: I was looking at arch_sync_dma_for_device(), which changed from 'invalidate' to 'clean' last June in commit c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers at start of DMA transfer"). I don't see a revert for that. The one that was reverted recently is arch_dma_prep_coherent, which was changed and reverted in c44094eee32 Aug 23 2022 flush->clean b7d9aae4048 Dec 6 2022 clean->flush I'm primarily interested in the streaming mappings (arch_sync_*) at the moment. >> I'm still not sure what the correct set of operations has >> to be, but nothing in that patch description sounds ISA >> or even microarchitecture specific. > > Nothing is ISA specific, and the only micro architecture related thing > is if the specific core can speculate memory accesses. See the table > in arch/arc/mm/dma.c for details. > > And as commented on the arm64 patch I really hate architectures getting > creative here, as I'd much prefer to move the choice of primitives to > the core DMA code and just provide helpers to invalidate/writeback/ > writeback+invalidate from the architectures eventually. Agreed, that's why I particularly don't like the idea of giving SoC specific L2$ drivers the option of making custom choices here. The arch_dma_prep_coherent() change is arguably arm64 specific because it is only needed for machines sharing memory with EL3 firmware that enforces buffer ownership, but even for that it would be nice to have a central definition and some architecture specific flag that picks one behavior or the other. Same thing for the speculative access difference. I looked at all the implementations now and put them in a table[1] to see what the differences are. The only bit that I think needs discussion is the dma_sync_single_for_device(DMA_FROM_DEVICE) op that I mentioned above. I see that arm64, csky, powerpc, riscv and parisc all write out at least partical cache lines first to avoid losing dirty data in the part that is not written by the device[2][3], while the other ones don't[4]. The other differences I found are: - arm and arm64 use clean instead of flush for dma_sync_single_for_device(DMA_BIDIRECTIONAL). This seems harmless, since there is another invalidation at the _for_cpu() step. - hexagon, m68k, openrisc and sh don't deal with speculative loads - m68k, nios2, openrisc, parisc and xtensa use flush instead of clean for DMA_TO_DEVICE, presumably because they don't have a flush without invalidate. - microblaze, parisc and powerpc use the same function for _for_device and _for_cpu, which is slightly wasteful but harmless. - openrisc lacks support for bidirectional mappings, which is a bug - sparc32 and xtensa only supports writethrough caches and consequently skips the clean before DMA but instead invalidate the buffer in the _for_cpu operation. I also see that at least arc, arm, mips and riscv all want CPU specific cache management operations to be registered at boot time. While Russell had some concerns about your suggestion to generalize the arm version, we could start by moving the abstracted riscv version into kernel/dma/direct.c and make sure it can be shared with at least mips and arc, provided that we can agree on the DMA_FROM_DEVICE behavior. Arnd [1] https://docs.google.com/spreadsheets/d/1qDuMqB6TnRTj_CgUwgIIm_RJ6EZO76qohpTJUMQjUEo/edit#gid=0 [2] https://git.kernel.org/torvalds/c/c50f11c6196f [3] https://git.kernel.org/torvalds/c/03d70617b8a7 [4] https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
On Tue, Jan 10, 2023 at 04:03:06PM +0100, Arnd Bergmann wrote: > On Tue, Jan 10, 2023, at 08:01, Christoph Hellwig wrote: > > On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote: > >> I had another look at the arm64 side, which (like the zicbom > >> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE), > >> as that has changed not that long ago, see > >> > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572 > > > > which IIRC has been reverted recently. > > To clarify: I was looking at arch_sync_dma_for_device(), which > changed from 'invalidate' to 'clean' last June in commit > c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers > at start of DMA transfer"). I don't see a revert for that. > > The one that was reverted recently is arch_dma_prep_coherent, which > was changed and reverted in > > c44094eee32 Aug 23 2022 flush->clean > b7d9aae4048 Dec 6 2022 clean->flush > > I'm primarily interested in the streaming mappings (arch_sync_*) > at the moment. Just as an FYI, but we plan to revert the revert (i.e. go back to 'clean') here once Qualcomm's modem firmware loader has been updated: https://lore.kernel.org/r/20230109034843.23759-1-quic_sibis@quicinc.com Will
On Tue, Jan 10, 2023 at 04:03:06PM +0100, Arnd Bergmann wrote: > To clarify: I was looking at arch_sync_dma_for_device(), which > changed from 'invalidate' to 'clean' last June in commit > c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers > at start of DMA transfer"). I don't see a revert for that. > > The one that was reverted recently is arch_dma_prep_coherent, which > was changed and reverted in > > c44094eee32 Aug 23 2022 flush->clean > b7d9aae4048 Dec 6 2022 clean->flush > > I'm primarily interested in the streaming mappings (arch_sync_*) > at the moment. Oh, true. I've been confused about the two changes. > Agreed, that's why I particularly don't like the idea of giving SoC > specific L2$ drivers the option of making custom choices here. Yes, moving this into individual drivers is an absolute no-go. > The arch_dma_prep_coherent() change is arguably arm64 specific > because it is only needed for machines sharing memory with EL3 > firmware that enforces buffer ownership, but even for that it would > be nice to have a central definition and some architecture specific > flag that picks one behavior or the other. Same thing for the > speculative access difference. Yes. > I looked at all the implementations now and put them in a table[1] > to see what the differences are. The only bit that I think needs > discussion is the dma_sync_single_for_device(DMA_FROM_DEVICE) op > that I mentioned above. I see that arm64, csky, powerpc, riscv > and parisc all write out at least partical cache lines first to > avoid losing dirty data in the part that is not written by the > device[2][3], while the other ones don't[4]. I'm tempted to declare [4] buggy until proof of the inverse. > I also see that at least arc, arm, mips and riscv all want > CPU specific cache management operations to be registered > at boot time. While Russell had some concerns about your > suggestion to generalize the arm version, we could start > by moving the abstracted riscv version into > kernel/dma/direct.c and make sure it can be shared with > at least mips and arc, provided that we can agree on the > DMA_FROM_DEVICE behavior. Yes, I'd really like to start out with a common version and then move bits over. There's also some interesting bits about handling highmem for architectures that needs virtual addresss for flushing that might be worth sharing, too. Thіs should be a new file in kernel/dma/ as it's not only used by dma-direct but also by dma-iommu, and to keep the code nicely separated. Can you give it a go?
On Fri, Jan 13, 2023, at 06:48, Christoph Hellwig wrote: > On Tue, Jan 10, 2023 at 04:03:06PM +0100, Arnd Bergmann wrote: >> I looked at all the implementations now and put them in a table[1] >> to see what the differences are. The only bit that I think needs >> discussion is the dma_sync_single_for_device(DMA_FROM_DEVICE) op >> that I mentioned above. I see that arm64, csky, powerpc, riscv >> and parisc all write out at least partical cache lines first to >> avoid losing dirty data in the part that is not written by the >> device[2][3], while the other ones don't[4]. > > I'm tempted to declare [4] buggy until proof of the inverse. Having looked at this some more, I see that the powerpc version is a bit problematic here as well: this one flushes the partial cache lines before and after the DMA transfer, while only invalidating the full cache lines. If a partical cache line gets written to by the CPU while the buffer is owned by the device, this means that the received data from the device is immediately overwritten by the second flush. The arm64 and riscv behavior of doing a flush before and an invalidate after the DMA seems a bit more appropriate, as that ends up keeping the DMA data but discarding anything written by the CPU. Obviously there is no winning either way if the same cache line gets written by both CPU and device, I'm just trying to figure out what behavior we actually want here. The best I can think of so far is: - flush the partial cache lines before the DMA, as powerpc does, and just invalidate the full cache lines - only invalidate but not clean/flush after the DMA. This is the arm64 behavior - warn when flushing partial cache lines if dma_debug is enabled. >> I also see that at least arc, arm, mips and riscv all want >> CPU specific cache management operations to be registered >> at boot time. While Russell had some concerns about your >> suggestion to generalize the arm version, we could start >> by moving the abstracted riscv version into >> kernel/dma/direct.c and make sure it can be shared with >> at least mips and arc, provided that we can agree on the >> DMA_FROM_DEVICE behavior. > > Yes, I'd really like to start out with a common version and then > move bits over. There's also some interesting bits about handling > highmem for architectures that needs virtual addresss for flushing > that might be worth sharing, too. > > Thіs should be a new file in kernel/dma/ as it's not only used by > dma-direct but also by dma-iommu, and to keep the code nicely > separated. > > Can you give it a go? I started this at the beginning of the week but couldn't finish it at all, but still plan to get back to it next week. Aside from the question for how to handle flush vs invalidate on DMA_FROM_DEVICE, I'm still trying to figure out how to best handle highmem with architecture specific cache management operations. The easy approach would be to leave that up to the architecture, passing only a physical address to the flush function. A nicer interface might be to move the loop over highmem pages out into common code, flush lowmem pages by virtual addresss, and have a separate callback for highmem pages that takes a page pointer, like struct dma_cache_ops { void (*dma_cache_wback_inv)(void *start, unsigned long sz); void (*dma_cache_inv)(void *start, unsigned long sz); void (*dma_cache_wback)(void *start, unsigned long sz); #ifdef CONFIG_HIGHMEM void (*dma_cache_wback_inv_high_page)(struct page *, size_t start, unsigned long sz); void (*dma_cache_inv_high_page)(struct page *, size_t start, unsigned long sz); void (*dma_cache_wback_high_page)(struct page *, size_t start, unsigned long sz); #endif }; Let me know if you have a preference here, before I spend too much time on something we don't want in the end. Arnd
On Fri, Jan 20, 2023 at 06:04:37PM +0100, Arnd Bergmann wrote: > Having looked at this some more, I see that the powerpc > version is a bit problematic here as well: this one > flushes the partial cache lines before and after the > DMA transfer, while only invalidating the full > cache lines. That feels really odd, and might be worth a bug report to the PPC maintainers. > Obviously there is no winning either way if the same > cache line gets written by both CPU and device, I'm > just trying to figure out what behavior we actually > want here. There isn't, and that's why we require DMAed regions to be cache line aligned. > Aside from the question for how to handle flush vs invalidate > on DMA_FROM_DEVICE, I'm still trying to figure out how to > best handle highmem with architecture specific cache management > operations. The easy approach would be to leave that up > to the architecture, passing only a physical address to > the flush function. I suspect that is a good enough first step. Especially as I remember that some architectures have physical address based cache management anyway (unless we removed them in the meantime). > A nicer interface might be to move the > loop over highmem pages out into common code, flush > lowmem pages by virtual addresss, and have a separate > callback for highmem pages that takes a page pointer, > like I'd rather avoid multiple callbacks if we can. But maybe solve the simple problem first and just pass the paddr and then iterate from there. > > struct dma_cache_ops { > void (*dma_cache_wback_inv)(void *start, unsigned long sz); > void (*dma_cache_inv)(void *start, unsigned long sz); > void (*dma_cache_wback)(void *start, unsigned long sz); > #ifdef CONFIG_HIGHMEM > void (*dma_cache_wback_inv_high_page)(struct page *, size_t start, unsigned long sz); > void (*dma_cache_inv_high_page)(struct page *, size_t start, unsigned long sz); > void (*dma_cache_wback_high_page)(struct page *, size_t start, unsigned long sz); Btw, I really don't think these should be indirect calls. For sane architectures there should be exactly one way to call them, and the onces that have different implementations really should be using alternatives instead of expensive indirect calls.
On Sat, Jan 21, 2023, at 15:37, Christoph Hellwig wrote: > On Fri, Jan 20, 2023 at 06:04:37PM +0100, Arnd Bergmann wrote: >> Having looked at this some more, I see that the powerpc >> version is a bit problematic here as well: this one >> flushes the partial cache lines before and after the >> DMA transfer, while only invalidating the full >> cache lines. > > That feels really odd, and might be worth a bug report to the > PPC maintainers. Right, my first step would be to change all of the current outliers to use the same set of operations where possible. >> Aside from the question for how to handle flush vs invalidate >> on DMA_FROM_DEVICE, I'm still trying to figure out how to >> best handle highmem with architecture specific cache management >> operations. The easy approach would be to leave that up >> to the architecture, passing only a physical address to >> the flush function. > > I suspect that is a good enough first step. Especially as I remember > that some architectures have physical address based cache management > anyway (unless we removed them in the meantime). ok >> A nicer interface might be to move the >> loop over highmem pages out into common code, flush >> lowmem pages by virtual addresss, and have a separate >> callback for highmem pages that takes a page pointer, >> like > > I'd rather avoid multiple callbacks if we can. But maybe solve > the simple problem first and just pass the paddr and then > iterate from there. Ok, fair enough. This means we can't easily put the kmap_atomic() into common code for highmem, though the per-page loop would still work. >> struct dma_cache_ops { >> void (*dma_cache_wback_inv)(void *start, unsigned long sz); >> void (*dma_cache_inv)(void *start, unsigned long sz); >> void (*dma_cache_wback)(void *start, unsigned long sz); >> #ifdef CONFIG_HIGHMEM >> void (*dma_cache_wback_inv_high_page)(struct page *, size_t start, unsigned long sz); >> void (*dma_cache_inv_high_page)(struct page *, size_t start, unsigned long sz); >> void (*dma_cache_wback_high_page)(struct page *, size_t start, unsigned long sz); > > Btw, I really don't think these should be indirect calls. > For sane architectures there should be exactly one way to call them, > and the onces that have different implementations really should be > using alternatives instead of expensive indirect calls. I was thinking of using STATIC_CALL() as an optimization here, which I find easier to read and understand than alternatives. One advantage here is that this allows the actual cache operations to be declared locally in the architecture without letting drivers call them, but still update the common code to work without indirect branches. The main downside is that this is currently only optimized on powerpc and x86, both of which don't actually need CPU specific callbacks. ARC, ARM, and MIPS on the other hand already have indirect function pointers, RISC-V would likely benefit the most from either alternatives or static_call, as it already uses alternatives and has one implementation that is clearly preferred over the others. Arnd
On Sat, Jan 21, 2023 at 08:30:23PM +0100, Arnd Bergmann wrote: > > That feels really odd, and might be worth a bug report to the > > PPC maintainers. > > Right, my first step would be to change all of the current > outliers to use the same set of operations where possible. Sounds good. > > I'd rather avoid multiple callbacks if we can. But maybe solve > > the simple problem first and just pass the paddr and then > > iterate from there. > > Ok, fair enough. This means we can't easily put the kmap_atomic() > into common code for highmem, though the per-page loop would > still work. Yes. Given how messy many of the ops are I think one step at a time is always good. > I was thinking of using STATIC_CALL() as an optimization here, which > I find easier to read and understand than alternatives. One advantage > here is that this allows the actual cache operations to be declared > locally in the architecture without letting drivers call them, > but still update the common code to work without indirect branches. > > The main downside is that this is currently only optimized on > powerpc and x86, both of which don't actually need CPU specific > callbacks. ARC, ARM, and MIPS on the other hand already > have indirect function pointers, RISC-V would likely benefit the > most from either alternatives or static_call, as it already > uses alternatives and has one implementation that is clearly > preferred over the others. For now I'd just keep doing direct calls into the arch code, just for the lower level invalidate, writeback, invalidate+writeback calls as that helps cementinc the logic of which of those to use in well documented core code. And I'm not really sure I'd like to go beyond that - making it too easy pluggable will make people feel more comfortable doing stupid things here. And yes, maybe that's personal because I've warned the RISC-V people years ago that they'll need architectural cache management instructions yesterday and the answer was that no one is going to use them on modern CPUs. *sigh*
On Sun, Jan 22, 2023, at 08:27, Christoph Hellwig wrote: > On Sat, Jan 21, 2023 at 08:30:23PM +0100, Arnd Bergmann wrote: >> I was thinking of using STATIC_CALL() as an optimization here, which >> I find easier to read and understand than alternatives. One advantage >> here is that this allows the actual cache operations to be declared >> locally in the architecture without letting drivers call them, >> but still update the common code to work without indirect branches. >> >> The main downside is that this is currently only optimized on >> powerpc and x86, both of which don't actually need CPU specific >> callbacks. ARC, ARM, and MIPS on the other hand already >> have indirect function pointers, RISC-V would likely benefit the >> most from either alternatives or static_call, as it already >> uses alternatives and has one implementation that is clearly >> preferred over the others. > > For now I'd just keep doing direct calls into the arch code, just > for the lower level invalidate, writeback, invalidate+writeback > calls as that helps cementinc the logic of which of those to use > in well documented core code. Ok. > And I'm not really sure I'd like to go beyond that - making it too > easy pluggable will make people feel more comfortable doing stupid > things here. I fear the bigger risk is still making the functions callable from device driver code than it is to make the functions globally settable. You introduced the mips version in f8c55dc6e828 ("MIPS: use generic dma noncoherent ops for simple noncoherent platforms"), which was clearly meant as an implementation detail, yet we already have a driver that slipped in with 3bdffa8ffb45 ("Input: Add N64 controller driver") that just calls this directly rather than using the dma-mapping interface. On the other hand, the indirect function pointers for per-cpu cache operations are not easily translated anyway: with the three architectures that multiplex between cpu specific operations, arc uses physical addresses, mips uses virtual addresses (because of highmem), and arm even uses both because of incompatible requirements between l1 and l2 cache operations. arm32 also seems to have the superset of all possible corner cases that one might see elsewhere (prefetching vs in-order, writethrough vs writeback, broken broadcast invalidation, ...). > And yes, maybe that's personal because I've warned > the RISC-V people years ago that they'll need architectural > cache management instructions yesterday and the answer was that > no one is going to use them on modern CPUs. *sigh* To be fair, from the ISA point of view, it really shouldn't be necessary as long as you have a sane SoC design. In practice there are always chips that are cutting corners, or use the new CPU core as a drop-in for an existing design. Arm SBSA tried to enforce the same thing and also failed for pretty much the same reason. Arnd
On Sun, Jan 22, 2023 at 12:04:35PM +0100, Arnd Bergmann wrote: > > And I'm not really sure I'd like to go beyond that - making it too > > easy pluggable will make people feel more comfortable doing stupid > > things here. > > I fear the bigger risk is still making the functions callable > from device driver code than it is to make the functions > globally settable. > > You introduced the mips version in f8c55dc6e828 ("MIPS: use generic > dma noncoherent ops for simple noncoherent platforms"), which > was clearly meant as an implementation detail, yet we already > have a driver that slipped in with 3bdffa8ffb45 ("Input: Add > N64 controller driver") that just calls this directly rather > than using the dma-mapping interface. MIPS actually has a bit of a history of these odd bypasses that it seems like this driver copied. rmk has been very worried by this bypassing, and in general I agree. But there's only so much we can do except for auditing drivers. Especially as none of these helpers is exported and built-in only drivers are quite rare. > > And yes, maybe that's personal because I've warned > > the RISC-V people years ago that they'll need architectural > > cache management instructions yesterday and the answer was that > > no one is going to use them on modern CPUs. *sigh* > > To be fair, from the ISA point of view, it really shouldn't > be necessary as long as you have a sane SoC design. > In practice there are always chips that are cutting corners, > or use the new CPU core as a drop-in for an existing > design. Arm SBSA tried to enforce the same thing and also > failed for pretty much the same reason. Not wiring up IP blocks for cache coherency is cheap. So it totally makes sense for dirt cheap SOCs, or even for low performance periphals in general. The GPU folks also believe they can make some things faster by deliberately turning coherency off on SOCs that support coherency. In addition to that cache maintainance is absolutely needed for NVDIMM support.
diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c index fac5742d1c1e..826b2ba3e61e 100644 --- a/arch/riscv/errata/thead/errata.c +++ b/arch/riscv/errata/thead/errata.c @@ -8,12 +8,72 @@ #include <linux/module.h> #include <linux/string.h> #include <linux/uaccess.h> +#include <asm/dma-noncoherent.h> #include <asm/alternative.h> #include <asm/cacheflush.h> #include <asm/errata_list.h> #include <asm/patch.h> #include <asm/vendorid_list.h> +#ifdef CONFIG_ERRATA_THEAD_CMO +/* + * dcache.ipa rs1 (invalidate, physical address) + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | + * 0000001 01010 rs1 000 00000 0001011 + * dache.iva rs1 (invalida, virtual address) + * 0000001 00110 rs1 000 00000 0001011 + * + * dcache.cpa rs1 (clean, physical address) + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | + * 0000001 01001 rs1 000 00000 0001011 + * dcache.cva rs1 (clean, virtual address) + * 0000001 00100 rs1 000 00000 0001011 + * + * dcache.cipa rs1 (clean then invalidate, physical address) + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | + * 0000001 01011 rs1 000 00000 0001011 + * dcache.civa rs1 (... virtual address) + * 0000001 00111 rs1 000 00000 0001011 + * + * sync.s (make sure all cache operations finished) + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | + * 0000000 11001 00000 000 00000 0001011 + */ +#define THEAD_inval_A0 ".long 0x0265000b" +#define THEAD_clean_A0 ".long 0x0245000b" +#define THEAD_flush_A0 ".long 0x0275000b" +#define THEAD_SYNC_S ".long 0x0190000b" + +#define THEAD_CMO_OP(_op, _start, _size, _cachesize) \ + asm volatile("mv a0, %1\n\t" \ + "j 2f\n\t" \ + "3:\n\t" \ + THEAD_##_op##_A0 "\n\t" \ + "add a0, a0, %0\n\t" \ + "2:\n\t" \ + "bltu a0, %2, 3b\n\t" \ + THEAD_SYNC_S \ + : : "r"(_cachesize), \ + "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)), \ + "r"((unsigned long)(_start) + (_size)) \ + : "a0") + +static void thead_cmo_clean_range(unsigned long addr, unsigned long size) +{ + THEAD_CMO_OP(clean, addr, size, riscv_cbom_block_size); +} + +static void thead_cmo_flush_range(unsigned long addr, unsigned long size) +{ + THEAD_CMO_OP(flush, addr, size, riscv_cbom_block_size); +} + +static void thead_cmo_inval_range(unsigned long addr, unsigned long size) +{ + THEAD_CMO_OP(inval, addr, size, riscv_cbom_block_size); +} +#endif + static bool errata_probe_pbmt(unsigned int stage, unsigned long arch_id, unsigned long impid) { @@ -33,6 +93,8 @@ static bool errata_probe_pbmt(unsigned int stage, static bool errata_probe_cmo(unsigned int stage, unsigned long arch_id, unsigned long impid) { + struct riscv_cache_ops thead_cmo_ops; + if (!IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) return false; @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage, riscv_cbom_block_size = L1_CACHE_BYTES; riscv_noncoherent_supported(); + + memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops)); + if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) { + thead_cmo_ops.clean_range = &thead_cmo_clean_range; + thead_cmo_ops.inv_range = &thead_cmo_inval_range; + thead_cmo_ops.flush_range = &thead_cmo_flush_range; + riscv_noncoherent_register_cache_ops(&thead_cmo_ops); + } + return true; } diff --git a/arch/riscv/include/asm/dma-noncoherent.h b/arch/riscv/include/asm/dma-noncoherent.h new file mode 100644 index 000000000000..a2af863d2608 --- /dev/null +++ b/arch/riscv/include/asm/dma-noncoherent.h @@ -0,0 +1,83 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2022 Renesas Electronics Corp. + */ + +#ifndef __ASM_DMA_NONCOHERENT_H +#define __ASM_DMA_NONCOHERENT_H + +#include <linux/dma-direct.h> + +enum dma_noncoherent_ops { + NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0, + NON_COHERENT_SYNC_DMA_FOR_CPU, + NON_COHERENT_DMA_PREP, + NON_COHERENT_DMA_PMEM, +}; + +/* + * struct riscv_cache_ops - Structure for CMO function pointers + * @clean_range: Function pointer for clean cache + * @inv_range: Function pointer for invalidate cache + * @flush_range: Function pointer for flushing the cache + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who want + * to handle CMO themselves. If this function pointer is set rest of the + * function pointers will be NULL. + */ +struct riscv_cache_ops { + void (*clean_range)(unsigned long addr, unsigned long size); + void (*inv_range)(unsigned long addr, unsigned long size); + void (*flush_range)(unsigned long addr, unsigned long size); + void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size, + enum dma_data_direction dir, + enum dma_noncoherent_ops ops); +}; + +extern struct riscv_cache_ops zicbom_cmo_ops; + +#ifdef CONFIG_RISCV_DMA_NONCOHERENT + +extern struct riscv_cache_ops noncoherent_cache_ops; + +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops); + +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) +{ + if (noncoherent_cache_ops.clean_range) { + unsigned long addr = (unsigned long)vaddr; + + noncoherent_cache_ops.clean_range(addr, size); + } +} + +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) +{ + if (noncoherent_cache_ops.flush_range) { + unsigned long addr = (unsigned long)vaddr; + + noncoherent_cache_ops.flush_range(addr, size); + } +} + +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) +{ + if (noncoherent_cache_ops.inv_range) { + unsigned long addr = (unsigned long)vaddr; + + noncoherent_cache_ops.inv_range(addr, size); + } +} + +#else + +static void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) {} + +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) {} + +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) {} + +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) {} + +#endif + +#endif /* __ASM_DMA_NONCOHERENT_H */ diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h index 4180312d2a70..ae3fc8b80edd 100644 --- a/arch/riscv/include/asm/errata_list.h +++ b/arch/riscv/include/asm/errata_list.h @@ -91,59 +91,6 @@ asm volatile(ALTERNATIVE( \ #define ALT_THEAD_PMA(_val) #endif -/* - * dcache.ipa rs1 (invalidate, physical address) - * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | - * 0000001 01010 rs1 000 00000 0001011 - * dache.iva rs1 (invalida, virtual address) - * 0000001 00110 rs1 000 00000 0001011 - * - * dcache.cpa rs1 (clean, physical address) - * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | - * 0000001 01001 rs1 000 00000 0001011 - * dcache.cva rs1 (clean, virtual address) - * 0000001 00100 rs1 000 00000 0001011 - * - * dcache.cipa rs1 (clean then invalidate, physical address) - * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | - * 0000001 01011 rs1 000 00000 0001011 - * dcache.civa rs1 (... virtual address) - * 0000001 00111 rs1 000 00000 0001011 - * - * sync.s (make sure all cache operations finished) - * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 | - * 0000000 11001 00000 000 00000 0001011 - */ -#define THEAD_inval_A0 ".long 0x0265000b" -#define THEAD_clean_A0 ".long 0x0245000b" -#define THEAD_flush_A0 ".long 0x0275000b" -#define THEAD_SYNC_S ".long 0x0190000b" - -#define ALT_CMO_OP(_op, _start, _size, _cachesize) \ -asm volatile(ALTERNATIVE_2( \ - __nops(6), \ - "mv a0, %1\n\t" \ - "j 2f\n\t" \ - "3:\n\t" \ - "cbo." __stringify(_op) " (a0)\n\t" \ - "add a0, a0, %0\n\t" \ - "2:\n\t" \ - "bltu a0, %2, 3b\n\t" \ - "nop", 0, CPUFEATURE_ZICBOM, CONFIG_RISCV_ISA_ZICBOM, \ - "mv a0, %1\n\t" \ - "j 2f\n\t" \ - "3:\n\t" \ - THEAD_##_op##_A0 "\n\t" \ - "add a0, a0, %0\n\t" \ - "2:\n\t" \ - "bltu a0, %2, 3b\n\t" \ - THEAD_SYNC_S, THEAD_VENDOR_ID, \ - ERRATA_THEAD_CMO, CONFIG_ERRATA_THEAD_CMO) \ - : : "r"(_cachesize), \ - "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)), \ - "r"((unsigned long)(_start) + (_size)) \ - : "a0") - #define THEAD_C9XX_RV_IRQ_PMU 17 #define THEAD_C9XX_CSR_SCOUNTEROF 0x5c5 diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c index 205bbd6b1fce..d94d32eb7faf 100644 --- a/arch/riscv/kernel/cpufeature.c +++ b/arch/riscv/kernel/cpufeature.c @@ -14,6 +14,7 @@ #include <linux/of.h> #include <asm/alternative.h> #include <asm/cacheflush.h> +#include <asm/dma-noncoherent.h> #include <asm/errata_list.h> #include <asm/hwcap.h> #include <asm/patch.h> @@ -298,6 +299,7 @@ static bool __init_or_module cpufeature_probe_zicbom(unsigned int stage) return false; riscv_noncoherent_supported(); + riscv_noncoherent_register_cache_ops(&zicbom_cmo_ops); return true; } diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c index d919efab6eba..d9445c266bfd 100644 --- a/arch/riscv/mm/dma-noncoherent.c +++ b/arch/riscv/mm/dma-noncoherent.c @@ -9,23 +9,82 @@ #include <linux/dma-map-ops.h> #include <linux/mm.h> #include <asm/cacheflush.h> +#include <asm/dma-noncoherent.h> static bool noncoherent_supported; +struct riscv_cache_ops noncoherent_cache_ops = { + .clean_range = NULL, + .inv_range = NULL, + .flush_range = NULL, + .riscv_dma_noncoherent_cmo_ops = NULL, +}; +EXPORT_SYMBOL(noncoherent_cache_ops); + +#ifdef CONFIG_RISCV_ISA_ZICBOM +#define ZICBOM_CMO_OP(_op, _start, _size, _cachesize) \ + asm volatile("mv a0, %1\n\t" \ + "j 2f\n\t" \ + "3:\n\t" \ + "cbo." __stringify(_op) " (a0)\n\t" \ + "add a0, a0, %0\n\t" \ + "2:\n\t" \ + "bltu a0, %2, 3b\n\t" \ + : : "r"(_cachesize), \ + "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)), \ + "r"((unsigned long)(_start) + (_size)) \ + : "a0") + +static void zicbom_cmo_clean_range(unsigned long addr, unsigned long size) +{ + ZICBOM_CMO_OP(clean, addr, size, riscv_cbom_block_size); +} + +static void zicbom_cmo_flush_range(unsigned long addr, unsigned long size) +{ + ZICBOM_CMO_OP(flush, addr, size, riscv_cbom_block_size); +} + +static void zicbom_cmo_inval_range(unsigned long addr, unsigned long size) +{ + ZICBOM_CMO_OP(inval, addr, size, riscv_cbom_block_size); +} + +struct riscv_cache_ops zicbom_cmo_ops = { + .clean_range = &zicbom_cmo_clean_range, + .inv_range = &zicbom_cmo_inval_range, + .flush_range = &zicbom_cmo_flush_range, +}; +#else +struct riscv_cache_ops zicbom_cmo_ops = { + .clean_range = NULL, + .inv_range = NULL, + .flush_range = NULL, + .riscv_dma_noncoherent_cmo_ops = NULL, +}; +#endif +EXPORT_SYMBOL(zicbom_cmo_ops); + void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum dma_data_direction dir) { void *vaddr = phys_to_virt(paddr); + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir, + NON_COHERENT_SYNC_DMA_FOR_DEVICE); + return; + } + switch (dir) { case DMA_TO_DEVICE: - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size); + riscv_dma_noncoherent_clean(vaddr, size); break; case DMA_FROM_DEVICE: - ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size); + riscv_dma_noncoherent_clean(vaddr, size); break; case DMA_BIDIRECTIONAL: - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size); + riscv_dma_noncoherent_flush(vaddr, size); break; default: break; @@ -37,12 +96,18 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size, { void *vaddr = phys_to_virt(paddr); + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir, + NON_COHERENT_SYNC_DMA_FOR_CPU); + return; + } + switch (dir) { case DMA_TO_DEVICE: break; case DMA_FROM_DEVICE: case DMA_BIDIRECTIONAL: - ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size); + riscv_dma_noncoherent_flush(vaddr, size); break; default: break; @@ -53,7 +118,13 @@ void arch_dma_prep_coherent(struct page *page, size_t size) { void *flush_addr = page_address(page); - ALT_CMO_OP(flush, flush_addr, size, riscv_cbom_block_size); + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(flush_addr, size, -1, + NON_COHERENT_DMA_PREP); + return; + } + + riscv_dma_noncoherent_flush(flush_addr, size); } void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, @@ -78,3 +149,16 @@ void riscv_noncoherent_supported(void) "Non-coherent DMA support enabled without a block size\n"); noncoherent_supported = true; } + +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) +{ + if (!ops) + return; + + if (ops->riscv_dma_noncoherent_cmo_ops) + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops = + ops->riscv_dma_noncoherent_cmo_ops; + else + noncoherent_cache_ops = *ops; +} +EXPORT_SYMBOL(riscv_noncoherent_register_cache_ops); diff --git a/arch/riscv/mm/pmem.c b/arch/riscv/mm/pmem.c index 089df92ae876..cd5aa6f42851 100644 --- a/arch/riscv/mm/pmem.c +++ b/arch/riscv/mm/pmem.c @@ -6,16 +6,28 @@ #include <linux/export.h> #include <linux/libnvdimm.h> -#include <asm/cacheflush.h> +#include <asm/dma-noncoherent.h> void arch_wb_cache_pmem(void *addr, size_t size) { - ALT_CMO_OP(clean, addr, size, riscv_cbom_block_size); + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(addr, size, + -1, NON_COHERENT_DMA_PMEM); + return; + } + + riscv_dma_noncoherent_clean(addr, size); } EXPORT_SYMBOL_GPL(arch_wb_cache_pmem); void arch_invalidate_pmem(void *addr, size_t size) { - ALT_CMO_OP(inval, addr, size, riscv_cbom_block_size); + if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) { + noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(addr, size, + -1, NON_COHERENT_DMA_PMEM); + return; + } + + riscv_dma_noncoherent_inval(addr, size); } EXPORT_SYMBOL_GPL(arch_invalidate_pmem);