Message ID | 1555947870-23014-1-git-send-email-guoren@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | riscv: Support non-coherency memory model | expand |
On Mon, Apr 22, 2019 at 11:44:30PM +0800, guoren@kernel.org wrote: > - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit > designates a coherence for this page mapping. Software set the bit to > tell the hardware that the region of the page's memory area must be > coherent with IOs devices in SOC system by PMA settings. > If IOs and CPU are already coherent in SOC system, CPU just ignore > this bit. > > PTE format: > | XLEN-1 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 > PFN C RSW D A G U X W R V > ^ > BIT(9): Coherence attribute bit > 0: hardware needn't keep the page coherenct and software will > maintain the coherence with cache clear/invalid operations. > 1: hardware must keep the page coherenct and software needn't > maintain the coherence. > BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux > > Add a new hardware bit in PTE also need to modify Privileged > Architecture Supervisor-Level ISA: > https://github.com/riscv/riscv-isa-manual/pull/374 > > - Add SBI_FENCE_DMA 9 in riscv-sbi. > sbi_fence_dma(start, size, dir) could synchronize CPU cache data with > DMA device in non-coherency memory model. The third param's definition > is the same with linux's in include/linux/dma-direction.h: Please don't make this an SBI call. We need a proper instruction for cache flushing and invalidation. We'll also need that for pmem support for example. I heard at least one other vendor already had an instruction, and we really need to get this into the privileged spec ASAP (yesterday in fact). If you have your own instructions already we can probably binary patch those in using the Linux alternatives mechanism once we have a standardized way in the privileged spec. We should probably start a working group for this ASAP unless we can get another working group to help taking care of it. > +#define pgprot_noncached pgprot_noncached > +static inline pgprot_t pgprot_noncached(pgprot_t _prot) > +{ > + unsigned long prot = pgprot_val(_prot); > + > + prot |= _PAGE_COHERENCY; > + > + return __pgprot(prot); Nitpick: this can be shortened to return __pgprot(pgprot_val(prot) | _PAGE_COHERENCY)); Also is this really a coherent flag, or an 'uncached' flag like in many other architectures? > +++ b/arch/riscv/mm/dma-mapping.c This should probably be called dma-noncoherent.c It should also have a user visible config option so that we don't have to build it for fully coherent systems. > +void arch_dma_prep_coherent(struct page *page, size_t size) > +{ > + memset(page_address(page), 0, size); No need for this memset, the caller takes care of it. > diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c > index bd2f2db..f6aaf1e 100644 > --- a/arch/riscv/mm/ioremap.c > +++ b/arch/riscv/mm/ioremap.c > @@ -73,7 +73,7 @@ static void __iomem *__ioremap_caller(phys_addr_t addr, size_t size, > */ > void __iomem *ioremap(phys_addr_t offset, unsigned long size) > { > - return __ioremap_caller(offset, size, PAGE_KERNEL, > + return __ioremap_caller(offset, size, PAGE_KERNEL_COHERENCY, > __builtin_return_address(0)); > } > EXPORT_SYMBOL(ioremap); I think ioremap is a different story, and should be a separate patch.
Thx Christoph, On Mon, Apr 22, 2019 at 06:18:14PM +0200, Christoph Hellwig wrote: > On Mon, Apr 22, 2019 at 11:44:30PM +0800, guoren@kernel.org wrote: > > - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit > > designates a coherence for this page mapping. Software set the bit to > > tell the hardware that the region of the page's memory area must be > > coherent with IOs devices in SOC system by PMA settings. > > If IOs and CPU are already coherent in SOC system, CPU just ignore > > this bit. > > > > PTE format: > > | XLEN-1 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 > > PFN C RSW D A G U X W R V > > ^ > > BIT(9): Coherence attribute bit > > 0: hardware needn't keep the page coherenct and software will > > maintain the coherence with cache clear/invalid operations. > > 1: hardware must keep the page coherenct and software needn't > > maintain the coherence. > > BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux > > > > Add a new hardware bit in PTE also need to modify Privileged > > Architecture Supervisor-Level ISA: > > https://github.com/riscv/riscv-isa-manual/pull/374 > > > > - Add SBI_FENCE_DMA 9 in riscv-sbi. > > sbi_fence_dma(start, size, dir) could synchronize CPU cache data with > > DMA device in non-coherency memory model. The third param's definition > > is the same with linux's in include/linux/dma-direction.h: > > Please don't make this an SBI call. We need a proper instruction > for cache flushing and invalidation. We'll also need that for pmem > support for example. I heard at least one other vendor already > had an instruction, and we really need to get this into the privileged > spec ASAP (yesterday in fact). > > If you have your own instructions already we can probably binary > patch those in using the Linux alternatives mechanism once we have > a standardized way in the privileged spec. > > We should probably start a working group for this ASAP unless we can > get another working group to help taking care of it. Good news, I prefer to use instructions directly instead of SBI_CALL. Our instruction is "dcache.c/iva %0" (one cache line) and the parameter is virtual address in S-state. When get into M-state by SBI_CALL, we could let dcache.c/iva use physical addres directly and it needn't kmap page for RV32 with highmem (Of cause highmem is not ready in RV32 now). > > > +#define pgprot_noncached pgprot_noncached > > +static inline pgprot_t pgprot_noncached(pgprot_t _prot) > > +{ > > + unsigned long prot = pgprot_val(_prot); > > + > > + prot |= _PAGE_COHERENCY; > > + > > + return __pgprot(prot); > > Nitpick: this can be shortened to > > return __pgprot(pgprot_val(prot) | _PAGE_COHERENCY)); Good. > > Also is this really a coherent flag, or an 'uncached' flag like in > many other architectures? There are a lot of features about coherency attributes, eg: cacheable, bufferable, strong order ..., and coherency is a more abstract name to contain all of these. In our hardware, coherence = uncached + unbufferable + (stong order). But I'm not very care about the name is, uncached is also ok. My key point is the bits of page attributes is very precious and this patch will use the last unused attribute bit in PTE. Another point is we could get more attribute bits by modify the riscv spec: - Remove Global bit, I think it's duplicate with the User bit in linux. - Change _PAGE_PFN_SHIFT from 10 to 12, because the huge pfn in RV32 is very useless and current RV32 linux doesn't even implement highmem. And then we could get another three page attribute bits in PTE. > > > +++ b/arch/riscv/mm/dma-mapping.c > > This should probably be called dma-noncoherent.c > > It should also have a user visible config option so that we don't > have to build it for fully coherent systems. Ok, dma-noncoherent.c is more clear. > > > +void arch_dma_prep_coherent(struct page *page, size_t size) > > +{ > > + memset(page_address(page), 0, size); > > No need for this memset, the caller takes care of it. Ok > > > diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c > > index bd2f2db..f6aaf1e 100644 > > --- a/arch/riscv/mm/ioremap.c > > +++ b/arch/riscv/mm/ioremap.c > > @@ -73,7 +73,7 @@ static void __iomem *__ioremap_caller(phys_addr_t addr, size_t size, > > */ > > void __iomem *ioremap(phys_addr_t offset, unsigned long size) > > { > > - return __ioremap_caller(offset, size, PAGE_KERNEL, > > + return __ioremap_caller(offset, size, PAGE_KERNEL_COHERENCY, > > __builtin_return_address(0)); > > } > > EXPORT_SYMBOL(ioremap); > > I think ioremap is a different story, and should be a separate patch. Ok Best Regards Guo Ren
On Tue, Apr 23, 2019 at 08:13:48AM +0800, Guo Ren wrote: > > We should probably start a working group for this ASAP unless we can > > get another working group to help taking care of it. > Good news, I prefer to use instructions directly instead of SBI_CALL. > > Our instruction is "dcache.c/iva %0" (one cache line) and the parameter is > virtual address in S-state. When get into M-state by SBI_CALL, we could > let dcache.c/iva use physical addres directly and it needn't kmap page > for RV32 with highmem (Of cause highmem is not ready in RV32 now). So you only have one instruction variant? Normally we'd have two or three to implement the non-coherent DMA (or pmem) semantics: cache writeback, cache invalidate and potentially cache writeback + invalidate to optimize that case. Here is the table how Linux uses them for DMA: | map == for_device | unmap == for_cpu |---------------------------------------------------------------- TO_DEV | writeback writeback | none none FROM_DEV | invalidate invalidate | invalidate* invalidate* BIDI | writeback+inv writeback+inv | invalidate invalidate [*] needed for CPU speculative prefetches We already have a discussion on isa-dev on something like these instructions: https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/qXbzqaQbDXU/4ThcEAeCCAAJ It got a little side tracked, both due to the usual noise on isa-dev and due to the proposal including a lot more instructions that might be a little more contentious, but it might be a good start to bring this into a working group. > > Also is this really a coherent flag, or an 'uncached' flag like in > > many other architectures? > There are a lot of features about coherency attributes, eg: cacheable, > bufferable, strong order ..., and coherency is a more abstract name to > contain all of these. In our hardware, coherence = uncached + > unbufferable + (stong order). > > But I'm not very care about the name is, uncached is also ok. My key > point is the bits of page attributes is very precious and this patch > will use the last unused attribute bit in PTE. I don't care about the name actually, more about having defined semantics. Totally uncached should include unbuffered. I don't think we need the strong ordering for DMA memory either. > Another point is we could get more attribute bits by modify the riscv > spec: > - Remove Global bit, I think it's duplicate with the User bit in linux. It is in Linux, but it is conceptually very different. > - Change _PAGE_PFN_SHIFT from 10 to 12, because the huge pfn in RV32 is > very useless and current RV32 linux doesn't even implement highmem. This would seem sensible to me, but I'm not sure everyone agrees. Even then we are very late in the game for changes like that.
On Tue, Apr 23, 2019 at 07:55:48AM +0200, Christoph Hellwig wrote: > On Tue, Apr 23, 2019 at 08:13:48AM +0800, Guo Ren wrote: > > > We should probably start a working group for this ASAP unless we can > > > get another working group to help taking care of it. > > Good news, I prefer to use instructions directly instead of SBI_CALL. > > > > Our instruction is "dcache.c/iva %0" (one cache line) and the parameter is > > virtual address in S-state. When get into M-state by SBI_CALL, we could > > let dcache.c/iva use physical addres directly and it needn't kmap page > > for RV32 with highmem (Of cause highmem is not ready in RV32 now). > > So you only have one instruction variant? Normally we'd have two or > three to implement the non-coherent DMA (or pmem) semantics: dcache.c/iva means three instructions: - dcache.cva %0 : writeback by virtual address cacheline - dcache.iva %0 : invalid by virtual address cacheline - dcache.civa %0 : writeback+inv by virtual address cacheline We also have memory barrier instructions, ref: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/csky/include/asm/barrier.h > > cache writeback, cache invalidate and potentially cache writeback + > invalidate to optimize that case. Here is the table how Linux > uses them for DMA: > > | map == for_device | unmap == for_cpu > |---------------------------------------------------------------- > TO_DEV | writeback writeback | none none > FROM_DEV | invalidate invalidate | invalidate* invalidate* > BIDI | writeback+inv writeback+inv | invalidate invalidate > > [*] needed for CPU speculative prefetches > > > We already have a discussion on isa-dev on something like these > instructions: > > https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/qXbzqaQbDXU/4ThcEAeCCAAJ > > It got a little side tracked, both due to the usual noise on isa-dev > and due to the proposal including a lot more instructions that might be > a little more contentious, but it might be a good start to bring this > into a working group. I think using sbi_call as a temporary solution is a good choice before the instruction is determined. > > > > Also is this really a coherent flag, or an 'uncached' flag like in > > > many other architectures? > > There are a lot of features about coherency attributes, eg: cacheable, > > bufferable, strong order ..., and coherency is a more abstract name to > > contain all of these. In our hardware, coherence = uncached + > > unbufferable + (stong order). > > > > But I'm not very care about the name is, uncached is also ok. My key > > point is the bits of page attributes is very precious and this patch > > will use the last unused attribute bit in PTE. > > I don't care about the name actually, more about having defined semantics. > Totally uncached should include unbuffered. I don't think we need the > strong ordering for DMA memory either. Yes, memory don't need strong order and strong order in PMA also implies uncached, unbuffered for IO address. If the entry of PMA is strong order, the page mapping must be uncached + unbuffered + strong-order without _PAGE_COHENCY in PTE. > > > Another point is we could get more attribute bits by modify the riscv > > spec: > > - Remove Global bit, I think it's duplicate with the User bit in linux. > > It is in Linux, but it is conceptually very different. Yes, but hardware could ignore one of them and in riscv linux _PAGE_GLOBAL is no use at all, see: grep _PAGE_GLOBAL arch/riscv -r In fact, the _PAGE_KERNEL for pte doesn't contain _PAGE_GLOBAL and it works on FU540 and qemu. As I've mentioned page attribute bits is very precious, define a useless bit make people confused. > > > - Change _PAGE_PFN_SHIFT from 10 to 12, because the huge pfn in RV32 is > > very useless and current RV32 linux doesn't even implement highmem. > > This would seem sensible to me, but I'm not sure everyone agrees. Even > then we are very late in the game for changes like that. Agree. Best Regards Guo Ren
On 23/04/2019 16:46, Guo Ren wrote: > On Tue, Apr 23, 2019 at 07:55:48AM +0200, Christoph Hellwig wrote: >> On Tue, Apr 23, 2019 at 08:13:48AM +0800, Guo Ren wrote: >>>> We should probably start a working group for this ASAP unless we can >>>> get another working group to help taking care of it. >>> Good news, I prefer to use instructions directly instead of SBI_CALL. >>> >>> Our instruction is "dcache.c/iva %0" (one cache line) and the parameter is >>> virtual address in S-state. When get into M-state by SBI_CALL, we could >>> let dcache.c/iva use physical addres directly and it needn't kmap page >>> for RV32 with highmem (Of cause highmem is not ready in RV32 now). >> >> So you only have one instruction variant? Normally we'd have two or >> three to implement the non-coherent DMA (or pmem) semantics: > dcache.c/iva means three instructions: > - dcache.cva %0 : writeback by virtual address cacheline > - dcache.iva %0 : invalid by virtual address cacheline > - dcache.civa %0 : writeback+inv by virtual address cacheline > > We also have memory barrier instructions, ref: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/csky/include/asm/barrier.h > >> >> cache writeback, cache invalidate and potentially cache writeback + >> invalidate to optimize that case. Here is the table how Linux >> uses them for DMA: >> >> | map == for_device | unmap == for_cpu >> |---------------------------------------------------------------- >> TO_DEV | writeback writeback | none none >> FROM_DEV | invalidate invalidate | invalidate* invalidate* >> BIDI | writeback+inv writeback+inv | invalidate invalidate >> >> [*] needed for CPU speculative prefetches >> >> >> We already have a discussion on isa-dev on something like these >> instructions: >> >> https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/qXbzqaQbDXU/4ThcEAeCCAAJ >> >> It got a little side tracked, both due to the usual noise on isa-dev >> and due to the proposal including a lot more instructions that might be >> a little more contentious, but it might be a good start to bring this >> into a working group. > I think using sbi_call as a temporary solution is a good choice before > the instruction is determined. > >> >>>> Also is this really a coherent flag, or an 'uncached' flag like in >>>> many other architectures? >>> There are a lot of features about coherency attributes, eg: cacheable, >>> bufferable, strong order ..., and coherency is a more abstract name to >>> contain all of these. In our hardware, coherence = uncached + >>> unbufferable + (stong order). >>> >>> But I'm not very care about the name is, uncached is also ok. My key >>> point is the bits of page attributes is very precious and this patch >>> will use the last unused attribute bit in PTE. >> >> I don't care about the name actually, more about having defined semantics. >> Totally uncached should include unbuffered. I don't think we need the >> strong ordering for DMA memory either. > Yes, memory don't need strong order and strong order in PMA also implies > uncached, unbuffered for IO address. If the entry of PMA is strong > order, the page mapping must be uncached + unbuffered + strong-order > without _PAGE_COHENCY in PTE. > >> >>> Another point is we could get more attribute bits by modify the riscv >>> spec: >>> - Remove Global bit, I think it's duplicate with the User bit in linux. >> >> It is in Linux, but it is conceptually very different. > Yes, but hardware could ignore one of them and in riscv linux > _PAGE_GLOBAL is no use at all, see: > grep _PAGE_GLOBAL arch/riscv -r > > In fact, the _PAGE_KERNEL for pte doesn't contain _PAGE_GLOBAL and it > works on FU540 and qemu. As I've mentioned page attribute bits is very > precious, define a useless bit make people confused. > The fact that it isn't used yet doesn't imply it is not useful. We don't use ASIDs at the moment, and without using ASIDs the "global" bit is indeed not useful. However with ASIDs the bit will be vital for saving TLB spaces. Without the global bit, the kernel pages become synonyms to themselves (i.e. they have different tags in TLB but refer to the same physical page). The global bit also exists in many other ISAs as well. It's definitely not a "useless" bits. Moreover, this bit is already implemented in both Rocket and Ariane. It is also in the spec for quite a while. The fact that Linux doesn't use it at the moment is not a reason for removing it. > >> >>> - Change _PAGE_PFN_SHIFT from 10 to 12, because the huge pfn in RV32 is >>> very useless and current RV32 linux doesn't even implement highmem. >> >> This would seem sensible to me, but I'm not sure everyone agrees. Even >> then we are very late in the game for changes like that. > Agree. > > Best Regards > Guo Ren > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv >
Hi Gary, On Tue, Apr 23, 2019 at 03:57:30PM +0000, Gary Guo wrote: > >>> Another point is we could get more attribute bits by modify the riscv > >>> spec: > >>> - Remove Global bit, I think it's duplicate with the User bit in linux. > >> > >> It is in Linux, but it is conceptually very different. > > Yes, but hardware could ignore one of them and in riscv linux > > _PAGE_GLOBAL is no use at all, see: > > grep _PAGE_GLOBAL arch/riscv -r > > > > In fact, the _PAGE_KERNEL for pte doesn't contain _PAGE_GLOBAL and it > > works on FU540 and qemu. As I've mentioned page attribute bits is very > > precious, define a useless bit make people confused. > > > > The fact that it isn't used yet doesn't imply it is not useful. We don't > use ASIDs at the moment, and without using ASIDs the "global" bit is > indeed not useful. However with ASIDs the bit will be vital for saving > TLB spaces. Without the global bit, the kernel pages become synonyms to > themselves (i.e. they have different tags in TLB but refer to the same > physical page). > > The global bit also exists in many other ISAs as well. It's definitely > not a "useless" bits. > > Moreover, this bit is already implemented in both Rocket and Ariane. It > is also in the spec for quite a while. The fact that Linux doesn't use > it at the moment is not a reason for removing it. > Look: linux-next git:(riscv_asid_allocator_v2)$ grep GLOBAL arch/riscv -r arch/riscv/include/asm/pgtable-bits.h:#define _PAGE_GLOBAL (1 << 5) /* Global */ arch/riscv/include/asm/pgtable-bits.h: _PAGE_USER | _PAGE_GLOBAL)) Your patch tell us _PAGE_USER and _PAGE_GLOBAL are duplicate and why we couldn't make _PAGE_USER implies _PAGE_GLOBAL? Can you give an example of a real scene in PTE about: _PAGE_USER:0 + _PAGE_GLOBAL:1 or _PAGE_USER:1 + _PAGE_GLOBAL:0 Of cause I know USER & GLOBAL are conceptually very different, but there are only 10 attribute-bits for riscv (In fact we've wasted two bits to support huge RV32-pfn :P). So I think it is time to merge these two bits before hardware supports GLOBAL. Reserve them for future! Best Regards Guo Ren
> -----Original Message----- > From: Guo Ren <guoren@kernel.org> > Sent: Wednesday, April 24, 2019 03:08 > To: Gary Guo <gary@garyguo.net> > Cc: Christoph Hellwig <hch@lst.de>; linux-arch@vger.kernel.org; Palmer > Dabbelt <palmer@sifive.com>; Andrew Waterman <andrew@sifive.com>; Arnd > Bergmann <arnd@arndb.de>; Anup Patel <anup.patel@wdc.com>; Xiang > Xiaoyan <xiaoyan_xiang@c-sky.com>; linux-kernel@vger.kernel.org; Mike > Rapoport <rppt@linux.ibm.com>; Vincent Chen <vincentc@andestech.com>; > Greentime Hu <green.hu@gmail.com>; ren_guo@c-sky.com; linux- > riscv@lists.infradead.org; Marek Szyprowski <m.szyprowski@samsung.com>; > Robin Murphy <robin.murphy@arm.com>; Scott Wood <swood@redhat.com>; > tech-privileged@lists.riscv.org > Subject: Re: [PATCH] riscv: Support non-coherency memory model > > Hi Gary, > > On Tue, Apr 23, 2019 at 03:57:30PM +0000, Gary Guo wrote: > > >>> Another point is we could get more attribute bits by modify the riscv > > >>> spec: > > >>> - Remove Global bit, I think it's duplicate with the User bit in linux. > > >> > > >> It is in Linux, but it is conceptually very different. > > > Yes, but hardware could ignore one of them and in riscv linux > > > _PAGE_GLOBAL is no use at all, see: > > > grep _PAGE_GLOBAL arch/riscv -r > > > > > > In fact, the _PAGE_KERNEL for pte doesn't contain _PAGE_GLOBAL and it > > > works on FU540 and qemu. As I've mentioned page attribute bits is very > > > precious, define a useless bit make people confused. > > > > > > > The fact that it isn't used yet doesn't imply it is not useful. We don't > > use ASIDs at the moment, and without using ASIDs the "global" bit is > > indeed not useful. However with ASIDs the bit will be vital for saving > > TLB spaces. Without the global bit, the kernel pages become synonyms to > > themselves (i.e. they have different tags in TLB but refer to the same > > physical page). > > > > The global bit also exists in many other ISAs as well. It's definitely > > not a "useless" bits. > > > > Moreover, this bit is already implemented in both Rocket and Ariane. It > > is also in the spec for quite a while. The fact that Linux doesn't use > > it at the moment is not a reason for removing it. > > > > Look: > linux-next git:(riscv_asid_allocator_v2)$ grep GLOBAL arch/riscv -r > arch/riscv/include/asm/pgtable-bits.h:#define _PAGE_GLOBAL (1 << 5) /* > Global */ > arch/riscv/include/asm/pgtable-bits.h: _PAGE_USER | > _PAGE_GLOBAL)) > > Your patch tell us _PAGE_USER and _PAGE_GLOBAL are duplicate and why we > couldn't make _PAGE_USER implies _PAGE_GLOBAL? Can you give an example > of a real scene in PTE about: > _PAGE_USER:0 + _PAGE_GLOBAL:1 > or > _PAGE_USER:1 + _PAGE_GLOBAL:0 > > Of cause I know USER & GLOBAL are conceptually very different, but > there are only 10 attribute-bits for riscv (In fact we've wasted two bits > to support huge RV32-pfn :P). So I think it is time to merge these two bits > before hardware supports GLOBAL. Reserve them for future! Two cases I can think of: * vdso like things. They're user pages that can really be shared across address spaces (i.e. global). Kernels like L4 implement most systems calls similar to VDSO, so USER + GLOBAL is useful. * hypervisor without H-extension: This requires shadow page tables. Supervisor pages are mapped to supervisor shadow pages. However these shadow pages cannot be GLOBAL because they can't be shared between VMs. So !USER + !GLOBAL is useful. Remember Linux isn't the only supervisor software that RISC-V cares! > > Best Regards > Guo Ren
Hi Gary, On Wed, Apr 24, 2019 at 03:21:14AM +0000, Gary Guo wrote: > > Look: > > linux-next git:(riscv_asid_allocator_v2)$ grep GLOBAL arch/riscv -r > > arch/riscv/include/asm/pgtable-bits.h:#define _PAGE_GLOBAL (1 << 5) /* > > Global */ > > arch/riscv/include/asm/pgtable-bits.h: _PAGE_USER | > > _PAGE_GLOBAL)) > > > > Your patch tell us _PAGE_USER and _PAGE_GLOBAL are duplicate and why we > > couldn't make _PAGE_USER implies _PAGE_GLOBAL? Can you give an example > > of a real scene in PTE about: > > _PAGE_USER:0 + _PAGE_GLOBAL:1 > > or > > _PAGE_USER:1 + _PAGE_GLOBAL:0 > > > > Of cause I know USER & GLOBAL are conceptually very different, but > > there are only 10 attribute-bits for riscv (In fact we've wasted two bits > > to support huge RV32-pfn :P). So I think it is time to merge these two bits > > before hardware supports GLOBAL. Reserve them for future! > > Two cases I can think of: > * vdso like things. They're user pages that can really be shared across address spaces (i.e. global). Kernels like L4 implement most systems calls similar to VDSO, so USER + GLOBAL is useful. Vdso is a user space mapping in linux, See: fs/binfmt_elf.c static int load_elf_binary(struct linux_binprm *bprm) { ... #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES retval = arch_setup_additional_pages(bprm, !!elf_interpreter); if (retval < 0) goto out; #endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */ All linux archs use arch_setup_additional_pages for vdso mapping and every process has its own vdso mapping to the same pages. I don't think vdso is a real scene for GLOBAL in PTE. > * hypervisor without H-extension: This requires shadow page tables. Supervisor > pages are mapped to supervisor shadow pages. However these shadow pages cannot > be GLOBAL because they can't be shared between VMs. So !USER + !GLOBAL is useful. Hypervisor use 2-stages TLB translation in hardware and shadow page tables is for stage 2 translation. Shadow page tables care vmid not asid. If hardware don't support H-extension (MMU 2-stages translation), it's hard to accept for virtualization performance. I don't think hypervisor is a real scene for GLOBAL in PTE. Are there other scene for GLOBAL in PTE? Best Regards Guo Ren
On 24/04/2019 06:57, Guo Ren wrote: > Hi Gary, > > On Wed, Apr 24, 2019 at 03:21:14AM +0000, Gary Guo wrote: >>> Look: >>> linux-next git:(riscv_asid_allocator_v2)$ grep GLOBAL arch/riscv -r >>> arch/riscv/include/asm/pgtable-bits.h:#define _PAGE_GLOBAL (1 << 5) /* >>> Global */ >>> arch/riscv/include/asm/pgtable-bits.h: _PAGE_USER | >>> _PAGE_GLOBAL)) >>> >>> Your patch tell us _PAGE_USER and _PAGE_GLOBAL are duplicate and why we >>> couldn't make _PAGE_USER implies _PAGE_GLOBAL? Can you give an example >>> of a real scene in PTE about: >>> _PAGE_USER:0 + _PAGE_GLOBAL:1 >>> or >>> _PAGE_USER:1 + _PAGE_GLOBAL:0 >>> >>> Of cause I know USER & GLOBAL are conceptually very different, but >>> there are only 10 attribute-bits for riscv (In fact we've wasted two bits >>> to support huge RV32-pfn :P). So I think it is time to merge these two bits >>> before hardware supports GLOBAL. Reserve them for future! >> >> Two cases I can think of: >> * vdso like things. They're user pages that can really be shared across address spaces (i.e. global). Kernels like L4 implement most systems calls similar to VDSO, so USER + GLOBAL is useful. > Vdso is a user space mapping in linux, See: fs/binfmt_elf.c > > static int load_elf_binary(struct linux_binprm *bprm) { > ... > #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES > retval = arch_setup_additional_pages(bprm, !!elf_interpreter); > if (retval < 0) > goto out; > #endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */ > > All linux archs use arch_setup_additional_pages for vdso mapping and > every process has its own vdso mapping to the same pages. But we shouldn't prevent a kernel from mapping a USER page globally. As I said, the fact that Linux doesn't do it isn't a valid reason for omitting the possibility. > > I don't think vdso is a real scene for GLOBAL in PTE. > >> * hypervisor without H-extension: This requires shadow page tables. Supervisor >> pages are mapped to supervisor shadow pages. However these shadow pages cannot >> be GLOBAL because they can't be shared between VMs. So !USER + !GLOBAL is useful. > Hypervisor use 2-stages TLB translation in hardware and shadow page > tables is for stage 2 translation. Shadow page tables care vmid not > asid. When H-extension is present, stage 2 translation uses VMID and is performed by hardware. When H-extension is not present, there's no such thing called VMID. When H-extension is not present, both hypervisor and guest supervisor will run in supervisor mode, and hypervisor uses MSTATUS.TVM to trap guest supervisor virtual memory operations. The shadow page table is populated by doing 2-stage page walk in software. In this case, the hypervisor likely needs to use some bits of ASID to emulate the VMID feature. In this case GLOBAL page cannot be used as it means that the page exists in all physical ASIDs (which contains both emulated VMID and ASID). Having supervisor pages being GLOBAL makes the semantics incorrect! > If hardware don't support H-extension (MMU 2-stages translation), it's > hard to accept for virtualization performance. The RISC-V privileged spec is explicitly designed to allow the techniques described above (this is the sole purpose of MSTATUS.TVM). It might be as high performance as a hardware with H-extension, but is definitely a legit use case. In fact, it is vital for use cases like recursive virtualization. Also, I believe the PTE format of RISC-V is already frozen -- therefore it is impossible now to merge GLOBAL and USER bit, nor to replace RSW bit with another bit. > > I don't think hypervisor is a real scene for GLOBAL in PTE. > > Are there other scene for GLOBAL in PTE? > > Best Regards > Guo Ren >
On Wed, Apr 24, 2019 at 12:45:56PM +0000, Gary Guo wrote: > The RISC-V privileged spec is explicitly designed to allow the > techniques described above (this is the sole purpose of MSTATUS.TVM). It > might be as high performance as a hardware with H-extension, but is > definitely a legit use case. In fact, it is vital for use cases like > recursive virtualization. > > Also, I believe the PTE format of RISC-V is already frozen -- therefore > it is impossible now to merge GLOBAL and USER bit, nor to replace RSW > bit with another bit. Yes, I do not think we can just repurpose a bit. Even using a currently unused one would require some gymnastics. That being said IFF we want to support non-coherent DMA (and I think we do as people glue together their SOCs using shoestring and paper clips, as already demonstrated by Andes and C-SKY in RISC-V space, and most arm, mips and ppc SOCs) we need something like this flag. The current RISC-V method that only allows M-mode to set up such attributes on a small number or PMP regions just doesn't work well with the way how Linux and most non-trivial OSes implement DMA memory allocations. Note that I said well - in theory we can have a firmware provided uncached pool - that is what Linux does on most nommu (that is without pagetables) ports, but the fixed sized pool really does suck and will make users very unhappy.
On Wed, Apr 24, 2019 at 4:23 PM Christoph Hellwig <hch@lst.de> wrote: > > On Wed, Apr 24, 2019 at 12:45:56PM +0000, Gary Guo wrote: > > The RISC-V privileged spec is explicitly designed to allow the > > techniques described above (this is the sole purpose of MSTATUS.TVM). It > > might be as high performance as a hardware with H-extension, but is > > definitely a legit use case. In fact, it is vital for use cases like > > recursive virtualization. > > > > Also, I believe the PTE format of RISC-V is already frozen -- therefore > > it is impossible now to merge GLOBAL and USER bit, nor to replace RSW > > bit with another bit. > > Yes, I do not think we can just repurpose a bit. Even using a currently > unused one would require some gymnastics. > > That being said IFF we want to support non-coherent DMA (and I think we > do as people glue together their SOCs using shoestring and paper clips, > as already demonstrated by Andes and C-SKY in RISC-V space, and most > arm, mips and ppc SOCs) we need something like this flag. The current > RISC-V method that only allows M-mode to set up such attributes on > a small number or PMP regions just doesn't work well with the way how > Linux and most non-trivial OSes implement DMA memory allocations. > > Note that I said well - in theory we can have a firmware provided > uncached pool - that is what Linux does on most nommu (that is without > pagetables) ports, but the fixed sized pool really does suck and will > make users very unhappy. You could probably get away with allowing uncached mappings only for huge pages, and using one or two of the bits the PMD for it. This should cover most use cases, since in practice coherent allocations tend to be either small and rare (device descriptors) or very big (frame buffer etc), and both cases can be handled with hugepages and gen_pool_alloc, possibly CMA added in since there will likely not be an IOMMU either on the systems that lack cache coherent DMA. One downside is that you need a little more care for drivers that use dma_mmap_coherent() to expose coherent buffers to user space. Two other points about the proposal: - Aside from completely uncached/unbuffered mappings, you typically want uncached/buffered mappings to cover dma_alloc_wc() that is typically used for frame buffers etc that need write-combining to get acceptable performance - you need to decide what is supposed to happen when there are multiple conflicting mappings for the same physical address. Arnd
Hi Arnd, On Thu, Apr 25, 2019 at 11:50:11AM +0200, Arnd Bergmann wrote: > On Wed, Apr 24, 2019 at 4:23 PM Christoph Hellwig <hch@lst.de> wrote: > > > > On Wed, Apr 24, 2019 at 12:45:56PM +0000, Gary Guo wrote: > > > The RISC-V privileged spec is explicitly designed to allow the > > > techniques described above (this is the sole purpose of MSTATUS.TVM). It > > > might be as high performance as a hardware with H-extension, but is > > > definitely a legit use case. In fact, it is vital for use cases like > > > recursive virtualization. > > > > > > Also, I believe the PTE format of RISC-V is already frozen -- therefore > > > it is impossible now to merge GLOBAL and USER bit, nor to replace RSW > > > bit with another bit. > > > > Yes, I do not think we can just repurpose a bit. Even using a currently > > unused one would require some gymnastics. > > > > That being said IFF we want to support non-coherent DMA (and I think we > > do as people glue together their SOCs using shoestring and paper clips, > > as already demonstrated by Andes and C-SKY in RISC-V space, and most > > arm, mips and ppc SOCs) we need something like this flag. The current > > RISC-V method that only allows M-mode to set up such attributes on > > a small number or PMP regions just doesn't work well with the way how > > Linux and most non-trivial OSes implement DMA memory allocations. > > > > Note that I said well - in theory we can have a firmware provided > > uncached pool - that is what Linux does on most nommu (that is without > > pagetables) ports, but the fixed sized pool really does suck and will > > make users very unhappy. > > You could probably get away with allowing uncached mappings only > for huge pages, and using one or two of the bits the PMD for it. > This should cover most use cases, since in practice coherent allocations > tend to be either small and rare (device descriptors) or very big > (frame buffer etc), and both cases can be handled with hugepages > and gen_pool_alloc, possibly CMA added in since there will likely > not be an IOMMU either on the systems that lack cache coherent DMA. Generally attributs in huge-tlb-entry and leaf-tlb-entry should be the same. Only put _PAGE_CACHE and _PAGE_BUF bits in huge-tlb-entry sounds a bit strange. The gen_pool_alloc only 256KB by default, but a huge tlb entry is 4MB. Hardware couldn't setup vitual-4MB to a phys-256KB range mapping in TLB. > > One downside is that you need a little more care for drivers that > use dma_mmap_coherent() to expose coherent buffers to user space. > > Two other points about the proposal: > - Aside from completely uncached/unbuffered mappings, you typically > want uncached/buffered mappings to cover dma_alloc_wc() that is > typically used for frame buffers etc that need write-combining to get > acceptable performance I agree dma_alloc_wc is necessary, and we need add another more attribute bit in PTE: _PAGE_BUF. Perhaps using _PAGE_BUF + _PAGE_CACHE are better then _PAGE_CONHENCY. > - you need to decide what is supposed to happen when there are > multiple conflicting mappings for the same physical address. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ What's the mulitple confilcing mappings ? Best Regards Guo Ren
On Fri, Apr 26, 2019 at 6:06 PM Guo Ren <guoren@kernel.org> wrote: > On Thu, Apr 25, 2019 at 11:50:11AM +0200, Arnd Bergmann wrote: > > On Wed, Apr 24, 2019 at 4:23 PM Christoph Hellwig <hch@lst.de> wrote: > > > > You could probably get away with allowing uncached mappings only > > for huge pages, and using one or two of the bits the PMD for it. > > This should cover most use cases, since in practice coherent allocations > > tend to be either small and rare (device descriptors) or very big > > (frame buffer etc), and both cases can be handled with hugepages > > and gen_pool_alloc, possibly CMA added in since there will likely > > not be an IOMMU either on the systems that lack cache coherent DMA. > > Generally attributs in huge-tlb-entry and leaf-tlb-entry should be the > same. Only put _PAGE_CACHE and _PAGE_BUF bits in huge-tlb-entry sounds > a bit strange. Well, the point is that we can't really change the meaning of the existing low bits, but because of the alignment contraints on hugepages, the extra bits are currently unused for hugepage TLBs. There are other architectures that reuse the bits in clever ways, e.g. allowing larger physical address ranges to be used with hugepages than normal pages. > The gen_pool_alloc only 256KB by default, but a huge tlb entry is 4MB. > Hardware couldn't setup vitual-4MB to a phys-256KB range mapping in TLB. I expect the size would be easily changed, as long as there is sufficient physical memory. If the entire system has 32MB or less, setting 2MB aside would have a fairly significant impact of course. > > - you need to decide what is supposed to happen when there are > > multiple conflicting mappings for the same physical address. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > What's the mulitple confilcing mappings ? I mean when you have the linear mapping as cacheable and another mapping for the same physical page as uncacheable, and then access virtual address in both. This is usually a bad idea, but architectures go to different lengths to prevent it. The safest way would be for the CPU to produce a checkstop as soon as there are TLB entries for the same physical address but different caching settings. You can also do that if you have a cache-bypassing load/store that hits a live cache line. The other extreme would be to not do anything special and try to come up with sane behavior, e.g. allow accesses in both ways but ensure that a cache-bypassing load/store always flushes and invalidates cache lines for the same physical address before its access. Arnd
On 4/26/19 11:42 AM, Arnd Bergmann wrote: > EXTERNAL MAIL > > > On Fri, Apr 26, 2019 at 6:06 PM Guo Ren <guoren@kernel.org> wrote: >> On Thu, Apr 25, 2019 at 11:50:11AM +0200, Arnd Bergmann wrote: >>> On Wed, Apr 24, 2019 at 4:23 PM Christoph Hellwig <hch@lst.de> wrote: >>> >>> You could probably get away with allowing uncached mappings only >>> for huge pages, and using one or two of the bits the PMD for it. >>> This should cover most use cases, since in practice coherent allocations >>> tend to be either small and rare (device descriptors) or very big >>> (frame buffer etc), and both cases can be handled with hugepages >>> and gen_pool_alloc, possibly CMA added in since there will likely >>> not be an IOMMU either on the systems that lack cache coherent DMA. >> >> Generally attributs in huge-tlb-entry and leaf-tlb-entry should be the >> same. Only put _PAGE_CACHE and _PAGE_BUF bits in huge-tlb-entry sounds >> a bit strange. > > Well, the point is that we can't really change the meaning of the existing > low bits, but because of the alignment contraints on hugepages, the extra bits > are currently unused for hugepage TLBs. > There are other architectures that reuse the bits in clever ways, e.g. > allowing larger physical address ranges to be used with hugepages than > normal pages. > >> The gen_pool_alloc only 256KB by default, but a huge tlb entry is 4MB. >> Hardware couldn't setup vitual-4MB to a phys-256KB range mapping in TLB. > > I expect the size would be easily changed, as long as there is sufficient > physical memory. If the entire system has 32MB or less, setting 2MB aside > would have a fairly significant impact of course. > > >>> - you need to decide what is supposed to happen when there are >>> multiple conflicting mappings for the same physical address. >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> What's the mulitple confilcing mappings ? > > I mean when you have the linear mapping as cacheable and another > mapping for the same physical page as uncacheable, and then access > virtual address in both. This is usually a bad idea, but architectures > go to different lengths to prevent it. > > The safest way would be for the CPU to produce a checkstop as soon > as there are TLB entries for the same physical address but different > caching settings. You can also do that if you have a cache-bypassing > load/store that hits a live cache line. The second one is probably do-able in most systems. But the first one usually isn't. The TLB usually can't be looked up by physical address. Bill > > The other extreme would be to not do anything special and try to come > up with sane behavior, e.g. allow accesses in both ways but ensure that > a cache-bypassing load/store always flushes and invalidates cache > lines for the same physical address before its access. > > Arnd > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > > View/Reply Online (#395): https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.riscv.org_g_tech-2Dprivileged_message_395&d=DwIBaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=zf51zm7BmTNIb87ycEVTpGbwXV6ovRb4Rqy-BVXZ2F4&s=IWaqMk0fwM69syrjjmqzm5u3GeI_wWUYTJfBXCbxnWA&e= > Mute This Topic: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.riscv.org_mt_31344322_1677293&d=DwIBaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=zf51zm7BmTNIb87ycEVTpGbwXV6ovRb4Rqy-BVXZ2F4&s=BwJEwbhCkTjHFPfiAQs7CBgG1U6kqM7yZbpSWXPTOoU&e= > Group Owner: tech-privileged+owner@lists.riscv.org > Unsubscribe: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.riscv.org_g_tech-2Dprivileged_unsub&d=DwIBaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=zf51zm7BmTNIb87ycEVTpGbwXV6ovRb4Rqy-BVXZ2F4&s=vSG91zravGfrVN9_elxveQPGPYaNew0MyETUvHfKSEk&e= [huffman@cadence.com] > -=-=-=-=-=-=-=-=-=-=-=- >
On Mon, 22 Apr 2019 08:44:30 PDT (-0700), guoren@kernel.org wrote: > From: Guo Ren <ren_guo@c-sky.com> > > The current riscv linux implementation requires SOC system to support > memory coherence between all I/O devices and CPUs. But some SOC systems > cannot maintain the coherence and they need support cache clean/invalid > operations to synchronize data. > > Current implementation is no problem with SiFive FU540, because FU540 > keeps all IO devices and DMA master devices coherence with CPU. But to a > traditional SOC vendor, it may already have a stable non-coherency SOC > system, the need is simply to replace the CPU with RV CPU and rebuild > the whole system with IO-coherency is very expensive. > > So we should make riscv linux also support non-coherency memory model. > Here are the two points that riscv linux needs to be modified: > > - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit > designates a coherence for this page mapping. Software set the bit to > tell the hardware that the region of the page's memory area must be > coherent with IOs devices in SOC system by PMA settings. > If IOs and CPU are already coherent in SOC system, CPU just ignore > this bit. > > PTE format: > | XLEN-1 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 > PFN C RSW D A G U X W R V > ^ > BIT(9): Coherence attribute bit > 0: hardware needn't keep the page coherenct and software will > maintain the coherence with cache clear/invalid operations. > 1: hardware must keep the page coherenct and software needn't > maintain the coherence. > BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux > > Add a new hardware bit in PTE also need to modify Privileged > Architecture Supervisor-Level ISA: > https://github.com/riscv/riscv-isa-manual/pull/374 This is a RISC-V ISA modification, which isn't really appropriate to suggest on the kernel mailing lists. The right place to talk about this is at the RISC-V foundation, which owns the ISA -- we can't change the hardware with a patch to Linux :). > - Add SBI_FENCE_DMA 9 in riscv-sbi. > sbi_fence_dma(start, size, dir) could synchronize CPU cache data with > DMA device in non-coherency memory model. The third param's definition > is the same with linux's in include/linux/dma-direction.h: > > enum dma_data_direction { > DMA_BIDIRECTIONAL = 0, > DMA_TO_DEVICE = 1, > DMA_FROM_DEVICE = 2, > DMA_NONE = 3, > }; > > The first param:start must be physical address which could be handled > in M-state. > > Here is a pull request to the riscv-sbi-doc: > https://github.com/riscv/riscv-sbi-doc/pull/15 > > We have tested the patch on our fpga SOC system which network controller > connected to a non-cache-coherency interconnect in and it couldn't work > without the patch. > > There is no side effect for FU540 whose CPU don't care _PAGE_COHERENCY > in PTE, but FU540's bbl also need to implement a simple sbi_fence_dma > by directly return. In fact, if you give a correct configuration for > dev_is_dma_conherent(), linux dma framework wouldn't call sbi_fence_dma > any more. Non-coherent fences also need to be discussed as part of a RISC-V ISA extension. I know people have expressed interest, but I don't know of a working group that's already been set up. > > Changelog: > - Use coherency instead of consistency for all to maintain term > consistency. (Xiang Xiaoyan) > - Add riscv-isa-manual modification pull request link. > - Correct grammatical errors. > > Signed-off-by: Guo Ren <ren_guo@c-sky.com> > Cc: Andrew Waterman <andrew@sifive.com> > Cc: Anup Patel <anup.patel@wdc.com> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Greentime Hu <green.hu@gmail.com> > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Cc: Mike Rapoport <rppt@linux.ibm.com> > Cc: Palmer Dabbelt <palmer@sifive.com> > Cc: Robin Murphy <robin.murphy@arm.com> > Cc: Scott Wood <swood@redhat.com> > Cc: Vincent Chen <vincentc@andestech.com> > Cc: Xiang Xiaoyan <xiaoyan_xiang@c-sky.com> > --- > arch/riscv/Kconfig | 4 ++++ > arch/riscv/include/asm/pgtable-bits.h | 1 + > arch/riscv/include/asm/pgtable.h | 11 +++++++++ > arch/riscv/include/asm/sbi.h | 10 ++++++++ > arch/riscv/mm/Makefile | 1 + > arch/riscv/mm/dma-mapping.c | 44 +++++++++++++++++++++++++++++++++++ > arch/riscv/mm/ioremap.c | 2 +- > 7 files changed, 72 insertions(+), 1 deletion(-) > create mode 100644 arch/riscv/mm/dma-mapping.c > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > index eb56c82..f0fc503 100644 > --- a/arch/riscv/Kconfig > +++ b/arch/riscv/Kconfig > @@ -16,9 +16,12 @@ config RISCV > select OF > select OF_EARLY_FLATTREE > select OF_IRQ > + select ARCH_HAS_SYNC_DMA_FOR_CPU > + select ARCH_HAS_SYNC_DMA_FOR_DEVICE > select ARCH_WANT_FRAME_POINTERS > select CLONE_BACKWARDS > select COMMON_CLK > + select DMA_DIRECT_REMAP > select GENERIC_CLOCKEVENTS > select GENERIC_CPU_DEVICES > select GENERIC_IRQ_SHOW > @@ -27,6 +30,7 @@ config RISCV > select GENERIC_STRNCPY_FROM_USER > select GENERIC_STRNLEN_USER > select GENERIC_SMP_IDLE_THREAD > + select GENERIC_ALLOCATOR > select GENERIC_ATOMIC64 if !64BIT || !RISCV_ISA_A > select HAVE_ARCH_AUDITSYSCALL > select HAVE_MEMBLOCK_NODE_MAP > diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h > index 470755c..104f8c0 100644 > --- a/arch/riscv/include/asm/pgtable-bits.h > +++ b/arch/riscv/include/asm/pgtable-bits.h > @@ -31,6 +31,7 @@ > #define _PAGE_ACCESSED (1 << 6) /* Set by hardware on any access */ > #define _PAGE_DIRTY (1 << 7) /* Set by hardware on any write */ > #define _PAGE_SOFT (1 << 8) /* Reserved for software */ > +#define _PAGE_COHERENCY (1 << 9) /* Coherency */ > > #define _PAGE_SPECIAL _PAGE_SOFT > #define _PAGE_TABLE _PAGE_PRESENT > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h > index 1141364..26debb4 100644 > --- a/arch/riscv/include/asm/pgtable.h > +++ b/arch/riscv/include/asm/pgtable.h > @@ -66,6 +66,7 @@ > > #define PAGE_KERNEL __pgprot(_PAGE_KERNEL) > #define PAGE_KERNEL_EXEC __pgprot(_PAGE_KERNEL | _PAGE_EXEC) > +#define PAGE_KERNEL_COHERENCY __pgprot(_PAGE_KERNEL | _PAGE_COHERENCY) > > extern pgd_t swapper_pg_dir[]; > > @@ -375,6 +376,16 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma, > return ptep_test_and_clear_young(vma, address, ptep); > } > > +#define pgprot_noncached pgprot_noncached > +static inline pgprot_t pgprot_noncached(pgprot_t _prot) > +{ > + unsigned long prot = pgprot_val(_prot); > + > + prot |= _PAGE_COHERENCY; > + > + return __pgprot(prot); > +} > + > /* > * Encode and decode a swap entry > * > diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h > index b6bb10b..b945e50 100644 > --- a/arch/riscv/include/asm/sbi.h > +++ b/arch/riscv/include/asm/sbi.h > @@ -25,6 +25,7 @@ > #define SBI_REMOTE_SFENCE_VMA 6 > #define SBI_REMOTE_SFENCE_VMA_ASID 7 > #define SBI_SHUTDOWN 8 > +#define SBI_FENCE_DMA 9 > > #define SBI_CALL(which, arg0, arg1, arg2) ({ \ > register uintptr_t a0 asm ("a0") = (uintptr_t)(arg0); \ > @@ -42,6 +43,8 @@ > #define SBI_CALL_0(which) SBI_CALL(which, 0, 0, 0) > #define SBI_CALL_1(which, arg0) SBI_CALL(which, arg0, 0, 0) > #define SBI_CALL_2(which, arg0, arg1) SBI_CALL(which, arg0, arg1, 0) > +#define SBI_CALL_3(which, arg0, arg1, arg2) \ > + SBI_CALL(which, arg0, arg1, arg2) > > static inline void sbi_console_putchar(int ch) > { > @@ -82,6 +85,13 @@ static inline void sbi_remote_fence_i(const unsigned long *hart_mask) > SBI_CALL_1(SBI_REMOTE_FENCE_I, hart_mask); > } > > +static inline void sbi_fence_dma(unsigned long start, > + unsigned long size, > + unsigned long dir) > +{ > + SBI_CALL_3(SBI_FENCE_DMA, start, size, dir); > +} > + > static inline void sbi_remote_sfence_vma(const unsigned long *hart_mask, > unsigned long start, > unsigned long size) > diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile > index b68aac7..adc563a 100644 > --- a/arch/riscv/mm/Makefile > +++ b/arch/riscv/mm/Makefile > @@ -9,3 +9,4 @@ obj-y += fault.o > obj-y += extable.o > obj-y += ioremap.o > obj-y += cacheflush.o > +obj-y += dma-mapping.o > diff --git a/arch/riscv/mm/dma-mapping.c b/arch/riscv/mm/dma-mapping.c > new file mode 100644 > index 0000000..5e1d179 > --- /dev/null > +++ b/arch/riscv/mm/dma-mapping.c > @@ -0,0 +1,44 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include <linux/dma-mapping.h> > + > +static int __init atomic_pool_init(void) > +{ > + return dma_atomic_pool_init(GFP_KERNEL, pgprot_noncached(PAGE_KERNEL)); > +} > +postcore_initcall(atomic_pool_init); > + > +void arch_dma_prep_coherent(struct page *page, size_t size) > +{ > + memset(page_address(page), 0, size); > + > + sbi_fence_dma(page_to_phys(page), size, DMA_BIDIRECTIONAL); > +} > + > +void arch_sync_dma_for_device(struct device *dev, phys_addr_t paddr, > + size_t size, enum dma_data_direction dir) > +{ > + switch (dir) { > + case DMA_TO_DEVICE: > + case DMA_FROM_DEVICE: > + case DMA_BIDIRECTIONAL: > + sbi_fence_dma(paddr, size, dir); > + break; > + default: > + BUG(); > + } > +} > + > +void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr, > + size_t size, enum dma_data_direction dir) > +{ > + switch (dir) { > + case DMA_TO_DEVICE: > + case DMA_FROM_DEVICE: > + case DMA_BIDIRECTIONAL: > + sbi_fence_dma(paddr, size, dir); > + break; > + default: > + BUG(); > + } > +} > diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c > index bd2f2db..f6aaf1e 100644 > --- a/arch/riscv/mm/ioremap.c > +++ b/arch/riscv/mm/ioremap.c > @@ -73,7 +73,7 @@ static void __iomem *__ioremap_caller(phys_addr_t addr, size_t size, > */ > void __iomem *ioremap(phys_addr_t offset, unsigned long size) > { > - return __ioremap_caller(offset, size, PAGE_KERNEL, > + return __ioremap_caller(offset, size, PAGE_KERNEL_COHERENCY, > __builtin_return_address(0)); > } > EXPORT_SYMBOL(ioremap);
On Mon, Apr 29, 2019 at 01:11:43PM -0700, Palmer Dabbelt wrote: > On Mon, 22 Apr 2019 08:44:30 PDT (-0700), guoren@kernel.org wrote: > >From: Guo Ren <ren_guo@c-sky.com> > > > >The current riscv linux implementation requires SOC system to support > >memory coherence between all I/O devices and CPUs. But some SOC systems > >cannot maintain the coherence and they need support cache clean/invalid > >operations to synchronize data. > > > >Current implementation is no problem with SiFive FU540, because FU540 > >keeps all IO devices and DMA master devices coherence with CPU. But to a > >traditional SOC vendor, it may already have a stable non-coherency SOC > >system, the need is simply to replace the CPU with RV CPU and rebuild > >the whole system with IO-coherency is very expensive. > > > >So we should make riscv linux also support non-coherency memory model. > >Here are the two points that riscv linux needs to be modified: > > > > - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit > > designates a coherence for this page mapping. Software set the bit to > > tell the hardware that the region of the page's memory area must be > > coherent with IOs devices in SOC system by PMA settings. > > If IOs and CPU are already coherent in SOC system, CPU just ignore > > this bit. > > > > PTE format: > > | XLEN-1 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 > > PFN C RSW D A G U X W R V > > ^ > > BIT(9): Coherence attribute bit > > 0: hardware needn't keep the page coherenct and software will > > maintain the coherence with cache clear/invalid operations. > > 1: hardware must keep the page coherenct and software needn't > > maintain the coherence. > > BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux > > > > Add a new hardware bit in PTE also need to modify Privileged > > Architecture Supervisor-Level ISA: > > https://github.com/riscv/riscv-isa-manual/pull/374 > > This is a RISC-V ISA modification, which isn't really appropriate to suggest on > the kernel mailing lists. The right place to talk about this is at the RISC-V > foundation, which owns the ISA -- we can't change the hardware with a patch to > Linux :). I just want a discussion and a wide discussion is good for all of us :) > > > - Add SBI_FENCE_DMA 9 in riscv-sbi. > > sbi_fence_dma(start, size, dir) could synchronize CPU cache data with > > DMA device in non-coherency memory model. The third param's definition > > is the same with linux's in include/linux/dma-direction.h: > > > > enum dma_data_direction { > > DMA_BIDIRECTIONAL = 0, > > DMA_TO_DEVICE = 1, > > DMA_FROM_DEVICE = 2, > > DMA_NONE = 3, > > }; > > > > The first param:start must be physical address which could be handled > > in M-state. > > > > Here is a pull request to the riscv-sbi-doc: > > https://github.com/riscv/riscv-sbi-doc/pull/15 > > > >We have tested the patch on our fpga SOC system which network controller > >connected to a non-cache-coherency interconnect in and it couldn't work > >without the patch. > > > >There is no side effect for FU540 whose CPU don't care _PAGE_COHERENCY > >in PTE, but FU540's bbl also need to implement a simple sbi_fence_dma > >by directly return. In fact, if you give a correct configuration for > >dev_is_dma_conherent(), linux dma framework wouldn't call sbi_fence_dma > >any more. > > Non-coherent fences also need to be discussed as part of a RISC-V ISA ^^^^^^^^^^^^ ^^^^^^ fences instructions? not page attributes? > extension. > I know people have expressed interest, but I don't know of a > working group that's already been set up. Is that mean current RISC-V ISA forces the SOC to be coherent memory model? Best Regards Guo Ren
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index eb56c82..f0fc503 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -16,9 +16,12 @@ config RISCV select OF select OF_EARLY_FLATTREE select OF_IRQ + select ARCH_HAS_SYNC_DMA_FOR_CPU + select ARCH_HAS_SYNC_DMA_FOR_DEVICE select ARCH_WANT_FRAME_POINTERS select CLONE_BACKWARDS select COMMON_CLK + select DMA_DIRECT_REMAP select GENERIC_CLOCKEVENTS select GENERIC_CPU_DEVICES select GENERIC_IRQ_SHOW @@ -27,6 +30,7 @@ config RISCV select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select GENERIC_SMP_IDLE_THREAD + select GENERIC_ALLOCATOR select GENERIC_ATOMIC64 if !64BIT || !RISCV_ISA_A select HAVE_ARCH_AUDITSYSCALL select HAVE_MEMBLOCK_NODE_MAP diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h index 470755c..104f8c0 100644 --- a/arch/riscv/include/asm/pgtable-bits.h +++ b/arch/riscv/include/asm/pgtable-bits.h @@ -31,6 +31,7 @@ #define _PAGE_ACCESSED (1 << 6) /* Set by hardware on any access */ #define _PAGE_DIRTY (1 << 7) /* Set by hardware on any write */ #define _PAGE_SOFT (1 << 8) /* Reserved for software */ +#define _PAGE_COHERENCY (1 << 9) /* Coherency */ #define _PAGE_SPECIAL _PAGE_SOFT #define _PAGE_TABLE _PAGE_PRESENT diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 1141364..26debb4 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -66,6 +66,7 @@ #define PAGE_KERNEL __pgprot(_PAGE_KERNEL) #define PAGE_KERNEL_EXEC __pgprot(_PAGE_KERNEL | _PAGE_EXEC) +#define PAGE_KERNEL_COHERENCY __pgprot(_PAGE_KERNEL | _PAGE_COHERENCY) extern pgd_t swapper_pg_dir[]; @@ -375,6 +376,16 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma, return ptep_test_and_clear_young(vma, address, ptep); } +#define pgprot_noncached pgprot_noncached +static inline pgprot_t pgprot_noncached(pgprot_t _prot) +{ + unsigned long prot = pgprot_val(_prot); + + prot |= _PAGE_COHERENCY; + + return __pgprot(prot); +} + /* * Encode and decode a swap entry * diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h index b6bb10b..b945e50 100644 --- a/arch/riscv/include/asm/sbi.h +++ b/arch/riscv/include/asm/sbi.h @@ -25,6 +25,7 @@ #define SBI_REMOTE_SFENCE_VMA 6 #define SBI_REMOTE_SFENCE_VMA_ASID 7 #define SBI_SHUTDOWN 8 +#define SBI_FENCE_DMA 9 #define SBI_CALL(which, arg0, arg1, arg2) ({ \ register uintptr_t a0 asm ("a0") = (uintptr_t)(arg0); \ @@ -42,6 +43,8 @@ #define SBI_CALL_0(which) SBI_CALL(which, 0, 0, 0) #define SBI_CALL_1(which, arg0) SBI_CALL(which, arg0, 0, 0) #define SBI_CALL_2(which, arg0, arg1) SBI_CALL(which, arg0, arg1, 0) +#define SBI_CALL_3(which, arg0, arg1, arg2) \ + SBI_CALL(which, arg0, arg1, arg2) static inline void sbi_console_putchar(int ch) { @@ -82,6 +85,13 @@ static inline void sbi_remote_fence_i(const unsigned long *hart_mask) SBI_CALL_1(SBI_REMOTE_FENCE_I, hart_mask); } +static inline void sbi_fence_dma(unsigned long start, + unsigned long size, + unsigned long dir) +{ + SBI_CALL_3(SBI_FENCE_DMA, start, size, dir); +} + static inline void sbi_remote_sfence_vma(const unsigned long *hart_mask, unsigned long start, unsigned long size) diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile index b68aac7..adc563a 100644 --- a/arch/riscv/mm/Makefile +++ b/arch/riscv/mm/Makefile @@ -9,3 +9,4 @@ obj-y += fault.o obj-y += extable.o obj-y += ioremap.o obj-y += cacheflush.o +obj-y += dma-mapping.o diff --git a/arch/riscv/mm/dma-mapping.c b/arch/riscv/mm/dma-mapping.c new file mode 100644 index 0000000..5e1d179 --- /dev/null +++ b/arch/riscv/mm/dma-mapping.c @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/dma-mapping.h> + +static int __init atomic_pool_init(void) +{ + return dma_atomic_pool_init(GFP_KERNEL, pgprot_noncached(PAGE_KERNEL)); +} +postcore_initcall(atomic_pool_init); + +void arch_dma_prep_coherent(struct page *page, size_t size) +{ + memset(page_address(page), 0, size); + + sbi_fence_dma(page_to_phys(page), size, DMA_BIDIRECTIONAL); +} + +void arch_sync_dma_for_device(struct device *dev, phys_addr_t paddr, + size_t size, enum dma_data_direction dir) +{ + switch (dir) { + case DMA_TO_DEVICE: + case DMA_FROM_DEVICE: + case DMA_BIDIRECTIONAL: + sbi_fence_dma(paddr, size, dir); + break; + default: + BUG(); + } +} + +void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr, + size_t size, enum dma_data_direction dir) +{ + switch (dir) { + case DMA_TO_DEVICE: + case DMA_FROM_DEVICE: + case DMA_BIDIRECTIONAL: + sbi_fence_dma(paddr, size, dir); + break; + default: + BUG(); + } +} diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c index bd2f2db..f6aaf1e 100644 --- a/arch/riscv/mm/ioremap.c +++ b/arch/riscv/mm/ioremap.c @@ -73,7 +73,7 @@ static void __iomem *__ioremap_caller(phys_addr_t addr, size_t size, */ void __iomem *ioremap(phys_addr_t offset, unsigned long size) { - return __ioremap_caller(offset, size, PAGE_KERNEL, + return __ioremap_caller(offset, size, PAGE_KERNEL_COHERENCY, __builtin_return_address(0)); } EXPORT_SYMBOL(ioremap);