riscv: Support non-coherency memory model

Message ID	1555947870-23014-1-git-send-email-guoren@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-riscv-bounces+patchwork-linux-riscv=patchwork.kernel.org@lists.infradead.org> From: guoren@kernel.org To: ren_guo@c-sky.com, guoren@kernel.org Subject: [PATCH] riscv: Support non-coherency memory model Date: Mon, 22 Apr 2019 23:44:30 +0800 Message-Id: <1555947870-23014-1-git-send-email-guoren@kernel.org> Precedence: list Cc: linux-arch@vger.kernel.org, Palmer Dabbelt <palmer@sifive.com>, Andrew Waterman <andrew@sifive.com>, Arnd Bergmann <arnd@arndb.de>, Anup Patel <anup.patel@wdc.com>, Xiang Xiaoyan <xiaoyan_xiang@c-sky.com>, linux-kernel@vger.kernel.org, Mike Rapoport <rppt@linux.ibm.com>, Vincent Chen <vincentc@andestech.com>, Greentime Hu <green.hu@gmail.com>, Scott Wood <swood@redhat.com>, linux-riscv@lists.infradead.org, Marek Szyprowski <m.szyprowski@samsung.com>, Robin Murphy <robin.murphy@arm.com>, Christoph Hellwig <hch@lst.de>, tech-privileged@lists.riscv.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org> Errors-To: linux-riscv-bounces+patchwork-linux-riscv=patchwork.kernel.org@lists.infradead.org
Series	riscv: Support non-coherency memory model \| expand riscv: Support non-coherency memory model

Guo Ren April 22, 2019, 3:44 p.m. UTC

From: Guo Ren <ren_guo@c-sky.com>

The current riscv linux implementation requires SOC system to support
memory coherence between all I/O devices and CPUs. But some SOC systems
cannot maintain the coherence and they need support cache clean/invalid
operations to synchronize data.

Current implementation is no problem with SiFive FU540, because FU540
keeps all IO devices and DMA master devices coherence with CPU. But to a
traditional SOC vendor, it may already have a stable non-coherency SOC
system, the need is simply to replace the CPU with RV CPU and rebuild
the whole system with IO-coherency is very expensive.

So we should make riscv linux also support non-coherency memory model.
Here are the two points that riscv linux needs to be modified:

 - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit
   designates a coherence for this page mapping. Software set the bit to
   tell the hardware that the region of the page's memory area must be
   coherent with IOs devices in SOC system by PMA settings.
   If IOs and CPU are already coherent in SOC system, CPU just ignore
   this bit.

   PTE format:
   | XLEN-1  10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
         PFN      C  RSW  D   A   G   U   X   W   R   V
                  ^
   BIT(9): Coherence attribute bit
          0: hardware needn't keep the page coherenct and software will
             maintain the coherence with cache clear/invalid operations.
          1: hardware must keep the page coherenct and software needn't
             maintain the coherence.
   BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux

   Add a new hardware bit in PTE also need to modify Privileged
   Architecture Supervisor-Level ISA:
   https://github.com/riscv/riscv-isa-manual/pull/374

 - Add SBI_FENCE_DMA 9 in riscv-sbi.
   sbi_fence_dma(start, size, dir) could synchronize CPU cache data with
   DMA device in non-coherency memory model. The third param's definition
   is the same with linux's in include/linux/dma-direction.h:

   enum dma_data_direction {
	DMA_BIDIRECTIONAL = 0,
	DMA_TO_DEVICE = 1,
	DMA_FROM_DEVICE = 2,
	DMA_NONE = 3,
   };

   The first param:start must be physical address which could be handled
   in M-state.

   Here is a pull request to the riscv-sbi-doc:
   https://github.com/riscv/riscv-sbi-doc/pull/15

We have tested the patch on our fpga SOC system which network controller
connected to a non-cache-coherency interconnect in and it couldn't work
without the patch.

There is no side effect for FU540 whose CPU don't care _PAGE_COHERENCY
in PTE, but FU540's bbl also need to implement a simple sbi_fence_dma
by directly return. In fact, if you give a correct configuration for
dev_is_dma_conherent(), linux dma framework wouldn't call sbi_fence_dma
any more.

Changelog:
 - Use coherency instead of consistency for all to maintain term
   consistency. (Xiang Xiaoyan)
 - Add riscv-isa-manual modification pull request link.
 - Correct grammatical errors.

Signed-off-by: Guo Ren <ren_guo@c-sky.com>
Cc: Andrew Waterman <andrew@sifive.com>
Cc: Anup Patel <anup.patel@wdc.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Vincent Chen <vincentc@andestech.com>
Cc: Xiang Xiaoyan <xiaoyan_xiang@c-sky.com>
---
 arch/riscv/Kconfig                    |  4 ++++
 arch/riscv/include/asm/pgtable-bits.h |  1 +
 arch/riscv/include/asm/pgtable.h      | 11 +++++++++
 arch/riscv/include/asm/sbi.h          | 10 ++++++++
 arch/riscv/mm/Makefile                |  1 +
 arch/riscv/mm/dma-mapping.c           | 44 +++++++++++++++++++++++++++++++++++
 arch/riscv/mm/ioremap.c               |  2 +-
 7 files changed, 72 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/mm/dma-mapping.c

Christoph Hellwig April 22, 2019, 4:18 p.m. UTC | #1

On Mon, Apr 22, 2019 at 11:44:30PM +0800, guoren@kernel.org wrote:
>  - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit
>    designates a coherence for this page mapping. Software set the bit to
>    tell the hardware that the region of the page's memory area must be
>    coherent with IOs devices in SOC system by PMA settings.
>    If IOs and CPU are already coherent in SOC system, CPU just ignore
>    this bit.
> 
>    PTE format:
>    | XLEN-1  10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
>          PFN      C  RSW  D   A   G   U   X   W   R   V
>                   ^
>    BIT(9): Coherence attribute bit
>           0: hardware needn't keep the page coherenct and software will
>              maintain the coherence with cache clear/invalid operations.
>           1: hardware must keep the page coherenct and software needn't
>              maintain the coherence.
>    BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux
> 
>    Add a new hardware bit in PTE also need to modify Privileged
>    Architecture Supervisor-Level ISA:
>    https://github.com/riscv/riscv-isa-manual/pull/374
> 
>  - Add SBI_FENCE_DMA 9 in riscv-sbi.
>    sbi_fence_dma(start, size, dir) could synchronize CPU cache data with
>    DMA device in non-coherency memory model. The third param's definition
>    is the same with linux's in include/linux/dma-direction.h:

Please don't make this an SBI call.  We need a proper instruction
for cache flushing and invalidation.  We'll also need that for pmem
support for example.  I heard at least one other vendor already
had an instruction, and we really need to get this into the privileged
spec ASAP (yesterday in fact).

If you have your own instructions already we can probably binary
patch those in using the Linux alternatives mechanism once we have
a standardized way in the privileged spec.

We should probably start a working group for this ASAP unless we can
get another working group to help taking care of it.

> +#define pgprot_noncached pgprot_noncached
> +static inline pgprot_t pgprot_noncached(pgprot_t _prot)
> +{
> +	unsigned long prot = pgprot_val(_prot);
> +
> +	prot |= _PAGE_COHERENCY;
> +
> +	return __pgprot(prot);

Nitpick: this can be shortened to

	return __pgprot(pgprot_val(prot) | _PAGE_COHERENCY));

Also is this really a coherent flag, or an 'uncached' flag like in
many other architectures?

> +++ b/arch/riscv/mm/dma-mapping.c

This should probably be called dma-noncoherent.c

It should also have a user visible config option so that we don't
have to build it for fully coherent systems.

> +void arch_dma_prep_coherent(struct page *page, size_t size)
> +{
> +	memset(page_address(page), 0, size);

No need for this memset, the caller takes care of it.

> diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c
> index bd2f2db..f6aaf1e 100644
> --- a/arch/riscv/mm/ioremap.c
> +++ b/arch/riscv/mm/ioremap.c
> @@ -73,7 +73,7 @@ static void __iomem *__ioremap_caller(phys_addr_t addr, size_t size,
>   */
>  void __iomem *ioremap(phys_addr_t offset, unsigned long size)
>  {
> -	return __ioremap_caller(offset, size, PAGE_KERNEL,
> +	return __ioremap_caller(offset, size, PAGE_KERNEL_COHERENCY,
>  		__builtin_return_address(0));
>  }
>  EXPORT_SYMBOL(ioremap);

I think ioremap is a different story, and should be a separate patch.

Guo Ren April 23, 2019, 12:13 a.m. UTC | #2

Thx Christoph,

On Mon, Apr 22, 2019 at 06:18:14PM +0200, Christoph Hellwig wrote:
> On Mon, Apr 22, 2019 at 11:44:30PM +0800, guoren@kernel.org wrote:
> >  - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit
> >    designates a coherence for this page mapping. Software set the bit to
> >    tell the hardware that the region of the page's memory area must be
> >    coherent with IOs devices in SOC system by PMA settings.
> >    If IOs and CPU are already coherent in SOC system, CPU just ignore
> >    this bit.
> > 
> >    PTE format:
> >    | XLEN-1  10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
> >          PFN      C  RSW  D   A   G   U   X   W   R   V
> >                   ^
> >    BIT(9): Coherence attribute bit
> >           0: hardware needn't keep the page coherenct and software will
> >              maintain the coherence with cache clear/invalid operations.
> >           1: hardware must keep the page coherenct and software needn't
> >              maintain the coherence.
> >    BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux
> > 
> >    Add a new hardware bit in PTE also need to modify Privileged
> >    Architecture Supervisor-Level ISA:
> >    https://github.com/riscv/riscv-isa-manual/pull/374
> > 
> >  - Add SBI_FENCE_DMA 9 in riscv-sbi.
> >    sbi_fence_dma(start, size, dir) could synchronize CPU cache data with
> >    DMA device in non-coherency memory model. The third param's definition
> >    is the same with linux's in include/linux/dma-direction.h:
> 
> Please don't make this an SBI call.  We need a proper instruction
> for cache flushing and invalidation.  We'll also need that for pmem
> support for example.  I heard at least one other vendor already
> had an instruction, and we really need to get this into the privileged
> spec ASAP (yesterday in fact).
> 
> If you have your own instructions already we can probably binary
> patch those in using the Linux alternatives mechanism once we have
> a standardized way in the privileged spec.
> 
> We should probably start a working group for this ASAP unless we can
> get another working group to help taking care of it.
Good news, I prefer to use instructions directly instead of SBI_CALL.

Our instruction is "dcache.c/iva %0" (one cache line) and the parameter is
virtual address in S-state. When get into M-state by SBI_CALL, we could
let dcache.c/iva use physical addres directly and it needn't kmap page
for RV32 with highmem (Of cause highmem is not ready in RV32 now).

> 
> > +#define pgprot_noncached pgprot_noncached
> > +static inline pgprot_t pgprot_noncached(pgprot_t _prot)
> > +{
> > +	unsigned long prot = pgprot_val(_prot);
> > +
> > +	prot |= _PAGE_COHERENCY;
> > +
> > +	return __pgprot(prot);
> 
> Nitpick: this can be shortened to
> 
> 	return __pgprot(pgprot_val(prot) | _PAGE_COHERENCY));
Good.

> 
> Also is this really a coherent flag, or an 'uncached' flag like in
> many other architectures?
There are a lot of features about coherency attributes, eg: cacheable,
bufferable, strong order ..., and coherency is a more abstract name to
contain all of these. In our hardware, coherence = uncached +
unbufferable + (stong order).

But I'm not very care about the name is, uncached is also ok. My key
point is the bits of page attributes is very precious and this patch
will use the last unused attribute bit in PTE.

Another point is we could get more attribute bits by modify the riscv
spec:
 - Remove Global bit, I think it's duplicate with the User bit in linux.
 - Change _PAGE_PFN_SHIFT from 10 to 12, because the huge pfn in RV32 is
   very useless and current RV32 linux doesn't even implement highmem.

And then we could get another three page attribute bits in PTE.

> 
> > +++ b/arch/riscv/mm/dma-mapping.c
> 
> This should probably be called dma-noncoherent.c
> 
> It should also have a user visible config option so that we don't
> have to build it for fully coherent systems.
Ok, dma-noncoherent.c is more clear.

> 
> > +void arch_dma_prep_coherent(struct page *page, size_t size)
> > +{
> > +	memset(page_address(page), 0, size);
> 
> No need for this memset, the caller takes care of it.
Ok

> 
> > diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c
> > index bd2f2db..f6aaf1e 100644
> > --- a/arch/riscv/mm/ioremap.c
> > +++ b/arch/riscv/mm/ioremap.c
> > @@ -73,7 +73,7 @@ static void __iomem *__ioremap_caller(phys_addr_t addr, size_t size,
> >   */
> >  void __iomem *ioremap(phys_addr_t offset, unsigned long size)
> >  {
> > -	return __ioremap_caller(offset, size, PAGE_KERNEL,
> > +	return __ioremap_caller(offset, size, PAGE_KERNEL_COHERENCY,
> >  		__builtin_return_address(0));
> >  }
> >  EXPORT_SYMBOL(ioremap);
> 
> I think ioremap is a different story, and should be a separate patch.
Ok

Best Regards
 Guo Ren

Christoph Hellwig April 23, 2019, 5:55 a.m. UTC | #3

On Tue, Apr 23, 2019 at 08:13:48AM +0800, Guo Ren wrote:
> > We should probably start a working group for this ASAP unless we can
> > get another working group to help taking care of it.
> Good news, I prefer to use instructions directly instead of SBI_CALL.
> 
> Our instruction is "dcache.c/iva %0" (one cache line) and the parameter is
> virtual address in S-state. When get into M-state by SBI_CALL, we could
> let dcache.c/iva use physical addres directly and it needn't kmap page
> for RV32 with highmem (Of cause highmem is not ready in RV32 now).

So you only have one instruction variant?  Normally we'd have two or
three to implement the non-coherent DMA (or pmem) semantics:

cache writeback, cache invalidate and potentially cache writeback +
invalidate to optimize that case.  Here is the table how Linux
uses them for DMA:

          |   map          ==  for_device     |   unmap     ==  for_cpu
          |----------------------------------------------------------------
 TO_DEV   |   writeback        writeback      |   none          none
 FROM_DEV |   invalidate       invalidate     |   invalidate*   invalidate*
 BIDI     |   writeback+inv    writeback+inv  |   invalidate    invalidate

     [*] needed for CPU speculative prefetches

We already have a discussion on isa-dev on something like these
instructions:

https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/qXbzqaQbDXU/4ThcEAeCCAAJ

It got a little side tracked, both due to the usual noise on isa-dev
and due to the proposal including a lot more instructions that might be
a little more contentious, but it might be a good start to bring this
into a working group.

> > Also is this really a coherent flag, or an 'uncached' flag like in
> > many other architectures?
> There are a lot of features about coherency attributes, eg: cacheable,
> bufferable, strong order ..., and coherency is a more abstract name to
> contain all of these. In our hardware, coherence = uncached +
> unbufferable + (stong order).
> 
> But I'm not very care about the name is, uncached is also ok. My key
> point is the bits of page attributes is very precious and this patch
> will use the last unused attribute bit in PTE.

I don't care about the name actually, more about having defined semantics.
Totally uncached should include unbuffered.  I don't think we need the
strong ordering for DMA memory either.

> Another point is we could get more attribute bits by modify the riscv
> spec:
>  - Remove Global bit, I think it's duplicate with the User bit in linux.

It is in Linux, but it is conceptually very different.

>  - Change _PAGE_PFN_SHIFT from 10 to 12, because the huge pfn in RV32 is
>    very useless and current RV32 linux doesn't even implement highmem.

This would seem sensible to me, but I'm not sure everyone agrees.  Even
then we are very late in the game for changes like that.

Guo Ren April 23, 2019, 3:46 p.m. UTC | #4

On Tue, Apr 23, 2019 at 07:55:48AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 23, 2019 at 08:13:48AM +0800, Guo Ren wrote:
> > > We should probably start a working group for this ASAP unless we can
> > > get another working group to help taking care of it.
> > Good news, I prefer to use instructions directly instead of SBI_CALL.
> > 
> > Our instruction is "dcache.c/iva %0" (one cache line) and the parameter is
> > virtual address in S-state. When get into M-state by SBI_CALL, we could
> > let dcache.c/iva use physical addres directly and it needn't kmap page
> > for RV32 with highmem (Of cause highmem is not ready in RV32 now).
> 
> So you only have one instruction variant?  Normally we'd have two or
> three to implement the non-coherent DMA (or pmem) semantics:
dcache.c/iva means three instructions:
 - dcache.cva %0  : writeback     by virtual address cacheline
 - dcache.iva %0  : invalid       by virtual address cacheline
 - dcache.civa %0 : writeback+inv by virtual address cacheline

We also have memory barrier instructions, ref:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/csky/include/asm/barrier.h

> 
> cache writeback, cache invalidate and potentially cache writeback +
> invalidate to optimize that case.  Here is the table how Linux
> uses them for DMA:
> 
>           |   map          ==  for_device     |   unmap     ==  for_cpu
>           |----------------------------------------------------------------
>  TO_DEV   |   writeback        writeback      |   none          none
>  FROM_DEV |   invalidate       invalidate     |   invalidate*   invalidate*
>  BIDI     |   writeback+inv    writeback+inv  |   invalidate    invalidate
> 
>      [*] needed for CPU speculative prefetches
> 
> 
> We already have a discussion on isa-dev on something like these
> instructions:
> 
> https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/qXbzqaQbDXU/4ThcEAeCCAAJ
> 
> It got a little side tracked, both due to the usual noise on isa-dev
> and due to the proposal including a lot more instructions that might be
> a little more contentious, but it might be a good start to bring this
> into a working group.
I think using sbi_call as a temporary solution is a good choice before
the instruction is determined.

> 
> > > Also is this really a coherent flag, or an 'uncached' flag like in
> > > many other architectures?
> > There are a lot of features about coherency attributes, eg: cacheable,
> > bufferable, strong order ..., and coherency is a more abstract name to
> > contain all of these. In our hardware, coherence = uncached +
> > unbufferable + (stong order).
> > 
> > But I'm not very care about the name is, uncached is also ok. My key
> > point is the bits of page attributes is very precious and this patch
> > will use the last unused attribute bit in PTE.
> 
> I don't care about the name actually, more about having defined semantics.
> Totally uncached should include unbuffered.  I don't think we need the
> strong ordering for DMA memory either.
Yes, memory don't need strong order and strong order in PMA also implies
uncached, unbuffered for IO address. If the entry of PMA is strong
order, the page mapping must be uncached + unbuffered + strong-order
without _PAGE_COHENCY in PTE.

> 
> > Another point is we could get more attribute bits by modify the riscv
> > spec:
> >  - Remove Global bit, I think it's duplicate with the User bit in linux.
> 
> It is in Linux, but it is conceptually very different.
Yes, but hardware could ignore one of them and in riscv linux
_PAGE_GLOBAL is no use at all, see:
grep _PAGE_GLOBAL arch/riscv -r

In fact, the _PAGE_KERNEL for pte doesn't contain _PAGE_GLOBAL and it
works on FU540 and qemu. As I've mentioned page attribute bits is very
precious, define a useless bit make people confused.

> 
> >  - Change _PAGE_PFN_SHIFT from 10 to 12, because the huge pfn in RV32 is
> >    very useless and current RV32 linux doesn't even implement highmem.
> 
> This would seem sensible to me, but I'm not sure everyone agrees.  Even
> then we are very late in the game for changes like that.
Agree.

Best Regards
 Guo Ren

Gary Guo April 23, 2019, 3:57 p.m. UTC | #5

On 23/04/2019 16:46, Guo Ren wrote:
> On Tue, Apr 23, 2019 at 07:55:48AM +0200, Christoph Hellwig wrote:
>> On Tue, Apr 23, 2019 at 08:13:48AM +0800, Guo Ren wrote:
>>>> We should probably start a working group for this ASAP unless we can
>>>> get another working group to help taking care of it.
>>> Good news, I prefer to use instructions directly instead of SBI_CALL.
>>>
>>> Our instruction is "dcache.c/iva %0" (one cache line) and the parameter is
>>> virtual address in S-state. When get into M-state by SBI_CALL, we could
>>> let dcache.c/iva use physical addres directly and it needn't kmap page
>>> for RV32 with highmem (Of cause highmem is not ready in RV32 now).
>>
>> So you only have one instruction variant?  Normally we'd have two or
>> three to implement the non-coherent DMA (or pmem) semantics:
> dcache.c/iva means three instructions:
>   - dcache.cva %0  : writeback     by virtual address cacheline
>   - dcache.iva %0  : invalid       by virtual address cacheline
>   - dcache.civa %0 : writeback+inv by virtual address cacheline
> 
> We also have memory barrier instructions, ref:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/csky/include/asm/barrier.h
> 
>>
>> cache writeback, cache invalidate and potentially cache writeback +
>> invalidate to optimize that case.  Here is the table how Linux
>> uses them for DMA:
>>
>>            |   map          ==  for_device     |   unmap     ==  for_cpu
>>            |----------------------------------------------------------------
>>   TO_DEV   |   writeback        writeback      |   none          none
>>   FROM_DEV |   invalidate       invalidate     |   invalidate*   invalidate*
>>   BIDI     |   writeback+inv    writeback+inv  |   invalidate    invalidate
>>
>>       [*] needed for CPU speculative prefetches
>>
>>
>> We already have a discussion on isa-dev on something like these
>> instructions:
>>
>> https://groups.google.com/a/groups.riscv.org/forum/#!msg/isa-dev/qXbzqaQbDXU/4ThcEAeCCAAJ
>>
>> It got a little side tracked, both due to the usual noise on isa-dev
>> and due to the proposal including a lot more instructions that might be
>> a little more contentious, but it might be a good start to bring this
>> into a working group.
> I think using sbi_call as a temporary solution is a good choice before
> the instruction is determined.
> 
>>
>>>> Also is this really a coherent flag, or an 'uncached' flag like in
>>>> many other architectures?
>>> There are a lot of features about coherency attributes, eg: cacheable,
>>> bufferable, strong order ..., and coherency is a more abstract name to
>>> contain all of these. In our hardware, coherence = uncached +
>>> unbufferable + (stong order).
>>>
>>> But I'm not very care about the name is, uncached is also ok. My key
>>> point is the bits of page attributes is very precious and this patch
>>> will use the last unused attribute bit in PTE.
>>
>> I don't care about the name actually, more about having defined semantics.
>> Totally uncached should include unbuffered.  I don't think we need the
>> strong ordering for DMA memory either.
> Yes, memory don't need strong order and strong order in PMA also implies
> uncached, unbuffered for IO address. If the entry of PMA is strong
> order, the page mapping must be uncached + unbuffered + strong-order
> without _PAGE_COHENCY in PTE.
> 
>>
>>> Another point is we could get more attribute bits by modify the riscv
>>> spec:
>>>   - Remove Global bit, I think it's duplicate with the User bit in linux.
>>
>> It is in Linux, but it is conceptually very different.
> Yes, but hardware could ignore one of them and in riscv linux
> _PAGE_GLOBAL is no use at all, see:
> grep _PAGE_GLOBAL arch/riscv -r
> 
> In fact, the _PAGE_KERNEL for pte doesn't contain _PAGE_GLOBAL and it
> works on FU540 and qemu. As I've mentioned page attribute bits is very
> precious, define a useless bit make people confused.
 >

The fact that it isn't used yet doesn't imply it is not useful. We don't 
use ASIDs at the moment, and without using ASIDs the "global" bit is 
indeed not useful. However with ASIDs the bit will be vital for saving 
TLB spaces. Without the global bit, the kernel pages become synonyms to 
themselves (i.e. they have different tags in TLB but refer to the same 
physical page).

The global bit also exists in many other ISAs as well. It's definitely 
not a "useless" bits.

Moreover, this bit is already implemented in both Rocket and Ariane. It 
is also in the spec for quite a while. The fact that Linux doesn't use 
it at the moment is not a reason for removing it.

> 
>>
>>>   - Change _PAGE_PFN_SHIFT from 10 to 12, because the huge pfn in RV32 is
>>>     very useless and current RV32 linux doesn't even implement highmem.
>>
>> This would seem sensible to me, but I'm not sure everyone agrees.  Even
>> then we are very late in the game for changes like that.
> Agree.
> 
> Best Regards
>   Guo Ren
> 
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
>

Guo Ren April 24, 2019, 2:08 a.m. UTC | #6

Hi Gary,

On Tue, Apr 23, 2019 at 03:57:30PM +0000, Gary Guo wrote:
> >>> Another point is we could get more attribute bits by modify the riscv
> >>> spec:
> >>>   - Remove Global bit, I think it's duplicate with the User bit in linux.
> >>
> >> It is in Linux, but it is conceptually very different.
> > Yes, but hardware could ignore one of them and in riscv linux
> > _PAGE_GLOBAL is no use at all, see:
> > grep _PAGE_GLOBAL arch/riscv -r
> > 
> > In fact, the _PAGE_KERNEL for pte doesn't contain _PAGE_GLOBAL and it
> > works on FU540 and qemu. As I've mentioned page attribute bits is very
> > precious, define a useless bit make people confused.
>  >
> 
> The fact that it isn't used yet doesn't imply it is not useful. We don't 
> use ASIDs at the moment, and without using ASIDs the "global" bit is 
> indeed not useful. However with ASIDs the bit will be vital for saving 
> TLB spaces. Without the global bit, the kernel pages become synonyms to 
> themselves (i.e. they have different tags in TLB but refer to the same 
> physical page).
> 
> The global bit also exists in many other ISAs as well. It's definitely 
> not a "useless" bits.
> 
> Moreover, this bit is already implemented in both Rocket and Ariane. It 
> is also in the spec for quite a while. The fact that Linux doesn't use 
> it at the moment is not a reason for removing it.
> 

Look:
linux-next git:(riscv_asid_allocator_v2)$ grep GLOBAL arch/riscv -r
arch/riscv/include/asm/pgtable-bits.h:#define _PAGE_GLOBAL    (1 << 5)    /* Global */
arch/riscv/include/asm/pgtable-bits.h:                                    _PAGE_USER | _PAGE_GLOBAL))

Your patch tell us _PAGE_USER and _PAGE_GLOBAL are duplicate and why we
couldn't make _PAGE_USER implies _PAGE_GLOBAL? Can you give an example
of a real scene in PTE about:
  _PAGE_USER:0 + _PAGE_GLOBAL:1
or
  _PAGE_USER:1 + _PAGE_GLOBAL:0

Of cause I know USER & GLOBAL are conceptually very different, but
there are only 10 attribute-bits for riscv (In fact we've wasted two bits
to support huge RV32-pfn :P). So I think it is time to merge these two bits
before hardware supports GLOBAL. Reserve them for future!

Best Regards
 Guo Ren

Gary Guo April 24, 2019, 3:21 a.m. UTC | #7

> -----Original Message-----
> From: Guo Ren <guoren@kernel.org>
> Sent: Wednesday, April 24, 2019 03:08
> To: Gary Guo <gary@garyguo.net>
> Cc: Christoph Hellwig <hch@lst.de>; linux-arch@vger.kernel.org; Palmer
> Dabbelt <palmer@sifive.com>; Andrew Waterman <andrew@sifive.com>; Arnd
> Bergmann <arnd@arndb.de>; Anup Patel <anup.patel@wdc.com>; Xiang
> Xiaoyan <xiaoyan_xiang@c-sky.com>; linux-kernel@vger.kernel.org; Mike
> Rapoport <rppt@linux.ibm.com>; Vincent Chen <vincentc@andestech.com>;
> Greentime Hu <green.hu@gmail.com>; ren_guo@c-sky.com; linux-
> riscv@lists.infradead.org; Marek Szyprowski <m.szyprowski@samsung.com>;
> Robin Murphy <robin.murphy@arm.com>; Scott Wood <swood@redhat.com>;
> tech-privileged@lists.riscv.org
> Subject: Re: [PATCH] riscv: Support non-coherency memory model
> 
> Hi Gary,
> 
> On Tue, Apr 23, 2019 at 03:57:30PM +0000, Gary Guo wrote:
> > >>> Another point is we could get more attribute bits by modify the riscv
> > >>> spec:
> > >>>   - Remove Global bit, I think it's duplicate with the User bit in linux.
> > >>
> > >> It is in Linux, but it is conceptually very different.
> > > Yes, but hardware could ignore one of them and in riscv linux
> > > _PAGE_GLOBAL is no use at all, see:
> > > grep _PAGE_GLOBAL arch/riscv -r
> > >
> > > In fact, the _PAGE_KERNEL for pte doesn't contain _PAGE_GLOBAL and it
> > > works on FU540 and qemu. As I've mentioned page attribute bits is very
> > > precious, define a useless bit make people confused.
> >  >
> >
> > The fact that it isn't used yet doesn't imply it is not useful. We don't
> > use ASIDs at the moment, and without using ASIDs the "global" bit is
> > indeed not useful. However with ASIDs the bit will be vital for saving
> > TLB spaces. Without the global bit, the kernel pages become synonyms to
> > themselves (i.e. they have different tags in TLB but refer to the same
> > physical page).
> >
> > The global bit also exists in many other ISAs as well. It's definitely
> > not a "useless" bits.
> >
> > Moreover, this bit is already implemented in both Rocket and Ariane. It
> > is also in the spec for quite a while. The fact that Linux doesn't use
> > it at the moment is not a reason for removing it.
> >
> 
> Look:
> linux-next git:(riscv_asid_allocator_v2)$ grep GLOBAL arch/riscv -r
> arch/riscv/include/asm/pgtable-bits.h:#define _PAGE_GLOBAL    (1 << 5)    /*
> Global */
> arch/riscv/include/asm/pgtable-bits.h:                                    _PAGE_USER |
> _PAGE_GLOBAL))
> 
> Your patch tell us _PAGE_USER and _PAGE_GLOBAL are duplicate and why we
> couldn't make _PAGE_USER implies _PAGE_GLOBAL? Can you give an example
> of a real scene in PTE about:
>   _PAGE_USER:0 + _PAGE_GLOBAL:1
> or
>   _PAGE_USER:1 + _PAGE_GLOBAL:0
> 
> Of cause I know USER & GLOBAL are conceptually very different, but
> there are only 10 attribute-bits for riscv (In fact we've wasted two bits
> to support huge RV32-pfn :P). So I think it is time to merge these two bits
> before hardware supports GLOBAL. Reserve them for future!

Two cases I can think of:
* vdso like things. They're user pages that can really be shared across address spaces (i.e. global). Kernels like L4 implement most systems calls similar to VDSO, so USER + GLOBAL is useful.
* hypervisor without H-extension: This requires shadow page tables. Supervisor pages are mapped to supervisor shadow pages. However these shadow pages cannot be GLOBAL because they can't be shared between VMs. So  !USER + !GLOBAL is useful.

Remember Linux isn't the only supervisor software that RISC-V cares! 

> 
> Best Regards
>  Guo Ren

Guo Ren April 24, 2019, 5:57 a.m. UTC | #8

Hi Gary,

On Wed, Apr 24, 2019 at 03:21:14AM +0000, Gary Guo wrote:
> > Look:
> > linux-next git:(riscv_asid_allocator_v2)$ grep GLOBAL arch/riscv -r
> > arch/riscv/include/asm/pgtable-bits.h:#define _PAGE_GLOBAL    (1 << 5)    /*
> > Global */
> > arch/riscv/include/asm/pgtable-bits.h:                                    _PAGE_USER |
> > _PAGE_GLOBAL))
> > 
> > Your patch tell us _PAGE_USER and _PAGE_GLOBAL are duplicate and why we
> > couldn't make _PAGE_USER implies _PAGE_GLOBAL? Can you give an example
> > of a real scene in PTE about:
> >   _PAGE_USER:0 + _PAGE_GLOBAL:1
> > or
> >   _PAGE_USER:1 + _PAGE_GLOBAL:0
> > 
> > Of cause I know USER & GLOBAL are conceptually very different, but
> > there are only 10 attribute-bits for riscv (In fact we've wasted two bits
> > to support huge RV32-pfn :P). So I think it is time to merge these two bits
> > before hardware supports GLOBAL. Reserve them for future!
> 
> Two cases I can think of:
> * vdso like things. They're user pages that can really be shared across address spaces (i.e. global). Kernels like L4 implement most systems calls similar to VDSO, so USER + GLOBAL is useful.
Vdso is a user space mapping in linux, See: fs/binfmt_elf.c

static int load_elf_binary(struct linux_binprm *bprm) {
...
#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
	if (retval < 0)
		goto out;
#endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */

All linux archs use arch_setup_additional_pages for vdso mapping and
every process has its own vdso mapping to the same pages.

I don't think vdso is a real scene for GLOBAL in PTE.

> * hypervisor without H-extension: This requires shadow page tables. Supervisor
> pages are mapped to supervisor shadow pages. However these shadow pages cannot
> be GLOBAL because they can't be shared between VMs. So  !USER + !GLOBAL is useful.
Hypervisor use 2-stages TLB translation in hardware and shadow page
tables is for stage 2 translation. Shadow page tables care vmid not
asid.
If hardware don't support H-extension (MMU 2-stages translation), it's
hard to accept for virtualization performance.

I don't think hypervisor is a real scene for GLOBAL in PTE.

Are there other scene for GLOBAL in PTE?

Best Regards
 Guo Ren

Gary Guo April 24, 2019, 12:45 p.m. UTC | #9

On 24/04/2019 06:57, Guo Ren wrote:
> Hi Gary,
> 
> On Wed, Apr 24, 2019 at 03:21:14AM +0000, Gary Guo wrote:
>>> Look:
>>> linux-next git:(riscv_asid_allocator_v2)$ grep GLOBAL arch/riscv -r
>>> arch/riscv/include/asm/pgtable-bits.h:#define _PAGE_GLOBAL    (1 << 5)    /*
>>> Global */
>>> arch/riscv/include/asm/pgtable-bits.h:                                    _PAGE_USER |
>>> _PAGE_GLOBAL))
>>>
>>> Your patch tell us _PAGE_USER and _PAGE_GLOBAL are duplicate and why we
>>> couldn't make _PAGE_USER implies _PAGE_GLOBAL? Can you give an example
>>> of a real scene in PTE about:
>>>    _PAGE_USER:0 + _PAGE_GLOBAL:1
>>> or
>>>    _PAGE_USER:1 + _PAGE_GLOBAL:0
>>>
>>> Of cause I know USER & GLOBAL are conceptually very different, but
>>> there are only 10 attribute-bits for riscv (In fact we've wasted two bits
>>> to support huge RV32-pfn :P). So I think it is time to merge these two bits
>>> before hardware supports GLOBAL. Reserve them for future!
>>
>> Two cases I can think of:
>> * vdso like things. They're user pages that can really be shared across address spaces (i.e. global). Kernels like L4 implement most systems calls similar to VDSO, so USER + GLOBAL is useful.
> Vdso is a user space mapping in linux, See: fs/binfmt_elf.c
> 
> static int load_elf_binary(struct linux_binprm *bprm) {
> ...
> #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
> 	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
> 	if (retval < 0)
> 		goto out;
> #endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */
> 
> All linux archs use arch_setup_additional_pages for vdso mapping and
> every process has its own vdso mapping to the same pages.

But we shouldn't prevent a kernel from mapping a USER page globally. As 
I said, the fact that Linux doesn't do it isn't a valid reason for 
omitting the possibility.

> 
> I don't think vdso is a real scene for GLOBAL in PTE.
> 
>> * hypervisor without H-extension: This requires shadow page tables. Supervisor
>> pages are mapped to supervisor shadow pages. However these shadow pages cannot
>> be GLOBAL because they can't be shared between VMs. So  !USER + !GLOBAL is useful.
> Hypervisor use 2-stages TLB translation in hardware and shadow page
> tables is for stage 2 translation. Shadow page tables care vmid not
> asid.

When H-extension is present, stage 2 translation uses VMID and is 
performed by hardware. When H-extension is not present, there's no such 
thing called VMID. When H-extension is not present, both hypervisor and 
guest supervisor will run in supervisor mode, and hypervisor uses 
MSTATUS.TVM to trap guest supervisor virtual memory operations. The 
shadow page table is populated by doing 2-stage page walk in software. 
In this case, the hypervisor likely needs to use some bits of ASID to 
emulate the VMID feature. In this case GLOBAL page cannot be used as it 
means that the page exists in all physical ASIDs (which contains both 
emulated VMID and ASID). Having supervisor pages being GLOBAL makes the 
semantics incorrect!

> If hardware don't support H-extension (MMU 2-stages translation), it's
> hard to accept for virtualization performance.

The RISC-V privileged spec is explicitly designed to allow the 
techniques described above (this is the sole purpose of MSTATUS.TVM). It 
might be as high performance as a hardware with H-extension, but is 
definitely a legit use case. In fact, it is vital for use cases like 
recursive virtualization.

Also, I believe the PTE format of RISC-V is already frozen -- therefore 
it is impossible now to merge GLOBAL and USER bit, nor to replace RSW 
bit with another bit.

> 
> I don't think hypervisor is a real scene for GLOBAL in PTE.
> 
> Are there other scene for GLOBAL in PTE?
> 
> Best Regards
>   Guo Ren
>

Christoph Hellwig April 24, 2019, 2:23 p.m. UTC | #10

On Wed, Apr 24, 2019 at 12:45:56PM +0000, Gary Guo wrote:
> The RISC-V privileged spec is explicitly designed to allow the 
> techniques described above (this is the sole purpose of MSTATUS.TVM). It 
> might be as high performance as a hardware with H-extension, but is 
> definitely a legit use case. In fact, it is vital for use cases like 
> recursive virtualization.
> 
> Also, I believe the PTE format of RISC-V is already frozen -- therefore 
> it is impossible now to merge GLOBAL and USER bit, nor to replace RSW 
> bit with another bit.

Yes, I do not think we can just repurpose a bit.  Even using a currently
unused one would require some gymnastics.

That being said IFF we want to support non-coherent DMA (and I think we
do as people glue together their SOCs using shoestring and paper clips,
as already demonstrated by Andes and C-SKY in RISC-V space, and most
arm, mips and ppc SOCs) we need something like this flag.  The current
RISC-V method that only allows M-mode to set up such attributes on
a small number or PMP regions just doesn't work well with the way how
Linux and most non-trivial OSes implement DMA memory allocations.

Note that I said well - in theory we can have a firmware provided
uncached pool - that is what Linux does on most nommu (that is without
pagetables) ports, but the fixed sized pool really does suck and will
make users very unhappy.

Arnd Bergmann April 25, 2019, 9:50 a.m. UTC | #11

On Wed, Apr 24, 2019 at 4:23 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Apr 24, 2019 at 12:45:56PM +0000, Gary Guo wrote:
> > The RISC-V privileged spec is explicitly designed to allow the
> > techniques described above (this is the sole purpose of MSTATUS.TVM). It
> > might be as high performance as a hardware with H-extension, but is
> > definitely a legit use case. In fact, it is vital for use cases like
> > recursive virtualization.
> >
> > Also, I believe the PTE format of RISC-V is already frozen -- therefore
> > it is impossible now to merge GLOBAL and USER bit, nor to replace RSW
> > bit with another bit.
>
> Yes, I do not think we can just repurpose a bit.  Even using a currently
> unused one would require some gymnastics.
>
> That being said IFF we want to support non-coherent DMA (and I think we
> do as people glue together their SOCs using shoestring and paper clips,
> as already demonstrated by Andes and C-SKY in RISC-V space, and most
> arm, mips and ppc SOCs) we need something like this flag.  The current
> RISC-V method that only allows M-mode to set up such attributes on
> a small number or PMP regions just doesn't work well with the way how
> Linux and most non-trivial OSes implement DMA memory allocations.
>
> Note that I said well - in theory we can have a firmware provided
> uncached pool - that is what Linux does on most nommu (that is without
> pagetables) ports, but the fixed sized pool really does suck and will
> make users very unhappy.

You could probably get away with allowing uncached mappings only
for huge pages, and using one or two of the bits the PMD for it.
This should cover most use cases, since in practice coherent allocations
tend to be either small and rare (device descriptors) or very big
(frame buffer etc), and both cases can be handled with hugepages
and gen_pool_alloc, possibly CMA added in since there will likely
not be an IOMMU either on the systems that lack cache coherent DMA.

One downside is that you need a little more care for drivers that
use dma_mmap_coherent() to expose coherent buffers to user space.

Two other points about the proposal:
- Aside from completely uncached/unbuffered mappings, you typically
  want uncached/buffered mappings to cover dma_alloc_wc() that is
  typically used for frame buffers etc that need write-combining to get
  acceptable performance
- you need to decide what is supposed to happen when there are
  multiple conflicting mappings for the same physical address.

        Arnd

Guo Ren April 26, 2019, 4:05 p.m. UTC | #12

Hi Arnd,

On Thu, Apr 25, 2019 at 11:50:11AM +0200, Arnd Bergmann wrote:
> On Wed, Apr 24, 2019 at 4:23 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Wed, Apr 24, 2019 at 12:45:56PM +0000, Gary Guo wrote:
> > > The RISC-V privileged spec is explicitly designed to allow the
> > > techniques described above (this is the sole purpose of MSTATUS.TVM). It
> > > might be as high performance as a hardware with H-extension, but is
> > > definitely a legit use case. In fact, it is vital for use cases like
> > > recursive virtualization.
> > >
> > > Also, I believe the PTE format of RISC-V is already frozen -- therefore
> > > it is impossible now to merge GLOBAL and USER bit, nor to replace RSW
> > > bit with another bit.
> >
> > Yes, I do not think we can just repurpose a bit.  Even using a currently
> > unused one would require some gymnastics.
> >
> > That being said IFF we want to support non-coherent DMA (and I think we
> > do as people glue together their SOCs using shoestring and paper clips,
> > as already demonstrated by Andes and C-SKY in RISC-V space, and most
> > arm, mips and ppc SOCs) we need something like this flag.  The current
> > RISC-V method that only allows M-mode to set up such attributes on
> > a small number or PMP regions just doesn't work well with the way how
> > Linux and most non-trivial OSes implement DMA memory allocations.
> >
> > Note that I said well - in theory we can have a firmware provided
> > uncached pool - that is what Linux does on most nommu (that is without
> > pagetables) ports, but the fixed sized pool really does suck and will
> > make users very unhappy.
> 
> You could probably get away with allowing uncached mappings only
> for huge pages, and using one or two of the bits the PMD for it.
> This should cover most use cases, since in practice coherent allocations
> tend to be either small and rare (device descriptors) or very big
> (frame buffer etc), and both cases can be handled with hugepages
> and gen_pool_alloc, possibly CMA added in since there will likely
> not be an IOMMU either on the systems that lack cache coherent DMA.
Generally attributs in huge-tlb-entry and leaf-tlb-entry should be the
same. Only put _PAGE_CACHE and _PAGE_BUF bits in huge-tlb-entry sounds
a bit strange.

The gen_pool_alloc only 256KB by default, but a huge tlb entry is 4MB.
Hardware couldn't setup vitual-4MB to a phys-256KB range mapping in TLB.

> 
> One downside is that you need a little more care for drivers that
> use dma_mmap_coherent() to expose coherent buffers to user space.
> 
> Two other points about the proposal:
> - Aside from completely uncached/unbuffered mappings, you typically
>   want uncached/buffered mappings to cover dma_alloc_wc() that is
>   typically used for frame buffers etc that need write-combining to get
>   acceptable performance
I agree dma_alloc_wc is necessary, and we need add another more attribute
bit in PTE: _PAGE_BUF. 
Perhaps using _PAGE_BUF + _PAGE_CACHE are better then _PAGE_CONHENCY.

> - you need to decide what is supposed to happen when there are
>   multiple conflicting mappings for the same physical address.
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What's the mulitple confilcing mappings ?

Best Regards
 Guo Ren

Arnd Bergmann April 26, 2019, 6:42 p.m. UTC | #13

On Fri, Apr 26, 2019 at 6:06 PM Guo Ren <guoren@kernel.org> wrote:
> On Thu, Apr 25, 2019 at 11:50:11AM +0200, Arnd Bergmann wrote:
> > On Wed, Apr 24, 2019 at 4:23 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > You could probably get away with allowing uncached mappings only
> > for huge pages, and using one or two of the bits the PMD for it.
> > This should cover most use cases, since in practice coherent allocations
> > tend to be either small and rare (device descriptors) or very big
> > (frame buffer etc), and both cases can be handled with hugepages
> > and gen_pool_alloc, possibly CMA added in since there will likely
> > not be an IOMMU either on the systems that lack cache coherent DMA.
>
> Generally attributs in huge-tlb-entry and leaf-tlb-entry should be the
> same. Only put _PAGE_CACHE and _PAGE_BUF bits in huge-tlb-entry sounds
> a bit strange.

Well, the point is that we can't really change the meaning of the existing
low bits, but because of the alignment contraints on hugepages, the extra bits
are currently unused for hugepage TLBs.
There are other architectures that reuse the bits in clever ways, e.g.
allowing larger physical address ranges to be used with hugepages than
normal pages.

> The gen_pool_alloc only 256KB by default, but a huge tlb entry is 4MB.
> Hardware couldn't setup vitual-4MB to a phys-256KB range mapping in TLB.

I expect the size would be easily changed, as long as there is sufficient
physical memory. If the entire system has 32MB or less, setting 2MB aside
would have a fairly significant impact of course.

> > - you need to decide what is supposed to happen when there are
> >   multiple conflicting mappings for the same physical address.
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> What's the mulitple confilcing mappings ?

I mean when you have the linear mapping as cacheable and another
mapping for the same physical page as uncacheable, and then access
virtual address in both. This is usually a bad idea, but architectures
go to different lengths to prevent it.

The safest way would be for the CPU to produce a checkstop as soon
as there are TLB entries for the same physical address but different
caching settings. You can also do that if you have a cache-bypassing
load/store that hits a live cache line.

The other extreme would be to not do anything special and try to come
up with sane behavior, e.g. allow accesses in both ways but ensure that
a cache-bypassing load/store always flushes and invalidates cache
lines for the same physical address before its access.

       Arnd

Bill Huffman April 26, 2019, 7:05 p.m. UTC | #14

On 4/26/19 11:42 AM, Arnd Bergmann wrote:
> EXTERNAL MAIL
> 
> 
> On Fri, Apr 26, 2019 at 6:06 PM Guo Ren <guoren@kernel.org> wrote:
>> On Thu, Apr 25, 2019 at 11:50:11AM +0200, Arnd Bergmann wrote:
>>> On Wed, Apr 24, 2019 at 4:23 PM Christoph Hellwig <hch@lst.de> wrote:
>>>
>>> You could probably get away with allowing uncached mappings only
>>> for huge pages, and using one or two of the bits the PMD for it.
>>> This should cover most use cases, since in practice coherent allocations
>>> tend to be either small and rare (device descriptors) or very big
>>> (frame buffer etc), and both cases can be handled with hugepages
>>> and gen_pool_alloc, possibly CMA added in since there will likely
>>> not be an IOMMU either on the systems that lack cache coherent DMA.
>>
>> Generally attributs in huge-tlb-entry and leaf-tlb-entry should be the
>> same. Only put _PAGE_CACHE and _PAGE_BUF bits in huge-tlb-entry sounds
>> a bit strange.
> 
> Well, the point is that we can't really change the meaning of the existing
> low bits, but because of the alignment contraints on hugepages, the extra bits
> are currently unused for hugepage TLBs.
> There are other architectures that reuse the bits in clever ways, e.g.
> allowing larger physical address ranges to be used with hugepages than
> normal pages.
> 
>> The gen_pool_alloc only 256KB by default, but a huge tlb entry is 4MB.
>> Hardware couldn't setup vitual-4MB to a phys-256KB range mapping in TLB.
> 
> I expect the size would be easily changed, as long as there is sufficient
> physical memory. If the entire system has 32MB or less, setting 2MB aside
> would have a fairly significant impact of course.
> 
> 
>>> - you need to decide what is supposed to happen when there are
>>>    multiple conflicting mappings for the same physical address.
>>      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> What's the mulitple confilcing mappings ?
> 
> I mean when you have the linear mapping as cacheable and another
> mapping for the same physical page as uncacheable, and then access
> virtual address in both. This is usually a bad idea, but architectures
> go to different lengths to prevent it.
> 
> The safest way would be for the CPU to produce a checkstop as soon
> as there are TLB entries for the same physical address but different
> caching settings. You can also do that if you have a cache-bypassing
> load/store that hits a live cache line.

The second one is probably do-able in most systems.  But the first one 
usually isn't.  The TLB usually can't be looked up by physical address.

       Bill

> 
> The other extreme would be to not do anything special and try to come
> up with sane behavior, e.g. allow accesses in both ways but ensure that
> a cache-bypassing load/store always flushes and invalidates cache
> lines for the same physical address before its access.
> 
>         Arnd
> 
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> 
> View/Reply Online (#395): https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.riscv.org_g_tech-2Dprivileged_message_395&d=DwIBaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=zf51zm7BmTNIb87ycEVTpGbwXV6ovRb4Rqy-BVXZ2F4&s=IWaqMk0fwM69syrjjmqzm5u3GeI_wWUYTJfBXCbxnWA&e=
> Mute This Topic: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.riscv.org_mt_31344322_1677293&d=DwIBaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=zf51zm7BmTNIb87ycEVTpGbwXV6ovRb4Rqy-BVXZ2F4&s=BwJEwbhCkTjHFPfiAQs7CBgG1U6kqM7yZbpSWXPTOoU&e=
> Group Owner: tech-privileged+owner@lists.riscv.org
> Unsubscribe: https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.riscv.org_g_tech-2Dprivileged_unsub&d=DwIBaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=zf51zm7BmTNIb87ycEVTpGbwXV6ovRb4Rqy-BVXZ2F4&s=vSG91zravGfrVN9_elxveQPGPYaNew0MyETUvHfKSEk&e=  [huffman@cadence.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>

Palmer Dabbelt April 29, 2019, 8:11 p.m. UTC | #15

On Mon, 22 Apr 2019 08:44:30 PDT (-0700), guoren@kernel.org wrote:
> From: Guo Ren <ren_guo@c-sky.com>
>
> The current riscv linux implementation requires SOC system to support
> memory coherence between all I/O devices and CPUs. But some SOC systems
> cannot maintain the coherence and they need support cache clean/invalid
> operations to synchronize data.
>
> Current implementation is no problem with SiFive FU540, because FU540
> keeps all IO devices and DMA master devices coherence with CPU. But to a
> traditional SOC vendor, it may already have a stable non-coherency SOC
> system, the need is simply to replace the CPU with RV CPU and rebuild
> the whole system with IO-coherency is very expensive.
>
> So we should make riscv linux also support non-coherency memory model.
> Here are the two points that riscv linux needs to be modified:
>
>  - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit
>    designates a coherence for this page mapping. Software set the bit to
>    tell the hardware that the region of the page's memory area must be
>    coherent with IOs devices in SOC system by PMA settings.
>    If IOs and CPU are already coherent in SOC system, CPU just ignore
>    this bit.
>
>    PTE format:
>    | XLEN-1  10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
>          PFN      C  RSW  D   A   G   U   X   W   R   V
>                   ^
>    BIT(9): Coherence attribute bit
>           0: hardware needn't keep the page coherenct and software will
>              maintain the coherence with cache clear/invalid operations.
>           1: hardware must keep the page coherenct and software needn't
>              maintain the coherence.
>    BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux
>
>    Add a new hardware bit in PTE also need to modify Privileged
>    Architecture Supervisor-Level ISA:
>    https://github.com/riscv/riscv-isa-manual/pull/374

This is a RISC-V ISA modification, which isn't really appropriate to suggest on
the kernel mailing lists.  The right place to talk about this is at the RISC-V
foundation, which owns the ISA -- we can't change the hardware with a patch to
Linux :).

>  - Add SBI_FENCE_DMA 9 in riscv-sbi.
>    sbi_fence_dma(start, size, dir) could synchronize CPU cache data with
>    DMA device in non-coherency memory model. The third param's definition
>    is the same with linux's in include/linux/dma-direction.h:
>
>    enum dma_data_direction {
> 	DMA_BIDIRECTIONAL = 0,
> 	DMA_TO_DEVICE = 1,
> 	DMA_FROM_DEVICE = 2,
> 	DMA_NONE = 3,
>    };
>
>    The first param:start must be physical address which could be handled
>    in M-state.
>
>    Here is a pull request to the riscv-sbi-doc:
>    https://github.com/riscv/riscv-sbi-doc/pull/15
>
> We have tested the patch on our fpga SOC system which network controller
> connected to a non-cache-coherency interconnect in and it couldn't work
> without the patch.
>
> There is no side effect for FU540 whose CPU don't care _PAGE_COHERENCY
> in PTE, but FU540's bbl also need to implement a simple sbi_fence_dma
> by directly return. In fact, if you give a correct configuration for
> dev_is_dma_conherent(), linux dma framework wouldn't call sbi_fence_dma
> any more.

Non-coherent fences also need to be discussed as part of a RISC-V ISA
extension.  I know people have expressed interest, but I don't know of a
working group that's already been set up.

>
> Changelog:
>  - Use coherency instead of consistency for all to maintain term
>    consistency. (Xiang Xiaoyan)
>  - Add riscv-isa-manual modification pull request link.
>  - Correct grammatical errors.
>
> Signed-off-by: Guo Ren <ren_guo@c-sky.com>
> Cc: Andrew Waterman <andrew@sifive.com>
> Cc: Anup Patel <anup.patel@wdc.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Greentime Hu <green.hu@gmail.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Palmer Dabbelt <palmer@sifive.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Scott Wood <swood@redhat.com>
> Cc: Vincent Chen <vincentc@andestech.com>
> Cc: Xiang Xiaoyan <xiaoyan_xiang@c-sky.com>
> ---
>  arch/riscv/Kconfig                    |  4 ++++
>  arch/riscv/include/asm/pgtable-bits.h |  1 +
>  arch/riscv/include/asm/pgtable.h      | 11 +++++++++
>  arch/riscv/include/asm/sbi.h          | 10 ++++++++
>  arch/riscv/mm/Makefile                |  1 +
>  arch/riscv/mm/dma-mapping.c           | 44 +++++++++++++++++++++++++++++++++++
>  arch/riscv/mm/ioremap.c               |  2 +-
>  7 files changed, 72 insertions(+), 1 deletion(-)
>  create mode 100644 arch/riscv/mm/dma-mapping.c
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index eb56c82..f0fc503 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -16,9 +16,12 @@ config RISCV
>  	select OF
>  	select OF_EARLY_FLATTREE
>  	select OF_IRQ
> +	select ARCH_HAS_SYNC_DMA_FOR_CPU
> +	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
>  	select ARCH_WANT_FRAME_POINTERS
>  	select CLONE_BACKWARDS
>  	select COMMON_CLK
> +	select DMA_DIRECT_REMAP
>  	select GENERIC_CLOCKEVENTS
>  	select GENERIC_CPU_DEVICES
>  	select GENERIC_IRQ_SHOW
> @@ -27,6 +30,7 @@ config RISCV
>  	select GENERIC_STRNCPY_FROM_USER
>  	select GENERIC_STRNLEN_USER
>  	select GENERIC_SMP_IDLE_THREAD
> +	select GENERIC_ALLOCATOR
>  	select GENERIC_ATOMIC64 if !64BIT || !RISCV_ISA_A
>  	select HAVE_ARCH_AUDITSYSCALL
>  	select HAVE_MEMBLOCK_NODE_MAP
> diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h
> index 470755c..104f8c0 100644
> --- a/arch/riscv/include/asm/pgtable-bits.h
> +++ b/arch/riscv/include/asm/pgtable-bits.h
> @@ -31,6 +31,7 @@
>  #define _PAGE_ACCESSED  (1 << 6)    /* Set by hardware on any access */
>  #define _PAGE_DIRTY     (1 << 7)    /* Set by hardware on any write */
>  #define _PAGE_SOFT      (1 << 8)    /* Reserved for software */
> +#define _PAGE_COHERENCY (1 << 9)    /* Coherency */
>
>  #define _PAGE_SPECIAL   _PAGE_SOFT
>  #define _PAGE_TABLE     _PAGE_PRESENT
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 1141364..26debb4 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -66,6 +66,7 @@
>
>  #define PAGE_KERNEL		__pgprot(_PAGE_KERNEL)
>  #define PAGE_KERNEL_EXEC	__pgprot(_PAGE_KERNEL | _PAGE_EXEC)
> +#define PAGE_KERNEL_COHERENCY	__pgprot(_PAGE_KERNEL | _PAGE_COHERENCY)
>
>  extern pgd_t swapper_pg_dir[];
>
> @@ -375,6 +376,16 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return ptep_test_and_clear_young(vma, address, ptep);
>  }
>
> +#define pgprot_noncached pgprot_noncached
> +static inline pgprot_t pgprot_noncached(pgprot_t _prot)
> +{
> +	unsigned long prot = pgprot_val(_prot);
> +
> +	prot |= _PAGE_COHERENCY;
> +
> +	return __pgprot(prot);
> +}
> +
>  /*
>   * Encode and decode a swap entry
>   *
> diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
> index b6bb10b..b945e50 100644
> --- a/arch/riscv/include/asm/sbi.h
> +++ b/arch/riscv/include/asm/sbi.h
> @@ -25,6 +25,7 @@
>  #define SBI_REMOTE_SFENCE_VMA 6
>  #define SBI_REMOTE_SFENCE_VMA_ASID 7
>  #define SBI_SHUTDOWN 8
> +#define SBI_FENCE_DMA 9
>
>  #define SBI_CALL(which, arg0, arg1, arg2) ({			\
>  	register uintptr_t a0 asm ("a0") = (uintptr_t)(arg0);	\
> @@ -42,6 +43,8 @@
>  #define SBI_CALL_0(which) SBI_CALL(which, 0, 0, 0)
>  #define SBI_CALL_1(which, arg0) SBI_CALL(which, arg0, 0, 0)
>  #define SBI_CALL_2(which, arg0, arg1) SBI_CALL(which, arg0, arg1, 0)
> +#define SBI_CALL_3(which, arg0, arg1, arg2) \
> +			SBI_CALL(which, arg0, arg1, arg2)
>
>  static inline void sbi_console_putchar(int ch)
>  {
> @@ -82,6 +85,13 @@ static inline void sbi_remote_fence_i(const unsigned long *hart_mask)
>  	SBI_CALL_1(SBI_REMOTE_FENCE_I, hart_mask);
>  }
>
> +static inline void sbi_fence_dma(unsigned long start,
> +			       unsigned long size,
> +			       unsigned long dir)
> +{
> +	SBI_CALL_3(SBI_FENCE_DMA, start, size, dir);
> +}
> +
>  static inline void sbi_remote_sfence_vma(const unsigned long *hart_mask,
>  					 unsigned long start,
>  					 unsigned long size)
> diff --git a/arch/riscv/mm/Makefile b/arch/riscv/mm/Makefile
> index b68aac7..adc563a 100644
> --- a/arch/riscv/mm/Makefile
> +++ b/arch/riscv/mm/Makefile
> @@ -9,3 +9,4 @@ obj-y += fault.o
>  obj-y += extable.o
>  obj-y += ioremap.o
>  obj-y += cacheflush.o
> +obj-y += dma-mapping.o
> diff --git a/arch/riscv/mm/dma-mapping.c b/arch/riscv/mm/dma-mapping.c
> new file mode 100644
> index 0000000..5e1d179
> --- /dev/null
> +++ b/arch/riscv/mm/dma-mapping.c
> @@ -0,0 +1,44 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/dma-mapping.h>
> +
> +static int __init atomic_pool_init(void)
> +{
> +	return dma_atomic_pool_init(GFP_KERNEL, pgprot_noncached(PAGE_KERNEL));
> +}
> +postcore_initcall(atomic_pool_init);
> +
> +void arch_dma_prep_coherent(struct page *page, size_t size)
> +{
> +	memset(page_address(page), 0, size);
> +
> +	sbi_fence_dma(page_to_phys(page), size, DMA_BIDIRECTIONAL);
> +}
> +
> +void arch_sync_dma_for_device(struct device *dev, phys_addr_t paddr,
> +			      size_t size, enum dma_data_direction dir)
> +{
> +	switch (dir) {
> +	case DMA_TO_DEVICE:
> +	case DMA_FROM_DEVICE:
> +	case DMA_BIDIRECTIONAL:
> +		sbi_fence_dma(paddr, size, dir);
> +		break;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +void arch_sync_dma_for_cpu(struct device *dev, phys_addr_t paddr,
> +			   size_t size, enum dma_data_direction dir)
> +{
> +	switch (dir) {
> +	case DMA_TO_DEVICE:
> +	case DMA_FROM_DEVICE:
> +	case DMA_BIDIRECTIONAL:
> +		sbi_fence_dma(paddr, size, dir);
> +		break;
> +	default:
> +		BUG();
> +	}
> +}
> diff --git a/arch/riscv/mm/ioremap.c b/arch/riscv/mm/ioremap.c
> index bd2f2db..f6aaf1e 100644
> --- a/arch/riscv/mm/ioremap.c
> +++ b/arch/riscv/mm/ioremap.c
> @@ -73,7 +73,7 @@ static void __iomem *__ioremap_caller(phys_addr_t addr, size_t size,
>   */
>  void __iomem *ioremap(phys_addr_t offset, unsigned long size)
>  {
> -	return __ioremap_caller(offset, size, PAGE_KERNEL,
> +	return __ioremap_caller(offset, size, PAGE_KERNEL_COHERENCY,
>  		__builtin_return_address(0));
>  }
>  EXPORT_SYMBOL(ioremap);

Guo Ren April 30, 2019, 3:29 a.m. UTC | #16

On Mon, Apr 29, 2019 at 01:11:43PM -0700, Palmer Dabbelt wrote:
> On Mon, 22 Apr 2019 08:44:30 PDT (-0700), guoren@kernel.org wrote:
> >From: Guo Ren <ren_guo@c-sky.com>
> >
> >The current riscv linux implementation requires SOC system to support
> >memory coherence between all I/O devices and CPUs. But some SOC systems
> >cannot maintain the coherence and they need support cache clean/invalid
> >operations to synchronize data.
> >
> >Current implementation is no problem with SiFive FU540, because FU540
> >keeps all IO devices and DMA master devices coherence with CPU. But to a
> >traditional SOC vendor, it may already have a stable non-coherency SOC
> >system, the need is simply to replace the CPU with RV CPU and rebuild
> >the whole system with IO-coherency is very expensive.
> >
> >So we should make riscv linux also support non-coherency memory model.
> >Here are the two points that riscv linux needs to be modified:
> >
> > - Add _PAGE_COHERENCY bit in current page table entry attributes. The bit
> >   designates a coherence for this page mapping. Software set the bit to
> >   tell the hardware that the region of the page's memory area must be
> >   coherent with IOs devices in SOC system by PMA settings.
> >   If IOs and CPU are already coherent in SOC system, CPU just ignore
> >   this bit.
> >
> >   PTE format:
> >   | XLEN-1  10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
> >         PFN      C  RSW  D   A   G   U   X   W   R   V
> >                  ^
> >   BIT(9): Coherence attribute bit
> >          0: hardware needn't keep the page coherenct and software will
> >             maintain the coherence with cache clear/invalid operations.
> >          1: hardware must keep the page coherenct and software needn't
> >             maintain the coherence.
> >   BIT(8): Reserved for software and now it's _PAGE_SPECIAL in linux
> >
> >   Add a new hardware bit in PTE also need to modify Privileged
> >   Architecture Supervisor-Level ISA:
> >   https://github.com/riscv/riscv-isa-manual/pull/374
> 
> This is a RISC-V ISA modification, which isn't really appropriate to suggest on
> the kernel mailing lists.  The right place to talk about this is at the RISC-V
> foundation, which owns the ISA -- we can't change the hardware with a patch to
> Linux :).
I just want a discussion and a wide discussion is good for all of us :)

> 
> > - Add SBI_FENCE_DMA 9 in riscv-sbi.
> >   sbi_fence_dma(start, size, dir) could synchronize CPU cache data with
> >   DMA device in non-coherency memory model. The third param's definition
> >   is the same with linux's in include/linux/dma-direction.h:
> >
> >   enum dma_data_direction {
> >	DMA_BIDIRECTIONAL = 0,
> >	DMA_TO_DEVICE = 1,
> >	DMA_FROM_DEVICE = 2,
> >	DMA_NONE = 3,
> >   };
> >
> >   The first param:start must be physical address which could be handled
> >   in M-state.
> >
> >   Here is a pull request to the riscv-sbi-doc:
> >   https://github.com/riscv/riscv-sbi-doc/pull/15
> >
> >We have tested the patch on our fpga SOC system which network controller
> >connected to a non-cache-coherency interconnect in and it couldn't work
> >without the patch.
> >
> >There is no side effect for FU540 whose CPU don't care _PAGE_COHERENCY
> >in PTE, but FU540's bbl also need to implement a simple sbi_fence_dma
> >by directly return. In fact, if you give a correct configuration for
> >dev_is_dma_conherent(), linux dma framework wouldn't call sbi_fence_dma
> >any more.
> 
> Non-coherent fences also need to be discussed as part of a RISC-V ISA
  ^^^^^^^^^^^^ ^^^^^^
  fences instructions? not page attributes?
> extension.  
> I know people have expressed interest, but I don't know of a
> working group that's already been set up.
Is that mean current RISC-V ISA forces the SOC to be coherent memory model?

Best Regards
 Guo Ren

riscv: Support non-coherency memory model

Commit Message

Comments

Patch