mbox series

[RFC,0/3] riscv: Add DMA_COHERENT support

Message ID 1621400656-25678-1-git-send-email-guoren@kernel.org (mailing list archive)
Headers show
Series riscv: Add DMA_COHERENT support | expand

Message

Guo Ren May 19, 2021, 5:04 a.m. UTC
From: Guo Ren <guoren@linux.alibaba.com>

The RISC-V ISA doesn't yet specify how to query or modify PMAs, so let
vendors define the custom properties of memory regions in PTE.

This patchset helps SOC vendors to support their own custom interconnect
coherent solution with PTE attributes.

For example, allwinner D1[1] uses T-HEAD C906 as main processor, C906 has
two modes in MMU:
 - Compatible mode, the same as the definitions in spec.
 - Enhanced mode, add custom DMA_COHERENT attribute bits in PTE which
   not mentioned in spec.

Allwinner D1 needs the enhanced mode to support the DMA type device with
non-coherent interconnect in its SOC. C906 uses BITS(63 - 59) as custom
attribute bits in PTE.

Here is the config example for Allwinner D1:
CONFIG_RISCV_DMA_COHERENT=y
CONFIG_RISCV_PAGE_DMA_MASK=0xf800000000000000
CONFIG_RISCV_PAGE_CACHE=0x7000000000000000
CONFIG_RISCV_PAGE_DMA_NONCACHE=0x8000000000000000

Link: https://linux-sunxi.org/D1 [1]

Guo Ren (3):
  riscv: pgtable.h: Fixup _PAGE_CHG_MASK usage
  riscv: Add DMA_COHERENT for custom PTE attributes
  riscv: Add SYNC_DMA_FOR_CPU/DEVICE for DMA_COHERENT

 arch/riscv/Kconfig                    | 31 ++++++++++++++++++++++++++
 arch/riscv/include/asm/pgtable-64.h   |  8 ++++---
 arch/riscv/include/asm/pgtable-bits.h | 13 ++++++++++-
 arch/riscv/include/asm/pgtable.h      | 26 +++++++++++++++++-----
 arch/riscv/include/asm/sbi.h          | 16 ++++++++++++++
 arch/riscv/kernel/sbi.c               | 19 ++++++++++++++++
 arch/riscv/mm/Makefile                |  4 ++++
 arch/riscv/mm/dma-mapping.c           | 41 +++++++++++++++++++++++++++++++++++
 8 files changed, 148 insertions(+), 10 deletions(-)
 create mode 100644 arch/riscv/mm/dma-mapping.c

Comments

Christoph Hellwig May 19, 2021, 5:20 a.m. UTC | #1
On Wed, May 19, 2021 at 05:04:13AM +0000, guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
> 
> The RISC-V ISA doesn't yet specify how to query or modify PMAs, so let
> vendors define the custom properties of memory regions in PTE.

Err, hell no.   The ISA needs to gets this fixed first.  Then we can
talk about alternatives patching things in or trapping in the SBI.
But if the RISC-V ISA can't get these basic done after years we can't
support it in Linux at all.
Guo Ren May 19, 2021, 5:48 a.m. UTC | #2
On Wed, May 19, 2021 at 1:20 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, May 19, 2021 at 05:04:13AM +0000, guoren@kernel.org wrote:
> > From: Guo Ren <guoren@linux.alibaba.com>
> >
> > The RISC-V ISA doesn't yet specify how to query or modify PMAs, so let
> > vendors define the custom properties of memory regions in PTE.
>
> Err, hell no.   The ISA needs to gets this fixed first.  Then we can
> talk about alternatives patching things in or trapping in the SBI.
> But if the RISC-V ISA can't get these basic done after years we can't
> support it in Linux at all.

The patchset just leaves a configuration chance for vendors. Before
RISC-V ISA fixes it, we should give the chance to let vendor solve
their real chip issues.
Christoph Hellwig May 19, 2021, 5:55 a.m. UTC | #3
On Wed, May 19, 2021 at 01:48:23PM +0800, Guo Ren wrote:
> The patchset just leaves a configuration chance for vendors. Before
> RISC-V ISA fixes it, we should give the chance to let vendor solve
> their real chip issues.

No.  The vendors need to work to get a feature standardized before
implementing it.  There is other way to have a sane kernel build that
supports all the different SOCs.
Guo Ren May 19, 2021, 6:05 a.m. UTC | #4
On Wed, May 19, 2021 at 1:20 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, May 19, 2021 at 05:04:13AM +0000, guoren@kernel.org wrote:
> > From: Guo Ren <guoren@linux.alibaba.com>
> >
> > The RISC-V ISA doesn't yet specify how to query or modify PMAs, so let
> > vendors define the custom properties of memory regions in PTE.
>
> Err, hell no.   The ISA needs to gets this fixed first.  Then we can
> talk about alternatives patching things in or trapping in the SBI.
> But if the RISC-V ISA can't get these basic done after years we can't
> support it in Linux at all.
This is the lightest solution I could imagine, it avoids conflicts
with RISC-V ISA.

Since the existing RISC-V ISA cannot solve this problem, it is better
to provide some configuration for the SOC vendor to customize.

--
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/
Christoph Hellwig May 19, 2021, 6:06 a.m. UTC | #5
On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> Since the existing RISC-V ISA cannot solve this problem, it is better
> to provide some configuration for the SOC vendor to customize.

We've been talking about this problem for close to five years.  So no,
if you don't manage to get the feature into the ISA it can't be
supported.
Guo Ren May 19, 2021, 6:09 a.m. UTC | #6
On Wed, May 19, 2021 at 1:55 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, May 19, 2021 at 01:48:23PM +0800, Guo Ren wrote:
> > The patchset just leaves a configuration chance for vendors. Before
> > RISC-V ISA fixes it, we should give the chance to let vendor solve
> > their real chip issues.
>
> No.  The vendors need to work to get a feature standardized before
> implementing it.  There is other way to have a sane kernel build that
> supports all the different SOCs.

I've said the patchset doesn't define any features, It just leaves the
chance for vendors.

It's not in conflict with any standardized riscv ISA.
Guo Ren May 19, 2021, 6:11 a.m. UTC | #7
On Wed, May 19, 2021 at 2:06 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > Since the existing RISC-V ISA cannot solve this problem, it is better
> > to provide some configuration for the SOC vendor to customize.
>
> We've been talking about this problem for close to five years.  So no,
> if you don't manage to get the feature into the ISA it can't be
> supported.

arch/riscv/errata/ is also defined in riscv ISA?
Drew Fustini May 19, 2021, 6:44 a.m. UTC | #8
On Wed, May 19, 2021 at 01:48:23PM +0800, Guo Ren wrote:
> On Wed, May 19, 2021 at 1:20 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Wed, May 19, 2021 at 05:04:13AM +0000, guoren@kernel.org wrote:
> > > From: Guo Ren <guoren@linux.alibaba.com>
> > >
> > > The RISC-V ISA doesn't yet specify how to query or modify PMAs, so let
> > > vendors define the custom properties of memory regions in PTE.
> >
> > Err, hell no.   The ISA needs to gets this fixed first.  Then we can
> > talk about alternatives patching things in or trapping in the SBI.
> > But if the RISC-V ISA can't get these basic done after years we can't
> > support it in Linux at all.
> 
> The patchset just leaves a configuration chance for vendors. Before
> RISC-V ISA fixes it, we should give the chance to let vendor solve
> their real chip issues.

This patch series looks like it might be useful for the StarFive JH7100
[1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
USB and SDIO require that the L2 cache must be manually flushed after
DMA operations if the data is intended to be shared with U74 cores [2].

There is the RISC-V Cache Management Operation, or CMO, task group [3]
but I am not sure if that can help the SoC's that have already been
fabbed like the the D1 and the JH7100.

thanks,
drew

[1] https://github.com/starfive-tech/beaglev_doc/blob/main/JH7100%20Data%20Sheet%20V01.01.04-EN%20(4-21-2021).pdf
[2] https://github.com/starfive-tech/beaglev_doc/blob/main/JH7100%20Cache%20Coherence%20V1.0.pdf
[3] https://github.com/riscv/riscv-CMOs
Christoph Hellwig May 19, 2021, 6:53 a.m. UTC | #9
On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
> This patch series looks like it might be useful for the StarFive JH7100
> [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
> USB and SDIO require that the L2 cache must be manually flushed after
> DMA operations if the data is intended to be shared with U74 cores [2].

Not too much, given that the SiFive lineage CPUs have an uncached
window, that is a totally different way to allocate uncached memory.

> There is the RISC-V Cache Management Operation, or CMO, task group [3]
> but I am not sure if that can help the SoC's that have already been
> fabbed like the the D1 and the JH7100.

It does, because unimplemented instructions trap into M-mode, where they
can be emulated.

Or to summarize things.  Non-coherent DMA (and not coherent as title in
this series) requires two things:

 1) allocating chunks of memory that is marked as not cachable
 2) instructions to invalidate and/or writeback cache lines

none of which currently exists in RISV-V.  Hacking vendor specific
cruft into the kernel doesn't scale, as shown perfectly by this
series which requires to hard code vendor specific non-standardized
extensions in a kernel that makes it specific to that implementation.

What we need to do is to standardize a way to do this properly, and then
after that figure out a way to quirk in non-compliant implementations
in a way that does not harm the general kernel.
Drew Fustini May 19, 2021, 6:54 a.m. UTC | #10
On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig wrote:
> On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > Since the existing RISC-V ISA cannot solve this problem, it is better
> > to provide some configuration for the SOC vendor to customize.
> 
> We've been talking about this problem for close to five years.  So no,
> if you don't manage to get the feature into the ISA it can't be
> supported.

Isn't it a good goal for Linux to support the capabilities present in
the SoC that a currently being fab'd?

I believe the CMO group only started last year [1] so the RV64GC SoCs
that are going into mass production this year would not have had the
opporuntiy of utilizing any RISC-V ISA extension for handling cache
management.

Thanks,
Drew

[1] https://github.com/riscv/riscv-CMOs
Christoph Hellwig May 19, 2021, 6:56 a.m. UTC | #11
On Tue, May 18, 2021 at 11:54:31PM -0700, Drew Fustini wrote:
> Isn't it a good goal for Linux to support the capabilities present in
> the SoC that a currently being fab'd?
> 
> I believe the CMO group only started last year [1] so the RV64GC SoCs
> that are going into mass production this year would not have had the
> opporuntiy of utilizing any RISC-V ISA extension for handling cache
> management.

Then the vendors need to push harder.  This problem has been known
for years but ignored by the vendors.
Anup Patel May 19, 2021, 7:14 a.m. UTC | #12
On Wed, May 19, 2021 at 12:24 PM Drew Fustini <drew@beagleboard.org> wrote:
>
> On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig wrote:
> > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > > Since the existing RISC-V ISA cannot solve this problem, it is better
> > > to provide some configuration for the SOC vendor to customize.
> >
> > We've been talking about this problem for close to five years.  So no,
> > if you don't manage to get the feature into the ISA it can't be
> > supported.
>
> Isn't it a good goal for Linux to support the capabilities present in
> the SoC that a currently being fab'd?
>
> I believe the CMO group only started last year [1] so the RV64GC SoCs
> that are going into mass production this year would not have had the
> opporuntiy of utilizing any RISC-V ISA extension for handling cache
> management.

The current Linux RISC-V policy is to only accept patches for frozen or
ratified ISA specs.
(Refer, Documentation/riscv/patch-acceptance.rst)

This means even if emulate CMO instructions in OpenSBI, the Linux
patches won't be taken by Palmer because CMO specification is
still in draft stage.

Also, we all know how much time it takes for RISCV international
to freeze some spec. Judging by that we are looking at another
3-4 years at minimum.

Regards,
Anup
Damien Le Moal May 19, 2021, 8:25 a.m. UTC | #13
On 2021/05/19 16:16, Anup Patel wrote:
> On Wed, May 19, 2021 at 12:24 PM Drew Fustini <drew@beagleboard.org> wrote:
>>
>> On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig wrote:
>>> On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
>>>> Since the existing RISC-V ISA cannot solve this problem, it is better
>>>> to provide some configuration for the SOC vendor to customize.
>>>
>>> We've been talking about this problem for close to five years.  So no,
>>> if you don't manage to get the feature into the ISA it can't be
>>> supported.
>>
>> Isn't it a good goal for Linux to support the capabilities present in
>> the SoC that a currently being fab'd?
>>
>> I believe the CMO group only started last year [1] so the RV64GC SoCs
>> that are going into mass production this year would not have had the
>> opporuntiy of utilizing any RISC-V ISA extension for handling cache
>> management.
> 
> The current Linux RISC-V policy is to only accept patches for frozen or
> ratified ISA specs.
> (Refer, Documentation/riscv/patch-acceptance.rst)
> 
> This means even if emulate CMO instructions in OpenSBI, the Linux
> patches won't be taken by Palmer because CMO specification is
> still in draft stage.
> 
> Also, we all know how much time it takes for RISCV international
> to freeze some spec. Judging by that we are looking at another
> 3-4 years at minimum.

Which is the root cause of most problems with riscv extension support in Linux.
All RISC-V foundation members need to apply pressure on the foundation and these
standard groups to deliver frozen specifications with an acceptable schedule.
c.f. the H extensions specs which are not yet frozen despite not having been
changed for months if not years.
Guo Ren May 20, 2021, 1:45 a.m. UTC | #14
On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
> > This patch series looks like it might be useful for the StarFive JH7100
> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
> > USB and SDIO require that the L2 cache must be manually flushed after
> > DMA operations if the data is intended to be shared with U74 cores [2].
>
> Not too much, given that the SiFive lineage CPUs have an uncached
> window, that is a totally different way to allocate uncached memory.
It's a very big MIPS smell. What's the attribute of the uncached
window? (uncached + strong-order/ uncached + weak, most vendors still
use AXI interconnect, how to deal with a bufferable attribute?) In
fact, customers' drivers use different ways to deal with DMA memory in
non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
the same way in DMA memory is a smart choice. So using PTE attributes
is more suitable.

See: https://github.com/riscv/virtual-memory/blob/main/specs/611-virtual-memory-diff.pdf
4.4.1
The draft supports custom attribute bits in PTE.

Although I do not agree with uncached windows, this patchset does not
conflict with SiFive uncached windows.

>
> > There is the RISC-V Cache Management Operation, or CMO, task group [3]
> > but I am not sure if that can help the SoC's that have already been
> > fabbed like the the D1 and the JH7100.
>
> It does, because unimplemented instructions trap into M-mode, where they
> can be emulated.
>
> Or to summarize things.  Non-coherent DMA (and not coherent as title in
> this series) requires two things:
>
>  1) allocating chunks of memory that is marked as not cachable
>  2) instructions to invalidate and/or writeback cache lines
Maybe sbi_ecall (dma_sync) is enough and let the vendor deal with it
in opensbi. From a hardware view, CMO instruction only could deal with
one cache line, then CMO-trap is not a good idea.

>
> none of which currently exists in RISV-V.  Hacking vendor specific
> cruft into the kernel doesn't scale, as shown perfectly by this
> series which requires to hard code vendor-specific non-standardized
> extensions in a kernel that makes it specific to that implementation.
>
> What we need to do is to standardize a way to do this properly, and then
> after that figure out a way to quirk in non-compliant implementations
> in a way that does not harm the general kernel.
Guo Ren May 20, 2021, 1:47 a.m. UTC | #15
On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Wed, May 19, 2021 at 12:24 PM Drew Fustini <drew@beagleboard.org> wrote:
> >
> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig wrote:
> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > > > Since the existing RISC-V ISA cannot solve this problem, it is better
> > > > to provide some configuration for the SOC vendor to customize.
> > >
> > > We've been talking about this problem for close to five years.  So no,
> > > if you don't manage to get the feature into the ISA it can't be
> > > supported.
> >
> > Isn't it a good goal for Linux to support the capabilities present in
> > the SoC that a currently being fab'd?
> >
> > I believe the CMO group only started last year [1] so the RV64GC SoCs
> > that are going into mass production this year would not have had the
> > opporuntiy of utilizing any RISC-V ISA extension for handling cache
> > management.
>
> The current Linux RISC-V policy is to only accept patches for frozen or
> ratified ISA specs.
> (Refer, Documentation/riscv/patch-acceptance.rst)
>
> This means even if emulate CMO instructions in OpenSBI, the Linux
> patches won't be taken by Palmer because CMO specification is
> still in draft stage.
How do you think about
sbi_ecall(SBI_EXT_DMA, SBI_DMA_SYNC, start, size, dir, 0, 0, 0);
? thx
>
> Also, we all know how much time it takes for RISCV international
> to freeze some spec. Judging by that we are looking at another
> 3-4 years at minimum.
>
> Regards,
> Anup
Guo Ren May 20, 2021, 1:59 a.m. UTC | #16
On Thu, May 20, 2021 at 9:47 AM Guo Ren <guoren@kernel.org> wrote:
>
> On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org> wrote:
> >
> > On Wed, May 19, 2021 at 12:24 PM Drew Fustini <drew@beagleboard.org> wrote:
> > >
> > > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig wrote:
> > > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > > > > Since the existing RISC-V ISA cannot solve this problem, it is better
> > > > > to provide some configuration for the SOC vendor to customize.
> > > >
> > > > We've been talking about this problem for close to five years.  So no,
> > > > if you don't manage to get the feature into the ISA it can't be
> > > > supported.
> > >
> > > Isn't it a good goal for Linux to support the capabilities present in
> > > the SoC that a currently being fab'd?
> > >
> > > I believe the CMO group only started last year [1] so the RV64GC SoCs
> > > that are going into mass production this year would not have had the
> > > opporuntiy of utilizing any RISC-V ISA extension for handling cache
> > > management.
> >
> > The current Linux RISC-V policy is to only accept patches for frozen or
> > ratified ISA specs.
> > (Refer, Documentation/riscv/patch-acceptance.rst)
> >
> > This means even if emulate CMO instructions in OpenSBI, the Linux
I think it's CBO now.
https://github.com/riscv/riscv-CMOs/blob/master/discussion-files/RISC_V_range_CMOs_bad_v1.00.pdf

> > patches won't be taken by Palmer because CMO specification is
> > still in draft stage.
> How do you think about
> sbi_ecall(SBI_EXT_DMA, SBI_DMA_SYNC, start, size, dir, 0, 0, 0);
> ? thx
CBO insn trap is okay for me ;-)

> >
> > Also, we all know how much time it takes for RISCV international
> > to freeze some spec. Judging by that we are looking at another
> > 3-4 years at minimum.
> >
> > Regards,
> > Anup
>
>
>
> --
> Best Regards
>  Guo Ren
>
> ML: https://lore.kernel.org/linux-csky/
Christoph Hellwig May 20, 2021, 5:48 a.m. UTC | #17
On Thu, May 20, 2021 at 09:45:45AM +0800, Guo Ren wrote:
> It's a very big MIPS smell. What's the attribute of the uncached
> window? (uncached + strong-order/ uncached + weak, most vendors still
> use AXI interconnect, how to deal with a bufferable attribute?) In
> fact, customers' drivers use different ways to deal with DMA memory in
> non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
> the same way in DMA memory is a smart choice. So using PTE attributes
> is more suitable.

I'm not saying it is a good idea.  Just that apparently this exists in
the ASICs.
Guo Ren May 22, 2021, 12:36 a.m. UTC | #18
On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org> wrote:
>
> On Wed, May 19, 2021 at 12:24 PM Drew Fustini <drew@beagleboard.org> wrote:
> >
> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig wrote:
> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > > > Since the existing RISC-V ISA cannot solve this problem, it is better
> > > > to provide some configuration for the SOC vendor to customize.
> > >
> > > We've been talking about this problem for close to five years.  So no,
> > > if you don't manage to get the feature into the ISA it can't be
> > > supported.
> >
> > Isn't it a good goal for Linux to support the capabilities present in
> > the SoC that a currently being fab'd?
> >
> > I believe the CMO group only started last year [1] so the RV64GC SoCs
> > that are going into mass production this year would not have had the
> > opporuntiy of utilizing any RISC-V ISA extension for handling cache
> > management.
>
> The current Linux RISC-V policy is to only accept patches for frozen or
> ratified ISA specs.
> (Refer, Documentation/riscv/patch-acceptance.rst)
>
> This means even if emulate CMO instructions in OpenSBI, the Linux
> patches won't be taken by Palmer because CMO specification is
> still in draft stage.
Before CMO specification release, could we use a sbi_ecall to solve
the current problem? This is not against the specification, when CMO
is ready we could let users choose to use the new CMO in Linux.

From a tech view, CMO trap emulation is the same as sbi_ecall.

>
> Also, we all know how much time it takes for RISCV international
> to freeze some spec. Judging by that we are looking at another
> 3-4 years at minimum.
>
> Regards,
> Anup
Palmer Dabbelt May 30, 2021, 12:30 a.m. UTC | #19
On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org wrote:
> On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org> wrote:
>>
>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini <drew@beagleboard.org> wrote:
>> >
>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig wrote:
>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
>> > > > Since the existing RISC-V ISA cannot solve this problem, it is better
>> > > > to provide some configuration for the SOC vendor to customize.
>> > >
>> > > We've been talking about this problem for close to five years.  So no,
>> > > if you don't manage to get the feature into the ISA it can't be
>> > > supported.
>> >
>> > Isn't it a good goal for Linux to support the capabilities present in
>> > the SoC that a currently being fab'd?
>> >
>> > I believe the CMO group only started last year [1] so the RV64GC SoCs
>> > that are going into mass production this year would not have had the
>> > opporuntiy of utilizing any RISC-V ISA extension for handling cache
>> > management.
>>
>> The current Linux RISC-V policy is to only accept patches for frozen or
>> ratified ISA specs.
>> (Refer, Documentation/riscv/patch-acceptance.rst)
>>
>> This means even if emulate CMO instructions in OpenSBI, the Linux
>> patches won't be taken by Palmer because CMO specification is
>> still in draft stage.
> Before CMO specification release, could we use a sbi_ecall to solve
> the current problem? This is not against the specification, when CMO
> is ready we could let users choose to use the new CMO in Linux.
>
> From a tech view, CMO trap emulation is the same as sbi_ecall.
>
>>
>> Also, we all know how much time it takes for RISCV international
>> to freeze some spec. Judging by that we are looking at another
>> 3-4 years at minimum.

Sorry for being slow here, this thread got buried.

I've been trying to work with a handful of folks at the RISC-V 
foundation to try and get a subset of the various in-development 
specifications (some simple CMOs, something about non-caching in the 
page tables, and some way to prevent speculative accesse from generating 
coherence traffic that will break non-coherent systems).  I'm not sure 
we can get this together quickly, but I'd prefer to at least try before 
we jump to taking vendor-specificed behavior here.  It's obviously an 
up-hill battle to try and get specifications through the process and I'm 
certainly not going to promise it will work, but I'm hoping that the 
impending need to avoid forking the ISA will be sufficient to get people 
behind producing some specifications in a timely fashion.

I wasn't aware than this chip had non-coherent devices until I saw this 
thread, so we'd been mostly focused on the Beagle V chip.  That was in a 
sense an easier problem because the SiFive IP in it was never designed 
to have non-coherent devices so we'd have to make anything work via a 
series of slow workarounds, which would make emulating the eventually 
standardized behavior reasonable in terms of performance (ie, everything 
would be super slow so who really cares).

I don't think relying on some sort of SBI call for the CMOs whould be 
such a performance hit that it would prevent these systems from being 
viable, but assuming you have reasonable performance on your non-cached 
accesses then that's probably not going to be viable to trap and 
emulate.  At that point it really just becomes silly to pretend that 
we're still making things work by emulating the eventually ratified 
behavior, as anyone who actually tries to use this thing to do IO would 
need out of tree patches.  I'm not sure exactly what the plan is for the 
page table bits in the specification right now, but if you can give me a 
pointer to some documentation then I'm happy to try and push for 
something compatible.

If we can't make the process work at the foundation then I'd be strongly 
in favor of just biting the bullet and starting to take vendor-specific 
code that's been implemented in hardware and is necessarry to make 
things work acceptably.  That's obviously a sub-optimal solution as 
it'll lead to a bunch of ISA fragmentation, but at least we'll be able 
to keep the software stack together.

Can you tell us when these will be in the hands of users?  That's pretty 
important here, as I don't want to be blocking real users from having 
their hardware work.  IIRC there were some plans to distribute early 
boards, but it looks like the foundation got involved and I guess I lost 
the thread at that point.

Sorry this is all such a headache, but hopefully we can get things 
sorted out.

>>
>> Regards,
>> Anup
Palmer Dabbelt June 3, 2021, 4:13 a.m. UTC | #20
On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
> On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org wrote:
>> On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org> wrote:
>>>
>>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini <drew@beagleboard.org> wrote:
>>> >
>>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig wrote:
>>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
>>> > > > Since the existing RISC-V ISA cannot solve this problem, it is better
>>> > > > to provide some configuration for the SOC vendor to customize.
>>> > >
>>> > > We've been talking about this problem for close to five years.  So no,
>>> > > if you don't manage to get the feature into the ISA it can't be
>>> > > supported.
>>> >
>>> > Isn't it a good goal for Linux to support the capabilities present in
>>> > the SoC that a currently being fab'd?
>>> >
>>> > I believe the CMO group only started last year [1] so the RV64GC SoCs
>>> > that are going into mass production this year would not have had the
>>> > opporuntiy of utilizing any RISC-V ISA extension for handling cache
>>> > management.
>>>
>>> The current Linux RISC-V policy is to only accept patches for frozen or
>>> ratified ISA specs.
>>> (Refer, Documentation/riscv/patch-acceptance.rst)
>>>
>>> This means even if emulate CMO instructions in OpenSBI, the Linux
>>> patches won't be taken by Palmer because CMO specification is
>>> still in draft stage.
>> Before CMO specification release, could we use a sbi_ecall to solve
>> the current problem? This is not against the specification, when CMO
>> is ready we could let users choose to use the new CMO in Linux.
>>
>> From a tech view, CMO trap emulation is the same as sbi_ecall.
>>
>>>
>>> Also, we all know how much time it takes for RISCV international
>>> to freeze some spec. Judging by that we are looking at another
>>> 3-4 years at minimum.
>
> Sorry for being slow here, this thread got buried.
>
> I've been trying to work with a handful of folks at the RISC-V
> foundation to try and get a subset of the various in-development
> specifications (some simple CMOs, something about non-caching in the
> page tables, and some way to prevent speculative accesse from generating
> coherence traffic that will break non-coherent systems).  I'm not sure
> we can get this together quickly, but I'd prefer to at least try before
> we jump to taking vendor-specificed behavior here.  It's obviously an
> up-hill battle to try and get specifications through the process and I'm
> certainly not going to promise it will work, but I'm hoping that the
> impending need to avoid forking the ISA will be sufficient to get people
> behind producing some specifications in a timely fashion.
>
> I wasn't aware than this chip had non-coherent devices until I saw this
> thread, so we'd been mostly focused on the Beagle V chip.  That was in a
> sense an easier problem because the SiFive IP in it was never designed
> to have non-coherent devices so we'd have to make anything work via a
> series of slow workarounds, which would make emulating the eventually
> standardized behavior reasonable in terms of performance (ie, everything
> would be super slow so who really cares).
>
> I don't think relying on some sort of SBI call for the CMOs whould be
> such a performance hit that it would prevent these systems from being
> viable, but assuming you have reasonable performance on your non-cached
> accesses then that's probably not going to be viable to trap and
> emulate.  At that point it really just becomes silly to pretend that
> we're still making things work by emulating the eventually ratified
> behavior, as anyone who actually tries to use this thing to do IO would
> need out of tree patches.  I'm not sure exactly what the plan is for the
> page table bits in the specification right now, but if you can give me a
> pointer to some documentation then I'm happy to try and push for
> something compatible.
>
> If we can't make the process work at the foundation then I'd be strongly
> in favor of just biting the bullet and starting to take vendor-specific
> code that's been implemented in hardware and is necessarry to make
> things work acceptably.  That's obviously a sub-optimal solution as
> it'll lead to a bunch of ISA fragmentation, but at least we'll be able
> to keep the software stack together.
>
> Can you tell us when these will be in the hands of users?  That's pretty
> important here, as I don't want to be blocking real users from having
> their hardware work.  IIRC there were some plans to distribute early
> boards, but it looks like the foundation got involved and I guess I lost
> the thread at that point.
>
> Sorry this is all such a headache, but hopefully we can get things
> sorted out.

I talked with some of the RISC-V foundation folks, we're not going to 
have an ISA specification for the non-coherent stuff any time soon.  I 
took a look at this code and I definately don't want to take it as is, 
but I'm not opposed to taking something that makes the hardware work as 
long as it's a lot cleaner.  We've already got two of these non-coherent 
chips, I'm sure more will come, and I'd rather have the extra headaches 
than make everyone fork the software stack.

After talking to Atish it looks like there's likely to be an SBI 
extension to handle the CMOs, which should let us avoid the bulk of the 
vendor-specific behavior in the kernel.  I know some people are worried 
about adding to the SBI surface.  I'm worried about that too, but that's 
way better than sticking a bunch of vendor-specific instructions into 
the kernel.  The SBI extension should make for a straight-forward cache 
flush implementation in Linux, so let's just plan on that getting 
through quickly (as has been done before).

Unfortunately we've yet to come up with a way to handle the 
non-cacheable mappings without introducing a degree of vendor-specific 
behavior or seriously impacting performance (mark them as not valid and 
deal with them in the trap handler).  I'm not really sure it counts as 
supporting the hardware if it's massively slow, so that really leaves us 
with vendor-specific mappings as the only option to make these chips 
work.

This implementation, which adds some Kconfig entries that control page 
table bits, definately isn't suitable for upstream.  Allowing users to 
set arbitrary page table bits will eventually conflict with the 
standard, and is just going to be a mess.  It'll also lead to kernels 
that are only compatible with specific designs, which we're trying very 
hard to avoid.  At a bare minimum we'll need some way to detect systems 
with these page table bits before setting them, and some description of 
what the bits actually do so we can reason about them.
Anup Patel June 3, 2021, 6 a.m. UTC | #21
> -----Original Message-----
> From: Palmer Dabbelt <palmer@dabbelt.com>
> Sent: 03 June 2021 09:43
> To: guoren@kernel.org
> Cc: anup@brainfault.org; drew@beagleboard.org; Christoph Hellwig
> <hch@lst.de>; Anup Patel <Anup.Patel@wdc.com>; wefu@redhat.com;
> lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> <paul.walmsley@sifive.com>
> Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> 
> On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
> > On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org wrote:
> >> On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org>
> wrote:
> >>>
> >>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini
> <drew@beagleboard.org> wrote:
> >>> >
> >>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig
> wrote:
> >>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> >>> > > > Since the existing RISC-V ISA cannot solve this problem, it is
> >>> > > > better to provide some configuration for the SOC vendor to
> customize.
> >>> > >
> >>> > > We've been talking about this problem for close to five years.
> >>> > > So no, if you don't manage to get the feature into the ISA it
> >>> > > can't be supported.
> >>> >
> >>> > Isn't it a good goal for Linux to support the capabilities present
> >>> > in the SoC that a currently being fab'd?
> >>> >
> >>> > I believe the CMO group only started last year [1] so the RV64GC
> >>> > SoCs that are going into mass production this year would not have
> >>> > had the opporuntiy of utilizing any RISC-V ISA extension for
> >>> > handling cache management.
> >>>
> >>> The current Linux RISC-V policy is to only accept patches for frozen
> >>> or ratified ISA specs.
> >>> (Refer, Documentation/riscv/patch-acceptance.rst)
> >>>
> >>> This means even if emulate CMO instructions in OpenSBI, the Linux
> >>> patches won't be taken by Palmer because CMO specification is still
> >>> in draft stage.
> >> Before CMO specification release, could we use a sbi_ecall to solve
> >> the current problem? This is not against the specification, when CMO
> >> is ready we could let users choose to use the new CMO in Linux.
> >>
> >> From a tech view, CMO trap emulation is the same as sbi_ecall.
> >>
> >>>
> >>> Also, we all know how much time it takes for RISCV international to
> >>> freeze some spec. Judging by that we are looking at another
> >>> 3-4 years at minimum.
> >
> > Sorry for being slow here, this thread got buried.
> >
> > I've been trying to work with a handful of folks at the RISC-V
> > foundation to try and get a subset of the various in-development
> > specifications (some simple CMOs, something about non-caching in the
> > page tables, and some way to prevent speculative accesse from
> > generating coherence traffic that will break non-coherent systems).
> > I'm not sure we can get this together quickly, but I'd prefer to at
> > least try before we jump to taking vendor-specificed behavior here.
> > It's obviously an up-hill battle to try and get specifications through
> > the process and I'm certainly not going to promise it will work, but
> > I'm hoping that the impending need to avoid forking the ISA will be
> > sufficient to get people behind producing some specifications in a timely
> fashion.
> >
> > I wasn't aware than this chip had non-coherent devices until I saw
> > this thread, so we'd been mostly focused on the Beagle V chip.  That
> > was in a sense an easier problem because the SiFive IP in it was never
> > designed to have non-coherent devices so we'd have to make anything
> > work via a series of slow workarounds, which would make emulating the
> > eventually standardized behavior reasonable in terms of performance
> > (ie, everything would be super slow so who really cares).
> >
> > I don't think relying on some sort of SBI call for the CMOs whould be
> > such a performance hit that it would prevent these systems from being
> > viable, but assuming you have reasonable performance on your
> > non-cached accesses then that's probably not going to be viable to
> > trap and emulate.  At that point it really just becomes silly to
> > pretend that we're still making things work by emulating the
> > eventually ratified behavior, as anyone who actually tries to use this
> > thing to do IO would need out of tree patches.  I'm not sure exactly
> > what the plan is for the page table bits in the specification right
> > now, but if you can give me a pointer to some documentation then I'm
> > happy to try and push for something compatible.
> >
> > If we can't make the process work at the foundation then I'd be
> > strongly in favor of just biting the bullet and starting to take
> > vendor-specific code that's been implemented in hardware and is
> > necessarry to make things work acceptably.  That's obviously a
> > sub-optimal solution as it'll lead to a bunch of ISA fragmentation,
> > but at least we'll be able to keep the software stack together.
> >
> > Can you tell us when these will be in the hands of users?  That's
> > pretty important here, as I don't want to be blocking real users from
> > having their hardware work.  IIRC there were some plans to distribute
> > early boards, but it looks like the foundation got involved and I
> > guess I lost the thread at that point.
> >
> > Sorry this is all such a headache, but hopefully we can get things
> > sorted out.
> 
> I talked with some of the RISC-V foundation folks, we're not going to have an
> ISA specification for the non-coherent stuff any time soon.  I took a look at
> this code and I definately don't want to take it as is, but I'm not opposed to
> taking something that makes the hardware work as long as it's a lot cleaner.
> We've already got two of these non-coherent chips, I'm sure more will come,
> and I'd rather have the extra headaches than make everyone fork the software
> stack.

Thanks for confirming. The CMO extension is still in early stages so it will
certainly take more time for them. After CMO extension is finalized, it will
take some more time to have actual RISC-V platforms with CMO implemented.

> 
> After talking to Atish it looks like there's likely to be an SBI extension to
> handle the CMOs, which should let us avoid the bulk of the vendor-specific
> behavior in the kernel.  I know some people are worried about adding to the
> SBI surface.  I'm worried about that too, but that's way better than sticking a
> bunch of vendor-specific instructions into the kernel.  The SBI extension
> should make for a straight-forward cache flush implementation in Linux, so
> let's just plan on that getting through quickly (as has been done before).

Yes, I agree. We can have just a single SBI call which is meant for DMA sync
purpose only which means it will flush/invalidate pages from all cache
levels irrespective of the cache hierarchy (i.e. flush/invalidate to RAM). The
CMO extension might more generic cache operations which can target
any cache level.

I am already preparing a write-up for SBI DMA sync call in SBI spec. I will
share it with you separately as well.

> 
> Unfortunately we've yet to come up with a way to handle the non-cacheable
> mappings without introducing a degree of vendor-specific behavior or
> seriously impacting performance (mark them as not valid and deal with them
> in the trap handler).  I'm not really sure it counts as supporting the hardware
> if it's massively slow, so that really leaves us with vendor-specific mappings as
> the only option to make these chips work.

A RISC-V platform can have non-cacheable mappings is following possible
ways:
1) Fixed physical address range as non-cacheable using PMAs
2) Custom page table attributes
3) Svpmbt extension being defined by RVI

Atish and me both think it is possible to have RISC-V specific DMA ops
implementation which can handle all above case. Atish is already working
on DMA ops implementation for RISC-V.

> 
> This implementation, which adds some Kconfig entries that control page table
> bits, definately isn't suitable for upstream.  Allowing users to set arbitrary
> page table bits will eventually conflict with the standard, and is just going to
> be a mess.  It'll also lead to kernels that are only compatible with specific
> designs, which we're trying very hard to avoid.  At a bare minimum we'll need
> some way to detect systems with these page table bits before setting them,
> and some description of what the bits actually do so we can reason about
> them.

Yes, vendor specific Kconfig options are strict NO NO. We can't give-up the
goal of unified kernel image for all platforms.

Regards,
Anup
Palmer Dabbelt June 3, 2021, 3:39 p.m. UTC | #22
On Wed, 02 Jun 2021 23:00:29 PDT (-0700), Anup Patel wrote:
>
>
>> -----Original Message-----
>> From: Palmer Dabbelt <palmer@dabbelt.com>
>> Sent: 03 June 2021 09:43
>> To: guoren@kernel.org
>> Cc: anup@brainfault.org; drew@beagleboard.org; Christoph Hellwig
>> <hch@lst.de>; Anup Patel <Anup.Patel@wdc.com>; wefu@redhat.com;
>> lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
>> kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
>> sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
>> <paul.walmsley@sifive.com>
>> Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
>> 
>> On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
>> > On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org wrote:
>> >> On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org>
>> wrote:
>> >>>
>> >>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini
>> <drew@beagleboard.org> wrote:
>> >>> >
>> >>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig
>> wrote:
>> >>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
>> >>> > > > Since the existing RISC-V ISA cannot solve this problem, it is
>> >>> > > > better to provide some configuration for the SOC vendor to
>> customize.
>> >>> > >
>> >>> > > We've been talking about this problem for close to five years.
>> >>> > > So no, if you don't manage to get the feature into the ISA it
>> >>> > > can't be supported.
>> >>> >
>> >>> > Isn't it a good goal for Linux to support the capabilities present
>> >>> > in the SoC that a currently being fab'd?
>> >>> >
>> >>> > I believe the CMO group only started last year [1] so the RV64GC
>> >>> > SoCs that are going into mass production this year would not have
>> >>> > had the opporuntiy of utilizing any RISC-V ISA extension for
>> >>> > handling cache management.
>> >>>
>> >>> The current Linux RISC-V policy is to only accept patches for frozen
>> >>> or ratified ISA specs.
>> >>> (Refer, Documentation/riscv/patch-acceptance.rst)
>> >>>
>> >>> This means even if emulate CMO instructions in OpenSBI, the Linux
>> >>> patches won't be taken by Palmer because CMO specification is still
>> >>> in draft stage.
>> >> Before CMO specification release, could we use a sbi_ecall to solve
>> >> the current problem? This is not against the specification, when CMO
>> >> is ready we could let users choose to use the new CMO in Linux.
>> >>
>> >> From a tech view, CMO trap emulation is the same as sbi_ecall.
>> >>
>> >>>
>> >>> Also, we all know how much time it takes for RISCV international to
>> >>> freeze some spec. Judging by that we are looking at another
>> >>> 3-4 years at minimum.
>> >
>> > Sorry for being slow here, this thread got buried.
>> >
>> > I've been trying to work with a handful of folks at the RISC-V
>> > foundation to try and get a subset of the various in-development
>> > specifications (some simple CMOs, something about non-caching in the
>> > page tables, and some way to prevent speculative accesse from
>> > generating coherence traffic that will break non-coherent systems).
>> > I'm not sure we can get this together quickly, but I'd prefer to at
>> > least try before we jump to taking vendor-specificed behavior here.
>> > It's obviously an up-hill battle to try and get specifications through
>> > the process and I'm certainly not going to promise it will work, but
>> > I'm hoping that the impending need to avoid forking the ISA will be
>> > sufficient to get people behind producing some specifications in a timely
>> fashion.
>> >
>> > I wasn't aware than this chip had non-coherent devices until I saw
>> > this thread, so we'd been mostly focused on the Beagle V chip.  That
>> > was in a sense an easier problem because the SiFive IP in it was never
>> > designed to have non-coherent devices so we'd have to make anything
>> > work via a series of slow workarounds, which would make emulating the
>> > eventually standardized behavior reasonable in terms of performance
>> > (ie, everything would be super slow so who really cares).
>> >
>> > I don't think relying on some sort of SBI call for the CMOs whould be
>> > such a performance hit that it would prevent these systems from being
>> > viable, but assuming you have reasonable performance on your
>> > non-cached accesses then that's probably not going to be viable to
>> > trap and emulate.  At that point it really just becomes silly to
>> > pretend that we're still making things work by emulating the
>> > eventually ratified behavior, as anyone who actually tries to use this
>> > thing to do IO would need out of tree patches.  I'm not sure exactly
>> > what the plan is for the page table bits in the specification right
>> > now, but if you can give me a pointer to some documentation then I'm
>> > happy to try and push for something compatible.
>> >
>> > If we can't make the process work at the foundation then I'd be
>> > strongly in favor of just biting the bullet and starting to take
>> > vendor-specific code that's been implemented in hardware and is
>> > necessarry to make things work acceptably.  That's obviously a
>> > sub-optimal solution as it'll lead to a bunch of ISA fragmentation,
>> > but at least we'll be able to keep the software stack together.
>> >
>> > Can you tell us when these will be in the hands of users?  That's
>> > pretty important here, as I don't want to be blocking real users from
>> > having their hardware work.  IIRC there were some plans to distribute
>> > early boards, but it looks like the foundation got involved and I
>> > guess I lost the thread at that point.
>> >
>> > Sorry this is all such a headache, but hopefully we can get things
>> > sorted out.
>> 
>> I talked with some of the RISC-V foundation folks, we're not going to have an
>> ISA specification for the non-coherent stuff any time soon.  I took a look at
>> this code and I definately don't want to take it as is, but I'm not opposed to
>> taking something that makes the hardware work as long as it's a lot cleaner.
>> We've already got two of these non-coherent chips, I'm sure more will come,
>> and I'd rather have the extra headaches than make everyone fork the software
>> stack.
>
> Thanks for confirming. The CMO extension is still in early stages so it will
> certainly take more time for them. After CMO extension is finalized, it will
> take some more time to have actual RISC-V platforms with CMO implemented.

Agreed.  It's going to take two or three years from the standard to get 
hardware to supporting it, so that means we're three or four years away 
(at least, there's not even any solid timeline for a spec a year from 
now) from having hardware.  There's just going to be too much 
non-standard hardware here to try to ignore it all.

>> After talking to Atish it looks like there's likely to be an SBI extension to
>> handle the CMOs, which should let us avoid the bulk of the vendor-specific
>> behavior in the kernel.  I know some people are worried about adding to the
>> SBI surface.  I'm worried about that too, but that's way better than sticking a
>> bunch of vendor-specific instructions into the kernel.  The SBI extension
>> should make for a straight-forward cache flush implementation in Linux, so
>> let's just plan on that getting through quickly (as has been done before).
>
> Yes, I agree. We can have just a single SBI call which is meant for DMA sync
> purpose only which means it will flush/invalidate pages from all cache
> levels irrespective of the cache hierarchy (i.e. flush/invalidate to RAM). The
> CMO extension might more generic cache operations which can target
> any cache level.
>
> I am already preparing a write-up for SBI DMA sync call in SBI spec. I will
> share it with you separately as well.

Great, thanks.  Atish sort of mentioned that, but I didn't want to put 
words in your mouth (and I assume you were aleep or something, due to 
time zones).

>> Unfortunately we've yet to come up with a way to handle the non-cacheable
>> mappings without introducing a degree of vendor-specific behavior or
>> seriously impacting performance (mark them as not valid and deal with them
>> in the trap handler).  I'm not really sure it counts as supporting the hardware
>> if it's massively slow, so that really leaves us with vendor-specific mappings as
>> the only option to make these chips work.
>
> A RISC-V platform can have non-cacheable mappings is following possible
> ways:
> 1) Fixed physical address range as non-cacheable using PMAs
> 2) Custom page table attributes
> 3) Svpmbt extension being defined by RVI
>
> Atish and me both think it is possible to have RISC-V specific DMA ops
> implementation which can handle all above case. Atish is already working
> on DMA ops implementation for RISC-V.

Great, thanks.  I haven't started writing any code, but I think we're 
going to be able to get a big chunk of #1 from the "dma-ranges" device 
tree stuff.  I think we still need some arch-specific allocation work to 
make sure we don't alias, though.

The page table attributes are definately going to need dma ops.  I'd 
been assuming we'd have multiple DMA op tables for the multiple flavors, 
but if they fit into a single op table cleanly that's fine -- that sort 
of stuff really needs the code here.

Since you guys have already started I'll just wait for patches.

Thanks!

>> This implementation, which adds some Kconfig entries that control page table
>> bits, definately isn't suitable for upstream.  Allowing users to set arbitrary
>> page table bits will eventually conflict with the standard, and is just going to
>> be a mess.  It'll also lead to kernels that are only compatible with specific
>> designs, which we're trying very hard to avoid.  At a bare minimum we'll need
>> some way to detect systems with these page table bits before setting them,
>> and some description of what the bits actually do so we can reason about
>> them.
>
> Yes, vendor specific Kconfig options are strict NO NO. We can't give-up the
> goal of unified kernel image for all platforms.

I think this is just a phrasing issue, but just to be sure:

IMO it's not that they're vendor-specific Kconfig options, it's that 
turning them on will conflict with standard systems (and other vendors).  
We've already got the ability to select sets of Kconfig settings that 
will only boot on one vendor's system, which is fine, as long as there 
remains a set of Kconfig settings that will boot on all systems.

An example here would be the errata: every system has errata of some 
sort, so if we start flipping off various vendor's errata Kconfigs 
you'll end up with kernels that only function properly on some systems.  
That's fine with me, as long as it's possible to turn on all vendor's 
errata Kconfigs at the same time and the resulting kernel functions 
correctly on all systems.

> Regards,
> Anup
David Laight June 4, 2021, 9:02 a.m. UTC | #23
From: Palmer Dabbelt
> Sent: 03 June 2021 16:39
...
> An example here would be the errata: every system has errata of some
> sort, so if we start flipping off various vendor's errata Kconfigs
> you'll end up with kernels that only function properly on some systems.
> That's fine with me, as long as it's possible to turn on all vendor's
> errata Kconfigs at the same time and the resulting kernel functions
> correctly on all systems.

ISTM that if you can (easily) detect the errata then the detection
should be left it - but the kernel fail to boot if the system
needs the errata fixed.

The same would be needed for DMA in systems with non-coherent memory.

Only a hardware engineer would build a system with non-coherent memory
and without the ability to do uncached accesses and flush/invalidate
small sections of cache.

Mind you we did get a dual-cpu system that didn't have cache-coherency
between the cpus! That was singularly useless.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Arnd Bergmann June 4, 2021, 9:53 a.m. UTC | #24
On Thu, Jun 3, 2021 at 5:39 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> On Wed, 02 Jun 2021 23:00:29 PDT (-0700), Anup Patel wrote:
> >> This implementation, which adds some Kconfig entries that control page table
> >> bits, definately isn't suitable for upstream.  Allowing users to set arbitrary
> >> page table bits will eventually conflict with the standard, and is just going to
> >> be a mess.  It'll also lead to kernels that are only compatible with specific
> >> designs, which we're trying very hard to avoid.  At a bare minimum we'll need
> >> some way to detect systems with these page table bits before setting them,
> >> and some description of what the bits actually do so we can reason about
> >> them.
> >
> > Yes, vendor specific Kconfig options are strict NO NO. We can't give-up the
> > goal of unified kernel image for all platforms.
>
> I think this is just a phrasing issue, but just to be sure:
>
> IMO it's not that they're vendor-specific Kconfig options, it's that
> turning them on will conflict with standard systems (and other vendors).
> We've already got the ability to select sets of Kconfig settings that
> will only boot on one vendor's system, which is fine, as long as there
> remains a set of Kconfig settings that will boot on all systems.
>
> An example here would be the errata: every system has errata of some
> sort, so if we start flipping off various vendor's errata Kconfigs
> you'll end up with kernels that only function properly on some systems.
> That's fine with me, as long as it's possible to turn on all vendor's
> errata Kconfigs at the same time and the resulting kernel functions
> correctly on all systems.

Yes, this is generally the goal, and it would be great to have that
working in a way where a 'defconfig' build just turns on all the options
that are needed to use any SoC specific features and drivers while
still working on all hardware. There are however limits you may run
into at some point, and other architectures usually only manage to span
some 10 to 15 years of hardware implementations with a single
kernel before it get really hard.

To give some common examples that make it break down:

- 32-bit vs 64-bit already violates that rule on risc-v (as it does on
  most other architectures)

- architectures that support both big-endian and little-endian kernels
  tend to have platforms that require one or the other (e.g. mips,
  though not arm). Not an issue for you.

- page table formats are the main cause of incompatibility: arm32
  and x86-32 require three-level tables for certain features, but those
  are incompatible with older cores, arm64 supports three different
  page sizes, but none of them works on all cores (4KB almost works
  everywhere).

- SMP-enabled ARMv7 kernels can be configured to run on either
  ARMv6 or ARMv8, but not both, in this case because of incompatible
  barrier instructions.

- 32-bit Arm has a couple more remaining features that require building
  a machine specific kernel if enabled because they hardcode physical
  addresses: early printk (debug_ll, not the normal earlycon), NOMMU,
  and XIP.

       Arnd
Guo Ren June 4, 2021, 2:47 p.m. UTC | #25
Hi Arnd & Palmer,

Sorry for the delayed reply, I'm working on the next version of the patch.

On Fri, Jun 4, 2021 at 5:56 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Jun 3, 2021 at 5:39 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> > On Wed, 02 Jun 2021 23:00:29 PDT (-0700), Anup Patel wrote:
> > >> This implementation, which adds some Kconfig entries that control page table
> > >> bits, definately isn't suitable for upstream.  Allowing users to set arbitrary
> > >> page table bits will eventually conflict with the standard, and is just going to
> > >> be a mess.  It'll also lead to kernels that are only compatible with specific
> > >> designs, which we're trying very hard to avoid.  At a bare minimum we'll need
> > >> some way to detect systems with these page table bits before setting them,
> > >> and some description of what the bits actually do so we can reason about
> > >> them.
> > >
> > > Yes, vendor specific Kconfig options are strict NO NO. We can't give-up the
> > > goal of unified kernel image for all platforms.
Okay,  Agree. Please help review the next version of the patch.

> >
> > I think this is just a phrasing issue, but just to be sure:
> >
> > IMO it's not that they're vendor-specific Kconfig options, it's that
> > turning them on will conflict with standard systems (and other vendors).
> > We've already got the ability to select sets of Kconfig settings that
> > will only boot on one vendor's system, which is fine, as long as there
> > remains a set of Kconfig settings that will boot on all systems.
> >
> > An example here would be the errata: every system has errata of some
> > sort, so if we start flipping off various vendor's errata Kconfigs
> > you'll end up with kernels that only function properly on some systems.
> > That's fine with me, as long as it's possible to turn on all vendor's
> > errata Kconfigs at the same time and the resulting kernel functions
> > correctly on all systems.
>
> Yes, this is generally the goal, and it would be great to have that
> working in a way where a 'defconfig' build just turns on all the options
> that are needed to use any SoC specific features and drivers while
> still working on all hardware. There are however limits you may run
> into at some point, and other architectures usually only manage to span
> some 10 to 15 years of hardware implementations with a single
> kernel before it get really hard.
I could follow the goal in the next version of the patchset. Please
help review, thx.

>
> To give some common examples that make it break down:
>
> - 32-bit vs 64-bit already violates that rule on risc-v (as it does on
>   most other architectures)
>
> - architectures that support both big-endian and little-endian kernels
>   tend to have platforms that require one or the other (e.g. mips,
>   though not arm). Not an issue for you.
>
> - page table formats are the main cause of incompatibility: arm32
>   and x86-32 require three-level tables for certain features, but those
>   are incompatible with older cores, arm64 supports three different
>   page sizes, but none of them works on all cores (4KB almost works
>   everywhere).
>
> - SMP-enabled ARMv7 kernels can be configured to run on either
>   ARMv6 or ARMv8, but not both, in this case because of incompatible
>   barrier instructions.
>
> - 32-bit Arm has a couple more remaining features that require building
>   a machine specific kernel if enabled because they hardcode physical
>   addresses: early printk (debug_ll, not the normal earlycon), NOMMU,
>   and XIP.
>
>        Arnd
Palmer Dabbelt June 4, 2021, 4:12 p.m. UTC | #26
On Fri, 04 Jun 2021 07:47:22 PDT (-0700), guoren@kernel.org wrote:
> Hi Arnd & Palmer,
>
> Sorry for the delayed reply, I'm working on the next version of the patch.
>
> On Fri, Jun 4, 2021 at 5:56 PM Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> On Thu, Jun 3, 2021 at 5:39 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> > On Wed, 02 Jun 2021 23:00:29 PDT (-0700), Anup Patel wrote:
>> > >> This implementation, which adds some Kconfig entries that control page table
>> > >> bits, definately isn't suitable for upstream.  Allowing users to set arbitrary
>> > >> page table bits will eventually conflict with the standard, and is just going to
>> > >> be a mess.  It'll also lead to kernels that are only compatible with specific
>> > >> designs, which we're trying very hard to avoid.  At a bare minimum we'll need
>> > >> some way to detect systems with these page table bits before setting them,
>> > >> and some description of what the bits actually do so we can reason about
>> > >> them.
>> > >
>> > > Yes, vendor specific Kconfig options are strict NO NO. We can't give-up the
>> > > goal of unified kernel image for all platforms.
> Okay,  Agree. Please help review the next version of the patch.
>
>> >
>> > I think this is just a phrasing issue, but just to be sure:
>> >
>> > IMO it's not that they're vendor-specific Kconfig options, it's that
>> > turning them on will conflict with standard systems (and other vendors).
>> > We've already got the ability to select sets of Kconfig settings that
>> > will only boot on one vendor's system, which is fine, as long as there
>> > remains a set of Kconfig settings that will boot on all systems.
>> >
>> > An example here would be the errata: every system has errata of some
>> > sort, so if we start flipping off various vendor's errata Kconfigs
>> > you'll end up with kernels that only function properly on some systems.
>> > That's fine with me, as long as it's possible to turn on all vendor's
>> > errata Kconfigs at the same time and the resulting kernel functions
>> > correctly on all systems.
>>
>> Yes, this is generally the goal, and it would be great to have that
>> working in a way where a 'defconfig' build just turns on all the options
>> that are needed to use any SoC specific features and drivers while
>> still working on all hardware. There are however limits you may run
>> into at some point, and other architectures usually only manage to span
>> some 10 to 15 years of hardware implementations with a single
>> kernel before it get really hard.
> I could follow the goal in the next version of the patchset. Please
> help review, thx.

IMO we're essentially here now with the RISC-V stuff: defconfig flips on 
everything necesasry to boot normal-smelling SOCs, with everything being 
detected as the system boots.  We have some wacky configurations like 
!MMU and XIP that are coupled to the hardware, but (and sorry for 
crossing the other threads, I missed your pointer as it's early here) as 
I said in the other thread it might be time to make it explicit that 
those things are non-portable.

The hope here has always been that we'd have enough in the standards 
that we could avoid a proliferation of vendor-specific code.  We've 
always put a strong "things keep working forever" stake in the ground in 
RISC-V land, but that's largely been because we were counting on the 
standards existing that make support easy.  In practice we don't have 
those standards so we're ending up with a fairly large software base 
that is required to support everything.  We don't have all that much 
hardware right now so we'll have to see how it goes, but for now I'm in 
favor of keeping defconfig as a "boots on everything" sort of setup -- 
both because it makes life easier for users, and because it makes issues 
like the non-portable Kconfigs that showed up here quite explicit.

If we get to 10/15 years of hardware then I'm sure we'll be removing old 
systems from defconfig (or maybe even the kernel entirely, a lot of this 
stuff isn't in production).  I'm just hoping we make it that far ;)

>> To give some common examples that make it break down:
>>
>> - 32-bit vs 64-bit already violates that rule on risc-v (as it does on
>>   most other architectures)

Yes, and there's no way around that on RISC-V.  They're different base 
ISAs therefor re-define the same instructions, so we're essentially at 
two kernel binaries by that point.  The platform spec says rv64gc, so we 
can kind of punt on this one for now.  If rv32 hardware shows up 
we'll probably want a standard system there too, which is why we've 
avoided coupling kernel portability to XLEN.

>> - architectures that support both big-endian and little-endian kernels
>>   tend to have platforms that require one or the other (e.g. mips,
>>   though not arm). Not an issue for you.

It is now!  We've added big-endian to RISC-V.  There's no hardware yet 
and very little software support.  IMO the right answer is to ban that 
from the platform spec, but again it'll depnd on what vendors want to 
build (though anyone is listening, please don't make my life miserable 
;)).

>> - page table formats are the main cause of incompatibility: arm32
>>   and x86-32 require three-level tables for certain features, but those
>>   are incompatible with older cores, arm64 supports three different
>>   page sizes, but none of them works on all cores (4KB almost works
>>   everywhere).

We actually have some support on the works for multiple page table 
levels in a single binary, which should help with a lot of that 
incompatibility.  I don't know of any plans to couple other page table 
features to the number of levels, though.

>> - SMP-enabled ARMv7 kernels can be configured to run on either
>>   ARMv6 or ARMv8, but not both, in this case because of incompatible
>>   barrier instructions.

Our barriers aren't quite split the same way, but we do have two memory 
models (RVWMO and TSO).  IIUC we should be able to support both in the 
same kernels with some patching, but the resulting kernels would be 
biased towards one memory models over the other WRT performance.  Again, 
we'll have to see what the vendors do and I'm hoping we don't end up 
with too many headaches.

>> - 32-bit Arm has a couple more remaining features that require building
>>   a machine specific kernel if enabled because they hardcode physical
>>   addresses: early printk (debug_ll, not the normal earlycon), NOMMU,
>>   and XIP.

We've got NOMMU and XIP as well, but we have some SBI support for early 
printk.  IMO we're not really sure if we've decoupled all the PA layout 
dependencies yet from Linux, as we really only support one vendor's 
systems, but we've had a lot of work lately on beefing up our memory 
layout so with any luck we'll be able to quickly sort out anything that 
comes up.
Arnd Bergmann June 4, 2021, 9:26 p.m. UTC | #27
On Fri, Jun 4, 2021 at 6:14 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:

> The hope here has always been that we'd have enough in the standards
> that we could avoid a proliferation of vendor-specific code.  We've
> always put a strong "things keep working forever" stake in the ground in
> RISC-V land, but that's largely been because we were counting on the
> standards existing that make support easy.  In practice we don't have
> those standards so we're ending up with a fairly large software base
> that is required to support everything.  We don't have all that much
> hardware right now so we'll have to see how it goes, but for now I'm in
> favor of keeping defconfig as a "boots on everything" sort of setup --
> both because it makes life easier for users, and because it makes issues
> like the non-portable Kconfigs that showed up here quite explicit.

It's obviously easy to take the hard line approach as long as there is
so little hardware available. I expect this to be a constant struggle,
but it's definitely worth trying as long as you can.

> >> To give some common examples that make it break down:
> >>
> >> - 32-bit vs 64-bit already violates that rule on risc-v (as it does on
> >>   most other architectures)
>
> Yes, and there's no way around that on RISC-V.  They're different base
> ISAs therefor re-define the same instructions, so we're essentially at
> two kernel binaries by that point.  The platform spec says rv64gc, so we
> can kind of punt on this one for now.  If rv32 hardware shows up
> we'll probably want a standard system there too, which is why we've
> avoided coupling kernel portability to XLEN.

I would actually put 32-bit into the same category as NOMMU, XIP
and the built-in DTB:
Since it seems unrealistic to expect an rv32 Debian or Fedora build,
there is very little to gain by enforcing compatibility between machines.
This is different from 32-bit Arm, which is widely used across multiple
distros and many SoCs.

> >> - architectures that support both big-endian and little-endian kernels
> >>   tend to have platforms that require one or the other (e.g. mips,
> >>   though not arm). Not an issue for you.
>
> It is now!  We've added big-endian to RISC-V.  There's no hardware yet
> and very little software support.  IMO the right answer is to ban that
> from the platform spec, but again it'll depnd on what vendors want to
> build (though anyone is listening, please don't make my life miserable
> ;)).

I don't see any big-endian support in linux-next. This one does seem
worth enforcing to be kept out, as it would double the number of user
space ABIs, not just kernel configurations. On arm64, I think the general
feeling is now that we would have been better off not merging big-endian
support into the kernel as an option, but it still seemed important at the
time. Not that there is anything really wrong with big-endian by itself,
just that there is no use case that is worth the added complexity of
supporting both.

Let me know if there are patches you want me to Nak in the future ;-)

> >> - SMP-enabled ARMv7 kernels can be configured to run on either
> >>   ARMv6 or ARMv8, but not both, in this case because of incompatible
> >>   barrier instructions.
>
> Our barriers aren't quite split the same way, but we do have two memory
> models (RVWMO and TSO).  IIUC we should be able to support both in the
> same kernels with some patching, but the resulting kernels would be
> biased towards one memory models over the other WRT performance.  Again,
> we'll have to see what the vendors do and I'm hoping we don't end up
> with too many headaches.

I wouldn't specifically expect the problem to be barriers in the rv64 case,
this was just an example of instruction sets slowly changing in incompatible
ways over a long time. There might be an important reason for version 3.0
of one of the specifications to drop compatibility with version 1.x.

          Arnd
Palmer Dabbelt June 4, 2021, 10:10 p.m. UTC | #28
On Fri, 04 Jun 2021 14:26:11 PDT (-0700), Arnd Bergmann wrote:
> On Fri, Jun 4, 2021 at 6:14 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> >> To give some common examples that make it break down:
>> >>
>> >> - 32-bit vs 64-bit already violates that rule on risc-v (as it does on
>> >>   most other architectures)
>>
>> Yes, and there's no way around that on RISC-V.  They're different base
>> ISAs therefor re-define the same instructions, so we're essentially at
>> two kernel binaries by that point.  The platform spec says rv64gc, so we
>> can kind of punt on this one for now.  If rv32 hardware shows up
>> we'll probably want a standard system there too, which is why we've
>> avoided coupling kernel portability to XLEN.
>
> I would actually put 32-bit into the same category as NOMMU, XIP
> and the built-in DTB:
> Since it seems unrealistic to expect an rv32 Debian or Fedora build,
> there is very little to gain by enforcing compatibility between machines.
> This is different from 32-bit Arm, which is widely used across multiple
> distros and many SoCs.

OK, well, that's what the spec says already.  Maybe the right answer is 
to just add that "be compatible with the platform spec" Kconfig and have 
it also enforce rv64gc like the spec says.

>
>> >> - architectures that support both big-endian and little-endian kernels
>> >>   tend to have platforms that require one or the other (e.g. mips,
>> >>   though not arm). Not an issue for you.
>>
>> It is now!  We've added big-endian to RISC-V.  There's no hardware yet
>> and very little software support.  IMO the right answer is to ban that
>> from the platform spec, but again it'll depnd on what vendors want to
>> build (though anyone is listening, please don't make my life miserable
>> ;)).
>
> I don't see any big-endian support in linux-next. This one does seem
> worth enforcing to be kept out, as it would double the number of user
> space ABIs, not just kernel configurations. On arm64, I think the general
> feeling is now that we would have been better off not merging big-endian
> support into the kernel as an option, but it still seemed important at the
> time. Not that there is anything really wrong with big-endian by itself,
> just that there is no use case that is worth the added complexity of
> supporting both.
>
> Let me know if there are patches you want me to Nak in the future ;-)

Sorry, by "added big-endian to RISC-V" I meant to the ISA, not to Linux.  
We haven't had any interesting in adding it to Linux.  The interest has 
all been in the embedded space.
Guo Ren June 6, 2021, 5:11 p.m. UTC | #29
Hi Anup and Atish,

On Thu, Jun 3, 2021 at 2:00 PM Anup Patel <Anup.Patel@wdc.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Palmer Dabbelt <palmer@dabbelt.com>
> > Sent: 03 June 2021 09:43
> > To: guoren@kernel.org
> > Cc: anup@brainfault.org; drew@beagleboard.org; Christoph Hellwig
> > <hch@lst.de>; Anup Patel <Anup.Patel@wdc.com>; wefu@redhat.com;
> > lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> > sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> > <paul.walmsley@sifive.com>
> > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> >
> > On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
> > > On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org wrote:
> > >> On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org>
> > wrote:
> > >>>
> > >>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini
> > <drew@beagleboard.org> wrote:
> > >>> >
> > >>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig
> > wrote:
> > >>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > >>> > > > Since the existing RISC-V ISA cannot solve this problem, it is
> > >>> > > > better to provide some configuration for the SOC vendor to
> > customize.
> > >>> > >
> > >>> > > We've been talking about this problem for close to five years.
> > >>> > > So no, if you don't manage to get the feature into the ISA it
> > >>> > > can't be supported.
> > >>> >
> > >>> > Isn't it a good goal for Linux to support the capabilities present
> > >>> > in the SoC that a currently being fab'd?
> > >>> >
> > >>> > I believe the CMO group only started last year [1] so the RV64GC
> > >>> > SoCs that are going into mass production this year would not have
> > >>> > had the opporuntiy of utilizing any RISC-V ISA extension for
> > >>> > handling cache management.
> > >>>
> > >>> The current Linux RISC-V policy is to only accept patches for frozen
> > >>> or ratified ISA specs.
> > >>> (Refer, Documentation/riscv/patch-acceptance.rst)
> > >>>
> > >>> This means even if emulate CMO instructions in OpenSBI, the Linux
> > >>> patches won't be taken by Palmer because CMO specification is still
> > >>> in draft stage.
> > >> Before CMO specification release, could we use a sbi_ecall to solve
> > >> the current problem? This is not against the specification, when CMO
> > >> is ready we could let users choose to use the new CMO in Linux.
> > >>
> > >> From a tech view, CMO trap emulation is the same as sbi_ecall.
> > >>
> > >>>
> > >>> Also, we all know how much time it takes for RISCV international to
> > >>> freeze some spec. Judging by that we are looking at another
> > >>> 3-4 years at minimum.
> > >
> > > Sorry for being slow here, this thread got buried.
> > >
> > > I've been trying to work with a handful of folks at the RISC-V
> > > foundation to try and get a subset of the various in-development
> > > specifications (some simple CMOs, something about non-caching in the
> > > page tables, and some way to prevent speculative accesse from
> > > generating coherence traffic that will break non-coherent systems).
> > > I'm not sure we can get this together quickly, but I'd prefer to at
> > > least try before we jump to taking vendor-specificed behavior here.
> > > It's obviously an up-hill battle to try and get specifications through
> > > the process and I'm certainly not going to promise it will work, but
> > > I'm hoping that the impending need to avoid forking the ISA will be
> > > sufficient to get people behind producing some specifications in a timely
> > fashion.
> > >
> > > I wasn't aware than this chip had non-coherent devices until I saw
> > > this thread, so we'd been mostly focused on the Beagle V chip.  That
> > > was in a sense an easier problem because the SiFive IP in it was never
> > > designed to have non-coherent devices so we'd have to make anything
> > > work via a series of slow workarounds, which would make emulating the
> > > eventually standardized behavior reasonable in terms of performance
> > > (ie, everything would be super slow so who really cares).
> > >
> > > I don't think relying on some sort of SBI call for the CMOs whould be
> > > such a performance hit that it would prevent these systems from being
> > > viable, but assuming you have reasonable performance on your
> > > non-cached accesses then that's probably not going to be viable to
> > > trap and emulate.  At that point it really just becomes silly to
> > > pretend that we're still making things work by emulating the
> > > eventually ratified behavior, as anyone who actually tries to use this
> > > thing to do IO would need out of tree patches.  I'm not sure exactly
> > > what the plan is for the page table bits in the specification right
> > > now, but if you can give me a pointer to some documentation then I'm
> > > happy to try and push for something compatible.
> > >
> > > If we can't make the process work at the foundation then I'd be
> > > strongly in favor of just biting the bullet and starting to take
> > > vendor-specific code that's been implemented in hardware and is
> > > necessarry to make things work acceptably.  That's obviously a
> > > sub-optimal solution as it'll lead to a bunch of ISA fragmentation,
> > > but at least we'll be able to keep the software stack together.
> > >
> > > Can you tell us when these will be in the hands of users?  That's
> > > pretty important here, as I don't want to be blocking real users from
> > > having their hardware work.  IIRC there were some plans to distribute
> > > early boards, but it looks like the foundation got involved and I
> > > guess I lost the thread at that point.
> > >
> > > Sorry this is all such a headache, but hopefully we can get things
> > > sorted out.
> >
> > I talked with some of the RISC-V foundation folks, we're not going to have an
> > ISA specification for the non-coherent stuff any time soon.  I took a look at
> > this code and I definately don't want to take it as is, but I'm not opposed to
> > taking something that makes the hardware work as long as it's a lot cleaner.
> > We've already got two of these non-coherent chips, I'm sure more will come,
> > and I'd rather have the extra headaches than make everyone fork the software
> > stack.
>
> Thanks for confirming. The CMO extension is still in early stages so it will
> certainly take more time for them. After CMO extension is finalized, it will
> take some more time to have actual RISC-V platforms with CMO implemented.
>
> >
> > After talking to Atish it looks like there's likely to be an SBI extension to
> > handle the CMOs, which should let us avoid the bulk of the vendor-specific
> > behavior in the kernel.  I know some people are worried about adding to the
> > SBI surface.  I'm worried about that too, but that's way better than sticking a
> > bunch of vendor-specific instructions into the kernel.  The SBI extension
> > should make for a straight-forward cache flush implementation in Linux, so
> > let's just plan on that getting through quickly (as has been done before).
>
> Yes, I agree. We can have just a single SBI call which is meant for DMA sync
> purpose only which means it will flush/invalidate pages from all cache
> levels irrespective of the cache hierarchy (i.e. flush/invalidate to RAM). The
> CMO extension might more generic cache operations which can target
> any cache level.
>
> I am already preparing a write-up for SBI DMA sync call in SBI spec. I will
> share it with you separately as well.
>
> >
> > Unfortunately we've yet to come up with a way to handle the non-cacheable
> > mappings without introducing a degree of vendor-specific behavior or
> > seriously impacting performance (mark them as not valid and deal with them
> > in the trap handler).  I'm not really sure it counts as supporting the hardware
> > if it's massively slow, so that really leaves us with vendor-specific mappings as
> > the only option to make these chips work.
>
> A RISC-V platform can have non-cacheable mappings is following possible
> ways:
> 1) Fixed physical address range as non-cacheable using PMAs
> 2) Custom page table attributes
> 3) Svpmbt extension being defined by RVI
>
> Atish and me both think it is possible to have RISC-V specific DMA ops
> implementation which can handle all above case. Atish is already working
> on DMA ops implementation for RISC-V.
Not only DMA ops, but also icache_sync & __vdso_icache_sync. Please
have a look at:
https://lore.kernel.org/linux-riscv/1622970249-50770-12-git-send-email-guoren@kernel.org/T/#u


>
> >
> > This implementation, which adds some Kconfig entries that control page table
> > bits, definately isn't suitable for upstream.  Allowing users to set arbitrary
> > page table bits will eventually conflict with the standard, and is just going to
> > be a mess.  It'll also lead to kernels that are only compatible with specific
> > designs, which we're trying very hard to avoid.  At a bare minimum we'll need
> > some way to detect systems with these page table bits before setting them,
> > and some description of what the bits actually do so we can reason about
> > them.
>
> Yes, vendor specific Kconfig options are strict NO NO. We can't give-up the
> goal of unified kernel image for all platforms.
>
> Regards,
> Anup
Nick Kossifidis June 6, 2021, 6:14 p.m. UTC | #30
Στις 2021-05-20 04:45, Guo Ren έγραψε:
> On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:
>> 
>> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
>> > This patch series looks like it might be useful for the StarFive JH7100
>> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
>> > USB and SDIO require that the L2 cache must be manually flushed after
>> > DMA operations if the data is intended to be shared with U74 cores [2].
>> 
>> Not too much, given that the SiFive lineage CPUs have an uncached
>> window, that is a totally different way to allocate uncached memory.
> It's a very big MIPS smell. What's the attribute of the uncached
> window? (uncached + strong-order/ uncached + weak, most vendors still
> use AXI interconnect, how to deal with a bufferable attribute?) In
> fact, customers' drivers use different ways to deal with DMA memory in
> non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
> the same way in DMA memory is a smart choice. So using PTE attributes
> is more suitable.
> 
> See:
> https://github.com/riscv/virtual-memory/blob/main/specs/611-virtual-memory-diff.pdf
> 4.4.1
> The draft supports custom attribute bits in PTE.
> 

Not only it doesn't support custom attributes on PTEs:

"Bits63–54 are reserved for future standard use and must be zeroed by 
software for forward compatibility."

It also goes further to say that:

"if any of these bits are set, a page-fault exception is raised"
Guo Ren June 7, 2021, 12:04 a.m. UTC | #31
On Mon, Jun 7, 2021 at 2:14 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Στις 2021-05-20 04:45, Guo Ren έγραψε:
> > On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:
> >>
> >> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
> >> > This patch series looks like it might be useful for the StarFive JH7100
> >> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
> >> > USB and SDIO require that the L2 cache must be manually flushed after
> >> > DMA operations if the data is intended to be shared with U74 cores [2].
> >>
> >> Not too much, given that the SiFive lineage CPUs have an uncached
> >> window, that is a totally different way to allocate uncached memory.
> > It's a very big MIPS smell. What's the attribute of the uncached
> > window? (uncached + strong-order/ uncached + weak, most vendors still
> > use AXI interconnect, how to deal with a bufferable attribute?) In
> > fact, customers' drivers use different ways to deal with DMA memory in
> > non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
> > the same way in DMA memory is a smart choice. So using PTE attributes
> > is more suitable.
> >
> > See:
> > https://github.com/riscv/virtual-memory/blob/main/specs/611-virtual-memory-diff.pdf
> > 4.4.1
> > The draft supports custom attribute bits in PTE.
> >
>
> Not only it doesn't support custom attributes on PTEs:
>
> "Bits63–54 are reserved for future standard use and must be zeroed by
> software for forward compatibility."
>
> It also goes further to say that:
>
> "if any of these bits are set, a page-fault exception is raised"

In RISC-V VM TG, A C-bit discussion is raised. So it's a comm idea to
support it.

Let Linux support custom PTE attributes won't get any side effect in practice.

IMO:
We needn't waste a bit in PTE, but the custom idea in PTE reserved
bits is necessary. Because Allwinner D1 needs custom PTE bits in
reserved bits to work around.
So I recommend just remove the "C" bit in PTE, but allow vendors to
define their own PTE attributes in reserved bits. I've found a way to
compact different PTE attributes of different vendors during the Linux
boot stage. That means we still could use One Image for all vendors in
Linux
Nick Kossifidis June 7, 2021, 2:16 a.m. UTC | #32
Στις 2021-06-07 03:04, Guo Ren έγραψε:
> On Mon, Jun 7, 2021 at 2:14 AM Nick Kossifidis <mick@ics.forth.gr> 
> wrote:
>> 
>> Στις 2021-05-20 04:45, Guo Ren έγραψε:
>> > On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:
>> >>
>> >> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
>> >> > This patch series looks like it might be useful for the StarFive JH7100
>> >> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
>> >> > USB and SDIO require that the L2 cache must be manually flushed after
>> >> > DMA operations if the data is intended to be shared with U74 cores [2].
>> >>
>> >> Not too much, given that the SiFive lineage CPUs have an uncached
>> >> window, that is a totally different way to allocate uncached memory.
>> > It's a very big MIPS smell. What's the attribute of the uncached
>> > window? (uncached + strong-order/ uncached + weak, most vendors still
>> > use AXI interconnect, how to deal with a bufferable attribute?) In
>> > fact, customers' drivers use different ways to deal with DMA memory in
>> > non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
>> > the same way in DMA memory is a smart choice. So using PTE attributes
>> > is more suitable.
>> >
>> > See:
>> > https://github.com/riscv/virtual-memory/blob/main/specs/611-virtual-memory-diff.pdf
>> > 4.4.1
>> > The draft supports custom attribute bits in PTE.
>> >
>> 
>> Not only it doesn't support custom attributes on PTEs:
>> 
>> "Bits63–54 are reserved for future standard use and must be zeroed by
>> software for forward compatibility."
>> 
>> It also goes further to say that:
>> 
>> "if any of these bits are set, a page-fault exception is raised"
> 
> In RISC-V VM TG, A C-bit discussion is raised. So it's a comm idea to
> support it.
> 

The C-bit was recently dropped, there is a new proposal for Page Based 
Memory Attributes (PBMT) that we can work on / push for.

> Let Linux support custom PTE attributes won't get any side effect in 
> practice.
> 
> IMO:
> We needn't waste a bit in PTE, but the custom idea in PTE reserved
> bits is necessary. Because Allwinner D1 needs custom PTE bits in
> reserved bits to work around.
> So I recommend just remove the "C" bit in PTE, but allow vendors to
> define their own PTE attributes in reserved bits. I've found a way to
> compact different PTE attributes of different vendors during the Linux
> boot stage. That means we still could use One Image for all vendors in
> Linux

The spec is clear, those attributes are for standard use only, not for 
custom/platform use. It's one thing to implement custom CMOs where the 
ISA doesn't have anything for it and doesn't prevent you to do so (so 
you are not violating anything, it's just a custom extension), and we 
can hide them behind SBI calls etc, and another to violate the current 
Privilege Spec by using bits on PTEs that are reserved for standard use 
only. The intentions of the VM TG are clear, not only those bits are 
reserved but if software uses them the hw will result a page fault in 
future revisions of the spec. What's the idea here, to support 
non-compliant implementations on mainline ? I'm sure you have a good 
idea on how to make this work, but as long as it violates the spec it 
can't go in IMHO.
Guo Ren June 7, 2021, 3:19 a.m. UTC | #33
On Mon, Jun 7, 2021 at 10:16 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Στις 2021-06-07 03:04, Guo Ren έγραψε:
> > On Mon, Jun 7, 2021 at 2:14 AM Nick Kossifidis <mick@ics.forth.gr>
> > wrote:
> >>
> >> Στις 2021-05-20 04:45, Guo Ren έγραψε:
> >> > On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:
> >> >>
> >> >> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
> >> >> > This patch series looks like it might be useful for the StarFive JH7100
> >> >> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
> >> >> > USB and SDIO require that the L2 cache must be manually flushed after
> >> >> > DMA operations if the data is intended to be shared with U74 cores [2].
> >> >>
> >> >> Not too much, given that the SiFive lineage CPUs have an uncached
> >> >> window, that is a totally different way to allocate uncached memory.
> >> > It's a very big MIPS smell. What's the attribute of the uncached
> >> > window? (uncached + strong-order/ uncached + weak, most vendors still
> >> > use AXI interconnect, how to deal with a bufferable attribute?) In
> >> > fact, customers' drivers use different ways to deal with DMA memory in
> >> > non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
> >> > the same way in DMA memory is a smart choice. So using PTE attributes
> >> > is more suitable.
> >> >
> >> > See:
> >> > https://github.com/riscv/virtual-memory/blob/main/specs/611-virtual-memory-diff.pdf
> >> > 4.4.1
> >> > The draft supports custom attribute bits in PTE.
> >> >
> >>
> >> Not only it doesn't support custom attributes on PTEs:
> >>
> >> "Bits63–54 are reserved for future standard use and must be zeroed by
> >> software for forward compatibility."
> >>
> >> It also goes further to say that:
> >>
> >> "if any of these bits are set, a page-fault exception is raised"
> >
> > In RISC-V VM TG, A C-bit discussion is raised. So it's a comm idea to
> > support it.
> >
>
> The C-bit was recently dropped, there is a new proposal for Page Based
> Memory Attributes (PBMT) that we can work on / push for.
C-bit still needs discussion, we shouldn't drop it directly.

>
> > Let Linux support custom PTE attributes won't get any side effect in
> > practice.
> >
> > IMO:
> > We needn't waste a bit in PTE, but the custom idea in PTE reserved
> > bits is necessary. Because Allwinner D1 needs custom PTE bits in
> > reserved bits to work around.
> > So I recommend just remove the "C" bit in PTE, but allow vendors to
> > define their own PTE attributes in reserved bits. I've found a way to
> > compact different PTE attributes of different vendors during the Linux
> > boot stage. That means we still could use One Image for all vendors in
> > Linux
>
> The spec is clear, those attributes are for standard use only, not for
> custom/platform use. It's one thing to implement custom CMOs where the
> ISA doesn't have anything for it and doesn't prevent you to do so (so
> you are not violating anything, it's just a custom extension), and we
> can hide them behind SBI calls etc, and another to violate the current
> Privilege Spec by using bits on PTEs that are reserved for standard use
> only. The intentions of the VM TG are clear, not only those bits are
> reserved but if software uses them the hw will result a page fault in
> future revisions of the spec. What's the idea here, to support
> non-compliant implementations on mainline ?
Raise a page fault won't solve anything. We still need access to the
page with proper performance.

> I'm sure you have a good
> idea on how to make this work, but as long as it violates the spec it
> can't go in IMHO.

We need PTEs to provide a non-coherency solution, and only CMO
instructions are not enough. We can't modify so many Linux drivers to
fit it.
From Linux non-coherency view, we need:
 - Non-cache + Strong Order PTE attributes to deal with drivers' DMA descriptors
 - Non-cache + weak order to deal with framebuffer drivers
 - CMO dma_sync to sync cache with DMA devices
 - Userspace icache_sync solution, which prevents calls to S-mode with
IPI fence.i. (Necessary to JIT java scenarios.)

All above are not in spec, but the real chips are done.
(Actually, these have been talked about for more than five years, we
still haven't the uniform idea.)

The idea of C-bit is really important for us which prevents our chips
violates the spec.
Anup Patel June 7, 2021, 3:38 a.m. UTC | #34
> -----Original Message-----
> From: Guo Ren <guoren@kernel.org>
> Sent: 06 June 2021 22:42
> To: Anup Patel <Anup.Patel@wdc.com>; Atish Patra <atishp@atishpatra.org>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>; anup@brainfault.org;
> drew@beagleboard.org; Christoph Hellwig <hch@lst.de>; wefu@redhat.com;
> lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> <paul.walmsley@sifive.com>
> Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> 
> Hi Anup and Atish,
> 
> On Thu, Jun 3, 2021 at 2:00 PM Anup Patel <Anup.Patel@wdc.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Palmer Dabbelt <palmer@dabbelt.com>
> > > Sent: 03 June 2021 09:43
> > > To: guoren@kernel.org
> > > Cc: anup@brainfault.org; drew@beagleboard.org; Christoph Hellwig
> > > <hch@lst.de>; Anup Patel <Anup.Patel@wdc.com>; wefu@redhat.com;
> > > lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> > > kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> > > sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> > > <paul.walmsley@sifive.com>
> > > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> > >
> > > On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
> > > > On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org wrote:
> > > >> On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org>
> > > wrote:
> > > >>>
> > > >>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini
> > > <drew@beagleboard.org> wrote:
> > > >>> >
> > > >>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig
> > > wrote:
> > > >>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > > >>> > > > Since the existing RISC-V ISA cannot solve this problem,
> > > >>> > > > it is better to provide some configuration for the SOC
> > > >>> > > > vendor to
> > > customize.
> > > >>> > >
> > > >>> > > We've been talking about this problem for close to five years.
> > > >>> > > So no, if you don't manage to get the feature into the ISA
> > > >>> > > it can't be supported.
> > > >>> >
> > > >>> > Isn't it a good goal for Linux to support the capabilities
> > > >>> > present in the SoC that a currently being fab'd?
> > > >>> >
> > > >>> > I believe the CMO group only started last year [1] so the
> > > >>> > RV64GC SoCs that are going into mass production this year
> > > >>> > would not have had the opporuntiy of utilizing any RISC-V ISA
> > > >>> > extension for handling cache management.
> > > >>>
> > > >>> The current Linux RISC-V policy is to only accept patches for
> > > >>> frozen or ratified ISA specs.
> > > >>> (Refer, Documentation/riscv/patch-acceptance.rst)
> > > >>>
> > > >>> This means even if emulate CMO instructions in OpenSBI, the
> > > >>> Linux patches won't be taken by Palmer because CMO specification
> > > >>> is still in draft stage.
> > > >> Before CMO specification release, could we use a sbi_ecall to
> > > >> solve the current problem? This is not against the specification,
> > > >> when CMO is ready we could let users choose to use the new CMO in
> Linux.
> > > >>
> > > >> From a tech view, CMO trap emulation is the same as sbi_ecall.
> > > >>
> > > >>>
> > > >>> Also, we all know how much time it takes for RISCV international
> > > >>> to freeze some spec. Judging by that we are looking at another
> > > >>> 3-4 years at minimum.
> > > >
> > > > Sorry for being slow here, this thread got buried.
> > > >
> > > > I've been trying to work with a handful of folks at the RISC-V
> > > > foundation to try and get a subset of the various in-development
> > > > specifications (some simple CMOs, something about non-caching in
> > > > the page tables, and some way to prevent speculative accesse from
> > > > generating coherence traffic that will break non-coherent systems).
> > > > I'm not sure we can get this together quickly, but I'd prefer to
> > > > at least try before we jump to taking vendor-specificed behavior here.
> > > > It's obviously an up-hill battle to try and get specifications
> > > > through the process and I'm certainly not going to promise it will
> > > > work, but I'm hoping that the impending need to avoid forking the
> > > > ISA will be sufficient to get people behind producing some
> > > > specifications in a timely
> > > fashion.
> > > >
> > > > I wasn't aware than this chip had non-coherent devices until I saw
> > > > this thread, so we'd been mostly focused on the Beagle V chip.
> > > > That was in a sense an easier problem because the SiFive IP in it
> > > > was never designed to have non-coherent devices so we'd have to
> > > > make anything work via a series of slow workarounds, which would
> > > > make emulating the eventually standardized behavior reasonable in
> > > > terms of performance (ie, everything would be super slow so who really
> cares).
> > > >
> > > > I don't think relying on some sort of SBI call for the CMOs whould
> > > > be such a performance hit that it would prevent these systems from
> > > > being viable, but assuming you have reasonable performance on your
> > > > non-cached accesses then that's probably not going to be viable to
> > > > trap and emulate.  At that point it really just becomes silly to
> > > > pretend that we're still making things work by emulating the
> > > > eventually ratified behavior, as anyone who actually tries to use
> > > > this thing to do IO would need out of tree patches.  I'm not sure
> > > > exactly what the plan is for the page table bits in the
> > > > specification right now, but if you can give me a pointer to some
> > > > documentation then I'm happy to try and push for something
> compatible.
> > > >
> > > > If we can't make the process work at the foundation then I'd be
> > > > strongly in favor of just biting the bullet and starting to take
> > > > vendor-specific code that's been implemented in hardware and is
> > > > necessarry to make things work acceptably.  That's obviously a
> > > > sub-optimal solution as it'll lead to a bunch of ISA
> > > > fragmentation, but at least we'll be able to keep the software stack
> together.
> > > >
> > > > Can you tell us when these will be in the hands of users?  That's
> > > > pretty important here, as I don't want to be blocking real users
> > > > from having their hardware work.  IIRC there were some plans to
> > > > distribute early boards, but it looks like the foundation got
> > > > involved and I guess I lost the thread at that point.
> > > >
> > > > Sorry this is all such a headache, but hopefully we can get things
> > > > sorted out.
> > >
> > > I talked with some of the RISC-V foundation folks, we're not going
> > > to have an ISA specification for the non-coherent stuff any time
> > > soon.  I took a look at this code and I definately don't want to
> > > take it as is, but I'm not opposed to taking something that makes the
> hardware work as long as it's a lot cleaner.
> > > We've already got two of these non-coherent chips, I'm sure more
> > > will come, and I'd rather have the extra headaches than make
> > > everyone fork the software stack.
> >
> > Thanks for confirming. The CMO extension is still in early stages so
> > it will certainly take more time for them. After CMO extension is
> > finalized, it will take some more time to have actual RISC-V platforms with
> CMO implemented.
> >
> > >
> > > After talking to Atish it looks like there's likely to be an SBI
> > > extension to handle the CMOs, which should let us avoid the bulk of
> > > the vendor-specific behavior in the kernel.  I know some people are
> > > worried about adding to the SBI surface.  I'm worried about that
> > > too, but that's way better than sticking a bunch of vendor-specific
> > > instructions into the kernel.  The SBI extension should make for a
> > > straight-forward cache flush implementation in Linux, so let's just plan on
> that getting through quickly (as has been done before).
> >
> > Yes, I agree. We can have just a single SBI call which is meant for
> > DMA sync purpose only which means it will flush/invalidate pages from
> > all cache levels irrespective of the cache hierarchy (i.e.
> > flush/invalidate to RAM). The CMO extension might more generic cache
> > operations which can target any cache level.
> >
> > I am already preparing a write-up for SBI DMA sync call in SBI spec. I
> > will share it with you separately as well.
> >
> > >
> > > Unfortunately we've yet to come up with a way to handle the
> > > non-cacheable mappings without introducing a degree of
> > > vendor-specific behavior or seriously impacting performance (mark
> > > them as not valid and deal with them in the trap handler).  I'm not
> > > really sure it counts as supporting the hardware if it's massively
> > > slow, so that really leaves us with vendor-specific mappings as the only
> option to make these chips work.
> >
> > A RISC-V platform can have non-cacheable mappings is following
> > possible
> > ways:
> > 1) Fixed physical address range as non-cacheable using PMAs
> > 2) Custom page table attributes
> > 3) Svpmbt extension being defined by RVI
> >
> > Atish and me both think it is possible to have RISC-V specific DMA ops
> > implementation which can handle all above case. Atish is already
> > working on DMA ops implementation for RISC-V.
> Not only DMA ops, but also icache_sync & __vdso_icache_sync. Please have a
> look at:
> https://lore.kernel.org/linux-riscv/1622970249-50770-12-git-send-email-
> guoren@kernel.org/T/#u

The icache_sync and __vdso_icache_sync will have to be addressed
differently. The SBI DMA sync extension cannot address this.

It seems Allwinner D1 have more non-standard stuff:
1) Custom PTE bits for IO-coherent access
2) Custom data cache flush/invalidate for DMA sync
3) Custom icache flush/invalidate

Other hand, BeagleV has only two problems:
1) Custom physical address range for IO-coherent access
2) Custom L2 cache flush/invalidate for DMA sync

From above #2, can be solved by SBI DMA sync call and
Linux DMA ops for both BeagleV and Allwinner D1

On BeagleV, issue #1 can be solved using "dma-ranges".

On Allwinner D1, issues #1 and #3 need to be addressed
separately.

I think supporting BeagleV in upstream Linux is relatively
easy compared to Allwinner D1.

@Guo, please check if you can reserve dedicated
physical address range for IO-coherent access (just like
BeagleV). If yes, then we can tackle issue #1 for Allwinner
D1 using "dma-ranges" DT property.

Regards,
Anup

> 
> 
> >
> > >
> > > This implementation, which adds some Kconfig entries that control
> > > page table bits, definately isn't suitable for upstream.  Allowing
> > > users to set arbitrary page table bits will eventually conflict with
> > > the standard, and is just going to be a mess.  It'll also lead to
> > > kernels that are only compatible with specific designs, which we're
> > > trying very hard to avoid.  At a bare minimum we'll need some way to
> > > detect systems with these page table bits before setting them, and
> > > some description of what the bits actually do so we can reason about
> them.
> >
> > Yes, vendor specific Kconfig options are strict NO NO. We can't
> > give-up the goal of unified kernel image for all platforms.
> >
> > Regards,
> > Anup
> 
> 
> 
> --
> Best Regards
>  Guo Ren
> 
> ML: https://lore.kernel.org/linux-csky/
Guo Ren June 7, 2021, 4:22 a.m. UTC | #35
Hi Anup,

On Mon, Jun 7, 2021 at 11:38 AM Anup Patel <Anup.Patel@wdc.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Guo Ren <guoren@kernel.org>
> > Sent: 06 June 2021 22:42
> > To: Anup Patel <Anup.Patel@wdc.com>; Atish Patra <atishp@atishpatra.org>
> > Cc: Palmer Dabbelt <palmer@dabbelt.com>; anup@brainfault.org;
> > drew@beagleboard.org; Christoph Hellwig <hch@lst.de>; wefu@redhat.com;
> > lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> > kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> > sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> > <paul.walmsley@sifive.com>
> > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> >
> > Hi Anup and Atish,
> >
> > On Thu, Jun 3, 2021 at 2:00 PM Anup Patel <Anup.Patel@wdc.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Palmer Dabbelt <palmer@dabbelt.com>
> > > > Sent: 03 June 2021 09:43
> > > > To: guoren@kernel.org
> > > > Cc: anup@brainfault.org; drew@beagleboard.org; Christoph Hellwig
> > > > <hch@lst.de>; Anup Patel <Anup.Patel@wdc.com>; wefu@redhat.com;
> > > > lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> > > > kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> > > > sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> > > > <paul.walmsley@sifive.com>
> > > > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> > > >
> > > > On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
> > > > > On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org wrote:
> > > > >> On Wed, May 19, 2021 at 3:15 PM Anup Patel <anup@brainfault.org>
> > > > wrote:
> > > > >>>
> > > > >>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini
> > > > <drew@beagleboard.org> wrote:
> > > > >>> >
> > > > >>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph Hellwig
> > > > wrote:
> > > > >>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren wrote:
> > > > >>> > > > Since the existing RISC-V ISA cannot solve this problem,
> > > > >>> > > > it is better to provide some configuration for the SOC
> > > > >>> > > > vendor to
> > > > customize.
> > > > >>> > >
> > > > >>> > > We've been talking about this problem for close to five years.
> > > > >>> > > So no, if you don't manage to get the feature into the ISA
> > > > >>> > > it can't be supported.
> > > > >>> >
> > > > >>> > Isn't it a good goal for Linux to support the capabilities
> > > > >>> > present in the SoC that a currently being fab'd?
> > > > >>> >
> > > > >>> > I believe the CMO group only started last year [1] so the
> > > > >>> > RV64GC SoCs that are going into mass production this year
> > > > >>> > would not have had the opporuntiy of utilizing any RISC-V ISA
> > > > >>> > extension for handling cache management.
> > > > >>>
> > > > >>> The current Linux RISC-V policy is to only accept patches for
> > > > >>> frozen or ratified ISA specs.
> > > > >>> (Refer, Documentation/riscv/patch-acceptance.rst)
> > > > >>>
> > > > >>> This means even if emulate CMO instructions in OpenSBI, the
> > > > >>> Linux patches won't be taken by Palmer because CMO specification
> > > > >>> is still in draft stage.
> > > > >> Before CMO specification release, could we use a sbi_ecall to
> > > > >> solve the current problem? This is not against the specification,
> > > > >> when CMO is ready we could let users choose to use the new CMO in
> > Linux.
> > > > >>
> > > > >> From a tech view, CMO trap emulation is the same as sbi_ecall.
> > > > >>
> > > > >>>
> > > > >>> Also, we all know how much time it takes for RISCV international
> > > > >>> to freeze some spec. Judging by that we are looking at another
> > > > >>> 3-4 years at minimum.
> > > > >
> > > > > Sorry for being slow here, this thread got buried.
> > > > >
> > > > > I've been trying to work with a handful of folks at the RISC-V
> > > > > foundation to try and get a subset of the various in-development
> > > > > specifications (some simple CMOs, something about non-caching in
> > > > > the page tables, and some way to prevent speculative accesse from
> > > > > generating coherence traffic that will break non-coherent systems).
> > > > > I'm not sure we can get this together quickly, but I'd prefer to
> > > > > at least try before we jump to taking vendor-specificed behavior here.
> > > > > It's obviously an up-hill battle to try and get specifications
> > > > > through the process and I'm certainly not going to promise it will
> > > > > work, but I'm hoping that the impending need to avoid forking the
> > > > > ISA will be sufficient to get people behind producing some
> > > > > specifications in a timely
> > > > fashion.
> > > > >
> > > > > I wasn't aware than this chip had non-coherent devices until I saw
> > > > > this thread, so we'd been mostly focused on the Beagle V chip.
> > > > > That was in a sense an easier problem because the SiFive IP in it
> > > > > was never designed to have non-coherent devices so we'd have to
> > > > > make anything work via a series of slow workarounds, which would
> > > > > make emulating the eventually standardized behavior reasonable in
> > > > > terms of performance (ie, everything would be super slow so who really
> > cares).
> > > > >
> > > > > I don't think relying on some sort of SBI call for the CMOs whould
> > > > > be such a performance hit that it would prevent these systems from
> > > > > being viable, but assuming you have reasonable performance on your
> > > > > non-cached accesses then that's probably not going to be viable to
> > > > > trap and emulate.  At that point it really just becomes silly to
> > > > > pretend that we're still making things work by emulating the
> > > > > eventually ratified behavior, as anyone who actually tries to use
> > > > > this thing to do IO would need out of tree patches.  I'm not sure
> > > > > exactly what the plan is for the page table bits in the
> > > > > specification right now, but if you can give me a pointer to some
> > > > > documentation then I'm happy to try and push for something
> > compatible.
> > > > >
> > > > > If we can't make the process work at the foundation then I'd be
> > > > > strongly in favor of just biting the bullet and starting to take
> > > > > vendor-specific code that's been implemented in hardware and is
> > > > > necessarry to make things work acceptably.  That's obviously a
> > > > > sub-optimal solution as it'll lead to a bunch of ISA
> > > > > fragmentation, but at least we'll be able to keep the software stack
> > together.
> > > > >
> > > > > Can you tell us when these will be in the hands of users?  That's
> > > > > pretty important here, as I don't want to be blocking real users
> > > > > from having their hardware work.  IIRC there were some plans to
> > > > > distribute early boards, but it looks like the foundation got
> > > > > involved and I guess I lost the thread at that point.
> > > > >
> > > > > Sorry this is all such a headache, but hopefully we can get things
> > > > > sorted out.
> > > >
> > > > I talked with some of the RISC-V foundation folks, we're not going
> > > > to have an ISA specification for the non-coherent stuff any time
> > > > soon.  I took a look at this code and I definately don't want to
> > > > take it as is, but I'm not opposed to taking something that makes the
> > hardware work as long as it's a lot cleaner.
> > > > We've already got two of these non-coherent chips, I'm sure more
> > > > will come, and I'd rather have the extra headaches than make
> > > > everyone fork the software stack.
> > >
> > > Thanks for confirming. The CMO extension is still in early stages so
> > > it will certainly take more time for them. After CMO extension is
> > > finalized, it will take some more time to have actual RISC-V platforms with
> > CMO implemented.
> > >
> > > >
> > > > After talking to Atish it looks like there's likely to be an SBI
> > > > extension to handle the CMOs, which should let us avoid the bulk of
> > > > the vendor-specific behavior in the kernel.  I know some people are
> > > > worried about adding to the SBI surface.  I'm worried about that
> > > > too, but that's way better than sticking a bunch of vendor-specific
> > > > instructions into the kernel.  The SBI extension should make for a
> > > > straight-forward cache flush implementation in Linux, so let's just plan on
> > that getting through quickly (as has been done before).
> > >
> > > Yes, I agree. We can have just a single SBI call which is meant for
> > > DMA sync purpose only which means it will flush/invalidate pages from
> > > all cache levels irrespective of the cache hierarchy (i.e.
> > > flush/invalidate to RAM). The CMO extension might more generic cache
> > > operations which can target any cache level.
> > >
> > > I am already preparing a write-up for SBI DMA sync call in SBI spec. I
> > > will share it with you separately as well.
> > >
> > > >
> > > > Unfortunately we've yet to come up with a way to handle the
> > > > non-cacheable mappings without introducing a degree of
> > > > vendor-specific behavior or seriously impacting performance (mark
> > > > them as not valid and deal with them in the trap handler).  I'm not
> > > > really sure it counts as supporting the hardware if it's massively
> > > > slow, so that really leaves us with vendor-specific mappings as the only
> > option to make these chips work.
> > >
> > > A RISC-V platform can have non-cacheable mappings is following
> > > possible
> > > ways:
> > > 1) Fixed physical address range as non-cacheable using PMAs
> > > 2) Custom page table attributes
> > > 3) Svpmbt extension being defined by RVI
> > >
> > > Atish and me both think it is possible to have RISC-V specific DMA ops
> > > implementation which can handle all above case. Atish is already
> > > working on DMA ops implementation for RISC-V.
> > Not only DMA ops, but also icache_sync & __vdso_icache_sync. Please have a
> > look at:
> > https://lore.kernel.org/linux-riscv/1622970249-50770-12-git-send-email-
> > guoren@kernel.org/T/#u
>
> The icache_sync and __vdso_icache_sync will have to be addressed
> differently. The SBI DMA sync extension cannot address this.
Agree

>
> It seems Allwinner D1 have more non-standard stuff:
> 1) Custom PTE bits for IO-coherent access
> 2) Custom data cache flush/invalidate for DMA sync
> 3) Custom icache flush/invalidate
Yes, but 3) is a performance optimization, not critical for running.

>
> Other hand, BeagleV has only two problems:
> 1) Custom physical address range for IO-coherent access
> 2) Custom L2 cache flush/invalidate for DMA sync
https://github.com/starfive-tech/linux/commit/d4c4044c08134dca8e5eaaeb6d3faf97dc453b6d

Currently, they still use DMA sync with DMA descriptor, are you sure
they have minor memory physical address.

>
> From above #2, can be solved by SBI DMA sync call and
> Linux DMA ops for both BeagleV and Allwinner D1
>
> On BeagleV, issue #1 can be solved using "dma-ranges".
>
> On Allwinner D1, issues #1 and #3 need to be addressed
> separately.
>
> I think supporting BeagleV in upstream Linux is relatively
> easy compared to Allwinner D1.
>
> @Guo, please check if you can reserve dedicated
> physical address range for IO-coherent access (just like
> BeagleV). If yes, then we can tackle issue #1 for Allwinner
> D1 using "dma-ranges" DT property.
There is no dedicated physical address range for IO-coherent access in
D1. But the solution you mentioned couldn't solve all requirements.
Only one mirror physical address range is not enough, we need at least
three (Normal, DMA desc, frame buffer).
And that will triple the memory physical address which can't be
accepted by our users from the hardware design cost view.

 "dma-ranges" DT property is a big early MIPS smell. ARM SOC users
can't accept it. (They just say replace the CPU, but don't touch
anything other.)

PTE attributes are the non-coherent solution for many years. MIPS also
follows that now:
ref arch/mips/include/asm/pgtable-bits.h & arch/mips/include/asm/pgtable.h

#ifndef _CACHE_CACHABLE_NO_WA
#define _CACHE_CACHABLE_NO_WA           (0<<_CACHE_SHIFT)
#endif
#ifndef _CACHE_CACHABLE_WA
#define _CACHE_CACHABLE_WA              (1<<_CACHE_SHIFT)
#endif
#ifndef _CACHE_UNCACHED
#define _CACHE_UNCACHED                 (2<<_CACHE_SHIFT)
#endif
#ifndef _CACHE_CACHABLE_NONCOHERENT
#define _CACHE_CACHABLE_NONCOHERENT     (3<<_CACHE_SHIFT)
#endif
#ifndef _CACHE_CACHABLE_CE
#define _CACHE_CACHABLE_CE              (4<<_CACHE_SHIFT)
#endif
#ifndef _CACHE_CACHABLE_COW
#define _CACHE_CACHABLE_COW             (5<<_CACHE_SHIFT)
#endif
#ifndef _CACHE_CACHABLE_CUW
#define _CACHE_CACHABLE_CUW             (6<<_CACHE_SHIFT)
#endif
#ifndef _CACHE_UNCACHED_ACCELERATED
#define _CACHE_UNCACHED_ACCELERATED     (7<<_CACHE_SHIFT)

We can't force our users to double/triplicate their physical memory regions.

>
> Regards,
> Anup
>
> >
> >
> > >
> > > >
> > > > This implementation, which adds some Kconfig entries that control
> > > > page table bits, definately isn't suitable for upstream.  Allowing
> > > > users to set arbitrary page table bits will eventually conflict with
> > > > the standard, and is just going to be a mess.  It'll also lead to
> > > > kernels that are only compatible with specific designs, which we're
> > > > trying very hard to avoid.  At a bare minimum we'll need some way to
> > > > detect systems with these page table bits before setting them, and
> > > > some description of what the bits actually do so we can reason about
> > them.
> > >
> > > Yes, vendor specific Kconfig options are strict NO NO. We can't
> > > give-up the goal of unified kernel image for all platforms.
> > >
> > > Regards,
> > > Anup
> >
> >
> >
> > --
> > Best Regards
> >  Guo Ren
> >
> > ML: https://lore.kernel.org/linux-csky/
Anup Patel June 7, 2021, 4:47 a.m. UTC | #36
> -----Original Message-----
> From: Guo Ren <guoren@kernel.org>
> Sent: 07 June 2021 09:52
> To: Anup Patel <Anup.Patel@wdc.com>
> Cc: Atish Patra <atishp@atishpatra.org>; Palmer Dabbelt
> <palmer@dabbelt.com>; anup@brainfault.org; drew@beagleboard.org;
> Christoph Hellwig <hch@lst.de>; wefu@redhat.com; lazyparser@gmail.com;
> linux-riscv@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> arch@vger.kernel.org; linux-sunxi@lists.linux.dev; guoren@linux.alibaba.com;
> Paul Walmsley <paul.walmsley@sifive.com>
> Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> 
> Hi Anup,
> 
> On Mon, Jun 7, 2021 at 11:38 AM Anup Patel <Anup.Patel@wdc.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Guo Ren <guoren@kernel.org>
> > > Sent: 06 June 2021 22:42
> > > To: Anup Patel <Anup.Patel@wdc.com>; Atish Patra
> > > <atishp@atishpatra.org>
> > > Cc: Palmer Dabbelt <palmer@dabbelt.com>; anup@brainfault.org;
> > > drew@beagleboard.org; Christoph Hellwig <hch@lst.de>;
> > > wefu@redhat.com; lazyparser@gmail.com;
> > > linux-riscv@lists.infradead.org; linux- kernel@vger.kernel.org;
> > > linux-arch@vger.kernel.org; linux- sunxi@lists.linux.dev;
> > > guoren@linux.alibaba.com; Paul Walmsley <paul.walmsley@sifive.com>
> > > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> > >
> > > Hi Anup and Atish,
> > >
> > > On Thu, Jun 3, 2021 at 2:00 PM Anup Patel <Anup.Patel@wdc.com>
> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Palmer Dabbelt <palmer@dabbelt.com>
> > > > > Sent: 03 June 2021 09:43
> > > > > To: guoren@kernel.org
> > > > > Cc: anup@brainfault.org; drew@beagleboard.org; Christoph Hellwig
> > > > > <hch@lst.de>; Anup Patel <Anup.Patel@wdc.com>;
> wefu@redhat.com;
> > > > > lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> > > > > kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> > > > > sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> > > > > <paul.walmsley@sifive.com>
> > > > > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> > > > >
> > > > > On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
> > > > > > On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org
> wrote:
> > > > > >> On Wed, May 19, 2021 at 3:15 PM Anup Patel
> > > > > >> <anup@brainfault.org>
> > > > > wrote:
> > > > > >>>
> > > > > >>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini
> > > > > <drew@beagleboard.org> wrote:
> > > > > >>> >
> > > > > >>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph
> > > > > >>> > Hellwig
> > > > > wrote:
> > > > > >>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren
> wrote:
> > > > > >>> > > > Since the existing RISC-V ISA cannot solve this
> > > > > >>> > > > problem, it is better to provide some configuration
> > > > > >>> > > > for the SOC vendor to
> > > > > customize.
> > > > > >>> > >
> > > > > >>> > > We've been talking about this problem for close to five years.
> > > > > >>> > > So no, if you don't manage to get the feature into the
> > > > > >>> > > ISA it can't be supported.
> > > > > >>> >
> > > > > >>> > Isn't it a good goal for Linux to support the capabilities
> > > > > >>> > present in the SoC that a currently being fab'd?
> > > > > >>> >
> > > > > >>> > I believe the CMO group only started last year [1] so the
> > > > > >>> > RV64GC SoCs that are going into mass production this year
> > > > > >>> > would not have had the opporuntiy of utilizing any RISC-V
> > > > > >>> > ISA extension for handling cache management.
> > > > > >>>
> > > > > >>> The current Linux RISC-V policy is to only accept patches
> > > > > >>> for frozen or ratified ISA specs.
> > > > > >>> (Refer, Documentation/riscv/patch-acceptance.rst)
> > > > > >>>
> > > > > >>> This means even if emulate CMO instructions in OpenSBI, the
> > > > > >>> Linux patches won't be taken by Palmer because CMO
> > > > > >>> specification is still in draft stage.
> > > > > >> Before CMO specification release, could we use a sbi_ecall to
> > > > > >> solve the current problem? This is not against the
> > > > > >> specification, when CMO is ready we could let users choose to
> > > > > >> use the new CMO in
> > > Linux.
> > > > > >>
> > > > > >> From a tech view, CMO trap emulation is the same as sbi_ecall.
> > > > > >>
> > > > > >>>
> > > > > >>> Also, we all know how much time it takes for RISCV
> > > > > >>> international to freeze some spec. Judging by that we are
> > > > > >>> looking at another
> > > > > >>> 3-4 years at minimum.
> > > > > >
> > > > > > Sorry for being slow here, this thread got buried.
> > > > > >
> > > > > > I've been trying to work with a handful of folks at the RISC-V
> > > > > > foundation to try and get a subset of the various
> > > > > > in-development specifications (some simple CMOs, something
> > > > > > about non-caching in the page tables, and some way to prevent
> > > > > > speculative accesse from generating coherence traffic that will break
> non-coherent systems).
> > > > > > I'm not sure we can get this together quickly, but I'd prefer
> > > > > > to at least try before we jump to taking vendor-specificed behavior
> here.
> > > > > > It's obviously an up-hill battle to try and get specifications
> > > > > > through the process and I'm certainly not going to promise it
> > > > > > will work, but I'm hoping that the impending need to avoid
> > > > > > forking the ISA will be sufficient to get people behind
> > > > > > producing some specifications in a timely
> > > > > fashion.
> > > > > >
> > > > > > I wasn't aware than this chip had non-coherent devices until I
> > > > > > saw this thread, so we'd been mostly focused on the Beagle V chip.
> > > > > > That was in a sense an easier problem because the SiFive IP in
> > > > > > it was never designed to have non-coherent devices so we'd
> > > > > > have to make anything work via a series of slow workarounds,
> > > > > > which would make emulating the eventually standardized
> > > > > > behavior reasonable in terms of performance (ie, everything
> > > > > > would be super slow so who really
> > > cares).
> > > > > >
> > > > > > I don't think relying on some sort of SBI call for the CMOs
> > > > > > whould be such a performance hit that it would prevent these
> > > > > > systems from being viable, but assuming you have reasonable
> > > > > > performance on your non-cached accesses then that's probably
> > > > > > not going to be viable to trap and emulate.  At that point it
> > > > > > really just becomes silly to pretend that we're still making
> > > > > > things work by emulating the eventually ratified behavior, as
> > > > > > anyone who actually tries to use this thing to do IO would
> > > > > > need out of tree patches.  I'm not sure exactly what the plan
> > > > > > is for the page table bits in the specification right now, but
> > > > > > if you can give me a pointer to some documentation then I'm
> > > > > > happy to try and push for something
> > > compatible.
> > > > > >
> > > > > > If we can't make the process work at the foundation then I'd
> > > > > > be strongly in favor of just biting the bullet and starting to
> > > > > > take vendor-specific code that's been implemented in hardware
> > > > > > and is necessarry to make things work acceptably.  That's
> > > > > > obviously a sub-optimal solution as it'll lead to a bunch of
> > > > > > ISA fragmentation, but at least we'll be able to keep the
> > > > > > software stack
> > > together.
> > > > > >
> > > > > > Can you tell us when these will be in the hands of users?
> > > > > > That's pretty important here, as I don't want to be blocking
> > > > > > real users from having their hardware work.  IIRC there were
> > > > > > some plans to distribute early boards, but it looks like the
> > > > > > foundation got involved and I guess I lost the thread at that point.
> > > > > >
> > > > > > Sorry this is all such a headache, but hopefully we can get
> > > > > > things sorted out.
> > > > >
> > > > > I talked with some of the RISC-V foundation folks, we're not
> > > > > going to have an ISA specification for the non-coherent stuff
> > > > > any time soon.  I took a look at this code and I definately
> > > > > don't want to take it as is, but I'm not opposed to taking
> > > > > something that makes the
> > > hardware work as long as it's a lot cleaner.
> > > > > We've already got two of these non-coherent chips, I'm sure more
> > > > > will come, and I'd rather have the extra headaches than make
> > > > > everyone fork the software stack.
> > > >
> > > > Thanks for confirming. The CMO extension is still in early stages
> > > > so it will certainly take more time for them. After CMO extension
> > > > is finalized, it will take some more time to have actual RISC-V
> > > > platforms with
> > > CMO implemented.
> > > >
> > > > >
> > > > > After talking to Atish it looks like there's likely to be an SBI
> > > > > extension to handle the CMOs, which should let us avoid the bulk
> > > > > of the vendor-specific behavior in the kernel.  I know some
> > > > > people are worried about adding to the SBI surface.  I'm worried
> > > > > about that too, but that's way better than sticking a bunch of
> > > > > vendor-specific instructions into the kernel.  The SBI extension
> > > > > should make for a straight-forward cache flush implementation in
> > > > > Linux, so let's just plan on
> > > that getting through quickly (as has been done before).
> > > >
> > > > Yes, I agree. We can have just a single SBI call which is meant
> > > > for DMA sync purpose only which means it will flush/invalidate
> > > > pages from all cache levels irrespective of the cache hierarchy (i.e.
> > > > flush/invalidate to RAM). The CMO extension might more generic
> > > > cache operations which can target any cache level.
> > > >
> > > > I am already preparing a write-up for SBI DMA sync call in SBI
> > > > spec. I will share it with you separately as well.
> > > >
> > > > >
> > > > > Unfortunately we've yet to come up with a way to handle the
> > > > > non-cacheable mappings without introducing a degree of
> > > > > vendor-specific behavior or seriously impacting performance
> > > > > (mark them as not valid and deal with them in the trap handler).
> > > > > I'm not really sure it counts as supporting the hardware if it's
> > > > > massively slow, so that really leaves us with vendor-specific
> > > > > mappings as the only
> > > option to make these chips work.
> > > >
> > > > A RISC-V platform can have non-cacheable mappings is following
> > > > possible
> > > > ways:
> > > > 1) Fixed physical address range as non-cacheable using PMAs
> > > > 2) Custom page table attributes
> > > > 3) Svpmbt extension being defined by RVI
> > > >
> > > > Atish and me both think it is possible to have RISC-V specific DMA
> > > > ops implementation which can handle all above case. Atish is
> > > > already working on DMA ops implementation for RISC-V.
> > > Not only DMA ops, but also icache_sync & __vdso_icache_sync. Please
> > > have a look at:
> > > https://lore.kernel.org/linux-riscv/1622970249-50770-12-git-send-ema
> > > il-
> > > guoren@kernel.org/T/#u
> >
> > The icache_sync and __vdso_icache_sync will have to be addressed
> > differently. The SBI DMA sync extension cannot address this.
> Agree
> 
> >
> > It seems Allwinner D1 have more non-standard stuff:
> > 1) Custom PTE bits for IO-coherent access
> > 2) Custom data cache flush/invalidate for DMA sync
> > 3) Custom icache flush/invalidate
> Yes, but 3) is a performance optimization, not critical for running.
> 
> >
> > Other hand, BeagleV has only two problems:
> > 1) Custom physical address range for IO-coherent access
> > 2) Custom L2 cache flush/invalidate for DMA sync
> https://github.com/starfive-
> tech/linux/commit/d4c4044c08134dca8e5eaaeb6d3faf97dc453b6d
> 
> Currently, they still use DMA sync with DMA descriptor, are you sure they
> have minor memory physical address.
> 
> >
> > From above #2, can be solved by SBI DMA sync call and Linux DMA ops
> > for both BeagleV and Allwinner D1
> >
> > On BeagleV, issue #1 can be solved using "dma-ranges".
> >
> > On Allwinner D1, issues #1 and #3 need to be addressed separately.
> >
> > I think supporting BeagleV in upstream Linux is relatively easy
> > compared to Allwinner D1.
> >
> > @Guo, please check if you can reserve dedicated physical address range
> > for IO-coherent access (just like BeagleV). If yes, then we can tackle
> > issue #1 for Allwinner
> > D1 using "dma-ranges" DT property.
> There is no dedicated physical address range for IO-coherent access in D1. But
> the solution you mentioned couldn't solve all requirements.
> Only one mirror physical address range is not enough, we need at least three
> (Normal, DMA desc, frame buffer).

How many non-coherent devices you really have?

I am guess lot of critical devices on Allwinner D1 are not coherent with CPU.
The problem for Allwinner D1 is even worst than I thought. If such critical
high through-put devices are not cache coherent with CPU then I am
speechless about Allwinner D1 situation.

> And that will triple the memory physical address which can't be accepted by
> our users from the hardware design cost view.
> 
>  "dma-ranges" DT property is a big early MIPS smell. ARM SOC users can't
> accept it. (They just say replace the CPU, but don't touch anything other.)
> 
> PTE attributes are the non-coherent solution for many years. MIPS also
> follows that now:
> ref arch/mips/include/asm/pgtable-bits.h &
> arch/mips/include/asm/pgtable.h

RISC-V is in the process of standardizing Svpmbt extension.

Unfortunately, the higher order bits which your implementation uses is
not for SoC vendor use as-per the RISC-V privilege spec.

> 
> #ifndef _CACHE_CACHABLE_NO_WA
> #define _CACHE_CACHABLE_NO_WA           (0<<_CACHE_SHIFT)
> #endif
> #ifndef _CACHE_CACHABLE_WA
> #define _CACHE_CACHABLE_WA              (1<<_CACHE_SHIFT)
> #endif
> #ifndef _CACHE_UNCACHED
> #define _CACHE_UNCACHED                 (2<<_CACHE_SHIFT)
> #endif
> #ifndef _CACHE_CACHABLE_NONCOHERENT
> #define _CACHE_CACHABLE_NONCOHERENT     (3<<_CACHE_SHIFT)
> #endif
> #ifndef _CACHE_CACHABLE_CE
> #define _CACHE_CACHABLE_CE              (4<<_CACHE_SHIFT)
> #endif
> #ifndef _CACHE_CACHABLE_COW
> #define _CACHE_CACHABLE_COW             (5<<_CACHE_SHIFT)
> #endif
> #ifndef _CACHE_CACHABLE_CUW
> #define _CACHE_CACHABLE_CUW             (6<<_CACHE_SHIFT)
> #endif
> #ifndef _CACHE_UNCACHED_ACCELERATED
> #define _CACHE_UNCACHED_ACCELERATED     (7<<_CACHE_SHIFT)
> 
> We can't force our users to double/triplicate their physical memory regions.

We are trying to find a workable solution here so that we don't have
to deal with custom PTE attributes which are reserved for RISC-V priv
specification only.

Regards,
Anup

> 
> >
> > Regards,
> > Anup
> >
> > >
> > >
> > > >
> > > > >
> > > > > This implementation, which adds some Kconfig entries that
> > > > > control page table bits, definately isn't suitable for upstream.
> > > > > Allowing users to set arbitrary page table bits will eventually
> > > > > conflict with the standard, and is just going to be a mess.
> > > > > It'll also lead to kernels that are only compatible with
> > > > > specific designs, which we're trying very hard to avoid.  At a
> > > > > bare minimum we'll need some way to detect systems with these
> > > > > page table bits before setting them, and some description of
> > > > > what the bits actually do so we can reason about
> > > them.
> > > >
> > > > Yes, vendor specific Kconfig options are strict NO NO. We can't
> > > > give-up the goal of unified kernel image for all platforms.
> > > >
> > > > Regards,
> > > > Anup
> > >
> > >
> > >
> > > --
> > > Best Regards
> > >  Guo Ren
> > >
> > > ML: https://lore.kernel.org/linux-csky/
> 
> 
> 
> --
> Best Regards
>  Guo Ren
> 
> ML: https://lore.kernel.org/linux-csky/
Guo Ren June 7, 2021, 5:08 a.m. UTC | #37
On Mon, Jun 7, 2021 at 12:47 PM Anup Patel <Anup.Patel@wdc.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Guo Ren <guoren@kernel.org>
> > Sent: 07 June 2021 09:52
> > To: Anup Patel <Anup.Patel@wdc.com>
> > Cc: Atish Patra <atishp@atishpatra.org>; Palmer Dabbelt
> > <palmer@dabbelt.com>; anup@brainfault.org; drew@beagleboard.org;
> > Christoph Hellwig <hch@lst.de>; wefu@redhat.com; lazyparser@gmail.com;
> > linux-riscv@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > arch@vger.kernel.org; linux-sunxi@lists.linux.dev; guoren@linux.alibaba.com;
> > Paul Walmsley <paul.walmsley@sifive.com>
> > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> >
> > Hi Anup,
> >
> > On Mon, Jun 7, 2021 at 11:38 AM Anup Patel <Anup.Patel@wdc.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Guo Ren <guoren@kernel.org>
> > > > Sent: 06 June 2021 22:42
> > > > To: Anup Patel <Anup.Patel@wdc.com>; Atish Patra
> > > > <atishp@atishpatra.org>
> > > > Cc: Palmer Dabbelt <palmer@dabbelt.com>; anup@brainfault.org;
> > > > drew@beagleboard.org; Christoph Hellwig <hch@lst.de>;
> > > > wefu@redhat.com; lazyparser@gmail.com;
> > > > linux-riscv@lists.infradead.org; linux- kernel@vger.kernel.org;
> > > > linux-arch@vger.kernel.org; linux- sunxi@lists.linux.dev;
> > > > guoren@linux.alibaba.com; Paul Walmsley <paul.walmsley@sifive.com>
> > > > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> > > >
> > > > Hi Anup and Atish,
> > > >
> > > > On Thu, Jun 3, 2021 at 2:00 PM Anup Patel <Anup.Patel@wdc.com>
> > wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Palmer Dabbelt <palmer@dabbelt.com>
> > > > > > Sent: 03 June 2021 09:43
> > > > > > To: guoren@kernel.org
> > > > > > Cc: anup@brainfault.org; drew@beagleboard.org; Christoph Hellwig
> > > > > > <hch@lst.de>; Anup Patel <Anup.Patel@wdc.com>;
> > wefu@redhat.com;
> > > > > > lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> > > > > > kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> > > > > > sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> > > > > > <paul.walmsley@sifive.com>
> > > > > > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> > > > > >
> > > > > > On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
> > > > > > > On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org
> > wrote:
> > > > > > >> On Wed, May 19, 2021 at 3:15 PM Anup Patel
> > > > > > >> <anup@brainfault.org>
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini
> > > > > > <drew@beagleboard.org> wrote:
> > > > > > >>> >
> > > > > > >>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph
> > > > > > >>> > Hellwig
> > > > > > wrote:
> > > > > > >>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren
> > wrote:
> > > > > > >>> > > > Since the existing RISC-V ISA cannot solve this
> > > > > > >>> > > > problem, it is better to provide some configuration
> > > > > > >>> > > > for the SOC vendor to
> > > > > > customize.
> > > > > > >>> > >
> > > > > > >>> > > We've been talking about this problem for close to five years.
> > > > > > >>> > > So no, if you don't manage to get the feature into the
> > > > > > >>> > > ISA it can't be supported.
> > > > > > >>> >
> > > > > > >>> > Isn't it a good goal for Linux to support the capabilities
> > > > > > >>> > present in the SoC that a currently being fab'd?
> > > > > > >>> >
> > > > > > >>> > I believe the CMO group only started last year [1] so the
> > > > > > >>> > RV64GC SoCs that are going into mass production this year
> > > > > > >>> > would not have had the opporuntiy of utilizing any RISC-V
> > > > > > >>> > ISA extension for handling cache management.
> > > > > > >>>
> > > > > > >>> The current Linux RISC-V policy is to only accept patches
> > > > > > >>> for frozen or ratified ISA specs.
> > > > > > >>> (Refer, Documentation/riscv/patch-acceptance.rst)
> > > > > > >>>
> > > > > > >>> This means even if emulate CMO instructions in OpenSBI, the
> > > > > > >>> Linux patches won't be taken by Palmer because CMO
> > > > > > >>> specification is still in draft stage.
> > > > > > >> Before CMO specification release, could we use a sbi_ecall to
> > > > > > >> solve the current problem? This is not against the
> > > > > > >> specification, when CMO is ready we could let users choose to
> > > > > > >> use the new CMO in
> > > > Linux.
> > > > > > >>
> > > > > > >> From a tech view, CMO trap emulation is the same as sbi_ecall.
> > > > > > >>
> > > > > > >>>
> > > > > > >>> Also, we all know how much time it takes for RISCV
> > > > > > >>> international to freeze some spec. Judging by that we are
> > > > > > >>> looking at another
> > > > > > >>> 3-4 years at minimum.
> > > > > > >
> > > > > > > Sorry for being slow here, this thread got buried.
> > > > > > >
> > > > > > > I've been trying to work with a handful of folks at the RISC-V
> > > > > > > foundation to try and get a subset of the various
> > > > > > > in-development specifications (some simple CMOs, something
> > > > > > > about non-caching in the page tables, and some way to prevent
> > > > > > > speculative accesse from generating coherence traffic that will break
> > non-coherent systems).
> > > > > > > I'm not sure we can get this together quickly, but I'd prefer
> > > > > > > to at least try before we jump to taking vendor-specificed behavior
> > here.
> > > > > > > It's obviously an up-hill battle to try and get specifications
> > > > > > > through the process and I'm certainly not going to promise it
> > > > > > > will work, but I'm hoping that the impending need to avoid
> > > > > > > forking the ISA will be sufficient to get people behind
> > > > > > > producing some specifications in a timely
> > > > > > fashion.
> > > > > > >
> > > > > > > I wasn't aware than this chip had non-coherent devices until I
> > > > > > > saw this thread, so we'd been mostly focused on the Beagle V chip.
> > > > > > > That was in a sense an easier problem because the SiFive IP in
> > > > > > > it was never designed to have non-coherent devices so we'd
> > > > > > > have to make anything work via a series of slow workarounds,
> > > > > > > which would make emulating the eventually standardized
> > > > > > > behavior reasonable in terms of performance (ie, everything
> > > > > > > would be super slow so who really
> > > > cares).
> > > > > > >
> > > > > > > I don't think relying on some sort of SBI call for the CMOs
> > > > > > > whould be such a performance hit that it would prevent these
> > > > > > > systems from being viable, but assuming you have reasonable
> > > > > > > performance on your non-cached accesses then that's probably
> > > > > > > not going to be viable to trap and emulate.  At that point it
> > > > > > > really just becomes silly to pretend that we're still making
> > > > > > > things work by emulating the eventually ratified behavior, as
> > > > > > > anyone who actually tries to use this thing to do IO would
> > > > > > > need out of tree patches.  I'm not sure exactly what the plan
> > > > > > > is for the page table bits in the specification right now, but
> > > > > > > if you can give me a pointer to some documentation then I'm
> > > > > > > happy to try and push for something
> > > > compatible.
> > > > > > >
> > > > > > > If we can't make the process work at the foundation then I'd
> > > > > > > be strongly in favor of just biting the bullet and starting to
> > > > > > > take vendor-specific code that's been implemented in hardware
> > > > > > > and is necessarry to make things work acceptably.  That's
> > > > > > > obviously a sub-optimal solution as it'll lead to a bunch of
> > > > > > > ISA fragmentation, but at least we'll be able to keep the
> > > > > > > software stack
> > > > together.
> > > > > > >
> > > > > > > Can you tell us when these will be in the hands of users?
> > > > > > > That's pretty important here, as I don't want to be blocking
> > > > > > > real users from having their hardware work.  IIRC there were
> > > > > > > some plans to distribute early boards, but it looks like the
> > > > > > > foundation got involved and I guess I lost the thread at that point.
> > > > > > >
> > > > > > > Sorry this is all such a headache, but hopefully we can get
> > > > > > > things sorted out.
> > > > > >
> > > > > > I talked with some of the RISC-V foundation folks, we're not
> > > > > > going to have an ISA specification for the non-coherent stuff
> > > > > > any time soon.  I took a look at this code and I definately
> > > > > > don't want to take it as is, but I'm not opposed to taking
> > > > > > something that makes the
> > > > hardware work as long as it's a lot cleaner.
> > > > > > We've already got two of these non-coherent chips, I'm sure more
> > > > > > will come, and I'd rather have the extra headaches than make
> > > > > > everyone fork the software stack.
> > > > >
> > > > > Thanks for confirming. The CMO extension is still in early stages
> > > > > so it will certainly take more time for them. After CMO extension
> > > > > is finalized, it will take some more time to have actual RISC-V
> > > > > platforms with
> > > > CMO implemented.
> > > > >
> > > > > >
> > > > > > After talking to Atish it looks like there's likely to be an SBI
> > > > > > extension to handle the CMOs, which should let us avoid the bulk
> > > > > > of the vendor-specific behavior in the kernel.  I know some
> > > > > > people are worried about adding to the SBI surface.  I'm worried
> > > > > > about that too, but that's way better than sticking a bunch of
> > > > > > vendor-specific instructions into the kernel.  The SBI extension
> > > > > > should make for a straight-forward cache flush implementation in
> > > > > > Linux, so let's just plan on
> > > > that getting through quickly (as has been done before).
> > > > >
> > > > > Yes, I agree. We can have just a single SBI call which is meant
> > > > > for DMA sync purpose only which means it will flush/invalidate
> > > > > pages from all cache levels irrespective of the cache hierarchy (i.e.
> > > > > flush/invalidate to RAM). The CMO extension might more generic
> > > > > cache operations which can target any cache level.
> > > > >
> > > > > I am already preparing a write-up for SBI DMA sync call in SBI
> > > > > spec. I will share it with you separately as well.
> > > > >
> > > > > >
> > > > > > Unfortunately we've yet to come up with a way to handle the
> > > > > > non-cacheable mappings without introducing a degree of
> > > > > > vendor-specific behavior or seriously impacting performance
> > > > > > (mark them as not valid and deal with them in the trap handler).
> > > > > > I'm not really sure it counts as supporting the hardware if it's
> > > > > > massively slow, so that really leaves us with vendor-specific
> > > > > > mappings as the only
> > > > option to make these chips work.
> > > > >
> > > > > A RISC-V platform can have non-cacheable mappings is following
> > > > > possible
> > > > > ways:
> > > > > 1) Fixed physical address range as non-cacheable using PMAs
> > > > > 2) Custom page table attributes
> > > > > 3) Svpmbt extension being defined by RVI
> > > > >
> > > > > Atish and me both think it is possible to have RISC-V specific DMA
> > > > > ops implementation which can handle all above case. Atish is
> > > > > already working on DMA ops implementation for RISC-V.
> > > > Not only DMA ops, but also icache_sync & __vdso_icache_sync. Please
> > > > have a look at:
> > > > https://lore.kernel.org/linux-riscv/1622970249-50770-12-git-send-ema
> > > > il-
> > > > guoren@kernel.org/T/#u
> > >
> > > The icache_sync and __vdso_icache_sync will have to be addressed
> > > differently. The SBI DMA sync extension cannot address this.
> > Agree
> >
> > >
> > > It seems Allwinner D1 have more non-standard stuff:
> > > 1) Custom PTE bits for IO-coherent access
> > > 2) Custom data cache flush/invalidate for DMA sync
> > > 3) Custom icache flush/invalidate
> > Yes, but 3) is a performance optimization, not critical for running.
> >
> > >
> > > Other hand, BeagleV has only two problems:
> > > 1) Custom physical address range for IO-coherent access
> > > 2) Custom L2 cache flush/invalidate for DMA sync
> > https://github.com/starfive-
> > tech/linux/commit/d4c4044c08134dca8e5eaaeb6d3faf97dc453b6d
> >
> > Currently, they still use DMA sync with DMA descriptor, are you sure they
> > have minor memory physical address.
> >
> > >
> > > From above #2, can be solved by SBI DMA sync call and Linux DMA ops
> > > for both BeagleV and Allwinner D1
> > >
> > > On BeagleV, issue #1 can be solved using "dma-ranges".
> > >
> > > On Allwinner D1, issues #1 and #3 need to be addressed separately.
> > >
> > > I think supporting BeagleV in upstream Linux is relatively easy
> > > compared to Allwinner D1.
> > >
> > > @Guo, please check if you can reserve dedicated physical address range
> > > for IO-coherent access (just like BeagleV). If yes, then we can tackle
> > > issue #1 for Allwinner
> > > D1 using "dma-ranges" DT property.
> > There is no dedicated physical address range for IO-coherent access in D1. But
> > the solution you mentioned couldn't solve all requirements.
> > Only one mirror physical address range is not enough, we need at least three
> > (Normal, DMA desc, frame buffer).
>
> How many non-coherent devices you really have?
>
> I am guess lot of critical devices on Allwinner D1 are not coherent with CPU.
> The problem for Allwinner D1 is even worst than I thought. If such critical
> high through-put devices are not cache coherent with CPU then I am
> speechless about Allwinner D1 situation.
Allwinner D1 is a cost-down product and there is no cache coherent
device at all. Cache coherent interconnect will increase the chip
design cost and the performance is enough in their scenario.

So that why we need Srong Order + noncache & Weak Order + noncache to
optimization.

From T-HEAD side we could privide two kinds of solution of DMA coherent.
 - Let SOC vendor update coherent interconnect, and our CPU could
support coherent protocal.
 - Let SOC vendor connect their DMA device with our CPU LL cache
coherent interface.

But we can't force them do that. They want how my origin soc works
then make it work with your RV core. They know trade off coherency or
non-coherency in their busisness scenario.

>
> > And that will triple the memory physical address which can't be accepted by
> > our users from the hardware design cost view.
> >
> >  "dma-ranges" DT property is a big early MIPS smell. ARM SOC users can't
> > accept it. (They just say replace the CPU, but don't touch anything other.)
> >
> > PTE attributes are the non-coherent solution for many years. MIPS also
> > follows that now:
> > ref arch/mips/include/asm/pgtable-bits.h &
> > arch/mips/include/asm/pgtable.h
>
> RISC-V is in the process of standardizing Svpmbt extension.
>
> Unfortunately, the higher order bits which your implementation uses is
> not for SoC vendor use as-per the RISC-V privilege spec.
For a while, I had placed my hopes on C-bit, but my fantasy was
disillusioned. -_-!

>
> >
> > #ifndef _CACHE_CACHABLE_NO_WA
> > #define _CACHE_CACHABLE_NO_WA           (0<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_WA
> > #define _CACHE_CACHABLE_WA              (1<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_UNCACHED
> > #define _CACHE_UNCACHED                 (2<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_NONCOHERENT
> > #define _CACHE_CACHABLE_NONCOHERENT     (3<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_CE
> > #define _CACHE_CACHABLE_CE              (4<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_COW
> > #define _CACHE_CACHABLE_COW             (5<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_CUW
> > #define _CACHE_CACHABLE_CUW             (6<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_UNCACHED_ACCELERATED
> > #define _CACHE_UNCACHED_ACCELERATED     (7<<_CACHE_SHIFT)
> >
> > We can't force our users to double/triplicate their physical memory regions.
>
> We are trying to find a workable solution here so that we don't have
> to deal with custom PTE attributes which are reserved for RISC-V priv
> specification only.
Thank you for your hard work in this regard, sincerely.

>
> Regards,
> Anup
>
> >
> > >
> > > Regards,
> > > Anup
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > This implementation, which adds some Kconfig entries that
> > > > > > control page table bits, definately isn't suitable for upstream.
> > > > > > Allowing users to set arbitrary page table bits will eventually
> > > > > > conflict with the standard, and is just going to be a mess.
> > > > > > It'll also lead to kernels that are only compatible with
> > > > > > specific designs, which we're trying very hard to avoid.  At a
> > > > > > bare minimum we'll need some way to detect systems with these
> > > > > > page table bits before setting them, and some description of
> > > > > > what the bits actually do so we can reason about
> > > > them.
> > > > >
> > > > > Yes, vendor specific Kconfig options are strict NO NO. We can't
> > > > > give-up the goal of unified kernel image for all platforms.
> > > > >
> > > > > Regards,
> > > > > Anup
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > >  Guo Ren
> > > >
> > > > ML: https://lore.kernel.org/linux-csky/
> >
> >
> >
> > --
> > Best Regards
> >  Guo Ren
> >
> > ML: https://lore.kernel.org/linux-csky/
Guo Ren June 7, 2021, 5:13 a.m. UTC | #38
:


On Mon, Jun 7, 2021 at 12:47 PM Anup Patel <Anup.Patel@wdc.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Guo Ren <guoren@kernel.org>
> > Sent: 07 June 2021 09:52
> > To: Anup Patel <Anup.Patel@wdc.com>
> > Cc: Atish Patra <atishp@atishpatra.org>; Palmer Dabbelt
> > <palmer@dabbelt.com>; anup@brainfault.org; drew@beagleboard.org;
> > Christoph Hellwig <hch@lst.de>; wefu@redhat.com; lazyparser@gmail.com;
> > linux-riscv@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > arch@vger.kernel.org; linux-sunxi@lists.linux.dev; guoren@linux.alibaba.com;
> > Paul Walmsley <paul.walmsley@sifive.com>
> > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> >
> > Hi Anup,
> >
> > On Mon, Jun 7, 2021 at 11:38 AM Anup Patel <Anup.Patel@wdc.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Guo Ren <guoren@kernel.org>
> > > > Sent: 06 June 2021 22:42
> > > > To: Anup Patel <Anup.Patel@wdc.com>; Atish Patra
> > > > <atishp@atishpatra.org>
> > > > Cc: Palmer Dabbelt <palmer@dabbelt.com>; anup@brainfault.org;
> > > > drew@beagleboard.org; Christoph Hellwig <hch@lst.de>;
> > > > wefu@redhat.com; lazyparser@gmail.com;
> > > > linux-riscv@lists.infradead.org; linux- kernel@vger.kernel.org;
> > > > linux-arch@vger.kernel.org; linux- sunxi@lists.linux.dev;
> > > > guoren@linux.alibaba.com; Paul Walmsley <paul.walmsley@sifive.com>
> > > > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> > > >
> > > > Hi Anup and Atish,
> > > >
> > > > On Thu, Jun 3, 2021 at 2:00 PM Anup Patel <Anup.Patel@wdc.com>
> > wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Palmer Dabbelt <palmer@dabbelt.com>
> > > > > > Sent: 03 June 2021 09:43
> > > > > > To: guoren@kernel.org
> > > > > > Cc: anup@brainfault.org; drew@beagleboard.org; Christoph Hellwig
> > > > > > <hch@lst.de>; Anup Patel <Anup.Patel@wdc.com>;
> > wefu@redhat.com;
> > > > > > lazyparser@gmail.com; linux-riscv@lists.infradead.org; linux-
> > > > > > kernel@vger.kernel.org; linux-arch@vger.kernel.org; linux-
> > > > > > sunxi@lists.linux.dev; guoren@linux.alibaba.com; Paul Walmsley
> > > > > > <paul.walmsley@sifive.com>
> > > > > > Subject: Re: [PATCH RFC 0/3] riscv: Add DMA_COHERENT support
> > > > > >
> > > > > > On Sat, 29 May 2021 17:30:18 PDT (-0700), Palmer Dabbelt wrote:
> > > > > > > On Fri, 21 May 2021 17:36:08 PDT (-0700), guoren@kernel.org
> > wrote:
> > > > > > >> On Wed, May 19, 2021 at 3:15 PM Anup Patel
> > > > > > >> <anup@brainfault.org>
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>> On Wed, May 19, 2021 at 12:24 PM Drew Fustini
> > > > > > <drew@beagleboard.org> wrote:
> > > > > > >>> >
> > > > > > >>> > On Wed, May 19, 2021 at 08:06:17AM +0200, Christoph
> > > > > > >>> > Hellwig
> > > > > > wrote:
> > > > > > >>> > > On Wed, May 19, 2021 at 02:05:00PM +0800, Guo Ren
> > wrote:
> > > > > > >>> > > > Since the existing RISC-V ISA cannot solve this
> > > > > > >>> > > > problem, it is better to provide some configuration
> > > > > > >>> > > > for the SOC vendor to
> > > > > > customize.
> > > > > > >>> > >
> > > > > > >>> > > We've been talking about this problem for close to five years.
> > > > > > >>> > > So no, if you don't manage to get the feature into the
> > > > > > >>> > > ISA it can't be supported.
> > > > > > >>> >
> > > > > > >>> > Isn't it a good goal for Linux to support the capabilities
> > > > > > >>> > present in the SoC that a currently being fab'd?
> > > > > > >>> >
> > > > > > >>> > I believe the CMO group only started last year [1] so the
> > > > > > >>> > RV64GC SoCs that are going into mass production this year
> > > > > > >>> > would not have had the opporuntiy of utilizing any RISC-V
> > > > > > >>> > ISA extension for handling cache management.
> > > > > > >>>
> > > > > > >>> The current Linux RISC-V policy is to only accept patches
> > > > > > >>> for frozen or ratified ISA specs.
> > > > > > >>> (Refer, Documentation/riscv/patch-acceptance.rst)
> > > > > > >>>
> > > > > > >>> This means even if emulate CMO instructions in OpenSBI, the
> > > > > > >>> Linux patches won't be taken by Palmer because CMO
> > > > > > >>> specification is still in draft stage.
> > > > > > >> Before CMO specification release, could we use a sbi_ecall to
> > > > > > >> solve the current problem? This is not against the
> > > > > > >> specification, when CMO is ready we could let users choose to
> > > > > > >> use the new CMO in
> > > > Linux.
> > > > > > >>
> > > > > > >> From a tech view, CMO trap emulation is the same as sbi_ecall.
> > > > > > >>
> > > > > > >>>
> > > > > > >>> Also, we all know how much time it takes for RISCV
> > > > > > >>> international to freeze some spec. Judging by that we are
> > > > > > >>> looking at another
> > > > > > >>> 3-4 years at minimum.
> > > > > > >
> > > > > > > Sorry for being slow here, this thread got buried.
> > > > > > >
> > > > > > > I've been trying to work with a handful of folks at the RISC-V
> > > > > > > foundation to try and get a subset of the various
> > > > > > > in-development specifications (some simple CMOs, something
> > > > > > > about non-caching in the page tables, and some way to prevent
> > > > > > > speculative accesse from generating coherence traffic that will break
> > non-coherent systems).
> > > > > > > I'm not sure we can get this together quickly, but I'd prefer
> > > > > > > to at least try before we jump to taking vendor-specificed behavior
> > here.
> > > > > > > It's obviously an up-hill battle to try and get specifications
> > > > > > > through the process and I'm certainly not going to promise it
> > > > > > > will work, but I'm hoping that the impending need to avoid
> > > > > > > forking the ISA will be sufficient to get people behind
> > > > > > > producing some specifications in a timely
> > > > > > fashion.
> > > > > > >
> > > > > > > I wasn't aware than this chip had non-coherent devices until I
> > > > > > > saw this thread, so we'd been mostly focused on the Beagle V chip.
> > > > > > > That was in a sense an easier problem because the SiFive IP in
> > > > > > > it was never designed to have non-coherent devices so we'd
> > > > > > > have to make anything work via a series of slow workarounds,
> > > > > > > which would make emulating the eventually standardized
> > > > > > > behavior reasonable in terms of performance (ie, everything
> > > > > > > would be super slow so who really
> > > > cares).
> > > > > > >
> > > > > > > I don't think relying on some sort of SBI call for the CMOs
> > > > > > > whould be such a performance hit that it would prevent these
> > > > > > > systems from being viable, but assuming you have reasonable
> > > > > > > performance on your non-cached accesses then that's probably
> > > > > > > not going to be viable to trap and emulate.  At that point it
> > > > > > > really just becomes silly to pretend that we're still making
> > > > > > > things work by emulating the eventually ratified behavior, as
> > > > > > > anyone who actually tries to use this thing to do IO would
> > > > > > > need out of tree patches.  I'm not sure exactly what the plan
> > > > > > > is for the page table bits in the specification right now, but
> > > > > > > if you can give me a pointer to some documentation then I'm
> > > > > > > happy to try and push for something
> > > > compatible.
> > > > > > >
> > > > > > > If we can't make the process work at the foundation then I'd
> > > > > > > be strongly in favor of just biting the bullet and starting to
> > > > > > > take vendor-specific code that's been implemented in hardware
> > > > > > > and is necessarry to make things work acceptably.  That's
> > > > > > > obviously a sub-optimal solution as it'll lead to a bunch of
> > > > > > > ISA fragmentation, but at least we'll be able to keep the
> > > > > > > software stack
> > > > together.
> > > > > > >
> > > > > > > Can you tell us when these will be in the hands of users?
> > > > > > > That's pretty important here, as I don't want to be blocking
> > > > > > > real users from having their hardware work.  IIRC there were
> > > > > > > some plans to distribute early boards, but it looks like the
> > > > > > > foundation got involved and I guess I lost the thread at that point.
> > > > > > >
> > > > > > > Sorry this is all such a headache, but hopefully we can get
> > > > > > > things sorted out.
> > > > > >
> > > > > > I talked with some of the RISC-V foundation folks, we're not
> > > > > > going to have an ISA specification for the non-coherent stuff
> > > > > > any time soon.  I took a look at this code and I definately
> > > > > > don't want to take it as is, but I'm not opposed to taking
> > > > > > something that makes the
> > > > hardware work as long as it's a lot cleaner.
> > > > > > We've already got two of these non-coherent chips, I'm sure more
> > > > > > will come, and I'd rather have the extra headaches than make
> > > > > > everyone fork the software stack.
> > > > >
> > > > > Thanks for confirming. The CMO extension is still in early stages
> > > > > so it will certainly take more time for them. After CMO extension
> > > > > is finalized, it will take some more time to have actual RISC-V
> > > > > platforms with
> > > > CMO implemented.
> > > > >
> > > > > >
> > > > > > After talking to Atish it looks like there's likely to be an SBI
> > > > > > extension to handle the CMOs, which should let us avoid the bulk
> > > > > > of the vendor-specific behavior in the kernel.  I know some
> > > > > > people are worried about adding to the SBI surface.  I'm worried
> > > > > > about that too, but that's way better than sticking a bunch of
> > > > > > vendor-specific instructions into the kernel.  The SBI extension
> > > > > > should make for a straight-forward cache flush implementation in
> > > > > > Linux, so let's just plan on
> > > > that getting through quickly (as has been done before).
> > > > >
> > > > > Yes, I agree. We can have just a single SBI call which is meant
> > > > > for DMA sync purpose only which means it will flush/invalidate
> > > > > pages from all cache levels irrespective of the cache hierarchy (i.e.
> > > > > flush/invalidate to RAM). The CMO extension might more generic
> > > > > cache operations which can target any cache level.
> > > > >
> > > > > I am already preparing a write-up for SBI DMA sync call in SBI
> > > > > spec. I will share it with you separately as well.
> > > > >
> > > > > >
> > > > > > Unfortunately we've yet to come up with a way to handle the
> > > > > > non-cacheable mappings without introducing a degree of
> > > > > > vendor-specific behavior or seriously impacting performance
> > > > > > (mark them as not valid and deal with them in the trap handler).
> > > > > > I'm not really sure it counts as supporting the hardware if it's
> > > > > > massively slow, so that really leaves us with vendor-specific
> > > > > > mappings as the only
> > > > option to make these chips work.
> > > > >
> > > > > A RISC-V platform can have non-cacheable mappings is following
> > > > > possible
> > > > > ways:
> > > > > 1) Fixed physical address range as non-cacheable using PMAs
> > > > > 2) Custom page table attributes
> > > > > 3) Svpmbt extension being defined by RVI
> > > > >
> > > > > Atish and me both think it is possible to have RISC-V specific DMA
> > > > > ops implementation which can handle all above case. Atish is
> > > > > already working on DMA ops implementation for RISC-V.
> > > > Not only DMA ops, but also icache_sync & __vdso_icache_sync. Please
> > > > have a look at:
> > > > https://lore.kernel.org/linux-riscv/1622970249-50770-12-git-send-ema
> > > > il-
> > > > guoren@kernel.org/T/#u
> > >
> > > The icache_sync and __vdso_icache_sync will have to be addressed
> > > differently. The SBI DMA sync extension cannot address this.
> > Agree
> >
> > >
> > > It seems Allwinner D1 have more non-standard stuff:
> > > 1) Custom PTE bits for IO-coherent access
> > > 2) Custom data cache flush/invalidate for DMA sync
> > > 3) Custom icache flush/invalidate
> > Yes, but 3) is a performance optimization, not critical for running.
> >
> > >
> > > Other hand, BeagleV has only two problems:
> > > 1) Custom physical address range for IO-coherent access
> > > 2) Custom L2 cache flush/invalidate for DMA sync
> > https://github.com/starfive-
> > tech/linux/commit/d4c4044c08134dca8e5eaaeb6d3faf97dc453b6d
> >
> > Currently, they still use DMA sync with DMA descriptor, are you sure they
> > have minor memory physical address.
> >
> > >
> > > From above #2, can be solved by SBI DMA sync call and Linux DMA ops
> > > for both BeagleV and Allwinner D1
> > >
> > > On BeagleV, issue #1 can be solved using "dma-ranges".
> > >
> > > On Allwinner D1, issues #1 and #3 need to be addressed separately.
> > >
> > > I think supporting BeagleV in upstream Linux is relatively easy
> > > compared to Allwinner D1.
> > >
> > > @Guo, please check if you can reserve dedicated physical address range
> > > for IO-coherent access (just like BeagleV). If yes, then we can tackle
> > > issue #1 for Allwinner
> > > D1 using "dma-ranges" DT property.
> > There is no dedicated physical address range for IO-coherent access in D1. But
> > the solution you mentioned couldn't solve all requirements.
> > Only one mirror physical address range is not enough, we need at least three
> > (Normal, DMA desc, frame buffer).
>
> How many non-coherent devices you really have?
>
> I am guess lot of critical devices on Allwinner D1 are not coherent with CPU.
> The problem for Allwinner D1 is even worst than I thought. If such critical
> high through-put devices are not cache coherent with CPU then I am
> speechless about Allwinner D1 situation.
>
> > And that will triple the memory physical address which can't be accepted by
> > our users from the hardware design cost view.
> >
> >  "dma-ranges" DT property is a big early MIPS smell. ARM SOC users can't
> > accept it. (They just say replace the CPU, but don't touch anything other.)
> >
> > PTE attributes are the non-coherent solution for many years. MIPS also
> > follows that now:
> > ref arch/mips/include/asm/pgtable-bits.h &
> > arch/mips/include/asm/pgtable.h
>
> RISC-V is in the process of standardizing Svpmbt extension.
>
> Unfortunately, the higher order bits which your implementation uses is
> not for SoC vendor use as-per the RISC-V privilege spec.
>
> >
> > #ifndef _CACHE_CACHABLE_NO_WA
> > #define _CACHE_CACHABLE_NO_WA           (0<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_WA
> > #define _CACHE_CACHABLE_WA              (1<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_UNCACHED
> > #define _CACHE_UNCACHED                 (2<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_NONCOHERENT
> > #define _CACHE_CACHABLE_NONCOHERENT     (3<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_CE
> > #define _CACHE_CACHABLE_CE              (4<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_COW
> > #define _CACHE_CACHABLE_COW             (5<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_CACHABLE_CUW
> > #define _CACHE_CACHABLE_CUW             (6<<_CACHE_SHIFT)
> > #endif
> > #ifndef _CACHE_UNCACHED_ACCELERATED
> > #define _CACHE_UNCACHED_ACCELERATED     (7<<_CACHE_SHIFT)
> >
> > We can't force our users to double/triplicate their physical memory regions.
>
> We are trying to find a workable solution here so that we don't have
> to deal with custom PTE attributes which are reserved for RISC-V priv
> specification only.
How do think about my new patch of custom PTE attributes?
https://lore.kernel.org/linux-riscv/610849b6f66e8d5a9653c9f62f46c48d@mailhost.ics.forth.gr/T/#mdc0dacba57346b5ac59a01961495c132b93cfcdb

>
> Regards,
> Anup
>
> >
> > >
> > > Regards,
> > > Anup
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > This implementation, which adds some Kconfig entries that
> > > > > > control page table bits, definately isn't suitable for upstream.
> > > > > > Allowing users to set arbitrary page table bits will eventually
> > > > > > conflict with the standard, and is just going to be a mess.
> > > > > > It'll also lead to kernels that are only compatible with
> > > > > > specific designs, which we're trying very hard to avoid.  At a
> > > > > > bare minimum we'll need some way to detect systems with these
> > > > > > page table bits before setting them, and some description of
> > > > > > what the bits actually do so we can reason about
> > > > them.
> > > > >
> > > > > Yes, vendor specific Kconfig options are strict NO NO. We can't
> > > > > give-up the goal of unified kernel image for all platforms.
> > > > >
> > > > > Regards,
> > > > > Anup
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > >  Guo Ren
> > > >
> > > > ML: https://lore.kernel.org/linux-csky/
> >
> >
> >
> > --
> > Best Regards
> >  Guo Ren
> >
> > ML: https://lore.kernel.org/linux-csky/
Christoph Hellwig June 7, 2021, 6:27 a.m. UTC | #39
On Mon, Jun 07, 2021 at 11:19:03AM +0800, Guo Ren wrote:
> >From Linux non-coherency view, we need:
>  - Non-cache + Strong Order PTE attributes to deal with drivers' DMA descriptors
>  - Non-cache + weak order to deal with framebuffer drivers
>  - CMO dma_sync to sync cache with DMA devices

This is not strictly true.  At the very minimum you only need cache
invalidation and writeback instructions.  For example early parisc
CPUs and some m68knommu SOCs have no support for uncached areas at all,
and Linux works.  But to be fair this is very painful and supports only
very limited periphals.  So for modern full Linux support some uncahed
memory is advisable.  But that doesn't have to be using PTE attributes.
It could also be physical memory regions that are either totally fixed
or somewhat dynamic.
Guo Ren June 7, 2021, 6:41 a.m. UTC | #40
On Mon, Jun 7, 2021 at 2:27 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Jun 07, 2021 at 11:19:03AM +0800, Guo Ren wrote:
> > >From Linux non-coherency view, we need:
> >  - Non-cache + Strong Order PTE attributes to deal with drivers' DMA descriptors
> >  - Non-cache + weak order to deal with framebuffer drivers
> >  - CMO dma_sync to sync cache with DMA devices
>
> This is not strictly true.  At the very minimum you only need cache
> invalidation and writeback instructions.  For example early parisc
> CPUs and some m68knommu SOCs have no support for uncached areas at all,
> and Linux works.  But to be fair this is very painful and supports only
> very limited periphals.  So for modern full Linux support some uncahed
> memory is advisable.  But that doesn't have to be using PTE attributes.
> It could also be physical memory regions that are either totally fixed
Double/Triple the size of physical memory regions can't be accepted by
SOC vendors, because it wastes HW resources.
Some cost-down soc interconnects only have 32bit~34bit width of
physical address, are you sure you could force them to expand it? (I
can't)

> or somewhat dynamic.
How can HW implement with dynamic modifying PMA? What's the granularity?
Christoph Hellwig June 7, 2021, 6:51 a.m. UTC | #41
On Mon, Jun 07, 2021 at 02:41:14PM +0800, Guo Ren wrote:
> Double/Triple the size of physical memory regions can't be accepted by
> SOC vendors, because it wastes HW resources.
> Some cost-down soc interconnects only have 32bit~34bit width of
> physical address, are you sure you could force them to expand it? (I
> can't)
> 
> > or somewhat dynamic.
> How can HW implement with dynamic modifying PMA? What's the granularity?

I'm just stating the requirements from the Linux DMA perspective.  You
also do not need tripple the address space, just double.
Guo Ren June 7, 2021, 7:46 a.m. UTC | #42
On Mon, Jun 7, 2021 at 2:51 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Jun 07, 2021 at 02:41:14PM +0800, Guo Ren wrote:
> > Double/Triple the size of physical memory regions can't be accepted by
> > SOC vendors, because it wastes HW resources.
> > Some cost-down soc interconnects only have 32bit~34bit width of
> > physical address, are you sure you could force them to expand it? (I
> > can't)
> >
> > > or somewhat dynamic.
> > How can HW implement with dynamic modifying PMA? What's the granularity?
>
> I'm just stating the requirements from the Linux DMA perspective.  You
> also do not need tripple the address space, just double.

With double, you only got "strong order + non-cache" for the DMA
descriptor. How about write-combine scenario?

Even, double physical memory address space also wastes HW resources.




--
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/
Nick Kossifidis June 7, 2021, 8:35 a.m. UTC | #43
Στις 2021-06-07 06:19, Guo Ren έγραψε:
>> The C-bit was recently dropped, there is a new proposal for Page Based
>> Memory Attributes (PBMT) that we can work on / push for.
> C-bit still needs discussion, we shouldn't drop it directly.
> 

You can always participate on the discussion on virtmem mailing list.

> Raise a page fault won't solve anything. We still need access to the
> page with proper performance.
> 

The point is that future hw implementations will be required to return a 
page fault in case we tamper with those reserved bits, they won't just 
ignore them. Supporting custom values there means supporting 
non-compliant implementations.

> 
> We need PTEs to provide a non-coherency solution, and only CMO
> instructions are not enough. We can't modify so many Linux drivers to
> fit it.
> From Linux non-coherency view, we need:
>  - Non-cache + Strong Order PTE attributes to deal with drivers' DMA 
> descriptors
>  - Non-cache + weak order to deal with framebuffer drivers
>  - CMO dma_sync to sync cache with DMA devices
>  - Userspace icache_sync solution, which prevents calls to S-mode with
> IPI fence.i. (Necessary to JIT java scenarios.)
> 
> All above are not in spec, but the real chips are done.
> (Actually, these have been talked about for more than five years, we
> still haven't the uniform idea.)
> 
> The idea of C-bit is really important for us which prevents our chips
> violates the spec.

Have you checked the PBMT proposal ? It defines (so far) the following 
attributes that can be set on PTEs to override the PMAs of the 
underlying physical memory:

Bits [62:61]
00 (WB) -> Cacheable, default ordering
01 (NC) -> Noncacheable, default ordering
10 (IO) -> Noncacheable, strong ordering

So it'll cover the use cases you mention.
Guo Ren June 8, 2021, 12:26 p.m. UTC | #44
On Sat, Jun 5, 2021 at 12:12 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Fri, 04 Jun 2021 07:47:22 PDT (-0700), guoren@kernel.org wrote:
> > Hi Arnd & Palmer,
> >
> > Sorry for the delayed reply, I'm working on the next version of the patch.
> >
> > On Fri, Jun 4, 2021 at 5:56 PM Arnd Bergmann <arnd@arndb.de> wrote:
> >>
> >> On Thu, Jun 3, 2021 at 5:39 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
> >> > On Wed, 02 Jun 2021 23:00:29 PDT (-0700), Anup Patel wrote:
> >> > >> This implementation, which adds some Kconfig entries that control page table
> >> > >> bits, definately isn't suitable for upstream.  Allowing users to set arbitrary
> >> > >> page table bits will eventually conflict with the standard, and is just going to
> >> > >> be a mess.  It'll also lead to kernels that are only compatible with specific
> >> > >> designs, which we're trying very hard to avoid.  At a bare minimum we'll need
> >> > >> some way to detect systems with these page table bits before setting them,
> >> > >> and some description of what the bits actually do so we can reason about
> >> > >> them.
> >> > >
> >> > > Yes, vendor specific Kconfig options are strict NO NO. We can't give-up the
> >> > > goal of unified kernel image for all platforms.
> > Okay,  Agree. Please help review the next version of the patch.
> >
> >> >
> >> > I think this is just a phrasing issue, but just to be sure:
> >> >
> >> > IMO it's not that they're vendor-specific Kconfig options, it's that
> >> > turning them on will conflict with standard systems (and other vendors).
> >> > We've already got the ability to select sets of Kconfig settings that
> >> > will only boot on one vendor's system, which is fine, as long as there
> >> > remains a set of Kconfig settings that will boot on all systems.
> >> >
> >> > An example here would be the errata: every system has errata of some
> >> > sort, so if we start flipping off various vendor's errata Kconfigs
> >> > you'll end up with kernels that only function properly on some systems.
> >> > That's fine with me, as long as it's possible to turn on all vendor's
> >> > errata Kconfigs at the same time and the resulting kernel functions
> >> > correctly on all systems.
> >>
> >> Yes, this is generally the goal, and it would be great to have that
> >> working in a way where a 'defconfig' build just turns on all the options
> >> that are needed to use any SoC specific features and drivers while
> >> still working on all hardware. There are however limits you may run
> >> into at some point, and other architectures usually only manage to span
> >> some 10 to 15 years of hardware implementations with a single
> >> kernel before it get really hard.
> > I could follow the goal in the next version of the patchset. Please
> > help review, thx.
>
> IMO we're essentially here now with the RISC-V stuff: defconfig flips on
> everything necesasry to boot normal-smelling SOCs, with everything being
> detected as the system boots.  We have some wacky configurations like
> !MMU and XIP that are coupled to the hardware, but (and sorry for
> crossing the other threads, I missed your pointer as it's early here) as
> I said in the other thread it might be time to make it explicit that
> those things are non-portable.
>
> The hope here has always been that we'd have enough in the standards
> that we could avoid a proliferation of vendor-specific code.  We've
> always put a strong "things keep working forever" stake in the ground in
> RISC-V land, but that's largely been because we were counting on the
> standards existing that make support easy.  In practice we don't have
> those standards so we're ending up with a fairly large software base
> that is required to support everything.  We don't have all that much
> hardware right now so we'll have to see how it goes, but for now I'm in
> favor of keeping defconfig as a "boots on everything" sort of setup --
> both because it makes life easier for users, and because it makes issues
> like the non-portable Kconfigs that showed up here quite explicit.
I reuse the Image header to pass vendor magic to init PTE attributes'
variable before setup_vm. Can you give me some feedback on that patch?
https://lore.kernel.org/linux-riscv/1622970249-50770-9-git-send-email-guoren@kernel.org/T/#mdc0dacba57346b5ac59a01961495c132b93cfcdb

>
> If we get to 10/15 years of hardware then I'm sure we'll be removing old
> systems from defconfig (or maybe even the kernel entirely, a lot of this
> stuff isn't in production).  I'm just hoping we make it that far ;)
>
> >> To give some common examples that make it break down:
> >>
> >> - 32-bit vs 64-bit already violates that rule on risc-v (as it does on
> >>   most other architectures)
>
> Yes, and there's no way around that on RISC-V.  They're different base
> ISAs therefor re-define the same instructions, so we're essentially at
> two kernel binaries by that point.  The platform spec says rv64gc, so we
> can kind of punt on this one for now.  If rv32 hardware shows up
> we'll probably want a standard system there too, which is why we've
> avoided coupling kernel portability to XLEN.
>
> >> - architectures that support both big-endian and little-endian kernels
> >>   tend to have platforms that require one or the other (e.g. mips,
> >>   though not arm). Not an issue for you.
>
> It is now!  We've added big-endian to RISC-V.  There's no hardware yet
> and very little software support.  IMO the right answer is to ban that
> from the platform spec, but again it'll depnd on what vendors want to
> build (though anyone is listening, please don't make my life miserable
> ;)).
>
> >> - page table formats are the main cause of incompatibility: arm32
> >>   and x86-32 require three-level tables for certain features, but those
> >>   are incompatible with older cores, arm64 supports three different
> >>   page sizes, but none of them works on all cores (4KB almost works
> >>   everywhere).
>
> We actually have some support on the works for multiple page table
> levels in a single binary, which should help with a lot of that
> incompatibility.  I don't know of any plans to couple other page table
> features to the number of levels, though.
>
> >> - SMP-enabled ARMv7 kernels can be configured to run on either
> >>   ARMv6 or ARMv8, but not both, in this case because of incompatible
> >>   barrier instructions.
>
> Our barriers aren't quite split the same way, but we do have two memory
> models (RVWMO and TSO).  IIUC we should be able to support both in the
> same kernels with some patching, but the resulting kernels would be
> biased towards one memory models over the other WRT performance.  Again,
> we'll have to see what the vendors do and I'm hoping we don't end up
> with too many headaches.
>
> >> - 32-bit Arm has a couple more remaining features that require building
> >>   a machine specific kernel if enabled because they hardcode physical
> >>   addresses: early printk (debug_ll, not the normal earlycon), NOMMU,
> >>   and XIP.
>
> We've got NOMMU and XIP as well, but we have some SBI support for early
> printk.  IMO we're not really sure if we've decoupled all the PA layout
> dependencies yet from Linux, as we really only support one vendor's
> systems, but we've had a lot of work lately on beefing up our memory
> layout so with any luck we'll be able to quickly sort out anything that
> comes up.
David Laight June 8, 2021, 3 p.m. UTC | #45
From: Christoph Hellwig
> Sent: 07 June 2021 07:27
> 
> On Mon, Jun 07, 2021 at 11:19:03AM +0800, Guo Ren wrote:
> > >From Linux non-coherency view, we need:
> >  - Non-cache + Strong Order PTE attributes to deal with drivers' DMA descriptors
> >  - Non-cache + weak order to deal with framebuffer drivers
> >  - CMO dma_sync to sync cache with DMA devices
> 
> This is not strictly true.  At the very minimum you only need cache
> invalidation and writeback instructions.  For example early parisc
> CPUs and some m68knommu SOCs have no support for uncached areas at all,
> and Linux works.  But to be fair this is very painful and supports only
> very limited periphals.  So for modern full Linux support some uncahed
> memory is advisable.  But that doesn't have to be using PTE attributes.
> It could also be physical memory regions that are either totally fixed
> or somewhat dynamic.

It is almost impossible to interface to many ethernet chips without
either coherent or uncached memory for the descriptor rings.
The status bits on the transmit ring are particularly problematic.

The receive ring can be done with writeback+invalidate provided you
fill a cache line at a time.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Christoph Hellwig June 8, 2021, 3:32 p.m. UTC | #46
On Tue, Jun 08, 2021 at 03:00:17PM +0000, David Laight wrote:
> It is almost impossible to interface to many ethernet chips without
> either coherent or uncached memory for the descriptor rings.
> The status bits on the transmit ring are particularly problematic.
> 
> The receive ring can be done with writeback+invalidate provided you
> fill a cache line at a time.

It is horrible, but it has been done.  Take a look at:

drivers/net/ethernet/i825xx/lasi_82596.c and
drivers/net/ethernet/seeq/sgiseeq.c
David Laight June 8, 2021, 4:11 p.m. UTC | #47
From: 'Christoph Hellwig'
> Sent: 08 June 2021 16:32
> 
> On Tue, Jun 08, 2021 at 03:00:17PM +0000, David Laight wrote:
> > It is almost impossible to interface to many ethernet chips without
> > either coherent or uncached memory for the descriptor rings.
> > The status bits on the transmit ring are particularly problematic.
> >
> > The receive ring can be done with writeback+invalidate provided you
> > fill a cache line at a time.
> 
> It is horrible, but it has been done.  Take a look at:
> 
> drivers/net/ethernet/i825xx/lasi_82596.c and
> drivers/net/ethernet/seeq/sgiseeq.c

I guess that each transmit has to be split into enough
fragments that they fill a cache line.
That won't work with some (probably old now) devices that
require the first fragment to be 64 bytes because it won't
back-up the descriptors after a collision.

It's all as horrid as a DSP we have that can't receive ethernet
frames onto a 4n+2 boundary and doesn't support misaligned accesses.

Mind you, Sun's original Sbus ethernet board had to be given
a 4n aligned rx buffer and then a misaligned copy done in kernel
in order to not drop packets!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Guo Ren June 9, 2021, 3:28 a.m. UTC | #48
On Mon, Jun 7, 2021 at 2:14 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Στις 2021-05-20 04:45, Guo Ren έγραψε:
> > On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:
> >>
> >> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
> >> > This patch series looks like it might be useful for the StarFive JH7100
> >> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
> >> > USB and SDIO require that the L2 cache must be manually flushed after
> >> > DMA operations if the data is intended to be shared with U74 cores [2].
> >>
> >> Not too much, given that the SiFive lineage CPUs have an uncached
> >> window, that is a totally different way to allocate uncached memory.
> > It's a very big MIPS smell. What's the attribute of the uncached
> > window? (uncached + strong-order/ uncached + weak, most vendors still
> > use AXI interconnect, how to deal with a bufferable attribute?) In
> > fact, customers' drivers use different ways to deal with DMA memory in
> > non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
> > the same way in DMA memory is a smart choice. So using PTE attributes
> > is more suitable.
> >
> > See:
> > https://github.com/riscv/virtual-memory/blob/main/specs/611-virtual-memory-diff.pdf
> > 4.4.1
> > The draft supports custom attribute bits in PTE.
> >
>
> Not only it doesn't support custom attributes on PTEs:
>
> "Bits63–54 are reserved for future standard use and must be zeroed by
> software for forward compatibility."
>
> It also goes further to say that:
>
> "if any of these bits are set, a page-fault exception is raised"
Agree, when our processor's mmu works in compatible mmu, we must keep
"Bits63–54 bit" zero in Linux.
So, I think this is the first version of the PTE format.

If the "PBMT" extension proposal is approved, it will cause the second
version of the PTE format.

Maybe in the future, we'll get more versions of the PTE formats.

So, seems Linux must support multi versions of PTE formats with one
Image, right?

Okay, we could stop arguing with the D1 PTE format. And talk about how
to let Linux support multi versions of PTE formats that come from the
future RISC-V privilege spec.
Jisheng Zhang June 9, 2021, 6:05 a.m. UTC | #49
On Wed, 9 Jun 2021 11:28:19 +0800
Guo Ren <guoren@kernel.org> wrote:


> 
> 
> On Mon, Jun 7, 2021 at 2:14 AM Nick Kossifidis <mick@ics.forth.gr> wrote:
> >
> > Στις 2021-05-20 04:45, Guo Ren έγραψε:  
> > > On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:  
> > >>
> > >> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:  
> > >> > This patch series looks like it might be useful for the StarFive JH7100
> > >> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
> > >> > USB and SDIO require that the L2 cache must be manually flushed after
> > >> > DMA operations if the data is intended to be shared with U74 cores [2].  
> > >>
> > >> Not too much, given that the SiFive lineage CPUs have an uncached
> > >> window, that is a totally different way to allocate uncached memory.  
> > > It's a very big MIPS smell. What's the attribute of the uncached
> > > window? (uncached + strong-order/ uncached + weak, most vendors still
> > > use AXI interconnect, how to deal with a bufferable attribute?) In
> > > fact, customers' drivers use different ways to deal with DMA memory in
> > > non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
> > > the same way in DMA memory is a smart choice. So using PTE attributes
> > > is more suitable.
> > >

<snip>

> > > 4.4.1
> > > The draft supports custom attribute bits in PTE.
> > >  
> >
> > Not only it doesn't support custom attributes on PTEs:
> >
> > "Bits63–54 are reserved for future standard use and must be zeroed by
> > software for forward compatibility."
> >
> > It also goes further to say that:
> >
> > "if any of these bits are set, a page-fault exception is raised"  
> Agree, when our processor's mmu works in compatible mmu, we must keep
> "Bits63–54 bit" zero in Linux.
> So, I think this is the first version of the PTE format.
> 
> If the "PBMT" extension proposal is approved, it will cause the second
> version of the PTE format.
> 
> Maybe in the future, we'll get more versions of the PTE formats.
> 
> So, seems Linux must support multi versions of PTE formats with one
> Image, right?
> 
> Okay, we could stop arguing with the D1 PTE format. And talk about how
> to let Linux support multi versions of PTE formats that come from the
> future RISC-V privilege spec.
> 

Just my humble opinion:
When those bits(63~54) usage are standardized in future RISC-V privilege spec
generic Image can still be supported with the following solutions:

*alternative patch only fly:
If the bit is only need to be set during init, we may insert nop instruction(s)
at proper place, then patch the nop into set_the_target_bit instruction(s) by
hart's feature.

*normal check feature then use:
If the feature needs a bit complex code, we could go through the "feature check
then use". static key tech can be used here to avoid branches.
Nick Kossifidis June 9, 2021, 9:45 a.m. UTC | #50
Στις 2021-06-09 06:28, Guo Ren έγραψε:
> On Mon, Jun 7, 2021 at 2:14 AM Nick Kossifidis <mick@ics.forth.gr> 
> wrote:
>> 
>> Στις 2021-05-20 04:45, Guo Ren έγραψε:
>> > On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:
>> >>
>> >> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
>> >> > This patch series looks like it might be useful for the StarFive JH7100
>> >> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
>> >> > USB and SDIO require that the L2 cache must be manually flushed after
>> >> > DMA operations if the data is intended to be shared with U74 cores [2].
>> >>
>> >> Not too much, given that the SiFive lineage CPUs have an uncached
>> >> window, that is a totally different way to allocate uncached memory.
>> > It's a very big MIPS smell. What's the attribute of the uncached
>> > window? (uncached + strong-order/ uncached + weak, most vendors still
>> > use AXI interconnect, how to deal with a bufferable attribute?) In
>> > fact, customers' drivers use different ways to deal with DMA memory in
>> > non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
>> > the same way in DMA memory is a smart choice. So using PTE attributes
>> > is more suitable.
>> >
>> > See:
>> > https://github.com/riscv/virtual-memory/blob/main/specs/611-virtual-memory-diff.pdf
>> > 4.4.1
>> > The draft supports custom attribute bits in PTE.
>> >
>> 
>> Not only it doesn't support custom attributes on PTEs:
>> 
>> "Bits63–54 are reserved for future standard use and must be zeroed by
>> software for forward compatibility."
>> 
>> It also goes further to say that:
>> 
>> "if any of these bits are set, a page-fault exception is raised"
> Agree, when our processor's mmu works in compatible mmu, we must keep
> "Bits63–54 bit" zero in Linux.
> So, I think this is the first version of the PTE format.
> 
> If the "PBMT" extension proposal is approved, it will cause the second
> version of the PTE format.
> 
> Maybe in the future, we'll get more versions of the PTE formats.
> 
> So, seems Linux must support multi versions of PTE formats with one
> Image, right?
> 
> Okay, we could stop arguing with the D1 PTE format. And talk about how
> to let Linux support multi versions of PTE formats that come from the
> future RISC-V privilege spec.

The RISC-V ISA specs are meant to be backwards compatible, so newer PTE 
versions should work on older devices (note that the spec says that 
software must set those bits to zero for "forward compatibility" and are 
"reserved for future use" so current implementations must ignore them). 
Obviously the proposed "if any of these bits are set, a page-fault 
exception is raised" will break backwards compatibility which is why we 
need to ask for it to be removed from the draft.

As an example the PBMT proposal uses bits 62:61 that on older hw should 
be ignored ("reserved for future use"), if Linux uses those bits we 
won't need a different code path for supporting older hw/older PTE 
versions, we'll just set them and older hw will ignore them. Because of 
the guarantee that ISA specs maintain backwards compatibility, the 
functionality of bits 62:61 is guaranteed to remain backwards 
compatible.

In other words we don't need any special handling of multiple PTE 
formats, we just need to support the latest Priv. Spec and the Spec 
itself will guarantee backwards compatibility.
Guo Ren June 9, 2021, 12:43 p.m. UTC | #51
On Wed, Jun 9, 2021 at 5:45 PM Nick Kossifidis <mick@ics.forth.gr> wrote:
>
> Στις 2021-06-09 06:28, Guo Ren έγραψε:
> > On Mon, Jun 7, 2021 at 2:14 AM Nick Kossifidis <mick@ics.forth.gr>
> > wrote:
> >>
> >> Στις 2021-05-20 04:45, Guo Ren έγραψε:
> >> > On Wed, May 19, 2021 at 2:53 PM Christoph Hellwig <hch@lst.de> wrote:
> >> >>
> >> >> On Tue, May 18, 2021 at 11:44:35PM -0700, Drew Fustini wrote:
> >> >> > This patch series looks like it might be useful for the StarFive JH7100
> >> >> > [1] [2] too as it has peripherals on a non-coherent interconnect. GMAC,
> >> >> > USB and SDIO require that the L2 cache must be manually flushed after
> >> >> > DMA operations if the data is intended to be shared with U74 cores [2].
> >> >>
> >> >> Not too much, given that the SiFive lineage CPUs have an uncached
> >> >> window, that is a totally different way to allocate uncached memory.
> >> > It's a very big MIPS smell. What's the attribute of the uncached
> >> > window? (uncached + strong-order/ uncached + weak, most vendors still
> >> > use AXI interconnect, how to deal with a bufferable attribute?) In
> >> > fact, customers' drivers use different ways to deal with DMA memory in
> >> > non-coherent SOC. Most riscv SOC vendors are from ARM, so giving them
> >> > the same way in DMA memory is a smart choice. So using PTE attributes
> >> > is more suitable.
> >> >
> >> > See:
> >> > https://github.com/riscv/virtual-memory/blob/main/specs/611-virtual-memory-diff.pdf
> >> > 4.4.1
> >> > The draft supports custom attribute bits in PTE.
> >> >
> >>
> >> Not only it doesn't support custom attributes on PTEs:
> >>
> >> "Bits63–54 are reserved for future standard use and must be zeroed by
> >> software for forward compatibility."
> >>
> >> It also goes further to say that:
> >>
> >> "if any of these bits are set, a page-fault exception is raised"
> > Agree, when our processor's mmu works in compatible mmu, we must keep
> > "Bits63–54 bit" zero in Linux.
> > So, I think this is the first version of the PTE format.
> >
> > If the "PBMT" extension proposal is approved, it will cause the second
> > version of the PTE format.
> >
> > Maybe in the future, we'll get more versions of the PTE formats.
> >
> > So, seems Linux must support multi versions of PTE formats with one
> > Image, right?
> >
> > Okay, we could stop arguing with the D1 PTE format. And talk about how
> > to let Linux support multi versions of PTE formats that come from the
> > future RISC-V privilege spec.
>
> The RISC-V ISA specs are meant to be backwards compatible, so newer PTE
> versions should work on older devices (note that the spec says that
> software must set those bits to zero for "forward compatibility" and are
> "reserved for future use" so current implementations must ignore them).
> Obviously the proposed "if any of these bits are set, a page-fault
> exception is raised" will break backwards compatibility which is why we
> need to ask for it to be removed from the draft.
>
> As an example the PBMT proposal uses bits 62:61 that on older hw should
> be ignored ("reserved for future use"), if Linux uses those bits we
> won't need a different code path for supporting older hw/older PTE
> versions, we'll just set them and older hw will ignore them. Because of
> the guarantee that ISA specs maintain backwards compatibility, the
> functionality of bits 62:61 is guaranteed to remain backwards
> compatible.
the spec says that software must set those bits to zero for "forward
compatibility". So how older hw ignore them?
If an older hw follow the current spec requires software to set those
bits to zero, how we put any PBMT bits without different Linux PTE
formats?

>
> In other words we don't need any special handling of multiple PTE
> formats, we just need to support the latest Priv. Spec and the Spec
> itself will guarantee backwards compatibility.
Nak, totally no Logically self-consistent.