Message ID | 20190110072841.3283-1-ard.biesheuvel@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] drm/ttm: force cached mappings for system RAM on ARM | expand |
On 1/10/19 3:28 PM, Ard Biesheuvel wrote: > ARM systems do not permit the use of anything other than cached > mappings for system memory, since that memory may be mapped in the > linear region as well, and the architecture does not permit aliases > with mismatched attributes. > > So short-circuit the evaluation in ttm_io_prot() if the flags include > TTM_PL_SYSTEM when running on ARM or arm64, and just return cached > attributes immediately. It sounds a case for ARM system memory access from CPU only? If that always applies to ARM memory, suppose we should do that for TTM_PL_TT as well. While TTM_PL_TT | TTM_PL_FLAG_WC is likely to work as below mention. Regards, Jerry > This fixes the radeon and amdgpu [TBC] drivers when running on arm64. > Without this change, amdgpu does not start at all, and radeon only > produces corrupt display output. > > Cc: Christian Koenig <christian.koenig@amd.com> > Cc: Huang Rui <ray.huang@amd.com> > Cc: Junwei Zhang <Jerry.Zhang@amd.com> > Cc: David Airlie <airlied@linux.ie> > Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > --- > drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c > index 046a6dda690a..0c1eef5f7ae3 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c > @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp) > if (caching_flags & TTM_PL_FLAG_CACHED) > return tmp; > > +#if defined(__arm__) || defined(__aarch64__) > + /* ARM only permits cached mappings of system memory */ > + if (caching_flags & TTM_PL_SYSTEM) > + return tmp; > +#endif > #if defined(__i386__) || defined(__x86_64__) > if (caching_flags & TTM_PL_FLAG_WC) > tmp = pgprot_writecombine(tmp);
Hi Ard, thanks a lot for this! At least somebody who can explain why this doesn't work as expected. The problem is that the hardware actually needs a few pages as uncached in a couple of cases to work correctly. So we could still run into issues with that solution. For now we have blocked userspace/kernel from extensively using write combine mappings by adjusting drm_arch_can_wc_memory(), but that is probably degrading performance quite a bit. What can be done to improve the solution? At least on X86 we solve this by marking the write combined pages in the linear mapping as uncached as well. Would this be doable on ARM as well? Thanks, Christian. Am 10.01.19 um 08:28 schrieb Ard Biesheuvel: > ARM systems do not permit the use of anything other than cached > mappings for system memory, since that memory may be mapped in the > linear region as well, and the architecture does not permit aliases > with mismatched attributes. > > So short-circuit the evaluation in ttm_io_prot() if the flags include > TTM_PL_SYSTEM when running on ARM or arm64, and just return cached > attributes immediately. > > This fixes the radeon and amdgpu [TBC] drivers when running on arm64. > Without this change, amdgpu does not start at all, and radeon only > produces corrupt display output. > > Cc: Christian Koenig <christian.koenig@amd.com> > Cc: Huang Rui <ray.huang@amd.com> > Cc: Junwei Zhang <Jerry.Zhang@amd.com> > Cc: David Airlie <airlied@linux.ie> > Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > --- > drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c > index 046a6dda690a..0c1eef5f7ae3 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c > @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp) > if (caching_flags & TTM_PL_FLAG_CACHED) > return tmp; > > +#if defined(__arm__) || defined(__aarch64__) > + /* ARM only permits cached mappings of system memory */ > + if (caching_flags & TTM_PL_SYSTEM) > + return tmp; > +#endif > #if defined(__i386__) || defined(__x86_64__) > if (caching_flags & TTM_PL_FLAG_WC) > tmp = pgprot_writecombine(tmp);
On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote: > ARM systems do not permit the use of anything other than cached > mappings for system memory, since that memory may be mapped in the > linear region as well, and the architecture does not permit aliases > with mismatched attributes. > > So short-circuit the evaluation in ttm_io_prot() if the flags include > TTM_PL_SYSTEM when running on ARM or arm64, and just return cached > attributes immediately. > > This fixes the radeon and amdgpu [TBC] drivers when running on arm64. > Without this change, amdgpu does not start at all, and radeon only > produces corrupt display output. > > Cc: Christian Koenig <christian.koenig@amd.com> > Cc: Huang Rui <ray.huang@amd.com> > Cc: Junwei Zhang <Jerry.Zhang@amd.com> > Cc: David Airlie <airlied@linux.ie> > Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > --- > drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c > index 046a6dda690a..0c1eef5f7ae3 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c > @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp) > if (caching_flags & TTM_PL_FLAG_CACHED) > return tmp; > > +#if defined(__arm__) || defined(__aarch64__) > + /* ARM only permits cached mappings of system memory */ > + if (caching_flags & TTM_PL_SYSTEM) > + return tmp; > +#endif > #if defined(__i386__) || defined(__x86_64__) > if (caching_flags & TTM_PL_FLAG_WC) > tmp = pgprot_writecombine(tmp); > Apart from Christian's concerns, I think this is the wrong place for this, because other TTM / driver code will still consider the memory uncacheable. E.g. the amdgpu driver will program the GPU to treat the memory as uncacheable, so it won't participate in cache coherency protocol for it, which is unlikely to work as expected in general if the CPU treats the memory as cacheable.
On Thu, 10 Jan 2019 at 10:34, Michel Dänzer <michel@daenzer.net> wrote: > > On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote: > > ARM systems do not permit the use of anything other than cached > > mappings for system memory, since that memory may be mapped in the > > linear region as well, and the architecture does not permit aliases > > with mismatched attributes. > > > > So short-circuit the evaluation in ttm_io_prot() if the flags include > > TTM_PL_SYSTEM when running on ARM or arm64, and just return cached > > attributes immediately. > > > > This fixes the radeon and amdgpu [TBC] drivers when running on arm64. > > Without this change, amdgpu does not start at all, and radeon only > > produces corrupt display output. > > > > Cc: Christian Koenig <christian.koenig@amd.com> > > Cc: Huang Rui <ray.huang@amd.com> > > Cc: Junwei Zhang <Jerry.Zhang@amd.com> > > Cc: David Airlie <airlied@linux.ie> > > Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com> > > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > > --- > > drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++ > > 1 file changed, 5 insertions(+) > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c > > index 046a6dda690a..0c1eef5f7ae3 100644 > > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c > > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c > > @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp) > > if (caching_flags & TTM_PL_FLAG_CACHED) > > return tmp; > > > > +#if defined(__arm__) || defined(__aarch64__) > > + /* ARM only permits cached mappings of system memory */ > > + if (caching_flags & TTM_PL_SYSTEM) > > + return tmp; > > +#endif > > #if defined(__i386__) || defined(__x86_64__) > > if (caching_flags & TTM_PL_FLAG_WC) > > tmp = pgprot_writecombine(tmp); > > > > Apart from Christian's concerns, I think this is the wrong place for > this, because other TTM / driver code will still consider the memory > uncacheable. E.g. the amdgpu driver will program the GPU to treat the > memory as uncacheable, so it won't participate in cache coherency > protocol for it, which is unlikely to work as expected in general if the > CPU treats the memory as cacheable. > Will and I have spent some time digging into this, so allow me to share some preliminary findings while we carry on and try to fix this properly. - The patch above is flawed, i.e., it doesn't do what it intends to since it uses TTM_PL_SYSTEM instead of TTM_PL_FLAG_SYSTEM. Apologies for that. - The existence of a linear region mapping with mismatched attributes is likely not the culprit here. (We do something similar with non-cache coherent DMA in other places). - The reason remapping the CPU side as cacheable does work (which I did test) is because the GPU's uncacheable accesses (which I assume are made using the NoSnoop PCIe transaction attribute) are actually emitted as cacheable in some cases. . On my AMD Seattle, with or without SMMU (which is stage 2 only), I must use cacheable accesses from the CPU side or things are broken. This might be a h/w flaw, though. . On systems with stage 1+2 SMMUs, the driver uses stage 1 translations which always override the memory attributes to cacheable for DMA coherent devices. This is what is affecting the Cavium ThunderX2 (although it appears the attributes emitted by the RC may be incorrect as well.) The latter issue is a shortcoming in the SMMU driver that we have to fix, i.e., it should take care not to modify the incoming attributes of DMA coherent PCIe devices for NoSnoop to be able to work. So in summary, the mismatch appears to be between the CPU accessing the vmap region with non-cacheable attributes and the GPU accessing the same memory with cacheable attributes, resulting in a loss of coherency and lots of visible corruption. To be able to debug this further, could you elaborate a bit on - How does the hardware emit those uncached/wc inbound accesses? Do they rely on NoSnoop? - Christian pointed out that some accesses must be uncached even when not using WC. What kind of accesses are those? And do they access system RAM?
Am 14.01.19 um 11:53 schrieb Ard Biesheuvel: > On Thu, 10 Jan 2019 at 10:34, Michel Dänzer <michel@daenzer.net> wrote: >> On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote: >>> ARM systems do not permit the use of anything other than cached >>> mappings for system memory, since that memory may be mapped in the >>> linear region as well, and the architecture does not permit aliases >>> with mismatched attributes. >>> >>> So short-circuit the evaluation in ttm_io_prot() if the flags include >>> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached >>> attributes immediately. >>> >>> This fixes the radeon and amdgpu [TBC] drivers when running on arm64. >>> Without this change, amdgpu does not start at all, and radeon only >>> produces corrupt display output. >>> >>> Cc: Christian Koenig <christian.koenig@amd.com> >>> Cc: Huang Rui <ray.huang@amd.com> >>> Cc: Junwei Zhang <Jerry.Zhang@amd.com> >>> Cc: David Airlie <airlied@linux.ie> >>> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com> >>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> >>> --- >>> drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++ >>> 1 file changed, 5 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c >>> index 046a6dda690a..0c1eef5f7ae3 100644 >>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c >>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c >>> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp) >>> if (caching_flags & TTM_PL_FLAG_CACHED) >>> return tmp; >>> >>> +#if defined(__arm__) || defined(__aarch64__) >>> + /* ARM only permits cached mappings of system memory */ >>> + if (caching_flags & TTM_PL_SYSTEM) >>> + return tmp; >>> +#endif >>> #if defined(__i386__) || defined(__x86_64__) >>> if (caching_flags & TTM_PL_FLAG_WC) >>> tmp = pgprot_writecombine(tmp); >>> >> Apart from Christian's concerns, I think this is the wrong place for >> this, because other TTM / driver code will still consider the memory >> uncacheable. E.g. the amdgpu driver will program the GPU to treat the >> memory as uncacheable, so it won't participate in cache coherency >> protocol for it, which is unlikely to work as expected in general if the >> CPU treats the memory as cacheable. >> > Will and I have spent some time digging into this, so allow me to > share some preliminary findings while we carry on and try to fix this > properly. > > - The patch above is flawed, i.e., it doesn't do what it intends to > since it uses TTM_PL_SYSTEM instead of TTM_PL_FLAG_SYSTEM. Apologies > for that. > - The existence of a linear region mapping with mismatched attributes > is likely not the culprit here. (We do something similar with > non-cache coherent DMA in other places). This is still rather problematic. The issue is that we often don't create a vmap for a page, but rather access the page directly using the linear mapping. So we would use the wrong access type here. > - The reason remapping the CPU side as cacheable does work (which I > did test) is because the GPU's uncacheable accesses (which I assume > are made using the NoSnoop PCIe transaction attribute) are actually > emitted as cacheable in some cases. > . On my AMD Seattle, with or without SMMU (which is stage 2 only), I > must use cacheable accesses from the CPU side or things are broken. > This might be a h/w flaw, though. > . On systems with stage 1+2 SMMUs, the driver uses stage 1 > translations which always override the memory attributes to cacheable > for DMA coherent devices. This is what is affecting the Cavium > ThunderX2 (although it appears the attributes emitted by the RC may be > incorrect as well.) > > The latter issue is a shortcoming in the SMMU driver that we have to > fix, i.e., it should take care not to modify the incoming attributes > of DMA coherent PCIe devices for NoSnoop to be able to work. > > So in summary, the mismatch appears to be between the CPU accessing > the vmap region with non-cacheable attributes and the GPU accessing > the same memory with cacheable attributes, resulting in a loss of > coherency and lots of visible corruption. Actually it is the other way around. The CPU thinks some data is in the cache and the GPU only updates the system memory version because the snoop flag is not set. > To be able to debug this further, could you elaborate a bit on > - How does the hardware emit those uncached/wc inbound accesses? Do > they rely on NoSnoop? The GPU has a separate page walker in the MC and the page tables there have a bits saying if the access should go to the PCIe bus and if yes if the snoop bit should be set. > - Christian pointed out that some accesses must be uncached even when > not using WC. What kind of accesses are those? And do they access > system RAM? On some hardware generations we have a buggy engine which fails to forward the snoop bit and because of this the system memory page used by this engine must be uncached. But this only applies if you use ROCm in a specific configuration. Regards, Christian.
On Mon, 14 Jan 2019 at 12:38, Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Am 14.01.19 um 11:53 schrieb Ard Biesheuvel: > > On Thu, 10 Jan 2019 at 10:34, Michel Dänzer <michel@daenzer.net> wrote: > >> On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote: > >>> ARM systems do not permit the use of anything other than cached > >>> mappings for system memory, since that memory may be mapped in the > >>> linear region as well, and the architecture does not permit aliases > >>> with mismatched attributes. > >>> > >>> So short-circuit the evaluation in ttm_io_prot() if the flags include > >>> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached > >>> attributes immediately. > >>> > >>> This fixes the radeon and amdgpu [TBC] drivers when running on arm64. > >>> Without this change, amdgpu does not start at all, and radeon only > >>> produces corrupt display output. > >>> > >>> Cc: Christian Koenig <christian.koenig@amd.com> > >>> Cc: Huang Rui <ray.huang@amd.com> > >>> Cc: Junwei Zhang <Jerry.Zhang@amd.com> > >>> Cc: David Airlie <airlied@linux.ie> > >>> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com> > >>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > >>> --- > >>> drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++ > >>> 1 file changed, 5 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c > >>> index 046a6dda690a..0c1eef5f7ae3 100644 > >>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c > >>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c > >>> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp) > >>> if (caching_flags & TTM_PL_FLAG_CACHED) > >>> return tmp; > >>> > >>> +#if defined(__arm__) || defined(__aarch64__) > >>> + /* ARM only permits cached mappings of system memory */ > >>> + if (caching_flags & TTM_PL_SYSTEM) > >>> + return tmp; > >>> +#endif > >>> #if defined(__i386__) || defined(__x86_64__) > >>> if (caching_flags & TTM_PL_FLAG_WC) > >>> tmp = pgprot_writecombine(tmp); > >>> > >> Apart from Christian's concerns, I think this is the wrong place for > >> this, because other TTM / driver code will still consider the memory > >> uncacheable. E.g. the amdgpu driver will program the GPU to treat the > >> memory as uncacheable, so it won't participate in cache coherency > >> protocol for it, which is unlikely to work as expected in general if the > >> CPU treats the memory as cacheable. > >> > > Will and I have spent some time digging into this, so allow me to > > share some preliminary findings while we carry on and try to fix this > > properly. > > > > - The patch above is flawed, i.e., it doesn't do what it intends to > > since it uses TTM_PL_SYSTEM instead of TTM_PL_FLAG_SYSTEM. Apologies > > for that. > > - The existence of a linear region mapping with mismatched attributes > > is likely not the culprit here. (We do something similar with > > non-cache coherent DMA in other places). > > This is still rather problematic. > > The issue is that we often don't create a vmap for a page, but rather > access the page directly using the linear mapping. > > So we would use the wrong access type here. > Yes. But how are these accesses done? Are they wrapped in a kmap()? > > - The reason remapping the CPU side as cacheable does work (which I > > did test) is because the GPU's uncacheable accesses (which I assume > > are made using the NoSnoop PCIe transaction attribute) are actually > > emitted as cacheable in some cases. > > . On my AMD Seattle, with or without SMMU (which is stage 2 only), I > > must use cacheable accesses from the CPU side or things are broken. > > This might be a h/w flaw, though. > > . On systems with stage 1+2 SMMUs, the driver uses stage 1 > > translations which always override the memory attributes to cacheable > > for DMA coherent devices. This is what is affecting the Cavium > > ThunderX2 (although it appears the attributes emitted by the RC may be > > incorrect as well.) > > > > The latter issue is a shortcoming in the SMMU driver that we have to > > fix, i.e., it should take care not to modify the incoming attributes > > of DMA coherent PCIe devices for NoSnoop to be able to work. > > > > So in summary, the mismatch appears to be between the CPU accessing > > the vmap region with non-cacheable attributes and the GPU accessing > > the same memory with cacheable attributes, resulting in a loss of > > coherency and lots of visible corruption. > > Actually it is the other way around. The CPU thinks some data is in the > cache and the GPU only updates the system memory version because the > snoop flag is not set. > That doesn't seem to be what is happening. As far as we can tell from our experiments, all inbound transactions are always cacheable, and so the only way to make things work is to ensure that the CPU uses the same attributes. > > To be able to debug this further, could you elaborate a bit on > > - How does the hardware emit those uncached/wc inbound accesses? Do > > they rely on NoSnoop? > > The GPU has a separate page walker in the MC and the page tables there > have a bits saying if the access should go to the PCIe bus and if yes if > the snoop bit should be set. > > > - Christian pointed out that some accesses must be uncached even when > > not using WC. What kind of accesses are those? And do they access > > system RAM? > > On some hardware generations we have a buggy engine which fails to > forward the snoop bit and because of this the system memory page used by > this engine must be uncached. But this only applies if you use ROCm in a > specific configuration. > OK. I don't know what that means tbh. Does this apply to both radeon and amdgpu?
Am 14.01.19 um 18:32 schrieb Ard Biesheuvel: [SNIP] - The existence of a linear region mapping with mismatched attributes is likely not the culprit here. (We do something similar with non-cache coherent DMA in other places). This is still rather problematic. The issue is that we often don't create a vmap for a page, but rather access the page directly using the linear mapping. So we would use the wrong access type here. Yes. But how are these accesses done? Are they wrapped in a kmap()? I just double checked the source for this and it should actually be handled gracefully: if (num_pages<https://elixir.bootlin.com/linux/v5.0-rc1/ident/num_pages> == 1 && (mem<https://elixir.bootlin.com/linux/v5.0-rc1/ident/mem>->placement<https://elixir.bootlin.com/linux/v5.0-rc1/ident/placement> & TTM_PL_FLAG_CACHED<https://elixir.bootlin.com/linux/v5.0-rc1/ident/TTM_PL_FLAG_CACHED>)) { /* * We're mapping a single page, and the desired * page protection is consistent with the bo. */ map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->bo_kmap_type = ttm_bo_map_kmap<https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_bo_map_kmap>; map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->page = ttm->pages<https://elixir.bootlin.com/linux/v5.0-rc1/ident/pages>[start_page]; map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->virtual = kmap<https://elixir.bootlin.com/linux/v5.0-rc1/ident/kmap>(map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->page); } else { /* * We need to use vmap to get the desired page protection * or to make the buffer object look contiguous. */ prot = ttm_io_prot<https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_io_prot>(mem<https://elixir.bootlin.com/linux/v5.0-rc1/ident/mem>->placement<https://elixir.bootlin.com/linux/v5.0-rc1/ident/placement>, PAGE_KERNEL<https://elixir.bootlin.com/linux/v5.0-rc1/ident/PAGE_KERNEL>); map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->bo_kmap_type = ttm_bo_map_vmap<https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_bo_map_vmap>; map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->virtual = vmap<https://elixir.bootlin.com/linux/v5.0-rc1/ident/vmap>(ttm->pages<https://elixir.bootlin.com/linux/v5.0-rc1/ident/pages> + start_page, num_pages<https://elixir.bootlin.com/linux/v5.0-rc1/ident/num_pages>, 0, prot); } We check the placement flags and only directly use kmap when it is cached. Otherwise we fallback to using vmap. So that should indeed work correctly. - The reason remapping the CPU side as cacheable does work (which I did test) is because the GPU's uncacheable accesses (which I assume are made using the NoSnoop PCIe transaction attribute) are actually emitted as cacheable in some cases. . On my AMD Seattle, with or without SMMU (which is stage 2 only), I must use cacheable accesses from the CPU side or things are broken. This might be a h/w flaw, though. . On systems with stage 1+2 SMMUs, the driver uses stage 1 translations which always override the memory attributes to cacheable for DMA coherent devices. This is what is affecting the Cavium ThunderX2 (although it appears the attributes emitted by the RC may be incorrect as well.) The latter issue is a shortcoming in the SMMU driver that we have to fix, i.e., it should take care not to modify the incoming attributes of DMA coherent PCIe devices for NoSnoop to be able to work. So in summary, the mismatch appears to be between the CPU accessing the vmap region with non-cacheable attributes and the GPU accessing the same memory with cacheable attributes, resulting in a loss of coherency and lots of visible corruption. Actually it is the other way around. The CPU thinks some data is in the cache and the GPU only updates the system memory version because the snoop flag is not set. That doesn't seem to be what is happening. As far as we can tell from our experiments, all inbound transactions are always cacheable, and so the only way to make things work is to ensure that the CPU uses the same attributes. Ok that doesn't make any sense. If inbound transactions are cacheable or not is irrelevant when the CPU always uses uncached accesses. See on the PCIe side you have the snoop bit in the read/write transactions which tells the root hub if the device wants to snoop caches or not. When the CPU accesses some memory as cached then devices need to snoop the cache for coherent accesses. When the CPU accesses some memory as uncached then devices can disable snooping to improve performance, but when they don't do this it is mandated by the spec that this still works. To be able to debug this further, could you elaborate a bit on - How does the hardware emit those uncached/wc inbound accesses? Do they rely on NoSnoop? The GPU has a separate page walker in the MC and the page tables there have a bits saying if the access should go to the PCIe bus and if yes if the snoop bit should be set. - Christian pointed out that some accesses must be uncached even when not using WC. What kind of accesses are those? And do they access system RAM? On some hardware generations we have a buggy engine which fails to forward the snoop bit and because of this the system memory page used by this engine must be uncached. But this only applies if you use ROCm in a specific configuration. OK. I don't know what that means tbh. Does this apply to both radeon and amdgpu? That applies to both, but only in rare corner cases. Christian. <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body text="#000000" bgcolor="#FFFFFF"> <div class="moz-cite-prefix">Am 14.01.19 um 18:32 schrieb Ard Biesheuvel:<br> </div> <blockquote type="cite" cite="mid:CAKv+Gu_JyO2KNtdN1-_TcoL_kqs0yswdwBrpAXA6-jyWuDrj+g@mail.gmail.com"> [SNIP] <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">- The existence of a linear region mapping with mismatched attributes is likely not the culprit here. (We do something similar with non-cache coherent DMA in other places). </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> This is still rather problematic. The issue is that we often don't create a vmap for a page, but rather access the page directly using the linear mapping. So we would use the wrong access type here. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Yes. But how are these accesses done? Are they wrapped in a kmap()?</pre> </blockquote> <br> I just double checked the source for this and it should actually be handled gracefully:<br> <blockquote type="cite"> <pre> <span class="k">if</span> <span class="p">(</span><span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/num_pages">num_pages</a></span> <span class="o">==</span> <span class="mi">1</span> <span class="o">&&</span> <span class="p">(</span><span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/mem">mem</a></span><span class="o">-></span><span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/placement">placement</a></span> <span class="o">&</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/TTM_PL_FLAG_CACHED">TTM_PL_FLAG_CACHED</a></span><span class="p">))</span> <span class="p">{</span> <span class="cm">/*</span> <span class="cm"> * We're mapping a single page, and the desired</span> <span class="cm"> * page protection is consistent with the bo.</span> <span class="cm"> */</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/map">map</a></span><span class="o">-></span><span class="n">bo_kmap_type</span> <span class="o">=</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_bo_map_kmap">ttm_bo_map_kmap</a></span><span class="p">;</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/map">map</a></span><span class="o">-></span><span class="n">page</span> <span class="o">=</span> <span class="n">ttm</span><span class="o">-></span><span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/pages">pages</a></span><span class="p">[</span><span class="n">start_page</span><span class="p">];</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/map">map</a></span><span class="o">-></span><span class="n">virtual</span> <span class="o">=</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/kmap">kmap</a></span><span class="p">(</span><span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/map">map</a></span><span class="o">-></span><span class="n">page</span><span class="p">);</span> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span> <span class="cm">/*</span> <span class="cm"> * We need to use vmap to get the desired page protection</span> <span class="cm"> * or to make the buffer object look contiguous.</span> <span class="cm"> */</span> <span class="n">prot</span> <span class="o">=</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_io_prot">ttm_io_prot</a></span><span class="p">(</span><span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/mem">mem</a></span><span class="o">-></span><span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/placement">placement</a></span><span class="p">,</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/PAGE_KERNEL">PAGE_KERNEL</a></span><span class="p">);</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/map">map</a></span><span class="o">-></span><span class="n">bo_kmap_type</span> <span class="o">=</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_bo_map_vmap">ttm_bo_map_vmap</a></span><span class="p">;</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/map">map</a></span><span class="o">-></span><span class="n">virtual</span> <span class="o">=</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/vmap">vmap</a></span><span class="p">(</span><span class="n">ttm</span><span class="o">-></span><span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/pages">pages</a></span> <span class="o">+</span> <span class="n">start_page</span><span class="p">,</span> <span class="n"><a href="https://elixir.bootlin.com/linux/v5.0-rc1/ident/num_pages">num_pages</a></span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">prot</span><span class="p">);</span> <span class="p">}</span></pre> </blockquote> <br> We check the placement flags and only directly use kmap when it is cached. Otherwise we fallback to using vmap. So that should indeed work correctly.<br> <br> <blockquote type="cite" cite="mid:CAKv+Gu_JyO2KNtdN1-_TcoL_kqs0yswdwBrpAXA6-jyWuDrj+g@mail.gmail.com"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">- The reason remapping the CPU side as cacheable does work (which I did test) is because the GPU's uncacheable accesses (which I assume are made using the NoSnoop PCIe transaction attribute) are actually emitted as cacheable in some cases. . On my AMD Seattle, with or without SMMU (which is stage 2 only), I must use cacheable accesses from the CPU side or things are broken. This might be a h/w flaw, though. . On systems with stage 1+2 SMMUs, the driver uses stage 1 translations which always override the memory attributes to cacheable for DMA coherent devices. This is what is affecting the Cavium ThunderX2 (although it appears the attributes emitted by the RC may be incorrect as well.) The latter issue is a shortcoming in the SMMU driver that we have to fix, i.e., it should take care not to modify the incoming attributes of DMA coherent PCIe devices for NoSnoop to be able to work. So in summary, the mismatch appears to be between the CPU accessing the vmap region with non-cacheable attributes and the GPU accessing the same memory with cacheable attributes, resulting in a loss of coherency and lots of visible corruption. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Actually it is the other way around. The CPU thinks some data is in the cache and the GPU only updates the system memory version because the snoop flag is not set. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> That doesn't seem to be what is happening. As far as we can tell from our experiments, all inbound transactions are always cacheable, and so the only way to make things work is to ensure that the CPU uses the same attributes.</pre> </blockquote> <br> Ok that doesn't make any sense. If inbound transactions are cacheable or not is irrelevant when the CPU always uses uncached accesses.<br> <br> See on the PCIe side you have the snoop bit in the read/write transactions which tells the root hub if the device wants to snoop caches or not.<br> <br> When the CPU accesses some memory as cached then devices need to snoop the cache for coherent accesses.<br> <br> When the CPU accesses some memory as uncached then devices can disable snooping to improve performance, but when they don't do this it is mandated by the spec that this still works.<br> <br> <blockquote type="cite" cite="mid:CAKv+Gu_JyO2KNtdN1-_TcoL_kqs0yswdwBrpAXA6-jyWuDrj+g@mail.gmail.com"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">To be able to debug this further, could you elaborate a bit on - How does the hardware emit those uncached/wc inbound accesses? Do they rely on NoSnoop? </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> The GPU has a separate page walker in the MC and the page tables there have a bits saying if the access should go to the PCIe bus and if yes if the snoop bit should be set. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">- Christian pointed out that some accesses must be uncached even when not using WC. What kind of accesses are those? And do they access system RAM? </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> On some hardware generations we have a buggy engine which fails to forward the snoop bit and because of this the system memory page used by this engine must be uncached. But this only applies if you use ROCm in a specific configuration. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> OK. I don't know what that means tbh. Does this apply to both radeon and amdgpu? </pre> </blockquote> <br> That applies to both, but only in rare corner cases.<br> <br> Christian.<br> <br> </body> </html>
On Mon, Jan 14, 2019 at 07:07:54PM +0000, Koenig, Christian wrote: > Am 14.01.19 um 18:32 schrieb Ard Biesheuvel: > - The reason remapping the CPU side as cacheable does work (which I > did test) is because the GPU's uncacheable accesses (which I assume > are made using the NoSnoop PCIe transaction attribute) are actually > emitted as cacheable in some cases. > . On my AMD Seattle, with or without SMMU (which is stage 2 only), I > must use cacheable accesses from the CPU side or things are broken. > This might be a h/w flaw, though. > . On systems with stage 1+2 SMMUs, the driver uses stage 1 > translations which always override the memory attributes to cacheable > for DMA coherent devices. This is what is affecting the Cavium > ThunderX2 (although it appears the attributes emitted by the RC may be > incorrect as well.) > > The latter issue is a shortcoming in the SMMU driver that we have to > fix, i.e., it should take care not to modify the incoming attributes > of DMA coherent PCIe devices for NoSnoop to be able to work. > > So in summary, the mismatch appears to be between the CPU accessing > the vmap region with non-cacheable attributes and the GPU accessing > the same memory with cacheable attributes, resulting in a loss of > coherency and lots of visible corruption. > > Actually it is the other way around. The CPU thinks some data is in the > cache and the GPU only updates the system memory version because the > snoop flag is not set. > > > That doesn't seem to be what is happening. As far as we can tell from > our experiments, all inbound transactions are always cacheable, and so > the only way to make things work is to ensure that the CPU uses the > same attributes. > > > Ok that doesn't make any sense. If inbound transactions are cacheable or not is > irrelevant when the CPU always uses uncached accesses. > > See on the PCIe side you have the snoop bit in the read/write transactions > which tells the root hub if the device wants to snoop caches or not. > > When the CPU accesses some memory as cached then devices need to snoop the > cache for coherent accesses. > > When the CPU accesses some memory as uncached then devices can disable snooping > to improve performance, but when they don't do this it is mandated by the spec > that this still works. Which spec? The Arm architecture (and others including Power afaiu) doesn't guarantee coherency when memory is accessed using mismatched cacheability attributes. Will
Am 14.01.19 um 20:13 schrieb Will Deacon: > On Mon, Jan 14, 2019 at 07:07:54PM +0000, Koenig, Christian wrote: >> Am 14.01.19 um 18:32 schrieb Ard Biesheuvel: >> - The reason remapping the CPU side as cacheable does work (which I >> did test) is because the GPU's uncacheable accesses (which I assume >> are made using the NoSnoop PCIe transaction attribute) are actually >> emitted as cacheable in some cases. >> . On my AMD Seattle, with or without SMMU (which is stage 2 only), I >> must use cacheable accesses from the CPU side or things are broken. >> This might be a h/w flaw, though. >> . On systems with stage 1+2 SMMUs, the driver uses stage 1 >> translations which always override the memory attributes to cacheable >> for DMA coherent devices. This is what is affecting the Cavium >> ThunderX2 (although it appears the attributes emitted by the RC may be >> incorrect as well.) >> >> The latter issue is a shortcoming in the SMMU driver that we have to >> fix, i.e., it should take care not to modify the incoming attributes >> of DMA coherent PCIe devices for NoSnoop to be able to work. >> >> So in summary, the mismatch appears to be between the CPU accessing >> the vmap region with non-cacheable attributes and the GPU accessing >> the same memory with cacheable attributes, resulting in a loss of >> coherency and lots of visible corruption. >> >> Actually it is the other way around. The CPU thinks some data is in the >> cache and the GPU only updates the system memory version because the >> snoop flag is not set. >> >> >> That doesn't seem to be what is happening. As far as we can tell from >> our experiments, all inbound transactions are always cacheable, and so >> the only way to make things work is to ensure that the CPU uses the >> same attributes. >> >> >> Ok that doesn't make any sense. If inbound transactions are cacheable or not is >> irrelevant when the CPU always uses uncached accesses. >> >> See on the PCIe side you have the snoop bit in the read/write transactions >> which tells the root hub if the device wants to snoop caches or not. >> >> When the CPU accesses some memory as cached then devices need to snoop the >> cache for coherent accesses. >> >> When the CPU accesses some memory as uncached then devices can disable snooping >> to improve performance, but when they don't do this it is mandated by the spec >> that this still works. > Which spec? The PCIe spec. The snoop bit (or rather the NoSnoop) in the transaction is perfectly optional IIRC. > The Arm architecture (and others including Power afaiu) doesn't > guarantee coherency when memory is accessed using mismatched cacheability > attributes. Well what exactly goes wrong on ARM? As far as I know Power doesn't really supports un-cached memory at all, except for a very very old and odd configuration with AGP. I mean in theory I agree that devices should use matching cacheability attributes, but in practice I know of quite a bunch of devices/engines which fails to do this correctly. Regards, Christian. > > Will
[+ BenH and MPE] On Mon, Jan 14, 2019 at 07:21:08PM +0000, Koenig, Christian wrote: > Am 14.01.19 um 20:13 schrieb Will Deacon: > > On Mon, Jan 14, 2019 at 07:07:54PM +0000, Koenig, Christian wrote: > >> Am 14.01.19 um 18:32 schrieb Ard Biesheuvel: > >> - The reason remapping the CPU side as cacheable does work (which I > >> did test) is because the GPU's uncacheable accesses (which I assume > >> are made using the NoSnoop PCIe transaction attribute) are actually > >> emitted as cacheable in some cases. > >> . On my AMD Seattle, with or without SMMU (which is stage 2 only), I > >> must use cacheable accesses from the CPU side or things are broken. > >> This might be a h/w flaw, though. > >> . On systems with stage 1+2 SMMUs, the driver uses stage 1 > >> translations which always override the memory attributes to cacheable > >> for DMA coherent devices. This is what is affecting the Cavium > >> ThunderX2 (although it appears the attributes emitted by the RC may be > >> incorrect as well.) > >> > >> The latter issue is a shortcoming in the SMMU driver that we have to > >> fix, i.e., it should take care not to modify the incoming attributes > >> of DMA coherent PCIe devices for NoSnoop to be able to work. > >> > >> So in summary, the mismatch appears to be between the CPU accessing > >> the vmap region with non-cacheable attributes and the GPU accessing > >> the same memory with cacheable attributes, resulting in a loss of > >> coherency and lots of visible corruption. > >> > >> Actually it is the other way around. The CPU thinks some data is in the > >> cache and the GPU only updates the system memory version because the > >> snoop flag is not set. > >> > >> > >> That doesn't seem to be what is happening. As far as we can tell from > >> our experiments, all inbound transactions are always cacheable, and so > >> the only way to make things work is to ensure that the CPU uses the > >> same attributes. > >> > >> > >> Ok that doesn't make any sense. If inbound transactions are cacheable or not is > >> irrelevant when the CPU always uses uncached accesses. > >> > >> See on the PCIe side you have the snoop bit in the read/write transactions > >> which tells the root hub if the device wants to snoop caches or not. > >> > >> When the CPU accesses some memory as cached then devices need to snoop the > >> cache for coherent accesses. > >> > >> When the CPU accesses some memory as uncached then devices can disable snooping > >> to improve performance, but when they don't do this it is mandated by the spec > >> that this still works. > > Which spec? > > The PCIe spec. The snoop bit (or rather the NoSnoop) in the transaction > is perfectly optional IIRC. Thanks for the clarification. I suspect the devil is in the details, so I'll try to dig up the spec. > > The Arm architecture (and others including Power afaiu) doesn't > > guarantee coherency when memory is accessed using mismatched cacheability > > attributes. > > Well what exactly goes wrong on ARM? Coherency (and any ordering guarantees) can be lost, so the device may see a stale copy of the memory it is accessing. The architecture requires cache maintenance to restore coherency between the mismatched aliases. > As far as I know Power doesn't really supports un-cached memory at all, > except for a very very old and odd configuration with AGP. Hopefully Michael/Ben can elaborate here, but I was under the (possibly mistaken) impression that mismatched attributes could cause a machine-check on Power. > I mean in theory I agree that devices should use matching cacheability > attributes, but in practice I know of quite a bunch of devices/engines > which fails to do this correctly. Given that the experiences of Ard and I so far has been that the system ends up making everything cacheable after the RC, perhaps that's an attempt by system designers to correct for these devices. Unfortunately, it doesn't help if the CPU carefully goes ahead and establishes a non-cacheable mapping for itself! Will
Hi Will, Will Deacon <will.deacon@arm.com> writes: > [+ BenH and MPE] > > On Mon, Jan 14, 2019 at 07:21:08PM +0000, Koenig, Christian wrote: >> Am 14.01.19 um 20:13 schrieb Will Deacon: ... > >> > The Arm architecture (and others including Power afaiu) doesn't >> > guarantee coherency when memory is accessed using mismatched cacheability >> > attributes. ... > >> As far as I know Power doesn't really supports un-cached memory at all, >> except for a very very old and odd configuration with AGP. > > Hopefully Michael/Ben can elaborate here, but I was under the (possibly > mistaken) impression that mismatched attributes could cause a machine-check > on Power. That's what I've always been told, but I can't actually find where it's documented, I'll keep searching. But you're right that mixing cached / uncached is not really supported, and probably results in a machine check or worse. cheers
On Tue, 2019-01-15 at 22:31 +1100, Michael Ellerman wrote: > > > As far as I know Power doesn't really supports un-cached memory at all, > > > except for a very very old and odd configuration with AGP. > > > > Hopefully Michael/Ben can elaborate here, but I was under the (possibly > > mistaken) impression that mismatched attributes could cause a machine-check > > on Power. > > That's what I've always been told, but I can't actually find where it's > documented, I'll keep searching. > > But you're right that mixing cached / uncached is not really supported, > and probably results in a machine check or worse. .. or worse :) It could checkstop. It's also my understanding that on ARM v7 and above, it's technically forbidden to map the same physical page with both cached and non-cached mappings, since the cached one could prefetch (or speculatively load), thus creating collisions and inconsistencies. Am I wrong here ? The old hack of using non-cached mapping to avoid snoop cost in AGP and others is just that ... an ugly and horrible hacks that should have never eventuated, when the search for performance pushes HW people into utter insanity :) Cheers, Ben.
Am 16.01.19 um 01:33 schrieb Benjamin Herrenschmidt: > On Tue, 2019-01-15 at 22:31 +1100, Michael Ellerman wrote: >>>> As far as I know Power doesn't really supports un-cached memory at all, >>>> except for a very very old and odd configuration with AGP. >>> Hopefully Michael/Ben can elaborate here, but I was under the (possibly >>> mistaken) impression that mismatched attributes could cause a machine-check >>> on Power. >> That's what I've always been told, but I can't actually find where it's >> documented, I'll keep searching. >> >> But you're right that mixing cached / uncached is not really supported, >> and probably results in a machine check or worse. > .. or worse :) It could checkstop. Not sure if that would be so bad, it would at least give us a clear indicator that something is wrong instead of silently corrupting data. > It's also my understanding that on ARM v7 and above, it's technically > forbidden to map the same physical page with both cached and non-cached > mappings, since the cached one could prefetch (or speculatively load), > thus creating collisions and inconsistencies. Am I wrong here ? No, but you answer the wrong question. See we don't want to have different mappings of cached and non-cached on the CPU, but rather want to know if a snooped DMA from the PCIe counts as cached access as well. As far as I know on x86 it doesn't, so when you have an un-cached page you can still access it with a snooping DMA read/write operation and don't cause trouble. > The old hack of using non-cached mapping to avoid snoop cost in AGP and > others is just that ... an ugly and horrible hacks that should have > never eventuated, when the search for performance pushes HW people into > utter insanity :) Well I agree that un-cached system memory makes things much more complicated for a questionable gain. But fact is we now have to deal with the mess, so no point in complaining about it to much :) Cheers, Christian. > > Cheers, > Ben. > > >
On Wed, 16 Jan 2019 at 08:36, Koenig, Christian <Christian.Koenig@amd.com> wrote: > > Am 16.01.19 um 01:33 schrieb Benjamin Herrenschmidt: > > On Tue, 2019-01-15 at 22:31 +1100, Michael Ellerman wrote: > >>>> As far as I know Power doesn't really supports un-cached memory at all, > >>>> except for a very very old and odd configuration with AGP. > >>> Hopefully Michael/Ben can elaborate here, but I was under the (possibly > >>> mistaken) impression that mismatched attributes could cause a machine-check > >>> on Power. > >> That's what I've always been told, but I can't actually find where it's > >> documented, I'll keep searching. > >> > >> But you're right that mixing cached / uncached is not really supported, > >> and probably results in a machine check or worse. > > .. or worse :) It could checkstop. > > Not sure if that would be so bad, it would at least give us a clear > indicator that something is wrong instead of silently corrupting data. > > > It's also my understanding that on ARM v7 and above, it's technically > > forbidden to map the same physical page with both cached and non-cached > > mappings, since the cached one could prefetch (or speculatively load), > > thus creating collisions and inconsistencies. Am I wrong here ? > > No, but you answer the wrong question. > > See we don't want to have different mappings of cached and non-cached on > the CPU, but rather want to know if a snooped DMA from the PCIe counts > as cached access as well. > > As far as I know on x86 it doesn't, so when you have an un-cached page > you can still access it with a snooping DMA read/write operation and > don't cause trouble. > I think it is the other way around. The question is, on an otherwise cache coherent device, whether the NoSnoop attribute set by the GPU propagates all the way to the bus so that it bypasses the caches. On x86, we can tolerate if this is not the case, since uncached memory accesses by the CPU snoop the caches as well. On other architectures, uncached accesses go straight to main memory, so if the device wrote anything to the caches we won't see it. So to use this optimization, you have to either be 100% sure that NoSnoop is implemented correctly, or have a x86 CPU. > > The old hack of using non-cached mapping to avoid snoop cost in AGP and > > others is just that ... an ugly and horrible hacks that should have > > never eventuated, when the search for performance pushes HW people into > > utter insanity :) > > Well I agree that un-cached system memory makes things much more > complicated for a questionable gain. > > But fact is we now have to deal with the mess, so no point in > complaining about it to much :) > Indeed. I wonder if we should just disable it altogether unless CONFIG_X86=y
On Wed, 2019-01-16 at 07:35 +0000, Koenig, Christian wrote: > No, but you answer the wrong question. > > See we don't want to have different mappings of cached and non-cached on > the CPU, but rather want to know if a snooped DMA from the PCIe counts > as cached access as well. > > As far as I know on x86 it doesn't, so when you have an un-cached page > you can still access it with a snooping DMA read/write operation and > don't cause trouble. Hrm... well, if you map it uncached on the CPU on powerpc, a snoop DMA will work fine too, it won't hit any cache. The only problem I'm aware of is a core (or CAPI device) emiting non-cached load/stores colliding with a cache snooper. > > The old hack of using non-cached mapping to avoid snoop cost in AGP and > > others is just that ... an ugly and horrible hacks that should have > > never eventuated, when the search for performance pushes HW people into > > utter insanity :) > > Well I agree that un-cached system memory makes things much more > complicated for a questionable gain. > > But fact is we now have to deal with the mess, so no point in > complaining about it to much :) I wish we could just sent the HW designers home and tell them we won't support that crap... oh well. Ben. > Cheers, > Christian. > > > Cheers, > > Ben. > > > >
On Wed, 2019-01-16 at 08:47 +0100, Ard Biesheuvel wrote: > > As far as I know on x86 it doesn't, so when you have an un-cached page > > you can still access it with a snooping DMA read/write operation and > > don't cause trouble. > > > > I think it is the other way around. The question is, on an otherwise > cache coherent device, whether the NoSnoop attribute set by the GPU > propagates all the way to the bus so that it bypasses the caches. On powerpc it's ignored, all DMA accesses will be snooped. But that's fine regardless of whether the memory was mapped cachable or not, the snooper will simply not find anything if not. I *think* we only do cache inject if the line already exists in one of the caches. > On x86, we can tolerate if this is not the case, since uncached memory > accesses by the CPU snoop the caches as well. > > On other architectures, uncached accesses go straight to main memory, > so if the device wrote anything to the caches we won't see it. Well, on all powerpc implementations that I am aware of at least (dunno about ARM), they do, but we don't have a problem because I don't think the devices can/will write to the caches directly unless a corresponding line already exists (but I might be wrong, we need to double check all implementations which is tricky). I am not aware of any powerpc chip implementing NoSnoop. > So to use this optimization, you have to either be 100% sure that > NoSnoop is implemented correctly, or have a x86 CPU. > > > > The old hack of using non-cached mapping to avoid snoop cost in AGP and > > > others is just that ... an ugly and horrible hacks that should have > > > never eventuated, when the search for performance pushes HW people into > > > utter insanity :) > > > > Well I agree that un-cached system memory makes things much more > > complicated for a questionable gain. > > > > But fact is we now have to deal with the mess, so no point in > > complaining about it to much :) > > > > Indeed. I wonder if we should just disable it altogether unless CONFIG_X86=y The question is whether DMA from a device can instanciate cache lines in your system. This a system specific rather than architecture specific question I suspect... Cheers, Ben.
On Thu, 17 Jan 2019 at 07:07, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > On Wed, 2019-01-16 at 08:47 +0100, Ard Biesheuvel wrote: > > > As far as I know on x86 it doesn't, so when you have an un-cached page > > > you can still access it with a snooping DMA read/write operation and > > > don't cause trouble. > > > > > > > I think it is the other way around. The question is, on an otherwise > > cache coherent device, whether the NoSnoop attribute set by the GPU > > propagates all the way to the bus so that it bypasses the caches. > > On powerpc it's ignored, all DMA accesses will be snooped. But that's > fine regardless of whether the memory was mapped cachable or not, the > snooper will simply not find anything if not. I *think* we only do > cache inject if the line already exists in one of the caches. > Others should correct me if I am wrong, but arm64 SoCs often have L3 system caches, and I would expect inbound transactions with writeback write-allocate (WBWA) attributes to allocate there. > > On x86, we can tolerate if this is not the case, since uncached memory > > accesses by the CPU snoop the caches as well. > > > > On other architectures, uncached accesses go straight to main memory, > > so if the device wrote anything to the caches we won't see it. > > Well, on all powerpc implementations that I am aware of at least (dunno > about ARM), they do, but we don't have a problem because I don't think > the devices can/will write to the caches directly unless a > corresponding line already exists (but I might be wrong, we need to > double check all implementations which is tricky). > > I am not aware of any powerpc chip implementing NoSnoop. > Do you have any history on why this optimization is disabled for power unless CONFIG_NOT_CACHE_COHERENT is set? That also begs the question how any of this is supposed to work with non-cache coherent DMA. The code appears to always assume cache coherent, and provide non-cache coherent as an optimization if dma_arch_can_wc_memory() returns true. So I wonder if that helper should take a struct device pointer instead, and return true for non-cache coherent devices. > > So to use this optimization, you have to either be 100% sure that > > NoSnoop is implemented correctly, or have a x86 CPU. > > > > > > The old hack of using non-cached mapping to avoid snoop cost in AGP and > > > > others is just that ... an ugly and horrible hacks that should have > > > > never eventuated, when the search for performance pushes HW people into > > > > utter insanity :) > > > > > > Well I agree that un-cached system memory makes things much more > > > complicated for a questionable gain. > > > > > > But fact is we now have to deal with the mess, so no point in > > > complaining about it to much :) > > > > > > > Indeed. I wonder if we should just disable it altogether unless CONFIG_X86=y > > The question is whether DMA from a device can instanciate cache lines > in your system. This a system specific rather than architecture > specific question I suspect... > The ARM architecture permits it, afaict, and write-allocate is a hint so the implementation is free to ignore it, whether it is set or cleared.
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 046a6dda690a..0c1eef5f7ae3 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp) if (caching_flags & TTM_PL_FLAG_CACHED) return tmp; +#if defined(__arm__) || defined(__aarch64__) + /* ARM only permits cached mappings of system memory */ + if (caching_flags & TTM_PL_SYSTEM) + return tmp; +#endif #if defined(__i386__) || defined(__x86_64__) if (caching_flags & TTM_PL_FLAG_WC) tmp = pgprot_writecombine(tmp);
ARM systems do not permit the use of anything other than cached mappings for system memory, since that memory may be mapped in the linear region as well, and the architecture does not permit aliases with mismatched attributes. So short-circuit the evaluation in ttm_io_prot() if the flags include TTM_PL_SYSTEM when running on ARM or arm64, and just return cached attributes immediately. This fixes the radeon and amdgpu [TBC] drivers when running on arm64. Without this change, amdgpu does not start at all, and radeon only produces corrupt display output. Cc: Christian Koenig <christian.koenig@amd.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Junwei Zhang <Jerry.Zhang@amd.com> Cc: David Airlie <airlied@linux.ie> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> --- drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++ 1 file changed, 5 insertions(+)