diff mbox series

arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)

Message ID 20210527124356.22367-1-will@kernel.org (mailing list archive)
State New, archived
Headers show
Series arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES) | expand

Commit Message

Will Deacon May 27, 2021, 12:43 p.m. UTC
Back in 97303480753e ("arm64: Increase the max granular size"),
ARCH_DMA_MINALIGN was effectively increased to 128 bytes thanks to an
increase in L1_CACHE_BYTES due to an unsubstantiated performance claim
on the now obsolete ThunderX-1. Although this was reverted in
d93277b9839b, ARCH_DMA_MINALIGN was kept at 128 bytes by ebc7e21e0fa2
("arm64: Increase ARCH_DMA_MINALIGN to 128").

During discussion of the original patch, it was reported that the change
also prevented a warning during boot on (again, now obsolete) Qualcomm
server hardware where the cache writeback granule was larger than 64
bytes. The reason for this warning was because non-coherent DMA could
lead to data corruption due to unexpected writeback from the CPU where a
cacheline is shared with other allocations.

Since then, systems have appeared with larger cachelines still, and so
commit 8f5c9037a55b ("arm64/mm: Correct the cache line size warning with
non coherent device") reworked the warning so that it only appears on
systems where non-coherent DMA is actually required and taints the
kernel with TAINT_CPU_OUT_OF_SPEC. We are not aware of any systems, even
including the aforementioned obsolete machines, which have a CWG larger
than 64 bytes and require non-coherent DMA.

More recently, it has been reported that a ARCH_DMA_MINALIGN of 128
bytes wastes considerable memory (~6% immediately after boot on one
system).

Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
indicate if there are machines that unknowingly rely on this.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Link: https://lore.kernel.org/linux-arm-kernel/1442944788-17254-1-git-send-email-rric@kernel.org/
Link: https://lore.kernel.org/linux-arm-kernel/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com/
Link: https://lore.kernel.org/linux-arm-kernel/20190614131141.4428-1-msys.mizuma@gmail.com/
Link: https://lore.kernel.org/r/20210517074332.28280-1-vincent.whitchurch@axis.com
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/cache.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Catalin Marinas May 27, 2021, 1:11 p.m. UTC | #1
On Thu, May 27, 2021 at 01:43:56PM +0100, Will Deacon wrote:
> Back in 97303480753e ("arm64: Increase the max granular size"),
> ARCH_DMA_MINALIGN was effectively increased to 128 bytes thanks to an
> increase in L1_CACHE_BYTES due to an unsubstantiated performance claim
> on the now obsolete ThunderX-1. Although this was reverted in
> d93277b9839b, ARCH_DMA_MINALIGN was kept at 128 bytes by ebc7e21e0fa2
> ("arm64: Increase ARCH_DMA_MINALIGN to 128").
> 
> During discussion of the original patch, it was reported that the change
> also prevented a warning during boot on (again, now obsolete) Qualcomm
> server hardware where the cache writeback granule was larger than 64
> bytes. The reason for this warning was because non-coherent DMA could
> lead to data corruption due to unexpected writeback from the CPU where a
> cacheline is shared with other allocations.
> 
> Since then, systems have appeared with larger cachelines still, and so
> commit 8f5c9037a55b ("arm64/mm: Correct the cache line size warning with
> non coherent device") reworked the warning so that it only appears on
> systems where non-coherent DMA is actually required and taints the
> kernel with TAINT_CPU_OUT_OF_SPEC. We are not aware of any systems, even
> including the aforementioned obsolete machines, which have a CWG larger
> than 64 bytes and require non-coherent DMA.
> 
> More recently, it has been reported that a ARCH_DMA_MINALIGN of 128
> bytes wastes considerable memory (~6% immediately after boot on one
> system).
> 
> Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> indicate if there are machines that unknowingly rely on this.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
> Link: https://lore.kernel.org/linux-arm-kernel/1442944788-17254-1-git-send-email-rric@kernel.org/
> Link: https://lore.kernel.org/linux-arm-kernel/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com/
> Link: https://lore.kernel.org/linux-arm-kernel/20190614131141.4428-1-msys.mizuma@gmail.com/
> Link: https://lore.kernel.org/r/20210517074332.28280-1-vincent.whitchurch@axis.com
> Signed-off-by: Will Deacon <will@kernel.org>

Unless we hear from anyone that the warning is triggered, I think we
should change this in mainline.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland May 27, 2021, 1:19 p.m. UTC | #2
On Thu, May 27, 2021 at 01:43:56PM +0100, Will Deacon wrote:
> Back in 97303480753e ("arm64: Increase the max granular size"),
> ARCH_DMA_MINALIGN was effectively increased to 128 bytes thanks to an
> increase in L1_CACHE_BYTES due to an unsubstantiated performance claim
> on the now obsolete ThunderX-1. Although this was reverted in
> d93277b9839b, ARCH_DMA_MINALIGN was kept at 128 bytes by ebc7e21e0fa2
> ("arm64: Increase ARCH_DMA_MINALIGN to 128").
> 
> During discussion of the original patch, it was reported that the change
> also prevented a warning during boot on (again, now obsolete) Qualcomm
> server hardware where the cache writeback granule was larger than 64
> bytes. The reason for this warning was because non-coherent DMA could
> lead to data corruption due to unexpected writeback from the CPU where a
> cacheline is shared with other allocations.
> 
> Since then, systems have appeared with larger cachelines still, and so
> commit 8f5c9037a55b ("arm64/mm: Correct the cache line size warning with
> non coherent device") reworked the warning so that it only appears on
> systems where non-coherent DMA is actually required and taints the
> kernel with TAINT_CPU_OUT_OF_SPEC. We are not aware of any systems, even
> including the aforementioned obsolete machines, which have a CWG larger
> than 64 bytes and require non-coherent DMA.
> 
> More recently, it has been reported that a ARCH_DMA_MINALIGN of 128
> bytes wastes considerable memory (~6% immediately after boot on one
> system).
> 
> Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> indicate if there are machines that unknowingly rely on this.

The rationale above makes sense to me, so:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Mark.

> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
> Link: https://lore.kernel.org/linux-arm-kernel/1442944788-17254-1-git-send-email-rric@kernel.org/
> Link: https://lore.kernel.org/linux-arm-kernel/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com/
> Link: https://lore.kernel.org/linux-arm-kernel/20190614131141.4428-1-msys.mizuma@gmail.com/
> Link: https://lore.kernel.org/r/20210517074332.28280-1-vincent.whitchurch@axis.com
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/cache.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index a074459f8f2f..a9c0716e7440 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -47,7 +47,7 @@
>   * cache before the transfer is done, causing old data to be seen by
>   * the CPU.
>   */
> -#define ARCH_DMA_MINALIGN	(128)
> +#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
>  
>  #ifdef CONFIG_KASAN_SW_TAGS
>  #define ARCH_SLAB_MINALIGN	(1ULL << KASAN_SHADOW_SCALE_SHIFT)
> -- 
> 2.31.1.818.g46aad6cb9e-goog
>
Arnd Bergmann May 28, 2021, 9:35 a.m. UTC | #3
On Thu, May 27, 2021 at 3:19 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > More recently, it has been reported that a ARCH_DMA_MINALIGN of 128
> > bytes wastes considerable memory (~6% immediately after boot on one
> > system).
> >
> > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> > indicate if there are machines that unknowingly rely on this.
>
> The rationale above makes sense to me, so:
>
> Acked-by: Mark Rutland <mark.rutland@arm.com>

I think it would make sense to go even further than this in the
future, and allow
setting a smaller minimum alignment depending what hardware is detected
at boot time. That would clearly require more work and testing to do right,
so for the moment, this approach is the best we can do.

Acked-by: Arnd Bergmann <arnd@arndb.de>
Ard Biesheuvel May 31, 2021, 5:38 a.m. UTC | #4
On Thu, 27 May 2021 at 14:44, Will Deacon <will@kernel.org> wrote:
>
> Back in 97303480753e ("arm64: Increase the max granular size"),
> ARCH_DMA_MINALIGN was effectively increased to 128 bytes thanks to an
> increase in L1_CACHE_BYTES due to an unsubstantiated performance claim
> on the now obsolete ThunderX-1. Although this was reverted in
> d93277b9839b, ARCH_DMA_MINALIGN was kept at 128 bytes by ebc7e21e0fa2
> ("arm64: Increase ARCH_DMA_MINALIGN to 128").
>
> During discussion of the original patch, it was reported that the change
> also prevented a warning during boot on (again, now obsolete) Qualcomm
> server hardware where the cache writeback granule was larger than 64
> bytes. The reason for this warning was because non-coherent DMA could
> lead to data corruption due to unexpected writeback from the CPU where a
> cacheline is shared with other allocations.
>
> Since then, systems have appeared with larger cachelines still, and so
> commit 8f5c9037a55b ("arm64/mm: Correct the cache line size warning with
> non coherent device") reworked the warning so that it only appears on
> systems where non-coherent DMA is actually required and taints the
> kernel with TAINT_CPU_OUT_OF_SPEC. We are not aware of any systems, even
> including the aforementioned obsolete machines, which have a CWG larger
> than 64 bytes and require non-coherent DMA.
>
> More recently, it has been reported that a ARCH_DMA_MINALIGN of 128
> bytes wastes considerable memory (~6% immediately after boot on one
> system).
>
> Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> indicate if there are machines that unknowingly rely on this.
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
> Link: https://lore.kernel.org/linux-arm-kernel/1442944788-17254-1-git-send-email-rric@kernel.org/
> Link: https://lore.kernel.org/linux-arm-kernel/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com/
> Link: https://lore.kernel.org/linux-arm-kernel/20190614131141.4428-1-msys.mizuma@gmail.com/
> Link: https://lore.kernel.org/r/20210517074332.28280-1-vincent.whitchurch@axis.com
> Signed-off-by: Will Deacon <will@kernel.org>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  arch/arm64/include/asm/cache.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index a074459f8f2f..a9c0716e7440 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -47,7 +47,7 @@
>   * cache before the transfer is done, causing old data to be seen by
>   * the CPU.
>   */
> -#define ARCH_DMA_MINALIGN      (128)
> +#define ARCH_DMA_MINALIGN      L1_CACHE_BYTES
>
>  #ifdef CONFIG_KASAN_SW_TAGS
>  #define ARCH_SLAB_MINALIGN     (1ULL << KASAN_SHADOW_SCALE_SHIFT)
> --
> 2.31.1.818.g46aad6cb9e-goog
>
Catalin Marinas June 1, 2021, 10:14 a.m. UTC | #5
On Fri, May 28, 2021 at 11:35:32AM +0200, Arnd Bergmann wrote:
> On Thu, May 27, 2021 at 3:19 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > > More recently, it has been reported that a ARCH_DMA_MINALIGN of 128
> > > bytes wastes considerable memory (~6% immediately after boot on one
> > > system).
> > >
> > > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> > > indicate if there are machines that unknowingly rely on this.
> >
> > The rationale above makes sense to me, so:
> >
> > Acked-by: Mark Rutland <mark.rutland@arm.com>
> 
> I think it would make sense to go even further than this in the
> future, and allow
> setting a smaller minimum alignment depending what hardware is detected
> at boot time.

Yeah, we talked about this in the past. The problem is that very early
the kernel doesn't know whether it'll have devices that require
non-coherent DMA. So we'd probably need to start with a 64 byte
ARCH_DMA_MINALIGN and populate the slab caches slightly later once the
kernel learns more about the system it's running on.
Will Deacon June 1, 2021, 6:21 p.m. UTC | #6
On Thu, 27 May 2021 13:43:56 +0100, Will Deacon wrote:
> Back in 97303480753e ("arm64: Increase the max granular size"),
> ARCH_DMA_MINALIGN was effectively increased to 128 bytes thanks to an
> increase in L1_CACHE_BYTES due to an unsubstantiated performance claim
> on the now obsolete ThunderX-1. Although this was reverted in
> d93277b9839b, ARCH_DMA_MINALIGN was kept at 128 bytes by ebc7e21e0fa2
> ("arm64: Increase ARCH_DMA_MINALIGN to 128").
> 
> [...]

Applied to arm64 (for-next/mm), thanks!

[1/1] arm64: cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)
      https://git.kernel.org/arm64/c/65688d2a05de

Cheers,
Marek Szyprowski June 2, 2021, 1:25 p.m. UTC | #7
Hi Will,

On 27.05.2021 14:43, Will Deacon wrote:
> Back in 97303480753e ("arm64: Increase the max granular size"),
> ARCH_DMA_MINALIGN was effectively increased to 128 bytes thanks to an
> increase in L1_CACHE_BYTES due to an unsubstantiated performance claim
> on the now obsolete ThunderX-1. Although this was reverted in
> d93277b9839b, ARCH_DMA_MINALIGN was kept at 128 bytes by ebc7e21e0fa2
> ("arm64: Increase ARCH_DMA_MINALIGN to 128").
>
> During discussion of the original patch, it was reported that the change
> also prevented a warning during boot on (again, now obsolete) Qualcomm
> server hardware where the cache writeback granule was larger than 64
> bytes. The reason for this warning was because non-coherent DMA could
> lead to data corruption due to unexpected writeback from the CPU where a
> cacheline is shared with other allocations.
>
> Since then, systems have appeared with larger cachelines still, and so
> commit 8f5c9037a55b ("arm64/mm: Correct the cache line size warning with
> non coherent device") reworked the warning so that it only appears on
> systems where non-coherent DMA is actually required and taints the
> kernel with TAINT_CPU_OUT_OF_SPEC. We are not aware of any systems, even
> including the aforementioned obsolete machines, which have a CWG larger
> than 64 bytes and require non-coherent DMA.
>
> More recently, it has been reported that a ARCH_DMA_MINALIGN of 128
> bytes wastes considerable memory (~6% immediately after boot on one
> system).
>
> Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> indicate if there are machines that unknowingly rely on this.
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
> Link: https://lore.kernel.org/linux-arm-kernel/1442944788-17254-1-git-send-email-rric@kernel.org/
> Link: https://lore.kernel.org/linux-arm-kernel/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com/
> Link: https://lore.kernel.org/linux-arm-kernel/20190614131141.4428-1-msys.mizuma@gmail.com/
> Link: https://lore.kernel.org/r/20210517074332.28280-1-vincent.whitchurch@axis.com
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
This patch landed in todays linux-next as commit 65688d2a05de ("arm64: 
cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)"). It causes an 
issue on Raspberry Pi 3b board. System boots to userspace fine, but then 
it hangs somewhere during the init scripts after loading the modules. I 
didn't manage to track where it hangs yet though.
>   arch/arm64/include/asm/cache.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index a074459f8f2f..a9c0716e7440 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -47,7 +47,7 @@
>    * cache before the transfer is done, causing old data to be seen by
>    * the CPU.
>    */
> -#define ARCH_DMA_MINALIGN	(128)
> +#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
>   
>   #ifdef CONFIG_KASAN_SW_TAGS
>   #define ARCH_SLAB_MINALIGN	(1ULL << KASAN_SHADOW_SCALE_SHIFT)

Best regards
Mark Rutland June 2, 2021, 1:51 p.m. UTC | #8
Hi Marek,

On Wed, Jun 02, 2021 at 03:25:41PM +0200, Marek Szyprowski wrote:
> On 27.05.2021 14:43, Will Deacon wrote:
> This patch landed in todays linux-next as commit 65688d2a05de ("arm64: 
> cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)"). It causes an 
> issue on Raspberry Pi 3b board. System boots to userspace fine, but then 
> it hangs somewhere during the init scripts after loading the modules. I 
> didn't manage to track where it hangs yet though.

Ouch!

I have a 3b in a drawer that I might be able to reproduce the issue
with; can you tell me how you're booting that kernel? e.g. which FW and
DT you're using?

Is your filesystem on the SD card, or some USB storage? I'm guessing
we'll need to stress DMA over one of those.

Thanks,
Mark.

> >   arch/arm64/include/asm/cache.h | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> > index a074459f8f2f..a9c0716e7440 100644
> > --- a/arch/arm64/include/asm/cache.h
> > +++ b/arch/arm64/include/asm/cache.h
> > @@ -47,7 +47,7 @@
> >    * cache before the transfer is done, causing old data to be seen by
> >    * the CPU.
> >    */
> > -#define ARCH_DMA_MINALIGN	(128)
> > +#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
> >   
> >   #ifdef CONFIG_KASAN_SW_TAGS
> >   #define ARCH_SLAB_MINALIGN	(1ULL << KASAN_SHADOW_SCALE_SHIFT)
> 
> Best regards
> -- 
> Marek Szyprowski, PhD
> Samsung R&D Institute Poland
>
Marek Szyprowski June 2, 2021, 2:09 p.m. UTC | #9
Hi

On 02.06.2021 15:51, Mark Rutland wrote:
> Hi Marek,
>
> On Wed, Jun 02, 2021 at 03:25:41PM +0200, Marek Szyprowski wrote:
>> On 27.05.2021 14:43, Will Deacon wrote:
>> This patch landed in todays linux-next as commit 65688d2a05de ("arm64:
>> cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)"). It causes an
>> issue on Raspberry Pi 3b board. System boots to userspace fine, but then
>> it hangs somewhere during the init scripts after loading the modules. I
>> didn't manage to track where it hangs yet though.
> Ouch!
>
> I have a 3b in a drawer that I might be able to reproduce the issue
> with; can you tell me how you're booting that kernel? e.g. which FW and
> DT you're using?

I'm booting the kernel with the mainline dtb 
(arch/arm64/boot/dts/broadcom/bcm2837-rpi-3-b.dtb) from the u-boot, 
which downloads it via TFTP. I don't remember which firmware version is 
there, but without raspberry specific tools (which I don't have deployed 
there) it is hard to check that now. The rootfs is on SD card, the 
system is some older Debian release. Here is the last part of the boot 
log if it helps:

[    7.906329] Freeing unused kernel memory: 8512K
[    7.911793] Run /sbin/init as init process
INIT: version 2.88 booting
[info] Using makefile-style concurrent boot in runlevel S.
ERROR: could not open /proc/stat: No such file or directory
[....] Starting the hotplug events dispatcher: systemd-udevdstarting 
version 236
. ok
[....] Synthesizing the initial hotplug events...[   13.641287] 
bcm2835-rng 3f104000.rng: hwrng registered
[   13.896575] i2c-bcm2835 3f805000.i2c: Could not read clock-frequency 
property
done.
[   13.938691] debugfs: Directory '3f902000.hdmi' with parent 'vc4-hdmi' 
already present!
[   13.983716] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
[   13.991816] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
[   14.000945] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
[   14.009644] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
[   14.017547] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
[   14.025780] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
[   14.033995] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
[   14.042196] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
[....] Waiting for /dev to be fully populated...[   14.112812] [drm] 
Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0
done.
[ ok ] Activating swap...done.
[....] Checking file systems...fsck from util-linux 2.29.2
done.
[ ok ] Cleaning up temporary files... /tmp.
[ ok ] Mounting local filesystems...done.
[ ok ] Activating swapfile swap...done.
[ ok ] Cleaning up temporary files....
[ ok ] Setting kernel variables...done.
[ ok ] Configuring network interfaces...done.
[ ok ] Starting RPC port mapper daemon: rpcbind.
[ ok ] Cleaning up temporary files....
[ ok ] Setting up ALSA...done.
[ ok ] Setting up X socket directories... /tmp/.X11-unix /tmp/.ICE-unix.

Best regards
Arnd Bergmann June 2, 2021, 2:11 p.m. UTC | #10
On Wed, Jun 2, 2021 at 3:51 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Wed, Jun 02, 2021 at 03:25:41PM +0200, Marek Szyprowski wrote:
> > On 27.05.2021 14:43, Will Deacon wrote:
> > This patch landed in todays linux-next as commit 65688d2a05de ("arm64:
> > cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)"). It causes an
> > issue on Raspberry Pi 3b board. System boots to userspace fine, but then
> > it hangs somewhere during the init scripts after loading the modules. I
> > didn't manage to track where it hangs yet though.
>
> Ouch!
>
> I have a 3b in a drawer that I might be able to reproduce the issue
> with; can you tell me how you're booting that kernel? e.g. which FW and
> DT you're using?
>
> Is your filesystem on the SD card, or some USB storage? I'm guessing
> we'll need to stress DMA over one of those.

It could be something other than DMA in this case, possibly an out-of-bounds
access into a dynamically allocated structure that happens to work when there
is extra data at the end of it. Running that kernel with KASAN should help
rule this out.

Running the same kernel on a Pi 4, or on a virtual machine could also help
figure out if there is something hardware specific.

       Arnd
Arnd Bergmann June 2, 2021, 2:14 p.m. UTC | #11
On Wed, Jun 2, 2021 at 4:09 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
> On 02.06.2021 15:51, Mark Rutland wrote:
> [   13.938691] debugfs: Directory '3f902000.hdmi' with parent 'vc4-hdmi'
> already present!
> [   13.983716] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
> [   13.991816] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
> [   14.000945] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
> [   14.009644] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
> [   14.017547] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
> [   14.025780] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
> [   14.033995] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
> [   14.042196] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
> [....] Waiting for /dev to be fully populated...[   14.112812] [drm]
> Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0
> done.

I wonder if the vc4's cache is involved here rather than the CPU cache.
If it's that, then removing the vc4 drivers should make it boot reliably, and
it might be possible to fix it by manually padding any allocations that get
passed to the GPU.

       Arnd
Marek Szyprowski June 2, 2021, 2:15 p.m. UTC | #12
Hi Arnd,

On 02.06.2021 16:11, Arnd Bergmann wrote:
> On Wed, Jun 2, 2021 at 3:51 PM Mark Rutland <mark.rutland@arm.com> wrote:
>> On Wed, Jun 02, 2021 at 03:25:41PM +0200, Marek Szyprowski wrote:
>>> On 27.05.2021 14:43, Will Deacon wrote:
>>> This patch landed in todays linux-next as commit 65688d2a05de ("arm64:
>>> cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)"). It causes an
>>> issue on Raspberry Pi 3b board. System boots to userspace fine, but then
>>> it hangs somewhere during the init scripts after loading the modules. I
>>> didn't manage to track where it hangs yet though.
>> Ouch!
>>
>> I have a 3b in a drawer that I might be able to reproduce the issue
>> with; can you tell me how you're booting that kernel? e.g. which FW and
>> DT you're using?
>>
>> Is your filesystem on the SD card, or some USB storage? I'm guessing
>> we'll need to stress DMA over one of those.
> It could be something other than DMA in this case, possibly an out-of-bounds
> access into a dynamically allocated structure that happens to work when there
> is extra data at the end of it. Running that kernel with KASAN should help
> rule this out.

Okay, I will enable KASAN and give it a try.

> Running the same kernel on a Pi 4, or on a virtual machine could also help
> figure out if there is something hardware specific.

Exactly the same kernel binary boots fine on RPi4, virt (qemu) and a few 
other arm64 boards I have on my test farm.

Best regards
Marek Szyprowski June 2, 2021, 2:28 p.m. UTC | #13
On 02.06.2021 16:14, Arnd Bergmann wrote:
> On Wed, Jun 2, 2021 at 4:09 PM Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> On 02.06.2021 15:51, Mark Rutland wrote:
>> [   13.938691] debugfs: Directory '3f902000.hdmi' with parent 'vc4-hdmi'
>> already present!
>> [   13.983716] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
>> [   13.991816] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
>> [   14.000945] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
>> [   14.009644] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
>> [   14.017547] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
>> [   14.025780] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
>> [   14.033995] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
>> [   14.042196] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
>> [....] Waiting for /dev to be fully populated...[   14.112812] [drm]
>> Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0
>> done.
> I wonder if the vc4's cache is involved here rather than the CPU cache.
> If it's that, then removing the vc4 drivers should make it boot reliably, and
> it might be possible to fix it by manually padding any allocations that get
> passed to the GPU.

Indeed, after disabling DRM_VC4 in the .config, system boots fine and 
seems to be working correctly.

Best regards
Arnd Bergmann June 2, 2021, 2:52 p.m. UTC | #14
On Wed, Jun 2, 2021 at 4:28 PM Marek Szyprowski
<m.szyprowski@samsung.com> wrote:
>
> On 02.06.2021 16:14, Arnd Bergmann wrote:
> > On Wed, Jun 2, 2021 at 4:09 PM Marek Szyprowski
> > <m.szyprowski@samsung.com> wrote:
> >> On 02.06.2021 15:51, Mark Rutland wrote:
> >> [   13.938691] debugfs: Directory '3f902000.hdmi' with parent 'vc4-hdmi'
> >> already present!
> >> [   13.983716] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
> >> [   13.991816] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
> >> [   14.000945] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
> >> [   14.009644] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
> >> [   14.017547] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
> >> [   14.025780] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
> >> [   14.033995] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
> >> [   14.042196] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
> >> [....] Waiting for /dev to be fully populated...[   14.112812] [drm]
> >> Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0
> >> done.
> > I wonder if the vc4's cache is involved here rather than the CPU cache.
> > If it's that, then removing the vc4 drivers should make it boot reliably, and
> > it might be possible to fix it by manually padding any allocations that get
> > passed to the GPU.
>
> Indeed, after disabling DRM_VC4 in the .config, system boots fine and
> seems to be working correctly.

Ok, that helps. This means there is a good chance it would be one of these:

$ git grep k[cmz]alloc drivers/gpu/drm/vc4
drivers/gpu/drm/vc4/vc4_bo.c:           new_list =
kmalloc_array(new_size, sizeof(struct list_head),
drivers/gpu/drm/vc4/vc4_bo.c:   bo = kzalloc(sizeof(*bo), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_bo.c:   vc4->bo_labels =
kcalloc(VC4_BO_TYPE_COUNT, sizeof(*vc4->bo_labels),
drivers/gpu/drm/vc4/vc4_crtc.c: flip_state =
kzalloc(sizeof(*flip_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_crtc.c: vc4_state =
kzalloc(sizeof(*vc4_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_crtc.c: vc4_crtc_state =
kzalloc(sizeof(*vc4_crtc_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_crtc.c: vc4_crtc = devm_kzalloc(dev,
sizeof(*vc4_crtc), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_debugfs.c:              devm_kzalloc(dev->dev,
sizeof(*entry), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_dpi.c:  dpi = devm_kzalloc(dev, sizeof(*dpi),
GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_dpi.c:  vc4_dpi_encoder = devm_kzalloc(dev,
sizeof(*vc4_dpi_encoder),
drivers/gpu/drm/vc4/vc4_drv.c:  vc4file = kzalloc(sizeof(*vc4file), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_dsi.c:  dsi->clk_onecell = devm_kzalloc(dev,
drivers/gpu/drm/vc4/vc4_dsi.c:  vc4_dsi_encoder = devm_kzalloc(dev,
sizeof(*vc4_dsi_encoder),
drivers/gpu/drm/vc4/vc4_dsi.c:  dsi = devm_kzalloc(dev, sizeof(*dsi),
GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_gem.c:  bo_state = kcalloc(state->bo_count,
sizeof(*bo_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_gem.c:  kernel_state = kcalloc(1,
sizeof(*kernel_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_gem.c:  kernel_state->bo = kcalloc(state->bo_count,
drivers/gpu/drm/vc4/vc4_gem.c:  fence = kzalloc(sizeof(*fence), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_gem.c:  exec = kcalloc(1, sizeof(*exec), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_hdmi.c:         kzalloc(sizeof(*new_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_hdmi.c: new_state =
kzalloc(sizeof(*new_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_hdmi.c: regs = kcalloc(variant->num_registers,
sizeof(*regs),
drivers/gpu/drm/vc4/vc4_hdmi.c: vc4_hdmi = devm_kzalloc(dev,
sizeof(*vc4_hdmi), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_hvs.c:  hvs = devm_kzalloc(&pdev->dev,
sizeof(*hvs), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_kms.c:  ctm_state =
kzalloc(sizeof(*ctm_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_kms.c:  load_state =
kzalloc(sizeof(*load_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_kms.c:  state = kzalloc(sizeof(*state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_kms.c:  state = kzalloc(sizeof(*state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_perfmon.c:      perfmon =
kzalloc(struct_size(perfmon, counters, req->ncounters),
drivers/gpu/drm/vc4/vc4_plane.c:        vc4_state =
kzalloc(sizeof(*vc4_state), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_plane.c:                u32 *new_dlist =
kmalloc_array(new_size, 4, GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_plane.c:        vc4_plane =
devm_kzalloc(dev->dev, sizeof(*vc4_plane),
drivers/gpu/drm/vc4/vc4_txp.c:  txp = devm_kzalloc(dev, sizeof(*txp),
GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_v3d.c:  v3d = devm_kzalloc(&pdev->dev,
sizeof(*v3d), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_validate_shaders.c:
kcalloc(BITS_TO_LONGS(validation_state.max_ip),
drivers/gpu/drm/vc4/vc4_validate_shaders.c:     validated_shader =
kcalloc(1, sizeof(*validated_shader), GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_vec.c:  vec_connector = devm_kzalloc(dev->dev,
sizeof(*vec_connector),
drivers/gpu/drm/vc4/vc4_vec.c:  vec = devm_kzalloc(dev, sizeof(*vec),
GFP_KERNEL);
drivers/gpu/drm/vc4/vc4_vec.c:  vc4_vec_encoder = devm_kzalloc(dev,
sizeof(*vc4_vec_encoder),

I suppose most can be easily ruled out because they are not shared, or
because they are larger
than 64 bytes. I'll have a quick look if I find any smoking guns.

       Arnd
Mark Rutland June 4, 2021, 10:01 a.m. UTC | #15
Hi Marek,

On Wed, Jun 02, 2021 at 04:09:19PM +0200, Marek Szyprowski wrote:
> On 02.06.2021 15:51, Mark Rutland wrote:
> > On Wed, Jun 02, 2021 at 03:25:41PM +0200, Marek Szyprowski wrote:
> >> On 27.05.2021 14:43, Will Deacon wrote:
> >> This patch landed in todays linux-next as commit 65688d2a05de ("arm64:
> >> cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)"). It causes an
> >> issue on Raspberry Pi 3b board. System boots to userspace fine, but then
> >> it hangs somewhere during the init scripts after loading the modules. I
> >> didn't manage to track where it hangs yet though.
> > Ouch!
> >
> > I have a 3b in a drawer that I might be able to reproduce the issue
> > with; can you tell me how you're booting that kernel? e.g. which FW and
> > DT you're using?
> 
> I'm booting the kernel with the mainline dtb 
> (arch/arm64/boot/dts/broadcom/bcm2837-rpi-3-b.dtb) from the u-boot, 
> which downloads it via TFTP. I don't remember which firmware version is 
> there, but without raspberry specific tools (which I don't have deployed 
> there) it is hard to check that now.

Thanks, this was enough info to get started.

For comparison, I have a 3Bv1.2 board.

I grabbed the latest RPI firmware, built myself a v2021.07-rc3
rpi_3_defconfig u-boot, and got a kernel booting (off the SD card rather
than over the network).

For the kernel I'm testing commit 65688d2a05de; defconfig with DRM and
VC4 built-in, since passing modules around is painful in my setup.

> The rootfs is on SD card, the system is some older Debian release.

For comparison, I built myself a buildroot 2021.02.2 filesystem.

So far, booting up an running I'm not seeeing issues (and no complaints
from KASAN or similar), but I don't have a good way to stress the VC4
GPU, so I might not be triggering whatever's going wrong.

From the log below I see the last message is about X sockets -- is the
lockup happening when the display manager starts?

Thanks,
Mark.

> Here is the last part of the boot 
> log if it helps:
> 
> [    7.906329] Freeing unused kernel memory: 8512K
> [    7.911793] Run /sbin/init as init process
> INIT: version 2.88 booting
> [info] Using makefile-style concurrent boot in runlevel S.
> ERROR: could not open /proc/stat: No such file or directory
> [....] Starting the hotplug events dispatcher: systemd-udevdstarting 
> version 236
> . ok
> [....] Synthesizing the initial hotplug events...[   13.641287] 
> bcm2835-rng 3f104000.rng: hwrng registered
> [   13.896575] i2c-bcm2835 3f805000.i2c: Could not read clock-frequency 
> property
> done.
> [   13.938691] debugfs: Directory '3f902000.hdmi' with parent 'vc4-hdmi' 
> already present!
> [   13.983716] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
> [   13.991816] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
> [   14.000945] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
> [   14.009644] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
> [   14.017547] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
> [   14.025780] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
> [   14.033995] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
> [   14.042196] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
> [....] Waiting for /dev to be fully populated...[   14.112812] [drm] 
> Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0
> done.
> [ ok ] Activating swap...done.
> [....] Checking file systems...fsck from util-linux 2.29.2
> done.
> [ ok ] Cleaning up temporary files... /tmp.
> [ ok ] Mounting local filesystems...done.
> [ ok ] Activating swapfile swap...done.
> [ ok ] Cleaning up temporary files....
> [ ok ] Setting kernel variables...done.
> [ ok ] Configuring network interfaces...done.
> [ ok ] Starting RPC port mapper daemon: rpcbind.
> [ ok ] Cleaning up temporary files....
> [ ok ] Setting up ALSA...done.
> [ ok ] Setting up X socket directories... /tmp/.X11-unix /tmp/.ICE-unix.
> 
> Best regards
> 
> -- 
> Marek Szyprowski, PhD
> Samsung R&D Institute Poland
>
Marek Szyprowski June 7, 2021, 9:58 a.m. UTC | #16
Hi Mark,

On 04.06.2021 12:01, Mark Rutland wrote:
> On Wed, Jun 02, 2021 at 04:09:19PM +0200, Marek Szyprowski wrote:
>> On 02.06.2021 15:51, Mark Rutland wrote:
>>> On Wed, Jun 02, 2021 at 03:25:41PM +0200, Marek Szyprowski wrote:
>>>> On 27.05.2021 14:43, Will Deacon wrote:
>>>> This patch landed in todays linux-next as commit 65688d2a05de ("arm64:
>>>> cache: Lower ARCH_DMA_MINALIGN to 64 (L1_CACHE_BYTES)"). It causes an
>>>> issue on Raspberry Pi 3b board. System boots to userspace fine, but then
>>>> it hangs somewhere during the init scripts after loading the modules. I
>>>> didn't manage to track where it hangs yet though.
>>> Ouch!
>>>
>>> I have a 3b in a drawer that I might be able to reproduce the issue
>>> with; can you tell me how you're booting that kernel? e.g. which FW and
>>> DT you're using?
>> I'm booting the kernel with the mainline dtb
>> (arch/arm64/boot/dts/broadcom/bcm2837-rpi-3-b.dtb) from the u-boot,
>> which downloads it via TFTP. I don't remember which firmware version is
>> there, but without raspberry specific tools (which I don't have deployed
>> there) it is hard to check that now.
> Thanks, this was enough info to get started.
>
> For comparison, I have a 3Bv1.2 board.
>
> I grabbed the latest RPI firmware, built myself a v2021.07-rc3
> rpi_3_defconfig u-boot, and got a kernel booting (off the SD card rather
> than over the network).
>
> For the kernel I'm testing commit 65688d2a05de; defconfig with DRM and
> VC4 built-in, since passing modules around is painful in my setup.
>
>> The rootfs is on SD card, the system is some older Debian release.
> For comparison, I built myself a buildroot 2021.02.2 filesystem.
>
> So far, booting up an running I'm not seeeing issues (and no complaints
> from KASAN or similar), but I don't have a good way to stress the VC4
> GPU, so I might not be triggering whatever's going wrong.
>
> >From the log below I see the last message is about X sockets -- is the
> lockup happening when the display manager starts?

I've just checked with the latest firmware from 
https://github.com/raspberrypi/firmware (master branch, just copied 
everything to /boot) and the issue is still there.

If you start from arm64/defconfig without modules, please make sure you 
have enabled all RPi drivers, otherwise VC4 DRM won't come up. I've 
managed to reproduce the issue without the modules with the following 
changes to arm64's defconfig:

./scripts/config -e DRM -e DRM_VC4 -e CONFIG_CLK_RASPBERRYPI -e 
CONFIG_SENSORS_RASPBERRYPI_HWMON -e CONFIG_I2C_BCM2835 -e 
CONFIG_ARM_RASPBERRYPI_CPUFREQ

Best regards
Mark Rutland June 7, 2021, 12:01 p.m. UTC | #17
On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:
> Hi Mark,

> I've just checked with the latest firmware from 
> https://github.com/raspberrypi/firmware (master branch, just copied 
> everything to /boot) and the issue is still there.
> 
> If you start from arm64/defconfig without modules, please make sure you 
> have enabled all RPi drivers, otherwise VC4 DRM won't come up. I've 
> managed to reproduce the issue without the modules with the following 
> changes to arm64's defconfig:
> 
> ./scripts/config -e DRM -e DRM_VC4 -e CONFIG_CLK_RASPBERRYPI -e 
> CONFIG_SENSORS_RASPBERRYPI_HWMON -e CONFIG_I2C_BCM2835 -e 
> CONFIG_ARM_RASPBERRYPI_CPUFREQ

Thanks for this!

With that config on commit 65688d2a05deb9f0 I also see a hang at the end
of boot, but before reaching userspace, with the last messages in dmesg
as below.

I'll go check that the ARCH_DMA_MINALIGN affects this, then I'll go play
with debug options.

| [    1.397102] 3f201000.serial: ttyAMA0 at MMIO 0x3f201000 (irq = 99, base_baud = 0) is a PL011 rev2
| [    1.406437] serial serial0: tty port ttyAMA0 registered
| [    1.408914] sdhost-bcm2835 3f202000.mmc: loaded - DMA enabled (>1)
| [    1.430510] raspberrypi-firmware soc:firmware: Attached to firmware from 2021-05-24T19:52:58
| [    1.504953] mmc0: host does not support reading read-only switch, assuming write-enable
| [    1.514908] mmc0: new high speed SDXC card at address 0001
| [    1.521626] mmcblk0: mmc0:0001 00000 59.6 GiB
| [    1.528406]  mmcblk0: p1 p2
| [    1.644694] debugfs: Directory '3f902000.hdmi' with parent 'vc4-hdmi' already present!
| [    1.654544] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
| [    1.661078] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
| [    1.667401] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
| [    1.673760] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
| [    1.680096] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
| [    1.687157] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
| [    1.694172] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
| [    1.701170] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
| [    1.709277] [drm] Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0

Thanks,
Mark.
Arnd Bergmann June 7, 2021, 12:17 p.m. UTC | #18
On Wed, Jun 2, 2021 at 4:52 PM Arnd Bergmann <arnd@arndb.de> wrote:
> On Wed, Jun 2, 2021 at 4:28 PM Marek Szyprowski <m.szyprowski@samsung.com> wrote:
> I suppose most can be easily ruled out because they are not shared, or
> because they are larger than 64 bytes. I'll have a quick look if I find any smoking guns.

Sorry for the late follow-up. I did check these last week, but
unfortunately none of
the allocations in the vc4 driver itself  looked suspicious to me in
the end. It might
still be an indirect allocation though, where the vc4 driver calls
into a drivers/drm/
API or something else that allocates a small DMA buffer.

        Arnd
Mark Rutland June 7, 2021, 1:08 p.m. UTC | #19
On Mon, Jun 07, 2021 at 01:01:18PM +0100, Mark Rutland wrote:
> On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:
> > Hi Mark,
> 
> > I've just checked with the latest firmware from 
> > https://github.com/raspberrypi/firmware (master branch, just copied 
> > everything to /boot) and the issue is still there.
> > 
> > If you start from arm64/defconfig without modules, please make sure you 
> > have enabled all RPi drivers, otherwise VC4 DRM won't come up. I've 
> > managed to reproduce the issue without the modules with the following 
> > changes to arm64's defconfig:
> > 
> > ./scripts/config -e DRM -e DRM_VC4 -e CONFIG_CLK_RASPBERRYPI -e 
> > CONFIG_SENSORS_RASPBERRYPI_HWMON -e CONFIG_I2C_BCM2835 -e 
> > CONFIG_ARM_RASPBERRYPI_CPUFREQ
> 
> Thanks for this!
> 
> With that config on commit 65688d2a05deb9f0 I also see a hang at the end
> of boot, but before reaching userspace, with the last messages in dmesg
> as below.
> 
> I'll go check that the ARCH_DMA_MINALIGN affects this, then I'll go play
> with debug options.

I can confirm that with the ARCH_DMA_MINALIGN change reverted, the hang
goes away. Running with that reverted andwith KASAN, I get the
slab-out-of-bounds splat below, which occurs at the time the hang would
otherwise occur, and is possibly the problem:

[    3.609515] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
[    3.621451] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
[    3.628344] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
[    3.635904] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
[    3.643351] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
[    3.651238] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
[    3.659167] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
[    3.666499] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
[    3.688560] [drm] Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0
[    3.728010] ==================================================================
[    3.728042] BUG: KASAN: slab-out-of-bounds in vc4_atomic_commit_tail+0x1cc/0x910
[    3.728123] Read of size 8 at addr ffff000007360440 by task kworker/u8:0/7
[    3.728153]
[    3.728169] CPU: 2 PID: 7 Comm: kworker/u8:0 Not tainted 5.13.0-rc3-00009-g694c523e7267 #3
[    3.728203] Hardware name: Raspberry Pi 3 Model B (DT)
[    3.728225] Workqueue: events_unbound deferred_probe_work_func
[    3.728290] Call trace:
[    3.728301]  dump_backtrace+0x0/0x2b4
[    3.728358]  show_stack+0x1c/0x30
[    3.728407]  dump_stack+0xfc/0x168
[    3.728445]  print_address_description.constprop.0+0x2c/0x2c0
[    3.728495]  kasan_report+0x1dc/0x240
[    3.728529]  __asan_load8+0x98/0xd4
[    3.728565]  vc4_atomic_commit_tail+0x1cc/0x910
[    3.728621]  commit_tail+0x100/0x210
[    3.728675]  drm_atomic_helper_commit+0x1c4/0x3dc
[    3.728730]  drm_atomic_commit+0x80/0x94
[    3.728768]  drm_client_modeset_commit_atomic+0x2f4/0x3a0
[    3.728821]  drm_client_modeset_commit_locked+0x8c/0x230
[    3.728872]  drm_fb_helper_pan_display+0x164/0x3a0
[    3.728924]  fb_pan_display+0x12c/0x1fc
[    3.728963]  bit_update_start+0x34/0xa0
[    3.729013]  fbcon_switch+0x678/0x920
[    3.729058]  redraw_screen+0x17c/0x35c
[    3.729095]  fbcon_prepare_logo+0x484/0x5bc
[    3.729143]  fbcon_init+0x77c/0x970
[    3.729187]  visual_init+0x14c/0x1e4
[    3.729239]  do_bind_con_driver.isra.0+0x2c4/0x530
[    3.729279]  do_take_over_console+0x200/0x2e0
[    3.729317]  do_fbcon_takeover+0x90/0x120
[    3.729363]  fbcon_fb_registered+0x14c/0x164
[    3.729412]  register_framebuffer+0x308/0x4e0
[    3.729451]  __drm_fb_helper_initial_config_and_unlock+0x538/0x7d0
[    3.729506]  drm_fbdev_client_hotplug+0x204/0x374
[    3.729556]  drm_fbdev_generic_setup+0xf4/0x24c
[    3.729604]  vc4_drm_bind+0x1d4/0x1f0
[    3.729654]  try_to_bring_up_master+0x254/0x2dc
[    3.729709]  __component_add+0x10c/0x240
[    3.729759]  component_add+0x18/0x24
[    3.729807]  vc4_v3d_dev_probe+0x20/0x30
[    3.729854]  platform_probe+0x90/0x110
[    3.729907]  really_probe+0x148/0x744
[    3.729952]  driver_probe_device+0x8c/0xfc
[    3.729998]  __device_attach_driver+0x120/0x180
[    3.730048]  bus_for_each_drv+0xf4/0x15c
[    3.730091]  __device_attach+0x168/0x250
[    3.730137]  device_initial_probe+0x18/0x24
[    3.730186]  bus_probe_device+0xec/0x100
[    3.730230]  deferred_probe_work_func+0xe8/0x130
[    3.730279]  process_one_work+0x3b8/0x650
[    3.730319]  worker_thread+0x3cc/0x72c
[    3.730356]  kthread+0x21c/0x224
[    3.730402]  ret_from_fork+0x10/0x38
[    3.730442]
[    3.730453] Allocated by task 7:
[    3.730470]  kasan_save_stack+0x2c/0x60
[    3.730526]  __kasan_kmalloc+0x90/0xb4
[    3.730577]  vc4_hvs_channels_duplicate_state+0x60/0x1a0
[    3.730637]  drm_atomic_get_private_obj_state+0x144/0x230
[    3.730680]  vc4_atomic_check+0x40/0x73c
[    3.730732]  drm_atomic_check_only+0x998/0xe60
[    3.730769]  drm_atomic_commit+0x34/0x94
[    3.730804]  drm_client_modeset_commit_atomic+0x2f4/0x3a0
[    3.730854]  drm_client_modeset_commit_locked+0x8c/0x230
[    3.730904]  drm_client_modeset_commit+0x38/0x60
[    3.730951]  drm_fb_helper_set_par+0x104/0x17c
[    3.730998]  fbcon_init+0x43c/0x970
[    3.731041]  visual_init+0x14c/0x1e4
[    3.731090]  do_bind_con_driver.isra.0+0x2c4/0x530
[    3.731128]  do_take_over_console+0x200/0x2e0
[    3.731165]  do_fbcon_takeover+0x90/0x120
[    3.731210]  fbcon_fb_registered+0x14c/0x164
[    3.731258]  register_framebuffer+0x308/0x4e0
[    3.731296]  __drm_fb_helper_initial_config_and_unlock+0x538/0x7d0
[    3.731349]  drm_fbdev_client_hotplug+0x204/0x374
[    3.731398]  drm_fbdev_generic_setup+0xf4/0x24c
[    3.731446]  vc4_drm_bind+0x1d4/0x1f0
[    3.731493]  try_to_bring_up_master+0x254/0x2dc
[    3.731546]  __component_add+0x10c/0x240
[    3.731594]  component_add+0x18/0x24
[    3.731642]  vc4_v3d_dev_probe+0x20/0x30
[    3.731686]  platform_probe+0x90/0x110
[    3.731737]  really_probe+0x148/0x744
[    3.731781]  driver_probe_device+0x8c/0xfc
[    3.731827]  __device_attach_driver+0x120/0x180
[    3.731875]  bus_for_each_drv+0xf4/0x15c
[    3.731916]  __device_attach+0x168/0x250
[    3.731962]  device_initial_probe+0x18/0x24
[    3.732009]  bus_probe_device+0xec/0x100
[    3.732052]  deferred_probe_work_func+0xe8/0x130
[    3.732100]  process_one_work+0x3b8/0x650
[    3.732137]  worker_thread+0x3cc/0x72c
[    3.732172]  kthread+0x21c/0x224
[    3.732215]  ret_from_fork+0x10/0x38
[    3.732253]
[    3.732262] The buggy address belongs to the object at ffff000007360400
[    3.732262]  which belongs to the cache kmalloc-128 of size 128
[    3.732293] The buggy address is located 64 bytes inside of
[    3.732293]  128-byte region [ffff000007360400, ffff000007360480)
[    3.732329] The buggy address belongs to the page:
[    3.732344] page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7360
[    3.732380] flags: 0x3fffc0000000200(slab|node=0|zone=0|lastcpupid=0xffff)
[    3.732442] raw: 03fffc0000000200 dead000000000100 dead000000000122 ffff000004c02300
[    3.732478] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[    3.732501] page dumped because: kasan: bad access detected
[    3.732518]
[    3.732527] Memory state around the buggy address:
[    3.732549]  ffff000007360300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    3.732579]  ffff000007360380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[    3.732608] >ffff000007360400: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
[    3.732629]                                            ^
[    3.732652]  ffff000007360480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[    3.732682]  ffff000007360500: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    3.732703] ==================================================================
[    3.732718] Disabling lock debugging due to kernel taint
[    3.769129] Console: switching to colour frame buffer device 90x30
[    5.148699] vc4-drm soc:gpu: [drm] fb0: vc4drmfb frame buffer device

Thanks,
Mark.
Will Deacon June 7, 2021, 1:39 p.m. UTC | #20
[Adding VC4 folks -- please see the KASAN splat below!]

Background here is that reducing ARCH_DMA_MINALIGN to 64 on arm64 (queued in
-next) is causing vc4 to hang on Rpi3b due to a probable driver bug.

Will

On Mon, Jun 07, 2021 at 02:08:59PM +0100, Mark Rutland wrote:
> On Mon, Jun 07, 2021 at 01:01:18PM +0100, Mark Rutland wrote:
> > On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:
> > > I've just checked with the latest firmware from 
> > > https://github.com/raspberrypi/firmware (master branch, just copied 
> > > everything to /boot) and the issue is still there.
> > > 
> > > If you start from arm64/defconfig without modules, please make sure you 
> > > have enabled all RPi drivers, otherwise VC4 DRM won't come up. I've 
> > > managed to reproduce the issue without the modules with the following 
> > > changes to arm64's defconfig:
> > > 
> > > ./scripts/config -e DRM -e DRM_VC4 -e CONFIG_CLK_RASPBERRYPI -e 
> > > CONFIG_SENSORS_RASPBERRYPI_HWMON -e CONFIG_I2C_BCM2835 -e 
> > > CONFIG_ARM_RASPBERRYPI_CPUFREQ
> > 
> > Thanks for this!
> > 
> > With that config on commit 65688d2a05deb9f0 I also see a hang at the end
> > of boot, but before reaching userspace, with the last messages in dmesg
> > as below.
> > 
> > I'll go check that the ARCH_DMA_MINALIGN affects this, then I'll go play
> > with debug options.
> 
> I can confirm that with the ARCH_DMA_MINALIGN change reverted, the hang
> goes away. Running with that reverted andwith KASAN, I get the
> slab-out-of-bounds splat below, which occurs at the time the hang would
> otherwise occur, and is possibly the problem:
> 
> [    3.609515] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
> [    3.621451] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
> [    3.628344] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
> [    3.635904] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
> [    3.643351] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
> [    3.651238] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
> [    3.659167] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
> [    3.666499] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
> [    3.688560] [drm] Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0
> [    3.728010] ==================================================================
> [    3.728042] BUG: KASAN: slab-out-of-bounds in vc4_atomic_commit_tail+0x1cc/0x910
> [    3.728123] Read of size 8 at addr ffff000007360440 by task kworker/u8:0/7
> [    3.728153]
> [    3.728169] CPU: 2 PID: 7 Comm: kworker/u8:0 Not tainted 5.13.0-rc3-00009-g694c523e7267 #3
> [    3.728203] Hardware name: Raspberry Pi 3 Model B (DT)
> [    3.728225] Workqueue: events_unbound deferred_probe_work_func
> [    3.728290] Call trace:
> [    3.728301]  dump_backtrace+0x0/0x2b4
> [    3.728358]  show_stack+0x1c/0x30
> [    3.728407]  dump_stack+0xfc/0x168
> [    3.728445]  print_address_description.constprop.0+0x2c/0x2c0
> [    3.728495]  kasan_report+0x1dc/0x240
> [    3.728529]  __asan_load8+0x98/0xd4
> [    3.728565]  vc4_atomic_commit_tail+0x1cc/0x910
> [    3.728621]  commit_tail+0x100/0x210
> [    3.728675]  drm_atomic_helper_commit+0x1c4/0x3dc
> [    3.728730]  drm_atomic_commit+0x80/0x94
> [    3.728768]  drm_client_modeset_commit_atomic+0x2f4/0x3a0
> [    3.728821]  drm_client_modeset_commit_locked+0x8c/0x230
> [    3.728872]  drm_fb_helper_pan_display+0x164/0x3a0
> [    3.728924]  fb_pan_display+0x12c/0x1fc
> [    3.728963]  bit_update_start+0x34/0xa0
> [    3.729013]  fbcon_switch+0x678/0x920
> [    3.729058]  redraw_screen+0x17c/0x35c
> [    3.729095]  fbcon_prepare_logo+0x484/0x5bc
> [    3.729143]  fbcon_init+0x77c/0x970
> [    3.729187]  visual_init+0x14c/0x1e4
> [    3.729239]  do_bind_con_driver.isra.0+0x2c4/0x530
> [    3.729279]  do_take_over_console+0x200/0x2e0
> [    3.729317]  do_fbcon_takeover+0x90/0x120
> [    3.729363]  fbcon_fb_registered+0x14c/0x164
> [    3.729412]  register_framebuffer+0x308/0x4e0
> [    3.729451]  __drm_fb_helper_initial_config_and_unlock+0x538/0x7d0
> [    3.729506]  drm_fbdev_client_hotplug+0x204/0x374
> [    3.729556]  drm_fbdev_generic_setup+0xf4/0x24c
> [    3.729604]  vc4_drm_bind+0x1d4/0x1f0
> [    3.729654]  try_to_bring_up_master+0x254/0x2dc
> [    3.729709]  __component_add+0x10c/0x240
> [    3.729759]  component_add+0x18/0x24
> [    3.729807]  vc4_v3d_dev_probe+0x20/0x30
> [    3.729854]  platform_probe+0x90/0x110
> [    3.729907]  really_probe+0x148/0x744
> [    3.729952]  driver_probe_device+0x8c/0xfc
> [    3.729998]  __device_attach_driver+0x120/0x180
> [    3.730048]  bus_for_each_drv+0xf4/0x15c
> [    3.730091]  __device_attach+0x168/0x250
> [    3.730137]  device_initial_probe+0x18/0x24
> [    3.730186]  bus_probe_device+0xec/0x100
> [    3.730230]  deferred_probe_work_func+0xe8/0x130
> [    3.730279]  process_one_work+0x3b8/0x650
> [    3.730319]  worker_thread+0x3cc/0x72c
> [    3.730356]  kthread+0x21c/0x224
> [    3.730402]  ret_from_fork+0x10/0x38
> [    3.730442]
> [    3.730453] Allocated by task 7:
> [    3.730470]  kasan_save_stack+0x2c/0x60
> [    3.730526]  __kasan_kmalloc+0x90/0xb4
> [    3.730577]  vc4_hvs_channels_duplicate_state+0x60/0x1a0
> [    3.730637]  drm_atomic_get_private_obj_state+0x144/0x230
> [    3.730680]  vc4_atomic_check+0x40/0x73c
> [    3.730732]  drm_atomic_check_only+0x998/0xe60
> [    3.730769]  drm_atomic_commit+0x34/0x94
> [    3.730804]  drm_client_modeset_commit_atomic+0x2f4/0x3a0
> [    3.730854]  drm_client_modeset_commit_locked+0x8c/0x230
> [    3.730904]  drm_client_modeset_commit+0x38/0x60
> [    3.730951]  drm_fb_helper_set_par+0x104/0x17c
> [    3.730998]  fbcon_init+0x43c/0x970
> [    3.731041]  visual_init+0x14c/0x1e4
> [    3.731090]  do_bind_con_driver.isra.0+0x2c4/0x530
> [    3.731128]  do_take_over_console+0x200/0x2e0
> [    3.731165]  do_fbcon_takeover+0x90/0x120
> [    3.731210]  fbcon_fb_registered+0x14c/0x164
> [    3.731258]  register_framebuffer+0x308/0x4e0
> [    3.731296]  __drm_fb_helper_initial_config_and_unlock+0x538/0x7d0
> [    3.731349]  drm_fbdev_client_hotplug+0x204/0x374
> [    3.731398]  drm_fbdev_generic_setup+0xf4/0x24c
> [    3.731446]  vc4_drm_bind+0x1d4/0x1f0
> [    3.731493]  try_to_bring_up_master+0x254/0x2dc
> [    3.731546]  __component_add+0x10c/0x240
> [    3.731594]  component_add+0x18/0x24
> [    3.731642]  vc4_v3d_dev_probe+0x20/0x30
> [    3.731686]  platform_probe+0x90/0x110
> [    3.731737]  really_probe+0x148/0x744
> [    3.731781]  driver_probe_device+0x8c/0xfc
> [    3.731827]  __device_attach_driver+0x120/0x180
> [    3.731875]  bus_for_each_drv+0xf4/0x15c
> [    3.731916]  __device_attach+0x168/0x250
> [    3.731962]  device_initial_probe+0x18/0x24
> [    3.732009]  bus_probe_device+0xec/0x100
> [    3.732052]  deferred_probe_work_func+0xe8/0x130
> [    3.732100]  process_one_work+0x3b8/0x650
> [    3.732137]  worker_thread+0x3cc/0x72c
> [    3.732172]  kthread+0x21c/0x224
> [    3.732215]  ret_from_fork+0x10/0x38
> [    3.732253]
> [    3.732262] The buggy address belongs to the object at ffff000007360400
> [    3.732262]  which belongs to the cache kmalloc-128 of size 128
> [    3.732293] The buggy address is located 64 bytes inside of
> [    3.732293]  128-byte region [ffff000007360400, ffff000007360480)
> [    3.732329] The buggy address belongs to the page:
> [    3.732344] page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7360
> [    3.732380] flags: 0x3fffc0000000200(slab|node=0|zone=0|lastcpupid=0xffff)
> [    3.732442] raw: 03fffc0000000200 dead000000000100 dead000000000122 ffff000004c02300
> [    3.732478] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
> [    3.732501] page dumped because: kasan: bad access detected
> [    3.732518]
> [    3.732527] Memory state around the buggy address:
> [    3.732549]  ffff000007360300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [    3.732579]  ffff000007360380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [    3.732608] >ffff000007360400: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
> [    3.732629]                                            ^
> [    3.732652]  ffff000007360480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [    3.732682]  ffff000007360500: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [    3.732703] ==================================================================
> [    3.732718] Disabling lock debugging due to kernel taint
> [    3.769129] Console: switching to colour frame buffer device 90x30
> [    5.148699] vc4-drm soc:gpu: [drm] fb0: vc4drmfb frame buffer device
Mark Rutland June 7, 2021, 1:56 p.m. UTC | #21
On Mon, Jun 07, 2021 at 02:39:54PM +0100, Will Deacon wrote:
> [Adding VC4 folks -- please see the KASAN splat below!]
> 
> Background here is that reducing ARCH_DMA_MINALIGN to 64 on arm64 (queued in
> -next) is causing vc4 to hang on Rpi3b due to a probable driver bug.
> 
> Will
> 
> On Mon, Jun 07, 2021 at 02:08:59PM +0100, Mark Rutland wrote:
> > On Mon, Jun 07, 2021 at 01:01:18PM +0100, Mark Rutland wrote:
> > > On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:
> > > > I've just checked with the latest firmware from 
> > > > https://github.com/raspberrypi/firmware (master branch, just copied 
> > > > everything to /boot) and the issue is still there.
> > > > 
> > > > If you start from arm64/defconfig without modules, please make sure you 
> > > > have enabled all RPi drivers, otherwise VC4 DRM won't come up. I've 
> > > > managed to reproduce the issue without the modules with the following 
> > > > changes to arm64's defconfig:
> > > > 
> > > > ./scripts/config -e DRM -e DRM_VC4 -e CONFIG_CLK_RASPBERRYPI -e 
> > > > CONFIG_SENSORS_RASPBERRYPI_HWMON -e CONFIG_I2C_BCM2835 -e 
> > > > CONFIG_ARM_RASPBERRYPI_CPUFREQ
> > > 
> > > Thanks for this!
> > > 
> > > With that config on commit 65688d2a05deb9f0 I also see a hang at the end
> > > of boot, but before reaching userspace, with the last messages in dmesg
> > > as below.
> > > 
> > > I'll go check that the ARCH_DMA_MINALIGN affects this, then I'll go play
> > > with debug options.
> > 
> > I can confirm that with the ARCH_DMA_MINALIGN change reverted, the hang
> > goes away. Running with that reverted andwith KASAN, I get the
> > slab-out-of-bounds splat below, which occurs at the time the hang would
> > otherwise occur, and is possibly the problem:
> > 
> > [    3.609515] vc4-drm soc:gpu: bound 3f902000.hdmi (ops vc4_hdmi_ops)
> > [    3.621451] vc4-drm soc:gpu: bound 3f806000.vec (ops vc4_vec_ops)
> > [    3.628344] vc4-drm soc:gpu: bound 3f400000.hvs (ops vc4_hvs_ops)
> > [    3.635904] vc4-drm soc:gpu: bound 3f004000.txp (ops vc4_txp_ops)
> > [    3.643351] vc4-drm soc:gpu: bound 3f206000.pixelvalve (ops vc4_crtc_ops)
> > [    3.651238] vc4-drm soc:gpu: bound 3f207000.pixelvalve (ops vc4_crtc_ops)
> > [    3.659167] vc4-drm soc:gpu: bound 3f807000.pixelvalve (ops vc4_crtc_ops)
> > [    3.666499] vc4-drm soc:gpu: bound 3fc00000.v3d (ops vc4_v3d_ops)
> > [    3.688560] [drm] Initialized vc4 0.0.0 20140616 for soc:gpu on minor 0
> > [    3.728010] ==================================================================
> > [    3.728042] BUG: KASAN: slab-out-of-bounds in vc4_atomic_commit_tail+0x1cc/0x910
> > [    3.728123] Read of size 8 at addr ffff000007360440 by task kworker/u8:0/7

FWIW, faddr2line tells me this is:

[mark@lakrids:~/src/linux]% ./scripts/faddr2line vmlinux vc4_atomic_commit_tail+0x1cc/0x910
vc4_atomic_commit_tail+0x1cc/0x910:
vc4_atomic_commit_tail at drivers/gpu/drm/vc4/vc4_kms.c:375

... which is:

| ret = drm_crtc_commit_wait(old_hvs_state->fifo_state[i].pending_commit);

Thanks,
Mark.

> > [    3.728153]
> > [    3.728169] CPU: 2 PID: 7 Comm: kworker/u8:0 Not tainted 5.13.0-rc3-00009-g694c523e7267 #3
> > [    3.728203] Hardware name: Raspberry Pi 3 Model B (DT)
> > [    3.728225] Workqueue: events_unbound deferred_probe_work_func
> > [    3.728290] Call trace:
> > [    3.728301]  dump_backtrace+0x0/0x2b4
> > [    3.728358]  show_stack+0x1c/0x30
> > [    3.728407]  dump_stack+0xfc/0x168
> > [    3.728445]  print_address_description.constprop.0+0x2c/0x2c0
> > [    3.728495]  kasan_report+0x1dc/0x240
> > [    3.728529]  __asan_load8+0x98/0xd4
> > [    3.728565]  vc4_atomic_commit_tail+0x1cc/0x910
> > [    3.728621]  commit_tail+0x100/0x210
> > [    3.728675]  drm_atomic_helper_commit+0x1c4/0x3dc
> > [    3.728730]  drm_atomic_commit+0x80/0x94
> > [    3.728768]  drm_client_modeset_commit_atomic+0x2f4/0x3a0
> > [    3.728821]  drm_client_modeset_commit_locked+0x8c/0x230
> > [    3.728872]  drm_fb_helper_pan_display+0x164/0x3a0
> > [    3.728924]  fb_pan_display+0x12c/0x1fc
> > [    3.728963]  bit_update_start+0x34/0xa0
> > [    3.729013]  fbcon_switch+0x678/0x920
> > [    3.729058]  redraw_screen+0x17c/0x35c
> > [    3.729095]  fbcon_prepare_logo+0x484/0x5bc
> > [    3.729143]  fbcon_init+0x77c/0x970
> > [    3.729187]  visual_init+0x14c/0x1e4
> > [    3.729239]  do_bind_con_driver.isra.0+0x2c4/0x530
> > [    3.729279]  do_take_over_console+0x200/0x2e0
> > [    3.729317]  do_fbcon_takeover+0x90/0x120
> > [    3.729363]  fbcon_fb_registered+0x14c/0x164
> > [    3.729412]  register_framebuffer+0x308/0x4e0
> > [    3.729451]  __drm_fb_helper_initial_config_and_unlock+0x538/0x7d0
> > [    3.729506]  drm_fbdev_client_hotplug+0x204/0x374
> > [    3.729556]  drm_fbdev_generic_setup+0xf4/0x24c
> > [    3.729604]  vc4_drm_bind+0x1d4/0x1f0
> > [    3.729654]  try_to_bring_up_master+0x254/0x2dc
> > [    3.729709]  __component_add+0x10c/0x240
> > [    3.729759]  component_add+0x18/0x24
> > [    3.729807]  vc4_v3d_dev_probe+0x20/0x30
> > [    3.729854]  platform_probe+0x90/0x110
> > [    3.729907]  really_probe+0x148/0x744
> > [    3.729952]  driver_probe_device+0x8c/0xfc
> > [    3.729998]  __device_attach_driver+0x120/0x180
> > [    3.730048]  bus_for_each_drv+0xf4/0x15c
> > [    3.730091]  __device_attach+0x168/0x250
> > [    3.730137]  device_initial_probe+0x18/0x24
> > [    3.730186]  bus_probe_device+0xec/0x100
> > [    3.730230]  deferred_probe_work_func+0xe8/0x130
> > [    3.730279]  process_one_work+0x3b8/0x650
> > [    3.730319]  worker_thread+0x3cc/0x72c
> > [    3.730356]  kthread+0x21c/0x224
> > [    3.730402]  ret_from_fork+0x10/0x38
> > [    3.730442]
> > [    3.730453] Allocated by task 7:
> > [    3.730470]  kasan_save_stack+0x2c/0x60
> > [    3.730526]  __kasan_kmalloc+0x90/0xb4
> > [    3.730577]  vc4_hvs_channels_duplicate_state+0x60/0x1a0
> > [    3.730637]  drm_atomic_get_private_obj_state+0x144/0x230
> > [    3.730680]  vc4_atomic_check+0x40/0x73c
> > [    3.730732]  drm_atomic_check_only+0x998/0xe60
> > [    3.730769]  drm_atomic_commit+0x34/0x94
> > [    3.730804]  drm_client_modeset_commit_atomic+0x2f4/0x3a0
> > [    3.730854]  drm_client_modeset_commit_locked+0x8c/0x230
> > [    3.730904]  drm_client_modeset_commit+0x38/0x60
> > [    3.730951]  drm_fb_helper_set_par+0x104/0x17c
> > [    3.730998]  fbcon_init+0x43c/0x970
> > [    3.731041]  visual_init+0x14c/0x1e4
> > [    3.731090]  do_bind_con_driver.isra.0+0x2c4/0x530
> > [    3.731128]  do_take_over_console+0x200/0x2e0
> > [    3.731165]  do_fbcon_takeover+0x90/0x120
> > [    3.731210]  fbcon_fb_registered+0x14c/0x164
> > [    3.731258]  register_framebuffer+0x308/0x4e0
> > [    3.731296]  __drm_fb_helper_initial_config_and_unlock+0x538/0x7d0
> > [    3.731349]  drm_fbdev_client_hotplug+0x204/0x374
> > [    3.731398]  drm_fbdev_generic_setup+0xf4/0x24c
> > [    3.731446]  vc4_drm_bind+0x1d4/0x1f0
> > [    3.731493]  try_to_bring_up_master+0x254/0x2dc
> > [    3.731546]  __component_add+0x10c/0x240
> > [    3.731594]  component_add+0x18/0x24
> > [    3.731642]  vc4_v3d_dev_probe+0x20/0x30
> > [    3.731686]  platform_probe+0x90/0x110
> > [    3.731737]  really_probe+0x148/0x744
> > [    3.731781]  driver_probe_device+0x8c/0xfc
> > [    3.731827]  __device_attach_driver+0x120/0x180
> > [    3.731875]  bus_for_each_drv+0xf4/0x15c
> > [    3.731916]  __device_attach+0x168/0x250
> > [    3.731962]  device_initial_probe+0x18/0x24
> > [    3.732009]  bus_probe_device+0xec/0x100
> > [    3.732052]  deferred_probe_work_func+0xe8/0x130
> > [    3.732100]  process_one_work+0x3b8/0x650
> > [    3.732137]  worker_thread+0x3cc/0x72c
> > [    3.732172]  kthread+0x21c/0x224
> > [    3.732215]  ret_from_fork+0x10/0x38
> > [    3.732253]
> > [    3.732262] The buggy address belongs to the object at ffff000007360400
> > [    3.732262]  which belongs to the cache kmalloc-128 of size 128
> > [    3.732293] The buggy address is located 64 bytes inside of
> > [    3.732293]  128-byte region [ffff000007360400, ffff000007360480)
> > [    3.732329] The buggy address belongs to the page:
> > [    3.732344] page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7360
> > [    3.732380] flags: 0x3fffc0000000200(slab|node=0|zone=0|lastcpupid=0xffff)
> > [    3.732442] raw: 03fffc0000000200 dead000000000100 dead000000000122 ffff000004c02300
> > [    3.732478] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
> > [    3.732501] page dumped because: kasan: bad access detected
> > [    3.732518]
> > [    3.732527] Memory state around the buggy address:
> > [    3.732549]  ffff000007360300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > [    3.732579]  ffff000007360380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> > [    3.732608] >ffff000007360400: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
> > [    3.732629]                                            ^
> > [    3.732652]  ffff000007360480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> > [    3.732682]  ffff000007360500: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > [    3.732703] ==================================================================
> > [    3.732718] Disabling lock debugging due to kernel taint
> > [    3.769129] Console: switching to colour frame buffer device 90x30
> > [    5.148699] vc4-drm soc:gpu: [drm] fb0: vc4drmfb frame buffer device
Arnd Bergmann June 7, 2021, 1:57 p.m. UTC | #22
On Mon, Jun 7, 2021 at 3:39 PM Will Deacon <will@kernel.org> wrote:
>
> [Adding VC4 folks -- please see the KASAN splat below!]
>
> Background here is that reducing ARCH_DMA_MINALIGN to 64 on arm64 (queued in
> -next) is causing vc4 to hang on Rpi3b due to a probable driver bug.

The great news for the patch that caused it is that this has nothing to
do with DMA alignment.

> On Mon, Jun 07, 2021 at 02:08:59PM +0100, Mark Rutland wrote:
> > On Mon, Jun 07, 2021 at 01:01:18PM +0100, Mark Rutland wrote:
> > > On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:

> > [    3.728042] BUG: KASAN: slab-out-of-bounds in vc4_atomic_commit_tail+0x1cc/0x910
> > [    3.728123] Read of size 8 at addr ffff000007360440 by task kworker/u8:0/7

This is offset 0x40 into struct vc4_hvs_state, which is the
'pending_commit' pointer
for the array index 4, i.e. one after the end of the structure.

> > [    3.728495]  kasan_report+0x1dc/0x240
> > [    3.728529]  __asan_load8+0x98/0xd4
> > [    3.728565]  vc4_atomic_commit_tail+0x1cc/0x910

It seems to be this loop:

        for_each_old_crtc_in_state(state, crtc, old_crtc_state, i) {
                struct vc4_crtc_state *vc4_crtc_state =
                        to_vc4_crtc_state(old_crtc_state);
                unsigned int channel = vc4_crtc_state->assigned_channel;
                int ret;

                if (channel == VC4_HVS_CHANNEL_DISABLED)
                        continue;

                if (!old_hvs_state->fifo_state[channel].in_use)
                        continue;

                ret =
drm_crtc_commit_wait(old_hvs_state->fifo_state[i].pending_commit);
                if (ret)
                        drm_err(dev, "Timed out waiting for commit\n");
        }

I notice that it checks index 'fifos_state[channel].in_use', but then
uses a different index 'i' for looking at the 'pending_commit' field
beyond the end of the array.

This code was introduced by Maxime Ripard in commit 9ec03d7f1ed3
 ("drm/vc4: kms: Wait on previous FIFO users before a commit").

    Arnd
Maxime Ripard June 7, 2021, 3:17 p.m. UTC | #23
On Mon, Jun 07, 2021 at 03:57:41PM +0200, Arnd Bergmann wrote:
> On Mon, Jun 7, 2021 at 3:39 PM Will Deacon <will@kernel.org> wrote:
> >
> > [Adding VC4 folks -- please see the KASAN splat below!]
> >
> > Background here is that reducing ARCH_DMA_MINALIGN to 64 on arm64 (queued in
> > -next) is causing vc4 to hang on Rpi3b due to a probable driver bug.
> 
> The great news for the patch that caused it is that this has nothing to
> do with DMA alignment.
> 
> > On Mon, Jun 07, 2021 at 02:08:59PM +0100, Mark Rutland wrote:
> > > On Mon, Jun 07, 2021 at 01:01:18PM +0100, Mark Rutland wrote:
> > > > On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:
> 
> > > [    3.728042] BUG: KASAN: slab-out-of-bounds in vc4_atomic_commit_tail+0x1cc/0x910
> > > [    3.728123] Read of size 8 at addr ffff000007360440 by task kworker/u8:0/7
> 
> This is offset 0x40 into struct vc4_hvs_state, which is the
> 'pending_commit' pointer
> for the array index 4, i.e. one after the end of the structure.
> 
> > > [    3.728495]  kasan_report+0x1dc/0x240
> > > [    3.728529]  __asan_load8+0x98/0xd4
> > > [    3.728565]  vc4_atomic_commit_tail+0x1cc/0x910
> 
> It seems to be this loop:
> 
>         for_each_old_crtc_in_state(state, crtc, old_crtc_state, i) {
>                 struct vc4_crtc_state *vc4_crtc_state =
>                         to_vc4_crtc_state(old_crtc_state);
>                 unsigned int channel = vc4_crtc_state->assigned_channel;
>                 int ret;
> 
>                 if (channel == VC4_HVS_CHANNEL_DISABLED)
>                         continue;
> 
>                 if (!old_hvs_state->fifo_state[channel].in_use)
>                         continue;
> 
>                 ret =
> drm_crtc_commit_wait(old_hvs_state->fifo_state[i].pending_commit);
>                 if (ret)
>                         drm_err(dev, "Timed out waiting for commit\n");
>         }
> 
> I notice that it checks index 'fifos_state[channel].in_use', but then
> uses a different index 'i' for looking at the 'pending_commit' field
> beyond the end of the array.
> 
> This code was introduced by Maxime Ripard in commit 9ec03d7f1ed3
>  ("drm/vc4: kms: Wait on previous FIFO users before a commit").

Awesome, I tried to find out that bug a few weeks ago but couldn't
reproduce the KASAN spat. You're right, it should be channel here
instead of i. Since you did the whole work, do you want to send the
patch?

maxime
Mark Rutland June 7, 2021, 3:32 p.m. UTC | #24
On Mon, Jun 07, 2021 at 03:57:41PM +0200, Arnd Bergmann wrote:
> On Mon, Jun 7, 2021 at 3:39 PM Will Deacon <will@kernel.org> wrote:
> >
> > [Adding VC4 folks -- please see the KASAN splat below!]
> >
> > Background here is that reducing ARCH_DMA_MINALIGN to 64 on arm64 (queued in
> > -next) is causing vc4 to hang on Rpi3b due to a probable driver bug.
> 
> The great news for the patch that caused it is that this has nothing to
> do with DMA alignment.
> 
> > On Mon, Jun 07, 2021 at 02:08:59PM +0100, Mark Rutland wrote:
> > > On Mon, Jun 07, 2021 at 01:01:18PM +0100, Mark Rutland wrote:
> > > > On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:
> 
> > > [    3.728042] BUG: KASAN: slab-out-of-bounds in vc4_atomic_commit_tail+0x1cc/0x910
> > > [    3.728123] Read of size 8 at addr ffff000007360440 by task kworker/u8:0/7
> 
> This is offset 0x40 into struct vc4_hvs_state, which is the
> 'pending_commit' pointer
> for the array index 4, i.e. one after the end of the structure.
> 
> > > [    3.728495]  kasan_report+0x1dc/0x240
> > > [    3.728529]  __asan_load8+0x98/0xd4
> > > [    3.728565]  vc4_atomic_commit_tail+0x1cc/0x910
> 
> It seems to be this loop:
> 
>         for_each_old_crtc_in_state(state, crtc, old_crtc_state, i) {
>                 struct vc4_crtc_state *vc4_crtc_state =
>                         to_vc4_crtc_state(old_crtc_state);
>                 unsigned int channel = vc4_crtc_state->assigned_channel;
>                 int ret;
> 
>                 if (channel == VC4_HVS_CHANNEL_DISABLED)
>                         continue;
> 
>                 if (!old_hvs_state->fifo_state[channel].in_use)
>                         continue;
> 
>                 ret =
> drm_crtc_commit_wait(old_hvs_state->fifo_state[i].pending_commit);
>                 if (ret)
>                         drm_err(dev, "Timed out waiting for commit\n");
>         }
> 
> I notice that it checks index 'fifos_state[channel].in_use', but then
> uses a different index 'i' for looking at the 'pending_commit' field
> beyond the end of the array.

FWIW, with that drm_crtc_commit_wait() call changed to:

| ret = drm_crtc_commit_wait(old_hvs_state->fifo_state[channel].pending_commit);

... I no longer see a KASAN splat, and I no longer see a hang with
ARCH_DMA_MINALIGN reduced to 64.

Thanks,
Mark.

> 
> This code was introduced by Maxime Ripard in commit 9ec03d7f1ed3
>  ("drm/vc4: kms: Wait on previous FIFO users before a commit").
> 
>     Arnd
Arnd Bergmann June 7, 2021, 3:50 p.m. UTC | #25
On Mon, Jun 7, 2021 at 5:17 PM Maxime Ripard <maxime@cerno.tech> wrote:
> On Mon, Jun 07, 2021 at 03:57:41PM +0200, Arnd Bergmann wrote:
> > On Mon, Jun 7, 2021 at 3:39 PM Will Deacon <will@kernel.org> wrote:
> > > On Mon, Jun 07, 2021 at 02:08:59PM +0100, Mark Rutland wrote:
> > > > On Mon, Jun 07, 2021 at 01:01:18PM +0100, Mark Rutland wrote:
> > > > > On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:
> > I notice that it checks index 'fifos_state[channel].in_use', but then
> > uses a different index 'i' for looking at the 'pending_commit' field
> > beyond the end of the array.
> >
> > This code was introduced by Maxime Ripard in commit 9ec03d7f1ed3
> >  ("drm/vc4: kms: Wait on previous FIFO users before a commit").
>
> Awesome, I tried to find out that bug a few weeks ago but couldn't
> reproduce the KASAN spat. You're right, it should be channel here
> instead of i. Since you did the whole work, do you want to send the
> patch?

Marek and Mark did most of the work finding the problem, I just looked
in the right place a few times (and a bit in the wrong place). I'd suggest
you send that patch with the corresponding Reported-by/Analyzed-by/
Tested-by tags.

        Arnd
Mark Rutland June 8, 2021, 8:57 a.m. UTC | #26
On Mon, Jun 07, 2021 at 05:50:57PM +0200, Arnd Bergmann wrote:
> On Mon, Jun 7, 2021 at 5:17 PM Maxime Ripard <maxime@cerno.tech> wrote:
> > On Mon, Jun 07, 2021 at 03:57:41PM +0200, Arnd Bergmann wrote:
> > > On Mon, Jun 7, 2021 at 3:39 PM Will Deacon <will@kernel.org> wrote:
> > > > On Mon, Jun 07, 2021 at 02:08:59PM +0100, Mark Rutland wrote:
> > > > > On Mon, Jun 07, 2021 at 01:01:18PM +0100, Mark Rutland wrote:
> > > > > > On Mon, Jun 07, 2021 at 11:58:32AM +0200, Marek Szyprowski wrote:
> > > I notice that it checks index 'fifos_state[channel].in_use', but then
> > > uses a different index 'i' for looking at the 'pending_commit' field
> > > beyond the end of the array.
> > >
> > > This code was introduced by Maxime Ripard in commit 9ec03d7f1ed3
> > >  ("drm/vc4: kms: Wait on previous FIFO users before a commit").
> >
> > Awesome, I tried to find out that bug a few weeks ago but couldn't
> > reproduce the KASAN spat. You're right, it should be channel here
> > instead of i. Since you did the whole work, do you want to send the
> > patch?
> 
> Marek and Mark did most of the work finding the problem, I just looked
> in the right place a few times (and a bit in the wrong place). I'd suggest
> you send that patch with the corresponding Reported-by/Analyzed-by/
> Tested-by tags.

I've sent:

https://lore.kernel.org/r/20210608085513.2069-1-mark.rutland@arm.com

Thanks,
Mark.
Catalin Marinas July 6, 2021, 10:26 a.m. UTC | #27
On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
> In-Reply-To: <20210527124356.22367-1-will@kernel.org>
> > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> > indicate if there are machines that unknowingly rely on this.
> 
> The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:

Is this booting with ACPI or DT?

> ------------[ cut here ]------------
> rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
> WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
[...]
> This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.

I have a suspicion none of the reported devices actually do any DMA, so
in practice it should be safe but we need to figure out why
arch_setup_dma_ops() gets called.
Robin Murphy July 6, 2021, 1:29 p.m. UTC | #28
On 2021-07-06 11:26, Catalin Marinas wrote:
> On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
>> In-Reply-To: <20210527124356.22367-1-will@kernel.org>
>>> Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
>>> indicate if there are machines that unknowingly rely on this.
>>
>> The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:
> 
> Is this booting with ACPI or DT?
> 
>> ------------[ cut here ]------------
>> rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
>> WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
> [...]
>> This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.
> 
> I have a suspicion none of the reported devices actually do any DMA, so
> in practice it should be safe but we need to figure out why
> arch_setup_dma_ops() gets called.

It gets called because there's no straightforward way to know that a 
platform device *isn't* DMA-capable, so we have to assume they are.

I would also assume that in a Qcom SoC there really are at least some 
things doing non-coherent DMA :(

Robin.
Will Deacon July 6, 2021, 1:33 p.m. UTC | #29
On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
> On 2021-07-06 11:26, Catalin Marinas wrote:
> > On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
> > > In-Reply-To: <20210527124356.22367-1-will@kernel.org>
> > > > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
> > > > indicate if there are machines that unknowingly rely on this.
> > > 
> > > The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:
> > 
> > Is this booting with ACPI or DT?
> > 
> > > ------------[ cut here ]------------
> > > rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
> > > WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
> > [...]
> > > This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.
> > 
> > I have a suspicion none of the reported devices actually do any DMA, so
> > in practice it should be safe but we need to figure out why
> > arch_setup_dma_ops() gets called.
> 
> It gets called because there's no straightforward way to know that a
> platform device *isn't* DMA-capable, so we have to assume they are.
> 
> I would also assume that in a Qcom SoC there really are at least some things
> doing non-coherent DMA :(

Agreed, unless this is a CPU erratum and the line size is being reported for
a cache beyond the PoC, then I think we're going to have to revert the patch
reducing ARCH_DMA_MINALIGN after all.

I can't find much information about the original Kryo core at all...

Will
Marc Zyngier July 6, 2021, 1:44 p.m. UTC | #30
On 2021-07-06 14:33, Will Deacon wrote:
> On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
>> On 2021-07-06 11:26, Catalin Marinas wrote:
>> > On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
>> > > In-Reply-To: <20210527124356.22367-1-will@kernel.org>
>> > > > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the warning/taint to
>> > > > indicate if there are machines that unknowingly rely on this.
>> > >
>> > > The warning is being triggered on Qualcomm MSM8996, as well as the out-of-spec taint:
>> >
>> > Is this booting with ACPI or DT?
>> >
>> > > ------------[ cut here ]------------
>> > > rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (64 < 128)
>> > > WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 arch_setup_dma_ops+0xf8/0x10c
>> > [...]
>> > > This warning is triggered with nearly every driver probe, not only rtc-pm8xxx.
>> >
>> > I have a suspicion none of the reported devices actually do any DMA, so
>> > in practice it should be safe but we need to figure out why
>> > arch_setup_dma_ops() gets called.
>> 
>> It gets called because there's no straightforward way to know that a
>> platform device *isn't* DMA-capable, so we have to assume they are.
>> 
>> I would also assume that in a Qcom SoC there really are at least some 
>> things
>> doing non-coherent DMA :(
> 
> Agreed, unless this is a CPU erratum and the line size is being 
> reported for
> a cache beyond the PoC, then I think we're going to have to revert the 
> patch
> reducing ARCH_DMA_MINALIGN after all.
> 
> I can't find much information about the original Kryo core at all...

I have similar issues with my QDF2400. The UART, RTC and DMA controllers
are all screaming at me. I'm confident that the UART doesn't do any
DMA (it is handled by the SBSA driver), but the DMA controllers are
probably doing what it says on the tin.

Do we know whether Falkor and Kryo share any part of their design?

         M.
Robin Murphy July 6, 2021, 2:21 p.m. UTC | #31
On 2021-07-06 14:44, Marc Zyngier wrote:
> On 2021-07-06 14:33, Will Deacon wrote:
>> On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
>>> On 2021-07-06 11:26, Catalin Marinas wrote:
>>> > On Tue, Jul 06, 2021 at 09:26:59AM +0000, Yassine Oudjana wrote:
>>> > > In-Reply-To: <20210527124356.22367-1-will@kernel.org>
>>> > > > Reduce ARCH_DMA_MINALIGN to 64 bytes and allow the 
>>> warning/taint to
>>> > > > indicate if there are machines that unknowingly rely on this.
>>> > >
>>> > > The warning is being triggered on Qualcomm MSM8996, as well as 
>>> the out-of-spec taint:
>>> >
>>> > Is this booting with ACPI or DT?
>>> >
>>> > > ------------[ cut here ]------------
>>> > > rtc-pm8xxx 400f000.qcom,spmi:pmic@0:rtc@6000: ARCH_DMA_MINALIGN 
>>> smaller than CTR_EL0.CWG (64 < 128)
>>> > > WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:45 
>>> arch_setup_dma_ops+0xf8/0x10c
>>> > [...]
>>> > > This warning is triggered with nearly every driver probe, not 
>>> only rtc-pm8xxx.
>>> >
>>> > I have a suspicion none of the reported devices actually do any 
>>> DMA, so
>>> > in practice it should be safe but we need to figure out why
>>> > arch_setup_dma_ops() gets called.
>>>
>>> It gets called because there's no straightforward way to know that a
>>> platform device *isn't* DMA-capable, so we have to assume they are.
>>>
>>> I would also assume that in a Qcom SoC there really are at least some 
>>> things
>>> doing non-coherent DMA :(
>>
>> Agreed, unless this is a CPU erratum and the line size is being 
>> reported for
>> a cache beyond the PoC, then I think we're going to have to revert the 
>> patch
>> reducing ARCH_DMA_MINALIGN after all.
>>
>> I can't find much information about the original Kryo core at all...
> 
> I have similar issues with my QDF2400. The UART, RTC and DMA controllers
> are all screaming at me. I'm confident that the UART doesn't do any
> DMA (it is handled by the SBSA driver), but the DMA controllers are
> probably doing what it says on the tin.
> 
> Do we know whether Falkor and Kryo share any part of their design?

According to [1], not literally, but some of the same people being 
involved in both could plausibly imply that some basic design decisions 
might have carried over.

Robin.

[1] 
https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux
Arnd Bergmann July 6, 2021, 2:30 p.m. UTC | #32
On Tue, Jul 6, 2021 at 3:44 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On 2021-07-06 14:33, Will Deacon wrote:
> > On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
> >
> > I can't find much information about the original Kryo core at all...
>
> I have similar issues with my QDF2400. The UART, RTC and DMA controllers
> are all screaming at me. I'm confident that the UART doesn't do any
> DMA (it is handled by the SBSA driver), but the DMA controllers are
> probably doing what it says on the tin.

But that's a server chip, surely the DMA controller is fully cache coherent
as required by SBSA? (please?)

Maybe just a misannotation on the device node?

> Do we know whether Falkor and Kryo share any part of their design?

I'm fairly sure the Snapdragon 821 / msm8996 is not cache coherent.

I can only speculate on how much got reused between the two, but
as Falkor was released only after they had already given up on
the full-custom Kryo core, it's plausible that it incorporates bits from
that one. In particular the cache controller is probably easy to reuse
even if the rest of it was a new design.

      Arnd
Marc Zyngier July 6, 2021, 2:46 p.m. UTC | #33
On Tue, 06 Jul 2021 15:30:34 +0100,
Arnd Bergmann <arnd@arndb.de> wrote:
> 
> On Tue, Jul 6, 2021 at 3:44 PM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On 2021-07-06 14:33, Will Deacon wrote:
> > > On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
> > >
> > > I can't find much information about the original Kryo core at all...
> >
> > I have similar issues with my QDF2400. The UART, RTC and DMA controllers
> > are all screaming at me. I'm confident that the UART doesn't do any
> > DMA (it is handled by the SBSA driver), but the DMA controllers are
> > probably doing what it says on the tin.
> 
> But that's a server chip, surely the DMA controller is fully cache
> coherent as required by SBSA? (please?)

Remember that there is a SBSA level for each broken implementation
(even my XGene is SBSA compliant!), so that doesn't mean much.

> Maybe just a misannotation on the device node?

Maybe. But given that whoever wrote the ACPI tables made sure that
everything else was annotated as coherent, I doubt the DMA controllers
being advertised as non-coherent is an accident.

> > Do we know whether Falkor and Kryo share any part of their design?
> 
> I'm fairly sure the Snapdragon 821 / msm8996 is not cache coherent.
> 
> I can only speculate on how much got reused between the two, but
> as Falkor was released only after they had already given up on
> the full-custom Kryo core, it's plausible that it incorporates bits from
> that one. In particular the cache controller is probably easy to reuse
> even if the rest of it was a new design.

I guess we'll never find out, and I'm probably one of the few still
having some access to this HW (not even sure for how long anyway).

I won't cry if we decide to pull the plug on it.

	M.
Arnd Bergmann July 6, 2021, 3:43 p.m. UTC | #34
On Tue, Jul 6, 2021 at 4:46 PM Marc Zyngier <maz@kernel.org> wrote:
> On Tue, 06 Jul 2021 15:30:34 +0100, Arnd Bergmann <arnd@arndb.de> wrote:
> > I can only speculate on how much got reused between the two, but
> > as Falkor was released only after they had already given up on
> > the full-custom Kryo core, it's plausible that it incorporates bits from
> > that one. In particular the cache controller is probably easy to reuse
> > even if the rest of it was a new design.
>
> I guess we'll never find out, and I'm probably one of the few still
> having some access to this HW (not even sure for how long anyway).
>
> I won't cry if we decide to pull the plug on it.

Sure, but the Snapdragon 820E is one we do need to worry about.

While the internet pretty much agrees on Falkor having 128 bytes
L1 cache line, it might be good to rule out that Kryo just misreports
it before we revert the patch.

Yassine, could you run the 'line' and 'cache' helper from lmbench
to determine what the cache topology appears to be and if that
matches the CTR_EL0 contents?

Something like

numactl -C 0 line -M 1M
numactl -C 3 line -M 1M
numactl -C 0 cache
numactl -C 3 cache

(the numactl command helps run this both on the 'big' and 'little'
cores without running into migration)

      Arnd
Will Deacon July 6, 2021, 4:20 p.m. UTC | #35
On Tue, Jul 06, 2021 at 04:30:34PM +0200, Arnd Bergmann wrote:
> On Tue, Jul 6, 2021 at 3:44 PM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On 2021-07-06 14:33, Will Deacon wrote:
> > > On Tue, Jul 06, 2021 at 02:29:07PM +0100, Robin Murphy wrote:
> > >
> > > I can't find much information about the original Kryo core at all...
> >
> > I have similar issues with my QDF2400. The UART, RTC and DMA controllers
> > are all screaming at me. I'm confident that the UART doesn't do any
> > DMA (it is handled by the SBSA driver), but the DMA controllers are
> > probably doing what it says on the tin.
> 
> But that's a server chip, surely the DMA controller is fully cache coherent
> as required by SBSA? (please?)
> 
> Maybe just a misannotation on the device node?
> 
> > Do we know whether Falkor and Kryo share any part of their design?
> 
> I'm fairly sure the Snapdragon 821 / msm8996 is not cache coherent.
> 
> I can only speculate on how much got reused between the two, but
> as Falkor was released only after they had already given up on
> the full-custom Kryo core, it's plausible that it incorporates bits from
> that one. In particular the cache controller is probably easy to reuse
> even if the rest of it was a new design.

I think the million dollar question is whether the 128-byte cache-lines
live in a cache above the PoC or not. If it's just a system level cache
through which all DMA is "coherent", then it doesn't matter.

Digging around on the web, it's unclear whether msm8996 has an L3 or not.

Will
Yassine Oudjana July 6, 2021, 5:15 p.m. UTC | #36
On Tuesday, July 6th, 2021 at 7:43 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, Jul 6, 2021 at 4:46 PM Marc Zyngier maz@kernel.org wrote:
> > On Tue, 06 Jul 2021 15:30:34 +0100, Arnd Bergmann arnd@arndb.de wrote:
> > > I can only speculate on how much got reused between the two, but
> > > as Falkor was released only after they had already given up on
> > > the full-custom Kryo core, it's plausible that it incorporates bits from
> > > that one. In particular the cache controller is probably easy to reuse
> > > even if the rest of it was a new design.
> >
> > I guess we'll never find out, and I'm probably one of the few still
> > having some access to this HW (not even sure for how long anyway).
> >
> > I won't cry if we decide to pull the plug on it.
>
> Sure, but the Snapdragon 820E is one we do need to worry about.
> While the internet pretty much agrees on Falkor having 128 bytes
> L1 cache line, it might be good to rule out that Kryo just misreports
> it before we revert the patch.
>
> Yassine, could you run the 'line' and 'cache' helper from lmbench
> to determine what the cache topology appears to be and if that
> matches the CTR_EL0 contents?
>
> Something like
>
> numactl -C 0 line -M 1M
> numactl -C 3 line -M 1M
> numactl -C 0 cache
> numactl -C 3 cache
>
> (the numactl command helps run this both on the 'big' and 'little'
> cores without running into migration)
>
> Arnd

Here are the results:

$ numactl -C 0 line -M 1M
128
$ numactl -C 3 line -M 1M
128
$ numactl -C 0 cache
L1 cache: 512 bytes 1.37 nanoseconds 64 linesize -1.00 parallelism
L2 cache: 24576 bytes 2.75 nanoseconds 64 linesize 5.06 parallelism
L3 cache: 131072 bytes 7.89 nanoseconds 64 linesize 3.85 parallelism
L4 cache: 524288 bytes 15.86 nanoseconds 128 linesize 3.48 parallelism
Memory latency: 145.93 nanoseconds 4.88 parallelism
$ numactl -C 3 cache
L1 cache: 24576 bytes 1.29 nanoseconds 64 linesize 5.00 parallelism
L2 cache: 1048576 bytes 8.60 nanoseconds 128 linesize 3.07 parallelism
Memory latency: 143.29 nanoseconds 5.37 parallelism
Arnd Bergmann July 6, 2021, 8:33 p.m. UTC | #37
On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana <y.oudjana@protonmail.com> wrote:
> > (the numactl command helps run this both on the 'big' and 'little'
> > cores without running into migration)
> >
> > Arnd
>
> Here are the results:

Thanks, that was quick

> $ numactl -C 0 line -M 1M
> 128
> $ numactl -C 3 line -M 1M
> 128
> $ numactl -C 0 cache
> L1 cache: 512 bytes 1.37 nanoseconds 64 linesize -1.00 parallelism
> L2 cache: 24576 bytes 2.75 nanoseconds 64 linesize 5.06 parallelism
> L3 cache: 131072 bytes 7.89 nanoseconds 64 linesize 3.85 parallelism
> L4 cache: 524288 bytes 15.86 nanoseconds 128 linesize 3.48 parallelism
> Memory latency: 145.93 nanoseconds 4.88 parallelism
> $ numactl -C 3 cache
> L1 cache: 24576 bytes 1.29 nanoseconds 64 linesize 5.00 parallelism
> L2 cache: 1048576 bytes 8.60 nanoseconds 128 linesize 3.07 parallelism
> Memory latency: 143.29 nanoseconds 5.37 parallelism

This is still somewhat inconclusive, but it does give some hope. The data that
I found on random web sites was

- 32KB L1, 2MB/1MB L2 [1][2]
- 16KB L1, 1.5MB L2 [3]
- 32KB L1, 1MB/512KB L2 [4]

so none of the sizes really line up. My best guess is that the actual hierarchy

1MB per-core L2 cache on the two big CPU, 512KB per-core L2 cache on
the two little ones, but no shared L2 or L3. The older Krait had a 4KB L0
cache, which could explain the 512-byte L1 output.

Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
byte L1 line size that the 'cache' test reported?
      Arnd

[1] https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors#Snapdragon_820_and_821_(2016)
[2] https://en.wikipedia.org/wiki/Kryo
[3] https://www.geektopia.es/es/product/qualcomm/snapdragon-820/
[4] https://www.anandtech.com/show/9837/snapdragon-820-preview/2
Bjorn Andersson July 6, 2021, 10:27 p.m. UTC | #38
On Tue 06 Jul 15:33 CDT 2021, Arnd Bergmann wrote:

> On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana <y.oudjana@protonmail.com> wrote:
> > > (the numactl command helps run this both on the 'big' and 'little'
> > > cores without running into migration)
> > >
> > > Arnd
> >
> > Here are the results:
> 
> Thanks, that was quick
> 
> > $ numactl -C 0 line -M 1M
> > 128
> > $ numactl -C 3 line -M 1M
> > 128
> > $ numactl -C 0 cache
> > L1 cache: 512 bytes 1.37 nanoseconds 64 linesize -1.00 parallelism
> > L2 cache: 24576 bytes 2.75 nanoseconds 64 linesize 5.06 parallelism
> > L3 cache: 131072 bytes 7.89 nanoseconds 64 linesize 3.85 parallelism
> > L4 cache: 524288 bytes 15.86 nanoseconds 128 linesize 3.48 parallelism
> > Memory latency: 145.93 nanoseconds 4.88 parallelism
> > $ numactl -C 3 cache
> > L1 cache: 24576 bytes 1.29 nanoseconds 64 linesize 5.00 parallelism
> > L2 cache: 1048576 bytes 8.60 nanoseconds 128 linesize 3.07 parallelism
> > Memory latency: 143.29 nanoseconds 5.37 parallelism
> 
> This is still somewhat inconclusive, but it does give some hope. The data that
> I found on random web sites was
> 
> - 32KB L1, 2MB/1MB L2 [1][2]
> - 16KB L1, 1.5MB L2 [3]
> - 32KB L1, 1MB/512KB L2 [4]
> 
> so none of the sizes really line up. My best guess is that the actual hierarchy
> 
> 1MB per-core L2 cache on the two big CPU, 512KB per-core L2 cache on
> the two little ones, but no shared L2 or L3. The older Krait had a 4KB L0
> cache, which could explain the 512-byte L1 output.
> 
> Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> byte L1 line size that the 'cache' test reported?

I can confirm that MSM8996, and a few derivatives, has 128 byte cache
lines.

Regards,
Bjorn
Yassine Oudjana July 7, 2021, 8:24 a.m. UTC | #39
On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > ...
>
> This is still somewhat inconclusive, but it does give some hope. The data that
>
> I found on random web sites was
>
> -   32KB L1, 2MB/1MB L2 [1][2]
> -   16KB L1, 1.5MB L2 [3]
> -   32KB L1, 1MB/512KB L2 [4]
>
>     so none of the sizes really line up. My best guess is that the actual hierarchy
>     1MB per-core L2 cache on the two big CPU, 512KB per-core L2 cache on
>     the two little ones, but no shared L2 or L3. The older Krait had a 4KB L0
>     cache, which could explain the 512-byte L1 output.
>
>     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
>     byte L1 line size that the 'cache' test reported?
>
>     Arnd
>
>     [1] https://en.wikipedia.org/wiki/List_of_Qualcomm_Snapdragon_processors#Snapdragon_820_and_821_(2016)
>     [2] https://en.wikipedia.org/wiki/Kryo
>     [3] https://www.geektopia.es/es/product/qualcomm/snapdragon-820/
>     [4] https://www.anandtech.com/show/9837/snapdragon-820-preview/2

$ numactl -C 0 line -M 128K
64
$ numactl -C 3 line -M 128K
64
Will Deacon July 7, 2021, 9:27 a.m. UTC | #40
On Tue, Jul 06, 2021 at 05:27:53PM -0500, Bjorn Andersson wrote:
> On Tue 06 Jul 15:33 CDT 2021, Arnd Bergmann wrote:
> 
> > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana <y.oudjana@protonmail.com> wrote:
> > > > (the numactl command helps run this both on the 'big' and 'little'
> > > > cores without running into migration)
> > > >
> > > > Arnd
> > >
> > > Here are the results:
> > 
> > Thanks, that was quick
> > 
> > > $ numactl -C 0 line -M 1M
> > > 128
> > > $ numactl -C 3 line -M 1M
> > > 128
> > > $ numactl -C 0 cache
> > > L1 cache: 512 bytes 1.37 nanoseconds 64 linesize -1.00 parallelism
> > > L2 cache: 24576 bytes 2.75 nanoseconds 64 linesize 5.06 parallelism
> > > L3 cache: 131072 bytes 7.89 nanoseconds 64 linesize 3.85 parallelism
> > > L4 cache: 524288 bytes 15.86 nanoseconds 128 linesize 3.48 parallelism
> > > Memory latency: 145.93 nanoseconds 4.88 parallelism
> > > $ numactl -C 3 cache
> > > L1 cache: 24576 bytes 1.29 nanoseconds 64 linesize 5.00 parallelism
> > > L2 cache: 1048576 bytes 8.60 nanoseconds 128 linesize 3.07 parallelism
> > > Memory latency: 143.29 nanoseconds 5.37 parallelism
> > 
> > This is still somewhat inconclusive, but it does give some hope. The data that
> > I found on random web sites was
> > 
> > - 32KB L1, 2MB/1MB L2 [1][2]
> > - 16KB L1, 1.5MB L2 [3]
> > - 32KB L1, 1MB/512KB L2 [4]
> > 
> > so none of the sizes really line up. My best guess is that the actual hierarchy
> > 
> > 1MB per-core L2 cache on the two big CPU, 512KB per-core L2 cache on
> > the two little ones, but no shared L2 or L3. The older Krait had a 4KB L0
> > cache, which could explain the 512-byte L1 output.
> > 
> > Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> > byte L1 line size that the 'cache' test reported?
> 
> I can confirm that MSM8996, and a few derivatives, has 128 byte cache
> lines.

Do you know if the caches with 128-byte lines sit above the PoC? i.e. is
non-coherent DMA coherent with this cache or not?

Will
Arnd Bergmann July 7, 2021, 9:29 a.m. UTC | #41
On Tue, Jul 6, 2021 at 6:20 PM Will Deacon <will@kernel.org> wrote:
>
> I think the million dollar question is whether the 128-byte cache-lines
> live in a cache above the PoC or not. If it's just a system level cache
> through which all DMA is "coherent", then it doesn't matter.

On Wed, Jul 7, 2021 at 10:24 AM Yassine Oudjana
<y.oudjana@protonmail.com> wrote:
>
> On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > >
> > > $ numactl -C 0 line -M 1M
> > > 128
> > > $ numactl -C 3 line -M 1M
> > > 128
> >
> >     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> >     byte L1 line size that the 'cache' test reported?
>
> $ numactl -C 0 line -M 128K
> 64
> $ numactl -C 3 line -M 128K
> 64

Ok, so this seems to confirm that the L1 uses 64 byte lines, while the L2 (or
possibly L3) uses 128 byte lines.

On Wed, Jul 7, 2021 at 12:27 AM Bjorn Andersson
<bjorn.andersson@linaro.org> wrote:
>
> I can confirm that MSM8996, and a few derivatives, has 128 byte cache lines.

Ok, thanks. Assuming this is an outer cache and the L1 indeed has a smaller line
size, can you also confirm that this 128 byte lines are north of the point of
coherency?

In other words, does the CTR_EL0.DminLine field also show 128 bytes
(in which case
it seems we already lost)? And if not, does a CPU store to the second half of a
128 byte L2 line cause DMA data in the first half to be clobbered?

       Arnd
Jeffrey Hugo July 7, 2021, 2:41 p.m. UTC | #42
On Wed, Jul 7, 2021 at 3:30 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Tue, Jul 6, 2021 at 6:20 PM Will Deacon <will@kernel.org> wrote:
> >
> > I think the million dollar question is whether the 128-byte cache-lines
> > live in a cache above the PoC or not. If it's just a system level cache
> > through which all DMA is "coherent", then it doesn't matter.
>
> On Wed, Jul 7, 2021 at 10:24 AM Yassine Oudjana
> <y.oudjana@protonmail.com> wrote:
> >
> > On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > > >
> > > > $ numactl -C 0 line -M 1M
> > > > 128
> > > > $ numactl -C 3 line -M 1M
> > > > 128
> > >
> > >     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> > >     byte L1 line size that the 'cache' test reported?
> >
> > $ numactl -C 0 line -M 128K
> > 64
> > $ numactl -C 3 line -M 128K
> > 64
>
> Ok, so this seems to confirm that the L1 uses 64 byte lines, while the L2 (or
> possibly L3) uses 128 byte lines.
>
> On Wed, Jul 7, 2021 at 12:27 AM Bjorn Andersson
> <bjorn.andersson@linaro.org> wrote:
> >
> > I can confirm that MSM8996, and a few derivatives, has 128 byte cache lines.
>
> Ok, thanks. Assuming this is an outer cache and the L1 indeed has a smaller line
> size, can you also confirm that this 128 byte lines are north of the point of
> coherency?

Finding this old documentation has been painful  :)

L0 I 64 byte cacheline
L1 I 64
L1 D 64
L2 unified 128 (shared between the CPUs of a duplex)

I believe L2 is within the POC, but I'm trying to dig up the old
documentation to confirm.

>
> In other words, does the CTR_EL0.DminLine field also show 128 bytes
> (in which case
> it seems we already lost)? And if not, does a CPU store to the second half of a
> 128 byte L2 line cause DMA data in the first half to be clobbered?

Per the documentation I'm seeing, CTR_EL0.DminLine should show 128
bytes.  I don't have hardware handy to confirm.
Jeffrey Hugo July 8, 2021, 8:59 p.m. UTC | #43
On Wed, Jul 7, 2021 at 8:41 AM Jeffrey Hugo <jeffrey.l.hugo@gmail.com> wrote:
>
> On Wed, Jul 7, 2021 at 3:30 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Tue, Jul 6, 2021 at 6:20 PM Will Deacon <will@kernel.org> wrote:
> > >
> > > I think the million dollar question is whether the 128-byte cache-lines
> > > live in a cache above the PoC or not. If it's just a system level cache
> > > through which all DMA is "coherent", then it doesn't matter.
> >
> > On Wed, Jul 7, 2021 at 10:24 AM Yassine Oudjana
> > <y.oudjana@protonmail.com> wrote:
> > >
> > > On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > > > >
> > > > > $ numactl -C 0 line -M 1M
> > > > > 128
> > > > > $ numactl -C 3 line -M 1M
> > > > > 128
> > > >
> > > >     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> > > >     byte L1 line size that the 'cache' test reported?
> > >
> > > $ numactl -C 0 line -M 128K
> > > 64
> > > $ numactl -C 3 line -M 128K
> > > 64
> >
> > Ok, so this seems to confirm that the L1 uses 64 byte lines, while the L2 (or
> > possibly L3) uses 128 byte lines.
> >
> > On Wed, Jul 7, 2021 at 12:27 AM Bjorn Andersson
> > <bjorn.andersson@linaro.org> wrote:
> > >
> > > I can confirm that MSM8996, and a few derivatives, has 128 byte cache lines.
> >
> > Ok, thanks. Assuming this is an outer cache and the L1 indeed has a smaller line
> > size, can you also confirm that this 128 byte lines are north of the point of
> > coherency?
>
> Finding this old documentation has been painful  :)
>
> L0 I 64 byte cacheline
> L1 I 64
> L1 D 64
> L2 unified 128 (shared between the CPUs of a duplex)
>
> I believe L2 is within the POC, but I'm trying to dig up the old
> documentation to confirm.

Was able to track down a friendly hardware designer.  The POC lies
between L2 and L3.  Hope this helps.

> > In other words, does the CTR_EL0.DminLine field also show 128 bytes
> > (in which case
> > it seems we already lost)? And if not, does a CPU store to the second half of a
> > 128 byte L2 line cause DMA data in the first half to be clobbered?
>
> Per the documentation I'm seeing, CTR_EL0.DminLine should show 128
> bytes.  I don't have hardware handy to confirm.
Will Deacon July 9, 2021, 8:48 a.m. UTC | #44
On Thu, Jul 08, 2021 at 02:59:28PM -0600, Jeffrey Hugo wrote:
> On Wed, Jul 7, 2021 at 8:41 AM Jeffrey Hugo <jeffrey.l.hugo@gmail.com> wrote:
> >
> > On Wed, Jul 7, 2021 at 3:30 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Tue, Jul 6, 2021 at 6:20 PM Will Deacon <will@kernel.org> wrote:
> > > >
> > > > I think the million dollar question is whether the 128-byte cache-lines
> > > > live in a cache above the PoC or not. If it's just a system level cache
> > > > through which all DMA is "coherent", then it doesn't matter.
> > >
> > > On Wed, Jul 7, 2021 at 10:24 AM Yassine Oudjana
> > > <y.oudjana@protonmail.com> wrote:
> > > >
> > > > On Wednesday, July 7th, 2021 at 12:33 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > > > > On Tue, Jul 6, 2021 at 7:15 PM Yassine Oudjana y.oudjana@protonmail.com wrote:
> > > > > >
> > > > > > $ numactl -C 0 line -M 1M
> > > > > > 128
> > > > > > $ numactl -C 3 line -M 1M
> > > > > > 128
> > > > >
> > > > >     Can you rerun the the 'line' test with '-M 128K' to see if that confirms the 64
> > > > >     byte L1 line size that the 'cache' test reported?
> > > >
> > > > $ numactl -C 0 line -M 128K
> > > > 64
> > > > $ numactl -C 3 line -M 128K
> > > > 64
> > >
> > > Ok, so this seems to confirm that the L1 uses 64 byte lines, while the L2 (or
> > > possibly L3) uses 128 byte lines.
> > >
> > > On Wed, Jul 7, 2021 at 12:27 AM Bjorn Andersson
> > > <bjorn.andersson@linaro.org> wrote:
> > > >
> > > > I can confirm that MSM8996, and a few derivatives, has 128 byte cache lines.
> > >
> > > Ok, thanks. Assuming this is an outer cache and the L1 indeed has a smaller line
> > > size, can you also confirm that this 128 byte lines are north of the point of
> > > coherency?
> >
> > Finding this old documentation has been painful  :)
> >
> > L0 I 64 byte cacheline
> > L1 I 64
> > L1 D 64
> > L2 unified 128 (shared between the CPUs of a duplex)
> >
> > I believe L2 is within the POC, but I'm trying to dig up the old
> > documentation to confirm.
> 
> Was able to track down a friendly hardware designer.  The POC lies
> between L2 and L3.  Hope this helps.

Damn, yes, it's bad news but thanks for chasing it up. I'll revert the patch
at -rc1 and add a comment about MSM8996.

Will
Catalin Marinas July 9, 2021, 5:10 p.m. UTC | #45
On Fri, Jul 09, 2021 at 09:48:42AM +0100, Will Deacon wrote:
> On Thu, Jul 08, 2021 at 02:59:28PM -0600, Jeffrey Hugo wrote:
> > On Wed, Jul 7, 2021 at 8:41 AM Jeffrey Hugo <jeffrey.l.hugo@gmail.com> wrote:
> > > L0 I 64 byte cacheline
> > > L1 I 64
> > > L1 D 64
> > > L2 unified 128 (shared between the CPUs of a duplex)
> > >
> > > I believe L2 is within the POC, but I'm trying to dig up the old
> > > documentation to confirm.
> > 
> > Was able to track down a friendly hardware designer.  The POC lies
> > between L2 and L3.  Hope this helps.
> 
> Damn, yes, it's bad news but thanks for chasing it up. I'll revert the patch
> at -rc1 and add a comment about MSM8996.

It's a shame but we can't do much for this platform.

Longer term, we should look at making kmalloc() cache selection more
dynamic. Probably still starting with a 128 byte minimum size but, after
initialising all the devices during boot, if we can't find any
non-coherent one just relax the kmalloc() allocations. We still have the
issue with platform devices with DT assumed to be non-coherent and any
late call (after boot) to arch_setup_dma_ops().

Some bodge below to get an idea, not a final patch (not even the
beginning of one). It initialises the kmalloc caches to size 8 but
limits the allocation size to a kmalloc_dyn_min_size, initially set to
128 on arm64. In a device_initcall_sync(), if we didn't find any
non-coherent device, we lower this to KMALLOC_MIN_SIZE (8 with slub).

----------------8<----------------------------
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index a074459f8f2f..bed65db3c42e 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -40,15 +40,6 @@
 #define CLIDR_LOC(clidr)	(((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
 #define CLIDR_LOUIS(clidr)	(((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
 
-/*
- * Memory returned by kmalloc() may be used for DMA, so we must make
- * sure that all such allocations are cache aligned. Otherwise,
- * unrelated code may cause parts of the buffer to be read into the
- * cache before the transfer is done, causing old data to be seen by
- * the CPU.
- */
-#define ARCH_DMA_MINALIGN	(128)
-
 #ifdef CONFIG_KASAN_SW_TAGS
 #define ARCH_SLAB_MINALIGN	(1ULL << KASAN_SHADOW_SCALE_SHIFT)
 #elif defined(CONFIG_KASAN_HW_TAGS)
@@ -59,6 +50,9 @@
 
 #include <linux/bitops.h>
 
+extern int kmalloc_dyn_min_size;
+#define __HAVE_ARCH_KMALLOC_DYN_MIN_SIZE
+
 #define ICACHEF_ALIASING	0
 #define ICACHEF_VPIPT		1
 extern unsigned long __icache_flags;
@@ -88,7 +82,7 @@ static inline int cache_line_size_of_cpu(void)
 {
 	u32 cwg = cache_type_cwg();
 
-	return cwg ? 4 << cwg : ARCH_DMA_MINALIGN;
+	return cwg ? 4 << cwg : __alignof__(unsigned long long);
 }
 
 int cache_line_size(void);
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index efed2830d141..a25813377187 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2808,8 +2808,8 @@ void __init setup_cpu_features(void)
 	 */
 	cwg = cache_type_cwg();
 	if (!cwg)
-		pr_warn("No Cache Writeback Granule information, assuming %d\n",
-			ARCH_DMA_MINALIGN);
+		pr_warn("No Cache Writeback Granule information, assuming %ld\n",
+			__alignof__(unsigned long long));
 }
 
 static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap)
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 4bf1dd3eb041..9a30d1beb3ea 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -13,6 +13,18 @@
 
 #include <asm/cacheflush.h>
 
+/*
+ * Memory returned by kmalloc() may be used for DMA, so we must make
+ * sure that all such allocations are cache aligned. Otherwise,
+ * unrelated code may cause parts of the buffer to be read into the
+ * cache before the transfer is done, causing old data to be seen by
+ * the CPU.
+ */
+int kmalloc_dyn_min_size = 128;
+EXPORT_SYMBOL(kmalloc_dyn_min_size);
+
+static bool non_coherent_devices;
+
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 		enum dma_data_direction dir)
 {
@@ -42,11 +54,14 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 {
 	int cls = cache_line_size_of_cpu();
 
-	WARN_TAINT(!coherent && cls > ARCH_DMA_MINALIGN,
+	WARN_TAINT(!coherent && cls > kmalloc_dyn_min_size,
 		   TAINT_CPU_OUT_OF_SPEC,
-		   "%s %s: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (%d < %d)",
+		   "%s %s: kmalloc() minimum size smaller than CTR_EL0.CWG (%d < %d)",
 		   dev_driver_string(dev), dev_name(dev),
-		   ARCH_DMA_MINALIGN, cls);
+		   kmalloc_dyn_min_size, cls);
+
+	if (!coherent)
+		non_coherent_devices = true;
 
 	dev->dma_coherent = coherent;
 	if (iommu)
@@ -57,3 +72,12 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
 		dev->dma_ops = &xen_swiotlb_dma_ops;
 #endif
 }
+
+static int __init adjust_kmalloc_dyn_min_size(void)
+{
+	if (!non_coherent_devices)
+		kmalloc_dyn_min_size = KMALLOC_MIN_SIZE;
+
+	return 0;
+}
+device_initcall_sync(adjust_kmalloc_dyn_min_size);
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0c97d788762c..e40c7899cb07 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -349,15 +349,21 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags)
  */
 static __always_inline unsigned int kmalloc_index(size_t size)
 {
+	int min_size = KMALLOC_MIN_SIZE;
+
 	if (!size)
 		return 0;
 
-	if (size <= KMALLOC_MIN_SIZE)
-		return KMALLOC_SHIFT_LOW;
+#ifdef __HAVE_ARCH_KMALLOC_DYN_MIN_SIZE
+	min_size = kmalloc_dyn_min_size;
+#endif
+
+	if (size <= min_size)
+		return ilog2(min_size);
 
-	if (KMALLOC_MIN_SIZE <= 32 && size > 64 && size <= 96)
+	if (min_size <= 32 && size > 64 && size <= 96)
 		return 1;
-	if (KMALLOC_MIN_SIZE <= 64 && size > 128 && size <= 192)
+	if (min_size <= 64 && size > 128 && size <= 192)
 		return 2;
 	if (size <=          8) return 3;
 	if (size <=         16) return 4;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7cab77655f11..2666237c84c4 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -725,6 +725,10 @@ struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags)
 		if (!size)
 			return ZERO_SIZE_PTR;
 
+#ifdef __HAVE_ARCH_KMALLOC_DYN_MIN_SIZE
+		if (size < kmalloc_dyn_min_size)
+			size = kmalloc_dyn_min_size;
+#endif
 		index = size_index[size_index_elem(size)];
 	} else {
 		if (WARN_ON_ONCE(size > KMALLOC_MAX_CACHE_SIZE))
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index a074459f8f2f..a9c0716e7440 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -47,7 +47,7 @@ 
  * cache before the transfer is done, causing old data to be seen by
  * the CPU.
  */
-#define ARCH_DMA_MINALIGN	(128)
+#define ARCH_DMA_MINALIGN	L1_CACHE_BYTES
 
 #ifdef CONFIG_KASAN_SW_TAGS
 #define ARCH_SLAB_MINALIGN	(1ULL << KASAN_SHADOW_SCALE_SHIFT)