mbox series

[0/2] riscv: implement Zicbom-based CMO instructions + the t-head variant

Message ID 20220307224620.1933061-1-heiko@sntech.de (mailing list archive)
Headers show
Series riscv: implement Zicbom-based CMO instructions + the t-head variant | expand

Message

Heiko Stuebner March 7, 2022, 10:46 p.m. UTC
This series is based on the alternatives changes done in my svpbmt series
and thus also depends on Atish's isa-extension parsing series.

It implements using the cache-management instructions from the  Zicbom-
extension to handle cache flush, etc actions on platforms needing them.

SoCs using cpu cores from T-Head like the Allwinne D1 implement a
different set of cache instructions. But while they are different,
instructions they provide the same functionality, so a variant can
easly hook into the existing alternatives mechanism on those.


Heiko Stuebner (2):
  riscv: Implement Zicbom-based cache management operations
  riscv: implement cache-management errata for T-Head SoCs

 arch/riscv/Kconfig                   |  8 +++
 arch/riscv/Kconfig.erratas           | 10 ++++
 arch/riscv/errata/thead/errata.c     |  5 ++
 arch/riscv/include/asm/errata_list.h | 78 +++++++++++++++++++++++++++-
 arch/riscv/include/asm/hwcap.h       |  1 +
 arch/riscv/kernel/cpu.c              |  1 +
 arch/riscv/kernel/cpufeature.c       | 17 ++++++
 arch/riscv/mm/Makefile               |  1 +
 arch/riscv/mm/dma-noncoherent.c      | 61 ++++++++++++++++++++++
 9 files changed, 180 insertions(+), 2 deletions(-)
 create mode 100644 arch/riscv/mm/dma-noncoherent.c

Comments

Corentin Labbe April 15, 2022, 11:26 a.m. UTC | #1
Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> This series is based on the alternatives changes done in my svpbmt series
> and thus also depends on Atish's isa-extension parsing series.
> 
> It implements using the cache-management instructions from the  Zicbom-
> extension to handle cache flush, etc actions on platforms needing them.
> 
> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> different set of cache instructions. But while they are different,
> instructions they provide the same functionality, so a variant can
> easly hook into the existing alternatives mechanism on those.
> 
> 

Hello

I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.

I am hitting a buffer corruption problem with DMA.
The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
In fact the buffer is not overran by device but by dma_map_single() operation.

The following small code show the problem:

dma_addr_t dma;
u8 *buf;
#define BSIZE 2048
#define DMASIZE 16

buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
for (i = 0; i < BSIZE; i++)
    buf[i] = 0xFE;
print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
dma_unmap_single(ce->dev, dma, DMASIZE, DMA_FROM_DEVICE);
print_hex_dump(KERN_INFO, "DMATEST3:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);

Will lead to:
[    2.960040] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    2.965354] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    2.970709] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    2.976069] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    2.981440] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    2.986814] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    2.992188] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    2.997560] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.002934] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.008307] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.013680] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.019054] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.024427] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.029802] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.035175] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.040546] DMATEST1:fefefefe fefefefe fefefefe fefefefe
[    3.401647] DMATEST3:a9c3a9c3 a9c3a9c3 a9c3a9c3 a9c3a9c3
[    3.406982] DMATEST3:a9c3a9c3 a9c3a9c3 a9c3a9c3 a9c3a9c3
[    3.412350] DMATEST3:a9c3a9c3 a9c3a9c3 a9c3a9c3 a9c3a9c3
[    3.417720] DMATEST3:a9c3a9c3 a9c3a9c3 a9c3a9c3 a9c3a9c3
[    3.423094] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.428468] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.433841] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.439213] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.444588] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.449962] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.455334] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.460707] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.466081] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.471454] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.476828] DMATEST3:fefefefe fefefefe fefefefe fefefefe
[    3.482200] DMATEST3:fefefefe fefefefe fefefefe fefefefe

Even with no DMA action, the buffer is corrupted.

Regards
Samuel Holland April 16, 2022, 2:19 a.m. UTC | #2
On 4/15/22 6:26 AM, Corentin Labbe wrote:
> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
>> This series is based on the alternatives changes done in my svpbmt series
>> and thus also depends on Atish's isa-extension parsing series.
>>
>> It implements using the cache-management instructions from the  Zicbom-
>> extension to handle cache flush, etc actions on platforms needing them.
>>
>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
>> different set of cache instructions. But while they are different,
>> instructions they provide the same functionality, so a variant can
>> easly hook into the existing alternatives mechanism on those.
>>
>>
> 
> Hello
> 
> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> 
> I am hitting a buffer corruption problem with DMA.
> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> In fact the buffer is not overran by device but by dma_map_single() operation.
> 
> The following small code show the problem:
> 
> dma_addr_t dma;
> u8 *buf;
> #define BSIZE 2048
> #define DMASIZE 16
> 
> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> for (i = 0; i < BSIZE; i++)
>     buf[i] = 0xFE;
> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);

This function (through dma_direct_map_page()) ends up calling
arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
cache. This is the same thing other architectures do (at least arm, arm64,
openrisc, and powerpc). So this appears to be working as intended.

Regards,
Samuel

> dma_unmap_single(ce->dev, dma, DMASIZE, DMA_FROM_DEVICE);
> print_hex_dump(KERN_INFO, "DMATEST3:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> 
> Will lead to:
> [    2.960040] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    2.965354] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    2.970709] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    2.976069] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    2.981440] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    2.986814] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    2.992188] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    2.997560] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.002934] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.008307] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.013680] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.019054] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.024427] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.029802] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.035175] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.040546] DMATEST1:fefefefe fefefefe fefefefe fefefefe
> [    3.401647] DMATEST3:a9c3a9c3 a9c3a9c3 a9c3a9c3 a9c3a9c3
> [    3.406982] DMATEST3:a9c3a9c3 a9c3a9c3 a9c3a9c3 a9c3a9c3
> [    3.412350] DMATEST3:a9c3a9c3 a9c3a9c3 a9c3a9c3 a9c3a9c3
> [    3.417720] DMATEST3:a9c3a9c3 a9c3a9c3 a9c3a9c3 a9c3a9c3
> [    3.423094] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.428468] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.433841] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.439213] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.444588] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.449962] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.455334] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.460707] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.466081] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.471454] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.476828] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> [    3.482200] DMATEST3:fefefefe fefefefe fefefefe fefefefe
> 
> Even with no DMA action, the buffer is corrupted.
> 
> Regards
>
Corentin Labbe April 16, 2022, 7:35 a.m. UTC | #3
Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> >> This series is based on the alternatives changes done in my svpbmt series
> >> and thus also depends on Atish's isa-extension parsing series.
> >>
> >> It implements using the cache-management instructions from the  Zicbom-
> >> extension to handle cache flush, etc actions on platforms needing them.
> >>
> >> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> >> different set of cache instructions. But while they are different,
> >> instructions they provide the same functionality, so a variant can
> >> easly hook into the existing alternatives mechanism on those.
> >>
> >>
> > 
> > Hello
> > 
> > I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > 
> > I am hitting a buffer corruption problem with DMA.
> > The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > In fact the buffer is not overran by device but by dma_map_single() operation.
> > 
> > The following small code show the problem:
> > 
> > dma_addr_t dma;
> > u8 *buf;
> > #define BSIZE 2048
> > #define DMASIZE 16
> > 
> > buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > for (i = 0; i < BSIZE; i++)
> >     buf[i] = 0xFE;
> > print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> 
> This function (through dma_direct_map_page()) ends up calling
> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> cache. This is the same thing other architectures do (at least arm, arm64,
> openrisc, and powerpc). So this appears to be working as intended.
> 
> Regards,
> Samuel
> 

This behavour is not present at least on ARM and ARM64.
The sample code I provided does not corrupt the buffer on them.

Regards
Samuel Holland April 16, 2022, 5:47 p.m. UTC | #4
On 4/16/22 2:35 AM, Corentin Labbe wrote:
> Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
>> On 4/15/22 6:26 AM, Corentin Labbe wrote:
>>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
>>>> This series is based on the alternatives changes done in my svpbmt series
>>>> and thus also depends on Atish's isa-extension parsing series.
>>>>
>>>> It implements using the cache-management instructions from the  Zicbom-
>>>> extension to handle cache flush, etc actions on platforms needing them.
>>>>
>>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
>>>> different set of cache instructions. But while they are different,
>>>> instructions they provide the same functionality, so a variant can
>>>> easly hook into the existing alternatives mechanism on those.
>>>>
>>>>
>>>
>>> Hello
>>>
>>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
>>>
>>> I am hitting a buffer corruption problem with DMA.
>>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
>>> In fact the buffer is not overran by device but by dma_map_single() operation.
>>>
>>> The following small code show the problem:
>>>
>>> dma_addr_t dma;
>>> u8 *buf;
>>> #define BSIZE 2048
>>> #define DMASIZE 16
>>>
>>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
>>> for (i = 0; i < BSIZE; i++)
>>>     buf[i] = 0xFE;
>>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
>>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
>>
>> This function (through dma_direct_map_page()) ends up calling
>> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
>> cache. This is the same thing other architectures do (at least arm, arm64,
>> openrisc, and powerpc). So this appears to be working as intended.
> 
> This behavour is not present at least on ARM and ARM64.
> The sample code I provided does not corrupt the buffer on them.

That can be explained by the 0xFE bytes having been flushed to DRAM already in
your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
a dirty cache line. The cache topology and implementation is totally different
across the SoCs, so this is not too surprising.

Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
unidirectional DMA transfer from the device into that buffer. So the contents of
the buffer are "undefined" until the DMA transfer completes. If you are also
writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.

Regards,
Samuel
Corentin Labbe April 16, 2022, 7:32 p.m. UTC | #5
Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> >>>> This series is based on the alternatives changes done in my svpbmt series
> >>>> and thus also depends on Atish's isa-extension parsing series.
> >>>>
> >>>> It implements using the cache-management instructions from the  Zicbom-
> >>>> extension to handle cache flush, etc actions on platforms needing them.
> >>>>
> >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> >>>> different set of cache instructions. But while they are different,
> >>>> instructions they provide the same functionality, so a variant can
> >>>> easly hook into the existing alternatives mechanism on those.
> >>>>
> >>>>
> >>>
> >>> Hello
> >>>
> >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> >>>
> >>> I am hitting a buffer corruption problem with DMA.
> >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> >>>
> >>> The following small code show the problem:
> >>>
> >>> dma_addr_t dma;
> >>> u8 *buf;
> >>> #define BSIZE 2048
> >>> #define DMASIZE 16
> >>>
> >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> >>> for (i = 0; i < BSIZE; i++)
> >>>     buf[i] = 0xFE;
> >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> >>
> >> This function (through dma_direct_map_page()) ends up calling
> >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> >> cache. This is the same thing other architectures do (at least arm, arm64,
> >> openrisc, and powerpc). So this appears to be working as intended.
> > 
> > This behavour is not present at least on ARM and ARM64.
> > The sample code I provided does not corrupt the buffer on them.
> 
> That can be explained by the 0xFE bytes having been flushed to DRAM already in
> your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> a dirty cache line. The cache topology and implementation is totally different
> across the SoCs, so this is not too surprising.
> 
> Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> unidirectional DMA transfer from the device into that buffer. So the contents of
> the buffer are "undefined" until the DMA transfer completes. If you are also
> writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> 
> Regards,
> Samuel

+CC crypto mailing list + maintainer

My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
concat a poison buffer to check that device does write beyond buffer.

But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.

So you mean that on SoC D1, this crypto API check strategy is impossible ?
Guo Ren April 17, 2022, 2:17 a.m. UTC | #6
On Sun, Apr 17, 2022 at 3:32 AM Corentin Labbe
<clabbe.montjoie@gmail.com> wrote:
>
> Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> > On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> > >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> > >>>> This series is based on the alternatives changes done in my svpbmt series
> > >>>> and thus also depends on Atish's isa-extension parsing series.
> > >>>>
> > >>>> It implements using the cache-management instructions from the  Zicbom-
> > >>>> extension to handle cache flush, etc actions on platforms needing them.
> > >>>>
> > >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> > >>>> different set of cache instructions. But while they are different,
> > >>>> instructions they provide the same functionality, so a variant can
> > >>>> easly hook into the existing alternatives mechanism on those.
> > >>>>
> > >>>>
> > >>>
> > >>> Hello
> > >>>
> > >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > >>>
> > >>> I am hitting a buffer corruption problem with DMA.
> > >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> > >>>
> > >>> The following small code show the problem:
> > >>>
> > >>> dma_addr_t dma;
> > >>> u8 *buf;
> > >>> #define BSIZE 2048
> > >>> #define DMASIZE 16
> > >>>
> > >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > >>> for (i = 0; i < BSIZE; i++)
> > >>>     buf[i] = 0xFE;
> > >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> > >>
> > >> This function (through dma_direct_map_page()) ends up calling
> > >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> > >> cache. This is the same thing other architectures do (at least arm, arm64,
> > >> openrisc, and powerpc). So this appears to be working as intended.
> > >
> > > This behavour is not present at least on ARM and ARM64.
> > > The sample code I provided does not corrupt the buffer on them.
> >
> > That can be explained by the 0xFE bytes having been flushed to DRAM already in
> > your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> > a dirty cache line. The cache topology and implementation is totally different
> > across the SoCs, so this is not too surprising.
> >
> > Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> > unidirectional DMA transfer from the device into that buffer. So the contents of
> > the buffer are "undefined" until the DMA transfer completes. If you are also
> > writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> >
> > Regards,
> > Samuel
>
> +CC crypto mailing list + maintainer
>
> My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
> concat a poison buffer to check that device does write beyond buffer.
>
> But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.
>
> So you mean that on SoC D1, this crypto API check strategy is impossible ?

I think you could try to replace all CLEAN & INVAL ops with FLUSH ops
for the testing. (All cache block-aligned data from the device for the
CPU should be invalided.)

+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum
dma_data_direction dir)
+{
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ ALT_CMO_OP(CLEAN, (unsigned long)phys_to_virt(paddr), size);
+ break;
+ case DMA_FROM_DEVICE:
+ ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
+ break;
+ case DMA_BIDIRECTIONAL:
+ ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
+ break;
+ default:
+ break;
+ }
+}
+
+void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size, enum
dma_data_direction dir)
+{
+ switch (dir) {
+ case DMA_TO_DEVICE:
+ break;
+ case DMA_FROM_DEVICE:
+ case DMA_BIDIRECTIONAL:
+ ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
+ break;
+ default:
+ break;
+ }
+}
Corentin Labbe April 17, 2022, 8:45 a.m. UTC | #7
Le Sun, Apr 17, 2022 at 10:17:34AM +0800, Guo Ren a écrit :
> On Sun, Apr 17, 2022 at 3:32 AM Corentin Labbe
> <clabbe.montjoie@gmail.com> wrote:
> >
> > Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> > > On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > > > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> > > >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > > >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> > > >>>> This series is based on the alternatives changes done in my svpbmt series
> > > >>>> and thus also depends on Atish's isa-extension parsing series.
> > > >>>>
> > > >>>> It implements using the cache-management instructions from the  Zicbom-
> > > >>>> extension to handle cache flush, etc actions on platforms needing them.
> > > >>>>
> > > >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> > > >>>> different set of cache instructions. But while they are different,
> > > >>>> instructions they provide the same functionality, so a variant can
> > > >>>> easly hook into the existing alternatives mechanism on those.
> > > >>>>
> > > >>>>
> > > >>>
> > > >>> Hello
> > > >>>
> > > >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > > >>>
> > > >>> I am hitting a buffer corruption problem with DMA.
> > > >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > > >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> > > >>>
> > > >>> The following small code show the problem:
> > > >>>
> > > >>> dma_addr_t dma;
> > > >>> u8 *buf;
> > > >>> #define BSIZE 2048
> > > >>> #define DMASIZE 16
> > > >>>
> > > >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > > >>> for (i = 0; i < BSIZE; i++)
> > > >>>     buf[i] = 0xFE;
> > > >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > > >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> > > >>
> > > >> This function (through dma_direct_map_page()) ends up calling
> > > >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> > > >> cache. This is the same thing other architectures do (at least arm, arm64,
> > > >> openrisc, and powerpc). So this appears to be working as intended.
> > > >
> > > > This behavour is not present at least on ARM and ARM64.
> > > > The sample code I provided does not corrupt the buffer on them.
> > >
> > > That can be explained by the 0xFE bytes having been flushed to DRAM already in
> > > your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> > > a dirty cache line. The cache topology and implementation is totally different
> > > across the SoCs, so this is not too surprising.
> > >
> > > Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> > > unidirectional DMA transfer from the device into that buffer. So the contents of
> > > the buffer are "undefined" until the DMA transfer completes. If you are also
> > > writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> > >
> > > Regards,
> > > Samuel
> >
> > +CC crypto mailing list + maintainer
> >
> > My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
> > concat a poison buffer to check that device does write beyond buffer.
> >
> > But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.
> >
> > So you mean that on SoC D1, this crypto API check strategy is impossible ?
> 
> I think you could try to replace all CLEAN & INVAL ops with FLUSH ops
> for the testing. (All cache block-aligned data from the device for the
> CPU should be invalided.)
> 

With:
diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index 2c124bcc1932..608483522e05 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -21,7 +21,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum dma_data_dire
                ALT_CMO_OP(CLEAN, (unsigned long)phys_to_virt(paddr), size);
                break;
        case DMA_FROM_DEVICE:
-               ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
+               ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
                break;
        case DMA_BIDIRECTIONAL:
                ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);


The crypto self test works and I got no more buffer corruption.

Thanks
Guo Ren April 17, 2022, 8:49 a.m. UTC | #8
On Sun, Apr 17, 2022 at 4:45 PM Corentin Labbe
<clabbe.montjoie@gmail.com> wrote:
>
> Le Sun, Apr 17, 2022 at 10:17:34AM +0800, Guo Ren a écrit :
> > On Sun, Apr 17, 2022 at 3:32 AM Corentin Labbe
> > <clabbe.montjoie@gmail.com> wrote:
> > >
> > > Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> > > > On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > > > > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> > > > >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > > > >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> > > > >>>> This series is based on the alternatives changes done in my svpbmt series
> > > > >>>> and thus also depends on Atish's isa-extension parsing series.
> > > > >>>>
> > > > >>>> It implements using the cache-management instructions from the  Zicbom-
> > > > >>>> extension to handle cache flush, etc actions on platforms needing them.
> > > > >>>>
> > > > >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> > > > >>>> different set of cache instructions. But while they are different,
> > > > >>>> instructions they provide the same functionality, so a variant can
> > > > >>>> easly hook into the existing alternatives mechanism on those.
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>> Hello
> > > > >>>
> > > > >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > > > >>>
> > > > >>> I am hitting a buffer corruption problem with DMA.
> > > > >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > > > >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> > > > >>>
> > > > >>> The following small code show the problem:
> > > > >>>
> > > > >>> dma_addr_t dma;
> > > > >>> u8 *buf;
> > > > >>> #define BSIZE 2048
> > > > >>> #define DMASIZE 16
> > > > >>>
> > > > >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > > > >>> for (i = 0; i < BSIZE; i++)
> > > > >>>     buf[i] = 0xFE;
> > > > >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > > > >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> > > > >>
> > > > >> This function (through dma_direct_map_page()) ends up calling
> > > > >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> > > > >> cache. This is the same thing other architectures do (at least arm, arm64,
> > > > >> openrisc, and powerpc). So this appears to be working as intended.
> > > > >
> > > > > This behavour is not present at least on ARM and ARM64.
> > > > > The sample code I provided does not corrupt the buffer on them.
> > > >
> > > > That can be explained by the 0xFE bytes having been flushed to DRAM already in
> > > > your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> > > > a dirty cache line. The cache topology and implementation is totally different
> > > > across the SoCs, so this is not too surprising.
> > > >
> > > > Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> > > > unidirectional DMA transfer from the device into that buffer. So the contents of
> > > > the buffer are "undefined" until the DMA transfer completes. If you are also
> > > > writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> > > >
> > > > Regards,
> > > > Samuel
> > >
> > > +CC crypto mailing list + maintainer
> > >
> > > My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
> > > concat a poison buffer to check that device does write beyond buffer.
> > >
> > > But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.
> > >
> > > So you mean that on SoC D1, this crypto API check strategy is impossible ?
> >
> > I think you could try to replace all CLEAN & INVAL ops with FLUSH ops
> > for the testing. (All cache block-aligned data from the device for the
> > CPU should be invalided.)
> >
>
> With:
> diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> index 2c124bcc1932..608483522e05 100644
> --- a/arch/riscv/mm/dma-noncoherent.c
> +++ b/arch/riscv/mm/dma-noncoherent.c
> @@ -21,7 +21,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum dma_data_dire
>                 ALT_CMO_OP(CLEAN, (unsigned long)phys_to_virt(paddr), size);
>                 break;
>         case DMA_FROM_DEVICE:
> -               ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
> +               ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
>                 break;
>         case DMA_BIDIRECTIONAL:
>                 ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
>
>
> The crypto self test works and I got no more buffer corruption.
No, No ... it's not a solution. That means your driver has a problem.
From device, we only need INVAL enough.

>
> Thanks
Corentin Labbe April 17, 2022, 5:35 p.m. UTC | #9
Le Sun, Apr 17, 2022 at 04:49:34PM +0800, Guo Ren a écrit :
> On Sun, Apr 17, 2022 at 4:45 PM Corentin Labbe
> <clabbe.montjoie@gmail.com> wrote:
> >
> > Le Sun, Apr 17, 2022 at 10:17:34AM +0800, Guo Ren a écrit :
> > > On Sun, Apr 17, 2022 at 3:32 AM Corentin Labbe
> > > <clabbe.montjoie@gmail.com> wrote:
> > > >
> > > > Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> > > > > On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > > > > > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> > > > > >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > > > > >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> > > > > >>>> This series is based on the alternatives changes done in my svpbmt series
> > > > > >>>> and thus also depends on Atish's isa-extension parsing series.
> > > > > >>>>
> > > > > >>>> It implements using the cache-management instructions from the  Zicbom-
> > > > > >>>> extension to handle cache flush, etc actions on platforms needing them.
> > > > > >>>>
> > > > > >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> > > > > >>>> different set of cache instructions. But while they are different,
> > > > > >>>> instructions they provide the same functionality, so a variant can
> > > > > >>>> easly hook into the existing alternatives mechanism on those.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>> Hello
> > > > > >>>
> > > > > >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > > > > >>>
> > > > > >>> I am hitting a buffer corruption problem with DMA.
> > > > > >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > > > > >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> > > > > >>>
> > > > > >>> The following small code show the problem:
> > > > > >>>
> > > > > >>> dma_addr_t dma;
> > > > > >>> u8 *buf;
> > > > > >>> #define BSIZE 2048
> > > > > >>> #define DMASIZE 16
> > > > > >>>
> > > > > >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > > > > >>> for (i = 0; i < BSIZE; i++)
> > > > > >>>     buf[i] = 0xFE;
> > > > > >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > > > > >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> > > > > >>
> > > > > >> This function (through dma_direct_map_page()) ends up calling
> > > > > >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> > > > > >> cache. This is the same thing other architectures do (at least arm, arm64,
> > > > > >> openrisc, and powerpc). So this appears to be working as intended.
> > > > > >
> > > > > > This behavour is not present at least on ARM and ARM64.
> > > > > > The sample code I provided does not corrupt the buffer on them.
> > > > >
> > > > > That can be explained by the 0xFE bytes having been flushed to DRAM already in
> > > > > your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> > > > > a dirty cache line. The cache topology and implementation is totally different
> > > > > across the SoCs, so this is not too surprising.
> > > > >
> > > > > Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> > > > > unidirectional DMA transfer from the device into that buffer. So the contents of
> > > > > the buffer are "undefined" until the DMA transfer completes. If you are also
> > > > > writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> > > > >
> > > > > Regards,
> > > > > Samuel
> > > >
> > > > +CC crypto mailing list + maintainer
> > > >
> > > > My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
> > > > concat a poison buffer to check that device does write beyond buffer.
> > > >
> > > > But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.
> > > >
> > > > So you mean that on SoC D1, this crypto API check strategy is impossible ?
> > >
> > > I think you could try to replace all CLEAN & INVAL ops with FLUSH ops
> > > for the testing. (All cache block-aligned data from the device for the
> > > CPU should be invalided.)
> > >
> >
> > With:
> > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> > index 2c124bcc1932..608483522e05 100644
> > --- a/arch/riscv/mm/dma-noncoherent.c
> > +++ b/arch/riscv/mm/dma-noncoherent.c
> > @@ -21,7 +21,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum dma_data_dire
> >                 ALT_CMO_OP(CLEAN, (unsigned long)phys_to_virt(paddr), size);
> >                 break;
> >         case DMA_FROM_DEVICE:
> > -               ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
> > +               ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> >                 break;
> >         case DMA_BIDIRECTIONAL:
> >                 ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> >
> >
> > The crypto self test works and I got no more buffer corruption.
> No, No ... it's not a solution. That means your driver has a problem.
> From device, we only need INVAL enough.
> 

For me, my driver works fine, the problem came from dma_map_sg(), probably I didnt explain right, I restart.

Example:
crypto self test send to my driver an AES cipher operation of 16 bytes inside a SG, but the original buffer is greater (said 32 for the example).
So the first 16 bytes are used by the SG and the last 16 bytes are a poisoned buffer (with value 0xFE) to check driver do not write beyong the normal operation of 16 bytes (and beyond the SG length).

Doing the dma_map_sg(FROM_DEVICE) on the SG corrupt the whole buffer.
My driver write normally via DMA the first 16 bytes.
Crypto API check the last bytes, no more 0xFE, so it fail believing my driver wrote beyond the first 16 bytes.

But even If I disable my hardware operation, the buffer is still corrupted. (See my sample code which just do dma_map/dma_unmap)

So the problem is the dma_map(FROM_DEVICE) which change buffer content.

So if this behavour is normal on D1 SoC, how to fix the crypto self tests ?
Guo Ren April 17, 2022, 10:50 p.m. UTC | #10
On Mon, Apr 18, 2022 at 1:35 AM Corentin Labbe
<clabbe.montjoie@gmail.com> wrote:
>
> Le Sun, Apr 17, 2022 at 04:49:34PM +0800, Guo Ren a écrit :
> > On Sun, Apr 17, 2022 at 4:45 PM Corentin Labbe
> > <clabbe.montjoie@gmail.com> wrote:
> > >
> > > Le Sun, Apr 17, 2022 at 10:17:34AM +0800, Guo Ren a écrit :
> > > > On Sun, Apr 17, 2022 at 3:32 AM Corentin Labbe
> > > > <clabbe.montjoie@gmail.com> wrote:
> > > > >
> > > > > Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> > > > > > On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > > > > > > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> > > > > > >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > > > > > >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> > > > > > >>>> This series is based on the alternatives changes done in my svpbmt series
> > > > > > >>>> and thus also depends on Atish's isa-extension parsing series.
> > > > > > >>>>
> > > > > > >>>> It implements using the cache-management instructions from the  Zicbom-
> > > > > > >>>> extension to handle cache flush, etc actions on platforms needing them.
> > > > > > >>>>
> > > > > > >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> > > > > > >>>> different set of cache instructions. But while they are different,
> > > > > > >>>> instructions they provide the same functionality, so a variant can
> > > > > > >>>> easly hook into the existing alternatives mechanism on those.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>> Hello
> > > > > > >>>
> > > > > > >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > > > > > >>>
> > > > > > >>> I am hitting a buffer corruption problem with DMA.
> > > > > > >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > > > > > >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> > > > > > >>>
> > > > > > >>> The following small code show the problem:
> > > > > > >>>
> > > > > > >>> dma_addr_t dma;
> > > > > > >>> u8 *buf;
> > > > > > >>> #define BSIZE 2048
> > > > > > >>> #define DMASIZE 16
> > > > > > >>>
> > > > > > >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > > > > > >>> for (i = 0; i < BSIZE; i++)
> > > > > > >>>     buf[i] = 0xFE;
> > > > > > >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > > > > > >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> > > > > > >>
> > > > > > >> This function (through dma_direct_map_page()) ends up calling
> > > > > > >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> > > > > > >> cache. This is the same thing other architectures do (at least arm, arm64,
> > > > > > >> openrisc, and powerpc). So this appears to be working as intended.
> > > > > > >
> > > > > > > This behavour is not present at least on ARM and ARM64.
> > > > > > > The sample code I provided does not corrupt the buffer on them.
> > > > > >
> > > > > > That can be explained by the 0xFE bytes having been flushed to DRAM already in
> > > > > > your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> > > > > > a dirty cache line. The cache topology and implementation is totally different
> > > > > > across the SoCs, so this is not too surprising.
> > > > > >
> > > > > > Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> > > > > > unidirectional DMA transfer from the device into that buffer. So the contents of
> > > > > > the buffer are "undefined" until the DMA transfer completes. If you are also
> > > > > > writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> > > > > >
> > > > > > Regards,
> > > > > > Samuel
> > > > >
> > > > > +CC crypto mailing list + maintainer
> > > > >
> > > > > My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
> > > > > concat a poison buffer to check that device does write beyond buffer.
> > > > >
> > > > > But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.
> > > > >
> > > > > So you mean that on SoC D1, this crypto API check strategy is impossible ?
> > > >
> > > > I think you could try to replace all CLEAN & INVAL ops with FLUSH ops
> > > > for the testing. (All cache block-aligned data from the device for the
> > > > CPU should be invalided.)
> > > >
> > >
> > > With:
> > > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> > > index 2c124bcc1932..608483522e05 100644
> > > --- a/arch/riscv/mm/dma-noncoherent.c
> > > +++ b/arch/riscv/mm/dma-noncoherent.c
> > > @@ -21,7 +21,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum dma_data_dire
> > >                 ALT_CMO_OP(CLEAN, (unsigned long)phys_to_virt(paddr), size);
> > >                 break;
> > >         case DMA_FROM_DEVICE:
> > > -               ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
> > > +               ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> > >                 break;
> > >         case DMA_BIDIRECTIONAL:
> > >                 ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> > >
> > >
> > > The crypto self test works and I got no more buffer corruption.
> > No, No ... it's not a solution. That means your driver has a problem.
> > From device, we only need INVAL enough.
> >
>
> For me, my driver works fine, the problem came from dma_map_sg(), probably I didnt explain right, I restart.
>
> Example:
> crypto self test send to my driver an AES cipher operation of 16 bytes inside a SG, but the original buffer is greater (said 32 for the example).
> So the first 16 bytes are used by the SG and the last 16 bytes are a poisoned buffer (with value 0xFE) to check driver do not write beyong the normal operation of 16 bytes (and beyond the SG length).
>
> Doing the dma_map_sg(FROM_DEVICE) on the SG corrupt the whole buffer.
> My driver write normally via DMA the first 16 bytes.
> Crypto API check the last bytes, no more 0xFE, so it fail believing my driver wrote beyond the first 16 bytes.
>
> But even If I disable my hardware operation, the buffer is still corrupted. (See my sample code which just do dma_map/dma_unmap)
>
> So the problem is the dma_map(FROM_DEVICE) which change buffer content.
>
> So if this behavour is normal on D1 SoC, how to fix the crypto self tests ?
Actually, FLUSH is safe for all, but more expensive. Can you tell me
which arm SOC are you using? And which version of linux is running on
your arm SOC?
Philipp Tomsich April 18, 2022, 3:29 p.m. UTC | #11
On Sun, 17 Apr 2022 at 19:35, Corentin Labbe <clabbe.montjoie@gmail.com> wrote:
>
> Le Sun, Apr 17, 2022 at 04:49:34PM +0800, Guo Ren a écrit :
> > On Sun, Apr 17, 2022 at 4:45 PM Corentin Labbe
> > <clabbe.montjoie@gmail.com> wrote:
> > >
> > > Le Sun, Apr 17, 2022 at 10:17:34AM +0800, Guo Ren a écrit :
> > > > On Sun, Apr 17, 2022 at 3:32 AM Corentin Labbe
> > > > <clabbe.montjoie@gmail.com> wrote:
> > > > >
> > > > > Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> > > > > > On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > > > > > > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> > > > > > >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > > > > > >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> > > > > > >>>> This series is based on the alternatives changes done in my svpbmt series
> > > > > > >>>> and thus also depends on Atish's isa-extension parsing series.
> > > > > > >>>>
> > > > > > >>>> It implements using the cache-management instructions from the  Zicbom-
> > > > > > >>>> extension to handle cache flush, etc actions on platforms needing them.
> > > > > > >>>>
> > > > > > >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> > > > > > >>>> different set of cache instructions. But while they are different,
> > > > > > >>>> instructions they provide the same functionality, so a variant can
> > > > > > >>>> easly hook into the existing alternatives mechanism on those.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>> Hello
> > > > > > >>>
> > > > > > >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > > > > > >>>
> > > > > > >>> I am hitting a buffer corruption problem with DMA.
> > > > > > >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > > > > > >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> > > > > > >>>
> > > > > > >>> The following small code show the problem:
> > > > > > >>>
> > > > > > >>> dma_addr_t dma;
> > > > > > >>> u8 *buf;
> > > > > > >>> #define BSIZE 2048
> > > > > > >>> #define DMASIZE 16
> > > > > > >>>
> > > > > > >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > > > > > >>> for (i = 0; i < BSIZE; i++)
> > > > > > >>>     buf[i] = 0xFE;
> > > > > > >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > > > > > >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> > > > > > >>
> > > > > > >> This function (through dma_direct_map_page()) ends up calling
> > > > > > >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> > > > > > >> cache. This is the same thing other architectures do (at least arm, arm64,
> > > > > > >> openrisc, and powerpc). So this appears to be working as intended.
> > > > > > >
> > > > > > > This behavour is not present at least on ARM and ARM64.
> > > > > > > The sample code I provided does not corrupt the buffer on them.
> > > > > >
> > > > > > That can be explained by the 0xFE bytes having been flushed to DRAM already in
> > > > > > your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> > > > > > a dirty cache line. The cache topology and implementation is totally different
> > > > > > across the SoCs, so this is not too surprising.
> > > > > >
> > > > > > Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> > > > > > unidirectional DMA transfer from the device into that buffer. So the contents of
> > > > > > the buffer are "undefined" until the DMA transfer completes. If you are also
> > > > > > writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> > > > > >
> > > > > > Regards,
> > > > > > Samuel
> > > > >
> > > > > +CC crypto mailing list + maintainer
> > > > >
> > > > > My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
> > > > > concat a poison buffer to check that device does write beyond buffer.
> > > > >
> > > > > But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.
> > > > >
> > > > > So you mean that on SoC D1, this crypto API check strategy is impossible ?
> > > >
> > > > I think you could try to replace all CLEAN & INVAL ops with FLUSH ops
> > > > for the testing. (All cache block-aligned data from the device for the
> > > > CPU should be invalided.)
> > > >
> > >
> > > With:
> > > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> > > index 2c124bcc1932..608483522e05 100644
> > > --- a/arch/riscv/mm/dma-noncoherent.c
> > > +++ b/arch/riscv/mm/dma-noncoherent.c
> > > @@ -21,7 +21,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum dma_data_dire
> > >                 ALT_CMO_OP(CLEAN, (unsigned long)phys_to_virt(paddr), size);
> > >                 break;
> > >         case DMA_FROM_DEVICE:
> > > -               ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
> > > +               ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> > >                 break;
> > >         case DMA_BIDIRECTIONAL:
> > >                 ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> > >
> > >
> > > The crypto self test works and I got no more buffer corruption.
> > No, No ... it's not a solution. That means your driver has a problem.
> > From device, we only need INVAL enough.
> >
>
> For me, my driver works fine, the problem came from dma_map_sg(), probably I didnt explain right, I restart.
>
> Example:
> crypto self test send to my driver an AES cipher operation of 16 bytes inside a SG, but the original buffer is greater (said 32 for the example).
> So the first 16 bytes are used by the SG and the last 16 bytes are a poisoned buffer (with value 0xFE) to check driver do not write beyong the normal operation of 16 bytes (and beyond the SG length).
>
> Doing the dma_map_sg(FROM_DEVICE) on the SG corrupt the whole buffer.

Doesn't the DMA_FROM_DEVICE indicate that there are no expected writes
from the CPU to the buffer (and that any modifications to the
underlying cache line can be dropped via an invalidation)?
In other words: does the behavior change when mapping as
DMA_BIDIRECTIONAL — and: should a map/unmap sequence be used where it
is first mapped as DMA_TO_DEVICE when poisoning the buffer and later
as DMA_FROM_DEVICE when in normal operation?

Philipp.

> My driver write normally via DMA the first 16 bytes.
> Crypto API check the last bytes, no more 0xFE, so it fail believing my driver wrote beyond the first 16 bytes.
>
> But even If I disable my hardware operation, the buffer is still corrupted. (See my sample code which just do dma_map/dma_unmap)
>
> So the problem is the dma_map(FROM_DEVICE) which change buffer content.
>
> So if this behavour is normal on D1 SoC, how to fix the crypto self tests ?
Corentin Labbe April 19, 2022, 7:44 a.m. UTC | #12
Le Mon, Apr 18, 2022 at 06:50:57AM +0800, Guo Ren a écrit :
> On Mon, Apr 18, 2022 at 1:35 AM Corentin Labbe
> <clabbe.montjoie@gmail.com> wrote:
> >
> > Le Sun, Apr 17, 2022 at 04:49:34PM +0800, Guo Ren a écrit :
> > > On Sun, Apr 17, 2022 at 4:45 PM Corentin Labbe
> > > <clabbe.montjoie@gmail.com> wrote:
> > > >
> > > > Le Sun, Apr 17, 2022 at 10:17:34AM +0800, Guo Ren a écrit :
> > > > > On Sun, Apr 17, 2022 at 3:32 AM Corentin Labbe
> > > > > <clabbe.montjoie@gmail.com> wrote:
> > > > > >
> > > > > > Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> > > > > > > On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > > > > > > > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> > > > > > > >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > > > > > > >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> > > > > > > >>>> This series is based on the alternatives changes done in my svpbmt series
> > > > > > > >>>> and thus also depends on Atish's isa-extension parsing series.
> > > > > > > >>>>
> > > > > > > >>>> It implements using the cache-management instructions from the  Zicbom-
> > > > > > > >>>> extension to handle cache flush, etc actions on platforms needing them.
> > > > > > > >>>>
> > > > > > > >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> > > > > > > >>>> different set of cache instructions. But while they are different,
> > > > > > > >>>> instructions they provide the same functionality, so a variant can
> > > > > > > >>>> easly hook into the existing alternatives mechanism on those.
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>> Hello
> > > > > > > >>>
> > > > > > > >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > > > > > > >>>
> > > > > > > >>> I am hitting a buffer corruption problem with DMA.
> > > > > > > >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > > > > > > >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> > > > > > > >>>
> > > > > > > >>> The following small code show the problem:
> > > > > > > >>>
> > > > > > > >>> dma_addr_t dma;
> > > > > > > >>> u8 *buf;
> > > > > > > >>> #define BSIZE 2048
> > > > > > > >>> #define DMASIZE 16
> > > > > > > >>>
> > > > > > > >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > > > > > > >>> for (i = 0; i < BSIZE; i++)
> > > > > > > >>>     buf[i] = 0xFE;
> > > > > > > >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > > > > > > >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> > > > > > > >>
> > > > > > > >> This function (through dma_direct_map_page()) ends up calling
> > > > > > > >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> > > > > > > >> cache. This is the same thing other architectures do (at least arm, arm64,
> > > > > > > >> openrisc, and powerpc). So this appears to be working as intended.
> > > > > > > >
> > > > > > > > This behavour is not present at least on ARM and ARM64.
> > > > > > > > The sample code I provided does not corrupt the buffer on them.
> > > > > > >
> > > > > > > That can be explained by the 0xFE bytes having been flushed to DRAM already in
> > > > > > > your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> > > > > > > a dirty cache line. The cache topology and implementation is totally different
> > > > > > > across the SoCs, so this is not too surprising.
> > > > > > >
> > > > > > > Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> > > > > > > unidirectional DMA transfer from the device into that buffer. So the contents of
> > > > > > > the buffer are "undefined" until the DMA transfer completes. If you are also
> > > > > > > writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Samuel
> > > > > >
> > > > > > +CC crypto mailing list + maintainer
> > > > > >
> > > > > > My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
> > > > > > concat a poison buffer to check that device does write beyond buffer.
> > > > > >
> > > > > > But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.
> > > > > >
> > > > > > So you mean that on SoC D1, this crypto API check strategy is impossible ?
> > > > >
> > > > > I think you could try to replace all CLEAN & INVAL ops with FLUSH ops
> > > > > for the testing. (All cache block-aligned data from the device for the
> > > > > CPU should be invalided.)
> > > > >
> > > >
> > > > With:
> > > > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> > > > index 2c124bcc1932..608483522e05 100644
> > > > --- a/arch/riscv/mm/dma-noncoherent.c
> > > > +++ b/arch/riscv/mm/dma-noncoherent.c
> > > > @@ -21,7 +21,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum dma_data_dire
> > > >                 ALT_CMO_OP(CLEAN, (unsigned long)phys_to_virt(paddr), size);
> > > >                 break;
> > > >         case DMA_FROM_DEVICE:
> > > > -               ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
> > > > +               ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> > > >                 break;
> > > >         case DMA_BIDIRECTIONAL:
> > > >                 ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> > > >
> > > >
> > > > The crypto self test works and I got no more buffer corruption.
> > > No, No ... it's not a solution. That means your driver has a problem.
> > > From device, we only need INVAL enough.
> > >
> >
> > For me, my driver works fine, the problem came from dma_map_sg(), probably I didnt explain right, I restart.
> >
> > Example:
> > crypto self test send to my driver an AES cipher operation of 16 bytes inside a SG, but the original buffer is greater (said 32 for the example).
> > So the first 16 bytes are used by the SG and the last 16 bytes are a poisoned buffer (with value 0xFE) to check driver do not write beyong the normal operation of 16 bytes (and beyond the SG length).
> >
> > Doing the dma_map_sg(FROM_DEVICE) on the SG corrupt the whole buffer.
> > My driver write normally via DMA the first 16 bytes.
> > Crypto API check the last bytes, no more 0xFE, so it fail believing my driver wrote beyond the first 16 bytes.
> >
> > But even If I disable my hardware operation, the buffer is still corrupted. (See my sample code which just do dma_map/dma_unmap)
> >
> > So the problem is the dma_map(FROM_DEVICE) which change buffer content.
> >
> > So if this behavour is normal on D1 SoC, how to fix the crypto self tests ?
> Actually, FLUSH is safe for all, but more expensive. Can you tell me
> which arm SOC are you using? And which version of linux is running on
> your arm SOC?
> 

The SOC is Allwinner D1 (RiscV).
I am testing linux from https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip
Corentin Labbe April 19, 2022, 7:52 a.m. UTC | #13
Le Mon, Apr 18, 2022 at 05:29:10PM +0200, Philipp Tomsich a écrit :
> On Sun, 17 Apr 2022 at 19:35, Corentin Labbe <clabbe.montjoie@gmail.com> wrote:
> >
> > Le Sun, Apr 17, 2022 at 04:49:34PM +0800, Guo Ren a écrit :
> > > On Sun, Apr 17, 2022 at 4:45 PM Corentin Labbe
> > > <clabbe.montjoie@gmail.com> wrote:
> > > >
> > > > Le Sun, Apr 17, 2022 at 10:17:34AM +0800, Guo Ren a écrit :
> > > > > On Sun, Apr 17, 2022 at 3:32 AM Corentin Labbe
> > > > > <clabbe.montjoie@gmail.com> wrote:
> > > > > >
> > > > > > Le Sat, Apr 16, 2022 at 12:47:29PM -0500, Samuel Holland a écrit :
> > > > > > > On 4/16/22 2:35 AM, Corentin Labbe wrote:
> > > > > > > > Le Fri, Apr 15, 2022 at 09:19:23PM -0500, Samuel Holland a écrit :
> > > > > > > >> On 4/15/22 6:26 AM, Corentin Labbe wrote:
> > > > > > > >>> Le Mon, Mar 07, 2022 at 11:46:18PM +0100, Heiko Stuebner a écrit :
> > > > > > > >>>> This series is based on the alternatives changes done in my svpbmt series
> > > > > > > >>>> and thus also depends on Atish's isa-extension parsing series.
> > > > > > > >>>>
> > > > > > > >>>> It implements using the cache-management instructions from the  Zicbom-
> > > > > > > >>>> extension to handle cache flush, etc actions on platforms needing them.
> > > > > > > >>>>
> > > > > > > >>>> SoCs using cpu cores from T-Head like the Allwinne D1 implement a
> > > > > > > >>>> different set of cache instructions. But while they are different,
> > > > > > > >>>> instructions they provide the same functionality, so a variant can
> > > > > > > >>>> easly hook into the existing alternatives mechanism on those.
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>> Hello
> > > > > > > >>>
> > > > > > > >>> I am testing https://github.com/smaeul/linux.git branch:origin/riscv/d1-wip which contain this serie.
> > > > > > > >>>
> > > > > > > >>> I am hitting a buffer corruption problem with DMA.
> > > > > > > >>> The sun8i-ce crypto driver fail self tests due to "device overran destination buffer".
> > > > > > > >>> In fact the buffer is not overran by device but by dma_map_single() operation.
> > > > > > > >>>
> > > > > > > >>> The following small code show the problem:
> > > > > > > >>>
> > > > > > > >>> dma_addr_t dma;
> > > > > > > >>> u8 *buf;
> > > > > > > >>> #define BSIZE 2048
> > > > > > > >>> #define DMASIZE 16
> > > > > > > >>>
> > > > > > > >>> buf = kmalloc(BSIZE, GFP_KERNEL | GFP_DMA);
> > > > > > > >>> for (i = 0; i < BSIZE; i++)
> > > > > > > >>>     buf[i] = 0xFE;
> > > > > > > >>> print_hex_dump(KERN_INFO, "DMATEST1:", DUMP_PREFIX_NONE, 16, 4, buf, 256, false);
> > > > > > > >>> dma = dma_map_single(ce->dev, buf, DMASIZE, DMA_FROM_DEVICE);
> > > > > > > >>
> > > > > > > >> This function (through dma_direct_map_page()) ends up calling
> > > > > > > >> arch_sync_dma_for_device(..., ..., DMA_FROM_DEVICE), which invalidates the CPU's
> > > > > > > >> cache. This is the same thing other architectures do (at least arm, arm64,
> > > > > > > >> openrisc, and powerpc). So this appears to be working as intended.
> > > > > > > >
> > > > > > > > This behavour is not present at least on ARM and ARM64.
> > > > > > > > The sample code I provided does not corrupt the buffer on them.
> > > > > > >
> > > > > > > That can be explained by the 0xFE bytes having been flushed to DRAM already in
> > > > > > > your ARM/ARM64 tests, whereas in your riscv64 case, the 0xFE bytes were still in
> > > > > > > a dirty cache line. The cache topology and implementation is totally different
> > > > > > > across the SoCs, so this is not too surprising.
> > > > > > >
> > > > > > > Semantically, dma_map_single(..., DMA_FROM_DEVICE) means you are doing a
> > > > > > > unidirectional DMA transfer from the device into that buffer. So the contents of
> > > > > > > the buffer are "undefined" until the DMA transfer completes. If you are also
> > > > > > > writing data into the buffer from the CPU side, then you need DMA_BIDIRECTIONAL.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Samuel
> > > > > >
> > > > > > +CC crypto mailing list + maintainer
> > > > > >
> > > > > > My problem is that crypto selftest, for each buffer where I need to do a cipher operation,
> > > > > > concat a poison buffer to check that device does write beyond buffer.
> > > > > >
> > > > > > But the dma_map_sg(FROM_DEVICE) corrupts this poison buffer and crypto selftests fails thinking my device did a buffer overrun.
> > > > > >
> > > > > > So you mean that on SoC D1, this crypto API check strategy is impossible ?
> > > > >
> > > > > I think you could try to replace all CLEAN & INVAL ops with FLUSH ops
> > > > > for the testing. (All cache block-aligned data from the device for the
> > > > > CPU should be invalided.)
> > > > >
> > > >
> > > > With:
> > > > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> > > > index 2c124bcc1932..608483522e05 100644
> > > > --- a/arch/riscv/mm/dma-noncoherent.c
> > > > +++ b/arch/riscv/mm/dma-noncoherent.c
> > > > @@ -21,7 +21,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size, enum dma_data_dire
> > > >                 ALT_CMO_OP(CLEAN, (unsigned long)phys_to_virt(paddr), size);
> > > >                 break;
> > > >         case DMA_FROM_DEVICE:
> > > > -               ALT_CMO_OP(INVAL, (unsigned long)phys_to_virt(paddr), size);
> > > > +               ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> > > >                 break;
> > > >         case DMA_BIDIRECTIONAL:
> > > >                 ALT_CMO_OP(FLUSH, (unsigned long)phys_to_virt(paddr), size);
> > > >
> > > >
> > > > The crypto self test works and I got no more buffer corruption.
> > > No, No ... it's not a solution. That means your driver has a problem.
> > > From device, we only need INVAL enough.
> > >
> >
> > For me, my driver works fine, the problem came from dma_map_sg(), probably I didnt explain right, I restart.
> >
> > Example:
> > crypto self test send to my driver an AES cipher operation of 16 bytes inside a SG, but the original buffer is greater (said 32 for the example).
> > So the first 16 bytes are used by the SG and the last 16 bytes are a poisoned buffer (with value 0xFE) to check driver do not write beyong the normal operation of 16 bytes (and beyond the SG length).
> >
> > Doing the dma_map_sg(FROM_DEVICE) on the SG corrupt the whole buffer.
> 
> Doesn't the DMA_FROM_DEVICE indicate that there are no expected writes
> from the CPU to the buffer (and that any modifications to the
> underlying cache line can be dropped via an invalidation)?
> In other words: does the behavior change when mapping as
> DMA_BIDIRECTIONAL — and: should a map/unmap sequence be used where it
> is first mapped as DMA_TO_DEVICE when poisoning the buffer and later
> as DMA_FROM_DEVICE when in normal operation?
> 

There are no cpu writes after the dma_map(FROM_DEVICE).
The buffer is initialized by the cryptoAPI before.
Furtheremore, the buffer corrupted is next to the buffer being mapped.

I verified the size of dma_map_sg() via some debug:
sun8i-ce 3040000.crypto: sun8i_ce_cipher_prepare ecb(aes) cryptlen=16
dma_direct_map_sg:483 SG0 len=16   <- dma_map TO_DEVICE
dma_direct_map_sg:483 SG0 len=16   <- dma_map FROM_DEVICE
need:a47ca9dd e0df4c86 a070af6e 91710dec 
have:a47ca9dd e0df4c86 a070af6e 91710dec
dump whole buffer:
over:a47ca9dd e0df4c86 a070af6e 91710dec
over:ec05e6f2 d542fb77 128b2059 5bf06986 < here we should have 0xFE
alg: skcipher: ecb-aes-sun8i-ce encryption overran dst buffer on test vector 1, cfg=\"random: use_finup src_divs=[<reimport>100.0%@+1604]\"


Note that I tried the following patch:
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 4948201065cc..c5b945974441 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -19,6 +19,7 @@
 #include <crypto/aead.h>
 #include <crypto/hash.h>
 #include <crypto/skcipher.h>
+#include <linux/cacheflush.h>
 #include <linux/err.h>
 #include <linux/fips.h>
 #include <linux/module.h>
@@ -205,6 +206,7 @@ static void testmgr_free_buf(char *buf[XBUFSIZE])
 static inline void testmgr_poison(void *addr, size_t len)
 {
        memset(addr, TESTMGR_POISON_BYTE, len);
+       flush_icache_range(addr, addr + len);
 }
 
 /* Is the memory region still fully poisoned? */

This patch fixes the problem, but I am not sure this is the rigth way.
A DMA mapping operation corrupting buffer around seems not good.