mbox series

[0/3] ARM ZSTD boot compression

Message ID 20230412212126.3966502-1-j.neuschaefer@gmx.net (mailing list archive)
Headers show
Series ARM ZSTD boot compression | expand

Message

J. Neuschäfer April 12, 2023, 9:21 p.m. UTC
This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):

 - LZO:  7.2 MiB,  6 seconds
 - ZSTD: 5.6 MiB, 60 seconds

Jonathan Neuschäfer (3):
  ARM: compressed: Pass the actual output length to the decompressor
  ARM: compressed: Bump MALLOC_SIZE to 128 KiB
  ARM: compressed: Enable ZSTD compression

 arch/arm/Kconfig                      |  1 +
 arch/arm/boot/compressed/Makefile     |  5 +++--
 arch/arm/boot/compressed/decompress.c |  8 ++++++--
 arch/arm/boot/compressed/head.S       |  4 ++--
 arch/arm/boot/compressed/misc.c       | 12 ++++++++++--
 5 files changed, 22 insertions(+), 8 deletions(-)

--
2.39.2

Comments

Arnd Bergmann April 12, 2023, 9:33 p.m. UTC | #1
On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>
>  - LZO:  7.2 MiB,  6 seconds
>  - ZSTD: 5.6 MiB, 60 seconds

That seems unexpected, as the usual numbers say it's about 25%
slower than LZO. Do  you have an idea why it is so much slower
here? How long does it take to decompress the
generated arch/arm/boot/Image file in user space on the same
hardware using lzop and zstd?

       Arnd
Arnd Bergmann April 13, 2023, 11:13 a.m. UTC | #2
On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>
>>  - LZO:  7.2 MiB,  6 seconds
>>  - ZSTD: 5.6 MiB, 60 seconds
>
> That seems unexpected, as the usual numbers say it's about 25%
> slower than LZO. Do  you have an idea why it is so much slower
> here? How long does it take to decompress the
> generated arch/arm/boot/Image file in user space on the same
> hardware using lzop and zstd?

I looked through this a bit more and found two interesting points:

- zstd uses a lot more unaligned loads and stores while
  decompressing. On armv5 those turn into individual byte
  accesses, while the others can likely use word-aligned
  accesses. This could make a huge difference if caches are
  disabled during the decompression.

- The sliding window on zstd is much larger, with the kernel
  using an 8MB window (zstd=23), compared to the normal 32kb
  for deflate (couldn't find the default for lzo), so on
  machines with no L2 cache, it is much likely to thrash a
  small L1 dcache that are used on most arm9.

      Arnd
J. Neuschäfer April 14, 2023, 10:50 p.m. UTC | #3
On Wed, Apr 12, 2023 at 11:33:15PM +0200, Arnd Bergmann wrote:
> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> > This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> > Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
> >
> >  - LZO:  7.2 MiB,  6 seconds
> >  - ZSTD: 5.6 MiB, 60 seconds
> 
> That seems unexpected, as the usual numbers say it's about 25%
> slower than LZO. Do you have an idea why it is so much slower
> here?

No clear idea.

I guess it might be related to caching or unaligned memory accesses
somehow.

I suspected CONFIG_CPU_DCACHE_WRITETHROUGH, which was enabled, but
disabling it didn't improve performance.

> How long does it take to decompress the generated arch/arm/boot/Image
> file in user space on the same hardware using lzop and zstd?


Unfortunately, the unzstd userspace tool requires a buffer of of 128 MiB
(the window size), which is too big for my usual devboard (which has about
100 MiB available). I'd have to test on a different board.


Jonathan

---
# uname -a
Linux buildroot 6.3.0-rc6-00020-g023058d50f2f #1212 PREEMPT Fri Apr 14 20:58:21 CEST 2023 armv5tejl GNU/Linux

# ls -lh
total 13M
-rw-r--r--    1 root     root        7.5M Jan  1 00:07 piggy.lzo
-rw-r--r--    1 root     root        5.8M Jan  1 00:07 piggy.zstd

# time lzop -d piggy.lzo -c > /dev/null
lzop: piggy.lzo: warning: ignoring trailing garbage in lzop file
Command exited with non-zero status 2
real    0m 3.38s
user    0m 3.20s
sys     0m 0.18s

# time unzstd piggy.zstd -c > /dev/null
[  858.270000] __vm_enough_memory: pid: 114, comm: unzstd, not enough memory for the allocation
piggy.zstd : Decoding error (36) : Allocation error : not enough memory
Command exited with non-zero status 1
real    0m 0.03s
user    0m 0.01s
sys     0m 0.03s
J. Neuschäfer April 15, 2023, 2 a.m. UTC | #4
On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
> > On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> >> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> >> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
> >>
> >>  - LZO:  7.2 MiB,  6 seconds
> >>  - ZSTD: 5.6 MiB, 60 seconds
> >
> > That seems unexpected, as the usual numbers say it's about 25%
> > slower than LZO. Do  you have an idea why it is so much slower
> > here? How long does it take to decompress the
> > generated arch/arm/boot/Image file in user space on the same
> > hardware using lzop and zstd?
> 
> I looked through this a bit more and found two interesting points:
> 
> - zstd uses a lot more unaligned loads and stores while
>   decompressing. On armv5 those turn into individual byte
>   accesses, while the others can likely use word-aligned
>   accesses. This could make a huge difference if caches are
>   disabled during the decompression.
> 
> - The sliding window on zstd is much larger, with the kernel
>   using an 8MB window (zstd=23), compared to the normal 32kb
>   for deflate (couldn't find the default for lzo), so on
>   machines with no L2 cache, it is much likely to thrash a
>   small L1 dcache that are used on most arm9.
> 
>       Arnd

Make sense.

For ZSTD as used in kernel decompression (the zstd22 configuration), the
window is even bigger, 128 MiB. (AFAIU)


Thanks

Jonathan
Nick Terrell Oct. 12, 2023, 10:33 p.m. UTC | #5
> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote:
> 
> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>>> 
>>>> - LZO:  7.2 MiB,  6 seconds
>>>> - ZSTD: 5.6 MiB, 60 seconds
>>> 
>>> That seems unexpected, as the usual numbers say it's about 25%
>>> slower than LZO. Do  you have an idea why it is so much slower
>>> here? How long does it take to decompress the
>>> generated arch/arm/boot/Image file in user space on the same
>>> hardware using lzop and zstd?
>> 
>> I looked through this a bit more and found two interesting points:
>> 
>> - zstd uses a lot more unaligned loads and stores while
>>  decompressing. On armv5 those turn into individual byte
>>  accesses, while the others can likely use word-aligned
>>  accesses. This could make a huge difference if caches are
>>  disabled during the decompression.
>> 
>> - The sliding window on zstd is much larger, with the kernel
>>  using an 8MB window (zstd=23), compared to the normal 32kb
>>  for deflate (couldn't find the default for lzo), so on
>>  machines with no L2 cache, it is much likely to thrash a
>>  small L1 dcache that are used on most arm9.
>> 
>>      Arnd
> 
> Make sense.
> 
> For ZSTD as used in kernel decompression (the zstd22 configuration), the
> window is even bigger, 128 MiB. (AFAIU)

Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...

But this is totally configurable. You can switch compression configurations
at any time. If you believe that the window size is the issue causing speed
regressions, you could use a zstd compression to use a e.g. 256KB window
size like this:

  zstd -19 --zstd=wlog=18

This will keep the same algorithm search strength, but limit the decoder memory
usage.

I will also try to get this patchset working on my machine, and try to debug.
The 10x slower speed difference is not expected, and we see much better speed
in userspace ARM. I suspect it has something to do with the preboot environment.
E.g. when implementing x86-64 zstd kernel decompression, I noticed that
memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
penalty.

Best,
Nick Terrell

> Thanks
> 
> Jonathan
J. Neuschäfer Oct. 13, 2023, 1:27 a.m. UTC | #6
On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote:
> > On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote:
> > On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
> >> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
> >>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
> >>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
> >>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
> >>>> 
> >>>> - LZO:  7.2 MiB,  6 seconds
> >>>> - ZSTD: 5.6 MiB, 60 seconds
[...]
> > For ZSTD as used in kernel decompression (the zstd22 configuration), the
> > window is even bigger, 128 MiB. (AFAIU)
> 
> Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...
> 
> But this is totally configurable. You can switch compression configurations
> at any time. If you believe that the window size is the issue causing speed
> regressions, you could use a zstd compression to use a e.g. 256KB window
> size like this:
> 
>   zstd -19 --zstd=wlog=18
> 
> This will keep the same algorithm search strength, but limit the decoder memory
> usage.

Noted.

> I will also try to get this patchset working on my machine, and try to debug.
> The 10x slower speed difference is not expected, and we see much better speed
> in userspace ARM. I suspect it has something to do with the preboot environment.
> E.g. when implementing x86-64 zstd kernel decompression, I noticed that
> memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
> penalty.

In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on
only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I
think the main culprit here was particularly bad luck in my choice of
test hardware.

The inlining issues are a good point, noted for the next time I work on this.


Thanks,
Jonathan
Nick Terrell Oct. 20, 2023, 6:53 p.m. UTC | #7
> On Oct 12, 2023, at 6:27 PM, J. Neuschäfer <j.neuschaefer@gmx.net> wrote:
> 
> On Thu, Oct 12, 2023 at 10:33:23PM +0000, Nick Terrell wrote:
>>> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@gmx.net> wrote:
>>> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
>>>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
>>>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>>>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>>>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>>>>> 
>>>>>> - LZO:  7.2 MiB,  6 seconds
>>>>>> - ZSTD: 5.6 MiB, 60 seconds
> [...]
>>> For ZSTD as used in kernel decompression (the zstd22 configuration), the
>>> window is even bigger, 128 MiB. (AFAIU)
>> 
>> Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...
>> 
>> But this is totally configurable. You can switch compression configurations
>> at any time. If you believe that the window size is the issue causing speed
>> regressions, you could use a zstd compression to use a e.g. 256KB window
>> size like this:
>> 
>>  zstd -19 --zstd=wlog=18
>> 
>> This will keep the same algorithm search strength, but limit the decoder memory
>> usage.
> 
> Noted.
> 
>> I will also try to get this patchset working on my machine, and try to debug.
>> The 10x slower speed difference is not expected, and we see much better speed
>> in userspace ARM. I suspect it has something to do with the preboot environment.
>> E.g. when implementing x86-64 zstd kernel decompression, I noticed that
>> memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
>> penalty.
> 
> In the meantime I've seen 8s for ZSTD vs. 2s for other algorithms, on
> only mildly less ancient hardware (Hi3518A, another ARM9 SoC), so I
> think the main culprit here was particularly bad luck in my choice of
> test hardware.
> 
> The inlining issues are a good point, noted for the next time I work on this.

I went out and bought a Raspberry Pi 4 to test on. I’ve done some crude measurements
and see that zstd kernel decompression is just slightly slower than gzip kernel
decompression, and about 2x slower than lzo. In userspace decompression of the same
file (a manually compressed kernel image) I see that zstd decompression is significantly
faster than gzip. So it is definitely something about the preboot boot environment, or how
the code is compiled for the preboot environment that is causing the issue.

My next step is to set up qemu on my Pi to try to get some perf measurements of the
decompression. One thing I’ve really been struggling with, and what thwarted my last
attempts at adding ARM zstd kernel decompression, was getting preboot logs printed.

I’ve figured out I need CONFIG_DEBUG_LL=y, but I’ve yet to actually get any logs.
And I can’t figure out how to get it working in qemu. I haven’t tried qemu on an ARM
host with kvm, but that’s the next thing I will try.

Do you happen to have any advice about how to get preboot logs in qemu? Is it
possible only on an ARM host, or would it also be possible on an x86-64 host?

Thanks,
Nick Terrell

> Thanks,
> Jonathan