diff mbox series

[v3,3/8] lib: add zstd support to decompress

Message ID 20200325195849.407900-4-nickrterrell@gmail.com (mailing list archive)
State New, archived
Headers show
Series Add support for ZSTD-compressed kernel and initramfs | expand

Commit Message

Nick Terrell March 25, 2020, 7:58 p.m. UTC
From: Nick Terrell <terrelln@fb.com>

* Add unzstd() and the zstd decompress interface.
* Add zstd support to decompress_method().

The decompress_method() and unzstd() functions are used to decompress
the initramfs and the initrd. The __decompress() function is used in
the preboot environment to decompress a zstd compressed kernel.

The zstd decompression function allows the input and output buffers to
overlap because that is used by x86 kernel decompression.

Signed-off-by: Nick Terrell <terrelln@fb.com>
---
 include/linux/decompress/unzstd.h |  11 +
 lib/Kconfig                       |   4 +
 lib/Makefile                      |   1 +
 lib/decompress.c                  |   5 +
 lib/decompress_unzstd.c           | 338 ++++++++++++++++++++++++++++++
 5 files changed, 359 insertions(+)
 create mode 100644 include/linux/decompress/unzstd.h
 create mode 100644 lib/decompress_unzstd.c

Comments

Petr Malat March 26, 2020, 4:47 p.m. UTC | #1
Hi!
On Wed, Mar 25, 2020 at 12:58:44PM -0700, Nick Terrell wrote:
> From: Nick Terrell <terrelln@fb.com>
> * Add unzstd() and the zstd decompress interface.
Here I do not understand why you limit the window size to 8MB even when
you read a larger value from the header. I do not see a reason why there
should be such a limitation at the first place and if there should be,
why it differs from ZSTD_WINDOWLOG_MAX.

I removed that limitation to be able to test it in my environment and I
found the performance is worst than with my patch by roughly 20% (on
i7-3520M), which is a major drawback considering the main motivation
to use zstd is the decompression speed. I will test on arm as well and
share the result tomorrow.
  Petr
Nick Terrell March 26, 2020, 7:03 p.m. UTC | #2
> On Mar 26, 2020, at 9:47 AM, Petr Malat <oss@malat.biz> wrote:
> 
> Hi!
> On Wed, Mar 25, 2020 at 12:58:44PM -0700, Nick Terrell wrote:
>> From: Nick Terrell <terrelln@fb.com>
>> * Add unzstd() and the zstd decompress interface.
> Here I do not understand why you limit the window size to 8MB even when
> you read a larger value from the header. I do not see a reason why there
> should be such a limitation at the first place and if there should be,
> why it differs from ZSTD_WINDOWLOG_MAX.

When we are doing streaming decompression (either flush or fill is provided)
we have to allocate memory proportional to the window size. We want to
bound that memory so we don’t accidentally allocate too much memory.
When we are doing a single-pass decompression (neither flush nor fill
are provided) the window size doesn’t matter, and we only have to allocate
a fixed amount of memory ~192 KB.

The zstd spec [0] specifies that all decoders should allow window sizes
up to 8 MB. Additionally, the zstd CLI won’t produce window sizes greater
than 8 MB by default. The window size is controlled by the compression
level, and can be explicitly set.

I would expect larger window sizes to be beneficial for compression ratio,
though there is demising returns. I would expect that for kernel image
compression larger window sizes are beneficial, since it is decompressed
with a single pass. For initramfs decompression, I would expect that limiting
the window size could help decompression speed, since it uses streaming
compression, so unzstd() has to allocate a buffer of window size bytes.

> I removed that limitation to be able to test it in my environment and I
> found the performance is worst than with my patch by roughly 20% (on
> i7-3520M), which is a major drawback considering the main motivation
> to use zstd is the decompression speed. I will test on arm as well and
> share the result tomorrow.
>  Petr

What do you mean by that? Can you share with me the test you ran?
Is this for kernel decompression or initramfs decompression?

Best,
Nick

[0] https://tools.ietf.org/html/rfc8478#section-3.1.1.1.2
Petr Malat March 26, 2020, 8:16 p.m. UTC | #3
Hi!
On Thu, Mar 26, 2020 at 07:03:54PM +0000, Nick Terrell wrote:
> >> * Add unzstd() and the zstd decompress interface.
> > Here I do not understand why you limit the window size to 8MB even when
> > you read a larger value from the header. I do not see a reason why there
> > should be such a limitation at the first place and if there should be,
> > why it differs from ZSTD_WINDOWLOG_MAX.
> 
> When we are doing streaming decompression (either flush or fill is provided)
> we have to allocate memory proportional to the window size. We want to
> bound that memory so we don't accidentally allocate too much memory.
> When we are doing a single-pass decompression (neither flush nor fill
> are provided) the window size doesn't matter, and we only have to allocate
> a fixed amount of memory ~192 KB.
>
> The zstd spec [0] specifies that all decoders should allow window sizes
> up to 8 MB. Additionally, the zstd CLI won't produce window sizes greater
> than 8 MB by default. The window size is controlled by the compression
> level, and can be explicitly set.
Yes, one needs to pass --ultra option to zstd to produce an incompatible
archive, but that doesn't justify the reason to limit this in the kernel,
especially if one is able to read the needed window size from the header
when allocating the memory. At the time when initramfs is extracted,
there usually is memory available as it's before any processes are
started and this memory is reclaimed after the decompression.

If, on the other hand, an user makes an initramfs for a memory constrained
system, he limits the window size while compressing the archive and
the small window size will be announced in the header.

The only scenario where using the hard-coded limit makes sense is in a
case the window size is not available (I'm not sure if it's mandatory
to provide it). That's how my code works - if the size is available,
it uses the provided value, if not it uses 1 << ZSTD_WINDOWLOG_MAX.

I would also agree a fixed limit would make a sense if a user (or network)
provided data would be used, but in this case only the system owner is able
to provide an initramfs. If one is able to change initramfs, he can render
the system unusable simply by providing a corrupted file. He doesn't have
to bother making the window bigger than the available memory.

> I would expect larger window sizes to be beneficial for compression ratio,
> though there is demising returns. I would expect that for kernel image
> compression larger window sizes are beneficial, since it is decompressed
> with a single pass. For initramfs decompression, I would expect that limiting
> the window size could help decompression speed, since it uses streaming
> compression, so unzstd() has to allocate a buffer of window size bytes.
Yes, larger window improves the compression ratio, see here a comparison
between level 19 and 22 on my testing x86-64 initramfs:
  30775022 rootfs.cpio.zst-19
  28755429 rootfs.cpio.zst-22
These 7% can be noticeable when one has a slow storage, e.g. a flash memory
on SPI bus. 

> > I removed that limitation to be able to test it in my environment and I
> > found the performance is worst than with my patch by roughly 20% (on
> > i7-3520M), which is a major drawback considering the main motivation
> > to use zstd is the decompression speed. I will test on arm as well and
> > share the result tomorrow.
> >  Petr
> 
> What do you mean by that? Can you share with me the test you ran?
> Is this for kernel decompression or initramfs decompression?
Initramfs - you can apply my v2 patch on v5.5 and try with your test data.

I have tested your patch also on ARMv7 platform and there the degradation
was 8%.
  Petr
Nick Terrell March 26, 2020, 9:13 p.m. UTC | #4
> On Mar 26, 2020, at 1:16 PM, Petr Malat <oss@malat.biz> wrote:
> 
> Hi!
> On Thu, Mar 26, 2020 at 07:03:54PM +0000, Nick Terrell wrote:
>>>> * Add unzstd() and the zstd decompress interface.
>>> Here I do not understand why you limit the window size to 8MB even when
>>> you read a larger value from the header. I do not see a reason why there
>>> should be such a limitation at the first place and if there should be,
>>> why it differs from ZSTD_WINDOWLOG_MAX.
>> 
>> When we are doing streaming decompression (either flush or fill is provided)
>> we have to allocate memory proportional to the window size. We want to
>> bound that memory so we don't accidentally allocate too much memory.
>> When we are doing a single-pass decompression (neither flush nor fill
>> are provided) the window size doesn't matter, and we only have to allocate
>> a fixed amount of memory ~192 KB.
>> 
>> The zstd spec [0] specifies that all decoders should allow window sizes
>> up to 8 MB. Additionally, the zstd CLI won't produce window sizes greater
>> than 8 MB by default. The window size is controlled by the compression
>> level, and can be explicitly set.
> Yes, one needs to pass --ultra option to zstd to produce an incompatible
> archive, but that doesn't justify the reason to limit this in the kernel,
> especially if one is able to read the needed window size from the header
> when allocating the memory. At the time when initramfs is extracted,
> there usually is memory available as it's before any processes are
> started and this memory is reclaimed after the decompression.

I’m happy to increase this limit. I set it to 8 MB to be conservative, but I am
happy to increase it to 128 MB == 1 << ZSTD_WINDOWLOG_MAX. I will
submit a new version with that change.

> If, on the other hand, an user makes an initramfs for a memory constrained
> system, he limits the window size while compressing the archive and
> the small window size will be announced in the header.
> 
> The only scenario where using the hard-coded limit makes sense is in a
> case the window size is not available (I'm not sure if it's mandatory
> to provide it). That's how my code works - if the size is available,
> it uses the provided value, if not it uses 1 << ZSTD_WINDOWLOG_MAX.
> 
> I would also agree a fixed limit would make a sense if a user (or network)
> provided data would be used, but in this case only the system owner is able
> to provide an initramfs. If one is able to change initramfs, he can render
> the system unusable simply by providing a corrupted file. He doesn't have
> to bother making the window bigger than the available memory.

That makes sense to me.

>> I would expect larger window sizes to be beneficial for compression ratio,
>> though there is demising returns. I would expect that for kernel image
>> compression larger window sizes are beneficial, since it is decompressed
>> with a single pass. For initramfs decompression, I would expect that limiting
>> the window size could help decompression speed, since it uses streaming
>> compression, so unzstd() has to allocate a buffer of window size bytes.
> Yes, larger window improves the compression ratio, see here a comparison
> between level 19 and 22 on my testing x86-64 initramfs:
>  30775022 rootfs.cpio.zst-19
>  28755429 rootfs.cpio.zst-22
> These 7% can be noticeable when one has a slow storage, e.g. a flash memory
> on SPI bus. 
> 
>>> I removed that limitation to be able to test it in my environment and I
>>> found the performance is worst than with my patch by roughly 20% (on
>>> i7-3520M), which is a major drawback considering the main motivation
>>> to use zstd is the decompression speed. I will test on arm as well and
>>> share the result tomorrow.
>>> Petr
>> 
>> What do you mean by that? Can you share with me the test you ran?
>> Is this for kernel decompression or initramfs decompression?
> Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
> 
> I have tested your patch also on ARMv7 platform and there the degradation
> was 8%.

Are you comparing the performance of an 8 MB window size vs a 128 MB
window size?

>  Petr
Petr Malat March 26, 2020, 9:44 p.m. UTC | #5
On Thu, Mar 26, 2020 at 09:13:54PM +0000, Nick Terrell wrote:
> >> What do you mean by that? Can you share with me the test you ran?
> >> Is this for kernel decompression or initramfs decompression?
> > Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
> > 
> > I have tested your patch also on ARMv7 platform and there the degradation
> > was 8%.
> 
> Are you comparing the performance of an 8 MB window size vs a 128 MB
> window size?
No, I use the same initramfs file with 2 different kernels for the test. I have
removed
  if (params.windowSize > ZSTD_WINDOWSIZE_MAX) goto out;
from your code.
  Petr
Nick Terrell March 26, 2020, 9:58 p.m. UTC | #6
> On Mar 26, 2020, at 2:44 PM, Petr Malat <oss@malat.biz> wrote:
> 
> On Thu, Mar 26, 2020 at 09:13:54PM +0000, Nick Terrell wrote:
>>>> What do you mean by that? Can you share with me the test you ran?
>>>> Is this for kernel decompression or initramfs decompression?
>>> Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
>>> 
>>> I have tested your patch also on ARMv7 platform and there the degradation
>>> was 8%.
>> 
>> Are you comparing the performance of an 8 MB window size vs a 128 MB
>> window size?
> No, I use the same initramfs file with 2 different kernels for the test. I have
> removed
>  if (params.windowSize > ZSTD_WINDOWSIZE_MAX) goto out;
> from your code.

Thanks for the clarification. I will try to reproduce the speed difference you’ve
measured before submitting v4 (that deletes the windowSize bound).

Initramfs passes the whole input buffer (doesn’t use fill), but does use flush.
Zstd always decompresses into an internal buffer, then copies into the
ZSTD_outBuffer. That means the only functional difference between the two
versions for initramfs should be that I will call flush() 4 KB at a time, and you
call flush 128 KB at a time. Naively I wouldn’t expect this to matter too much,
but I will measure.

Best,
Nick
Nick Terrell April 1, 2020, 2:46 a.m. UTC | #7
> On Mar 26, 2020, at 1:16 PM, Petr Malat <oss@malat.biz> wrote:
> 
> Hi!
> On Thu, Mar 26, 2020 at 07:03:54PM +0000, Nick Terrell wrote:
>>>> * Add unzstd() and the zstd decompress interface.
>>> Here I do not understand why you limit the window size to 8MB even when
>>> you read a larger value from the header. I do not see a reason why there
>>> should be such a limitation at the first place and if there should be,
>>> why it differs from ZSTD_WINDOWLOG_MAX.
>> 
>> When we are doing streaming decompression (either flush or fill is provided)
>> we have to allocate memory proportional to the window size. We want to
>> bound that memory so we don't accidentally allocate too much memory.
>> When we are doing a single-pass decompression (neither flush nor fill
>> are provided) the window size doesn't matter, and we only have to allocate
>> a fixed amount of memory ~192 KB.
>> 
>> The zstd spec [0] specifies that all decoders should allow window sizes
>> up to 8 MB. Additionally, the zstd CLI won't produce window sizes greater
>> than 8 MB by default. The window size is controlled by the compression
>> level, and can be explicitly set.
> Yes, one needs to pass --ultra option to zstd to produce an incompatible
> archive, but that doesn't justify the reason to limit this in the kernel,
> especially if one is able to read the needed window size from the header
> when allocating the memory. At the time when initramfs is extracted,
> there usually is memory available as it's before any processes are
> started and this memory is reclaimed after the decompression.
> 
> If, on the other hand, an user makes an initramfs for a memory constrained
> system, he limits the window size while compressing the archive and
> the small window size will be announced in the header.
> 
> The only scenario where using the hard-coded limit makes sense is in a
> case the window size is not available (I'm not sure if it's mandatory
> to provide it). That's how my code works - if the size is available,
> it uses the provided value, if not it uses 1 << ZSTD_WINDOWLOG_MAX.
> 
> I would also agree a fixed limit would make a sense if a user (or network)
> provided data would be used, but in this case only the system owner is able
> to provide an initramfs. If one is able to change initramfs, he can render
> the system unusable simply by providing a corrupted file. He doesn't have
> to bother making the window bigger than the available memory.
> 
>> I would expect larger window sizes to be beneficial for compression ratio,
>> though there is demising returns. I would expect that for kernel image
>> compression larger window sizes are beneficial, since it is decompressed
>> with a single pass. For initramfs decompression, I would expect that limiting
>> the window size could help decompression speed, since it uses streaming
>> compression, so unzstd() has to allocate a buffer of window size bytes.
> Yes, larger window improves the compression ratio, see here a comparison
> between level 19 and 22 on my testing x86-64 initramfs:
>  30775022 rootfs.cpio.zst-19
>  28755429 rootfs.cpio.zst-22
> These 7% can be noticeable when one has a slow storage, e.g. a flash memory
> on SPI bus.
> 
>>> I removed that limitation to be able to test it in my environment and I
>>> found the performance is worst than with my patch by roughly 20% (on
>>> i7-3520M), which is a major drawback considering the main motivation
>>> to use zstd is the decompression speed. I will test on arm as well and
>>> share the result tomorrow.
>>> Petr
>> 
>> What do you mean by that? Can you share with me the test you ran?
>> Is this for kernel decompression or initramfs decompression?
> Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
> 
> I have tested your patch also on ARMv7 platform and there the degradation
> was 8%.

Thanks again for measuring the speed differences between the two patchsets!
I’ve found that the difference in performance between our two patchsets is
caused by the output buffer size. I expect this is due to calling flush() more often,
since that is a complex state machine in initramfs’s use case.

I’ve measured the speed of this patch set (v3), compared against this patch set
with a 128 KB buffer size (ZSTD_DStreamOutSize()), vs Petr’s patchset. I’m
measuring on an Intel i9-9900K with turbo disabled on CPU 0. I’m booting the
kernel using QEMU.

To measure the initramfs decompression speed I look at the difference in timestamp
between “Unpacking initramfs…” and “Freeing ignited memory”. The initramfs is
compressed using level 19, but results for level 22 are similar. Times are reported
in seconds. I ran each test 3 times and took the median time, but the results are
very stable. On ARM the initramfs is 26 MB. On x86-64 the initramfs is 97 MB.

Arch	v3	128	Petr
Arm	1.67	1.52	1.55
x64	1.76	1.69	1.66

The results for my patch are slightly better on ARM, yours are slightly better on x86.
In v4 of my patchset, which I will send out tonight, I will increase ZSTD_IOBUF_SIZE
to 128 KB (as well as remove the 8 MB window size limit). Please let me know if your
results align with mine on v4.

I’ve also measured the x86_64 zstd kernel decompression speed using our two
patchsets. I measured it by the timing between the “Decompressing Linux…” message
and the “Parsing ELF” message with this script [0]. I used the same technique for
measurement as above. The kernel I am testing is compressed at level 19 with my
patchset and at level 19 with a window size of 4 MB with your patchset.

I found that my patchset takes 70ms to decompress and yours takes 318ms. Your
patchset also uses 4 MB of heap memory, where mine only needs 192 KB. The difference
is caused by two things:

1. memcpy() is replaced by __builtin_memcpy() in patch 1 of my set. This is the
    core of the decompression hot loop, and without it the compiler can’t inline memcpy.
2. My patchset calls decompress_single() when neither flush nor fill are provided,
    like when decompressing the kernel. This saves the 4 MB of memory, as well as
    speeds up decompression a little bit.

Best,
Nick Terrell

[0] https://gist.github.com/terrelln/9bd53321a669f62683c608af8944fbc2
diff mbox series

Patch

diff --git a/include/linux/decompress/unzstd.h b/include/linux/decompress/unzstd.h
new file mode 100644
index 000000000000..56d539ae880f
--- /dev/null
+++ b/include/linux/decompress/unzstd.h
@@ -0,0 +1,11 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_DECOMPRESS_UNZSTD_H
+#define LINUX_DECOMPRESS_UNZSTD_H
+
+int unzstd(unsigned char *inbuf, long len,
+	   long (*fill)(void*, unsigned long),
+	   long (*flush)(void*, unsigned long),
+	   unsigned char *output,
+	   long *pos,
+	   void (*error_fn)(char *x));
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index bc7e56370129..11de5fa09a52 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -336,6 +336,10 @@  config DECOMPRESS_LZ4
 	select LZ4_DECOMPRESS
 	tristate
 
+config DECOMPRESS_ZSTD
+	select ZSTD_DECOMPRESS
+	tristate
+
 #
 # Generic allocator support is selected if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index 611872c06926..09ad45ba6883 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -160,6 +160,7 @@  lib-$(CONFIG_DECOMPRESS_LZMA) += decompress_unlzma.o
 lib-$(CONFIG_DECOMPRESS_XZ) += decompress_unxz.o
 lib-$(CONFIG_DECOMPRESS_LZO) += decompress_unlzo.o
 lib-$(CONFIG_DECOMPRESS_LZ4) += decompress_unlz4.o
+lib-$(CONFIG_DECOMPRESS_ZSTD) += decompress_unzstd.o
 
 obj-$(CONFIG_TEXTSEARCH) += textsearch.o
 obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o
diff --git a/lib/decompress.c b/lib/decompress.c
index 857ab1af1ef3..ab3fc90ffc64 100644
--- a/lib/decompress.c
+++ b/lib/decompress.c
@@ -13,6 +13,7 @@ 
 #include <linux/decompress/inflate.h>
 #include <linux/decompress/unlzo.h>
 #include <linux/decompress/unlz4.h>
+#include <linux/decompress/unzstd.h>
 
 #include <linux/types.h>
 #include <linux/string.h>
@@ -37,6 +38,9 @@ 
 #ifndef CONFIG_DECOMPRESS_LZ4
 # define unlz4 NULL
 #endif
+#ifndef CONFIG_DECOMPRESS_ZSTD
+# define unzstd NULL
+#endif
 
 struct compress_format {
 	unsigned char magic[2];
@@ -52,6 +56,7 @@  static const struct compress_format compressed_formats[] __initconst = {
 	{ {0xfd, 0x37}, "xz", unxz },
 	{ {0x89, 0x4c}, "lzo", unlzo },
 	{ {0x02, 0x21}, "lz4", unlz4 },
+	{ {0x28, 0xb5}, "zstd", unzstd },
 	{ {0, 0}, NULL, NULL }
 };
 
diff --git a/lib/decompress_unzstd.c b/lib/decompress_unzstd.c
new file mode 100644
index 000000000000..a6b391b47ab8
--- /dev/null
+++ b/lib/decompress_unzstd.c
@@ -0,0 +1,338 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Important notes about in-place decompression
+ *
+ * At least on x86, the kernel is decompressed in place: the compressed data
+ * is placed to the end of the output buffer, and the decompressor overwrites
+ * most of the compressed data. There must be enough safety margin to
+ * guarantee that the write position is always behind the read position.
+ *
+ * The safety margin for ZSTD with a 128 KB block size is calculated below.
+ * Note that the margin with ZSTD is bigger than with GZIP or XZ!
+ *
+ * The worst case for in-place decompression is that the beginning of
+ * the file is compressed extremely well, and the rest of the file is
+ * uncompressible. Thus, we must look for worst-case expansion when the
+ * compressor is encoding uncompressible data.
+ *
+ * The structure of the .zst file in case of a compresed kernel is as follows.
+ * Maximum sizes (as bytes) of the fields are in parenthesis.
+ *
+ *    Frame Header: (18)
+ *    Blocks: (N)
+ *    Checksum: (4)
+ *
+ * The frame header and checksum overhead is at most 22 bytes.
+ *
+ * ZSTD stores the data in blocks. Each block has a header whose size is
+ * a 3 bytes. After the block header, there is up to 128 KB of payload.
+ * The maximum uncompressed size of the payload is 128 KB. The minimum
+ * uncompressed size of the payload is never less than the payload size
+ * (excluding the block header).
+ *
+ * The assumption, that the uncompressed size of the payload is never
+ * smaller than the payload itself, is valid only when talking about
+ * the payload as a whole. It is possible that the payload has parts where
+ * the decompressor consumes more input than it produces output. Calculating
+ * the worst case for this would be tricky. Instead of trying to do that,
+ * let's simply make sure that the decompressor never overwrites any bytes
+ * of the payload which it is currently reading.
+ *
+ * Now we have enough information to calculate the safety margin. We need
+ *   - 22 bytes for the .zst file format headers;
+ *   - 3 bytes per every 128 KiB of uncompressed size (one block header per
+ *     block); and
+ *   - 128 KiB (biggest possible zstd block size) to make sure that the
+ *     decompressor never overwrites anything from the block it is currently
+ *     reading.
+ *
+ * We get the following formula:
+ *
+ *    safety_margin = 22 + uncompressed_size * 3 / 131072 + 131072
+ *                 <= 22 + (uncompressed_size >> 15) + 131072
+ */
+
+/*
+ * Preboot environments #include "path/to/decompress_unzstd.c".
+ * All of the source files we depend on must be #included.
+ * zstd's only source dependeny is xxhash, which has no source
+ * dependencies.
+ *
+ * zstd and xxhash avoid declaring themselves as modules
+ * when ZSTD_PREBOOT and XXH_PREBOOT are defined.
+ */
+#ifdef STATIC
+# define ZSTD_PREBOOT
+# define XXH_PREBOOT
+# include "xxhash.c"
+# include "zstd/entropy_common.c"
+# include "zstd/fse_decompress.c"
+# include "zstd/huf_decompress.c"
+# include "zstd/zstd_common.c"
+# include "zstd/decompress.c"
+#endif
+
+#include <linux/decompress/mm.h>
+#include <linux/kernel.h>
+#include <linux/zstd.h>
+
+/* 8 MB maximum window size */
+#define ZSTD_WINDOWSIZE_MAX	(1 << 23)
+/* Size of the input and output buffers in multi-call mdoe */
+#define ZSTD_IOBUF_SIZE		4096
+
+static int INIT handle_zstd_error(size_t ret, void (*error)(char *x))
+{
+	const int err = ZSTD_getErrorCode(ret);
+
+	if (!ZSTD_isError(ret))
+		return 0;
+
+	switch (err) {
+	case ZSTD_error_memory_allocation:
+		error("ZSTD decompressor ran out of memory");
+		break;
+	case ZSTD_error_prefix_unknown:
+		error("Input is not in the ZSTD format (wrong magic bytes)");
+		break;
+	case ZSTD_error_dstSize_tooSmall:
+	case ZSTD_error_corruption_detected:
+	case ZSTD_error_checksum_wrong:
+		error("ZSTD-compressed data is corrupt");
+		break;
+	default:
+		error("ZSTD-compressed data is probably corrupt");
+		break;
+	}
+	return -1;
+}
+
+/*
+ * Handle the case where we have the entire input and output in one segment.
+ * We can allocate less memory (no circular buffer for the sliding window),
+ * and avoid some memcpy() calls.
+ */
+static int INIT decompress_single(const u8 *in_buf, long in_len, u8 *out_buf,
+				  long out_len, long *in_pos,
+				  void (*error)(char *x))
+{
+	const size_t wksp_size = ZSTD_DCtxWorkspaceBound();
+	void *wksp = large_malloc(wksp_size);
+	ZSTD_DCtx *dctx = ZSTD_initDCtx(wksp, wksp_size);
+	int err;
+	size_t ret;
+
+	if (dctx == NULL) {
+		error("Out of memory while allocating ZSTD_DCtx");
+		err = -1;
+		goto out;
+	}
+	/*
+	 * Find out how large the frame actually is, there may be junk at
+	 * the end of the frame that ZSTD_decompressDCtx() can't handle.
+	 */
+	ret = ZSTD_findFrameCompressedSize(in_buf, in_len);
+	err = handle_zstd_error(ret, error);
+	if (err)
+		goto out;
+	in_len = (long)ret;
+
+	ret = ZSTD_decompressDCtx(dctx, out_buf, out_len, in_buf, in_len);
+	err = handle_zstd_error(ret, error);
+	if (err)
+		goto out;
+
+	if (in_pos != NULL)
+		*in_pos = in_len;
+
+	err = 0;
+out:
+	if (wksp != NULL)
+		large_free(wksp);
+	return err;
+}
+
+static int INIT __unzstd(unsigned char *in_buf, long in_len,
+			 long (*fill)(void*, unsigned long),
+			 long (*flush)(void*, unsigned long),
+			 unsigned char *out_buf, long out_len,
+			 long *in_pos,
+			 void (*error)(char *x))
+{
+	ZSTD_inBuffer in;
+	ZSTD_outBuffer out;
+	ZSTD_frameParams params;
+	void *in_allocated = NULL;
+	void *out_allocated = NULL;
+	void *wksp = NULL;
+	size_t wksp_size;
+	ZSTD_DStream *dstream;
+	int err;
+	size_t ret;
+
+	if (out_len == 0)
+		out_len = LONG_MAX; /* no limit */
+
+	if (fill == NULL && flush == NULL)
+		/*
+		 * We can decompress faster and with less memory when we have a
+		 * single chunk.
+		 */
+		return decompress_single(in_buf, in_len, out_buf, out_len,
+					 in_pos, error);
+
+	/*
+	 * If in_buf is not provided, we must be using fill(), so allocate
+	 * a large enough buffer. If it is provided, it must be at least
+	 * ZSTD_IOBUF_SIZE large.
+	 */
+	if (in_buf == NULL) {
+		in_allocated = malloc(ZSTD_IOBUF_SIZE);
+		if (in_allocated == NULL) {
+			error("Out of memory while allocating input buffer");
+			err = -1;
+			goto out;
+		}
+		in_buf = in_allocated;
+		in_len = 0;
+	}
+	/* Read the first chunk, since we need to decode the frame header. */
+	if (fill != NULL)
+		in_len = fill(in_buf, ZSTD_IOBUF_SIZE);
+	if (in_len < 0) {
+		error("ZSTD-compressed data is truncated");
+		err = -1;
+		goto out;
+	}
+	/* Set the first non-empty input buffer. */
+	in.src = in_buf;
+	in.pos = 0;
+	in.size = in_len;
+	/* Allocate the output buffer if we are using flush(). */
+	if (flush != NULL) {
+		out_allocated = malloc(ZSTD_IOBUF_SIZE);
+		if (out_allocated == NULL) {
+			error("Out of memory while allocating output buffer");
+			err = -1;
+			goto out;
+		}
+		out_buf = out_allocated;
+		out_len = ZSTD_IOBUF_SIZE;
+	}
+	/* Set the output buffer. */
+	out.dst = out_buf;
+	out.pos = 0;
+	out.size = out_len;
+
+	/*
+	 * We need to know the window size to allocate the ZSTD_DStream.
+	 * Since we are streaming, we need to allocate a buffer for the sliding
+	 * window. The window size varies from 1 KB to ZSTD_WINDOWSIZE_MAX
+	 * (8 MB), so it is important to use the actual value so as not to
+	 * waste memory when it is smaller.
+	 */
+	ret = ZSTD_getFrameParams(&params, in.src, in.size);
+	err = handle_zstd_error(ret, error);
+	if (err)
+		goto out;
+	if (ret != 0) {
+		error("ZSTD-compressed data has an incomplete frame header");
+		err = -1;
+		goto out;
+	}
+	if (params.windowSize > ZSTD_WINDOWSIZE_MAX) {
+		error("ZSTD-compressed data has too large a window size");
+		err = -1;
+		goto out;
+	}
+
+	/*
+	 * Allocate the ZSTD_DStream now that we know how much memory is
+	 * required.
+	 */
+	wksp_size = ZSTD_DStreamWorkspaceBound(params.windowSize);
+	wksp = large_malloc(wksp_size);
+	dstream = ZSTD_initDStream(params.windowSize, wksp, wksp_size);
+	if (dstream == NULL) {
+		error("Out of memory while allocating ZSTD_DStream");
+		err = -1;
+		goto out;
+	}
+
+	/*
+	 * Decompression loop:
+	 * Read more data if necessary (error if no more data can be read).
+	 * Call the decompression function, which returns 0 when finished.
+	 * Flush any data produced if using flush().
+	 */
+	if (in_pos != NULL)
+		*in_pos = 0;
+	do {
+		/*
+		 * If we need to reload data, either we have fill() and can
+		 * try to get more data, or we don't and the input is truncated.
+		 */
+		if (in.pos == in.size) {
+			if (in_pos != NULL)
+				*in_pos += in.pos;
+			in_len = fill ? fill(in_buf, ZSTD_IOBUF_SIZE) : -1;
+			if (in_len < 0) {
+				error("ZSTD-compressed data is truncated");
+				err = -1;
+				goto out;
+			}
+			in.pos = 0;
+			in.size = in_len;
+		}
+		/* Returns zero when the frame is complete. */
+		ret = ZSTD_decompressStream(dstream, &out, &in);
+		err = handle_zstd_error(ret, error);
+		if (err)
+			goto out;
+		/* Flush all of the data produced if using flush(). */
+		if (flush != NULL && out.pos > 0) {
+			if (out.pos != flush(out.dst, out.pos)) {
+				error("Failed to flush()");
+				err = -1;
+				goto out;
+			}
+			out.pos = 0;
+		}
+	} while (ret != 0);
+
+	if (in_pos != NULL)
+		*in_pos += in.pos;
+
+	err = 0;
+out:
+	if (in_allocated != NULL)
+		free(in_allocated);
+	if (out_allocated != NULL)
+		free(out_allocated);
+	if (wksp != NULL)
+		large_free(wksp);
+	return err;
+}
+
+#ifndef ZSTD_PREBOOT
+STATIC int INIT unzstd(unsigned char *buf, long len,
+		       long (*fill)(void*, unsigned long),
+		       long (*flush)(void*, unsigned long),
+		       unsigned char *out_buf,
+		       long *pos,
+		       void (*error)(char *x))
+{
+	return __unzstd(buf, len, fill, flush, out_buf, 0, pos, error);
+}
+#else
+STATIC int INIT __decompress(unsigned char *buf, long len,
+			     long (*fill)(void*, unsigned long),
+			     long (*flush)(void*, unsigned long),
+			     unsigned char *out_buf, long out_len,
+			     long *pos,
+			     void (*error)(char *x))
+{
+	return __unzstd(buf, len, fill, flush, out_buf, out_len, pos, error);
+}
+#endif