mbox series

[0/4] crypto: Add new compression modes for zlib and IAA

Message ID cover.1710969449.git.andre.glover@linux.intel.com (mailing list archive)
Headers show
Series crypto: Add new compression modes for zlib and IAA | expand

Message

Andre Glover March 28, 2024, 5:44 p.m. UTC
This patch series adds 'canned' compression for Intel IAA and zlib. It
also adds 'dynamic' compression for Intel IAA which is compatible with zlib
deflate algorithms. 

The original IAA crypto submissions [1] included a 'canned' compression
mode, but support for 'canned' compression was removed during the review
because it didn't have an equivalent software implementation available [2].

Deflate compression can be done in a variety of modes. The dynamic mode
uses Huffman tables that are generated/optimized for that particular
input. This gives the best compression ratio but has the longest latency.
The fixed mode uses Huffman tables that are defined by the Deflate
standard. It generally gives the best latency but with a lower compression
ratio. The 'canned' compression mode implements a compression scheme that
uses a statically defined set of Huffman tables, but where the Deflate
Block Header is implied rather than stored with the compressed data. 

The 'canned' mode results in lower compression/decompression latencies
compared to using the dynamic mode, but it results in a better compression
ratio than using the fixed mode.

Below is a table showing the latency improvements with zlib, between
zlib dynamic and zlib canned modes, and the compression ratio for 
each mode while using a set of 4300 4KB pages sampled from SPEC 
CPU17 workloads:

Comments

Eric Biggers March 29, 2024, 2:46 a.m. UTC | #1
On Thu, Mar 28, 2024 at 10:44:41AM -0700, Andre Glover wrote:
> The 'canned' compression mode implements a compression scheme that
> uses a statically defined set of Huffman tables, but where the Deflate
> Block Header is implied rather than stored with the compressed data. 

This already exists in standard DEFLATE; it's called fixed mode.  See section
3.2.6 of RFC1951 (https://datatracker.ietf.org/doc/html/rfc1951#page-12).

I think that what's going on is that you've implemented a custom variant of
DEFLATE where you set the fixed Huffman codes to something different from the
ones defined in the standard.

Is that correct, or are there other differences?

Actually, looking at your zlib_tr_flush_block(), it looks instead of using the
reserved block type value (3) or redefining the meaning of the fixed block type
value (1), you actually deleted the BTYPE and BFINAL fields from the data stream
entirely.  So the stream no longer stores the type of block or the flag that
indicates whether the block is the final one or not.

That has the property that there cannot be any standard blocks, even
uncompressed blocks, included in the data stream anymore.  Is that intentional?

Maybe this is why you're using the name "canned", instead of going with
something more consistent with the existing "fixed" name, like "custom-fixed"?

I wonder what the plan is for when the next hardware vendor tries to do this and
chooses their own Huffman codes, different from yours.  Or what if Intel decides
the Huffman codes they chose aren't the best ones anymore and releases new
hardware that uses different codes.  Will we perhaps be getting a tinned mode
too?

Is your customization described in any sort of publicly available document that
could hint at some way to name it properly?

- Eric
Andre Glover April 1, 2024, 8:46 p.m. UTC | #2
Hi Eric,

Thank you for reviewing the patch. Please see responses to your
questions inline below.

On Thu, 2024-03-28 at 19:46 -0700, Eric Biggers wrote:
> On Thu, Mar 28, 2024 at 10:44:41AM -0700, Andre Glover wrote:
> > The 'canned' compression mode implements a compression scheme that
> > uses a statically defined set of Huffman tables, but where the
> > Deflate
> > Block Header is implied rather than stored with the compressed
> > data. 
> 
> This already exists in standard DEFLATE; it's called fixed mode.  See
> section
> 3.2.6 of RFC1951
> (https://datatracker.ietf.org/doc/html/rfc1951#page-12).
> 
> I think that what's going on is that you've implemented a custom
> variant of
> DEFLATE where you set the fixed Huffman codes to something different
> from the
> ones defined in the standard.
> 
> Is that correct, or are there other differences?
> 

We view it as a variant of dynamic block Deflate as opposed to a
variant of fixed. In particular, it is compressing the input with
static Huffman tables (i.e. ones that do not vary with the input), and
where the Deflate block header (which is a constant) is not stored with
the compressed data. If the missing block header were to be prepended
to the compressed data, the result would be a valid dynamic compressed
block.

One might think of this as vaguely similar to dictionary compression.
In that case, the dictionary is not stored with the compressed data but
is agreed to by the compressor and decompression. In the case of canned
compression, the header is not stored with the compressed data but is
agreed to by both entities.

> Actually, looking at your zlib_tr_flush_block(), it looks instead of
> using the
> reserved block type value (3) or redefining the meaning of the fixed
> block type
> value (1), you actually deleted the BTYPE and BFINAL fields from the
> data stream
> entirely.  So the stream no longer stores the type of block or the
> flag that
> indicates whether the block is the final one or not.
> 
> That has the property that there cannot be any standard blocks, even
> uncompressed blocks, included in the data stream anymore.  Is that
> intentional?
> 

Conceptually, there is a valid dynamic block header associated with the
compressed data, it is just not stored with the data in order to save
space (since it is effectively an unchanging constant). In this usage,
it is envisioned that the output would consist of a single Deflate
block (i.e. the implied header is marked as BFINAL). In theory, one
could have the implied block header not marked as BFINAL, and so the
compressed data would need to contain at least two blocks (i.e. the
body of the initial block, an EOB, and a normal block with header), but
this would be counterproductive to the intended usage.

> Maybe this is why you're using the name "canned", instead of going
> with
> something more consistent with the existing "fixed" name, like
> "custom-fixed"?
> 

Again, we view this as a variant of dynamic compression rather than as
a variant of fixed compression. We use the term "static" to describe
compression with a dynamic block where the tables are unchanging (i.e.
not dependent on the input data) . This can allow the compression to be
done in one pass rather than two. The "canned" compression uses a
static table, but is different in that the header is implied rather
than stored.

> I wonder what the plan is for when the next hardware vendor tries to
> do this and
> chooses their own Huffman codes, different from yours.  Or what if
> Intel decides
> the Huffman codes they chose aren't the best ones anymore and
> releases new
> hardware that uses different codes.  Will we perhaps be getting a
> tinned mode
> too?
> 

The Huffman tables are not built into the hardware. The Huffman
tables/block header is in this case built into the software. The same
tables are embedded in the zlib portion of the patch for software
compression/decompression and in the hardware driver portion of the
patch for IAA compression/decompression. The hardware itself can work
with any desired set of tables.

Thanks,
Andre
Herbert Xu April 5, 2024, 7:07 a.m. UTC | #3
On Thu, Mar 28, 2024 at 10:44:41AM -0700, Andre Glover wrote:
>
> Below is a table showing the latency improvements with zlib, between
> zlib dynamic and zlib canned modes, and the compression ratio for 
> each mode while using a set of 4300 4KB pages sampled from SPEC 
> CPU17 workloads:
> _________________________________________________________
> | Zlib Level |  Canned Latency Gain  |    Comp Ratio    |
> |------------|-----------------------|------------------|
> |            | compress | decompress | dynamic | canned |
> |____________|__________|____________|_________|________|
> |     1      |    49%   |    29%     |  3.16   |  2.92  |
> |------------|----------|------------|---------|--------|
> |     6	     |    27%   |    28%     |  3.35   |  3.09  |
> |------------|----------|------------|---------|--------|
> |     9      |    12%   |    29%     |  3.36   |  3.11  |
> |____________|__________|____________|_________|________|

So which kernel user (zswap I presume) is clamouring for this
feature? We don't add new algorithms that have no in-kernel
users.  So we need to be sure that the kernel user actually
want this.

Thanks,
Andre Glover May 1, 2024, 10:07 p.m. UTC | #4
Hi Herbert,

On Fri, 2024-04-05 at 15:07 +0800, Herbert Xu wrote:
> On Thu, Mar 28, 2024 at 10:44:41AM -0700, Andre Glover wrote:
> > 
> > Below is a table showing the latency improvements with zlib,
> > between
> > zlib dynamic and zlib canned modes, and the compression ratio for 
> > each mode while using a set of 4300 4KB pages sampled from SPEC 
> > CPU17 workloads:
> > _________________________________________________________
> > > Zlib Level |  Canned Latency Gain  |    Comp Ratio    |
> > > ------------|-----------------------|------------------|
> > >            | compress | decompress | dynamic | canned |
> > > ____________|__________|____________|_________|________|
> > >     1      |    49%   |    29%     |  3.16   |  2.92  |
> > > ------------|----------|------------|---------|--------|
> > >     6        |    27%   |    28%     |  3.35   |  3.09  |
> > > ------------|----------|------------|---------|--------|
> > >     9      |    12%   |    29%     |  3.36   |  3.11  |
> > > ____________|__________|____________|_________|________|
> 
> So which kernel user (zswap I presume) is clamouring for this
> feature? We don't add new algorithms that have no in-kernel
> users.  So we need to be sure that the kernel user actually
> want this.
> 
> Thanks,

Hi Herbert,
We have recently submitted an RFC to zswap and zram maintainers and
users for by_n compression with Intel IAA [1] feedback. This work is in
support of efforts to swap in/out large and multi-sized folios. With
by_n compression, we have created a scheme that allows parallel IAA
compression and decompression operations on a single folio resulting in
performance gains. Currently the by_n scheme uses the canned mode
compression algorithm to perform the compression and decompression
operations. Using canned mode compression results in reduced
compression latency because the deflate header doesnt need to be
created dynamically, while also producing better ratio than Deflate
Fixed mode. We would appreciate your feedback on this scheme.

Here is data from the RFC showing a performance comparison for 64KB
folio swap in/out 
with zram on Sapphire Rapids, whose core frequency is fixed at 2500MHz:
+------------+-------------+---------+-------------+----------+-------+
|            | Compression | Decomp  | Compression | zram     | zram  |
| Algorithm  | latency     | latency | ratio       | write    | read  |
+------------+-------------+---------+-------------+----------+-------+
|            |       Median (ns)     |             |      Median (ns) |
+------------+-------------+---------+-------------+----------+-------+
|            |             |         |             |          |       |
| IAA by_1   | 34,493      | 20,038  | 2.93        | 40,130   | 24,478|
| IAA by_2   | 18,830      | 11,888  | 2.93        | 24,149   | 15,536|
| IAA by_4   | 11,364      |  8,146  | 2.90        | 16,735   | 11,469|
| IAA by_8   |  8,344      |  6,342  | 2.77        | 13,527   |  9,177|
| IAA by_16  |  8,837      |  6,549  | 2.33        | 15,309   |  9,547|
| IAA by_32  | 11,153      |  9,641  | 2.19        | 16,457   | 14,086|
| IAA by_64  | 18,272      | 16,696  | 1.96        | 24,294   | 20,048|
|            |             |         |             |          |       |
| lz4        | 139,190     | 33,687  | 2.40        | 144,940  | 37,312|
|            |             |         |             |          |       |
| lzo-rle    | 138,235     | 61,055  | 2.52        | 143,666  | 64,321|
|            |             |         |             |          |       |
| zstd       | 251,820     | 90,878  | 3.40        | 256,384  | 94,328|
+------------+-------------+---------+-------------+----------+-------+

[1]https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.
intel.com/