diff mbox series

[v2,2/3] lib: zstd: Don't inline functions in zstd_opt.c

Message ID 20211117201459.1194876-3-nickrterrell@gmail.com (mailing list archive)
State Superseded
Headers show
Series Fix stack usage on parisc & improve code size bloat | expand

Commit Message

Nick Terrell Nov. 17, 2021, 8:14 p.m. UTC
From: Nick Terrell <terrelln@fb.com>

`zstd_opt.c` contains the match finder for the highest compression
levels. These levels are already very slow, and are unlikely to be used
in the kernel. If they are used, they shouldn't be used in latency
sensitive workloads, so slowing them down shouldn't be a big deal.

This saves 188 KB of the 288 KB regression reported by Geert Uytterhoeven [0].
I've also opened an issue upstream [1] so that we can properly tackle
the code size issue in `zstd_opt.c` for all users, and can hopefully
remove this hack in the next zstd version we import.

Bloat-o-meter output on x86-64:

```
> ../scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 6/5 grow/shrink: 1/9 up/down: 16673/-209939 (-193266)
Function                                     old     new   delta
ZSTD_compressBlock_opt_generic.constprop       -    7559   +7559
ZSTD_insertBtAndGetAllMatches                  -    6304   +6304
ZSTD_insertBt1                                 -    1731   +1731
ZSTD_storeSeq                                  -     693    +693
ZSTD_BtGetAllMatches                           -     255    +255
ZSTD_updateRep                                 -     128    +128
ZSTD_updateTree                               96      99      +3
ZSTD_insertAndFindFirstIndexHash3             81       -     -81
ZSTD_setBasePrices.constprop                  98       -     -98
ZSTD_litLengthPrice.constprop                138       -    -138
ZSTD_count                                   362     181    -181
ZSTD_count_2segments                        1407     938    -469
ZSTD_insertBt1.constprop                    2689       -   -2689
ZSTD_compressBlock_btultra2                19990     423  -19567
ZSTD_compressBlock_btultra                 19633      15  -19618
ZSTD_initStats_ultra                       19825       -  -19825
ZSTD_compressBlock_btopt                   20374      12  -20362
ZSTD_compressBlock_btopt_extDict           29984      12  -29972
ZSTD_compressBlock_btultra_extDict         30718      15  -30703
ZSTD_compressBlock_btopt_dictMatchState    32689      12  -32677
ZSTD_compressBlock_btultra_dictMatchState   33574      15  -33559
Total: Before=6611828, After=6418562, chg -2.92%
```

[0] https://lkml.org/lkml/2021/11/14/189
[1] https://github.com/facebook/zstd/issues/2862

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Nick Terrell <terrelln@fb.com>
---
 lib/zstd/common/compiler.h   |  7 +++++++
 lib/zstd/compress/zstd_opt.c | 14 +++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

Comments

Geert Uytterhoeven Nov. 18, 2021, 7:52 a.m. UTC | #1
Hi Nick,

On Wed, Nov 17, 2021 at 9:08 PM Nick Terrell <nickrterrell@gmail.com> wrote:
> `zstd_opt.c` contains the match finder for the highest compression
> levels. These levels are already very slow, and are unlikely to be used
> in the kernel. If they are used, they shouldn't be used in latency
> sensitive workloads, so slowing them down shouldn't be a big deal.
>
> This saves 188 KB of the 288 KB regression reported by Geert Uytterhoeven [0].
> I've also opened an issue upstream [1] so that we can properly tackle
> the code size issue in `zstd_opt.c` for all users, and can hopefully
> remove this hack in the next zstd version we import.
>
> Bloat-o-meter output on x86-64:
>
> ```
> > ../scripts/bloat-o-meter vmlinux.old vmlinux
> add/remove: 6/5 grow/shrink: 1/9 up/down: 16673/-209939 (-193266)
> Function                                     old     new   delta
> ZSTD_compressBlock_opt_generic.constprop       -    7559   +7559
> ZSTD_insertBtAndGetAllMatches                  -    6304   +6304
> ZSTD_insertBt1                                 -    1731   +1731
> ZSTD_storeSeq                                  -     693    +693
> ZSTD_BtGetAllMatches                           -     255    +255
> ZSTD_updateRep                                 -     128    +128
> ZSTD_updateTree                               96      99      +3
> ZSTD_insertAndFindFirstIndexHash3             81       -     -81
> ZSTD_setBasePrices.constprop                  98       -     -98
> ZSTD_litLengthPrice.constprop                138       -    -138
> ZSTD_count                                   362     181    -181
> ZSTD_count_2segments                        1407     938    -469
> ZSTD_insertBt1.constprop                    2689       -   -2689
> ZSTD_compressBlock_btultra2                19990     423  -19567
> ZSTD_compressBlock_btultra                 19633      15  -19618
> ZSTD_initStats_ultra                       19825       -  -19825
> ZSTD_compressBlock_btopt                   20374      12  -20362
> ZSTD_compressBlock_btopt_extDict           29984      12  -29972
> ZSTD_compressBlock_btultra_extDict         30718      15  -30703
> ZSTD_compressBlock_btopt_dictMatchState    32689      12  -32677
> ZSTD_compressBlock_btultra_dictMatchState   33574      15  -33559
> Total: Before=6611828, After=6418562, chg -2.92%
> ```
>
> [0] https://lkml.org/lkml/2021/11/14/189
> [1] https://github.com/facebook/zstd/issues/2862
>
> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Signed-off-by: Nick Terrell <terrelln@fb.com>

Thanks for your patch!

Impact on lib/zstd/zstd_compress.ko for atari_defconfig:

    add/remove: 5/4 grow/shrink: 1/9 up/down: 15392/-167214 (-151822)

Nice!

Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>

> --- a/lib/zstd/compress/zstd_opt.c
> +++ b/lib/zstd/compress/zstd_opt.c
> @@ -894,7 +906,7 @@ static void ZSTD_optLdm_processMatchCandidate(ZSTD_optLdm_t* optLdm, ZSTD_match_
>               */
>              U32 posOvershoot = currPosInBlock - optLdm->endPosInBlock;
>              ZSTD_optLdm_skipRawSeqStoreBytes(&optLdm->seqStore, posOvershoot);
> -        }
> +        }
>          ZSTD_opt_getNextMatchAndUpdateSeqStore(optLdm, currPosInBlock, remainingBytes);
>      }
>      ZSTD_optLdm_maybeAddMatch(matches, nbMatches, optLdm, currPosInBlock);

This change is unrelated. With that removed:
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Nick Terrell Nov. 18, 2021, 8:22 a.m. UTC | #2
> On Nov 17, 2021, at 11:52 PM, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> 
> Hi Nick,
> 
> On Wed, Nov 17, 2021 at 9:08 PM Nick Terrell <nickrterrell@gmail.com> wrote:
>> `zstd_opt.c` contains the match finder for the highest compression
>> levels. These levels are already very slow, and are unlikely to be used
>> in the kernel. If they are used, they shouldn't be used in latency
>> sensitive workloads, so slowing them down shouldn't be a big deal.
>> 
>> This saves 188 KB of the 288 KB regression reported by Geert Uytterhoeven [0].
>> I've also opened an issue upstream [1] so that we can properly tackle
>> the code size issue in `zstd_opt.c` for all users, and can hopefully
>> remove this hack in the next zstd version we import.
>> 
>> Bloat-o-meter output on x86-64:
>> 
>> ```
>>> ../scripts/bloat-o-meter vmlinux.old vmlinux
>> add/remove: 6/5 grow/shrink: 1/9 up/down: 16673/-209939 (-193266)
>> Function                                     old     new   delta
>> ZSTD_compressBlock_opt_generic.constprop       -    7559   +7559
>> ZSTD_insertBtAndGetAllMatches                  -    6304   +6304
>> ZSTD_insertBt1                                 -    1731   +1731
>> ZSTD_storeSeq                                  -     693    +693
>> ZSTD_BtGetAllMatches                           -     255    +255
>> ZSTD_updateRep                                 -     128    +128
>> ZSTD_updateTree                               96      99      +3
>> ZSTD_insertAndFindFirstIndexHash3             81       -     -81
>> ZSTD_setBasePrices.constprop                  98       -     -98
>> ZSTD_litLengthPrice.constprop                138       -    -138
>> ZSTD_count                                   362     181    -181
>> ZSTD_count_2segments                        1407     938    -469
>> ZSTD_insertBt1.constprop                    2689       -   -2689
>> ZSTD_compressBlock_btultra2                19990     423  -19567
>> ZSTD_compressBlock_btultra                 19633      15  -19618
>> ZSTD_initStats_ultra                       19825       -  -19825
>> ZSTD_compressBlock_btopt                   20374      12  -20362
>> ZSTD_compressBlock_btopt_extDict           29984      12  -29972
>> ZSTD_compressBlock_btultra_extDict         30718      15  -30703
>> ZSTD_compressBlock_btopt_dictMatchState    32689      12  -32677
>> ZSTD_compressBlock_btultra_dictMatchState   33574      15  -33559
>> Total: Before=6611828, After=6418562, chg -2.92%
>> ```
>> 
>> [0] https://lkml.org/lkml/2021/11/14/189
>> [1] https://github.com/facebook/zstd/issues/2862
>> 
>> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
>> Signed-off-by: Nick Terrell <terrelln@fb.com>
> 
> Thanks for your patch!
> 
> Impact on lib/zstd/zstd_compress.ko for atari_defconfig:
> 
>    add/remove: 5/4 grow/shrink: 1/9 up/down: 15392/-167214 (-151822)
> 
> Nice!
> 
> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
> 
>> --- a/lib/zstd/compress/zstd_opt.c
>> +++ b/lib/zstd/compress/zstd_opt.c
>> @@ -894,7 +906,7 @@ static void ZSTD_optLdm_processMatchCandidate(ZSTD_optLdm_t* optLdm, ZSTD_match_
>>              */
>>             U32 posOvershoot = currPosInBlock - optLdm->endPosInBlock;
>>             ZSTD_optLdm_skipRawSeqStoreBytes(&optLdm->seqStore, posOvershoot);
>> -        }
>> +        }
>>         ZSTD_opt_getNextMatchAndUpdateSeqStore(optLdm, currPosInBlock, remainingBytes);
>>     }
>>     ZSTD_optLdm_maybeAddMatch(matches, nbMatches, optLdm, currPosInBlock);
> 
> This change is unrelated. With that removed:
> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>

Thanks for the review! I will remove this change before submitting
the PR. Unless you have a strong preference, I won’t put up another
version of the patch set with just this change.

Best,
Nick Terrell

> Gr{oetje,eeting}s,
> 
>                        Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                -- Linus Torvalds
diff mbox series

Patch

diff --git a/lib/zstd/common/compiler.h b/lib/zstd/common/compiler.h
index a1a051e4bce6..f5a9c70a228a 100644
--- a/lib/zstd/common/compiler.h
+++ b/lib/zstd/common/compiler.h
@@ -16,6 +16,7 @@ 
 *********************************************************/
 /* force inlining */
 
+#if !defined(ZSTD_NO_INLINE)
 #if (defined(__GNUC__) && !defined(__STRICT_ANSI__)) || defined(__cplusplus) || defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L   /* C99 */
 #  define INLINE_KEYWORD inline
 #else
@@ -24,6 +25,12 @@ 
 
 #define FORCE_INLINE_ATTR __attribute__((always_inline))
 
+#else
+
+#define INLINE_KEYWORD
+#define FORCE_INLINE_ATTR
+
+#endif
 
 /*
   On MSVC qsort requires that functions passed into it use the __cdecl calling conversion(CC).
diff --git a/lib/zstd/compress/zstd_opt.c b/lib/zstd/compress/zstd_opt.c
index 04337050fe9a..09483f518dc3 100644
--- a/lib/zstd/compress/zstd_opt.c
+++ b/lib/zstd/compress/zstd_opt.c
@@ -8,6 +8,18 @@ 
  * You may select, at your option, one of the above-listed licenses.
  */
 
+/*
+ * Disable inlining for the optimal parser for the kernel build.
+ * It is unlikely to be used in the kernel, and where it is used
+ * latency shouldn't matter because it is very slow to begin with.
+ * We prefer a ~180KB binary size win over faster optimal parsing.
+ *
+ * TODO(https://github.com/facebook/zstd/issues/2862):
+ * Improve the code size of the optimal parser in general, so we
+ * don't need this hack for the kernel build.
+ */
+#define ZSTD_NO_INLINE 1
+
 #include "zstd_compress_internal.h"
 #include "hist.h"
 #include "zstd_opt.h"
@@ -894,7 +906,7 @@  static void ZSTD_optLdm_processMatchCandidate(ZSTD_optLdm_t* optLdm, ZSTD_match_
              */
             U32 posOvershoot = currPosInBlock - optLdm->endPosInBlock;
             ZSTD_optLdm_skipRawSeqStoreBytes(&optLdm->seqStore, posOvershoot);
-        } 
+        }
         ZSTD_opt_getNextMatchAndUpdateSeqStore(optLdm, currPosInBlock, remainingBytes);
     }
     ZSTD_optLdm_maybeAddMatch(matches, nbMatches, optLdm, currPosInBlock);