diff mbox series

[net-next,09/15] net: increase MAX_SKB_FRAGS

Message ID 20220203015140.3022854-10-eric.dumazet@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series tcp: BIG TCP implementation | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 5979 this patch: 3782
netdev/cc_maintainers warning 1 maintainers not CCed: keescook@chromium.org
netdev/build_clang fail Errors and warnings before: 879 this patch: 889
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 6130 this patch: 4795
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 20 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Eric Dumazet Feb. 3, 2022, 1:51 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

Currently, MAX_SKB_FRAGS value is 17.

For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
attempts order-3 allocations, stuffing 32768 bytes per frag.

But with zero copy, we use order-0 pages.

For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
to be able to fit 45 segments per skb.

This is also needed for BIG TCP rx zerocopy, as zerocopy currently
does not support skbs with frag list.

We have used this MAX_SKB_FRAGS value for years at Google before
we deployed 4K MTU, with no adverse effect.
Back then, goal was to be able to receive full size (64KB) GRO
packets without the frag_list overhead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

Comments

kernel test robot Feb. 3, 2022, 5:02 a.m. UTC | #1
Hi Eric,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
config: arc-randconfig-r043-20220130 (https://download.01.org/0day-ci/archive/20220203/202202031206.1nNLT568-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
        git checkout 64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arc SHELL=/bin/bash kernel/bpf/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/container_of.h:5,
                    from include/linux/list.h:5,
                    from include/linux/rculist.h:10,
                    from include/linux/pid.h:5,
                    from include/linux/sched.h:14,
                    from include/linux/ptrace.h:6,
                    from include/uapi/asm-generic/bpf_perf_event.h:4,
                    from ./arch/arc/include/generated/uapi/asm/bpf_perf_event.h:1,
                    from include/uapi/linux/bpf_perf_event.h:11,
                    from kernel/bpf/btf.c:6:
>> include/linux/build_bug.h:78:41: error: static assertion failed: "BITS_PER_LONG >= NR_MSG_FRAG_IDS"
      78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
         |                                         ^~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
      77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
         |                                  ^~~~~~~~~~~~~~~
   include/linux/skmsg.h:41:1: note: in expansion of macro 'static_assert'
      41 | static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);
         | ^~~~~~~~~~~~~
   kernel/bpf/btf.c: In function 'btf_seq_show':
   kernel/bpf/btf.c:6049:29: warning: function 'btf_seq_show' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
    6049 |         seq_vprintf((struct seq_file *)show->target, fmt, args);
         |                             ^~~~~~~~
   kernel/bpf/btf.c: In function 'btf_snprintf_show':
   kernel/bpf/btf.c:6086:9: warning: function 'btf_snprintf_show' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
    6086 |         len = vsnprintf(show->target, ssnprintf->len_left, fmt, args);
         |         ^~~


vim +78 include/linux/build_bug.h

bc6245e5efd70c4 Ian Abbott       2017-07-10  60  
6bab69c65013bed Rasmus Villemoes 2019-03-07  61  /**
6bab69c65013bed Rasmus Villemoes 2019-03-07  62   * static_assert - check integer constant expression at build time
6bab69c65013bed Rasmus Villemoes 2019-03-07  63   *
6bab69c65013bed Rasmus Villemoes 2019-03-07  64   * static_assert() is a wrapper for the C11 _Static_assert, with a
6bab69c65013bed Rasmus Villemoes 2019-03-07  65   * little macro magic to make the message optional (defaulting to the
6bab69c65013bed Rasmus Villemoes 2019-03-07  66   * stringification of the tested expression).
6bab69c65013bed Rasmus Villemoes 2019-03-07  67   *
6bab69c65013bed Rasmus Villemoes 2019-03-07  68   * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
6bab69c65013bed Rasmus Villemoes 2019-03-07  69   * scope, but requires the expression to be an integer constant
6bab69c65013bed Rasmus Villemoes 2019-03-07  70   * expression (i.e., it is not enough that __builtin_constant_p() is
6bab69c65013bed Rasmus Villemoes 2019-03-07  71   * true for expr).
6bab69c65013bed Rasmus Villemoes 2019-03-07  72   *
6bab69c65013bed Rasmus Villemoes 2019-03-07  73   * Also note that BUILD_BUG_ON() fails the build if the condition is
6bab69c65013bed Rasmus Villemoes 2019-03-07  74   * true, while static_assert() fails the build if the expression is
6bab69c65013bed Rasmus Villemoes 2019-03-07  75   * false.
6bab69c65013bed Rasmus Villemoes 2019-03-07  76   */
6bab69c65013bed Rasmus Villemoes 2019-03-07  77  #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
6bab69c65013bed Rasmus Villemoes 2019-03-07 @78  #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
6bab69c65013bed Rasmus Villemoes 2019-03-07  79  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Eric Dumazet Feb. 3, 2022, 5:20 a.m. UTC | #2
On Wed, Feb 2, 2022 at 9:02 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Eric,
>
> I love your patch! Yet something to improve:
>
> [auto build test ERROR on net-next/master]
>
> url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
> config: arc-randconfig-r043-20220130 (https://download.01.org/0day-ci/archive/20220203/202202031206.1nNLT568-lkp@intel.com/config)
> compiler: arc-elf-gcc (GCC) 11.2.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/0day-ci/linux/commit/64ec6b0260be94b2ed90ee6d139591bdbd49c82d
>         git remote add linux-review https://github.com/0day-ci/linux
>         git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
>         git checkout 64ec6b0260be94b2ed90ee6d139591bdbd49c82d
>         # save the config file to linux build tree
>         mkdir build_dir
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arc SHELL=/bin/bash kernel/bpf/
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>    In file included from include/linux/container_of.h:5,
>                     from include/linux/list.h:5,
>                     from include/linux/rculist.h:10,
>                     from include/linux/pid.h:5,
>                     from include/linux/sched.h:14,
>                     from include/linux/ptrace.h:6,
>                     from include/uapi/asm-generic/bpf_perf_event.h:4,
>                     from ./arch/arc/include/generated/uapi/asm/bpf_perf_event.h:1,
>                     from include/uapi/linux/bpf_perf_event.h:11,
>                     from kernel/bpf/btf.c:6:
> >> include/linux/build_bug.h:78:41: error: static assertion failed: "BITS_PER_LONG >= NR_MSG_FRAG_IDS"
>       78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
>          |                                         ^~~~~~~~~~~~~~
>    include/linux/build_bug.h:77:34: note: in expansion of macro '__static_assert'
>       77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
>          |                                  ^~~~~~~~~~~~~~~
>    include/linux/skmsg.h:41:1: note: in expansion of macro 'static_assert'
>       41 | static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);

Not clear why we have this assertion. Do we use a bitmap in an
"unsigned long" in skmsg ?

We could still use the old 17 limit for 32bit arches/builds.

>          | ^~~~~~~~~~~~~
>    kernel/bpf/btf.c: In function 'btf_seq_show':
>    kernel/bpf/btf.c:6049:29: warning: function 'btf_seq_show' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
>     6049 |         seq_vprintf((struct seq_file *)show->target, fmt, args);
>          |                             ^~~~~~~~
>    kernel/bpf/btf.c: In function 'btf_snprintf_show':
>    kernel/bpf/btf.c:6086:9: warning: function 'btf_snprintf_show' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]
>     6086 |         len = vsnprintf(show->target, ssnprintf->len_left, fmt, args);
>          |         ^~~
>
>
> vim +78 include/linux/build_bug.h
>
> bc6245e5efd70c4 Ian Abbott       2017-07-10  60
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  61  /**
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  62   * static_assert - check integer constant expression at build time
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  63   *
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  64   * static_assert() is a wrapper for the C11 _Static_assert, with a
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  65   * little macro magic to make the message optional (defaulting to the
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  66   * stringification of the tested expression).
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  67   *
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  68   * Contrary to BUILD_BUG_ON(), static_assert() can be used at global
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  69   * scope, but requires the expression to be an integer constant
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  70   * expression (i.e., it is not enough that __builtin_constant_p() is
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  71   * true for expr).
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  72   *
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  73   * Also note that BUILD_BUG_ON() fails the build if the condition is
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  74   * true, while static_assert() fails the build if the expression is
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  75   * false.
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  76   */
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  77  #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
> 6bab69c65013bed Rasmus Villemoes 2019-03-07 @78  #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
> 6bab69c65013bed Rasmus Villemoes 2019-03-07  79
>
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot Feb. 3, 2022, 5:23 a.m. UTC | #3
Hi Eric,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
config: hexagon-randconfig-r045-20220130 (https://download.01.org/0day-ci/archive/20220203/202202031315.B425Ipe8-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project a73e4ce6a59b01f0e37037761c1e6889d539d233)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
        git checkout 64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash kernel/bpf/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from kernel/bpf/btf.c:22:
>> include/linux/skmsg.h:41:1: error: static_assert failed due to requirement '32 >= (45UL + 1)' "BITS_PER_LONG >= NR_MSG_FRAG_IDS"
   static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:78:41: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                           ^              ~~~~
   1 error generated.


vim +41 include/linux/skmsg.h

604326b41a6fb9 Daniel Borkmann 2018-10-13  25  
604326b41a6fb9 Daniel Borkmann 2018-10-13  26  struct sk_msg_sg {
604326b41a6fb9 Daniel Borkmann 2018-10-13  27  	u32				start;
604326b41a6fb9 Daniel Borkmann 2018-10-13  28  	u32				curr;
604326b41a6fb9 Daniel Borkmann 2018-10-13  29  	u32				end;
604326b41a6fb9 Daniel Borkmann 2018-10-13  30  	u32				size;
604326b41a6fb9 Daniel Borkmann 2018-10-13  31  	u32				copybreak;
163ab96b52ae2b Jakub Kicinski  2019-10-06  32  	unsigned long			copy;
031097d9e079e4 Jakub Kicinski  2019-11-27  33  	/* The extra two elements:
031097d9e079e4 Jakub Kicinski  2019-11-27  34  	 * 1) used for chaining the front and sections when the list becomes
031097d9e079e4 Jakub Kicinski  2019-11-27  35  	 *    partitioned (e.g. end < start). The crypto APIs require the
031097d9e079e4 Jakub Kicinski  2019-11-27  36  	 *    chaining;
031097d9e079e4 Jakub Kicinski  2019-11-27  37  	 * 2) to chain tailer SG entries after the message.
d3b18ad31f93d0 John Fastabend  2018-10-13  38  	 */
031097d9e079e4 Jakub Kicinski  2019-11-27  39  	struct scatterlist		data[MAX_MSG_FRAGS + 2];
604326b41a6fb9 Daniel Borkmann 2018-10-13  40  };
031097d9e079e4 Jakub Kicinski  2019-11-27 @41  static_assert(BITS_PER_LONG >= NR_MSG_FRAG_IDS);
604326b41a6fb9 Daniel Borkmann 2018-10-13  42  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Jakub Kicinski Feb. 3, 2022, 5:31 a.m. UTC | #4
On Wed, 2 Feb 2022 21:20:32 -0800 Eric Dumazet wrote:
> Not clear why we have this assertion. Do we use a bitmap in an
> "unsigned long" in skmsg ?
> 
> We could still use the old 17 limit for 32bit arches/builds.

git blame points at me but I just adjusted it. Looks like its
struct sk_msg_sg::copy that's the reason. On a quick look we 
can make it an array of unsigned longs without a problem.
kernel test robot Feb. 3, 2022, 5:43 a.m. UTC | #5
Hi Eric,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 52dae93f3bad842c6d585700460a0dea4d70e096
config: arc-allyesconfig (https://download.01.org/0day-ci/archive/20220203/202202031344.0FFfnywX-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Eric-Dumazet/tcp-BIG-TCP-implementation/20220203-095336
        git checkout 64ec6b0260be94b2ed90ee6d139591bdbd49c82d
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arc SHELL=/bin/bash drivers/net/ethernet/3com/ drivers/net/ethernet/agere/ drivers/net/ethernet/mellanox/mlx5/core/ drivers/net/wireguard/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/net/wireguard/send.c: In function 'encrypt_packet':
>> drivers/net/wireguard/send.c:219:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=]
     219 | }
         | ^
--
   drivers/net/wireguard/receive.c: In function 'decrypt_packet':
>> drivers/net/wireguard/receive.c:299:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=]
     299 | }
         | ^
--
>> drivers/net/ethernet/3com/typhoon.c:142:2: warning: #warning Typhoon only supports 32 entries in its SG list for TSO, disabling TSO [-Wcpp]
     142 | #warning Typhoon only supports 32 entries in its SG list for TSO, disabling TSO
         |  ^~~~~~~


vim +219 drivers/net/wireguard/send.c

e7096c131e5161 Jason A. Donenfeld 2019-12-09  161  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  162  static bool encrypt_packet(struct sk_buff *skb, struct noise_keypair *keypair)
e7096c131e5161 Jason A. Donenfeld 2019-12-09  163  {
e7096c131e5161 Jason A. Donenfeld 2019-12-09  164  	unsigned int padding_len, plaintext_len, trailer_len;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  165  	struct scatterlist sg[MAX_SKB_FRAGS + 8];
e7096c131e5161 Jason A. Donenfeld 2019-12-09  166  	struct message_data *header;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  167  	struct sk_buff *trailer;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  168  	int num_frags;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  169  
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  170  	/* Force hash calculation before encryption so that flow analysis is
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  171  	 * consistent over the inner packet.
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  172  	 */
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  173  	skb_get_hash(skb);
c78a0b4a78839d Jason A. Donenfeld 2020-05-19  174  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  175  	/* Calculate lengths. */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  176  	padding_len = calculate_skb_padding(skb);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  177  	trailer_len = padding_len + noise_encrypted_len(0);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  178  	plaintext_len = skb->len + padding_len;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  179  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  180  	/* Expand data section to have room for padding and auth tag. */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  181  	num_frags = skb_cow_data(skb, trailer_len, &trailer);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  182  	if (unlikely(num_frags < 0 || num_frags > ARRAY_SIZE(sg)))
e7096c131e5161 Jason A. Donenfeld 2019-12-09  183  		return false;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  184  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  185  	/* Set the padding to zeros, and make sure it and the auth tag are part
e7096c131e5161 Jason A. Donenfeld 2019-12-09  186  	 * of the skb.
e7096c131e5161 Jason A. Donenfeld 2019-12-09  187  	 */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  188  	memset(skb_tail_pointer(trailer), 0, padding_len);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  189  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  190  	/* Expand head section to have room for our header and the network
e7096c131e5161 Jason A. Donenfeld 2019-12-09  191  	 * stack's headers.
e7096c131e5161 Jason A. Donenfeld 2019-12-09  192  	 */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  193  	if (unlikely(skb_cow_head(skb, DATA_PACKET_HEAD_ROOM) < 0))
e7096c131e5161 Jason A. Donenfeld 2019-12-09  194  		return false;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  195  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  196  	/* Finalize checksum calculation for the inner packet, if required. */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  197  	if (unlikely(skb->ip_summed == CHECKSUM_PARTIAL &&
e7096c131e5161 Jason A. Donenfeld 2019-12-09  198  		     skb_checksum_help(skb)))
e7096c131e5161 Jason A. Donenfeld 2019-12-09  199  		return false;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  200  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  201  	/* Only after checksumming can we safely add on the padding at the end
e7096c131e5161 Jason A. Donenfeld 2019-12-09  202  	 * and the header.
e7096c131e5161 Jason A. Donenfeld 2019-12-09  203  	 */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  204  	skb_set_inner_network_header(skb, 0);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  205  	header = (struct message_data *)skb_push(skb, sizeof(*header));
e7096c131e5161 Jason A. Donenfeld 2019-12-09  206  	header->header.type = cpu_to_le32(MESSAGE_DATA);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  207  	header->key_idx = keypair->remote_index;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  208  	header->counter = cpu_to_le64(PACKET_CB(skb)->nonce);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  209  	pskb_put(skb, trailer, trailer_len);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  210  
e7096c131e5161 Jason A. Donenfeld 2019-12-09  211  	/* Now we can encrypt the scattergather segments */
e7096c131e5161 Jason A. Donenfeld 2019-12-09  212  	sg_init_table(sg, num_frags);
e7096c131e5161 Jason A. Donenfeld 2019-12-09  213  	if (skb_to_sgvec(skb, sg, sizeof(struct message_data),
e7096c131e5161 Jason A. Donenfeld 2019-12-09  214  			 noise_encrypted_len(plaintext_len)) <= 0)
e7096c131e5161 Jason A. Donenfeld 2019-12-09  215  		return false;
e7096c131e5161 Jason A. Donenfeld 2019-12-09  216  	return chacha20poly1305_encrypt_sg_inplace(sg, plaintext_len, NULL, 0,
e7096c131e5161 Jason A. Donenfeld 2019-12-09  217  						   PACKET_CB(skb)->nonce,
e7096c131e5161 Jason A. Donenfeld 2019-12-09  218  						   keypair->sending.key);
e7096c131e5161 Jason A. Donenfeld 2019-12-09 @219  }
e7096c131e5161 Jason A. Donenfeld 2019-12-09  220  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Eric Dumazet Feb. 3, 2022, 6:35 a.m. UTC | #6
On Wed, Feb 2, 2022 at 9:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 2 Feb 2022 21:20:32 -0800 Eric Dumazet wrote:
> > Not clear why we have this assertion. Do we use a bitmap in an
> > "unsigned long" in skmsg ?
> >
> > We could still use the old 17 limit for 32bit arches/builds.
>
> git blame points at me but I just adjusted it. Looks like its
> struct sk_msg_sg::copy that's the reason. On a quick look we
> can make it an array of unsigned longs without a problem.

Oh right, thanks for the pointer.
Paolo Abeni Feb. 3, 2022, 4:01 p.m. UTC | #7
On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Currently, MAX_SKB_FRAGS value is 17.
> 
> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> attempts order-3 allocations, stuffing 32768 bytes per frag.
> 
> But with zero copy, we use order-0 pages.
> 
> For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
> to be able to fit 45 segments per skb.
> 
> This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> does not support skbs with frag list.
> 
> We have used this MAX_SKB_FRAGS value for years at Google before
> we deployed 4K MTU, with no adverse effect.
> Back then, goal was to be able to receive full size (64KB) GRO
> packets without the frag_list overhead.

IIRC, while backporting some changes to an older RHEL kernel, we had to
increase the skb overhead due to kabi issue.

That caused some measurable regressions because some drivers (e.g.
ixgbe) where not able any more to allocate multiple (skb) heads from
the same page. 

All the above subject to some noise - it's a fainting memory.

I'll try to do some tests with the H/W I have handy, but it could take
a little time due to conflicting scheduling here.

Thanks,

Paolo
Alexander Duyck Feb. 3, 2022, 5:26 p.m. UTC | #8
On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Currently, MAX_SKB_FRAGS value is 17.
> 
> For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> attempts order-3 allocations, stuffing 32768 bytes per frag.
> 
> But with zero copy, we use order-0 pages.
> 
> For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
> to be able to fit 45 segments per skb.
> 
> This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> does not support skbs with frag list.
> 
> We have used this MAX_SKB_FRAGS value for years at Google before
> we deployed 4K MTU, with no adverse effect.
> Back then, goal was to be able to receive full size (64KB) GRO
> packets without the frag_list overhead.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

So a big issue I see with this patch is the potential queueing issues
it may introduce on Tx queues. I suspect it will cause a number of
performance regressions and deadlocks as it will change the Tx queueing
behavior for many NICs.

As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
the ingredients for DESC_NEEDED in order to determine if the Tx queue
needs to stop. With this change the value for igb for instance is
jumping from 21 to 49, and the wake threshold is twice that, 98. As
such the minimum Tx descriptor threshold for the driver would need to
be updated beyond 80 otherwise it is likely to deadlock the first time
it has to pause.
Eric Dumazet Feb. 3, 2022, 5:34 p.m. UTC | #9
On Thu, Feb 3, 2022 at 9:26 AM Alexander H Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Currently, MAX_SKB_FRAGS value is 17.
> >
> > For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> > attempts order-3 allocations, stuffing 32768 bytes per frag.
> >
> > But with zero copy, we use order-0 pages.
> >
> > For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
> > to be able to fit 45 segments per skb.
> >
> > This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> > does not support skbs with frag list.
> >
> > We have used this MAX_SKB_FRAGS value for years at Google before
> > we deployed 4K MTU, with no adverse effect.
> > Back then, goal was to be able to receive full size (64KB) GRO
> > packets without the frag_list overhead.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> So a big issue I see with this patch is the potential queueing issues
> it may introduce on Tx queues. I suspect it will cause a number of
> performance regressions and deadlocks as it will change the Tx queueing
> behavior for many NICs.
>
> As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> the ingredients for DESC_NEEDED in order to determine if the Tx queue
> needs to stop. With this change the value for igb for instance is
> jumping from 21 to 49, and the wake threshold is twice that, 98. As
> such the minimum Tx descriptor threshold for the driver would need to
> be updated beyond 80 otherwise it is likely to deadlock the first time
> it has to pause.

Are these limits hard coded in Intel drivers and firmware, or do you
think this can be changed ?

I could make  MAX_SKB_FRAGS a config option, and default to 17, until
all drivers have been fixed.

Alternative is that I remove this patch from the series and we apply
it to Google production kernels,
as we did before.
Alexander Duyck Feb. 3, 2022, 5:56 p.m. UTC | #10
On Thu, Feb 3, 2022 at 9:34 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 9:26 AM Alexander H Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Wed, 2022-02-02 at 17:51 -0800, Eric Dumazet wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > >
> > > Currently, MAX_SKB_FRAGS value is 17.
> > >
> > > For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
> > > attempts order-3 allocations, stuffing 32768 bytes per frag.
> > >
> > > But with zero copy, we use order-0 pages.
> > >
> > > For BIG TCP to show its full potential, we increase MAX_SKB_FRAGS
> > > to be able to fit 45 segments per skb.
> > >
> > > This is also needed for BIG TCP rx zerocopy, as zerocopy currently
> > > does not support skbs with frag list.
> > >
> > > We have used this MAX_SKB_FRAGS value for years at Google before
> > > we deployed 4K MTU, with no adverse effect.
> > > Back then, goal was to be able to receive full size (64KB) GRO
> > > packets without the frag_list overhead.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> >
> > So a big issue I see with this patch is the potential queueing issues
> > it may introduce on Tx queues. I suspect it will cause a number of
> > performance regressions and deadlocks as it will change the Tx queueing
> > behavior for many NICs.
> >
> > As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> > the ingredients for DESC_NEEDED in order to determine if the Tx queue
> > needs to stop. With this change the value for igb for instance is
> > jumping from 21 to 49, and the wake threshold is twice that, 98. As
> > such the minimum Tx descriptor threshold for the driver would need to
> > be updated beyond 80 otherwise it is likely to deadlock the first time
> > it has to pause.
>
> Are these limits hard coded in Intel drivers and firmware, or do you
> think this can be changed ?

This is all code in the drivers. Most drivers have them as the logic
is used to avoid having to return NETIDEV_TX_BUSY. Basically the
assumption is there is a 1:1 correlation between descriptors and
individual frags. So most drivers would need to increase the size of
their Tx descriptor rings if they were optimized for a lower value.

The other thing is that most of the tuning for things like interrupt
moderation assume a certain fill level on the queues and those would
likely need to be updated to account for this change.

> I could make  MAX_SKB_FRAGS a config option, and default to 17, until
> all drivers have been fixed.
>
> Alternative is that I remove this patch from the series and we apply
> it to Google production kernels,
> as we did before.

A config option would probably be preferred. The big issue as I see it
is that changing MAX_SKB_FRAGS is going to have ripples throughout the
ecosystem as the shared info size will be increasing and the queueing
behavior for most drivers will be modified as a result.
Jakub Kicinski Feb. 3, 2022, 7:18 p.m. UTC | #11
On Thu, 3 Feb 2022 09:56:42 -0800 Alexander Duyck wrote:
> > I could make  MAX_SKB_FRAGS a config option, and default to 17, until
> > all drivers have been fixed.
> >
> > Alternative is that I remove this patch from the series and we apply
> > it to Google production kernels,
> > as we did before.  
> 
> A config option would probably be preferred. The big issue as I see it
> is that changing MAX_SKB_FRAGS is going to have ripples throughout the
> ecosystem as the shared info size will be increasing and the queueing
> behavior for most drivers will be modified as a result.

I'd vote for making the change and dealing with the fall out. Unlikely
many people would turn this knob otherwise and it's a major difference.
Better not to fork the characteristics of the stack, IMHO.
Eric Dumazet Feb. 3, 2022, 7:20 p.m. UTC | #12
On Thu, Feb 3, 2022 at 11:18 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 3 Feb 2022 09:56:42 -0800 Alexander Duyck wrote:
> > > I could make  MAX_SKB_FRAGS a config option, and default to 17, until
> > > all drivers have been fixed.
> > >
> > > Alternative is that I remove this patch from the series and we apply
> > > it to Google production kernels,
> > > as we did before.
> >
> > A config option would probably be preferred. The big issue as I see it
> > is that changing MAX_SKB_FRAGS is going to have ripples throughout the
> > ecosystem as the shared info size will be increasing and the queueing
> > behavior for most drivers will be modified as a result.
>
> I'd vote for making the change and dealing with the fall out. Unlikely
> many people would turn this knob otherwise and it's a major difference.
> Better not to fork the characteristics of the stack, IMHO.

Another issue with CONFIG_ options is that they are integer.

Trying the following did not work

#define MAX_SKB_FRAGS  ((unsigned long)CONFIG_MAX_SKB_FRAGS)

Because in some places we have

#if    (   MAX_SKB_FRAGS > ...)

(MAX_SKB_FRAGS is UL currently, making it an integer might cause some
signed/unsigned operations buggy)
Eric Dumazet Feb. 3, 2022, 7:54 p.m. UTC | #13
On Thu, Feb 3, 2022 at 11:20 AM Eric Dumazet <edumazet@google.com> wrote:

>
> Another issue with CONFIG_ options is that they are integer.
>
> Trying the following did not work
>
> #define MAX_SKB_FRAGS  ((unsigned long)CONFIG_MAX_SKB_FRAGS)
>
> Because in some places we have
>
> #if    (   MAX_SKB_FRAGS > ...)
>
> (MAX_SKB_FRAGS is UL currently, making it an integer might cause some
> signed/unsigned operations buggy)

I came to something like this, clearly this a bit ugly.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 08c12c41c5a5907dccc7389f396394d8132d962e..cc3cac3ee109f95c8a51eb90ba4a3bf7bebe86eb
100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -323,7 +323,15 @@ enum skb_drop_reason {
        SKB_DROP_REASON_MAX,
 };

+#ifdef CONFIG_MAX_SKB_FRAGS_17
+#define MAX_SKB_FRAGS 17UL
+#endif
+#ifdef CONFIG_MAX_SKB_FRAGS_25
+#define MAX_SKB_FRAGS 25UL
+#endif
+#ifdef CONFIG_MAX_SKB_FRAGS_45
 #define MAX_SKB_FRAGS 45UL
+#endif

 extern int sysctl_max_skb_frags;

diff --git a/net/Kconfig b/net/Kconfig
index 8a1f9d0287de3c32040eee03b60114c6e6d150bc..d91027a654c2aad7bfa55152ef81c882bf394aff
100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -253,6 +253,29 @@ config PCPU_DEV_REFCNT
          network device refcount are using per cpu variables if this
option is set.
          This can be forced to N to detect underflows (with a
performance drop).

+choice
+       prompt "Maximum number of fragments per skb_shared_info"
+       default MAX_SKB_FRAGS_17
+
+config MAX_SKB_FRAGS_17
+       bool "17 fragments per skb_shared_info"
+       help
+         Some drivers have assumptions about MAX_SKB_FRAGS being 17.
+         Until they are fixed, it is safe to adopt the old limit.
+
+config MAX_SKB_FRAGS_25
+       bool "25 fragments per skb_shared_info"
+       help
+         Helps BIG TCP workloads, but might expose bugs in some legacy drivers.
+
+config MAX_SKB_FRAGS_45
+       bool "45 fragments per skb_shared_info"
+       help
+         Helps BIG TCP workloads, but might expose bugs in some legacy drivers.
+         This also increase memory overhead of small packets.
+
+endchoice
+
 config RPS
        bool
        depends on SMP && SYSFS
David Laight Feb. 4, 2022, 10:18 a.m. UTC | #14
From: Alexander Duyck
> Sent: 03 February 2022 17:57
...
> > > So a big issue I see with this patch is the potential queueing issues
> > > it may introduce on Tx queues. I suspect it will cause a number of
> > > performance regressions and deadlocks as it will change the Tx queueing
> > > behavior for many NICs.
> > >
> > > As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> > > the ingredients for DESC_NEEDED in order to determine if the Tx queue
> > > needs to stop. With this change the value for igb for instance is
> > > jumping from 21 to 49, and the wake threshold is twice that, 98. As
> > > such the minimum Tx descriptor threshold for the driver would need to
> > > be updated beyond 80 otherwise it is likely to deadlock the first time
> > > it has to pause.
> >
> > Are these limits hard coded in Intel drivers and firmware, or do you
> > think this can be changed ?
> 
> This is all code in the drivers. Most drivers have them as the logic
> is used to avoid having to return NETIDEV_TX_BUSY. Basically the
> assumption is there is a 1:1 correlation between descriptors and
> individual frags. So most drivers would need to increase the size of
> their Tx descriptor rings if they were optimized for a lower value.

Maybe the drivers can be a little less conservative about the number
of fragments they expect in the next message?
There is little point requiring 49 free descriptors when the workload
never has more than 2 or 3 fragments.

Clearly you don't want to re-enable things unless there are enough
descriptors for an skb that has generated NETDEV_TX_BUSY, but the
current logic of 'trying to never actually return NETDEV_TX_BUSY'
is probably over cautious.

Does Linux allow skb to have a lot of short fragments?
If dma_map isn't cheap (probably anything with an iommu or non-coherent
memory) them copying/merging short fragments into a pre-mapped
buffer can easily be faster.
Many years ago we found it was worth copying anything under 1k on
a sparc mbus+sbus system.
I don't think Linux can generate what I've seen elsewhere - the mac
driver being asked to transmit something with 1000+ one byte fragmemts!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Alexander Duyck Feb. 4, 2022, 3:46 p.m. UTC | #15
On Fri, Feb 4, 2022 at 2:18 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Alexander Duyck
> > Sent: 03 February 2022 17:57
> ...
> > > > So a big issue I see with this patch is the potential queueing issues
> > > > it may introduce on Tx queues. I suspect it will cause a number of
> > > > performance regressions and deadlocks as it will change the Tx queueing
> > > > behavior for many NICs.
> > > >
> > > > As I recall many of the Intel drivers are using MAX_SKB_FRAGS as one of
> > > > the ingredients for DESC_NEEDED in order to determine if the Tx queue
> > > > needs to stop. With this change the value for igb for instance is
> > > > jumping from 21 to 49, and the wake threshold is twice that, 98. As
> > > > such the minimum Tx descriptor threshold for the driver would need to
> > > > be updated beyond 80 otherwise it is likely to deadlock the first time
> > > > it has to pause.
> > >
> > > Are these limits hard coded in Intel drivers and firmware, or do you
> > > think this can be changed ?
> >
> > This is all code in the drivers. Most drivers have them as the logic
> > is used to avoid having to return NETIDEV_TX_BUSY. Basically the
> > assumption is there is a 1:1 correlation between descriptors and
> > individual frags. So most drivers would need to increase the size of
> > their Tx descriptor rings if they were optimized for a lower value.
>
> Maybe the drivers can be a little less conservative about the number
> of fragments they expect in the next message?
> There is little point requiring 49 free descriptors when the workload
> never has more than 2 or 3 fragments.
>
> Clearly you don't want to re-enable things unless there are enough
> descriptors for an skb that has generated NETDEV_TX_BUSY, but the
> current logic of 'trying to never actually return NETDEV_TX_BUSY'
> is probably over cautious.

The problem is that NETDEV_TX_BUSY can cause all sorts of issues in
terms of the flow of packets. Basically when you start having to push
packets back from the device to the qdisc you can essentially create
head-of-line blocking type scenarios which can make things like
traffic shaping that much more difficult.

> Does Linux allow skb to have a lot of short fragments?
> If dma_map isn't cheap (probably anything with an iommu or non-coherent
> memory) them copying/merging short fragments into a pre-mapped
> buffer can easily be faster.

I know Linux skbs can have a lot of short fragments. The i40e has a
workaround for cases where more than 8 fragments are needed to
transmit a single frame for instance, see __i40e_chk_linearize().

> Many years ago we found it was worth copying anything under 1k on
> a sparc mbus+sbus system.
> I don't think Linux can generate what I've seen elsewhere - the mac
> driver being asked to transmit something with 1000+ one byte fragmemts!
>
>         David

Linux cannot generate the 1000+ fragments, mainly because it is
limited by the frags. However as I pointed out above it isn't uncommon
to see an skb composed of a number of smaller fragments.

That said, I don't know if we really need to be rewriting the code for
NETDEV_TX_BUSY handling on the drivers. It would just be a matter of
reserving more memory in the descriptor rings since the counts would
be going from 42 to 98 in order to unblock a Tx queue in the case of
igb for instance, and currently the minimum ring size is 80. So in
this case it would just be a matter of increasing the minimum so that
it cannot be configured into a deadlock.

Ultimately that is the trade-off with this approach. What we are doing
is increasing the memory footprint of the drivers and skbs in order to
allow for more buffering in the skb to increase throughput. I wonder
if it wouldn't make sense to just make MAX_SKB_FRAGS a driver level
setting like gso_max_size so that low end NICs out there aren't having
to reserve a ton of memory to store fragments they will never use.

- Alex
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a27bcc4f7e9a92ea4f4f4f6e5f454bb4f8099f66..08c12c41c5a5907dccc7389f396394d8132d962e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -323,18 +323,8 @@  enum skb_drop_reason {
 	SKB_DROP_REASON_MAX,
 };
 
-/* To allow 64K frame to be packed as single skb without frag_list we
- * require 64K/PAGE_SIZE pages plus 1 additional page to allow for
- * buffers which do not start on a page boundary.
- *
- * Since GRO uses frags we allocate at least 16 regardless of page
- * size.
- */
-#if (65536/PAGE_SIZE + 1) < 16
-#define MAX_SKB_FRAGS 16UL
-#else
-#define MAX_SKB_FRAGS (65536/PAGE_SIZE + 1)
-#endif
+#define MAX_SKB_FRAGS 45UL
+
 extern int sysctl_max_skb_frags;
 
 /* Set skb_shinfo(skb)->gso_size to this in case you want skb_segment to