[net-next,1/2] tcp: add support for SO_PEEK_OFF socket option

From: Jon Maloy <jmaloy@redhat.com>

From: Jon Maloy <jmaloy@redhat.com>

When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
in a similar way it is done for Unix Domain sockets.

In the iperf3 log examples shown below, we can observe a throughput
improvement of 15-20 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).

Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.

At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.

passt and pasta are supported in KubeVirt and libvirt/qemu.

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF not supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 52084
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
[  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
[  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
[  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
[  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
[  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
[  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
[  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
[  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
                       -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni
v4: - Same as v3. Posting was delayed because I first had to debug
      an issue that turned out to not be directly related to this
      change. See next commit in this series.
---
 net/ipv4/af_inet.c |  1 +
 net/ipv4/tcp.c     | 16 ++++++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

Message ID	20240406182107.261472-2-jmaloy@redhat.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4B9181F16B for <netdev@vger.kernel.org>; Sat, 6 Apr 2024 18:21:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712427676; cv=none; b=NL3Sh+hKj6ThYFCbt0QANyCjDISgtdzBrR7Ouj2KD4+JW8YaBtyZKJKHJfC+ePEIS5ravgP/b/aIkuMMNlq7PL4SJdWjVILv75GatkWFBTICx3uau7brDP5iCKYfGFlFGXTxNssXDEZTv1/bZrD7auK8S1NE3tKkUxRTRB64c5s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712427676; c=relaxed/simple; bh=tfsb4UpUkUP/BJiNG+EDtfEFNl5L3PmNq5Ge74VyVdQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=WNPIpQkngxbwYclWmSbxKMwcUIsYAqYTx2EWblV4ftSNBqGDSRcTuH1XQ3WWBh1EBEXYlsII4bm5m39nRMB08yr3fL7zXhO0gViRUFGrX3/FY+hqi8Yuzl5whUa9/mHDA70uuW707tAFmFAq24Il18eqnaur+rgFI1pYlFMtZSs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=A+ei9YSw; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="A+ei9YSw" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1712427673; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VfBhEzaGqIv1unWQbj7mIQS+w5MhB0Cxgsw2OYqYPWQ=; b=A+ei9YSwbEwXJtiAgIXX7QhAab7NTkpITBJtPyt6L40qco7ONuW7mj1PTTfj4L23ChtEff hX1YdoqFpHGUnAsWiDNlHl6dNDNXHWmqF6bBgcxPqQnCbYArvan/QvX1MsohVJ5PXD+p35 8S+jlKkPAedku+lAt8XHmMAJBh5dfIY= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-622-GC06kdWyPZaf2ymQ3w2HLg-1; Sat, 06 Apr 2024 14:21:09 -0400 X-MC-Unique: GC06kdWyPZaf2ymQ3w2HLg-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 87A8C8570F6; Sat, 6 Apr 2024 18:21:09 +0000 (UTC) Received: from fenrir.redhat.com (unknown [10.22.8.7]) by smtp.corp.redhat.com (Postfix) with ESMTP id C599C42C820; Sat, 6 Apr 2024 18:21:08 +0000 (UTC) From: jmaloy@redhat.com To: netdev@vger.kernel.org, davem@davemloft.net Cc: kuba@kernel.org, passt-dev@passt.top, jmaloy@redhat.com, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, eric.dumazet@gmail.com, edumazet@google.com Subject: [net-next 1/2] tcp: add support for SO_PEEK_OFF socket option Date: Sat, 6 Apr 2024 14:21:06 -0400 Message-ID: <20240406182107.261472-2-jmaloy@redhat.com> In-Reply-To: <20240406182107.261472-1-jmaloy@redhat.com> References: <20240406182107.261472-1-jmaloy@redhat.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: <netdev.vger.kernel.org> List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.10 X-Patchwork-Delegate: kuba@kernel.org
Series	tcp: add support for SO_PEEK_OFF socket option \| expand [net-next,0/2] tcp: add support for SO_PEEK_OFF socket option [net-next,1/2] tcp: add support for SO_PEEK_OFF socket option [net-next,2/2] tcp: correct handling of extreme menory squeeze

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 945 this patch: 945
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	warning	2 maintainers not CCed: dsahern@kernel.org pabeni@redhat.com
netdev/build_clang	success	Errors and warnings before: 954 this patch: 954
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 956 this patch: 956
netdev/checkpatch	warning	WARNING: line length of 85 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 1 this patch: 1
netdev/source_inline	success	Was 0 now: 0
netdev/contest	success	net-next-2024-04-07--00-00 (tests: 956)

[net-next,1/2] tcp: add support for SO_PEEK_OFF socket option

Checks

Commit Message

Comments

Patch