diff mbox series

[bpf-next] bpf: add sock_ops callbacks for data send/recv/acked events

Message ID 20231123030732.111576-1-lulie@linux.alibaba.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf-next] bpf: add sock_ops callbacks for data send/recv/acked events | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
netdev/series_format success Single patches do not need cover letters
netdev/codegen success Generated files up to date
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2923 this patch: 2923
netdev/cc_maintainers fail 16 maintainers not CCed: ast@kernel.org yonghong.song@linux.dev song@kernel.org haoluo@google.com netdev@vger.kernel.org jolsa@kernel.org kpsingh@kernel.org dsahern@kernel.org edumazet@google.com pabeni@redhat.com kuba@kernel.org martin.lau@linux.dev andrii@kernel.org john.fastabend@gmail.com daniel@iogearbox.net sdf@google.com
netdev/build_clang success Errors and warnings before: 1290 this patch: 1290
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3005 this patch: 3005
netdev/checkpatch warning CHECK: spaces preferred around that '<<' (ctx:VxV) WARNING: Block comments should align the * on each line WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: please, no space before tabs
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 fail Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-16 / veristat
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-llvm-16 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-16 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 fail Logs for x86_64-llvm-16 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 fail Logs for x86_64-llvm-16 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 fail Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-3 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-10 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-11 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-llvm-16 / build / build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-llvm-16 / veristat

Commit Message

Philo Lu Nov. 23, 2023, 3:07 a.m. UTC
Add 3 sock_ops operators, namely BPF_SOCK_OPS_DATA_SEND_CB,
BPF_SOCK_OPS_DATA_RECV_CB, and BPF_SOCK_OPS_DATA_ACKED_CB. A flag
BPF_SOCK_OPS_DATA_EVENT_CB_FLAG is provided to minimize the performance
impact. The flag must be explicitly set to enable these callbacks.

If the flag is enabled, bpf sock_ops program will be called every time a
tcp data packet is sent, received, and acked.
BPF_SOCK_OPS_DATA_SEND_CB: call bpf after a data packet is sent.
BPF_SOCK_OPS_DATA_RECV_CB: call bpf after a data packet is receviced.
BPF_SOCK_OPS_DATA_ACKED_CB: call bpf after a valid ack packet is
processed (some sent data are ackknowledged).

We use these callbacks for fine-grained tcp monitoring, which collects
and analyses every tcp request/response event information. The whole
system has been described in SIGMOD'18 (see
https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
achieve this with bpf, we require hooks for data events that call
sock_ops bpf (1) when any data packet is sent/received/acked, and (2)
after critical tcp state variables have been updated (e.g., snd_una,
snd_nxt, rcv_nxt). However, existing sock_ops operators cannot meet our
requirements.

Besides, these hooks also help to debug tcp when data send/recv/acked.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
---
 include/net/tcp.h        |  9 +++++++++
 include/uapi/linux/bpf.h | 14 +++++++++++++-
 net/ipv4/tcp_input.c     |  4 ++++
 net/ipv4/tcp_output.c    |  2 ++
 4 files changed, 28 insertions(+), 1 deletion(-)

Comments

Philo Lu Nov. 23, 2023, 12:37 p.m. UTC | #1
Sorry, I forgot to cc the maintainers.

On 2023/11/23 11:07, Philo Lu wrote:
> Add 3 sock_ops operators, namely BPF_SOCK_OPS_DATA_SEND_CB,
> BPF_SOCK_OPS_DATA_RECV_CB, and BPF_SOCK_OPS_DATA_ACKED_CB. A flag
> BPF_SOCK_OPS_DATA_EVENT_CB_FLAG is provided to minimize the performance
> impact. The flag must be explicitly set to enable these callbacks.
>
> If the flag is enabled, bpf sock_ops program will be called every time a
> tcp data packet is sent, received, and acked.
> BPF_SOCK_OPS_DATA_SEND_CB: call bpf after a data packet is sent.
> BPF_SOCK_OPS_DATA_RECV_CB: call bpf after a data packet is receviced.
> BPF_SOCK_OPS_DATA_ACKED_CB: call bpf after a valid ack packet is
> processed (some sent data are ackknowledged).
>
> We use these callbacks for fine-grained tcp monitoring, which collects
> and analyses every tcp request/response event information. The whole
> system has been described in SIGMOD'18 (see
> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> achieve this with bpf, we require hooks for data events that call
> sock_ops bpf (1) when any data packet is sent/received/acked, and (2)
> after critical tcp state variables have been updated (e.g., snd_una,
> snd_nxt, rcv_nxt). However, existing sock_ops operators cannot meet our
> requirements.
>
> Besides, these hooks also help to debug tcp when data send/recv/acked.
>
> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
> ---
>   include/net/tcp.h        |  9 +++++++++
>   include/uapi/linux/bpf.h | 14 +++++++++++++-
>   net/ipv4/tcp_input.c     |  4 ++++
>   net/ipv4/tcp_output.c    |  2 ++
>   4 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d2f0736b76b8..73eda03fdda5 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2660,6 +2660,15 @@ static inline void tcp_bpf_rtt(struct sock *sk)
>   		tcp_call_bpf(sk, BPF_SOCK_OPS_RTT_CB, 0, NULL);
>   }
>   
> +/* op must be one of BPF_SOCK_OPS_DATA_SEND_CB, BPF_SOCK_OPS_DATA_RECV_CB,
> + * or BPF_SOCK_OPS_DATA_ACKED_CB.
> + */
> +static inline void tcp_bpf_data_event(struct sock *sk, int op)
> +{
> +	if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_DATA_EVENT_CB_FLAG))
> +		tcp_call_bpf(sk, op, 0, NULL);
> +}
> +
>   #if IS_ENABLED(CONFIG_SMC)
>   extern struct static_key_false tcp_have_smc;
>   #endif
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 7cf8bcf9f6a2..2154a6235901 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3016,6 +3016,7 @@ union bpf_attr {
>    * 		* **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission)
>    * 		* **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change)
>    * 		* **BPF_SOCK_OPS_RTT_CB_FLAG** (every RTT)
> + * 		* **BPF_SOCK_OPS_DATA_EVENT_CB_FLAG** (data packet send/recv/acked)
>    *
>    * 		Therefore, this function can be used to clear a callback flag by
>    * 		setting the appropriate bit to zero. e.g. to disable the RTO
> @@ -6755,8 +6756,10 @@ enum {
>   	 * options first before the BPF program does.
>   	 */
>   	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
> +	/* Call bpf when data send/recv/acked. */
> +	BPF_SOCK_OPS_DATA_EVENT_CB_FLAG = (1<<7),
>   /* Mask of all currently supported cb flags */
> -	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
> +	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
>   };
>   
>   /* List of known BPF sock_ops operators.
> @@ -6869,6 +6872,15 @@ enum {
>   					 * by the kernel or the
>   					 * earlier bpf-progs.
>   					 */
> +	BPF_SOCK_OPS_DATA_SEND_CB,		/* Calls BPF program when a
> +					 * data packet is sent. Pure ack is ignored.
> +					 */
> +	BPF_SOCK_OPS_DATA_RECV_CB,		/* Calls BPF program when a
> +					 * data packet is received. Pure ack is ignored.
> +					 */
> +	BPF_SOCK_OPS_DATA_ACKED_CB,		/* Calls BPF program when sent
> +					 * data are acknowledged.
> +					 */
>   };
>   
>   /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index bcb55d98004c..72c6192e7cd0 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -824,6 +824,8 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
>   
>   	now = tcp_jiffies32;
>   
> +	tcp_bpf_data_event(sk, BPF_SOCK_OPS_DATA_RECV_CB);
> +
>   	if (!icsk->icsk_ack.ato) {
>   		/* The _first_ data packet received, initialize
>   		 * delayed ACK engine.
> @@ -3454,6 +3456,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, const struct sk_buff *ack_skb,
>   		flag |= FLAG_SET_XMIT_TIMER;  /* set TLP or RTO timer */
>   	}
>   
> +	tcp_bpf_data_event(sk, BPF_SOCK_OPS_DATA_ACKED_CB);
> +
>   	if (icsk->icsk_ca_ops->pkts_acked) {
>   		struct ack_sample sample = { .pkts_acked = pkts_acked,
>   					     .rtt_us = sack->rate->rtt_us };
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index eb13a55d660c..ddd6a9c2150f 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2821,6 +2821,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>   		/* Send one loss probe per tail loss episode. */
>   		if (push_one != 2)
>   			tcp_schedule_loss_probe(sk, false);
> +
> +		tcp_bpf_data_event(sk, BPF_SOCK_OPS_DATA_SEND_CB);
>   		return false;
>   	}
>   	return !tp->packets_out && !tcp_write_queue_empty(sk);
Daniel Borkmann Nov. 24, 2023, 9:47 a.m. UTC | #2
On 11/23/23 1:37 PM, Philo Lu wrote:
> Sorry, I forgot to cc the maintainers.
> 
> On 2023/11/23 11:07, Philo Lu wrote:
>> Add 3 sock_ops operators, namely BPF_SOCK_OPS_DATA_SEND_CB,
>> BPF_SOCK_OPS_DATA_RECV_CB, and BPF_SOCK_OPS_DATA_ACKED_CB. A flag
>> BPF_SOCK_OPS_DATA_EVENT_CB_FLAG is provided to minimize the performance
>> impact. The flag must be explicitly set to enable these callbacks.
>>
>> If the flag is enabled, bpf sock_ops program will be called every time a
>> tcp data packet is sent, received, and acked.
>> BPF_SOCK_OPS_DATA_SEND_CB: call bpf after a data packet is sent.
>> BPF_SOCK_OPS_DATA_RECV_CB: call bpf after a data packet is receviced.
>> BPF_SOCK_OPS_DATA_ACKED_CB: call bpf after a valid ack packet is
>> processed (some sent data are ackknowledged).
>>
>> We use these callbacks for fine-grained tcp monitoring, which collects
>> and analyses every tcp request/response event information. The whole
>> system has been described in SIGMOD'18 (see
>> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
>> achieve this with bpf, we require hooks for data events that call
>> sock_ops bpf (1) when any data packet is sent/received/acked, and (2)
>> after critical tcp state variables have been updated (e.g., snd_una,
>> snd_nxt, rcv_nxt). However, existing sock_ops operators cannot meet our
>> requirements.
>>
>> Besides, these hooks also help to debug tcp when data send/recv/acked.
>>
>> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
>> ---
>>   include/net/tcp.h        |  9 +++++++++
>>   include/uapi/linux/bpf.h | 14 +++++++++++++-
>>   net/ipv4/tcp_input.c     |  4 ++++
>>   net/ipv4/tcp_output.c    |  2 ++
>>   4 files changed, 28 insertions(+), 1 deletion(-)

Please also add selftests for the new hooks, and speaking of the latter
looks like this fails current BPF selftests :

https://github.com/kernel-patches/bpf/actions/runs/6974541866/job/18980491457

Notice: Success: 502/3526, Skipped: 56, Failed: 1
Error: #348 tcpbpf_user
   Error: #348 tcpbpf_user
   test_tcpbpf_user:PASS:open and load skel 0 nsec
   test_tcpbpf_user:PASS:test__join_cgroup(/tcpbpf-user-test) 0 nsec
   test_tcpbpf_user:PASS:attach_cgroup(bpf_testcb) 0 nsec
   run_test:PASS:start_server 0 nsec
   run_test:PASS:connect_to_fd(listen_fd) 0 nsec
   run_test:PASS:accept(listen_fd) 0 nsec
   run_test:PASS:send(cli_fd) 0 nsec
   run_test:PASS:recv(accept_fd) 0 nsec
   run_test:PASS:send(accept_fd) 0 nsec
   run_test:PASS:recv(cli_fd) 0 nsec
   run_test:PASS:recv(cli_fd) for fin 0 nsec
   run_test:PASS:recv(accept_fd) for fin 0 nsec
   verify_result:PASS:event_map 0 nsec
   verify_result:PASS:bytes_received 0 nsec
   verify_result:PASS:bytes_acked 0 nsec
   verify_result:PASS:data_segs_in 0 nsec
   verify_result:PASS:data_segs_out 0 nsec
   verify_result:FAIL:bad_cb_test_rv unexpected bad_cb_test_rv: actual 0 != expected 128
   verify_result:PASS:good_cb_test_rv 0 nsec
   verify_result:PASS:num_listen 0 nsec
   verify_result:PASS:num_close_events 0 nsec
   verify_result:PASS:tcp_save_syn 0 nsec
   verify_result:PASS:tcp_saved_syn 0 nsec
   verify_result:PASS:window_clamp_client 0 nsec
   verify_result:PASS:window_clamp_server 0 nsec
Test Results:
              bpftool: PASS
           test_progs: FAIL (returned 1)
             shutdown: CLEAN
Error: Process completed with exit code 1.
Martin KaFai Lau Nov. 29, 2023, 12:33 a.m. UTC | #3
On 11/23/23 4:37 AM, Philo Lu wrote:
> Sorry, I forgot to cc the maintainers.
> 
> On 2023/11/23 11:07, Philo Lu wrote:
>> Add 3 sock_ops operators, namely BPF_SOCK_OPS_DATA_SEND_CB,
>> BPF_SOCK_OPS_DATA_RECV_CB, and BPF_SOCK_OPS_DATA_ACKED_CB. A flag
>> BPF_SOCK_OPS_DATA_EVENT_CB_FLAG is provided to minimize the performance
>> impact. The flag must be explicitly set to enable these callbacks.
>>
>> If the flag is enabled, bpf sock_ops program will be called every time a
>> tcp data packet is sent, received, and acked.
>> BPF_SOCK_OPS_DATA_SEND_CB: call bpf after a data packet is sent.
>> BPF_SOCK_OPS_DATA_RECV_CB: call bpf after a data packet is receviced.
>> BPF_SOCK_OPS_DATA_ACKED_CB: call bpf after a valid ack packet is
>> processed (some sent data are ackknowledged).
>>
>> We use these callbacks for fine-grained tcp monitoring, which collects
>> and analyses every tcp request/response event information. The whole
>> system has been described in SIGMOD'18 (see
>> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
>> achieve this with bpf, we require hooks for data events that call
>> sock_ops bpf (1) when any data packet is sent/received/acked, and (2)
>> after critical tcp state variables have been updated (e.g., snd_una,
>> snd_nxt, rcv_nxt). However, existing sock_ops operators cannot meet our
>> requirements.
>>
>> Besides, these hooks also help to debug tcp when data send/recv/acked.

This all sounds like a tracing use case. Why tracepoint is not used instead?
Philo Lu Nov. 29, 2023, 10:05 a.m. UTC | #4
On 2023/11/24 17:47, Daniel Borkmann wrote:
> On 11/23/23 1:37 PM, Philo Lu wrote:
>> Sorry, I forgot to cc the maintainers.
>>
>> On 2023/11/23 11:07, Philo Lu wrote:
>>> Add 3 sock_ops operators, namely BPF_SOCK_OPS_DATA_SEND_CB,
>>> BPF_SOCK_OPS_DATA_RECV_CB, and BPF_SOCK_OPS_DATA_ACKED_CB. A flag
>>> BPF_SOCK_OPS_DATA_EVENT_CB_FLAG is provided to minimize the performance
>>> impact. The flag must be explicitly set to enable these callbacks.
>>>
>>> If the flag is enabled, bpf sock_ops program will be called every 
>>> time a
>>> tcp data packet is sent, received, and acked.
>>> BPF_SOCK_OPS_DATA_SEND_CB: call bpf after a data packet is sent.
>>> BPF_SOCK_OPS_DATA_RECV_CB: call bpf after a data packet is receviced.
>>> BPF_SOCK_OPS_DATA_ACKED_CB: call bpf after a valid ack packet is
>>> processed (some sent data are ackknowledged).
>>>
>>> We use these callbacks for fine-grained tcp monitoring, which collects
>>> and analyses every tcp request/response event information. The whole
>>> system has been described in SIGMOD'18 (see
>>> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
>>> achieve this with bpf, we require hooks for data events that call
>>> sock_ops bpf (1) when any data packet is sent/received/acked, and (2)
>>> after critical tcp state variables have been updated (e.g., snd_una,
>>> snd_nxt, rcv_nxt). However, existing sock_ops operators cannot meet our
>>> requirements.
>>>
>>> Besides, these hooks also help to debug tcp when data send/recv/acked.
>>>
>>> Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
>>> ---
>>>   include/net/tcp.h        |  9 +++++++++
>>>   include/uapi/linux/bpf.h | 14 +++++++++++++-
>>>   net/ipv4/tcp_input.c     |  4 ++++
>>>   net/ipv4/tcp_output.c    |  2 ++
>>>   4 files changed, 28 insertions(+), 1 deletion(-)
>
> Please also add selftests for the new hooks, and speaking of the latter
> looks like this fails current BPF selftests :
>
> https://github.com/kernel-patches/bpf/actions/runs/6974541866/job/18980491457 
>
>

We will add selftests in the next version. The current selftests fail just
because of the new flag added, and we can also fix this in the next version.
Philo Lu Nov. 29, 2023, 10:05 a.m. UTC | #5
On 2023/11/29 08:33, Martin KaFai Lau wrote:
> On 11/23/23 4:37 AM, Philo Lu wrote:
>> Sorry, I forgot to cc the maintainers.
>>
>> On 2023/11/23 11:07, Philo Lu wrote:
>>> Add 3 sock_ops operators, namely BPF_SOCK_OPS_DATA_SEND_CB,
>>> BPF_SOCK_OPS_DATA_RECV_CB, and BPF_SOCK_OPS_DATA_ACKED_CB. A flag
>>> BPF_SOCK_OPS_DATA_EVENT_CB_FLAG is provided to minimize the performance
>>> impact. The flag must be explicitly set to enable these callbacks.
>>>
>>> If the flag is enabled, bpf sock_ops program will be called every 
>>> time a
>>> tcp data packet is sent, received, and acked.
>>> BPF_SOCK_OPS_DATA_SEND_CB: call bpf after a data packet is sent.
>>> BPF_SOCK_OPS_DATA_RECV_CB: call bpf after a data packet is receviced.
>>> BPF_SOCK_OPS_DATA_ACKED_CB: call bpf after a valid ack packet is
>>> processed (some sent data are ackknowledged).
>>>
>>> We use these callbacks for fine-grained tcp monitoring, which collects
>>> and analyses every tcp request/response event information. The whole
>>> system has been described in SIGMOD'18 (see
>>> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
>>> achieve this with bpf, we require hooks for data events that call
>>> sock_ops bpf (1) when any data packet is sent/received/acked, and (2)
>>> after critical tcp state variables have been updated (e.g., snd_una,
>>> snd_nxt, rcv_nxt). However, existing sock_ops operators cannot meet our
>>> requirements.
>>>
>>> Besides, these hooks also help to debug tcp when data send/recv/acked.
>
> This all sounds like a tracing use case. Why tracepoint is not used 
> instead?

Yes, our use case is pure tracing. We add hooks to sockops because we 
also use
other ops like BPF_SOCK_OPS_STATE_CB. Thus, sockops seems a natural solution
for us.

We can also use tracepoint (with sockops) instead. So we think which to use
depends on your opinions. Many thanks.
Martin KaFai Lau Nov. 30, 2023, 6:13 p.m. UTC | #6
On 11/29/23 2:05 AM, Philo Lu wrote:
> 
> On 2023/11/29 08:33, Martin KaFai Lau wrote:
>> On 11/23/23 4:37 AM, Philo Lu wrote:
>>> Sorry, I forgot to cc the maintainers.
>>>
>>> On 2023/11/23 11:07, Philo Lu wrote:
>>>> Add 3 sock_ops operators, namely BPF_SOCK_OPS_DATA_SEND_CB,
>>>> BPF_SOCK_OPS_DATA_RECV_CB, and BPF_SOCK_OPS_DATA_ACKED_CB. A flag
>>>> BPF_SOCK_OPS_DATA_EVENT_CB_FLAG is provided to minimize the performance
>>>> impact. The flag must be explicitly set to enable these callbacks.
>>>>
>>>> If the flag is enabled, bpf sock_ops program will be called every time a
>>>> tcp data packet is sent, received, and acked.
>>>> BPF_SOCK_OPS_DATA_SEND_CB: call bpf after a data packet is sent.
>>>> BPF_SOCK_OPS_DATA_RECV_CB: call bpf after a data packet is receviced.
>>>> BPF_SOCK_OPS_DATA_ACKED_CB: call bpf after a valid ack packet is
>>>> processed (some sent data are ackknowledged).
>>>>
>>>> We use these callbacks for fine-grained tcp monitoring, which collects
>>>> and analyses every tcp request/response event information. The whole
>>>> system has been described in SIGMOD'18 (see
>>>> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
>>>> achieve this with bpf, we require hooks for data events that call
>>>> sock_ops bpf (1) when any data packet is sent/received/acked, and (2)
>>>> after critical tcp state variables have been updated (e.g., snd_una,
>>>> snd_nxt, rcv_nxt). However, existing sock_ops operators cannot meet our
>>>> requirements.
>>>>
>>>> Besides, these hooks also help to debug tcp when data send/recv/acked.
>>
>> This all sounds like a tracing use case. Why tracepoint is not used instead?
> 
> Yes, our use case is pure tracing. We add hooks to sockops because we also use
> other ops like BPF_SOCK_OPS_STATE_CB. Thus, sockops seems a natural solution
> for us.

There is also an existing trace_inet_sock_set_state() tracepoint for tracking 
the state change. There are other existing tracepoints in 
include/trace/events/tcp.h for tcp perf monitoring/analysis purpose (e.g. 
trace_tcp_retransmit_skb). All it needs is read-only access to sk and the 
purpose is for tcp perf monitoring/analysis. If a hook is needed here 
(cgroup-bpf or tracepoint), I would think it is better to supplement the 
existing tcp tracepoints which were also added to do tcp monitoring.

I suspect the fexit bpf prog may also work because the fexit bpf prog is called 
after the traced kernel function is called. However, the kernel functions may 
get inlined and the tracepoint will still be needed. May be the netdev 
maintainer can chime in here regarding the tracepoint additions.

> 
> We can also use tracepoint (with sockops) instead. So we think which to use
> depends on your opinions. Many thanks.
> 
>
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d2f0736b76b8..73eda03fdda5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2660,6 +2660,15 @@  static inline void tcp_bpf_rtt(struct sock *sk)
 		tcp_call_bpf(sk, BPF_SOCK_OPS_RTT_CB, 0, NULL);
 }
 
+/* op must be one of BPF_SOCK_OPS_DATA_SEND_CB, BPF_SOCK_OPS_DATA_RECV_CB,
+ * or BPF_SOCK_OPS_DATA_ACKED_CB.
+ */
+static inline void tcp_bpf_data_event(struct sock *sk, int op)
+{
+	if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_DATA_EVENT_CB_FLAG))
+		tcp_call_bpf(sk, op, 0, NULL);
+}
+
 #if IS_ENABLED(CONFIG_SMC)
 extern struct static_key_false tcp_have_smc;
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7cf8bcf9f6a2..2154a6235901 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3016,6 +3016,7 @@  union bpf_attr {
  * 		* **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission)
  * 		* **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change)
  * 		* **BPF_SOCK_OPS_RTT_CB_FLAG** (every RTT)
+ * 		* **BPF_SOCK_OPS_DATA_EVENT_CB_FLAG** (data packet send/recv/acked)
  *
  * 		Therefore, this function can be used to clear a callback flag by
  * 		setting the appropriate bit to zero. e.g. to disable the RTO
@@ -6755,8 +6756,10 @@  enum {
 	 * options first before the BPF program does.
 	 */
 	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+	/* Call bpf when data send/recv/acked. */
+	BPF_SOCK_OPS_DATA_EVENT_CB_FLAG = (1<<7),
 /* Mask of all currently supported cb flags */
-	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0x7F,
+	BPF_SOCK_OPS_ALL_CB_FLAGS       = 0xFF,
 };
 
 /* List of known BPF sock_ops operators.
@@ -6869,6 +6872,15 @@  enum {
 					 * by the kernel or the
 					 * earlier bpf-progs.
 					 */
+	BPF_SOCK_OPS_DATA_SEND_CB,		/* Calls BPF program when a
+					 * data packet is sent. Pure ack is ignored.
+					 */
+	BPF_SOCK_OPS_DATA_RECV_CB,		/* Calls BPF program when a
+					 * data packet is received. Pure ack is ignored.
+					 */
+	BPF_SOCK_OPS_DATA_ACKED_CB,		/* Calls BPF program when sent
+					 * data are acknowledged.
+					 */
 };
 
 /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bcb55d98004c..72c6192e7cd0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -824,6 +824,8 @@  static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 
 	now = tcp_jiffies32;
 
+	tcp_bpf_data_event(sk, BPF_SOCK_OPS_DATA_RECV_CB);
+
 	if (!icsk->icsk_ack.ato) {
 		/* The _first_ data packet received, initialize
 		 * delayed ACK engine.
@@ -3454,6 +3456,8 @@  static int tcp_clean_rtx_queue(struct sock *sk, const struct sk_buff *ack_skb,
 		flag |= FLAG_SET_XMIT_TIMER;  /* set TLP or RTO timer */
 	}
 
+	tcp_bpf_data_event(sk, BPF_SOCK_OPS_DATA_ACKED_CB);
+
 	if (icsk->icsk_ca_ops->pkts_acked) {
 		struct ack_sample sample = { .pkts_acked = pkts_acked,
 					     .rtt_us = sack->rate->rtt_us };
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index eb13a55d660c..ddd6a9c2150f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2821,6 +2821,8 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		/* Send one loss probe per tail loss episode. */
 		if (push_one != 2)
 			tcp_schedule_loss_probe(sk, false);
+
+		tcp_bpf_data_event(sk, BPF_SOCK_OPS_DATA_SEND_CB);
 		return false;
 	}
 	return !tp->packets_out && !tcp_write_queue_empty(sk);