diff mbox series

[bpf,v1,1/2] bpf: fix ktls panic

Message ID 20250123171552.57345-2-mrpre@163.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: fix ktls panic and add tests | expand

Checks

Context Check Description
bpf/vmtest-bpf-PR success PR summary
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 10 of 11 maintainers
netdev/build_clang success Errors and warnings before: 2 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 15 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-VM_Test-10 success Logs for aarch64-gcc / veristat-kernel
bpf/vmtest-bpf-VM_Test-13 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-VM_Test-11 success Logs for aarch64-gcc / veristat-meta
bpf/vmtest-bpf-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-VM_Test-17 success Logs for s390x-gcc / veristat-kernel
bpf/vmtest-bpf-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-18 success Logs for s390x-gcc / veristat-meta
bpf/vmtest-bpf-VM_Test-19 success Logs for set-matrix
bpf/vmtest-bpf-VM_Test-20 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-21 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-VM_Test-30 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-31 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-VM_Test-35 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-36 success Logs for x86_64-llvm-17 / veristat-kernel
bpf/vmtest-bpf-VM_Test-37 success Logs for x86_64-llvm-17 / veristat-meta
bpf/vmtest-bpf-VM_Test-38 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-39 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-VM_Test-44 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-45 success Logs for x86_64-llvm-18 / veristat-kernel
bpf/vmtest-bpf-VM_Test-46 success Logs for x86_64-llvm-18 / veristat-meta
bpf/vmtest-bpf-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-22 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-26 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-28 success Logs for x86_64-gcc / veristat-kernel / x86_64-gcc veristat_kernel
bpf/vmtest-bpf-VM_Test-29 success Logs for x86_64-gcc / veristat-meta / x86_64-gcc veristat_meta
bpf/vmtest-bpf-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-34 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-42 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-43 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc

Commit Message

Jiayuan Chen Jan. 23, 2025, 5:15 p.m. UTC
[ 2172.936997] ------------[ cut here ]------------
[ 2172.936999] kernel BUG at lib/iov_iter.c:629!
......
[ 2172.944996] PKRU: 55555554
[ 2172.945155] Call Trace:
[ 2172.945299]  <TASK>
[ 2172.945428]  ? die+0x36/0x90
[ 2172.945601]  ? do_trap+0xdd/0x100
[ 2172.945795]  ? iov_iter_revert+0x178/0x180
[ 2172.946031]  ? iov_iter_revert+0x178/0x180
[ 2172.946267]  ? do_error_trap+0x7d/0x110
[ 2172.946499]  ? iov_iter_revert+0x178/0x180
[ 2172.946736]  ? exc_invalid_op+0x50/0x70
[ 2172.946961]  ? iov_iter_revert+0x178/0x180
[ 2172.947197]  ? asm_exc_invalid_op+0x1a/0x20
[ 2172.947446]  ? iov_iter_revert+0x178/0x180
[ 2172.947683]  ? iov_iter_revert+0x5c/0x180
[ 2172.947913]  tls_sw_sendmsg_locked.isra.0+0x794/0x840
[ 2172.948206]  tls_sw_sendmsg+0x52/0x80
[ 2172.948420]  ? inet_sendmsg+0x1f/0x70
[ 2172.948634]  __sys_sendto+0x1cd/0x200
[ 2172.948848]  ? find_held_lock+0x2b/0x80
[ 2172.949072]  ? syscall_trace_enter+0x140/0x270
[ 2172.949330]  ? __lock_release.isra.0+0x5e/0x170
[ 2172.949595]  ? find_held_lock+0x2b/0x80
[ 2172.949817]  ? syscall_trace_enter+0x140/0x270
[ 2172.950211]  ? lockdep_hardirqs_on_prepare+0xda/0x190
[ 2172.950632]  ? ktime_get_coarse_real_ts64+0xc2/0xd0
[ 2172.951036]  __x64_sys_sendto+0x24/0x30
[ 2172.951382]  do_syscall_64+0x90/0x170
......

After calling bpf_exec_tx_verdict(), the size of msg_pl->sg may increase,
e.g., when the BPF program executes bpf_msg_push_data().

If the BPF program sets cork_bytes and sg.size is smaller than cork_bytes,
it will return -ENOSPC and attempt to roll back to the non-zero copy
logic. However, during rollback, msg->msg_iter is reset, but since
msg_pl->sg.size has been increased, subsequent executions will exceed the
actual size of msg_iter.
'''
iov_iter_revert(&msg->msg_iter, msg_pl->sg.size - orig_size);
'''

The changes in this commit are based on the following considerations:

1. When cork_bytes is set, rolling back to non-zero copy logic is
pointless and can directly go to zero-copy logic.

2. Suppose sg.size is initially 5, and we push it to 100, setting
apply_bytes to 7. Then, 98 bytes of data are sent out, leaving 2 bytes to
be processed. The rollback logic cannot determine which data has been
processed and which hasn't.

This current change is based on the principle of minimal modification,
which won't make things worse. If we still encounter similar panic
further, we can consider a more comprehensive solution.

Fixes: fcb14cb1bdac ("new iov_iter flavour - ITER_UBUF")
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: Jiayuan Chen <mrpre@163.com>
---
 net/tls/tls_sw.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

John Fastabend Jan. 25, 2025, 5:24 a.m. UTC | #1
On 2025-01-24 01:15:51, Jiayuan Chen wrote:
> [ 2172.936997] ------------[ cut here ]------------
> [ 2172.936999] kernel BUG at lib/iov_iter.c:629!
> ......
> [ 2172.944996] PKRU: 55555554
> [ 2172.945155] Call Trace:
> [ 2172.945299]  <TASK>
> [ 2172.945428]  ? die+0x36/0x90
> [ 2172.945601]  ? do_trap+0xdd/0x100
> [ 2172.945795]  ? iov_iter_revert+0x178/0x180
> [ 2172.946031]  ? iov_iter_revert+0x178/0x180
> [ 2172.946267]  ? do_error_trap+0x7d/0x110
> [ 2172.946499]  ? iov_iter_revert+0x178/0x180
> [ 2172.946736]  ? exc_invalid_op+0x50/0x70
> [ 2172.946961]  ? iov_iter_revert+0x178/0x180
> [ 2172.947197]  ? asm_exc_invalid_op+0x1a/0x20
> [ 2172.947446]  ? iov_iter_revert+0x178/0x180
> [ 2172.947683]  ? iov_iter_revert+0x5c/0x180
> [ 2172.947913]  tls_sw_sendmsg_locked.isra.0+0x794/0x840
> [ 2172.948206]  tls_sw_sendmsg+0x52/0x80
> [ 2172.948420]  ? inet_sendmsg+0x1f/0x70
> [ 2172.948634]  __sys_sendto+0x1cd/0x200
> [ 2172.948848]  ? find_held_lock+0x2b/0x80
> [ 2172.949072]  ? syscall_trace_enter+0x140/0x270
> [ 2172.949330]  ? __lock_release.isra.0+0x5e/0x170
> [ 2172.949595]  ? find_held_lock+0x2b/0x80
> [ 2172.949817]  ? syscall_trace_enter+0x140/0x270
> [ 2172.950211]  ? lockdep_hardirqs_on_prepare+0xda/0x190
> [ 2172.950632]  ? ktime_get_coarse_real_ts64+0xc2/0xd0
> [ 2172.951036]  __x64_sys_sendto+0x24/0x30
> [ 2172.951382]  do_syscall_64+0x90/0x170
> ......
> 
> After calling bpf_exec_tx_verdict(), the size of msg_pl->sg may increase,
> e.g., when the BPF program executes bpf_msg_push_data().
> 
> If the BPF program sets cork_bytes and sg.size is smaller than cork_bytes,
> it will return -ENOSPC and attempt to roll back to the non-zero copy
> logic. However, during rollback, msg->msg_iter is reset, but since
> msg_pl->sg.size has been increased, subsequent executions will exceed the
> actual size of msg_iter.
> '''
> iov_iter_revert(&msg->msg_iter, msg_pl->sg.size - orig_size);
> '''
> 
> The changes in this commit are based on the following considerations:
> 
> 1. When cork_bytes is set, rolling back to non-zero copy logic is
> pointless and can directly go to zero-copy logic.
> 
> 2. Suppose sg.size is initially 5, and we push it to 100, setting
> apply_bytes to 7. Then, 98 bytes of data are sent out, leaving 2 bytes to
> be processed. The rollback logic cannot determine which data has been
> processed and which hasn't.

This is the error path we are talking about correct?

        if (msg->cork_bytes && msg->cork_bytes > msg->sg.size &&
            !enospc && !full_record) {
                err = -ENOSPC;
                goto out_err;
        }

> 
> This current change is based on the principle of minimal modification,
> which won't make things worse. If we still encounter similar panic
> further, we can consider a more comprehensive solution.
> 
> Fixes: fcb14cb1bdac ("new iov_iter flavour - ITER_UBUF")
> Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
> Signed-off-by: Jiayuan Chen <mrpre@163.com>
> ---
>  net/tls/tls_sw.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 7bcc9b4408a2..b3cae4dd4f49 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -1120,9 +1120,13 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
>  					num_async++;
>  				else if (ret == -ENOMEM)
>  					goto wait_for_memory;
> -				else if (ctx->open_rec && ret == -ENOSPC)
> +				else if (ctx->open_rec && ret == -ENOSPC) {
> +					if (msg_pl->cork_bytes) {
> +						ret = 0;
> +						goto send_end;
> +					}

The app will lose bytes here I suspect if we return copied == try_to_copy then
no error makes it to the user?

Could we return delta from bpf_exec_tx_verdict and then we can calculate
the correct number of bytes to revert? I'll need to check but its not
clear to me if BPF program pushes data that the right thing is done with
delta there now.

Thanks for looking into this.
Jiayuan Chen Jan. 25, 2025, 6:51 a.m. UTC | #2
On Fri, Jan 24, 2025 at 09:24:48PM -0800, John Fastabend wrote:
> On 2025-01-24 01:15:51, Jiayuan Chen wrote:
> > [ 2172.936997] ------------[ cut here ]------------
> > [ 2172.936999] kernel BUG at lib/iov_iter.c:629!
> > ......
> > pointless and can directly go to zero-copy logic.
> > 
> > 2. Suppose sg.size is initially 5, and we push it to 100, setting
> > apply_bytes to 7. Then, 98 bytes of data are sent out, leaving 2 bytes to
> > be processed. The rollback logic cannot determine which data has been
> > processed and which hasn't.
> 
> This is the error path we are talking about correct?
> 
>         if (msg->cork_bytes && msg->cork_bytes > msg->sg.size &&
>             !enospc && !full_record) {
>                 err = -ENOSPC;
>                 goto out_err;
>         }
> 
yes, it its.
> > 
> > diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> > index 7bcc9b4408a2..b3cae4dd4f49 100644
> > --- a/net/tls/tls_sw.c
> > +++ b/net/tls/tls_sw.c
> > @@ -1120,9 +1120,13 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
> >  					num_async++;
> >  				else if (ret == -ENOMEM)
> >  					goto wait_for_memory;
> > -				else if (ctx->open_rec && ret == -ENOSPC)
> > +				else if (ctx->open_rec && ret == -ENOSPC) {
> > +					if (msg_pl->cork_bytes) {
> > +						ret = 0;
> > +						goto send_end;
> > +					}
> 
> The app will lose bytes here I suspect if we return copied == try_to_copy then
> no error makes it to the user?

I looked into the corking logic for non-TLS sockets in tcp_bpf_sendmsg,
and I found that when a "cork" situation occurs, the user-space send
doesn't return an error, and the returned length is the same as the input
length parameter, even if some data is cached. I think TLS should also
behave similarly.

Additionally, I saw that the current non-zero-copy logic for handling
corking is written as:
'''
line 1177

else if (ret != -EAGAIN) {
	if (ret == -ENOSPC)
		ret = 0;
	goto send_end;
}
'''

Meanwhile, I set cork_bytes to 1 and tested the following behavior logic:
'''
send(msg, 1);
send(msg+1, 1);
send(msg+2, remain_length);
'''
Both the sender and receiver seem to be working normally both for TLS and
non-TLS sockets.
 
> Could we return delta from bpf_exec_tx_verdict and then we can calculate
> the correct number of bytes to revert? I'll need to check but its not
> clear to me if BPF program pushes data that the right thing is done with
> delta there now.
> 
> Thanks for looking into this.

Let's assume the original data is "abcdefgh" (8 bytes), and after 3 pushes
by the BPF program, it becomes 11-byte data: "abc?de?fgh?".

Then, we set cork_bytes to 6, which means the first 6 bytes have been
processed, and the remaining 5 bytes "?fgh?" will be cached until the
length meets the cork_bytes requirement.

However, some data in "?fgh?" is not within 'sg->msg_iter'
(but in msg_pl instead), especially the data "?" we pushed.

So it doesn't seem as simple as just reverting through an offset of msg_iter.
It appears that 'msg_iter' and 'msg_pl' are two separate objects,
and the BPF program modifies the scatterlist within the msg_pl.

--
Thanks.
Jiayuan Chen Jan. 31, 2025, 6:20 a.m. UTC | #3
On Sat, Jan 25, 2025 at 02:51:49PM +0800, Jiayuan Chen wrote:
> On Fri, Jan 24, 2025 at 09:24:48PM -0800, John Fastabend wrote:
> > On 2025-01-24 01:15:51, Jiayuan Chen wrote:
> > > [ 2172.936997] ------------[ cut here ]------------
> > > [ 2172.936999] kernel BUG at lib/iov_iter.c:629!
> > > ......
> > > pointless and can directly go to zero-copy logic.
> > > 
> > > 2. Suppose sg.size is initially 5, and we push it to 100, setting
> > > apply_bytes to 7. Then, 98 bytes of data are sent out, leaving 2 bytes to
> > > be processed. The rollback logic cannot determine which data has been
> > > processed and which hasn't.
> > 
> > This is the error path we are talking about correct?
> > 
> >         if (msg->cork_bytes && msg->cork_bytes > msg->sg.size &&
> >             !enospc && !full_record) {
> >                 err = -ENOSPC;
> >                 goto out_err;
> >         }
> > 
> yes, it its.
> > > 
> > > diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> > > index 7bcc9b4408a2..b3cae4dd4f49 100644
> > > --- a/net/tls/tls_sw.c
> > > +++ b/net/tls/tls_sw.c
> > > @@ -1120,9 +1120,13 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
> > >  					num_async++;
> > >  				else if (ret == -ENOMEM)
> > >  					goto wait_for_memory;
> > > -				else if (ctx->open_rec && ret == -ENOSPC)
> > > +				else if (ctx->open_rec && ret == -ENOSPC) {
> > > +					if (msg_pl->cork_bytes) {
> > > +						ret = 0;
> > > +						goto send_end;
> > > +					}
> > 
> > The app will lose bytes here I suspect if we return copied == try_to_copy then
> > no error makes it to the user?
> 
> I looked into the corking logic for non-TLS sockets in tcp_bpf_sendmsg,
> and I found that when a "cork" situation occurs, the user-space send
> doesn't return an error, and the returned length is the same as the input
> length parameter, even if some data is cached. I think TLS should also
> behave similarly.
> 
> Additionally, I saw that the current non-zero-copy logic for handling
> corking is written as:
> '''
> line 1177
> 
> else if (ret != -EAGAIN) {
> 	if (ret == -ENOSPC)
> 		ret = 0;
> 	goto send_end;
> }
> '''
> 
> Meanwhile, I set cork_bytes to 1 and tested the following behavior logic:
> '''
> send(msg, 1);
> send(msg+1, 1);
> send(msg+2, remain_length);
> '''
> Both the sender and receiver seem to be working normally both for TLS and
> non-TLS sockets.
>  
> > Could we return delta from bpf_exec_tx_verdict and then we can calculate
> > the correct number of bytes to revert? I'll need to check but its not
> > clear to me if BPF program pushes data that the right thing is done with
> > delta there now.
> > 
> > Thanks for looking into this.
> 
> Let's assume the original data is "abcdefgh" (8 bytes), and after 3 pushes
> by the BPF program, it becomes 11-byte data: "abc?de?fgh?".
> 
> Then, we set cork_bytes to 6, which means the first 6 bytes have been
> processed, and the remaining 5 bytes "?fgh?" will be cached until the
> length meets the cork_bytes requirement.
> 
> However, some data in "?fgh?" is not within 'sg->msg_iter'
> (but in msg_pl instead), especially the data "?" we pushed.
> 
> So it doesn't seem as simple as just reverting through an offset of msg_iter.
> It appears that 'msg_iter' and 'msg_pl' are two separate objects,
> and the BPF program modifies the scatterlist within the msg_pl.
> 
> --
> Thanks.
> 

Hi John

Just checking in on the status of my patch. If there's any new feedback,
I'm happy to move forward with the next steps.

Regards.
diff mbox series

Patch

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 7bcc9b4408a2..b3cae4dd4f49 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1120,9 +1120,13 @@  static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
 					num_async++;
 				else if (ret == -ENOMEM)
 					goto wait_for_memory;
-				else if (ctx->open_rec && ret == -ENOSPC)
+				else if (ctx->open_rec && ret == -ENOSPC) {
+					if (msg_pl->cork_bytes) {
+						ret = 0;
+						goto send_end;
+					}
 					goto rollback_iter;
-				else if (ret != -EAGAIN)
+				} else if (ret != -EAGAIN)
 					goto send_end;
 			}
 			continue;