[PATCHv6,0/5] net/tls: fixes for NVMe-over-TLS

Message ID	20230703090444.38734-1-hare@suse.de (mailing list archive)
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5891720EA for <netdev@vger.kernel.org>; Mon, 3 Jul 2023 09:04:59 +0000 (UTC) From: Hannes Reinecke <hare@suse.de> To: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>, linux-nvme@lists.infradead.org, Jakub Kicinski <kuba@kernel.org>, Eric Dumazet <edumazet@google.com>, Paolo Abeni <pabeni@redhat.com>, netdev@vger.kernel.org, Hannes Reinecke <hare@suse.de> Subject: [PATCHv6 0/5] net/tls: fixes for NVMe-over-TLS Date: Mon, 3 Jul 2023 11:04:39 +0200 Message-Id: <20230703090444.38734-1-hare@suse.de> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	net/tls: fixes for NVMe-over-TLS \| expand [PATCHv6,0/5] net/tls: fixes for NVMe-over-TLS [1/5] net/tls: handle MSG_EOR for tls_sw TX flow [2/5] net/tls: handle MSG_EOR for tls_device TX flow [3/5] selftests/net/tls: add test for MSG_EOR [4/5] net/tls: split tls_rx_reader_lock [5/5] net/tls: implement ->read_sock()

Hannes Reinecke July 3, 2023, 9:04 a.m. UTC

Hi all,

here are some small fixes to get NVMe-over-TLS up and running.
The first three are just minor modifications to have MSG_EOR handled
for TLS (and adding a test for it), but the last two implement the
->read_sock() callback for tls_sw and that, I guess, could do with
some reviews.
It does work with my NVMe-TLS test harness, but what do I know :-)

As usual, comments and reviews are welcome.

Changes to the original submission:
- Add a testcase for MSG_EOR handling

Changes to v2:
- Bail out on conflicting message flags
- Rework flag handling

Changes to v3:
- Return -EINVAL on conflicting flags
- Rebase on top of net-next

Changes to v4:
- Add tlx_rx_reader_lock() to read_sock
- Add MSG_EOR handling to tls_sw_readpages()

Changes to v5:
- Rebase to latest upstream
- Split tls_rx_reader_lock() as suggested by Sagi

Hannes Reinecke (5):
  net/tls: handle MSG_EOR for tls_sw TX flow
  net/tls: handle MSG_EOR for tls_device TX flow
  selftests/net/tls: add test for MSG_EOR
  net/tls: split tls_rx_reader_lock
  net/tls: implement ->read_sock()

 net/tls/tls.h                     |   2 +
 net/tls/tls_device.c              |   6 +-
 net/tls/tls_main.c                |   2 +
 net/tls/tls_sw.c                  | 121 +++++++++++++++++++++++++-----
 tools/testing/selftests/net/tls.c |  11 +++
 5 files changed, 124 insertions(+), 18 deletions(-)

Sagi Grimberg July 3, 2023, 10:08 a.m. UTC | #1

On 7/3/23 12:04, Hannes Reinecke wrote:
> Hi all,
> 
> here are some small fixes to get NVMe-over-TLS up and running.
> The first three are just minor modifications to have MSG_EOR handled
> for TLS (and adding a test for it), but the last two implement the
> ->read_sock() callback for tls_sw and that, I guess, could do with
> some reviews.
> It does work with my NVMe-TLS test harness, but what do I know :-)

Hannes, have you tested nvme/tls with the MSG_SPLICE_PAGES series from
david?

Hannes Reinecke July 3, 2023, 10:20 a.m. UTC | #2

On 7/3/23 12:08, Sagi Grimberg wrote:
> 
> 
> On 7/3/23 12:04, Hannes Reinecke wrote:
>> Hi all,
>>
>> here are some small fixes to get NVMe-over-TLS up and running.
>> The first three are just minor modifications to have MSG_EOR handled
>> for TLS (and adding a test for it), but the last two implement the
>> ->read_sock() callback for tls_sw and that, I guess, could do with
>> some reviews.
>> It does work with my NVMe-TLS test harness, but what do I know :-)
> 
> Hannes, have you tested nvme/tls with the MSG_SPLICE_PAGES series from
> david?

Yes. This patchset has been tested on top of current linus' HEAD, which 
includes the MSG_SPLICE_PAGES series (hence the absence of patch hunks 
for ->sendpage() and friends).

Cheers,

Hannes

Hannes Reinecke July 3, 2023, 12:13 p.m. UTC | #3

On 7/3/23 12:20, Hannes Reinecke wrote:
> On 7/3/23 12:08, Sagi Grimberg wrote:
>>
>>
>> On 7/3/23 12:04, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> here are some small fixes to get NVMe-over-TLS up and running.
>>> The first three are just minor modifications to have MSG_EOR handled
>>> for TLS (and adding a test for it), but the last two implement the
>>> ->read_sock() callback for tls_sw and that, I guess, could do with
>>> some reviews.
>>> It does work with my NVMe-TLS test harness, but what do I know :-)
>>
>> Hannes, have you tested nvme/tls with the MSG_SPLICE_PAGES series from
>> david?
> 
> Yes. This patchset has been tested on top of current linus' HEAD, which 
> includes the MSG_SPLICE_PAGES series (hence the absence of patch hunks 
> for ->sendpage() and friends).
> 
Well, of course, spoke too soon.
'discover' and 'connect' works, but when I'm trying to transfer data
(eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
sock_sendmsg() as it's trying to access invalid pages :-(

Debugging to find out if it's my or Davids fault.

Cheers,

Hannes

Hannes Reinecke July 3, 2023, 12:19 p.m. UTC | #4

On 7/3/23 14:13, Hannes Reinecke wrote:
> On 7/3/23 12:20, Hannes Reinecke wrote:
>> On 7/3/23 12:08, Sagi Grimberg wrote:
>>>
>>>
>>> On 7/3/23 12:04, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> here are some small fixes to get NVMe-over-TLS up and running.
>>>> The first three are just minor modifications to have MSG_EOR handled
>>>> for TLS (and adding a test for it), but the last two implement the
>>>> ->read_sock() callback for tls_sw and that, I guess, could do with
>>>> some reviews.
>>>> It does work with my NVMe-TLS test harness, but what do I know :-)
>>>
>>> Hannes, have you tested nvme/tls with the MSG_SPLICE_PAGES series from
>>> david?
>>
>> Yes. This patchset has been tested on top of current linus' HEAD, 
>> which includes the MSG_SPLICE_PAGES series (hence the absence of patch 
>> hunks for ->sendpage() and friends).
>>
> Well, of course, spoke too soon.
> 'discover' and 'connect' works, but when I'm trying to transfer data
> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
> sock_sendmsg() as it's trying to access invalid pages :-(
> 
> Debugging to find out if it's my or Davids fault.
> 
Hehe.

Wasn't me. Crashes even without TLS.
Sagi, does current linus' HEAD (I've tested with e55e5df193d2) work for
you (traddr tcp 127.0.0.1)?

Cheers,

Hannes

David Howells July 3, 2023, 12:26 p.m. UTC | #5

Hannes Reinecke <hare@suse.de> wrote:

> > 'discover' and 'connect' works, but when I'm trying to transfer data
> > (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
> > sock_sendmsg() as it's trying to access invalid pages :-(

Can you be more specific about the crash?

David

Sagi Grimberg July 3, 2023, 12:33 p.m. UTC | #6

> Hannes Reinecke <hare@suse.de> wrote:
> 
>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>> sock_sendmsg() as it's trying to access invalid pages :-(
> 
> Can you be more specific about the crash?

Hannes,

See:
[PATCH net] nvme-tcp: Fix comma-related oops

Hannes Reinecke July 3, 2023, 12:35 p.m. UTC | #7

On 7/3/23 14:26, David Howells wrote:
> Hannes Reinecke <hare@suse.de> wrote:
> 
>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>> sock_sendmsg() as it's trying to access invalid pages :-(
> 
> Can you be more specific about the crash?
> 
[ 1804.190109] nvme nvme0: new ctrl: NQN "nqn.test", addr 127.0.0.1:4420
[ 1814.346751] BUG: unable to handle page fault for address: 
ffffffffc4e166cc
[ 1814.347980] #PF: supervisor read access in kernel mode
[ 1814.348864] #PF: error_code(0x0000) - not-present page
[ 1814.349727] PGD 2b239067 P4D 2b239067 PUD 2b23b067 PMD 0
[ 1814.350630] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1814.351369] CPU: 0 PID: 2309 Comm: mkfs.xfs Kdump: loaded Tainted: G 
           E      6.4.0-54-default+ #296
[ 1814.353021] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
0.0.0 02/06/2015
[ 1814.354298] RIP: 0010:skb_splice_from_iter+0xda/0x310
[ 1814.355171] Code: 0f 8e 2f 02 00 00 4c 8d 6c 24 30 48 8b 54 24 28 4c 
89 f8 4d 89 f7 4d 89 ee 49 89 c5 49 8b 1e bd 00
10 00 00 48 29 d5 4c 39 fd <48> 8b 43 08 49 0f 47 ef a8 01 0f 85 c9 01 
00 00 66 90 48 89 d8 48
[ 1814.358314] RSP: 0018:ffffa90f00db7778 EFLAGS: 00010283
[ 1814.359202] RAX: ffff8941ad156600 RBX: ffffffffc4e166c4 RCX: 
0000000000000000
[ 1814.360426] RDX: 0000000000000fff RSI: 0000000000000000 RDI: 
0000000000000000
[ 1814.361613] RBP: 0000000000000001 R08: ffffa90f00db7958 R09: 
00000000ac6df68e
[ 1814.362797] R10: 0000000000020000 R11: ffffffffac9873c0 R12: 
0000000000000000
[ 1814.363998] R13: ffff8941ad156600 R14: ffffa90f00db77a8 R15: 
0000000000001000
[ 1814.365175] FS:  00007f5f0ea63780(0000) GS:ffff8941fb400000(0000) 
knlGS:0000000000000000
[ 1814.366493] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1814.367456] CR2: ffffffffc4e166cc CR3: 0000000005e68000 CR4: 
0000000000350ef0
[ 1814.368671] Call Trace:
[ 1814.369092]  <TASK>
[ 1814.369463]  ? __die_body+0x1a/0x60
[ 1814.370060]  ? page_fault_oops+0x151/0x470
[ 1814.370746]  ? search_bpf_extables+0x65/0x70
[ 1814.371471]  ? fixup_exception+0x22/0x310
[ 1814.372162]  ? exc_page_fault+0xb1/0x150
[ 1814.372822]  ? asm_exc_page_fault+0x22/0x30
[ 1814.373521]  ? kmem_cache_alloc_node+0x1c0/0x320
[ 1814.374294]  ? skb_splice_from_iter+0xda/0x310
[ 1814.375036]  ? skb_splice_from_iter+0xa9/0x310
[ 1814.375777]  tcp_sendmsg_locked+0x763/0xce0
[ 1814.376488]  ? tcp_sendmsg_locked+0x936/0xce0
[ 1814.377225]  tcp_sendmsg+0x27/0x40
[ 1814.377797]  sock_sendmsg+0x8b/0xa0
[ 1814.378386]  nvme_tcp_try_send_data+0x131/0x400 [nvme_tcp]
[ 1814.379297]  ? nvme_tcp_try_send_cmd_pdu+0x164/0x290 [nvme_tcp]
[ 1814.380302]  ? __kernel_text_address+0xe/0x40
[ 1814.381037]  ? __pfx_stack_trace_consume_entry+0x10/0x10
[ 1814.381909]  ? arch_stack_walk+0xa1/0xf0
[ 1814.382565]  nvme_tcp_try_send+0x197/0x2c0 [nvme_tcp]
[ 1814.383407]  ? __rq_qos_issue+0x24/0x40
[ 1814.384069]  nvme_tcp_queue_rq+0x38b/0x3c0 [nvme_tcp]
[ 1814.384907]  __blk_mq_issue_directly+0x3e/0xe0
[ 1814.385648]  blk_mq_try_issue_directly+0x6c/0xa0
[ 1814.386411]  blk_mq_submit_bio+0x52b/0x650
[ 1814.387095]  __submit_bio+0xd8/0x150
[ 1814.387697]  submit_bio_noacct_nocheck+0x151/0x360
[ 1814.388503]  ? submit_bio_wait+0x54/0xc0
[ 1814.389157]  submit_bio_wait+0x54/0xc0
[ 1814.389780]  __blkdev_direct_IO_simple+0x124/0x280

My suspicion is here:

         while (true) {
                 struct bio_vec bvec;
                 struct msghdr msg = {
                         .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES,
                 };
                 struct page *page = nvme_tcp_req_cur_page(req);
                 size_t offset = nvme_tcp_req_cur_offset(req);
                 size_t len = nvme_tcp_req_cur_length(req);
                 bool last = nvme_tcp_pdu_last_send(req, len);
                 int req_data_sent = req->data_sent;
                 int ret;

                 if (!last || queue->data_digest || 
nvme_tcp_queue_more(queue))
                         msg.msg_flags |= MSG_MORE;

                 if (!sendpage_ok(page))
                         msg.msg_flags &= ~MSG_SPLICE_PAGES,

                 bvec_set_page(&bvec, page, len, offset);
                 iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
                 ret = sock_sendmsg(queue->sock, &msg);

Thing is, 'req' already has an iter (that's what nvme_tcp_req_cur_page() 
extracts from.
Maybe a missing kmap() somewhere?
But the entire thing is borderline pedantic.
'req' already has an iter, so we should be using it directly and not 
create another iter which just replicates the existing one.

Cheers,

Hannes

David Howells July 3, 2023, 12:45 p.m. UTC | #8

Hannes Reinecke <hare@suse.de> wrote:

>                         msg.msg_flags &= ~MSG_SPLICE_PAGES,

Ah, yes - as Sagi pointed out there's a patch for that comma there.

David

Hannes Reinecke July 3, 2023, 1:15 p.m. UTC | #9

On 7/3/23 14:33, Sagi Grimberg wrote:
> 
>> Hannes Reinecke <hare@suse.de> wrote:
>>
>>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>>> sock_sendmsg() as it's trying to access invalid pages :-(
>>
>> Can you be more specific about the crash?
> 
> Hannes,
> 
> See:
> [PATCH net] nvme-tcp: Fix comma-related oops

Ah, right. That solves _that_ issue.

But now I'm deadlocking on the tls_rx_reader_lock() (patched as to your 
suggestion). Investigating.

But it brought up yet another can of worms: what _exactly_ is the return 
value of ->read_sock()?

There are currently two conflicting use-cases:
-> Ignore the return value, and assume errors etc are signalled
    via 'desc.error'.
    net/strparser/strparser.c
    drivers/infiniband/sw/siw
    drivers/scsi/iscsi_tcp.c
-> use the return value of ->read_sock(), ignoring 'desc.error':
    drivers/nvme/host/tcp.c
    net/ipv4/tcp.c
So which one is it?
Needless to say, implementations following the second style do not
set 'desc.error', causing any errors there to be ignored for callers
from the first style...
Jakub?

Cheers,

Hannes

Sagi Grimberg July 3, 2023, 1:42 p.m. UTC | #10

>>> Hannes Reinecke <hare@suse.de> wrote:
>>>
>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>>>> sock_sendmsg() as it's trying to access invalid pages :-(
>>>
>>> Can you be more specific about the crash?
>>
>> Hannes,
>>
>> See:
>> [PATCH net] nvme-tcp: Fix comma-related oops
> 
> Ah, right. That solves _that_ issue.
> 
> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to your 
> suggestion). Investigating.

Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp
does not interpret this as a transient status and simply returns from
io_work?

> But it brought up yet another can of worms: what _exactly_ is the return 
> value of ->read_sock()?
> 
> There are currently two conflicting use-cases:
> -> Ignore the return value, and assume errors etc are signalled
>     via 'desc.error'.
>     net/strparser/strparser.c
>     drivers/infiniband/sw/siw
>     drivers/scsi/iscsi_tcp.c
> -> use the return value of ->read_sock(), ignoring 'desc.error':
>     drivers/nvme/host/tcp.c
>     net/ipv4/tcp.c
> So which one is it?
> Needless to say, implementations following the second style do not
> set 'desc.error', causing any errors there to be ignored for callers
> from the first style...

I don't think ignoring the return value of read_sock makes sense because
it can fail outside of the recv_actor failures.

But to be on the safe side, perhaps you can both return an error and set
desc.error?

> Jakub?
> 
> Cheers,
> 
> Hannes
> 
>

Hannes Reinecke July 3, 2023, 1:46 p.m. UTC | #11

On 7/3/23 15:42, Sagi Grimberg wrote:
> 
>>>> Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>>>>> sock_sendmsg() as it's trying to access invalid pages :-(
>>>>
>>>> Can you be more specific about the crash?
>>>
>>> Hannes,
>>>
>>> See:
>>> [PATCH net] nvme-tcp: Fix comma-related oops
>>
>> Ah, right. That solves _that_ issue.
>>
>> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to 
>> your suggestion). Investigating.
> 
> Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp
> does not interpret this as a transient status and simply returns from
> io_work?
> 
>> But it brought up yet another can of worms: what _exactly_ is the 
>> return value of ->read_sock()?
>>
>> There are currently two conflicting use-cases:
>> -> Ignore the return value, and assume errors etc are signalled
>>     via 'desc.error'.
>>     net/strparser/strparser.c
>>     drivers/infiniband/sw/siw
>>     drivers/scsi/iscsi_tcp.c
>> -> use the return value of ->read_sock(), ignoring 'desc.error':
>>     drivers/nvme/host/tcp.c
>>     net/ipv4/tcp.c
>> So which one is it?
>> Needless to say, implementations following the second style do not
>> set 'desc.error', causing any errors there to be ignored for callers
>> from the first style...
> 
> I don't think ignoring the return value of read_sock makes sense because
> it can fail outside of the recv_actor failures.
> 
Oh, but it's not read_actor which is expected to set desc.error.
Have a look at 'strp_read_sock()':

         /* sk should be locked here, so okay to do read_sock */
         sock->ops->read_sock(strp->sk, &desc, strp_recv);

         desc.error = strp->cb.read_sock_done(strp, desc.error);

it's the ->read_sock() callback which is expected to set desc.error.

> But to be on the safe side, perhaps you can both return an error and set
> desc.error?
> 
But why? We can easily make ->read_sock() a void function, then it's 
obvious that you can't check the return value.

Cheers,

Hannes

Hannes Reinecke July 3, 2023, 1:57 p.m. UTC | #12

On 7/3/23 15:42, Sagi Grimberg wrote:
> 
>>>> Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>>>>> sock_sendmsg() as it's trying to access invalid pages :-(
>>>>
>>>> Can you be more specific about the crash?
>>>
>>> Hannes,
>>>
>>> See:
>>> [PATCH net] nvme-tcp: Fix comma-related oops
>>
>> Ah, right. That solves _that_ issue.
>>
>> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to 
>> your suggestion). Investigating.
> 
> Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp
> does not interpret this as a transient status and simply returns from
> io_work?
> 
Unfortunately, yes.

static int tls_rx_reader_acquire(struct sock *sk, struct 
tls_sw_context_rx *ctx,
                                  bool nonblock)
{
         long timeo;

         timeo = sock_rcvtimeo(sk, nonblock);

         while (unlikely(ctx->reader_present)) {
                 DEFINE_WAIT_FUNC(wait, woken_wake_function);

                 ctx->reader_contended = 1;

                 add_wait_queue(&ctx->wq, &wait);
                 sk_wait_event(sk, &timeo,
                               !READ_ONCE(ctx->reader_present), &wait);

and sk_wait_event() does:
#define sk_wait_event(__sk, __timeo, __condition, __wait)              \
         ({      int __rc;                                              \
                 __sk->sk_wait_pending++;                               \
                 release_sock(__sk);                                    \
                 __rc = __condition;                                    \
                 if (!__rc) {                                           \
                         *(__timeo) = wait_woken(__wait,                \
                                                 TASK_INTERRUPTIBLE,    \
                                                 *(__timeo));           \
                 }                                                      \
                 sched_annotate_sleep();                                \
                 lock_sock(__sk);                                       \
                 __sk->sk_wait_pending--;                               \
                 __rc = __condition;                                    \
                 __rc;                                                  \
         })

so not calling 'lock_sock()' in tls_tx_reader_acquire() helps only _so_ 
much, we're still deadlocking.

Cheers,

Hannes

Sagi Grimberg July 3, 2023, 2:01 p.m. UTC | #13

>>>>> Hannes Reinecke <hare@suse.de> wrote:
>>>>>
>>>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>>>>>> sock_sendmsg() as it's trying to access invalid pages :-(
>>>>>
>>>>> Can you be more specific about the crash?
>>>>
>>>> Hannes,
>>>>
>>>> See:
>>>> [PATCH net] nvme-tcp: Fix comma-related oops
>>>
>>> Ah, right. That solves _that_ issue.
>>>
>>> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to 
>>> your suggestion). Investigating.
>>
>> Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp
>> does not interpret this as a transient status and simply returns from
>> io_work?
>>
>>> But it brought up yet another can of worms: what _exactly_ is the 
>>> return value of ->read_sock()?
>>>
>>> There are currently two conflicting use-cases:
>>> -> Ignore the return value, and assume errors etc are signalled
>>>     via 'desc.error'.
>>>     net/strparser/strparser.c
>>>     drivers/infiniband/sw/siw
>>>     drivers/scsi/iscsi_tcp.c
>>> -> use the return value of ->read_sock(), ignoring 'desc.error':
>>>     drivers/nvme/host/tcp.c
>>>     net/ipv4/tcp.c
>>> So which one is it?
>>> Needless to say, implementations following the second style do not
>>> set 'desc.error', causing any errors there to be ignored for callers
>>> from the first style...
>>
>> I don't think ignoring the return value of read_sock makes sense because
>> it can fail outside of the recv_actor failures.
>>
> Oh, but it's not read_actor which is expected to set desc.error.
> Have a look at 'strp_read_sock()':
> 
>          /* sk should be locked here, so okay to do read_sock */
>          sock->ops->read_sock(strp->sk, &desc, strp_recv);
> 
>          desc.error = strp->cb.read_sock_done(strp, desc.error);
> 
> it's the ->read_sock() callback which is expected to set desc.error.

Then it is completely up to the consumer how it wants to interpret the
error.

>> But to be on the safe side, perhaps you can both return an error and set
>> desc.error?
>>
> But why? We can easily make ->read_sock() a void function, then it's 
> obvious that you can't check the return value.

but it returns the consumed byte count, where would this info go?

Sagi Grimberg July 3, 2023, 2:10 p.m. UTC | #14

On 7/3/23 16:57, Hannes Reinecke wrote:
> On 7/3/23 15:42, Sagi Grimberg wrote:
>>
>>>>> Hannes Reinecke <hare@suse.de> wrote:
>>>>>
>>>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data
>>>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in
>>>>>>> sock_sendmsg() as it's trying to access invalid pages :-(
>>>>>
>>>>> Can you be more specific about the crash?
>>>>
>>>> Hannes,
>>>>
>>>> See:
>>>> [PATCH net] nvme-tcp: Fix comma-related oops
>>>
>>> Ah, right. That solves _that_ issue.
>>>
>>> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to 
>>> your suggestion). Investigating.
>>
>> Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp
>> does not interpret this as a transient status and simply returns from
>> io_work?
>>
> Unfortunately, yes.
> 
> static int tls_rx_reader_acquire(struct sock *sk, struct 
> tls_sw_context_rx *ctx,
>                                   bool nonblock)
> {
>          long timeo;
> 
>          timeo = sock_rcvtimeo(sk, nonblock);
> 
>          while (unlikely(ctx->reader_present)) {
>                  DEFINE_WAIT_FUNC(wait, woken_wake_function);
> 
>                  ctx->reader_contended = 1;
> 
>                  add_wait_queue(&ctx->wq, &wait);
>                  sk_wait_event(sk, &timeo,
>                                !READ_ONCE(ctx->reader_present), &wait);
> 
> and sk_wait_event() does:
> #define sk_wait_event(__sk, __timeo, __condition, __wait)              \
>          ({      int __rc;                                              \
>                  __sk->sk_wait_pending++;                               \
>                  release_sock(__sk);                                    \
>                  __rc = __condition;                                    \
>                  if (!__rc) {                                           \
>                          *(__timeo) = wait_woken(__wait,                \
>                                                  TASK_INTERRUPTIBLE,    \
>                                                  *(__timeo));           \
>                  }                                                      \
>                  sched_annotate_sleep();                                \
>                  lock_sock(__sk);                                       \
>                  __sk->sk_wait_pending--;                               \
>                  __rc = __condition;                                    \
>                  __rc;                                                  \
>          })
> 
> so not calling 'lock_sock()' in tls_tx_reader_acquire() helps only _so_ 
> much, we're still deadlocking.

That still is legal assuming that sock lock is taken prior to
sk_wait_event...

What are the blocked threads from sysrq-trigger?

[PATCHv6,0/5] net/tls: fixes for NVMe-over-TLS

Message

Comments