Message ID | 20230703090444.38734-1-hare@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | net/tls: fixes for NVMe-over-TLS | expand |
On 7/3/23 12:04, Hannes Reinecke wrote: > Hi all, > > here are some small fixes to get NVMe-over-TLS up and running. > The first three are just minor modifications to have MSG_EOR handled > for TLS (and adding a test for it), but the last two implement the > ->read_sock() callback for tls_sw and that, I guess, could do with > some reviews. > It does work with my NVMe-TLS test harness, but what do I know :-) Hannes, have you tested nvme/tls with the MSG_SPLICE_PAGES series from david?
On 7/3/23 12:08, Sagi Grimberg wrote: > > > On 7/3/23 12:04, Hannes Reinecke wrote: >> Hi all, >> >> here are some small fixes to get NVMe-over-TLS up and running. >> The first three are just minor modifications to have MSG_EOR handled >> for TLS (and adding a test for it), but the last two implement the >> ->read_sock() callback for tls_sw and that, I guess, could do with >> some reviews. >> It does work with my NVMe-TLS test harness, but what do I know :-) > > Hannes, have you tested nvme/tls with the MSG_SPLICE_PAGES series from > david? Yes. This patchset has been tested on top of current linus' HEAD, which includes the MSG_SPLICE_PAGES series (hence the absence of patch hunks for ->sendpage() and friends). Cheers, Hannes
On 7/3/23 12:20, Hannes Reinecke wrote: > On 7/3/23 12:08, Sagi Grimberg wrote: >> >> >> On 7/3/23 12:04, Hannes Reinecke wrote: >>> Hi all, >>> >>> here are some small fixes to get NVMe-over-TLS up and running. >>> The first three are just minor modifications to have MSG_EOR handled >>> for TLS (and adding a test for it), but the last two implement the >>> ->read_sock() callback for tls_sw and that, I guess, could do with >>> some reviews. >>> It does work with my NVMe-TLS test harness, but what do I know :-) >> >> Hannes, have you tested nvme/tls with the MSG_SPLICE_PAGES series from >> david? > > Yes. This patchset has been tested on top of current linus' HEAD, which > includes the MSG_SPLICE_PAGES series (hence the absence of patch hunks > for ->sendpage() and friends). > Well, of course, spoke too soon. 'discover' and 'connect' works, but when I'm trying to transfer data (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in sock_sendmsg() as it's trying to access invalid pages :-( Debugging to find out if it's my or Davids fault. Cheers, Hannes
On 7/3/23 14:13, Hannes Reinecke wrote: > On 7/3/23 12:20, Hannes Reinecke wrote: >> On 7/3/23 12:08, Sagi Grimberg wrote: >>> >>> >>> On 7/3/23 12:04, Hannes Reinecke wrote: >>>> Hi all, >>>> >>>> here are some small fixes to get NVMe-over-TLS up and running. >>>> The first three are just minor modifications to have MSG_EOR handled >>>> for TLS (and adding a test for it), but the last two implement the >>>> ->read_sock() callback for tls_sw and that, I guess, could do with >>>> some reviews. >>>> It does work with my NVMe-TLS test harness, but what do I know :-) >>> >>> Hannes, have you tested nvme/tls with the MSG_SPLICE_PAGES series from >>> david? >> >> Yes. This patchset has been tested on top of current linus' HEAD, >> which includes the MSG_SPLICE_PAGES series (hence the absence of patch >> hunks for ->sendpage() and friends). >> > Well, of course, spoke too soon. > 'discover' and 'connect' works, but when I'm trying to transfer data > (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in > sock_sendmsg() as it's trying to access invalid pages :-( > > Debugging to find out if it's my or Davids fault. > Hehe. Wasn't me. Crashes even without TLS. Sagi, does current linus' HEAD (I've tested with e55e5df193d2) work for you (traddr tcp 127.0.0.1)? Cheers, Hannes
Hannes Reinecke <hare@suse.de> wrote: > > 'discover' and 'connect' works, but when I'm trying to transfer data > > (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in > > sock_sendmsg() as it's trying to access invalid pages :-( Can you be more specific about the crash? David
> Hannes Reinecke <hare@suse.de> wrote: > >>> 'discover' and 'connect' works, but when I'm trying to transfer data >>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in >>> sock_sendmsg() as it's trying to access invalid pages :-( > > Can you be more specific about the crash? Hannes, See: [PATCH net] nvme-tcp: Fix comma-related oops
On 7/3/23 14:26, David Howells wrote: > Hannes Reinecke <hare@suse.de> wrote: > >>> 'discover' and 'connect' works, but when I'm trying to transfer data >>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in >>> sock_sendmsg() as it's trying to access invalid pages :-( > > Can you be more specific about the crash? > [ 1804.190109] nvme nvme0: new ctrl: NQN "nqn.test", addr 127.0.0.1:4420 [ 1814.346751] BUG: unable to handle page fault for address: ffffffffc4e166cc [ 1814.347980] #PF: supervisor read access in kernel mode [ 1814.348864] #PF: error_code(0x0000) - not-present page [ 1814.349727] PGD 2b239067 P4D 2b239067 PUD 2b23b067 PMD 0 [ 1814.350630] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1814.351369] CPU: 0 PID: 2309 Comm: mkfs.xfs Kdump: loaded Tainted: G E 6.4.0-54-default+ #296 [ 1814.353021] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 1814.354298] RIP: 0010:skb_splice_from_iter+0xda/0x310 [ 1814.355171] Code: 0f 8e 2f 02 00 00 4c 8d 6c 24 30 48 8b 54 24 28 4c 89 f8 4d 89 f7 4d 89 ee 49 89 c5 49 8b 1e bd 00 10 00 00 48 29 d5 4c 39 fd <48> 8b 43 08 49 0f 47 ef a8 01 0f 85 c9 01 00 00 66 90 48 89 d8 48 [ 1814.358314] RSP: 0018:ffffa90f00db7778 EFLAGS: 00010283 [ 1814.359202] RAX: ffff8941ad156600 RBX: ffffffffc4e166c4 RCX: 0000000000000000 [ 1814.360426] RDX: 0000000000000fff RSI: 0000000000000000 RDI: 0000000000000000 [ 1814.361613] RBP: 0000000000000001 R08: ffffa90f00db7958 R09: 00000000ac6df68e [ 1814.362797] R10: 0000000000020000 R11: ffffffffac9873c0 R12: 0000000000000000 [ 1814.363998] R13: ffff8941ad156600 R14: ffffa90f00db77a8 R15: 0000000000001000 [ 1814.365175] FS: 00007f5f0ea63780(0000) GS:ffff8941fb400000(0000) knlGS:0000000000000000 [ 1814.366493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1814.367456] CR2: ffffffffc4e166cc CR3: 0000000005e68000 CR4: 0000000000350ef0 [ 1814.368671] Call Trace: [ 1814.369092] <TASK> [ 1814.369463] ? __die_body+0x1a/0x60 [ 1814.370060] ? page_fault_oops+0x151/0x470 [ 1814.370746] ? search_bpf_extables+0x65/0x70 [ 1814.371471] ? fixup_exception+0x22/0x310 [ 1814.372162] ? exc_page_fault+0xb1/0x150 [ 1814.372822] ? asm_exc_page_fault+0x22/0x30 [ 1814.373521] ? kmem_cache_alloc_node+0x1c0/0x320 [ 1814.374294] ? skb_splice_from_iter+0xda/0x310 [ 1814.375036] ? skb_splice_from_iter+0xa9/0x310 [ 1814.375777] tcp_sendmsg_locked+0x763/0xce0 [ 1814.376488] ? tcp_sendmsg_locked+0x936/0xce0 [ 1814.377225] tcp_sendmsg+0x27/0x40 [ 1814.377797] sock_sendmsg+0x8b/0xa0 [ 1814.378386] nvme_tcp_try_send_data+0x131/0x400 [nvme_tcp] [ 1814.379297] ? nvme_tcp_try_send_cmd_pdu+0x164/0x290 [nvme_tcp] [ 1814.380302] ? __kernel_text_address+0xe/0x40 [ 1814.381037] ? __pfx_stack_trace_consume_entry+0x10/0x10 [ 1814.381909] ? arch_stack_walk+0xa1/0xf0 [ 1814.382565] nvme_tcp_try_send+0x197/0x2c0 [nvme_tcp] [ 1814.383407] ? __rq_qos_issue+0x24/0x40 [ 1814.384069] nvme_tcp_queue_rq+0x38b/0x3c0 [nvme_tcp] [ 1814.384907] __blk_mq_issue_directly+0x3e/0xe0 [ 1814.385648] blk_mq_try_issue_directly+0x6c/0xa0 [ 1814.386411] blk_mq_submit_bio+0x52b/0x650 [ 1814.387095] __submit_bio+0xd8/0x150 [ 1814.387697] submit_bio_noacct_nocheck+0x151/0x360 [ 1814.388503] ? submit_bio_wait+0x54/0xc0 [ 1814.389157] submit_bio_wait+0x54/0xc0 [ 1814.389780] __blkdev_direct_IO_simple+0x124/0x280 My suspicion is here: while (true) { struct bio_vec bvec; struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_SPLICE_PAGES, }; struct page *page = nvme_tcp_req_cur_page(req); size_t offset = nvme_tcp_req_cur_offset(req); size_t len = nvme_tcp_req_cur_length(req); bool last = nvme_tcp_pdu_last_send(req, len); int req_data_sent = req->data_sent; int ret; if (!last || queue->data_digest || nvme_tcp_queue_more(queue)) msg.msg_flags |= MSG_MORE; if (!sendpage_ok(page)) msg.msg_flags &= ~MSG_SPLICE_PAGES, bvec_set_page(&bvec, page, len, offset); iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len); ret = sock_sendmsg(queue->sock, &msg); Thing is, 'req' already has an iter (that's what nvme_tcp_req_cur_page() extracts from. Maybe a missing kmap() somewhere? But the entire thing is borderline pedantic. 'req' already has an iter, so we should be using it directly and not create another iter which just replicates the existing one. Cheers, Hannes
Hannes Reinecke <hare@suse.de> wrote:
> msg.msg_flags &= ~MSG_SPLICE_PAGES,
Ah, yes - as Sagi pointed out there's a patch for that comma there.
David
On 7/3/23 14:33, Sagi Grimberg wrote: > >> Hannes Reinecke <hare@suse.de> wrote: >> >>>> 'discover' and 'connect' works, but when I'm trying to transfer data >>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in >>>> sock_sendmsg() as it's trying to access invalid pages :-( >> >> Can you be more specific about the crash? > > Hannes, > > See: > [PATCH net] nvme-tcp: Fix comma-related oops Ah, right. That solves _that_ issue. But now I'm deadlocking on the tls_rx_reader_lock() (patched as to your suggestion). Investigating. But it brought up yet another can of worms: what _exactly_ is the return value of ->read_sock()? There are currently two conflicting use-cases: -> Ignore the return value, and assume errors etc are signalled via 'desc.error'. net/strparser/strparser.c drivers/infiniband/sw/siw drivers/scsi/iscsi_tcp.c -> use the return value of ->read_sock(), ignoring 'desc.error': drivers/nvme/host/tcp.c net/ipv4/tcp.c So which one is it? Needless to say, implementations following the second style do not set 'desc.error', causing any errors there to be ignored for callers from the first style... Jakub? Cheers, Hannes
>>> Hannes Reinecke <hare@suse.de> wrote: >>> >>>>> 'discover' and 'connect' works, but when I'm trying to transfer data >>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in >>>>> sock_sendmsg() as it's trying to access invalid pages :-( >>> >>> Can you be more specific about the crash? >> >> Hannes, >> >> See: >> [PATCH net] nvme-tcp: Fix comma-related oops > > Ah, right. That solves _that_ issue. > > But now I'm deadlocking on the tls_rx_reader_lock() (patched as to your > suggestion). Investigating. Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp does not interpret this as a transient status and simply returns from io_work? > But it brought up yet another can of worms: what _exactly_ is the return > value of ->read_sock()? > > There are currently two conflicting use-cases: > -> Ignore the return value, and assume errors etc are signalled > via 'desc.error'. > net/strparser/strparser.c > drivers/infiniband/sw/siw > drivers/scsi/iscsi_tcp.c > -> use the return value of ->read_sock(), ignoring 'desc.error': > drivers/nvme/host/tcp.c > net/ipv4/tcp.c > So which one is it? > Needless to say, implementations following the second style do not > set 'desc.error', causing any errors there to be ignored for callers > from the first style... I don't think ignoring the return value of read_sock makes sense because it can fail outside of the recv_actor failures. But to be on the safe side, perhaps you can both return an error and set desc.error? > Jakub? > > Cheers, > > Hannes > >
On 7/3/23 15:42, Sagi Grimberg wrote: > >>>> Hannes Reinecke <hare@suse.de> wrote: >>>> >>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data >>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in >>>>>> sock_sendmsg() as it's trying to access invalid pages :-( >>>> >>>> Can you be more specific about the crash? >>> >>> Hannes, >>> >>> See: >>> [PATCH net] nvme-tcp: Fix comma-related oops >> >> Ah, right. That solves _that_ issue. >> >> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to >> your suggestion). Investigating. > > Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp > does not interpret this as a transient status and simply returns from > io_work? > >> But it brought up yet another can of worms: what _exactly_ is the >> return value of ->read_sock()? >> >> There are currently two conflicting use-cases: >> -> Ignore the return value, and assume errors etc are signalled >> via 'desc.error'. >> net/strparser/strparser.c >> drivers/infiniband/sw/siw >> drivers/scsi/iscsi_tcp.c >> -> use the return value of ->read_sock(), ignoring 'desc.error': >> drivers/nvme/host/tcp.c >> net/ipv4/tcp.c >> So which one is it? >> Needless to say, implementations following the second style do not >> set 'desc.error', causing any errors there to be ignored for callers >> from the first style... > > I don't think ignoring the return value of read_sock makes sense because > it can fail outside of the recv_actor failures. > Oh, but it's not read_actor which is expected to set desc.error. Have a look at 'strp_read_sock()': /* sk should be locked here, so okay to do read_sock */ sock->ops->read_sock(strp->sk, &desc, strp_recv); desc.error = strp->cb.read_sock_done(strp, desc.error); it's the ->read_sock() callback which is expected to set desc.error. > But to be on the safe side, perhaps you can both return an error and set > desc.error? > But why? We can easily make ->read_sock() a void function, then it's obvious that you can't check the return value. Cheers, Hannes
On 7/3/23 15:42, Sagi Grimberg wrote: > >>>> Hannes Reinecke <hare@suse.de> wrote: >>>> >>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data >>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in >>>>>> sock_sendmsg() as it's trying to access invalid pages :-( >>>> >>>> Can you be more specific about the crash? >>> >>> Hannes, >>> >>> See: >>> [PATCH net] nvme-tcp: Fix comma-related oops >> >> Ah, right. That solves _that_ issue. >> >> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to >> your suggestion). Investigating. > > Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp > does not interpret this as a transient status and simply returns from > io_work? > Unfortunately, yes. static int tls_rx_reader_acquire(struct sock *sk, struct tls_sw_context_rx *ctx, bool nonblock) { long timeo; timeo = sock_rcvtimeo(sk, nonblock); while (unlikely(ctx->reader_present)) { DEFINE_WAIT_FUNC(wait, woken_wake_function); ctx->reader_contended = 1; add_wait_queue(&ctx->wq, &wait); sk_wait_event(sk, &timeo, !READ_ONCE(ctx->reader_present), &wait); and sk_wait_event() does: #define sk_wait_event(__sk, __timeo, __condition, __wait) \ ({ int __rc; \ __sk->sk_wait_pending++; \ release_sock(__sk); \ __rc = __condition; \ if (!__rc) { \ *(__timeo) = wait_woken(__wait, \ TASK_INTERRUPTIBLE, \ *(__timeo)); \ } \ sched_annotate_sleep(); \ lock_sock(__sk); \ __sk->sk_wait_pending--; \ __rc = __condition; \ __rc; \ }) so not calling 'lock_sock()' in tls_tx_reader_acquire() helps only _so_ much, we're still deadlocking. Cheers, Hannes
>>>>> Hannes Reinecke <hare@suse.de> wrote: >>>>> >>>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data >>>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in >>>>>>> sock_sendmsg() as it's trying to access invalid pages :-( >>>>> >>>>> Can you be more specific about the crash? >>>> >>>> Hannes, >>>> >>>> See: >>>> [PATCH net] nvme-tcp: Fix comma-related oops >>> >>> Ah, right. That solves _that_ issue. >>> >>> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to >>> your suggestion). Investigating. >> >> Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp >> does not interpret this as a transient status and simply returns from >> io_work? >> >>> But it brought up yet another can of worms: what _exactly_ is the >>> return value of ->read_sock()? >>> >>> There are currently two conflicting use-cases: >>> -> Ignore the return value, and assume errors etc are signalled >>> via 'desc.error'. >>> net/strparser/strparser.c >>> drivers/infiniband/sw/siw >>> drivers/scsi/iscsi_tcp.c >>> -> use the return value of ->read_sock(), ignoring 'desc.error': >>> drivers/nvme/host/tcp.c >>> net/ipv4/tcp.c >>> So which one is it? >>> Needless to say, implementations following the second style do not >>> set 'desc.error', causing any errors there to be ignored for callers >>> from the first style... >> >> I don't think ignoring the return value of read_sock makes sense because >> it can fail outside of the recv_actor failures. >> > Oh, but it's not read_actor which is expected to set desc.error. > Have a look at 'strp_read_sock()': > > /* sk should be locked here, so okay to do read_sock */ > sock->ops->read_sock(strp->sk, &desc, strp_recv); > > desc.error = strp->cb.read_sock_done(strp, desc.error); > > it's the ->read_sock() callback which is expected to set desc.error. Then it is completely up to the consumer how it wants to interpret the error. >> But to be on the safe side, perhaps you can both return an error and set >> desc.error? >> > But why? We can easily make ->read_sock() a void function, then it's > obvious that you can't check the return value. but it returns the consumed byte count, where would this info go?
On 7/3/23 16:57, Hannes Reinecke wrote: > On 7/3/23 15:42, Sagi Grimberg wrote: >> >>>>> Hannes Reinecke <hare@suse.de> wrote: >>>>> >>>>>>> 'discover' and 'connect' works, but when I'm trying to transfer data >>>>>>> (eg by doing a 'mkfs.xfs') the whole thing crashes horribly in >>>>>>> sock_sendmsg() as it's trying to access invalid pages :-( >>>>> >>>>> Can you be more specific about the crash? >>>> >>>> Hannes, >>>> >>>> See: >>>> [PATCH net] nvme-tcp: Fix comma-related oops >>> >>> Ah, right. That solves _that_ issue. >>> >>> But now I'm deadlocking on the tls_rx_reader_lock() (patched as to >>> your suggestion). Investigating. >> >> Are you sure it is a deadlock? or maybe you returned EAGAIN and nvme-tcp >> does not interpret this as a transient status and simply returns from >> io_work? >> > Unfortunately, yes. > > static int tls_rx_reader_acquire(struct sock *sk, struct > tls_sw_context_rx *ctx, > bool nonblock) > { > long timeo; > > timeo = sock_rcvtimeo(sk, nonblock); > > while (unlikely(ctx->reader_present)) { > DEFINE_WAIT_FUNC(wait, woken_wake_function); > > ctx->reader_contended = 1; > > add_wait_queue(&ctx->wq, &wait); > sk_wait_event(sk, &timeo, > !READ_ONCE(ctx->reader_present), &wait); > > and sk_wait_event() does: > #define sk_wait_event(__sk, __timeo, __condition, __wait) \ > ({ int __rc; \ > __sk->sk_wait_pending++; \ > release_sock(__sk); \ > __rc = __condition; \ > if (!__rc) { \ > *(__timeo) = wait_woken(__wait, \ > TASK_INTERRUPTIBLE, \ > *(__timeo)); \ > } \ > sched_annotate_sleep(); \ > lock_sock(__sk); \ > __sk->sk_wait_pending--; \ > __rc = __condition; \ > __rc; \ > }) > > so not calling 'lock_sock()' in tls_tx_reader_acquire() helps only _so_ > much, we're still deadlocking. That still is legal assuming that sock lock is taken prior to sk_wait_event... What are the blocked threads from sysrq-trigger?