From patchwork Thu Mar 16 15:26:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Howells X-Patchwork-Id: 13177818 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51455C7618E for ; Thu, 16 Mar 2023 15:27:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DD3F6900009; Thu, 16 Mar 2023 11:27:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D8491900007; Thu, 16 Mar 2023 11:27:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD5A8900009; Thu, 16 Mar 2023 11:27:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A6677900007 for ; Thu, 16 Mar 2023 11:27:19 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 7DB3C1C6CE5 for ; Thu, 16 Mar 2023 15:27:19 +0000 (UTC) X-FDA: 80575140198.29.E916E70 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf06.hostedemail.com (Postfix) with ESMTP id 89C89180017 for ; Thu, 16 Mar 2023 15:27:17 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=d7ygn5xO; spf=pass (imf06.hostedemail.com: domain of dhowells@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=dhowells@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678980437; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3nPvkTYaIi99z8jZpnwnKSywFWTrpbYCfoV76x5ycHw=; b=2/1XkgR8HTLYy52tFMnJxcPv2gQf40oJvK5CHwmJ0/CEHLX8YiGZvZ538OY0N59Rn2uJ9l z9MCBURjSQwLIADyQjL/H7PykKBMiqA2r8Fh6pDpz7OQJEc9XwYUp5KEeetTL3MbPghKqD Pdcy4y9Tb3mSEb5esMr2DPpHP99lsV0= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=d7ygn5xO; spf=pass (imf06.hostedemail.com: domain of dhowells@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=dhowells@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678980437; a=rsa-sha256; cv=none; b=yhvMUVut9E8LIZ3vDL7t367EhSD6R7sJN4kFsU2wVh5E1I6Ir7yVP39PZPPDYO1OIMcsQp rwKjX4SwLqatcy0tqmSCHWEJuxMfzMnBHRDJ2LBlh97WUWXg4t+iZBlgbML1w9rc4zH2ZH 3XZ5sPvvpeUCqkoL8QoAZ4UmWumC/8I= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678980436; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3nPvkTYaIi99z8jZpnwnKSywFWTrpbYCfoV76x5ycHw=; b=d7ygn5xOjFU/vguCqr4AeKU6NzylUjmjpaH9nDCfpdtuS9sg3ZB4J7bHaEDCC7PqNxXkHC KDuNPvqOrpZyl+mKIzQOv7SWcz/O5pQC39ZXz+dZOvJukDMg1lqRfchqCFifgxQdYbfrv6 3gyRIxWuNfT62n3dbfss5t3Nl7QEEYw= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-77-Nfq8OMIzNR-misbqjnyc9g-1; Thu, 16 Mar 2023 11:27:11 -0400 X-MC-Unique: Nfq8OMIzNR-misbqjnyc9g-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3A73085A5A3; Thu, 16 Mar 2023 15:27:10 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.33.36.18]) by smtp.corp.redhat.com (Postfix) with ESMTP id 24F3A40C6E68; Thu, 16 Mar 2023 15:27:08 +0000 (UTC) From: David Howells To: Matthew Wilcox , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni Cc: David Howells , Al Viro , Christoph Hellwig , Jens Axboe , Jeff Layton , Christian Brauner , Linus Torvalds , netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Bernard Metzler , Tom Talpey , linux-rdma@vger.kernel.org Subject: [RFC PATCH 18/28] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit Date: Thu, 16 Mar 2023 15:26:08 +0000 Message-Id: <20230316152618.711970-19-dhowells@redhat.com> In-Reply-To: <20230316152618.711970-1-dhowells@redhat.com> References: <20230316152618.711970-1-dhowells@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 X-Rspamd-Queue-Id: 89C89180017 X-Stat-Signature: rhzee9m181cjuo6mn7obn5393j1ecp67 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1678980437-172411 X-HE-Meta: U2FsdGVkX1+q/aVz5p2/a/VKPDUakjQeWyOXvSvXWWwzAEr1OUSRvv2P2altmfabrXpXQPaVVwZjdMW3EQoHqnuZuEkZ8432dGZz3n0Nyn1xNUv2DHoa+Nhka7Z/c+MYgN0s8TA2TIiYsrBHY60RfAnej8WK+mpS8/UI2f7QA7dDtQ2YI1oK+2W3t7/d0KQXw+i/7eh+pRFbem/ZKhu2rv/ShJe+QUrxsTuqpbzzqNsE+Mt9waYDTTUnBO9H0P+fp0t7/SVZ/Qpfof8bDNIss1m78F0A1p6FmPgvxIt5ZqLqHi4kys4irnL8WldIiA/PlPhyywD3KfUWZ9Umu+xnDukDu9sJNUV9XEHt8m4Fsd7GS2V27wWu/P7CPAFy1uyGUz0iyTpkTT2iFt+BXQgdQeGCbIdPqtHax0DcG3rE2VL+mK7OPY9fm9CvISMrnvYpvqBwN13SfIdNNQyfyHAXKn5D++gXM4g7WK1sGc6tW5j/6JuVXE+mWeGM/1uVFRzw8dd1vhMtdIzOWQTz2WPDoi8oQZ7kEZfyaWeK1+JdYtlWTf5Ypa61D0OQUPbFlAkv6gNmgMaMX5jWeeYuJwl2a/3jtsm5WT9514ArU9/GXBcsUrCATdL294Zqy636DzGqQHgy/90JeaoA5DZq9z4KetuEbOURWZacR3hhMu+wxHsabdnnFcbQIZiFYZjUYC4Iz6OcWTl4C9S89ejr8YeUyG4Ym9uDMMFi839HV05org5wfD/kRl+QD685zd/yjirECUA03frLmlYf8pv56iYQkFJvIsg5sgOQjIaxPQkeDcHiCSKs6CIidRKFLkATHMoSWOnVhMNdurzCnNfJvEWyV193TCtPTYJmoY2+JTMbCZwVKeb5dbb787nDCT+V2CjLHeHnyq25UHJXORspLT3Z5gJ5mYKX5EfOcEowJWjVfy2LX4K6Nc3xudcou8i/KVaNxIJkoOKK3GSyuxlgAdS Q0Vtg4pS v5exs+OyfpdceqOsIEvvqvWGwCOmeCDcGO2YErZ7E7HF85TBXvR+Q8j1tUouYnaKrNFZYlTvGcEyv9S+Vyz14JOsgsjSrPqsGUNMeDpvaL3tZ0jvU9WsL0z/kS2+r4aH5DxwgdyUSA+lGr3MpsbL5Qi45sapOrvZmOrn39r/eV3udjM7FL6g83itAi3tBNLenz4ORvkl8XI5kmW14e8Nv5bzId+Cw4rMC6uNlXrdfKYIkQ8jam7BxjRJN62bQGgb3MXNthwk0Sw5vnrxlt7pTBNXZOP7jG9/0SQ5fvOOAF5S5CgSn//QVYPX+BZULm76ajjxxLffHoMRgTCfAHj8quLUh/W6qUEbagcZKGQKNZ9hKwX5LVgh4y5FM9gM06GWDJTjg7PnDIF3jsBzHimdzZxnmh6ga4T54SYDIJmATVcvrm2kG4sp0wIbRXH0TtzMmAbjF1+uApU017nu+dX8p2/d5dQuQMZVciRqmzBbLRP5qWJcs8PHJrGY+RUVm81HBkT8pDidWDFzdTqwaS9N6a9TYTQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When transmitting data, call down into TCP using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather than performing several sendmsg and sendpage calls to transmit header, data pages and trailer. To make this work, the data is assembled in a bio_vec array and attached to a BVEC-type iterator. The header and trailer (if present) are copied into memory acquired from zcopy_alloc() which just breaks a page up into small pieces that can be freed with put_page(). Signed-off-by: David Howells cc: Bernard Metzler cc: Tom Talpey cc: "David S. Miller" cc: Eric Dumazet cc: Jakub Kicinski cc: Paolo Abeni cc: Jens Axboe cc: Matthew Wilcox cc: linux-rdma@vger.kernel.org cc: netdev@vger.kernel.org --- drivers/infiniband/sw/siw/siw_qp_tx.c | 231 +++++--------------------- 1 file changed, 46 insertions(+), 185 deletions(-) diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c index 8fc179321e2b..ec4f0ac324ce 100644 --- a/drivers/infiniband/sw/siw/siw_qp_tx.c +++ b/drivers/infiniband/sw/siw/siw_qp_tx.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include @@ -312,114 +313,8 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s, return rv; } -/* - * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES. - * - * Using sendpage to push page by page appears to be less efficient - * than using sendmsg, even if data are copied. - * - * A general performance limitation might be the extra four bytes - * trailer checksum segment to be pushed after user data. - */ -static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset, - size_t size) -{ - struct bio_vec bvec; - struct msghdr msg = { - .msg_flags = (MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT | - MSG_SENDPAGE_NOTLAST), - }; - struct sock *sk = s->sk; - int i = 0, rv = 0, sent = 0; - - while (size) { - size_t bytes = min_t(size_t, PAGE_SIZE - offset, size); - - if (size + offset <= PAGE_SIZE) - msg.msg_flags = MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT; - - tcp_rate_check_app_limited(sk); - bvec_set_page(&bvec, page[i], bytes, offset); - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); - -try_page_again: - lock_sock(sk); - rv = tcp_sendmsg_locked(sk, &msg, size); - release_sock(sk); - - if (rv > 0) { - size -= rv; - sent += rv; - if (rv != bytes) { - offset += rv; - bytes -= rv; - goto try_page_again; - } - offset = 0; - } else { - if (rv == -EAGAIN || rv == 0) - break; - return rv; - } - i++; - } - return sent; -} - -/* - * siw_0copy_tx() - * - * Pushes list of pages to TCP socket. If pages from multiple - * SGE's, all referenced pages of each SGE are pushed in one - * shot. - */ -static int siw_0copy_tx(struct socket *s, struct page **page, - struct siw_sge *sge, unsigned int offset, - unsigned int size) -{ - int i = 0, sent = 0, rv; - int sge_bytes = min(sge->length - offset, size); - - offset = (sge->laddr + offset) & ~PAGE_MASK; - - while (sent != size) { - rv = siw_tcp_sendpages(s, &page[i], offset, sge_bytes); - if (rv >= 0) { - sent += rv; - if (size == sent || sge_bytes > rv) - break; - - i += PAGE_ALIGN(sge_bytes + offset) >> PAGE_SHIFT; - sge++; - sge_bytes = min(sge->length, size - sent); - offset = sge->laddr & ~PAGE_MASK; - } else { - sent = rv; - break; - } - } - return sent; -} - #define MAX_TRAILER (MPA_CRC_SIZE + 4) -static void siw_unmap_pages(struct kvec *iov, unsigned long kmap_mask, int len) -{ - int i; - - /* - * Work backwards through the array to honor the kmap_local_page() - * ordering requirements. - */ - for (i = (len-1); i >= 0; i--) { - if (kmap_mask & BIT(i)) { - unsigned long addr = (unsigned long)iov[i].iov_base; - - kunmap_local((void *)(addr & PAGE_MASK)); - } - } -} - /* * siw_tx_hdt() tries to push a complete packet to TCP where all * packet fragments are referenced by the elements of one iovec. @@ -439,15 +334,13 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) { struct siw_wqe *wqe = &c_tx->wqe_active; struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx]; - struct kvec iov[MAX_ARRAY]; - struct page *page_array[MAX_ARRAY]; + struct bio_vec bvec[MAX_ARRAY]; struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR }; int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv; unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0, sge_off = c_tx->sge_off, sge_idx = c_tx->sge_idx, pbl_idx = c_tx->pbl_idx; - unsigned long kmap_mask = 0L; if (c_tx->state == SIW_SEND_HDR) { if (c_tx->use_sendpage) { @@ -457,10 +350,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) c_tx->state = SIW_SEND_DATA; } else { - iov[0].iov_base = - (char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent; - iov[0].iov_len = hdr_len = - c_tx->ctrl_len - c_tx->ctrl_sent; + const void *hdr = &c_tx->pkt.ctrl + c_tx->ctrl_sent; + + hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent; + rv = zcopy_memdup(hdr_len, hdr, &bvec[0], GFP_NOFS); + if (rv < 0) + goto done; seg = 1; } } @@ -478,28 +373,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) } else { is_kva = 1; } - if (is_kva && !c_tx->use_sendpage) { - /* - * tx from kernel virtual address: either inline data - * or memory region with assigned kernel buffer - */ - iov[seg].iov_base = - (void *)(uintptr_t)(sge->laddr + sge_off); - iov[seg].iov_len = sge_len; - - if (do_crc) - crypto_shash_update(c_tx->mpa_crc_hd, - iov[seg].iov_base, - sge_len); - sge_off += sge_len; - data_len -= sge_len; - seg++; - goto sge_done; - } while (sge_len) { size_t plen = min((int)PAGE_SIZE - fp_off, sge_len); - void *kaddr; if (!is_kva) { struct page *p; @@ -512,33 +388,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) p = siw_get_upage(mem->umem, sge->laddr + sge_off); if (unlikely(!p)) { - siw_unmap_pages(iov, kmap_mask, seg); wqe->processed -= c_tx->bytes_unsent; rv = -EFAULT; goto done_crc; } - page_array[seg] = p; - - if (!c_tx->use_sendpage) { - void *kaddr = kmap_local_page(p); - - /* Remember for later kunmap() */ - kmap_mask |= BIT(seg); - iov[seg].iov_base = kaddr + fp_off; - iov[seg].iov_len = plen; - - if (do_crc) - crypto_shash_update( - c_tx->mpa_crc_hd, - iov[seg].iov_base, - plen); - } else if (do_crc) { - kaddr = kmap_local_page(p); - crypto_shash_update(c_tx->mpa_crc_hd, - kaddr + fp_off, - plen); - kunmap_local(kaddr); - } + + bvec_set_page(&bvec[seg], p, plen, fp_off); } else { /* * Cast to an uintptr_t to preserve all 64 bits @@ -552,12 +407,15 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) * bits on a 64 bit platform and 32 bits on a * 32 bit platform. */ - page_array[seg] = virt_to_page((void *)(va & PAGE_MASK)); - if (do_crc) - crypto_shash_update( - c_tx->mpa_crc_hd, - (void *)va, - plen); + bvec_set_virt(&bvec[seg], (void *)va, plen); + } + + if (do_crc) { + void *kaddr = kmap_local_page(bvec[seg].bv_page); + crypto_shash_update(c_tx->mpa_crc_hd, + kaddr + bvec[seg].bv_offset, + bvec[seg].bv_len); + kunmap_local(kaddr); } sge_len -= plen; @@ -567,13 +425,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) if (++seg > (int)MAX_ARRAY) { siw_dbg_qp(tx_qp(c_tx), "to many fragments\n"); - siw_unmap_pages(iov, kmap_mask, seg-1); wqe->processed -= c_tx->bytes_unsent; rv = -EMSGSIZE; goto done_crc; } } -sge_done: + /* Update SGE variables at end of SGE */ if (sge_off == sge->length && (data_len != 0 || wqe->processed < wqe->bytes)) { @@ -582,15 +439,8 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) sge_off = 0; } } - /* trailer */ - if (likely(c_tx->state != SIW_SEND_TRAILER)) { - iov[seg].iov_base = &c_tx->trailer.pad[4 - c_tx->pad]; - iov[seg].iov_len = trl_len = MAX_TRAILER - (4 - c_tx->pad); - } else { - iov[seg].iov_base = &c_tx->trailer.pad[c_tx->ctrl_sent]; - iov[seg].iov_len = trl_len = MAX_TRAILER - c_tx->ctrl_sent; - } + /* Set the CRC in the trailer */ if (c_tx->pad) { *(u32 *)c_tx->trailer.pad = 0; if (do_crc) @@ -603,23 +453,31 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) else if (do_crc) crypto_shash_final(c_tx->mpa_crc_hd, (u8 *)&c_tx->trailer.crc); - data_len = c_tx->bytes_unsent; + /* Copy the trailer and add it to the output list */ + if (likely(c_tx->state != SIW_SEND_TRAILER)) { + void *trl = &c_tx->trailer.pad[4 - c_tx->pad]; - if (c_tx->use_sendpage) { - rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx], - c_tx->sge_off, data_len); - if (rv == data_len) { - rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len); - if (rv > 0) - rv += data_len; - else - rv = data_len; - } + trl_len = MAX_TRAILER - (4 - c_tx->pad); + rv = zcopy_memdup(trl_len, trl, &bvec[seg], GFP_NOFS); + if (rv < 0) + goto done_crc; } else { - rv = kernel_sendmsg(s, &msg, iov, seg + 1, - hdr_len + data_len + trl_len); - siw_unmap_pages(iov, kmap_mask, seg); + void *trl = &c_tx->trailer.pad[c_tx->ctrl_sent]; + + trl_len = MAX_TRAILER - c_tx->ctrl_sent; + rv = zcopy_memdup(trl_len, trl, &bvec[seg], GFP_NOFS); + if (rv < 0) + goto done_crc; } + + data_len = c_tx->bytes_unsent; + + if (c_tx->use_sendpage) + msg.msg_flags |= MSG_SPLICE_PAGES; + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, seg + 1, + hdr_len + data_len + trl_len); + rv = sock_sendmsg(s, &msg); + if (rv < (int)hdr_len) { /* Not even complete hdr pushed or negative rv */ wqe->processed -= data_len; @@ -680,6 +538,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s) } done_crc: c_tx->do_crc = 0; + if (c_tx->state == SIW_SEND_HDR) + folio_put(page_folio(bvec[0].bv_page)); + folio_put(page_folio(bvec[seg].bv_page)); done: return rv; }