Message ID | 20230316152618.711970-9-dhowells@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | None | expand |
> -----Original Message----- > From: David Howells <dhowells@redhat.com> > Sent: Thursday, 16 March 2023 16:26 > To: Matthew Wilcox <willy@infradead.org>; David S. Miller > <davem@davemloft.net>; Eric Dumazet <edumazet@google.com>; Jakub Kicinski > <kuba@kernel.org>; Paolo Abeni <pabeni@redhat.com> > Cc: David Howells <dhowells@redhat.com>; Al Viro <viro@zeniv.linux.org.uk>; > Christoph Hellwig <hch@infradead.org>; Jens Axboe <axboe@kernel.dk>; Jeff > Layton <jlayton@kernel.org>; Christian Brauner <brauner@kernel.org>; Linus > Torvalds <torvalds@linux-foundation.org>; netdev@vger.kernel.org; linux- > fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-mm@kvack.org; > Bernard Metzler <BMT@zurich.ibm.com>; Tom Talpey <tom@talpey.com>; linux- > rdma@vger.kernel.org > Subject: [EXTERNAL] [RFC PATCH 08/28] siw: Inline do_tcp_sendpages() > > do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(), > so inline it, allowing do_tcp_sendpages() to be removed. This is part of > replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set. > > Signed-off-by: David Howells <dhowells@redhat.com> > cc: Bernard Metzler <bmt@zurich.ibm.com> > cc: Tom Talpey <tom@talpey.com> > cc: "David S. Miller" <davem@davemloft.net> > cc: Eric Dumazet <edumazet@google.com> > cc: Jakub Kicinski <kuba@kernel.org> > cc: Paolo Abeni <pabeni@redhat.com> > cc: Jens Axboe <axboe@kernel.dk> > cc: Matthew Wilcox <willy@infradead.org> > cc: linux-rdma@vger.kernel.org > cc: netdev@vger.kernel.org > --- > drivers/infiniband/sw/siw/siw_qp_tx.c | 17 ++++++++++++----- > 1 file changed, 12 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c > b/drivers/infiniband/sw/siw/siw_qp_tx.c > index 05052b49107f..8fc179321e2b 100644 > --- a/drivers/infiniband/sw/siw/siw_qp_tx.c > +++ b/drivers/infiniband/sw/siw/siw_qp_tx.c > @@ -313,7 +313,7 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, > struct socket *s, > } > Hi David, many thanks for looking into that! I reply to a limited audience, expecting limited interest, basically to the rdma list plus Tom. > /* > - * 0copy TCP transmit interface: Use do_tcp_sendpages. > + * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES. > * > * Using sendpage to push page by page appears to be less efficient > * than using sendmsg, even if data are copied. That is an interesting observation. Is efficiency to be read as CPU load, or throughput on the wire, or both? Back in the days, I introduced that zcopy path for efficiency reasons - getting both better throughput and less CPU load. I looked at both WRITE and READ performance. Using do_tcp_sendpages() is currently limited to processing work which is not registered with local completion generation. Replying to a remote READ request is a typical case. Did you check with READ? > @@ -324,20 +324,27 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, > struct socket *s, > static int siw_tcp_sendpages(struct socket *s, struct page **page, int > offset, > size_t size) > { > + struct bio_vec bvec; > + struct msghdr msg = { > + .msg_flags = (MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT | > + MSG_SENDPAGE_NOTLAST), > + }; > struct sock *sk = s->sk; > - int i = 0, rv = 0, sent = 0, > - flags = MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST; > + int i = 0, rv = 0, sent = 0; > > while (size) { > size_t bytes = min_t(size_t, PAGE_SIZE - offset, size); > > if (size + offset <= PAGE_SIZE) > - flags = MSG_MORE | MSG_DONTWAIT; > + msg.msg_flags = MSG_SPLICE_PAGES | MSG_MORE | > MSG_DONTWAIT; > > tcp_rate_check_app_limited(sk); > + bvec_set_page(&bvec, page[i], bytes, offset); > + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); > + > try_page_again: > lock_sock(sk); > - rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags); > + rv = tcp_sendmsg_locked(sk, &msg, size); Would that tcp_sendmsg_locked() with a msg flagged MSG_SPLICE_PAGES still have zero copy semantics? > release_sock(sk); > > if (rv > 0) {
Bernard Metzler <BMT@zurich.ibm.com> wrote: > > /* > > - * 0copy TCP transmit interface: Use do_tcp_sendpages. > > + * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES. > > * > > * Using sendpage to push page by page appears to be less efficient > > * than using sendmsg, even if data are copied. > > That is an interesting observation. Is efficiency to be read as > CPU load, or throughput on the wire, or both? Um. The observation in the comment is one you made, not me according to git blame. I merely changed "do_tcp_sendpages" to "MSG_SPLICE_PAGES" in the first line of the comment. > Back in the days, I introduced that zcopy path for efficiency > reasons - getting both better throughput and less CPU load. > I looked at both WRITE and READ performance. Using > do_tcp_sendpages() is currently limited to processing work > which is not registered with local completion generation. > Replying to a remote READ request is a typical case. Did > you check with READ? Ah - you're talking about ksmbd there? I haven't tested the patch with that. > > - rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags); > > + rv = tcp_sendmsg_locked(sk, &msg, size); > > Would that tcp_sendmsg_locked() with a msg flagged > MSG_SPLICE_PAGES still have zero copy semantics? Yes - though I am considering making it conditional on whether the pages in the iterator belong to the slab allocator (in which case they get copied) or not. David
> -----Original Message----- > From: David Howells <dhowells@redhat.com> > Sent: Monday, 20 March 2023 12:09 > To: Bernard Metzler <BMT@zurich.ibm.com> > Cc: David Howells <dhowells@redhat.com>; Tom Talpey <tom@talpey.com>; > linux-rdma@vger.kernel.org > Subject: [EXTERNAL] Re: [RFC PATCH 08/28] siw: Inline do_tcp_sendpages() > > Bernard Metzler <BMT@zurich.ibm.com> wrote: > > > > /* > > > - * 0copy TCP transmit interface: Use do_tcp_sendpages. > > > + * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES. > > > * > > > * Using sendpage to push page by page appears to be less efficient > > > * than using sendmsg, even if data are copied. > > > > That is an interesting observation. Is efficiency to be read as > > CPU load, or throughput on the wire, or both? > > Um. The observation in the comment is one you made, not me according to > git Haha, yes. So sorry for that. I am getting older ;) I need to put on some more sanity checks before posting here! > blame. I merely changed "do_tcp_sendpages" to "MSG_SPLICE_PAGES" in the > first > line of the comment. > > > Back in the days, I introduced that zcopy path for efficiency > > reasons - getting both better throughput and less CPU load. > > I looked at both WRITE and READ performance. Using > > do_tcp_sendpages() is currently limited to processing work > > which is not registered with local completion generation. > > Replying to a remote READ request is a typical case. Did > > you check with READ? > > Ah - you're talking about ksmbd there? I haven't tested the patch with > that. Did you test with both kernel ULPs and user level applications? > > > > - rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags); > > > + rv = tcp_sendmsg_locked(sk, &msg, size); > > > > Would that tcp_sendmsg_locked() with a msg flagged > > MSG_SPLICE_PAGES still have zero copy semantics? > > Yes - though I am considering making it conditional on whether the pages in > the iterator belong to the slab allocator (in which case they get copied) > or > not. Sounds good to me! > > David
Bernard Metzler <BMT@zurich.ibm.com> wrote: > > > Back in the days, I introduced that zcopy path for efficiency > > > reasons - getting both better throughput and less CPU load. > > > I looked at both WRITE and READ performance. Using > > > do_tcp_sendpages() is currently limited to processing work > > > which is not registered with local completion generation. > > > Replying to a remote READ request is a typical case. Did > > > you check with READ? > > > > Ah - you're talking about ksmbd there? I haven't tested the patch with > > that. > > Did you test with both kernel ULPs and user level applications? Kernel "ULPs"? As far as cifs goes, I've tested the fs with large dd commands for the moment, but that's all. This post was more to find out how attached people were to ->sendpage() and to see if anyone had any preferences on a couple of things mentioned in the cover note. This isn't aimed at the next merge window. David
> -----Original Message----- > From: David Howells <dhowells@redhat.com> > Sent: Monday, 20 March 2023 14:13 > To: Bernard Metzler <BMT@zurich.ibm.com> > Cc: David Howells <dhowells@redhat.com>; Tom Talpey <tom@talpey.com>; > linux-rdma@vger.kernel.org > Subject: [EXTERNAL] Re: [RFC PATCH 08/28] siw: Inline do_tcp_sendpages() > > Bernard Metzler <BMT@zurich.ibm.com> wrote: > > > > > Back in the days, I introduced that zcopy path for efficiency > > > > reasons - getting both better throughput and less CPU load. > > > > I looked at both WRITE and READ performance. Using > > > > do_tcp_sendpages() is currently limited to processing work > > > > which is not registered with local completion generation. > > > > Replying to a remote READ request is a typical case. Did > > > > you check with READ? > > > > > > Ah - you're talking about ksmbd there? I haven't tested the patch with > > > that. > > > > Did you test with both kernel ULPs and user level applications? > > Kernel "ULPs"? I was trying to refer to kernel applications or clients or upper layer protocols (ulp, like nfs). > > As far as cifs goes, I've tested the fs with large dd commands for the > moment, > but that's all. This post was more to find out how attached people were to > ->sendpage() and to see if anyone had any preferences on a couple of things > mentioned in the cover note. This isn't aimed at the next merge window. I like your patches to siw a lot, since it would significantly simplify the transmit code path. Thank you, Bernard. > > David
diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c index 05052b49107f..8fc179321e2b 100644 --- a/drivers/infiniband/sw/siw/siw_qp_tx.c +++ b/drivers/infiniband/sw/siw/siw_qp_tx.c @@ -313,7 +313,7 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s, } /* - * 0copy TCP transmit interface: Use do_tcp_sendpages. + * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES. * * Using sendpage to push page by page appears to be less efficient * than using sendmsg, even if data are copied. @@ -324,20 +324,27 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s, static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset, size_t size) { + struct bio_vec bvec; + struct msghdr msg = { + .msg_flags = (MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT | + MSG_SENDPAGE_NOTLAST), + }; struct sock *sk = s->sk; - int i = 0, rv = 0, sent = 0, - flags = MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST; + int i = 0, rv = 0, sent = 0; while (size) { size_t bytes = min_t(size_t, PAGE_SIZE - offset, size); if (size + offset <= PAGE_SIZE) - flags = MSG_MORE | MSG_DONTWAIT; + msg.msg_flags = MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT; tcp_rate_check_app_limited(sk); + bvec_set_page(&bvec, page[i], bytes, offset); + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size); + try_page_again: lock_sock(sk); - rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags); + rv = tcp_sendmsg_locked(sk, &msg, size); release_sock(sk); if (rv > 0) {
do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(), so inline it, allowing do_tcp_sendpages() to be removed. This is part of replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set. Signed-off-by: David Howells <dhowells@redhat.com> cc: Bernard Metzler <bmt@zurich.ibm.com> cc: Tom Talpey <tom@talpey.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-rdma@vger.kernel.org cc: netdev@vger.kernel.org --- drivers/infiniband/sw/siw/siw_qp_tx.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-)