Message ID | 20230316152618.711970-4-dhowells@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) | expand |
David Howells wrote: > Make TCP's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be > spliced from the source iterator if possible (the iterator must be > ITER_BVEC and the pages must be spliceable). > > This allows ->sendpage() to be replaced by something that can handle > multiple multipage folios in a single transaction. > > Signed-off-by: David Howells <dhowells@redhat.com> > cc: Eric Dumazet <edumazet@google.com> > cc: "David S. Miller" <davem@davemloft.net> > cc: Jakub Kicinski <kuba@kernel.org> > cc: Paolo Abeni <pabeni@redhat.com> > cc: Jens Axboe <axboe@kernel.dk> > cc: Matthew Wilcox <willy@infradead.org> > cc: netdev@vger.kernel.org > --- > net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 53 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 288693981b00..77c0c69208a5 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > int flags, err, copied = 0; > int mss_now = 0, size_goal, copied_syn = 0; > int process_backlog = 0; > - bool zc = false; > + int zc = 0; > long timeo; > > flags = msg->msg_flags; > @@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > if (msg->msg_ubuf) { > uarg = msg->msg_ubuf; > net_zcopy_get(uarg); > - zc = sk->sk_route_caps & NETIF_F_SG; > + if (sk->sk_route_caps & NETIF_F_SG) > + zc = 1; > } else if (sock_flag(sk, SOCK_ZEROCOPY)) { > uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); > if (!uarg) { > err = -ENOBUFS; > goto out_err; > } > - zc = sk->sk_route_caps & NETIF_F_SG; > - if (!zc) > + if (sk->sk_route_caps & NETIF_F_SG) > + zc = 1; > + else > uarg_to_msgzc(uarg)->zerocopy = 0; > } > + } else if (unlikely(flags & MSG_SPLICE_PAGES) && size) { > + if (!iov_iter_is_bvec(&msg->msg_iter)) > + return -EINVAL; > + if (sk->sk_route_caps & NETIF_F_SG) > + zc = 2; > } The commit message mentions MSG_SPLICE_PAGES as an internal flag. It can be passed from userspace. The code anticipates that and checks preconditions. A side effect is that legacy applications that may already be setting this bit in the flags now start failing. Most socket types are historically permissive and simply ignore undefined flags. With MSG_ZEROCOPY we chose to be extra cautious and added SOCK_ZEROCOPY, only testing the MSG_ZEROCOPY bit if this socket option is explicitly enabled. Perhaps more cautious than necessary, but FYI.
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > The commit message mentions MSG_SPLICE_PAGES as an internal flag. > > It can be passed from userspace. The code anticipates that and checks > preconditions. Should I add a separate field in the in-kernel msghdr struct for such internal flags? That would also avoid putting an internal flag in the same space as the uapi flags. David
David Howells wrote: > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag. > > > > It can be passed from userspace. The code anticipates that and checks > > preconditions. > > Should I add a separate field in the in-kernel msghdr struct for such internal > flags? That would also avoid putting an internal flag in the same space as > the uapi flags. That would work, if no cost to common paths that don't need it. A not very pretty alternative would be to add an an extra arg to each sendmsg handler that is used only when called from sendpage. There are a few other internal MSG_.. flags, such as MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored in sendmsg, I think. Which would explain why it was clearly safe to add them.
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > David Howells wrote: > > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag. > > > > > > It can be passed from userspace. The code anticipates that and checks > > > preconditions. > > > > Should I add a separate field in the in-kernel msghdr struct for such internal > > flags? That would also avoid putting an internal flag in the same space as > > the uapi flags. > > That would work, if no cost to common paths that don't need it. Actually, it might be tricky. __ip_append_data() doesn't take a msghdr struct pointer per se. The "void *from" argument *might* point to one - but it depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't know. Possibly this changes if sendpage goes away. > A not very pretty alternative would be to add an an extra arg to each > sendmsg handler that is used only when called from sendpage. > > There are a few other internal MSG_.. flags, such as > MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored > in sendmsg, I think. Which would explain why it was clearly safe to > add them. Should those be moved across to the internal flags with MSG_SPLICE_PAGES? David
David Howells wrote: > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > David Howells wrote: > > > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > > > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag. > > > > > > > > It can be passed from userspace. The code anticipates that and checks > > > > preconditions. > > > > > > Should I add a separate field in the in-kernel msghdr struct for such internal > > > flags? That would also avoid putting an internal flag in the same space as > > > the uapi flags. > > > > That would work, if no cost to common paths that don't need it. > > Actually, it might be tricky. __ip_append_data() doesn't take a msghdr struct > pointer per se. The "void *from" argument *might* point to one - but it > depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't > know. > > Possibly this changes if sendpage goes away. Is it sufficient to mask out this bit in tcp_sendmsg_locked and udp_sendmsg if passed from userspace (and should be ignored), and pass it through flags to callees like ip_append_data? > > > A not very pretty alternative would be to add an an extra arg to each > > sendmsg handler that is used only when called from sendpage. > > > > There are a few other internal MSG_.. flags, such as > > MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored > > in sendmsg, I think. Which would explain why it was clearly safe to > > add them. > > Should those be moved across to the internal flags with MSG_SPLICE_PAGES? I would not include that in this patch series.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 288693981b00..77c0c69208a5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) int flags, err, copied = 0; int mss_now = 0, size_goal, copied_syn = 0; int process_backlog = 0; - bool zc = false; + int zc = 0; long timeo; flags = msg->msg_flags; @@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (msg->msg_ubuf) { uarg = msg->msg_ubuf; net_zcopy_get(uarg); - zc = sk->sk_route_caps & NETIF_F_SG; + if (sk->sk_route_caps & NETIF_F_SG) + zc = 1; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); if (!uarg) { err = -ENOBUFS; goto out_err; } - zc = sk->sk_route_caps & NETIF_F_SG; - if (!zc) + if (sk->sk_route_caps & NETIF_F_SG) + zc = 1; + else uarg_to_msgzc(uarg)->zerocopy = 0; } + } else if (unlikely(flags & MSG_SPLICE_PAGES) && size) { + if (!iov_iter_is_bvec(&msg->msg_iter)) + return -EINVAL; + if (sk->sk_route_caps & NETIF_F_SG) + zc = 2; } if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) && @@ -1345,7 +1352,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (copy > msg_data_left(msg)) copy = msg_data_left(msg); - if (!zc) { + if (zc == 0) { bool merge = true; int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk); @@ -1390,7 +1397,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) page_ref_inc(pfrag->page); } pfrag->offset += copy; - } else { + } else if (zc == 1) { /* First append to a fragless skb builds initial * pure zerocopy skb */ @@ -1411,6 +1418,46 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (err < 0) goto do_error; copy = err; + } else if (zc == 2) { + /* Splice in data. */ + const struct bio_vec *bv = msg->msg_iter.bvec; + size_t seg = iov_iter_single_seg_count(&msg->msg_iter); + size_t off = bv->bv_offset + msg->msg_iter.iov_offset; + bool can_coalesce; + int i = skb_shinfo(skb)->nr_frags; + + if (copy > seg) + copy = seg; + + can_coalesce = skb_can_coalesce(skb, i, bv->bv_page, off); + if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) { + tcp_mark_push(tp, skb); + goto new_segment; + } + if (tcp_downgrade_zcopy_pure(sk, skb)) + goto wait_for_space; + + copy = tcp_wmem_schedule(sk, copy); + if (!copy) + goto wait_for_space; + + if (can_coalesce) { + skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); + } else { + get_page(bv->bv_page); + skb_fill_page_desc_noacc(skb, i, bv->bv_page, off, copy); + } + iov_iter_advance(&msg->msg_iter, copy); + + if (!(flags & MSG_NO_SHARED_FRAGS)) + skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG; + + skb->len += copy; + skb->data_len += copy; + skb->truesize += copy; + sk_wmem_queued_add(sk, copy); + sk_mem_charge(sk, copy); + } if (!copied)
Make TCP's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be spliced from the source iterator if possible (the iterator must be ITER_BVEC and the pages must be spliceable). This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Dumazet <edumazet@google.com> cc: "David S. Miller" <davem@davemloft.net> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: netdev@vger.kernel.org --- net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 53 insertions(+), 6 deletions(-)