Message ID | 20230522121125.2595254-4-dhowells@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 2e910b95329c2dc7feffbec00907f9e02d1a850a |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 | expand |
On 2023/5/22 20:11, David Howells wrote: Hi, David I am not very familiar with the 'struct iov_iter' yet, just two questions below. > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 7f53dcb26ad3..f4a5b51aed22 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -6892,3 +6892,91 @@ nodefer: __kfree_skb(skb); > if (unlikely(kick) && !cmpxchg(&sd->defer_ipi_scheduled, 0, 1)) > smp_call_function_single_async(cpu, &sd->defer_csd); > } > + > +static void skb_splice_csum_page(struct sk_buff *skb, struct page *page, > + size_t offset, size_t len) > +{ > + const char *kaddr; > + __wsum csum; > + > + kaddr = kmap_local_page(page); > + csum = csum_partial(kaddr + offset, len, 0); > + kunmap_local(kaddr); > + skb->csum = csum_block_add(skb->csum, csum, skb->len); > +} > + > +/** > + * skb_splice_from_iter - Splice (or copy) pages to skbuff > + * @skb: The buffer to add pages to > + * @iter: Iterator representing the pages to be added > + * @maxsize: Maximum amount of pages to be added > + * @gfp: Allocation flags > + * > + * This is a common helper function for supporting MSG_SPLICE_PAGES. It > + * extracts pages from an iterator and adds them to the socket buffer if > + * possible, copying them to fragments if not possible (such as if they're slab > + * pages). > + * > + * Returns the amount of data spliced/copied or -EMSGSIZE if there's I am not seeing any copying done directly in the skb_splice_from_iter(), maybe iov_iter_extract_pages() has done copying for it? > + * insufficient space in the buffer to transfer anything. > + */ > +ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter, > + ssize_t maxsize, gfp_t gfp) > +{ > + size_t frag_limit = READ_ONCE(sysctl_max_skb_frags); > + struct page *pages[8], **ppages = pages; > + ssize_t spliced = 0, ret = 0; > + unsigned int i; > + > + while (iter->count > 0) { > + ssize_t space, nr; > + size_t off, len; > + > + ret = -EMSGSIZE; > + space = frag_limit - skb_shinfo(skb)->nr_frags; > + if (space < 0) > + break; > + > + /* We might be able to coalesce without increasing nr_frags */ > + nr = clamp_t(size_t, space, 1, ARRAY_SIZE(pages)); > + > + len = iov_iter_extract_pages(iter, &ppages, maxsize, nr, 0, &off); > + if (len <= 0) { > + ret = len ?: -EIO; > + break; > + } > + > + i = 0; > + do { > + struct page *page = pages[i++]; > + size_t part = min_t(size_t, PAGE_SIZE - off, len); > + > + ret = -EIO; > + if (WARN_ON_ONCE(!sendpage_ok(page))) > + goto out; > + > + ret = skb_append_pagefrags(skb, page, off, part, > + frag_limit); > + if (ret < 0) { > + iov_iter_revert(iter, len); I am not sure I understand the error handling here, doesn't 'len' indicate the remaining size of the data to be appended to skb, maybe we should revert the size of data that is already appended to skb here? Does 'spliced' need to be adjusted accordingly? > + goto out; > + } > + > + if (skb->ip_summed == CHECKSUM_NONE) > + skb_splice_csum_page(skb, page, off, part); > + > + off = 0; > + spliced += part; > + maxsize -= part; > + len -= part; > + } while (len > 0); > + > + if (maxsize <= 0) > + break; > + } > + > +out: > + skb_len_add(skb, spliced); > + return spliced ?: ret; > +} > +EXPORT_SYMBOL(skb_splice_from_iter); > > > . >
Yunsheng Lin <linyunsheng@huawei.com> wrote: > > + * Returns the amount of data spliced/copied or -EMSGSIZE if there's > > I am not seeing any copying done directly in the skb_splice_from_iter(), > maybe iov_iter_extract_pages() has done copying for it? Ah, I took the code for that out and deferred it. The comment needs amending. > > + ret = skb_append_pagefrags(skb, page, off, part, > > + frag_limit); > > + if (ret < 0) { > > + iov_iter_revert(iter, len); > > I am not sure I understand the error handling here, doesn't 'len' > indicate the remaining size of the data to be appended to skb, Yes. > maybe we should revert the size of data that is already appended to skb > here? Does 'spliced' need to be adjusted accordingly? Neither. > I am not very familiar with the 'struct iov_iter' yet An iov_iter struct is a cursor over a buffer. It advances as we draw data or space from that buffer. Sometimes we overdraw and have to back up a bit - hence the revert function. It could possibly be renamed to something more appropriate as (if/when ITER_PIPE is removed) it doesn't actually change the buffer. So looking at skb_splice_from_iter(): iov_iter_extract_pages() is used to get a list of pages from the buffer that we think we're going to be able to handle. If the buffer is of type IOVEC or UBUF those pages would have pins inserted into them also; otherwise no pin or ref will be taken on them. MSG_SPLICE_PAGES should not be used with IOVEC or UBUF types for the moment as the network layer does not yet handle pins. iov_iter_extract_pages() will advance the iterator past the page fragments it has returned. If skb_append_pagefrags() indicates that it could not attach the page, this isn't necessarily fatal - it could return -EMSGSIZE to indicate there was no space, in which case we return to the caller to create a new skbuff. If a non-fatal error occurs, we may already have committed some parts of the buffer to the skbuff and rewinding into that part of the buffer would cause a repeat of the data which would be bad. What the iov_iter_revert() is doing is rewinding iterator back past the part of the extracted pages that we didn't get to use so that we will pick up where we left off next time we're called. It does *not* and must not revert the data we've already transferred. Arguably, I should revert when I return -EIO because sendpage_ok() returned false, but that's a fatal error. David
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 15011408c47c..1b2ebf6113e0 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -5097,5 +5097,8 @@ static inline void skb_mark_for_recycle(struct sk_buff *skb) #endif } +ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter, + ssize_t maxsize, gfp_t gfp); + #endif /* __KERNEL__ */ #endif /* _LINUX_SKBUFF_H */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7f53dcb26ad3..f4a5b51aed22 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -6892,3 +6892,91 @@ nodefer: __kfree_skb(skb); if (unlikely(kick) && !cmpxchg(&sd->defer_ipi_scheduled, 0, 1)) smp_call_function_single_async(cpu, &sd->defer_csd); } + +static void skb_splice_csum_page(struct sk_buff *skb, struct page *page, + size_t offset, size_t len) +{ + const char *kaddr; + __wsum csum; + + kaddr = kmap_local_page(page); + csum = csum_partial(kaddr + offset, len, 0); + kunmap_local(kaddr); + skb->csum = csum_block_add(skb->csum, csum, skb->len); +} + +/** + * skb_splice_from_iter - Splice (or copy) pages to skbuff + * @skb: The buffer to add pages to + * @iter: Iterator representing the pages to be added + * @maxsize: Maximum amount of pages to be added + * @gfp: Allocation flags + * + * This is a common helper function for supporting MSG_SPLICE_PAGES. It + * extracts pages from an iterator and adds them to the socket buffer if + * possible, copying them to fragments if not possible (such as if they're slab + * pages). + * + * Returns the amount of data spliced/copied or -EMSGSIZE if there's + * insufficient space in the buffer to transfer anything. + */ +ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter, + ssize_t maxsize, gfp_t gfp) +{ + size_t frag_limit = READ_ONCE(sysctl_max_skb_frags); + struct page *pages[8], **ppages = pages; + ssize_t spliced = 0, ret = 0; + unsigned int i; + + while (iter->count > 0) { + ssize_t space, nr; + size_t off, len; + + ret = -EMSGSIZE; + space = frag_limit - skb_shinfo(skb)->nr_frags; + if (space < 0) + break; + + /* We might be able to coalesce without increasing nr_frags */ + nr = clamp_t(size_t, space, 1, ARRAY_SIZE(pages)); + + len = iov_iter_extract_pages(iter, &ppages, maxsize, nr, 0, &off); + if (len <= 0) { + ret = len ?: -EIO; + break; + } + + i = 0; + do { + struct page *page = pages[i++]; + size_t part = min_t(size_t, PAGE_SIZE - off, len); + + ret = -EIO; + if (WARN_ON_ONCE(!sendpage_ok(page))) + goto out; + + ret = skb_append_pagefrags(skb, page, off, part, + frag_limit); + if (ret < 0) { + iov_iter_revert(iter, len); + goto out; + } + + if (skb->ip_summed == CHECKSUM_NONE) + skb_splice_csum_page(skb, page, off, part); + + off = 0; + spliced += part; + maxsize -= part; + len -= part; + } while (len > 0); + + if (maxsize <= 0) + break; + } + +out: + skb_len_add(skb, spliced); + return spliced ?: ret; +} +EXPORT_SYMBOL(skb_splice_from_iter);
Add a function to handle MSG_SPLICE_PAGES being passed internally to sendmsg(). Pages are spliced into the given socket buffer if possible and copied in if not (e.g. they're slab pages or have a zero refcount). Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Dumazet <edumazet@google.com> cc: "David S. Miller" <davem@davemloft.net> cc: David Ahern <dsahern@kernel.org> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: netdev@vger.kernel.org --- Notes: ver #8) - Order local variables in reverse xmas tree order. - Remove duplicate coalescence check. - Warn if sendpage_ok() fails. ver #7) - Export function. - Never copy data, return -EIO if sendpage_ok() returns false. include/linux/skbuff.h | 3 ++ net/core/skbuff.c | 88 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 91 insertions(+)