diff mbox series

[RFC,net-next,v3,05/29] net: bvec specific path in zerocopy_sg_from_iter

Message ID 5143111391e771dc97237e2a5e6a74223ef8f15f.1653992701.git.asml.silence@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series io_uring zerocopy send | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/apply fail Patch does not apply to net-next

Commit Message

Pavel Begunkov June 28, 2022, 6:56 p.m. UTC
Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
It'll be used later for {get,put}_page() optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/datagram.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

Comments

Al Viro June 28, 2022, 8:06 p.m. UTC | #1
On Tue, Jun 28, 2022 at 07:56:27PM +0100, Pavel Begunkov wrote:
> Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
> It'll be used later for {get,put}_page() optimisations.

If you need a variant that would not grab page references for ITER_BVEC
(and presumably other non-userland ones), the natural thing to do would
be to provide just such a primitive, wouldn't it?

The fun question here is by which paths ITER_BVEC can be passed to that
function and which all of them are currently guaranteed to hold the
underlying pages pinned...

And AFAICS you quietly assume that only ITER_BVEC ones will ever have that
"managed" flag of your set.  Or am I misreading the next patch in the
series?
Pavel Begunkov June 28, 2022, 9:33 p.m. UTC | #2
On 6/28/22 21:06, Al Viro wrote:
> On Tue, Jun 28, 2022 at 07:56:27PM +0100, Pavel Begunkov wrote:
>> Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
>> It'll be used later for {get,put}_page() optimisations.
> 
> If you need a variant that would not grab page references for ITER_BVEC
> (and presumably other non-userland ones), the natural thing to do would

I don't see other iter types interesting in this context

> be to provide just such a primitive, wouldn't it?

A helper returning a page array sounds like overshot and waste of cycles
considering that it copies one bvec into another, and especially since
iov_iter_get_pages() parses only the first struct bio_vec and so returns
only 1 page at a time.

I can actually use for_each_bvec(), but still leaves updating the iter
from bvec_iter.

> The fun question here is by which paths ITER_BVEC can be passed to that
> function and which all of them are currently guaranteed to hold the
> underlying pages pinned...

It's the other way around, not all ITER_BVEC are managed but all users
asking to use managed frags (i.e. io_uring) should keep pages pinned and
provide ITER_BVEC. It's opt-in, both for users and protocols.

--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -66,9 +66,16 @@ struct msghdr {
  	};
  	bool		msg_control_is_user : 1;
  	bool		msg_get_inq : 1;/* return INQ after receive */
+	/*
+	 * The data pages are pinned and won't be released before ->msg_ubuf
+	 * is released. ->msg_iter should point to a bvec and ->msg_ubuf has
+	 * to be non-NULL.
+	 */
+	bool		msg_managed_data : 1;
  	unsigned int	msg_flags;	/* flags on received message */
  	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
  	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
+	struct ubuf_info *msg_ubuf;
  };

The user sets ->msg_managed_data, then protocols find it and set
SKBFL_MANAGED_FRAG_REFS. If either of the steps didn't happen the
feature is not used.
The ->msg_managed_data part makes io_uring the only user, and io_uring
ensures pages are pinned.


> And AFAICS you quietly assume that only ITER_BVEC ones will ever have that
> "managed" flag of your set.  Or am I misreading the next patch in the
> series?

I hope a comment just above ->msg_managed_data should count as not quiet.
David Ahern June 28, 2022, 10:52 p.m. UTC | #3
On Tue, Jun 28, 2022 at 07:56:27PM +0100, Pavel Begunkov wrote:
> Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
> It'll be used later for {get,put}_page() optimisations.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  net/core/datagram.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
> 

Rather than propagating iter functions, I have been using the attached
patch for a few months now. It leverages your ubuf_info in msghdr to
allow in kernel users to pass in their own iter handler.
Pavel Begunkov July 4, 2022, 1:31 p.m. UTC | #4
On 6/28/22 23:52, David Ahern wrote:
> On Tue, Jun 28, 2022 at 07:56:27PM +0100, Pavel Begunkov wrote:
>> Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
>> It'll be used later for {get,put}_page() optimisations.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>   net/core/datagram.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 47 insertions(+)
>>
> 
> Rather than propagating iter functions, I have been using the attached
> patch for a few months now. It leverages your ubuf_info in msghdr to
> allow in kernel users to pass in their own iter handler.

If the series is going to be picked up for 5.20, how about we delay
this one for 5.21? I'll have time to think about it (maybe moving
the skb managed flag setup inside?), and will anyway need to send
some omitted patches then.

I was also entertaining the idea of having a smaller ubuf_info to
fit it into io_kiocb, which is tight on space.

struct ubuf_info {
	void *callback;
	refcount_t refcnt;
	u32 flags;
};

struct ubuf_info_msgzerocopy {
	struct ubuf_info ubuf;
	/* others fields */
};

48 bytes would be taking too much, but 16 looks nice. It might
make sense to move the callback into struct msghdr, I don't know.
David Ahern July 5, 2022, 2:28 a.m. UTC | #5
On 7/4/22 7:31 AM, Pavel Begunkov wrote:
> If the series is going to be picked up for 5.20, how about we delay
> this one for 5.21? I'll have time to think about it (maybe moving
> the skb managed flag setup inside?), and will anyway need to send
> some omitted patches then.
> 

I think it reads better for io_uring and future extensions for io_uring
to contain the optimized bvec iter handler and setting the managed flag.
Too many disjointed assumptions the way the code is now. By pulling that
into io_uring, core code does not make assumptions that "managed" means
bvec and no page references - rather that is embedded in the code that
cares.
Pavel Begunkov July 5, 2022, 2:03 p.m. UTC | #6
On 7/5/22 03:28, David Ahern wrote:
> On 7/4/22 7:31 AM, Pavel Begunkov wrote:
>> If the series is going to be picked up for 5.20, how about we delay
>> this one for 5.21? I'll have time to think about it (maybe moving
>> the skb managed flag setup inside?), and will anyway need to send
>> some omitted patches then.
>>
> 
> I think it reads better for io_uring and future extensions for io_uring
> to contain the optimized bvec iter handler and setting the managed flag.
> Too many disjointed assumptions the way the code is now. By pulling that
> into io_uring, core code does not make assumptions that "managed" means
> bvec and no page references - rather that is embedded in the code that
> cares.

Core code would still need to know when to remove the skb's managed
flag, e.g. in case of mixing. Can be worked out but with assumptions,
which doesn't look better that it currently is. I'll post a 5.20
rebased version and will iron it out on the way then.
Pavel Begunkov July 5, 2022, 10:09 p.m. UTC | #7
On 7/5/22 15:03, Pavel Begunkov wrote:
> On 7/5/22 03:28, David Ahern wrote:
>> On 7/4/22 7:31 AM, Pavel Begunkov wrote:
>>> If the series is going to be picked up for 5.20, how about we delay
>>> this one for 5.21? I'll have time to think about it (maybe moving
>>> the skb managed flag setup inside?), and will anyway need to send
>>> some omitted patches then.
>>>
>>
>> I think it reads better for io_uring and future extensions for io_uring
>> to contain the optimized bvec iter handler and setting the managed flag.
>> Too many disjointed assumptions the way the code is now. By pulling that
>> into io_uring, core code does not make assumptions that "managed" means
>> bvec and no page references - rather that is embedded in the code that
>> cares.
> 
> Core code would still need to know when to remove the skb's managed
> flag, e.g. in case of mixing. Can be worked out but with assumptions,
> which doesn't look better that it currently is. I'll post a 5.20
> rebased version and will iron it out on the way then.

Incremental looks like below. Probably looks better. What is slightly
dubious is that for zerocopy paths it leaves downgrading managed bit
to the callback unlike in most other places where it's done by core.
Also knowing upfront whether the user requests the feature or not
sounds less convoluted, but I guess it's not that important for now.

I can try to rebase and see how it goes


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2d5badd4b9ff..2cc5b8850cb4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1782,12 +1782,13 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
  			   bool success);
  
  int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
-			    struct iov_iter *from, size_t length);
+			    struct iov_iter *from, struct msghdr *msg,
+			    size_t length);
  
  static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
  					  struct msghdr *msg, int len)
  {
-	return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len);
+	return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, msg, len);
  }
  
  int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ba84ee614d5a..59b0f47c1f5a 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -14,6 +14,8 @@ struct file;
  struct pid;
  struct cred;
  struct socket;
+struct sock;
+struct sk_buff;
  
  #define __sockaddr_check_size(size)	\
  	BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage)))
@@ -66,16 +68,13 @@ struct msghdr {
  	};
  	bool		msg_control_is_user : 1;
  	bool		msg_get_inq : 1;/* return INQ after receive */
-	/*
-	 * The data pages are pinned and won't be released before ->msg_ubuf
-	 * is released. ->msg_iter should point to a bvec and ->msg_ubuf has
-	 * to be non-NULL.
-	 */
-	bool		msg_managed_data : 1;
  	unsigned int	msg_flags;	/* flags on received message */
  	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
  	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
  	struct ubuf_info *msg_ubuf;
+
+	int (*sg_from_iter)(struct sock *sk, struct sk_buff *skb,
+			    struct iov_iter *from, size_t length);
  };
  
  struct user_msghdr {
diff --git a/io_uring/net.c b/io_uring/net.c
index a142a609790d..b7643f267e20 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -269,7 +269,6 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
  	msg.msg_controllen = 0;
  	msg.msg_namelen = 0;
  	msg.msg_ubuf = NULL;
-	msg.msg_managed_data = false;
  
  	flags = sr->msg_flags;
  	if (issue_flags & IO_URING_F_NONBLOCK)
@@ -617,7 +616,6 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
  	msg.msg_controllen = 0;
  	msg.msg_iocb = NULL;
  	msg.msg_ubuf = NULL;
-	msg.msg_managed_data = false;
  
  	flags = sr->msg_flags;
  	if (force_nonblock)
@@ -706,6 +704,60 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  	return 0;
  }
  
+static int io_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+			   struct iov_iter *from, size_t length)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	int frag = shinfo->nr_frags;
+	int ret = 0;
+	struct bvec_iter bi;
+	ssize_t copied = 0;
+	unsigned long truesize = 0;
+
+	if (!shinfo->nr_frags)
+		shinfo->flags |= SKBFL_MANAGED_FRAG_REFS;
+
+	if (!skb_zcopy_managed(skb) || !iov_iter_is_bvec(from)) {
+		skb_zcopy_downgrade_managed(skb);
+		return __zerocopy_sg_from_iter(sk, skb, from, NULL, length);
+	}
+
+	bi.bi_size = min(from->count, length);
+	bi.bi_bvec_done = from->iov_offset;
+	bi.bi_idx = 0;
+
+	while (bi.bi_size && frag < MAX_SKB_FRAGS) {
+		struct bio_vec v = mp_bvec_iter_bvec(from->bvec, bi);
+
+		copied += v.bv_len;
+		truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
+		__skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
+					   v.bv_offset, v.bv_len);
+		bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
+	}
+	if (bi.bi_size)
+		ret = -EMSGSIZE;
+
+	shinfo->nr_frags = frag;
+	from->bvec += bi.bi_idx;
+	from->nr_segs -= bi.bi_idx;
+	from->count = bi.bi_size;
+	from->iov_offset = bi.bi_bvec_done;
+
+	skb->data_len += copied;
+	skb->len += copied;
+	skb->truesize += truesize;
+
+	if (sk && sk->sk_type == SOCK_STREAM) {
+		sk_wmem_queued_add(sk, truesize);
+		if (!skb_zcopy_pure(skb))
+			sk_mem_charge(sk, truesize);
+	} else {
+		refcount_add(truesize, &skb->sk->sk_wmem_alloc);
+	}
+	return ret;
+}
+
  int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
  {
  	struct sockaddr_storage address;
@@ -740,7 +792,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
  	msg.msg_control = NULL;
  	msg.msg_controllen = 0;
  	msg.msg_namelen = 0;
-	msg.msg_managed_data = 1;
+	msg.sg_from_iter = io_sg_from_iter;
  
  	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
  		ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu,
@@ -748,7 +800,6 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
  		if (unlikely(ret))
  				return ret;
  	} else {
-		msg.msg_managed_data = 0;
  		ret = import_single_range(WRITE, zc->buf, zc->len, &iov,
  					  &msg.msg_iter);
  		if (unlikely(ret))
diff --git a/net/compat.c b/net/compat.c
index 435846fa85e0..6cd2e7683dd0 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -81,7 +81,6 @@ int __get_compat_msghdr(struct msghdr *kmsg,
  
  	kmsg->msg_iocb = NULL;
  	kmsg->msg_ubuf = NULL;
-	kmsg->msg_managed_data = false;
  	*ptr = msg.msg_iov;
  	*len = msg.msg_iovlen;
  	return 0;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 3c913a6342ad..6901dcb44d72 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -613,59 +613,14 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
  }
  EXPORT_SYMBOL(skb_copy_datagram_from_iter);
  
-static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
-				   struct iov_iter *from, size_t length)
-{
-	struct skb_shared_info *shinfo = skb_shinfo(skb);
-	int frag = shinfo->nr_frags;
-	int ret = 0;
-	struct bvec_iter bi;
-	ssize_t copied = 0;
-	unsigned long truesize = 0;
-
-	bi.bi_size = min(from->count, length);
-	bi.bi_bvec_done = from->iov_offset;
-	bi.bi_idx = 0;
-
-	while (bi.bi_size && frag < MAX_SKB_FRAGS) {
-		struct bio_vec v = mp_bvec_iter_bvec(from->bvec, bi);
-
-		copied += v.bv_len;
-		truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
-		__skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
-					   v.bv_offset, v.bv_len);
-		bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
-	}
-	if (bi.bi_size)
-		ret = -EMSGSIZE;
-
-	shinfo->nr_frags = frag;
-	from->bvec += bi.bi_idx;
-	from->nr_segs -= bi.bi_idx;
-	from->count = bi.bi_size;
-	from->iov_offset = bi.bi_bvec_done;
-
-	skb->data_len += copied;
-	skb->len += copied;
-	skb->truesize += truesize;
-
-	if (sk && sk->sk_type == SOCK_STREAM) {
-		sk_wmem_queued_add(sk, truesize);
-		if (!skb_zcopy_pure(skb))
-			sk_mem_charge(sk, truesize);
-	} else {
-		refcount_add(truesize, &skb->sk->sk_wmem_alloc);
-	}
-	return ret;
-}
-
  int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
-			    struct iov_iter *from, size_t length)
+			    struct iov_iter *from, struct msghdr *msg,
+			    size_t length)
  {
  	int frag;
  
-	if (skb_zcopy_managed(skb))
-		return __zerocopy_sg_from_bvec(sk, skb, from, length);
+	if (unlikely(msg && msg->msg_ubuf && msg->sg_from_iter))
+		return msg->sg_from_iter(sk, skb, from, length);
  
  	frag = skb_shinfo(skb)->nr_frags;
  
@@ -753,7 +708,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
  	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
  		return -EFAULT;
  
-	return __zerocopy_sg_from_iter(NULL, skb, from, ~0U);
+	return __zerocopy_sg_from_iter(NULL, skb, from, NULL, ~0U);
  }
  EXPORT_SYMBOL(zerocopy_sg_from_iter);
  
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7e6fcb3cd817..046ec3124835 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1368,7 +1368,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
  	if (orig_uarg && uarg != orig_uarg)
  		return -EEXIST;
  
-	err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len);
+	err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, msg, len);
  	if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) {
  		struct sock *save_sk = skb->sk;
  
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3fd1bf675598..df7f9dfbe8be 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1241,18 +1241,7 @@ static int __ip_append_data(struct sock *sk,
  			skb->truesize += copy;
  			wmem_alloc_delta += copy;
  		} else {
-			struct msghdr *msg = from;
-
-			if (!skb_shinfo(skb)->nr_frags) {
-				if (msg->msg_managed_data)
-					skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
-			} else {
-				/* appending, don't mix managed and unmanaged */
-				if (!msg->msg_managed_data)
-					skb_zcopy_downgrade_managed(skb);
-			}
-
-			err = skb_zerocopy_iter_dgram(skb, msg, copy);
+			err = skb_zerocopy_iter_dgram(skb, from, copy);
  			if (err < 0)
  				goto error;
  		}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 05e2f6271f65..634c16fe8dcd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1392,18 +1392,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
  			 * zerocopy skb
  			 */
  			if (!skb->len) {
-				if (msg->msg_managed_data)
-					skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
  				skb_shinfo(skb)->flags |= SKBFL_PURE_ZEROCOPY;
-			} else {
-				/* appending, don't mix managed and unmanaged */
-				if (!msg->msg_managed_data)
-					skb_zcopy_downgrade_managed(skb);
-				if (!skb_zcopy_pure(skb)) {
-					copy = tcp_wmem_schedule(sk, copy);
-					if (!copy)
-						goto wait_for_space;
-				}
+			} else if (!skb_zcopy_pure(skb)) {
+				copy = tcp_wmem_schedule(sk, copy);
+				if (!copy)
+					goto wait_for_space;
  			}
  
  			err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 34eb3b5da5e2..897ca4f9b791 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1796,18 +1796,7 @@ static int __ip6_append_data(struct sock *sk,
  			skb->truesize += copy;
  			wmem_alloc_delta += copy;
  		} else {
-			struct msghdr *msg = from;
-
-			if (!skb_shinfo(skb)->nr_frags) {
-				if (msg->msg_managed_data)
-					skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
-			} else {
-				/* appending, don't mix managed and unmanaged */
-				if (!msg->msg_managed_data)
-					skb_zcopy_downgrade_managed(skb);
-			}
-
-			err = skb_zerocopy_iter_dgram(skb, msg, copy);
+			err = skb_zerocopy_iter_dgram(skb, from, copy);
  			if (err < 0)
  				goto error;
  		}
diff --git a/net/socket.c b/net/socket.c
index 0963a02b1472..ed061609265e 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2107,7 +2107,6 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
  	msg.msg_controllen = 0;
  	msg.msg_namelen = 0;
  	msg.msg_ubuf = NULL;
-	msg.msg_managed_data = false;
  	if (addr) {
  		err = move_addr_to_kernel(addr, addr_len, &address);
  		if (err < 0)
@@ -2174,7 +2173,6 @@ int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
  	msg.msg_iocb = NULL;
  	msg.msg_flags = 0;
  	msg.msg_ubuf = NULL;
-	msg.msg_managed_data = false;
  	if (sock->file->f_flags & O_NONBLOCK)
  		flags |= MSG_DONTWAIT;
  	err = sock_recvmsg(sock, &msg, flags);
@@ -2414,7 +2412,6 @@ int __copy_msghdr_from_user(struct msghdr *kmsg,
  
  	kmsg->msg_iocb = NULL;
  	kmsg->msg_ubuf = NULL;
-	kmsg->msg_managed_data = false;
  	*uiov = msg.msg_iov;
  	*nsegs = msg.msg_iovlen;
  	return 0;
David Ahern July 6, 2022, 3:11 p.m. UTC | #8
On 7/5/22 4:09 PM, Pavel Begunkov wrote:
> On 7/5/22 15:03, Pavel Begunkov wrote:
>> On 7/5/22 03:28, David Ahern wrote:
>>> On 7/4/22 7:31 AM, Pavel Begunkov wrote:
>>>> If the series is going to be picked up for 5.20, how about we delay
>>>> this one for 5.21? I'll have time to think about it (maybe moving
>>>> the skb managed flag setup inside?), and will anyway need to send
>>>> some omitted patches then.
>>>>
>>>
>>> I think it reads better for io_uring and future extensions for io_uring
>>> to contain the optimized bvec iter handler and setting the managed flag.
>>> Too many disjointed assumptions the way the code is now. By pulling that
>>> into io_uring, core code does not make assumptions that "managed" means
>>> bvec and no page references - rather that is embedded in the code that
>>> cares.
>>
>> Core code would still need to know when to remove the skb's managed
>> flag, e.g. in case of mixing. Can be worked out but with assumptions,
>> which doesn't look better that it currently is. I'll post a 5.20
>> rebased version and will iron it out on the way then.

Sure. My comment was that MANAGED means something else (not core code)
manages the page references on the skb frags. That flag does not need to
be linked to a customized bvec.

> @@ -66,16 +68,13 @@ struct msghdr {
>      };
>      bool        msg_control_is_user : 1;
>      bool        msg_get_inq : 1;/* return INQ after receive */
> -    /*
> -     * The data pages are pinned and won't be released before ->msg_ubuf
> -     * is released. ->msg_iter should point to a bvec and ->msg_ubuf has
> -     * to be non-NULL.
> -     */
> -    bool        msg_managed_data : 1;
>      unsigned int    msg_flags;    /* flags on received message */
>      __kernel_size_t    msg_controllen;    /* ancillary data buffer
> length */
>      struct kiocb    *msg_iocb;    /* ptr to iocb for async requests */
>      struct ubuf_info *msg_ubuf;
> +
> +    int (*sg_from_iter)(struct sock *sk, struct sk_buff *skb,
> +                struct iov_iter *from, size_t length);
>  };
>  

Putting in msghdr works too. I chose ubuf_info because it is directly
related to the ZC path, but that struct is getting tight on space.
diff mbox series

Patch

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 50f4faeea76c..5237cb533bb4 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -613,11 +613,58 @@  int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 }
 EXPORT_SYMBOL(skb_copy_datagram_from_iter);
 
+static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
+				   struct iov_iter *from, size_t length)
+{
+	int frag = skb_shinfo(skb)->nr_frags;
+	int ret = 0;
+	struct bvec_iter bi;
+	ssize_t copied = 0;
+	unsigned long truesize = 0;
+
+	bi.bi_size = min(from->count, length);
+	bi.bi_bvec_done = from->iov_offset;
+	bi.bi_idx = 0;
+
+	while (bi.bi_size && frag < MAX_SKB_FRAGS) {
+		struct bio_vec v = mp_bvec_iter_bvec(from->bvec, bi);
+
+		copied += v.bv_len;
+		truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
+		get_page(v.bv_page);
+		skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
+		bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
+	}
+	if (bi.bi_size)
+		ret = -EMSGSIZE;
+
+	from->bvec += bi.bi_idx;
+	from->nr_segs -= bi.bi_idx;
+	from->count = bi.bi_size;
+	from->iov_offset = bi.bi_bvec_done;
+
+	skb->data_len += copied;
+	skb->len += copied;
+	skb->truesize += truesize;
+
+	if (sk && sk->sk_type == SOCK_STREAM) {
+		sk_wmem_queued_add(sk, truesize);
+		if (!skb_zcopy_pure(skb))
+			sk_mem_charge(sk, truesize);
+	} else {
+		refcount_add(truesize, &skb->sk->sk_wmem_alloc);
+	}
+	return ret;
+}
+
 int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 			    struct iov_iter *from, size_t length)
 {
 	int frag = skb_shinfo(skb)->nr_frags;
 
+	if (iov_iter_is_bvec(from))
+		return __zerocopy_sg_from_bvec(sk, skb, from, length);
+
 	while (length && iov_iter_count(from)) {
 		struct page *pages[MAX_SKB_FRAGS];
 		struct page *last_head = NULL;