diff mbox series

[resend,mm,net-next,3/3] net-zerocopy: Use vm_insert_pages() for tcp rcv zerocopy.

Message ID 20200128025958.43490-3-arjunroy.kdev@gmail.com (mailing list archive)
State New, archived
Headers show
Series [resend,mm,net-next,1/3] mm: Refactor insert_page to prepare for batched-lock insert. | expand

Commit Message

Arjun Roy Jan. 28, 2020, 2:59 a.m. UTC
From: Arjun Roy <arjunroy@google.com>

Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles
(as reported by perf) drop from a couple of percentage points
to a fraction of a percent. This results in a roughly 6% increase in
efficiency, measured roughly as zerocopy receive count divided by CPU
utilization.

The intention of this patch-set is to reduce atomic ops for
tcp zerocopy receives, which normally hits the same spinlock multiple
times consecutively.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

---
 net/ipv4/tcp.c | 67 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 61 insertions(+), 6 deletions(-)

Comments

Andrew Morton Feb. 13, 2020, 2:56 a.m. UTC | #1
On Mon, 27 Jan 2020 18:59:58 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote:

> Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles
> (as reported by perf) drop from a couple of percentage points
> to a fraction of a percent. This results in a roughly 6% increase in
> efficiency, measured roughly as zerocopy receive count divided by CPU
> utilization.
> 
> The intention of this patch-set is to reduce atomic ops for
> tcp zerocopy receives, which normally hits the same spinlock multiple
> times consecutively.

For some reason the patch causes this:

In file included from ./arch/x86/include/asm/atomic.h:5:0,
                 from ./include/linux/atomic.h:7,
                 from ./include/linux/crypto.h:15,
                 from ./include/crypto/hash.h:11,
                 from net/ipv4/tcp.c:246:
net/ipv4/tcp.c: In function ‘do_tcp_getsockopt.isra.29’:
./include/linux/compiler.h:225:31: warning: ‘tp’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  case 4: *(volatile __u32 *)p = *(__u32 *)res; break;
          ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
net/ipv4/tcp.c:1779:19: note: ‘tp’ was declared here
  struct tcp_sock *tp;
                   ^~

It's a false positive.  gcc-7.2.0

: out:
:        up_read(&current->mm->mmap_sem);
:        if (length) {
:                WRITE_ONCE(tp->copied_seq, seq);

but `length' is zero here.  

This suppresses it:

--- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix
+++ a/net/ipv4/tcp.c
@@ -1788,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s
 
 	sock_rps_record_flow(sk);
 
+	tp = tcp_sk(sk);
+
 	down_read(&current->mm->mmap_sem);
 
 	ret = -EINVAL;
@@ -1796,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s
 		goto out;
 	zc->length = min_t(unsigned long, zc->length, vma->vm_end - address);
 
-	tp = tcp_sk(sk);
 	seq = tp->copied_seq;
 	inq = tcp_inq(sk);
 	zc->length = min_t(u32, zc->length, inq);

and I guess it's zero-cost.


Anyway, I'll sit on this lot for a while, hoping for a davem ack?
Arjun Roy Feb. 17, 2020, 2:49 a.m. UTC | #2
On Wed, Feb 12, 2020 at 6:56 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 27 Jan 2020 18:59:58 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote:
>
> > Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles
> > (as reported by perf) drop from a couple of percentage points
> > to a fraction of a percent. This results in a roughly 6% increase in
> > efficiency, measured roughly as zerocopy receive count divided by CPU
> > utilization.
> >
> > The intention of this patch-set is to reduce atomic ops for
> > tcp zerocopy receives, which normally hits the same spinlock multiple
> > times consecutively.
>
> For some reason the patch causes this:
>
> In file included from ./arch/x86/include/asm/atomic.h:5:0,
>                  from ./include/linux/atomic.h:7,
>                  from ./include/linux/crypto.h:15,
>                  from ./include/crypto/hash.h:11,
>                  from net/ipv4/tcp.c:246:
> net/ipv4/tcp.c: In function ‘do_tcp_getsockopt.isra.29’:
> ./include/linux/compiler.h:225:31: warning: ‘tp’ may be used uninitialized in this function [-Wmaybe-uninitialized]
>   case 4: *(volatile __u32 *)p = *(__u32 *)res; break;
>           ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
> net/ipv4/tcp.c:1779:19: note: ‘tp’ was declared here
>   struct tcp_sock *tp;
>                    ^~
>
> It's a false positive.  gcc-7.2.0
>
> : out:
> :        up_read(&current->mm->mmap_sem);
> :        if (length) {
> :                WRITE_ONCE(tp->copied_seq, seq);
>
> but `length' is zero here.
>
> This suppresses it:
>
> --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix
> +++ a/net/ipv4/tcp.c
> @@ -1788,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s
>
>         sock_rps_record_flow(sk);
>
> +       tp = tcp_sk(sk);
> +
>         down_read(&current->mm->mmap_sem);
>
>         ret = -EINVAL;
> @@ -1796,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s
>                 goto out;
>         zc->length = min_t(unsigned long, zc->length, vma->vm_end - address);
>
> -       tp = tcp_sk(sk);
>         seq = tp->copied_seq;
>         inq = tcp_inq(sk);
>         zc->length = min_t(u32, zc->length, inq);
>
> and I guess it's zero-cost.
>
>
> Anyway, I'll sit on this lot for a while, hoping for a davem ack?

Actually, speaking of the ack on the networking side:

I guess this patch set is a bit weird since it requires some
non-trivial coordination between mm and net-next? Not sure what the
normal approach is in this case.

-Arjun
Arjun Roy Feb. 21, 2020, 9:21 p.m. UTC | #3
Andrew, David -

I remain a bit concerned regarding the merge process for this specific
patch (0003, the net/ipv4/tcp.c change) since I have other in-flight
changes for TCP receive zerocopy that I'd like to upstream for
net-next - and would like to avoid weird merge issues.

So perhaps the following could work:

1. Andrew, perhaps we could remove this particular patch (0003, the
net/ipv4/tcp.c change) from mm-next; that way we merge
vm_insert_pages() but not the call-site within TCP, for now.
2. net-next will eventually pick vm_insert_pages() up.
3. I can modify the zerocopy code to use it at that point?

Else I'm concerned a complicated merge situation may result.

What do you all think?

Thanks,
-Arjun

On Sun, Feb 16, 2020 at 6:49 PM Arjun Roy <arjunroy@google.com> wrote:
>
> On Wed, Feb 12, 2020 at 6:56 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Mon, 27 Jan 2020 18:59:58 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote:
> >
> > > Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles
> > > (as reported by perf) drop from a couple of percentage points
> > > to a fraction of a percent. This results in a roughly 6% increase in
> > > efficiency, measured roughly as zerocopy receive count divided by CPU
> > > utilization.
> > >
> > > The intention of this patch-set is to reduce atomic ops for
> > > tcp zerocopy receives, which normally hits the same spinlock multiple
> > > times consecutively.
> >
> > For some reason the patch causes this:
> >
> > In file included from ./arch/x86/include/asm/atomic.h:5:0,
> >                  from ./include/linux/atomic.h:7,
> >                  from ./include/linux/crypto.h:15,
> >                  from ./include/crypto/hash.h:11,
> >                  from net/ipv4/tcp.c:246:
> > net/ipv4/tcp.c: In function ‘do_tcp_getsockopt.isra.29’:
> > ./include/linux/compiler.h:225:31: warning: ‘tp’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> >   case 4: *(volatile __u32 *)p = *(__u32 *)res; break;
> >           ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
> > net/ipv4/tcp.c:1779:19: note: ‘tp’ was declared here
> >   struct tcp_sock *tp;
> >                    ^~
> >
> > It's a false positive.  gcc-7.2.0
> >
> > : out:
> > :        up_read(&current->mm->mmap_sem);
> > :        if (length) {
> > :                WRITE_ONCE(tp->copied_seq, seq);
> >
> > but `length' is zero here.
> >
> > This suppresses it:
> >
> > --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix
> > +++ a/net/ipv4/tcp.c
> > @@ -1788,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s
> >
> >         sock_rps_record_flow(sk);
> >
> > +       tp = tcp_sk(sk);
> > +
> >         down_read(&current->mm->mmap_sem);
> >
> >         ret = -EINVAL;
> > @@ -1796,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s
> >                 goto out;
> >         zc->length = min_t(unsigned long, zc->length, vma->vm_end - address);
> >
> > -       tp = tcp_sk(sk);
> >         seq = tp->copied_seq;
> >         inq = tcp_inq(sk);
> >         zc->length = min_t(u32, zc->length, inq);
> >
> > and I guess it's zero-cost.
> >
> >
> > Anyway, I'll sit on this lot for a while, hoping for a davem ack?
>
> Actually, speaking of the ack on the networking side:
>
> I guess this patch set is a bit weird since it requires some
> non-trivial coordination between mm and net-next? Not sure what the
> normal approach is in this case.
>
> -Arjun
Andrew Morton Feb. 24, 2020, 3:37 a.m. UTC | #4
On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote:

> I remain a bit concerned regarding the merge process for this specific
> patch (0003, the net/ipv4/tcp.c change) since I have other in-flight
> changes for TCP receive zerocopy that I'd like to upstream for
> net-next - and would like to avoid weird merge issues.
> 
> So perhaps the following could work:
> 
> 1. Andrew, perhaps we could remove this particular patch (0003, the
> net/ipv4/tcp.c change) from mm-next; that way we merge
> vm_insert_pages() but not the call-site within TCP, for now.
> 2. net-next will eventually pick vm_insert_pages() up.
> 3. I can modify the zerocopy code to use it at that point?
> 
> Else I'm concerned a complicated merge situation may result.
> 
> What do you all think?

We could do that.

For now, I'll stage the entire patch series after linux-next and shall
wait and see whether things which appear in linux-next cause serious
merge issues to occur.  Sound OK?
Arjun Roy Feb. 24, 2020, 4:19 p.m. UTC | #5
On Sun, Feb 23, 2020 at 7:37 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote:
>
> > I remain a bit concerned regarding the merge process for this specific
> > patch (0003, the net/ipv4/tcp.c change) since I have other in-flight
> > changes for TCP receive zerocopy that I'd like to upstream for
> > net-next - and would like to avoid weird merge issues.
> >
> > So perhaps the following could work:
> >
> > 1. Andrew, perhaps we could remove this particular patch (0003, the
> > net/ipv4/tcp.c change) from mm-next; that way we merge
> > vm_insert_pages() but not the call-site within TCP, for now.
> > 2. net-next will eventually pick vm_insert_pages() up.
> > 3. I can modify the zerocopy code to use it at that point?
> >
> > Else I'm concerned a complicated merge situation may result.
> >
> > What do you all think?
>
> We could do that.
>
> For now, I'll stage the entire patch series after linux-next and shall
> wait and see whether things which appear in linux-next cause serious
> merge issues to occur.  Sound OK?

Sounds good for now; the conflict itself would be easy enough to fix
when it does crop up.

Thanks,
-Arjun
Andrew Morton April 10, 2020, 7:04 p.m. UTC | #6
On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote:

> I remain a bit concerned regarding the merge process for this specific
> patch (0003, the net/ipv4/tcp.c change) since I have other in-flight
> changes for TCP receive zerocopy that I'd like to upstream for
> net-next - and would like to avoid weird merge issues.
> 
> So perhaps the following could work:
> 
> 1. Andrew, perhaps we could remove this particular patch (0003, the
> net/ipv4/tcp.c change) from mm-next; that way we merge
> vm_insert_pages() but not the call-site within TCP, for now.
> 2. net-next will eventually pick vm_insert_pages() up.
> 3. I can modify the zerocopy code to use it at that point?
> 
> Else I'm concerned a complicated merge situation may result.

The merge situation is quite clean.

I guess I'll hold off on
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch (below) and
shall send it to davem after Linus has merged the prerequisites.


From: Arjun Roy <arjunroy@google.com>
Subject: net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy

Use vm_insert_pages() for tcp receive zerocopy.  Spin lock cycles (as
reported by perf) drop from a couple of percentage points to a fraction of
a percent.  This results in a roughly 6% increase in efficiency, measured
roughly as zerocopy receive count divided by CPU utilization.

The intention of this patchset is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.

[akpm@linux-foundation.org: suppress gcc-7.2.0 warning]
Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 net/ipv4/tcp.c |   70 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 63 insertions(+), 7 deletions(-)

--- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy
+++ a/net/ipv4/tcp.c
@@ -1734,14 +1734,48 @@ int tcp_mmap(struct file *file, struct s
 }
 EXPORT_SYMBOL(tcp_mmap);
 
+static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma,
+					struct page **pages,
+					unsigned long pages_to_map,
+					unsigned long *insert_addr,
+					u32 *length_with_pending,
+					u32 *seq,
+					struct tcp_zerocopy_receive *zc)
+{
+	unsigned long pages_remaining = pages_to_map;
+	int bytes_mapped;
+	int ret;
+
+	ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining);
+	bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining);
+	/* Even if vm_insert_pages fails, it may have partially succeeded in
+	 * mapping (some but not all of the pages).
+	 */
+	*seq += bytes_mapped;
+	*insert_addr += bytes_mapped;
+	if (ret) {
+		/* But if vm_insert_pages did fail, we have to unroll some state
+		 * we speculatively touched before.
+		 */
+		const int bytes_not_mapped = PAGE_SIZE * pages_remaining;
+		*length_with_pending -= bytes_not_mapped;
+		zc->recv_skip_hint += bytes_not_mapped;
+	}
+	return ret;
+}
+
 static int tcp_zerocopy_receive(struct sock *sk,
 				struct tcp_zerocopy_receive *zc)
 {
 	unsigned long address = (unsigned long)zc->address;
 	u32 length = 0, seq, offset, zap_len;
+	#define PAGE_BATCH_SIZE 8
+	struct page *pages[PAGE_BATCH_SIZE];
 	const skb_frag_t *frags = NULL;
 	struct vm_area_struct *vma;
 	struct sk_buff *skb = NULL;
+	unsigned long pg_idx = 0;
+	unsigned long curr_addr;
 	struct tcp_sock *tp;
 	int inq;
 	int ret;
@@ -1754,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s
 
 	sock_rps_record_flow(sk);
 
+	tp = tcp_sk(sk);
+
 	down_read(&current->mm->mmap_sem);
 
 	ret = -EINVAL;
@@ -1762,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s
 		goto out;
 	zc->length = min_t(unsigned long, zc->length, vma->vm_end - address);
 
-	tp = tcp_sk(sk);
 	seq = tp->copied_seq;
 	inq = tcp_inq(sk);
 	zc->length = min_t(u32, zc->length, inq);
@@ -1774,8 +1809,20 @@ static int tcp_zerocopy_receive(struct s
 		zc->recv_skip_hint = zc->length;
 	}
 	ret = 0;
+	curr_addr = address;
 	while (length + PAGE_SIZE <= zc->length) {
 		if (zc->recv_skip_hint < PAGE_SIZE) {
+			/* If we're here, finish the current batch. */
+			if (pg_idx) {
+				ret = tcp_zerocopy_vm_insert_batch(vma, pages,
+								   pg_idx,
+								   &curr_addr,
+								   &length,
+								   &seq, zc);
+				if (ret)
+					goto out;
+				pg_idx = 0;
+			}
 			if (skb) {
 				if (zc->recv_skip_hint > 0)
 					break;
@@ -1784,7 +1831,6 @@ static int tcp_zerocopy_receive(struct s
 			} else {
 				skb = tcp_recv_skb(sk, seq, &offset);
 			}
-
 			zc->recv_skip_hint = skb->len - offset;
 			offset -= skb_headlen(skb);
 			if ((int)offset < 0 || skb_has_frag_list(skb))
@@ -1808,14 +1854,24 @@ static int tcp_zerocopy_receive(struct s
 			zc->recv_skip_hint -= remaining;
 			break;
 		}
-		ret = vm_insert_page(vma, address + length,
-				     skb_frag_page(frags));
-		if (ret)
-			break;
+		pages[pg_idx] = skb_frag_page(frags);
+		pg_idx++;
 		length += PAGE_SIZE;
-		seq += PAGE_SIZE;
 		zc->recv_skip_hint -= PAGE_SIZE;
 		frags++;
+		if (pg_idx == PAGE_BATCH_SIZE) {
+			ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
+							   &curr_addr, &length,
+							   &seq, zc);
+			if (ret)
+				goto out;
+			pg_idx = 0;
+		}
+	}
+	if (pg_idx) {
+		ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
+						   &curr_addr, &length, &seq,
+						   zc);
 	}
 out:
 	up_read(&current->mm->mmap_sem);
Arjun Roy April 10, 2020, 7:13 p.m. UTC | #7
On Fri, Apr 10, 2020 at 12:04 PM Andrew Morton <akpm@linux-foundation.org>
wrote:

> On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote:
>
> > I remain a bit concerned regarding the merge process for this specific
> > patch (0003, the net/ipv4/tcp.c change) since I have other in-flight
> > changes for TCP receive zerocopy that I'd like to upstream for
> > net-next - and would like to avoid weird merge issues.
> >
> > So perhaps the following could work:
> >
> > 1. Andrew, perhaps we could remove this particular patch (0003, the
> > net/ipv4/tcp.c change) from mm-next; that way we merge
> > vm_insert_pages() but not the call-site within TCP, for now.
> > 2. net-next will eventually pick vm_insert_pages() up.
> > 3. I can modify the zerocopy code to use it at that point?
> >
> > Else I'm concerned a complicated merge situation may result.
>
> The merge situation is quite clean.
>
> I guess I'll hold off on
> net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch (below) and
> shall send it to davem after Linus has merged the prerequisites.
>
>
>
Acknowledged, thank you!

-Arjun



> From: Arjun Roy <arjunroy@google.com>
> Subject: net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy
>
> Use vm_insert_pages() for tcp receive zerocopy.  Spin lock cycles (as
> reported by perf) drop from a couple of percentage points to a fraction of
> a percent.  This results in a roughly 6% increase in efficiency, measured
> roughly as zerocopy receive count divided by CPU utilization.
>
> The intention of this patchset is to reduce atomic ops for tcp zerocopy
> receives, which normally hits the same spinlock multiple times
> consecutively.
>
> [akpm@linux-foundation.org: suppress gcc-7.2.0 warning]
> Link:
> http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com
> Signed-off-by: Arjun Roy <arjunroy@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: David Miller <davem@davemloft.net>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  net/ipv4/tcp.c |   70 ++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 63 insertions(+), 7 deletions(-)
>
> --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy
> +++ a/net/ipv4/tcp.c
> @@ -1734,14 +1734,48 @@ int tcp_mmap(struct file *file, struct s
>  }
>  EXPORT_SYMBOL(tcp_mmap);
>
> +static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma,
> +                                       struct page **pages,
> +                                       unsigned long pages_to_map,
> +                                       unsigned long *insert_addr,
> +                                       u32 *length_with_pending,
> +                                       u32 *seq,
> +                                       struct tcp_zerocopy_receive *zc)
> +{
> +       unsigned long pages_remaining = pages_to_map;
> +       int bytes_mapped;
> +       int ret;
> +
> +       ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining);
> +       bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining);
> +       /* Even if vm_insert_pages fails, it may have partially succeeded
> in
> +        * mapping (some but not all of the pages).
> +        */
> +       *seq += bytes_mapped;
> +       *insert_addr += bytes_mapped;
> +       if (ret) {
> +               /* But if vm_insert_pages did fail, we have to unroll some
> state
> +                * we speculatively touched before.
> +                */
> +               const int bytes_not_mapped = PAGE_SIZE * pages_remaining;
> +               *length_with_pending -= bytes_not_mapped;
> +               zc->recv_skip_hint += bytes_not_mapped;
> +       }
> +       return ret;
> +}
> +
>  static int tcp_zerocopy_receive(struct sock *sk,
>                                 struct tcp_zerocopy_receive *zc)
>  {
>         unsigned long address = (unsigned long)zc->address;
>         u32 length = 0, seq, offset, zap_len;
> +       #define PAGE_BATCH_SIZE 8
> +       struct page *pages[PAGE_BATCH_SIZE];
>         const skb_frag_t *frags = NULL;
>         struct vm_area_struct *vma;
>         struct sk_buff *skb = NULL;
> +       unsigned long pg_idx = 0;
> +       unsigned long curr_addr;
>         struct tcp_sock *tp;
>         int inq;
>         int ret;
> @@ -1754,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s
>
>         sock_rps_record_flow(sk);
>
> +       tp = tcp_sk(sk);
> +
>         down_read(&current->mm->mmap_sem);
>
>         ret = -EINVAL;
> @@ -1762,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s
>                 goto out;
>         zc->length = min_t(unsigned long, zc->length, vma->vm_end -
> address);
>
> -       tp = tcp_sk(sk);
>         seq = tp->copied_seq;
>         inq = tcp_inq(sk);
>         zc->length = min_t(u32, zc->length, inq);
> @@ -1774,8 +1809,20 @@ static int tcp_zerocopy_receive(struct s
>                 zc->recv_skip_hint = zc->length;
>         }
>         ret = 0;
> +       curr_addr = address;
>         while (length + PAGE_SIZE <= zc->length) {
>                 if (zc->recv_skip_hint < PAGE_SIZE) {
> +                       /* If we're here, finish the current batch. */
> +                       if (pg_idx) {
> +                               ret = tcp_zerocopy_vm_insert_batch(vma,
> pages,
> +                                                                  pg_idx,
> +
> &curr_addr,
> +                                                                  &length,
> +                                                                  &seq,
> zc);
> +                               if (ret)
> +                                       goto out;
> +                               pg_idx = 0;
> +                       }
>                         if (skb) {
>                                 if (zc->recv_skip_hint > 0)
>                                         break;
> @@ -1784,7 +1831,6 @@ static int tcp_zerocopy_receive(struct s
>                         } else {
>                                 skb = tcp_recv_skb(sk, seq, &offset);
>                         }
> -
>                         zc->recv_skip_hint = skb->len - offset;
>                         offset -= skb_headlen(skb);
>                         if ((int)offset < 0 || skb_has_frag_list(skb))
> @@ -1808,14 +1854,24 @@ static int tcp_zerocopy_receive(struct s
>                         zc->recv_skip_hint -= remaining;
>                         break;
>                 }
> -               ret = vm_insert_page(vma, address + length,
> -                                    skb_frag_page(frags));
> -               if (ret)
> -                       break;
> +               pages[pg_idx] = skb_frag_page(frags);
> +               pg_idx++;
>                 length += PAGE_SIZE;
> -               seq += PAGE_SIZE;
>                 zc->recv_skip_hint -= PAGE_SIZE;
>                 frags++;
> +               if (pg_idx == PAGE_BATCH_SIZE) {
> +                       ret = tcp_zerocopy_vm_insert_batch(vma, pages,
> pg_idx,
> +                                                          &curr_addr,
> &length,
> +                                                          &seq, zc);
> +                       if (ret)
> +                               goto out;
> +                       pg_idx = 0;
> +               }
> +       }
> +       if (pg_idx) {
> +               ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
> +                                                  &curr_addr, &length,
> &seq,
> +                                                  zc);
>         }
>  out:
>         up_read(&current->mm->mmap_sem);
> _
>
>
Arjun Roy April 10, 2020, 7:15 p.m. UTC | #8
On Fri, Apr 10, 2020 at 12:04 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote:
>
> > I remain a bit concerned regarding the merge process for this specific
> > patch (0003, the net/ipv4/tcp.c change) since I have other in-flight
> > changes for TCP receive zerocopy that I'd like to upstream for
> > net-next - and would like to avoid weird merge issues.
> >
> > So perhaps the following could work:
> >
> > 1. Andrew, perhaps we could remove this particular patch (0003, the
> > net/ipv4/tcp.c change) from mm-next; that way we merge
> > vm_insert_pages() but not the call-site within TCP, for now.
> > 2. net-next will eventually pick vm_insert_pages() up.
> > 3. I can modify the zerocopy code to use it at that point?
> >
> > Else I'm concerned a complicated merge situation may result.
>
> The merge situation is quite clean.
>
> I guess I'll hold off on
> net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch (below) and
> shall send it to davem after Linus has merged the prerequisites.
>

Acknowledged, thank you!
(resend because gmail conveniently forgot my plain text mode preference...)

-Arjun

>
> From: Arjun Roy <arjunroy@google.com>
> Subject: net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy
>
> Use vm_insert_pages() for tcp receive zerocopy.  Spin lock cycles (as
> reported by perf) drop from a couple of percentage points to a fraction of
> a percent.  This results in a roughly 6% increase in efficiency, measured
> roughly as zerocopy receive count divided by CPU utilization.
>
> The intention of this patchset is to reduce atomic ops for tcp zerocopy
> receives, which normally hits the same spinlock multiple times
> consecutively.
>
> [akpm@linux-foundation.org: suppress gcc-7.2.0 warning]
> Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com
> Signed-off-by: Arjun Roy <arjunroy@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: David Miller <davem@davemloft.net>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  net/ipv4/tcp.c |   70 ++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 63 insertions(+), 7 deletions(-)
>
> --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy
> +++ a/net/ipv4/tcp.c
> @@ -1734,14 +1734,48 @@ int tcp_mmap(struct file *file, struct s
>  }
>  EXPORT_SYMBOL(tcp_mmap);
>
> +static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma,
> +                                       struct page **pages,
> +                                       unsigned long pages_to_map,
> +                                       unsigned long *insert_addr,
> +                                       u32 *length_with_pending,
> +                                       u32 *seq,
> +                                       struct tcp_zerocopy_receive *zc)
> +{
> +       unsigned long pages_remaining = pages_to_map;
> +       int bytes_mapped;
> +       int ret;
> +
> +       ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining);
> +       bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining);
> +       /* Even if vm_insert_pages fails, it may have partially succeeded in
> +        * mapping (some but not all of the pages).
> +        */
> +       *seq += bytes_mapped;
> +       *insert_addr += bytes_mapped;
> +       if (ret) {
> +               /* But if vm_insert_pages did fail, we have to unroll some state
> +                * we speculatively touched before.
> +                */
> +               const int bytes_not_mapped = PAGE_SIZE * pages_remaining;
> +               *length_with_pending -= bytes_not_mapped;
> +               zc->recv_skip_hint += bytes_not_mapped;
> +       }
> +       return ret;
> +}
> +
>  static int tcp_zerocopy_receive(struct sock *sk,
>                                 struct tcp_zerocopy_receive *zc)
>  {
>         unsigned long address = (unsigned long)zc->address;
>         u32 length = 0, seq, offset, zap_len;
> +       #define PAGE_BATCH_SIZE 8
> +       struct page *pages[PAGE_BATCH_SIZE];
>         const skb_frag_t *frags = NULL;
>         struct vm_area_struct *vma;
>         struct sk_buff *skb = NULL;
> +       unsigned long pg_idx = 0;
> +       unsigned long curr_addr;
>         struct tcp_sock *tp;
>         int inq;
>         int ret;
> @@ -1754,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s
>
>         sock_rps_record_flow(sk);
>
> +       tp = tcp_sk(sk);
> +
>         down_read(&current->mm->mmap_sem);
>
>         ret = -EINVAL;
> @@ -1762,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s
>                 goto out;
>         zc->length = min_t(unsigned long, zc->length, vma->vm_end - address);
>
> -       tp = tcp_sk(sk);
>         seq = tp->copied_seq;
>         inq = tcp_inq(sk);
>         zc->length = min_t(u32, zc->length, inq);
> @@ -1774,8 +1809,20 @@ static int tcp_zerocopy_receive(struct s
>                 zc->recv_skip_hint = zc->length;
>         }
>         ret = 0;
> +       curr_addr = address;
>         while (length + PAGE_SIZE <= zc->length) {
>                 if (zc->recv_skip_hint < PAGE_SIZE) {
> +                       /* If we're here, finish the current batch. */
> +                       if (pg_idx) {
> +                               ret = tcp_zerocopy_vm_insert_batch(vma, pages,
> +                                                                  pg_idx,
> +                                                                  &curr_addr,
> +                                                                  &length,
> +                                                                  &seq, zc);
> +                               if (ret)
> +                                       goto out;
> +                               pg_idx = 0;
> +                       }
>                         if (skb) {
>                                 if (zc->recv_skip_hint > 0)
>                                         break;
> @@ -1784,7 +1831,6 @@ static int tcp_zerocopy_receive(struct s
>                         } else {
>                                 skb = tcp_recv_skb(sk, seq, &offset);
>                         }
> -
>                         zc->recv_skip_hint = skb->len - offset;
>                         offset -= skb_headlen(skb);
>                         if ((int)offset < 0 || skb_has_frag_list(skb))
> @@ -1808,14 +1854,24 @@ static int tcp_zerocopy_receive(struct s
>                         zc->recv_skip_hint -= remaining;
>                         break;
>                 }
> -               ret = vm_insert_page(vma, address + length,
> -                                    skb_frag_page(frags));
> -               if (ret)
> -                       break;
> +               pages[pg_idx] = skb_frag_page(frags);
> +               pg_idx++;
>                 length += PAGE_SIZE;
> -               seq += PAGE_SIZE;
>                 zc->recv_skip_hint -= PAGE_SIZE;
>                 frags++;
> +               if (pg_idx == PAGE_BATCH_SIZE) {
> +                       ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
> +                                                          &curr_addr, &length,
> +                                                          &seq, zc);
> +                       if (ret)
> +                               goto out;
> +                       pg_idx = 0;
> +               }
> +       }
> +       if (pg_idx) {
> +               ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
> +                                                  &curr_addr, &length, &seq,
> +                                                  zc);
>         }
>  out:
>         up_read(&current->mm->mmap_sem);
> _
>
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 34490d972758..52f96c3ceab3 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1861,14 +1861,48 @@  int tcp_mmap(struct file *file, struct socket *sock,
 }
 EXPORT_SYMBOL(tcp_mmap);
 
+static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma,
+					struct page **pages,
+					unsigned long pages_to_map,
+					unsigned long *insert_addr,
+					u32 *length_with_pending,
+					u32 *seq,
+					struct tcp_zerocopy_receive *zc)
+{
+	unsigned long pages_remaining = pages_to_map;
+	int bytes_mapped;
+	int ret;
+
+	ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining);
+	bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining);
+	/* Even if vm_insert_pages fails, it may have partially succeeded in
+	 * mapping (some but not all of the pages).
+	 */
+	*seq += bytes_mapped;
+	*insert_addr += bytes_mapped;
+	if (ret) {
+		/* But if vm_insert_pages did fail, we have to unroll some state
+		 * we speculatively touched before.
+		 */
+		const int bytes_not_mapped = PAGE_SIZE * pages_remaining;
+		*length_with_pending -= bytes_not_mapped;
+		zc->recv_skip_hint += bytes_not_mapped;
+	}
+	return ret;
+}
+
 static int tcp_zerocopy_receive(struct sock *sk,
 				struct tcp_zerocopy_receive *zc)
 {
 	unsigned long address = (unsigned long)zc->address;
 	u32 length = 0, seq, offset, zap_len;
+	#define PAGE_BATCH_SIZE 8
+	struct page *pages[PAGE_BATCH_SIZE];
 	const skb_frag_t *frags = NULL;
 	struct vm_area_struct *vma;
 	struct sk_buff *skb = NULL;
+	unsigned long pg_idx = 0;
+	unsigned long curr_addr;
 	struct tcp_sock *tp;
 	int inq;
 	int ret;
@@ -1901,15 +1935,26 @@  static int tcp_zerocopy_receive(struct sock *sk,
 		zc->recv_skip_hint = zc->length;
 	}
 	ret = 0;
+	curr_addr = address;
 	while (length + PAGE_SIZE <= zc->length) {
 		if (zc->recv_skip_hint < PAGE_SIZE) {
+			/* If we're here, finish the current batch. */
+			if (pg_idx) {
+				ret = tcp_zerocopy_vm_insert_batch(vma, pages,
+								   pg_idx,
+								   &curr_addr,
+								   &length,
+								   &seq, zc);
+				if (ret)
+					goto out;
+				pg_idx = 0;
+			}
 			if (skb) {
 				skb = skb->next;
 				offset = seq - TCP_SKB_CB(skb)->seq;
 			} else {
 				skb = tcp_recv_skb(sk, seq, &offset);
 			}
-
 			zc->recv_skip_hint = skb->len - offset;
 			offset -= skb_headlen(skb);
 			if ((int)offset < 0 || skb_has_frag_list(skb))
@@ -1933,14 +1978,24 @@  static int tcp_zerocopy_receive(struct sock *sk,
 			zc->recv_skip_hint -= remaining;
 			break;
 		}
-		ret = vm_insert_page(vma, address + length,
-				     skb_frag_page(frags));
-		if (ret)
-			break;
+		pages[pg_idx] = skb_frag_page(frags);
+		pg_idx++;
 		length += PAGE_SIZE;
-		seq += PAGE_SIZE;
 		zc->recv_skip_hint -= PAGE_SIZE;
 		frags++;
+		if (pg_idx == PAGE_BATCH_SIZE) {
+			ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
+							   &curr_addr, &length,
+							   &seq, zc);
+			if (ret)
+				goto out;
+			pg_idx = 0;
+		}
+	}
+	if (pg_idx) {
+		ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
+						   &curr_addr, &length, &seq,
+						   zc);
 	}
 out:
 	up_read(&current->mm->mmap_sem);