Message ID | 20200128025958.43490-3-arjunroy.kdev@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [resend,mm,net-next,1/3] mm: Refactor insert_page to prepare for batched-lock insert. | expand |
On Mon, 27 Jan 2020 18:59:58 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote: > Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles > (as reported by perf) drop from a couple of percentage points > to a fraction of a percent. This results in a roughly 6% increase in > efficiency, measured roughly as zerocopy receive count divided by CPU > utilization. > > The intention of this patch-set is to reduce atomic ops for > tcp zerocopy receives, which normally hits the same spinlock multiple > times consecutively. For some reason the patch causes this: In file included from ./arch/x86/include/asm/atomic.h:5:0, from ./include/linux/atomic.h:7, from ./include/linux/crypto.h:15, from ./include/crypto/hash.h:11, from net/ipv4/tcp.c:246: net/ipv4/tcp.c: In function ‘do_tcp_getsockopt.isra.29’: ./include/linux/compiler.h:225:31: warning: ‘tp’ may be used uninitialized in this function [-Wmaybe-uninitialized] case 4: *(volatile __u32 *)p = *(__u32 *)res; break; ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~ net/ipv4/tcp.c:1779:19: note: ‘tp’ was declared here struct tcp_sock *tp; ^~ It's a false positive. gcc-7.2.0 : out: : up_read(¤t->mm->mmap_sem); : if (length) { : WRITE_ONCE(tp->copied_seq, seq); but `length' is zero here. This suppresses it: --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix +++ a/net/ipv4/tcp.c @@ -1788,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s sock_rps_record_flow(sk); + tp = tcp_sk(sk); + down_read(¤t->mm->mmap_sem); ret = -EINVAL; @@ -1796,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s goto out; zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); - tp = tcp_sk(sk); seq = tp->copied_seq; inq = tcp_inq(sk); zc->length = min_t(u32, zc->length, inq); and I guess it's zero-cost. Anyway, I'll sit on this lot for a while, hoping for a davem ack?
On Wed, Feb 12, 2020 at 6:56 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 27 Jan 2020 18:59:58 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles > > (as reported by perf) drop from a couple of percentage points > > to a fraction of a percent. This results in a roughly 6% increase in > > efficiency, measured roughly as zerocopy receive count divided by CPU > > utilization. > > > > The intention of this patch-set is to reduce atomic ops for > > tcp zerocopy receives, which normally hits the same spinlock multiple > > times consecutively. > > For some reason the patch causes this: > > In file included from ./arch/x86/include/asm/atomic.h:5:0, > from ./include/linux/atomic.h:7, > from ./include/linux/crypto.h:15, > from ./include/crypto/hash.h:11, > from net/ipv4/tcp.c:246: > net/ipv4/tcp.c: In function ‘do_tcp_getsockopt.isra.29’: > ./include/linux/compiler.h:225:31: warning: ‘tp’ may be used uninitialized in this function [-Wmaybe-uninitialized] > case 4: *(volatile __u32 *)p = *(__u32 *)res; break; > ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~ > net/ipv4/tcp.c:1779:19: note: ‘tp’ was declared here > struct tcp_sock *tp; > ^~ > > It's a false positive. gcc-7.2.0 > > : out: > : up_read(¤t->mm->mmap_sem); > : if (length) { > : WRITE_ONCE(tp->copied_seq, seq); > > but `length' is zero here. > > This suppresses it: > > --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix > +++ a/net/ipv4/tcp.c > @@ -1788,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s > > sock_rps_record_flow(sk); > > + tp = tcp_sk(sk); > + > down_read(¤t->mm->mmap_sem); > > ret = -EINVAL; > @@ -1796,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s > goto out; > zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); > > - tp = tcp_sk(sk); > seq = tp->copied_seq; > inq = tcp_inq(sk); > zc->length = min_t(u32, zc->length, inq); > > and I guess it's zero-cost. > > > Anyway, I'll sit on this lot for a while, hoping for a davem ack? Actually, speaking of the ack on the networking side: I guess this patch set is a bit weird since it requires some non-trivial coordination between mm and net-next? Not sure what the normal approach is in this case. -Arjun
Andrew, David - I remain a bit concerned regarding the merge process for this specific patch (0003, the net/ipv4/tcp.c change) since I have other in-flight changes for TCP receive zerocopy that I'd like to upstream for net-next - and would like to avoid weird merge issues. So perhaps the following could work: 1. Andrew, perhaps we could remove this particular patch (0003, the net/ipv4/tcp.c change) from mm-next; that way we merge vm_insert_pages() but not the call-site within TCP, for now. 2. net-next will eventually pick vm_insert_pages() up. 3. I can modify the zerocopy code to use it at that point? Else I'm concerned a complicated merge situation may result. What do you all think? Thanks, -Arjun On Sun, Feb 16, 2020 at 6:49 PM Arjun Roy <arjunroy@google.com> wrote: > > On Wed, Feb 12, 2020 at 6:56 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Mon, 27 Jan 2020 18:59:58 -0800 Arjun Roy <arjunroy.kdev@gmail.com> wrote: > > > > > Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles > > > (as reported by perf) drop from a couple of percentage points > > > to a fraction of a percent. This results in a roughly 6% increase in > > > efficiency, measured roughly as zerocopy receive count divided by CPU > > > utilization. > > > > > > The intention of this patch-set is to reduce atomic ops for > > > tcp zerocopy receives, which normally hits the same spinlock multiple > > > times consecutively. > > > > For some reason the patch causes this: > > > > In file included from ./arch/x86/include/asm/atomic.h:5:0, > > from ./include/linux/atomic.h:7, > > from ./include/linux/crypto.h:15, > > from ./include/crypto/hash.h:11, > > from net/ipv4/tcp.c:246: > > net/ipv4/tcp.c: In function ‘do_tcp_getsockopt.isra.29’: > > ./include/linux/compiler.h:225:31: warning: ‘tp’ may be used uninitialized in this function [-Wmaybe-uninitialized] > > case 4: *(volatile __u32 *)p = *(__u32 *)res; break; > > ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~ > > net/ipv4/tcp.c:1779:19: note: ‘tp’ was declared here > > struct tcp_sock *tp; > > ^~ > > > > It's a false positive. gcc-7.2.0 > > > > : out: > > : up_read(¤t->mm->mmap_sem); > > : if (length) { > > : WRITE_ONCE(tp->copied_seq, seq); > > > > but `length' is zero here. > > > > This suppresses it: > > > > --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix > > +++ a/net/ipv4/tcp.c > > @@ -1788,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s > > > > sock_rps_record_flow(sk); > > > > + tp = tcp_sk(sk); > > + > > down_read(¤t->mm->mmap_sem); > > > > ret = -EINVAL; > > @@ -1796,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s > > goto out; > > zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); > > > > - tp = tcp_sk(sk); > > seq = tp->copied_seq; > > inq = tcp_inq(sk); > > zc->length = min_t(u32, zc->length, inq); > > > > and I guess it's zero-cost. > > > > > > Anyway, I'll sit on this lot for a while, hoping for a davem ack? > > Actually, speaking of the ack on the networking side: > > I guess this patch set is a bit weird since it requires some > non-trivial coordination between mm and net-next? Not sure what the > normal approach is in this case. > > -Arjun
On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote: > I remain a bit concerned regarding the merge process for this specific > patch (0003, the net/ipv4/tcp.c change) since I have other in-flight > changes for TCP receive zerocopy that I'd like to upstream for > net-next - and would like to avoid weird merge issues. > > So perhaps the following could work: > > 1. Andrew, perhaps we could remove this particular patch (0003, the > net/ipv4/tcp.c change) from mm-next; that way we merge > vm_insert_pages() but not the call-site within TCP, for now. > 2. net-next will eventually pick vm_insert_pages() up. > 3. I can modify the zerocopy code to use it at that point? > > Else I'm concerned a complicated merge situation may result. > > What do you all think? We could do that. For now, I'll stage the entire patch series after linux-next and shall wait and see whether things which appear in linux-next cause serious merge issues to occur. Sound OK?
On Sun, Feb 23, 2020 at 7:37 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote: > > > I remain a bit concerned regarding the merge process for this specific > > patch (0003, the net/ipv4/tcp.c change) since I have other in-flight > > changes for TCP receive zerocopy that I'd like to upstream for > > net-next - and would like to avoid weird merge issues. > > > > So perhaps the following could work: > > > > 1. Andrew, perhaps we could remove this particular patch (0003, the > > net/ipv4/tcp.c change) from mm-next; that way we merge > > vm_insert_pages() but not the call-site within TCP, for now. > > 2. net-next will eventually pick vm_insert_pages() up. > > 3. I can modify the zerocopy code to use it at that point? > > > > Else I'm concerned a complicated merge situation may result. > > > > What do you all think? > > We could do that. > > For now, I'll stage the entire patch series after linux-next and shall > wait and see whether things which appear in linux-next cause serious > merge issues to occur. Sound OK? Sounds good for now; the conflict itself would be easy enough to fix when it does crop up. Thanks, -Arjun
On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote: > I remain a bit concerned regarding the merge process for this specific > patch (0003, the net/ipv4/tcp.c change) since I have other in-flight > changes for TCP receive zerocopy that I'd like to upstream for > net-next - and would like to avoid weird merge issues. > > So perhaps the following could work: > > 1. Andrew, perhaps we could remove this particular patch (0003, the > net/ipv4/tcp.c change) from mm-next; that way we merge > vm_insert_pages() but not the call-site within TCP, for now. > 2. net-next will eventually pick vm_insert_pages() up. > 3. I can modify the zerocopy code to use it at that point? > > Else I'm concerned a complicated merge situation may result. The merge situation is quite clean. I guess I'll hold off on net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch (below) and shall send it to davem after Linus has merged the prerequisites. From: Arjun Roy <arjunroy@google.com> Subject: net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles (as reported by perf) drop from a couple of percentage points to a fraction of a percent. This results in a roughly 6% increase in efficiency, measured roughly as zerocopy receive count divided by CPU utilization. The intention of this patchset is to reduce atomic ops for tcp zerocopy receives, which normally hits the same spinlock multiple times consecutively. [akpm@linux-foundation.org: suppress gcc-7.2.0 warning] Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Cc: David Miller <davem@davemloft.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- net/ipv4/tcp.c | 70 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 63 insertions(+), 7 deletions(-) --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy +++ a/net/ipv4/tcp.c @@ -1734,14 +1734,48 @@ int tcp_mmap(struct file *file, struct s } EXPORT_SYMBOL(tcp_mmap); +static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, + struct page **pages, + unsigned long pages_to_map, + unsigned long *insert_addr, + u32 *length_with_pending, + u32 *seq, + struct tcp_zerocopy_receive *zc) +{ + unsigned long pages_remaining = pages_to_map; + int bytes_mapped; + int ret; + + ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining); + bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining); + /* Even if vm_insert_pages fails, it may have partially succeeded in + * mapping (some but not all of the pages). + */ + *seq += bytes_mapped; + *insert_addr += bytes_mapped; + if (ret) { + /* But if vm_insert_pages did fail, we have to unroll some state + * we speculatively touched before. + */ + const int bytes_not_mapped = PAGE_SIZE * pages_remaining; + *length_with_pending -= bytes_not_mapped; + zc->recv_skip_hint += bytes_not_mapped; + } + return ret; +} + static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { unsigned long address = (unsigned long)zc->address; u32 length = 0, seq, offset, zap_len; + #define PAGE_BATCH_SIZE 8 + struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; struct vm_area_struct *vma; struct sk_buff *skb = NULL; + unsigned long pg_idx = 0; + unsigned long curr_addr; struct tcp_sock *tp; int inq; int ret; @@ -1754,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s sock_rps_record_flow(sk); + tp = tcp_sk(sk); + down_read(¤t->mm->mmap_sem); ret = -EINVAL; @@ -1762,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s goto out; zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); - tp = tcp_sk(sk); seq = tp->copied_seq; inq = tcp_inq(sk); zc->length = min_t(u32, zc->length, inq); @@ -1774,8 +1809,20 @@ static int tcp_zerocopy_receive(struct s zc->recv_skip_hint = zc->length; } ret = 0; + curr_addr = address; while (length + PAGE_SIZE <= zc->length) { if (zc->recv_skip_hint < PAGE_SIZE) { + /* If we're here, finish the current batch. */ + if (pg_idx) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, + pg_idx, + &curr_addr, + &length, + &seq, zc); + if (ret) + goto out; + pg_idx = 0; + } if (skb) { if (zc->recv_skip_hint > 0) break; @@ -1784,7 +1831,6 @@ static int tcp_zerocopy_receive(struct s } else { skb = tcp_recv_skb(sk, seq, &offset); } - zc->recv_skip_hint = skb->len - offset; offset -= skb_headlen(skb); if ((int)offset < 0 || skb_has_frag_list(skb)) @@ -1808,14 +1854,24 @@ static int tcp_zerocopy_receive(struct s zc->recv_skip_hint -= remaining; break; } - ret = vm_insert_page(vma, address + length, - skb_frag_page(frags)); - if (ret) - break; + pages[pg_idx] = skb_frag_page(frags); + pg_idx++; length += PAGE_SIZE; - seq += PAGE_SIZE; zc->recv_skip_hint -= PAGE_SIZE; frags++; + if (pg_idx == PAGE_BATCH_SIZE) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, + &curr_addr, &length, + &seq, zc); + if (ret) + goto out; + pg_idx = 0; + } + } + if (pg_idx) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, + &curr_addr, &length, &seq, + zc); } out: up_read(¤t->mm->mmap_sem);
On Fri, Apr 10, 2020 at 12:04 PM Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote: > > > I remain a bit concerned regarding the merge process for this specific > > patch (0003, the net/ipv4/tcp.c change) since I have other in-flight > > changes for TCP receive zerocopy that I'd like to upstream for > > net-next - and would like to avoid weird merge issues. > > > > So perhaps the following could work: > > > > 1. Andrew, perhaps we could remove this particular patch (0003, the > > net/ipv4/tcp.c change) from mm-next; that way we merge > > vm_insert_pages() but not the call-site within TCP, for now. > > 2. net-next will eventually pick vm_insert_pages() up. > > 3. I can modify the zerocopy code to use it at that point? > > > > Else I'm concerned a complicated merge situation may result. > > The merge situation is quite clean. > > I guess I'll hold off on > net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch (below) and > shall send it to davem after Linus has merged the prerequisites. > > > Acknowledged, thank you! -Arjun > From: Arjun Roy <arjunroy@google.com> > Subject: net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy > > Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles (as > reported by perf) drop from a couple of percentage points to a fraction of > a percent. This results in a roughly 6% increase in efficiency, measured > roughly as zerocopy receive count divided by CPU utilization. > > The intention of this patchset is to reduce atomic ops for tcp zerocopy > receives, which normally hits the same spinlock multiple times > consecutively. > > [akpm@linux-foundation.org: suppress gcc-7.2.0 warning] > Link: > http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com > Signed-off-by: Arjun Roy <arjunroy@google.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> > Cc: David Miller <davem@davemloft.net> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: Stephen Rothwell <sfr@canb.auug.org.au> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > net/ipv4/tcp.c | 70 ++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 63 insertions(+), 7 deletions(-) > > --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy > +++ a/net/ipv4/tcp.c > @@ -1734,14 +1734,48 @@ int tcp_mmap(struct file *file, struct s > } > EXPORT_SYMBOL(tcp_mmap); > > +static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, > + struct page **pages, > + unsigned long pages_to_map, > + unsigned long *insert_addr, > + u32 *length_with_pending, > + u32 *seq, > + struct tcp_zerocopy_receive *zc) > +{ > + unsigned long pages_remaining = pages_to_map; > + int bytes_mapped; > + int ret; > + > + ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining); > + bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining); > + /* Even if vm_insert_pages fails, it may have partially succeeded > in > + * mapping (some but not all of the pages). > + */ > + *seq += bytes_mapped; > + *insert_addr += bytes_mapped; > + if (ret) { > + /* But if vm_insert_pages did fail, we have to unroll some > state > + * we speculatively touched before. > + */ > + const int bytes_not_mapped = PAGE_SIZE * pages_remaining; > + *length_with_pending -= bytes_not_mapped; > + zc->recv_skip_hint += bytes_not_mapped; > + } > + return ret; > +} > + > static int tcp_zerocopy_receive(struct sock *sk, > struct tcp_zerocopy_receive *zc) > { > unsigned long address = (unsigned long)zc->address; > u32 length = 0, seq, offset, zap_len; > + #define PAGE_BATCH_SIZE 8 > + struct page *pages[PAGE_BATCH_SIZE]; > const skb_frag_t *frags = NULL; > struct vm_area_struct *vma; > struct sk_buff *skb = NULL; > + unsigned long pg_idx = 0; > + unsigned long curr_addr; > struct tcp_sock *tp; > int inq; > int ret; > @@ -1754,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s > > sock_rps_record_flow(sk); > > + tp = tcp_sk(sk); > + > down_read(¤t->mm->mmap_sem); > > ret = -EINVAL; > @@ -1762,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s > goto out; > zc->length = min_t(unsigned long, zc->length, vma->vm_end - > address); > > - tp = tcp_sk(sk); > seq = tp->copied_seq; > inq = tcp_inq(sk); > zc->length = min_t(u32, zc->length, inq); > @@ -1774,8 +1809,20 @@ static int tcp_zerocopy_receive(struct s > zc->recv_skip_hint = zc->length; > } > ret = 0; > + curr_addr = address; > while (length + PAGE_SIZE <= zc->length) { > if (zc->recv_skip_hint < PAGE_SIZE) { > + /* If we're here, finish the current batch. */ > + if (pg_idx) { > + ret = tcp_zerocopy_vm_insert_batch(vma, > pages, > + pg_idx, > + > &curr_addr, > + &length, > + &seq, > zc); > + if (ret) > + goto out; > + pg_idx = 0; > + } > if (skb) { > if (zc->recv_skip_hint > 0) > break; > @@ -1784,7 +1831,6 @@ static int tcp_zerocopy_receive(struct s > } else { > skb = tcp_recv_skb(sk, seq, &offset); > } > - > zc->recv_skip_hint = skb->len - offset; > offset -= skb_headlen(skb); > if ((int)offset < 0 || skb_has_frag_list(skb)) > @@ -1808,14 +1854,24 @@ static int tcp_zerocopy_receive(struct s > zc->recv_skip_hint -= remaining; > break; > } > - ret = vm_insert_page(vma, address + length, > - skb_frag_page(frags)); > - if (ret) > - break; > + pages[pg_idx] = skb_frag_page(frags); > + pg_idx++; > length += PAGE_SIZE; > - seq += PAGE_SIZE; > zc->recv_skip_hint -= PAGE_SIZE; > frags++; > + if (pg_idx == PAGE_BATCH_SIZE) { > + ret = tcp_zerocopy_vm_insert_batch(vma, pages, > pg_idx, > + &curr_addr, > &length, > + &seq, zc); > + if (ret) > + goto out; > + pg_idx = 0; > + } > + } > + if (pg_idx) { > + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, > + &curr_addr, &length, > &seq, > + zc); > } > out: > up_read(¤t->mm->mmap_sem); > _ > >
On Fri, Apr 10, 2020 at 12:04 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 21 Feb 2020 13:21:41 -0800 Arjun Roy <arjunroy@google.com> wrote: > > > I remain a bit concerned regarding the merge process for this specific > > patch (0003, the net/ipv4/tcp.c change) since I have other in-flight > > changes for TCP receive zerocopy that I'd like to upstream for > > net-next - and would like to avoid weird merge issues. > > > > So perhaps the following could work: > > > > 1. Andrew, perhaps we could remove this particular patch (0003, the > > net/ipv4/tcp.c change) from mm-next; that way we merge > > vm_insert_pages() but not the call-site within TCP, for now. > > 2. net-next will eventually pick vm_insert_pages() up. > > 3. I can modify the zerocopy code to use it at that point? > > > > Else I'm concerned a complicated merge situation may result. > > The merge situation is quite clean. > > I guess I'll hold off on > net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch (below) and > shall send it to davem after Linus has merged the prerequisites. > Acknowledged, thank you! (resend because gmail conveniently forgot my plain text mode preference...) -Arjun > > From: Arjun Roy <arjunroy@google.com> > Subject: net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy > > Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles (as > reported by perf) drop from a couple of percentage points to a fraction of > a percent. This results in a roughly 6% increase in efficiency, measured > roughly as zerocopy receive count divided by CPU utilization. > > The intention of this patchset is to reduce atomic ops for tcp zerocopy > receives, which normally hits the same spinlock multiple times > consecutively. > > [akpm@linux-foundation.org: suppress gcc-7.2.0 warning] > Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com > Signed-off-by: Arjun Roy <arjunroy@google.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> > Cc: David Miller <davem@davemloft.net> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: Stephen Rothwell <sfr@canb.auug.org.au> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > net/ipv4/tcp.c | 70 ++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 63 insertions(+), 7 deletions(-) > > --- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy > +++ a/net/ipv4/tcp.c > @@ -1734,14 +1734,48 @@ int tcp_mmap(struct file *file, struct s > } > EXPORT_SYMBOL(tcp_mmap); > > +static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, > + struct page **pages, > + unsigned long pages_to_map, > + unsigned long *insert_addr, > + u32 *length_with_pending, > + u32 *seq, > + struct tcp_zerocopy_receive *zc) > +{ > + unsigned long pages_remaining = pages_to_map; > + int bytes_mapped; > + int ret; > + > + ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining); > + bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining); > + /* Even if vm_insert_pages fails, it may have partially succeeded in > + * mapping (some but not all of the pages). > + */ > + *seq += bytes_mapped; > + *insert_addr += bytes_mapped; > + if (ret) { > + /* But if vm_insert_pages did fail, we have to unroll some state > + * we speculatively touched before. > + */ > + const int bytes_not_mapped = PAGE_SIZE * pages_remaining; > + *length_with_pending -= bytes_not_mapped; > + zc->recv_skip_hint += bytes_not_mapped; > + } > + return ret; > +} > + > static int tcp_zerocopy_receive(struct sock *sk, > struct tcp_zerocopy_receive *zc) > { > unsigned long address = (unsigned long)zc->address; > u32 length = 0, seq, offset, zap_len; > + #define PAGE_BATCH_SIZE 8 > + struct page *pages[PAGE_BATCH_SIZE]; > const skb_frag_t *frags = NULL; > struct vm_area_struct *vma; > struct sk_buff *skb = NULL; > + unsigned long pg_idx = 0; > + unsigned long curr_addr; > struct tcp_sock *tp; > int inq; > int ret; > @@ -1754,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s > > sock_rps_record_flow(sk); > > + tp = tcp_sk(sk); > + > down_read(¤t->mm->mmap_sem); > > ret = -EINVAL; > @@ -1762,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s > goto out; > zc->length = min_t(unsigned long, zc->length, vma->vm_end - address); > > - tp = tcp_sk(sk); > seq = tp->copied_seq; > inq = tcp_inq(sk); > zc->length = min_t(u32, zc->length, inq); > @@ -1774,8 +1809,20 @@ static int tcp_zerocopy_receive(struct s > zc->recv_skip_hint = zc->length; > } > ret = 0; > + curr_addr = address; > while (length + PAGE_SIZE <= zc->length) { > if (zc->recv_skip_hint < PAGE_SIZE) { > + /* If we're here, finish the current batch. */ > + if (pg_idx) { > + ret = tcp_zerocopy_vm_insert_batch(vma, pages, > + pg_idx, > + &curr_addr, > + &length, > + &seq, zc); > + if (ret) > + goto out; > + pg_idx = 0; > + } > if (skb) { > if (zc->recv_skip_hint > 0) > break; > @@ -1784,7 +1831,6 @@ static int tcp_zerocopy_receive(struct s > } else { > skb = tcp_recv_skb(sk, seq, &offset); > } > - > zc->recv_skip_hint = skb->len - offset; > offset -= skb_headlen(skb); > if ((int)offset < 0 || skb_has_frag_list(skb)) > @@ -1808,14 +1854,24 @@ static int tcp_zerocopy_receive(struct s > zc->recv_skip_hint -= remaining; > break; > } > - ret = vm_insert_page(vma, address + length, > - skb_frag_page(frags)); > - if (ret) > - break; > + pages[pg_idx] = skb_frag_page(frags); > + pg_idx++; > length += PAGE_SIZE; > - seq += PAGE_SIZE; > zc->recv_skip_hint -= PAGE_SIZE; > frags++; > + if (pg_idx == PAGE_BATCH_SIZE) { > + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, > + &curr_addr, &length, > + &seq, zc); > + if (ret) > + goto out; > + pg_idx = 0; > + } > + } > + if (pg_idx) { > + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, > + &curr_addr, &length, &seq, > + zc); > } > out: > up_read(¤t->mm->mmap_sem); > _ >
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 34490d972758..52f96c3ceab3 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1861,14 +1861,48 @@ int tcp_mmap(struct file *file, struct socket *sock, } EXPORT_SYMBOL(tcp_mmap); +static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma, + struct page **pages, + unsigned long pages_to_map, + unsigned long *insert_addr, + u32 *length_with_pending, + u32 *seq, + struct tcp_zerocopy_receive *zc) +{ + unsigned long pages_remaining = pages_to_map; + int bytes_mapped; + int ret; + + ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining); + bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining); + /* Even if vm_insert_pages fails, it may have partially succeeded in + * mapping (some but not all of the pages). + */ + *seq += bytes_mapped; + *insert_addr += bytes_mapped; + if (ret) { + /* But if vm_insert_pages did fail, we have to unroll some state + * we speculatively touched before. + */ + const int bytes_not_mapped = PAGE_SIZE * pages_remaining; + *length_with_pending -= bytes_not_mapped; + zc->recv_skip_hint += bytes_not_mapped; + } + return ret; +} + static int tcp_zerocopy_receive(struct sock *sk, struct tcp_zerocopy_receive *zc) { unsigned long address = (unsigned long)zc->address; u32 length = 0, seq, offset, zap_len; + #define PAGE_BATCH_SIZE 8 + struct page *pages[PAGE_BATCH_SIZE]; const skb_frag_t *frags = NULL; struct vm_area_struct *vma; struct sk_buff *skb = NULL; + unsigned long pg_idx = 0; + unsigned long curr_addr; struct tcp_sock *tp; int inq; int ret; @@ -1901,15 +1935,26 @@ static int tcp_zerocopy_receive(struct sock *sk, zc->recv_skip_hint = zc->length; } ret = 0; + curr_addr = address; while (length + PAGE_SIZE <= zc->length) { if (zc->recv_skip_hint < PAGE_SIZE) { + /* If we're here, finish the current batch. */ + if (pg_idx) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, + pg_idx, + &curr_addr, + &length, + &seq, zc); + if (ret) + goto out; + pg_idx = 0; + } if (skb) { skb = skb->next; offset = seq - TCP_SKB_CB(skb)->seq; } else { skb = tcp_recv_skb(sk, seq, &offset); } - zc->recv_skip_hint = skb->len - offset; offset -= skb_headlen(skb); if ((int)offset < 0 || skb_has_frag_list(skb)) @@ -1933,14 +1978,24 @@ static int tcp_zerocopy_receive(struct sock *sk, zc->recv_skip_hint -= remaining; break; } - ret = vm_insert_page(vma, address + length, - skb_frag_page(frags)); - if (ret) - break; + pages[pg_idx] = skb_frag_page(frags); + pg_idx++; length += PAGE_SIZE; - seq += PAGE_SIZE; zc->recv_skip_hint -= PAGE_SIZE; frags++; + if (pg_idx == PAGE_BATCH_SIZE) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, + &curr_addr, &length, + &seq, zc); + if (ret) + goto out; + pg_idx = 0; + } + } + if (pg_idx) { + ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx, + &curr_addr, &length, &seq, + zc); } out: up_read(¤t->mm->mmap_sem);