Message ID | 20220907122505.26953-1-wintera@linux.ibm.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net] tcp: Fix performance regression for request-response workloads | expand |
On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote: > > Since linear payload was removed even for single small messages, > an additional page is required and we are measuring performance impact. > > 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting") > explicitely allowed "payload in skb->head for first skb put in the queue, > to not impact RPC workloads." > 472c2e07eef0 ("tcp: add one skb cache for tx") > made that obsolete and removed it. > When > d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") > reverted it, this piece was not reverted and not added back in. > > When running uperf with a request-response pattern with 1k payload > and 250 connections parallel, we measure 13% difference in throughput > for our PCI based network interfaces since 472c2e07eef0. > (our IO MMU is sensitive to the number of mapped pages) > > Could you please consider allowing linear payload for the first > skb in queue again? A patch proposal is appended below. No. Please add a work around in your driver. You can increase throughput by 20% by premapping a coherent piece of memory in which you can copy small skbs (skb->head included) Something like 256 bytes per slot in the TX ring. > > Kind regards > Alexandra > > --------------------------------------------------------------- > > tcp: allow linear skb payload for first in queue > > Allow payload in skb->head for first skb in the queue, > RPC workloads will benefit. > > Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") > Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> > --- > net/ipv4/tcp.c | 39 +++++++++++++++++++++++++++++++++++++-- > 1 file changed, 37 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index e5011c136fdb..f7cbccd41d85 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1154,6 +1154,30 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset, > } > EXPORT_SYMBOL(tcp_sendpage); > > +/* Do not bother using a page frag for very small frames. > + * But use this heuristic only for the first skb in write queue. > + * > + * Having no payload in skb->head allows better SACK shifting > + * in tcp_shift_skb_data(), reducing sack/rack overhead, because > + * write queue has less skbs. > + * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB. > + * This also speeds up tso_fragment(), since it won't fallback > + * to tcp_fragment(). > + */ > +static int linear_payload_sz(bool first_skb) > +{ > + if (first_skb) > + return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER); > + return 0; > +} > + > +static int select_size(bool first_skb, bool zc) > +{ > + if (zc) > + return 0; > + return linear_payload_sz(first_skb); > +} > + > void tcp_free_fastopen_req(struct tcp_sock *tp) > { > if (tp->fastopen_req) { > @@ -1311,6 +1335,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > > if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) { > bool first_skb; > + int linear; > > new_segment: > if (!sk_stream_memory_free(sk)) > @@ -1322,7 +1347,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > goto restart; > } > first_skb = tcp_rtx_and_write_queues_empty(sk); > - skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation, > + linear = select_size(first_skb, zc); > + skb = tcp_stream_alloc_skb(sk, linear, > + sk->sk_allocation, > first_skb); > if (!skb) > goto wait_for_space; > @@ -1344,7 +1371,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > if (copy > msg_data_left(msg)) > copy = msg_data_left(msg); > > - if (!zc) { > + /* Where to copy to? */ > + if (skb_availroom(skb) > 0 && !zc) { > + /* We have some space in skb head. Superb! */ > + copy = min_t(int, copy, skb_availroom(skb)); > + err = skb_add_data_nocache(sk, skb, &msg->msg_iter, > + copy); > + if (err) > + goto do_error; > + } else if (!zc) { > bool merge = true; > int i = skb_shinfo(skb)->nr_frags; > struct page_frag *pfrag = sk_page_frag(sk); > -- > 2.24.3 (Apple Git-128) >
Am 07.09.22 um 18:06 schrieb Eric Dumazet: > On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote: >> >> Since linear payload was removed even for single small messages, >> an additional page is required and we are measuring performance impact. >> >> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting") >> explicitely allowed "payload in skb->head for first skb put in the queue, >> to not impact RPC workloads." >> 472c2e07eef0 ("tcp: add one skb cache for tx") >> made that obsolete and removed it. >> When >> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") >> reverted it, this piece was not reverted and not added back in. >> >> When running uperf with a request-response pattern with 1k payload >> and 250 connections parallel, we measure 13% difference in throughput >> for our PCI based network interfaces since 472c2e07eef0. >> (our IO MMU is sensitive to the number of mapped pages) > > > >> >> Could you please consider allowing linear payload for the first >> skb in queue again? A patch proposal is appended below. > > No. > > Please add a work around in your driver. > > You can increase throughput by 20% by premapping a coherent piece of > memory in which > you can copy small skbs (skb->head included) > > Something like 256 bytes per slot in the TX ring. > FWIW this regression was withthe standard mellanox driver (nothing s390 specific).
On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > > Am 07.09.22 um 18:06 schrieb Eric Dumazet: > > On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote: > >> > >> Since linear payload was removed even for single small messages, > >> an additional page is required and we are measuring performance impact. > >> > >> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting") > >> explicitely allowed "payload in skb->head for first skb put in the queue, > >> to not impact RPC workloads." > >> 472c2e07eef0 ("tcp: add one skb cache for tx") > >> made that obsolete and removed it. > >> When > >> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") > >> reverted it, this piece was not reverted and not added back in. > >> > >> When running uperf with a request-response pattern with 1k payload > >> and 250 connections parallel, we measure 13% difference in throughput > >> for our PCI based network interfaces since 472c2e07eef0. > >> (our IO MMU is sensitive to the number of mapped pages) > > > > > > > >> > >> Could you please consider allowing linear payload for the first > >> skb in queue again? A patch proposal is appended below. > > > > No. > > > > Please add a work around in your driver. > > > > You can increase throughput by 20% by premapping a coherent piece of > > memory in which > > you can copy small skbs (skb->head included) > > > > Something like 256 bytes per slot in the TX ring. > > > > FWIW this regression was withthe standard mellanox driver (nothing s390 specific). I did not claim this was s390 specific. Only IOMMU mode. I would rather not add back something which makes TCP stack slower (more tests in fast path) for the majority of us _not_ using IOMMU. In our own tests, this trick of using linear skbs was only helping benchmarks, not real workloads. Many drivers have to map skb->head a second time if they contain TCP payload, thus adding yet another corner case in their fast path. - Typical RPC workloads are playing with TCP_NODELAY - Typical bulk flows never have empty write queues... Really, I do not want this optimization back, this is not worth it. Again, a driver knows better if it is using IOMMU and if pathological layouts can be optimized to non SG ones, and using a pre-dma-map zone will also benefit pure TCP ACK packets (which do not have any payload) Here is the changelog of a patch I did for our GQ NIC (not yet upstreamed, but will be soon) ... The problem is coming from gq_tx_clean() calling dma_unmap_single(q->dev, p->addr, -p->len, DMA_TO_DEVICE); This seems silly to perform possibly expensive IOMMU operations to send small packets. (TCP pure acks are 86 bytes long in total for 99% of the cases) Idea of this patch is to pre-dma-map a memory zone to hold the headers of the packet (if less than 128/256 bytes long) Then if the whole packet can be copied into this 128/256 bytes zone, just copy it entirely. This permits to consume the small packets right away in ndo_start_xmit() while the skb (and associated socket sk_wmem_alloc) is hot, instead of later at TX completion time. This makes ACK packets cost much smaller, but also tiny TCP packets (say, synthetic benchmarks) We enable this behavior only if IOMMU is used/forced on GQ, although we might use it regardless of IOMMU being used or not. ... To recap, there is a huge difference if we cross the 42 byte limit : (for a 128 bytes zone per TX ring slot) iroa21:/home/edumazet# ./super_netperf 200 -H iroa23 -t TCP_RR -l 20 -- -r40,40 2648141 iroa21:/home/edumazet# ./super_netperf 200 -H iroa23 -t TCP_RR -l 20 -- -r44,44 970691 We might experiment with bigger GQ_TX_INLINE_HEADER_SIZE in the future ? ...
On 08.09.22 14:41, Eric Dumazet wrote: > On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger > <borntraeger@linux.ibm.com> wrote: >> >> Am 07.09.22 um 18:06 schrieb Eric Dumazet: >>> On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote: >>>> >>>> Since linear payload was removed even for single small messages, >>>> an additional page is required and we are measuring performance impact. >>>> >>>> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting") >>>> explicitely allowed "payload in skb->head for first skb put in the queue, >>>> to not impact RPC workloads." >>>> 472c2e07eef0 ("tcp: add one skb cache for tx") >>>> made that obsolete and removed it. >>>> When >>>> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") >>>> reverted it, this piece was not reverted and not added back in. >>>> >>>> When running uperf with a request-response pattern with 1k payload >>>> and 250 connections parallel, we measure 13% difference in throughput >>>> for our PCI based network interfaces since 472c2e07eef0. >>>> (our IO MMU is sensitive to the number of mapped pages) >>> >>> >>> >>>> >>>> Could you please consider allowing linear payload for the first >>>> skb in queue again? A patch proposal is appended below. >>> >>> No. >>> >>> Please add a work around in your driver. >>> >>> You can increase throughput by 20% by premapping a coherent piece of >>> memory in which >>> you can copy small skbs (skb->head included) >>> >>> Something like 256 bytes per slot in the TX ring. >>> >> >> FWIW this regression was withthe standard mellanox driver (nothing s390 specific). > > I did not claim this was s390 specific. > > Only IOMMU mode. > > I would rather not add back something which makes TCP stack slower > (more tests in fast path) > for the majority of us _not_ using IOMMU. > > In our own tests, this trick of using linear skbs was only helping > benchmarks, not real workloads. > > Many drivers have to map skb->head a second time if they contain TCP payload, > thus adding yet another corner case in their fast path. > > - Typical RPC workloads are playing with TCP_NODELAY > - Typical bulk flows never have empty write queues... > > Really, I do not want this optimization back, this is not worth it. > > Again, a driver knows better if it is using IOMMU and if pathological > layouts can be optimized > to non SG ones, and using a pre-dma-map zone will also benefit pure > TCP ACK packets (which do not have any payload) > > Here is the changelog of a patch I did for our GQ NIC (not yet > upstreamed, but will be soon) > [...] Saeed, As discussed at LPC, could you please consider adding a workaround to the Mellanox driver, to use non-SG SKBs for small messages? As mentioned above we are seeing 13% throughput degradation, if 2 pages need to be mapped instead of 1. While Eric's ideas sound very promising, just using non-SG in these cases should be enough to mitigate the performance regression we see. Thank you in advance. Alexandra
On 26 Sep 12:06, Alexandra Winter wrote: > [ ... ] >[...] > >Saeed, >As discussed at LPC, could you please consider adding a workaround to the >Mellanox driver, to use non-SG SKBs for small messages? As mentioned above >we are seeing 13% throughput degradation, if 2 pages need to be mapped >instead of 1. > >While Eric's ideas sound very promising, just using non-SG in these cases >should be enough to mitigate the performance regression we see. Hi Alexandra, sorry for the late response. Yeas linearizing small messages makes sense, but will require some careful perf testing. We will do our best to include this in the next kernel release cycle. I will take it with the mlx5e team next week, everybody is on vacation this time of year :). Thanks, Saeed.
On 01.10.22 01:37, Saeed Mahameed wrote: > On 26 Sep 12:06, Alexandra Winter wrote: >> > > [ ... ] > >> [...] >> >> Saeed, >> As discussed at LPC, could you please consider adding a workaround to the >> Mellanox driver, to use non-SG SKBs for small messages? As mentioned above >> we are seeing 13% throughput degradation, if 2 pages need to be mapped >> instead of 1. >> >> While Eric's ideas sound very promising, just using non-SG in these cases >> should be enough to mitigate the performance regression we see. > > Hi Alexandra, sorry for the late response. > > Yeas linearizing small messages makes sense, but will require some careful > perf testing. > > We will do our best to include this in the next kernel release cycle. > I will take it with the mlx5e team next week, everybody is on vacation this > time of year :). > > Thanks, > Saeed. Hello Saeed, may I ask whether you had a chance to include such a patch in the 6.2 kernel? Or is this still on your ToDo list? I haven't seen anything like this on the mailing list, but I may have overlooked it. All the best for 2023 Alexandra
On 29.12.22 09:27, Alexandra Winter wrote: > > > On 01.10.22 01:37, Saeed Mahameed wrote: >> On 26 Sep 12:06, Alexandra Winter wrote: >> >> [ ... ] >>> >>> Saeed, >>> As discussed at LPC, could you please consider adding a workaround to the >>> Mellanox driver, to use non-SG SKBs for small messages? As mentioned above >>> we are seeing 13% throughput degradation, if 2 pages need to be mapped >>> instead of 1. >>> >>> While Eric's ideas sound very promising, just using non-SG in these cases >>> should be enough to mitigate the performance regression we see. >> >> Hi Alexandra, sorry for the late response. >> >> Yeas linearizing small messages makes sense, but will require some careful >> perf testing. >> >> We will do our best to include this in the next kernel release cycle. >> I will take it with the mlx5e team next week, everybody is on vacation this >> time of year :). >> >> Thanks, >> Saeed. > > Hello Saeed, > may I ask whether you had a chance to include such a patch in the 6.2 kernel? > Or is this still on your ToDo list? > I haven't seen anything like this on the mailing list, but I may have overlooked it. > All the best for 2023 > Alexandra Hello Saeed, any news about linearizing small messages? Is there any way we could be of help? Kind regards Alexandra
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e5011c136fdb..f7cbccd41d85 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1154,6 +1154,30 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset, } EXPORT_SYMBOL(tcp_sendpage); +/* Do not bother using a page frag for very small frames. + * But use this heuristic only for the first skb in write queue. + * + * Having no payload in skb->head allows better SACK shifting + * in tcp_shift_skb_data(), reducing sack/rack overhead, because + * write queue has less skbs. + * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB. + * This also speeds up tso_fragment(), since it won't fallback + * to tcp_fragment(). + */ +static int linear_payload_sz(bool first_skb) +{ + if (first_skb) + return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER); + return 0; +} + +static int select_size(bool first_skb, bool zc) +{ + if (zc) + return 0; + return linear_payload_sz(first_skb); +} + void tcp_free_fastopen_req(struct tcp_sock *tp) { if (tp->fastopen_req) { @@ -1311,6 +1335,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) { bool first_skb; + int linear; new_segment: if (!sk_stream_memory_free(sk)) @@ -1322,7 +1347,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) goto restart; } first_skb = tcp_rtx_and_write_queues_empty(sk); - skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation, + linear = select_size(first_skb, zc); + skb = tcp_stream_alloc_skb(sk, linear, + sk->sk_allocation, first_skb); if (!skb) goto wait_for_space; @@ -1344,7 +1371,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (copy > msg_data_left(msg)) copy = msg_data_left(msg); - if (!zc) { + /* Where to copy to? */ + if (skb_availroom(skb) > 0 && !zc) { + /* We have some space in skb head. Superb! */ + copy = min_t(int, copy, skb_availroom(skb)); + err = skb_add_data_nocache(sk, skb, &msg->msg_iter, + copy); + if (err) + goto do_error; + } else if (!zc) { bool merge = true; int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk);
Since linear payload was removed even for single small messages, an additional page is required and we are measuring performance impact. 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting") explicitely allowed "payload in skb->head for first skb put in the queue, to not impact RPC workloads." 472c2e07eef0 ("tcp: add one skb cache for tx") made that obsolete and removed it. When d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache") reverted it, this piece was not reverted and not added back in. When running uperf with a request-response pattern with 1k payload and 250 connections parallel, we measure 13% difference in throughput for our PCI based network interfaces since 472c2e07eef0. (our IO MMU is sensitive to the number of mapped pages) Could you please consider allowing linear payload for the first skb in queue again? A patch proposal is appended below. Kind regards Alexandra --------------------------------------------------------------- tcp: allow linear skb payload for first in queue Allow payload in skb->head for first skb in the queue, RPC workloads will benefit. Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> --- net/ipv4/tcp.c | 39 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 37 insertions(+), 2 deletions(-)