Message ID | 20210302060753.953931-1-kuba@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: tcp: don't allocate fast clones for fastopen SYN | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | fail | 1 blamed authors not CCed: ycheng@google.com; 3 maintainers not CCed: ycheng@google.com yoshfuji@linux-ipv6.org dsahern@kernel.org |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 1 this patch: 1 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 17 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 1 this patch: 1 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > When receiver does not accept TCP Fast Open it will only ack > the SYN, and not the data. We detect this and immediately queue > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > In DC networks with very low RTT and without RFS the SYN-ACK > may arrive before NIC driver reported Tx completion on > the original SYN. In which case skb_still_in_host_queue() > returns true and sender will need to wait for the retransmission > timer to fire milliseconds later. > > Revert back to non-fast clone skbs, this way > skb_still_in_host_queue() won't prevent the recovery flow > from completing. > > Suggested-by: Eric Dumazet <edumazet@google.com> > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") Hmmm, not sure if this Fixes: tag makes sense. Really, if we delay TX completions by say 10 ms, other parts of the stack will misbehave anyway. Also, backporting this patch up to linux-3.19 is going to be tricky. The real issue here is that skb_still_in_host_queue() can give a false positive. I have mixed feelings here, as you can read my answer :/ Maybe skb_still_in_host_queue() signal should not be used when a part of the SKB has been received/acknowledged by the remote peer (in this case the SYN part). Alternative is that drivers unable to TX complete their skbs in a reasonable time should call skb_orphan() to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue) If you really want to play and delay TX completions, maybe provide a way to disable skb_still_in_host_queue() globally, using a static key ? My personal WIP/hack was something like : diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 69a545db80d2ead47ffcf2f3819a6d066e95f35d..666f6f0a6a06fece204199e07a79e21d1faf8f92 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5995,7 +5995,8 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack, else tp->fastopen_client_fail = TFO_DATA_NOT_ACKED; skb_rbtree_walk_from(data) { - if (__tcp_retransmit_skb(sk, data, 1)) + /* segs = -1 to bypass skb_still_in_host_queue() check */ + if (__tcp_retransmit_skb(sk, data, -1)) break; } tcp_rearm_rto(sk); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index fbf140a770d8e21b936369b79abbe9857537acd8..1d1489e596976e352fe7d5ccee7a6eae55fdbcce 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3155,8 +3155,12 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) sk->sk_sndbuf)) return -EAGAIN; - if (skb_still_in_host_queue(sk, skb)) - return -EBUSY; + if (segs > 0) { + if (skb_still_in_host_queue(sk, skb)) + return -EBUSY; + } else { + segs = -segs; + } if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) { if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) { > Signed-off-by: Neil Spring <ntspring@fb.com> > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > --- > net/ipv4/tcp_output.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index fbf140a770d8..cd9461588539 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3759,9 +3759,16 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) > /* limit to order-0 allocations */ > space = min_t(size_t, space, SKB_MAX_HEAD(MAX_TCP_HEADER)); > > - syn_data = sk_stream_alloc_skb(sk, space, sk->sk_allocation, false); > + syn_data = alloc_skb(MAX_TCP_HEADER + space, sk->sk_allocation); > if (!syn_data) > goto fallback; > + if (!sk_wmem_schedule(sk, syn_data->truesize)) { > + __kfree_skb(syn_data); > + goto fallback; > + } > + skb_reserve(syn_data, MAX_TCP_HEADER); > + INIT_LIST_HEAD(&syn_data->tcp_tsorted_anchor); > + > syn_data->ip_summed = CHECKSUM_PARTIAL; > memcpy(syn_data->cb, syn->cb, sizeof(syn->cb)); > if (space) { > -- > 2.26.2 >
On Tue, 2 Mar 2021 10:38:46 +0100 Eric Dumazet wrote: > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > When receiver does not accept TCP Fast Open it will only ack > > the SYN, and not the data. We detect this and immediately queue > > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > > > In DC networks with very low RTT and without RFS the SYN-ACK > > may arrive before NIC driver reported Tx completion on > > the original SYN. In which case skb_still_in_host_queue() > > returns true and sender will need to wait for the retransmission > > timer to fire milliseconds later. > > > > Revert back to non-fast clone skbs, this way > > skb_still_in_host_queue() won't prevent the recovery flow > > from completing. > > > > Suggested-by: Eric Dumazet <edumazet@google.com> > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") > > Hmmm, not sure if this Fixes: tag makes sense. > > Really, if we delay TX completions by say 10 ms, other parts of the > stack will misbehave anyway. > > Also, backporting this patch up to linux-3.19 is going to be tricky. Indeed, the problem is minor in practical terms. Maybe it's enough if I spell that out more in the description? Are you thinking net-next or net without a Fixes tag? > The real issue here is that skb_still_in_host_queue() can give a false positive. > > I have mixed feelings here, as you can read my answer :/ > > Maybe skb_still_in_host_queue() signal should not be used when a part > of the SKB has been received/acknowledged by the remote peer > (in this case the SYN part). FWIW I was pondering this, when the rtx is requested by the receiver we are relatively sure we can ignore skb_still_in_host_queue() because we know our system should Tx in order so if receiver saw N + 1, N can't be in our queues. But AFAICT generalizing the test doesn't matter much. In cases other than TFO worst case a loss probe will chase the rtx out. And I don't grasp enough of TCP to implement the general optimization :) > Alternative is that drivers unable to TX complete their skbs in a > reasonable time should call skb_orphan() > to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue) > > If you really want to play and delay TX completions, maybe provide a > way to disable skb_still_in_host_queue() globally, > using a static key ? I see the TFO issue with rx and tx completions set to 33us both, with two different NIC vendors, so the timing just influences the likelihood. > My personal WIP/hack was something like : LGTM, are you happy with that being the fix? > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 69a545db80d2ead47ffcf2f3819a6d066e95f35d..666f6f0a6a06fece204199e07a79e21d1faf8f92 > 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -5995,7 +5995,8 @@ static bool tcp_rcv_fastopen_synack(struct sock > *sk, struct sk_buff *synack, > else > tp->fastopen_client_fail = TFO_DATA_NOT_ACKED; > skb_rbtree_walk_from(data) { > - if (__tcp_retransmit_skb(sk, data, 1)) > + /* segs = -1 to bypass > skb_still_in_host_queue() check */ > + if (__tcp_retransmit_skb(sk, data, -1)) > break; > } > tcp_rearm_rto(sk); > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index fbf140a770d8e21b936369b79abbe9857537acd8..1d1489e596976e352fe7d5ccee7a6eae55fdbcce > 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3155,8 +3155,12 @@ int __tcp_retransmit_skb(struct sock *sk, > struct sk_buff *skb, int segs) > sk->sk_sndbuf)) > return -EAGAIN; > > - if (skb_still_in_host_queue(sk, skb)) > - return -EBUSY; > + if (segs > 0) { > + if (skb_still_in_host_queue(sk, skb)) > + return -EBUSY; > + } else { > + segs = -segs; > + } > > if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) { > if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) {
On Tue, Mar 2, 2021 at 6:00 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 2 Mar 2021 10:38:46 +0100 Eric Dumazet wrote: > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > When receiver does not accept TCP Fast Open it will only ack > > > the SYN, and not the data. We detect this and immediately queue > > > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > > > > > In DC networks with very low RTT and without RFS the SYN-ACK > > > may arrive before NIC driver reported Tx completion on > > > the original SYN. In which case skb_still_in_host_queue() > > > returns true and sender will need to wait for the retransmission > > > timer to fire milliseconds later. > > > > > > Revert back to non-fast clone skbs, this way > > > skb_still_in_host_queue() won't prevent the recovery flow > > > from completing. > > > > > > Suggested-by: Eric Dumazet <edumazet@google.com> > > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") > > > > Hmmm, not sure if this Fixes: tag makes sense. > > > > Really, if we delay TX completions by say 10 ms, other parts of the > > stack will misbehave anyway. > > > > Also, backporting this patch up to linux-3.19 is going to be tricky. > > Indeed, the problem is minor in practical terms. Maybe it's enough if I > spell that out more in the description? Are you thinking net-next or > net without a Fixes tag? > > > The real issue here is that skb_still_in_host_queue() can give a false positive. > > > > I have mixed feelings here, as you can read my answer :/ > > > > Maybe skb_still_in_host_queue() signal should not be used when a part > > of the SKB has been received/acknowledged by the remote peer > > (in this case the SYN part). > > FWIW I was pondering this, when the rtx is requested by the receiver > we are relatively sure we can ignore skb_still_in_host_queue() because > we know our system should Tx in order so if receiver saw N + 1, N can't > be in our queues. > > But AFAICT generalizing the test doesn't matter much. In cases other > than TFO worst case a loss probe will chase the rtx out. And I don't > grasp enough of TCP to implement the general optimization :) > > > Alternative is that drivers unable to TX complete their skbs in a > > reasonable time should call skb_orphan() > > to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue) > > > > If you really want to play and delay TX completions, maybe provide a > > way to disable skb_still_in_host_queue() globally, > > using a static key ? > > I see the TFO issue with rx and tx completions set to 33us both, > with two different NIC vendors, so the timing just influences the > likelihood. > > > My personal WIP/hack was something like : > > LGTM, are you happy with that being the fix? Yes, this seems a bit less intrusive, and net-next should be fine (I guess FB can backport this early if needed) > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 69a545db80d2ead47ffcf2f3819a6d066e95f35d..666f6f0a6a06fece204199e07a79e21d1faf8f92 > > 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -5995,7 +5995,8 @@ static bool tcp_rcv_fastopen_synack(struct sock > > *sk, struct sk_buff *synack, > > else > > tp->fastopen_client_fail = TFO_DATA_NOT_ACKED; > > skb_rbtree_walk_from(data) { > > - if (__tcp_retransmit_skb(sk, data, 1)) > > + /* segs = -1 to bypass > > skb_still_in_host_queue() check */ > > + if (__tcp_retransmit_skb(sk, data, -1)) > > break; > > } > > tcp_rearm_rto(sk); > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index fbf140a770d8e21b936369b79abbe9857537acd8..1d1489e596976e352fe7d5ccee7a6eae55fdbcce > > 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -3155,8 +3155,12 @@ int __tcp_retransmit_skb(struct sock *sk, > > struct sk_buff *skb, int segs) > > sk->sk_sndbuf)) > > return -EAGAIN; > > > > - if (skb_still_in_host_queue(sk, skb)) > > - return -EBUSY; > > + if (segs > 0) { > > + if (skb_still_in_host_queue(sk, skb)) > > + return -EBUSY; > > + } else { > > + segs = -segs; > > + } > > > > if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) { > > if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) { >
On Tue, Mar 2, 2021 at 11:58 AM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > When receiver does not accept TCP Fast Open it will only ack > > the SYN, and not the data. We detect this and immediately queue > > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > > > In DC networks with very low RTT and without RFS the SYN-ACK > > may arrive before NIC driver reported Tx completion on > > the original SYN. In which case skb_still_in_host_queue() > > returns true and sender will need to wait for the retransmission > > timer to fire milliseconds later. > > > > Revert back to non-fast clone skbs, this way > > skb_still_in_host_queue() won't prevent the recovery flow > > from completing. > > > > Suggested-by: Eric Dumazet <edumazet@google.com> > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") > > Hmmm, not sure if this Fixes: tag makes sense. > > Really, if we delay TX completions by say 10 ms, other parts of the > stack will misbehave anyway. > > Also, backporting this patch up to linux-3.19 is going to be tricky. > > The real issue here is that skb_still_in_host_queue() can give a false positive. > > I have mixed feelings here, as you can read my answer :/ > > Maybe skb_still_in_host_queue() signal should not be used when a part > of the SKB has been received/acknowledged by the remote peer > (in this case the SYN part). Thank you Eric and Jakub for working on the TFO issue. I like this option the most because it's more generic and easy to understand. Is it easy to implement by checking snd_una etc? > > Alternative is that drivers unable to TX complete their skbs in a > reasonable time should call skb_orphan() > to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue) > > If you really want to play and delay TX completions, maybe provide a > way to disable skb_still_in_host_queue() globally, > using a static key ? > > My personal WIP/hack was something like : > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 69a545db80d2ead47ffcf2f3819a6d066e95f35d..666f6f0a6a06fece204199e07a79e21d1faf8f92 > 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -5995,7 +5995,8 @@ static bool tcp_rcv_fastopen_synack(struct sock > *sk, struct sk_buff *synack, > else > tp->fastopen_client_fail = TFO_DATA_NOT_ACKED; > skb_rbtree_walk_from(data) { > - if (__tcp_retransmit_skb(sk, data, 1)) > + /* segs = -1 to bypass > skb_still_in_host_queue() check */ > + if (__tcp_retransmit_skb(sk, data, -1)) > break; > } > tcp_rearm_rto(sk); > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index fbf140a770d8e21b936369b79abbe9857537acd8..1d1489e596976e352fe7d5ccee7a6eae55fdbcce > 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3155,8 +3155,12 @@ int __tcp_retransmit_skb(struct sock *sk, > struct sk_buff *skb, int segs) > sk->sk_sndbuf)) > return -EAGAIN; > > - if (skb_still_in_host_queue(sk, skb)) > - return -EBUSY; > + if (segs > 0) { > + if (skb_still_in_host_queue(sk, skb)) > + return -EBUSY; > + } else { > + segs = -segs; > + } > > if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) { > if (unlikely(before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))) { > > > > Signed-off-by: Neil Spring <ntspring@fb.com> > > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > > --- > > net/ipv4/tcp_output.c | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index fbf140a770d8..cd9461588539 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -3759,9 +3759,16 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) > > /* limit to order-0 allocations */ > > space = min_t(size_t, space, SKB_MAX_HEAD(MAX_TCP_HEADER)); > > > > - syn_data = sk_stream_alloc_skb(sk, space, sk->sk_allocation, false); > > + syn_data = alloc_skb(MAX_TCP_HEADER + space, sk->sk_allocation); > > if (!syn_data) > > goto fallback; > > + if (!sk_wmem_schedule(sk, syn_data->truesize)) { > > + __kfree_skb(syn_data); > > + goto fallback; > > + } > > + skb_reserve(syn_data, MAX_TCP_HEADER); > > + INIT_LIST_HEAD(&syn_data->tcp_tsorted_anchor); > > + > > syn_data->ip_summed = CHECKSUM_PARTIAL; > > memcpy(syn_data->cb, syn->cb, sizeof(syn->cb)); > > if (space) { > > -- > > 2.26.2 > >
On Tue, 2 Mar 2021 12:52:14 -0800 Yuchung Cheng wrote: > On Tue, Mar 2, 2021 at 11:58 AM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > When receiver does not accept TCP Fast Open it will only ack > > > the SYN, and not the data. We detect this and immediately queue > > > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > > > > > In DC networks with very low RTT and without RFS the SYN-ACK > > > may arrive before NIC driver reported Tx completion on > > > the original SYN. In which case skb_still_in_host_queue() > > > returns true and sender will need to wait for the retransmission > > > timer to fire milliseconds later. > > > > > > Revert back to non-fast clone skbs, this way > > > skb_still_in_host_queue() won't prevent the recovery flow > > > from completing. > > > > > > Suggested-by: Eric Dumazet <edumazet@google.com> > > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") > > > > Hmmm, not sure if this Fixes: tag makes sense. > > > > Really, if we delay TX completions by say 10 ms, other parts of the > > stack will misbehave anyway. > > > > Also, backporting this patch up to linux-3.19 is going to be tricky. > > > > The real issue here is that skb_still_in_host_queue() can give a false positive. > > > > I have mixed feelings here, as you can read my answer :/ > > > > Maybe skb_still_in_host_queue() signal should not be used when a part > > of the SKB has been received/acknowledged by the remote peer > > (in this case the SYN part). > > Thank you Eric and Jakub for working on the TFO issue. > > I like this option the most because it's more generic and easy to understand. I'm assuming by "this" you mean v1 - as far as understanding goes we can polish v2. I think Eric just shared a quick example, but perhaps we could add an explicit bool to __tcp_retransmit_skb() called expl_req or such to signify that the rtx was explicitly requested by the receiver and therefore we can skip the "in queue" check? Or add some flags to the call? If everyone is okay with targeting -next the new argument won't be an issue. > Is it easy to implement by checking snd_una etc? That is to detect TFO retransmits in __tcp_retransmit_skb() by matching the connections state, like acked == 1?
On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > When receiver does not accept TCP Fast Open it will only ack > > the SYN, and not the data. We detect this and immediately queue > > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > > > In DC networks with very low RTT and without RFS the SYN-ACK > > may arrive before NIC driver reported Tx completion on > > the original SYN. In which case skb_still_in_host_queue() > > returns true and sender will need to wait for the retransmission > > timer to fire milliseconds later. > > > > Revert back to non-fast clone skbs, this way > > skb_still_in_host_queue() won't prevent the recovery flow > > from completing. > > > > Suggested-by: Eric Dumazet <edumazet@google.com> > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") > > Hmmm, not sure if this Fixes: tag makes sense. > > Really, if we delay TX completions by say 10 ms, other parts of the > stack will misbehave anyway. > > Also, backporting this patch up to linux-3.19 is going to be tricky. > > The real issue here is that skb_still_in_host_queue() can give a false positive. > > I have mixed feelings here, as you can read my answer :/ > > Maybe skb_still_in_host_queue() signal should not be used when a part > of the SKB has been received/acknowledged by the remote peer > (in this case the SYN part). > > Alternative is that drivers unable to TX complete their skbs in a > reasonable time should call skb_orphan() > to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue) > > If you really want to play and delay TX completions, maybe provide a > way to disable skb_still_in_host_queue() globally, > using a static key ? The problem as I see it is that the original fclone isn't what we sent out on the wire and that is confusing things. What we sent was a SYN with data, but what we have now is just a data frame that hasn't been put out on the wire yet. I wonder if we couldn't get away with doing something like adding a fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen skbs? That would keep the skb_still_in_host queue from triggering as we would be changing the state from SKB_FCLONE_ORIG to SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In addition if we have to clone it again and the fclone reference count is 1 we could reset it back to SKB_FCLONE_ORIG.
On Wed, 3 Mar 2021 13:35:53 -0800 Alexander Duyck wrote: > On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > When receiver does not accept TCP Fast Open it will only ack > > > the SYN, and not the data. We detect this and immediately queue > > > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > > > > > In DC networks with very low RTT and without RFS the SYN-ACK > > > may arrive before NIC driver reported Tx completion on > > > the original SYN. In which case skb_still_in_host_queue() > > > returns true and sender will need to wait for the retransmission > > > timer to fire milliseconds later. > > > > > > Revert back to non-fast clone skbs, this way > > > skb_still_in_host_queue() won't prevent the recovery flow > > > from completing. > > > > > > Suggested-by: Eric Dumazet <edumazet@google.com> > > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") > > > > Hmmm, not sure if this Fixes: tag makes sense. > > > > Really, if we delay TX completions by say 10 ms, other parts of the > > stack will misbehave anyway. > > > > Also, backporting this patch up to linux-3.19 is going to be tricky. > > > > The real issue here is that skb_still_in_host_queue() can give a false positive. > > > > I have mixed feelings here, as you can read my answer :/ > > > > Maybe skb_still_in_host_queue() signal should not be used when a part > > of the SKB has been received/acknowledged by the remote peer > > (in this case the SYN part). > > > > Alternative is that drivers unable to TX complete their skbs in a > > reasonable time should call skb_orphan() > > to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue) > > > > If you really want to play and delay TX completions, maybe provide a > > way to disable skb_still_in_host_queue() globally, > > using a static key ? > > The problem as I see it is that the original fclone isn't what we sent > out on the wire and that is confusing things. What we sent was a SYN > with data, but what we have now is just a data frame that hasn't been > put out on the wire yet. Not sure I understand why it's the key distinction here. Is it re-transmitting part of the frame or having different flags? Is re-transmit of half of a GSO skb also considered not the same? To me the distinction is that the receiver has implicitly asked us for the re-transmission. If it was requested by SACK we should ignore "in_queue" for the first transmission as well, even if the skb state is identical. > I wonder if we couldn't get away with doing something like adding a > fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen > skbs? That would keep the skb_still_in_host queue from triggering as > we would be changing the state from SKB_FCLONE_ORIG to > SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In > addition if we have to clone it again and the fclone reference count > is 1 we could reset it back to SKB_FCLONE_ORIG. The unused value of fclone was tempting me as well :) AFAICT we have at least these options: 1 - don't use a fclone skb [v1] 2 - mark the fclone as "special" at Tx to escape the "in queue" check 3 - indicate to retansmit that we're sure initial tx is out [v2] 4 - same as above but with a bool / flag instead of negative seg 5 - use the fclone bits but mark them at Rx when we see a rtx request 6 - check the skb state in retransmit to match the TFO case (IIUC Yuchung's suggestion) #5 is my favorite but I didn't know how to extend it to fast re-transmits so I just stuck to the suggestion from the ML :) WDYT? Eric, Yuchung?
On Wed, Mar 3, 2021 at 4:07 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 3 Mar 2021 13:35:53 -0800 Alexander Duyck wrote: > > On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@google.com> wrote: > > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > When receiver does not accept TCP Fast Open it will only ack > > > > the SYN, and not the data. We detect this and immediately queue > > > > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > > > > > > > In DC networks with very low RTT and without RFS the SYN-ACK > > > > may arrive before NIC driver reported Tx completion on > > > > the original SYN. In which case skb_still_in_host_queue() > > > > returns true and sender will need to wait for the retransmission > > > > timer to fire milliseconds later. > > > > > > > > Revert back to non-fast clone skbs, this way > > > > skb_still_in_host_queue() won't prevent the recovery flow > > > > from completing. > > > > > > > > Suggested-by: Eric Dumazet <edumazet@google.com> > > > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") > > > > > > Hmmm, not sure if this Fixes: tag makes sense. > > > > > > Really, if we delay TX completions by say 10 ms, other parts of the > > > stack will misbehave anyway. > > > > > > Also, backporting this patch up to linux-3.19 is going to be tricky. > > > > > > The real issue here is that skb_still_in_host_queue() can give a false positive. > > > > > > I have mixed feelings here, as you can read my answer :/ > > > > > > Maybe skb_still_in_host_queue() signal should not be used when a part > > > of the SKB has been received/acknowledged by the remote peer > > > (in this case the SYN part). > > > > > > Alternative is that drivers unable to TX complete their skbs in a > > > reasonable time should call skb_orphan() > > > to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue) > > > > > > If you really want to play and delay TX completions, maybe provide a > > > way to disable skb_still_in_host_queue() globally, > > > using a static key ? > > > > The problem as I see it is that the original fclone isn't what we sent > > out on the wire and that is confusing things. What we sent was a SYN > > with data, but what we have now is just a data frame that hasn't been > > put out on the wire yet. > > Not sure I understand why it's the key distinction here. Is it > re-transmitting part of the frame or having different flags? > Is re-transmit of half of a GSO skb also considered not the same? The difference in my mind is the flags. So specifically the clone of the syn_data frame in the case of the TCP fast open isn't actually a clone of the sent frame. Instead we end up modifying the flags so that it becomes the first data frame. We already have the SYN sitting in the retransmit queue before we send the SYN w/ data frame. In addition the SYN packet in the retransmit queue has a reference count of 1 so it is not encumbered by the fclone reference count check so it could theoretically be retransmitted immediately, it is just the data packet that is being held. If we replay a GSO frame we will get the same frames all over again. In the case of a TCP fast open syn_data packet that isn't the case. The first time out it is one packet, the second time it is two. > To me the distinction is that the receiver has implicitly asked > us for the re-transmission. If it was requested by SACK we should > ignore "in_queue" for the first transmission as well, even if the > skb state is identical. In my mind the distinction is the fact that what we have in the retransmit queue is 2 frames, a SYN and a data. Whereas what we have put on the wire is SYN w/ data. > > I wonder if we couldn't get away with doing something like adding a > > fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen > > skbs? That would keep the skb_still_in_host queue from triggering as > > we would be changing the state from SKB_FCLONE_ORIG to > > SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In > > addition if we have to clone it again and the fclone reference count > > is 1 we could reset it back to SKB_FCLONE_ORIG. > > The unused value of fclone was tempting me as well :) > > AFAICT we have at least these options: > > 1 - don't use a fclone skb [v1] > > 2 - mark the fclone as "special" at Tx to escape the "in queue" check This is what I had in mind. Basically just add a function to transition from SKB_FCLONE_ORIG to SKB_FCLONE_MODIFIED/DIRTY when the clone no longer represents what is actually on the wire. The added benefit is that we can potentially restore it to SKB_FCLONE_ORIG if we clone it again assuming the reference count fell back to 1. > 3 - indicate to retansmit that we're sure initial tx is out [v2] > > 4 - same as above but with a bool / flag instead of negative seg > > 5 - use the fclone bits but mark them at Rx when we see a rtx request > > 6 - check the skb state in retransmit to match the TFO case (IIUC > Yuchung's suggestion) > > #5 is my favorite but I didn't know how to extend it to fast > re-transmits so I just stuck to the suggestion from the ML :) > > WDYT? Eric, Yuchung?
On Thu, Mar 4, 2021 at 3:45 AM Alexander Duyck <alexander.duyck@gmail.com> wrote: > > On Wed, Mar 3, 2021 at 4:07 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Wed, 3 Mar 2021 13:35:53 -0800 Alexander Duyck wrote: > > > On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > When receiver does not accept TCP Fast Open it will only ack > > > > > the SYN, and not the data. We detect this and immediately queue > > > > > the data for (re)transmission in tcp_rcv_fastopen_synack(). > > > > > > > > > > In DC networks with very low RTT and without RFS the SYN-ACK > > > > > may arrive before NIC driver reported Tx completion on > > > > > the original SYN. In which case skb_still_in_host_queue() > > > > > returns true and sender will need to wait for the retransmission > > > > > timer to fire milliseconds later. > > > > > > > > > > Revert back to non-fast clone skbs, this way > > > > > skb_still_in_host_queue() won't prevent the recovery flow > > > > > from completing. > > > > > > > > > > Suggested-by: Eric Dumazet <edumazet@google.com> > > > > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly") > > > > > > > > Hmmm, not sure if this Fixes: tag makes sense. > > > > > > > > Really, if we delay TX completions by say 10 ms, other parts of the > > > > stack will misbehave anyway. > > > > > > > > Also, backporting this patch up to linux-3.19 is going to be tricky. > > > > > > > > The real issue here is that skb_still_in_host_queue() can give a false positive. > > > > > > > > I have mixed feelings here, as you can read my answer :/ > > > > > > > > Maybe skb_still_in_host_queue() signal should not be used when a part > > > > of the SKB has been received/acknowledged by the remote peer > > > > (in this case the SYN part). > > > > > > > > Alternative is that drivers unable to TX complete their skbs in a > > > > reasonable time should call skb_orphan() > > > > to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue) > > > > > > > > If you really want to play and delay TX completions, maybe provide a > > > > way to disable skb_still_in_host_queue() globally, > > > > using a static key ? > > > > > > The problem as I see it is that the original fclone isn't what we sent > > > out on the wire and that is confusing things. What we sent was a SYN > > > with data, but what we have now is just a data frame that hasn't been > > > put out on the wire yet. > > > > Not sure I understand why it's the key distinction here. Is it > > re-transmitting part of the frame or having different flags? > > Is re-transmit of half of a GSO skb also considered not the same? > > The difference in my mind is the flags. So specifically the clone of > the syn_data frame in the case of the TCP fast open isn't actually a > clone of the sent frame. Instead we end up modifying the flags so that > it becomes the first data frame. We already have the SYN sitting in > the retransmit queue before we send the SYN w/ data frame. In addition > the SYN packet in the retransmit queue has a reference count of 1 so > it is not encumbered by the fclone reference count check so it could > theoretically be retransmitted immediately, it is just the data packet > that is being held. > > If we replay a GSO frame we will get the same frames all over again. > In the case of a TCP fast open syn_data packet that isn't the case. > The first time out it is one packet, the second time it is two. > > > To me the distinction is that the receiver has implicitly asked > > us for the re-transmission. If it was requested by SACK we should > > ignore "in_queue" for the first transmission as well, even if the > > skb state is identical. > > In my mind the distinction is the fact that what we have in the > retransmit queue is 2 frames, a SYN and a data. Whereas what we have > put on the wire is SYN w/ data. > > > > I wonder if we couldn't get away with doing something like adding a > > > fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen > > > skbs? That would keep the skb_still_in_host queue from triggering as > > > we would be changing the state from SKB_FCLONE_ORIG to > > > SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In > > > addition if we have to clone it again and the fclone reference count > > > is 1 we could reset it back to SKB_FCLONE_ORIG. > > > > The unused value of fclone was tempting me as well :) > > > > AFAICT we have at least these options: > > > > 1 - don't use a fclone skb [v1] > > > > 2 - mark the fclone as "special" at Tx to escape the "in queue" check > > This is what I had in mind. Basically just add a function to > transition from SKB_FCLONE_ORIG to SKB_FCLONE_MODIFIED/DIRTY when the > clone no longer represents what is actually on the wire. > > The added benefit is that we can potentially restore it to > SKB_FCLONE_ORIG if we clone it again assuming the reference count fell > back to 1. > I think we are over thinking this really (especially if the fix needs a change in core networking or drivers) We can reuse TSQ logic to have a chance to recover when the clone is eventually freed. This will be more generic, not only for the SYN+data of FastOpen. Can you please test the following patch ? diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6d0a33d1c0db6defaa005335c3e285f0f1ac686c..43317b40df54718edf9165ca2018da92ddfeab62 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1136,7 +1136,7 @@ static inline bool skb_fclone_busy(const struct sock *sk, return skb->fclone == SKB_FCLONE_ORIG && refcount_read(&fclones->fclone_ref) > 1 && - fclones->skb2.sk == sk; + READ_ONCE(fclones->skb2.sk) == sk; } /** diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index fbf140a770d8e21b936369b79abbe9857537acd8..0996e12e6c75121030e5db2fc8f2bd6d07037164 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2775,13 +2775,22 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto) * a packet is still in a qdisc or driver queue. * In this case, there is very little point doing a retransmit ! */ -static bool skb_still_in_host_queue(const struct sock *sk, +static bool skb_still_in_host_queue(struct sock *sk, const struct sk_buff *skb) { if (unlikely(skb_fclone_busy(sk, skb))) { - NET_INC_STATS(sock_net(sk), - LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES); - return true; + set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags); + /* If TX completion is really slow, we could have received an ACK + * for a packet not yet freed by the driver. + * Setting TSQ_THROTTLED makes sure we will attempt to retransmit + * later when fclone is no longer busy. + */ + smp_mb__after_atomic(); + if (skb_fclone_busy(sk, skb)) { + NET_INC_STATS(sock_net(sk), + LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES); + return true; + } } return false; }
On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote: > I think we are over thinking this really (especially if the fix needs > a change in core networking or drivers) > > We can reuse TSQ logic to have a chance to recover when the clone is > eventually freed. > This will be more generic, not only for the SYN+data of FastOpen. > > Can you please test the following patch ? #7 - Eric comes up with something much better :) But so far doesn't seem to quite do it, I'm looking but maybe you'll know right away (FWIW testing a v5.6 backport but I don't think TSQ changed?): On __tcp_retransmit_skb kretprobe: ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00! First hit: __tcp_retransmit_skb+1 tcp_rcv_state_process+2488 tcp_v6_do_rcv+405 tcp_v6_rcv+2984 ip6_protocol_deliver_rcu+180 ip6_input_finish+17 Successful hit: __tcp_retransmit_skb+1 tcp_retransmit_skb+18 tcp_retransmit_timer+716 tcp_write_timer_handler+136 tcp_write_timer+141 call_timer_fn+43 skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1
On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote: > > I think we are over thinking this really (especially if the fix needs > > a change in core networking or drivers) > > > > We can reuse TSQ logic to have a chance to recover when the clone is > > eventually freed. > > This will be more generic, not only for the SYN+data of FastOpen. > > > > Can you please test the following patch ? > > #7 - Eric comes up with something much better :) > > > But so far doesn't seem to quite do it, I'm looking but maybe you'll > know right away (FWIW testing a v5.6 backport but I don't think TSQ > changed?): > > On __tcp_retransmit_skb kretprobe: > > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00! > > First hit: > __tcp_retransmit_skb+1 > tcp_rcv_state_process+2488 > tcp_v6_do_rcv+405 > tcp_v6_rcv+2984 > ip6_protocol_deliver_rcu+180 > ip6_input_finish+17 > > Successful hit: > __tcp_retransmit_skb+1 > tcp_retransmit_skb+18 > tcp_retransmit_timer+716 > tcp_write_timer_handler+136 > tcp_write_timer+141 > call_timer_fn+43 > > skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1 Humm maybe one of the conditions used in tcp_tsq_write() does not hold... if (tp->lost_out > tp->retrans_out && tp->snd_cwnd > tcp_packets_in_flight(tp)) { tcp_mstamp_refresh(tp); tcp_xmit_retransmit_queue(sk); } Maybe FastOpen case is 'special' and tp->lost_out is wrong.
On Thu, Mar 4, 2021 at 8:41 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote: > > > I think we are over thinking this really (especially if the fix needs > > > a change in core networking or drivers) > > > > > > We can reuse TSQ logic to have a chance to recover when the clone is > > > eventually freed. > > > This will be more generic, not only for the SYN+data of FastOpen. > > > > > > Can you please test the following patch ? > > > > #7 - Eric comes up with something much better :) > > > > > > But so far doesn't seem to quite do it, I'm looking but maybe you'll > > know right away (FWIW testing a v5.6 backport but I don't think TSQ > > changed?): > > > > On __tcp_retransmit_skb kretprobe: > > > > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00! > > > > First hit: > > __tcp_retransmit_skb+1 > > tcp_rcv_state_process+2488 > > tcp_v6_do_rcv+405 > > tcp_v6_rcv+2984 > > ip6_protocol_deliver_rcu+180 > > ip6_input_finish+17 > > > > Successful hit: > > __tcp_retransmit_skb+1 > > tcp_retransmit_skb+18 > > tcp_retransmit_timer+716 > > tcp_write_timer_handler+136 > > tcp_write_timer+141 > > call_timer_fn+43 > > > > skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1 > > > Humm maybe one of the conditions used in tcp_tsq_write() does not hold... > > if (tp->lost_out > tp->retrans_out && > tp->snd_cwnd > tcp_packets_in_flight(tp)) { > tcp_mstamp_refresh(tp); > tcp_xmit_retransmit_queue(sk); > } > > Maybe FastOpen case is 'special' and tp->lost_out is wrong. It would be nice if tun driver would have the ability to delay TX completions by N usecs, so that packetdrill tests could be used. It is probably not too hard to add such a feature.
On Thu, Mar 04, 2021 at 08:41:45PM +0100, Eric Dumazet wrote: > On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote: > > > I think we are over thinking this really (especially if the fix needs > > > a change in core networking or drivers) > > > > > > We can reuse TSQ logic to have a chance to recover when the clone is > > > eventually freed. > > > This will be more generic, not only for the SYN+data of FastOpen. > > > > > > Can you please test the following patch ? > > > > #7 - Eric comes up with something much better :) > > > > > > But so far doesn't seem to quite do it, I'm looking but maybe you'll > > know right away (FWIW testing a v5.6 backport but I don't think TSQ > > changed?): > > > > On __tcp_retransmit_skb kretprobe: > > > > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00! > > > > First hit: > > __tcp_retransmit_skb+1 > > tcp_rcv_state_process+2488 > > tcp_v6_do_rcv+405 > > tcp_v6_rcv+2984 > > ip6_protocol_deliver_rcu+180 > > ip6_input_finish+17 > > > > Successful hit: > > __tcp_retransmit_skb+1 > > tcp_retransmit_skb+18 > > tcp_retransmit_timer+716 > > tcp_write_timer_handler+136 > > tcp_write_timer+141 > > call_timer_fn+43 > > > > skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1 > > > Humm maybe one of the conditions used in tcp_tsq_write() does not hold... > > if (tp->lost_out > tp->retrans_out && > tp->snd_cwnd > tcp_packets_in_flight(tp)) { > tcp_mstamp_refresh(tp); > tcp_xmit_retransmit_queue(sk); > } > > Maybe FastOpen case is 'special' and tp->lost_out is wrong. Something like this? (completely untested)
On Thu, Mar 4, 2021 at 10:08 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: > > On Thu, Mar 04, 2021 at 08:41:45PM +0100, Eric Dumazet wrote: > > On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote: > > > > I think we are over thinking this really (especially if the fix needs > > > > a change in core networking or drivers) > > > > > > > > We can reuse TSQ logic to have a chance to recover when the clone is > > > > eventually freed. > > > > This will be more generic, not only for the SYN+data of FastOpen. > > > > > > > > Can you please test the following patch ? > > > > > > #7 - Eric comes up with something much better :) > > > > > > > > > But so far doesn't seem to quite do it, I'm looking but maybe you'll > > > know right away (FWIW testing a v5.6 backport but I don't think TSQ > > > changed?): > > > > > > On __tcp_retransmit_skb kretprobe: > > > > > > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00! > > > > > > First hit: > > > __tcp_retransmit_skb+1 > > > tcp_rcv_state_process+2488 > > > tcp_v6_do_rcv+405 > > > tcp_v6_rcv+2984 > > > ip6_protocol_deliver_rcu+180 > > > ip6_input_finish+17 > > > > > > Successful hit: > > > __tcp_retransmit_skb+1 > > > tcp_retransmit_skb+18 > > > tcp_retransmit_timer+716 > > > tcp_write_timer_handler+136 > > > tcp_write_timer+141 > > > call_timer_fn+43 > > > > > > skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1 > > > > > > Humm maybe one of the conditions used in tcp_tsq_write() does not hold... > > > > if (tp->lost_out > tp->retrans_out && > > tp->snd_cwnd > tcp_packets_in_flight(tp)) { > > tcp_mstamp_refresh(tp); > > tcp_xmit_retransmit_queue(sk); > > } > > > > Maybe FastOpen case is 'special' and tp->lost_out is wrong. > > > Something like this? (completely untested) > -- > Jonathan > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 69a545db80d2..92bc9b0f4955 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -5995,8 +5995,10 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack, > else > tp->fastopen_client_fail = TFO_DATA_NOT_ACKED; > skb_rbtree_walk_from(data) { > + tcp_mark_skb_lost(sk, data); > if (__tcp_retransmit_skb(sk, data, 1)) > break; > + tp->retrans_out += tcp_skb_pcount(data); > } > tcp_rearm_rto(sk); > NET_INC_STATS(sock_net(sk), > Yes, but the hard part is testing this ;) Once we properly mark the skb lost, we can call regular tcp_xmit_retransmit_queue() to not have to care of tp->retrans_out Not that we really need to make sure tcp_xmit_retransmit_queue() can be called anyway from TSQ handler if TX completion is delayed.
On Thu, Mar 4, 2021 at 10:20 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Mar 4, 2021 at 10:08 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote: > > > > On Thu, Mar 04, 2021 at 08:41:45PM +0100, Eric Dumazet wrote: > > > On Thu, Mar 4, 2021 at 8:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > On Thu, 4 Mar 2021 13:51:15 +0100 Eric Dumazet wrote: > > > > > I think we are over thinking this really (especially if the fix needs > > > > > a change in core networking or drivers) > > > > > > > > > > We can reuse TSQ logic to have a chance to recover when the clone is > > > > > eventually freed. > > > > > This will be more generic, not only for the SYN+data of FastOpen. > > > > > > > > > > Can you please test the following patch ? > > > > > > > > #7 - Eric comes up with something much better :) > > > > > > > > > > > > But so far doesn't seem to quite do it, I'm looking but maybe you'll > > > > know right away (FWIW testing a v5.6 backport but I don't think TSQ > > > > changed?): > > > > > > > > On __tcp_retransmit_skb kretprobe: > > > > > > > > ==> Hit TFO case ret:-16 ca_state:0 skb:ffff888fdb4bac00! > > > > > > > > First hit: > > > > __tcp_retransmit_skb+1 > > > > tcp_rcv_state_process+2488 > > > > tcp_v6_do_rcv+405 > > > > tcp_v6_rcv+2984 > > > > ip6_protocol_deliver_rcu+180 > > > > ip6_input_finish+17 > > > > > > > > Successful hit: > > > > __tcp_retransmit_skb+1 > > > > tcp_retransmit_skb+18 > > > > tcp_retransmit_timer+716 > > > > tcp_write_timer_handler+136 > > > > tcp_write_timer+141 > > > > call_timer_fn+43 > > > > > > > > skb:ffff888fdb4bac00 --- delay:51642us bytes_acked:1 > > > > > > > > > Humm maybe one of the conditions used in tcp_tsq_write() does not hold... > > > > > > if (tp->lost_out > tp->retrans_out && > > > tp->snd_cwnd > tcp_packets_in_flight(tp)) { > > > tcp_mstamp_refresh(tp); > > > tcp_xmit_retransmit_queue(sk); > > > } > > > > > > Maybe FastOpen case is 'special' and tp->lost_out is wrong. > > > > > > Something like this? (completely untested) > > -- > > Jonathan > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 69a545db80d2..92bc9b0f4955 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -5995,8 +5995,10 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack, > > else > > tp->fastopen_client_fail = TFO_DATA_NOT_ACKED; > > skb_rbtree_walk_from(data) { > > + tcp_mark_skb_lost(sk, data); > > if (__tcp_retransmit_skb(sk, data, 1)) > > break; > > + tp->retrans_out += tcp_skb_pcount(data); > > } > > tcp_rearm_rto(sk); > > NET_INC_STATS(sock_net(sk), > > > > Yes, but the hard part is testing this ;) > > Once we properly mark the skb lost, we can call regular > tcp_xmit_retransmit_queue() to not have to care of tp->retrans_out > > Not that we really need to make sure tcp_xmit_retransmit_queue() can > be called anyway from TSQ handler if TX completion is delayed. I was testing this (note how I also removed the tcp_rearm_rto(sk) call) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 6f450e577975c7be9537338c8a4c0673d7d36c4c..9ef92ca55e530f76ad793d7342442c4ec06165f7 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -6471,11 +6471,10 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack, tcp_fastopen_cache_set(sk, mss, cookie, syn_drop, try_exp); if (data) { /* Retransmit unacked data in SYN */ - skb_rbtree_walk_from(data) { - if (__tcp_retransmit_skb(sk, data, 1)) - break; - } - tcp_rearm_rto(sk); + skb_rbtree_walk_from(data) + tcp_mark_skb_lost(sk, data); + + tcp_xmit_retransmit_queue(sk); NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVEFAIL); return true;
On Thu, 4 Mar 2021 22:26:58 +0100 Eric Dumazet wrote: > It would be nice if tun driver would have the ability to delay TX > completions by N usecs, > so that packetdrill tests could be used. > > It is probably not too hard to add such a feature. Add an ioctl to turn off the skb_orphan, queue the skbs in tun_do_read() to free them from a timer? > I was testing this (note how I also removed the tcp_rearm_rto(sk) call) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 6f450e577975c7be9537338c8a4c0673d7d36c4c..9ef92ca55e530f76ad793d7342442c4ec06165f7 > 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -6471,11 +6471,10 @@ static bool tcp_rcv_fastopen_synack(struct > sock *sk, struct sk_buff *synack, > tcp_fastopen_cache_set(sk, mss, cookie, syn_drop, try_exp); > > if (data) { /* Retransmit unacked data in SYN */ > - skb_rbtree_walk_from(data) { > - if (__tcp_retransmit_skb(sk, data, 1)) > - break; > - } > - tcp_rearm_rto(sk); > + skb_rbtree_walk_from(data) > + tcp_mark_skb_lost(sk, data); > + > + tcp_xmit_retransmit_queue(sk); > NET_INC_STATS(sock_net(sk), > LINUX_MIB_TCPFASTOPENACTIVEFAIL); > return true; AFAICT this works great now: ==> TFO case ret:-16 (0) ca_state:0 skb:ffff8881d3513800! FREED swapper/5 -- skb 0xffff8881d3513800 freed after: 1us ----- First: __tcp_retransmit_skb+1 tcp_retransmit_skb+18 tcp_xmit_retransmit_queue.part.70+339 tcp_rcv_state_process+2491 tcp_v6_do_rcv+405 tcp_v6_rcv+2984 Second: __tcp_retransmit_skb+1 tcp_retransmit_skb+18 tcp_xmit_retransmit_queue.part.70+339 tcp_tsq_write.part.71+146 tcp_tsq_handler+53 tcp_tasklet_func+181 sk:0xffff8885adc16f00 skb:ffff8881d3513800 --- 61us acked:1 The other case where we hit RTO after __tcp_retransmit_skb() fails is: ==> non-TFO case ret:-11 (0) ca_state:3 skb:ffff8883d71dd400! ----- First: __tcp_retransmit_skb+1 tcp_retransmit_skb+18 tcp_xmit_retransmit_queue.part.70+339 tcp_ack+2270 tcp_rcv_established+303 tcp_v6_do_rcv+190 Second: __tcp_retransmit_skb+1 tcp_retransmit_skb+18 tcp_retransmit_timer+716 tcp_write_timer_handler+136 tcp_write_timer+141 call_timer_fn+43 sk:0xffff88801772d340 skb:ffff8883d71dd400 --- 51738us acked:47324 Which I believe is this: if (refcount_read(&sk->sk_wmem_alloc) > min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), sk->sk_sndbuf)) return -EAGAIN; Because __tcp_retransmit_skb() seems to bail before inet6_sk_rebuild_header gets called. Should we arm TSQ here as well?
On Fri, Mar 5, 2021 at 12:27 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 4 Mar 2021 22:26:58 +0100 Eric Dumazet wrote: > > It would be nice if tun driver would have the ability to delay TX > > completions by N usecs, > > so that packetdrill tests could be used. > > > > It is probably not too hard to add such a feature. > > Add an ioctl to turn off the skb_orphan, queue the skbs in tun_do_read() > to free them from a timer? Yes, I cooked a prototype like that a few hours ago before my night shift to launch our test suite. I yet have to add a sane limit to the number of delayed skbs so that syzbot does not oom its hosts ;) > > > I was testing this (note how I also removed the tcp_rearm_rto(sk) call) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 6f450e577975c7be9537338c8a4c0673d7d36c4c..9ef92ca55e530f76ad793d7342442c4ec06165f7 > > 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -6471,11 +6471,10 @@ static bool tcp_rcv_fastopen_synack(struct > > sock *sk, struct sk_buff *synack, > > tcp_fastopen_cache_set(sk, mss, cookie, syn_drop, try_exp); > > > > if (data) { /* Retransmit unacked data in SYN */ > > - skb_rbtree_walk_from(data) { > > - if (__tcp_retransmit_skb(sk, data, 1)) > > - break; > > - } > > - tcp_rearm_rto(sk); > > + skb_rbtree_walk_from(data) > > + tcp_mark_skb_lost(sk, data); > > + > > + tcp_xmit_retransmit_queue(sk); > > NET_INC_STATS(sock_net(sk), > > LINUX_MIB_TCPFASTOPENACTIVEFAIL); > > return true; > > AFAICT this works great now: Yes but some packetdrill tests fail, I have to check them (sponge link for Googlers), but at first glance we have more investigations. Ran 2003 tests: 1990 passed, 13 failed Sponge: http://sponge2/b0d4c652-3173-4837-86ef-5e8cc59730a1 Failures in //net/tcp/fastopen/client:ipv4-mapped-ipv6:cookie-disabled-prod-conn //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-icmp-unreach-frag-needed //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-icmp-unreach-frag-needed-with-seq //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-only-syn-acked //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-partial-or-over-ack //net/tcp/fastopen/client:ipv4:cookie-disabled-prod-conn //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed-with-seq //net/tcp/fastopen/client:ipv4:syn-data-only-syn-acked //net/tcp/fastopen/client:ipv4:syn-data-partial-or-over-ack //net/tcp/fastopen/client:ipv6:cookie-disabled-prod-conn //net/tcp/fastopen/client:ipv6:syn-data-only-syn-acked //net/tcp/fastopen/client:ipv6:syn-data-partial-or-over-ack Showing test.log for //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed syn-data-icmp-unreach-frag-needed.pkt:33: error handling packet: live packet field ipv4_total_length: expected: 800 (0x320) vs actual: 1040 (0x410) script packet: 0.090624 . 1:761(760) ack 1 actual packet: 0.090619 P. 1:1001(1000) ack 1 win 256 Yes, it looks like Alex patch no longer works commit c31b70c9968fe9c4194d1b5d06d07596a3b680de Author: Alexander Duyck <alexanderduyck@fb.com> Date: Sat Dec 12 12:31:24 2020 -0800 tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit > > ==> TFO case ret:-16 (0) ca_state:0 skb:ffff8881d3513800! > FREED swapper/5 -- skb 0xffff8881d3513800 freed after: 1us > ----- > First: > __tcp_retransmit_skb+1 > tcp_retransmit_skb+18 > tcp_xmit_retransmit_queue.part.70+339 > tcp_rcv_state_process+2491 > tcp_v6_do_rcv+405 > tcp_v6_rcv+2984 > > Second: > __tcp_retransmit_skb+1 > tcp_retransmit_skb+18 > tcp_xmit_retransmit_queue.part.70+339 > tcp_tsq_write.part.71+146 > tcp_tsq_handler+53 > tcp_tasklet_func+181 > > sk:0xffff8885adc16f00 skb:ffff8881d3513800 --- 61us acked:1 > > > The other case where we hit RTO after __tcp_retransmit_skb() fails is: > > ==> non-TFO case ret:-11 (0) ca_state:3 skb:ffff8883d71dd400! > ----- > First: > __tcp_retransmit_skb+1 > tcp_retransmit_skb+18 > tcp_xmit_retransmit_queue.part.70+339 > tcp_ack+2270 > tcp_rcv_established+303 > tcp_v6_do_rcv+190 > > Second: > __tcp_retransmit_skb+1 > tcp_retransmit_skb+18 > tcp_retransmit_timer+716 > tcp_write_timer_handler+136 > tcp_write_timer+141 > call_timer_fn+43 > > sk:0xffff88801772d340 skb:ffff8883d71dd400 --- 51738us acked:47324 > > Which I believe is this: > > if (refcount_read(&sk->sk_wmem_alloc) > > min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), > sk->sk_sndbuf)) > return -EAGAIN; > > Because __tcp_retransmit_skb() seems to bail before > inet6_sk_rebuild_header gets called. Should we arm TSQ here as well? Yes, or make the limit slightly bigger since we now have the fclone more precise thing for standard drivers [1] if ((refcount_read(&sk->sk_wmem_alloc) >> 1) > min_t(u32, sk->sk_wmem_queued, sk->sk_sndbuf)) return -EAGAIN; [1] It is probably wise to keep this code because some drivers do call skb_orphan() in their ndo_start_xmit()
On Fri, Mar 5, 2021 at 6:17 AM Eric Dumazet <edumazet@google.com> wrote: > > On Fri, Mar 5, 2021 at 12:27 AM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 4 Mar 2021 22:26:58 +0100 Eric Dumazet wrote: > > > It would be nice if tun driver would have the ability to delay TX > > > completions by N usecs, > > > so that packetdrill tests could be used. > > > > > > It is probably not too hard to add such a feature. > > > > Add an ioctl to turn off the skb_orphan, queue the skbs in tun_do_read() > > to free them from a timer? > > Yes, I cooked a prototype like that a few hours ago before my night > shift to launch our test suite. > > I yet have to add a sane limit to the number of delayed skbs so that > syzbot does not oom its hosts ;) > > > > > > I was testing this (note how I also removed the tcp_rearm_rto(sk) call) > > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > > index 6f450e577975c7be9537338c8a4c0673d7d36c4c..9ef92ca55e530f76ad793d7342442c4ec06165f7 > > > 100644 > > > --- a/net/ipv4/tcp_input.c > > > +++ b/net/ipv4/tcp_input.c > > > @@ -6471,11 +6471,10 @@ static bool tcp_rcv_fastopen_synack(struct > > > sock *sk, struct sk_buff *synack, > > > tcp_fastopen_cache_set(sk, mss, cookie, syn_drop, try_exp); > > > > > > if (data) { /* Retransmit unacked data in SYN */ > > > - skb_rbtree_walk_from(data) { > > > - if (__tcp_retransmit_skb(sk, data, 1)) > > > - break; > > > - } > > > - tcp_rearm_rto(sk); > > > + skb_rbtree_walk_from(data) > > > + tcp_mark_skb_lost(sk, data); > > > + > > > + tcp_xmit_retransmit_queue(sk); > > > NET_INC_STATS(sock_net(sk), > > > LINUX_MIB_TCPFASTOPENACTIVEFAIL); > > > return true; > > > > AFAICT this works great now: > > Yes but some packetdrill tests fail, I have to check them (sponge link > for Googlers), but at first glance we have more investigations. > > Ran 2003 tests: 1990 passed, 13 failed > Sponge: http://sponge2/b0d4c652-3173-4837-86ef-5e8cc59730a1 > > Failures in > //net/tcp/fastopen/client:ipv4-mapped-ipv6:cookie-disabled-prod-conn > //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-icmp-unreach-frag-needed > //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-icmp-unreach-frag-needed-with-seq > //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-only-syn-acked > //net/tcp/fastopen/client:ipv4-mapped-ipv6:syn-data-partial-or-over-ack > //net/tcp/fastopen/client:ipv4:cookie-disabled-prod-conn > //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed > //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed-with-seq > //net/tcp/fastopen/client:ipv4:syn-data-only-syn-acked > //net/tcp/fastopen/client:ipv4:syn-data-partial-or-over-ack > //net/tcp/fastopen/client:ipv6:cookie-disabled-prod-conn > //net/tcp/fastopen/client:ipv6:syn-data-only-syn-acked > //net/tcp/fastopen/client:ipv6:syn-data-partial-or-over-ack > > Showing test.log for > //net/tcp/fastopen/client:ipv4:syn-data-icmp-unreach-frag-needed > > syn-data-icmp-unreach-frag-needed.pkt:33: error handling packet: live > packet field ipv4_total_length: expected: 800 (0x320) vs actual: 1040 > (0x410) > script packet: 0.090624 . 1:761(760) ack 1 > actual packet: 0.090619 P. 1:1001(1000) ack 1 win 256 > Actually this might be a good thing : We now resend the whole data in a TSO packet, now we call the standard rtx queue mechanism, instead of sending only one MSS I will look at other test failures today (Friday) before submitting a patch series officially. > > Yes, it looks like Alex patch no longer works > > commit c31b70c9968fe9c4194d1b5d06d07596a3b680de > Author: Alexander Duyck <alexanderduyck@fb.com> > Date: Sat Dec 12 12:31:24 2020 -0800 > > tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit > > > > > > > ==> TFO case ret:-16 (0) ca_state:0 skb:ffff8881d3513800! > > FREED swapper/5 -- skb 0xffff8881d3513800 freed after: 1us > > ----- > > First: > > __tcp_retransmit_skb+1 > > tcp_retransmit_skb+18 > > tcp_xmit_retransmit_queue.part.70+339 > > tcp_rcv_state_process+2491 > > tcp_v6_do_rcv+405 > > tcp_v6_rcv+2984 > > > > Second: > > __tcp_retransmit_skb+1 > > tcp_retransmit_skb+18 > > tcp_xmit_retransmit_queue.part.70+339 > > tcp_tsq_write.part.71+146 > > tcp_tsq_handler+53 > > tcp_tasklet_func+181 > > > > sk:0xffff8885adc16f00 skb:ffff8881d3513800 --- 61us acked:1 > > > > > > The other case where we hit RTO after __tcp_retransmit_skb() fails is: > > > > ==> non-TFO case ret:-11 (0) ca_state:3 skb:ffff8883d71dd400! > > ----- > > First: > > __tcp_retransmit_skb+1 > > tcp_retransmit_skb+18 > > tcp_xmit_retransmit_queue.part.70+339 > > tcp_ack+2270 > > tcp_rcv_established+303 > > tcp_v6_do_rcv+190 > > > > Second: > > __tcp_retransmit_skb+1 > > tcp_retransmit_skb+18 > > tcp_retransmit_timer+716 > > tcp_write_timer_handler+136 > > tcp_write_timer+141 > > call_timer_fn+43 > > > > sk:0xffff88801772d340 skb:ffff8883d71dd400 --- 51738us acked:47324 > > > > Which I believe is this: > > > > if (refcount_read(&sk->sk_wmem_alloc) > > > min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2), > > sk->sk_sndbuf)) > > return -EAGAIN; > > > > Because __tcp_retransmit_skb() seems to bail before > > inet6_sk_rebuild_header gets called. Should we arm TSQ here as well? > > Yes, or make the limit slightly bigger since we now have the fclone > more precise thing for standard drivers [1] > > if ((refcount_read(&sk->sk_wmem_alloc) >> 1) > > min_t(u32, sk->sk_wmem_queued, sk->sk_sndbuf)) > return -EAGAIN; > > [1] It is probably wise to keep this code because some drivers do call > skb_orphan() in their ndo_start_xmit()
On Fri, Mar 5, 2021 at 6:33 AM Eric Dumazet <edumazet@google.com> wrote: > Actually this might be a good thing : We now resend the whole data in > a TSO packet, now > we call the standard rtx queue mechanism, instead of sending only one MSS > > I will look at other test failures today (Friday) before submitting a > patch series officially. > Well, there is also this part to sort out. WARN_ON(tp->retrans_out != 0); in tcp_fastretrans_alert()
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index fbf140a770d8..cd9461588539 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3759,9 +3759,16 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) /* limit to order-0 allocations */ space = min_t(size_t, space, SKB_MAX_HEAD(MAX_TCP_HEADER)); - syn_data = sk_stream_alloc_skb(sk, space, sk->sk_allocation, false); + syn_data = alloc_skb(MAX_TCP_HEADER + space, sk->sk_allocation); if (!syn_data) goto fallback; + if (!sk_wmem_schedule(sk, syn_data->truesize)) { + __kfree_skb(syn_data); + goto fallback; + } + skb_reserve(syn_data, MAX_TCP_HEADER); + INIT_LIST_HEAD(&syn_data->tcp_tsorted_anchor); + syn_data->ip_summed = CHECKSUM_PARTIAL; memcpy(syn_data->cb, syn->cb, sizeof(syn->cb)); if (space) {