Message ID | 20240712192405.505553-1-nnac123@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | bonding: Return TX congested if no active slave | expand |
Nick Child <nnac123@linux.ibm.com> wrote: >Hello. Posting this as RFC because I understand that changing return >codes can have unexpected results and I am not well versed enough to >claim with certainty that this won't affect something elsewhere. > >We are seeing a rare TCP connection timeout after only ~7.5 seconds of >inactivity. This is mainly due to the ibmvnic driver hogging the RTNL >lock for too long (~2 seconds per ibmvnic device). We are working on >getting the driver off the RTNL lock but figured the core of the issue >should also be considered. > >Because most of this is new ground to me, I listed the background >knowledge that is needed to identify the issue. Feel free to skip this >part: > >1. During a zero window probe, the socket attempts to get an updated >window from the peer 15 times (default of tcp_retries2). Looking at >tcp_send_probe0(), the timer between probes is either doubled or 0.5 >seconds. The timer is doubled if the skb transmit returns >NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the >timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means >the socket can ETIMEOUT after 7.5 seconds. The return code is typically >the return code of __dev_queue_xmit() I'm not sure that your description of the behavior of tcp_send_probe0() matches the code. It looks to me like the timer doubles for "if (err <= 0)", meaning a negative value or NET_XMIT_SUCCESS. The 0.5 timeout value (in TCP_RESOURCE_PROBE_INTERVAL) is set in the "else" after "if (err <= 0)", so NET_XMIT_DROP and NET_XMIT_CN both qualify and would presumably result in the 0.5 second timeout. However, since tcp_write_wakeup() -> tcp_transmit_skb() -> __tcp_transmit_skb() will convert NET_XMIT_CN to 0 (which is NET_XMIT_SUCCESS) via net_xmit_eval(), I'm not sure that it's possible for err to equal NET_XMIT_CN here. I'll note that the 0.5 second timeout logic had a relatively recent change in c1d5674f8313 ("tcp: less aggressive window probing on local congestion"). From reading the log, the intent seems to be to bound the maximum probe interval to 0.5 seconds in low-RTT environments. >2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the >enqueue function is defined. In this circumstance, the qdisc enqueue >function return code propagates up the stack. On the other hand, if the >qdisc enqueue function is NULL then the drivers xmit function is called >directly via dev_hard_start_xmit(). In this circumstance, the drivers >xmit return code propagates up the stack. > >3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to >noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue >function. Therefore, when the bonding device is carrier UP, >bond_start_xmit is called directly. In this function, depending on >bonding mode, a slave device is assigned to the skb and >__dev_queue_xmit() is called again. This time the slaves qdisc enqueue >function (which is almost always defined) is called. Does your analysis or behavior change if the bond itself does have a qdisc? IFF_NO_QUEUE does not install one by default, but users are free to add one. >4. When a device calls netif_carrier_off(), it schedules dev_deactivate >which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue >function of noop_qdisc is defined but simply returns NET_XMIT_CN. > >5. The miimon of the bonding device periodically checks for the carrier >status of its slaves. If it detects that all of its slaves are down, >then it sets currently_active_slave to NULL and calls >netif_carrier_off() on itself. > >6. In the bonding devices xmit function, if it does not have any active >slaves, it returns NET_XMIT_DROP. > >Given these observations. Assume a bonding devices slaves all suddenly >call netif_carrier_off(). Also assume that a tcp connection is in a zero >window probe. There is a window for a behavioral issue here: >1. If the bonding device does not notice that its slaves are down >(maybe its miimon interval is too large or the miimon commit could not >grab rtnl), then the slaves enqueue function is invoked. This will >either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is >doubled. As I mentioned above, I'm not sure this accurately describes the behavior in tcp_send_probe0(). -J >2. If the bonding device notices the slaves are down. It sets active >slave to NULL and calls netif_carrier_off() on itself. dev_deactivate() >is scheduled: > > a. If dev_deactivate() executes. The bonding enqueue function >(which was null) is now noop_enqueue and returns NET_XMIT_CN. The probe >timer is doubled. > > b. If dev_deactivate() cannot execute for some time (say because it >is waiting on rtnl). Then bond_start_xmit() is called. It sees that it >does not have an active slave and returns NET_XMIT_DROP. The probe >timer is set to 0.5 seconds. > >I believe that when the active slave is NULL, it is safe to say that >the bonding device calls netif_carrier_off() on itself. But there is a >time where active_slave == NULL and the qdisc.enqueue != >noop_enqueue. During this time the return code is different. Consider >the following timeline: > > slaves go down > | > | returns NET_XMIT_CN > | > bond calls carrier_off > | > | returns NET_XMIT_DROP > | > dev_deactivate on bond > | > | returns NET_XMIT_CN > | > >Because the bonding xmit function should really only be a route to a >slave device, I propose that it returns NET_XMIT_CN in these scenarios >instead of NET_XMIT_DROP. Note that the bond will still return >NET_XMIT_DROP when it has NO slaves. Since miimon calls >netif_carrier_off() when it doesn't have any active slaves, I believe >this change will only effect the time between the bonding devices call >to netif_carrier_off() and the execution of dev_deactivate(). > >I was able to see this issue with bpftrace: > 10:28:27:283929 - dev_deactivate eth252 > 10:28:27:312805 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 0+1 > 10:28:27:760016 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 1+1 > 10:28:28:147410 - dev_deactivate eth251 > 10:28:28:348387 - dev_queue_xmit rc = 2 > 10:28:28:348394 - dev_queue_xmit rc = 2 > 10:28:28:670013 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 2+1 > 10:28:28:670066 - dev_queue_xmit rc = 2 > 10:28:28:670070 - dev_queue_xmit rc = 2 > 10:28:30:440025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 3+1 > 10:28:30:440084 - dev_queue_xmit rc = 2 > 10:28:30:440088 - dev_queue_xmit rc = 2 > 10:28:33:505982 - netif_carrier_off bond1 > netif_carrier_off+112 > bond_set_carrier+296 > bond_select_active_slave+296 > bond_mii_monitor+1900 > process_one_work+760 > worker_thread+136 > kthread+456 > ret_from_kernel_thread+92 > 10:28:33:790050 - dev_queue_xmit rc = 1 > 10:28:33:870015 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 4+1 > 10:28:33:870061 - dev_queue_xmit rc = 1 > 10:28:34:300269 - dev_queue_xmit rc = 1 > 10:28:34:380025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 5+1 > 10:28:34:380072 - dev_queue_xmit rc = 1 > 10:28:34:432446 - dev_queue_xmit rc = 1 > 10:28:34:810059 - dev_queue_xmit rc = 1 > 10:28:34:900014 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 6+1 > 10:28:34:900059 - dev_queue_xmit rc = 1 > 10:28:35:000050 - dev_queue_xmit rc = 1 > 10:28:35:330054 - dev_queue_xmit rc = 1 > 10:28:35:410020 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 7+1 > 10:28:35:410070 - dev_queue_xmit rc = 1 > 10:28:35:630865 - dev_queue_xmit rc = 1 > 10:28:35:850072 - dev_queue_xmit rc = 1 > 10:28:35:920025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 8+1 > 10:28:35:920069 - dev_queue_xmit rc = 1 > 10:28:35:940348 - dev_queue_xmit rc = 1 > 10:28:36:140055 - dev_queue_xmit rc = 1 > 10:28:36:370050 - dev_queue_xmit rc = 1 > 10:28:36:430029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 9+1 > 10:28:36:430089 - dev_queue_xmit rc = 1 > 10:28:36:460052 - dev_queue_xmit rc = 1 > 10:28:36:650049 - dev_queue_xmit rc = 1 > 10:28:36:880059 - dev_queue_xmit rc = 1 > 10:28:36:940024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 10+1 > 10:28:36:940074 - dev_queue_xmit rc = 1 > 10:28:36:980044 - dev_queue_xmit rc = 1 > 10:28:37:160331 - dev_queue_xmit rc = 1 > 10:28:37:390060 - dev_queue_xmit rc = 1 > 10:28:37:450024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 11+1 > 10:28:37:450082 - dev_queue_xmit rc = 1 > 10:28:37:480045 - dev_queue_xmit rc = 1 > 10:28:37:730281 - dev_queue_xmit rc = 1 > 10:28:37:900051 - dev_queue_xmit rc = 1 > 10:28:37:970019 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 12+1 > 10:28:37:970062 - dev_queue_xmit rc = 1 > 10:28:38:000045 - dev_queue_xmit rc = 1 > 10:28:38:240089 - dev_queue_xmit rc = 1 > 10:28:38:420053 - dev_queue_xmit rc = 1 > 10:28:38:490012 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 13+1 > 10:28:38:490076 - dev_queue_xmit rc = 1 > 10:28:38:520048 - dev_queue_xmit rc = 1 > 10:28:38:750069 - dev_queue_xmit rc = 1 > 10:28:39:000029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 14+1 > 10:28:39:000093 - dev_queue_xmit rc = 1 > 10:28:39:030052 - dev_queue_xmit rc = 1 > 10:28:39:260046 - dev_queue_xmit rc = 1 > 10:28:39:450050 - dev_queue_xmit rc = 1 > 10:28:39:520044 - SK_ERR(110) port=8682 - snd_wnd 0 > sk_error_report+12 > tcp_write_err+64 > tcp_write_timer_handler+564 > tcp_write_timer+424 > call_timer_fn+88 > run_timer_softirq+1896 > __do_softirq+348 > do_softirq_own_stack+64 > irq_exit+392 > timer_interrupt+380 > decrementer_common_virt+528 > 10:28:47:813297 - dev_deactivate bond1 > >Again, I know the easier solution is to remove rtnl users to reduce the >time before dev_deactivate (we are working on that as well). > >Nick Child (1): > bonding: Return TX congested if no active slave > > include/net/bonding.h | 7 +++++++ > 1 file changed, 7 insertions(+) > >-- >2.43.0 --- -Jay Vosburgh, jv@jvosburgh.net
Hi Jay, thanks for taking a look. On 7/12/24 18:34, Jay Vosburgh wrote: > Nick Child <nnac123@linux.ibm.com> wrote: > >> 1. During a zero window probe, the socket attempts to get an updated >> window from the peer 15 times (default of tcp_retries2). Looking at >> tcp_send_probe0(), the timer between probes is either doubled or 0.5 >> seconds. The timer is doubled if the skb transmit returns >> NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the >> timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means >> the socket can ETIMEOUT after 7.5 seconds. The return code is typically >> the return code of __dev_queue_xmit() > > I'm not sure that your description of the behavior of > tcp_send_probe0() matches the code. > > It looks to me like the timer doubles for "if (err <= 0)", > meaning a negative value or NET_XMIT_SUCCESS. The 0.5 timeout value (in > TCP_RESOURCE_PROBE_INTERVAL) is set in the "else" after "if (err <= 0)", > so NET_XMIT_DROP and NET_XMIT_CN both qualify and would presumably > result in the 0.5 second timeout. > > However, since tcp_write_wakeup() -> tcp_transmit_skb() -> > __tcp_transmit_skb() will convert NET_XMIT_CN to 0 (which is > NET_XMIT_SUCCESS) via net_xmit_eval(), I'm not sure that it's possible > for err to equal NET_XMIT_CN here. Apologies, I was oversimplifying in my explanation. I was referencing dev_queue_xmit returning either CN or DROP. On its way up the stack, return code CN is mapped to 0 (in net_xmit_eval()). So, proper phrasing is "when dev_queue_xmit returns CN, the timer is doubled. When dev_queue_xmit returns DROPPED, the timer is 0.5". > I'll note that the 0.5 second timeout logic had a relatively > recent change in c1d5674f8313 ("tcp: less aggressive window probing on > local congestion"). From reading the log, the intent seems to be to > bound the maximum probe interval to 0.5 seconds in low-RTT environments. > I don't have any complaints with the quick timeout. I believe it has valid applications, I just don't think this scenario (when the bond device is waiting for dev_deactivate) falls into that. >> 2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the >> enqueue function is defined. In this circumstance, the qdisc enqueue >> function return code propagates up the stack. On the other hand, if the >> qdisc enqueue function is NULL then the drivers xmit function is called >> directly via dev_hard_start_xmit(). In this circumstance, the drivers >> xmit return code propagates up the stack. >> >> 3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to >> noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue >> function. Therefore, when the bonding device is carrier UP, >> bond_start_xmit is called directly. In this function, depending on >> bonding mode, a slave device is assigned to the skb and >> __dev_queue_xmit() is called again. This time the slaves qdisc enqueue >> function (which is almost always defined) is called. > > Does your analysis or behavior change if the bond itself does > have a qdisc? IFF_NO_QUEUE does not install one by default, but users > are free to add one. > Good question. I did not try. Though I don't think we would see this issue because there would not be a way for the return code of dev_queue_xmit() to propagate all the way up to tcp_write_wakeup(). >> 4. When a device calls netif_carrier_off(), it schedules dev_deactivate >> which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue >> function of noop_qdisc is defined but simply returns NET_XMIT_CN. >> >> 5. The miimon of the bonding device periodically checks for the carrier >> status of its slaves. If it detects that all of its slaves are down, >> then it sets currently_active_slave to NULL and calls >> netif_carrier_off() on itself. >> >> 6. In the bonding devices xmit function, if it does not have any active >> slaves, it returns NET_XMIT_DROP. >> >> Given these observations. Assume a bonding devices slaves all suddenly >> call netif_carrier_off(). Also assume that a tcp connection is in a zero >> window probe. There is a window for a behavioral issue here: > >> 1. If the bonding device does not notice that its slaves are down >> (maybe its miimon interval is too large or the miimon commit could not >> grab rtnl), then the slaves enqueue function is invoked. This will >> either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is >> doubled. > > As I mentioned above, I'm not sure this accurately describes the > behavior in tcp_send_probe0() > > -J > The call stack I describe here looks like this: pfifo_fast_enqueue+12 dev_qdisc_enqueue+76 __dev_queue_xmit+1700 bond_dev_queue_xmit+68 bond_start_xmit+1092 dev_hard_start_xmit+300 __dev_queue_xmit+3124 ip_finish_output2+1020 ip_output+228 ip_local_out+104 __ip_queue_xmit+384 __tcp_transmit_skb+1644 Considering that tcp_write_wakeup calls tcp_transmit_skb and __tcp_transmit_skb translates NET_XMIT_CN to 0 through net_xmit_eval, I believe my statement is correct. If the slaves enqueue function is invoked (in this case pfifo_fast) and NET_XMIT_SUCCESS or NET_XMIT_CN is returned then the probe timer is doubled. >> -- >> 2.43.0 > > --- > -Jay Vosburgh, jv@jvosburgh.net