Message ID | 20201120150208.6838-1-vincent.whitchurch@axis.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: stmmac: Use hrtimer for TX coalescing | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | fail | Series targets non-next tree, but doesn't contain any Fixes tags |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 20 this patch: 20 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | warning | WARNING: line length of 85 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 20 this patch: 20 |
netdev/header_inline | success | Link |
netdev/stable | success | Stable not CCed |
On Fri, 20 Nov 2020 16:02:08 +0100 Vincent Whitchurch wrote: > This driver uses a normal timer for TX coalescing, which means that the > with the default tx-usecs of 1000 microseconds the cleanups actually > happen 10 ms or more later with HZ=100. This leads to very low > througput with TCP when bridged to a slow link such as a 4G modem. Fix > this by using an hrtimer instead. > > On my ARM platform with HZ=100 and the default TX coalescing settings > (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem > delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM > improves from ~5.5 Mbps to ~100 Mbps. > > Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Looks perfectly reasonable, but you marked it for net. Do you consider this to be a bug fix, and need it backported to stable? Otherwise I'd rather apply it to net-next.
On Tue, Nov 24, 2020 at 01:46:00AM +0100, Jakub Kicinski wrote: > On Fri, 20 Nov 2020 16:02:08 +0100 Vincent Whitchurch wrote: > > This driver uses a normal timer for TX coalescing, which means that the > > with the default tx-usecs of 1000 microseconds the cleanups actually > > happen 10 ms or more later with HZ=100. This leads to very low > > througput with TCP when bridged to a slow link such as a 4G modem. Fix > > this by using an hrtimer instead. > > > > On my ARM platform with HZ=100 and the default TX coalescing settings > > (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem > > delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM > > improves from ~5.5 Mbps to ~100 Mbps. > > > > Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> > > Looks perfectly reasonable, but you marked it for net. Do you consider > this to be a bug fix, and need it backported to stable? Otherwise I'd > rather apply it to net-next. No, sorry, I think a backport to stable is unnecessary. It should be fine to apply it to net-next. Thanks.
On Tue, 24 Nov 2020 05:11:27 +0100 Vincent Whitchurch wrote: > On Tue, Nov 24, 2020 at 01:46:00AM +0100, Jakub Kicinski wrote: > > On Fri, 20 Nov 2020 16:02:08 +0100 Vincent Whitchurch wrote: > > > This driver uses a normal timer for TX coalescing, which means that the > > > with the default tx-usecs of 1000 microseconds the cleanups actually > > > happen 10 ms or more later with HZ=100. This leads to very low > > > througput with TCP when bridged to a slow link such as a 4G modem. Fix > > > this by using an hrtimer instead. > > > > > > On my ARM platform with HZ=100 and the default TX coalescing settings > > > (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem > > > delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM > > > improves from ~5.5 Mbps to ~100 Mbps. > > > > > > Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> > > > > Looks perfectly reasonable, but you marked it for net. Do you consider > > this to be a bug fix, and need it backported to stable? Otherwise I'd > > rather apply it to net-next. > > No, sorry, I think a backport to stable is unnecessary. It should be > fine to apply it to net-next. Thanks. Applied to net-next, thank you!
On 20.11.20 16:02, Vincent Whitchurch wrote: > This driver uses a normal timer for TX coalescing, which means that the > with the default tx-usecs of 1000 microseconds the cleanups actually > happen 10 ms or more later with HZ=100. This leads to very low > througput with TCP when bridged to a slow link such as a 4G modem. Fix > this by using an hrtimer instead. > > On my ARM platform with HZ=100 and the default TX coalescing settings > (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem > delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM > improves from ~5.5 Mbps to ~100 Mbps. > > Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Based on tests by OpenWrt users, it seems that this one is causing a significant performance regression caused by wasting lots of CPU cycles re-arming the hrtimer on every single packet. More info: https://github.com/openwrt/openwrt/issues/11676#issuecomment-1690492666 My suggestion for fixing this properly would be: - keep a separate timestamp for last tx packet - do not modify the timer if it's scheduled already - in the timer function, check the last tx timestamp and re-arm the timer if necessary. This should significantly reduce the number of wasted CPU cycles, even when accounting for the additional overhead of hrtimer vs regular timer. - Felix
On Wed, 2023-08-23 at 22:18 +0200, Felix Fietkau wrote: > Based on tests by OpenWrt users, it seems that this one is causing a > significant performance regression caused by wasting lots of CPU cycles > re-arming the hrtimer on every single packet. More info: > https://github.com/openwrt/openwrt/issues/11676#issuecomment-1690492666 It looks like there was an attempt to avoid the re-arming of the timer in ->xmit() a while ago (pre-dating the hrtimer usage) in commit 4ae0169fd1b3c792b66be58995b7e6b629919ecf ("net: stmmac: Do not keep rearming the coalesce timer in stmmac_xmit"), but that got reverted later due to regressions. The coalescing code has been reworked since then but the removal of the re-arming was never attempted again. > My suggestion for fixing this properly would be: > - keep a separate timestamp for last tx packet > - do not modify the timer if it's scheduled already > - in the timer function, check the last tx timestamp and re-arm the > timer if necessary. Would you mind explain the reasons for maintaining a timestamp and checking it in the expiry function? Is that to obtain the same effect as the driver's current behaviour of postponing the expiry of the timer for each packet? Is that really desired? According to the commit message in 4ae0169fd1b3c792b66be58995b7e6b629919ecf, "Once the timer is armed it should run after the time expires for the first packet sent and not the last one." Since the timer expiry function schedules napi, and the napi poll function stmmac_tx_clean() re-arms the timer if it sees that there are pending tx packets, shouldn't an implementation similar to hip04_eth.c (which doesn't save/check the last tx timestamp) be sufficient?
On 25.08.23 15:42, Vincent Whitchurch wrote: > On Wed, 2023-08-23 at 22:18 +0200, Felix Fietkau wrote: >> Based on tests by OpenWrt users, it seems that this one is causing a >> significant performance regression caused by wasting lots of CPU cycles >> re-arming the hrtimer on every single packet. More info: >> https://github.com/openwrt/openwrt/issues/11676#issuecomment-1690492666 > > It looks like there was an attempt to avoid the re-arming of the timer > in ->xmit() a while ago (pre-dating the hrtimer usage) in commit > 4ae0169fd1b3c792b66be58995b7e6b629919ecf ("net: stmmac: Do not keep > rearming the coalesce timer in stmmac_xmit"), but that got reverted > later due to regressions. The coalescing code has been reworked since > then but the removal of the re-arming was never attempted again. > >> My suggestion for fixing this properly would be: >> - keep a separate timestamp for last tx packet >> - do not modify the timer if it's scheduled already >> - in the timer function, check the last tx timestamp and re-arm the >> timer if necessary. > > Would you mind explain the reasons for maintaining a timestamp and > checking it in the expiry function? Is that to obtain the same effect > as the driver's current behaviour of postponing the expiry of the timer > for each packet? Exactly. I did something very similar in mac80211 in the past, when expensive mod_timer calls were showing up in my perf measurements. Functionally it should be the same, it just avoids excessive re-sorting in timer data structures. > Is that really desired? According to the commit > message in 4ae0169fd1b3c792b66be58995b7e6b629919ecf, "Once the timer is > armed it should run after the time expires for the first packet sent and > not the last one." > > Since the timer expiry function schedules napi, and the napi poll > function stmmac_tx_clean() re-arms the timer if it sees that there are > pending tx packets, shouldn't an implementation similar to hip04_eth.c > (which doesn't save/check the last tx timestamp) be sufficient? To be honest, I didn't look very closely at what the timer does and how coalescing works. I don't know if delaying the timer processing with every tx is the right choice, or if it should be armed only once. However, as you pointed out, the commit that dropped the re-arming was reverted because of regressions. My suggestions are intended to preserve the existing behavior as much as possible (in order to avoid regressions), while also achieving the benefit of significantly reducing CPU cycles wasted by re-arming the timer. - Felix
On Fri, 2023-08-25 at 19:38 +0200, Felix Fietkau wrote: > On 25.08.23 15:42, Vincent Whitchurch wrote: > > Since the timer expiry function schedules napi, and the napi poll > > function stmmac_tx_clean() re-arms the timer if it sees that there are > > pending tx packets, shouldn't an implementation similar to hip04_eth.c > > (which doesn't save/check the last tx timestamp) be sufficient? > > To be honest, I didn't look very closely at what the timer does and how > coalescing works. I don't know if delaying the timer processing with > every tx is the right choice, or if it should be armed only once. > However, as you pointed out, the commit that dropped the re-arming was > reverted because of regressions. > > My suggestions are intended to preserve the existing behavior as much as > possible (in order to avoid regressions), while also achieving the > benefit of significantly reducing CPU cycles wasted by re-arming the timer. I looked at it some more and the continuous postponing behaviour strikes me as quite odd. For example, if you set tx-frames coalescing to 0 then cleanups could happen much later than the specified tx-usecs period, in the absence of RX traffic. Also, if we'd have to have a shared timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer to preserve this continuous postponing behaviour, then we'd need to introduce some locking between the timer expiry and those functions, to avoid race conditions. So currently I am experimenting with just the following patch. The second hunk is similar to hip04_eth.c (the comment is mine). AFAICS hip04_eth.c doesn't have code equivalent to the first hunk of the patch and instead unconditionally restarts the timer from its napi poll if it sees that it's needed. I can't reproduce the mod_timer vs hrtimer performance problems on my hardware, but the patch below reduces the number of (re-)starts of the stmmac_tx_timer by around 85% in an iperf3 test over a gigabit link (just the second hunk reduces it by about 30%). Any test results with this patch on the hardware with the performance problems would be appreciated. diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 4727f7be4f86..4b6e5061b5a6 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -2703,9 +2703,7 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue) /* We still have pending packets, let's call for a new scheduling */ if (tx_q->dirty_tx != tx_q->cur_tx) - hrtimer_start(&tx_q->txtimer, - STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]), - HRTIMER_MODE_REL); + stmmac_tx_timer_arm(priv, queue); __netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue)); @@ -2987,6 +2985,20 @@ static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue) { struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue]; + /* + * Note that the hrtimer could expire immediately after we check this, + * and the hrtimer and the callers of this function do not share a + * lock. + * + * This should however be safe since the only thing the hrtimer does is + * schedule napi (or ask for it run again if it's already running), and + * stmmac_tx_clean(), called from the napi poll function, also calls + * stmmac_tx_timer_arm() at the end if it sees that there are any TX + * packets which have not yet been cleaned. + */ + if (hrtimer_is_queued(&tx_q->txtimer)) + return; + hrtimer_start(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]), HRTIMER_MODE_REL);
On Wed, 30 Aug 2023 at 14:55:37 +0000, Vincent Whitchurch wrote: > Any test results with this patch on the hardware with the performance > problems would be appreciated. TL/DR: it's definitely better than without the patch, but still worse than fully reverting hrtimer [1]. OpenWrt on Netgear R7800, iperf3 test in both directions (LAN->WAN and WAN->LAN), 3 runs for the duration of 1 minute each. OpenWrt options packet_steering (RPS + XPS) and flow_offloading (flowtables) enabled, irqbalance disabled. Numbers in Mbit/s, I had to fit the table into 72 characters, so: U is the original kernel 6.1.46 from OpenWrt, R is reverted hrtimer (patch [1]), P is patched with the new patch from the previous email, ^ is upload (LAN->WAN), v is download (WAN->LAN). | ^ | v | ^ | v | ^ | v | avg ^ | avg v | std ^ | std v - | --- | --- | --- | --- | --- | --- | ----- | ----- | ----- | ----- U | 742 | 709 | 740 | 715 | 556 | 750 | 679 | 725 | 107 | 22 R | 931 | 938 | 935 | 939 | 935 | 939 | 934 | 939 | 2 | 1 P | 845 | 939 | 934 | 939 | 845 | 909 | 875 | 929 | 51 | 17 Full revert allows to get really close to the theoretical maximum goodput (~949 Mbit/s) with minimal deviation. The new patch, however, gives less stable numbers, while sometimes hits the maximum as well. More numbers for the [U]npatched and [R]everted kernels are in the OpenWrt thread [2]. [1]: https://github.com/openwrt/openwrt/files/12422351/revert-stmmac.patch.txt [2]: https://github.com/openwrt/openwrt/issues/11676 > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > index 4727f7be4f86..4b6e5061b5a6 100644 > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > @@ -2703,9 +2703,7 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue) > > /* We still have pending packets, let's call for a new scheduling */ > if (tx_q->dirty_tx != tx_q->cur_tx) > - hrtimer_start(&tx_q->txtimer, > - STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]), > - HRTIMER_MODE_REL); > + stmmac_tx_timer_arm(priv, queue); > > __netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue)); > > @@ -2987,6 +2985,20 @@ static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue) > { > struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue]; > > + /* > + * Note that the hrtimer could expire immediately after we check this, > + * and the hrtimer and the callers of this function do not share a > + * lock. > + * > + * This should however be safe since the only thing the hrtimer does is > + * schedule napi (or ask for it run again if it's already running), and > + * stmmac_tx_clean(), called from the napi poll function, also calls > + * stmmac_tx_timer_arm() at the end if it sees that there are any TX > + * packets which have not yet been cleaned. > + */ > + if (hrtimer_is_queued(&tx_q->txtimer)) > + return; > + > hrtimer_start(&tx_q->txtimer, > STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]), > HRTIMER_MODE_REL); >
On 30.08.23 16:55, Vincent Whitchurch wrote: > On Fri, 2023-08-25 at 19:38 +0200, Felix Fietkau wrote: >> On 25.08.23 15:42, Vincent Whitchurch wrote: >> > Since the timer expiry function schedules napi, and the napi poll >> > function stmmac_tx_clean() re-arms the timer if it sees that there are >> > pending tx packets, shouldn't an implementation similar to hip04_eth.c >> > (which doesn't save/check the last tx timestamp) be sufficient? >> >> To be honest, I didn't look very closely at what the timer does and how >> coalescing works. I don't know if delaying the timer processing with >> every tx is the right choice, or if it should be armed only once. >> However, as you pointed out, the commit that dropped the re-arming was >> reverted because of regressions. >> >> My suggestions are intended to preserve the existing behavior as much as >> possible (in order to avoid regressions), while also achieving the >> benefit of significantly reducing CPU cycles wasted by re-arming the timer. > > I looked at it some more and the continuous postponing behaviour strikes > me as quite odd. For example, if you set tx-frames coalescing to 0 then > cleanups could happen much later than the specified tx-usecs period, in > the absence of RX traffic. Also, if we'd have to have a shared > timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer > to preserve this continuous postponing behaviour, then we'd need to > introduce some locking between the timer expiry and those functions, to > avoid race conditions. I just spent some time digging through the history of the timer code, figuring out the intention behind the continuous postponing behavior. It's an interrupt mitigation scheme where DMA descriptors are configured to only generate a completion event every 25 packets, and the only purpose of the timer is to avoid keeping packets in the queue for too long after tx activity has stopped. Based on that design, I believe that the continuous postponing actually makes sense and the patches that eliminate it are misguided. When there is constant activity, there will be tx completion interrupts that trigger cleanup. That said, I did even more digging and I found out that the timer code was added at a time when the driver didn't even disable tx and rx interrupts individually, which means that it could not take advantage of interrupt mitigation via NAPI scheduling + IRQ disable/enable. I have a hunch that given the changes made to the driver over time, the timer based interrupt mitigation strategy might just be completely useless and actively harmful now. It certainly messes with things like TSQ and BQL in a nasty way. I suspect that the best and easiest way to deal with this issue is to simply rip out all that timer nonsense, rely on tx IRQs entirely and just let NAPI do its thing. - Felix
On Wed, 2023-08-30 at 23:06 +0200, Felix Fietkau wrote: > On 30.08.23 16:55, Vincent Whitchurch wrote: > > I looked at it some more and the continuous postponing behaviour strikes > > me as quite odd. For example, if you set tx-frames coalescing to 0 then > > cleanups could happen much later than the specified tx-usecs period, in > > the absence of RX traffic. Also, if we'd have to have a shared > > timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer > > to preserve this continuous postponing behaviour, then we'd need to > > introduce some locking between the timer expiry and those functions, to > > avoid race conditions. > > I just spent some time digging through the history of the timer code, > figuring out the intention behind the continuous postponing behavior. > > It's an interrupt mitigation scheme where DMA descriptors are configured > to only generate a completion event every 25 packets, and the only > purpose of the timer is to avoid keeping packets in the queue for too > long after tx activity has stopped. > Based on that design, I believe that the continuous postponing actually > makes sense and the patches that eliminate it are misguided. When there > is constant activity, there will be tx completion interrupts that > trigger cleanup. The tx-frames value (25) can be controlled from user space. If it is set to zero then the driver should still coalesce interrupts based on the interval specified in the tx-usecs setting, but the driver fails to do so because it keeps postponing the cleanup and increasing the latency of the cleanups far beyond the programmed period. > That said, I did even more digging and I found out that the timer code > was added at a time when the driver didn't even disable tx and rx > interrupts individually, which means that it could not take advantage of > interrupt mitigation via NAPI scheduling + IRQ disable/enable. > > I have a hunch that given the changes made to the driver over time, the > timer based interrupt mitigation strategy might just be completely > useless and actively harmful now. It certainly messes with things like > TSQ and BQL in a nasty way. > > I suspect that the best and easiest way to deal with this issue is to > simply rip out all that timer nonsense, rely on tx IRQs entirely and > just let NAPI do its thing. If you want an interrupt for every packet, you can turn off coalescing by setting tx-frames to 1 and tx-usecs 0. Currently, the driver does not turn off the timer even if tx-usecs is set to zero, but that is trivially fixed. With such a fix and that setting, the result is a 30x increase in the number of interrupts in a tx-only test. And tx-frames 25 tx-usecs 0 is of course also bad for throughput since cleanups will not happen when fewer than 25 frames are transmitted.
On 01.09.23 13:31, Vincent Whitchurch wrote: > On Wed, 2023-08-30 at 23:06 +0200, Felix Fietkau wrote: >> On 30.08.23 16:55, Vincent Whitchurch wrote: >> > I looked at it some more and the continuous postponing behaviour strikes >> > me as quite odd. For example, if you set tx-frames coalescing to 0 then >> > cleanups could happen much later than the specified tx-usecs period, in >> > the absence of RX traffic. Also, if we'd have to have a shared >> > timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer >> > to preserve this continuous postponing behaviour, then we'd need to >> > introduce some locking between the timer expiry and those functions, to >> > avoid race conditions. >> >> I just spent some time digging through the history of the timer code, >> figuring out the intention behind the continuous postponing behavior. >> >> It's an interrupt mitigation scheme where DMA descriptors are configured >> to only generate a completion event every 25 packets, and the only >> purpose of the timer is to avoid keeping packets in the queue for too >> long after tx activity has stopped. >> Based on that design, I believe that the continuous postponing actually >> makes sense and the patches that eliminate it are misguided. When there >> is constant activity, there will be tx completion interrupts that >> trigger cleanup. > > The tx-frames value (25) can be controlled from user space. If it is > set to zero then the driver should still coalesce interrupts based on > the interval specified in the tx-usecs setting, but the driver fails to > do so because it keeps postponing the cleanup and increasing the latency > of the cleanups far beyond the programmed period. > >> That said, I did even more digging and I found out that the timer code >> was added at a time when the driver didn't even disable tx and rx >> interrupts individually, which means that it could not take advantage of >> interrupt mitigation via NAPI scheduling + IRQ disable/enable. >> >> I have a hunch that given the changes made to the driver over time, the >> timer based interrupt mitigation strategy might just be completely >> useless and actively harmful now. It certainly messes with things like >> TSQ and BQL in a nasty way. >> >> I suspect that the best and easiest way to deal with this issue is to >> simply rip out all that timer nonsense, rely on tx IRQs entirely and >> just let NAPI do its thing. > > If you want an interrupt for every packet, you can turn off coalescing > by setting tx-frames to 1 and tx-usecs 0. Currently, the driver does > not turn off the timer even if tx-usecs is set to zero, but that is > trivially fixed. With such a fix and that setting, the result is a 30x > increase in the number of interrupts in a tx-only test. > > And tx-frames 25 tx-usecs 0 is of course also bad for throughput since > cleanups will not happen when fewer than 25 frames are transmitted. If the CPU is not saturated, the increase in interrupts is expected, given that NAPI doesn't do any timer based mitigation. I guess for devices operating on battery, the timer based coalescing might conserve some power, unless the power drain caused by the CPU overhead of the timer rescheduling is bigger than the power drain from extra wakeups. For a devices where performance matters more (e.g. the ones used with OpenWrt), I believe that the increase in the number of IRQs won't matter, because mitigation should kick in under load. Could you please post your fix? I think in order to avoid accidental breakage, we should make tx-usecs=0 imply tx-frames=1. Thanks, - Felix
On Fri, 2023-09-01 at 13:53 +0200, Felix Fietkau wrote: > Could you please post your fix? I think in order to avoid accidental > breakage, we should make tx-usecs=0 imply tx-frames=1. I've posted a fix to make tx-usecs=0 not use the timer. The driver already rejects the invalid setting of tx-frames=0 tx-usecs=0. https://lore.kernel.org/netdev/20230907-stmmac-coaloff-v2-1-38ccfac548b9@axis.com/
On Wed, 2023-08-30 at 21:05 +0300, Maxim Mikityanskiy wrote: > On Wed, 30 Aug 2023 at 14:55:37 +0000, Vincent Whitchurch wrote: > > Any test results with this patch on the hardware with the performance > > problems would be appreciated. > > TL/DR: it's definitely better than without the patch, but still worse > than fully reverting hrtimer [1]. Thank you for testing this. Have you also had the chance to try out Felix's suggestion of completely disabling coalescing in the driver? For that you will need to apply the patch at [0] (the command below may appear to work without the patch but the timer will continue to be programmed and expire immediately if the patch is not applied) and then run something like: ethtool -C eth0 tx-frame 1 tx-usecs 0 [0] https://lore.kernel.org/all/20230907-stmmac-coaloff-v2-1-38ccfac548b9@axis.com/ With coalescing disabled in the driver, there is also the option of playing with the generic software IRQ coalescing options available in NAPI in newer kernels (eg. gro_flush_timeout), which may work better for you than the one in the driver.
On Mon, 18 Sep 2023 at 12:56:31 +0000, Vincent Whitchurch wrote: > On Wed, 2023-08-30 at 21:05 +0300, Maxim Mikityanskiy wrote: > > On Wed, 30 Aug 2023 at 14:55:37 +0000, Vincent Whitchurch wrote: > > > Any test results with this patch on the hardware with the performance > > > problems would be appreciated. > > > > TL/DR: it's definitely better than without the patch, but still worse > > than fully reverting hrtimer [1]. > > Thank you for testing this. > > Have you also had the chance to try out Felix's suggestion of completely > disabling coalescing in the driver? For that you will need to apply the > patch at [0] (the command below may appear to work without the patch but > the timer will continue to be programmed and expire immediately if the > patch is not applied) and then run something like: > > ethtool -C eth0 tx-frame 1 tx-usecs 0 It's really good, yet not as perfect as reverting the hrtimer patch. Aggregated summary over 15 20-second iperf3 tests (with XPS, RPS and flow tables): | avg up | avg dn | std up | std dn --------|--------|--------|--------|-------- Revert | 931 | 939 | 6 | 2 No-coal | 925 | 916 | 13 | 26 I'd say it's almost as good as the full revert, but with more deviation, numbers are less stable. Kind of off topic, but I've also been testing PPPoE, and it's pretty bad even with the available fixes: | avg up | avg dn | std up | std dn --------|--------|--------|--------|-------- Revert | 791 | 388 | 13 | 26 No-coal | 806 | 390 | 97 | 60 Thanks, Max > > [0] https://lore.kernel.org/all/20230907-stmmac-coaloff-v2-1-38ccfac548b9@axis.com/ > > With coalescing disabled in the driver, there is also the option of > playing with the generic software IRQ coalescing options available in > NAPI in newer kernels (eg. gro_flush_timeout), which may work better for > you than the one in the driver.
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h index 727e68dfaf1c..04b7162211cb 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h @@ -13,6 +13,7 @@ #define DRV_MODULE_VERSION "Jan_2016" #include <linux/clk.h> +#include <linux/hrtimer.h> #include <linux/if_vlan.h> #include <linux/stmmac.h> #include <linux/phylink.h> @@ -46,7 +47,7 @@ struct stmmac_tx_info { struct stmmac_tx_queue { u32 tx_count_frames; int tbs; - struct timer_list txtimer; + struct hrtimer txtimer; u32 queue_index; struct stmmac_priv *priv_data; struct dma_extended_desc *dma_etx ____cacheline_aligned_in_smp; diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index ba45fe237512..eea740c06f52 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -111,7 +111,7 @@ static void stmmac_init_fs(struct net_device *dev); static void stmmac_exit_fs(struct net_device *dev); #endif -#define STMMAC_COAL_TIMER(x) (jiffies + usecs_to_jiffies(x)) +#define STMMAC_COAL_TIMER(x) (ns_to_ktime((x) * NSEC_PER_USEC)) /** * stmmac_verify_args - verify the driver parameters. @@ -2051,7 +2051,8 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue) /* We still have pending packets, let's call for a new scheduling */ if (tx_q->dirty_tx != tx_q->cur_tx) - mod_timer(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer)); + hrtimer_start(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer), + HRTIMER_MODE_REL); __netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue)); @@ -2335,7 +2336,8 @@ static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue) { struct stmmac_tx_queue *tx_q = &priv->tx_queue[queue]; - mod_timer(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer)); + hrtimer_start(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer), + HRTIMER_MODE_REL); } /** @@ -2344,9 +2346,9 @@ static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue) * Description: * This is the timer handler to directly invoke the stmmac_tx_clean. */ -static void stmmac_tx_timer(struct timer_list *t) +static enum hrtimer_restart stmmac_tx_timer(struct hrtimer *t) { - struct stmmac_tx_queue *tx_q = from_timer(tx_q, t, txtimer); + struct stmmac_tx_queue *tx_q = container_of(t, struct stmmac_tx_queue, txtimer); struct stmmac_priv *priv = tx_q->priv_data; struct stmmac_channel *ch; @@ -2360,6 +2362,8 @@ static void stmmac_tx_timer(struct timer_list *t) spin_unlock_irqrestore(&ch->lock, flags); __napi_schedule(&ch->tx_napi); } + + return HRTIMER_NORESTART; } /** @@ -2382,7 +2386,8 @@ static void stmmac_init_coalesce(struct stmmac_priv *priv) for (chan = 0; chan < tx_channel_count; chan++) { struct stmmac_tx_queue *tx_q = &priv->tx_queue[chan]; - timer_setup(&tx_q->txtimer, stmmac_tx_timer, 0); + hrtimer_init(&tx_q->txtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + tx_q->txtimer.function = stmmac_tx_timer; } } @@ -2874,7 +2879,7 @@ static int stmmac_open(struct net_device *dev) phylink_stop(priv->phylink); for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++) - del_timer_sync(&priv->tx_queue[chan].txtimer); + hrtimer_cancel(&priv->tx_queue[chan].txtimer); stmmac_hw_teardown(dev); init_error: @@ -2907,7 +2912,7 @@ static int stmmac_release(struct net_device *dev) stmmac_disable_all_queues(priv); for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++) - del_timer_sync(&priv->tx_queue[chan].txtimer); + hrtimer_cancel(&priv->tx_queue[chan].txtimer); /* Free the IRQ lines */ free_irq(dev->irq, dev); @@ -5140,7 +5145,7 @@ int stmmac_suspend(struct device *dev) stmmac_disable_all_queues(priv); for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++) - del_timer_sync(&priv->tx_queue[chan].txtimer); + hrtimer_cancel(&priv->tx_queue[chan].txtimer); /* Stop TX/RX DMA */ stmmac_stop_all_dma(priv);
This driver uses a normal timer for TX coalescing, which means that the with the default tx-usecs of 1000 microseconds the cleanups actually happen 10 ms or more later with HZ=100. This leads to very low througput with TCP when bridged to a slow link such as a 4G modem. Fix this by using an hrtimer instead. On my ARM platform with HZ=100 and the default TX coalescing settings (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM improves from ~5.5 Mbps to ~100 Mbps. Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> --- drivers/net/ethernet/stmicro/stmmac/stmmac.h | 3 ++- .../net/ethernet/stmicro/stmmac/stmmac_main.c | 23 +++++++++++-------- 2 files changed, 16 insertions(+), 10 deletions(-)