diff mbox series

[net] net: stmmac: Use hrtimer for TX coalescing

Message ID 20201120150208.6838-1-vincent.whitchurch@axis.com (mailing list archive)
State Accepted
Delegated to: Netdev Maintainers
Headers show
Series [net] net: stmmac: Use hrtimer for TX coalescing | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 20 this patch: 20
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 85 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 20 this patch: 20
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Vincent Whitchurch Nov. 20, 2020, 3:02 p.m. UTC
This driver uses a normal timer for TX coalescing, which means that the
with the default tx-usecs of 1000 microseconds the cleanups actually
happen 10 ms or more later with HZ=100.  This leads to very low
througput with TCP when bridged to a slow link such as a 4G modem.  Fix
this by using an hrtimer instead.

On my ARM platform with HZ=100 and the default TX coalescing settings
(tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem
delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM
improves from ~5.5 Mbps to ~100 Mbps.

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac.h  |  3 ++-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c | 23 +++++++++++--------
 2 files changed, 16 insertions(+), 10 deletions(-)

Comments

Jakub Kicinski Nov. 24, 2020, 12:46 a.m. UTC | #1
On Fri, 20 Nov 2020 16:02:08 +0100 Vincent Whitchurch wrote:
> This driver uses a normal timer for TX coalescing, which means that the
> with the default tx-usecs of 1000 microseconds the cleanups actually
> happen 10 ms or more later with HZ=100.  This leads to very low
> througput with TCP when bridged to a slow link such as a 4G modem.  Fix
> this by using an hrtimer instead.
> 
> On my ARM platform with HZ=100 and the default TX coalescing settings
> (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem
> delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM
> improves from ~5.5 Mbps to ~100 Mbps.
> 
> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>

Looks perfectly reasonable, but you marked it for net. Do you consider
this to be a bug fix, and need it backported to stable? Otherwise I'd
rather apply it to net-next.
Vincent Whitchurch Nov. 24, 2020, 4:11 a.m. UTC | #2
On Tue, Nov 24, 2020 at 01:46:00AM +0100, Jakub Kicinski wrote:
> On Fri, 20 Nov 2020 16:02:08 +0100 Vincent Whitchurch wrote:
> > This driver uses a normal timer for TX coalescing, which means that the
> > with the default tx-usecs of 1000 microseconds the cleanups actually
> > happen 10 ms or more later with HZ=100.  This leads to very low
> > througput with TCP when bridged to a slow link such as a 4G modem.  Fix
> > this by using an hrtimer instead.
> > 
> > On my ARM platform with HZ=100 and the default TX coalescing settings
> > (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem
> > delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM
> > improves from ~5.5 Mbps to ~100 Mbps.
> > 
> > Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
> 
> Looks perfectly reasonable, but you marked it for net. Do you consider
> this to be a bug fix, and need it backported to stable? Otherwise I'd
> rather apply it to net-next.

No, sorry, I think a backport to stable is unnecessary.  It should be
fine to apply it to net-next.  Thanks.
Jakub Kicinski Nov. 24, 2020, 4:50 p.m. UTC | #3
On Tue, 24 Nov 2020 05:11:27 +0100 Vincent Whitchurch wrote:
> On Tue, Nov 24, 2020 at 01:46:00AM +0100, Jakub Kicinski wrote:
> > On Fri, 20 Nov 2020 16:02:08 +0100 Vincent Whitchurch wrote:  
> > > This driver uses a normal timer for TX coalescing, which means that the
> > > with the default tx-usecs of 1000 microseconds the cleanups actually
> > > happen 10 ms or more later with HZ=100.  This leads to very low
> > > througput with TCP when bridged to a slow link such as a 4G modem.  Fix
> > > this by using an hrtimer instead.
> > > 
> > > On my ARM platform with HZ=100 and the default TX coalescing settings
> > > (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem
> > > delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM
> > > improves from ~5.5 Mbps to ~100 Mbps.
> > > 
> > > Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>  
> > 
> > Looks perfectly reasonable, but you marked it for net. Do you consider
> > this to be a bug fix, and need it backported to stable? Otherwise I'd
> > rather apply it to net-next.  
> 
> No, sorry, I think a backport to stable is unnecessary.  It should be
> fine to apply it to net-next.  Thanks.

Applied to net-next, thank you!
Felix Fietkau Aug. 23, 2023, 8:18 p.m. UTC | #4
On 20.11.20 16:02, Vincent Whitchurch wrote:
> This driver uses a normal timer for TX coalescing, which means that the
> with the default tx-usecs of 1000 microseconds the cleanups actually
> happen 10 ms or more later with HZ=100.  This leads to very low
> througput with TCP when bridged to a slow link such as a 4G modem.  Fix
> this by using an hrtimer instead.
> 
> On my ARM platform with HZ=100 and the default TX coalescing settings
> (tx-frames 25 tx-usecs 1000), with "tc qdisc add dev eth0 root netem
> delay 60ms 40ms rate 50Mbit" run on the server, netperf's TCP_STREAM
> improves from ~5.5 Mbps to ~100 Mbps.
> 
> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>

Based on tests by OpenWrt users, it seems that this one is causing a 
significant performance regression caused by wasting lots of CPU cycles 
re-arming the hrtimer on every single packet. More info:
https://github.com/openwrt/openwrt/issues/11676#issuecomment-1690492666

My suggestion for fixing this properly would be:
- keep a separate timestamp for last tx packet
- do not modify the timer if it's scheduled already
- in the timer function, check the last tx timestamp and re-arm the 
timer if necessary.

This should significantly reduce the number of wasted CPU cycles, even 
when accounting for the additional overhead of hrtimer vs regular timer.

- Felix
Vincent Whitchurch Aug. 25, 2023, 1:42 p.m. UTC | #5
On Wed, 2023-08-23 at 22:18 +0200, Felix Fietkau wrote:
> Based on tests by OpenWrt users, it seems that this one is causing a 
> significant performance regression caused by wasting lots of CPU cycles 
> re-arming the hrtimer on every single packet. More info:
> https://github.com/openwrt/openwrt/issues/11676#issuecomment-1690492666

It looks like there was an attempt to avoid the re-arming of the timer
in ->xmit() a while ago (pre-dating the hrtimer usage) in commit
4ae0169fd1b3c792b66be58995b7e6b629919ecf ("net: stmmac: Do not keep
rearming the coalesce timer in stmmac_xmit"), but that got reverted
later due to regressions.  The coalescing code has been reworked since
then but the removal of the re-arming was never attempted again.

> My suggestion for fixing this properly would be:
> - keep a separate timestamp for last tx packet
> - do not modify the timer if it's scheduled already
> - in the timer function, check the last tx timestamp and re-arm the 
> timer if necessary.

Would you mind explain the reasons for maintaining a timestamp and
checking it in the expiry function?  Is that to obtain the same effect
as the driver's current behaviour of postponing the expiry of the timer
for each packet?  Is that really desired?  According to the commit
message in 4ae0169fd1b3c792b66be58995b7e6b629919ecf, "Once the timer is
armed it should run after the time expires for the first packet sent and
not the last one."

Since the timer expiry function schedules napi, and the napi poll
function stmmac_tx_clean() re-arms the timer if it sees that there are
pending tx packets, shouldn't an implementation similar to hip04_eth.c
(which doesn't save/check the last tx timestamp) be sufficient?
Felix Fietkau Aug. 25, 2023, 5:38 p.m. UTC | #6
On 25.08.23 15:42, Vincent Whitchurch wrote:
> On Wed, 2023-08-23 at 22:18 +0200, Felix Fietkau wrote:
>> Based on tests by OpenWrt users, it seems that this one is causing a 
>> significant performance regression caused by wasting lots of CPU cycles 
>> re-arming the hrtimer on every single packet. More info:
>> https://github.com/openwrt/openwrt/issues/11676#issuecomment-1690492666
> 
> It looks like there was an attempt to avoid the re-arming of the timer
> in ->xmit() a while ago (pre-dating the hrtimer usage) in commit
> 4ae0169fd1b3c792b66be58995b7e6b629919ecf ("net: stmmac: Do not keep
> rearming the coalesce timer in stmmac_xmit"), but that got reverted
> later due to regressions.  The coalescing code has been reworked since
> then but the removal of the re-arming was never attempted again.
> 
>> My suggestion for fixing this properly would be:
>> - keep a separate timestamp for last tx packet
>> - do not modify the timer if it's scheduled already
>> - in the timer function, check the last tx timestamp and re-arm the 
>> timer if necessary.
> 
> Would you mind explain the reasons for maintaining a timestamp and
> checking it in the expiry function?  Is that to obtain the same effect
> as the driver's current behaviour of postponing the expiry of the timer
> for each packet? 

Exactly. I did something very similar in mac80211 in the past, when 
expensive mod_timer calls were showing up in my perf measurements.
Functionally it should be the same, it just avoids excessive re-sorting 
in timer data structures.

> Is that really desired?  According to the commit
> message in 4ae0169fd1b3c792b66be58995b7e6b629919ecf, "Once the timer is
> armed it should run after the time expires for the first packet sent and
> not the last one."
> 
> Since the timer expiry function schedules napi, and the napi poll
> function stmmac_tx_clean() re-arms the timer if it sees that there are
> pending tx packets, shouldn't an implementation similar to hip04_eth.c
> (which doesn't save/check the last tx timestamp) be sufficient?

To be honest, I didn't look very closely at what the timer does and how 
coalescing works. I don't know if delaying the timer processing with 
every tx is the right choice, or if it should be armed only once. 
However, as you pointed out, the commit that dropped the re-arming was 
reverted because of regressions.

My suggestions are intended to preserve the existing behavior as much as 
possible (in order to avoid regressions), while also achieving the 
benefit of significantly reducing CPU cycles wasted by re-arming the timer.

- Felix
Vincent Whitchurch Aug. 30, 2023, 2:55 p.m. UTC | #7
On Fri, 2023-08-25 at 19:38 +0200, Felix Fietkau wrote:
> On 25.08.23 15:42, Vincent Whitchurch wrote:
> > Since the timer expiry function schedules napi, and the napi poll
> > function stmmac_tx_clean() re-arms the timer if it sees that there are
> > pending tx packets, shouldn't an implementation similar to hip04_eth.c
> > (which doesn't save/check the last tx timestamp) be sufficient?
> 
> To be honest, I didn't look very closely at what the timer does and how 
> coalescing works. I don't know if delaying the timer processing with 
> every tx is the right choice, or if it should be armed only once. 
> However, as you pointed out, the commit that dropped the re-arming was 
> reverted because of regressions.
> 
> My suggestions are intended to preserve the existing behavior as much as 
> possible (in order to avoid regressions), while also achieving the 
> benefit of significantly reducing CPU cycles wasted by re-arming the timer.

I looked at it some more and the continuous postponing behaviour strikes
me as quite odd.  For example, if you set tx-frames coalescing to 0 then
cleanups could happen much later than the specified tx-usecs period, in
the absence of RX traffic.  Also, if we'd have to have a shared
timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer
to preserve this continuous postponing behaviour, then we'd need to
introduce some locking between the timer expiry and those functions, to
avoid race conditions.

So currently I am experimenting with just the following patch.  The
second hunk is similar to hip04_eth.c (the comment is mine).  AFAICS
hip04_eth.c doesn't have code equivalent to the first hunk of the patch
and instead unconditionally restarts the timer from its napi poll if it
sees that it's needed.

I can't reproduce the mod_timer vs hrtimer performance problems on my
hardware, but the patch below reduces the number of (re-)starts of the
stmmac_tx_timer by around 85% in an iperf3 test over a gigabit link
(just the second hunk reduces it by about 30%).

Any test results with this patch on the hardware with the performance
problems would be appreciated.

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 4727f7be4f86..4b6e5061b5a6 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2703,9 +2703,7 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue)
 
 	/* We still have pending packets, let's call for a new scheduling */
 	if (tx_q->dirty_tx != tx_q->cur_tx)
-		hrtimer_start(&tx_q->txtimer,
-			      STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]),
-			      HRTIMER_MODE_REL);
+		stmmac_tx_timer_arm(priv, queue);
 
 	__netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue));
 
@@ -2987,6 +2985,20 @@ static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue)
 {
 	struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
 
+	/*
+	 * Note that the hrtimer could expire immediately after we check this,
+	 * and the hrtimer and the callers of this function do not share a
+	 * lock.
+	 *
+	 * This should however be safe since the only thing the hrtimer does is
+	 * schedule napi (or ask for it run again if it's already running), and
+	 * stmmac_tx_clean(), called from the napi poll function, also calls
+	 * stmmac_tx_timer_arm() at the end if it sees that there are any TX
+	 * packets which have not yet been cleaned.
+	 */
+	if (hrtimer_is_queued(&tx_q->txtimer))
+		return;
+
 	hrtimer_start(&tx_q->txtimer,
 		      STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]),
 		      HRTIMER_MODE_REL);
Maxim Mikityanskiy Aug. 30, 2023, 6:05 p.m. UTC | #8
On Wed, 30 Aug 2023 at 14:55:37 +0000, Vincent Whitchurch wrote:
> Any test results with this patch on the hardware with the performance
> problems would be appreciated.

TL/DR: it's definitely better than without the patch, but still worse
than fully reverting hrtimer [1].

OpenWrt on Netgear R7800, iperf3 test in both directions (LAN->WAN and
WAN->LAN), 3 runs for the duration of 1 minute each. OpenWrt options
packet_steering (RPS + XPS) and flow_offloading (flowtables) enabled,
irqbalance disabled.

Numbers in Mbit/s, I had to fit the table into 72 characters, so:
U is the original kernel 6.1.46 from OpenWrt, R is reverted hrtimer
(patch [1]), P is patched with the new patch from the previous email,
^ is upload (LAN->WAN), v is download (WAN->LAN).

  |  ^  |  v  |  ^  |  v  |  ^  |  v  | avg ^ | avg v | std ^ | std v
- | --- | --- | --- | --- | --- | --- | ----- | ----- | ----- | -----
U | 742 | 709 | 740 | 715 | 556 | 750 |  679  |  725  |  107  |   22
R | 931 | 938 | 935 | 939 | 935 | 939 |  934  |  939  |    2  |    1
P | 845 | 939 | 934 | 939 | 845 | 909 |  875  |  929  |   51  |   17

Full revert allows to get really close to the theoretical maximum
goodput (~949 Mbit/s) with minimal deviation. The new patch, however,
gives less stable numbers, while sometimes hits the maximum as well.

More numbers for the [U]npatched and [R]everted kernels are in the
OpenWrt thread [2].

[1]: https://github.com/openwrt/openwrt/files/12422351/revert-stmmac.patch.txt
[2]: https://github.com/openwrt/openwrt/issues/11676

> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 4727f7be4f86..4b6e5061b5a6 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -2703,9 +2703,7 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue)
>  
>  	/* We still have pending packets, let's call for a new scheduling */
>  	if (tx_q->dirty_tx != tx_q->cur_tx)
> -		hrtimer_start(&tx_q->txtimer,
> -			      STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]),
> -			      HRTIMER_MODE_REL);
> +		stmmac_tx_timer_arm(priv, queue);
>  
>  	__netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue));
>  
> @@ -2987,6 +2985,20 @@ static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue)
>  {
>  	struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
>  
> +	/*
> +	 * Note that the hrtimer could expire immediately after we check this,
> +	 * and the hrtimer and the callers of this function do not share a
> +	 * lock.
> +	 *
> +	 * This should however be safe since the only thing the hrtimer does is
> +	 * schedule napi (or ask for it run again if it's already running), and
> +	 * stmmac_tx_clean(), called from the napi poll function, also calls
> +	 * stmmac_tx_timer_arm() at the end if it sees that there are any TX
> +	 * packets which have not yet been cleaned.
> +	 */
> +	if (hrtimer_is_queued(&tx_q->txtimer))
> +		return;
> +
>  	hrtimer_start(&tx_q->txtimer,
>  		      STMMAC_COAL_TIMER(priv->tx_coal_timer[queue]),
>  		      HRTIMER_MODE_REL);
>
Felix Fietkau Aug. 30, 2023, 9:06 p.m. UTC | #9
On 30.08.23 16:55, Vincent Whitchurch wrote:
> On Fri, 2023-08-25 at 19:38 +0200, Felix Fietkau wrote:
>> On 25.08.23 15:42, Vincent Whitchurch wrote:
>> > Since the timer expiry function schedules napi, and the napi poll
>> > function stmmac_tx_clean() re-arms the timer if it sees that there are
>> > pending tx packets, shouldn't an implementation similar to hip04_eth.c
>> > (which doesn't save/check the last tx timestamp) be sufficient?
>> 
>> To be honest, I didn't look very closely at what the timer does and how 
>> coalescing works. I don't know if delaying the timer processing with 
>> every tx is the right choice, or if it should be armed only once. 
>> However, as you pointed out, the commit that dropped the re-arming was 
>> reverted because of regressions.
>> 
>> My suggestions are intended to preserve the existing behavior as much as 
>> possible (in order to avoid regressions), while also achieving the 
>> benefit of significantly reducing CPU cycles wasted by re-arming the timer.
> 
> I looked at it some more and the continuous postponing behaviour strikes
> me as quite odd.  For example, if you set tx-frames coalescing to 0 then
> cleanups could happen much later than the specified tx-usecs period, in
> the absence of RX traffic.  Also, if we'd have to have a shared
> timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer
> to preserve this continuous postponing behaviour, then we'd need to
> introduce some locking between the timer expiry and those functions, to
> avoid race conditions.

I just spent some time digging through the history of the timer code, 
figuring out the intention behind the continuous postponing behavior.

It's an interrupt mitigation scheme where DMA descriptors are configured 
to only generate a completion event every 25 packets, and the only 
purpose of the timer is to avoid keeping packets in the queue for too 
long after tx activity has stopped.
Based on that design, I believe that the continuous postponing actually 
makes sense and the patches that eliminate it are misguided. When there 
is constant activity, there will be tx completion interrupts that 
trigger cleanup.

That said, I did even more digging and I found out that the timer code 
was added at a time when the driver didn't even disable tx and rx 
interrupts individually, which means that it could not take advantage of 
interrupt mitigation via NAPI scheduling + IRQ disable/enable.

I have a hunch that given the changes made to the driver over time, the 
timer based interrupt mitigation strategy might just be completely 
useless and actively harmful now. It certainly messes with things like 
TSQ and BQL in a nasty way.

I suspect that the best and easiest way to deal with this issue is to 
simply rip out all that timer nonsense, rely on tx IRQs entirely and 
just let NAPI do its thing.

- Felix
Vincent Whitchurch Sept. 1, 2023, 11:31 a.m. UTC | #10
On Wed, 2023-08-30 at 23:06 +0200, Felix Fietkau wrote:
> On 30.08.23 16:55, Vincent Whitchurch wrote:
> > I looked at it some more and the continuous postponing behaviour strikes
> > me as quite odd.  For example, if you set tx-frames coalescing to 0 then
> > cleanups could happen much later than the specified tx-usecs period, in
> > the absence of RX traffic.  Also, if we'd have to have a shared
> > timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer
> > to preserve this continuous postponing behaviour, then we'd need to
> > introduce some locking between the timer expiry and those functions, to
> > avoid race conditions.
> 
> I just spent some time digging through the history of the timer code, 
> figuring out the intention behind the continuous postponing behavior.
> 
> It's an interrupt mitigation scheme where DMA descriptors are configured 
> to only generate a completion event every 25 packets, and the only 
> purpose of the timer is to avoid keeping packets in the queue for too 
> long after tx activity has stopped.
> Based on that design, I believe that the continuous postponing actually 
> makes sense and the patches that eliminate it are misguided. When there 
> is constant activity, there will be tx completion interrupts that 
> trigger cleanup.

The tx-frames value (25) can be controlled from user space.  If it is
set to zero then the driver should still coalesce interrupts based on
the interval specified in the tx-usecs setting, but the driver fails to
do so because it keeps postponing the cleanup and increasing the latency
of the cleanups far beyond the programmed period.

> That said, I did even more digging and I found out that the timer code 
> was added at a time when the driver didn't even disable tx and rx 
> interrupts individually, which means that it could not take advantage of 
> interrupt mitigation via NAPI scheduling + IRQ disable/enable.
> 
> I have a hunch that given the changes made to the driver over time, the 
> timer based interrupt mitigation strategy might just be completely 
> useless and actively harmful now. It certainly messes with things like 
> TSQ and BQL in a nasty way.
> 
> I suspect that the best and easiest way to deal with this issue is to 
> simply rip out all that timer nonsense, rely on tx IRQs entirely and 
> just let NAPI do its thing.

If you want an interrupt for every packet, you can turn off coalescing
by setting tx-frames to 1 and tx-usecs 0.  Currently, the driver does
not turn off the timer even if tx-usecs is set to zero, but that is
trivially fixed.  With such a fix and that setting, the result is a 30x
increase in the number of interrupts in a tx-only test.

And tx-frames 25 tx-usecs 0 is of course also bad for throughput since
cleanups will not happen when fewer than 25 frames are transmitted.
Felix Fietkau Sept. 1, 2023, 11:53 a.m. UTC | #11
On 01.09.23 13:31, Vincent Whitchurch wrote:
> On Wed, 2023-08-30 at 23:06 +0200, Felix Fietkau wrote:
>> On 30.08.23 16:55, Vincent Whitchurch wrote:
>> > I looked at it some more and the continuous postponing behaviour strikes
>> > me as quite odd.  For example, if you set tx-frames coalescing to 0 then
>> > cleanups could happen much later than the specified tx-usecs period, in
>> > the absence of RX traffic.  Also, if we'd have to have a shared
>> > timestamp between the callers of stmmac_tx_timer_arm() and the hrtimer
>> > to preserve this continuous postponing behaviour, then we'd need to
>> > introduce some locking between the timer expiry and those functions, to
>> > avoid race conditions.
>> 
>> I just spent some time digging through the history of the timer code, 
>> figuring out the intention behind the continuous postponing behavior.
>> 
>> It's an interrupt mitigation scheme where DMA descriptors are configured 
>> to only generate a completion event every 25 packets, and the only 
>> purpose of the timer is to avoid keeping packets in the queue for too 
>> long after tx activity has stopped.
>> Based on that design, I believe that the continuous postponing actually 
>> makes sense and the patches that eliminate it are misguided. When there 
>> is constant activity, there will be tx completion interrupts that 
>> trigger cleanup.
> 
> The tx-frames value (25) can be controlled from user space.  If it is
> set to zero then the driver should still coalesce interrupts based on
> the interval specified in the tx-usecs setting, but the driver fails to
> do so because it keeps postponing the cleanup and increasing the latency
> of the cleanups far beyond the programmed period.
> 
>> That said, I did even more digging and I found out that the timer code 
>> was added at a time when the driver didn't even disable tx and rx 
>> interrupts individually, which means that it could not take advantage of 
>> interrupt mitigation via NAPI scheduling + IRQ disable/enable.
>> 
>> I have a hunch that given the changes made to the driver over time, the 
>> timer based interrupt mitigation strategy might just be completely 
>> useless and actively harmful now. It certainly messes with things like 
>> TSQ and BQL in a nasty way.
>> 
>> I suspect that the best and easiest way to deal with this issue is to 
>> simply rip out all that timer nonsense, rely on tx IRQs entirely and 
>> just let NAPI do its thing.
> 
> If you want an interrupt for every packet, you can turn off coalescing
> by setting tx-frames to 1 and tx-usecs 0.  Currently, the driver does
> not turn off the timer even if tx-usecs is set to zero, but that is
> trivially fixed.  With such a fix and that setting, the result is a 30x
> increase in the number of interrupts in a tx-only test.
> 
> And tx-frames 25 tx-usecs 0 is of course also bad for throughput since
> cleanups will not happen when fewer than 25 frames are transmitted.
If the CPU is not saturated, the increase in interrupts is expected, 
given that NAPI doesn't do any timer based mitigation.

I guess for devices operating on battery, the timer based coalescing 
might conserve some power, unless the power drain caused by the CPU 
overhead of the timer rescheduling is bigger than the power drain from 
extra wakeups.

For a devices where performance matters more (e.g. the ones used with 
OpenWrt), I believe that the increase in the number of IRQs won't 
matter, because mitigation should kick in under load.

Could you please post your fix? I think in order to avoid accidental 
breakage, we should make tx-usecs=0 imply tx-frames=1.

Thanks,

- Felix
Vincent Whitchurch Sept. 8, 2023, 10:42 a.m. UTC | #12
On Fri, 2023-09-01 at 13:53 +0200, Felix Fietkau wrote:
> Could you please post your fix? I think in order to avoid accidental 
> breakage, we should make tx-usecs=0 imply tx-frames=1.

I've posted a fix to make tx-usecs=0 not use the timer.  The driver
already rejects the invalid setting of tx-frames=0 tx-usecs=0.

 https://lore.kernel.org/netdev/20230907-stmmac-coaloff-v2-1-38ccfac548b9@axis.com/
Vincent Whitchurch Sept. 18, 2023, 12:56 p.m. UTC | #13
On Wed, 2023-08-30 at 21:05 +0300, Maxim Mikityanskiy wrote:
> On Wed, 30 Aug 2023 at 14:55:37 +0000, Vincent Whitchurch wrote:
> > Any test results with this patch on the hardware with the performance
> > problems would be appreciated.
> 
> TL/DR: it's definitely better than without the patch, but still worse
> than fully reverting hrtimer [1].

Thank you for testing this.

Have you also had the chance to try out Felix's suggestion of completely
disabling coalescing in the driver?  For that you will need to apply the
patch at [0] (the command below may appear to work without the patch but
the timer will continue to be programmed and expire immediately if the
patch is not applied) and then run something like:

  ethtool -C eth0 tx-frame 1 tx-usecs 0

[0] https://lore.kernel.org/all/20230907-stmmac-coaloff-v2-1-38ccfac548b9@axis.com/

With coalescing disabled in the driver, there is also the option of
playing with the generic software IRQ coalescing options available in
NAPI in newer kernels (eg. gro_flush_timeout), which may work better for
you than the one in the driver.
Maxim Mikityanskiy Sept. 19, 2023, 5:53 p.m. UTC | #14
On Mon, 18 Sep 2023 at 12:56:31 +0000, Vincent Whitchurch wrote:
> On Wed, 2023-08-30 at 21:05 +0300, Maxim Mikityanskiy wrote:
> > On Wed, 30 Aug 2023 at 14:55:37 +0000, Vincent Whitchurch wrote:
> > > Any test results with this patch on the hardware with the performance
> > > problems would be appreciated.
> > 
> > TL/DR: it's definitely better than without the patch, but still worse
> > than fully reverting hrtimer [1].
> 
> Thank you for testing this.
> 
> Have you also had the chance to try out Felix's suggestion of completely
> disabling coalescing in the driver?  For that you will need to apply the
> patch at [0] (the command below may appear to work without the patch but
> the timer will continue to be programmed and expire immediately if the
> patch is not applied) and then run something like:
> 
>   ethtool -C eth0 tx-frame 1 tx-usecs 0

It's really good, yet not as perfect as reverting the hrtimer patch.
Aggregated summary over 15 20-second iperf3 tests (with XPS, RPS and
flow tables):

        | avg up | avg dn | std up | std dn
--------|--------|--------|--------|--------
Revert  |   931  |   939  |     6  |     2
No-coal |   925  |   916  |    13  |    26

I'd say it's almost as good as the full revert, but with more deviation,
numbers are less stable.

Kind of off topic, but I've also been testing PPPoE, and it's pretty bad
even with the available fixes:

        | avg up | avg dn | std up | std dn
--------|--------|--------|--------|--------
Revert  |   791  |   388  |    13  |    26
No-coal |   806  |   390  |    97  |    60

Thanks,
Max

> 
> [0] https://lore.kernel.org/all/20230907-stmmac-coaloff-v2-1-38ccfac548b9@axis.com/
> 
> With coalescing disabled in the driver, there is also the option of
> playing with the generic software IRQ coalescing options available in
> NAPI in newer kernels (eg. gro_flush_timeout), which may work better for
> you than the one in the driver.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 727e68dfaf1c..04b7162211cb 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -13,6 +13,7 @@ 
 #define DRV_MODULE_VERSION	"Jan_2016"
 
 #include <linux/clk.h>
+#include <linux/hrtimer.h>
 #include <linux/if_vlan.h>
 #include <linux/stmmac.h>
 #include <linux/phylink.h>
@@ -46,7 +47,7 @@  struct stmmac_tx_info {
 struct stmmac_tx_queue {
 	u32 tx_count_frames;
 	int tbs;
-	struct timer_list txtimer;
+	struct hrtimer txtimer;
 	u32 queue_index;
 	struct stmmac_priv *priv_data;
 	struct dma_extended_desc *dma_etx ____cacheline_aligned_in_smp;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ba45fe237512..eea740c06f52 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -111,7 +111,7 @@  static void stmmac_init_fs(struct net_device *dev);
 static void stmmac_exit_fs(struct net_device *dev);
 #endif
 
-#define STMMAC_COAL_TIMER(x) (jiffies + usecs_to_jiffies(x))
+#define STMMAC_COAL_TIMER(x) (ns_to_ktime((x) * NSEC_PER_USEC))
 
 /**
  * stmmac_verify_args - verify the driver parameters.
@@ -2051,7 +2051,8 @@  static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue)
 
 	/* We still have pending packets, let's call for a new scheduling */
 	if (tx_q->dirty_tx != tx_q->cur_tx)
-		mod_timer(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer));
+		hrtimer_start(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer),
+			      HRTIMER_MODE_REL);
 
 	__netif_tx_unlock_bh(netdev_get_tx_queue(priv->dev, queue));
 
@@ -2335,7 +2336,8 @@  static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue)
 {
 	struct stmmac_tx_queue *tx_q = &priv->tx_queue[queue];
 
-	mod_timer(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer));
+	hrtimer_start(&tx_q->txtimer, STMMAC_COAL_TIMER(priv->tx_coal_timer),
+		      HRTIMER_MODE_REL);
 }
 
 /**
@@ -2344,9 +2346,9 @@  static void stmmac_tx_timer_arm(struct stmmac_priv *priv, u32 queue)
  * Description:
  * This is the timer handler to directly invoke the stmmac_tx_clean.
  */
-static void stmmac_tx_timer(struct timer_list *t)
+static enum hrtimer_restart stmmac_tx_timer(struct hrtimer *t)
 {
-	struct stmmac_tx_queue *tx_q = from_timer(tx_q, t, txtimer);
+	struct stmmac_tx_queue *tx_q = container_of(t, struct stmmac_tx_queue, txtimer);
 	struct stmmac_priv *priv = tx_q->priv_data;
 	struct stmmac_channel *ch;
 
@@ -2360,6 +2362,8 @@  static void stmmac_tx_timer(struct timer_list *t)
 		spin_unlock_irqrestore(&ch->lock, flags);
 		__napi_schedule(&ch->tx_napi);
 	}
+
+	return HRTIMER_NORESTART;
 }
 
 /**
@@ -2382,7 +2386,8 @@  static void stmmac_init_coalesce(struct stmmac_priv *priv)
 	for (chan = 0; chan < tx_channel_count; chan++) {
 		struct stmmac_tx_queue *tx_q = &priv->tx_queue[chan];
 
-		timer_setup(&tx_q->txtimer, stmmac_tx_timer, 0);
+		hrtimer_init(&tx_q->txtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+		tx_q->txtimer.function = stmmac_tx_timer;
 	}
 }
 
@@ -2874,7 +2879,7 @@  static int stmmac_open(struct net_device *dev)
 	phylink_stop(priv->phylink);
 
 	for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++)
-		del_timer_sync(&priv->tx_queue[chan].txtimer);
+		hrtimer_cancel(&priv->tx_queue[chan].txtimer);
 
 	stmmac_hw_teardown(dev);
 init_error:
@@ -2907,7 +2912,7 @@  static int stmmac_release(struct net_device *dev)
 	stmmac_disable_all_queues(priv);
 
 	for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++)
-		del_timer_sync(&priv->tx_queue[chan].txtimer);
+		hrtimer_cancel(&priv->tx_queue[chan].txtimer);
 
 	/* Free the IRQ lines */
 	free_irq(dev->irq, dev);
@@ -5140,7 +5145,7 @@  int stmmac_suspend(struct device *dev)
 	stmmac_disable_all_queues(priv);
 
 	for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++)
-		del_timer_sync(&priv->tx_queue[chan].txtimer);
+		hrtimer_cancel(&priv->tx_queue[chan].txtimer);
 
 	/* Stop TX/RX DMA */
 	stmmac_stop_all_dma(priv);