diff mbox

[BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s

Message ID 20131120191145.GP8581@1wt.eu (mailing list archive)
State New, archived
Headers show

Commit Message

Willy Tarreau Nov. 20, 2013, 7:11 p.m. UTC
Hi Arnaud,

first, thanks for all these tests.

On Wed, Nov 20, 2013 at 12:53:43AM +0100, Arnaud Ebalard wrote:
(...)
> In the end, here are the conclusions *I* draw from this test session,
> do not hesitate to correct me:
> 
>  - Eric, it seems something changed in linus tree betwen the beginning
>    of the thread and now, which somehow reduces the effect of the
>    regression we were seen: I never got back the 256KB/s.
>  - You revert patch still improves the perf a lot
>  - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
>  - w/ your revert patch, I can confirm that mvneta driver is capable of
>    doing line rate w/ proper tweak of TCP send window (256KB instead of
>    4M)
>  - It seems I will I have to spend some time on the SATA issues I
>    previously thought were an artefact of not cleaning my tree during a
>    debug session [1], i.e. there is IMHO an issue.

Could you please try Eric's patch that was just merged into Linus' tree
if it was not yet in the kernel you tried :

  98e09386c0e  tcp: tsq: restore minimal amount of queueing

For me it restored the original performance (I saturate the Gbps with
about 7 concurrent streams).

Further, I wrote the small patch below for mvneta. I'm not sure it's
smp-safe but it's a PoC. In mvneta_poll() which currently is only called
upon Rx interrupt, it tries to flush all possible remaining Tx descriptors
if any. That significantly improved my transfer rate, now I easily achieve
1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet.

It also increased the overall connection rate by 10% on empty HTTP responses
(small packets), very likely by reducing the dead time between some segments!

You'll probably want to give it a try, so here it comes.

Cheers,
Willy

From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 20 Nov 2013 19:47:11 +0100
Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx
 interrupts

Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
timer to flush Tx descriptors. This causes jerky output traffic with bursts
and pauses, making it difficult to reach line rate with very few streams.
This patch tries to improve the situation which is complicated by the lack
of public datasheet from Marvell. The workaround consists in trying to flush
pending buffers during the Rx polling. The idea is that for symmetric TCP
traffic, ACKs received in response to the packets sent will trigger the Rx
interrupt and will anticipate the flushing of the descriptors.

The results are quite good, a single TCP stream is now capable of saturating
a gigabit.

This is only a workaround, it doesn't address asymmetric traffic nor datagram
based traffic.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Arnaud Ebalard Nov. 20, 2013, 7:26 p.m. UTC | #1
Hi,

Willy Tarreau <w@1wt.eu> writes:

> first, thanks for all these tests.
>
> On Wed, Nov 20, 2013 at 12:53:43AM +0100, Arnaud Ebalard wrote:
> (...)
>> In the end, here are the conclusions *I* draw from this test session,
>> do not hesitate to correct me:
>> 
>>  - Eric, it seems something changed in linus tree betwen the beginning
>>    of the thread and now, which somehow reduces the effect of the
>>    regression we were seen: I never got back the 256KB/s.
>>  - You revert patch still improves the perf a lot
>>  - It seems reducing MVNETA_TX_DONE_TIMER_PERIOD does not help
>>  - w/ your revert patch, I can confirm that mvneta driver is capable of
>>    doing line rate w/ proper tweak of TCP send window (256KB instead of
>>    4M)
>>  - It seems I will I have to spend some time on the SATA issues I
>>    previously thought were an artefact of not cleaning my tree during a
>>    debug session [1], i.e. there is IMHO an issue.
>
> Could you please try Eric's patch that was just merged into Linus' tree
> if it was not yet in the kernel you tried :
>
>   98e09386c0e  tcp: tsq: restore minimal amount of queueing

I have it in my quilt set.


> For me it restored the original performance (I saturate the Gbps with
> about 7 concurrent streams).
>
> Further, I wrote the small patch below for mvneta. I'm not sure it's
> smp-safe but it's a PoC. In mvneta_poll() which currently is only called
> upon Rx interrupt, it tries to flush all possible remaining Tx descriptors
> if any. That significantly improved my transfer rate, now I easily achieve
> 1 Gbps using a single TCP stream on the mirabox. Not tried on the AX3 yet.
>
> It also increased the overall connection rate by 10% on empty HTTP responses
> (small packets), very likely by reducing the dead time between some segments!
>
> You'll probably want to give it a try, so here it comes.

hehe, I was falling short of patches to test tonight ;-) I will give it
a try now.

Cheers,

a+
Arnaud Ebalard Nov. 20, 2013, 9:28 p.m. UTC | #2
Hi,

Willy Tarreau <w@1wt.eu> writes:

> From d1a00e593841223c7d871007b1e1fc528afe8e4d Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Wed, 20 Nov 2013 19:47:11 +0100
> Subject: EXP: net: mvneta: try to flush Tx descriptor queue upon Rx
>  interrupts
>
> Right now the mvneta driver doesn't handle Tx IRQ, and solely relies on a
> timer to flush Tx descriptors. This causes jerky output traffic with bursts
> and pauses, making it difficult to reach line rate with very few streams.
> This patch tries to improve the situation which is complicated by the lack
> of public datasheet from Marvell. The workaround consists in trying to flush
> pending buffers during the Rx polling. The idea is that for symmetric TCP
> traffic, ACKs received in response to the packets sent will trigger the Rx
> interrupt and will anticipate the flushing of the descriptors.
>
> The results are quite good, a single TCP stream is now capable of saturating
> a gigabit.
>
> This is only a workaround, it doesn't address asymmetric traffic nor datagram
> based traffic.
>
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 5aed8ed..59e1c86 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -2013,6 +2013,26 @@ static int mvneta_poll(struct napi_struct *napi, int budget)
>  	}
>  
>  	pp->cause_rx_tx = cause_rx_tx;
> +
> +	/* Try to flush pending Tx buffers if any */
> +	if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) {
> +		int tx_todo = 0;
> +
> +		mvneta_tx_done_gbe(pp,
> +	                           (((1 << txq_number) - 1) &
> +	                           MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
> +	                           &tx_todo);
> +
> +		if (tx_todo > 0) {
> +			mod_timer(&pp->tx_done_timer,
> +			          jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD));
> +		}
> +		else {
> +			clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
> +			del_timer(&pp->tx_done_timer);
> +		}
> +	}
> +
>  	return rx_done;
>  }

With current Linus tree (head being b4789b8e: aacraid: prevent invalid
pointer dereference), as a baseline here is what I get:

 w/ tcp_wmem left w/ default values (4096 16384 4071360)

  via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s
  via wget against apache: 15.4 MB/s
  via wget against nginx: 104 MB/s
 
 w/ tcp_wmem set to 4096 16384 262144:

  via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s
  via wget against apache: 63.3 MB/s
  via wget against nginx: 104 MB/s
 
With your patch on top of it (and tcp_wmem kept at its default value):

 via netperf: 939.16 / 935.44 Mbits/s
 via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si
                                     and 72% CPU for apache2)
 via wget against nginx: 106 MB/s


With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10
(still w/ and tcp_wmem kept at its default value):

 via netperf: 939.12 / 935.84 Mbits/s
 via wget against apache: 63.7 MB/s
 via wget against nginx: 108 MB/s

So:

 - First, Eric's patch sitting in Linus tree does fix the regression
   I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s).

 - As can be seen in the results of first test, Eric's patch still
   requires some additional tweaking of tcp_wmem to get netperf and
   apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of
   15.4MB/s by setting max tcp send buffer space to 256KB for apache).

 - For unknown reasons, nginx manages to provide a 104MB/s download rate
   even with a tcp_wmem set to default and no specific patch of mvneta.

 - Now, Willy's patch seems to makes netperf happy (link saturated from
   server to client), w/o tweaking tcp_wmem.

 - Again with Willy's patch I guess the "limitations" of the platform
   (1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the
   link. All I can say is that the same test some months ago on a 1.6GHz
   ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not
   know if it is some apache regression, some mvneta vs mv63xx_eth
   difference or some CPU frequency issue but having netperf and  nginx
   happy make me wonder about Apache.

 - Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your
   patch does not improve the already good value I get w/ your patch.


In the end if you iterate on your work to push a version of your patch
upstream, I'll be happy to test it. And thanks for the time you already
spent!

Cheers,

a+
Willy Tarreau Nov. 20, 2013, 9:54 p.m. UTC | #3
Hi Arnaud,

On Wed, Nov 20, 2013 at 10:28:50PM +0100, Arnaud Ebalard wrote:
> With current Linus tree (head being b4789b8e: aacraid: prevent invalid
> pointer dereference), as a baseline here is what I get:
> 
>  w/ tcp_wmem left w/ default values (4096 16384 4071360)
> 
>   via netperf (TCP_MAERTS/TCP_STREAM): 151.13 / 935.50 Mbits/s
>   via wget against apache: 15.4 MB/s
>   via wget against nginx: 104 MB/s
>  
>  w/ tcp_wmem set to 4096 16384 262144:
> 
>   via netperf (TCP_MAERTS/TCP_STREAM): 919.89 / 935.50 Mbits/s
>   via wget against apache: 63.3 MB/s
>   via wget against nginx: 104 MB/s
>  
> With your patch on top of it (and tcp_wmem kept at its default value):
> 
>  via netperf: 939.16 / 935.44 Mbits/s
>  via wget against apache: 65.9 MB/s (top reports 69.5 sy, 30.1 si
>                                      and 72% CPU for apache2)
>  via wget against nginx: 106 MB/s
> 
> 
> With your patch and MVNETA_TX_DONE_TIMER_PERIOD set to 1 instead of 10
> (still w/ and tcp_wmem kept at its default value):
> 
>  via netperf: 939.12 / 935.84 Mbits/s
>  via wget against apache: 63.7 MB/s
>  via wget against nginx: 108 MB/s
> 
> So:
> 
>  - First, Eric's patch sitting in Linus tree does fix the regression
>    I had on 3.11.7 and early 3.12 (15.4 MB/s vs 256KB/s).
> 
>  - As can be seen in the results of first test, Eric's patch still
>    requires some additional tweaking of tcp_wmem to get netperf and
>    apache somewhat happy w/ perfectible drivers (63.3 MB/s instead of
>    15.4MB/s by setting max tcp send buffer space to 256KB for apache).
> 
>  - For unknown reasons, nginx manages to provide a 104MB/s download rate
>    even with a tcp_wmem set to default and no specific patch of mvneta.
> 
>  - Now, Willy's patch seems to makes netperf happy (link saturated from
>    server to client), w/o tweaking tcp_wmem.
> 
>  - Again with Willy's patch I guess the "limitations" of the platform
>    (1.2GHz CPU w/ 512MB of RAM) somehow prevent Apache to saturate the
>    link. All I can say is that the same test some months ago on a 1.6GHz
>    ARMv5TE (kirkwood 88f6282) w/ 256MB of RAM gave me 108MB/s. I do not
>    know if it is some apache regression, some mvneta vs mv63xx_eth
>    difference or some CPU frequency issue but having netperf and  nginx
>    happy make me wonder about Apache.
> 
>  - Willy, setting MVNETA_TX_DONE_TIMER_PERIOD to 1 instead of 10 w/ your
>    patch does not improve the already good value I get w/ your patch.

Great, thanks for your detailed tests! Concerning Apache, it's common to
see it consume more CPU than others, which makes it more sensible to small
devices like these ones (which BTW have a very small cache and only a 16bit
RAM bus). Please still note that there could be a number of other differences
such as Apache always doing TCP_NODELAY resulting in sending incomplete
segments at the end of each buffer, which consume slightly more descriptors.

> In the end if you iterate on your work to push a version of your patch
> upstream, I'll be happy to test it. And thanks for the time you already
> spent!

I'm currently trying to implement TX IRQ handling. I found the registers
description in the neta driver that is provided in Marvell's LSP kernel
that is shipped with some devices using their CPUs. This code is utterly
broken (eg: splice fails with -EBADF) but I think the register descriptions
could be trusted.

I'd rather have real IRQ handling than just relying on mvneta_poll(), so
that we can use it for asymmetric traffic/routing/whatever.

Regards,
Willy
diff mbox

Patch

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 5aed8ed..59e1c86 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2013,6 +2013,26 @@  static int mvneta_poll(struct napi_struct *napi, int budget)
 	}
 
 	pp->cause_rx_tx = cause_rx_tx;
+
+	/* Try to flush pending Tx buffers if any */
+	if (test_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags)) {
+		int tx_todo = 0;
+
+		mvneta_tx_done_gbe(pp,
+	                           (((1 << txq_number) - 1) &
+	                           MVNETA_CAUSE_TXQ_SENT_DESC_ALL_MASK),
+	                           &tx_todo);
+
+		if (tx_todo > 0) {
+			mod_timer(&pp->tx_done_timer,
+			          jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD));
+		}
+		else {
+			clear_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags);
+			del_timer(&pp->tx_done_timer);
+		}
+	}
+
 	return rx_done;
 }