diff mbox

Issue found in Armada 370: "No buffer space available" error during continuous ping

Message ID 20141201072802.GB21731@1wt.eu (mailing list archive)
State New, archived
Headers show

Commit Message

Willy Tarreau Dec. 1, 2014, 7:28 a.m. UTC
Hi Maggie,

On Mon, Dec 01, 2014 at 02:26:49PM +0800, Maggie Mae Roxas wrote:
> Hi Willy, Thomas.
> Good day.
> 
> I am reopening this discussion because we found an unusual behavior
> after using this combination that we thought was OK as discussed in
> the previous messages of this thread:
> 
> > - use 3.13.9 mvneta.c
> > - apply cd71e246c16b30e3f396a85943d5f596202737ba
> > - revert 4f3a4f701b59a3e4b5c8503ac3d905c0a326f922
> 
> Specifically, if we apply above, the "No buffer space available" error
> during continuous ping does NOT occur anymore.
> # Attached: with_patch_3_13_9_no_buffer_space_solved.txt
> 
> However, after continuous and further testing, we encounter the ff. issues:
> 1. Low throughput during iperf when Armada 370 device is set as iperf
> client. For example, in 1000Mbits/s, we only get below 140Mbits/s.

Yes that was the intent of the original fix.

We recently diagnosed the issue related to "no buffer space available".
What happens is that the "ping" utility uses a very small socket buffer.
It sends a few packets, and the NIC doesn't send interrupts until the
TX interrupt count is reached, so the Tx skbs are not freed and the
socket buffers remain full.

The only solution at the moment is to make the NIC emit an IRQ for each
Tx packet. I'm still trying to find a better way to do this (either find
a way to make the NIC emit an IRQ once the Tx queue is empty or adjust
the IRQ delay when adding more packets, though it creates a race condition).

In the mean time you can apply the attached patch. I haven't submitted it
yet only by lack of time :-(

Best regards,
Willy
From 01b23da3607dbce1d1abfe5b7f092de11ae327cf Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Sat, 25 Oct 2014 19:12:49 +0200
Subject: net: mvneta: fix TX coalesce interrupt mode

The mvneta driver sets the amount of Tx coalesce packets to 16 by
default. Normally that does not cause any trouble since the driver
uses a much larger Tx ring size (532 packets). But some sockets
might run with very small buffers, much smaller than the equivalent
of 16 packets. This is what ping is doing for example, by setting
SNDBUF to 324 bytes rounded up to 2kB by the kernel.

The problem is that there is no documented method to force a specific
packet to emit an interrupt (eg: the last of the ring) nor is it
possible to make the NIC emit an interrupt after a given delay.

In this case, it causes trouble, because when ping sends packets over
its raw socket, the few first packets leave the system, and the first
15 packets will be emitted without an IRQ being generated, so without
the skbs being freed. And since the socket's buffer is small, there's
no way to reach that amount of packets, and the ping ends up with
"send: no buffer available" after sending 6 packets. Running with 3
instances of ping in parallel is enough to hide the problem, because
with 6 packets per instance, that's 18 packets total, which is enough
to grant a Tx interrupt before all are sent.

The original driver in the LSP kernel worked around this design flaw
by using a software timer to clean up the Tx descriptors. This timer
was slow and caused terrible network performance on some Tx-bound
workloads (such as routing) but was enough to make tools like ping
work correctly.

Instead here, we simply set the packet counts before interrupt to 1.
This ensures that each packet sent will produce an interrupt. NAPI
takes care of coalescing interrupts since the interrupt is disabled
once generated.

No measurable performance impact nor CPU usage were observed on small
nor large packets, including when saturating the link on Tx, and this
fixes tools like ping which rely on too small a send buffer.

This fix needs to be backported to stable kernels starting with 3.10.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Maggie Mae Roxas Dec. 1, 2014, 8:27 a.m. UTC | #1
Hi Willy,
Good day.

Thank you for the quick feedback.

> In the mean time you can apply the attached patch. I haven't submitted it yet only by lack of time :-(
Should we apply this patch on:
1. 3.13.9? or
2. 3.13.9 with cd71e246c16b30e3f396a85943d5f596202737ba and reverted
4f3a4f701b59a3e4b5c8503ac3d905c0a326f922?

This patch is expected resolve the low throughput and the kernel crash as well?
# Not just the "No buffer space available" error?

Thank you.

Regards,
Maggie Roxas

On Mon, Dec 1, 2014 at 3:28 PM, Willy Tarreau <w@1wt.eu> wrote:
> Hi Maggie,
>
> On Mon, Dec 01, 2014 at 02:26:49PM +0800, Maggie Mae Roxas wrote:
>> Hi Willy, Thomas.
>> Good day.
>>
>> I am reopening this discussion because we found an unusual behavior
>> after using this combination that we thought was OK as discussed in
>> the previous messages of this thread:
>>
>> > - use 3.13.9 mvneta.c
>> > - apply cd71e246c16b30e3f396a85943d5f596202737ba
>> > - revert 4f3a4f701b59a3e4b5c8503ac3d905c0a326f922
>>
>> Specifically, if we apply above, the "No buffer space available" error
>> during continuous ping does NOT occur anymore.
>> # Attached: with_patch_3_13_9_no_buffer_space_solved.txt
>>
>> However, after continuous and further testing, we encounter the ff. issues:
>> 1. Low throughput during iperf when Armada 370 device is set as iperf
>> client. For example, in 1000Mbits/s, we only get below 140Mbits/s.
>
> Yes that was the intent of the original fix.
>
> We recently diagnosed the issue related to "no buffer space available".
> What happens is that the "ping" utility uses a very small socket buffer.
> It sends a few packets, and the NIC doesn't send interrupts until the
> TX interrupt count is reached, so the Tx skbs are not freed and the
> socket buffers remain full.
>
> The only solution at the moment is to make the NIC emit an IRQ for each
> Tx packet. I'm still trying to find a better way to do this (either find
> a way to make the NIC emit an IRQ once the Tx queue is empty or adjust
> the IRQ delay when adding more packets, though it creates a race condition).
>
> In the mean time you can apply the attached patch. I haven't submitted it
> yet only by lack of time :-(
>
> Best regards,
> Willy
>
Willy Tarreau Dec. 1, 2014, 9:28 a.m. UTC | #2
On Mon, Dec 01, 2014 at 04:27:46PM +0800, Maggie Mae Roxas wrote:
> Hi Willy,
> Good day.
> 
> Thank you for the quick feedback.
> 
> > In the mean time you can apply the attached patch. I haven't submitted it yet only by lack of time :-(
> Should we apply this patch on:
> 1. 3.13.9? or
> 2. 3.13.9 with cd71e246c16b30e3f396a85943d5f596202737ba and reverted
> 4f3a4f701b59a3e4b5c8503ac3d905c0a326f922?
> 
> This patch is expected resolve the low throughput and the kernel crash as well?
> # Not just the "No buffer space available" error?

Yes absolutely. The low throughput is caused by the use of a timer instead
of an interrupt to flush Tx descriptors. The "No buffer space available"
is caused by the Tx coalesce of 16 which only flushes the buffers after 16
packets have been emitted. When socket buffers are too small for 16 packets
(eg: ping) you get the error above. Thus setting Tx coalesce to 1 fixes it
for all situations. It's slightly less performant than coalesce 16 but you
can change it using ethtool if you want (4 still works with ping and shows
better performance).

Best regards,
Willy
Thomas Petazzoni Dec. 1, 2014, 9:32 a.m. UTC | #3
Dear Willy Tarreau,

On Mon, 1 Dec 2014 10:28:51 +0100, Willy Tarreau wrote:

> > This patch is expected resolve the low throughput and the kernel crash as well?
> > # Not just the "No buffer space available" error?
> 
> Yes absolutely. The low throughput is caused by the use of a timer instead
> of an interrupt to flush Tx descriptors. The "No buffer space available"
> is caused by the Tx coalesce of 16 which only flushes the buffers after 16
> packets have been emitted. When socket buffers are too small for 16 packets
> (eg: ping) you get the error above. Thus setting Tx coalesce to 1 fixes it
> for all situations. It's slightly less performant than coalesce 16 but you
> can change it using ethtool if you want (4 still works with ping and shows
> better performance).

If I understood correctly, on RX the interrupt coalescing can be done
every X packets, or after N milliseconds. However, on TX, it's only
after Y packets, there is no way to configure a delay.

But in any case, with NAPI implemented in software, are these hardware
interrupt coalescing features very important? As soon as the number of
interrupts becomes high, the kernel will disable the interrupt and
switch to polling, no?

Best regards,

Thomas
Willy Tarreau Dec. 1, 2014, 9:58 a.m. UTC | #4
Hi Thomas,

On Mon, Dec 01, 2014 at 10:32:21AM +0100, Thomas Petazzoni wrote:
> If I understood correctly, on RX the interrupt coalescing can be done
> every X packets, or after N milliseconds. However, on TX, it's only
> after Y packets, there is no way to configure a delay.

That was my understanding of the datasheet as well.

> But in any case, with NAPI implemented in software, are these hardware
> interrupt coalescing features very important? As soon as the number of
> interrupts becomes high, the kernel will disable the interrupt and
> switch to polling, no?

Absolutely, but despite this, the interrupts still impact the system's
performance, because instead of waking up the driver once the Tx queue
is about to be empty (let's say twice per Tx queue), we wake up as often
as the system supports it. The net effect is a performance loss of about
5% on small packets, which is not huge of course, but would rather be
spent doing some more useful stuff.

Uri gave me a contact at Marvell who knows this device well, I'll ask
him if there's no other way to work with this chip. Sending an interrupt
at least when the Tx queue is empty would be nice :-/

Best regards,
Willy
Maggie Mae Roxas Dec. 1, 2014, 10:15 a.m. UTC | #5
Hi Thomas, Willy.
Good day.

Thank you for your assistance as always.

We are currently testing the attached patch + 3.13.9.
We'll let you know the results as soon as we've conducted enough tests
- latest tomorrow.

Will keep you posted.

Regards,
Maggie Roxas


On Mon, Dec 1, 2014 at 5:58 PM, Willy Tarreau <w@1wt.eu> wrote:
> Hi Thomas,
>
> On Mon, Dec 01, 2014 at 10:32:21AM +0100, Thomas Petazzoni wrote:
>> If I understood correctly, on RX the interrupt coalescing can be done
>> every X packets, or after N milliseconds. However, on TX, it's only
>> after Y packets, there is no way to configure a delay.
>
> That was my understanding of the datasheet as well.
>
>> But in any case, with NAPI implemented in software, are these hardware
>> interrupt coalescing features very important? As soon as the number of
>> interrupts becomes high, the kernel will disable the interrupt and
>> switch to polling, no?
>
> Absolutely, but despite this, the interrupts still impact the system's
> performance, because instead of waking up the driver once the Tx queue
> is about to be empty (let's say twice per Tx queue), we wake up as often
> as the system supports it. The net effect is a performance loss of about
> 5% on small packets, which is not huge of course, but would rather be
> spent doing some more useful stuff.
>
> Uri gave me a contact at Marvell who knows this device well, I'll ask
> him if there's no other way to work with this chip. Sending an interrupt
> at least when the Tx queue is empty would be nice :-/
>
> Best regards,
> Willy
>
Maggie Mae Roxas Dec. 2, 2014, 4:09 a.m. UTC | #6
Hi Willy, Thomas,
Good day.

This is just to confirm that the patch Willy sent yesterday, patched
to 3.13.9 alone, resolves these issues:
- "No buffer space available" around 17th packet
- Low throughput on iperf as client (ie, ~130 Mbits/s on 1000 Mbits/s)
- Kernel crash

We confirmed it after quite some testing.
# Full logs at:
# as iperf client: https://app.box.com/s/5pzh753dmmnqt4q8s5iy
# as iperf server: https://app.box.com/s/t9ohxrcphpmnfer9dtj9

Thank you very much for your assistance again!!! :)

Regards,
Maggie Roxas

On Mon, Dec 1, 2014 at 6:15 PM, Maggie Mae Roxas
<maggie.mae.roxas@gmail.com> wrote:
> Hi Thomas, Willy.
> Good day.
>
> Thank you for your assistance as always.
>
> We are currently testing the attached patch + 3.13.9.
> We'll let you know the results as soon as we've conducted enough tests
> - latest tomorrow.
>
> Will keep you posted.
>
> Regards,
> Maggie Roxas
>
>
> On Mon, Dec 1, 2014 at 5:58 PM, Willy Tarreau <w@1wt.eu> wrote:
>> Hi Thomas,
>>
>> On Mon, Dec 01, 2014 at 10:32:21AM +0100, Thomas Petazzoni wrote:
>>> If I understood correctly, on RX the interrupt coalescing can be done
>>> every X packets, or after N milliseconds. However, on TX, it's only
>>> after Y packets, there is no way to configure a delay.
>>
>> That was my understanding of the datasheet as well.
>>
>>> But in any case, with NAPI implemented in software, are these hardware
>>> interrupt coalescing features very important? As soon as the number of
>>> interrupts becomes high, the kernel will disable the interrupt and
>>> switch to polling, no?
>>
>> Absolutely, but despite this, the interrupts still impact the system's
>> performance, because instead of waking up the driver once the Tx queue
>> is about to be empty (let's say twice per Tx queue), we wake up as often
>> as the system supports it. The net effect is a performance loss of about
>> 5% on small packets, which is not huge of course, but would rather be
>> spent doing some more useful stuff.
>>
>> Uri gave me a contact at Marvell who knows this device well, I'll ask
>> him if there's no other way to work with this chip. Sending an interrupt
>> at least when the Tx queue is empty would be nice :-/
>>
>> Best regards,
>> Willy
>>
Willy Tarreau Dec. 2, 2014, 6:56 a.m. UTC | #7
Hi Maggie,

On Tue, Dec 02, 2014 at 12:09:05PM +0800, Maggie Mae Roxas wrote:
> Hi Willy, Thomas,
> Good day.
> 
> This is just to confirm that the patch Willy sent yesterday, patched
> to 3.13.9 alone, resolves these issues:
> - "No buffer space available" around 17th packet
> - Low throughput on iperf as client (ie, ~130 Mbits/s on 1000 Mbits/s)
> - Kernel crash
> 
> We confirmed it after quite some testing.
> # Full logs at:
> # as iperf client: https://app.box.com/s/5pzh753dmmnqt4q8s5iy
> # as iperf server: https://app.box.com/s/t9ohxrcphpmnfer9dtj9
> 
> Thank you very much for your assistance again!!! :)

thanks for your valuable feedback, I'll add your tested-by and will
send it to netdev for inclusion before I switch to something else
and forget again!

Willy
Maggie Mae Roxas Dec. 2, 2014, 7:04 a.m. UTC | #8
Hi Willy,
Good day.

No problem. Thanks again!

Regards,
Maggie Roxas

On Tue, Dec 2, 2014 at 2:56 PM, Willy Tarreau <w@1wt.eu> wrote:
> Hi Maggie,
>
> On Tue, Dec 02, 2014 at 12:09:05PM +0800, Maggie Mae Roxas wrote:
>> Hi Willy, Thomas,
>> Good day.
>>
>> This is just to confirm that the patch Willy sent yesterday, patched
>> to 3.13.9 alone, resolves these issues:
>> - "No buffer space available" around 17th packet
>> - Low throughput on iperf as client (ie, ~130 Mbits/s on 1000 Mbits/s)
>> - Kernel crash
>>
>> We confirmed it after quite some testing.
>> # Full logs at:
>> # as iperf client: https://app.box.com/s/5pzh753dmmnqt4q8s5iy
>> # as iperf server: https://app.box.com/s/t9ohxrcphpmnfer9dtj9
>>
>> Thank you very much for your assistance again!!! :)
>
> thanks for your valuable feedback, I'll add your tested-by and will
> send it to netdev for inclusion before I switch to something else
> and forget again!
>
> Willy
>
diff mbox

Patch

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 4762994..35bfba7 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -214,7 +214,7 @@ 
 /* Various constants */
 
 /* Coalescing */
-#define MVNETA_TXDONE_COAL_PKTS		16
+#define MVNETA_TXDONE_COAL_PKTS		1
 #define MVNETA_RX_COAL_PKTS		32
 #define MVNETA_RX_COAL_USEC		100