diff mbox

[BUG,REGRESSION?] 3.11.6+,3.12: GbE iface rate drops to few KB/s

Message ID 20131121004430.GX8581@1wt.eu (mailing list archive)
State New, archived
Headers show

Commit Message

Willy Tarreau Nov. 21, 2013, 12:44 a.m. UTC
Hi Arnaud,

On Wed, Nov 20, 2013 at 10:54:35PM +0100, Willy Tarreau wrote:
> I'm currently trying to implement TX IRQ handling. I found the registers
> description in the neta driver that is provided in Marvell's LSP kernel
> that is shipped with some devices using their CPUs. This code is utterly
> broken (eg: splice fails with -EBADF) but I think the register descriptions
> could be trusted.
> 
> I'd rather have real IRQ handling than just relying on mvneta_poll(), so
> that we can use it for asymmetric traffic/routing/whatever.

OK it paid off. And very well :-)

I did it at once and it worked immediately. I generally don't like this
because I always fear that some bug was left there hidden in the code. I have
only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
on the XP-GP board for some SMP stress tests.

I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
with and without the patch.

  without :
      - need at least 12 streams to reach gigabit.
      - 60% of idle CPU remains at 1 Gbps
      - HTTP connection rate on empty objects is 9950 connections/s
      - cumulated outgoing traffic on two ports reaches 1.3 Gbps

  with the patch :
      - a single stream easily saturates the gigabit
      - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
      - HTTP connection rate on empty objects is 10250 connections/s
      - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.

BTW I must say I was impressed to see that big an improvement in CPU
usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
that Eric has done in between account for this.

I cut the patch in 3 parts :
   - one which reintroduces the hidden bits of the driver
   - one which replaces the timer with the IRQ
   - one which changes the default Tx coalesce from 16 to 4 packets
     (larger was preferred with the timer, but less is better now).

I'm attaching them, please test them on your device.

Note that this is *not* for inclusion at the moment as it has not been
tested on the SMP CPUs.

Cheers,
Willy
From b77b32dbffdfffc6aa21fa230054e09e2a4cd227 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 20 Nov 2013 23:58:30 +0100
Subject: net: mvneta: add missing bit descriptions for interrupt masks and
 causes

Marvell has not published the chip's datasheet yet, so it's very hard
to find the relevant bits to manipulate to change the IRQ behaviour.
Fortunately, these bits are described in the proprietary LSP patch set
which is publicly available here :

    http://www.plugcomputer.org/downloads/mirabox/

So let's put them back in the driver in order to reduce the burden of
current and future maintenance.

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 drivers/net/ethernet/marvell/mvneta.c | 44 +++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

Comments

Thomas Petazzoni Nov. 21, 2013, 7:04 p.m. UTC | #1
Dear Willy Tarreau,

On Thu, 21 Nov 2013 19:38:34 +0100, Willy Tarreau wrote:

> While we were diagnosing a network performance regression that we finally
> found and fixed, it appeared during a test that Linus' tree shows a much
> higher performance on Armada 370 (armv7) than its predecessors. I can
> saturate the two Gig links of my Mirabox each with a single TCP flow and
> keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
> can achieve around 1.3 Gbps when the two ports are used in parallel.

Interesting finding and analysis, once again!

> I'm not at ease with these things so I'd like to ask your opinion here, is
> this supposed to be an improvement or a fix ? Is this something we should
> backport into stable versions, or is there something to fix in the armada
> platform so that it works just as if the patch was applied ?

I guess the driver should have been setting its dma_mask to 0xffffffff,
since the platform is capable of doing DMA on the first 32 bits of the
physical address space, probably something like calling
pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that. I know
Russell has recently added some helpers to prevent stupid people (like
me) from doing mistakes when setting the DMA masks. Certainly worth
having a look.

Best regards,

Thomas
Arnaud Ebalard Nov. 21, 2013, 9:51 p.m. UTC | #2
Hi,

Willy Tarreau <w@1wt.eu> writes:

> OK it paid off. And very well :-)
>
> I did it at once and it worked immediately. I generally don't like this
> because I always fear that some bug was left there hidden in the code. I have
> only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> on the XP-GP board for some SMP stress tests.
>
> I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> with and without the patch.
>
>   without :
>       - need at least 12 streams to reach gigabit.
>       - 60% of idle CPU remains at 1 Gbps
>       - HTTP connection rate on empty objects is 9950 connections/s
>       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
>
>   with the patch :
>       - a single stream easily saturates the gigabit
>       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
>       - HTTP connection rate on empty objects is 10250 connections/s
>       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
>
> BTW I must say I was impressed to see that big an improvement in CPU
> usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> that Eric has done in between account for this.
>
> I cut the patch in 3 parts :
>    - one which reintroduces the hidden bits of the driver
>    - one which replaces the timer with the IRQ
>    - one which changes the default Tx coalesce from 16 to 4 packets
>      (larger was preferred with the timer, but less is better now).
>
> I'm attaching them, please test them on your device.

Well, on the RN102 (Armada 370), I get the same results as with your
previous patch, i.e. netperf and nginx saturate the link. Apache still
lagging behind though.

> Note that this is *not* for inclusion at the moment as it has not been
> tested on the SMP CPUs.

I tested it on my RN2120 (2-core armada XP): I got no problem and the
link saturated w/ apache, nginx and netperf. Good work!

Cheers,

a+
Willy Tarreau Nov. 21, 2013, 9:51 p.m. UTC | #3
Hi Thomas,

On Thu, Nov 21, 2013 at 08:04:33PM +0100, Thomas Petazzoni wrote:
> > I'm not at ease with these things so I'd like to ask your opinion here, is
> > this supposed to be an improvement or a fix ? Is this something we should
> > backport into stable versions, or is there something to fix in the armada
> > platform so that it works just as if the patch was applied ?
> 
> I guess the driver should have been setting its dma_mask to 0xffffffff,
> since the platform is capable of doing DMA on the first 32 bits of the
> physical address space, probably something like calling
> pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) or something like that.

Almost, yes. Thanks for the tip! There are so few drivers which do this
that I was convinced something was missing (nobody initializes dma_mask
on this platform), so calls to dma_set_mask() from drivers return -EIO
and are ignored.

I ended up with that in mvneta_init() at the end :

       /* setting DMA mask significantly improves transfer rates */
       pp->dev->dev.parent->coherent_dma_mask = DMA_BIT_MASK(32);
       pp->dev->dev.parent->dma_mask = &pp->dev->dev.parent->coherent_dma_mask;

This method changed in 3.12 with Russell's commit fa6a8d6 (DMA-API: provide
a helper to setup DMA masks) doing it a cleaner and safer way, using
dma_coerce_mask_and_coherent().

Then Rob's commit 0589342 (of: set dma_mask to point to coherent_dma_mask) also
merged in 3.12 pre-initialized the dma_mask to point to &coherent_dma_mask for
all devices by default.

> I know
> Russell has recently added some helpers to prevent stupid people (like
> me) from doing mistakes when setting the DMA masks. Certainly worth
> having a look.

My change now allows me to proxy HTTP traffic at 1 Gbps between the two
ports of the mirabox, while it limits to 650 Mbps without the change. But
it's not needed in mainline anymore. However it might be worth having in
older kernels (I don't know if it's suitable for stable since I don't
know if that's a bug), or at least in your own kernels if you have to
maintain and older branch for some customers.

That said, I tend to believe that applying Rob's patch will be better than
just the change above since it will cover all drivers, not only mvneta.
I'll have to test on the AX3 and the XP-GP to see the performance gain in
SMP and using the PCIe.

Best regards,
Willy
Willy Tarreau Nov. 21, 2013, 9:52 p.m. UTC | #4
On Thu, Nov 21, 2013 at 10:51:09PM +0100, Arnaud Ebalard wrote:
> Hi,
> 
> Willy Tarreau <w@1wt.eu> writes:
> 
> > OK it paid off. And very well :-)
> >
> > I did it at once and it worked immediately. I generally don't like this
> > because I always fear that some bug was left there hidden in the code. I have
> > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> > on the XP-GP board for some SMP stress tests.
> >
> > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> > with and without the patch.
> >
> >   without :
> >       - need at least 12 streams to reach gigabit.
> >       - 60% of idle CPU remains at 1 Gbps
> >       - HTTP connection rate on empty objects is 9950 connections/s
> >       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
> >
> >   with the patch :
> >       - a single stream easily saturates the gigabit
> >       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
> >       - HTTP connection rate on empty objects is 10250 connections/s
> >       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
> >
> > BTW I must say I was impressed to see that big an improvement in CPU
> > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> > that Eric has done in between account for this.
> >
> > I cut the patch in 3 parts :
> >    - one which reintroduces the hidden bits of the driver
> >    - one which replaces the timer with the IRQ
> >    - one which changes the default Tx coalesce from 16 to 4 packets
> >      (larger was preferred with the timer, but less is better now).
> >
> > I'm attaching them, please test them on your device.
> 
> Well, on the RN102 (Armada 370), I get the same results as with your
> previous patch, i.e. netperf and nginx saturate the link. Apache still
> lagging behind though.
> 
> > Note that this is *not* for inclusion at the moment as it has not been
> > tested on the SMP CPUs.
> 
> I tested it on my RN2120 (2-core armada XP): I got no problem and the
> link saturated w/ apache, nginx and netperf. Good work!

Great, thanks for your tests Arnaud. I forgot to mention that all my
tests this evening involved this patch as well.

Cheers,
Willy
Eric Dumazet Nov. 21, 2013, 10 p.m. UTC | #5
On Thu, 2013-11-21 at 22:52 +0100, Willy Tarreau wrote:
> On Thu, Nov 21, 2013 at 10:51:09PM +0100, Arnaud Ebalard wrote:
> > Hi,
> > 
> > Willy Tarreau <w@1wt.eu> writes:
> > 
> > > OK it paid off. And very well :-)
> > >
> > > I did it at once and it worked immediately. I generally don't like this
> > > because I always fear that some bug was left there hidden in the code. I have
> > > only tested it on the Mirabox, so I'll have to try on the OpenBlocks AX3-4 and
> > > on the XP-GP board for some SMP stress tests.
> > >
> > > I upgraded my Mirabox to latest Linus' git (commit 5527d151) and compared
> > > with and without the patch.
> > >
> > >   without :
> > >       - need at least 12 streams to reach gigabit.
> > >       - 60% of idle CPU remains at 1 Gbps
> > >       - HTTP connection rate on empty objects is 9950 connections/s
> > >       - cumulated outgoing traffic on two ports reaches 1.3 Gbps
> > >
> > >   with the patch :
> > >       - a single stream easily saturates the gigabit
> > >       - 87% of idle CPU at 1 Gbps (12 streams, 90% idle at 1 stream)
> > >       - HTTP connection rate on empty objects is 10250 connections/s
> > >       - I saturate the two gig ports at 99% CPU, so 2 Gbps sustained output.
> > >
> > > BTW I must say I was impressed to see that big an improvement in CPU
> > > usage between 3.10 and 3.13, I suspect some of the Tx queue improvements
> > > that Eric has done in between account for this.
> > >
> > > I cut the patch in 3 parts :
> > >    - one which reintroduces the hidden bits of the driver
> > >    - one which replaces the timer with the IRQ
> > >    - one which changes the default Tx coalesce from 16 to 4 packets
> > >      (larger was preferred with the timer, but less is better now).
> > >
> > > I'm attaching them, please test them on your device.
> > 
> > Well, on the RN102 (Armada 370), I get the same results as with your
> > previous patch, i.e. netperf and nginx saturate the link. Apache still
> > lagging behind though.
> > 
> > > Note that this is *not* for inclusion at the moment as it has not been
> > > tested on the SMP CPUs.
> > 
> > I tested it on my RN2120 (2-core armada XP): I got no problem and the
> > link saturated w/ apache, nginx and netperf. Good work!
> 
> Great, thanks for your tests Arnaud. I forgot to mention that all my
> tests this evening involved this patch as well.

Now you might try to set a lower value
for /proc/sys/net/ipv4/tcp_limit_output_bytes

Ideally, a value of 8192 (instead of 131072) allows
to queue less data per tcp flow, and react faster to losses,
as retransmits don't have to wait that previous packets in Qdisc left
the host.

131072 bytes for a 80 Mbit flow means more than 11 ms of queueing :(
Rob Herring Nov. 21, 2013, 10:01 p.m. UTC | #6
On 11/21/2013 12:38 PM, Willy Tarreau wrote:
> Hi Rob,
> 
> While we were diagnosing a network performance regression that we finally
> found and fixed, it appeared during a test that Linus' tree shows a much
> higher performance on Armada 370 (armv7) than its predecessors. I can
> saturate the two Gig links of my Mirabox each with a single TCP flow and
> keep up to 25% of idle CPU in the optimal case. In 3.12.1 or 3.10.20, I
> can achieve around 1.3 Gbps when the two ports are used in parallel.
> 
> Today I bisected these kernels to find what was causing this difference.
> I found it was your patch below which I can copy entirely here :
> 
>   commit 0589342c27944e50ebd7a54f5215002b6598b748
>   Author: Rob Herring <rob.herring@calxeda.com>
>   Date:   Tue Oct 29 23:36:46 2013 -0500
> 
>       of: set dma_mask to point to coherent_dma_mask
>     
>       Platform devices created by DT code don't initialize dma_mask pointer to
>       anything. Set it to coherent_dma_mask by default if the architecture
>       code has not set it.
>     
>       Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> 
>   diff --git a/drivers/of/platform.c b/drivers/of/platform.c
>   index 9b439ac..c005495 100644
>   --- a/drivers/of/platform.c
>   +++ b/drivers/of/platform.c
>   @@ -216,6 +216,8 @@ static struct platform_device *of_platform_device_create_pdata(
>           dev->archdata.dma_mask = 0xffffffffUL;
>    #endif
>           dev->dev.coherent_dma_mask = DMA_BIT_MASK(32);
>   +       if (!dev->dev.dma_mask)
>   +               dev->dev.dma_mask = &dev->dev.coherent_dma_mask;
>           dev->dev.bus = &platform_bus_type;
>           dev->dev.platform_data = platform_data;
> 
> And I can confirm that applying this patch on 3.10.20 + the fixes we found
> yesterday substantially boosted my network performance (and reduced the CPU
> usage when running on a single link).
> 
> I'm not at ease with these things so I'd like to ask your opinion here, is
> this supposed to be an improvement or a fix ? Is this something we should
> backport into stable versions, or is there something to fix in the armada
> platform so that it works just as if the patch was applied ?
> 

The patch was to fix this issue[1]. It is fixed in the core code because
dma_mask not being set has been a known issue with DT probing for some
time. Since most drivers don't seem to care, we've gotten away with it.
I thought the normal failure mode was drivers failing to probe.

As to why it helps performance, I'm not really sure. Perhaps it is
causing some bounce buffers to be used.

Rob

[1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html
Willy Tarreau Nov. 21, 2013, 10:13 p.m. UTC | #7
On Thu, Nov 21, 2013 at 04:01:42PM -0600, Rob Herring wrote:
> The patch was to fix this issue[1]. It is fixed in the core code because
> dma_mask not being set has been a known issue with DT probing for some
> time. Since most drivers don't seem to care, we've gotten away with it.
> I thought the normal failure mode was drivers failing to probe.

It seems that very few drivers try to set their mask, so probably that
the default value was already OK even though less performant.

> As to why it helps performance, I'm not really sure. Perhaps it is
> causing some bounce buffers to be used.

That's also the thing I have been thinking about, and given this device
only has a 16-bit DDR bus, bounce buffers can make a difference.

> Rob
> 
> [1] http://lists.xen.org/archives/html/xen-devel/2013-10/msg00092.html

Thanks for your quick explanation Rob!
Willy
Arnaud Ebalard Nov. 21, 2013, 10:55 p.m. UTC | #8
Hi eric,

Eric Dumazet <eric.dumazet@gmail.com> writes:

>> > I tested it on my RN2120 (2-core armada XP): I got no problem and the
>> > link saturated w/ apache, nginx and netperf. Good work!
>> 
>> Great, thanks for your tests Arnaud. I forgot to mention that all my
>> tests this evening involved this patch as well.
>
> Now you might try to set a lower value
> for /proc/sys/net/ipv4/tcp_limit_output_bytes

On the RN2120, for a file served from /run/shm (for apache and nginx):

          Apache     nginx       netperf
131072:  102 MB/s   112 MB/s   941.11 Mb/s
 65536:  102 MB/s   112 MB/s   935.97 Mb/s
 32768:  101 MB/s   105 MB/s   940.49 Mb/s
 16384:   94 MB/s    90 MB/s   770.07 Mb/s
  8192:   83 MB/s    66 MB/s   556.79 Mb/s

On the RN102, this time for apache and nginx, the file is served from
disks (ext4/lvm/raid1):

          Apache     nginx       netperf
131072:  66 MB/s   105 MB/s   925.63 Mb/s
 65536:  59 MB/s   105 MB/s   862.55 Mb/s
 32768:  62 MB/s   105 MB/s   918.99 Mb/s
 16384:  65 MB/s   105 MB/s   927.71 Mb/s
  8192:  60 MB/s   104 MB/s   915.63 Mb/s

Values above are for a single flow though.

Cheers,

a+
Rick Jones Nov. 21, 2013, 11:23 p.m. UTC | #9
On 11/21/2013 02:55 PM, Arnaud Ebalard wrote:
> On the RN2120, for a file served from /run/shm (for apache and nginx):
>
>            Apache     nginx       netperf
> 131072:  102 MB/s   112 MB/s   941.11 Mb/s
>   65536:  102 MB/s   112 MB/s   935.97 Mb/s
>   32768:  101 MB/s   105 MB/s   940.49 Mb/s
>   16384:   94 MB/s    90 MB/s   770.07 Mb/s
>    8192:   83 MB/s    66 MB/s   556.79 Mb/s

If you want to make the units common across all three tests, netperf 
accepts a global -f option to alter the output units.  If you add -f M 
netperf will then emit results in MB/s (M == 1048576).  I'm assuming of 
course that the MB/s of Apache and nginx are also M == 1048576.

happy benchmarking,

rick jones
diff mbox

Patch

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index b8e232b..6630690 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -101,16 +101,56 @@ 
 #define      MVNETA_CPU_RXQ_ACCESS_ALL_MASK      0x000000ff
 #define      MVNETA_CPU_TXQ_ACCESS_ALL_MASK      0x0000ff00
 #define MVNETA_RXQ_TIME_COAL_REG(q)              (0x2580 + ((q) << 2))
+
+/* Exception Interrupt Port/Queue Cause register */
+
 #define MVNETA_INTR_NEW_CAUSE                    0x25a0
-#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
 #define MVNETA_INTR_NEW_MASK                     0x25a4
+
+/* bits  0..7  = TXQ SENT, one bit per queue.
+ * bits  8..15 = RXQ OCCUP, one bit per queue.
+ * bits 16..23 = RXQ FREE, one bit per queue.
+ * bit  29 = OLD_REG_SUM, see old reg ?
+ * bit  30 = TX_ERR_SUM, one bit for 4 ports
+ * bit  31 = MISC_SUM,   one bit for 4 ports
+ */
+#define      MVNETA_TX_INTR_MASK(nr_txqs)        (((1 << nr_txqs) - 1) << 0)
+#define      MVNETA_TX_INTR_MASK_ALL             (0xff << 0)
+#define      MVNETA_RX_INTR_MASK(nr_rxqs)        (((1 << nr_rxqs) - 1) << 8)
+#define      MVNETA_RX_INTR_MASK_ALL             (0xff << 8)
+
 #define MVNETA_INTR_OLD_CAUSE                    0x25a8
 #define MVNETA_INTR_OLD_MASK                     0x25ac
+
+/* Data Path Port/Queue Cause Register */
 #define MVNETA_INTR_MISC_CAUSE                   0x25b0
 #define MVNETA_INTR_MISC_MASK                    0x25b4
+
+#define      MVNETA_CAUSE_PHY_STATUS_CHANGE      BIT(0)
+#define      MVNETA_CAUSE_LINK_CHANGE            BIT(1)
+#define      MVNETA_CAUSE_PTP                    BIT(4)
+
+#define      MVNETA_CAUSE_INTERNAL_ADDR_ERR      BIT(7)
+#define      MVNETA_CAUSE_RX_OVERRUN             BIT(8)
+#define      MVNETA_CAUSE_RX_CRC_ERROR           BIT(9)
+#define      MVNETA_CAUSE_RX_LARGE_PKT           BIT(10)
+#define      MVNETA_CAUSE_TX_UNDERUN             BIT(11)
+#define      MVNETA_CAUSE_PRBS_ERR               BIT(12)
+#define      MVNETA_CAUSE_PSC_SYNC_CHANGE        BIT(13)
+#define      MVNETA_CAUSE_SERDES_SYNC_ERR        BIT(14)
+
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT    16
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_ALL_MASK   (0xF << MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT)
+#define      MVNETA_CAUSE_BMU_ALLOC_ERR_MASK(pool) (1 << (MVNETA_CAUSE_BMU_ALLOC_ERR_SHIFT + (pool)))
+
+#define      MVNETA_CAUSE_TXQ_ERROR_SHIFT        24
+#define      MVNETA_CAUSE_TXQ_ERROR_ALL_MASK     (0xFF << MVNETA_CAUSE_TXQ_ERROR_SHIFT)
+#define      MVNETA_CAUSE_TXQ_ERROR_MASK(q)      (1 << (MVNETA_CAUSE_TXQ_ERROR_SHIFT + (q)))
+
 #define MVNETA_INTR_ENABLE                       0x25b8
 #define      MVNETA_TXQ_INTR_ENABLE_ALL_MASK     0x0000ff00
-#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000
+#define      MVNETA_RXQ_INTR_ENABLE_ALL_MASK     0xff000000  // note: neta says it's 0x000000FF
+
 #define MVNETA_RXQ_CMD                           0x2680
 #define      MVNETA_RXQ_DISABLE_SHIFT            8
 #define      MVNETA_RXQ_ENABLE_MASK              0x000000ff