diff mbox series

[net,v1,1/2] lan743x: improve performance: fix rx_napi_poll/interrupt ping-pong

Message ID 20201206034408.31492-1-TheSven73@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net,v1,1/2] lan743x: improve performance: fix rx_napi_poll/interrupt ping-pong | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 28 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Sven Van Asbroeck Dec. 6, 2020, 3:44 a.m. UTC
From: Sven Van Asbroeck <thesven73@gmail.com>

Even if the rx ring is completely full, and there is more rx data
waiting on the chip, the rx napi poll fn will never run more than
once - it will always immediately bail out and re-enable interrupts.
Which results in ping-pong between napi and interrupt.

This defeats the purpose of napi, and is bad for performance.

Fix by addressing two separate issues:

1. Ensure the rx napi poll fn always updates the rx ring tail
   when returning, even when not re-enabling interrupts.

2. Up to half of elements in a full rx ring are extension
   frames, which do not generate any skbs. Limit the default
   napi weight to the smallest no. of skbs that can be generated
   by a full rx ring.

Tested-by: Sven Van Asbroeck <thesven73@gmail.com> # lan7430
Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
---

Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git # 905b2032fa42

To: Bryan Whitehead <bryan.whitehead@microchip.com>
To: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
To: "David S. Miller" <davem@davemloft.net>
To: Jakub Kicinski <kuba@kernel.org>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

 drivers/net/ethernet/microchip/lan743x_main.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski Dec. 8, 2020, 7:50 p.m. UTC | #1
On Sat,  5 Dec 2020 22:44:07 -0500 Sven Van Asbroeck wrote:
> From: Sven Van Asbroeck <thesven73@gmail.com>
> 
> Even if the rx ring is completely full, and there is more rx data
> waiting on the chip, the rx napi poll fn will never run more than
> once - it will always immediately bail out and re-enable interrupts.
> Which results in ping-pong between napi and interrupt.
> 
> This defeats the purpose of napi, and is bad for performance.
> 
> Fix by addressing two separate issues:
> 
> 1. Ensure the rx napi poll fn always updates the rx ring tail
>    when returning, even when not re-enabling interrupts.
> 
> 2. Up to half of elements in a full rx ring are extension
>    frames, which do not generate any skbs. Limit the default
>    napi weight to the smallest no. of skbs that can be generated
>    by a full rx ring.
> 
> Tested-by: Sven Van Asbroeck <thesven73@gmail.com> # lan7430
> Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
> ---
> 
> Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git # 905b2032fa42
> 
> To: Bryan Whitehead <bryan.whitehead@microchip.com>
> To: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com>
> To: "David S. Miller" <davem@davemloft.net>
> To: Jakub Kicinski <kuba@kernel.org>
> Cc: Andrew Lunn <andrew@lunn.ch>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> 
>  drivers/net/ethernet/microchip/lan743x_main.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
> index 87b6c59a1e03..ebb5e0bc516b 100644
> --- a/drivers/net/ethernet/microchip/lan743x_main.c
> +++ b/drivers/net/ethernet/microchip/lan743x_main.c
> @@ -2260,10 +2260,11 @@ static int lan743x_rx_napi_poll(struct napi_struct *napi, int weight)
>  				  INT_BIT_DMA_RX_(rx->channel_number));
>  	}
>  
> +done:
>  	/* update RX_TAIL */
>  	lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
>  			  rx_tail_flags | rx->last_tail);
> -done:
> +

I assume this rings the doorbell to let the device know that more
buffers are available? If so it's a little unusual to do this at the
end of NAPI poll. The more usual place would be to do this every n
times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
That's to say for example ring the doorbell every time a buffer is put
at an index divisible by 16.

>  	return count;
>  }
>  
> @@ -2405,9 +2406,15 @@ static int lan743x_rx_open(struct lan743x_rx *rx)
>  	if (ret)
>  		goto return_error;
>  
> +	/* up to half of elements in a full rx ring are
> +	 * extension frames. these do not generate skbs.
> +	 * to prevent napi/interrupt ping-pong, limit default
> +	 * weight to the smallest no. of skbs that can be
> +	 * generated by a full rx ring.
> +	 */
>  	netif_napi_add(adapter->netdev,
>  		       &rx->napi, lan743x_rx_napi_poll,
> -		       rx->ring_size - 1);
> +		       (rx->ring_size - 1) / 2);

This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
here.
Sven Van Asbroeck Dec. 8, 2020, 10:23 p.m. UTC | #2
On Tue, Dec 8, 2020 at 2:50 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> >
> > +done:
> >       /* update RX_TAIL */
> >       lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
> >                         rx_tail_flags | rx->last_tail);
> > -done:
> > +
>
> I assume this rings the doorbell to let the device know that more
> buffers are available? If so it's a little unusual to do this at the
> end of NAPI poll. The more usual place would be to do this every n
> times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
> That's to say for example ring the doorbell every time a buffer is put
> at an index divisible by 16.

Yes, I believe it tells the device that new buffers have become available.

I wonder why it's so unusual to do this at the end of a napi poll? Omitting
this could result in sub-optimal use of buffers, right?

Example:
- tail is at position 0
- core calls napi_poll(weight=64)
- napi poll consumes 15 buffers and pushes 15 skbs, then ring empty
- index not divisible by 16, so tail is not updated
- weight not reached, so napi poll re-enables interrupts and bails out

Result: now there are 15 buffers which the device could potentially use, but
because the tail wasn't updated, it doesn't know about them.

It does make sense to update the tail more frequently than only at the end
of the napi poll, though?

I'm new to napi polling, so I'm quite interested to learn about this.


> >
> > +     /* up to half of elements in a full rx ring are
> > +      * extension frames. these do not generate skbs.
> > +      * to prevent napi/interrupt ping-pong, limit default
> > +      * weight to the smallest no. of skbs that can be
> > +      * generated by a full rx ring.
> > +      */
> >       netif_napi_add(adapter->netdev,
> >                      &rx->napi, lan743x_rx_napi_poll,
> > -                    rx->ring_size - 1);
> > +                    (rx->ring_size - 1) / 2);
>
> This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
> here.

I agree. The problem is that a full ring buffer of 64 buffers will only
contain 32 buffers with network data - the others are timestamps.

So napi_poll(weight=64) can never reach its full weight. Even with a full
buffer, it always assumes that it has to stop polling, and re-enable
interrupts, which results in a ping-pong.

Would it be better to fix the weight counting? Increase the count
for every buffer consumed, instead of for every skb pushed?
Jakub Kicinski Dec. 8, 2020, 11:29 p.m. UTC | #3
On Tue, 8 Dec 2020 17:23:08 -0500 Sven Van Asbroeck wrote:
> On Tue, Dec 8, 2020 at 2:50 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >  
> > >
> > > +done:
> > >       /* update RX_TAIL */
> > >       lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
> > >                         rx_tail_flags | rx->last_tail);
> > > -done:
> > > +  
> >
> > I assume this rings the doorbell to let the device know that more
> > buffers are available? If so it's a little unusual to do this at the
> > end of NAPI poll. The more usual place would be to do this every n
> > times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
> > That's to say for example ring the doorbell every time a buffer is put
> > at an index divisible by 16.  
> 
> Yes, I believe it tells the device that new buffers have become available.
> 
> I wonder why it's so unusual to do this at the end of a napi poll? Omitting
> this could result in sub-optimal use of buffers, right?
> 
> Example:
> - tail is at position 0
> - core calls napi_poll(weight=64)
> - napi poll consumes 15 buffers and pushes 15 skbs, then ring empty
> - index not divisible by 16, so tail is not updated
> - weight not reached, so napi poll re-enables interrupts and bails out
> 
> Result: now there are 15 buffers which the device could potentially use, but
> because the tail wasn't updated, it doesn't know about them.

Perhaps 16 for a device with 64 descriptors is rather high indeed.
Let's say 8. If the device misses 8 packet buffers on the ring, 
that should be negligible. 

Depends on the cost of the CSR write, usually packet processing is
putting a lot of pressure on the memory subsystem of the CPU, hence
amortizing the write over multiple descriptors helps. The other thing
is that you could delay the descriptor writes to write full cache lines,
but I don't think that will help on IMX6.

> It does make sense to update the tail more frequently than only at the end
> of the napi poll, though?
> 
> I'm new to napi polling, so I'm quite interested to learn about this.

There is a tracepoint which records how many packets NAPI has polled:
napi:napi_poll, you can see easily what your system is doing.

What you want to avoid is the situation where the device already used
up all the descriptors by the time driver finishes the Rx processing.
That'd result in drops. So the driver should push the buffers back to
the device reasonably early.

With a ring of 64 descriptors and NAPI budget of 64 it's not unlikely
that the ring will be completely used when processing runs.

> > > +     /* up to half of elements in a full rx ring are
> > > +      * extension frames. these do not generate skbs.
> > > +      * to prevent napi/interrupt ping-pong, limit default
> > > +      * weight to the smallest no. of skbs that can be
> > > +      * generated by a full rx ring.
> > > +      */
> > >       netif_napi_add(adapter->netdev,
> > >                      &rx->napi, lan743x_rx_napi_poll,
> > > -                    rx->ring_size - 1);
> > > +                    (rx->ring_size - 1) / 2);  
> >
> > This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
> > here.  
> 
> I agree. The problem is that a full ring buffer of 64 buffers will only
> contain 32 buffers with network data - the others are timestamps.
> 
> So napi_poll(weight=64) can never reach its full weight. Even with a full
> buffer, it always assumes that it has to stop polling, and re-enable
> interrupts, which results in a ping-pong.

Interesting I don't think we ever had this problem before. Let me CC
Eric to see if he has any thoughts on the case. AFAIU you should think
of the weight as way of arbitrating between devices (if there is more
than one). 

NAPI does not do any deferral (in wall clock time terms) of processing,
so the only difference you may get for lower weight is that softirq
kthread will get a chance to kick in earlier.

> Would it be better to fix the weight counting? Increase the count
> for every buffer consumed, instead of for every skb pushed?
Eric Dumazet Dec. 8, 2020, 11:50 p.m. UTC | #4
On Wed, Dec 9, 2020 at 12:29 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 8 Dec 2020 17:23:08 -0500 Sven Van Asbroeck wrote:
> > On Tue, Dec 8, 2020 at 2:50 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > >
> > > > +done:
> > > >       /* update RX_TAIL */
> > > >       lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
> > > >                         rx_tail_flags | rx->last_tail);
> > > > -done:
> > > > +
> > >
> > > I assume this rings the doorbell to let the device know that more
> > > buffers are available? If so it's a little unusual to do this at the
> > > end of NAPI poll. The more usual place would be to do this every n
> > > times a new buffer is allocated (in lan743x_rx_init_ring_element()?)
> > > That's to say for example ring the doorbell every time a buffer is put
> > > at an index divisible by 16.
> >
> > Yes, I believe it tells the device that new buffers have become available.
> >
> > I wonder why it's so unusual to do this at the end of a napi poll? Omitting
> > this could result in sub-optimal use of buffers, right?
> >
> > Example:
> > - tail is at position 0
> > - core calls napi_poll(weight=64)
> > - napi poll consumes 15 buffers and pushes 15 skbs, then ring empty
> > - index not divisible by 16, so tail is not updated
> > - weight not reached, so napi poll re-enables interrupts and bails out
> >
> > Result: now there are 15 buffers which the device could potentially use, but
> > because the tail wasn't updated, it doesn't know about them.
>
> Perhaps 16 for a device with 64 descriptors is rather high indeed.
> Let's say 8. If the device misses 8 packet buffers on the ring,
> that should be negligible.
>

mlx4 uses 8 as the threshold ( mlx4_en_refill_rx_buffers())

> Depends on the cost of the CSR write, usually packet processing is
> putting a lot of pressure on the memory subsystem of the CPU, hence
> amortizing the write over multiple descriptors helps. The other thing
> is that you could delay the descriptor writes to write full cache lines,
> but I don't think that will help on IMX6.
>
> > It does make sense to update the tail more frequently than only at the end
> > of the napi poll, though?
> >
> > I'm new to napi polling, so I'm quite interested to learn about this.
>
> There is a tracepoint which records how many packets NAPI has polled:
> napi:napi_poll, you can see easily what your system is doing.
>
> What you want to avoid is the situation where the device already used
> up all the descriptors by the time driver finishes the Rx processing.
> That'd result in drops. So the driver should push the buffers back to
> the device reasonably early.
>
> With a ring of 64 descriptors and NAPI budget of 64 it's not unlikely
> that the ring will be completely used when processing runs.
>
> > > > +     /* up to half of elements in a full rx ring are
> > > > +      * extension frames. these do not generate skbs.
> > > > +      * to prevent napi/interrupt ping-pong, limit default
> > > > +      * weight to the smallest no. of skbs that can be
> > > > +      * generated by a full rx ring.
> > > > +      */
> > > >       netif_napi_add(adapter->netdev,
> > > >                      &rx->napi, lan743x_rx_napi_poll,
> > > > -                    rx->ring_size - 1);
> > > > +                    (rx->ring_size - 1) / 2);
> > >
> > > This is rather unusual, drivers should generally pass NAPI_POLL_WEIGHT
> > > here.
> >
> > I agree. The problem is that a full ring buffer of 64 buffers will only
> > contain 32 buffers with network data - the others are timestamps.
> >
> > So napi_poll(weight=64) can never reach its full weight. Even with a full
> > buffer, it always assumes that it has to stop polling, and re-enable
> > interrupts, which results in a ping-pong.
>
> Interesting I don't think we ever had this problem before. Let me CC
> Eric to see if he has any thoughts on the case. AFAIU you should think
> of the weight as way of arbitrating between devices (if there is more
> than one).

Driver could be called with an arbitrary budget (of 64),
and if its ring buffer has been depleted, return @budget instead of skb counts,
and not ream the interrupt

if (count < budget && !rx_ring_fully_processed) {
    if (napi_complete_done(napi, count))
          ream_irqs();
   return count;
}
return budget;


>
> NAPI does not do any deferral (in wall clock time terms) of processing,
> so the only difference you may get for lower weight is that softirq
> kthread will get a chance to kick in earlier.
>
> > Would it be better to fix the weight counting? Increase the count
> > for every buffer consumed, instead of for every skb pushed?
>
Sven Van Asbroeck Dec. 9, 2020, 12:17 a.m. UTC | #5
On Tue, Dec 8, 2020 at 6:50 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Driver could be called with an arbitrary budget (of 64),
> and if its ring buffer has been depleted, return @budget instead of skb counts,
> and not ream the interrupt
>

Aha, so the decision to re-arm the interrupts is made by looking
at whether the device has run out of ring buffers to fill... instead
of checking whether the weight was reached !

That makes complete sense.
Thank you Eric and Jakub for your expert suggestions.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c
index 87b6c59a1e03..ebb5e0bc516b 100644
--- a/drivers/net/ethernet/microchip/lan743x_main.c
+++ b/drivers/net/ethernet/microchip/lan743x_main.c
@@ -2260,10 +2260,11 @@  static int lan743x_rx_napi_poll(struct napi_struct *napi, int weight)
 				  INT_BIT_DMA_RX_(rx->channel_number));
 	}
 
+done:
 	/* update RX_TAIL */
 	lan743x_csr_write(adapter, RX_TAIL(rx->channel_number),
 			  rx_tail_flags | rx->last_tail);
-done:
+
 	return count;
 }
 
@@ -2405,9 +2406,15 @@  static int lan743x_rx_open(struct lan743x_rx *rx)
 	if (ret)
 		goto return_error;
 
+	/* up to half of elements in a full rx ring are
+	 * extension frames. these do not generate skbs.
+	 * to prevent napi/interrupt ping-pong, limit default
+	 * weight to the smallest no. of skbs that can be
+	 * generated by a full rx ring.
+	 */
 	netif_napi_add(adapter->netdev,
 		       &rx->napi, lan743x_rx_napi_poll,
-		       rx->ring_size - 1);
+		       (rx->ring_size - 1) / 2);
 
 	lan743x_csr_write(adapter, DMAC_CMD,
 			  DMAC_CMD_RX_SWR_(rx->channel_number));