[net] ibmvnic: Do not reset dql stats on NON_FATAL err

Message ID	20230622190332.29223-1-nnac123@linux.ibm.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 375111F923 for <netdev@vger.kernel.org>; Thu, 22 Jun 2023 19:03:43 +0000 (UTC) From: Nick Child <nnac123@linux.ibm.com> To: netdev@vger.kernel.org Cc: haren@linux.ibm.com, ricklind@us.ibm.com, Nick Child <nnac123@linux.ibm.com> Subject: [PATCH net] ibmvnic: Do not reset dql stats on NON_FATAL err Date: Thu, 22 Jun 2023 14:03:32 -0500 Message-Id: <20230622190332.29223-1-nnac123@linux.ibm.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[net] ibmvnic: Do not reset dql stats on NON_FATAL err \| expand [net] ibmvnic: Do not reset dql stats on NON_FATAL err

Context	Check	Description
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for net
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 8 this patch: 8
netdev/cc_maintainers	fail	2 blamed authors not CCed: kuba@kernel.org tlfalcon@linux.ibm.com; 11 maintainers not CCed: mpe@ellerman.id.au kuba@kernel.org tlfalcon@linux.ibm.com danymadden@us.ibm.com npiggin@gmail.com christophe.leroy@csgroup.eu davem@davemloft.net ricklind@linux.ibm.com pabeni@redhat.com edumazet@google.com linuxppc-dev@lists.ozlabs.org
netdev/build_clang	success	Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 8 this patch: 8
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 19 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Nick Child June 22, 2023, 7:03 p.m. UTC

All ibmvnic resets, make a call to netdev_tx_reset_queue() when
re-opening the device. netdev_tx_reset_queue() resets the num_queued
and num_completed byte counters. These stats are used in Byte Queue
Limit (BQL) algorithms. The difference between these two stats tracks
the number of bytes currently sitting on the physical NIC. ibmvnic
increases the number of queued bytes though calls to
netdev_tx_sent_queue() in the drivers xmit function. When, VIOS reports
that it is done transmitting bytes, the ibmvnic device increases the
number of completed bytes through calls to netdev_tx_completed_queue().
It is important to note that the driver batches its transmit calls and
num_queued is increased every time that an skb is added to the next
batch, not necessarily when the batch is sent to VIOS for transmission.

Unlike other reset types, a NON FATAL reset will not flush the sub crq
tx buffers. Therefore, it is possible for the batched skb array to be
partially full. So if there is call to netdev_tx_reset_queue() when
re-opening the device, the value of num_queued (0) would not account
for the skb's that are currently batched. Eventually, when the batch
is sent to VIOS, the call to netdev_tx_completed_queue() would increase
num_completed to a value greater than the num_queued. This causes a
BUG_ON crash:

ibmvnic 30000002: Firmware reports error, cause: adapter problem.
Starting recovery...
ibmvnic 30000002: tx error 600
ibmvnic 30000002: tx error 600
ibmvnic 30000002: tx error 600
ibmvnic 30000002: tx error 600
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:27!
Oops: Exception in kernel mode, sig: 5
[....]
NIP dql_completed+0x28/0x1c0
LR ibmvnic_complete_tx.isra.0+0x23c/0x420 [ibmvnic]
Call Trace:
ibmvnic_complete_tx.isra.0+0x3f8/0x420 [ibmvnic] (unreliable)
ibmvnic_interrupt_tx+0x40/0x70 [ibmvnic]
__handle_irq_event_percpu+0x98/0x270
---[ end trace ]---

Therefore, do not reset the dql stats when performing a NON_FATAL reset.
Simply clearing the queues off bit is sufficient.

Fixes: 0d973388185d ("ibmvnic: Introduce xmit_more support using batched subCRQ hcalls")
Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Simon Horman June 23, 2023, 7:52 a.m. UTC | #1

+ maintainers and blamed authors

On Thu, Jun 22, 2023 at 02:03:32PM -0500, Nick Child wrote:
> All ibmvnic resets, make a call to netdev_tx_reset_queue() when
> re-opening the device. netdev_tx_reset_queue() resets the num_queued
> and num_completed byte counters. These stats are used in Byte Queue
> Limit (BQL) algorithms. The difference between these two stats tracks
> the number of bytes currently sitting on the physical NIC. ibmvnic
> increases the number of queued bytes though calls to
> netdev_tx_sent_queue() in the drivers xmit function. When, VIOS reports
> that it is done transmitting bytes, the ibmvnic device increases the
> number of completed bytes through calls to netdev_tx_completed_queue().
> It is important to note that the driver batches its transmit calls and
> num_queued is increased every time that an skb is added to the next
> batch, not necessarily when the batch is sent to VIOS for transmission.
> 
> Unlike other reset types, a NON FATAL reset will not flush the sub crq
> tx buffers. Therefore, it is possible for the batched skb array to be
> partially full. So if there is call to netdev_tx_reset_queue() when
> re-opening the device, the value of num_queued (0) would not account
> for the skb's that are currently batched. Eventually, when the batch
> is sent to VIOS, the call to netdev_tx_completed_queue() would increase
> num_completed to a value greater than the num_queued. This causes a
> BUG_ON crash:
> 
> ibmvnic 30000002: Firmware reports error, cause: adapter problem.
> Starting recovery...
> ibmvnic 30000002: tx error 600
> ibmvnic 30000002: tx error 600
> ibmvnic 30000002: tx error 600
> ibmvnic 30000002: tx error 600
> ------------[ cut here ]------------
> kernel BUG at lib/dynamic_queue_limits.c:27!
> Oops: Exception in kernel mode, sig: 5
> [....]
> NIP dql_completed+0x28/0x1c0
> LR ibmvnic_complete_tx.isra.0+0x23c/0x420 [ibmvnic]
> Call Trace:
> ibmvnic_complete_tx.isra.0+0x3f8/0x420 [ibmvnic] (unreliable)
> ibmvnic_interrupt_tx+0x40/0x70 [ibmvnic]
> __handle_irq_event_percpu+0x98/0x270
> ---[ end trace ]---
> 
> Therefore, do not reset the dql stats when performing a NON_FATAL reset.
> Simply clearing the queues off bit is sufficient.
> 
> Fixes: 0d973388185d ("ibmvnic: Introduce xmit_more support using batched subCRQ hcalls")
> Signed-off-by: Nick Child <nnac123@linux.ibm.com>
> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> index c63d3ec9d328..5523ab52ff2b 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1816,7 +1816,18 @@ static int __ibmvnic_open(struct net_device *netdev)
>  		if (prev_state == VNIC_CLOSED)
>  			enable_irq(adapter->tx_scrq[i]->irq);
>  		enable_scrq_irq(adapter, adapter->tx_scrq[i]);
> -		netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));
> +		/* netdev_tx_reset_queue will reset dql stats and set the stacks
> +		 * flag for queue status. During NON_FATAL resets, just
> +		 * clear the bit, don't reset the stats because there could
> +		 * be batched skb's waiting to be sent. If we reset dql stats,
> +		 * we risk num_completed being greater than num_queued.
> +		 * This will cause a BUG_ON in dql_completed().
> +		 */
> +		if (adapter->reset_reason == VNIC_RESET_NON_FATAL)
> +			clear_bit(__QUEUE_STATE_STACK_XOFF,
> +				  &netdev_get_tx_queue(netdev, i)->state);
> +		else
> +			netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));
>  	}
>  
>  	rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
> -- 
> 2.31.1
> 
>

Rick Lindsley June 23, 2023, 12:24 p.m. UTC | #2

On 6/22/23 12:03, Nick Child wrote:

> Signed-off-by: Nick Child <nnac123@linux.ibm.com>

Reviewed-by: ricklind@us.ibm.com

Simon Horman June 23, 2023, 12:41 p.m. UTC | #3

On Fri, Jun 23, 2023 at 09:52:26AM +0200, Simon Horman wrote:
> + maintainers and blamed authors

A second time, because something went wrong with the first attempt.

> On Thu, Jun 22, 2023 at 02:03:32PM -0500, Nick Child wrote:
> > All ibmvnic resets, make a call to netdev_tx_reset_queue() when
> > re-opening the device. netdev_tx_reset_queue() resets the num_queued
> > and num_completed byte counters. These stats are used in Byte Queue
> > Limit (BQL) algorithms. The difference between these two stats tracks
> > the number of bytes currently sitting on the physical NIC. ibmvnic
> > increases the number of queued bytes though calls to
> > netdev_tx_sent_queue() in the drivers xmit function. When, VIOS reports
> > that it is done transmitting bytes, the ibmvnic device increases the
> > number of completed bytes through calls to netdev_tx_completed_queue().
> > It is important to note that the driver batches its transmit calls and
> > num_queued is increased every time that an skb is added to the next
> > batch, not necessarily when the batch is sent to VIOS for transmission.
> > 
> > Unlike other reset types, a NON FATAL reset will not flush the sub crq
> > tx buffers. Therefore, it is possible for the batched skb array to be
> > partially full. So if there is call to netdev_tx_reset_queue() when
> > re-opening the device, the value of num_queued (0) would not account
> > for the skb's that are currently batched. Eventually, when the batch
> > is sent to VIOS, the call to netdev_tx_completed_queue() would increase
> > num_completed to a value greater than the num_queued. This causes a
> > BUG_ON crash:
> > 
> > ibmvnic 30000002: Firmware reports error, cause: adapter problem.
> > Starting recovery...
> > ibmvnic 30000002: tx error 600
> > ibmvnic 30000002: tx error 600
> > ibmvnic 30000002: tx error 600
> > ibmvnic 30000002: tx error 600
> > ------------[ cut here ]------------
> > kernel BUG at lib/dynamic_queue_limits.c:27!
> > Oops: Exception in kernel mode, sig: 5
> > [....]
> > NIP dql_completed+0x28/0x1c0
> > LR ibmvnic_complete_tx.isra.0+0x23c/0x420 [ibmvnic]
> > Call Trace:
> > ibmvnic_complete_tx.isra.0+0x3f8/0x420 [ibmvnic] (unreliable)
> > ibmvnic_interrupt_tx+0x40/0x70 [ibmvnic]
> > __handle_irq_event_percpu+0x98/0x270
> > ---[ end trace ]---
> > 
> > Therefore, do not reset the dql stats when performing a NON_FATAL reset.
> > Simply clearing the queues off bit is sufficient.
> > 
> > Fixes: 0d973388185d ("ibmvnic: Introduce xmit_more support using batched subCRQ hcalls")
> > Signed-off-by: Nick Child <nnac123@linux.ibm.com>
> > ---
> >  drivers/net/ethernet/ibm/ibmvnic.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> > index c63d3ec9d328..5523ab52ff2b 100644
> > --- a/drivers/net/ethernet/ibm/ibmvnic.c
> > +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> > @@ -1816,7 +1816,18 @@ static int __ibmvnic_open(struct net_device *netdev)
> >  		if (prev_state == VNIC_CLOSED)
> >  			enable_irq(adapter->tx_scrq[i]->irq);
> >  		enable_scrq_irq(adapter, adapter->tx_scrq[i]);
> > -		netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));
> > +		/* netdev_tx_reset_queue will reset dql stats and set the stacks
> > +		 * flag for queue status. During NON_FATAL resets, just
> > +		 * clear the bit, don't reset the stats because there could
> > +		 * be batched skb's waiting to be sent. If we reset dql stats,
> > +		 * we risk num_completed being greater than num_queued.
> > +		 * This will cause a BUG_ON in dql_completed().
> > +		 */
> > +		if (adapter->reset_reason == VNIC_RESET_NON_FATAL)
> > +			clear_bit(__QUEUE_STATE_STACK_XOFF,
> > +				  &netdev_get_tx_queue(netdev, i)->state);
> > +		else
> > +			netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));
> >  	}
> >  
> >  	rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
> > -- 
> > 2.31.1
> > 
> >

Jakub Kicinski June 24, 2023, 10:19 p.m. UTC | #4

On Thu, 22 Jun 2023 14:03:32 -0500 Nick Child wrote:
> +		if (adapter->reset_reason == VNIC_RESET_NON_FATAL)
> +			clear_bit(__QUEUE_STATE_STACK_XOFF,
> +				  &netdev_get_tx_queue(netdev, i)->state);

Why are you trying to clear this bit?

If the completions will still come the bit will be cleared (or not)
during completion handling (netdev_tx_completed_queue() et al.)

Drivers shouldn't be poking into queue state bits directly.

Nick Child June 26, 2023, 3:45 p.m. UTC | #5

On 6/24/23 17:19, Jakub Kicinski wrote:
> On Thu, 22 Jun 2023 14:03:32 -0500 Nick Child wrote:
>> +		if (adapter->reset_reason == VNIC_RESET_NON_FATAL)
>> +			clear_bit(__QUEUE_STATE_STACK_XOFF,
>> +				  &netdev_get_tx_queue(netdev, i)->state);
> 
> Why are you trying to clear this bit?
> 
> If the completions will still come the bit will be cleared (or not)
> during completion handling (netdev_tx_completed_queue() et al.)
> 
> Drivers shouldn't be poking into queue state bits directly.

Most likely, yes there could be some bytes that will get a completion 
which would clear this bit.

That being said, it is also possible that all bytes sent to the NIC are 
already completed. In which case we would not get a completion. The 
ibmvnic driver sends its skb's to the NIC in batches, it makes a call to 
netdev_tx_sent_queue on every time an skb is added to the batch. This is 
not necessarily every-time that the batch is sent to the NIC.

I am not sure what number of bytes causes dql to set 
__QUEUE_STATE_STACK_XOFF but I do know that it is possible for up to 15 
skb's to be sitting in the queues batch. If there are no outstanding 
bytes on the NIC (ie not expecting a tx completion) and the internal 
batch has 15 references per queue, is this enough to enforce dql and set 
__QUEUE_STATE_STACK_XOFF? If so, then we must clear 
__QUEUE_STATE_STACK_XOFF when resetting.

I had a feeling this would raise some eyebrows. The main intent is to do 
everything that netdev_tx_reset_queue() does besides resetting 
statistics. Setting a "*STACK*" flag in driver code feels wrong 
(especially since a *DRV* flag exists) but I could not find an 
appropriate upper-level function. I suppose an alternative is to read 
this flag after the device finishes the reset and sending the batch if 
it is set. Is this any better?

As always, thanks for the review.
Nick

Jakub Kicinski June 26, 2023, 5:33 p.m. UTC | #6

On Mon, 26 Jun 2023 10:45:38 -0500 Nick Child wrote:
> On 6/24/23 17:19, Jakub Kicinski wrote:
> > On Thu, 22 Jun 2023 14:03:32 -0500 Nick Child wrote:  
>  [...]  
> > 
> > Why are you trying to clear this bit?
> > 
> > If the completions will still come the bit will be cleared (or not)
> > during completion handling (netdev_tx_completed_queue() et al.)
> > 
> > Drivers shouldn't be poking into queue state bits directly.  
> 
> Most likely, yes there could be some bytes that will get a completion 
> which would clear this bit.
> 
> That being said, it is also possible that all bytes sent to the NIC are 
> already completed. In which case we would not get a completion. The 
> ibmvnic driver sends its skb's to the NIC in batches, it makes a call to 
> netdev_tx_sent_queue on every time an skb is added to the batch. This is 
> not necessarily every-time that the batch is sent to the NIC.

If packets can get stuck in the driver that needs a dedicated fix.

> I am not sure what number of bytes causes dql to set 
> __QUEUE_STATE_STACK_XOFF but I do know that it is possible for up to 15 
> skb's to be sitting in the queues batch. If there are no outstanding 
> bytes on the NIC (ie not expecting a tx completion) and the internal 
> batch has 15 references per queue, is this enough to enforce dql and set 
> __QUEUE_STATE_STACK_XOFF? If so, then we must clear 
> __QUEUE_STATE_STACK_XOFF when resetting.

You should only be doing the batching if xmit_more is set, really.
And xmit_more will not be set if core is about to set
__QUEUE_STATE_STACK_XOFF. 

With a correctly written driver STACK_XOFF can only be set if we're
expecting a completion.

> I had a feeling this would raise some eyebrows. The main intent is to do 
> everything that netdev_tx_reset_queue() does besides resetting 
> statistics. Setting a "*STACK*" flag in driver code feels wrong 
> (especially since a *DRV* flag exists) but I could not find an 
> appropriate upper-level function. I suppose an alternative is to read 
> this flag after the device finishes the reset and sending the batch if 
> it is set. Is this any better?

AFAIU you shouldn't have to clear this flag. We need to reset DQL when
hard resetting the queue because we may lose completions. But if no
completions are lost, STACK_XOFF should be left alone. 

Just clearing that flag is not valid either because qdiscs will not be
rescheduled, so until some socket tries to xmit next packet the data
already queued will not move.

Rick Lindsley June 26, 2023, 7:06 p.m. UTC | #7

On 6/24/23 15:19, Jakub Kicinski wrote:

> Why are you trying to clear this bit?
> 
> If the completions will still come the bit will be cleared (or not)
> during completion handling (netdev_tx_completed_queue() et al.)
> 
> Drivers shouldn't be poking into queue state bits directly.

+		if (adapter->reset_reason == VNIC_RESET_NON_FATAL)
+			clear_bit(__QUEUE_STATE_STACK_XOFF,
+				  &netdev_get_tx_queue(netdev, i)->state);
+		else
+			netdev_tx_reset_queue(netdev_get_tx_queue(netdev, i));

The problem is the call to dql_reset() in netdev_tx_reset_queue().  If we reset dql stats,   we risk num_completed being greater than num_queued, which will cause a BUG_ON to fire in dql_completed().

     static inline void netdev_tx_reset_queue(struct netdev_queue *q)
     {
     #ifdef CONFIG_BQL
         clear_bit(__QUEUE_STATE_STACK_XOFF, &q->state);
         dql_reset(&q->dql);
     #endif
     }

Existing code calls  netdev_tx_reset_queue() unconditionally.  When the error is non-fatal, though, we were tripping over the BUG_ON on those occasions when a few skbs were already submitted.  The patch here is more about NOT calling dql_reset() than it is about clearing __QUEUE_STATE_STACK_XOFF.  That was only included because it-was-the-other-thing-the-function-did.  So ... maybe we don't care about __QUEUE_STATE_STACK_XOFF?

Nick, do we *need* to have __QUEUE_STATE_STACK_XOFF cleared?  If not, then do we need to call or emulate netdev_tx_reset_queue() at all on a non-fatal error?

Nick Child June 26, 2023, 7:23 p.m. UTC | #8

On 6/26/23 12:33, Jakub Kicinski wrote:
> 
> If packets can get stuck in the driver that needs a dedicated fix.
> 
> You should only be doing the batching if xmit_more is set, really.
> And xmit_more will not be set if core is about to set
> __QUEUE_STATE_STACK_XOFF.
> 

I am not saying that packets can get stuck in the driver, if xmit_more 
is false or if the driver is hard resetting then the batch gets sent.

During nonfatal reset, we are simply reestablishing communications with 
the NIC and it is okay to keep the batch around. It is possible that 
there are batched skb's because xmit_more and the non fatal reset work 
independently of each-other and are not mutually exclusive. The upper 
level functions have no way of knowing when an issue from the NIC will 
occur.

> With a correctly written driver STACK_XOFF can only be set if we're
> expecting a completion.
> 
> AFAIU you shouldn't have to clear this flag. We need to reset DQL when
> hard resetting the queue because we may lose completions. But if no
> completions are lost, STACK_XOFF should be left alone.
> 
> Just clearing that flag is not valid either because qdiscs will not be
> rescheduled, so until some socket tries to xmit next packet the data
> already queued will not move.

This addresses my concern. So if STACK_XOFF gets set, then xmit_more 
will be false and our batch will get sent. This will eventually lead to 
a completion which will clear STACK_XOFF. Meaning the reset should not 
have to worry about clearing STACK_XOFF.

My worry was that STACK_XOFF would get set when the batch was only 
partially full and xmit_more was true. If a non fatal reset occurred 
here then there would be no expected completions and a partially filled 
batch. So I thought we would have to clear STACK_XOFF in order to get 
the queue to be usable again. Sounds like this is not possible due to 
xmit_more being false if STACK_XOFF gets set.

On 6/26/23 14:06, Rick Lindsley wrote:
 >
 > Nick, do we *need* to have __QUEUE_STATE_STACK_XOFF cleared?  If not,
 > then do we need to call or emulate netdev_tx_reset_queue() at all on a
 > non-fatal error?

After reading Jakubs review. I believe we should remove the part where 
we clear STACK_XOFF.

If there are no objections, I will test these changes and send a v2.

Thanks

[net] ibmvnic: Do not reset dql stats on NON_FATAL err

Checks

Commit Message

Comments

Patch