diff mbox series

[V2,net] ibmvnic: Continue with reset if set link down failed

Message ID 20210420213517.24171-1-drt@linux.ibm.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [V2,net] ibmvnic: Continue with reset if set link down failed | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers fail 1 blamed authors not CCed: nfont@linux.vnet.ibm.com; 1 maintainers not CCed: nfont@linux.vnet.ibm.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 100 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success Link

Commit Message

Dany Madden April 20, 2021, 9:35 p.m. UTC
When ibmvnic gets a FATAL error message from the vnicserver, it marks
the Command Respond Queue (CRQ) inactive and resets the adapter. If this
FATAL reset fails and a transmission timeout reset follows, the CRQ is
still inactive, ibmvnic's attempt to set link down will also fail. If
ibmvnic abandons the reset because of this failed set link down and this
is the last reset in the workqueue, then this adapter will be left in an
inoperable state.

Instead, make the driver ignore this link down failure and continue to
free and re-register CRQ so that the adapter has an opportunity to
recover.

Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
---
Changes in V2:
- Update description to clarify background for the patch
- Include Reviewed-by tags
---
 drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Lijun Pan April 20, 2021, 9:42 p.m. UTC | #1
> On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> 
> When ibmvnic gets a FATAL error message from the vnicserver, it marks
> the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> FATAL reset fails and a transmission timeout reset follows, the CRQ is
> still inactive, ibmvnic's attempt to set link down will also fail. If
> ibmvnic abandons the reset because of this failed set link down and this
> is the last reset in the workqueue, then this adapter will be left in an
> inoperable state.
> 
> Instead, make the driver ignore this link down failure and continue to
> free and re-register CRQ so that the adapter has an opportunity to
> recover.

This v2 does not adddress the concerns mentioned in v1.
And I think it is better to exit with error from do_reset, and schedule a thorough
do_hard_reset if the the adapter is already in unstable state.

> 
> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
> Signed-off-by: Dany Madden <drt@linux.ibm.com>
> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
> ---
> Changes in V2:
> - Update description to clarify background for the patch
> - Include Reviewed-by tags
> ---
> drivers/net/ethernet/ibm/ibmvnic.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> index ffb2a91750c7..4bd8c5d1a275 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1970,8 +1970,10 @@ static int do_reset(struct ibmvnic_adapter *adapter,
> 			rtnl_unlock();
> 			rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN);
> 			rtnl_lock();
> -			if (rc)
> -				goto out;
> +			if (rc) {
> +				netdev_dbg(netdev,
> +					   "Setting link down failed rc=%d. Continue anyway\n", rc);
> +			}
> 
> 			if (adapter->state == VNIC_OPEN) {
> 				/* When we dropped rtnl, ibmvnic_open() got
> -- 
> 2.26.2
>
Sukadev Bhattiprolu April 21, 2021, 6:45 a.m. UTC | #2
Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> 
> 
> > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > 
> > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > still inactive, ibmvnic's attempt to set link down will also fail. If
> > ibmvnic abandons the reset because of this failed set link down and this
> > is the last reset in the workqueue, then this adapter will be left in an
> > inoperable state.
> > 
> > Instead, make the driver ignore this link down failure and continue to
> > free and re-register CRQ so that the adapter has an opportunity to
> > recover.
> 
> This v2 does not adddress the concerns mentioned in v1.
> And I think it is better to exit with error from do_reset, and schedule a thorough
> do_hard_reset if the the adapter is already in unstable state.

We had a FATAL error and when handling it, we failed to send a 
link-down message to the VIOS. So what we need to try next is to 
reset the connection with the VIOS. For this we must talk to the 
firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
does just that in ibmvnic_reset_crq().

Now, sure we can attempt a "thorough hard reset" which also does
the same hcalls to reestablish the connection. Is there any
other magic in do_hard_reset()? But in addition, it also frees lot
more Linux kernel buffers and reallocates them for instance.

If we are having a communication problem with the VIOS, what is
the point of freeing and reallocating Linux kernel buffers? Beside
being inefficient, it would expose us to even more errors during
reset under heavy workloads?

From what I understand so far, do_reset() is complicated because
it is attempting some optimizations.  If we are going to fall back
to hard reset for every error we might as well drop the do_reset()
and just do the "thorough hard reset" every time right?

The protocol spec is ambiguous and so far I did not get a clear
answer on whether the link-down is even needed. If it is needed,
then should we add it to do_hard_reset() also? If not, we should
remove it (like you mentioned your earlier) completely but am
waiting for confirmation on that. git history has not been helpful.

While there are other rough edges around do_reset() that we are
working on fixing separately (eg: ignore the error return from 
__ibmvnic_close() right above this change) I see a benefit to
the customer with this patch.

I am not convinced we should perform a hard reset just because
the link down failed.

Sukadev
Rick Lindsley April 21, 2021, 7:54 a.m. UTC | #3
On 4/20/21 2:42 PM, Lijun Pan wrote:
> 
> This v2 does not adddress the concerns mentioned in v1.
> And I think it is better to exit with error from do_reset, and schedule a thorough
> do_hard_reset if the the adapter is already in unstable state.

But the point is that the testing and analysis has indicated that doing a full
hard reset is not necessary. We are about to take the very action which will fix
this situation, but currently do not.

Please describe the advantage in deferring it further by routing it through
do_hard_reset().  I don't see one.

Rick
Lijun Pan April 22, 2021, 5:06 a.m. UTC | #4
On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
<sukadev@linux.ibm.com> wrote:
>
> Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> >
> >
> > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > >
> > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > ibmvnic abandons the reset because of this failed set link down and this
> > > is the last reset in the workqueue, then this adapter will be left in an
> > > inoperable state.
> > >
> > > Instead, make the driver ignore this link down failure and continue to
> > > free and re-register CRQ so that the adapter has an opportunity to
> > > recover.
> >
> > This v2 does not adddress the concerns mentioned in v1.
> > And I think it is better to exit with error from do_reset, and schedule a thorough
> > do_hard_reset if the the adapter is already in unstable state.
>
> We had a FATAL error and when handling it, we failed to send a
> link-down message to the VIOS. So what we need to try next is to
> reset the connection with the VIOS. For this we must talk to the
> firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> does just that in ibmvnic_reset_crq().
>
> Now, sure we can attempt a "thorough hard reset" which also does
> the same hcalls to reestablish the connection. Is there any
> other magic in do_hard_reset()? But in addition, it also frees lot
> more Linux kernel buffers and reallocates them for instance.

Working around everything in do_reset will make the code very difficult
to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
can be removed completely or merged into do_reset.

>
> If we are having a communication problem with the VIOS, what is
> the point of freeing and reallocating Linux kernel buffers? Beside
> being inefficient, it would expose us to even more errors during
> reset under heavy workloads?

No real customer runs the system under that heavy load created by
HTX stress test, which can tear down any working system.

>
> From what I understand so far, do_reset() is complicated because
> it is attempting some optimizations.  If we are going to fall back
> to hard reset for every error we might as well drop the do_reset()
> and just do the "thorough hard reset" every time right?

I think such optimizations are catered for passing HTX tests. Whether
the optimization benefits the adapter, say making the adapter more stable,
I doubt it. I think there should be a trade off between optimization
and stability.
Lijun Pan April 22, 2021, 5:12 a.m. UTC | #5
On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley
<ricklind@linux.vnet.ibm.com> wrote:
>
> On 4/20/21 2:42 PM, Lijun Pan wrote:
> >
> > This v2 does not adddress the concerns mentioned in v1.
> > And I think it is better to exit with error from do_reset, and schedule a thorough
> > do_hard_reset if the the adapter is already in unstable state.
>
> But the point is that the testing and analysis has indicated that doing a full
> hard reset is not necessary. We are about to take the very action which will fix
> this situation, but currently do not.

The testing was done on this patch. It was not performed on a full hard reset.
So I don't think you could even compare the two results.

>
> Please describe the advantage in deferring it further by routing it through
> do_hard_reset().  I don't see one.

It is not deferred. It exits with error and calls do_hard_reset.
See my reply to Suka's.
Lijun Pan April 22, 2021, 5:30 a.m. UTC | #6
On Tue, Apr 20, 2021 at 4:37 PM Dany Madden <drt@linux.ibm.com> wrote:
>
> When ibmvnic gets a FATAL error message from the vnicserver, it marks
> the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> FATAL reset fails and a transmission timeout reset follows, the CRQ is
> still inactive, ibmvnic's attempt to set link down will also fail. If
> ibmvnic abandons the reset because of this failed set link down and this
> is the last reset in the workqueue, then this adapter will be left in an
> inoperable state.
>
> Instead, make the driver ignore this link down failure and continue to
> free and re-register CRQ so that the adapter has an opportunity to
> recover.
>
> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
> Signed-off-by: Dany Madden <drt@linux.ibm.com>
> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>

One thing I would like to point out as already pointed out by Nathan Lynch is
that those review-by tags given by the same groups of people from the same
company loses credibility over time if you never critique or ask
questions on the list.
Sukadev Bhattiprolu April 22, 2021, 6:58 a.m. UTC | #7
Lijun Pan [lijunp213@gmail.com] wrote:
> > Now, sure we can attempt a "thorough hard reset" which also does
> > the same hcalls to reestablish the connection. Is there any
> > other magic in do_hard_reset()? But in addition, it also frees lot
> > more Linux kernel buffers and reallocates them for instance.
> 
> Working around everything in do_reset will make the code very difficult

We are not working around everything. We are doing in do_reset()
exactly what we would do in hard reset for this error (ignore the
set link down error and try to reestablish the connection with the
VIOS).

What we are avoiding is unnecessary work on the Linux side for a
communication problem on the VIOS side.

> to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> can be removed completely or merged into do_reset.
> 
> >
> > If we are having a communication problem with the VIOS, what is
> > the point of freeing and reallocating Linux kernel buffers? Beside
> > being inefficient, it would expose us to even more errors during
> > reset under heavy workloads?
> 
> No real customer runs the system under that heavy load created by
> HTX stress test, which can tear down any working system.

We need to talk to capacity planning and test architects about that,
but all I want to know is what hard reset would do differently to
fix this communication error with VIOS.

Sukadev
Rick Lindsley April 22, 2021, 7:05 a.m. UTC | #8
On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley
<ricklind@linux.vnet.ibm.com> wrote:

>> Please describe the advantage in deferring it further by routing it through
>> do_hard_reset().  I don't see one.

On 4/21/21 10:12 PM, Lijun Pan replied:

> It is not deferred. It exits with error and calls do_hard_reset.
> See my reply to Suka's.

I saw your reply, but it does not answer the question I asked.  The patch
would have us reinitialize and restart the communication queues.  Your
suggestion would do more work than that.  Please describe the advantage
in deferring the reinitialization - and yes, defer is the right word -
by routing it through do_hard_reset() and executing that extra code.  I
see that route as doing more work than necessary and so introducing
additional risk, for no clear advantage.  So I would find it helpful
if you would describe the advantage.

> The testing was done on this patch. It was not performed on a full hard reset.
> So I don't think you could even compare the two results.

A problem has been noted, a patch has been proposed, and the reasoning that
the patch is correct has been given.  Testing with this patch has
demonstrated the problem has not returned.  So far, that sounds like a
pretty reasonable solution.

Your comment is curious - why would testing for this patch be done on a full
hard reset when this patch does not invoke a full hard reset?  If you have
data to consider then let's have it. I'm willing to be convinced, but so far
this just sounds like "I wouldn't do it that way myself, and I have a bad
feeling about doing it any other way."

Rick
Rick Lindsley April 22, 2021, 7:05 a.m. UTC | #9
On 4/21/21 10:06 PM, Lijun Pan wrote:
> No real customer runs the system under that heavy load created by
> HTX stress test, which can tear down any working system.

So, are you saying the bugs that HTX uncovers are not worth fixing?

There's a fair number of individuals and teams out there that
utilize HTX in the absence of such a workload, and not just for vnic.
What workload would you suggest to better serve "real customers"?

> I think such optimizations are catered for passing HTX tests.

Well, yes.  If the bugs are found with HTX, the fixes are verified
against HTX.  If they were found with a "real customer workload" they'd
be verified against a "real customer workload."
Rick Lindsley April 22, 2021, 7:07 a.m. UTC | #10
On 4/21/21 10:30 PM, Lijun Pan wrote:
>> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
>> Signed-off-by: Dany Madden <drt@linux.ibm.com>
>> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
>> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
> 
> One thing I would like to point out as already pointed out by Nathan Lynch is
> that those review-by tags given by the same groups of people from the same
> company loses credibility over time if you never critique or ask
> questions on the list.
> 

Well, so far you aren't addressing either my critiques or questions.

I have been asking questions but all I have from you are the above
attempts to discredit the reputation of myself and other people, and
non-technical statements like

     will make the code very difficult to manage
     I think there should be a trade off between optimization and stability.
     So I don't think you could even compare the two results

On the other hand, from the original submission I see some very specific
details:

     If ibmvnic abandons the reset because of this failed set link
     down and this is the last reset in the workqueue, then this
     adapter will be left in an inoperable state.

and from a followup discussion:

     We had a FATAL error and when handling it, we failed to
     send a link-down message to the VIOS. So what we need
     to try next is to reset the connection with the VIOS. For
     this we must ...

These are great technical points that could be argued or discussed.
Problem is, I agree with them.

I will ask again:  can you please supply some technical reasons for
your objections.  Otherwise, your objections are meritless and at worst
simply an ad hominem attack.

Rick
Lijun Pan April 22, 2021, 5:01 p.m. UTC | #11
On Thu, Apr 22, 2021 at 2:07 AM Rick Lindsley
<ricklind@linux.vnet.ibm.com> wrote:
>
> On 4/21/21 10:30 PM, Lijun Pan wrote:
> >> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
> >> Signed-off-by: Dany Madden <drt@linux.ibm.com>
> >> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com>
> >> Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
> >
> > One thing I would like to point out as already pointed out by Nathan Lynch is
> > that those review-by tags given by the same groups of people from the same
> > company loses credibility over time if you never critique or ask
> > questions on the list.
> >
>
> Well, so far you aren't addressing either my critiques or questions.
>
> I have been asking questions but all I have from you are the above
> attempts to discredit the reputation of myself and other people, and
> non-technical statements like
>
>      will make the code very difficult to manage
>      I think there should be a trade off between optimization and stability.
>      So I don't think you could even compare the two results
>
> On the other hand, from the original submission I see some very specific
> details:
>
>      If ibmvnic abandons the reset because of this failed set link
>      down and this is the last reset in the workqueue, then this
>      adapter will be left in an inoperable state.
>
> and from a followup discussion:
>
>      We had a FATAL error and when handling it, we failed to
>      send a link-down message to the VIOS. So what we need
>      to try next is to reset the connection with the VIOS. For
>      this we must ...
>
> These are great technical points that could be argued or discussed.
> Problem is, I agree with them.
>
> I will ask again:  can you please supply some technical reasons for
> your objections.  Otherwise, your objections are meritless and at worst
> simply an ad hominem attack.

Well, from the beginning of v1, I started to provide technical inputs.
Then I was not
allowed to post anything in the community about this patch and VNIC
via ljp@linux.ibm.com except giving an ack-by/reviewed-by.
Michal Suchánek April 22, 2021, 5:21 p.m. UTC | #12
Hello,

On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote:
> On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
> <sukadev@linux.ibm.com> wrote:
> >
> > Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> > >
> > >
> > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > > >
> > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > > ibmvnic abandons the reset because of this failed set link down and this
> > > > is the last reset in the workqueue, then this adapter will be left in an
> > > > inoperable state.
> > > >
> > > > Instead, make the driver ignore this link down failure and continue to
> > > > free and re-register CRQ so that the adapter has an opportunity to
> > > > recover.
> > >
> > > This v2 does not adddress the concerns mentioned in v1.
> > > And I think it is better to exit with error from do_reset, and schedule a thorough
> > > do_hard_reset if the the adapter is already in unstable state.
> >
> > We had a FATAL error and when handling it, we failed to send a
> > link-down message to the VIOS. So what we need to try next is to
> > reset the connection with the VIOS. For this we must talk to the
> > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> > does just that in ibmvnic_reset_crq().
> >
> > Now, sure we can attempt a "thorough hard reset" which also does
> > the same hcalls to reestablish the connection. Is there any
> > other magic in do_hard_reset()? But in addition, it also frees lot
> > more Linux kernel buffers and reallocates them for instance.
> 
> Working around everything in do_reset will make the code very difficult
> to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> can be removed completely or merged into do_reset.

This debate is not very constructive.

In the context of driver that has separate do_reset and do_hard_reset
this fix picks the correct one unless you can refute the arguments
provided.

Merging do_reset and do_hard_reset might be a good code cleanup which is
out of the scope of this fix.



Given that vast majority of fixes to the vnic driver are related to the
reset handling it would improve stability and testability if every
reset took the same code path.

In the context of merging do_hard_reset and do_reset the question is
what is the intended distinction and performance gain by having
'lightweight' reset.

I don't have a vnic protocol manual at hand and I suspect I would not
get one even if I searched for one.

From reading through the fixes in the past my understanding is that the
full reset is required when the backend changes which then potentially
requires different size/number of buffers.

What is the expected situation when reset is required without changing
the backend?

Is this so common that it warrants a separate 'lightweight' optimized
function?

Thanks

Michal
Lijun Pan April 22, 2021, 5:38 p.m. UTC | #13
On Thu, Apr 22, 2021 at 12:21 PM Michal Suchánek <msuchanek@suse.de> wrote:
>
> Hello,
>
> On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote:
> > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
> > <sukadev@linux.ibm.com> wrote:
> > >
> > > Lijun Pan [ljp@linux.vnet.ibm.com] wrote:
> > > >
> > > >
> > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden <drt@linux.ibm.com> wrote:
> > > > >
> > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > > > ibmvnic abandons the reset because of this failed set link down and this
> > > > > is the last reset in the workqueue, then this adapter will be left in an
> > > > > inoperable state.
> > > > >
> > > > > Instead, make the driver ignore this link down failure and continue to
> > > > > free and re-register CRQ so that the adapter has an opportunity to
> > > > > recover.
> > > >
> > > > This v2 does not adddress the concerns mentioned in v1.
> > > > And I think it is better to exit with error from do_reset, and schedule a thorough
> > > > do_hard_reset if the the adapter is already in unstable state.
> > >
> > > We had a FATAL error and when handling it, we failed to send a
> > > link-down message to the VIOS. So what we need to try next is to
> > > reset the connection with the VIOS. For this we must talk to the
> > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> > > does just that in ibmvnic_reset_crq().
> > >
> > > Now, sure we can attempt a "thorough hard reset" which also does
> > > the same hcalls to reestablish the connection. Is there any
> > > other magic in do_hard_reset()? But in addition, it also frees lot
> > > more Linux kernel buffers and reallocates them for instance.
> >
> > Working around everything in do_reset will make the code very difficult
> > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> > can be removed completely or merged into do_reset.

Michal,

I should have given more details about the above statement. Thanks for
your detailed info in the below.

>
> This debate is not very constructive.
>
> In the context of driver that has separate do_reset and do_hard_reset
> this fix picks the correct one unless you can refute the arguments
> provided.
>
> Merging do_reset and do_hard_reset might be a good code cleanup which is
> out of the scope of this fix.

Right.

>
>
>
> Given that vast majority of fixes to the vnic driver are related to the
> reset handling it would improve stability and testability if every
> reset took the same code path.

I agree.

>
> In the context of merging do_hard_reset and do_reset the question is
> what is the intended distinction and performance gain by having
> 'lightweight' reset.

Right.

>
> I don't have a vnic protocol manual at hand and I suspect I would not
> get one even if I searched for one.
>
> From reading through the fixes in the past my understanding is that the
> full reset is required when the backend changes which then potentially
> requires different size/number of buffers.

Yes, full reset is better when the backend changes.

>
> What is the expected situation when reset is required without changing
> the backend?
>
> Is this so common that it warrants a separate 'lightweight' optimized
> function?
>
Rick Lindsley April 23, 2021, 2:26 a.m. UTC | #14
On 4/22/21 10:21 AM, Michal Suchánek wrote:

> Merging do_reset and do_hard_reset might be a good code cleanup which is
> out of the scope of this fix.

I agree, on both points. Accepting the patch, and improving the reset
path are not in conflict with each other.

My position is that improving the reset path is certainly on the table,
but not with this bug or this fix.  Within the context of this discovered
problem, the issue is well understood, the fix is small and addresses the
immediate problem, and it has not generated new bugs under subsequent
testing.  For those reasons, the submitted patch has my support.

Rick
Dany Madden May 4, 2021, 7:05 p.m. UTC | #15
On 2021-04-22 19:26, Rick Lindsley wrote:
> On 4/22/21 10:21 AM, Michal Suchánek wrote:
> 
>> Merging do_reset and do_hard_reset might be a good code cleanup which 
>> is
>> out of the scope of this fix.
> 
> I agree, on both points. Accepting the patch, and improving the reset
> path are not in conflict with each other.
> 
> My position is that improving the reset path is certainly on the table,
> but not with this bug or this fix.  Within the context of this 
> discovered
> problem, the issue is well understood, the fix is small and addresses 
> the
> immediate problem, and it has not generated new bugs under subsequent
> testing.  For those reasons, the submitted patch has my support.
> 
> Rick

Thanks for all the feedback on the patch. Refactoring the ibmvnic reset 
functions is something we plan to evaluate, as this would potentially 
simplify the reset code. In the mean time, the proposed patch is simple, 
and has been validated in our test environment to solve an issue 
resulting in vnic interfaces getting stuck in an offline state following 
a vnic timeout reset. Given that, I suggest we merge this patch while we 
look at further optimizations in future. I will submit a V3 patch 
shortly.

Dany
diff mbox series

Patch

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index ffb2a91750c7..4bd8c5d1a275 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1970,8 +1970,10 @@  static int do_reset(struct ibmvnic_adapter *adapter,
 			rtnl_unlock();
 			rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN);
 			rtnl_lock();
-			if (rc)
-				goto out;
+			if (rc) {
+				netdev_dbg(netdev,
+					   "Setting link down failed rc=%d. Continue anyway\n", rc);
+			}
 
 			if (adapter->state == VNIC_OPEN) {
 				/* When we dropped rtnl, ibmvnic_open() got