diff mbox series

[iwl-net,v4] i40e: fix hot issue NVM content is corrupted after nvmupdate

Message ID 20240618132111.3193963-1-aleksandr.loktionov@intel.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [iwl-net,v4] i40e: fix hot issue NVM content is corrupted after nvmupdate | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 859 this patch: 859
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 3 blamed authors not CCed: leon@kernel.org kuba@kernel.org jan.sokolowski@intel.com; 6 maintainers not CCed: jan.sokolowski@intel.com pabeni@redhat.com jesse.brandeburg@intel.com leon@kernel.org edumazet@google.com kuba@kernel.org
netdev/build_clang success Errors and warnings before: 863 this patch: 863
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 863 this patch: 863
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 10 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Aleksandr Loktionov June 18, 2024, 1:21 p.m. UTC
The bug affects users only at the time when they try to update NVM, and
only F/W versions that generate errors while nvmupdate. For example, X710DA2
with 0x8000ECB7 F/W is affected, but there are probably more...

Command for reproduction is just NVM update:
 ./nvmupdate64

In the log instead of:
 i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_ENOMEM)
appears:
 i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM

But the problematic code did silently convert EIO into EAGAIN which forced
nvmupdate to ignore EAGAIN error and retry the same operation until timeout.
That's why NVM update takes 20+ minutes to finish with the fail in the end.

After commit 230f3d53a547 ("i40e: remove i40e_status"), which should only
replace F/W specific error codes with Linux kernel generic, all EIO errors
suddenly started to be converted into EAGAIN which leads nvmupdate to retry
until it timeouts and sometimes fails after more than 20 minutes in the
middle of NVM update, so NVM becomes corrupted.

Remove wrong EIO to EGAIN conversion and pass all errors as is.

Fixes: 230f3d53a547 ("i40e: remove i40e_status")
Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
v3->v4 commit message update
v2->v3 commit messege typos
v1->v2 commit message update
---
 drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
 1 file changed, 4 deletions(-)

Comments

Przemek Kitszel June 18, 2024, 1:49 p.m. UTC | #1
On 6/18/24 15:21, Aleksandr Loktionov wrote:
> The bug affects users only at the time when they try to update NVM, and
> only F/W versions that generate errors while nvmupdate. For example, X710DA2
> with 0x8000ECB7 F/W is affected, but there are probably more...
> 
> Command for reproduction is just NVM update:
>   ./nvmupdate64

nvmupdate64 is not an upstream tool, but it's fine to mention it here,
as we don't have a better alternative for i40e as of now

> 
> In the log instead of:
>   i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err I40E_AQ_RC_ENOMEM)
> appears:
>   i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
> 
> But the problematic code did silently convert EIO into EAGAIN which forced
> nvmupdate to ignore EAGAIN error and retry the same operation until timeout.
> That's why NVM update takes 20+ minutes to finish with the fail in the end.

this paragraph tells more about nvmupdate tool problems that the driver

> 
> After commit 230f3d53a547 ("i40e: remove i40e_status"), which should only
> replace F/W specific error codes with Linux kernel generic, 

> all EIO errors
> suddenly started to be converted into EAGAIN

all or just this one that you are fixing here?

> which leads nvmupdate to retry
> until it timeouts and sometimes fails after more than 20 minutes in the
> middle of NVM update, so NVM becomes corrupted.
> 
> Remove wrong EIO to EGAIN conversion and pass all errors as is.
> 
> Fixes: 230f3d53a547 ("i40e: remove i40e_status")

That is a proper tag, and the description makes it clear that we want
the patch finally applied, thank you for the efforts to make it well
described.

> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>

this 2 line removal was indeed developed by two of you?

> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> v3->v4 commit message update

Please fix the subject line too, it's what will be the most frequently
read part of your work, and with stay in git for decades. You could use:
i40e: fix wrong error code replacement

and add link in changelog section:
v4: 
https://lore.kernel.org/netdev/20240618132111.3193963-1-aleksandr.loktionov@intel.com/T/#u

> v2->v3 commit messege typos
> v1->v2 commit message update
> ---
>   drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>   1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> index ee86d2c..55b5bb8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>   		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>   	};
>   
> -	/* aq_rc is invalid if AQ timed out */
> -	if (aq_ret == -EIO)
> -		return -EAGAIN;
> -
>   	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>   		return -ERANGE;
>
Aleksandr Loktionov June 18, 2024, 2:21 p.m. UTC | #2
> -----Original Message-----
> From: Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>
> Sent: Tuesday, June 18, 2024 3:50 PM
> To: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>; Nguyen,
> Anthony L <anthony.l.nguyen@intel.com>
> Cc: netdev@vger.kernel.org; Kang, Kelvin <kelvin.kang@intel.com>;
> Kubalewski, Arkadiusz <arkadiusz.kubalewski@intel.com>; intel-wired-
> lan@lists.osuosl.org
> Subject: Re: [Intel-wired-lan] [PATCH iwl-net v4] i40e: fix hot issue
> NVM content is corrupted after nvmupdate
> 
> On 6/18/24 15:21, Aleksandr Loktionov wrote:
> > The bug affects users only at the time when they try to update NVM,
> > and only F/W versions that generate errors while nvmupdate. For
> > example, X710DA2 with 0x8000ECB7 F/W is affected, but there are
> probably more...
> >
> > Command for reproduction is just NVM update:
> >   ./nvmupdate64
> 
> nvmupdate64 is not an upstream tool, but it's fine to mention it here,
> as we don't have a better alternative for i40e as of now
> 
> >
> > In the log instead of:
> >   i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err
> > I40E_AQ_RC_ENOMEM)
> > appears:
> >   i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
> >
> > But the problematic code did silently convert EIO into EAGAIN which
> > forced nvmupdate to ignore EAGAIN error and retry the same operation
> until timeout.
> > That's why NVM update takes 20+ minutes to finish with the fail in
> the end.
> 
> this paragraph tells more about nvmupdate tool problems that the
> driver
What is your proposal for it exactly?

> >
> > After commit 230f3d53a547 ("i40e: remove i40e_status"), which should
> > only replace F/W specific error codes with Linux kernel generic,
> 
> > all EIO errors
> > suddenly started to be converted into EAGAIN
> 
> all or just this one that you are fixing here?
All error codes from FW which exactly I'm fixing of course.

> 
> > which leads nvmupdate to retry
> > until it timeouts and sometimes fails after more than 20 minutes in
> > the middle of NVM update, so NVM becomes corrupted.
> >
> > Remove wrong EIO to EGAIN conversion and pass all errors as is.
> >
> > Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> 
> That is a proper tag, and the description makes it clear that we want
> the patch finally applied, thank you for the efforts to make it well
> described.
> 
> > Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> > Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> 
> this 2 line removal was indeed developed by two of you?
As written above, yes of course.

> > Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
> > Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> > ---
> > v3->v4 commit message update
> 
> Please fix the subject line too, it's what will be the most frequently
> read part of your work, and with stay in git for decades. You could
> use:
> i40e: fix wrong error code replacement
> 
This patch fixes urgent/hot issue with NVM update, which is stated in the
title. Wrong error conversion code could lead to different severity issues.
From my point it's better to mention in the title the serious problem which
users meet/see, am I wrong?

> and add link in changelog section:
> v4:
> https://lore.kernel.org/netdev/20240618132111.3193963-1-
> aleksandr.loktionov@intel.com/T/#u
> 
> > v2->v3 commit messege typos
> > v1->v2 commit message update
> > ---
> >   drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
> >   1 file changed, 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > index ee86d2c..55b5bb8 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int
> aq_ret, int aq_rc)
> >   		-EFBIG,      /* I40E_AQ_RC_EFBIG */
> >   	};
> >
> > -	/* aq_rc is invalid if AQ timed out */
> > -	if (aq_ret == -EIO)
> > -		return -EAGAIN;
> > -
> >   	if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> sizeof((aq_to_posix)[0]))))
> >   		return -ERANGE;
> >
Przemek Kitszel June 24, 2024, 1:42 p.m. UTC | #3
On 6/18/24 16:21, Loktionov, Aleksandr wrote:
> 
> 
>> -----Original Message-----
>> From: Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>
>> Sent: Tuesday, June 18, 2024 3:50 PM
>> To: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>; Nguyen,
>> Anthony L <anthony.l.nguyen@intel.com>
>> Cc: netdev@vger.kernel.org; Kang, Kelvin <kelvin.kang@intel.com>;
>> Kubalewski, Arkadiusz <arkadiusz.kubalewski@intel.com>; intel-wired-
>> lan@lists.osuosl.org
>> Subject: Re: [Intel-wired-lan] [PATCH iwl-net v4] i40e: fix hot issue
>> NVM content is corrupted after nvmupdate
>>
>> On 6/18/24 15:21, Aleksandr Loktionov wrote:
>>> The bug affects users only at the time when they try to update NVM,
>>> and only F/W versions that generate errors while nvmupdate. For
>>> example, X710DA2 with 0x8000ECB7 F/W is affected, but there are
>> probably more...
>>>
>>> Command for reproduction is just NVM update:
>>>    ./nvmupdate64
>>
>> nvmupdate64 is not an upstream tool, but it's fine to mention it here,
>> as we don't have a better alternative for i40e as of now
>>
>>>
>>> In the log instead of:
>>>    i40e_nvmupd_exec_aq err I40E_ERR_ADMIN_QUEUE_ERROR aq_err
>>> I40E_AQ_RC_ENOMEM)
>>> appears:
>>>    i40e_nvmupd_exec_aq err -EIO aq_err I40E_AQ_RC_ENOMEM
>>>
>>> But the problematic code did silently convert EIO into EAGAIN which
>>> forced nvmupdate to ignore EAGAIN error and retry the same operation
>> until timeout.
>>> That's why NVM update takes 20+ minutes to finish with the fail in
>> the end.
>>
>> this paragraph tells more about nvmupdate tool problems that the
>> driver
> What is your proposal for it exactly?

Perhaps this could be summarized as:
This driver issue leads to nvmupdate needlessly retrying for over 20
minutes to only fail at the end. It is even worse that NVM content
could be broken after the broken update.

But I'm fine with it as is too.

> 
>>>
>>> After commit 230f3d53a547 ("i40e: remove i40e_status"), which should
>>> only replace F/W specific error codes with Linux kernel generic,
>>
>>> all EIO errors
>>> suddenly started to be converted into EAGAIN
>>
>> all or just this one that you are fixing here?
> All error codes from FW which exactly I'm fixing of course.

fine

> 
>>
>>> which leads nvmupdate to retry
>>> until it timeouts and sometimes fails after more than 20 minutes in
>>> the middle of NVM update, so NVM becomes corrupted.
>>>
>>> Remove wrong EIO to EGAIN conversion and pass all errors as is.
>>>
>>> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
>>
>> That is a proper tag, and the description makes it clear that we want
>> the patch finally applied, thank you for the efforts to make it well
>> described.
>>
>>> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
>>> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
>>
>> this 2 line removal was indeed developed by two of you?
> As written above, yes of course.

fine

> 
>>> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
>>> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>>> ---
>>> v3->v4 commit message update
>>
>> Please fix the subject line too, it's what will be the most frequently
>> read part of your work, and with stay in git for decades. You could
>> use:
>> i40e: fix wrong error code replacement
>>
> This patch fixes urgent/hot issue with NVM update, which is stated in the
> title. Wrong error conversion code could lead to different severity issues.
>  From my point it's better to mention in the title the serious problem which
> users meet/see, am I wrong?

I believe so, please send v5 with a new title, like:
i40e: prevent needless retries of NVM update
(no "hot issue", this is as hot as the more severe half of i40e bugs).

I know that you like to have "fix" in the title, but this is not
required for the kernel, as the same info is assumed from Fixes: tag.
If you insist, then:
i40e: fix: prevent needless retries of NVM update

By trying to start the sentence from "fix", you dragged yourself into
this convoluted need for explaining yourself ;)

> 
>> and add link in changelog section:
>> v4:
>> https://lore.kernel.org/netdev/20240618132111.3193963-1-
>> aleksandr.loktionov@intel.com/T/#u
>>
>>> v2->v3 commit messege typos
>>> v1->v2 commit message update
>>> ---
>>>    drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>>>    1 file changed, 4 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>> index ee86d2c..55b5bb8 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int
>> aq_ret, int aq_rc)
>>>    		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>>>    	};
>>>
>>> -	/* aq_rc is invalid if AQ timed out */
>>> -	if (aq_ret == -EIO)
>>> -		return -EAGAIN;
>>> -
>>>    	if (!((u32)aq_rc < (sizeof(aq_to_posix) /
>> sizeof((aq_to_posix)[0]))))
>>>    		return -ERANGE;
>>>
>
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
index ee86d2c..55b5bb8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
@@ -109,10 +109,6 @@  static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
 		-EFBIG,      /* I40E_AQ_RC_EFBIG */
 	};
 
-	/* aq_rc is invalid if AQ timed out */
-	if (aq_ret == -EIO)
-		return -EAGAIN;
-
 	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
 		return -ERANGE;