diff mbox series

[iwl-net] i40e: fix hot issue NVM content is corrupted after nvmupdate

Message ID 20240610092051.2030587-1-aleksandr.loktionov@intel.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [iwl-net] i40e: fix hot issue NVM content is corrupted after nvmupdate | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 864 this patch: 864
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 3 blamed authors not CCed: jan.sokolowski@intel.com kuba@kernel.org leon@kernel.org; 6 maintainers not CCed: pabeni@redhat.com kuba@kernel.org leon@kernel.org edumazet@google.com jan.sokolowski@intel.com jesse.brandeburg@intel.com
netdev/build_clang success Errors and warnings before: 868 this patch: 868
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 868 this patch: 868
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 10 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Aleksandr Loktionov June 10, 2024, 9:20 a.m. UTC
After 230f3d53a547 patch all I/O errors are being converted into EAGAIN
which leads to retries until timeout so nvmupdate sometimes fails after
more than 20 minutes!

Remove misleading EIO to EGAIN conversion and pass all errors as is.

Fixes: 230f3d53a547 ("i40e: remove i40e_status")
Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
 1 file changed, 4 deletions(-)

Comments

Paul Menzel June 10, 2024, 9:45 a.m. UTC | #1
Dear Aleksandr, dear Kelvin,


Thank you for your patch.


Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov:
> After 230f3d53a547 patch all I/O errors are being converted into EAGAIN
> which leads to retries until timeout so nvmupdate sometimes fails after
> more than 20 minutes!
> 
> Remove misleading EIO to EGAIN conversion and pass all errors as is.
> 
> Fixes: 230f3d53a547 ("i40e: remove i40e_status")

This commit is present since v6.6-rc1, released September last year 
(2023). So until now, nobody noticed this?

> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

Please give more details about your test setup. For me it’s also not 
clear, how the NVM content gets corrupted as stated in the 
summary/title. Could you please elaborate that in the commit message.

> ---
>   drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>   1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> index ee86d2c..55b5bb8 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>   		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>   	};
>   
> -	/* aq_rc is invalid if AQ timed out */
> -	if (aq_ret == -EIO)
> -		return -EAGAIN;
> -
>   	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>   		return -ERANGE;

The referenced commit 230f3d53a547 does:

```
diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h 
b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
index ee394aacef4d..267f2e0a21ce 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
@@ -5,7 +5,6 @@
  #define _I40E_ADMINQ_H_

  #include "i40e_osdep.h"
-#include "i40e_status.h"
  #include "i40e_adminq_cmd.h"

  #define I40E_ADMINQ_DESC(R, i)   \
@@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, 
int aq_rc)
         };

         /* aq_rc is invalid if AQ timed out */
-       if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
+       if (aq_ret == -EIO)
                 return -EAGAIN;

         if (!((u32)aq_rc < (sizeof(aq_to_posix) / 
sizeof((aq_to_posix)[0]))))
```

So I do not see yet, why removing the whole hunk is the solution.


Kind regards,

Paul
Aleksandr Loktionov June 10, 2024, 10:14 a.m. UTC | #2
> -----Original Message-----
> From: Paul Menzel <pmenzel@molgen.mpg.de>
> Sent: Monday, June 10, 2024 11:45 AM
> To: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>; Kang,
> Kelvin <kelvin.kang@intel.com>
> Cc: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org; Kubalewski,
> Arkadiusz <arkadiusz.kubalewski@intel.com>; Sokolowski, Jan
> <jan.sokolowski@intel.com>; Leon Romanovsky <leonro@nvidia.com>
> Subject: Re: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot issue
> NVM content is corrupted after nvmupdate
> 
> Dear Aleksandr, dear Kelvin,
> 
> 
> Thank you for your patch.
> 
> 
> Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov:
> > After 230f3d53a547 patch all I/O errors are being converted into
> > EAGAIN which leads to retries until timeout so nvmupdate
> sometimes
> > fails after more than 20 minutes!
> >
> > Remove misleading EIO to EGAIN conversion and pass all errors as
> is.
> >
> > Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> 
> This commit is present since v6.6-rc1, released September last year
> (2023). So until now, nobody noticed this?
> 
Really, really. The regression affects users only when they update F/W,
and not all F/W are affected, only that generate I/O errors while update.
Not cars are affected, but the consequences are serous as in subj.

> > Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> > Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> > Reviewed-by: Arkadiusz Kubalewski
> <arkadiusz.kubalewski@intel.com>
> > Signed-off-by: Aleksandr Loktionov
> <aleksandr.loktionov@intel.com>
> 
> Please give more details about your test setup. For me it’s also
> not clear, how the NVM content gets corrupted as stated in the
> summary/title. Could you please elaborate that in the commit
> message.
> 
> > ---
> >   drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
> >   1 file changed, 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > index ee86d2c..55b5bb8 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int
> aq_ret, int aq_rc)
> >   		-EFBIG,      /* I40E_AQ_RC_EFBIG */
> >   	};
> >
> > -	/* aq_rc is invalid if AQ timed out */
> > -	if (aq_ret == -EIO)
> > -		return -EAGAIN;
> > -
> >   	if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> sizeof((aq_to_posix)[0]))))
> >   		return -ERANGE;
> 
> The referenced commit 230f3d53a547 does:
> 
> ```
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> index ee394aacef4d..267f2e0a21ce 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> @@ -5,7 +5,6 @@
>   #define _I40E_ADMINQ_H_
> 
>   #include "i40e_osdep.h"
> -#include "i40e_status.h"
>   #include "i40e_adminq_cmd.h"
> 
>   #define I40E_ADMINQ_DESC(R, i)   \
> @@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int
> aq_ret, int aq_rc)
>          };
> 
>          /* aq_rc is invalid if AQ timed out */
> -       if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
> +       if (aq_ret == -EIO)
>                  return -EAGAIN;
> 
>          if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> sizeof((aq_to_posix)[0]))))
> ```
> 
> So I do not see yet, why removing the whole hunk is the solution.
> 
> 
> Kind regards,
> 
> Paul
Aleksandr Loktionov June 10, 2024, 10:15 a.m. UTC | #3
> -----Original Message-----
> From: Loktionov, Aleksandr
> Sent: Monday, June 10, 2024 12:14 PM
> To: Paul Menzel <pmenzel@molgen.mpg.de>; Kang, Kelvin
> <kelvin.kang@intel.com>
> Cc: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org; Kubalewski,
> Arkadiusz <Arkadiusz.Kubalewski@intel.com>; Sokolowski, Jan
> <jan.sokolowski@intel.com>; Leon Romanovsky <leonro@nvidia.com>
> Subject: RE: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot issue
> NVM content is corrupted after nvmupdate
> 
> 
> 
> > -----Original Message-----
> > From: Paul Menzel <pmenzel@molgen.mpg.de>
> > Sent: Monday, June 10, 2024 11:45 AM
> > To: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>; Kang,
> Kelvin
> > <kelvin.kang@intel.com>
> > Cc: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
> > <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org; Kubalewski,
> > Arkadiusz <arkadiusz.kubalewski@intel.com>; Sokolowski, Jan
> > <jan.sokolowski@intel.com>; Leon Romanovsky <leonro@nvidia.com>
> > Subject: Re: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot
> issue NVM
> > content is corrupted after nvmupdate
> >
> > Dear Aleksandr, dear Kelvin,
> >
> >
> > Thank you for your patch.
> >
> >
> > Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov:
> > > After 230f3d53a547 patch all I/O errors are being converted
> into
> > > EAGAIN which leads to retries until timeout so nvmupdate
> > sometimes
> > > fails after more than 20 minutes!
> > >
> > > Remove misleading EIO to EGAIN conversion and pass all errors
> as
> > is.
> > >
> > > Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> >
> > This commit is present since v6.6-rc1, released September last
> year
> > (2023). So until now, nobody noticed this?
> >
> Really, really. The regression affects users only when they update
> F/W, and not all F/W are affected, only that generate I/O errors
> while update.
Not all the cards are affected, but the consequences are serous as in subj.

> 
> > > Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> > > Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> > > Reviewed-by: Arkadiusz Kubalewski
> > <arkadiusz.kubalewski@intel.com>
> > > Signed-off-by: Aleksandr Loktionov
> > <aleksandr.loktionov@intel.com>
> >
> > Please give more details about your test setup. For me it’s also
> not
> > clear, how the NVM content gets corrupted as stated in the
> > summary/title. Could you please elaborate that in the commit
> message.
> >
> > > ---
> > >   drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
> > >   1 file changed, 4 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > index ee86d2c..55b5bb8 100644
> > > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int
> > aq_ret, int aq_rc)
> > >   		-EFBIG,      /* I40E_AQ_RC_EFBIG */
> > >   	};
> > >
> > > -	/* aq_rc is invalid if AQ timed out */
> > > -	if (aq_ret == -EIO)
> > > -		return -EAGAIN;
> > > -
> > >   	if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> > sizeof((aq_to_posix)[0]))))
> > >   		return -ERANGE;
> >
> > The referenced commit 230f3d53a547 does:
> >
> > ```
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > index ee394aacef4d..267f2e0a21ce 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > @@ -5,7 +5,6 @@
> >   #define _I40E_ADMINQ_H_
> >
> >   #include "i40e_osdep.h"
> > -#include "i40e_status.h"
> >   #include "i40e_adminq_cmd.h"
> >
> >   #define I40E_ADMINQ_DESC(R, i)   \
> > @@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int
> aq_ret,
> > int aq_rc)
> >          };
> >
> >          /* aq_rc is invalid if AQ timed out */
> > -       if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
> > +       if (aq_ret == -EIO)
> >                  return -EAGAIN;
> >
> >          if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> > sizeof((aq_to_posix)[0]))))
> > ```
> >
> > So I do not see yet, why removing the whole hunk is the solution.
> >
> >
> > Kind regards,
> >
> > Paul
Paul Menzel June 10, 2024, 10:18 a.m. UTC | #4
Dear Aleksandr,


Thank you for your prompt reply.

Am 10.06.24 um 12:14 schrieb Loktionov, Aleksandr:

>> -----Original Message-----
>> From: Paul Menzel <pmenzel@molgen.mpg.de>
>> Sent: Monday, June 10, 2024 11:45 AM

[…]

>> Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov:
>>> After 230f3d53a547 patch all I/O errors are being converted into
>>> EAGAIN which leads to retries until timeout so nvmupdate sometimes
>>> fails after more than 20 minutes!
>>>
>>> Remove misleading EIO to EGAIN conversion and pass all errors as is.
>>>
>>> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
>>
>> This commit is present since v6.6-rc1, released September last year
>> (2023). So until now, nobody noticed this?
>>
> Really, really. The regression affects users only when they update F/W,
> and not all F/W are affected, only that generate I/O errors while update.

Thank you for the clarification.

> Not cars are affected, but the consequences are serous as in subj.

Please details this also in the commit message (body) as asked for below.

>>> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
>>> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
>>> Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
>>> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
>>
>> Please give more details about your test setup. For me it’s also
>> not clear, how the NVM content gets corrupted as stated in the
>> summary/title. Could you please elaborate that in the commit
>> message.
>>
>>> ---
>>>    drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>>>    1 file changed, 4 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>> index ee86d2c..55b5bb8 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>>>    		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>>>    	};
>>>
>>> -	/* aq_rc is invalid if AQ timed out */
>>> -	if (aq_ret == -EIO)
>>> -		return -EAGAIN;
>>> -
>>>    	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>>>    		return -ERANGE;
>>
>> The referenced commit 230f3d53a547 does:
>>
>> ```
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>> index ee394aacef4d..267f2e0a21ce 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>> @@ -5,7 +5,6 @@
>>    #define _I40E_ADMINQ_H_
>>
>>    #include "i40e_osdep.h"
>> -  #include "i40e_status.h"
>>    #include "i40e_adminq_cmd.h"
>>
>>    #define I40E_ADMINQ_DESC(R, i)   \
>> @@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int
>> aq_ret, int aq_rc)
>>           };
>>
>>           /* aq_rc is invalid if AQ timed out */
>> -       if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
>> +       if (aq_ret == -EIO)
>>                   return -EAGAIN;
>>
>>           if (!((u32)aq_rc < (sizeof(aq_to_posix) /
>> sizeof((aq_to_posix)[0]))))
>> ```
>>
>> So I do not see yet, why removing the whole hunk is the solution.


Kind regards,

Paul
Aleksandr Loktionov June 10, 2024, 10:20 a.m. UTC | #5
> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On
> Behalf Of Loktionov, Aleksandr
> Sent: Monday, June 10, 2024 12:16 PM
> To: Paul Menzel <pmenzel@molgen.mpg.de>; Kang, Kelvin
> <kelvin.kang@intel.com>
> Cc: Sokolowski, Jan <jan.sokolowski@intel.com>;
> netdev@vger.kernel.org; Kubalewski, Arkadiusz
> <arkadiusz.kubalewski@intel.com>; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; intel-wired-lan@lists.osuosl.org;
> Leon Romanovsky <leonro@nvidia.com>
> Subject: Re: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot issue
> NVM content is corrupted after nvmupdate
> 
> 
> 
> > -----Original Message-----
> > From: Loktionov, Aleksandr
> > Sent: Monday, June 10, 2024 12:14 PM
> > To: Paul Menzel <pmenzel@molgen.mpg.de>; Kang, Kelvin
> > <kelvin.kang@intel.com>
> > Cc: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
> > <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org; Kubalewski,
> > Arkadiusz <Arkadiusz.Kubalewski@intel.com>; Sokolowski, Jan
> > <jan.sokolowski@intel.com>; Leon Romanovsky <leonro@nvidia.com>
> > Subject: RE: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot
> issue NVM
> > content is corrupted after nvmupdate
> >
> >
> >
> > > -----Original Message-----
> > > From: Paul Menzel <pmenzel@molgen.mpg.de>
> > > Sent: Monday, June 10, 2024 11:45 AM
> > > To: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>; Kang,
> > Kelvin
> > > <kelvin.kang@intel.com>
> > > Cc: intel-wired-lan@lists.osuosl.org; Nguyen, Anthony L
> > > <anthony.l.nguyen@intel.com>; netdev@vger.kernel.org;
> Kubalewski,
> > > Arkadiusz <arkadiusz.kubalewski@intel.com>; Sokolowski, Jan
> > > <jan.sokolowski@intel.com>; Leon Romanovsky <leonro@nvidia.com>
> > > Subject: Re: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot
> > issue NVM
> > > content is corrupted after nvmupdate
> > >
> > > Dear Aleksandr, dear Kelvin,
> > >
> > >
> > > Thank you for your patch.
> > >
> > >
> > > Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov:
> > > > After 230f3d53a547 patch all I/O errors are being converted
> > into
> > > > EAGAIN which leads to retries until timeout so nvmupdate
> > > sometimes
> > > > fails after more than 20 minutes!
> > > >
> > > > Remove misleading EIO to EGAIN conversion and pass all errors
> > as
> > > is.
> > > >
> > > > Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> > >
> > > This commit is present since v6.6-rc1, released September last
> > year
> > > (2023). So until now, nobody noticed this?
> > >
> > Really, really. The regression affects users only when they
> update
> > F/W, and not all F/W are affected, only that generate I/O errors
> while
> > update.
> Not all the cards are affected, but the consequences are serous as
> in subj.
> 
> >
> > > > Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> > > > Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> > > > Reviewed-by: Arkadiusz Kubalewski
> > > <arkadiusz.kubalewski@intel.com>
> > > > Signed-off-by: Aleksandr Loktionov
> > > <aleksandr.loktionov@intel.com>
> > >
> > > Please give more details about your test setup. For me it’s
> also
> > not
> > > clear, how the NVM content gets corrupted as stated in the
> > > summary/title. Could you please elaborate that in the commit
> > message.
For example X710DA2 with 0x8000ECB7 is affected, but there are probably more...
The corruption is already described - because of timeout nvmupdate timeouts failing to update NVM. 

> > >
> > > > ---
> > > >   drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
> > > >   1 file changed, 4 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > > b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > > index ee86d2c..55b5bb8 100644
> > > > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > > @@ -109,10 +109,6 @@ static inline int
> i40e_aq_rc_to_posix(int
> > > aq_ret, int aq_rc)
> > > >   		-EFBIG,      /* I40E_AQ_RC_EFBIG */
> > > >   	};
> > > >
> > > > -	/* aq_rc is invalid if AQ timed out */
> > > > -	if (aq_ret == -EIO)
> > > > -		return -EAGAIN;
> > > > -
> > > >   	if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> > > sizeof((aq_to_posix)[0]))))
> > > >   		return -ERANGE;
> > >
> > > The referenced commit 230f3d53a547 does:
> > >
> > > ```
> > > diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > index ee394aacef4d..267f2e0a21ce 100644
> > > --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> > > @@ -5,7 +5,6 @@
> > >   #define _I40E_ADMINQ_H_
> > >
> > >   #include "i40e_osdep.h"
> > > -#include "i40e_status.h"
> > >   #include "i40e_adminq_cmd.h"
> > >
> > >   #define I40E_ADMINQ_DESC(R, i)   \
> > > @@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int
> > aq_ret,
> > > int aq_rc)
> > >          };
> > >
> > >          /* aq_rc is invalid if AQ timed out */
> > > -       if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
> > > +       if (aq_ret == -EIO)
> > >                  return -EAGAIN;
> > >
> > >          if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> > > sizeof((aq_to_posix)[0]))))
> > > ```
> > >
> > > So I do not see yet, why removing the whole hunk is the
> solution.
> > >
> > >
> > > Kind regards,
> > >
> > > Paul
Paul Menzel June 10, 2024, 12:25 p.m. UTC | #6
Dear Aleksandr,


Thank you for your response.


Am 10.06.24 um 12:20 schrieb Loktionov, Aleksandr:
> 
> 
>> -----Original Message-----
>> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of Loktionov, Aleksandr
>> Sent: Monday, June 10, 2024 12:16 PM

>>> -----Original Message-----
>>> From: Loktionov, Aleksandr
>>> Sent: Monday, June 10, 2024 12:14 PM

>>>> -----Original Message-----
>>>> From: Paul Menzel <pmenzel@molgen.mpg.de>
>>>> Sent: Monday, June 10, 2024 11:45 AM

>>>> Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov:
>>>>> After 230f3d53a547 patch all I/O errors are being converted into
>>>>> EAGAIN which leads to retries until timeout so nvmupdate sometimes
>>>>> fails after more than 20 minutes!
>>>>>
>>>>> Remove misleading EIO to EGAIN conversion and pass all errors as
>>>> is.
>>>>>
>>>>> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
>>>>
>>>> This commit is present since v6.6-rc1, released September last year
>>>> (2023). So until now, nobody noticed this?
>>>>
>>> Really, really. The regression affects users only when they update
>>> F/W, and not all F/W are affected, only that generate I/O errors while
>>> update.
>> Not all the cards are affected, but the consequences are serous as
>> in subj.
>>
>>>>> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
>>>>> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
>>>>> Reviewed-by: Arkadiusz Kubalewski
>>>> <arkadiusz.kubalewski@intel.com>
>>>>> Signed-off-by: Aleksandr Loktionov
>>>> <aleksandr.loktionov@intel.com>
>>>>
>>>> Please give more details about your test setup. For me it’s also not
>>>> clear, how the NVM content gets corrupted as stated in the
>>>> summary/title. Could you please elaborate that in the commit message.

> For example X710DA2 with 0x8000ECB7 is affected, but there are
> probably more...

Please amend the commit message with this information, and for ease also 
the commands you executed.

> The corruption is already described - because of
> timeout nvmupdate timeouts failing to update NVM.

Only because something times out, does not mean it causes corruption.

Please amend the commit message. It looks like you also missed more of 
my questions below.

>>>>> ---
>>>>>    drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
>>>>>    1 file changed, 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>>> index ee86d2c..55b5bb8 100644
>>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>>> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>>>>>    		-EFBIG,      /* I40E_AQ_RC_EFBIG */
>>>>>    	};
>>>>>
>>>>> -	/* aq_rc is invalid if AQ timed out */
>>>>> -	if (aq_ret == -EIO)
>>>>> -		return -EAGAIN;
>>>>> -
>>>>>    	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>>>>>    		return -ERANGE;
>>>>
>>>> The referenced commit 230f3d53a547 does:
>>>>
>>>> ```
>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>> index ee394aacef4d..267f2e0a21ce 100644
>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
>>>> @@ -5,7 +5,6 @@
>>>>    #define _I40E_ADMINQ_H_
>>>>
>>>>    #include "i40e_osdep.h"
>>>> -  #include "i40e_status.h"
>>>>    #include "i40e_adminq_cmd.h"
>>>>
>>>>    #define I40E_ADMINQ_DESC(R, i)   \
>>>> @@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
>>>>           };
>>>>
>>>>           /* aq_rc is invalid if AQ timed out */
>>>> -       if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
>>>> +       if (aq_ret == -EIO)
>>>>                   return -EAGAIN;
>>>>
>>>>           if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
>>>> ```
>>>>
>>>> So I do not see yet, why removing the whole hunk is the solution.

Kind regards,

Paul


PS: Would it be possible, that you use another email program. The cited 
parts are wrapped awkwardly, and it takes some time to correct it in my 
response.
Aleksandr Loktionov June 11, 2024, 8:58 a.m. UTC | #7
> -----Original Message-----
> From: Paul Menzel <pmenzel@molgen.mpg.de>
> Sent: Monday, June 10, 2024 2:26 PM
> To: Loktionov, Aleksandr <aleksandr.loktionov@intel.com>; Kang, Kelvin
> <kelvin.kang@intel.com>
> Cc: Sokolowski, Jan <jan.sokolowski@intel.com>; netdev@vger.kernel.org;
> Kubalewski, Arkadiusz <arkadiusz.kubalewski@intel.com>; Nguyen, Anthony L
> <anthony.l.nguyen@intel.com>; intel-wired-lan@lists.osuosl.org; Leon
> Romanovsky <leonro@nvidia.com>
> Subject: Re: [Intel-wired-lan] [PATCH iwl-net] i40e: fix hot issue NVM
> content is corrupted after nvmupdate
> 
> Dear Aleksandr,
> 
> 
> Thank you for your response.
> 
> 
> Am 10.06.24 um 12:20 schrieb Loktionov, Aleksandr:
> >
> >
> >> -----Original Message-----
> >> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf
> >> Of Loktionov, Aleksandr
> >> Sent: Monday, June 10, 2024 12:16 PM
> 
> >>> -----Original Message-----
> >>> From: Loktionov, Aleksandr
> >>> Sent: Monday, June 10, 2024 12:14 PM
> 
> >>>> -----Original Message-----
> >>>> From: Paul Menzel <pmenzel@molgen.mpg.de>
> >>>> Sent: Monday, June 10, 2024 11:45 AM
> 
> >>>> Am 10.06.24 um 11:20 schrieb Aleksandr Loktionov:
> >>>>> After 230f3d53a547 patch all I/O errors are being converted into
> >>>>> EAGAIN which leads to retries until timeout so nvmupdate sometimes
> >>>>> fails after more than 20 minutes!
> >>>>>
> >>>>> Remove misleading EIO to EGAIN conversion and pass all errors as
> >>>> is.
> >>>>>
> >>>>> Fixes: 230f3d53a547 ("i40e: remove i40e_status")
> >>>>
> >>>> This commit is present since v6.6-rc1, released September last year
> >>>> (2023). So until now, nobody noticed this?
> >>>>
> >>> Really, really. The regression affects users only when they update
> >>> F/W, and not all F/W are affected, only that generate I/O errors
> >>> while update.
> >> Not all the cards are affected, but the consequences are serous as in
> >> subj.
> >>
> >>>>> Co-developed-by: Kelvin Kang <kelvin.kang@intel.com>
> >>>>> Signed-off-by: Kelvin Kang <kelvin.kang@intel.com>
> >>>>> Reviewed-by: Arkadiusz Kubalewski
> >>>> <arkadiusz.kubalewski@intel.com>
> >>>>> Signed-off-by: Aleksandr Loktionov
> >>>> <aleksandr.loktionov@intel.com>
> >>>>
> >>>> Please give more details about your test setup. For me it’s also
> >>>> not clear, how the NVM content gets corrupted as stated in the
> >>>> summary/title. Could you please elaborate that in the commit message.
> 
> > For example X710DA2 with 0x8000ECB7 is affected, but there are
> > probably more...
> 
> Please amend the commit message with this information, and for ease also the
> commands you executed.
> 
> > The corruption is already described - because of timeout nvmupdate
> > timeouts failing to update NVM.
> 
> Only because something times out, does not mean it causes corruption.
> 
> Please amend the commit message. It looks like you also missed more of my
> questions below.
> 
> >>>>> ---
> >>>>>    drivers/net/ethernet/intel/i40e/i40e_adminq.h | 4 ----
> >>>>>    1 file changed, 4 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> >>>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> >>>>> index ee86d2c..55b5bb8 100644
> >>>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> >>>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> >>>>> @@ -109,10 +109,6 @@ static inline int i40e_aq_rc_to_posix(int aq_ret,
> int aq_rc)
> >>>>>    		-EFBIG,      /* I40E_AQ_RC_EFBIG */
> >>>>>    	};
> >>>>>
> >>>>> -	/* aq_rc is invalid if AQ timed out */
> >>>>> -	if (aq_ret == -EIO)
> >>>>> -		return -EAGAIN;
> >>>>> -
> >>>>>    	if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> sizeof((aq_to_posix)[0]))))
> >>>>>    		return -ERANGE;
> >>>>
> >>>> The referenced commit 230f3d53a547 does:
> >>>>
> >>>> ```
> >>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> >>>> b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> >>>> index ee394aacef4d..267f2e0a21ce 100644
> >>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> >>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
> >>>> @@ -5,7 +5,6 @@
> >>>>    #define _I40E_ADMINQ_H_
> >>>>
> >>>>    #include "i40e_osdep.h"
> >>>> -  #include "i40e_status.h"
> >>>>    #include "i40e_adminq_cmd.h"
> >>>>
> >>>>    #define I40E_ADMINQ_DESC(R, i)   \
> >>>> @@ -117,7 +116,7 @@ static inline int i40e_aq_rc_to_posix(int aq_ret,
> int aq_rc)
> >>>>           };
> >>>>
> >>>>           /* aq_rc is invalid if AQ timed out */
> >>>> -       if (aq_ret == I40E_ERR_ADMIN_QUEUE_TIMEOUT)
> >>>> +       if (aq_ret == -EIO)
> >>>>                   return -EAGAIN;
> >>>>
> >>>>           if (!((u32)aq_rc < (sizeof(aq_to_posix) /
> >>>> sizeof((aq_to_posix)[0])))) ```
> >>>>
> >>>> So I do not see yet, why removing the whole hunk is the solution.
The solution is to pass errors as mentioned in the commit as is from f/w to nvmupdate.

> Kind regards,
> 
> Paul
> 
> 
> PS: Would it be possible, that you use another email program. The cited
> parts are wrapped awkwardly, and it takes some time to correct it in my
> response.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.h b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
index ee86d2c..55b5bb8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.h
@@ -109,10 +109,6 @@  static inline int i40e_aq_rc_to_posix(int aq_ret, int aq_rc)
 		-EFBIG,      /* I40E_AQ_RC_EFBIG */
 	};
 
-	/* aq_rc is invalid if AQ timed out */
-	if (aq_ret == -EIO)
-		return -EAGAIN;
-
 	if (!((u32)aq_rc < (sizeof(aq_to_posix) / sizeof((aq_to_posix)[0]))))
 		return -ERANGE;