diff mbox series

[net] Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI"

Message ID 20240520-net-2024-05-20-revert-silicom-switch-workaround-v1-1-50f80f261c94@intel.com (mailing list archive)
State Accepted
Commit b35b1c0b4e166a427395deaf61e3140495dfcb89
Delegated to: Netdev Maintainers
Headers show
Series [net] Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 905 this patch: 905
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 1 blamed authors not CCed: anthony.l.nguyen@intel.com; 6 maintainers not CCed: pabeni@redhat.com kuba@kernel.org edumazet@google.com anthony.l.nguyen@intel.com jesse.brandeburg@intel.com intel-wired-lan@lists.osuosl.org
netdev/build_clang success Errors and warnings before: 909 this patch: 909
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 909 this patch: 909
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 78 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 47 this patch: 47
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-05-21--21-00 (tests: 1039)

Commit Message

Jacob Keller May 21, 2024, 12:21 a.m. UTC
This reverts commit 565736048bd5f9888990569993c6b6bfdf6dcb6d.

According to the commit, it implements a manual AN-37 for some
"troublesome" Juniper MX5 switches. This appears to be a workaround for a
particular switch.

It has been reported that this causes a severe breakage for other switches,
including a Cisco 3560CX-12PD-S.

The code appears to be a workaround for a specific switch which fails to
link in SFI mode. It expects to see AN-37 auto negotiation in order to
link. The Cisco switch is not expecting AN-37 auto negotiation. When the
device starts the manual AN-37, the Cisco switch decides that the port is
confused and stops attempting to link with it. This persists until a power
cycle. A simple driver unload and reload does not resolve the issue, even
if loading with a version of the driver which lacks this workaround.

The authors of the workaround commit have not responded with
clarifications, and the result of the workaround is complete failure to
connect with other switches.

This appears to be a case where the driver can either "correctly" link with
the Juniper MX5 switch, at the cost of bricking the link with the Cisco
switch, or it can behave properly for the Cisco switch, but fail to link
with the Junipir MX5 switch. I do not know enough about the standards
involved to clearly determine whether either switch is at fault or behaving
incorrectly. Nor do I know whether there exists some alternative fix which
corrects behavior with both switches.

Revert the workaround for the Juniper switch.

Fixes: 565736048bd5 ("ixgbe: Manual AN-37 for troublesome link partners for X550 SFI")
Link: https://lore.kernel.org/netdev/cbe874db-9ac9-42b8-afa0-88ea910e1e99@intel.com/T/
Link: https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/#post-612291
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jeff Daly <jeffd@silicom-usa.com>
Cc: kernel.org-fo5k2w@ycharbi.fr
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h |  3 --
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 56 ++-------------------------
 2 files changed, 3 insertions(+), 56 deletions(-)


---
base-commit: e4a87abf588536d1cdfb128595e6e680af5cf3ed
change-id: 20240520-net-2024-05-20-revert-silicom-switch-workaround-48a6a7a9b0a0

Best regards,

Comments

Simon Horman May 21, 2024, 4:41 p.m. UTC | #1
On Mon, May 20, 2024 at 05:21:27PM -0700, Jacob Keller wrote:
> This reverts commit 565736048bd5f9888990569993c6b6bfdf6dcb6d.
> 
> According to the commit, it implements a manual AN-37 for some
> "troublesome" Juniper MX5 switches. This appears to be a workaround for a
> particular switch.
> 
> It has been reported that this causes a severe breakage for other switches,
> including a Cisco 3560CX-12PD-S.
> 
> The code appears to be a workaround for a specific switch which fails to
> link in SFI mode. It expects to see AN-37 auto negotiation in order to
> link. The Cisco switch is not expecting AN-37 auto negotiation. When the
> device starts the manual AN-37, the Cisco switch decides that the port is
> confused and stops attempting to link with it. This persists until a power
> cycle. A simple driver unload and reload does not resolve the issue, even
> if loading with a version of the driver which lacks this workaround.
> 
> The authors of the workaround commit have not responded with
> clarifications, and the result of the workaround is complete failure to
> connect with other switches.
> 
> This appears to be a case where the driver can either "correctly" link with
> the Juniper MX5 switch, at the cost of bricking the link with the Cisco
> switch, or it can behave properly for the Cisco switch, but fail to link
> with the Junipir MX5 switch. I do not know enough about the standards
> involved to clearly determine whether either switch is at fault or behaving
> incorrectly. Nor do I know whether there exists some alternative fix which
> corrects behavior with both switches.
> 
> Revert the workaround for the Juniper switch.
> 
> Fixes: 565736048bd5 ("ixgbe: Manual AN-37 for troublesome link partners for X550 SFI")
> Link: https://lore.kernel.org/netdev/cbe874db-9ac9-42b8-afa0-88ea910e1e99@intel.com/T/
> Link: https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.135129/#post-612291
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> Cc: Jeff Daly <jeffd@silicom-usa.com>
> Cc: kernel.org-fo5k2w@ycharbi.fr

One of those awkward situations where the only known (in this case to Jacob
and me) resolution to a regression is itself a regression (for a different
setup).

I think that in these kind of situations it's best to go back to how things
were.

Reviewed-by: Simon Horman <horms@kernel.org>
Jeff Daly May 21, 2024, 4:49 p.m. UTC | #2
> -----Original Message-----
> From: Simon Horman <horms@kernel.org>
> Sent: Tuesday, May 21, 2024 12:42 PM
> To: Jacob Keller <jacob.e.keller@intel.com>
> Cc: netdev@vger.kernel.org; Jeff Daly <jeffd@silicom-usa.com>; kernel.org-
> fo5k2w@ycharbi.fr
> Subject: Re: [PATCH net] Revert "ixgbe: Manual AN-37 for troublesome link
> partners for X550 SFI"
> 
> Caution: This is an external email. Please take care when clicking links or
> opening attachments.
> 
> 
> On Mon, May 20, 2024 at 05:21:27PM -0700, Jacob Keller wrote:
> > This reverts commit 565736048bd5f9888990569993c6b6bfdf6dcb6d.
> >
> > According to the commit, it implements a manual AN-37 for some
> > "troublesome" Juniper MX5 switches. This appears to be a workaround
> > for a particular switch.
> >
> > It has been reported that this causes a severe breakage for other
> > switches, including a Cisco 3560CX-12PD-S.
> >
> > The code appears to be a workaround for a specific switch which fails
> > to link in SFI mode. It expects to see AN-37 auto negotiation in order
> > to link. The Cisco switch is not expecting AN-37 auto negotiation.
> > When the device starts the manual AN-37, the Cisco switch decides that
> > the port is confused and stops attempting to link with it. This
> > persists until a power cycle. A simple driver unload and reload does
> > not resolve the issue, even if loading with a version of the driver which lacks
> this workaround.
> >
> > The authors of the workaround commit have not responded with
> > clarifications, and the result of the workaround is complete failure
> > to connect with other switches.
> >
> > This appears to be a case where the driver can either "correctly" link
> > with the Juniper MX5 switch, at the cost of bricking the link with the
> > Cisco switch, or it can behave properly for the Cisco switch, but fail
> > to link with the Junipir MX5 switch. I do not know enough about the
> > standards involved to clearly determine whether either switch is at
> > fault or behaving incorrectly. Nor do I know whether there exists some
> > alternative fix which corrects behavior with both switches.
> >
> > Revert the workaround for the Juniper switch.
> >
> > Fixes: 565736048bd5 ("ixgbe: Manual AN-37 for troublesome link
> > partners for X550 SFI")
> > Link:
> > https://lore.kernel.org/netdev/cbe874db-9ac9-42b8-afa0-88ea910e1e99@in
> > tel.com/T/
> > Link:
> > https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.1
> > 35129/#post-612291
> > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> > Cc: Jeff Daly <jeffd@silicom-usa.com>
> > Cc: kernel.org-fo5k2w@ycharbi.fr
> 
> One of those awkward situations where the only known (in this case to Jacob
> and me) resolution to a regression is itself a regression (for a different setup).
> 
> I think that in these kind of situations it's best to go back to how things were.
> 
> Reviewed-by: Simon Horman <horms@kernel.org>

In principle, I don't disagree.....  However, our customer was very sensitive to having any patches/workarounds needed for their configuration be part of the upstream.  Aside from maintaining our own patchset (or figuring out whether there's a patch that works for everyone) is there a better solution?
kernel.org-fo5k2w@ycharbi.fr May 21, 2024, 5:12 p.m. UTC | #3
If any of you have the skills to develop a patch that tries to satisfy everyone, please know that I'm always available for testing on my hardware. If Jeff also has the possibilities, it's not impossible that we could come to a consensus. All we'd have to do would be to test the behavior of our equipment in the problematic situation.

Isn't there someone at Intel who can contribute their expertise on the underlying technical reasons for the problem (obviously level 1 OSI) in order to guide us towards a state-of-the-art solution?

Best regards.



21 mai 2024 à 18:49 "Jeff Daly" <jeffd@silicom-usa.com> a écrit:


> 
> > 
> > -----Original Message-----
> >  From: Simon Horman <horms@kernel.org>
> >  Sent: Tuesday, May 21, 2024 12:42 PM
> >  To: Jacob Keller <jacob.e.keller@intel.com>
> >  Cc: netdev@vger.kernel.org; Jeff Daly <jeffd@silicom-usa.com>; kernel.org-
> >  fo5k2w@ycharbi.fr
> >  Subject: Re: [PATCH net] Revert "ixgbe: Manual AN-37 for troublesome link
> >  partners for X550 SFI"
> >  
> >  Caution: This is an external email. Please take care when clicking links or
> >  opening attachments.
> >  
> >  
> >  On Mon, May 20, 2024 at 05:21:27PM -0700, Jacob Keller wrote:
> >  This reverts commit 565736048bd5f9888990569993c6b6bfdf6dcb6d.
> > 
> >  According to the commit, it implements a manual AN-37 for some
> >  "troublesome" Juniper MX5 switches. This appears to be a workaround
> >  for a particular switch.
> > 
> >  It has been reported that this causes a severe breakage for other
> >  switches, including a Cisco 3560CX-12PD-S.
> > 
> >  The code appears to be a workaround for a specific switch which fails
> >  to link in SFI mode. It expects to see AN-37 auto negotiation in order
> >  to link. The Cisco switch is not expecting AN-37 auto negotiation.
> >  When the device starts the manual AN-37, the Cisco switch decides that
> >  the port is confused and stops attempting to link with it. This
> >  persists until a power cycle. A simple driver unload and reload does
> >  not resolve the issue, even if loading with a version of the driver which lacks
> >  this workaround.
> > 
> >  The authors of the workaround commit have not responded with
> >  clarifications, and the result of the workaround is complete failure
> >  to connect with other switches.
> > 
> >  This appears to be a case where the driver can either "correctly" link
> >  with the Juniper MX5 switch, at the cost of bricking the link with the
> >  Cisco switch, or it can behave properly for the Cisco switch, but fail
> >  to link with the Junipir MX5 switch. I do not know enough about the
> >  standards involved to clearly determine whether either switch is at
> >  fault or behaving incorrectly. Nor do I know whether there exists some
> >  alternative fix which corrects behavior with both switches.
> > 
> >  Revert the workaround for the Juniper switch.
> > 
> >  Fixes: 565736048bd5 ("ixgbe: Manual AN-37 for troublesome link
> >  partners for X550 SFI")
> >  Link:
> >  https://lore.kernel.org/netdev/cbe874db-9ac9-42b8-afa0-88ea910e1e99@in
> >  tel.com/T/
> >  Link:
> >  https://forum.proxmox.com/threads/intel-x553-sfp-ixgbe-no-go-on-pve8.1
> >  35129/#post-612291
> >  Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> >  Cc: Jeff Daly <jeffd@silicom-usa.com>
> >  Cc: kernel.org-fo5k2w@ycharbi.fr
> >  
> >  One of those awkward situations where the only known (in this case to Jacob
> >  and me) resolution to a regression is itself a regression (for a different setup).
> >  
> >  I think that in these kind of situations it's best to go back to how things were.
> >  
> >  Reviewed-by: Simon Horman <horms@kernel.org>
> > 
> 
> In principle, I don't disagree..... However, our customer was very sensitive to having any patches/workarounds needed for their configuration be part of the upstream. Aside from maintaining our own patchset (or figuring out whether there's a patch that works for everyone) is there a better solution?
>
Jacob Keller May 21, 2024, 9:05 p.m. UTC | #4
On 5/21/2024 10:12 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
> If any of you have the skills to develop a patch that tries to satisfy everyone, please know that I'm always available for testing on my hardware. If Jeff also has the possibilities, it's not impossible that we could come to a consensus. All we'd have to do would be to test the behavior of our equipment in the problematic situation.
> 

I would love a solution which fixes both cases. I don't currently have
any idea what it would be.

> Isn't there someone at Intel who can contribute their expertise on the underlying technical reasons for the problem (obviously level 1 OSI) in order to guide us towards a state-of-the-art solution?
> 
> Best regards.
> 

Unfortunately I do not know anyone still here who has the expertise to
solve this. The out-of-tree ixgbe driver does not have the fix from
Silicom, so from Intel's perspective the correct implementation matches
the need for the Cisco switch...

> On 5/21/2024 9:49 AM, Jeff Daly wrote:
>> 
>> 
>>> -----Original Message-----
>>> From: Simon Horman <horms@kernel.org>
>>> Sent: Tuesday, May 21, 2024 12:42 PM
>>> To: Jacob Keller <jacob.e.keller@intel.com>
>>> Cc: netdev@vger.kernel.org; Jeff Daly <jeffd@silicom-usa.com>; kernel.org-
>>> fo5k2w@ycharbi.fr
>>> Subject: Re: [PATCH net] Revert "ixgbe: Manual AN-37 for troublesome link
>>> partners for X550 SFI"
>>>
>>> One of those awkward situations where the only known (in this case to Jacob
>>> and me) resolution to a regression is itself a regression (for a different setup).
>>>
>>> I think that in these kind of situations it's best to go back to how things were.
>>>
>>> Reviewed-by: Simon Horman <horms@kernel.org>
>> 
>> In principle, I don't disagree.....  However, our customer was very sensitive to having any patches/workarounds needed for their configuration be part of the upstream.  Aside from maintaining our own patchset (or figuring out whether there's a patch that works for everyone) is there a better solution?
>> 
>> 

We're somewhat stuck between a rock and a hard place here. I don't have
full context for the problem, however I did manage to get a little more
info about this from internal bugs.

Here is the facts as i understand it:

1. The Juniper MX5 switch appears to require clause 37 auto negotiation
(AN-37) to link.
2. The Cisco 3560CX-12PD-S appears to reject AN-37 as invalid and stops
trying to link if it sees it for this case.
3. As far as I understand, AN-37 is intended for 1G links, and is not
generally supported or used in 10GB? It looks like the way this fix
applies affects all 10GB SFP links, which results in the issues with the
Cisco switch.


For context, this document was the best I found from a quick google
search: https://www.ieee802.org/3/by/public/Mar15/booth_3by_01_0315.pdf

It appears the Cisco device is linking at some form of 10GB according to
the bug report here:

> show interface status | include Te1/0/1
> Te1/0/1   --- Vers Qotom --- connected    trunk        full    10G 
> SFP-10GBase-CX1


The link is an SFP-10GBase-CX1?

@Jeff, can you provide any further details about the Juniper MX5 switch
case that the original change fixes?

The function being modified here is the ixgbe_setup_sfi_x550a, which is
called for setting up SFI, and the description says "Used to connect the
internal PHY directly to an SFP cage without auto-negotiation"

It is only called by ixgbe_setup_mac_link_sfp_n which is supposed to
configure the PHY for native SFP support for IXGBE_DEV_ID_X550EM_A_SFP_N
(0x15C4). No other device type is changed with this.

Every comment here implies that this has no auto negotiation, which
makes it extremely weird to me that we try to enable AN-37 in this flow.

Without more context, my gut instinct is that the Cisco switch is likely
following the general expectations here compared to the Juniper switch.

I also don't see a good way currently to have the driver select between
the options, if both cases are standard SFP. It can't know what its
linked against. If we try the AN-37 flow with Cisco, it essentially
bricks the link until a reboot. Even reloading to the out-of-tree driver
which doesn't do this AN-37 flow fails to recover link. This makes any
sort of "fallback" mechanism unlikely to work.

Unless we can find some obvious way to distinguish the two cases, or
there is fundamentally a different fix for the Juniper case, I don't see
how we can support both flows.

I guess there is the option of some sort of toggle via ethtool/otherwise
to allow selection... But users might try to enable this when link is
faulty and end up hitting the case where once we try the AN-37, the
remote switch refuses to try again until a cycle.
patchwork-bot+netdevbpf@kernel.org May 23, 2024, 9 a.m. UTC | #5
Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Mon, 20 May 2024 17:21:27 -0700 you wrote:
> This reverts commit 565736048bd5f9888990569993c6b6bfdf6dcb6d.
> 
> According to the commit, it implements a manual AN-37 for some
> "troublesome" Juniper MX5 switches. This appears to be a workaround for a
> particular switch.
> 
> It has been reported that this causes a severe breakage for other switches,
> including a Cisco 3560CX-12PD-S.
> 
> [...]

Here is the summary with links:
  - [net] Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI"
    https://git.kernel.org/netdev/net/c/b35b1c0b4e16

You are awesome, thank you!
Jacob Keller May 23, 2024, 4:15 p.m. UTC | #6
On 5/21/2024 2:05 PM, Jacob Keller wrote:
> 
> 
> On 5/21/2024 10:12 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
>> If any of you have the skills to develop a patch that tries to satisfy everyone, please know that I'm always available for testing on my hardware. If Jeff also has the possibilities, it's not impossible that we could come to a consensus. All we'd have to do would be to test the behavior of our equipment in the problematic situation.
>>
> 
> I would love a solution which fixes both cases. I don't currently have
> any idea what it would be.
> 

It looks like netdev pulled the revert. Given the lack of a full
understanding of the original fix posted from Jeff, I think this is the
right decision.

>> Isn't there someone at Intel who can contribute their expertise on the underlying technical reasons for the problem (obviously level 1 OSI) in order to guide us towards a state-of-the-art solution?
>>

I did create an internal ticket here to track and try to get a
reproduction so that some of our link experts can diagnose the issue
being seen.

I hope to have news I can report on this soon.

> I guess there is the option of some sort of toggle via ethtool/otherwise
> to allow selection... But users might try to enable this when link is
> faulty and end up hitting the case where once we try the AN-37, the
> remote switch refuses to try again until a cycle.
> 

Given that we have two cases where our current understanding is a need
for mutually exclusive behavior, we (Intel) would be open to some sort
of config option, flag, or otherwise toggle to enable the Silicom folks
without breaking everything else. I don't know what the acceptance for
such an idea is with the rest of the community.

I hope that internal reproduction task above may lead to a better
understanding and possibly a fix that can resolve both cases.
Jeff Daly May 23, 2024, 4:18 p.m. UTC | #7
Jacob, initially a config flag option was put forward but because of the maturity of the driver, that option was shot down by the maintainers.

> -----Original Message-----
> From: Jacob Keller <jacob.e.keller@intel.com>
> Sent: Thursday, May 23, 2024 12:15 PM
> To: kernel.org-fo5k2w@ycharbi.fr; Jeff Daly <jeffd@silicom-usa.com>; Simon
> Horman <horms@kernel.org>
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH net] Revert "ixgbe: Manual AN-37 for troublesome link
> partners for X550 SFI"
> 
> Caution: This is an external email. Please take care when clicking links or
> opening attachments.
> 
> 
> On 5/21/2024 2:05 PM, Jacob Keller wrote:
> >
> >
> > On 5/21/2024 10:12 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
> >> If any of you have the skills to develop a patch that tries to satisfy everyone,
> please know that I'm always available for testing on my hardware. If Jeff also
> has the possibilities, it's not impossible that we could come to a consensus. All
> we'd have to do would be to test the behavior of our equipment in the
> problematic situation.
> >>
> >
> > I would love a solution which fixes both cases. I don't currently have
> > any idea what it would be.
> >
> 
> It looks like netdev pulled the revert. Given the lack of a full understanding of the
> original fix posted from Jeff, I think this is the right decision.
> 
> >> Isn't there someone at Intel who can contribute their expertise on the
> underlying technical reasons for the problem (obviously level 1 OSI) in order to
> guide us towards a state-of-the-art solution?
> >>
> 
> I did create an internal ticket here to track and try to get a reproduction so that
> some of our link experts can diagnose the issue being seen.
> 
> I hope to have news I can report on this soon.
> 
> > I guess there is the option of some sort of toggle via
> > ethtool/otherwise to allow selection... But users might try to enable
> > this when link is faulty and end up hitting the case where once we try
> > the AN-37, the remote switch refuses to try again until a cycle.
> >
> 
> Given that we have two cases where our current understanding is a need for
> mutually exclusive behavior, we (Intel) would be open to some sort of config
> option, flag, or otherwise toggle to enable the Silicom folks without breaking
> everything else. I don't know what the acceptance for such an idea is with the
> rest of the community.
> 
> I hope that internal reproduction task above may lead to a better understanding
> and possibly a fix that can resolve both cases.
kernel.org-fo5k2w@ycharbi.fr May 23, 2024, 4:49 p.m. UTC | #8
> It looks like netdev pulled the revert. Given the lack of a full
> understanding of the original fix posted from Jeff, I think this is the
> right decision.

Thank you very much for your investment.
I hope a solution can be found for Jeff.
 
> I did create an internal ticket here to track and try to get a
> reproduction so that some of our link experts can diagnose the issue
> being seen.
> 
> I hope to have news I can report on this soon.

Super. Cela va peut-être faire remonter d'autres problèmes sous-jacent dans l'implémentation de la norme.

> 
> > 
> > I guess there is the option of some sort of toggle via ethtool/otherwise
> >  to allow selection... But users might try to enable this when link is
> >  faulty and end up hitting the case where once we try the AN-37, the
> >  remote switch refuses to try again until a cycle.
> > 
> 
> Given that we have two cases where our current understanding is a need
> for mutually exclusive behavior, we (Intel) would be open to some sort
> of config option, flag, or otherwise toggle to enable the Silicom folks
> without breaking everything else. I don't know what the acceptance for
> such an idea is with the rest of the community.
> 
> I hope that internal reproduction task above may lead to a better
> understanding and possibly a fix that can resolve both cases.
>

> The link is an SFP-10GBase-CX1?

As I understand it, CX1 is the name given to Twinax copper cables such as the one I used in the experiments in this thread. It's therefore a priori the right value to display for this kind of connection (instead of “10000baseT/Full”).

Thanks again for your hard work in finding a solution. You can always contact me later so that I can carry out tests if you need. The machine is at least available for 1 year for testing purposes.

Best regards,
Yohan.
Jacob Keller May 23, 2024, 6:47 p.m. UTC | #9
On 5/23/2024 9:49 AM, kernel.org-fo5k2w@ycharbi.fr wrote:

>> The link is an SFP-10GBase-CX1?
> 
> As I understand it, CX1 is the name given to Twinax copper cables such as the one I used in the experiments in this thread. It's therefore a priori the right value to display for this kind of connection (instead of “10000baseT/Full”).
> 

I'm filing an internal tracking bug regarding this as well.

> Thanks again for your hard work in finding a solution. You can always contact me later so that I can carry out tests if you need. The machine is at least available for 1 year for testing purposes.
> 
> Best regards,
> Yohan.

Appreciate it. Hopefully we can find something that works for both cases.
Jacob Keller May 23, 2024, 6:49 p.m. UTC | #10
On 5/23/2024 9:49 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
>>
> 
>> The link is an SFP-10GBase-CX1?
> 
> As I understand it, CX1 is the name given to Twinax copper cables such as the one I used in the experiments in this thread. It's therefore a priori the right value to display for this kind of connection (instead of “10000baseT/Full”).
> 
One more thing: Could you confirm if this behavior appears in the 5.19.9
driver from the Intel website or source forge? I'm curious if this is a
case of a fix that never got published to the netdev community.

Thanks,
Jake
kernel.org-fo5k2w@ycharbi.fr May 23, 2024, 8:19 p.m. UTC | #11
> One more thing: Could you confirm if this behavior appears in the 5.19.9
> driver from the Intel website or source forge? I'm curious if this is a
> case of a fix that never got published to the netdev community.
> 
> Thanks,
> Jake

I can confirm that the result of the “Supported link modes:” section is identical with the Intel ixgbe-5.19.9 driver:
uname -r
5.10.0-29-amd64

apt install linux-headers-$(uname -r) gcc make
wget https://downloadmirror.intel.com/812532/ixgbe-5.19.9.tar.gz
tar xf ixgbe-5.19.9.tar.gz -C /usr/local/src/
make -j 8

modinfo ./ixgbe.ko
rmmod ixgbe
modprobe dca
insmod ./ixgbe.ko

# eno1 is up
ethtool eno1
Settings for eno1:
	Supported ports: [ FIBRE ]
	Supported link modes:   10000baseT/Full
	Supported pause frame use: Symmetric
	Supports auto-negotiation: No
	Supported FEC modes: Not reported
	Advertised link modes:  10000baseT/Full
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: No
	Advertised FEC modes: Not reported
	Speed: 10000Mb/s
	Duplex: Full
	Auto-negotiation: off
	Port: Direct Attach Copper
	PHYAD: 0
	Transceiver: internal
	Supports Wake-on: d
	Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
	Link detected: yes
Jacob Keller May 23, 2024, 8:35 p.m. UTC | #12
On 5/23/2024 1:19 PM, kernel.org-fo5k2w@ycharbi.fr wrote:
>> One more thing: Could you confirm if this behavior appears in the 5.19.9
>> driver from the Intel website or source forge? I'm curious if this is a
>> case of a fix that never got published to the netdev community.
>>
>> Thanks,
>> Jake
> 
> I can confirm that the result of the “Supported link modes:” section is identical with the Intel ixgbe-5.19.9 driver:
> uname -r
> 5.10.0-29-amd64
> 
> apt install linux-headers-$(uname -r) gcc make
> wget https://downloadmirror.intel.com/812532/ixgbe-5.19.9.tar.gz
> tar xf ixgbe-5.19.9.tar.gz -C /usr/local/src/
> make -j 8
> 
> modinfo ./ixgbe.ko
> rmmod ixgbe
> modprobe dca
> insmod ./ixgbe.ko
> 
> # eno1 is up
> ethtool eno1
> Settings for eno1:
> 	Supported ports: [ FIBRE ]
> 	Supported link modes:   10000baseT/Full
> 	Supported pause frame use: Symmetric
> 	Supports auto-negotiation: No
> 	Supported FEC modes: Not reported
> 	Advertised link modes:  10000baseT/Full
> 	Advertised pause frame use: Symmetric
> 	Advertised auto-negotiation: No
> 	Advertised FEC modes: Not reported
> 	Speed: 10000Mb/s
> 	Duplex: Full
> 	Auto-negotiation: off
> 	Port: Direct Attach Copper
> 	PHYAD: 0
> 	Transceiver: internal
> 	Supports Wake-on: d
> 	Wake-on: d
>         Current message level: 0x00000007 (7)
>                                drv probe link
> 	Link detected: yes

Thanks. At a glance from reviewing code, it looks like ixgbe simply
collapses all non-backplane links into 10000baseT/Full.
Jeff Daly June 26, 2024, 2:30 p.m. UTC | #13
Looking for a solution that would satisfy everyone....

> -----Original Message-----
> From: Jacob Keller <jacob.e.keller@intel.com>
> Sent: Thursday, May 23, 2024 12:15 PM
> To: kernel.org-fo5k2w@ycharbi.fr; Jeff Daly <jeffd@silicom-usa.com>; Simon
> Horman <horms@kernel.org>
> Cc: netdev@vger.kernel.org
> Subject: Re: [PATCH net] Revert "ixgbe: Manual AN-37 for troublesome link
> partners for X550 SFI"
> 
> Caution: This is an external email. Please take care when clicking links or
> opening attachments.
> 
> 
> On 5/21/2024 2:05 PM, Jacob Keller wrote:
> >
> >
> > On 5/21/2024 10:12 AM, kernel.org-fo5k2w@ycharbi.fr wrote:
> >> If any of you have the skills to develop a patch that tries to satisfy everyone,
> please know that I'm always available for testing on my hardware. If Jeff also
> has the possibilities, it's not impossible that we could come to a consensus. All
> we'd have to do would be to test the behavior of our equipment in the
> problematic situation.
> >>
> >
> > I would love a solution which fixes both cases. I don't currently have
> > any idea what it would be.
> >
> 
> It looks like netdev pulled the revert. Given the lack of a full understanding of the
> original fix posted from Jeff, I think this is the right decision.
> 
> >> Isn't there someone at Intel who can contribute their expertise on the
> underlying technical reasons for the problem (obviously level 1 OSI) in order to
> guide us towards a state-of-the-art solution?
> >>
> 
> I did create an internal ticket here to track and try to get a reproduction so that
> some of our link experts can diagnose the issue being seen.
> 
> I hope to have news I can report on this soon.
> 
> > I guess there is the option of some sort of toggle via
> > ethtool/otherwise to allow selection... But users might try to enable
> > this when link is faulty and end up hitting the case where once we try
> > the AN-37, the remote switch refuses to try again until a cycle.
> >
> 
> Given that we have two cases where our current understanding is a need for
> mutually exclusive behavior, we (Intel) would be open to some sort of config
> option, flag, or otherwise toggle to enable the Silicom folks without breaking
> everything else. I don't know what the acceptance for such an idea is with the
> rest of the community.
> 

Originally, this was the idea that was put forward, and if I recall it was quashed by the maintainers due to the maturity of the driver.  I was told that introducing new config options were a no-go.  If there's a possibility of reworking the patch to introduce a new config option during module loading that would be specific to our fix, I'm sure we can come up with something appropriate....  I just don't want to try to push something that will never get accepted, but I think in this case it's something that could be warranted.
I understand the desire to not have special workaround code for a singular case, but in this instance I think it's the only option.

> I hope that internal reproduction task above may lead to a better understanding
> and possibly a fix that can resolve both cases.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index 897fe357b65b..346e3d9114a8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -3675,9 +3675,7 @@  struct ixgbe_info {
 #define IXGBE_KRM_LINK_S1(P)		((P) ? 0x8200 : 0x4200)
 #define IXGBE_KRM_LINK_CTRL_1(P)	((P) ? 0x820C : 0x420C)
 #define IXGBE_KRM_AN_CNTL_1(P)		((P) ? 0x822C : 0x422C)
-#define IXGBE_KRM_AN_CNTL_4(P)		((P) ? 0x8238 : 0x4238)
 #define IXGBE_KRM_AN_CNTL_8(P)		((P) ? 0x8248 : 0x4248)
-#define IXGBE_KRM_PCS_KX_AN(P)		((P) ? 0x9918 : 0x5918)
 #define IXGBE_KRM_SGMII_CTRL(P)		((P) ? 0x82A0 : 0x42A0)
 #define IXGBE_KRM_LP_BASE_PAGE_HIGH(P)	((P) ? 0x836C : 0x436C)
 #define IXGBE_KRM_DSP_TXFFE_STATE_4(P)	((P) ? 0x8634 : 0x4634)
@@ -3687,7 +3685,6 @@  struct ixgbe_info {
 #define IXGBE_KRM_PMD_FLX_MASK_ST20(P)	((P) ? 0x9054 : 0x5054)
 #define IXGBE_KRM_TX_COEFF_CTRL_1(P)	((P) ? 0x9520 : 0x5520)
 #define IXGBE_KRM_RX_ANA_CTL(P)		((P) ? 0x9A00 : 0x5A00)
-#define IXGBE_KRM_FLX_TMRS_CTRL_ST31(P)	((P) ? 0x9180 : 0x5180)
 
 #define IXGBE_KRM_PMD_FLX_MASK_ST20_SFI_10G_DA		~(0x3 << 20)
 #define IXGBE_KRM_PMD_FLX_MASK_ST20_SFI_10G_SR		BIT(20)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
index 2decb0710b6e..a5f644934445 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
@@ -1722,59 +1722,9 @@  static int ixgbe_setup_sfi_x550a(struct ixgbe_hw *hw, ixgbe_link_speed *speed)
 		return -EINVAL;
 	}
 
-	(void)mac->ops.write_iosf_sb_reg(hw,
-			IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
-
-	/* change mode enforcement rules to hybrid */
-	(void)mac->ops.read_iosf_sb_reg(hw,
-			IXGBE_KRM_FLX_TMRS_CTRL_ST31(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
-	reg_val |= 0x0400;
-
-	(void)mac->ops.write_iosf_sb_reg(hw,
-			IXGBE_KRM_FLX_TMRS_CTRL_ST31(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
-
-	/* manually control the config */
-	(void)mac->ops.read_iosf_sb_reg(hw,
-			IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
-	reg_val |= 0x20002240;
-
-	(void)mac->ops.write_iosf_sb_reg(hw,
-			IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
-
-	/* move the AN base page values */
-	(void)mac->ops.read_iosf_sb_reg(hw,
-			IXGBE_KRM_PCS_KX_AN(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
-	reg_val |= 0x1;
-
-	(void)mac->ops.write_iosf_sb_reg(hw,
-			IXGBE_KRM_PCS_KX_AN(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
-
-	/* set the AN37 over CB mode */
-	(void)mac->ops.read_iosf_sb_reg(hw,
-			IXGBE_KRM_AN_CNTL_4(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
-	reg_val |= 0x20000000;
-
-	(void)mac->ops.write_iosf_sb_reg(hw,
-			IXGBE_KRM_AN_CNTL_4(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
-
-	/* restart AN manually */
-	(void)mac->ops.read_iosf_sb_reg(hw,
-			IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, &reg_val);
-	reg_val |= IXGBE_KRM_LINK_CTRL_1_TETH_AN_RESTART;
-
-	(void)mac->ops.write_iosf_sb_reg(hw,
-			IXGBE_KRM_LINK_CTRL_1(hw->bus.lan_id),
-			IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
+	status = mac->ops.write_iosf_sb_reg(hw,
+				IXGBE_KRM_PMD_FLX_MASK_ST20(hw->bus.lan_id),
+				IXGBE_SB_IOSF_TARGET_KR_PHY, reg_val);
 
 	/* Toggle port SW reset by AN reset. */
 	status = ixgbe_restart_an_internal_phy_x550em(hw);