mbox series

[v8,net-next,0/2] mv88e6xxx: Add MAB offload support

Message ID 20221112203748.68995-1-netdev@kapio-technology.com (mailing list archive)
Headers show
Series mv88e6xxx: Add MAB offload support | expand

Message

Hans Schultz Nov. 12, 2022, 8:37 p.m. UTC
This patchset adds MAB [1] offload support in mv88e6xxx.

Patch #1: Fix a problem when reading the FID needed to get the VID.

Patch #2: The MAB implementation for mv88e6xxx.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a35ec8e38cdd1766f29924ca391a01de20163931

Hans J. Schultz (2):
  net: dsa: mv88e6xxx: allow reading FID when handling ATU violations
  net: dsa: mv88e6xxx: mac-auth/MAB implementation

 drivers/net/dsa/mv88e6xxx/Makefile      |  1 +
 drivers/net/dsa/mv88e6xxx/chip.c        | 19 ++++--
 drivers/net/dsa/mv88e6xxx/chip.h        | 10 ++++
 drivers/net/dsa/mv88e6xxx/global1_atu.c | 70 ++++++++++++++++++++--
 drivers/net/dsa/mv88e6xxx/port.h        |  6 ++
 drivers/net/dsa/mv88e6xxx/switchdev.c   | 77 +++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/switchdev.h   | 19 ++++++
 7 files changed, 190 insertions(+), 12 deletions(-)
 create mode 100644 drivers/net/dsa/mv88e6xxx/switchdev.c
 create mode 100644 drivers/net/dsa/mv88e6xxx/switchdev.h

Comments

Jakub Kicinski Nov. 15, 2022, 2:57 a.m. UTC | #1
On Sat, 12 Nov 2022 21:37:46 +0100 Hans J. Schultz wrote:
> This patchset adds MAB [1] offload support in mv88e6xxx.
> 
> Patch #1: Fix a problem when reading the FID needed to get the VID.
> 
> Patch #2: The MAB implementation for mv88e6xxx.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a35ec8e38cdd1766f29924ca391a01de20163931

Vladimir, Ido, ack?
Jakub Kicinski Nov. 15, 2022, 5:18 a.m. UTC | #2
On Mon, 14 Nov 2022 18:57:04 -0800 Jakub Kicinski wrote:
> On Sat, 12 Nov 2022 21:37:46 +0100 Hans J. Schultz wrote:
> > This patchset adds MAB [1] offload support in mv88e6xxx.
> > 
> > Patch #1: Fix a problem when reading the FID needed to get the VID.
> > 
> > Patch #2: The MAB implementation for mv88e6xxx.
> > 
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a35ec8e38cdd1766f29924ca391a01de20163931  
> 
> Vladimir, Ido, ack?

Ah, either way a v9 will be needed:

drivers/net/dsa/mv88e6xxx/switchdev.c:33:5: warning: no previous prototype for function 'mv88e6xxx_handle_violation' [-Wmissing-prototypes]
int mv88e6xxx_handle_violation(struct mv88e6xxx_chip *chip, int port,
    ^
drivers/net/dsa/mv88e6xxx/switchdev.c:33:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int mv88e6xxx_handle_violation(struct mv88e6xxx_chip *chip, int port,
^
Ido Schimmel Nov. 15, 2022, 9:30 a.m. UTC | #3
On Sat, Nov 12, 2022 at 09:37:46PM +0100, Hans J. Schultz wrote:
> This patchset adds MAB [1] offload support in mv88e6xxx.
> 
> Patch #1: Fix a problem when reading the FID needed to get the VID.
> 
> Patch #2: The MAB implementation for mv88e6xxx.

Just to be sure, this was tested with bridge_locked_port.sh, right?
Hans Schultz Nov. 15, 2022, 10:26 a.m. UTC | #4
On 2022-11-15 10:30, Ido Schimmel wrote:
> On Sat, Nov 12, 2022 at 09:37:46PM +0100, Hans J. Schultz wrote:
>> This patchset adds MAB [1] offload support in mv88e6xxx.
>> 
>> Patch #1: Fix a problem when reading the FID needed to get the VID.
>> 
>> Patch #2: The MAB implementation for mv88e6xxx.
> 
> Just to be sure, this was tested with bridge_locked_port.sh, right?

As I have the phy regression I have given notice of, that has simply not
been possible. After maybe 50 resets it worked for me at a point
(something to do with timing), and I tested it manually.

When I have tried to run the selftests, I get errors related to the phy
problem, which I have not been able to find a way around.
Vladimir Oltean Nov. 15, 2022, 10:28 a.m. UTC | #5
On Tue, Nov 15, 2022 at 11:26:55AM +0100, netdev@kapio-technology.com wrote:
> On 2022-11-15 10:30, Ido Schimmel wrote:
> > On Sat, Nov 12, 2022 at 09:37:46PM +0100, Hans J. Schultz wrote:
> > > This patchset adds MAB [1] offload support in mv88e6xxx.
> > > 
> > > Patch #1: Fix a problem when reading the FID needed to get the VID.
> > > 
> > > Patch #2: The MAB implementation for mv88e6xxx.
> > 
> > Just to be sure, this was tested with bridge_locked_port.sh, right?
> 
> As I have the phy regression I have given notice of, that has simply not
> been possible. After maybe 50 resets it worked for me at a point
> (something to do with timing), and I tested it manually.
> 
> When I have tried to run the selftests, I get errors related to the phy
> problem, which I have not been able to find a way around.

What PHY regression?
Hans Schultz Nov. 15, 2022, 10:52 a.m. UTC | #6
On 2022-11-15 11:28, Vladimir Oltean wrote:
> On Tue, Nov 15, 2022 at 11:26:55AM +0100, netdev@kapio-technology.com 
> wrote:
>> On 2022-11-15 10:30, Ido Schimmel wrote:
>> > On Sat, Nov 12, 2022 at 09:37:46PM +0100, Hans J. Schultz wrote:
>> > > This patchset adds MAB [1] offload support in mv88e6xxx.
>> > >
>> > > Patch #1: Fix a problem when reading the FID needed to get the VID.
>> > >
>> > > Patch #2: The MAB implementation for mv88e6xxx.
>> >
>> > Just to be sure, this was tested with bridge_locked_port.sh, right?
>> 
>> As I have the phy regression I have given notice of, that has simply 
>> not
>> been possible. After maybe 50 resets it worked for me at a point
>> (something to do with timing), and I tested it manually.
>> 
>> When I have tried to run the selftests, I get errors related to the 
>> phy
>> problem, which I have not been able to find a way around.
> 
> What PHY regression?

I had a discussion with Jacub, which resulted in me sending a mail to
maintainers on this. The problem is shown below...


the phy is marvell/6097/88e3082

------------[ cut here ]------------
WARNING: CPU: 0 PID: 332 at drivers/net/phy/phy.c:975
phy_error+0x28/0x54
Modules linked in:
CPU: 0 PID: 332 Comm: kworker/0:0 Tainted: G        W          6.0.0 #17
Hardware name: Freescale i.MX27 (Device Tree Support)
Workqueue: events_power_efficient phy_state_machine
   unwind_backtrace from show_stack+0x18/0x1c
   show_stack from dump_stack_lvl+0x28/0x30
   dump_stack_lvl from __warn+0xb8/0x114
   __warn from warn_slowpath_fmt+0x80/0xbc
   warn_slowpath_fmt from phy_error+0x28/0x54
   phy_error from phy_state_machine+0xbc/0x218
   phy_state_machine from process_one_work+0x17c/0x244
   process_one_work from worker_thread+0x248/0x2cc
   worker_thread from kthread+0xb0/0xbc
   kthread from ret_from_fork+0x14/0x2c
Exception stack(0xc4a71fb0 to 0xc4a71ff8)
1fa0:                                     00000000 00000000 00000000
00000000
1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
---[ end trace 0000000000000000 ]---
Vladimir Oltean Nov. 15, 2022, 11:10 a.m. UTC | #7
On Tue, Nov 15, 2022 at 11:52:40AM +0100, netdev@kapio-technology.com wrote:
> I had a discussion with Jacub, which resulted in me sending a mail to
> maintainers on this. The problem is shown below...
> 
> the phy is marvell/6097/88e3082
> 
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 332 at drivers/net/phy/phy.c:975
> phy_error+0x28/0x54
> Modules linked in:
> CPU: 0 PID: 332 Comm: kworker/0:0 Tainted: G        W          6.0.0 #17
> Hardware name: Freescale i.MX27 (Device Tree Support)
> Workqueue: events_power_efficient phy_state_machine
>   unwind_backtrace from show_stack+0x18/0x1c
>   show_stack from dump_stack_lvl+0x28/0x30
>   dump_stack_lvl from __warn+0xb8/0x114
>   __warn from warn_slowpath_fmt+0x80/0xbc
>   warn_slowpath_fmt from phy_error+0x28/0x54
>   phy_error from phy_state_machine+0xbc/0x218
>   phy_state_machine from process_one_work+0x17c/0x244
>   process_one_work from worker_thread+0x248/0x2cc
>   worker_thread from kthread+0xb0/0xbc
>   kthread from ret_from_fork+0x14/0x2c
> Exception stack(0xc4a71fb0 to 0xc4a71ff8)
> 1fa0:                                     00000000 00000000 00000000 00000000
> 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> ---[ end trace 0000000000000000 ]---

Was that email public on the lists? I don't see it...

The phy_error() is called from phy_state_machine() if one of
phy_check_link_status() or phy_start_aneg() fails.

Could you please print exactly the value of "err", as well as dig deeper
to see which call is failing, all the way into the PHY driver?

Easiest way to do that would probably be something like:

$ trace-cmd record -e mdio sleep 10 &
... do your stuff ...
$ trace-cmd report
    kworker/u4:3-337   [001]    59.054741: mdio_access:          0000:00:00.3 read  phy:0x13 reg:0x01 val:0x7949
    kworker/u4:3-337   [001]    59.054941: mdio_access:          0000:00:00.3 read  phy:0x13 reg:0x09 val:0x0700
    kworker/u4:3-337   [001]    59.055262: mdio_access:          0000:00:00.3 read  phy:0x13 reg:0x0a val:0x4000
    kworker/u4:3-337   [001]    60.075808: mdio_access:          0000:00:00.3 read  phy:0x13 reg:0x1c val:0x3005

"val" will be negative when there is an error. This is to see quicker what fails,
and if some MDIO access ever works.

If you don't want to enable CONFIG_FTRACE or install trace-cmd, you
could also probably do the debugging manually.
Hans Schultz Nov. 15, 2022, 11:31 a.m. UTC | #8
On 2022-11-15 12:10, Vladimir Oltean wrote:
> On Tue, Nov 15, 2022 at 11:52:40AM +0100, netdev@kapio-technology.com 
> wrote:
>> I had a discussion with Jacub, which resulted in me sending a mail to
>> maintainers on this. The problem is shown below...
>> 
>> the phy is marvell/6097/88e3082
>> 
>> ------------[ cut here ]------------
>> WARNING: CPU: 0 PID: 332 at drivers/net/phy/phy.c:975
>> phy_error+0x28/0x54
>> Modules linked in:
>> CPU: 0 PID: 332 Comm: kworker/0:0 Tainted: G        W          6.0.0 
>> #17
>> Hardware name: Freescale i.MX27 (Device Tree Support)
>> Workqueue: events_power_efficient phy_state_machine
>>   unwind_backtrace from show_stack+0x18/0x1c
>>   show_stack from dump_stack_lvl+0x28/0x30
>>   dump_stack_lvl from __warn+0xb8/0x114
>>   __warn from warn_slowpath_fmt+0x80/0xbc
>>   warn_slowpath_fmt from phy_error+0x28/0x54
>>   phy_error from phy_state_machine+0xbc/0x218
>>   phy_state_machine from process_one_work+0x17c/0x244
>>   process_one_work from worker_thread+0x248/0x2cc
>>   worker_thread from kthread+0xb0/0xbc
>>   kthread from ret_from_fork+0x14/0x2c
>> Exception stack(0xc4a71fb0 to 0xc4a71ff8)
>> 1fa0:                                     00000000 00000000 00000000 
>> 00000000
>> 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
>> 00000000
>> 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
>> ---[ end trace 0000000000000000 ]---
> 
> Was that email public on the lists? I don't see it...

Sorry, yes the public list was not added.

> 
> The phy_error() is called from phy_state_machine() if one of
> phy_check_link_status() or phy_start_aneg() fails.
> 
> Could you please print exactly the value of "err", as well as dig 
> deeper
> to see which call is failing, all the way into the PHY driver?

It happens on upstart, so I would then have to hack the system upstart 
to add trace.

I also have:
mv88e6085 1002b000.ethernet-1:04: switch 0x990 detected: Marvell 
88E6097/88E6097F, revision 2
mv88e6085 1002b000.ethernet-1:04: configuring for fixed/rgmii-id link 
mode
mv88e6085 1002b000.ethernet-1:04: Link is Up - 100Mbps/Full - flow 
control off
mv88e6085 1002b000.ethernet-1:04 eth10 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:00] driver 
[Generic PHY] (irq=POLL)
mv88e6085 1002b000.ethernet-1:04 eth6 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:01] driver 
[Generic PHY] (irq=POLL)
mv88e6085 1002b000.ethernet-1:04 eth9 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:02] driver 
[Generic PHY] (irq=POLL)
mv88e6085 1002b000.ethernet-1:04 eth5 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:03] driver 
[Generic PHY] (irq=POLL)
mv88e6085 1002b000.ethernet-1:04 eth8 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:04] driver 
[Generic PHY] (irq=POLL)
mv88e6085 1002b000.ethernet-1:04 eth4 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:05] driver 
[Generic PHY] (irq=POLL)
mv88e6085 1002b000.ethernet-1:04 eth7 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:06] driver 
[Generic PHY] (irq=POLL)
mv88e6085 1002b000.ethernet-1:04 eth3 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:07] driver 
[Generic PHY] (irq=POLL)
mv88e6085 1002b000.ethernet-1:04 eth2 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:08] driver 
[Marvell 88E1112] (irq=174)
mv88e6085 1002b000.ethernet-1:04 eth1 (uninitialized): PHY 
[!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:09] driver 
[Marvell 88E1112] (irq=175)

after this and adding the ifaces to the bridge, it continues like:

br0: port 1(eth10) entered blocking state
br0: port 1(eth10) entered disabled state
br0: port 2(eth6) entered blocking state
br0: port 2(eth6) entered disabled state
device eth6 entered promiscuous mode
device eth10 entered promiscuous mode
br0: port 3(eth9) entered blocking state
br0: port 3(eth9) entered disabled state
device eth9 entered promiscuous mode
br0: port 4(eth5) entered blocking state
br0: port 4(eth5) entered disabled state
device eth5 entered promiscuous mode
br0: port 5(eth8) entered blocking state
br0: port 5(eth8) entered disabled state
device eth8 entered promiscuous mode
br0: port 6(eth4) entered blocking state
br0: port 6(eth4) entered disabled state
mv88e6085 1002b000.ethernet-1:04: Timeout while waiting for switch
mv88e6085 1002b000.ethernet-1:04: port 0 failed to add 9a:af:03:f1:bd:0a 
vid 1 to fdb: -110
device eth4 entered promiscuous mode
br0: port 7(eth7) entered blocking state
br0: port 7(eth7) entered disabled state

I don't know if that gives ay clues...?

Otherwise I have to take more time to see what I can dig out. The 
easiest for me is then to
add some printk statements giving targeted information if told what and 
where...

> 
> Easiest way to do that would probably be something like:
> 
> $ trace-cmd record -e mdio sleep 10 &
> ... do your stuff ...
> $ trace-cmd report
>     kworker/u4:3-337   [001]    59.054741: mdio_access:
> 0000:00:00.3 read  phy:0x13 reg:0x01 val:0x7949
>     kworker/u4:3-337   [001]    59.054941: mdio_access:
> 0000:00:00.3 read  phy:0x13 reg:0x09 val:0x0700
>     kworker/u4:3-337   [001]    59.055262: mdio_access:
> 0000:00:00.3 read  phy:0x13 reg:0x0a val:0x4000
>     kworker/u4:3-337   [001]    60.075808: mdio_access:
> 0000:00:00.3 read  phy:0x13 reg:0x1c val:0x3005
> 
> "val" will be negative when there is an error. This is to see quicker
> what fails,
> and if some MDIO access ever works.
> 
> If you don't want to enable CONFIG_FTRACE or install trace-cmd, you
> could also probably do the debugging manually.
Vladimir Oltean Nov. 15, 2022, 12:22 p.m. UTC | #9
On Tue, Nov 15, 2022 at 12:31:59PM +0100, netdev@kapio-technology.com wrote:
> It happens on upstart, so I would then have to hack the system upstart to
> add trace.

Hack upstart or disable the service that brings the switch ports up, and
bring them up manually...

> I also have:
> mv88e6085 1002b000.ethernet-1:04: switch 0x990 detected: Marvell 88E6097/88E6097F, revision 2
> mv88e6085 1002b000.ethernet-1:04: configuring for fixed/rgmii-id link mode
> mv88e6085 1002b000.ethernet-1:04: Link is Up - 100Mbps/Full - flow control off
> mv88e6085 1002b000.ethernet-1:04 eth10 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:00] driver [Generic PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth6 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:01] driver [Generic PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth9 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:02] driver [Generic PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth5 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:03] driver [Generic PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth8 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:04] driver [Generic PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth4 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:05] driver [Generic PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth7 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:06] driver [Generic PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth3 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:07] driver [Generic PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth2 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:08] driver [Marvell 88E1112] (irq=174)
> mv88e6085 1002b000.ethernet-1:04 eth1 (uninitialized): PHY [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:09] driver [Marvell 88E1112] (irq=175)
> 
> after this and adding the ifaces to the bridge, it continues like:
> 
> br0: port 1(eth10) entered blocking state
> br0: port 1(eth10) entered disabled state
> br0: port 2(eth6) entered blocking state
> br0: port 2(eth6) entered disabled state
> device eth6 entered promiscuous mode
> device eth10 entered promiscuous mode
> br0: port 3(eth9) entered blocking state
> br0: port 3(eth9) entered disabled state
> device eth9 entered promiscuous mode
> br0: port 4(eth5) entered blocking state
> br0: port 4(eth5) entered disabled state
> device eth5 entered promiscuous mode
> br0: port 5(eth8) entered blocking state
> br0: port 5(eth8) entered disabled state
> device eth8 entered promiscuous mode
> br0: port 6(eth4) entered blocking state
> br0: port 6(eth4) entered disabled state
> mv88e6085 1002b000.ethernet-1:04: Timeout while waiting for switch
> mv88e6085 1002b000.ethernet-1:04: port 0 failed to add 9a:af:03:f1:bd:0a vid 1 to fdb: -110

Dumb question, but if you get errors like this, how can you test anything at all
in the patches that you submit?

> device eth4 entered promiscuous mode
> br0: port 7(eth7) entered blocking state
> br0: port 7(eth7) entered disabled state
> 
> I don't know if that gives ay clues...?

Not really. That error might be related - something indicating a breakage
in the top-level (fec IIUC) MDIO controller, or not. There was "recent"
rework almost everywhere.  For example commit 35da1dfd9484 ("net: dsa:
mv88e6xxx: Improve performance of busy bit polling"). That also hooks
into the mv88e6xxx cascaded MDIO controller (mv88e6xxx_g2_smi_phy_wait),
so there might be something there.

> 
> Otherwise I have to take more time to see what I can dig out. The easiest
> for me is then to add some printk statements giving targeted information if told what and
> where...

Do you have a timeline for when the regression was introduced?
Commit 35da1dfd9484 reverts cleanly, so I suppose giving it a go with
that reverted might be worth a shot. Otherwise, a bisect from a known
working version only takes a couple of hours, and shouldn't require
other changes to the setup.
Hans Schultz Nov. 15, 2022, 12:40 p.m. UTC | #10
On 2022-11-15 13:22, Vladimir Oltean wrote:
> On Tue, Nov 15, 2022 at 12:31:59PM +0100, netdev@kapio-technology.com 
> wrote:
>> It happens on upstart, so I would then have to hack the system upstart 
>> to
>> add trace.
> 
> Hack upstart or disable the service that brings the switch ports up, 
> and
> bring them up manually...
> 
>> I also have:
>> mv88e6085 1002b000.ethernet-1:04: switch 0x990 detected: Marvell 
>> 88E6097/88E6097F, revision 2
>> mv88e6085 1002b000.ethernet-1:04: configuring for fixed/rgmii-id link 
>> mode
>> mv88e6085 1002b000.ethernet-1:04: Link is Up - 100Mbps/Full - flow 
>> control off
>> mv88e6085 1002b000.ethernet-1:04 eth10 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:00] driver 
>> [Generic PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth6 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:01] driver 
>> [Generic PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth9 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:02] driver 
>> [Generic PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth5 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:03] driver 
>> [Generic PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth8 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:04] driver 
>> [Generic PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth4 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:05] driver 
>> [Generic PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth7 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:06] driver 
>> [Generic PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth3 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:07] driver 
>> [Generic PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth2 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:08] driver 
>> [Marvell 88E1112] (irq=174)
>> mv88e6085 1002b000.ethernet-1:04 eth1 (uninitialized): PHY 
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:09] driver 
>> [Marvell 88E1112] (irq=175)
>> 
>> after this and adding the ifaces to the bridge, it continues like:
>> 
>> br0: port 1(eth10) entered blocking state
>> br0: port 1(eth10) entered disabled state
>> br0: port 2(eth6) entered blocking state
>> br0: port 2(eth6) entered disabled state
>> device eth6 entered promiscuous mode
>> device eth10 entered promiscuous mode
>> br0: port 3(eth9) entered blocking state
>> br0: port 3(eth9) entered disabled state
>> device eth9 entered promiscuous mode
>> br0: port 4(eth5) entered blocking state
>> br0: port 4(eth5) entered disabled state
>> device eth5 entered promiscuous mode
>> br0: port 5(eth8) entered blocking state
>> br0: port 5(eth8) entered disabled state
>> device eth8 entered promiscuous mode
>> br0: port 6(eth4) entered blocking state
>> br0: port 6(eth4) entered disabled state
>> mv88e6085 1002b000.ethernet-1:04: Timeout while waiting for switch
>> mv88e6085 1002b000.ethernet-1:04: port 0 failed to add 
>> 9a:af:03:f1:bd:0a vid 1 to fdb: -110
> 
> Dumb question, but if you get errors like this, how can you test 
> anything at all
> in the patches that you submit?

The answer is that I don't always get these errors... once in a while 
(maaany resets) it does
not happen, and all is fine.

The error code is... well of course -110 (timed out).

> 
>> device eth4 entered promiscuous mode
>> br0: port 7(eth7) entered blocking state
>> br0: port 7(eth7) entered disabled state
>> 
>> I don't know if that gives ay clues...?
> 
> Not really. That error might be related - something indicating a 
> breakage
> in the top-level (fec IIUC) MDIO controller, or not. There was "recent"
> rework almost everywhere.  For example commit 35da1dfd9484 ("net: dsa:
> mv88e6xxx: Improve performance of busy bit polling"). That also hooks
> into the mv88e6xxx cascaded MDIO controller 
> (mv88e6xxx_g2_smi_phy_wait),
> so there might be something there.
> 

I can check that out, but I remember that net-next has not worked on 
this device for quite some
time...

>> 
>> Otherwise I have to take more time to see what I can dig out. The 
>> easiest
>> for me is then to add some printk statements giving targeted 
>> information if told what and
>> where...
> 
> Do you have a timeline for when the regression was introduced?
> Commit 35da1dfd9484 reverts cleanly, so I suppose giving it a go with
> that reverted might be worth a shot. Otherwise, a bisect from a known
> working version only takes a couple of hours, and shouldn't require
> other changes to the setup.

I can't say when the regression was introduced as I used modified 
kernels, but something
between 5.16 and 5.17, I know there was something phy related, but it's 
a bit more complicated,
so it is only a guess...

I would have to get the whole locked port patch set etc. on a 5.16 to 
see if that works.
Andrew Lunn Nov. 15, 2022, 1:21 p.m. UTC | #11
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:00] driver [Generic
> PHY] (irq=POLL)

It is interesting it is using the generic PHY driver, not the Marvell
PHY driver. 

> mv88e6085 1002b000.ethernet-1:04 eth6 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:01] driver [Generic
> PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth9 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:02] driver [Generic
> PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth5 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:03] driver [Generic
> PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth8 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:04] driver [Generic
> PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth4 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:05] driver [Generic
> PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth7 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:06] driver [Generic
> PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth3 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:07] driver [Generic
> PHY] (irq=POLL)
> mv88e6085 1002b000.ethernet-1:04 eth2 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:08] driver
> [Marvell 88E1112] (irq=174)
> mv88e6085 1002b000.ethernet-1:04 eth1 (uninitialized): PHY
> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:09] driver
> [Marvell 88E1112] (irq=175)

And here it does use the Marvell PHY driver. Are ports 8 and 9
external, where as the others are internal?

    Andrew
Hans Schultz Nov. 15, 2022, 1:25 p.m. UTC | #12
On 2022-11-15 13:22, Vladimir Oltean wrote:

> Do you have a timeline for when the regression was introduced?
> Commit 35da1dfd9484 reverts cleanly, so I suppose giving it a go with
> that reverted might be worth a shot. Otherwise, a bisect from a known
> working version only takes a couple of hours, and shouldn't require
> other changes to the setup.


Wow! Reverting 35da1dfd9484 and the problem has disappeared. :-)

The bridge_locked_port.sh tests all succeeded... as expected... ;-)
Hans Schultz Nov. 15, 2022, 2:18 p.m. UTC | #13
On 2022-11-15 14:21, Andrew Lunn wrote:
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:00] driver 
>> [Generic
>> PHY] (irq=POLL)
> 
> It is interesting it is using the generic PHY driver, not the Marvell
> PHY driver.
> 
>> mv88e6085 1002b000.ethernet-1:04 eth6 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:01] driver 
>> [Generic
>> PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth9 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:02] driver 
>> [Generic
>> PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth5 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:03] driver 
>> [Generic
>> PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth8 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:04] driver 
>> [Generic
>> PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth4 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:05] driver 
>> [Generic
>> PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth7 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:06] driver 
>> [Generic
>> PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth3 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdio:07] driver 
>> [Generic
>> PHY] (irq=POLL)
>> mv88e6085 1002b000.ethernet-1:04 eth2 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:08] driver
>> [Marvell 88E1112] (irq=174)
>> mv88e6085 1002b000.ethernet-1:04 eth1 (uninitialized): PHY
>> [!soc!aipi@10020000!ethernet@1002b000!mdio!switch@4!mdioe:09] driver
>> [Marvell 88E1112] (irq=175)
> 
> And here it does use the Marvell PHY driver. Are ports 8 and 9
> external, where as the others are internal?
> 
>     Andrew

It is an 8 port switch, so the two (8+9) are for internal use, I guess, 
as I
have not had any part in the system design etc of this device.

I have it for test and development purposes, incl. upstreaming.
Vladimir Oltean Nov. 15, 2022, 2:56 p.m. UTC | #14
On Tue, Nov 15, 2022 at 02:25:13PM +0100, netdev@kapio-technology.com wrote:
> On 2022-11-15 13:22, Vladimir Oltean wrote:
> > Do you have a timeline for when the regression was introduced?
> > Commit 35da1dfd9484 reverts cleanly, so I suppose giving it a go with
> > that reverted might be worth a shot. Otherwise, a bisect from a known
> > working version only takes a couple of hours, and shouldn't require
> > other changes to the setup.
> 
> Wow! Reverting 35da1dfd9484 and the problem has disappeared. :-)

See? That wasn't very painful, was it.

Now, why doesn't that commit work for you? that's the real question.
I'm going to say there's a big assumption made there. The old code used
to poll up to 16 times with sleeps of up to 2 ms in between.
The new code polls until at least 50 ms have elapsed.
I can imagine the thought process being something like "hmm, 16*2=32ms,
let's round that up to 50 just to be sure". But the effective timeout
was not really increased. Rather said, in the old code there was never
really an effective timeout, since the polling code could have been
preempted many times, and these preemptions would not be accounted
against the msleep(2) calls. Whereas the new code really tracks
something approximating 50 ms now.

Could you please add the reverted patch back to your git tree, and see
by how much do you need to increase the timeout for your system to get
reliable results?

> The bridge_locked_port.sh tests all succeeded... as expected... ;-)

Yeah, I confirm this works on a 6390 over here. But I still don't like
the log spam from the IRQ handlers.

[root@mox:~/.../drivers/net/dsa] # ./bridge_locked_port.sh lan1 lan2 lan3 lan4
[ 1298.218224] mv88e6085 d0032004.mdio-mii:10 lan1: configuring for phy/gmii link mode
[ 1299.150249] mv88e6085 d0032004.mdio-mii:10 lan4: configuring for phy/gmii link mode
[ 1299.791778] br0: port 1(lan2) entered blocking state
[ 1299.800824] br0: port 1(lan2) entered disabled state
[ 1300.419596] br0: port 2(lan3) entered blocking state
[ 1300.425016] br0: port 2(lan3) entered disabled state
[ 1300.527959] device lan3 entered promiscuous mode
[ 1300.538124] device lan2 entered promiscuous mode
[ 1300.733679] mv88e6085 d0032004.mdio-mii:10 lan2: configuring for phy/gmii link mode
[ 1300.765642] 8021q: adding VLAN 0 to HW filter on device lan2
[ 1300.818540] mv88e6085 d0032004.mdio-mii:10 lan3: configuring for phy/gmii link mode
[ 1300.855003] 8021q: adding VLAN 0 to HW filter on device lan3
[ 1303.903840] mv88e6085 d0032004.mdio-mii:10 lan3: Link is Up - 1Gbps/Full - flow control rx/tx
[ 1303.912939] IPv6: ADDRCONF(NETDEV_CHANGE): lan3: link becomes ready
[ 1303.928313] br0: port 2(lan3) entered blocking state
[ 1303.933685] br0: port 2(lan3) entered forwarding state
[ 1303.948607] mv88e6085 d0032004.mdio-mii:10 lan4: Link is Up - 1Gbps/Full - flow control rx/tx
[ 1303.985784] IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready
[ 1303.999962] IPv6: ADDRCONF(NETDEV_CHANGE): lan4: link becomes ready
[ 1304.003780] mv88e6085 d0032004.mdio-mii:10 lan2: Link is Up - 1Gbps/Full - flow control rx/tx
[ 1304.017407] IPv6: ADDRCONF(NETDEV_CHANGE): lan2: link becomes ready
[ 1304.027922] br0: port 1(lan2) entered blocking state
[ 1304.033157] br0: port 1(lan2) entered forwarding state
[ 1304.043508] mv88e6085 d0032004.mdio-mii:10 lan1: Link is Up - 1Gbps/Full - flow control rx/tx
[ 1304.052601] IPv6: ADDRCONF(NETDEV_CHANGE): lan4.100: link becomes ready
[ 1304.158404] IPv6: ADDRCONF(NETDEV_CHANGE): lan1: link becomes ready
[ 1304.167574] IPv6: ADDRCONF(NETDEV_CHANGE): lan1.100: link becomes ready
TEST: Locked port ipv4                                              [ OK ]
TEST: Locked port ipv6                                              [ OK ]
[ 1337.627010] mv88e6xxx_g1_vtu_prob_irq_thread_fn: 1 callbacks suppressed
[ 1337.627042] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1337.644822] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1337.727920] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1337.862053] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1337.956972] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1338.065996] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1338.136905] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1338.238182] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1338.339244] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1338.440106] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1342.655520] mv88e6xxx_g1_vtu_prob_irq_thread_fn: 33 callbacks suppressed
[ 1342.655568] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1342.763619] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1342.835264] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1342.847464] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1342.865387] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1342.971661] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1343.075774] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1343.179562] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1343.283426] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1343.387409] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1348.758296] mv88e6xxx_g1_vtu_prob_irq_thread_fn: 24 callbacks suppressed
[ 1348.758328] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1348.858879] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1348.990492] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1349.063977] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1349.165979] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1349.268187] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1349.373827] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1349.472601] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1349.573752] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
[ 1349.585837] mv88e6085 d0032004.mdio-mii:10: VTU member violation for vid 100, source port 9
TEST: Locked port vlan                                              [ OK ]
[ 1352.550194] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1352.659486] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1352.792941] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1352.888959] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1352.968150] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1353.072434] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1353.182844] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1353.280595] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1353.384755] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1353.488602] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1358.704139] mv88e6xxx_g1_atu_prob_irq_thread_fn: 39 callbacks suppressed
[ 1358.704172] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1359.280930] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1359.388609] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1359.524500] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1359.620272] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1359.696597] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1359.727872] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1359.801563] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1359.908665] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1360.012063] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1363.858627] mv88e6xxx_g1_atu_prob_irq_thread_fn: 29 callbacks suppressed
[ 1363.858674] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1364.879407] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
TEST: Locked port MAB                                               [ OK ]
[ 1369.837089] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for a0:b0:c0:c0:b0:a0 portvec 4 spid 2
[ 1369.971489] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for a0:b0:c0:c0:b0:a0 portvec 4 spid 2
[ 1370.070444] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for a0:b0:c0:c0:b0:a0 portvec 4 spid 2
[ 1370.143784] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for a0:b0:c0:c0:b0:a0 portvec 4 spid 2
[ 1370.245843] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for a0:b0:c0:c0:b0:a0 portvec 4 spid 2
[ 1371.465429] mv88e6085 d0032004.mdio-mii:10: ATU member violation for a0:b0:c0:c0:b0:a0 portvec 8 spid 2
[ 1371.599084] mv88e6085 d0032004.mdio-mii:10: ATU member violation for a0:b0:c0:c0:b0:a0 portvec 8 spid 2
[ 1371.703341] mv88e6085 d0032004.mdio-mii:10: ATU member violation for a0:b0:c0:c0:b0:a0 portvec 8 spid 2
[ 1371.794905] mv88e6085 d0032004.mdio-mii:10: ATU member violation for a0:b0:c0:c0:b0:a0 portvec 8 spid 2
[ 1371.873307] mv88e6085 d0032004.mdio-mii:10: ATU member violation for a0:b0:c0:c0:b0:a0 portvec 8 spid 2
[ 1372.022403] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
TEST: Locked port MAB roam                                          [ OK ]
TEST: Locked port MAB configuration                                 [ OK ]
[ 1373.039089] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:01 portvec 4 spid 2
[ 1373.060995] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:05 portvec 4 spid 2
[ 1373.167964] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:05 portvec 4 spid 2
[ 1373.296506] mv88e6085 d0032004.mdio-mii:10: ATU miss violation for 00:01:02:03:04:05 portvec 4 spid 2
TEST: Locked port MAB FDB flush                                     [ OK ]
[ 1375.330376] mv88e6085 d0032004.mdio-mii:10 lan3: Link is Down
[ 1375.341372] br0: port 2(lan3) entered disabled state
[ 1375.461136] mv88e6085 d0032004.mdio-mii:10 lan2: Link is Down
[ 1375.489476] br0: port 1(lan2) entered disabled state
[ 1375.611159] device lan3 left promiscuous mode
[ 1375.618253] br0: port 2(lan3) entered disabled state
[ 1375.737909] device lan2 left promiscuous mode
[ 1375.745305] br0: port 1(lan2) entered disabled state
Hans Schultz Nov. 15, 2022, 3:14 p.m. UTC | #15
On 2022-11-15 15:56, Vladimir Oltean wrote:
> On Tue, Nov 15, 2022 at 02:25:13PM +0100, netdev@kapio-technology.com 
> wrote:
>> On 2022-11-15 13:22, Vladimir Oltean wrote:
> 
>> The bridge_locked_port.sh tests all succeeded... as expected... ;-)
> 
> Yeah, I confirm this works on a 6390 over here.

Thanks :-)

> But I still don't like
> the log spam from the IRQ handlers.
> 
> [root@mox:~/.../drivers/net/dsa] # ./bridge_locked_port.sh lan1 lan2 
> lan3 lan4
> [ 1298.218224] mv88e6085 d0032004.mdio-mii:10 lan1: configuring for
> ...

I think the violation log issue should be handled in a seperate 
patch(set)?
Hans Schultz Nov. 15, 2022, 4:03 p.m. UTC | #16
On 2022-11-15 15:56, Vladimir Oltean wrote:
> On Tue, Nov 15, 2022 at 02:25:13PM +0100, netdev@kapio-technology.com 
> wrote:
>> On 2022-11-15 13:22, Vladimir Oltean wrote:
>> > Do you have a timeline for when the regression was introduced?
>> > Commit 35da1dfd9484 reverts cleanly, so I suppose giving it a go with
>> > that reverted might be worth a shot. Otherwise, a bisect from a known
>> > working version only takes a couple of hours, and shouldn't require
>> > other changes to the setup.
>> 
>> Wow! Reverting 35da1dfd9484 and the problem has disappeared. :-)
> 
> See? That wasn't very painful, was it.

Indeed it was not, when you get a good tip. Thanks alot! :-)

> 
> Now, why doesn't that commit work for you? that's the real question.
> I'm going to say there's a big assumption made there. The old code used
> to poll up to 16 times with sleeps of up to 2 ms in between.
> The new code polls until at least 50 ms have elapsed.
> I can imagine the thought process being something like "hmm, 16*2=32ms,
> let's round that up to 50 just to be sure". But the effective timeout
> was not really increased. Rather said, in the old code there was never
> really an effective timeout, since the polling code could have been
> preempted many times, and these preemptions would not be accounted
> against the msleep(2) calls. Whereas the new code really tracks
> something approximating 50 ms now.
> 
> Could you please add the reverted patch back to your git tree, and see
> by how much do you need to increase the timeout for your system to get
> reliable results?
> 

Yes, so you want me to simply increase the 50ms on line 58 in smi.c...

I have now tried to increase it even to 10000ms == 10s and it didn't 
help,
so something else must be needed...
Vladimir Oltean Nov. 15, 2022, 4:15 p.m. UTC | #17
On Tue, Nov 15, 2022 at 04:14:05PM +0100, netdev@kapio-technology.com wrote:
> I think the violation log issue should be handled in a seperate patch(set)?

idk, what do you plan to do about it?
Vladimir Oltean Nov. 15, 2022, 4:18 p.m. UTC | #18
On Tue, Nov 15, 2022 at 05:03:01PM +0100, netdev@kapio-technology.com wrote:
> Yes, so you want me to simply increase the 50ms on line 58 in smi.c...
> 
> I have now tried to increase it even to 10000ms == 10s and it didn't help,
> so something else must be needed...

Not only that, but also the one in mv88e6xxx_wait_mask().
Hans Schultz Nov. 15, 2022, 5:11 p.m. UTC | #19
On 2022-11-15 17:15, Vladimir Oltean wrote:
> On Tue, Nov 15, 2022 at 04:14:05PM +0100, netdev@kapio-technology.com 
> wrote:
>> I think the violation log issue should be handled in a seperate 
>> patch(set)?
> 
> idk, what do you plan to do about it?

When I think about it, I think that the messages should be disabled by 
default
and like one enables NO_LL_LEARN (echo 1 >/sys/class...), they can be 
activated
if one needs it. And of course the ethtool stats will still be there 
anyhow...

How does that sound?
Vladimir Oltean Nov. 15, 2022, 5:15 p.m. UTC | #20
On Tue, Nov 15, 2022 at 06:11:08PM +0100, netdev@kapio-technology.com wrote:
> On 2022-11-15 17:15, Vladimir Oltean wrote:
> > On Tue, Nov 15, 2022 at 04:14:05PM +0100, netdev@kapio-technology.com
> > wrote:
> > > I think the violation log issue should be handled in a seperate
> > > patch(set)?
> > 
> > idk, what do you plan to do about it?
> 
> When I think about it, I think that the messages should be disabled by default
> and like one enables NO_LL_LEARN (echo 1 >/sys/class...), they can be activated
> if one needs it. And of course the ethtool stats will still be there anyhow...
> 
> How does that sound?

Just make them trace points...
If you don't know how to do that, just search for who has "#define TRACE_SYSTEM"
in drivers/net/ethernet/ and steal from them...
Hans Schultz Nov. 15, 2022, 6:40 p.m. UTC | #21
On 2022-11-15 17:18, Vladimir Oltean wrote:
> On Tue, Nov 15, 2022 at 05:03:01PM +0100, netdev@kapio-technology.com 
> wrote:
>> Yes, so you want me to simply increase the 50ms on line 58 in smi.c...
>> 
>> I have now tried to increase it even to 10000ms == 10s and it didn't 
>> help,
>> so something else must be needed...
> 
> Not only that, but also the one in mv88e6xxx_wait_mask().

So, I will not present you with a graph as it is a tedious process 
(probably
it is some descending gaussian curve wrt timeout occurring).

But 100ms fails, 125 I had 1 port fail, at 140, 150  and 180 I saw 
timeouts
resulting in fdb add fails, like (and occasional port fail):

mv88e6085 1002b000.ethernet-1:04: Timeout while waiting for switch
mv88e6085 1002b000.ethernet-1:04: port 0 failed to add be:7c:96:06:9f:09 
vid 1 to fdb: -110

At around 200 ms it looks like it is getting stable (like 5 runs, no 
problems).

So with the gaussian curve tail whipping ones behind (risque of failure) 
it
might need to be like 300 ms in my case... :-)
Vladimir Oltean Nov. 16, 2022, 10:24 a.m. UTC | #22
On Tue, Nov 15, 2022 at 07:40:02PM +0100, netdev@kapio-technology.com wrote:
> So, I will not present you with a graph as it is a tedious process (probably
> it is some descending gaussian curve wrt timeout occurring).
> 
> But 100ms fails, 125 I had 1 port fail, at 140, 150  and 180 I saw timeouts
> resulting in fdb add fails, like (and occasional port fail):
> 
> mv88e6085 1002b000.ethernet-1:04: Timeout while waiting for switch
> mv88e6085 1002b000.ethernet-1:04: port 0 failed to add be:7c:96:06:9f:09 vid
> 1 to fdb: -110
> 
> At around 200 ms it looks like it is getting stable (like 5 runs, no
> problems).
> 
> So with the gaussian curve tail whipping ones behind (risque of failure) it
> might need to be like 300 ms in my case... :-)

Pick a value that is high enough to be reliable and submit a patch to
"net" where you present the evidence for it (top-level MDIO controller,
SoC, switch, kernel). I don't believe there's much to read into. A large
timeout shouldn't have a negative effect on the MDIO performance,
because it just determines how long it takes until the kernel declares
it dead, rather than how long it takes for transactions to actually take
place.
Andrew Lunn Nov. 16, 2022, 1:37 p.m. UTC | #23
> Pick a value that is high enough to be reliable and submit a patch to
> "net" where you present the evidence for it (top-level MDIO controller,
> SoC, switch, kernel). I don't believe there's much to read into. A large
> timeout shouldn't have a negative effect on the MDIO performance,
> because it just determines how long it takes until the kernel declares
> it dead, rather than how long it takes for transactions to actually take
> place.

Yes, please do that.

It is interesting that you found this. I'm just curious, so no need to
investigate if you don't have time. Is there a pattern, is it the same
register which always times out?

	 Andrew
Hans Schultz Nov. 18, 2022, 1:37 p.m. UTC | #24
On 2022-11-16 11:24, Vladimir Oltean wrote:
> On Tue, Nov 15, 2022 at 07:40:02PM +0100, netdev@kapio-technology.com 
> wrote:
>> So, I will not present you with a graph as it is a tedious process 
>> (probably
>> it is some descending gaussian curve wrt timeout occurring).
>> 
>> But 100ms fails, 125 I had 1 port fail, at 140, 150  and 180 I saw 
>> timeouts
>> resulting in fdb add fails, like (and occasional port fail):
>> 
>> mv88e6085 1002b000.ethernet-1:04: Timeout while waiting for switch
>> mv88e6085 1002b000.ethernet-1:04: port 0 failed to add 
>> be:7c:96:06:9f:09 vid
>> 1 to fdb: -110
>> 
>> At around 200 ms it looks like it is getting stable (like 5 runs, no
>> problems).
>> 
>> So with the gaussian curve tail whipping ones behind (risque of 
>> failure) it
>> might need to be like 300 ms in my case... :-)
> 
> Pick a value that is high enough to be reliable and submit a patch to
> "net" where you present the evidence for it (top-level MDIO controller,
> SoC, switch, kernel). I don't believe there's much to read into. A 
> large
> timeout shouldn't have a negative effect on the MDIO performance,
> because it just determines how long it takes until the kernel declares
> it dead, rather than how long it takes for transactions to actually 
> take
> place.

Would it not be appropriate to have a define that specifies the value 
instead
of the same value two places as it is now?

And in so case, what would be an appropriate name?
Andrew Lunn Nov. 18, 2022, 1:51 p.m. UTC | #25
> Would it not be appropriate to have a define that specifies the
> value instead of the same value two places as it is now?

Yes.

> And in so case, what would be an appropriate name?

MV88E6XXX_WAIT_TIMEOUT_MS ?

	  Andrew