diff mbox series

net: dsa: realtek: silent indirect reg read failures on SMP

Message ID CAJq09z5FCgG-+jVT7uxh1a-0CiiFsoKoHYsAWJtiKwv7LXKofQ@mail.gmail.com (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series net: dsa: realtek: silent indirect reg read failures on SMP | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Luiz Angelo Daros de Luca Feb. 6, 2022, 8:26 p.m. UTC
Hello,

Arinç reported an issue with interfaces going down and up in a fixed
interval (a little less than 40s). It happened with both SMI or MDIO
interfaces for two different rtl8365mb supported models.

The indirect read procedure is something like this:

rtl8365mb_phy_read
  rtl8365mb_phy_ocp_read
    rtl8365mb_phy_poll_busy
      regmap_read_poll_timeout(RTL8365MB_INDIRECT_ACCESS_STATUS_REG)
    rtl8365mb_phy_ocp_prepare
      regmap_update_bits(RTL8365MB_GPHY_OCP_MSB_0_REG)
      regmap_write(RTL8365MB_INDIRECT_ACCESS_ADDRESS_REG)
    regmap_write(RTL8365MB_INDIRECT_ACCESS_CTRL_REG)
    rtl8365mb_phy_poll_busy
      regmap_read_poll_timeout(RTL8365MB_INDIRECT_ACCESS_STATUS_REG)
    regmap_read(RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG)

And the write is similar. The regmap read/write does have its own
mutex. realtek-mdio does use a sequence of read/writes to mdio for
each read/write ops but beyond regmap internal locks, I also lock
bus->mdio_lock.

In a non-SMP system or with extra cores disabled, everything works as
expected. However, with an SMP system, there is some kind of
concurrent access to registers and
regmap_read(RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG) will eventually
read 0x0000. It is like the chip didn't have time to update that
register or got lost in the way. It only happens when the system is
idle. If you stress the CPU, the simultaneous register access gets
unsynchronized or serialized.

I can "simulate" a similar issue in a non-SMP system by dumping regmap
from non-used registers like 0x20a0 to 0x20ff (0x20a0 is related to
port 5 state, but I have 0,1,2,3,4 + 6,7). There are other regions
that cause the same behavior but [0x20a0,0x20ff] is the first one I
found. It never failed while reading good memory regions. Maybe it is
just caused by another issue like the chip getting confused while
reading undefined memory regions.

for a in $(seq 1000); do dd
if=/sys/kernel/debug/regmap/mdio-bus:1d/registers bs=11
skip=$((0x20a0)) count=$((0x1)) 2>/dev/null; done

(That range might not be precise as I don't know if dd is enough to
select only a specific range)

While the script is running, I can see both the driver and the debugfs
dump flicking between good data, garbage and zero. When DSA reads
0x0000, it brings the port down and up.

I tried to add a simple mutex over rtl8365mb_phy_ocp_read call but it
still failed to prevent the event in a SMP system (it would not
protect the debugfs dump). Maybe I'm missing something obvious.

I also wonder if all those functions that use a sequence of
regmap_{read,write} might need a mutex to protect from parallel
execution. I didn't find a way to create a "regmap transaction" by
locking it externally as its lock control is private as well as the
non-locked _regmap_read(). I could disable its lock and do it inside
the driver, but it would also disable the debugfs dump and add a bunch
of new code. And I don't even know if it would fix the issue.
Is there a better way to deal with this issue? Maybe force dsa polling
to use a single processor?

This is the failed patch.

       int ret;
@@ -1001,16 +1004,24 @@ static int rtl8365mb_phy_read(struct
realtek_priv *priv, int phy, int regnum)

       ocp_addr = RTL8365MB_PHY_OCP_ADDR_PHYREG_BASE + regnum * 2;

-       ret = rtl8365mb_phy_ocp_read(priv, phy, ocp_addr, &val);
+       ret = mutex_lock_interruptible(&mb->indirect_reg_lock);
       if (ret) {
               dev_err(priv->dev,
-                       "failed to read PHY%d reg %02x @ %04x, ret %d\n", phy,
+                       "read PHY%d reg %02x @ %04x, ret %d interrupted\n", phy,
                       regnum, ocp_addr, ret);
               return ret;
       }

-       dev_dbg(priv->dev, "read PHY%d register 0x%02x @ %04x, val <- %04x\n",
-               phy, regnum, ocp_addr, val);
+       ret = rtl8365mb_phy_ocp_read(priv, phy, ocp_addr, &val);
+       if (ret)
+               dev_err(priv->dev,
+                       "failed to read PHY%d reg %02x @ %04x, ret %d\n", phy,
+                       regnum, ocp_addr, ret);
+       else
+               dev_dbg(priv->dev, "read PHY%d register 0x%02x @ %04x,
val <- %04x\n",
+                       phy, regnum, ocp_addr, val);
+
+       mutex_unlock(&mb->indirect_reg_lock);

       return val;
}
@@ -1018,6 +1029,7 @@ static int rtl8365mb_phy_read(struct
realtek_priv *priv, int phy, int regnum)
static int rtl8365mb_phy_write(struct realtek_priv *priv, int phy, int regnum,
                              u16 val)
{
+       struct rtl8365mb *mb = priv->chip_data;
       u32 ocp_addr;
       int ret;

@@ -1029,18 +1041,26 @@ static int rtl8365mb_phy_write(struct
realtek_priv *priv, int phy, int regnum,

       ocp_addr = RTL8365MB_PHY_OCP_ADDR_PHYREG_BASE + regnum * 2;

-       ret = rtl8365mb_phy_ocp_write(priv, phy, ocp_addr, val);
+       ret = mutex_lock_interruptible(&mb->indirect_reg_lock);
       if (ret) {
               dev_err(priv->dev,
-                       "failed to write PHY%d reg %02x @ %04x, ret %d\n", phy,
+                       "write PHY%d reg %02x @ %04x, ret %d
interrupted\n", phy,
                       regnum, ocp_addr, ret);
               return ret;
       }

-       dev_dbg(priv->dev, "write PHY%d register 0x%02x @ %04x, val -> %04x\n",
-               phy, regnum, ocp_addr, val);
+       ret = rtl8365mb_phy_ocp_write(priv, phy, ocp_addr, val);
+       if (ret)
+               dev_err(priv->dev,
+                       "failed to write PHY%d reg %02x @ %04x, ret %d\n", phy,
+                       regnum, ocp_addr, ret);
+       else
+               dev_dbg(priv->dev, "write PHY%d register 0x%02x @
%04x, val -> %04x\n",
+                       phy, regnum, ocp_addr, val);

-       return 0;
+       mutex_unlock(&mb->indirect_reg_lock);
+
+       return ret;
}

static int rtl8365mb_dsa_phy_read(struct dsa_switch *ds, int phy, int regnum)
@@ -2215,6 +2235,8 @@ static int rtl8365mb_setup(struct dsa_switch *ds)

       mb = priv->chip_data;

+       mutex_init(&mb->indirect_reg_lock);
+
       ret = rtl8365mb_reset_chip(priv);
       if (ret) {
               dev_err(priv->dev, "failed to reset chip: %d\n", ret);
--
2.34.1


---
   Luiz

Comments

Alvin Šipraga Feb. 9, 2022, 10:32 a.m. UTC | #1
Hi Luiz,

Thanks very much for the bug report!

Luiz Angelo Daros de Luca <luizluca@gmail.com> writes:

> Hello,
>
> Arinç reported an issue with interfaces going down and up in a fixed
> interval (a little less than 40s). It happened with both SMI or MDIO
> interfaces for two different rtl8365mb supported models.

Is this using the Generic PHY driver or the Realtek RTL8365MB-VC PHY
driver? If it's the latter then I would be interested to know whether
the interrupt counters are going up too. Maybe the switch really has
something to say?

I think the Generic PHY driver is based on polling of PHY registers,
which seems more likely to trigger the issue you describe.

In any case, I think I can reproduce your issue. Using an SMI setup with
just one cable attached to port 2, I did the following:

1. Enable ftrace event logging for regmap and mdio
2. Poll BSMR PHY register for my connected port; it should always read
   the same (0x79ed).
3. Wait for 2 to read out something else.

So I used this command:

    while true; do phytool read swp2/2/0x01; sleep 0.1; done

After a couple of seconds I already read out 0x0000. And after checking
the ftrace log I can see that this is because of some interleaved
register access:

     kworker/3:4-70      [003] .......  1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
         phytool-16816   [002] .......  1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
     kworker/3:4-70      [003] .......  1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
         phytool-16816   [002] .......  1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
     kworker/3:4-70      [003] .......  1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
         phytool-16816   [002] .......  1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
     kworker/3:4-70      [003] .......  1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
         phytool-16816   [002] .......  1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
     kworker/3:4-70      [003] .......  1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
         phytool-16816   [002] .......  1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
     kworker/3:4-70      [003] .......  1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
         phytool-16816   [002] .......  1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
         phytool-16816   [002] .......  1927.142641: mdio_access: SMI-0 read  phy:0x02 reg:0x01 val:0x0000
     kworker/3:4-70      [003] .......  1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
     kworker/3:4-70      [003] .......  1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
     kworker/3:4-70      [003] .......  1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
     kworker/3:4-70      [003] .......  1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
     kworker/3:4-70      [003] .......  1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
     kworker/3:4-70      [003] .......  1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6

The kworker here is polling MIB counters for stats, as evidenced by the
register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG).

Sometimes I can even read out completely junk values if I tighten my
phytool loop:

    # while true; do phytool read swp2/2/0x01; done | uniq
    0x79ed
    0000
    0x79ed
    0000
    0x79ed
    0000
    0x79ed
    0x0014 <-- ?!

This is just a preliminary investigation - I only checked your mail
today - but I will follow up on this next week and try and make a
fix. In the mean time could you and Arinç perhaps do some similar
tracing and confirm that in your case it is a similar issue? You needn't
use phytool if the issue reproduces itself every 40 seconds - in that
case just enable ftrace events and inspect the register access sequence
when a PHY read gives an unexpected value.

Let's please restrict the investigation to registers which we expect to
give well-defined values though - unused or otherwise inactive registers
may very well give us junk, but it is harder to argue that this is a
bug. Anyway, this seems pretty easy to reproduce, so I don't think you
need to go down that path :-)

Overall we probably need some stricter locking but I want to spend some
time on this before just sending a big-fat-lock solution to the
list. I didn't try your patch yet due to other items in my inbox today.

Kind regards,
Alvin

>
> The indirect read procedure is something like this:
>
> rtl8365mb_phy_read
>   rtl8365mb_phy_ocp_read
>     rtl8365mb_phy_poll_busy
>       regmap_read_poll_timeout(RTL8365MB_INDIRECT_ACCESS_STATUS_REG)
>     rtl8365mb_phy_ocp_prepare
>       regmap_update_bits(RTL8365MB_GPHY_OCP_MSB_0_REG)
>       regmap_write(RTL8365MB_INDIRECT_ACCESS_ADDRESS_REG)
>     regmap_write(RTL8365MB_INDIRECT_ACCESS_CTRL_REG)
>     rtl8365mb_phy_poll_busy
>       regmap_read_poll_timeout(RTL8365MB_INDIRECT_ACCESS_STATUS_REG)
>     regmap_read(RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG)
>
> And the write is similar. The regmap read/write does have its own
> mutex. realtek-mdio does use a sequence of read/writes to mdio for
> each read/write ops but beyond regmap internal locks, I also lock
> bus->mdio_lock.
>
> In a non-SMP system or with extra cores disabled, everything works as
> expected. However, with an SMP system, there is some kind of
> concurrent access to registers and
> regmap_read(RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG) will eventually
> read 0x0000. It is like the chip didn't have time to update that
> register or got lost in the way. It only happens when the system is
> idle. If you stress the CPU, the simultaneous register access gets
> unsynchronized or serialized.
>
> I can "simulate" a similar issue in a non-SMP system by dumping regmap
> from non-used registers like 0x20a0 to 0x20ff (0x20a0 is related to
> port 5 state, but I have 0,1,2,3,4 + 6,7). There are other regions
> that cause the same behavior but [0x20a0,0x20ff] is the first one I
> found. It never failed while reading good memory regions. Maybe it is
> just caused by another issue like the chip getting confused while
> reading undefined memory regions.
>
> for a in $(seq 1000); do dd
> if=/sys/kernel/debug/regmap/mdio-bus:1d/registers bs=11
> skip=$((0x20a0)) count=$((0x1)) 2>/dev/null; done
>
> (That range might not be precise as I don't know if dd is enough to
> select only a specific range)
>
> While the script is running, I can see both the driver and the debugfs
> dump flicking between good data, garbage and zero. When DSA reads
> 0x0000, it brings the port down and up.
>
> I tried to add a simple mutex over rtl8365mb_phy_ocp_read call but it
> still failed to prevent the event in a SMP system (it would not
> protect the debugfs dump). Maybe I'm missing something obvious.
>
> I also wonder if all those functions that use a sequence of
> regmap_{read,write} might need a mutex to protect from parallel
> execution. I didn't find a way to create a "regmap transaction" by
> locking it externally as its lock control is private as well as the
> non-locked _regmap_read(). I could disable its lock and do it inside
> the driver, but it would also disable the debugfs dump and add a bunch
> of new code. And I don't even know if it would fix the issue.
> Is there a better way to deal with this issue? Maybe force dsa polling
> to use a single processor?
>
> This is the failed patch.
>
> diff --git a/drivers/net/dsa/realtek/rtl8365mb.c
> b/drivers/net/dsa/realtek/rtl8365mb.c
> index 1a0562811c1a..5572271a2514 100644
> --- a/drivers/net/dsa/realtek/rtl8365mb.c
> +++ b/drivers/net/dsa/realtek/rtl8365mb.c
> @@ -817,6 +817,7 @@ struct rtl8365mb_port {
>  * @port_mask: mask of all ports
>  * @learn_limit_max: maximum number of L2 addresses the chip can learn
>  * @mib_lock: prevent concurrent reads of MIB counters
> + * @indirect_reg_lock: prevent concurrent access of indirect registers
>  * @ports: per-port data
>  * @jam_table: chip-specific initialization jam table
>  * @jam_size: size of the chip's jam table
> @@ -832,6 +833,7 @@ struct rtl8365mb {
>        u32 port_mask;
>        u32 learn_limit_max;
>        struct mutex mib_lock;
> +       struct mutex indirect_reg_lock;
>        struct rtl8365mb_port ports[RTL8365MB_MAX_NUM_PORTS];
>        const struct rtl8365mb_jam_tbl_entry *jam_table;
>        size_t jam_size;
> @@ -989,6 +991,7 @@ static int rtl8365mb_phy_ocp_write(struct
> realtek_priv *priv, int phy,
>
> static int rtl8365mb_phy_read(struct realtek_priv *priv, int phy, int regnum)
> {
> +       struct rtl8365mb *mb = priv->chip_data;
>        u32 ocp_addr;
>        u16 val;
>        int ret;
> @@ -1001,16 +1004,24 @@ static int rtl8365mb_phy_read(struct
> realtek_priv *priv, int phy, int regnum)
>
>        ocp_addr = RTL8365MB_PHY_OCP_ADDR_PHYREG_BASE + regnum * 2;
>
> -       ret = rtl8365mb_phy_ocp_read(priv, phy, ocp_addr, &val);
> +       ret = mutex_lock_interruptible(&mb->indirect_reg_lock);
>        if (ret) {
>                dev_err(priv->dev,
> -                       "failed to read PHY%d reg %02x @ %04x, ret %d\n", phy,
> +                       "read PHY%d reg %02x @ %04x, ret %d interrupted\n", phy,
>                        regnum, ocp_addr, ret);
>                return ret;
>        }
>
> -       dev_dbg(priv->dev, "read PHY%d register 0x%02x @ %04x, val <- %04x\n",
> -               phy, regnum, ocp_addr, val);
> +       ret = rtl8365mb_phy_ocp_read(priv, phy, ocp_addr, &val);
> +       if (ret)
> +               dev_err(priv->dev,
> +                       "failed to read PHY%d reg %02x @ %04x, ret %d\n", phy,
> +                       regnum, ocp_addr, ret);
> +       else
> +               dev_dbg(priv->dev, "read PHY%d register 0x%02x @ %04x,
> val <- %04x\n",
> +                       phy, regnum, ocp_addr, val);
> +
> +       mutex_unlock(&mb->indirect_reg_lock);
>
>        return val;
> }
> @@ -1018,6 +1029,7 @@ static int rtl8365mb_phy_read(struct
> realtek_priv *priv, int phy, int regnum)
> static int rtl8365mb_phy_write(struct realtek_priv *priv, int phy, int regnum,
>                               u16 val)
> {
> +       struct rtl8365mb *mb = priv->chip_data;
>        u32 ocp_addr;
>        int ret;
>
> @@ -1029,18 +1041,26 @@ static int rtl8365mb_phy_write(struct
> realtek_priv *priv, int phy, int regnum,
>
>        ocp_addr = RTL8365MB_PHY_OCP_ADDR_PHYREG_BASE + regnum * 2;
>
> -       ret = rtl8365mb_phy_ocp_write(priv, phy, ocp_addr, val);
> +       ret = mutex_lock_interruptible(&mb->indirect_reg_lock);
>        if (ret) {
>                dev_err(priv->dev,
> -                       "failed to write PHY%d reg %02x @ %04x, ret %d\n", phy,
> +                       "write PHY%d reg %02x @ %04x, ret %d
> interrupted\n", phy,
>                        regnum, ocp_addr, ret);
>                return ret;
>        }
>
> -       dev_dbg(priv->dev, "write PHY%d register 0x%02x @ %04x, val -> %04x\n",
> -               phy, regnum, ocp_addr, val);
> +       ret = rtl8365mb_phy_ocp_write(priv, phy, ocp_addr, val);
> +       if (ret)
> +               dev_err(priv->dev,
> +                       "failed to write PHY%d reg %02x @ %04x, ret %d\n", phy,
> +                       regnum, ocp_addr, ret);
> +       else
> +               dev_dbg(priv->dev, "write PHY%d register 0x%02x @
> %04x, val -> %04x\n",
> +                       phy, regnum, ocp_addr, val);
>
> -       return 0;
> +       mutex_unlock(&mb->indirect_reg_lock);
> +
> +       return ret;
> }
>
> static int rtl8365mb_dsa_phy_read(struct dsa_switch *ds, int phy, int regnum)
> @@ -2215,6 +2235,8 @@ static int rtl8365mb_setup(struct dsa_switch *ds)
>
>        mb = priv->chip_data;
>
> +       mutex_init(&mb->indirect_reg_lock);
> +
>        ret = rtl8365mb_reset_chip(priv);
>        if (ret) {
>                dev_err(priv->dev, "failed to reset chip: %d\n", ret);
> --
> 2.34.1
>
>
> ---
>    Luiz
Luiz Angelo Daros de Luca Feb. 9, 2022, 11:28 p.m. UTC | #2
'

> Hi Luiz,

Hi Alvin,

> Thanks very much for the bug report!

Thanks Alvin!

> > Arinç reported an issue with interfaces going down and up in a fixed
> > interval (a little less than 40s). It happened with both SMI or MDIO
> > interfaces for two different rtl8365mb supported models.
>
> Is this using the Generic PHY driver or the Realtek RTL8365MB-VC PHY
> driver? If it's the latter then I would be interested to know whether
> the interrupt counters are going up too. Maybe the switch really has
> something to say?

Both are using 'Realtek RTL8365MB-VC PHY' but without interruptions.
In the mdio-connected device, the interrupt pin is not connected yet
(I suggested them to use it).
The other one is unknown if the pin is connected to a gpio port. I
suggested Arunç to export all available GPIO ports and check their
values with the interruption configured as both active high and low.
That's how I found the interrupt gpio in my device.
So, all affected SMP devices are using polling.

> I think the Generic PHY driver is based on polling of PHY registers,
> which seems more likely to trigger the issue you describe.

I didn't even know that 'Generic PHY' would work until yesterday when
I noticed I was missing the patch that added the phy_id.
Anyway, as I said, both drivers will work with polling while only
'RTL8365MB-VC PHY' can handle interruptions. It is not a good
indicator
whether it is using interruption or not. And MDIO also didn't support
interruption without a new patch I just sent.

https://patchwork.kernel.org/project/netdevbpf/patch/20220209224538.9028-1-luizluca@gmail.com/

> In any case, I think I can reproduce your issue. Using an SMI setup with
> just one cable attached to port 2, I did the following:
>
> 1. Enable ftrace event logging for regmap and mdio
> 2. Poll BSMR PHY register for my connected port; it should always read
>    the same (0x79ed).
> 3. Wait for 2 to read out something else.
>
> So I used this command:
>
>     while true; do phytool read swp2/2/0x01; sleep 0.1; done

I'll test it and report.

>
> After a couple of seconds I already read out 0x0000. And after checking
> the ftrace log I can see that this is because of some interleaved
> register access:
>
>      kworker/3:4-70      [003] .......  1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
>          phytool-16816   [002] .......  1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
>      kworker/3:4-70      [003] .......  1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
>          phytool-16816   [002] .......  1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
>      kworker/3:4-70      [003] .......  1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
>          phytool-16816   [002] .......  1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
>      kworker/3:4-70      [003] .......  1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
>          phytool-16816   [002] .......  1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
>      kworker/3:4-70      [003] .......  1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
>          phytool-16816   [002] .......  1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
>      kworker/3:4-70      [003] .......  1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
>          phytool-16816   [002] .......  1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
>          phytool-16816   [002] .......  1927.142641: mdio_access: SMI-0 read  phy:0x02 reg:0x01 val:0x0000
>      kworker/3:4-70      [003] .......  1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
>      kworker/3:4-70      [003] .......  1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
>      kworker/3:4-70      [003] .......  1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
>      kworker/3:4-70      [003] .......  1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
>      kworker/3:4-70      [003] .......  1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
>      kworker/3:4-70      [003] .......  1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6
>
> The kworker here is polling MIB counters for stats, as evidenced by the
> register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG).

At first I thought it was just two threads reading different indirect
registers. And there are really more than one kernel threads reading
port status in a SMP device.
If you disable the extra cores, it was working as expected. So, I
added a mutex to serialize that sequence. But it didn't fix the issue.
It looks like something deep down is broken.

> Sometimes I can even read out completely junk values if I tighten my
> phytool loop:
>
>     # while true; do phytool read swp2/2/0x01; done | uniq
>     0x79ed
>     0000
>     0x79ed
>     0000
>     0x79ed
>     0000
>     0x79ed
>     0x0014 <-- ?!

Yeah, just like what happens if you read that unused register. You
generally get 0x0000 instead of the good data but sometimes random
data.
I also added multiple udelay over and inside the regmap_read with no
success (while reading the unused register). I might test again with
phytool.

> This is just a preliminary investigation - I only checked your mail
> today - but I will follow up on this next week and try and make a
> fix. In the mean time could you and Arinç perhaps do some similar
> tracing and confirm that in your case it is a similar issue? You needn't
> use phytool if the issue reproduces itself every 40 seconds - in that
> case just enable ftrace events and inspect the register access sequence
> when a PHY read gives an unexpected value.

I'll try to enable ftrace but when I added a couple of printk, the
issue was gone.
Let's see what my device can handle.

The 40s is not a global cycle when everything falls apart. I added a
simple test that printed an error when the expected value wasn't
right. And each time a different port was failing in a fixed but
unsynchronized rhythm. Even with disconnected ports.
It looks like the switch "internal job queue" overflows and we get the
null or random data. And that gives the switch a little time to
process the queue. After that, the queue fills up again and we have
the same behavior. Accessing the unused register might simply consume
a lot of "extra loops". It looks like an exaggerated variant of the
same issue. Or not.

> Let's please restrict the investigation to registers which we expect to
> give well-defined values though - unused or otherwise inactive registers
> may very well give us junk, but it is harder to argue that this is a
> bug. Anyway, this seems pretty easy to reproduce, so I don't think you
> need to go down that path :-)

I also believe that we need to stick to registers that should return
good values. But while the indirect registers are failing, you can
also dump
some other switch registers and it should also return some junk data.
I didn't test it but I believe that's what will happen.

> Overall we probably need some stricter locking but I want to spend some
> time on this before just sending a big-fat-lock solution to the
> list. I didn't try your patch yet due to other items in my inbox today.

If nothing serializes the access to the indirect registers, the lock
is needed. Your traces show that two reads will collide.
The regmap also has its own lock by default and I also lock the mdio
during a regmap_read/write. I don't believe an extra lock will do.
I guess we'll need to slow the regmap access, although it wasn't enough for me.

Maybe the switch is yelling an interruption we are ignoring. This is
another interesting thing to check.

---
Luiz
Luiz Angelo Daros de Luca Feb. 10, 2022, 3:10 a.m. UTC | #3
> > So I used this command:
> >
> >     while true; do phytool read swp2/2/0x01; sleep 0.1; done
>
> I'll test it and report.

In my single processor device, I got stable readings:

root@OpenWrt:/# while true; do phytool read lan1/2/0x01; done | time awk '!a[$1]
++'
0x79c9
^CCommand terminated by signal 2
real    4m 56.44s
user    0m 1.61s
sys     0m 0.72s

And most of the time I was executing three copies of that loop in parallel.
I tried with and without my patches. All of them resulted in the same behavior.

Sorry but I believe my non-SMP device is not "affected enough" to help
debug this issue. Is your device SMP?

---
Luiz
Alvin Šipraga Feb. 10, 2022, 8:13 a.m. UTC | #4
Hi Luiz,

Thanks for the info.

Luiz Angelo Daros de Luca <luizluca@gmail.com> writes:

>> > So I used this command:
>> >
>> >     while true; do phytool read swp2/2/0x01; sleep 0.1; done
>>
>> I'll test it and report.
>
> In my single processor device, I got stable readings:
>
> root@OpenWrt:/# while true; do phytool read lan1/2/0x01; done | time awk '!a[$1]
> ++'
> 0x79c9
> ^CCommand terminated by signal 2
> real    4m 56.44s
> user    0m 1.61s
> sys     0m 0.72s
>
> And most of the time I was executing three copies of that loop in parallel.
> I tried with and without my patches. All of them resulted in the same behavior.
>
> Sorry but I believe my non-SMP device is not "affected enough" to help
> debug this issue. Is your device SMP?

Yes, it's SMP and also with PREEMPT_RT ;-)

But no problem, I will look into this and get back to you in this thread
if I need any more feedback. Thanks for your help.

Kind regards,
Alvin

>
> ---
> Luiz
Alvin Šipraga Feb. 10, 2022, 4:01 p.m. UTC | #5
Hi again Luiz,

Alvin Šipraga <ALSI@bang-olufsen.dk> writes:

> Hi Luiz,
>
> Thanks for the info.
>
> Luiz Angelo Daros de Luca <luizluca@gmail.com> writes:
>

<snip>

>>
>> Sorry but I believe my non-SMP device is not "affected enough" to help
>> debug this issue. Is your device SMP?
>
> Yes, it's SMP and also with PREEMPT_RT ;-)
>
> But no problem, I will look into this and get back to you in this thread
> if I need any more feedback. Thanks for your help.

So I had a closer look this afternoon. First some general statements.

1. Generally the switch driver is not accessing registers unless an
   administrator is reconfiguring the interfaces
2. The exception to the above rule is the delayed_work for stats64,
   which you can see in the code is there because get_stats64 runs in
   atomic context. This runs every 3 seconds.
3. Without the interrupt from the switch enabled, the PHY subsystem
   resorts to polling the PHY registers (every 1 second I think you
   said).

As a result of (2) and (3) above, you are bound at some point to have
both the stats work and the PHY register read callbacks executing in
parallel. As I mentioned in my last email, a simple busy loop with
phytool would return some nonsense pretty quickly. On my SMP/PREEMPT_RT
system this happens every 3 seconds while everything else is idle.

I tried to disable the stats delayed_work just to see, and in this case
I did not observe any PHY read issues. The PHY register value was always
as expected.

In that setup I then tried to provoke the error again, this time by
reading a single register with dd via regmap debugfs. And while it's
unlikely for a single such read to interfere with my busy phytool loop,
putting the dd read in a tight loop almost immediately provokes the same
bug. This time I noticed that the value returned by phytool is the same
value that I read out with dd from the other non-PHY-related register.

In general what I found is that if we read from an arbitrary register A
and this read coincides with the multi-register access in
rtl8365mb_phy_ocp_read, then the final read from
RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
register A. It is quite reliably the case in all my testing, and it
explains the nonsense values we sometimes happened to see during PHY
register access, because of the stats work going on in the
background. Probably we got some MIB counter data and this corrupted the
PHY register value held in the INDIRECT_ACCESS_READ_DATA_REG register.

I am not sure why this happens - likely some peculiarity of the ASIC
hardware - but I wanted to check if this is also the case for the MIB
register access, because we also have a kind of indirect lookup
algorithm there. But in that case I did not see any corruption of the
data in the MIB counter data registers (RTL8365MB_MIB_COUNTER_REG(_x)).

So my conclusion is that this problem is unique to the indirect PHY
register access pattern. It should be pointed out that the regmap is
already protected by a lock, so I don't expect to see any weird data
races for non-PHY register access.

One more thing I wanted to point out: you mentioned that on your system
you conducted multiple phytool read loops and did not observe any
issues. I think this is easily explained by some higher-level locking
in the kernel which prevents concurrent PHY register reads.

****

With all of that said, I think the solution here is simply to guard
against stray register access while we are in the indirect PHY register
access callbacks. You also posted a patch to forego the whole indirect
access method for MDIO-connected switches, and I think that is also a
good thing. My reply to that patch was just taking issue with your
explanation, both because the diagnosis of the bug was rather nebulous,
and also because it did not actually fix the bug - it just worked around
it.

I will take it upon myself to fix this issue of indirect PHY register
access yielding corrupt values and post a patch to the list next week.
I already have a quick-n-dirty change which ensures proper locking and
now I cannot reproduce the issue for several hours.

In the mean time, could you resend your MDIO direct-PHY-register-access
patch and I will give it one more review. Please do not suggest that it
is a fix for this bug (because it's not) -- better yet, just add a Link:
to this thread and explain why you bothered implementing it to begin
with. You can mention that the issue is not seen with direct-access
(which also corroborates our findings here). Then I will base my changes
on your patch.

Alternatively you can drop the patch and we can just fix the indirect
access wholesale - both for SMI and MDIO. That would mean adding less
code (since MDIO with indirect access also works), albeit at the expense
of some technically unnecessary gymnastics in the driver (since MDIO
direct access is simpler). But I'll leave that up to you :-)

What do you think?

Kind regards,
Alvin
Luiz Angelo Daros de Luca Feb. 11, 2022, 5:40 a.m. UTC | #6
> Hi again Luiz,

Hello,

> >>
> >> Sorry but I believe my non-SMP device is not "affected enough" to help
> >> debug this issue. Is your device SMP?
> >
> > Yes, it's SMP and also with PREEMPT_RT ;-)
> >
> > But no problem, I will look into this and get back to you in this thread
> > if I need any more feedback. Thanks for your help.
>
> So I had a closer look this afternoon. First some general statements.
>
> 1. Generally the switch driver is not accessing registers unless an
>    administrator is reconfiguring the interfaces
> 2. The exception to the above rule is the delayed_work for stats64,
>    which you can see in the code is there because get_stats64 runs in
>    atomic context. This runs every 3 seconds.
> 3. Without the interrupt from the switch enabled, the PHY subsystem
>    resorts to polling the PHY registers (every 1 second I think you
>    said).
>
> As a result of (2) and (3) above, you are bound at some point to have
> both the stats work and the PHY register read callbacks executing in
> parallel. As I mentioned in my last email, a simple busy loop with
> phytool would return some nonsense pretty quickly. On my SMP/PREEMPT_RT
> system this happens every 3 seconds while everything else is idle.
>
> I tried to disable the stats delayed_work just to see, and in this case
> I did not observe any PHY read issues. The PHY register value was always
> as expected.
>
> In that setup I then tried to provoke the error again, this time by
> reading a single register with dd via regmap debugfs. And while it's
> unlikely for a single such read to interfere with my busy phytool loop,
> putting the dd read in a tight loop almost immediately provokes the same
> bug. This time I noticed that the value returned by phytool is the same
> value that I read out with dd from the other non-PHY-related register.
>
> In general what I found is that if we read from an arbitrary register A
> and this read coincides with the multi-register access in
> rtl8365mb_phy_ocp_read, then the final read from
> RTL8365MB_INDIRECT_ACCESS_READ_DATA_REG will always return the value in
> register A. It is quite reliably the case in all my testing, and it
> explains the nonsense values we sometimes happened to see during PHY
> register access, because of the stats work going on in the
> background. Probably we got some MIB counter data and this corrupted the
> PHY register value held in the INDIRECT_ACCESS_READ_DATA_REG register.

It makes sense. That explains why a simple lock over indirect access
did not solve the issue.

> I am not sure why this happens - likely some peculiarity of the ASIC
> hardware - but I wanted to check if this is also the case for the MIB
> register access, because we also have a kind of indirect lookup
> algorithm there. But in that case I did not see any corruption of the
> data in the MIB counter data registers (RTL8365MB_MIB_COUNTER_REG(_x)).
>
> So my conclusion is that this problem is unique to the indirect PHY
> register access pattern. It should be pointed out that the regmap is
> already protected by a lock, so I don't expect to see any weird data
> races for non-PHY register access.
>
>
> One more thing I wanted to point out: you mentioned that on your system
> you conducted multiple phytool read loops and did not observe any
> issues. I think this is easily explained by some higher-level locking
> in the kernel which prevents concurrent PHY register reads.
>
> ****
>
> With all of that said, I think the solution here is simply to guard
> against stray register access while we are in the indirect PHY register
> access callbacks. You also posted a patch to forego the whole indirect
> access method for MDIO-connected switches, and I think that is also a
> good thing. My reply to that patch was just taking issue with your
> explanation, both because the diagnosis of the bug was rather nebulous,
> and also because it did not actually fix the bug - it just worked around
> it.
>
> I will take it upon myself to fix this issue of indirect PHY register
> access yielding corrupt values and post a patch to the list next week.
> I already have a quick-n-dirty change which ensures proper locking and
> now I cannot reproduce the issue for several hours.

I tried to find a way to lock regmap while permitting it to be used
only by the indirect reg access.
However, I failed to find a clean solution. It's great you got a proper fix.

> In the mean time, could you resend your MDIO direct-PHY-register-access
> patch and I will give it one more review. Please do not suggest that it
> is a fix for this bug (because it's not) -- better yet, just add a Link:
> to this thread and explain why you bothered implementing it to begin
> with. You can mention that the issue is not seen with direct-access
> (which also corroborates our findings here). Then I will base my changes
> on your patch.
>
> Alternatively you can drop the patch and we can just fix the indirect
> access wholesale - both for SMI and MDIO. That would mean adding less
> code (since MDIO with indirect access also works), albeit at the expense
> of some technically unnecessary gymnastics in the driver (since MDIO
> direct access is simpler). But I'll leave that up to you :-)
>
> What do you think?
>

I would prefer to drop it if we get a shared fix. There is an ongoing
discussion that might allow us to drop the realtek-smi internal mdio
and share phy_read/write between both interfaces. In that case, that
different mdio code path will fit much better as an "if" inside those
functions.

> Kind regards,
> Alvin

Regards,
Luiz
Alvin Šipraga Feb. 11, 2022, 8:41 a.m. UTC | #7
Hi Luiz,

Luiz Angelo Daros de Luca <luizluca@gmail.com> writes:

>
> Hello,

<snip>

> I would prefer to drop it if we get a shared fix. There is an ongoing
> discussion that might allow us to drop the realtek-smi internal mdio
> and share phy_read/write between both interfaces. In that case, that
> different mdio code path will fit much better as an "if" inside those
> functions.

OK, sounds good!

Kind regards,
Alvin
diff mbox series

Patch

diff --git a/drivers/net/dsa/realtek/rtl8365mb.c
b/drivers/net/dsa/realtek/rtl8365mb.c
index 1a0562811c1a..5572271a2514 100644
--- a/drivers/net/dsa/realtek/rtl8365mb.c
+++ b/drivers/net/dsa/realtek/rtl8365mb.c
@@ -817,6 +817,7 @@  struct rtl8365mb_port {
 * @port_mask: mask of all ports
 * @learn_limit_max: maximum number of L2 addresses the chip can learn
 * @mib_lock: prevent concurrent reads of MIB counters
+ * @indirect_reg_lock: prevent concurrent access of indirect registers
 * @ports: per-port data
 * @jam_table: chip-specific initialization jam table
 * @jam_size: size of the chip's jam table
@@ -832,6 +833,7 @@  struct rtl8365mb {
       u32 port_mask;
       u32 learn_limit_max;
       struct mutex mib_lock;
+       struct mutex indirect_reg_lock;
       struct rtl8365mb_port ports[RTL8365MB_MAX_NUM_PORTS];
       const struct rtl8365mb_jam_tbl_entry *jam_table;
       size_t jam_size;
@@ -989,6 +991,7 @@  static int rtl8365mb_phy_ocp_write(struct
realtek_priv *priv, int phy,

static int rtl8365mb_phy_read(struct realtek_priv *priv, int phy, int regnum)
{
+       struct rtl8365mb *mb = priv->chip_data;
       u32 ocp_addr;
       u16 val;