diff mbox series

[net-next] net: mdio: Add netlink interface

Message ID 20230306204517.1953122-1-sean.anderson@seco.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net: mdio: Add netlink interface | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 0 this patch: 1
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang fail Errors and warnings before: 0 this patch: 2
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 0 this patch: 1
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline fail Was 0 now: 3

Commit Message

Sean Anderson March 6, 2023, 8:45 p.m. UTC
This adds a netlink interface to make reads/writes to mdio buses. This
makes it easier to debug devices. This is especially useful when there
is a PCS involved (and the ethtool reads are faked), when there is no
MAC associated with a PHY, or when the MDIO device is not a PHY.

The closest existing in-kernel interfaces are the SIOCG/SMIIREG ioctls, but
they have several drawbacks:

1. "Write register" is not always exactly that. The kernel will try to
   be extra helpful and do things behind the scenes if it detects a
   write to the reset bit of a PHY for example.

2. Only one op per syscall. This means that is impossible to implement
   many operations in a safe manner. Something as simple as a
   read/mask/write cycle can race against an in-kernel driver.

3. Addressing is awkward since you don't address the MDIO bus
   directly, rather "the MDIO bus to which this netdev's PHY is
   connected". This makes it hard to talk to devices on buses to which
   no PHYs are connected, the typical case being Ethernet switches.

To address these shortcomings, this adds a GENL interface with which a user
can interact with an MDIO bus directly.  The user sends a program that
mdio-netlink executes, possibly emitting data back to the user. I.e. it
implements a very simple VM. Read/mask/write operations could be
implemented by dedicated commands, but when you start looking at more
advanced things like reading out the VLAN database of a switch you need
state and branching.

To prevent userspace phy drivers, writes are disabled by default, and can
only be enabled by editing the source. This is the same strategy used by
regmap for debugfs writes. Unfortunately, this disallows several useful
features, including

- Register writes (obviously)
- C45-over-C22
- Atomic access to paged registers
- Better MDIO emulation for e.g. QEMU

However, the read-only interface remains broadly useful for debugging.
Users who want to use the above features can re-enable them by defining
MDIO_NETLINK_ALLOW_WRITE and recompiling their kernel.

Signed-off-by: Sean Anderson <sean.anderson@seco.com>
---
This driver was written by Tobias Waldekranz. It is adapted from the
version he released with mdio-tools [1]. This was last discussed 2.5
years ago [2], and I have incorperated his cover letter into this commit
message. The discussion mainly centered around the write capability
allowing for userspace phy drivers. Although it comes with reduced
functionality, I hope this new approach satisfies Andrew. I have also
made several minor changes for style and to stay abrest of changing
APIs.

Tobias, I've taken the liberty of adding some copyright notices
attributing this to you.

[1] https://github.com/wkz/mdio-tools
[2] https://lore.kernel.org/netdev/C42DZQLTPHM5.2THDSRK84BI3T@wkz-x280/

 drivers/net/mdio/Kconfig          |   8 +
 drivers/net/mdio/Makefile         |   1 +
 drivers/net/mdio/mdio-netlink.c   | 464 ++++++++++++++++++++++++++++++
 include/uapi/linux/mdio-netlink.h |  61 ++++
 4 files changed, 534 insertions(+)
 create mode 100644 drivers/net/mdio/mdio-netlink.c
 create mode 100644 include/uapi/linux/mdio-netlink.h

Comments

Russell King (Oracle) March 6, 2023, 10:48 p.m. UTC | #1
On Mon, Mar 06, 2023 at 03:45:16PM -0500, Sean Anderson wrote:
> +static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
> +{
> +	struct mdio_nl_insn *insn;
> +	unsigned long timeout;
> +	u16 regs[8] = { 0 };
> +	int pc, ret = 0;

So "pc" is signed.

> +	int phy_id, reg, prtad, devad, val;
> +
> +	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
> +
> +	mutex_lock(&xfer->mdio->mdio_lock);
> +
> +	for (insn = xfer->prog, pc = 0;
> +	     pc < xfer->prog_len;

xfer->prog_len is signed, so this is a signed comparison.

> +		case MDIO_NL_OP_JEQ:
> +			if (__arg_ri(insn->arg0, regs) ==
> +			    __arg_ri(insn->arg1, regs))
> +				pc += (s16)__arg_i(insn->arg2);

This adds a signed 16-bit integer to pc, which can make pc negative.

And so the question becomes... what prevents pc becoming negative
and then trying to use a negative number as an index?

I think prog_len and pc should both be unsigned, then the test you
have will be unsigned, and thus wrapping "pc" around zero makes it
a very large integer which fails the test - preventing at least
access outside of the array. Better still would be a validator
that checks that the program is in fact safe to execute.
Sean Anderson March 6, 2023, 11:39 p.m. UTC | #2
On 3/6/23 17:48, Russell King (Oracle) wrote:
> On Mon, Mar 06, 2023 at 03:45:16PM -0500, Sean Anderson wrote:
>> +static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
>> +{
>> +	struct mdio_nl_insn *insn;
>> +	unsigned long timeout;
>> +	u16 regs[8] = { 0 };
>> +	int pc, ret = 0;
> 
> So "pc" is signed.
> 
>> +	int phy_id, reg, prtad, devad, val;
>> +
>> +	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
>> +
>> +	mutex_lock(&xfer->mdio->mdio_lock);
>> +
>> +	for (insn = xfer->prog, pc = 0;
>> +	     pc < xfer->prog_len;
> 
> xfer->prog_len is signed, so this is a signed comparison.
> 
>> +		case MDIO_NL_OP_JEQ:
>> +			if (__arg_ri(insn->arg0, regs) ==
>> +			    __arg_ri(insn->arg1, regs))
>> +				pc += (s16)__arg_i(insn->arg2);
> 
> This adds a signed 16-bit integer to pc, which can make pc negative.
> 
> And so the question becomes... what prevents pc becoming negative
> and then trying to use a negative number as an index?

We start executing from somewhere on the heap :)

> I think prog_len and pc should both be unsigned, then the test you
> have will be unsigned, and thus wrapping "pc" around zero makes it
> a very large integer which fails the test - preventing at least
> access outside of the array.

Will fix.

> Better still would be a validator
> that checks that the program is in fact safe to execute.

I think mdio_nl_validate_prog could be extended to check for this.

--Sean
kernel test robot March 7, 2023, 12:05 a.m. UTC | #3
Hi Sean,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Sean-Anderson/net-mdio-Add-netlink-interface/20230307-044742
patch link:    https://lore.kernel.org/r/20230306204517.1953122-1-sean.anderson%40seco.com
patch subject: [PATCH net-next] net: mdio: Add netlink interface
config: sparc-allyesconfig (https://download.01.org/0day-ci/archive/20230307/202303070724.WmNAt4Af-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/78ff5545f403a98977a2db207cc165cb3a3b4d8f
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Sean-Anderson/net-mdio-Add-netlink-interface/20230307-044742
        git checkout 78ff5545f403a98977a2db207cc165cb3a3b4d8f
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc SHELL=/bin/bash drivers/net/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303070724.WmNAt4Af-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/net/mdio/mdio-netlink.c: In function 'mdio_nl_eval':
>> drivers/net/mdio/mdio-netlink.c:98:40: warning: variable 'val' set but not used [-Wunused-but-set-variable]
      98 |         int phy_id, reg, prtad, devad, val;
         |                                        ^~~


vim +/val +98 drivers/net/mdio/mdio-netlink.c

    91	
    92	static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
    93	{
    94		struct mdio_nl_insn *insn;
    95		unsigned long timeout;
    96		u16 regs[8] = { 0 };
    97		int pc, ret = 0;
  > 98		int phy_id, reg, prtad, devad, val;
    99	
   100		timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
   101	
   102		mutex_lock(&xfer->mdio->mdio_lock);
   103	
   104		for (insn = xfer->prog, pc = 0;
   105		     pc < xfer->prog_len;
   106		     insn = &xfer->prog[++pc]) {
   107			if (time_after(jiffies, timeout)) {
   108				ret = -ETIMEDOUT;
   109				break;
   110			}
   111	
   112			switch ((enum mdio_nl_op)insn->op) {
   113			case MDIO_NL_OP_READ:
   114				phy_id = __arg_ri(insn->arg0, regs);
   115				prtad = mdio_phy_id_prtad(phy_id);
   116				devad = mdio_phy_id_devad(phy_id);
   117				reg = __arg_ri(insn->arg1, regs);
   118	
   119				if (mdio_phy_id_is_c45(phy_id))
   120					ret = __mdiobus_c45_read(xfer->mdio, prtad,
   121								 devad, reg);
   122				else
   123					ret = __mdiobus_read(xfer->mdio, phy_id, reg);
   124	
   125				if (ret < 0)
   126					goto exit;
   127				*__arg_r(insn->arg2, regs) = ret;
   128				ret = 0;
   129				break;
   130	
   131			case MDIO_NL_OP_WRITE:
   132				phy_id = __arg_ri(insn->arg0, regs);
   133				prtad = mdio_phy_id_prtad(phy_id);
   134				devad = mdio_phy_id_devad(phy_id);
   135				reg = __arg_ri(insn->arg1, regs);
   136				val = __arg_ri(insn->arg2, regs);
   137
Michael Walle March 7, 2023, 11:23 a.m. UTC | #4
> To prevent userspace phy drivers, writes are disabled by default, and can
> only be enabled by editing the source.

Maybe we can just taint the kernel using add_taint()? I'm not sure if
that will prevent vendors writing user space drivers. Thoughts?

-michael
Tobias Waldekranz March 7, 2023, 12:26 p.m. UTC | #5
On mån, mar 06, 2023 at 15:45, Sean Anderson <sean.anderson@seco.com> wrote:
> This adds a netlink interface to make reads/writes to mdio buses. This
> makes it easier to debug devices. This is especially useful when there
> is a PCS involved (and the ethtool reads are faked), when there is no
> MAC associated with a PHY, or when the MDIO device is not a PHY.
>
> The closest existing in-kernel interfaces are the SIOCG/SMIIREG ioctls, but
> they have several drawbacks:
>
> 1. "Write register" is not always exactly that. The kernel will try to
>    be extra helpful and do things behind the scenes if it detects a
>    write to the reset bit of a PHY for example.
>
> 2. Only one op per syscall. This means that is impossible to implement
>    many operations in a safe manner. Something as simple as a
>    read/mask/write cycle can race against an in-kernel driver.
>
> 3. Addressing is awkward since you don't address the MDIO bus
>    directly, rather "the MDIO bus to which this netdev's PHY is
>    connected". This makes it hard to talk to devices on buses to which
>    no PHYs are connected, the typical case being Ethernet switches.
>
> To address these shortcomings, this adds a GENL interface with which a user
> can interact with an MDIO bus directly.  The user sends a program that
> mdio-netlink executes, possibly emitting data back to the user. I.e. it
> implements a very simple VM. Read/mask/write operations could be
> implemented by dedicated commands, but when you start looking at more
> advanced things like reading out the VLAN database of a switch you need
> state and branching.
>
> To prevent userspace phy drivers, writes are disabled by default, and can
> only be enabled by editing the source. This is the same strategy used by
> regmap for debugfs writes. Unfortunately, this disallows several useful
> features, including
>
> - Register writes (obviously)
> - C45-over-C22
> - Atomic access to paged registers
> - Better MDIO emulation for e.g. QEMU
>
> However, the read-only interface remains broadly useful for debugging.
> Users who want to use the above features can re-enable them by defining
> MDIO_NETLINK_ALLOW_WRITE and recompiling their kernel.

What about taking a page from the BPF playbook and require all loaded
programs (MDIO_GENL_XFERs) to be licensed under GPL?  That would mean
that the userspace program that generated it would also have to be
GPLed.

My view has always been that a vendor looking to build a userspace SDK
won't be deterred by this limitation.  They can easily build
mdio-netlink.ko from mdio-tools and use that to drive it, or (more
likely) they already have their own implementation that they are stuck
with for legacy reasons.  In other words: we are only punishing
legitimate users (mdio-tools being one of them, IMO).

Perhaps with this approach we could have our cake and eat it too.

> Signed-off-by: Sean Anderson <sean.anderson@seco.com>
> ---
> This driver was written by Tobias Waldekranz. It is adapted from the
> version he released with mdio-tools [1]. This was last discussed 2.5
> years ago [2], and I have incorperated his cover letter into this commit
> message. The discussion mainly centered around the write capability
> allowing for userspace phy drivers. Although it comes with reduced
> functionality, I hope this new approach satisfies Andrew. I have also
> made several minor changes for style and to stay abrest of changing
> APIs.
>
> Tobias, I've taken the liberty of adding some copyright notices
> attributing this to you.

Fine by me :)

> [1] https://github.com/wkz/mdio-tools
> [2] https://lore.kernel.org/netdev/C42DZQLTPHM5.2THDSRK84BI3T@wkz-x280/
>
>  drivers/net/mdio/Kconfig          |   8 +
>  drivers/net/mdio/Makefile         |   1 +
>  drivers/net/mdio/mdio-netlink.c   | 464 ++++++++++++++++++++++++++++++
>  include/uapi/linux/mdio-netlink.h |  61 ++++
>  4 files changed, 534 insertions(+)
>  create mode 100644 drivers/net/mdio/mdio-netlink.c
>  create mode 100644 include/uapi/linux/mdio-netlink.h
>
> diff --git a/drivers/net/mdio/Kconfig b/drivers/net/mdio/Kconfig
> index 90309980686e..8a01978e5b51 100644
> --- a/drivers/net/mdio/Kconfig
> +++ b/drivers/net/mdio/Kconfig
> @@ -43,6 +43,14 @@ config ACPI_MDIO
>  
>  if MDIO_BUS
>  
> +config MDIO_NETLINK
> +	tristate "Netlink interface for MDIO buses"
> +	help
> +	  Enable a netlink interface to allow reading MDIO buses from
> +	  userspace. A small virtual machine allows implementing complex
> +	  operations, such as conditional reads or polling. All operations
> +	  submitted in the same program are evaluated atomically.
> +
>  config MDIO_DEVRES
>  	tristate
>  
> diff --git a/drivers/net/mdio/Makefile b/drivers/net/mdio/Makefile
> index 7d4cb4c11e4e..5583d5b8d174 100644
> --- a/drivers/net/mdio/Makefile
> +++ b/drivers/net/mdio/Makefile
> @@ -4,6 +4,7 @@
>  obj-$(CONFIG_ACPI_MDIO)		+= acpi_mdio.o
>  obj-$(CONFIG_FWNODE_MDIO)	+= fwnode_mdio.o
>  obj-$(CONFIG_OF_MDIO)		+= of_mdio.o
> +obj-$(CONFIG_MDIO_NETLINK)	+= mdio-netlink.o
>  
>  obj-$(CONFIG_MDIO_ASPEED)		+= mdio-aspeed.o
>  obj-$(CONFIG_MDIO_BCM_IPROC)		+= mdio-bcm-iproc.o
> diff --git a/drivers/net/mdio/mdio-netlink.c b/drivers/net/mdio/mdio-netlink.c
> new file mode 100644
> index 000000000000..3e32d1a9bab3
> --- /dev/null
> +++ b/drivers/net/mdio/mdio-netlink.c
> @@ -0,0 +1,464 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022-23 Sean Anderson <sean.anderson@seco.com>
> + * Copyright (C) 2020-22 Tobias Waldekranz <tobias@waldekranz.com>
> + */
> +
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/netlink.h>
> +#include <linux/phy.h>
> +#include <net/genetlink.h>
> +#include <net/netlink.h>
> +#include <uapi/linux/mdio-netlink.h>
> +
> +struct mdio_nl_xfer {
> +	struct genl_info *info;
> +	struct sk_buff *msg;
> +	void *hdr;
> +	struct nlattr *data;
> +
> +	struct mii_bus *mdio;
> +	int timeout_ms;
> +
> +	int prog_len;
> +	struct mdio_nl_insn *prog;
> +};
> +
> +static int mdio_nl_open(struct mdio_nl_xfer *xfer);
> +static int mdio_nl_close(struct mdio_nl_xfer *xfer, bool last, int xerr);
> +
> +static int mdio_nl_flush(struct mdio_nl_xfer *xfer)
> +{
> +	int err;
> +
> +	err = mdio_nl_close(xfer, false, 0);
> +	if (err)
> +		return err;
> +
> +	return mdio_nl_open(xfer);
> +}
> +
> +static int mdio_nl_emit(struct mdio_nl_xfer *xfer, u32 datum)
> +{
> +	int err = 0;
> +
> +	if (!nla_put_nohdr(xfer->msg, sizeof(datum), &datum))
> +		return 0;
> +
> +	err = mdio_nl_flush(xfer);
> +	if (err)
> +		return err;
> +
> +	return nla_put_nohdr(xfer->msg, sizeof(datum), &datum);
> +}
> +
> +static inline u16 *__arg_r(u32 arg, u16 *regs)
> +{
> +	WARN_ON_ONCE(arg >> 16 != MDIO_NL_ARG_REG);
> +
> +	return &regs[arg & 0x7];
> +}
> +
> +static inline u16 __arg_i(u32 arg)
> +{
> +	WARN_ON_ONCE(arg >> 16 != MDIO_NL_ARG_IMM);
> +
> +	return arg & 0xffff;
> +}
> +
> +static inline u16 __arg_ri(u32 arg, u16 *regs)
> +{
> +	switch ((enum mdio_nl_argmode)(arg >> 16)) {
> +	case MDIO_NL_ARG_IMM:
> +		return arg & 0xffff;
> +	case MDIO_NL_ARG_REG:
> +		return regs[arg & 7];
> +	default:
> +		WARN_ON_ONCE(1);
> +		return 0;
> +	}
> +}
> +
> +/* To prevent out-of-tree drivers from being implemented through this
> + * interface, disallow writes by default. This does disallow read-only uses,
> + * such as c45-over-c22 or reading phys with pages. However, with a such a
> + * flexible interface, we must use a big hammer. People who want to use this
> + * will need to modify the source code directly.
> + */
> +#undef MDIO_NETLINK_ALLOW_WRITE
> +
> +static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
> +{
> +	struct mdio_nl_insn *insn;
> +	unsigned long timeout;
> +	u16 regs[8] = { 0 };
> +	int pc, ret = 0;
> +	int phy_id, reg, prtad, devad, val;
> +
> +	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
> +
> +	mutex_lock(&xfer->mdio->mdio_lock);
> +
> +	for (insn = xfer->prog, pc = 0;
> +	     pc < xfer->prog_len;
> +	     insn = &xfer->prog[++pc]) {
> +		if (time_after(jiffies, timeout)) {
> +			ret = -ETIMEDOUT;
> +			break;
> +		}
> +
> +		switch ((enum mdio_nl_op)insn->op) {
> +		case MDIO_NL_OP_READ:
> +			phy_id = __arg_ri(insn->arg0, regs);
> +			prtad = mdio_phy_id_prtad(phy_id);
> +			devad = mdio_phy_id_devad(phy_id);
> +			reg = __arg_ri(insn->arg1, regs);
> +
> +			if (mdio_phy_id_is_c45(phy_id))
> +				ret = __mdiobus_c45_read(xfer->mdio, prtad,
> +							 devad, reg);
> +			else
> +				ret = __mdiobus_read(xfer->mdio, phy_id, reg);
> +
> +			if (ret < 0)
> +				goto exit;
> +			*__arg_r(insn->arg2, regs) = ret;
> +			ret = 0;
> +			break;
> +
> +		case MDIO_NL_OP_WRITE:
> +			phy_id = __arg_ri(insn->arg0, regs);
> +			prtad = mdio_phy_id_prtad(phy_id);
> +			devad = mdio_phy_id_devad(phy_id);
> +			reg = __arg_ri(insn->arg1, regs);
> +			val = __arg_ri(insn->arg2, regs);
> +
> +#ifdef MDIO_NETLINK_ALLOW_WRITE
> +			add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> +			if (mdio_phy_id_is_c45(phy_id))
> +				ret = __mdiobus_c45_write(xfer->mdio, prtad,
> +							  devad, reg, val
> +			else
> +				ret = __mdiobus_write(xfer->mdio, dev, reg,
> +						      val);
> +#else
> +			ret = -EPERM;
> +#endif
> +			if (ret < 0)
> +				goto exit;
> +			ret = 0;
> +			break;
> +
> +		case MDIO_NL_OP_AND:
> +			*__arg_r(insn->arg2, regs) =
> +				__arg_ri(insn->arg0, regs) &
> +				__arg_ri(insn->arg1, regs);
> +			break;
> +
> +		case MDIO_NL_OP_OR:
> +			*__arg_r(insn->arg2, regs) =
> +				__arg_ri(insn->arg0, regs) |
> +				__arg_ri(insn->arg1, regs);
> +			break;
> +
> +		case MDIO_NL_OP_ADD:
> +			*__arg_r(insn->arg2, regs) =
> +				__arg_ri(insn->arg0, regs) +
> +				__arg_ri(insn->arg1, regs);
> +			break;
> +
> +		case MDIO_NL_OP_JEQ:
> +			if (__arg_ri(insn->arg0, regs) ==
> +			    __arg_ri(insn->arg1, regs))
> +				pc += (s16)__arg_i(insn->arg2);
> +			break;
> +
> +		case MDIO_NL_OP_JNE:
> +			if (__arg_ri(insn->arg0, regs) !=
> +			    __arg_ri(insn->arg1, regs))
> +				pc += (s16)__arg_i(insn->arg2);
> +			break;
> +
> +		case MDIO_NL_OP_EMIT:
> +			ret = mdio_nl_emit(xfer, __arg_ri(insn->arg0, regs));
> +			if (ret < 0)
> +				goto exit;
> +			ret = 0;
> +			break;
> +
> +		case MDIO_NL_OP_UNSPEC:
> +		default:
> +			ret = -EINVAL;
> +			goto exit;
> +		}
> +	}
> +exit:
> +	mutex_unlock(&xfer->mdio->mdio_lock);
> +	return ret;
> +}
> +
> +struct mdio_nl_op_proto {
> +	u8 arg0;
> +	u8 arg1;
> +	u8 arg2;
> +};
> +
> +static const struct mdio_nl_op_proto mdio_nl_op_protos[MDIO_NL_OP_MAX + 1] = {
> +	[MDIO_NL_OP_READ] = {
> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg2 = BIT(MDIO_NL_ARG_REG),
> +	},
> +	[MDIO_NL_OP_WRITE] = {
> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg2 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +	},
> +	[MDIO_NL_OP_AND] = {
> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg2 = BIT(MDIO_NL_ARG_REG),
> +	},
> +	[MDIO_NL_OP_OR] = {
> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg2 = BIT(MDIO_NL_ARG_REG),
> +	},
> +	[MDIO_NL_OP_ADD] = {
> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg2 = BIT(MDIO_NL_ARG_REG),
> +	},
> +	[MDIO_NL_OP_JEQ] = {
> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg2 = BIT(MDIO_NL_ARG_IMM),
> +	},
> +	[MDIO_NL_OP_JNE] = {
> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg2 = BIT(MDIO_NL_ARG_IMM),
> +	},
> +	[MDIO_NL_OP_EMIT] = {
> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
> +		.arg1 = BIT(MDIO_NL_ARG_NONE),
> +		.arg2 = BIT(MDIO_NL_ARG_NONE),
> +	},
> +};
> +
> +static int mdio_nl_validate_insn(const struct nlattr *attr,
> +				 struct netlink_ext_ack *extack,
> +				 const struct mdio_nl_insn *insn)
> +{
> +	const struct mdio_nl_op_proto *proto;
> +
> +	if (insn->op > MDIO_NL_OP_MAX) {
> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Illegal instruction");
> +		return -EINVAL;
> +	}
> +
> +	proto = &mdio_nl_op_protos[insn->op];
> +
> +	if (!(BIT(insn->arg0 >> 16) & proto->arg0)) {
> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 0 invalid");
> +		return -EINVAL;
> +	}
> +
> +	if (!(BIT(insn->arg1 >> 16) & proto->arg1)) {
> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 1 invalid");
> +		return -EINVAL;
> +	}
> +
> +	if (!(BIT(insn->arg2 >> 16) & proto->arg2)) {
> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 2 invalid");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int mdio_nl_validate_prog(const struct nlattr *attr,
> +				 struct netlink_ext_ack *extack)
> +{
> +	const struct mdio_nl_insn *prog = nla_data(attr);
> +	int len = nla_len(attr);
> +	int i, err = 0;
> +
> +	if (len % sizeof(*prog)) {
> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Unaligned instruction");
> +		return -EINVAL;
> +	}
> +
> +	len /= sizeof(*prog);
> +	for (i = 0; i < len; i++) {
> +		err = mdio_nl_validate_insn(attr, extack, &prog[i]);
> +		if (err)
> +			break;
> +	}
> +
> +	return err;
> +}
> +
> +static const struct nla_policy mdio_nl_policy[MDIO_NLA_MAX + 1] = {
> +	[MDIO_NLA_UNSPEC]  = { .type = NLA_UNSPEC, },
> +	[MDIO_NLA_BUS_ID]  = { .type = NLA_STRING, .len = MII_BUS_ID_SIZE },
> +	[MDIO_NLA_TIMEOUT] = NLA_POLICY_MAX(NLA_U16, 10 * MSEC_PER_SEC),
> +	[MDIO_NLA_PROG]    = NLA_POLICY_VALIDATE_FN(NLA_BINARY,
> +						    mdio_nl_validate_prog,
> +						    0x1000),
> +	[MDIO_NLA_DATA]    = { .type = NLA_NESTED },
> +	[MDIO_NLA_ERROR]   = { .type = NLA_S32, },
> +};
> +
> +static struct genl_family mdio_nl_family;
> +
> +static int mdio_nl_open(struct mdio_nl_xfer *xfer)
> +{
> +	int err;
> +
> +	xfer->msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!xfer->msg) {
> +		err = -ENOMEM;
> +		goto err;
> +	}
> +
> +	xfer->hdr = genlmsg_put(xfer->msg, xfer->info->snd_portid,
> +				xfer->info->snd_seq, &mdio_nl_family,
> +				NLM_F_ACK | NLM_F_MULTI, MDIO_GENL_XFER);
> +	if (!xfer->hdr) {
> +		err = -EMSGSIZE;
> +		goto err_free;
> +	}
> +
> +	xfer->data = nla_nest_start(xfer->msg, MDIO_NLA_DATA);
> +	if (!xfer->data) {
> +		err = -EMSGSIZE;
> +		goto err_free;
> +	}
> +
> +	return 0;
> +
> +err_free:
> +	nlmsg_free(xfer->msg);
> +err:
> +	return err;
> +}
> +
> +static int mdio_nl_close(struct mdio_nl_xfer *xfer, bool last, int xerr)
> +{
> +	struct nlmsghdr *end;
> +	int err;
> +
> +	nla_nest_end(xfer->msg, xfer->data);
> +
> +	if (xerr && nla_put_s32(xfer->msg, MDIO_NLA_ERROR, xerr)) {
> +		err = mdio_nl_flush(xfer);
> +		if (err)
> +			goto err_free;
> +
> +		if (nla_put_s32(xfer->msg, MDIO_NLA_ERROR, xerr)) {
> +			err = -EMSGSIZE;
> +			goto err_free;
> +		}
> +	}
> +
> +	genlmsg_end(xfer->msg, xfer->hdr);
> +
> +	if (last) {
> +		end = nlmsg_put(xfer->msg, xfer->info->snd_portid,
> +				xfer->info->snd_seq, NLMSG_DONE, 0,
> +				NLM_F_ACK | NLM_F_MULTI);
> +		if (!end) {
> +			err = mdio_nl_flush(xfer);
> +			if (err)
> +				goto err_free;
> +
> +			end = nlmsg_put(xfer->msg, xfer->info->snd_portid,
> +					xfer->info->snd_seq, NLMSG_DONE, 0,
> +					NLM_F_ACK | NLM_F_MULTI);
> +			if (!end) {
> +				err = -EMSGSIZE;
> +				goto err_free;
> +			}
> +		}
> +	}
> +
> +	return genlmsg_unicast(genl_info_net(xfer->info), xfer->msg,
> +			       xfer->info->snd_portid);
> +
> +err_free:
> +	nlmsg_free(xfer->msg);
> +	return err;
> +}
> +
> +static int mdio_nl_cmd_xfer(struct sk_buff *skb, struct genl_info *info)
> +{
> +	struct mdio_nl_xfer xfer;
> +	int err;
> +
> +	if (!info->attrs[MDIO_NLA_BUS_ID] ||
> +	    !info->attrs[MDIO_NLA_PROG]   ||
> +	     info->attrs[MDIO_NLA_DATA]   ||
> +	     info->attrs[MDIO_NLA_ERROR])
> +		return -EINVAL;
> +
> +	xfer.mdio = mdio_find_bus(nla_data(info->attrs[MDIO_NLA_BUS_ID]));
> +	if (!xfer.mdio)
> +		return -ENODEV;
> +
> +	if (info->attrs[MDIO_NLA_TIMEOUT])
> +		xfer.timeout_ms = nla_get_u32(info->attrs[MDIO_NLA_TIMEOUT]);
> +	else
> +		xfer.timeout_ms = 100;
> +
> +	xfer.info = info;
> +	xfer.prog_len = nla_len(info->attrs[MDIO_NLA_PROG]) / sizeof(*xfer.prog);
> +	xfer.prog = nla_data(info->attrs[MDIO_NLA_PROG]);
> +
> +	err = mdio_nl_open(&xfer);
> +	if (err)
> +		return err;
> +
> +	err = mdio_nl_eval(&xfer);
> +
> +	err = mdio_nl_close(&xfer, true, err);
> +	return err;
> +}
> +
> +static const struct genl_ops mdio_nl_ops[] = {
> +	{
> +		.cmd = MDIO_GENL_XFER,
> +		.doit = mdio_nl_cmd_xfer,
> +		.flags = GENL_ADMIN_PERM,
> +	},
> +};
> +
> +static struct genl_family mdio_nl_family = {
> +	.name     = "mdio",
> +	.version  = 1,
> +	.maxattr  = MDIO_NLA_MAX,
> +	.netnsok  = false,
> +	.module   = THIS_MODULE,
> +	.ops      = mdio_nl_ops,
> +	.n_ops    = ARRAY_SIZE(mdio_nl_ops),
> +	.policy   = mdio_nl_policy,
> +};
> +
> +static int __init mdio_nl_init(void)
> +{
> +	return genl_register_family(&mdio_nl_family);
> +}
> +
> +static void __exit mdio_nl_exit(void)
> +{
> +	genl_unregister_family(&mdio_nl_family);
> +}
> +
> +MODULE_AUTHOR("Tobias Waldekranz <tobias@waldekranz.com>");
> +MODULE_DESCRIPTION("MDIO Netlink Interface");
> +MODULE_LICENSE("GPL");
> +
> +module_init(mdio_nl_init);
> +module_exit(mdio_nl_exit);
> diff --git a/include/uapi/linux/mdio-netlink.h b/include/uapi/linux/mdio-netlink.h
> new file mode 100644
> index 000000000000..bebd3b45c882
> --- /dev/null
> +++ b/include/uapi/linux/mdio-netlink.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright (C) 2020-22 Tobias Waldekranz <tobias@waldekranz.com>
> + */
> +
> +#ifndef _UAPI_LINUX_MDIO_NETLINK_H
> +#define _UAPI_LINUX_MDIO_NETLINK_H
> +
> +#include <linux/types.h>
> +
> +enum {
> +	MDIO_GENL_UNSPEC,
> +	MDIO_GENL_XFER,
> +
> +	__MDIO_GENL_MAX,
> +	MDIO_GENL_MAX = __MDIO_GENL_MAX - 1
> +};
> +
> +enum {
> +	MDIO_NLA_UNSPEC,
> +	MDIO_NLA_BUS_ID,  /* string */
> +	MDIO_NLA_TIMEOUT, /* u32 */
> +	MDIO_NLA_PROG,    /* struct mdio_nl_insn[] */
> +	MDIO_NLA_DATA,    /* nest */
> +	MDIO_NLA_ERROR,   /* s32 */
> +
> +	__MDIO_NLA_MAX,
> +	MDIO_NLA_MAX = __MDIO_NLA_MAX - 1
> +};
> +
> +enum mdio_nl_op {
> +	MDIO_NL_OP_UNSPEC,
> +	MDIO_NL_OP_READ,	/* read  dev(RI), port(RI), dst(R) */
> +	MDIO_NL_OP_WRITE,	/* write dev(RI), port(RI), src(RI) */
> +	MDIO_NL_OP_AND,		/* and   a(RI),   b(RI),    dst(R) */
> +	MDIO_NL_OP_OR,		/* or    a(RI),   b(RI),    dst(R) */
> +	MDIO_NL_OP_ADD,		/* add   a(RI),   b(RI),    dst(R) */
> +	MDIO_NL_OP_JEQ,		/* jeq   a(RI),   b(RI),    jmp(I) */
> +	MDIO_NL_OP_JNE,		/* jeq   a(RI),   b(RI),    jmp(I) */
> +	MDIO_NL_OP_EMIT,	/* emit  src(RI) */
> +
> +	__MDIO_NL_OP_MAX,
> +	MDIO_NL_OP_MAX = __MDIO_NL_OP_MAX - 1
> +};
> +
> +enum mdio_nl_argmode {
> +	MDIO_NL_ARG_NONE,
> +	MDIO_NL_ARG_REG,
> +	MDIO_NL_ARG_IMM,
> +	MDIO_NL_ARG_RESERVED
> +};
> +
> +struct mdio_nl_insn {
> +	__u64 op:8;
> +	__u64 reserved:2;
> +	__u64 arg0:18;
> +	__u64 arg1:18;
> +	__u64 arg2:18;
> +};
> +
> +#endif /* _UAPI_LINUX_MDIO_NETLINK_H */
> -- 
> 2.35.1.1320.gc452695387.dirty
Andrew Lunn March 7, 2023, 1:47 p.m. UTC | #6
On Mon, Mar 06, 2023 at 10:48:48PM +0000, Russell King (Oracle) wrote:
> On Mon, Mar 06, 2023 at 03:45:16PM -0500, Sean Anderson wrote:
> > +static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
> > +{
> > +	struct mdio_nl_insn *insn;
> > +	unsigned long timeout;
> > +	u16 regs[8] = { 0 };
> > +	int pc, ret = 0;
> 
> So "pc" is signed.
> 
> > +	int phy_id, reg, prtad, devad, val;
> > +
> > +	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
> > +
> > +	mutex_lock(&xfer->mdio->mdio_lock);
> > +
> > +	for (insn = xfer->prog, pc = 0;
> > +	     pc < xfer->prog_len;
> 
> xfer->prog_len is signed, so this is a signed comparison.
> 
> > +		case MDIO_NL_OP_JEQ:
> > +			if (__arg_ri(insn->arg0, regs) ==
> > +			    __arg_ri(insn->arg1, regs))
> > +				pc += (s16)__arg_i(insn->arg2);
> 
> This adds a signed 16-bit integer to pc, which can make pc negative.
> 
> And so the question becomes... what prevents pc becoming negative
> and then trying to use a negative number as an index?

I don't know ebpf very well, but would it of caught this?  I know the
aim of this is to be simple, but due to its simplicity, we are loosing
out on all the inherent safety of eBPF. Is a eBPF interface all that
complex? I assume you just need to add some way to identify MDIO
busses and kfunc to perform a read on the bus?

       Andrew
Andrew Lunn March 7, 2023, 1:49 p.m. UTC | #7
On Tue, Mar 07, 2023 at 12:23:07PM +0100, Michael Walle wrote:
> > To prevent userspace phy drivers, writes are disabled by default, and can
> > only be enabled by editing the source.
> 
> Maybe we can just taint the kernel using add_taint()? I'm not sure if
> that will prevent vendors writing user space drivers. Thoughts?

I was thinking about taint as well. But keep the same code structure,
you need to edit it to enable write.

    Andrew
Vladimir Oltean March 7, 2023, 2:05 p.m. UTC | #8
On Tue, Mar 07, 2023 at 02:49:27PM +0100, Andrew Lunn wrote:
> On Tue, Mar 07, 2023 at 12:23:07PM +0100, Michael Walle wrote:
> > > To prevent userspace phy drivers, writes are disabled by default, and can
> > > only be enabled by editing the source.
> > 
> > Maybe we can just taint the kernel using add_taint()? I'm not sure if
> > that will prevent vendors writing user space drivers. Thoughts?
> 
> I was thinking about taint as well. But keep the same code structure,
> you need to edit it to enable write.

But as per the commit message, this locks us out of the following
legitimate uses:

- C45-over-C22 (reads)
- Atomic (why only atomic?) (read) access to paged registers

are we ok with the implications?
Andrew Lunn March 7, 2023, 2:22 p.m. UTC | #9
> To prevent userspace phy drivers, writes are disabled by default, and can
> only be enabled by editing the source. This is the same strategy used by
> regmap for debugfs writes. Unfortunately, this disallows several useful
> features, including
> 
> - Register writes (obviously)
> - C45-over-C22

You could add C45-over-C22 as another op.

This tool is dangerous, even in its read only mode, just like the
IOCTL interface. Interrupt status registers are often clear on read,
so you can loose interrupts. Statistics counters are sometimes clear
on read. BMSR link bit is also latching, so a read of it could mean
you miss link events, etc. Adding C45-over-C22 is just as dangerous,
you can mess up MDIO switches which use the registers for other
things, but by deciding to use this tool you have decided to take the
risk of blowing your foot off.

> - Atomic access to paged registers
> - Better MDIO emulation for e.g. QEMU
> 
> However, the read-only interface remains broadly useful for debugging.

I would say it is broadly useful for PHYs. But not Ethernet switches,
when in read only mode. 

> +static int mdio_nl_open(struct mdio_nl_xfer *xfer);
> +static int mdio_nl_close(struct mdio_nl_xfer *xfer, bool last, int xerr);

I guess i never did a proper review of this code before, due to not
liking the concept....

Move the code around so these are not needed, unless there are
functions which are mutually recursive.

> +static inline u16 *__arg_r(u32 arg, u16 *regs)
> +{
> +	WARN_ON_ONCE(arg >> 16 != MDIO_NL_ARG_REG);
> +
> +	return &regs[arg & 0x7];
> +}

No inline functions in C files. Leave the compiler to decide.

> +static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
> +{
> +	struct mdio_nl_insn *insn;
> +	unsigned long timeout;
> +	u16 regs[8] = { 0 };
> +	int pc, ret = 0;
> +	int phy_id, reg, prtad, devad, val;
> +
> +	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
> +
> +	mutex_lock(&xfer->mdio->mdio_lock);

Should timeout be set inside the lock, for when you have two
applications running in parallel and each take a while?

> +
> +	for (insn = xfer->prog, pc = 0;
> +	     pc < xfer->prog_len;
> +	     insn = &xfer->prog[++pc]) {
> +		if (time_after(jiffies, timeout)) {
> +			ret = -ETIMEDOUT;
> +			break;
> +		}
> +
> +		switch ((enum mdio_nl_op)insn->op) {
> +		case MDIO_NL_OP_READ:
> +			phy_id = __arg_ri(insn->arg0, regs);
> +			prtad = mdio_phy_id_prtad(phy_id);
> +			devad = mdio_phy_id_devad(phy_id);
> +			reg = __arg_ri(insn->arg1, regs);
> +
> +			if (mdio_phy_id_is_c45(phy_id))
> +				ret = __mdiobus_c45_read(xfer->mdio, prtad,
> +							 devad, reg);
> +			else
> +				ret = __mdiobus_read(xfer->mdio, phy_id, reg);

The application should say if it want to do C22 or C45. As you said in
the cover note, the ioctl interface is limiting when there is no PHY,
so you are artificially adding the same restriction here. Also, you
might want to do C45 on a C22 PHY, e.g. to access EEE registers. Plus
you could consider adding C45 over C22 here.

> +
> +			if (ret < 0)
> +				goto exit;
> +			*__arg_r(insn->arg2, regs) = ret;
> +			ret = 0;
> +			break;
> +
> +		case MDIO_NL_OP_WRITE:
> +			phy_id = __arg_ri(insn->arg0, regs);
> +			prtad = mdio_phy_id_prtad(phy_id);
> +			devad = mdio_phy_id_devad(phy_id);
> +			reg = __arg_ri(insn->arg1, regs);
> +			val = __arg_ri(insn->arg2, regs);
> +
> +#ifdef MDIO_NETLINK_ALLOW_WRITE
> +			add_taint(TAINT_USER, LOCKDEP_STILL_OK);

I don't know, but maybe taint on read as well.

> +			if (mdio_phy_id_is_c45(phy_id))
> +				ret = __mdiobus_c45_write(xfer->mdio, prtad,
> +							  devad, reg, val
> +			else
> +				ret = __mdiobus_write(xfer->mdio, dev, reg,
> +						      val);
> +#else
> +			ret = -EPERM;

EPERM is odd, EOPNOTSUPP would be better. EPERM suggests you can run
it as root and it should work.

   Andrew
Andrew Lunn March 7, 2023, 2:33 p.m. UTC | #10
> - Atomic (why only atomic?) (read) access to paged registers

I would say 'atomic' is wrong, you cannot access paged registers at
all.

> are we ok with the implications?

I am. Anybody doing this level of debugging should be able to
recompile the kernel to enable write support. It does limit debugging
in field, where maybe you cannot recompile the kernel, but to me, that
is a reasonable trade off.

   Andrew
Russell King (Oracle) March 7, 2023, 2:50 p.m. UTC | #11
On Tue, Mar 07, 2023 at 03:22:46PM +0100, Andrew Lunn wrote:
> > +		switch ((enum mdio_nl_op)insn->op) {
> > +		case MDIO_NL_OP_READ:
> > +			phy_id = __arg_ri(insn->arg0, regs);
> > +			prtad = mdio_phy_id_prtad(phy_id);
> > +			devad = mdio_phy_id_devad(phy_id);
> > +			reg = __arg_ri(insn->arg1, regs);
> > +
> > +			if (mdio_phy_id_is_c45(phy_id))
> > +				ret = __mdiobus_c45_read(xfer->mdio, prtad,
> > +							 devad, reg);
> > +			else
> > +				ret = __mdiobus_read(xfer->mdio, phy_id, reg);
> 
> The application should say if it want to do C22 or C45. As you said in
> the cover note, the ioctl interface is limiting when there is no PHY,
> so you are artificially adding the same restriction here. Also, you
> might want to do C45 on a C22 PHY, e.g. to access EEE registers. Plus
> you could consider adding C45 over C22 here.

Remembering of course that C45-over-C22 on a device that isn't a PHY
could end up causing havoc, but then if you are using this interface,
you already have the gun pointing at your foot... and if you go and
try C45-over-C22 to scan a MDIO bus, you'd definitely be pulling the
trigger too.
Russell King (Oracle) March 7, 2023, 3 p.m. UTC | #12
On Tue, Mar 07, 2023 at 03:33:20PM +0100, Andrew Lunn wrote:
> > - Atomic (why only atomic?) (read) access to paged registers
> 
> I would say 'atomic' is wrong, you cannot access paged registers at
> all.
> 
> > are we ok with the implications?
> 
> I am. Anybody doing this level of debugging should be able to
> recompile the kernel to enable write support. It does limit debugging
> in field, where maybe you cannot recompile the kernel, but to me, that
> is a reasonable trade off.

However, it should be pointed out that disabling write support means
that one can not even read paged registers through this interface.

That leads me to question the whole point of this, because as far as I
can see, the only thing this interface would offer with writes disabled
is the ability to read several non-paged registers consectively while
holding the mdio bus lock. Apart from that, with writes disabled, it
appears to offer nothing over the existing MII ioctls.

With writes enabled, then yes, it offers a slightly better interface
to be able to perform multiple accesses while holding the bus lock.

In that regard, is there any point to having the configuration option
to control whether writes are supported, or is it just better to have
an option to enable/disable the whole interface, and taint the kernel
on *any* use of this interface?
Sean Anderson March 7, 2023, 4:16 p.m. UTC | #13
On 3/7/23 09:22, Andrew Lunn wrote:
>> To prevent userspace phy drivers, writes are disabled by default, and can
>> only be enabled by editing the source. This is the same strategy used by
>> regmap for debugfs writes. Unfortunately, this disallows several useful
>> features, including
>> 
>> - Register writes (obviously)
>> - C45-over-C22
> 
> You could add C45-over-C22 as another op.
> 
> This tool is dangerous, even in its read only mode, just like the
> IOCTL interface. Interrupt status registers are often clear on read,
> so you can loose interrupts. Statistics counters are sometimes clear
> on read. BMSR link bit is also latching, so a read of it could mean
> you miss link events, etc. Adding C45-over-C22 is just as dangerous,
> you can mess up MDIO switches which use the registers for other
> things, but by deciding to use this tool you have decided to take the
> risk of blowing your foot off.

Yes, and I should probably have commented on this in the commit message.
IMO the things you listed are... iffy but unlikely to cause a
malfunction. Tainting on read would be fine, since it is certainly
possible to imagine hardware which would malfunction on read. I suppose
regmap gets away with this by sticking the whole thing in debugfs.

>> - Atomic access to paged registers
>> - Better MDIO emulation for e.g. QEMU
>> 
>> However, the read-only interface remains broadly useful for debugging.
> 
> I would say it is broadly useful for PHYs. But not Ethernet switches,
> when in read only mode. 
> 
>> +static int mdio_nl_open(struct mdio_nl_xfer *xfer);
>> +static int mdio_nl_close(struct mdio_nl_xfer *xfer, bool last, int xerr);
> 
> I guess i never did a proper review of this code before, due to not
> liking the concept....
> 
> Move the code around so these are not needed, unless there are
> functions which are mutually recursive.

They do indeed call each other (although by my analysis no recursion
results). The forward declaration of mdio_nl_open is unnecessary, so I
will rearrange things to avoid that.

>> +static inline u16 *__arg_r(u32 arg, u16 *regs)
>> +{
>> +	WARN_ON_ONCE(arg >> 16 != MDIO_NL_ARG_REG);
>> +
>> +	return &regs[arg & 0x7];
>> +}
> 
> No inline functions in C files. Leave the compiler to decide.

OK

>> +static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
>> +{
>> +	struct mdio_nl_insn *insn;
>> +	unsigned long timeout;
>> +	u16 regs[8] = { 0 };
>> +	int pc, ret = 0;
>> +	int phy_id, reg, prtad, devad, val;
>> +
>> +	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
>> +
>> +	mutex_lock(&xfer->mdio->mdio_lock);
> 
> Should timeout be set inside the lock, for when you have two
> applications running in parallel and each take a while?

That seems reasonable.

>> +
>> +	for (insn = xfer->prog, pc = 0;
>> +	     pc < xfer->prog_len;
>> +	     insn = &xfer->prog[++pc]) {
>> +		if (time_after(jiffies, timeout)) {
>> +			ret = -ETIMEDOUT;
>> +			break;
>> +		}
>> +
>> +		switch ((enum mdio_nl_op)insn->op) {
>> +		case MDIO_NL_OP_READ:
>> +			phy_id = __arg_ri(insn->arg0, regs);
>> +			prtad = mdio_phy_id_prtad(phy_id);
>> +			devad = mdio_phy_id_devad(phy_id);
>> +			reg = __arg_ri(insn->arg1, regs);
>> +
>> +			if (mdio_phy_id_is_c45(phy_id))
>> +				ret = __mdiobus_c45_read(xfer->mdio, prtad,
>> +							 devad, reg);
>> +			else
>> +				ret = __mdiobus_read(xfer->mdio, phy_id, reg);
> 
> The application should say if it want to do C22 or C45.

The phy_id comes from the application. So it sets MDIO_PHY_ID_C45 if it wants
to use C45.

> As you said in
> the cover note, the ioctl interface is limiting when there is no PHY,
> so you are artificially adding the same restriction here.

I don't follow.

> Also, you
> might want to do C45 on a C22 PHY, e.g. to access EEE registers. Plus
> you could consider adding C45 over C22 here.

As Russell noted, this is dangerous in the general case.

>> +
>> +			if (ret < 0)
>> +				goto exit;
>> +			*__arg_r(insn->arg2, regs) = ret;
>> +			ret = 0;
>> +			break;
>> +
>> +		case MDIO_NL_OP_WRITE:
>> +			phy_id = __arg_ri(insn->arg0, regs);
>> +			prtad = mdio_phy_id_prtad(phy_id);
>> +			devad = mdio_phy_id_devad(phy_id);
>> +			reg = __arg_ri(insn->arg1, regs);
>> +			val = __arg_ri(insn->arg2, regs);
>> +
>> +#ifdef MDIO_NETLINK_ALLOW_WRITE
>> +			add_taint(TAINT_USER, LOCKDEP_STILL_OK);
> 
> I don't know, but maybe taint on read as well.
> 
>> +			if (mdio_phy_id_is_c45(phy_id))
>> +				ret = __mdiobus_c45_write(xfer->mdio, prtad,
>> +							  devad, reg, val
>> +			else
>> +				ret = __mdiobus_write(xfer->mdio, dev, reg,
>> +						      val);
>> +#else
>> +			ret = -EPERM;
> 
> EPERM is odd, EOPNOTSUPP would be better. EPERM suggests you can run
> it as root and it should work.

Well, EPERM is what you get when trying to write a 444 file, which is
effectively what we're enforcing here.

--Sean
Sean Anderson March 7, 2023, 4:30 p.m. UTC | #14
On 3/7/23 07:26, Tobias Waldekranz wrote:
> On mån, mar 06, 2023 at 15:45, Sean Anderson <sean.anderson@seco.com> wrote:
>> This adds a netlink interface to make reads/writes to mdio buses. This
>> makes it easier to debug devices. This is especially useful when there
>> is a PCS involved (and the ethtool reads are faked), when there is no
>> MAC associated with a PHY, or when the MDIO device is not a PHY.
>>
>> The closest existing in-kernel interfaces are the SIOCG/SMIIREG ioctls, but
>> they have several drawbacks:
>>
>> 1. "Write register" is not always exactly that. The kernel will try to
>>    be extra helpful and do things behind the scenes if it detects a
>>    write to the reset bit of a PHY for example.
>>
>> 2. Only one op per syscall. This means that is impossible to implement
>>    many operations in a safe manner. Something as simple as a
>>    read/mask/write cycle can race against an in-kernel driver.
>>
>> 3. Addressing is awkward since you don't address the MDIO bus
>>    directly, rather "the MDIO bus to which this netdev's PHY is
>>    connected". This makes it hard to talk to devices on buses to which
>>    no PHYs are connected, the typical case being Ethernet switches.
>>
>> To address these shortcomings, this adds a GENL interface with which a user
>> can interact with an MDIO bus directly.  The user sends a program that
>> mdio-netlink executes, possibly emitting data back to the user. I.e. it
>> implements a very simple VM. Read/mask/write operations could be
>> implemented by dedicated commands, but when you start looking at more
>> advanced things like reading out the VLAN database of a switch you need
>> state and branching.
>>
>> To prevent userspace phy drivers, writes are disabled by default, and can
>> only be enabled by editing the source. This is the same strategy used by
>> regmap for debugfs writes. Unfortunately, this disallows several useful
>> features, including
>>
>> - Register writes (obviously)
>> - C45-over-C22
>> - Atomic access to paged registers
>> - Better MDIO emulation for e.g. QEMU
>>
>> However, the read-only interface remains broadly useful for debugging.
>> Users who want to use the above features can re-enable them by defining
>> MDIO_NETLINK_ALLOW_WRITE and recompiling their kernel.
> 
> What about taking a page from the BPF playbook and require all loaded
> programs (MDIO_GENL_XFERs) to be licensed under GPL?  That would mean
> that the userspace program that generated it would also have to be
> GPLed.
> 
> My view has always been that a vendor looking to build a userspace SDK
> won't be deterred by this limitation.  They can easily build
> mdio-netlink.ko from mdio-tools and use that to drive it, or (more
> likely) they already have their own implementation that they are stuck
> with for legacy reasons.  In other words: we are only punishing
> legitimate users (mdio-tools being one of them, IMO).

Yes, I agree with this. It's always seemed silly to me to exclude a good
debugging interface on the grounds that it could be used for userspace
drivers when a vendor can just as easily supply their own proprietary
module implementing the same thing.

Last time, the discussion seemed to get hung up on this topic, so I wanted
to start off with an approach which would obviously prevent misuse (albeit
rather draconian).

--Sean

> Perhaps with this approach we could have our cake and eat it too.
> 
>> Signed-off-by: Sean Anderson <sean.anderson@seco.com>
>> ---
>> This driver was written by Tobias Waldekranz. It is adapted from the
>> version he released with mdio-tools [1]. This was last discussed 2.5
>> years ago [2], and I have incorperated his cover letter into this commit
>> message. The discussion mainly centered around the write capability
>> allowing for userspace phy drivers. Although it comes with reduced
>> functionality, I hope this new approach satisfies Andrew. I have also
>> made several minor changes for style and to stay abrest of changing
>> APIs.
>>
>> Tobias, I've taken the liberty of adding some copyright notices
>> attributing this to you.
> 
> Fine by me :)
> 
>> [1] https://github.com/wkz/mdio-tools
>> [2] https://lore.kernel.org/netdev/C42DZQLTPHM5.2THDSRK84BI3T@wkz-x280/
>>
>>  drivers/net/mdio/Kconfig          |   8 +
>>  drivers/net/mdio/Makefile         |   1 +
>>  drivers/net/mdio/mdio-netlink.c   | 464 ++++++++++++++++++++++++++++++
>>  include/uapi/linux/mdio-netlink.h |  61 ++++
>>  4 files changed, 534 insertions(+)
>>  create mode 100644 drivers/net/mdio/mdio-netlink.c
>>  create mode 100644 include/uapi/linux/mdio-netlink.h
>>
>> diff --git a/drivers/net/mdio/Kconfig b/drivers/net/mdio/Kconfig
>> index 90309980686e..8a01978e5b51 100644
>> --- a/drivers/net/mdio/Kconfig
>> +++ b/drivers/net/mdio/Kconfig
>> @@ -43,6 +43,14 @@ config ACPI_MDIO
>>  
>>  if MDIO_BUS
>>  
>> +config MDIO_NETLINK
>> +	tristate "Netlink interface for MDIO buses"
>> +	help
>> +	  Enable a netlink interface to allow reading MDIO buses from
>> +	  userspace. A small virtual machine allows implementing complex
>> +	  operations, such as conditional reads or polling. All operations
>> +	  submitted in the same program are evaluated atomically.
>> +
>>  config MDIO_DEVRES
>>  	tristate
>>  
>> diff --git a/drivers/net/mdio/Makefile b/drivers/net/mdio/Makefile
>> index 7d4cb4c11e4e..5583d5b8d174 100644
>> --- a/drivers/net/mdio/Makefile
>> +++ b/drivers/net/mdio/Makefile
>> @@ -4,6 +4,7 @@
>>  obj-$(CONFIG_ACPI_MDIO)		+= acpi_mdio.o
>>  obj-$(CONFIG_FWNODE_MDIO)	+= fwnode_mdio.o
>>  obj-$(CONFIG_OF_MDIO)		+= of_mdio.o
>> +obj-$(CONFIG_MDIO_NETLINK)	+= mdio-netlink.o
>>  
>>  obj-$(CONFIG_MDIO_ASPEED)		+= mdio-aspeed.o
>>  obj-$(CONFIG_MDIO_BCM_IPROC)		+= mdio-bcm-iproc.o
>> diff --git a/drivers/net/mdio/mdio-netlink.c b/drivers/net/mdio/mdio-netlink.c
>> new file mode 100644
>> index 000000000000..3e32d1a9bab3
>> --- /dev/null
>> +++ b/drivers/net/mdio/mdio-netlink.c
>> @@ -0,0 +1,464 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2022-23 Sean Anderson <sean.anderson@seco.com>
>> + * Copyright (C) 2020-22 Tobias Waldekranz <tobias@waldekranz.com>
>> + */
>> +
>> +#include <linux/init.h>
>> +#include <linux/kernel.h>
>> +#include <linux/module.h>
>> +#include <linux/netlink.h>
>> +#include <linux/phy.h>
>> +#include <net/genetlink.h>
>> +#include <net/netlink.h>
>> +#include <uapi/linux/mdio-netlink.h>
>> +
>> +struct mdio_nl_xfer {
>> +	struct genl_info *info;
>> +	struct sk_buff *msg;
>> +	void *hdr;
>> +	struct nlattr *data;
>> +
>> +	struct mii_bus *mdio;
>> +	int timeout_ms;
>> +
>> +	int prog_len;
>> +	struct mdio_nl_insn *prog;
>> +};
>> +
>> +static int mdio_nl_open(struct mdio_nl_xfer *xfer);
>> +static int mdio_nl_close(struct mdio_nl_xfer *xfer, bool last, int xerr);
>> +
>> +static int mdio_nl_flush(struct mdio_nl_xfer *xfer)
>> +{
>> +	int err;
>> +
>> +	err = mdio_nl_close(xfer, false, 0);
>> +	if (err)
>> +		return err;
>> +
>> +	return mdio_nl_open(xfer);
>> +}
>> +
>> +static int mdio_nl_emit(struct mdio_nl_xfer *xfer, u32 datum)
>> +{
>> +	int err = 0;
>> +
>> +	if (!nla_put_nohdr(xfer->msg, sizeof(datum), &datum))
>> +		return 0;
>> +
>> +	err = mdio_nl_flush(xfer);
>> +	if (err)
>> +		return err;
>> +
>> +	return nla_put_nohdr(xfer->msg, sizeof(datum), &datum);
>> +}
>> +
>> +static inline u16 *__arg_r(u32 arg, u16 *regs)
>> +{
>> +	WARN_ON_ONCE(arg >> 16 != MDIO_NL_ARG_REG);
>> +
>> +	return &regs[arg & 0x7];
>> +}
>> +
>> +static inline u16 __arg_i(u32 arg)
>> +{
>> +	WARN_ON_ONCE(arg >> 16 != MDIO_NL_ARG_IMM);
>> +
>> +	return arg & 0xffff;
>> +}
>> +
>> +static inline u16 __arg_ri(u32 arg, u16 *regs)
>> +{
>> +	switch ((enum mdio_nl_argmode)(arg >> 16)) {
>> +	case MDIO_NL_ARG_IMM:
>> +		return arg & 0xffff;
>> +	case MDIO_NL_ARG_REG:
>> +		return regs[arg & 7];
>> +	default:
>> +		WARN_ON_ONCE(1);
>> +		return 0;
>> +	}
>> +}
>> +
>> +/* To prevent out-of-tree drivers from being implemented through this
>> + * interface, disallow writes by default. This does disallow read-only uses,
>> + * such as c45-over-c22 or reading phys with pages. However, with a such a
>> + * flexible interface, we must use a big hammer. People who want to use this
>> + * will need to modify the source code directly.
>> + */
>> +#undef MDIO_NETLINK_ALLOW_WRITE
>> +
>> +static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
>> +{
>> +	struct mdio_nl_insn *insn;
>> +	unsigned long timeout;
>> +	u16 regs[8] = { 0 };
>> +	int pc, ret = 0;
>> +	int phy_id, reg, prtad, devad, val;
>> +
>> +	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
>> +
>> +	mutex_lock(&xfer->mdio->mdio_lock);
>> +
>> +	for (insn = xfer->prog, pc = 0;
>> +	     pc < xfer->prog_len;
>> +	     insn = &xfer->prog[++pc]) {
>> +		if (time_after(jiffies, timeout)) {
>> +			ret = -ETIMEDOUT;
>> +			break;
>> +		}
>> +
>> +		switch ((enum mdio_nl_op)insn->op) {
>> +		case MDIO_NL_OP_READ:
>> +			phy_id = __arg_ri(insn->arg0, regs);
>> +			prtad = mdio_phy_id_prtad(phy_id);
>> +			devad = mdio_phy_id_devad(phy_id);
>> +			reg = __arg_ri(insn->arg1, regs);
>> +
>> +			if (mdio_phy_id_is_c45(phy_id))
>> +				ret = __mdiobus_c45_read(xfer->mdio, prtad,
>> +							 devad, reg);
>> +			else
>> +				ret = __mdiobus_read(xfer->mdio, phy_id, reg);
>> +
>> +			if (ret < 0)
>> +				goto exit;
>> +			*__arg_r(insn->arg2, regs) = ret;
>> +			ret = 0;
>> +			break;
>> +
>> +		case MDIO_NL_OP_WRITE:
>> +			phy_id = __arg_ri(insn->arg0, regs);
>> +			prtad = mdio_phy_id_prtad(phy_id);
>> +			devad = mdio_phy_id_devad(phy_id);
>> +			reg = __arg_ri(insn->arg1, regs);
>> +			val = __arg_ri(insn->arg2, regs);
>> +
>> +#ifdef MDIO_NETLINK_ALLOW_WRITE
>> +			add_taint(TAINT_USER, LOCKDEP_STILL_OK);
>> +			if (mdio_phy_id_is_c45(phy_id))
>> +				ret = __mdiobus_c45_write(xfer->mdio, prtad,
>> +							  devad, reg, val
>> +			else
>> +				ret = __mdiobus_write(xfer->mdio, dev, reg,
>> +						      val);
>> +#else
>> +			ret = -EPERM;
>> +#endif
>> +			if (ret < 0)
>> +				goto exit;
>> +			ret = 0;
>> +			break;
>> +
>> +		case MDIO_NL_OP_AND:
>> +			*__arg_r(insn->arg2, regs) =
>> +				__arg_ri(insn->arg0, regs) &
>> +				__arg_ri(insn->arg1, regs);
>> +			break;
>> +
>> +		case MDIO_NL_OP_OR:
>> +			*__arg_r(insn->arg2, regs) =
>> +				__arg_ri(insn->arg0, regs) |
>> +				__arg_ri(insn->arg1, regs);
>> +			break;
>> +
>> +		case MDIO_NL_OP_ADD:
>> +			*__arg_r(insn->arg2, regs) =
>> +				__arg_ri(insn->arg0, regs) +
>> +				__arg_ri(insn->arg1, regs);
>> +			break;
>> +
>> +		case MDIO_NL_OP_JEQ:
>> +			if (__arg_ri(insn->arg0, regs) ==
>> +			    __arg_ri(insn->arg1, regs))
>> +				pc += (s16)__arg_i(insn->arg2);
>> +			break;
>> +
>> +		case MDIO_NL_OP_JNE:
>> +			if (__arg_ri(insn->arg0, regs) !=
>> +			    __arg_ri(insn->arg1, regs))
>> +				pc += (s16)__arg_i(insn->arg2);
>> +			break;
>> +
>> +		case MDIO_NL_OP_EMIT:
>> +			ret = mdio_nl_emit(xfer, __arg_ri(insn->arg0, regs));
>> +			if (ret < 0)
>> +				goto exit;
>> +			ret = 0;
>> +			break;
>> +
>> +		case MDIO_NL_OP_UNSPEC:
>> +		default:
>> +			ret = -EINVAL;
>> +			goto exit;
>> +		}
>> +	}
>> +exit:
>> +	mutex_unlock(&xfer->mdio->mdio_lock);
>> +	return ret;
>> +}
>> +
>> +struct mdio_nl_op_proto {
>> +	u8 arg0;
>> +	u8 arg1;
>> +	u8 arg2;
>> +};
>> +
>> +static const struct mdio_nl_op_proto mdio_nl_op_protos[MDIO_NL_OP_MAX + 1] = {
>> +	[MDIO_NL_OP_READ] = {
>> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg2 = BIT(MDIO_NL_ARG_REG),
>> +	},
>> +	[MDIO_NL_OP_WRITE] = {
>> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg2 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +	},
>> +	[MDIO_NL_OP_AND] = {
>> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg2 = BIT(MDIO_NL_ARG_REG),
>> +	},
>> +	[MDIO_NL_OP_OR] = {
>> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg2 = BIT(MDIO_NL_ARG_REG),
>> +	},
>> +	[MDIO_NL_OP_ADD] = {
>> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg2 = BIT(MDIO_NL_ARG_REG),
>> +	},
>> +	[MDIO_NL_OP_JEQ] = {
>> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg2 = BIT(MDIO_NL_ARG_IMM),
>> +	},
>> +	[MDIO_NL_OP_JNE] = {
>> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg2 = BIT(MDIO_NL_ARG_IMM),
>> +	},
>> +	[MDIO_NL_OP_EMIT] = {
>> +		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
>> +		.arg1 = BIT(MDIO_NL_ARG_NONE),
>> +		.arg2 = BIT(MDIO_NL_ARG_NONE),
>> +	},
>> +};
>> +
>> +static int mdio_nl_validate_insn(const struct nlattr *attr,
>> +				 struct netlink_ext_ack *extack,
>> +				 const struct mdio_nl_insn *insn)
>> +{
>> +	const struct mdio_nl_op_proto *proto;
>> +
>> +	if (insn->op > MDIO_NL_OP_MAX) {
>> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Illegal instruction");
>> +		return -EINVAL;
>> +	}
>> +
>> +	proto = &mdio_nl_op_protos[insn->op];
>> +
>> +	if (!(BIT(insn->arg0 >> 16) & proto->arg0)) {
>> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 0 invalid");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!(BIT(insn->arg1 >> 16) & proto->arg1)) {
>> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 1 invalid");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (!(BIT(insn->arg2 >> 16) & proto->arg2)) {
>> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 2 invalid");
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int mdio_nl_validate_prog(const struct nlattr *attr,
>> +				 struct netlink_ext_ack *extack)
>> +{
>> +	const struct mdio_nl_insn *prog = nla_data(attr);
>> +	int len = nla_len(attr);
>> +	int i, err = 0;
>> +
>> +	if (len % sizeof(*prog)) {
>> +		NL_SET_ERR_MSG_ATTR(extack, attr, "Unaligned instruction");
>> +		return -EINVAL;
>> +	}
>> +
>> +	len /= sizeof(*prog);
>> +	for (i = 0; i < len; i++) {
>> +		err = mdio_nl_validate_insn(attr, extack, &prog[i]);
>> +		if (err)
>> +			break;
>> +	}
>> +
>> +	return err;
>> +}
>> +
>> +static const struct nla_policy mdio_nl_policy[MDIO_NLA_MAX + 1] = {
>> +	[MDIO_NLA_UNSPEC]  = { .type = NLA_UNSPEC, },
>> +	[MDIO_NLA_BUS_ID]  = { .type = NLA_STRING, .len = MII_BUS_ID_SIZE },
>> +	[MDIO_NLA_TIMEOUT] = NLA_POLICY_MAX(NLA_U16, 10 * MSEC_PER_SEC),
>> +	[MDIO_NLA_PROG]    = NLA_POLICY_VALIDATE_FN(NLA_BINARY,
>> +						    mdio_nl_validate_prog,
>> +						    0x1000),
>> +	[MDIO_NLA_DATA]    = { .type = NLA_NESTED },
>> +	[MDIO_NLA_ERROR]   = { .type = NLA_S32, },
>> +};
>> +
>> +static struct genl_family mdio_nl_family;
>> +
>> +static int mdio_nl_open(struct mdio_nl_xfer *xfer)
>> +{
>> +	int err;
>> +
>> +	xfer->msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
>> +	if (!xfer->msg) {
>> +		err = -ENOMEM;
>> +		goto err;
>> +	}
>> +
>> +	xfer->hdr = genlmsg_put(xfer->msg, xfer->info->snd_portid,
>> +				xfer->info->snd_seq, &mdio_nl_family,
>> +				NLM_F_ACK | NLM_F_MULTI, MDIO_GENL_XFER);
>> +	if (!xfer->hdr) {
>> +		err = -EMSGSIZE;
>> +		goto err_free;
>> +	}
>> +
>> +	xfer->data = nla_nest_start(xfer->msg, MDIO_NLA_DATA);
>> +	if (!xfer->data) {
>> +		err = -EMSGSIZE;
>> +		goto err_free;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_free:
>> +	nlmsg_free(xfer->msg);
>> +err:
>> +	return err;
>> +}
>> +
>> +static int mdio_nl_close(struct mdio_nl_xfer *xfer, bool last, int xerr)
>> +{
>> +	struct nlmsghdr *end;
>> +	int err;
>> +
>> +	nla_nest_end(xfer->msg, xfer->data);
>> +
>> +	if (xerr && nla_put_s32(xfer->msg, MDIO_NLA_ERROR, xerr)) {
>> +		err = mdio_nl_flush(xfer);
>> +		if (err)
>> +			goto err_free;
>> +
>> +		if (nla_put_s32(xfer->msg, MDIO_NLA_ERROR, xerr)) {
>> +			err = -EMSGSIZE;
>> +			goto err_free;
>> +		}
>> +	}
>> +
>> +	genlmsg_end(xfer->msg, xfer->hdr);
>> +
>> +	if (last) {
>> +		end = nlmsg_put(xfer->msg, xfer->info->snd_portid,
>> +				xfer->info->snd_seq, NLMSG_DONE, 0,
>> +				NLM_F_ACK | NLM_F_MULTI);
>> +		if (!end) {
>> +			err = mdio_nl_flush(xfer);
>> +			if (err)
>> +				goto err_free;
>> +
>> +			end = nlmsg_put(xfer->msg, xfer->info->snd_portid,
>> +					xfer->info->snd_seq, NLMSG_DONE, 0,
>> +					NLM_F_ACK | NLM_F_MULTI);
>> +			if (!end) {
>> +				err = -EMSGSIZE;
>> +				goto err_free;
>> +			}
>> +		}
>> +	}
>> +
>> +	return genlmsg_unicast(genl_info_net(xfer->info), xfer->msg,
>> +			       xfer->info->snd_portid);
>> +
>> +err_free:
>> +	nlmsg_free(xfer->msg);
>> +	return err;
>> +}
>> +
>> +static int mdio_nl_cmd_xfer(struct sk_buff *skb, struct genl_info *info)
>> +{
>> +	struct mdio_nl_xfer xfer;
>> +	int err;
>> +
>> +	if (!info->attrs[MDIO_NLA_BUS_ID] ||
>> +	    !info->attrs[MDIO_NLA_PROG]   ||
>> +	     info->attrs[MDIO_NLA_DATA]   ||
>> +	     info->attrs[MDIO_NLA_ERROR])
>> +		return -EINVAL;
>> +
>> +	xfer.mdio = mdio_find_bus(nla_data(info->attrs[MDIO_NLA_BUS_ID]));
>> +	if (!xfer.mdio)
>> +		return -ENODEV;
>> +
>> +	if (info->attrs[MDIO_NLA_TIMEOUT])
>> +		xfer.timeout_ms = nla_get_u32(info->attrs[MDIO_NLA_TIMEOUT]);
>> +	else
>> +		xfer.timeout_ms = 100;
>> +
>> +	xfer.info = info;
>> +	xfer.prog_len = nla_len(info->attrs[MDIO_NLA_PROG]) / sizeof(*xfer.prog);
>> +	xfer.prog = nla_data(info->attrs[MDIO_NLA_PROG]);
>> +
>> +	err = mdio_nl_open(&xfer);
>> +	if (err)
>> +		return err;
>> +
>> +	err = mdio_nl_eval(&xfer);
>> +
>> +	err = mdio_nl_close(&xfer, true, err);
>> +	return err;
>> +}
>> +
>> +static const struct genl_ops mdio_nl_ops[] = {
>> +	{
>> +		.cmd = MDIO_GENL_XFER,
>> +		.doit = mdio_nl_cmd_xfer,
>> +		.flags = GENL_ADMIN_PERM,
>> +	},
>> +};
>> +
>> +static struct genl_family mdio_nl_family = {
>> +	.name     = "mdio",
>> +	.version  = 1,
>> +	.maxattr  = MDIO_NLA_MAX,
>> +	.netnsok  = false,
>> +	.module   = THIS_MODULE,
>> +	.ops      = mdio_nl_ops,
>> +	.n_ops    = ARRAY_SIZE(mdio_nl_ops),
>> +	.policy   = mdio_nl_policy,
>> +};
>> +
>> +static int __init mdio_nl_init(void)
>> +{
>> +	return genl_register_family(&mdio_nl_family);
>> +}
>> +
>> +static void __exit mdio_nl_exit(void)
>> +{
>> +	genl_unregister_family(&mdio_nl_family);
>> +}
>> +
>> +MODULE_AUTHOR("Tobias Waldekranz <tobias@waldekranz.com>");
>> +MODULE_DESCRIPTION("MDIO Netlink Interface");
>> +MODULE_LICENSE("GPL");
>> +
>> +module_init(mdio_nl_init);
>> +module_exit(mdio_nl_exit);
>> diff --git a/include/uapi/linux/mdio-netlink.h b/include/uapi/linux/mdio-netlink.h
>> new file mode 100644
>> index 000000000000..bebd3b45c882
>> --- /dev/null
>> +++ b/include/uapi/linux/mdio-netlink.h
>> @@ -0,0 +1,61 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +/*
>> + * Copyright (C) 2020-22 Tobias Waldekranz <tobias@waldekranz.com>
>> + */
>> +
>> +#ifndef _UAPI_LINUX_MDIO_NETLINK_H
>> +#define _UAPI_LINUX_MDIO_NETLINK_H
>> +
>> +#include <linux/types.h>
>> +
>> +enum {
>> +	MDIO_GENL_UNSPEC,
>> +	MDIO_GENL_XFER,
>> +
>> +	__MDIO_GENL_MAX,
>> +	MDIO_GENL_MAX = __MDIO_GENL_MAX - 1
>> +};
>> +
>> +enum {
>> +	MDIO_NLA_UNSPEC,
>> +	MDIO_NLA_BUS_ID,  /* string */
>> +	MDIO_NLA_TIMEOUT, /* u32 */
>> +	MDIO_NLA_PROG,    /* struct mdio_nl_insn[] */
>> +	MDIO_NLA_DATA,    /* nest */
>> +	MDIO_NLA_ERROR,   /* s32 */
>> +
>> +	__MDIO_NLA_MAX,
>> +	MDIO_NLA_MAX = __MDIO_NLA_MAX - 1
>> +};
>> +
>> +enum mdio_nl_op {
>> +	MDIO_NL_OP_UNSPEC,
>> +	MDIO_NL_OP_READ,	/* read  dev(RI), port(RI), dst(R) */
>> +	MDIO_NL_OP_WRITE,	/* write dev(RI), port(RI), src(RI) */
>> +	MDIO_NL_OP_AND,		/* and   a(RI),   b(RI),    dst(R) */
>> +	MDIO_NL_OP_OR,		/* or    a(RI),   b(RI),    dst(R) */
>> +	MDIO_NL_OP_ADD,		/* add   a(RI),   b(RI),    dst(R) */
>> +	MDIO_NL_OP_JEQ,		/* jeq   a(RI),   b(RI),    jmp(I) */
>> +	MDIO_NL_OP_JNE,		/* jeq   a(RI),   b(RI),    jmp(I) */
>> +	MDIO_NL_OP_EMIT,	/* emit  src(RI) */
>> +
>> +	__MDIO_NL_OP_MAX,
>> +	MDIO_NL_OP_MAX = __MDIO_NL_OP_MAX - 1
>> +};
>> +
>> +enum mdio_nl_argmode {
>> +	MDIO_NL_ARG_NONE,
>> +	MDIO_NL_ARG_REG,
>> +	MDIO_NL_ARG_IMM,
>> +	MDIO_NL_ARG_RESERVED
>> +};
>> +
>> +struct mdio_nl_insn {
>> +	__u64 op:8;
>> +	__u64 reserved:2;
>> +	__u64 arg0:18;
>> +	__u64 arg1:18;
>> +	__u64 arg2:18;
>> +};
>> +
>> +#endif /* _UAPI_LINUX_MDIO_NETLINK_H */
>> -- 
>> 2.35.1.1320.gc452695387.dirty
Sean Anderson March 7, 2023, 4:41 p.m. UTC | #15
On 3/7/23 08:47, Andrew Lunn wrote:
> On Mon, Mar 06, 2023 at 10:48:48PM +0000, Russell King (Oracle) wrote:
>> On Mon, Mar 06, 2023 at 03:45:16PM -0500, Sean Anderson wrote:
>> > +static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
>> > +{
>> > +	struct mdio_nl_insn *insn;
>> > +	unsigned long timeout;
>> > +	u16 regs[8] = { 0 };
>> > +	int pc, ret = 0;
>> 
>> So "pc" is signed.
>> 
>> > +	int phy_id, reg, prtad, devad, val;
>> > +
>> > +	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
>> > +
>> > +	mutex_lock(&xfer->mdio->mdio_lock);
>> > +
>> > +	for (insn = xfer->prog, pc = 0;
>> > +	     pc < xfer->prog_len;
>> 
>> xfer->prog_len is signed, so this is a signed comparison.
>> 
>> > +		case MDIO_NL_OP_JEQ:
>> > +			if (__arg_ri(insn->arg0, regs) ==
>> > +			    __arg_ri(insn->arg1, regs))
>> > +				pc += (s16)__arg_i(insn->arg2);
>> 
>> This adds a signed 16-bit integer to pc, which can make pc negative.
>> 
>> And so the question becomes... what prevents pc becoming negative
>> and then trying to use a negative number as an index?
> 
> I don't know ebpf very well, but would it of caught this?  I know the
> aim of this is to be simple, but due to its simplicity, we are loosing
> out on all the inherent safety of eBPF. Is a eBPF interface all that
> complex? I assume you just need to add some way to identify MDIO
> busses and kfunc to perform a read on the bus?
Regarding eBPF over netlink, the last time this was discussed, Tobias said

> - Why not use BPF?
> 
>   That could absolutely be one way forward, but the GENL approach was
>   easy to build out-of-tree to prove the idea. Its not obvious how it
>   would work though as BPF programs typically run async on some event
>   (probe hit, packet received etc.) whereas this is a single execution
>   on behalf of a user. So to what would the program be attached? The
>   output path is also not straight forward, but it could be done with
>   perf events i suppose.

I'm not familiar enough with eBPF to comment further.

--Sean
Andrew Lunn March 7, 2023, 5:23 p.m. UTC | #16
> Yes, and I should probably have commented on this in the commit message.
> IMO the things you listed are... iffy but unlikely to cause a
> malfunction.

You consider a missed interrupt not a malfunction?

> >> +
> >> +	for (insn = xfer->prog, pc = 0;
> >> +	     pc < xfer->prog_len;
> >> +	     insn = &xfer->prog[++pc]) {
> >> +		if (time_after(jiffies, timeout)) {
> >> +			ret = -ETIMEDOUT;
> >> +			break;
> >> +		}
> >> +
> >> +		switch ((enum mdio_nl_op)insn->op) {
> >> +		case MDIO_NL_OP_READ:
> >> +			phy_id = __arg_ri(insn->arg0, regs);
> >> +			prtad = mdio_phy_id_prtad(phy_id);
> >> +			devad = mdio_phy_id_devad(phy_id);
> >> +			reg = __arg_ri(insn->arg1, regs);
> >> +
> >> +			if (mdio_phy_id_is_c45(phy_id))
> >> +				ret = __mdiobus_c45_read(xfer->mdio, prtad,
> >> +							 devad, reg);
> >> +			else
> >> +				ret = __mdiobus_read(xfer->mdio, phy_id, reg);
> > 
> > The application should say if it want to do C22 or C45.
> 
> The phy_id comes from the application. So it sets MDIO_PHY_ID_C45 if it wants
> to use C45.

Ah, i misunderstood what mdio_phy_id_is_c45() does.

Anyway, i don't like MDIO_PHY_ID_C45, it has been pretty much removed
everywhere with the refactoring of MDIO drivers to export read and
read_c45 etc. PHY drivers also don't use it, they use c22 or c45
specific methods. So i would prefer an additional attribute. That also
opens up the possibility of adding C45 over C22.

> As Russell noted, this is dangerous in the general case.

And Russell also agreed this whole module is dangerous in general.
Once you accept it is dangerous, its a debug tool only, why not allow
playing with a bit more fire? You could at least poke around the MDIO
bus structures and see if a PHY has been found, and it not, block C45
over C22.

> >> +			if (mdio_phy_id_is_c45(phy_id))
> >> +				ret = __mdiobus_c45_write(xfer->mdio, prtad,
> >> +							  devad, reg, val
> >> +			else
> >> +				ret = __mdiobus_write(xfer->mdio, dev, reg,
> >> +						      val);
> >> +#else
> >> +			ret = -EPERM;
> > 
> > EPERM is odd, EOPNOTSUPP would be better. EPERM suggests you can run
> > it as root and it should work.
> 
> Well, EPERM is what you get when trying to write a 444 file, which is
> effectively what we're enforcing here.

Does it change to 644 when write is enabled? But netlink does not even
use file access permissions. I would probably trap this earlier, where
you have a extack instance you can return a meaningful error message
string.

     Andrew
Sean Anderson March 7, 2023, 5:42 p.m. UTC | #17
On 3/7/23 12:23, Andrew Lunn wrote:
>> Yes, and I should probably have commented on this in the commit message.
>> IMO the things you listed are... iffy but unlikely to cause a
>> malfunction.
> 
> You consider a missed interrupt not a malfunction?

Hm, yeah that would probably do it.

>> >> +
>> >> +	for (insn = xfer->prog, pc = 0;
>> >> +	     pc < xfer->prog_len;
>> >> +	     insn = &xfer->prog[++pc]) {
>> >> +		if (time_after(jiffies, timeout)) {
>> >> +			ret = -ETIMEDOUT;
>> >> +			break;
>> >> +		}
>> >> +
>> >> +		switch ((enum mdio_nl_op)insn->op) {
>> >> +		case MDIO_NL_OP_READ:
>> >> +			phy_id = __arg_ri(insn->arg0, regs);
>> >> +			prtad = mdio_phy_id_prtad(phy_id);
>> >> +			devad = mdio_phy_id_devad(phy_id);
>> >> +			reg = __arg_ri(insn->arg1, regs);
>> >> +
>> >> +			if (mdio_phy_id_is_c45(phy_id))
>> >> +				ret = __mdiobus_c45_read(xfer->mdio, prtad,
>> >> +							 devad, reg);
>> >> +			else
>> >> +				ret = __mdiobus_read(xfer->mdio, phy_id, reg);
>> > 
>> > The application should say if it want to do C22 or C45.
>> 
>> The phy_id comes from the application. So it sets MDIO_PHY_ID_C45 if it wants
>> to use C45.
> 
> Ah, i misunderstood what mdio_phy_id_is_c45() does.
> 
> Anyway, i don't like MDIO_PHY_ID_C45, it has been pretty much removed
> everywhere with the refactoring of MDIO drivers to export read and
> read_c45 etc. PHY drivers also don't use it, they use c22 or c45
> specific methods. So i would prefer an additional attribute. That also
> opens up the possibility of adding C45 over C22.

Well, this is really just because there is an existing way to specify c22
and c45 addresses in a u16. We could definitely add a "please do C45 over
C22" flag. That said, I think that sort of thing is handled better by
allowing writes in the general case.

>> As Russell noted, this is dangerous in the general case.
> 
> And Russell also agreed this whole module is dangerous in general.
> Once you accept it is dangerous, its a debug tool only, why not allow
> playing with a bit more fire? You could at least poke around the MDIO
> bus structures and see if a PHY has been found, and it not, block C45
> over C22.

I can look into that.

>> >> +			if (mdio_phy_id_is_c45(phy_id))
>> >> +				ret = __mdiobus_c45_write(xfer->mdio, prtad,
>> >> +							  devad, reg, val
>> >> +			else
>> >> +				ret = __mdiobus_write(xfer->mdio, dev, reg,
>> >> +						      val);
>> >> +#else
>> >> +			ret = -EPERM;
>> > 
>> > EPERM is odd, EOPNOTSUPP would be better. EPERM suggests you can run
>> > it as root and it should work.
>> 
>> Well, EPERM is what you get when trying to write a 444 file, which is
>> effectively what we're enforcing here.
> 
> Does it change to 644 when write is enabled?

Yes. But it is more like 400 and 600.

> But netlink does not even use file access permissions.

Is EPERM reserved only for files?

> I would probably trap this earlier, where you have a extack instance
> you can return a meaningful error message string.

That sounds good.

--Sean
diff mbox series

Patch

diff --git a/drivers/net/mdio/Kconfig b/drivers/net/mdio/Kconfig
index 90309980686e..8a01978e5b51 100644
--- a/drivers/net/mdio/Kconfig
+++ b/drivers/net/mdio/Kconfig
@@ -43,6 +43,14 @@  config ACPI_MDIO
 
 if MDIO_BUS
 
+config MDIO_NETLINK
+	tristate "Netlink interface for MDIO buses"
+	help
+	  Enable a netlink interface to allow reading MDIO buses from
+	  userspace. A small virtual machine allows implementing complex
+	  operations, such as conditional reads or polling. All operations
+	  submitted in the same program are evaluated atomically.
+
 config MDIO_DEVRES
 	tristate
 
diff --git a/drivers/net/mdio/Makefile b/drivers/net/mdio/Makefile
index 7d4cb4c11e4e..5583d5b8d174 100644
--- a/drivers/net/mdio/Makefile
+++ b/drivers/net/mdio/Makefile
@@ -4,6 +4,7 @@ 
 obj-$(CONFIG_ACPI_MDIO)		+= acpi_mdio.o
 obj-$(CONFIG_FWNODE_MDIO)	+= fwnode_mdio.o
 obj-$(CONFIG_OF_MDIO)		+= of_mdio.o
+obj-$(CONFIG_MDIO_NETLINK)	+= mdio-netlink.o
 
 obj-$(CONFIG_MDIO_ASPEED)		+= mdio-aspeed.o
 obj-$(CONFIG_MDIO_BCM_IPROC)		+= mdio-bcm-iproc.o
diff --git a/drivers/net/mdio/mdio-netlink.c b/drivers/net/mdio/mdio-netlink.c
new file mode 100644
index 000000000000..3e32d1a9bab3
--- /dev/null
+++ b/drivers/net/mdio/mdio-netlink.c
@@ -0,0 +1,464 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022-23 Sean Anderson <sean.anderson@seco.com>
+ * Copyright (C) 2020-22 Tobias Waldekranz <tobias@waldekranz.com>
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/netlink.h>
+#include <linux/phy.h>
+#include <net/genetlink.h>
+#include <net/netlink.h>
+#include <uapi/linux/mdio-netlink.h>
+
+struct mdio_nl_xfer {
+	struct genl_info *info;
+	struct sk_buff *msg;
+	void *hdr;
+	struct nlattr *data;
+
+	struct mii_bus *mdio;
+	int timeout_ms;
+
+	int prog_len;
+	struct mdio_nl_insn *prog;
+};
+
+static int mdio_nl_open(struct mdio_nl_xfer *xfer);
+static int mdio_nl_close(struct mdio_nl_xfer *xfer, bool last, int xerr);
+
+static int mdio_nl_flush(struct mdio_nl_xfer *xfer)
+{
+	int err;
+
+	err = mdio_nl_close(xfer, false, 0);
+	if (err)
+		return err;
+
+	return mdio_nl_open(xfer);
+}
+
+static int mdio_nl_emit(struct mdio_nl_xfer *xfer, u32 datum)
+{
+	int err = 0;
+
+	if (!nla_put_nohdr(xfer->msg, sizeof(datum), &datum))
+		return 0;
+
+	err = mdio_nl_flush(xfer);
+	if (err)
+		return err;
+
+	return nla_put_nohdr(xfer->msg, sizeof(datum), &datum);
+}
+
+static inline u16 *__arg_r(u32 arg, u16 *regs)
+{
+	WARN_ON_ONCE(arg >> 16 != MDIO_NL_ARG_REG);
+
+	return &regs[arg & 0x7];
+}
+
+static inline u16 __arg_i(u32 arg)
+{
+	WARN_ON_ONCE(arg >> 16 != MDIO_NL_ARG_IMM);
+
+	return arg & 0xffff;
+}
+
+static inline u16 __arg_ri(u32 arg, u16 *regs)
+{
+	switch ((enum mdio_nl_argmode)(arg >> 16)) {
+	case MDIO_NL_ARG_IMM:
+		return arg & 0xffff;
+	case MDIO_NL_ARG_REG:
+		return regs[arg & 7];
+	default:
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+}
+
+/* To prevent out-of-tree drivers from being implemented through this
+ * interface, disallow writes by default. This does disallow read-only uses,
+ * such as c45-over-c22 or reading phys with pages. However, with a such a
+ * flexible interface, we must use a big hammer. People who want to use this
+ * will need to modify the source code directly.
+ */
+#undef MDIO_NETLINK_ALLOW_WRITE
+
+static int mdio_nl_eval(struct mdio_nl_xfer *xfer)
+{
+	struct mdio_nl_insn *insn;
+	unsigned long timeout;
+	u16 regs[8] = { 0 };
+	int pc, ret = 0;
+	int phy_id, reg, prtad, devad, val;
+
+	timeout = jiffies + msecs_to_jiffies(xfer->timeout_ms);
+
+	mutex_lock(&xfer->mdio->mdio_lock);
+
+	for (insn = xfer->prog, pc = 0;
+	     pc < xfer->prog_len;
+	     insn = &xfer->prog[++pc]) {
+		if (time_after(jiffies, timeout)) {
+			ret = -ETIMEDOUT;
+			break;
+		}
+
+		switch ((enum mdio_nl_op)insn->op) {
+		case MDIO_NL_OP_READ:
+			phy_id = __arg_ri(insn->arg0, regs);
+			prtad = mdio_phy_id_prtad(phy_id);
+			devad = mdio_phy_id_devad(phy_id);
+			reg = __arg_ri(insn->arg1, regs);
+
+			if (mdio_phy_id_is_c45(phy_id))
+				ret = __mdiobus_c45_read(xfer->mdio, prtad,
+							 devad, reg);
+			else
+				ret = __mdiobus_read(xfer->mdio, phy_id, reg);
+
+			if (ret < 0)
+				goto exit;
+			*__arg_r(insn->arg2, regs) = ret;
+			ret = 0;
+			break;
+
+		case MDIO_NL_OP_WRITE:
+			phy_id = __arg_ri(insn->arg0, regs);
+			prtad = mdio_phy_id_prtad(phy_id);
+			devad = mdio_phy_id_devad(phy_id);
+			reg = __arg_ri(insn->arg1, regs);
+			val = __arg_ri(insn->arg2, regs);
+
+#ifdef MDIO_NETLINK_ALLOW_WRITE
+			add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+			if (mdio_phy_id_is_c45(phy_id))
+				ret = __mdiobus_c45_write(xfer->mdio, prtad,
+							  devad, reg, val
+			else
+				ret = __mdiobus_write(xfer->mdio, dev, reg,
+						      val);
+#else
+			ret = -EPERM;
+#endif
+			if (ret < 0)
+				goto exit;
+			ret = 0;
+			break;
+
+		case MDIO_NL_OP_AND:
+			*__arg_r(insn->arg2, regs) =
+				__arg_ri(insn->arg0, regs) &
+				__arg_ri(insn->arg1, regs);
+			break;
+
+		case MDIO_NL_OP_OR:
+			*__arg_r(insn->arg2, regs) =
+				__arg_ri(insn->arg0, regs) |
+				__arg_ri(insn->arg1, regs);
+			break;
+
+		case MDIO_NL_OP_ADD:
+			*__arg_r(insn->arg2, regs) =
+				__arg_ri(insn->arg0, regs) +
+				__arg_ri(insn->arg1, regs);
+			break;
+
+		case MDIO_NL_OP_JEQ:
+			if (__arg_ri(insn->arg0, regs) ==
+			    __arg_ri(insn->arg1, regs))
+				pc += (s16)__arg_i(insn->arg2);
+			break;
+
+		case MDIO_NL_OP_JNE:
+			if (__arg_ri(insn->arg0, regs) !=
+			    __arg_ri(insn->arg1, regs))
+				pc += (s16)__arg_i(insn->arg2);
+			break;
+
+		case MDIO_NL_OP_EMIT:
+			ret = mdio_nl_emit(xfer, __arg_ri(insn->arg0, regs));
+			if (ret < 0)
+				goto exit;
+			ret = 0;
+			break;
+
+		case MDIO_NL_OP_UNSPEC:
+		default:
+			ret = -EINVAL;
+			goto exit;
+		}
+	}
+exit:
+	mutex_unlock(&xfer->mdio->mdio_lock);
+	return ret;
+}
+
+struct mdio_nl_op_proto {
+	u8 arg0;
+	u8 arg1;
+	u8 arg2;
+};
+
+static const struct mdio_nl_op_proto mdio_nl_op_protos[MDIO_NL_OP_MAX + 1] = {
+	[MDIO_NL_OP_READ] = {
+		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg2 = BIT(MDIO_NL_ARG_REG),
+	},
+	[MDIO_NL_OP_WRITE] = {
+		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg2 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+	},
+	[MDIO_NL_OP_AND] = {
+		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg2 = BIT(MDIO_NL_ARG_REG),
+	},
+	[MDIO_NL_OP_OR] = {
+		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg2 = BIT(MDIO_NL_ARG_REG),
+	},
+	[MDIO_NL_OP_ADD] = {
+		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg2 = BIT(MDIO_NL_ARG_REG),
+	},
+	[MDIO_NL_OP_JEQ] = {
+		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg2 = BIT(MDIO_NL_ARG_IMM),
+	},
+	[MDIO_NL_OP_JNE] = {
+		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg1 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg2 = BIT(MDIO_NL_ARG_IMM),
+	},
+	[MDIO_NL_OP_EMIT] = {
+		.arg0 = BIT(MDIO_NL_ARG_REG) | BIT(MDIO_NL_ARG_IMM),
+		.arg1 = BIT(MDIO_NL_ARG_NONE),
+		.arg2 = BIT(MDIO_NL_ARG_NONE),
+	},
+};
+
+static int mdio_nl_validate_insn(const struct nlattr *attr,
+				 struct netlink_ext_ack *extack,
+				 const struct mdio_nl_insn *insn)
+{
+	const struct mdio_nl_op_proto *proto;
+
+	if (insn->op > MDIO_NL_OP_MAX) {
+		NL_SET_ERR_MSG_ATTR(extack, attr, "Illegal instruction");
+		return -EINVAL;
+	}
+
+	proto = &mdio_nl_op_protos[insn->op];
+
+	if (!(BIT(insn->arg0 >> 16) & proto->arg0)) {
+		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 0 invalid");
+		return -EINVAL;
+	}
+
+	if (!(BIT(insn->arg1 >> 16) & proto->arg1)) {
+		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 1 invalid");
+		return -EINVAL;
+	}
+
+	if (!(BIT(insn->arg2 >> 16) & proto->arg2)) {
+		NL_SET_ERR_MSG_ATTR(extack, attr, "Argument 2 invalid");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int mdio_nl_validate_prog(const struct nlattr *attr,
+				 struct netlink_ext_ack *extack)
+{
+	const struct mdio_nl_insn *prog = nla_data(attr);
+	int len = nla_len(attr);
+	int i, err = 0;
+
+	if (len % sizeof(*prog)) {
+		NL_SET_ERR_MSG_ATTR(extack, attr, "Unaligned instruction");
+		return -EINVAL;
+	}
+
+	len /= sizeof(*prog);
+	for (i = 0; i < len; i++) {
+		err = mdio_nl_validate_insn(attr, extack, &prog[i]);
+		if (err)
+			break;
+	}
+
+	return err;
+}
+
+static const struct nla_policy mdio_nl_policy[MDIO_NLA_MAX + 1] = {
+	[MDIO_NLA_UNSPEC]  = { .type = NLA_UNSPEC, },
+	[MDIO_NLA_BUS_ID]  = { .type = NLA_STRING, .len = MII_BUS_ID_SIZE },
+	[MDIO_NLA_TIMEOUT] = NLA_POLICY_MAX(NLA_U16, 10 * MSEC_PER_SEC),
+	[MDIO_NLA_PROG]    = NLA_POLICY_VALIDATE_FN(NLA_BINARY,
+						    mdio_nl_validate_prog,
+						    0x1000),
+	[MDIO_NLA_DATA]    = { .type = NLA_NESTED },
+	[MDIO_NLA_ERROR]   = { .type = NLA_S32, },
+};
+
+static struct genl_family mdio_nl_family;
+
+static int mdio_nl_open(struct mdio_nl_xfer *xfer)
+{
+	int err;
+
+	xfer->msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!xfer->msg) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	xfer->hdr = genlmsg_put(xfer->msg, xfer->info->snd_portid,
+				xfer->info->snd_seq, &mdio_nl_family,
+				NLM_F_ACK | NLM_F_MULTI, MDIO_GENL_XFER);
+	if (!xfer->hdr) {
+		err = -EMSGSIZE;
+		goto err_free;
+	}
+
+	xfer->data = nla_nest_start(xfer->msg, MDIO_NLA_DATA);
+	if (!xfer->data) {
+		err = -EMSGSIZE;
+		goto err_free;
+	}
+
+	return 0;
+
+err_free:
+	nlmsg_free(xfer->msg);
+err:
+	return err;
+}
+
+static int mdio_nl_close(struct mdio_nl_xfer *xfer, bool last, int xerr)
+{
+	struct nlmsghdr *end;
+	int err;
+
+	nla_nest_end(xfer->msg, xfer->data);
+
+	if (xerr && nla_put_s32(xfer->msg, MDIO_NLA_ERROR, xerr)) {
+		err = mdio_nl_flush(xfer);
+		if (err)
+			goto err_free;
+
+		if (nla_put_s32(xfer->msg, MDIO_NLA_ERROR, xerr)) {
+			err = -EMSGSIZE;
+			goto err_free;
+		}
+	}
+
+	genlmsg_end(xfer->msg, xfer->hdr);
+
+	if (last) {
+		end = nlmsg_put(xfer->msg, xfer->info->snd_portid,
+				xfer->info->snd_seq, NLMSG_DONE, 0,
+				NLM_F_ACK | NLM_F_MULTI);
+		if (!end) {
+			err = mdio_nl_flush(xfer);
+			if (err)
+				goto err_free;
+
+			end = nlmsg_put(xfer->msg, xfer->info->snd_portid,
+					xfer->info->snd_seq, NLMSG_DONE, 0,
+					NLM_F_ACK | NLM_F_MULTI);
+			if (!end) {
+				err = -EMSGSIZE;
+				goto err_free;
+			}
+		}
+	}
+
+	return genlmsg_unicast(genl_info_net(xfer->info), xfer->msg,
+			       xfer->info->snd_portid);
+
+err_free:
+	nlmsg_free(xfer->msg);
+	return err;
+}
+
+static int mdio_nl_cmd_xfer(struct sk_buff *skb, struct genl_info *info)
+{
+	struct mdio_nl_xfer xfer;
+	int err;
+
+	if (!info->attrs[MDIO_NLA_BUS_ID] ||
+	    !info->attrs[MDIO_NLA_PROG]   ||
+	     info->attrs[MDIO_NLA_DATA]   ||
+	     info->attrs[MDIO_NLA_ERROR])
+		return -EINVAL;
+
+	xfer.mdio = mdio_find_bus(nla_data(info->attrs[MDIO_NLA_BUS_ID]));
+	if (!xfer.mdio)
+		return -ENODEV;
+
+	if (info->attrs[MDIO_NLA_TIMEOUT])
+		xfer.timeout_ms = nla_get_u32(info->attrs[MDIO_NLA_TIMEOUT]);
+	else
+		xfer.timeout_ms = 100;
+
+	xfer.info = info;
+	xfer.prog_len = nla_len(info->attrs[MDIO_NLA_PROG]) / sizeof(*xfer.prog);
+	xfer.prog = nla_data(info->attrs[MDIO_NLA_PROG]);
+
+	err = mdio_nl_open(&xfer);
+	if (err)
+		return err;
+
+	err = mdio_nl_eval(&xfer);
+
+	err = mdio_nl_close(&xfer, true, err);
+	return err;
+}
+
+static const struct genl_ops mdio_nl_ops[] = {
+	{
+		.cmd = MDIO_GENL_XFER,
+		.doit = mdio_nl_cmd_xfer,
+		.flags = GENL_ADMIN_PERM,
+	},
+};
+
+static struct genl_family mdio_nl_family = {
+	.name     = "mdio",
+	.version  = 1,
+	.maxattr  = MDIO_NLA_MAX,
+	.netnsok  = false,
+	.module   = THIS_MODULE,
+	.ops      = mdio_nl_ops,
+	.n_ops    = ARRAY_SIZE(mdio_nl_ops),
+	.policy   = mdio_nl_policy,
+};
+
+static int __init mdio_nl_init(void)
+{
+	return genl_register_family(&mdio_nl_family);
+}
+
+static void __exit mdio_nl_exit(void)
+{
+	genl_unregister_family(&mdio_nl_family);
+}
+
+MODULE_AUTHOR("Tobias Waldekranz <tobias@waldekranz.com>");
+MODULE_DESCRIPTION("MDIO Netlink Interface");
+MODULE_LICENSE("GPL");
+
+module_init(mdio_nl_init);
+module_exit(mdio_nl_exit);
diff --git a/include/uapi/linux/mdio-netlink.h b/include/uapi/linux/mdio-netlink.h
new file mode 100644
index 000000000000..bebd3b45c882
--- /dev/null
+++ b/include/uapi/linux/mdio-netlink.h
@@ -0,0 +1,61 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (C) 2020-22 Tobias Waldekranz <tobias@waldekranz.com>
+ */
+
+#ifndef _UAPI_LINUX_MDIO_NETLINK_H
+#define _UAPI_LINUX_MDIO_NETLINK_H
+
+#include <linux/types.h>
+
+enum {
+	MDIO_GENL_UNSPEC,
+	MDIO_GENL_XFER,
+
+	__MDIO_GENL_MAX,
+	MDIO_GENL_MAX = __MDIO_GENL_MAX - 1
+};
+
+enum {
+	MDIO_NLA_UNSPEC,
+	MDIO_NLA_BUS_ID,  /* string */
+	MDIO_NLA_TIMEOUT, /* u32 */
+	MDIO_NLA_PROG,    /* struct mdio_nl_insn[] */
+	MDIO_NLA_DATA,    /* nest */
+	MDIO_NLA_ERROR,   /* s32 */
+
+	__MDIO_NLA_MAX,
+	MDIO_NLA_MAX = __MDIO_NLA_MAX - 1
+};
+
+enum mdio_nl_op {
+	MDIO_NL_OP_UNSPEC,
+	MDIO_NL_OP_READ,	/* read  dev(RI), port(RI), dst(R) */
+	MDIO_NL_OP_WRITE,	/* write dev(RI), port(RI), src(RI) */
+	MDIO_NL_OP_AND,		/* and   a(RI),   b(RI),    dst(R) */
+	MDIO_NL_OP_OR,		/* or    a(RI),   b(RI),    dst(R) */
+	MDIO_NL_OP_ADD,		/* add   a(RI),   b(RI),    dst(R) */
+	MDIO_NL_OP_JEQ,		/* jeq   a(RI),   b(RI),    jmp(I) */
+	MDIO_NL_OP_JNE,		/* jeq   a(RI),   b(RI),    jmp(I) */
+	MDIO_NL_OP_EMIT,	/* emit  src(RI) */
+
+	__MDIO_NL_OP_MAX,
+	MDIO_NL_OP_MAX = __MDIO_NL_OP_MAX - 1
+};
+
+enum mdio_nl_argmode {
+	MDIO_NL_ARG_NONE,
+	MDIO_NL_ARG_REG,
+	MDIO_NL_ARG_IMM,
+	MDIO_NL_ARG_RESERVED
+};
+
+struct mdio_nl_insn {
+	__u64 op:8;
+	__u64 reserved:2;
+	__u64 arg0:18;
+	__u64 arg1:18;
+	__u64 arg2:18;
+};
+
+#endif /* _UAPI_LINUX_MDIO_NETLINK_H */