[net-next,0/5] Move devlink_register to be near devlink_reload_enable

Message ID	cover.1628599239.git.leonro@nvidia.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Leon Romanovsky <leon@kernel.org> To: "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org> Cc: Leon Romanovsky <leonro@nvidia.com>, Guangbin Huang <huangguangbin2@huawei.com>, Ido Schimmel <idosch@nvidia.com>, Jiri Pirko <jiri@nvidia.com>, linux-kernel@vger.kernel.org, Michael Guralnik <michaelgur@mellanox.com>, netdev@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com>, Salil Mehta <salil.mehta@huawei.com>, Tariq Toukan <tariqt@nvidia.com>, Yisen Zhuang <yisen.zhuang@huawei.com>, Yufeng Mo <moyufeng@huawei.com> Subject: [PATCH net-next 0/5] Move devlink_register to be near devlink_reload_enable Date: Tue, 10 Aug 2021 16:37:30 +0300 Message-Id: <cover.1628599239.git.leonro@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Move devlink_register to be near devlink_reload_enable \| expand [net-next,0/5] Move devlink_register to be near devlink_reload_enable [net-next,1/5] net: hns3: remove always exist devlink pointer check [net-next,2/5] net/mlx4: Move devlink_register to be the last initialization command [net-next,3/5] mlxsw: core: Refactor code to publish devlink ops when device is ready [net-next,4/5] net/mlx5: Accept devlink user input after driver initialization complete [net-next,5/5] netdevsim: Delay user access till probe is finished

Leon Romanovsky Aug. 10, 2021, 1:37 p.m. UTC

From: Leon Romanovsky <leonro@nvidia.com>

Hi Dave and Jakub,

This series prepares code to remove devlink_reload_enable/_disable API
and in order to do, we move all devlink_register() calls to be right
before devlink_reload_enable().

The best place for such a call should be right before exiting from
the probe().

This is done because devlink_register() opens devlink netlink to the
users and gives them a venue to issue commands before initialization
is finished.

1. Some drivers were aware of such "functionality" and tried to protect
themselves with extra locks, state machines and devlink_reload_enable().
Let's assume that it worked for them, but I'm personally skeptical about
it.

2. Some drivers copied that pattern, but without locks and state
machines. That protected them from reload flows, but not from any _set_
routines.

3. And all other drivers simply didn't understand the implications of early
devlink_register() and can be seen as "broken".

In this series, we focus on items #1 and #2.

Please share your opinion if I should change ALL other drivers to make
sure that devlink_register() is the last command or leave them in an
as-is state.

Thanks

Leon Romanovsky (5):
  net: hns3: remove always exist devlink pointer check
  net/mlx4: Move devlink_register to be the last initialization command
  mlxsw: core: Refactor code to publish devlink ops when device is ready
  net/mlx5: Accept devlink user input after driver initialization
    complete
  netdevsim: Delay user access till probe is finished

 .../hisilicon/hns3/hns3pf/hclge_devlink.c     |  8 +---
 .../hisilicon/hns3/hns3vf/hclgevf_devlink.c   |  8 +---
 drivers/net/ethernet/mellanox/mlx4/main.c     | 38 +++++++++++++------
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 10 +----
 .../net/ethernet/mellanox/mlx5/core/main.c    | 13 ++++++-
 .../mellanox/mlx5/core/sf/dev/driver.c        | 12 +++++-
 drivers/net/ethernet/mellanox/mlxsw/core.c    | 27 +++++++------
 drivers/net/netdevsim/dev.c                   | 19 +++++-----
 8 files changed, 76 insertions(+), 59 deletions(-)

Jakub Kicinski Aug. 10, 2021, 11:53 p.m. UTC | #1

On Tue, 10 Aug 2021 16:37:30 +0300 Leon Romanovsky wrote:
> This series prepares code to remove devlink_reload_enable/_disable API
> and in order to do, we move all devlink_register() calls to be right
> before devlink_reload_enable().
> 
> The best place for such a call should be right before exiting from
> the probe().
> 
> This is done because devlink_register() opens devlink netlink to the
> users and gives them a venue to issue commands before initialization
> is finished.
> 
> 1. Some drivers were aware of such "functionality" and tried to protect
> themselves with extra locks, state machines and devlink_reload_enable().
> Let's assume that it worked for them, but I'm personally skeptical about
> it.
> 
> 2. Some drivers copied that pattern, but without locks and state
> machines. That protected them from reload flows, but not from any _set_
> routines.
> 
> 3. And all other drivers simply didn't understand the implications of early
> devlink_register() and can be seen as "broken".

What are those implications for drivers which don't implement reload?
Depending on which parts of devlink the drivers implement there may well
be nothing to worry about.

Plus devlink instances start out with reload disabled. Could you please
take a step back and explain why these changes are needed.

> In this series, we focus on items #1 and #2.
> 
> Please share your opinion if I should change ALL other drivers to make
> sure that devlink_register() is the last command or leave them in an
> as-is state.

Can you please share the output of devlink monitor and ip monitor link
before and after?  The modified drivers will not register ports before
they register the devlink instance itself.

Leon Romanovsky Aug. 11, 2021, 6:10 a.m. UTC | #2

On Tue, Aug 10, 2021 at 04:53:18PM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 16:37:30 +0300 Leon Romanovsky wrote:
> > This series prepares code to remove devlink_reload_enable/_disable API
> > and in order to do, we move all devlink_register() calls to be right
> > before devlink_reload_enable().
> > 
> > The best place for such a call should be right before exiting from
> > the probe().
> > 
> > This is done because devlink_register() opens devlink netlink to the
> > users and gives them a venue to issue commands before initialization
> > is finished.
> > 
> > 1. Some drivers were aware of such "functionality" and tried to protect
> > themselves with extra locks, state machines and devlink_reload_enable().
> > Let's assume that it worked for them, but I'm personally skeptical about
> > it.
> > 
> > 2. Some drivers copied that pattern, but without locks and state
> > machines. That protected them from reload flows, but not from any _set_
> > routines.
> > 
> > 3. And all other drivers simply didn't understand the implications of early
> > devlink_register() and can be seen as "broken".
> 
> What are those implications for drivers which don't implement reload?
> Depending on which parts of devlink the drivers implement there may well
> be nothing to worry about.
> 
> Plus devlink instances start out with reload disabled. Could you please
> take a step back and explain why these changes are needed.

The problem is that devlink_register() adds new devlink instance to the
list of visible devlinks (devlink_list). It means that all devlink_*_dumpit()
will try to access devices during their initialization, before they are ready.

The more troublesome case is that devlink_list is iterated in the
devlink_get_from_attrs() and it is used in devlink_nl_pre_doit(). The
latter function will return to the caller that new devlink is valid and
such caller will be able to proceed to *_set_doit() functions.

Just as an example:
 * user sends netlink message
  * devlink_nl_cmd_eswitch_set_doit()
   * ops->eswitch_mode_set()
    * Are you sure that all drivers protected here?
      I remind that driver is in the middle of its probe().

Someone can argue that drivers and devlink are protected from anything
harmful with their global (devlink_mutex and devlink->lock) and internal
(device->lock, e.t.c.) locks. However it is impossible to prove for all
drivers and prone to errors.

Reload enable/disable gives false impression that the problem exists in
that flow only, which is not true.

devlink_reload_enable() is a duct tape because reload flows much easier
to hit.

> 
> > In this series, we focus on items #1 and #2.
> > 
> > Please share your opinion if I should change ALL other drivers to make
> > sure that devlink_register() is the last command or leave them in an
> > as-is state.
> 
> Can you please share the output of devlink monitor and ip monitor link
> before and after?  The modified drivers will not register ports before
> they register the devlink instance itself.

Not really, they will register but won't be accessible from the user space.
The only difference is the location of "[dev,new] ..." notification.

[leonro@vm ~]$ sudo modprobe mlx5_core
[  105.575790] mlx5_core 0000:00:09.0: firmware version: 4.8.9999
[  105.576349] mlx5_core 0000:00:09.0: 0.000 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x255 link)
[  105.686217] pps pps0: new PPS source ptp0
[  105.688144] mlx5_core 0000:00:09.0: E-Switch: Total vports 2, per vport: max uc(32768) max mc(32768)
[  105.717736] mlx5_core 0000:00:09.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
[  106.957028] mlx5_core 0000:00:09.0 eth1: Link down
[  106.960379] mlx5_core 0000:00:09.0 eth1: Link up
[  106.967916] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
================================================================================================
Before:
[leonro@vm ~]$ sudo devlink monitor
[dev,new] pci/0000:00:09.0
[param,new] pci/0000:00:09.0: name flow_steering_mode type driver-specific
  values:
[param,new] pci/0000:00:09.0: name esw_port_metadata type driver-specific
  values:
[param,new] pci/0000:00:09.0: name enable_remote_dev_reset type generic
  values:
[param,new] pci/0000:00:09.0: name enable_roce type generic
  values:
    cmode driverinit value true
[param,new] pci/0000:00:09.0: name fdb_large_groups type driver-specific
  values:
    cmode driverinit value 15
[param,new] pci/0000:00:09.0: name flow_steering_mode type driver-specific
  values:
    cmode runtime value dmfs
[param,new] pci/0000:00:09.0: name enable_roce type generic
  values:
    cmode driverinit value true
[param,new] pci/0000:00:09.0: name fdb_large_groups type driver-specific
  values:
    cmode driverinit value 15
[param,new] pci/0000:00:09.0: name esw_port_metadata type driver-specific
  values:
    cmode runtime value true
[param,new] pci/0000:00:09.0: name enable_remote_dev_reset type generic
  values:
    cmode runtime value true
[trap-group,new] pci/0000:00:09.0: name l2_drops generic true
[trap,new] pci/0000:00:09.0: name ingress_vlan_filter type drop generic true action drop group l2_drops
[trap,new] pci/0000:00:09.0: name dmac_filter type drop generic true action drop group l2_drops
[port,new] pci/0000:00:09.0/131071: type notset flavour physical port 0 splittable false
[port,new] pci/0000:00:09.0/131071: type eth netdev eth1 flavour physical port 0 splittable false

[leonro@vm ~]$ sudo ip monitor
inet eth1 forwarding off rp_filter loose mc_forwarding off proxy_neigh off ignore_routes_with_linkdown off 
inet6 eth1 forwarding off mc_forwarding off proxy_neigh off ignore_routes_with_linkdown off 
4: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default 
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
4: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state DOWN group default 
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
multicast ff00::/8 dev eth1 table local proto kernel metric 256 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default 
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
4: eth1    inet6 fe80::5054:ff:fe12:3456/64 scope link 
       valid_lft forever preferred_lft forever
local fe80::5054:ff:fe12:3456 dev eth1 table local proto kernel metric 0 pref medium

===========================================================================================================
After:
[leonro@vm ~]$ sudo devlink monitor
[param,new] pci/0000:00:09.0: name flow_steering_mode type driver-specific
  values:
[param,new] pci/0000:00:09.0: name esw_port_metadata type driver-specific
  values:
[param,new] pci/0000:00:09.0: name enable_remote_dev_reset type generic
  values:
[param,new] pci/0000:00:09.0: name enable_roce type generic
  values:
    cmode driverinit value true
[param,new] pci/0000:00:09.0: name fdb_large_groups type driver-specific
  values:
    cmode driverinit value 15
[param,new] pci/0000:00:09.0: name flow_steering_mode type driver-specific
  values:
    cmode runtime value dmfs
[param,new] pci/0000:00:09.0: name enable_roce type generic
  values:
    cmode driverinit value true
[param,new] pci/0000:00:09.0: name fdb_large_groups type driver-specific
  values:
    cmode driverinit value 15
[param,new] pci/0000:00:09.0: name esw_port_metadata type driver-specific
  values:
    cmode runtime value true
[param,new] pci/0000:00:09.0: name enable_remote_dev_reset type generic
  values:
    cmode runtime value true
[trap-group,new] pci/0000:00:09.0: name l2_drops generic true
[trap,new] pci/0000:00:09.0: name ingress_vlan_filter type drop generic true action drop group l2_drops
[trap,new] pci/0000:00:09.0: name dmac_filter type drop generic true action drop group l2_drops
[dev,new] pci/0000:00:09.0
[port,new] pci/0000:00:09.0/131071: type notset flavour physical port 0 splittable false
[port,new] pci/0000:00:09.0/131071: type eth netdev eth1 flavour physical port 0 splittable false

[leonro@vm ~]$ sudo ip monitor
inet eth1 forwarding off rp_filter loose mc_forwarding off proxy_neigh off ignore_routes_with_linkdown off 
inet6 eth1 forwarding off mc_forwarding off proxy_neigh off ignore_routes_with_linkdown off 
4: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default 
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
4: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state DOWN group default 
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
multicast ff00::/8 dev eth1 table local proto kernel metric 256 pref medium
fe80::/64 dev eth1 proto kernel metric 256 pref medium
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default 
    link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
4: eth1    inet6 fe80::5054:ff:fe12:3456/64 scope link 
       valid_lft forever preferred_lft forever
local fe80::5054:ff:fe12:3456 dev eth1 table local proto kernel metric 0 pref medium

Jakub Kicinski Aug. 11, 2021, 1:27 p.m. UTC | #3

On Wed, 11 Aug 2021 09:10:49 +0300 Leon Romanovsky wrote:
> On Tue, Aug 10, 2021 at 04:53:18PM -0700, Jakub Kicinski wrote:
> > On Tue, 10 Aug 2021 16:37:30 +0300 Leon Romanovsky wrote:  
> > > This series prepares code to remove devlink_reload_enable/_disable API
> > > and in order to do, we move all devlink_register() calls to be right
> > > before devlink_reload_enable().
> > > 
> > > The best place for such a call should be right before exiting from
> > > the probe().
> > > 
> > > This is done because devlink_register() opens devlink netlink to the
> > > users and gives them a venue to issue commands before initialization
> > > is finished.
> > > 
> > > 1. Some drivers were aware of such "functionality" and tried to protect
> > > themselves with extra locks, state machines and devlink_reload_enable().
> > > Let's assume that it worked for them, but I'm personally skeptical about
> > > it.
> > > 
> > > 2. Some drivers copied that pattern, but without locks and state
> > > machines. That protected them from reload flows, but not from any _set_
> > > routines.
> > > 
> > > 3. And all other drivers simply didn't understand the implications of early
> > > devlink_register() and can be seen as "broken".  
> > 
> > What are those implications for drivers which don't implement reload?
> > Depending on which parts of devlink the drivers implement there may well
> > be nothing to worry about.
> > 
> > Plus devlink instances start out with reload disabled. Could you please
> > take a step back and explain why these changes are needed.  
> 
> The problem is that devlink_register() adds new devlink instance to the
> list of visible devlinks (devlink_list). It means that all devlink_*_dumpit()
> will try to access devices during their initialization, before they are ready.
> 
> The more troublesome case is that devlink_list is iterated in the
> devlink_get_from_attrs() and it is used in devlink_nl_pre_doit(). The
> latter function will return to the caller that new devlink is valid and
> such caller will be able to proceed to *_set_doit() functions.
> 
> Just as an example:
>  * user sends netlink message
>   * devlink_nl_cmd_eswitch_set_doit()
>    * ops->eswitch_mode_set()
>     * Are you sure that all drivers protected here?
>       I remind that driver is in the middle of its probe().
> 
> Someone can argue that drivers and devlink are protected from anything
> harmful with their global (devlink_mutex and devlink->lock) and internal
> (device->lock, e.t.c.) locks. However it is impossible to prove for all
> drivers and prone to errors.
> 
> Reload enable/disable gives false impression that the problem exists in
> that flow only, which is not true.
> 
> devlink_reload_enable() is a duct tape because reload flows much easier
> to hit.

Right :/

> > > In this series, we focus on items #1 and #2.
> > > 
> > > Please share your opinion if I should change ALL other drivers to make
> > > sure that devlink_register() is the last command or leave them in an
> > > as-is state.  
> > 
> > Can you please share the output of devlink monitor and ip monitor link
> > before and after?  The modified drivers will not register ports before
> > they register the devlink instance itself.  
> 
> Not really, they will register but won't be accessible from the user space.
> The only difference is the location of "[dev,new] ..." notification.

Is that because of mlx5's use of auxdev, or locking? I don't see
anything that should prevent the port notification from coming out.

I think the notifications need to get straightened out, we can't notify
about sub-objects until the object is registered, since they are
inaccessible.

Leon Romanovsky Aug. 11, 2021, 2:01 p.m. UTC | #4

On Wed, Aug 11, 2021 at 06:27:32AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Aug 2021 09:10:49 +0300 Leon Romanovsky wrote:
> > On Tue, Aug 10, 2021 at 04:53:18PM -0700, Jakub Kicinski wrote:
> > > On Tue, 10 Aug 2021 16:37:30 +0300 Leon Romanovsky wrote:  
> > > > This series prepares code to remove devlink_reload_enable/_disable API
> > > > and in order to do, we move all devlink_register() calls to be right
> > > > before devlink_reload_enable().
> > > > 
> > > > The best place for such a call should be right before exiting from
> > > > the probe().
> > > > 
> > > > This is done because devlink_register() opens devlink netlink to the
> > > > users and gives them a venue to issue commands before initialization
> > > > is finished.
> > > > 
> > > > 1. Some drivers were aware of such "functionality" and tried to protect
> > > > themselves with extra locks, state machines and devlink_reload_enable().
> > > > Let's assume that it worked for them, but I'm personally skeptical about
> > > > it.
> > > > 
> > > > 2. Some drivers copied that pattern, but without locks and state
> > > > machines. That protected them from reload flows, but not from any _set_
> > > > routines.
> > > > 
> > > > 3. And all other drivers simply didn't understand the implications of early
> > > > devlink_register() and can be seen as "broken".  
> > > 
> > > What are those implications for drivers which don't implement reload?
> > > Depending on which parts of devlink the drivers implement there may well
> > > be nothing to worry about.
> > > 
> > > Plus devlink instances start out with reload disabled. Could you please
> > > take a step back and explain why these changes are needed.  
> > 
> > The problem is that devlink_register() adds new devlink instance to the
> > list of visible devlinks (devlink_list). It means that all devlink_*_dumpit()
> > will try to access devices during their initialization, before they are ready.
> > 
> > The more troublesome case is that devlink_list is iterated in the
> > devlink_get_from_attrs() and it is used in devlink_nl_pre_doit(). The
> > latter function will return to the caller that new devlink is valid and
> > such caller will be able to proceed to *_set_doit() functions.
> > 
> > Just as an example:
> >  * user sends netlink message
> >   * devlink_nl_cmd_eswitch_set_doit()
> >    * ops->eswitch_mode_set()
> >     * Are you sure that all drivers protected here?
> >       I remind that driver is in the middle of its probe().
> > 
> > Someone can argue that drivers and devlink are protected from anything
> > harmful with their global (devlink_mutex and devlink->lock) and internal
> > (device->lock, e.t.c.) locks. However it is impossible to prove for all
> > drivers and prone to errors.
> > 
> > Reload enable/disable gives false impression that the problem exists in
> > that flow only, which is not true.
> > 
> > devlink_reload_enable() is a duct tape because reload flows much easier
> > to hit.
> 
> Right :/
> 
> > > > In this series, we focus on items #1 and #2.
> > > > 
> > > > Please share your opinion if I should change ALL other drivers to make
> > > > sure that devlink_register() is the last command or leave them in an
> > > > as-is state.  
> > > 
> > > Can you please share the output of devlink monitor and ip monitor link
> > > before and after?  The modified drivers will not register ports before
> > > they register the devlink instance itself.  
> > 
> > Not really, they will register but won't be accessible from the user space.
> > The only difference is the location of "[dev,new] ..." notification.
> 
> Is that because of mlx5's use of auxdev, or locking? I don't see
> anything that should prevent the port notification from coming out.

And it is ok, kernel can (and does) send notifications, because we left
devlink_ops assignment to be in devlink_alloc(). It ensures that all
flows that worked before will continue to work without too much changes.

> 
> I think the notifications need to get straightened out, we can't notify
> about sub-objects until the object is registered, since they are
> inaccessible.

I'm not sure about that. You present the case where kernel and user
space races against each other and historically kernel doesn't protect
from such flows. 

For example, you can randomly remove and add kernel modules. At some
point of time, you will get "missing symbols errors", just because
one module tries to load and it depends on already removed one.

We must protect kernel and this is what I do. User shouldn't access
devlink instance before he sees "dev name" notification.

Of course, we can move various iterators to devlink_register(), but it
will make code much complex, because we have objects that can be
registered at any time (IMHO. trap is one of them) and I will need to 
implement notification logic that separate objects that were created
before devlink_register and after.

Thanks

Leon Romanovsky Aug. 11, 2021, 2:15 p.m. UTC | #5

On Wed, Aug 11, 2021 at 05:01:20PM +0300, Leon Romanovsky wrote:
> On Wed, Aug 11, 2021 at 06:27:32AM -0700, Jakub Kicinski wrote:
> > On Wed, 11 Aug 2021 09:10:49 +0300 Leon Romanovsky wrote:
> > > On Tue, Aug 10, 2021 at 04:53:18PM -0700, Jakub Kicinski wrote:
> > > > On Tue, 10 Aug 2021 16:37:30 +0300 Leon Romanovsky wrote:  
> > > > > This series prepares code to remove devlink_reload_enable/_disable API
> > > > > and in order to do, we move all devlink_register() calls to be right
> > > > > before devlink_reload_enable().
> > > > > 
> > > > > The best place for such a call should be right before exiting from
> > > > > the probe().
> > > > > 
> > > > > This is done because devlink_register() opens devlink netlink to the
> > > > > users and gives them a venue to issue commands before initialization
> > > > > is finished.
> > > > > 
> > > > > 1. Some drivers were aware of such "functionality" and tried to protect
> > > > > themselves with extra locks, state machines and devlink_reload_enable().
> > > > > Let's assume that it worked for them, but I'm personally skeptical about
> > > > > it.
> > > > > 
> > > > > 2. Some drivers copied that pattern, but without locks and state
> > > > > machines. That protected them from reload flows, but not from any _set_
> > > > > routines.
> > > > > 
> > > > > 3. And all other drivers simply didn't understand the implications of early
> > > > > devlink_register() and can be seen as "broken".  
> > > > 
> > > > What are those implications for drivers which don't implement reload?
> > > > Depending on which parts of devlink the drivers implement there may well
> > > > be nothing to worry about.
> > > > 
> > > > Plus devlink instances start out with reload disabled. Could you please
> > > > take a step back and explain why these changes are needed.  
> > > 
> > > The problem is that devlink_register() adds new devlink instance to the
> > > list of visible devlinks (devlink_list). It means that all devlink_*_dumpit()
> > > will try to access devices during their initialization, before they are ready.
> > > 
> > > The more troublesome case is that devlink_list is iterated in the
> > > devlink_get_from_attrs() and it is used in devlink_nl_pre_doit(). The
> > > latter function will return to the caller that new devlink is valid and
> > > such caller will be able to proceed to *_set_doit() functions.
> > > 
> > > Just as an example:
> > >  * user sends netlink message
> > >   * devlink_nl_cmd_eswitch_set_doit()
> > >    * ops->eswitch_mode_set()
> > >     * Are you sure that all drivers protected here?
> > >       I remind that driver is in the middle of its probe().
> > > 
> > > Someone can argue that drivers and devlink are protected from anything
> > > harmful with their global (devlink_mutex and devlink->lock) and internal
> > > (device->lock, e.t.c.) locks. However it is impossible to prove for all
> > > drivers and prone to errors.
> > > 
> > > Reload enable/disable gives false impression that the problem exists in
> > > that flow only, which is not true.
> > > 
> > > devlink_reload_enable() is a duct tape because reload flows much easier
> > > to hit.
> > 
> > Right :/
> > 
> > > > > In this series, we focus on items #1 and #2.
> > > > > 
> > > > > Please share your opinion if I should change ALL other drivers to make
> > > > > sure that devlink_register() is the last command or leave them in an
> > > > > as-is state.  
> > > > 
> > > > Can you please share the output of devlink monitor and ip monitor link
> > > > before and after?  The modified drivers will not register ports before
> > > > they register the devlink instance itself.  
> > > 
> > > Not really, they will register but won't be accessible from the user space.
> > > The only difference is the location of "[dev,new] ..." notification.
> > 
> > Is that because of mlx5's use of auxdev, or locking? I don't see
> > anything that should prevent the port notification from coming out.
> 
> And it is ok, kernel can (and does) send notifications, because we left
> devlink_ops assignment to be in devlink_alloc(). It ensures that all
> flows that worked before will continue to work without too much changes.
> 
> > 
> > I think the notifications need to get straightened out, we can't notify
> > about sub-objects until the object is registered, since they are
> > inaccessible.
> 
> I'm not sure about that. You present the case where kernel and user
> space races against each other and historically kernel doesn't protect
> from such flows. 
> 
> For example, you can randomly remove and add kernel modules. At some
> point of time, you will get "missing symbols errors", just because
> one module tries to load and it depends on already removed one.
> 
> We must protect kernel and this is what I do. User shouldn't access
> devlink instance before he sees "dev name" notification.
> 
> Of course, we can move various iterators to devlink_register(), but it
> will make code much complex, because we have objects that can be
> registered at any time (IMHO. trap is one of them) and I will need to 
> implement notification logic that separate objects that were created
> before devlink_register and after.

Bottom line,
I'm trying to make code simpler, not opposite :).

> 
> Thanks

Jakub Kicinski Aug. 11, 2021, 2:18 p.m. UTC | #6

On Wed, 11 Aug 2021 17:01:20 +0300 Leon Romanovsky wrote:
> > > Not really, they will register but won't be accessible from the user space.
> > > The only difference is the location of "[dev,new] ..." notification.  
> > 
> > Is that because of mlx5's use of auxdev, or locking? I don't see
> > anything that should prevent the port notification from coming out.  
> 
> And it is ok, kernel can (and does) send notifications, because we left
> devlink_ops assignment to be in devlink_alloc(). It ensures that all
> flows that worked before will continue to work without too much changes.
> 
> > I think the notifications need to get straightened out, we can't notify
> > about sub-objects until the object is registered, since they are
> > inaccessible.  
> 
> I'm not sure about that. You present the case where kernel and user
> space races against each other and historically kernel doesn't protect
> from such flows. 
> 
> For example, you can randomly remove and add kernel modules. At some
> point of time, you will get "missing symbols errors", just because
> one module tries to load and it depends on already removed one.

Sure. But there is a difference between an error because another
actor did something conflicting, asynchronously, and API which by design
sends notifications which can't be acted upon until later point in time,
because kernel sent them too early.

> We must protect kernel and this is what I do. User shouldn't access
> devlink instance before he sees "dev name" notification.

Which is a new rule, and therefore a uAPI change..

> Of course, we can move various iterators to devlink_register(), but it
> will make code much complex, because we have objects that can be
> registered at any time (IMHO. trap is one of them) and I will need to 
> implement notification logic that separate objects that were created
> before devlink_register and after.

I appreciate it's a PITA but it is the downside of a solution where
registration of co-dependent objects exposed via devlink is reordered 
in the kernel.

Leon Romanovsky Aug. 11, 2021, 2:36 p.m. UTC | #7

On Wed, Aug 11, 2021 at 07:18:17AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Aug 2021 17:01:20 +0300 Leon Romanovsky wrote:
> > > > Not really, they will register but won't be accessible from the user space.
> > > > The only difference is the location of "[dev,new] ..." notification.  
> > > 
> > > Is that because of mlx5's use of auxdev, or locking? I don't see
> > > anything that should prevent the port notification from coming out.  
> > 
> > And it is ok, kernel can (and does) send notifications, because we left
> > devlink_ops assignment to be in devlink_alloc(). It ensures that all
> > flows that worked before will continue to work without too much changes.
> > 
> > > I think the notifications need to get straightened out, we can't notify
> > > about sub-objects until the object is registered, since they are
> > > inaccessible.  
> > 
> > I'm not sure about that. You present the case where kernel and user
> > space races against each other and historically kernel doesn't protect
> > from such flows. 
> > 
> > For example, you can randomly remove and add kernel modules. At some
> > point of time, you will get "missing symbols errors", just because
> > one module tries to load and it depends on already removed one.
> 
> Sure. But there is a difference between an error because another
> actor did something conflicting, asynchronously, and API which by design
> sends notifications which can't be acted upon until later point in time,
> because kernel sent them too early.
> 
> > We must protect kernel and this is what I do. User shouldn't access
> > devlink instance before he sees "dev name" notification.
> 
> Which is a new rule, and therefore a uAPI change..
> 
> > Of course, we can move various iterators to devlink_register(), but it
> > will make code much complex, because we have objects that can be
> > registered at any time (IMHO. trap is one of them) and I will need to 
> > implement notification logic that separate objects that were created
> > before devlink_register and after.
> 
> I appreciate it's a PITA but it is the downside of a solution where
> registration of co-dependent objects exposed via devlink is reordered 
> in the kernel.

No problem, I will rewrite notification logic to be queue-based mechanism.

Thanks

Leon Romanovsky Aug. 12, 2021, 4:10 a.m. UTC | #8

On Wed, Aug 11, 2021 at 07:18:17AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Aug 2021 17:01:20 +0300 Leon Romanovsky wrote:
> > > > Not really, they will register but won't be accessible from the user space.
> > > > The only difference is the location of "[dev,new] ..." notification.  
> > > 
> > > Is that because of mlx5's use of auxdev, or locking? I don't see
> > > anything that should prevent the port notification from coming out.  
> > 
> > And it is ok, kernel can (and does) send notifications, because we left
> > devlink_ops assignment to be in devlink_alloc(). It ensures that all
> > flows that worked before will continue to work without too much changes.
> > 
> > > I think the notifications need to get straightened out, we can't notify
> > > about sub-objects until the object is registered, since they are
> > > inaccessible.  
> > 
> > I'm not sure about that. You present the case where kernel and user
> > space races against each other and historically kernel doesn't protect
> > from such flows. 
> > 
> > For example, you can randomly remove and add kernel modules. At some
> > point of time, you will get "missing symbols errors", just because
> > one module tries to load and it depends on already removed one.
> 
> Sure. But there is a difference between an error because another
> actor did something conflicting, asynchronously, and API which by design
> sends notifications which can't be acted upon until later point in time,
> because kernel sent them too early.
> 
> > We must protect kernel and this is what I do. User shouldn't access
> > devlink instance before he sees "dev name" notification.
> 
> Which is a new rule, and therefore a uAPI change..
> 
> > Of course, we can move various iterators to devlink_register(), but it
> > will make code much complex, because we have objects that can be
> > registered at any time (IMHO. trap is one of them) and I will need to 
> > implement notification logic that separate objects that were created
> > before devlink_register and after.
> 
> I appreciate it's a PITA but it is the downside of a solution where
> registration of co-dependent objects exposed via devlink is reordered 
> in the kernel.

I thought about it more and realized what we can make registration
monitor notifications behave as before, we can't do it for unregister
path.

For register, we can buffer all notifications till devlink_register
comes, use it as a marker and release everything that was accumulated
till that point. Everything that will come later will be delivered
immediately.

It will give "dev name ..." print at the beginning as you want.

For unregister, this trick won't work because we don't know if any other
devlink unregister API is used after devlink_unregister. So we can't
delay notifications.

Even if we can, it will be even worse from user perspective, because
in such case devlink_unregister() will close netlink access without
notifying user and he won't understand why ports don't work (as an
example).

Jakub, you are over engineering here and solve non-existing problem.

> Which is a new rule, and therefore a uAPI change..

AFAIR, netlink can be out-of-order, because it is UDP, but it is just
impractical to see it in the real-life. So no, it is not new rule.

Thanks

[net-next,0/5] Move devlink_register to be near devlink_reload_enable

Message

Comments