diff mbox series

[RFC,v2,net-next,6/7] net: ethtool: add a mutex protecting RSS contexts

Message ID 9e2bcb887b5cf9cbb8c0c4ba126115fe01a01f3f.1681236654.git.ecree.xilinx@gmail.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series ethtool: track custom RSS contexts in the core | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1891 this patch: 1891
netdev/cc_maintainers warning 3 maintainers not CCed: vladimir.oltean@nxp.com d-tatianin@yandex-team.ru andrew@lunn.ch
netdev/build_clang success Errors and warnings before: 555 this patch: 555
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2000 this patch: 2000
netdev/checkpatch warning CHECK: struct mutex definition without comment
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

edward.cree@amd.com April 11, 2023, 6:26 p.m. UTC
From: Edward Cree <ecree.xilinx@gmail.com>

While this is not needed to serialise the ethtool entry points (which
 are all under RTNL), drivers may have cause to asynchronously access
 dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
 do this safely without needing to take the RTNL.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
---
 include/linux/ethtool.h | 3 +++
 net/core/dev.c          | 5 +++++
 net/ethtool/ioctl.c     | 7 +++++++
 3 files changed, 15 insertions(+)

Comments

Andrew Lunn April 11, 2023, 8:40 p.m. UTC | #1
On Tue, Apr 11, 2023 at 07:26:14PM +0100, edward.cree@amd.com wrote:
> From: Edward Cree <ecree.xilinx@gmail.com>
> 
> While this is not needed to serialise the ethtool entry points (which
>  are all under RTNL), drivers may have cause to asynchronously access
>  dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
>  do this safely without needing to take the RTNL.

What is actually wrong with taking RTNL? KISS is often best,
especially for locks.

     Andrew
Edward Cree April 12, 2023, 4:16 p.m. UTC | #2
On 11/04/2023 21:40, Andrew Lunn wrote:
> On Tue, Apr 11, 2023 at 07:26:14PM +0100, edward.cree@amd.com wrote:
>> While this is not needed to serialise the ethtool entry points (which
>>  are all under RTNL), drivers may have cause to asynchronously access
>>  dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
>>  do this safely without needing to take the RTNL.
> 
> What is actually wrong with taking RTNL? KISS is often best,
> especially for locks.

The examples I have of driver code that needs to access rss_ctx (in the
 sfc driver) are deep inside call chains where RTNL may or may not
 already be held.
1) filter insertion.  E.g. ethtool -U is already holding RTNL, but other
 sources of filters (e.g. aRFS) aren't, and thus taking it if necessary
 might mean passing a 'bool rtnl_locked' all the way down the chain.
2) device reset handling (we have to restore the RSS contexts in the
 hardware after a reset).  Again resets don't always happen under RTNL,
 and I don't fully understand the details (EFX_ASSERT_RESET_SERIALISED()).
So it makes life much simpler if we just have a finer-grained lock we can
 just take when we need to.
Also, RTNL is a very big hammer that serialises all kinds of operations
 across all the netdevs in the system, holding it for any length of time
 can cause annoying user-visible latency (e.g. iirc sshd accepting a new
 connection has to wait for it) so I prefer to avoid it if possible.  If
 anything we want to be breaking up this BKL[1], not making it bigger.

-ed

[1]: https://legacy.netdevconf.info/2.2/slides/westphal-rtnlmutex-talk.pdf
Andrew Lunn April 12, 2023, 5:15 p.m. UTC | #3
On Wed, Apr 12, 2023 at 05:16:11PM +0100, Edward Cree wrote:
> On 11/04/2023 21:40, Andrew Lunn wrote:
> > On Tue, Apr 11, 2023 at 07:26:14PM +0100, edward.cree@amd.com wrote:
> >> While this is not needed to serialise the ethtool entry points (which
> >>  are all under RTNL), drivers may have cause to asynchronously access
> >>  dev->ethtool->rss_ctx; taking dev->ethtool->rss_lock allows them to
> >>  do this safely without needing to take the RTNL.
> > 
> > What is actually wrong with taking RTNL? KISS is often best,
> > especially for locks.
> 
> The examples I have of driver code that needs to access rss_ctx (in the
>  sfc driver) are deep inside call chains where RTNL may or may not
>  already be held.
> 1) filter insertion.  E.g. ethtool -U is already holding RTNL, but other
>  sources of filters (e.g. aRFS) aren't, and thus taking it if necessary
>  might mean passing a 'bool rtnl_locked' all the way down the chain.
> 2) device reset handling (we have to restore the RSS contexts in the
>  hardware after a reset).  Again resets don't always happen under RTNL,
>  and I don't fully understand the details (EFX_ASSERT_RESET_SERIALISED()).
> So it makes life much simpler if we just have a finer-grained lock we can
>  just take when we need to.
> Also, RTNL is a very big hammer that serialises all kinds of operations
>  across all the netdevs in the system, holding it for any length of time
>  can cause annoying user-visible latency (e.g. iirc sshd accepting a new
>  connection has to wait for it) so I prefer to avoid it if possible.  If
>  anything we want to be breaking up this BKL[1], not making it bigger.

Hi Ed

I have to wonder if your locking model is wrong. When i look at the
next patch, i see the driver is also using this lock. And i generally
find that is wrong.

As a rule of thumb, driver writes don't understand locking. Yes, there
are some that do, but most don't. As core code developers, i find it
good practice to have the locks in the core, and only in the
core. Drivers writers should not need to worry about locking. The API
into the driver will take the locks needed before entering the driver,
and release them on exit.

So i don't really agree with 'it makes life much simpler if we just
have a finer-grained lock'. It makes life more complex having to help
driver writers find the corruption and deadlock bugs in their code
because they got the locking wrong.

Please try to work on the abstraction so that drivers don't need this
lock, just the core.

	Andrew
Edward Cree April 12, 2023, 5:42 p.m. UTC | #4
On 12/04/2023 18:15, Andrew Lunn wrote:
> I have to wonder if your locking model is wrong. When i look at the
> next patch, i see the driver is also using this lock. And i generally
> find that is wrong.
...
> Drivers writers should not need to worry about locking. The API
> into the driver will take the locks needed before entering the driver,
> and release them on exit.
I don't think that's possible without increasing driver complexity in
 other ways.  Essentially, for the driver to take advantage of the core
 tracking these contexts, and thus not need its own data structures to
 track them as well (like the efx->rss_context.list we remove in patch
 #7), it has to be able to access them on driver-initiated (not just
 core-initiated) events.  (The central example of this being "oh, the
 NIC MCPU just rebooted, we have to reinstall all our filters".)  And
 it needs to be able to exclude writes while it does that, not only for
 consistency but also because e.g. context deletion will free that
 memory (though I guess we could finesse that part with RCU?).
What I *could* do is add suitable wrapper functions for access to
 dev->ethtool->rss_ctx (e.g. a core equivalent of
 efx_find_rss_context_entry() that wraps the idr_find()) and have them
 assert that the lock is held (like efx_find_rss_context_entry() does);
 that would at least validate the driver locking somewhat.
But having those helper functions perform the locking themselves would
 mean going to a get/put model for managing the lifetime of the
 driver's reference (with a separate get_write for exclusive access),
 at which point it's probably harder for driver writers to understand
 than "any time you're touching rss_ctx you need to hold the rss_lock".

-ed
Jakub Kicinski April 13, 2023, 2:06 a.m. UTC | #5
On Wed, 12 Apr 2023 18:42:16 +0100 Edward Cree wrote:
> On 12/04/2023 18:15, Andrew Lunn wrote:
> > I have to wonder if your locking model is wrong. When i look at the
> > next patch, i see the driver is also using this lock. And i generally
> > find that is wrong.  
> ...
> > Drivers writers should not need to worry about locking. The API
> > into the driver will take the locks needed before entering the driver,
> > and release them on exit.  
> I don't think that's possible without increasing driver complexity in
>  other ways.  Essentially, for the driver to take advantage of the core
>  tracking these contexts, and thus not need its own data structures to
>  track them as well (like the efx->rss_context.list we remove in patch
>  #7), it has to be able to access them on driver-initiated (not just
>  core-initiated) events.  (The central example of this being "oh, the
>  NIC MCPU just rebooted, we have to reinstall all our filters".)  And
>  it needs to be able to exclude writes while it does that, not only for
>  consistency but also because e.g. context deletion will free that
>  memory (though I guess we could finesse that part with RCU?).
> What I *could* do is add suitable wrapper functions for access to
>  dev->ethtool->rss_ctx (e.g. a core equivalent of
>  efx_find_rss_context_entry() that wraps the idr_find()) and have them
>  assert that the lock is held (like efx_find_rss_context_entry() does);
>  that would at least validate the driver locking somewhat.
> But having those helper functions perform the locking themselves would
>  mean going to a get/put model for managing the lifetime of the
>  driver's reference (with a separate get_write for exclusive access),
>  at which point it's probably harder for driver writers to understand
>  than "any time you're touching rss_ctx you need to hold the rss_lock".

IMO the "MCPU has rebooted" case is a control path, taking rtnl seems
like the right thing to do. Does that happen often enough or takes so
long to recover that we need to be concerned about locking down rtnl?
aRFS can't sleep IIUC so the mutex is does not seem like a great match.

IOW I had the same reaction as Andrew first time I saw this mutex :(
Edward Cree April 13, 2023, 9:52 p.m. UTC | #6
On 13/04/2023 03:06, Jakub Kicinski wrote:
> IMO the "MCPU has rebooted" case is a control path, taking rtnl seems
> like the right thing to do. Does that happen often enough or takes so
> long to recover that we need to be concerned about locking down rtnl?

Normally we *do* hold RTNL across the reset handling path, and all the
 callers I can find take it.  But the existence of a more complicated
 condition guarding the ASSERT_RTNL() in EFX_ASSERT_RESET_SERIALISED()
 implies that there is, or at least was, a call site that doesn't;
 that makes me nervous about assuming it.

> aRFS can't sleep IIUC so the mutex is does not seem like a great match.

sfc punts aRFS filter insertion into a workitem, because you can't do
 MCDI without sleeping.  And aRFS was just one example; there's lots
 of other sources of filters in the driver (PTP, sync_rx_mode (device
 UC/MC addresses), VLAN filtering (NETIF_F_HW_VLAN_CTAG_FILTER)...).
 Some of those filters can also have EFX_FILTER_FLAG_RX_RSS (though
 not custom contexts).

So while I *think* the filter insert code could carefully go
   if (spec->flags & EFX_FILTER_FLAG_RX_RSS && spec->rss_context)
       ASSERT_RTNL();
 it's kinda hairy.  And if this is a *normal* level of hair for
 drivers to have in this area, then the "but driver writers don't
 understand locking!" argument doesn't really favour one solution over
 the other.  After all, the driver will still have to hold *some* lock
 to access this stuff, whether it's rss_lock or RTNL.
(Idk, maybe sfc is just uniquely complex and messy.  It wouldn't be
 the first time.)

> IOW I had the same reaction as Andrew first time I saw this mutex :(

Well I seem to be outvoted, so I'll have another crack at getting it
 to work with just RTNL (that's what I went for originally, actually,
 but one of the ASSERT_RTNL()s failed in test and I couldn't figure
 out why at the time.  Possibly trying to argue the case has helped me
 to understand more of the details!).
Andrew Lunn April 13, 2023, 9:58 p.m. UTC | #7
> (Idk, maybe sfc is just uniquely complex and messy.  It wouldn't be
>  the first time.)

Hi Ed

Have you looked at other drivers? It would be bad to design an API
around a messy driver. Maybe this is an opportunity to learn from
other drivers and come up with something cleaner for SFC?

      Andrew
Edward Cree April 14, 2023, 8:20 p.m. UTC | #8
On 13/04/2023 22:58, Andrew Lunn wrote:
>> (Idk, maybe sfc is just uniquely complex and messy.  It wouldn't be
>>  the first time.)
> 
> Hi Ed
> 
> Have you looked at other drivers? It would be bad to design an API
> around a messy driver.

I have; there's really not many that implement custom RSS contexts
 (just Marvell's mvpp2 and octeontx2, and Mellanox's mlx5).  The
 `rss_ctx_max_id` field is designed for those as they all have fixed-
 size arrays currently and idk whether that's a purely software limit
 or whether it reflects the hardware.
I couldn't find anything in any of them that looked like "restore
 stuff after a device reboot"; maybe it's just not something those
 devices expect to experience normally.

I don't know enough about mlx5 hw to really understand their filter
 code, but the rough equivalent of our efx_mcdi_filter_insert_locked()
 in that driver appears to be _mlx5_add_flow_rules(), which seems to
 be doing some kind of hand-over-hand locking.  And no sign (whether
 in comments or in asserts) of whether the function expects callers to
 hold RTNL.  Same goes for their functions operating on TIRs (whatever
 those are) which are called from all over (aRFS, tc, even kTLS!) in
 addition to the ethtool RSS/ntuple paths.

Anyway I'll cc maintainers of those drivers on v3 so they can chime
 in on the API design.  (Should've done that on v1 really, but I
 forgot.  Mea culpa.)

-ed
diff mbox series

Patch

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 724da9234cf1..e8e88d5900d3 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -1026,11 +1026,14 @@  int ethtool_virtdev_set_link_ksettings(struct net_device *dev,
  * struct ethtool_netdev_state - per-netdevice state for ethtool features
  * @rss_ctx_max_id:	maximum (exclusive) supported RSS context ID
  * @rss_ctx:		IDR storing custom RSS context state
+ * @rss_lock:		Protects entries in @rss_ctx.  May be taken from
+ *			within RTNL.
  * @wol_enabled:	Wake-on-LAN is enabled
  */
 struct ethtool_netdev_state {
 	u32			rss_ctx_max_id;
 	struct idr		rss_ctx;
+	struct mutex		rss_lock;
 	unsigned		wol_enabled:1;
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 44668386f376..60c844b372e3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9987,6 +9987,7 @@  int register_netdevice(struct net_device *dev)
 	idr_init_base(&dev->ethtool->rss_ctx, 1);
 
 	spin_lock_init(&dev->addr_list_lock);
+	mutex_init(&dev->ethtool->rss_lock);
 	netdev_set_addr_lockdep_class(dev);
 
 	ret = dev_get_valid_name(net, dev, dev->name);
@@ -10792,6 +10793,7 @@  static void netdev_rss_contexts_free(struct net_device *dev)
 	if (!dev->ethtool_ops->create_rxfh_context &&
 	    !dev->ethtool_ops->set_rxfh_context)
 		return;
+	mutex_lock(&dev->ethtool->rss_lock);
 	idr_for_each_entry(&dev->ethtool->rss_ctx, ctx, context) {
 		u32 *indir = ethtool_rxfh_context_indir(ctx);
 		u8 *key = ethtool_rxfh_context_key(ctx);
@@ -10806,6 +10808,7 @@  static void netdev_rss_contexts_free(struct net_device *dev)
 							   &context, true);
 		kfree(ctx);
 	}
+	mutex_unlock(&dev->ethtool->rss_lock);
 }
 
 /**
@@ -10919,6 +10922,8 @@  void unregister_netdevice_many_notify(struct list_head *head,
 		if (dev->netdev_ops->ndo_uninit)
 			dev->netdev_ops->ndo_uninit(dev);
 
+		mutex_destroy(&dev->ethtool->rss_lock);
+
 		if (skb)
 			rtmsg_ifinfo_send(skb, dev, GFP_KERNEL, portid, nlh);
 
diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c
index abd1cf50e681..8b2e90ba03a1 100644
--- a/net/ethtool/ioctl.c
+++ b/net/ethtool/ioctl.c
@@ -1257,6 +1257,7 @@  static noinline_for_stack int ethtool_set_rxfh(struct net_device *dev,
 	u8 *rss_config;
 	u32 rss_cfg_offset = offsetof(struct ethtool_rxfh, rss_config[0]);
 	bool create = false, delete = false;
+	bool locked = false; /* dev->ethtool->rss_lock taken */
 
 	if (!ops->get_rxnfc || !ops->set_rxfh)
 		return -EOPNOTSUPP;
@@ -1334,6 +1335,10 @@  static noinline_for_stack int ethtool_set_rxfh(struct net_device *dev,
 		}
 	}
 
+	if (rxfh.rss_context) {
+		mutex_lock(&dev->ethtool->rss_lock);
+		locked = true;
+	}
 	if (create) {
 		if (delete) {
 			ret = -EINVAL;
@@ -1453,6 +1458,8 @@  static noinline_for_stack int ethtool_set_rxfh(struct net_device *dev,
 	}
 
 out:
+	if (locked)
+		mutex_unlock(&dev->ethtool->rss_lock);
 	kfree(rss_config);
 	return ret;
 }