diff mbox series

[RFC,net] net: dsa: tag_dsa: fix suspicious rcu_dereference_check() with br_vlan_get_pvid_rcu

Message ID 20210928233708.1246774-1-vladimir.oltean@nxp.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net] net: dsa: tag_dsa: fix suspicious rcu_dereference_check() with br_vlan_get_pvid_rcu | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: Possible repeated word: 'Very' WARNING: Unknown commit id '43da55cbd54e', maybe rebased or not pulled?
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success Link

Commit Message

Vladimir Oltean Sept. 28, 2021, 11:37 p.m. UTC
__dev_queue_xmit(), which is our caller, does run under rcu_read_lock_bh(),
but in my foolishness I had thought this would be enough to make the
access, lockdep complains that rcu_read_lock() is not held.

Which it isn't - as it turns out, RCU preempt and RCU-bh are two
different flavors, and although Paul McKenney has consolidated
synchronize_rcu() to wait for both preempt as well as bh read-side
critical sections [1], the reader-side API is different, the lockdep
maps and keys are different.

The bridge calls synchronize_rcu() in br_vlan_flush(), and this does
wait for our TX fastpath reader of the br_vlan_group_rcu to complete
even though it is in an rcu-bh read side section. So even though this is
in premise safe, to lockdep this is a case of "who are you? I don't know
you, you're suspicious".

Side note, I still don't really understand the different RCU flavors.
For example, as far as I can see, the core network stack has never
directly called synchronize_rcu_bh, not even once. Just the initial
synchronize_kernel(), replaced later with the RCU preempt variant -
synchronize_rcu(). Very very long story short, dev_queue_xmit has
started calling this exact variant - rcu_read_lock_bh() - since [2], to
make dev_deactivate properly wait for network interfaces with
NETIF_F_LLTX to finish their dev_queue_xmit(). But that relied on an
existing synchronize_rcu(), not synchronize_rcu_bh(). So does this mean
that synchronize_net() never really waited for the rcu-bh critical
section in dev_queue_xmit to finish? I've no idea.

So basically there are multiple options.

First would be to duplicate br_vlan_get_pvid_rcu() into a new
br_vlan_get_pvid_rcu_bh() to appease lockdep for the TX path case. But
this function already has another brother, br_vlan_get_pvid(), which is
protected by the update-side rtnl_mutex. We don't want to grow the
family too big too, especially since br_vlan_get_pvid_rcu_bh() would not
be a function used by the bridge at all, just exported by it and used by
the DSA layer.

The option of getting to the bottom of why does __dev_queue_xmit use
rcu-bh, and splitting that into local_bh_disable + rcu_read_lock, as it
was before [3], might be impractical. There have been 15 years of
development since then, and there are lots of code paths that use
rcu_dereference_bh() in the TX path. Plus, with the consolidation work
done in [1], I'm not even sure what are the practical benefits of rcu-bh
any longer, if the whole point was for synchronize_rcu() to wait for
everything in sight - how can spammy softirqs like networking paint
themselves red any longer, and how can certain RCU updaters not wait for
them now, in order to avoid denial of service? It doesn't appear
possible from the distance from which I'm looking at the problem.
So the effort of converting __dev_queue_xmit from rcu-bh to rcu-preempt
would only appear justified if it went together with the complete
elimination of rcu-bh. Also, it would appear to be quite a strange and
roundabout way to fix a "suspicious RCU usage" lockdep message.

Last, it appears possible to just give lockdep what it wants, and hold
an rcu-preempt read-side critical section when calling br_vlan_get_pvid_rcu
from the TX path. In terms of lines of code and amount of thought needed
it is certainly the easiest path forward, even though it incurs a small
(negligible) performance overhead (and avoidable, at that). This is what
this patch does, in lack of a deeper understanding of lockdep, RCU or
the network transmission process.

[1] https://lwn.net/Articles/777036/
[2] commit d4828d85d188 ("[NET]: Prevent transmission after dev_deactivate")
[3] commit 43da55cbd54e ("[NET]: Do less atomic count changes in dev_queue_xmit.")

Fixes: d82f8ab0d874 ("net: dsa: tag_dsa: offload the bridge forwarding process")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/tag_dsa.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Paul E. McKenney Sept. 29, 2021, 5:53 p.m. UTC | #1
On Wed, Sep 29, 2021 at 02:37:08AM +0300, Vladimir Oltean wrote:
> __dev_queue_xmit(), which is our caller, does run under rcu_read_lock_bh(),
> but in my foolishness I had thought this would be enough to make the
> access, lockdep complains that rcu_read_lock() is not held.

Depending on exactly which primitive is complaining, you can inform
lockdep of your intentions.  For example, you can change
rcu_dereference(p) to rcu_dereference_bh(p).  Or you can change:

	list_for_each_entry_rcu(p, &lh, field) {
		...

To:

	list_for_each_entry_rcu(p, &lh, field, rcu_read_lock_bh_held()) {
		...

And hlist_for_each_entry_rcu() can also take that same optional
lockdep parameter.

> Which it isn't - as it turns out, RCU preempt and RCU-bh are two
> different flavors, and although Paul McKenney has consolidated
> synchronize_rcu() to wait for both preempt as well as bh read-side
> critical sections [1], the reader-side API is different, the lockdep
> maps and keys are different.
> 
> The bridge calls synchronize_rcu() in br_vlan_flush(), and this does
> wait for our TX fastpath reader of the br_vlan_group_rcu to complete
> even though it is in an rcu-bh read side section. So even though this is
> in premise safe, to lockdep this is a case of "who are you? I don't know
> you, you're suspicious".
> 
> Side note, I still don't really understand the different RCU flavors.

RCU BH was there to handle denial-of-service networking loads.
Changes over the years to RCU and to softirq have rendered it obsolete.
But rcu_read_lock_bh() still disables softirq for you.

RCU Sched provides the original semantics.

RCU Preempt, as the name suggests, allows RCU readers to be preempted.
Of course, if you are using rcu_read_lock_sched() or rcu_read_lock_bh(),
you are disabling preemption across the critical section.

> For example, as far as I can see, the core network stack has never
> directly called synchronize_rcu_bh, not even once. Just the initial
> synchronize_kernel(), replaced later with the RCU preempt variant -
> synchronize_rcu(). Very very long story short, dev_queue_xmit has
> started calling this exact variant - rcu_read_lock_bh() - since [2], to
> make dev_deactivate properly wait for network interfaces with
> NETIF_F_LLTX to finish their dev_queue_xmit(). But that relied on an
> existing synchronize_rcu(), not synchronize_rcu_bh(). So does this mean
> that synchronize_net() never really waited for the rcu-bh critical
> section in dev_queue_xmit to finish? I've no idea.

The pre-consolidation Linux kernel v4.16 has these calls to
synchronize_rcu_bh():

drivers/net/team/team.c team_port_disable_netpoll 1094 synchronize_rcu_bh();
drivers/vhost/net.c vhost_net_release 1027 synchronize_rcu_bh();
net/netfilter/ipset/ip_set_hash_gen.h mtype_resize 667 synchronize_rcu_bh();

But to your point, nothing in net/core.

And for v4.16 kernels build with CONFIG_PREEMPT=y, there is no guarantee
that synchronize_rcu() will wait for a rcu_read_lock_bh() critical
section.  A CPU in such a critical section could take a scheduling-clock
interrupt, notice that it was not in an rcu_read_lock() critical section,
and report a quiescent state, which could well end that grace period.

But as you say, in more recent kernels, synchronize_rcu() will indeed
wait for rcu_read_lock_bh() critical sections.

But please be very careful when backporting.

> So basically there are multiple options.
> 
> First would be to duplicate br_vlan_get_pvid_rcu() into a new
> br_vlan_get_pvid_rcu_bh() to appease lockdep for the TX path case. But
> this function already has another brother, br_vlan_get_pvid(), which is
> protected by the update-side rtnl_mutex. We don't want to grow the
> family too big too, especially since br_vlan_get_pvid_rcu_bh() would not
> be a function used by the bridge at all, just exported by it and used by
> the DSA layer.
> 
> The option of getting to the bottom of why does __dev_queue_xmit use
> rcu-bh, and splitting that into local_bh_disable + rcu_read_lock, as it
> was before [3], might be impractical. There have been 15 years of
> development since then, and there are lots of code paths that use
> rcu_dereference_bh() in the TX path. Plus, with the consolidation work
> done in [1], I'm not even sure what are the practical benefits of rcu-bh
> any longer, if the whole point was for synchronize_rcu() to wait for
> everything in sight - how can spammy softirqs like networking paint
> themselves red any longer, and how can certain RCU updaters not wait for
> them now, in order to avoid denial of service? It doesn't appear
> possible from the distance from which I'm looking at the problem.
> So the effort of converting __dev_queue_xmit from rcu-bh to rcu-preempt
> would only appear justified if it went together with the complete
> elimination of rcu-bh. Also, it would appear to be quite a strange and
> roundabout way to fix a "suspicious RCU usage" lockdep message.

The thing to be very careful of is code that might be implicitly assuming
that it cannot be interrupted by a softirq handler.  This assumption will
of course be violated by changing rcu_read_lock_bh() to rcu_read_lock().
The resulting low-probability subtle breakage might be hard to find.

> Last, it appears possible to just give lockdep what it wants, and hold
> an rcu-preempt read-side critical section when calling br_vlan_get_pvid_rcu
> from the TX path. In terms of lines of code and amount of thought needed
> it is certainly the easiest path forward, even though it incurs a small
> (negligible) performance overhead (and avoidable, at that). This is what
> this patch does, in lack of a deeper understanding of lockdep, RCU or
> the network transmission process.
> 
> [1] https://lwn.net/Articles/777036/
> [2] commit d4828d85d188 ("[NET]: Prevent transmission after dev_deactivate")
> [3] commit 43da55cbd54e ("[NET]: Do less atomic count changes in dev_queue_xmit.")
> 
> Fixes: d82f8ab0d874 ("net: dsa: tag_dsa: offload the bridge forwarding process")
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

Of course, if no one really needs rcu_read_lock_bh() anymore, I would be
quite happy to simplify my life by getting rid of it.  ;-)

							Thanx, Paul

> ---
>  net/dsa/tag_dsa.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
> index 77d0ce89ab77..178464cd2bdb 100644
> --- a/net/dsa/tag_dsa.c
> +++ b/net/dsa/tag_dsa.c
> @@ -150,10 +150,9 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
>  		 * that's where the packets ingressed from.
>  		 */
>  		if (!br_vlan_enabled(br)) {
> -			/* Safe because __dev_queue_xmit() runs under
> -			 * rcu_read_lock_bh()
> -			 */
> +			rcu_read_lock();
>  			err = br_vlan_get_pvid_rcu(br, &pvid);
> +			rcu_read_unlock();
>  			if (err)
>  				return NULL;
>  		}
> -- 
> 2.25.1
>
Nikolay Aleksandrov Sept. 29, 2021, 9:21 p.m. UTC | #2
On 29/09/2021 02:37, Vladimir Oltean wrote:
> __dev_queue_xmit(), which is our caller, does run under rcu_read_lock_bh(),
> but in my foolishness I had thought this would be enough to make the
> access, lockdep complains that rcu_read_lock() is not held.
> 
> Which it isn't - as it turns out, RCU preempt and RCU-bh are two
> different flavors, and although Paul McKenney has consolidated
> synchronize_rcu() to wait for both preempt as well as bh read-side
> critical sections [1], the reader-side API is different, the lockdep
> maps and keys are different.
> 
> The bridge calls synchronize_rcu() in br_vlan_flush(), and this does
> wait for our TX fastpath reader of the br_vlan_group_rcu to complete
> even though it is in an rcu-bh read side section. So even though this is
> in premise safe, to lockdep this is a case of "who are you? I don't know
> you, you're suspicious".
> 
> Side note, I still don't really understand the different RCU flavors.
> For example, as far as I can see, the core network stack has never
> directly called synchronize_rcu_bh, not even once. Just the initial
> synchronize_kernel(), replaced later with the RCU preempt variant -
> synchronize_rcu(). Very very long story short, dev_queue_xmit has
> started calling this exact variant - rcu_read_lock_bh() - since [2], to
> make dev_deactivate properly wait for network interfaces with
> NETIF_F_LLTX to finish their dev_queue_xmit(). But that relied on an
> existing synchronize_rcu(), not synchronize_rcu_bh(). So does this mean
> that synchronize_net() never really waited for the rcu-bh critical
> section in dev_queue_xmit to finish? I've no idea.
> 
> So basically there are multiple options.
> 
> First would be to duplicate br_vlan_get_pvid_rcu() into a new
> br_vlan_get_pvid_rcu_bh() to appease lockdep for the TX path case. But
> this function already has another brother, br_vlan_get_pvid(), which is
> protected by the update-side rtnl_mutex. We don't want to grow the
> family too big too, especially since br_vlan_get_pvid_rcu_bh() would not
> be a function used by the bridge at all, just exported by it and used by
> the DSA layer.
> 
> The option of getting to the bottom of why does __dev_queue_xmit use
> rcu-bh, and splitting that into local_bh_disable + rcu_read_lock, as it
> was before [3], might be impractical. There have been 15 years of
> development since then, and there are lots of code paths that use
> rcu_dereference_bh() in the TX path. Plus, with the consolidation work
> done in [1], I'm not even sure what are the practical benefits of rcu-bh
> any longer, if the whole point was for synchronize_rcu() to wait for
> everything in sight - how can spammy softirqs like networking paint
> themselves red any longer, and how can certain RCU updaters not wait for
> them now, in order to avoid denial of service? It doesn't appear
> possible from the distance from which I'm looking at the problem.
> So the effort of converting __dev_queue_xmit from rcu-bh to rcu-preempt
> would only appear justified if it went together with the complete
> elimination of rcu-bh. Also, it would appear to be quite a strange and
> roundabout way to fix a "suspicious RCU usage" lockdep message.
> 
> Last, it appears possible to just give lockdep what it wants, and hold
> an rcu-preempt read-side critical section when calling br_vlan_get_pvid_rcu
> from the TX path. In terms of lines of code and amount of thought needed
> it is certainly the easiest path forward, even though it incurs a small
> (negligible) performance overhead (and avoidable, at that). This is what
> this patch does, in lack of a deeper understanding of lockdep, RCU or
> the network transmission process.
> 
> [1] https://lwn.net/Articles/777036/
> [2] commit d4828d85d188 ("[NET]: Prevent transmission after dev_deactivate")
> [3] commit 43da55cbd54e ("[NET]: Do less atomic count changes in dev_queue_xmit.")
> 
> Fixes: d82f8ab0d874 ("net: dsa: tag_dsa: offload the bridge forwarding process")
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  net/dsa/tag_dsa.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

To answer the question about the patch - it is correct and ok, the dereference needed
is done inside the function and after that the value is copied. I'd only check how the
DSA bridge pointer is handled, it seems it is dereferenced directly and I don't know
the DSA locking rules (i.e. if anything is held by that time to ensure the pointer is
valid). As for the patch:
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>

Cheers,
 Nik
Vladimir Oltean Sept. 29, 2021, 9:28 p.m. UTC | #3
On Wed, Sep 29, 2021 at 10:53:33AM -0700, Paul E. McKenney wrote:
> On Wed, Sep 29, 2021 at 02:37:08AM +0300, Vladimir Oltean wrote:
> > __dev_queue_xmit(), which is our caller, does run under rcu_read_lock_bh(),
> > but in my foolishness I had thought this would be enough to make the
> > access, lockdep complains that rcu_read_lock() is not held.
> 
> Depending on exactly which primitive is complaining, you can inform
> lockdep of your intentions.  For example, you can change
> rcu_dereference(p) to rcu_dereference_bh(p).  Or you can change:
> 
> 	list_for_each_entry_rcu(p, &lh, field) {
> 		...
> 
> To:
> 
> 	list_for_each_entry_rcu(p, &lh, field, rcu_read_lock_bh_held()) {
> 		...
> 
> And hlist_for_each_entry_rcu() can also take that same optional
> lockdep parameter.

This is covered by my first option below, basically I don't want to
bloat the Ethernet bridge driver too much with a single user, and that
user being outside the bridge code itself, at that. I'm sure Nikolay and
Roopa would agree :)

The bridge xmit function - br_dev_xmit - takes the rcu_preempt
rcu_read_lock() too. Although I don't think it is very high-overhead in
any incarnation, I think it would be seen as a positive improvement if
it could be removed from that path too (or would it?)

> > Which it isn't - as it turns out, RCU preempt and RCU-bh are two
> > different flavors, and although Paul McKenney has consolidated
> > synchronize_rcu() to wait for both preempt as well as bh read-side
> > critical sections [1], the reader-side API is different, the lockdep
> > maps and keys are different.
> > 
> > The bridge calls synchronize_rcu() in br_vlan_flush(), and this does
> > wait for our TX fastpath reader of the br_vlan_group_rcu to complete
> > even though it is in an rcu-bh read side section. So even though this is
> > in premise safe, to lockdep this is a case of "who are you? I don't know
> > you, you're suspicious".
> > 
> > Side note, I still don't really understand the different RCU flavors.
> 
> RCU BH was there to handle denial-of-service networking loads.
> Changes over the years to RCU and to softirq have rendered it obsolete.
> But rcu_read_lock_bh() still disables softirq for you.

Thank you, I guess? :)

> RCU Sched provides the original semantics.
> 
> RCU Preempt, as the name suggests, allows RCU readers to be preempted.
> Of course, if you are using rcu_read_lock_sched() or rcu_read_lock_bh(),
> you are disabling preemption across the critical section.
> 
> > For example, as far as I can see, the core network stack has never
> > directly called synchronize_rcu_bh, not even once. Just the initial
> > synchronize_kernel(), replaced later with the RCU preempt variant -
> > synchronize_rcu(). Very very long story short, dev_queue_xmit has
> > started calling this exact variant - rcu_read_lock_bh() - since [2], to
> > make dev_deactivate properly wait for network interfaces with
> > NETIF_F_LLTX to finish their dev_queue_xmit(). But that relied on an
> > existing synchronize_rcu(), not synchronize_rcu_bh(). So does this mean
> > that synchronize_net() never really waited for the rcu-bh critical
> > section in dev_queue_xmit to finish? I've no idea.
> 
> The pre-consolidation Linux kernel v4.16 has these calls to
> synchronize_rcu_bh():
> 
> drivers/net/team/team.c team_port_disable_netpoll 1094 synchronize_rcu_bh();
> drivers/vhost/net.c vhost_net_release 1027 synchronize_rcu_bh();
> net/netfilter/ipset/ip_set_hash_gen.h mtype_resize 667 synchronize_rcu_bh();
> 
> But to your point, nothing in net/core.
> 
> And for v4.16 kernels build with CONFIG_PREEMPT=y, there is no guarantee
> that synchronize_rcu() will wait for a rcu_read_lock_bh() critical
> section.  A CPU in such a critical section could take a scheduling-clock
> interrupt, notice that it was not in an rcu_read_lock() critical section,
> and report a quiescent state, which could well end that grace period.

I find it really hard to believe that commit d4828d85d188
("[NET]: Prevent transmission after dev_deactivate") did not provide the
guarantee it promised. It seems much more likely that I'm missing something,
although I don't see what :)

But the team driver, which I did notice, and which you've linked to
above as well, did have a comment which suggested that yes, if you don't
call synchronize_rcu_bh(), you don't really wait for __dev_queue_xmit()
to finish.

> But as you say, in more recent kernels, synchronize_rcu() will indeed
> wait for rcu_read_lock_bh() critical sections.
> 
> But please be very careful when backporting.

No concerns with backporting, the code in question was added last month
or so.

> > So basically there are multiple options.
> > 
> > First would be to duplicate br_vlan_get_pvid_rcu() into a new
> > br_vlan_get_pvid_rcu_bh() to appease lockdep for the TX path case. But
> > this function already has another brother, br_vlan_get_pvid(), which is
> > protected by the update-side rtnl_mutex. We don't want to grow the
> > family too big too, especially since br_vlan_get_pvid_rcu_bh() would not
> > be a function used by the bridge at all, just exported by it and used by
> > the DSA layer.
> > 
> > The option of getting to the bottom of why does __dev_queue_xmit use
> > rcu-bh, and splitting that into local_bh_disable + rcu_read_lock, as it
> > was before [3], might be impractical. There have been 15 years of
> > development since then, and there are lots of code paths that use
> > rcu_dereference_bh() in the TX path. Plus, with the consolidation work
> > done in [1], I'm not even sure what are the practical benefits of rcu-bh
> > any longer, if the whole point was for synchronize_rcu() to wait for
> > everything in sight - how can spammy softirqs like networking paint
> > themselves red any longer, and how can certain RCU updaters not wait for
> > them now, in order to avoid denial of service? It doesn't appear
> > possible from the distance from which I'm looking at the problem.
> > So the effort of converting __dev_queue_xmit from rcu-bh to rcu-preempt
> > would only appear justified if it went together with the complete
> > elimination of rcu-bh. Also, it would appear to be quite a strange and
> > roundabout way to fix a "suspicious RCU usage" lockdep message.
> 
> The thing to be very careful of is code that might be implicitly assuming
> that it cannot be interrupted by a softirq handler.  This assumption will
> of course be violated by changing rcu_read_lock_bh() to rcu_read_lock().
> The resulting low-probability subtle breakage might be hard to find.

Of course, the networking code would not change functionally with the
removal of rcu_read_lock_bh(). So rcu_read_lock_bh() would be
transformed into local_bh_disable() + rcu_read_lock(). I am still a bit
unclear on the details, but there are reasons why we need softirqs
disabled - we don't call hard_start_xmit just from the NET_TX softirq,
that would be just too nice :)

> > Last, it appears possible to just give lockdep what it wants, and hold
> > an rcu-preempt read-side critical section when calling br_vlan_get_pvid_rcu
> > from the TX path. In terms of lines of code and amount of thought needed
> > it is certainly the easiest path forward, even though it incurs a small
> > (negligible) performance overhead (and avoidable, at that). This is what
> > this patch does, in lack of a deeper understanding of lockdep, RCU or
> > the network transmission process.
> > 
> > [1] https://lwn.net/Articles/777036/
> > [2] commit d4828d85d188 ("[NET]: Prevent transmission after dev_deactivate")
> > [3] commit 43da55cbd54e ("[NET]: Do less atomic count changes in dev_queue_xmit.")
> > 
> > Fixes: d82f8ab0d874 ("net: dsa: tag_dsa: offload the bridge forwarding process")
> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> 
> Of course, if no one really needs rcu_read_lock_bh() anymore, I would be
> quite happy to simplify my life by getting rid of it.  ;-)
> 
> 							Thanx, Paul

The basic idea that updaters could be resilient against softirq storms
sounds great in principle. Although if I understand correctly, that went
away with the consolidation. So again, it isn't that some resiliency
wouldn't be nice, but I'm looking at the current code and I just don't
see what the benefits of rcu_bh are. If the answer is as self-evident as
a naive person like me thinks it is - i.e. rcu_read_lock_bh is just a
glorified version of rcu_read_lock which also disables softirqs, just
with different lockdep rules - then is the consolidation really complete?
Couldn't the bh-disable readers be modified to just open-code the
disabling of softirqs, and resolve the lockdep issues that ensue from
having a separate flavor?

> > ---
> >  net/dsa/tag_dsa.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
> > index 77d0ce89ab77..178464cd2bdb 100644
> > --- a/net/dsa/tag_dsa.c
> > +++ b/net/dsa/tag_dsa.c
> > @@ -150,10 +150,9 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
> >  		 * that's where the packets ingressed from.
> >  		 */
> >  		if (!br_vlan_enabled(br)) {
> > -			/* Safe because __dev_queue_xmit() runs under
> > -			 * rcu_read_lock_bh()
> > -			 */
> > +			rcu_read_lock();
> >  			err = br_vlan_get_pvid_rcu(br, &pvid);
> > +			rcu_read_unlock();
> >  			if (err)
> >  				return NULL;
> >  		}
> > -- 
> > 2.25.1
> >
Vladimir Oltean Sept. 29, 2021, 9:44 p.m. UTC | #4
On Thu, Sep 30, 2021 at 12:21:55AM +0300, Nikolay Aleksandrov wrote:
> To answer the question about the patch - it is correct and ok, the dereference needed
> is done inside the function and after that the value is copied. I'd only check how the
> DSA bridge pointer is handled, it seems it is dereferenced directly and I don't know
> the DSA locking rules (i.e. if anything is held by that time to ensure the pointer is
> valid). As for the patch:
> Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>

I think you're right in the fact that nobody ensures dp->bridge_dev
cannot be reset to NULL by dsa_port_bridge_leave(). I'll try to simulate
an issue by spin-looping for a while in the xmit function while I add
and remove the port from a bridge and see what happens. Worst case,
we'll have to add READ_ONCE and WRITE_ONCE annotations for all users,
and work with a local, on-stack reference. Because the bridge net_device
itself will not go away until the xmit itself finishes, but it's just
that it may not be our upper anymore.
Paul E. McKenney Sept. 29, 2021, 11:27 p.m. UTC | #5
On Wed, Sep 29, 2021 at 09:28:22PM +0000, Vladimir Oltean wrote:
> On Wed, Sep 29, 2021 at 10:53:33AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 29, 2021 at 02:37:08AM +0300, Vladimir Oltean wrote:
> > > __dev_queue_xmit(), which is our caller, does run under rcu_read_lock_bh(),
> > > but in my foolishness I had thought this would be enough to make the
> > > access, lockdep complains that rcu_read_lock() is not held.
> > 
> > Depending on exactly which primitive is complaining, you can inform
> > lockdep of your intentions.  For example, you can change
> > rcu_dereference(p) to rcu_dereference_bh(p).  Or you can change:
> > 
> > 	list_for_each_entry_rcu(p, &lh, field) {
> > 		...
> > 
> > To:
> > 
> > 	list_for_each_entry_rcu(p, &lh, field, rcu_read_lock_bh_held()) {
> > 		...
> > 
> > And hlist_for_each_entry_rcu() can also take that same optional
> > lockdep parameter.
> 
> This is covered by my first option below, basically I don't want to
> bloat the Ethernet bridge driver too much with a single user, and that
> user being outside the bridge code itself, at that. I'm sure Nikolay and
> Roopa would agree :)

No need to duplicate code.  You can pass in both like this:

rcu_dereference_check(p, rcu_read_lock_bh_held() || rcu_read_lock_held());

But if just switching rcu_read_lock() really works for you, that is
easier.

> The bridge xmit function - br_dev_xmit - takes the rcu_preempt
> rcu_read_lock() too. Although I don't think it is very high-overhead in
> any incarnation, I think it would be seen as a positive improvement if
> it could be removed from that path too (or would it?)

Both rcu_read_lock() and preempt_disable() cost very little, but it
is not zero cost.  The real question is of course "can you see it in a
system-level benchmark?"  Or even in a microbenchmark.

> > > Which it isn't - as it turns out, RCU preempt and RCU-bh are two
> > > different flavors, and although Paul McKenney has consolidated
> > > synchronize_rcu() to wait for both preempt as well as bh read-side
> > > critical sections [1], the reader-side API is different, the lockdep
> > > maps and keys are different.
> > > 
> > > The bridge calls synchronize_rcu() in br_vlan_flush(), and this does
> > > wait for our TX fastpath reader of the br_vlan_group_rcu to complete
> > > even though it is in an rcu-bh read side section. So even though this is
> > > in premise safe, to lockdep this is a case of "who are you? I don't know
> > > you, you're suspicious".
> > > 
> > > Side note, I still don't really understand the different RCU flavors.
> > 
> > RCU BH was there to handle denial-of-service networking loads.
> > Changes over the years to RCU and to softirq have rendered it obsolete.
> > But rcu_read_lock_bh() still disables softirq for you.
> 
> Thank you, I guess? :)

You are welcome, I guess?  ;-)

> > RCU Sched provides the original semantics.
> > 
> > RCU Preempt, as the name suggests, allows RCU readers to be preempted.
> > Of course, if you are using rcu_read_lock_sched() or rcu_read_lock_bh(),
> > you are disabling preemption across the critical section.
> > 
> > > For example, as far as I can see, the core network stack has never
> > > directly called synchronize_rcu_bh, not even once. Just the initial
> > > synchronize_kernel(), replaced later with the RCU preempt variant -
> > > synchronize_rcu(). Very very long story short, dev_queue_xmit has
> > > started calling this exact variant - rcu_read_lock_bh() - since [2], to
> > > make dev_deactivate properly wait for network interfaces with
> > > NETIF_F_LLTX to finish their dev_queue_xmit(). But that relied on an
> > > existing synchronize_rcu(), not synchronize_rcu_bh(). So does this mean
> > > that synchronize_net() never really waited for the rcu-bh critical
> > > section in dev_queue_xmit to finish? I've no idea.
> > 
> > The pre-consolidation Linux kernel v4.16 has these calls to
> > synchronize_rcu_bh():
> > 
> > drivers/net/team/team.c team_port_disable_netpoll 1094 synchronize_rcu_bh();
> > drivers/vhost/net.c vhost_net_release 1027 synchronize_rcu_bh();
> > net/netfilter/ipset/ip_set_hash_gen.h mtype_resize 667 synchronize_rcu_bh();
> > 
> > But to your point, nothing in net/core.
> > 
> > And for v4.16 kernels build with CONFIG_PREEMPT=y, there is no guarantee
> > that synchronize_rcu() will wait for a rcu_read_lock_bh() critical
> > section.  A CPU in such a critical section could take a scheduling-clock
> > interrupt, notice that it was not in an rcu_read_lock() critical section,
> > and report a quiescent state, which could well end that grace period.
> 
> I find it really hard to believe that commit d4828d85d188
> ("[NET]: Prevent transmission after dev_deactivate") did not provide the
> guarantee it promised. It seems much more likely that I'm missing something,
> although I don't see what :)

You guys are the ones who get bitten if there is a problem, so I will
let you guys check it out.  Or not, at your option.  ;-)

> But the team driver, which I did notice, and which you've linked to
> above as well, did have a comment which suggested that yes, if you don't
> call synchronize_rcu_bh(), you don't really wait for __dev_queue_xmit()
> to finish.

And that did change, but then again I changed that to synchronize_rcu()
late in the consolidation effort.

> > But as you say, in more recent kernels, synchronize_rcu() will indeed
> > wait for rcu_read_lock_bh() critical sections.
> > 
> > But please be very careful when backporting.
> 
> No concerns with backporting, the code in question was added last month
> or so.

Very good!

> > > So basically there are multiple options.
> > > 
> > > First would be to duplicate br_vlan_get_pvid_rcu() into a new
> > > br_vlan_get_pvid_rcu_bh() to appease lockdep for the TX path case. But
> > > this function already has another brother, br_vlan_get_pvid(), which is
> > > protected by the update-side rtnl_mutex. We don't want to grow the
> > > family too big too, especially since br_vlan_get_pvid_rcu_bh() would not
> > > be a function used by the bridge at all, just exported by it and used by
> > > the DSA layer.
> > > 
> > > The option of getting to the bottom of why does __dev_queue_xmit use
> > > rcu-bh, and splitting that into local_bh_disable + rcu_read_lock, as it
> > > was before [3], might be impractical. There have been 15 years of
> > > development since then, and there are lots of code paths that use
> > > rcu_dereference_bh() in the TX path. Plus, with the consolidation work
> > > done in [1], I'm not even sure what are the practical benefits of rcu-bh
> > > any longer, if the whole point was for synchronize_rcu() to wait for
> > > everything in sight - how can spammy softirqs like networking paint
> > > themselves red any longer, and how can certain RCU updaters not wait for
> > > them now, in order to avoid denial of service? It doesn't appear
> > > possible from the distance from which I'm looking at the problem.
> > > So the effort of converting __dev_queue_xmit from rcu-bh to rcu-preempt
> > > would only appear justified if it went together with the complete
> > > elimination of rcu-bh. Also, it would appear to be quite a strange and
> > > roundabout way to fix a "suspicious RCU usage" lockdep message.
> > 
> > The thing to be very careful of is code that might be implicitly assuming
> > that it cannot be interrupted by a softirq handler.  This assumption will
> > of course be violated by changing rcu_read_lock_bh() to rcu_read_lock().
> > The resulting low-probability subtle breakage might be hard to find.
> 
> Of course, the networking code would not change functionally with the
> removal of rcu_read_lock_bh(). So rcu_read_lock_bh() would be
> transformed into local_bh_disable() + rcu_read_lock(). I am still a bit
> unclear on the details, but there are reasons why we need softirqs
> disabled - we don't call hard_start_xmit just from the NET_TX softirq,
> that would be just too nice :)
> 
> > > Last, it appears possible to just give lockdep what it wants, and hold
> > > an rcu-preempt read-side critical section when calling br_vlan_get_pvid_rcu
> > > from the TX path. In terms of lines of code and amount of thought needed
> > > it is certainly the easiest path forward, even though it incurs a small
> > > (negligible) performance overhead (and avoidable, at that). This is what
> > > this patch does, in lack of a deeper understanding of lockdep, RCU or
> > > the network transmission process.
> > > 
> > > [1] https://lwn.net/Articles/777036/
> > > [2] commit d4828d85d188 ("[NET]: Prevent transmission after dev_deactivate")
> > > [3] commit 43da55cbd54e ("[NET]: Do less atomic count changes in dev_queue_xmit.")
> > > 
> > > Fixes: d82f8ab0d874 ("net: dsa: tag_dsa: offload the bridge forwarding process")
> > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> > 
> > Of course, if no one really needs rcu_read_lock_bh() anymore, I would be
> > quite happy to simplify my life by getting rid of it.  ;-)
> > 
> > 							Thanx, Paul
> 
> The basic idea that updaters could be resilient against softirq storms
> sounds great in principle. Although if I understand correctly, that went
> away with the consolidation. So again, it isn't that some resiliency
> wouldn't be nice, but I'm looking at the current code and I just don't
> see what the benefits of rcu_bh are. If the answer is as self-evident as
> a naive person like me thinks it is - i.e. rcu_read_lock_bh is just a
> glorified version of rcu_read_lock which also disables softirqs, just
> with different lockdep rules - then is the consolidation really complete?
> Couldn't the bh-disable readers be modified to just open-code the
> disabling of softirqs, and resolve the lockdep issues that ensue from
> having a separate flavor?

As part of the consolidation effort, I made synchronize_rcu()
also be resilient against softirq storms in the same way that
synchronize_rcu_bh() used to be.  For example, that is why there is now
a call to rcu_softirq_qs() in __do_softirq().

Why didn't we do it that way to start with?  Because this approach relies
on a bunch of subsequent improvements in softirq in general and ksoftirqd
in particular.

If you change all instance of "rcu_read_lock_bh()" to instead be
"local_bh_disable(); rcu_read_lock()", and of "rcu_read_unlock_bh()"
to "rcu_read_unlock(); local_bh_enable()" that would work.  But
there are probably many that can now be just "rcu_read_lock()" and
"rcu_read_unlock()".  Figuring out which is which is a task for someone
who understands that code way better than I do.

And for all I know, there might be some situation that really wants
and rcu_read_lock_bh() for some reason.  But that does seem to me
to be unlikely, and it would be good to get rid of the _bh() variants.

If the uses go away, I am more than happy to get rid of the definitions!

							Thanx, Paul

> > > ---
> > >  net/dsa/tag_dsa.c | 5 ++---
> > >  1 file changed, 2 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
> > > index 77d0ce89ab77..178464cd2bdb 100644
> > > --- a/net/dsa/tag_dsa.c
> > > +++ b/net/dsa/tag_dsa.c
> > > @@ -150,10 +150,9 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
> > >  		 * that's where the packets ingressed from.
> > >  		 */
> > >  		if (!br_vlan_enabled(br)) {
> > > -			/* Safe because __dev_queue_xmit() runs under
> > > -			 * rcu_read_lock_bh()
> > > -			 */
> > > +			rcu_read_lock();
> > >  			err = br_vlan_get_pvid_rcu(br, &pvid);
> > > +			rcu_read_unlock();
> > >  			if (err)
> > >  				return NULL;
> > >  		}
> > > -- 
> > > 2.25.1
> > >
diff mbox series

Patch

diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index 77d0ce89ab77..178464cd2bdb 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -150,10 +150,9 @@  static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 		 * that's where the packets ingressed from.
 		 */
 		if (!br_vlan_enabled(br)) {
-			/* Safe because __dev_queue_xmit() runs under
-			 * rcu_read_lock_bh()
-			 */
+			rcu_read_lock();
 			err = br_vlan_get_pvid_rcu(br, &pvid);
+			rcu_read_unlock();
 			if (err)
 				return NULL;
 		}